Parallel and Distributed Processing: 15 IPDPS 2000 Workshops, Cancun, Mexico, May 1-5, 2000 Proceedings

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen 1800 3 Berlin Heidelberg New Yo...

Author: Jose Rolim

6 downloads 1314 Views 23MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

1800

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo

José Rolim et al. (Eds.)

Parallel and Distributed Processing 15 IPDPS 2000 Workshops Cancun, Mexico, May 1-5, 2000 Proceedings

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Managing Volume Editor José Rolim Université de Genève, Centre Universitaire d’Informatique 24, rue Général Dufour, CH-1211 Genève 4, Switzerland E-mail: [email protected]

Cataloging-in-Publication Data applied for

Die Deutsche Bibliothek - CIP-Einheitsaufnahme Parallel and distributed processing : 15 IPDPS 2000 workshops, Cancun, Mexico, May 1 - 5, 2000, proceedings / José Rolim et al. (ed.). Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 (Lecture notes in computer science ; Vol. 1800) ISBN 3-540-67442-X

CR Subject Classification (1998): C.1-4, B.1-7, D.1-4, F.1-2, G.1-2, E.1, H.2 ISSN 0302-9743 ISBN 3-540-67442-X Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag is a company in the BertelsmannSpringer publishing group. © Springer-Verlag Berlin Heidelberg 2000 Printed in Germany Typesetting: Camera-ready by author, data conversion by Boller Mediendesign Printed on acid-free paper SPIN: 10720149 06/3142 543210

Volume Editors

José D.P. Rolim G. Chiola G. Conte L.V. Mancini Oscar H. Ibarra Koji Nakano Stephan Olariu Sethuraman Panchanathan Andreas Uhl Martin Schulz Mohammed J. Zaki Vipin Kumar David B. Skilicorn Sartaj Sahni Timothy Davis Sanguthevar Rajasekeran Sanjay Ranka Denis Caromel Serge Chaumette Geoffrey Fox Peter Graham Albert Y. Zomaya Fikret Ercal

Kenji Toda Sang Hyuk Son Maarten Boasson Yoshiaki Kakuda Deveah Bhatt Lonnie R. Welch Hossam ElGindy Viktor K. Prasanna Hartmut Schmeck Oliver Diessel Beverly Sanders Dominique Méry Fouad Kiamilev Jeremy Ekman Afonso Ferreira Sadik Esener Yi Pan Keqin Li Ron Olsson Laxmikant V. Kale Pete Beckman Matthew Haines Dimiter R. Avresky

Foreword This volume contains the proceedings from the workshops held in conjunction with the IEEE International Parallel and Distributed Processing Symposium, IPDPS 2000, on 1-5 May 2000 in Cancun, Mexico. The workshops provide a forum for bringing together researchers, practitioners, and designers from various backgrounds to discuss the state of the art in parallelism. They focus on different aspects of parallelism, from run time systems to formal methods, from optics to irregular problems, from biology to networks of personal computers, from embedded systems to programming environments; the following workshops are represented in this volume: – – – – – – – – – – – – – – –

Workshop on Personal Computer Based Networks of Workstations Workshop on Advances in Parallel and Distributed Computational Models Workshop on Par. and Dist. Comp. in Image, Video, and Multimedia Workshop on High-Level Parallel Prog. Models and Supportive Env. Workshop on High Performance Data Mining Workshop on Solving Irregularly Structured Problems in Parallel Workshop on Java for Parallel and Distributed Computing Workshop on Biologically Inspired Solutions to Parallel Processing Problems Workshop on Parallel and Distributed Real-Time Systems Workshop on Embedded HPC Systems and Applications Reconfigurable Architectures Workshop Workshop on Formal Methods for Parallel Programming Workshop on Optics and Computer Science Workshop on Run-Time Systems for Parallel Programming Workshop on Fault-Tolerant Parallel and Distributed Systems

All papers published in the workshops proceedings were selected by the program committee on the basis of referee reports. Each paper was reviewed by independent referees who judged the papers for originality, quality, and consistency with the themes of the workshops. We would like to thank the general co-chairs Joseph JaJa and Charles Weems for their support and encouragement, the steering committee chairs, George Westrom and Victor Prasanna, for their guidance and vision, and the finance chair, Bill Pitts, for making this publication possible. Special thanks are due to Sally Jelinek, for her assistance with meeting publicity, to Susamma Barua for making local arrangements, and to Danuta Sosnowska for her tireless efforts in interfacing with the organizers. We gratefully acknowledge sponsorship from the IEEE Computer Society and its Technical Committee of Parallel Processing and the cooperation of the ACM SIGARCH. Finally, we would like to thank Danuta Sosnowska and Germaine Gusthiot for their help in the preparation of this volume. February 2000

José D. P. Rolim

Contents

Workshop on Personal Computer Based Networks of Workstations G. Chiola, G. Conte, L.V. Mancini

1

Memory Management in a Combined VIA/SCI Hardware M. Trams, W. Rehm, D. Balkanski, S. Simeonov

4

ATOLL, a New Switched, High Speed Interconnect in Comparison to Myrinet and SCI 16 M. Fischer, U. Br¨ uning, J. Kluge, L. Rzymianowicz, P. Schulz, M. Waack ClusterNet: An Object-Oriented Cluster Network R.R. Hoare

28

GigaBit Performance under NT M. Baker, S. Scott, A. Geist, L. Browne

39

MPI Collective Operations over IP Multicast H.A. Chen, Y.O. Carrasco, A.W. Apon

51

An Open Market-Based Architecture for Distributed Computing S. Lalis, A. Karipidis

61

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters ´ M. Baretto, R. Avila, P. Navaux

71

Parallel Information Retrieval on an SCI-Based PC-NOW S.-H. Chung, H.-C. Kwon, K.R. Ryu, H.-K. Jang, J.-H. Kim, C.-A. Choi

81

A PC-NOW Based Parallel Extension for a Sequential DBMS M. Exbrayat, L. Brunie

91

Workshop on Advances in Parallel and Distributed Computational Models O.H. Ibarra, K. Nakano, S. Olariu

101

The Heterogeneous Bulk Synchronous Parallel Model T.L. Williams, R.J. Parsons

102

On Stalling in LogP G. Bilardi, K.T. Herley, A. Pietracaprina, G. Pucci

109

X

Contents

Parallelizability of Some P -Complete Problems A. Fujiwara, M. Inoue, T. Masuzawa

116

A New Computation of Shape Moments via Quadtree Decomposition C.-H. Wu, S.-J. Horng, P.-Z. Lee, S.-S. Lee, S.-Y. Lin

123

The Fuzzy Philosophers S.-T. Huang

130

A Java Applet to Visualize Algorithms on Reconfigurable Mesh K. Miyashita, R. Hashimoto

137

A Hardware Implementation of PRAM and Its Performance Evaluation M. Imai, Y. Hayakawa, H. Kawanaka, W. Chen, K. Wada, C.D. Castanho, Y. Okajima, H. Okamoto

143

A Non-binary Parallel Arithmetic Architecture R. Lin, J.L. Schwing

149

Multithreaded Parallel Computer Model with Performance Evaluation J. Cui, J.L. Bordim, K. Nakano, T. Hayashi, N. Ishii

155

Workshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (PDIVM 2000) S. Panchanathan, A. Uhl

161

MAJC-5200: A High Performance Microprocessor for Multimedia Computing S. Sudharsanan

163

A Novel Superscalar Architecture for Fast DCT Implementation Z. Yong, M. Zhang

171

Computing Distance Maps Efficiently Using an Optical Bus Y. Pan, Y. Li, J. Li, K. Li, S.-Q. Zheng

178

Advanced Data Layout Optimization for Multimedia Applications C. Kulkarni, F. Catthoor, H. De Man

186

Parallel Parsing of MPEG Video in a Multi-threaded Multiprocessor Environment S.M. Bhandarkar, S.R. Chandrasekaran

194

Contents

XI

Parallelization Techniques for Spatial-Temporal Occupancy Maps from Multiple Video Streams N. DeBardeleben, A. Hoover, W. Jones, W. Ligon

202

Heuristic Solutions for a Mapping Problem in a TV-Anytime Server Network X. Zhou, R. L¨ uling, L. Xie

210

RPV: A Programming Environment for Real-Time Parallel Vision Specification and Programming Methodology D. Arita, Y. Hamada, S. Yonemoto, R.-i. Taniguchi

218

Parallel Low-Level Image Processing on a Distributed Memory System C. Nicolescu, P. Jonker Congestion-Free Routing of Streaming Multimedia Content in BMIN-Based Parallel Systems H. Sethu Performance of On-Chip Multiprocessors for Vision Tasks Y. Chung, K. Park, W. Hahn, N. Park, V.K. Prasanna Parallel Hardware-Software Architecture for Computation of Discrete Wavelet Transform Using the Recursive Merge Filtering Algorithm P. Jamkhandi, A. Mukherjee, K. Mukherjee, R. Franceschini

226

234

242

250

Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2000) M. Schulz

257

Pipelining Wavefront Computations: Experiences and Performance E.C. Lewis, L. Snyder

261

Specification Techniques for Automatic Performance Analysis Tools M. Gerndt, H.-G. Eßer

269

PDRS: A Performance Data Representation System X.-H. Sun, X. Wu

277

Clix - A Hybrid Programming Environment for Distributed Objects and Distributed Shared Memory F. Mueller, J. Nolte, A. Schlaefer

285

Controlling Distributed Shared Memory Consistency from High Level Programming Languages Y. Jégou

293

XII

Contents

Online Computation of Critical Paths for Multithreaded Languages Y. Oyama, K. Taura, A. Yonezawa

301

Problem Solving Environment Infrastructure for High Performance Computer Systems D.C. Stanzione, Jr., W.B. Ligon III

314

Combining Fusion Optimizations and Piecewise Execution of Nested Data-Parallel Programs W. Pfannenstiel

324

Declarative Concurrency in Java R. Ramirez, A.E. Santosa

332

Scalable Monitoring Technique for Detecting Races in Parallel Programs Y.-K. Jun, C.E. McDowell

340

Workshop on High Performance Data Mining M.J. Zaki, V. Kumar, D.B. Skillicorn

348

Implementation Issues in the Design of I/O Intensive Data Mining Applications on Clusters of Workstations R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, R. Perego A Requirements Analysis for Parallel KDD Systems W.A. Maniatty, M.J. Zaki

350

358

Parallel Data Mining on ATM-Connected PC Cluster and Optimization of Its Execution Environment M. Oguchi, M. Kitsuregawa

366

The Parallelization of a Knowledge Discovery System with Hypergraph Representation J. Seitzer, J.P. Buckley, Y. Pan, L.A. Adams

374

Parallelisation of C4.5 as a Particular Divide and Conquer Computation P. Becuzzi, M. Coppola, S. Ruggieri, M. Vanneschi

382

Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti, D. Talia

390

Exploiting Dataset Similarity for Distributed Mining S. Parthasarathy, M. Ogihara

399

Contents

XIII

Scalable Model for Extensional and Intensional Descriptions of Unclassified Data H.A. Prado, S.C. Hirtle, P.M. Engel

407

Parallel Data Mining of Bayesian Networks from Telecommunications Network Data R. Sterrit, K. Adamson, C.M. Shapcott, E.P. Curran

415

Irregular 2000 - Workshop on Solving Irregularly Structured Problems in Parallel S. Sahni, T. Davis, S. Rajasekeran, S. Ranka

423

Load Balancing and Continuous Quadratic Programming W.W. Hager

427

Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application X. Cavin, L. Alonso

428

Efficient Parallelization of Unstructured Reductions on Shared Memory Parallel Architectures S. Benkner, T. Brandes

435

Parallel FEM Simulation of Crack Propagation-Challenges, Status, and Perspectives B. Carter, C.-S. Chen, L.P. Chew, N. Chrisochoides, G.R. Gao, G. Heber, A.R. Ingraffea, R. Krause, C. Myers, D. Nave, K. Pingali, P. Stodghill, S. Vavasis, P.A. Wawrzynek Support for Irregular Computations in Massively Parallel PIM Arrays, Using an Object-Based Execution Model H.P. Zima, T.L. Sterling Executing Communication-Intensive Irregular Programs Efficiently V. Ramakrishnan, I.D. Scherson

443

450

457

Non-Memory-Based and Real-Time Zerotree Building for Wavelet Zerotree Coding Systems D. Peng, M. Lu

469

Graph Partitioning for Dynamic, Adaptive, and Multi-phase Computations V. Kumar, K. Schloegel, G. Karypis

476

XIV

Contents

A Multilevel Algorithm for Spectral Partitioning with Extended Eigen-Models S. Oliveira, T. Soma

477

An Integrated Decomposition and Partitioning Approach for Irregular Block-Structured Applications J. Rantakokko

485

Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems L. Oliker, X. Li, G. Heber, R. Biswas

497

A GRASP for Computing Approximate Solutions for the Three-Index Assignment Problem R.M. Aiex, P.M. Pardalos, L.S. Pitsoulis, M.G.C. Resende

504

On Identifying Strongly Connected Components in Parallel L.K. Fleischer, B. Hendrickson, A. Pınar

505

A Parallel, Adaptive Refinement Scheme for Tetrahedral and Triangular Grids A. Stagg, J. Hallberg, J. Schmidt

512

PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions P. Hénon, P. Ramet, J. Roman

519

Workshop on Java for Parallel and Distributed Computing D. Caromel, S. Chaumette, G. Fox, P. Graham

526

An IP Next Generation Compliant JavaTM Virtual Machine ´ Fleury G. Chelius, E.

528

An Approach to Asynchronous Object-Oriented Parallel and Distributed Computing on Wide-Area Systems M. Di Santo, F. Frattolillo, W. Russo, E. Zimeo

536

Performance Issues for Multi-language Java Applications P. Murray, T. Smith, S. Srinivas, M. Jacob MPJ: A Proposed Java Message Passing API and Environment for High Performance Computing M. Baker, B. Carpenter

544

552

Contents

XV

Implementing Java Consistency Using a Generic, Multithreaded DSM Runtime System 560 G. Antoniu, L. Bougé, P. Hatcher, M. MacBeth, K. McGuigan, R. Namyst Workshop on Bio-Inspired Solutions to Parallel Processing Problems (BioSP3) A.Y. Zomaya, F. Ercal, S. Olariu

568

Take Advantage of the Computing Power of DNA Computers Z.F. Qiu, M. Lu

570

Agent Surgery: The Case for Mutable Agents L. B¨ ol¨ oni, D.C. Marinescu

578

Was Collective Intelligence before Life on Earth? T. Szuba, M. Almulla

586

Solving Problems on Parallel Computers by Cellular Programming D. Talia

595

Multiprocessor Scheduling with Support by Genetic Algorithms-Based Learning Classifier System J.P. Nowacki, G. Pycka, F. Seredy´ nski

604

Viewing Scheduling Problems through Genetic and Evolutionary Algorithms M. Rocha, C. Vilela, P. Cortez, J. Neves

612

Dynamic Load Balancing Model: Preliminary Assessment of a Biological Model for a Pseudo-search Engine R.L. Walker

620

A Parallel Co-evolutionary Metaheuristic V. Bachelet, E.-G. Talbi

628

Neural Fraud Detection in Mobile Phone Operations A. Boukerche, M.S.M.A. Notare

636

Information Exchange in Multi Colony Ant Algorithms M. Middendorf, F. Reischle, H. Schmeck

645

A Surface-Based DNA Algorithm for the Expansion of Symbolic Determinants Z.F. Qiu, M. Lu

653

XVI

Contents

Hardware Support for Simulated Annealing and Tabu Search R. Schneider, R. Weiss

660

Workshop on Parallel and Distributed Real-Time Systems K. Toda, S.H. Son, M. Boasson, Y. Kakuda

668

A Distributed Real Time Coordination Protocol L. Sha, D. Seto

671

A Segmented Backup Scheme for Dependable Real Time Communication in Multihop Networks P.K. Gummadi, J.P. Madhavarapu, S.R. Murthy Real-Time Coordination in Distributed Multimedia Systems T.A. Limniotes, G.A. Papadopoulos

678

685

Supporting Fault-Tolerant Real-Time Applications Using the RED-Linux General Scheduling Framework K.-J. Lin, Y.-C. Wang

692

Are COTS Suitable for Building Distributed Fault-Tolerant Hard Real-Time Systems? P. Chevochot, A. Colin, D. Decotigny, I. Puaut

699

Autonomous Consistency Technique in Distributed Database with Heterogeneous Requirements H. Hanamura, I. Kaji, K. Mori

706

Real-Time Transaction Processing Using Two-Stage Validation in Broadcast Disks K.-w. Lam, V.C.S. Lee, S.H. Son

713

Using Logs to Increase Availability in Real-Time Main-Memory Database 720 T. Niklander, K. Raatikainen Components Are from Mars M.R.V. Chaudron, E. de Jong

727

2+10 1+50 ! H. Hansson, C. Norstr¨ om, S. Punnekkat

734

A Framework for Embedded Real-Time System Design J.-Y. Choi, H.-H. Kwak, I. Lee

738

Contents

XVII

Best-Effort Scheduling of (m,k)-Firm Real-Time Streams in Multihop Networks A. Striegel, G. Manimaran

743

Predictability and Resource Management in Distributed Multimedia Presentations C. Mourlas

750

Quality of Service Negotiation for Distributed, Dynamic Real-Time Systems 757 C.D. Cavanaugh, L.R. Welch, B.A. Shirazi, E.-n. Huh, S. Anwar An Open Framework for Real-Time Scheduling Simulation T. Kramp, M. Adrian, R. Koster

766

Workshop on Embedded/Distributed HPC Systems and Applications (EHPC 2000) D. Bhatt, L.R. Welch

773

A Probabilistic Power Prediction Tool for the Xilinx 4000-Series FPGA 776 T. Osmulski, J.T. Muehring, B. Veale, J.M. West, H. Li, S. Vanichayobon, S.-H. Ko, J.K. Antonio, S.K. Dhall Application Challenges: System Health Management for Complex Systems 784 G.D. Hadden, P. Bergstrom, T. Samad, B.H. Bennett, G.J. Vachtsevanos, J. Van Dyke Accomodating QoS Prediction in an Adaptive Resource Management Framework E.-n. Huh, L.R. Welch, B.A. Shirazi, B.C. Tjaden, C.D. Cavanaugh Network Load Monitoring in Distributed Systems K.M. Jahirul Islam, B.A. Shirazi, L.R. Welch, B.C. Tjaden, C.D. Cavanaugh, S. Anwar A Novel Specification and Design Methodology of Embedded Multiprocessor Signal Processing Systems Using High-Performance Middleware R.S. Janka, L.M. Wills Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems M.I. Patel, K. Jordan, M. Clark, D. Bhatt

792

800

808

816

XVIII Contents

Developing an Open Architecture for Performance Data Mining D.B. Pierce, D.T. Rover

823

A 90k Gate “CLB” for Parallel Distributed Computing B. Schulman, G. Pechanek

831

Power-Aware Replication of Data Structures in Distributed Embedded Real-Time Systems O.S. Unsal, I. Koren, C.M. Krishna Comparison of MPI Implementations on a Shared Memory Machine B. Van Voorst, S. Seidel

839

847

A Genetic Algorithm Approach to Scheduling Communications for a Class of Parallel Space-Time Adaptive Processing Algorithms J.M. West, J.K. Antonio

855

Reconfigurable Parallel Sorting and Load Balancing on a Beowulf Cluster: HeteroSort P. Yang, T.M. Kunau, B.H. Bennett, E. Davis, B. Wren

862

Reconfigurable Architectures Workshop (RAW 2000) H. ElGindy, V.K. Prasanna, H. Schmeck, O. Diessel

870

Run-Time Reconfiguration at Xilinx S.A. Guccione

873

JRoute: A Run-Time Routing API for FPGA Hardware E. Keller

874

A Reconfigurable Content Addressable Memory S.A. Guccione, D. Levi, D. Downs

882

ATLANTIS - A Hybrid FPGA/RISC Based Re-configurable System O. Brosch, J. Hesser, C. Hinkelbein, K. Kornmesser, T. Kuberka, A. Kugel, R. M¨ anner, H. Singpiel, B. Vettermann

890

The Cellular Processor Architecture CEPRA-1X and Its Configuration by CDL C. Hochberger, R. Hoffmann, K.-P. V¨ olkmann, S. Waldschmidt

898

Contents

XIX

Loop Pipelining and Optimization for Run Time Reconfiguration K. Bondalapati, V.K. Prasanna

906

Compiling Process Algebraic Descriptions into Reconfigurable Logic O. Diessel, G. Milne

916

Behavioral Partitioning with Synthesis for Multi-FPGA Architectures under Interconnect, Area, and Latency Constraints P. Lakshmikanthan, S. Govindarajan, V. Srinivasan, R. Vemuri Module Allocation for Dynamically Reconfigurable Systems X.-j. Zhang, K.-w. Ng Augmenting Modern Superscalar Architectures with Configurable Extended Instructions X. Zhou, M. Martonosi

924

932

941

Complexity Bounds for Lookup Table Implementation of Factored Forms 951 in FPGA Technology Mapping W. Feng, F.J. Meyer, F. Lombardi Optimization of Motion Estimator for Run-Time-Reconfguration Implementation C. Tanougast, Y. Berviller, S. Weber

959

Constant-Time Hough Transform on a 3D Reconfigurable Mesh Using Fewer Processors Y. Pan

966

Workshop on Formal Methods for Parallel Programming (FMPPTA 2000) B. Sanders, D. M´ ery

974

A Method for Automatic Cryptographic Protocol Verification J. Goubault-Larrecq

977

Verification Methods for Weaker Shared Memory Consistency Models R.P. Ghughal, G.C. Gopalakrishnan

985

Models Supporting Nondeterminism and Probabilistic Choice M. Mislove

993

Concurrent Specification and Timing Analysis of Digital Hardware Using SDL K.J. Turner, F.J. Argul-Marin, S.D. Laing

1001

XX

Contents

Incorporating Non-functional Requirements into Software Architectures N.S. Rosa, G.R.R. Justo, P.R.F. Cunha

1009

Automatic Implementation of Distributed Systems Formal Specifications 1019 L.H. Castelo Branco, A.F. do Prado, W. Lopes de Souza, M. Sant’Anna Refinement Based Validation of an Algorithm for Detecting Distributed Termination M. Filali, P. Mauran, G. Padiou, P. Quéinnec, X. Thirioux

1027

Tutorial 1: Abstraction and Refinement of Concurrent Programs and Formal Specification D. Cansell, D. Méry, C. Tabacznyj

1037

Tutorial 2: A Foundation for Composing Concurrent Objects J.-P. Bahsoun

1039

Workshop on Optics and Computer Science (WOCS 2000) F. Kiamilev, J. Ekman, A. Ferreira, S. Esener, Y. Pan, K. Li

1042

Fault Tolerant Algorithms for a Linear Array with a Reconfigurable Pipelined Bus System A.G. Bourgeois, J.L. Trahan

1044

Fast and Scalable Parallel Matrix Computationas with Optical Buses K. Li

1053

Pulse-Modulated Vision Chips with Versatile-Interconnected Pixels J. Ohta, A. Uehara, T. Tokuda, M. Nunoshita

1063

Connectivity Models for Optoelectronic Computing Systems H.M. Ozaktas

1072

Optoelectronic-VLSI Technology: Terabit/s I/O to a VLSI Chip A.V. Krishnamoorthy

1089

Three Dimensional VLSI-Scale Interconnects D.W. Prather

1092

Present and Future Needs of Free-Space Optical Interconnects S. Esener, P. Marchand

1104

Contents

XXI

Fast Sorting on a Linear Array with a Reconfigurable Pipelined Bus System A. Datta, R. Owens, S. Soundaralakshmi

1110

Architecture Description and Prototype Demonstration of Optoelectronic Parallel-Matching Architecture K. Kagawa, K. Nitta, Y. Ogura, J. Tanida, Y. Ichioka

1118

A Distributed Computing Demonstration System Using FSOI Inter-Processor Communication J. Ekman, C. Berger, F. Kiamilev, X. Wang, H. Spaanenburg, P. Marchand, S. Esener

1126

Optoelectronic Multi-chip Modules Based on Imaging Fiber Bundle Structures D.M. Chiarulli, S.P. Levitan

1132

VCSEL Based Smart Pixel Array Technology Enables Chip-to-Chip Optical Interconnect Y. Liu

1133

Workshop on Run-Time Systems for Parallel Programming (RTSPP) R. Olsson, L.V. Kale, P. Beckman, M. Haines

1134

A Portable and Adaptative Multi-protocol Communication Library for Multithreaded Runtime Systems O. Aumage, L. Bougé, R. Namyst

1136

CORBA Based Runtime Support for Load Distribution and Fault Tolerance T. Barth, G. Flender, B. Freisleben, M. Grauer, F. Thilo

1144

Run-Time Support for Adaptive Load Balancing M.A. Bhandarkar, R.K. Brunner, L.V. Kalé Integrating Kernel Activations in a Multithreaded Runtime System on Top of Linux V. Danjean, R. Namyst, R.D. Russell

1152

1160

DyRecT: Software Support for Adaptive Parallelism on NOWs E. Godard, S. Setia, E. White

1168

Fast Measurement of LogP Parameters for Message Passing Platforms T. Kielmann, H.E. Bal, K. Verstoep

1176

XXII

Contents

Supporting Flexible Safety and Sharing in Multi-threaded Environments 1184 S.H. Samorodin, R. Pandey A Runtime System for Dynamic DAG Programming M.-Y. Wu, W. Shu, Y. Chen

1192

Workshop on Fault-Tolerant Parallel and Distributed Systems (FTPDS 2000) D.R. Avresky

1200

Certification of System Architecture Dependability I. Levendel

1202

Computing in the RAIN: A Reliable Array of Independent Nodes V. Bohossian, C.C. Fan, P.S. LeMahieu, M.D. Riedel, L. Xu, J. Bruck

1204

Fault-Tolerant Wide-Area Parallel Computing J.B. Weissman

1214

Transient Analysis of Dependability/Performability Models by Regenerative Randomization with Laplace Transform Inversion J.A. Carrasco FANTOMAS: Fault Tolerance for Mobile Agents in Clusters H. Pals, S. Petri, C. Grewe Metrics, Methodologies, and Tools for Analyzing Network Fault Recovery Performance in Real-Time Distributed Systems P.M. Irey IV, B.L. Chappell, R.W. Hott, D.T. Marlow, K.F. O’Donoghue, T.R. Plunkett

1226

1236

1248

Consensus Based on Strong Failure Detectors: A Time and Message-Efficient Protocol F. Greve, M. Hurfin, R. Macêdo, M. Raynal

1258

Implementation of Finite Lattices in VLSI for Fault-State Encoding in High-Speed Networks A.C. D¨ oring, G. Lustig

1266

Building a Reliable Message Delivery System Using the CORBA Event Service S. Ramani, B. Dasarathy, K.S. Trivedi

1276

Contents XXIII

Network Survivability Simulation of a Commercially Deployed Dynamic Routing System Protocol A. Chowdhury, O. Frieder, P. Luse, P.-J. Wan

1281

Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network D. Hecht, C. Katsinis

1286

An Efficient Backup-Overloading for Fault-Tolerant Scheduling of Real-Time Tasks R. Al-Omari, G. Manimaran, A.K. Somani

1291

Mobile Agents to Automate Fault Management in Wireless and Mobile Networks N. Pissinou, Bhagyavati, K. Makki

1296

Heterogeneous Computing Workshop (HCW 2000) V.K. Prasanna, C.S. Raghavendra

1301

Author Index

1307

3rd Workshop on Personal Computer based Networks Of Workstations (PC-NOW 2000) Clusters composed of fast personal computers are now well established as cheap and ecient platforms for distributed and parallel applications. The main drawbac k of a standard NOWs is the poor performance of the standard inter-process communication mechanisms based on RPC, soc kets,TCP/IP, Ethernet. Suc h standard communication mechanisms perform poorly both in terms of throughput as well as message latency. Sev eral prototypes developed around the world have pro ved that re-visiting the implementation of the communication layer of a standard Operating System kernel, a low cost hardware platform composed of only commodity components can scale up to several tens of processing nodes and deliver communication and computation performance exceeding the one delivered by the conven tional highcost parallel platforms. This w orkshoppro vides a forum to discuss issues related to the design of ecient NOW/Clusters based on commodity hardware and public domain operating systems as compared to custom hardware devices and/or proprietary operating systems.

Workshop Organizers G. Chiola (DISI, U. Genoa, I) G. Conte (CE, U. Parma, I) L.V. Mancini (DSI, U. Rome, I)

Sponsors IEEE TFCC (T ask Force on Cluster Computing)

J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1-3, 2000.  Springer-Verlag Berlin Heidelberg 2000

2

G. Chiola, G. Conte, and L.V. Mancini

Program Commitee Program Chair:

C. Anglano (U. Piemonte Or., I) M. Baker (CSM, U. Portsmouth, UK) L. Bouge (ENS Lyon, F) G. Chiola (DISI, U. Genoa, I) G. Ciaccio (DISI, U. Genoa, I) G. Conte (CE, U. Parma, I) H.G. Dietz (ECE, Purdue U., USA) W. Gentzsch (GENIAS Software GmbH, D) G. Iannello (DIS, U. Napoli, I) Y. Ishikawa (RWCP, J) K. Li (Princeton U., USA) L.V. Mancini (DSI, U. Roma 1, I) T.G. Mattson (Intel Corp., USA) W. Rehm (Informatik, T.U. Chemnitz, D) P. Rossi (ENEA HPCN, Bologna, I) P. Roe (Queensland U. of Tech., AUS) D.B. Skillikorn (Queens U., CAN) D. Tavangarian (Informatik, U. Rostock, D) B. Tourancheau (LHPC, U. Lyon, F)

Referees C. Anglano O. Aumage M. Baker G. Chiola G. Ciaccio G. Conte M. Fischer

W. Gentzsch G. Iannello Y. Ishikawa L.V. Mancini T.G. Mattson J.-F. Mehaut R. Namyst

W. Rehm P. Roe P. Rossi D. Tavangarian B. Tourancheau R. Westrelin

3rd Workshop on Personal Computer Based Networks of Workstations

3

Accepted Papers Session 1: Cluster Interconnect Design and Implementation { M. Trams, W. Rehm, D. Balkanski, and S. Simeonov \Memory Management in a combined VIA/SCI Hardware" { M. Fischer, et al. \ATOLL, a new switched, high speed Interconnect in comparison to Myrinet and SCI" { R.R. Hoare \ClusterNet: An Object-Oriented Cluster Network" Session 2: O-the-shelf Clusters Communication { M. Baker, S. Scott, A. Geist, and L. Browne \GigaBit Performance under NT" { H.A. Chen, Y.O. Carrasco, and A.W. Apon \MPI Collective Operations over IP Multicast"

Session 3: Multiple Clusters and Grid Computing { S. Lalis, and A. Karipidis Àn Open Market-Based Architecture for Distributed Computing" { M. Barreto, R. Avila, and Ph. Navaux \The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters"

Session 4: Data Intensive Applications { S.H. Chung, et al. \Parallel Information Retrieval on an SCI-Based PCNOW" { M. Exbrayat, and L. Brunie À PC-MOW Based Parallel Extension for a Sequential DBMS"

Other Activities In addition to the presentation of contributed papers an invited talk will be scheduled at the workshop.

Memory Management in a combined VIA/SCI Hardware Mario Trams, Wolfgang Rehm, Daniel Balkanski and Stanislav Simeonov ? fmtr,rehmg@inform at ik. tu -chem nitz. de [email protected] m, [email protected]

T echnisc he Universitat Chemnitz ?? F akultat fur Informatik Strae der Nationen 62, 09111 Chemnitz, Germany

Abstract In this document w emake a brief review of memory management and DMA considerations in case of common SCI hardware and the Virtual Interface Architecture. On this basis we expose our ideas for an improved memory management of a hardware combining the positive characteristics of both basic technologies in order to get one completely new design rather than simply adding one to the other. The described memory management concept provides the opportunity of a real zero{ cop y transfer for Send{Receive operations by keeping full exibility and eciency of a nodes' local memory management system. From the resulting hardware we expect a very good system throughput for message passing applications even if they are using a wide range of message sizes.

1 Motivation and Introduction PCI{SCI bridges (Scalable Coherent Interface [12]) become a more and more preferable technological choice in the growing market of Cluster Computing based on non{proprietary hardware. Although absolute performance characteristics of this communication hardware increases more and more, it still has some disadvan tages. Dolphin Interconnect Solutions AS (Norway) is the leading manufacturer of commercial SCI link chips as well as the only manufacturer of commercially available PCI{SCI bridges. These bridges oer very lo w latencies in range of some microseconds for their distributed shared memory and reac h also relatively high bandwidths (more than 80MBytes/s). In our clusters we use Dolphins PCI{SCI bridges in junction with standard PC components [11]. MPI applications that we are running on our cluster can get a great acceleration from low latencies of the underlying SCI shared memory if it is used as communication medium for transferring messages. MPI implementations e.g. such as [7] show a Daniel Balkanski and Stanislav Simeonov are from the Burgas Free University, Bulgaria. ?? The work presented in this paper is sponsored by the SMWK/SMWA Saxony ministries (AZ:7531.50-03-0380-98/6). It is also carried out in strong interaction with the project GRANT SFB393/B6 of the DFG (German National Science Foundation). ?


Memory Management in a Combined VIA/SCI Hardware

5

bandwidth of about 35MByte/s for a message size of 1kByte which is quite a lot (refer also to gure 1 later). The major problem of MPI implementations over shared memory is big CPU utilization on long message sizes due to copy operations. So the just referred good MPI performance [7] is more an academic peak performance which is achieved with more or less total CPU consumption. A standard solution for this problem is to use a block{moving DMA engine for data transfers in background. Dolphins PCI{SCI bridges implement such a DMA engine. Unfortunately, this one can't be controlled directly from a user process without violating general protection issues. Therefore kernel calls are required here which in end eect increase the minimum achievable latency and require a lot of additional CPU cycles. The Virtual Interface Architecture (VIA) Speci cation [16] de nes mechanisms for moving the communication hardware closer to the application by migrating protection mechanisms into the hardware. In fact, VIA speci es nothing completely new since it can be seen as an evolution of U{Net [15]. But it is a rst try to de ne a common industry{standard of a principle communication architecture for message passing | from hardware to software layers. Due to its DMA transfers and its reduced latency because of user{level hardware access, a VIA system will increase the general system throughput of a cluster computer compared to a cluster equipped with a conventional communication system with similar raw performance characteristics. But for very short transmission sizes a programmed IO over global distributed shared memory won't be reached by far in terms of latency and bandwidth. This is a natural fact because we can't compare a simple memory reference with DMA descriptor preparation and execution. 100 90

SCI MPI cLAN MPI

Bandwidth [MByte/s]

80 70 60 50 40 30 20 10 0 256

1024

4096

16384

65536

262144

1048576

Message Size [Bytes]

Figure1. Comparison of MPI Implementations for Dolphins PCI{SCI Bridges and GigaNets cLAN VIA Hardware

Figure 1 shows bandwidth curves of MPI implementations for both an SCI and a native VIA implementation (GigaNet cLAN). The hardware is in both cases based on the PCI bus and the machines where the measurements were taken are comparable. The concrete values are based on ping{pong measurements and where taken from [7] in case of SCI, and from [10] (Linux case) for the cLAN hardware.

6

M. Trams et al.

As expected, the bandwidth in case of SCI is looking better in the range of smaller message sizes. For larger message sizes the cLAN implementation demonstrates higher bandwidth because of its advanced DMA engine. But not less important is the fact that a DMA engine gives the CPU more time for computations. Details of such CPU utilization considerations are outside the scope of this paper and are already discussed in [14] and [8]. As summarization of these motivating facts we can state that besides a powerful DMA engine controllable from user{level a distributed shared memory for programmed IO is an important feature which shouldn't be missed in a communication system.

2 What are the Memory Management Considerations? First of all we want to make a short de nition what belongs to memory management regarding this document. This can be stated by the following aspects expressed in the form of questions: 1. How a process' memory area is made available to the Network Interface Controller (NIC) and in what way main memory is protected against wrong accesses? 2. At which point in the system a DMA engine is working and how are the transactions of this DMA engine validated? 3. In which way memory of a process on a remote node is made accessible for a local process? Based on these questions we can classify the dierent communication system architectures in terms of advantages/disadvantages of their memory management. In the analysis that is presented in the following sections we'll reveal these advantages and disadvantages arisen from common PCI{SCI architecture and the VI Architecture.

3 PCI{SCI vs. VIA discussion and comparison

3.1 Question 1: How a process' memory area is made available to the NIC and in what way main memory is protected against wrong accesses? Common PCI{SCI case: Current PCI{SCI bridges developed by Dolphin

realize a quiet static memory management [4] to get access to main memory or rather PCI address space. To avoid unwanted accesses to sensitive locations, the PCI{SCI bridge is set up to allow accesses only to a dedicated memory window. Memory access requests caused by remote machines are only allowed if they fall within the speci ed window. This causes two big disadvantages: { Continuous exported regions must also be continuous available inside the physical address space. Additionally, these regions must be aligned to the minimum exportable block size which is typically quite large (512kB for Dolphin's bridges).


7

{ Exported Memory must reside within this window. To handle these problems it is required to reserve main memory only for SCI purposes. This, in practice, 'wastes' a part of memory if it is not really exported later. In consequence these disadvantages of common PCI{SCI bridge architecture make their use with MPI applications very dicult. Especially in view of zero{copy transfer operations. Because data transfers can be processed using the reserved memory region only, it would require that MPI applications use special malloc() functions for allocating data structures used for send/receive purposes later. But this violates a major goal of the MPI standard: Architecture Independence.

VIA case: The VI Architecture speci es a much better view the NIC has

on main memory. Instead of a at one{to{one representation of the physical memory space it implements a more exible lookup{table address translation. Comparing this mechanism with the PCI{SCI pendant the following advantages become visible. { Continuous regions seen by the VIA hardware are not required to be also continuous inside the host physical address space. { Accesses to sensitive address ranges are prevented by just not including them into the translation table. { The NIC can get access to every physical memory page, even if this may not be possible for all physical pages at once (when the translation table has less entries than the number of physical pages). The translation table is not only for address translation purposes, but also for protection of memory. To achieve this a so{called Protection Tag is included for each translation and protection table entry. This tag is checked prior to each access to main memory to qualify the access. For more information about this see later in section 3.2.

Conclusions regarding question 1: It is clear, that the VIA approach oers much more exibility. Using this local memory access strategy in a PCI{SCI bridge design will eliminate all of the problems seen in current designs. Of course, the drawback is the more complicated hardware and the additional cycles to translate the address.

3.2 Question 2: At which point in the system a DMA engine is working and how are the transactions of this DMA engine validated? Common PCI{SCI case: The DMA engine accesses local memory in the same

way as already discussed in section 3.1. Therefore it inherits also all disadvantages when dealing with physical addresses on the PCI{SCI bridge.

8

M. Trams et al.

For accesses to global SCI memory a more exible translation table is used. This Downstream Translation Table realizes a virtual view onto global SCI memory | similar as the view of a VIA NIC onto local memory. Every page of the virtual SCI memory can be mapped to a page of the global SCI memory. Regarding validation, the DMA engine can't distinguish between regions owned by dierent processes (neither local nor remote). Therefore the hardware can't make a check of access rights on{the{ ow. Rather it is required that the DMA descriptor containing the information about the block to copy is assured to be right. In other words the operating system kernel has to prepare or at least to check any DMA descriptor to be posted to the NIC. This requires OS calls that we want to remove at all cost.

VIA case: A VIA NIC implements mechanisms to execute a DMA descriptor

from user{level while assuring protection among multiple processes using the same VIA hardware. An user process can own one or more interfaces of the VIA hardware (so{called Virtual Interfaces). In other words, a virtual interface is a virtual representation of a virtual unique communication hardware. The connection between the virtual interfaces and the VIA hardware is made by Doorbells that represent a virtual interface with its speci c control registers. An user{level process can insert a new DMA descriptor into a job queue of the VIA hardware by writing an appropriate value into a doorbell assigned to this process. The size of a doorbell is equal to the page size of the host computer and so the handling which process may access which doorbell (or virtual interface) can be simply realized by the hosts' virtual memory management system. Protection during DMA transfers is achieved by usage of Protection Tags. These tags are used by the DMA engine to check if the access of the current processed virtual interface to a memory page is right. The protection tag of the accessed memory page is compared with the protection tag assigned to the virtual interface of the process that provided this DMA descriptor. Only if both tags are equal, the access is legal and can be performed. A more detailed description of this mechanism is outside the scope of this document (refer to [13] and [16]).

Conclusions regarding question 2: The location of the DMA engine is in

both cases principally the same. The dierence is that in case of VIA a real lookup{table based address translation is performed between the DMA engine and PCI memory. That is, the VIA DMA operates on a virtual local address space, while the PCI{SCI DMA operates directly with local physical addresses. The answer for the access protection is simple: The common PCI{SCI DMA engine supports no protection in hardware and must trust on right DMA descriptors. The VIA hardware supports full protection in hardware where the DMA engine is only one part of the whole protection mechanism.


9

3.3 Question 3: In which way memory of a process on a remote node is made accessible for a local process? Common PCI{SCI case: Making remote memory accessible is a key function

in a SCI system, of course. Each PCI{SCI bridge oers a special PCI memory window which is practically the virtual SCI memory seen by the card. So the same SCI memory the DMA engine may access can be also accessed via memory references (also called programmed IO here). The procedure of making globally available SCI memory accessible for the local host is also referred as importing global memory into local address space. On the other side, every PCI{SCI bridge can open a window to local address space and make it accessible for remote SCI nodes. The mechanism of this window is already described in section 3.1 regarding question 1. The procedure of making local memory globally accessible is also called exporting local memory into global SCI space. Protection is totally guaranteed when dealing with imported and exported memory in point of view of memory references. Only if a process has got a valid mapping of a remote process' memory page it is able to access this memory.

VIA case: The VI Architecture oers principally no mechanism to access remote memory as it is realized in a distributed shared memory communication system such as SCI. But there is an indirect way by using a so{called Remote DMA (or RDMA) mechanism. This method is very similar to DMA transfers as they are used in common PCI{SCI bridges. A process that wants to transfer data between its local memory and memory of a remote process speci es a RDMA descriptor. This contains an address for the local VIA virtual address space and an address for the remote nodes' local VIA virtual address space.

Conclusions regarding question 3: While a PCI{SCI architecture allows processes to really share their memory globally across a system, this is not possible with a VIA hardware. Of course, VIA was never designed for realizing distributed shared memory.

4 A new PCI{SCI Architecture with VIA Approaches In our design we want to combine the advances of an ultra{low latency SCI Shared Memory with a VIA{like advanced memory management and protected user{level DMA. This combination will make our SCI hardware more suitable for our message passing oriented parallel applications requiring short as well as long transmission sizes.

4.1 Advanced Memory Management In order to eliminate the discussed above restrictions with continuous and aligned exported memory regions that must reside in a special window, our PCI{SCI

10

M. Trams et al.

architecture will implement two address translation tables | for both local and remote memory accesses. In contrast, common PCI{SCI bridges use only one translation table for accesses to remote memory. This new and more exible memory management combined with reduced minimal page size of distributed shared memory leads to a much better usage of the main memory of the host system. In fact, our targeted amount of imported SCI memory is 1GB with a page granularity of 16kB. With a larger downstream address translation table this page size may be reduced further to match exactly the page size used in the host systems (such as 4kB for x86 CPUs). In case of the granularity of memory to be exported in SCI terminology or to be made available for VIA operations there's no question: It must be equal to the host system page size. In other words, 4kB since the primary target system is a x86 one. 128MB is the planned maximum window size here.

4.2 Operation of Distributed Shared Memory from a memory{related point of view Global SCI Address Space

Node 1 (Importer) Process Virtual Address Space

Physical Address Space

Node 2 (Exporter)

Processor Page Table

Downstream Translation Table

Translation to Local Physical Address

Translation to Global SCI Address

Physical Address Space Upstream Translation Table

Process Virtual Address Space Processor Page Table

Translation to Host 2 PCI Address

Figure2. Address Translations between exporting and importing Processes for programmed IO

Figure 2 gives an overall example of exporting/importing memory regions. The example illustrates the address translations performed when the importing process accesses memory exported by a process on the remote node. The exporting process exports some of its previously allocated memory by registering it within its local PCI{SCI hardware. Registering memory is done on a by{page basis. Remember that in case of a common PCI{SCI system it would be required that this exported memory is physically located inside this special memory area reserved for SCI purposes. But here we can take the advantage of the virtual view onto local memory similar to this in VI Architecture.


11

Once the upstream address translation table entries are adjusted, the exported memory can be accessed from remote machines since it became part of the global SCI memory. To access this memory, the remote machine must import it rst. The major step to do here is to set up entries inside its downstream address translation table so that they point to the region inside the global SCI memory that belongs to the exporter. From now, the only remaining task is to map the physical PCI pages that correspond to the prepared downstream translation entries into the virtual address space of the importing process. When the importing process accesses the imported area, the transaction is forwarded through the PCI{SCI system and addresses are translated three times. At rst the host MMU translates the address from the process' virtual address space into physical address space (or rather PCI space). Then the PCI{SCI bridge takes up the transaction and translates the address into the global SCI address space by usage of the downstream translation table. The downstream address translation includes generation of the remote node id and address oset inside the remote nodes' virtual local PCI address space. When the remote node receives the transaction, it translates the address to the correct local physical (or rather PCI) address by using the upstream address translation table.

4.3 Operation of Protected User{Level Remote DMA from a memory{related point of view Figure 3 shows the principle work of the DMA engine of our PCI{SCI bridge design. This gure shows principally the same address spaces and translation tables as shown by gure 2. Only the process' virtual address spaces and the corresponding translation into physical address spaces are skipped to not overload the gure. The DMA engine inside the bridge is surrounded by two address translation tables, or more correct said by two address translation and protection tables. On the active node (that is, where the DMA engine is executing DMA descriptors | node 1 here) both translation tables are involved. However, on the remote node there has practically nothing changed compared to the programmed IO case. Hence the remote node doesn't make any dierence between transactions whether they were generated by the DMA engine or not. Both translation tables of one PCI{SCI bridge incorporate protection tags as described in section 3.2. But while this is used in VIA for accesses to local memory, here it is also used for accesses to remote SCI memory. Together with VIA mechanisms for descriptor noti cation and execution the DMA engine is unable to access wrong memory pages | whether local (exported) nor remote (imported) ones. Note that a check for right protection tags is really made only for the DMA engine and only on the active node (node 1 in gure 3). In all other cases the same translation and protection tables are used, but the protection tags inside are ignored.

12

M. Trams et al. Global SCI Address Space

Node 1

Node 2

Physical PCI Address Space


Upstream Translation Table


Physical PCI Address Space


Upstream Translation Table

DMA

DMA

Engine

Engine

Source and Destination Addresses

Translation to Global SCI Address


Figure3. Address Translations performed during RDMA Transfers

4.4 A free choice of using either Programmed I/O or User{Level Remote DMA

This kind of a global memory management allows applications or more exactly communication libraries to decide on{the{ y depending on data size in which way it should be transferred. In case of a short message a PIO transfer may be used, and in case of a longer message a RDMA transfer may be suitable. The corresponding remote node is not concerned in this decision since it doesn't see any dierences. This keeps the protocol overhead very low. And nally we want to remember the VIA case. Although we already have the opportunity of a relatively low{latency protected user{level remote DMA mechanism without the memory handling problems as in case of common PCI{ SCI, there's nothing like a PIO mechanism for realizing a distributed shared memory. Hence the advantages of an ultra{low latency PIO transfer are not available here.

5 In uence on MPI Libraries To show the advantages of the presented advanced memory management we want to take a look at the so{called Rendezvous Protocol that is commonly used for Send{Receive operations. Figure 4 illustrates the principle of the Rendezvous protocol used in common MPI implementations [7] based on Dolphins PCI{SCI bridges. One big problem in this model is the copy operation that takes place on the receivers' side to take data out of the SCI buer. Although the principally increasing latency can be hidden due to the overlapping mechanism a lot of CPU cycles are burned there.

Memory Management in a Combined VIA/SCI Hardware Sender

Sender

Receiver Request_Send Ok_to_Send

Copy Data from private local Memory into SCI Buffer Space of Receiver (remote Write) Transfer completed

Allocate SCI Buffer Space and return Acknowledge

Block_Ready Block_Ready

Copy Data from SCI Buffer Space into private Memory

Receiver Request_Send

Import remote Memory (if necessary)

13

Ok_to_Send

Register and Export private Memory (if necessary) and return Acknowledge

Copy Data from private local Memory into private Space of Receiver (remote Write) Transfer completed

Ready

Ready

Transfer completed

Transfer completed CPU busy

CPU busy

CPU free

CPU free

Figure4. Typical Rendezvous{Protocol in common PCI{SCI Implementations

Figure5.

Improved Rendezvous{ Protocol based on advanced PCI{SCI Memory Management

With our proposed memory management there's a chance to remove this copy operation on the receivers' side. The basic operation of the Rendezvous protocol can be implemented as described in gure 5. Here the sender informs the receiver as usual. Before the receiver sends back an acknowledge it checks if the data structure the data is to be written to is already exported to the sender. If not, the memory region that includes the data structure is registered within the receivers' PCI{SCI bridge and exported to the sender. The sender itself must also import this memory region if this was not already done before. After this the sender copies data from private memory of the sending process directly into private memory of the receiving process. As further optimization the sender may decide to use the DMA engine to copy data without further CPU intervention. This decision will be typically based on the message size.

6 State of the project (November 1999) We developed our own FPGA{based PCI{SCI card and have prototypes of this card already running. At the moment they only oer a so{called Manual Packet Mode for now that is intended for sideband communication besides the regular programmed IO and DMA transfers. The card itself is a 64Bit/33MHz PCI Rev.2.1 one [8]. As SCI link controller we are using Dolphins LC{2 for now, and we are looking to migrate to the LC{3 as soon as it is available. The reprogrammable FPGA design leads to a exible recon gurable hardware and oers also the opportunity for experiments. Linux low{level drivers for Alpha and x86 platforms and several con guration/test programs were developed. In addition our research group is working on an appropriate higher{level Linux driver for our card [5, 6]. This oers a software{interface (advanced Virtual Interface Provider Library) that combines SCI and VIA features such as importing/exporting memory regions, VI connection management etc. Also it emulates parts of the hardware so that it is possible to run other software on top of it although the real hardware is not available. As an example, a parallelized MPI{version of the popular raytracer POVRAY is already running over this emulation. This program uses an MPI{2 library for

14

M. Trams et al.

our combined SCI/VIA hardware. This library is also under development at our department [3]. For more details and latest news refer to our project homepage at http://www.tu-chemnitz.de/~mtr/VIA SCI/

7 Other Works on SCI and VIA Dolphin already presented some performance measurements in [1] for their VIA implementation which is a emulation over SCI shared memory. Although the presented VIA performance is looking very good, it's achieved by the cost of too big CPU utilization again. The number of vendors of native VIA hardware is growing more and more. One of these companies is GigaNet [17] where performance values are already available. GigaNet gives on their web pages latencies of 8s for short transmission sizes. Dolphin gives a latency for PIO operations (remote memory access) of 2.3s. This demonstrates the relatively big performance advantage a distributed shared memory oers here. University of California, Berkeley [2] and the Berkeley Lab [9] are doing more open research also in direction of improving the VIA speci cation. The work at the University of California, Berkeley is concentrated more on VIA hardware implementations based on Myrinet. In contrast, the work at the Berkeley Lab is targeted mainly to software development for Linux.

8 Conclusions and Outlook The combined PCI{SCI/VIA system is not just a simple result of adding two dierent things. Rather it is a real integration of both in one design. More concrete it is an integration of concepts de ned by the VIA speci cation into a common PCI{SCI architecture since major PCI{SCI characteristics are kept. The result is a hardware design with completely new qualitative characteristics. It combines the most powerful features of SCI and VIA in order to get highly ecient messaging mechanisms and high throughput over a broad range of message lengths. The advantage that MPI libraries can take from a more exible memory management was illustrated for the case of a Rendezvous Send{Receive for MPI. The nal proof in practice is still pending due to lack of a hardware with all implemented features.

References 1. Torsten Amundsen and John Robinson: High{performance cluster{computing with Dolphin's CluStar PCI adapter card. In: Proceedings of SCI Europe '98, Pages 149{152, Bordeaux, 1998


15

2. Philip Buonadonna, Andrew Geweke: An Implementation and Analysis of the Virtual Interface Architecture. University of California at Berkeley, Dept.of Computer Science, Berkeley, 1998. www.cs.berkeley.edu/~philipb/via/ 3. A new MPI{2{Standard MPI Implementation with support for the VIA. www.tu-chemnitz.de/informatik/RA/projects/chempi-html/

4. Dolphin Interconnect Solutions AS: PCI{SCI Bridge Spec. Rev. 4.01. 1997. 5. Friedrich Seifert: Design and Implementation of System Software for Transparent Mode Communication over SCI., Student Work, Dept. of Computer Science, University of Technology Chemnitz, 1999. See also: www.tu-chemnitz.de/~sfri/publications.html

6. Friedrich Seifert: Development of System Software to integrate the Virtual Interface Architecture (VIA) into Linux Operating System Kernel for optimized Message Passing. Diploma Thesis, TU{Chemnitz, Sept. 1999. See also: www.tu-chemnitz.de/informatik/RA/themes/works.html

7. Joachim Worringen and Thomas Bemmerl: MPICH for SCI{connected Clusters. In: Proceedings of SCI{Europe'99, Toulouse, Sept. 1999, Pages 3{11. See also: wwwbode.in.tum.de/events/sci-europe99/ 8. Mario Trams and Wolfgang Rehm: A new generic and recon gurable PCI{ SCI bridge. In: Proceedings of SCI{Europe'99, Toulouse, Sept. 1999, Pages 113{120. See also: wwwbode.in.tum.de/events/sci-europe99/ 9. M{VIA: A High Performance Modular VIA for Linux. Project Homepage: http://www.nersc.gov/research/FTG/via/

10. MPI Software Technology, Inc. Performance of MPI/Pro for cLAN on Linux and Windows. www.mpi-softtech.com/performance/perf-win-lin.html 11. The Open Scalable Cluster ARchitecture (OSCAR) Project. TU Chemnitz. www.tu-chemnitz.de/informatik/RA/projects/oscar html/

12. IEEE Standard for Scalable Coherent Interface (SCI). IEEE Std. 1596-1992. SCI Homepage: www.SCIzzL.com 13. Mario Trams: Design of a system{friendly PCI{SCI Bridge with an optimized User{Interface. Diploma Thesis, TU-Chemnitz, 1998. See also: www.tu-chemnitz.de/informatik/RA/themes/works.html

14. Mario Trams, Wolfgang Rehm, and Friedrich Seifert: An advanced PCI{SCI bridge with VIA support. In: Proceedings of 2nd Cluster{Computing Workshop, Karlsruhe, 1999, Pages 35{44. See also: www.tu-chemnitz.de/informatik/RA/CC99/

15. The U-Net Project: A User{Level Network Interface Architecture. www2.cs.cornell.edu/U-Net

16. Intel, Compaq and Microsoft. Virtual Interface Architecture Speci cation V1.0., VIA Homepage: www.viarch.org 17. GigaNet Homepage: www.giganet.com

ATOLL, a new switched, high speed Interconnect in Comparison to Myrinet and SCI Markus Fischer, Ulrich Bruning, Jorg Kluge, Lars Rzymianowicz, Patrick Sc hulz, Mathias Waack University of Mannheim, Germany, [email protected]

Abstract. While standard processors achiev e supercomputer perfor-

mance, a performance gap exists betw een the interconnect of MPP's and COTS. Standard solutions like Ethernet can not keep up with the demand for high speed communication of todays po w erful CPU's. Hence, high speed interconnects have an important impact on a cluster's performance. While standard solutions for processing nodes exist, communication hardware is curren tly only av ailable as a special, expensive non portable solution. ATOLL presents a switched, high speed interconnect, whic hful lls the curren tneeds for user level communication and concurrency in computation and communication. A TOLLis a single chip solution, additional switching hardware is not required.

1

Introduction

Using commodity o the shelf components (COTS) is a viable option to build up pow erful clusters not only for number crunching but also for highly parallel, commercial applications. First clusters already show up in the Top500 [6] list and it is expected to see the number of entries continuously rising. Pow erful CPU's such as the Intel PIII Xeon with SMP functionality, achiev e processing performance kno wnfrom supercomputers. Currently a high percentage of existing clusters is equipped with standard solutions suc h as F ast Ethernet. This is mainly for compatibility reasons since applications based on standardized TCP/IP are easily portable. This protocol ho w ev eris known to cause too muc h overhead [7]. Especially low ering latency is an important key to achieve good communication performance. A survey on message sizes shows that protocols and hardware have to be designed to handle short messages extremely well [14]:

{ { { { {

in sev en parallel scienti c applications 30% of the messages were bet ween 16 bytes and a kilobyte the median message sizes for TCP and UDP trac in a departmental net w ork w ere 32 and 128 bytes respectively 99% of TCP and 86% of the UDP trac was less than 200 bytes on a commercial database all messages were less than 200 bytes the a verage message size ranges between 19 - 230 bytes


ATOLL, a New Switched, High Speed Interconnect

17

Recent research with Gigabit/s interconnects, such as Myrinet and SCI, has shown that one key to achieve low latency and high bandwidth is to bypass the operating system, avoiding a trap into the system: User Level Communication (ULC) gives the user application full control over the interconnect device (BIP, HPVM, UNET, AM). While ULC shortens the critical path when sending a message, a global instance such as the kernel, is no longer involved in scheduling outgoing data. This has the disadvantage, that security issues have to be discussed, if dierent users are running their application. But also trashing and context switching through multiple processes can lower performance. Current research examines how to multiplex a network device eciently [8], if this is not supported by the NI hardware itself. Therefore, a unique solution would be to support multiple NI's directly in hardware. Designing interconnects for the standard PCI interface cuts down production costs, due to higher volume. Nevertheless, necessary additional switching hardware increases the total cost per node signi cantly. While PCI is a standard interface designed for IO, current PCI bridges are limited by a bandwidth of 132 MB/s running at 32bit/33Mhz. Upcoming mainboards will run at 64bit/66Mhz and achieve a maximum bandwidth of 528MB/s. The paper is organized as follows. The design space for network interfaces is evaluated and an overview on key functionality to achieve good communication performance is described in the next section. Section 3 will describe the design issues of ATOLL in comparison to Myrinet and SCI. In section 4 software layers, such as low level API and message passing interfaces for ATOLL and other NIC's, are discussed. Finally, section 5 concludes our paper. 2

Design Space for Network Interfaces

In this section we would like to evaluate current NICs and characterize the design space of IO features in general, dierentiating between hardware and software issues. From the hardware's point of view, features like special purpose processor on board, additional (staging) memory, support of concurrency by allowing both, PIO and DMA operations, or support for shared memory at lowest level are of interest. The requirement for additional switching hardware to build up large scaling clusters is another concern. From the software's point of view it is interesting to examine which protocols are oered and how they are implemented, whether MMU functionality is implemented allowing RDMA, or how message delivery and arrival are detected. The latter will have a major impact on performance. We would like to break down the design space into the following items:

{ Concurrency with PIO and DMA Transactions, MMU Functionality to support RDMA Basically, when sending a message, the NIC's API chooses PIO or DMA for transfer, depending on the message size. PIO has the advantage of low start-up costs to initiate the transfer. However since the processor is transferring data

18

M. Fischer et al.

directly to the network, it is busy during the entire transaction. To allow concurrency, the DMA mode must be chosen in which the processor only prepares a message by creating a descriptor pointing to the actual message. This descriptor is handed to the DMA engine which picks up the information and injects the message into the network. It is important to know that the DMA engine relies on pinned down memory since otherwise pages can be swapped out of memory and the engine usually can not page on demand by itself. The advantage of using DMA is to hide latency (allowing for multiple sends and receives). However it has a higher start-up time than PIO. Typically, a threshold values determines which protocol is chosen for the transaction. Both mechanisms also play an important role when trying to avoid memory copies.

{ Intelligent Network Adapter, Hardware and Software Protocols

The most important feature having an intelligent network adapter (processor and SRAM on board) is to be exible in programming message handling functionality. Protocols for error detection and correction can be programmed in software, but also new techniques can be applied (VIA). Support for concurrency is improved as well. Additional memory on board lowers congestion and the possibility of deadlocks on the network decreases. It has the advantage to buer incoming data, thus emptying the network links on which the message has been transferred. However, the memory size is usually limited and expensive, also the number of data copies rises. Another disadvantage of this combination is that the speed of an processor on board can not cope with the main processing unit. Finally, programming the network adapter is a versatile task.

{ Switches, Scalability and Routing

A benchmark of a point to point routine typically only shows the best performance for non-standard situations. Since a parallel application usually consists of dozens of processes communicating in a more or less xed pattern, measuring the bisection bandwidth generates better information of the underlying communication hardware. A cost-eective SAN has bidirectional links and allows sending and receiving concurrently. A key factor for performance is scalability, when switches are added for a multistage connection network to allow larger clusters. Here blocking behavior becomes the major concern. Another point of interest is the connection from NIC to NIC: Data link cables must provide a good compromise between data path width and transfer speed.

{ Hardware support for Shared Memory (Coherency) and NI locations

Currently a trend can be seen in clustering bigger SMP nodes. Within an SMP node, a cache coherent protocol like MESI synchronizes to achieve data consistency. To add this functionality to IO devices (such as the NIC), they would have to participate on the cache coherent protocol, being able to snoop on the system bus. However, this would involve a special solution for every processor type and system and can not be propagated as a commodity solution. With the


19

growing distance between the NI and the processor, the latency of the communication operations raises and, at the same time, the bandwidth declines. The only position that results in a wide distribution and, thus, necessary higher production volumes, is the standardized PCI bus. This leads to the loss of a number of functions, like e.g., the cache coherent accesses to the main memory of the processor. As the NI on the PCI card is independent from the used processor (and has to be), functions like the MMU in the NI cannot be recreated, as they dier according to which processor is being used. For this purpose an adaptable hardware realization of the basic mechanisms or an additional programmable processor on the PCI card can be used.

{ Performance Issues: Copy Routines and Noti cation Mechanisms

Once a message is ready for sending, the data has to be placed at a location where the NIC can fetch the data. Using the standard memcpy routines however may show poor performance. The reason is that the cache of the CPU is ruined when larger messages have been injected into the network. Modern CPU's like the Pentium III or Ultrasparc oer special MMX or VIS instructions which copy the data without polluting the cache. Another critical point is the software overhead caused by diverse protocols to guarantee data transfer. Nowadays cables are almost error free. Thus heavy protocols like TCP/IP are no longer necessary. Since an error may occur, an automatic error detection and correction implemented directly in hardware would improve eciency. Performance is also sensitive to message arrival detection. A polling method typically wastes a lot of CPU cycles and an interrupt causes too much overhead, since contexts have to be switched. Avoiding the interrupt mechanism is very important as each new interrupt handling leads to a latency of approximately 60 s [8]. 3

NIC Hardware Layout and Design

In the ATOLL project, all design space features have been carefully evaluated and the result is an implementation of a very advanced technology.

3.1 ATOLL Overview The ATOLL cluster interface network, is a future communication

technology for building cost-eective and very ecient SAN's using standard processing nodes. Due to an extremely low communication start-up time and very broad hardware support for processing messages, a much higher performance standard in the communication of parallel programs is achieved. Unique is the availability of four links of the interface network, an integrated 8 x 8 crossbar and four independent host ports. They allow for creating diverse network topologies without additional external switches and the ATOLL network is one of the rst network on a chip implementations. This design feature especially supports SMP nodes by assigning multiple processes their dedicated device. Figure 1 depicts an overview on hardware layout and data ow of ATOLL.

20

M. Fischer et al.

Fig. 1. ATOLL Hardware Layout and Data Flow Design Features ATOLL's special and new feature in comparison to other

NIC's is the availability of multiple and independent devices. ATOLL integrates four host and network interfaces, an 8x8 crossbar and 4 link interfaces into one single ASIC. The chip is mounted on a standard PCI board and has a 64Bit/66Mhz PCI interface with a theoretical bandwidth of 528MBytes/s at the PCI bridge. Choosing this interface, ATOLL addresses commodity solutions with a high volume production. The crossbar has a fall through latency of 24ns and a capacity of 2GBytes/s bisection bandwidth. A message is broken down by hardware into 64Byte link packets, protected by CRC and retransmitted automatically upon transmission errors. Therefore, protocol overhead for data transfer is eliminated and it has been achieved to implement error detection and correction directly in hardware. The chip itself, with crossbar, host- and network interfaces, runs at 250 Mhz. Standard techniques for the PCI bus such as write-combining and read-prefetching to increase performance are supported. Sending and receiving of messages can be done simultaneously without involving any additional controlling instances. The ATOLL API is responsible for direct communication with each of the network interfaces, providing ULC and giving the user complete control of "his" device. In contrast to other SAN's, most of data ow control is directly implemented in hardware. This results in an extremely low communication latency of less than 2 s. ATOLL oers Programmed IO (PIO mode) and Direct Memory Access (DMA mode), respectively. A threshold value determines which method to choose. The latter requires one pinned down DMA data space for each device. This data space is separated into send and receive regions. For starting a transmission in DMA mode, a descriptor is generated and entered into the job queue of the host interface. Injecting the message into the network is initiated by raising the descriptor write pointer, which triggers the ATOLL card to fetch the message. Basically, the descriptor contains the following information: The message length, the destination id, a pointer to the message in DMA memory space and a message tag. The completion of a DMA task is signaled through writing a

Host Port 0

PCI Bus Interface

Host Port 2

Network Port 2

Host Port 3

Network Port 3

64bit/66MHz PCI BUS

Host Port 1

66Mhz/64bit

/ 64

/

/

9

Network Port 1

/ 64

Network Port 0


9

Link Port 4

21

/

9

Link Port 5 8x8 Crossbar Link Port 6

Link Port 7

250Mhz/9bit

Fig. 2. ATOLL Chip

data word into main memory, which makes the time consuming interrupt handling by the processor unnecessary. Figure 3 depicts the three operations of a DMA send process.

Fig. 3. Process of a DMA send job

DMA data and descriptor memory space are implemented as ring buers. When receiving a message, the descriptor for the received message is assembled by the NI and copied into main memory. There it can be seen cache coherently by the processor. In this mode the DMA engine can also be seen as a message handler in hardware. If PIO mode is used for very short messages, it is kept in the receive FIFO of the host interface and the processor is informed of the received message through an update of the FIFO's entries in main memory. Just like in DMA mode an expensive interrupt is avoided. To overcome deadlocks, a time barrier throws an interrupt to empty the receive buer. In this mode busy waiting of the processor on the FIFO entries leads to the extremely short receive latency. As this value is also mirrored cache-coherently into main memory the processor does not waste valuable memory or IO bandwidth. Routing is done via source path routing, identifying sender and receiver by a system wide unique identi er, the Atoll ID. Routing information is stored in a status page resiging in pinned DMA memory space. For each communication partner, a point-to-point

22

M. Fischer et al.

connection is created. If two communication partners are within one SMP node, the ATOLL-API transparently maps the communication to shared memory. Finally, the ATOLL NIC supports multithreaded applications. A special register accessable in user mode can be used as a semaphore 'test-and-set'. Typical standard processor's like the PIII restrict locking mechanism to superuser level.

3.2 Myrinet Overview The Myrinet is a high-speed interconnection technology for cluster

computing. Figure 4 depicts the layout of the Myrinet NIC. A complete network consists of three basic components: a switch, the Myrinet card per host and cables which connect each card to the switch. The switch transfers variablelength packets concurrently at 1.28 Gbit/s using wormhole routing through the network. Hardware ow control via back-pressure and in-order delivery is guaranteed. The NI card connects to the PCI bus of the host and holds three DMA engines, a custom programmable network controller called LANai and up to 2 Mbyte of fast SRAM to buer data. Newer cards improve some parameters, but do not change the basic layout. They have a 64 bit addressing mechanism allowing to address 1Gbyte of memory, a faster RISC processor at 100 Mhz accessing the SRAM which has been increased to 4 Mbytes.

Fig. 4. Myrinet Hardware Layout and Data Flow

Design Features Under the supervision of the RISC, the DMA engines are

responsible for handling data for the following interfaces: host memory/NIC's SRAM and SRAM/network, respectively. In detail, one DMA engine moves data from host memory to SRAM and vice-versa, the second stores incoming messages from the network link to the SRAM, and the third injects data from SRAM into the network. The LANai processor runs at 100 MHz, controls the DMA operations, and can be programmed individually by an Myrinet Control Program


23

(MCP). The SRAM serves primarily for staging communication data, but also stores the code of the MCP. To simplify the software, the NI memory can be mapped into the host's virtual address space. As research shows [1], the limited amount of memory on the NIC is not a bottleneck, but the interaction of DMA engines and LANai. The Myrinet card retrieves ve prioritized data streams into the SRAM. However, at a cycle of 10ns only 2 of them can be addressed whereas 3 are stalling. This leads to a stalling LANai, which does not get access to the staging memory. When sending a message with Myrinet, rst the user copies data to a buer in host memory, which is accessible by the NI device. The next step is to provide the MCP with the (physical) address of the buer position. The LANai starts the PCI bridge engine to copy the data from host memory to NIC memory. Finally the LANai starts up the network DMA engine to inject the data from NIC memory into the network. On the receiving side, the procedure is vice versa. First, the LANai starts the receive DMA engine to copy the data to NIC memory and starts the PCI bridge engine to copy the data to an address in host memory (which was previously speci ed via a rendez-vous protocol). Finally, after both copies are performed, the receiver LANai noti es the polling processor of the message arrival by setting a ag in host memory.

3.3 Scalable Coherent Interface (SCI) Overview Compared to Myrinet, SCI is not just another network interface

card for message passing, but oers shared memory programming in a cluster environment as well. SCI intends to enable a large cache coherent system with many nodes. Besides its own private cache / memory, each node has an additional SCI cache for caching remote memory. Unfortunately, the caching of remote memory is not possible for PCI bus based systems. This is because transactions on the system bus are not visible on the PCI bus. Therefore an important feature de ned in the standard is not available on standard clusters and SCI is no longer coherent when relying solely on its hardware.

Design Features One of the key features of SCI is that by exporting and

importing memory chunks, a shared memory programming style is adopted. Remote memory access (RMA) is directly supported at hardware level (Figure 2 depicts an overview of SCI address translations). By providing a unique handle to the exported memory (SCI Node ID, Chunk ID and Module ID) a remote host can import this 'window' and create a mapping. To exchange messages, data has to be copied into this region and will be transferred by the SCI card, which detects data changes automatically. Packet sizes of 64 Bytes are send immediately, otherwise a store barrier has to be called to force a transaction. In order to notify other nodes when messages have been sent they either can implement their own ow control and poll on data or create an interrupter which will trigger the remote host. However, the latter has a bad performance with a latency of 36 s on a Pentium II450. One major drawback of SCI is that a shared memory programming style can not easily

24

M. Fischer et al.

Fig. 5. SCI Address Translation

be achieved because of lacking functionality to cache regions of remote memory in the local processor cache. Furthermore, SCI uses read and write buers to speed up communication which brings along a certain amount of inconsistency. Finally SCI is not attractive to developers who have to keep in mind the big gap between read and write bandwidth in order to achieve highest performance (74MB/s remote write vs. 7.5 MB/s remote read using a Pentium II450). When looking at concurrency then the preferred method is to use the processor to copy data. In this case however, the processor is busy and can not be used to overlap computation and communication as when DMA would be used. Using the processor, a remote communication in SCI takes place as just a part of a simple load or store opcode execution in a processor. Typically the remote address results in a cache miss, which causes the cache controller to address remote memory via SCI to get the data, and within the order of a microsecond the remote data is fetched to cache and the processor continues execution. 4

Software

4.1 Low Level API

The old approach, moving data through I/O-channel or network-style paths, requires assembling an appropriate communication packet in software, pointing the interface hardware at it, and initiating the I/O operation, usually by calling a kernel subroutine. When the data arrives at the destination, hardware stores them in a memory buer and alerts the processor with an interrupt. Software then moves the data to a temporary user buer before it is nally copied to its destination. Typically this process results in latencies that are tens to thousands of times higher than user level communication. These latencies are the main limitation on the performance of Clusters or Networks of Workstations.

ATOLL The ATOLL API is providing access to the device at user level. It oers function calls to establish direct point to point communication between


25

two ATOLL id's. An ATOLL id and one corresponding hostport is assigned to a process when opening the device. A connection between two ATOLL ids is needed in order to call non blocking send and receive operations. Send, receive and multicast have a message passing style in the form of tuples (destination, src, length). Threshold values for PIO and DMA can be adjusted during runtime. In PIO mode, the ATOLL API oers zero level communication. Besides the message passing functionality the ATOLL API oers lock and unlock primitives to the semaphore, which is available for each hostport. The ATOLL API is open source.

Myrinet and SCI Well known API's for the Myrinet NIC are the PM and GM

libraries. Both API's are open source and oer send and receive functions in a message passing style. For SCI, Dolphin and Scali oer low level API's to create, map and export memory to remote nodes. The implementation of sending and receiving of data is left to the user. Typically ring buers in the mapped memory regions are implemented. This allows for simple data ow control. Writing data to this region is detected by the SCI card which transfers the updated data to the remote node.

4.2 Upper Software Layer for Communication

Open Source projects can be seen as a key to the success of a project. Myrinet and ATOLL are open source projects in which ports to standard message passing environments such as MPI and PVM are available to application developers. Drivers for SCI are not in an open format. This makes it dicult to x bugs, but also the widespread usage of the software is limited. For all API's, devices for MPICH [11] have been written. Especially the Score 2.4 implementation [9] based on PM achieves a good performance supporting intra-node communication at 1150 Mbit/s and inter-node communication at 850 Mbit/s using Pentium II 400's. It is to mention, that Score also allows to have multiple processes from dierent users using the Myrinet interface. Another device written for MPICH is BIP-MPI [10]. This software also achieves good performance, however is restricted to a single job per node. The Scampi MPI implementation from Scali also achieves high bandwidths, however using the BEFF [12] benchmark from Pallas, the latency on clusters with more than 32 nodes, increased up to 60 s. ATOLL which will be available 1Q/00 is an Open Source project and rst message passing environments will be based on MPICH and PVM. With a hardware latency of 1.4 s and a link utilization of 80% at 128 Bytes, the achievable performance with MPI looks promising. A loop back device shows a one way roundtrip time of 2.4 s.

4.3 Communication Overhead

In this section we would like to discuss current techniques to avoid memcpy's when injecting data into the network. Figure 4 depicts necessary steps involved during a transaction for ATOLL.

26

M. Fischer et al.

Fig. 6. DMA Copy Recent research tries to avoid unnecessary data copies which results in a so called zero copy mechanism. In this case the data is directly fetched from its position in application memory and directly deposited in remote application memory. It is expected to decrease latency and increase bandwidth for data transfer using this method. Basically, if PIO is available, this communication mode can be used for zero copy. When sending, data is directly injected by the CPU into the network. On the receiving side, the message can again be delivered directly with PIO. The disadvantage is that the processor will be involved during the entire transaction and can not be used for computation during that time. To enable the DMA engine to perform this task, a virtual-to-physical address translation must be implemented, which increases hardware complexity signi cantly. Sharing the page tables between the OS and the device is complex and time consuming too. The TLB handling is usually performed by the OS. Pages for translation have to be pinned down, and virtual addresses now represent physical ones. The TLB can be placed and managed at NI memory, the host memory, or both. Using this method, zero-copy can be achieved via remote memory writes using the information provided with the TLB. Send and receive operations carry the physical address of the destination buer and the DMA engine copies the (contiguous) data directly to the destination address. Typically, a rendezvous model is needed before such operation can take place, since the location at the receiver side is not know a priori. A requirement for the NIC is touch dynamically pinned down data. This DMA method also only makes sense, if the data to be transferred is locked down once and the segment can be re-used. Otherwise expensive lock and unlock sequences will lower performance making a trap into the system. Another problem coming along with zero copies is that of message ow control. It is not obvious when a message has been transferred and the space can be used again. On the other hand support for RDMA eases the implementations of one sided communication. Myrinet features PIO and DMA transactions, however data to be send is rst stored in the staging memory of the NIC. In a second step, the LANai then injects the message which will be again stored in the SRAM on the remote card. This may be the reason that latest research shows only a performance increase of 2% when using zero copy instead of DMA copy [8]. Porting issues for using zero copy mechanisms are another point of concern.


5

27

Conclusion

We have given a description of ATOLL, a new high speed interconnect combining attractive, ecient design issues with new features which ful ll the needs of today's high speed interconnects. ATOLL is a cost eective, aordable SAN with xed expenses per node, even for large clusters. A more expensive solution is Myrinet, which has the highest number of installed cards, or SCI. Both need additional switches for building larger clusters. In terms of performance and robustness, currently Myrinet seems to be the best choice, however this may change with the availability of ATOLL and 64Bit/66Mhz PCI bridges. Here, the integration of the most important functions for a SAN into one chip shows a high level of performance and extremely low latency for cluster communication. Next development steps of the ATOLL network project will include optical link interconnects for increasing distance. Under investigation is also MMU functionality implemented in hardware. It is also planned to adapt the link technology to the concepts of System IO, since major parts are easily adaptable. This will provide the user with a high speed low latency uni ed communication infrastructure for the next generation of clusters. References [1] Warschko, Blum and Tichy. On the Design and Semantics of User-Space Communication Subsystems, PDPTA 99, Las Vegas, Nevada. [2] Santos, Bianchini and Amorim. A Survey of Messaging Software Issues and Sys[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

tem on Myrinet Based Clusters IEEE 345, 47th Street New York. IEEE Standard for Scalable Coherent Interface (SCI), 1993 O' Caroll, Tezuka, Hori and Ishikawa. The Design and Implementation of Zero Copy MPI using ... , In International Conference on Supercomputing '98, pages 243-250, July 1998. Rzymianowicz, Bruening, Kluge, Schulz and Waack. Atoll, A Network on a Chip, PDPTA 99, Las Vegas, Nevada http://www.top500.org Kay and Pasquale. Pro ling and Tuning Processing overheads in TCP/IP. IEEE/ACM Transactions on Networking, Dec. 1996 Warschko. Ecient Communication in Parallel Computers. PhD thesis, University of Karlsruhe, 1998 Tezuka, O'Caroll, Hori, and Ishikawa. Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication, IPPS98, pages 308-314, 1998 Prylli and Tourancheau. BIP: A new protocol designed for high performance networking on Myrinet, PCNOW Workshop, IPPS/SPDP98, 1998 O'Carroll, Tezukua, Hori, and Ishikawa. MPICH-PM: Design and Implementation of Zero Copy MPI for PM, Technical Report TR-97011, RWC, March 1998. http://www.pallas.com Scholtyssik and Dormanns. Simplifying the use of SCI shared memory by using software SVM techniques, 2nd Workshop Cluster Computing, Karlsruhe, 1999 Mukherjee and Hill. The Impact of Data Transfer and Buering Alternatives on Network Interface Design, HPCA98, Feb. 1998

ClusterNet: An Object-Oriented Cluster Network Raymond R. Hoare Department of Electrical Engineering, University of Pittsburgh Pittsburgh, PA 15261 [email protected]

Abstract. Parallel processing is based on utilizing a group of processors to efficiently solve large problems faster than is possible on a single processor. To accomplish this, the processors must communicate and coordinate with each other through some type of network. However, the only function that most networks support is message routing. Consequently, functions that involve data from a group of processors must be implemented on top of message routing. We propose treating the network switch as a function unit that can receive data from a group of processors, execute operations, and return the result(s) to the appropriate processors. This paper describes how each of the architectural resources that are typically found in a network switch can be better utilized as a centralized function unit. A proof-of-concept prototype called ClusterNet4EPP has been implemented to demonstrate feasibility of this concept.

1 Introduction In the not-so-distant past, it was common for groups of people to pool their resources to invest in a single, high-performance processor. The processors used in desktop machines were inferior to the mainframes and supercomputers of that time. However, the market for desktop computers has since superceded that of mainframes and supercomputers combined. Now the fastest processors are first designed for the desktop and are then incorporated into supercomputers. Consequently, the fastest processors, memory systems, and disk controllers are packaged as a single circuit board. Thus, the highest performance "processing element" is a personal computer (PC). Almost every individual and company uses computers to help them be more efficient. Networks enable seemingly random connections of computers to communicate with each other and share resources. Computational and data intensive applications can utilize the resources of a cluster of computers if the network is efficient enough. If the network is inefficient, the added communication and coordination cost reduces, or even removes, the benefit of using multiple computers. As more computers are used to execute an application in parallel, the extra overhead eventually removes the performance benefit of the additional resources. Thus, for a cluster of computers to be


ClusterNet: An Object-Oriented Cluster Network

29

used as a single parallel processing machine, they must be able to efficiently communicate and coordinate with each other. Stone, in his popular book on high-performance computer architecture [1], states that peak performance of a parallel machine is rarely achieved. The five issues he cited are: 1. Delays introduced by interprocessor communications 2. Overhead in synchronizing the work of one processor with another 3. Lost efficiency when one or more processors run out of tasks 4. Lost efficiency due to wasted effort by one or more processors 5. Processing costs for controlling the system and scheduling operations These issues are relevant for all computer architectures but are particularly troublesome for clusters. Clusters typically use commodity network switches that have been designed for random configurations of computers and routers. Thus, packets are used to encapsulate data and provide routing information. This software overhead accounts for 40-50% of the total communication time[2]. Network switches are designed for typical network traffic. Rarely will every incoming packet be destined for the same output port for a sustained period of time. The outgoing network link would become saturated, its buffers will become full, the switch will have to drop incoming packets, and the packets will have to be resent. While this is extremely rare for a typical network, the gather and all-to-all communication operations require this type of communication pattern[3]. Processor coordination and synchronization are group operations that require information from each processor. While such operations can be executed using message passing communications, a total of N messages must be sent through the network to gather the data and broadcast the result(s). These communication operations can ideally be overlapped to require only (log2N + 1) communication time steps. Processor scheduling and control are also operations that require data from each of the processors. However, this information must be maintained to ensure even load distribution. The algorithms used for scheduling and controlling clusters are not computationally intensive but require efficient access to every computer’s status. Operations that involve data from a collection of computers are defined as aggregate operations [4]. Ironically, the architectural characteristics of a typical switch are well suited for executing aggregate operations. A typical modern switch interconnects 8 to 80 computers, contains a processor (or routing logic), and stores routing information in a lookup table. For a 32 or 64-processor cluster, a single switch is capable directly interconnecting all of the computers. Rather than assuming that a cluster is a purely distributed memory architecture that communicates through a point-to-point network, this paper examines the entire cluster architecture, including the network switch to demonstrate how a better cluster architecture can be created. Specifically, the architectural resources contained in a typical switch will be examined, reallocated and/or changed, to facilitate efficient communication and control of the entire cluster. As shown in the following table, the architectural features of a network switch and a cluster are almost exact opposites. In fact, the architectural characteristics are almost exact opposites. The proposed ClusterNet

30

R.R. Hoare

architecture utilizes these differences to form a new architecture that is a complement of both distributed and shared memory, as well as a complement of parallel and serial program execution. Table 1. Architectural features of network switches and a cluster of computers.

Architectural Feature Number Of Processors I/O Ports Per Processor Memory Architecture Storage Functionality Execution Model Topology Performance Criteria Communication Pattern

Network Switch 1 8-80 Shared Lookup Table Fixed Serial Unknown Packets per Second Point-To-Point

Cluster 16, 32, 64 1-3 Distributed RAM & Disk Programmable Parallel Star Topology Seconds per Communication Point-To-Point & Collective

2 ClusterNet While network switches can be used to facilitate cluster communications, there are a number of architectural differences between the network switch and the rest of the cluster. By combining these two architectures, a more efficient cluster architecture called ClusterNet, can be built. Rather than limiting the network switch to routing packets, we propose expanding the role of the switch to execute functions on data gathered from a group of processors. Furthermore, because of the switch’s memory architecture, it should also be able to store data. Thus, by combining data storage with computation, an object-oriented cluster network can be created. To simplify our discussion, our new object-oriented switch will be labeled an aggregate function unit (AFU) and the unmodified network switch will just be called a switch. The goal of this paper is to demonstrate how the resources of a switch can be more efficiently utilized when placed within the context of a cluster architecture. Table 2 shows how ClusterNet’s usage of architectural resources differs from a switch. Table 2. ClusterNet’s usage of architectural resources.

Architectural Resource Routing Logic Switch Memory Switch Port Physical Link Software Interface Application Interface

Switch Usage Route Messages Address Lookup Table Input/Output Packet Queue Send/Receive Packets Send/Receive Messages MPI

AFU Usage Execute Functions Data Structures Register Interface Send/Receive Data Access To AFU Port Aggregate Functions


31

The remainder of this section will discuss each of the resources listed above and how they can be used to provide a more robust cluster architecture called ClusterNet. Section 3 describes a proof-of-concept four-processor prototype that was built. Section 4 describes related work and section 5 offers conclusions and future directions. 2.1 Functionality: Router vs. Aggregate Function Execution

14

40

12

35

Execution Time (us)

Execution Time (us)

The routing logic (or processor) can collect and distributed information from every processor because most network switches interconnect between 4 and 80 computers. However, cluster implementations have maintained a distributed-memory architecture in which the processors communicate through message passing. Ironically, group operations such as Global Sum are implemented by sending N messages through the same switch in log2N time steps. Each time step requires a minimum of a few microseconds, over 1000 processor cycles. Rather than performing the computation, the network switch is busy routing packets. Instead of using the network switch’s processor to route messages, the processor can be used to execute functions within the network. Because the switch is directly connected to each of the processors, data from every processor can be simultaneously sent into the network switch. Upon arrival, the specified function is computed and the result is returned to each processor. To quantify this proposition, we define the following variables: N - The number of processors in the cluster (2 - 64). α - The communication time between a processor and the switch (1µs). k * (N-1) - The number of instructions to be executed. ε - The amount of time required to compute a single instruction (5 ns). If an associative computation is executed using N processors and a point-to-point network, the amount of time required is approximately (2α+kε ) * log2N because computation can be overlapped. If the switch’s processor is used to execute the same function, the amount of time required is (2α + (N -1) kε ). From an asymptotic perspective, it is better to use all N processors rather than the AFU’s processor. However, when typical values are used (α =1µs, ε=5ns) the resulting graphs show the performance tradeoffs as we change k and N, shown in Fig. 1.

10

Parallel

8 6 4

AFU

2

AFU

25 20 15 10

Parallel

5 0

0 2

k=10

30

4

8

16

# of Processors

32

64

2

k=100

4

8

16

32

# of Processors

Fig. 1 Collective computation using the AFU verses using all N processors for k=10 and 100, α =1µs, ε=5ns.

64

32

R.R. Hoare

2.2 Network Storage: Routing Tables vs. Network-Embedded Data Structures To enable a switch to be used for any network topology, it must be able to change how it routes different packets. This is typically implemented through a lookup table. When a packet is received, its destination address is used as an index into the lookup table to determines which port the packet should be routed to. This information can also be changed because network configurations change. In a cluster architecture, the routing lookup table is of minimal use because each processor is directly connected to the switch. If we require that processor i be attached to port i then there is no need for a routing table. The network-embedded memory can then be used as a cluster resource. For example, the lookup table could be used to track cluster-wide resources. If a resource is needed, the lookup table could be used to determine where the resource is located. This concept can be used to implement a dynamically-distributed sharedmemory cluster architecture. In a distributed shared-memory architecture (i.e. Cray T3D) there is a single memory address range that is distributed across all of the processors. Each processor can access any portion in memory by simply specifying a memory address. However, this results in non-uniform memory access times. Direct memory access was not built into the Cray T3E. A dynamically-distributed sharedmemory still uses a single address range but allows blocks of memory to migrate to the processor that needs them. When a memory request is made, the entire block of memory is relocated and placed in the local memory of the requesting processor. For regular access patterns, this drastically improves performance. However, there is an inherent contradiction within the dynamically-distributed shared-memory architecture. A shared resource table is needed to determine where each block is located. To share this location table, it too must be placed in shared memory. The location table can be distributed across the processors but requires two requests for every memory access. If the switch’s lookup table is used for the location table, memory requests could be sent to the network and the network could forward them to the processor that currently owns the block. In addition to a lookup table, the network-embedded memory can be used to represent any number of useful data structures. Synchronization data structures can be used to implement static, dynamic and directed synchronization. A processor load table can be kept in the network to facilitate dynamic task allocation to the least loaded processor. Queues and priority queues can also be used for task allocation and load balancing. Even shared linked-lists can be implemented with a small amount of additional control logic.

2.3 Network Port Interface: I/O Queues vs. Register Interface Because all networks use packets, they also contain I/O queues to store the packets until the router logic is able to handle them. The drawback to this is that the queues


33

become full and overflow. Our design does not require queues because it does not route packets. The AFU does, however, execute functions and To PC To AFU does transmit data. As shown in Fig. 4, the interface to Data the AFU appears as four registers. The OpCode regisOpCode ter is used to specify which function is to be executed. Counter The Data registers are used to move data between the Data PC and the AFU. Function parameters and function results are passed through these registers using the Fig. 3. The AFU Interface Port full/empty bits to indicate valid data. The Counter register can be used as a function parameter and is useful when accessing the network-embedded data structures described earlier. The Counter is particularly useful when accessing adjacent locations in memory. When a word has been read from memory the counter automatically increments. In this way, streaming read and write operations can be implemented easily by setting the appropriate OpCode and sending/receiving an entire block of data. Table 3. Latencies (in us) for point-to-point messages for several architectures. Platform IBM SP2 Intel Paragon Meiko 7CS-2 Cray T3D Memory Channel Myrinet SHRIMP ParaStation PAPERS ClusterNet4EPP

Latency (in µs) 39 6.5 7.5 2.2 5-20 11.2 10+ 5+ 3-5 1.7-5.2

Send Overhead 21.5% 22.7% 17.9% -

Receive Overhead 33.8% 21.3% 23.2% -

Ref. [5] [2] [2] [6] [7] [2] [8] [9] [10] [11]

2.4 Software Interface: Packet vs. Direct Read and Write As was shown in the Table 3, the software overhead for sending and receiving a message consumes 40-50% of the overall message latency. This is due to the time spent encoding and decoding packets. Rather than accepting this overhead, we propose expanding the functionality of the network and simplifying the network interface. Most architectures layer their communication libraries on top of point-to-point primitives that encode, send and decode packets. ClusterNet executes functions within the network and can be used to execute collective communications within the network. As a result of executing functions within the network hardware, the software interface is very simple and only requires seven assembly-level instructions listed below. Lines 1 and 2 are used to set the OpCode and Counter registers. The OpCode is used to specify which function should be executed. If the OpCode has not changed, these

34

R.R. Hoare

registers do not need to be set. After data is placed into the network, the function is executed and the results are returned. This architecture relies on the fact that the network link between the processor and the AFU perform error detection and correction. 1. 2. 3. 4. 5. 6. 7. 8.

I/O Write (OpCode) /* Optional */ I/O Write (Count) /* Optional */ I/O Write (Data) I/O Read (Result) if ( Result == NOT_A_NUMBER) goto line 4 if ( Result != PREFIX_TOKEN ) goto line 8 I/O "Data" Read (Result) /* The Aggregate Function has completed. */

3 The ClusterNet4EPP Proof-of-Concept Prototype

Control Logic

The four-processor Object-Oriented Aggre4 Processor AFU gate Network[11], called ClusterNet4EPP, dem(1 Altera 10K20RC240-4) onstrates that the simplified network interface EPP Interis feasible and performs very well using a LPT1 Used Asynchronously face Accessible as NIC small FPGA (Altera 10K20). The PCs’ parEPP Shared Memory Interallel ports were used as the network interface face and require approximately 1µs to access. EPP InterExperimental results were performed and a Synchronous face Functions Linux PCs PCI device was accessible in approximately EPP InterIEEE 1284 EPP over 450 ns. For ClusterNet4PP, read and write face Parallel Cables access time to each of the four registers (Data Fig. 4. ClusterNet4EPP In, OpCode, Counter, Data Out) was found to be 1.7 µs. IEEE 1284 in EPP mode was used for cable signaling. To demonstrate that network-embedded data structures are feasible and beneficial, an embedded RAM block was placed inside the FPGA. The control logic for the RAM block was modified and the synchronization/ arithmetic operations shown below were implemented. Each operation is executed on a single memory location. While a processor was not placed in the network, these operations can be used to perform simple global operations. Experimentation was performed to determine the effect of memory contention but due to the small number of processors and a 120 ns memory-access time, no effect could be detected. All memory accesses required approximately 1.7 µs. If the OpCode and the Counter need to be set, the total execution time is 5.2 µs. All of the memory operations can be executed on any word in the embedded memory. In addition to memory operations, barrier synchronization and a number of reduction operations were implemented. These operations are described in Table 4.


35

Table 4. Memory operations for the RAM embedded within ClusterNet4EPP

Memory Operations Non-blocking Memory Read / Exchange Wait for Lock=1 (or 0 ) then Read Wait for Lock=0 (or 1), Exchange and set Lock=1 Non-blocking Write and Unlock/Lock Wait for Lock=0, OR with RAM Wait for Lock=0, XOR with RAM Wait for Lock=0, Decrement/ Decrement RAM Wait for Lock=0, RAM = RAM -/+ Data

4. Related Research The NYU Ultracomputer [12, 13] and the IBM RP3 [14] are both dance-hall shared memory architectures. The Ultracomputer was the first architecture to propose that combining be performed within the processor-to-memory interconnection network. Messages that reference identical memory locations are combined if both messages are buffered within the same switch at the same time. The computations that can be performed within the interconnection network are Fetch-and-Add, Fetch-andIncrement, and other Fetch-and-Op functions, where Op is associative. Active Messages from Berkeley [2] allow functions to be executed at the network interface on the local or remote node. Active Networks perform operations on data values that are passed through the network [15, 16]. Fast Messages[17] modify the network interface drivers to reduce the overhead for sending and receiving messages. Sorting networks were introduced in [18] and have continued to remain a topic of interest [19, 20]. Multistage data manipulation networks are discussed in [21]. A number of commercial architectures have included direct support for various associative aggregate computations. The Cray T3D directly supports, through Craydesigned logic circuits, barrier synchronization, swap, and Fetch-and-Increment [22]. The TMC Connection Machine CM-5 has a control network that supports reduction operations, prefix operations, maximum, logical OR and XOR [22]. These architectures can be considered aggregate networks but they are very specific in the functions that they are designed to execute. PAPERS, Purdue’s Adapter for Parallel Execution and Rapid Synchronization, is a network that allows a number of aggregate computations to be performed within a custom network hub that is attached to a cluster of Linux PCs [23-25]. This design uses a combination of barrier synchronization with a four-bit wide global NAND to construct a robust library of aggregate computations and communications. A number of cluster projects have employed different approaches to reduce the communication cost of point-to-point and broadcast messages. SHRIMP [8, 26] uses memory bus snooping to implement a virtual memory interface. Point-to-point mes-

36

R.R. Hoare

sages in SHRIMP have a 10+ µs latency and remote procedure calls have a 3+ µs latency. Myrinet [27] provides gigabit bandwidth with a 0.55 µs worst-case latency through its pipelined crossbar switch.

5. Conclusions and Future Directions This paper has proposed the concept of combining the architectural characteristics of a network switch with that of a cluster of desktop computers. Rather than using the resources of the switch for message routing, this paper has proposed using them to create a function unit that is capable of performing computations on data that is aggregated from a group of processors. Specifically, the following switch resources can better serve the architectural needs of a cluster in the following way: • The switch lookup table should be used as network-embedded shared memory. • The functionality of the switch should be expanded to include aggregate functions. This reduces the total amount of time required for group computations. • The functionality of the switch should be configurable. This will enable greater utilization of the architectural resources of the entire cluster rather than just the processors. • Packets are not needed if each processor has direct access to a set of registers within the "switch". This remove the need to encode and decode packets and reduces the software overhead to less than ten assembly-level instructions. ClusterNet4EPP was described and implements numerous instructions that access the shared memory in as little as 1.7µs. A number of functions were implemented that involved data from all of the processors. These functions included OR, XOR, AND and ADD. While ClusterNet4EPP has demonstrated that it is possible to implement functions within the network, there are still a number of issue that have not been addressed. Scalability to large systems has not been demonstrated and the performance of complex functions is still unknown. Scalability and function performance are currently being examine using an Altera 10K100 that is five times larger that the 10K20 and is currently able to interconnect 8 processors. The figure to the left shows the 10K100 prototype in the left portion of the picture and four of the connectors in the right portion of the picture. The cable in the middle of the picture with the Altera label is the FPGA configuration cable. The EPP in currently working and the remainder of the design is expected to be completed by SuperComputing ’99 in the middle of November. Future directions include using a higher bandwidth physical layer and a PCI network interface card. Each of these areas are under development but experimental results have not been obtained yet. Additionally, embedding a DSP or RISC processor into the network would enable rapid experimentation with system-level resource management. After that is achieved, user-level programmability of the AFU will be approached.


37

References 1. H. S. Stone, High-Performance Computer Architecture, Third ed. Reading, MA: AddisonWesley Publishing Company, 1993. 2. D. Culler, L. Liu, R. Martin, and C. Yoshikawa, “Assessing Fast Network Interfaces,” IEEE Micro, vol. 16, pp. 35-43, 1996. 3. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, MPI, The Complete Reference. Cambridge, Massachusetts: The MIT Press, 1996. 4. R. Hoare and H. Dietz, “A Case for Aggregate Networks,” Proceedings of the 12th International Parallel Processing Symposium and 9th Symposium on Parallel and Distributed Processing, Orlando, FL, 1998. 5. C. Stunkel and e. al., “The SP2 High-Performance Switch,” IBM Systems Journeal, vol. 34, pp. 185-204, 1994. 6. R. Kessler and J. Schwarzmeier, “Cray T3D: a New Dimension for Cray Research,” Proceedings of the In Digest of Papers. COMPCON Spring '93, San Francisco, CA, 1993. 7. M. Fillo and R. Gillett, “Architecture and Implementation of Memory Channel 2,” Digital Equipment Corporation High Performance Technical Computing, pp. 34-48, 1997. 8. M. Blumrich and e. al., “Virtual Memory Mapped Network Interfaces for the SHRIMP Multicomputer,” Proceedings of the The 21st Annual International Symposium on Computer Architecture, 1994. 9. T. Warshko, W. Tichy, and C. Herter, “Efficient Parallel Computing on Workstation Clusters,” University of Karlsruhe, Dept. of Informatics, Karlsruhe, Germany Technical Report 21/95, 1995. 10. R. Hoare, T. Mattox, and H. Dietz, “TTL-PAPERS 960801, The Modularly Scalable, Field Upgradable, Implementation of Purdue's Adapter for Parallel Execution and Rapid Synchronization,” Purdue University, W. Lafayette, Internet On-line Tech Report:, 1996. 11. R. Hoare, “Object-Oriented Aggregate Networks,” in School of Electrical Engineering. W. Lafayette: Purdue University, 1999. 12. A. Gottlieb and e. al., “The NYU Ultracomputer, Designing a MIMD Shared Memory Prallel Computer,” IEEE Transactions on Computers, pp. 175-189, 1983. 13. R. Bianchini, S. Dickey, J. Edler, G. Goodman, A. Gottlieb, R. Kenner, and J. Wang, “The Ultra III Prototype,” Proceedings of the Parallel Systems Fair, 1993. 14. G. Pfister and V. Norton, “'Hot Spot' Contention and Combining in Multistage Interconnection Networks,” Proceedings of the 1985 International Conference on Parallel Processing, 1985. 15. D. Tennenhouse and D. Wetherall, “Towards an Active Network Architecture,” Computer Communications Review, vol. 26, 1996. 16. D. Tennenhouse and e. al., “A Survey of Active Network Research,” IEEE Communications Magazine, vol. 35, pp. 80-86, 1997. 17. H. Bal, R. Hofman, and K. Verstoep, “A Comparison of Three High Speed Networks for Parallel Cluster Computing,” Proceedings of the First International Workstion on Communication and Architectural Support for Network-Based Parallel Computing, San Antonio, TX, 1997. 18. K. Batcher, “Sorting Networks and Their Applicaitons,” Proceedings of the Spring Joint Computer Conference, 1968. 19. J. Lee and K. Batcher, “Minimizing Communication of a Recirculating Bitonic Sorting Network,” Proceedings of the the 1996 International Conference on Parallel Processing, 1996.

38

R.R. Hoare

20. Z. Wen, “Multiway Merging in Parallel,” IEEE Transactions on Parallel and Distributed Systems, vol. 7, pp. 11-17, 1996. 21. H. J. Siegel, Interconnection Networks for Large-Scale Parallel Processing: Theory and Case Studies, Second Edition ed. New York, NY: McGraw-Hill, 1990. 22. G. Almasi and A. Gottlieb, Highly Parallel Computing, Second Edition. Redwood City, CA: The Benjamin/Cummings Publishing Company, Inc., 1994. 23. H. Dietz, R. Hoare, and T. Mattox, “A Fine-Grain Parallel Architecture Based on Barrier Synchronization,” Proceedings of the International Conference on Parallel Processing, Bloomington, IL, 1996. 24. R. Hoare, H. Dietz, T. Mattox, and S. Kim, “Bitwise Aggregate Networks,” Proceedings of the Eighth IEEE Symposium on Parallel and Distributed Processing, New Orleans, LA, 1996. 25.T. Mattox, “Synchronous Aggregate Communication Architecture for MIMD Parallel Processing,” in School of Electrical and Computer Engineering. W. Lafayette, IN: Purdue University, 1997. 26. E. Felten and e. al., “Early Experience with Message-Passing on the SHRIMP Multicomputer,” Proceedings of the The 23rd Annual International Symposium on Comuter Architecture, Philadelphia, PA, 1996. 27. N. Boden and e. al., “Myrinet: A Gigabit per Second Local Area Network,” in IEEE-Micro, vol. 15, 1995, pp. 29-36. 28. R. Brouwer, “Parallel algorithms for placement and routing in VLSI design”, Ph. D. Thesis, University of Illinois, Urbana-Champaign, 1991. 29. J. Chandy, et. al. “Parallel Simulated Annealing Strategies for VLSI Cell Placement”, in Proceedings of the 1996 International Conference on VLSI Design, Bangalore, India, January 1996. 30. T. Stornetta, et. al., “Implementation of an Efficient Parallel BDD Package”, Proc. 33rd ACM/IEEE Design Automation Conference, 1996. 31. R. Ranjan, et. al., “Binary Decision Diagrams on Network of Workstations”, In Proceedings of the International Conference on Computer-Aided Design, pp. 358-364, 1996.

*LJD%LW3HUIRUPDQFHXQGHU17

Mark Baker University of Portsmouth Hants, UK, PO4 8JF, UK [email protected] Stephen Scott and Al Geist Oak Ridge National Laboratory Oak Ridge, TN 37831-6367, USA {scottsl,gst}@ornl.gov Logan Browne Hiram College Hiram, OH44234, USA [email protected] January 13, 2000 $EVWUDFW

The recent interest and growing popularity of commoditybased cluster computing has created a demand for lowlatency, high-bandwidth interconnect technologies. Early cluster systems have used expensive but fast interconnects such as Myrinet or SCI. Even though these technologies provide low-latency, high-bandwidth communications, the cost of an interface card almost matches that of individual computers in the cluster. Even though these specialist technologies are popular, there is a growing demand for Ethernet which can provide a low-risk and upgradeable path with which to link clusters together. In this paper we compare and contrast the low-level performance of a range of Giganet network cards under Windows NT using MPI and PVM. In the first part of the paper we discuss our motivation and rationale for undertaking this work. We then move on to discuss the systems that we are using and our methods for assessing these technologies. In the second half of the paper we present our results and discuss our findings. In the final section of the paper we summarize our experiences and then briefly mention further work we intend to undertake. Keywords: cluster interconnect, communication network, Gigabit Ethernet, PVM, MPI, performance evaluation.


40

M. Baker et al.

1. Introduction The concept of a cluster of computers as a distinguished type of computing platform evolved during the early 1990’s1. Prior to that time, the development of computing platforms composed of multiple processors was typically accomplished with custom-designed systems consisting of proprietary hardware and software. Supercomputers, or high-performance multiprocessor computers, were designed, developed, and marketed to customers for specialized grand challenge applications. Typically, the applications that ran on these supercomputers were written in Fortran or C, but used proprietary numerical or messaging libraries, that were generally not portable. However, rapid advances in commercial off-the-shelf (COTS) hardware and the shortening of the design cycle for COTS components made the design of custom hardware cost-ineffective. By the time a company designed and developed a supercomputer, the processor speed and capability was out-paced by commercial processing components. In addition to the rapid increase in COTS hardware capability that led to increased cluster performance, software capability and portability increased rapidly during the 1990’s. A number of software systems that were originally built as academic projects led to the development of standard portable languages and new standard communication protocols for cluster computing. The programming paradigm for cluster computing falls primarily into two categories: message passing and distributed shared memory (DSM). Although DSM is claimed to be an easier programming paradigm as the programmer has a global view of all the memory, early efforts instead focused on message passing systems. Parallel Virtual Machine2 (PVM) started as a message passing research tool in 1989 at Oak Ridge National Laboratory (ORNL). Version 2, written at the University of Tennessee, was publicly released in early 1991. As a result of this effort and other message passing schemes, there became a push for a standardized message passing interface. Thus, in 1994 MPI Version 1 was approved as a de jure standard for message-passing parallel applications3. Many implementations of MPI-1 have been developed. Some implementations, such as MPICH, are freely available. Others are commercial products optimized for a particular system, such as SUN HPC MPI. Generally, each MPI implementation is built over faster and less functional low-level interfaces, such as BSD Sockets, or the SGI SHMEM interface.

2. Message Passing 03,2YHUYLHZ

The MPI standard4 is the amalgamation of what were considered the best aspects of the most popular message-passing systems at the time of its conception. The standard only defines a message passing library and leaves, amongst other things,

GigaBit Performance under NT

41

process initialisation and control to individual developers to define. MPI is available on a wide range of platforms and is fast becoming the de facto standard for message passing. The design goals of the MPI were portability, efficiency and functionality. Commercial and public domain implementations of MPI exist. These run on a range of systems from tightly coupled, massively-parallel machines, through to networks of workstations. MPI has a range of features including: point-to-point, with synchronous and asynchronous communication modes; and collective communication (barrier, broadcast, reduce). MPICH5,6 developed by Argonne National Laboratory and Mississippi State University, is probably the most popular of the current, free, implementations of MPI. MPICH is a version of MPI built on top of Chameleon 7. MPICH and its variants are available for most commonly used distributed and parallel platforms.

2.2 PVM Overview The Parallel Virtual Machine8 (PVM) system provides an environment within which parallel programs can be developed and run. PVM is a continuing research and development project between ORNL, Emory University and the University of Tennessee. PVM transparently handles all message routing, data conversion and task scheduling across a network of heterogeneous computer architectures. PVM is available for most computer architectures, including Linux and NT. The PVM system consists of: ° A PVM daemon (or NT service) which is installed on each PVM host computer – this daemon is used to initiate and manipulate the PVM environment. ° A set of libraries to perform parallel communication between PVM tasks, an initiation method for the parallel environment. ° A console that allows users to manipulate their PVM environment by, for example, adding, deleting hosts as well as starting and monitoring, and stopping PVM programs. ° A set of functions for debugging both the PVM environment and a PVM program. *LJDELW(WKHUQHW

Gigabit Ethernet offers an upgrade path for current Ethernet installations and allows existing installed stations, management tools and training to be reused. It is anticipated that the initial applications for Gigabit Ethernet are for campuses or buildings requiring greater bandwidth between routers, switches, hubs, repeaters and servers9. At some time in the near future Gigabit Ethernet will be used by high-end desktop computers requiring a higher bandwidth than Fast Ethernet can offer.

42

M. Baker et al.

Gigabit Ethernet is an extension of the standard (10 MBps) Ethernet and Fast Ethernet (100 MBps) for network connectivity. The Gigabit Ethernet standard, IEEE 802.3z, was officially approved by the IEEE standards board in June 1998. Gigabit Ethernet employs the same Carrier Sense Multiple Access with Collision Detection (CSMA/CD) protocol, frame format and size as its predecessors. Much of the IEEE 802.3z standard is devoted to the definition of physical layer of the network architecture. For Gigabit Ethernet communications, several physical layer standards are emerging from the IEEE 802.3z effort – these standards are for different link technologies as well as short and long distant interconnects. The differences between the technologies are shown in Table 110.

Data Rate Cat 5 UTP STP/Coax Multimode Fiber Single-mode Fiber

Ethernet 10 BaseT

Fast Ethernet 100 BaseT

Gigabit Ethernet 1000 Base X 1000 Mbps 100 m 25 m 550 m

10 Mbps 100 m (min) 500 m 2 km 25 km

100 Mbps 100 m 100 m 412 m (half duplex) 2 km (full duplex) 20 km

7DEOH

Ethernet segment limitations

5 km

03,17(QYLURQPHQWV

There are now six MPI environments for NT11 . These range from commercial products, such a MPI/Pro and PaTENT, to the standard release of MPICH with a WinSock devise. The MPI environments used to evaluate Gigabit network performance are described briefly in sections 4.1 – 4.3.

03,352IRU:LQGRZV17

MPI/Pro12 is a commercial environment released in April 1998 by MPI Software Technology, Inc. The current version of MPI/Pro is based on WinMPIch 13 but has been fairly radically redesigned to remove the bottlenecks and other problems that were present. MPI/Pro supports both Intel and Alpha processors and is released to be used with Microsoft Visual C++ and Digital Visual Fortran. The MPI/Pro developers are currently working on a new source base for MPI that does not include any MPICH code and supports the VI Architecture14. 3D7(17:03,

PaTENT15 is the commercial version of WMPI funded by the European project WINPAR16. PaTENT differs from WMPI in a number of small ways which includes: sanitized release, easier installation, better documentation and full user support. PaTENT is available for Microsoft Visual C++ and Digital Visual Fortran and consists of libraries, header files, examples and daemons for remote


43

starting. PaTENT includes ROMIO, ANL’s implementation of MPI-IO, configured for UFS. PaTENT uses the Installshield software mechanisms for installation and configuration. :03,

WMPI17 from the Department of Informatics Engineering of the University of Coimbra, Portugal is a full implementation of MPI for Microsoft Win32 platforms. WMPI is based on MPICH and includes a P418 device. P4 provides the communication internals and a startup mechanism (that are not specified in the MPI standard). For this reason WMPI also supports the P4 API. The WMPI package is a set of libraries (for Borland C++, Microsoft Visual C++ and Microsoft Visual FORTRAN). The release of WMPI provides libraries, header files, examples and daemons for remote starting. 3HUIRUPDQFH7HVWV

7HVW(TXLSPHQW

The aim of these tests is restricted to gathering data that helps indicate the expected communications performance (peak bandwidth and message latency) of MPI on NT. The benchmark environment consisted of two dual-processor Pentium’s (450 MHz PIII) with 512 MBytes of DRAM running NT 4 (SP5), 1 Windows 2000β 3 with individual links between each pair of network cards. The technical details of the network cards assessed is given in Table 2. Card Make NetGear 19 FA310TX 100Mbps GigaNet20 Clan GNN1000 Packet Engine21 GNIC II SysKonnect 22 SK-9841 NetGear GA620

Technical Details IEEE 802.3u 100BASE-TX Fast Ethernet and 802.3i 1 32/64-bit 33MHz, PCI 2.1 2 compliant, 1.25Gbps full duplex . 32/64-bit 33MHz, PCI 2.1 compliant, 2 Gbps full duplex 32/64-bit 33/66MHz PCI 2.2 complaint, 2 Gbps full duplex 32/64-bit 33/66MHz PCI 2.1 complaint, 2 Gbps full duplex 7DEOH

Cost MSRP $24.95 ($17.50 in qty 50) MSRP $795 $995 No longer available - out of NIC business MSRP $729 MSRP $299.99

Network Card Specification

0XOWLSURFHVVRU%HQFKPDUN3LQJ3RQJ

In this program, increasing sized messages are sent back and forth between processes. PingPong is an SPMD program written in C using the PVM, MPI and WinSock message passing APIs. These codes have been carefully developed so that all three versions as closely as possible match each others behaviour. PingPong provides information about the latency of send/receive operations and 1 2

Our references to NT 5 and Windows 2000 are synonymous. GigaNet uses a proprietary protocol for communications, rather than Ethernet

44

M. Baker et al.

the uni-directional bandwidth that can be attained on a link. To ensure that anomalies in message timings do not occur the PingPong is repeated for all message lengths. 03,9HUVLRQ

The MPI version of the code uses the blocking send/receive on both processes. MPI_Send(A,nbyte,MPI_BYTE,0,10,MPI_COMM_WORLD); MPI_Recv(A,nbyte,MPI_BYTE,0,20,MPI_COMM_WORLD, &status); 3909HUVLRQ

The PVM version of the code is slightly more complicated as data needs to be packed into buffers before being sent and unpacked at the receiving end. 0DVWHU

pvm_initsend(ENCODING); for (length = 0, length < maximum; increment message length) { pvm_pkbyte(send buffer, length, 1); pvm_send(slave ID, 1) pvm_recv(-1, -1) } 6ODYH

pvm_initsend(ENCODING); while (true) { bufid = pvm_recv(-1, -1); pvm_bufinfo(bufid, (int*)0, (int*)0, &dtid); pvm_send(parent ID, 2); }} 'LIIHUHQFHVRIWKH03,DQG390YHUVLRQVRI3LQJ3RQJ

A comparison of the MPI and PVM codes shows that there are some potential differences in how user data is handled and this may cause some performance differences. The one obvious difference is the way user data is handled. In particular the PVM Master leaves the received user data in a temporary buffer space. This and other effects will be investigated and reported upon in the final workshop presentation.

5HVXOWV ,QWURGXFWLRQ

In this section we present and discuss the results that were obtained from running the various performance tests under MPI and PVM. It should be noted that not all the PVM results were available at the time of submission of this paper – but will be available for the actual workshop. It should also be noted that due to design restrictions, PaTENT or WMPI are unable to use alternative network interfaces,


45

other than that pointed at by the local host name. This problem was pointed out to both sets of developers (Genias and Coimbra), but unfortunately a “fix” was provided in time to incorporate the results in this paper. µ V

6\VWHP

/DWHQF\

1. 2. 3. 4. 5. 6. 7.

MPI/Pro 1.2.3, SMP NT4 WSOCK 32, SMP NT4 WMPI 1.2, SMP NT4 PaTENT 4.014, SMP NT4 MPI/Pro 1.2.3, SMP NT5 WSOCK 32, SMP NT5 PaTENT SMP NT5

106.3 74.0 44.2 32.8 98.2 76.4 35.5

8. 9. 10. 11. 12.

MPI/Pro 1.2.3, TCP 100 Mbps WSOCK 32, TCP 100 Mbps WMPI 1.2, TCP 100 Mbps MPI/Pro 1.2.3, TCP NT5 100 Mbps WSOCK 32 TCP NT5 100 Mbps

207.6 97.5 283.4 244.1 112.7

13. 14. 15. 16. 17. 18. 19. 20.

MPI/Pro 1.2.3, TCP GigaNet WSOCK 32, GigaNet MPI/Pro 1.2.3, TCP Packet Engine WSOCK 32, TCP Packet Engine MPI/Pro 1.2.3, TCP SysKonnect WSOCK 32, TCP SysKonnect MPI/Pro 1.2.3, TCP NetGear WSOCK 32, NetGear

207.8 96.9 335.6 298.4 178.8 90.6 585.5 666.2

7DEOH

Measured 1 Byte Message Latency

/DWHQF\5HVXOWV7DEOH 3

SM – PaTENT and WMPI clearly have the lowest latencies under NT4 – approximately half the time taken by WinSock and MPI/Pro. Under NT5 WinSock and PaTENT latencies are slightly slower than under NT4 (~8%). However, MPI/Pro under NT5 is slightly faster (~8%) than under NT4. 4

TCP (100 Mbps) – WinSock has more than half the latency of the MPI environments – both under NT4 and NT5. MPI/Pro is about 25% faster than WMPI. Under NT5 all systems exhibit a 10 – 15% increase in latency. TCP (GigaBit) – The WinSock results for GigaNet (53%), Packet Engine (11%) and

SysKonnect (50%) network cards are all faster than the MPI/Pro results. However, for NetGear performance WinSock (14%) is slower than MPI/Pro. This particular result is unexpected as MPI/Pro is built on top of the WinSock API. Overall, the SysKonnect card exhibits the lowest latencies, closely followed by GigaNet and Packet Engine. The latencies for NetGear are more than double of those for the other network cards. 3

4

SM is where two processes are running on one computer and potentially communicating via Shared-Memory. TCP is where two processes are running on separate computers and communicating via TCP/IP.

46

M. Baker et al.

700 666

600

Latency (microseconds)

586

500

400

336

300

298

283 244

200

208

208

179

100

106

113

98 44

0

98

97

91

76

74 33

36

ar e G et N ar K e C G ct et SO N n ne W ro Ko /P t s PI Sy nec M K n C Ko O S ys W oS r P I/ E P P M K C O E S et W oP N r a P ig I/ P M K G et N T5 C O N ga i S W o G bps 5 M Pr T 0 I/ N P 10 ps M K b C M 4 O 0 T S N 10 W s ro 4 T bp P N M I/ P s M 100 bp 4 M I T P 0 N M 10 s W K bp C M O S 00 W o 1 T5 r N P I/ M P M T S T5 N N TE M Pa K S T5 C N M SO W o S T4 r N /P M PI M TS 4 EN NT aT 4 M P S T I N P M M W KS T4 C N O M S W oS r P I/

P M

)LJXUH

- One Byte Network Latencies

1HWZRUN%DQGZLGWKV 6KDUHG0HPRU\5HVXOWV)LJXUH

PaTENT and WMPI exhibit the best overall performance under NT4 and NT5. Under NT4, PaTENT and WMPI have a peak bandwidth of just over 100 MBytes/s and under NT5 PaTENT peaks at 122 Mbytes/s. MPI/Pro under NT 4 and NT5 has a similar bandwidth to WinSock up until message lengths of 8K. MPI/Pros bandwidth then continues to increase, peaking at 107 Mbytes/s under NT4 and at 122 Mbytes/s under NT5. Winsock peaks at 31 Mbytes/s under NT 4 and 39 Mbytes/s under NT 5 – here it also exhibits a huge performance dip between 16K and 64K message lengths. It should be noted that higher peak bandwidths were achieved under NT5 compared to NT 4. 'LVWULEXWHG0HPRU\

03,3UR5HVXOWV)LJXUH

The bandwidth results from the 100 Mbps and GigaNet network cards between 1 and 512 Bytes are very similar. Thereafter the GigaNet results continue to increase up to 256K length messages where a peak of 37 Mbytes/s is reached. The 100 Mbps network card outperforms the Packet Engine, SysKonnect and NetGear network cards up until message lengths of about 1K. The 100 Mbps technology peaks at 8.8 Mbytes/s. The bandwidth of NetGear is much poorer than all the other technologies up until 2K message lengths. The peak bandwidths for Packet Engine, SysKonnect and NetGear are 12 Mbytes/s, 17 Mbytes/s and 19 Mbytes/s respectively.


47

% DQGZ LGWK/RJ YH UVXV0HVVD JH /H QJWK ,Q6KDUHG0HPRU\

V

V H W \ E 0 K W G L

Z G Q D %

3ORW. H\

0 3,3UR17

:6RFN17

3D7(1717

:0 3,17

0 3,3UR17

:6RFN17

3D7(1717

.

.

.

.

.

0

0HVVDJH/HQJWK% \WHV

Figure 2 - PingPong Shared Memory Results :LQ6RFN5HVXOWV)LJXUH

The bandwidth results from the 100 Mbps and GigaNet network cards between 1 and 128 Bytes are very similar. Thereafter the GigaNet results continue to increase up to 8K length messages where a peak of 38 Mbytes/s is reached. The 100 Mbps network card outperforms the SysKonnect network card up until 256 Bytes message length. The 100 Mbps network card outperforms the Packet Engine and NetGear network cards up until 4K message lengths. The 100 Mbps technology peaks at 10 Mbytes/s. The bandwidth of NetGear is much poorer than all the other technologies up until 8K message lengths. The peak bandwidths for Packet Engine, SysKonnect and NetGear are 10.6 Mbytes/s, 17.4 Mbytes/s and 17 Mbytes/s respectively. 6XPPDU\DQG&RQFOXVLRQV

6XPPDU\

In this paper we have presented and discussed the results from our simple network performance tests on NT using the MPI, PVM and WinSock message passing APIs on six different network interface technologies. At the date of submission, we have been unable to complete the PVM tests, so our discussion on the performance differences is limited at this moment.

48

M. Baker et al.

%DQGZLGWK / R J YHU VX V0 HVVDJH/ HQ JWK ,Q' LV WU LEXWH G0 H P RU \

V V H W \ %

0 K W G L

3 O R W. H \

Z G Q D %

0 %SV

*LJD1HW

3 DFNHW(QJLQH

6\V.R QQHFW

1HW*HDU

.

.

.

.

.

0

0 H VVD JH /H QJWK%\WHV

Figure 3 - MPI/Pro Bandwidth Results Our experiences with the performance of MPI under NT 4 and Windows 2000 are inconclusive. Currently, it appears that in shared-memory mode that the latencies under Windows 2000 may be marginally lower than NT 4. The measured peak bandwidths of Windows 2000 were greater than NT4. In distributed-memory mode the measured latencies under Windows 2000 were approximately 20% higher than the equivalent under NT 4. The measured bandwidths for Windows 2000 and NT 4 were very similar however. It is interesting to note that the measured network latencies for 100 Mbps Ethernet cards and Giga Net under WinSock and MPI/Pro are almost equivalent. The performance of the Packet Engine Gigbit card is between 7% and 13% faster respectively. However, the performance of the SysKonnect and Net Gear cards are significantly slower that standard 100 Mbps Ethernet. 3ULFH3HUIRUPDQFH&RQVLGHUDWLRQV

Table 4 shows the price/performance ratios calculated using the network card costs in September 1999 versus the peak measured bandwidth and minimum latency. It should be noted that the calculated ratios shown are only an approximate indicator as the price of the network cards varies significantly based on the quantity bought and the discounts given. The smaller the price/performance ratio the better value for money that can be expected from a network card. The choice of what is the most appropriate card is often not based


49

solely on the price/performance, but also other factor such as desired performance, compatibility or availability. %DQ GZ LGWK / R J YHUVX V0 HVVDJH / HQ JWK ,Q' LVWULE XWHG 0 HP RU\

V V H W \ %

0 K W G L

3ORW. H\

Z G Q D %

0 ESV

*LJD1HW

3 DFNHW(QJLQH

6\V.RQQHFW

1HW*HDU

.

.

.

.

.

0

0H VVDJ H /HQJ W K %\W H V

- WinSock Bandwidth Results The ratios shown in Table 4 indicate that the 100 Mbps Fast Ethernet cards provide significantly better price/performance than the other network cards. However, the ratios for the NetGear Gigabit card are significantly better than the other price/performance ratios available. )LJXUH

Card Make and speed NetGear FA310TX 100Mbps GigaNet - Clan GNN1000 Packet Engine – GNIC II SysKonnect – SK-9841 NetGear - GA620 7DEOH

Price/Performance ($/Mbytes/s) $24.95/8.8 = 2.835 $795/37 = 2149 $995/12 = 82.92 $729/17 = 42.88 $299.99/19 = 15.79

Price/Performance ($/µs) $24.95/208 = 0.12 $$795/208 = 3.82 $995/336 = 2.96 $729/179 = 4.07 $299.99/585 = 0.51

Network Card Cost versus Performance (MPI/Pro)

6XPPDU\RI&RQFOXVLRQV

Our work has shown that release 1.2.3 of MPI/Pro imposes an approximate additional 1 Byte latency of 25% and 50% over WinSock under shared and distributed-memory modes respectively. We have shown that the Giga Net Gigabit Ethernet provides the highest bandwidth of those tested. We suspect, as currently we do not have a concrete price for this card, that the price/performance of this card will be poorer that that of Net Gear but better than Packet Engine and NetGear. Our price/performance figures do, however, strongly suggest that the current performance and costs of the Gigabits cards makes standard 100 Mbps a much sounder technology investment at the moment. Obviously, other

50

M. Baker et al.

factors, like required peak bandwidth, may make the decision of what technology to choose not one purely based on price/performance. Another factor that puts the Gigabit Ethernet at a disadvantage compared to other network technologies, such as Myrinet23 and SCI24 , is the relatively high start up latencies – approximately an order of magnitude higher. These high latencies are being addressed with the new VIA interfaces and drivers being developed for Ethernet. )XWXUH:RUN

This work is part of an on going effort to investigate the performance of a range of cluster-based technologies. The next phase of our work will involve comparing the performance of different network technologies under NT and Linux. 5HIHUHQFHV

1

A. Geist, Cluster Computing: The Wave of the future, Springer Verlag Lecture Notes in Computer Science, May 1994. 2 The PVM project - http://www.epm.ornl.gov/pvm/ 3 MPI Forum - http://www.mpi-forum.org/docs/docs.html 4 Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, University of Tennessee, Knoxville, Report No. CS-94-230, May 5, 1994 5

MPICH - http://www.mcs.anl.gov/mpi/mpich/ W. Gropp, et. al., A high-performance, portable implementation of the MPI message passing interface standard - http://www-c.mcs.anl.gov/mpi/mpicharticle/paper.html 6

7 W. Gropp and B. Smith, Chameleon parallel programming tools users manual. Technical Report ANL-93/23, Argonne National Laboratory, March 1993. 8 PVM: A Users’ Guide and Tutorial For Networked Parallel Computing – http://www.netlib.org/pvm3/book/pvm-book.html 9 Gigabit Ethernet Alliance - Gigabit Ethernet: Accelerating the standard for speed, http://www.gigabit-ethernet.org/technology/whitepapers, September 1999. 10 Ethernet Segment Limits. - http://www.gigabit-ethernet.org/technology/ 11 TOPIC – http://www.dcs.port.ac.uk/~mab/TOPIC/ 12 MPI Software Technology, Inc. – http://www.mpi-softtech.com/ 13 WinMPICh - http://www.erc.msstate.edu/mpi/mpiNT.html 14 VIA – http://www.viaarch.com 15 PaTENT - http://www.genias.de/products/patent/ 16 WINdows based PARallel computing - http://www.genias.de/ 17 WMPI - http://dsg.dei.uc.pt/w32mpi/ 18 R. Buttler and E. Lusk, User’s Guide to the p4 Parallel Programming System, ANL92/17, Argonne National Laboratory, October 1992. 19 NetGear - http://netgear.baynetworks.com/ 20 GigaNet - http://www.giga-net.com/ 21 Packet Engine - http://www.packetengines.com/index4.html 22 SysKonnect - http://www.syskonnect.de/ 23 N. Boden, et. al. Myrinet - A Gbps LAN. IEEE Micro, Vol. 15, No.1, February 1995. http://www.myri.com/ 24 Dolphin Interconnect Solutions - http://www.dolphinics.no/

MPI Collective Operations o ver IP Multicast ? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas F ayetteville, Arkansas, U.S.A fhachen,yochoa,[email protected]

Many common implementations of Message P assing Interface (MPI) implement collectiv e operations over poin t-to-poin toperations. This work examines IP multicast as a framework for collectiv e operations. IP multicast is not reliable. If a receiver is not ready when a message is sent via IP multicast, the message is lost. Two techniques for ensuring that a message is not lost due to a slow receiving process are examined. The techniques are implemented and compared experimentally over both a shared and a switched Fast Ethernet. The average performance of collective operations is improved as a function of the number of participating processes and message size for both netw orks. Abstract.

1

Introduction

Message passing in a cluster of computers has become one of the most popular paradigms for parallel computing. Message Passing Interface (MPI) has emerged to be the de facto standard for message passing. In many common implementations of MPI for clusters, MPI collective operations are implemented o ver MPI point-to-point operations. Opportunities for optimization remain. Multicast is a mode of communication where one sender can send to multiple receiv ers b y sending only one copy of the message. With multicast, the message is not duplicated unless it has to travel to dierent parts of the netw ork through switches. Many net w orks support broadcast or m ulticast. For example, shared Ethernet, token bus, token ring, FDDI, and re ective memory all support broadcast at the data link layer. The In ternet Protocol (IP) supports multicast over netw orksthat ha veIP multicast routing capability at the net w ork la yer. The goal of this paper is to in vestigatethe design issues and performance of implementing MPI collectiv e operations using multicast. IP multicast is used to optimize the performance of MPI collective operations, namely the MPI broadcast and MPI barrier synchronization, for this preliminary work. The results are promising and give insight to w ork that is planned on a low-latency netw ork. The remainder of this paper describes IP multicast, design issues in the implementations, experimental results, conclusions, and future planned work. ?

This work was supported by Grant #ESS-9996143 from the National Science Foundation


52

2

H.A. Chen, Y.O. Carrasco, and A.W. Apon

IP Multicast

Multicast in IP is a receiver-directed mode of communication. In IP multicast, all the receivers form a group, called an IP multicast group. In order to receive a message a receiving node must explicitly join the group. Radio transmission is an analogy to this receiver-directed mode of communication. A radio station broadcasts the message to one frequency channel. Listeners tune to the speci c channel to hear that speci c radio station. In contrast, a sender-directed mode of communication is like newspaper delivery. Multiple copies of the paper are delivered door-to-door and the newspaper company must know every individual address of its subscriber. IP multicast works like radio. The sender only needs to send one copy of the message to the multicast group, and it is the receiver who must be aware of its membership in the group. Membership in an IP multicast group is dynamic. A node can join and leave an IP multicast group freely. A node can send to a multicast group without having to join the multicast group. There is a multicast address associated with each multicast group. IP address ranges from 224.0.0.0 through 239.255.255.255 (class D addresses) are IP multicast addresses. Multicast messages to an IP multicast group will be forwarded by multicast-aware routers or switches to branches with nodes that belong to the IP multicast group. IP multicast saves network bandwidth because it reduces the need for the sender to send extra copies of its message and therefore lowers the latency of the network. In theory, IP multicast should be widely applicable to reduce latency. However, one drawback of IP multicast is that it is unreliable. The reliable Transmission Control Protocol(TCP) does not provide multicast communication services. The User Datagram Protocol (UDP) is used instead to implement IP multicast applications. UDP is a \best eort" protocol that does not guarantee datagram delivery. This unreliability limits the application of IP multicast as a protocol for parallel computing. There are three kinds of unreliability problems with implementing parallel collective operations over IP multicast. One comes with unreliability at the hardware or data link layer. An unreliable network may drop packets, or deliver corrupted data. In this work, we assume that the hardware is reliable and that packets are delivered reliably at the data link layer. It is also possible that a set of fast senders may overrun a single receiver. In our experimental environment we have not observed these kind of errors. However, a third problem is related to the software design mismatch between IP multicast and parallel computing libraries such as MPI. In WAN's, where IP multicast is generally applied, receivers of a multicast group come and go dynamically, so there is no guarantee of delivery to all receivers. The sender simply does not know who the receivers are. However, in parallel computing all receivers must receive. With IP multicast, only receivers that are ready at the time the message arrives will receive it. However, the asynchronous nature of cluster computing makes it impossible for the sender know the receive status of the receiver without some synchronizing mechanism, regardless of how reliable the underlying hardware is. This is a paradigm mismatch between IP multicast and MPI. This

MPI Collective Operations over IP Multicast

53

paper explores two synchronizing techniques to ensure that messages are not lost because a receiving process is slower than the sender. This work is related to other eorts to combine parallel programming and broadcast or multicast messaging. In work done on the Orca project [8], a technique was developed for ensuring the reliability of a broadcast message that uses a special sequencer node. In research done at Oak Ridge National Laboratory, parallel collective operations in Parallel Virtual Machine (PVM) were implemented over IP multicast[2]. In that work, reliability was ensured by the sender repeatedly sending the same message until ack's were received from all receivers. This approach did not produce improvement in performance. One reason for the lack of performance gain is that the multiple sends of the data cause extra delay. The goal of this work is to improve the performance of MPI collective calls. This work focuses on the use of IP multicast in a cluster environment. We evaluate the eectiveness of constructing MPI collective operations, speci cally broadcast and barrier, over IP multicast in a commodity-o-the-shelf cluster.

3

MPI Collective Operations

The Message Passing Interface (MPI) standard speci es a set of collective operations that allows one-to-many, many-to-one, or many-to-many communication modes. MPI implementations, including LAM[6] and MPICH[7], generally implement MPI collective operations on top of MPI point-to-point operations. We use MPICH as our reference MPI implementation. MPI Collective Operations MPI Point-to-Point

proc 0

Multicast can be implemented here proc1

The Abstract Device Interface

proc 2

proc 4

The Channel Interface proc 5

The Chameleon

Fig. 1.

proc 6

proc 3

MPICH Layers

MPICH Broadcast mechanism with 4 nodes Fig. 2.

MPICH[3] uses a layered approach to implement MPI. The MPICH layers include the Abstract Device Interface (ADI) layer, the Channel Interface Layer, and the Chameleon layer. Portability is achieved from the design of the ADI layer, which is hardware dependent. The ADI provides an interface to higher layers that are hardware independent. The MPICH point-to-point operations are built on top of the ADI layer. To avoid implementing collective operations

54


over MPICH point-to-point functions, the new implementation has to bypass all the MPICH layers, as shown in Fig. 1. proc 4

proc 5

proc 6

time step 1

proc 0

proc 1

proc 0

proc 2

proc 2

proc 1

proc 2

proc 4

proc 3

proc 5

proc 6

proc 5

proc 6

proc 3

synchronization time step 2

(scout messages)

proc 0

time step 3 proc 0

time step 4 Multicast Message

proc 1

proc 2

proc 3

proc 4

proc 5

proc 6

MPI broadcast using IP multicast (Binary Algorithm) Fig. 3.

3.1

proc 1

proc 2

proc 3

proc 4

MPI broadcast using IP multicast (Linear Algorithm) Fig. 4.

MPI Broadcast

Since the new layer for MPI collective operations using multicast is compared experimentally with the original MPICH implementation, it is helpful to understand how these functions are implemented in MPICH. MPICH uses a tree structured algorithm in its implementation of MPI broadcast operation (MPI Bcast). In the broadcast algorithm, the sender sends separate copies of the message to some of the receivers. After they receive, the receivers at this level in turn send separate copies of the message to receivers at the next level. For example, as illustrated in Fig. 2, in an environment with 7 participating processes, process 0 (the root) sends the message to processes 4, 2, and 1. Process 2 sends to process 3 and process 4 sends to processes 5 and 6. In general, if there are N participating processes, the message size is M bytes and the maximum network frame size is T bytes, it takes ( M T + 1) (N 1) network frames for one broadcast. When IP multicast is used to re-implement MPI broadcast, the software must ensure that all receivers have a chance to receive. Two synchronization mechanisms have been implemented, a binary tree algorithm and a linear algorithm. In the binary tree algorithm, the sender gathers small scout messages with no data from all receivers in a binary tree fashion before it sends. With K processes each executing on a separate computer, the height of the binary tree is log2 K + 1. In the synchronization stage at time step 1, all processes at the leaves of binary tree send. Scout messages propagate up the binary tree until all the messages are nally received at the root of the broadcast. After that, the root broadcasts the data to all processes via a single send using IP multicast. For example, as illustrated in Fig. 3 in an environment with 7 participating processes, processes 4, 5, and 6 send to processes 0, 1, and 2, respectively. Next, process 1 and process 3 send to processes 0 and 2, respectively. Then process 2 sends to process

MPI Collective Operations over IP Multicast

55

0. Finally, process 0 sends the message to all processes using IP multicast. In general, with N processes, a total of N 1 scout messages are sent. With a message size of M , and a maximum network frame size of T , M T + 1 network frames need to be sent to complete one message transmission. Adding the N 1 scout messages, it takes a total of (N 1) + M T + 1 frames to send one broadcast message. The linear algorithm makes the sender wait for scout messages from all receivers, as illustrated in Fig. 4. Then the message with data is sent via multicast. With K processes in the environment, it takes K 1 steps for the root to receive all the scout messages since the root can only receive one message at a time. As illustrated in Fig. 4 with N processes, the root receives N 1 point-to-point scout messages before it sends the data. With 7 nodes, the multicast implementation only requires one-third of actual data frames compared to current MPICH implementation. Since the binary tree algorithm takes less time steps to complete, we anticipate it to perform better than the linear algorithm. Hub or Switch proc 4

proc 5

proc 6

eagle1 proc 0

proc 1

proc 2

proc 3

proc 0

proc 1

proc 2

proc 3

proc 0

proc 1

proc 2

proc 3

Campus LAN

eagle

eagle5

FDDI Backbone proc 0

proc 1

proc 2

proc 3

proc 0

proc 1

proc 2

proc 3

proc 4

proc 5

eagle8

proc 6

MPICH barrier synchronization with 7 processes Fig. 5.

3.2

100 BaseT Ethernet

Fig. 6.

The Eagle Cluster

MPI Barrier Synchronization

Another MPI collective operation re-implemented was MPI Barrier. MPI Barrier is an operation that synchronizes processes. All processes come to a common stopping point before proceeding. The MPICH algorithm for barrier synchronization can be divided into three phases. In the rst phase, processes that cannot be included in sending pair-wise point-to-point operations send messages to processes who can. In the second phase, point-to-point sends and receives are performed in pairs. In the third phase, messages are sent from the processes in the second phase to processes from the third phase to release them. Figure 5 illustrates MPICH send and receive messages for synchronization between 7

56


processes. In this example, processes 4, 5, and 6 send messages to processes 0, 1 and 2. In the second phase, point-to-point message are sent between processes 0, 1, 2, and 3. In the third phase, process 0, 1, and 2, send messages to 4, 5, and 6 to release them. If there are N participating processes, and K is the biggest power of 2 less than N , a total of 2 (N K ) + log2 K K messages need to be sent. By incorporating IP multicast into the barrier algorithm, we were able to reduce the number of phases by two. The binary algorithm described above is used to implement MPI Barrier. First, point-to-point messages are reduced to process 0 in a binary tree fashion. After that, a message with no data is sent using multicast to release all processes from the barrier. In general, with N processes in the system, a total of N 1 point-to-point messages are sent. One multicast message with no data is sent.

4

Experimental Results

The platform for this experiment consists of four Compaq PentiumIII 500MHZ computers and ve Gateway PentiumIII 450 MHZ computers. The nine workstations are connected via either a 3Com SuperStack II Ethernet Hub or an HP ProCurve Switch. Both the hub and the switch provide 100 Mbps connectivity. The switch is a managed switch that supports IP multicast. Each Compaq workstation is equipped with 256 MB of memory and an EtherExpress Pro 10/100 Ethernet card. Each Gateway computer has 128MB of memory and a 3Com 10/100 Ethernet card. mpich over hub mcast linear

2000

latency in usec

mcast (binary) over hub 1500

1000

500

0

0

Fig. 7.

1000

2000 3000 size of message (in byte)

4000

5000

MPI Bcast with 4 processes over Fast Ethernet Hub

The performance of the MPI collective operations is measured as the longest completion time of the collective operation. among all processes. For each message size, 20 to 30 dierent experiments were run. The graphs show the measured

MPI Collective Operations over IP Multicast 2000

57

mpich over switch mcast linear mcast (binary) over switch

latency in usec

1500

1000

500

0

0

Fig. 8.

1000


4000

5000

MPI Bcast with 4 processes over Fast Ethernet Switch

2000 mpich over switch mcast linear mcast (binary) over switch

latency in usec

1500

1000

500

0

0

Fig. 9.

1000


4000

5000

MPI Bcast with 6 processes over Fast Ethernet Switch

time for all experiments with a line through the median of the times. The graphs illustrate the sample distribution of measured times. Figure 7 shows the performance of MPI Bcast of both implementations over the hub with 4 processes. The gure shows that the average performance for both the linear and the binary multicast implementation is better for message sizes greater than 1000 bytes. With small messages, the cost of the scout messages causes the multicast performance to be worse than MPICH performance. The gure also shows variations in performance for all implementations due to collisions on the Fast Ethernet network. The variation in performance for MPICH is generally higher than the variation in performance for either multicast implementation. Figures 8, 9,and 10 describe the performance with the switch for 4, 6, and 9 processes respectively. Both the linear and the binary algorithm using multicast show better average performance for a large enough message size. The crossover point of average MPICH performance and the average performance of using

58

H.A. Chen, Y.O. Carrasco, and A.W. Apon 3000 mpich over switch mcast linear

2500

mcast (binary) over switch

latency in usec

2000

1500

1000

500

0

0

Fig. 10.

1000


4000

5000

MPI Bcast with 9 processes over Fast Ethernet Switch mpich over hub mpich over switch mcast (binary) over switch mcast (binary) over hub

1800 1600 1400

usec

1200 1000 800 600 400 200 0

Fig. 11.

0

1000


4000

5000

Performance Comparison with MPI Bcast over hub and switch for 4 processes

multicast is where the extra latency of sending scout messages becomes less than the latency from sending extra packets of data when the data is large. For some numbers of nodes, collisions also caused larger variance in performance with the multicast implementations. For example, this is observed for 6 nodes as shown in Fig. 9. With 6 nodes using the binary algorithm, both node 2 and node 1 attempt to send to node 0 at the same time, which causes extra delay. Figure 11 compares the average performance of the switch and the hub for 4 processes. When using IP multicast, the average performance of the hub is better than the switch for all measured message sizes. As for the original MPICH implementation, the average performance of hub becomes worse than the switch when the size of the message is bigger than 3000. The MPICH implementation puts more messages into the network. As the load of the network gets larger, the extra latency of the switch become less signi cant than the improvement gained with more bandwidth. The multicast implementation is better than MPICH for message sizes greater than one Ethernet frame.

MPI Collective Operations over IP Multicast 2500

800

mpich (9 proc) mpich (6 proc) mpich (3 proc) linear (9 proc) linear (6 proc) linear (3 proc)

2000

59

multicast MPICH 700

latency in usec

latency in usec

600 1500

1000

500 400 300

500

200 0

0

1000


4000

5000

Performance Comparison with MPI Bcast over 3, 6, and 9 processes over Fast Ethernet switch Fig. 12.

100

2

3

4

5 6 Number of Processes

7

8

9

Comparison of MPI Barrier over Fast Ethernet hub Fig. 13.

Figure 12 compares MPICH and the linear multicast implementation for 3, 6, and 9 processes over the switch. The results show that the linear multicast algorithm scales well up to 9 processes and better than MPICH. With the linear implementation, the extra cost for additional processes is nearly constant with respect to message size. This is not true for MPICH. Figure 13 describes the results of MPI Barrier operation over the hub. The results for MPI Barrier show that IP multicast performs better on the average than the original MPICH implementation. The performance improvement increases as the size of the message gets bigger. In a Single Program Multiple Data (SPMD) environment, message passing using either the linear algorithm or the binary algorithm is correct even when there are multiple multicast groups. However, since the IP multicast implementation requires the receive call to be posted before the message is sent, it is required that each process execute the multicast calls in the same order. This restriction is equivalent to requiring that the MPI code be safe[5]. If several processes broadcast to the same multicast group (in MPI terms, this is the same process group of same context), the order of broadcast will be correctly preserved. For example, suppose in an environment including the 4 processes with ids 4, 6, 7 and 8, processes 6, 7, and 8 all belong to the same multicast group and the broadcast is called in the following order. MPI Bcast(&buer, count, MPI INT, 6, MPI COMM WORLD); MPI Bcast(&buer, count, MPI INT, 7, MPI COMM WORLD); MPI Bcast(&buer, count, MPI INT, 8, MPI COMM WORLD);

Using either the binary algorithm or the linear algorithm, process 7 cannot proceed to send the the second broadcast until it has received the broadcast message from process 6, and process 8 cannot send in the third broadcast until it has received the broadcast message from process 7. The order of the three

60


broadcasts is carried out correctly. Using a similar argument, when there are two or more multicast groups that a process receives from, the order of broadcast will be correct as long as the MPI code is safe.

5

Conclusions and Future Work

Multicast reduces the number of messages required and improves the performance of MPI collective operations by doing so. Its receiver-directed message passing mode allows the sender to address all the receivers as a group. This experiment focused on a particular implementation using IP multicast. Future work is planned in several areas. Improvements are possible to the binary tree and linear communication patterns. While we have not observed buer over ow due to a set of fast senders overrunning a single receiver, it is possible this may occur in many-to-many communications and needs to be examined further. Additional experimentation using parallel applications is planned. Also, low latency protocols such as the Virtual Interface Architecture[9] standard typically require a receive descriptor to be posted before a mesage arrives. This is similar to the requirement in IP multicast that the receiver be ready. Future work is planned to examine how multicast may be applied to MPI collective operations in combination with low latency protocols.

References [1] D. E. Comer. Internetworking with TCP/IP Vol. I: Principles, Protocols, and Architecture . Prentice Hall, 1995. [2] T. H. Dunigan and K. A. Hall. PVM and IP Multicast. Technical Report ORNL/TM-13030, Oak Ridge National Laboratory, 1996. [3] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Technical Report Preprint MCS-P567-0296, Argonne National Laboratory, March 1996. [4] N. Nupairoj and L. M. Ni. Performance Evaluation of Some MPI Implementations on Workstation Clusters. In Proceedings of the 1994 Scalable Parallel Libraties Conference, pages 98{105. IEEE Computer Society Press, October 1994. [5] P. Pacheo. Parallel Programming with MPI . Morgan Kaufmann, 1997. [6] The LAM source code. http://www.mpi.nd.edu/lam. [7] The MPICH source code. www-unix.mcs.anl.gov/mpi/index.html. [8] A. S. Tannenbaum, M. F. Kaashoek, and H. E. Bal. Parallel Programming Using Shared Objects and Broadcasting. Computer, 25(8), 1992. [9] The Virtual Interface Architecture Standard. http://www.viarch.org. [10] D. Towsley, J. Kurose, and S. Pingali. A Comparison of Sender-Initiated and Receiver-Initiated Reliable Multicast Protocols. IEEE JSAC, 15(3), April 1997.

An Open Market-Based Architecture for Distributed Computing Sp yros Lalis and Alexandros Karipidis Computer Science Dept., University of Crete, Hellas

flalis,[email protected]

Institute of Computer Science, F oundation for Research and Technology, Hellas flalis,[email protected]

Abstract. One of the challenges in large scale distributed computing

is to utilize the thousands of idle personal computers. In this paper, we presen t a system that enables users to eortlessly and safely export their machines in a global market of processing capacity. Ecient resource allocation is performed based on statistical machine pro les and leases are used to promote dynamic task placement. The basic programming primitives of the system can be extended to develop class hierarchies which support dierent distributed computing paradigms. Due to the objectorien ted structuring of code, developing a distributed computation can be as simple as implementing a few methods.

1 Introduction The growth of the Internet has provided us with the largest network of interconnected computers in history. As o-the-shelf hardware becomes faster and gains Internet access, the netw ork's processing capacity will continue increasing. Many of these systems are often under-utilized, a fact accentuated by the globe's geography since \busy" hours in one time-zone tend to be \idle" hours in another. Distributing computations over the Internet is thus very appealing. However, sev eral issues must be resolved for this to be feasible. The obstacle of platform heterogeneity must be overcome and security problems arising from the execution of code from untrusted parties must be confronted. F urther inconveniences arise when installing and maintaining the corresponding programming en vironments. And then, distributed computations must be designed and implemented on top of them, a challenging task even for experienced programmers. In this paper we present a system that addresses these problems, simplifying distributed computing over the Internet considerably. Through a maintenancefree, web-based user interface any machine can be safely connected to the system to act as a host for remote computations. A framework that promotes code reuse and incremental development through object-oriented extensions is oered to the application programmer. Writing computations for the system can be as trivial as implementing a few routines. We feel that the ease of deploying the system J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 61-70, 2000.  Springer-Verlag Berlin Heidelberg 2000

62

S. Lalis and A. Karipidis

and developing applications for it is of importance to the scienti c community since most of the programming is done by scientists themselves with little or no support from computer experts. The rest of the paper is organized as follows. Section 2 summarizes the general properties of the system. Details about the resource allocation mechanism are given in Sect. 3. In Sect. 4 we look into the system architecture, giving a description of components and communication mechanisms. In Sect. 5 we show how our system can be used to develop distributed computations in a straightforward way. A comparison with related work is given in Sect. 6. Section 7 discusses the advantages of our approach. Finally, future directions of this work are mentioned in the last section.

2 System Properties When designing the system, the most important goal was to achieve a level of simplicity that would make it popular both to programmers and owners of lightweight host machines, most notably PCs. Ease of host registration was thus considered a key issue. Safety barriers to shield hosts from malicious behavior of foreign code were also required. Portability and inter-operability was needed to maximize the number of host platforms that can be utilized. A simple yet powerful programming environment was called for to facilitate the distribution of computations over the Internet. All these features had to be accompanied by a dynamic and ecient mechanism for allocating resources to applications without requiring signi cant eort from the programmer. In order to guarantee maximal cross-platform operability the system was implemented in Java. Due to Java's large scale deployment, the system can span across many architectures and operating systems. Host participation is encouraged via a web based interface, which installs a Java applet on the host machine. This accommodates the need for a user friendly interface, as users are accustomed to using web browsers. Furthermore, the security manager installed in Java enabled browsers is a widely trusted rewall, protecting hosts from downloaded programs. Finally, due to the applet mechanism, no administration nor maintenance is required at the host { the majority of users already has a recent version of a web browser installed on their machines. On the client side we provide an open, extensible architecture for developing distributed applications. Basic primitives are provided which can in turn be used to implement diverse, specialized processing models. Through such models it is possible to hide the internals of the system and/or provide advanced programming support in order to simplify application development.

3 Resource Allocation Host allocation is based on pro les, which are created by periodically benchmarking each host. A credit based [1] mechanism is used for charging. Credit

An Open Market-Based Architecture for Distributed Computing

63

can be translated into anything that makes sense in the context where the system is deployed. Within a non-pro t institution, it may represent time units to facilitate quotas. Service-oriented organizations could charge clients for using hosts by converting credit to actual currency. Both hosts (sellers) and clients (buyers) submit orders to a market, specifying their actual and desired machine pro le respectively. The parameters of an order are listed in table 1. The performance vectors include the host's mean score and variance for a set of benchmarks over key performance characteristics such as integer and oating point arithmetic, network connection speed to the market server etc. The host abort ratio is the ratio of computations killed by the host versus computations initiated on that host (a \kill" happens when a host abruptly leaves the market). The host performance vectors and abort ratio are automatically produced by the system. Host pro les can easily be extended to include additional information that could be of importance for host selection.

Table 1. Parameters speci ed in orders Parameter price/sec

Sell Orders

Description Buy Orders

The minimum amount of credit The maximum amount of credit required per second of use of oered per second of use of the the host. host. lease duration The maximum amount of usage The minimum amount of usage time without renegotiation. time without renegotiation. granted/demanded Credit granted/demanded for not honoring the lease duration. compensation performance The host's average score and The average performance score statistics variance for each of the bench- and variance a buyer is willing vectors marks (measured). to accept. abort ratio The host's measured abort ra- The abort ratio a buyer is willtio. ing to accept.

An economy-based mechanism is employed to match the orders that are put in the market. For each match, the market produces a lease, which is a contract between a host and a client containing their respective orders and the price of use agreed upon. Leases are produced periodically using continuous double auction [8]. A lease entitles the client to utilize the host for a speci c amount of time. If the client's task completes within the lease duration, then the buyer transfers an amount of credit to the seller as a reward, calculated by multiplying actual duration with the lease's price per second. If the lease duration is not honored, an amount of credit is transfered from the dishonoring party to the other.

64


4 System Architecture 4.1 Overview of System Components An overview of the system's architecture is depicted in Fig. 1. The basic components of our system are the market server, hosts, the host agent, schedulers, tasks and client applications. Scheduler

ControlProtocol

MarketSchedulerProtocol

Market Server ComputationProtocol

Client application

UploadProtocol

MarketHostAgentProtocol

Host Agent

Host Agent

HostAgentTaskProtocol Task

HostAgentTaskProtocol Task

ComputationProtocol

Fig. 1. Overview of architecture The Client Application is a program which needs to perform computations that require considerable processing power. Through the system, it may either distribute a computation across a number of machines or just delegate the execution of an entire computation to a faster machine to speed up execution. The Market Server is the meeting place for buyers and sellers of processing power. It collects orders from clients and hosts. Using the host pro les, it then matches buy with sell orders and thus allocates resources. A Host is a machine made available to be used by clients. A host participates in the market through the Host Agent, a Java applet. The user visits a URL with a Java enabled web browser and the agent is downloaded to his system. The agent communicates with the market server, takes care of placing orders on behalf of the user and executes tasks assigned to the host. It also provides the market server with the benchmark scores needed for the host's pro le. A computation in our system consists of a Scheduler and one or more Tasks. The application installs the scheduler on the market server. The scheduler then places orders in the market for acquiring machines to complete the computation. New orders can be issued at any time in order to adapt to uid market conditions. When a lease is accepted by the scheduler, a task is launched in the host machine to assist in completing the computation.


65

4.2 Basic System Services and Communication There are six protocols used for communication by the system. The UploadProtocol is a xed, published Remote Method Invocation (RMI) interface used by the client application to upload a computation to the market server and to instantiate it's scheduler. A client application may instantiate multiple schedulers to simultaneously launch the same code with multiple data. The ControlProtocol is a published RMI interface for the client application to control a scheduler. Through this interface the application performs tasks such as starting a computation with new parameters, altering the computation's budget for acquiring hosts, instructing the scheduler to kill all tasks and exit, etc. The basic functions are implemented in the system classes. The programmer can introduce computation speci c control functions by extending this interface. The ComputationProtocol is used within the bounds of a single computation for communication among tasks and their scheduler. It is application dependent and thus unknown to the system. We do, however, provide message passing support (not further discussed in this paper) that can be used by application developers to implement exible, safe and ecient data exchange. The MarketSchedulerProtocol is used for local communication between the market server and schedulers. The market server implements a standard published interface for servicing requests from schedulers such as placing orders and retrieving host and market status information. Respectively, schedulers provide methods for being noti ed by the market of events such as the opportunity to acquire a new lease, a change in the client's account balance, the completion of a task's work and the failure of a host that was leased to them. Similarly, the HostAgentTaskProtocol provides local communication among a host agent and the task it is hosting. The agent implements a published interface for servicing requests from tasks, such as retrieving information about a host's performance. The MarketHostAgentProtocol is a proprietary protocol used by the market server and the host agent. It allows orders to be placed in the market by the host. It is also used to retrieve tasks from the market, ask for \payment" when tasks complete and to post benchmarking data to the market server.

5 Supporting Distributed Computing Paradigms Through the set of primitives oered by the system, it is possible to develop a wide range of applications. More importantly generic support can be provided for entire classes of distributed computations. Applications can then be developed by extending these classes to introduce speci c functionality. This incremental development can greatly simplify programming. As an example, in the following we describe this process for embarrassingly parallel computations requiring no communication between tasks. Other distributed computation paradigms can be supported in similar fashion.

66


5.1 The Generic Master { Slave Model

In this model work is distributed among many processors by a distinguished processor referred to as the \master". The other processors, referred to as \slaves", complete the work assigned to them and return the results to the master. In order to process its workload a slave does not need to communicate with any other slave. This model is used in image processing, genetics algorithms, brute force search and game tree evaluation. One possible implementation of this model is sketched below. For brevity, only the methods a programmer has to be aware of are shown. public interface MS_Control extends Control { void start(Object pars); // inherited by superclass void stop(); // inherited by superclass Object[] getResults(boolean all, boolean keep); } public abstract class MS_Scheduler extends Scheduler implements MS_Control { public abstract Object[] doPartitions(Object pars); public void receiveResult(Object result); } public abstract class MS_Task extends Task { public abstract Object processPartition(Object partition); }

The MS Control.start method starts a new computation. MS Control.start triggers MS Scheduler.doPartitions to produce the various partitions of the computation. These are forwarded to instances ofMS Task residing on hosts allocated to the computation and MS Task.processPartition is invoked to process them. The results are returned to the scheduler where post-processing is performed via calls to the MS Scheduler.receiveResult method. It is important to notice that programmers need to implement just three methods in order to complete a computation following this model. All other implementation issues, including the resource allocation strategy of the scheduler, remain hidden. The MS Control interface, which de nes the primitives for controlling and retrieving the results of the computation, is implemented by the base MS Scheduler class and thus does not concern the programmer. This master/slave model could be further extended to introduce additional functionality such as check-pointing and restarting of tasks for fault tolerance. Programmers would exploit this functionality without eort.

5.2 A Sample Client Application

Based on this model, we show how a speci c application, e.g. for computing the Mandelbrot set, can be implemented. We assume that the area to be calculated is partitioned in bands, processed in parallel to speed up execution. The user selects an area and the computation is started to zoom into the selected area.


67

The parameters, partitions and results of the fractal application must be extensions of the Object class. The classes must implement the Serializable interface in order to be successfully transported across machine boundaries. class FractalParameters extends Object implements Serializable { // ... fractal computation parameters } class FractalPartition extends Object implements Serializable { // ... parameters for calculating a slice } class FractalResult extends Object implements Serializable { // ... results of a slice calculation }

Assuming the parameter and result objects have been appropriately de ned, a FractalScheduler class must be programmed as a subclass of MS Scheduler to produce partitions via the doPartitions method. The MS Scheduler.receiveResult method is not overridden because individual results are not merged by the scheduler. Also, the basic MS Control interface needs no extension since it already oers the necessary routines for controlling and monitoring the computation. Analogously, a FractalTask class must be provided that implements the MS Task.processPartition method to perform the calculation of slices. class FractalScheduler extends MS_Scheduler { Object[] doPartitions(Object comp_pars) { FractalPartition partitions[]; FractalParameters pars=(FractalParameters)comp_pars; // ... split calculation and produce partitions return (partitions); } } class FractalTask extends MS_Task { Object processPartition(Object partition) { FractalResult result; FractalPartition pars=(FractalPartition)partition; // ... perform the computation return(result); } }

Finally, to run the application, the computation's classes must be uploaded to the market server using the UploadProtocol and a scheduler instance must be created. The MS Control interface is used to control the scheduler and periodically retrieve the computation's results.

68


6 Related Work Popular distributed programming environments such as PVM [9] and MPI [9] lack advanced resource allocation support. PVM allows applications to be noti ed when machines join/leave the system, but the programmer must provide code that investigates hosts' properties and decides on proper allocation. MPI, using a static node setup, prohibits dynamic host allocation: the programmer must make a priori such decisions. Both systems require explicit installation of their runtime system on participating hosts. A user must therefore have access to all participating machines, as she must be able to login to them in order to spawn tasks. This is impractical and may result in only a few number of hosts being utilized, even within a single organization. Finally, the choice of C as the main programming language, compared to Java, is an advantage when speed is concerned. But to be able to exploit dierent architectures, the user must provide and compile code for each one of them, adding to the complexity and increasing development time due to porting considerations. The maturation of Java technology (\just in time" compilation, Java processors, etc.) could soon bridge the performance gap with C. Notably, a Java PVM implementation is underway [6], which will positively impact the portability of the PVM platform. Condor is a system that has been around for several years. It provides a comparative \matchmaking" process for resource allocation through its \classi ed advertisment" matchmaking framework [11]. A credit-based mechanism could be implemented using this framework, but is currently unavailable. Condor too requires extensive administration and lacks support for easy development. Newer systems such as Legion [10] and Globus [7] address the issues of resource allocation and security. They provide mechanisms for locating hosts and signing code. However, both require administration such as compiling and installing the system as well as access to the host computer. They do not support the widely popular Windows platform (though Legion supports NT) and do little to facilitate application development for non-experts. Globus merely oers an MPI implementation whereas Legion provides the \Mentat" language extensions. Legion's solution is more complete but also complicated for inexperienced programmers. It requires using a preprocessor, an \XDR" style serialization process and introduces error-prone situations since virtual method calls will not work as expected in all cases. Stateful and stateless objects are also handled dierently. Finally, adding hosts to a running computation is done from the command line and additional hosts are assigned to the computation at random { no matching of criteria is performed. Several other systems using Java as the \native" programming language have been designed for supporting globally distributed computations, such as Charlotte [3], Javelin [4] and Challenger [5]. These systems automatically distribute computations over machines. However, they do not employ market-based principles to allocate hosts and do not maintain information about hosts' performance. The market paradigm has received considerable attention in distributed systems aiming for exible and ecient resource allocation. A system operating on the same principles as ours is Popcorn [12]. Popcorn also uses auction mech-


69

anisms to allocate hosts to client computations and exploits Java applet technology to achieve portability, inter-operability and safety. However it does not provide \host pro ling", nor promotes incremental development.

7 Discussion Besides the fact that the allocation strategies used in most systems don't take into account \behavioral patterns" of hosts, there is also virtually no support for leasing. We argue that both are invaluable for ecient resource allocation in open computational environments. Providing information about the statistical behavior of participating hosts can assist schedulers in taking task placement decisions, avoiding hosts that will degrade performance (and waste credit). For example, assume a scheduler has two tasks to allocate. Blind allocation on two hosts is not a good idea; unless two machines exhibit comparable performance, the faster machine will be wasted since the computation will be delayed by the slower one. Similarly, using the abort ratio, schedulers can avoid unstable hosts for placing critical parts of a computation. Those can be assigned to perhaps more \expensive" but stable hosts. Computations implementing check-pointing and crash-recovery could utilize less credible hosts. The lack of leasing is also a drawback in open environments: a client could obtain many processors when there is no contention and continue to hold them when demand rises. This is unacceptable in a real world scenario where credit re ects priorities or money. This would imply that prioritized or wealthy computations can be blocked by \lesser" ones. To guarantee quality of service, some form of leasing or preemption must be adopted. Leases are also practical in non-competitive environments. The lease duration allows users to indicate the time during which hosts are under-utilized. Based on this knowledge, tasks can be placed on hosts that will be idle for enough time, and checkpoints can be accurately scheduled, right before a host is about to become unavailable. Finally, it is generally acknowledged that incremental development increases productivity by separation of concerns and modular design. Distributed computing can bene t from such an approach. Modern object-oriented programming environments are a step towards this direction, but signi cant programming experience and discipline are still required. We feel that with our system's design, it is possible even for inexperienced programmers to write computations rapidly.

8 Future Directions New versions of the Java platform will oer more ne grained control in the security system. Using the new mechanisms we expect to be able to provide more ecient services, such as access to local storage for task checkpoints, invocation of native calls to exploit local, tuned libraries such as [2] [13]. Logging mechanisms along with the signing of classes, will further increase the security of the system.

70


We also wish to experiment with schedulers capable of recording the performance of previous allocations. Accumulated information can perhaps be converted into \experience", leading towards more ecient allocation strategies. Lastly the issue of scalability needs to be addressed. The current architecture is limited by the market server. A single server could not handle the millions or billions of hosts connecting to a truly world-wide version of this service. It would also be impossible to have all schedulers running on the machine. We intend to overcome this problem by introducing multiple market servers that will allow trac to be shared among several geographically distributed servers.

References [1] Y. Amir, B. Awerbuch, and R. S. Borgstrom. A cost-bene t framework for online management of a metacomputing system. In Proceedings of the First International Conference on Information and Computation Economies, pages 140{147, October 1998. [2] M. Baker, B. Carpenter, G. Fox, S. H. Ko, and S. Lim. mpiJava: An ObjectOriented Java Interface to MPI. Presented at International Workshop on Java for Parallel and Distributed Computing, IPPS/SPDP 1999, April 1999. [3] A. Baratloo, M. Karaul, Z. M. Kedem, and P. Wycko. Charlotte: Metacomputing on the web. In Ninth International Conference on Parallel and Distributed Computing Systems, September 1996. [4] P. Cappello, B. Christiansen, M. F. Ionescu, M. O. Neary, K. E. Schauser, and D. Wu. Javelin: Internet-based parallel computing using java. In Proceedings of the ACM Workshop on Java for Science and Engineering Computation, June 1997. [5] A. Chavez, A. Moukas, and P. Maes. Challenger: A multiagent system for distributed resource allocation. In Proceedings of the First International Conference on Autonomous Agents '97, 1997. [6] A. Ferrari. JPVM { The Java Parallel Virtual Machine. Journal of Concurrency: Practice and Experience, 10(11), November 1998. [7] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Intl J. Supercomputer Applications, 11(2), 1997. [8] D. Friedman. The double auction market institution: A survey. In D. Friedman and J. Rust, editors, Proceedings of the Workshop in Double Auction Markets, Theories and Evidence, June 1991. [9] G. A. Geist, J. A. Kohl, and P. M. Papadopoulos. PVM and MPI: a Comparison of Features. Calculateurs Paralleles, 8(2):137{150, June 1996. [10] A. S. Grimshaw and W. A. Wulf. The legion vision of a worldwide computer. CACM, 40(1):39{45, 1997. [11] R. Raman, M. Livny, and M. Solomon. Matchmaking: Distributed resource management for high throughput computing. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 1998. [12] O. Regev and N. Nisan. The POPCORN Market { an Online Market for Computational Resources. In Proceedings of the First International Conference on Information and Computation Economies, pages 148{157, October 1998. [13] The Java Grande Working Group. Recent Progress of the Java Grande Numerics Working Group. http://math.nist.gov/javanumerics/ reports/jgfnwg-02.html.

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters ?? , and Philippe Navaux? ? ? ´ Marcos Barreto? , Rafael Avila

Institute of Informatics — UFRGS Av. Bento Gonçalves, 9500 Bl. IV PO Box 15064 — 90501-910 Porto Alegre, Brazil E-mail: fbarreto,bohrer,[email protected]

Abstract. One of the new research tendencies within the well-established cluster computing area is the growing interest in the use of multiple workstation clusters as a single virtual parallel machine, in much the same way as individual workstations are nowadays connected to build a single parallel cluster. In this paper we present an analysis on several aspects concerning the integration of different workstation clusters, such as Myrinet and SCI, and propose our MultiCluster model as an alternative to achieve such integrated architecture.

1 Introduction Cluster computing is nowadays a common practice to many research groups around the world that search for high performance to a great variety of parallel and distributed applications, like aerospacial and molecular simulations, Web servers, data mining, and so forth. To achieve high performance, many efforts have been devoted to the design and implementation of low overhead communication libraries, specially dedicated to fast communication networks used to interconnect nodes within a cluster, which is the case of Fast Ethernet [14], Myrinet [3] and SCI [12]. The design of such software is a widely explored area, resulting in proposals like BIP [21], GM [9], VIA [24] and Fast Messages [19]. Currently, there are other research areas being explored, such as administrative tools for cluster management and what is being called Grid Computing, with the objective of joining geographically distributed clusters to form a Metacomputer and taking benefit of the resulting overall computational power [4]. The work presented here is not focused on these areas directly, because our goal is to discuss a practical situation in which a Myrinet cluster must be interconnected with a SCI cluster to form a single parallel machine, which can be used to verify the application’s behaviour when it runs on a shared memory cluster or on a message passing cluster, efficiently distribute tasks from an application according to their communication needs, offer a complete environment destinated to teach parallel and distributed ? M.Sc. student at PPGC/UFRGS (CAPES fellow) ?? M.Sc. (PPGC/UFRGS, 1999); RHAE/CNPq researcher at PPGC/UFRGS ? ? ? Ph.D. (INPG, Grenoble — France, 1979); Professor at PPGC/UFRGS


72

M. Barreto, R. Avila, and P. Navaux

programming, allowing the user to express, through the same API, message passing and shared memory interactions. This paper is organised as follows: Section 2 exposes an analysis on the problems that arise from integrating multiple workstation clusters; in Section 3 we present the MultiCluster model and the DECK environment as our contribution towards this objective; Section 4 brings some comments on related research efforts and finally Section 5 presents our conclusions and current research activities.

2 Integrating Multiple Clusters When computer networks were an emergent platform to parallel and distributed programming, many efforts were dispended to solve problems related to joining individual PCs in a single virtual parallel machine. From these efforts, communication libraries such as PVM [8] and MPI [17] arose to allow individual network nodes to be identified within the parallel environment. The integration of multiple workstation clusters presents a similar problem. Individual clusters of workstations are nowadays fairly well managed by communication libraries and parallel execution environments. When we start to think on clusters of clusters, again we have the same problems regarding the connection of elements that run independently from each other and still meet the compromise of offering to the user an appropriate environment for parallel and distributed programming. What we mean by appropriate is to provide an intuitive programming interface and offer enough resources to meet the programmer’s needs. As the purpose of this paper is to identify these problems and propose possible solutions to them, we have divided our study in hardware and software analysis. 2.1 Hardware Aspects There are no major problems in the hardware point of view to achieve such integration, since the networks considered (Myrinet and SCI) could co-exist within the same node and use different techniques to communicate. Figure 1 presents the most simple cluster interconnection that could be realised. Each individual cluster could have any number of physical nodes connected through a switch (in the Myrinet case) or directly as a ring (in the SCI case). To allow the integration, each cluster must have a “gateway” node configured with two network interfaces (two Myrinet NIs or a Myrinet + SCI NIs), where the additional Myrinet NI is used to link clusters. For the moment we do not consider SCI a suitable technology as a linking media, since a message-passing paradigm seems more adequate for this purpose. 2.2 Software Aspects Several points have been discussed by the community in order to identify problems and solutions related to the design and implementation of communication libraries for cluster-based applications, with a main objective: provide high bandwith at small latencies. Besides this, the development of cluster middleware tools to furnish high availability and single system image support is an ongoing task [4, 11].

The MultiCluster Model to the Integrated Use of Multiple Workstation Clusters

Myrinet cluster

Myrinet switch

73

SCI cluster

Fast Ethernet or Myrinet link

Fig. 1. The simplest way to interconnect two workstation clusters.

In the case of clusters of clusters, performance is not a key point due to the drawbacks implicitly imposed by the loosely coupled integration. There are other problems regarding such integration that must be attended first and performance will then be the consequence of the techniques used to solve them. The first point to consider is how to combine message passing with distributed shared memory. A desirable solution would be to offer a single communication abstraction that could be efficiently implemented over message passing and shared memory architectures. In practice, however, it is easier to have an individual mechanism to each one and allow the user to choose between them, depending on his application needs. Another point to treat is the routing problem, which arises when a task needs to exchange data with another task running in a remote cluster. It is necessary that the communication layer identifies what is the location of a communication endpoint and knows how to map physical nodes from separate clusters to be capable of routing messages between them. Finally, heterogeneity could be a problem. Although most individual workstation clusters are internally homogeneous, there may be cases where multiple clusters could be heterogeneous in relation to each other. In these cases, problems regarding “endianisms” and floating-point data representation have to be addressed. If the previous problems can be efficiently treated, it is also possible to provide the user with the capacity of deciding where to place a specific set of tasks, according to their communication needs. If the application granularity can be modelled considering the underlying platform, it is still possible to achieve good performance.

3 The MultiCluster Model The MultiCluster model is an approach to join independent clusters and provide a simple programming interface which allows the user to configure and utilize such an integrated platform. With this model we intend to address and provide solution to the problems mentioned in the previous Section, while still keeping a well structured and

74


efficient programming environment. To best explain the proposed model, we have divided the discussion in hardware and software aspects. 3.1 Hardware Platform We are assuming the configuration illustrated in Figure 1, which corresponds to our available hardware platform. We currently have a Myrinet cluster, composed by 4 Dual Pentium Pro 200 MHz nodes, and a SCI cluster, composed by 4 Pentium Celeron 300 MHz nodes. These clusters are linked through a Fast Ethernet network. The choice of the media used to interconnect the clusters depends mostly on the application needs. It is possible to use a standard Ethernet link instead of Myrinet to realise the communication between clusters. We propose Myrinet as a link media because it could minimize the loss in performance originated by the integration of different platforms; for our model, however, it is enough that some node in each cluster plays the role of a gateway. It is important to say that questions related to cost and scalability are out of the scope of this paper. In a near future, many companies and universities are likely to own a small number of cluster platforms, and so these questions are particular to each of them. We are assuming the situation where at least two clusters are available and have to be used together. 3.2 Software Structure We have studied each problem mentioned in Section 2.2, trying to find the best solution to each one and structuring our software layer to carry out such solutions. As a result, the MultiCluster model follow some conceptual definitions which rule the way such integration must be handled. Figure 2 shows the user-defined descriptor file to a MultiCluster application. In this file, the user must specify a list of machines within the clusters he wants to use, the communication subnets identifiers (used to inter-cluster communication), a set of logical nodes with its correspondents machines and the gateway nodes. Physical and Logical Nodes. A physical node corresponds to each available machine plugged in any individual cluster and only matters to physical questions. Logical nodes are the set of available nodes from the application’s point of view. In the case of message-passing clusters, each physical node corresponds to one logical node (this is mandatory). In shared-memory clusters, a logical node can be composed of more than one physical node. The distinction between logical nodes for Myrinet and SCI is made by the node id field. For example, “node 1:0” means the second node within the subnet 0 (which is Myrinet in our example), while “node 4:1” means the first node within the subnet 1 (which is SCI). It is important to notice that this numbering scheme, although complex, is entirely processed by the environment in a transparent manner; the user only knows how many logical nodes he has and what are the physical machines within each logical node.


75

// DECK user-defined descriptor file // virtual machine verissimo, quintana, euclides, dionelio, scliar, ostermann, meyer, luft // communication subnets myrinet: 0 sci: 1 // logical nodes node 0:0 machines: verissimo node 1:0 machines: quintana node 2:0 machines: euclides node 3:0 machines: dionelio node 4:1 machines: scliar, luft node 5:1 machines: ostermann, meyer // gateway nodes gateways: quintana, scliar Fig. 2. Descriptor file for a MultiCluster application.

Intra- and Inter-node Communication. As the application only sees logical nodes, it is relatively easy to adapt the different communication paradigms: inside a logical node, communication is made by shared memory; between logical nodes, communication is made by message passing. From the user’s point of view, there is only one programming interface furnishing both mechanisms to specify communication over Myrinet or SCI clusters; the underlying communication layer is in charge of implementing one or another paradigm. Heterogeneity. Although a less frequent problem, heterogeneity may arise depending on the availability of clusters that have to be interconnected. Here, we are considering different data representations and the need to indicate to the message receiver what is the architecture type of the message sender. This problem is implicitly treated by the communication software. Even occuring some performance loss due to such integration, it is possible to the user to define the best location for his application tasks, creating communication resources according to each task location (i.e. communication subnets). Through this facility, the granularity of communication could be balanced among clusters, avoiding as long as possible the traffic across the link network. 3.3 The Programming Environment—DECK The interface between the programmer and the MultiCluster architecture is the DECK environment. DECK (Distributed Executive Communication Kernel) is composed of a runtime system and a user API which provides a set of services and abstractions for the development of parallel and distributed applications. A DECK application runs in an SPMD style, split in terms of logical nodes.

76


DECK is divided in two layers, one called DECK, which directly interacts with the underlying OS and a service layer, where more elaborate resources (including the support for multiple clusters) are made available. Figure 3 shows the layered structure of DECK.

RCD

naming

FT

group

sched

services

thread

semaph

msg

mbox

shmem

uDECK

Fig. 3. Internal structure of DECK.

DECK is the platform-dependent part of DECK. This layer implements the five basic abstractions provided within the environment: threads, semaphores, messages, mailboxes and shared segments. Each of these abstractions is treated by the application as an object, and has associated primitives for proper manipulation. Messages present pack/unpack primitives, which do not necessarily perform marshalling/unmarshalling actions. When a message object is created, one of its attributes holds the identification of the host architecture. At the time of a pack no marshalling is performed; at the time of an unpack, if the receiving host is of a different architecture, the proper data conversion is made1 . Messages can be posted to or retrieved from mailboxes. Only the creator of a mailbox is allowed to retrieve messages from it, but any other thread knowing the mailbox can post to it. To use a mailbox, the creator must register it in a naming server. There are two ways to obtain a mailbox address: fetching it in the name server or receiving it in a message. The service layer is built on top of DECK and aims to furnish additional, more sophisticated mechanisms that might be useful to the development of parallel applications, such as naming, group communication and fault tolerance support. In the scope of this paper, two elements of this layer must be analysed: the naming service and the Remote Communication Daemon (RCD). The name server is a dedicated thread which runs in the first node within each cluster. For example, in the configuration illustrated in Figure 2, there will be a naming server running on “verissimo” and another running on “scliar”. Each naming server is responsible to register mailboxes created within its cluster. The name server is automatically executed when the application starts and has a well-known mailbox to allow other threads to communicate. 1

It is important to observe that we only expect this to happen for messages crossing cluster boundaries, since clusters are assumed to be internally homogeneous.


77

The DECK/Myrinet Implementation. In the implementation of DECK on top of Myrinet, we are currently using BIP (Basic Interface for Parallelism) [21] as a communication protocol to efficiently use the underlying hardware and deliver high performance to applications. As BIP utilizes reception queues labeled with tags within each node, our mailbox implementation assigns a specific tag to each mailbox. To create a mailbox, the programmer uses a deck mbox create() primitive, passing as arguments the mailbox name and the communication subnet (defined in the descriptor file) in which this mailbox will be used. The communication is made by post and retrieve operations, passing as arguments the corresponding mailbox and the message object, which contains the DECK supported datatypes. Posting a message is an asynchronous operation, while retrieving a message is a synchronous operation. To achieve this behaviour, we use the bip tisend() and bip trecv() primitives, respectively. The implementation of DECK mailboxes and messages on top of BIP is straightforward, since both are based on message passing. Shared segments, however, need an additional software DSM support to be implemented with the same library. For the moment we are studying the introduction of a DSM library, such as TreadMarks [25], to allow the usage of shared segments over Myrinet. The primitives for threads and semaphores are trivial and follow the Pthreads standard [13]. The DECK/SCI Implementation. We base our DECK/SCI implementation on two SCI programming libraries: Yasmin [23], which provides basic primitives for creation, mapping and synchronisation of shared segments, and Sthreads [22], which offers a Pthread-like environment on top of Yasmin. A DECK shared segment object offers primitives for creation, naming, mapping and locking. To the difference of Myrinet, SCI allows an easier implementation of both communication paradigms, so DECK/SCI offers mailboxes and messages as well as shared segments. The creation of threads in DECK/SCI follows a simple round-robin placement strategy, according to the number of physical nodes that compose a logical node, which means that placement is still transparent to the end user. Notice that local memory can still be used for communication by local threads (i.e. threads in the same physical node), but it is up to the programmer to keep this kind of control. This means that, within SCI clusters, memory is only guaranteed to be correctly shared between remote threads if it is mapped into a DECK shared segment. RCD–Remote Communication Daemon. In order to support the MultiCluster model, the Remote Communication Daemon has been designed as a DECK service responsible for communicating to remote clusters. As each cluster must have a “gateway” node, the RCD is automatically executed inside this node when the application starts and follows the same semantic of the name server, i.e., it also has a well-known mailbox. The RCD acts upon demand on two special cases: when fetching names defined remotely (i.e. on another cluster) and when posting messages to remote mailboxes. When a DECK primitive fails to fetch a mailbox address in a local name server, it contacts the RCD, which then broadcasts the request to other RCDs in the system and

78


wait for an answer, returning it to the caller. In the second case, when a DECK primitive sees a remote mailbox address when posting a message, it contacts the RCD, which then forwards the message to the RCD responsible for the communication subnet in which the mailbox is valid. It is important to emphasize that communication between threads in different logical nodes, as well as different clusters, must always be made by message passing. Even in the case of a SCI cluster, there must be at least one mailbox to allow the communication with the RCD and, eventually, retrieve messages. For the moment we are disconsidering the utilisation of a global shared memory space to establish communication among clusters due to the lack of this support in the DECK/Myrinet implementation. Our intention in designing DECK in three parts is to make it usable without changes in both single- and multi-clustered environments. In the first case, the RCD will simply not be brought into action by the application, since all the objects will be local to a specific cluster.

4 Related Work Since the purpose of this paper is to discuss practical questions involved in the integration of multiple clusters and propose our model to achieve such integration, we tried to identify similar proposals regarding this subject. There is a great number of research projects concerning the integration of multiple workstation clusters, such as NOW [1], Beowulf [2], Globus [7] and Legion [10]. The goal of these projects is to allow parallel and distributed programming over geographically distributed, heterogeneous clusters that corresponds to a “global computational grid”. The differential characteristic of our MultiCluster model is that we are assuming the simultaneous use of different network technologies, while these projects plans to use a common network technology to connect clusters, providing high scalability. In terms of programming environments, there are also some efforts concentrated in joining message passing and distributed shared memory facilities, such as Stardust [5] and Active Messages II [16]. The main goal is to provide support for both message passing and distributed shared memory paradigms and, at same time, offer mechanisms to fault tolerance and load balancing support, as well as, portability. There are also some important contributions based on Java, such as JavaNOW [15], JavaParty [20] and Javelin [6]. All these contributions aims to provide distributed programming across networks of workstations or Web-based networks, differing in the communication model they used. The idea behind MultiCluster is similar in some aspects with the objectives found in the projects/environments mentioned here, though in a smaller scale. Our research goal is to identify and propose solutions to problems related to specific integration of Myrinet and SCI clusters, while the goals of such projects comprise a larger universe, including fast communication protocols, cluster tools, job scheduling and so on. Nevertheless, it is possible to state brief comparisons: our RCD is a simplest implementation when compared with Nexus, the communication system used inside Globus; it is just a way to give remote access to mailboxes defined in another clusters and allow us to separate the functionality of DECK when it runs in a single cluster platform.


79

The combination of message passing and distributed shared memory we offer is not so different than the usual mechanisms provided by the others environments. We want to efficiently implement these mechanisms in both clusters, without changing the programming interface. To accomplish this, our choice is to provide a mailbox object and a shared segment object to express message passing and memory sharing, respectively.

5 Conclusions and Current Work In this paper we exposed some problems related to the integration of two different cluster platforms and proposed our MultiCluster model to achieve such desirable integration. We are developing our software environment aiming to accomplish a number of objectives, such as joining two specific cluster platforms (Myrinet and SCI) and providing a uniform API for parallel and distributed programming on both platforms, as well as opening research activities concerning such integration. The integration is easier in terms of hardware because many solutions are already implemented within the OS kernel (e.g. co-existence of network device drivers). In terms of software, we have to decide what is the abstraction degree we want to offer to the programmer. It is important that the user be aware of the characteristics of each individual cluster to best adapt his application to take benefit of them. On the other hand, the DECK layer must abstract as much as possible implementation details, offering to the users a complete and simple API able to express the application needs. Currently, the descriptor file is the key point to configure the MultiCluster platform, because it represents the communication contexts and the logical nodes the user wants to use. Although this configuration is not so transparent, it is the most suitable way to adapt the execution environment according to the user needs. We consider that there are no problems in this task, since the execution environment guarantees the expected functionality. Our work has been guided towards the design of a complete set of programming resources, enclosed in a software layer. Through the modularisation of DECK, we have divided our work in such way that we can parallelize our efforts to cover all problems exposed and to make available, as soon as possible, the MultiCluster model. At the moment we already have an implementation of DECK based on Pthreads and UNIX sockets, available at our Web page [18]. This implementation has played an important role to define the DECK structure and behaviour. At the time of this writing, we are concluding the implementation on top of BIP and collecting some performance results and, at same time, starting the implementation of DECK objects on top of SCI. The next step is to join both clusters and develop the RCD communication protocol.

References 1. T. Anderson, D. Culler, and D. Patterson. A case for NOW - Network of Workstations. Available by WWW at http://now.cs.berkeley.edu, Out. 1999. 2. Beowulf. The Beowulf project. Available by WWW at http://www.beowulf.org, Jun. 1999. 3. N. Boden et al. Myrinet: A gigabit-per-second local-area network. IEEE Micro, 15(1):29–36, Feb. 1995.

80


4. Rajkumar Buyya. High Performance Cluster Computing. Prentice Hall PTR, Upper Saddle River, NJ, 1999. 5. Gilbert Cabillic and Isabelle Puaut. Stardust: an environment for parallel programming on networks of heterogeneous workstations. Journal of Parallel and Distributed Computing, 40:65–80, 1997. 6. B. Christiansen et al. Javelin: Internet-based parallel computing using Java. Available by WWW at http://www.cs.ucsb.edu/research/javelin/, Nov. 1999. 7. Ian Foster and Carl Kesselman. The Globus project. Available by WWW at http://www.globus.org, Jul. 1999. 8. Al Geist et al. PVM: Parallel Virtual Machine. MIT Press, Cambridge, MA, 1994. 9. GM message passing system. Available by WWW at http://www.myri.com, Nov. 1999. 10. A. Grimshaw et al. The Legion vision of a worldwide virtual computer. Communications of the ACM, 40(1), Jan. 1997. 11. Kai Hwang and Zhiwei Xu. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York, NY, 1997. 12. IEEE. IEEE standard for Scalable Coherent Interface (SCI). IEEE 1596-1992, 1992. 13. IEEE. Information technology—portable operating system interface (POSIX), threads extension [C language]. IEEE 1003.1c-1995, 1995. 14. IEEE. Local and metropolitan area networks-supplement—media access control (MAC) parameters, physical layer, medium attachment units and repeater for 100Mb/s operation, type 100BASE-T (clauses 21–30). IEEE 802.3u-1995, 1995. 15. Java and High Performance Computing Group. The JavaNOW project. Available by WWW at http://www.jhpc.org/projects.html, Nov. 1999. 16. Steven S. Lumetta, Alan M. Mainwaring, and David E. Culler. Multi-protocol Active Messages on a cluster of SMP’s. In Proc. of SuperComputing 97, 1997. 17. MPI FORUM. Document for a standard message passing interface. International Journal of Supercomputer Applications and High Performance Computing Technology, 8(3/4), 1994. 18. The MultiCluster project. Available by WWW at http://wwwgppd.inf.ufrgs.br/projects/mcluster, Nov. 1999. 19. S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois Fast Messages for Myrinet. In SuperCOmputing ’95. IEEE Computer Society Press, 1996. 20. Michael Philippsen and Matthias Zenger. JavaParty: A distributed companion to Java. Available by WWW at http://wwwipd.ira.uka.de/JavaParty, Nov. 1999. 21. Loic Prylli and Bernard Tourancheau. BIP: A new protocol designed for high performance networking on Myrinet. In José Rolim, editor, Parallel and Distributed Processing, number 1388 in Lecture Notes in Computer Science, pages 472–485. Springer, 1998. 22. Enno Rehling. Sthreads: Multithreading for SCI clusters. In Proc. of Eleventh Symposium on Computer Architecture and High Performance Computing, Natal - RN, Brazil, 1999. Brazilian Computer Society. 23. H. Taskin. Synchronizationsoperationen für gemeinsamen Speicher in SCI-Clustern. Available by WWW at http://www.uni-paderborn.de/cs/ag-heiss/en/veroeffentlichungen.html, Aug. 1999. 24. VIA – Virtual Interface Architecture. Available by WWW at http://www.via.org, Nov. 1999. 25. Willy Zwaenepoel et al. TreadMarks distributed shared memory (DSM) system. Available by WWW at http://www.cs.rice.edu/˜willy/TreadMarks/overview.html, Dez. 1998.

Parallel Information Retrieval on an SCI-Based PC-NOW Sang-Hwa Chung, Hyuk-Chul Kwon, Kwang Ryel Ryu, Han-Kook Jang, Jin-Hyuk Kim, and Cham-Ah Choi Division of Computer Science and Engineering, Pusan National University, Pusan, 609-735, Korea {shchung, hckwon, krryu, hkjang, variant, cca}@hyowon.pusan.ac.kr

Abstract. This paper presents an efficient parallel information retrieval (IR) system which provides fast information service for the Internet users on lowcost high-performance PC-NOW environment. The IR system is implemented on a PC cluster based on the Scalable Coherent Interface (SCI), a powerful interconnecting mechanism for both shared memory models and message passing models. In the IR system, the inverted-index file (IIF) is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node’s hard disk. For each incoming user’s query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of the IIF to be evaluated in parallel. According to the experiments, the IR system outperforms an MPI-based IR system using Fast Ethernet as an interconnect. Speed- up of up to 4.0 was obtained with an 8node cluster in processing each query on a 500,000-document IIF.

1. Introduction As more and more people are accessing the Internet and acquiring a vast amount of information easily, more people consider that the problem of information retrieval (IR) resides no longer in the lack of information, but in how we can choose from a vast amount the right information with speed. Many of us have already experienced that some IR systems provide information service much faster than others. How fast an IR system can respond to users’ queries mostly depends on the performance of the underlying hardware platform. Therefore, most of the major IR service providers have been urged to spend several hundred thousand dollars to purchase their hardware systems. However, for many small businesses on the Internet, that cost is too high. In this paper, as a cost-effective solution for this problem, a PC cluster interconnected by a high-speed network card is suggested as a platform for fast IR service. With the PC cluster, a massive digital library can be efficiently distributed to PC nodes by utilizing local hard disks. Besides, every PC node can act as an entry to process multiple users’ queries simultaneously. It is extremely important to select a network adapter to construct a high-speed system area network (SAN). For a message passing system, the Fast Ethernet card or the Myrinet card can be used. For a distributed shared memory (DSM) system, the SCI card can be considered. Fast Ethernet developed for LAN is based on complicated protocol software such as TCP/IP, and its bandwidth is not high. The Myrinet[1] card is a high-speed message passing card with a maximum bandwidth of 160Mbyte/sec. However, the network cost is relatively high because Myrinet J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 81-90, 2000.  Springer-Verlag Berlin Heidelberg 2000

82

S.-H. Chung et al.

requires crossbar switches for the network connection. Besides, its message-passing mechanism is based on time consuming operating system calls. For applications with frequent message-passing, this can lead to performance degradation. To overcome the system call overhead, systems based on user-level interface for message-passing without intervention of operating system have been developed. Representative systems include AM[2], FM[3], and U-Net[4]. Recently, Myrinet is also provided with a new message-passing system called GM[5], which supports user-level OSbypass network interface access. The SCI (Scalable Coherent Interface: ANSI/IEEE standard 1596-1992) is designed to provide a low-latency (less than 1µs) and high bandwidth (up to 1Gbyte/sec) point-to-point interconnect. The SCI interconnect can assume any topology including ring and crossbar. Once fully developed, the SCI can connect up to 64K nodes. Since the SCI supports DSM models that can feature both of NUMA and CC-NUMA variants, it is possible to make transparent remote memory access with memory read/write transactions without using explicit message-passing. The performance of the SCI-based systems has been proven by the commercial CCNUMA servers such as Sequent NUMAQ 2000[6] and Data General’s Aviion[7]. In this research, the SCI is chosen as an underlying interconnecting mechanism for clustering. The Parallel IR system is implemented on an SCI-based PC cluster using a DSM programming technique. In the IR system, the inverted-index file(IIF) is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node’s hard disk. An IIF is the sorted list of terms (or keywords), with each term having links to the documents containing that term. For each incoming user’s query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of IIF to be evaluated in parallel. An MPI-based IR system using Fast Ethernet as an interconnect is also constructed for comparison purpose.

2. PC Cluster-based IR System 2.1

Typical IR System on Uniprocessor

Figure 1 shows the structure of a typical IR system implemented on a uniprocessor. As shown in the figure, once a user’s query with multiple terms is presented to the system, for each query term in turn the IR engine retrieves relevant information from the IIF in the hard disk. When all the information is collected, the IR engine performs necessary IR operations, scores the retrieved documents, ranks them, and sends the IR result back to the user. For the efficient parallelization of the system, it is important to find out the most time consuming part in executing the IR system. Using the sequential IR system developed previously[8], the system’s execution time is analyzed as shown in Figure 2. In the sequential system, the most time consuming part is disk access. Thus, it is necessary to parallelize disk access. This can be done by partitioning the IIF into pieces and distributing the pieces to the processing nodes in a PC cluster.

Parallel Information Retrieval on an SCI-Based PC-NOW

x

y Interface

IR Engine

ki

Fig. 1. A typical IR system

83

(%) 50 45 40 35 30 25 20 15 10 5 0 disk access

vector extract IR operation

ranking

Fig. 2. Execution time analysis in the sequential IR system

2.2 Declustering IIF Most current IR systems use a very large lookup table called an inverted index file (IIF) to index relevant documents for given query terms. Each entry of the IIF consists of a term and a list of ids of documents containing the term. Each of the document ids is tagged with a weight of the term for that document. Given a query, all the query terms are looked up from the IIF to retrieve relevant document ids and the corresponding term weights. Next, the documents are scored based on the term weight values and then ranked before they are reported back to the user. Since our IR system processes user’s query in parallel on a PC cluster, it is desirable to have the IIF appropriately declustered to the local hard disks of the processing nodes. We can achieve maximum parallelism if the declustering is done in such a way that the disk I/O and the subsequent scoring job are distributed as evenly as possible to all the processing nodes. An easy random declustering method would be just to assign each of the terms (together with its list of documents) in the IIF lexicographically to each of the processing nodes in turn, repeatedly until all the terms are assigned. In this paper, we present a simple greedy declustering method which performs better than the random method. Our greedy declustering method tries to put together in the same node those terms which have low probability of simultaneous occurrence in the same query. If the terms in a query all happen to be stored in the same node, the disk I/O cannot be done in parallel and also the scoring job cannot readily be processed in parallel. For an arbitrary pair of terms in the IIF, how can we predict the probability of their cooccurring in the same query? We conjecture that this probability has a strong correlation with the probability of their co-occurrence in the same documents. Given a pair of terms, the probability of their co-occurrence in the same documents can be obtained by the number of documents in which the two terms co-occur divided by the number of all the documents in a given document collection. We calculate this probability for each of all the pairs of terms by preprocessing the whole document collection. When the size of the document collection is very large, we can limit the calculation of the co-occurrence probabilities only to those terms which are significant. The reason is that about 80% of the terms in a document collection usually exhibits only a single or double occurrences in the whole document collection and they are unlikely to appear in the user queries. Also, since the number of terms in a document collection is known to increase in log scale as the number of documents increases, our

84

S.-H. Chung et al.

method will not have much difficulty in scaling up. As more documents are added to the collection, however, re-calculation of the co-occurrence probabilities would be needed for maintenance. But, this would not happen frequently because the statistical characteristics of a document collection does not change abruptly. In the first step of our greedy declustering algorithm, all the terms in the IIF are sorted in the decreasing order of the number of documents each term appears. The higher this number the more important the term is in the sense that it is quite likely to be included in many queries. This is especially true when the queries are modified by relevance feedback[9]. This type of terms also have a longer list of documents in the IIF and thus causes heavier disk I/O. Therefore, it is advantageous to store these terms in different nodes whenever possible for the enhancement of I/O parallelism. Suppose there are n processing nodes. We assign the first n of the sorted terms to each of the n nodes in turn. For the next n terms, each term is assigned to the node which contains a term with the lowest probability of co-occurrence. From the third pass of the term assignment, a term is assigned to such a node that the summation of the probabilities of co-occurrence of the term with the terms already assigned to the node is the lowest. This process repeats until all the terms in the IIF are assigned. 2.3 Parallel IR System Model The PC cluster-based parallel IR system model is shown in Figure 3. The IR system consists of an entry node and multiple processing nodes. The participating nodes are PCs with local hard disks and connected by an SCI-based high-speed network. The working mechanism of the parallel IR system model can be explained as follows. The entry node accepts a user’ query and distributes query terms to processing nodes (including itself) based on the declustering information described in the previous subsection. Each processing node consults the partitioned IIF using the list of query terms delivered from the entry node, and collects the necessary document list for each term from the local hard disk. Once all the necessary document lists are collected, they are transmitted to the entry node. The entry node collects the document lists from the participating processing nodes (including itself), performs required IR operations such as AND/OR and ranks the selected documents according to their scores. Finally the sorted document list is sent back to the user as an IR result.

U ser Q ueries

R esults

E ntry N ode

D ataba se (D eclustering inform atio n)

Q uery term s & D ocu m e nt lists

P ro cessin g Node 1




D ataba se

D ataba se

D ataba se

D ataba se

Fig. 3. Parallel IR system model


2.4

85

Experimental PC Cluster System

In this research, an 8-node SCI-based PC cluster system is constructed as shown in Figure 4. Each node is a 350MHz Pentium II PC with 128Mbyte main memory and 4.3Gbyte SCSI hard disk, and operated by Linux kernel 2.0.36. In the cluster, any PC node can be configured as an entry node. As shown in the figure, each PC node is connected to the SCI network through the Dolphin Interconnect Solution (DIS)’s PCI-SCI bridge card. There are 4 rings in the network, and 2 nodes in each ring. The rings are interconnected by the DIS’s 4×4 SCI switch. For DSM programming, the DIS’s SISCI (Software Infrastructure for SCI) API[10] is used. With this configuration, the maximum point-to-point bulk transfer rate obtained is 80 Mbyte/sec approximately. Node

Node

Node

Node SCI Switch

Node

Node

Node

Ring

Node

CPU L1 L2 PCI bridge

Main Memory PCI bus

PCI-SCI Bridge Card

Fig. 4. SCI-based 8 node PC cluster system

For comparison purpose, an 8-node Fast Ethernet-based PC cluster system is also constructed. Each PC node has the same configuration as the SCI network’s node except that a PCI Fast Ethernet Adapter is used for networking. A switching hub is used to interconnect PC nodes in the cluster. For message-passing programming, MPICH 1.1.1[11] is used. In this case, the maximum point-to-point bulk transfer rate obtained is 10 Mbyte/sec approximately. 2.5

SCI-based DSM Programming

The SCI interconnect mechanism supports DSM programming. By using SISCI, a node in the SCI-based PC cluster can establish a mapping between it’s local memory address space and a remote node’s memory address space. Once the mapping is established, the local node can access the remote node’s memory directly. In DSM programming, the communication between PC nodes in the cluster is done using remote read and remote write transactions instead of message-passing. These remote read/write transactions are actually carried out using the remote read/write functions provided by SISCI. When the IR program is actually coded, most of the remote memory transactions are implemented using the remote write function. This is because the remote write function performs about 10 times faster than the remote read function in the DIS’s PSI-SCI bridge card.

86

S.-H. Chung et al.

3. Performance of PC Cluster-based IR System 3.1 Performance Comparison between SCI-based System and MPI-based System In this experiment, average query processing times are measured for the 8-node SCIbased system, the 8-node MPI-based system and a single node system. The IIF is constructed from 100,000 documents collected from articles in a newspaper. A user’s query consists of 24 terms. Each query is made to contain a rather large number of terms because the queries modified by relevance feedback usually have that many terms. The IIF is randomly declustered to be stored on each processing node’s local disk. As shown in Table 1, the disk access time is reduced for both the SCI-based system and the MPI-based system when compared with the single node system. However, the MPI-based system is worse than the single node system in total query processing time because of the communication overhead. The SCI-based system has much less communication overhead than the MPI-based system, and performs better than the single node system. The speed-up improves with further optimizations presented in the following subsections. Table 1. Query processing times of 8-node SCI-based system and 8-node MPI-based system (unit : sec)

Send query term Receive document list Disk access IR operation Total

SCI-based system 0.0100 0.0839 0.0683 0.0468 0.2091

MPI-based system 0.0251 0.2097 0.0683 0.0468 0.3500

Single-node System 0 0 0.2730 0.0468 0.3198

3.2 Effect of Declustering IIF The greedy declustering method is compared with the random method on a test set consisting of 500 queries each containing 24 terms. To generate the test queries we randomly sampled 500 documents from a document collection containing 500,000 newspaper articles. From each document, the most important 24 terms are selected to make a query. The importance of a term in a document is judged by the value tf × idf, where tf is the term’s frequency in that document and idf is the so called inverse document frequency. The inverse document frequency is given by log2(N/n) + 1, where N is the total number of documents in the collection and n is the number of documents containing the term. Therefore, a term in a document is considered important if its frequency in that document is high enough but at the same time it does not appear in too many other documents. Table 2 shows the experimental results comparing the random clustering and the greedy declustering methods using those 500 queries on our 500,000 document collection.


87

Table 2. Comparison of random declustering and greedy declustering (unit: sec)

Random declustering

Greedy declustering

Average query processing time

0.5725

0.5384

Accumulated query processing time for 500 queries

286.2534

269.1919

3.3 Performance with Various-sized IIF In this subsection, the performance of the SCI-based parallel IR system is analyzed with the number of documents increased up to 500,000. These documents are collected from a daily newspaper, and 500,000 documents amount to the collection of the daily newspaper articles for 7 years. The size of IIF proportionally increases as the number of documents increases. For example, the size of IIF is 300 Mbytes for 100,000 documents, and 1.5 Gbytes for 500,000 documents. The 8-node PC cluster and the greedy declustering method are used for the experiment. The experimental result is presented in Figure 5. It takes 0.1805 seconds to process a single query with the 100,000 document IIF, while it takes 0.2536 seconds with the 200,000 document IIF and 0.5398 seconds with 500,000 document IIF. As the IIF size increases, the document list for each query term becomes longer, and the time spent for IR operations (AND/OR operations) increases considerably. As a result, the IR operation eventually takes more time than the disk access, and becomes the major source of bottleneck. (sec)

0.6000 0.5000 0.2873

0.4000 0.2009 0.3000 0.2000 0.1000

0.1327 0.1061 0.0623 0.0628 0.0554

0.0000

100,000

0.0775 0.0699 200,000

0.1078 0.0788 300,000

0.1171 0.1036 400,000

IR operation Disk access Send query term + Receive document list (Communication)

Fig. 5. IIF size vs. query processing time

0.1396

0.1128 500,000

(number of documents)

88

S.-H. Chung et al.

3.4 Reducing IR Operation Time As presented in the previous subsection, the IR operation time turns out to be a new overhead as the IIF size increases. In the IR system, AND/OR operations are performed by the entry node after all the necessary document lists are collected from the processing nodes. However, it is possible to perform AND/OR operations partially to the document lists collected in each processing node. So, each processing node can transmit only the result to the entry node. This helps in reducing not only the IR operation time but also the communication time. The performance of the improved system in comparison with the original system is shown in Figure 6. In the experiment, the 8-node PC cluster, the greedy declustering method and 500,000 document IIF are used. In the original system, the IR operation takes 0.2873 seconds which is more than 53% of the total query processing time. However in the improved system, the IR operation takes only 0.1035 seconds which is about 35% of the total time. Thus, the IR operation takes less time than the disk access again. The communication time is also reduced from 0.1128 seconds to 0.0500 seconds, and the total time is reduced to almost half when compared with the original system. (sec) 0.6000 0.5000 0.4000

0.2873

0.3000

0.1035

0.2000

0.1396

0.1000

0.1128

0.0000

Original system

0.1396 0.0500 Improved system


Fig. 6. Query processing time with reduced IR operation time

Figure 7 shows the speed-up of the parallel IR system. The maximum speed-up obtained from the 8-node system when compared with the single node system is 4.0. As shown in the figure, the speed-up of the parallel IR system is saturated rapidly from the 4-node system. As the number of the processing nodes in the system 1 increases, the disk access time is reduced because the average number of query terms assigned to each node decreases. However, the IR operation time and the communication time rather increase as the number of document lists transmitted to the entry node increases, and attenuate the overall speed-up. The problem may be alleviated by applying the following idea. Instead of sending all the document lists to the entry nodes, intermediate nodes can be utilized to merge the document lists by performing AND/OR operations in advance as shown in Figure 8. Thus the entry node finally handles only two document lists. This will help in reducing both the IR 1

The disk access time includes the time spent for partial AND/OR operations in the processing nodes.


operation time and the communication time. verify the above idea .

89

Experiments need to be performed to

(sec) 1.2000 0.3550

1.0000 0.8000

0.0513

0.6000 0.8080

0.4000

0.0582

0.5825

0.0783 0.2944

0.2000 0.0000

1 node

2 nodes

0.1396 0.0500

0.0403

0.0364

0.0322

0.0000

0.1035

0.2083

4 nodes

6 nodes

8 nodes


Fig. 7. Number of processing nodes vs. query processing time

Node 1 (Entry node)

Node 5

Node 1

Node 1

Node 1

Node 2

Node 3

Node 3

Node 4

Node 5

Node 5

Node 6

Node 7

Node 7

Node 8

Fig. 8. Merging document lists in intermediate nodes

4. Conclusions In this paper, as a cost-effective solution for fast IR service, an SCI-based PC cluster system is proposed. In the parallel IR system developed on the PC cluster, the IIF is partitioned into pieces using a greedy declustering algorithm and distributed to the cluster nodes to be stored on each node’s hard disk. For each incoming user’s query with multiple terms, terms are sent to the corresponding nodes which contain the relevant pieces of IIF to be evaluated in parallel. The IR system is developed using a DSM programming technique based on SCI. According to the experiments, the IR system outperforms an MPI-based IR system using Fast Ethernet as an interconnect. Speed-up of 4.0 was obtained with the 8-node cluster in processing each query on a

90

S.-H. Chung et al.

500,000-document IIF. Currently, the parallel IR system has a single entry node. In the future research, a PC cluster based IR system with multiple entry nodes will be developed. Each processing node in the cluster system can act as an entry node to process multiple users’s queries simultaneously. This will help in improving both the IR system’s utilization and throughput. With more research effort, we hope this model to be evolved as a practical solution for low-cost high-performance IR service on the Internet.

References 1. IEEE, "MYRINET: A GIGABIT PER SECOND LOCAL AREA NETWORK", IEEE-Micro, Vol.15, No.1, February 1995, pp.29-36. 2. "Active Messages: a Mechanism for Integrated Communication and Computation", Thorsten von Eicken and David Culler, et. al., 1992. 3. “Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processors”, IEEE Concurrency, vol. 5, No. 2, April-June 1997, pp. 60-73. (Pakin, Karamcheti & Chien) 4. "U-Net: A User-Level Network Interface for Parallel and Distributed Computing", Anindya Basu, Vineet Buch, Werner Vogels, Thorsten von Eicken, Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain, Colorado, December 3-6, 1995. 5. http://www.myri.com/GM/doc/gm_toc.html 6. "NUMA-Q: An SCI based Enterprise Server", http://www.sequent.com/products/ highend_srv/sci_wp1.html 7. "SCI Interconnect Chipset and Adapter: Building Large Scale Enterprise Servers with Pentium Pro SHV Nodes", http://www.dg.com/about/html/sci_interconnect_ chipset_and_a.html 8. S.H.Park, H.C.Kwon, "An Improved Relevance Feedback for Korean Information Retrieval System", Proc. of the 16th IASTED International Conf. Applied Informatics, IASTED/ACTA Press, pp.65-68, Garmisch-Partenkirchen, Germany, February 23-25, 1998 9. Salton, G. and Buckley, C., “Improving retrieval performance by relevance feedback”, American Society for Information Science, 41, 4, pp. 288-297, 1990. 10. http://www.dolphinics.no/customer/software/linux/index.html 11. "A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard", http://www-unix.mcs.anl.gov/mpi/ mpich/docs.html

A PC-NOW Based Parallel Extension for a Sequential DBMS

Matthieu Exbrayat and Lionel Brunie Laboratoire d'Ingenierie des Systemes d'Information Institut National des Sciences Appliquees, Lyon, F rance

[email protected], [email protected]

In this paper we study the use of netw orks of PCs to handle the parallel execution of relational database queries. This approach is based on a parallel extension, called parallel relational query evaluator, w orking in a coupled mode with a sequential DBMS. We present a detailed arc hitecture of the parallel query evaluator and introduce Enkidu, the eÆcient Java-based prototype that has been build according to our concepts. We expose a set of measurements, conducted over Enkidu, and highlighting its performances. We nally discuss the interest and viability of the concept of parallel extension in the context of relational databases and in the wider context of high performance computing. Keywords: Netw orks of workstations, Parallel DBMS, Java Abstract.

1

Introduction

P arallelizing Database Management Systems (DBMS) has been a ourishing eld of research for the last fteen years. Research, experiment and development ha ve been conducted according to three main goals. The rst one is to accelerate hea vy operations, such as queries involving the confrontation of h uge amounts of data (by parallelizing elementary operations over the nodes and distributing data among the disks { I/O parallelism). The second one is to support a growing n umber of concurrent users (b y dispatching connections and queries among the processors). The third goal is to oer a high level of fault tolerance, and therefore to guarantee the availabilit yof data, for instance in the con text of in tensiv e commercial transactions (e.g. by using RAID techniques). The very rst parallel DBMSs (PDBMSs) were based on speci c machines, suc h as Gamma [1] and the Teradata Parallel Database Machine [2]. The next logical step appeared in the middle of the 90's, with such PDBMSs as Informix On Line XPS [3], IBM DB2 Parallel Edition [4] and Oracle 7 Parallel Server [5], which were designed to w orkon standard (parallel) machines. Some of these systems (e.g. Informix), were de ned as running on \Netw orks of Workstation". Nevertheless, this de nition was quite erroneous, as they were mainly designed to work on high-end architectures, such as the IBM SP2 machine. The very last dev elopments, like Oracle 8 Parallel Server [6] take advantage of recent cluster arc hitectures, and partially hide the management of parallelism (the administrator only has to de ne the list of nodes and disks to be used). It is in fact J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 91-100, 2000.  Springer-Verlag Berlin Heidelberg 2000

92

M. Exbrayat and L. Brunie

noticeable, that the use of a network of PCs to support a PDBMS has been poorly studied. We can cite Midas [7] (parallel port of a sequential DBMS to a LAN of PCs), and the 100 Node PC Cluster [8] database (developed from scratch). Nevertheless, while the very large majority of studies and products consist in fully porting sequential DBMSs to parallel architectures, we estimate that networks of PCs could lead to a new approach of DBMS parallelization, considering the network of PCs as a parallel extension for an existing sequential DBMS. This extension, named coupled query evaluator, consists of a parallel execution component (on the network of PCs), which works together with a sequential DBMS, in order to oer both high performance for query evaluation (on the parallel component) and coherency for data creation and modi cation (on the sequential DBMS). In section 2, we will detail the architecture of our proposal. Its implementation will then be introduced in section 3. In section 4 we will present some measurements conducted over our prototype. In section 5 we will discuss the relevance and impact of the concept of parallel extension. Finally, in section 6 we will present some application domains of our extension.

2

Architecture Client

Client Launch server and calculators

Queries (classical access to DBMS) DBMS

Queries (using the extension)

Client

Server Access to source data

Extract and distribute data

Client

Distributed data (parallelization of read-only queries)

Evaluate queries

Calculator

Calculator

Calculator

Stop calculators and server

Fig. 1.

2.1

Extension's basic phases

Fig. 2.

General overview

General Overview

The coupled query evaluator works through two successive phases (see g. 1). First, data is extracted from the pre-existing relational DBMS and distributed among a network of workstations. Second, this distribution is used for the parallel processing of relational queries. The overall architecture consists of two main components (see g. 2): the server and the calculators. The server is the access point. All tasks are submitted to and treated by it. This server is connected to several calculators which are


93

in charge of storing and processing redistributed data. In our architecture we assume that only one component, i.e. one calculator or the server, is running on each station (we must underline that such a choice does not bring any limitation, for instance on a SMP station, as far as a single calculator can handle several computing threads { see section 3.3). Users’ and administrator’s access point

A : Data distribution B : Load information C : Query

A1 : Demand of distribution Interface

C7 : Resulting tuples

C1 : SQL query

A3 : Distribution information B2 : Load information

Parallel execution Manager

B1 : Calculator’s load

Communications

Instructions

Redistribution Manager

Access to DBMS

C4 : Basic operations C5 : State of operations

Communications A2 : Original data to distribute

Administration messages Data

Load Manager

Result Manager

Execution messages

SQL analyser

plan

Resulting tuples to be sent

C3 : optimized PEP

C2 : Raw execution C6 : Resulting tuples

Parallel execution Optimizer

Scheduling

Administration Data administration

Operations Data to be used

Access to calculators

Fig. 3.

2.2

Server module

Storage

Computation Local resulting tuples

Fig. 4.

Calculator module

The Server Module

The server module (see g. 3) consists of eight components in charge of data distribution (circuit A), collection of load information (circuit B) and parallel query execution (circuit C). Data distribution is done through the redistribution manager (A1), which extracts the requested data from the DBMS. Extracted data is sent to the calculators through the communication module (A2). Redistribution parameters are then stored by the parallel execution optimizer (A3). Processor load information is regularly sent by each calculator (B1 and B2). Distribution and load information is used by the parallel execution optimizer in order to determine the best suited location for each operation. Query execution is triggered by submitting a SQL query through the interface. This query is then translated into an internal format by the SQL analyzer (C1). This raw parallel execution plan (PEP) is then improved by the parallel execution optimizer (C2). This optimized PEP (C3) consists of basic (elementary) operators connected by ows of data and pre- and post-conditions, e.g. scheduling decisions [9]. The parallel execution manager analyses the PEP so that each calculator only receives the operators which take place on it (C4). The parallel execution manager receives (C5) processing information (e.g. end of an operator). Resulting tuples are grouped and stored by the result manager (C6), and then returned to the user (C7).

94 2.3

M. Exbrayat and L. Brunie The Calculator Module

The calculator module consists of ve components (see g. 4). The communication module is similar to the one of the server module. It allows networking

with the server and with all other calculators. Incoming data is transmitted to and stored by the storage module. Incoming instructions are transmitted to the computation module according to the order determined by the scheduling module. Intermediate results that will be used locally are transmitted to the storage module, while other results are sent to other calculators (intermediate results) or to the server ( nal results). Execution messages are also sent to the server at the end of each operator. Finally, calculators can handle administration messages (e.g. suppression of relations, shutdown of the calculator).

3 3.1

Prototyping General Overview

Based on the architecture above, we have developed a complete prototype, named Enkidu, and written in Java, owing to the robustness and portability of this language. Enkidu is a RAM-based parallel query evaluator, which oers various distribution and execution strategies. It can be used with real data (downloaded from an existing database, or from a storage le) or with self-generated data (according to given skew parameters). Thanks to its Java implementation, Enkidu has already been used under Solaris, Linux and Windows 95. 3.2

Implementation of the Server Module

The server module mainly consists of Java code. Nevertheless, the MPO parallel execution plan optimizer [10], an external component developed in C, is currently being adapted through the Java Native Interface [11]. The server module can simulate concurrent users. This is rather important, as far as the large majority of existing academic PDBMS prototypes do not really care about concurrent queries (though DBMSs are generally supposed to support and optimize the combined load of concurrent users). Data extraction is done by the server, through the jdbc interface. Enkidu rst loads the data dictionnary. Then the administrator can distribute data. Extraction is done with a SQL \Select" method, due to the portability and ease of use of this method (see also section 5.1). 3.3

Implementation of the Calculator Module

The calculator module is a pure Java software. The computation module is multithreaded: several computation threads are working on dierent operators. Their priority is determined according to the precedence of queries and operators. The thread with highest priority runs as long as input data remains available. If no more data is temporarily available (in a pipelined mode), secondary priority threads can start working (i.e. no time is lost waiting). Thread switching is


95

limited by using a coarse grain of treatment: tuples are grouped in packets, and a computation thread can not be interrupted until it nishes its current packet. Thread switching is based on a gentlemen's agreement (i.e. when a packet has been computed, the current thread lets another one start {in its priority level, or on an upper level if exists). This multi-threaded approach oers direct gains (optimized workload), and could also be useful in the context of a SMP machine, as threads could be distributed amongst nodes. With such a hardware architecture, a single calculator module could handle the whole SMP. Storage and I/O management would be managed on a single node, while other nodes would only run one (on some) computation thread(s). We must also highlight the fact that our calculators have been designed to store data within RAM. Disks remain unused in order to avoid I/O overcosts. While this choice limits the volume of data that can be extracted and distributed, we must notice that the parallel extension is supposed to be an intermediate solution between sequential and PDBMSs. Thus, we can argue that the volume of data should remain reasonable (some GBytes at most). 3.4

Communication Issues

We chose to work at the Java sockets level, owing to their ease of use, and also because existing Java ports of MPI did not oer satisfying performance. The main problem we met did concern serialization. Serialization is a major concept of Java, which consists in automatically transforming objects into byte vectors and vice-versa. Thus, objects can be directly put on a ow (between two processes or between a process and a le). The generic serialization mecanism is powerful, as it also stores the structure of objects within the byte vector, and thus guarantees the le readability across applications. Nevertheless, this structural data is quite heavy, and introduces tremendous overcosts in the context of NOW computing. For this reason we choose to develop a speci c light serialization, which only serializes data. This approach is quite similar to the one of [12], and both methods should be compared in a forthcoming paper.

4 4.1

Current Performance of the Extension Prototype Underlying Hardware

Enkidu is currently tested over the Popc machine. Popc is an integrated network of PCs, which has been developed by Matra Systems & Information, as a prototype of their Peakserver cluster [13]. It consists of 12 Pentium Pro processors running under Linux, with 64 MByte memory each, connected by both an Ethernet and a Myrinet [14] network. The PopC machine is a computing server, which is classically used as a testbed for low- and high-level components (Myrinet optimization, thread migration, parallel programs...). In the following tests we use the Ethernet network, as it corresponds to the basic standard LAN of an average mid-size company. We are currently studying a Myrinet optimized interface. Simultaneous users are simulated by threads running concurrently on the server. To obtain reliable values, each test has been run at least ten times.

96


4.2

Speed-up

We realized several speed-up tests over our prototype. The one presented in this paper consists of a single hash join involving two relations (100 and 100 000 tuples). We ran these tests with 1, 5 and 10 simultaneous users. We can see on gure 5 that Enkidu oers the linear speed-ups expected with a hash join. On this gure, speed-ups seem to be \super-linear". This comes both from the structure of the hash-join algorithm and from the fact that networking delays between the server and the calculators are included within our measurements. For this second reason, multi-user tests oer better speed-ups, as networking delays are overlapped by computation. 10 9

linear speed up 1 user 5 users 10 users

speed up

8 7 6 5 4 3 2 1 1

2

3

4

5

6

7

8

number of stations Fig. 5.

Real Database Tests 0.45

5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

Oracle Enkidu 1 node Enkidu 6 nodes

0

2

4

6

8

10 12 14 16 18 20

Mean execution time (seconds)

Mean execution time (seconds)

4.3

Speed-up measures

0.4 0.35

Enkidu 1 node Enkidu 2 nodes Enkidu 3 nodes Enkidu 4 nodes Enkidu 5 nodes Enkidu 6 nodes

0.3 0.25 0.2 0.15 0.1 0

# of simultaneous users Fig. 6.

Enkidu vs. oracle

Fig. 7.

5 10 15 # of simultaneous users

20

Details of Enkidu execution time

As a good speed-up could hide poor absolute performance, we also compared Enkidu to a real DBMS. The following tests are based on a Basic Database

A PC-NOW Based Parallel Extension for a Sequential DBMS Table 1.

97

Comparison of computation time between Oracle 7 and Enkidu

System # of users global exec. time (s) mean exec. time (s) mean exec. time # of nodes Oracle (1 node) 1 350 350 350 Enkidu 6 nodes 1 7.4 7.4 44.1 Enkidu 6 nodes 5 36.3 7.3 44.6 Enkidu 6 nodes 10 72.7 7.3 43.6 Enkidu 6 nodes 15 120.1 8.0 48.0 Enkidu 6 nodes 20 143.1 7.2 42.9 medicine database: the Claude Bernard Data bank [15]. The relatively small size of this latter (some MBytes) is counterbalanced by the lack of index (in order to simulate non pre-optimized queries). We ran our tests both on Enkidu and on our source DBMS (Oracle 7 on a Bull Estrella { PowerPC, 64 MByte RAM, AIX). The set of queries consisted in retrieving the name of medicines containing a given chemistry, for 1 to 20 concurrent users. Figures 6 and 7 highlight the good performance of Enkidu. The indicated time consists of the global response time divided by the number of users. In the context of this test, we can notice that, due to the limited size of the database, the observed speed-up is not linear (around 1.7 for 2 machines and 2.3 for 5 machines), as communication and initialization are not negligible compared to computation time. As the rst database was quite small, we conducted a similar test with an extended database (10 times bigger for chemistries and medicines, and then 100 times bigger for the links between chemistries and medicines). As our prototype is only using RAM, we could not run this test on a single-node con guration, due to the amount of hashed data and intermediate results (performance would have suered from the resulting swap). Thus we used a 6 nodes con guration. Concerning the Oracle platform, we only ran one user, due to both the need for very big temporary les, and the resulting swap overcosts. We can see in table 1, that using only RAM allows Enkidu (6 nodes) to compute nearly 50 times faster than Oracle. Considering the ratio computation time / number of nodes, Enkidu remains 8 times faster. Extended Database

5 5.1

Discussion Parallel Extension vs. Porting

Providing a parallel extension constitutes a specialized alternative to parallel porting. We can especially notice the dierences according to the following axes: {

{

data storage and access: within the extension, data is loaded from a remote system (the DBMS) and stored in main memory. Within a PDBMS, data is stored on the local disks, from which it is accessed as needed; transaction management: the extension does not directly oer transaction management, and updates are limited.

98


Porting does eectively oer a complete solution, with no data load delays. Updates are automatically and relatively simply managed. Nevertheless, we see several drawbacks inherently linked to parallelization: {

{

development time: a complete port to parallel architectures is a heavy task, while the extension can be developed in a much faster way, due to its intentionally limited functions; persistence of the parallel components: a PDBMS, once initialized, uses a set of disks in a permanent manner. In the context of a network of PCs, this means that a given set of machines is dedicated to the DBMS.

As updates are managed by the DBMS, the extension must regularly update its data. As far as we mostly work with o-line applications, updates can be delayed, as long as their frequency oers a suÆcient \freshness" of data (e.g. once a day). We propose to re-extract data rather than using an incremental update (which would need extra development, especially if triggers are used by the sequential DBMS { as tracking updates is then much more diÆcult). As an example, extracting and distributing the extended database of section 4.3 is done in less than 15 minutes: about 10 minutes for extracting data (from Oracle to the server) and 1.5 minutes for distributing it (from the server to the calculators). Technically speaking, we use a temporary le in order to handle these two phases independantly. Thus, calculators are only locked during distribution. 5.2

Toward a Generalization of the Parallel Extension Concept

The parallel extension concept could t in a wider perspective of high performance processing. In eect, many similarities exist between our extension and recent developments, for instance in the eld of scienti c computing, to extend sequential applications by parallelizing some of their algorithms. Various applications, such as numerical simulation and imaging, could use sequential components during data input and light operations, and could bene t from parallel components during heavy computations. Concerning numerical simulation, parallel computing can be used for heavy computation, such as crash simulation, or uid mechanics, while designing the structure is usually done in a sequential way. Considering image computing, the input and annotation of pictures should be done sequentially, while image fusion or ltering can bene t from parallel algorithms. In a more general way, a parallel extension could be used whenever a software alternatively executes light and heavy treatments.

6

Context

As our extension appears to be mainly interesting in the context of \read-mostly" applications, we will now give three examples of such applications: decision support systems, data mining and multimedia databases. Within decision support systems (DSS), data are frequently manipulated o-line(i.e. data generation and manipulation are two distinct and independant


99

tasks). Thus, an extension can be used. As an example we will cite the well known TPC-R and TPC-H benchmarks. TPC-R [16] is a business reporting benchmark, adapted to repetitive queries, concerning a large amount (1 TByte) of data, and thus oriented toward the (static) optimization of data structures and queries. TPC-H [17] works on ad-hoc queries, i.e. various and theoretically non-predictable queries. It can thus be used with various dataset size (from 1 GByte to 1 TByte). Our system can a priori be used in both cases, and at least with TPC-H, as small amounts of data can be manipulated. Although our current implementation is not adapted to large datasets (data is stored in RAM), it could anyway work on databases ranging from 1 to 10 GByte, by using a cluster of PCs handling enough memory (e.g. 6 to 10 PCs, each having 256 MByte memory, could easily handle a 1 GByte database). Of course, we could also implement some existing algorithms using disks to store temporary data, such as the well known hybrid hash-join algorithm [18]. Concerning data mining, our extension could at least be used during preprocessing phases, in order to provide a fast and repetitive access to source data. It could even be used during processing, as far as [19] showed that knowledge extraction could also be done through a classical DBMS using SQL. Concerning multimedia databases, both academic researchers [20] and industrial software developers [21, 22] are deeply implicated in delivering multimedia DBMSs. The read-mostly nature of such databases is trivial. For instance, the Medical Knowledge Bank project [23], and especially its initial and continuing medical education section, mainly involves read-only accesses.

7

Summary

In this paper we proposed and discussed the use of a parallel extension in the context of Database Management Systems and we presented the prototype we built according to our proposal. Through our tests in appeared, that this extension is a valuable alternative to the classical parallel porting of DBMSs, especially in the context of read-mostly applications. Future work should follow two mains goals: getting even better performance and developing speci cally adapted algorithms. From the performance point of view, we plan to develop high performance (Myrinet-based) Java components. We also wish to upgrade our packet-based techniques toward a real macropipelining approach. Finally, we are trying to get faster extraction and distribution algorithms. From the applications point of view, we are currently studying some multimedia and information retrieval algorithms working over our architecture. Another important direction of research, from our point of view, consists in testing the concept of parallel extension in various elds, and to propose a global and generic de nition for it.

References 1. D. Dewit, S. Ghandeharizadeh, D. Schneider, et al., \The Gamma Database Machine Project," IEEE TKDE, vol. 2, pp. 44{62, Mar. 1990.

100


2. J. Page, \A Study of a Parallel Database Machine and its Performance the NCR/Teradata DBC/1012," in Proceedings of the 10th BNCOD Conference, (Aberdeen, Scotland), pp. 115{137, July 1992. 3. B. Gerber, \Informix On Line XPS," in Proceedings of ACM SIGMOD '95, vol. 24 of SIGMOD Records, (San Jose, Ca, USA), p. 463, May 1995. 4. C. Baru, G. Fecteau, A. Goya, et al., \DB2 Parallel Edition," IBM Systems Journal, vol. 34, no. 2, pp. 292{322, 1995. 5. R. Bamford, D. Butler, B. Klots, et al., \Architecture of Oracle Parallel Server," in Proceedings of VLDB '98, (New York City, NY, USA), pp. 669{670, Aug. 1998. 6. Oracle, \Oracle Parallel Server: Solutions for Mission Critical Computing," tech. rep., Oracle Corp., Redwood Shores, CA, Feb. 1999. 7. G. Bozas, M. Jaedicke, A. Listl, et al., \On transforming a sequential sql-dbms into a parallel one : First results and experiences of the MIDAS project," in EuroPar'96, (Lyon), pp. 881{886, Aug. 1996. 8. T. Tamura, M. Oguchi, and M. Kitsuregawa, \Parallel Database Processing on a 100 Node PC Cluster: Cases for Decision Support Query Processing and Data Mining," in SC'97, 1997. 9. L. Brunie and H. Kosch, \Optimizing complex decision support queries for parallel execution," in PDPTA '97, (Las Vegas, AZ, USA), July 1997. 10. L. Brunie and H. Kosch, \ModParOpt : a modular query optimizer for multi-query parallel databases," in ADBIS'97, (St Petersbourg, RU), 1997. 11. S. Liang, The Java Native Interface: Programmer's Guide and Speci cation. Java Series, Addison Wesley, June 1999. 12. M. Philippsen and B. Haumacher, \More EÆcient Object Serialization," in International Workshop on Java for Parallel and Distributed Computing, (San Juan, Porto Rico, USA), Apr. 1999. 13. MatraSI, \Peakserver, the Information Server." [On-Line], Available on Internet :, 1999. 14. N. Boden, D. Cohen, R. Felderman, et al., \Myrinet - a gigabit-per-second localarea network," IEEE-Micro, vol. 15, pp. 29{36, 1995. 15. A. Flory, C. Paultre, and C. Veilleraud, \A relational databank to aid in the dispensing of medicines," in MEDINFO '83, (Amsterdam), pp. 152{155, 1983. 16. TPC, TPC Benchmark R (Decision Support) Standard Speci cation. San Jose, CA: Transaction Processing Performance Council, Feb. 1999. 17. TPC, TPC Benchmark H (Decision Support) Standard Speci cation. San Jose, CA: Transaction Processing Performance Council, June 1999. 18. D. Schneider and D. DeWitt, \A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment," in Proceedings of ACM SIGMOD '89, (Portland, Oregon, USA), pp. 110{121, June 1989. 19. I. Pramudiono, T. Shintani, T. Tamura, et al., \Mining Generalized Association Rule Using Parallel RDB Engine on PC Cluster," in DaWak'99, (Florence, Italy), pp. 281{292, Sept. 1999. 20. H. Ishikawa, K. Kubota, Y. Noguchi, et al., \Document Warehousing Based on a Multimedia Database System," in ICDE'99, (Sydney, Australia), pp. 168{173, Mar. 1999. 21. Oracle, \Oracle Intermedia: Managing Multimedia Content," tech. rep., Oracle Corp., Redwood Shores, CA, Feb. 1999. 22. Informix, \Informix Media 360," tech. rep., Informix, Menlo Park, CA, Aug. 1999. 23. W. Sterling, \The Medical Knowledge Bank: A Multimedia Database Application," NCR Technical Journal, Aug. 1993.

Workshop on Advances in Parallel and Distributed Computational Models In recent years, new parallel and distributed computational models hav e been proposed in the literature, re ecting advances in new computational devices and en vironments such as optical interconnects, FPGA devices, networks of workstations, radio communications, DNA computing, quantum computing, etc. New algorithmic techniques and paradigms ha v ebeen recen tlydev elopedfor these new models. The main goal of this workshop is to provide a timely forum for the dissemination and exchange of new ideas, techniques and research in the eld of the new parallel and distributed computational models. The w orkshopwill bring together researc hers and practitioners interested in all aspects of parallel and distributed computing taken in an inclusive, rather than exclusive, sense. Workshop Chair:

Oscar H. Ibarra (University of California Santa Barbara)

Program Co-Chairs:

Koji Nakano (Nagoy a Institute of Technology), Stephan Olariu (Old Dominion University)

Steering Committee

Narsingh Deo (University of Central Florida, USA), Joseph JaJa (University of Maryland, USA), Ernst W. Mayr (Technical University Munich, Germany), Lionel Ni (Michigan State University, USA), Sartaj Sahni (University of Florida, USA), Behrooz Shirazi (University of T exas, USA),P eter Widmay er (ETH, Zurich, Switzerland)

Program Committee

Jik Hyun Chang (Sogang University, Korea), Chuzo Iw amoto (Hiroshima University, Japan), Omer Egecioglu (University of California, USA), Hossam ElGindy (University of New South Wales, Australia), Akihiro Fujiwara (Kyush u Institute of Technology, Japan), Ju-wook Jang (Sogang University, Korea), Rong Lin (SUNY Geneseo, USA), Toshimitsu Masuzaw a (Nara Institute of Science and Technology, Japan), Rami Melhem (University of Pittsburgh, USA), Eiji Miyano (Kyushu Institute of Design, Japan), Michael Palis (Rutgers University, USA), Sanguthevar Rajasekaran (University of Florida, USA), Nicola Santoro (Carleton University, Canada), James Sch wing (Central Washington University, USA), Hong Shen (GriÆth University, Australia), Iv an Stojmenovic (University of Ottaw a, Canada), Jerry L. Trahan (Louisiana State University, USA), Ramachandran Vaidy anathan (Louisiana State University, USA), Biing-Feng Wang (National Tsinhua University, T aiwan), Jie Wu (Florida Atlantic University, USA), Masafumi Yamashita (Kyushu University, Japan), T ao Yang (University of California, USA), Si Qing Zheng (University of Texas at Dallas, USA), Albert Y. Zomay a (Univ ersity of Western Australia, Australia)


The Heterogeneous Bulk Synchronous Parallel Model Tiani L. Williams and Rebecca J. Parsons School of Computer Science University of Central Florida Orlando, FL 32816-2362

fwilliams,[email protected]

Abstract. T rends in parallel computing indicate that heterogeneous

parallel computing will be one of the most widespread platforms for computation-intensiv e applications. A heterogeneous computing environment oers considerably more computational pow er at a lo w er cost than a parallel computer. We propose the Heterogeneous Bulk Synchronous P arallel (HBSP) model, which is based on the BSP model of parallel computation, as a framework for dev eloping applications for heterogeneous parallel environments. HBSP enhances the applicability of the BSP model by incorporating parameters that re ect the relative speeds of the heterogeneous computing components. Moreover, w e demonstrate the utility of the model by developing parallel algorithms for heterogeneous systems.

1 Introduction P arallel computers have made an impact on the performance of large-scale scien ti c and engineering applications such as weather forecasting, earthquake prediction, and seismic data analysis. However, special-purpose massively parallel machines have proven to be expensive to build, dicult to use, and have lagged in performance by taking insucient advantage of improving technologies. Heterogeneous computing [8, 14] is a cost-eective approach that avoids these disadv an tages.A heterogeneous computing environment can represent a diverse suite of architecture types such as P en tiumPCs, shared-memory multiprocessors, and high-performance w orkstations.Unlike parallel computing, such an approach will leverage technologies that ha vedemonstrated sustained success, including: computer net w orks; microprocessor technology; and shared-memory platforms. We propose a framework for the development of parallel applications for heterogeneous platforms. Our model is called Heterogeneous Bulk Synchronous P arallel (HBSP), which is an extension to the BSP model of parallel computation [17]. BSP provides guidance on designing applications for good performance on homogeneous parallel machines. Experiments [5] indicate that the model also accurately predicts parallel program performance on a wide range of parallel J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 102-108, 2000.  Springer-Verlag Berlin Heidelberg 2000

The Heterogeneous Bulk Synchronous Parallel Model

103

machines. HBSP enhances the applicability of the BSP model by incorporating parameters that re ect the relative speeds of the heterogeneous computing components. Our starting point for the development of algorithms for HBSP are ecient BSP or HCGM [10, 11] applications. Speci cally, we develop three HBSP algorithms|pre x sums, matrix multiplication, and randomized sample sort| that distribute the computational load according to processor speed without sacri cing performance. In fact, the cost model indicates that wall clock performance is increased in many cases. Furthermore, these algorithms can execute unchanged on both heterogeneous and homogeneous platforms. The rest of the paper proceeds as follows. Section 2 reviews related work. Section 3 describes the HBSP model. Section 4 presents a sampling of algorithms for HBSP. Concluding remarks and future directions are given in Section 5.

2 Related Work The theoretical foundations of the BSP model were presented in a series of papers by Valiant [15, 16, 17, 18, 19], which describe the model, how BSP computers can be programmed either in direct mode or in automatic mode (PRAM simulations), and how to construct ecient BSP computers. Other work presents theoretical results, empirical results, or experimental parameterization of BSP programs [1, 2, 3, 4, 5, 21]. Many alternative models of parallel computation have been proposed in the literature|a good survey on this topic are papers by Maggs, Matheson, and Tarjan [9] and Skillicorn and Talia [13]. Several models exist to support heterogeneous parallel computation. However, they are either primarily of theoretical interest or are basically languages/runtime systems without a solid theoretical foundation. For an overview of these approaches, we refer the reader to the surveys by Siegel et al. [12] and Weems et al. [20]. One notable exception is the the Heterogeneous Coarse-Grained Multicomputer (HCGM) model, developed by Morin [10, 11]. HBSP and HCGM are similar in structure and philosophy. The main dierence is that HGCM is not intended to be an accurate predictor of execution times whereas HBSP attempts to provide the developer with predictable algorithmic performance.

3 Heterogeneous BSP The Heterogeneous Bulk Synchronous Parallel (HBSP) model is a generalization of the BSP model [17] of parallel computation. The BSP model is a useful guide for parallel system development. However, it is inappropriate for heterogeneous parallel systems since it assumes all components have equal computation and communication abilities. The goal of HBSP is to provide a framework that makes parallel computing a viable option for heterogeneous systems. HBSP enhances the applicability of BSP by incorporating parameters that re ect the relative speeds of the heterogeneous computing components. An HBSP computer is characterized by the following parameters:

104

T.L. Williams and R.J. Parsons

{ { {

the number of processor-memory components p labeled P0 ; :::; Pp,1 ; the gap gj for j 2 [0::p , 1], a bandwidth indicator that re ects the speed with which processor j can inject packets into the network; the latency L, which is the minimum duration of a superstep, and which re ects the latency to send a packet through the network as well as the overhead to perform a barrier synchronization; { processor parameters cj for j 2 [0::p , 1], which indicates the speed of processor j relative to the slowest processor, and ,1 c . { the total speed of the heterogeneous con guration c = Ppi=0 i For notational convenience, Pf (Ps ) represents the fastest (slowest) processor. The communication time and the computation speed of the fastest (slowest) processor are gf (gs ) and cf (cs ), respectively. We assume that cs is normalized to 1. If ci = j , then Pi is j times faster than Ps . Computation consists of a sequence of supersteps. During a superstep, each processor performs asynchronously some combination of local computation, message transmissions, and message arrivals. A message sent in one superstep is guaranteed to be available to the destination processor at the beginning of the next superstep. Each superstep is followed by a global synchronization of all the processors. Execution time of an HBSP computer is as follows. Each processor, Pj , can perform wi;j units of work in wc time units during superstep i. Let wi = max( wc ) represent the largest amount of local computation performed by any processor during superstep i. Let hi;j be the largest number of packets sent or received by processor j in superstep i. Thus, the execution time of superstep i is: wi + j2max fgj hi;j g + L (1) [0::p,1] i;j j

i;j j

The overall execution time is the sum of the superstep execution times. The HBSP model leverages existing BSP research. The more complex cost model does not change the basic programming methodology, which relies on the superstep concept. Furthermore, when cj = 1 and gj = gk , where 0 j; k < p, HBSP is equivalent to BSP.

4 HBSP Algorithms This section provides a sampling of applications for the HBSP model based on those proposed by Morin for the HCGM model [10, 11]. Our algorithms, which include pre x sums, matrix multiplication, and randomized sample sort, illustrate the power and elegance of the HBSP model. In each of the applications, the input size is partitioned according to a processor's speed. If ci is the speed of processor Pi , then Pi holds cc n input elements. When discussing the performance of the algorithms, we will often make use of a coarse-grained assumption, p n, i.e., the size of the problem is signi cantly larger than the number of processors. Our interpretation of \signi cantly larger" is p np . i


105

4.1 Pre x Sums Given a sequence of n numbers fx0 ; x1 ; :::; xn,1 g, it is required to compute their pre x sums sj = x0 + x1 + ::: + xj , for all j , 0 j n , 1. Under HBSP, each processor locally computes its pre x sums and sends the total sum to Pf . Next, Pf computes the pre x sums of this sequence and sends the (i , 1)st element of the pre x to Pi . Lastly, Pi adds this value to each element of the pre x sums computed in the rst step to obtain the pre x sums of the overall result. The pre x sums algorithm is shown below. 1. 2. 3. 4. 5.

Each processor locally computes the pre x sums of its cc n input elements. Each processor, Pi , sends the total sum of its input elements to Pf . Pf computes the pre x sums of the p elements received in Step 2. For 1 i p , 1; Pf sends the (i , 1)st element computed in Step 3 to Pi . Each processor computes its nal portion of the pre x sums by adding the value received in Step 4 to each of the values computed in Step 1. i

Analysis. In Step 1n and Step 5, each processor Pi does O( cc n) work and this can be done in O( c ) time. Steps 2 and 4 require a communication time of maxfgs 1; gf pg. Step 3 takes O( cp ) computation time. Since cf pc and p np , O( cp ) O( nc ). Thus, the algorithm takes time i

f

f

O( nc ) + 2 maxfgs 1; gf pg + 3L:

(2)

If gs pgf , the communication time is 2pgf , otherwise it's 2gs.

4.2 Matrix Multiplication Matrix multiplication is perhaps one of the most common operations used in large-scale scienti c computing. Given n n matrices A and B , we de ne P ,two 1 the matrix C = A B as Ci;j = nk=0 Ai;k Bk;j : We assume that matrix A is partitioned among the processors so that each processor, Pi , holds cc n rows of A and np columns of B. At the completion of the computation, Pi will hold cc n rows of C. We denote the parts of A, B, and C held by Pi as Ai ; Bi ; and Ci , respectively. The matrix multiplication algorithm consists of circulating the columns of B among the processors. When Pi receives column j of B , it can compute column j of Ci . Once Pi has seen all columns of B , it will have computed all of Ci . The matrix multiplication algorithm is given below. i

i

1. repeat p times. 2. Pi computes Ci = Ai Bi . 3. Pi sends Bi to P(i+1)mod p . 4. end repeat

106


Analysis Step 3 requires Pi to perform O( cc n np n) = O( ncp3 c ) amount of i

i

work. Over p rounds, the total computation time is O( nc ). During Step 4, each processor sends and receives np columns of matrix B . Therefore, the total time of HBSP matrix multiplication is 3

O( nc ) + gs n2 + pL: 3

(3)

4.3 Randomized Sample Sort One approach for parallel sorting that is suitable for heterogeneous computing is randomized sample sort. It is based on the selection of a set of p , 1 \splitters" from a set of input keys. In particular, we seek splitters that will divide the input keys into approximately equal-sized buckets. The standard approach is to randomly select pr sample keys from the input set, where r is called the oversampling ratio. The keys are sorted and the keys with ranks r; 2r; :::; (p , 1)r are selected as splitters. By choosing a large enough oversampling ratio, it can be shown with high probability that no bucket will contain many more keys than the average [7]. Once processors gain knowledge of the splitters, their keys are partitioned into the appropriate bucket. Afterwards, processor i locally sorts all the keys in bucket i. When adapting this algorithm to the HBSP model, we change the way in which the splitters are chosen. To balance the work according to the processor speeds c0 ; :::; cp,1 , it is necessary that O( cc n) keys fall between si and si+1 . This leads to the following algorithm. i

1. Each processor randomly selects a set of r sample keys from its cc n input keys. 2. Each processor, Pi , sends its sample keys to Pf . 3. Pf sorts the pr sample keys. Denote these keys by sample0; :::; samplepr,1 where samplei is the sample key with rank i in the sorted order. Pf de nes p , 1 splitters, s0 ; :::; sp,2 , where si = sampled(P . )pr e =0 4. Pf broadcasts the p , 1 splitters to each of the processors. 5. All keys assigned to the ith bucket are sent to the ith processor. 6. Each processor sorts its bucket. i

i

j

cj c

Analysis Inn Step 1, each processor performs O(r) O(n) amount of work. This requires O( c ) time. Step 2 requires a communication time of maxfgs r; gf prg. To sort the pr sample keys, Pf does O(pr lg pr) O(n lg n) amount of work. This can be done in O( cn lg n) time. Broadcasting the p , 1 splitters requires maxfgs (p,1); gf p(p,1)g communication time. Since each processor is expected to receive approximately cc n keys [11], Step 5 uses O( nc ) computation time and maxfgi cc ng communication time, where i 2 [0::p , 1]. Once each processor s

f

i

i


107

receives their keys, sorting them requires O( nc lg n) time. Thus, the total time is

n ci o O cn lg n + X (r + (p , 1)) + i2[0max g n + 4L; where ::p,1] i c f 8 > > < pgf if gs pgf X = > g otherwise. > s :

(4)

5 Conclusions and Future Directions The HBSP model provides a framework for the development of parallel applications for heterogeneous platforms. HBSP enhances the applicability of BSP by incorporating parameters that re ect the relative speeds of the heterogeneous computing components. Although the HBSP model is somewhat more complex than BSP, it captures the most important aspects of heterogeneous systems. Existing BSP and HCGM algorithms provide the foundation for the HBSP algorithms presented here. These algorithms suggest that improved performance under HBSP results from utilizing the processor speeds of the underlying system. However, experimental evidence is needed to corroborate this claim. We plan to extend this work in several directions. First, a library based on BSPlib (a small, standardized library of BSP functions) [6] will provide the foundation for HBSP programming. Experiments will be conducted to test the eectiveness of the model on a network of heterogeneous workstations. These experiments will test the predictability, scalability, and eciency of applications written under HBSP. Currently, the HBSP model only addresses a heterogeneous collection of uniprocessor machines. We are investigating variants to the model to address multiprocessor systems. In conclusion, the goal of HBSP is to oer a framework that makes parallel computing a viable option for a wide range of tasks. We seek to demonstrate that it can provide a simple programming approach, portable and ecient application code, predictable execution, and scalable performance.

References [1] R. H. Bisseling. Sparse matrix computations on bulk synchronous parallel computers. In Proceedings of the International Conference on Industrial and Applied Mathematics, Hamburg, July 1995. [2] R. H. Bisseling and W. F. McColl. Scienti c computing on bulk synchronous parallel architectures. In B. Pehrson and I. Simon, editors, Proceedings of the 13th IFIP World Computer Congress, volume 1, pages 509{514. Elsevier, 1994. [3] A. V. Gerbessiotis and C. J. Siniolakis. Deterministic sorting and randomized mean nding on the BSP model. In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 223{232, June 1996. [4] A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous parallel algorithms. Journal of Parallel and Distributed Computing, 22(2):251{267, August 1994.

108


[5] M. W. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas. Towards eciency and portability: Programming with the BSP model. In Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1{12, June 1996. [6] J. M. D. Hill, B. McColl, D. C. Stefanescu, M. W. Goudreau, K. Lang, S. B. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP programming library. Parallel Computing, 24(14):1947{1980, 1998. [7] J. Huang and Y. Chow. Parallel sorting and data partitioning by sampling. In IEEE Computer Society's Seventh International Computer Software & Applications Conference (COMPSAC'83), pages 627{631, November 1983. [8] A. Khokhar, V. Prasanna, M. Shaaban, and C. Wang. Heterogeneous computing: Challenges and opportunities. Computer, 26(6):18{27, June 1993. [9] B. M. Maggs, L. R. Matheson, and R. E. Tarjan. Models of parallel computation: A survey and synthesis. In Proceedings of the 28th Hawaii International Conference on System Sciences, volume 2, pages 61{70. IEEE Press, January 1995. [10] P. Morin. Coarse-grained parallel computing on heterogeneous systems. In Proceedings of the 1998 ACM Symposium on Applied Computing, pages 629{634, 1998. [11] P. Morin. Two topics in applied algorithmics. Master's thesis, Carleton University, 1998. [12] H. J. Siegel, H. G. Dietz, and J. K. Antonio. Software support for heterogeneous computing. In A. B. Tucker, editor, The Computer Science and Engineering Handbook, pages 1886|1909. CRC Press, 1997. [13] D. B. Skillicorn and D. Talia. Models and languages for parallel computation. ACM Computing Surveys, 30(2):123{169, June 1998. [14] L. Smarr and C. E. Catlett. Metacomputing. Communications of the ACM, 35(6):45{52, June 1992. [15] L. G. Valiant. Optimally universal parallel computers. Philosophical Transactions of the Royal Society of London, A 326:373{376, 1988. [16] L. G. Valiant. Bulk-synchronous parallel computers. In M. Reeve and S. E. Zenith, editors, Parallel Processing and Arti cial Intelligence, pages 15{22. John Wiley & Sons, Chichester, 1989. [17] L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103{111, 1990. [18] L. G. Valiant. General purpose parallel architectures. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A: Algorithms and Complexity, chapter 18, pages 943{971. MIT Press, Cambridge, MA, 1990. [19] L. G. Valiant. Why BSP computers? In Proceedings of the 7th International Parallel Processing Symposium, pages 2{5. IEEE Press, April 1993. [20] C. C. Weems, G. E. Weaver, and S. G. Dropsho. Linguistic support for heterogeneous parallel processing: A survey and an approach. In Proceedings of the Heterogeneous Computing Workshop, pages 81{88, 1994. [21] T. L. Williams and M. W. Goudreau. An experimental evaluation of BSP sorting algorithms. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing Systems, pages 115{118, October 1998.

On stalling in LogP? (Extended Abstract) Gianfranco Bilardi1 ; 2 , Kieran T. Herley3, Andrea Pietracaprina1, and Geppino Pucci1 1

Dipartimento di Elettronica e Informatica, Università di Padova, Padova, Italy. bilardi,andrea,geppo @artemide.dei.unipd.it 2 T.J. Watson Research Center, IBM, Yorktown Heights, NY 10598, USA. 3 Department of Computer Science, University College Cork, Cork, Ireland. [email protected]

f

g

Abstract. We investigate the issue of stalling in the LogP model. In particular, we introduce a novel quantitative characterization of stalling, referred to as -stalling, which intuitively captures the realistic assumption that once the network’s capacity constraint is violated, it takes some time (at most ) for this information to propagate to the processors involved. We prove a lower bound that shows that LogP under -stalling is strictly more powerful than the stall-free version of the model where only strictly stall-free computations are permitted. On the other hand, we show that -stalling LogP with = L can be simulated with at most logarithmic slowdown by a BSP machine with similar bandwidth and latency values, thus extending the equivalence (up to logarithmic factors) between stall-free LogP and BSP argued in [1] to the more powerful L-stalling LogP.

1 Introduction Over the last decade considerable attention has been devoted to the formulation of a suitable computational model that supports the development of efficient and portable parallel software. The widely-studied BSP [6] and LogP [2] models were conceived to provide a convenient framework for the design of algorithms, coupled with a simple yet accurate cost model, to allow algorithms to be ported across a wide range of machine architectures with good performance. Both models view a parallel computer as a set of p processors with local memory that exchange messages through a communication medium whose performance is essentially characterized by two key parameters: bandwidth (g for BSP and G for LogP) and latency (` for BSP and L for LogP). A distinctive feature of LogP is that it embodies a network capacity constraint stipulating that at any time the total number of messages in transit towards any specific destination should not exceed the threshold dL=Ge. If this constraint is respected, then every message is guaranteed to arrive within L steps of its submission time. If, however, a processor attempts to submit a message with destination d whose injection into the network would violate the constraint, then the processor is forced to stall until the delivery of some outstanding messages brings the traffic for d below the dL=Ge threshold. It seems clear that the intention of the original LogP proposal [2] was strongly to ? This research was supported, in part, by the Italian CNR, and by MURST under Project Algo-

rithms for Large Data Sets: Science and Engineering. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 109-115, 2000.  Springer-Verlag Berlin Heidelberg 2000

110

G. Bilardi et al.

encourage the development of stall-free programs. Indeed, the delays incurred in the presence of stalling were not formally quantified within the model, making the performance of stalling programs an issue difficult to assess with any precision. At the same time, adhering strictly to the stall-free mode might make algorithm design artificially complex, e.g., in situations involving randomization where stalling is unlikely but not impossible. Hence, ruling out stalling altogether might not be desirable. The relation between BSP and LogP has been investigated in [1], where it is shown that the two models can simulate one another efficiently, under the reasonable assumption that both exhibit comparable values for their respective bandwidth and latency parameters. These results were obtained under a precise specification of stalling behaviour, that attempted to be faithful to the original formulation of the model. Interestingly, however, while the simulation of stall-free LogP programs on the BSP machine can be accomplished with constant slowdown, the simulation of stalling programs incurs a higher slowdown. This difference appears also in subsequent results of [5], where work-preserving simulations are considered. Should stalling programs turn out inherently to require a larger slowdown, it would be an indication that stalling adds power to the LogP model, in contrast with the objective of discouraging its use. The definition of stalling proposed in [1] states that at each step the network accepts submitted messages up to the capacity threshold for each destination, forcing a processor to stall immediately upon submitting a message that exceeds the network capacity, and subsequently awakening the processor immediately when its message can be injected without violating the capacity constraint. Although consistent with the informal descriptions given in [2], the above definition of stalling implies the somewhat unrealistic assumption that the network is able to detect and react to the occurrence of a capacity constraint violation instantaneously. More realistically, some time lag is necessary between the submission of a message and the onset of stalling, to allow information to propagate through the network. In this paper we delve further into the issue of stalling in LogP along the following directions:

We generalize the definition of stalling, by introducing the notion of -stalling. Intuitively, captures the time lag between the submission of a message by a processor which violates the capacity constraint, and the time that the processor “realizes” that it must stall. (A similar time lag affects the “unstalling” process.) The extreme case of = 1 essentially corresponds to the stalling interpretation given in [1]. While remaining close to the spirit of the original LogP , -stalling LogP has the potential of reflecting more closely the behaviour of actual platforms, without introducing further complications in the design and analysis of algorithms. We prove that allowing for stalling in a LogP program enhances the computational power of the model. In particular, we prove a lower bound which separates -stalling LogP from stall-free LogP computations by a non-constant factor.

We devise an algorithm to simulate -stalling LogP programs in BSP, which achieves at most logarithmic slowdown under the realistic assumption = L. This result, combined with those in [1], extends the equivalence (up to logarithmic factors) between LogP and BSP to L-stalling computations.

On Stalling in LogP

111

The rest of the paper is organized as follows. In Section 2 the definitions of BSP and LogP are reviewed and the new -stalling rule is introduced. In Section 3 a lower bound is shown that separates -stalling LogP from stall-free LogP computations. In Section 4 the simulation of -stalling LogP in BSP is presented.

2 The models Both the BSP [6] and the LogP [2] models can be defined in terms of a virtual machine consisting of p serial processors with unique identifiers. Each processor i, 0 i < p, has direct and exclusive access to a private memory and has a local clock. All clocks run at the same speed. The processors interact through a communication medium, typically a network, which supports the routing of messages. In the case of BSP, the communication medium also supports global barrier synchronization. The distinctive features of the two models are discussed below. In the rest of this section we will use PiB and PiL to denote, respectively, the i-th BSP processor and the i-th LogP processor, with 0 i < p. BSP A BSP machine operates by performing a sequence of supersteps, where in a superstep each processor may perform local operations, send messages to other processors and read messages previously delivered by the network. The superstep is concluded by a barrier synchronization which informs the processors that all local computations are completed and that every message sent during the superstep has reached its intended destination. The model prescribes that the next superstep may commence only after completion of the previous barrier synchronization, and that the messages generated and transmitted during a superstep are available at the destinations only at the start of the next superstep. The performance of the network is captured by a bandwidth parameter g and a latency parameter `. The running time of a superstep is expressed in terms g and ` as Tsuperstep = w + gh + `, where w is the maximum number of local operations performed by any processor and h the maximum number of messages sent or received by any processor during the superstep. The overall time of a BSP computation is simply the sum of the times of its constituent supersteps. LogP In a LogP machine, at each time step, a processor can be either operational or stalling. If it is operational, then it can perform one of the following types of operations: execute an operation on locally held data (compute); submit a message to the network destined to another processor (submit); receive a message previously delivered by the network (receive). A LogP program specifies the sequence of operations to be performed by each processor. As in BSP, the behaviour of the network is modeled by a bandwidth parameter G (called gap in [2]) and a latency parameter L with the following meaning. At least G time steps must elapse between consecutive submit or receive operations performed by the same processor. If, at the time that a message is submitted, the total number of messages in transit (i.e., submitted to the network but not yet delivered) for that destination is at most dL=Ge, then the message is guaranteed to be delivered within L steps. If, however, the number of messages in transit exceeds dL=Ge, then, due to congestion,

112

G. Bilardi et al.

the message may take longer to reach its destination, and the submitting processor may stall for some time before continuing its operations. The quantity dL=Ge is referred to as the network’s capacity constraint. Note that message delays are unpredictable, hence different executions of a LogP program are possible. If no stalling occurs, then every message arrives in at most L time steps after its submission. Upon arrival, a message is promptly removed from the network and buffered in some input buffer associated with the receiving processor. However, the actual acquisition of the incoming message by the processor, through a receive operation, may occur at a later time. LogP also introduces an overhead parameter o to represent both the time required to prepare a message for submission and the time required to unpack the message after it has been received. Throughout the paper we will assume that maxf2; og G L p. The reader is referred to [1] for a justification of this assumption. 2.1 LogP’s stalling behaviour The original definition of the LogP model in [2] provides only a qualitative description of the stalling behaviour and does not specify precisely how the performance of a program is affected by stalling. In [1], the following rigorous characterization of stalling was proposed. At each step the network accepts messages up to saturation, for each destination, of the capacity limit, possibly blocking the messages exceeding such a limit at the senders. From a processor’s perspective, the attempt to submit a message violating the capacity constraint results in immediate stalling, and the stalling lasts until the message can be accepted by the network without capacity violation. The above characterization of stalling, although consistent with the intentions of the model’s proposers, relies on the somewhat unrealistic assumption that the network is able to monitor at each step the number of messages in transit for each destination, blocking (resp., unblocking) a processor instantaneously in case a capacity constraint violation is detected (resp., ends). In reality, the stall/unstall information would require some time to propagate through the network and reach the intended processors. Below we propose an alternative, yet rigorous, definition of stalling, which respects the spirit of LogP while modelling the behaviour of real machines more accurately. Let 1 L be an integral parameter. Suppose that at time step t processor PiL submits a message m destined to PjL , and let cj (t) denote the total number of messages destined to PjL which have been submitted up to (and including) step t and are still in transit at the beginning of this step. If cj (t) dL=Ge, then m reaches its destination at some step tm , with t < tm t + L. If, instead, cj (t) > dL=Ge (i.e., the capacity constraint is violated), the following happens: 1. Message m reaches its destination at some step tm , with t < tm t + Gcj (t) + L. 2. PiL may be signalled to stall at some time step t0 , with t < t0 t + . Until step t0 the processor continues its normal operations. 3. Let t denote the latest time step when a message that caused PiL to stall during steps [t; t0 ) arrives at its destination. Then, the processor is signalled to revert to operational state at some time t00 , with t < t00 t + . (Note that if t0 > t + no stalling takes place.)

On Stalling in LogP

113

Intuitively, parameter represents an upper bound to the time the network takes to inform a processor that one of the messages it submitted violated the capacity constraint, or that it may revert to operational state as the result of a decreased load in the network. We refer to the LogP model under the above stalling rule as -stalling LogP, or LogP for short. A legal execution of a -LogP program is one where message delivery times and stalling periods are consistent with the model’s specifications and with the above rule.1 In [1] a restricted version of LogP has also been considered, which regards as correct only those programs whose executions never violate the capacity constraint, that is, programs where processors never stall. We refer to such a restricted version of the model as stall-free LogP, or SF-LogP for short.

3 Separation between -stalling LogP and stall-free LogP In this section, we demonstrate that allowing for -stalling in LogP makes the model strictly more powerful than SF-LogP. We prove our claim by exhibiting a simple problem such that any SF-LogP algorithm for requires time which is asymptotically higher than the time attained by a simple -LogP algorithm for . Let be the problem of 2-compaction [4]. On a shared memory machine, the problem is defined as follows: given a vector x = (x0 ; x1 ; : : : xp,1 ) of p integer components with at most two nonzero values xi0 and xi1 , i0 < i1 , compact the nonzero values at the front of the array. On LogP, we recast the problem as follows. Vector x is initially distributed among the processors so that PiL holds xi , for 0 i < p. The problem simply requires to make (i0 ; xi0 ) and (i1 ; xi1 ) known, respectively, to P0L and P1L . On -LogP the 2-compaction problem can be solved by the following simple deterministic algorithm in O(L) time, for any 1: each processor that holds a 1 transmits its identity and its input value first to P0L and then to P1L . Observe that if G = L such a strategy is illegal for SF-LogP, since it generates a violation of the capacity constraint (since, in this case, dL=Ge = 1). The following theorem shows that, indeed, for G = L, 2-compaction cannot be solved on SF-LogP in O (L) time, thus providing a separation between SF-LogP and -LogP. Theorem 1. For any constant , 0 < < 1, solving 2-compaction , p with probability greater than (1 + )=2 on SF-LogP with G = L requires L log n steps. Proof (Sketch). In [4] it is proved that solving 2,-compaction probability greater plog n steps,with than (1 + )=2 on the EREW-PRAM requires

even if each processor is allowed to perform an unbounded amount of local computation per step. The theorem follows by showing that when G = L, any T -step computation of a p-processor SF-LogP can be simulated in O (dT=Le) steps on a p-processor EREW-PRAM with unbounded local computation. (Details of the simulation will be provided in the full version of the paper.) 1

Note that the characterization of stalling proposed in [1] corresponds to the one given above with = 1, except that in [1] a processor reverts to the operational state as soon as the capacity constraint violation ends, which may happen before the message causing the violation reaches its destination.

114

G. Bilardi et al.

It must be remarked that the above theorem relies on the assumption G = L. We leave the extension of the lower bound to arbitrary values of G and L as an interesting open problem.

4 Simulation of LogP on BSP This section shows how to simulate -LogP programs efficiently on BSP. The strategy is similar in spirit to the one devised in [1] for the simulation of SF-LogP programs, however it features a more careful scheduling of interprocessor communication in order to correctly implement the stalling rule. The algorithm is organized in cycles, where in a cycle PiB simulates C = maxfG; g L consecutive steps (including possible stalling steps) of processor PiL , using its own local memory to store the contents of PiL ’s local memory, for 0 i < p. In order to simplify bookkeeping operations, the algorithm simulates a particular legal execution of the LogP program where all messages reach their destinations at cycle boundaries. (From what follows it will be clear that such a legal execution exists.) Each processor PiB has a program counter that at any time indicates the next instruction to be simulated in the PiL ’s program. It also maintains in its local memory a pool for outgoing messages Qout (i), a FIFO queue for incoming messages Qin (i) (both initially empty), and two integer variables ti and wi . Variable ti represents the clock and always indicates the next time step to be simulated, while wi is employed in case of stalling to indicate when PiL reverts to the operational state. Specifically, PiL is stalling in the time interval [ti ; wi , 1], hence it is operational at step ti , if wi ti . Initially both ti and wi are set to 0. The undelivered messages causing processors to stall are retained in a global pool S , which is evenly distributed among the processors. We now outline the simulation of the k -th cycle, k 0, which comprises time steps C k; C k + 1; : : : C (k + 1) , 1. At the beginning of the cycle’s simulation we have that ti = C k and Qin (i) contains all messages delivered by the network to PiL at the beginning of step C k , for 0 i < p. Also, S contains messages that have been submitted in previous cycles and that will reach their destination at later cycles, that is, at time steps C k 0 with k 0 > k . The simulation of the k -th cycle proceeds as follows.

,

f

g

1. For 0 i < p, if wi < C (k +1) then PiB simulates the next x = C (k +1) max ti ; wi instructions in the PiL ’s program. A submit is simulated by inserting the message into Qout (i), and a receive is simulated by extracting a message from Qin (i). The processor also increments Sby x and sets ti = C (k + 1). 2. All messages in i Qout (i) together with those in S are sorted by destination and, within each destination group, by time of submission. 3. Within each destination group, messages are ranked and a message with rank r is assigned delivery time C (k + r= L=G ) (i.e., the message will be delivered at the beginning of the ( r= L=G )-th next cycle). 4. Each message to be delivered at cycle k + 1 is placed in the appropriate Qin (i) queue (that of its destination), while all other messages are placed in S . Comment: Note that S contains only those messages for which a violation of the capacity constraint occurred. 5. For 0 i < p, if one of the messages submitted by PiL is currently in S then (a) wi is set to the maximum delivery time of PiL ’s messages in S ;

d d

ee

d d

ee

On Stalling in LogP

115

(b) If < G, then all operations performed by PiL in the simulated cycle subsequent to the submission of the first message that ended up in S are “undone” and is adjusted accordingly. Comment: Note that when < G processor PiL submits only one message in the cycle, hence the operations to be undone do not involve submits and their undoing is straightforward. 6. Messages in S are evenly redistributed among the processors.

Theorem 2. For any , 1 L, the above algorithm correctly simulates a cycle of C = maxfG; g arbitrary LogP steps in time

g=G `=C O C 1 + log p G1 + 1 + log( C=G) + 1 + log minfC=G; `=gg

:

Proof (Sketch). Consider of the simulation of an arbitrary cycle. The proof of correctness, which will be provided in the full version of the paper, entails showing that the operations performed by the BSP processors in the above simulation algorithm do indeed mimic the computation of their LogP counterparts in a legal execution of the cycle. As for the running time, Steps 1 and 5.(b) involve O (C ) local computation. Step 2 involves the sorting of O ((C=G)p) messages, since jQout (i)j = O(C=G), for 0 i < p, and there can be no more than d=Ge = O(C=G) messages in S sent by the same (stalling) processor. Finally, the remaining steps are dominated by the cost of prefix operations performed on evenly distributed input sets of size O ((C=G)p) and by the routing of O (C=G)-relations. The stated running time then follows by employing results in [3, 6]. The following corollary is immediately established. Corollary 1. When ` = (L), g = (G) an arbitrary LogP program can be simulated in BSP with slowdown O ((L=G) log p), if = 1, and with slowdown O (log p= minfG; 1 + log(L=G)g), if = (L). The corollary, combined with the results in [1], shows that LogP, under the reasonable

L-stalling rule, and BSP can simulate each other with at most logarithmic slowdown when featuring similar bandwidth and latency parameters.

References 1. G. Bilardi, K.T. Herley, A. Pietracaprina, G. Pucci and P. Spirakis. BSP vs. LogP. Algorithmica, 24:405–422, 1999. 2. D.E. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T.V. Eicken. LogP: A practical model of parallel computation. Communications of the ACM, 39(11):78–85, November 1996. 3. M.T. Goodrich. Communication-Efficient Parallel Sorting. In Proc. of the 28th ACM Symp. on Theory of Computing, pages 247–256, Philadelphia PA, 1996. 4. P.D. MacKenzie. Lower bounds for randomized exclusive write PRAMs. Theorey of Computing Systems, 30(6):599–626, 1997. 5. V. Ramachandran, B. Grayson, and M. Dahlin. Emulations between QSM, BSP and LogP: a framework for general-purpose parallel algorithm design. TR98-22, Dept. of CS, Univ. of Texas at Austin, November 1998. (Summary in Proc. of ACM-SIAM SODA, 1999.) 6. L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990.

Parallelizability of some P -complete problems? Akihiro Fujiwara1 , Michiko Inoue2 , and Toshimitsu Masuzawa2 1

2

Kyushu Institute of Technology, JAPAN [email protected] Nara Institute of Science and Technology, JAPAN {kounoe, masuzawa}@is.aist-nara.ac.jp

Abstract. In this paper, we consider parallelizability of some P complete problems. First we propose a parameter which indicates parallelizability for a convex layers problem. We prove P -completeness of the problem and propose a cost optimal parallel algorithm, according to the parameter. Second we consider a lexicographically first maximal 3 sums problem. We prove P -completeness of the problem by reducing a lexicographically first maximal independent set problem, and propose two cost optimal parallel algorithms for related problems. The above results show that some P -complete problems have efficient cost optimal parallel algorithms.

1

Introduction

In parallel computation theory, one of primary complexity classes is the class N C. Let n be the input size of a problem. The problem is in the class N C if there exists an algorithm which solves the problem in T (n) time using P (n) processors where T (n) and P (n) are polylogarithmic and polynomial functions for n, respectively. Many problems in the class P , which is the class of problems solvable in polynomial time sequentially, are also in the class N C. On the other hand, some problems in P seem to have no parallel algorithm which runs in polylogarithmic time using a polynomial number of processors. Such problems are called P -complete. A problem in the class P is P -complete if we can reduce any problem in P to the problem using N C-reduction. (For details of the P completeness, see [9].) Although there are some efficient probabilistic parallel algorithms for some P -complete problems, it is believed that the P -complete problems are inherently sequential and hard to be parallelized. Among many P -complete problems, only some graph problems are known to be asympotically parallelizable. Vitter and Simons[12] showed that the unification, path system accessibility, monotone circuit value and ordered depth-first search problems have cost optimal parallel algorithms if their input graphs are dense graphs, that is, the number of edges is m = Ω(n1+ ) for a constant where the number of vertices is n. ?

Research supported in part by the Scientific Research Grant-in-Aid from Ministry of Education, Science, Sports and Culture of Japan (Scientific research of Priority Areas(B)10205218)


Parallelizability of Some P-Complete Problems

117

In this paper, we consider parallelizability of two P -complete problems. First we consider a convex layers problem. For the problem, we propose a parameter d which indicates parallelizability of the problem. Using the parameter, we prove that the problem is still P -complete if d = n with 0 < < 1. Next we propose a 2 n parallel algorithm which runs in O( n log + dp + d log d) time using p processors p (1 ≤ p ≤ d) on the EREW PRAM. From the complexity, the problem is in N C if d = (log n)k where k is a positive constant, and has a cost optimal parallel algorithm if d = n with 0 < ≤ 12 . Second P -complete problem is a lexicographically first maximal 3 sums problem. We prove the P -completeness of the problem, and propose a parallel al2 gorithm, which runs in O( np + n log n) using p processors (1 ≤ p ≤ n) on the CREW PRAM, for the problem. The above algorithm is cost optimal for 1 ≤ p ≤ logn n . In addition, we propose a cost optimal parallel algorithm for a related P -complete problem. These results show that some P -complete problems have efficient cost optimal parallel algorithms.

2

Parameterized convex layers

First we give some definitions for convex layers. Definition 1 (Convex layers). Let S be a set of n points in the Euclidean plane. The convex layers is a problem to compute a set of convex hulls, {CH0 , CH1 , . . . , CHm−1 }, which satisfies the following two conditions. (1) CH0 ∪ CH1 ∪ . . . ∪ CHm−1 = S. (2) Each CHi (0 ≤ i ≤ m − 1) is a convex hull of a set of points CHi ∪ CHi+1 ∪ . . . ∪ CHm−1 . 2 Dessmark et al.[5] proved P -completeness of the convex layers problem, and Chazelle[1] proposed an optimal sequential algorithm which runs in O(n log n) time. The sequential algorithm is time optimal because computation of a convex hull, which is the first hull of convex layers, requires Ω(n log n) time[13]. In this paper, we consider an additional parameter d for the problem, and restrict its input points on d horizontal lines. Definition 2 (Convex layers for d lines). The convex layers for d lines is a convex layers problem whose input points are on d horizontal lines. 2 The parameter d is at most n if there is no restrictions for positions of input points. In the following, CL(d) denotes the convex layers for d lines problem. We can solve the problem sequentially in O(n log n) time using the algorithm[1], and prove the lower bound Ω(n log n) by reduction from the sorting. We can prove the following theorem for the problem CL(d). (We omit the proof because of space limitation. The proof is described in [7].) Theorem 1. The problem CL(n ) with 0 < ≤ 1 is P -complete.

2

118

A. Fujiwara, M. Inoue, and T. Masuzawa

Next we propose a cost optimal parallel algorithm for CL(d). Algorithm for computing CL(d) Input: A set of points {u0 , u1 , . . . , un−1 } on lines {l0 , l1 , . . . , ld−1 }. Step 1: Set variables T OP = 0 and BOT = d − 1. (lT OP and lBOT denote top and bottom lines respectively.) Compute a set of points on each line li (0 ≤ i ≤ d − 1), and store them in a double-ended queue Qi in order of x coordinates. Step 2: For each line li (T OP ≤ i ≤ BOT ), compute the leftmost point uilef t and the rightmost point uiright . T OP +1 Step 3: Let Ulef t and Uright denote sets of points {uTlefOP , . . . , uBOT t , ulef t lef t } T OP +1 T OP BOT and {uright , uright , . . . , uright } respectively. Compute a left hull of Ulef t and a right hull of Uright , and store the obtained points on each hull in CHlef t and CHright , respectively. (The left hull of Ulef t consists of points T OP on a convex hull of Ulef t , which are from uBOT lef t to ulef t in clockwise order. The right hull of Uright is defined similarly.) Step 4: Remove points in QT OP , QBOT , CHlef t and CHright as the outmost convex hull. Step 5 Compute top and bottom lines on which there is at least one point. Set T OP and BOT to obtained top and bottom lines respectively. Step 6: Repeat Step 2, 3, 4 and 5 until no point remains. We discuss complexities of the above parallel algorithm on the EREW PRAM. We use at most p processor (1 ≤ p ≤ d) in the algorithm except for n Step 1. Step 1 takes O( n log + log n) using Cole’s merge sort[4] and primip tive operations, and Step 2 takes O( dp ) time obviously. We can compute the left hull and the right hull in Step 3 using a known parallel algorithm[2, 3] for computing a convex hull of sorted points. The algorithm runs in O( dp + log d) time for each hull. Step 4 takes O( dp ) time to remove the points. (Points in QT OP , QBOT are automatically removed by changing T OP and BOT in Step 5.) We can compute top and bottom lines in Step 5 in O( dp + log d) time using a basic parallel algorithm computing the maximum and the minimum. Since the number of the repetition of Step 6 is d d2 e, we can compute CL(d) in 2 n n O( n log + log n + ( dp + log d) × d d2 e) = O( n log + dp + d log d), and obtain the p p following theorem. n Theorem 2. We can solve CL(d)in O( n log + p cessors (1 ≤ p ≤ d) on the EREW PRAM.

d2 p

+ d log d) time using p pro2

We can show that the class of the problem changes according to the number of lines d from the above complexity. (Details are omitted.) Corollary 1. We can solve CL((log n)k ), where k is a positive constant, in O(log n log log n) time using n processors on the EREW PRAM, that is, CL((log n)k ) is in N C. 2 Corollary 2. We can solve CL(n ) with 0 < ≤ processors (1 ≤ p ≤ n ) on the EREW PRAM.

1 2

n in O( n log ) time using p p 2


3

119

Lexicographically first maximal 3 sums

We first define the lexicographically first maximal 3 sums problem as follows. Definition 3 (Lexicographically first maximal 3 sums). Let I be a set of n distinct integers. The lexicographically first maximal 3 sums is a problem to compute the set of 3 integers LF M 3S = {(a0 , b0 , c0 ), (a1 , b1 , c1 ), . . ., (am−1 , bm−1 , cm−1 )}, which satisfies the following three conditions. 1. The set S = {a0 , b0 , c0 , a1 , b1 , c1 , . . . , am−1 , bm−1 , cm−1 } is a subset of I. 2. Let si = {ai , bi , ci } (0 ≤ i ≤ m − 1). Then, (ai , bi , ci ) is the lexicographically first set of 3 integers which satisfies ai +bi +ci = 0 for I −(s0 ∪s1 ∪. . .∪si−1 ). 3. There is no set of three integers (a0 , b0 , c0 ) which satisfies a0 , b0 , c0 ∈ I − S and a0 + b0 + c0 = 0. 2 Next we prove P -completeness of LFM3S. We show reduction from the lexicographically first maximal independent set (LFMIS) problem to LFM3S. Let G = (V, E) be an input graph for LFMIS. We assume that all vertices in V = {v0 , v1 , . . . , vn−1 } are ordered, that is, vi is less than vj if i < j. In [11], Miyano proved the following lemma for LFMIS. Lemma 1. The LFMIS restricted to graphs with degree at most 3 is P -complete. 2 Using the above lemma, we can prove the P -completeness of LFM3S. (Details are described in [7].) Theorem 3. The problem LFM3S is P -complete. (Outline of proof ) It is obvious that LFM3S is in P . Let G = (V, E) with V = {v0 , v1 , . . . , vn−1 } be an input graph with degree at most 3. First we define a vertex value V V (i) for each vertex vi . The vertex value is a negative integer and defined as V V (i) = i − n. Thus vertices v0 , v1 , . . . , vn−1 have vertex values −n, −(n − 1), . . . , −1 respectively. We also difine a key set of integers Q = {q0 , q1 , . . . , q12 } = {−64, −61, −32, −31, −29, −15, −14, −13, −10, −8, 23, 46, 93}. Using the vertex value and the key set, we define the following 4-tuples for each vertex vi in V (0 ≤ i ≤ n − 1) as inputs for LFM3S. 1. Vertex tuple for vi : V T (i) = [V V (i), q0 , V V (i), 0] 2. Auxiliary tuples for vi : (a) AT1 (i) = [V V (i), q1 , 0, V V (i)] (b) AT2 (i) = [V V (i), q2 , V V (i), 0] (c) AT3 (i) = [V V (i), q3 , 0, V V (i)] (d) AT4 (i) = [V V (i), q4 , 0, V V (i)] (e) AT5 (i) = [V V (i), q5 , V V (i), 0] (f) AT6 (i) = [V V (i), q6 , 0, V V (i)] (g) AT7 (i) = [V V (i), q7 , 0, V V (i)] (h) AT8 (i) = [V V (i), q8 , V V (i), 0] (i) AT9 (i) = [V V (i), q9 , 0, V V (i)] (j) AT10 (i) = [2 ∗ |V V (i)|, q10 , |V V (i)|, |V V (i)|] (k) AT11 (i) = [2 ∗ |V V (i)|, q11 , |V V (i)|, |V V (i)|] (l) AT12 (i) = [2 ∗ |V V (i)|, q12 , |V V (i)|, |V V (i)|]

120


3. Link tuples for vi : For each adjacent vertex vj of vi , which satisfies i < j, add one of the following tuples. (a) LT1 (i, j) = [|V V (i)| + |V V (j)|, |q0 | + |q1 |, |V V (j)|, |V V (i)|] (b) LT2 (i, j) = [|V V (i)| + |V V (j)|, |q0 | + |q3 |, |V V (j)|, |V V (i)|] (c) LT3 (i, j) = [|V V (i)| + |V V (j)|, |q0 | + |q7 |, |V V (j)|, |V V (i)|] (In case that vi has only one adjacent vertex vj which satisfies i < j, add LT1 (i, j) for vj . In case that vi has the two adjacent vertices vj1 , vj2 , add LT1 (i, j1 ) and LT2 (i, j2 ) for each vertex. In case that vi has the three adjacent vertices, add all three tuples similarly.) The above 4-tuples have the following special feature. Let {V T (i), AT1 (i), AT2 (i), . . . , AT12 (i), LT1 (i, s), LT2 (i, t), LT3 (i, u), V T (s), V T (t), V T (u)} be the input for LFM3S1 . (We assume vs , vt and vu are adjacent vertices which satisfy i < s < t < u.) Then the solution of LFM3S is as follows. (We call the solution TYPE A sums.) {(V T (i), AT4 (i), AT12 (i)), (AT2 (i), AT6 (i), AT11 (i)), (AT5 (i), AT9 (i), AT10 (i)), (AT1 (i), V T (s), LT1 (i, s)), (AT3 (i), V T (t), LT2 (i, t)), (AT7 (i), V T (u), LT3 (i, u))} Note that vertex tuples, V T (s), V T (t) and V T (u), are in the sums. In other words, the above vertex tuples are not in the remaining inputs after the computation. Next, we consider the solution without V T (i) in the input. (We call the solution TYPE B sums.) {(AT1 (i), AT2 (i), AT12 (i)), (AT3 (i), AT5 (i), AT11 (i)), (AT7 (i), AT8 (i), AT10 (i))} In this case, the vertex tuples, V T (s), V T (t) and V T (u), remain in the inputs. We give the above 4-tuples for all vertices in V of LFMIS, and compute LFM3S. Then the vertex vi ∈ V is in the solution of LFMIS if and only if there exists a sum of three 4-tuples (T1 , T2 , T3 ) which satisfies T1 = V T (i) in the solution of LFM3S. (Proof of correctness is omitted.) It is easy to see that the above reduction is in N C. Although we define that inputs of LF M 3S are distinct integers, inputs of the above reduction are 4-tuples. We can easily reduce each 4-tuple to an integer without loss of the features. Let 2g ≤ n < 2g+1 and h = max{g, 6}. Then we can reduce each 4-tuple [α0 , α1 , α2 , α3 ] to α0 ∗ 23(h+1) + (α1 − 65) ∗ 22(h+1) + α2 ∗ 2h+1 + α3 . 2 Finally, we consider a parallel algorithm for LFM3S on the CREW PRAM. We can propose a sequential algorithm which solves LFM3S in O(n2 ) by modifying an algorithm computing the 3 sum problem[8]. The algorithm is the known fastest sequential algorithm for LFM3S. Note that strict lower bound of LFM3S is not known. However the 3 sum has no o(n2 ) algorithm and has an Ω(n2 ) lower bound on a weak model of computation[6]. Algorithm for computing LF M 3S Input: A set of n integers I. 1

The sum of tuples A = [α0 , α1 , α2 , α3 ] and B = [β0 , β1 , β2 , β3 ] is defined as A + B = [α0 + β0 , α1 + β1 , α2 + β2 , α3 + β3 ], and A < B if A is lexicographically less than B. We assume that the sum is zero if the sum of tuples is [0, 0, 0, 0].


121

Step 1: Sort all elements in I. (Let S = (s0 , s1 , . . . , sn−1 ) be the sorted sequence.) Step 2: Repeat the following substeps from i = 0 to i = n − 3. 0 (2-1) Create the following two sorted sequences S 0 and SR from S. 0 S 0 = (si+1 , si+2 , . . . , sn−1 ), SR = (−sn−1 − si , −sn−2 − si , . . . , −si+1 − si ) 0 (For b ∈ S 0 and c ∈ SR which satisfy b = sg and c = −sh − si respectively, b = c if and only if si + sg + sh = 0.) 0 (2-2) Merge S 0 and SR into a sorted sequence SS = (ss0 , ss1 , . . . , ss2(n−i−1)−1 ). (2-3) Compute the smallest element ssj in SS which satisfies ssj = ssj+1 . (2-4) If the above ssj is obtained, compute sg and sh in S such that sg = ssj and 0 sh = −sg − si , respectively. (It is obvious that sg ∈ S 0 and −sg − si ∈ SR since all elements in S are distinct.) Delete si , sg , sh from I, and output (si , sg , sh ), whenever they exist.

We assume the number of processors p is restricted to 1 ≤ p ≤ n. We can n sort n elements in O( n log + log n) time using Cole’s merge sort[4] in Step 1. In p Step 2, we can compute a substep (2-1) in O( np ) time easily. We can compute 2

substeps (2-3) and (2-4) in O( np + log n) time using simple known algorithms and basic operations. In a substep (2-2), we can merge two sorted sequence in O( np +log log n) time using a fast merging algorithm[10]. Since repetition of Step 2 is O(n), we obtain the following theorem. 2

Theorem 4. We can solve LF M 3S in O( np + n log n) time using p processors (1 ≤ p ≤ n) on the CREW PRAM. 2 2

In the case of 1 ≤ p ≤ logn n , the time complexity becomes O( np ). Therefore the above algorithm is cost optimal for 1 ≤ p ≤ logn n . As generalization of LFM3S, we can also obtain the similar results for the following problem. Definition 4 (Lexicographically first maximal set of 3 arguments (LFMS3A)). Let E be a totally ordered set of n elements. The lexicographically first maximal set of 3 arguments is a problem to compute the set of 3 elements LF M S3A = {(a0 , b0 , c0 ), (a1 , b1 , c1 ), . . ., (am , bm , cm )}, which satisfies the following three conditions for a given function f (x, y, z) whose value is T RU E or F ALSE. 1. The set S = {a0 , b0 , c0 , a1 , b1 , c1 , . . . , am , bm , cm } is a subset of E. 2. Let ei = {ai , bi , ci } (0 ≤ i ≤ m). Then, (ai , bi , ci ) is the lexicographically first set of 3 elements which satisfies f (ai , bi , ci ) = T RU E for I − (e0 ∪ e1 ∪ . . . ∪ ei−1 ). 3. There is no set of three elements (a0 , b0 , c0 ) which satisfies a0 , b0 , c0 ∈ I − S and f (a0 , b0 , c0 ) = T RU E. 2 Corollary 3. The problem LFMS3A is P -complete. Theorem 5. We can solve LF M S3A with an unresolvable function f in n log n) time using p processors (1 ≤ p ≤ n2 ) on the CREW PRAM.

2 3 O( np +

2

122

4


Conclusions

In this paper, we proved that two problems are P -complete, and proposed cost optimal algorithms for the problems. The results imply that some P -complete problems are parallelizable within the reasonable number of processors. In the future research, we investigate other parallelizable P -complete problems. The result may imply new classification of problems in P . Another future topic is proposition of fast parallel algorithms which run in O(n ) time where 0 < < k for P -complete problems. Only a few P -complete problems are known to have such algorithms[12].

References 1. B. Chazelle. On the convex layers of a planar set. IEEE Transactions on Information Theory, IT-31(4):509–517, 1985. 2. D. Z. Chen. Efficient geometric algorithms on the EREW PRAM. IEEE transactions on parallel and distributed systems, 6(1):41–47, 1995. 3. W. Chen. Parallel Algorithm and Data Structures for Geometric Problems. PhD thesis, Osaka University, 1993. 4. R. Cole. Parallel merge sort. SIAM Journal of Computing, 17(4):770–785, 1988. 5. A. Dessmark, A. Lingas, and A. Maheshwari. Multi-list ranking: complexity and applications. In 10th Annual Symposium on Theoretical Aspects of Computer Science (LNCS665), pages 306–316, 1993. 6. J. Erickson and R. Seidel. Better lower bounds on detecting affine and spherical degeneracies. In 34th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’93), pages 528–536, 1993. 7. A. Fujiwara, M. Inoue, and M. Toshimitsu. Practical parallelizability of some Pcomplete problems. Technical Report of IPSF, Vol. 99, No. 72 (AL-69-2), September 1999. 8. A. Gajentaan and M. H. Overmars. On a class of O(n2 ) problems in computational geometry. Computational geometry, 5:165–185, 1995. 9. R. Greenlaw, H.J. Hoover, and W.L. Ruzzo. Limits to Parallel Computation: PCompleteness Theory. Oxford university press, 1995. 10. C. Kruskal. Searching, merging and sorting in parallel computation. IEEE Transactions on Computers, C-32(10):942–946, 1983. 11. S. Miyano. The lexicographically first maximal subgraph problems: P -completeness and N C algorithms. Mathematical Systems Theory, 22:47–73, 1989. 12. J.S. Vitter and R.A. Simons. New classes for parallel complexity: A study of unification and other complete problems for P . IEEE Transactions of Computers, C-35(5):403–418, 1986. 13. A. C. Yao. A lower bound to finding convex hulls. Journal of the ACM, 28(4):780– 787, 1981.

A New Computation of Shape Moments via Quadtree Decomposition ? Chin-Hsiung Wu1 , Shi-Jinn Horng1;2 , Pei-Zong Lee2 , Shung-Shing Lee3 , and Shih-Ying Lin3 1

National Taiwan University of Science and Technology, Taipei, Taiwan, R. O. C. 2

[email protected]

Institute of Information Science, Academia Sinica, Taipei, Taiwan, R. O. C. 3 F ushin Institute of Technology and Commerce, I-Lain, Taiwan, R. O. C.

The main contribution of this paper is in designing an optimal and/or optimal speed-up algorithm for computing shape moments. We introduce a new technique for computing shape moments. The new technique is based on the quadtree representation of images. We decompose the image into squares, since the moment computation of squares is easier than that of the whole image. The proposed sequential algorithm reduces the computational complexity signi cantly. By integrating the adv an tages of both optical transmission and electronic computation, the proposed parallel algorithm can be run in O(1) time. In the sense of the product of time and the number of processors used, the proposed parallel algorithm is time and cost optimal and achieves optimal speed-up. Abstract.

1

Introduction

Moments are widely used in image analysis, pattern recognition and low-level computer vision [6]. The computation of moments of a tw o-dimensional (2-D) image involves a signi cant amount of multiplications and additions in a direct method. Previously, some fast algorithms for computing moments had been proposed using various computation methods [2, 3, 5, 8, 14, 15]. F or an N N binary image, Chung [2] presented a constant time algorithm for computing the horizontal/v ertical convex shape's moments of order up to 3 on an N N recon gurable mesh. Chung's algorithm is unsuitable for complicated objects. In this paper, we will develop a more eÆcient algorithm to overcome the disadvantage of Chung's algorithm. The array with a recon gurable optical bus system is de ned as an array of processors connected to a recon gurable optical bus system whose con guration can be dynamically changed by setting up the local switches of each processor, and messages can be transmitted concurrently on a bus in a pipelined fashion. ?

This work was partially supported by the National Science Council under the contract no. NSC-89-2213-E011-007. Part of this work was carried out when the second author was visiting the Institute of Information Science, Academia Sinica, Taipei, T aiw an, July - Decem ber 1999.


124

C.-H. Wu et al.

More recently, two related models have been proposed, namely the array with recon gurable optical buses (AROB) [10] and linear array with a recon gurable pipelined bus system (LARPBS) [9]. The AROB model is essentially a mesh using the basic structure of a classical recon gurable network (LRN) [1] and optical technology. A 2-D AROB of size M N , denoted as 2-D M N AROB, contains M N processors arranged in a 2-D grid. The processor with index (i1 ; i0 ) is denoted by Pi1 ; i0 . For more details on the AROB, see [10]. The main contribution of this paper is in designing an optimal speed-up algorithm for computing the 2-D shape moments. The idea of our algorithm is based on the summation of the contribution of each quadtree node where each quadtree node represents a square region. We rst represent the image by quadtree decomposition. After that, the image is divided into squares. Then we derive the relationship between the quadtree and the computation of shape moments. Finally, using this representation, an eÆcient sequential algorithm (SM) and an optimal parallel algorithm (PSM) for shape moment computations are developed. For a constant c, c 1, the proposed algorithm PSM can be run 1 in O(1) time using N N 1+ c processors when the input image is complicated. If the image is simple (i.e., the image can be represented by a few quadtree nodes), the proposed algorithm PSM can be run in O(1) time using N N processors. In the sense of the product of time and the number of processors used, the proposed algorithm PSM is time and cost optimal and achieves optimal speed-up.

2

Basic Data Manipulation Operations

Given N integers ai with 0 ai < N , 0 i < N , let sum stand for

Xa:

N 1

(1)

i

i=0

For computing Eq. (1), Pavel and Akl [11] proposed an O(1) time algorithm on a 2-D N log N AROB. In the following, we will use another approach1 to design a more exible algorithm for this problem on a 1-D AROB using N 1+ c processors, where c is a constant and c 1. Since ai < N and 0 i < N , each digit has a value ranging from 0 to ! 1 for the radix-! system and a ! -ary representation m3 m2 m1 m0 is equal to m0 !0 + m1 !1 + m2 !2 + m3 !3 . The maximum of sum is at most N (N 1). With this approach, ai and sum are equivalent to

ai =

Xm

T 1

i; k

!k ;

(2)

XS

!l ;

(3)

k=0

sum =

U 1 l=0

l

where T = blog! N c + 1, 0 i < N , U = blog! N (N Sl < !.

1)c + 1, and 0 mi; k ,

A New Computation of Shape Moments via Quadtree Decomposition

P P

P P

125

1 T 1 m !k = T 1 N 1 m !k , let d be the sum of As sum = N k k=0 i=0 i; k k=0 i; k i=0 N coeÆcients mi; k , 0 i < N , which is de ned as

dk =

Xm

N 1 i=0

i; k ;

(4)

where 0 k < T . Then sum can be also formulated as

sum =

Xd

T 1 k=0

k

!k ;

(5)

where 0 dk < !N . Let C0 = 0 and du = 0, T u < U . The relationship between Eqs. (3) and (5) is described by Eqs. (6)-(8).

et = Ct + dt ; 0 t < U;

(6)

Ct+1 = et div !; 0 t < U; (7) St = et mod !; 0 t < U; (8) th th where et is the sum at the t digit position and Ct is the carry to the t digit position. Hence, St of Eq. (8) corresponds to the coeÆcient of sum of Eq. (3) under the radix-! system. Since the carry to the tth digit position of et is not greater than N , we have Ct N , 0 t < U . Since sum N (N 1), the number of digits representing sum under radix-! is not greater than U , where U = blog! N (N 1)c + 1. Therefore, instead of computing Eq. (1), we rst compute the coeÆcient mi; k for each ai . Then each St can be computed by Eqs. (4), (6)-(8). Finally, sum can be computed by Eq. (3). For more details, see [13]. Lemma 1.

on a 1-D

The

N 1+1=c

N

integers each of size AROB for a constant

O(log N )-bit, can be added in O(1) time c and c 1.

Consequently, given an N N integer matrix each of size O(log N )-bit, the sum of these N 2 integers can be computed by the following three steps. First, apply Lemma 1 to compute the partial sum of each row in parallel. Then, route the partial sums located on the rst column to the rst row. Finally, apply Lemma 1 to accumulate these N partial sums.

N 2 integers each of size O(log N )-bit, can 1 time on a 2-D N N 1+ c AROB for a constant c and c 1. Lemma 2.

3

The

be added in

O(1)

The Quadtree Decomposition

The quadtree is constructed by recursively decomposing the image into four equal-sized quadrants in top-down fashion. Given an N N image (N = 2d for some d), the quadtree representation of it is a tree of degree four which can be

126

C.-H. Wu et al.

de ned as follows. The root node of the tree represents the whole image. If the whole image has only one color, we label that root node with that color and stop; otherwise, we add four children to the root node, representing the four quadrants of the image. Recursively we apply this process for each of the four nodes, respectively. If a block has a constant color, then its corresponding node is a leaf node; otherwise, its node has four children. Recently, Lee and Horng et al. [7] addressed a constant time quadtree building algorithm for a given image based on a speci ed space- lling order. Lemma 3.

time on an

[7] The quadtree of an AROB.

N N

N N

image can be constructed in constant

Let the data structure of a quadtree node consist of four elds r; c; I and sz , respectively. The row and column coordinates of the top-left corner of a quadtree node are represented by r and c, the image color of it is represented by I and sz represents the index of the block size of a quadtree node; if the block size is 4s then sz is s. For a binary image, the third eld I can be omitted. In this paper, only the leaves of the quadtree which represent black blocks are useful for computing shape moments; the non-terminal nodes are omitted.

4

Computing Shape Moments

For a 2-D digital image A = a(x; y ), 1 x; y N , the moment of order (p; q ) is de ned as:

mpq =

N X N X xp yq a(x; y);

x=1 y=1

(9)

where a(x; y ) is an integer representing the intensity function (gray level or binary value) at pixel (x; y ). Delta algorithm [15] and Chung's algorithm [2] were based on the summation of the contribution of each row. Ours is based on the summation of the contribution of each quadtree node where each quadtree node represents a square region. For an object represented by a quadtree with leaves, exactly non-overlapped squares, Q1 , Q2 ,, Q , are de ned. From the de nition of moments, computing the double summations in Eq. (9) of a square is easier than that of an arbitrary shape. Thus, compared to a direct method, the computational complexity can be reduced signi cantly. Since the double summations in Eq. (9) are linear operations, the moments of the whole object can be derived from the summations of the moments of these squares. The (p; q )th order moments of theses squares can be computed as follows. From the data structure of quadtree nodes, we can easily nd the location of the four corners of the corresponding square. For a square Qi , assume the coordinates of its top-left corner are (r; c) and its size is 4s . Let u = 2s , denote the length of each side of the square. Then the coordinates of the other three corners of Qi are (r + u 1; c), (r; c + u 1) and (r + u 1; c + u 1), respectively.


127

For a binary digital image, the moment computation of a quadtree node Qi reduces to the separable computation

mpq;i =

X

r+u 1 x=r

xp

X

c+u 1 y=c

yq =

X

r+u 1 x=r

xp hq;i = hq;i

X(r + k)p = g

u 1 k=0

p;i hq;i ; (10)

where gp;i and hq;i are the p-order and q -order moments for dimension x and dimension y , respectively and they are de ned as:

gp;i = hq;i =

X

r+u 1 x=r c+u 1

X y=c

xp = yq =

X(r + k)p;

u 1

k=0 u 1

X(c + k)q :

k=0

(11)

Similarly, the corresponding moments of other quadtree nodes can be obtained from Eqs. (10)-(11) by replacing r, c and u with their corresponding values since they are also represented as squares. Thus, the 2-D shape moments of order (p; q ) can be obtained by summing up the corresponding moments of all square regions:

mpq =

X m i=1

pq;i :

(12)

Let us conclude this section by stating a sequential algorithm for computing shape moments from the above derivations. Algorithm SM; 1: For each quadtree node Qi , compute the 2-D shape moments mpq;i , 1 i , according to Eqs. (10)-(11). 2: Compute the 2-D shape moments mpq by summing up mpq;i , 1 i , according to Eq. (12). Given an N order 3 can be computed in of quadtree nodes. Theorem 1.

N

binary image A, the 2-D shape moments up to O() time on a uniprocessor, where is the number

Proof : The correctness of this algorithm directly follows from Eqs. (9)-(12). The time complexity is analyzed as follows. Step 1 and 2 each take O() time, where is the number of quadtree nodes. Hence, the time complexity is O(). If we consider an N N binary image whose entire image has only 1-valued, the comparison of the computational complexity in computing all the moments of order up to p + q 3 is shown in Table 1. From Table 1, we see that the proposed method reduces the computational computation signi cantly. In addition to the computing operations shown in Table 1, contour following, which needs a few comparison operations per pixel, is required for all the nondirect methods to identify the shape of all objects and it takes O(N 2 ) time. Our algorithm also needs a preprocessing time to create the quadtree nodes for the given image and this can be done in O(N 2 ) time by the optimal quadtree construction algorithm proposed by Shaer and Samet [12].

128

C.-H. Wu et al.

Table 1: Comparison of computational complexity for shape moment methods. Method Direct [6] Delta [15] Green's [8] Integral [3] This paper Multiplication 20N 2 25N 0 8N 8 Addition 10N 2 N 2 + 6N 128N 22N 22

5

Parallel Moment Computation Algorithm

From Eqs. (9)-(12), the algorithm for computing 2-D shape moments mpq includes the following three steps. First build the quadtree for the given image. Then for each quadtree node, compute its corresponding 2-D shape moments by multiplying the two dimensional moments derived from Eqs. (10)-(11). Finally the 2-D shape moments can be obtained by summing up the corresponding moments which were computed by Step 2. Initially, assume that the given image A is stored in the local variable a(i; j ) of processor Pi; j , 1 i; j N . Finally, the results are stored in the local variable mpq (1; 1) of processor P1;1 . Following the de nitions of moments, quadtree, and the relationship between them, the detailed moments algorithm (PSM) is listed as follows. Algorithm PSM; 1: Apply Lemma 3 to build the quadtree for the given image. After that, the results Qi , 1 i ; are stored in local variable Q(x; y ) in processor Px; y , where i = xN + y . 2: //For each quadtree node computes its 2-D shape moments. // 2.1: For each quadtree node Qi , 1 i , computes its 1-D shape moments gp (x; y ) and hq (x; y ) of dimension x and dimension y respectively according to Eq. (11). 2.2: For each quadtree node Qi , 1 i , compute its 2-D shape moments by computing Eq. (10) (i.e., mpq;i (x; y ) = gp (x; y ) hq (x; y )). 3: Compute the 2-D shape moments mpq by summing up mpq;i , 1 i , using Lemmas 1 or 2 according to the value of . After that, the 2-D moments mpq are stored in the local variable mpq (1; 1) of processor P1; 1 .

Given an N order 3 can be computed in Theorem 2.

binary image A, the 2-D shape moments up to O(1) time either on an N N AROB if A is simple O(N )), or on an N N 1+ 1c AROB for a constant c and

is bounded by c 1 if A is complicated.

(i.e.,

N

Proof : The time complexity is analyzed as follows. Step 1 takes O (1) time using N N processors by Lemma 3. Step 2 takes O(1) time. Step 3 takes O(1) 1 time using N N or N N 1+ c processors by Lemmas 1 and 2. Hence, the time complexity is O(1). For computing high order shape moments, Steps 2 and 3 will take maxfp; q g times. If both p and q are constant, then the expression for gp;i (or hq;i ) de ned in Eq. (11) will have a constant number of terms with a constant number of powers. Therefore, the results of Theorem 2 can be extended.


6

129

Concluding Remarks

In this paper, we introduce a new technique based on the quadtree decomposition for computing shape moments. The quadtree decomposition divides the image into squares, where the number of squares is dependant on the image complexity. In the most application, the N N image can be decomposed into O(N ) squares by quadtree decomposition. As a result, the shape moments can be parallelized and computed in O(1) time on an N N AROB.

References 1. Ben-Asher, Y., Peleg, D., Ramaswami, R., Schuster, A.: The Power of Recon guration. Journal of Parallel and Distributed Computing 13 (1991) 139{153 2. Chung, K.-L.: Computing Horizontal/vertical Convex Shape's Moments on Recon gurable Meshes. Pattern Recognition 29 (1996) 1713-1717 3. Dai, M., Batlou, P., Najim, M.: An EÆcient Algorithm for Computation of Shape Moments from Run-length Codes or Chain Codes. Pattern Recognition 25 (1992) 1119-1128 4. Guo, Z., Melhem, R. G., Hall, R. W., Chiarulli, D. M., Levitan, S. P.: Pipelined Communications in Optically Interconnected Arrays. Journal of Parallel and Distributed Computing 12 (1991) 269{282 5. Hatamian, M.: A Real Time Two-dimensional Moment Generation Algorithm and Its Single Chip Implementation. IEEE Trans. ASPP 34 (1986) 546-553 6. Hu, M.-K.: Visual Pattern Recognition by Moment Invariants. IRE Trans. Inform. Theory IT-8 (1962) 179-187 7. Lee, S.-S., Horng, S.-J., Tsai, H.-R., Tsai, S.-S.: Building a Quadtree and Its Applications on a Recon gurable Mesh. Pattern Recognition 29 (1996) 1571-1579 8. Li, B.-C., Shen, J.: Fast Computation of Moment Invariants. Pattern Recognition 24 (1991) 8071-813 9. Pan, Y., Li, K.: Linear Array with a Recon gurable Pipelined Bus System| Concepts and Applications. Information Sciences { An Int. Journal 106 (1998) 237-258 10. Pavel, S., Akl, S. G.: On the Power of Arrays with Recon gurable Optical Bus. Proc. Int. Conf. Parallel and Distributed Processing Techniques and Applications (1996) 1443-1454 11. Pavel, S., Akl, S. G.: Matrix Operations Using Arrays with Recon gurable Optical Buses. Parallel Algorithms and Applications 8 (1996) 223-242 12. Shaer, C. A., Samet, H.: Optimal Quadtree Construction Algorithms. Computer Vision, Graphics, Image processing 37 (1987) 402-419 13. Wu, C.-H., Horng, S.-J., Tsai, H.-R.: Template Matching on Arrays with Recon gurable Optical Buses. Proc. Int. Symp. Operations Research and its Applications (1998), 127-141 14. Yang, L., Albregtsen, F.: Fast and Exact Computation of Cartesian Geometric Moments Using Discrete Green's Theorem. Pattern Recognition 29 (1996) 10611073 15. Zakaria, M. F., Zsombor-Murray, P. J. A., Kessel, J. M. H. H.: Fast Algorithm for the Computation of Moment Invariants. Pattern Recognition 20 (1987) 639-643

7KH )X]]\ 3KLORVRSKHUV

6KLQJ7VDDQ +XDQJ

'HSDUWPHQW RI &RPSXWHU 6FLHQFH DQG ,QIRUPDWLRQ (QJLQHHULQJ 1DWLRQDO &HQWUDO 8QLYHUVLW\ &KXQJ/L 7DLZDQ 52& (PDLO VWKXDQJ#FVLHQFXHGXWZ

&RQVLGHU D QHWZRUN RI QRGHV HDFK QRGH UHSUHVHQWV D SKLORVRSKHU OLQNV UHSUHVHQW

WKH

QHLJKERULQJ

UHODWLRQVKLS

DPRQJ

WKH

SKLORVRSKHUV

(YHU\

SKLORVRSKHU HQMR\V VLQJLQJ VR PXFK WKDW RQFH JHWWLQJ WKH FKDQFH KH DOZD\V VLQJV D VRQJ ZLWKLQ D ILQLWH GHOD\ 7KLV SDSHU SURSRVHV D SURWRFRO IRU WKH SKLORVRSKHUV WR IROORZ

7KH SURWRFRO JXDUDQWHHV WKH IROORZLQJ UHTXLUHPHQWV

1R WZR QHLJKERULQJ SKLORVRSKHUV VLQJ VRQJV VLPXOWDQHRXVO\ $ORQJ DQ\ LQILQLWH WLPH SHULRG HDFK SKLORVRSKHU JHWV KLV FKDQFHV WR VLQJ LQILQLWHO\ RIWHQ )ROORZLQJ WKH SURWRFRO HDFK SKLORVRSKHU XVHV RQO\ RQH ELW WR PHPRUL]H KLV VWDWH 6RPHWLPHV WKH SKLORVRSKHUV PD\ EH IX]]\ HQRXJK WR IRUJHW WKH VWDWH

6R D

VHOIVWDELOL]LQJ YHUVLRQ RI WKH SURWRFRO LV DOVR SURSRVHG WR FRSH ZLWK WKLV SUREOHP +RZHYHU

WKH SKLORVRSKHUV PD\ QHHG DGGLWLRQDO ELWV WR PHPRUL]H

WKHLU VWDWHV

,QWURGXFWLRQ &RQVLGHU D QHWZRUN RI QRGHV HDFK QRGH UHSUHVHQWV D SKLORVRSKHU OLQNV UHSUHVHQW WKH QHLJKERULQJ UHODWLRQVKLS DPRQJ WKH SKLORVRSKHUV 7KLV SDSHU SURSRVHV D SURWRFRO IRU WKH SKLORVRSKHUV WR IROORZ

7KH SURWRFRO JXDUDQWHHV WKH IROORZLQJ WZR

UHTXLUHPHQWV 1R WZR QHLJKERULQJ SKLORVRSKHUV VLQJ VRQJV VLPXOWDQHRXVO\

$ORQJ DQ\ LQILQLWH WLPH SHULRG HDFK SKLORVRSKHU JHWV KLV FKDQFHV WR VLQJ LQILQLWHO\ RIWHQ

)ROORZLQJ WKH SURWRFRO HDFK SKLORVRSKHU RQO\ XVHV D ERROHDQ YDULDEOH WR

PHPRUL]H KLV VWDWH 6RPHWLPHV WKH SKLORVRSKHUV PD\ EH IX]]\ HQRXJK WR IRUJHW WKH VWDWH 7KH IX]]\ EHKDYLRU RI WKH SKLORVRSKHUV LV PRGHOHG DV WUDQVLHQW IDXOWV

$ WUDQVLHQW IDXOW PD\

SHUWXUE WKH YDOXHV RI WKH YDULDEOHV RI D SURJUDP EXW QRW WKH FRQVWDQWV DQG WKH SURJUDP FRGH 7R FRSH ZLWK DOO NLQGV RI SRVVLEOH WUDQVLHQW IDXOWV 'LMNVWUD >@ LQWURGXFHG WKH VHOIVWDELOL]LQJ 66 LQ VKRUW FRQFHSW LQWR FRPSXWHU V\VWHPV

3URYLGHG WKDW QR PRUH

WUDQVLHQW IDXOWV PD\ RFFXU DIWHUZDUGV DQ 66 V\VWHP PXVW EH DEOH WR VWDELOL]H HYHQWXDOO\ WR VWDWHV ZKLFK IXOILO WKH GHVLUHG UHTXLUHPHQWV QR PDWWHU ZKDW FXUUHQW VWDWH LW LV 6LQJLQJ D VRQJ E\ WKH SKLORVRSKHUV FDQ EH PRGHOHG DV H[HFXWLQJ WKH FULWLFDO VHFWLRQ &6 LQ VKRUW

7KHQ WKH IRUPXODWHG SUREOHP LV FORVHO\ UHODWHG WR WKH GLQQLQJ

SKLORVRSKHUV E\ 'LMNVWUD >@ DQG WKH GULQNLQJ SKLORVRSKHUV E\ &KDQG\ DQG 0LVUD >@ DOWKRXJK WKH GLQQLQJ SKLORVRSKHUV DQG WKH GULQNLQJ SKLORVRSKHUV GR QRW KDQGOH


The Fuzzy Philosophers WUDQVLHQW IDXOWV

131

7KDW QR WZR QHLJKERULQJ SKLORVRSKHUV DUH DOORZHG WR H[HFXWH WKH &6

VLPXOWDQHRXVO\ LV WKH FRPPRQ UHTXLUHPHQW 7KH PDMRU LVVXH IDFHG LQ WKH SKLORVRSKHUV LQ IXOILOOLQJ WKH UHTXLUHPHQW LV WKH V\PPHWU\ SUREOHP ,W ZRXOG EH LPSRVVLEOH WR KDYH D GHWHUPLQLVWLF VROXWLRQ LI WKH V\VWHP LV LQ D VWDWH RI ZKLFK QR QRGH LV GLVWLQJXLVKDEOH IURP WKH RWKHUV

+HUH LQ WKLV SDSHU D VLPSOH DQG HOHJDQW DSSURDFK LV SURSRVHG ZKLFK

DOORZV D QRGH XVH RQO\ RQH ELW WR UHVROYH WKH FRQIOLFWV 7KH UHVXOW VKRXOG EH LQWHUHVWLQJ WR WKRVH ZKR PLJKW GHVLJQ GLVWULEXWHG SURWRFROV WR UHVROYH WKH FRQIOLFWV DPRQJ WKH UHTXHVWV IURP QHLJKERULQJ SURFHVVHV 7KHUH DUH WZR YHUVLRQV RI WKH SURSRVHG SURWRFRO

$SURWRFRO DQG %SURWRFRO

$

SURWRFRO KDV WKH 66 SURSHUW\ LI WKH QHWZRUN LV DF\FOLF EXW QRW RWKHUZLVH %SURWRFRO FDQ FRSH ZLWK WKH WUDQVLHQW IDXOWV LH LW LV DQ 66 SURWRFRO

3URYLGHG WKDW WKH

SKLORVRSKHUV DUH QRW IX]]\ DQ\ PRUH %SURWRFRO HYHQWXDOO\ JXDUDQWHHV WKH WZR UHTXLUHPHQWV +RZHYHU WKH SKLORVRSKHUV PD\ QHHG PRUH ERROHDQ YDULDEOHV WR PHPRUL]HG WKHLU VWDWHV $Q 66 SURWRFRO LV XVXDOO\ SUHVHQWHG LQ UXOHV (DFK UXOH KDV WZR SDUWV WKH JXDUG DQG WKH DFWLRQ 7KH JXDUG LV D ERROHDQ IXQFWLRQ RI WKH VWDWHV RI WKH QRGH DQG LWV QHLJKERUV ,I WKH JXDUG LV WUXH LWV DFWLRQ LV VDLG WR EH HQDEOHG DQG FDQ WKHQ EH H[HFXWHG ,Q SURYLQJ WKH FRUUHFWQHVV RI DQ 66 SURWRFRO WKH IROORZLQJ WKUHH DVVXPSWLRQV PD\ EH FRQVLGHUHG 6HULDO H[HFXWLRQ (QDEOHG DFWLRQV DUH H[HFXWHG RQH DW D WLPH

&RQFXUUHQW H[HFXWLRQ $Q\ QRQHPSW\ VXEVHW RI HQDEOHG DFWLRQV DUH H[HFXWHG DOO DW

D WLPH

'LVWULEXWHG H[HFXWLRQ $ QRGH PD\ UHDG WKH VWDWHV RI LWV QHLJKERUV DW VRPH GLIIHUHQW

WLPHV DQG HYDOXDWH LWV JXDUGV DQG H[HFXWH WKH HQDEOHG DFWLRQV DW D ODWHU PRPHQW $ GLVWULEXWHGFRUUHFW SURWRFRO LV DOVR FRQFXUUHQWFRUUHFW LQ WXUQ LV DOVR VHULDOFRUUHFW EXW QRW YLFH YHUVD %HFDXVH LW LV HDVLHU WR GHVLJQ DQG SURYH VHULDO SURWRFROV PRVW RI WKH 66 SURWRFROV>@ DUH GHVLJQ LQ VXFK D ZD\ 7KH UHVXOW UHSRUWHG LQ WKLV SDSHU LV LQVSLUHG E\ +DGGL[ >@

WKH DOWHUQDWRU

VWXGLHG E\ *RXGD DQG

2QH PDMRU GLIIHUHQFH EHWZHHQ WKHLU UHVXOW DQG WKH FXUUHQW RQH LV WKDW WKHLU

SURWRFRO VXSSRUWV FRUUHFW FRQFXUUHQW H[HFXWLRQ RI VHULDOFRUUHFW 66 SURWRFROV ZKHUHDV WKH SURSRVHG %SURWRFRO VXSSRUWV QRW RQO\ FRUUHFW FRQFXUUHQW H[HFXWLRQ EXW DOVR FRUUHFW

GLVWULEXWHG H[HFXWLRQ

&RUUHFWGLVWULEXWHG H[HFXWLRQ LV FRPPRQO\ EHOLHYHG PRUH

GLIILFXOW $ UXOH LV VDLG WR EH QRQLQWHUIHULQJ SDUW LV H[HFXWHG

LI RQFH LW LV HQDEOHG LW UHPDLQV VR XQWLO WKH DFWLRQ

,W KDV EHHQ VKRZQ WKDW D VHULDOFRUUHFW SURWRFRO LV DOVR GLVWULEXWHG

FRUUHFW SURYLGHG WKDW LWV UXOHV DUH QRQLQWHUIHULQJ >@ 7KH QRQLQWHUIHULQJ SURSHUW\ RI WKH UXOHV PDNHV WKH SURSRVHG %SURWRFRO FDQ VXSSRUW FRUUHFW GLVWULEXWHG H[HFXWLRQ IRU WKH VHULDOFRUUHFW 66 SURWRFROV

2WKHU DWWHPSWV PDGH WR VXSSRUW FRUUHFW GLVWULEXWHG

H[HFXWLRQ IRU VHULDOFRUUHFW 66 SURWRFROV FDQ DOVR EH IRXQG LQ >@>@ 7KH UHVW RI WKH SDSHU LV RUJDQL]HG DV IROORZV

6HFWLRQ SUHVHQWV $SURWRFRO

1H[W

6HFWLRQ JLYHV LWV FRUUHFWQHVV SURRI %SURWRFRO DQG LWV FRUUHFWQHVV GLVFXVVLRQ DUH WKHQ SUHVHQWHG LQ 6HFWLRQ 7KH HIILFLHQF\ RI $SURWRFRO LV GLVFXVVHG LQ 6HFWLRQ

$SURWRFRO 7KH ILUVW LVVXH ZH IDFH LV WKH V\PPHWU\ SUREOHP

7R VROYH WKH SUREOHP LQ

VWDWLF GLUHFWLRQ VXFK WKDW WKH GLUHFWHG QHWZRUN LV DF\FOLF 7KH GLUHFWHG OLQN LV WKHQ FDOOHG WKH EDVH HGJH DQG LV GHQRWHG E\ %→ 7KH GLUHFWHG QHWZRUNHG LQGXFHG E\ WKH EDVH HGJHV LV FDOOHG WKH %QHWZRUN 1RWH WKDW WKH $SURWRFRO HDFK OLQN LV DVVLJQHG D

132

S.-T. Huang

%QHWZRUN LV VWDWLF LQ WKH VHQVH WKDW DOO WKH GLUHFWLRQV RI LWV HGJHV DUH IL[HG +HQFH WKH %QHWZRUN LV DOZD\V DF\FOLF $VVRFLDWHG ZLWK HDFK OLQN WKHUH LV DQRWKHU HGJH FDOOHG WKH

FRQWURO HGJH

GHQRWHG E\

&→ 7KH GLUHFWLRQ RI WKH FRQWURO HGJH LV G\QDPLFDOO\ FRQWUROOHG E\ WZR FRQWURO ELWV PDLQWDLQHG E\ WKH WZR QRGHV LQFLGHQW WR WKH HGJH UHVSHFWLYHO\ YLD WKH IROORZLQJ IRXU UXOHV

%→ WKHQ &→

%→ WKHQ &←

%→ WKHQ &←

%→ WKHQ &→

/HW WKH FRQWURO ELW PDLQWDLQHG E\ QRGH L EH GHQRWHG DV &L )RU WZR QHLJKERULQJ QRGHV L DQG M WKH UXOHV LPSO\ WKDW LI &L

⊕

&M

WKHQ WKH FRQWURO HGJH KDV WKH UHYHUVHG

GLUHFWLRQ RI WKH EDVH HGJH RWKHUZLVH WKH\ KDYH WKH VDPH GLUHFWLRQ :KHUH

⊕

LV WKH

H[FOXVLYH 25 RSHUDWRU 7KH GLUHFWHG QHWZRUN LQGXFHG E\ WKH FRQWURO HGJHV LV FDOOHG WKH &QHWZRUN $FFRUGLQJ WR WKH IRXU UXOHV D QRGH FDQ UHYHUVH DOO WKH GLUHFWLRQV RI LWV DGMDFHQW FRQWURO HGJHV VLPSO\ E\ UHYHUVLQJ LWV FRQWURO ELW )LJXUH JLYHV DQ H[DPSOH IRU WKH %QHWZRUN DQG WKH &QHWZRUN

7KH IROORZLQJ $SURWRFRO LV D GLUHFW FRQVHTXHQFH RI WKLV

VXUSULVLQJO\ VLPSOH UHVXOW

D %QHWZRUN

E &QHWZRUN

)LJXUH ([DPSOH IRU WKH EDVH QHWZRUN DQG WKH FRQWURO QHWZRUN /HW &VLQNL RU &VRXUFHL UHVSHFWLYHO\ GHQRWH WKDW DOO WKH FRQWURO HGJHV RI QRGH L DUH LQFRPLQJ WR RU RXWJRLQJ IURP UHVSHFWLYHO\ L $SURWRFRO FRQVLVWV RI RQH JXDUGHG UXOH RQO\ >5$@ &VLQNL

→ ([HFXWH &6 &L ¬&L

7KH LGHD EHKLQG $SURWRFRO LV YHU\ VLPSOH 7KH FRQWURO HGJH LV XVHG DV DQ DUELWUDWRU WR GHFLGH ZKLFK RQH RI WKH WZR QRGHV LQFLGHQW WR WKH HGJH KDV WKH SULRULW\ WR H[HFXWH WKH &6 WKH RQH SRLQWHG WR KDV WKH SULRULW\ &VLQNL LPSOLHV WKDW DOO WKH QHLJKERUV RI L DJUHH ZLWK WKDW QRGH L KDV WKH SULRULW\ $IWHU H[HFXWLQJ WKH &6 QRGH L \LHOGV WKH SULRULW\ WR DOO LWV QHLJKERUV E\ UHYHUVLQJ LWV FRQWURO ELW

:LWK FRPPRQ NQRZOHGJH LQ WKH PXWXDO

H[FOXVLRQ ILHOG >@ $SURWRFRO REYLRXVO\ KDV WKH VDIHW\ SURSHUW\ LH QR WZR QHLJKERULQJ QRGHV H[HFXWH WKH &6 VLPXOWDQHRXVO\ >5$@ LV QRQLQWHUIHULQJ RQFH WKH JXDUG LV WUXH LW UHPDLQV WUXH XQWLO WKH DFWLRQ SDUW LV H[HFXWHG

7KH QRQLQWHUIHULQJ SURSHUW\ RI WKH UXOH PDNHV $SURWRFRO GLVWULEXWHG

The Fuzzy Philosophers FRUUHFW SURYLGHG WKDW LW FDQ EH SURYHG VHULDOFRUUHFW >@

133

7KHUHIRUH WKH FRUUHFWQHVV

SURRI LQ WKH QH[W VHFWLRQ RQO\ FRQVLGHUV VHULDO H[HFXWLRQ &RUUHFWQHVV RI $SURWRFRO :H SURYH $SURWRFRO FRUUHFW E\ VKRZLQJ WKDW LW KDV WKH IROORZLQJ WZR SURSHUWLHV >3@ 6DIHW\ 3URSHUW\ 1R WZR QHLJKERULQJ QRGHV H[HFXWH WKH &6 VLPXOWDQHRXVO\ >3@ )DLUQHVV 3URSHUW\ $ORQJ DQ\ LQILQLWH FRPSXWDWLRQ HDFK QRGH H[HFXWHV WKH &6 LQILQLWHO\ RIWHQ $V GLVFXVVHG LQ WKH SUHYLRXV VHFWLRQ WKH IROORZLQJ 7KHRUHP LV WUXH 7KHRUHP $SURWRFRO KDV WKH SURSHUW\ >3@ ,Q RUGHU WR VKRZ WKDW $SURWRFRO DOVR KDV WKH SURSHUW\ >3@ ZH VKRZ WKDW $SURWRFRO LV GHDGORFNIUHH ILUVW /HW DOO WKH FRQWURO ELWV PDLQWDLQHG E\ WKH QRGHV EH LQLWLDOL]HG DV ]HUR DQG DVVXPH WKDW WKH V\VWHP IDFHV QR WUDQVLHQW IDXOWV /DWHU LQ %SURWRFRO ZH ZLOO GLVFXVV KRZ WR KDQGOH WKH WUDQVLHQW IDXOWV 8QGHU WKLV DVVXPSWLRQ ZH KDYH WKH IROORZLQJ LQYDULDQW >,@ 7KH &QHWZRUN LV DF\FOLF /HPPD >,@ LV DQ LQYDULDQW 3URRI )LUVW >,@ LV WUXH DW WKH WLPH ZKHQ WKH QHWZRUN LV LQLWLDOL]HG 7KLV LV EHFDXVH WKH &QHWZRUN LV H[DFWO\ WKH %QHWZRUN DW WKH EHJLQQLQJ 6HFRQGO\ LI >,@ LV WUXH EHIRUH D V\VWHP VWDWH WUDQVLWLRQ LW LV DOVR WUXH DIWHU WKH WUDQVLWLRQ 1RWH WKDW D QRGH FKDQJHV IURP D &VLQN QRGH WR D &VRXUFH QRGH ZKHQ LW H[HFXWHV WKH DFWLRQ SDUW RI WKH UXOH 7KLV LV EHFDXVH DOO WKH FRQWURO HGJHV RI WKH QRGH UHYHUVH WKHLU GLUHFWLRQ E\ WKH DFWLRQ SDUW RI WKH UXOH $OVR LW©V QRW KDUG WR VHH WKDW DQ DF\FOLF QHWZRUN UHPDLQV DF\FOLF LI VRPH VLQN QRGH LV UHSODFHG ZLWK D VRXUFH QRGH $OO WKRVH WRJHWKHU LPSOLHV WKDW >,@ LV DQ LQYDULDQW 7KLV HQGV WKH SURRI /HPPD $SURWRFRO LV GHDGORFNIUHH 3URRI %\ /HPPD WKH &QHWZRUN LV DOZD\V DF\FOLF KHQFH DW DQ\ VWDWH WKHUH H[LVWV DW OHDVW RQH &VLQN QRGH ZKLFK LV HQDEOHG 7KLV HQGV WKH SURRI 7KHRUHP $SURWRFRO KDV WKH >3@ SURSHUW\ 3URRI >5$@ LV QRQLQWHUIHULQJ KHQFH DQ HQDEOHG QRGH H[HFXWHV WKH UXOH HYHQWXDOO\ 7KHQ E\ /HPPD DORQJ DQ\ LQILQLWH FRPSXWDWLRQ VRPH QRGH VD\ QRGH L PXVW H[HFXWH WKH &6 LQILQLWHO\ RIWHQ %\ WKH UXOH EHWZHHQ WZR VXFFHVVLYH &6 VWHSV RI QRGH L DOO LWV QHLJKERUV PXVW H[HFXWH WKH &6 RQFH +HQFH DOO WKH QHLJKERUV RI QRGH L PXVW H[HFXWH WKH &6 LQILQLWHO\ RIWHQ DORQJ WKH FRPSXWDWLRQ 7KHQ EHFDXVH WKH QHWZRUN LV ILQLWH WKLV WKHRUHP LV SURYHG :H KDYH SURYHG WKH FRUUHFWQHVV RI $SURWRFRO XQGHU WKH DVVXPSWLRQ WKDW QR WUDQVLHQW IDXOWV PD\ RFFXU +RZHYHU ZKHQ WUDQVLHQW IDXOWV DUH WDNHQ LQWR FRQVLGHUDWLRQ >,@ LV QR ORQJHU DQ LQYDULDQW

7R VHH WKLV FRQVLGHU D WKUHHQRGH ULQJ ZLWK WKH IROORZLQJ

%QHWZRUN FRQILJXUDWLRQ L%→ M%→ N%← L

$W VRPH PRPHQW WKH &QHWZRUN

FRQILJXUDWLRQ PD\ EH DV &→ &← &→ 7KHQ D WUDQVLHQW IDXOW PD\ SHUWXUE &L

134

S.-T. Huang

FKDQJLQJ LWV YDOXH IURP WR DQG PDNH WKH FRQILJXUDWLRQ DV &← &← &← $ F\FOH H[LVWV ,Q WKH QH[W VHFWLRQ %SURWRFRO LV PRGLILHG IURP $SURWRFRO WR FRSH ZLWK WKH WUDQVLHQW IDXOWV 1RWH WKDW $SURWRFRO KDV WKH 66 SURSHUW\ LI WKH RULJLQDO QHWZRUN LV DF\FOLF WKLV LV EHFDXVH LQ VXFK D FDVH WKH LQYDULDQW >,@ LV YDOLG HYHQ DIWHU WKH WUDQVLHQW IDXOWV %SURWRFRO 7KH LGHD EHKLQG %SURWRFRO LV WR FRORU WKH OLQNV RI WKH QHWZRUN LQWR GLIIHUHQW FRORUV FRORU FRORU FRORUP 7KH VXEQHWZRUN LQGXFHG E\ OLQNV ZLWK FRORU[ LV FDOOHG WKH &[VXEQHWZRUN 7KH FRORULQJ PXVW EH FDUULHG RXW LQ VXFK D ZD\ WKDW HDFK RI WKH &VXEQHWZRUN &VXEQHWZRUN DQG &PVXEQHWZRUN LV DF\FOLF EXW PD\ EH GLVFRQQHFWHG +HUH ZH DVVXPH WKH FRORUV DUH LQLWLDOO\ JLYHQ $FFRUGLQJ WR WKH FRORUV RI WKH OLQNV QRGHV DUH FODVVLILHG LQWR QRQPXWXDOO\H[FOXVLYH GLIIHUHQW FRORU VHWV $ QRGH LV VDLG WR EHORQJ WR &[VHW RU VDLG WR EH D &[ QRGH LI WKH QRGH LV LQFLGHQW WR DW OHDVW RQH OLQN ZLWK FRORU[

1RWH WKDW D QRGH PD\ EHORQJ WR

VHYHUDO GLIIHUHQW FRORU VHWV $V DQ H[DPSOH RQH PD\ FRORU D PHVK ZLWK WZR FRORUV WKH YHUWLFDO OLQNV ZLWK FRORU DQG WKH KRUL]RQWDO OLQNV ZLWK FRORU ,Q VXFK D FRORULQJ HDFK QRGH EHORQJV WR WZR FRORU VHWV ,Q %SURWRFRO LI QRGH L LV D &[ QRGH WKHQ L PDLQWDLQV D FRQWURO ELW IRU WKRVH OLQNV ZLWK FRORU[ GHQRWHG DV &[L WR FRQWURO WKH FRQWURO HGJHV RYHU WKRVH OLQNV +HQFH IRU D QRGH EHORQJV WR N GLIIHUHQW FRORU VHWV N FRQWURO ELWV DUH QHHGHG 6LPLODU WR $SURWRFRO DVVRFLDWHG ZLWK HDFK OLQN WKHUH LV D EDVH HGJH +RZHYHU WKH GLUHFWLRQ RI WKH EDVH HGJHV FDQ EH DUELWUDU\ 7KH UHTXLUHPHQW WKDW %QHWZRUN LQGXFHG E\ WKH EDVH HGJHV LV DF\FOLF LV QR ORQJHU QHFHVVDU\ 7KH UHTXLUHPHQW LV QHHGHG LQ $SURWRFRO EHFDXVH WKH LQYDULDQW >,@ PXVW EH LQLWLDOO\ WUXH 7KH GLUHFWLRQ RI WKH FRQWURO HGJH RYHU D OLQN ZLWK FRORU[ LV GHFLGHG E\ WKH GLUHFWLRQ RI WKH DVVRFLDWHG EDVH HGJH DQG WKH IRXU UXOHV LQ WKH VDPH ZD\ DV LQ $SURWRFRO %SURWRFRO FRQVLVWV RI WZR UXOHV 7KH QRWDWLRQ &[VLQN LQ WKH UXOHV LV FRUUHVSRQGLQJ WR &[VXEQHWZRUN >5%@ >5%@

∀&[ QRGH L ∈ &[VHW &[VLQNL → ([HFXWH &6 &[L ¬&[L ∃&[ &\ [ < \ QRGH L ∈&[VHW QRGH L ∈ &\VHW ¬&[VLQNL ∧ &\VLQNL → &\L ¬&\L

5XOH >5%@ JXDUDQWHHV WKDW %SURWRFRO KDV WKH VDIHW\ SURSHUW\ >3@ EHFDXVH LWV JXDUG JXDUDQWHHV WKDW DOO WKH FRQWURO HGJHV LQFLGHQW WR QRGH L SRLQW WR L 7KH UXOHV LPSO\ WKDW HDFK QRGH ZDLWV IRU H[HFXWLQJ &6 E\ ZDLWLQJ WR KROG QHHGHG VLQN VWDWXV RI WKH FRQWURO VXEQHWZRUNV RQH E\ RQH IURP ORZHU FRORU WR KLJKHU FRORU %\ >5%@ ZKHQ D QRGH GRHV QRW KROG WKH QHHGHG VLQN VWDWXV RI D ORZHU FRORU FRQWURO VXEQHWZRUN LW GRHV QRW NHHS WKH VLQN VWDWXV RI D KLJKHU FRORU FRQWURO VXEQHWZRUN WR DYRLG GHDGORFN $SURWRFRO LV SURYHG VHULDOFRUUHFW ,W LV DOVR GLVWULEXWHGFRUUHFW EHFDXVH WKH RQO\ UXOH >5$@ LV QRQLQWHUIHULQJ

7KLV LV QRW WKH FDVH IRU %SURWRFRO EHFDXVH >5%@ LV QRW

QRQLQWHUIHULQJ 7KH JXDUG RI >5%@ PD\ FKDQJH IURP WUXH WR IDOVH LI WKH DFWLRQ SDUW RI LW GRHV QRW H[HFXWH LQ WLPH +RZHYHU ZKDW ZH UHDOO\ FDUH LV WKH H[HFXWLRQ RI WKH &6 E\ WKH QRGHV WKDW LV WKH DFWLRQ SDUW RI UXOH >5%@

5XOH >5%@ LV REYLRXVO\ QRQ

LQWHUIHULQJ 7KHUHIRUH ZH FRQFOXGH WKDW %SURWRFRO LV DOVR GLVWULEXWHGFRUUHFW DV ORQJ DV WKH WZR SURSHUWLHV >3@ DQG >3@ DUH WKH RQO\ FRQFHUQV

The Fuzzy Philosophers

135

%SURWRFRO FDQ VXSSRUW FRUUHFW GLVWULEXWHG H[HFXWLRQ RI WKH VHULDOFRUUHFW DSSOLFDWLRQ SURWRFRO LQ D YHU\ VLPSOH ZD\ 7KH UXOHV RI WKH DSSOLFDWLRQ SURWRFRO DUH VLPSO\ DWWDFKHG LQWR WKH &6 SDUW RI %SURWRFRO

7KH QRGH KROGLQJ WKH &6 SULYLOHJH DFFRUGLQJ WR %

SURWRFRO WKHQ H[HFXWHV WKH UXOHV RI WKH DSSOLFDWLRQ SURWRFRO

(IILFLHQF\ RI

$SURWRFRO

7KLV VHFWLRQ GLVFXVVHV WKH HIILFLHQF\ RI $SURWRFRO :H DUH XQDEOH WR GHULYH ,Q WKH GLVFXVVLRQ D PD[LPDO FRQFXUUHQW H[HFXWLRQ RI $SURWRFRO LV DVVXPHG DV LQ >@ 7KDW LV WKH QRGHV DUH H[HFXWHG JRRG UHVXOWV UHJDUGLQJ WKH HIILFLHQF\ RI %SURWRFRO

LQ ORFNHG VWHSV LQ HDFK VWHS\] ZKLFK EULQJ WKH V\VWHP IURP VWDWH \ WR VWDWH ] DOO HQDEOHG DFWLRQV DW VWDWH \ DUH H[HFXWHG LQ WKH VWHS $SURWRFRO DVVXPHV QR WUDQVLHQW IDXOWV +HQFH WKH &QHWZRUN LV DF\FOLF

$

&SDWK LV

GHILQHG DV D GLUHFWHG SDWK IURP D QRQVLQN QRGH WKH KHDG QRGH RI WKH &SDWK WR D VLQN

QRGH WKH

WDLO

QRGH RI WKH &SDWK LQ WKH &QHWZRUN 1RWH WKDW WKH KHDG QRGH LV QRW

QHFHVVDU\ D VRXUFH QRGH DOVR IURP D QRQVLQN QRGH PDQ\ &SDWKV PD\ H[LVW +HQFH D &SDWK PD\ LQFOXGH PDQ\ VKRUWHU &SDWKV )RU H[DPSOH &SDWK K L M Z LQFOXGHV &SDWK M Z 7KH

OHQJWK RI D &SDWK LV GHILQHG DV WKH QXPEHU RI HGJHV LQ LW ZKLFK FDQ RQO\ EHFRPH

VKRUWHU EHFDXVH WKH KHDG LV IL[HG DQG WKH WDLO FDQ RQO\ VKULQN 7KH PD[LPXP OHQJWK RI DOO WKH &SDWKV IURP D QRGH LV WKH ORZHVW SRVVLEOH QXPEHU RI VWHSV WKDW WKH QRGH QHHGV WR ZDLW IRU LWV WXUQ WR H[HFXWH WKH &6 7KHUHIRUH WKH PD[LPXP OHQJWK RI DOO GLUHFWHG SDWKV LQ WKH QHWZRUN GHQRWHG DV ;OHQJWK RI WKH QHWZRUN LV XVHG DV WKH PHWULF LQ GLVFXVVLQJ WKH HIILFLHQF\ /HPPD ,Q HDFK VWHS\] DQ H[LVWLQJ &SDWK DW VWDWH \ EHFRPHV RQH HGJH VKRUWHU RU GLVDSSHDUV DW VWDWH ] 3URRI $W VWDWH \ H[FHSW WKH WDLO QRGH ZKLFK LV D VLQN DOO RWKHU QRGHV LQFOXGLQJ WKH KHDG QRGH DQG WKH PLGGOH QRGHV RI WKH SDWK DUH QRW HQDEOHG DQG VR WKH\ UHPDLQ LQ WKH SDWK DW VWDWH ] :KHUHDV WKH WDLO QRGH LV HQDEOHG DW VWDWH \ DQG KHQFH LWV DFWLRQ SDUW LV H[HFXWHG LQ WKH VWHS DQG UHYHUVHV WKH GLUHFWLRQ RI DOO WKH FRQWURO HGJHV LQFLGHQW WR LW ,Q RWKHU ZRUGV WKH WDLO QRGH RI WKH SDWK DW VWDWH \ LV QR ORQJHU SDUW RI LW DW VWDWH ] 7KLV HQGV WKH SURRI 1RWH WKDW WZR RU PRUH &SDWKV ZLWK WKH VDPH KHDG QRGH PD\ PHUJH LQWR RQH ZKHQ WKH\ DUH JHWWLQJ VKRUWHU DQG VKRUWHU )RU H[DPSOH &SDWKK XYZ DQG &SDWKK XV PHUJH LQWR RQH DV &SDWKK XY DW WKH QH[W VWDWH $OVR D QHZ &SDWK PD\ EH FUHDWHG EHFDXVH D VLQN QRGH EHFRPHV D QRQVLQN QRGH DW WKH QH[W VWDWH /HPPD ,Q HDFK VWHS\] D QHZO\ FUHDWHG &SDWK KDV OHQJWK RQH RU FDQ RQO\ EH DV ORQJ DV VRPH &SDWK H[LVWLQJ LQ VWDWH \ 3URRI /HW &VLQNL DW VWDWH \ 1RGH L EHFRPHV D QRQVLQN QRGH DW VWDWH ] 7KHQ LM PD\ EH D QHZ &SDWK RI OHQJWK RQH

2U LM K PD\ EH D QHZ &SDWK EHFDXVH

&SDWKM KN H[LVWV DW VWDWH \ ERWK KDYH WKH VDPH OHQJWK 7KLV HQGV WKH SURRI /HPPD 7KH ;OHQJWK RI WKH &QHWZRUN LV QRQLQFUHDVLQJ

136

S.-T. Huang

/HPPD LV D GLUHFW FRQVHTXHQFH RI /HPPDV DQG %\ /HPPD WKH ;OHQJWK RI WKH &QHWZRUN FDQ RQO\ EHFRPH VPDOOHU GXULQJ WKH FRPSXWDWLRQ 5HFDOO WKDW DW WKH EHJLQQLQJ WKH &QHWZRUN LV H[DFWO\ WKH %QHWZRUN WKHUHIRUH $SURWRFRO FDQ EH PDGH YHU\ HIILFLHQW E\ DVVLJQLQJ GLUHFWLRQ RI WKH OLQNV RI WKH QHWZRUN LQ VXFK D ZD\ WKDW WKH ;OHQJWK LV PDGH DV VPDOO DV SRVVLEOH )RU H[DPSOH LQ D ULQJ QHWZRUN WKH ;OHQJWK FDQ EH DW PRVW WZR $SURWRFRO LV DQ 66 SURWRFRO ZKHQ LW DSSOLHV RQ DQ DF\FOLF QHWZRUN YL] WUHH QHWZRUN DV PHQWLRQHG EHIRUH

:KHQ WKH QHWZRUN LV DF\FOLF $SURWRFRO LV YHU\ HIILFLHQW

DFFRUGLQJ WR WKH IROORZLQJ 7KHRUHP 7KHRUHP 7KH ;OHQJWK RI WKH &QHWZRUN VWDELOL]HV WR RQH ZKHQ $SURWRFRO DSSOLHV RQ D ILQLWH WUHH QHWZRUN 3URRI (YHQWXDOO\ DQ\ QHZO\ FUHDWHG &SDWK LQ HDFK VWHS\] FDQ RQO\ EH RI OHQJWK RQH DW VWDWH ]

7KLV LV EHFDXVH WKH &QHWZRUN RQ D ILQLWH WUHH QHWZRUN LV ILQLWH DQG DF\FOLF

WKH FUHDWLRQ RI D QHZ &SDWK RI WKH IRUP LM K HYHQWXDOO\ EHFRPHV LPSRVVLEOH :KHQ WKLV KDSSHQV HDFK VWHS DIWHUZDUGV H[LVWLQJ &SDWKV EHFRPH RQH HGJH VKRUWHU DQG QHZO\ FUHDWHG &SDWKV DUH RI OHQJWK RQH +HQFH WKH ;OHQJWK RI WKH &QHWZRUN VWDELOL]HV WR RQH 7KLV HQGV WKH SURRI 7KHRUHP LPSOLHV WKDW ZKHQ $SURWRFRO DSSOLHV RQ D WUHH QHWZRUN WKH V\VWHP VWDELOL]HV WR VWDWHV LQ ZKLFK D QRGH FDQ H[HFXWH WKH &6 RQFH HYHU\ WZR VWHSV $FNQRZOHGJHPHQW 7KLV UHVHDUFK ZDV VXSSRUWHG LQ SDUW E\ WKH 1DWLRQDO 6FLHQFH &RXQFLO RI WKH 5HSXEOLF RI &KLQD XQGHU WKH &RQWUDFW 16& ( 5HIHUHQFHV

%URZQ * 0 *RXGD 0 * DQG :X & / 7RNHQ V\VWHPV WKDW VWDELOL]H ,((( 7UDQVDFWLRQ RQ &RPSXWHUV 9RO 1R

&KDQ\ . 0 DQG 0LVUD - 7KH GULQNLQJ SKLORVRSKHUV SUREOHP $&0 7UDQVDFWLRQ RQ 3URJUDPPLQJ /DQJXDJHV DQG 6\VWHPV 9RO 1R 2FW

'LMNVWUD

(

:

6HOI

VWDELOL]LQJ

V\VWHPV

LQ

VSLWH

RI

GLVWULEXWHG

FRQWURO

&RPPXQLFDWLRQV RI WKH $&0 9RO 1R

'LMNVWUD ( : +LHUDUFKLFDO RUGHULQJ RI VHTXHQWLDO SURFHVVHV

,Q 2SHUDWLQJ

6\VWHPV 7HFKQLTXHV +RDUH &$ 5 DQG 3HUURWW 5+ (GV $FDGHPLF 3UHVV 1HZ 0 can be turned into a Monte Carlo algorithm which runs in O(T (N )) time with probability of success 1 , O(1=N ) for any large constant > 0 by running the algorithm for d =e consecutive times and choosing one that succeeds without increasing the time complexity. Summarizing the above discussions, we obtain the following results:

Theorem 1 The distance map problem de ned on an n n images can be solved

using an LARPBS of n2 processors in O(log n log log n) time deterministically or in O(log n) time with high probability for any practical size of n.

4 Algorithm Using n3 Processors We can further reduce the time in the above algorithm through using more processors. In this section, we describe an algorithm which works on an LARPBS with n3 processors. The n3 algorithm is similar to the n2 algorithm described in the preceding section. The only dierence is that we use n2 to nd the EDT values in a row instead of using n processors, thus reducing the time used. Initially, the n2 pixels are stored in the rst n2 processors. Actually, all the steps except steps 3 and 4 use the rst n2 processors only, and hence have the same steps as in the n2 algorithm described in the preceding section. Now we describe step 3 in the new algorithm in detail. Notice that all values such as DIST 's have been computed and stored in local processors. We divide the LARPBS into n subsystems with each having n2 processors. Denote these subsystems as LARPBS-i, 0 i n , 1. Distribute the n rows of pixels along with the computed values such as DIST 's in the previous steps to the rst n processors of the n subsystems. Thus, each subsystem is responsible for a row of pixels. There are n EDT values to be computed in a row and each subsystem has n2 processors. Hence, we can let n processors calculate an EDT value and all the EDT values can be computed concurrently. An EDT value can be computed using the deterministic minimum nding algorithm or the randomized minimum nding algorithm on the DIST values computed in step 2. Obviously, this step involves only broadcast, multicast, array recon guration, and minimum nding operations. All these operations takes O(1) time except the minimum nding algorithm. Using a similar analysis described previously, it is easily obtained that step 3 can be computed in O(loglog n) time deterministically or in O(1) time with high probability. Hence, we have the following results:

Theorem 2 The distance map problem de ned on an n n images can be solved using an LARPBS of n3 processors in O(loglog n) time deterministically or in O(1) time with high probability for any practical size of n.

Computing Distance Maps Efficiently Using an Optical Bus

185

5 Conclusions Due to the high bandwidth of an optical bus and several eciently implemented data movement operations, the distance map problem is solved eciently on the LARPBS model. It should be noted that algorithms on plain a mesh or a recon gurable mesh cannot reach the time bounds described in this paper.

References 1. T. Bossomaier, N. Isidoro, and A. Loe, \Data parallel computation of Euclidean distance transforms," Parallel Processing Letters, vol. 2, no. 4, pp. 331-339, 1992. 2. L. Chen and H. Y. H. Chuang, \A fast algorithm for Euclidean distance maps of a 2-D binary image," Information Processing Letters, vol. 51, pp. 25-29, 1994. 3. L. Chen and H. Y. H. Chuang, \An ecient algorithm for complete Euclidean distance transform on mesh-connected SIMD," Parallel Computing, vol. 21, pp. 841-852, 1995. 4. D. Chiarulli, R. Melhem, and S. Levitan, \Using Coincident Optical Pulses for Parallel Memory Addressing," IEEE Computer, vol. 20, no. 12, pp. 48-58, 1987. 5. M.N. Kolountzakis and K.N. Kutulakos, \Fast computation of Euclidean distance maps for binary images," Information Processing Letters, vol. 43, pp. 181-184, 1992. 6. R. Melhem, D. Chiarulli, and S. Levitan, \Space Multiplexing of Waveguides in Optically Interconnected Multiprocessor Systems," The Computer Journal, vol. 32, no. 4, pp. 362-369, 1989. 7. Yi Pan and Keqin Li, \Linear array with a recon gurable pipelined bus system: concepts and applications," Special Issue on \Parallel and Distributed Processing" of Information Sciences, vol. 106, no. 3/4, pp. 237- 258, May 1998. (Also appeared in International conference on Parallel and Distributed Processing Techniques and Applications, Sunnyvale, CA, August 9-11, 1996, 1431-1442) 8. Y. Pan, K. Li, and S.Q. Zheng, \Fast nearest neighbor algorithms on a linear array with a recon gurable pipelined bus system," Parallel Algorithms and Applications, vol. 13, pp. 1-25, 1998. 9. S. Rajasekaran and S. Sahni, \Sorting, selection and routing on the arrays with recon gurable optical buses," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 11, pp. 1123-1132, Nov. 1997. 10. J. L. Trahan, A. G. Bourgeois, Y. Pan, and R. Vaidyanathan, \Optimally scaling permutation routing on recon gurable linear arrays with optical buses," Proc. of the Second Merged IEEE Symposium IPPS/SPDP '99, San Juan, Puerto Rico, pp. 233-237, April 12-16, 1999. 11. H. Yamada, \Complete Euclidean distance transformation by parallel operation," Proc. 7th International Conference on Pattern Recognition, pp. 69-71, 1984.

This article was processed using the LaTEX macro package with LLNCS style

Advanced Data Layout Optimization for Multimedia Applications Chidamber Kulkarni1;2 F ranc ky Catthoor1;3 Hugo De Man1;3 2

1 IMEC, Kapeldreef 75, B3001 Leuven, Belgium. Also Ph.D. student at the Dept of EE, Kath Univ Leuven 3 Also Professor at the Dept of EE, Kath Univ Leuven

1 Introduction and Related Work Increasing disparity betw een processor and memory speeds has been a motivation for designing systems with deep memory hierarchies. Most data-dominated multimedia applications do not use their cac he eciently and spend muc h of their time w aitingfor memory accesses [1]. This also implies a signi cant additional cost in increased memory bandwidth, in the system bus load and the associated pow er consumption apart from increasing the average memory access time. In this work, we are mainly targeting the embedded (parallel) real-time multimedia processing (RMP) application domain since algorithms in there lend themselves to very good compile-time analysis. Although embedded RMP applications are relatively regular, but certainly not perfectly linear/ane in the loop and index expressions, the simultaneous presence of complex accesses to large working sets makes most of the existing approaches largely to fail in taking adv an tage ofthe localit y.Earlier studies ha veshown that the majority of the execution time is spent in cac he stalls due to cache misses for image processing applications [1] as well as scienti c applications [12]. Hence the reduction of such cac he misses is of crucial importance. Source-level program transformations to modify the execution order can impro ve the cache performance of these applications to a large extent [3, 6{9] but still a signi cant amount of cache misses are present. Storage order optimizations [3, 4] are very helpful in reducing the capacity misses. So in the end mostly con ict cache misses related to the sub-optimal data layout remain. Array padding has been proposed earlier to reduce the latter [11, 14, 15]. These approaches are useful for reducing the cross-con ict misses. However existing approaches do not eliminate the majority of the con ict misses. Besides [2, 6, 14], very little has been done to measure the impact of data organization (or layout) on the cac he performance. Thus there is a need to investigate additional data layout or organization techniques to reduce these cache misses. The fundamental relation which governs the mapping of data from the main memory to a cache is given as below : (Block Address) MOD (Number of Sets in Cache) J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 186-193, 2000.  Springer-Verlag Berlin Heidelberg 2000

(1)

Advanced Data Layout Optimization for Multimedia Applications

187

Based on the number of lines in a set we de ne direct mapped, n-way associative and fully associative cache [16]. It is clear that, if we arrange the data in the main memory so that they are placed at particular block addresses depending on their lifetimes and sizes, we can control the mapping of data to the cache and hence (largely) remove the in uence of associativity on the mapping of data to the cache. The problem is however that trade-os normally need to be made between many dierent variables. This requires a global data layout approach which to our knowledge has not yet been published before. This has been the motivation for us to come up with a new formalized and automated methodology for optimized data organization in the higher levels of memory. Our approach is called main memory data layout organization (MDO). This is our main contribution, which will be demonstrated on real-life applications. The remaining paper is organized as follows : Section 2 presents an example illustration of the proposed main memory data layout organization methodology. This is followed by the introduction of the general memory data layout organization problem and the potential solutions in section 3. Experimental results on three real-life test vehicles are presented in section 4. Some conclusions from this work are given in section 5.

2 Example Illustration In this section we will brie y introduce the basic principle behind main memory data layout organization (MDO) using an example illustration. Consider the example in gure 1. The initial algorithm in gure 1(a) needs three arrays to execute the complete program. Note that the initial main memory data layout in gure 1(b) is single contiguous irrespective of the array and cache sizes. The initial algorithm can have 3N (cross-) con ict cache misses for a direct mapped cache, in the worst case i.e. when each of the arrays are placed at an (initial) address, which is a multiple of the cache size. Thus to eliminate all the con ict cache misses, it is necessary that none of the three arrays gets mapped to the same cache locations. The MDO optimized algorithm, as shown in gure 1(c), will have no (cross-) con ict cache misses at all. This is because, in the MDO optimized algorithm the arrays always get mapped to xed and non-overlapping locations in the cache. This happens because of the way the data is stored in the main memory, as shown in gure 1(d). To obtain this modi ed data layout, the following steps are carried out : 1. the initial arrays are split into sub-arrays of equal size. The size of each sub-array is called the tile size. 2. dierent arrays are merged so that the sum of their tile-sizes equals the cache size. Now store the merged arrays recursively till all the concerned arrays are completely mapped in the main memory. Thus we now have a new array which comprises all the arrays but the constituent arrays are stored in such a way that they get mapped into cache so as to remove the con ict misses. This new array is represented by \x[]" in gure 1(c).

188

C. Kulkarni, F. Catthoor, and H. De Man

In gure 1(c) and (d), two important observations need to be made : (1) there is a recursive allocation of dierent array data, with each recursion equal to the cache size and (2) the generated addressing, which is used to impose the modi ed data layout on the linker, contains modulo operations. These can be removed afterwards through a seperate optimization stage [5]. Source code

a

0

d

for(i=0;i@ 7KHVH HIIRUWV FDQ EH JURXSHG LQWR WKUHH FDWHJRULHV EDVHG RQ WKH QDWXUH RI FRPSXWLQJ SODWIRUPV WKH\ XWLOL]H VSHFLDOSXUSRVH 9/6, SURFHVVRUV>@ VSHFLDOL]HG YLVLRQ V\VWHPV>@ DQG JHQHUDOSXUSRVH SDUDOOHO PD FKLQHV>@ ,Q WKLV SDSHU ZH H[DPLQH WKH SHUIRUPDQFH RI D FROOHFWLRQ RI YLVLRQ WDVNV RQ DQ RQFKLS PXOWLSURFHVVRU 7KH SHUIRUPDQFH RI JHQHUDOSXUSRVH &RPPRGLW\2II7KH 6KHOI&276 PLFURSURFHVVRUV KDV EHHQ LPSURYLQJ DW D SKHQRPHQDO UDWH RYHU WKH ODVW GHFDGH &XUUHQWO\ WKH PRVW &276 SURFHVVRUV VXFK DV ,QWHO 3HQWLXP 6HULHV &RPSDT

*

The work at USC was supported by the DARPA Data Intensive Systems program under contract F33615-99-1-1483 monitored by Wright Patterson Airforce Base.


Performance of On-Chip Multiprocessors for Vision Tasks

243

$OSKD ,%0 3RZHU3& 6XQ 8OWUD6SDUF +3 3$ DQG 0,36 5 XVH WKH VXSHUVFDODU GHVLJQ WHFKQLTXH>@ 6XFK VXSHUVFDODU SURFHVVRUV H[HFXWH PXOWL SOH LQVWUXFWLRQV LQ D VLQJOH F\FOH E\ H[SORLWLQJ ,QVWUXFWLRQ/HYHO 3DUDOOHOLVP,/3 >@ +RZHYHU VLJQLILFDQW VSHHGXSV PD\ QRW EH DFKLHYHG E\ XVLQJ WKLV WHFKQLTXH EHFDXVH RI WKH OLPLWDWLRQ LPSRVHG E\ WKH LQVWUXFWLRQ ZLQGRZ VL]H DQG DYDLODEOH ,/3 LQ D W\SLFDO VHTXHQWLDO SURJUDP>@ 0RUHRYHU FRQVLGHUDEOH GHVLJQ HIIRUW LV UHTXLUHG WR GHYHORS VXFK KLJK SHUIRUPDQFH SURFHVVRUV 5HVHDUFKHUV KDYH EHHQ VWXG\LQJ DOWHUQDWLYHV WR WKH VXSHUVFDODU DUFKLWHFWXUH>@ 7KH ¦RQFKLS PXOWLSURFHVVRU§>@ LV RQH RI WKH DOWHUQDWLYHV FRQVLGHUHG DV D QH[W JHQHUDWLRQ SURFHVVRU 7KH NH\ IHDWXUH RI VXFK DQ DUFKLWHFWXUH LV D QG PXOWLSURFHVVRU LQ D VLQJOH FKLS WKDW VKDUHV DQ RIIFKLS OHYHO FDFKH WR H[SORLW 7KUHDG/HYHO 3DUDOOHOLVP7/3 >@ LQ DGGLWLRQ WR ,/3 )RU VFLHQWLILF ZRUNORDGV WKH RQFKLS PXOWLSURFHVVRU KDV EHHQ VKRZQ WR RYHUFRPH WKH ,/3 OLPLWDWLRQ>@ +RZHYHU WKH SHUIRUPDQFH RI WKH RQFKLS PXOWLSURFHVVRU RQ YLVLRQ WDVNV WKDW KDYH YDU\LQJ FRP SXWDWLRQDO FKDUDFWHULVWLFV LV QRW NQRZQ 7R LQYHVWLJDWH WKH VXLWDELOLW\ RI WKH RQFKLS PXOWLSURFHVVRU IRU YLVLRQ WDVNV ZH FRQGXFWHG SHUIRUPDQFH VWXGLHV XVLQJ D GHGLFDWHG DUFKLWHFWXUDO VLPXODWRU ,W LV D SUR JUDPGULYHQ F\FOHOHYHO VLPXODWRU FRQVLVWLQJ RI D 3UH3URFHVVLQJ 8QLW DV DQ LQVWUXF WLRQ VLPXODWRU DQG D 3RVW3URFHVVLQJ 8QLW DV D SHUIRUPDQFH VLPXODWRU $OVR D SUR JUDPPLQJ HQYLURQPHQW ZDV GHYHORSHG WR VXSSRUW PXOWLWKUHDGHG SURJUDPPLQJ RQ PXOWLSOH SURFHVVRU FRUHV 2XU VLPXODWLRQV IRFXVHG RQ WKH SHUIRUPDQFH FKDUDFWHULVWLFV RI WKH RQFKLS PXOWLSURFHVVRU LQFOXGLQJ WKH ,QVWUXFWLRQV 3HU &\FOH,3& DQG WKH WRWDO QXPEHU RI H[HFXWLRQ F\FOHV DV WKH QXPEHU RI SURFHVVRU FRUHV LV LQFUHDVHG 9LVLRQ WDVNV ZHUH FKRVHQ IURP WKH ZLGHO\ VWXGLHG '$53$ ,PDJH 8QGHUVWDQGLQJ %HQFK PDUN>@ 7KH VLPXODWLRQ UHVXOWV VKRZ WKDW ¦RQFKLS PXOWLSURFHVVRU§ LV DQ DWWUDFWLYH FDQGLGDWH DUFKLWHFWXUH IRU YLVLRQ WDVNV 7KH RUJDQL]DWLRQ RI WKH SDSHU LV DV IROORZV 2YHUYLHZ RI WKH YLVLRQ WDVNV DQG WKH RQFKLS PXOWLSURFHVVRU DUFKLWHFWXUH FRQVLGHUHG LQ WKLV SDSHU DUH JLYHQ LQ 6HFWLRQ DQG UHVSHFWLYHO\ ,Q 6HFWLRQ WKH DUFKLWHFWXUDO VLPXODWRU DQG LWV SURJUDPPLQJ HQYL URQPHQW DUH H[SODLQHG 6LPXODWLRQ UHVXOWV DUH VKRZQ LQ 6HFWLRQ DQG FRQFOXGLQJ UHPDUNV DUH PDGH LQ 6HFWLRQ

2 Selected Vision Tasks 7KH YLVLRQ WDVNV FRQVLGHUHG LQ WKLV SDSHU DUH VHOHFWHG IURP WKH ,PDJH 8QGHU VWDQGLQJ %HQFKPDUN>@ 7KLV EHQFKPDUN SHUIRUPV WKH UHFRJQLWLRQ RI D ¦PRELOH§ VFXOSWXUH JLYHQ WKH LQSXW LPDJHV IURP LQWHQVLW\ DQG UDQJH VHQVRUV 7KH EHQFKPDUN SHUIRUPV ORZOHYHO RSHUDWLRQV VXFK DV FRQYROXWLRQ WKUHVKROGLQJ FRQQHFWHG FRPSR QHQWV ODEHOLQJ HGJH WUDFNLQJ PHGLDQ ILOWHU +RXJK WUDQVIRUP FRQYH[ KXOO DQG FRU QHU GHWHFWLRQ ,W DOVR SHUIRUPV JURXSLQJ RSHUDWLRQV DQG JUDSK PDWFKLQJ ZKLFK DUH UHSUHVHQWDWLYH H[DPSOHV RI LQWHUPHGLDWHOHYHO DQG KLJKOHYHO SURFHVVLQJ UHVSHFWLYHO\ 7KH EHQFKPDUN XWLOL]HV LQIRUPDWLRQ IURP WZR VHQVRUV LQ RUGHU WR FRPSOHWH WKH LQWHU SUHWDWLRQ SURFHVV ,W PDNHV XVH RI ERWK LQWHJHU DQG IORDWLQJSRLQW UHSUHVHQWDWLRQV

244

Y. Chung et al. 7KH EHQFKPDUN SHUIRUPV ERWK ERWWRPXS GDWDGLUHFWHG DQG WRSGRZQ NQRZO

HGJH RU PRGHOGLUHFWHG SURFHVVLQJ 7KH WRSGRZQ SURFHVVLQJ FDQ LQYROYH SURFHVVLQJ RI ORZ DQG LQWHUPHGLDWHOHYHO GDWD WR H[WUDFW DGGLWLRQDO IHDWXUHV IURP WKH GDWD RU FDQ LQYROYH FRQWURO RI ORZ DQG LQWHUPHGLDWHOHYHO SURFHVVHV WR UHGXFH WKH WRWDO DPRXQW RI FRPSXWDWLRQ UHTXLUHG ,Q WKH EHQFKPDUN WKH SURFHVVLQJ EHJLQV ZLWK ORZOHYHO RSHUDWLRQV RQ WKH LQWHQ VLW\ DQG GHSWK LPDJHV IROORZHG E\ JURXSLQJ RSHUDWLRQV RQ WKH LQWHQVLW\ GDWD WR H[WUDFW FDQGLGDWH UHFWDQJOHV 7KHVH FDQGLGDWHV DUH XVHG WR IRUP SDUWLDO PDWFKHV ZLWK WKH VWRUHG PRGHOV )RU HDFK RI WKHVH PRGHOV PXOWLSOH K\SRWKHWLFDO SRVHV PD\ EH HVWDE OLVKHG )RU HDFK RI WKH SRVH VWRUHG LQIRUPDWLRQ LV XVHG WR SUREH WKH GHSWK DQG LQWHQ VLW\ LPDJHV LQ D WRSGRZQ PDQQHU (DFK SUREH WHVWV D K\SRWKHVLV IRU WKH H[LVWHQFH RI D UHFWDQJOH LQ D JLYHQ ORFDWLRQ LQ WKH LPDJHV 5HMHFWLRQ RI D K\SRWKHVLV ZKLFK RQO\ RFFXUV ZKHQ WKHUH LV VWURQJ HYLGHQFH WKDW D UHFWDQJOH LV DFWXDOO\ DEVHQW UHVXOWV LQ WKH HOLPLQDWLRQ RI WKH FRUUHVSRQGLQJ PRGHO SRVH &RQILUPDWLRQ RI WKH K\SRWKHVLV UHVXOWV LQ WKH FRPSXWDWLRQ RI D PDWFK VWUHQJWK IRU WKH UHFWDQJOH DQG LW DOVR UHVXOWV LQ WKH XSGDW LQJ RI LWV UHSUHVHQWDWLRQ LQ WKH PRGHO SRVH ZLWK QHZ VL]H RULHQWDWLRQ DQG SRVLWLRQ LQIRUPDWLRQ $IWHU D SUREH KDV EHHQ SHUIRUPHG IRU HYHU\ XQPDWFKHG UHFWDQJOH LQ WKH OLVW RI PRGHO SRVHV DQ DYHUDJH PDWFK VWUHQJWK LV FRPSXWHG IRU HDFK SRVH WKDW KDV QRW EHHQ HOLPLQDWHG 7KH PRGHO SRVH ZLWK WKH KLJKHVW DYHUDJH LV VHOHFWHG DV WKH EHVW PDWFK 0RUH GHWDLOV RI WKH EHQFKPDUN FDQ EH IRXQG LQ >@

3 On-Chip Multiprocessor ,Q WKLV SDSHU ZH XVH

5DSWRU>@

DQ RQFKLS PXOWLSURFHVVRU FRQVLVWLQJ RI IRXU

LQGHSHQGHQW SURFHVVRU FRUHV FDOOHG *HQHUDO 3URFHVVRU 8QLWV*38V DQG RQH JUDSKLF FRSURFHVVRU FDOOHG *UDSKLF &RSURFHVVRU 8QLW*&8 'XH WR WKH OLPLWHG GLH VL]H ZH KDYH FKRVHQ IRXU *38V WKDW DUH LQWHJUDWHG LQWR D VLQJOH FKLS 7KH *&8 LV VKDUHG E\ IRXU *38V $OVR LQ RUGHU WR FRQWURO *38V*&8 DQG WR SURYLGH DQ LQWHUIDFH WR RXWVLGH ZRUOG DGGLWLRQDO IRXU FRPSRQHQW XQLWV DUH LQFOXGHG LQ 5DSWRU QDPHO\ ,QWHUSURFHVVRU %XV

8QLW,%8

([WHUQDO

&DFKH

&RQWURO

8QLW(&8

0XOWLSURFHVVRU

&RQWURO

8QLW0&8 DQG 3RUW ,QWHUIDFH 8QLW3,8 7KH ,%8 LV D VKDUHG EXV FRQQHFWLQJ WKH *38V DQG WKH (&8 7KH 0&8 GLVWULEXWHV WKH LQWHUUXSWV DFURVV WKH *38V DQG SURYLGHV V\QFKURQL]DWLRQ UHVRXUFHV DPRQJ WKH *38V

7KH 3,8 LV D PXWLSURFHVVRUUHDG\ EXV

LQWHUIDFH WR FRPPXQLFDWH ZLWK WKH H[WHULRU RI WKH 5DSWRU 7KH IRXU *38V H[HFXWH DOO LQVWUXFWLRQV H[FHSW H[WHQGHG JUDSKLF LQVWUXFWLRQV ZLWK WKHLU RZQ UHJLVWHU ILOHV DQG SURJUDP FRXQWHUV EXW VKDUH WKH (&8 WKURXJK WKH ,%8 $ *38 SHUIRUPV JUDSKLF LQ VWUXFWLRQV ZLWK 6LQJOH ,QVWUXFWLRQ 6WUHDP 0XOWLSOH 'DWD 6WUHDP6,0' VW\OH DQG SL[HO SURFHVVLQJ KDUGZDUH 7KH VDOLHQW IHDWXUHV RI 5DSWRU FDQ EH VXPPDUL]HG DV IROORZV QG 6LQJOH FKLS ZD\ PXOWLSURFHVVRU VKDULQJ RIIFKLS OHYHO FDFKH

z z z z z

ELW GDWD DQG ELW YLUWXDO DGGUHVV 63$5& 9 ,QVWUXFWLRQ 6HW $UFKLWHFWXUH,6$ ([WHQVLRQ RI JUDSKLF LQVWUXFWLRQ VHW

VW QG 0XOWLSOH FDFKH VWUXFWXUH FRQVLVWLQJ RI RQFKLS OHYHO FDFKH DQG RIIFKLS


z z

245

OHYHO FDFKH

VW +DUYDUG VWUXFWXUH RI OHYHO FDFKH FRQVLVWLQJ RI .E\WH LQVWUXFWLRQ FDFKH DQG

.E\WH RI GDWD FDFKH QG QG 2QFKLS OHYHO FDFKH FRQWUROOHU KDQGOLQJ 0E\WH RI XQLILHG RIIFKLS OHYHO FDFKH

4 Simulation Environment 7R HYDOXDWH WKH 5DSWRU TXDQWLWDWLYHO\ ZH GHYHORSHG D GHGLFDWHG VLPXODWRU FDOOHG

5DS6LP

$OVR D SURJUDPPLQJ HQYLURQPHQW FDOOHG 00260XOWLWKUHDGHG 0LQL

26 ZDV GHYHORSHG WR VXSSRUW D PXOWLWKUHDGHG SURJUDPPLQJ RQ WKH PXOWLSOH *38V 7KH RYHUDOO HQYLURQPHQW RI WKH 5DS6LP DQG WKH 0026 LV VKRZQ LQ )LJ

3WKUHDG /LEUDU\

0DWK /LEUDU\

& /LEUDU\

0026 %HQFKPDUN 3URJUDP

,QWHUIDFH 6XSSRUW

7KUHDG

7LPHU

.H\ERDUG

26

6FKHGXOHU

,QWHUUXSW

,QWHUUXSW

6HUYLFH

5DS6LP 5DSWRU

&RQWH[W

6LPXODWRU

3&

0DLQ 0HPRU\ QG/HYHO &DFKH

5HJLVWHU )LOH

6XQ26 63$5& :RUNVWDWLRQ

)LJ

7KH

5DS6LP

6LPXODWLRQ (QYLURQPHQW

LV D SURJUDPGULYHQ PLFUR DUFKLWHFWXUH VLPXODWRU WKDW PRGHOV WKH

IRXU *38V DQG D PHPRU\ KLHUDUFK\ VKDUHG E\ WKH IRXU *38V 7KH 5DP6LP FRQVLVWV RI D 3UH3URFHVVLQJ 8QLW DQG D 3RVW3URFHVVLQJ 8QLW 7KH 3UH3URFHVVLQJ 8QLW RI WKH 5DS6LP LV DQ LQVWUXFWLRQ VHW VLPXODWRU ZKLOH WKH 3RVW3URFHVVLQJ 8QLW LV D SHUIRUPDQFH VLPXODWRU 7KH 3UH3URFHVVLQJ 8QLW FRQVLVWV RI IRXU FRPSRQHQWV D SURFHVVRU PRGHO IRU H[ HFXWLQJ LQVWUXFWLRQV GDWD VWUXFWXUHV IRU UHJLVWHU ILOHV SUR[\ PRGHO IRU SURFHVVLQJ V\V VW WHP FDOOV DQG D PRGHO RI OHYHO FDFKH 7KH 3UH3URFHVVLQJ 8QLW IHWFKHV WKH LQVWUXF QG WLRQV DQG WKH GDWD IURP WKH VKDUHG PHPRU\ KLHUDUFK\ LQFOXGLQJ OHYHO FDFKH DQG H[HFXWHV WKH LQVWUXFWLRQV DQG JHQHUDWHV DQ RQWKHIO\ WUDFH FRQVXPHG E\ WKH 3RVW 3URFHVVLQJ 8QLW 7KH 3UH3URFHVVLQJ 8QLW VWDUWV WKH VLPXODWLRQ E\ ORDGLQJ D EHQFK PDUN ELQDU\ ILOH FRPSLOHG DQG VWDWLFDOO\ OLQNHG ZLWK WKH 0026 OLEUDU\ LQWR PHPRU\ PRGHO 'XULQJ WKH ORDGLQJ RI WKH EHQFKPDUN ELQDU\ D SURSHU VWDUWLQJ SURJUDP FRXQWHU LV VHW LQ WKH SURFHVVRU PRGHO 7UDS WDEOH DQG WUDS KDQGOHUV DUH LQLWLDOL]HG LQ WKH PHPR

246

Y. Chung et al.

U\ PRGHO DQG D VWDFN LV FRQVWUXFWHG LQ WKH PHPRU\ PRGHO 7KHQ WKH SURFHVVRU PRGHO H[HFXWHV WKH LQVWUXFWLRQV XVLQJ WKH LQWHUQDO UHVRXUFHV OLNH WKH H[HFXWLRQ XQLWV UHJLVWHU VW

ILOHV DQG OHYHO FDFKH $V WKH 3UH3URFHVVLQJ 8QLW UXQV LWV LQVWUXFWLRQ VWUHDPV LW JHQHUDWHV DQ RQWKHIO\ WUDFH D VHTXHQFH RI H[HFXWHG LQVWUXFWLRQV (DFK HQWU\ RI WKH WUDFH FRQWDLQV HQRXJK LQIRUPDWLRQ VR WKDW WKH 3RVW3URFHVVLQJ 8QLW FDQ SHUIRUP WKH SHUIRUPDQFH VLPXODWLRQ XVLQJ WKH WUDFH DV LQSXWV 7KH 3RVW3URFHVVLQJ 8QLW LV D 5,6& SLSHOLQH PRGHO FRQ GXFWLQJ SHUIRUPDQFH VLPXODWLRQ E\ XVLQJ WKH LQVWUXFWLRQ WUDFHV JHQHUDWHG IURP WKH 3UH 3URFHVVLQJ 8QLW ,W LV PRGHOHG DV D LVVXH

VXSHUVFDODU

LQFOXGLQJ 5HVHUYDWLRQ 6WD

WLRQV56 DQG D 5HRUGHU %XIIHU52% WR VXSSRUW RXWRIRUGHU H[HFXWLRQV 7ZR LQ VWUXFWLRQV LQ D 7UDFH %XIIHU DUH IHWFKHG DQG SUHGHFRGHG LQ D F\FOH 7KH SUHGHFRGHG LQVWUXFWLRQV LQ DQ ,QVWUXFWLRQ %XIIHU DUH GHFRGHG DQG LVVXHG LQWR SURSHU 5HVHUYDWLRQ 6WDWLRQV56 DQG WKH 5HRUGHU %XIIHU LV XSGDWHG VLPXOWDQHRXVO\ (DFK H[HFXWLRQ XQLW UXQV VDIH LQVWUXFWLRQV IURP D SURSHU 5HVHUYDWLRQ 6WDWLRQ UHVROYLQJ GHSHQGHQF\ SURE OHPV 7KH

0026

SURYLGHV WKH 5DS6LP ZLWK D PXOWLWKUHDGHG SURJUDPPLQJ HQYLURQ

PHQW WR XWLOL]H IRXU *38V HIILFLHQWO\ 7KH 0026 KDV D 3WKUHDG>@ OLEUDU\ & OLEUDU\ DQG 5DS6LP LQWHUIDFH 7KH & OLEUDU\ DOORZV PXOWLSOH WKUHDGV WR DFFHVV WKH VKDUHG & OLEUDU\ ZLWKRXW V\QFKURQL]DWLRQ SUREOHPV ZKHUHDV WKH 3WKUHDG OLEUDU\ SURYLGHV V\Q FKURQL]DWLRQ DQG VFKHGXOLQJ UHTXLUHPHQWV DPRQJ PXOWLSOH WKUHDGV 7KH 5DS6LP LQWHU IDFH FRQQHFWV WKH 0026 WR WKH 5DS6LP DQG VFKHGXOHV DQG DVVLJQV WKUHDGV LQWR WKH SURFHVVRU PRGHOV RI WKH 5DS6LP 7KH VLPXODWLRQ SDUDPHWHUV XVHG LQ WKH H[SHULPHQW DUH OLVWHG LQ 7DEOH 7DEOH

VW QG

OHYHO FDFKH VL]H OHYHO FDFKH VL]H

:ULWH XSGDWH SROLF\

6LPXODWLRQ 3DUDPHWHUV

3DUDPHWHU

'HIDXOW 9DOXH .E\WH ,FDFKH .E\WH 'FDFKH E\WHV SHU OLQH 0E\WH E\WHV SHU OLQH

VW QG

OHYHO FDFKH DFFHVV ODWHQF\ OHYHO FDFKH DFFHVV ODWHQF\

0DLQ PHPRU\ DFFHVV ODWHQF\

VW OHYHO QG

FDFKH WR

QG

OHYHO FDFKH ZULWH WKURXJK

OHYHO FDFKH WR PDLQ PHPRU\ ZULWH EDFN

F\FOH F\FOHV F\FOHV

5 Simulation Results and Analysis 7KUHH VHWV RI VLPXODWLRQV ZHUH FRQGXFWHG IRU HDFK YLVLRQ WDVN GHVFULEHG LQ 6HF WLRQ 7KH LPDJH VL]H ZDV ; 7KH WKUHH VHWV RI VLPXODWLRQV ZHUH

z z z

6HTXHQWLDO D QRQPXOWLWKUHDGLQJ UXQQLQJ RQ D *38 FRQILJXUDWLRQ 7KUHDGV ZD\ PXOWLWKUHDGLQJ UXQQLQJ RQ D *38 FRQILJXUDWLRQ 7KUHDGV ZD\ PXOWLWKUHDGLQJ RQ D *38 FRQILJXUDWLRQ 7KH ,QVWUXFWLRQV 3HU &\FOH,3& DQG WKH WRWDO QXPEHU RI H[HFXWLRQ F\FOHV ZHUH

PHDVXUHG DV RXU SHUIRUPDQFH PHWULFV

'LVWULEXWLRQ RI ,QVWUXFWLRQV


247

\[[P c[P

123 :,1 )38 678 /'8 %58 $/8

a[P _[P ][P [P

)LJ

wlmpw

s{z

~xzzs

r}lo

xlns

ppyo

'LVWULEXWLRQ RI ,QVWUXFWLRQV ([HFXWHG RQ D *38 &RQILJXUDWLRQ

7R FKDUDFWHUL]H WKH FRPSXWDWLRQDO UHTXLUHPHQW RI HDFK YLVLRQ WDVN LQ WKH REMHFW UHFRJQLWLRQ V\VWHP ZH EUHDN GRZQ WKH LQVWUXFWLRQV H[HFXWHG RQ D *38 FRQILJXUD WLRQ LQWR VHYHQ FRPSRQHQWV DV VKRZQ LQ )LJ ,Q WKLV )LJ $/8 %58 /'8 678 )38 :,1 UHSUHVHQW $/8 EUDQFK ORDG VWRUH )38 ZLQGRZ UHJLVWHU LQVWUXFWLRQV

/$%(/ +. Markers annotations can be seen simply as program comments (i.e. they can be ignored) if only the functional semantics of an application is considered. Markers are associated with programs points between instructions, possibly in dierent threads. Constraints may be speci ed between program points delineated by these markers. For a marker M, time(M) (read as \the visit time at M") denotes the time at which the instruction immediately preceding M has just been completed. In the following, we will refer to time(M) simply by M whenever confusion is unlikely. Given a pair of markers, constraints can be stated to specify their relative order of execution in all executions of the

336

R. Ramirez and A.E. Santosa

program. If the execution of a thread T1 reaches a program point whose execution time is constrained to be greater than the execution time of a not yet executed program point in a dierent thread T2 , thread T1 is forced to suspend execution. In the presence of loops and procedure calls a marker is typically visited several times during program execution. Thus, in general, a marker M associated with a program point p represents an event class E where each of its instances e; e+; e + + : : : corresponds to a visit to p during program execution (e represents the rst visit, e+ the second, etc.).

3.3 Constraints and methods

Sometimes it is more convenient to express the synchronization aspects of an object-oriented program in terms of its methods rather that in terms of speci c program points in its code. Thus, we allow in the language higher-level constraints of the form p(c; m1 ; m2) ( q(e1 ; e2 ; : : :e ), where m1 and m2 are methods in class c and e1 ; e2 : : :e program points (i.e. markers) in the code of either m1 or m2 . This may be seen as adding some syntactic sugar on top of the base language previously de ned. k

k

4 Synchronization constraints

In this section we illustrate how the proposed language is used to specify the concurrency issues (safety properties) of concurrent object-oriented systems by presenting an example: a bounded stack data type. Consider implementation in Java of a bounded stack data type. A basic speci cation without synchronization code is as follows (ignore for the moment markers , i.e. treat them as comments): class BoundedStack { static final int MAX = 10; int pos = 0; Object[] contents = new Object [MAX];

}

public Object peek () { return contents [pos]; } public Object pop () { return contents [--pos]; } public void push (Object e) { contents [pos++]=e ; }

The speci cation of the class BoundedStack with synchronization code added is shown in the program listing of the next page. Let us consider, for instance, the declaration of the peek method with synchronization as shown in the listing. It speci es the safety property that the peek method waits until there are no more threads currently executing a push or a pop method, i.e. mutual exclusion between peek and push and between peek and pop. It is clear that the synchronization code completely dominates the source code: almost all of the code for the peek method is synchronization code. Furthermore, it is very dicult to formally reason about the correctness of the code.

Declarative Concurrency in Java public class BoundedStack { private static final int MAX = 10; private int pos = 0; private Object[] contents = new Object[MAX]; private int activeReaders_ = 0; private int activeWriters_ = 0; private int waitingReaders_ = 0; private int waitingWriters_ = 0; private boolean return pos == private boolean return pos ==

while (activeReaders_ == 0 && activeWriters_ == 0 && !empty()) { try { wait(); } catch (InterruptedException ex) {} } --waitingWriters_; ++activeWriters_; } try { return contents[--pos]; } finally { synchronized(this) { --activeWriters_; notifyAll(); } }

empty() { 0; } full() { MAX; }

public Object peek() { synchronized(this) { ++waitingReaders_; while (waitingWriters_ == 0 && activeWriters_ == 0) { try { wait(); } catch (InterruptedException ex) {} } --waitingReaders_; ++activeReaders_; } try { return contents[pos]; } finally { synchronized(this) { --activeReaders_; notifyAll(); } } } public Object pop() { synchronized(this) { ++waitingWriters_;

337

} public void push(Object e) { synchronized(this) { ++waitingWriters_; while (activeReaders_ == 0 && activeWriters_ == 0 && !full()) { try { wait(); } catch (InterruptedException ex) {} } --waitingWriters_; ++activeWriters_; } contents[pos++] = e; synchronized(this) { --activeWriters_; notifyAll(); } } }

Similarly, the safety properties that no thread attempts to remove (pop ) an item from an empty stack and no thread attempts to append (push ) into a full stack require coding of additional synchronization code in the pop and push methods. These safety properties can be elegantly and formally expressed by using temporal constraints as follows. The requirement that the peek method waits until there are no more threads currently executing a push or a pop method may be implemented by mutex(a1; a2; a3; a4). mutex(a1; a2; a5; a6). mutex(X1; X2; Y 1; Y 2)

X2 < Y 1; mutex(X1+; X2+; Y 1; Y 2); Y 2 < X1; mutex(X1; X2; Y 1+; Y 2+):

where a1; a2 : : :a6 are the markers on our initial Java program. We may de ne equivalent higher-level constraints restricting the execution of the peek, pop and push methods by de ning: mutex(Stack; peek; push) ( mutex(a1; a2; a3; a4) mutex(Stack; peek; pop) ( mutex(a1; a2; a5; a6)

338

R. Ramirez and A.E. Santosa

The requirement that no thread attempts to remove an item from an empty stack and no thread attempts to append into a full stack may be respectively implemented by p(a8; a5). and p(a6; a7(+MAX)). p(A; B) A < B, p(A+; B+). p(A; B) A < B, p(A+; B+).

5 Implementation

The constraint logic programs have a procedural interpretation that allows a correct speci cation to be executed in the sense that events are only executed as permitted by the constraints represented by the program. This procedural interpretation is based on an incremental execution of the program and a lazy generation of the corresponding partial orders. Constraints are generated by the constraint logic program only when needed to reason about the execution times of current events. A description of how this procedural interpretation of constraint logic programs is implemented can be found in [12]. Fairness is implicitly guaranteed by our implementation. Every event that becomes enabled will eventually be executed (provided that the program point associated with it is reached). This is implemented by dealing with event execution requests in a rst-in- rst-out basis. Although fairness is provided as the default, users, however, may intervene by specifying priority events using temporal constraints (on how to do this, see [12]). It is therefore possible to specify unfair scheduling. A prototype implementation of the ideas presented here has been written using the language Java. Java was used both to implement the constraint language and to write the code of a number of applications. The discussion of these applications and further details of implementation can be found in a companion paper.

6 Conclusion

We have presented a high-level language for expressing synchronization constraints in concurrent object-oriented applications. In the language, the safety properties of the system are explicitly stated as temporal constraints. Programs are annotated at points of interest so that the run-time environment enforces speci c temporal relationships between the visit times of these points. Higherlever constraints on class methods may also be de ned. Constraints are language independent in that the application program can be speci ed in any conventional concurrent object-oriented language. The constraints have a procedural interpretation that allows the speci cation to be executed. The procedural interpretation is based on the incremental and lazy generation of constraints, i.e. constraints are considered only when needed to reason about the execution time of current events. This paper presents work in progress so several important issues are still to be considered. Our implementation is still in a prototype stage, thus several ef ciency issues have still to be addressed. In particular, we will focus on how the

Declarative Concurrency in Java

339

two key features of incrementality and laziness may be most eciently achieved. Another important issue is how to deal with progress properties. Currently, constraints explicitly state all safety and timing properties of programs. However, the progress (liveness) properties of programs remain implicit. It would be desirable to be able to express these properties explicitly as additional constraints, but so far we have not devised a way to do that. Future versions may also include deadlock detection feature. We are considering a mechanism that checks user constraints for cycles (e.g., A < B; B < A) whenever a timeout occurred.

References 1. Atkinson, C. 1991. Object-Oriented Reuse, Concurrency and Distribution: An Ada-Based Approach. Addison-Wesley. 2. Van den Bos, J. and Lara, C. 1989. PROCOL: A parallel object language with protocols. ACM SIGPLAN Notices 24(10):95{112, October 1989. Proc. of OOPSLA '89. 3. Gregory, S. and Ramirez, R. 1995. Tempo: a declarative concurrent programming language. Proc. of the ICLP (Tokyo, June), MIT Press, 1995. 4. Gregory, S. 1995. Derivation of concurrent algorithms in Tempo. In LOPSTR95: Fifth International Workshop on Logic Program Synthesis and Transformation. 5. Hong, S. and Gerber, R. 1995. Compiling real-time programs with timing constraint re nement and structural code motion, IEEE Transactions on Software Engineering, 21. 6. Jahnaian F. and Mok A. K. 1987. A graph theoretic approach for timing analysis and its implementation, IEEE Transactions on Computers, C36(8). 7. Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J.-M. and Irwin, J. 1997. Aspect-oriented programming. In ECOOP '97|ObjectOriented Programming, Lecture Notes in Computer Science, number 1241, pp. 220{242, Springer-Verlag. 8. Kowalski, R.A. and Sergot, M.J. 1986. A logic-based calculus of events. New Generation Computing 4, pp. 67{95. 9. Krakowiak, S., Meysembourg, M., Nguyen Van, H., Riveill, M., Roisin, C. and Rousset de Pina, X. 1990. Design and implementation of an object-oriented strongly typed language for distributed applications. Journal of Object-Oriented Programming 3(3):11{22. 10. Leung, A., Palem, K. and Pnueli, A. 1998. Time C: A Time Constraint Language for ILP Processor Compilation, Technical Report TR1998-764, New York University. 11. Pratt, V. 1986. Modeling concurrency with partial orders. International Journal of Parallel Programming 15, 1, pp. 33{71. 12. Ramirez, R. 1996. A logic-based concurrent object-oriented programming language, PhD thesis, Bristol University. 13. Shaw, A. 1989. Reasoning about time in higher-level language software, IEEE Transactions on Software Engineering, 15(7). 14. Stoyenko, A. D., Marlowe, T. J. and Younis, M. F. 1996. A language for complex real-time systems, Technical Report cis9521, New Jersey Institute of Technology. 15. De Volder, K. and D'Hondt, T. 1999. Aspect-oriented logic meta programming. In Meta-Level Architectures and Re ection, Lecture Notes in Computer Science number 1616, pp. 250{272. Springer-Verlag.

Scalable Monitoring Technique for Detecting Races in Parallel Programs? Yong-Kee Jun1 ?? and Charles E. McDowell2 1

Dept. of Computer Science, Gyeongsang National University Chinju, 660-701 South Korea 2

[email protected]

Computer Science Department, University of California Santa Cruz, CA 95064 USA [email protected]

Abstract. Detecting races is important for debugging shared-memory

parallel programs, because the races result in unintended nondeterministic executions of the programs. Previous on-the- y techniques to detect races have a bottleneck caused by the need to check or serialize all accesses to each shared variable in a program that may have nested parallelism with barrier synchronization. The new scalable monitoring technique in this paper reduces the bottleneck signi cantly by c hecking or serializing at most 2(B + 1) non-nested accesses in an iteration for each shared variable, where B is the number of barrier operations in the iteration. This technique, therefore, makes on-the- y race detection more scalable.

1

Introduction

A race is a pair of unsynchronized instructions, in a set of parallel threads, accessing a shared variable where at least one is a write access. Detecting races is important for debugging shared-memory parallel programs, because the races result in unintended nondeterministic executions of the programs. Traditional cyclical debugging with breakpoints is often not eective in the presence of races. Breakpoints can change the execution timing causing the erroneous behavior to disappear. On-the- y race detection instruments either the program to be debugged [1, 3, 5], or the underlying system [8, 12, 13], and monitors an execution of the program to report races which occur during the monitored execution. One drawback of existing on-the- y techniques is the run-time overhead which is incurred from the need to check or serialize all accesses to the same shared-memory location. Every access must be compared with the previous accesses stored in a shared data University Research Program supported by Ministry of Information and Communication in South Korea. ?? In Gyeongsang National University, he is also involved in both Institute of Computer Research and Development, and Information and Communication Research Center, as a research professor. ?


Scalable Monitoring Technique for Detecting Races in Parallel Programs

341

structure, often called the access history. In addition, the access history must be updated. This overhead has limited the usefulness of on-the- y techniques. The overhead can be reduced by detecting only the rst races [3, 5, 8, 9, 12], which, intuitively, occur between two accesses that are not causally preceded by any other accesses also involved in races. It is important to detect the rst races eciently, because the removal of the rst races can make other races disappear. It is even possible that all races reported by other on-the- y algorithms would disappear once the rst races were removed. A previous paper [5] presents a scalable on-the- y technique to detect the rst races, in which at most two accesses to a shared variable in each thread must be checked. However, this technique is restricted to parallel programs which have neither nested parallelism nor inter-thread synchronization. In this paper, we introduce a new scalable technique for programs which may have nested parallelism and barrier synchronization. After rst describing the background information on this work, in Section 3 we introduce a set of accesses, called ltered accesses . The set includes any accesses involved in rst races. We then introduce two ltering procedures which examine if the current access is a ltered access during the execution of a program. Checking only ltered accesses is sucient for detecting rst races in the execution instance and reduces the number of non-nested accesses in an iteration that must be checked or serialized to at most 2( + 1) for each shared variable, where is the number of barrier operations in the iteration. Before concluding the paper, we brie y mention some related work in section 5. B

2

B

Background

This work applies to shared-memory parallel programs [10, 11] with nested forkjoin parallelism using parallel sections1 or parallel loops. In this paper we use PARALLEL DO and END DO as in PCF Fortran [11]. The program may have interthread coordination in the loops using barriers. The nesting level of an individual loop is equal to one plus the number of the enclosing outer loops, and each loop may enclose zero or more disjoint loops at the same level. For example, Figure 1 shows a parallel loop of nesting depth two, which has two loops in the second nesting level. In an execution of the program, multiple threads of control are created at a PARALLEL DO and terminated at the corresponding END DO statement. These fork and join operations are called thread operations . The concurrency relationship among threads is represented by a directed acyclic graph, called a Partial Order Execution Graph (POEG) [1]. A vertex of a POEG represents a thread operation, and an arc originating from a vertex represents a thread starting from the corresponding thread operation. Figure 1 shows a POEG that is an execution instance of the program shown in the same gure, where a small lled circle on a thread represents an access executed by the thread to shared variable . If the program contains barrier synchronization, the POEG will contain additional edges to re ect the induced ordering. X

1

The work in this paper also can be applied to parallel sections without diculty.

342

Y.-K. Jun and C.E. McDowell

PARALLEL DO I = 1, 2 = X IF THEN PARALLEL DO I = 1, 2 = X = X END DO = X IF THEN PARALLEL DO I = 1, 2 IF THEN = X IF THEN X = X = END DO X = IF THEN = X END DO

0 8

r0

fr ; r g

1 2 fr 3; r 4g

r8

5 9

r9

fr ; r g

fr ; r g

r1

r2

r3

r4 r5

6 fw 10g fw 7; w 13g 11; w14g 12g

fr

w13

r12

fw

r6

w10

w11

fr g

w7

w14

Fig. 1. A Parallel Program and Partial Order Execution Graph Because the graph captures the happened-before relationship [6], it represents a partial order over the set of events executed by the program. Concurrency determination is not dependent on the number or relative execution speeds of processors executing the program. An event i happened before another event j if there exists a path from i to j in the POEG, and i is concurrent with j if neither one happened before the other. For example, consider the accesses in Figure 1, where 0 happened before 7 because there exists a path from 0 to 7, and 0 is concurrent with 11, because there is no path between them. e

e

r

w

e

e

e

e

w

r

r

w

De nition 1. A dynamic iteration of a parallel loop in the i-th nesting level, called a level i iteration consists of three components: (1) a thread, tf , in the i-th nesting level that starts immediately at a fork operation, (2) a thread, tj , in the i-th nesting level that terminates at the corresponding join operation, and (3) a set of threads, T , such that tf happened before T which happened before tj . Two accesses to a shared variable are con icting if at least one of them is a write. If two accesses, i and j , are con icting and concurrent then the two accesses constitute a race denoted i - j . An access j is aected by another access i , if i happened before j and i is involved in a race. A race i - j is unaected, if neither i nor j are aected by any other accesses. The race is partially aected, if only one of i or j is aected by another access. A tangle is a set of partially aected races such that if i - j is a race in then exactly one of i or j is aected by k such that k - l is also in . A tangled race is a partially aected race that is in a tangle. a

a

a

a

a

a

a

a

a

a

a

a

a

a

T

a

a

a

a

a

a

a

T

T

De nition 2. A rst race is either an unaected race or a tangled race.

a


343

There are twenty seven races in the POEG shown in Figure 1; all of the accesses in the POEG are involved in races. Among these, only three races, f 0- 11, 78, 8- 10g, are rst races which are tangled races. Eliminating the three tangled races may make the other seven aected races disappear. The term tangled race was introduced by Netzer and Miller [9] describing the situation when no single race from a set of tangled races is unaected by the others. Note that there can never be exactly one tangled race in an execution. They also introduce a tighter notion of rst race, called non-artifact race , which uses the event-control dependences to de ne how accesses aect each other. r

r

3

r

w

w

w

Scalable Monitoring Technique

On-the- y race detection performs a relatively expensive check each time a monitored access is executed. In the worst case, this check must be done for every access. If we can determine a smaller set of accesses which includes all the accesses involved in rst races, then the number of expensive access history operations can be reduced. In this section, we rst de ne such a set, called ltered accesses , and then introduce two ltering procedures which examine if the current access is a ltered access during execution of the program. Checking only ltered accesses is sucient for detecting rst races in the execution instance, and reduces the number of expensive checks or serializing accesses to at most 2( + 1) nonnested accesses in an iteration for each shared variable of a parallel loop, where is the number of barrier operations in the iteration. Detecting rst races in programs with nested parallelism requires detecting the happened-before relationship between nested iterations. We rst exploit the nested iterations to indicate a set of ltered accesses. De nition 3. A read (write) access, i , is a level ltered read (write), if and only if (1) i is in level iteration k , and (2) there does not exist any other access, j , such that j is in k , j happened before i , and there are no barrier operations on the path from j to i in the POEG. De nition 4. A write access, i , is a level ltered r-write, if and only if (1) iteration k , (2) there exists a level ltered read, i , such that i is in level i happened before i , and (3) there does not exist any other write access, j , such that j is in k , j happened before i , and there are no barrier operations on the path from j to i in the POEG. For example, consider the accesses in Figure 1. There are ve ltered accesses in the two level 1 iterations of the execution instance: two ltered reads f 0, 8g and three ltered r-writes f 7, 10, 11g. Among these accesses, f 8, 11g are in the same level 1 iteration which is in the left in the POEG, and f 0, 7, 10g are in the other level 1 iteration. The reads, f 0, 8g, are level 1 ltered reads because there does not exist any other access that happened before 0 or 8 in their level 1 iterations. The access 7 is a level 1 ltered r-write, because there exists a ltered read 0 in the rst level that happened before 7, and there does not exist any other write access that happened before 7 in the level 1 iteration. In the second level, we have four iterations in the gure, which have ve reads B

B

a

a

k

a

k

I

a

I

a

a

a

a

w

a

k

k

I

r

k

r

w

w

I

w

w

w

w

w

r

w

w

w

r

r

r

r

w

w

w

r

r

w

r

w

w

r

344

0 1 2 3 4 5 6 7


0 1 2 CheckRead(X; k; bv; c r) if :P (X; k; bvk ) ^ :F (X; k; bvk ) then 34 for i := 1 upto k do 5 P (X; i; bv i ) := true; 6 endfor CheckReadFiltered(X; k; bv; c r); 78 endif 9 EndCheckRead 10 11 12

CheckWrite( ) if ( ( k) for := 1 upto do ( i ) := true; endfor CheckWriteFiltered( elseif ( ( k) for := 1 upto do ( i ) := true; endfor CheckR-writeFiltered( endif EndCheckWrite X; k; bv; c w

^ :F X; k; bv k

:P X; k; bv i

) then

k

F X; i; bv

); ) then

X; k; bv; c w

P X; k; bv

i

^

:F X; k; bv k

k

F X; i; bv

X; k; bv; c w

);

Fig. 2. Checking to lter read or write access

and three writes. Among the eight accesses, ve are level 2 ltered accesses: three ltered reads f 1, 2, 6g, one ltered write f 10g, and one ltered r-write f 7g. The shaded access names in the POEG distinguish the ltered accesses from the other accesses. Note that the write access 7 is a ltered r-write not only in a level 1 iteration but also in a level 2 iteration. It is necessary to record the fact that some accesses, such as 7, are ltered accesses in iterations at multiple levels, because these accesses can be involved in races at dierent levels. De nition 5. A nested ltered access in a level iteration is a ltered access, iteration ( ) such that j is also a level ltered access. j , in a level In Figure 1, the ltered write 10 in a level 2 iteration is a nested ltered rwrite in a level 1 iteration. On the other hand, the ltered read, 6, in a level 2 iteration is not a nested ltered access in a level 1 iteration. In summary of the accesses in Figure 1, we have fteen accesses to a shared variable , which include eight ltered accesses, of which ve are involved in three rst races. The following theorem shows that monitoring the ltered accesses is a sucient condition for detecting rst races in an execution instance. The proof can be found elsewhere [2]. Theorem 1. If an access i is involved in a rst race, i is a ltered access. Determining if each access is a ltered access is solely a function of comparing the current access with other previous accesses to the same shared variable in the iterations at multiple nesting levels. For this purpose, we de ne two states of nested iteration for each shared variable, which are checked in every access to determine if it is a ltered access. De nition 6. A partially- ltered iteration for a shared variable in the -th nesting level is an iteration in which there exists a level ltered read to . A fully- ltered iteration for a shared variable in the -th nesting level is an iteration in which there exists a level ltered write or a level ltered r-write to . r

r

r

w

w

w

w

i

a

k

i < k

a

i

w

r

X

a

a

X

i

X

i

X

i

X

i

i


345

Benchmarks Kernel Input Set

Static Dynamic Accesses Access Checks Filtered Checks Read Write Read Write Read Write MG 64 64 64 39 23 347,223,802 25,971,050 68,496,214 19,127,166 FT 128 128 32 16 3 221,448 101,380 113,924 101,380 EP 67108864 4 3 93,463,643 26,485,851 41,950,826 26,483,873

Table 1. Race Instrumentation Statistics Figure 2 shows two algorithms used to check, in each barrier partition , if the current read ( ) or write access ( ) to a shared variable is a level ltered access. The barrier vector, , selects a barrier partition for each nesting level. Each thread and barrier operation determines the value of for the corresponding partition and nesting level. ( i ) and ( i ) indicate if a level iteration i is partially- ltered and fully- ltered, respectively, in a partition i in i for shared variable . If there are non-nested barriers in i , then 0 i , resulting in ( +1) unique variable pairs for i . These private boolean variables are initialized to false at the start of every barrier partition of level iteration. Now, look at the procedure CheckRead() (CheckWrite()). If the condition in line 1 is true, it sets all the ( i) ( ( i )) to true, where ( ) [line 2-4]. If the current read (write) access is the only access in a barrier partition of a level iteration, it is a ltered read (write) in the iteration and all the level iterations are guarranteed to be partially- ltered (fully- ltered). And then it performs CheckReadFiltered() (CheckWriteFiltered()) to invoke the detection protocol incurring the expensive centralized check or serialization. The modi cations of ( i ) and ( i ) in the higher level iterations, do not need to be serialized because the value can only change from false to true. If the condition in line 1 of CheckRead() is false, it does nothing, because if an iteration is fully- ltered or partially- ltered in a barrier partition, there exists at least one previous ltered access in the iteration. In the case of CheckWrite(), it tests another condition. If the line 6 condition is true, CheckWrite() sets all the ( i ) to true, where ( ) [line 7-9]. If there exists a write access in a barrier partition of a level iteration, all the level iterations are fully- ltered. CheckWrite() then performs CheckR-writeFiltered() to invoke the detection protocol incurring the expensive centralized operation. Otherwise, it does nothing, because if an iteration is fully- ltered in a barrier partition there exists at least one previous ltered write or ltered r-write in the iteration. Some experiments were performed on a set of three serial NAS benchmarks, in which a small set of common shared variables were monitored and ltered. Table 1 shows that ltering reduced the number of expensive checks to less than half in the case of read accesses, although the benchmarks are ne-grained. The following theorem shows that these algorithms reduce the number of required checks signi cantly in a monitored execution. The proof appears elsewhere [2]. bv

c r

c w

X

k

bv

bv

P X; i; bv

i

F X; i; bv

I

bv

I

bv

B

X

B

B

I

I

i

P X; i; bv

i

F X; i; bv

k

k

i

P X; i; bv

F X; i; bv

F X; i; bv

i

k

i

k

346


Theorem 2. If there exists a set of accesses to a shared variable in an iteration,

the set involves at most 2(B + 1) non-nested ltered accesses in an iteration, where B is the number of barrier operations in the iteration. 4

Related Work

Many approaches for eciently detecting races on-the- y for parallel programs have been reported. In this section, we brie y mention some important work to improve the scalability of on-the- y race detection. This work falls into two groups: compiler support [7] to reduce the number of monitored accesses, and underlying system support [8, 12, 13] using scalable distributed shared memory systems. Our technique is novel in that the scalability is provided with simple but powerful instrumented code which can be applied to most existing techniques. Mellor-Crummey [7] describes an instrumentation tool for on-the- y race detection which applies compile-time analysis to identify variable references that need not be monitored at run-time. Using dependence analysis and interprocedural analysis of scalar side eects, the tool was able to reduce the dynamic counts of instrumented operations by 70-100% for the programs tested. Even with the impressive reductions in dynamic counts of monitoring operations, Mellor-Crummey reports that monitoring overhead for run-time detection of data races ran as high as a factor of 5.8. Min and Choi [8] propose a technique of on-the- y race detection to minimize the number of times that the monitored program is interrupted for run-time checking of accesses to shared variables. This scheme uses information from the underlying hardware-based, distributed shared-memory, cache coherence protocol and then requires additional hardware support, processor scheduling, cache management and compiler support. Richards and Larus [13] propose a similar technique to that of Min and Choi in a software-based coherence protocol for a ne-grained data-maintaining distributed shared memory system. To detect data races on-the- y in programs with barrier-only synchronization, this technique resets access histories at barriers, and monitors only the rst read and write after obtaining a copy of a coherence block. They obtain substantial performance improvment but risk missing races. They report an implementation of this technique running on a 32-processor CM-5, and some experiments in which monitored applications had slowdowns ranging from 0-3. Perkovic and Keleher [12] implemented on-the- y race detection in a pagebased release-consistent distributed shared memory system, which maintains ordering information that enables the system to make a constant-time determination of whether two accesses are concurrent without compiler support. They extended the system to collect information about the referenced locations and check at barriers for concurrent accesses to shared locations. Although they statically eliminate over 99% of non-shared accesses in applications, an average of 68% of the total overhead in race detection is the run-time overhead to determine whether an access is to shared memory. Nonetheless, the majority of the results are for non-shared accesses. They report that the applications slow down by an average factor of approximately 2.


5

347

Conclusion

In this paper, we present a new scalable on-the- y technique for detecting races in parallel programs which may have nested parallelism with barrier synchronization. Our technique reduces the monitoring overhead to require serializing at most 2( + 1) non-nested accesses , in an iteration for a shared variable, where is the number of barrier operations in the iteration. It is important to detect races eciently, because detecting races might require several iterations of monitoring, and the cost of monitoring a particular execution is still expensive. The technique in this paper can be applied to most existing techniques, therefore, making on-the- y race detection scalable and more practical for debugging shared-memory parallel programs. We have experimented the technique on a prototype system of race debugging, called RaceStand [4], and have been extending it for the programs which have more general types of inter-thread coordination than barrier synchronization. B

B

References 1. Dinning, A., and E. Schonberg, \An Empirical Comparision of Monitoring Algorithms for Access Anomaly Detection ," 2nd Symp. on Principles and Practice of Parallel Programming, pp. 1-10, ACM, March 1990. 2. Jun, Y., \Improving Scalablility of On-the- y Detection for Nested Parallelism," TR OS-9905, Dept. of Computer Science, Gyeongsang National Univ., March 1999. 3. Jun, Y., and C. E. McDowell, \On-the- y Detection of the First Races in Programs with Nested Parallelism," 2nd Int. Conf. on Parallel and Distributed Processing Techniques and Applications, pp. 1549-1560, CSREA, August 1996. 4. Kim, D., and Y. Jun, \An Eective Tool for Debugging Races in Parallel Programs ," 3rd Int. Conf. on Parallel and Distributed Processing Techniques and Applications, pp. 117-126, CSREA, July 1997. 5. Kim, J., and Y. Jun, \Scalable On-the- y Detection of the First Races in Parallel Programs ," 12th Int. Conf. on Supercomputing, pp. 345-352, ACM, July 1998. 6. Lamport, L., \Time, Clocks, amd the Ordering of Events in Distributed System ," Communications of ACM, 21(7): 558-565, ACM, July 1978. 7. Mellor-Crummey, J., \Compile-time Support for Ecient Data Race Detection in Shared-Memory Parallel Programs ," 3rd Workshop on Parallel and Distributed Debugging, pp. 129-139, ACM, May 1993. 8. Min, S. L., and J. D. Choi, \An Ecient Cache-based Access Anomaly Detection Scheme ," 4th Int. Conf. on Architectural Support for Programming Language and Operating Systems, pp. 235-244, ACM, April 1991. 9. Netzer, R. H., and B. P. Miller, \Improving the Accuracy of Data Race Detection ," 3rd Symp. on Prin. and Practice of Parallel Prog., pp. 133-144, ACM, April 1991. 10. OpenMP Architecture Review Board, OpenMP Fortran Application Program Interface , Version 1.0, Oct. 1997. 11. Parallel Computing Forum, \PCF Parallel Fortran Extensions ," Fortran Forum, 10(3), ACM, Sept. 1991. 12. Perkovic, D., and P. Keleher, \Online Data-Race Detection vis Coherency Guarantees ," 2nd Usenix Symp. on Operating Systems Design and Implementation, pp. 47-58, ACM/IEEE, Oct. 1996. 13. Richards, B., and J. R. Larus, \Protocol-Based Data-Race Detection," 2nd Sigmetrics Symp. on Parallel and Dist. Tools, pp. 40-47, ACM, August 1998.

3rd IPDPS Workshop on High Performance Data Mining Preface

The explosive growth in data collection in business and scienti c elds has literally forced upon us the need to analyze and mine useful knowledge from it. Data mining refers to the entire process of extracting useful and novel patterns/models from large datasets. Due to the huge size of data and amount of computation in v olv ed in data mining, high-performance computing is an essen tial component for an y successful large-scale data mining application. This workshop provided a forum for presenting recent results in high performance computing for data mining including applications, algorithms, softw are, and systems. High-performance was broadly interpreted to include scalable sequential as well as parallel and distributed algorithms and systems. Relevan t topics for the workshop included: 1. Scalable and/or parallel/distributed algorithms for various mining tasks like classi cation, clustering, sequences, associations, trend and deviation detection, etc. 2. Methods for pre/post-processing like feature extraction and selection, discretization, rule pruning, model scoring, etc. 3. F rameworks for KDD systems, and parallel or distributed mining. 4. In tegration issues with databases and data-warehouses. These proceedings contain 9 papers that were accepted for presentation at the w orkshop. Each paper was reviewed by tw o members of the program committee. In k eeping with the spirit of the workshop some of these papers also represent w ork-in-progress. In all cases, how ev er, the workshop program highlights av en ues of activ e researc h in high performance data mining. We would like to thank all the authors and attendees for contributing to the success of the workshop. Special thanks are due to the program committee and external reviewers for help in reviewing the submissions.

F ebruary 2000


Mohammed J. Zaki Vipin Kumar David B. Skillicorn Editors

3rd IPDPS Workshop on High Performance Data Mining

349

Workshop Co-Chairs

Mohammed J. Zaki (Rensselaer Polytechnic Institute, USA) Vipin Kumar (University of Minnesota, USA) David B. Skillicorn (Queens University, Canada)

Program Committee

Philip K. Chan (Florida Institute of Technology, USA) Alok Choudhary (Northwestern University, USA) Umeshwar Dayal (Hewlett-Packard Labs., USA) Alex A. Freitas (Ponti cal Catholic University of Parana, Brazil) Ananth Grama (Purdue University, USA) Robert Grossman (University of Illinois-Chicago, USA) Yike Guo (Imperial College, UK) Jiawei Han (Simon Fraser University, Canada) Howard Ho (IBM Almaden Research Center, USA) Chandrika Kamath (Lawrence Livermore National Labs., USA) Masaru Kitsuregawa (University of Tokyo, Japan) Sanjay Ranka (University of Florida, USA) Vineet Singh (Hewlett-Packard Labs., USA) Domenico Talia (ISI-CNR: Institute of Systems Analysis and Information Technology, Italy) Kathryn Burn-Thornton (Durham University, UK) External Reviewers

Eui-Hong (Sam) Han (University of Minnesota, USA) Wen Jin (Simon Fraser University, Canada) Harsha S. Nagesh (Northwestern University, USA) Srinivasan Parthasarathy (University of Rochester, USA)

Implementation Issues in the Design of I/O Intensive Data Mining Applications on Clusters of Workstations R. Baraglia1, D. Laforenza1, Salvatore Orlando2 , P. Palmerini1 and Raaele Perego1 1 2

Istituto CNUCE, Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy Dipartimento di Informatica, Universita Ca' Foscari di Venezia, Italy

This paper investigates scalable implementations of out-ofcore I/O-intensiv e Data Mining algorithms on aordable parallel architectures, suc h as clusters of w orkstations. In order to validate our approach, the K-means algorithm, a well kno wnDM Clustering algorithm, w as used as a test case. Abstract

1

Introduction

Data Mining (DM) applications exploit huge amounts of data, stored in les or databases. Such data need to be accessed to discover patterns and correlations useful for various purposes, abo veall for guiding strategic decision making in the business domain. Many DM applications are strongly I/O in tensiv esince they need to read and process the input dataset sev eral times [1,6,7 ]. Several techniques have been proposed in order to improve the performance of DM applications. Many of them are based on parallel processing [5]. In general, their main goals are to reduce the computation time and/or reduce the time spent on accessing out-of-memory data. Since the early 1990s there has been an increasing trend to move aw ay from expensive and specialized proprietary parallel supercomputers tow ards clusters of w orkstations(COWs) [3]. Historically, COWs ha vebeen used primarily for science and engineering applications, but their lo w cost, scalability , and generality pro vide a wide array of opportunities for new domains of application [13]. DM is certainly one of these domains, since DM algorithms generally exhibit large amounts of data parallelism. How ev er, toeÆciently exploit COWs, parallel implementations should be adaptive with respect to the speci c features of the machine (e.g. they must tak e in to account the memory hierarc hies and cac hing policies adopted by modern hardware/soft w are arc hitectures). Speci c Out-of-Core (OoC) techniques (also known as External Memory techniques) [3,14 ] can be exploited to approach DM problems that require h uge amounts of memory. OoC techniques are useful for all applications that do not completely t into the physical memory. Their main goal is to reduce memory hierarc hy overheads by bypassing the OS virtual memory system and explicitly J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 350-357, 2000.  Springer-Verlag Berlin Heidelberg 2000

Implementation Issues in the Design of I/O Intensive Data Mining Applications

351

managing I/O. Direct control over data movements between main memory and secondary storage is achieved by splitting the dataset into several small blocks. These blocks are then loaded into data structures which will certainly t into physical memory. They are processed and, if necessary, written back to disks. The knowledge of the patterns used by the algorithm to access the data can be exploited in an eective way to reduce I/O overheads by overlapping them with useful computations. The access pattern exploited by the DM algorithm discussed in this paper is simple, since read-only datasets are accessed sequentially and iteratively. Note that the general purpose external memory mechanism provided by the operating system { in our case, the Unix read() system call { is speci cally optimized for this kind of data access. This paper investigates scalable implementations of I/O-intensive DM algorithms on aordable parallel architectures, such as clusters of PCs equipped with main memories of a limited size, which are not suÆciently big to store the whole dataset (or even a partition of it). The test case DM application used to validate our approach is based on the on-line K-means algorithm, a well known DM Clustering algorithm [8,10]. The testbed COW was composed of three SMPs, interconnected by a 100BaseT switched Ethernet, where each SMP was equipped with two Pentium II - 233 MHz processors, 128 MB of main memory, and a 4GB UW-SCSI disk. Their OS was Linux, kernel version 2.2.5-15. The paper is organized as follows. Section 2 discusses implementation issues related to the design of I/O-intensive DM applications. Section 3 deals with the K-means algorithm and its parallel implementation. Finally, Section 4 discusses the results of our experiments and draws some conclusions.

2

Implementation of I/O Intensive DM Applications

As mentioned above, we are interested in DM algorithms that access sequentially the same dataset several times. The repeated scanning of the whole dataset entails good spatial locality but scarce temporal locality. The latter can only be exploited if the whole dataset entirely ts into the physical memory. In general, however, this condition cannot be taken for granted because \real life" datasets are generally very large. Moreover, the physical memory is of limited size, and other running processes contend for its usage. The adoption of an OoC algorithm, which takes advantage of possible prefetching policies implemented by both software drivers and disk controllers [11], and which allows to exploit multitasking or multithreading strategies in order to overlap I/O latencies with useful computations, is thus mandatory. The best policy might thus appear to be to adopt OoC algorithms only if a dataset does not t into the physical memory. When the memory is large enough, an in-core approach might seem more eÆcient, since all the dataset is read once from disk, and is repeatedly accessed without further I/O operations. Clearly such an in-core strategy might fail when other processes use the main memory, thus causing swapping on the disk. We believe that \smart" OoC approaches are always preferable to their in-core counterparts, even when datasets are small

352

R. Baraglia et al.

with respect to memory size. This assertion is due to the existence of a buer cache for block devices in modern OSs, such as Linux [2]. The available physical memory left unused by the kernel and processes is dynamically enrolled in the buer cache on demand. When the requirement for primary memory increases, for example because new processes enter the system, the memory allocated to buers is reduced. We conducted experiments to compare in-core and out-of-core versions of a simple test program that repeatedly scans a dataset which ts into physical memory. We observed that the two versions of the program have similar performances. In fact, if we consider the OoC version of this simple program, at the end of the rst scan the buer cache contains the blocks of the whole dataset. The following scans of the dataset will not actually access the disk at all, since they nd all the blocks to be read in the main memory, i.e. in the buer cache. In other words, due to the mechanisms provided by the OS, the actual behavior of the OoC program becomes in-core. We also observed another advantage of the OoC program over the in-core solution. During the rst scan of the dataset, the OoC program takes advantage of OS prefetching. In fact, during the processing of a block the OS prefetches the next one, thus hiding some I/O time. On the contrary, I/O time of in-core programs cannot be overlapped with useful computations because the whole dataset has to be read before starting the computation. In summary, the OoC approach not only works well for small datasets, but it also scales-up when the problem size exceeds the physical memory size, i.e., in those cases when in-core algorithms fail due to memory swapping. Moreover, to improve scalability for large datasets, we can also exploit multitasking techniques in conjunction with OoC techniques to hide I/O time. To exploit multitasking, non-overlapping partitions of the whole dataset must be assigned to distinct tasks. The same technique can also be used to parallelize the application, by mapping these tasks onto distinct machines. This kind of data-parallel paradigm is usually very eective for implementing DM algorithms, since computation is generally uniform, data exchange between tasks is limited, and generally involves a global synchronization at the end of each scan of the whole dataset. This synchronization is used to check termination conditions and to restore a consistent global state. Consistency restoration is needed since the tasks start each iteration on the basis of a consistent state, generating new local states that only re ect their partial view of the whole dataset. Finally, parallel DM algorithms implemented on COWs also have to deal with load imbalance. In fact, workload imbalance may derive either from dierent capacities of the machines involved or from unexpected arrivals of external jobs. Since the programming paradigm adopted is data parallel, a possible solution to this problem is to dynamically change partition sizes.

3

A Test Case DM Algorithm and its Implementation

There is a variety of applications, ranging from marketing to biology, astrophysics, and so on [8], that need to identify subsets of records (clusters) present-


353

ing characteristics of homogeneity. In this paper we used a well known clustering algorithm, the K-means algorithm [10] as a case study representative of a class of I/O intensive DM algorithms. We deal with the on-line formulation of Kmeans, which can be considered as a competitive learning formulation of the classical K-means algorithm. K-means considers records in a dataset to be represented as data-points in a high dimensional space. Clusters are identi ed by using the concept of proximity among data-points in this space. The K-means algorithm is known to have some limitations regarding the dependence on the initial conditions and the shape and size of the clusters found [9,10]. Moreover, it is necessary to de ne a priori the number K of clusters that we expect to nd, even though it is also possible to start with a small number of clusters (and associated centers), and increase this number when speci c conditions are observed. The three main steps of the on-line K-means sequential algorithm are: (1) start with a given number of centers randomly chosen; (2) scan all the datapoints of the dataset, and for each point p nd the center closest to p, assign p to the cluster associated with this center, and move the center toward p; (3) repeat step 2 until the assignment of data-points to the various clusters remains unchanged. Note that the repetition of step 2 ensures that centers gradually get attracted into the middle of the clusters. In our tests we used synthetic datasets and we xed a priori K .

Parallel Implementation. We implemented the OoC version of the algorithm

mentioned above, where data-points are repeatedly scanned by sequentially reading small blocks of 4 KBytes from the disk. The program was implemented using MPI according to an SPMD paradigm. A non overlapping partition of the input le, univocally identi ed by a pair of boundaries, is processed by each task of the SPMD program. The number of tasks involved in the execution may be greater than the number of physical processors, thus exploiting multitasking. This parallel formulation of our test case is similar to those described in [12,4], and requires a new consistent global state to be established once each scan of the whole dataset is completed. Our global state corresponds with the new positions reached by the K centers. These positions are determined by summing the vectors corresponding with the centers' movements which were separately computed by the various tasks involved. In our implementation, the new center positions are computed by a single task, the root one, and are broadcast to the others. The root task also checks the termination condition. The load balancing strategy adopted is simple but eective. It is based on past knowledge of the bandwidths of all concurrent tasks (i.e. number of points computed in a unit of time). If a load imbalance is detected, the size of the partitions is increased for \faster" tasks and decreased for \slower" ones. This requires input datasets to be replicated on all the disks of our testbed. If complete replication is too expensive or not possible, le partitions with overlapping boundaries can be exploited as well. Let N P be the total number of datapoints, and fp1 ; : : : ; pn g the n tasks of the SPMD program. At the rst iteration np1i = N P=n data-points are assigned to each pi . During iteration j each pi measures the elapsed time Tij spent on elaborating its own block of npji points,

354

R. Baraglia et al.

so that tji = Tij =npji is the time taken by each pi to elaborate a single point, and bji = 1=tji = npji =Tij is its bandwidth. In order to balance the workload, the numbers npji +1 of data-points which each pi has to process in the next iteration are then computed on the basis of the various bji (npji +1 = ji N P , where ji = bji = ni=1 bji ). Finally, values npji +1 are easily translated into partition boundaries, i.e. a pair of osets within the replicated or partially replicated input le.

P

4

Experimental Results and Conclusions

Several experiments were conducted on the testbed with our parallel implementation of K -means based on MPI. Data-parallelism, OoC techniques, multitasking, and load balancing strategies were exploited. Note that the successful adoption of multitasking mainly depends on (1) the number of disks with respect to the number of processors available on each machine, and (2) the computation granularity (i.e., the time spent on processing each data block) with respect to the I/O bandwidth. In our experiments on synthetic datasets, we tuned this computational granularity by changing the number K of clusters to look for. Another important characteristics of our approach is the size of the partitions assigned to the tasks mapped on a single SMP machine. If the sum of these sizes is less than the size of the physical main memory, we guess that the behavior of the OoC application will be similar to its in-core counterpart, due to a large enough buer cache. Otherwise, sequential accesses carried out by a task to its dataset partition will entail disk accesses, so that the only possibility of hiding these I/O times is to exploit, besides OS prefetching, some form of moderated multitasking. Figure 1 shows the eects of the presence of the buer cache. On a single SMP we ran our test case algorithm with a small dataset (64 MB) and small computational granularity (K =3). Bars show the time spent by the tasks in computing (t comp), in doing I/O and being idle in some OS queue (t io + t idle), and in communication and synchronization (t comm). The two bars on the left hand side represent the rst and the second iterations of a sequential implementation of the test case. The four bars on the right hand side regard the parallel implementation (2 tasks mapped on the same SMP). Note that in both cases the t io and t idle are high during the rst iteration, since the buer cache is not able to ful ll the read requests (cache misses). On the other hand, these times almost disappear from the the second iteration bars, since the accessed blocks are found in the buer cache (cache hits). Figure 2 shows the eects of multitasking on a single SMP when the disk has to be accessed. Although a small dataset was used for these experiments, the bars only refer to the rst iteration, during which we certainly need to access the disk. Now recall that our testbed machines are equipped with a single disk each. This represents a strong constraint on the I/O bandwidth of our platform. This is particularly evident when several I/O-bound tasks, running in parallel on an SMP, try to access this single disk. In this regard, we found that our test case has dierent behaviors depending on the computational granularity.


355

For a ne granularity (K =8), the computation is completely I/O-bound. In this condition it is better to allocate a single task to each SMP (see Figure 2.(a)). When we allocated more then one task, the performance worsened because of the limited I/O bandwidth and I/O con icts. For a coarser granularity (K = 32), the performance improved when two tasks were used (see Figure 2.(b)). In the case of higher degrees of parallelism the performance decreases. This is due to the overloading of the single disk, and to noises introduced by multitasking into the OS prefetching policy. Figure 3 shows some speedup curves. The plots refer to 20 iterations with K =16. We used at most two tasks per SMP. Note the super-linear speedup achieved when 2 or 3 processors were used. These processors belong to distinct SMPs, so that this super-linear speedup is due to the exploitation of multiple disks and to the eects of the buer cache. In fact, when moderately large datasets were used (64 MB or 128 MB) the data partitions associated with the tasks mapped on each SMP t into the buer caches. Overheads due to communications, occurring at the end of each iteration, are very small and do not aect speedup. In the case of a larger dataset (384 MB), whose size is greater than the whole main memory available, when the number of tasks remains under three, linear speedups were obtained. For larger degrees of parallelism, the speedup decreases. This is still due to the limited I/O bandwidth on each SMP.

K=3 - 64 MB 14 12

t_io + t_idle t_comm t_comp

10

sec

8 6 4 2 0 0

Figure1.

1

0 0 Iteration number

1

1

Execution times of two iterations of the test case on a single SMP.

Figure 4 shows the eectiveness of the load balancing strategy adopted. Both plots refer to experiments conducted using all the six processors of our testbed with the 64MB dataset and K = 32. The plot in the left hand side of the gure shows the number of blocks dynamically assigned to each task by our load balancing algorithm as a function of the iteration index. During time interval [t1; t3] ([t2; t4]) we executed a CPU-intensive process on the SMP A (M) running tasks A0 and A1 (M1 and M1). As it can be seen, the load balancing algorithm quickly detects the variation in the capacities of the machines, and correspondingly adjusts the size of the partitions by narrowing partitions assigned to slower

356

R. Baraglia et al.

K=8 64 M

25

20

K=32 64 M 35



30 25

15 sec

sec

20 15

10

10 5 5 0

1

2

3 N. of tasks per SMP

0

4

1 H2

H2

2

3 H2 perH2 N. H2 of tasks SMP

H2

H2

H2

4

H2

H2

Execution times of the rst iteration on a single SMP by varying the number of tasks exploited and the computational granularities: (a) K =8 and (b) K =32. Figure2.

K=16 8

-

20 iterations 064 M 128 M 384 M

7

Speed-up

6 5 4 3 2 1 1

Figure3.

2

3 4 5 6 Number of Processors

7

8

Speedup curves for dierent dataset sizes (20 iterations, K = 16).

machines and enlarging the others. The plot in the right hand side compares the execution times obtained exploiting or not our load balancing strategy as a function of the external load present in one of the machines. We can see that in the absence of external load the overhead introduced by the load balancing strategy is negligible. As the external load increases, the bene ts of exploiting the load balancing strategy increase as well. In conclusion, this work has investigated the issues related to the implementation of a test case application, chosen as a representative of a large class of DM I/O-intensive applications, on an inexpensive COW. Eective strategies for managing I/O requests and for overlapping their latencies with useful computations have been devised and implemented. Issues related to data parallelism exploitation, OoC techniques, multitasking, and load balancing strategies have been discussed. To validate our approach we conducted several experiments and discussed the encouraging results achieved. Future work regards the evaluation of the possible advantages of exploiting lightweight threads for intra-SMP parallelism and multitasking. Moreover, other I/O intensive DM algorithms have to be considered in order to de ne a framework of techniques/functionalities useful for eÆciently solving general DM applications on COWs, which, unlike homo-

Implementation Issues in the Design of I/O Intensive Data Mining Applications K=32, 64 MB

K=32, 64 MB, 20 iterations

5

280 M-0 M-1 A-0 A-1 D-0 D-1

4

260

with load balancing

240

without load balancing

220 3

200 sec

Num. of blocks (x 1000)

357

2

180 160 140

1

120 100

0

80 10

t1 20

t2 30 t3 Iterations

Figure4.

t4

50

0

1

2 External Load

3

4

Eectiveness of the load balancing strategy.

geneous MPPs, impose additional issues that must be addressed using adaptive strategies.

References 1. Jain A.K. and Dubes R.C. Algorithms for Clustering Data. Prentice Hall, 1988. 2. M. Beck et al. Linux Kernel Internals, 2nd ed. Addison-Wesley, 1998. 3. Rajkumar Buyya, editor. High Performance Cluster Computing. Prentice Hall PTR, 1999. 4. I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory machines. In ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1999. 5. A. A. Freitas and S. H. Lavington. Mining Very Large Databases with Parallel Processing. Kluwer Academin Publishers, 1998. 6. V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining Very Large Databases. IEEE Computer, 32(8):38{45, 1999. 7. E. Han, G. Karypis, and V. Kumar. Scalable Parallel Data Mining for Association Rules. IEEE Transactions on Knowledge and Data Engineering. To appear. 8. J.A. Hartigan. Clustering Algorithms. Wiley & Sons, 1975. 9. G. Karypis, E. Han, and V. Kumar. Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer, 32:68{75, 1999. 10. Mac Queen, J.B. Some Methods for Classi cation and Analysis of Multivariate Observation. 5th Berkeley Symp. on Mathematical Statistics and Probability, pages 281{297. Univ. of California Press, 1967. 11. Chris Ruemmler and John Wilkes. An Introduction to Disk Drive Modeling. IEEE Computer, 27(3):17{28, March 1994. 12. K. Stoel and A. Belkoniene. Parallel k-means clustering for large datasets. EuroPar'99 Parallel Processing, Lecture Notes in Computer Science, No. 1685. Springer-Verlag, 1999. 13. Sterling T.L., Salmon J., Becker D.J., and Savarese D.F. How to Build a Beowulf. A guide to the Implementation and Application of PC Clusters. The MIT Press, 1999. 14. J. S. Vitter. External Memory Algorithms and Data Structures. In External Memory Algorithms (DIMACS Series on Discrete Mathematics and Theoretical

. American Mathematical Society, 1999.

Computer Science)

A Requirements Analysis for Parallel KDD Systems William A. Maniatty1 and Mohammed J. Zaki2 1

2

Computer Science Dept., University at Albany, Albany, NY 12222

[email protected], http://www.cs.albany.edu/maniatty/

Computer Science Dept., Rensselaer Polytec hnic Institute, Troy, NY 12180 [email protected], http://www.cs.rpi.edu/zaki/

The current generation of data mining tools have limited capacit y and performance, since these tools tend to be sequential. This paper explores a migration path out of this bottlenec kby considering an in tegrated hardware and softw are approach to parallelize data mining. Our analysis shows that parallel data mining solutions require the following components: parallel data mining algorithms, parallel and distributed data bases, parallel le systems, parallel I/O, tertiary storage, management of online data, support for heterogeneous data representations, securit y, qualit y of service and pricing metrics. State of the art technology in these areas is surveyed with an eye tow ards an integration strategy leading to a complete solution. Abstract.

1

Introduction

Knowledge discovery in databases (KDD) employs a variety of tec hniques, collectiv ely called data mining, to uncover trends in large volumes of data. Many applications generate (or acquire) data faster than it can be analyzed using existing KDD tools, leading to perpetual data archiv al without retrieval or analysis. F urthermore, analyzing suÆciently large data sets can exceed the available computational resources of existing computers. In order to reverse the vicious cycle induced by these t w o problematic trends, the issues of performing KDD faster than the rate of arrival and increasing capacity must simultaneously be dealt with. F ortunately, novel applications of parallel computing techniques should assist in solving these large problems in a timely fashion. P arallel KDD (PKDD) techniques are not currently that common, though recen talgorithmic advances seek to address these problems (F reitas and Lavington 1998; Zaki 1999; Zaki and Ho 2000; Kargupta and Chan 2000). How ever, there has been no work in designing and implementing large-scale parallel KDD systems, which must not only support the mining algorithms, but also the entire KDD process, including the pre-processing and post-processing steps (in fact, it has been posited that around 80% of the KDD eort is spent in these steps, rather than mining). The picture gets even more complicated when one considers persistent data management of mined patterns and models. Given the infancy of KDD in general, and PKDD in particular, it is not clear ho w or where to start, to realize the goal of building a PKDD system J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 358-365, 2000.  Springer-Verlag Berlin Heidelberg 2000

A Requirements Analysis for Parallel KDD Systems

359

that can handle terabyte-sized (or larger) central or distributed datasets. Part of the problem stems from the fact that PKDD draws input from diverse areas that have been traditionally studied in isolation. Typically, the KDD process is supported by a hierarchical architecture consisting of the following layers: (from bottom to top) I/O Support, File System, Data Base, Query Manager, and Data Mining. However, the current incarnations of this architecture tend to be sequential, limiting both problem size and performance. To implement a successful PKDD toolkit, we need to borrow, adapt, and enhance research in elds such as super-, meta- and heterogeneous-computing environments, parallel and distributed databases, parallel and distributed le systems, parallel I/O, mass storage systems, and so on (not to mention the other elds that make up KDD | statistics, machine learning, visualization, etc.). This paper represents a rst step in the process of unifying these diverse technologies and leveraging them within the PKDD system. We do this by discussing the system requirements for PKDD and the extant solutions (or lack thereof), i.e., the what and the how of PKDD. These requirements follow from: the basic requirements imposed by KDD (Section 2), current KDD algorithmic techniques (Section 3), the trends in commodity hardware design (Section 4) and software requirements (Section 5). One diÆculty in making such a survey is that each research community has its own jargon, which we will try to make accessible by describing it within a common PKDD framework.

2

PKDD Requirements

We begin by discussing the wish-list or desirable features of a functional PKDD system, using it to guide the rest of the survey. We mainly concentrate on aspects that have not received wide attention as yet. Algorithm Evaluation: Algorithmic aspects that need attention are the ability to handle high dimensional datasets, to support terabyte data-stores, to minimize number of data scans, etc. An even more important research area is to provide a rapid development framework to implement and conduct the performance evaluation of a number of competing parallel methods for a given mining task. Currently this is a very time-consuming process, and there are no guidelines when to use a particular algorithm over another. Process Support: The toolkit should support all KDD steps, from pre-processing operations for like sampling, discretization, and feature subset selection, to postprocessing operations like rule grouping and pruning and model scoring. Other aspects include (persistent) pattern management operations like caching, eÆcient retrieval, and meta-level mining. Location Transparency: The PKDD system should be able to seamlessly access and mine datasets regardless of their location, be they centralized or distributed. Data Type Transparency: The system should be able to cope with heterogeneity (e.g., dierent database schemas), without having to materialize a join of multiple tables. Other diÆcult aspects deal with handling unstructured (hyper-)text, spreadsheet, and a variety of other data types.

360

W.A. Maniatty and M.J. Zaki

System Transparency: This refers to the fact that the PKDD system should be able to seamlessly access le systems, databases, or data archives. Databases and data warehouses represent one kind of data repositories, and thus it is crucial to integrate mining with DBMS to avoid extracting data to at les. On the other hand, a huge amount of data remains outside databases in at- les, weblogs, etc. The PKDD system must therefore bridge the gap that exists today between the database and le-systems worlds (Choudhary and Kotz 1996). This is required since database systems today oer little functionality to support mining applications (Agrawal et al. 1993), and most research on parallel le systems and parallel I/O has looked at scienti c applications, while data mining operations have very dierent workload characteristics. Security, QoS and Pricing: In an increasingly networked world one constantly needs access to proprietary third-party and other remote datasets. The two main issues that need attention here are security and Quality-of-Service (QoS) issues in data mining. We need to prevent unauthorized mining, and we need to provide cost-sensitive mining to guarantee a level of performance. These issues are paramount in web-mining for e-commerce. Availability, Fault Tolerance and Mobility: Distributed and parallel systems have more points of failure than centralized systems. Furthermore temporary disconnections (which are frequent in mobile computing environments) and reconnections by users should be tolerated with a minimal penalty to the user. Many real world applications cannot tolerate outages, and in the presence of QoS guarantees and contracts outages, can breach the agreements between providers and users. Little work has been done to address this area as well. In the discussion below, due to space constraints, we choose to concentrate only on the algorithmic and hardware trends, and system transparency issues (i.e., parallel I/O and parallel and distributed databases), while brie y touching on other aspects (a more detailed paper is forthcoming).

3

Mining Methods

Faster and scalable algorithms for mining will always be required. Parallel and distributed computing seems ideally placed to address these big data performance issues. However, achieving good performance on today's multiprocessor systems is a non-trivial task. The main challenges include synchronization and communication minimization, work-load balancing, nding good data layout and data decomposition, and disk I/O minimization. The parallel design space spans a number of systems and algorithmic components such as the hardware platform (shared vs. distributed), kind of parallelism (task vs. data), load balancing strategy (static vs. dynamic), data layout (horizontal vs. vertical) and search procedure used (complete vs. greedy). Recent algorithmic work has been very successful in showing the bene ts of parallelism for many of the common data mining tasks including association rules (Agrawal and Shafer 1996; Cheung et al. 1996; Han et al. 1997; Zaki et al. 1997), sequential patterns (Shintani and Kitsuregawa 1998; Zak-


361

i 2000), classi cation (Shafer et al. 1996; Joshi et al. 1998; Zaki et al. 1999; Sreenivas et al. 1999), regression (Williams et al. 2000) and clustering (Judd et al. 1996; Dhillon and Modha 2000; S. Goil and Choudhary 1999). The typical trend in parallel mining is to start with a sequential method and pose various parallel formulations, implement them, and conduct a performance evaluation. While this is very important, it is a very costly process. After all, the parallel design space is vast and results on the parallelization of one serial method may not be applicable to other methods. The result is that there is a proliferation of parallel algorithms without any standardized benchmarking to compare and provide guidelines on which methods work better under what circumstances. The problem becomes even worse when a new and improved serial algorithm is found, and one is forced to come up with new parallel formulations. Thus, it is crucial that the PKDD system support rapid development and testing of algorithms to facilitate algorithmic performance evaluation. One recent eort in this direction is discussed by (Skillicorn 1999). He emphasizes the importance of and presents a set of cost measures that can be applied to parallel algorithms to predict their computation, data access, and communication performance. These measures make it possible to compare dierent parallel implementation strategies for data-mining techniques without benchmarking each one. A dierent approach is to build a data mining kernel that supports common data mining operations, and is modular in design so that new algorithms or their \primitive" components can be easily added to increase functionality. An example is the MKS (Anand et al. 1997) kernel. Also, generic set-oriented primitive operations were proposed in (Freitas and Lavington 1998) for classi cation and clustering, which were integrated with a parallel DBMS.

4

Hardware Models and Trends

The current hardware trends are that memory and disk capacity are increasing at a much higher rate than their speed. Furthermore, CPU capacity is roughly obeying Moore's law, which predicts doubling performance approximately every 18 months. To combat bus and memory bandwidth limitations, caching is used to improve the mean access time, giving rise to Non-Uniform Memory Access architectures. To accelerate the rate of computation, modern machines frequently increase the number of processing elements in an architecture. Logically, the memory of such machines is kept consistent, giving rise to a shared memory model, called Symmetric Multiprocessing (SMP) in the architecture community and shared everything in the database community (DeWitt and Gray 1992; Valduriez 1993). However the scalability of such architectures is limited, so for higher degrees of parallelism, a cluster of SMP nodes is used. This model, called shared-nothing in database literature, is also the preferred architecture for parallel databases (DeWitt and Gray 1992). Redundant arrays of independent (or inexpensive) disks (RAID) (Chen et al. 1994) has gained popularity to increase I/O bandwidth and storage capacity,

362


reduce latency, and (optionally) support fault tolerance. In many systems, since the amount of data exceeds that which can be stored on disk, tertiary storage is used, typically consisting of one or more removable media devices with a juke box to swap the loaded media. In addition to the current trends, there have been other ideas to improve the memory and storage bottlenecks. Active Disks (Riedel et al. 1997) and Intelligent Disks (Keeton et al. 1998) have been proposed as a means to exploit the improved processor performance of embedded processors in disk controllers to allow more complex I/O operations and optimizations, while reducing the amount of traÆc over a congested I/O bus. Intelligent RAM (IRAM) (Kozyrakis and Patterson 1998) seeks to integrate processing elements in the memory. Active disks and IRAM are not currently prevalent, as the required hardware and systems software are not commonly available.

5

Software Infrastructure

Since our goal is to use commodity hardware, much of the support for our desired functionality is pushed back into the software. In this section we discuss some of the system transparency issues in PKDD systems, i.e., support for seamless access to databases and le systems and parallel I/O. We review selected aspects of these areas. The most common database constructions currently in use are relational databases, object oriented databases, and object-relational databases. The data base layer ensures referential integrity and provides support for queries and/or transactions on the data set (Oszu and Valduriez 1999). The data base layer is frequently accessed via a query language, such as SQL. We are primarily interested in parallel and distributed database systems (DeWitt and Gray 1992; Valduriez 1993), which have data sets spanning disks. The primary advantages of such systems are that capacity of storage is improved and that parallelizing of disk access improves bandwidth and (for large I/O's) can reduce latency. Early on parallel database research explored special-purpose database machines for performance (Hsiao 1983), but today the consensus is that its better to use available parallel platforms, with shared-nothing paradigm as the architecture of choice. Shared-nothing database systems include Teradata, Gamma (D. DeWitt et al. 1990), Tandem (Tandem Performance Group 1988), Bubba (Boral et al. 1990), Arbre (Lorie et al. 1989), etc. We refer the reader to (DeWitt and Gray 1992; Valduriez 1993; Khan et al. 1999) for excellent survey articles on parallel and distributed databases. Issues within parallel database research of relevance to PKDD include the data partitioning (over disks) methods used, such as simple round-robin partitioning, where records are distributed evenly among the disks. Hash partitioning is most eective for applications requiring associative access since records are partitioned based on a hash function. Finally, range partitioning clusters records with similar attributes together. Most parallel data mining work to-date has used a round-robin approach to data partitioning. Other methods might be more suitable. Exploration of eÆcient multidimensional indexing


363

structures for PKDD is required (Gaede and Gunther 1998). The vast amount of work on parallel relational query operators, particularly parallel join algorithms, is also of relevance (Pirahesh et al. 1990). The use of DBMS views (Oszu and Valduriez 1999) to restrict the access of a DBMS user to a subset of the data, can be used to provide security in KDD systems. Parallel I/O and le systems techniques are geared to handling large data sets in a distributed memory environment, and appear to be a better t than distributed le systems for managing the large data sets found in KDD applications. Parallel File Systems and Parallel I/O techniques have been widely studied; (Kotz ) maintains an archive and bibliography, which has a nice reference guide (Stockinger 1998). Use of parallel I/O and le systems becomes necessary if RAID devices have insuÆcient capacity (due to scaling limitations) or contention for shared resources (e.g. buses or processors) exceeds the capacity of SMP architectures. The Scalable I/O initiative (SIO) includes many groups, including the Message Passing Interface (MPI) forum, which has adopted a MPI-IO API (Thakur et al. 1999) for parallel le management. MPI-IO is layered on top of local le systems. MPI uses a run time type de nition scheme to de ne communication and I/O entity types. The ROMIO library (Thakur et al. 1999) implements MPI-IO in Argonne's MPICH implementation of MPI. ROMIO automates scheduling of aggregated I/O requests and uses the ADIO middleware layer to provide portability and isolate implementation dependent parts of MPIO. PABLO, another SIO member group, has created the portable parallel le systems (PPFS II), designed to support eÆcient access of large data sets in scienti c applications with irregular access patterns. More information on parallel and distributed I/O and le systems appears in (Kotz ; Carretero et al. 1996; Gibson et al. 1999; Initiative ; Moyer and Sunderam 1994; Nieuwejaar and Kotz 1997; Schikuta et al. 1998; Seamons and Winslett 1996). Users of PKDD systems are interested in maximizing performance. Prefetching is an important performance enhancing technique that can reduce the impact of latency by overlapping computation and I/O (Cortes 1999; Kimbrel et al. 1996; Patterson III 1997). In order for prefetching to be eective, the distributed system uses hints which indicate what data is likely to be used in the near future. Generation of accurate hints (not surprisingly) tends to be diÆcult since it relies on predicting a program's ow of control. Many hint generation techniques rely on traces of a program's I/O access patterns. (Kimbrel et al. 1996) surveyed a range of trace driven techniques and prefetching strategies, and provided performance comparisons. (Madhyastha and Reed 1997) recently used machine learning tools to analyze I/O traces from the PPFS, relying on arti cial neural networks for on-line analysis of the current trace, and hidden markov models to analyze data obtained by pro ling. (Chang and Gibson 1999) developed SpecHint which generates hints via speculative execution. We conjecture that PKDD techniques can be used to identify reference patterns, to provide hint generation and to address open performance analysis issues (Reed et al. 1998). As we noted earlier, integration of various systems components for eective KDD is lagging. The current state of KDD tools can accurately be captured by

364


the term at- le mining, i.e., prior to mining, all the data is extracted into a

at- le, which is then used for mining, eectively bypassing all database functionality. This is mainly because traditional databases are ill-equipped to handle/optimize the complex query structure of mining methods. However, recent work has recognized the need for integrating of the database, query management and data mining layers (Agrawal and Shim 1996; Sarawagi et al. 1998). (Agrawal and Shim 1996) postulated that better integration of the query manager, database and data mining layers would provide a speedup. (Sarawagi et al. 1998) con rmed that performance improvements could be attained, with the best performance obtained in cache-mine which caches and mines the query results on a local disk. SQL-like operators for mining association rules have also been developed (Meo et al. 1996). Further, proposals for data mining query language (Han et al. 1996; Imielinski and Mannila 1996; Imielinski et al. 1996; Siebes 1995) have emerged. We note that most of this work is targeted for serial environments. PKDD eorts will bene t from this research, but the optimization problems will of course be dierent in a parallel setting. Some exceptions include the parallel generic primitives proposed in (Freitas and Lavington 1998), and Data Surveyor (Holsheimer et al. 1996), a mining tool that uses the Monet database server for parallel classi cation rule induction. We further argue that we need a wider integration of parallel and distributed databases and le systems, to fully mine all available data (only a modest fraction of which actually resides in databases). Integration of PKDD and parallel le systems should enhance performance by improving hint generation in prefetching. Integrated PKDD can use parallel le systems for storing and managing large data sets and use distributed le system as an access point suited to mobile clients for management of query results.

6

Conclusions

We described a list of desirable design features of parallel KDD systems. These requirements motivated a brief survey of existing algorithmic and systems support for building such large-scale mining tools. We focused on the state-of-the-art in databases, and parallel I/O techniques. We observe that implementing a eective PKDD system requires integration of these diverse sub- elds into a coherent and seamless system. Emerging issues in PKDD include benchmarking, security, availability, mobility and QoS, motivating fresh research in these disciplines. Finally, PKDD approaches may be used as a tool in these areas (e.g. hint generation for prefetching in parallel I/O), resulting in a bootstrapping approach to software development.


References

365

D. Kotz. The parallel i/o archive. Includes pointers to his Parallel I/O Bibliography, can be found at . C. E. Kozyrakis and D. A. Patterson. New direction in computer architecture research. , pages 24{32, November 1998. R. Lorie, Adding inter-transaction parallelism to existing DBMS: Early experience. , 12(1), March 1989. T. M. Madhyastha and D. A. Reed. Exploiting global input/output access pattern classification. In , 1997. On CDROM. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In , 1996. S. A. Moyer and V. S. Sunderam. PIOUS: a scalable parallel I/O system for distributed computing environments. In , 1994. N. Nieuwejaar and D. Kotz. The galley parallel file system. , 23(4), June 1997. M. T. Oszu and P. Valduriez. . Prentice Hall, 1999. R. H. Patterson III. . PhD thesis, Carnegie Mellon University, December 1997. Pirahesh . In , July 1990. D. A. Reed, Performance analysis of parallel systems: Approaches and open problems. In , June 1998. E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large-scale data mining and multimedia. In , August 1997. H. Nagesh S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Northwestern University, June 1999. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases: alternatives and implications. In , June 1998. E. Schikuta, T. Fuerle, and H. Wanek. ViPIOS: The vienna parallel input/output system. In , September 1998. K. E. Seamons and M. Winslett. Multidimensional array I/O in Panda 1.0. , 10(2):191{211, 1996. J. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In , March 1996. T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach. In , April 1998. A. Siebes. Foundations of an inductive query language. In , August 1995. D. Skillicorn. Strategies for parallel data mining. , 7(4):26{35, October-December 1999. M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-ofcore divide and conquer techniques with application to classification trees. In , April 1999. H. Stockinger. Dictionary on parallel input/output. Master's thesis, Dept. of Data Engineering, University of Vienna, February 1998. Tandem Performance Group. A benchmark of non-stop SQL on the debit credit transaction. In , June 1988. R. Thakur, W. Gropp, and E. Lusk. On implementing mpi-io portably and with high performance. In , May 1999. P. Valduriez. Parallel database systems: Open problems and new issues. , 1:137{ 165, 1993. G. Williams, The integrated delivery of large-scale data mining: The ACSys data mining project. In . M. J. Zaki and C-T. Ho, editors. , LNCS Vol. 1759. Springer-Verlag, 2000. M. J. Zaki, Parallel algorithms for fast discovery of association rules. , 1(4):343-373, December 1997. M. J. Zaki, C.-T. Ho, and R. Agrawal. Parallel classification for data mining on shared-memory multiprocessors. In , March 1999. M. J. Zaki. Parallel and distributed association mining: A survey. , 7(4):14{25, 1999. M. J. Zaki. Parallel sequence mining on SMP machines. In .

http://www.cs.dartmouth.edu/pario/ R. Agrawal and J. Shafer. Parallel mining of association rules. , 8(6):962{969, December 1996.

IEEE Trans. on Knowledge and Data Engg.

R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on a relational DBMS. In , 1996.

Int'l

Conf. on Knowledge Discovery and Data Mining

R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. , 5(6):914{925, December 1993.

IEEE Trans. on Knowledge and Data Engg. S. Anand, et al. Designing a kernel for data mining. IEEE Expert, pages 65{74, March 1997. H. Boral, et al. Prototyping Bubba, a highly parallel database system. IEEE Trans. on Knowledge and Data Engg., 2(1), March 1990. J. Carretero, et al. ParFiSys: A parallel file system for MPP. ACM Operating Systems Review, 30(2):74{80, 1996. F. Chang and G. Gibson. Automatic i/o hint generation through speculative execution. In , February 1999.

Symp. on Operating Systems Design and Implementation P. M. Chen, et al. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145{185, June 1994.

et al.

D. Cheung, A fast distributed algorithm for mining association rules. In , December 1996.

4th Int'l Conf. Parallel and Distributed

Info. Systems

A. Choudhary and D. Kotz. Large-scale file systems with the flexibility of databases. , 28A(4), December 1996.

ACM Computing Surveys

High Performance Cluster Computing

T. Cortes. , Vol. 1, chapter Software Raid and Parallel File Systems, pages 463{495. Prentice Hall, 1999. D. DeWitt et al. project. 62, March 1990.

The GAMMA database machine

IEEE Trans. on Knowledge and Data Engg., 2(1):44{

D. DeWitt and J. Gray. Parallel database systems: The future of high-performance database systems. , 35(6):85{98, June 1992.

Commu-

nications of the ACM

I. S. Dhillon and D. S. Modha. A clustering algorithm on distributed memory machines. In .

Zaki and Ho, 2000 Mining very large databases

A. Freitas and S. Lavington. . Kluwer Academic Pub., 1998.

with parallel processing

V. Gaede and O. Gunther. methods.

Multidimensional access , 30(2):170{231, 1998.

ACM Computing Surveys et al. NASD scalable storage systems. In USENIX99, Extreme Linux Workshop, June 1999. J. Han, et al. DMQL: A data mining query language for relational databases. In SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, June 1996. G. Gibson,

E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In , May 1997.

ACM SIGMOD Conf.

Management of Data

M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor: Searching the nuggets in parallel. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, . AAAI Press, 1996.

Advances in Knowledge Discovery and Data Mining Advanced Database Machine Architectures.

D. Hsiao. tice Hall, 1983.

T. Imielinski and H. Mannila. tive on knowledge discovery. 39(11), November 1996.

Pren-

A database perspec-

Communications of the ACM,

T.

Imielinski,

A.

Virmani,

and

A.

Abdulghani.

DataMine: Application programming interface and query

Int'l Conf. Knowledge

language for database mining. In , August 1996.

Discovery and Data Mining

http://www.cacr.caltech.edu/SIO.

Scalable I/O Initiative. California Institute of Technology.

M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classification algorithm for mining large datasets. In , 1998.

Int'l Parallel Processing Symposium

D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In , 1996.

Int'l Conf. Pattern Recognition and P. Chan, editors. Advances in Distributed

H. Kargupta . AAAI Press, 2000.

Data Mining

K. Keeton, D. Patterson, and J.M. Hellerstein. The case for intelligent disks. , 27(3):42{52, September 1998.

SIGMOD Record

et al.

M.F. Khan, Intensive data management in parallel systems: A survey. , 7:383{414, 1999. T. Kimbrel, gorithms for

Distributed and Parallel Databases

et al.

A trace-driven comparison of alparallel prefetching and caching. In

USENIX Symp. on Operating Systems Design and Implementation, pages 19{34, October 1996.

IEEE Computer

et al.

IEEE Data Engineering

Newsletter

Proceed-

ings of SC'97

Int'l Conf. Very Large

Databases

Scalable High-Performance Computing Conf.

Parallel Computing

Database Systems

Principles of Distributed

Informed Prefetching and Caching

et al. Parallelism in Relational Data Base Systems nt'l Symp. on Parallel and Distributed Systems et al. Joint Symposium on Parallel Processing (JSPP) Int'l Conf. on Very Large Databases

ACM SIGMOD Conf. on Management

of Data

Euro-Par'98

Journal of Supercomputing

Int'l Conf. on Very

Large Databases

2nd Paci c-Asia Conf. on Knowledge Discovery and Data Mining Int'l Conf. on Knowledge Discovery and Data Mining

IEEE

Concurrency

Int'l Parallel Processing Symposium

SIGMOD Con-

ference

shop on I/O in Parallel and Distributed Systems

Work-

Distributed and Parallel Databases

et al.

and Ho, 2000 Mining

Zaki

Large-Scale Parallel Data

et al. Data Mining and Knowledge Discovery: An International Journal Int'l Conf. on Data Engineering IEEE Concurrency

Zaki and Ho, 2000

Parallel Data Mining on ATM-Connected PC Cluster and Optimization of its Execution Environments

Masato OGUCHI12 and Masaru KITSUREGAWA1 1

Institute of Industrial Science, The University of Tokyo 7-22-1 Roppongi, Minato-ku Toky o 106-8558, Japan 2 Informatik4, Aachen University of Technology Ahornstr.55, D-52056 Aachen, Germany [email protected]

In this paper, we have constructed a large scale ATM-connected PC cluster consists of 100 PCs, implemented a data mining application, and optimized its execution environment. Default parameters of TCP retransmission mechanism cannot pro vide good performance for data mining application, since a lot of collisions occur in the case of all-to-all m ulticasting in the large scale PC cluster. Using a TCP retransmission parameters according to the proposed parameter optimization, reasonably good performance improvement is achiev ed for parallel data mining on 100 PCs. Association rule mining, one of the best-known problems in data mining, diers from conventional scienti c calculations in its usage of main memory. W e ha ve investigated the feasibility of using available memory on remote nodes as a swap area when working nodes need to swap out their real memory contents. According to the experimental results on our PC cluster, the proposed method is expected to be considerably better than using hard disks as a swapping device. Abstract.

1 Introduction Looking over the recent technology trends, PC/WS clusters connected with high speed networks such as ATM are considered to be a principal platform for future high performance parallel computers. Applications which formerly could only be implemented on expensive massively parallel processors can now be executed on inexpensive clusters of PCs. Various research projects to develop andexamine PC/WS clusters have been performed until no w[1][2][3]. Most of them how ever, only measured basic characteristics of PCs and net works, and/or some small benchmark programs were examined. We believe that data intensiv eapplications such as data mining and ad-hoc query processing in databases are quite important for future high performance computers, in addition to the conventional scienti c applications[4]. Data mining has attracted a lot of attention recently from both the research and commercial community, for nding interesting trends hidden in large transaction logs. Since data mining is a very computation and I/O intensive process, J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 366-373, 2000.  Springer-Verlag Berlin Heidelberg 2000

Parallel Data Mining on ATM-Connected PC Cluster

367

parallel processing is required to supply the necessary computational power for very large mining operations. In this paper, we report the results on parallel data mining on ATM-connected PC clusters, consists of 100 Pentium Pro PCs. 2

Our ATM-connected PC cluster and its communication

characteristics

We have constructed a PC cluster pilot system which consists of 100 nodes of 200MHz Pentium Pro PCs connected with an ATM switch. An overview of the PC cluster is shown in Figure 1. Each node of the cluster is equipped with 64Mbytes main memory, 2.5Gbytes IDE hard disk, and 4.3Gbytes SCSI hard disk. Solaris ver.2.5.1 is used as an operating system. All nodes of the cluster are connected by a 155Mbps ATM LAN as well as an Ethernet. HITACHI's AN1000-20, which has 128 port 155Mbps UTP-5, is used as an ATM switch. Interphase 5515 PCI ATM adapter and RFC-1483 PVC driver, which support LLC/SNAP encapsulation for IP over ATM, are used. Only UBR trac class is supported in this driver. TCP/IP is used as a communication protocol. TCP is not only a very popular reliable protocol for computer communication, but also contains all functions as a general transport layer. Thus the results of our experiments must be valid even if other transport protocol is used, for investigating reliable communication protocols on a large scale cluster. ATM Switch (155Mbps X 128 ports) ATM (UTP-5 155Mbps)

Personal Computer Pentium Pro 200MHz 64MB Memory 2.5GB IDE Hard Disk 4.3GB SCSI Hard Disk

X 100

Fig. 1.

Ethernet (10Base-T)

An overview of the PC cluster

368

M. Oguchi and M. Kitsuregawa

3 Parallel data mining application and its implementation on the cluster 3.1

Association rule mining

Data mining is a method of the ecient discovery of useful information such as rules and previously unknown patterns existing among data items embedded in large databases, which allows more eective utilization of existing data. One of the best known problems in data mining is mining of the association rules from a database, so called \basket analysis"[5][6]. Basket type transactions typically consist of transaction id and items bought per-transaction. An example of an association rule is \if customers buy A and B then 90% of them also buy C". The best known algorithm for association rule mining is Apriori algorithm proposed by R. Agrawal of IBM Almaden Research[7]. In order to improve the quality of the rule, we have to analyze very large amounts of transaction data, which requires considerably long computation time. We have studied several parallel algorithms for mining association rules until now[8], based on Apriori. One of these algorithms, called HPA(Hash Partitioned Apriori), is implemented and evaluated. Apriori rst generates candidate itemsets, then scans the transaction database to determine whether the candidates satisfy the user speci ed minimum support. At rst pass (pass 1), a support for each item is counted by scanning the transaction database, and all items which satisfy the minimum support are picked out. These items are called large 1-itemsets. In the second pass (pass 2), 2-itemsets (length 2) are generated using the large 1-itemsets. These 2-itemsets are called candidate 2-itemsets. Then supports for the candidate 2-itemsets are counted by scanning the transaction database, large 2-itemsets which satisfy the minimum support are determined. This iterative procedure terminates when large itemset or candidate itemset becomes empty. Association rules which satisfy user speci ed minimum con dence can be derived from these large itemsets. HPA partitions the candidate itemsets among processors using a hash function, like the hash join in relational databases. HPA eectively utilizes the whole memory space of all the processors. Hence it works well for large scale data mining. In the detail of HPA, please refer to [8][9]. 3.2

Implementation of HPA program on PC cluster

We have implemented HPA program on our PC cluster. Each node of the cluster has a transaction data le on its own hard disk. Solaris socket library is used for the inter-process communication. All processes are connected with each other by socket connections, thus forming mesh topology. In the ATM level, PVC (Permanent Virtual Channel) switching is used since the data is transferred continuously among all the processes. Transaction data is produced using data generation program developed by Agrawal, designating some parameters such as the number of transaction, the


369

Table 1. The number of candidate and large itemsets

C L T

the number of candidate itemsets the number of large itemsets the execution time of each pass [sec]

pass

C

pass 1

L

T

1023 11.2

pass 2 522753

32

69.8

pass 3

19

19

3.2

pass 4

7

7

6.2

pass 5

1

0

12.1

number of dierent items, and so on. The produced data is divided by the number of nodes, and copied to each node's hard disk. The parameters used in the evaluation is as follows: The number of transaction is 10,000,000, the number of dierent items is 5000 and minimum support is 0.7%. The size of the transaction data is about 800Mbytes in total. The message block size is set to be 8Kbytes and the disk I/O block size is 64Kbytes. The numbers of candidate itemsets and large itemsets, and the execution time of each pass executed on 100 nodes PC cluster are shown in Table 1. Note that the number of candidate itemsets in pass 2 is extremely larger than other passes, which often happens in association rules mining.

4 Optimization of transport layer protocol parameters 4.1

Broadcasting on the cluster and TCP retransmission

The execution times of pass 3 0 5 are relatively long in Table 1, although they do not have large number of itemsets. At the end of each pass, a barrier synchronization and exchange of data are needed among all nodes, that is, all-to-all broadcasting takes place. Even if the amount of broadcasting data is not large, cells must be discarded at the ATM switch if timing of the broadcasting is the same at all nodes. Since pass 3 0 5 have little data to process, actual execution time is quite short, thus broadcasting is performed almost simultaneously in all nodes, which tend to cause network congestion and TCP retransmission as a result. We have executed several experiments to nd the better retransmission parameters setting suitable for such cases. We use TCP protocol implemented in Solaris OS, whose parameters can be changed with user level commands. Two parameters changed here are `maximum interval of TCP retransmission' and `minimum interval of TCP retransmission', which we call `MAX' and `MIN' respectively. The default setting is MAX = 60000 [msec] and MIN = 200 [msec] in the current version of Solaris. The interval of

370


240 Default Interval (MAX=60000[ms], MIN=200[ms]) Optimized Interval (MIN=250...350[ms], MAX=MIN+100[ms]) 220

Execution Time [sec]

200

180

160

140

120

100

80 30

40

50

60

70

80

90

100

Number of PCs

Fig. 2.

Execution time of HPA program on PC cluster

TCP retransmission is dynamically changed in the protocol, within the limits between MAX and MIN. As a result of experiments, we have found that the default value of MAX is not suitable for the cluster, which might cause the unnecessary long retransmission interval. MAX should be set to be smaller than the default value, such as MAX = MIN + 100[msec]. Moreover, MIN is better to be set as random value, which can prevent the collision of the cells at ATM switch. 4.2

Total performance of HPA program using proposed method

HPA program is executed using the proposed parameter setting of TCP on the PC cluster pilot system. The execution time of HPA program is shown in Figure 2. In this Figure, one line indicates the case using default TCP retransmission parameters, i.e. MAX = 60000[msec] and MIN = 200[msec], and the other line indicates the case using random parameters (MIN = 250 ... 350[ms], MAX = MIN + 100[ms]). Reasonably good speedup is achieved up to 100 PCs using proposed optimized parameters. Since the application itself is not changed, the dierence only comes from TCP retransmission, occurred along with barrier synchronization and all-to-all data broadcasting.

5 Dynamic remote memory acquisition 5.1

Dynamic remote memory acquisition and its experiments

As shown in section 3, the number of candidate itemsets in pass 2 is very much larger than in other passes in association rule mining. The number of itemsets is


371

strongly dependent on user-speci ed conditions, such as the minimum support value, and it is dicult to predict how large the number will be before execution. Therefore, it may happen that the number of candidate itemsets increases dramatically in this step so that the memory requirement becomes extremely large. When the required memory is larger than the real memory size, part of the contents of memory must be swapped out. However, because the size of each data item is rather small and all the data is accessed almost at random, swapping out to a storage device is expected to degrade the total performance severely. We have executed several experiments in which available memory in remote nodes is used as a swap area when huge memory is dynamically required. In the experiments, a limit value for memory usage of candidate itemsets is set at each node. When the amount of memory used exceeds this value during the execution of the HPA program(in Pass 2), part of the contents is swapped out to available memory in remote nodes, that is, application execution nodes acquire remote memory dynamically. Although such available remote nodes could be found dynamically in a real system, we selected them statically in these experiments. On the other hand, when an application execution node trys to access an item that had been swapped out, a pagefault occurs. The basic behavior of this approach has something in common with distributed shared memory systems[10], memory management system in distributed operating systems[11], or cache mechanism in client-server database systems[12]. For example, if data structures inside applications are considered in distributed shared memory, almost the same eect can be expected. That is to say, it is possible to program almost the same mechanism using some types of distributed shared memory systems. Thus, our mechanism might be regarded as equivalent to a case of distributed shared memory optimized for a particular application. We have executed experiments of the proposed mechanism on the PC cluster. The parameters used in the experiment are as follows. The number of transactions is 1,000,000, the number of dierent items is 5,000, and the minimum support is 0.1%. The number of application execution nodes is 8 in this evaluation. The number of memory available nodes is varied from 1 to 16. With these conditions, the total number of candidate itemsets in pass 2 is 4,871,881. Since each candidate itemset occupied 24 bytes in total(structure area + data area), approximately 14-15Mbytes of memory were lled with these candidate itemsets at each node. 5.2

Remote update method

When memory usage is limited, the execution time is much longer than when there is no memory limit. This is because the number of swapouts is extremely large. In Table 2 the numbers of pagefaults on each application execution node are shown. Because most of the memory contents are accessed repeatedly, a kind of thrashing seems to happen in these cases. In order to prevent this phenomenon, a method for restricting swapping operations is proposed. When usage of memory reaches the limit value at a node, it acquires remote memory and swaps out part of its memory contents. The contents will be

372


Table 2.

Usage limit 12[MB] 13[MB] 14[MB] 15[MB]

The numbers of pagefaults on each application node

node 1 1606258 885798 254094

node 2 node 3 2925254 1306521 1896226 593000 1003757 268039

node 4 2361756 1374688 512984

node 5 1671840 932374 286945

node 6 1723410 896150 191102

node 7 2166277 1326941 601657

node 8 2545003 1375398 407628

swapped in again if this data is accessed later. Instead of swapping, it is sometimes better to send update information to the remote memory when a pagefault occurs. That is to say, once some contents are swapped out to memory in a distant node, they are xed there and accessed only through a remote memory access interface provided by library functions. This remote update method has been applied only to the itemsets counting phase, for simplicity. The access interface function has been developed to realize the remote update operations. The execution time using this method is shown in Figure 3. This gure shows the execution time of pass 2 of the HPA program, when the number of memory available nodes is 16. The execution times for dynamic remote memory acquisition, according to this method and the previous simple swapping case, are compared in the Figure. The execution time using hard disks as a swapping device is also shown, for comparison. Seagate Barracuda 7,200[rpm] SCSI hard disks have been used for this purpose. Other conditions are the same as the case of dynamic remote memory acquisition. The execution time using hard disks as swapping devices is very long, especially when the memory usage limit is small, because each access time to a hard disk is much longer than that for remote memory through the network. The execution time of dynamic remote memory acquisition with simple swapping is better than for swapping out to hard disks. It increases, however, when the memory usage limit is small, since the number of pagefaults becomes extremely large in such a case. Compared to these results, the execution time of dynamic remote memory acquisition with remote update operations is quite short, even when the memory usage limit is small. It seems to be eective to provide a simple remote access interface for the itemsets counting phase, because the number of swapping operations during this phase is very large. These results indicate that, performance of the proposed remote memory acquisition with remote update operations is considerably better than other methods.

References 1. C. Huang and P. K. McKinley: \Communication Issues in Parallel Computing Across ATM Networks", IEEE Parallel and Distributed Technology, Vol.2, No.4, pp.73-86, 1994.


373

Swapping out to hard disks Dynamic remote memory acquisition with simple swapping Dynamic remote memory acquisition with remote update

14000

12000

Execution Time [s]

10000

8000

6000

4000

2000

0 12

12.5

Fig. 3.

13

13.5 Memory Usage Limit

14

14.5

15

Comparison of proposed methods

2. R. Carter and J. Laroco: \Commodity Clusters: Performance Comparison Between PC's and Workstations", Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, pp.292-304, August 1996. 3. D. E. Culler et al.: \Parallel Computing on the Berkeley NOW", Proceedings of the 1997 Joint Symposium on Parallel Processing(JSPP '97), pp.237-247, May 1997. 4. T. Tamura, M. Oguchi, and M. Kitsuregawa: \Parallel Database Processing on a 100 Node PC Cluster: Cases for Decision Support Query Processing and Data Mining", Proceedings of SuperComputing '97, November 1997. 5. U. M. Fayyad et al.: \Advances in Knowledge Discovery and Data Mining", The MIT Press, 1996. 6. V. Ganti, J. Gehrke, and R. Ramakrishnan: \Mining Very Large Databases", IEEE Computer, Vol.32, No.8, pp.38-45, August 1999. 7. R. Agrawal, T. Imielinski, and A. Swami: \Mining Association Rules between Sets of Items in Large Databases", Proceedings of the ACM International Conference on Management of Data, pp.207-216, May 1993. 8. T. Shintani and M. Kitsuregawa: \Hash Based Parallel Algorithms for Mining Association Rules", Proceedings of the Fourth IEEE International Conference on Parallel and Distributed Information Systems, pp.19-30, December 1996. 9. M. J. Zaki: \Parallel and Distributed Association Mining: A Survey", IEEE Concurrency, Vol.7, No.4, pp.14-25, 1999. 10. C. Amza et al.: \TreadMarks: Shared Memory Computing on Networks of Workstations", IEEE Computer, Vol.29, No.2, pp.18-28, February 1996. 11. M. J. Feeley et al.: \Implementing Global Memory Management in a Workstation Cluster", Proceedings of the ACM Symposium on Operating Systems Principles, pp.201-212, December 1995. 12. S. Dar et al.: \Semantic Data Caching and Replacement", Proceedings of 22nd VLDB Conference, September 1996.

The Parallelization of a Knowledge Discovery System with Hypergraph Representation? Jennifer Seitzer, James P. Buckley, Yi Pan, and Lee A. Adams Department of Computer Science University of Dayton, Dayton, OH 45469-2160

Abstract. Knowledge discovery is a time-consuming and space intensive endeavor. By distributing such an endeavor, we can diminish both time and space. System INDED(pronounced \indeed") is an induc-

tive implementation that performs rule discovery using the techniques of inductive logic programming and accumulates and handles knowledge using a deductive nonmonotonic reasoning engine. We present four schemes of transforming this large serial inductive logic programming (ILP) knowledge-based discovery system into a distributed ILP discovery system running on a Beowulf cluster. We also present our data partitioning algorithm based on locality used to accomplish the data decomposition used in the scenarios.

1 Introduction Knowledge discovery in databases has been de ned as the non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data [PSF91]. Data mining is a commonl y used knowledge discovery technique that attempts to reveal patterns within a database in order to exploit implicit information that was previously unknown [CHY96]. One of the more useful applications of data mining is to generate all signi cant associations between items in a data set [AIS93]. A discovered pattern is often denoted in the form of an IF-THEN rule (IF antecedent THEN consequent), where the antecedent and consequent are logical conjunctions of predicates ( rst order logic) or propositions (propositional logic) [Qui86]. Graphs and hypergraphs are used extensively as knowledge representation constructs because of their ability to depict causal chains or networks of implications by interconnecting the consequent of one rule to the antecedent of another. In this work, using the language of logic programming, we use a hypergraph to represent the knowledge base from which rules are mined. Because the hypergraph gets inordinantly large in the serial version of our system [Sei99], we have devised a parallel implementation where, on each node, a smaller sub-hypergraph is created. Consequently, because there is a memory limit to the size of a storable hypergraph, by using this parallel version, we are able to grapple with problems ?

This work is partially supported under Grant 9806184 of the National Science Foundation.


The Parallelization of a Knowledge Discovery System

375

involving larger knowledge bases than those workable on the serial system. A great deal of work has been done in parallelizing unguided discovery of association rules originally in [ZPO97] and recently re ned in [SSC99]. The novel aspects of this work include the parallelization of both a nonmonotonic reasoning system and an ILP learner. In this paper, we present the schemes we have explored and are currently exploring in this pursuit.

2 Serial System INDED System INDED is a knowledge discovery system that uses inductive logic programming (ILP) [LD94] as its discovery technique. To maintain a database of background knowledge, INDED houses a deduction engine that uses deductive logic programming to compute the current state (current set of true facts) as new rules and facts are procured.

2.1 Inductive Logic Programming Inductive logic programming (ILP) is a new research area in arti cial intelligence which attempts to attain some of the goals of machine learning while using the techniques, language, and methodologies of logic programming. Some of the areas to which ILP has been applied are data mining, knowledge acquisition, and scienti c discovery [LD94]. The goal of an inductive logic programming system is to output a rule which covers (entails) an entire set of positive observations, or examples, and excludes or does not cover a set of negative examples [Mug92]. This rule is constructed using a set of known facts and rules, knowledge, called domain or background knowledge. In essence, the ILP objective is to synthesize a logic program, or at least part of a logic program using examples, background knowledge, and an entailment relation. The following de nitions are from [LD94].

De nition 2.1 (coverage, completeness, consistency) Given background knowledge B, hypothesis H, and example set E, hypothesis H covers example e 2 E with respect to B if B [H j= e. A hypothesis H is complete with respect to background B and examples E if all positive examples are covered, i.e., if for all e 2E + , B [H j= e. A hypothesis H is consistent with respect to background B and examples E if no negative examples are covered, i.e., if for all e 2E , , B [H 6j= e. De nition 2.2 (Formal Problem Statement) Let E be a set of training examples consisting of true E + and false E , ground facts of an unknown (target) predicate T . Let L be a description language specifying syntactic restrictions on the de nition of predicate T . Let B be background knowledge de ning predicates qi which may be used in the de nition of T and which provide additional information about the arguments of the examples of predicate T . The ILP problem is to produce a de nition H for T , expressed in L, such that H is complete and consistent with respect to the examples E and background knowledge B. [LD94]

376

J. Seitzer et al.

2.2 Serial Arichitecture System INDED (pronounced \indeed") is comprised of two main computation

engines. The deduction engine is a bottom-up reasoning system that computes the current state by generating a stable model 2 , if there is one, of the current ground instantiation represented internally as a hypergraph, and by generating the well-founded model [VRS91], if there is no stable model[GL90]. This deduction engine is, in essence, a justi cation truth maintenance system which accommodates non-monotonic updates in the forms of positive or negative facts. The induction engine, using the current state created by the deduction engine as the background knowledge base, along with positive examples E + and negative examples E , , induces a rule(s) which is then used to augment the deductive engine's hypergraph. We use a standard top-down hypothesis construction algorithm (learning algorithm) in INDED[LD94]. This algorithm uses two nested programming loops. The outer (covering) loop attempts to cover all positive examples, while the inner loop (specialization) attempts to exclude all negative examples. Termination is dictated by two user-input values to indicate suciency and necessity stopping criterea. The following diagram illustrates the discovery constituents of INDED and their symbiotic interaction. Positive Examples E+ Negative Examples

E,

Background Knowledge B = B+ [ B,

Architecture of System INDED ENGINE DEDUCTIVE ENGINE -INDUCTIVE Top-Down JTMS Learned Nonmonotonic Empirical Target - ILP -Predicate- Reasoning System - Learner

6

The input les to INDED that initialize the system are the extensional database (EDB) and the intensional database (IDB). The EDB is made up of initial ground facts (facts with no variables, only constants). The IDB is made up of universally quanti ed rules with no constants, only variables. Together, these form the internal ground instantiation represented internally as the deduction engine hypergraph.

3 Parallelizing INDED Our main goals in parallelizing INDED are to obtain reasonably accurate rules faster and to decrease the size of the internal deduction hypergraph so that the 2

Although the formal de nitions of these semantics are cited above, for this paper, we can intuitively accept stable and well-founded models as those sets of facts that are generated by transitively applying modus ponens to rules.


377

problem space is increased. The serial version is very limited in what problems it can solve because of memory limitations. Work has been done in the direct partitioning of a hypergraph [KAK99]. In our pursuit to parallelize INDED, however, we are exploring the following schemes where indirect hypergraph reductions are performed. Each of the following scenarios has been devised to be implemented on a Beowulf cluster [Buy99] using MPI [GLS99]: 1. large grained control parallel decomposition where one node runs the induction engine while another node runs the deduction engine 2. large grained control parallel decomposition where a pipeline of processors are established each operating on a dierent current state as created in previous (or subsequent) pipelined iterations 3. data parallel decomposition where each node runs the same program with smaller input les (hence smaller internal hypergraphs). 4. speculative parallel approach where each node attempts to learn the same rule using a dierent predicate ranking alogrithm in the induction engine.

4 Naive Decomposition In this decomposition, we create a very coarse grain system in which two nodes share the execution. One node houses the deduction engine; the other houses the induction. Our strategy lets the induction engine initially discover a target predicate from positive and negative examples and an initial background knowledge base. Meanwhile, the deduction engine computes the current state using initial input les. This current state is sent to the induction engine as its background knowledge base in the subsequent iteration. The learned predicate from the induction engine from one iteration is then fed into the deductive engine to be used during the next iteration in its computation of the current state. This is then used as the background knowledge for the induction engine during the subsequent iteration. In general, during iteration i, the induction engine computes new intensional rules for the deduction engine to use in its computation of the current state in iteration i + 1. Simultaneously,during iteration i, the deduction engine computes a new current state for the induction engine to use as its background knowledge in iteration i + 1. The above process is repeated until all target predicates speci ed have been discovered. As we extend this implementation, we expect to acquire a pipe-lined system where the deduction engine is computing state Si+1 while the induction engine is using Si to induce new rules (where i is the current iteration number).

5 Data Parallel Decomposition with Data Partitioning In this method, each worker node runs INDED when invoked by a master MPI node [GLS99]; each worker executes by running a partial background knowledge

378

J. Seitzer et al.

base which, as in the serial version, is spawned by its deduction engine. In particular, each worker receives the full serial intensional knowledge base (IDB) but only a partial extensional knowledge base (EDB). The use of a partial EDB creates a signi cantly smaller (and dierent) hypergraph on each Beowulf worker node. This decomposition led to a faster execution due to a signi cantly smaller internal hypergraph being built. The challenge was to determine the best way to decompose the serial large EDB into smaller EDB's so that the rules obtained were as accurate as those learned by the serial version.

5.1 Data Partitioning and Locality

In this data parallel method, our attention centered on decomposition of the input les to reduce the size of any node's deduction hypergraph. We found that in many cases data transactions exhibited a form of locality of reference. Locality of reference is a phenomenon ardently exploited by cache systems where the general area of memory referenced by sequential instructions tends to be repeatedly accessed. Locality of reference in the context of knowledge discovery also exists and should be exploited to increase the eciency of rule mining. A precept of knowledge discovery is that data in a knowledge base system are nonrandom and tend to cluster in a somewhat predictable manner. This tendency mimics locality of reference. There are three types of locality of reference which may coexist in a knowledge base system: spatial, temporal, and functional. In spatial locality of reference, certain data items appear together in a physical section of a database. In temporal locality of reference, the data items that are used in the recent past appear in the near future. For example, if there is a sale in a supermarket for a particular brand of toothpaste on Monday, we will see a lot of sales for this brand of toothpaste on that day. In functional locality of reference, we appeal to a semantic relationship between entities that have a strong semantic tie that aects data transactions relating to them. For example, cereal and milk are two semantically related objects. Although they are typically located in dierent areas of a store, many purchase transactions of one, include the other. All three of these localities can be exploited in distributed knowledge mining, and help justify the schemes adpoted in our implementations discussed in the following sections.

5.2 Partitioning Algorithm

To retain all global dependencies among the predicates in the current state, all Beowulf nodes receive a full copy of the serial IDB. The serial EDB, the initial large set of facts, therefore, is decomposed and partitioned among the nodes. The following algorithm transforms a large serial extensional database (EDB) into p smaller EDB's to be placed on p Beowulf nodes. It systematically creates sets based on constants appearing in the positive example set E + . Some facts from the serial EDB could appear on more than one processor. The algorithm is of linear complexity requiring only one scan through the serial EDB and positive example set E+.


379

Algorithm 5.1 (EDB Partitioning Algorithm) This algorithmis O(n), where n is the number of facts in the EDB. Input: Number of processors p in Beowulf Serial extensional database (EDB) Positive and negative example set E + , E , Output: p individual worker node EDB's BEGIN ALGORITHM 5.1

For each example e 2E + [E , Do For each constant c 2 e Do create an initially empty set Sc of facts Create one (initially empty) set Snone for facts that have no constants in any example e 2 E + [E , For each fact f 2 EDB Do For each constant c0 2 f Do Sc0 = Sc0 [ f If no set exists for c then Snone = Snone [ f Distribute the contents of Snone among all constant sets Determine load balance by summing all set cardinalities to re ect total parallel EDB entries K De ne min local load = dK=pe Distribute all sets Sc evenly among the processors so that each processor has an EDB of roughly equal cardinality such that each node has EDB of cardinality min local load as de ned above. i

END ALGORITHM 5.1

6 Global Hypergraph using Speculative Parallelism In this parallelization, each Beowulf node searches the space of all possible rules independently and dierently. All input les are the same on all machines. Therefore, each worker is discovering from the same background knowledge base. Every rule discovered by INDED is constructed by systematically appending chosen predicate expressions to an originally empty rule body. The predicate expressions are ranked by employing various algorithms, each of which designates a dierent search strategy. The highest ranked expressions are chosen to constitute the rule body under construction. In this parallelization of INDED, each node of the Beowulf employs a dierent ranking procedure, and hence, may construct very dierent rules. We are considering two possibilities for handling the rules generated by each worker. In the rst, as soon as a process converges ( nds a valid set of rules), it broadcasts a message to announce the end of the procedure. When the message is received by other processes, they are terminated. The other possibility we

380

J. Seitzer et al.

are considering is to combine the rules of each worker. Dierent processes may generate dierent rules due to the use of dierent ranking algorithms. These rules may be combined after all the processes are terminated, and only good rules are retained. In this way, not only can we speed up the mining process, but we can also achieve a better and richer quality of solutions.

7 Current Status and Results The current status of our work in these parallelization schemes is as follows. We have implemented the naive decomposition and enjoyed a 50 per cent reduction in execution time. Thus far, however, the bulk of our eorts have centered on implementing and testing the data parallel implementation on an eleven node Beowulf cluster. Here, we also experienced a great reduction in execution time. Figure 1 illustrates the consistent reduction of time as the number of nodes increased. The problem domain with which we are currently experimenting relates to the diagnosis of diabetes. The accuracy of the discovered rules by the cluster has varied. The rule learned by serial INDED is inject_insulin(A) Sim(X; Z ), then it should imply that X is more similar to Y than it is to Z .

The rst two properties bound the range of the measure while the third property ensures that similarities across objects can be meaningfully compared. This last property is particularly useful for clustering purposes. Now we de ne our metric. Let A and B respectively be the association sets for a database D and that for a database E . For an element x 2 A (respectively in B ), let supD (x) (respectively supE (x)) be the frequency of x in D (respectively in E ). De ne maxf0; 1 j supD (x) supE (x)jg Sim (A; B ) = x2A\B kA [ B k where is a scaling parameter. The parameter has the default value of 1 and is to re ect how signi cance the user view variations in supports are (the higher is the more in uential variations are). For = 0 the similarity measure is identical to kA\Bk , i.e., support variance carries no signi cance. kA[Bk

P

2.3

Sampling and Association Rules

The use of sampling for approximate, quick computation of associations has been studied in the literature [10]. While computing the similarity measure, sampling can be used at two levels. First, if generating the associations is expensive (for large datasets) one can sample the dataset and subsequently generate the association set from the sample, resulting in huge I/O savings. Second, if the association sets are

402

S. Parthasarathy and M. Ogihara

large one can estimate the distance between them by sampling, appropriately modifying the similarity measure presented above. Sampling at this level is particularly useful in a distributed setting when the association sets, which have to be communicated to a common location, are very large.

3

Clustering Datasets

Clustering is commonly used for partitioning data [6]. The clustering technique we adopt is the simple tree clustering. We use the similarity metric of databases de ned in Section 2 for as the distance metric for our clustering algorithm. Input to the algorithm is simply the number of clusters in the nal result. At the start of the clustering process each database constitutes a unique cluster. Then we repeatedly merge the pair of clusters with the highest similarity and merge the pair into one cluster until there are the desired number of clusters left. As our similarity metric is based on associations, there is an issue of how to merge their association lattices when two clusters are merged. A solution would be to combine all the datasets and recompute the associations, but this would be timeconsuming and involve heavy communication overheads (all the datasets will have to be re-accessed). Another solution would be to intersect the two association lattices and use the intersection as the lattice for the new cluster, but this would be very inaccurate. We take the half-way point of these two extremes. Suppose we are merging two clusters D and E , whose association sets are respectively A and B . The value of supD (x) is known only for all x 2 A and that of supE (x) is known only for all x 2 B . The actual support of x in the join of D and E is given as supD (x) kDk + supE (x) kE k : kDk + kE k When x does not belong to A or B , we will approximate the unknown sup-value by a \guess" 2 , which can be speci c to the cluster as well as to the association x.

4

Experimental Analysis

In this section we experimentally evaluate our similarity metric3 . We evaluate the performance and sensitivity of computing this metric using sampling in a distributed setting. We then apply our dataset clustering technique to synthetic datasets from IBM and on a real dataset from the Census Bureau, and evaluate the results obtained. 4.1

Setup

All the experiments (association generation, similarity computation) were performed on a single processor of a DECStation 4100 containing four 600MHz Alpha 21164 processors, with 256MB of memory per processor. 2 We are evaluating two methods to estimate . The strawman is to randomly guess a value 3

between 0 and the minimum support. The second approach is to estimate the support of an item based on the available supports of its subsets. Due to lack of space we do not detail our experimentation on choice of .

Exploiting Dataset Similarity for Distributed Mining

403

We used dierent synthetic databases with size ranging from 3MB to 30MB, which are generated using the procedure described in [3]. These databases mimic the transactions in a retailing environment. Table 1 shows the databases used and their properties. The number of transactions is denoted as numT , the average transaction size as Tl , the average maximal potentially frequent itemset size as I , the number of maximal potentially frequent itemsets as kLk, and the number of items as Size. We refer the reader to [3] for more detail on the database generation. Database numT Tl I kLk Size D100 100000 8 2000 4 5MB D200 200000 12 6000 2 12MB D300 300000 10 4000 3 16MB D400 400000 10 10000 6 25MB

Table 1. Database properties The Census data used in this work was derived from the County Business Patterns (State) database from the Census Bureau. Each dataset we derive (dataset per state) from this database contains one transaction per county. Each transaction contains items which highlight information on subnational economic data by industry. Each industry is divided into small, medium and large scale concerns. The original data has numeric data corresponding to number of such concerns occurring in the county. We discretize these numeric values into three categories: high, middle and low. So an item \high-small-agriculture" would correspond to a high number of small agricultural concerns. The resulting set of datasets have as many transactions as counties in the state and a high degree of associativity. 4.2

Sensitivity to Sampling Rate

In Section 2 we mentioned that sampling can be used at two levels to estimate the similarity eÆciently in a distributed setting. If association generation proves to be expensive, one can sample the transactions to generate the associations and subsequently use these associations to estimate the similarity accurately. Alternatively, if the number of associations in the lattice are large, one can sample the associations to directly estimate the similarity. We evaluate the impact of using sampling to compute the approximate the similarity metric below. For this experiment we breakdown the execution time of computing the similarity between two of our databases D300 and D400 under varying sampling rates. The two datasets were located in physically separate locations. We measured the total time to generate the associations for a minimum support of 0.05% (Computing Associations) for both datasets (run in parallel), the time to communicate the associations from one machine (Communication Overhead) to another and the time to compute the similarity metric (Computing Similarity) from these association sets. Transactional sampling in uences the computing the associations while association sampling in uences the latter two aspects of this experiment. Under association sampling, each processor computes a sample of its association set and sends it to the other, both then compute a part of similarity metric (in parallel). These two values are then merged appropriately, accounting for duplicates in the samples used. While both these sampling levels (transaction and association) could have dierent

404


Performance Breakdown: Sim(D300,D400) 40 Computing associations Communication overhead

35

Computing Similarity

Execution Time

30 25 20 15 10 5 0 100%

25%

20%

12.5%

10%

5%

Transactional and Association Sampling Rate

Fig. 1. Sampling Performance sampling rates, for expository simplicity we chose to set both at a common value. We evaluate the performance under the following sampling rates, 5%, 10%, 12.5%, 20%, and 25%. Figure 1 shows the results from this experiment. Breaking down the performance it is clear that by using sampling at both levels the performance improves dramatically. For a sampling rate of 10% the time to compute associations goes down by a factor of 4. The communication overhead goes down by a factor of 6 and the time to compute the similarity goes down by a factor of 7. This yields an overall speedup of close to 5. Clearly, the dominant factor in this experiment is computing the associations (85% of total execution time). However, with more traÆc in the system, as will be the case when computing the similarity across several datasets (such as in clustering), and when one is modifying the probe set interactively, the communication overhead will play a more dominant role. The above experiment aÆrms the performance gains from association and transactional sampling. Next, we evaluate the quality of the similarity metric estimated using such approximation techniques for two minimum support values (0.05% and 0.1%). From Table 2 it is clear that using sampling for estimating the similarity metric can be very accurate (within 2% of the ideal (Sampling Rate 100%)) for all sampling rates above 5%. We have observed similar results (speedup and accuracy) for the other dataset pairs as well. Support SR-100% SR-25% SR-20% SR-10% SR-5% 0.05% 0.135 0.134 0.136 0.133 0.140 0.1% 0.12 0.12 0.12 0.12 0.115

Table 2. Sampling Accuracy: Sim(D300,D400)

4.3

Synthetic Dataset Clustering

We evaluated the eÆcacy of clustering homogeneous distributed datasets based on similarity. We used the synthetic datasets described earlier as a start point. We randomly split each of the datasets D100, D200, D300, and D400 into 10 datasets of roughly equal size. For the sake of simplicity in exposition we describe only the

Exploiting Dataset Similarity for Distributed Mining

405

experiment that used only rst three subsets from each. We ran a simple tree-based clustering algorithm on these twelve datasets. Figure 2 shows the result. The numbers attached to the joins are the Sim metric with = 1:0. Clearly the datasets from the same origin are merged rst. Given four as the desired number of clusters (or a merge cuto of 0.2), the algorithm stops right after executing all the merges depicted by full lines, combining all the children from the same parents into single clusters and leaving apart those from dierent parents. This experiment illustrates two key points. First, the similarity metric coupled with our merging technique seem to be an eÆcient yet eective way to cluster datasets. Second, hypothetically speaking, if these 12 datasets were representative of a distributed database, combining all 12 and mining for rules would have destroyed any potentially useful structural rules that could have been found if each cluster were mined independently (our approach). D100-0

0.755 0.76

D100-1 0.14

D100-2 D400-0

0.60

D400-1

0.62 0.04

D400-2 D300-0 0.78

D300-1 0.80

D300-2 D200-0

0.10 0.65 0.645

D200-1 D200-2

Fig. 2. Dataset Clustering 4.4

Census Dataset Evaluation

Table 3 shows the Sim values (with = 1:0) for a subset of the Census data for the year of 1988. As mentioned earlier each dataset corresponds to a state in the US. When asked to break the eight states into four clusters the clustering algorithm returned the clusters [IL, IA, TX], [NY, PA], [FL], and [OR,WA]. On looking at the actual Sim values it is clear that NY and PA have a closeted preference for one another IL, IA, and TX have strong preference for one another. OR has a stronger preference for IL, IA and TX, but once IL, IA, and TX were merged it preferred being merged with WA. Interestingly three pairs of neighboring states, i.e., (OR,WA), (IL,IA), and (NY,PA), are found in the same cluster. An interesting by-play of discretization of the number of industrial concerns into three categories (high, middle and low) is that states with larger counties (area-wise), such as PA, NY and FL tend to have higher associativity (since each county has many items) and thereby tend to have less aÆnity to states with lower associativity. By probing the similarity between IA and IL further the most in uential attribute is found to be agricultural concerns (no surprise there). The reason TX was found to be similar to these states was again due to agricultural concerns, a somewhat surprising result. However, this made sense, when we realized that cattle farming is also grouped

406


under agricultural concerns! Interestingly, we found that the Census data bene ted, performance-wise, from association sampling due its high associativity. State IL NY PA FL TX OR WA IA 0.54 0.01 0.01 0.16 0.44 0.26 0.1 IL 0.02 0.02 0.24 0.52 0.30 0.16 NY 0.31 0.14 0.01 0.04 0.08 PA 0.05 0.01 0.03 0.04 FL 0.24 0.21 0.21 TX 0.32 0.16 OR 0.25

Table 3. Census Dataset: Sim Values (support = 20%)

5

Conclusions

In this paper we propose a method to measure the similarity among homogeneous databases and show how one can use this measure to cluster similar datasets to perform meaningful distributed data mining. An interesting feature of our algorithm is the ability to interact via informative querying to identify attributes in uencing similarity. Experimental results show that our algorithm can adapt to time constraints by providing quick (speedup of 5-7) and accurate estimates (within 2%) of similarity. We evaluate our work on several datasets, synthetic and real, and show the eectiveness of our techniques. As part of future work we will focus on evaluating and applying dataset clustering to other real world distributed data mining tasks. It seems likely that the notion of similarity introduced here would work well for tasks such as Discretization and Sequence Mining with minor modi cations if any. We are also evaluating the eectiveness of the merging criteria described in Section 3.

References 1. C. Aggarwal and P. Yu. Online generation of association rules. In ICDE'98. 2. R. Agrawal, C. Faloutsos, and A. Swami. EÆcient similarity search in sequence databases. In Foundations of Data Organization and Algorithms, 1993. 3. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages 307{328. AAAI Press, Menlo Park, CA, 1996. 4. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In 20th VLDB Conf., 1994. 5. G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. In KDD 1998. 6. U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process of rextracing useful information from volumes of data. Communications of ACM, 39(11):27{34, 1996. 7. R. Goldman, N.Shivakumar, V. Suresh, and H. Garcia-Molina. Proximity search in databases. In VLDB Conf., 1998. 8. H. Jagadish, A. Mendelzon, and T. Milo. Similarity based queries. In PODS, 1995. 9. R. Subramonian. De ning di as a data mining primitive. In KDD 1998. 10. H. Toivonen. Sampling large databases for association rules. In VLDB Conf., 1996. 11. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In KDD, 1997.

Scalable Model for Extensional and Intensional Descriptions of Unclassified Data Hércules A. Prado1;2; ? , Stephen C. Hirtle2; ?? , Paulo M. Engel1; ? ? ? 1

Universidade Federal do Rio Grande do Sul Instituto de Informática Av. Bento Gonçalves, 9500 - Bairro Agronomia Porto Alegre / RS - Brasil Caixa Postal 15.064 - CEP 91.501-970 Fone: +55(051)316-6829 Fax: +55(051)319-1576 2 University of Pittsburgh Department of Information Sciences and Telecommunications 135 North Bellefield Ave. Pittsburgh, PA 15.260 Phone: +1(412)624-9434 Fax: +1(412)624-2788 fprado, [email protected] and [email protected]

Abstract. Knowledge discovery from unlabeled data comprises two main tasks: identification of ”natural groups” and analysis of these groups in order to interpret their meaning. These tasks are accomplished by unsupervised and supervised learning, respectively, and correspond to the taxonomy and explanation phases of the discovery process described by Langley [9]. The efforts of Knowledge Discovery from Databases (KDD) research field has addressed these two processes into two main dimensions: (1) scaling up the learning algorithms to very large databases, and (2) improving the efficiency of the knowledge discovery process. In this paper we argue that the advances achieved in scaling up supervised and unsupervised learning algorithms allow us to combine these two processes in just one model, providing extensional (who belongs to each group) and intensional (what features best describe each group) descriptions of unlabeled data. To explore this idea we present an artificial neural network (ANN) architecture, using as building blocks two well-know models: the ART1 network, from the Adaptive Resonance Theory family of ANNs [4], and the Combinatorial Neural Model (CNM), proposed by Machado ([11] and [12])). Both models satisfy one important desiderata for data mining, learning in just one pass of the database. Moreover, CNM, the intensional part of the architecture, allows one to obtain rules directly from its structure. These rules represent the insights on the groups. The architecture can be extended to other supervised/unsupervised learning algorithms that comply with the same desiderata. ? Researcher at EMBRAPA — Brazilian Enterprise for Agricultural Research and lecturer at

Catholic University of Brasília (Supported by CAPES - Coordenaçao de Aperfeiçoamento de Pessoal de Nível Superior, grant nr. BEX1041/98-3) ?? Professor at University of Pittsburgh ? ? ? Professor at Federal University of Rio Grande do Sul J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 407-414, 2000.  Springer-Verlag Berlin Heidelberg 2000

408

H.A. Prado, S.C. Hirtle, and P.M. Engel

1 Introduction Research in Knowledge Discovery from Databases (KDD) has developed along two main dimensions: (1) improving the knowledge discovery process, and (2) scaling up this same process to very large databases. Machine Learning, as an important field related to KDD, is founded on three principles [19]: 1. Modeling of cognitive processes, aiming to select characteristics of interest to be formalized as knowledge; 2. Computer science, which offers a formalism to support the descriptions of those characteristics, as well as providing approaches to evaluate the degree of computational difficulty of the issues involved; and 3. Applications, where one departs from practical needs to the implementation of systems. In this article, we depart from a characterization of the concept formation activity as a cognitive process, proposing a computational approach to support this activity, from the point of view of KDD. We take the concept of performance as given by the relation functionality/resources applied. By this way, we present a model where funcionality is increased while the resources applied are just slightly changed.

2 Motivation According to Wrobel [19], a concept is ”a generalized description of sets of objects”. In this sense, Easterlin and Langley [5] analyse the concept formation process as follows: 1. Given a set of objects (instances, events, cases) descriptions, usually presented incrementally; 2. find sets of objects that can be grouped together (aggregation), and 3. find intensional description of these sets of objects (characterization). Murphy and Medin [14] discuss two hypothesis that constrain the way objects are grouped in concepts: 1. Similarity hypothesis: this hypothesis sustain that what defines a class is that its members are similar to each other and not similar to members of other classes. 2. Correlated attribute hypothesis: this hypothesis states that ”natural groups” are described according to clusters of features and that categories reflect the cluster structure of correlations. The first hypothesis presents a problem that is: the similarity criteria must be applied to a pre-defined set of features and the definition of this set is affected by the previous knowledge one has over the objects. However, when just a small knowledge about the data exists, this criteria is used as a first approximation. Over this approximation, the correlated attribute hypothesis is applied. In a broad sense, what is desirable in this process is to provoke the mental operations that can lead to a problem solution ([16] and [15]). Actually, since it seems that the discovery process, as a rule, requires the human judgment [9], it is useful to leave available to the analyst all relevant information to evaluate both hypothesis when searching for the classes' structures.

Scalable Model for Extensional and Intensional Descriptions of Unclassified Data

409

3 Proposed Architecture The research on the KDD realm has emphasized improvements in the processes of supervised and unsupervised learning. More recently, many unsupervised learning algorithms have been scaled up according to the desiderata proposed by Agrawal et al. [1] for this kind of learning algorithm. Considering these advances and the ones in supervised learning, we believe there is enough room to scale up the combined process of unsupervised and supervised learning in order to obtain better descriptions of unclassified data. By ”better descriptions” we mean obtaining intensional (what are the main characteristics of each class) descriptions, beyond the extensional (what objects are members of each class) ones, usually provided. We explore our idea with a hybrid architecture, based into two well-know models: ART1 [4], used for cluster binary data, and CNM ([11] and [12]), used to map the input space in the formed classes. Both ANNs present an important characteristic to support data mining: they learn in just one pass of the entire data set. The model is illustrated by Figure 1.

Fig. 1. Describing unclassified data

The architecture is composed by five layers according to the schema: Input layer (F0, where the examples are introduced in the architecure); Aggregation module (F1 and F2, where the classes are defined); Characterization module (F2, F3 , and F4, where the classes are explained). For the characterization module, it requires the pre-existence of classes that would not be available when the process starts. We overcome this problem by creating the

410


classes by means of a sampling subsystem. The creation of classes through sampling was explored by Guha [8] with consistent results. As a consequence of the sampling process, a complete execution of the system will take more than one pass of the input data set. Considering the size of the sample and D the size of the input data set, a complete execution of the system will take, precisely, D + records. In the next two sections, we describe each model used as building blocks for our architecture.

4 ART1 Neural Network ART1 (F1 and F2 layers) is a member of the so-called ART family, that stands by Adaptive Resonance Theory [10], [3], [4] and [6]. ART1 is a competitive recurrent network with two layers, the input layer and the clustering layer. This network was developed to overcome the plasticity-stability dilemma [7], allowing an incremental learning, with a continuous updating of the clusters prototypes, and preserving the previously stored patterns. The clustering algorithm proceeds, in general steps, as folows: (a) the first input is selected to be the first cluster; (b) each next input is compared with each existing cluster; the first cluster where the distance to the input is less then a threshold is chosen to cluster the input. Otherwise, the input defines a new cluster. It can be observed that the number of clusters depends on the threshold and the distance metric used to compare the inputs with the clusters. For each input pattern presented to the network, one output unit is declared winner (at the first pattern, the own input pattern defines the cluster). The winner backpropagates a signal that encodes the expected pattern template. If the current input pattern differs more than a defined threshold from the backpropagated signal, the winner are temporarily disabled (by the Orienting System) and the next closest unit is declared winner. The process continues until an output unit become a winner, considering the threshold. If no one of the output units become a winner, a new output unit is defined to cluster the input pattern. Graphically, an ART1 network can be illustrated by Figure 2, where it appears with four input and six ouput neurons. tij and bij are, respectively, bottom-up and top-down connections. One important characteristic of this ANN is that it works in just one pass, what is interesting when we are processing a huge amount of data. The training algorithm for this network is the following: – Step1. Initialization: The bottom-up bij (t) and top-down tij (t) weight connection between input node i and output node j at time t are set up. The fraction (vigilance parameter) is defined, indicating how close an input must be to a stored exemplar to match. Initialize N and M , numbers of input and output nodes. – Step 2. Apply New Input – Step 3. Compute Matching Scores: The bottom-up weights are applied to the input pattern, generating the output signal: j . – Step 4. Select Best Matching Exemplar: j = maxfj g is taken as the best exemplar. – Step 5. Vigilance Test: The best exemplar and the input pattern are compared, according to . If the distance is acceptable, the control flows to Step 7, otherwise Step 6 proceeds.


411

j nodes i nodes

Inputs

Winner

t

ij

b ij

Fig. 2. Architecture of an ART network [3]

– Step 6. Disable Best Matching Exemplar: The ouput of the best matching node selected in Step4 is temporarily set to zero and no longer takes part in the maximization of Step4. Then go to Step 3. – Step 7. Adapt Best Matching Exemplar:

tij (t + 1) = tij (t)t(xi ) tij (t)xi bij (t + 1) = 1 N ,1 tij (t)xi i=0 2 +

P

– Step 8. Repeat by Going to Step 2: First enable any node disabled in Step 6.

5 Combinatorial Neural Model (CNM) CNM (F2, F3, and F4 layers) is a hybrid architecture for intelligent systems that integrates symbolic and connectionist computational paradigms. This model is able to recognize regularities from high-dimensional symbolic data, performing mappings from this input space to a lower dimensional output space. Like ART1, this ANN also overcomes the plasticity-stability dilemma [7]. The CNM uses supervised learning and a feedforward topology with: one input layer, one hidden layer - here called combinatorial - and one output layer (Figure 3). Each neuron in the input layer corresponds to a concept - a complete idea about an object of the domain, expressed in an object-attribute-value form. They represent the evidences of the domain application. On the combinatorial layer there are aggregative fuzzy AND neurons, each one connected to one or more neurons of the input layer by arcs with adjustable weights. The output layer contains one aggregative fuzzy OR neuron for each possible class (also called hypothesis), linked to one or more neurons on the combinatorial layer. The synapses may be excitatory or inhibitory and they are

412


characterized by a strength value (weight) between zero (not connected) to one (fully connected synapses).

h2

h1

c1

c2

e1

c3

c4

c5

e2

c6

c8

c 7

e3

Fig. 3. Complete version of CNM for 3 input evidences and 2 hypotheses [11]

The network is created completely uncommited, according to the following steps: (a) one neuron in the input layer for each evidence in the training set; (b) a neuron in the output layer for each class in the training set; and (c) for each neuron in the output layer, there is a complete set of hidden neurons in the combinatorial layer which corresponds to all possible combinations (length between two and nine) of connections with the input layer. There is no neuron in the combinatorial layer for single connections. In this case, input neurons are connected directly to the hypotheses. The learning mechanism works in only one iteration, and it is described bellow: PUNISHMENT AND REWARD LEARNING RULE – Set to each arc of the network an accumulator with initial value zero; – For each example case from the training data base, do: Propagate the evidence beliefs from input nodes until the hypotheses layer; For each arc reaching a hypothesis node, do: If the reached hypothesis node corresponds to the correct class of the case Then backpropagate from this node until input nodes, increasing the accumulator of each traversed arc by its evidencial flow (Reward) Else backpropagate from the hypothesis node until input nodes, decreasing the accumulator of each traversed arc by its evidencial flow (Punishment). After training, the value of accumulators associated to each arc arriving to the output layer will be between [-T, T], where T is the number of cases present in the training set. The last step is the prunning of network; it is performed by the following actions: (a) remove all arcs whose accumulator is lower than a threshold (specified by a specialist); (b) remove all neurons from the input and combinatorial layers that became disconnected from all hypotheses in the output layer; and (c) make weights of the arcs


413

arriving at the output layer equal to the value obtained by dividing the arc accumulators by the largest arc accumulator value in the network. After this pruning, the network becomes operational for classification tasks. This ANN has been applied with success in data mining tasks ([2], [17], and [18]).

6 Ongoing Work This paper presents an architecture to scale up the whole process of concept formation according two main constraints: the identification of groups composed by similar objects and the description of this groups by the higher correlated features. The ongoing work includes the implementation and evaluation of this architecture, instantiated to ART1 and CNM, and its extension to cope with continuous data.

References 1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghanavan, P. Automatic Subspace Clustering of High-Dimensional Data for Data Mining Applicati ons. In: Proceedings of ACM SIGMOD98 International Conference on Management of Data, Seattle, Washington, 1998. 2. Beckenkamp, F. G., Feldens, M. A., Pree, W.: Optimizations of the Combinatorial Neural Model. IN: Vth Brazilian Symposium on Neural Networks. (SBRN' 98), Belo Horizonte, Brazil. 3. Bigus, J. P. Data Mining with Neural Networks. [S.l.]: McGraw-Hill, 1996. p.3-42. 4. Carpenter, G. and Grossberg, S. Neural Dynamics of Category Learning and Recognition: Attention, Memory, Consolidation, and Amnesia. In: Joel L. Davis (ed.), Brain structure, learning, and memory. AAAS Symposia Series, Boulder, CO: Westview Press, 1988. p.233-287. 5. Easterlin, J.D., Langley, P.: A Framework for Concept Formation. In: Seventh Annual Conference of the Cognitive Science Society, Irvine, CA, 1985. 6. Engel, P. M. Lecture Notes. Universidade Federal do Rio Grande do Sul. Porto Alegre-RS, Brazil: CPGCC da UFRGS, 1997. 7. Freeman, J. A., Skapura, D. M.: Neural Networks, Algorithms, Applications, and Program Techniques. [S.l.]: Addison-Wesley, 1992. p.292-339. 8. Guha, S., Rastogi, R., Shim, K. Cure: An Efficient Clustering Algorithm for Larg e Databases. In: Proceedings of ACM SIGMOD98 International Conference on Managemen t of Data, Seattle, Washington, 1998. 9. Langley, P. The Computer-Aided Discovery of Scientific Knowledge. In: Proc. of the First International Conference on Discovery Science, Fukuoka, Japan, 1998. 10. Lippmann, D. An Introduction to Computing with Neural Nets, IEEE ASSP Magazine. April, 1987. 11. Machado, R. J., Rocha, A. F.: Handling knowledge in high order neural networks: the combinatorial neural network. Rio de Janeiro: IBM Rio Scientific Center, Brazil, 1989. (Technical Report CCR076). 12. Machado, R. J., Carneiro, W., Neves, P. A.: Learning in the combinatorial neural model, IEEE Transactions on Neural Networks, v.9, p.831-847. Sep.1998. 13. Medin, D., Altom, M.W., Edelson, S.M. and Freko, D. Correlated symptoms and simulated medical classification. Journal of Experimental Psychology: Learning, Memory and Cognition, 8:37-50, 1983. 14. Murphy, G. and Medin, D.: The Role of Theories in Conceptual Coherence. Psychological Review, 92(3):289-316, July, 1985.

414


15. Pereira, W. C. de A. Resoluçao de Problemas Criativos: Ativaçao da Capacidade de Pensar. Departamento de Informaçao e Documentaçao/EMBRAPA, Bras ília-DF, 1980. 54pp. 16. Polya, G. How to Solve It: A New Aspect of Mathematical Method. Princeton: Princeton University Press, 1972. 253pp. 17. Prado, H. A., Frigeri, S. R., Engel, P. M.: A Parsimonious Generation of Combinatorial Neural Model. IN: IV Congreso Argentino de Ciencias de la Computación (CACIC' 98), Neuquén, Argentina, 1998. 18. Prado, H. A. do; Machado, K.F.; Frigeri, S. R.; Engel, P. M. Accuracy Tuning in Combinatorial Neural Model. PAKDD' 99 - Pacific-Asia Conference on Knowledge Discovery and Data Mining. Proceedings ... Beijing, China, 1999 19. Wrobel, S. Concept Formation and Knowledge Revision. Dordrecht, The Netherlands: Kluwer, 1994. 240pp.

Parallel Data Mining of Bayesian Networks from Telecommunications Network Data Roy Sterritt, Kenny Adamson, C. Mary Shapcott, and Edwin P. Curran University of Ulster, F acult y of Informatics, Newtownabbey, County Antrim, BT37 0QB, Northern Ireland. fr.sterritt, k.adamson, cm.shapcott, [email protected] http://www.ulst.ac.uk/

Global telecommunication systems are built with extensive redundancy and complex management systems to ensure robustness. F ault iden ti cation and management of this complexity is an open research issue with which data mining can greatly assist. This paper proposes a hybrid data mining architecture and a parallel genetic algorithm (PGA) applied to the mining of Bayesian Belief Networks (BBN) from Telecommunication Management Net w ork (TMN) data. Abstract.

1

Introduction and the Global Picture

High-speed broadband telecommunication systems are built with extensive redundancy and complex management systems to ensure robustness. The presence of a fault may not only be detected by the oending component and its parent but the consequence of that fault discovered by other components. This can poten tially result in a net eect of a large number of alarm ev en tsbeing raised and cascaded to the element controller, possibility ooding it in a testing environment with raw alarm events. In an operational netw ork a ood is prevent by ltering and masking functions on the actual netw ork elements (NE). Yet there can still be a considerable amount of alarms depending on size and con guration of the netw ork, for instance, although unusual to execute, there does exist the facilit y for the user to disable the ltering/masking on some of the alarm types. The behaviour of the alarms is so complex it appears non-deterministic. It is very diÆcult to isolate the true cause of the fault/multiple faults. Data mining aims at the discovery of interesting regularities or exceptions from vast amounts of data and as such can assist greatly in this area. F ailures in the netw ork are unavoidable but quick detection and identi cation of the fault is essential to ensure robustness. To this end the ability to correlate alarm even ts becomes very important. This paper will describe how the authors, in collaboration with NITEC, a Nortel Net w orksR&D lab, have used parallel techniques for mining ba yesian net w orksfrom telecommunications net w orkdata. The primary purpose being fault management - the induction of a bayesian netw orkb y correlating oine J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 415-422, 2000.  Springer-Verlag Berlin Heidelberg 2000

416

R. Sterritt et al.

event data, and deducing the cause (fault identi cation) using this bayesian network with live events. The problems encountered using traditional computing and how these can be overcome using a high performance computing architecture and algorithm are reported along with the results of the architecture and parallel algorithm.

1.1 Telecommunication Fault Management, Fault Correlation and Data Mining BBNs Arti cial intelligence and database techniques are useful tools in the interrogation of databases for hitherto unseen information to support managerial decision making or aid advanced system modelling. Knowledge discovery in databases is a technique for combing through large amounts of information for relationships that may be of interest to a domain expert but have either been obscured by the sheer volume of data involved or are a product of the volume[1]. Bayesian Belief Networks is a technique for representing and reasoning with uncertainty[2]. It represents CAUSEs and EFFECTs as nodes and connects CAUSEs and EFFECTs as networks with a probability distribution. Bayesian Belief Networks have been successfully used to build applications, such as medical diagnosis, where multiple causes bear on a single eect[2]. We proposed to employ both techniques in conjunction to model complex systems that act in a non-deterministic manner. These problems are common to real-life industrial systems that produce large amounts of information. The data-handling requirements, computationally expensive techniques and real-time responsiveness make parallel processing a necessity. The exemplar under which the architecture is being developed is a telecommunications application, involving the Synchronous Digital Hierarchy (SDH) the backbone of global communications. It oers exibility in dealing with existing bandwidth requirements and provide capabilities for increasingly sophisticated telecommunications services of the future[3]. One key area of interest to engineers is the management of events and faults in a network of SDH multiplexers[4]. An event is a change of status within a network that produces a corresponding alarm message or fault. When a network error occurs, each multiplexer determines the nature of the problem and takes steps to minimise any loss in signal or service. To facilitate the process of error recovery, there are many levels of redundancy built into a network, e.g. signal re-routing and the use of self-healing rings. The management of faults is complex because :

{ the occurrence of faults is time-variant and non-deterministic; { faults can produce a cascade of other faults; { fault-handling must be performed in real-time. Although the behaviour of individual multiplexers is accurately speci ed, the complex interactions between a number of multiplexers and the variability of real-life network topologies means that the process of fault occurrence is more complicated than a single speci cation will imply. This problem is compounded

Parallel Data Mining of Bayesian Networks from Telecommunications Network Data

417

by the growth of networks and the increasing variety of network topology. The application of data mining and in particular parallel data mining can assist greatly. Data mining aims at the discovery of interesting regularities or exceptions from vast amounts of data. As has been stated, fault management is a critical but diÆcult area of telecommunication network management since networks produce a vast quantity of data that must be analysed and interpreted before faults can be located. Alarm correlation is a central technique in fault identi cation yet it is diÆcult to cope with incomplete data and the sheer complexity involved. At the heart of alarm event correlation is the determination of the cause. The alarms represent the symptoms and as such are not of general interest[5]. There are two real world concerns, the;

{ sheer volume of alarm event traÆc when a fault occurs; { cause not the symptoms. A technique that can tackle both these concerns would be best, yet this can be diÆcult to achieve.

1.2 The Architecture We proposed a parallel mining architecture for the elucidation of an accurate system representation based primarily on Bayesian Belief Networks that are induced using Knowledge Discovery techniques. The architecture has a modular design that can be recon gured according to application speci cation. The system was prototyped using the INMOS transputer as the target hardware. Within the telecommunications domain, it is hoped that the application of the system will ultimately assist in fault management but also in the analysis of test data. The actual realisation of the architecture for the situation is shown in Figure 1. It can be seen that the input data is available in the form of log data from the element controller. The team identi ed a need for the eÆcient preparatory processing of data that appeared in the event log format. This led to the design and preliminary implementation of a data cleaner and a data pre-processor. The data cleaner allows a user to specify the format of a text document in a generic way using a template, and to specify ltering conditions on the output. In the telecommunications application the text document is an event log, and the template le can be altered if the structure of the event records changes. The data cleaner parses a log le and passes the resulting events to the pre-processor which time-slices the information and creates an intermediate le for use in the induction module. The Bayesian net, created as a result of induction, is potentially useful in a fault situation where the faults most likely to be responsible for observed alarms can be computed from the net and relayed to a human operator. For this reason there is a deduction module in the architecture whereby observed conditions in the telecommunications network can be fed into the Bayesian net and changes in the probabilities of underlying fault conditions can be computed[6].

418

R. Sterritt et al.

However, the components are able to operate in isolation, provided that they are provided with les in the correct input format. In particular the induction component and the deduction component use data in Bayesian Network Interchange format[7].

Fig. 1.

The Architecture

2 The Parallel Data Mining Algorithm 2.1 The Need for Parallelism In this case, as in many others, the structure of the graphical model (the Bayesian net) is not known in advance, but there is a database of information concerning the frequencies of occurrence of combinations of dierent variable values (the alarms). In such a case the problem is that of induction - to induce the structure from the data. Heckerman has a good description of the problem[8]. There has been a lot of work in the literature in the area, including that of Cooper and Herskovits[9]. Unfortunately the general problem is NP-hard [10]. For a given number of variables there is a very large number of potential graphical structures which can be induced. To determine the best structure then in theory one should t the data to each possible graphical structure, score the structure, and then select the structure with the best score. Consequently algorithms for learning networks from data are usually heuristic, once the number of variables gets to be of reasonable size. There are 2k(k 1)=2 distinct possible independence graphs for a k-dimensional random vector: this translates to 64 probabilistic models for k= 4, and 32, 768 models for k = 6. Several dierent algorithms were prototyped and tested using the telecommunications data but since the potential number of graph candidates is so large a genetic algorithm was developed.

2.2 Parallel Cause And Eect Genetic Algorithm (P-CAEGA) Goldberg describes many ways to view genetic algorithms (GA) [11]: as problems solvers, as a basis for competent machine learning, as a computational model of innovation and creativity and so on. In the work described here the problem is to nd the best cause-and-eect network in a very large solution space of


419

all possible cause-and-eect networks since the problem is NP-hard a heuristic search technique must be used. This led to a consideration of genetic algorithms, since they have been shown to work well in many application areas[12] and oers a robust means of searching for the globally optimal solution[13]. The genetic algorithm works on a population of solutions, which change as the algorithm cycles through a sequence of generations, until a satisfactory solution has been found. Initialisation consists of the creation of an initial population, a pool of breeders. In each generation each breeder is scored, and the best breeders are selected (possibly with some uncertainty) to breed and create solutions for the next generation. A solution created by breeding is a genetic mixture of its two parents, and may also have been subject to random mutations which allow new gene sequences to be created. Solutions are directed graphs, viable solutions (those which will be scored and allowed to breed in the next generation) are directed acyclic graphs. The scoring function used was an adaptation of one proposed by Cooper and Herskovits[9], in which the best t to the experimental data is calculated using Bayesian techniques. High scoring structures have a greater chance of being selected as parents for the next generation. Due to the sheer volume of data involved in data mining[1], the time required to execute genetic algorithms and the intrinsic parallel nature of genetic algorithms[14], it was decided to implement a parallel version of the CAEGA algorithm (P-CAEGA). There are a number of approaches to parallelising an algorithm for execution on more than one processor. An architecture with common memory can be used which allows eÆcient communication and synchronisation. Unfortunately these systems cannot be scaled to a large number of processors because of physical construction diÆculties and contention for the memory bus[15]. An alternative is the distributed programming model or message passing model. Two environments widely researched in this area are the Local Area Network (LAN) using Parallel Virtual Machine (PVM) and Transputer Networks using Inmos development languages[16]. In the network architecture the hardware scales easily to a relatively large number of processors but this is eventually limited because of network contention. The Transputer hardware is a point-topoint architecture with dedicated high-speed communications with no contention and no need for addressing. The system can be highly scaleable if the program is constructed accordingly. The parallel prototype implementation was carried out on T805 INMOS Transputers connected to a Sun Workstation with development performed in parallel C. The sequential prototype of CAEGA had been coded in Pascal. This was converted to C. The C code was then decomposed into processes that needed to be run sequentially and those that could be executed in parallel. Communications channels were used to transfer data between processes. The rst parallel version is a straightforward master-slave (processor farm) implementation. Breeding (reproduction, crossover and mutation) was carried out in parallel. In fact the scoring was also implemented in parallel. The selection had to be implemented sequentially and thus remained on the master

420

R. Sterritt et al.

(the root processor which is the controller, and is connected to the host). This was necessary, as all of the structures from the new generation needed to be remixed to form new parents from the gene pool before distribution to the slaves for breeding. The remaining processors are utilised as slaves, which carry out the breeding in parallel and report the new structures and their scores to the master (root processor). As was anticipated from preliminary investigations the scaleability achievable is limited because of the overhead of communications. For less than 8 processors the echo n-2 holds (excluding the master; n-1). It is believed, with further work on the eÆciency of the algorithm this could be improved, but a linear increase (excluding the master) is not expected because of the sheer amount of communications involved. This implementation represents a straight forward rst prototype. It is a direct parallelisation of an CAEGA which did not change the underlying nature of the algorithm. This has resulted in global communications, which limits the scaleable - speedup ratio. In a LAN implementation this would be even more restrictive due to communications costs. In general an eective concurrent design will require returning to rst principles.

2.3 Results Typical results of the application of the algorithms described are shown below. The data which is shown results from an overnight run of automated testing, but does not show dependencies on underlying faults. About 12,000 individual events were recorded and the graph in Figure 2 shows a generated directed graph in which the width of the edge between two nodes (alarms) is proportional to the strength of connection of the two variables. It can be seen that PPI-AIS and LP-EXC have the strongest relationship, followed by the relationship between PPI-Unexp-Signal and LP-PLM. Note that the directions of the arrows are not important as causal indicators but variables sharing the same parents do form a group. The graph in Figure 3 shows the edge strengths as strongest if the edge remains in models which become progressively less complex - where there is a penalty for complexity. It can be seen that the broad patterns are the same in the two graphs but that the less strong edges are dierent in the two graphs. In the second graph the node NE-Unexpected-Card shows links to three other nodes, whereas it has no direct links in the rst graph. The results from an industrial point of view are very encouraging. The case for using a genetic algorithm holds and parallelising it speeds up this process. The algorithm is currently being used by NITEC to analyse their data produced by a SDH network when a fault occurs. From their point of view it has been a worthwhile eort for the speed-up. It has been established that genetic scoring and scaling could be implemented in parallel but the communications cost in transmitting these back to the master removed any bene t from just having the master perform these functions.


Fig. 2.

data

Example results BBN of TMN

Fig. 3.

421

Another set of results

2.4 Future Potential Research This study assessed GAs as a solution provider. As Goldberg states "some of us leave the encounter with far more"[11] . Future development will take the basic solution further and remove the limitation on scaleability. What is required is a concurrent algorithm as opposed to a modi cation of a sequential algorithm. The next planned development is a redesign of the algorithm into the "continental" algorithm (current development name). P-CAEGA as it stands can be classi ed as global parallelisation. Every individual has a chance to mate with all the rest (i.e. random breeding), thus the implementation did not aect the behaviour of the original algorithm (CAEGA). The "continental" version would be a more sophisticated parallel approach where the population is divided into sub-populations , relatively isolated from each other. This model introduces a migration element that would be used to send some individuals from one sub-population to another. This adapted algorithm would yield local parallelism and each processor could be thought of as a continent where the majority of the breeding occurs between residents with limited migration. This modi cation to the behaviour of the original algorithm would vastly decrease the communications cost, and present a more scaleable implementation.

3

Conclusion

The association between the University of Ulster and Nortel in the area of fault analysis has continued with the Garnet project. Garnet is using the techniques developed in NETEXTRACT to develop useful tools in the area of live testing. The basic idea is to regard the Bayesian nets as abstract views of the test network's response to stimuli. In a further development, the Jigsaw project entails

422

R. Sterritt et al.

the construction of a data warehouse for the storage of the test data, with a direct link to data mining algorithms. Although this paper has described the work completed with reference to Nortel's telecommunications network, the architecture is generic in that it can extract cause-and-eect nets from large, noisy databases, the data arising in many areas of science and technology, industry and business and in social and medical domains. The corresponding hypothesis to the aim of this project could be proposed - that cause and eect graphs can be derived to simulate domain experts' knowledge and even extend it. The authors would like to thanks Nortel Networks and the EPSRC for their support during the project.

References 1. Agrawal, R., Imielinski, T., Swami, A.,"Database Mining: a Performance Perspective",IEEE Trans. KDE, 5(6), Dec, pp914-925, 1993. 2. Guan, J.W., Bell, D.A., Evidence Theory and Its Applications, Vol. 1&2, Studies in Computer Science and AI 7,8, Elsevier, 1992. 3. ITU-T Rec. G.782 Types and General Characteristics of SDH Muliplexing Equipment, 1990. 4. ITU-T Rec. G.784 SDH Management, 1990. 5. Harrison K. "A Novel Approach to Event Correlation", HP Labs., Bristol. HP-9468, July, 1994, pp. 1-10. 6. McClean, S.I. and Shapcott, C.M., 1997, "Developing BBNs for Telecommunications Alarms", Proc. Conf. Causal Models & Statistical Learning, pp123-128, UNICOM, London. 7. MS Decision Theory and Adaptive Systems Group, 1996. "Proposal for a BBN Interchange Format", MS Document. 8. Heckerman D, 1996. "BBNs for Knowledge Discovery" In Fayyad UM, PiatetskyShapiro G, Smyth P and Uthurusamy R (Eds.), Advances in Knowledge Discovery and Data Mining AAAI Press/MIT Press, pp273-305. 9. Cooper, G.F. and Herskovits, E., 1992. "A Bayesian Method for the Induction of Probabilistic Networks from Data". ML, 9, pp309-347 10. Chickering D.M. and D. Heckerman, 1994. "Learning Bayesian networks is NPhard". MSR-TR-94-17, Microsoft Research, Microsoft Corporation, 1994. 11. Goldberg, D.E., "The Existential Pleasures of Genetic Algorithms". Genetic Algorithms in Engineering and Computer Science , (England: John Wiley and Sons Ltd., 1995) Ed. Winter, G., et al 23-31. 12. Holland, J. H., Adaptation in Natural and Arti cal Systems, (Ann Arbor: The University of Michigan Press, 1975) 13. Larranga, P., et al., Genetic Algorithms Applied to Bayesian Networks,Proc. of the Applied Decision Technology Conf., 1995,pp283-302. 14. Bertoni, A., Dorigo, M., Implicit Parallelism in Genetic Algorithms. AI (61) 2, 1993, pp307-314. 15. Ben-Ari, M., Principles of Concurrent and Distributed Programming, (Hertfordshire: Prentice Hall International (UK) Ltd. ,1990) 16. Almeida, F., Garcia, F., Roda, J., Morales, D., Rodriguez C., A Comparative Study of two Distributed Systems: PVM and Transputers, Transputer Applications and Systems, (IOS Press, 1995) Ed. Cook, B., et al 244-258.

IRREGULAR'00 SEVENTH INTERNATIONAL WORKSHOP ON SOLVING IRREGULARLY STRUCTURED PROBLEMS IN PARALLEL

General Chair: Sartaj Sahni, University of Florida Program Co-chairs:

Timothy Davis, University of Florida Sanguthevar Rajasekeran, University of Florida Sanjay Ranka, University of Florida Steering Committee:

Afonso Ferreira, CNRS-I3S-INRIA Sophia Antipolis Jose Rolim, University of Geneva Program Committee:

Cleve Ashcraft, The Boeing Company Iain Du, CLRC Rutherford Appleton Laboratory Hossam ElGindy, University of Newcastle Apostolos Gerasolous, Rutgers University, New Brunswick John Gilbert, Xerox Palo Alto Research Center Bruce Hendrickson, Sandia Labs Vipin Kumar, University of Minnesota Esmond Ng, Lawrence Berkeley National Laboratory C. Pandurangan, Indian Institute of Technology, Madras Alex Pothen, Old Dominion University P adma Raghavan, Univ. of Tennessee Rajeev Raman, King's College, London John Reif, Duke University Joel Saltz, University of Maryland Horst Simon, Lawrence Berkeley National Laboratory Jaswinder Pal Singh, Princeton University Ramesh Sitaraman, University of Massach usetts R. Vaidy anathan, Louisiana State University Kathy Yelick, University of California, Berkeley Invited Speakers:

William Hager, University of Florida Vipin Kumar, University of Minnesota Panos Pardalos, University of Florida J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 423-426, 2000.  Springer-Verlag Berlin Heidelberg 2000

424

S. Sahni et al.

FOREWORD

The Seventh International Workshop on Solving Irregularly Structured Problems in Parallel (Irregular '00) is an annual workshop addressing issues related to deriving eÆcient parallel solutions for unstructured problems, with particular emphasis on the inter-cooperation between theoreticians and practitioners of the eld. Irregular'00 is the seventh in the series, after Geneva, Lyon, Santa Barbara, Paderborn, Berkeley, and Puerto Rico. Twelve of the submitted papers have been selected for presentation by the Program Committee on the basis of referee reports. The nal scienti c program of Irregular '00 consists of four sessions and three invited talks. We wish to thank all of the authors who responded to the call for papers and our invited speakers. Thank the members of the Program Committee and anonymous reviewers for reviewing and selecting papers. We also would like to thank the IPDPS'00 conference co-chair Jose Rolim and members of the steering committee for helping us in organizing this workshop. January 2000

Timothy Davis Sanguthevar Rajasekeran Sanjay Ranka Sartaj Sahni

Irregular 2000 - Workshop on Solving Irregularly Structured Problems in Parallel

425

Irregular'00 Final Program 8:30am-9:15am

Invited Talk by William Hager (University of Florida): Load Balancing and Continuous Quadratic Programming. 9:15am-10:25

Finite element methods{applications and algorithms

Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application, Xavier Cavin, Institut National Polytechnique de Lorraine and Laurent Alonso, INRIA Lorraine. EÆcient Parallelization of Unstructured Reductions on Shared Memory Parallel Architectures, Siegfried Benkner, University of Vienna and Thomas Brandes, German National Research Center for Information Technology (GMD). Parallel FEM Simulation of Crack Propagation{Challenges, Status, and Perspectives, Bruce Carter, Chuin-Shan Chen, Gerd Heber, Antony R. Ingraea, Roland Krause, Chris Myers, and Paul A. Wawrzynek, L. Paul Chew, Keshav Pingali, Paul Stodghill, and Stephen Vavasis, Cornell University; Nikos Chrisochoides and Demian Nave University of Notre Dame; and Guang R. Gao, University of Delaware. 10:25 -10:55

Coee break

10:55 -12:05pm

Architecture and system software support

Support for Irregular Computations in Massively Parallel PIM Arrays, Using an Object-Based Execution Model, Hans P. Zima, University of Vienna and Thomas L. Sterling, California Institute of Technology. Executing Communication-Intensive Irregular Programs EÆciently, Vara Ramakrishnan and Isaac D. Scherson, University of California, Irvine. Non-Memory-based and real-time zerotree building for wavelet zerotree coding systems, Dongming Peng and Mi Lu, Texas A & M University. 12:05 -1:20

Lunch break

426

S. Sahni et al.

1:20pm-2:05pm

Invited Talk by Vipin Kumar (University of Minnesota): Graph Partitioning for Dynamic, Adaptive and Multi-phase Computations (joint work with Kirk Schloegel and George Karypis). 2:05pm-3:15

Graph partitioning - algorithms and applications

A Multilevel Algorithm for Spectral Partitioning with Extended Eigen-Models, Suely Oliveira and Takako Soma, University of Iowa. An Integrated Decomposition and Partitioning Approach, Jarmo Rantakokko, University of California, San Diego Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems, Leonid Oliker and Xiaoye Li, Lawrence Berkeley Nat. Lab.; Gerd Heber, Cornell University; and Rupak Biswas, MRJ/Nasa Ames Res. Center. 3:15-3:45

Coee break

3:45-4:30

Invited Talk by Panos Pardalos (University of Florida): A GRASP for computing approximate solutions for the Three-Index Assignment Problem 4:30-5:40pm

Graph algorithms and sparse matrix methods

On Identifying Strongly Connected Components in Parallel, Ali Pinar, University of Illinois; Lisa Fleischer, Columbia University; and Bruce Hendrickson, Sandia National Laboratories. A Parallel, Adaptive Re nement Scheme for Tetrahedral and Triangular Grids, Alan Stagg, Los Alamos National Laboratory; Jackie Hallberg, US Army Eng. Res. and Dev. Center, Coastal and Hydraulics Lab.; and Joseph Schmidt, Reston, Virginia. PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions, Pascal Henon, Pierre Ramet, and Jean Roman, Universite Bordeaux I.

Load Balancing and Continuous Quadratic Programming William W. Hager Department of Mathematics, University of Florida, 358 Little Hall, Gainesville, FL 32611-8105.

[email protected]

Abstract. A quadratic programming approach is described for solving the graph partitioning problem, in which the optimal solution to a contin uous problem is the exact solution to the discrete graph partitioning problem. We discuss techniques for appro ximatingthe solution to the quadratic program using gradient projections, preconditioned conjugate gradien ts, and a bloc k exchange method.

1

Extended Abstract

In a parallel computing environment, tasks can be modelled as nodes on a graph and the communication links betw een tasks as edges on the graph. The problem of assigning the tasks to dierent processors while minimizing the communication betw een processors is equivalent to partitioning the nodes of the graph into sets chosen so that the number of edges connecting nodes in dierent sets is as small as possible. The graph partitioning problem was rst exposed by Kernighan and Lin in the seminal paper in 1970. Since Sahni and Gonzales show in 1976 that graph partitioning is an NP hard problem, exact solutions can be computed only when the number of nodes is small. To solv e large problems, approximation techniques have been developed that include exchange techniques, spectral methods, geometric methods, and multilev el methods. Our approach to the graph partitioning problem is based a quadratic programming formulation. For the problem of partitioning an n node graph into k sets, w e exhibit a quadratic programming problem (quadratic cost function and linear equality and inequality constraints) in nk variables xij , 1 i n, 1 j k , where xij is a contin uous variable taking values on the interval [0; 1]. This quadratic program is equivalent to the discrete graph partitioning problem in the sense that there exists an optimal 0/1 solution to the quadratic program and the assignment of node i to set j if xij = 1 is optimal in the graph partitioning problem. Based on this equivalence betw een graph partitioning and a quadratic problem, we have applied tools from optimization theory to solve the graph partitioning problem. In this talk, we discuss tec hniques for approximating the solution to the quadratic program using gradient projections, preconditioned conjugate gradients, and a block exc hange method. The advantages and disadvantages of the optimization approach compared to other approaches to graph partitioning are also examined. For papers related to this talk, see http://www.math.ufl.e du /~hager. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 427-427, 2000.  Springer-Verlag Berlin Heidelberg 2000

Parallel Management of Large Dynamic Shared Memory Space: A Hierarchical FEM Application

Xavier Cavin? and Laurent Alonso?? ISA research team, LORIA? ? ? Campus Scienti que, BP 239, F{54506 Vanduvre{les{Nancy cedex, France [email protected]

We sho w in this paper the memory management issues raised by a parallel irregular and dynamic hierarchical application, which constantly allocates and deallocates data over an extremely large virtual address space. First, we show that if memory caches data locality is necessary, a lack of virtual pages localit y may greatly aect the obtained performance. Second, fragmentation and contention problems associated with the required parallel dynamic memory allocation are presented. We propose practical solutions and discuss experimentation results obtained on a cache{coherent non uniform memory access (ccNUMA) distributed shared memory SGI Origin2000 machine. Abstract.

1 Introduction The radiosity equation [Kaj86] is widely used in many physical domains and in computer graphics, for its abilit y to model the global illumination in a giv en scene. It looks like a Fredholm integral equation of the second kind, which can be expressed as: Z f (y) = fe (y) + k(x; y)f (x) d x ; (1)

where f is the radiosity equation to determine. Generally, this unknown function is de ned over an irregular, non{uniform ph ysical domain , mainly described in terms of polygonal surfaces (some of them having non{zero initial radiosity fe ). Finding an analytical solution to (1) is not possible in general. Numerical appro ximations must be employed, generally leading to very expensive algorithms. The fundamental reason for their high cost is that each surface of the input scene may potentially in uence all other surfaces via re ection. A common resolution technique is the w eigh tedresidual method, often referred as \ nite element method" (FEM). Early radiosit yalgorithms can be analyzed as FEM using piecewise constant basis functions. Later, hierarchical algorithms, inspired b y adaptive N{body methods, ha vebeen introduced ? ?? ???

Institut National Polytec hnique de Lorraine. INRIA Lorraine. UMR 7503, a joint research laboratory betw een CNRS, Institut National Polytec hnique de Lorraine, INRIA, Universite Henri Poincar e and Universite Nancy 2.


Parallel Management of Large Dynamic Shared Memory Space

429

by [HSA91] to increase the eÆciency of the computations. They are based on a multi{level representation of the radiosity function, which is dynamically created as the computation proceeds, subdividing surfaces where necessary to increase the accuracy of the solution. Since energy exchanges can occur between any levels of the hierarchies, sub{computation times are highly variable and change at every step of the resolution. This dynamic nature, both in memory and computing resources, combined to the non{uniformity of the physical domain being simulated makes the parallelization of hierarchical radiosity algorithms a challenge, since straightforward parallelizations generally fail to simultaneously provide the load balancing and data locality necessary to eÆcient parallel execution, even on modern distributed shared memory multiprocessor machines [SHG95]. In a recent paper [Cav99], we have proposed appropriate partitioning and scheduling techniques for a parallel hierarchical wavelet radiosity algorithm, that deliver an optimal load balancing, by minimizing idle time wasted on locks and synchronization barriers, while still exhibiting an excellent data locality. However, our experiments seemed to show that this was still not suÆcient to perform extremely large computations with optimal parallel performance. Indeed, dealing in parallel with a dynamically growing huge amount of memory (for a whole building simulation, it is not rare that more than 20 Gbytes may be required) is not free of problems to have it done in an eÆcient way. This is even more complicated since most of this memory management is generally hidden to the programmer. If this can be a great facility when all works well, it quickly becomes damageable when problems start to occur. We show in this paper the two main causes of the performance degradation, and experiment practical solutions to overcome them on a 64{processor SGI Origin2000 ccNUMA machine. The rst problem concerns the irregular memory access patterns of our application, which have to be handled within an extremely large virtual address space. The issues are discussed in Sect. 2, and we show how some SGI IRIX operating system facilities can help enhancing virtual pages locality and consequently reduce computation times. Parallel dynamic memory allocation is the second problem and comes in two dierent

avors: fragmentation and contention. We experiment in Sect. 3 available IRIX and public domain solutions, and propose enhancements leading to an eÆcient parallel memory allocator. Finally, Sect. 4 concludes and presents future work.

2 EÆcient Irregular Memory Accesses within a Large Virtual Address Space 2.1

Understanding ccNUMA Architecture

In order to fully bene t from a computer system performance, it is really important to understand the underlying architecture. The SGI Origin2000 is a scalable multiprocessor with distributed shared memory, based on the Scalable Node 0 (SN0) architecture [Cor98]. The basic building block is the node board, composed of two MIPS R10000 processors, each with separate 32 Kbytes rst level

430

X. Cavin and L. Alonso

(L1) instruction and data caches on the chip, with 32{byte cache line, and a uni ed (instruction and data), commonly 4 Mbytes, two{way set associative second level (L2) o{chip cache, with 128{byte cache line. Large SN0 systems are built by connecting the nodes together via a scalable interconnection network. The SN0 architecture allows the memory to be physically distributed (from 64 Mbytes to 4 Gbytes per node), while making all memory equally accessible from a software point of view, in a ccNUMA approach [LL97]. A given processor only operates on data that are resident in its cache: as long as the requested piece of memory is present in the cache, access times are very short; on the contrary, a delay occurs while a copy of the data is fetched from memory (local or remote) into the cache. The trivial conclusion is that a program shall use these caches eectively in order to get optimal performance. Obviously, if the shared memory is seen as a contiguous range of virtual memory addresses, the physical memory is actually divided into pages, which are distributed all over the system. For every memory access, the given virtual address must be translated into the physical address required by the hardware. A hardware cache mechanism, the translation lookaside buer (TLB), keeps the 64 2 most recently used page addresses, allowing an instant virtual{to{physical translation for these pages. This allows a 2 Mbytes memory space (for the default 16 Kbytes page size) to be addressed without translation penalty. Programs using larger virtual memory (the common case) may refer to a virtual address that is not cached in the TLB. In this case (TLB miss), the translation is done by the operating system, in the kernel mode, thus adding a non{negligible overhead to the memory access, even if it is satis ed in a memory cache.1 2.2

Enhancing Virtual Pages Locality

Having an optimal data caches locality appears to be a necessary, but no suf cient, condition to get optimal performance, since a L1 or L2 cache hit may be greatly delayed by a TLB miss. It is really important, however, to understand that data caches locality does not necessarily implies virtual pages (i.e. TLB) locality. Indeed, the data are stored in memory caches with a small size granularity (128 bytes for the 4 Mbytes of L2 cache), and may thus come from a large number (much greater that 64 TLB entries) of dierent pages. Then, the application may exhibit an optimal data locality through these data and a poor TLB locality at the same time. Unfortunately, this is the case for our hierarchical application, which by nature exhibits high data locality, but suers from highly irregular data access patterns. Whatever the input data, our application aords very high cache hits rates of more than 95 % for both L1 and L2 caches, for any number of processors used [CAP98]. At the same time, it is clear that the irregular memory accesses towards a large number of virtual pages are responsible for many TLB misses, especially as the size of the input data increases.2 This is con rmed by the analysis of the sequential run of the application with the default 16 Kbytes page 1 2

See [Cor98] for the precise read latency times. Monitoring of L1, L2 and TLB misses is done with the IRIX perfex command.


431

size, where about 44 % of the execution time is lost due to TLB misses. Table 1 shows that when the number of processors used increases, the total number of TLB misses quickly falls down by 33 % with 16 processors, and then more slowly decreases. This seems to be due to the fact that using N processors allows to use 64 N TLB entries at the same time. This suggests that next generations of MIPS processors should contain more TLB entries. Increasing the number of TLB entries appears to enhance TLB locality. Nevertheless, for a xed number of available TLB entries, an alternate solution is to increase the total memory space they can address, by simply telling the operating system to increase the size of a memory page, thanks to the IRIX dplace command. As shown by Table 1, doing this reduces the execution times, at least with at the small 16{processor scale. Indeed, using a larger number of processors multiplies the number of available TLB entries, thus reducing the TLB misses problem and the bene ts of this solution. Table 1.

Impact of page sizes on TLB misses and execution times

Processors Page size 1 4 8 16 Execution time 16k 13 835 3 741 1 866 1 054 (s) 1m 10 813 { 1 538 891 TLB misses 16k 17 675 14 019 12 487 11 676 (106 ) 1m 5 479 { 3 939 3 769

24 32 40 761 653 610 701 612 590 10 176 9 888 9 497 3 813 3 842 3 688

3 EÆcient Parallel Dynamic Memory Allocation Since the memory required by our application has to be dynamically allocated all along the execution, an eÆcient dynamic memory allocator has to be employed, both to reduce fragmentation problems, which may cause severe memory waste, and to allow contention free parallel manipulations. As quoted in [WJNB95], \memory allocation is widely considered to be either a solved problem, or an insoluble one". This appears to be true for the common programmer using the default memory allocation package available on his machine. Once again, most of the time, this solution is not a bad one, but when the memory allocator runs badly, one discovers to be in front of a mysterious \black box". The role of a memory allocator is to keep track of which parts of the memory are in use, and which parts are free, and to provide the processes an eÆcient service, minimizing wasted space without undue time cost, or vice et versa. Space consideration appears to be of primary interest. Indeed, worst case space behavior may lead to complete failure due to memory exhaustion or virtual memory trashing. Obviously, time performance is also important, especially in parallel, but this is a question rather of implementation than of algorithm, even if considerations can be more complicated.

432


We believe that it is primordial to rely on existing memory allocators, rather than to develop ad hoc storage management techniques, for obvious reasons of software clarity, exibility, maintainability, portability and reliability. We focus here on the three following available ones: 1. the IRIX C standard memory allocator, which is a complete black box; 2. the IRIX alternative, tunable, \fast main memory allocator", available when linking with the -lmalloc library; 3. the LINUX/GNU libc parallel memory allocator3 , which is an extension of famous Doug Lea's Malloc4 , and is also parameterizable. 3.1

The Fragmentation Problem

Fragmentation is the inability to reuse memory that is free. An application may free blocks in a particular order that creates holes between \live" objects. If these holes are too numerous and small, they cannot be used to satisfy further requests for larger blocks. Note here that the notion of fragmented memory at a given moment is completely relative to further requests. Fragmentation is the central problem of memory allocation and has been widely studied in the eld, since the early days of computer science. Wilson et al. present in [WJNB95] a wide coverage of literature (over 150 references) and available strategies and policies. Unfortunately, a unique general optimal solution has not emerged, because it simply does not exist. Our goal here is not to propose a new memory allocation scheme, but rather to report the behavior of the chosen allocators in terms of fragmentation. Algorithmic considerations have been put aside, since it is diÆcult to know the principles a given allocator is implemented on. We have chosen to use the experimental methodology proposed in [WJNB95] to illustrate the complex memory usage patterns of our radiosity application: all allocation and deallocation requests done by the program are written to a le during its execution. The obtained trace only re ects the program behavior, and is independent of the allocator. Figure 1 shows the pro le of memory use for a complete run of our radiosity application. Although it has been done on a small input test scene, it is representative of what happens during the computations: many temporary, short{live objects are continuously allocated and deallocated to progressively build the long{live objects of the solution (here for instance, 120 Mbytes of data are allocated for only 10 Mbytes of \useful" data). Then, the trace is read by a simulator, which has rst been linked with the allocator to be evaluated: this allows to precisely monitor the way it behaves, including the potential fragmentation of the memory. Unfortunately, on such small input test scenes, none of the three allocators suers from fragmentation (86 % of the allocated memory is actually used). The fragmentation problem only occurs with large input test scenes, the computation of which can not (yet) be traced with the tools we use5 . We can just report what we have observed 3 4 5

See http://www.dent.med.uni-muenchen.de/~wmglo/malloc-slides.html See http://gee.cs.oswego.edu/dl/html/malloc.html At ftp://ftp.dent.med.uni-muenchen.de/pub/wmglo/mtrace.tar.gz


433

9.23

Megabytes in Use

Megabytes in Use

8

6

4

9.21

9.19

2 9.17

0 0

20

40 60 80 Allocation Time in Megabytes

100

(a) Complete execution.

88

89

90 91 92 Allocation Time in Megabytes

93

(b) Zoom in.

Pro le of memory use in a short radiosity computation (534 initial surfaces, 20 iterations, 5 478 nal meshes): these curves plot the overall amount of live data for a run of the program, the time scale being the allocation time expressed in bytes Fig. 1.

during our numerous experiments: the standard IRIX C allocator appears to be the better one, while the alternative IRIX allocator leads to catastrophic failure with large input data on its default behavior (our experiments with the parameters have not been successful either); LINUX/GNU libc allocator is closer to the IRIX C allocator, although a little bit more space consuming. 3.2

Allocating Memory in Parallel

Few solutions have been proposed for parallel allocators (some are cited at the beginning of Sect. 4 in [WJNB95]). However, none of them appears to be implemented inside the two IRIX allocators, since the only available solution is to serialize memory requests with a single lock mechanism. This is obviously not the right way for our kind of application, which constantly allocates and deallocates memory in parallel. Generally, with this strategy, only one or two processes are running at a given moment, while all remaining ones are idle. We rst considered implementing our own parallel allocator, based on the arena facilities provided by IRIX. Basically, each process is assigned to its own arena of memory, and is the only one responsible for memory operations in it: we observed we could achieve optimal time performance, without any contention. Unfortunately, the allocation inside a given arena is based on the alternative IRIX allocator, which gives, as previously said, very poor results in terms of fragmentation. We then considered the LINUX/GNU libc allocator, which exhibits better space performance and provides parallelism facilities, based on similar ideas as ours. Unfortunately, a few implementation details greatly limit the scalability to four or eight threads, which is obviously insuÆcient for us. We thus xed these minor problems to have the LINUX/GNU libc allocator more looks like our parallel version, and nally get a parallel allocator which proves to be (for the moment) rather eÆcient both in space and time performance.

434


4 Conclusion We show in this paper the problems raised by the memory management of a large, dynamically evolving, shared memory space within an irregular and non{ uniform parallel hierarchical FEM algorithm. These problems are not widely covered in the literature, and there are few available solutions. We rst propose practical techniques to enhance memory accesses performance. We then study the characteristics of our application in terms of memory usage, and show that it is greatly suitable to fragmentation problems. Available allocators are considered and experimented to nd which one gives the best answer to the request patterns. Finally, we design a parallel allocator, based on the LINUX/GNU libc one, which appears to give good performance in terms of time and space.6 However, deeper insights, inside both our application and available memory allocators, will be needed to better understand the way they interact, and we believe this is still an open and beautiful problem. Acknowledgments. We would like to thank the Centre Charles Hermite for providing access to its computing resources. Special thanks to Alain Filbois for the hours spent on the Origin.

References [CAP98]

Xavier Cavin, Laurent Alonso, and Jean-Claude Paul. Experimentation of Data Locality Performance for a Parallel Hierarchical Algorithm on the Origin2000. In Fourth European CRAY-SGI MPP Workshop, Garching/Munich, Germany, September 1998. [Cav99] Xavier Cavin. Load Balancing Analysis of a Parallel Hierarchical Algorithm on the Origin2000. In Fifth European SGI/Cray MPP Workshop, Bologna, Italy, September 1999. [Cor98] David Cortesi. Origin2000 (TM) and Onyx2 (TM) Performance Tuning and Optimization Guide. Tech Pubs Library guide Number 007-3430-002, Silicon Graphics, Inc., 1998. [HSA91] Pat Hanrahan, David Salzman, and Larry Aupperle. A Rapid Hierarchical Radiosity Algorithm. In Computer Graphics (ACM SIGGRAPH '91 Proceedings), volume 25, pages 197{206, July 1991. [Kaj86] James T. Kajiya. The Rendering Equation. In Computer Graphics (ACM SIGGRAPH '86 Proceedings), volume 20, pages 143{150, August 1986. [LL97] James Laudon and Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 241{251, Denver, June 1997. ACM Press. [SHG95] Jaswinder Pal Singh, John L. Hennessy, and Annoop Gupta. Implications of Hierarchical N -body Methods for Multiprocessor Architectures. ACM Transactions on Computer Systems, 13(2):141{202, May 1995. [WJNB95] Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. Dynamic storage allocation: A survey and critical review. In Proceedings of International Workshop on Memory Management, volume 986 of Lecture Notes in Computer Science. Springer-Verlag, September 1995. 6 Feel free to contact us if you want to evaluate this clone.

Ecient Parallelization of Unstructured Reductions on Shared Memory Parallel Arc hitectures? Siegfried Benkner1 and Thomas Brandes2

1 Institute for Software Technology and Parallel Systems University of Vienna, Liechtensteinstr. 22, A-1090 Vienna, Austria

[email protected]

Institute for Algorithms and Scienti c Computing (SCAI) German National Research Center for Information Technology (GMD) Schlo Birlinghoven, D-53754 St. Augustin, Germany 2

[email protected]

Abstract. This paper presents a new parallelization method for an ef-

cient implementation of unstructured array reductions on shared memory parallel machines with OpenMP. This method is strongly related to parallelization techniques for irregular reductions on distributed memory machines as employed in the context of High Performance Fortran. By exploiting data locality, synchronization is minimized without introducing severe memory or computational overheads as observed with most existing shared memory parallelization techniques.

1 Introduction

Reduction operations on unstructured meshes or sparse matrices account for a large fraction of the total computational costs in many advanced scienti c and engineering applications. Typical examples of such applications include crash simulations, uid-dynamics codes, weather forecasting models, electromagnetic problem modeling, and man y others. In order to fully exploit the potential of parallel computers with such applications using high-level parallel languages like OpenMP [11] or HPF [8], it is of paramoun t importance to apply ecient parallelization strategies to reduction operations performed on irregular data structures. Unstructured reductions are usually implemented by means of loops containing indirect (vector-subscripted) array accesses. When parallelizing suc h loops it is crucial to avoid high synchronization or serious communication overheads, respectively. Several dierent parallelization techniques targeted either for distributed or for shared memory parallel architectures have been described in the literature [10,2, 7,1] and have been integrated in parallelizing compilers. In this paper we present a new parallelization method for unstructured array reductions on shared memory parallel computers and compare it to existing parallelization strategies. Without loss of generality we discuss these techniques in the context of nite-element methods (FEM). Section 2 describes existing shared memory parallelization techniques for unstructured reductions and their support in OpenMP. Section 3 brie y discusses parallelization for distributed ?

The work described in this paper was supported by NEC Europe Ltd. as part of the ADVICE project in cooperation with the NEC C&C Research Laboratories and by the Special Research Program SFB F011 AURORA of the Austrian Science Fund.


436

S. Benkner and T. Brandes

memory machines and high-level language support for unstructured reductions as provided in HPF. Section 4 presents an optimized parallelization method for unstructured array reductions on shared memory machines. This method is strongly related to an ecient handling of irregular reductions on distributed memory machines [1]. By exploiting data locality, synchronization is minimized without introducing severe memory or computational overheads as observed with most existing shared memory parallelization techniques. We have implemented this parallelization method in a compilation system [4] that translates high-level data parallel programs into shared memory parallel programs utilizing OpenMP to realize thread parallelism and synchronization. Performance results presented in Section 5 for a typical FEM application kernel verify the eectiveness of our approach and its superiority compared to existing techniques. Figure 1 shows an example of an unstructured reduction operation on a niteelement mesh. The mesh consists of NELEMS elements and NNODES nodes, whereby each element comprises four nodes. The arrays NODE and ELEM store certain physical quantities1, e.g. positions for each node or forces for each element, respectively. The integer array IX captures the connectivity of the mesh, i.e. the array section IX(1:4,I) contains the four node numbers of element I. integer, real, real, integer,

parameter dimension dimension dimension

:: NNODES=...,NELEMS=... (NNODES) :: NODE (NELEMS) :: ELEM (4, NELEMS) :: IX

! unstructured reduction loop do I = 1, NELEMS VAL = Work(ELEM(I)) do K = 1, 4 NODE(IX(K,I)) = NODE(IX(K,I)) + VAL end do end do

!$omp !$omp

!$omp !$omp !$omp

real, dimension (NNODES) :: NTMP ... parallel, private (VAL, K, NTMP) NTMP = 0.0 do do I = 1, NELEMS VAL = Work(ELEM(I)) do K = 1, 4 NTMP(IX(K,I)) = NTMP(IX(K,I)) + VAL end do end do critical NODE = NODE + NTMP end critical end parallel

Fig. 1. Unstructured reduction on a nite-element mesh: sequential version (left) and OpenMP version (right).

In each iteration of the unstructured reduction loop shown in Figure 1 a value (e.g. an elemental force) is computed for an element of the mesh by means of a function (Work) and added to all nodes comprising this element. Since the addition is assumed to be an associative and commutative operation, the order in which the loop iterations are executed does not change the nal result (except for possible round-o errors). However, when parallelizing the loop it has to be taken into account that dierent loop iterations may update the same node. This is caused by the fact that neighboring elements of the mesh may share one or more nodes, and thus IX(:,I) may have the same values for dierent iterations. If the reduction loop is parallelized on a shared memory machine by partitioning the loop iterations among concurrent threads, synchronization will be required to ensure that two distinct threads do not update one node at the same time. On VAL

1

To simplify presentation, only one physical quantity is stored per node and element.

Efficient Parallelization of Unstructured Reductions

437

distributed memory machines, where the mesh is to be partitioned with respect to the local memories of the processors, communication will be required for nodes at processor boundaries.

2 Unstructured Reductions on Shared Memory Machines

In this section we brie y discuss existing techniques for parallelizing unstructured reductions on shared memory machines by means of thread parallelism, as for example oered by OpenMP. Array Privatization The central idea of the array privatization technique is that every processor gets an own private copy of the reduction array and performs its part of the loop iterations independently of other processors. Subsequently, the private results of all processors are combined to yield the nal result. In order to ensure correct results, this ultimate step requires synchronization. Since the reduction clause of OpenMP [11] supports only reductions on scalar variables, parallelization of array reductions based on array expansion has to be programmed explicitly as shown in the right hand side of Figure 1. The original loop is enclosed in a parallel region and the temporary array NODE TMP is declared as private to enforce that each thread gets its own copy. The I-loop is parallelized by relying on the default work sharing mechanism of OpenMP. As a consequence, a chunk of loop iterations is assigned to each thread, and all threads execute their chunk of iterations independently of each other. After parallel execution of the loop, the array assignment statement, protected by a critical section, ensures that each thread adds its local result of the reduction operation stored in NODE TMP to the shared array NODE. Parallelizing compilers for shared memory architectures like POLARIS [2] or SUIF [7], automatically translate sequential reduction loops as outlined in Figure 1 but utilize a thread library instead of OpenMP to implement multithreading. Array privatization is a good solution for reductions on scalar variables and for small arrays. For our FEM example, NNODES > NELEMS, since the memory overhead and the execution time for the serial part of the computation increase with the number of elements. Atomic Reductions OpenMP provides the atomic directive for enforcing the atomic updating of a speci c memory location, rather than exposing it to multiple, simultaneously writing threads. Using this directive, the reduction loop of

438


Figure 1 can be parallelized by means of a parallel do directive and by inserting an atomic directive prior to the assignment of NODE. This technique does not require any additional memory but may cause high synchronization overheads. Although the atomic directive may be replaced by a critical directive, the atomic directive permits better optimizations. In contrast to a critical section, more than one thread may execute the assignment at the same time as long as dierent memory locations are updated.

3 Unstructured Reductions on Distributed Memory Machines

Parallelization of unstructured reductions on distributed memory machines is more complex than on shared memory machines. Since a distributed memory architecture provides no global address space, the reduction arrays have to be distributed to the local memories of the processors and accesses to array elements on other processors require communication. As a consequence of the indirect array accesses, analysis of array access patterns, which is a pre-requisite for communication generation, cannot be performed at compile time and, therefore, runtime parallelization techniques have to be applied. In the following the parallelization of unstructured array reductions for distributed memory machines is discussed in the context of High Performance Fortran (HPF) [8]. !hpf$ distribute(block) :: ELEM, NODE !hpf$ align IX(*,i) with ELEM(i) ... !hpf$ independent, on home (ELEM(I)), new(VAL, K), reduction(NODE) do I = 1, NELEMS VAL = Work(ELEM(I)) do K = 1, 4 NODE(IX(K,I)) = NODE(IX(K,I)) + VAL end do end do

Fig. 2. Unstructured Reductions in High Performance Fortran. HPF provides high-level directives for specifying the distribution of arrays to abstract processors according to various formats (block, cyclic, etc.). The independent directive may be used to assert that a loop does not contain loopcarried dependences and therefore may be parallelized. In this context, temporary variables that are (conceptually) private for each loop iteration may be speci ed by means of a new clause. Moreover, a reduction clause may be used to indicate that dependences caused by associative and commutative reduction operations can be ignored. As opposed to OpenMP, in HPF also array variables may appear within a reduction clause. With the on home clause the loop iteration space may be partitioned according to the distribution of an array. Inspector/Executor Parallelization Technique In order to parallelize the code shown in Figure 2, an HPF compiler distributes the arrays ELEM, NODE, and IX as speci ed by the distribution and alignment directives to the local memories of the processors. Parallel execution of the reduction loop is usually performed in two phases based on the inspector/executor strategy [10]. During the inspector


439

phase each processor determines for its share of iterations the set of non-local elements of NODE it needs to access and derives the corresponding communication schedules (i.e. gather/scatter schedules). In the executor phase, on each processor non-local data to be read are gathered from the respective owner processors according to the gather-schedules by means of message-passing communication and are stored in local buers. This is followed by a local computation phase where each processor executes, independently of the other processors, its share of loop iterations on its local part of the NODE array or the local buers, respectively. Finally, a global communication phase based on the scatter-schedules takes place combining all those elements of NODE that have been written by processors not owning them. Since the inspector phase may be very time-consuming, it is essential to amortize the preprocessing overhead over multiple executions of a loop by reusing communication schedules [9,1,3] as long as communication patterns do not change. This is possible since unstructured reductions are performed in many codes within a serial time-step loop and the communication patterns are invariant for all (or at least many) time-steps. Some HPF compilers also employ alternative strategies akin to array privatization or array expansion as discussed in Section 2. However, these techniques usually exhibit a larger memory and/or communication overhead. Non-Local Access Patterns The main task of the inspector phase when preprocessing a loop with irregular array accesses (vector subscripts) is to determine, on each processor, the set of non-local array elements to be accessed. Once the non-local access pattern has been determined, the required communication can be derived. Recently, the concept of halos [1] has been proposed for HPF, enabling the explicit speci cation of non local data access patterns for distributed arrays. A halo, which in its simplest form comprises a list of global indices, speci es the set of non-local elements to be accessed at runtime for each abstract processor participating in the execution of an HPF program. The information provided by a halo signi cantly reduces the overheads of the inspector phase and alleviates computation and reuse of communication schedules. Figure 3 sketches a mesh partitioned in a node-based way, and the corresponding halo describing the set of non-local nodes to be accessed on each processor. By making the required communication explicit, the size of the halo area provides an appropriate measure for data locality. In the next section we show how the concepts of data distribution and halos can be utilized in order to parallelize irregular reductions eciently for shared memory parallel architectures.

4 Exclusive Ownership Technique

In this section we present a new parallelization method for unstructured reductions on shared memory machines. This strategy is an extension of the atomic reduction technique described in Section 2 that avails itself with the concepts of data distribution and halos in order to minimize synchronization overheads. It can be employed for compiling an HPF program for multithreaded execution on shared memory parallel computers or for parallelizing irregular array reductions with OpenMP directly. We outline this technique for the FEM reduction loop shown in Figure 1.

440


2

3

1

2

4

18

5

13

6

1 17

14 12 11

7 8

halo node local node

5 16

15

3 4

18 13

12

6 7

11 8

10 9

17 14 10

9

shared node exclusive node

16 15

!$omp parallel, private (K, VAL, N) !$omp do do I = 1, NELEMS VAL = Work(ELEM(I)) do K = 1, 4 N = IX(K,I) if (EXCLUSIVE(N)) then NODE(N) = NODE(N) + VAL else !$omp atomic NODE(N) = NODE(N) + VAL end if end do end do

Fig. 3. Distributed mesh and halo (left); virtually distributed mesh with exclusive ownership (middle); OpenMP code based on exclusive ownership (right).

As a starting point, a virtual distribution is determined for the arrays ELEM, and IX with respect to abstract processors. Data distribution is referred to as virtual, since the arrays are not actually distributed but allocated in an unpartitioned manner in shared memory. Based on the virtual distribution, each array element is associated with a unique abstract processor which becomes the owner of this element. Ownership is then used to determine the work sharing for the reduction loop with respect to abstract processors, whereby each abstract processor will be implemented by a separate thread. In our example, the loop iteration space is partitioned such that each iteration I is assigned to the abstract processor owning ELEM(I) of the mesh. Assuming a block distribution for ELEM, the loop iteration space can be partitioned by relying on the standard work sharing mechanism of OpenMP. In order to minimize synchronization overheads for the assignment to NODE, we introduce, based on halos, the concept of exclusive ownership. An element of array NODE is exclusively owned by an abstract processor (thread) if it is owned by that processor and not contained in the halo of any other processor. Synchronization via atomic updates is necessary only for loop iterations that access nodes not exclusively owned by the executing thread (shared nodes), while exclusively owned nodes can be handled like private data requiring no synchronization. In Figure 3 the halo and exclusive ownership information is shown for a simple mesh together with the resulting OpenMP code. In the code, exclusive ownership information is represented by means of a logical array EXCLUSIVE which can be easily derived from the halo of the array NODE. We assume that the halo is either supplied by a domain partitioning tool or explicitly computed before the reduction loop is executed by analyzing the indirection array IX. In the latter case, the analysis of the indirection array is similar to an inspector phase as applied in the context of distributed memory parallelization, yet much simpler, since due to the shared address space, no communication is required. Exclusive ownership information can be reused employing techniques for communication schedule reuse [9,1,3], as long as the indirection array is not changed. Gutierrez et al. [6] presented a parallelization method for irregular reductions on shared memory machines that exploits locality similar to our method. The iteration space of a reduction loop is partitioned among threads in such a NODE


441

way that con ict-free writing on the reduction array is guaranteed and no synchronization is required. For this purpose, loop index prefetching arrays are built before the loop is executed by employing techniques similar to those applied in an inspector/executor strategy. The construction of the loop-index prefetching arrays becomes very complex and the algorithms presented in [6] work only for one or two reductions within one loop iteration. As opposed to our technique, some loop iterations have to be executed by more than one processor, since more than one element of the reduction array may have to be updated during each iteration. In the context of the FEM example presented previously, all iterations that manipulate elements on the distribution boundary of the mesh would have to be executed by all threads that own a node of this element. As a consequence, redundant computations are introduced, with an overhead depending on the computational costs of the function Work.

5 Performance Results

For the evaluation of the exclusive ownership technique we used a kernel from an industrial crash simulation code [5]. The kernel is based on a time-marching scheme to perform stress-strain calculations on a nite-element mesh consisting of 4-node shell elements. In each time-step elemental forces are calculated for every element of the mesh (cf. function Work) and added back to the forces stored at nodes by means of unstructured reduction operations. Besides the computation of elemental forces, the unstructured reduction operations to obtain the nodal forces represent the most important contribution to the overall computational costs. Table 1 shows the elapsed times measured on an SGI Origin 2000 (MIPSpro Fortran compiler, version 7.30) for dierent variants of a crash kernel performing 100 iterations on a mesh consisting of 25600 elements and 25760 nodes. In the table the entry halo (DM) refers to an HPF version parallelized with the Adaptor compiler [4] for distributed memory according to the inspector/executor strategy, privatization (SM), expansion (SM) and atomic (SM) refer to the dierent shared memory parallelization strategies discussed in Section 2, redundant (SM) refers to the method based on loop-index prefetching, and exclusive (SM) to our exclusive ownership technique. All versions marked with (SM) utilize thread parallelism based on OpenMP, while the HPF version (DM) is based on process parallelism and relies on MPI for communication. The irregular mesh used in this evaluation exhibits a high locality. There are only 160 non-exclusive (shared) nodes for two processors, 324 for three, and 480 nodes for four processors, respectively. As a consequence, both the distributed memory version and the shared memory versions that exploit data locality (i.e. exclusive ownership technique and loop index prefetching) show very satisfying results. The versions based on array privatization and array expansion achieve some speed-up, yet they exhibit very poor scaling due to the high synchronization overhead or computational overheads introduced by the serial code section, respectively. The version using atomic updates for all assignments to the node array scales but exhibits an overhead of about a factor of two. The best performance is obtained with the exclusive ownership strategy. The version based on

442


NP = 1 NP = 2 NP = 4 NP = 8 NP = 16 NP = 32 halo (DM) 6.39 3.58 1.76 0.99 0.61 0.40 privatization (SM) 5.57 3.81 4.33 8.53 16.83 37.12 expansion (SM) 6.43 6.03 5.28 4.91 5.04 5.68 atomic (SM) 11.39 6.51 3.48 2.06 1.41 1.27 redundant (SM) 5.35 2.95 1.53 0.82 0.65 0.38 exclusive (SM) 5.10 2.79 1.47 0.74 0.55 0.34 Table 1. Execution times (secs) for crash simulation kernel on the SGI Origin 2000.

index prefetching is slightly worse since for redundant computations of elemental forces more time is required than for atomic updates with our strategy.

6 Summary and Conclusion

An ecient handling of unstructured reductions is crucial for many scienti c applications. The usual methods for implementing reductions on shared memory parallel computers based on privatization, array expansion, or atomic updates, may not yield satisfying results for unstructured array reductions as they do not exploit data locality. The parallelization technique presented in this paper exploits data locality in order to minimize synchronization. The performance results verify that the concept of ownership in a shared memory programming model is essential for an ecient realization of unstructured reductions.

References

1. S. Benkner. Optimizing Irregular HPF Applications Using Halos. In Proceedings of IPPS/SPDP Workshops. LNCS 1586, 1999. 2. W. Blume, R. Doallo, and R. Eigenmann et.al. Parallel Programming with Polaris. IEEE Computer, 29(12):78{82, 1996. 3. T. Brandes and Germain C. A Tracing Protocol for Optimizing Data Parallel Irregular Computations. LNCS 1470, pages 629{638, September 1998. 4. T. Brandes and F. Zimmermann. Adaptor { A Transformation Tool for HPF Programs. In Programming environments for massively parallel distributed systems, pages 91{96. Birkhauser Boston Inc., April 1994. 5. J. Clinckemaillie, B. Elsner, and G. Lonsdale et al. Performance issues of the parallel PAM-CRASH code. The International Journal of Supercomputer Applications and High Performance Computing, 11(1):3{11, Spring 1997. 6. E. Gutierrez, O. Plata, and E.L. Zapata. On Automatic Parallelization of Irregular Reductions on Scalable Shared Memory Systems. In Euro-Par'99 Parallel Processing, Toulouse, pages 422{429. LNCS 1685, Springer-Verlag, September 1999. 7. M. Hall and al. Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Computer, 29(12):84{90, 1996. 8. High Performance Fortran Forum. High Performance Fortran Language Speci cation. Vers. 2.0, Rice University, January 1997. 9. Ravi Ponnusamy, Joel Saltz, and Alok Choudhary. Runtime-compilation techniques for data partitioning and communication schedule reuse. In Proceedings Supercomputing '93, pages 361{370, 1993. 10. J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(2):303{312, 1990. 11. The OpenMP Forum. OpenMP Fortran Application Program Interface. Technical Report Ver 1.0, SGI, October 1997.

Parallel FEM Simulation of Crack Propagation {

?

Challenges, Status, and Perspectives

Bruce Carter1 , Chuin-Shan Chen1 , L. Paul Chew2 , Nikos Chrisochoides3, Guang R. Gao4 , Gerd Heber1 , Anton y R. Ingraea1 , Roland Krause5, Chris Myers1 , Demian Nave3 , Keshav Pingali2 , Paul Stodghill2 , Stephen Vavasis2 , and Paul A. Wawrzynek1 1

Cornell Fracture Group, Rhodes Hall, Cornell University, Ithaca, NY 14853

fbcarter,dchen,heber,[email protected], [email protected], [email protected] 2

CS Department, Upson Hall, Cornell University, Ithaca, NY 14853

3

CS Department, University of Notre Dame, Notre Dame, IN 46556

fchew,pingali,stodghil,[email protected]

4 5

fnikos,[email protected]

EECIS Department, University of Delaw are, Newark, DE 19716 [email protected]

Cen ter for Comp. Mech., Washington University in Saint Louis, MO 63130 [email protected]

Understanding how fractures develop in materials is crucial to many disciplines, e.g., aeronautical engineering, material sciences, and geoph ysics. F ast and accurate computer sim ulation of crack propagation in realistic 3D structures w ould be a valuable tool for engineers and scien tists exploring the fracture process in materials. In the following, w e will describe a next generation crack propagation simulation softw are that aims to make this potential a realit y. Abstract.

1

Introduction

Within the scope of this paper, it is suÆcient to think about crac k propagation as a dynamic process of creating new surfaces within a solid. During the simulation, crack growth causes changes in the geometry and, sometimes, in the topology of the model. Roughly speaking, with the tools in place before the start of this project, a typical fracture analysis at a resolution of 104 degrees of freedom, using boundary elements, would take about 100 hours on a state-of-the-art single processor workstation. The goal of this project is it to create a parallel en vironment which allows the same analysis to be done, using nite elements, in 1 hour at a resolution of 106 degrees of freedom. In order to attain this level of performance, our system will have tw o features that are not found in current fracture analysis systems: ?

This work was supported by NSF grants CCR-9720211, EIA-9726388, ACI-9870687, and EIA-9972853.


444

B. Carter et al.

Parallelism { Current trends in computer hardware suggest that in the near future, high-end engineering workstations will be 8- or 16-way SMP \nodes", and departmental computational servers will be built by combining a number of these nodes using a high-performance network switch. Furthermore, the performance of each processor in these nodes will continue to grow. This will happen not only because of faster clock speeds, but also because ner-grain parallelism will be exploited via multi-way (or superscalar) execution and multi-threading. Adaptivity { Cracks are (hopefully) very small compared with the dimension of the structure, and their growth is very dynamic in nature. Because of this, it is impossible to know a priori how ne a discretization is required to accurately predict crack growth. While it is possible to over-re ne the discretization, this is undesirable, as it tends to dramatically increase the required computational resources. A better approach is to adaptively choose the discretization re nement. The dynamic nature of crack growth and the need to do adaptive re nement make crack propagation simulation a highly irregular application. Exploiting parallelism and adaptivity presents us with three major research challenges, { developing algorithms for parallel mesh generation for unstructured 3D meshes with automatic element size control and provably good element quality, { implementing fast and robust parallel sparse solvers, and { determining eÆcient schemes for automatic, hybrid h-p re nement. To tackle the challenges of developing this system, we have assembled a multidisciplinary and multi-institutional team that draws upon a wide-ranging pool of talent and the resources of 4 universities.

2

System Overview

Figure 1 gives an overview of a typical simulation. During pre-processing, a solid model is created, problem speci c boundary conditions (displacements, tractions, etc.) are imposed, and aws (cracks) are introduced. In the next step, a volume mesh is created, and (linear elasticity) equations for the displacements are formulated and solved. An error estimator determines whether the desired accuracy has been reached, or further iterations, after subsequent adaptation, are necessary. Finally, the results are fed back into a fracture analysis tool for post-processing and crack propagation. Figure 1 presents the simulation loop of our system in its nal and most advanced form. Currently, we have sequential and parallel implementations of the outer simulation loop (i.e., not the inner re nement loop) running with the following restrictions: right now, the parallel mesher can handle only polygonal (non-curved) boundaries. Curved boundaries can be handled by the sequential meshers, though (see section 3). We have not yet implemented unstructured h-re nement and adaptive p-re nement, although the parallel formulator can handle arbitrary p-order elements.

Parallel FEM Simulation of Crack Propagation

445

FRANC3D Live Prediction

Crack Propagation

Fracture Analysis

Boundary Conditions

Solid Model

Introduce Flaws

YES NO Acceptable Error?

Volume Mesh

Unstructured Refinement

Structured Refinement

Finite Element Formulation

Increase Order of Basis Functions Estimate Errors

Iterative Solution

Fig. 1.

3

Simulation loop.

Geometric Modeling and Mesh Generation

The solid modeler used in the project is called OSM. OSM, as well as the main pre- and post-processing tool, FRANC3D, is freely available from the Cornell Fracture Group's website [4]. FRANC3D - a workstation based FRacture ANalysis Code for simulating arbitrary non-planar 3D crack growth - has been under development since 1987, with hydraulic fracture and crack growth in aerospace structures as the primary application targets since its inception. While there are a few 3D fracture simulators available and a number of other software packages that can model cracks in 3D structures, these are severely limited by the crack geometries that they can represent (typically planar elliptical or semi-elliptical only). FRANC3D diers by providing a mechanism for representing the geometry and topology of 3D structures with arbitrary non-planar cracks, along with functions for 1) discretizing or meshing the structure, 2) attaching boundary conditions at the geometry level and allowing the mesh to inherit these values, and 3) modifying the geometry to allow crack growth but with only local remeshing required to complete the model. The simulation process is controlled by the user via a graphic user-interface, which includes windows for the display of the 3D structure and a menu/dialogue-box system for interacting with the program. The creation of volume meshes for crack growth studies is quite challenging. The geometries tend to be complicated because of internal boundaries (cracks). The simulation requires smaller elements near each crack front in order to ac-

446

B. Carter et al.

curately model high stresses and curved geometry. On the other hand, larger elements might be suÆcient away from the crack front. There is a considerable dierence between these two scales of element sizes, which amounts to three orders of magnitude in real life applications. A mesh generator must provide automatic element size control and give certain quality guarantees for elements. The mesh generators we studied so far are QMG by Steve Vavasis [14], JMESH by Joaquim Neto [11], and DMESH by Paul Chew [12]. These meshers represent three dierent approaches: octree-algorithm based (QMG), advancing front (JMESH), and Delaunay mesh (DMESH). QMG and DMESH come with quality guarantees for elements in terms of aspect ratio. All these mesh generators are sequential and give us insight into the generation of large \engineering quality" meshes. We decided to pursue the Delaunay mesh based approach rst for a parallel implementation, which is described in [5]. Departing from traditional approaches, we simultaneously do mesh generation and partitioning in parallel.

This not only eliminates most of the overhead of the traditional approach, it is almost a necessary condition to do crack growth simulations at this scale, where it is not always possible or too expensive to keep up with the geometry changes by doing structured h-re nement. The implementation is a parallelization of the so-called Bowyer-Watson (see the references in [5]) algorithm: given an initial Delaunay triangulation, we add a new point to the mesh, determine the simplex containing this point and the point's cavity (the union of simplices with non-empty circumspheres), and, nally, retriangulate this cavity. One of the challenges for a parallel implementation is that this cavity might extend across several submeshes (and processors). What looks like a problem, turns out to be the key element in unifying mesh generation and partitioning: the newly created elements, together with an adequate cost function, are the best candidates to do the \partitioning on the y". We compared our results with Chaco and MeTis in terms of equidistribution of elements, relative quality of mesh separators, data migration, I/O, and total performance. Table 1 shows a runtime comparison be-

Table 1.

Total run time in seconds on 16 processors.

Mesh Size

200K 500K 1000K 2000K

PPGK SMGP0 SMGP1 SMGP2 SMGP3 90 42 42 42 42 215 65 87 64 62 439 97 160 91 94 1232 133 310 110 135

tween ParMeTis with PartGeomKway (PPGK) and, our implementation, called SMGP, on 16 processors of an IBM SP2 for meshes of up to 2000K elements. The numbers behind SMGP refer to dierent cost functions used in driving the partitioning [5].


4

447

Equation Solving and Preconditioning

So far, our focus has been on iterative solution methods. It is part of our future work to explore the potential of direct methods. We chose PETSc [13, 1] as the basis for our equation solver subsystem. PETSc provides a number of Krylov space solvers, such as Conjugate Gradient and GMRES, and a number of widely-used preconditioners, such as (S)SOR, and ILU/ICC. We have augmented the basic library with third party packages, including BlockSolve95 [8] and the Barnard's SPAI [2]. In addition, we have implemented a parallel version of the Global Extraction Element-By-Element (GEBE) preconditioner [7] (which is unrelated to the EBE preconditioner of Winget and Hughes [15]), and added it to the collection using PETSc's extension mechanisms. The central idea of GEBE is to extract subblocks of the global stiness matrix associated with elements and invert them, which is highly parallel. The ICC preconditioner is frequently used in practice, and is considered to be a good preconditioner for many elasticity problems. However, we were concerned that it would not scale well to the large number of processors required for our nal system. We believed that GEBE would provide a more scalable implementation, and we hoped that it would converge nearly as well as ICC. In order to test our hypothesis, we ran several experiments on the Cornell Theory Center SP-2. The preliminary performance results for the gear2 and tee2 models are shown in Tables 2 and 3, respectively. gear2 is a model of a power transmission gear with a crack in one of its teeth. tee2 is a model of a T steel pro le. For each model, we ran the Conjugant Gradient solver with both PETSc's IC(0) preconditioner and our own parallel implementation of GEBE on 8 to 64 processors. (PC { preconditioner type, p { number of processors, tPC { time for preconditioner setup, tit { time per cg iteration. The iteration counts in Tables 2 and 3 correspond to a 1015 reduction of the residual error, which is completely academic at this point.) The experimental results con rm our hypothesis: 1) Table 2.

Gear2 (79,656 unknowns)

PC

IC(0) GEBE IC(0) GEBE IC(0) GEBE IC(0) GEBE

p tPC tit Iters. 8 17.08 0.2 416 8 9.43 0.19 487 16 15.47 0.27 422 16 6.71 0.11 486 32 8.51 0.32 539 32 3.73 0.08 485 64 11.00 0.28 417 64 4.74 0.07 485

Table 3.

Tee2 (319,994 unknowns)

PC

IC(0) GEBE IC(0) GEBE

p tPC tit Iters. 32 30.00 0.29 2109 32 35.70 0.21 2421 64 23.60 0.29 2317 64 7.60 0.12 2418

GEBE converges nearly as quickly as IC(0) for the problems that we tested.

448

B. Carter et al.

2) Our naive GEBE implementation scales much better than PETSc's IC(0) implementation which uses BlockSolve95.

5

Adaptivity

Understanding the cost and impact of the dierent adaptivity options is the central point in our current activities. Our implementation follows the approach of Biswas and Oliker [3] and currently handles tetrahedra, while allowing enough

exibility for an extension to non-tetrahedral element types. For relatively simple, two-dimensional problems, stress intensity factors can be computed to an accuracy suÆcient for engineering purposes with little mesh re nement by proper use of singularly enriched elements. There are many situations though when functionals other than stress intensity factors are of interest or when the singularity of the solution is not known a priori. In any case the engineer should be able to evaluate whether the data of interest have converged to some level of accuracy considered appropriate for the computation. It is generally suÆcient to show, that the data of interest are converging sequences with respect to increasing degrees of freedom. Adaptive nite element methods are the most eÆcient way to achieve this goal and at the same time they are able to provide estimates of the remaining discretization error. We de ne the error of the nite element solution as e = u uFE and a possible measure for the discretization error is the energy norm, jjejj2E ( ) = 12 B (e; e) : The error estimator introduced by Kelly et.al. [9, 6] is derived by inserting the nite element solution into the original dierential equation system and calculating a norm of the residual using interpolation estimates. An error indicator computable from local results of one element of the nite element solution is then derived and the corresponding error estimator is computed by summing the contribution of the error indicators over the entire domain. The error indicator is computed with a contribution from the interior residual of the element and a contribution of the stress jumps on the faces of an element. Details on the computation of the error estimator from the nite element solution can be found in [10].

6

Future Work

The main focus of our future work will be on improving the performance of the existing system. We have not yet done any speci c performance tuning, like locality optimization. This is not only highly platform dependent, but also has to be put in perspective to the forthcoming runtime optimizations, like dynamic load balancing. We are considering introducing special elements at the crack tip, and non-tetrahedral elements (hexes, prisms, pyramids) elsewhere. On the solver side, we explore new preconditioners (e.g., support tree preconditioning), and multigrid, as well as sparse direct solvers, to make our environment more eective and robust. There is a port of our code base to the new 64 4-way SMP node NT cluster at the Cornell Theory Center underway.


7

449

Conclusions

At present, our project can claim two major contributions. The rst is our parallel mesher/partitioner, which is the rst practical implementation of its kind with quality guarantees. This technology makes it possible, for the rst time, to fully automatically solve problems using unstructured h-re nement in a parallel setting. The second major contribution is to show that GEBE outperforms ICC, at least for our problem class. We have shown that, not only does GEBE converge almost as quickly as ICC, it is much more scalable in a parallel setting than ICC.

References [1] Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. EÆcient management of parallelism in object-oriented numerical software libaries. In E. Arge, A.M. Bruaset, and H.P. Langtangen, editors, Modern Software Tools in Scienti c Computing. Birkhauser Press, 1997. [2] Stephen T. Barnard and Robert Clay. A portable MPI implementation of the SPAI preconditioner in ISIS++. In Eighth SIAM Conference for Parallel Processing for Scienti c Computing, March 1997. [3] R. Biswas and L. Oliker. A new procedure for dynamic adaption of threedimensional unstructured grids. Applied Numerical Mathematics, 13:437{452, 1994. [4] http://www.cfg.cornell.edu/. [5] Nikos Chrisochoides and Demian Nave. Simultaneous mesh generation and partitioning for Delaunay meshes. In 8th Int'l. Meshing Roundtable, 1999. [6] J.P. de S.R. Gago, D.W. Kelly, O.C. Zienkiewicz, and I. Babuska. A posteriori error analysis and adaptive processes in the nite element method: Part II { Adaptive mesh re nement. International Journal for Numerical Methods in Engineering, 19:1621{1656, 1983. [7] I. Hladik, M.B. Reed, and G. Swoboda. Robust preconditioners for linear elasticity FEM analyses. International Journal for Numerical Methods in Engineering, 40:2109{2127, 1997. [8] Mark T. Jones and Paul E. Plassmann. Blocksolve95 users manual: Scalable library software for the parallel solution of sparse linear systems. Technical Report ANL-95/48, Argonne National Laboratory, December 1995. [9] D.W. Kelly, J.P. de S.R. Gago, O.C. Zienkiewicz, and I. Babuska. A posteriori error analysis and adaptive processes in the nite element method: Part I { Error analysis. International Journal for Numerical Methods in Engineering, 19:1593{ 1619, 1983. [10] Roland Krause. Multiscale Computations with a Combined h- and p-Version of the Finite Element Method. PhD thesis, Universitat Dortmund, 1996. [11] J.B.C. Neto et al. An algorithm for three-dimensional mesh generation for arbitrary regions with cracks. submitted for publication. [12] http://www.cs.cornell.edu/People/chew/chew.html. [13] http://www.mcs.anl.gov/petsc/index.html. [14] http://www.cs.cornell.edu/vavasis/vavasis.html. [15] J.M. Winget and T.J.R. Hughes. Solution algorithms for nonlinear transient heat conduction analysis employing element-by-element iterative strategies. Computational Methods in Applied Mechanical Engineering, 52:711{815, 1985.

Support for Irregular Computations in Massively Parallel PIM Arrays, Using an Ob ject-Based Execution Model 1;2

Hans P. Zima 1

and

1

Thomas L. Sterling

CACR, California Institute of Technology, Pasadena, CA 91125, U.S.A. 2 Institute for Softw are Science, University of Vienna, Austria E-mail: fzima,[email protected]

Abstract The emergence of semiconductor fabrication technology allowing a tight coupling betw een high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. F urthermore, large arrays of PIMs can be arranged into massively parallel architectures. In this paper, we outline the salient features of PIM architectures and discuss macroservers, an object-based model for such machines. Subsequently, we speci cally address the support for irregular problems provided by PIM arrays. The discussion concludes with a case study illustrating an approach to the solution of a sparse matrix vector multiplication.

1

Introduction

Processor-in-Memory or PIM arc hitecture couples processor cores and DRAM blocks pro viding direct access to memory row buers to increase the memory bandwidth achiev ed by tw o orders of magnitude. Current generation PIMs are very basic, treating memory as a ph ysical resource. But future generations of PIM arc hitecture may be well suited to the processing of irregular data structures. PIM based system arc hitecture may tak e a number of forms from a simple replacement of dumb memory chips with PIM chips to complete systems comprising an array of PIM chips. Manipulating irregular data structures favors speci c system architecture characteristics that provide high inter-PIM chip communication bandwidth, eÆcient address manipulation, and fast response to ligh t-w eigh t service requests. In this paper, we rst c haracterize some properties of irregular problems (Section 2). The subsequent Section 3 provides a short overview of an object-based execution model for PIM arra ys that w eare currently developing, in parallel J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 450-456, 2000.  Springer-Verlag Berlin Heidelberg 2000

Support for Irregular Computations in Massively Parallel PIM Arrays

451

with the design of a PIM array architecture. Section 4 will then discuss in more detail some of the features of future generation systems that are designed to support the eÆcient processing of irregular problems. We nish with concluding remarks in Section 6.

2

Irregular Problems

Many advanced scienti c problems are of an irregular nature and contain a large degree of inherent parallelism. This includes sparse matrix computations, sweeps over unstructured grids, tree searches, and particle-in-cell codes; moreover, many relevant problems are of an adaptive nature. In order to execute such codes eÆciently on a massively parallel array of PIMs, a viable tradeo between the exploitation of locality and parallelism has to be found. As a consequence, the performance of irregular codes is largely determined by the decisions made relating to their data and work distributions. First, a data distribution must be determined by partitioning large data structures and distributing them across the memories of the machine. Secondly, the choice of the work distribution has to made in conjunction with the data distribution, taking into account dependencies and access patterns in the code. Regular problems, which are characterized by regular, usually linear, data access patterns, can be eÆciently implemented on conventional architectures exploiting compile-time knowledge [7]. In contrast, the data access patterns as well as the data and work distributions typical for irregular algorithms must be largely resolved at runtime. Such algorithms pose major problems for most existing parallel machines, mainly as a result of the memory access latency which makes the runtime translation of indirect references and the associated communication very expensive. The problem is compounded by the preprocessing that is often required to eectively organize bulk communication of data [5]. As will be discussed in more detail in Section 4, PIM arrays can oer signi cant support for this kind of problems. Below we summarize some of the typical features of iregular applications. Irregular Data Distributions: The management of irregular data dis-

tributions is one of the key issues to be dealt with. Relevant topics include the generation of the data structure, its representation in memory, the implementation of access mechanisms, and (total as well as incremental) redistribution algorithms.

Thread Groups: Irregular algorithms require sophisticated mechanisms

for the generation of groups of cooperating threads. One example is a set of data parallel threads working on a problem in loosely synchronous SPMD mode; another one a thread structure arising from a tree search. Critical operations { all of which may be intra-chip as well as inter-chip include thread generation, communication, pre x and reduction operations, mutual exclusion, and condition synchronization.

452

H.P. Zima and T.L. Sterling

Address Translation: Address translation refers to the problem of map-

ping an indirect reference (such as A(X(I)) or a pointer value) to a memory address. For irregular data structures, this translation must, in general, be performed at runtime, since neither the data distributions nor the value of index arrays need to be statically known.

3

Macroservers

We have de ned an object-based model that provides an abstract programming paradigm for a massively parallel PIM array. Here we give a short outline of this model; for a more detailed speci cation see [8]. The central concept of the model is called a macroserver. A macroserver is an autonomous, active object that comes into existence by being created based upon a macroserver class, which essentially establishes an encapsulation for a set of variables and methods as well as a context for threads and their synchronization. At any point in time, a macroserver has a well-de ned home, which is a location in virtual memory where the metadata needed for the management of the object can be found. Macroservers are associated with a set of variables whose values de ne its state. Variables may be distributed across the memories of the PIM; the underlying distribution mechanism can be thought of as a generalization of the corresponding concept in languages such as Vienna Fortran [2] or HPF [4], extended to LISP-type data structures and allowing arbitrary mappings as well as incremental redistribution. The activation of a method in a macroserver gives rise to the generation of a (synchronous or asynchronous) thread. Threads can directly read and modify the state of the macroserver to which they belong. Each thread can activate methods of its own or of another macroserver accessible to it; as a consequence, the model oers intra-server as well as inter-server concurrency, re ecting the multi-level parallelism of the underlying architecture. Special support is provided for the operation of thread groups. The model provides a simple mechanism for mutual exclusion (atomic methods) and allows synchronization via condition variables and futures. Metadata includes information about the state variables (such as types and distributions), the signatures of methods, and representations of the threads currently operating in the macroserver. Most of the metadata will be stored at the home of the object, allowing an eÆcient centralized management by the associated PIM.

4

PIM Arrays and Their Support for Irregular Computations

PIM combines memory cell blocks and processing logic on the same integrated circuit die. The row buer of the typical memory block may be on the order of 1K or more bits which are acquired in a single memory cycle delivering data at


453

a typical rate of 4 Gigabytes per second at the sense amps of the memory. A processor designed to directly manipulate this data simultaneously can typically operate at a performance of a Gigaops. Multiple memory-processor pairs can occupy the same die with a possible performance total today of 4 Gigaops, with higher rates for simple arithmetic or small eld manipulations. A typical memory system comprising an array of PIM chips could provide a total throughput of 256 Gigaops or, for byte level operations, 1 Teraops peak throughput. While in some designs core processors developed as conventional microprocessors are \dropped" into the PIM to minimize development time and cost as well as exploit existing software such as compilers, in other cases much simpler processors may be designed explicitly for the PIM context, minimizing the die area consumed while optimizing for eective memory bandwidth. A typical PIM chip includes multiple modules of memory blocks and processors, memory bus interface, and shared functional units such as oating point arithmetic units. Also, addresses employed within the PIM by the on-chip processors are usually physical. A system incorporating PIM chips may dier from the conventional structure outlined above in important ways to take advantage of the capabilities of PIM and employ them in the broader system. While in the most simple structure, the PIM chips may simply replace regular memory retaining the systems processor with its layered cache hierarchy, other structures diverging from this will deliver improved performance. Providing a separate highly parallel network for just inter-PIM communications and a second level of bulk storage as backing store for the PIMs are two examples. Beyond this is the \Gilgamesh" (billion logic gate array mesh) which is a multi-dimensional structure of PIM chips in a standalone system. The processing and manipulation of irregular data structures imposes additional requirements than can be eectively handled with current generation PIM technology and architecture. However, many of the advantages that PIM has for dense contiguous data computing would also convey to metadata-organized irregular data structures if augmented with advanced mechanisms. An important advance that PIM enables is the exploitation of ne grain parallelism intrinsic to irregular structures. This is because many dierent parts of the distributed structure can be processed simultaneously by the many PIM processors throughout the memory and because the nodes of the structure can be handled eÆciently in the memory itself. Virtual memory management including address translation is central to the advanced architecture requirements of PIM for irregular data. Such data incorporates virtual user addresses within the data structure itself, which must be managed directly by the PIM. Conventional TLB based methods employed by microprocessors are likely to be poorly suited because the access patterns encountered by the PIM internal processors often will not experience the necessary level of temporal locality required to make them eective. An innovative approach employs \intrinsic address translation" in which the virtual to physical address mapping is incorporated in the data structures metadata. With bi-directional links, any page movement across physical memory can cause auto-

454


matic update of the translation data in other linked pages. A second method is a \set associative" approach that is a hybrid mapping; partly physical and partly virtual as in cache systems but in main memory instead. This allows eÆcient address translation between memory chips while permitting random placement and location of pages anywhere within a designated memory chip. Both methods when combined provide relatively general and highly scalable virtual to physical memory address translation. A second requirement is direct inter-PIM message driven communication without system processor intervention. This is being explored by the DIVA project [3] and HTMT project [1]. When a pointer indicates continuation of a structure on an external chip, the computation must be able to \jump" across this physical gap while retaining logical consistency. The \parcel" model [1] permits such actions to follow distributed data structures across PIM arrays. A parcel is a message packet that not only conveys data, but also speci es the actions to be performed. The arrival of a parcel causes a PIM to respond by instantiating the speci ed thread, carrying out the action on the designated local data, and providing a necessary response, possibly by returning another parcel or by continuing to move through the data structure. A third requirement is architecture support for thread management. Because of the potentially random distribution of the data structure elements or substructures, the order of service requests may be unknown until runtime. A PIM chip may be required to service multiple parcels concurrently. A PIM comprises several subsystems that can operate simultaneously. Finally, some PIM threads may require access to remote resources such as other PIM chips and such accesses will impose delays. One mechanism not yet incorporated in past PIM implementations is multithreading, a means of rapidly switching among multiple concurrent thread contexts. Multithreading provides an eÆcient low level mechanism for managing multiple active threads of execution. This is useful while waiting several cycles for a memory access or the propagation delay for a shared ALU to permit other work to be performed by the remaining resources. Each time a new parcel arrives, the active thread can be suspended until the incident parcel is at least stored, thereby freeing the receiver hardware resources for the next arrival. Multithreaded architecture will greatly facilitate the manipulation and processing of irregular data structures because it provides a runtime adaptive means of applying system resources to computational needs as determined by the structure of the data itself.

5

Case Study: Sparse Matrix Vector Multiply

In this section, we outline an approach for the parallelization of a sparse matrixvector multiplication using our model, based partly on concepts developed in [6]. We rst take a look at the sequential algorithm. Consider the operation S = A:B , where A(1 : N; 1 : M ) is a sparse matrix with q nonzero elements, and B (1 : M ) and S (1 : N ) are vectors. We assume that the nonzero elements of A


455

INTEGER :: C(q), R(N+1) REAL :: D(q), B(M), S(N) INTEGER :: I, J DO I = 1,M S(I)=0.0 DO J = R(I), R(I+1)-1 S(I) = S(I) + D(J)*B(C(J)) ENDDO J ENDDO I Figure 1: Sparse Matrix-Vector Multiply: Core Loop are enumerated using row-major order; the k -th element in this order is called the k -th nonzero element of A. In the Compressed Row Storage (CRS) format, A is represented by three vectors, D, C , and R: the data vector,

D (1 : q ), stores the sequence of nonzero elements of A, in the order of their enumeration;

the column vector,

C (1 : q ), contains in position k the column number, in A, of the k -th nonzero element in A; and

the row vector,

R(1 : N + 1), contains in position i the number of the rst nonzero element of A in that row (if any); R(N + 1) is set to q + 1.

Based upon this representation, the core loop of the sequential algorithm can be formulated in Fortran as shown in Figure 1. The rst step in developing a parallel version of the algorithm consists of de ning a distributed sparse representation of A. This essentially combines a data distribution with a sparse format such as CRS. More speci cally, a data distribution [7, 8] is interpreted as if A were a dense array, specifying a local distribution segment for each PIM node. The distributed sparse representation is then obtained by representing the submatrix constituting the local distribution segment in the sparse format (CRS in our case). A number of data distributions have been used for this purpose, including Multiple Recursive Decomposition (MRD) and cyclic distributions [6]. MRD is a method that partitions A into nn rectangular distribution segments,where nn is the number of memory nodes. These segments are the result of a recursive construction algorithm that aims at achieving load balancing by having approximately the same number of nonzero elements in each segment. Based upon a distributed sparse representation using MRD, a parallel algorithm for the sparse matrix vector product can now be formulated by applying (a slightly modi ed version of) the algorithm in Fig. 1 in parallel to all distribution segments, and then combining the partial results for each row of the original matrix in a reduction operation. Because of lack of space, we do not

456


discuss further details of the parallel algorithm here (see [8]). However, we outline a number of topics that illustrate the support of the PIM array architecture for this kind of algorithm: The CRS representation of the local data segments can be stored and processed locally in each PIM node by microservers. The indirect references involving D and B can be resolved in the memory;

making the implementation of an inspector/executor scheme [5, 7] much more eÆcient than for distributed-memory machines.

The PIM array network oers eÆcient support for spawning a large number

of \similar" parallel threads and for executing reduction operations.

6

Conclusion

In this paper, we have discussed the design of massively parallel PIM arrays, together with an object-based execution model for such architectures. An important focus of this work in progress is the capability to deal eectively with irregular problems.

References [1] J.B.Brockman,P.M.Kogge,V.W.Freeh,S.K.Kuntz, and T.L.Sterling. Microservers: A New Memory Semantics for Massively Parallel Computing. Proceedings ACM International Conference on Supercomputing (ICS'99), June 1999. [2] B. Chapman, P. Mehrotra, and H. Zima. Programming in Vienna Fortran. Scienti c Programming, 1(1):31{50, Fall 1992. [3] M.Hall,J.Koller,P.Diniz,J.Chame,J.Draper, J.LaCoss, J.Granacki, J.Brockman, A.Srivastava, W.Athas, V.Freeh, J.Shin, and J.Park. Mapping Irregular Applications to DIVA, a PIM-Based Data Intensive Architecture. Proceedings SC'99, November 1999. [4] High Performance Fortran Forum. High Performance Fortran Language Speci cation, Version 2.0, January 1997. [5] J.Saltz,K.Crowley,R.Mirchandaney, and H.Berryman. Run-Time Scheduling and Execution of Loops on Message-Passing Machines. Journal of Parallel and Distributed Computing, 8(2),pp.303-312, 1990. [6] M.Ujaldon,E.L.Zapata,B.M.Chapman,and H.P.Zima. Vienna Fortran/HPF Extensions for Sparse and Irregular Problems and Their Compilation. IEEE Transactions on Parallel and Distributed Systems, Vol.8, No.10, pp.1068-1083 (October 1997). [7] H. Zima and B. Chapman. Compiling for Distributed Memory Systems. Proceedings of the IEEE, Special Section on Languages and Compilers for Parallel Machines, pp. 264-287, February 1993. [8] H.Zima and T.Sterling. Macroservers. An Object-Based Model for Massively Parallel Processor-in-Memory Arrays. Caltech CACR Technical Report, January 2000 (in preparation).

Executing Communication-Intensive Irregular Programs EÆciently V ara Ramakrishnan?

Isaac D. Scherson

Department of Information and Computer Science University of California, Irvine, CA 92697 fvara,[email protected]

We consider the problem of eÆciently executing completely irregular, communication-intensiv e parallel programs. Completely irregular programs are those whose number of parallel threads as well as the amount of computation performed in each thread vary during execution. Our programs run on MIMD computers with some form of space-slicing (partitioning) and time-slicing (scheduling) support. A hardware barrier sync hronization mechanism is required to eÆciently implement the frequen t communications of our programs, and this constrains the computer to a xed size partitioning policy. We compare the possible scheduling policies for irregular programs on xed size partitions: local scheduling and m ulti-gangscheduling, and pro ve that local sc heduling does better. Then ew introduce competitive analysis and formally analyze the online rebalancing algorithms required for eÆcient local sc heduling under tw o scenarios: with full information and with partial information. Abstract.

1

Introduction

The universe of parallel programs can be broadly divided into regular and irregular programs. Regular programs have a xed number of parallel threads, each of which perform about the same amount of computation. Irregular programs ha ve a varying number of parallel threads and/or threads which perform unequal amounts of computation. Since the behavior of regular programs is predictable, scheduling them well is relatively easy. There are several static (job arrival time) and dynamic scheduling methodologies [3] which work very well on regular programs. For this reason, we focus on irregular programs, which waste computing resources if managed poorly. There are tw o main classes of parallel computers, MIMD (multiple-instruction multiple-data) and SIMD (single-instruction multiple-data). MIMD computers ha ve full- edged processors which fetch, decode and execute instructions independently of each other. SIMD computers have a centralized control mechanism which fetc hes, decodes and broadcasts each instruction to specialized processors, and they all execute the instruction simultaneously. ?

This research w as supported primarily b y PMC-Sierra, Inc., San Jose, California. http://www.pmc-sierra.com


458

V. Ramakrishnan and I.D. Scherson

We assume that the parallel computer is an MIMD computer. MIMD computers have overwhelming advantages in cost and time-to-market, provided by using o-the-shelf processors, whereas SIMD computers need custom-built processors. MIMD computers can also be used more eÆciently for the following reasons: 1. An MIMD computer does not force unnecessary synchronization after every instruction, or unnecessary sequentialization of non-interfering branches of computation, as an SIMD computer does. 2. Many jobs can be run simultaneously on dierent processors (or groups of processors) of an MIMD computer. These factors, among others, justify our assumption and explain the continuing market trend towards MIMD computers, exempli ed by the Thinking Machines CM-5, the Cray Research T3D and, more recently, the Sun Microsystems Enterprise 10000 and the Hewlett-Packard HyperPlex. To enable sharing by multiple programs, MIMD computers usually provide some form of time-slicing (scheduling) or space-slicing (partitioning) or both. Parallel irregularity can be classi ed as follows: variation in parallelism (number of threads) during execution (termed X-irregularity), and variation in the amount of computation performed per thread (termed Y-irregularity). Of course, a program may be both X and Y-irregular: completely irregular, which is the class of programs we consider. This means our programs de nitely exhibit Xirregularity, the behavior of spawning and terminating parallel threads at runtime. We then have to make run-time decisions on where to schedule newly spawned threads, and whether to rebalance remaining threads after threads terminate. Several examples of communication-intensive irregular parallel programs can be found among parallelized electronic design automation applications, and parallel search algorithms used in internet search engines and games. Besides being irregular, these programs require frequent communications among several or all of the running threads. In Section 2 of this paper, we discuss why the most eÆcient way of implementing frequent communications is barrier synchronization, using a hardware tree mechanism. For a detailed discussion of barrier synchronization and implementation methods, see [4]. Hardware barrier synchronization trees impose xed size processor partitions on the computer. In Section 3, we discuss the scheduling strategies available for xed size partitions. We show that local scheduling does better than a policy based on gang scheduling for Y-irregular programs, provided we are able to balance threads across processors. In Section 4, we discuss the problem of balancing threads across processors. Due to frequent communications between threads, it is essential to schedule a newly spawned thread almost immediately. Therefore, the only practical way to schedule a new thread is to temporarily run it on the same processor as its parent thread, and periodically rebalance the threads in a partition to achieve an eÆcient schedule. We outline an optimal online algorithm which decides when to rebalance, and prove that it does no worse than 2 times the optimal oline strategy. The optimal online algorithm requires complete information about the scheduling state of the partition, which is prohibitively expensive to gather.

Executing Communication-Intensive Irregular Programs Efficiently

459

Therefore, we propose a novel way to gather partial state information by using the barrier synchronization tree available on the computer. Then, we propose and analyze an online algorithm which uses only this partial state information. We show that our online algorithm based on partial state information performs no worse than n times the optimal o-line strategy, where n is the number of processors. This implies that our worst case performance is limited to the performance of running the program sequentially. We intend to expand on this work by demonstrating experimentally that the average case performance of our partial state online algorithm is close to the optimal online algorithm.

2

Constraints on Execution

This section describes the constraints placed by our program characteristics on their eÆcient execution. We outline the most eective way to implement the frequent communications of our programs, and its impact on partitioning and scheduling options. 2.1

Barrier Synchronization

The overhead of implementing frequent communications as multiple pairwise communications across the data network is very high. An eÆcient alternative is barrier synchronization: a form of synchronization where a point in the code is designated as a barrier and no thread is allowed to cross the barrier until all threads involved in the computation have reached it. Usually, threads are barrier synchronized before and after each communication. This ensures that all data values communicated are current, without doing any pairwise synchronizations between threads. Barrier synchronization can be implemented in software or in hardware. Software barriers are implemented using shared semaphores or message passing protocols based on recursive doubling. Software implementations provide

exible usage, but they suer from either the sequential bottleneck associated with shared semaphores or the large communication latencies of message passing protocols. The time for a software barrier to complete is measured in data network latencies, and is proportional to n for shared semaphores and to (log n) for protocols based on recursive doubling, where n is the number of processors. Hardware barriers are implemented in their simplest form as a single-bit binary tree network, called a barrier tree. The barrier tree takes single-bit ags from the processors as its inputs, and essentially implements recursive doubling in hardware, using AND gates. The time for a hardware barrier to complete is measured in gate delays, and is proportional to the height of the tree, or equivalently, (log n). Although barrier trees are commonly implemented as complete binary trees, this is not essential. A barrier tree can be any spanning tree of the processors, and in fact, such an implementation facilitates partitioning of the computer for multiple programs [5].

460


Instead of a single-bit tree, the barrier tree can be constructed with m-bit edges, and any associative logic on m-bit inputs (such as maximum, minimum, sum) can replace the AND gates in the barrier tree. In this case, barriers will complete after a hardware latency no worse than (log n log m). [4] shows that a tree which computes the maximum of its input ags, a max-tree, is useful in synchronizing irregular programs eÆciently. Later in this paper, we discuss a way to also use such a tree in scheduling irregular programs. In general, hardware barrier trees are intrinsically parallel and have very low latency, so they are at least an order of magnitude faster than software barriers. 2.2

Fixed Size Processor Partitions

If barrier synchronization is implemented in software, it imposes no constraints on the partitioning policy. For example, a partition may be de ned loosely as the set of processors assigned to the threads of a job, and it may be possible to increase the partition size whenever a new thread is spawned. Due to this

exibility, scheduling new threads is an easier problem on such computers. If barrier synchronization is implemented as a hardware tree, the barrier tree dictates the possible processor partitions on the computer. This is because we need to ensure that a usable portion of the barrier tree is available on each partition. Then, space-slicing can only be done by assigning xed size partitions (usually of contiguous processors) to each program. Assuming there is only one barrier tree available per partition, partitions must be non-overlapping. It may be possible to resize partitions dynamically, but this is a slow operation since all the threads running on the partition must be quiesced and the barrier tree recon gured for the new partition size. Therefore, resizing has to be done at infrequent enough intervals that we may assume a xed size partition for the duration of a program. We treat a run-time partition resizing as a case where a program terminates all its threads simultaneously, and a new program with the same number of threads starts on the new partition. Note that if the computer does not support partitioning at all, then the entire computer can be treated as one xed size partition for our purposes. Scheduling in the presence of a hardware barrier tree is a harder problem and more applicable to the class of programs we are interested in. Therefore, in the rest of this paper, we assume that we are executing on computers with a hardware barrier synchronization mechanism, speci cally a max-tree, and xed size, non-overlapping partitions.

3

Scheduling on Fixed Size Partitions

Since barrier synchronization is implemented in hardware, when a new thread is spawned by a program, it must be scheduled on one of the processors within the xed size partition. Given this xed size partitioning policy, let us consider what the scheduling options are:


461

If there is no time-slicing available on the computer, threads cannot be preempted, meaning that a job will relinquish its entire partition only upon completion. Without time-slicing, each processor can run no more than one thread, so there is no way to run a job whose number of threads exceeds the number of processors in the largest partition (which may be the entire computer). This also means it is only possible to run X-irregular jobs which can predict the maximum number of threads they may have at any time during execution (this may not be feasible), and there is a large enough partition to accommodate that number. Therefore, to run X-irregular jobs without restrictions, it is essential that time-slicing be available on the computer. One form of time-slicing called gang scheduling is possible on xed size partitions. In gang scheduling, each processor has no more than one thread assigned to it, and threads never relinquish processors on an individual basis. At the end of a time-slice, all threads in the partition are preempted simultaneously using a centralized mechanism called multi-context-switch. Then, new threads (either of the same or a dierent job) are scheduled on the partition. (Note that it is possible to schedule dierent jobs in each time-slice because the state of the barrier tree can be saved along with the job's other context information, eectively allowing the barrier tree to be time-sliced as well.) By gang scheduling threads of the same job in more than one time-slice, called multi-gang scheduling, it is possible to run X-irregular jobs. However, this is ineÆcient because each processor will be idle after its thread reaches a barrier, wasting the rest of the time-slice. To address this, the time-slice can be selected to match the communication frequency, but if the job is Y-irregular, the time-slice has to be large enough to allow the longest thread to reach its barrier. In such a case, it would be helpful to allow the barrier tree to trigger the multi-context-switch hardware when the longest thread reaches the barrier, rather than using xed length time-slices (this feature is not available on any computers we know of). There would still be some idling on most of the other threads' processors due to Y-irregularity, but this cannot be completely eliminated in multi-gang scheduling. An alternate form of time-slicing, called local scheduling, mitigates the idling caused by Y-irregularity. Local scheduling is possible within xed size partitions, and requires that each processor is capable of individually time-slicing multiple threads allocated to it. These threads must all belong to the same job, since threads from multiple jobs cannot share a barrier tree simultaneously and there is only one barrier tree per partition. In a local scheduled partition, the processor preempts each thread when it reaches a barrier, giving other threads a chance to run and reach the barrier. The processor only sets its ag on the barrier tree after all its threads have reached the barrier. In other words, the processor locally synchronizes all its threads and places the result on the barrier tree. 3.1

Handling Y-Irregularity

Barriers divide the program execution into barrier phases, and any Y-irregularity in the program is fragmented into these barrier phases. There will be some processing resources wasted in each barrier phase because not all threads have

462


the same amount of computation to perform before reaching the next barrier. We show that local scheduling can usually do better than multi-gang scheduling in eliminating some of this waste. Given a Y-irregular program with a large number of threads distributed evenly across processors, multi-gang scheduling cannot perform any better than local scheduling. Theorem.

Proof. Let the total number of threads in the job be N , and the number of processors in the partition be a much smaller number n. The number of threads on each processor is either m = d Nn e, or m 1. Let the time for thread i on processor j to reach the barrier be tij , with discrete values varying between 0 and M . (If a processor j has only m 1 threads, then tmj = 0.) In multi-gang scheduling, the time for one barrier phase to complete is

Tm =

X max(t m

i=1

n

ij

j =1

)

while in local scheduling, the time for one barrier phase to complete is n Tl = max j =1

X(t m

ij

)

i=1

Tm is the sum of the maximum tij values across all the processors. Tl is the maximum among the sums of the tij values on each processor. The only way Tl could be as large as Tm is when the largest tij values all happen to occur on exactly the same processor, whose sum would then be selected as the maximum. For all other cases, Tl would be smaller than Tm . Therefore, Tm Tl . This proves the theorem. ut

Since the odds of all the largest tij values occurring on the same processor are very low, local scheduling generally does better than multi-gang scheduling. Intuitively, local scheduling tends to even out dierences in barrier phase times across processors by averaging across the local threads. For the above reason, as well as the fact that barrier tree triggered multicontext-switch mechanisms are not available on any existing computer, we assume that local scheduling as opposed to multi-gang scheduling is used to run X-irregular jobs on our computer. This gives rise to the problem of ensuring that the job's threads are distributed evenly across processors, which is addressed in Section 4. Pure local scheduling has the disadvantage of not allowing a partition to be shared by more than one job. This means that a decision to run a particular job may have a potentially large, persistent and unpredictable impact on future jobs wanting to run on the computer [2]. To avoid this problem, a combination of both forms of time-slicing called family scheduling [1] is possible as well. In


463

family scheduling, it is assumed that the number of threads in a job is larger than the partition size, and they are distributed across the processors as evenly as possible (the number of threads on any two processors may dier by at most 1). Multiple jobs are gang scheduled on the partition. Within its allotted partition and gang scheduled time-slice, multiple threads of the job are local scheduled on each processor. For our purposes, we can treat family scheduling and local scheduling as equivalent, since the gang scheduling time-slice is usually much larger than barrier phase times on our jobs. 3.2

Handling X-Irregularity

Since communications are implemented using barrier synchronizations, they cannot complete unless all threads participate. Therefore, to ensure job progress, all threads must be executed simultaneously or at least given some guarantee of execution within a short time bound. This means newly spawned threads must be scheduled almost immediately to keep processors from idling. Since we have a xed partition size, the only practical option is to schedule a newly spawned thread on the same processor as its parent thread. This will temporarily cause an imbalance on that processor (and violate the family scheduling rule that the number of threads on any two processors dier by at most 1). A thread's probability of spawning other threads may be data dependent, causing some processors to become heavily loaded compared to others. Therefore, to ensure eÆcient execution, spawned threads will have to be migrated to other processors. Similarly, when threads terminate, some processors may be underutilized till the remaining threads are migrated to rebalance the load.

4

Online Rebalancing of Threads

At every barrier synchronization point, we have the opportunity to gather information about the state of the partition. If we nd an imbalance in the number of threads across processors, we can make the decision whether to correct the imbalance or to leave the threads where they are and continue running till the imbalance gets worse. In making this decision, we must weigh the cost of processor idling due to the imbalance against the cost of rebalancing. In addition, there is also the cost associated with gathering information about the state of the system to enable our decision making, but we will ignore this for the moment. Note that we use the term cost to indicate the time penalty associated with an action. We must make our decisions online: at a given barrier, we have knowledge of the previous imbalances in the system, and the current state. We also know the cost of rebalancing the system, which varies as a function of the imbalance. With this partial knowledge, we must decide whether to rebalance the system or not. In contrast, a theoretical oine algorithm knows the entire sequence of imbalances in advance and can make rebalancing decisions with the bene t of foresight.

464


Consider a xed sequence of imbalances . Let COP T ( ) be the cost incurred by the optimal oine algorithm on this sequence. Let CA ( ) be the cost incurred by an online algorithm A on the same sequence. Algorithm A is said to be rcompetitive if for all sequences of imbalances , CA ( ) r COP T ( ). The competitive ratio of algorithm A is r. This technique of evaluating an online algorithm by comparing its performance to the optimal oine algorithm is called competitive analysis. It was introduced in [6] and has been used to analyze online algorithms in various elds. The optimal oine algorithm is often referred to as an adversary. Note that a thread's time to reach a barrier may be data dependent, which would mean that all threads cannot be counted as equals when rebalancing decisions are made. However, due to the dynamic nature of thread behavior over the duration of a program, it is very hard to predict a thread's time to reach a barrier. Even if that information were predictable, using it to further improve scheduling decisions is usually not feasible. This is because the variation in times to reach a barrier is limited by the very small time for each barrier phase (due to communication frequency), and it is diÆcult to make scheduling decisions with low enough overhead to actually recover some of that small time wasted in each barrier phase. In the rest of this paper, we assume that each thread utilizes its barrier phase fully for computation. In addition, we also assume that the barrier phase time does not vary signi cantly over the duration of the program. The above two assumptions enable us to treat threads of a job as equals, and rebalancing has the sole objective of equalizing the number of threads across processors. 4.1

Costs of Imbalance and Rebalancing

Initially, we assume that complete information about the number of threads on all processors in the job's partition is available at each processor. This information is actually expensive to gather, but we will discuss this expense and alternatives in the next section. We rst de ne two terms which are used to analyze the costs of imbalance and rebalancing: {

The system imbalance, Æ is the maximum thread imbalance on any processor. If there are n processors and N threads, let m = b Nn c denote the average number of threads in the system. If kj is the number of threads on processor j , then nj=01 kj = N . The thread imbalance on processor j is Æj = kj m. Note that Æj values can be either positive or negative, and their maximum, by de nition, has a value of 0 or higher. The system imbalance, Æ = maxnj =01 Æj . The aggregate system imbalance, is the number of threads that need to be moved in order to balance the system. Let ÆAj denote the absolute value of the thread imbalance on each processor. In other words, for each processor, j , ÆAj = jÆj j. The aggregate system

P

{

imbalance, =

Pnj

1

=0

2

ÆAj

.


465

The cost of running without rebalancing at a barrier is c, where c = t Æ , and t is the time for any thread to reach any barrier, which we have assumed to be a constant for a given program. The cost of rebalancing the tree is C , where C = x + y , and x and y are constants which depend on the implementation of the data network on the machine. These costs c and C are used to analyze online rebalancing algorithms. For our analysis, we assume that x is negligibly small compared to y . Note that for communication-intensive programs, t is very small, so C is signi cantly larger than c on any machine. When the job is initially scheduled or after the last rebalancing, the threads in the partition are evenly distributed across all processors. At this point, Æ may be 0 or 1, depending on whether N is a multiple of n. Therefore, the lowest value of c which represents a correctable system imbalance is generally 2t. It is reasonable to assume that the system places a limit M on the maximum number of threads that can be run per processor, and this limit is helpful in bounding the values of c and C later on. An online algorithm would maintain a sum of all the c values encountered (resetting whenever c has a value of 0 or t, corresponding to Æ = 0 or 1). When equals some threshold T , a system rebalance is triggered, at a cost of C . 4.2

Optimal Online Algorithm

Case 1: Consider an algorithm which selects a T smaller than C . Therefore, T = C . To minimize the performance of this algorithm, the adversary would keep the system at a minimum imbalance at all times, by spawning or terminating threads. Therefore, the algorithm pays the cost C of rebalancing the system, while the adversary never pays a rebalancing cost. However, both the algorithm and the adversary pay the cost = T of running with an imbalance. The algorithm's cost is C + T , while the adversary's cost is T , making the algorithm's competitive ratio r = 2CC = 1 + CC . Therefore, r 2 for all values of , and has a minimum value of 2 occurring when = 0. Case 2: Consider an algorithm which selects a T larger than C . Therefore, T = C + . To minimize the performance of this algorithm, the adversary would rebalance its system as early as possible, therefore paying no costs for running with an imbalance. Once again, the algorithm's cost is C + T , while the adversary's cost is C , making the algorithm's competitive ratio r = 2CC+ = 2 + =C . Again, r 2 for all values of , and has a minimum value of 2 occurring when = 0. From the above two cases, we see that the optimal online algorithm selects T = C , and has a competitive ratio of 2. 4.3

Low Overhead Alternatives

The cost of gathering complete information about the system con guration at each barrier is too high, requiring n phases of communication over the data network, where each processor is allowed to declare how many threads are assigned to it. To enable hardware barrier synchronization, we assumed that a max-tree

466


is available on the computer. This tree can be used to inexpensively compute the maximum and minimum number of threads per processor on our system. This is done in two phases: 1. Each processor j places its kj in its max-tree ag, and the tree returns their maximum, maxnj=1 kj to all the processors. This value directly corresponds to the maximum number of threads running on any processor. 2. Each processor places M kj in its max-tree ag, and the tree returns their maximum, maxnj=1 (M kj ) to all the processors. By subtracting this value from M , each processor calculates the minimum number of threads running on any processor, since M maxnj=1 (M kj ) = M [M minnj=1 kj ] = minnj=1 kj . We wish to consider algorithms which estimate costs only based on the dierence between the minimum and maximum number of threads on the processors, without knowing the actual average number of threads on the system. The average number of threads on the system has to lie between the minimum and maximum number of threads on any processor. Therefore, the algorithm faces the worst uncertainty in guessing the average when the maximum and minimum are as far apart as possible. This happens in system con gurations where at least one processor has 0 threads and at least one has M threads. We refer to the set of system con gurations with this property as 0-M con gurations. We need to consider only 0-M con gurations since we are interested in estimating the algorithm's worst case behavior. The rebalancing period p is the number of barriers run with an imbalance, after which the algorithm chooses to rebalance. The algorithm would have to guess values of c and C , and select a value for p that is as close as possible to Cc . If the algorithm underestimates p, it would rebalance too often, and the adversary's strategy would be to keep the system at a minimum imbalance at all times. The adversary would never pay a rebalancing cost, while the algorithm would pay it more often than it would with complete information. Both pay the cost of running with the minimum imbalance at all times. If the algorithm overestimates p, it would rebalance too infrequently, and the adversary's strategy would be to cause the maximum system imbalance and rebalance immediately. The adversary pays the rebalancing cost, while the algorithm pays the rebalancing cost as well as the cost of running with the maximum imbalance for longer than it would with complete information. Now we analyze 0-M con gurations to arrive at the minimum and maximum values that Æ and can take. For any 0-M con guration, without loss of generality, we can also assume that processor 0 has 0 threads, and processor n 1 has M threads. Note that the system imbalance Æ is always attributable to the processor with the largest number of threads running on it. Therefore, we may assume that Æn 1 = Æ . The maximum Æ , denoted by ÆM AX , has to occur when Æn 1 = M m is maximized, which is when m is minimized. This happens when processors 0 M (n 1) M through n 2 all have 0 threads, making m = M . n , and Æn 1 = M n = n (Any other con guration would have more threads on some of the processors 0


467

through n 2, making the mean m higher and reducing the value of Æn 1 .) Therefore, ÆMAX is M (nn 1) . For this con guration, referred to as Con guration , = (n 1) Mn2+M Mn , which simpli es to M (nn 1) . Similarly, the minimum Æ , denoted by ÆMIN , has to occur when Æn 1 = M m is minimized, which is when m is maximized. ThisM happens when processors 1 through n 1 all have M threads, making m = (nn 1) , and ÆMIN = MM (n 1)M (nn 1) = MnM.(nFor1) this con guration, referred to as Con guration , = n +(n 1)[M n ] , which simpli es to M (n 1) . 2 n Note that has the same value, M (nn 1) at ÆMIN and at ÆMAX . This is also the minimum number of threads that need to be moved to balance any 0-M con guration. Therefore, the minimum value of , M IN is M (nn 1) . The maximum value of , MAX occurs when half the processors have M threads and the other half have 0 threads. In this con guration, m = M2 . To balance this con guration, M2 threads have to be moved from n2 processors (who have M threads) to the others. Therefore, MAX = M4n , and corresponds to Æ = M2 . We refer to this as Con guration . Assume there are no limits (other than M ) on the number of threads that can be spawned or terminated in one barrier phase. Considering only 0-M conM (n 1) gurations, Æ can take any value in the range [ M ] at a barrier, regardless n ; n of its previous value. Similarly, can take any value in the range [ M (nn 1) ; M4n ] at a barrier, depending only on Æ and regardless of its previous value. However, to analyze the worst case behavior of any algorithm, it is suÆcient to consider con gurations , and , since these provide the worst case values of Æ and . c = t Æ, with three choices of Æ values: Mn , M2 and M (nn 1) . C = y , with . just two choices of values: M (nn 1) and Mn 4 Mn If the algorithm underestimates p, its cost is p t M n + y 4 , while the yn2 adversary's cost is p t M n . This makes the competitive ratio r = 1 + 4pt . If the algorithm overestimates p, its cost is p t M (nn 1) + y M (nn 1) , while the adversary's cost is y M (nn 1) . This makes the competitive ratio r = pty+y . We propose an algorithm which selects p = yt n2 , corresponding to con guration . By exhaustively considering all possible combinations of Æ and values, one can prove that any algorithm does best by choosing this value, thus showing that our algorithm is optimal among incomplete information alternatives. (The proof is omitted here due to lack of space.) By substituting the value of p in the equations for r above, we see that the competitive ratio of our algorithm is n.

5

Summary and Future Work

In this paper, we classify irregular parallel programs based on the source of their irregularity, into X-irregular, Y-irregular and completely irregular programs. Our programs are communication-intensive besides being completely irregular. This

468


limits us to xed size partitions, since frequent communication is eÆciently implemented using a hardware barrier tree. We compare the possible scheduling algorithms for completely irregular programs on xed size partitions: multi-gang scheduling and local scheduling, and show that local scheduling does better to mitigate the ineÆciencies of Y-irregularity. However, to handle the eects of X-irregularity, threads need to be rebalanced on processors periodically. We propose and analyze online algorithms for rebalancing threads, including an n-competitive algorithm which is eÆcient due to its low information gathering cost. We intend to run simulations based on our program characteristics to experimentally show the following: Although our algorithm's worst case behavior is in (n), its average behavior is fairly close to the performance of the 2-competitive, optimal algorithm which has a far greater information gathering overhead. Based on our experiments, we also intend to propose algorithms for machines where the rebalancing cost C = x + y has a large value of x, making infrequent rebalancing advantageous.

References 1. R. M. Bryant and R. A. Finkel. A stable distributed scheduling algorithm. In International Conference on Distributed Computing Systems, April 1981. 2. D. G. Feitelson and M. A. Jette. Improved utilization and responsiveness with gang scheduling. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing - Lecture Notes in Computer Science, volume 1291, pages 238{261. Springer Verlag, 1997. 3. D. G. Feitelson and L. Rudolph. Parallel job scheduling: Issues and approaches. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing - Lecture Notes in Computer Science, volume 949, pages 1{18. Springer Verlag, 1995. 4. V. Ramakrishnan, I. D. Scherson, and R. Subramanian. EÆcient techniques for fast nested barrier synchronization. In Symposium on Parallel Algorithms and Architectures, pages 157{164, July 1995. 5. V. Ramakrishnan, I. D. Scherson, and R. Subramanian. EÆcient techniques for nested and disjoint barrier synchronization. Journal of Parallel and Distributed Computing, 58:333{356, August 1999. 6. D. D. Sleator and R. E. Tarjan. Amortized eÆciency of list update and paging rules. Communications of the ACM, 28:202{208, February 1985.

NON-MEMORY-BASED AND REAL-TIME ZEROTREE BUILDING FOR WAVELET ZEROTREE CODING SYSTEMS

Dongming Peng and Mi Lu Electrical Engineering Department, Texas A&M University, College Station, TX77843, USA

1

INTRODUCTION

The w aveletzerotree coding systems, including Embedded Zerotree Wavelet (EZW)[1] and its varian tsSet P artitioning In Hierarchical T rees (SPIHT)[2] and Space Frequency Quantization (SFQ)[3][4], have three common procedures: 1) 2-D Discrete Wavelet T ransform (D WT)[5], 2) zerotree building and symbol generation from the wavelet coeÆcients (illustrated in Figure 1), and 3) quantization of the magnitudes of signi cant coeÆcients and entrop y coding, where the second procedure is an important one. All recently propsed architectures ([6]-[9]) for w avlelet zerotree coding use memories tobuild zerotrees. In this paper w e con tribute to building the zerotrees in a non-memory-based way with real-time performance leading to the decrease of hardware cost and the increasement of processing rate which is especially desirable in video coding. One of our main ideas is to rearrange the DWT calculations taking advantage of parallel and pipelined processing that so any parent coeÆcient and its children coeÆcients in zerotrees are guaranteed to be calculated and outputted simultaneously. 2

THE ARCHITECTURE FOR REARRANGING 2-STAGE 2-D DWT

2.1

Two Priliminary Devices Used in the Architecture for Rearrangement

(1)The Processing Unit(PU) shown as in Figure 2 rearranges the calculation of w avelet ltering so that the lter is cut to half taps based on the symmetry betw een the negative and positive wavelet lter coeÆcients. x, a and c are the input sequence, low- and high-pass ltering output sequence respectively. While a datum of sequence x is fed and shifted into the PU per clock cycle, a datum of a is calculated every even clock cycle and a datum of c is calculated every odd cloc k cycle. The PU in Figure 2(a) can be extended to a parallel format as in Figure 2(b) where if a number of data from sequence x xk+8 , xk+7 , ..., xk are fed to the PU in parallel at a clock cycle, then xk+9 , xk+8 , ..., xk+1 are fed at the next cycle. (2)In Figure 3 the TU is a systolic array with (Nw +3)N cells, where J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 469-475, 2000.  Springer-Verlag Berlin Heidelberg 2000

470

D. Peng and M. Lu grandparent

parent

A C A C

B

D

B D

A1 A3 C1 C3 A2 A4 C2 C4 B1 B3 D1 D3 B2 B4 D2 D4

(a) the ralation of parent-children in two-stage DWT (b) the relation of parent-children in three -stage DWT row-major

high/low -pass filtering

r

Lr m (n , n )= 1

the corresponding simplified expressions of the formula

2

h(k1 )LLm- 1 (n1 , 2n 2 -k 1)

k1

Hrm (n1 , n2 )=

k1

column-major


LLm (n1 , n2 )=

LLm-

1

LLm-

1

r

g(k2 )Lr m(2n1 -k 2, n 2)

the corresponding simplified expressions of the formula c LLm Lr m c LHm Lr m

h(k2 )Hr m (2n 1 -k2 , n 2)

HLm

c

Hr m

g(k2 )Hr m (2n 1 -k 2, n 2)

HHm

c

Hrm

k2

LHm (n1 ,n 2 )=

Lr m Hr m

g(k1)LLm-1(n 1 ,2n2 -k 1)

h(k 2) Lrm (2n 1-k 2, n 2 )

k2

HLm (n 1, n2 )= k2

HHm(n1 , n2 )= (c)

the m

th

k2

stage separable 2-D DWT formula and the corresponding simplied expressions (h,g are low-pass and high-pass filter respectively)

row-major

column-major high/low-pass


input image

L

filtering

row-major

LL 1

r1

LH2

HL1

HH2

LH1

HL 1 HH 1 1

column-major high/low-pass filtering

LL3

HL2

LH3

HH3

high/low -pass

L r2 Hr2

filtering

LH1

Lr1 Hr1

HL 1

Hr1 HH 1

LH

1

column-major

HL 1

high/low-pass filtering

HH1

LL2 HL 2 LH

2

HH

LH

1

HL 1

row-major high/low -pass filtering

HH 1

HL2

LH2

HH2

LH1

Fig. 1.

HL 1

HH 1

(d) the three-stage separable 2-D DWT

the algorithms of EZW and 2-D DWT

Nw is the width of the wavelet lter and N is the width or the height of the input (square) image to DWT. N is hundreds or thousands of times greater than Nw for most applications of 2-D wavelet transforms (e.g. image/video systems). A cell transfers its content to the next adjacent cell once it receives a datum from its preceding cell. The leftmost cells in odd rows and rightmost cells in even rows have output ports and copy their data to outside. The upper-left cell has an input port and the TU uses it to recieve the input sequence. An element in matrix X is fed to the TU per clock cycle in the order according to the indices in Figure 3(b). The TU's (Nw +3) outputs and its newly arrived element belong to the same column in X. An example for the positions of the X's elements in TU after 3N clock cyles of inputting X is illustrated in Figure 3(c). 2.2

the Proposed Architecture and the Analysis of Its Operations

The architecture for the rearrangement of DWT is proposed in Figure 4, and the corresponding timing of operations is presented in Figure 5. Every four sibling coeÆcients in the rst decomposition stage are designed to be calculated together (meanwhile their parent is generated by PU3 ).

Non-Memory-Based and Real-Time Zerotree Building a9 a8

a7

a6

a5

a4

a3

a2

471

a1

qi D

a9 .........a2 a 1

D

D

Pi qi

0

D 0

(h4,0)

(h3, g4)

(h2, g3)

(h1, g2)

C0

C1

C2

C3

(h, g) Ci

Pi- 1

(h0, g1) C4

Pi

(h4, 0) C0

(h3, h4) C1

(h, g)

(h1, g2) C3

(h2, g3) C2

(h0, g1) C4 Xi/Yi

Xi/Yi ADDE R

Xi=(Pi+qi)*h (even clock cycle), Yi=(Pi+qi)*g (odd clock cycle),

ADDE R

Pi-1=Pi Xi=(Pi+qi)*h (even clock cycle), Yi=(Pi+qi)*g (odd clock cycle),

(a) systolic filter

Fig. 2.

(b) parallel filter

the structures of PU (Processing Unit)

INPUT .......

Y1

.......

Y Nw+1

.......

Y2

.......

........

.......

........ .......

Y Nw+3

.......

1

2

N+1

N+2

... ...

........ ........

...

...

Y Nw+2

...

(b)

Fig. 3.

3N ...

regular row-major indexing of matrix X

Nw+3 rows , N columns

(a)

N 2N

...

2N+1 2N+2

........

................

3

................

Y

........ ........

3N

3N-1

N+1 N+2 N

N-1 ...

...

...

2N+1

...

2N

...

1

...

snake-like row-major indexing of matrix X

(c)

A new transpose unit (TU)

Due to the row-major dyadic subsampling, the row-major high/low-pass ltering is alternatively executed by the PU1 in Figure 4 point by point in each row. Based on similar column-major dyadic subsampling, the PU2 takes turns to execute column-major high/low-pass ltering and selects appropriate inputs from TU1 to generate four sibling coeÆcients consecutively. The order to calculate the siblings is as A, then B, C and D in the example of four siblings illustrated in Figure 1(a). After PU2 's calculation of point A by taking Y1 , ..., YNw as inputs, PU2 has to calculate B by taking Y3 , ..., YNw+2 as inputs, then PU2 comes back to take Y1 , ..., YNw to calculate C, then PU2 takes Y3 , ..., YNw+2 again to calculate D. In Figure 5 it can be seen that PU2 calculates the same kind of columnmajor convolution (high-pass or low-pass ltering) during the period when a row of input image is sequentially fed to the system in N clock cycles, and the coeÆcients in four subbands LL1 , LH1 , HL1 and HH1 are calculated in turns in a longer period when four rows of input image are fed. The right column in Figure 5 is to show the operations of PU3 in Figure 4. In PU3 the second stage of DWT is completed and the parent coeÆcients are generated. PU3 takes sequential input of LL1 coeÆcients from PU2 to perform row-major ltering in the rst quarter of the period during which four rows of input image are fed to the system. Then in the next three quarters (i.e., i=4q+1, 4q+2 or 4q+3 in Figure 5), PU3 performs the column-major convolutions by taking the result from the rst quarter's row-major ltering as input.

472

D. Peng and M. Lu

Feedback

N w -point column-wise data

block

........ Lr 2 / Hr

PU1

TU1

.....

Input Image

YNw+3

DM1

..........

Y1

PU2

4q 4q+1 4q+2 4q+3

LL 1 LH1 HL 1

2

LL2

PU3

HH 1

1) DM1 is a demultiplexer that select Nw-point data from Nw+3 outputs of TU1. 2) PU2 is a paralle filter as in Figure 2(b) and has four output ports active at different time. 3) PU3 is the hybrid version of PU that can take either sequential or parallel inputs. 4) Feedback block consists of 2 separate TUs and Demultriplexers to select Lr2 / Hr2 into respective TU and to select outputs from 2 TUs into PU3.

Fig. 4.

3

The architecture for two-stage DWT and the zerotree building

THE DESIGN EXTENDED TO GENERAL STAGES OF DWT

Now we consider what should be done to modify the architecture in Figure 4 for m stages of wavelet decomposition. Because the input image is fed into the system in the same way as before, the rst stage row-major high/low-pass ltering is still performed in PU1 alternatively as designated in Figure 5. Regarding the rst stage column-major high/low-pass tering performed in PU2 , we note that there are 4m 1 coeÆcients in the rst stage decomposition corresponding to the same ancestor in the last stage (stage m) decomposition. To satisfy the restriction of generating parent and children simultaneously, it is required that these 4m 1 \kindred" coeÆcients be calculated together. (Meanwhile these coeÆcients' parents in the intermediate stages of decompostion should be calculated together too.) Note that these 4m 1 coeÆcients are located in 2m 1 adjacent rows and 2m 1 adjacent columns in their subband. PU2 should alternatively select appropriate inputs among 2m 1 dierent groups of parallel column-major data from TU1 , and perform column-major ltering to generate the 4m 1 kindred coeÆcients in turns, where the coeÆceints calculated with the same group of input belong to the same row. Accordingly, TU1 is an extended version in Figure 3(a) and is supposed to have output ports Y1 , ..., YNw+M with M equal to 2m . PU3 carries out the rest computation in DWT. The second stage decomposition is achieved as follows. In the rst quarter of the period when 2m rows of input image are fed to the system, PU3 gets its inputs, i.e., the coeÆcients in LL1 subband from TU1 , and alternatively performs the second stage low/highpass row-major convolution. The calculated results, or the coeÆcients in Lr2 and Hr2 are stored in two TUs hidden in PU3 's feedback block. In the second quarter, the Lr2 coeÆcients are fed back to PU3 to be column-major ltered to get the results in LL2 and LH2 . In the third and fourth quarter, the Hr2 points

LH2 HL2 HH2

Non-Memory-Based and Real-Time Zerotree Building

473

are fed back to PU3 to be used to calculate out HL2 and HH2 respectively. The PU3 achieves further stage decompositions in the available intervals during its execution of the second stage decomposition. By reason of limit space in this paper, the operations of processors are described with basic principles and not with many details. To sum up, TU1 is modi ed to have (Nw +M) rows; PU3's feedback block has changed a little bigger so that it holds Nw +M rows of coeÆcients in the results of row-major wavelet ltering; and the switches have become complicated to select appropriate data at dirent time. 4

PERFORMANCE ANALYSIS AND CONCLUSION

Since Nw (the width of wavelet lters) is far less than N (the width or length of input image) and the size of boudary eect of wavelet transforms is only dependent on Nw , in this paper we ignore the boudary eect to simplify our expressions knowing that it can be resolved by a little adjustment in either timing or architecture. The area of the proposed architecture is dominated by PUs and TUs. A PU contains pNw MACs (Multiplyer and Accumulator Cell), where p is the number of precision bits of data, thus three PUs contain 3pNw MACs. Because a TU is necessary for the column-major ltering in every stage decomposition, and the number of cells in TU at the ith stage decomposition is (N/2i )(Nw +2i ), where the rst item is the length of a row and the second item is the number of rows in the TU, the total area for TUs is O(mN+Nw N), where m is the number of stages in DWT. Note that the TUs except TU1 are hidden in PU3 's feedback block. Thus the whole area of the architecture for m stage DWT is A=O(pNNw +pNm). The input image is assumed to be fed with one pixel per clcok cycle. The system's latency (execution time) T is N2 clock cycles. Thus the product of A and T for the system is O(pN3 (Nw +m)), where N2 is the input size of the algorithm. Our proposed architecture is comparable to conventional DWT architectures ([10]-[14]) in the aspect of area, latency, the product of them, or the hardware utilization, even though not only the DWT but also the zerotree building is achieved in this architecture. We have proposed a non-memory-based design in which the input image is recursively decomposed by DWT and zerotrees are built in real-time. The computation of wavelet-based zerotree coding is strongly featured by the computation locality in that the calculations of coeÆcients on a certain zerotree (only) depend on the same local sub-area of the 2-D inputs. This desirable feature has been exploited in this paper by calculating children and their parent simultaneously in the rearranged DWT, so that most intermediate data need not be held for future calculations. References

1. J.M. Shapiro, Embedded image coding using zerotrees of wavelet coeÆcients, IEEE Transactions on Signal Processing, Volume: 41, 1993, Page(s): 3445 -3462.

474

D. Peng and M. Lu

2. A. Said, W.A. Pearlman, A new, fast, and eÆcient image codec based on set partitioning in hierarchical trees, IEEE Transactions on Circuits and Systems for Video Technology, Volume: 6, June 1996, Page(s): 243 -250. 3. Zixiang Xiong, K. Ramchandran, M.T. Orchard, Wavelet packet image coding using space-frequency quantization, IEEE Transactions on Image Processing, Volume: 7, June 1998, Page(s): 892 -898. 4. Zixiang Xiong, K. Ramchandran, M.T. Orchard, Space-frequency quantization for wavelet image coding, IEEE Transactions on Image Processing, Volume: 6, May 1997, Page(s): 677 -693. 5. M. Vetterli, J. Kovacevic, Wavelets and Subband Coding, Prentice Hall, 1995. 6. Li-Minn Ang, Hon Nin Cheung, K. Eshraghian, VLSI architecture for signi cance map coding of embedded zerotree wavelet coeÆcients, Proceddings of 1998 IEEE Asia-Paci c Conference Circuits and Systems, 1998. Page(s): 627 -630. 7. Jongwoo Bae, V. K. Prasanna, A fast and area-eÆcient VLSI architecture for embedded image coding, Proceedings of International Conference on Image Processing, Volume: 3, 1995, Page(s): 452 -455. 8. J.M. Shapiro, A fast technique for identifying zerotrees in the EZW algorithm, Proceedings of 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, Volume: 3, 1996, Page(s): 1455-1458. 9. J. Vega-Pineda, M. A. Suriano, V. M. Villalva, S. D. Cabrera, Y.-C. Chang, A VLSI array processor with embedded scalability for hierarchical image compression,1996 IEEE International Symposium on Circuits and Systems, Volume: 4, 1996, Page(s): 168 -171. 10. Jer Min Jou, Pei-Yin Chen, Yeu-Horng Shiau, Ming-Shiang Liang, A scalable pipelined architecture for separable 2-D discrete wavelet transform, Design Automation Conference, Proceedings of the ASP-DAC '99. Asia and South Paci c, Volume: 1, Page(s): 205 -208. 11. M. Vishwanath, R. M. Owens, M. J. Irwin, VLSI architectures for the discrete wavelet transform, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Volume: 42, May 1995, Page(s): 305 -316. 12. Chu Yu, Sao-Jie Chen, Design of an eÆcient VLSI architecture for 2-D discrete wavelet transforms, IEEE Transactions on Consumer Electronics, Volume: 45, Feb. 1999, Page(s): 135 -140. 13. V. Sundararajan, K. K. Parhi, Synthesis of folded, pipelined architectures for multidimensional multirate systems, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Volume: 5, 1998, Page(s): 3089 -3092. 14. Chu Yu, Sao-Jie Chen, VLSI implementation of 2-D discrete wavelet transform for real-time video signal processing, IEEE Transactions on Consumer Electronics, Volume: 43, Nov. 1997, Page(s): 1270-1279. 15. T. Acharya, Po-Yueh Chen, VLSI implementation of a DWT architecture, Proceedings of the 1998 IEEE International Symposium on Circuits and Systems, Volume: 2, 1998, Page(s): 272 -275.

Non-Memory-Based and Real-Time Zerotree Building

475

real-time Alg. DWT amenable to EZW { for i=0 to N-1 / * row */ for j=0 to N-1 / * column */ Do DWT(i,j) } DWT(i,j) / * q,s are any non-negative integers */ { what PU 1 does what PU 2 does if i=4q if j=even if j=odd if j=4s if j=4s+1 if j=4s+2 if j=4s+3 if i=4q+1 if j=even if j=odd if j=4s if j=4s+2 if i=4q+2 if j=even if j=odd if j=4s+2 if i=4q+3 if j=even if j=odd if j=4s+2

r Lr1 Lr 1 r

x x

c LL1 1 c LL 1

what PU 3does

Lr1 Lr11 Lr2 Hr2 Lr 12 Hr12

Lr1 r Lr 1 r

x x

+ LH1 c Lr 1 1 c 1+ LH1 Lr1

LL 2 LH 2 Lr1 r Hr1 r

x x

HL 1 c Hr 1 HL 11 c Hr 11

r LL1 r LL 1 r LL 11 r LL 11

c c

c HL 2

Lr1 r Hr1 r

x x

Lr2 Lr+2

HH 1 c Hr+1 HH 11 c Hr1+1 HH 2

Hr2

c

+

Hr 2

Notation: (1).The meaning of superscript "k" is explained as the following.(k is an integer) c r Using the simplified expression as in figure 1(c),we call the calculation of as "r" arrow; as "c" arrow. k Suppose A corresponds to a part of a component in figure 1 (d). If row-wise signal A and Ak are on the right side of "r" arrow, and A is fromthe ith row,then A is from the kth row. If column-wise signal A and Ak are on the right side of "c" arrow, and A is from Yi,.....YNw+i (See TU’s output in figure 3), then Ak is from Yi+2k,......YNw+2k (in the same column) If A and Ak are on the left side of "c" arrow or "r" arrow,they are corresponding to 2 coefficients inthe same column and in ith row and (i+k)th row respectively. The meaning of "+" is explained as the following: k+ Column-wise signal A ,A +and A k+are on the right side of "c" arrow. A is from Yi,.......YNw+i (see TU’s output in figure 3) the Ak+ is from Yi+1,....Y , A is from Yi+2k+1,.........YNw+i+2k+1 Nw+i+1 (they are in the same column) + (2) Because of PU’s feature of alternative high/low -pass filtering,we use A and A alternative to get column-wise low-high pass outputs. (3) The reason why we use A and A1 ( A standing for Lr 1 , LL1 ) alternatively to generate B and B1 ( B standing for LL1 , LH1 ,.....) is based on the restriction that any siblings be generated C C consecutively. Because of dyadic downsamplings for column-wise convojution. B1 A1 if B A c c (4) Lr 2 and Hr 2 are fed in 2 seperate TUs in PU3’s feedback block. The TUs can hold Nw+2 rows of data at most. Careful readers may find that LL2 Lr2 and LH2 Lr2 are executed alternatively c c point by point, however, HL2 Hr 2 and HH2 Hr2 are executed alternatively row-by row. This paradox can be resolved by a little manipulation in 2 TUs in feedback block . Anyway, during the time of 4q+1K. Let’s assume that S1 contains all strands with X≤K while S2 holds the rest. b-7 select(S1, C): We are going to extract the answer from S1 as S1 is the set that contains those “bags” with items less than full, i.e., X≤ K. As the value of each strand is represented by a certain number of digits, we only need to go through these digits one by one and find the answer larger than V.

4

Problem Reconsideration

In the previous section, we introduced new algorithms for solving NP complete problems: Knapsack problems. Here we are going to show that the advantage of our algorithm, i.e., unlike other existing algorithms [1] [2] [3] [12] [18] [21] [25] that need to restart the whole computation process when there are changes on the initial condition, our algorithm will only need a few extra operations and the new problem will be solved. This will greatly save time and cost for our DNA computer because usually DNA computing needs a lot of expensive materials and takes very long time, e.g., months, to complete. We first work on the simplified knapsack problem. The initial condition is an integer K and n items of different sizes. After the procedures we showed in section 3.1, we will obtain a bag with size K and have m items inside where m

strong

CSkl Fig. 1. Ordering C o m p u t a t i o n a l S p a c e s C S according to their strong complexity.

Solving Problems on Parallel Computers by Cellular Programming Domenico Talia ISI-CNR c/o DEIS, UNICAL, 87036 Rende (CS), Italy Email : [email protected]

Abstract. Cellular automata can be used to design high-performance natural solvers on parallel computers. This paper describes the development of applications using CARPET, a high-level programming language based on the biology-inspired cellular automata theory. CARPET is a programming language designed for supporting the development of parallel high-performance software abstracting from the parallel architecture on which programs run. We introduce the main constructs of CARPET and discuss how the language can be effectively utilized to implement natural solvers of real-world complex problems such as forest fire and circuitry simulations. Performance figures of the experiments carried out on a MIMD parallel computer show the effectiveness of our approach both in terms of execution time and speedup.

1. Introduction Cellular processing languages based on the cellular automata (CA) model [10] represent a significant class of restricted-computation models [8] inspired to a biological paradigm. They are used to solve problems on parallel computing systems in a wide range of application areas such as biology, physics, geophysics, chemistry, economics, artificial life, and engineering. CA provide an abstract setting for the development of natural solvers of dynamic complex phenomena and systems. Natural solvers are algorithms, models and applications that are inspired by processes from nature. Besides CA, typical examples of natural solvers methods are neural nets, genetic algorithms, and Lindenmayer systems. CA represent a basic framework for parallel natural solvers because their computation is based on a massive number of cells with local interactions that use discrete time, discrete space and a discrete set of state variable values. A cellular automaton consists of one-dimensional or multi-dimensional lattice of cells, each of which is connected to a finite neighborhood of cells that are nearby in the lattice. Each cell in the regular spatial lattice can take any of a finite number of discrete state values. Time is discrete, as well, and at each time step all the cells in the lattice are updated by means of a local rule called transition function, which determines the cell’s next state based upon the states of its neighbors. That is, the state J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 595-603, 2000.  Springer-Verlag Berlin Heidelberg 2000

596

D. Talia

of a cell at a given time depends only on its own state and the states of its nearby neighbors at the previous time step. Different neighborhoods can be defined for the cells. All cells of the automaton are updated synchronously. The global behavior of the system is determined by the evolution of the states of all cells as a result of multiple interactions. An interesting extension of the CA standard model is represented by continuous CA that allow a cell to contain a real, not only an integer value. This class of automata is very useful for simulation of complex phenomena where physical quantities such as temperature or density must be taken into account. CA are intrinsically parallel and they can be mapped onto parallel computers with high efficiency, because the communication flow between processors can be kept low due to locality and regularity. We implemented the CA features in a high-level parallel programming language, called CARPET [9], that assists cellular algorithms design. Unlike early cellular approaches, in which cell state was defined as a single bit or a set of bits, we define the state of a cell as a set of typed substates. This extends the range of applications to be programmed by cellular algorithms. CARPET has been used for programming cellular algorithms in the CAMEL environment [2, 4]. The goal of this paper is to discuss how the language can be effectively utilized to design and implement scientific applications as parallel natural solvers. The rest of the paper is organized as follows. Sections 2 and 3 introduce the constructs of CARPET and the main architectural issues of the CAMEL system. Section 4 presents a simple CARPET example and describes how the language can be utilized to model the forest fire problem. Finally, performance figures that show the scalability of CARPET programs on a multicomputer are given.

2. Cellular Programming The rationale for CARPET (CellulAR Programming EnvironmenT) is to make parallel computers available to application-oriented users hiding the implementation issues resulting from architectural complexity. CARPET is a high-level language based on C with additional constructs to define the rules of the transition function of a single cell of a cellular automaton. A CARPET user can program complex problems that may be represented as discrete cells across 1D, 2D, and 3D lattices. CARPET implements a cellular automaton as a SPMD program. CA are implemented as a number of processes each one mapped on a distinct processing element (PE) that executes the same code on different data. However, parallelism inherent to its programming model is not apparent to the programmer. According to this approach, a user defines the main features of a CA and specifies the operations of the transition function of a single cell of the system to be simulated. So using CARPET, a wide variety of cellular algorithms can be described in a simple but very expressive way. The language utilizes the control structures, the types, the operators and the expressions of the C language. A CARPET program is composed of a declaration part that appears only once in the program and must precede any statement (except those of C pre-processor) and of a program body. The program body has the usual C statements and a set of special statements defined to access and modify the state of a

Solving Problems on Parallel Computers by Cellular Programming

597

cell and its neighborhood. Furthermore, CARPET permits the use of C functions and procedures to improve the structure of programs. The declaration section includes constructs that allow a user to specify the dimensions of the automaton (dimension), the radius of the neighborhood (radius), the pattern of the neighborhood (neighbor), and to describe the state of a cell (state) as a set of typed substates that can be: shorts, integers, floats, doubles and arrays of these basic types. The use of float and double substates allows a user to define continuous CA for modeling complex systems or phenomena. At the same time, formal compliance with the standard CA definition can be easily assured by resorting to a discretized set of values. In CARPET, the state of a cell is composed of a set of typed substates, unlike classical cellular automata where the cell state is represented by a few bits. The typification of the substates allows us to extend the range of the applications that can be coded in CARPET simplifying writing the programs and improving their readability. Most systems and languages (for example CELLANG [6]) define the cell substates only as integers. In this case, for instance, if a user must store a real value in a substate then she/he must write some procedures for the data retyping. The writing of these procedures makes the program longer and difficult to read or change. The CARPET language frees the user of this tedious task and offers her/him a high level in state declaration. A type identifier must be included for each substate. In the following example the state is constituted of three substates: state (int particles, float temperature, density);

A substate of the current cell can be referenced by the variable cell-substate (e.g., cell_speed). To guarantee the semantics of cell updating in cellular automata the value of one substate of a cell can be modified only by the update operation. After an update statement the value of the substate, in the current iteration, is unchanged.

The new value takes effect at the beginning of the next iteration. CARPET allows a user to define a logic neighborhood that can represent a wide range of different neighborhoods inside the same radius. Neighborhoods can be asymmetrical or have any other special topological properties (e.g., hexagonal neighborhood). The neighbor declaration assigns a name to specified neighboring cells of the current cell and a vector name that can be used as an alias in referring to a neighbor cell. For instance, the von Neumann and Moore neighborhoods shown in figure 1, can be defined as follows: neighbor Neumann[4]([0,-1]North,[-1,0]West, [0,1]South, [1,0]East); neighbor Moore[8] ([1,-1]NEast, [0,-1]North, [-1,-1]NWest, [-1,0] West, [1,0]East ,[-1,1]SWest, [0,1]South [1,1]SEast);

A substate of a neighbor cell is referred to, for instance, as NEast_speed. By the vector name the same substate can be referred to also as Moore[0]_speed. This way of referencing simplifies writing loops in CARPET programs. CARPET permits the definition of global parameters that can be initialized to specific values (e.g., parameter (viscosity 0.25)). The value of a parameter is the same in each cell of the automaton. For this reason, the value of each parameter cannot be changed in the program but it can only be modified, during the simulation,

598

D. Talia

by the user interface (UI). CARPET defines also a mechanism for programming nondeterministic rules by a random function. Finally, a user can define cells with different transition functions by means of the Getx, Gety, Getz functions that return the value of the coordinates X, Y, and Z of the cell in the automaton.

C

C

Fig. 1. The von Neumann and Moore neighborhoods in a two-dimensional cellular automaton .

CARPET does not include constructs for configuration and visualization of the data, unlike other cellular languages. As a result, the same CARPET program can be executed with different configurations. The size of lattice, as other details, of a cellular automaton are defined by the UI of the CARPET environment. The UI allows, by menus, to define the size of a cellular automaton, the number of processors on which the automaton will be executed, and to choose colors to be assigned to the cell substates to support the graphical visualization of their values.

3. A Parallel Environment for CARPET Parallel computers represent the most natural architecture where CA programming environments can be implemented. In fact, when a sequential computer is used to support the simulation, the execution time might become very high since such computer has to perform the transition function for each cell of the automaton in a sequential way. Thus, parallel computers are necessary as a practical support for the effective implementation of high-performance CA [1]. The approach previously mentioned motivated the development of CAMEL (Cellular Automata environMent for systEms ModeLing), a parallel software architecture based on the cellular automata model that constitutes the parallel run-time system of CARPET. The latest version of CAMEL named CAMELot (CAMEL open technology) is a portable implementation based on the MPI communication library. It is available on MIMD parallel computers and PC clusters. The CAMEL run-time system is composed of a set of macrocell processes, each one running on a single processing element of the parallel machine, and by a controller process running on a processor that is identified as the Master processor. The CAMEL system uses the SPMD approach for executing the CA transition function. Because of the number of cells that compose an automaton is generally greater than the number of available processors, several elementary cells are mapped on each macrocell process. The whole set of the macrocells implement a cellular automaton and it is called the CAMEL Parallel Engine. As mentioned before,

Solving Problems on Parallel Computers by Cellular Programming

599

CAMEL provides also a user interface to configure a CARPET program, to monitor the parameters of a simulation and dynamically change them at run time. CAMEL implements a form of block-cyclic data decomposition for mapping cells on the processors that aims to address the problem of load imbalance experienced when the areas of active cells are restricted to one or few domains and the rest of lattice may be inactive for a certain number of steps [2]. This load balancing strategy divides the computation of the next state of the active cells among all the processors of the parallel machine avoiding to compute the next state of cells that belongs to a stationary region. This is a domain decomposition strategy similar to the scattered decomposition technique.

4. Programming Examples To describe practically cellular programming in CARPET, this section shows two cellular programs. They are simple but representative examples of complex systems and phenomena and can explain how the natural solver approach can be exploited by the CARPET language. 4.1. The wireworld program This section shows the simple wireworld program written by CARPET. This program should familiarize the reader with the language approach. In fact, figure 2 shows how the CARPET constructs can be used to implement the wireworld model proposed in the 1990 by A. K. Dewdney [3] to build and simulate a wide variety of circuitry. In this simple CA model each cell has 4 possible states: space, wire, electron head or electron tail. This simple automaton models electrical pulses with heads and tails, giving them a direction of travel. Cells interact with their 8 neighbours by the following rules: space cells forever remain space cells, electron tails turn into wire cells, electron heads turn into electron tails, wire cells remain wire cells unless bordered by 1 or 2 electron heads. By taking special care in the arrangement of the wire (initial configuration of the lattice), with these basic rules electrons composed of heads and tails can move along wires and you can build and test diodes, OR gates, NOT gates, memory cells, wire crossings and much more complex circuitry. 4.2. A forest fire model We show here the basic algorithm of a CARPET implementation of a real life complex application. Preventing and controlling forest fires plays an important role in forest management. Fast and accurate models can aid in managing the forests as well as controlling fires. This programming example concerns a simulation of the propagation of a forest fire that has been modeled as a two-dimensional space partitioned into square cells of uniform size (figure 3).

600

D. Talia #define #define #define #define

space wire electhead electail

0 1 2 3

cadef { dimension 2; /*bidimensional lattice */ radius 1; state (short content); neighbor moore[8] ([0,-1]North,[-1,-1]NorthWest, [-1,0]West, [-1,1]SouthWest,[0,1]South, [1,1] SouthEast, [1,0]East, [1,-1]NorthEast); } int i; short count; { count = 0; for (i = 0; i:< action > with the interpretation of the following decision rule: if a current observed state of the environment matches the condition, then execute the action. The conditional part of a classi er contains some description of the environment, expressed with use of symbols f0,1g, and additionally a don't-care symbol #. The action part of a classi er contains an action of the CS, associated with the condition. A usefulness of a classi er c, applied in a given situation, is measured by its strength str. A real-valued strength of a classi er is estimated in terms of rewards for its action obtained from the environment. If a measurement of the environment matches a conditional part of a classi er then the classi er is activated and becomes a candidate to send its action to the environment. Action selection is implemented by a competition mechanism based on auction [2], where the winner is a classi er with the highest strength. To modify classi er strengths the simpli ed credit assignment algorithm [2] is used. The algorithm consists in subtracting a tax of the winning classi er from its strength, and then dividing equally the reward received after executing an action, among all classi ers matching the observed state. A strength of a classi er has the same meaning as a tness function of an individual in genetic algorithm (GA) (see, e.g. [2]). Therefore, a standard GA with three basic genetic operators: selection, crossover and mutation is applied to create new, better classi ers.

3 Multi-agent Approach to Multiprocessor Scheduling A multiprocessor system is represented by an undirected unweighted graph G = (V ; E ) called a system graph . V is the set of N nodes representing processors and E is the set of edges representing channels between processors. A parallel program is represented by a weighted directed acyclic graph G =< V ; E >, called a precedence task graph or a program graph. V is the set of N nodes of the graph, representing elementary tasks. The weight b of the node k describes the processing time needed to execute task k on any processor of the system. E is the set of edges of the precedence task graph describing the communication patterns between the tasks. The weight a , associated with the edge (k; l), de nes the communication time between the ordered pair of tasks k and l when they are located in neighboring processors. If the tasks k and l are located in processors corresponding to vertices u and v in G , then the communication delay between them will be de ned as a d(u; v), where d(u; v) is the length of the shortest path in G , between u and v. s

s

s

s

s

s

p

p

k

p

kl

s

kl

s

p

p

p

606

J.P. Nowacki, G. Pycka, and F. Seredynski

The purpose of scheduling is to distribute the tasks among the processors in such a way that the precedence constraints are preserved, and the response time T is minimized. T depends on the allocation of tasks in the multiprocessor topology and scheduling policy applied in individual processors:

T = f (allocation; scheduling policy): (1) We assume that the scheduling policy is xed for a given run. The scheduling policy accepted at this work assumes that the highest priority among tasks ready to run in a given processor will have the task with the greatest number of successors. The priority p of a task k is calculated using the following recurrent formula: k

p =s + k

k

Xp sk

nk

=1

kn

k

;

(2)

where, s is the number of immediate successors of a task k, and p nk is a priority of the n immediate successor of the task k. For the purpose of the scheduling algorithm we specify two additional parameters of a task k mapped into a system graph: a Message Ready Time (MRT) predecessor of the task k, and the MRT successor of the k. A MRT predecessor of a task k is its predecessor which is the last one from which the task k receives data. The task can be processed only if data from all predecessors arrived. A MRT successor of the task k is a successor for which the task is the MRT predecessor. We propose an approach to multiprocessor scheduling based on a multi-agent interpretation of the parallel program. We assume that an agent associated with a given task can perform a migration in a system graph. The purpose of migration is searching for an optimal allocation of program tasks into the processors, according to (1). We assume that decision about migration of a given agent will be taken by a CS, after presentation by the agent a local information about its location in the system graph. k

k

k

4 An Architecture of a Classi er System to Support Scheduling To adjust the CS to use it for scheduling we need to interpret the notion of an environment of the CS. The environment of the CS is represented by some information concerning a position of a given task located in a system graph. A message containing such a information will consist of 7 bits: { bit 0: value 0 - task does not have any predecessors; value 1 - the task has at least one predecessor { bit 1: value 0 - the task does not have any successors; value 1 - the task has at least one successor

Multiprocessor Scheduling

607

{ bit 2: value 0 - the task does not have any brothers; value 1 - the task has brothers

{ bits 3 and 4:

values 00 - none MRT successor of the task is alocated on the processor where the task is allocated; values 01 - some MRT successors are allocated on the same processsor where the task is allocated; values 11 - all MRT successors are alocated on the same processor where the task is allocated; values 10 the task does not has any MRT successors { bits 5 and 6: values 00 - none MRT predecessor of the task is alocated on the processor where the task is allocated; values 01 - some MRT predecessors are allocated on the same processsor where the task is allocated; values 11 - all MRT predecessors are alocated on the same processor where the task is allocated; values 10 - the task does not has any MRT predecessors. The list of actions of a CS contains 8 actions:

{ action 0: do nothing - the task does not migrate from a given location (processor) to any other processor of the system

{ action 1: random action - randomly chosen action from the set of all ac{ { { { { {

tions, except the action 1, will be performed action 2: random node - the task migrates to one of randomly chosen processors of the system action 3: pred rnd - the task migrates to a processor where randomly selected predecessor of the task is located action 4: succ rnd - the task migrates to a processor where randomly selected successor of the task is located action 5: less neighbours - the task migrates to a processor where the smallest number of neighbours of the task is located action 6: succ MRT - the task migrates to a processor where its MRT successor is located action 7: pred MRT - the task migrates to a processor where its MRT predecessor is located.

Conditional part of a classi er contains information about speci c situation of a given task which must be satis ed to execute the action of the classi er. For example, a classi er < #1 #0 0 #0 >:< 6 > can be interpreted in the following way: IF: it does not matter whether the task has predecessors or not (symbol: #) AND IF: the task has successors (symbol: 1) AND IF: it does not matter whether the task has brothers or not (symbol: #) AND IF: none among MRT successors of the task is located on the processor where the task is located (symbols: 00) AND IF: none among MRT predecessors of the task is located on the processor where the task is located or the task does not has MRT predecessors (symbols: # 0) THEN: move the task to the processor where is located a MRT successor of the task (symbol: 6).

608


classifier 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

: : : : : : : : : : : : : : : : : : : :

strength of classifiers after execution of action by subsequent agents initial 0 1 2 3 4 5 300.00 298.35 298.20 395.07 490.94 485.79 485.55 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 297.70 297.56 297.41 297.26 297.11 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 297.11 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 298.35 298.20 298.05 297.90 297.75 297.60 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 298.20 298.05 297.90 297.75 297.60 300.00 299.85 298.20 298.05 297.90 297.75 296.11 300.00 297.85 297.70 297.56 297.41 297.26 297.11 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10 300.00 299.85 299.70 299.55 299.40 299.25 299.10

Fig. 1. Initial population of classi ers in few rst steps of working the scheduler.

5 Experimental Results Experiment #1: Step by step simulation (problem: gauss18 > full2)

We will analyze some initial steps of the work of the scheduler, solving the scheduling problem for a program graph gauss18 ([3], see Fig. 2a) processed in the 2-processor system full2. The program contains 18 tasks, and is initially allocated as shown in Fig. 2b with response time T = 74. Fig. 1 shows an initial population of the CS containing 20 classi ers with initial values of the strenght of each equal to 300. The agent A0 sends rst its message < 0100010 > to the CS. The message describes the actual situation of the task 0 (as shown in Fig. 2a, b) and contains the following information: the task 0 does not have any predecessor; it has successors; it does not have any brothers; all MRT successors are located on a processor dierent than the processor where is located the task 0, and the task does not have any MRT predecessors. The message matches three classi ers of the CS: the classi er 0, 7 and 14. A winner of the competition between them is the classi er 14, and its action < 3 > is passed to the agent A0 . The action says: migrate to a processor where your randomly chosen predecessor is located. The agent A0 can not execute this action, because the task 0 does not have any predecessors. So, the allocation of tasks and corresponding value of T is not changed. The CS receives a reward


8

609

0 12

4

4

1

4

2

4

3

4

4

5

8 8

6

8

8

6

12 3

8

12

12

3

7

12 3

8

3

9

10

12 8 8

4

8

11

12 2

8

12

12 12 2 13

12 2 14

8 8

12 1

a)

c)

8

2 15

16

12 1

17

b)

d)

Fig. 2. Program graph gauss18 (a) and Gantt charts (b), (c), (d) for allocations of tasks in f ull2, corresponding to actions of classi ers shown in Fig. 1.

equal to 1 for this action, because it does not change the value of T (the value of the reward - a user de ned parameter). The reward increases the strength of the classi er 14. New strengths of classi ers, as shown in Fig. 1, in the column corresponding to the agent 0, are the result of applying taxes. The next agent which sends a message to the CS is the agent A1 , and the message is < 1011000 >. The message matches classi ers 2, 12 and 13. A winner of the competition is the classi er 2, which sends the action < 1 > to the A1 . As the result of this action (random action), the action 0 is chosen (do nothing). The allocation of tasks remains the same, and the classi er 2 receives the reward equal to 1. All classi ers pay the life tax, the classi ers 2, 12, and 13 pay bid tax, also the winner pays a tax, what results in new values of strength of classi ers. The agent A2 sends to the CS the message < 1110000 >. It matches only the classi er 0 and the action < 7 > is executed by the agent. The execution of the action results in the migration of the task 2 from the processor 1 to the processor 0, where the task MRT predecessor is located. Changing allocation of tasks reduces T to the value 68 (see, Fig. 2c). The classi er 0 receives the reward equal to 100 for improvement (the user de ned value for improvement) of T .

610


tree15 gauss18 g18 g40 fft64 Rnd25_1 Rnd25_5 Rnd25_10 Rnd50_1 Rnd50_5 Rnd50_10 Rnd100_1 Rnd100_5 Rnd100_10

full2 9 44 46 80 2055 495 95 62 890 207 138 1481 404 175

full8 7 44 24 32 710 289 95 62 394 201 141 582 364 173

ring8 7 44 27 36 841 346 95 62 550 209 141 789 432 179

cube8 7 44 25 34 778 327 95 62 502 216 138 703 422 178

de Bruijn8 7 44 25 34 779 313 95 62 477 205 138 671 389 172

Fig. 3. The best response time received for dierent program and system graphs. Next, the agent A3 sends the message < 1110000 > - the same message as the one sent by the agent A2. The message matches only the classi er 0, what causes the execution by the agent of the same action as previouly, and the migration of the task 3 to the processor 0. The new value of T = 61 is better than the previous value, and the classi er 0 receives again the reward equal to 100. The agent A4 sends the same message to the CS as agents A3 and A2 sent. However, an attempt to execute the action < 7 > by the agent, i.e. migration of the task 4 from the processor 1 to the processor 0, increases T to the value 62, so the execution of the action is cancelled and the classi er 0 receives the reward equal to 0 (the user de ned value for causing the result worse). The action executed by the agent A5 does not change the value of T . The message < 1111000 > of the agent matches classi ers 5 and 13, and the classi er 5 with the action < 5 > is the winner. The execution of the action, i.e. the migration of the task 5 from the processor 1 to the same processor, obviously does not change tasks' allocation. Agents execute their actions sequentially, in the order of their numbering in the program graph. After the execution of an action by the agent A17, the sequence of actions is repeated again starting from the agent A0. In the considered experiment, actions of next several agents do not improve the value of T . The next improvement of T appears as the result of the execution of an action by the agent A15 (T = 46). Last migration of a task, which causes decreasing T to T = 44 takes place in the iteration 38. Found value of T (see, Fig. 2d) is optimal and can not be improved. Experiment #2: Response time for dierent scheduling problems The scheduling algorithm was used to nd response time T for deterministic program graphs such as tree15; gauss18; g18; g40; fft64 known from lit-


611

erature, processed in dierent topologies of multiprocessor systems, such as full2; full5; ring8; cube8; deBruin. Also a number of random graphs were used, with 25, 50 and 100 (in the average) tasks (Rnd25 x, Rnd50 x, Rnd100 x), where x denotes ratio of the average communication time a in a program graph to the average processing time b of tasks in the program graph. Fig. 3 summerizes results. Results obtained for deterministic graphs are the same as known in literature. Results obtained for random graphs were compared with results (not shown) obtained with use of GA-based algorithms, such as parallel GAs of island and diusion models. Results obtained with use of the scheduler are signi cantly better than with use of parallel GAs. kl

k

6 Conclusions We have presented results of our research on development scheduling algorithms with support of the scheduling process by genetic algorithm-based learning classi er system. Results of experimental study of the system are very promising. They show that the CS is able to develop eective rules for scheduling during its operation, and solutions found with use of the CS outperform ones obtained by applying non-learning GA-based algorithms.

Acknowledgement

The work has been partially supported by the State Committee for Scienti c Research (KBN) under Grant 8 T11A 009 13.

References 1. S. Chingchit, M. Kumar and L. N. Bhuyan, A Flexible Clustering and Scheduling Scheme for Ecient Parallel Computation, in Proc. of the IPPS/SPDP 1999, April 12-16, 1999, San Juan, Puerto Rico, USA, pp. 500-505. 2. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning , Addison-Wesley, Reading, MA, 1989 3. Y. K. Kwok and I. Ahmad, Dynamic Critical-Path Scheduling: An Eective Technique for Allocating Task Graphs to Multiprocessors, IEEE Trans. on Parallel and Distributed Systems. 7, N5, May 1996, pp. 506-521. 4. S. Mounir Alaoui, O. Frieder and T. El-Ghazawi, A Parallel Genetic Algorithm for Task Mapping on Parallel Machines, in J. Rolim et al. (Eds.), Parallel and Distributed Processing, LNCS 1586, Springer, 1999, pp. 201-209. 5. A Radulescu, A. J. C. van Gemund and H. -X. Lin, LLB: A Fast and Eective Scheduling for Distributed-Memory Systems, in Proc. of the IPPS/SPDP 1999, April 12-16, 1999, San Juan, Puerto Rico, USA, pp. 525-530. 6. S. Salleh and A. Y. Zomaya, Multiprocessor Scheduling Using Mean-Field Annealing, in J. Rolim (Ed.), Parallel and Distributed Processing, LNCS 1388, Springer, 1998, pp. 288-296. 7. F. Seredynski, Scheduling tasks of a parallel program in two-processor systems with use of cellular automata, Future Generation Computer Systems 14, 1998, pp. 351-364. This article was processed using the LaTEX macro package with LLNCS style

Viewing Scheduling Problems through Genetic and Evolutionary Algorithms Miguel Rocha, Carla Vilela, Paulo Cortez, and Jose Neves Dep.Informatica - Universidade do Minho - Braga - PORTUGAL fmrocha,cvilela, pcortez,[email protected]

In every system, where the resources to be allocated to a giv en set of tasks are limited, one is faced with scheduling problems, that hea vily constrain the enterprise's productivity. The scheduling tasks are typically very complex, and although there has been a growing ow of w ork in the area, the solutions are not yet at the desired level of qualit y and eÆciency. The Genetic and Evolutionary Algorithms (GEAs) oer, in this scenario, a promising approach to problem solving, considering the good results obtained so far in complex combinatorial optimization problems. The goal of this w ork is, therefore, to apply GEAs to the scheduling processes, giving a special atten tion to indirect represen tations of the data. One will consider the case of the Job Shop Scheduling Pr oblem, the most challenging and common in industrial environments. A speci c application, developed for a Small and Medium Enterprise, the Tipogra a Tadinense, Lda, will be presented. Keywords: Genetic and Evolutionary Algorithms, Job Shop Scheduling. Abstract.

1

Introduction

In ev ery industrial environment one is faced with a diversity of scheduling problems which can be diÆcult to solve. Once a good solution is found it produces very tangible results, in terms of the way the resources are used to maximize the pro ts. The scheduling problems are typically NP-Complete, th us not ha ving the warran yt of solvabilit y in polynomial time. Indeed, although there has been a steady ev olution in the areas of A rti cialIntelligence (AI) and Operational R esearch (OR)aiming at the development of techniques to give solution to this type of problems, the basic question has not yet been solved. The Genetic and Evolutionary Algorithms (GEAs) mimic the process of natural selection, and have been used to address complex combinatorial optimization problems. Using an evolutionary strategy, the GEAs objective is to maximize/minimize an objective function f : S (R) 7! ; j 2 [1 : n] was not visitedgj be the number of possible successors that have a probability > to be chosen. Then, the average number of alternatives with probability > during a generation is m X n 1 X D = mn D(k) (i) k=1 i=1

Clearly, D 1 always holds for < 1=(n , 1). Note that a similar measure | the -branching factor | was used in [7] to measure dimension of the search space. In contrast to D the -branching factor considers all other cities as possible successors not only those cities that have not yet been visited by the ant. Hence, D takes into account only the alternatives that the ants really meet, whereas the -branching factor is a more abstract measure and problem dependent. Figure 2 shows the in uence of the information exchange strategies on D for = 0:01 when information exchange is done every I = 10, respectively I = 50 generations. The gure shows that after every information exchange step the D value becomes larger. But in all cases it falls down below 2 before the 80th generation. After generation 150 the D values for method (1) are always lower than for the other methods. They are below 1.1 after generation 270 for I = 10,

650

M. Middendorf, F. Reischle, and H. Schmeck

respectively after generation 290 for I = 50. It is interesting that during the rst 100-150 generations D falls fastest for method (3) but in generation 500 the D value of 1.08 in case I = 10 is the largest. The D values of methods (2) and (4) with circular information exchange of local best solutions are quite similar. They are always larger than those of method (1). Compared to method (3) they are smaller after generation 300 in case I = 10 but are always larger in case I = 50. Table 1 shows the lengths of the best found tours after 500 generations with methods (1)-(4) and for the case that no information exchange takes place when I = 50. In the case of no information exchange it is better to have one large colony than several smaller ones (see also Figure 3). It was observed that the length of the solution found by one colony does not change any more after generation 250. For methods (1) and (3) there is no advantage to have several colonies over just one colony. It seems that the exchange of only a few migrants in method (1) is so weak that the colonies can not really pro t from it. It should be noted that the picture changes when information exchange is done more often. E.g., for I = 10 we found that 5 colonies are better than one (best found solution was 638.65 in this case).

Table 1. Dierent strategies of information exchange: best found solution after 500 generations, I = 50

No information Exchange of Circular exch. Circular exchange globally best of locally exchange of solution best solutions migrants N=1 640.15 | | | N=5 642.85 640.70 637.10 643.15 N=10 642.85 641.65 637.10 642.75 N=20 648.00 642.90 640.45 645.45

Methods (2) and (4) where local best solutions are exchanged between neighbouring colonies in the ring perform well. Figure 3 shows that the solutions found with method (2) by 5 or 10 colonies are always better than those of one colony after 250 generations, respectively 350 generations, for I = 50. In generation 500 the length of the best solution found by 10 colonies is about the same as that found by 5 colonies. Moreover, the curves show that there is still potential for further improvement after 500 generations for the 10 colonies and the 20 colonies. The curves for method (4) are quite similar to those in Figure 3 and are omitted. Table 2 shows the behaviour of method (2) when the information exchange is done more often, i.e., every 10 or 5 generations. For an exchange after every 5 generations the solution quality found in generation 500 is not or only slightly better for the multi colonies compared to the case with one colony. It seems that in this case the information exchange is too much in the sense that the colonies can not evolve into dierent directions.

Information Exchange in Multi Colony Ant Algorithms

651

Table 2. Circular exchange of locally best solutions I=5 I=10 I=50 N=5 642.30 638.90 637.10 N=10 642.90 638.55 637.10 N=20 639.35 638.20 640.45 0.35

0.35

Local best solution Global best solution Migrants + local best solution Migrants

0.3

Local best solution Global best solution Migrants + local best solution Migrant

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

0 0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

Fig.1. Dierence between matrices , Left: migration interval 10, Right: migration interval 50

6 Conclusion Dierent methods for information exchange in multi colony ant algorithms were studied. Clearly, ant algorithms with several colonies that exchange not too much information can eectively be parallelized. It was shown that even the solution quality can improve when the colonies exchange not too much information. Instead of exchanging the local best solution very often and between all colonies it is better to exchange the local best solution only with the neighbour in a directed ring and not too often. 2

2


1.8

1.8

1.6

1.6

1.4

1.4

1.2

1.2

1


1 0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

Fig.2. Average number of alternatives D, Left: migration interval 10, Right: migration interval 50

652

M. Middendorf, F. Reischle, and H. Schmeck

700

700

1 colony with 100 ants 5 colonies with 20 ants 10 colonies with 10 ants 20 colonies with 5 ants

690

1 colony with 100 ants 5 colonies with 20 ants 10 colonies with 10 ants 20 colonies with 5 ants

690

680

680

670

670

660

660

650

650

640

640

630

630 0

50

100

150

200

250 300 Iteration

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

Fig.3. Best found solution, Left: no information exchange, Right: circular exchange of locally best solution, migration interval 50

References 1. M. Bolondi, M. Bondaza: Parallelizzazione di un algoritmo per la risoluzione del problema del comesso viaggiatore; Master's thesis, Politecnico di Milano, 1993. 2. B. Bullnheimer, R.F. Hartl, C. Strauss: A New Rank Based Version of the Ant System - A Computational Study; CEJOR, Vol 7, 25-38, 1999. 3. B. Bullnheimer, G. Kotsis, C. Strauss: Parallelization Strategies for the Ant System; in: R. De Leone et al. (Eds.), High Performance Algorithms and Software in Nonlinear Optimization; series: Applied Optimization, Vol. 24, Kluwer, 87-100, 1998. 4. M. Dorigo: Optimization, Learning and Natural Algorithms (in Italian). PhD thesis, Dipartimento di Elettronica, Politecnico di Milano, 1992. 5. M. Dorigo: Parallel ant system: An experimental study; Unpub. manuscript, 1993. 6. M. Dorigo, V. Maniezzo, A. Colorni: The Ant System: Optimization by a Colony of Cooperating Agents; IEEE Trans. Sys., Man, Cybernetics { B, 26, 29-41, 1996. 7. L. M. Gambardella, M. Dorigo: Ant-Q: A Reinforcement Learning approach to the traveling salesman problem; Proceedings of ML-95, Twelfth Intern. Conf. on Machine Learning, Morgan Kaufmann, 252-260, 1995. 8. U. Kohlmorgen, H. Schmeck, K. Haase: Experiences with ne-grained parallel genetic algorithms; Ann. Oper. Res., 90, 203-219, 1999. 9. F. Kruger, M. Middendorf, D. Merkle: Studies on a Parallel Ant System for the BSP Model; Unpub. manuscript, 1998. 10. R. Michels, M. Middendorf: An Ant System for the Shortest Common Supersequence Problem; in: D. Corne, M. Dorigo, F. Glover (Eds.), New Ideas in Optimization, McGraw-Hill, 1999, 51{61. 11. T. Stutzle: Parallelization strategies for ant colony optimization; in: A. E. Eiben, T. Back, M. Schonauer, H.-P. Schwefel (Eds.), Parallel Problem Solving from Nature - PPSN V, Springer-Verlag, LNCS 1498, 722-731, 1998. 12. T. Stutzle, H. Hoos: Improvements on the ant system: Introducing MAX(MIN) ant system; in G. D. Smith et al. (Eds.), Proc. of the International Conf. on Arti cial Neutral Networks and Genetic Algorithms, Springer-Verlag, 245-249, 1997. 13. E-G. Talbi, O. Roux, C. Fonlupt, D. Robillard: Parallel ant colonies for combinatorial optimization problems; in J. Rolim et al. (Eds.) Parallel and Distributed Processing, 11 IPPS/SPDP'99 Workshops, LNCS 1586, Springer, 239-247, 1999. 14. http://www.iwr.uni-heidelberg.de/iwr/comopt/soft/TSPLIB/TSPLIB.html

A Surface-Based DNA Algorithm for the Expansion of Symbolic Determinants Z. FRANK QIU and MI LU Department of Electrical Engineering Texas A&M University College Station, Texas 77843-3128, U.S.A. {zhiquan, mlu}@ee.tamu.edu

Abstract. In the past few years since Adleman’s pioneering work on solving the HPP(Hamiltonian Path Problem) with a DNA-based computer [1], many algorithms have been designed on solving NP problems. Most of them are in the solution bases and need some error correction or tolerance technique in order to get good and correct results [3] [7] [9] [11] [21] [22]. The advantage of surface-based DNA computing technique, with very low error rate, has been shown many times [12] [18] [17] [20] over the solution based DNA computing, but this technique has not been widely used in the DNA computer algorithms design. This is mainly due to the restriction of the surface-based technique comparing with those methods using the DNA strands in solutions. In this paper, we introduce a surface-based DNA computing algorithm for solving a hard computation problem: expansion of symbolic determinants given their patterns of zero entries. This problem is well-known for its exponential difficulty. It is even more difficult than evaluating determinants whose entries are merely numerical [15]. We will show how this problem can be solved with the low error rate surface-based DNA computer using our naive algorithm.

1

Introduction

Although there are a flood of ideas about using DNA computers to solve difficult computing problems [2] [16] [19] [15] since Adleman [1] and Lipton [16] presented their ideas, most of them are using DNA strands in solution. They all take advantage of the massive parallelism available in DNA computers as one liter of water can hold 1022 bases of DNA strands. Because they all let DNA strands float in solution, it is difficult to handle samples and strands may get lost during some bio-operations. A well developed method, in which the DNA strands are immobilized on a surface before any other operations, is introduced to DNA computing area by Liu [18]. This method, which is called surface-based DNA computing, first attaches a set of oligos to a surface (glass, silicon, gold, etc). They are then subjected to operations such as hybridization from solution or exonuclease degradation, in order to extract the desired solution. This method greatly reduces losses J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 653-659, 2000.  Springer-Verlag Berlin Heidelberg 2000

654

Z.F. Qiu and M. Lu

of DNA molecules during purification steps [18]. The surface-based chemistries have become the standard for complex chemical syntheses and many other chemistries. Although the surface-based DNA computer has been demonstrated as more reliable with low error rate and easier to handle [8] [12] [18] [20], only a little research work about utilizing these properties of this kind of computer has been presented [12]. This happens mainly because when the oligos are attached to a surface, we lose flexibility due to the restriction that the oligos can not grow in the direction of the attachment on the surface. In order to take advantage of the new mature method, algorithms of surface-based computing need to be developed. In this paper, we present a new algorithm to be implemented on a surfacebased DNA computer that will take fully advantage of these special properties of low error rate. We will use the expanding symbolic determinants problem as an example to show the advantage of our algorithm comparing with an existing algorithm based on general DNA computer in solution. Both algorithms will be able to solve some intractable problems that are unrealistic to be solved by current conventional electronic computers because of the intense computing power requirement. These problems are harder to solve than the problem in NPComplete. Our algorithm has all the advantages of surface-based computers over an existing algorithm introduced in [15]. The rest of the paper are organized as follows: the next section will explain the methodology, including the logical and biological operations of surface-based DNA computers. The problem of expansion of symbolic determinants and our algorithm to solve it will be presented in section 3. Section 4 will analyze our new surface-based algorithm and the last section will conclude this paper.

2

Surface-Based Operations

In this section, we show the logical operations of DNA computers and then explain how these operations can be implemented on surface-based DNA computers. All these operations are necessary for solving the computational hard problem given in the next section. A simple version of surface-based DNA computer uses three basic operations, mark, unmark, and destroy [17] plus the initialization and append operations introduced in [8]. The explanation of these operations are clearly shown as follows. 2.1

Abstract Model

1. reset(S): It can also be called initialization. This step will generate all the strands for the following operations. These strands in set S can be generated to represent either the same value or different values according to the requirement. 2. mark(C, S): All strands in set S satisfying the constraint C are identified as marked. A strand satisfies this constraint if and only if there is a number

A Surface-Based DNA Algorithm for the Expansion of Symbolic Determinants

3. 4. 5.

6.

2.2

655

represented by a strand with bit i agrees with the bit value specified in the constraint. If no constraint is given, all strands are marked [8]. unmark(): Unmark all the marked strands. delete(C): All strands satisfying condition C are removed from set S where C ∈ {marked, unmarked}. append(C, X): A word X represented by a strand segment is appended to all strands satisfying constraint C. C can be defined as marked or unmarked. If the constraint is marked strands, a word X is appended to all marked strands. Otherwise, a word X will be appended to all unmarked strands. readout(C, S): This operation will select an element in S following criteria C. If no C is given, then an element is selected randomly. We will use this step to obtain the expected answer. Biological Implementation

In this section, we include the fundamental biological operations for our surfacebased DNA computation model. 1. reset(S): The initialization operation used here is different from those widely used biological DNA operations described in [1] [2] [4] [10] [19]. All the strands generated are attached to a surface instead of floating in the solution. In order to prepare all these necessary strands on the surface, both the surface and one end of the oligonucleotides are specially prepared to enable this attachment. A good attachment chemistry is necessary to ensure that the properly prepared oligonucleotides can be immobilized to the surface at a high density and unwanted binding will not happen on the surface [8] [18] [17]. 2. mark(C, S): Strands are marked simply by making them double-strands at the free end as all the strands on the surface are single strands at the beginning. These single strands being added in to the container will anneal with the strand segments that need to be marked. Partial double strands will be formed according to the Watson-Crick(WC) complement rule [1] [16] [6]. 3. unmark(): This biological operation can be implemented using the method introduced in [8]. Simply washing the surface in distilled water and raising the temperature if necessary will obtain the resultant container with only single strands attaching to the surface. Because with the absence of salt which stabilizes the double strand bond, the complementary strands will denature from the oligonucleotides on the surface and will be washed away. 4. delete(C): This operation can be achieved using some enzymes known as exonucleases which chew up DNA molecules from the end. Detail of this operation is introduced in [8]. Exonucleases exist with specificity for either the single or double stranded form. By picking different enzymes, marked (double strands) or unmarked (single strands) can be destroyed selectively. 5. append(C, X): Different operations are used depending on whether marked or unmarked strands are going to be appended. If X is going to be appended

656

Z.F. Qiu and M. Lu

to all marked strands, the following bio-operations will be used for appending. Since marked strands are double stranded at the free terminus, the append operation can be implemented using the ligation at the free terminus. The method introduced in [8] can be used here. More details may be found in [8]. To append to unmarked strands, simple hybridization of a splint oligonucleotide followed by ligation as explained in [1] [16] may be used. 6. readout(C, S): This procedure will actually extract out the strand we are looking for. There are many existing methods developed for solution based DNA computing readout [1] [6] [20]. In order to use these methods, we have to detach the strands from the surface first. Some enzymes can recognize short sequences of bases called restriction sites and cut the strand at that site when the sequence is double-stranded [8]. When the segment which is attaching to the surface contains this particular sequence, they can all be detached from the surface when the enzyme is added in.

3 3.1

Hard Computation Problem Solving Expansion of Symbolic Determinants Problem

We will use the expansion of symbolic determinants problem as an example to show how our surface-based DNA computer can be used to solve hard problems that are unsolvable by currently electronic computers. Problem: Assuming the matrix is n×n: a11 a12 a13 a21 a22 a31 .

.

.

.

.

.

.

.

. .

a1n

.

.

an1 ann . . . Generally, the determinant of a matrix is: X (−1)σ Aiσ1 1 · · · aiσn n det(A) =

(1)

σ∈Sn

where Sn = (σ1 , . . . , σn ) is a permutation space [13] [5] [14]. A complete matrix expansion has n! items. When there are many zero entries inside, the expansion will be greatly simplified. We are going to solve this kind of problem–to obtain the expansion of matrices with many zero entries in them. 3.2

Surface-Based Algorithm

In order to make the process easy, we encode each item in the matrix aij by two parts: (aij ) L and (aij ) R while all the (akj ) L0 s are with the same k but


657

different j and all the (aik ) R0 s are with the same k but different i. Using this coding method, all items from the same row will have the same left half code, and all the items from the same column will have the same right code. It seems like that we construct aij by combining ai and aj . So, for example, a13 and a19 will be represented by the same left half segment but different right halves because they are in the same row but different columns. For another example, a14 and a84 will have the same right half but different left halves because they are in the same column but different rows. The following is an algorithm using the methodology of the previous section. It can be accomplished as follows: a-1 reset(S): A large amount of strands will be generated on the surface. All the strands are empty initially, they only have the basic header to be annealed to the surface. a-2 append(X, S): This will make the strands on the surface grow with X. The X here is aij 6= 0 while i is initially set as one and j ∈ (1 : n). All the strands will grow by one unit and each will contain one item in the first row. After the append operation finishes, wash the surface to get rid of all unnecessary strand segment remained on the surface. a-3 Repeat the above steps a-2 with i incremented by one until i reaches n. Now we have each strand should represent n units while each unit is an item from one row. So, each strand should have n items from n different rows. a-4 mark(X, S): We mark all strands containing X and X is initially set as ai , the code for left half of each item representing the row number, with i = 0. a-5 delete(UM): Destroy all strands that are unmarked. This will eliminate those strands containing less than n rows because no matter what i is, it represents a row and every strand should contain it. a-6 Repeat the above steps a-4 and a-5 n times with different i’s while i ∈ (1 : n). This will guarantee that one item from each row is contained in each strand. a-7 Repeat the above steps a-4 and a-5 and a-6 with different aj ’s, the codes for the right half of each item representing the column number, while j ∈ (1 : n). This is used to keep only those strands that have items from each column and eliminate those that do not satisfy. a-8 readout(S): Readout all the remaining strands on the surface and they will be the answer for the expansion of our symbolic determinant. Each strand will contain one item from each row and one item from each column.

4

Analysis of the Algorithm

The complexity of this new algorithm is O(n) where n is the size of the matrix. In order to show the advantage of our surface-based DNA computer, we need to analysis the traditional method for expanding the symbolic determinants. The computing complexity of the traditional method is O(n!). Compare with the traditional method, we have solved a problem harder than NP within linear steps. The advantage of using DNA computer to solve the expansion of symbolic determinants problem is huge. Because the surface-based DNA technology is used, the DNA computer will be more reliable with low error-rate.

658

5

Z.F. Qiu and M. Lu

Conclusion

In this paper, we have proposed an algorithm to solve the expansion of symbolic determinants using surface-based model of DNA computer. Compare with other given applications of DNA computers, our problem is a more computation intensive one and our surface-based DNA computer will also reduce the possible errors due to the loss of DNA strands. Further research includes expanding the application of surface-based DNA computing in order to make DNA computers more robust. With the goal of even lower error rate, we may combine the existing error-resistant methods [3] [7] [9] [11] [21] [22] and the surface-based technology to achieve better results.

References 1. Len Adleman. Molecular computation of solutions to combinatorial problems. Science, November 1994. 2. Martyn Amos. DNA Computation. PhD thesis, University of Warwick, UK, September 1997. 3. Martyn Amos, Alan Gibbons, and David Hodgson. Error-resistant implementation of DNA computations. In Second Annual Meeting on DNA Based Computers, pages 87–101, June 1996. 4. Eric B. Baum. DNA sequences useful for computation. In Second Annual Meeting on DNA Based Computers, pages 122–127, June 1996. 5. Fraleigh Beauregard. Linear Algebra 3rd Edition. Addison-Wesley Publishing Company, 1995. 6. D. Beaver. Molecular computing. Technical report, Penn State University Technical Report CSE-95-001, 1995. 7. Dan Boneh, Christopher Dunworth, Jeri Sgall, and Richard J. Lipton. Making DNA computers error resistant. In Second Annual Meeting on DNA Based Computers, pages 102–110, June 1996. 8. Weiping Cai, Anne E. Condon, Robert M. Corn, Elton Glaser, Tony Frutos Zhengdong Fei, Zhen Guo, Max G. Lagally, Qinghua Liu, Lloyd M. Smith, and Andrew Thiel. The power of surface-based DNA computation. In RECOMB’97. Proceedings of the first annual international conference on Computational modecular biology, pages 67–74, 1997. 9. Junghuei Chen and David Wood. A new DNA separation technique with low error rate. In 3rd DIMACS Workshop on DNA Based Computers, pages 43–58, June 1997. 10. R. Deaton, R. C. Murphy, M. Garzon, D. R. Franceschetti, and Jr. S. E. Stevens. Good encodings for DNA-based solutions to combinatorial problems. In Second Annual Meeting on DNA Based Computers, pages 131–140, June 1996. 11. Myron Deputat, George Hajduczok, and Erich Schmitt. On error-correcting structures derived from DNA. In 3rd DIMACS Workshop on DNA Based Computers, pages 223–229, June 1997. 12. Tony L. Eng and Benjamin M. Serridge. A surface-based DNA algorithm for minimal set cover. In 3rd DIMACS Workshop on DNA Based Computers, pages 74–82, June 1997.


659

13. Paul A. Fuhrmann. A Polynomial Approach To Linear Algebra. Springer, 1996. 14. Klaus J¨ anich. Linear Algebra. Springer-Verlag, 1994. 15. Thomas H. Leete, Matthew D. Schwartz, Robert M. Williams, David H. Wood, Jerome S. Salem, and Harvey Rubin. Massively parallel dna computation: Expansion of symbolic determinants. In Second Annual Meeting on DNA Based Computers, pages 49–66, June 1996. 16. Richard Lipton. Using DNA to solve SAT. Unpulished Draft, 1995. 17. Qinghua Liu, Anthony Frutos, Liman Wang, Andrew Thiel, Susan Gillmor, Todd Strother, Anne Condon, Robert Corn, Max Lagally, and Lloyd Smith. Progress towards demonstration of a surface based DNA computation: A one word approach to solve a model satisfiability problem. In Fourth Internation Meeting on DNA Based Computers, pages 15–26, June 1998. 18. Qinghua Liu, Zhen Guo, Anne E. Condon, Robert M. Corn, Max G. Lagally, and Lloyd M. Smith. A surface-based approach to DNA computation. In Second Annual Meeting on DNA Based Computers, pages 206–216, June 1996. 19. Z. Frank Qiu and Mi Lu. Arithmetic and logic operations for DNA computer. In Parallel and Distributed Computing and Networks (PDCN’98), pages 481–486. IASTED, December 1998. 20. Liman Wang, Qinghua Liu, Anthony Frutos, Susan Gillmor, Andrew Thiel, Todd Strother, Anne, Condon, Robert Corn, Max Lagally, and Lloyd Smith. Surfacebased DNA computing operations: Destroy and readout. In Fourth Internation Meeting on DNA Based Computers, pages 247–248, June 1998. 21. David Harlan Wood. applying error correcting codes to DNA computing. In Fourth Internation Meeting on DNA Based Computers, pages 109–110, June 1998. 22. Tatsuo Yoshinobu, Yohei Aoi, Katsuyuki Tanizawa, and Hiroshi Iwasaki. Ligation errors in DNA computing. In Fourth Internation Meeting on DNA Based Computers, pages 245–246, June 1998.

Hardware Support for Simulated Annealing and Tabu Search Reinhard Schneider and Reinhold Weiss [schneider | weiss]@iti.tu-graz.ac.at Institute for Technical Informatics Technical University of Graz, AUSTRIA

In this paper, w e present a concept of a CPU kernel with hardware support for local-searc h based optimization algorithms like Simulated Annealing (SA) and Tabu-Search (TS). The special hardware modules are:(i) A link ed-list memory representing the problem space. (ii) CPU instruction set extensions supporting fast moves within the neigh borhood of a solution. (iii) Support for the generation of moves for both algorithms, SA and TS. (iv) A solution mover managing several solution memories according to the optimization progress. (v) Hardware addressing support for the calculation of cost functions. (vi) Support for nonlinear functions in the acceptance procedure of SA. (vii) A status module providing on-line information about the solution quality. (v) An acceptance prediction module supporting parallel SA algorithms. Simulations of a VHDL implementation show a speedup of up to 260 in comparison to an existing implementation without hardware support. Abstract.

1

Introduction

Simulated Annealing (SA)[1] and T abu-Search (TS)[2][3] are algorithms that are w ell suited to solving general combinatorial optimization problems which are common in the area of real-time multiprocessor systems. Tindell et al.[4] solv ed a standard real-time mapping task with sev eral requirements using SA. Axelsson[5] applied SA, TS and genetic algorithms, all three based on the local searc h concept[6], to the problem of HW/SW Codesign. In [7] the authors introduced a complete tool for handling parallel digital signal processing systems based on parallel SA. All research projects mentioned use, like ma n y others, SA, TS or other algorithms based on local search to nd solutions for partitioning, mapping and scheduling problems in parallel systems. The results show that these algorithms are able to solve even diÆcult problems with a good solutions quality. The main dra wback is the slow optimization speed. This is particularly true for SA. Many researchers ha vetried to reduce execution time in dierent w ays.One w ayis to optimize the algorithm itself, which depends strongly on the application and has a limited possible speedup[8]. Another approach is to parallelize SA[9]. With parallel simulated annealing (PSA) it is possible to achiev e greater speedup, independent of the problem, without compromising solution qualit y[10]. PSA is J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 660-667, 2000.  Springer-Verlag Berlin Heidelberg 2000

Hardware Support for Simulated Annealing and Tabu Search

661

already successfully applied to multiprocessor scheduling and mapping[11]. But even with PSA on up-to-date processor hardware, it takes a very long time to compute a multiprocessor schedule for realistic system complexity. This still prevents the on-line use of SA in dynamic systems, and it is also the main reason why our research focuses on supporting SA and TS by dedicated processor hardware. It is evident that a processor supporting local search also simpli es non real-time applications using SA and TS. Abramson[12] showed that with a custom computing machine (CCM) it is possible to outperform software implementation by several orders of magnitude. Other hardware- implementations ([13][14]) also showed a signi cant speedup in comparison with a software-implementation. CCMs are very eÆcient for the problem they are designed for. Unfortunately they can not solve other problems. Even a small change in the characteristic of the problem or an unexpected increase of the problem size means that the CCM itself has to be re-designed. Eschermann et al.[15] tried to build a more exible processor for SA where fewer parts of the algorithm are implemented in hardware so that dierent problems can be solved. Unfortunately, this processor has not been developed any further. Up to now, there is no processor available that explicitly supports local search or other nature-inspired algorithms. Our solution combines the exibility of a programmable CPU with the speed of dedicated hardware in a exible, modular concept.

2

Local Search

Optimization algorithms based on local search (LS) have in common that they start from any arbitrary solution and try to nd better solutions by stepping through a neighborhood of solutions. The neighborhood of a solution i is de ned by the neighborhood function N (i). A real cost value c can be mapped to each solution i by a cost function c(i). The problem is to nd a globally optimal solution i, such that c = c(i) c(i) for all solutions i. Iterative Improvement. A basic version of LS is iterative improvement. With this technique, the neighborhood N (i) is searched starting from the current solution i. Then, either the rst better solution ( rst improvement ) or the solution with the lowest costs within the neighborhood (best improvement ) is chosen as the new solution. Improved techniques like SA or TS use dierent strategies to overcome the problem of getting caught in a local minimum. But all these techniques are based on the same few basic functions. The following pseudo-code describes the basic structure of a local-search based algorithm: i=Generate-Initial-Solution REPEAT Move=Select-a-Move-within-Neighborhood-N(i) i'=Apply-Move(i, Move) dC=Compute-Change-in-Cost(i, i') IF accept THEN i=i' UNTIL Stopping-Condition-is-true

(1) (2) (3) (4)

662

R. Schneider and R. Weiss

Key functions of this algorithm are: (1) The selection of a move, which means the selection of a transition from one solution to another. (2) Performing the move to obtain the new solution. (3) Computing the dierence in costs, and (4) deciding whether to accept the new solution or not. The de nition of the neighborhood and the way of computing the costs depend on the problem that has to be solved. The way of selecting a solution from the neighborhood and the criteria for accepting a new state depend on the algorithm: Simulated Annealing. The selection of a move in SA is based on a stochastic process. This means that one move is chosen at random out of all possible moves within the neighborhood. Therefore, the quality of the pseudo random number generator is important in order not to omit any solution. In SA, a move which leads to an improvement of cost is always accepted, deteriorations of costs are accepted if they ful ll the Metropolis Criterion - in analogy to the annealing procedure of metals. Tabu Search. TS always searches the whole neighborhood. The best solution within the neighborhood is taken as a new solution. In order to avoid getting trapped in a local minimum, TS works with the search history: Solutions that have already been selected some time before are forbidden (taboo). These solutions or the moves that lead to these solutions, respectively, are stored in a tabu list. Solutions in the tabu list may still be accepted if they are extraordinary (e.g., if they are signi cantly better than all other solutions in the neighborhood). These solutions are also stored in a list called the aspiration list.

3

Hardware Support

The analysis of possible hardware support for LS-based algorithms was governed by the following objectives: { { { { {

The exibility of the nal system should be maximized. Hardware support should be modular so that more than one optimization algorithm can be accelerated by the same hardware. As many parts of the algorithms as possible should be realized in hardware. The nal processor should support parallelization. The employment in real-time systems should be supported.

The rst goal was achieved by designing the hardware support as a CPU kernel extension. Thus, the exibility of a fully programmable CPU remained. Additionally, data transfer time drops out in this concept because the hardware modules directly interact with the CPU, the bus system and the main memory. The second goal was achieved by a strict modular design. Dierent optimization algorithms could be supported by dierent combinations of the modules. This concept also satis ed the third objective, namely the realization of special, algorithm-dependent functions as modules and their integration in the system. Parallelization techniques were analyzed only for SA. An acceptance prediction module was introduced which eÆciently supports the decision tree decomposition algorithm[16], where the processors work in a pipelined, overlapped mode.


663

REPOSITON

Agent 1

solution mover solution memory

neighborhood generation

current solution

Job 3

Job 1

Job 7

SWAP

Agent 2

Job 2

Job 6

Agent 3

Job 8

Job 4

Job 13

Agent 3

Job 10

Job 9

Job 12

MOVE

solution memory new solution

cost changes

acceptance criterion

Job 11

Job 5

INVERSION

Structure of a typical local search based algorithm. Right: Linked-list representation of the generalized assignment problem with fundamental instructions to generate a neighbor solution. Fig. 1. Left:

In real-time systems, it is most important to know about the status of the optimization process. As all optimization algorithms approach the optimal solution, knowledge about the quality of the solution reached so far must exist in order to make a decision at a crucial time. Therefore, a statistical module was designed that continuously checks the current status of the optimization process. Modular View. A modular view of local search based algorithms is depicted in Figure 1, left side. The basic modules are: (i) Two solution memories, (ii) a neighborhood generator, (iii) a solution mover, (iv) support for the calculation of the cost function, and (v), support for the calculation of the acceptance criterion. Modules (i)-(iii) are closely coupled, as they work on the same data structure. Module (iv) strongly depends on the problem, module (v) depends on the algorithm used. Still, there are possibilities to build hardware support for these two modules. Additional modules (an acceptance prediction module and a status module) are implemented to support parallelization and real-time systems. Solution Memory. A fast solution memory is fundamental as the movement within the solution space is the most frequent operation in local search. In order to speed up moves, it is necessary to nd an appropriate problem representation in memory. Generally, a combinatorial optimization can be described by (i) elements of dierent type and (ii) relations between them. E.g., the generalized assignment problem (GAP) could look as depicted in Figure 1, right side. The problem of mapping tasks to processors is a special case of the GAP problem, where the jobs represent tasks and the agents represent the processors. In this model, the elements and their content form the static description of the problem. All possible combinations of relations represent the solution space. One set of relations represents a special solution. Moving in the solution space means to change relations. Based on this de nition, it is easy to de ne a neighborhood: A solution i is de ned to be within the neighborhood N (i) if the dierences in relations is small (e.g. one dierent relation). Linked lists[17] and matrices are eÆcient ways of representing relations between two types of elements. We decided to implement both a memory based on a linked-list representation, and a memory based on a matrix representation. In the special list/matrix memory (solution memory), it is only pointers to the

664


static elements in main memory that are stored. Thus, the solution memory is independent of the kind of problem solved and the size of the appropriate elements. Four basic operations (see Figure 1, right side) on the list representation allow movement in the neighborhood: (i) Moving an element means removing it from one list and appending it to an other one (MOVE). (ii) An element can be reordered within a list (REPOSITION). (iii) Two elements of the same list can be exchanged (SWAP). And nally, (iv) the order of a chain of elements can be inverted (INVERSION). Neighborhood Generation. The generation of the neighborhood depends on the algorithm used. In SA, a new solution is generated at random. Therefore, a set of hardware pseudo random number generators (PRNG) is proposed. One of them has to chose the move, another one has to select the source element (job), the third one has to chose the destination agent and/or position in the list according to the selected move. In TS, all possible moves within the neighborhood have to be searched. The neighborhood generator has to check if a selected move is forbidden (tabu ) or not. This is done by comparing the move with the tabu list and the aspiration list. This search is accelerated by managing the lists in hardware . Solution Mover. As the current solution and the new solution must be stored until the acceptance decision has been taken, the linked-list memory is duplicated. The result of the acceptance decision determines which memory has to be synchronized to the other. TS also needs to store the best solution reached within the neighborhood. Additionally, the best solution reached so far is stored. This is important for real-time systems, where the optimization has to stop after a xed time, and the best solution so far should be available. The solution mover module applies the list operation suggested by the neighborhood generator and manages all solution memories and transactions between them. Cost Function. The cost function strongly depends on the problem solved. A function completely realized in hardware decreases exibility dramatically. Hence, we suggest to implement only addressing support for the cost function: Providing an easy way to access the elements that have been aected by the move, e.g., by a list of these elements, supports in particular cost functions that can be computed incrementally. As the order and size of this list depends on the problem, we suggest to provide a user-programmable hardware module (e.g., an FPGA-based module) which is tightly coupled to the linked-list memory. This allows the adaptation of the sequence of elements to any individual problem before the optimization process is started. Acceptance Criterion. In TS, the best solution found so far is always accepted. Thus, no additional hardware is needed. In SA, moves with a cost improvement are always accepted. If costs rise, SA decides on the acceptance of the Ei Ej move by evaluating the Metropolis criterion e T > random(0; 1). Negative cost dierences (Ei Ej ) are weighted by a control parameter T and transformed Ei Ej by non-linear operation. The move is then accepted if the result of e T (always between 0 and 1) is greater than a random number between 0 and 1. A hardware pseudo random number generator improves performance signi cantly,


665

as the random number is provided without CPU interaction. Additionally, as the result of the exponential function is compared with a random number, no high accuracy is needed. Therefore, a hardware lookup table with pre-calculated values for each value of T is suÆcient. By means of these tables, the evaluation of the exponential function is done in one cycle. Status Information. The absolute value of the cost function can not be used as status information, because only its relation to the optimal solution is linked to the quality of the solution. But the optimal solution is not known to the system. Therefore, we use a statistical status information based on the relative cost changes. Acceptance Prediction. The acceptance prediction module is used to support parallelization in SA. The output value corresponds to the probability of accepting a new move. With this value, a good prediction of the acceptance is available before the actual result of the acceptance criterion is available.

4

Implementation

All modules are implemented using VHDL and are synthesized for emulation on a programmable FPGA chip using Xilinx Foundation software and tools. All modules use parametric problem sizes to be easily adaptable to dierent systems. Solution Memory. The module consists of four solution memories, realized as both linked list memories and matrix memories: the current solution, the new solution, the best solution found so far and the best solution in the neighborhood. The latter is used only in TS. Memory synchronization works very fast as all memories are arranged physically side by side and connected by a high speed internal bus. Move Generator. Moves are generated in two ways: For SA, a set of pseudo random number (PRN) generators, based on cellular automata[18], automatically generates a move. These automata provide excellent PRNs every cycle with a maximum perod of 2n . For TS, all possible moves have to be considered. These moves are generated sequentially. Each move has to be checked by a move checker . The move checker decides, with the help of the content of the tabu list and the aspiration list, if a move is accepted or not. The search within the lists is realized by parallel comparators. Status Module. A good estimation of the current status of the optimization can be made by averaging the cost changes over the absolute costs. This only works for problems with a smooth cost function without singular minima, which is the case for mapping tasks in multiprocessor systems. Acceptance Prediction. The acceptance prediction unit (for SA) uses an averaged cost value, the last cost dierences and the last acceptance decision as input values. The output is a prediction value that indicates if the next new solution will be accepted or not. With the help of this value, the network topology of the parallelized processors can change dynamically.

666

5


Results

Timing results were obtained in two ways: Firstly by simulation with a VHDL simulator, and secondly by calculating the cycles needed per instruction. The time needed for one iteration strongly depends on the time to perform a move in the neighborhood (move generation, solution mover and acceptance decision) and the time needed to calculate the cost dierence. The latter strongly depends on the problem and is therefore not discussed any further. The use of our hardware modules shortens the time for move generation and the acceptance decision to one cycle each. The solution mover is more critical. The time needed to perform a particular move depends on the type of memory (linked-list based or matrixbased) and the problem size, which is indicated by parameter n in Table 1. Table 1.

Timing requirements for the solution mover module.

instruction cycles: matrix memory cycles: list memory swap 10 26 inversion n8+3 n 28 + 6 remove n 8 + 10 22 reposition n8+2 n 26 + 2

In order to assess our solution, a system was designed to solve the travelling salesman problem with SA. Simulations needed 13 cycles for one iteration. With an FPGA running at 13 MHz, the time for one instruction is 1s. A software implementation on a digital signal processor with a clock speed of 40 MHz needs 86s. The speedup of the hardware-supported solution is therefore 86 or, assuming that the hardware modules will run with the same speed if directly implemented in a CPU, the speedup will be over 260. The acceptance prediction module showed a hit rate of 90% when suspended for only 10% of the time.

6

Discussion

Nature-inspired algorithms are a fast growing eld. New and improved algorithms are developed rapidly. But there is a lack of appropriate computer architectures to support these algorithms. The system described in this paper shows that with an extended CPU it is possible to speed up signi cantly local-search based algorithms. Even though an ASIC{prototype has to be realized rst in order to verify the speedup, the simulation results are respectable. These modules are an attempt to show which functions could be supported by new, intelligent CPU cores. The costs of integrating these modules in a CPU core are small compared to the speedup they provide. The modular concept is very exible and allows, e.g., support for parallelization. Based on this concept, a lot of new modules can be imagined: Support for other algorithms like genetic algorithm, neuronal networks, qualitative algorithms, etc. A CPU extended by such modules will probably make expensive special solutions dispensable.


667

References [1] S Kirkpatrick, C D Gelatt, and M P Vecchi. Optimisation by simulated annealing. Science, 220:671{680, 1983. [2] Fred Glover. Tabu search: 1. ORSA Journal on Computing, 1(3):190{206, 1989. [3] Fred Glover. Tabu search: 2. ORSA Journal on Computing, 2(1):4{32, 1990. [4] K. W. Tindell, A. Burns, and A. J. Wellings. Allocating Hard Real{Time Tasks: An NP-Hard Problem Made Easy. The Journal of Real{Time Systems, (4):145{ 165, 1992. [5] Jakob Axelsson. Architecture Synthesis an Partitioning of Real-Time Systems: A Comparison of Three Heuristic Search Strategies. In 5th International Workshop on Hardware/Software Codesign, pages 161{165, March, 24-26 1997. [6] E. Aarts and K. Lenstra. Local Search in Combinatorial Optimization. Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons, 1997. [7] Claudia Mathis, Martin Schmid and Reinhard Schneider. A Flexible Tool for Mapping and Scheduling Real-Time Applications on Parallel Systems. In Pro[8] [9] [10] [11]

[12] [13] [14] [15] [16] [17]

[18]

ceedings of the Third International Conference on Parallel Processing and Applied Mathematics, Kazimierz Dolny, Poland, September, 5-7 1999. E. H. L. Aarts and J. H. M Korst. Simulated Annealing and Boltzmann Machines.

Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons, Chichester, U.K., 1989. Tarek M. Nabhan and Albert Y. Zomaya. Parallel simulated annealing algorithm with low communication overhead. IEEE Transactions on Parallel and Distributed Systems, 6(12):1226{1233, December 1995. Soo-Young Lee and Kyung Geun Lee. Synchronous and asynchronous parallel simulated annealing with multiple Markov chains:. IEEE Transactions on Parallel and Distributed Systems, 7(10):993{1008, October 1996. Martin Schmid and Reinhard Schneider. A Model for Scheduling and Mapping DSP Applications onto Multi-DSP Platforms. In Proceedings of the International Conference on Signal Processing Applications and Technology. Miller Freeman, 1999. David Abramson. A very high speed architecture for simulated annealing. jCOMPUTER, 25(5):27{36, May 1992. J. Niittylahti. Simulated Annealing Hardware Tool. In The 2nd International Conference on Expert Systems for Development, pages 187{191, 1994. Bang W. Lee and Bing J. Sheu. Paralleled hardware annealing for optimal solutions on electronic neural networks. IEEE Transactions on Neural Networks, 4(4):588{599, July 1993. B. Eschermann, O. Haberl, O. Bringmann, and O. Seitzr. COSIMA: A SelfTestable Simulated Annealing Processor for Universal Cost Functions. In EuroASIC, pages 374{377, Los Alamitos, CA, 1992. IEEE Computer Society Press. Daniel R. Greening. Parallel Simulated Annealing Techniques. In In Emergent Computation, pages 293{306. MIT Press, Cambridge, MA, 1991. A. Postula, D.A. Abramson, and P. Logothetis. A Tail of 2 by n Cities: Performing Combinatorial Optimization Using Linked Lists on Special Purpose Computers. In The International Conference on Computational Intelligence and Multimedia Applications (ICCIMA), Feb, 9-11 1998. P.D. Hortensius, R.D. McLeod, and H.C. Card. "parallel random number generation for vlsi systems using cellular automata". IEEE Transactions on Computers, 38(10):1466{1473, October 1989.

Eighth International Workshop on Parallel and Distributed Real-Time Systems held in conjunction with

In ternationalParallel and Distributed Processing Symposium May 1-2, 2000 Cancun, Mexico

General Chair

Kenji Toda, Electrotechnical Laboratory, Japan

Program Chairs

Sang Hyuk Son, University of Virginia, USA Maarten Boasson, University of Amsterdam, The Netherlands Y oshiaki Kakuda, Hiroshima City University, Japan

Publicity Chair

Amy Apon, University of Arkansas, USA

Steering Committee

David Andrews (Chair), University of Arkansas, USA Dieter K. Hammer, Eindhoven University of Technology, The Netherlands E. Douglas Jensen, MITRE Corporation, USA Guenter Hommel, Technische Universitaet Berlin, Germany Kinji Mori, Tokyo Institute of Technology, Japan Viktor K. Prasanna, University of Southern California, USA Behrooz A. Shirazi, The University of Texas at Arlington, USA Lonnie R. Welch, Ohio University, USA


Eighth International Workshop on Parallel and Distributed Real-Time Systems

Program Committee

Tarek Abdelzaher, University of Virginia, USA Giorgio Buttazzo, University of Pavia, Italy Max Geerling, Chess IT, Haarlem, The Netherlands Jorgen Hansson, University of Skovde, Sweden Kenji Ishida, Hiroshima City University, Japan Michael B. Jones, Microsoft Research, USA Tei-Wei Kuo, National Chung Cheng University, Taiwan Insup Lee, University of Pennsylvania, USA Victor Lee, City University of Hong Kong, Hong Kong Jane Liu, University of Illinois, USA Doug Locke, Lockheed Martin, USA G. Manimaran, Iowa State University, USA Tim Martin, Compaq Computer Corporation, USA Sang Lyul Min, Seoul National University, Korea Al Mok, UT Austin, USA C. Siva Ram Murthy, IIT Madras, India Hidenori Nakazato, OKI, Japan Joseph Kee-Yin Ng, Hong Kong Baptist University, Hong Kong Isabelle Puaut, INSA/IRISA, France Ragunathan Rajkumar, Carnegie Mellon University, USA Franklin Reynolds, Nokia Research Center, USA Wilhelm Rossak, FSU Jena, Informatik, Germany Shiro Sakata, NEC, Japan Manas Saksena, University of Pittsburgh, USA Lui Sha, University of Illinois, USA Kang Shin, University of Michigan, USA Hiroaki Takada, Toyohashi University of Technology, Japan Nalini Venkatasubramanian, University of California at Irvine, USA Wei Zhao, Texas A&M University, USA

669

670

K. Toda et al.

Message from the Program Chairs The Eighth International Workshop on Parallel and Distributed Real-Time Systems (WPDRTS'00) is a forum that covers recent advances in real-time systems { a eld that is becoming an important area in the eld of computer science and engineering. It brings together practitioners and researchers from academia, industry, and government, to explore the best current ideas on realtime systems, and to evaluate the maturity and directions of real-time system technology. As the demand for advanced functionalities and timely management of real-time systems continue to grow, our intellectual and engineering abilities are being challenged to come up with practical solutions to the problems faced in design and development of complex real-time systems. The workshop presents the papers that demonstrate recent advances in research pertaining to real-time systems. Topics addressed in WPDRTS'00 include: Communication and Coordination Real-Time and Fault-Tolerance Real-Time Databases Scheduling and Resource Management QoS and Simulation In addition to the regular paper presentation, the workshop also features a Keynote Speech, \Real-Time Application Speci c Operating Systems: Towards a Componet Based Solution," by Jack Stankovic, University of Virginia, an invited papers session, and a panel discussion. We would like to thank all who have helped to make WPDRTS'00 a success. In particular, the Program Committee members carefully reviewed the submitted papers. We also would like to thank the authors of all the submitted papers. The eorts of the Steering Committee chair and the Publicity chair are also greatly appreciated. Finally, we thank the IPDPS organizers for providing an ideal environment in Cancun.

Sang H. Son

[email protected]

Maarten Boasson

Yoshiaki Kakuda

[email protected] [email protected] Program Chairs 8th International Workshop on Parallel and Distributed Real-Time Systems

A Distributed Real Time Coordination Protocol Lui Sha' and Danbing Seto2 'CS, University of Illinois at Urbana-Champaign 2UnitedTechnology Research Center

Abstract: When the communication channels are subject to interruptions such as jamming, the coordination of the real time motions of distributed autonomous vehicles becomes a challenging problem, that differs significantly with fault tolerance communication problems such as reliable broadcast. In this paper, we investigate the issues on the maintenance of the coordination in spite of arbitrarily long interruptions to the communication.

1 Introduction Internet based instrumentation and controls are an attractive avenue for the development and evolution of distributed real-time systems [I, 21. However, one of the challenges is the real-time coordination problem in the presence of communication interruptions. In distributed control, coordination concerns with how to synchronize the states of distributed control subsystems in realtime. A prototypical problem is to command a group of unmanned air vehicles, where each vehicle must closely follow a desired trajectory which is planned in real-time. To synchronize the states of distributed control systems, a reference trajectory1 is given in real time to each distributed node, a local system. A reference setpoint2 moves along the reference trajectory according to the specified speed profile. The reference trajectories are designed in such a way that the movements of the reference setpoints represent the synchronized changes of distributed states. The difference between the actual system state and the state represented by the reference setpoint is called tracking error. A tracking error bound specifies the acceptable tracking error on each reference trajectory. A local controller is designed to force the local system's state to follow the reference setpoint closely within the tracking error bound.

I

Figure

1: Coordinated IP Control Prototype

I

A reference trajectory is a specification of how a system's state should change as a function of some independent variables. For example, the glide slope that guides aircraft landing specifies both the path and the speed along the path. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS Springer-Verlag Berlin Heidelberg 2000

1800, pp. 671-677,2000.

672

L. Sha and D. Seto

This paper addresses the abstract problem of designing reliable communication protocols for distributed real-time coordination with a concrete example. We will use a simplified Inverted Pendulum (IP) control prototype to introduce the basic concepts in real-time coordination. The coordinated IP control system has two IPS with a nearly massless plastic rod which ties the tips of the two IPS together as shown in Figure 1. The rod does not affect the inverted pendulum control until the slack is consumed by the difference in IPS' positions. Each IP consists of a metal rod mounted vertically to a hinge on a motorized cart controlled by a computer. The metal rod rotates freely. It will fall down from its upright position if the cart's movement is not properly controlled. The mission of the overall system is to get the two IPS moving in synchrony to a desired position on the track with the IPS standing upright. Apparently, if the two IPS are significantly out of steps with each, they can pull each other down. Therefore, the two carts must keep the pendulums at upright position, and maintain their positions synchronized within a small tolerance of, e.g., 5 cm, to prevent the plastic rod from falling. The tolerance is a function of how tightly the two tips are tied together. In this experiment, each IP is controlled locally by a computing node on an Ethernet switch. An operator uses a Command Node on the network to send messages commanding the two IPS where to go. A "communication jamming station" is also connected to the same network, so that we can experimentally test the robustness of communication protocol designed for coordinated control.

Example 1: As illustrated in Figure 2, suppose that in a coordinated IP control experiment, the initial positions for the two IPS are at the middle of the two parallel tracks, i.e., x, = 0 cm, x, = 0 cm at time t = 0. We may command the IPS to move to positions near one end of the parallel tracks at 70 cm with a motion start time t = 10 sec, and with a constant speed of 2cdsec. The system coordination is carried out by sending commands to the IPS. A command specifies both the start time and the reference trajectory. A reference trajectory specifies the path of the motion and Position 70 cm

10 sec Start time Figure 2: A Reference Trajectory

b

45 sec

A

the speed along the specified path. In this example, paths for the IPS are two straight lines, each connecting track position 0 to track position 70 cm.

Suppose that both IPS receive their commands in time, that is, before the start time t = I0 sec. The Local Reference Setpoint will start moving exactly at t = 10 sec and with a constant speed of 2 c d s e c . The local control forces the IP to follow the Local Reference Setpoint. If both IPS' conGiving a reference trajectory, the reference setpoint specifies where the controlled system's state ought to be along the reference trajectory. That is, it is a specification of the desired system state as a function of time. Feedback control is used to force the physical system's state to converge to the reference setpoint.

A Distributed Real Time Coordination Protocol

673

trols are functioning correctly, the tracking error between the IP position and the Local Reference Setpoint will tend to zero. In the experiment, the error between the two IPS positions is allowed to be as large as 5 cm, which is called as the global coordination error bound. Typically, in distributed control systems, the global coordination error bound is translated into sufficient conditions that can be observed and controlled locally. For example, if each IP is within 2.5 cm of its Local Reference Setpoint, then the global coordination error bound is satisfied. This localized condition is referred to as the local tracking error bound. In Example 1, both IPS start their motions at the same time. In practice, it is quite common to command different objects starting motions at different specified times. However, distributed real-time coordination with synchronized start times is the key problem. The problem of using different start times can always be decomposed into two problems: 1) a coordination problem with synchronized start times and 2) a stand alone control problem. For example, suppose that initially one of the IPS, IP, is at - 5cm while IP, is at 0 cm. We would like them to line up first and then move in synchrony. This problem can be decomposed into two problems: 1) command IP, to first move to x, = 0 cm, and 2) command them to move in synchrony as illustrated in Example 1. Obviously, the hard problem is coordinated control with synchronized start times. In the following, we will focus on problems that require synchronous start times. We have so far assumed that both IPS receive their commands on time. This assumption is unrealistic in an open network. Obviously, if one IP receives its command on time, and the other receives its command much later, the IP that moves first will pull down the other IP. This is an example of coordinated control failure. The design of the real-time coordination communication protocol concerns with the problem of how to send the trajectories to distributed nodes quickly and reliably. That is, in spite of arbitrary long interruptions to any or all of the communication channels, the protocol must guarantee that distributed nodes will never receive a set of inconsistent commands that will lead to coordination failure. This problem is related to the synchronization of distributed objects [3] in the sense that the states of distributed objects cannot be diverged arbitrarily. However, in the synchronization of distributed objects, the problem is how to force distributed executions to enter a set of prescribed states when certain condition is met, not how to quickly and reliably communicate the trajectories that constrain state transitions. The communication protocol design for real-time coordination is similar to the design of fault tolerant communication protocols, such as reliable broadcast [4, 51, in the sense that we need to find a way to reliably provide distributed objects with consistent information. However, in realtime coordination, we have a weaker form of consistency constraints due to the existence of tracking error bound. On the other hand, we are faced with a hard constraint on the relative delays between the messages received by coordinating nodes. We will revisit this point after we specify the real-time coordination problem. In Section 2, we will define the problem and show some of the pitfalls in protocol design. In Section 3 we present the solutions. Section 4 is the conclusion and summary.

2 Problem Formulation In this section, we will define the communication protocol design problem for real-time coordination. Our assumptions are as follows:

Assumption 1: Communication delays change widely and unpredictably. They are normally short with respect to application needs. However, very long delays can happen suddenly without warning.

674

L. Sha and D. Seto

Assumption 2: Messages are encrypted. Adversaries are unable to forge or alter the content of messages without being detected. Assumption 3: The clocks of distributed nodes are synchronized. Assumption 4: The control of objects is precise. Errors due to control algorithms, environments or mechanical problems are negligible. Assumptions 3 and 4 allow us to ignore tracking errors due to control or clock synchronization inaccuracies. That is, they allow us to focus on the specific problem of tracking errors caused by communication delays. The local tracking error bound in the following discussion is used only to constrain the tracking error due to communication delays. From a system engineering perspective, this is the portion of the tracking error bound that is allocated for tracking errors due to communication delays. It is important to note that the bound for control errors is in the form of B, because the object being controlled can either undershoot or overshoot the reference setpoint. The tracking error due to late start is always positive. It measures how much the object lags behind the reference setpoint. We will use the symbol B to denote the tracking error bound in the rest of this paper.

Assumption 5: A reliable point-to-point communication protocol is used in all the communications. In coordinated control (with synchronous starts), we could specify fixed start times as in Example 1. However, we cannot guarantee the coordination to work using fixed start times, if the duration between start time and current time is shorter than the worst case communication delay. Observing this constraint causes long delays to the communication of the trajectories. We are interested in protocols using start times that are set dynamically to take advantage of the window of opportunities in communication - moments at which bandwidth is available. The simplest coordination protocol using dynamic start time is to let each node start its motion immediately after it has received its command. Although this simple-minded protocol will not work in the presence of arbitrary delays, it helps us to pin down a number of useful concepts. The idea of dynamic start times is to use some communication protocols to dynamically start the motions within a narrow time window. To analyze the worst case relative delay in start times, the System-Start-Time (SST) is defined as the leading edge of the time window. That is, the time at which one of the coordinating nodes makes the first move. To compute the local tracking error due to late start, we imagine that the Local Reference Setpoint starts at SST, independent of the time at which the local coordination command is received. This is, what the Local Reference Setpoint should have done if there were no delay. We call this idealized setpoint the "SST-Reference Setpoint", and call the trajectory "SST-Reference Trajectory". The tracking error due to late start is then computed as the difference between SSTReference Setpoint and the actual position of the object due to late start.

Figure 3: Tracking Error Using Dynamic Start Times

Example 2: Let the system start time SST = 0. But the node starts its motion 10 sec later due to the delay in receiving its coordination command. The SST-Reference Trajectory and the Local Reference Trajectory are illustrated in Figure 3. Note that the physical object follows the Local


675

Reference Setpoint to move along the Local Reference Trajectory. At time t, the local tracking error due to late start is the difference between the positions of SST-Reference Setpoint and the Local Reference Setpoint. Note that the position of Local Reference Setpoint and the position of the object is identical under Assumptions 1 to 5. That is, the object follows the local reference setpoint perfectly. The tracking error in Figure 3 is caused by the late start of the object. Next, we illustrate some of the difficulties in the design of real-time coordination communication protocol using dynamic start times.

Example 3: The Command Node sends a command to nodes N, and N,. When N, (N,) receives its command, it immediately sends N, (N,) a confirmation message that it has received its command with the given command identifier. When N, (N,) receives its command and the confirmation from N, (N,) that N, has also received the corresponding command, N, (N,) starts its motion immediately. Unfortunately, the protocol in Example 3 does not work. Suppose that N, receives its command and the confirmation from N, at time t, and therefore starts its motion at time t. Unfortunately, N, receives N,'s confirmation message long after time t. Therefore, the large lag leads to coordination failure. A moment's reflection tells us that the consensus based dynamic start time protocol is not better than fixed start time protocol. It is not possible for distributed nodes to reach a strictly consistent view of a given command within a duration shorter than the maximal communication delay D. To see this point, note that in any consensus protocol, there will be a decision function, which will return the value True, if a certain set of messages are received. All the adversary needs to do is to let one of the coordinating nodes receive all the required messages, and start its motion. However, the adversary jams the required messages to other nodes for duration D. Indeed, if we insist on finding a way to ensure that all the nodes receive the same set of commands, it becomes a reliable broadcast problem. It is not possible to guarantee reliable broadcast within a time window that is less than the worst case communication delay. Fortunately, the real-time coordination problem permits a weaker form of consistency due to the existence of tracking error bounds.

3 Protocol Design There are two requirements for the design of communication protocols for real-time coordination. First, under a given communication protocol, the tracking error due to late start must always stay within its bound no matter how long is the communication delay. Second, it is desirable to shorten the time that the protocol needs to send the reference trajectories to coordinating nodes. As a first step to exploit the weaker form of consistency, we develop the Constant Distance Grid (CDG) - Iterative Command (IC) protocol. A CDG partitions each reference trajectory into a series of k equal distance short segments. Each segment on a trajectory should be no longer than the tracking error bound. Figure 4 illustrates a simple Constant Distance Grid with two parallel reference trajectories. Position 0 on each trajectory marks the starting point of the first segment, Segment 1 of the trajectory. Given a trajectory in the form of CDG, the Iterative Command (IC) protocol works as follows. The Command Node sends messages to each of the N Maneuver Control Nodes and asks them to move to Position 1 first and wait for further commands. Once the Command Node receives messages that all N Maneuver Control Nodes have reached Position 1, it commands them to move to Position 2 and wait, and so on. The IC protocol is outlined by the following pseudo-code.

676

L. Sha and D. Seto

Positions Figure 4: A Constant Distance Grid

Definition 1: The IC Protocol Initialization: Each of the N objects will be at its starting position (Position 0) of its trajectory. Command Node for j = I to k // k is the last position, thefinal destination for a trajectory. Send Message j to each of N Maneuver Control Nodes to go to Position j; Waitfor confirmation of reaching Position j from all the N Maneuver Control Nodes; end Each Maneuver Control Node Loop Waitfor command; Move to the commanded Position j; Send confirmation to the Command Node immediately afer reaching the commanded Position j; end Loop The CDG-IC protocol is just the IC protocol that uses CDG. We now analyze the tracking error. Recall that the worst case tracking error is computed with respect to the SST-Reference Setpoint, which starts to move as soon as an object makes the first move. Let the segment length be d and the tracking error bound be B and d B.

Theorem 1: Under CDG-IC protocol, the local tracking error, e, on Trajectory i is bounded by the tracking error bound, B. Proof: Suppose that Theorem 1 is false, i.e., e > B. Since the segment size d B, we have e > d. For e > d, SST-Reference Setpoint and the object must be in two different segments. Let the object be in Segment i and the SST-Reference Setpoint at Segment j, and j > i. For SST-Reference Setpoint in Segment j, a command it must receive a command to go to Position j+l. Under CDG-IC protocol, a command to go to Position j+l will be issued only if all the objects have reached Position j, the starting position of Segment j. This contradicts the assumption that the object is in Segment i and the SST-Reference Setpoint in Segment j. Theorem 1 follows. CDG-IC has the drawback of waiting for all the objects to complete the current command before issuing the next one. However, the movement of electronic messages is much faster than the movement of physical objects. Waiting for the movement of physical objects could waste the window of opportunities in communication. To speed up the process of sending the trajectories, we have developed Constant Time Grid Fast Iterative Command (CTG-FIC) protocol. There are two key ideas in CTG-FIC protocol.


677

To replace the constant distance grid with a constant time grid. In a constant time grid, the distance of a segment is adjusted in such a way that each segment will take the same time to finish with respect to a giving reference trajectory. To send commands to objects to go to Position ( j + l ) without actually waiting for all the objects actually reaching Position j. As soon as the Command Node receives all the acknowledgements from all the objects that they have received the command to go to Position j, it sends messages to command them to move to Position (j+l)until the command for the final destination position is successfully received. In order words, we allow an arbitrary number of outstanding, yet to execute commands. This allows us to capitalize on the windows of opportunity in communication. Due to the lack of space, we are unable to show the proof of correctness of the GTG-FIC protocol. Readers who are interested in knowing some of the potential pitfalls in designing fast protocols that allow outstanding commands or interested in the proof of CTG-FIC may send emails to the authors to request for a copy of the report: "Communication Protocols for Distributed Real-Time Coordination in The Presence of Communication Interruptions."

4 Summary and Conclusion Internet based instrumentation and controls are an attractive avenue for the development and evolution of distributed control systems. However, one of the challenges is the design of communication protocols for real-time coordination in the presence of communication interruptions. Two protocols were developed to solve the real time coordination problem in the presence of communication interruptions, the Constant Distance Grid - Iterative Command protocol (CDG-IC) and the Constant Time Grid - Fast Iterative Command protocol (CTG - FIC). Both of them can tolerate arbitrary long communication delays without causing coordination failures. However, the completion time of sending a trajectory to a node under CDG-IC depends on the speed of physical systems. CTG-FIC can send successive commands to distributed nodes without waiting for the completion of the earlier commands. Thus, the completion time of CTG-FIC is independent of the speed of the physical systems. It can better exploit the window of opportunities in communication. Due to the limitation of space, only the simpler CDG-IC is described and the CTG-FIC was briefly outlined.

Acknowledgement This work was sponsored in part by the Office of Naval Research, by EPRI and by the Software Engineering Institute, CMU. The authors want to thank Michael Gagliardi, Ted Marz and Neal Altman for their contributions to the design and implementation of the demonstration software and to thank John Walker for the design and implementation of the hardware. Finally, we want to thank Jane Liu for her helpful comments to an earlier draft. References: 1. The Proceedings of NSFJCSS Workshop on New Directions in Control Engineering Education, October, 1998. pp. 15-16. 2. The Proceedings of Workshop on Automated Control of Distributed Instrumentation, April, 1999. 3. J. P. Briot, R. Cuerraoui and K. P. Lohr, "Concurrency and Distribution in Object-Oriented Programming", ACM Computing Survey, Vol. 30, No. 3, September, 1998. 4. P. M. Melliar-Smith, L. E. Moser and V. Agrawala, "Broadcast Protocols for Distributed Systems", IEEE Transaction on Parallel and Distributed Systems, January, 1990. 5. P. Jalote, "Fault Tolerance in Distributed Systems", Prentice Hall, 1994.

A Segmented Backup Scheme for Dependable Real Time Communication in Mult ihop Networks Gummadi P. Krishna

M. Jnana Pradeep and C. Siva Ram Murthy

Department of Computer Science and Engineering Indian Institute of Technology, Madras - 600 036, INDIA [email protected],[email protected], murth [email protected]

A b s t r a c t . Several distributed real time applications require fault tolerance apart from guaranteed timeliness. It is essential to provide hard guarantees on recovery delays, due to component failures, which cannot be ensured in traditional datagram services. Several schemes exist w h i d attempt to guarantee recovery in a timely and resource efficient manner. These methods center around a priori resermtion of netw ork resources called spare resources along a backup route. In this paper we propose a method of segmented bac kups which improves upon the existing methods in terms of resource utilisation, call acceptance rate and bounded failure recovery time. We demonstrate the efficiency of our method using simulation studies.

1

Introduction

Any communication netw orkis prone t o faults due to hardware failure or softw are bugs. Itis essential to incorporate faulttolerance into QoS requirements for distributed real time multimedia communications such asvideo conferencing, scien tific visualisation, virtual reality and distributed real time control. Conventional applications which use multihop packet switching easily overcome a local fault but experience varying delays in the process. How ever, real time applications with QoS guaranteed bounded message delays require a priori reservation of resources (link bandwidth, buffer space) along some path from source to destination. All the messages of a real t i m e session are routed through over this static path. In this w a y the QoS guarantee on timeliness is realised but it brings in the problem of fault tolerance for failure of components along its predetermined path. Two proactive approaches are in vogue t o overcome thisproblem. The first approach is forward recovery method [1,2], in whih multiple copies of the same message are sent along disjoint paths. The second approach is to reserv e resources along a path, called backup path [3,4], which is disjoint with the primary, in anticipation ofa fault in the primary path. The second approach is far more inexpensive than the first if infrequent transient packet losses are tolerable. W focus on the second proactive scheme. Establishment of backup channels saves the time required for reestablishing the channel in r e a c t i ~methods. Two different schemes have been widely analysed for the establishmer&of backup channels. In the first, the spare resources in the vicinit yof failed component are used to reroute the channel. This method of local detouring [3,4] leads to inefficient resource utilisation as after recovera the channel path lengths usually get extended significantly. The second method end t o end detouring w as proposed to sole the problem in a resource efficient manner. But end to end detouring has the additional requirement that the primary and backup paths be totally disjoint except the source and destination. This might lead to rejection of a call even when there is considerable bandwidth available in the n e b ork. Rrther, this method of establishing backups might be very inefficient for delay critical applications if the delay of the backup is not within the required limits. In this paper we address these problems by proposing J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 678-684,2000 O Springer-Verlag Berlin Heidelberg 2000

A Segmented Backup Scheme for Dependable Real Time Communication

679

to have segmented backups rather than a single continuous backup path from source to destination and show that the proposed method not only solves these problems but also is more resource efficient than the end t o end detouring methods with resource aggregation through backup multiplexing [5-71. We now explain our concept of segmented backups. Earlier schemes have used end to end backups, i.e., backups which run from source to destination of a dependable connection, with the restriction that the primary and the backup channels do not share any components other than the source and destination. In our approach of segmented backups, we find backups for only parts of the primary path. The primary path is viewed as made up of smaller contiguous paths, which we call primary segm e n t s as shown in Figure 1. We find a backup path, which we call backup segment, for each segment independently. Note that successive primary segments of a primary path overlap on at least one link and that any two non consecutive segments are disjoint. The primary channel with 9 links shown, has 3 primary segments: the 1st segment spanning the first 3 links, the 2nd spanning link 3 to link 6 and the 3rd the last 4 links, segments overlapping on the 3rd and 6th links. The backup segments established are also shown. In case of a failure in a component along a primary segment the message packets are routed through the corresponding backup segment rather than through the original path, only for the length of this primary segment as illustrated. In case of a fault in any component of a primary path, we give the following method of backup segment activation. If only one primary segPath after failure recovery

Backup segments

.

~0urc.e

.............

1

2

3

, " Initial path

5

6

.

I

7

8

-

9

Destination

\ Primary channel

Fig. 1. Illustration of Segmented Backups

ment contains the failed component activate the backup segment corresponding to that primary segment as shown for the failure of link 4. If two successive primary segments contain the failed component activate any one of the two backup segments corresponding to the primary segments. Now we illustrate one of the advantages of

Fig. 2. Establishment of Segmented Backup Channels

the segmented approach over end to end backup approach with a simple example of a 5 X 6 mesh in Figure 2. Suppose the capacity of each link on the mesh is only 1 unit. There are 2 dependable connections t o be established : S1 t o D l and S2

680

P.K. Gummadi, J.P. Madhavarapu, and S.R. Murthy

to D2. The primary paths (shortest paths) of these connections are shown in the figure. It is not possible to establish end t o end backups for both the connections as both the backups contend for the unit resource along the link between N15 to N16. However, segmented backups can be established as shown in the figure.

2

Spare Resource Allocation

It is very important to address the issue of minimizing the amount of spare resources reserved. The idea is to reduce the amount of backup resources reserved by multiplexing the backups passing through the same link. We explain the method very briefly below. Refer to [5-71 for more detailed discussion. We note that the resources reserved for backup channels are used only during component failures in their primary channels. We consider single link failure model for our analysis, under the assumption that channel failure recovery time i.e., time taken for the fault to be rectified, is much smaller than the network's mean time to failure (MTTF). If primary channels of two connections share no common components and their backup channels with bandwidths b l and b2 pass through link L, it is sufficient to reserve max(b1, b2) for both the backup channels on the link L in this failure model, as we know that both the backup channels can never be activated simultaneously. This is the idea of multiplexing. We discuss how deterministic multiplexing [5,6] applies to our scheme in comparison to earlier schemes. We use deterministic failure model and calculate the minimum amount of extra resources that are necessary to be reserved to handle all possible cases of failure. We give below the algorithm we use to calculate the spare resources SLatlinkL under single link failure model. Let GL denote the set of all primary channels whose backups traverse L. Let Rps denote the resource required a t each link by the primary segment Ps. Initialise S I , = ~0 V I , L loop for each link I , I # L loop for each primary channel segment Ps E G if Ps contains link I then SI,L= SI,L+ RPS endif endloop endloop SL = ma%{ S I , )~V I # L It is worth noting the complexity of this multiplexing algorithm. Its execution time increases steeply with increase in the number of links and connections in the network. At first sight it appears as if backup segments taken together, require to reserve more resources than a single end to end backup because segments overlap over the primary channel. But the backup segments tend to multiplex more as their primary segments' lengths are much shorter. Larger the number of backup segments, shorter the primary segments i.e., smaller the number of components in each primary segment and hence, greater the multiplexing among their backup segments. Our method tends t o be more resource efficient since there is a considerable improvement in backup segments' multiplexing capability over end to end backup's capability. Therefore, our scheme is expected to be more efficient for large networks when a large number of calls are long distance calls.

3

Backup Route Selection

Several elaborate routing methods have been developed which search for routes using various QoS metrics. Optimal routing problem of minimizing the amount of


681

spare resources while providing the guaranteed fault tolerance level is known to be NP-hard. So we resort to heuristics. Several greedy heuristics for selecting end t o end backup paths are discussed in [5]. A shortest path search algorithm like Dijkstra's is enough t o find the minimum cost path where the cost value for a link can be made a function of delay, spare resource reservation needed etc. The complexity of our problem of selecting segmented backups is far greater as we have to address additional constraints due to our following design goals.

Improving Call Acceptance Rate: Our scheme tends to improve the call acceptance rate over end to end backups due to two main reasons. Firstly, it tends to improve the call acceptance in situations where there exists a primary path but the call gets rejected due to the lack of an end to end disjoint backup path. We have already shown this through a simple example in Figure 2. Secondly, by reserving lesser amount of resources it allows for more calls to be accepted. This method however, has the problem of choosing the appropriate intermediate nodes (the nodes chosen should not only allow backup segments but should also economize on the resource reservation). Improving Resource Reservation: This sets up two opposing constraints. First, longer the primary segment of a backup segment, lesser will be the number of backup segments required. Too short primary segments can lead to a requirement of large amounts of resources for the large number of backup segments (Note that each of the backup segments requires more resource than the primary segment which it spans). On the contrary shorter primary segments lead to more multiplexing among their backup segments as described before. So we have to choose primary segments which are neither too short nor too long. Increase in the Delay Due to Backup: We are interested only in backup segments which do not lead to an unreasonable increment in delay in case of a failure in their primary segment, which constrains the choice of intermediary nodes. Even in case of end to end detouring we face these constraints but we have a very simple way out. The shortest path algorithm run on the network with the nodes of the primary path removed should give a very good solution and if it fails there does not exist any solution. In contrast, for our scheme we do not have the intermediate destinations fixed and we have t o choose among the many possible solutions. In our heuristic we run Dijkstra's shortest path algorithm from source to destination removing all links in the primary path. If in the process, Dijkstra's search algorithm comes to any node in the primary path, we mark it as an intermediate node. Then, we take the node previous to it in the primary path (in the order of increasing distance from the source) and using it as the new source try to find a shortest path to the destination recursively. In order to ensure that the primary segment is not too small we use a parameter MINLEAPLEN which indicates the minimum number of nodes in any primary segment. Thus, we remove the first MINLEAPLEN nodes starting from the new source along the primary path every time before beginning the search for the shortest path to the destination. It is also important that the delay increment for any backup segment is below a threshold A for the backup to be of use. This tends to prevent lengthy backups for very small primary segments. In case the destination cannot be reached or the A condition is violated, we start Dijkstra's algorithm again from the first segment, this time avoiding the nodes which were chosen as the end of first segment, in previous attempts. The number of times we go back and try again (number of retries) is constant and can be set as a parameter.It is to be noted that our scheme tends to perform better in comparison t o the scheme in [6] for large networks, with moderate congestion and for long distance calls. Further, it is important to note that for small networks with short distance calls this scheme mimics the end to end backup scheme in [6] as we do allow a backup to be made of just one segment. In case of connections with very

682


short primary path our heuristic chooses the backup with a single segment.

4

Failure Recovery

When a fault occurs in a component in the network, all dependable connections passing through it have to be rerouted through their backup paths. This process is called failure recovery. This has three phases: fault detection, failure reporting and backup activation. The restoration time, called failure recovery delay, is crucial to many real time applications, and has to be minimized. In our model, we assume that when a link fails, its end nodes can detect the failure. For failure detection techniques and their evaluation refer to [8]. After fault detection, the nodes which have detected the fault, report it to the concerned nodes for recovering from the failure. This is called failure reporting. After the failure report reaches certain nodes, the backup is activated by those nodes. Failure reporting and backup activation need to use control messages. For this purpose, we assume a real time control channel (RCC) [6] for sending control messages. In RCC, separate channels are established for sending control messages, and it guarantees a minimum rate of sending messages. Failure Reporting and Backup Activation: The nodes adjacent to a failed component in the primary path of a dependable connection will detect the failure and send failure reports both towards the source and the destination. In the end to end backup scheme, these messages have t o reach the source and destination before they can activate the backup path. In our scheme, this is not necessary. Failures can be handled more locally. The end nodes of the primary segment containing the faulty component on receiving the failure reports initiate the recovery process. These two nodes send the activation message along the backup segment, and the dependable connection service is resumed. This process is illustrated in Figure 3. If there are k segments in the backup, then this gives about O(k) improvement in the time for failure reporting. When a fault occurs, not only do we experience a

Fig. 3. Illustration of Failure Recovery

disruption of service for some time, but also packets transmitted during the failure reporting time are lost. Most real time applications cannot tolerate much message loss. In our scheme the message loss is reduced t o a considerable extent. When a fault occurs in one segment of the primary, only the packets which have entered that segment from the time of the occurrence of the fault till the backup segment activation are lost. Other packets in the segments before and after the failed segment are not affected and will be delivered normally. This is in contrast to the end to end backup case, where all packets in transit in the primary path before the failed component, between occurrence of failure and backup activation, are lost.


5

683

Delay and Scalability

Delay: In Real Time Communication, it is essential to have the delays along both the primary and the backup channels t o be as low as possible. Hence, we might have a restriction on the amount by which the delay along the backup exceeds that along the primary path. Let the total delay along the backup path not exceed the delay along the primary by A, a specified QoS parameter. Thus, the constraint for choosing an end t o end backup is, delay(backup path) - delay(primary path) 5 A. In the case of segmented backups, this constraint is, (delay(backup segment i) - delay(primary segment i)) 5 A, V i. We see that in our case we have to minimize the delay increase for each segment independently. Hence call acceptance rate will be better since it is easier t o find small segments than t o find big end t o end backups satisfying the A constraint. Scalability: The segmented backup scheme scales well since it does not demand global knowledge and does not involve any kind of broadcast. There is no necessity for a network manager and this scheme works well in a distributed network. For Backup Multiplexing each node needs to know the primary paths of the channels whose backups pass through it. This is easily accomplished if the information is sent along with the packet requesting the establishment of backup channel. Upon encountering faults, control messages are not broadcast, but sent only t o a limited part of the network affected by the fault. In large networks, the effectiveness of the segmentation increases as the mean path length of connections increases. Since the calculation of spare resources using multiplexing has to be done per segment independently, this scheme scales better than the earlier end to end methods.

6

Performance Evaluation

We evaluated the proposed scheme by carrying out simulation experiments similar to those in [6], on a 12 X 12 mesh. We also implemented the end to end backup scheme [6] for comparative study. In the-simulated network, neighbour nodes are connected by two simplex links, one in each direction, and all links have identical bandwidth. For simplicity, the bandwidth requirement of all connections was put equal to 1 unit. The delay of each link was set t o 1, thereby making delay along any path equal to its path length. Primary channels were routed using a sequential shortest-path search algorithm. The end t o end backups were also routed using the same shortest-path search algorithm, with the condition that it does not contain any component of the primary other than source and destination. The amount by which backup path delay can exceed primary path delay was used as a parameter, A. We find the backup segments as described in Section 3. The number of retries was set to 9 in our simulation experiments. The MINLEAPLEN parameter was set t o 4. Connections were reauested incrementally. " , between a source and destination chosen randomly, with the condition that no (source, destination) pair is repeated, and the length of the shortest path between them is a t least MINPATHLEN. In our simulation studies, connections were only established but not torn down since (i) the computational time required for release of connections is considerably high, and (ii) earlier studies with end to end backups [5,6] also do the same. The results are shown in Table 1.In this table, we show the statistics a t different instants of time as in the simulation. The number of connections requested is proportional t o the time. The network load a t the time is also shown. Table 1 shows the average amount of spare bandwidth reserved per connection, both for segmented

684


(seg) and end t o end (end) backups, for different values of A. We show the results for MINPATHLEN=6, and for MINPATHLEN=8. The average path lengths in the two cases was 10.8 and 12.3. The bandwidth of the links was chosen as 100 units for MINPATHLEN=6 and 90 units for MINPATHLEN=8. As expected, the spare bandwidth reserved was much lower for segmented backups. Also, the improvement is seen to be more in the second case. This illustrates that as the average length of connections increases, the effectiveness of segmented backups increases. The cumulative number of requests rejected till an instant of time was also noted. The number rejected by the segmented backup scheme was seen to be much lesser than that of the end to end scheme. Table 1. Average amount of spare bandwidth reserved per connection MINPATHLEN = 6 A = 2 A = 4 MINPATHLEN = 8 A = 2 A = 4 Time1 n/w load end seg end seg Time] n/w load end seg end seg 42% 1245 1 7.5517.06 7.5017.071 1284 1 8.7218.16 8.7118.16 53%

I

7

1

1

I

1

Conclusions

In this paper, we have proposed segmented backups: a failure recovery scheme for dependable real-time communication in distributed networks. This mechanism not only improves resource utilisation and call acceptance rate but also provides for faster failure recovery. We evaluated the proposed scheme through simulations and demonstrated the superior performance of the scheme compared t o earlier end to end backup schemes [5-71. In order to realise the full potential of the method of segmented backups, better routing strategies have t o be developed for choosing intermediate nodes optimally. We also need faster algorithms for backup multiplexing.

References P. Ramanathan and K. G. Shin, "Delivery of time-critical messages using a multiple copy approach," ACM Trans. Computer Systems, vol. 10, no. 2, pp. 144-166, May 1992. B. Kao, H. Garcia-Molina, and D. Barbara, "Aggressive transmissions of short messages over redundant paths," IEEE Trans. Parallel and Distributed Systems, vol. 5 , no. 1, pp. 102-109, January 1994. Q. Zheng and K. G. Shin, "Fault-tolerant real-time communication in distributed computing systems," in Proc. IEEE FTCS, pp. 86-93, 1992. W. Grover. "The self-healing network: A fast distributed restoration technique for networks using digital crossconnect machines," in Proc. IEEE GLOBECOM, pp. 10901095, 1987. S. Han and K. G. Shin, "Efficient spare-resource allocation for fast restoration of realtime channels from network component failures," in Proc. IEEE RTSS, pp. 99-108, 1997. S. Han and K. G. Shin, "A primary-backup channel approach to dependable real-time communication in multihop networks," IEEE Trans. on Computers, vol. 47, no. 1, pp. 46-61, January 1998 C. Dovrolis and P. Ramanathan, 'LResourceaggregation for fault tolerance in integrated services networks," ACM SIGCOMM Computer Communication Review, 1999. S. Han and K. G. Shin, "Experimental evaluation of failure detection schemes in realtime communication networks," in Proc. IEEE FTCS, pp. 122-131, 1997. This article was processed using the BT)$ macro package with LLNCS style

Real-Time Coordination in Distributed Multimedia Systems Theophilos A. Limniotes and George A. Papadopoulos Department of Computer Science University of Cyprus 75 Kallipoleos Str, P.O.B. 20537 CY-1678 Nicosia Cyprus E-mail: {theo,george}@cs.ucy.ac.cy

Abstract. The coordination paradigm has been used extensively as a mechanism for software composition and integration. However, little work has been done for the cases where the software components involved have real-time requirements. The paper presents an extension to a state-of-the-art control- or event-driven coordination language with real-time capabilities. It then shows the capability of the proposed model in modelling distributed multimedia environments

1

Introduction

The concept of coordinating a number of activities, possibly created independently from each other, such that they can run concurrently in a parallel and/or distributed fashion has received wide attention and a number of coordination models and associated languages ([4]) have been developed for many application areas such as high-performance computing or distributed systems. Nevertheless, most of the proposed coordination frameworks are suited for environments where the sub-components comprising an application are conventional ones in the sense that they do not adhere to any real-time constraints. Those few that are addressing this issue of real-time coordination either rely on the ability of the underlying architecture apparatus to provide real-time support ([3]) and/or are confined to using a specific real-time language ([5]). In this paper we address the issue of real-time coordination but with a number of self imposed constraints, which we feel, if satisfied, will render the proposed model suitable for a wide variety of applications. These constraints are: • The coordination model should not rely on any specific architecture configuration supporting real-time response. • The real-time capabilities of the coordination framework should be able to be met in a variety of systems including distributed ones. • Language interoperability should not be sacrificed and the real-time framework should not be based on the use of specific language formalisms.


686

T.A. Limniotes and G.A. Papadopoulos

We attempt to meet the above-mentioned targets by extending a state-of-the-art coordination language with real-time capabilities. In particular, we concentrate on the so-called control- or event-driven coordination languages ([4]) which we feel they are particularly suited for this purpose, and more to the point the language Manifold ([1]). We show that it is quite natural to extend such a language with primitives enforcing real-time coordination and we apply the proposed model to the area of distributed multimedia systems.

2

The Coordination Language Manifold

Manifold ([1]) is a control- or event-driven coordination language, and is a realisation of a rather recent type of coordination models, namely the Ideal Worker Ideal Manager (IWIM) one. In Manifold there exist two different types of processes: managers (or coordinators) and workers. A manager is responsible for setting up and taking care of the communication needs of the group of worker processes it controls (non-exclusively). A worker on the other hand is completely unaware of who (if anyone) needs the results it computes or from where it itself receives the data to process. Manifold possess the following characteristics: • Processes. A process is a black box with well-defined ports of connection through which it exchanges units of information with the rest of the world. • Ports. These are named openings in the boundary walls of a process through which units of information are exchanged using standard I/O type primitives analogous to read and write. Without loss of generality, we assume that each port is used for the exchange of information in only one direction: either into (input port) or out of (output port) a process. We use the notation p.i to refer to the port i of a process instance p. • Streams. These are the means by which interconnections between the ports of processes are realised. A stream connects a (port of a) producer (process) to a (port of a) consumer (process). We write p.o -> q.i to denote a stream connecting the port o of a producer process p to the port i of a consumer process q. • Events. Independent of streams, there is also an event mechanism for information exchange. Events are broadcast by their sources in the environment, yielding event occurrences. In principle, any process in the environment can pick up a broadcast event; in practice though, usually only a subset of the potential receivers is interested in an event occurrence. We say that these processes are tuned in to the sources of the events they receive. We write e.p to refer to the event e raised by a source p. Activity in a Manifold configuration is event driven. A coordinator process waits to observe an occurrence of some specific event (usually raised by a worker process it coordinates) which triggers it to enter a certain state and perform some actions. These actions typically consist of setting up or breaking off connections of ports and streams. It then remains in that state until it observes the occurrence of some other event, which causes the preemption of the current state in favour of a new one corresponding to that event. Once an event has been raised, its source generally continues with its activities, while the event occurrence propagates through the

Real-Time Coordination in Distributed Multimedia Systems

687

environment independently and is observed (if at all) by the other processes according to each observer’s own sense of priorities. More information on Manifold can be found in [1]; the language has already been implemented on top of PVM and has been successfully ported to a number of platforms including Sun, Silicon Graphics, Linux, and IBM AIX, SP1 and SP2.

3

Extending Manifold with a Real-Time Event Manager

The IWIM coordination model and its associated language Manifold have some inherent characteristics, which are particularly suited to the modelling of real-time software systems. Probably the most important of these is the fact that the coordination formalism has no concern about the nature of the data being transmitted between input and output ports since they play no role at all in setting up coordination patterns. More to the point, a stream connection between a pair of input-output ports, simply passes anything that flows within it from the output to the input port. Furthermore, the processes involved in some coordination or cooperation scenario are treated by the coordination formalism (and in return treat each other) as black boxes without any concern being raised as to their very nature or what exactly they do. Thus, for all practical purposes, some of those black boxes may well be devices (rather than software modules) and the information being sent or received by their output and input ports respectively may well be signals (rather than ordinary data). Note also that the notion of stream connections as a communication metaphor, captures both the case of transmitting discrete signals (from some device) but also continuous signals (from, say, a media player). Thus, IWIM and Manifold are ideal starting points for developing a real-time coordination framework. In fact, a natural way to enhance the model with real-time capabilities is by extending its event manager. More to the point, we enhance the event manager with the ability to express real-time constraints associated with the raising of events but also reacting in bound time to observing them. Thus, while in the ordinary Manifold system the raising of some event e by a process p and its subsequent observation by some other process q are done completely asynchronously, in our extended framework timing constraints can be imposed regarding when p will raise e but also when q should react to observing it. Effectively, an event is not any more a pair <e,p>, but a triple <e,p,t> where t denotes the moment in time at which the event occurs. With events that can be raised and detected respecting timing constraints, we essentially have a real-time coordination framework, since we can now guarantee that changes in the configuration of some system’s infrastructure will be done in bounded time. Thus, our real-time Manifold system goes beyond ordinary coordination to providing temporal synchronization. 3.1

Recording Time

A number of primitives exist for capturing the notion of time, either relative to world time, the occurrence of some event, etc. during the execution of a multimedia

688


application which we refer to below as presentation. These primitives have been implemented as atomic (i.e. not Manifold) processes in C and Unix. In particular: • AP_CurrTime(int timemode) returns the current time according to the parameter timemode. It could be world time or relative. • AP_OccTime(AP_Event anevent, int timemode) returns the time point (in world or relative mode) of an event. Time points represent single instance in time; two time points form a basic interval of time. • AP_PutEventTimeAssociation(AP_Event anevent) creates a record for every event that is to be used in the presentation and inserts it in the events table mentioned above. • AP_PutEventTimeAssociation_W(AP_Event anevent) is a similar primitive which additionally marks the world time when a presentation starts, so that the rest of the events can relate their time points to it. 3.2

Expressing Temporal Relationships

There are two primitives for expressing temporal constraints among events raised and/or observed. The first is used to specify when an event must be triggered while the second is used to specify when the triggering of an event must be delayed for some time period. • AP_Cause(AP_Event anevent, AP_Event another, AP_Port delay, AP_Port timemode) enables the triggering of the event another based on the time point of anevent. • AP_Defer(AP_Event eventa, AP_Event eventb, AP_Event eventc, AP_Port delay) inhibits the triggering of the event eventc for the time interval specified by the events eventa and eventb. This inhibition of eventc may be delayed for a period of time specified by the parameter delay.

4

Coordination of RT Components in a Multimedia Presentation

We show the applicability of our proposed model by modelling an interactive multimedia example with video, sound, and music. A video accompanied by some music is played at the beginning. Then, three successive slides appear with a question. For every slide, if the answer given by the user is correct the next slide appears; otherwise the part of the presentation that contains the correct answer is re-played before the next question is asked. There are two sound streams, one for English and another one for German. For each such medium, there exists a separate manifold process. Each such manifold process is a “building block”. The coordination set up with the stream connections between the involved processes is shown below (the functionality of some of these boxes is explained later on):


Zoom

Video Server

Audio Server

Splitter

zero one

english

two

german

three

689

Pre sen tation Ser ver

We now show in more detail some of the most important components of our set up. We start with the manifold that coordinates the execution of atomics that take a video from the media object server and transfer it to a presentation server. manifold tv1() { begin:(activate(cause1,cause2,mosvideo,splitter,zoom),c ause1,WAIT). start_tv1:(cause2,mosvideo -> ( -> splitter), splitter.zoom ->zoom, zoom-> (->ps.zero),ps.out1->stdout,WAIT). end_tv1:post(end). end:(activate(ts1),ts1). } In addition to the begin and end states which apply at the beginning and the end of the manifold’s execution respectively, two more states are invoked by the AP_Cause commands, namely start_tv1 and end_tv1. At the begin state the instances of the atomics cause1, cause2, mosvideo, splitter, and zoom are activated. These activations introduce them as observable sources of events. This state is synchronized to preempt to start_tv1 with the execution of cause1. More to the point, the declaration of the instance cause1 process cause1 is AP_Cause(eventPS,start_tv1,3,CLOCK_P_REL) indicates that the preemption to start_tv1 should occur 3 seconds (relative time) after the raise of the presentation start event eventPS. Within start_tv1 the other three instances, cause2, mosvideo, and splitter, are executed in parallel. c a u s e 2 synchronizes the preemption to end_tv1 and its declaration process cause2 is AP_Cause(eventPS,end_tv1,13,CLOCK_P_REL) indicates that the currently running state must execute the other two atomic instances within 13 seconds. So the process for the media object mosvideo keeps sending its data to splitter until the state is preempted to end_tv1. The mosvideo coordinating instance supplies the video frames to the splitter manifold. The role

690


of splitter here is to process the video frames in two ways. One with the intention to be magnified (by the zoom manifold) and the other at normal size directly to a presentation port. zoom is an instance of an atomic which takes care of the video magnification and supplies its output to another port of the presentation server. The presentation server instance ps filters out the input from the supplying instances, i.e. it arranges the audio language (English or German) and the video magnification selection. At the end_tv1 state the presentation ceases and control is passed to the end state. Finally at the end state, the tv1 manifold is activated and performs the first question slide manifold ts1. This prompts a question, which if answered correctly prompts in return the next question slide. A wrong answer leads to the replaying of the presentation that relates to the correct answer, before going on with the next question slide. The code for a slide manifold is given below. manifold tslide1() { begin:(activate(cause7),cause7,WAIT). start_tslide1:(activate(testslide),testslide,WAIT). tslide1_correct: "your answer is correct"->stdout; (activate(cause8),cause8,WAIT). tslide1_wrong:"your answer is wrong"->stdout; (activate(cause9),cause9,WAIT). end_tslide1:(post(end),WAIT). start_replay1: (activate(replay1,cause10),replay1,cause10,WAIT). end_replay1: (activate(cause11),cause11,WAIT). end:(activate(ts2),ts2). } The instance cause7 is responsible for invoking the start_tslide state. The declaration for the cause7 instance is process cause7 is AP_Cause(end_tv1,start_slide1,3,CLOCK_P_REL) Here we specify that start_slide1 will start 3 seconds after the occurrence of end_tv1. Inside that, the testslide instance is activated and eventually causes preemption to either tslide_correct or tslide1_wrong, depending on the reply. The tslide_wrong instance causes transition to the start_replay1 state which causes the replay of the required part of the presentation and then preempts through cause10, to end_replay1. That in turn preempts through cause11, to end_replay1, after replaying the relevant presentation. The end_replay marks the end of the repeated presentation and preempts to e n d _ t s l i d e 1. The tslide_correct state, also causes the end_tslide1 event through the instance cause8. The end_tslide1, simply preempts to the end state that contains the execution of the next slide’s instance. The main program begins with the declaration of the events used in the program. AP_PutEventTimeAssociation_W(eventPS)


691

is the first event of the presentation and puts the current time as its time point. For the rest of the events the function AP_PutEventTimeAssociation(event) is used which leaves the time point empty. Then the implicit instances of the media manifolds, are executed in parallel at the end of the block. These are (tv1,eng_tv1,ger_tv1,music_tv1) tv1 is the manifold for the video transmission, eng_tv1 is the manifold for the English narration transmission, ger_tv1 is the manifold for the German narration transmission and music_tv1 is the manifold for the music transmission.

5

Conclusions

In this paper we have addressed the issue of real-time coordination in parallel and distributed systems. In particular, we have extended a control- or event-driven coordination language with a real-time event manager that allows expressing timing constraints in the raising, observing, and reacting to events. Thus, state transitions are done in a temporal sequence and affect accordingly the real-time behaviour of the system. We have tested our model with a scenario from the area of multimedia systems where recently issues of coordination and temporal synchronization at the middleware level have been of much interest to researchers ([2]).

References 1.

2. 3. 4. 5.

6.

F. Arbab, “The IWIM Model for Coordination of Concurrent Activities”, First International Conference on Coordination Models, Languages and Applications (Coordination’96), Cesena, Italy, 15-17 April, 1996, LNCS 1061, Springer Verlag, pp. 34-56. G, Blair, J-B. Stefani, Open Distributed Processing and Multimedia, Addison-Wesley, 1998. IEEE Inc., “Another Look at Real-Time Programming”, Special Section of the Proceedings of the IEEE 79(9), September, 1991. G. A. Papadopoulos and F. Arbab, “Coordination Models and Languages”, Advances in Computers, Marvin V. Zelkowitz (ed.), Academic Press, Vol. 46, August, 1998, 329-400. M. Papathomas. G. S. Blair and G. Coulson, “A Model for Active Object Coordination and its Use for Distributed Multimedia Applications”, LNCS, Springer Verlag, 1995, pp. 162-175. S. Ren and G. A. Agha, “RTsynchronizer: Language Support for Real-Time Specifications in Distributed Systems”, ACM SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, La Jolla, California, 21-22 June, 1995.

Supporting Fault-Tolerant Real-Time Applications using the RED-Linux General Scheduling Framework ?

Kwei-Ja y Lin and Yu-Chung Wang Department of Electrical and Computer Engineering University of California, Irvine, CA 92697-2625 fklin, [email protected]

In this paper, w e study the fault-tolerant support for realtime applications. In particular, we study the scheduling issues and kernel support for fault monitors and the primary-backup task model. Using the pow erful sc heduling framework in RED-Linux, we can support a jitterless fault monitoring. We can also provide the task execution isolation so that an erroneous runaw ay task will not take aw ay additional CPU budget from other concurrently running tasks. Finally, we provide a group mechanism to allow the primary and backup jobs of a fault-tolerant task to share both the CPU budget as well as other resources. All these mechanisms make the implementation of fault-tolerant real-time systems easier. Abstract.

1

Introduction

As more computer-based systems are now used in our daily life, many applications must be designed to meet real-time or response-time requirements, or h uman safety may be jeopardized. Real-time applications must be fault-tolerant both to timing faults as well as logical faults. Timing faults occur when an application cannot produce a result before its expected deadline. Logical faults occur when an application produce a wrong result before or after the deadline. Both types of faults must be handled in a fault-tolerant real-time system. Supporting faulttolerant mechanisms in real-time systems therefore is a complex issue. Finding a pow erful real-time OS to support fault-tolerant applications is even more diÆcult. We ha vebeen w orkingon a real-time kernel project based on Linux. Our real-time kernel project is called RED-Linux (R eal-timeand Embedded Linux ). F or eÆciency, w e ha veimplemented a mechanism that provides a short task dispatch time [18]. T oenhance the exibility, w epro vide a general scheduling framework (GSF) in RED-Linux [19]. In addition to the priority-driven scheduling, RED-Linux supports the time-driven [7{9] and the shar e-driven (such as the proportional sharing [14] and approximations [2, 17]) scheduling paradigms. In this paper, w ein vestigatehow GSF in RED-Linux may support fault-tolerant real-time systems. We review the primitives for many fault-tolerant real-time ?

This research w as supported in part by UC/MICRO 98-085, 99-073 and 99-074, Ra ytheon and GeoSpatial Technologies, and by NSF CCR-9901697.


Supporting Fault-Tolerant Real-Time Applications

693

system models and study how to support (or enforce) them in the framework. By adjusting scheduling attribute values and selection criteria in the scheduler, it is possible to implement many fault-tolerant scheduling algorithms in our framework eÆciently. In particular, we study the scheduling issues and kernel support for fault monitors and the primary-backup task model. Using the powerful scheduling framework in RED-Linux, we can support a jitterless fault monitoring. We can also easily specify the CPU budget for each computation so that an erroneous runaway task will not take away the CPU budget reserved for other concurrently running tasks. Finally, we provide a group mechanism to allow the primary and backup jobs of a fault-tolerant task to share both the CPU budget as well as other resources. All these mechanisms make the implementation of fault-tolerant real-time systems easier. The rest of this paper is organized as follows. Section 2 reviews popular scheduling paradigms used in real-time systems and other real-time OS projects. Section 3 brie y introduces the RED-Linux general scheduling framework. We then study the fault monitoring issues for real-time system in Section 4. Section 5 presents the design of the task group mechanism in RED-Linux. The paper is concluded in Section 6. 2

Related Work on Fault-Tolerant and Real-Time Support

Several previous work has studied the fault-tolerant real-time scheduling issues. Liestman and Campbell [11] propose a scheduling algorithm for frame based, simply periodic uniprocessor systems. Each task has two versions: primary and backup. Task schedules are dynamically selected from a pre-de ned schedule tree depending on the completion status of the primary tasks. Chetto and Chetto [5] present an optimal scheduling strategy based on a variant of the EDF algorithm, called EDL, to generate fault-tolerant schedules for tasks that are composed of primary and alternate jobs. Their method provides the ability to dynamically change the schedule and accounting for runtime situations such as successes or failures of primaries. Caccamo and Buttazzo [4] propose a fault-tolerant scheduling model using the primary and backup task model for a hybrid task set consisting of rm and hard periodic tasks on a uniprocessor system. The primary version of a hard task is always scheduled rst if it is possible to nish it and the backup task before the deadline. If not, only the backup task is scheduled. Another interesting work related to real-time fault-tolerance is the Simplex architecture [15]. The Simplex architecture is designed for on-line upgrade of real-time software applications by using redundant software components. By allowing dierent versions of a software component to be executed in sequence or in parallel, real-time application software can be dynamically replaced with negligible down-time. The architecture can also be used for fault tolerance. Our goal in this paper is not to propose a new fault-tolerant model but to study the OS support for those proposed earlier. Using RED-Linux's general

694

K.-J. Lin and Y.-C. Wang

scheduling framework, we hope to be able to support many existing fault-tolerant mechanisms eectively and eÆciently. To support the fault-tolerant mechanisms mentioned above, at least two mechanisms are necessary. The rst is a way to de ne the group relationship between related tasks (primary and backup, old and new versions etc.) to allow them to share the budget for CPU or other resources. The other is a predictable monitoring facility. These fault tolerance supports from RED-Linux will be discussed in this paper.

3

The RED-Linux General Scheduling Framework

The goal of the RED-Linux general scheduling framework (GSF) is to support most well-known scheduling paradigms, including the priority-driven, the timedriven [7{9] and the share-driven [14, 2, 17], so that any application can use RED-Linux for real-time support. Two features have been introduced: the general scheduling attributes used in the framework and the scheduler components used to make scheduling decisions. In our model, the smallest schedulable unit is called a job. For systems with periodic activities, we call a job stream as a periodic task. Dierent scheduling paradigms use dierent attributes to make scheduling decisions. In order for all paradigms to be supported in GSF, it is important for all useful timing information to be included in the framework so that they can be used by the scheduler. We denote four scheduling attributes for each job in GSF: priority, start time, nish time, budget. Among the four, the start time and the nish time together de ne the eligible interval of a job execution. The priority speci es the relative order for job execution. The budget speci es the total execution time assigned to a job. These attributes can be used as constraints. However, these timing attributes can also be used as the selection factors when a scheduler needs to select a job to be executed next. RED-Linux uses a two-component scheduling framework in . The framework separates the low level scheduler, or dispatcher, from the QOS parameter translator, or allocator. We also design a simple interface to exchange information between these two components. It is the allocator's responsibility to set up the four scheduling attributes associated with each real-time job according to the current scheduling policy. The dispatcher inspects each job's scheduling attribute values, chooses one job from the ready queue and dispatches it to execution. In addition to assigning attribute values, the allocator also determines the evaluation function of scheduling attributes, since each job has multiple scheduling attributes. This is done by producing an eective priority for each job. The allocator uses one or more attributes to produce the eective priority so that the dispatcher will follow a speci c scheduling discipline. More details on the GSF implementation and the performance measurement can be found in [19].

Supporting Fault-Tolerant Real-Time Applications 4

695

The Design of Fault Monitors

To provide fault tolerance, three facilities can be supported: fault detection, fault avoidance, and fault recovery. Fault-tolerant systems must be able to monitor the system and application status closely and predictably. The earlier a fault can be detected and identi ed, the easier it may be xed. Depending on the type and the likelihood of faults to be monitored, cyclic monitoring is often used in systems with safety properties that must always be maintained. For example, many system components send "heartbeat" messages to each other or to a central controller to let them know that the component is still alive and well. Another example is a temperature monitoring facility that constantly reads the temperature sensor and produce a warning if the temperature is too high. Cyclic monitors are scheduled independently from any user applications. Depending on their importance, they must be executed predictably and without jitters so that they do not miss a critical warning window for an important fault. However, the traditional priority driven scheduler may not provide the kind of predictability required by fault-tolerance monitors. There is no guarantee on the execution jitters since the temporal distance between two consecutive executions of a monitor task may be as long as twice the period length [7, 8]. One eective scheduling paradigm in RED-Linux for cyclic monitors is the time-driven (TD) (or clock-driven) paradigm. For embedded systems with steady and well-known input data streams, TD schedulers have been used to provide a very predictable processing time for each data stream [7{9]. Using this scheduling paradigm, the time instances when each task starts, preempts, resumes and nishes are pre-de ned and enforced by the scheduler. User applications may specify the exact time and cycle when a monitor should be activated; the Dispatcher will activate the monitor accordingly. However, using the general scheduling framework in RED-Linux, other tasks may use their own schedulers independent of the TD scheduler for monitors. The integration of TD schedules with these application schedulers has many interesting issues. For example, if an application uses the xed priority driven scheduling such as rate monotonic scheduling in the presence of TD schedulers, can we still guarantee that all periodic jobs will meet their deadlines using the schedulability condition for the RM model [12]? Suppose a fault-tolerant system has monitor jobs and priority driven (PD) jobs. The monitor jobs are scheduled using TD at exact times. Therefore the priority-driven jobs are scheduled after the TD jobs are executed. If a PD job is running when a TD job is scheduled to start, the PD job will be preempted. In other words, all TD jobs are considered to have a higher eective priority than all PD jobs. Using the RM scheduling, a system of n tasks are guaranteed to meet their deadlines if the total utilization satis es the condition: U

=

Xn i c

i=1

pi

n(21=n

1)

696


where a task i must be executed for ci time units per pi time interval. However, when a RM system is scheduled after a time-driven scheduler, the execution of a periodic task may be delayed or interrupted by a TD job. To handle this problem, we can treat all TD jobs as "blocking" for PD jobs just like PD jobs are blocked on accessing critical sections. We can model all TD jobs as critical sections for PD jobs. As long as all TD jobs are short enough, the schedulability of all PD jobs can be guaranteed using this approach. In other words, all PD jobs can meet their deadlines as long as: (

Xj c

j

pj

c +b )+ i i pi

< i(21=i

1)

where bi is the blocking time by all TD jobs for task i . To reduce bi , we need to make sure that all monitoring jobs have short execution times and are not clustered together. Another useful facility for fault-tolerant real-time systems is to monitor the timing events when executing user programs. Timing faults, i.e. results produced too late or an application uses more than its share of resources, may cause the system to produce erroneous responses. The OS should provide a powerful yet eÆcient mechanism to detect timing faults. In the original GSF reported in [19], the Dispatcher reports only those events when jobs are terminated (voluntarily or involuntarily). Some fault monitoring algorithms need to know the exact time when a job is executed, suspended or terminated. To support these algorithms, we have extended the original GSF feedback mechanism so that the Dispatcher sends these information about job executions to the Allocator. 5

The Implementation of Task Group in RED-Linux

Many fault-tolerant functions are implemented using the primary-backup model or the N-version model so that a speci c functionality can be provided by multiple jobs. For real-time scheduling, it is important for all these jobs to share a common CPU budget so that other tasks in the system will have enough CPU time for their execution. A group structure is thus introduced to distinguish a set of jobs from others. Every job in the set is assigned the same group number. For example, the primary-backup model [11, 5, 4] have two jobs. The two jobs may have dierent start time and nish time, as well as dierent priorities. However, the two jobs must share a total budget. Another example is the imprecise computation model [13] where a task may have many optional jobs that can be used to enhance the result quality. The optional jobs should share a common CPU budget for the group. To support the task group, we have implemented a hierarchical task group mechanism in RED-Linux. Similar to the concept of the hierarchical le directory structure, a task group may have other task sub-group as a member. For scheduling purposes, each task group is assigned some speci c resource budget to be shared by all tasks and task sub-groups in the group. Moreover, each task

Supporting Fault-Tolerant Real-Time Applications

697

group may use a dierent scheduling policy to assign its resources. Therefore, the Dispatcher in GSF needs the capability to adopt dierent scheduling policies to select the next running job from a set of jobs. In RED-Linux, each job is in a group and has a group number. Since each sub-group is like a job in a group, each sub-group has a group number as well. Another parameter, the server number, is used by each sub-group to identify itself as a server (for scheduling). Each server job is associated with a job queue which holds all jobs that will be scheduled using the server job's budget. Each server can use a dierent policy to schedule the jobs on its queue. For normal real-time jobs, they do not have a server number. Therefore the server number of a normal real-time job is set to be the same as the group number which it belongs. In other words, a job can be identi ed as a server job if its server number is dierent from its group number. A server job is a scheduling unit with an allocated budget but no application job. The Dispatcher will not execute the server job. Instead, it will select another job to be served by the server. The group algorithm implemented in the RED-Linux Dispatcher is shown as follows. 1. The Dispatcher selects a job K from all eligible jobs in group 0 according to the scheduling policy of group 0. 2. The server job list J is initialized to NULL. 3. If K is a real-time job, execute it, else if K is a server job with server number i, select a job L from all eligible jobs in group i. (a) Set a new timer to be the minimum of the following values: the current time plus the budget of the server job K , the current time plus the budget of L, the nish time of the server job K , the nish time of L. (b) Append the server job K to the job list J . (c) Set K = L and repeat Step 3. 4. When K nishes or is interrupted, reduce the actual execution time used from the budget of all jobs in the server job list J . The maintenance of the server list is necessary so that the execution of a job will consume the budget in all groups it belongs to under the whole group hierarchy. 6

Conclusions

In this paper, we present the support for fault-tolerant real-time applications using the general scheduling framework in RED-Linux. The scheduling framework is able to accommodate a variety of scheduling models used in fault-tolerant realtime applications. By using the group mechanism, fault-tolerant primary-backup tasks may share the CPU and other resource budget eÆciently. By using the time-driven scheduling, fault monitors can be executed eÆciently and without jitter. By using the budget mechanism, tasks are always given their guaranteed share regardless of the possible ill behavior of other tasks.

698


References

1. L. Abeni and G.C. Buttazzo. Integrating multimedia applications in hard real-time systems. In Proc. IEEE Real-Time Systems Symposium, Dec 1998. 2. J.C.R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fair queueing. In Proc. of IEEE INFOCOMM'96, San Francisco, CA, pp. 120-128, March 1996. 3. S. Punnekkat and A. Burns. Analysis of checkpointing for schedulability of real-time systems. In Proceedings of IEEE Real-Time Systems Symposium, pages 198{205, San Fran- cisco, December 1997. 4. M. Caccamo and G. Buttazzo. Optimal Scheduling for Fault-Tolerant and Firm RealTime Systems. Proceedings of IEEE Conference on Real-Time Computing Systems and Applications, Hiroshima, Japan, Oct 1998. 5. H. Chetto and M. Chetto. An adaptive scheduling algorithm for fault-tolerant realtime systems. Software Engineering Journal, May 1991. 6. A. Demers, S. Keshav, and S. Shenker. Analysis and Simulation of a Fair Queueing Algorithm. In Journal of Internetworking Research and Experience, pp.3-26, October 1990. 7. Ching-Chih Han, Kwei-Jay Lin and Chao-Ju Hou. Distance-constrained scheduling and its applications to real-time systems. In IEEE Trans. Computers, Vol. 45, No. 7, pp. 814-826, December 1996. 8. Chih-wen Hsueh and Kwei-Jay Lin. An optimal Pinwheel Scheduler Using the SingleNumber Reduction Techniques. In Proc. of IEEE Real-Time Systems Symposium, December 1996, pp.196-205. 9. Chih-wen Hsueh and Kwei-Jay Lin. On-line Schedulers for Pinwheel Tasks Using the Time-Driven Approach. In Proc. of the 10th Euromicro Workshop on Real-Time Systems, Berlin, Germany, June 1998, pp. 180-187. 10. K. Jeay et al. Proportional Share Scheduling of Operating System Service for RealTime Applications. In Proc. IEEE Real-Time Systems Symposium, Madrid, Spain, pp. 480-491, Dec 1998. 11. A. L. Liestman and R. H. Campbell. A fault-tolerant scheduling problem. IEEE Transactions on Software En- gineering, 12(11):1089{95, November 1986. 12. C.L. Liu and J. Layland. Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment. Journal of the ACM, 20(1):46-61, 1973. 13. J.W.-S. Liu, K.J. Lin, W.-K. Shih, A.C. Yu, J.Y. Chung and W. Zhao. Imprecise Computation. In Proc. of IEEE 82:83-94, 1994 14. A. K. Parekh and R. G. Gallager. A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Single-Node Case. In IEEE/ACM Trans. Networking, Vol. 1, No. 3, pp. 344-357, June 1993. 15. L. Sha. Dependable system upgrade. In Proc. IEEE Real-Time Systems Symposium, pp. 440-448, Dec 1998. 16. M. Spuri and G.C. Buttazzo. EÆcient aperiodic service under the earliest deadline scheduling. In Proc. IEEE Real-Time Systems Symposium, Dec 1994. 17. I. Stoica, H. Zhang, and T.S.E. Ng. A Hierarchical Fair Service Curve Algorithm for Link-Sharing, Real-Time and Priority Services. In Proc. of ACM SIGCOMM'97, Cannes, France, 1997. 18. Y.C. Wang and K.J. Lin. Enhancing the real-time capability of the Linux kernel. In Proc. of 5th RTCSA'98, Hiroshima, Japan, Oct 1998. 19. Y.C. Wang and K.J. Lin. Implementing a general real-time framework in the REDLinux real-time kernel. In Proc. of RTSS'99, Phoenix, Arizona, Dec 1999.

Are COTS suitable for building distributed fault-tolerant hard real-time systems?? P ascal Chevochot, Antoine Colin, David Decotigny, and Isabelle Puaut IRISA, Campus de beaulieu, 35042 Rennes, France

Abstract F or economic reasons, a new trend in the dev elopment of distributed hard real-time systems is to rely on the use of CommercialO-The-Shelf (cots) hardware and operating systems. As such systems often support critical applications, they must comply with stringent realtime and fault-tolerance requirements. The use of cots components in distributed critical systems is subject to tw o fundamental questions: are cots components compatible with hard real-time constraints? are they compatible with fault-tolerance constraints? This paper gives the current status of the Hades project, aiming at building a distributed run-time support for hard real-time fault-tolerant applications on top of cots components. Thanks to our experience in the design of Hades, we can giv e some information on the compatibility between cots components and hard real-time and fault-tolerance constraints.

1 Introduction

Real-time systems dier from other systems by a stricter criterion of correctness of their applications. The correctness of a real-time application not only depends on the delivered result but also on the time when it is produced. Critical applications (e.g. ight control systems, automotive applications, industrial automation systems) often have hard real-time constraints: missing a task deadline may cause catastrophic consequences on the environment under control. It is thus crucial for such systems to use schedulability analysis in order to prove, before the system execution, that all deadlines will be met. Moreover, critical applications exhibit fault-tolerance requirements: the failure of a system componen t should not cause a global system failure. Mainly due to economic reasons, a new trend in the dev elopment of distributed real-time systems is to rely on the use of Commercial-O-The-Shelf (cots) hardware and operating systems. As suc h systems often support critical applications (e.g. aircraft control systems), they must comply with stringent real-time and fault-tolerance requirements. The use of cots components in distributed critical systems is subject to tw o fundamental questions: are cots components compatible with har d real-time constraints? are they compatible with fault-tolerance constraints? ?

This work is partially supported by the F rench Departmentof Defense (DGA/DSP), #98.34.375.00.470.75.65. Extra information concerning this work can be found in the Web page http://www.irisa.fr/solidor/work/hades.html.


700

P. Chevochot et al.

The objective of our work is to build a run-time support for distributed faulttolerant hard real-time applications from cots components (processor and operating system). The run-time support is built as a middleware layer running on top of cots components, and implements scheduling and fault-tolerance mechanisms. This work is achieved in cooperation with the French department of defense (DGA) and Dassault-Aviation. Our work is named hereafter Hades, for Highly Available Distributed Embedded System. This paper gives the current status of the Hades project. Thanks to the experience gained in designing Hades, we can give some information on the compatibility between cots components and hard real-time and fault-tolerance constraints. The remainder of this paper is organized as follows. Section 2 gives our preliminary results concerning the compatibility between cots components and hard real-time constraints. Section 3 then deals with the issue of providing fault-tolerance facilities on cots components. An overview of the prototype of the run-time support of Hades is given in Section 4.

2 COTS and hard real-time constraints 2.1 Methodology

Schedulability analysis is used in hard real-time systems to prove, before the system execution, that all deadlines will be met. Schedulability analysis must have static knowledge about the tasks (e.g. worst-case arrival pattern, worst-case execution time, deadline, synchronizations and communications between tasks). Worst-case execution time analysis (wcet analysis), through the analysis of a piece of code, returns an upper bound for the time required to execute it on a given hardware. Our approach to be able to apply system-wide (application and run-time support) schedulability analysis to is to combine the use of a static task model and wcet analysis. Static task model. Applications must be structured according to a task model that de nes o-line:

{ The internal structure of tasks: every task is described by a directed acyclic

graph whose nodes model synchronization-free computations and edges model precedence constraints and data transfers between nodes. { A set of attributes related to task execution. Synchronization attributes serve at expressing exclusion constraints between nodes. Timing attributes express temporal properties for tasks and nodes (task arrival law, deadline, earliest and latest start time). Distribution attributes de ne the site where each node of a task graph executes. Fault-tolerance attributes specify for each task or portion of task which replication strategy must be applied, and the requested replication degree. Figure 1 illustrates the task model by depicting two tasks A and B , distributed on two sites named Site1 and Site2. Distribution attributes indicate that nodes a1 to a4 , as well as nodes b1 and b2 execute on Site1; nodes a5 and b3 execute on Site2. Synchronization attributes state that a shared resource R is modi ed by nodes a5 and b3, and thus that there is an exclusion constraint

Are COTS Suitable for Building Distributed Fault-Tolerant Hard Real-Time Systems Task B

Task A a1 a2 a4 a5

Attributes

b1 x := y

a3 b2

b3

701

Task name A Synchronization Use(a5 ; R; modified) Distribution Location(a1 ::a4 ; Site1 ) Location(a5 ; Site2 ) Timing Periodic(A; 50)

B Use(b3 ; R; modified) Location(b1 ::b2 ; Site1 ) Location(b3 ; Site2 ) Periodic(B; 40) Deadline(A; 40) Deadline(B; 40) Fault tolerance Active(a2 ; 2; Site1 ; Site2 ) Active(b1; 2; Site1 ; Site2 )

Figure1. Illustration of the task model between these two nodes. Timing attributes specify that both A and B are periodic with respective periods 40 and 50, and give the tasks deadlines. Finally, fault-tolerance attributes specify that nodes a2 and b1 must be made reliable by using active replication of degree 2 and that the node replicas must be located on Site1 and Site2 (see x 3). The interest of using the above static task model is that it forces the application developer to provide all informations needed to achieve schedulability analysis.

Use of WCET analysis. To achieve schedulability analysis, the wcets of

tasks have to be known. They can be obtained through the actual execution of tasks, or through the analysis of their code (wcet analysis). We propose to use wcet analysis because it produces a safe estimation of the tasks wcets. A wcet analysis tool named Heptane (Hades Embedded Processor Timing ANalyzEr ) has been developed. It analyses C code and produce timing information for the Pentium processor. Heptane operates on two program representations: the program syntax tree, obtained through the analysis of the program source code, and the program control ow graph, obtained through the analysis of the program assembly code generated by the C compiler. User-provided annotations are used to identify the worst-case execution path for loops. Heptane is used to obtain the wcet of nodes of application graphs, which, by construction are blocking-free, and to obtain the wcet of the run-time support itself (operating system, and the middleware layer we have developed to provide services for fault-tolerance). In order to be suited to wcet analysis, the middleware layer has been structured into two layers: (i) a low layer, written directly in C and designed so that it never blocks; (ii) a high layer, made of tasks structured using the static task model described above, so that blocking points are statically identi ed. Applying wcet analysis to systems that use cots components raises a number of issues, which are described in the two following paragraphs.

2.2 WCET analysis and COTS hardware cots processor include architectural features, such as instruction caches, pipelines

and branch prediction. These mechanisms, while permitting performance improvements, are sources of complexity in terms of timing analysis. A trivial approach to deal with them is to act if they were not present (i.e. to suppose

702

P. Chevochot et al.

all memory accesses lead to cache misses and assume there is no parallelism between the execution of successive instructions). However, this approach leads to largely overestimated wcets. In order to reduce the pessimism of wcet analysis caused by the processor microarchitecture, Heptane takes into account the eect of instruction cache, pipeline and branch prediction when computing programs wcets. { Pipeline. The presence of pipelines is considered by simulating the ow of instructions in the pipelines, in a method similar to the one proposed in [1]. { Instruction cache. Consideration of instruction cache uses static cache simulation [2]: every instruction is classi ed according to its worst-case behavior with respect to the instruction cache. The instruction classi cation process uses both the program syntax tree and control ow graph. { Branch prediction. An approach similar to the one used for the instruction cache was used to integrate the eect of branch prediction (see [3] for details). Experimental results show that the timing penalty due to wrong branch predictions estimated by the proposed technique is close to the real one, which demonstrates the practical applicability of our method: from 98% to 100% of results of predictions can be known statically on a set of small benchmark programs. To our knowledge, this work is the rst attempt to incorporate the eect of branch prediction on wcet analysis.

2.3 WCET analysis and COTS real-time operating systems

The rst obstacle to the use wcet analysis on cots real-time operating systems is to obtain their source code. However, at least for small kernels targeted for embedded applications, more and more operating systems come with their source code at reasonable cost. We have undertaken a study aiming at using wcet analysis to obtain the wcets of the system calls of the rtems real-time kernel, restricted for the sake of the experimentation to act on a monoprocessor architecture. We summarize below the results of the study, showing that, to some extent, the structure of the source code of rtems is suited to wcet analysis (for details, see [4]). The rst conclusion of this study is that using wcet analysis to obtain the wcet of the system calls of the rtems real-time kernel is feasible. The central part of rtems was analyzed in less than two months by a student having no a priori knowledge of the internals of rtems and wcet analysis. During the study, we discovered interesting properties of the code of rtems, that hopefully exist in other real-time operating systems, and that can have an in uence of the construction of wcet analyzers suited to the analysis of operating systems: small number of loops, absence of recursion, small number of function calls performed using function pointers. Most dynamic function calls (i.e. function calls through function pointers) could have been replaced by static ones. We observed that nding the maximum number of iterations for 75% of the loops required an in-depth study of the source code of rtems. We found rather pessimistic bounds for a number of loops (dynamic memory allocation routine, scheduler). Bounds for these loops depend on the operational conditions, such

Are COTS Suitable for Building Distributed Fault-Tolerant Hard Real-Time Systems

703

as the arrival law of interrupts or the actual number of active tasks at a given priority.

3 COTS and fault tolerance constraints 3.1 Methodology

Several issues have to be dealt with in order to use cots components (processor and operating systems) to support applications with fault-tolerance requirements. First, cots hardware generally do not include any fault-tolerance mechanism: messages may get lost on the network and processors may crash or produce incorrect results. Thus, some form of redundancy (spatial and/or temporal) must be used to support a faulty hardware component. Second, most cots operating systems themselves are not designed to mask the failure of hardware components (e.g. machine crashes, network omissions). Consequently, any errormasking mechanism must be provided as a software layer outside the operating system. Our approach to support applications with fault tolerance constraints is to combine the use of o-line task replication and basic fault-tolerance mechanisms. { O-line task replication (see x 3.2) transforms the task graphs (see x 2.1) to make portions of tasks fault-tolerant through the use of replication. O-line replication relies on a number of properties that are veri ed by the run-time support (fail-silence assumption, bounded and reliable communications). { Basic fault-tolerance mechanisms are provided by the run-time support (see x 3.3) in order to verify these assumptions. Error detection mechanisms have been designed to provide the highest coverage as possible of the fail-silence assumption. A set of mechanisms (group membership, clock synchronization, multicast) altogether guarantee bounded and reliable communications.

3.2 O-line task replication

Application tasks are made fault-tolerant through the replication of parts of their code (transformation of their structure, de ned as a graph, see x 2.1). Task replication is achieved o-line. An extensible set of error treatment strategies based on task replication (currently active, passive and semi-active replication) is provided. Task replication is achieved thanks to a tool we have developed, which is named Hydra [5]. It takes as input parameters the portions of code to be replicated, the replication strategy to be applied and the replication degree. Figure 2 illustrates the application of active replication on task A taken as example in Figure 1, where node a2 is made reliable through the use of active replication of degree 2 on Site1 and Site2. The replication tool transforms the graph of task A through the addition of: (i) two replicas for node a2 that must be made fault-tolerant (nodes aa2 and ab2 ); (ii) a node that computes a consensus value for the outputs of aa2 and ab2 ; (iii) new edges in the graph. The main interest of o-line replication is that replicas can be taken into account easily in schedulability analysis, even if only portions of tasks are replicated (see [6]). O-line replication is correct as far as the underlying run-time support in charge of executing tasks ensures a set of properties (see [5]) like for instance the

704

P. Chevochot et al. Task A

Task A (after transformation) a1

a1 a2

a3 a4 a5

ab2

aa2

a3

Cons

a4 a5

Figure2. Example of application of active replication

exe-fail-silence property, stating that the run-time support of a given site either sends correct results in the time and value domains, or remains silent forever.

3.3 Basic fault-tolerance mechanisms and COTS components

Most cots operating systems are not designed to mask faulty hardware components. Thus, we have designed a set of fault-tolerance mechanisms, embedded in a middleware layer running on top of a cots real-time kernel (in our prototype Chorus and rtems). Mechanisms fall into two categories: { error detection mechanisms: these mechanisms detect (as far as possible) value and temporal errors and transform them into machine stops in order to reach the fail-silence assumption (property exe-fail-silence: given above). { fault masking mechanisms: these mechanisms allow to mask communications faults and site crashes through the provision of fault-tolerant services: group membership service, reliable time-bounded multicast service and clock synchronization service (see [7] for details on these services). The properties of these services are of great importance to ensure exe-timeliness and exeagreement properties. These services have been designed using the static task model of x 2.1 to ease their integration in schedulability analysis. The design of services such as group membership, multicast and clock synchronization on cots components did not cause any intractable diculty. However, the problem is that all these services are entirely implemented in software on top of an existing operating system, which leads to modest worst-case performances (approximately two orders of magnitude worse than hardwareimplemented solutions like the TTP/C communication chips). An estimation of the in uence of these worst-case performance on the range of deadlines that can be supported has still to be achieved. A set of error detection mechanisms has been integrated into the run-time support. To detect timing errors, the run-time support includes monitoring code to (i) detect deadline and WCET exceeding; (ii) check that tasks arrival laws conform to their expected arrival laws. These checks are possible because the run-time support is aware of semantic information about the tasks it executes, thanks to the use of a static task model. To detect value errors, the run-time support (i) catches every hardware exception (e.g. parity error, alignment error, protection violation, division by zero); (ii) checks if a set of implementationdependent invariants hold. We are currently estimating the coverage of the failsilence assumption provided by these mechanisms.

Are COTS Suitable for Building Distributed Fault-Tolerant Hard Real-Time Systems

705

4 Experimental platform

A prototype of the run-time support of Hades has been developed (see [7] for details). It runs on a network of Pentium PCs, and has been ported on Chorus and rtems. It is divided into two layers: (i) a kernel, developed in C, in charge of executing tasks that are developed according to the task model of x 2.1, as well as detecting errors; (ii) a set of services, most of them implementing fault-masking mechanisms. Services are developed according to the task model of x 2.1 to be able to apply system-wide schedulability analysis. We are currently porting an avionic application provided by our industrial partner Dassault-Aviation on top of our prototype.

5 Concluding remarks

This paper has given current status of the Hades project, aiming at building a distributed run-time support for applications with hard real-time and faulttolerance constraints. Concerning the support of hard real-time constraints, we highly rely on the use of wcet analysis to provide information for schedulability analysis. We have shown that using this class of techniques on cots hardware and operating system is feasible. We are currently studying the pessimism induced by the analysis, and are studying the use of wcet analysis on larger (and therefore more realistic) operating systems. Concerning faulttolerance constraints, we have built on top of a cots kernel predictable fault tolerance mechanisms (fault masking mechanisms such as reliable multicast and clock synchronization, error detection mechanisms to enforce the fail-silence assumption). Designing fault masking mechanisms on top of a real-time kernel did not cause any particular problem, but is not ecient due to the fact that all these mechanisms are implemented entirely in software. The impact of softwareimplemented mechanisms on the range of deadlines that can be supported is currently under study. We are currently evaluating the eciency of the error detection mechanisms that have been integrated into the run-time support.

References

[1] N. Zhang, A. Burns, and M. Nicholson. Pipelined processors and worst case execution times. Real-Time Systems, 5(4):319{343, October 1993. [2] F. Mueller. Static Cache Simulation and its Application. PhD thesis, Departement of Computer Sciences, Florida State University, July 1994. [3] A. Colin and I. Puaut. Worst case execution time analysis for a processor with branch prediction. Real-Time Systems, 2000. To appear. [4] A. Colin and I. Puaut. Worst-case timing analysis of the RTEMS real-time operating system. Technical Report 1277, IRISA, November 1999. [5] P. Chevochot and I. Puaut. An approach for fault-tolerance in hard real-time distributed systems. Technical Report 1257, IRISA, July 1999. A short version of this paper can be found in the WIP session of SRDS'18 p. 292{293. [6] P. Chevochot and I. Puaut. Scheduling fault-tolerant distributed hard real-time tasks independently of the replication strategies. In Proc. of RTCSA'99, HongKong, China, December 1999. [7] E. Anceaume, G. Cabillic, P. Chevochot, and I. Puaut. A exible run-time support for distributed dependable hard real-time applications. In Proc. of ISORC'99, pages 310{319, St Malo, France, May 1999.

Autonomous Consistency Technique in Distributed Database with Heterogeneous Requirements Hideo Hanamura, Isao Kaji and Kinji Mori Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro, Tokyo 125, Japan Abstract. Recently, the diversified types of companies have been trying to cooperate among them to cope with the dynamic market and thus integration of their DBs with heterogeneous requirements is needed. But the conventional centralized approach has problems of fault-tolerance, real-time property and flexibility. Here, the autonomous consistency technique in distributed DB and its architecture are proposed. In this architecture, each site has the autonomy to determine Allowable Volume and to update the DB independently using it. In addition, this volume can be managed dynamically and successfully through autonomous communication among sites, and the system can achieve the adaptation to unpredictable user requirements. As an experimental result, it is shown that this mechanism can adaptively achieves users heterogeneous requirements.

1

Background

As the Information Technology advances, the information sharing among companies is essential for business and the integration of their databases(DBs) becomes more important. Though such kinds of DBs have been integrated by centralized approach [2], it has problems of fault-tolerance, real-time property and flexibility. Therefore, the autonomous approach is proposed to solve these problems. In this paper, the stock management system in Supply Chain Management(SCM) is discussed to explain the new approach. 1.1

Needs in SCM

Only the makers and retailers are considered as the constituents in the SCM. The characteristics of heterogeneous requirements of makers and retailers are described as below. Retailers Retailers are considered to be dealing in two kinds of products, regular products and non-regular products. Regular products are usually in stock at retailers in enough quantity. When some customers order this kind of products, the retailers ship them from their own stock. If they do not have enough stock, they order them to makers. On the J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 706-712, 2000.  Springer-Verlag Berlin Heidelberg 2000

Autonomous Consistency Technique in Distributed Database

707

other hand, non-regular products are not usually in stock at retailers. When the retailers receive the order from customers, they order them to the makers and the makers manufacture them at that time. Makers Makers deal in both regular products and non-regular products in the same way, namely, check the current stock and manufacture them, if necessary.

2 2.1

Approach Assurance

The Assurance is defined as the achievement of the user satisfaction with heterogeneous requirements in the integrated system. Assurance Usually, a system is constructed to achieve a single user requirement. But the integrated system has heterogeneous requirements, which are realized in each system before the integration. So, These requirements must be realized in the integrated one, even if they are contradictory. When these heterogeneous requirements are realized in the integrated system – if they are contradictory, the system satisfies them fairly –, it is defined that the integrated system realizes the assurance. 2.2

Goal

The following properties are required in the integrated DB system in SCM. – Real-Time Property – Fault Tolerance – Assurance As mentioned above, the system must realize the assurance. As for the nonregular products, the requirements are same between maker and retailer and the system realizes the Immediate Update, which propagates the result of update operation to all the system immediately. While, in case of the regular products, the requirements are contradictory between maker and retailer. In this case, the real-time property of update at retailers site is given the priority and the result is propagated to all the system at the earliest. which is called as Delay Update.

3 3.1

Accelerator Allowable Volume

The centralized approach is difficult to achieve the assurance when conflicting requirements coexist in a system or user requirements are changing rapidly. Because it does not provide the operational autonomy at each site which is essential in these cases. So, the autonomous consistency mechanism is introduced.

708

H. Hanamura, I. Kaji, and K. Mori

site1

+30 site0

Update -30

Product

Amount

product A

100 (20)

70 (20)

site2

Product

Amount

product A

100 (40)

100 (10)

Product

Amount

product A

100 (40)

Fig. 1. Example of Allowable Volume

The attribute called as the Allowable Volume(AV) is introduced at each site. The AV is defined on each numeric data in each local DBs. Each site can update the numeric data within it autonomously – without any communication with others. The AV is not fixed volume allocated by some master site but is flexibly managed by communication among sites. The update of DB with AV is illustrated in Fig. 1, in which each site has 40, 20 and 40 of AV for the product A, totally 100. The stock data of the product A is also 100. If some user updates -30 for product A at site 1, the AV is checked at the site 1 and both the AV and stock data are updated without any communication if the AV is enough. Otherwise, the site requests other sites to transfer the AV. In this figure, since the AV at site 1 is not enough to update, the site 1 requests the AV to site 0 and get +30, then the AV and data at site 1 are updated into 20 and 70.

3.2

System Model in SCM

The structure of the proposed system is shown in Fig. 2.

Site1:Retailer Allowable Volime Manegement Table local DB End-User Accelerator

Site0:Maker

Site2:Retailer

Allowable Volume Management Table

Allowable Volume Management Table local DB

local DB End-User

End-User Accelerator

Accelerator

Fig. 2. System Model


709

In the proposed system, each site has a local DB and they participate in a distributed DB. One of local DBs is called as the base DB which is usually located at makers and works as the primary copy in the case of Immediate Update. The others are located at retailers. The content of all local DBs are the same, which include product names and amount of their stock. In addition, the classification between regular and non-regular products is known. All data are assumed to be delivered to all the sites initially from the base DB. 3.3

Accelerator

The accelerator is proposed to achieve both Immediate Update and Delay Update for realizing the assurance for makers and retailers. It is located at each site with Allowable Volume management table(AV table) and provides DBMS function and AV management function. The AV management function consists of three functions, which is checking, selecting and deciding. The checking function is the function to check the type of user update request with AV table, Immediate Update and Delay Update, when the accelerator receives user update requests. The selecting function selects the site to request the transfer of the AV from/to. The deciding function decides the volume of AV needed to transfer from/to the other sites, depending on the situation to realize effective AV reallocation. The accelerator realizes both Immediate Update and Delay Update by using these functions. In following, the behavior of the accelerator is explained.


Local DB Other Sites End-User Accelerator

Site

Fig. 3. Accelerator

Delay Update When the accelerator receives the update request, it checks using AV table whether it is Immediate Update or Delay Update. If the AV is defined on it, the accelerator distinguishes it as Delay Update and holds the necessary amount of AV in advance of the update of local DB. In this case, it is not necessary to lock the AV exclusively until the completion of whole transaction. Because the AV is not seen from end-users and if rollback of transaction occurs, the recovery of operation can be done by updating with opposite of update volume, which was used for the transaction at first. Then extra AV can be used by other process while one process accesses the same data.

710


If the AV is sufficient at the local site, the accelerator updates the AV and completes the update at the local site. But if the AV is not sufficient for update, the accelerator holds all the AV at the site and requests other sites for extra AV. The target site for the request is determined according to the strategy of the accelerator such as the order of the volume the other sites keep. On the other hand, the site receiving the request provides some amount of AV according to its strategy. When the requesting accelerator receives the AV, it checks whether it is sufficient for the update. It requests again to other sites if it is not sufficient. After it gets the enough volume, it updates both data and AV and remaining AV is stored at the local AV table. Otherwise, all accumulated AV is stored in the local AV table.


Local DB

Other Sites

End-User Accelerator

Fig. 4. Delay Update with AV

Immediate Update When the accelerator checks the user requests and finds the AV is not defined on the AV table, it deals with the request as Immediate Update based on primary copy scheme. The requesting accelerator works as the coordinator and the update is processed. At first, it locks the data at the local DB and it also sends the lock request to the other accelerators simultaneously. Then the operations for update are processed at all the sites and ready and commitment messages are exchanged. The requesting accelerator judges the completion of the update with the message from the accelerator at the base DB.


Local DB Other Sites End-User Accelerator

Fig. 5. Immediate Update


3.4

711

AV management

Each site has its own strategy to manage AV to realize the real-time property and the assurance. But the optimal AV allocation algorithm using global information is not suitable for AV management. Because it is more important to adapt to changing user requirements rapidly. While determining the mechanism of AV management, it is essential to calculate the volume of AV transfer using local information and to make AV circulate among the sites. Then the system achieves the real-time property and the assurance.

4

Simulation

The simulation for the Delay Update is done with the proposed mechanism. A maker and two retailers are modeled as Fig. 2. For AV management, the algorithm which is proposed in the research for electronic commerce money distributed system [1] is utilized. That is, the AV management is occurred in the case when the Allowable Volume is insufficient for update. The requested site is selected according to the amount of AV the site keeps, which information is collected at the necessary communication for AV management and may not be current data. In addition, the requested AV is the amount of shortage needed for the completion of update. And allocated AV is half of the amount of AV that the site keeps. In the simulation, the number of data items in local DB is 100. In site 0, data is updated to increase the volume by at most 20% of the initial amount of data randomly. On the other hand, at site 1 and site 2, it is updated to decrease at most 10% randomly. The result is shown in the Fig. 6 and Table. 1.

16000

proposal conventional

number of correspondenscs

14000

12000

10000

8000

6000

4000

2000

0 0

5000

10000

15000

20000

number of updates

Fig. 6. Number of Updates vs Number of Correspondences

712


Site 1000 2000 3000 4000 5000 6000 7000 8000 Site1 195 416 669 973 1441 1516 1614 1997 Site2 195 422 634 905 1335 1473 1591 1996 Table 1. Number of Correspondences for Update

In the Fig. 6, the total number of updates in the system is described as the horizontal axis and the number of correspondences for update is the vertical axle (2 messages are counted as 1 correspondence). The line ”conventional” shows the number of correspondences for update by the conventional centralized way. The figure shows that the proposed way decreases the correspondences by 75% and most of the update is completed within the local site. Thus the real-time property is attained. In the Table. 1, it shows the number of correspondences for update in each site. In proposed way, the numbers are almost same between site 1 and site 2 and increases very slowly. That is, it is shown that the real-time property is fairly achieved at the retailer sites – site 1 and site 2. As the result, in the case of Delay Update, the proposed mechanism is shown to improve the real-time property by decreasing the number of correspondences and to realize the assurance for retailers.

5

Conclusion

In this paper, it is focused on the problem of the heterogeneity of the requirements according to the needs of cooperation among companies in SCM. Then the autonomous approach is proposed to solve this problem. In the proposal, each site can determine the AV, by autonomous communication among them, according to the changing situations. And the data can be updated autonomously at the local site within it without any communication to realize fault tolerance. By the accelerator, both Delay Update and Immediate Update are realized to achieve the assurance for maker and retailer. Then, by simulation, it is shown that the real-time property and the assurance for retailers are achieved. As a result, the proposed mechanism is shown to realize the real-time property, fault-tolerance and assurance, which attains users heterogeneous requirements.

References 1. H. Kawazoe, T. Shibuya, and T. Tokuyama. “optimal on-line algorithms for an electronic commerce money distribution system”. In Proceedings of the 10th ACM SIAM Symposium on Discrete Algorithms (SODA 99), pages 527–536, 1999. 2. Amit P. Sheth and James A.Larson. “federated database systems for managing distributed, heterogeneous, and autonomous databases”. ACM Comupting Surveys, 22(3):183–236, 1990.

Real-time Transaction Processing Using Two-stage Validation in Broadcast Disks? Kwok-wa Lam1 , Victor C. S. Lee1 , and Sang H. Son2 1

Department of Computer Science, City University of Hong Kong 2

[email protected]

Department of Computer Science, University of Virginia [email protected]

Abstract. Conventional concurrency control protocols are inapplicable

in mobile computing environments due to a number of constrain tsof wireless communications. In this paper, we design a protocol for processing mobile transactions. The protocol ts the environments such that no synchronization is necessary among the mobile clients. Data con icts can be detected betw een transactions at the server and mobile transactions by a tw o-stage validation mechanism. In addition to relieving the serv er from excessive validation workload, no to-be-restarted mobile transactions will be submitted to the server for nal validation. Suc h early data con ict detection can save processing and communication resources. Moreover, the protocol allows more schedules of transaction executions such that unnecessary transaction aborts can be avoided. These desirable features help mobile transactions to meet their deadlines by removing any avoidable delays due to the asymmetric property.

1 Introduction Broadcast-based data dissemination becomes a widely accepted approach to communication in mobile computing environments [1], [2], [6], [8]. The distinguishing feature of this broadcast mode is the communication bandwidth asymmetry, where the "do wnstream"(server to clien t) communication capacity is relativ elymuch greater than the "upstream" (client to serv er) communication capacity. The limited amount of bandwidth available for the clients to communicate with the broadcast server in such environments places a new challenge to implementing transaction processing eciently. Acharya et al [1], [2] introduced the concept of Broadcast Disks (Bdisks), which uses communication bandwidth to emulate a storage device or a memory hierarchy in general for mobile clients of a database system. It exploits the abundant bandwidth capacity available from a serv er to its clients by broadcasting data to its clients periodically. Although there is a number of related research work in Bdisks environments [1], [3], [4], only a few of them support transactional semantics [5], [8]. In this ?

Supported in part by Direct Allocation Grant no. 7100094 from City University of Hong Kong


714

K.-w. Lam, V.C.S. Lee, and S.H. Son

paper, we propose a new approach to processing transactions in Bdisks environments where updates can be originated from the clients. In the new protocol, mobile clients share a part of the validation function with the server and are able to detect data con icts earlier such that transactions are more likely to meet their deadlines [9].

2 Issues of Transaction Processing in Broadcast Environments Since the bandwidth from mobile clients to server is limited, concurrency control protocols, which require continuous synchronization with the server to detect data con icts during transaction execution such as two phase locking, become handicapped in these environments. This is the reason why almost all the protocols proposed for such environments are based on optimistic approach. The eventual termination (commit or abort) of mobile transactions submitted to the server will be broadcast to the clients in the following broadcast cycles. If the mobile transaction submitted to the server could not pass the validation, it will take a long time for the client to be acknowledged and to restart the failed transaction. For a huge number of clients, this strategy will certainly cause intolerable delays and clutter the server. Consequently, it will have a negative impact on the system performance in terms of response time and throughput. In addition, there are two major problems using conventional optimistic approach in Bdisks environments. First, any "serious" con icts which leads to a transaction abort can only be detected in the validation phase at the server. Therefore, some transactions, which are destined to abort when submitted to the server, are allowed to execute to the end. Such continuation of execution of these to- be-aborted transactions wastes the processing resources of the mobile clients as well as the communication bandwidth. Second, the ineectiveness of the validation process adopted by the protocols at the server leads to many unnecessary transaction aborts and restarts because they have implicitly assumed that committed transactions must precede the validating transaction in the serialization order [7].

3 Protocol Design We assume a central data server that stores and manages all the data objects at the server. Updates submitted from the clients are subject to nal validation so that they can be ultimately installed in the central database. Data objects are broadcast by the server and the clients listen to the broadcast channel to perform the read operations. When a client performs a write operation, it pre-writes the value of the data object in its private workspace.

3.1 Broadcasting of Validation Information

We use the timestamp intervals of transactions to reduce the unnecessary aborts and exploit the capability of the server to broadcast validation information to the

Real-Time Transaction Processing Using Two-Stage Validation in Broadcast Disks

715

clients so that the clients can adjust the timestamp intervals of their respective active transactions. Clearly, this method is not supposed to guarantee that the mobile transactions can be terminated (committed or aborted) locally at the clients. It is because the clients do not have a complete and up-to-date view of all con icting transactions. Thus, all transactions have to be submitted to the server for nal validation. Our new strategy places part of the validation function to the clients. In this way, the validation is implemented in a truly distributed fashion with the validation burden shared by the clients. One important issue is that the server and the clients should avoid repeating same part of the validation function. In other words, they should complement each other.

3.2 Timestamp Ordering We assume that there are no blind write operations. For each data object, a read timestamp (RTS ) and a write timestamp (WTS ) are maintained. The values of RTS (x) and WTS (x) represent the timestamps of the youngest committed transactions that have read and written data object x respectively. Each active transaction, Ta , at the clients is associated with a timestamp interval, TI (Ta), which is initialised as [0; 1). The TI (Ta) re ects the dependency of Ta on the committed transactions and is dynamically adjusted, if possible and necessary, when Ta reads a data object or after a transaction is successfully committed at the server. If TI (Ta) shuts out after the intervals are adjusted, Ta is aborted because a non-serializable schedule is detected. Otherwise, when Ta passes the validation at the server, a nal timestamp, TS (Ta), selecting between the current bounds of TI (Ta), is assigned to Ta . Let us denote TIlb (Ta ) and TIub (Ta ) the lower bound and the upper bound of TI (Ta) respectively. Whenever Ta is about to read (pre- write) a data object written (read) by a committed transaction Tc, TI (Ta) should be adjusted such that Ta is serialized after Tc . Let's examine the implication of data con ict resolution between a committed transaction, Tc, and an active transaction, Ta , on the dependency in the serialization order. There are two possible types of data con icts that can induce the serialization order between Tc and Ta such that TI (Ta) has to be adjusted. 1) RS (Tc) \ WS (Ta ) 6= fg (read-write con ict) This type of con ict can be resolved by adjusting the serialization order between Tc and Ta such that Tc ! Ta . That is, Tc precedes Ta in the serialization order so that the read of Tc is not aected by Ta 's write. Therefore, the adjustment of TI (Ta) should be : TIlb (Ta ) > TS (Tc). 2) WS (Tc) \ RS (Ta ) 6= fg (write-read con ict) In this case, the serialization order between Tc and Ta is induced as Ta ! Tc. That is, Ta precedes Tc in the serialization order. It implies that the read of Ta is placed before the write of Tc though Tc is committed before Ta. The adjustment of TI (Ta) should be : TIub (Ta ) < TS (Tc). Thus, this resolution makes it possible for a transaction, which precedes some committed transactions in the serialization order, to be validated and committed after them.

716


4 The New Protocol 4.1 Transaction Processing at Mobile Clients

The clients carry three basic functions:- (1) to process the read/write requests of active transactions, (2) to validate the active transactions using the validation information broadcast in the current cycle, (3) to submit the active transactions to the server for nal validation. These three functions are described by the algorithms Process, Validate, and Submit as shown below and the validation information consists of the following components. The Accepted and Rejected sets contain the identi ers of transactions successfully validated or rejected at the server in the last broadcast cycle. The CT ReadSet and CT WriteSet contain data objects that are in the read set and the write set of those committed transactions in the Accepted set. The RTS (x), a read timestamp and, FWTS (x) and WTS (x), the rst and the last write timestamps in the last broadcast cycle, are associated with each data object x in CT ReadSet and CT WriteSet. FWTS (x) is used to adjust TIub (Ta ) of an active transaction Ta for the read-write dependency while WTS (x) is used to adjust TIlb (Ta ) for the write-read dependency. Functions: Process, Validate, and Submit at the Clients Process (Ta; x; op) f if (op = READ) f TI (Ta) := TI (Ta) \ [WTS (x); 1); if TI (Ta) = [] then abort Ta ; else f Read(x); TOR(Ta; x) := WTS (x); Final V alidate(Ta) := Final V alidate(Ta) [ fxg; g

g

if (op = WRITE) f TI (Ta) := TI (Ta) \ [RTS (x); 1); if TI (Ta) = [] then abort Ta ; else f Pre-write(x); remove x from Final V alidate(Ta); g

g

g

Validate f // results of previously submitted transactions for each Tv in Submitted f if Tv 2 Accepted then f mark Tv as committed; Submitted := Submitted , fTv g; g else f if Tv 2 Rejected then


717

f mark Tv as aborted; restart Tv ;

g

g

Submitted := Submitted , fTv g; g

for each active transaction (Ta ) f if x 2 CT WriteSet and x 2 CWS (Ta ) then abort Ta; if x 2 CT WriteSet and x 2 Final V alidate(Ta) then f TI (Ta) := TI (Ta) \ [0; FWTS (x)]; if TI (Ta) = [] then abort Ta; else remove x from Final V alidate(Ta); g if x 2 CT ReadSet and x 2 CWS (Ta ) then f TI (Ta) := TI (Ta) \ [RTS (x); 1); if TI (Ta) = [] then abort Ta; g

g

g

Submit (Ta) f Submitted := Submitted [ fTag; g

Submit to the server for global nal validation with TI (Ta); RS (Ta); WS (Ta ); New V alue(Ta; x); Final V alidate(Ta); TOR(Ta; x) // x of TOR(Ta; x) 2 (WS (Ta ) [ Final V alidate(Ta));

4.2 The Server Functionality The server continuously performs the following algorithm until it is time to broadcast the next cycle. In essence, the server performs two basic functions: (1) to broadcast the latest committed values of all data objects and the validation information and (2) to validate the submitted transactions to ensure the serializability. One objective of the validation scheme at the server is to complement the local validation at clients to determine whether the execution of transactions is globally serializable. Note that the server does not need to perform the validation for those read operations of the validating transactions that have already done at the clients. Only the part of validation that cannot be guaranteed by the clients is required to be performed. At the server, we maintain a validating transaction list that enqueues the validating transactions submitted from the clients, but not yet processed. The server maintains the following information: a read timestamp RTS (x) and a write timestamp WTS (x) for each data object x. Each data object x is associated with a list of k write timestamp versions, which are the timestamps of the k most recently committed transactions that wrote x. For any two versions, WTS (x; i) and WTS (x; j ), if i < j , then WTS (x; i) < WTS (x; j ). The latest

718


version is equal to WTS (x). Note that this is not a multiversion protocol as only one version of the data object is maintained. Validation at the Server Global Validate (Tv ) f Dequeue a transaction in the validating transaction list. for each x in WS (Tv ) f if WTS (x) > TOR(Tv ; x) then f abort Tv ; Rejected := Rejected [ fTv g; g else f TI (Tv ) := TI (Tv ) \ [RTS (x); 1); if TI (Tv ) = [] then f abort Tv ; Rejected := Rejected [ fTv g; g

g

g

for each x in Final V alidated(Tv )

f Locate WTS (x; i) = TOR(Tv ; x)

if FOUND then f if WTS (x; i + 1) exists then TI (Tv ) := TI (Tv ) \ [0; WTS (x; i + 1)]; if TI (Tv ) = [] then f abort Tv ; Rejected := Rejected [ fTv g; g

g

else f abort Tv ;

g

g

Rejected := Rejected [ fTv g; g

// transaction passes the nal validation TS (Tv ) := lower bound of TI (Tv ) + // is a sucient small value for each x in RS (Tv ) if TS (Tv ) > RTS (x) then RTS (x) := TS (Tv ); for each x in WS (Tv ) WTS (x) := TS (Tv ); Accepted := Accepted [ fTv g; CT WriteSet := CT WriteSet [ WS (Tv ); CT ReadSet := CT ReadSet [ fRS (Tv ) , WS (Tv )g;

5 Conclusions and Future Work In this paper, we rst discuss the issues of transaction processing in broadcast environments. No one conventional concurrency control protocol ts well in these


719

environments due to a number of constraints in the current technology in wireless communication and mobile computing equipment. Recent related research on this area is mainly focused on the processing of read-only transactions. Update mobile transactions are submitted to the server for single round validation. This strategy suers from several de ciencies such as high overhead, wastage of resources on to-be-restarted transactions, and many unnecessary transaction restarts. These de ciencies are detrimental to transactions meeting their deadlines. To address these de ciencies, we have designed a concurrency control protocol in broadcast environments with three objectives. Firstly, data con icts should be detected as soon as possible (at the mobile clients side) such that both processing and communication resources can be saved. Secondly, more schedules of transaction executions should be allowed to avoid unnecessary transaction aborts and restarts since the cost of transaction restarts in mobile environments is particularly high. Finally, any synchronization or communication among the mobile clients or between the mobile clients and the server should be avoided or minimized due to the asymmetric property of wireless communication. These are very desirable features in real-time applications where transactions are associated with timing constraints.

References 1. Acharya S., M. Franklin and S. Zdonik, "Disseminating Updates on Broadcast Disks," Proc. of 22nd VLDB Conference, India, 1996. 2. Acharya S., R. Alonso, M. Franklin and S. Zdonik, "Broadcast Disks: Data Management for Asymmetric Communication Environments," Proc. of the ACM SIGMOD Conference, U.S.A., 1995. 3. Baruah S. and A. Bestavros, "Pinwheel Scheduling for Fault-Tolerant Broadcast Disks in Real-Time Database Systems," Technical Report TR-1996-023, Computer Science Department, Boston University, 1996. 4. Bestavros A., "AIDA-Based Real-Time Fault-Tolerant Broadcast Disks," Proc. of the IEEE Real-Time Technology and Applications Symposium, U.S.A. 1996. 5. Herman G., G. Gopal, K. C. Lee and A. Weinreb, "The Datacycle Architecture for Very High Throughput Database Systems," Proc. of the ACM SIGMOD Conference, U.S.A. 1987. 6. Imielinski T and B. R. Badrinath, "Mobile Wireless Computing: Challenges in Data Management," Communication of the ACM, vol. 37, no. 10, 1994. 7. Lam K. W., K. Y. Lam and S. L. Hung, "Real-time Optimistic Concurrency Control Protocol with Dynamic Adjustment of Serialization Order," Proc. of the IEEE RealTime Technology and Applications Symposium, pp. 174-179, Illinois, 1995. 8. Shanmugasundaram J., A. Nithrakashyap, R. Sivasankaran, K. Ramamritham, "Ef cient Concurrency Control for Bdisks environments," ACM SIGMOD International Conference on Management of Data, 1999. 9. Stankovic, J. A., Son, S. H., and Hansson, J., "Misconceptions about Real-Time Databases," Computer, vol. 32, no. 6, pp. 29-37, 1999.

Using Logs to Increase Availability in Real-Time Main-Memory Database Tiina Niklander and Kimmo Raatikainen University of Helsinki, Department of Computer Science P.O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki,Finland ftiina.niklander,[email protected]

Abstract. Real-time main-memory databases are useful in real-time en-

vironments. They are often faster and provide more predictable execution of transactions than disk-based databases do. The most reprehensible feature is the volatility of the memory. In the RODAIN Database Architecture we solve this problem by maintaining a remote copy of the database in a stand-by node. We use logs to update the database copy on the hot stand-by. The log writing is often the most dominating factor in the transaction commit phase. With hot stand-by we can completely omit the disk update from the critical path of the transaction, thus providing more predictable commit phase execution, which is important when the transactions need to be nished within their deadlines.

1 Introduction Real-time databases will be an important part of the future telecommunications infrastructure. They will hold the information needed in operations and management of telecommunication services and networks. The performance, reliability, and availability requirements of data access operations are demanding. Thousands of retrievals must be executed in a second. The allowed unscheduled down time is only a few minutes per year. The requirements originate in the following areas: real-time access to data, fault tolerance, distribution, object orientation, eciency, exibility, multiple interfaces, and compatibility [13, 14]. Telecommunication requirements and real-time database concepts are studied in the literature [1{3,7]. The RODAIN1 database architecture is a real-time, object-oriented, faulttolerant, and distributed database management system, which is designed to ful ll the requirements of a modern telecomm unications database system. It oers simultaneous execution of rm and soft deadline transactions as well as transactions that do not have deadlines at all. It supports high availability of the data using a hot stand-by, which maintains a copy of the operational database. The hot stand-by is ready to switch to the database server at any time, if the primary server fails. Related systems include ClustRa [4], Dal[5], and StarBase [6]. 1

RODAIN is the acronym of the project name Real-Time Object-Oriented Database Architecture for Intelligent Networks funded by Nokia Networks, Solid Information Technology, and the National Technology Agency of Finland.


Using Logs to Increase Availability in Real-Time Main-Memory Database

721

Database Primary Node

Applications

Requests and new connections

User Request Interpreter Subsystem

Query and update results Object-Oriented

Watchdog control data (Wcd)

RODAIN Database Nodes

Distributed Database Subsystem

Wcd

Watchdog

Database Management Wcd

Subsystem

Subsystem Wcd

Distribution operations Fault-Tolerance and Recovery Subsystem

Update acceptance/recovery commands Update operations to be mirrored

Secondary Storage Subsystem

Database Mirror Node

Fault-Tolerance and Recovery Subsystem

Update acceptance/recovery commands Update operations to be mirrored Object-Oriented Database

Watchdog control data (Wcd)

Distributed Database Subsystem

Wcd

Watchdog

Wcd

Management Subsystem

Subsystem Wcd Distribution operations

User Request Interpreter Subsystem

Requests and new connections Query and update results

Fig. 1. The Architecture of RODAIN Database Node. The rest of the paper is organized as follows. The architecture of the RODAIN Database Management System is presented in section 2. The logging mechanism is presented in detail in section 3. Finally, in section 4 we will summarize the results of our experiments, based on a prototype implementation of the RODAIN database system.

2 RODAIN Database A database designed to be used as a part of telecommunication services must give quick and timely responses to requests. In the RODAIN Database System (see Fig. 1) this is achieved by keeping time-critical data in the mainmemory database and using real-time transactions. Real-time transactions have attributes like criticality and deadline that are used in their scheduling. Data availability is increased using a hot stand-by node to maintain a copy of the main-memory database. The hot stand-by, which we call the Database Mirror Node, can replace the main database server, called the Database Primary Node, in the case of failure. Our main goal in the database design was to avoid as much of the overhead of rollbacks during transaction abort as possible. This was achieved using the deferred write mechanism. In a deferred write mechanism the transaction is allowed to write the modi ed data to the database area only after it is accepted to

722

T. Niklander and K. Raatikainen

commit by the concurrency control mechanism. This way the aborted transaction can simply discard its modi ed copies of the data without rollbacking. An aborted transaction is either discarded or restarted depending on its properties. For concurrency control, we chose to use an optimistic concurrency control protocol. Such a protocol seems appropriate to our environment with mainmemory data and mainly short, read-only transactions with rm deadlines. We combined the features of OCC-DA [8] and OCC-TI [9], thus creating our own concurrency control protocol called OCC-DATI [11] which reduces the number of unnecessary restarts. A modi ed version of the traditional Earliest Deadline First (EDF) scheduling is used for transaction scheduling. The modi cation is needed to support a small number of non-realtime transactions that are executed simultaneously with the real-time transactions. Without deadlines the non-realtime transactions get the execution turn only when the system has no real-time transaction ready for execution. Hence, they are likely to suer from starvation. We avoid this by reserving a xed fraction of execution time for the non-realtime transactions. The reservation is made on a demand basis. To handle occasional system overload situations the scheduler can limit the number of active transactions in the database system. We use the number of transactions that have missed their deadlines within the observation period as the indication of the current system load level. The synchronization between Primary and Mirror Nodes within the RODAIN Database Node is done by transaction logs and it is the base for the high availability of the main-memory database. Transactions are executed only on the Primary Node. For each write and commit operation a transaction redo log record is created. This log is passed to the Mirror Node before the transaction is committed. The Mirror Node updates its database copy accordingly and stores the log records to the disk. The transaction is allowed to commit as soon as the log records are on the Mirror Node removing the actual disk write from the critical path. It is like the log handling done in [10], except that our processors do not share memory. Thus, the commit time needed for a transaction contains one message round-trip time instead of a disk write. The database durability is trusted on the assumption that both nodes do not fail simultaneously. If this fails, our model might loose some committed data. This data loss comes from the main idea of using the Mirror Node as the stable storage for the transactions. The data storing to the disk is not synchronized with the transaction commits. Instead, the disk updates are made after the transaction is committed. A sequential failure of both nodes does not lose data, if the time dierence between the failures is large enough for the Mirror Node to store the buered logs to the disk. The risk of loosing committed data decreases when the time between node failures increases. As soon as the remaining node has had enough time to store the remaining logs to the disk, no data will be lost. In telecommunication the minor risk of loosing committed data seems to be acceptable, since most updates


723

handle data that has some temporal nature. The loss of temporal data is not catastrophic, it will be updated again at a later time. During node failure the remaining node, called the Transient Node, will function as Primary Node, but it must store the transaction logs directly to the disk before allowing the transaction to commit. The failed node will always become a Mirror Node when it recovers. This solution avoids the need to switch the database processing responsibilities from the currently running node to another. The switch is only done when the current server fails and can no longer serve any requests.

3 Log Handling in the RODAIN Database Node Log records are used for two dierent purposes in the RODAIN Database Node. Firstly, they are used to maintain an up-to-date copy of the main-memory database on a separate Mirror Node in order to recover quickly from failures of the Primary Node. Secondly, the logs are stored in a secondary media in the same way as in a traditional database system. These logs are used to maintain the database content even if both nodes fail simultaneously, but they can be also used for, for example, o-line analysis of the database usage. The log records containing database updates, the after images of the updated data items, are generated during the transaction's write phase. At the write phase the transaction is already accepted for commit and it just updates the data items it has modi ed during its execution. Each update also generates a log record containing transaction identi cation, data item identi cation and an after image of the data item. All transactions that have entered their write phases will eventually commit, unless the primary database system fails. When the Primary Node fails, all transactions that are not yet committed are considered aborted, and their modi cations to the database are not performed on the database copy in the Mirror Node. The communication between the committing transaction and the Log writer is synchronous. The Log Writer on the Primary Node sends the log records to the Mirror Node as soon as they are generated. When the Mirror Node receives a commit record, it immediately sends an acknowledgment back. This acknowledgment is used as an indication that the logs of this speci c transaction have arrived to the Mirror Node. The Log writer then allows the transaction to proceed to the nal commit step. If a Mirror Node does not exist, then the Log writer (on Transient Node) must store the logs directly to the disk. The logs are reordered based on transactions before the Mirror Node updates its database copy and stores the logs on disk. The true validation order of the transactions is used for the reordering. This reordering simpli es the recovery process. With logs already ordered, the recovery can simply pass the log once from the beginning to the end omitting only the transactions that do not have a commit record in the log. Likewise, The Mirror Node performs the logged updates to its database only when it has also received the commit record. This way it can be sure that it never needs to undo any changes based on logs.

724


1

1 Primary and Mirror Transient Transaction miss ratio

Transaction miss ratio

Primary and Mirror Transient 0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0

0 0

100

200 300 400 Arrival rate trans/s

500

(a) write ratio 50 %

600

0

20

40 60 Write fraction

80

100

(b) arrival rate 300 trp/s

Fig.2. Comparison of normal mode, both Primary and Mirror Node present, and transient mode, only Transient Node, using true log writes.

4 Experimental Study The current implementation of the RODAIN Database Prototype runs on a Chorus/ClassiX operating system [12]. The measurements were done on computers with a Pentium Pro 200MHz processor and 64 MB of main memory. All transactions arrive at the RODAIN Database Prototype through a speci c interface process, that reads the load descriptions from an o-line generated test le. Every test session contains 10 000 transactions and is repeated at least 20 times. The reported values are the means of the repetitions. The test database, containing 30 000 data objects, represents a number translation service. The number of concurrently running transactions is limited to 50. If the limit is reached, an arriving lower priority transaction is aborted. Transactions are validated atomically. If the deadline of a transaction expires, the transaction is always aborted. The workload in a test session consists of a variable mix of two transactions, one simple read-only transaction and the other a simple write transaction. The read-only service provision transaction reads a few objects and commits. The write transaction is an update service provision transaction that reads a few objects, updates them and then commits. The relative rm deadline of all realtime transactions is 50ms and the deadline of all write transactions is 150ms. We measured the transaction miss ratio, which represents the fraction of transactions that were aborted. The aborts can be either due to the exceeding of a transaction deadline, a concurrency control con ict, or an acceptance denial due to the load limit. In the experiments, the failures in transaction executions were mainly due to system overload. Occasionally a transaction also exceeded its deadline and was, therefore, aborted. We compared the performance of our logging mechanism in its normal use with both the Primary and the Mirror Node to a situation where only a single Transient node is running (see Fig. 2). When both nodes are up and running the logs are passed from the Primary to the Mirror Node. When the Transient Node is running alone, it stores the logs directly to the log storage. The experiment


1

0.6

0.4

0.2

1 Primary and Mirror Transient No logs

0.8


Primary and Mirror Transient No logs Transaction miss ratio


1

0.8

0.6

0.4

0.2

0 100


500

(a) write ratio 0%

600

Primary and Mirror Transient No logs

0.8

0.6

0.4

0.2

0 0

725

0 0

100


500

(b) write ratio 20%

600

0

100


500

600

(c) write ratio 80%

Fig.3. Comparison of optimal (marked as No logs), single node (Transient), and two node systems (Primary and Mirror).

shows clearly that the use of a remote node instead of direct disk writes increases the system performance. Since our experiments with disk writing showed that the log storing to the disk can easily become the bottleneck in the log handling, we ran more tests with the disk writing turned o. This scenario is feasible, if the probability of simultaneous failure of both nodes is acceptable and the system can be trusted to run without any other backups. The omission of the disk writes also emphasizes the overhead from our log handling algorithms with the two nodes. If the log storing to the disk system is slower than the median log generation rate, then the system gets trashed from the buered logs and must reduce the incoming ratio of the transactions to the pace of disk storing. This would then remove most of the bene t of the Mirror Node use. For comparison, we also ran tests on a Transient Node where the logging feature was completely turned o. The results from this optimal situation do not dier much from the results of Transient Node with logging turned o. From Fig. 3 we can see that the most eective feature of the system performance is the transaction arrival time. At the arrival rate of 200 to 300 transactions per second depending on the ratio of update transactions, the system becomes saturated and most of the unsuccessfully executed (=missed) transactions are due to abortions by overload manager. The eect of the ratio of update transactions is relatively small. There are two reasons for this behavior. First, the update transactions modify only a few items. Thus, the number of log records per transaction is not large either. Secondly, the system generates a commit log record also for read-only transactions, thus forcing the commit times of both transaction types to be quite close. The bene ts of the use of the hot stand-by are actually seen when the primary database system fails. When that happens, the Mirror Node can almost instantaneously serve incoming requests. If, however, the Primary Node was alone and had to recover from the backup on the disk or in the stable memory, like Flash, the database would be down much longer. Such down-times are not allowed in certain application areas such as telecommunication.

726


5 Conclusion The RODAIN database architecture is designed to meet the challenge of future telecommunication systems. In order to ful ll the requirements of the next generation of telecommunications systems, the database architecture must be fault-tolerant and support real-time transactions with explicit deadlines. The internals of the RODAIN DBMS described are designed to meet the requirements of telecommunications applications. The high availability of the RODAIN Database is achieved through using a database mirror. The mirror is also used for log processing, which reduces the load at the primary database node and shortens the commit times of transactions allowing more transactions to be executed within their deadlines.

References 1. I. Ahn. Database issues in telecommunications network management. ACM SIGMOD Record, 23(2):37{43, 1994. 2. R. Aranha et al. Implementation of a real-time database system. Information Systems, 21(1):55{74, 1996. 3. T. Bowen et al. A scale database architecture for network services. IEEE Communications Magazine, 29(1):52{59, January 1991. 4. S. Hvasshovd et al. The ClustRa telecom database: High availability, high throughput, and real-time response. In Proc. of the 21th VLDB Conf., pp. 469{477, 1995. 5. H. Jagadish et al. Dal: A high performance main memory storage manager. In Proc. of the 20th VLDB Conf., pp. 48{59, 1994. 6. Y. Kim and S. Son. Developing a real-time database: The StarBase experience. In A. Bestavros, K. Lin, and S. Son, editors, Real-Time Database Systems: Issues and Applications, pp. 305{324. Kluwer, 1997. 7. Y. Kiriha. Real-time database experiences in network management application. Tech. Report CS-TR-95-1555, Stanford University, USA, 1995. 8. K. Lam, K. Lam, and S. Hung. An ecient real-time optimistic concurrency control protocol. In Proc. of the 1st Int. Workshop on Active and Real-Time Database Systems, pp. 209{225. Springer, 1995. 9. J. Lee and S. Son. Performance of concurrency control algorithms for real-time database systems. In V. Kumar, editor, Performance of Concurrency Control Mechanisms in Centralized Database Systems, pp. 429{460. Prentice-Hall, 1996. 10. T. Lehman and M. Carey. A recovery algorithm for a high-performance memoryresident database system. In U. Dayal and I. Trager, editors, Proc. of ACM SIGMOD 1987 Ann. Conf., pp. 104{117, 1987. 11. J. Lindstrom and K. Raatikainen. Dynamic adjustment of serialization order using timestamp intervals in real-time databases. In Proc. of 6th Int. Conf. on Real-Time Computing Systems and Applications, 1999. 12. D. Pountain. The Chorus microkernel. Byte, pp. 131{138, January 1994. 13. K. Raatikainen. Real-time databases in telecommunications. In A. Bestavros, K. Lin, and S. Son, editors, Real-Time Database Systems: Issues and Applications, pp. 93{98. Kluwer, 1997. 14. J. Taina and K. Raatikainen. Experimental real-time object-oriented database architecture for intelligent networks. Engineering Intelligent Systems, 4(3):57{63, September 1996.

Components are from Mars M.R.V. Chaudron1 and E. de Jong1, 2 1

Technische Universiteit Eindhoven, Dept. of Computer Science P.O. Box 513, 5600 MB Eindhoven, The Netherlands [email protected] 2 Hollandse Signaalapparaten B.V., P.O. Box 42, 7550 GD Hengelo, The Netherlands [email protected]

Abstract. We advocate an approach towards the characterisation of components where their qualifications are deduced systematically from a small set of elementary assumptions. Using the characteristics that we find, we discuss some implications for components of real-time and distributed systems. Also we touch upon implications for design-paradigms and some disputed issues about components.

1

Introduction

From different perspectives on software engineering, it is considered highly desirable to build flexible systems through the composition of components. However, no method of design exists that is tailored towards this component-oriented style of system development. Before such a method can emerge, we need a clear notion of what components should be. However, although the component-oriented approach can be dated back to the late 1960’s (see [McI68]), recent publications list many different opinions about what components should be [Br98], [Sa97], [Sz98]. This abundance of definitions indicates that we do not yet understand what components and componentoriented software engineering are about. The discussion on what components should be is complicated by the absence of an explicit statement of (and agreement on) the fundamental starting points. As a result, the motivations behind opinions are often unknown, implicit or unclear. Also, presuppositions are implicitly made that are unnecessarily limiting. The goal of this paper is twofold: firstly, to make explicit the fundamental starting points of component-based engineering, and secondly, to systematically deduce characteristics of the ideal component.

2

Basic Component Model and Qualification

First we shall introduce a basic model and discuss its consequences for components in general. Next, we consider some implications for components for real-time and distributed systems. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 727-733, 2000.  Springer-Verlag Berlin Heidelberg 2000

728

M.R.V. Chaudron and E. de Jong

Basic component model In this section we introduce our basic model for reasoning about components. Our aim is to introduce concepts only when necessary. As a result, a lot of possible aspects of components are intentionally not present in our model. The model we consider consists of the following: − There are things called components. − Components may be composed by some composition mechanism. We use the following terminology: − A configuration of a number of composed components is called a composition. − Everything outside a component is called its environment. A pitfall in reasoning about components is that we presuppose they have features that we are familiar with from programming methodology to such a degree that we cannot imagine that the issues addressed by these features can be approached in another way. Typically, many people endow components with features from the object-oriented paradigm. In order to prevent us from doing so, we will adhere to a strict regime for reasoning about components. We fit our reasoning in the form of a logical theory that has axioms and corollaries. We postulate our basic assumptions about components as axioms. From these axioms we aim to deduce corollaries that qualify components and their composition mechanism. Next, we present our first axioms. A1

A component is capable of performing a task in isolation; i.e. without being composed with other components.

(1)

A2

Components may be developed independently from each other.

(2)

A3

The purpose of composition is to enable cooperation between the constituent components.

(3)

Axioms A1 and A2 are generally agreed upon. Already in [Pa72], axiom A2 appears explicitly and A1 is close in spirit to Parnas’ observation “.. we are able to cut off the upper levels [of the system] and still have a usable and useful product.” The intention of axiom A1 is more explicitly present in recent formulations such as “[a component is an] independent unit of deployment” [Sz98]. To build larger systems out of smaller ones, we want to combine the effects of components. In order to be able to do so, we need a composition mechanism (axiom A3). Note that axiom A3 does not imply that it is a component’s purpose to cooperate. In fact, for the functioning of a component it should be immaterial whether it is cooperating with other components (cf. A1). It is the designer (composer) of a

Components Are from Mars

729

composition who attributes meaning to the combined effect of the components. (Meaning [of a composition] is “in the eye of the composer.”) Next, we present a first corollary. C1

A component is capable of acquiring input from its environment and/or of presenting output to its environment.

(4)

This corollary can be motivated in two ways. The first is that performing some task (axiom A1) would be futile without some means to observe its effect. The second can be inferred from A3: in order to achieve cooperation between components, there must be some mechanism that facilitates their interaction. We proceed by deducing some more qualifications of components. C2

A component should be independent from its environment.

(5)

This corollary follows from axiom A1: In order for a component to fulfill its task in isolation, it should have no dependencies on this environment. Put more constructively, a design principle for components is to optimize their autonomy. C3

The addition or removal of a component should not require modification of other components in the composition.

(6)

Corollary C3 follows from C2. Suppose that the opposite of C3 was true; i.e. the addition (or removal) of a component does require modification of other components in the composition. Then, clearly, there is a dependency of the components that require modification on the one that is added to (or removed from) the composition. Corollary C3 expresses the flexibility or openness generally required of componentbased systems. Implications for distributed real-time systems From the preceding general observations we next shift attention to the design of components for real-time and distributed systems. The corollaries that we present follow straightforwardly from C2. To start with timeliness, C2 leads to the following corollary. C4

Timeliness of output of a component should be independent from timeliness of input.

(7)

730


Again this is a qualification towards the autonomy of components. One possible means to make the timeliness of output independent of timeliness of input is to build in a mechanism that enables a component to generate output when stimuli do not arrive as anticipated. Typically, such an output can be generated only at the cost of a decrease in the quality of the output. The next corollary, C5, is the justification of a principle that is known in the area of parallel and distributed systems as location transparency. Clearly, C5 follows from corollary C2. C5

The functioning of a component should be independent of its location in a composition.

(8)

Corollary C5 is a constraint on the internals of a component (internal location transparency). The counterpart of C5, external location transparency (corollary C6) is a qualification of the composition mechanism. Its justification is analogous to that of C3 (by contradiction of the opposite). C6

The change of location of a component should not require modifications to other components in the composition.

(9)

Next, we present our final corrolary of this paper. C7

A component should be a unit of fault-containment.

(10)

The justification of Corollary C7 is as follows: a component cannot assume that some input is normal and some other is faulty, since this implies a dependency on its environment. Hence, a component has to cater for all possible input. Corollary C7 entails the following guideline for the design of components: components should shield their output from any anomalies1 at their input.

3

On Disputed Issues in Component Design

In this section we will discuss some issues in the design of components based on the qualifications that we found in the preceding sections. When this has unexpected 1

Actually, the term “anomaly” is indicative for an assumption about, and hence a dependency on, the environment.


731

implications we may refer to existing composition systems (e.g. pipe-and-filter [Ri80], or shared tuple-spaces [CG89], [FHA99]) to illustrate that there are systems that do not violate these implications. Do components have state? Let us assume that, in some composition, the task of a component is to store some state. The openness or flexibility corollary C3 asserts that the removal of a component should not require modifications to other components in the composition. This suggests that using a component to store data that is to be used by other components is a bad idea, since this storage component may be removed arbitrarily and the data it stored will no longer be available for other components in the system. In other words, a storage component induces dependencies on other components. This reasoning suggests that stacks and queues should not be considered good examples of components. Although this is a surprising consequence, we see that neither the pipe-and-filternor the shared dataspace model require components that store data. In these cases the composition mechanism deals with the storage of data. The fundamental issue seems to be that of openness versus encapsulation (in the style of abstract data types as encouraged by the object orientation paradigm). Giving priority to openness (as we do here) seems to prohibit encapsulation of storage. However, a component is free to build up a “state” as long as the effect of this state cannot be observed by the environment. For example, a filter that performs a wordcount on input text clearly computes the output by incrementing some local wordcounter. However, this local state does not induce a dependency on other components. Are objects components? Components are often seen as the next logical step in the evolution of software engineering after objects. Be that as it may, this does not mean that components should be an extension of objects. It may turn out that some features of objects that were introduced to facilitate programming may not be suitable for the purpose of composition. The following are examples of features of the object-oriented paradigm that seem to hinder composition: • The mechanism for cooperation: The object orientation paradigm uses method invokation (based on message passing) as a mechanism for cooperation. This mechanism requires agreement between the invoking and the invoked object on the order in which methods are executed. Such an order is built into the definition of objects. As a result, addition or removal of an object requires modification to other objects in the system (methods may cease to exist or new methods may need to be introduced), contradicting corollary C3.

732


In the area of coordination models and languages [PA98], this style of interaction is called endogenous. In contrast, in exogeneous languages, the interaction between parties is specified outside (textually separate) from the computational code. An example of an exogenous composition language is the pipe-and-filter mechanism from Unix. The specification of the pattern of interaction outside of the components involved in it allows modification of the interaction pattern without requiring modifications to the components. Also, with method invokation, the initiative for invoking a method may not reside with the object that the method is part of, but with some object in the environment. This is a violation of the independence of components (corollary C2). •

Encapsulation of data: One argument is given in the previous subsection as answer to the issue of components and state. Another is given by [HO93]. The essence of that argument is that in an evolving system, the future uses of data cannot be predicted; hence an object that encapsulates data cannot provide the methods for which a need may arise in the future.

The above, however, does not imply that object oriented programming should not be used for implementing components – only that this paradigm does not provide the right abstractions for designing component based systems.

4

Concluding Remarks

The fact that currently many different definitions for components are proposed, suggests that we do not yet fully understand the implications of the requirements for component based engineering. In this paper we pursued the implications of these requirements further than is often done. To this end, we presented a rigorous approach to the qualification of components that makes the fundamental assumptions explicit. In this way, we aim to incrementally develop a model for component-based engineering. Our investigations suggest that object-orientation has some features that hamper the composability of software needed for component-based software development. Hence, we should investigate alternative composition mechanisms. We welcome comments and additions to our framework.

Acknowledgements The authors would like to thank Tim Willemse for his critical comments.


733

References [Br98] Broy M., Deimel A., Henn J., Koskimies K., Plasil F., Pomberger G., Pree W., Szyperski C.: What characterizes a (software) component?, Software Concepts & Tools (vol. 19, no. 1), 1998. [CG89] Carriero, N. and Gelernter, D., Linda in context, Communications of the ACM, vol 32(4), pp. 444-458, April 1989. [FHA99] Freeman, E., Hupfer, S. and Arnold, K., JavaSpaces(TM) Principles, Patterns and Practice (The Jini Technology Series), Addison-Wesley, 1999. [HO93] Harrison, W. and Osher, H., Subject-oriented Programming (a critique of pure objects), in: Proceedings of OOPSLA 1993, pp. 411-428. [McI68] McIlroy, D., Mass Produced Software Components, in "Software Engineering, Report on a conference sponsored by the NATO Science Committee, Garmisch, Germany, 7th to 11th October 1968", P. Naur and B. Randell (eds), Scientific Affairs Division, NATO, Brussels, 1969, 138-155. [Pa72] Parnas, D.L., On the Criteria to be used in Decomposing Systems into Modules, Communications of the ACM, Vol. 15, No. 12, Dec. 1972. [Pa98] Papadopoulos, G.A. and Arbab, F., Coordination Models and Languages. In M. Zelkowitz, editor, Advances in Computers, The Engineering of Large Systems, volume 46. Academic Press, August 1998. [Ri80] Ritchie, D.M., The Evolution of the Unix Time-sharing System, Proceedings of the Conference on Language Design and Programming Methodology, Sydney, 1979, Lecture Notes in Computer Science 79: Language Design and Programming Methodology, Springer-Verlag, 1980 (also at http://cm.bell-labs.com/cm/cs/who/dmr/hist.html). [Sa97] Sametinger, J., Software Engineering with Reusable Components, Springer, 1997. [SG96] Shaw, M. and Garlan, D., Software Architecture: Perspectives on an Emerging Discipline, Prentice Hall, 1996. [Sz98] Szyperski, C., Component Software: Beyond Object-Oriented Programming, AddisonWesley, 1998.

2 + 10

1 + 50 !

Hans Hansson, Christer Norstrom, and Sasikumar Punnekkat Malardalen Real-Time Research Centre Department of Computer Engineering Malardalen University, Vaster as, SWEDEN

[email protected], [email protected], [email protected] WWW home page: http://www.mrtc.mdh.se

In traditional design of computer based systems some eort, say 1, is spent on the early modeling phases, and some very high eort, say 50, is spent on the later implementation and testing phases. It is the conjecture of this paper that the total eort can be substantially reduced if an increased eort, say 2, is spent on the early modeling phases. Such a shift in focus of eorts will also greatly improve the o verall eects (both qualit y and cost-wise) of the systems dev eloped, thereb y leading to a better (denoted by \") design process. In this paper, w e speci cally consider the design of safety-critical distributed real-time systems. Abstract.

1

Introduction

Designing safety-critical real-time systems involves assessment of functionality, timing and reliabilit yof the designed system. Though sev eral design methods ha vebeen proposed in literature (suc h as HRT-HOOD, DARTS, UPPAAL, UML-RT), none of them have been able to gain widespread acceptance due to the range and magnitude of the issues involv ed and probably due the restricted focus of these methods. In Figure 1 we present a generic design model for the development of safet y critical distributed real-time systems. 1 2 Application requirements

Requirements capture

Environment assumptions

Architecture design

Analysis

Implementation & Testing

Environment model

Product

Real Environment

Fig. 1.

A generic real-time design model


2 + 10 > 1 + 50 !

735

The architecture design is the highest abstraction level for the design and construction of the system. Here the system is partitioned into components, processes for realisation of them are identi ed, and boundaries for desired quality levels are set. For real-time systems, timing budgets are typically allocated to individual components at this stage. The analysis part in the design process contains both functional analysis (such as, temporal behaviour, reliability modelling, safety, performance) as well as non-functional analysis (such as, testability, maintainability, portability, cost, extensibility). To be able to make these analyses, the architecture has to be described by a language that provides a precise syntax and semantics. Such a language should de ne the computational model with possible extensions for hierarchical and functional decompositions. Experiences from an industrial co-operation [1] have further convinced us of the bene ts of performing architecture analysis on temporal requirements, communication and synchronisation. Based on these insights we will be focusing on architecture analysis rather than analysis of the implementation. It is apparent that such a shift in focus from the implementation & testing phase to the architecture design and analysis phases, by adding more resources and eorts in these earlier phases, is absolutely necessary to detect many critical issues before they manifest in the product and necessitate a costly product re-design. In terms of gure 1, this amounts to iterating more on the inner loop (marked 2) rather than on the outer loop (marked 1) to improve the quality at a lower cost. Using such an approach, one of the major issues, i.e., timing compliance of the system was achieved in our project at Volvo [1] by applying a time-budgeting and negotiations strategy for individual tasks. We now present brie y two other major issues, viz. fault modelling and testability, representing a functional and non-functional issue, respectively. An accurate fault modelling and analysis will assist the designer in incorporating suÆcient fault-tolerance capabilities into the system, whereas testability analysis can greatly reduce the nal testing eorts. It should be noted that, both these issues are addressed in conjunction with their eects on the temporal requirements and properties.

2

Fault Modelling and Analysis

Though there has been sizable amount of research eorts in both the fault tolerance and the real-time realms, these two elds have been more or less treading along parallel paths. These two research domains are all the more relevant in the case of safety-critical systems and their mutual dependencies and interactions need to be analysed for achieving predictable performance. There are very few studies in literature, aimed at bridging the gap between these two areas and many issues remain open that need to be further investigated. One such important issue is the eect of faults on schedulability analysis and on the timing guarantees provided. The major stumbling block in having an intergrated approach is the orthogonal nature of the two factors, viz., the stochastic nature of faults and the deterministic requirements on schedulability analysis. This calls for development

736

H. Hansson, C. Norstroem, and S. Punnekkat

of more realistic fault models which capture the nuances of the environment as well as methods for incorporating such models into the timing analysis with ease. In applications such as automobiles, the systems are often subjected to high degrees of Electro Magnetic Interference (EMI). The common causes for such interferences include cellular phones and other radio equipments inside the vehicle and electrical devices like switches and relays as well as as radars and radio transmissions from external sources and lightning in the environment. These interferences may cause errors in the transmitted data. In this context we have recently [3] developed a model for calculating worstcase latencies of messages on the Controller Area Network (CAN) under errors. CAN is a predictable communication network widely used in the automotive and automation industries. The basic CAN analysis assumes an error free communication bus, which is not always true. To reduce the risk due to errors, CAN designers have provided elaborate error checking and con nement features, which identify the errors and retransmit the aected messages, thus increasing the message latencies and potentially leading to timing violations. Tindell and Burns [2] have proposed a model for calculating worst-case latencies of CAN messages under errors. They de ne an error overhead function E (t), as the maximum time required for error signaling and recovery in any time interval of length t. Their model is relatively simplistic and assumes an initial error-burst followed by sporadic error occurrences (i.e., errors separated by a known minimum time). Our new fault model [3] is more general, in that it { models intervals of interference as periods in which the bus is not available { allows more general patterns of interferences to be speci ed and from that description derive the eect on message transmissions { allows the combined eects of multiple sources of interference to be modeled. { considers the potential delay induced by the interference durations With this fault model it is possible to build parameterised models of dierent types of interferences originating from dierent sources. Using these models, realistic worst-case scenarios can be characterised and analysed. We believe that this kind of analysis will be a step towards future design of adaptive scheduling strategies which takes in to account the error occurrences and decides on-line issues such as graceful degradation and choosing dierent policies for dierent classes of messages.

3

Testability Analysis

A large part of the eort, time and cost in developing safety-critical real-time (and most other) systems is related to testing. Consequently, one of the most important non-functional quality attributes of a design is its testability, i.e. the eort required to obtain a speci c coverage in the testing process. High testability means that relatively few tests have to be exercised. The design with highest testability may however not be the preferred one, since testability typically is in con ict with other desired qualities, such as performance and maintainability.

2 + 10 > 1 + 50 !

737

Using testability measures in choosing between alternative designs that are similar in other respects is however highly desirable, and sacri cing other qualities for increased testability may be a good compromise in many situations. An intuitive metric for the testability of a system is its number of distinguishable computations. For a sequential program this is proportional to the number of program paths. For concurrent and distributed systems we must additionally consider the possible interleavings of the program executions (the tasks). Clearly, by limiting the freedom in scheduling and by making synchronization between distributed nodes tighter, we can substantially reduce the number of interleavings, thus increasing testability. Testability is further increased if the variations (jitter) in release and execution times of individual tasks can be reduced. In [4], we introduce a method for identifying the set of task interleavings of a distributed real-time system with a task set having recurring release patterns. We propose a testing strategy which essentially amounts to regarding each of the identi ed interleavings as a sequential program, and then use sequential techniques for testing it. Due to the large number of interleavings, this in general is a formidable task. We are however convinced that for a suÆciently large class of safety-critical real-time systems this approach is both feasible and desirable.

4

Conclusion and future challenges

In this paper, we have described some important issues in the design of safetycritical distributed real-time systems. We emphasize the potential gain of shifting the focus from implementation & testing phase to the architectural design phase, by obtaining a high eects-eorts ratio. In this context, we also highlighted two of our latest research contributions. The vision and objective of current research in the Systems Design Laboratory at Malardalen Real-Time research Centre is to provide engineers with scienti c methods and tools for designing safety-critical real-time systems, such that the state-of-art and practice for developing such systems is advanced to a mature engineering discipline. This amounts to developing, adopting and applying theory with industrial applications in mind, as well as designing appropriate engineering tools and methods.

References 1. Christer Norstrom, Kristian Sandstrom, and Jukka Maki-Turja: Experiences and ndings from the usage of real-time technology in an industrial project, MRTCTechnical report, January 2000. 2. Ken W. Tindell, Alan Burns, and Andy J. Wellings: Calculating Controller Area Network (CAN) Message Response Times. Control Engineering Practice, 3(8), 1995. 3. Sasikumar Punnekkat, Hans Hansson, and Christer Norstrom: Response time analysis of CAN message sets under errors, MRTC-Technical report, December 1999. 4. Henrik Thane and Hans Hansson: Towards Systematic Testing of Distributed RealTime Systems, 20th IEEE Real-Time Systems Symposium, Phoenix, December 1999.

A Framework for Embedded Real-time System Design ? Jin-Young Choi1 , Hee-Hwan Kwak2, and Insup Lee2 1 2

Department of Computer Science and Engineering, Korea Univerity [email protected]

Department of Computer and Information Science, University of Pennsylvania [email protected], [email protected]

Abstract. This paper describes a framework for parametric analysis

of real-time systems based on process algebra. The Algebra of Communicating Shared Resources (ACSR) has been extended to A CSRwith V alue-passing (ACSR-VP) in order to model the systems that pass values bet w een processes and hange c the priorities of events and timed actions dynamically. The analysis is performed by means of bisimulation or reachabilit y analysis. The result of the analysis is predicate equations. A solution to them yields the values of the parameters that satisfy the design speci cation. We brie y describe the proposed framework in which this approach is fully automated and identify future work.

1 Introduction There have been activ e researc h on formal methods for the speci cation and analysis of real-time systems [4, 5] to meet increasing demands on the correctness of embedded real-time systems. How ev er, mostof the w ork assumesthat various real-time system attributes, such as execution time, release time, priorities, etc., are xed a priori, and the goal is to determine whether a system with all these known attributes would meet required timing properties. That is to determine whether or not a giv en set of real-time tasks under a particular scheduling discipline can meet all of its timing constraints. Recently, parametric approaches which do not require to guess the values of unknown parameters a priori have been proposed as general frameworks for the design analysis of real-time systems. Gupta and Pontelli [3] proposed a uni ed framework where timed automata has been used as a fron t-end,and the constrain tlogic programming (CLP) languages as a bac k-end.We [7] proposed a parametric approach based on real-time process algebra ACSR-VP (Algebra of Communicating Shared Resources with Value Passing). The scheduling problem is modeled as a set of A CSR-VPterms which con tainthe unknown variables as parameters. As sho wn in [7], a system is schedulable when it is bisimilar to a non-blocking process. Hence, to obtain the values for these parameters we ?

This research was supported in part by NSF CCR-9619910, ARO DAAG55-98-10393, ARO DAAG55-98-1-0466, and ONR N00014-97-1-0505.


A Framework for Embedded Real-Time System Design

739

check a symbolic bisimulation relation between a system and a non-blocking process described both in ACSR-VP terms. The result of the bisimulation relation checking with the non-blocking process is a set of predicate equations of which solutions are the values for parameters that make the system schedulable. In this way, our approach reduces the analysis of scheduling problems into nding solutions of a recursive predicate equation system. We have demonstrated in [7] that CLP techniques can be used to solve predicate equations. Before we explain an extension of our approach [7], we brie y present some background material below. Due to the space limitation we omit the formal de nition of ACSR-VP. Instead, we illustrate the syntax and semantics using the following example process P . P (t) = (t > 0) ! (a!t + 1; 1):P (t) The process P has a free variable t. The instantaneous action (a!t +1; 1) outputs a value t + 1 on a channel a with a priority 1. The behavior of the process P is as follows. It checks the value of t. If t is greater than 0, then it performs the instantaneous action (a!t + 1; 1) and becomes P process. Otherwise it becomes NIL. For more information on ACSR-VP a reader refers to [7]. To capture the semantics of an ACSR-VP term, we proposed a Symbolic Graph with Assignment (SGA). SGA is a rooted directed graph where each node has an associated ACSR-VP term and each edge is labeled by boolean, action, assignment, (b; ; ). Given an ACSR-VP term, an SGA can be generated using the rules shown in [7]. The notion of bisimulation is used to capture the semantics of schedulability of real-time systems. The scheduling problem is to determine if a real-time system with a particular scheduling discipline meets all of its deadlines and timing constraints. In ACSR-VP, if no deadline and constraints are missed along with any computation of the system, then the process that models the system always executes an in nite sequence of timed action. Thus by checking the bisimulation equivalence between the process that models the system and the process that idles in nitely, the analysis of schedulability for the real-time systems can be achieved. 0

0

2 A Fully Automatic Approach for the Analysis of Real-time Systems. In the approach published in [7] a bisimulation relation plays a key role to nd solutions for parameters. However, the disadvantage with a bisimulation relation checking method is that it requires to add new edges. These new edges will increase the size of a set of predicate equations and the complexity to solve them. To reduce the size of a set of predicate equations, we introduced a parametric reachability analysis techniques. As noted in [7], nding conditions that make system schedulable is equivalent to nding symbolic bisimulation relation with an in nite idle process. Furthermore, checking the symbolic bisimulation relation with an in nite idle process

740

J.-Y. Choi, H.-H. Kwak, and I. Lee System Described with Restricted ACSR-VP

SGA

SG

Reachability Analysis or Bisimulation

A Boolean Expression

Boolean Equations

Integer Programming

Equation Solver

Solution Space

A Boolean Expression

Fig. 1. Our Approach for the Real-time System Analysis is equivalent to nding conditions that guarantee there is always a cycle in an SGA regardless of a path taken. That is, there is no deadlock in the system under analysis. Hence, we can obtain a condition that guarantees there is no deadlock in the system under analysis by checking possible cycles in an SGA for the system under analysis. We illustrate that this reachability analysis can replace a bisimulation relation checking procedure. With a reachability analysis we can avoid adding new edges and reduce the complexity of solving predicate equations. Utilizing existing CLP techniques seems to be a natural way of solving predicate equations. However, it is not possible to determine if a CLP program terminates. This leads us to identify a decidable subset of ACSR-VP terms. This subset can be classi ed by de ning variables in ACSR-VP terms into two types: control variable and data variable. Control variable is a variable with nite range. The value of a control variable can be modi ed while a process proceeds. Data variable is the variable that does not change its value. That is, it just hold values \passively" without modi cation to them. Data variables may assume values from in nite domains. A detailed explanation on a decidable subset of ACSR-VP is given in [6]. We use the term \restricted ACSR-VP" to denote a decidable subset of ACSR-VP. With a restricted ACSR-VP terms we can reduce a real-time system analysis into solving either a boolean expression or boolean equations with free variables. A decidable subset of ACSR-VP allow us to generate a boolean expression or boolean equations with free variables (BESfv) as a result of reachability analysis or symbolic bisimulation checking. We have developed a BESfv solving algorithm, which is based on maximal xpoint calculation. Here we explain the overview of our fully automatic approach, which is a re ned version of our previous one [7]. A simpli ed version of the overall structure of our approach is shown in Figure 1. We describe a system with restricted ACSR-VP terms. With a given set of restricted ACSR-VP processes, an SGA is generated from a restricted ACSR-VP term in order to capture the behavior of a system. Once an SGA is generated, we instantiate all the control variables in each SGA node to form an Symbolic Graph (SG). An SG is a directed graph in which every edge is labeled by (b; ), where b is a boolean expression and is an action. As an analysis either the symbolic bisimilarity is checked on an SG with an SG of in nite idle process or reachability analysis can be directly

A Framework for Embedded Real-Time System Design

741

performed on an SG of the system. The result is a set of boolean equations or a boolean expression. In the case that a boolean expression with free variables is produced, it can be solved by existing integer programming tools such as Omega Calculator [8]. In the other case that boolean equations with free variables are generated, an algorithm presented in [6] can be applied. We have applied our framework into several real-time scheduling problems. For real-time scheduling problems, the solution to boolean expression or a set of boolean equations with free variables identi es, if it exists, under what values of unknown parameters the system becomes schedulable. For instance, in the shortest job rst scheduling, we may want to know the period of certain jobs that guarantee the scheduling of the system. We let those periods be unknown parameters and describe the system in ACSR-VP process terms. Those unknown parameters are embedded into the derived boolean expression or boolean equations, and consequently the solutions of them represent the values of unknown parameters that make them satis able. These solutions represent the valid ranges of periods (i.e., unknown parameters) of the jobs that make the system schedulable. Our method is expressive to model complex real-time systems in general. Furthermore, the resulting boolean-formulas can be solved eciently. For instance, there has been active research [2] to solve a boolean expression eciently, and there are existing tools such as Omega Calculator [8] for a Presburger formulas. Another signi cant advantage of our method is the size of graphs. Due to the abstract nature of SGA, the size of an SGA constructed from an ACSR-VP term is signi cantly smaller than that of Labeled Transition Systems (LTS) which requires all the parameters to be known a priori. Consequently, this greatly reduces the state explosion problem, and thus, we can model larger systems and solve problems which were not possible by the previous approaches due to state explosions. Furthermore, our approach is decidable whereas other general framework as [3] is not, and thus, it is possible to make our approach fully automatic when we generate a set of boolean equations or a boolean expression. Since our approach is fully automatic, it can also be used to check other properties as long as they can be veri ed by reachability analysis.

3 Conclusion We have overviewed a formal framework for the speci cation and analysis of real-time systems design. Our framework is based on ACSR-VP, symbolic bisimulation, and reachability analysis. The major advantage of our approach is that the same framework can be used for scheduling problems with dierent assumptions and parameters. In other real-time system analysis techniques, new analysis algorithms need to be devised for problems with dierent assumptions since applicability of a particular algorithm is limited to speci c system characteristics. We believe that restricted ACSR-VP is expressive enough to model any realtime system. In particular, our method is appropriate to model many complex

742

J.-Y. Choi, H.-H. Kwak, and I. Lee

real-time systems and can be used to solve the priority assignment problem, execution synchronization problem, and schedulability analysis problem [9]. We are currently investing how to adapt the proposed frame for embedded hybrid systems, that is, systems with both continous and discrete components. The novel aspect of our approach is that schedulability of real-time systems can be described formally and analyzed automatically, all within a processalgebraic framework. It has often been noted that scheduling work is not adequately integrated with other aspects of real-time system development [1]. Our work is a step toward such an integration, which helps to meet our goal of making the timed process algebra ACSR a useful formalism for supporting the development of reliable real-time systems. Our approach allows the same speci cation to be subjected to the analysis of both schedulability and functional correctness. There are several issues that we need to address to make our approach practical. The complexity of an algorithm to solve a set of boolean equations with free variables grows exponentially with respect to the number of free variables. We are currently augmenting PARAGON, the toolset for ACSR, to support the full syntax of ACSR-VP directly and implementing a symbolic bisimulation algorithm. This toolset will allow us to experimentally evaluate the eectiveness of our approach with a number of large scale real-time systems.

References 1. A. Burns. Preemptive priority-based scheduling: An appropriate engineering approach. In Sang H. Song, editor, Advances in Real-Time Systems, chapter 10, pages 225{248. Prentice Hall, 1995. 2. Ue Engberg and Kim S. Larsen. Ecient Simpli cation of Bisimulation Formulas. In Proceedings of the Workshop on Tools and Algorithms for the Construction and Analysis of Systems, pages 111{132. LNCS 1019, Springer-Verlag, 1995. 3. G. Gupta and E. Pontelli. A constraint-based approach for speci cation and veri cation of real-time systems. In Proceedings IEEE Real-Time Systems Symposium, December 1997. 4. Constance Heitmeyer and Dino Mandrioli. Formal Methods for Real-Time Computing. Jonh Wiley and Sons, 1996. 5. Mathai Joseph. Real-Time Systems: Speci cation, Veri cation and Analysis. Prentice Hall Intl., 1996. 6. Hee Hwan Kwak. Process Algebraic Approach to the Parametric Analysis of Realtime Scheduling Problems. PhD thesis, University of Pennsylvania, 2000. 7. Hee-Hwan Kwak, Jin-Young Choi, Insup Lee, Anna Philippou, and Oleg Sokolsky. Symbolic Schedulability Analysis of Real-time Systems. In Proceedings IEEE RealTime Systems Symposium, December 1998. 8. William Pugh. The Omega test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 8:102{114, August 1992. 9. Jun Sun. Fixed-priority End-to-end Scheduling in Distributed Real-time Systems. PhD thesis, University of Illinois at Urbana-Champaign, 1997.

Best-eort Scheduling of (m,k)- rm Real-time Streams in Multihop Networks A. Striegel and G. Manimaran Dept. of Electrical and Computer Engineering Iowa State University, USA fadstrieg,[email protected]

Abstract. In this paper, we address the problem of best-eort schedul-

ing of (m; k)- rm real-time streams in multihop netw orks. The existing solutions for the problem ignore scalabilit y considerations because the solutions maintain a separate queue for each stream. In this context, w e propose a scheduling algorithm, EDBP, which is scalable ( xed scheduling cost) with little degradation in performance. The proposed EDBP algorithm achiev es this b y allo wing m ultiplexing of streams onto a xed number of queues and by using the notion of a look-ahead window. In the EDBP algorithm, at any point of time, the best packet for transmission is selected based on the state of the stream combined together with the laxit y of the packet. Our simulation studies show that the performance of EDBP is very close to that of DBP-M (a known algorithm for the problem) with a signi cant reduction in scheduling cost.

1 Introduction

P acket switched netw orks are increasingly being utilized for carrying real-time trac which often require quality of service (QoS) in terms of end-to-end delay, jitter, and loss. A particular type of real-time trac is a real-time stream, in which a sequence of related packets arrive at a regular interval with certain common timing constraints [1]. Real-time streams occur in many applications such as real-time video conferencing, remote medical imaging, and distributed realtime applications. Unlike non-real-time streams, packets in a real-time stream ha ve deadlines by which they are expected to reach their destination. P ackets that do not reach the destination on time contain stale information that cannot be used. There have been many schemes in the literature to deterministically guarantee the meeting of deadlines of all packets in a stream [2, 3]. The main limitation of these schemes is that they do not exploit the ability of streams that can tolerate occasional deadline misses. For example, in teleconferencing, occasional misses of audio packets can be tolerated by using interpolation techniques to estimate the information contained in tardy/dropped packets. On the other hand, there are schemes that try to exploit the ability of streams to tolerate occasional deadline misses by bounding the steady-state fraction of pac kets that miss their deadlines [4]. The main problem with these approaches is that the deadline misses are not adequately spaced which is often better than encountering spurts of deadline misses. For example, if a few consecutive audio pac kets miss their deadlines, a vital portion of the talkspurt may be missing and J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 743-749, 2000.  Springer-Verlag Berlin Heidelberg 2000

744

A. Striegel and G. Manimaran

the quality of the reconstructed audio signal may not be satisfactory. However, if the misses are adequately spaced, then interpolation techniques can be used to satisfactorily reconstruct the signal [5]. To address this problem, the (m; k)- rm guarantee model was proposed in [1]. A real-time stream with an (m; k)- rm guarantee requirement states that m out of any k consecutive packets in the stream must meet their respective deadlines. When a stream fails to meet this (m; k)- rm guarantee, a condition known as dynamic end-to-end failure occurs. The probability of dynamic endto-end failure is then used as a measure of the QoS perceived by a (m; k) rm real-time stream.

Related Work: The message scheduling algorithms, such as Earliest Deadline

First (EDF) and its variants [2, 3] that have been proposed for real-time streams are not adequate for (m; k)- rm streams because they do not exploit the m and k parameters of a stream. For scheduling of (m; k)- rm streams, a besteort scheme has been proposed in [1] for single hop and has been extended to multihop in [6], with the objective of minimizing the dynamic end-to-end failure.

DBP Algorithms: A scheduling algorithm, Distance Based Priority (DBP),

has been proposed in [1] in which each stream is associated with a state machine and a DBP value which depends on the current state of the stream. The state of stream captures the meeting and missing of deadlines for a certain number of previous packets of the stream. The DBP value of a stream is the number of transitions required to reach a failing state, where failing states are those states in which the number of meets is less than m. The lower the DBP value of a stream, the higher its priority. The packet from the stream with the highest priority is selected for transmission. Figure 1 shows the state diagram for a stream with a (2,3)- rm guarantee wherein M and m are used to represent meeting a deadline and missing a deadline, respectively. Each state is represented by a three-letter (k-letter) string. For example, MMm denotes the state where the most recent packet missed its deadline and the two previous packets met their deadlines. The edges represent the possible state transitions. Starting from a state, the stream makes a transition to one of two states, depending on whether its next packet meets (denoted by M ) or misses (denoted by m) its deadline. For example, if a stream is in state MMm and its next packet meets the deadline, then the stream transits to state MmM . In Figure 1, the failure states are mMm, Mmm, mmM , and mmm. The Modi ed DBP (DBP-M) [6] is a multihop version of the original DBP algorithm. In DBP-M, for each stream, the end-to-end deadline is split into link (local) deadlines, along the path from source to destination of the stream, such that the sum of the local deadlines is equal to the end-to-end deadline. DBPM confronts the problem introduced by multihop networks by having packets transmitted onward until they have missed their respective end-to-end deadlines. Thus, although a packet may miss its local deadline, it is still given a chance to meet its end-to-end deadline.

,, ,,,, ,,

Best-Effort Scheduling of (m,k)-Firm Real-Time Streams in Multihop Networks M

745

mMM

M

m

m

MmM

M

m MMM

M

mMm

MMm

m

m

M

m

Mmm

M

mmM

m

M

m

mmm

Fig. 1. DBP state diagram of a (2,3) stream

Motivation for Our Work: DBP and DBP-M use a separate queue for each

stream at every node along the path of a stream (connection). That is, for each stream that is owing across the network, a separate queue is created and per-stream state information is maintained at each node along the path of the stream. This solution is not scalable as the number of queues increases with the number of streams which results in high scheduling cost in terms of computational requirements. Similarly, the per-stream state information incurs overhead in terms of computational and memory requirements. The second aspect has been addressed by the Dierentiated Services model [9]. In this paper, we address the rst aspect by proposing an algorithm that reduces the scheduling cost by maintaining a xed number of queues. There exists a tradeo between dynamic failure performance and the scheduling cost involved in achieving that performance. With the DBP and DBP-M extreme, a signi cant amount of scheduling cost is required to maintain the one queue per one stream ratio. Given a link that has N streams owing across it, a DBP-M implementation requires N queues and requires O(N ) scheduling cost. However, this queue to stream ratio does deliver the best dynamic end-to-end failure performance for a given set of (m; k) streams. In contrast, classical EDF scheduling and its variations require only one (or a xed number of) queue(s) per link, i.e. the streams are multiplexed onto the queue(s), hence requiring a scheduling cost of O(1). These methods incur the least scheduling cost but deliver the poorest end-to-end dynamic failure performance for (m; k) streams. Therefore, a better algorithm would require less scheduling cost than DBP-M but would provide better dynamic failure performance than classic EDF scheduling. This is the principal motivation for our work; in it, an integrated heuristic is proposed that allows multiplexing of streams while still providing adequate dynamic failure performance.

2 EDBP Scheduling Algorithm

The proposed EDBP algorithm aims at providing the same dynamic failure performance as that of DBP-M with a minimal scheduling cost by allowing queues to have more than one stream multiplexed. EDBP meets this goal by its integrated heuristic (EDBP value) that incorporates the DBP state of a stream together with the laxity of the packet. The EDBP algorithm has two key parts.

746


The rst part deals with selecting the best (highest priority) packet from a window of packets in each queue (Steps 1-4). The second part selects the best packet from those packets chosen in the rst part and transmits it (Steps 5-6). For the EDBP algorithm, the following notations are used: Qi : ith queue; Pj : j t h packet in a queue Sx : stream that produced Pj ; w: window size EDBP (Pj ): EDBP value of packet Pj EDBPS (Sx ): EDBP state of stream Sx The packets in a queue are stored in FIFO order. The cost of algorithm has two parts: queue insertion cost and scheduling cost. The insertion cost is high for EDF because it uses a priority queue and is unit cost for DBP and EDBP. EDF has a unit scheduling cost whereas the scheduling costs of DBP and EDBP are N and w Q, respectively, where N is the number of streams and Q is the number of queues. The EDBP algorithm for transmitting a packet is given in Figure 2 below. Following it, the steps of the algorithm are discussed in detail.

Begin

For each queue Qi perform Steps 1-4 1) For each Pj from P0 to Pw,1 , determine if the packet has missed its end-to-end deadline, such packets are then dropped. ,to,EndDeadline(Pj ) 2) Local Deadline (Pj ) = Number End of Hops in the path of stream Sx Laxity (Pj ) = Local Deadline(Pj ) - current time BucketWidth = Max (j(Laxity(P0 )j, j(Laxity(P1 )j,..., j(Laxity(Pw,1 )j) + 1 3) Calculate the EDBP value for each packet Pj . EDBP (Pj ) = BucketWidth * EDBPS (Sx) + Laxity(Pj ) 4) Select Pj that has the lowest EDBP value, called best packet. 5) Repeat steps 2-4, treating the best packet from each queue Qi as a packet in an overall queue and with a window size (w) equal to the number of queues available. 6) Schedule the packet with the lowest EDBP value.

End

Fig. 2. EDBP scheduling algorithm for transmitting a packet

Step 1: The EDBP algorithm examines a window of w packets from each queue

starting from P0 (head packet in queue) up to Pw,1 to determine if the packet has missed its end-to-end deadline. Thus, if a packet cannot meet its end-to-end deadline, the packet is dropped and the EDBP state of the corresponding stream for the node is adjusted accordingly. As with DBP-M, a packet is not dropped based on its local deadline. The use of the end-to-end deadline as a dropping mechanism is to give the packet a chance to meet its end-to-end deadline by scheduling the packet ahead of time in the downstream nodes across its path. Step 2: In order to combine the EDBP state of a given packet Pj with the packet's laxity, the EDBP state must be converted to a meaningful value. Therefore, the EDBP algorithm uses the notion of buckets and osets. The idea of a bucket is to group together the streams that have similar DBP states and the laxity is used as an oset inside the group (bucket). The local deadline cannot be used for the calculation of the bucket width as it is a relative value. However, the laxity of a packet is an absolute value related to the maximum end-to-end

Best-Effort Scheduling of (m,k)-Firm Real-Time Streams in Multihop Networks

747

deadline in the network. In this step, for each queue, a window of packets is examined to determine the packet with the largest absolute laxity value. However, the maximum laxity value itself cannot simply be used to determine the bucket width. Consider the case where all of the packets in the window have missed their local deadline and the maximum laxity value is negative. Because the maximum laxity value is negative, priority inversion would occur as a lower EDBP heuristic value means a higher priority. To handle this case, the EDBP heuristic uses the maximum absolute laxity value. Thus, the value is always positive and priority inversion cannot occur. Consider a second case where all of the packets have a local deadline of zero. Thus, without further modi cation, the EDBP state of the respective streams would essentially drop out of the EDBP heuristic. To handle this case, the maximum laxity value is further modi ed by adding one. This ensures that the modi ed laxity value will always be greater than or equal to one, thus eliminating the possibility of priority inversion or the elimination of the term corresponding to the EDBP state. Steps 3, 4: Following the bucket width calculation, the best packet for the queue must be selected. The EDBP heuristic itself is divided into two parts, the bucket calculation and the bucket oset calculation. Each packet is placed into its appropriate bucket by multiplying the value of the EDBP state with the bucket width. After the bucket calculation is complete, each packet is appropriately oset into its bucket by adding the laxity value for that packet. For the EDBP algorithm, a modi cation of the DBP state calculation is proposed. As with the initial DBP algorithm, the DBP value of a stream in the non-failing state is the number of transitions required to reach a failing state. Consider a (2,3)- rm stream where with a previous history of MMM . The DBP value would be 2, representing the two transitions required to reach a failing state. In the EDBP heuristic, the DBP state is expanded to allow negative values, thus allowing the EDBP state to discern between levels of dynamic failure between dierent streams. When the stream has reached a failing state, EDBP expands upon the initial DBP algorithm by setting the EDBP value equal to one minus the number of transitions to return to a non-failing state. Under the initial DBP algorithm, a (2,3) stream with a history of Mmm would yield a DBP value of 0. However, when one examines the state diagram for the (2,3) stream, it is discovered that two transitions are required to return to a non-failing state. Under the EDBP algorithm, this stream would receive an EDBP value of -1, thus appropriately placing the packet at a priority level denoting its level of dynamic failure. Best Packet Selection - Steps 5, 6: Once the best packet has been selected from each queue Qi , the overall best packet is selected among these packets for transmission. To accomplish this, Steps 2-4 are repeated again with the following modi cations. First, the queue being examined is now a queue of the best packets from each queue Qi . Second, the window size for the EDBP algorithm is equal to the number of queues available. The best overall packet thus obtained will have the lowest EDBP value and is transmitted.

748


3 Performance Study A network simulator was developed to evaluate and compare the performance of the EDBP algorithm with that of the DBP-M and EDF algorithms. The simulator uses a single queue for EDF, one queue per stream for DBP-M, and a xed number of queues (which is an input parameter to the simulator) for EDBP. For our simulation studies, we have selected ARPANET as the representative topology. The algorithms were evaluated using the probability of dynamic failure as the performance metric. In our simulation, one millisecond (ms) is represented by one simulation clock tick. Source and destination nodes for a stream were chosen uniformly from the node set. The local deadline for each stream was xed with the end-to-end deadline equal to the xed local deadline times the number of hops in the stream's path. The m and k values of a stream in the network are exponentially distributed with the condition that m < k. The mean inter-arrival time of streams in the network follow a Poisson distribution and stay active for an exponentially distributed duration of time. Packets are assumed to be of xed size and each link has a transmission delay of one millisecond. Eect of Number of Queues: In Figure 3, the eect of the number of queues on the probability of dynamic failures is examined in the EDBP algorithm. Thus, in the best case, the number of queues is equal to the number of streams. This is exempli ed by the DBP-M algorithm. The EDBP algorithm has been split into two versions, one with N=2 queues and the other with N=4 queues (N = 16). Each increase in the number of queues results in an appropriate increase in the dynamic failure performance of the EDBP algorithm. For this gure, the performance of the EDF and DBP-M algorithms remain unchanged as the queue parameter has no eect on these algorithms. From Figure 3, one can deduce that an increase of the number of queues reduces the multiplexing degree that in turn increases the performance of the EDBP algorithm. The performance of the EDBP algorithm at N/2 queues is extremely close to the performance of the DBP-M algorithm while requiring only half of the scheduling cost of DBP-M. 0.5

EDF EDBP w=2 EDBP w=4 EDBP w=8 DBP

0.45

0.4

0.4

0.35

0.35

Dynamic Failure Probability

Dynamic Failure Probability

0.5

EDF EDBP N/4 EDBP N/2 DBP

0.45

0.3 0.25 0.2 0.15

0.3 0.25 0.2 0.15

0.1

0.1

0.05

0.05

0

0 50

45

40 35 30 25 Inter-Arrival Time (ms)

20

15

50

45

40 35 30 25 Inter-Arrival Time (ms)

20

15

Fig. 3. Eect of No. of Queues Fig. 4. Eect of Window Size Eect of Window Size: However, in a given setting, it may not be practical or even possible to increase the number of queues available. Figure 3 repeats the same settings used in Figure 3, except that the window size is varied instead of

Best-Effort Scheduling of (m,k)-Firm Real-Time Streams in Multihop Networks

749

the number of queues. Three versions of the EDBP algorithm are examined with w = 2; 4; 8. As the window size increases, the dynamic failure performance of the EDBP algorithm increases because the window size osets the penalty imposed by the multiplexing of streams onto a given queue. When the eect of window size is compared to the eect of additional queues in the EDBP algorithm, our experiments show that the increase in queues produces a more profound eect than an increase in window size. The underlying cause is due to the multiplexing of streams onto queues. Consider a scenario in which a stream (Sx ) with a small period (high rate) and another stream (Sy ) with a large period (low rate) are multiplexed onto the same queue. In this case, Sx will have a higher chance of having its packets inside the window than Sy . This results in more dynamic failure for Sy . However, as the number of available queues increases, the chance of these streams being separated into dierent queues increases as well, thus explaining the dierence in performance. Therefore, to obtain the best performance from the EDBP algorithm, the window size must be appropriately tuned to the degree of multiplexing.

4 Conclusions

In this paper, we have addressed the problem of best-eort scheduling of (m; k) rm real-time streams in multihop networks. The proposed algorithm, EDBP, allows multiplexing of streams onto a xed number of queues and aims at maximizing the dynamic failure performance with minimal scheduling cost. Our simulation studies have shown that the performance is close to that of the DBP-M algorithm with a signi cantly lower scheduling cost.

References

1. M. Hamdaoui and P. Ramanathan, \A dynamic priority assignment technique for streams with (m,k)- rm guarantees," IEEE Trans. Computers, vol.44, no.12, pp.1443-1451, Dec. 1995. 2. D. Ferrari and D.C. Verma, \A scheme for real-time channel establishment in wide-area networks," IEEE JSAC, vol.8, no.3, pp.368-379, Apr. 1990. 3. H. Zhang, \Service disciplines for guaranteed performance service in packetswitching networks," Proc. IEEE, vol.83, no.10, pp. 1374-1396, Oct. 1995. 4. D. Yates, D.T.J. Krouse, and M.G. Hluchyj, \On per-session end-to-end delay distributions and call admission problem for real-time applications with QoS requirements," in Proc. ACM SIGCOMM, pp.2-12, 1993. 5. Y.-J. Cho and C.-K. Un, \Performance analysis of reconstruction algorithms for packet voice communications,", Computer Networks and ISDN Systems, vol. 26, pp. 1385-1408, 1994. 6. W. Lindsay and P. Ramanathan, \DBP-M: A technique for meeting end-to-end (m,k)- rm guarantee requirements in point-to-point networks," in Proc. IEEE Conference on Local Computer Networks, pp. 294-303, Nov. 1997 7. S.S. Panwar, D. Towsley, and J.K. Wolf, \Optimal scheduling policies for a class of queues with customer deadlines to the beginning of service," Journal of the ACM, vol.35, no.4, pp.832-844, Oct. 1988. 8. S. Shenker and L. Breslau, \Two issues in reservation establishment," in Proc. ACM SIGCOMM, pp.14-26, 1995. 9. W. Weiss, \QoS with Dierentiated Services," Bell Labs Technical Journal, pp. 44-62, Oct.-Dec 1998.

Predictabili ty and Resource Management in Distribut ed Multimedia Presentations Costas Mourlas Department of Computer Science, University of Cyprus, 75 Kallipoleos str., CY-1678 Nicosia, Cyprus [email protected]

Abstract. The continuous media applications have an implied tempo-

ral dimension, i.e. they are presented at a particular rate for a particular length of time and if the required rate of presentation is not met the integrity of these media is destroyed. We present a set of language constructs suitable for the de nition of the required QoS and a new real-time environment that provides low-level support to these constructs. The emphasis of the proposed strategy is given on deterministic guarantees and can be considered as a next step for the design and the implementation of predictable continuous media applications over a network.

1 Introduction The current interest in network and multimedia technology is focused on the development of distributed multi-media applications. This is motivated by the wide range of potential applications such as distributed multi-media information systems, desktop conferencing and video-on-demand services. Each such application needs Quality of Service (QoS) guarantees, otherwise users may not accept them as these applications are expected to be judged against the quality of traditional services (e.g. radio, television, telephone services). The traditional network environments although they perform well in static information spaces they are inadequated for continuous media presentations, such as video and audio. In a distributed multimedia information system (see gure 1) there is a set of W eb-based applications where each application is allocated on a dierent node of the network and can require the access of media servers for continuous media data retrieval. These continuous media servers can be used by any application running in parallel on a dierent node of the network. Each such presentation has speci c timing and QoS requirements for its continuous media playback. This paper presents a new set of language constructs suitable for the de nition of the required QoS and the real-time dimension of the media that participate in multimedia presentations as well as a runtime environment that provides lowlevel support to these constructs during execution.

2 The Proposed Language Extensions for QoS de nition Playing a set of multimedia presentations in a traditional network architecture two main problems are met. Firstly , the best-eort service model provided by J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 750-756, 2000.  Springer-Verlag Berlin Heidelberg 2000

Predictability and Resource Management in Distributed Multimedia Presentations

751

1 <smil sync="soft"> 2 3 4 5 width="43%" height="70%"/> 6 7 8 9 10 <par id="Presentation_1">

1 <smil sync="soft"> 2 3

a1 v1

v2

4 5 6 7

. . . . . . . . .

11 12 13 10 11 12 13 14

sample-rate=8 sample-size=[16,8] /> 14 15 16

(L)

(R) ms1

media server

ms2

media server

Fig. 1. A Distributed Multimedia Information System the existing systems does not address the temporal dimension of the continuous media data during their retrieval and transmission phase. Resource reservation even if it is required, it is not the nal answer to the end-users. The end-users actually care on how to exploit all the available (and reserved) resources in a best way such that the multimedia application will be presented according to the expected quality requirements. For example, a 10% reservation of the total bandwidth to a video presentation means that the video can be played either colored with a rate of 10 frames per second or grey-scaled with a rate of 18 frames per second. The decision has to be taken by the end-users and the multimedia authors, providing high-level language primitives and special annotation for the de nition of any quality requirement. This new set of highlevel language constructs will be presented in the following paragraphs and comes as a continuation of our previous work described in [6]. The language that will be extended is SMIL [9], a language for Web-based multimedia presentations which has been developed by the W3C working group on synchronized multimedia. These extensions are introduced along the lines of SMIL, and there is an attempt to reuse terminology wherever feasible. SMIL describes four fundamental aspects of a multimedia presentation: temporal speci cations, spatial speci cations, alternative behaviour speci cations and hypermedia support. In this section we introduce and de ne a fth aspect of a multimedia presentation, called quality speci cations. In our extended SMIL language, the two continuous media objects can be described together with their quality requirements within a document via the following syntax: { hvideo cmo-attributes v-qos-attributesi, and { haudio cmo-attributes a-qos-attributesi.

752

C. Mourlas

The extensions are de ned by the two new sets of attributes v-qos-attributes and a-qos-attributes for video and audio respectively. The set cmo-attributes is curenlty supported by SMIL to de ne the location and duration of the media object. The new v-qos-attributes and a-qos-attributes lists describe quality requirements using the attributes: fps : The value of fps de nes the temporal resolution of a video presentation by giving the number of frames per second. The value of this attribute can be any positive integer or a range of positive integers. For example giving fps=14-18 as attribute to a video object, it means that the accepted values for this video presentation can be any rate between 14 and 18 frames per second (Figure 1 lines: L-11,L-12,R-10). spatial-res : The spatial-res de nition of a video presentation speci es the spatial resolution in pixels required for displaying the video object. In our model, the concepts of layout and resolution are separated. The resolution is a quality concept. If an ordered list of resolutions is given (e.g. spatialres=[180X130, 120X70]) then the video object will be presented with the highest possible spatial resolution according to the availability of system resources and can be altered at run time (lines: L-11,L-12,R-10). color-res : This attribute speci es the color resolution in bits required for displaying the video object. Typical values are 2, 8, 24 : : :. If an ordered list of integer values is given (e.g. color-res=[8,2] ) then the video object will be presented with the highest possible color resolution. (lines: L-11,L-12,R-10). sample-rate : The value of sample-rate for an audio object de nes in KHz the rate that the analog signal is sampled. If we need, for example, telephone quality the analog signal should be sampled 8000 times per second (i.e. sample-rate = 8), (lines: L-13,R-11). sample-size : This attributes of an audio object speci es the sample size in bits of each sample. If an ordered list of integer values is given (e.g. samplesize=[16,8] ) then each sample will be represented with a number of bits equal with one of the values given. For telephone quality, each sample of the signal is coded with 8 bits whereas for CD quality it is coded with 16 bits. The highest value that can be used for every sample it is decided at run time according to the availability of the resources (lines: L-13,R-11). The above language primitives form a complete set for QoS de nition of every distinct continuous media that participate in a multimedia presentation. If several media streams have to be combined then inter-media synchronization is another important factor of quality speci cation but this subject has been extensively studied and completely supported by the standard SMIL language.

3 The Proposed Runtime Environment We view every dierent multimedia presentation si as a periodic task i with period Ti . Every periodic task i is allocated on a dierent node of the distributed system and requires in each period the retrieval of a number of media blocks from


753

the remote disk of the server. CSji is the deterministic disk access time that task i requires in every period to retrieve data for all of its streams from the iserver Sj (communication delays can be included in the evaluation of every CSj ). Every data retrieval section on a remote shared server S is guarded by a lock(S) statement. These locks are released after the data retrieval using the unlock(S) statement. The term \critical section" will be used to denote any data retrieval section of a task de ned between a lock(S) and the corresponding unlock(S) statement. We follow a rate monotonic strategy for priority assignments. Periodic tasks are assigned priorities inversely to tasks periods (ties are broken arbitrarily). Hence, task i with period Ti receives higher priority than j with period Tj if Ti < Tj . The period Ti and the computational requirements CSji of every task are determined by the desired QoS of the stream that the task represents as well as system resources (processor speed, disk access time). The formal procedure of transforming the set of distributed multimedia presentations with quality of service expectations to a set of periodic tasks is described in our previous work [7, 6]. We have to notice here that the scheduling analysis that follows does not consider ranges of QoS values and this task is left as future work. A periodic task can have multiple non-overlapping critical sections, e.g. = f... lock(S1)...unlock(S1).....lock(S2)...unlock(S2)...g but not any nested critical section. Each task is characterized by two components (CS i , Ti ), 1 i n, where CS i is the set fCSji j j 1g that includes all the critical sections of the task i . CSji is the critical section of task i guarded by statement lock(Sj ). We de ne as Ci the total deterministic P computation requirement of all data retrieval sections of task i , i.e Ci = x2CS x. Each server Sj can be either locked by a task i if i is within its critical section CSji or free otherwise. Suppose that a task i requires to lock server Sj and enter its critical section CSji issuing the operation lock(Sj). Then the following cases can occur: 1. The server Sj is free. Then, the server Sj is allocated to the task i , the task i proceeds to its critical section and the state of Sj becomes locked. A server Sj locked by task i can not be accessed by any other task. 2. If case 1 does not hold, i.e. server Sj is currently locked, then after its release it is allocated to the highest priority task that is asking for its use. The task i will proceed to its critical section if and only if server Sj has been allocated to i . By the de nition of the protocol, a task i can be blocked by a lower priority task j , only if j is executing within its critical section CSlj when i asked for the use of the shared server Sl . Note also that the proposed synchronization protocol prevents deadlocks due to the fact that for any task i there is no nested critical section. Thus, i will never ask in its critical section for the use of any other server and so a blocking cycle (deadlock) cannot be formed. We can easily conclude that a set of n periodic tasks, each one bound to a different node } of a network can be scheduled using the proposed synchronization i

754

C. Mourlas

protocol if the following conditions are satis ed: 8i; 1 i n; Ci + Bi Ti (1) The term Bi represents the total worst case blocking time that task i has to wait for the allocation of the required media servers in every period Ti . Once Bi s have been computed for all i, the conditions (1) can then be used to determine the schedulability of the set of tasks.

3.1 Determination of Task Blocking Time

Here, we shall compute the worst-case blocking time Bli that a task i has to wait the allocation of server Sl , following a response-time-analysis type formulation [3]. This longest blocking time occurs at the critical instance for i .

De nition 3.1 A critical instance for task i occurs whenever a request from i to lock a server occurs simultaneously with the requests of all higher-priority tasks to lock this server. At that instance also, the lower priority task with the longest critical section executes its critical section holding the lock of that server.

Theorem 3.1 Consider a set of n tasks 1; : : :; n arranged in descending order of priority. Each task is bound to a dierent node }i of the network and the proposed synchronization protocol is used for the allocation of the servers. Let

Hli = fCSlj j 1 j < ig;

- set of critical sections used by tasks with higher priorities than i accessing the same server Sl - set of critical sections used by tasks with lower priorities than i accessing the same server Sl - blocking time due to lower priority tasks

Lil = fCSlj j i < j ng; li = max(Lil):

Then, the worst case blocking time Bli each time task i attempts to allocate server Sl and execute its critical section is equal to: X d Bli + t e CSj + i; 0 < t < 1 if X CSlj < 1 (2) Bi = l

CSlj 2Hli

Tj

l

l

CSlj 2Hli

Tj

Proof: The smallest integer value that satis es equation 2 above represents the longest blocking time Bli for a task i trying to enter its critical section CSli at its worst-case task set phasing, i.e. at its critical instance. If the worst-case task set phasing occurs at time t0 = 0 then the right-hand side of the equation represents the sum of the computational requirements for server Sl for all inputs from higher levels at the time interval [0; Bli + t) as well as the duration of one (actually the maximum) critical section of the lower priority tasks in Lil namely li . Task i will enter its critical section at time Bli


755

when the server Sl becomes free, i.e. after its consecutive use from tasks during the worst-case phasing. At that time and during the interval [Bli ; Bli + 1), server Sl becomes free for rst time after t0 and thus task i willi have the opportunity to lock Sl . The fact that server Sl is idle at time t 2 [Bl ; Bli + 1) leads to the result that the sum of the computational requirements for server Sl over the interval [0; t) equals Bli . Notice that an arbitrary value lying between zero and one is actually needed to check the load of the server at the interval [Bli ; Bli +1), and this value is represented by the term t. P In all cases, the sum CS 2H CS T should be less than one. This sum represents the work load of server Sl or the utilization factor of the server due to higher priority tasks and should be less than one otherwise all these higher priority tasks could block repeatedly the task i and in this case Bli will be unbounded (condition of formula 2). Hence the Theorem follows. 2 Equations of the form 2 above do not lend themselves easily to analytical solution. However, a solution to this equation can be found by iteration. The total worst-case blocking durationPBi experienced by task i is the sum of all these blocking durations, i.e. Bi = CS 2CS Bji . Once these blocking terms Bi ; 1 i n, have been determined, conditions (1) give a complete solution for the real-time task synchronization and scheduling in the distributed environment. j l

i l

i j

j l

j

i

4 Related Work A signi cant amount of work has been carried out for making resource allocations to satisfy speci c application-level requirements. The Rialto operating system [2] was designed to support simultaneous execution of independent realtime and non-real-time applications. The RT-Mach microkernel [4] supports a processor reserve abstraction which permits threads to specify their CPU resource requirements. If admitted by the kernel, it guarantees that the requested CPU demand is available to the requestor. The Lancaster QoS Architecture [1] provides extensions to existing microkernel environments for the support of continuous media. The QoS Broker [8] model addresses also the requirements for resource guarantees, QoS translation and admission control, so a new system architecture is proposed which provides all these issues. The Nemesis operating system is described in [5] as part of the Pegasus Project, whose goal is to support both traditional and multimedia applications. We have to notice at this point that few of the above eorts address the problem of distributed multimedia applications and very few of all the current multimedia architectures provide any synchronization strategy and a theory for the analysis and the predictability of a set of multimedia applications executed in a distributed environment. Many CPU allocation schemes have been presented for multimedia applications based on the restrictive assumption that the applications are independent of one another and do not have access to multiple resources simultaneously.

756

C. Mourlas

5 Conclusions

In this paper, we studied a set of language extensions and a runtime environment suitable for creating and playing distributed multimedia information systems with QoS requirements. At the language level a set of language extensions for SMIL was presented suitable for the de nition of the required QoS and the real-time dimension of the media that participate in a multimedia presentation. The runtime part is mainly focused on the maintenance of real-time constraints accross continuous media streams. It is based on a task oriented model that employs a periodic-based service discipline which provides the required service rate to a continuous media presentation independent of trac characteristics of other presentations. One direction of our future work will be on the ability of the runtime environment to support the required quality of service when the required quality lies within a range, by giving the minimal and the upper bound for the expected quality (e.g. fps=18-22). The runtime system will try to provide the best value in the range and it will be also authorised to modify this value at run-time towards the upper or the lower bound value according to the availability of the resources. This adaptation of quality of service will make the best use of the resources currently available to distributed applications and will give a fair solution to the presentation of continuous media applications over a network without sacri cing the ability to execute these applications predictably in time.

References

1. G.Coulson, G.S. Blair, P. Robin, and D. Shepherd. Supporting Continuous Media Applications in a Micro-Kernel Environment. In Otto Spaniol, editor, Architectures and Protocols for High-Speed Networks. Kluwer Academic Publishers, 1994. 2. M. B. Jones, D. Rosu, and M. Rosu. CPU Reservations and Time Constraints: Ecient, Predictable Scheduling of Independent Activities. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, October 1997. 3. M. Joseph and P. Pandya. Finding Response Times in a Real-Time System. The Computer Journal, 29(5):390{395, 1986. 4. C. Lee, R. Rajkumar, and C. Mercer. Experiences with Processor Reservation and Dynamic QOS in Real-Time Mach. In Proceedings of the Multimedia Japan 96. 5. I. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, R. Fairbairns, and E. Hyden. The Design and Implementation of an Operating System to Support Distributed Multimedia Applications. IEEE Journal on Selected Areas in Communications, 14(7):1280{1297, September 1996. 6. C. Mourlas. A Framework for Creating and Playing Distributed Multimedia Information Systems with QoS Requirements. In Proceedings of the 2000 ACM Symposium on Applied Computing, SAC 2000 (accepted for publication) . 7. C. Mourlas, David Duce, and Michael Wilson. On Satisfying Timing and Resource Constraints in Distributed Multimedia Systems. In Proceedings of the IEEE ICMCS'99 Conference, volume 2, pages 16{20. IEEE Computer Society, 1999. 8. Klara Nahrstedt and Jonathan M. Smith. The QoS Broker. IEEE Multimedia, 2(1):53{67, Spring 1995. 9. W3C. SMIL Draft Speci cation. See: http://www.w3.org/TR/WD-smil.

Quality of Service Negotiation for Distributed, Dynamic Real-time Systems 1

2

1

2

Charles D. Cavanaugh , Lonnie R. Welch , Behrooz A. Shirazi , Eui-nam Huh , and 1 Shafqat Anwar 1

Computer Science and Engineering Dept. The University of Texas at Arlington Box 19015, Arlington, TX 76019-0015 {cavan|shirazi|anwar}@cse.uta.edu 2 School of Electrical Engineering and Computer Science Ohio University Athens, OH 45701-2979 {welch|ehuh}@ace.cs.ohiou.edu

Abstract. Dynamic, distributed, real-time systems control an environment that varies widely without any time-invariant statistical or deterministic characteristic, are spread across multiple loosely-coupled computers, and must control the environment in a timely manner. In order to ensure that such a system meets its timeliness guarantees, there must be a means to monitor and maintain the quality of service in the system. The QoS manager is a monitoring and diagnosis system for real-time paths, collections of time-constrained and precedence-constrained applications. These applications may be distributed across multiple, heterogeneous computers and networks. This paper addresses the QoS negotiation features of the QoS manager and its interaction with the middleware resource manager. The major contributions of the paper are the negotiation algorithms and protocol that minimize the impact on the other paths’ QoS while maximizing the unhealthy path’s QoS. The approach and algorithms for QoS negotiation are presented.

1 Introduction Dynamic, distributed, real-time systems possess three characteristics. First, the environment that they control is not deterministic and cannot be characterized by time-invariant statistical distributions. Second, the system is spread across multiple loosely coupled computers. Third, the system must control the environment in a timely manner. Existing solutions for monitoring real-time systems [1] and for realtime scheduling are usually based on the assumption that the processes have worstcase execution times. In dynamic environments, such as air traffic control [2], robotics, and automotive safety, this assumption does not hold [3]. The dynamic real-time path [4][5] (Fig. 1) is a collection of time-constrained and precedence-constrained applications. These applications may be distributed across multiple, heterogeneous computers and networks. The QoS manager’s tasks are to monitor path health, diagnose the causes of poor health, and request computation and communication resources to maintain and restore health.


758

C.D. Cavanaugh et al. Compute subpaths

Path 3 Guide

Communication subpaths

Path 2 Initiate

Path 1

operator

Assess

sensors

filter/sense

evaluate & decide

act

actuators

Fig. 1. Path composition

The problem of mapping applications to resources is to assign resources to consumers such that the delivered QoS meets or exceeds the QoS requirement (if possible). If this is not possible, some of the resources that are in use by a low criticality real-time application may need to be diverted to a high criticality real-time application. The QoS manager and resource manager must negotiate a solution that is mutually acceptable. QoS negotiation is the process of the QoS manager and the resource manager trading off resources for some applications while improving the QoS of the applications having higher criticality. The rest of this paper is organized as follows: the QoS negotiation architecture and approach are explained in Section 2, the negotiation algorithms and protocol are presented in Section 3, a sample experiment using manual techniques to illustrate QoS negotiation is shown in Section 4, related work is summarized in Section 5, and a summary and statement of future work is in Section 6.

2 QoS Negotiation Architecture and Approach The QoS negotiation architecture is presented in Fig. 2. The QoS monitor’s job is to combine the monitored data into QoS metrics for the path and applications and to translate and pass along relevant application load and resource usage information. The analyzer’s function is to detect QoS violations and calculate trends for QoS metrics, load, and resource usage. The diagnosis component determines the causes of the QoS violations by recognizing conditions that indicate a particular malfunction. The negotiator has two functions. First, it selects actions that will remedy the malfunctions and requests resources for applications if necessary. Second, it negotiates the highest possible QoS with the resource manager when the resource manager indicates that resource availability does not allow a certain action or resource request to be carried out. Negotiation involves trading off some actions for alternative actions that provide the highest possible QoS assurance under the resource availability constraints. The resource manager obtains current utilization levels for communication and computation resources from host monitors. Moreover, resource unification is required to map heterogeneous resource requirements into available target host. Then, the RM finds resources that meet (unified) resource requirements. If the hosts are feasible, then it predicts queuing delays to analyze schedulability.

Quality of Service Negotiation for Distributed, Dynamic Real-Time Systems

759

QoS prediction will result from candidate reallocation actions. Finally, resource allocation selects and performs a reallocation action (through program control and startup daemons) based on predicted QoS. A new selection technique is used to guarantee the customer’s QoS.

Diagnosis Violations and QoS/load trends

Negotiator

Causes of violations

Resource requests and responses

Analyzer QoS Monitor

QoS and resource usage

Timing and resource usage

Reallocation actions

Operating system process control

Applications

Resource Manager Program Control

Process-control messages

Startup Daemon

Fig. 2. QoS Negotiation Architecture

The three phases of QoS negotiation correspond to the three phases of diagnosing poor path health: path-local, resource-local, and global diagnosis. During phase I, path-local diagnosis, the QoS manager requests for allocation actions involving the unhealthy subpaths that it identifies. For example, one application within a path may be unhealthy; and the QoS manager would request that it be scaled up. During phase II, resource-local diagnosis, the QoS manager requests actions involving any software that is sharing resources with the unhealthy path. For example, the QoS manager may request that some competing application program be moved off a host. The QoS manager does not need to know the specific application program that is involved, as it is the resource manager’s responsibility to maintain the system resources. During phase III, global diagnosis, the QoS manager requests actions that involve any resource. For example, the QoS manager may request that a less critical application be moved in order to free up space on a host that is not currently in use by the unhealthy subpath. The resource manager is responsible for finding the best host for the application or path while balancing the load among other paths and applications. The three phases of negotiation are illustrated in the following scenario: QM: Application x on host A is unhealthy and using 20% of CPU. Phase I: can you migrate it to another host? (QM adds action to list of attempted actions.) RM: No. No combination of host idle times adds up to 20%. Provide QoS information, ranked application actions, and resource usage.

760

C.D. Cavanaugh et al.

QM: (Marks previous action as unsuccessful.) Phase II: can you move competing application y (also on host A), which uses 15% of CPU to another host (to free up 15% of CPU on host A)? (QM adds action to list of attempted actions.) RM: No. No combination of host idle times adds up to 15%. Phase III: I can free up resources on host A by moving a less-critical application to a host with the lowest utility. (RM carries out action.) QM: (Marks previous action as successful.)

3 QoS Negotiation Algorithm and Protocol The QoS manager and resource manager maintain high QoS and manage resources, respectively. Whenever there is a conflict between obtaining enough resources to ensure high QoS and providing enough resources to the rest of the software, the QoS manager and resource manager negotiate a solution. To do this, both need algorithms to work toward their goals as well as a protocol for them to communicate with each other. The flowcharts of the algorithms illustrate the algorithms, and the communication steps show the protocol. The following are the steps that the QoS manager takes once it detects a QoS violation. A flowchart of the process is shown in Fig. 3. First, the QoS manager identifies unhealthy computation and communication subpaths. Depending on the phase of negotiation (path-local, resource-local, or global), and the constraints on allowable actions, the QoS manager then selects actions to remedy the unhealthy subpaths. Each subpath has a resource requirement that is proportional to the slowdown that the subpath is experiencing. The slowdown is the ratio of the current subpath latency to the subpath’s minimum latency for the same data stream size while on the same resource. For example, if the current subpath latency is 0.4 seconds at a data stream size of 1,000 on a particular resource, and the lowest latency that it has experienced in that same situation is 0.3 seconds, then the subpath’s slowdown is 0.4 / 0.3, or approximately 1.333. This implies that it requires (133% - 100%) or 33% more resources to run at its best. The slowdown is due to contention, so moving it to another resource is a likely solution. The QoS manager ranks the actions based on their resource requirements in descending order and groups actions that involve moving subpaths off a particular host and actions that involve replicating a particular subpath (if it is replicable). The groups are automatically ranked, since the groups are made from the sorted list. Once action selection is complete, the QoS manager requests resources by sending to the resource manager the ranked action requests (one from each group) along with the criticalities, current latencies, and resource usage information. If the resource manager responds that it can carry out the action, then the QoS manager monitors the stability of the system once the actions are carried out to ensure that QoS is indeed improved. However, if the RM cannot do the action, then the RM sends a negotiation request to the QM. The QoS manager responds by sending out the next ranked action in each group, or it goes to the next phase of negotiation. The RM responds with a level of degradation in the QoS that is to be expected by the QM. The QoS manager calculates the slowdown that would be associated with the degradation and derives a


761

benefit value for the path from it. If the benefit is at least as favorable as the QM requires, then the QoS manager responds to the RM with an acknowledgement; otherwise, the QoS manager proceeds to the next phase of negotiation.

Fig. 3. QoS manager QoS negotiation algorithm

The steps that the resource manager takes to negotiate with the QoS manager and to allocate resources are listed below. A flowchart of the process is shown in Fig. 4. 1. Find a feasible host corresponding to resource needs 2. If a host is feasible, then do step 4 3. Else do step 8 4. Predict queuing delay and execution time on feasible hosts 5. If the task with predicted response time is schedulable, then do step 7 6. Else do step 8 7. Predict QoS and allocate it best host and exit 8. Send “QoS negotiation requests” to all QoS managers 9. Receive path QoS information and ranked list of actions and applications’ resource usage from each QM 10.Calculate current utility value of each path 11.Select negotiable paths based on the minimum utility value 12.Calculate host utility values and find the host, Hj, with the minimum utility

762


13.Select application, ai, in the ranked list of recommended actions 14.Test feasibility of allocating the application, ai, on the host, Hj. 15.If not feasible, then pick next path and do step 11 16.If feasible, then recalculate utility value of path 17.If utility value of each path is less than threshold of utility value of each path then do step 4 18.Else allocate the violated application to the host that has the minimum utility value

Fig. 4. Resource Manager QoS Negotiation Algorithm

4 Experimental Results Sample experimental results were obtained by specifying two DynBench [6] periodic paths in the spec language: a higher criticality sensing path, D:H:Higher_Sensing, and a lower criticality sensing path, D:L:Lower_Sensing. These paths were started simultaneously, and the experiment generator was used to bring the data stream size (the load) to 1600 tracks for each path. Then, the filter and ED applications of each


763

path were manually replicated to simulate the QoS manager’s requesting that they be scaled up. The latency was brought down at that point. However, 500 more tracks were added to each path in order to overload the paths again. When no action was taken, the system became unstable, despite the fact that the loaded applications in both paths were already replicated. All four available hosts were in use. This instability is evident on the left-hand side of Fig. 5. Negotiation was simulated by manually moving the higher criticality path’s filter and ED replicas to a more powerful host, named texas. In addition, resources were taken away from the lower criticality path by terminating the additional replicas of the lower criticality path’s filter and ED applications, resulting in a normal QoS for the higher-criticality path (C) and a degraded QoS for the lower-criticality path (D), as shown on the right-hand side of Fig. 5. This combination of manual actions simulates the behavior of QoS negotiation and thus serves as a prototypical experiment. The scenario is a case by which an implementation of the negotiation algorithm should be tested.

Fig. 5. Instability caused by overload (2100 tracks per path), without negotiation (left). The higher (A) and lower (B) criticality paths are fluctuating. Stability restored after negotiation (right). The higher criticality path experiences normal QoS (C); the lower criticality path experiences degraded QoS (D)

5 Previous Work in QoS Negotiation To summarize the related work, the related work in negotiation is narrowly defined. The DeSiDeRaTa project promotes a broader view of negotiation: maximizing the quality of service provided to the most critical applications while minimizing the impact on other applications. The QuO project [7][8] terms the adaptation of object methods to the load as negotiation. Adaptation is only one aspect of QoS/resource

764


management in DeSiDeRaTa, with dynamic optimization of system resource utilization and application QoS being other capabilities of DeSiDeRaTa’s QoS negotiation. The University of Colorado DQM’s [9][10] negotiation concept is a means of raising and lowering the operating level (the algorithm’s complexity) based on current CPU usage conditions. This use of the term “negotiation” is similar to QuO’s use of the term. EPIQ’s [11] description of negotiation falls under this description as well, with the switching of regions of feasible quality being done in response to current conditions. The RTPOOL project [12] describes negotiation as the client’s specifying a static deadline for a task with a reward for scheduling the task. The server does a preliminary static schedulability analysis of worst-case timing characteristics, and its algorithm shuffles the tasks to maximize the reward. DeSiDeRaTa is a dynamic system that maintains the required quality of service under dynamic workloads, where worst-case execution times and time-invariant statistical timing characteristics are unknown. Furthermore, it uses the path abstraction.

6 Conclusions and Future Work Algorithms have been developed that will allow middleware to negotiate for the highest possible quality of service for distributed, dynamic real-time systems. The path abstraction allows QoS management to be decentralized and provides the basis for negotiating for resources for applications of differing criticality and purpose. The supply and demand approach to QoS management is based on the concept that resources (supply space) are limited in quantity and capacity and that the paths’ applications are the consumers (demand space) of these resources. If the applications cannot have their desired amounts of resources, then the middleware needs to distribute resources in order to deliver the best QoS possible. The major contributions of the paper are the negotiation algorithms and protocol that minimize the impact on the other paths’ QoS while maximizing the unhealthy path’s QoS. Future work includes implementation of the negotiation algorithms and integration into the current QoS and resource managers.

References 1. Tsai, J. J. P., and S. J. H. Yang. Monitoring and Debugging of Distributed Real-Time Systems. Los Alamitos, CA: IEEE Computer Society Press, 1995. 2. Cavanaugh, C. D., L. R. Welch, and C. Bruggeman. A Path-Based Design for the Air Traffic Control Problem. Arlington, TX: The University of Texas at Arlington Department of Computer Science and Engineering, 1999. Technical Report, TR-CSE-99-001. 3. Harrison, R. D. "Combat System Prerequisites on Supercomputer Performance Analysis." Proceedings of the NATO Advanced Study Institute on Real Time Computing, 1994. 4. Welch, L. R., B. Ravindran, B. Shirazi, and C. Bruggeman. "Specification and Analysis of Dynamic, Distributed Real-Time Systems." Proceedings of the 19th IEEE Real-Time Systems Symposium, Madrid, Spain, December 2-4, 1998. 5. Welch, L. R., P. V. Werme, B. Ravindran, L. A. Fontenot, M. W. Masters, D. W. Mills, and B. A. Shirazi. "Adaptive QoS and Resource Management Using A Posteriori Workload


765

Characterizations." Proceedings of the 5th IEEE Real-Time Technology and Applications Symposium (RTAS '99), May 1999. 6. Welch, L. R., and B. A. Shirazi. "A Dynamic Real-time Benchmark for Assessment of QoS and Resource Management Technology." Proceedings of the 5th IEEE Real-Time Technology and Applications Symposium (RTAS '99), May 1999. 7. Loyall, J. P., R. E. Schantz, J. A. Zinky, and D. E. Bakken. "Specifying and Measuring Quality of Service in Distributed Object Systems." Proceedings of the 1st International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98), Kyoto, Japan, April 1998. 8. Zinky, J. A., D. E. Bakken, and R. E. Schantz. “Architectural Support for Quality of Service for CORBA Objects”, Theory and Practice of Object Systems, 3(1) 1997. 9. Brandt, S., G. Nutt, T. Berk, and J. Mankovich, "A Dynamic Quality of Service Middleware Agent for Mediating Application Resource Usage", Proceedings of the 19th IEEE RealTime Systems Symposium (RTSS '98), December 1998. 10. Brandt., S., G. Nutt, T. Berk, and M. Humphrey, "Soft Real-Time Application Execution with Dynamic Quality of Service Assurance", Proceedings of the 6th IEEE/IFIP International Workshop on Quality of Service (IWQoS '98), pp. 154-163, May 1998. 11. Liu, J. W. S., K. Nahrstedt, D. Hull, S. Chen, and B. Li. “EPIQ QoS Characterization Draft Version.” http://epiq.cs.uiuc.edu/qo-970722.pdf 12. Abdelzaher, T. F., E. M. Atkins, and K. Shin, “QoS Negotiation in Real-Time Systems and its Application to Automated Flight Control” accepted to IEEE Transactions on Software Engineering, 1999. (Earlier version appeared in IEEE Real-Time Technology and Applications Symposium, Montreal, Canada, June 9-11, 1997.

An Open Framework for Real-Time Scheduling Simulation

Thorsten Kramp, Matthias Adrian, and Rainer Koster Distributed Systems Group, Dept. of Computer Science University of Kaiserslautern, P.O. Box 3049, 67653 Kaiserslautern, Germany fkramp,[email protected]

Real-time systems seek to guarantee predictable run-time beha viour to ensure that tasks will meet their deadlines. Optimal scheduling decisions, however, easily impose unacceptable run-time costs for many but the most basic scheduling problems, speci cally in the context of multiprocessors and distributed systems. Deriving suitable heuristics then usually requires extensive simulations to gain con dence in the chosen approach. In this paper we therefore present Fortissimo, an open framew ork that facilitates the development of taylor-made real-time scheduling sim ulators for multiprocessor systems. Abstract.

1

Introduction

Real-time systems are de ned as those systems in which correctness of the system depends not only on the logical result of computation, but also on the time at which the results are produced. Predictability is therefore of paramount concern with the scheduling algorithm being responsible for deciding which activit yis allo w ed to execute at some instant of time so that the maximum number of tasks meet their deadlines. Unfortunately, optimal scheduling decisions easily become prohibitively expensive at run time or even computationally intractable, specifically for multiprocessors and distributed systems [15]. In these cases, heuristics may serv e as viable alternatives, providing `good enough' behaviour at acceptable run-time overhead. While certain properties of sophisticated heuristics can be deriv ed analytically, it is often desirable to verify these results or even to nd new approaches empirically. Th us, a customisable and extensible testbed is needed for observing the behaviour of a scheduling algorithm under well-con trolled conditions. Such a scheduling simulator must provide enough infrastructure to let the real-time researcher concentrate on thedetails of the sc heduling algorithm and yet must be open to new requirements. That is, in addition to a pow erful dispatching core exible load generators and statistics gathering facilities are needed. By now, however, real-time scheduling simulators have been commonly build with a particular scheduling problem or execution environment in mind [11, 16]. In this paper we therefore present Fortissimo, an open object-oriented framew ork not exclusively aimed at simulating a particular class of scheduling algorithms but to serve as a starting point for the development of taylor-made realtime scheduling simulators for multiprocessor arc hitectures [8]. Consequently, J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 766-772, 2000.  Springer-Verlag Berlin Heidelberg 2000


767

is not a ready-to-run application, yet oers a frame of ideas to work in. Short of the concrete scheduling policy the framework consists of a number of ready-to-use components for workload creation, integration with dispatchers, and collecting run-time statistics. These components are realised as well-documented C++ classes and serve as the base from which the adaptation of Fortissimo to speci c simulation requirements evolves. Thus, Fortissimo tries to support the real-time architect by coping with various scheduling paradigms rather than forcing him or her into a single notion. Among the scheduling paradigms explicitly considered for hard real-time systems are static table-driven approaches such as cyclic executives [12], static priority-driven and dynamic best-eort policies such as rate monotonic scheduling or earliest deadline rst [9], and dynamic planning-based strategies such as the Spring scheduling-algorithm [14]. Task semantics, however, is not limited to hard real-time environments. Support for aperiodic and sporadic real-time activities [5], reasoning with value functions [6, 17] as well as requirements derived from techniques such as skip-over scheduling [7], imprecise computation [10], and task pair scheduling [4] have been included. The remainder of this paper is organised as follows. Section 2 discusses related work that partially has in uenced some of our design decisions. Then, in Section 3, the architecture of Fortissimo as well as the communication between the components are described. Section 4 nally summarises our experience and brie y outlines future work on Fortissimo. Fortissimo

2

Related Work

Naturally, concepts of other real-time scheduling simulation projects found their way into Fortissimo. Among the projects that have in uenced our design, Spring and STRESS come closest. Spring [14] is a research real-time operating system supporting multiprocessors and distributed systems. A project spin-o [3], the Spring simulation testbed, has in uenced the design of workload generation and scheduling components of Fortissimo. However, the primary focus of the Spring simulator seemingly was to evaluate the planning-based dynamic-priority assignment policy used in Spring. As a consequence, the simulator provides strong support for this kind of scheduling in a distributed environment, yet falls short when it comes to basically dierent scheduling strategies. STRESS [1], in contrast, is a simulation environment for hard real-time systems consisting of a simulation core that is supplemented by a graphical front-end for control and display. The approach chosen comprises a full-featured simulation language to specify both the system environment and task semantics. The simulation engine is quite elaborate, including some feasibility tests and support for multiprocessing as well as networking; tasks may synchronise via critical sections or message-passing. Since STRESS is targeted at hard real-time systems there is no build-in support for soft-deadline or value-function scheduling and it is unclear whether the simulation language is rich enough to cope with imprecise

768

T. Kramp, M. Adrian, and R. Koster

periodic task gen. aperiodic task gen.

scheduler

sporadic task gen. user def. task gen.

dispatcher

CPU 1

dispatcher

CPU 2

dispatcher

CPU n

event manager log secretary

Fig. 1.

statistics secretary

METAPOST secretary

Architecture of the framework

computing or task-pair scheduling, for instance. Task creation in however, works in a similar way to STRESS.

Fortissimo,

3 Theory of Operation As mentioned before, Fortissimo not only provides the basic infrastructure to build a real-time scheduling simulator suited for particular needs, but supports a number of scheduling paradigms right out of the box. Consequently, in order to add a new scheduler or task model, in most cases the real-time architect should need to re ne or add only some speci c classes rather than to redo everything from scratch. In Fortissimo, each class logically belongs to one of three independent modules, namely workload generation, scheduling and dispatching, and gathering statistics, with tasks and events serving as glue between these modules. The overall architecture is sketched in Fig. 1. The workload component creates tasks according to user-de ned patterns. Firstly, as part of the initialisation, the scheduler is allowed to check the feasibility of the speci ed task set as it will be generated from so-called task generators. Then, during simulation, the task generators create jobs for these tasks (e. g., instances of a periodic task) and place them into a global FIFO arrival queue. After removing a job from this queue, the scheduler can reject the job based on some feasibility test, accept and integrate it into its schedule, or react in a completely dierent way implemented by the user. An example would be putting the job aside and executing it only if additional execution time becomes available due to jobs that temporarily require less execution time than planned for. As soon as a new job has been successfully scheduled, it is assigned to a system dispatcher, each dispatcher being exclusively responsible for one CPU. Again, scheduler and dispatchers communicate via queues with one ready queue per


769

dispatcher in which tasks are placed by the scheduler. The scheduler, however, retains full access to the ready queues to simply add or remove some task, or to perform complete reschedules if necessary. It is therefore the responsibility of the scheduler to sort the jobs in the ready queues to re ect its policy | the dispatchers simply execute the job that is currently at the front of their queue, automatically performing a context switch if a dierent task moves to the front at any time. Since CPUs are simply abstractions and time passes by as ticks from a logical clock, the execution of a task merely consists of decrementing an execution counter and updating internal busy/idle statistics. When some job has completed execution, it is handed to the statistics facilities. Because some information of interest is often spread out over the complete lifetime of jobs and tasks, the statistics module also processes events from other components of the framework. Based on this overview, the following sections give a closer look at each component. A longer version of this paper describes in more detail how schedulers can be implemented in Fortissimo and how the framework can be con gured [8]. 3.1

Taks Model and Workload Generation

Workload generation in Fortissimo is split among independent task generators, each one responsible for the generation of a single class of tasks. Readily available are generator classes for periodic tasks whose jobs re-arrive by some xed amount of time, sporadic tasks whose frequency is limited by some minimum inter-arrival time, and aperiodic tasks whose arrival pattern is modelled by some stochastic assumptions. In addition, a user can create completely new task generators or customise the available ones via inheritance to produce workload patterns currently not explicitly supported. Timing parameters of a task include its average-case computation time, its worst-case computation time, and its deadline; the rst invocation of a task may be delayed by some initial oset to construct arbitrary task phasings, in order to prevent or enforce critical instants, for example. Furthermore, a directed precedence graph without cycles may be used to explicitely de ne predecessor/successor relationships. The basic classes of hard, rm, and soft constraints are employed categorising a deadline miss as resulting in a catastrophy, in the computation being useless, or in a degraded quality of service, respectively. Whenever this scheme is insuÆcient, two value functions per task may be used to describe the value of nishing the task up to and after its deadline. Each task may be assigned a base priority during setup while at run time an additional temporary priority per task can be used to support dual-priority scheduling [2] and priority-inheritance protocols [13], for instance. Besides these fundamental paradigms, skip-over scheduling, the notion of imprecise computations, and task-pair scheduling are also readily supported. While in Fortissimo skip-over scheduling is limited to periodic tasks, support for imprecise computations and task-pair scheduling is available for both periodic and sporadic tasks. To assess the behaviour of scheduling algorithms many simulation runs with varying load patterns are needed. Hence, virtually all task characteristics may

770


be chosen randomly by Fortissimo according to given stochastic distributions. Parameters such as arrival patterns and actual computation time may vary for each job. Additionally, for a sequence of simulation runs changing task set characteristics may be speci ed. 3.2

Scheduling and Dispatching

Scheduling Algorithms are not built into Fortissimo, but have to be implemented and linked by the user. Schedulers, however, can be derived from a base class Schedule providing some default behaviour that can be customised selectively. We believe that this approach, besides promising some additional

exibility, allows analysing the computation time of the scheduler itself, already at the simulation stage. A typical scheduler might work as follows within Fortissimo. The scheduler is invoked every tick of the logical clock and, provided it implements a preemptive algorithm, may perform a reschedule in response. If no new jobs have become ready since the last tick, the scheduler then falls asleep again until its next invocation. If new jobs have arrived, it removes these jobs one by one from the global arrival queue. For some algorithms providing guarantees, then a runtime admission test is performed. If the new job cannot be executed without jeopardizing the deadlines of either the new job itself or already guaranteed tasks, it is rejected and usually removed from the system. Otherwise, a new schedule must be constructed comprising the jobs already scheduled as well as the new job. For this, the scheduler typically has to retrieve the jobs already accepted and scheduled from the dispatchers' ready queues. Then, the jobs are sorted and re-inserted into the individual ready queues, possibly causing context switches. Like the scheduler, dispatchers are invoked every tick of the logical clock. At any time, the dispatcher will run the job that is currently at the front of its ready queue, which subsequently becomes the active job until it terminates normally, the scheduler aborts the job for some reason, or the dispatcher's ready queue has changed. Finally, jobs are run on virtual processors. Execution is simulated simply by decrementing the remaining execution time of the running job. In future versions, a more powerful processing model may, for instance, take interrupts and context-switching overhead into account. 3.3

Logging and Statistics

Whenever an important action is executed within the framework, this is signalled by an event. Each event carries the relevant information about the time and cause that lead to its creation, supplemented by additional data as needed. An event manager uniformly collects and distributes these events to so-called secretaries, which are registered with the event manager for certain types of events. Various types of action can be taken by a secretary upon arrival of a new event. Simple log secretaries just write a formatted line onto some output device, other secretaries


771

t2

t1 0

20 arrival Fig. 2.

40 ready

60 running

deadline

80

100

abort

Skip-over scheduling simulation run

may update some kind of statistical analysis data, and even more sophisticated ones may act as a gateway transforming the event into messages for a graphical user display. At the time of writing, secretaries for logging events, for collecting statistical data, and for visualizing a simulation run as a MetaPost gure are implemented. Fig. 2 shows an example run of a skip-over scheduler that tolerates missed deadlines to a certain degree provided `most' of a tasks deadlines are met [7]; a skip parameter s per task denotes the tolerance of that task to missing deadlines such that at least s 1 task instances must meet their deadlines after missing a deadline. The skip parameter of tasks t1 and t2 is set to 3 and 2, respectively; that is, after one aborted job of t1 , two jobs of t1 must be executed in time, and no two successive jobs of t2 may be aborted.

4

Conclusions

In this paper we have presented Fortissimo, an open object-oriented framework to simulate the scheduling of real-time tasks. The versatility of Fortissimo has been veri ed by implementing a wide range of fundamentally dierent scheduling policies such as rate-monotonic scheduling, earliest deadline rst, the sporadic server algorithm, an imprecise computation policy, skip-over scheduling, and task-pair scheduling. Although the task model already provides a sound basis, we intend to add support for critical sections, resource reservation, task semantics including inter-task communication, and more elaborate precedence relations to the scheduling core. Furthermore, in addition to the multiprocessor support already implemented, an infrastructure to simulate real-time scheduling in distributed systems is under development. A graphical user interface, nally, will increase the ease the use of Fortissimo and illustrate behaviour of scheduling policies at run time; for the latter, the event mechanism already provides the necessary internal hooks. Despite these loose ends, however, we believe that even the scheduling core as described in this paper might already serve real-time architects to develop taylor-made simulators based on Fortissimo to evaluate their algorithms and heuristics.

772


References [1] N. C. Audsley, A. Burns, M. F. Richardson, and A. J. Wellings. STRESS: A simulator for hard real-time systems. Software Practice and Experience, July 1994. [2] R. Davis and A. Wellings. Dual-priority scheduling. In Proceedings of the Sixteenth Real-Time Systems Symposium, pages 100{109, 1995. [3] E. Gene. Real-time systems: Spring simulators documentation, 1990. http://wwwccs.cs.umass.edu/spring/internal/spring sim docs.html. [4] M. Gergeleit and H. Streich. Task-pair scheduling with optimistic case execution times|An example for an adaptive real-time system. In Proceedings of the Second Workshop on Object-Oriented Real-Time Dependable Systems (WORDS), February 1996. [5] T. M. Ghazalie and T. P. Baker. Aperiodic servers in a deadline scheduling environment. Journal of Real-Time Systems, 7(9):31{67, 1995. [6] E. D. Jensen, C. D. Locke, and H. Tokuda. A time-driven scheduling model for real-time operating systems. In Proceedings of the Sixth IEEE Real-Time Systems Symposium, December 1985. [7] G. Koren and D. Shasha. Skip-over: Algorithms and complexity for overloaded systems that allow skips. In Proceedings of Sixteenth IEEE Real-Time Systems Symposium. IEEE, 1995. [8] T. Kramp, M. Adrian, and R. Koster. An open framework for real-time scheduling simulation. SFB 501 Report 01/00, Department of Computer Science, University of Kaiserslautern, Germany, January 2000. [9] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of the ACM, 20(1):46{61, 1973. [10] J. W. S. Liu, K.-J. Lin, W.-K. Shih, A. C. Yu, J.-Y. Chung, and W. Zhao. Algorithms for scheduling imprecise computations. IEEE Computer, 24(5):58{68, May 1991. [11] J. W. S. Liu, J. L. Redondo, Z. Deng, T. S. Tia, R. Bettati, A. Silberman, M. Storch, R. Ha, and W. K. Shih. PERTS: A prototyping environment for realtime systems. In Proceedings of the Fourteenth Real-Time Systems Symposium, pages 184{188. IEEE, December 1993. [12] C. D. Locke. Software architectures for hard real-time applications: Cyclic executives vs. xed-priority executives. Journal of Real-Time Systems, 4(1):37{53, 1992. [13] L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority inheritance protocols: An approach to real-time synchronisation. Technical Report CMU-CS-87-181, Computer Science Department, Carnegie Mellon University, 1987. [14] J. A. Stankovic and K. Ramamritham. The Spring kernel: A new paradigm for hard real-time operating systems. IEEE Software, 8(3):62{72, May 1991. [15] J. A. Stankovic, M. Spuri, M. Di Natale, and G. Buttazzo. Implications of classical scheduling results for real-time systems. IEEE Computer, 28(6):16{25, June 1995. [16] A. D. Stoyenko. A schedulability analyzer for Real-time Euclid. In Proceedings of the Eighth Real-Time Systems Symposium, pages 218{227. IEEE, December 1987. [17] H. Tokuda, J. W. Wendorf, and H.-Y. Wang. Implementation of a time-driven scheduler for real-time operating systems. In Proceedings of the Eighth IEEE Real-Time Systems Symposium, December 1987.

5th International Workshop on Embedded/Distributed HPC Systems and Applications (EHPC 2000) Workshop Co-Chairs Devesh Bhatt Honeywell Technology Center 3660 Technology Drive Minneapolis, MN 55418, USA [email protected]

Lonnie R. Welch Ohio University School of Engineering and Computer Science Athens, OH 45701-2979, USA [email protected]

Preface The International Workshop on Embedded/Distributed HPC Systems and Applications (EHPC) is a forum for the presentation and discussion of approaches, research findings, and experiences in the applications of High Performance Computing (HPC) technology for embedded/distributed systems. Of interest are both the development of relevant technology (e.g.: hardware, middleware, tools) as well as the embedded HPC applications built using such technology. We hope to bring together industry, academia, and government researchers/users to explore the special needs and issues in applying HPC technologies to defense and commercial applications. Topics of Interest • Algorithms and Applications: addressing parallel computing needs of embedded military and commercial applications areas such as signal/image processing, advanced vision/robotic systems, smart-sensor based systems, industrial automation/optimization, vehicle guidance. • Networking Multiple HPC Systems: in-the-large application programming models/API’s, partitioning/mapping, system integration, debugging and testing tools. • Programming Environments: software design, programming, and parallelization methods/tools for DSP-based, reconfigurable, and mixedcomputation-paradigm architectures. • Operating Systems and Middleware: distributed middleware service needs (e.g. QoS, object distribution) of high-performance embedded applications, configurable/optimal OS features needs, static/dynamic resource management needs. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 773-775, 2000.  Springer-Verlag Berlin Heidelberg 2000

774

D. Bhatt and L.R. Welch

• Architectures: special-purpose processors, packaging, mixed-computationparadigm architectures, size/weight/power modeling and management using hardware and software techniques. EHPC 2000 Contents The EHPC 2000 workshop will feature technical paper presentations, and an open discussion session. This year, we have papers covering several topic areas of interest. The following is a highlight of the papers. In the algorithm and applications area, Yang et al. present a reconfigurable, dynamic load balancing parallel sorting algorithm applicable to information fusion. Hadden et al. present system health management application domain which would benefit from embedded HPC architectures. In the programming environments area, Janka and Wills present a specification and design methodology for signal-processing systems using high-performance middleware and front-end tools. Patel et al. present performance comparison of highperformance real-time benchmarks using hand-crafted design versus automated gluecode generation from data-flow specification using their design tool. In the operating systems and middleware area, we have several papers ranging from network load monitoring to communication scheduling for high-performance applications. Islam et al. present a technique for evaluating network load based upon dynamic paths using embedded application benchmarks. Pierce et al. present an architecture for mining of performance data for HPC systems, extending the capabilities of current instrumentation tools. Huh et al. present an approach for predicting the real-time QoS in dynamic heterogeneous resource management systems. VanVoorst and Seidel present the use of a real-time parallel communication benchmark to compare several MPI implementations. West and Antonio present an approach for optimizing the communication scheduling in parallel Space-Time Adaptive Processing (STAP) applications. In the architecture area, we have papers on software and hardware perspectives on power management, as well as a new architecture for embedded applications. Osmulski et al. present a probabilistic power-prediction tool for Xilinx 4000-series reconfigurable computing devices. Unsal et al. present an energy consumption model addressing task assignment and network toplogy/routing, using replication of shared data structures. Schulman et al. present a system-on-chip architecture containing an array of VLIW processing elements, with reconfiguration times much smaller than FPGA-based architectures.

Program Committee Ashok Agrawala, Univ. of Maryland, USA Bonnie Bennett, Univ. of St. Thomas, USA Bob Bernecky, NUWC, USA Alberto Broggi, Universita‘ di Pavia, Italy Hakon O. Bugge, Scali Computer, Norway Richard Games, MITRE, USA

5th International Workshop on Embedded/Distributed HPC Systems and Applications

Farnam Jahanian, Univ. of Michigan, USA Magnus Jonsson, Halmstad University, Sweden Jeff Koller, USC/Information Sciences Institute, USA Bruce Lewis, US Army AmCom, USA Mark Linderman, USAF Rome Laboratory, USA Craig Lund, Mercury Computer Systems, Inc., USA Stephen Rhodes, Adavanced Systems Architectures Ltd., UK Samuel H. Russ, Mississipi State Univ., USA Behrooz Shirazi, University of Texas at Arlington, USA Anthony Skjellum, Mississipi State Univ., USA Brian VanVoorst, Michigan Technological Univ., USA Sudhakar Yalamanchili, Georgia Tech., USA

Advisory Committee Keith Bromley, NRaD, USA Dieter Hammer, Eindhoven Univ. of Technology, The Netherlands David Martinez, MIT Lincoln Laboratory, USA Viktor Prasanna, Univ. of Southern California, USA

775

A Probabilistic Power Prediction Tool for the Xilinx 4000-Series FPGA Timothy Osmulski, Jeffrey T. Muehring, Brian Veale, Jack M. West, Hongping Li, Sirirut Vanichayobon, Seok-Hyun Ko, John K. Antonio, and Sudarshan K. Dhall School of Computer Science University of Oklahoma 200 Felgar Street Norman, OK 73019 Phone: (405) 325-7859 [email protected]

Abstract. The work described here introduces a practical and accurate tool for predicting power consumption for FPGA circuits. The utility of the tool is that it enables FPGA circuit designers to evaluate the power consumption of their designs without resorting to the laborious and expensive empirical approach of instrumenting an FPGA board/chip and taking actual power consumption measurements. Preliminary results of the tool presented here indicate that an error of less than 5% is usually achieved when compared with actual physical measurements of power consumption.

1

Introduction and Background

Reconfigurable computing devices, such as field programmable gate arrays (FPGAs), have become a popular choice for the implementation of custom computing systems. For special purpose computing environments, reconfigurable devices can offer a costeffective and more flexible alternative than the use of application specific integrated circuits (ASICs). They are especially cost-effective compared to ASICs when only a few copies of the chip(s) are needed [1]. Another major advantage of FPGAs over ASICs is that they can be reconfigured to change their functionality while still resident in the system, which allows hardware designs to be changed as easily as software and dynamically reconfigured to perform different functions at different times [6]. Often a device’s performance (i.e., speed) is a main design consideration; however, power consumption is of growing concern as the logic density and speed of ICs increase. Some research has been undertaken in the area of power consumption in CMOS (complimentary metal-oxide semiconductor) devices, e.g., see [4, 5]. However, most of this past work assumes design and implementation based on the use of standard (basic cell) VLSI techniques, which is typically not a valid assumption for application circuits designed for implementation on an FPGA.


A Probabilistic Power Prediction Tool for the Xilinx 4000-Series FPGA

2

777

Overview of the Tool

A probabilistic power prediction tool for the Xilinx 4000-series FPGA is overviewed in this section. The tool, which is implemented in Java, takes as input two files: (1) a configuration file associated with an FPGA design and (2) a pin file that characterizes the signal activities of the input data pins to the FPGA. The configuration file defines how each CLB (configurable logic block) is programmed and defines signal connections among the programmed CLBs. The configuration file is an ASCII file that is generated using a Xilinx M1 Foundation Series utility called ncdread. The pin file is also an ASCII file, but is generated by the user. It contains a listing of pins that are associated with the input data for the configured FPGA circuit. For each pin number listed, probabilistic parameters are provided which characterize the signal activity for that pin. Based on the two input files, the tool propagates the probabilistic information associated with the pins through a model of the FPGA configuration and calculates the activity of every internal signal associated with the configuration [1]. The activity of an internal signal s, denoted as, is a value between zero and one and represents the signal’s relative frequency with respect to the frequency of the system clock, f. Thus, the average frequency of signal s is given by as f. Computing the activities of the internal signals represents the bulk of computations performed by the tool [1]. Given the probabilistic parameters for all input signals of a configured CLB, the probabilistic parameters of that CLB’s output signals are determined using a well-defined mathematical transformation [2]. Thus, the probabilistic information for the pin signals is transformed as it passes through the configured logic defined by the configuration file. However, the probabilistic parameters of some CLB inputs may not be initially known because they are not directly connected to pin signals, but instead are connected to the output of another CLB for which the output probabilistic parameters have not yet been computed (i.e., there is a feedback loop). For this reason, the tool applies an iterative approach to update the values for unknown signal parameters. The iteration process continues until convergence is reached, which means that the determined signal parameters are consistent based on the mathematical transformation that relates input and output signal parameter values, for every CLB. The average power dissipation due to a signal s is modeled by ½ Cd(s)V 2as f, where d(s) is the Manhattan distance the signal s spans across the array of CLBs, Cd(s) is the equivalent capacitance seen by the signal s, and V is the voltage level of the FPGA device. The overall power consumption of the configured device is the sum of the power dissipated by all signals. For an N x N array of CLBs, Manhattan signal distances can range from 0 to 2N. Therefore, the values of 2N + 1 equivalent capacitance values must be known, in general, to calculate the overall power consumption. Letting S denote the set of all internal signals for a given configuration, the overall power consumption of the FPGA is given by: 1 Pavg = C d ( s )V 2 a s f 2 s∈S

ÿ

=

1 2 V f 2

ÿC s∈S

d (s) a s .

(1)

778

T. Osmulski et al.

The values of the activities (i.e., the as’s) are dependent upon the parameter values of the pin signals defined in the pin file. Thus, although a given configuration file defines the set S of internal signals present, the parameter values in the pin file impact the activity values of these internal signals.

3

Calibration of the Tool

Let Si denote the set of signals of length i, i.e., Si = {s ∈ S | d ( s) = i} . So, the set of signals S can be partitioned into 2N + 1 subsets based on the length associated with each signal. Using this partitioning, Eq. 1 can be expressed as follows:

ÿ

ÿ

ÿ

0

1

2N

1 Pavg = V 2 f C0 a s + C1 a s + ÿ + C2 N as . 2 s∈S s∈S s∈S

(2)

To determine the values of the tool’s capacitance parameters, actual power consumption measurements are taken from an instrumented FPGA using different configuration files and pin input parameters. Specifically, 2N + 1 distinct measurements are made and equated to the above equation using the activity values (i.e., the as’s) computed by the tool. For the j-th design/data set combination, let Pj denote the measured power and let Aj,k denote the aggregate activity of all signals of length k. The resulting set of equations is then solved to determine the 2N + 1 unknown capacitance parameter values:

1 2 V f 2

A0,0

A0 ,1

ÿ

A0, 2 N

C0

A1,0

A1,1

ÿ

A1,2 N

C1

A2 N ,0

A2 N ,1

A2 N , 2 N

C2 N

P0 =

P1

.

(3)

P2 N

Solving the above equation for the vector of unknown capacitance values is how the tool is calibrated.

4

Power Measurements

For this study, a total of 70 power measurements were made using 5 different configuration files and 14 different data sets. Descriptions of these configuration files and data sets are given in Tables 1 and 2, respectively. All of the configuration files listed in Table 1 each take a total of 32-bits of data as input. The first three configurations (fp_mult, fp_add, int_mult) each take two 16-bit operands on each clock cycle, and the last two (serial_fir and parallel_fir) each take one 32-bit complex operand on each clock cycle. The 32 bits of input data are numbered as 0 through 31 in Table 2, and two key parameters are used to characterize these bits: an activity factor, a and a probability factor, p. The activity factor of an input bit is a value


779

between zero and one and represents the signal’s relative frequency with respect to the frequency of the system clock, f. The probability factor of a bit represents the fraction of time that the bit has a value of one. Fig. 1 shows plots of the measured power for all combinations of the configuration files and data sets described in Tables 1 and 2. For all cases, the clock was run at f = 30 MHz. With the exception of the fp_mult configuration file, the most active data set file (number 6) is associated with the highest power consumption. Also, the least active data set file (number 5) is associated with the lowest power consumption across all configuration files. There is somewhat of a correlation between the number of components utilized by each configuration and the power consumption; however, note that even though the serial_fir implementation is slightly larger than parallel_fir, it consumes less power. This is likely due to the fact that the parallel_fir design requires a high fan-out (and thus high routing capacitance) to drive the parallel multipliers.

Table 1. Characteristics of the configuration files. Configuration File Name

Description

Component Utilization of Xilinx 4036xla

fp_mult

Custom 16-bit floating point multiplier with 11bit mantissa, 4-bit exponent, and a sign bit [3].

368

fp_add

Custom 16-bit floating point adder with 11-bit mantissa, 4-bit exponent, and a sign bit [3].

339

int_mult

16-bit integer array multiplier; produces 32-bit product [3].

509

serial_fir

parallel_fir

FIR filter implementation using a serialmultiply with a parallel reduction add tree. Input data is 32-bit integer complex. Constant coefficient multipliers and adders from core generator. FIR filter implementation using a parallelmultiply with a series of delayed adders. Input data is 32-bit integer complex. Constant coefficient multipliers and adders from core generator.

1060

1055

780

T. Osmulski et al.

Table 2. Characteristics of the data sets. Data Set Number

Description

1

Pins 0 through 15 Pins 16 through 31

p = 0.0 and a = 0.0. p = 0.5 and a = 1.0

2


p = 0.0 and a = 0.0 p = 0.75 and a = 0.4

3


p = 0.25 and a = 0.45 p = 0.0 and a = 0.0

4


p = 0. 5 and a = 1.0 p = 0.0 and a = 0.0

5

Pins 0 through 31

p = 0.0 and a = 0.0

6

Pins 0 through 31

p = 0.5 and a = 1.0

7

Even numbered pins Odd numbered pins

p = 0.0 and a = 0.0 p = 0.5 and a = 1.0

8


p = 0.3 and a = 0.5 p = 0.7 and a = 0.5

9


p = 0.5 and a = 1.0 p = 0.0 and a = 0.0

10


p = 0.8 and a = 0.1 p = 0.2 and a = 0.15

11

For all pins, p and a selected at random (different from data set 12).

12

For all pins, p and a selected at random (different from data set 11).

13

Pins 0 through 2, p = 0.1 and a = 0.1 Pins 3 through 5, p = 0.2 and a = 0.2, etc., p’s continue to increase in steps of 0.1 and a’s increase to 0.5 in steps of 0.1 and then decrease back down to 0.0.

14

Pin 0, p = 0.1 and a = 0.2 Pin 1, p = 0.2 and a = 0.4 Pin 2, p = 0.3 and a = 0.6, etc., p’s continue to increase to 1.0 in steps of 0.1 (and then decrease) and a’s increase to 1.0 in steps of 0.2 (and then decrease).


Configure files: fp_mult fp_add int_mult serial_fir parallel_fir

5.0

4.5

Power Consumption (w)

781

4.0

3.5

3.0

2.5

2.0 0

2

4

6

8

10

12

14

data sets Fig. 1. Measured power consumption for the configuration files and data sets described in Tables 1 and 2.

5

Experimental Evaluation of the Tool

Because 73 values are used to model all of the internal capacitances of the device used in this study, at least three more measurement scenarios are required to calibrate all capacitance values (by solving the complete set of linear equations defined by Eq. 3). Fortunately, however, we were able to calibrate a subset of capacitance values by considering the power consumption of the two FIR filters (serial_fir and parallel_fir). This was because there turned out to be a total of only 28 non-zero entries for the rows of the matrix of Eq. 3, corresponding to aggregate activities for the two FIR filter designs. Fig. 2 shows the measured power consumption curve along with 29 different prediction curves generated by the tool for the serial FIR filter design. One of the prediction curves corresponds to predicted values based on using all 28 measured values to calibrate the tool’s capacitance values (this curve is labeled “all” in the legend of the figure). This curve naturally has excellent accuracy; predicted power consumption values match measured values nearly perfectly.1 The remaining 28 prediction curves are associated with capacitance values determined by using all but one of the measured data values to calibrate the tool (the data set not used is indicated in the legend of the figure). For each of these curves, the data set not used in the 1

The reason the predicted values do not match measured values exactly is because the equations used to determine capacitance values did not have full rank, and thus a least-squares solution was determined.

782

T. Osmulski et al.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 all measured

4.2 4.0 3.8

Power (W)

3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0

0

2

4

6

8

10

12

14

16

Data Sets Fig. 2. Measured and predicted power consumption curves using various calibration scenarios for the serial FIR filter implementation.

5.5

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 all measured

5.0

Power (W)

4.5 4.0 3.5 3.0

2.5 2.0

0

2

4

6

8

10

12

14

16

Data Sets Fig. 3. Measured and predicted power consumption curves using various calibration scenarios for the parallel FIR filter implementation.


783

calibration of the tool’s capacitance values generally associates with the highest error in the predicted value for that data point. For example, note that when data set number six for the serial FIR (labeled S6 in the figure’s legend) was not used in the calibration process, the resulting prediction for that value was highest (around 10% error). When data sets associated with the parallel FIR design were not included, the prediction curves did not change, thus those curves are all drawn as solid lines with no symbols. Fig. 3 shows the same type of results as Fig. 2, except for the parallel FIR instead of the serial FIR.

6

Summary

To summarize the results for both filter designs, when all 28 sets of measurements are used to calibrate the tool, the maximum error in predicted versus measured power is typically less than about 5%. With one data set removed, the maximum error increases to about 10%, and the predicted value with this highest error is typically associated with the data set not used in calibrating the tool. This level of error is acceptable for most design environments, and represents a considerable accomplishment in the area of power prediction for FPGA circuits. Thus, these preliminary results indicate that the tool is able to adequately predict power consumption (i.e., for data sets not used in calibrating the tool). By using more data sets to calibrate the tool in the future, it is expected that even greater prediction accuracy and robustness will be achieved.

Acknowledgements This work was supported by DARPA under contract no. F30602-97-2-0297. Special thanks go to Annapolis Micro Systems, Inc. for their support and for providing the instrumented FPGA board that was used to take power measurements.

References 1. T. Osmulski, Implementation and Evaluation of a Power Prediction Model for Field Programmable Gate Array, Master’s Thesis, Computer Science, Texas Tech University, 1998. 2. K. P. Parker and E. J. McClusky, “Probabilistic Treatment of General Combinatorial Networks,” IEEE Trans. Computers, vol. C-24, pp. 668-670, June 1975. 3. B. Veale, Study of Power Consumption for High-Performance Reconfigurable Computing Architectures, Master’s Thesis, Computer Science, Texas Tech University, 1999. 4. T. L. Chou, K. Roy, and S. Prasad, “Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching,” Proc. IEEE Int’l Conf. Comput. Aided Design, pp. 300-303, Nov. 1994. 5. A. Nannarelli and T. Yang, “Low-Power Divider,” IEEE Trans. Computers, Vol. 48, No. 1, Jan. 1999, pp. 2-14. 6. Xilinx XC4000E and XC4000X Series Field Programmable Gate arrays, Product Specification, Xilinx Inc., v1.5, http://www.xilinx.com/partinfo/databook.htm#xc4000, 1999.

Application Challenges: System Health Management for Complex Systems 1

1

1

2

George D. Hadden , Peter Bergstrom , Tariq Samad ,Bonnie Holte Bennett , 3 4 George J. Vachtsevanos , and Joe Van Dyke 1

Honeywell Technology Center, 3660 Technology Drive, Minneapolis, MN 55418 [email protected] 2 Knowledge Partners of Minnesota, Inc., 9 Salem Lane, Suite 100, St. Paul, MN 55118-4700 [email protected] 3 The Georgia Institute of Technology, School of Electrical and Computer Engineering, Atlanta, Georgia 30332-0250 [email protected] 4 Systems Analysis and Software Engineering, 253 Winslow Way West, Bainbridge Island, Washington, 98110 [email protected]

Abstract. System Health Management (SHM) is an example of the types of challenging applications facing embedding high-performance computing environments. SHM systems monitor real-time sensors to determine system health and performance. Performance, economics, and safety are all at stake in SHM, and the emphasis on health management technology is motivated by all these considerations. This paper describes a project focusing on condition-based maintenance (CBM) for naval ships. Condition-based maintenance refers to the identification of maintenance needs based on current operational conditions. In this project, system architectures and diagnostic and prognostic algorithms are being developed that can efficiently undertake real-time data analysis from appropriately instrumented machinery aboard naval ships and, based on the analysis, provide feedback to human users regarding the state of the machinery – such as its expected time to failure, the criticality of the equipment for current operation.

1

Introduction

Although some aspects of system operation, such as feedback control, are by now widely automated, others such as the broad area of system health management (SHM) still rely heavily on human operators, engineers, and supervisors. In many industries, SHM is viewed as the next frontier in automation. System health management has always been a topic of significant interest to industry. Only relatively recently, however, have the numerous aspects of health J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 784-791, 2000.  Springer-Verlag Berlin Heidelberg 2000

Application Challenges: System Health Management for Complex Systems

785

management begun to be viewed as facets of one overall problem. The term itself has gained currency only recently. We now understand SHM as encompassing all issues related to off-nominal operations of systems – including equipment, process/plant, and enterprise. As for the capabilities that fall under the SHM label, the following are particularly notable: • • • • • •

Fault detection: identifying that some element or component of a system has failed. Fault identification: identifying which element has failed. Failure prediction: identifying elements for which failure may be imminent and estimating their time to failure. Modelling and tracking degradation: quantifying gradual degradation in a component or the system. Maintenance scheduling: determining appropriate times for preventive or corrective operations on components. Error correction: estimating ‘correct’ values for parameters, the measurements of which have been corrupted.

Technologists are seeking to exploit advances in diverse fields for developing SHM solutions. As might be expected, the variety and complexity of problems that SHM encompasses preclude any single-technology answers. Hardware, software, and algorithmic technologies are all required and are being explored. An SHM solution can require a hardware architecture design, integrating sensors, actuators, computational processors, and communication networks. Different algorithmic techniques may be needed for signal processing, including Fourier and wavelet transforms and time series models. Artificial intelligence methods such as expert systems and fuzzy logic can be helpful in allowing human expertise and intuition to be captured. There is also increasing interest in fundamental modelling, especially in failure mode effects analysis (FMEA), a systematic approach for identifying what problems can potentially occur with products and processes. Finally, software architectures are required to manage the multiple devices, data streams, and algorithms. With Internet-enabled architectures, an SHM system can be physically distributed across large distances. 1.1

Challenges in system health management

Our successes in capturing common failure mechanisms has resulted in safer, more reliable, and more available systems. An interesting corollary is that we are now seeing failure modes that were rarely seen before. The lack of empirical data or experiential knowledge in such cases renders many methods unusable. Other types of knowledge must be relied upon in such cases, generally based on a human expert’s understanding of system operation. Another failing with many conventional methods for fault identification is that they assume that faults occur singly. Surprising relationships can occur among various

786

G.D. Hadden et al.

failure modes. A fault in one device may cause problems in otherwise unrelated machines that depend on it for their input (perhaps separated by several intervening devices). Compound faults often do not have independent symptoms, and predicting or diagnosing multiple faults is not simply a matter of dealing with each separately. Even when there is a single fault, its symptoms will be masked by any number of additional symptoms generated by logically upstream and downstream subsystems. Also, SHM must deal with the large differences in the time scales. Vibration data from a motor may need to be collected at nearly a megahertz for shaft balance problems to be detectable, whereas flooding in a distillation column is a phenomenon that occurs on a time scale of many minutes. System architectures and algorithms, that can deal with these extremes of sampling rates, are needed and not readily available.

1.2

Condition-Based Maintenance for Naval Ships

This project, supported by the Office of Naval Research of the U. S. Department of Defence, is focusing on condition-based maintenance (CBM) for naval ships. Condition-based maintenance refers to the identification of maintenance needs based on current operational conditions. In this project, system architectures and diagnostic and prognostic algorithms are being developed that can efficiently undertake real-time data analysis from appropriately instrumented machinery aboard naval ships and, based on the analysis, provide feedback to human users regarding the state of the machinery – such as its expected time to failure. Using these analyses, ship maintenance officers can determine which equipment is critical to repair before embarking on their next mission – a mission that could take the better part of a year.

1.2.1 MPROS Architecture The development of the CBM system, called MPROS (for Machinery Prognostic and Diagnostic System), had two phases. The first phase had MPROS installed and running in the lab. During the second phase, we extended MPROS’s capability somewhat and installed it on the Navy hospital ship Mercy in San Diego. MPROS is a distributed, open, extensible architecture for hosting multiple on-line diagnostic and prognostic algorithms. Additionally, our prototype contains four sets of algorithms aimed specifically at centrifugal chilled water plants. These are: 1. PredictDLI’s (a company in Bainbridge Island, Washington, that has a Navy contract to do CBM on shipboard machinery) vibration-based expert system adapted to run in a continuous mode. 2. State-based feature recognition (SBFR), an Honeywell Technology Center (HTC)developed embeddable technique that facilitates recognition of time-correlated events in multiple data streams. 3. Wavelet Neural Network (WNN) diagnostics and prognostics developed by Professor George Vachtsevanos and his colleagues at Georgia Tech. This technique


787

is aimed at vibration data; however, unlike PredictDLI’s, their algorithm excels at drawing conclusions from transitory phenomena. 4. Fuzzy logic diagnostics and prognostics also developed by Georgia Tech that draws diagnostic and prognostic conclusions from nonvibrational data. Since these algorithms (and others we may add later) have overlapping areas of expertise, they may sometimes disagree about what is ailing the machine. They may also reinforce each other by reaching the same conclusions from similar data. In these cases, another subsystem, called Knowledge Fusion (KF), is invoked to make some sense of these conclusions. We use a technique called Dempster-Shafer Rules of Evidence to combine conclusions reached by the various algorithms. It can be extended to handle any number of inputs. MPROS is distributed in the following sense: Devices called Data Concentrators (DCs) are placed near the ship’s machinery. Each of these is a computer in its own right and has the major responsibility for diagnostics and prognostics. Except for Knowledge Fusion, the algorithms described above run on the DC. Conclusions reached by these algorithms are then sent over the ship’s network to a centrally located machine containing the other part of our system – the Prognostic/Diagnostic/Monitoring Engine (PDME). KF is located in the PDME. Also in the PDME is the Object-Oriented Ship Model (OOSM). The OOSM represents parts of the ship (e.g., compressor, chiller, pump, deck, machinery space) and a number of relationships among them (e.g., part-of, proximity, kind-of). It also serves as a repository of diagnostic conclusions – both those of the individual algorithms and those reached by KF. Communication among the DCs and the PDME is done using Distributed Common Object Module (DCOM), a standard developed by Microsoft.

1.2.2 Data Concentrator hardware The DC hardware (Figure 1 shows the HTC-installed DC) consists of a PC104 singleboard Pentium PC (about 6 in. x 6 in.) with a flat-screen LCD display monitor, a PCMCIA host board, a four-channel PCMCIA DSP card, two multiplexer (MUX) cards, and a terminal bus for sensor cable connections. The operating system is Windows 95™, and there are connections for keyboard and mouse. Data is stored via DRAM. The DC is housed in a NEMA enclosure with a transparent front door and fans for cooling. Overall dimensions are 10 in. x 12 in. x 4 in. The system was built entirely with commercial off-the-shelf components with the exception of the MUX cards, which are a PredictDLI hardware subcomponent, and the PCMCIA card, which was modified from a commercial two-channel unit to meet the needs of the project.

788

G.D. Hadden et al.

Figure 1 Data concentrator installed at HTC

2

MPROS Software

Figure 2 shows a diagram of the MPROS system. The PDME consists entirely of software and runs on any sufficiently powerful Windows NT machine. A potentially large number (on the order of a thousand) DCs are installed on the ship and report diagnostic and prognostic conclusions to the PDME over the ship’s network. In the following, we describe the various software parts of the system. 2.1

PDME

The PDME is the logical center of the MPROS system. Diagnostic and prognostic conclusions are collected from DC-resident as well as PDME-resident algorithms. Fusion of conflicting and reinforcing source conclusions is performed to form a prioritized list for use by maintenance personnel. The PDME is implemented on a Windows NT platform as a set of communicating servers built using Microsoft’s Component Object Model (COM) libraries and services. Choosing COM as the interface design technique has allowed us to build some components in C++ and others in Visual Basic, with an expected improvement in development productivity as the outcome. Some components were prototyped using Microsoft Excel, and we continue to use Excel worksheets and macros to drive some


789

testing of the system. Communications between DC and PDME components depend on Distributed COM (DCOM) services built into Microsoft’s operating systems. Ship’s Network

DC DC

PC PDME

DCOM Interface

Executive

DCOM Interface DLI Expert System

SBFR

Wavelet Neural Network

Fuzzy Logic OO Ship Model

KF

Database Control signals

Data Acquisition Card

MUX

PDME Resident Algorithms

User Interface

MUX

Machinery Sensors Figure 2 The MPROS system

2.2

Knowledge fusion

Knowledge fusion is the co-ordination of individual data reports from a variety of sensors. It is higher level than pure ‘data fusion,’ which generally seeks to correlate common-platform data. Knowledge fusion, for example, seeks to integrate reports from acoustic, vibration, oil analysis, and other sources, and eventually to incorporate trend data, histories, and other components necessary for true prognostics. Implementation To date, two levels of knowledge fusion have been implemented: one for diagnostics and one for prognostics. Our approach for implementing knowledge fusion for diagnostics uses DempsterShafer belief maintenance for correlating incoming reports. This is facilitated by use of a heuristic that groups similar failures into logical groups. Dempster-Shafer theory is a calculus for qualifying beliefs using numerical expressions. For example, given a belief of 40% that A will occur and another belief of 75% that B or C will occur, it will conclude that A is 14% likely, B or C is 64% likely, and assign 22% of belief to unknown possibilities. This maintenance of the likelihood

790

G.D. Hadden et al.

of unknown possibilities is both a differentiator and a strength of Dempster-Shafer theory. It was chosen over other approaches (e.g., Bayes nets) because the others require prior estimates of the conditional probability relating two failures – data not yet available for the shipboard domain. Diagnostic knowledge fusion generates a new fused belief whenever a diagnostic report arrives for a suspect component. This updates the belief for that suspect component and for every other failure in the logical group for that component. It also updates the belief of ‘unknown’ failure for the logical group for that component. Prognostic knowledge fusion generates a new prognostic vector for each suspect component whenever a new prognostic report arrives.

3

Validation

A question we are often asked is, ‘How are you going to prove that your system can really predict failures?’ This question, as it turns out, is quite difficult to answer. The problem is that we are developing a system we claim will predict failures in devices, and that in real life, these devices fail relatively rarely. We have several answers to this question: •

•

•

•

We are still going to look for the failure modes. We have a number of installed data collectors both on land and on ships. In addition, PredictDLI is collecting time domain data for several parameters whenever their vibration-based expert system predicts a failure on shipboard chillers. As Honeywell upgrades its air conditioning systems to be compliant with new nonpolluting refrigerant regulations, older chillers become obsolete and are replaced. We have managed to acquire one of these chillers and are now constructing a test plan to collect data from this chiller. Seeded faults are worth doing. Our partners in the Mechanical Engineering Department of Georgia Tech are seeding faults in bearings and collecting the data. These tests have the drawback that they might not exhibit the same precursors as real-world failures, especially in the case of accelerated tests. Honeywell, York, PredictDLI, the Naval Research Laboratory, and WM Engineering, have archived maintenance data that we will take advantage of.

Although persuasive, these answers are far from conclusive. The authors would welcome any input on how to validate a failure prediction system.

4

Conclusions

In the not too distant past, automation was employed largely to manage systems under nominal operating conditions. The realm of automation rarely extended to abnormal conditions – people were expected to handle these. Whether it was equipment failure,


791

severe environmental disturbances, or other sorts of disruptions, the responsibility for predicting and diagnosing faults and returning the system to normal operation rested squarely on human staff. Developers of control systems and their applications were concerned about these issues only to the extent that they needed to provide the appropriate information and decision support to operators, engineers, and supervisors. The actual prognosis, diagnosis, and remedial actions were generally outside the scope of automation. We have succeeded in our original mission almost too well, and this success has led to a broadening of our ambitions for automation and control systems. This has happened even as the scale and complexity of the physical systems – whether naval ships or commercial buildings or factories – have dramatically increased. As might be expected, problem complexity translates to solution complexity. For instance, the more time we have to plan our response before a failure occurs, the better off we are – catastrophic failures can be avoided, human safety can be maximized, repair actions can be combined, and so on. To increase this time, we must find new ways to access data that we have not sensed before. In addition, we have to construct software that derives prognostic and diagnostic conclusions from increasingly subtle correlations among the sensed data.

5

Acknowledgment

The authors gratefully acknowledge the support of the Office of Naval Research, grant number N00014-96-C-0373. Joe Van Dyke participated in this project while employed at Predict DLI.

References Bennett, B.H. and Hadden, G.D. (1999) Condition-based maintenance: algorithms and applications for embedded high performance computing. Proceedings of the 4th International Workshop on Embedded HPC Systems and Applications (EHPC’99). Bristow, J., Hadden, G.D., Busch, D., Wrest, D., Kramer, K., Schoess, J., Menon, S., Lewis, S. and Gibson, P. (1999) Integrated diagnostics and prognostics systems. Proceedings of the 53rd Meeting of the Society for Machinery Failure Prevention Technology (invited). Hadden, G.D., Bennett, B.H., Bergstrom, P., Vachtsevanos, G. and Van Dyke, J. (1999) Machinery diagnostics and prognostics/condition based maintenance: a progress report. Proceedings of the 53rd Meeting of the Society for Machinery Failure Prevention Technology. Hadden, G.D., Bennett, B.H., Bergstrom, P., Vachtsevanos, G. and Van Dyke, J. (1999) Shipboard machinery diagnostics and prognostics/condition based maintenance: a progress report. Proceedings of the 1999 Maintenance and Reliability Conference (MARCON99).

Accommodating QoS Prediction in an Adaptive Resource Management Framework E. Huh1, L. R. Welch1, B. A. Shirazi2, B. Tjaden1, and C. D. Cavanaugh2 1

339 Stocker Center, School of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701 2 Department of Computer Science Engineering, The University of Texas at Arlington, Arlington, TX 76019 1 {ehuh|welch|[email protected]} 2 {shirzai|[email protected]}

Abstract. Resource management for dynamic, distributed real-time

systems requires handling of unknown arrival rates for data and events; additional desiderata include: accommodation of heterogeneous resources, high resource utilization, and guarantees of real-time quality-of-service (QoS). This paper describes the techniques employed by a resource manager that addresses these issues. The specific contributions of this paper are: QoS monitoring and resource usage profiling; prediction of real-time QoS (via interpolation and extrapolation of execution times) for heterogeneous resource platforms and dynamic real-time environments; and resource contention analysis.

1

Introduction

In [1], real-time systems are categorized into three classes: (1) deterministic systems, which have a priori known worst case arrival rates for events and data, and are accommodated by the Rate Monotonic Analysis (RMA) approach (see [2]); (2) stochastic systems, which have probabilistic arrival rates for events and data, and can be handled using statistical RMA [3] and real-time queuing theory [4]; and (3) dynamic systems, which operate in highly variable environments and therefore have arrival rates that cannot be known a priori. This paper presents a resource management approach for dynamic allocation to handle execution times represented using a time-variant stochastic model. Additionally, we show how to accommodate heterogeneity of resources and QoS prediction. Section 2 provides an overview of the resource manager (RM) approach. Sections 3-5 explain each component used in our RM approach. Section 6 presents experimental assessments of our techniques.


Accommodating QoS Prediction in an Adaptive Resource Management Framework

2

793

Overview of RM approach

Our approach to resource management is based on the dynamic path model of the demand space [5], [8], [9]. This demand space model is a collection of dynamic realtime paths, each of which consists of a set of communicating programs with end-toend QoS requirements. The demand space system model is described in Table 1. Table 1. Demand space system model Symbol Pi aij Hk |Pi.DS| = tl T(aij,tl) Cobs(aij, tl, Hk) Creq(aij, tl, Hk) Cprof(aij, tl, Hp) Cpred(aij, tl, Hk) Dobs(aij, tl, Hk) CUPobs\(aij, tl, Hk) CUPreq(aij, tl, Hk) CUPureq(aij, tl, Hk) MEMreq(aij, tl, Hk) λ req(Pi) λ pred(c+1, Pi) ψ(Pi)

Description a name of path “i” name of application j in path “i” a name of host “k” data stream sizes of path “i” (or workload or tactical load) period of aj in Pi with workload tl observed execution time of aj at cycle c with tl in path “i” on Hk required execution time of aj at cycle c with tl in path “i” on Hk profiled execution time of aj at cycle c with tl in path “i” on Hp predicted execution time of aj at cycle c with tl in path “i” on Hk observed queuing delay of aj at cycle c with tl in path “i” on Hk observed CPU usage on Hk for the aj in Pi with tl required minimum CPU usage on Hk for aj in Pi with tl required, unified minimum CPU usage on the target Hk for the aj in Pi with tl memory usage of aj in path “i” on Hk with tl required latency of Pi (=QoS) predicted latency of path Pi at cycle c+1 required slack interval for each QoS requirement = [ψ min(Pi), ψ max(Pi)]

Table 2. Supply space system model Symbol Hk SPECint95(Hi) SPECfp95(Hi) SPEC_RATE(Hi) Threshold_CPU(Hi) Threshold_MEM(Hi) CUP(Hi,t) CIP(Hi,t) FAM(Hi,t) MF(Hi,t) INT(Hi, t) CALL(Hi, t) CTX(Hi, t) CMI(Hi, t) CMO(Hi, t) COL(Hi, t) LMi(Hj, t)

Description host name “k” the fixed point operation performance of SPEC CPU95 of Hi the floating point operation performance of SPEC CPU95 of Hi the relative host rating of Hi the CPU utilization threshold of Hi the memory utilization threshold of Hi the CPU usage (user + kernel) percentage of Hi at time t the idle-percentage of Hi at time t the free-available-memory of Hi at time t the number of page faults on Hi at time t the number of interrupts on Hi at time t the number of system calls on Hi at time t the process context switching rate on Hi at time t the number of packet-in received on Hi at time t the number of packet-out transferred on Hi at time t the number of collisions occurred on Hi at time t th the i load metrics in host j at time; LMi(Hj, t) ∈{FAM(Hi,t), MF(Hi,t), INT(Hi, t),CIP(Hi, t), CUP(Hi,t), CALL(Hi ,t), CTX(Hi ,t)}

We also model the resources or the supply space (described in Table 2), which consists of host features, host resources, and host load metrics. The resource management problem is to map the set of all paths Pi onto the set of hardware resources, such that all λ req(Pi) are satisfied. Since the workloads of the Pi

794

E. Huh et al.

S /W , H /W P rofiling

V iolation D iag nosis

A ction = s cale? yes

yes Q oS violation?

R esource N eeds Estim ation

no

R esource D iscovery

C ontentio n A na lysis

F easibility Ana lysis

R esource U nification

Q oS P rediction

R esource A lloca tio n

Q oS and R esource R equir em ents M onitoring

no

: ne w cap ability

Fig. 1. Overview of resource manager vary, the mapping needs to be adapted dynamically. The flow of our adaptive resource management approach shown in Fig. 1. Each step is described in detail in the subsequent sections of this document.

3

Software and Hardware Profiling

In order to manage resources in an efficient manner, it is necessary to understand the resource usage characteristics of the members of the demand space and the relative resource capabilities of the members of the supply space. S/W profiling measures an application’s execution time, period, CPU usage, and memory usage that are collected passively by an external process (a monitor) that reads the proc table periodically to obtain process data. Three different techniques are tested as follows: (1) the process calls getrusage once per period, (2) an external monitor reads ps_info block in the proc table once per second, and (3) an external monitor reads ps_usage block in the proc table once per second. An exponential moving average is applied to measurements for all techniques for filtering. Initial profiling is done during application development and profiles are refined through dynamic profiling. The accuracy of exponential moving average of ps_usage block in the proc table is almost as good as getrusage shown in Fig. 2. H/W profiling measures capabilities of hosts relative to a reference host using the Standard Performance Evaluation Corporation (SPEC). SPEC is a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers (see [10]). To achieve overall, relative system performance, the mean throughput is compared to a reference machine, a Sun-Sparc-10/40Mhz.


795

A c c u r a c y o f P r o filin g E x e c u tio n T im e

Std. Dev. (s)

0 .1 2 0 .1 0 .0 8 0 .0 6 0 .0 4 0 .0 2 0 1

0 .4

1 .4

P e r io d ( s )

ra w p s _ in fo (e x te rn a l) s m o o th e d p s _ u s a g e (e x te rn a l) g e tru s a g e (in te rn a l)

Fig. 2. Comparison of profiling techniques

We use SPECfp95 (a measure of a host’s floating point performance) and SPECint95 (a measure of a host’s fixed point performance) to derive the relative, normalized host rating as follows: SPEC_RATE(Hi)=AVG(SPECint95(Hi)/Max(SPECint95(Hj),SPECfp95(Hi)/Max(SPECfp95(Hj))), where ∀j .

4

QoS and Resource Utilization Monitoring

This section discusses our approach to QoS and resource monitoring, resource needs estimation, and resource discovery. This module observes end-to-end real-time QoS of dynamic, distributed paths, and monitors resource requirements for dynamic software profiling to determine execution time, period, and memory usage. The memory usage (of main memory for allocation of workloads) is observed by taking the process residence set size from the proc table. The execution time of an application consists of the user- and the kerneltime, each of which corresponds to accurate computation of CPU utilization measured for a "move" action as follows: CUPreq(aij,tl, Hk )= Cobs(aij, tl, Hk ) / T(aij ,tl). Also, the cycle time of the QoS monitor called validity interval is used for the period (T(aij ,tl)) of an application to calculate the CPU resource requirement, while conventional approaches use the arrival time of workload for the period, which causes poor utilization on the dynamic environment. Interpolation and extrapolation uses profiles to estimate resource needs of a new replica of a scalable application. When the current path QoS is greater than minimum slack of the QoS requirement, and QoS Manager (QM) recommends a “scale up” action decision based on the rate of workload trend, the resource requirements for the new workload tl ( tl = current tl / (current replicas+1)) that will be distributed equally among replicas, need to be modified at run-time to request resource needs to the supply space. Hence, initial profiles of the violated application are the only way to

796

E. Huh et al.

decide required Creq(aij, tl, Hk ) and MEMreq(aij, tl, Hk) for the various workloads as the boundary of execution time of an application is not obtainable on dynamic environments. The interpolation and extrapolation of resource needs for a “scale down” action proceeds exactly the same as a "scale up” action except for the calculation of workload tl (tl = current tl / (replicas – 1)). The examined average of errors between the observed execution times and the estimated execution times that are examined by the piecewise linear regression using 2 data points is 12.1 milliseconds (1% CPU usage). Resource discovery determines current utilization levels for communication and computation resources by using vmstat and netstat system calls once per second. These metrics are filtered by exponential moving average. Communication resource management over the broadcasting type of networks (Ethernet/Fast Ethernet) is a hard problem as contention of those types of networks depends on the number of communication nodes, the size of packets, retransmission strategies, and collisions. The network load in terms of delay clearly has a strong relationship with collisions. Hence, our approach considers network load of hosts that are part of a real-time path is computed using the number of packet received / transmitted, and collisions. For a single host, the network load (net_load_of_host) is computed as follows: net_load_of_host = (1+ COL(Hi, t))*(CMI(Hi, t) + CMO(Hi, t))

5

Resource Selection

This section explains techniques for resource unification (mapping heterogeneous resource requirements into a canonical form), feasibility analysis, contention analysis, QoS prediction, and resource allocation analysis and enactment. The role of resource unification is to map heterogeneous resource requirements into a canonical form of each resource metric. To allocate delivered CUPreq(aij, tl, Hk), RM needs to determine the relative amount of the resources available on the target host. There might exist two approaches, which are static, and dynamic for resource management considering heterogeneity of resources. The static approach uses stable system information like benchmarks and CPU clock rate. In Globus project (see [6]), benchmark rates are used as the resource requirement (e.g. 100Gflops). The dynamic approach uses low level system parameters (see [7]). In Windows NT, a popular operating system, it is very complicated to access the dynamic system parameters that the operating system provides. Eventually, as a host-level, the global scheduler that will handle any types of systems, RM needs to use general system characteristics instead of dynamic, specific system parameters in the operating system layer. In our approach, using a static system information, SPEC_RATE shown in section 3, resources are unified into a canonical form as follows: CUPureq(aij, tl, Ht) = Cpred(aij, tl, Ht) / T(tl, aij) Cpred(aij, tl, Ht) = Creq(aij, tl, Hp) * SPEC_RATE(Hk) / SPEC_RATE(Ht), where Ht: a target host, Hk: host that the resource requirements are measured. Feasibility analysis finds resources, which will meet CUPureq(aij, tl, Ht). The thresholds are used for adaptable resource supply to tolerate the difference between


797

unified and actual resources. For example, if the available CPU resource is greater than the unified CPU resource requirement plus the threshold, then the host becomes a candidate. Contention analysis phase predicts queuing delays of applications among candidates. The queuing delay of an application in a path based real-time system is one of the critical elements for the RM to examine schedulability, when periods (of applications or paths) overlap each other. Currently, in our approach, observed system load metrics of hosts (LMi(Ht,t)) are applied to get the delay on heterogeneous hosts. First, we predict the queuing delay of the application in the target host by the observed queuing delay multiplied by the ratio of monitored load metrics between current host and target host (Dpred(aij,tl,Ht) = Dobs(aij, tl, Hk) * LMi(Ht,t) / LMi(Hk,t)). Second, we use the execution time and current CPU usage on the target host (Dpred(aij,tl,Ht) = Cpred(aij, tl, Ht) * CUP(Ht,t)). If one of our approaches can approximately uncover observed queuing delay, it becomes a generic solution in point of the host-level, global RM. An experiment is assessed for these approaches in section 6. RM predicts a real-time QoS (considering contention) that will result from candidate reallocation actions. In general, when a customer requests QoS, this step tells the next QoS (λ pred(c+1,Pi)) to the customer in addition to resource supply. If a single application in a path is violated, the path QoS is easily computed by adding predicted latency of the application instead of the current latency. Otherwise, the path QoS is accumulated until the last application’s latency is predicted. Resource allocation selects and performs a reallocation action based on the predicted QoS. Using the predicted QoS, RM can guarantee new reallocation. By testing predicted QoS path latency λ pred(c+1,Pi)) by ψmax(Pi) < λ pred(c+1,Pi) < ψmin(Pi) called pre-violation test, QoS violation of the path at the next cycle can be detected by RM. Therefore, RM now can see QoS in addition to an amount of resources being supplied. An allocation schemes for the violated application are considered based on QoS slack called “QoS Allocation (QA)”, where QoS slack = λreq (Pi) - λ pred (c+1, Pi); which has passed the above pre-violation test. A greedy, heuristic QA scheme finds a host Hi which has minimum λ pred(c+1,Pi); and it is in top 50th percentile of the average network load among all candidate hosts.

6

Experiments

We have used DynBench (see [8]) and D-SPEC (see [9]) as an assessment tool and specification language for dynamic resource management. DynBench uses an identical scenario for experiments, respectively. A CPU load generator has been developed to allow the user to adjust CPU usage. The profiled execution times are measured on a sun-ultra-1 (140Mhz). Prediction is performed on sun-ultra-10 (333Mhz) as a source node and sun-ultra-10 (300Mhz) as a target node, respectively. Experiment 1 shows predicted latency of a filter application a target host. Initially, 30% of CPU usage is used. In Fig. 3, using general system load metrics, several

798

E. Huh et al.

P re d ic te d late nc y o f a filter o nto ta rg e t h o s t sec 1 0.8 0.6 2600

0.4

tl

0.2

2100

0 1

3

5

7

9

11

13

15

17

cycle

Lobs

Lped(C TX )

Lped(C U P)

Lped(IN T)

Lped(C ALL)

Lpred(M F)

Fig. 3. Predicting latency of an application

P r e d ic te d p a th la te n c y 2

sec

1 .5 1 0 .5

2600

tl

2100

0 1

3

5

7

9

11

Lobs L p r e d (C U P ) L p r e d (C A L L )

13

15

cy cle

L p r e d (C T X ) L p r e d (IN T ) L p r e d (M F )

Fig. 4. Path prediction

methods of predicting latency are tested, and they are compared to the observed latency, which is measured at offline on the host with the same scenario Each method specified in parenthesis predicts latency of the filter application as follows: • • • • • •

Lobs : observed application latency Lpred(CTX) = C pred(aij,tl,Ht)+Dobs(aij,tl,Hk) * CTX(Ht,t)/CTX(Hk,t) Lpred(CUP) = C pred(aij,tl,Ht)+C pred(aij,tl,Ht) * CUP(Ht,t) Lpred(INT) = C pred(aij,tl,Ht) + Dobs(aij,tl,Hk) * INT(Ht,t)/INT(Hk,t) Lpred(CALL) = C pred(aij,tl,Ht)+Dobs(aij,tl,Hk) * CALL(Ht,t)/CALL(Hk,t) Lpred(MF) = C pred(aij,tl,Ht) + Dobs(aij,tl,Hk) * MF(Ht,t)/MF(Hk,t)

The results of experiment show that the queuing delay is a more important issue than the execution time to predict accurate latency of an application, when a host is overloaded. Overall average error (Lobs() - Lpred(CUP)) is 0.031 seconds.


799

Experiment 2 shows the path latency comparison between the predicted (Lpred) and the observed (Lobs) latency. Note that predicted path latency is the sum of each application’s predicted latency on to the target host. Fig. 4 explains that the Lpred(CUP) approach is most accurate approach, when the system is overloaded (CUP(Hk,t)) > 70 at workload 2600). To fully utilize CPU resource, the measurement of the queuing delay, when CPU usage is high, is very important for the distributed real-time system. The average error between the predicted and the observed latency is 0.084 seconds.

7

Conclusions and Ongoing Work

The experimental results show that our approach achieves good CPU utilization by analyzing system contention and by predicting QoS accurately. The accuracy of the techniques is shown by noting that the predicted CPU resource needs differ from observed ones by no more than 4.5%. Ongoing work includes proactive RM and dynamic QoS negotiation.

References 1.

Welch, L.R., Masters, M.W.,: Toward a Taxonomy for Real-Time Mission-Critical Systems. Proceedings of the First International Workshop on Real-Time Mission-Critical Systems 1999) 2. Liu, C.L., Layland, H.W.,: Scheduling Algorithm for multiprogramming in hard real-time environment. JACM, Vol. 20. (1973) 46-61 3. Atlas, A., Bestavros, A.,: Statistical Rate Monotonic Scheduling. Proceedings of RealTime Systems Symposium (1998) 4. Lehoczky, J.P.: Real-Time Queueing Theory. Proceedings of IEEE Real-Time Systems Symposium, IEEE CS Press (1996) 186-195 5. Welch, L.R., Ravindran, B., Harrison, R., Madden, L., Masters, M., Mills, W.,: Challenges in Engineering Distributed Shipboard Control Systems. The IEEE Real-Time Systems Symposium. (1996) 6. Czajkowski, K., Foster, I., Kesselman, C., Martin, S., Smith, W., Tuecke, S.,: A Resource Management Architecture for Metacomputing Systems. Proceedings in IPPS/SPDP '98 Workshop on Job Scheduling Strategies for Parallel Processing (1998) 7. Chatterjee, S., Strosnider, J.,: Distributed Pipeline Scheduling: A Framework for Distributed, Heterogeneous Real-Time System Design. In the Computer Journal, British Computer Society, Vol. 38. No. 4. (1995) 8. Welch, L.R., Shirazi, B.A.,: A Dynamic Real-Time Benchmark for Assessment of QoS and Resource Management Technology. IEEE Real-time Application System (1999) 9. Welch, L.R., Ravindran,B., Shirazi, B.A., Bruggeman,C., : Specification and Analysis of Dynamic, Distributed Real-Time Systems. Proceedings of the 19th IEEE Real-Time Systems Symposium, IEEE Computer Society Press (1998) 72-81 10. OSG Group: SPEC CPU95. http://www.spec.org.

Network Load Monitoring in Distributed Systems *

*

+

Kazi M Jahirul Islam , Behrooz A. Shirazi , Lonnie R. Welch , Brett C. Tjaden+, * * Charles Cavanaugh , Shafqat Anwar *

University of Texas at Arlington Department of CSE Box 19015 Arlington, TX 76019-0015

{islam|shirazi|cavan}@cse.uta.edu, [email protected] +

Ohio University School of Electrical Engineering and Computer Science Athens, OH 45701-2979 {welch|tjaden}@ohio.edu

Abstract. Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be interested to determine how much load a host is placing on the network, or what the network load index is. In this paper, a simple technique for evaluating the current load of network is proposed. If a computer is connected to several networks, then we can get the load index of that host for each network. We can also measure the load index of the network applied by all the hosts. The dynamic resource manager of DeSiDeRaTa should use this technique to achieve its requirements. We have verified the technique with two benchmarks – LoadSim and DynBench.

1 Introduction The DeSiDeRaTa project is providing innovative resource management technology that incorporates knowledge of resource demands in the distributed, real-time computer control systems domain. This project involves building middleware services for the next generation of ship-board air defense systems being developed by the U.S. Navy. DeSiDeRaTa technology differs from related work in its incorporation of novel features of dynamic real-time systems. The specification language, mathematical model and dynamic resource management middleware support the dynamic path paradigm, which has evolved from studying distributed, real-time application systems. The dynamic path is a convenient abstraction for expressing end-to-end system objectives, and for analyzing timeliness, dependability and scalability. Novel aspects of the dynamic path paradigm include its large granularity and its ability to accommodate systems that have dynamic variability[1]. The resource manager is responsible for making all resource allocation decisions. The resource manager component computes allocation decisions by interacting with the system data repository and obtaining software and hardware system profiles. The allocation decision may involve migrating programs to different hosts, starting additional copies of programs (for scalability), or restarting failed programs (for survivability). The system data repository component is responsible for collecting and maintaining all system information.


Network Load Monitoring in Distributed Systems

801

The resource management architecture (attached at end of this document) consists of components for adaptive resource management and QoS negotiation, data broker, path monitoring and diagnosis, resource monitoring, and resource management consoles. The adaptive resource management and QoS negotiation component is responsible for making resource management decisions. This component computes the allocation decision by interacting with the data broker and obtaining software and hardware system profiles. The allocation decision may involve migrating programs to different host nodes, starting additional copies of programs (for scalability), or restarting failed programs (for survivability). The resource management component carries out its decisions by communicating with a daemon program (on each host) to start up and control programs on each host[4]. The data broker component is responsible for collecting and maintaining all system information. The data broker reads the system description and requirements expressed using the specification language and builds the data structures that model the system. Dynamically measured software performance metrics, such as path latency and throughput, and resource usage characteristics, such as program page faults and resident size, are collected and maintained by the path monitoring and diagnosis component. The data broker obtains measurements of the dynamic attributes of the software from the monitoring component. Hardware resource profiles are collected and maintained by the resource monitoring component, and fed to the data broker on demand as well as periodically. The data broker thus provides a single interface for all system data. The path monitoring and diagnosis component monitors the performance of software systems at the path-level. This component determines the changing requirements of the software by interacting with the data broker. When a path fails to meet the requirements, this component performs diagnosis of the path, and determines the ``bottleneck'' node of the path. Resource management consoles display system and path status, and allow dynamic attributes, such as deadlines, to be modified. All communication for such consoles is through the data broker[6]. As mentioned earlier, the resource manager utilizes software and hardware system profiles to make allocation decisions. To obtain the system profiles, the resource manager continuously monitors the whole system and calculates various metrics. These metrics provide the guidelines for choosing among different allocation possibilities and optimizing resource usage. One of the components that is monitored by the resource manager is the network which connects the different computers that form the distributed real-time system. Several network parameters are of interest including host-to-host delay, network load index, host load index, etc. Host-to-host delay measures the time required to transmit a message from a specific host to another. Host load index measures the load applied to the network by a specific host. Network load index measures the total amount of load applied to the network by all the hosts that are currently connected or communicating through the network. Furthermore, since computers may be connected to multiple networks, they may have multiple IP addresses. Therefore, we can measure the host load index of a host on a specific network; or we may be interested in the load index of each network[7]. Depending on these parameters, the resource manager might select a different host to initiate a new process; or it might send data to a different host to get the result in an acceptable time using the least busy network or route. In a multi-homed network, it might also route data through an alternate network where load index indicates low traffic.

802

K.M. Jahirul Islam et al.

In this paper, we will formulate a host load index and network load index. We will also explain the experimental procedure that was followed to get the results. At the end we will also discuss the limitations of our approach and present some ideas for future work that will allow us to improve our method. Net1

Host1

Host2

Host3

Net2 Fig. 1. Formal definition of the problem

We will use the above model for the illustration of the problem. Let us assume that there are n hosts Host1, Host2, … Hostn and m networks Net1, Net2, …Netm in the system (Figure 1). We are interested in finding the host load index for each host, Hosti. We are also interested in finding the load index on all the networks through which the hosts are connected. That means, if they are connected through k different networks, we are interested in measuring the load index on all k different networks. We also want to measure the load index of each network Neti. That will help us to select the least loaded network for transmission.

2 Load Simulator The LoadSimulation1, hereafter referred as LoadSim, is able to compose and simulate the resource utilization (CPU cycles, network bandwidth, latency, etc.) of a largescale distributed system that may consists of many interacting processes executing on many computers networked together. Here, simulation of distributed system load is achieved by means of replicated copies of a configurable LoadSim computer program. Each replicated copy must be capable of being initialized to a potentially different host computer and network resource utilization profile. LoadSim replicates are mapped onto a heterogeneous network of computers by a set of support services that allows the user to specify and control the topology and characteristics of the LoadSim configuration under test. The LoadSim further provides the ability to collect metrics on the performance of the simulated large-scale system.

1 Some parts of LoadSimulation have been taken from “Requirements for a Realtime Distributed LoadSimulation” written by Timothy S. Drake. He may be contacted at [email protected]


803

The primary goals of the benchmark are to provide the ability to objectively assess the network communication protocols (e.g. TCP/IP, UDP, etc.), network bandwidth, network latency characteristics and to place additional load on partial implementations of real systems in order to assess the impact of the load that would be placed on the computing resource base by missing components if those components were present. As LoadSim can place additional load on a partial implementations of real systems, we used this tool to apply load on the network. The DynBench benchmark application is modeled after typical distributed real-time military applications such as an air defense subsystem. Figure 2 shows the three dynamic paths from the DynBench benchmark application. The detect path (path 1) is a continuous path that performs the role of examining radar sensor data (radar tracks) and detecting potential threats to a defended entity. The sensor data are filtered by software and are passed to two evaluation components, one is software and the other is a human operator. The detection may be performed manually, automatically, or semi-automatically (automatic detection with manual approval of engagement recommendation). When a threat is detected and confirmed, the transient engage path (path 2) is activated, resulting in the firing of a missile to engage the threat. After a missile is in flight, the quasi-continuous guidance path (path 3) uses sensor data to track the threat, and issues guidance commands to the missile. The guidance path involves sensor hardware, software for filtering/sensing, software for evaluating and deciding, software for acting, and actuator hardware.[5]

Compute subpaths

Path 3 Guide

Communication subpaths

Path 2 Initiate

Path 1

operator

Assess

sensors

filter/sense

evaluate & decide

act

actuators

Fig. 2. The DynBench dynamic paths

A (simulated) radar sensor periodically generates a stream of data samples (representing the positions of moving bodies) based on equations of motion defined in a scenario file. The data stream is provided to the Filter Manager, which distributes the current workload among replicas of the filter program (Figure 3). Each filter uses a least mean square regression algorithm to filter “noise” and to correlate the data points into three equations that describe the motion of a body. The equations of motion for each of the observed bodies are sent to the evaluate and decide manager, which distributes the workload among the evaluate and decide programs. Evaluate and decide processes determine if the current position of an observed body is within a “critical region” defined by a doctrine file.

804


When a body first enters the critical region, it is passed to the action manager in the initiation path. Action manager distributes the workload among the action programs, which calculate equations of motion to intercept bodies of interest. A simulated actuator operates the initiation of motion of the intercepting body. Whenever engaged objects are present in the sensor data, the evaluate and decide programs transmit the equations of motion of those bodies to the monitor and guide manager, which pairs identified bodies of targets with their corresponding interceptors. The corresponding pairs of equations are equally distributed among monitor and guide processes, which monitor the progress of the interceptor relative to the new positions of their intended target. If necessary, a new fight equation is calculated for the interceptor and sends to sensor. If an interception occurs, this process sends a request to remove the target and interceptor from the data stream. There is also a deconflict path added to DynBench application subsystem. The deconflict path is designed to pre-launch the intercept bodies, and check if there is some conflict for the interceptor flight path with some other tracks or interceptors before the interceptor hit its target track. If there is a conflict, deconflict will send a warning message to Radar Display.

Scenario File

Sensor

Doctrine File

FM

Filter

EDM

Radar

ED

Doctrine File

MG Actuator

AM

DM

MGM Action

DC

Fig. 3. DynBench application subsystem

We have used the DynBench as the load generator. We can put specific amount of tracks in the sensor and that will be processed by the whole benchmark. The increasing amount of tracks will put increasing amount of load on the network.

3 Previous Work Philip M. Irey IV, Robert D Harrison and David T. Marlow have previously analyzed LAN performance in [2] and have shown an approach to evaluate the applicability of currently available commercial products in real-time distributed systems. They also had defined a few parameters for this purpose. Depending on these parameters and


805

explained measurement methodologies we can determine the applicability of the commercial products in real-time systems. Andrej Sostaric, Milan Gabor Andreas Gygi had developed Mtool[3]. It can be used in performance monitoring in networked multi-platform systems. They emphasized on the three-tier architecture and used Java technology to achieve platform independence.

4 Experimental Procedure We used netstat to collect our statistics. We are mainly interested in the TCP/IP suite of protocols, and the command ‘netstat –i 1’ is used to produce a packet transmission summary once per second. We used this command in different hosts that may be connected through one or more networks to gather the statistics. This command gives number of packets in, number of packets out and number of collision in every second. We tried to formulate the load index using any linear combination of these three parameters. To generate the load, we used two different tools. One is load simulator. We put the specific amount of load in every 200 millisecond on the network. The other tool is DynBench, a benchmark for DeSiDeRaTa. We were sending a specific amount of tracks or load from Sensor to Filter Manager for a specific amount of time. We tested for each load for approximately three minutes; that means each amount of load was applied on the network for three minutes. Later, we increased the load and run the experiment again. The test continued in this manner. We performed our experiments in Sun workstations that are connected to a LAN. We collected statistics from different hosts in both cases. We are sure that the load was sent from the specified source host to the specific destination host. Because the mean and standard deviation of packets in, packets out and collision of the source and destination differs significantly from the other hosts; both mean and standard deviation were much higher in the source and destination. Load Simulator Load generator DynBench Out Out+Collision Out+in In Collision In+Collision In+Out+Collisn.

Host 1 0.53306 0.54933 0.53596 0.53698 0.55735 0.55380 0.54741

Host 2 0.62177 0.62177 0.61334 0.60707 N/A 0.60707 0.61334

Network 0.57686 0.58216 0.58968 0.59190 0.56566 0.60402 0.59450

Host 1 0.90124 0.94287 0.91077 0.92778 0.96815 0.96350 0.93961

Host 2 0.92911 0.92911 0.91070 0.90018 N/A 0.90018 0.91070

Network 0.91584 0.94481 0.91472 0.91252 0.97252 0.94345 0.93282

Table 1. Coefficient of correlation

Table 1 shows the coefficient of correlation of different combination of the parameters with the load that was applied on the network. Here data is moving from “Host 1” to “Host 2”. Two scenarios of load generation are shown here for comparison: the first load was produced by DynBench; the second load was generated from Load Simulator. Figs. 4 and 5 summarize the results. We have also measured the load index of the whole network using the same approach. In the above graph,

806


network specifies the load index of the whole network. It is the sum of the specified parameters.

Host 2

Network

n io

n

lis ol +C ut +O

Host 1

In

O

ut

In

C

+C

ol

ol

lis

lis

io

io

n

In

+I ut O

+C

ol

lis

O

ut

io n

n

Using Load Simulator as Load Generator

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Fig. 4.

Using Dynbench as load generator

0.6 0.5 0.4 0.3 0.2 0.1 0 Out

Out+Collision

Out+In

In

Collision

In+Collision

In+Out+Collision

Host 1 Host 2 Network

Fig. 5.

Several interesting characteristics can be observed from here. When a host is transmitting data to another host, the number of collisions gives a better approximation of the load index at the source than at the destination. From the receiver’s point of view, the number of packets transmitted by the receiver reflects its load index. Here, both “out” and “out+collision” are the same. Experimental data shows that during packet transmission, the receiver does not experience any collision. As TCP/IP uses handshaking to inform the sender about the reception of data, the number of packets out somehow reflects actually about the number of packets transmitted to this host. This is supported by both load simulator and DynBench. For measuring the network load index, two different parameters are chosen by two


807

different load generators. DynBench proposes sum of “In+Collision” of all the hosts connected to the network while according to Load Simulator, it is sum of Collision.

5 Conclusion In this paper, we have proposed a simple non-intrusive technique for measuring the load applied on a network. We have used a simple tool, netstat –i, for that. We generated load using DynBench and LoadSimulator and then measured the load index. We have shown all the combinations of three parameters – packet in, packet out and collision, to determine which one best describes the load index. Another way of measuring load index is to send time-stamped packets from one host to another host. This can measure the delay in the network, which also gives a fair indication of applied load. We can also measure the delay between two hosts through each network they are connected to. This gives the network load index.

References 1. L. R. Welch, B. A. Shirazi, B. Ravindran and C. Bruggeman, “DeSiDeRaTa: QoS Management Technology for Dynamic, Scalable, Dependable, Real-Time Systems”, Proceedings of The 15th IFAC Workshop on Distributed Computer Control Systems, September 1998. 2. Philip M. Irey IV, Robert D Harrison and David T. Marlow, “Techniques for LAN Performance Analysis in a Real-Time Environment”, Real-Time Systems, 14 21-44(1998) Kluwer Academic Publishers. 3. Andrej Sostaric, Milan Gabor Andreas Gygi, “Performance Monitoring in Network Systems”, 20th Int. Conf. Information Technology Interfaces ITI ’98, June 16-19, 1998. 4. L. R. Welch, B. Ravindran, B. Shirazi, and C. Bruggeman, “Specification and analysis of dynamic, distributed real-time systems”, in Proceedings of the 19th IEEE Real-Time Systems Symposium, 72-81, IEEE Computer Society Press, 1998 5. L. R. Welch, B. A. Shirazi, "A Dynamic Real-Time Benchmark for Assessment of QoS and Resource Management Technology", RTAS 99 6. B. Ravindran, L. R. Welch, B. A. Shirazi, Carl Bruggeman, Charles Cavanaugh, "A Resource Management Model for Dynamic, Scalable, Dependable, Real - Time Systems" 7. L. R. Welch, P. Shirolkar, Shafqat Anwar, Terry Sergeant, B. A. Shirazi, "Adaptive Resource Management for Scalable, Dependable, Real - Time Systems"

A Novel Specification and Design Methodology Of Embedded Multiprocessor Signal Processing Systems Using High-Performance Middleware 1

2

Randall S. Janka and Linda M. Wills 1

Georgia Institute of Technology, Georgia Tech Research Institute, Atlanta, GA 30332-0856 USA [email protected] 2 Georgia Institute of Technology, School of Electrical and Computer Engineering, Atlanta, GA 30332-0250 USA [email protected]

Abstract. Embedded signal processing system designers need to be able to prototype their designs quickly and validate them early. This must be done in a manner that avoids premature commitment to the implementation target, especially when that target includes costly COTS parallel multiprocessing hardware. A new specification and design methodology known as MAGIC enables the designer to move from an executable specification through design exploration and on to implementation with minimal loss of specification and design information by leveraging compuation middleware (VSIPL) and communication middleware (MPI). Maintaining such information is a quality known as “model continuity,” which is established using the MAGIC specification and design methodology.

1

Introduction

Embedded signal processing system designers need to be able to prototype their designs quickly and validate them early. This results in quicker time to market as well as early detection of errors, which is less costly. There is tremendous complexity in the specification and design of these systems even when we restrict the technology space to commercial-off-the-shelf (COTS) multiprocessing (MP) hardware and software. We need a way to manage this complexity and accomplish the following goals: • Enable the designer to quickly evaluate and validate design prototypes. • Reduce and manage the level of detail that needs to be specified about the system in order to make sound decisions at each stage of the design process. • Allow the design space to be explored without committing too early to a particular technology (hardware platform). • Enable constraints identified and derived in one stage to be applied consistently in other stages of the design process. In other words, we need to be able to benchmark and validate in early stages (at the appropriate level of detail and without premature commitment) – a process we call J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 808-815, 2000.  Springer-Verlag Berlin Heidelberg 2000

A Novel Specification and Design Methodology

809

“virtual benchmarking” [1]. We also need to carry information gained (constraints and design rationale) through to later stages, a quality known as “model continuity.” We have developed a new methodology to do this by exploiting computation and communication middleware that are emerging as standards in the embedded real-time COTS multiprocessing domain.

2 The Need for Model Continuity in Specification & Design Methodologies The process of designing embedded real-time embedded multiprocessor signal processing systems is plagued by a lack of coherent specification and design methodology. A canonical waterfall design process is commonly used to specify, design, and implement these systems with COTS MP hardware and software. Powerful frameworks exist for each individual phase of this canonical design process, but no single methodology exists which enables these frameworks to work together coherently, i.e., allowing the output of a framework used in one phase to be consumed by a different framework used in the next phase. This lack of coherence usually leads to design errors that are not caught until well into the implementation phase. Since the cost of redesign increases as the design moves through these three stages, redesign is the most expensive if not performed until the implementation phase. We have developed design rules and integrated commercial tools in such a way that designs targeting COTS MP technologies can be improved by providing a coherent coupling between these frameworks, a quality known as model continuity. The basic information flow of a COTS MP specification and design (SDM) methodology is shown in Fig. 1. To appreciate how our SDM establishes model continuity, we first illustrate how model continuity is missing in today’s COTS MP methodologies, as shown in Fig. 2. Currently, constants such as filter coefficients can be passed from MATLAB .m files into a CASE SDM or a simpler vendor software development environment, but that is the only link from the requirements specification and design specification to the implementation phase in the whole design process. Not having an executable requirements model and a channel for passing it to the design analysis phase leads to model discontinuity, which is the total absence or minimal presence of model continuity.

3

The MAGIC Specification and Design Methodology

We have developed and prototyped a new SDM which we call the MAGIC1 SDM [2]. The means of accomplishing model continuity using the frameworks we chose for the MAGIC SDM is illustrated in Error! Reference source not found.. Solid boxes are 1

MAGIC–Methodology Applying Generation, Integration, and Continuity.

810

R.S. Janka and L.M. Wills

documents or frameworks. Dashed boxes are aggregates of frameworks that contain executable specifications or the design analysis environment. Solid lines are automated channels, where system model information can be passed between frameworks without manual intervention. Dashed lines are semi-automated channels where some human intervention is required to move system model information between frameworks. Executable Requirements Specification

Test vectors Constants

Design Analysis

é

Nonperformance constraints checking

Hardware configuration Software configuration Software-to-hardware map

é

Design architectures ½ Satisfy computational requirements ½ Meet non-performance constraints Find best architecture for each candidate technology Make design decisions ½ Technology ½ Architecture

Data rates Non-performance constraints SWAP

Computation software Communication software

é

Modes Environment Equations

Architectural Information

Algorithms Data

Executable Design Specification Executable images Run-time scripts

Configuration description

COTS MP Signal Processor Fig. 1. Basic flow of information needed to support model continuity.


811

Requirements Specification MATLAB Psuedocode

Natural Language

Tables

Design Specification (Natural Language) Application Software (Computation & Communication)

Configuration (Software-to-Hardware Mapping)

Constants

Implementation CASE Framework or Software Development Environment

Executable images Run-time scripts


COTS MP Signal Processor Fig. 2. How model continuity is currently lacking in current COTS MP SDM.

The executable workbook was fundamental in providing model continuity between specification and design. It was created using Excel with links created between worksheets that contained data (benchmarks, reliability statistics, form factor constraints, etc.) and models (benchmark conversions, process estimates, latency estimates, etc.). The data link to Simulink2 was manual; architectural parameters were computed in Excel and then implemented in Simulink by hand since Simulink does not support scaling for parallelization. VSIPL3 (computation middleware) and MPI4 (communication middleware) functions were “generated” using our code generation rules and entered into our executable workbook. Once in our workbook, we could compute

2

Simulation and rapid prototyping framework from The MathWorks. Vector Scalar Image Processing Library–an open-standards API for computation. 4 Message-Passing Interface–an open-standard API for multiprocessing and parallel processing communication. Its real-time cousin is “MPI/RT.” 3

812


token delays to be used in eArchitect5 for performance modeling. We would iterate this process for other candidate architectures. We created channels of model continuity between specification and design with the implementation specification. When we decided upon an architecture, we could run Simulink and tap process outputs, dumping them into the MATLAB workspace where we could save them for testing the implementation. VSIPL and MPI code that we generated is available for use in the form of the inner-loop functions and parameter arguments. When design analysis is complete and we have made design decisions, our performance model provides the hardware configuration, software process definition, and software-to-hardware mapping.

4

Model Continuity via Middleware

Model continuity is achieved in large part through the use of middleware for computation and communication. Open standards-based middleware supports computation and communication software portability, which means that middleware written for one vendor’s hardware should run on another vendor’s platform. Consequently, middleware code that constitutes the inner-loop software implementation can be used for different vendors’ platforms for design analysis using performance modeling. Critical to making the use of middleware a strong thread of model continuity is the autogeneration of middleware code, since automating the generation of software by a framework that is correct in specification reduces the chance of error in the design and implementation. A code generator such as Simulink’s Real-Time Workshop that could generate middleware for computation using VSIPL, MPI for communication, and/or MPI/RT for communication and control will produce code for both design and implementation. The generated middleware can be used to quantify process delays in the performance model framework and as the core for signal processing implementation application software. Our reasons for choosing VSIPL and MPI are very similar to our reasons for choosing the frameworks discussed above. They are stated here in order of importance with the most important reason stated first: • Acceptable performance–These middlewares deliver high-performance because they are tightly integrated with the vendors’ computation and communication libraries. • Standards-based–Since all the COTS MP vendors in our domain space support these middleware and actively participate in their standardization processes, frameworks that generate VSIPL and MPI code will be consumable by all of the hardware vendors’ SDEs considered in the design phase. • COTS–They are now becoming commercially available and therefore stable and supported. 5

Performance modeling framework from Viewlogic that supports multiprocessing and highspeed interconnections such as RACEway and Myrinet.


813

Requirements Specification

MATLAB

(.m,.mat)

To Workspace

Modes Environment Equations

Simulink Stateflow (.mdl)

.m,.mat

Data rates Non-performance constraints SWAP

Architecture parameters VSIPL functions MPI functions (Cells)

(Matrices)

Software processes Parameters

eArchitect

(.prj)

Test vectors Constants

Excel

Timing parameters Token delays

Simulation results

(.xls)

Executable Requirements Specification

Algorithms Data

Tables

Design Analysis

Natural Language

Executable Workbook

MATLAB Psuedocode

VSIPL functions Hardware & software configuration MPI functions Software-to-hardware map

CASE Framework or Software Development Environment

Executable images Run-time scripts

Executable Design Specification


COTS MP Signal Processor

Fig. 3. MAGIC SDM information flow and illustration of model continuity.

814


VSIPL is an API supporting portability for COTS users of real-time embedded multicomputers that has been produced by a national forum of government, academia, and industry participants [3]. VSIPL is computational middleware, which also supports interoperability with interprocessor communication (IPC) middleware such as MPI and MPI/RT. The VSIPL Forum has produced the API, a prototype reference library, and a test suite to verify API compliance. Commercial implementations are just now becoming available (early 2000). Earnest consideration by various defense programs as well as other commercial projects is underway and early adoption has begun. The VSIPL API standard provides hundreds of functions to the application software developer to support computation on scalars, vectors, or dense rectangular arrays. Canonical development of embedded signal processing applications using COTS multiprocessing hardware and software typically consists of partitioning the code into two portions. One portion is the “outer loop” where the setup and cleanup functions are executed, typically memory allocation and coefficient generation, such as FFT twiddle factors and window coefficients. The other portion is the “inner loop” where the time-critical repetitive streaming data transformation functions lie. A VSIPL application will be built similarly, with the outer loop executing heavyweight system functions that allocate memory when creating blocks and parameterized accessors called views. The block creation is substantial, while the view object handles take up very little memory, but do require system support. Message passing is a powerful and very general method of expressing parallelism and can be used to create extremely efficient parallel software applications. It has become the most widely used method of programming many types of parallel computers. High-performance implementations of MPI are now available, including implementations for COTS MP platforms. The leading vendor is MPI Software Technology, Inc. (MSTI) who provides high-performance implementations of MPI under the commercial trademark MPI/PRO for NOWs and SPCs, including two of the three leading COTS MP vendors in our technology space (RACEway and Myrinet). There is another standards effort underway to specify a real-time version of MPI with a guaranteed quality-of-service (QoS) called MPI/RT [4]. Non-QoS beta versions of MPI/RT are just now (early 2000) beginning to appear.

5

Using VSIPL & MPI for Model Continuity

The two most important reasons for choosing VSIPL and MPI are acceptable performance and that they were standards-based. If these middleware could not deliver performance commensurate with the vendors’ native computational and communications libraries, they would not be as useful and therefore less acceptable. However, preliminary VSIPL benchmarks recently released by one COTS MP vendor (Mercury Computer Systems) shows computational throughput achieving up toward 98% of their native algorithm library. MPI benchmarks released by one commercial MPI vendor (MSTI) show bandwidths within 5% of the RACE theoretical maximum for


815

large block sizes, which is very close to that achieved by the vendor’s own native communication library. Being standards-based is the other key characteristic of these middleware. The participation of researchers, implementers, and users to form and support these standards goes a long way towards assuring their adoption. It is our opinion that there are two types of standards, official and de facto. Being standard is not a blessing deferred by some official “acronym’d” organization, but something established de facto when companies invest their own resources in products designed to a standard and consumers purchase those products. We are not saying that oversight and management by standards organizations is not worthwhile, we are just saying that real standards are determined by the community. Suffice to say, MPI and VSIPL are currently establishing themselves in the marketplace as standards, and no doubt “official sanctification” will occur sometime later. Being a genuine de facto standard means that code generated within the MAGIC SDM can be used to estimate communication and computation token delays in performance modeling, as well as for the inner-loop computational code in the implementation. This strengthens the thread of continuity from specification to design (token delays) and implementation (inner-loop code).

6

Conclusion

We have introduced a new specification and design methodology (SDM) in this paper, the MAGIC SDM, that leverages standards-based middleware to achieve model continuity in the specification and design of signal processing systems implemented with COTS hardware and software. This is feasible since middleware generated in the specification and design processes can be used in the physical implementation because of the efficiency of both the VSIPL computation and MPI communication middleware.

References [1] R. S. Janka and L. M. Wills, “Virtual Benchmarking of Embedded Multiprocessor Signal Processing Systems,” in submitted to IEEE Design and Test of Computers, 2000, pp. 26. [2] R. S. Janka, “A Model-Continuous Specification and Design Methodology for Embedded Multiprocessor Signal Processing Systems,” a Ph.D. dissertation in the School of Electrical and Computer Engineering. Atlanta, Georgia: Georgia Institute of Technology, 1999, pp. xxiii, 225. [3] VSIPL Forum, “VSIPL v1.0 API Standard Specification,” DARPA and the Navy, Draft http://www.vsipl.org/PubInfo/pubdrftrev.html, 1999. [4] Real-Time Message Passing Interface (MPI/RT) Forum, “Document for the Real-Time Message Passing Interface (MPI/RT-1.0) Draft Standard,” DARPA, Draft http://www.mpirt.org/drafts.html, February 1, 1999.

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems 1

Minesh I. Patel Ph.D.1, Karl Jordan1, Mattew Clark Ph.D. , and Devesh Bhatt Ph.D. 1

Honeywell Space Systems-Commercial Systems Operations, 13350 U.S. Highway 19 North, Clearwater, Florida, USA, 33764 {minesh.patel, karl.l.jordan, mathew.clark}@honeywell.com 2 Honeywell Technology Center, Minneapolis, Minnesota, [email protected]

Abstract. With the emergence of inexpensive commercial off-the-shelf (COTS) parts, heterogeneous multi-processor HPC platforms have now become more affordable. However, the effort required in developing real-time applications that require high-performance and high input/output bandwidth for the HPC systems is still difficult. Honeywell Inc. has released a suite of tools called the Systems and Applications Genesis Environment (SAGE) which allows an engineer to develop and field applications efficiently on the HPCs. This paper briefly describes the SAGE tool suite, which is followed by a detailed description of the SAGE automatic code generation and run-time components used for COTS based heterogeneous HPC platform. Experiments were conducted and demonstrated to show that the SAGE generated glue (source) code with run-time executes comparably or within 75% efficiency to hand coded version of the Parallel 2D FFT and Distributed Corner Turn benchmarks that were executed on CSPI, Mercury and SKY compute platforms.

1 Introduction Many Military, Industrial and Commercial systems require real-time, highperformance and high input/output bandwidth performance. Such applications include radar, signal and image processing, computer vision, pattern recognition, real time controls and optimization. The complexities of high performance computing (HPC) resources have made it difficult to port and fully implement the various applications. With the availability of inexpensive HPC systems based on commercial hardware, the high demands of military and industrial applications can be met. However, the potential benefit of using high performance parallel hardware is offset by the effort required to develop the application. Honeywell Inc. has release a set of user friendly tools that


Auto Source Code Generation and Run-Time Infrastructure and Environment

817

offer the application and systems engineer ways to use the computing resources for application development. By tuning processes, improving application efficiency and throughput, and automatic mapping, partitioning and glue (source) code generation, the engineer can improve productivity and turn around time, and lower development cost. This paper describes the Systems and Applications Genesis Environment (SAGE) and its auto-glue (source) code generation and run-time components. We first provide a brief overview of Honeywell’s SAGE tool suite. This is followed by a description of the SAGE’s auto glue code generation and run-time components. Finally, the experiments and results describing the comparison between the performance of the autogenerated glue code and hand-coded benchmarking applications, the Parallel 2D-FFT and the distributed corner turn is provided. 1.1 Systems and Applications Genesis Environment (SAGE) Honeywell has developed an integrated tool suite for system design called the 1 Systems and Applications Genesis Environment (SAGE) . The tool suite provides complete lifecycle development through an integrated combination of tools potentially reducing design and development costs. The SAGE approach to application development is to bring together under a common GUI, a set of collaborating tools designed specifically for each phase of a system’s development lifecycle. SAGE consists of the SAGE: Designer, the SAGE: Architecture Trades and Optimization Tool (AToT) and the SAGE: Visualizer. Typically the design process begins with the Designer. The engineer can use the Designer to describe and capture the hardware and software/application architectures of the system and the mapping between application to hardware, which may be refined or narrowed by AToT. In the Designer, application/system and hardware co-design can be performed using the Designer’s three editors, the application editor, data type editor and the hardware editor. The application editor is used to build a graphical view or model of the application by connecting functional or behavioral blocks (hierarchical) in a data flow manner through user defined or COTS functional libraries. The data type editor is used to define the various data types and striping and parallelization relationships for the different functions in the application editor. In the hardware editor, the hardware architecture is built hierarchically from the processor all the way up to the system level. All primitive and hierarchical blocks are stored on software and hardware “shelves” for later reuse. Items on the hardware shelf include workstations, other embedded computers, CPU chips, memory, ASICs, FPGAs, etc. The application and system designs can be refined using the software shelf items such as other COTS functional or user defined blocks. The entire software development environment integrates COTS-supplied components (compilers and run-time system, and libraries), along with custom, user-supplied software and hardware components (application code, libraries, etc.). Combining elements from the hardware shelf, the software shelf, and trade information, the engineer can construct an executable which maps software components onto hardware resources.

818

M.I. Patel et al.

Once the performance requirements, application and hardware of the system are captured in the Designer, the information is sent to AToT. AToT will analyze and interpret the captured information, which drives optimization and trade-off activities described in the following section. After the architecture trades process has determined a target hardware architecture, the genetic algorithm based partitioning and mapping capability of AToT assigns the application tasks to the multi-processor, heterogeneous architecture. AToT can be employed for total design optimization, which includes load balancing of CPU resources, optimizing over latency constraints, communication minimization and scheduling of CPUs and busses. When all the details of the system design have been made, the engineer may instrument and auto-generate the actual application code, which can be compiled and executed on certain supported testbed platforms. The SAGE Visualizer is a configurable instrumentation package that enables the designer to visualize the execution of the application through a variety of graphical displays that are fed by probes placed within the generated code. The Visualizer allows the designer to configure the instrumentation probes to measure application performance, and search for problems in the system, such as bottlenecks or violated latency thresholds.

2 Auto-Glue Code Generation and Run-Time Kernel The SAGE glue-code generator is implemented in Alter, a programming language similar to Lisp in its syntax and style, which provides a direct interface to the contents of a SAGE model. Alter is designed to enable the tool developer to traverse the objects and arc connections in a model, collect the relevant information from the various attributes and properties, and then output the information in a particular format for the application. In the context of the glue-code generator, Alter traverses through the SAGE model and generates source code that can be compiled with application function libraries and the SAGE run-time as shown in Figure 1. The basic Alter language provides the constructs to perform the traditional programming tasks, such as procedure encapsulation, conditionals, looping, variable declaration, and recursion. The language also includes a set of standard calls to access certain features in SAGE, such as setting or retrieving a property value from an object.

SAGE models SAGE DoME Smalltalk

Glue-code Generator (Alter)

Source files


819

Figure 1.0 The SAGE glue-code generator gains access into the internal SAGE design tool environment, traverses objects in the models to filter relevant information, and then outputs the information in formats particular to the SAGE run-time source files. The SAGE glue-code generator is implemented in Alter, the programming language that facilitates the traversal and manipulation of DoME-based objects and graphs.

The SAGE run-time kernel is responsible for all sequencing of functions, data striping, and buffer management. To better cover the wide range of application domains, it is necessary to capture the notation of complex data distribution between functional software modules. In the data-flow programming model of the SAGE design notation, this requirement is handled by the port striping features. In short, the port striping conventions enable the system designer to define complex data distribution patterns between functions in a multi-threaded environment. A function’s port object is the sending and receiving point for all data-flow communication between functions; the striping characteristics of a data-flow connection are defined on the source and destination ports. As mentioned previously, the glue-code generator develops several SAGE run-time source files, using information generated from the application model. For example, the function table is generated from a list of all function instances in the SAGE design. SAGE Designer orders all function instances and assigns them IDs from 0.. N - 1. The SAGE runtime executes functions based on this ID, which is the index of this descriptor into the function table. Similarly, information is extracted from the model that allows the runtime to perform data striping. A function port can be defined in the model to be of type replicated or striped. Replicated ports represent data-flow communications in which the data is replicated for each thread of the host function. Striped ports represent data-flow communications in which the data is sliced or divided evenly among the threads of the host function. The port striping type applies to both sending (outgoing) and receiving (incoming) ports. The runtime is responsible for striping the data based on the model information specified in the glue-code. It performs this operation using data buffers. Located and shared between each port on the sender and receiver functions is the SAGE notion of a logical buffer. A logical buffer is a logical representation of the data flow between sender and receiver function threads. It contains the striding information, total buffer size (before striding), thread information (number and type), etc. The logical buffer is defined by the glue-code using the application model’s properties. The runtime uses the logical buffer and the striding information to create physical buffers for message transfer.

3 Experiments In our experiments, we intend to show that SAGE produces executable code that is comparable to hand generated code for the targeted high performance computing platform for a selected set of benchmark applications. It is understood that tools which

820

M.I. Patel et al.

can auto generate code that can surpass performance wise hand coded application implementations is still work to be done. It is our intention to show that an application or system engineer can develop an application (conceptual, first cut or final version) using SAGE quickly and that the resulting solution is comparable both in performance and code size to hand coded versions. Additionally, the application can be refined for better performance by using the SAGE visualization software and by adding hand tuned functions to the SAGE reuse library for the target hardware platform. 3.1 Benchmark Applications The benchmark applications chosen are algorithms that have been used by Rome Laboratories and MITRE in their benchmarking efforts of COTS based high performance computing systems such as from Mercury, Sky and CSPI. The applications chosen for our experiments are the parallel 2D FFT and the parallel distributed Corner Turn executing on a 1024x1024 data matrix. The two applications and data set were provided by CSPI. Performance results of the two applications executing on a Mer2 cury, CSPI, SIGI and SKY platforms were obtained from MITRE . For each of the hardware platforms, MITRE performed measurements using several node configurations (node counts). Additionally, high performance-computing vendors developed their own MPI implementation optimized for their hardware. The traditional MPI implementation have a built in function for performing the corner turn operation, namely the MPI_All_to_All function, each vendor implemented their own version tailored to their respective hardware for the most optimal performance. 3.2 Target Machine The target hardware platform for performing the SAGE glue code and run-time experiments was chosen to be a 200 MHz Power PC 603e based high performance computing system provided by CSPI. The target system contained two quad-Power PC boards with the VxWorks operating system housed within a 21 Slot VME chassis. Each Power PC has 64 Mbytes of DRAM and can communicate through 160 MBytes Myrinet fabric interconnect to each other (intra-board) and to the outside world (interboard). CSPI also provided all software including the VxWorks operating system, MPI implementation and the CSPI ISSPL functional libraries. As part of the Honeywell IR&D program and corporate alliance with CSPI, the SAGE tool was ported to CSPI target hardware platform. The term “port” corresponds to the capturing of all knowledge associated with programming to the CSPI hardware. Such knowledge that is captures includes the ISSPL function libraries on to the appropriate shelves, the CSPI board specific run-time software and programming methodology. It is expected that within the year, additional hardware platforms will be folded into the SAGE knowledge repository. It should be noted that SAGE hides the complexities of programming to COTS high performance computing hardware from the application developer. Once an application is developed, that application becomes portable to other SAGE supported platforms.


821

3.3 Experiments and Test Method The experiments for the SAGE auto glue code generation and run-time components will be conducted in four steps. First, the application will be modeled using the Designer. Second, the different node configurations and mappings will be chosen through the Designer. Third, the glue code will be auto-generated where each node configuration and mapping will be executed ten times where each execution consists of a 100 iterations. The fourth step is the actual execution. The final performance number for that execution will average the 100*10 results into a final average result. When results are reported, a period is defined to be the time between input data sets while latency is the time required to process a single data set. The latency corresponds to the time from when the first data leaves the data source to the time the final result is output to the data sink.

3.4 Results The results of the experiments are shown in Table 1.0. Table 1.0 shows the actual performance numbers for the two benchmark applications executing on 4 and 8 node configurations with 256, 512 and 1024 data sets. Each entry denotes the average of the 10*100 executions with cumulative averages shown in the last column. The table shows that the SAGE auto-generated code executed within an average 86% of the hand-coded versions on the CSPI hardware. For the distributed corner turn, the SAGE generated code running on the CSPI platform performed as well as the hand-coded CSPI version with an average overhead of 20%. For the 2D FFT, SAGE showed, on average, 17% cost in overhead. Table 1.0 Comparison of hand-coded and auto-generated code for CSPI Number of Processing Nodes Application

Array Size

CSPI Hand Coded 4

8

SAGE AutoGen 4

8

% of Hand Coded 4

Average

8

256 x 256 512 x 512 1024 x 1024

14.8 63.77 267

8.496 33.902 137

15.8 70.22 312

9.4 37.75 169

93.7 90.8 85.6

90.4 89.8 81.1

92.0 90.3 83.3

256 x 256 Corner Turn 512 x 512 1024 x 1024

6.68

4.27

7.786

4.753

85.8

89.8

87.8

86.53

52.2

108.822

65.135

79.5

80.1

79.8 86.7

2D FFT

For the distributed corner turn, the SAGE generated code running on the CSPI platform performed as well as the hand coded CSPI version with on average 25% cost in overhead. A performance hit was taken on a two-node configuration. Here, the SAGE run-time buffer management scheme assigns unique logical buffers to the data

822

M.I. Patel et al.

per function which can cause extra data access times when compared to the CSPI implementation. For the 2D FFT, SAGE showed on average 20% cost in overhead.

4 Conclusions The SAGE tool suite provides a powerful graphical and interactive interface for the creation of executable systems and applications based on customer defined specifications with fewer errors and an order of magnitude reduction in development time. The SAGE auto glue code generation and run-time components delivered and executed the two benchmark applications at 77.5 % of hand code versions. Although the performance of the auto-generated code is not equal to the hand code versions, tools that can generate such code are many years away. Work is currently underway to improve the performance of the glue code generation component that will reach levels of 90% of hand coded performance. The use of SAGE provides the application or systems engineer a way to rapidly develop an application on the target system with reasonable assurances that the performance of the auto-generated code for the application will not be magnitudes different from hand coded versions. And since the current SAGE tool makes the target system transparent to the engineer, the application developed is portable to other SAGE supported hardware platforms. The designer simply needs to re-generate the glue code for the new hardware platform. The time saved by using SAGE can now be more effectively used to perform such tasks as improving the applications performance on the current hardware platform, trading and testing the application on other hardware platforms, and moving on to the next project.

References TM

1. Honeywell’s Systems and Applications Genesis Environment (SAGE ) Product Line, http://www.honeywell.com/sage. 2. Games, Richard, “Cross-Vendor Parallel Performance,” Slides Taken from: Real-Time Embedded High Performance Computing State-of-the-Art, MITRE Corporation, Presented at DARPA Embedded Systems PI Meeting, Maui, Hawaii, March 16, 1999.

Developing an Open Architecture for Performance Data Mining David B. Pierce1 and Diane T. Rover2 1

2

MS 1C1, Smiths Industries, 3290 Patterson Ave SE, Grand Rapids, MI, 49512 [email protected] Dept. of Elec. and Computer Engineering, Michigan State Univ., E. Lansing, MI, 48824 [email protected]

Abstract. Performance analysis of high performance systems is a difficult task. Current tools have proven successful in analysis tasks but their implementation is limited in several respects. Closed architectures, predefined analysis and views, and specific platforms account for these limitations. Embedded systems are particularly affected by these concerns. This paper presents an open architecture for performance data mining that addresses these limitations. Comparisons of the architecture with current tools show its capabilities address a wider range of system phases and environments.

1

Introduction

Performance analysis of complex systems is a difficult task. As a result, methods and tools to manage and reduce performance data to useable quantities or useful representations are the focus of significant research. Some successful tools include Pablo [1], Paragraph [2], and SPI [3]. These tools receive events from files, embedded instrumentation, or from an Instrumentation System (IS). These tools generally have a predefined set of views selected from a menu, some have options to select data to display, and libraries or executables to compile the tool. Despite their successes in the lab environment, this class of tools are not an integral part of embedded and high performance systems because: • The usage environment is limited to a specific OS or target HW, • The design/source is protected or incomplete, limiting ability for integration, • The views, processing algorithms, and queries (some tools have no query mechanism) are predefined, limiting flexibility for specific problems, • The data sources/sinks are limited, limiting the use of the system and its results. These tools are geared toward a lab environment, but we want to extend performance analysis to other environments. This will support embedded high performance systems, which can utilize performance analysis results for greater efficiency, user directed fault tolerance, and environmental tolerance (the recognition and corrective action operational conditions exceeding worst case design scenarios).


824

D.B. Pierce and D.T. Rover

A solution to these limitations is to define a Performance Data Mining Architecture (PDMA) that: 1) has an open architecture described in a format consistent with a wide range of system design tools, 2) that addresses the data mining capabilities needed for large quantities of data, and 3) is flexible and extensible (concerning views, algorithms, queries, data exchange, and data storage) allowing for a wide range of systems and interfaces, and further development of the individual pieces as specific systems and applications dictate. This paper presents the definition of such an architecture, with comparisons to current tools showing the benefits and advantages of such an approach.

2

Unified Modeling Language

To enable the widespread use of a PDMA in system designs, the development of a PDMA must address the system design environment. System designs are documented, reviewed, and analyzed in the early stages through the use of modeling techniques, such as Structured Analysis and UML. Following stages of system design use the model created to generate requirements, test plans and procedures, and in some cases, to generate source code headers and/or source code. While the most effective technique is a subject for debate [4], the great utility of these methods is not. We have utilized the Unified Modeling Language [5] (UML) for the development of the PDMA. UML is widely accepted within the systems community, and its usage is increasing [6]. By using UML, the design of a PDMA is expressed in the same format as the system design itself, promoting ready implementation into design, analysis, test, and documentation. There are a significant number of tools that can analyze, simulate, and generate code from suitable UML diagrams, securing a spot for data mining at the ground floor of a system, and making it one of the important features of a complete system.

3

A Performance Data Mining Architecture

To begin, we examine current performance analysis tools, which have been successful in at least one phase of a system lifespan. These tools have one or more common tasks: 1) data input (performance events or statistics), 2) computation of statistics or data points, 3) display of data, and 4) user interface. A query function is also present on a few tools. These common tasks summarize a significant portion of the desired system. However, there are four additional tasks that extend the usage environment of these tools. First, a database function to provide more flexibility to queries and more support for long term or relational computations is needed. Second, an output function for an IS, providing the ability to change instrumentation, based on current data. Third, provision for data exchange with system applications is important, and will support

Developing an Open Architecture for Performance Data Mining

825

contextual analysis of the system model, requirements, and testing in a wide range of operational environments. These common tasks then comprise the Use Cases within the Use Case diagram and form the basic requirements. The Use Case diagram is shown in Fig. 1. There is much detail underlying these simple use cases that differentiate the desired architecture from existing tools. PDMA Views

Data Input > IS

Storage

>

> Processing System Applications

Queries

> Input Devices

> Feedback

Output Devices

User Interface

Fig. 1. A Use Case diagram showing the use cases, actors, and protocols for the PDMA. In this diagram, actors on the system represent classes of outside actors, and not individual items. The > statement indicates that a use case uses the functionality of another use case (the end with the arrow)

A UML Class Diagram shows the classes that implement the architecture design and provides a vehicle for describing the details of the PDMA. The PDMA consists of three primary classes, System Interface, Analysis Context, and Data Warehouse. These primary classes are separated from each other to preserve data hiding principles and promote independence among system threads. These two principles provide flexibility for many specific system implementations [7]. The System Interface class (shown in Fig. 2) responds to large numbers of data inputs with short processing routines. Data inputs include performance events and statistics from the IS and configuration and loading data from system applications. This demands a relatively high priority thread to prevent queue overruns. It accepts data, converts to internal format as necessary, and routes the data. These items must be done quickly to prevent stealing too much time from other system threads.

826


It is also responsible for the output of data to the IS and systems applications. ISs accept feedback during operation to control the amount of instrumentation collected from specific instrumentation points. In this case, the data is formatted for output and routed to the IS using the appropriate interface. System applications can also accept feedback to control message routing and priority, system thread priority, and other features that may be determined by current and future research. Flexibility is required to support different input/output sources. Local file access, shared memory, and object brokering from/to other nodes and applications are supported by the classes defined. The classes do not require specific object broker protocol, but provide an interface for object brokers. Existing object brokers can be utilized when the system platform allows for such. However, systems like embedded high performance systems often require custom solutions for speed and platform. The classes provided support this environment with interfaces designed for this task. System Interface

… System Proxy

Object Translation

System Thread

Object Routing

Fig. 2. Class Diagram showing the System Interface primary class

The Data Warehouse class (shown in Fig. 3) handles persistent data and responds to data storage requests and data search requests. It is also responsible for agents or database request to control size, important for embedded solutions with fixed memory. Data storage requests may include the computation of relational information that is stored with performance data. This class is separated to allow the use of a custom database structure or an off-the-shelf database application, which is dictated by the specific application. The processing priority of this task is likely to be low and require more time than the other classes, due to the nature of search requests, which is promoted by separation from the other classes. The interface to this class is tightly controlled through data storage requests and data query requests, enabling the update on either side of the interface without affecting the other. An important factor for flexibility of the data warehouse implementation is relational information. The interface supports relational data requests and the formation of new relational information. Some key techniques for data mining include the search for new association rules, clustering, classification, sequential patterns, and outlier


827

detection [8]. These techniques are supported in this design, including the use of relational information. The combined use of system application data and performance data also provides new analysis possibilities for environmental tolerance and corrective action. Data Warehouse

… Storage Request

Query Context

System Thread

Database Agent

Fig. 3. Class Diagram showing the Data Storage primary class

The Analysis Context class (shown in Fig. 4) contains classes used for a specific analysis problem or context. Each performance analysis request has a specific context and can best be addressed by using an analysis context class instantiation, often running in its own thread. Allowing a separate thread for analysis contexts provides flexibility for assignment and priority of the thread. This supports a wide range of system applications. It is especially important to embedded high performance systems, where changing environmental conditions can be accounted for with dynamic adjustment of analysis threads. Secondary classes under the Analysis Context class include classes for algorithms, display constructs, interfaces and translations to display hardware that is not high resolution CRT, contextual (display) and operator entered query formation, user interface, and others. The criterion for the definition of these classes is to allow the addition of new algorithm, view, and other objects without affecting the existing objects. Further, the specific system implementation, including hardware and software, must not affect the underlying PDMA, only a few classes defining the interface to such items as hardware or system applications. The method for separating these analysis context objects is the interface to each of the objects. The Algorithm class can have many possibilities for computation within the instantiated object, but the interface to the View class, the Query class, etc., is maintained. A View object can then be utilized to display computed data, formatting the computed data in anonymous methods (from the algorithm viewpoint). The Display class receives View object data in a standard interface and transforms or translates the data to the specific hardware device involved in the Analysis Context instantiation. This may involve high resolution CRTs, character screen displays, banks of LEDs, alarms, etc.

828


Analysis Context

… Query Context Component

Algorithm Component

View Component

Feedback Component

Display Component

Fig. 4. Class Diagram showing the Analysis Context primary class

The Feedback class performs a similar function to the Display class, as it accepts the outputs from the View class, but it transforms or translates the data for feedback to the IS or system applications. It is separated from the View class because the necessary interface for each is unique enough to warrant it. This is shown by the types of data required by Display objects and Feedback objects. The previous figures did not show the relationships between the classes. Relationships exist in the form of data objects, including performance data objects, query request objects, query request objects, view data objects, etc. Two of these these relationships are shown in Fig. 5. These objects determine a large part of the interface between the classes. Several more relations have not been shown.

4

Discussion

Current tools handle the presentation of data by providing several displays of data, such as Gantt, histogram, and pie charts, which can be selected. Some tools allow the user to select the data types to be displayed. In the PDMA, this capability is extended in an object-oriented method. A Gantt chart object is a class containing basic parameters such as data orientation, scale, etc. Instantiating a Gantt Chart object accepts the interface parameters and builds a display view (within the objects scope). The internal view of this Gantt Chart is not what is presented to the user however. Additional modules within the display interface take the display parameters and map them to the display hardware. The display hardware will not always be a high


829

resolution CRT, the common display hardware in the lab. Embedded high performance systems may utilize character displays, banks of LEDs, klaxons, or some other hardware device. The display class handles this responsibility and allows the use of any views with any display technology. System Data

Query Request

System Interface

Analysis Context

Data Warehouse

Fig. 5. Class Diagram for the PDMA showing two of the relationships between the classes

Paragraph and other tools limit the user to the predefined selections, since the system does not provide user definition of views, and the source cannot be easily modified. Using the PDMA, a user sets up an analysis context, including an instantiation of the desired view, scaling, data to be displayed in the view and its orientation, etc. Given this interface, the user can define these during operation. Further, the user can define new objects for views, etc., during operation with the user interface. The analysis context can also be instantiated as a perofrmance monitor. In this case, no display is instantiated until an event of interest appears, at which point the display is created. Additionally, priorities can be assigned to the context, and actions assigned to its results as well. The user can assign priorities or interface to a scheduling algorithm to control the scheduling of tasks to meet the requirements during any specific operating environment. Embedded high performance systems have complex operational environments, which are difficult to accurately predict and design for. Providing capabilities for the operator, coupled with system support, provides a more flexible environment and greater operational success. The displays allow interactive queries, such as entered queries or button clicks in the context of a display. Each of the query types resolves the display context for mouse clicks, or resolves the textual entry of a query. This provides the interface to the Data Warehouse class, maintaining a simple constant interface to the database.

830

5


Future PDMA Research

This paper presents the definition of a PDMA considering a wide range of systems from a general point of view. It is purposely designed to promote future analysis research into view and algorithm technology, while allowing that technology to be readily exploited. Research on views, algorithms, data relationships, etc, are expected.

6

Conclusions

A Performance Data Mining Architecture (PDMA) has been presented, that objectifies and extends current tools, directly impacting embedded high performance systems. The design of the PDMA matches the design language of other systems, allowing the PDMA to be readily integrated. The PDMA provides support and interfaces for objects such as views and algorithms that don’t require redesign of the PDMA. Finally the PDMA allows for portability because it is not dependent on a specific instrumentation system, or a specific operating system, or the hardware and software limitations of a fielded system.

Acknowledgements This work was funded in part by DARPA Contract No. DABT63-95-C-0072 and NSG Grant No. ASC-9624149.

References 1. 2. 3.

4.

5. 6. 7. 8.

D. Reed et al., "Virtual Reality and Parallel Systems Performance Analysis", IEEE Computer, pp. 57-67, November 1995. M. Heath and J. Etheridge, "Visualizing the Performance of Parallel Programs", IEEE Software, 8(5), September 1991, pp. 29-39. D. Bhatt, et al., "SPI: An Instrumentation Development Environment for Parallel/Distributed Systems", Proceedings of the 9th International Parallel Processing Symposium, April 1995. R. Agarwal, P. De, and A. Sinha, "Comprehending Object and Process Models: An Empirical Study", IEEE Transactions of Software Engineering, Vol. 25, No. 4, July 1999, pp. 541-544. UML Documentation [Online], available at http://www.rational.com/uml/, April 30, 1999. B. P. Douglass, Real-Time UML : Developing Efficient Objects for Embedded Systems, Addison Wesley Longman, Inc., 1998. L. Bass, P. Clements, and R. Kazman, Software Architecture In Practice, Addison Wesley Longman Inc., 1998. A. Zomaya, T. El-Ghazawi, and O. Frieder, "Parallel and Distributing Computing for Data Mining", IEEE Concurrency, Vol. 7, No. 4, October 1999, pp. 11-13.

A 90k gate “CLB” for Parallel Distributed Computing Bruce Schulman1 and Gerald Pechanek2 1 2

BOPS, Inc. Palo Alto, CA

[email protected]

BOPS, Inc. Chapel Hill, NC [email protected]

Abstract. A reconfigurable architecture using distributed logic block processing elements (PEs) is presented. This distributed processor uses a lowcost interconnection network and local indirect VLIW memories to provide efficient algorithm implementations for portable battery operated products. In order to provide optimal algorithm performance, the VLIWs loaded to each PE configure that PE for processing. By reloading the local VLIW memories, each PE is reconfigured for a new algorithm. Different levels of flexibility are feasible by varying the complexity of the distributed PEs in this architecture.

1 Introduction As the complexity of portable products has increased, along with the need to support multiple, evolving standards, processor-based solutions have become a requirement at all levels of product architecture. While a processor provides the needed flexibility, it must do so in an energy efficient and area efficient manner. Since the type of processing required for different products includes communications, video, graphics, and audio functions, multiple data types and algorithmic computational needs must be accommodated. Due to this wide diversity of requirements, many approaches to providing efficient processing capability in each application have been proposed. These solutions include custom designed ASICs, general-purpose processors with DSP packed-data type instruction extensions, different DSPs in each product, and reconfigurable processor designs using FPGAs. ASICs lack flexibility in the face of changing standards and changing product requirements, measured as their high cost to support changes or multiple similar instances. General-purpose processors for embedded applications are inefficient in energy and area. Reconfigurable processors using FPGAs, even with the latest process improvements, are also inefficient in implementation area and energy use. This is especially true for FPGA implementations of arithmetic units, which are still very large and slow, compared to ASIC or custom arithmetic designs [1]. Even so, there is much work being done to combine the advantages of microprocessors and FPGAs for reconfigurable co-processing units, such as DISC [2] and GARP [3]. These systems may mix general control processors, fixed function ASICs, and FPGAs in a final system, such as Pleides [4]. In addition, companies such


832

B. Schulman and G. Pechanek

as Xilinx and Altera provide FPGA’s and design solutions for specific reconfigurable algorithmic use [5, 6]. The difficulty with state-of-the-art FPGA designs is that the area, performance, and power cannot compete with standard cell or custom designed logic. While using FPGAs seems to hold promise, many difficult problems exist that must be solved. Two problems with FPGA designs are the programming model/tools, and consistent and efficient use of silicon area. It is important that each product has a consistent programming model and a common set of development tools across the numerous applications. It is equally important to have a programmable design that can efficiently provide high performance and low power in the intended products. Research attempting to improve the implementation efficiency of FPGA-based reconfigurable processors proposes to increase the complexity of the Configurable Logic Blocks (CLBs), to include circuitry better suited for arithmetic use [7]. These additions attempt to provide application specific improvements to the original CLB definition. The goal is still to solve the basic problem of providing processor-level flexibility in a cost and performance efficient manner. The purpose of a reconfigurable processor is to make effective and efficient use of the available logic for a number of applications by programming the arrangement and interconnection of the logic. We propose to use standard ASIC processes for a set of flexible arithmetic units in a standard PE definition that is programmed through local VLIW memory. Programming the PE can be viewed as a method to optimize the logic make-up of the PE for different algorithms. With our scalable, parallel distributed processing configuration, the use of the available resources can be configured appropriately, cycle-by-cycle, to meet the requirements of each application. Further, these features and capabilities are provided in a single architectural definition using a consistent and standardized tool set. In this paper we present the BOPS ManArray parallel distributed computing architecture and show that by reprogramming the PEs' logic, very high-performance computing can be provided across multiple applications.

2 ManArray Parallel Distributed Computing The BOPS iVLIW PE is based upon the BOPS ManArray architecture, a parallel-distributed computer architecture targeting System-On-Chip applications. The ManArray architecture supports from 1 to 64 iVLIW PEs and a Sequence Processor (SP) for controlling the array of PEs. The SP is uniquely merged into the PE array for maximum efficiency to provide the SP controller with access to the ManArray network. The ManArray network interconnects clusters of PEs to provide contention-free, scalable, single-cycle communications. The distributed processor uses two basic building blocks, as shown in Figure 1. The PE consists of a register file, a set of execution units, a cluster switch as an interface to the ManArray network, local data memory, and local VLIW memory (VIM). The SP adds an instruction fetch unit and uses the same building block PE elements. Various core processors can be developed from these two reusable IP blocks.

A 90k Gate ‘‘CLB’’ for Parallel Distributed Computing

SP (Sequence Processor) -control and sequential functions Data Memory VIM (VLIW Instruction Memory)

E X U N I T S

ALU MAU DSU Load Store

C L S T R S w

R E G F I L E

PE (Processing Element) -slave resource, parallel tasks Data Memory VIM (VLIW Instruction Memory)

E X U N I T S

ALU MAU DSU Load Store

Address Generation Unit Figure 1

2x2

1x2 PE

1x1

PE

PE PE

SP/

SP/PE

SP/P

4x4 2x4

Figure 2

833

R E G F I L E

C L S T R S w

834


Figure 2 shows a 1x1, 1x2, 2x2, 2x4, and a 4x4 array processor. Each 2x2 cluster contains an SP control processor allowing reconfiguration of larger arrays to operate as subset array processors. For example, two 2x2 array processors can be configured in the 2x4 array processor system. The general organization of the BOPS iVLIW PE is shown in Figures 3 and 4, which depict the three main interfaces to each PE. These interfaces are an X-bit instruction bus, Y-bit data busses, and a single-port send/receive interface to the cluster switch that interconnects the PEs in the ManArray topology. The instruction format is typically X=32-bits but 16-bit and 24-bit formats are not excluded depending upon an application’s needs. Internal to the PEs are three storage elements, the local PE data memory, the KxS*X iVLIW Memory (VIM), and a multiported NxM-bit register file. The number of VLIWs entries, K is typically less than 128 entries, although larger iVLIW memories are not precluded. The number of instruction slots, S can vary from 1 to 8, although typically between 2 and 5 instruction slots would be used (Figures 3 and 4 respectively). Depending on the arithmetic VLIW configuration, the local PE data memory can be a one- or a two-port memory. The two-port local PE memory is configured into two Y-bit banks which support byte, halfword, word, and double word loads with Y=32-bits. With the twin banks, one memory can be loading and storing data simultaneously to/from the PE’s register file while the DMA unit is loading the other bank. This effectively hides DMA delays, and supports a data streaming approach to processing on the array. Based upon present application evaluations, two banks of 512x32-bits are typically proposed, although there is no architecture limit. In a 5-issue iVLIW, the VIM typically consists of up to 64x160 bits of iVLIW memory with 160-bit read out capability. The VIM is loaded sequentially, one X-bit instruction at a time, after being primed by a LoadV delimiter instruction. The 160-bit field is made up of 5 Ybit instruction slots, with each slot associated with an execution unit. In addition, a NxM-bit 8-read-port 4-write-port register file is available. This register file is split into two banks of 16x32-bits allowing the architecture to support 64-bit data flows as well as 32-bit data flows. One bank is associated with the even register addresses and the other bank is associated with the odd register addresses. The split register-file design takes full advantage of the instruction set architecture and reduces the number of ports required per register bank. Finally, the ManArray architecture supports up to eight execution units in each PE. The first release uses five execution units: one Load Unit (LU), one Multiply Accumulate Unit (MAU), one Arithmetic Logic Unit (ALU), one Data Select Unit (DSU), and one Store Unit (SU). The execution units support 1-bit, 8-bit, 16-bit, and 32-bit fixed point data types, and 32-bit IEEE floating point data to meet the requirements of a large number of applications. For high-performance applications, each PE supports 32-bit and 64-bit packed data operations that are interchangeable on a cycle-by-cycle basis. Specifically, the MAU supports quad 16x16 Multiply and Accumulate operations per cycle, and the ALU performs standard adds, subtracts, four 16-bit absolute-value-of-difference operations, and other DSP functions. The DSU performs bit operations, shifts, rotates, permutations, and ManArray network communication operations. Supporting the computational elements are 64-bit load and store units. It should be noted that there is a bypass path around the VIM


835

allowing single 32-bit instructions to be executed separately in classical SIMD mode in each PE and consequently on the array. We use a linearly scalable switch fabric to connect the PEs with an interconnect maximum length of 2 for large embedded arrays, and length 1 for orthogonal interconnected PEs [8]. This ManArray network is integrated in the architecture of the PEs such that data movement between PEs can be programmed and overlapped with other arithmetic operations and load/store operations. This interconnect is programmable per cycle to allow many different interconnect patterns to match the current processing task. X-bits Instruction bus

Y-bit Data

Instruction

Opcode Local PE Memory

iVLIW bypass for Simplex SIMD operations

iVLIW Memory

Memory Ex1 Switch

Ex1 Ex2

NxM Register File

Parallel Decode & Execute Ex2

To/From Cluster Switch

Figure 3 Y-bit Data Buses

X-bits Instruction bus Opcode

Instruction

Local PE Memory Memory Ex1 Switch Ex2 NxM Register File

iVLIWMemory (VIM) Ex1 Ex2 Ex3 Ex4 Ex5

iVLIW bypass for Simplex SIMD operations

Parallel Decode & Execute Ex3

Ex4

Figure 4

Ex5

To/From Cluster Switch

836


The programmer controls an array of PEs by writing a program for the SP, which includes the personalization of the PEs VLIW memory for the intended algorithm or algorithms to be executed. In addition, the SP controls the DMA unit to move data through the PEs, while controlling the program flow to perform the desired computation. Depending upon the size of the VLIW memory and the number of VLIWs needed for each algorithm in an application, the optimized set of VLIWs for multiple tasks can be resident in the VIM, allowing instantaneous reconfiguration as tasks change. Even with small VIMs that must be reloaded for each task, the loading of a five issue VLIW entry into all the PEs’ VIMs @100MHz takes only 60nsec. The loading steps are - a Load VLIW instruction followed by the five instructions to be loaded into each VLIW in each PE in parallel. To load 32 VLIWs in all the PEs, sufficient for many tasks, the total load time is 1.92 µsec. A state of the art, reconfigurable computer takes approximately 100 µs to reconfigure, a relative factor of 50x. [9] .

3 Evaluation In TSMC .25u ASIC flow process, the 5-issue iVLIW PEs has a worst-case clock rate of 100MHz. Higher speeds are available utilizing more custom design methodologies and/or synthesizing the Verilog soft macro cores to different processes. With full capabilities in the PEs, including both fixed and floating point MAU, ALU, and DSU which also includes a state of the art single-precision floatingpoint divide/square root unit, a 2x2 processor array with DMA, 1 Mbit of SRAM, and system interfaces including PCI 2.0 (32-bit/33 MHz), SDRAM (PC-100, 64-bit), and host processor interface (MIPS SYSAD bus compatible with QED5231) is 90 sq mm. A fully featured fixed-point 5-issue iVLIW PE requires 90K gates excluding the register file and memory elements. Depending upon performance needs, the PEs can be reduced in size as indicated in Figure 3 by reducing the number of iVLIW issue slots, which reduces the number of register file access ports, subsetting the instruction set appropriately, and reducing the local PE memory requirements. This scalability provides great flexibility while still maintaining the same instruction set architecture. For a 2x2 array of 5-issue iVLIW PEs with an SP, the computations on 16bit data per cycle include: 16 multiplies, eight 32-bit sums, eight 40-bit accumulates, 16 absolute differences, 16 rotates, 16 loads and 16 stores. At 100 MHz, this equates to ~10 bops. The FFT is one application that uses the performance and data flow capability of the ManArray. For Discrete MultiTone (DMT) based ADSL or VDSL, a 256-point FFT is needed. OFDM-based digital terrestrial television will use larger FFTs. The BOPS 2x2 can continuously process 256-point 16-bit complex FFTs in less than 5 us [10]. This includes all the data movement and address reordering (bit reversed or digit reversed). Xylinx LogiCore Data Sheet “256-Point Complex FFT/iFFT V1.0.3” [5] said it takes 1643 logic slices to do a 256-point 16-bit complex FFT in 40 us. To get to 5us per block, 8 such chips would be needed (8x less compute density). It is generous to NOT account for off-chip interconnect delays and forgive the fixed scaling in this estimate.


837

The basis of the most popular image compression algorithms (JPEG, MPEG) is the 8x8 iDCT. The algorithm takes in 8-bit data, but needs higher dynamic range to meet the S/N IEEE STD 1190. A 2x2 array can continuously process 8x8 blocks at a rate of 128 MBytes/second [11]. The Altera Discrete Cosine Transform AMPP Datashhet [6] shows an 8x8 iDCT processing rate of 17.5 MBytes/second, so you would need 8 of them to keep up.

4 Conclusions BOPS offers the highest performance DSP IP in the industry and targets massmarket applications in 3D graphics, multimedia, Internet, wireless communications, VoIP and digital imaging. With this new level of performance and cost/performance, Embedded High Performance Computing can become a reality in consumer products from 3G cell phones with streaming video, to broadband Internet, to higher performance 3D Graphics in Set-top, games, and PCs. The ManArray Architecture, including PEs, SP, DMA and Cluster Switch, delivers the highest performance, scalable, reusable, reconfigurable DSP IP in the industry. Compared to FPGAs, BOPS delivers more than 8x improvement in performance @100MHz in standard ASIC flow parts. Depending upon the array size, BOPS solutions cover the range from 1 to over 100 billion integer math operations/second.

References 1. O. T. Albaharna, P. Y. K. Cheung, and T. J. Clarke, “Area & Time Limitations of FPGAbased Virtual Hardware,” Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 184-189, Cambridge, Mass., October 10-12, 1994, IEEE Computer Society Press. 2. M. J. Wirthlin and B. L. Hutchings, “DISC: The Dynamic Instruction Set computer,” Proceedings of the SPIE, Field Programmable Gate Arrays (FPGAs) for Fast Board Development and Reconfigurable Computing, Vol. 2607, pp. 92-102, 1995. 3. J. R. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Coprocessor,” Proceedings of IEEE Workshop on FPGAs for custom Computing Machines (FCCM), Napa, CA, April 1997. 4.

Marlene Wan, Hui Zhang, Varghese George, Martin Benes, Arthur Abnous, Vandana Prabhu, Jan Rabaey, "Design Methodology of a Low-Energy Reconfigurable Single-Chip DSP System", Journal of VLSI Signal Processing, 2000.

5. Xilinx website: http://www.xilinx.com/ 6. Altera website: http://www.altera.com/

838


7. Nelson, Brent, “Reconfigurable Computing,” HPEC 98 proceedings, September 1998 8. G. G. Pechanek, S. Vassiliadis, and N. Pitsianis, “ManArray Interconnection Network: An Introduction,” EuroPar’99, Toulouse, France, Aug. 31-Sept. 3, 1999. 9. National Semiconductor NAPA 1000 – DARPA ITO Sponsored Research 1988. www.darpa.mil/ito/psum1998/e257-0.html 10. N. P. Pitsianis and G. G. Pechanek, “High-Performance FFT Implementation on the BOPS ManArray Parallel DSP,” International Symposium on Optical Science, Engineering, and Instrumentation, Denver, Colorado, July 18-23, 1999. 11.

G. G. Pechanek, B. Schulman, and C. Kurak, “Design of MPEG-2 Function with Embedded ManArray Cores,” Proceedings DesignCon 2000 IP World Forum section, Jan. 31-Feb. 3, 2000.

Po wer-Aware Replication of Data Structures in Distributed Embedded Real-Time Systems?

Osman S. Unsal, Israel Koren, C. Mani Krishna Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA 01003

In this paper, we study the problem of positioning copies of shared data structures to reduce pow er consumption in real-time systems. Pow er-constrained real-time systems are of increasing importance in defense, space, and consumer applications. We describe our energy consumption model and present numerical results linking the placement of data structures to energy consumption. Abstract.

1

System Model

This paper explores the pow er rami cations of various task assignment heuristics as well as net w ork topology/routing issues. W e study distributed real-time systems, with each node having a private memory and each task having a worstcase execution time and deadline. If tw o tasks reside on dierent processors then the communication pow er cost depends on the routing algorithm and topology. The objective is to study the impact of a particular assignment-topology-routing combination on pow er consumption. To save energy, part of a remote task's data structure may be replicated closer to the consuming node(s). The aim is to nd the ideal degree of replication. Increasing the replication increases local memory size and its energy consumption, while decreasing the volume of netw ork transfers and the associated pow er consumption. Therefore a \sweet spot" may exist, beyond which increasing the degree of replication increases the energy consumption. More formally, the total energy consumed, denoted by E , is:

X

n tasks

Ei where i=1 Ei = Nwritei ewrite + Nreadi eread + n tasks ( Nnetij ) enet + S izememi estatic j=1 E

=

X

?

(1)

(2)

This work is supported in part by DARPA through contract No. F30602-96-1-0341. The views and conclusions con tained in this document are those of the authors and should not be interpreted as necessarily representing the oÆcial policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency, the Air Force or the U.S. Government.


840

O.S. Unsal, I. Koren, and C.M. Krishna

Here, Ei is the energy consumed in executing the i-th task, ewrite and eread are the memory energy consumption per write and read access, estatic is the memory static energy consumption, enet is the energy cost of a per-hop data transfer per-bit, Nwritei and Nreadi are the number of local memory write and read accesses of task i respectively and Nnetij is the number of remote accesses from task i to task j . If the two tasks i and j are assigned to the same node then, Nnetij = 0. Memory consistency is preserved by updating all the replicated copies of a data item when a task writes a shared data item to its private memory. For typical programs the writes are at most 15 percent of reads. This characteristic facilitates the usefulness of replication. All links are assumed to be of the same type, i.e., the link power consumption to transfer one byte is the same for all links. Various routing strategies such as broadcasting or ooding are also implemented in the model. As for the case of multicasting, eÆcient multicasting algorithms rely on building and trimming a minimum spanning tree [3]. However, this is not optimal from a power point of view since it builds a minimum spanning tree for all the nodes instead of the subset of nodes in the multicast group. To obtain a better solution, we have developed an energy-saving Steiner tree heuristic for systems with multiple multicasting requirements. Given a weighted graph G, the Steiner tree problem is to nd a tree that spans a speci ed subset of nodes of G with minimal total distance on its edges. Various distinct trees that span the same subset of nodes of G can be constructed and one can select the tree that has less total edge cost than that of the other trees in the set. Since the problem is NP-complete, heuristics are needed. We have adapted such a heuristic [5] for our purposes. The heuristic nds a solution with total distance no more than 2(1 1=k ) times that of the optimal tree in time O(pn2 ). Here, n is the number of nodes in G, p the number of Steiner points and k the number of leaves in the optimal Steiner tree. A short description of the Steiner heuristic algorithm is given in Figure 1. For intertask-communication-bound real-time systems, the allocation of tasks to nodes can also have a signi cant impact on power. We use a steepest-descent heuristic[2] for power-aware task allocation. The heuristic starts from an initial allocation and then reallocates that pair of tasks to the same node which results in the largest decrease in the energy consumption from among the set of candidate task pairs. This reallocation is done iteratively until the energy saving is below a given threshold. Thus, the heuristic tends to assign tasks which communicate heavily to the same node.

2

Numerical Results

For the results in this section, unless otherwise noted, the number of tasks is 10 and the task execution times, periods as well as intertask communication sizes are random. The number of nodes is 4, the write-to-read power ratio is 1.22, 8% of the memory operations are writes and 92% are reads, and per-hop remote

Power-Aware Replication of Data Structures

Step 1.

841

For every multicast group repeat steps 2 through 6.

Construct the complete graph H from G and S in such a way that the set of nodes in H is equal to S ; for every edge (u; v ) in H , the distance of (u; v ) is set equal to the shortest path between u and v in G. Step 2.

Step 3.

Find a minimum spanning tree TH of H .

Replace each edge (u; v ) in TH by the shortest path between u and in G; the resulting graph R is a subgraph of G.

Step 4. v

Step 5.

Find a minimum spanning tree TR of R.

Delete edges in TR , if necessary, so that all the leaves in TR are elements of S . The resulting tree is returned as the solution. Step 6.

Fig. 1.

The Steiner heuristic algorithm

access energy cost is three times that of a local access energy cost.

2.1 Eect of Application Write Ratios We begin by considering a situation in which each node keeps a fraction of the global data structures in its private memory. Figure 2 illustrates the eect of changing write ratios on power. As can be seen from the gure the optimum energy point shifts towards lower replication as the write percentage gets higher. Another observation is that as the degree of replication gets higher the energy consumption increases sharply for higher write ratios. This stems from the memory consistency constraint and is caused by the need to update all the replicated copies of a data item that has been modi ed by the local task.

2.2 Impact of Per-hop Transfer Cost The per-link power cost per bit transferred depends on the interconnection hardware used. For example, a wireless link may consume less power than a twistedpair link. If the real-time system designer has multiple options for choosing interconnection hardware, he/she can nd the optimum degree of memory replication for each option. Figure 3 illustrates this. Here the per-hop energy consumption varies from being equal to memory energy consumption per operation to four times that of the memory energy consumption per operation. We observe that the optimum energy point shifts toward higher degrees of replication as the perlink power consumption is increased. Also, the system energy consumption starts converging at higher degrees of replication. This phenomenon is due to the fact

842


that most of the data is locally replicated, thus decreasing the sensitivity of total energy consumption to the per-link power cost.

Total System Energy

220000

2200000

1

200000

0%

W

r ite

s

Net Pow

er=4* L

s rite %W

8 180000

6% W

rites

1400000

c Power

Lo Net Power=2* er=Loc

Net Pow

2% Writes 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

600000

0

0.1

Degree of Replication Fig. 2.

Eect of write ratios on energy

er

Net Power=3* Loc Power

s 4% Write 1000000

160000

140000

oc Pow

1800000

0.2

0.3

Power

0.4

0.5

0.6

0.7


Eect of per-link power on energy

2.3 Task Allocation and Network Topology Figure 4 shows the energy consumption impact of the task allocation scheme by comparing the previously mentioned power-aware optimization heuristic with a simple, power-blind round-robin scheme. The resulting saving in energy consumption emphasizes the importance of the task allocation step. Network topology and routing are also important design considerations in realtime design. Figure 5 shows the energy comparisons of two dierent choices, a 16-node mesh topology and a 16-node torus topology. The extra wraparound edges of the torus result in lower energy consumption, but the energy dierence between the two topologies is not very large.

2.4 Routing Issues Multicasting has received little attention in real-time systems but is an important problem: sensors providing data to multiple processes and process outputs driving redundant actuators can all bene t from eÆcient multicasting algorithms. For the baseline case, we implemented the minimal spanning tree truncation scheme [3]. As mentioned in the previous section, we have developed a better multicasting scheme which makes use of a Steiner tree heuristic to nd a path with the minimum cost among the multicasting nodes. Figure 6 shows the energy comparison of the two approaches. Here the number of nodes is 16 and the number of tasks is 40.


843

600000

Total System Energy

Round-Robin 500000

400000

300000

Optimized 200000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.9

1


Impact of task allocation strategy

5700000

Total System Energy

5600000

5400000

16

-N

o

M de

5200000

16 5000000

0

0.1

0.2

0.3

0.4

0.5

0.6

No

d

0.7

es

o eT

h

ru

0.8

s


Topology choice and energy

The routing capabilities of a Real-Time Operating System (RTOS) determines the power impact of multicasting tasks. A minimalist micro-kernel RTOS might just supply a simple ooding model. In this model the multicast message is sent to all the nodes in the system. A slightly more sophisticated RTOS would do a broadcast by sending a unique message to each of the multicast nodes. Broadcasting is considered to be more eÆcient than ooding [4]. However, as seen in Figure 7, ooding surprisingly does better than broadcasting from an energy point of view. This is because only a single copy of the multicast message is sent in ooding.

844

O.S. Unsal, I. Koren, and C.M. Krishna 0.694

Energy Ratio

0.692

0.69

0.688

0.686

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Energy Ratio of Steiner / Minimal Spanning Tree Truncation

142000

Total System Power

140000

136000

Broadcasting 132000

Flooding 128000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Flooding versus Broadcasting

2.5 Selective Replication Up to this point, we have considered the task-to-task communication data structures to be fully replicated. This means that for a multicast group, the data


845

structure of the multicast source task is replicated at all multicast destination tasks. We now relax this requirement and selectively replicate the data structure of the source task only at some of the destination tasks, thus saving energy. Consider the example of Figure 8. This is a 16-node mesh and part of the task assignment is shown in the gure. Our focus is multicasting group A with task A:1 being the source and the other tasks in group A being the destinations. We selectively replicate task A:1's data structure only at task A:4's node. The result is compared against full replication and no replication in Figure 9. Here, the energy is plotted against the per-hop energy cost and it is normalized with respect to the energy consumption of no replication. As can be seen, selective replication results in signi cant energy savings. Task A.1

Task B.1

Task A.2

Task B.2

Task C.1

Task C.2

Task B.3

Task C.3

Task B.4

Task C.4

Task D.1

Task D.2

Task D.3

Task A.3

Task A.4

Task A.5

Fig. 8.

3

Example for selective replication

Conclusion

We have constructed a model to gauge the power impact of task assignment, network topology and routing strategies within the context of data structure replication to decrease energy. Our results show that substantial energy savings are possible by careful design. Our model also gives us the ability to calculate the energy impact of new power aware heuristics. We have adapted a Steiner tree heuristic for multicasting and compared its energy consumption with the baseline case of minimal spanning tree truncation. Currently, we are studying the more general case of heterogeneous data consumption rates at the destination tasks. We are also developing a heuristic which will


Normalized Energy of Replication

846

1.25

Total Replication 1.15

1.05

0.95

Selective Replication

0.85

0.75

1

2

3

4

5

Per-Hop Power Cost (As a multiple of memory cost) Fig. 9.

Advantage of selective replication

optimize the memory replication needs of each task.

References 1. Coumeri, S. L., and Thomas, D. E., Memory Modeling for System Synthesis, www.ece.cmu.edu:80/ thomas/research/List.html 2. Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling, W. T., Numerical Recipes, Cambridge University Press, 1989 3. Deering, S. E., and Cheriton, D.R., Multicast Routing in Datagram Internetworks and Extended LANs, ACM Trans. on Computer Systems May 1990 4. Tanenbaum, A. S., Computer Networks, Third Edition, Prentice Hall, 1996 5. Lau, H. T., Combinatorial Heuristic Algorithms with FORTRAN , Springer-Verlag, 1986

This article was processed using the LATEX macro package with LLNCS style

Comparison of MPI Implementations on a Shared Memory Machine Brian VanVoorst1 and Steven Seidel2 Honeywell T echnology Center, 3660 Technology Drive, Minneapolis, Minn. 55418

1 2

brian [email protected]

Dept. of Computer Science, Michigan Technological Univ., Houghton, Michigan 49931 [email protected]

Abstract. There are several alternative MPI implementations a vailable to parallel application developers. LAM MPI and MPICH are the most common. System vendors also provide their own implementations of MPI. Each version of MPI has options that can be tuned to best t the characteristics of the application and platform. The parallel application developer needs to know which implementation and options are best suited to the problem and platform at hand. In this study the RTCOMM1 communication benchmark from the Real Time Parallel Benchmark Suite is used to collect performance data on several MPI implementations for a Sun Enterprise 4500. This benchmark provides the data needed to create a re ned cost model for each MPIimplemen tation and to produce visualizations of those models. In addition, this benchmark provides best, w orst, and t ypical message passing performance data whic h is of particular interest to real-time parallel programmers.

1 Introduction Shared memory platforms can support many dierent versions of the Message P assingInterface (MPI)[1]. Among the best kno wnMPI implementations are LAM MPI[3] and MPICH[2]. V endorsalso provide MPI implementations particularly suited to their platforms. Each implementation has various options for tuning its behavior. This creates several c hoices for an application developer who is seeking the best possible performance for their application. The work presented here characterizes several MPI implementations and con gurations for the Sun Enterprise 4500. These characterizations are based on data obtained from the RTCOMM1 communication benchmark, part of the Real Time Parallel Benchmark Suite [4, 5]. A re ned communication cost model for eac h implementation is obtained by an iterative process of running RTCOMM1, examining the output, and adjusting the input to focus on the behavioral features rev ealed by the most recent data. This process was performed for the MPI implementations listed in Table 1. 2

This work is partially supported by NSF grant MRI-9871133.


848

B. VanVoorst and S. Seidel

Table 1. MPI variations examined MPI Mechanism Sun SHMEM SHMEM SHMEM SHMEM LAM TCP/IP SHMEM MPICH SHMEM

Option

MPI SPIN MPI POLLALL MPI EAGER -O -c2c -nger -O -c2c -nger

Several system con guration options can also be varied in order to reveal their impact on message passing performance. The con guration options available on the E4500 include locking processes to processors, disabling interrupts, and even disabling individual processors. The eects of these options were also investigated.

2 Approach Three MPI implementations are studied: Sun's MPI provided with HPC 3.0, LAM MPI[3], and MPICH[2]. LAM MPI was built in both its default TCP/IP version and in its shared memory version. These two builds of LAM MPI are compared to determine the amount of additional overhead created by the TCP/IP version compared to its shared memory version. By default, MPICH builds a shared memory version. No attempt at building a TCP/IP version of MPICH was made. These four implementations of MPI are the subject of the characterization work presented here. The platform used for this work is an 11-processor Sun Microsystems Enterprise 4500 symmetric multiprocessor with 8GB of memory running Solaris 2.7. The processors are 400MHz Sparc II's with 4MB of cache. The characterization methodology for this work relies heavily on the use of the RTCOMM1 benchmark. RTCOMM1 takes as input a sequence of message size ranges (e.g., 0-128 bytes, 129-4098 bytes, ...) and for each range produces N sample points. The experiments reported here use N = 20. A large number of ping-pong operations (sending a message back and forth between two processes) are timed at each sample point. The exact number of ping-pongs is not speci ed by the input to the benchmark. Instead, a total run time is speci ed. The benchmark performs a ping-pong measurement for each sample point in a round-robin fashion until the run time expires. The benchmark terminates only after completing a full round of sample points. This ensures that all message size ranges are measured an equal number of times and that any interruption of the benchmark (by, for example, an increase in background load) will not signi cantly bias the measurement of any one sample point. For each sample point RTCOMM1 records the fastest (best), slowest (worst), and typical (median) time to complete a ping-pong. At the completion of the

Comparison of MPI Implementations on a Shared Memory Machine

849

benchmark RTCOMM1 ts a line to the typical points of each message size range. This line is the communication cost model for that range of message sizes. RTCOMM1 provides as output these cost models and a series of data les suitable for plotting. The initial approach to the characterization of each MPI implementation is to oversample with short message ranges and a dense set of sample points. This provides a ne-grained picture of point-to-point communication performance. These measurements reveal interesting regions in the graph of the performance data. It is usually apparent that there are certain message size ranges for which dierent underlying protocols, buering schemes, etc. are used. Transitions in the graph at the boundaries of these ranges illustrate changes in the performance of the MPI implementation. Based on these observations, the input to the benchmark is adjusted so that the selected ranges match the transition points of the oversampled runs. A few iterations of this approach produces an accurate cost model for each MPI con guration.

3 Results Due to limited space, only graphs of the most interesting characteristics of the MPI implementations are presented. These features appear at a variety of scales. Those that have the most direct impact on performance are discussed here. It is important to remember that each point on a graph represents thousands of individually timed messages. When a \best" point takes slightly longer than its neighboring best points it is not due to chance mis-measurement. It is the result of some artifact in the system that did not allow that message to be transmitted faster. Not all such abnormalities can be explained, but they can be measured and their impact on performance can be revealed. All data points shown are actual measurements of point-to-point communication, not averages, computed by halving the best, typical, or worst observed ping-pong measurements. For all three message passing libraries the typical times are often the same as (or very close to) the best times. This means that out of thousands of trials the median time is usually the same as the best time. Therefore, application developers can be con dent that they will usually receive the best possible message passing performance the system has to oer. However, poorer performance sometimes occurs. This is captured by the \worst" observed points. These points often dier by a constant from the best observed times. This constant may be the cost of servicing one interrupt, which might happen only infrequently. Messages were occasionally observed to be slowed down by orders of magnitude, up to 1/10th of a second. This phenomenon can be reproduced by binding the benchmark processes to speci c processors, disabling interrupts on those processors, and disabling all other processors except one. This caused many messages delays ranging from 0.02 to 0.1 seconds at consistently spaced intervals of 0.01 seconds. It is unclear why this particular combination of circumstances caused this delay.

850


3.1 Platform Con guration Experiments showed that binding a processes to a processor fostered consistent performance. The processor bind() system call prevents the operating system from migrating processes among processors. Under these conditions it appeared that the operating system will not schedule these processors for other work if other processors are available. Using the psradm command to disable I/O interrupts on these processors further reduced the possibility of these processors being interrupted while running the benchmark. Experience also indicated that it was necessary to leave more than one processor available for servicing interrupts. The results presented here were collected from two benchmarking tasks bound to processors 0 and 1 on which interrupts were disabled. The remaining nine processors were available for other purposes but no other user jobs were running on the machine.

3.2 Sun's HPC 3.0 MPI Sun's MPI implementation delivered the fastest overall point-to-point message passing. However, Sun's MPI was the least consistent and hardest to model for larger-sized messages. Figure 1 shows a plot of messages sizes between the ranges of 210KB and 240KB. No explanation can be oered for the illustrated oscillations in message passing times. While the variance is small (5%), it is large relative to the execution time. This shows that for certain message sizes, a message that is a few bytes longer may be transmitted in less time (as much as 50 microseconds) than the shorter message. This eect is reproducible and starts to occur for messages longer than 64KB. A second observable trend (not illustrated here) is that the dierence between the worst and best points increases with message length. This is probably due to an increased chance of being interrupted multiple times while sending a longer message. The cost model for Sun MPI is given in Table 2. Due to the variance in measurements seen in Figure 1 it is not possible to present a precise cost model for messages longer than 64KB.

Table 2. Sun MPI cost model (*Imprecise due to large variance) Message size (bytes) Latency (sec) Bandwidth (MB/sec) 0 - 256 6 41.53 256 - 512 10 236.7 512- 1K 9 158.2 1K -16K 11 182.0 16K - 32K 29* 197.9* 32K - 1M 35* 208.6* 1M - 2M 277* 219.7* 2M - 4M 594* 225.1*


851

Sun MPI 0.00125

0.0012

Time in seconds

0.00115

0.0011

0.00105

0.001

0.00095 210000

215000

220000

225000 230000 Message size in bytes

235000

240000

Fig. 1. Sun's MPI performance varies for large message sizes 0.0003

Time in seconds

0.00025

0.0002

0.00015

0.0001

5e-05

0 0

5000

10000

15000

20000

25000

Message size in bytes

Fig. 2. Oversamples LAM MPI performance

3.3 LAM Shared Memory MPI The most distinctive feature of LAM MPI is a large jump in latency for messages of length 8KB, as shown in Figure 2. The magnitude of this increase shows that it takes almost twice as long to transmit a message of length 8193 as it does to transmit a message of 8192 bytes. For longer messages LAM message passing costs are modeled well by a straight line, as given in Table 3. Both the shared memory version of LAM and the TCP/IP version of LAM were built and tested in these experiments but the measurements were the same in both cases. Because the bandwidth is so high in each case it must be that the TCP/IP version is making use of shared memory.

852


Table 3. LAM cost model Message size (bytes) Latency (sec) Bandwidth (MB/sec) 0-256 33 44.48 256 - 8K 35 145.9 8K - 1M 108 141.9 1M - 4M 108 141.7 MPICH 0.0003 best pts typical pts worst pts 0.00025

Time in seconds

0.0002

0.00015

0.0001

5e-05

0 0

5000

10000


25000

30000

Fig. 3. MPICH cost model is a step function

3.4 MPICH The performance of MPICH is best characterized by a step function. Figure 3 shows the observed message passing times for MPICH for messages of lengths 0 to 30,000 bytes. The interval of the step shown in Figure 3 is about 4900 bytes and it varies slightly as the message size grows. It then changes to an interval of about 9800 bytes when the message size is greater than 130,000 bytes. This interval also changes slightly as message size grows. The cause for this step function and the variation of interval size is not known but it might be a side eect of padding or buer allocation and usage. The cost model for MPICH is given in Table 4. For messages longer than 100KB MPICH exhibits two \levels" of message passing times for each message length. Figure 4 shows message passing times for messages between the sizes of 200,000 and 210,000 bytes. Note that about a third of the time message passing times are greater by a xed amount. This behavior is reproducible but no explanation can be oered here.

4 Conclusions Sun MPI oers the best performance of the three MPI implementations. Figure 5 summarizes the costs of passing long message using Sun MPI, MPICH, and LAM. The second-order performance characteristics of these message passing interfaces are illustrated in Figures 1-4. Figure 1 shows that for long messages Sun MPI


853

MPICH 0.0015

Time in seconds

0.00149

0.00148

0.00147

0.00146

0.00145 200000

202000


208000

210000

Fig. 4. MPICH bimodal cost behavior Table 4. MPICH cost model Message size (bytes) Latency (sec) Bandwidth (MB/sec) 0 - 256 7 48.2 256 - 512 12 195.4 512 - 1K 6 56.5 Size,1K * 5 1K - 130K 282.3 4900 Size,130000 * 5+400 130K - 4M 238.7 9800

exhibits large cost uctuations. Figure 2 shows that with LAM, messages longer than 8KB have a start up cost three times that of messages shorter than 8KB and that LAM performance is best modeled by one cost function for messages shorter than 8KB and by another for messages longer than 8KB. MPICH is best characterized by a step function whose latency increases with message length, as shown in Figure 3. Figure 4 illustrates MPICH's bimodal cost behavior. The best platform con guration across all MPI implementations required locking processes to processors and disabling interrupts on those processors. These steps helped to ensure that processors remained dedicated to the application. MPI implementations on the same machine, using the same shared memory message transport mechanism, have very dierent performance characteristics. The results presented here illustrate signi cant dierences among cost models, scaling behavior, worst-case performance, and other performance characteristics. These dierences stem from implementation decisions made by interface developers. LAM and MPICH are portable MPI implementations that are not tuned for speci c platforms. The native implementation has a clear advantage in this case. It is also clear that no single implementation of MPI is best for all applications. This suggests that similar studies should be done for other platforms. It has been shown here that RTCOMM1 can be used to characterize MPI implementations. The communication cost model of a message passing interface and hardware platform is usually described as a linear function determined by

854

B. VanVoorst and S. Seidel MPI Implementation cost models 0.007

0.006

LAM MPICH SUN

Time in seconds

0.005

0.004

0.003

0.002

0.001

0 100000

200000

300000

400000

500000 600000 700000 Message size in bytes

800000

900000

1e+06

Fig. 5. Comparative message passing performance a measured startup cost (latency) and bandwidth. RTCOMM1 was used here to show that this is sometimes an oversimpli cation. This work also demonstrated an approach for using RTCOMM1 to identify and illustrate performance dierences between MPI implementations. While this approach does not reveal the causes underlying those dierences, the experimental data does admit the construction of more accurate cost models. In addition, RTCOMM1 provides insight into best and worst case message passing performance which is useful for real-time software development.

References [1] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789{828, September 1996. [2] William D. Gropp and Ewing Lusk. User's Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6. [3] A. Lumsdaine, et al. LAM MPI Home page. http://www.mpi.nd.edu/lam/. University of Notre Dame. [4] B. VanVoorst, R. Jha, S. Ponnuswammy, C. Nanvati, and L. Pires. DARPA Real Time Parallel Benchmarks: Final report. Technical Report (C013) - Contract Number F30602-94-C-0084, Rome Laboratory, USAF, 1998. [5] B. VanVoorst, S. Ponnuswammy, R. Jha, and L. Pires. DARPA Real Time Parallel Benchmarks: Low-level benchmark speci cations. Technical Report (C006) Contract Number F30602-94-C-0084, Rome Laboratory, USAF, 1998.

A Genetic Algorithm Approach to Scheduling Communications for a Class of Parallel Space-Time Adaptive Processing Algorithms Jack M. West and John K. Antonio School of Computer Science University of Oklahoma 200 Felgar Street Norman, OK 73019 Phone: (405) 325-4624 {west, antonio}@ou.edu

Abstract. An important consideration in the maximization of performance in parallel processing systems is scheduling the communication of messages during phases of data movement to reduce network contention and overall communication time. The work presented in this paper focuses on off-line optimization of message schedules for a class of radar signal processing techniques know as space-time adaptive processing on a parallel embedded system. In this work, a genetic-algorithm-based approach for optimizing the scheduling of messages is introduced. Preliminary results indicate that the proposed genetic approach to message scheduling can provide significant decreases in the communication time.

1

Introduction and Background

For an application on a parallel and embedded system to achieve required performance given tight system constraints, it is important to efficiently map the tasks and/or data of the application onto the processors to the reduce inter-processor communication traffic. In addition to mapping tasks efficiently, it is also important to schedule the communication of messages in a manner that minimizes network contention so as to achieve the smallest possible communication time. Mapping and scheduling can both – either independently or in combination – be cast as optimization problems, and optimizing mapping and scheduling objectives can be critical to the performance of the overall system. For parallel and embedded systems, great significance is placed on minimizing execution time (which includes both computation and communication components) and/or maximizing throughput. The work outlined in this paper involves optimizing the scheduling of messages for a class of radar signal processing techniques known as space-time adaptive processing (STAP) on a parallel and embedded system. A genetic algorithm (GA) based approach for solving the message-scheduling problem for the class of parallel STAP algorithms is proposed and preliminary results are provided. The GA-based optimization is performed off-line, and the results of this optimization are static J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 855-861, 2000.  Springer-Verlag Berlin Heidelberg 2000

856

J.M. West and J.K. Antonio

schedules for each compute node in the parallel system. These static schedules are then used within the on-line parallel STAP implementation. The results of the study show that significant improvement in communication time performance are possible using the proposed approach for scheduling. Performance of the schedules were evaluated using a RACEway network simulator [6].

2

Overview of Parallel STAP

STAP is an adaptive signal processing method that simultaneously combines the signals received from multiple elements of an antenna array (the spatial domain) and from multiple pulses (the temporal domain) of a coherent processing interval [5]. The focus of this research assumes STAP is implemented using an element-space postDoppler partially adaptive algorithm, refer to [5, 6] for details. Algorithms belonging to the class of element-space post-Doppler STAP perform filtering on the data along the pulse dimension, referred to as Doppler filtering, for each channel prior to adaptive filtering. After Doppler filtering, an adaptive weight problem is solved for each range and pulse data vector. The parallel computer under investigation for this work is the Mercury RACE multicomputer. The RACE multicomputer consists of a scalable network of compute nodes (CNs), as well as various high-speed I/O devices, all interconnected by Mercury’s RACEway interconnection fabric [4]. A high-level diagram of a 16-CN RACEway topology is illustrated in Figure 1. The interconnection fabric is configured in a fat-tree architecture and is a circuit switched network. The RACEway interconnection fabric is composed of a network of crossbar switches and provides high-speed data communication between different CNs. The Mercury multicomputer can support a heterogeneous collection of CNs (e.g., SHARC and PowerPCs), for more details refer to [6]. Crossbar Crossbar

Crossbar Crossbar 11

22

33

Crossbar Crossbar


55

66

77


99

10 10

11 11

Crossbar Crossbar 12 12

13 13

14 14

15 15

16 16

Fig. 1. Mercury RACE Fat-Tree Architecture configured with 16 CNs.

Achieving real-time performance requirements for STAP algorithms on a parallel embedded system like the Mercury multicomputer largely depends on two major issues. First is determining the best method for distributing the 3-D STAP data cube across CNs of the multiprocessor system (i.e., the mapping strategy). Second is

A Genetic Algorithm Approach to Scheduling Communications

857

determining the scheduling of communications between phases of computation. In general, STAP algorithms contain three phases of processing, one for each dimension of the data cube (i.e., range, pulse, channel). During each phase of processing, the vectors along the dimension of interest are distributed as equally as possible among the processors for processing in parallel. An approach to data set partitioning in STAP applications is to partition the data cube into sub-cube bars. Each sub-cube bar is composed of partial data samples from two dimensions while preserving one whole dimension for processing. The work here assumes a sub-cube bar partitioning of the STAP data cube, for further details refer to [6]. Figure 2 shows an example of how sub-cube partitioning is applied to partition a 3-D data cube across 12 CNs. Range Range s Pu lse

Pu lse

s

4 3 7

Channels

1 6 5

Channels

8

2

12

11

10

1 1

1 2

1 3

1 4

5

6

7

8

9

10

11

12

9

(a) Range 1 5 1

2

3

4

5

6

7

8

9

10

11

12

Channels

Channels

Pu l

se s

Pu ls e s

Range

9

2 6

10

3 7

11

4 8

12

(b)

Fig. 2. Illustration of the sub-cube bar mapping technique for the case of 12 CNs. The mapping of the sub-cube bars to CNs defines the required data communications. (a) Example illustration of the communication requirements from CN 1 to the other CNs (2, 3, and 4) after completion of the range processing and prior to Doppler processing. (b) Example illustration of the communication requirements from CN 1 to other CNs (5 and 9) after the completion of Doppler processing and prior to adaptive weight processing.

During phases of data redistribution (i.e., communication) between computational phases, the number of required communications and the communication pattern among the CNs is dependant upon how the data cube is mapped onto the CNs. For example, in Figure 2(a) the mapping of sub-cube bars to CNs dictates that after range processing, CN 1 must transfer portions of it data sub-cube bar to CNs 2, 3, and 4. (Each of the other CNs, likewise, is required to send portions of their sub-cube bar to CNs on the same row.) The scheduling (i.e., ordering) of outgoing messages at each CN impacts the resulting communication time. For example, in Figure 2(a) note CN 1 could order its outgoing messages according to one of 3! = 6 permutations (i.e., [2,3,4], [3,2,4], etc.). Similarly, a scheduling of outgoing messages must be defined for each CN. Improper schedule selection can result in excessive network contention and thereby increase the time to perform all communications between processing phases. The focus in this paper is on optimization of message scheduling, for a fixed mapping, using a genetic algorithm methodology.

858

3


Genetic Algorithm Methodology

A GA is a population-based model that uses selection and recombination operators to generate new sample points in the solution space [3]. A GA encodes a potential solution to a specific problem on a chromosome-like data structure and applies recombination operators to these structures in a manner that preserves critical information. Reproduction opportunities are applied in such a way that those chromosomes representing a better solution to the target problem are given more chances to reproduce than chromosomes with poorer solutions. GAs are a promising heuristic approach to locating near-optimal solutions in large search spaces [3]. For a complete discussion of GAs, the reader is referred to [1, 3]. Typically, a GA is composed of two main components, which are problem dependent: the encoding problem and the evaluation function. The encoding problem involves generating an encoding scheme to represent the possible solutions to the optimization problem. In this research, a candidate solution (i.e., a chromosome) is encoded to represent valid message schedules for all of the CNs. The evaluation function measures the quality of a particular solution. Each chromosome is associated with a fitness value, which in this case is the completion time of the schedule represented by the given chromosome. For this research, the smallest fitness value represents the better solution. The “fitness” of a candidate is calculated here based on its simulated performance. In previous work [6, 7], a software simulator was developed to model the communication traffic for a set of messages on the Mercury RACEway network. The simulation tool is used here to measure the “fitness” (i.e., the completion time) of the schedule of messages represented by each chromosome. Chromosomes evolve through successive iterations, called generations. To create the next generation, new chromosomes, called offspring, are formed by (a) merging two chromosomes from the current population together using a crossover operator or (b) modifying a chromosome using a mutation operator. Crossover, the main genetic operator, generates valid offspring by combining features of two parent chromosomes. Chromosomes are combined together at a defined crossover rate, which is defined as the ratio of the number of offspring produced in each generation to the population size. Mutation, a background operator, produces spontaneous random changes in various chromosomes. Mutation serves the critical role of either replacing the chromosomes lost from the population during the selection process or introducing new chromosomes that were not present in the initial population. The mutation rate controls the rate at which new chromosomes are introduced into the population. In this paper, results are based on the implementation of a position-based crossover operator and an insertion mutation operator, refer to [1] for details. Selection is the process of keeping and eliminating chromosomes in the population based on their relative quality or fitness. In most practices, a roulette wheel approach, either rank-based or value-based, is adopted as the selection procedure. In a rankedbased selection scheme, the population is sorted according to the fitness values. Each chromosome is assigned a sector of the roulette wheel based on its ranked-value and not the actual fitness value. In contrast, a value-based selection scheme assigns roulette wheel sectors proportional to the fitness value of the chromosomes. In this paper, a ranked-based selection scheme is used. Advantages of rank-based fitness


859

assignment is it provides uniform scaling across chromosomes in the population and is less sensitive to probability-based selections, refer to [3] for details.

4

Numerical Results

In the experiments reported in this section, it is assumed that the Mercury multicomputer is configured with 32 PowerPC compute nodes. For range processing, Doppler filtering, and adaptive weight computation, the 3-D STAP data cube is mapped onto the 32 processing elements based on an 8 × 4 process set (i.e., 8 rows and 4 columns), refer to [2, 6]. The strategy implemented for CN assignment in a process set is raster-order from left-to-right starting with row one and column one for all process sets. (The process sets not only define the allocation of the CNs to the data but also the required data transfers during phases of data redistribution.) The STAP data cube consists of 240 range bins, 32 pulses, and 16 antenna elements. For each genetic-based scenario, 40 random schedules were generated for the initial population. The poorest 20 schedules were eliminated from the initial population, and the GA population size was kept a constant 20. The recombination operators included a position-based crossover algorithm and an insertion mutation algorithm. A ranked-based selection scheme was assumed with the angle ratio of sectors on the roulette wheel for two adjacently ranked chromosomes to be 1 + 1 / P , where P is the population size. The stopping criteria were: (1) the number of generations reached 500; (2) the current population converged (i.e., all the chromosomes have the same fitness value); or (3) the current best solution had not improved in the last 150 generations. Figure 3 shows the simulated completion time for three genetic-based message scheduling scenarios for the data transfers required between range and Doppler processing phases. Figure 4 illustrates the simulated completion time for the same three genetic-based message scheduling scenarios for the data transfers required between Doppler and adaptive weight processing phases. In the first genetic scenario (GA 1), the crossover rate (Pxover) is 20% and the mutation rate (Pmut) is 4%. For GA 2, Pxover is 50% and Pmut is 10%. For GA 3, Pxover is 90% and Pmut is 50%. Figures 3 and 4 provide preliminary indication that for a fixed mapping the genetic-algorithmbased heuristic is capable of improving the scheduling of messages, thus providing improved performance. All three genetic-based scenarios improve the completion time for both communication phases. In each phase, GA 2 records the best schedule of messages (i.e., the smallest completion time).

860


Fitness (completion time in microseconds

0.94 0.92 0.9 0.88 GA 1

0.86

GA 2 0.84

GA 3

0.82 0.8 0.78 0.76 0

100

200

300

400

500

600

Generation

Fig. 3. Simulated completion time of the communication requirements for data redistribution after range processing and prior to Doppler processing for the parameters discussed in Section 4. For GA 1, the crossover rate (Pxover) = 20% and the mutation rate (Pmut) = 4%. For GA 2, Pxover = 50% and Pmut = 10%. For GA 3, Pxover = 90% and Pmut = 50%.

5.2

(completion time in microseconds

Fitness

5.1 5 4.9 4.8

GA 1 GA 2

4.7

GA 3 4.6 4.5 4.4 0

50

100

150

200

250

300

350

Generation

Fig. 4. Simulated completion time of the communication requirements for data redistribution after Doppler processing and prior to adaptive weight computation for the parameters stated in Section 4. For GA 1, the crossover rate (Pxover) = 20% and the mutation rate (Pmut) = 4%. For GA 2, Pxover = 50% and Pmut = 10%. For GA 3, Pxover = 90% and Pmut = 50%.


5.

861

Conclusion

In conclusion, preliminary data demonstrates that off-line GA-based message scheduling optimization can provide improved performance in a parallel system. Future work will be conducted to more completely study the effect of changing parameters of the GA, including crossover and mutation rates as well as the methods used for crossover and mutation. Finally, future studies will be conducted to determine the performance improvement between a randomly selected scheduling solution and the one determined by the GA. In Figures 3 and 4, the improvements shown are conservative in the sense that the initial generations’ performance on the plots represents the best of 40 randomly generated chromosomes (i.e., solutions). It will be interesting to determine improvements of the GA solutions with respect to the “average” and “worst” randomly generated solutions in the initial population.

Acknowledgements This work was supported by DARPA under contract no. F30602-97-2-0297.

References 1. M. Gen and R. Cheng, Genetic Algorithms and Engineering Design, John Wiley & Sons, Inc., New York, NY, 1997. 2. M. F. Skalabrin and T. H. Einstein, “STAP Processing on a Multicomputer: Distribution of 3-D Data Sets and Processor Allocation for Optimum Interprocessor Communication,” Proceedings of the Adaptive Sensor Array Processing (ASAP) Workshop, March 1996. 3. L. Wang, H. J. Siegel, V. P. Roychowdhury, and A. A. Maciejewski. “Task Matching and Scheduling in Heterogeneous Computing Environments Using a Genetic-Algorithm-Based Approach,” Journal of Parallel and Distributed Computing, Special Issue on Parallel Evolutionary Computing, Vol. 47, No 1, pp. 8-22, Nov. 25, 1997. 4. The RACE Multicomputer, Hardware Theory of Operation: Processors, I/O Interface, and RACEway Interconnect, Volume I, ver. 1.3. 5. J. Ward, Space-Time Adaptive Processing for Airborne Radar, Technical Report 1015, Massachusetts Institute of Technology, Lincoln Laboratory, Lexington, MA, 1994. 6. J. M. West, Simulation of Communication Time for a Space-Time Adaptive Processing Algorithm Implemented on a Parallel Embedded System, Master’s Thesis, Computer Science, Texas Tech University, 1998. 7. J. M. West and J. K. Antonio, "Simulation of the Communication Time for a Space-Time Adaptive Processing Algorithm on a Parallel Embedded System," Proceedings of the International Workshop on Embedded HPC Systems and Applications (EHPC ‘98), in Lecture Notes in Computer Science 1388: Parallel and Distributed Processing, edited by Jose Rolim, sponsor: IEEE Computer Society, Orlando, FL, USA, Apr. 1998, pp. 979-986.

Reconfigurable Parallel Sorting and Load Balancing on a Beowulf Cluster: HeteroSort 1

1

1

1

Pamela Yang , Timothy M. Kunau , Bonnie Holte Bennett , Emmett Davis , Bill Wren 1

2

University of St. Thomas, Graduate Programs in Software, Mail # OSS 301, 2115 Summit Avenue, Saint Paul, MN 55105 [email protected], [email protected], [email protected], [email protected] 2 Honeywell Technology Center, 3660 Technology Drive, Minneapolis, MN 55418 [email protected]

Abstract. HeteroSort load balances and sorts within static or dynamic networks using a conceptual torus mesh. We ported HeteroSort to a 16-node Beowulf cluster with a central switch architecture. By capturing global system knowledge in overlapping microregions of nodes, HeteroSort is useful in data dependent applications such as data information fusion on distributed processors.

1

Introduction

Dynamic adaptability, both within an application’s immediate distributed environment as well as future environments to which it will be ported, is a keystone feature for applications implemented on modern networks. Dynamic adaptability is a basis for fault tolerance. A system, which is dynamically adaptive, strives to withstand the assault of hardware glitches, electrical spikes, and component destruction. The research described in this paper set out to develop a high-speed load balancing algorithm, which would balance loads by sorting data across the network of nodes and resulted in developing a reconfigurable system for parallel sorting with dynamic adaptability. 1.1

Dynamic Adaptability

With the increased dependence on distributed and parallel processing to support general as well as safety-critical applications, we must have applications that are fault tolerant. Programs must be able to recognize that current resources are no longer available. Schedulers are employed in the presence of faults to manage resources against program needs using dynamic or fixed priority scheduling for timing correctness of critical application tasks. We have taken a different approach and refocused on the design of elemental processes such as load balancing and sorting. Instead of depending on schedulers, we design process


Reconfigurable Parallel Sorting and Load Balancing on a Beowulf Cluster: HeteroSort

863

algorithms where global processes are completed using only local knowledge and recovery resources. This lessens the need for schedulers and eases their workload. 1 1.2

Beowulf Clusters

Beowulf clusters are one of the most exciting implementations of Linux today. Originating from the Center of Excellence in Space Data and Information Sciences (CESDIS) at the NASA Goddard Space Center in Maryland, the project´s mission statement is: Beowulf is a project to produce the software for off-the-shelf clustered workstations based on commodity PC-class hardware, a high-bandwidth internal network and the Linux operating system. The Beowulf project was conceived by Dr. Thomas Sterling, Chief Scientist, CESDIS. One of NASA´s imperatives has always been to share technology with universities and industries. With the Beowulf project, NASA has provided the Linux community with the opportunity to spread into scientific areas needing high performance computing power.2 1.3

Local Knowledge and Global Processes

An efficient network sort algorithm is highly desirable, but difficult. The problem is that it requires local operations with global knowledge. So, consider a group of data (for example, that of names in a phone directory) which is to be distributed across a number of processors (for example, 26). Then an efficient technique would be for each processor to take a portion of the unsorted data and send each datum to the processor upon which it eventually belongs (A’s to processor 1, B’s to processor 2, … Z’s to processor 26). A significant practical feature of HeteroSort is that in our experiments it load balances before it finishes sorting. Since HeteroSort detects when the system is sorted, it also detects termination of load balancing. Chengzhong Xu and Francis Lau in Load Balancing in Parallel Computers: Theory and Practice (Boston: Kluwer Academic Publishers, 1997) state:

1

Examples of these fault tolerant efforts can be found in the work of Jay Strosnider and his colleagues at Department of Electrical and Computing Engineering, Carnegie Mellon University in the Fault-Tolerant Real Time Computing Project. Katcher, Daniel I., Jay K. Strosnider, and Elizabeth A. Hinzelman-Fortino. "Dynamic versus Fixed Priority Scheduling: A Case Study" http://usa.ece.cmu.edu/Jteam/papers/abstracts/tse93.abs.html.

2

For more information, see The University of St. Thomas Artificial Intelligence and High Performance Parallel Processing Research Laboratory’s Beowulf cluster web page: Kunau, Timothy M.

864

P. Yang et al.

From a practical point of view, the detection of the global termination is by no means a trivial problem because there is a lack of consistent knowledge in every processor about the whole workload distribution as load balancing progresses.[4] Thus the global knowledge that all names beginning with the same letter belong on a prespecified processor facilitates local operations in sending off each datum. The problem, however, is that this does not adequately balance the load on the system because there may be many A’s (Adams, Anderson, Andersen, Allen) and very few Q’s or X’s. So the optimal loading Aaa-Als on processor 1, Alb-Bix on processor 2, … Win-Zzz on processor 26) cannot be known until all the data is sorted. Global knowledge (the optimal loading) is unavailable to the local operations (where to send each datum) because it is not determined until all the local operations are finished. HeteroSort combines load balancing within sorting processes. Traditionally, techniques such as hashing have been used to overcome the non-uniform distribution of data. However, parallel hash tables require expensive computational maintenance to upgrade each sort cycle, thus making them less efficient than HeteroSort, which requires no external tables. 1.4

Related Work

Much of the work in this area deals with linear arrays.[2,3] The general approach is to take linear sort techniques and use either a row major or a snake-like grid overlaid on a regular grid topology of processors.[1] The snake-like grid is used at times with a shear-sort or shuffle sorting program where there is first a row operation and then an alternating column operation. So, either the row or the column connections are ignored in each cycle.

2

Approach

HeteroSort is our load balancing and sorting algorithm. Our initial approach was to use fourconnectedness (as an example of N-connectedness) for load balancing and sorting. In traditional linear sorts data is either high or low for the processor it is on, and is sent up or down the sort chain accordingly. Our approach differs in that we defined data to be very high, high, low, or very low. In order to do this we first defined a sort sequence across an array of processors as depicted in Figure 1. Next we defined the four neighbors. This is easily understood by examining Node 7 in the example of sixteen processors shown in Figure 1. The neighbors for Node 7 are 2, 6, 8, and 10. When Node 7 receives its initial data, it sorts it and splits it into four quarters. The lowest quarter goes to Node 2, the next lowest quarter goes to Node 6, the third quarter goes to Node 8, and the highest quarter goes to Node 10. Thus, the extremely high and low data are shipped on “express pathways” across the coils of the snake network.


8

7

6

5

9

10

11

12

16

15

14

13

865

Fig. 1. The sort sequence is overlaid in a snake-like grid across the array of processors. The lowest valued items in the sort will eventually reside on processor 1 and the highest valued items on processor 16. Node 7’s four connected trading partners are in bold: 2, 6, 8, and 10. When Node 7 receives its initial state, it sorts and splits the data into four quarters. The lowest quarter goes to Node 2. The next lowest quarter goes to Node 6, the third quarter to Node 8, and the highest quarter goes to node 10. Thus the extremely high and low data are shipped across the coils of the snake network.

The trading neighbors Node 2 and Node 10 which are not adjacent on the sort sequence (transcoil neighbors) provide a pathway for very low or very high data to pass across the coils of the snake network into another neighborhood of nodes. This provides an express pathway for extremely ill sorted data to move quickly across the network. The concept of four connectedness is easy to understand with an interior node like Node 7, but other remaining nodes in this example are edge nodes, and their implementation differs slightly. Table 1. Trading partner list. Determining which data is kept at a node depends on how that node falls among the sort order of its neighbors. For example, node 1 falls below all of its neighbors and thus receives the lowest quarter.

Node

Odd Cycle

Even Cycle

Node

Odd Cycle

Even Cycle

1 2

1 2 4 8 16 1 2 3 7 15

1 16 4 8 2 1 2 15 7 3

9 10

8 9 10 12 16 7 9 10 11 15

8 9 16 12 10 9 7 10 15 11

3 4

2 3 4 6 14 1 3 4 5 13

2 3 14 6 4 3 1 4 13 5

11 12

6 10 11 12 14 5 9 11 12 13

10 6 11 14 12 11 9 5 12 13

5 6

4 5 6 8 12 3 5 6 7 11

4 5 12 8 6 5 3 6 11 7

13 14

4 12 13 14 16 3 11 13 14 15

12 4 13 16 14 13 11 3 14 15

7 8

2 6 7 8 10 1578 9

6 2 7 10 8 751 8 9

15 16

2 10 14 15 16 1 9 13 15 16

14 10 2 15 16 15 9 13 1 16

Simply put, we use a torus for full connectivity. So nodes along the “north” edge of the array which have no north neighbors are connected (conceptually) to nodes along the “south” edge and vice versa (transedge neighbors). Similarly, a node along the “east” edge are given nodes along the “west” edge as east neighbors and so forth. The odd cycle column of Table 1 summarizes all the nodes of a sixteen node network.

866

P. Yang et al.

Thus, the use of the torus for four-connectedness provides full connectivity. The result is a modified shear-sort where both row and column connections are used with each round of sorting. Furthermore, ill-sorted data is quickly moved across the network via torus connections. The "express pathway" is a conceptual map of the sorting network. Ideally, the operating systems supports express pathways, such as in an Intel Paragon system where we first implemented our algorithm. Where this environmental support is missing, the cost of these non-adjacent operations is higher. In those environments where networks have edges, HeteroSort has three strategies. The first is to still implement the conceptual torus at the higher transmission cost. The second is to re-configure itself to the reality of some nodes having only two or three physical neighbors. A third strategy is particularly useful in heterogeneous environments, where we employ a genetic algorithm to determine the optimal network by minimizing transmission costs.

2.1

Beowulf Clusters

The major portion of this Beowulf background section is abstracted from CESDIS material on their web page: http://www.beowulf.org/ The Beowulf class of computers and its architecture are appropriate to the times. The increasing presence of computers in offices, homes, and schools, has led to an abundance of mass produced cost effective components. The COTS (Commodity Off The Shelf) industry now provides fully assembled subsystems (microprocessors, motherboards, disks and network interface cards). The pressure of the mass market place has driven the prices down and reliability up. In addition, shareware, freeware, and open source development; in particular, the Linux operating system, the GNU compilers and programming tools and the MPI and PVM message passing libraries, provide hardware independent software. In the taxonomy of parallel computers, Beowulf clusters fall somewhere between MPP (Massively Parallel Processors, like the Convex SPP, Cray T3D, Cray T3E, CM5, etc.) and NOWs (Networks of Workstations). The Beowulf project benefits from developments in both these classes of architecture. MPP’s are typically larger and have a lower latency interconnect network than Beowulf clusters. Most programmers develop their programs in message passing style. Such programs can be readily ported to Beowulf clusters. Programming a NOW is usually an attempt to harvest unused cycles on an already installed base of workstations in a lab or on a campus. Programming in this environment requires algorithms that are extremely tolerant of load balancing problems and large communication latency. These programs will directly run on a Beowulf. A Beowulf class cluster computer differs from a Network of Workstations in that the nodes in the cluster are dedicated to the cluster. This eases load balancing. Also, this allows the Beowulf software provide a global process ID, enabling signals to be sent from one node to another node of the system..


867

The challenge for our HeteroSort has been to adapt a conceptual mesh torus to a Beowulf cluster architecture. A trade in benefits has been the increased expense of nearest neighbor transactions. In the Beowulf, all transactions pass through a switch. This expense trades for the benefit that all other transactions do not have to traverse a network, passing through intervening nodes. 2.2

Optimization of HeteroSort

HeteroSort’s distributed approach can provide an efficient control mechanism for a wide variety of algorithms. It also provides “reconfiguration-on-fault” fault tolerance when a node or network error occurs. HeteroSort automatically reconfigures to account for the failed node(s), and the distributed data is not lost. However, efficient operation requires that major sort axis nodes should reside on near neighbor network physical processors. This minimizes communication costs for efficient operation. And, for a heterogeneous topology, or a homogeneous topology made irregular by failed nodes, automatically achieving this near neighbor configuration for the sort nodes is difficult. Figure 2 indicates a homogeneous mesh made irregular by two failed nodes. The numbers in the boxes (nodes) indicate the node’s position in the sort order.

S o rt O rd e r is th e n u m b e r in e a c h n o d e F a ile d N o d e

W o rk in g N o d e s

13

12

14

15

17

16

18 19

20

10

9

1

8

23

2

7

22

3

6

21

4

5

11

868

P. Yang et al.

Fig. 2. Sort order is the number of each node. A homogeneous mesh of 25 nodes made irregular by two failed nodes requires a new sort order for efficient performance. The numbers in the boxes (nodes) indicate the node’s position in the new sort order. The lowest valued items in the sort will eventually reside on processor 1 and the highest valued items on processor 23. This new order optimizes near neighbor relations.

We assume that a message cannot be sent across a failed node. To provide for online reconfiguration of the node sort order, we have developed an adaptive online sort order optimizer named the Scaleable Adaptive Load-balancing (SAL) Online Optimizer (SOO). SOO is performed by using a genetic algorithm which minimizes the total path length of the HeteroSort major sort axis, indicated on the figure by the line from one to 23. Note that other possible minimum path sort orders exist. Also note that for some topologies or failure patterns, strict near neighborness is not achievable. For these cases SOO defines the minimum path that includes store-and-forwards or traversals across other nodes. SOO can optimize given any combination of failed nodes and busses.

3

Fault Tolerance

The most important aspect of our algorithm is that it does not depend on a regular network topology (as, for example, a traditional shear sort does) because the torus can be superimposed on any physical architecture. This yields fault tolerance because our system can dynamically reconfigure itself, and easily accommodates “holes” in the connection. All that is required is for HeteroSort to change the partitioning schema in the data, and to stop sending data to a node when it is removed. Three other aspects of fault tolerance result from this algorithm. First, since only local knowledge is used in the sort, the system is fault tolerant because it does not require global knowledge. Thus, individual nodes continue to operate regardless of the performance (or even existence) of other non-neighbor nodes. Second, since each node keeps a backup copy of the data it sends off to its neighbors, if a node is eliminated during operation of the load balancing and sorting, its neighbors can make up for the loss of data. Third, the natural load balancing of the data during operation of the sorts adds a degree of fault tolerance. With data evenly distributed across nodes, then the loss of a node means the minimal loss of data to the system. The intent is to build minimum weight spanning trees and to use them in improving sort efficiency.

5.1

Future Directions

We currently have the concept of near (adjacent) neighbors and far neighbors (which exist with the implementation of the torus structure). This has implications for implementations on heterogeneous and distributed networks. Specifically, the far neighbors are metaphors for nodes on another processor in a distributed system. So, one component of the sort, partition,


869

and send task could be that the data is partitioned not into equal subsets, but into subsets of a size proportional to the speed of the link to that node. Furthermore, in heterogeneous architectures, the subset size could also be related to the speed of the corresponding neighbor node. Thus, future enhancements will include an applications kernel that will be resident on each node of the heterogeneous network. Upon startup, each kernel will negotiate with its near neighbor kernels to adjust the size of the exchange list (to be load balanced and sorted). The negotiated value will be a function of each node’s own capacity in memory, processing, and its number of neighbors. Upon a fault, the kernels will re-negotiate the exchange files with the surviving near neighbors. Acknowledgments This research was partially supported by a grant from the Defense Nuclear Agency 93DNA-3. We gratefully acknowledge this support. Reference 1. Gu, Qian Ping, and Jun Gu: Algorithms and Average Time Bounds of Sorting on a MeshConnected Computer. IEEE Transactions on Parallel and Distributed Systems. Vol 5, no 3. (March 1994) 308-315 2. Lin, Yen-Chun: On Balancing Sorting on a Linear Array. IEEE Transactions on Parallel and Distributed Systems. Vol 4, no 5. (May 1993) 566-571 3. Thompson, C.D., and H.T. Kung: Sorting on a Mesh-connected Parallel Computer. Communication of the ACM. Vol 20, no 40,. (April 1977) 263-271 4. Xu, Chengzhong and Francis Lau: Load Balancing in Parallel Computers: Theory and Practice. Kluwer Academic Publishers, Boston (1997)

7th Reconfigurable Architectures Workshop (RAW 2000)

Workshop Chair Hossam ElGindy, University of New South Wales (Australia)

Steering Chair Viktor K. Prasanna, University of Southern California at Los Angeles (USA )

Program Chair Hartmut Schmeck, University of Karlsruhe (Germany)

Publicity Chair Oliver Diessel, University of South Australia (Australia)

Programme Committee Jeff Arnold, Independent Consultant (USA) Peter Athanas, Virginia Tech (USA) Gordon Brebner, Univ. of Edinburgh (Scotland) Andre DeHon, Univ. of California at Berkeley (USA) Carl Ebeling, Univ. of Washington (USA) Hossam ElGindy, Univ. of New South Wales (Australia) Reiner Hartenstein, Univ. of Kaiserslautern (Germany) Brad Hutchings, Brigham Young Univ. (USA) Mohammed Khalid, Quickturn Design Systems (USA) Hyoung Joong Kim, Kangwon National Univ. (Korea) Rainer Kress, Siemens AG (Germany) Fabrizio Lombardi, Northeastern University (USA) Wayne Luk, Imperial College (UK) Patrick Lysaght, Univ. of Strathclyde (Scotland) William H. Mangione-Smith, Univ. of California, Los Angeles (USA) Margaret Marek-Sadowska, Univ. of California, Santa Barbara (USA) William P. Marnane, Univ. College Cork (Ireland) Margaret Martonosi, Princeton Univ. (USA) John T. McHenry, National Security Agency (USA) J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 870–872, 2000. c Springer-Verlag Berlin Heidelberg 2000

7th Reconfigurable Architectures Workshop (RAW 2000)

871

Alessandro Mei, Univ. of Trento (Italy) Martin Middendorf, Univ. of Karlsruhe (Germany) George Milne, Univ. of South Australia (Australia) Koji Nakano, Nagoya Institute of Technology (Japan) Stephan Olariu, Old Dominion Univ. (USA) Bernard Pottier, Univ. Bretagne Occidentale (France) Ralph Kohler, Air Force Research Laboratory (USA) Mark Shand, Compaq Systems Research Center (USA) Jerry L. Trahan, Louisiana State Univ. (USA) Ramachandran Vaidyanathan, Louisiana State Univ. (USA)

Preface The Reconfigurable Architecture Workshop series provides one of the major international forums for researchers to present ideas, results, and on-going research on both theoretical and industrial/practical advances in Reconfigurable Computing. The main focus of this year’s workshop is “ Run Time Reconfiguration - Foundations, Algorithms, Tools”: Technological advances in the field of fast reconfigurable devices have created new possibilities for the implementation and use of complex systems. Reconfiguration at runtime is one new dimension in computing that blurs the barriers between hardware and software components. Neither the existing processor architectures nor the hardware/software design tools which are available today can fully exploit the possibilities offered by this new computing paradigm. The potential of run time reconfiguration can be achieved through the appropriate combination of knowledge about foundations of dynamic reconfiguration, the various different models of reconfigurable computing, efficient algorithms, and the tools to support the design of run time reconfigurable systems and implementation of efficient algorithms. RAW 2000 provides the chance of creative interaction between these diciplines. The programme consists of an invited talk by Steven Guccione (Xilinx), technical sessions of refereed papers on various aspects of Run Time Reconfiguration, and a panel discussion on “The Future of Reconfigurable Computing”. The 12 paper presentations were selected out of 27 submissions after a careful review process, every paper was reviewed by at least three members of the programme committee. We hope that the workshop will provide again the environment for productive interaction and exchange of ideas. We would like to thank the organizing committee of IPDPS 2000 for the opportunity to organize this workshop, the authors for their contributed manuscripts, and the programme committee for their effort in assessing the 27 submissions to the workshop. January 2000

Hartmut Schmeck

872

H. ElGindy et al.

Programme of RAW 2000: Invited Talk - Run-Time Reconfiguration at Xilinx Steven A. Guccione JRoute: A Run-Time Routing API for FPGA Hardware Eric Keller A Reconfigurable Content Addressable Memory Steven A. Guccione, Delon Levi, Daniel Downs ATLANTIS - A Hybrid FPGA/RISC Based Reconfigurable System O. Brosch, J. Hesser, C. Hinkelbein, K. Kornmesser, T. Kuberka, A. Kugel, R. M¨ anner, H. Singpiel, B. Vettermann The Cellular Processor Architecture CEPRA-1X and its Configuration by CDL Christian Hochberger, Rolf Hofmann, Klaus-Peter Volkmann, Stefan Waldschmidt Loop Pipelining and Optimization for Run Time Reconfiguration Kiran Bondalapati, Viktor K. Prasanna Compiling Process Algebraic Descriptions into Reconfigurable Logic Oliver Diessel, George Milne Behavioral Partitioning with Synthesis for MultiFPGA Architectures under Interconnect, Area, and Latency Constraints Preetham Lakshmikanthan, Sriram Govindarajan, Vinoo Srinivasan, Ranga Vemuri Module Allocation for Dynamically Reconfigurable Systems Xuejie Zhang, Kamwing Ng Augmenting Modern Superscalar Architectures with Configurable Extended Instructions Xianfeng Zhou, Margaret Martonosi Complexity Bounds for Lookup Table Implementation of Factored Forms in FPGA Technology Mapping Wenyi Feng, Fred J. Meyer, Fabrizio Lombardi Optimization of Motion Estimator for RunTimeReconfiguration Implementation Camel Tanougast, Yves Berviller, Serge Weber ConstantTime Hough Transform On A 3D Reconfigurable Mesh Using Fewer Processors Yi Pan

Run-Time Recon guration at Xilinx (invited talk) Steven A. Guccione

Xilinx Inc. 2100 Logic Drive San Jose, CA 95124 (USA) Stev [email protected]

Run-Time Recon guration (RTR) provides a pow erful, but essen tially untapped mode of operation for SRAM-based FPGAs. Research over the last decade has indicated that RTR can provide substan tial bene ts to system designers, both in terms of overall performance and in terms of design simplicity. While RTR holds great promise for many aspects of system design, it has only recen tly been considered for commercial application. Two factors seem to be converging to make R TR based system design viable. First, silicon process technology has adv anced to the point where million gate FPGA devices are commonplace. This permits larger, more complex algorithms to be directly implemented in FGPAs. This alone has led to a quiet revolution in FPGA design. Today, coprocessing using large FPGA devices coupled to standard microprocessors is becoming commonplace, particularly in Digital Signal Processing (DSP) applications. The second factor is softw are. Until recen tly, there was literally no softw are support available for R TR. Existing ASIC-based design ows based on schematic capture and HDL did not provide the necessary mechanisms to allow implementation of RTR systems. Today, the JBits software tool suite from Xilinx pro vides the direct support for coprocessing and for RTR. The combination of hardware and softw arefor R TR has already begun to show some impressive results on standard system design methodologies and algorithms. Future plans to enhance both FPGA architectures and tools suc h as JBits should result in a widening acceptance of this technology. Abstract.


JRoute: A Run-Time Routing API for FPGA Hardware

Eric Keller Xilinx Inc. 2300 55th Street Boulder, CO 80301

[email protected]

JRoute is a set of Java classes that pro vide an application programming interface (API) for routing of Xilinx FPGA devices. The interface allows various levels of control from connecting two routing resources to automated routing of a net with fanout. This API also handles ports, which are useful when designing object oriented macro circuits or cores. Eac hcore can de ne its own ports, whic hcan then be used in calls to the router. Debug support for circuits is also available. Finally, the routing API has an option to unroute a circuit. Built on JBits, the JRoute API provides access to routing resources in a Xilinx FPGA architecture. Currently the VirtexTM family is supported. Abstract.

1

Introduction

JRoute is an API to route Xilinx FPGA devices. The API allo ws the user to ha vevarious levels of control. Using this API along with JBits, the user can create hierarchical and reusable designs through a library of cores. The JRoute API allows a user to perform run-time recon guration (RTR) of the routing resources b y preserving the elements of RTR that are present in its underlying JBits[1] foundation. RTR systems are dierent from traditional design ows in that circuit customization and routing are performed at run-time. Since the placement of cores is one of the parameters that can be con gured at run-time, the routing is not prede ned. This means that auto routing can be very useful, especially when connecting p orts from tw odierent cores. F urthering the dev elopment of RTR computing designs, JRoute enables the implementation of nontrivial run-time parameterizable designs. Since JRoute is an API, it allo ws users to build tools based on it. These can range from debugging tools to extensions that increase functionality. It is important to note that the JRoute API is independent of the algorithms used to implement it. The algorithms discussed in this paper are the initial implementations to further explain the API. This paper is meant to present features and bene ts of the API, not the algorithms. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 874-881, 2000.  Springer-Verlag Berlin Heidelberg 2000


2

875

Overview of the Virtex Routing Architecture

The Virtex architecture has local, general purpose, and global routing resources. Local resources include direct connections between horizontally adjacent con gurable logic blocks (CLBs) and feedback to inputs in the same logic block. Each provides high-speed connections bypassing the routing matrix, as seen in Figure 1. General-purpose routing resources include long lines, hex lines, and single lines. Each logic block connects to a general routing matrix (GRM). From the GRM, connections can be made to other GRMs along vertical and horizontal channels. There are 24 single length lines in each of the four directions. There are 96 hex length lines in each of the four directions that connect to a GRM six blocks away. Only 12 in each direction can be accessed by any given logic block. Some hexes are bi-directional, meaning they can be driven from either endpoint. There are also 12 long lines that run horizontal, or vertical for the length of the chip. Long lines are buered, bi-directional lines that distribute the signals across the chip quickly. Long lines can be accessed every 6 blocks. Each type of general routing resource can only drive certain types of wires. Logic block outputs drive all length interconnects, longs can drive hexes only, hexes drive singles and other hexes, and singles drive logic block inputs, vertical long lines, and other singles. There are also global resources that distribute high-fanout signals with minimal skew. This includes four dedicated global nets with dedicated pins to distribute high-fanout clock signals. The array sizes for Virtex range from 16x24 CLBs to 64x96 CLBs. For a complete description of the Virtex architecture, see [3].

To adjacent GRM To adjacent GRM

To adjacent GRM

To adjacent GRM Direct connection to adjacent CLB

Fig. 1.

CLB

Direct connection to adjacent CLB

Virtex routing architecture.

876

3

E. Keller

JRoute Features

The JRoute API makes routing easier to perform and helps in the development of large systems with reusable libraries. Unlike the standard Xilinx tools, JRoute can perform the routing at run-time. It also provides debugging facilities. Before describing each of the calls, the architecture description le must rst be described. There is a Java class in which all of the architecture information is held. In this class each wire is de ned by a unique integer. Also in this class the possible template values are de ned, along with which template value each wire can be classi ed under. A template value is de ned as a value describing a direction and a resource type. For example, a template value of NORTH6 describes any hex wire in the north direction, a template value of NORTH1 describes any single wire in the north direction. Similar values are de ned for each resource type in each direction that it can go. Also in this Java class is a description of each wire, including how long it is, its direction, which wires can drive it, and which wires it can drive. 3.1

Various Levels of Control

The JRoute API was designed with the goal of providing various levels of control. The calls range from turning on or o a single connection to auto-routing a bus connection. route (int row, int col, int from wire, int to wire):

This call allows the user to make a single connection (i.e. the user decides the path). This can be useful in cases where there is a real time constraint on the amount of time spent con guring the device. However, the user must know what each wire connects to, and which wires are used. This call turns on the connection between from wire and to wire in CLB (row,col). The following example shows how to create a route connecting S1 YQ in CLB (5,7) to S0F3 in CLB (6,8) going through Out[1], SingleEast[5], and SingleNorth[0]. router.route(5, router.route(5, router.route(5, router.route(6,

7, 7, 8, 8,

S1_YQ, Out[1]); Out[1], SingleEast[5]); SingleWest[5], SingleNorth[0]); SingleSouth[0], S0F3);

route (Path path):

This call allows the user to de ne a path. A path is an array of speci c resources, for example HexNorth[4], that are to be connected. The path also requires a starting location, de ned by a row and column. The router turns on all of the connections de ned in the path. The following example shows how to construct and route a path for the same route as in the previous example. int[] p = {S1_YQ, Out[1], SingleEast[5], SingleNorth[0], S0F3}; Path path = new Path(5,7,p); router.route(path);


877

route (Pin start pin, int end wire, Template template):

This call allows the user to specify a template and the router picks the wires. A template is de ned as an array of template values, previously de ned. The user does not have to know the wire connections and the resources in use. Using a template can also take advantage of regularity which would occur, for example, when connecting each output bit of an adder to an input of another core. The cost is longer execution time, and there is no guarantee that an unused path even exists. For this method a starting pin, de ned as a wire at a speci c row and column, needs to be de ned. As well, the ending wire and the template to follow is speci ed. The router begins at the start wire, then goes through each wire that it drives, as de ned in the architecture class, and checks rst if the wire's template value matches the template value speci ed by the user. If so, then it checks to make sure the wire is not already in use. A recursive call is made with the new wire as the starting point and the rst element of the template removed. The call would fail if there is no combination of resources that are available that follow the template. In this case a user action is required. The following example shows how to construct a template and route using it. The source and destination are the same as in the previous two examples. However, the speci c resources may dier. int[] t = {OUTMUX, EAST1, NORTH1, CLBIN}; Template template = new Template(t); Pin src = new Pin(5, 7, S1_YQ); router.route(src, S0F3, template);

Finally, there are the auto-routing calls. This involves source to sink, source to many sinks, and a bus connection of many sources to an equal number of sinks. route (EndPoint source, EndPoint sink):

This single source to single sink call allows for auto-routing of point to point connections. An EndPoint is either a Pin, de ned by a row, column, and wire, or a Port, which is described in the next subsection. Many algorithms can be used to implement this call. One possibility is to use a maze router [4] [5]. Another possibility that would potentially be faster is to de ne a set of unique and prede ned templates that would get from the source to the sink and try each one. If all of them fail then the router could fall back on a maze algorithm. The bene t of de ning the template would be to reduce the search space. The following example shows how to de ne the end point (Pins) and connect them. The source and sink are the same as in the previous three examples, for the individual connections, path route, and template route. The template followed and the resources used may not necessarily be the same as it would with the other calls. Pin src = new Pin(5, 7, S1_YQ); Pin sink = new Pin(6, 8, S0F3); router.route(src, sink);

878

E. Keller

route (EndPoint source, EndPoint[] sink):

This is the method for a source to several sinks. It decides the best path for the entire collection of sinks. This call should be used instead of connecting each sink individually, since it minimizes the routing resources used. Each sink gets routed in order of increasing distance from the source. For each sink, the router attempts to reuse the previous paths as much as possible. Because it is not timing driven, this algorithm is suitable only for non-critical nets. For critical nets, however, the user would need to specify the routes at a lower level. In an RTR environment traditional routing algorithms require too much time. Currently long lines are not supported; only hexes and singles are used. Using long lines would improve the routing of nets with large bounding boxes. route (EndPoint[] source, EndPoint[] sink):

This is a call for bus connections. In a data ow design, the outputs of one stage go to the inputs of the next stage. As a convenience, the user does not need to write a Java loop to connect each one. If used along with cores, this call can be very useful when connecting ports to other ports. For example, the output ports of a multiplier core could be connected to the input ports of an adder core. Using the bus method, the user would not need to connect each bit of the bus. Each of the auto-routing calls described above use greedy routing algorithms. This was chosen because of the designs that are targeted. Structured and regular designs often have simple and regular routing. Also, in an RTR environment, global routing followed by detailed routing would not be eÆcient. Furthermore, RTR designs will be changing during the execution. This leads to an unde ned de nition of what global routing would mean. 3.2

Support for Cores

Another goal when designing the JRoute API was to support a hierarchical and reusable library of run-time parameterizable cores. Before JRoute the user of a core needed to know each pin (an input or output to a logic resource) that needs to be connected. With JRoute, a core can de ne ports. Ports are virtual pins that provide input or output points to the core. The core can use the ports in calls to the router, instead of specifying the speci c pin. To the user there is no distinction between a physical pin, de ned as location and wire, and a logical port as they are both derived from the EndPoint class. The core can de ne a connection from internal pins to ports. It can also specify connections from ports of internal cores to its own ports. The router knows about ports and when one is encountered, it translates it to the corresponding list of pins. When a port gets routed, the source and sinks connected to the port are saved. This information is useful for the unrouter and the debugging features, which are described later. There are routing guidelines that need to be followed when designing a core. First, each port needs to be in a group. For example, if there is an adder with an n bit output, each bit is de ned as a port and put into the same group.


879

The group can be of any size greater than zero. Second, the router needs to be called for each port de ned. This call de nes the connections to the port from pins internal to the core. Finally, a getPorts( ) method must be de ned for each group, which returns the array of Ports associated with that group. 3.3

Unrouter

Run-time recon guration requires an unrouter. There may be situations when a route is no longer needed, or the net endpoints change. Unrouting the nets free up resources. A core may be replaced with the same type of core having dierent parameters. In this case the user can unroute the core then replace it. The port connections are removed, but are remembered. If the ports are reused, then they will be automatically connected to the new core. For example, consider a constant multiplier. The system connects it to the circuit and later requires a new constant. The core can be removed, unrouted, and replaced with a new constant multiplier without having to specify connections again. Core relocation is handled in a similar way. unroute (EndPoint source);

An unrouter can work in either the forward or reverse direction. In the forward direction a source pin is speci ed. The unrouter then follows each of the wires the pin drives and turns it o. This continues until all of the sinks are found. reverseUnroute (EndPoint sink);

In the reverse direction a sink pin is speci ed. The entire net, starting from the source, is not removed. Only the branch that leads to the speci ed pin is turned o, and freed up for reuse. The unrouter starts at the sink pin and works backwards, turning o wires along the way, until it comes to a point where a wire is driving multiple wires. It stops there because only the branch to the given sink is to be unrouted. 3.4

Avoiding Contention

isOn (int row, int col, int wire);

This call checks to see if the wire in CLB (row,col) is currently in use. The Virtex architecture has bi-directional routing resources. This means that the track can be driven at either end, leading to the possibility of contention. The router makes sure that this situation does not occur, and therefore protects the device. An exception is thrown in cases where the user tries to make connections that create contention. In the auto-routing calls, the router checks to see if a wire is already used, which avoids contention. 3.5

Debugging Features

trace (EndPoint source);

A JRoute call traces a source to all of its sinks. The entire net is returned for

880

E. Keller

the trace. Debugging tools, such as BoardScope [2], can use this to view each sink. reverseTrace (EndPoint sink);

A sink is traced back to its source. Only the net that leads to the sink is returned.

4

JRoute versus Routing with JBits

JRoute uses the JBits low-level interface to Xilinx FPGA con guration bitstreams, which only provides manual routing. The JRoute API extensions provide automated routing support, while not prohibiting JBits calls. JRoute facilitates the use of run-time relocatable and parameterizable cores. Using cores and the JRoute API, a user can create designs without knowledge of the routing architecture by using port to port connections. The user only really needs a small set of architecture-speci c cores to start with. For example, a counter can be made from a constant adder with the output fed back to one input ports and the other input set to a value of one.

5

Portability

Currently, JRoute only supports Virtex devices. However, it can be extended to support future Xilinx architectures. The API would not need to change. However, the architecture description class would need to be created for the new architecture. The algorithms as presented in this paper have some architecture dependencies. For example, when routing a single source to a single sink, de ning the set of prede ned templates is architecture dependent. However, algorithms can be designed that have no architecture dependencies, and could be used with new architectures. These algorithms would use the architecture class to choose wires, check their lengths, and check the connectivity. The path-based router and template-based router have no knowledge of the architecture outside of what the architecture class provides.

6

Future Work

Virtex features such as IOBs and Block RAM will be supported in a future release of JRoute. Also, skew minimization will be addressed. The use of long lines to improve the routing of certain nets will be examined. Finally, dierent algorithms are being investigated such as [6].

7

Conclusions

JRoute is a powerful abstraction of the Xilinx FPGA routing resources. A routing API facilitates the design of object oriented circuits that are con gurable at run-time. There are many options that are made available by JRoute such as connecting two points for which the location is determined dynamically.


881

Hierarchical core-based design using JRoute permits easier management of design complexity than using only JBits. JRoute automates much of the routing and reduces the need to understand the routing architecture of the device. JRoute also provides support for large designs by allowing cores to de ne ports. RTR features include the unrouter, which allows cores to be removed or replaced at run-time without having to recon gure the entire design. Auto-routing calls allow connections to be speci ed, even if the placement is not known until runtime.

Acknowledgements Thanks to Cameron Patterson for his guidance and help in understanding routing algorithms. This work was supported by DARPA in the Adaptive Computing Systems (ACS) program under contract DABT63-99-3-0004.

References 1. S. A. Guccione and D. Levi, \XBI: A Java-based interface to FPGA hardware," Con gurable Computing Technology and its uses in High Performance Computing, DSP and Systems Engineering, Proc. SPIE Photonics East, J. Schewel (Ed.), SPIE - The International Society for Optical Engineering, Bellingham, WA, November 1998. 2. D. Levi and S. A. Guccione, \BoardScope: A Debug Tool for Recon gurable Systems," Con gurable Computing Technology and its uses in High Performance Computing, DSP and Systems Engineering, Proc. SPIE Photonics East, J. Schewel (Ed.), SPIE - The International Society for Optical Engineering, Bellingham, WA, November 1998. 3. Xilinx, Inc., The Programmable Logic Data Book, 1999. 4. Naveed A Sherwani, Algorithms for VLSI Physical Design Automation, Kluwer Academic Publishers, Norwell, Massachusetts, 1993. 5. Stephen D. Brown, Robert J. Francis, Jonathan Rose and Zvonko G. Vranesic, FieldProgrammable Gate Arrays, Kluwer Academic Publishers, Norwell, Massachusetts, 1992. 6. J. Swartz, V. Betz and J. Rose, \A Fast Routability-Driven Router for FPGAs," ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, 1998.

A Recon gurable Content Addressable Memory Stev en A. Guccione, Delon Levi and Daniel Downs Xilinx Inc. 2100 Logic Drive San Jose, CA 95124 (USA) Stev [email protected] [email protected] [email protected]

Abstract. Content Addressable Memories or CAMs are popular paral-

lel matching circuits. They provide the capability, in hardware, to searc h a table of data for a matching entry. This functionality is a high performance alternative to popular softw are-based searching sc hemes. CAMs are typically found in embedded circuitry where fast matching is essential. This paper presents a novel approach to CAM implementation using FPGAs and run-time recon guration. This approach produces CAM circuits that are smaller, faster and more exible than traditional approaches.

1 Introduction Conten t Addressable Memories or CAMs are a class of parallel pattern matching circuits. In one mode, these circuits operate like standard memory circuits and may be used to store binary data. Unlike standard memory circuits, how ev er, a pow erfulmatch mode is also available. This match mode permits all of the data in the CAM device to be searched in parallel. While CAM hardware has been available for decades, its use has typically been in niche applications, embedded in custom designs. Perhaps the most popular application has been in cache controllers for central processing units. Here CAMs are often used to searc h cache tags in parallel to determine if a cache \hit" or \miss" has occurred. Clearly in this application performance is crucial and parallel search hardware such as a CAM can be used to good eect. A second and more recent use of CAM hardware is in the netw orking area [3]. As data packets arrive into a netw ork router, processing of these packets typically depends on the netw ork destination address of the pack et. Because of the large number of potential addresses, and the increasing performance demands, CAMs are beginning to become popular in processing netw ork address information.

2 A Standard CAM Implementation CAM circuits are similar in structure to traditional Random Access Memory (RAM) circuits, in that data may be written to and read from the device [5]. In J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 882-889, 2000.  Springer-Verlag Berlin Heidelberg 2000

A Reconfigurable Content Addressable Memory

883

addition to functioning as a standard memory device, CAMs have an additional parallel search or match mode. The entire memory array can be searched in parallel using hardware. In this match mode, each memory cell in the array is accessed in parallel and compared to some value. If this value is found in any of the memory locations, a match signal is generated. In some implementations, all that is signi cant is that a match for the data is found. In other cases, it is desirable to know exactly where in the memory address space this data was located. Rather than producing a simple \match" signal, some CAM implementations also supply the address of the matching data. In some sense, this provides a functionality opposite of a standard RAM. In a standard RAM, addresses are supplied to hardware and data at that address is returned. In a CAM, data is presented to the hardware and an address returned. At a lower level, the actual transistor implementation of a CAM circuit is very similar to a standard static RAM. Figure 1 shows transistor level diagrams of both CMOS RAM and CAM circuits. The circuits are almost identical, except for the addition of the match transistors to provide the parallel search capability.

Fig. 1. RAM versus CAM transistor level circuits. In a CMOS static RAM circuit, as well as in the CAM cell, data is accessed via the BIT lines and the cells selected via the WORD lines. In the CAM cell, however, the match mode is somewhat dierent. Inverted data is placed on the BIT lines. If any cell contains data which does not match, the MATCH line is pulled low, indicating that no match has occurred in the array. Clearly this transistor level implementation is ecient and may be used to produce CAM circuits which are nearly as dense as comparable static RAM circuits. Unfortunately, such transistor level circuits can not be implemented using standard programmable logic devices.

3 An FPGA CAM Implementation Of course, a content addressable memory is just a digital circuit, and as such may be implemented in an FPGA. The general approach is to provide an array

884

S.A. Guccione, D. Levi, and D. Downs

of registers to hold the data, and then use some collection of comparators to see if a match has occurred. While this is a viable solution, it suers from the same sort of ineciencies that plague FPGA-based RAM implementations. Like RAM, the CAM is eciently implemented at the transistor level. Using gate level logic, particularly programmable or recon gurable logic, often results in a substantial penalty, primarily in size. Because the FPGA CAM implementation relies on ip- ops as the data storage elements, the size of the circuit is restricted by the number of ip ops in the device. While this is adequate for smaller CAM designs, larger CAMs quickly deplete the resources of even the largest available FPGA.

4 The Recon gurable Content Addressable Memory (RCAM) The Recon gurable Content Addressable Memory or RCAM makes use of runtime recon guration to eciently implement a CAM circuit. Rather than using the FPGA ip- ops to store the data to be matched, the RCAM uses the FPGA Look Up Tables or LUTs. Using LUTs rather than ip- ops results in a smaller, faster CAM. The approach uses the LUT to provide a small piece of CAM functionality. In Figure 2, a LUT is loaded with data which provides a \match 5" functionality. That is, whenever the binary encoded value \5" is sent to the four LUT inputs, a match signal is generated. Note that using a LUT to implement CAM functionality, or any functionality for that matter, is not unique. An N-input LUT can implement any arbitrary function of N inputs, including a CAM.

Fig. 2. Using a LUT to match 5. Because a LUT can be used to implement any function of N variables, it is also possible to provide more exible matching schemes than the simple match described in the circuit in Figure 2. In Figure 3, the LUT is loaded with values which produce a match on any value but binary \4". This circuit demonstrates the ability to embed a mask in the con guration of a LUT, permitting arbitrary disjoint sets of values to be matched, within the LUT. This function is important in many matching applications, particularly networking.


885

Fig. 3. Using a LUT to match all inputs except 4. This approach can be used to provide matching circuits such as match all or match none or any combination of possible LUT values. Note again, that this arbitrary masking only applies to a single LUT. When combining LUTs to make larger CAMs, the ability to perform such masking becomes more restricted. While using LUTs to perform matching is a powerful approach, it is somewhat limited when used with traditional design tools. With schematics and HDLs, the LUT contents may be speci ed, albeit with some diculty. And once speci ed, modifying these LUTs is dicult or impossible. However, modi cation of FPGA circuitry at run-time is possible using a run-time recon guration tool such as JBits [1]. JBits permits LUT values, as well as other parts of the FPGA circuit, to be modi ed arbitrarily at run time and in-system. An Application Program Interface (API) into the FPGA con guration permits LUTs, for instance, to be modi ed with a single function call. This, combined with the partial recon guration capabilities of new FPGA devices such as Virtex (tm) permit the LUTs used to build the RCAM to be easily modi ed under software control, without disturbing the rest of the circuit. Finally, using run-time recon guration software such as JBits, RCAM circuits may be dynamically sized, even at run-time. This opens the possibility of not only changing the contents of the RCAM during operation, but actually changing the size and shape of the RCAM circuit itself. This results in a situation analogous to dynamic memory allocation in RAM. It is possible to \allocate" and \free" CAM resources as needed by the application.

5 An RCAM Example One currently popular use for CAMs is in networking. Here data must be processed under demanding real-time constraints. As packets arrive, their routing information must be processed. In particular, destination addresses, typically in the form of 32-bit Internet Protocol (IP) addresses must be classi ed. This typically involves some type of search. Current software based approaches rely on standard search schemes such as hashing. While eective, this approach requires a powerful processor to keep up with the real-time demands of the network. Ooading the computationally demanding matching portion of the algorithms to external hardware permits less powerful processors to be used in the system. This results in savings not only

886


Fig. 4. Matching a 32-bit IP header. in the cost of the processor itself, but in other areas such as power consumption and overall system cost. In addition, an external CAM provides networking hardware with the ability to achieve packet processing in essentially constant time. Provided all elements to be matched t in the CAM circuit, the time taken to match is independent of the number of items being matched. This provides not only good scalability properties, but also permits better real-time analysis. Other software based matching schemes such as hashing are data-dependent and may not meet realtime constraints depending on complex interactions between the hashing algortihm and the data being processed. CAMs suer no such limitations and permit easy analysis and veri cation. Figure 4 shows an example of an IP Match circuit constructed using the RCAM approach. Note that this example assumes a basic 4-input LUT structure for simplicity. Other optimizations, including using special-purpose hardware such as carry chains are possible and may result in substantial circuit area savings and clock speed increases. This circuit requires one LUT input per matched bit. In the case of a 32bit IP address, this circuit requires 8 LUTs to provide the matching, and three additional 4-input LUTs to provide the ANDing for the MATCH signal. An array of this basic 32-bit matching block may be replicated in an array to produce the CAM circuit. Again, note that other non-LUT implementations for generating the MATCH circuit are possible. Since the LUTs can be used to mask the matching data, it is possible to put in \match all" conditions by setting the LUTs to all ones. Other more complicated masking is possible, but typically only using groups of four inputs. While this does not provide for the most general case, it appears to cover the popular modes of matching.


887

6 System Issues The use of run-time recon guration to construct, program and reprogram the RCAM results in some signi cant overall system savings. In general, both the hardware and the software are greatly simpli ed. Most of the savings accrue from being able to directly recon gure the LUTs, rather than having to write them directly as in standard RAM circuits. Recon guration rather than direct access to the stored CAM data rst eliminates all of the read / write access circuitry. This includes the decode logic to decode each address, the wiring necessary to broadcast these addresses, the data busses for reading and writing the data, and the IOBs used to communicate with external hardware. It should be pointed out that this interface portion of the circuitry is substantial, both its size and complexity. Busses typically consume tri-state lines, which are often scarce. Depending on the addressing scheme, tens of IOBs will necessarily be consumed. These also tend to be valuable resources. The address decoders are also somewhat problematic circuits and often require special purpose logic to be implemented eciently. In addition, the bus interface is typically the most timing sensitive portion of the circuit and requires careful design and simulation. This is eliminated with the use of run-time recon guration. Finally, the system software is simpli ed. In a standard bus interface approach, device drivers and libraries must be written, debugged and maintained to access the CAM. And when the system software or processor changes, this software must be ported to the new platform. With the RCAM, all interfacing is performed through the existing con guration port, at no additional overhead. The cost of using the con guration port rather than direct hardware access is primarily one of setup speed. Direct writes can typically be done in some small number of system cycles. Recon guration of the RCAM to update table entries may take substantially longer, depending on the implementation. Partial recon guration in devices such as Virtex permit changes to be made more rapidly than in older bulk con guration device, but the speed may be orders of magnitude slower than direct hardware approaches. Clearly the RCAM approach favors applications with slowly changing data sets. Fortunately, many applications appear to t into this category.

7 Comparison to Other Approaches While CAM technology has been in widespread use for decades, there has been little interest in producing commercial CAM devices. This recent interest in CAMs, driven primarily by the high-performance networking market, has resulted in commercially available CAM devices. Music Semiconductor [4] and Net Logic [2] are two companies which provide CAM devices tailored speci cally for the networking market. In addition, at least one FPGA manufacturer, Altera, has begun to embed CAM hardware into their Apex(tm) devices. While this circuitry is embedded in

888


an FPGA, it is special purpose and not part of the general con gurable fabric. It is included here for comparison, but it should be pointed out that special purpose hardware is readily inserted into FPGAs. The cost here is in exibility. The special purpose hardware must be used for a speci c circuit, at a speci c physical location, or not used at all. In this sense, this embedded CAM has more in common with custom solutions than programmable solution. But the speci cations are included here for comparison. CAM (Virtex V1000 ) 768 x 32 384 x 64 RCAM (Virtex V1000) 3K x 32 1K x 64 Quality Semiconductor 1K x 64 2K x 64 Net Logic 16K x 64 8K x 128 Music Semiconductor 2K/4K/6K x 32 2K/4K/6K x 64 Altera APEX 1K-8K x 32 500-4K x 64

Fig. 5. Some commercially available CAM devices. Figure 5 gives some sizes for current commercially available devices. While these are custom CAM implementations and can expected to be denser than FPGA implementations, the RCAM sizes are within the general range of that available from custom implementations. In addition, the RCAM circuits are more

exible and may be placed at any location within the FPGA and may be integrated with other logic in the design. Finally, the RCAM approach is approximately 3-4 times denser than attempting to implement a CAM using an FPGA and traditional design approaches. Optimizations using logic such as the Virtex carry chain also indicate improvements of an additional 40%.

8 Associative Processing Today, advances in circuit technology permit large CAM circuits to be built. However, uses for CAM circuits are not necessarily limited to niche applications like cache controllers or network routers. Any application which relys on the searching of data can bene t from a CAM-based approach. A short list of some potential application areas that can bene t from fast matching are Arti cial Intelligence, Database Search, Computer Aided Design, Graphics Acceleration and Computer Vision. Much of the work in using parallel matching hardware to accelerate algorithms was carried out in the 1960s and 1970s, when several large parallel matching machines were constructed. An excellent survey of so-called Associative Processors can be found in Yau and Fung [7]. With the rapid growth both in size and speed of traditional processors in the intervening years, much of the interest in CAMs has faded. However, as realtime constraints in areas such as networking become impossible to meet with


889

traditional processors, solutions such as CAM-based parallel search will almost certainly become more prevalent. In addition, the use of parallel matching hardware in the form of CAMs can provide another more practical bene t. For many applications, the use of CAMbased parallel search can ooad much of the work done by the system processor. This should permit smaller, cheaper and lower power processors to be used in embedded applications which can make use of CAM-based parallel search.

9 Conclusions The RCAM is a exible, cost-eective alternative to existing CAMs. By using FPGA technology and run-time recon guration, fast, dense CAM circuits can be easily constructed, even at run-time. In addition, the size of the RCAM may be tailored to a particular hardware design, or even temporary changes in the system. This exibility is not available in other CAM solutions. In addition, the RCAM need not be a stand-alone implementation. Because the RCAM is entire a software solution using state of the art FPGA hardware, it is quite easy to embed RCAM functionality in larger FPGA designs. Finally, we believe that existing applications, primarily in the eld of network routing, are just the beginning of RCAM usage. Once other applications realize that simple, fast, exible parallel matching is available, it is likely that other applications and algorithms will be accelerated using this approach.

10 Acknowledgements Thanks to Kjell Torkellesson and Mario Dugandzic for discussions on networking. And thanks especially to Paul Hardy for early RCAM discussions.

References 1. Steven A. Guccione and Delon Levi. XBI: A Java-based interface to FPGA hardware. In John Schewel, editor, Con gurable Computing Technology and its use in High Performance Computing, DSP and Systems Engineering, Proc. SPIE Photonics East, pages 97{102, Bellingham, WA, November 1998. SPIE { The International Society for Optical Engineering. 2. Net Logic Microsystems. World Wide Web page http://www.netlogicmicro.com/, 1999. 3. R. Neale. Is content addressable memory (CAM) the key to network success? Electronic Engineering, 71(865):9{12, February 1999. 4. Music Semiconductor. World Wide Web page http://www.music-ic.com/, 1999. 5. Neil Weste and Kamram Eshraghian. Principles of CMOS VLSI Design. AddisonWesley Publishing Company, 1985. 6. Xilinx, Inc. The Programmable Logic Data Book, 1996. 7. S. S. Yau and H. S. Fung. Associative processor architecture { a survey. Computing Surveys, 9(1):3{27, March 1977.

ATLANTIS – A Hybrid FPGA/RISC Based Re-configurable System O. Brosch, J. Hesser, C. Hinkelbein, K. Kornmesser, T. Kuberka, A. Kugel, R. Männer, H. Singpiel, B. Vettermann Lehrstuhl für Informatik V, Universität Mannheim, D-68131 Mannheim, Germany {brosch, hinkelbein, kornmesser, kuberka, kugel, maenner, singpiel}@ti.uni-mannheim.de, [email protected], [email protected]

Abstract. ATLANTIS is the result of 8 years of experience with large standalone and smaller PCI based FPGA processors. Dedicated FPGA boards for computing and I/O plus a private backplane for a data rate of up to 1 GB/s support flexibility and scalability. FPGAs with more than 100k gates and 400 I/O pins per chip are used. CompactPCI provides the basic communication mechanism. Current real-time applications include pattern recognition tasks in high energy physics, 2D image processing, volume rendering, and n-body calculations in astronomy. First measurements and estimations show an acceleration up to a factor of 25 compared to a PC workstation, or commercial volume rendering hardware, respectively. Our CHDL, an object-oriented development environment is used for application programming.

1 Introduction 8 years of experience with FPGA based computing machines show that this new class of computers is an ideal concept for constructing special-purpose processors. As processing unit, I/O unit and bus system are implemented in separate modules, this kind of system provides scalability in computing power as well as I/O bandwidth. Enable-1 [1] was the first FPGA processor developed at Mannheim University in 1994, tailored for a specific pattern recognition task. More general machines were introduced at about the same time, e.g. DecPeRLe-1 [2] or Splash-2 [3]. Enable-1 was followed by a general-purpose FPGA processor in 1996, the Enable++ [4] system. In addition to the large scale Enable++ system a small PCI based FPGA coprocessor – microEnable [5] – was developed in late 1997. It turned out that the simplicity together with the tight host-coupling of the smaller system was a significant improvement compared to Enable++. The new FPGA processor ATLANTIS combines advantages of its predecessors Enable-1, Enable++, microEnable and others, and introduces several new features. The first is the ability to combine FPGA and RISC performance. A unique feature is the scalability and the fast data exchange between the different modules due to the CompactPCI and private bus backplane system. Another highlight is the configurable memory system which complements the flexibility of the FPGAs. We use CHDL, an


ATLANTIS - A Hybrid FPGA/RISC Based Re-configurable System

891

unique object-oriented software tool-set that was at developed our institute, to create and simulate hybrid applications.

2 ATLANTIS System Architecture A well-tried means to adjust a hybrid system to different applications is modularity. ATLANTIS implements modularity on different levels. First of all there are the main entities host CPU and FPGA processor which allow to partition an application into modules tailored for either target. Next the architecture of the FPGA processor uses one board-type (ACB) to implement mainly computing tasks and another board-type (AIB) to implement mainly I/O oriented tasks. A CompactPCI based backplane (AAB) as interconnect system provides scalability and supports an arbitrary mix of the two board-types, thus providing a high-speed interconnect. Finally modularity is used on the sub-board level by allowing different memory types or different I/O interfaces per board type. Only FPGA devices with a high I/O pin-count and a complexity in the 100k gaterange are of interest for the ATLANTIS project. Two additional features are important either for our concept or for some applications: support for read-back/test and asynchronous dual ported memory (DP-RAM). In particular the partial reconfiguration is of great interest for co-processing applications involving hardware task switches. These features and a relatively low price guided the decision to use the Lucent ORCA 3T125 in the ATLANTIS system. The latest Xilinx family – the VIRTEX series – is also a good choice but was not available on the market at the time the ACB was designed. However, the AIB carries two VIRTEX XCV600 chips. The ACB and the AIB both use a PLX9080 as PCI interface. This chip is compatible to the one used with the microEnable FPGA coprocessor. Furthermore the entire on-board support logic – like FPGA configuration and clock control – which is implemented in a large CPLD, is derived from microEnable. This high degree of compatibility ensures that virtually all basic software (WinNT driver, test tools, etc.) are immediately available for ATLANTIS. Clock generation and distribution is an important issue for large FPGA processors. The basic approach in Atlantis is to provide a central clock from the AAB. Additionally the I/O ports of all FPGAs on either ACB and AIB have their individual clock sources. Finally each ACB and AIB provides a local clock which can be used if the main AAB clock is not available or if the application requires an additional clock. All clocks are programmable in the range of a few MHz up to at least 80 MHz. Programming is done under software control from the CPU module. 2.1 ATLANTIS Computing Board (ACB) The core of the main processing unit of the ATLANTIS system consists of a 2*2 FPGA matrix. Assuming an average gate count of approximately 186k per chip for the ORCA 3T125 sums up to 744k FPGA gates. Each FPGA has 4 different ports: · 2 ports @ 72 lines to a neighboring FPGA each in vertical and horizontal direction · 1 logical I/O port @ 72 lines and

892

O. Brosch et al.

· 1 memory interconnect port @ 206 lines. Theses 4 ports use a total amount of 422 I/O signals per FPGA. The 72 lines of FPGA interconnect provide for high bandwidth as well as multi-channel communication between chips. The memory interconnect port is built from 2 high-density 124 pin mezzanine connectors per FPGA. Depending on the application, memory modules with different architectures can be used to optimize system performance. E.g. the HEP TRT trigger (see below) will employ memory modules organized as a single bank of 512k * 176 bit of synchronous SRAM per module, leading to a total of 44 MB per ACB. The 3D-rendering algorithm will use a single module of triple width with 512 MB of SDRAM organized in 8 simultaneously accessible banks. A more generalized module – also used for 2D image processing – will take 9 MB of synchronous SRAM organized in 2 banks of 512k * 72 bits. The I/O port serves different tasks on the 4 FPGAs, depending on the physical connection of the respective chip: · One FPGA is connected to the PLX9080 PCI interface chip thus providing the host-I/O functionality. · Two FPGAs are connected to the private backplane bus. · One FPGA is attached to two parallel LVDS connectors for external I/O. The connectors can be used to attach I/O modules, e.g. S-Link1, to set up a downscaled or test system without the need to add AAB and AIB modules. The 2 backplane ports support high-speed I/O of 1 GB/s @ 66 MHz, 2*64 bits. The hostinterface via PCI is compatible to the one used with microEnable, allowing 125 MB/s max. data rate.

2.2 ATLANTIS I/O Board (AIB) The task of the ATLANTIS I/O units is to connect the ATLANTIS system to its realworld environments via the private backplane bus. To provide a maximum flexibility in connecting to external data sources or destinations a modular design of the I/O boards was selected. Depending on the standard CompactPCI card size every AIB is able to carry up to four mezzanine I/O daughter-boards. Two Xilinx VIRTEX XCV600 FPGAs control the four I/O ports. Interfacing to the AAB and to the local PCI bridge is done in the same fashion as on the ACB. The default capacity of any of the four channels is 32 + 4 data bits @ 66MHz (or 264 MB/s ignoring the 4 extra bits). Thus the four I/O channels provide the same bandwidth as the 2 backplane ports: 1GB/s. To provide a sustained and high I/O bandwidth even at small block sizes buffering of data can be done in two stages (numbers per I/O channel): · A 32k * 36 FIFO-style buffer connected directly to the I/O port, implemented with dual-ported memory. · A 1M * 36 general purpose buffer implemented with synchronous SRAM. The fact that both FPGAs are connected to the PLX local bus provides a communication means in case channel synchronization, loop-back or the like is needed.

1

S-Link is a FIFO-like CERN internal standard for point-to-point links.


893

2.3 ATLANTIS Active Backplane (AAB) ACBs and AIBs share the same I/O-circuit with 160 signal lines. Connections between boards are done using the private bus system of the AAB. The default configuration of the I/O lines will be 4 channels of 32bit plus control, however any granularity from 16 channels of a single byte to 2 channels of 64 bit might be useful. Different backplanes can be used in order to scale the ATLANTIS system to the respective application. A simple pipelined, passive, i.e. not configurable, backplane is currently used for system and performance tests. The total bandwidth is 1 GB/s per slot. For example configuring the backplane for two independent pairs of ACBs and AIBs, an integrated bandwidth of 2 GB/s will result for a single ATLANTIS system. Like all other boards, the backplane is controlled by the host CPU via the PCI bus. 2.4 Host CPU The host computer to be used with ATLANTIS is an industrial version of a standard x86 PC – a CompactPCI computer – that plugs into one of the AAB slots. This industrial computer is equipped with a mobile Intel Pentium-200 MMX or Celeron-450 processor and thus 100% compatible to a standard PC desktop workstation. All standard operating systems can be used, in particular Windows NT and Linux, without the need to adapt drivers or I/O handlers, etc. The compatibility at the device driver level of ATLANTIS with the small scale FPGA processor microEnable allows a quick start using the tools already available. The CPU module allows to have the complete FPGA development tool-set be run on the target system, as well as the application itself. The ACB and AIB boards act as coprocessors, accelerating time and resource consuming parts of an application, and providing high I/O bandwidth. Moreover, the CPU is needed for control, when task switching and re-configuration of FPGAs is desired. Additionally, high precision floating point operations that are too much resource consuming on FPGAs, may be carried out in the CPU. 2.5 CHDL Development Environment CHDL (C++ based Hardware Description Language) was designed to support simulation of FPGA coprocessors. The use of commercial VHDL products to simulate FPGA coprocessors shows some insufficiencies: 1. A test bench must be implemented for emulating the FPGA environment using VHDL while the application operating the FPGA is mostly written in C/C++. 2. The test bench has to emulate the behavior of the microprocessor system exactly, including bus system and DMA controllers at the level of bus signals. 3. Implementing the test bench is redundant work because the application already contains the whole algorithm needed for simulation. CHDL provides a hardware description based on C++ classes for entering structural designs and state machine definitions. A CHDL design description is a traditional C++ program linked to a class library. This enables the developer to implement complex high level software which generates the structural CHDL design automatically.

894

O. Brosch et al.

The developer uses the original application to simulate the designs. No traditional hardware oriented test benches are needed. One single language, C++, is sufficient to manage the whole development process. In both the application and the hardware description the features of this powerful programming language can be used. More details can be found in [6].

3 Applications FPGA processors have shown to provide superior performance in a broad range of fields, like encryption, DNA sequencing, image processing, rapid prototyping etc. Very good surveys can be found in [3] and [7]. We are in particular interested in hybrid CPU/FPGA systems for: · acceleration of computing intensive pattern recognition tasks in High Energy Physics (HEP) and Heavy Ion Physics, · subsystems for high-speed and high-frequency I/O in HEP, · 2-dimensional industrial image processing, · 3-dimensional medical image visualization and · acceleration of multi-particle interaction (e.g. N-Body [8], SPH) calculations in astronomy. 3.1 High Energy Physics In the field of HEP many FPGA algorithms have been implemented at our institute during the past 5 years. Results show speedup rates in the range from 10 to 1,0002 compared to workstation implementations [9]. The most recent HEP pattern matching algorithm tries to find straight or curved tracks in a 2-dimensional input image delivered by a transition radiation tracking detector (TRT) with a repetition rate of up to 100 kHz. The size of the detector image is 80,000 pixels. The number of patterns varies from 240 to more than 2,400 depending on the operating frequency. The working principle of the algorithm is as follows: · Predefined patterns are stored in a large look-up table (LUT) with every data bit representing one pattern. · Each pixel in the input image contributes to a number of patterns, defined by the content of the LUT. · For every pattern a counter increments if its corresponding data bit is set. The total of all counter values builds the track histogram. · A track is considered valid if its value is above a predefined threshold. A description of the algorithm and its implementation can be found in [10]. In particular this algorithm is ideally suited for an FPGA implementation because it can be extremely parallelized. Adjustable memory boards allow RAM access with a width of e.g. 4*176 bits. Therefore, 706 straws can be processed simultaneously on a single ACB board equipped with 4 memory modules, thus providing an enormous speed-up compared to other systems, e.g. a state-of-the-art PC. 2

Measured on Enable-1 with parallel histogramming only, no I/O was needed.


895

3.2 Image processing Almost all image processing applications involve tasks where image elements (pixels or voxels) have to be processed with local filters. Among others, hardware implementation of algorithmically optimized real-time volume rendering is a current project at our institute in this area. The following rendering - or ray processing - pipeline is assumed: · Starting from each pixel of the resulting image rays are cast into the virtual scene. · At equally distant positions on the rays sample points are generated by tri-linear interpolation of the neighboring voxel values. · Sample points are classified with opacity or reflectivity according to gray values and gradient magnitude. · Finally, the absorption for each voxel is determined. The reflected fraction of the light intensity reaching the sample point is calculated and added to the contributions of all other sample points on that ray. The new architecture uses algorithmic optimizations: regions with no contribution are skipped, and processing is aborted as soon as the remaining intensity drops under an adjustable threshold. To overcome the resulting data and branch hazards in the rendering pipeline multi-threading is introduced. Each ray is considered as a single thread, and after each sample point the context is switched to the next ray. Our implementation has the same speed-up like software implementations of this algorithm, compared to volume rendering without algorithmic optimizations. However, compared to conventional architectures the number of pipeline stalls is reduced from more than 90% to less than 10% of rendering time. Details of the algorithm and its FPGA implementations can be found in [11]. 3.3 Astronomy Using FPGAs to accelerate complex computations using floating-point algorithms has not been considered a promising enterprise in the past few years. The reason is that general floating-point [12] as well as particular N-Body [13] implementations have shown only poor performance3 on FPGAs. Usually N-Body calculations need a computing performance in at least Tera-FLOP range and are accelerated with the help of ASIC based coprocessors [14]. Nonetheless we have recently investigated the performance of a certain sub-task of the N-Body algorithm on the Enable++ system [15]. The results indicate that FPGAs can indeed provide a significant performance increase even in this area. 3.4 Measured and Estimated Performance HEP. Besides principle parameters like system frequency the DMA performance plays a dominant role for the execution time of the TRT algorithm. Therefore DMA Read/Write access was the main focus of the measurements. Following are some 3

In 1995 approx. 10 MFLOP per Xilinx chip were reported for 18 bit precision, and 40 MFLOP with 32 bit precision on an 8 chip Altera board.

896

O. Brosch et al.

results showing the data throughput over CPCI for various applications, measured with ATLANTIS, microEnable driver, design speed 40 MHz. Table 1. ATLANTIS DMA performance

Block size (kByte) DMA Read perf. (MB/s) DMA Write perf. (MB/s)

1 8.8 7.4

4 24.6 21.6

32 75.3 54.3

256 97.7 65.3

The effect these results suggest for the performance of a distributed system largely depend on the respective application. For the TRT algorithm, the time needed for I/O is indeed the bottle-neck, in case the ATLANTIS sub-systems are employed as coprocessors and thus receive their data from the host CPU. Measurements of histogramming performance were done using a single-memory ACB (176 bit RAM access) [16]. The execution time on the test system (algorithm plus I/O), 19.2 ms compared to 35 ms using a C++ implementation on a PentiumII/300 standard PC, extrapolates to 2.7 ms using 2 ACB with 4 memory modules each (1408 bit RAM access). This corresponds to a speed-up by a factor of 13.5. Volume Rendering. The hardware speed is limited by several factors. One is the memory bandwidth. Assuming 100 MHz devices, simulations have shown that 4 Hz 3 frame rates for 1024 data sets can be achieved for typical data with hard surfaces and otherwise empty space in between [17]. With our FPGA solution we will achieve a clock rate of >25 MHz that reduces the frame rate accordingly. For detailed simulation we used a CT data set with 256*256*128 voxels. This data set is viewed from three different viewing directions and three different levels of opacity for soft tissue is applied. On average one achieves efficiencies of between 90% and 97%. The number of sample points varies between 10-15% of all voxels if the data set consists mainly of empty space and opaque objects and 25-40% for semi transparent opacity levels. The above results correspond to rendering rates from 20 Hz on semi-transparent data sets to 138 Hz for opaque objects and parallel projection. The results are achieved from images of size 256*128. Perspective views reduce the rendering speed by a factor of about 2. Comparing these results with the performance of the only commercially available volume rendering hardware, VolumePro [18], simulations suggest a speed-up by a 3 factor of 10 to 25 when using 1024 data sets.

4 Summary and Outlook ATLANTIS is a CompactPCI based computing machine that combines the advantages of FPGA and RISC architectures. Its unique features are scalability, flexibility with respect to memory, configurable high-speed I/O, and it comes with a powerful objectoriented development environment, CHDL. ATLANTIS has proven its supreme power regarding bandwidth and speed in applications we have investigated so far. An ACB is available since 09/1999 and is


897

currently tested with different memory modules and a simple backplane, with different applications. A second ACB and an AIB will be completed shortly. Though the full system is not available by now (01/2000) it is planned to have an implementation of a HEP trigger application run in a real experiment (FOPI at GSI, Darmstadt, Germany) within this year. Other implementations concern future experiments, or have prototype character.

References [1] Klefenz F., Zoz R., Noffz K.-H., Männer R., “The ENABLE Machine - A Systolic Second Level Trigger Processor for Track Finding”, Proc. Comp. in High Energy Physics, Annecy, France; CERN Rep. 92-07 (1992) 799-802 [2] DECPeRLe-1, an FPGA processor containing 16 Xilinx XC3090 FPGAs, http://pam.devinci.fr/hardware.html#DECPeRLe-1 [3] D. Buell, J. Arnold, W. Kleinfelder, “Splash-2 – FPGAs in a Custom Computing Machine“, CS Press, Los Alamitos, CA, 1996 [4] H. Hoegl et al., “Enable++: A Second Generation FPGA Processor”, Proc. IEEE Symposium on FPGAs for Custom Computing Machines, pp. 45-53, 1995 [5] microEnable, a PCI based FPGA co-processor by Silicon Software GmbH, http://www.silicon-software.com/ [6] K. Kornmesser et al, “Simulating FPGA-Coprocessors Using the FPGA Development System CHDL”, Proc. PACT Workshop on Reconf. Comp., Paris (1998) pp. 78-82 [7] J. Vuillemin et al., “Programmable Active Memories: Reconfigurable Systems Come of Age”, Proc. of the 1996 IEEE Trans. On VLSI Systems [8] R. Spurzem, S.J. Aarseth, “Direct Collisional Simulation of 10,000 Particles Past Core Collapse'', Monthly Notices Royal Astron. Soc., Vol. 282, 1996, p. 19 [9] V. Dörsing et al., “Demonstrator Results Architecture – A”, ATL-DAQ-98-084, CERN, 26 Mar 1998 [10] A. Kugel et al., “50kHz Pattern Recognition on the Large FPGA Processor Enable++”, Proc. IEEE Symp. on FPGAs for Custom Computing Machines, CS Press, Los Alamitos, CA, 1998, pp. 1262-3 [11] J. Hesser, B. Vettermann, “Solving the Hazard Problem for Algorithmically Optimized Real-Time Volume Rendering”, Int. Workshop on Vol. Graph. 1999, Swansea, UK [12] W. Ligon et al, “A Re-evaluation of the Practicality of Floating-Point Operations on FPGAs”, Proc. IEEE Symp. on FPGAs for Custom Computing Machines, 1998 [13] H.-R. Kim et al, “Hardware Acceleration of N-Body Simulations for Galactic Dynamics”, SPIE Conf. on FPGAs for Fast Board Develop. and Reconf. Comp. 1995, pp. 115-126 [14] J. Makino et al, “GRAPE-4: A Massively Parallel Special-Purpose Computer for Collisional N-Body Simulations”, Astrophysical Journal, Vol. 480, 1997, p. 432 [15] T. Kuberka, Diploma Thesis, Universität Mannheim, Germany, 1999 [16] C. Hinkelbein et al, “LVL2 Full TRT Scan FEX Algorithm for B-Physics Performed on the FPGA Processor ATLANTIS”, to be publ. as ATL-DAQ-Note, CERN [17] B. Vettermann et al, “Implementation of Algorithmically Optimized Volume Rendering on FPGA Hardware”, IEEE Visualization '99, San Francisco, CA (1999) [18] VolumePro, a PCI based volume rendering coprocessor by Mitsubishi Electronics America, Inc. RTVIZ, http://www.rtviz.com/

The Cellular Processor Architecture CEPRA{1X and its Con guration by CDL Christian Hochberger1 , Rolf Homann2, Klaus{Peter Volkmann2, and Stefan W aldschmidt2 1 2

University of Rostock, 18059 Rostock, Germany, [email protected]

Darmstadt University of Technology, 64283 Darmstadt, Germany, (hoffmann,voelk,waldsch)@informatik.tu-darmstadt.de

The con gurable coprocessor CEPRA{1X was developed as a PC plug{in card in order to speed up cellular processing signi cantly. Cellular Processing is an attractive and simple massive parallel processing model. To increase its general acceptance and usability it must be supported by a software environment, an eÆcient simulator and a special language. For this purpose the cellular description language CDL was de ned and implemented. With CDL complex cellular algorithms can be described in a concise and readable form. A CDL program can automatically be transformed into a logical design for the CEPRA{1X. The design is loaded into eld programmable gate arrays for the computation of the state transition of the cells. For time dependent or complex rules the design may be recon gured between consecutive generations. An example is presented to show the generation of logic code. Abstract.

1

Introduction

Cellular Processing is based on the processing model of Cellular Automata. All cells obey in parallel to the same local rule, which results in a global transformation of the whole generation. The cells are connected to their adjacent cells only. In the two dimensional case 4 neighbours (von Neumann neighbourhood) or 8 neighbours (Moore neighbourhood) are considered. In the three dimensional case up to 26 neighbours can be taken into consideration. Typical applications are: crystal growth, biological growth, simulation of digital logic, neuronal switching, electrodynamic elds, diusion, temperature distributions, movement and collision of particles, lattice gas models, liquid ow, wave optics, Ising systems, image processing, pattern recognition and n umerical applications. Cellular algorithms are described in a concise and readable form in the language CDL (Cellular Description Language). CDL has been proved to be very useful for the description of complex cellular algorithms[1]. One version of the compiler generates C or Java code for the software simulator, another version generates a hardware description for eld programmable gate arra y which we use in our coprocessor CEPRA{1X[2]. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 898-905, 2000.  Springer-Verlag Berlin Heidelberg 2000

The Cellular Processor Architecture CEPRA-1X and Its Configuration by CDL

899

Cellular processing on a conventional computer is time consuming especially for a large number of cells, complex rules and experiments with parameter variations. Special hardware support is necessary to speed up the computation and for realtime visualisation on the y. 2

Target Architectures

In the course of the cellular processing project at the Technical University of Darmstadt dierent architectures have been developed, in particular the CEPRA{8L[3], the CEPRA{1X[2], and the CEPRA{3D[4]. A new designed machine CEPRA{S for general purposes is under development. The advantage of the CEPRA processors compared to CAM[5] machines is that complex and probabilistic rules can be computed in one step, whereas the CAM machines must split the problem into cascaded look{up tables. Coprocessor CEPRA{1X. The CEPRA{1X coprocessor is a plug{in card for the PCI bus. It was designed for 2D cellular processing with visualisation support, but it can be used as a general data stream processor. The cellular eld data is stored in the host. For the computation of a new generation the cell states are streamed to the coprocessor, the rule is computed for all the cells in the stream and the new cell states are streamed back to the host. Global Parameter FIFO-in

Rule FPGA

FIFO-out

Line Shifter Random Generator Fig. 1.

three line FIFOs

The rule is computed by an FPGA ( eld programmable gate array) which has to be loaded with a con guration, describing the logic design of the rule. Because three lines are buered (implemented as FIFOs) each cell has to be read and written exactly once. With the PCI{bus{performance of 133 MByte/sec the performance is 30 million 2D{16{Bit cell operations per second with 9 neighbours. Considering the Belousov{Zhabotinsky reaction described later this is a speed up of about 40 in comparison to a 133 MHz PC. More complex rules will yield higher speed ups, because we use hardware pipelining in the CEPRA{1X. Therefore the computation time is independent of the the rule complexity. The logic design which has to be loaded into the FPGA is generated by the CDL hardware compiler. The compiler generates intermediate logic code

900

C. Hochberger et al.

(VERILOG) which is transformed into FPGA con guration data by a tool from XILINX. The logic description of dierent rules can be reloaded between the computation of the generations. By this technique time dependent rules can be computed. Complex rules which do not t into the FPGA can be broken into a sequence of phase rules. The phase rules are loaded between the phases of the generations. The time to reload the FPGA (parallel mode, 8MHz) is 15% of the computation time for a cell eld of size 1024 1024. Clut

16 Global Control

PCI - Bus

FIFO-in

16

Line Shifter

Rule FPGA

16

32 Bus Interface

Visualization FPGA

FIFO-out

32

32

Local Bus

Fig. 2.

CEPRA{1X architecture

Software Simulator. For the evaluation of cellular algorithms we have developed a simulator software. Experiments for this simulator consist of three basic parts: the description of the rule, the initial state of the cells in the array and some information about the visualisation. The simulator allows the user to store the cell state in a structured datatype. One of the easiest and most often used visualisation concepts is the assignment of colours to cell states. Thus the simulator provides a visualisation tool that uses one of the cells components as an index into a colourmap. The rule is written in C or Java and is linked with a kernel which controls the simulation. The kernel provides a neighbour function for the access to the neighbours. The kernel is capable of calling dierent rules depending on the position within the cellular eld. By this technique special rules for borders and corners can be de ned.

3

CDL, a Language for Cellular Processing

Until now cellular algorithms are programmed in simulator dependent special languages and data structures. Thus the programmer needs special knowledge of the target architecture, which makes programming a tedious task. The CEPRA{ 1X processor is programmed in VERILOG, whereas the software simulator is programmed in C or Java. Neither of those languages is convenient and adequate for the programmer to describe cellular algorithms. Also both languages contain


901

elements that are not required for this purpose (e.g. pointer and dynamic memory allocation in C). The new language CDL was de ned with respect to readability, conciseness and portability. While developing a cellular algorithm it is desired to have short turn{around cycles. Thus the usage of a highly interactive software simulator is recommended during the development process. After having tested the algorithm on the software simulator it can be transferred to the CEPRA{1X for fast execution and realtime visualisation. Features of the Language. The language CDL is intended to serve as an architecture independent language for cellular algorithms. The programmer's bene t is obvious: Switching the target architecture does not require more than just a new compiler run. Moreover CDL contains special elements that make the description of complex conditions very easy (groups, special loop constructs). These elements allow the description of situations like: { { {

Is there any neighbour that ful ls a certain condition? (one()) Do all neighbours ful l a certain condition? (all()) How many neighbours are in a certain state? (num())

CDL does not contain conditional loops, which has two positive side eects. (1) It enforces the termination of the rule because it is impossible to write endless loops and (2) it enables the compiler to unroll all statements which is extremely important for the synthesis of hardware. CDL allows the user to describe the cell state as a record of arbitrary types. All common data types are available in CDL (integer, boolean, oat, etc.). In addition the user can de ne new types (enumerations and subranges of integers or enumerations). Example. To give an impression of a CDL program we present the Belousov{ Zhabotinsky reaction[6]. It does not show all the special features of CDL, but demonstrates some of the problems that have to be handled quite dierently on hardware and software simulators. (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)

cellular automaton Belousov_Zhabotinsky ; const dimension = 2 ; // a two-dimensional grid distance = 1 ; // allow/restrict Moore-neighbourhood maxtimer = 7 ; // a local constant cell = [0,0]; // relative address of actual cell // *[0,0] means the contents of the cell type celltype = record // celltype defines possible states active : boolean; alarm : boolean; timer : 0..maxtimer; end; // addresses of all 8 Moore-neighbours group neighbours={[-1,0],[ 1,0],[0, 1],[ 0,-1], [ 1,1],[-1,1],[1,-1],[-1,-1]};

902


(16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31)

colour [0 , 255, 0] [255, 0, 0] [*cell.timer * 255 var

// description of ~ *cell.active ~ *cell.active div maxtimer,0,0]

neighbour : celladdress;

visualisation and *cell.alarm; and not *cell.alarm; ~ not *cell.active;

// local loop variable

rule begin *cell.active := *cell.timer=0; // is actual timer==0? *cell.alarm := // count neighbours in active state num(neighbour in neighbours : *neighbour.active) in {2,4..8}; if *cell.active and *cell.alarm and (*cell.timer=0) then *cell.timer:=maxtimer else if *cell.timer!=0 then *cell.timer:=*cell.timer-1; end;

The type celladdress, as used in line (21), is implicitly de ned by the compiler from the two constants dimension and distance. They de ne how many dimensions the model uses and how far the access to other cells reaches. Both constants must be supplied by the programmer. The type celladdress is a record with as many components as the model has dimensions. Each component can have a value between -distance and +distance. Lines (14) and (15) show the celladdresses of all eight Moore neighbours. The name of this enumeration does not have any meaning for the compiler. The elements are used in the iterative num{loop in line (26). 4

Transformation into a Hardware Description

Even simulators that are based on specialised hardware are supported by CDL. The CEPRA{1X simulator has been chosen as an example during the design phase of CDL. The most important restrictions of a hardware simulator are the limited number of cell states and the limitations in the rule complexity. Although oating point numbers are desired and should be included in a cellular language, they are usually not implemented in a specialised hardware simulator because of hardware costs. Celltype. In the case of CEPRA{1X the states of the cell must be coded with 16 bits. If the celltype is a record (as in lines (08){(12)) it would me more easy to reserve bit groups for the subtypes of this record (one bit for each boolean in lines (09){(10) and three bits for the integer subrange in line (11)). Usually, this will simplify the logic for the rules, because often the rules access only components of the cell record (e.g. line (28)). On the other hand, this may lead to a state coding, where not all 216 states can be used (e.g. if the integer subrange does not have power of two elements). Enumerating all possible cell states (the power set of the components) will not waste any of the states, but will


10 30

a1 a2

903

0 1

a3 1

+

b1

condition

Fig. 3.

The implementation of local variables and assignments

increase implementation cost. The CDL compiler decides itself which method to use. Variables. The classical synthesis approach uses registers to represent variables. The data paths between these registers are controlled by a nite state machine. For the CEPRA{1X machine this is not desired, because it would imply the usage of a clock signal. The number of clocks required to complete the calculation would then depend on the data. The varying time could stall the pipeline and slow down the calculation speed. To simulate CDL variables, they are represented by local signals. Because a new value can be assigned to signals only once, for each assignment new signals must be created. The following CDL fragment (1) a:=10; (2) if condition then a:=30; (3) b:=a+1;

produces the local variables a1 , a2, and a3, a multiplexer driven by condition, and a following adder which calculates the value of signal b1 (Fig. 3). Optimisations. The hardware resources inside a FPGA are limited. Therefore optimisation is necessary. The optimisation supported by the VERILOG compiler is good but not suÆcient. The CDL compiler already should keep an eye on the complexity of the description. It should not use too many local signals and avoid generating unused code. To reduce implementation cost, early expression and condition evaluation is necessary and was implemented. The compiler evaluates constant expressions during compilation, taking special properties of the operation into consideration. The or operation, for example, with one operand being constant true is evaluated during compilation and translated into the constant true. Usually a data type is represented by a xed number of bytes on common computers. To reduce implementation cost, the compiler should use single bits instead of bytes as the smallest unit. In addition, the size of a data type may vary. For example a variable of an integer subrange type, which is divided by two will need one bit less after the division. Therefore it is useful to know the exact range of possible values for each variable and expression. Loops. To simulate the behaviour of a loop, hardware must be generated for each iteration. Conditional loops are not available because the number of iterations can not be determined during compilation. (This is equivalent to the demand that calculation must always terminate.) The num expression in line (30) can be interpreted as a loop. The constants of the group neighbours are assigned to the variable neighbour one after the

904


other. After each assignment the expression *neighbour.active is evaluated and the result is assigned to a new local signal. After the eight iterations, the eight signals are connected to a logic, which sums up the conditions that are true. The sum is the result of this expression. Conditional Statements. The only statement which has a permanent effect is the assignment of a value to a variable or the cell state (e.g. line (28)). For this reason the assignment statement is aected by the corresponding condition. Have a look at line (35). Only if the condition is true, the assignment shall have an eect. Therefore each assignment is implemented as a two{to{one multiplexer, where one input is the old value and the other is the new value. The select signal of this multiplexer is connected to the condition of the surrounding conditional statement. For nested conditional statements their conditions are combined using the logical and operation. An else part can be realized using the inverted condition and a case statement using dierent cascaded conditions.

*[0,0].timer

*[-1,0].active *[1,0].active

*[0,0].active

=0?

=2? + =4?

+ *[0,1].active *[0,-1].active

+

=5? +

*[1,1].active *[-1,1].active

>=1 =6?

+ +

*[1,-1].active *[-1,-1].active

*[0,0].alarm

=7?

+ =8?

*[0,0].active *[0,0].alarm *[0,0].timer

& =0? 7

1 0

0

*[0,0].timer

1 -1 Fig. 4.

result of synthesis

Complete Example. The CDL program describing the Belousov{Zhabotinsky reaction from the previous section results in a hardware structure shown in Fig. 4. Obviously program line (24) corresponds with the upper part of the logic. The


905

middle part corresponds to lines (25){(27). And the lower part of the logic has been synthesised form lines (28){(30). Recognise the power of the num() statement. Only three lines of code result in the large amount of the middle part of the logic. Colour. The colour de nition must be loaded into the CRT controller as a look{up{table. To create this look{up{table during compilation, each possible cell state is associated with the contents of the cell (*[0,0]) and the expressions in the colour de nition (lines (17){(19)+) are evaluated. 5

Conclusion

The CEPRA{1X is a con gurable coprocessor which speeds up cellular processing signi cantly. As it processes data streams it can also be used for other applications. The resulting pixel stream can be coloured and visualised in realtime. Complex rules and time dependent rules can be computed by reloading the FPGA between the generations. CDL is an implemented language for the concise, readable and portable description of cellular algorithms. One version of the compiler generates C/Java{ code for the software simulator. Another version generates logic equations for the eld programmable gate arrays of the CEPRA{1X machine. The logic equations are partly minimised by the compiler and partly by a commercial available design system. Main features of the language are records, unions, groups and the loop construct for testing complex conditions. The language can be used to describe complex cellular algorithms of practical relevance. Based on the experience the language was extended to CDL++[7] for the description of moving objects. References

[1] Christian Hochberger, Rolf Homann, Klaus-Peter Volkmann, and Stefan Waldschmidt. Cellular processing environment. In Boguslaw Butrylo, editor, International Conference on Parallel Computing in Electrical Engineering (PARELEC 98), number 1, pages 171{174, Bialystok, Poland, 1998. Technical University of Bialystok. [2] Christian Hochberger, Rolf Homann, Klaus-Peter Volkmann, and Jens Steuerwald. The CEPRA{1X cellular processor. In Rainer W. Hartenstein and Viktor K. Prasanna, editors, Recon gurable Architectures, High Performance by Con gware. IT Press, Bruchsal, 1997. [3] Rolf Homann, Klaus-Peter Volkmann, and Marek Sobolewski. The cellular processing machine CEPRA{8L. Mathematical Research, 81:179{188, 1994. [4] R. Homann and K.-P Voelkmann. Hardware support for 3D cellular processing. Lecture Notes in Computer Science, 1277:322{??, 1997. [5] Norman H. Margolus. CAM{8: a computer architecture based on cellular automata. Technical Report 01239, MIT Lab. for Computer Science, December 1993. [6] A. Zaikin and A. Zhabotinsky. Nature, (225):535{, 1970. [7] Christian Hochberger. CDL | Eine Sprache fur die Zellularverarbeitung auf verschiedenen Zielplattformen. PhD thesis, Darmstadt University of Technology, 1999.

Loop Pipelining and Optimization for Run Time Recon guration? Kiran Bondalapati and Viktor K. Prasanna Departmen t of Electrical Engineering University of Southern California Los Angeles, CA 90089-2562, USA. fkiran,[email protected] http://maarcII.usc.edu

Abstract.

Lack of automatic mapping techniques is a signi cant hurdle in obtaining high performance for general purpose computing on recon gurable hardware. In this paper, we develop techniques for mapping loop computations from applications onto high performance pipelined con gurations. Loop statements with generalized directed acyclic graph dependencies are mapped onto multiple pipeline segments. Each pipeline segment is executed for a xed number of iterations before the hardware is recon gured at runtime to execute the next segment. The recon guration cost is amortized over the multiple iterations of the execution of the loop statements. This alleviates the bottleneck of high recon guration overheads in current architectures. The paper describes heuristic techniques to construct pipeline con gurations which have reduced total execution time including the runtime recon guration overheads. The performance bene ts which can be achieved using our approach are illustrated by mapping example application loop onto Virtex series FPGA from Xilinx.

1

Introduction

Recon gurable computing has demonstrated signi cant performance gains for several classes of applications[5]. Application mapping onto con gurable hardware still necessitates expertise in low-level hardware details. Automatic mapping of applications onto con gurable hardware is necessary to deliver high performance for general purpose computing. In this paper we address the issues in mapping application loops onto recon gurable hardware to optimize the total execution time. Total execution time includes the time spent in actual execution on the hardware and the time spent in recon guring the hardware. Con gurable hardware can be utilized to execute designs which are larger than the available physical resources. Run Time Recon guration(RTR) between computations facilitates dynamic adaptation of the hardware to suit the design area and computational requirements. But, in current devices, recon guration time is still signi cant compared to the execution time. W e focus on developing ?

This work was supported by DARPA Adaptive Computing Systems program under contract DABT63-99-1-0004 monitored by Fort Huachuca.


Loop Pipelining and Optimization for Run Time Reconfiguration

907

mapping techniques which exploit RTR but attempt to reduce the recon guration overhead. This is accomplished by amortizing the the recon guration overheads over the execution of large number of iterations of the loop. Loop statements contribute to a signi cantly large component of the execution time of an application. Pipelined designs are well structured and map well onto con gurable devices. Most recon gurable architectures, including FPGA devices, provide excellent support for pipelining with their regular logic block layout and large number of registers [17]. Pipelined designs have reduced and predictable delays because they use mostly local interconnections. Hence, mapping loop computations onto pipelined con gurations proves to be very eective on con gurable hardware. In this paper, we develop techniques to map computations in a loop onto recon gurable hardware. The data dependencies in the loop statements constitute a directed acyclic graph (DAG). These loop statements are mapped onto pipelined con gurations executing in the recon gurable hardware. Our mapping techniques attempt to minimize the total execution cost for the computations including the recon guration cost. The statements are split into multiple pipeline segments which are executed sequentially for a xed number of iterations each. Recon guration is performed after execution of a pipeline segment to execute the next segment. Generating optimal schedule from a given task graph is an NP-complete problem. In this paper, heuristic algorithms are utilized to reduce the recon guration cost between dierent pipeline segments. We compare the eectiveness of our heuristics against a greedy heuristic based list scheduling. Our mapping techniques promise potential performance improvement on several classes of FPGAs. We evaluate the performance of our mapping techniques on the Virtex series FPGA from Xilinx [17]. In Section 2, we describe some related research work which addresses similar issues. Our heuristic based algorithms are described in detail in Section 3 and illustrated by using an example. In Section 4, we evaluate the performance bene ts achieved using our approach. We draw conclusions based on our approach in Section 5. 2

Related Work

Pipelining of designs has been studied by several researchers in the con gurable computing domain. Cadambi et. al. address the issues in mapping virtual pipelines onto a physical pipeline by using incremental recon guration in the context of PipeRench [6]. Luk et. al. describe pipeline morphing and virtual pipelines as an idea to reduce the recon guration costs [11]. A pipeline con guration is morphed into another con guration by incrementally recon guring stage by stage while computations are being performed in the remaining stages. Weinhardt describes the generation of pipelined circuits from parallel-FOR loops in high level programming language [15]. Weinhardt et. al. also developed pipeline vectorization techniques [16].

908

K. Bondalapati and V.K. Prasanna

Other research has addressed related issues in mapping circuits onto recon gurable hardware [2, 7, 10, 12, 14]. Our prior research has also developed other techniques for mapping application loops [1, 3, 4]. In this paper, the focus is on Run Time Recon guration at a dierent granularity. Our approach is to exploit Run Time Recon guration to achieve high performance but schedule it infrequently to minimize the overheads. Algorithmic pipeline construction and partial recon guration at runtime are exploited to achieve this goal. 3

Pipeline Construction

The speed-up that can be obtained by using con gurable logic increases as the computations in a loop increase. But, the con gurable resources that are available can be lower than the required resources to pipeline all the computations in the loop. In this case, the pipeline has to be segmented to run some of the pipeline stages and recon gured to execute the remaining computations. In this paper, we consider loops which do not have loop carried dependencies. Such loops do not have any dependencies between dierent iterations of the loop. Loop transformations can be applied to remove some existing loop carried dependencies. We also assume that the number of iterations to be executed is signi cantly larger than the number of pipeline stages. Hence, the cycles involved in lling and emptying the pipeline are insigni cant compared to the actual execution cycles of the pipeline stages. The execution of the complete loop can be decomposed into multiple segments, where a xed number of iterations of each segment are executed in sequence starting from the rst segment. Each segment consists of multiple pipeline stages. The logic is recon gured after each segment to execute the next segment. The intermediate results from each segment execution are stored in memory. The execution of the sequence of segments is repeated until the required number of iterations of the loop are completed. We assume that the recon guration of the dierent segments is controlled by an external controller(e.g. a host processor).

3.1 De nitions Recon gurable Architecture A con gurable logic array of size L W and in-

termediate memory of size M. One of the basic goals of our approach is to exploit the on chip memory or fast access local SRAM provided in several recon gurable architectures. M represents the size of this memory. Input Task Speci cation A dependency graph G(V; E) of the application tasks of the loop to be executed for N iterations. Each task node vi denotes the operation to be performed on the inputs speci ed by the incoming edges to the node. The directed edge eij from vi to vj denotes the data dependency between the two nodes. The weight wij on each edge denotes the number of bits of data communicated between the nodes. Output Pipeline Con guration A sequence of pipeline segments 1 ; 2; : : :; p where each segment i(1 i p) consists of q number of stages si1 ; si2; : : :; siq .


909

The pipeline stages are the mapping of the computational task nodes V to con gurations of the device. Each of the stages sij is the con guration which executes a speci c task in the input task graph. The size of a pipeline stage is given by the length lij and the width and wij . Some of the stages in each segment might be null stages which are not actual tasks but are just place-holders as explained later in Section 3.6. Segment Clock Speed Each pipeline segment i can be executed at a dierent clock speed f depending on the maximum clock speed at which the stages in that segment can operate. Segment Data Output A pipeline stage sij has global outputs if any of the outgoing edges from a task node are to a node that is not mapped to the same pipeline segment. The size of the segment data output DOi (1 i p) of all the pipeline stages in a segment i (1 i p) is given by the sum of all the global outputs of the stages in the segment. Segment Iteration Count The number of iterations N for which each pipeline segment is executed before recon guring to execute the next segment. N depends on the size of the available memory to store the intermediate results. We assume that the initial and nal results are communicated from/to external memory. M 1 i p,1 N = mini DO + DO i i+1 Recon guration Cost The recon guration cost Rloop is the total cost involved in recon guring between all the segments of the pipeline con guration. This includes the cost of con guring between the last segment and the rst segment if N > N . The recon guration cost between any two segments is given by the dierence in the two pipeline con gurations. Partial recon guration of the device in columns is assumed in our computation. We use the number of logic columns in which the con gurations are dierent as the measure of the recon guration cost. When the corresponding stages in dierent segments are dissimilar, the recon guration cost accounts for the multiple adjacent stages that need to be recon gured. Total Execution Time The total execution time E is given by the sum of the execution times for each segment and the total recon guration time. p X E = N ( f1 ) + NN Rloop i=1 i

i

3.2 Phase 0: Pre-processing and Mapping In this phase the computation tasks in the input DAG are mapped onto components of the given logic device. The components are chosen from the set of library components available for executing the given application tasks in the task graph. Dierent components can have dierent logic-area/execution-time tradeos and could potentially have dierent degrees of pipelining and footprint on the device after layout. The library component of the highest degree of pipelining which

910


satis es other constraints speci ed by the task graph(such as precision of inputs) is chosen for a task. Our proposed approach is illustrated using the mapping and scheduling of the N-body simulation application and the FFT butter y computation. The resulting task graphs after Phase 0 with the dependency edges are shown in Figure 1. In the graph the operations are represented as A - Addition, M Multiplication, S - Subtraction and Sh - Shift right by 4 bits(Divide by 16). The operations in the graph are all 16 bits so the weights on the edges are not indicated.

3.3 Phase 1: Partitioning The partitioning phase generates multiple partitions where size of each partition is smaller than the size of the device. This phase attempts to optimize two criterion - (1) maximize the size of the partition (2) minimize the weight of the edges between partitions. The rst criterion improves the logic utilization and the second criterion reduces the memory required to buer intermediate results generated by each partition (pipeline segment). A sketch of the partitioning algorithm is given below without the intricate details. A heuristic based multi-way partitioning is used to incrementally generate each of the partitions. The largest size node is chosen from among the list of Ready nodes (whose inputs have been computed) to be added to the current partition. When no more nodes can be added to the current partition, a new partition is initiated. For adding a Ready node vi to a partition j , the heuristic uses the following sums of weights of edges:

{ { { {

!1 : weight of in edges to vi from nodes in j !2 : weight of in edges to vi from nodes not in j !3 : weight of out edges from vi to nodes in j !4 : weight of out edges from vi to neighbors of j vk is a neighbor of j if there is an edge from a node in j to vk and vk 62 j { !5: weight of out edges from vi to nodes not in j and not neighbors of j The node chosen is the node with maximumvalue of !1 +!3 +!4 ,!2 ,!5 . The primary inputs and outputs are not considered in computing the weights. The largest node which ts in the current partition satisfying the above condition is added to the current partition. Ties are broken by using the height of the node and the dierent weights of edges listed above. The resulting partitions are illustrated by the partition number on the nodes of the graph in Figure 1.

3.4 Routing Considerations The algorithm for the partitioning of the task graph assumes that there are enough routing resources to communicate between the dierent pipeline stages and from pipeline stages to the memory. Some of the pipeline stages might have


911

1 A 1

1

1 M

A

M 1

2

A

2 A

M

1

2

1

M

2 M

M

M

2

2

M

A

2

1

2

A

S

A 3 3

S A

1 M

3

A

2

1

2 A

S

S

Fig. 1. (a) N-body simulation task DAG and (b) FFT task DAG with partition

numbers global inputs and outputs. These are data inputs and outputs which are not to adjacent pipeline stages, but from/to either non-adjacent stages or from/to memory. Some of the data outputs from the pipeline stages might have to be buered (using registers) before they are consumed in the later stages. Routing resources are an important consideration when mapping communication between non-adjacent pipeline stages. In our experiments we have discovered that FPGAs such as Virtex [17] are routing and register rich and can support most pipeline-able designs. The number of bits of data computed in each stage is typically less than or equal to the number of logic cells utilized. Hence, the stage to stage communication has enough routing resources by using nearest neighbor interconnect. Extra routing and logic resources (for buering and multiplexing) have to be utilized for data values communicated across non-adjacent pipeline stages. In the partitioning algorithm, the remaining area in a partition is reduced to re ect the buering requirements. A limitation of our approach is that partitions might have bad memory performance when the computation is highly irregular or there are a large number of data dependencies in the DAG. The approximation of routing resources results in infeasible designs in some cases. But, for most applications, the circuits were nally mapped within the available logic and routing resources.

3.5 Phase 2: Pipeline Segmentation The con guration of the pipeline is generated from the partitions that are computed in Phase 1 by the algorithm in Figure 2. Each partition is utilized to generate one segment of the pipeline. The goal in the segmentation phase is to generate permutations of the pipeline stages in each segment to reduce the

912


recon guration costs across segments. We use the heuristic of matching the corresponding stages of the dierent pipeline segments. In each partition, the nodes of the same height have the exibility of being mapped in any order onto the pipeline. In addition, once a node has been mapped onto the pipeline, its successors from the same partition can also be mapped. The algorithm proceeds by rst identifying the list of tasks from each partition that are Ready to be scheduled. A task node is Ready if all of its predecessors have already been scheduled onto the segment. At the next step, a maximal matching set of task nodes are identi ed from the set of all Ready lists from all Partitions. A maximal matching set corresponds to the task node which occurs in most partitions. This step schedules similar nodes from dierent partitions onto the dierent segments. This enables the reduction in the recon gurations costs at runtime. The Ready lists are updated before scheduling the next set of nodes. The resulting pipeline schedules with the dierent segments are shown in Table 1(b) and Table 2(b). Segment 1 A M M A A Segment 2 M A M A A Segment 3 S A M * *

Segment 1 A M M A A Segment 2 A M M A A Segment 3 A S M * *

Table 1. Schedules for N-body simulation (a) S0 : Greedy Scheduling (b) SI : Schedule after Phase 2 Segment 1 M M M * * Segment 2 M S A A A Segment 3 S S * * *

Segment 1 M M S A S Segment 2 M M A A S

Table 2. Schedules for FFT (a) S0 : Greedy Scheduling (b) SI : Schedule after Phase 2

3.6 Recon guration of null stages

Recon guring from a null stage to a computation stage can be accomplished by small modi cations to the pipeline design. The data values from the previous computation stage are also communicated directly to the output register in addition to owing through the computational units. 2-input multiplexers are utilized at the output registers to latch one of the two values. Run Time Recon guration using partial recon guration only needs to modify the SRAM bits controlling the con guration of the multiplexers. This recon guration cost is signi cantly lower than recon guring the whole datapath. 4

Results

We evaluate the performance of our techniques by comparing them with a greedy heuristic based on list scheduling. The greedy schedule chooses the largest available Ready node as the next stage of the pipeline. A new pipeline segment is


913

1: Function Segmentation(G, Partition) 2: 8vi : Mapped(vi ) FALSE 3: Num Partitions jP artitionj 4: repeat 5: for i = 1 to i = Num Partitions do 6: Ready[i] fvj jvj 2 Partition[i] and 7: 8vk : vk = Predecessor(vj ) and Mapped(vk )g 8: endfor 9: for i = 1 to i = Num Partitions do 10: for all vj 2 Ready[i] do PNum Partitions jfv jType(v ) = T ype(v ) and 11: Count(vj ) k k j l=1 12: vk 2 Ready[l]gj 13: end for 14: end for 15: Vcurr null 15: for i = 1 to i = Num Partitions do 16: vsel = vj j vj 2 Ready[i] and 8vj : maxfCount(vj )g and vj 2 Vcurr 17: if vsel = null then 18: vsel = vj j vj 2 Ready[i] and 8vj : maxfCount(vj )g 19: end if 20: Segment[i] Segment[i] vsel 21: if vsel != null then 22: Vcurr Vcurr [ vsel 23: Mapped(vsel ) TRUE 24: end if 25: end for 26: until (8i : empty(P artition[i]))

Fig. 2. Algorithm to generate the pipeline segments initiated when no more nodes can be added to the current segment. The resulting schedule is shown in Table 1(a) and Table 2(a). We utilize the modules and the parameters from the Virtex component libraries [17]. Some of the modules utilized are tabulated below in Table 3. The number of pipelined stages, precision of the inputs and the size of the module when mapped onto the device are listed in the table. For the N-body simulation and FFT examples, the number of slices to be recon gured for each schedule is shown in Table 4. This is the recon guration cost Rloop as de ned in Section 3.1. The heuristic based algorithms have a signi cant saving in the recon guration cost. This translates to a direct reduction in the total execution time of the con guration. In the worst case, our heuristic algorithms generate a schedule which is at least as good as the greedy heuristic. The total execution cost was computed for both the applications for a data set size of 4096 data points with the an on-chip memory size of 2KB (M). For the two example applications, recon guration cost is the dominant cost in

914


Module Stages Input Slices Speed Add 1 16x16 10 173 MHz Add 1 32x32 20 157 MHz Subtract 1 16x16 11 141 MHz Shift 1 16x16 10 180 MHz Multiply 1 8x8 39 65 MHz Multiply 4 8x8 48 131 MHz Multiply 5 12x12 107 117 MHz Multiply 5 16x16 168 115 MHz

Table 3. Virtex module characteristics

Greedy Our Approach Speedup N-body 624 228 2.74 FFT 702 110 6.38

Table 4. Recon guration costs in number of Virtex slices the execution of the application and constitutes more than 95% of the total execution time. The application speedups are of the same order as the speedups in the recon guration costs illustrated in Table 4. This shows that our heuristic based approach performs signi cantly better than the greedy heuristic. 5

Conclusions

Automatic mapping and scheduling of applications is necessary for achieving performance improvement for general purpose computing applications on recon gurable hardware. These techniques have to address the overheads involved in recon guring the hardware. In current architectures the recon guration overheads are still signi cant compared to the execution cost. In this paper, we have proposed algorithmic techniques for mapping and scheduling loops in applications onto recon gurable hardware. The heuristics we have developed attempt to minimize the recon guration overheads by exploiting pipelined designs with partial and runtime recon guration. The mapping of example loops from applications illustrates that the proposed algorithms can generate high performance pipelined con gurations with reduced recon guration cost. In future work, we will explore the interaction of the proposed techniques with other techniques such as parallelization and vectorization. Recon gurable hardware speci c optimizations such as clock disabling for some pipeline stages and runtime modi cation of the interconnection to reduce the recon guration cost are also being examined. The work reported here is part of the USC MAARCII project [9]. This project is developing novel mapping techniques to exploit dynamic and self recon guration to facilitate run-time mapping using con gurable computing devices and architectures. Moreover, a domain-speci c mapping approach is being developed to support instance-dependent mapping. Finally, the concept of \active" libraries is exploited to realize a framework for automatic dynamic recon guration [8, 13].


915

References 1. K. Bondalapati. Modeling and Mapping for Dynamically Recon gurable Architectures. PhD thesis, University of Southern California. Under Preparation. 2. K. Bondalapati, P. Diniz, P. Duncan, J. Granacki, M. Hall, R. Jain, and H. Ziegler. DEFACTO: A Design Environment for Adaptive Computing Technology. In Recon gurable Architectures Workshop, RAW'99, April 1999. 3. K. Bondalapati and V.K. Prasanna. Mapping Loops onto Recon gurable Architectures. In 8th International Workshop on Field-Programmable Logic and Applications, September 1998. 4. K. Bondalapati and V.K. Prasanna. Dynamic Precision Management for Loop Computations on Recon gurable Architectures. In IEEE Symposium on FPGAs for Custom Computing Machines, April 1999. 5. D. A. Buell, J. M. Arnold, and W. J. Kleinfelder. Splash 2: FPGAs in a Custom Computing Machine. IEEE Computer Society Press, 1996. 6. S. Cadambi, J. Weener, S. C. Goldstein, H. Schmit, and D. E. Thomas. Managing Pipeline-Recon gurable FPGAs. In Proceedings ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, February 1998. 7. D. Chang and M. Marek-Sadowska. Partitioning sequential circuits on dynamically recon gurable fpgas. In IEEE Transactions on Computers, June 1999. 8. A. Dandalis, A. Mei, and V. K. Prasanna. Domain speci c mapping for solving graph problems on recon gurable devices. In Recon gurable Architectures Workshop, April 1999. 9. MAARCII Homepage. http://maarcII.usc.edu. 10. R. Kress, R.W. Hartenstein, and U. Nageldinger. An Operating System for Custom Computing Machines based on the Xputer Paradigm. In 7th International Workshop on Field-Programmable Logic and Applications, pages 304{313, Sept 1997. 11. W. Luk, N. Shirazi, S.R. Guo, and P.Y.K. Cheung. Pipeline Morphing and Virtual Pipelines. In 7th International Workshop on Field-Programmable Logic and Applications, Sept 1997. 12. K. M. G. Purna and D. Bhatia. Temporal partitioning and scheduling data ow graphs for recon gurable computers. In IEEE Transactions on Computers, June 1999. 13. R. P. Sidhu, A. Mei, and V. K. Prasanna. Genetic programming using selfrecon gurable fpgas. In International Workshop on Field Programmable Logic and Applications, September 1999. 14. R. Subramanian, N. Ramasubramanian, and S. Pande. Automatic analysis of loops to exploit operator parallelism on recon gurable systems. In Languages and Compilers for Parallel Computing, August 1998. 15. M. Weinhardt. Compilation and pipeline synthesis for recon gurable architectures. In Recon gurable Architectures Workshop(RAW' 97). ITpress Verlag, April 1997. 16. M. Weinhardt and W. Luk. Pipeline vectorization for recon gurable systems. In IEEE Symposium on Field-Programmable Custom Computing Machines(FCCM '99), April 1999. 17. Xilinx Inc.(www.xilinx.com). Virtex Series FPGAs. This article was processed using the LATEX macro package with LLNCS style

Compiling Process Algebraic Descriptions into Recon gurable Logic Oliver Diessel and George Milne Advanced Computing Research Centre Sc hool of Computer and Information Science University of South Australia Adelaide SA 5095

fOliver.Diessel, [email protected]

Recon gurable computers based on eld programmable gate array technology allow applications to be realized directly in digital logic. The inherent concurrency of hardware distinguishes such computers from microprocessor{based machines in which the concurrency of the underlying hardware is xed and abstracted from the programmer by the softw are model. However, recon gurable logic allows the potential to exploit \real" concurrency. We are therefore interested in knowing how to exploit this concurrency, how to model concurrent computations, and which languages allow us to control the hardware most eectively. The purpose of this paper is to demonstrate that behavioural descriptions expressed in a process algebraic language can be readily and intuitively compiled to recon gurable logic and that this contributes to the goal of discovering appropriate high{level languages for run{time recon guration. Abstract.

1 Introduction The term recon gurable c omputer is currently used to denote a machine based on eld programmable gate array (FPGA) technology. This chip technology is programmable at the gate level thereby allo wing any discrete digital logic system to be instantiated. It diers from the classical von Neumann computing paradigm in that a program does not reside in memory but rather an application is realized directly in digital logic. F or some computing and electronic con trolapplications we are able to exploit the inherent concurrency of digital logic to directly realize algorithms as custom hardware to gain a performance advantage over softw are executing on con ven tional microprocessors. Giv en this observation, we may ask a wide range of questions, such as: how do we exploit this concurrency? How do we harness it to perform computations? How do w emodel suc h computation? And what programming languages should we use to help programmers/designers? This paper demonstrates that we can intuitiv ely and rapidly compile a high{ lev el language that is oriented to describing concurrency and communication in to recon gurable logic. We sho w ho w the core features of process algebras J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 916-923, 2000.  Springer-Verlag Berlin Heidelberg 2000

Compiling Process Algebraic Descriptions into Reconfigurable Logic

917

[5, 7, 4] and the Circal process algebra in particular [5, 6] can be mapped into recon gurable logic. The rationale for focusing on using a process algebra as the basis of a language for specifying recon gurable logic are that it expresses the behaviour of a design in an abstract, technology{independent fashion and it emphasizes computation in terms of a hierarchical, modular, and interconnected structure. Process algebra have an extensive track record in the expression and representation of highly concurrent systems including digital hardware [1, 6] and are thus a good basis for a high{level language. A high{level language based on process algebra is quite dierent from classical hardware description languages, such as VHDL and Verilog, that are oriented towards register{transfer and gate{level descriptions. Instead, this approach provides designers with a design paradigm focussed on behavioural process modules and their interconnection. Because of its modular focus, our approach aids the rapid compilation and partial recon guration of designs at run{time. Our approach also presents us with the potential for formally verifying the compilation algorithm. Related research on veri able compilation to FPGAs was performed by Shaw and Milne [9] while Page and Luk [8] also constructed an Occam to FPGA compiler. Circal models emphasize the control of and communication between processes. The rapid compilation of Circal models allows assemblies of interacting nite state machines to be implemented quickly. Apart from logic controllers, we may thus be able to build and quickly modify test pattern generators that function at near hardware speed. This project also aims to support dynamic structures that may facilitate the control of dynamically recon gurable logic. In the following section we provide an overview of the Circal process algebra and the source language for our compiler. Section 3 introduces our contribution with an overview of the compiler. We describe a technology{independent circuit model of Circal processes in Section 4. The mapping of these circuits to FPGAs, and Xilinx XC6200 chips in particular, is discussed in Section 5. The derivation of the mapping from behavioural Circal descriptions is outlined in Section 6. A summary of the paper and directions for further work are presented in Section 7.

2 The Circal process algebra Circal is an event{based language; processes interact by participating in events, and sets of simultaneous events are termed actions. For an event to occur, all processes that include the event in their speci cation must be in a state that allows them to participate in the event. The Circal language primitives are:

State De nition P Q de nes process P to have the behaviour of term Q. Termination =n is a deadlock state from which a process cannot evolve. Guarding a P is a process that synchronizes to perform event a and then behaves as P . (a b) P synchronizes with events a and b simultaneously and then behaves as P .

918

O. Diessel and G. Milne

Choice P + Q is a term that chooses between the actions in process P and

those in Q, the choice depending upon the environment in which the process is executed. Non{determinism P & Q de nes an internal choice that is determined by the process without in uence from its environment. Composition P Q runs P and Q in parallel, with synchronization occuring over similarly named events. Abstraction P a hides event set a from P , the actions in a becoming unobservable. Relabelling P [a=b] replaces references to event b in P with the event named a.

3 Overview of compiler operation This paper describes our eorts to implement a subset of Circal suited to the instantiation of Circal process models as recon gurable logic circuits. The implementation of the hardware compiler is referred to as HCircal. An HCircal source le consists of a declaration part, a process de nition part, and an implementation part. Events and processes must be declared before use. The de nition part consists of a sequence of process de nitions adhering to the Circal BNF. The implementation part is introduced with the Implement declarative and is followed by a comma{delimited list of process compositions that is to be implemented in hardware. Processes must be de ned before they are referred to in an Implement statement. HCircal does not currently allow the user to model non{determinism, abstraction, or relabelling. However, implementations of abstraction and relabelling are straightforward extensions to the current system. In outline, the HCircal compiler operates as follows: 1. The user inputs an HCircal speci cation of the system to be implemented. 2. A compiler analyses the speci cation to produce a hardware implementation and a driver program for interacting with the hardware model { The current hardware model is in the form of a Xilinx XC6200 FPGA con guration bitstream [10] suitable for loading onto XC6200{based recon gurable coprocessors such as the SPACE.2 board [3]. { The driver program is a C program that executes on the host. The program loads the con guration onto the coprocessor and allows the user to interact with the implemented system. 3. The user runs the driver program and interacts with the hardware model by entering event traces and observing the system response. The following sections describe the mapping from behavioural descriptions to technology{independent circuits, the decomposition of the circuits into modules for which FPGA con gurations are readily generated, and the derivation of the module parameters from the Circal speci cation. The generation of the host program is a straightforward specialization of a general program that obtains appropriate event inputs, loads the input registers, and reads the process state registers. It is not further discussed.


919

4 A circuit model of Circal The aim of the model is to represent, as faithfully as possible, Circal semantics in hardware. The design concentrates on the representation of the Circal composition operator, which is of central importance because it is through the composition of processes that interesting behaviour is established. When processes are composed in hardware they are executed concurrently. The hardware implementation of the Circal system follows design principles that aim to generate fast circuits quickly. The rst of these is that, for the sake of speed and scalability, the hardware representation of Circal aims to minimize its dependence upon global computation at the compilation and execution phases. The second principle is that we choose to design for ease of run{time instantiation and computational speed over area minimality. The motivation for these choices is the desire to leverage the speedup aorded by concurrently executing the Circal system in hardware; they are supported by the ability to recon gure the gate array at run{time in order to provide a limitless circuit area. Finally, we desire a reusable design because we believe that will facilitate design synthesis, circuit recon guration, and future investigations into dynamically structured Circal.

4.1 Design outline A block diagram of a digital circuit that implements a composition of Circal processes in hardware. is shown in Figure 1(a). The circuit consists of a set of interconnected processes that respond to inputs from the environment by undergoing state transitions. Processes are implemented as blocks of logic with state. In a given state, each process responds to events according to the Circal process de nitions. Individual processes examine the event oered by the environment and produce a \request to synchronize" signal if the event is found to be acceptable. The request signals for all processes are then reduced to a single synchronization signal that each process responds to independently. Implementing Circal in synchronous FPGA circuits leads us to assume that: an event occurs at most once during a clock period; the next state is determined by the events that occurred during the previous clock period; and, if no event occurs between consecutive positive clock edges, then the idling transition P ! P occurs upon the second clock edge by default.

4.2 Process logic design Process logic blocks are derived from the process de nition syntax and represented as compact localized blocks of logic to simplify the placement and routing of the system. A high{level view of a process logic block is given in Figure 1(b). A process is designed to respond to events in the environment that are acceptable to all processes in the composed system. In order to perform this function, the process logic rst checks whether the event is acceptable to itself. If

920

O. Diessel and G. Milne Environmental inputs

Environmental inputs events

Process logic Event synchronization blocks logic P1

Request signals P2

Pn

Select state transition

Enable state transition 1

Request signal

Synch signal

Process state Synchronization signal

Pn

(a) Fig. 1.

events in the process sort

state feedback

(b)

(a) Circuit block diagram, and (b) Circal process logic block.

all processes nd the event acceptable, the event synchronization logic returns a synchronization signal that is used by individual process logic blocks to enable the state transition guarded by the event. The following subsections describe the process logic design in more detail.

Determining the validity of event combinations We construct a combi-

national circuit that checks whether the events in the sort of the process form a valid guard for the current state. The process also accepts a null event (an event not in its sort) in order to allow other processes to respond to events it does not care about. The current state of the process is recycled if an unacceptable or null event is oered by the environment. Let us assume at most k possibly recursive de nitions P0 ; P1 ; : : : ; Pk 1 are necessary to describe the evolution of process P with sort S = fe0 ; e1 ; :::; en 1 g, and that Pi , with 0 i k 1, is de ned as Pi gi;0 Pi;0 + ::: + gi;j Pi;j + ::: + gi;j Pi;j ; where index i refers to the current state, Pi;j is the next state, P0 ; :::; Pk 1 , the state Pi evolves to under guard gi;j S , and gi;j is interpreted as the simultaneous occurrence of the events in gi;j . The de nition for Pi consists of ji + 1 guarded terms where the gi;j are all distinct. Note that there may be at most k distinct next states but 2n 1 distinct guards. If we think of the events and states as boolean variables, then in state Pi the process responds to event combinations in the set f i;j g [ fS g; where i;j = "0 "1 ::: "n 1 and "l = el or "l = el , for 0 l n 1, depending upon whether or i

i


P

921

not el 2 gi;j , and where S = e0 e1 ::: en 1 is the null event for sort S . Process P in state Pi therefore accepts the boolean expression of events S + 0j j i;j . The request for synchronization signal, rP , is thus formed from the expressions for all states: rP = 0ik 1 (S + 0j j i;j ):Pi :

P

P

i

i

Checking the acceptability of an event The request signals for all processes are ANDed together in an AND gate tree that is implemented external to the individual process logic blocks. The output of the tree is fed back to each process as the synchronization signal, s.

Enabling state transitions The state of the process is stored in ip{ ops |

one for each state. Let DP ; 0 l k 1; denote the boolean input function of the D{type ip{ op for state Pl . Then we can derive the following boolean equations from the process de nitions: DP = s:(S :Pl + 0ik 1 P =P i;j :Pi;j ) + s:Pl ; for 0 l k 1: In the above equations, the terms in parentheses are enabled when the synchronization signal, s, is high. These terms correspond to the guards on state transitions and to state recycling if a null event was oered to this process. The last term in the equations forces the current state to be renewed if the processes could not accept the event combination oered by the environment. By observing the synchronization signal, the environment can determine whether or not an event was accepted and can thus be constrained by the process composition. l

l

P

P

i;j

l

4.3 The complete process logic block The disjunction of the parenthesized terms in the ip{ op input functions implements the same boolean function as that to obtain the request signal. We therefore use the state selection circuits to form the request signal and use the synchronization signal to enable the selection.

5 Mapping circuits to recon gurable logic In this section we consider the placement and routing of the circuits derived in Section 4. The derivation of circuit requirements from the speci cation is discussed in the next section. Our primary compilation goal is to generate FPGA con gurations rapidly. We also want to be able to replace circuitry at run{time to explore changing process behaviours and to overcome resource limitations. For this reason we're interested in mapping to Xilinx XC6200 technology because its open architecture allows us to produce our own tools and because the chip is partially recon gurable. DiÆculties with placing and routing the Circal models satisfactorily with XACTStep, the Xilinx APR tool for XC6200, led us to consider decomposing the circuits into modules that can be placed and routed under program control. These modules serve as an attractive intermediate form since they are easily

922

O. Diessel and G. Milne

derived from the speci cation, they completely describe the circuits to be implemented in a hardware{independent manner, and the FPGA con guration can be generated without further analysis. The circuits described in Section 4 are speci ed in terms of parameterised modules that communicate via adjoining ports when they are abutted on the array surface. To simplify the layout of the circuits, all modules are rectangular in shape. The internal layout of modules is also simpli ed by using local interconnects only. The module representation of the circuits is readily mapped to a particular hardware technology by suitable module generators. The compiler can thus be ported to a new FPGA type by implementing a new set of module generators. We distinguish between 9 module types. Each module type implements a speci c combinational logic function using a particular spatial arrangement. Modules are speci ed in terms of their location on the array, input and/or output wire bit vectors, and the speci c function they are to implement, e.g., minterm number. The interested reader is referred to our technical report for a complete description of the module functions, parameters, and circuit generators [2].

6 Deriving modules from process descriptions For each unique process that is to be implemented, a process template that consists of the modules comprising the process logic is constructed. The module parameters for a process template are independently calculated using relative osets. Once the size of the logic for each template is known, a copy with absolute osets ( nal placement of modules) is made for each process to be implemented. When all the parameters are known, the FPGA con guration is generated. Currently the compilation is performed o{line and the con gurations generated are static. In future implementations we plan to experiment with replacing modules at run{time to overcome resource limitations and implement dynamically changing process behaviours. Minor behavioural changes may simply involve replacing minterms or guard modules which could be done very quickly. The regular shapes and small sizes of modules may allow us to distribute them and nalize the module positioning at run{time in order to maximize array utilization. For a more detailed description of the steps in the derivation of the module representation please refer to [2].

7 Conclusions We have shown how to model Circal processes as circuits that can be mapped to blocks of logic on a recon gurable chip. Modelling system components as independent blocks of logic allows them to be generated independently, to be implemented in a distributed fashion, to operate concurrently, and to be swapped to overcome resource limitations. The model thus exploits the hierarchy and modularity inherent in behavioural descriptions to support virtualization of hardware.


923

We have shown how to instantiate a circuit by decomposing it into parametric modules that perform functions above the gate level. To simplify the layout, modules are mapped to rectangular regions that are wired together by abutting them on a chip. Since the modules completely describe the circuits to be implemented in a hardware{independent yet readily mapped manner, they could serve as a mobile description of Circal processes that can be transmitted and instantiated remotely. Future work will investigate developing an interpreter that adapts to resource availability and supports dynamic process behaviour. We also intend assessing the usability of process algebraic speci cations for a number of applications. A further direction is to enhance the HCircal language to support stream{oriented and data{parallel computations.

Acknowledgements We gratefully acknowledge the helpful comments and suggestions made by Alex Cowie, Martyn George, and Bernard Gunther.

References 1. A. Bailey, G. A. McCaskill, and G. Milne. An exercise in the automatic veri cation of asynchronous designs. Formal Methods in System Design, 4(3):213{242, 1994. 2. O. Diessel and G. Milne. Compiling HCircal. Draft manuscript, Advanced Computing Research Centre, University of South Australia, Adelaide, Australia, Spetember 24, 1999. 3. B. K. Gunther. SPACE 2 as a recon gurable stream processor. In N. Sharda and A. Tam, editors, Proceedings of PART'97 the 4th Australasian Conference on Parallel and Real{Time Systems, pages 286 { 297, Singapore, Sept. 1997. Springer{ Verlag. 4. C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall International series in computer science. Prentice{Hall, Inc., Englewood Clis, NJ, 1985. 5. G. Milne. CIRCAL and the representation of communication, concurrency and time. ACM Transactions on Programming Languages and Systems, 7(2):270{298, 1985. 6. G. Milne. Formal Speci cation and Veri cation of Digital Systems. McGraw{Hill, London, UK, 1994. 7. R. Milner. Communication and Concurrency. Prentice{Hall, Inc., New York, NY, 1989. 8. I. Page and W. Luk. Compiling Occam into FPGAs. In W. R. Moore and W. Luk, editors, FPGAs, Edited from the Oxford 1991 International Workshop on Field Programmable Logic and Applications, pages 271 { 283, Abingdon, England, 1991. Abingdon EE&CS Books. 9. P. Shaw and G. Milne. A highly parallel FPGA{based machine and its formal veri cation. In H. Grunbacher and R. W. Hartenstein, editors, Second International Workshop on Field{Programmable Logic and Applications, volume 705 of Lecture Notes in Computer Science, pages 162{173, Berlin, Germany, Sept. 1992. Springer{ Verlag. 10. Xilinx. XC6200 Field Programmable Gate Arrays. Xilinx, Inc., Apr. 1997.

Behavioral Partitioning with Synthesis for Multi-FPGA Architectures under Interconnect, Area, and Latency Constraints ? Preetham Lakshmikanthan ?? , Sriram Govindarajan, Vinoo Srinivasan ? ? ? , and Ranga Vemuri fplakshmi, sriram, vsriniva, [email protected]

Department of ECECS, University of Cincinnati, Cincinnati, OH 45221

Abstract This paper presents a technique to perform partitioning and synthesis of behavioral specifications. Partitioning of the design is done under multiple constraints – interconnections and device areas of the reconfigurable architecture, and the latency of the design. The proposed Multi-FPGA partitioning technique (FMPAR) is based on the Fiduccia-Mattheyses (FM) partitioning algorithm. In order to contemplate multiple implementations of the behavioral design, the partitioner is tightly integrated with an area estimator and design space exploration engine. A partitioning and synthesis framework was developed, with the FMPAR behavioral partitioner at the front-end and various synthesis phases (High-Level, Logic and Layout) at the back end. Results are provided to demonstrate the advantage of tightly integrating exploration with partitioning. It is also shown that, in relatively short runtimes, FMPAR generates designs of similar quality compared to a Simulated Annealing partitioner. Designs have been successfully implemented on a commercial multi-FPGA board, proving the effectiveness of the partitioner and the entire design framework.

1 Introduction Partitioning is essential when designs are too large to be placed on a single device or because of I/O pin limitations. Partitioning of a design can be done at various levels behavioral, register-transfer level (RTL) or gate-level. Behavioral partitioning is a presynthesis partitioning while RTL partitioning is done after high-level synthesis. Various studies [1] show the superiority of behavioral over structural partitioning. A behavioral partitioner has no a priori knowledge about design parameters such as area and latency. The partitioner must be guided by a high-level estimator that provides the required information. Efficient estimation techniques [2, 3] have been developed for this purpose. The approach presented in [2], presents an efficient design space exploration technique that can be performed dynamically with partitioning. A partitioner can effectively control the trade-off between the execution time and the design space ? This work is supported in part by the US Air Force, Wright Laboratory, WPAFB, under con-

tract number F33615-96-C-1912, and under contract number F33615-97-C-1043

?? Currently at Cadence Design Systems Inc., MA. Work done at University of Cincinnati. ? ? ? Currently at Intel Corporation, CA. Work done at University of Cincinnati.

J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 924-931, 2000. Springer-Verlag Berlin Heidelberg 2000

Behavioral Partitioning with Synthesis for Multi-FPGA Architectures

925

explored. We show the effectiveness of integrating the partitioner with a design-space exploration engine in generating constraint satisfying solutions. There has been a lot of research in multi-FPGA partitioning, as presented in the survey by Alpert and Kahng [4]. In particular, Sanchis [5] extended the FM for multiway partitioning by repeatedly applying standard bi-partitioning. This work attempts to minimize the sum of all the cutsets across all partition segments. For a multi-FPGA RC, it is imperative that the pin constraints of the devices are individually satisfied. Therefore, this method of minimizing a summation of cutsets may not produce a constraint satisfying solution. Our goal is to minimize each cutset individually for pinconstraint satisfaction. We present a technique called FMPAR which is an extension of the Fiduccia-Mattheyses algorithm [6]. The results of partitioning are compared against a Simulated Annealing (SA) partitioner that forms part of the SPARCS [7] framework. The rest of the paper is organized as follows. Section 2 describes the partitioning and synthesis framework. Section 3 presents the FMPAR algorithm in detail and the interaction of FMPAR with an exploration engine. Finally, Section 4 presents results demonstrating the effectiveness of this work.

2 Partitioning and Synthesis Framework The framework for partitioning and synthesis is shown in Figure 1. It consists of the FMPAR partitioner Behavioral design Specification (VHDL/C) Behavioral Simulation Architectural & Design at the front-end and various synthesis phases Constraints FMPAR Partitioner (High-level, Logic, and Layout) at the backArea Estimator & Partitions Exploration Engine end. The input behavioral designs are specTest Vectors p1.bbif pn.bbif ified in subsets of either VHDL or C. The design descriptions are translated into an RTL High-Level Synthesis Component equivalent Control-Data Flow Block Graph Library (CDFG) , where the blocks contain a simRTL Simulation p1.rtl pn.rtl ple data-flow graph that captures computation, and the edges between blocks repLogic and Layout Synthesis FPGA Bitstreams resent both data and control flow. p1.bit pn.bit Test Sequences The FMPAR partitioner views a block in CDFG as an atomic element that cannot Board Level Testing Target Reconfigurable & Computing Board Design Validation be partitioned onto multiple FPGAs. The edges between various blocks are the set Fig. 1. Partitioning and Synthesis of cutset constraints for the partitioner. The user can specify any number of logiFramework cal memory segments modeled as dummy blocks in the CDFG. The FMPAR partitioner automatically maps the logical memory blocks onto the physical memory banks. The core of the entire flow is the iterative FMPAR partitioner coupled with an area estimator and exploration engine. The exploration engine performs effective resource sharing across blocks and provides the partitioner with accurate area estimates. The partitioned behavior segments generated by FMPAR are automatically synthesized by an in-house high-level synthesis tool to generate equivalent RTL designs. Further, the RTL designs are taken through commercial logic (Synopsys Design Compiler)

926

P. Lakshmikanthan et al.

and layout (Xilinx M1) synthesis tools to generate FPGA bitstreams for the target board. Note that, the communication signals routed across devices are always registered in the RTL designs to ensure that the board interconnect delay does not affect the clock period of the partitioned design.

3 The FMPAR Partitioner with the Exploration Engine FPGA 1 FPGA 2 Like the FM, the FMPAR also allows only one block to be moved at a time and the locking op5 tion of cells in the standard FM is incorporated here. A block can be moved across the FPGAs, 10 a user-specified number of times, after which it GC = 40 20 19 is locked and cannot be moved. We now present n 30 10 n the terminology and details of the FMPAR algorithm. 30 Global Cut (GC) : This is defined as the cutset between the partitions assigned to two FPFPGA 3 FPGA 4 GAs. Consider the example shown in Figure 2. The RC board contains 4 FPGAs and it is a fully Fig. 2. Cutsets between FPGAs connected board. There are six global cuts, for example GC14 denotes the global cut between FPGAs 1 and 4, and jGC14j denotes the size of the global cut. Current Max : is the greatest value among all the global cuts. In Figure 2, jGC14j = 40 is the Current Max. Current Min : is the least value among all the global cuts. In Figure 2, jGC12j = 5 is the Current Min. Net Cut (ni ) : Each global cut is in turn composed of a set of nets that are cut fn1, n2,: : :, nk g. Consider the Current Max value (jGC14j = 40) in Figure 2. It is contributed to by 2 net-cuts n1 and n2 of size 30 and 10 respectively. Priority : The net-cuts are prioritized in decreasing order of their sizes. The size of a net-cut is the bit-width of the net. In trying to reduce any global cut, we attempt to eliminate the net-cuts, one at a time, in the sorted order. Net-Cut Elimination & Move Types : The moves are contemplated such that the worst GC (Current Max) is reduced. For this purpose, the highest priority net-cut of the worst GC is considered. Moves are contemplated on the connected blocks to this net-cut. For example in Figure 2, n1 is the highest priority net-cut in the worst global cut GC14. Three possible moves can be contemplated to eliminate this net-cut: (1) Move the connected blocks in FPGA 1 into FPGA 4 or vice-versa, (2) Move the connected blocks in FPGA 1 or 4 into FPGA 2, and (3) Move the connected blocks in FPGA 1 or 4 into FPGA 3. We call Option 1 as 1-degree move of a netcut, Option 2 as 2-degree move of a netcut and Option 3 as 3-degree move of a netcut. In general, for ‘n’ FPGAs, a 1-degree move is between the pair of FPGAs (say Fi and Fj ) associated with the highest priority net-cut. The remaining ‘n-2’ FPGAs (other than Fi and Fj ) are sorted in decreasing order of available free space, F2, F3 , : : :, Fn,1. A k-degree move (2 k n-1) is defined as one where blocks on either Fi or Fj are moved to the corresponding k’th FPGA. We 14

1

2


927

define free space as the difference between the device area and the estimated area of the partition segment. 3.1 The FMPAR Algorithm Algorithm 3.1 presents the outline of FMPAR, the proposed multiway FM partitioning technique. The inputs are the design described as a CDFG block graph (BG), the number of FPGAs (Nfpga ) on the board, the size (gc size[][]) of interconnections between each pair of FPGAs, the area of each FPGA (dev area[]), the block locking factor (lock fact), and the design latency. Unlike a standard FM, our algorithm performs a user-specified number of runs (Nruns) from different initial partition solutions. During each run, the FM-loop (outer repeat-until loop) is executed until no improvement in cutset is observed for K successive iterations. During each run of the FM algorithm Algorithm 31 (FMPAR Algorithm) FMPAR(BG, Nfpga , gc size[][], dev area[], a legal initial partition is generated. A parlock fact, latency, K, Nruns ) Begin tition is said to be legal if and only if all Max Prev Max 1; Current Max 0; For FM runs = 1 to Nruns partition segments satisfy the area conNew Partition Generate a legal initial partition; Repeat /* Run FM-loop until no improvement */ straints posed by the individual devices. Calculate GCs for all pairs of FPGAs; If (1 i; j Nfpga , jGCij j gc size(i,j)) Then During each iteration of the FM-loop, all Output (Constraint Satisfying Solution) and Exit; EndIf; GCs are computed and ordered. If a conRepeat /* Until no moves are possible */ Calculate Current Max, Current Min and order all the GCs; straint satisfying solution is obtained the If (Current Max < Max) Then Max Current Max; entire FMPAR algorithm terminates. A conEndIf; Move Choose A Move(Nfpga , Current Max, straint satisfying solution is a legal partiCurrent Min, dev area[], lock fact); tion that satisfies the interconnection conIf (Move = ) Then If (Max < Prev Max) Then straints as well. Current Max is the worst Prev Max Max; Save current partition as best partition; cutset between all FPGA pairs, and is calEndIf; Break; /* Out of inner repeat-until loop */ culated every time a move is made. Max Else Make the move and Increment that block’s move count; represents the least value of Current Max, EndIf; Until(False); over all moves that have been made. Prev Max If (Prev Max hasn’t changed over the last K runs) Then Output (best partition solution obtained); is the least value of Max over all iteraBreak; /* Out of outer repeat-until loop */ EndIf; tions of the FM-loop. Reset move count of all the blocks to zero; New Partition best partition; During each iteration of the FM-loop, Until(False); EndFor; /* Restart FM with a new initial partition */ several legal moves are made until no furEnd. ther moves are possible. A move is legal move only if it leads to a legal partition and does not exceed the locking factor. The locking factor is a user-defined upper limit on the number of times a block can be moved. For selecting a legal move, the algorithm contemplates several possible moves in the procedure Choose A Move(). The contemplated moves are called k , degree moves as explained earlier. The goal is to minimize the worst cutset (Current Max). If none of the moves decrease the worst cutset, then the least cutset violating move is accepted. If no legal move is possible the procedure returns . This terminates the move-making process for one iteration of FM. At this point, each block is unlocked (move count is set to zero) and the best partition obtained so far is used as the new partition for the next iteration of the FM. 3.2 Interaction between FMPAR and Exploration Engine The partitioner is tightly integrated with a high-level exploration engine. The partitioner always communicates any change in the partitioned configuration to the exploration

928


engine and both the tools maintain an identical view of the partitioned configuration. The exploration engine effectively uses the partitioning information by dynamically generating implementations that maximize sharing of resources within each partition segment. Further details of the exploration engine can be found in [2]. The partitioner dynamically controls the trade-off between the execution time and the design space ex- Algorithm 32 (FMPAR with dynamic exploration) plored. The exploration technique provides an Explo- FMPAR() Begin Generate random initial partition of blocks; ration Control Interface (ECI) that facilitates tight in- 12 Repeat 3 Unlock all blocks; tegration with the partitioning algorithm. This interface 4 While (9 movable blocks) Do 5 Select a block; consists of a collection of exploration methods that gen6 Estimate Move; erate new implementations, and estimation methods that 7 Make a move and lock; EndWhile; simply re-compute the design estimates for a modified 89 Reset to the best partition; Explore Design for best partition; partition configuration. Algorithm 32 presents the tem- 10 plate for the FMPAR algorithm with calls to the explo- 11End. Until (No Cutset Improvement); ration engine enclosed in boxes. The FMPAR partitioner calls the area estimator and exploration engine at two places : (1) When moves are being evaluated (line-6), and, (2) when the configuration is reset to the best partition (line-10). A detailed study was conducted to make appropriate usage of the ECI functions at crucial points of the partitioning process. The Estimate Move method evaluates the effect of moving a block from a source partition to a destination partition without performing exploration and hence is not expensive in time. Whereas, the Explore Design method attempts to generate area and latency satisfying implementations at the expense of compute time. This way the calls to the exploration engine effectively utilize the trade-off between the exploration time and the amount of design space explored. Essentially, the partitioner takes care of the interconnection constraints, while the area and the latency constraints are handled by the area estimator and exploration engine. Thus, each time the solution is acceptable in terms of interconnection constraint, the exploration engine ensures the best area and latency satisfying solution.

4 Experimental Results We first present results to show the effectiveness of the FMPAR algorithm integrated with the exploration engine. Then, the FMPAR is compared with a simulated annealing partitioner. Finally, we report results obtained for designs that were successfully implemented on the Wildforce [8], a commercial multi-FPGA board. 4.1 Effectiveness of Dynamic Exploration with FMPAR We developed two versions of FMPAR, one performing dynamic exploration and another that does not. In the latter case, the exploration engine is used only to obtain area estimates without exploring multiple implementations. For experimentation, we considered the two large DSP benchmarks - the Discrete Cosine Transform (DCT) and the Fast Fourier Transform (FFT). The FFT benchmark has 18 blocks with 2 loops, 152 operations, 1418 nets (data bits) across the blocks, DCT has 66 blocks with 8 loops, 264 operations and 2401 nets and both examples have an extremely large number possible implementations.


929

We have gathered results by fixing two of the three constraints (design latency (L) and RC interconnection cutset (C)) and varying the third (device area (A)). The results are presented as plots where the x-axis represents the constraint varied (device area) and the y-axis represents the fitness value. Fitness is defined as, 1 F = (1+CutsetP enalty) ; where; CutsetPenalty =

P

UnroutedNets T otalDesignNets

The unrouted nets is the summation of all the nets contributing to GCs that exceed the board cutsize. Fitness is a measure of the solution quality, ranging between 0 and 1. A fitness value of 1 denotes a constraint satisfying solution, while a lower value denotes a poor quality solution because of a violation of cutset constraints. We chose a representation of the Wildforce architecture with four FPGA devices and a cutset constraint of 36 interconnections between each pair of FPGAs. Figure 3 plots the fitness of generated solutions for the DCT benchmark. Both versions of the partitioner (with and without the dynamic exploration) generate constraintsatisfying solutions for all area constraints at and greater than 940 CLBs. As we gradually decrease the design area we see that the FMPAR version with dynamic exploration continues to generate constraint-satisfying solutions (F = 1), while its counterpart fails (F < 1), Fig. 3. Plot for DCT even after running on a large number random initial configurations. We have made similar observations for the FFT benchmark, presented in [2]. This clearly demonstrates the effectiveness of interfacing the partitioner with the area estimator and exploration engine. 4.2 Comparison of FMPAR against a Simulated Annealing Partitioner FMPAR With Exploration FMPAR Without Exploration

1

0.98 0.96 0.94 0.92 0.9

0.88 0.86

940

930

920

910

900

890

DCT Area for L = 700, C = 36

In this section, we compare the results of the FMPAR algorithm to that of a Simulated Annealing (SA) partitioner that a part of the SPARCS [7] design environment. The SA was also interfaced with the area estimator and design space exploration engine. Both algorithms were implemented and run on the same workstation – a twin processor UltraSparc with 384 MB RAM and clocking at 296 Mhz. The table in Fig.4 provides a comparison of designs partitioned by the FMPAR and SA partitioners. For each design example, both partitioners were run on the same set of device area and design latency constraints. The comparison metrics are: (i) the number of unrouted nets (# UN) in the resulting solution and, (ii) the run time for each partitioner. The Ndevs in the first column represents the number of devices on the RC, provided as a constraint to the partitioner. The interconnection constraint (CutSet) between each FPGA pair was fixed at 36. The last column in the table presents the speedup factor of the FMPAR partitioner over the SA partitioner. Both the FMPAR and SA partitioners found constraint-satisfying solutions in five cases (Rows 1,2,4,5 and 7). The designs satisfied the cutset constraints as evidenced by ‘0’ unrouted nets. At the same time, we see that the FMPAR algorithm always has much lesser run times than that of the SA. Both partitioners did not find a constraint-satisfying solution for three designs – ELLIP (Row 3), FFT (Row 6) and DCT4x4 (Row 8). This is because the partitioners

930


failed on the a tight cutset constraint. For the DCT4x4 example which is the largest, the SA was run with a slow cooling schedule for 2 hrs and 24 mins and a solution with 21 unrouted nets was obtained. It is observed that for this example, the FMPAR partitioner produced a higher quality solution (only 14 unrouted nets) in a much lesser time (33.4x speedup). In case of the FFT design and the ELLIP examples, the resulting solutions of both partitioner are comparable, yet the FMPAR finishes quicker. From the results, Design FPGA Dsgn Simulated Annealing FMPAR Name, Area Lat. Partn. # Run Partn. # Run Spd we conclude that FMAreas UN Time Areas UN Time Up PAR produces parti(Ndevs ) (clbs) (clks) (clbs) (h:m:s) (clbs) (m:s) tioned solutions whose ALU 150 18 60 , 123 0 0:00 22 , 147 0 0: 4x (4) 146, 0 :04 146, 43 01 quality is similar to STATS 324 44 287, 60 0 0:00 49, 318 0 0: 5x that of the SA, but, (2) :10 02 ELLIP 450 61 337, 362 7 0:00 441, 252 10 0: 5x in much lesser run(2) :55 12 times. This is because ELLIP 600 61 536, 92 0 0:00 596, 26 0 0: 6x (2) :12 02 the SA is a stochasFIR 290 93 242, 178, 0 0 0:00 23, 86, 288 0 0: 5x tic, hill-climbing ap(3) :15 03 proach as opposed to FFT 540 104 446, 317 4 0:04 387, 494 4 0: 15x (4) 530, 0 :08 500, 484 16 the FMPAR which is FFT 580 104 0 , 580 0 0:00 480, 564 0 0: 3x a move-based algo(4) 550, 0 :31 353, 423 09 DCT4x4 3600 415 3188, 3303 21 2:24 3338, 3534 14 4: 33x rithm that quickly con(4) 3468, 3266 :09 3241, 3531 19 verges to a constraint satisfying solution. Al- Fig. 4. Comparison of SA and FMPAR generated designs though FMPAR is highly dependent on the initial solution and could stop at a local optimum, the results are as good as the SA for the constraint satisfaction problem. 4.3 On-Board Implementations Two designs were executed on the board after logic and layout synthesis. The designs ALU and STATS were successfully implemented and tested on the Wildforce [8], a commercial multi-FPGA board. The ALU is a simple arithmetic unit that has four 16-bit operating modes: addition, subtraction, multiplication and sum of squares of two input operands. The STATS is a statistical analyzer that computes the mean and variance of eight 16-bit numbers. Information about the synDesign Partition Area (CLBs) Latency (Clks) thesized designs are shown Name Number Estimated Actual Constraint Actual in Figure 5. We compare the P1 22 30 estimated area and perforALU P2 147 139 18 19 mance measures against the P3 146 179 actual values after layout synP4 43 54 thesis. Columns 3 and 4 show STATS P1 49 66 44 46 the estimated and actual area P2 318 335 of each partition. In general, we observed in our experFig. 5. Designs down-loaded onto RC boards iments that the estimated areas are within 10-20% of accuracy. Columns 5 and 6 compare the latency constraint to


931

the actual latency of the partitioned design obtained from board-level simulation. We observe that our framework satisfies the latency constraint within a deviation of 5%. In order to check for functional correctness, the results generated on board were verified against the simulation results. The partitioned designs executed successfully and the results matched that of the simulation.

5 Summary This paper presents a framework for multi-FPGA partitioning of behavioral designs and their synthesis onto reconfigurable boards. An FM based multiway partitioner was presented, which is integrated with an area estimator and design space exploration engine. By efficiently performing dynamic exploration with partitioning, the partitioner produces good quality solutions in a reasonable amount of time. A limitation of the partitioner is that it can currently handle only fixed interconnection architectures. In the future, we plan to integrate the partitioner with interconnect estimation techniques [9] that can handle programmable interconnection architectures. Results are provided to demonstrate the advantage of tightly integrating exploration with partitioning. Also, it is shown that the FMPAR produces constraint-satisfying solutions of similar quality to that of the SA, in much lesser run-times. Designs taken down to the Wildforce board proves that the FMPAR algorithm maintains the functionality of the design after partitioning and also shows the effectiveness of the partitioning and synthesis framework.

References 1. F. Vahid. “Functional Partitioning improvements over Structural Partitioning for Packaging Constraints and Synthesis : Tool Performance”. In ACM Transactions on Design Automation of Electronic Systems, volume 3, pages 181–208, April 1998. 2. S. Govindarajan, V. Srinivasan, P. Lakshmikanthan, and R. Vemuri. “A Technique for Dynamic High-Level Exploration During Behavioral-Partitioning for Multi-Device Architectures”. In Proc, of the 13th IEEE Intl. Conf. on VLSI Design, January 2000. 3. F. Vahid and D. Gajski. “Incremental Hardware Estimation During Hardware/Software Functional Partitioning”. In IEEE Transactions on VLSI Systems, volume 3, September 1995. 4. Charles J. Alpert and Andrew B. Kahng. “Recent Directions in Netlist Partitioning”. In Integration, the VLSI Journal, 1995. 5. L. A. Sanchis. “Multiple-way network partitioning”. In IEEE Transactions on Computers, pages 62–81. 38(1), January 1989. 6. C. M. Fiduccia and R. M. Mattheyses. “A Linear Time Heuristic for Improving Network partitions”. In Proceedings of the 19th ACM/IEEE DAC, pages 175–181, 1982. 7. I. Ouaiss, S. Govindarajan, V. Srinivasan, M. Kaul, and R. Vemuri. “An Integrated Partitioning and Synthesis System for Dynamically Reconfigurable Multi-FPGA Architectures”. In Proc. of Reconfig. Arch. Workshop (RAW98), pages 31–36., March 1998. 8. Annapolis micro systems, inc. http://www.annapmicro.com/amshhomep.html. 9. V. Srinivasan, S. Radhakrishnan, R. Vemuri and J. Walrath. “Interconnect Synthesis for Reconfigurable Multi-FPGA Architectures”. In Proc. of RAW99, pages 597–605., April 1999.

Module Allocation for Dynamically Recon gurable Systems Xue-jie Zhang and Kam-wing Ng Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N. T., Hong Kong fxjzhang, [email protected]

Abstract. The synthesis of dynamically recon gurable systems poses

some new challenges for high-level syn thesis tools. In this paper, we deal with the task of module allocation as this step has a direct in uence on the performance of the dynamically recon gurable design. We propose a con guration bundling driv en module allocation technique that can be used for component clustering. The basic idea is to group con gurable logic together properly so that a given con guration can do as much work as possible, allowing a greater portion of the task to be completed betw een recon gurations. Our synthesis methodology addresses the issues of minimizing recon guration overhead by maintaining a global view of the resource requirements at all times during the high-level syn thesis process.

1 Introduction A dynamically recon gurable system allows hardware recon guration while part of the recon gurable hardware is busy computing, and allows a large system to be squeezed into a relatively small amount of physical hardware[1]. Though very promising, the dev elopment of dynamically recon gurable systems faces many problems. Since the con guration changes over time, one major problem is that there needs to be some way to ensure that the system behaves properly for all possible execution sequences. For this time-multiplexed recon guration to be realized, a new temporal partitioning step needs to be added to the traditional design ow. Some researchers have addressed temporal partitioning heuristically, by extending existing scheduling and clustering techniques of high-level synthesis[2][3][4]. In an earlier work[5], we presented a design model for abstracting, analyzing and syn thesizing recon guration at the operations level. In addition to making sure that a temporal partitioning be done correctly and producing a functionally correct implementation of the desired behavior, another important problem is how to produce the best implementation of functionality. With normal FPGA-based systems, one w an tsto map the con gurable logic spatially so that it occupies the smallest area, and produces results as quickly as possible. In a dynamically recon gurable system one must also consider the time J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 932-940, 2000.  Springer-Verlag Berlin Heidelberg 2000

Module Allocation for Dynamically Reconfigurable Systems

933

to recon gure the system, and how this aects the performance of the system. Con guration can take a signi cant amount of time, and thus recon guration should be kept to a minimum. This is in general a challenging problem to address, with almost no current solution[6]. In this paper, we present an ecient high-level synthesis technique which can be used to synthesize and optimize dynamically recon gurable designs. In particular, we concentrate our investigation on the task of module allocation. Dynamic recon guration extends the module allocation space by an additional dimension. The optimizing criteria in dynamic resource allocation also shift from a single static netlist to several con gurations of the design. We must account not only for temporal partitioning and scheduling eects but global considerations as well, such as the resource requirements of all con gurations, recon guration overhead, and the combination of all of the above. We have addressed these issues by using a con guration bundling technique that balances the advantages of dynamic recon guration against the added cost of con guration time by maintaining a global view of the resource requirements of all temporal partitions at all times during high-level synthesis.

2 Problem Formulation The contribution of this paper can be seen in the context of our previous work on a design model[5]. Our approach uses an extended control/data ow graph (ECDFG) as the intermediate representation of a design. The CDFG is extended by abstracting the temporal nature of a system in terms of the sensitization of paths in the data ow. An ECDFG is a behavioral-level model. An ECDFG representation of system behavior consists of three major parts: (1) possible execution paths which are described by the product of the corresponding guard variables, (2)temporal templates which lock several con guration compatible operations into temporal segments of relative schedules, (3) a control and data

ow graph (CDFG) describing data-dependency or control-dependency between the operations. Interested readers are referred to the original references for the details about ECDFG. In high-level synthesis, module allocation is an important task which determines the number and types of RTL components to be used in the design. Since we have encoded the temporal nature of synthesizing such systems by temporal templates[5], the module allocation process may be translated into a two-dimensional placement problem of temporal templates. Instead of considering individual CDFG nodes, we restate the dynamic module allocation problem in terms of temporal templates, a given spatial and temporal placement of con gurable logic resources used by some temporal templates for a range of time constraints represents a possible con guration. The module allocation problem for dynamically recon gurable logic involves not only generating the con guration for each of the temporal templates, but also reducing the recon guration overhead incurred. Our problem can be formally de ned as follows:

934

X.-j. Zhang and K.-w. Ng

Problem 1. Let F = fF1 ; F2 ; :::; Fm g be a set of function units which can be implemented on recon gurable logic, and C = fC1 ; C2 ; :::; Cn g be a set of possible con gurations of the con gurable logic units. Given an extended CDFG (ECDFG) G = (V; E; ; ) with a set of temporal templates in a given order TT = (TT1; TT2; :::; TTp), where TTi 2 F , nd an optimal sequence of con gurations CS = (CS1 ; CS2 ; ::; CSq ) for temporal template TT , where CSi 2 C which minimizes the recon guration cost R. R is de ned as

R=

Xq i

(1)

i=2

Where i is the recon guration cost in changing con guration from CSi,1 to CSi . In the remaining sections, we use a new con guration bundling driven technique to address the module allocation problem.

3 Con guration Bundling The basic idea is to group logic together properly so that a given con guration can do as much work as possible, allowing a greater portion of the task to be completed between recon gurations. We illustrate our concept with the help of a motivating example. Consider three temporal templates of an extended CDFG shown in Figure 1. Furthermore, assume that all operations nish in a single cycle and that all temporal templates have to be implemented in three clock cycles. If each temporal template is allocated as a single con guration, the rst temporal template (shown in Figure 1(a)) requires a module allocation of ve functional units namely f3 adders; 1 multiplier; 1 subtractorg. Similarly, the second and third temporal templates (shown in Figure 1(b)-(c)) can be implemented with module allocations of f 2 adders; 2 multipliers; 2 subtractorsg and f1 adder; 1 multiplier; 3 subtractorsg respectively. +

*

+

-

+

*

*

+

*

(a)

+

-

+

*

*

-

*

+

-

*

-

(b)

Fig. 1. A Motivating Example

-

-

+

-

*

+

(c)

-


935

A straightforward approach to optimize the module allocation of the three temporal templates as a dynamically recon gurable design involves considering the granularity of the recon guration. Resource requirements of the temporal templates can be reduced signi cantly by maintaining a global view of the resource requirements of all temporal templates at all times during the synthesis process. In fact, the three temporal templates can be implemented using a con guration granularity of two adders, two multipliers and two subtractors. In this research, we have developed a con guration bundling technique to reduce the recon guration overhead. The concept of con guration bundling can be de ned as follows:

De nition 1. Given an extended CDFG (ECDFG) G = (V; E; ; ) with a set of temporal templates TT = fTT1; TT2; :::; TTng, a con guration bundle is a subset of TT such that the hardware resource requirements of individual temporal template in this subset fTTi1 ; TTi2 ; :::; TTim g can be implemented by an overall resource allocation schema.

Con guration bundling is a synthesis technique where n temporal templates are bundled into at most m groups so that each temporal template belongs to at least one bundle and the objective function is optimized. Following con guration bundling, each bundle is synthesized into a separate con guration. The basic idea behind our con guration bundling technique is to attempt to identify and bundle temporal templates with similar computation topology and hardware types into compatible groups, such that these groups may be used to determine the choice of granularity for con gurations that optimize the recon guration overhead. In particular, the following compatibility issue should be considered during the con guration bundling process.

3.1 Bundling Compatibility of Temporal Templates If two temporal templates with disparate topologies are implemented in temporally consecutive con gurations the attendant con guration overhead will be signi cant. In the worst case, each functional unit has to be recon gured and this increases the time of recon guration. Therefore, topological similarity between temporal templates should be considered for bundling into the same group. For example, in Figure 2 Temporal Template 2 can be bundled into a con guration implementing temporal template 4 with almost no recon guration overhead. In addition, resource compatibility is an important issue during con guration bundling. For example, in Figure 2, while Temporal Templates 2, 3 and 4 use subtractors and multipliers, Temporal Template 1 uses adders. Therefore, bundling Temporal Template 1 with either Temporal Template 2 or 3 or 4 does not yield justi able bene t for reducing the recon guration overhead. On the other hand, based on the compatibility of the functional unit types, Temporal Templates 2, 3, 4 are good candidates to be bundled into the same group.

936


+

+

*

-

*

+

*

-

* +

-

(1)

(2)

-

*

(3)

(4)

Fig. 2. Compatibility of Temporal Template

3.2 Measure of Con guration Bundling Con guration bundling should take into account trade-os between maximizing static resource requirements and minimizing recon guration overhead in space. Therefore, a con guration bundle will have the smallest area and the scope for maximum resource usage if the temporal templates in a bundle are compatible with one another. Based on the above observations we have developed a measure to identify bundling compatibility between temporal templates. We rst outline the parameters of the function for bundling below.

{ { {

B: Set of bundles B1 ; B2 ; :::; Bk for a given TT that describes a set of possible con guration bundling. NFj (TTi ): the number of con guration of functional unit Fj for temporal template TTi. AreaFi : the area of a con guration of a functional unit Fi

Given a temporal template TTi , the following is an estimate of the area of the temporal templates:

Areatti =

X NF (TTi) Areaf

f 2F

(2)

If a bundle Bi has ni temporal templates, then the area of the bundle is estimated as below.

AreaBi =

X maxtt2B Nf (tt) Areaf i

f 2F

(3)

The larger the dierence between these areas of temporal templates, the more incompatible the temporal template will be with the remaining temporal templates in the bundle. Given a temporal template TTj for consideration for bundling in Bi , the incompatibility can be obtained as the following and is used to weigh the candidate solutions.

Bi ;TTj =

X jmaxTT 2B Nf (TT ) , NF (TTj)j Areaf

f 2F

i

(4)


937

4 Con guration Bundling Driven Module Allocation Algorithm Since there are several temporal templates in a range of time and module allocation, simultaneously considering all the temporal templates and their respective constraints is dicult. We propose to allocate the hardware resources from a range of time by considering one temporal template at a time. In particular, the following three issues must be taken into account:

{ the allocated hardware resource due to the previously considered temporal templates

{ the estimated hardware resource of the remaining temporal templates { the hardware resource required by the candidate temporal templates Here, temporal templates are rst bundled randomly. Then, a source con guration bundle is randomly chosen. From such a con guration bundle, an incompatible temporal template is selected and moved to another con guration bundle where the temporal templates are compatible with the selected temporal template. The hardware area of all con gurations is then computed, and the current bundling con guration is saved if it is the best so far. The process continues until no more improvement is obtained for a given number of iterations. For each con guration bundle Bi 2 B , the module allocation algorithm is outlined below. An initial module allocation ABi for each con guration bundle Bi is rst derived. Starting with a temporal template with the most resource requirements, a feasible module allocation for the entire bundle is obtained. From the total resource allocated to the con guration bundle ABi , the module allocation RTTi for each candidate temporal template TTi 2 Bi is obtained. Then, allocation and scheduling of the design are carried out using this module allocation technique.

4.1 Initial Module Allocation Let Nij be the maximum bound on the necessary amount of resource of a certain con guration type Cj of functional unit for the temporal template TTi of a con guration bundle. For each resource type Cj and for each temporal template TTi of a con guration bundle Bi , relaxation based scheduling techniques are used to derive an estimate of Nij . For a con guration bundle Bi , a global minimum bound of resource requirements Nj is used as the initial allocation for the con gurable logic Cj .

Nj = maxTTi 2Bi (Nij )

(5) This is based on the fact that there will be at least one temporal template in the con guration bundle that requires at least these many hardware con gurations of type Cj .

938


4.2 Ordering and Allocating Temporal Templates Within our methodology, the ordering of temporal templates in the same con guration bundle has an impact on resource usage and recon guration overhead of the resulting resource allocation. A good order for module allocation of temporal templates is important because this order has a pronounced impact on the nal resources allocation and the overall performance of the system. The proposed algorithm for ordering temporal templates include two stages called clustering and scheduling. The objective of the algorithm is to group temporal templates such that they may subsequently be allocated and scheduled. When considering functional locality in the module allocation process, it is better to schedule and allocate together temporal templates contributing to the same join node in the ECDFG, because this could help in the scheduling and allocation of relative temporal templates at higher levels. Therefore, clustering temporal templates is the rst step in the temporal templates ordering process. The cones partitioning algorithm provides the basis for our clustering stage[8][9]. Once temporal templates are partitioned into clusters, the cluster-based list scheduling and allocation algorithm orders the temporal templates in the same con guration bundle. Our algorithm combines scheduling with module allocation into subsequent con gurations for temporal templates in the same con guration bundle, while considering functional locality of the con guration bundle. There are two main steps in our list scheduling algorithms: the formation of clusters and list scheduling temporal templates.

5 Experimental Results In this section we present results to illustrate the eectiveness of the con guration binding technique. In order to experimentally verify the concept of con guration bundling driven module allocation, we used three popular high-level benchmarks - elliptical wave lter (EWF), nite impulse response lter (FIR) and bandpass lter (BF) - for optimizing the overall resource allocation as well as the recon guration overhead. We assume the following con gurations for addition and multiplication operations: look-ahead adder (Area = 1, latency = 1) and a two-stage multiplier (Area = 4, latency = 2). Figure 3 shows the component requirement for the static and con guration bundling driven module allocation. Bundles

Static module allocation (Area)

{EWF,FIR,BF} {EWF,FIR}, {BF} {EWF},{FIR,BF}

24 24 24

Bundling driven module allocation (Area) 11 17 13

Fig. 3. Bundling to minimize recon guration cost

Reduction

54.2% 29.2% 45.8%


939

We have also combined our front-end algorithms with the existing DRL scheduling algorithms[2] back end for demonstrating our results. DRL scheduling algorithms do not consider the module allocation problem. We compare results of the combined algorithms with a single DRL approach[2] as shown in Figure 4, where te , np , nf and represent the total data-path execution time, the number of partial and the number of full recon gurations and the graph latency respectively. Benchmarks

Elliptic wave_filter

Total area 15 10 6 15 10 6 15 10 6

Combined approach np nf te λ 15 25 0 15 15 25 0 15 15 24 1 15 16 12 0 16 18 13 1 17 20 17 0 20 19 3 0 19 26 9 0 26 28 12 0 28

te 17 17 17 17 18 24 18 21 37

DRL np nf λ 1 2 17 4 2 17 2 8 17 5 0 17 9 0 18 19 0 24 8 0 18 16 0 21 1 9 19

Fig. 4. Synthesis result and comparison The results in Fig.4 have shown that the use of the combined algorithm will lead to a faster execution time compared with a single DRL scheduling implementation, and with considerably smaller area. When the DRL scheduling is used alone, more control steps result but when scheduling together with our module allocation is performed the partial recon gurations will frequently occur instead of the full recon gurations. This is expected as our algorithm aims at producing a short recon guration time by maintaining a global view of the resource requirements of all temporal templates at all times during the synthesis process.

6 Conclusions and Acknowledgments We have presented a new module allocation technique in this paper. It is based on a con guration bundling heuristic that tries to allocate con gurable logic resources by maintaining a global view of the resource requirements of all temporal templates. The most important value of the con guration bundling driven module allocation technique is that enable trade-os between the granularity of the con guration and recon guration overhead during high-level synthesis process. The work described in this paper was partially supported by two grants: the Research Grant Council of the Hong Kong Special Administrative Region (RGC Research Grant Direct Allocation - Project ID: 2050196), and Yunnan Province Young Scholar Grant.

940


References 1. Lysaght and J. Dunlop: Dynamic recon guration of FPGAs, More FPGAs, UK:Abingdon EE and CS Books (1994), pp82-94, 1994. 2. M.Vasilko and D.Ait-Boudaoud: Architectural Synthesis Techniques for Dynamically Recon gurable Logic, Field-Programmable Logic, Lecture Notes in Computer Science 1142, pp290-296 3. J. Spillane and H.Owen: Temporal Partitioning for Partially Recon gurable Field Programmable Gate, Proceedings of Recon gurable Architectures Workshop(RAW'98), 1998. 4. M. Kaul and R. Vemuri: Optimal Temporal Partitioning and Synthesis for Recon gurable Architectures, Proceedings of Design and Test in Europe(DATE'98), 1998. 5. Kam-wing Ng, Xue-jie Zhang, and Gilbert H. Young: Design Representation for Dynamically Recon gurable Systems, Proceedings of the 5th Annual Australasian Conference on Parallel And Real-Time Systems(PART'98), pp14-23, Adelaide, Australia, September 1998. 6. Scott Hauck and Anant Agarwal: Software Technologies for Recon gurable Systems, Northwestern University, Dept. of ECE Technical Report, 1996. 7. Ivan Radivojevic and Forrest Brewer: A New Symbolic Technique for ControlDependent Scheduling, IEEE Trans. on Computer-Aided Design of Integerated Circuit and Systems, vol.15, no.1, pp45-56, Jan. 1996 . 8. D. Brasen, J.P. Hiol and G. Saucier: Finding Best Cones From Random Clusters for FPGA Package Partitioning, IFIP International Conference on VLSI, pp 799-804, Aug. 1995. 9. Sriam Govindarajan and Ranga Vemuri: Cone-Based Clustering Heuristic for ListScheduling Algorithms. Proceedings of the European Design and Test Conference, Paris, France, March 1997.

Augmenting Modern Superscalar Architectures with Configurable Extended Instructions Xianfeng Zhou and Margaret Martonosi Dept. of Electrical Engineering Princeton University {xzhou, martonosi}@ee.princeton.edu Abstract. The instruction sets of general-purpose microprocessors are designed to offer good performance across a wide range of programs. The size and complexity of the instruction sets, however, are limited by a need for generality and for streamlined implementation. The particular needs of one application are balanced against the needs of the full range of applications considered. For this reason, one can “design” a better instruction set when considering only a single application than when considering a general collection of applications. Configurable hardware gives us the opportunity to explore this option. This paper examines the potential for automatically identifying application-specific extended instructions and implementing them in programmable functional units based on configurable hardware. Adding fine-grained reconfigurable hardware to the datapath of an out-of-order issue superscalar processor allows 4-44% speedups on the MediaBench benchmarks [1]. As a key contribution of our work, we present a selective algorithm for choosing extended instructions to minimize reconfiguration costs within loops. Our selective algorithm constrains instruction choices so that significant speedups are achieved with as few as 4 moderately sized programmable functional units, typically containing less than 150 look-up tables each.

1 Introduction General-purpose instruction sets are intended to implement basic processing functions while balancing the needs of many applications. Complex instructions that might accelerate one application are often unused by several other applications. Worse, their implementation difficulties may impact all programs by degrading clock rates or using up vital chip area. Configurable hardware allows one to implement complex operations on an asneeded basis, one application at a time. In recent years, configurable computing based on Field-Programmable Gate Arrays (FPGAs) has been the focus of increasing research attention. The circuit being implemented can be changed simply by loading in a new set of configuration bits. Various architectures for FPGA-based computing have been proposed, ranging from co-processor boards accessed via the I/O bus, to relatively fine-grained structures accessed as an integral part of the CPU’s data path. The approach we explore here is closest to the latter architecture. We envision programmable functional units (PFUs) with 150 CLBs or less which are built into the datapath of a modern superscalar processor, and which can access the register file and result bus just like other functional units in the machine. Customized complex or extended instructions have several advantages over traditional instruction sets. First, customization allows one to match the flow of


942

X. Zhou and M. Martonosi

values within an extended instruction to the needs of the particular operation being performed. Second, one can customize the bitwidth of calculations to tightly match the needs of the particular application. Third, one can improve instruction-level parallelism (ILP) by amortizing the per-instruction cost of fetching, issuing, and committing over more work. While these advantages are compelling, customized extended instructions cannot be applied universally. First, since the PFU is part of the datapath, increasing the number of inputs to a PFU also increases the number of register file ports needed by the processor. This increases machine complexity and may impact the cycle time. Second, reconfiguring a PFU for a particular extended instruction requires fetching configuration bits and sending them to the PFU. This reconfiguration latency warrants care in choosing to implement operations as extended PFU instructions. With this in mind, we devised and modeled T1000, an out-of-order issue, superscalar processor with programmable functional units. Initial performance studies with a simple instruction selection algorithm shows 4-44% speedups for the MediaBench suite [1] when ignoring the reconfiguration penalties. To improve speedups under more realistic assumptions, we developed a selective approach for determining which extended instructions to implement and when to use them. The key difference of our work from previous work is to check many possibilities of converting an instruction sequence to valid extended instructions. The extended instructions chosen by our selective algorithm can typically fit in a PFU composed of = T N Done

N># PFU

Y Consider loop bodies one at a time N N># PFU

Exit loop

Y Apply step 3 to select #PFU distinct sequences Fig. 5. Flow chart of selective algorithm

The final appropriate extended instructions are then selected from the list by comparing their potential gains. For example, sequence J appears a total of 3 times in the loop, each with a potential gain of 1 cycle. By contrast, sequence I appears only once, but with a potential savings of 2 cycles. If we are working with an architecture with only one PFU, then selecting the sequence with the highest total gain across the loop would lead us to choose sequence J. 5.2 Performance Improvements Using the Selective Algorithm Figure 6 shows that the selective algorithm successfully chooses extended instructions that offer speedup by avoiding reconfiguration penalties as much as

948


possible. Speedups for these benchmarks now range from 2-27%. Since our approach reduces dramatically the number of PFU reconfigurations, the reconfiguration penalties only account for a small fraction of total potential gains. In fact, our experiments show that we retain our excellent speedups even with reconfiguration 1.5

Superscalar T1000 (2 PFUs)

execution time speedup

T1000 (4 PFUs) T1000 (unlimited PFUs) 1

0.5

mp eg2 _en c

mp eg2 _de c

g72 1_e nc

g72 1_d ec

gsm _en c

gsm _de c

epi c

une pic

0

times as high as 500 cycles. Fig. 6. Speedups achieved using the selective algorithm. For each benchmark, the second and third bars correspond to T1000 with 2 and 4 PFUs, respectively. The fourth bar models unlimited PFUs. A 10-cycle reconfiguration cost is assumed in all cases

Our selective algorithm also adjusts itself well to the number of PFUs available. Overall, we find that four PFUs are typically enough to achieve almost the same performance improvement as the optimistic speed-ups presented in Section 4. Figure 6 illustrates the results with 4 PFUs and compares them to the previous optimistic results achieved with unlimited number of PFUs.

6 Configurable Hardware Cost The basic component of the PFU is a configurable logic block consisting of lookup tables (LUTs) and flip-flops. An N-input look-up table can implement any Boolean function of N inputs. The LUT propagation delay is independent of the function implemented. In this paper, we use standard CAD tools to map extended instructions to Xilinx devices in order to estimate the PFU hardware cost. Figure 7 presents the area distribution of instructions chosen by our selective algorithm for the 8 benchmarks. The configurable hardware resources required by an extended instruction depend both on the type of operation and also on the operand widths. Quite a few of the extended instructions need very little hardware, largely due

Augmenting Modern Superscalar Architectures

949

# of extended instructions

to profiling that indicates when they can be implemented with narrow-bitwidth inputs. On these examples, the most area-intensive extended instruction needs 105 LUTs. 12 8 4 0 -1 0-

0

11

0 -2

21

0 -3

31

0 -4

41

0 -5

51

0 -6

71

0 -8

81

0 -9

-1 010

10

Fig. 7. Distribution of hardware requirements for the extended instructions extracted from 8 MediaBench benchmarks by our selective algorithm

7 Prior Work There has been a large amount of work on the reconfigurable computing architectures with customizable instruction sets, and an exhaustive summary is difficult. Instead, we present some representative work categorized by the degree of coupling between the configurable hardware resources and the base processor. Coarse-grained architectures include SPLASH1, SPLASH2 [8] and PAM [9]. In these, the configurable hardware resources are connected as a co-processor on the I/O bus of a standard microprocessor. While appropriate for coarse-grained problems, the disadvantage of these board-based systems is that they have high communication latencies and configurable hardware cost. Medium-grained architectures include NAPA [10]. In NAPA, the Adaptive Logic Processor (ALP) can access the same memory space as the Fixed Instruction Processor (FIP), so the communication overhead between the ALP and the FIP is reduced compared with the coarse-grained architectures, but this approach still does not give the ALP full access to the register file. Fine-grained architectures include the PRISC work [2,4]. PRISC was proposed to be a simple, pipelined, single-issue processor augmented with a single PFU. Because of the tight coupling between the PFU and the base CPU, PRISC requires only a small amount of configurable hardware resources and minimizes communication costs. Other representatives of this class include CoMPARE [11] and OneChip [12], etc. CoMPARE explores the impact of multiple PFUs and can execute RISC instructions and customized instructions concurrently. OneChip is an embedded system. It requires more functional modules to be implemented on PFU, which in turn introduces larger communication penalties. All of the above fine-grained architectures were evaluated on simple, in-order-issue, single-issue processors. The impact of PFUs on a superscalar processor’s performance is different from that on a simple processor, and our work has quantified these differences.

8 Conclusions and Future Work This work has explored the use of application-specific instructions in the context of modern superscalar architectures. In particular, we have proposed the T1000

950


architecture which adds programmable functional units (PFUs) into the datapath of a wide, out-of-order issue processor. These small configurable functional units based on FPGA-like technology have the potential to greatly improve performance. Our initial optimistic studies showed up to 44% performance improvements in some cases. A key issue in using a small number of PFUs effectively is devising a selection algorithm that is both aggressive enough to uncover speedup opportunities, and yet also conservative enough to avoid cases where PFUs “thrash” as they frequently reconfigure back and forth to handle many selected configurable instructions. With the goal of avoiding PFU thrashing, we developed and evaluated a selective algorithm for choosing instruction sequences for configurable implementation. Our choice is guided by the number of PFUs available and simple execution profiles of the program loops. This allows us to aggressively select configurable instructions that offer the largest performance savings with the smallest hardware needs. With this algorithm, we have shown performance improvements of up to 28% with 2 PFUs compared to simple superscalar processors without PFUs. Furthermore, our selective algorithm is so successful at avoiding PFU thrashing that these speedups are largely independent of the PFU’s reconfiguration overhead. We view our work as a proofof-concept demonstration that PFUs can offer worthwhile performance improvements in modern high-performance superscalar architectures.

References 1. C. Lee, M. Potkonjak, and W. H. Mangione-Smith, MediaBench: A Tool for Evaluating Multimedia and Communications Systems. Proc. Micro 30, 1997 2. R. Razdan, and M.D. Smith: A High-Performance Microarchitecture with HardwareProgrammable Functional Units. Proc. 27th Intl. Symp. On Micro, pp. 172-180, Nov., 1994. 3. Xilinx Inc. The Programmable Logic Data Book, Xilinx 2100 Logic Dr. San Jose, CA 1998 4. R. Razdan, K. Brace, and M. Smith. PRISC Software Acceleration Techniques. Proc.Int. Conf. on Computer Design. Oct.1994. 5. D. Burger, T.M. Austin, and S. Bennett. Evaluating future microprocessors: The SimpleScalar tool set. TR-1308, Univ. of Wisconsin-Madison CS Dept., July 1996 6. G. S. Sohi. Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers. IEEE Trans. on Computers, 39(3): 349-359, March 1990 7. Xilinx Inc. Foundation Series Quick Start Guide 1.5, Xilinx 2100 Logic Drive. San Jose, CA 8. J. Arnold et al. The Splash 2 Processor and Applications. Proc. Int. Conf. on Computer Design, Oct 1993 9. P. Bertin, D. Roncin, and J. Vuillemin. Introduction to Programmable Active Memories. Systolic Array Processors, J. McCanny et al. Eds., Prentice Hall, 1989 10. C. R. Rupp and M. Landguth et al. The NAPA Adaptive Processing Architecture. Proceedings IEEE Symp. on FPGAs for Custom Computing Machines. Napa Valley, CA, USA 15-17, April 1998 11. S. Sawitzki, A. Gratz and R.G. Spallek: Increasing Microprocessor Performance with Tightly-Coupled Reconfigurable Logic Array, Proc. of Field-Programmable Logic and Applications, Tallinn, Estonia, August 1998 12. R. D. Wittig and P. Chow: OneChip: An FPGA Processor With Reconfigurable Logic, Proc. IEEE Symp. on FPGAs for Custom Computing Machines, CA, April 1996

Complexity Bounds for Lookup Table Implementation of Factored Forms in FPGA Technology Mapping Wen yi Feng1 , Fred J. Meyer2, and Fabrizio Lombardi2 1 2

FPGA Softw are Core Group, Lucent Technologies, 1247 S Cedar Crest Blvd, Allentown PA 18103 Electrical & Computer Engineering, Northeastern University, 360 Huntington Avenue, Boston MA 02115

Abstract. We consider tec hnology mapping from factored form (binary leaf-D A G)to lookup tables (LUTs), such as those found in eld programmable gate arrays. P olynomial time algorithms exist for (in the w orst case) optimal mapping of a single-output function. The worst case occurs when the leaf-DA G is a tree. Previous results gav e a tight upper bound on the number of LUTs required for LUTs with up to 5 inputs (and a bound with 6 inputs). The bounds are a function of the number of literals and the LUT size. We extend these results to tight bounds for LUTs with an arbitrary number of inputs.

1 Introduction We view computer-aided synthesis of a logic circuit in tw o major steps: (1) the optimization of a technology-independent logic representation, using Boolean and/or algebraic techniques, and (2) technology mapping. Logic optimization is used to transform a logic description such that the resultant structure has a low er cost than the original [1]. Technology mapping is the task of transforming an arbitrary multiple-level logic representation into an interconnection of logic elements from a given library of elements. T echnology mapping is very crucial in the synthesis of semicustom circuits for dierent technologies, such as sea-of-gates, gate arrays, or standard cells. The qualit yof the synthesized circuit, both in terms of area and performance, depends heavily on this step. We focus on the problem of technology mapping onto Field-Programmable Gate Arrays (FPGAs). FPGAs are prewired circuits that are programmed by the users to perform the desired functions [13]. In particular, we consider FPGAs where the logic functions are implemented with lookup tables (LUTs). In a LUT-based FPGA, the basic block is a K {input, single-output LUT (K {LUT) that can implement any Boolean function of up to K variables. The technology mapping problem for LUT-based FPGAs is to generate a mapping of a set of Boolean functions onto K {LUTs. Traditional library binding algorithms for J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 951-958, 2000.  Springer-Verlag Berlin Heidelberg 2000

952

W. Feng, F.J. Meyer, and F. Lombardi

standard cells and Mask-Programmable Gate Arrays (MPGAs) are not applicable to FPGAs because the virtual library of a LUT is too large to enumerate (a K {LUT can realize 22K logic functions). Many papers have proposed algorithms for LUT-based technology mapping. They can be divided into 3 categories: (1) minimization of number of levels of LUTs in the mapped network [5]; (2) minimization of the number of LUTs used in the mapped network [3, 10, 7, 6], (3) routability of the mapping solution [2, 11], or combinations of these topics [4, 3]. Minimizing the number of levels is solvable in polynomial time in Flow-Map [5]. The key feature in Flow-Map is to compute a minimal height K {feasible cut in the input network. Minimization of the number of LUTs is a much harder problem. It was shown to be NP {hard even for restricted cases [6]. So, heuristics are used in all mapping systems. In this paper, we restrict our attention to mapping a single-output function onto LUT technology. We specify the input function with a graph, where each node represents a function of 2 inputs. We constrain the problem so that the synthesis must be conducted without being aware of (taking advantage of) the underlying function at each 2{input node.

2 Preliminaries

De nition 1. A leaf-DAG is a general case of a tree|the leaves of the tree

(primary inputs) are allowed to fan out. If node i is one of the inputs to node j in a leaf-DAG, we say that i is a child of j and that j is a parent of i. In mapping, we will not take any special advantage of leaf-DAGs; instead, we will regard the inputs to the various nodes in the DAG as coming from distinct primary inputs|i.e., we will not take advantage of any knowledge of fan-out at the primary inputs. This yields bounds that are applicable in any case and, in particular, in the worst case of a tree. p(v). Apart from the leaves, each node, v, in a leaf-DAG has a unique parent, p(v). l(S ). The number of literals of the input function, S . This is the sum of the number of inputs to all nodes of the input graph. We simply use l, instead of l(S ), whenever S is understood. De nition 2. The size or complexity, C (S ), of a circuit, S , is the number of gates (number of nodes in its DAG). The circuit complexity of a function, f , with respect to a basis, , is C (f ), which is the minimal number of gates from the set in order to compute f . K . The LUTs in the technology to be mapped onto have K inputs. We call them K {LUTs. A K {LUT implements the basis BK . LK (f ). The number of LUTs needed to map function f to K {LUTs. We use L(f ), LK , and L whenever f and/or K are understood. CK (l). This is the circuit complexity for leaf-DAGs mapped onto K {LUTs. It is the worst case, over all functions represented by leaf-DAGs with l literals, of the minimal number of K {LUTs required to implement the function.

Complexity Bounds for Lookup Table Implementation of Factored Forms

953

De nition 3. A factored form of a one-output function is a generalized sumof-products form allowing nested parentheses and arbitrary binary operations.

A factored form is represented by a binary leaf-DAG (all gates are in B2 ). For example, the function ab c + a bc + d can be represented in a factored form with 7 literals as (((ab )c ) + ((a b)c )) + d, and it can be written more compactly in factored form as ((a b)c ) + d with 4 literals. When all l literals of a factored form are dierent, its corresponding binary leaf-DAG is a binary tree. The binary tree has l inputs and l , 1 internal nodes. Figure 1 shows a binary tree with l = 7 inputs and l , 1 = 6 internal nodes. If all inputs of a binary leaf-DAG, D, are dierent, we have a binary tree, B . So, a realization of B would also serve as a realization of D. Perhaps some other realization of D requires fewer LUTs, using some structural information of D. 0

0

0

0

0

0

0

0

0

Lemma 4. Suppose a binary tree, B, is obtained from a binary leaf-DAG, D, by viewing all D's inputs as dierent. LK (D) LK (B ) This lemma tells us that, in order to analyze the worst case complexity of binary leaf-DAG mapping, it is enough to analyze binary trees. In [9], some results are provided on the complexity bound of a function, f , given in a factored form. The results are summarized in the following theorem.

Theorem 5. For the class of functions with l literals in factored form, C (l) = l , 1 (l 2) C (l) = b(2l , 1)=3c (l 2) C (l) = b(l , 1)=2c (l 3) C (l) = b(2l , 1)=5c (l 4) C (l) b(l , 1)=3c (l 6) 2 3 4 5 6

Reference [6] presented an optimal algorithm, Tree-Map, for technology mapping where the input is a tree. Tree-Map uses a greedy dynamic programming approach, which happens to guarantee an optimal mapping. Our approach to determining a tight bound for CK (l) for all l is to analyze a technology mapping algorithm that is optimal on trees. We use the Tree-Map algorithm [6], because it is easiest to analyze.

De nition 6. For a tree, T (V; E ), its height is the number of nodes on the longest path from an input to the root. The level of the root is the height of the tree. The level of a node (excluding the root) is the level of its parent minus 1.

De nition 7. Consider a tree, T (V; E ), with vertex (node) set, V , and edge set, E . Let V 1 be a subset of V such that a LUT is assigned to precisely those vertices in V 1. Two quantities are de ned for each vertex v 2 V . These quantities are its dependency, d(v), and its contribution, Z (v), de ned according to: { Contribution, Z (v):

954


For each primary input (or literal), v, Z (v) = 1. For each v 2 V 1, Z (v) = 1. For all other vertices v 2 V , Z (v) = Z (u ) + : : : + Z (uc v )|where v has c(v) children: u ; : : : ; uc v . { Dependency, d(v): For each primary input (or literal), v, d(v) = 1. For all other vertices v 2 V , d(v) = Z (u ) + : : : + Z (uc v )|where v has c(v) children: u ; : : : ; uc v . De nition 8. In a mapping, if a node is assigned a LUT, we say it is a LUT 1

( )

1

( )

1

( )

1

( )

node. Otherwise, we say it is a free node.

From Def. 7, we know that, for a free node, its contribution is equal to its dependency, but for a LUT node its contribution is set to 1. Note that d(v) is the summation of the number of inputs or LUTs that directly or indirectly supply input to vertex v, and it represents the number of signals that would need to be placed at v if the signal at v were implemented with a LUT. The quantity Z (v), on the other hand, represents the contribution of vertex v to the dependency of its parent vertex. Figure 1 shows an example of a tree and the assignment of LUTs to its vertices. The shaded vertices in the gure represent the LUT nodes. The dependency and contribution values for each node, v, in the tree are shown with an ordered pair, (d(v); Z (v)). (3,1)

(2,2)

(3,1)

(2,1)

(2,2)

(2,1)

(1,1)

(1,1)

(1,1)

(1,1)

(1,1)

(1,1)

(1,1)

Notation (d,Z) represents the dependency and contribution values for a vertex in the tree.

Fig. 1. The dependency and contribution values for a tree The Tree-Map algorithm scans from leaves to the root, assigning LUTs as necessary. Whenever it encounters a node with dependency exceeding K , it must


955

assign LUTs. It assigns LUTs to that node's children, starting with the child with largest contribution, until the node's dependency has been suciently reduced. This greedy mapping is optimal with respect to the number of LUTs [6]. Our objective is to derive a tight bound for general K {LUT technology mapping. We use Tree-Map [6] as a uni ed optimal mapping algorithm. Although the dynamic programming algorithm in [7] is also optimal for tree mapping, it is hard to work from it to derive bounds on the circuit complexity. Tree-Map takes a DAG input. In this paper, we constrain it to be a leaf-DAG|i.e., each internal node has fanout 1, the primary inputs may fan out arbitrarily. This is a generalization of trees [6] through allowing the primary inputs to fan out. Generally, the leaf-DAGs will be in factored form, because that is the worst case in terms of LUT complexity. We do not assume that we know the individual functions used in the formula. For example, if an output of an AND gate goes to another AND gate, we do not allow any inputs to be rearranged between the two gates. In short, the output of the technology mapping must be valid, even if arbitrary functions are substituted for each of the input leaf-DAG's gates.

3 Worst Case Mapping to K {LUTs Tree-Map proceeds from level to level in the tree. When we deal with vertices at level j , all vertices below level j have dependencies less than or equal to K . Tree-Map had processed all nodes at lower levels. Whenever any of them had dependency more than K , it assigned sucient LUTs to reduce the dependency to at most K . Lemma 9. If K is even, d(vi ) (K=2 + 1); 1iL,1 (1) Proof: i ranges up to L , 1, so it includes all the LUTs, except the one assigned to the root. According to the Map-Tree algorithm, a node i (except the root) is assigned a LUT only because its parent has dependency larger than K before the assignment. Furthermore, it is selected to be assigned a LUT because its dependency is at least as large as the (only) other child of its parent. So, its dependency must be K=2 + 1. 2 Lemma 10. If K is odd, d(vi ) (K + 1)=2; 1iL,1 (2) Proof: Similar to the proof for Lemma 9. 2 Lemma 11. Suppose K is odd, and vi (i 6= L) is a node with d(vi ) = (K +1)=2. Suppose vj is the rst LUT node on the path from vi to the root. Then: d(vj ) (K + 3)=2 (3) and we say vj is the pair node of vi .

956


ut

Proof. Omitted for brevity.

Lemma 12. Suppose K is odd, and vi and vj (i; j 6= L) both have dependency (K + 1)=2. Then their pair nodes are two dierent nodes. Proof. Contrariwise, suppose v1 were the pair node of both vi and vj . According to the proof of Lemma 11, v1 must resolve 2 (K + 3)=2 dependency, which cannot be true. ut

Lemma 13. Suppose L > 1 and vL = r. Suppose vi is a LUT node nearest to

r (if there are multiple such nodes, select any one). So, on the path from vi to r, no other LUT node exists. Then: d(r) + d(vi ) K + 2 (4) and we say that vi is the pair node of r. Proof. Omitted for brevity. ut Now we are able to prove our key theorem.

Theorem 14.

l , 2)=K c; CK (l) = bb(2 (2l , 1)=K c;

if K is even if K is odd

Proof. For brevity, we omit the half of the proof that the bound is always achievable. We only give the half of the proof that the bound is tight. To show tightness, we need to show some trees that meet the upper bound. We consider two cases. (1) K is even. Figure 2 shows an example. Node va is the root of a binary tree with K=2 + 1 inputs; each of the nodes vb , vd , vf , . . . is the root of a binary tree with K=2 inputs. The shaded nodes show the nodes to which a K {LUT should be assigned according to the Tree-Map algorithm. For example, when node vc is visited, the dependency d(vc ) is K + 1, and we put a K {LUT at node va , and so on. The value K=2 + 1 beside node va represents the amount of dependency resolved at node va |i.e., d(va ). The value K=2 beside node vb represents d(vb ). Suppose the number of LUTs in the gure is L. The total tree inputs is

l = L(K=2) + 1

(5)

So, the tree needs the upper bound number of K {LUTs. Therefore, the bound is tight in this case. (2) K is odd. We show two subcases in Fig. 3. In the rst (second) case, according to the Tree-Map algorithm, an odd (even) number of LUTs is needed. Suppose the number of K {LUTs needed is L. In the rst subcase (L is odd), the number of inputs is:

l = (LK + 1)=2

(6)


957

(K/2+1) K/2

(K/2+1)

(K/2+1)

K/2 f

e (K/2+1)

K/2 c

d

K/2 b

(K/2+1) a

Fig. 2. Proof of tightness when K is even It meets the upperbound. (In this case, b(2l , 1)=K c is 1 more than b(2l , 2)=K c, and we need b(2l , 1)=K c K {LUTs.) Therefore, this also shows that, for each l that makes (2l , 1)=K an (odd) integer, there exists a binary tree that needs (2l , 1)=K K {LUTs. In the second subcase (L is even), the number of inputs is:

l = L(K=2) + 1

(7)

It also meets the upperbound. (In this case, b(2l ,1)=K c is equal to b(2l ,2)=K c).

ut

4 Conclusion Arbitrary functions can be mapped onto FPGAs that use lookup tables (LUTs). If the input function is in the form of a tree or leaf-DAG [9], a greedy algorithm can process the input in polynomial time. In the case of a tree, the greedy algorithm minimizes the number of LUTs, subject to the constraint that the algorithm is not allowed to exploit any knowledge of the particular functions represented by the nodes in the input graph. In the case of a leaf-DAG, the number of LUTs needed is bounded by that required for an equivalent tree representation using unique literals. We dierentiate between LUTs by the number of inputs they handle, K . We considered leaf-DAGs where all nodes are 2{input functions. This is the worst case in terms of how many K {LUTs are required. Previous work [9] had obtained bounds on the worst case number of K {LUTs for K up to 6 (tight bounds up to 5). We extended this to tight bounds for all K .

958

W. Feng, F.J. Meyer, and F. Lombardi (K+1)/2

(K+3)/2

(K-1)/2

(K+1)/2 (K+3)/2

(K+3)/2

(K+1)/2

(K+1)/2 e

f

e

f (K+3)/2

(K-1)/2 c

d

(K+1)/2

(K+1)/2

(K+1)/2

odd number of LUTs

c

d (K+1)/2

b

(K+3)/2

(K-1)/2

(K+1)/2

(K+1)/2 b

a

a

even number of LUTs

Fig. 3. Proof of tightness when K is odd

References 1. Brayton, R. K., Rudell, R., Sangiovanni-Vincentelli, A.: Mis: A multiple-level logic optimization system. IEEE Trans. CAD of Int. Circ. and Sys. 6 (1987) 1062{1081 2. Chan, P. K., Zien, J. Y., Schlag, M.: On routability prediction of FPGAs. IEEE/ACM Des. Auto. Conf. (1993) 326{330 3. Chaudhary, K., Pedram, M.: A near optimal technology mapping minimizing area under delay constraints. IEEE/ACM Des. Auto. Conf. (1992) 492{498 4. Cong, J., Ding, Y.: On area/depth trade-o in LUT-based FPGA technology mapping. IEEE/ACM Des. Auto. Conf. (1993) 213{218 5. Cong, J., Ding, Y.: Flowmap: An optimal technology mapping algorithm for delay optimization in look-up table based FPGA designs. IEEE Trans. CAD of Int. Circ. and Sys. 13 (1994) 1{12 6. Farrahi, A. H., Sarrafzadeh, M.: Complexity of the look-up table minimization problem for FPGA technology mapping. IEEE Trans. CAD of Int. Circ. and Sys. 13 (1994) 1319{1332 7. Francis, R. J., Rose, J., Chung, K.: Chortle: A technology mapping algorithm for lookup table based FPGAs. IEEE/ACM Des. Auto. Conf. (1990) 613{619 8. Francis, R. J., Rose, J., Vranesic, Z.: Chortle-crf: Fast technology mapping for lookup table based FPGAs. IEEE/ACM Des. Auto. Conf. (1991) 227{233 9. Murgai, R., Brayton, R. K., Sangiovanni-Vincentelli, A.: Logic Synthesis for FPGAs. Kluwer Academic Publishers (1995) 10. Murgai, R., Nishizaki, Y., Shenoy, N., Brayton, R. K., Sangiovanni-Vincentelli, A.: Logic synthesis algorithms for programmable gate arrays. IEEE/ACM Des. Auto. Conf. (1990) 620{625 11. Schlag, M., Kong, J., Chan, P. K.: Routability driven technology mapping for look-up table FPGAs. IEEE Int. Conf. Comp. Des. (1992) 89{90 12. Wegener, I.: The Complexity of Boolean Functions. Wiley-Teubner (1987) 13. Xilinx Corporation: Xilinx FPGA Data Book. (1996)

Optimization of Motion Estimator for Run-TimeReconfiguration Implementation. Camel. Tanougast, Yves. Berviller, Serge.Weber. Laboratoire d’Instrumentation Electronique de Nancy - Université Henri Poincaré Nancy I Faculté des Sciences, BP 239 F-54506 Vandoeuvre-lès-Nancy cedex, France {tanougast, yves.berviller, serge.weber}@lien.u-nancy.fr

Abstract. In this paper, we present a method to estimate the number of reconfiguration steps that a time-constrained algorithm can accommodate. This analysis demonstrates how one would attack the problem of partitioning a particular algorithm into pieces to for run time reconfiguration execution on a Atmel 40K FPGA. Our method consist in evaluating algorithm operators execution time from data flow graph. So, we deduce the reconfiguration number and the algorithm partitioning for RTR implementation. The algorithm used in this work, is a qualitative motion estimator in the Log-Polar plane.

1. Introduction. The availability of FPGAs which supply fast and partial reconfiguration possibilities, provides a way to dynamically reconfigurable architectures [1]. This new approach enables the successive execution of an algorithms sequence on the same device [2]. This article propose an evaluation method for the determination of the number of successive reconfigurations which can be made for a given algorithm. This evaluation is obtained from the data flow graph in order to optimize its implementation on a run time reconfigurable architecture. This architecture uses Atmel’s AT40k FPGAs, which have short configuration times. The evaluation of this number gives us the partitioning of the data flow graph. The aim of this paper is the optimization of hardware resources while satisfying the real time processing constraint. The performances like processing time and resources usage rate of the FPGA are described. The algorithm is an apparent motion estimator in a Log-Polar images sequence, which estimates the normal optical flow. Firstly we describe the algorithm. Secondly, we present the method for the determination of the step number for a Run-Time-Reconfiguration (RTR) implementation. Thirdly we give the results compared with a static implementation. Finally we conclude on the contribution of this approach.


960

C. Tanougast, Y. Berviller, and S. Weber

2. Qualitative motion estimation in the Log-Polar space. The Log-Polar images are obtained by remapping the Cartesian coordinate images in a Complex Logarithm Mapping [3]. The advantage of this transformation is that the radial and axial motion in the original space becomes mainly horizontal in the new space. Our solution estimates the horizontal displacements of moving objects edges. The method uses OFC (1) (optical flow constraint) of moving points in image sequence.

r r ∂I . V ⋅ grad I = − ∂t

(1)

r V is the apparent velocity vector of an image point and I the intensity of this point. From this Optical Flow Constraint we estimate the normal optical flow by dividing the temporal derivative by the spatial gradient:

∂I ∂t Vn = − . ∂I ∂x

(2)

Vn is an estimate of normal optical flow in Log-Polar images. Before this computation, two pre-processing are necessary. The first processing is a gaussian filtering in order to guarantee the existence of the spatial derivative of image intensity I(x, y). The second is a time averaging filter to reduce the noise. Our apparent motion estimator algorithm in Log-Polar plane, is composed of gaussian and averaging filters, followed by temporal and spatial derivatives and arithmetic divider. The datapath of this algorithm is given on figure 1.

3. Determination of the possible number of steps for RTR implementation.

3.1. Evaluation of the possible number of steps. The images are acquired at a rate of 25 images per second, this leaves us 40 ms to process the entire image. To satisfy the real time constraint we need to process at a faster rate than that of pixels acquisition. The algorithms are partitioned in N steps corresponding to N execution-reconfiguration pairs. The working frequency of each step needs to verify the following inequality :

Optimization of Motion Estimator for Run-Time-Reconfiguration Implementation

N

N

j =1

j =1

961

(3)

n ² × ∑ t e j ≤ Ti − ∑ Trec j .

Where n² is the number of pixels in the image, N the number of reconfiguration, Ti is the duration of an image (40 ms), t e j is the elementary processing time of a pixel in th

the j steps and

Trec j is the reconfiguration time of the jth steps.

The objective is to make an implementation which requires the minimal logical resources and satisfies the real time constraint. From equation (3) we obtain the minimal number of steps that we can surely implement :

N ≥ N min =

Ti n ² × K × t o max + k rec × Cmax

.

(4)

t o max is the maximum execution time of an operator of the data flow graph (without routing), K is a coefficient which take into account the routing delay between operators, k rec is a proportionality constant between the configuration time and the number of used logic cells and Cmax is the total available logic cells. This evaluation is obtained with the maximal configuration time and the execution time of the slowest operator of each step. Our method is based on the analysis of the data flow graph of the algorithm in order to deduce the value of these parameters. The determination of Nmin gives us the number of partitions of the data flow graph which corresponds to the number of reconfiguration steps. 3.2. Modelling and parameters determination. AT40K’s technology enables partial reconfiguration. Each configuration time depends on the quantity of logic cells used for each step [4]. We evaluate the configuration th time of the j step by : Trec j = k rec × C j . (5) Where

C j is the number of Cells of the jth step. In our case, AT40K20’s capacity of

819 Cells leads to a total reconfiguration time lower than 0.6 ms at 33 MHz with 8 bits of configuration data [5]. We obtain for k rec a value of 733 ns/ cell. The maximum execution time of an operator depends on the speed grade of the device and the data size to process (number of bits). The following equation gives this time for a cascaded operator :

t o max = Dj max × (Tc + Tr ) + Tsetup .

(6)

962


Dj max is the maximum data size to process, Tc is the logical function path delay, Tr is propagation delay between logical function and Tsetup is setup time. We evaluate these values to Tc = 1.7 ns; Tr = 0.17 ns and Tsetup = 1.5 ns [5]. Where

The maximum working frequency depends on the slowest operator and the routing delays between operators. We determined experimentally that K is constant for a given occupation rate. This coefficient has a value of 1.5 in our application. The study of the Cell’s structure enables the evaluation of the cell usage for each operator. An n bits adder or substractor, latched or not, require n cells. The same cells number applies for n bits multiplexer or register. This allows the evaluation of logical resources needed for each step of the application from its data flow graph.

4. Results. From the data flow graph (see figure 1), we obtain the size and type of the different operators used (adder, multiplier, multiplexer...). So, in accordance with the technology used, we deduce the slowest operator execution time. With AT40K, adders are the slowest operators of our datapath if we consider identical size operators (number of bits). In our application, the slowest operator is an 15 bits latched adder. Then, the equation (6), give us a value of t o max of 29.55 ns. From the equation (4), and the parameters determination, we estimate the minimal number of reconfiguration-execution (steps) Nmin = 3.27 for our implementation. This result is obtained with a image size of 512 by 512 pixels. We deduce the data from the following table for a RTR optimized implementation with constant resources usage rate. Total estimated number of Cells

Mean Cells / step number

Reconfiguration time / step (ms)

temax (ns)

690

212

0.16

44.3

The value Nmin is calculated by considering that each step require a full device configuration and is executed with a slowest working frequency. In fact, after implementation we obtain reconfiguration and execution time lesser than or equal to evaluated time. That is why four reconfiguration-execution are possible instead of a theoretical value of 3.27. The partitioning of the data flow graph in four step is made in the following way : _ first step _ second step _ third step _ fourth step

: gaussian filter : averaging filter and temporal and spatial derivative : first half of divider : second half divider.


Pi

Pi -1

Pi -2

Pi -3

Pi -4

Gi –n²

8[-, 8]

8[-, 8] *2

+/-

+ *2

>0 0

/2

1

+

IviI, n = 4

*2 +/-

+

gaussian filter

+

>0 0

/2

1

IviI, n = 5

Gi : 9[-, 9]

/8 -/ +

+

+/-

>0

/2 0

Mi : 9[-, 9]

temporal, spatial derivates andAveraging filter

Mi

Mi + 1

/2

1

IviI, n = 6

Mi - 1 +/-

+/Ti : 9[s, 8]

Si : 9[s, 8]

>0 0

ISiI : 8[-,8]

ITiI : 8[-,8]

/2

1

IviI, n = 7 128 +/-

>0 0

/2

1

IviI, n = 0 +/-

arithmetic divider >0

0

/2

1

IviI, n = 1 +/X : N[s,E]. >0 0

N : X bits number. n = 0..N-1 s, s[X] : IXI sign.

/2

1

IviI, n = 2 E : IXI integer bits number . +/-

>0 0

1

/2 IviI, n = 3

Fig. 1. Data flow graph of the motion estimator.

963

964


The divider has been split in two parts in order to homogenize the number of resources for each step. The following table shows results obtained with our implementation. Operators

Number of Cells

Reconfiguration time / step (ms)

t e j (ns)

Gaussian Filter

106

0.08

27.1

average and Derivatives

103

0.08

26.5

Divider 1

354

0.26

38.7

Divider 2

336

0.25

37.8

We notice that dynamic execution with four steps can be achieved in real time. This is in concordance with our estimation. Indeed, we verify that maximal execution time (38.7 ns) is lesser than the evaluated time (44.3 ns). Moreover, we obtain a global reconfiguration time of 0.67 ms. This value is very inferior to Nmin multipled by the full device configuration time (1.96 ms). However, an implementation by partitioning in five steps leads to a critical time very harsh for real time operation. Indeed, in our case we have still 5.22 ms of processing time for a supplementary step. If we consider a configuration time of 0.26 ms (Same number of Cells as for the divider), we obtain a value t e j lower than 19 ns. This is incompatible with our application. The maximal number of Cells by step allows to determine the functional density gain factor obtained by the RTR implementation [6], [7], [8]. In our example, the gain factor in term of functional capacity is approximately 2.

5. Conclusion and future work. We have proposed a method to evaluate the minimum number of reconfigurationexecution (Nmin ). This value depends on resources usage rate ( K ) for a given algorithm. From the analysis of the data flow graph, we deduce resources requirement and speed of the various operators. This leads to the determination of total processing time, from which we deduce the optimized partitioning of the data flow graph for RTR implementation.


965

We illustrate our method with an apparent motion estimation algorithm on log-polar images. The results obtained are in accordance with our estimation. The differences between our estimation and experimental results are mainly due to the variations of K (which depends on routing and actual resource occupation rate). The performances obtained are compatible with the requirements of real time processing. A partitioning which does not rely on the algorithm’s functions, enables an implementation very homogeneous in terms of resource used by each step. This would allow to enhance the functional capacity.

References. 1. D. Demigny, M. Paindavoine, S. Weber : Architecture Reconfigurable Dynamiquement pour le Traitement Temps Réel des Images. Revue technique et Sciences de l’information, Numéro Spécial programmation des Architectures Reconfigurables. (1998). 2. H. Guermoud, Y. Berviller, E. Tisserand, S. Weber : Architecture à base de FPGA reconfigurable dynamiquement dédiée au traitement d’image sur flot de données. 16° colloque GRETSI. (1997). 3. M. Tistarelli, G. Sandini : On the advantage of polar and log-polar mapping for direct estimation of time to impact from optical flow. IEEE Transactions on PAMI, vol 15. (1993). 401-410. 4. ATMEL IDS AT40K User’ guide. 5. Atmel. AT40K FPGA. Data Sheet. 6. M. J. Wirthlin, B.L. Hutchings : Improving functional density through run-time constant propagation. FCCM97 (1997). 7. H. Guermoud : Architecture reconfigurable dynamiquement dédiées aux traitements en temps réel des signaux vidéo. Thèse de l’Université Henri Poincaré. Nancy 1. (1997). 8. J.G. Eldrerge, B.L. Hutchings : Density enhancement of neural network using FPGAs and run-time reconfiguration . FCCM94 (1994).

Constan t-Time Hough Transform On A 3D Recon gurable Mesh Using Few er Processors Yi Pan Department of Computer Science University of Dayton, Dayton, OH 45469-2160

Abstract. The Hough transform has many applications in image pro-

cessing and computer vision, including line detection, shape recognition and range alignment for moving imaging objects. Many constant-time algorithms for computing the Hough transform have been proposed on recon gurable meshes [1, 5, 6, 7, 9, 10]. Among them, the ones described in [1, 10] are the most ecient. For a problem with an N N image and an n n parameter space, the algorithm in [1] runs in a constant time on a 3D nN N N recon gurable mesh, and the algorithm in [10] runs in a constant time on a 3D n2 N N recon gurable mesh. In this paper, a more ecient Hough transform algorithm on a 3D recon gurable mesh is proposed. For the same problem, our algorithm runs in constant time on a 3D n log 2 N N N recon gurable mesh.

1 Introduction The Hough transform of binary images is an importan t problem in image processing and computer vision and has many applications such as line detection, shape recognition and range alignment for moving imaging objects. It is a special case of the Radon transform which deals with gray-level images. The Radon transform of a gray-level image is a set of projections of the image taken from dierent angles. Speci cally, the image is integrated along line contours de ned by the equation:

f(x; y) : x cos() + y sin() = g;

(1) where is the angle of the line with respect to positive x-axis and is the (signed) distance of the line from the origin. The computation of the Radon and Hough transforms on a sequential computer can be described as follows. We use an n n array to store the counts which are initialized to zero. For each of the black pixels in an N N image and for each of the n values of , the value of is computed based on (1) and the sum corresponding to the particular (; ) is accumulated as given in the following algorithm. In the algorithm, is the resolution along the direction; and gray-value(x; y) is the intensity of the pixel at location (x; y). for each black pixel at location (x; y) in an image do for = 0 ; 1; :::; ,1 do res

n


Constant-Time Hough Transform on a 3D Reconfigurable Mesh

967

begin (* parameter computation *) := (x cos + y sin )= (* accumulation *) sum[ ; ] := sum[ ; ] + gray-value(x; y) end; Obviously, for an N N image, and n values of , a sequential computer calculates the Radon (Hough) transform in O(nN 2 ) time since the number of black pixels is O(N 2 ). The computation time is too long for many applications, especially for real-time applications, as N and n can be very large. Recently, several constant-time algorithms for computing the Hough transform have been proposed for the recon gurable mesh model [1, 5, 6, 7, 9, 10]. Among them, the ones described in [1, 10] are the most ecient. For a problem with an N N image and an n n parameter space, the algorithm in [1] runs in a constant time on a 3D nN N N recon gurable mesh, and the algorithm in [10] runs in a constant time on a 3D n2 N N recon gurable mesh. Besides computing Hough transform, the algorithm in [10] can also compute the Radon transform in a constant time using the same number of processors. In this paper, a more ecient Hough transform algorithm for binary images on a 3D recon gurable mesh is proposed. For the same problem, our algorithm runs in constant time on a 3D n log2 N N N recon gurable mesh. We also show that the algorithm can be adapted to computing the Radon transform of gray-level images in constant time on a 3D n log3 N N N recon gurable mesh. Clearly, our algorithm uses the fewest number of processors to achieve the same objectives and is the most ecient one compared to existing results in the literature [1, 5, 6, 7, 9, 10]. res

2 The Computational Model A recon gurable mesh consists of a bus in the shape of a mesh which connects a set of processors, but which can be split dynamically by local switches at each processor. By setting these switches, the processors partition the bus into a number of subbuses through which the processors can then communicate. Thus the communication pattern between processors is exible, and moreover, it can be adjusted during the execution of an algorithm. The recon gurable mesh has begun to receive a great deal of attention as both a practical machine to build, and a good theoretical model of parallel computation. A 2D recon gurable mesh consists of an N1 N2 array of processors which are connected to a grid-shaped recon gurable bus system. Each processor can perform arithmetic and logical operations and is identi ed by a unique index (i; j ), 0 i < N1 , 0 j < N2 . The processor with index (i; j ) is denoted by PE (i; j ). Each processor can communicate with other processors by broadcasting values on the bus system. We assume that the bus width is O(log N ) and each broadcast takes O(1) time. The arithmetic operations in the processors are

968

Y. Pan

performed on O(log N ) bit words. Hence, each processor can perform one logical and arithmetic operation on O(1) words in unit time. A high dimensional recon gurable mesh can be de ned similarly. For example, a processor in a 3D N1 N2 N3 recon gurable mesh is identi ed by a unique index (i; j; k), 0 i < N1 , 0 j < N2 , 0 k < N3 . The processor with index (i; j; k) is denoted by PE (i; j; k). Within each processor, 6 ports are built with every two ports for each of the three directions: i-direction, j -direction, and k-direction. In each direction, a single bus or several subbuses can be established. A subarray is denoted by replacing certain indices by 's. For example, the ith row of processors in a 2D recon gurable mesh is represented by ARR(i; ). Similarly, ARR(; j; k), 0 j < N2 , 0 k < N3 , is a 1-dimensional subarray in a 3D recon gurable mesh, and these j k subarrays can execute algorithms independently and concurrently. Finally, a memory location L in PE (i; j; k) is denoted as L(i; j; k).

3 The Constant-Time Algorithm In this section, we propose a constant time algorithm for computing the Hough transform of an N N image on a 3D n log2 N N N recon gurable mesh. In the following discussion, we partition the image into parallel bands and these bands run at an angle of with respect to the horizontal axis, and then sum the pixel values contained in each band. If a pixel is contained in two or more bands, then it will be counted in only the band that contains its center. If the center of a pixel lies on the boundary between two bands, then it is counted only in the uppermost of the two bands. For example, we have computed a =4 angle Hough transform for an 8 8 pixel array in Figure 1, where the bands are one pixel-width wide. Clearly, there are 10 dierent 's in the gure. In the gure, the number of 1-pixels contained in each band is displayed at the upperright end of the band. For a particular angle and a particular distance , only the values of the pixels lying in the band speci ed by and need to be added together. In our algorithm, since all pixels in an image are used as the input, we can exploit easily the geometric features and relations of pixels in an image. Clearly, for a given pair of and , we do not need to consider all the pixels in an image. Instead, only those pixels that are centered in the band will contribute to the count value of that band. In this way, we can improve the eciency of the algorithm during computation. Before we describe the algorithms, several observations will be made. In order to speedup the computation, we need to connect together all processors which have computed and stored the same values. In order to do so, we rely on several results obtained in [8]. Although the results are made for such that 0 =4, they can easily generalized to other values. In the following discussion, we assume that 0 =4. Lemma 1. For any j, 0 j N , 1, the -distances satisfy +1 for 0 i N , 2. It can also be shown that no more than two consecutive values of in row j can be equal. i

i

i

i;j

i

;j


969

0 1 5

id th

2

pi xe

l-w

5 1

3 2

4 0

0

0

0

0

0

0

0

0

1

1

1

0

0

0

0

0

1

1

1

0

1

1

0

0

1

1

1

0

1

1

0

0

0

0

0

0

1

1

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

1

0

1

1

0

0

0

1

1

0

2 1

1.414 pixel-width

1 pixel-width

Fig. 1. Parallel Bands for = =4 in an 8 8 Image.

Lemma 2. The values of computed using equation (1) by two consecutive processors in a row j , dier by at most 1. More formally, for all i; j , 0 j N , 1 and 0 i N , 2, 0 +1 , 1. Lemma 3. For all values of i; j, 0 i; j N , 2, 6= +1 +1. Lemma 4. If = +2 for 0 i N , 1 and 0 j N , 3, then = +1 = +2 . If two -values in a column i are equal and they are placed two rows apart, then the -value in between should have the same value. The above lemmas will be used in our algorithm to connect related processors together to calculate the number of black pixels in the bands. The following result is also used in our algorithm to compute binary sums eciently and is due to [?]. Lemma 5. Let a binary sequence of length S stored in the rst row of a 2D S log2 S recon gurable mesh, the sum of the binary sequence can be computed in a constant time on the array. For the Radon transform, we need the following result to add integer values. The detailed proof of the lemma is described in [11]. Lemma 6. Given S (log S )-bit integers, these numbers can be added in O(1) time on a 2D S log3 S recon gurable mesh. Assume that the recon gurable mesh used here is con gured as a 3D n log2 N N N array. The 3D mesh is divided into n layers along the i direction with each layer having a 3D log2 N N N array as shown in Figure 2. Each layer i

;j

i;j

i;j

i;j

i;j

i;j

i;j

i;j

i

;j

970

Y. Pan

is responsible for computing the Hough transform for a particular projection (corresponding to a value). Now, we formally describe the algorithm. J Layer 0

K

Layer 1

Layer (n-2) Layer (n-1) I

Fig. 2. The 3D mesh is divided into n layers along the i direction. Input: An N N image and an n n parameter space, and a constant which is the resolution along the direction. Assume that each pixel value a(x; y) is stored in processors PE (0; x; y), for 0 x; y < N , and is known to all processors initially. Denote ARR(0; ; ) as the base submesh. It is clear that the initial image is stored in the base submesh. The algorithm consists of the following steps. res

res

Step 1. In this step, we copy the whole image from the base submesh to all the other submeshes ARR(i; ; ). All processors PE (0; j; k), 0 j < N , 0 k < N , broadcast the image pixels a(j; k) concurrently through its subbuses in direction i such that processors PE(i; j; k), 0 i < n log2 N , each receives a pixel from PE (0; j; k). At the end of step 1, all processors in subarray ARR(; j; k), where 0 j < N , 0 k < N , contain the pixel

value a(j; k) at location (j; k) in the original image. Since only local switch settings and broadcast operations are involved in this step, the time used is O(1). Step 2. As mentioned before, the whole 3D mesh is divided into n layers with each layer having a 3D log2 N N N submesh. Each layer is responsible for computing the Hough transform for a particular projection. Thus, the top log2 N 2D submeshes ARR(i; ; ), 0 i < log2 N , are assigned to computing the Hough transform for 0 . Similarly, the next log2 N 2D submeshes ARR(i; ; ), log2 N i < 2 log2 N , are in charge of computing for 1 , and so on. Thus, each processor can calculates its local value easily based on its local indix i since it initially knows the resolutions of and . This requires O(1) time. Step 3. In this step, all processors computes its local value independently and in parallel. Here, layer t uses for 0 t n as shown in Figure 2; i.e., submesh ARR(i; ; ) uses b log2 c , for 0 i n log2 N , to calculate their values. In other words, PE (i; j; k) computes = j cos b log2 c + t

i=

N

j;k

i=

N


971

j sin b log2 c . This step involves only local computations, and hence takes O(1) time. i=

N

j+1 Row / Column

j j-1 k

k

k Current

(a)

Connection j+1 j j k (b)

k+1

k

k+1

Other possible Connections

(c)

Fig. 3. Possible connections of a processor to neighboring processors for 0 k =4.

Step 4. All processors which have computed the same value of the normal dis-

tance in the same layer (for a particular angle) can be connected in a 2D submesh. The idea is to count the number of black pixels in the same band (same value for a particular value. Since all layers perform a similar job, in the following discussion we concentrate on layer 0. This operation requires only local communications and some setup of local switches. More speci cally, the possible cases are depicted in Figure 3. The following connection schemes are based on lemmas 1-4. If PE (i; j; k) computes some value for the normal distance associate to pixel (j; k), and the same value is obtained for pixel (j; k , 1) and/or (j; k + 1), then PE (i; j; k) should be connected to PE (i; j; k , 1) and/or PE (i; j; k + 1), as depicted in Figure 3(a). When two adjacent processors in a row have the same value, the connection can be made as shown in Figure 3(b). In case that processors PE (i; j; k + 1) and PE (i; j + 1; k) have to be connected, a third intermediate processor PE (i; j; k) is used as depicted in Figure 3(c). Using the above rule, a processor in submesh ARR(i; j; k) is connected to at most two buses at a time and no two distinct buses are connected to the same port of a processor in the same submesh. Figure 4 shows the switch and bus con guration for a 11 11 mesh for = =6. Since all processors in the same layer have the same value, the mesh con guration is the same for all 2D submeshes ARR(i; ; ) in the same layer. Thus, log2 N 2D submeshes in layer k will have the same con guration as the one depicted in Figure 5. In eect, many 2D vertical submeshes are established. In Figure 5, we show a vertical submesh formed in a layer after the above con gurartions and vertical buses are con gured along the i direction. In fact, many vertical submeshes exist in the same layer (not shown in the gure). Of course, submeshes in dierent layers have dierent shapes. In this step, processors only exchange information with neighboring processors, and then decide on their switch settings. It is obvious that this step also takes constant time. k

972

Y. Pan 10 9 * 8 7 * 6 5 * 4 3 * 2 1 * 0 * 0

* 1

* 2

* 3

* 4

* 5

* 6

7

* 8

* 9

10

Fig.4. Switch and bus con guration for k = =6.

Fig.5. A 2D vertical submesh is established in a layer after bus recon guration in step 4.

Step 5. Accumulate all the pixel values in a band using the corresponding submesh established in the last step in parallel. Notice that each submesh has a size of log2 N S , where S is not xed and depends its position. As shown in Figure 4, many subbuses of dierent lengthes p are formed and hence their S

values are dierent. However, S is less than 2N and is equal to the number of pixels contained in the band. For S binary values, we can use Lemma 5 to add these binary values in O(1) time on a 2D log2 N S mesh. Since all submeshes satisfy the above condition, and they can perform the accumulation concurrently, this step uses O(1) time. Step 6. Each submesh elects a leader and the leader stores the local count from the last step. Notice that this step is necessary since not all boundary processors are the last processors in the recon gured submesh as indicated in Figure 4. Only those processors with a \*" are leaders. The leaders can be elected easily by simply checking its neighbors and deciding if it should become a leader or not. Clearly, it also takes O(1) time. The nal results are stored in the leaders distributed among dierent submeshes. Since each step uses O(1) time, the total time used in the algorithm is O(1). To summarize the above discussion, we have: Theorem 1. For an N N binary image and an n n parameter space, the


973

Hough transform can be computed in constant time on a 3D n log2 N N N recon gurable mesh. Our result clearly improves the Hough transform algorithms in [1, 10] where a 3D nN N N recon gurable mesh and a 3D n2 N N recon gurable mesh are used, respectively, to achieve constant time.

References 1. K.-L. Chung and H.-Y. Lin, \Hough transform on recon gurable meshes," Computer Vision and Image Understanding, vol. 61, no. 2, 1995, pp. 278-284. 2. P. V. C. Hough, \Methods and means to recognize complex patterns," U.S. Patent 3069654, 1962. 3. H.A.H. Ibrahim, J.R. Kender, and D.E. Shaw, \The analysis and performance of two middle-level vision tasks on a ne-grained SIMD tree machine," Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 387393, June 1985. 4. J. F. Jeng and S. Sahni, \Recon gurable mesh algorithms for the Hough transform," International Conference on Parallel Processing, vol. III, pp. 34-41, Aug. 12-16, 1991. 5. T.-W. Kao, S.-J. Horng, Y.-L. Wang, \An O(1) time algorithm for computing histogram and Hough transform on a cross-bridge recon gurable array of processors," IEEE Transactions on Systems, Man and Cybernetics, Vol. 25, No. 4, April 1995, pp. 681-687 6. S.S. Lin, \Constant-time Hough transform on the processor arrays with recon gurable bus systems," Computing, vol. 52, pp. 1-15, 1994. 7. M. Merry and J. W. Baker, \Constant time algorithm for computing Hough transform on a recon gurable mesh," Image and Vision Computing, Vol. 14, pp. 35-37, 1996. 8. S. Olariu, J. L. Schwing, and J. Zhang, \Computing the Hough transform on recon gurable meshes," Image and Vision Computing, vol. 11, no.10, pp.623-628, Dec. 1993. 9. Y. Pan, \A More Ecient Constant Time Algorithm for Computing the Hough Transform," Parallel Processing Letters, vol. 4, no. 1/2, pp. 45-52, 1994. 10. Y. Pan, K. Li, and M. Hamdi, \An improved constant time algorithm for computing the Radon and Hough transforms on a recon gurable mesh," IEEE Transactions on Systems, Man, and Cybernetics: (Part A), Vol. 29, No. 04, July 1999, pp. 417421. (A preliminary version also appeared in Proceedings of the 8th International Conference on Parallel and Distributed Computing and Systems, 1996, pp. 82-86.) 11. H. Park, H. J. Kim, and V. K. Prasanna, \An O(1) time optimal algorithm for multiplying matrices on recon gurable mesh," Information Processing Letters Vol. 47, August 1993, pp. 109-113.

Fifth International Workshop on Formal Methods for Parallel Programming: Theory and Applications FMPPTA 2000

Program and Organizing Chair’s Message It is our pleasure to welcome you to the Fifth International Workshop on Formal Methods for Parallel Programming: Theory and Applications, FMPPTA’2000. This message pays tribute to the many people who have contributed their time and effort in organizing this meeting and reviewing papers. We are thankful to the IPDPS’2000 committee for accepting the organization of the workshop in cooperation with IPDPS’2000, and especially Viktor K. Prasanna, Mani Chandy and Jose Rolim. We also would like to thank the authors of all submitted papers, the presenters of accepted papers, the session chairs, the invited speakers and the program committee members. We hope that every participant will enjoy the workshop. Beverly Sanders, University of Florida, and Dominique Méry, Université Henri Poincaré Nancy I January 2000

Foreword The program of FMPPTA 2000 remains focused on the applications of formal methods, particularly for problems involving parallelism and distribution. Seven papers, four contributed and three invited, will be presented, most illustrating the use of techniques that are based on formal concepts and supported by tools. In addition, the workshop will include two tutorials to show how formal techniques can be useful and effective for developing realistic parallel and distributed solutions, for example in telecommunications applications where guaranteeing safety properties, in particular, seems to require the use of formal techniques. In the first contributed paper, Turner, Argul-Marin, and Laing present the ANISEED method for specifying and analyzing timing characteristics of hardware designs using SDL. Digital hardware is treated as a collection of interacting parallel components. SDL provides a way to validate and to verify digital hardware components. Timing constraints can be studied through SDL specifications. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 974–976, 2000. c Springer-Verlag Berlin Heidelberg 2000

Fifth International Workshop on Formal Methods for Parallel Programming

975

Non-functional requirements are very important aspects of practical systems. The paper by Rosa, Justo and Cunha presents an approach in which transactional and other non-functional requirements are formally incorporated into a special class of software architectures, namely dynamic software architectures. The ZCL framework based on the Z notation is a formal framework which formally incorporates elements of the CL model, a configuration model. Refinement is a process for developing solutions that satisfy the initial formal specification. In the paper by Filali et. al. refinement is used to develop and validate a termination detection algorithm. The use of UNITY as the development formalism is made easier by the use of PVS, a proof assistant. This work presents a non-trivial case study illustrating the use of a formal method together with mechanized support. Branco et al. describe their tool Draco-PUC, which automatically generates an implementation in Java for a distributed system described using their formal description technique MONDEL. This approach allows systems to be designed and analyzed at a higher level of abstraction than the implementation language. The invited presentations will be given by Ganesh Gopalkrishnan, Jean Goubault-Larrecq and Michael Mislove. They will address foundations and applications of formal methods. Ganesh Gopalkrishnan will present verification methods for weak shared memory consistency models; Jean Goubault-Larrecq will address the automatic verification of cryptographic protocols and Michael Mislove will describe the problems encountered in building a semantic model that supports both nondeterministic choice and probabilistic choice. Two tutorials are summarized by the two abstracts included in the proceedings of the workshop. These are The Design of Distributed Programs Using the B-Method by Dominique Cansell, Dominique Méry and Christophe Tabacznyj, and A Foundation for Composing Concurrent Objects by Jean-Paul Bahsoun. We hope that you will enjoy talks and papers. Beverly Sanders, University of Florida, and Dominique Méry, Université Henri Poincaré Nancy I January 2000

Programme Committee Flemming Andersen, Tele Danmark R&D, Denmark Mani Chandy, Caltech, USA Michel Charpentier, University of New Hampshire, USA Radhia Cousot, LIX-CNRS, Ecole Polytechnique, France Mamoun Filali, IRIT, CNRS, Toulouse, France Pascal Gribomont, Institut MONTEFIORE, Université de LIEGE, Belgium Dominique Méry, Université Henri Poincaré & IUF, LORIA, France (CoChair) Lawrence Paulson, Computer Laboratory, Cambridge University, UK

976

B. Sanders and D. Méry

Xu Qiwen, International Institute for Software Technology, United Nations University, Macau Joy Reed, Oxford Brookes University, UK Catalin Roman, Department of Computer Science, Washington University, USA Beverly Sanders, Department of Computer & Information Science & Engineering, University of Florida (CoChair), USA Ambuj Singh, Department of Computer Science, University of California at Santa Barbara, USA David Skillicorn, Department of Computing and Information Science, Queen’s University Kingston Canada

A Method for Automatic Cryptographic Protocol Verification (Extended Abstract) Jean Goubault-Larrecq G.I.E. Dyade & Projet Coq, Inria, France ([email protected])

Abstract. We present an automatic, terminating method for verifying confidentiality properties, and to a lesser extent freshness properties of cryptographic protocols. It is based on a safe abstract interpretation of cryptographic protocols using a specific extension of tree automata, -parameterized tree automata, which mix automata-theoretic techniques with deductive features. Contrary to most model-checking approaches, this method offers actual security guarantees. It owes much to D. Bolignano’s ways of modeling cryptographic protocols and to D. Monniaux’ seminal idea of using tree automata to verify cryptographic protocols by abstract interpretation. It extends the latter by adding new deductive abilities, and by offering the possibility of analyzing protocols in the presence of parallel multi-session principals, following some ideas by M. Debbabi, M. Mejri, N. Tawbi, and I. Yahmadi.

_

1 Introduction It is now well-known that secure cryptographic algorithms (see e.g., [17]) do not suffice in providing system-wide security guarantees, and that one has to be careful in designing cryptographic protocols, namely sequences of exchanges of messages purporting to achieve the communication of some piece of data, keeping it confidential or ensuring some level of authentication, to name a few properties of interest [6]. Successful attacks against cryptographic protocols are usually silly, in the sense that they are purely logical and do not exploit any weakness in the underlying cryptographic algorithms (e.g., encryption); they are nonetheless difficult to spot. To avoid logical faults, several methods have been designed, based on modal logics of beliefs ([6] and successors), on complexity theory [3] (for specific protocols), on process-algebraic techniques [2], on type disciplines [1], on model-checking [12, 13], or on deductive techniques [14, 4, 16]. While model-checking techniques are fully automated and have been used to find attacks, they cannot directly give actual security guarantees—although reductions to finite-state cases manage to do so in well-behaved cases [18]. On the other hand, the deductive techniques have been designed to give security guarantees, but mechanization is in general partial, as fully automated proof search in general does not terminate. In any case, abstract interpretation (see [8]) can help prepare the grounds for each style of verification. In fact, abstract interpretation alone suffices to verify protocols, as D. Monniaux shows [15], using tree automata to model the set of messages that intruders may build. F. Klay and T. Genet [10] also propose to use tree automata, this time to model the whole protocol itself. Each of the latter two approaches has advantages and disadvantages, but they are automatic, terminate and aim indeed at giving security guarantees, contrarily to standard model-checking tools. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 977-984, 2000.  Springer-Verlag Berlin Heidelberg 2000

978

J. Goubault-Larrecq

Our goal is to present yet another automated technique for guaranteeing the absence of logical faults in cryptographic protocols, which uses tree automata as well. Our contribution is twofold. First, instead of using standard tree automata, we use a refinement (_PTAs) allowing us to mix enumerative techniques (automata) with deductive techniques (BDDs [5]). The latter will notably help us in modeling freshness and initial states of intruder knowledge. Our _PTAs will also be much smaller than standard tree automata, improving the efficiency of verification markedly. Second, we extend the simulation of protocol runs to the case of parallel multi-session principals, e.g., key servers, an important case of unbounded parallelism, using ideas from [9]. For space reasons, this paper is only an overview. Moreover, we concentrate on secrecy because it is so fundamental; authentication can be dealt with by simple extensions of the framework presented here, following [10] for example. We describe _PTAs in Section 2, and use them to represent and compute states of knowledge in Section 3. We report on practical experience with these techniques in Section 4, showing its practical value, and shedding light on its strengths and weaknesses. We conclude in Section 5.

2 Terms, Formulae, _-Parameterized Tree Automata Let T be a set of so-called types . Let F a set of so-called function symbols. A firstorder signature over F is a map from F to the set of expressions of the form 1 : : : n ! , where n 2 N and 1 , : : : , n , are types. Let X , for each type , be pairwise disjoint non-empty sets, disjoint from F , and X be (X ) 2T . The set T (; X ) of terms of type is the smallest set containing X and such that for each f 2 F = dom , if (f ) = 1 : : : n ! and t1 2 T1 (; X ), : : : , tn 2 Tn (; X ), then f (t1 ; : : : ; tn ) is in T (; X ). We write f instead of f (). We use propositional formulae, up to logical equivalence, to represent (some) sets of terms. Let A , for each type , be a set of so-called logical variables of type . The intent is that each logical variable of type denotes a set of terms of type . Propositional formulae F of type are defined by the grammar :

F ::= A j F ^ F j F _ F j :F j 0 j 1 where A ranges over A . Formulae F are interpreted as sets [ F ] in environments , where is any family ( ) 2T of maps from A to T (; X ), by interpreting 0 as ;, 1 as T (; X ), ^ as intersection, _ as union, : as complement.

To deal with term structure, we define the following variant of tree automata. Compared to ordinary tree automata [7], ours integrate propositional formulae at states, and the states are typed (the latter helps in practice limit the size of automata, and does not restrict the generality of the approach). To simplify the following definition, extend to F [ 2T X by letting (x)= ^ ! for every x 2 X . Let Q be a set of so-called states q . We assume that each state q has a type q , and that Q contains infinitely many states of each type. An _-parameterized tree automaton, or _PTA, of type 0 , A, is a 4-tuple (Q; F; R; B), where Q is a finite subset of Q, whose elements are the states of A, F Q is the set of final states, B maps each state q 2 Q to a formula of type q , and R is a set of rewrites rules f (q1 ; : : : ; qn ) ! q , the transitions, where f 2 F [ 2T X

S

S

A Method for Automatic Cryptographic Protocol Verification

979

is such that (f )= ^ q1 : : : qn ! q (“transitions respect types”)—in case f is a variable of type , this means n = 0, (x) =! . Ordinary tree automata are just _PTAs without the B component (or equivalently, where B maps each state to the class of 0.) The semantics of _PTAs is given by defining when a _PTA A=( ^ Q; F; R; B) recognizes a term t in an environment at a state q; this is so if and only if t 2 [ B(q )]]q , or t is of the form f (t1 ; : : : ; tn ), and there is a transition f (q1 ; : : : ; qn ) ! q in R such that tj is recognized by A in at qj for each j , 1 j n. A term t is recognized by A in if it so at some final state of A. We can compute unions of _PTAs exactly, and give upper approximants of their intersections by a standard automaton product construction. (This construction gives an exact result in the case of normal _PTAs to be described later.) We can always test whether an _PTA is definitely empty, i.e. whether it cannot recognize any term under any environment : create a Boolean variable neq for each state of the _PTA, produce the clause neq if B(q ) is not equivalent to 0 (for each q ), the clause neq1 ^ : : : ^ neqn ) neq for each transition f (q1 ; : : : ; qn ) ! q with B(q) equivalent to 0, and :neq for each final state q ; if the resulting set of clauses is satisfiable, then the given _PTA is definitely empty; to check it, we use BDDs [5] to represent sets B(q ) and unit resolution to solve the resulting set of Horn clauses. We define assumptions to be maps H from types to formulae of type . The environment =( ^ ) 2T satisfies H, written j= H, if and only if [ H( )]] is the set of all terms of type , for every type . For any two formulae F and G of type , we write F \ G = ; the assumption mapping to :(F ^ G) and every other type to 1. Given a finite family of assumptions Hi , i 2 I , their conjunction maps every type to i2I Hi ( ). We reason on _PTAs A modulo assumptions H by reducing A, replacing B(q) by B(q) ^ H(q ) for each state q to get a new _PTA AjH: under any environment satisfying H, A and AjH recognize the same terms, and if AjH is definitely empty, then for no environment satisfying H, A recognizes any term.

V

3 Messages, What Intruders Know, and Simulating Protocol Runs To be more specific, our set T of types contains the type msg of messages; the type msglist of tuples of messages, which we shall use to build argument lists to the t tupling operator below; the type K of raw keys, e.g. integers of some fixed sizes used to build actual keys, of type key, which we assume to be in T as well; the type D of raw data, e.g. integers, reals, strings, etc. T may contain other types, which we do not care about. The basic signature 0 is:

;

;

! ! ! :!

symk asymk1 asymk2 : d: sk : pubk privk :

;

K key D msg msg msg msg key key

! key

k c t nil cons

: key ! msg : msg key ! msg : msglist ! msg : ! msglist : msg msglist ! msglist

The symk constructor builds symmetric keys from raw keys, asymk1 and asymk2 build the two parts of asymmetric keys; sk returns a long-term session key shared between the two principals in argument, pubk and privk return their argument’s public and

980

J. Goubault-Larrecq

private keys respectively. Any actual key is a message, as represented by the explicit conversion symbol k. Similarly, we use d to convert raw data to messages. The symbol c is used to build ciphertexts: c(M; K ) is the result of encrypting the plaintext M with key K . The special key is used to model the hash code of M as c(M; ). Finally, any list of messages can be made into a message, using the tupling constructor t that takes a list of messages, of type msglist, in argument: the latter are built using the standard Lisp constructors nil and cons. For legibility we shall abbreviate cons(M1 ; : : : ; cons(Mn ; nil) : : : ) as [M1 ; : : : ; Mn ]. We consider as our actual signature any one of the form 0 ] 1 , where 1 is an unspecified collection of function symbols of signatures 1 : : : n ! where 62 fkey; msg; msglistg. Leaving T and partly unspecified allows us to deal with extensible types for raw keys and raw data. We say that, for any keys K and K 0 (of type key), K 0 is an inverse of K if and only if K = symk(k ) or K = sk(M1 ; M2 ) and K 0 = K ; or K = asymk1(k ) and K 0 = asymk2(k); or K = asymk2(k) and K 0 = asymk1(k); or K = pubk(M ) and K 0 = privk(M ); or K = privk(M ) and K 0 = pubk(M ). Note that has no inverse. Intruders can read on any communication line, and collect what they read. Let E be a set of messages that the intruders have collected (this set might be infinite). These intruders can then forge new messages from E and send them to other principals. Following [4], we model intruders as a deductive system. Write E j! M the predicate “from the set E of messages, the intruders may deduce the message M ”, defined as follows (E; M denotes the union of E with fM g): (Ax) E; M j! M E j! M E j! k(K ) E j! c(M; K ) E j! k(K ) (K inverse of K ) (CryptI ) (CryptE ) E j! c(M; K ) E j! M E j! M1 : : : E j! M E j! t([M1 ; : : : ; M ]) (TupleI ) (TupleE ); 1 i n E j! t([M1 ; : : : ; M ]) E j! M 0

n

0

n

i

n

i

So intruders may replay messages (Ax), construct messages by encryption and tupling ((CryptI ), (TupleI )), and extract messages by decryption and field selection ((CryptE ), (TupleEi ))—but they cannot crack ciphertexts. Then we may always assume without loss of generality that intruders do all extractions before any construction [4]. That is, let Ded(E ) be the set of messages deducible from E , i.e., those such that E j! M is derivable; let Con(E ) be the constructible ones (derivable using only (Ax), (CryptI ), (TupleI )), and Ext(E ) the extractible ones (derivable using only (Ax), (CryptE ), (TupleEi)). Then Ded(E ) = Con(Ext(E )). We represent sets of messages E by _PTAs, more precisely by normal _PTAs, whose states q of type msg, msglist or key are such that B(q ) is equivalent to 0, and whose transitions f (q1 ; : : : ; qn ) ! q are such that f is in the basic signature 0 . In particular, computing intersections can be done exactly on normal _PTAs. A central result is that for every normal _PTA A of type msg, there is a normal _PTA that we call Ded(A) such that, if E is the set of terms recognized by A in ,


981

then Ded(A) recognizes at least the terms of Ded(E ) in . The idea is by constructing Ded(A) as Con(Ext(A)), where the semantics of Con and Ext are as expected. Building Ext(A) works by saturating the set F of final states by the following two rules: for every transition t(ql) ! q where q is in F, add to F all states q 0 of type msg reachable from ql by following cons-transitions backwards (rule (TupleEi )); for every transition c(q 0 ; qk ) ! q with q 2 F, add q 0 to F if for some transition k(qk 0 ) ! qf with qf 2 F, qk 0 contains possible inverses of qk (rule (CryptE )): qk 0 contains possible inverses of qk when there are transitions f1 (q11 ; : : : ; q1n ) ! qk and f2 (q21 ; : : : ; q2n ) ! qk0 such that q1j and q2j intersect possibly for every 1 j n, where f1 = symk and f2 = symk, or f1 = asymk1 and f2 = asymk2, etc. (see definition of inverse keys); two states q1 and q2 of the same type intersect possibly if and only if the intersection of (Q; fq1 g; R; B) and (Q; fq2 g; R; B) is not definitely empty. To build Con(A), add two fresh states qm of type msg and ql of type msglist to A, mapped to 0 by B. Then for each transition f (q1 ; : : : ; qn ) ! q0 , where q0 is in the set F of final states of A, add a transition f (q1 ; : : : ; qn ) ! qm, and add transitions nil() ! ql, cons(qm; ql) ! ql, and t(ql) ! qm (rule (TupleI )), and transitions c(qm; q ) ! qm for every transition k(q ) ! q 0 with q 0 final in A (rule (CryptI )). We simulate protocol runs by describing each principal as a small program. Programs are sequences of instructions, which may either create raw keys, create raw data (nonces), write expressions onto output channels, or read expressions from input channels while pattern-matching them (à la ML). We verify protocols by simulating all possible interleavings (modulo some partial order reductions). The Ded operator handles writes: writing a message M adds M to the set E of messages, and is abstracted by the computation of Ded(A), where A is the normal _PTA abstracting E . Reads returns any message M such that E j! M is derivable: we abstract this by having the read instruction return the _PTA A abstracting E itself as abstract value. Note that abstract values associated with each program variable denote sets of concrete messages, and are represented as normal _PTAs again. Pattern-matching is done in the abstract semantics just as in the concrete semantics, replacing equality tests between concrete messages M1 and M2 by tests that the _PTAs that abstract M1 and M2 have an intersection that is not definitely empty after reduction by the current set of assumptions H. Creating fresh raw data is done as follows. With each instruction creating raw data we associate a freshness variable X 2 AD ; then we insist that H be the conjunction of all assumptions X \ Y = ; for every two distinct freshness variables X and Y , and possibly of other assumptions. (H is fixed at the beginning of the simulation and never changes.) Then the abstract value of the variable containing the newly created data is the automaton (fq g; fq g; ;; fq 7! X g) recognizing exactly those data in (the semantics of) X . Creating fresh keys is done similarly. Note that propositional variables are really needed here to deal with freshness of nonces and keys. Before we start the simulation, we need to describe the initial set of messages that the intruders know. So let K0 and D0 be propositional variables denoting the sets of raw keys that exist (i.e., have been created already), respectively raw data that exist at the start of the run. Let SSK0 , SAK 10 , SAK 20 be variables denoting the sets of raw keys k such that symk(k), resp. asymk1(k), resp. asymk2(k) are initially unknown to the intruders. Let SD0 be a variable denoting the set of raw data d such that d(d) is initially

982

J. Goubault-Larrecq

unknown to intruders. Assuming for simplicity that every key sk(: : : ) or privk(: : : ) is initially unknown to intruders, and that all keys pubk(: : : ) and are known, we build an _PTA A0 recognizing the greatest set of terms M known to the intruders validating the secrecy assumptions above. Informally, this is done as follows. Create a state qd of all raw data assumed to exist and initially known; a state qk of all keys assumed to exist and initially known; a state qk ,1 of all keys assumed to exist but that have no initially known inverse. Then the set E of terms M we look after is given by: M is either d(d) with d recognized at qd, or k(k ) with k recognized at qk , or a tuple t([M1 ; : : : ; Mn ]) where each Mi is in E , or c(M; K ), where either M is in E and K is any existing key, or M is any existing message and K is recognized at qk ,1 . This description can be turned easily into an actual _PTA A0 . We also extend the simulation to handle an unbounded number of copies of any given group of principals. This handles the case of so-called parallel multi-session principals S , such as key servers, which actually spawn a new thread after each connection request. (They behave as processes !S in the -calculus, i.e. they run an unbounded number of copies of S in parallel.) To deal with this case, we use an idea from [9]: such principals S are viewed as accomplices to intruders, and we model them by extending the Ded(A) automaton by new states and transitions to account for the added computing power that all the copies of S contribute to intruders. This is technical, but let us give a rough idea. First, we assume that each creation (of raw data, of raw keys) done by each copy of S actually returns some unspecified data in the denotation of the freshness variable associated with the creation instruction; so we confuse every copy of S , as far as freshness is concerned. Then, we assume that each instruction of any copy of S executes in any order. Next, we assume that each read succeeds, and pattern-matching is approximated in a crude way: for example, in a read t([c(x; K ); y ]) which attempts to read a pair, put the second component in y , decrypt the first component with K and put the resulting plaintext in x, we simply estimate that the value of y will be anything known to intruders, and the resulting value of x will be anything that exists (possibly not known to intruders, because of the enclosing c). We model this by enriching the automaton Ded(A) with two states, qkn recognizing all known messages, and qx recognizing all existing messages. Writes are then coded by merging these states with other states; e.g., writing t([x; c(y; K )]) with the same x and y as above implies that t([x; c(y; K )]) must be recognized at qkn, so that x and c(y; K ) are recognized at qkn, because of (TupleEi ). As far as x is concerned, this means losing any information on existing but unknown messages (merge the qx and qkn states). For c(y; K ), everything depends on whether we assume K to have a known inverse or not: in the first case, then y must exist, otherwise it must become known to the intruders; in any case, since y was already assumed to be known, we do nothing here. In general, the problem of knowing whether K has a known inverse or not matters, and is solved by a fixpoint iteration, which converges because we only deal with finitely many key expressions.

4 Experimental Results We have implemented these techniques using a bytecode compiler for HimML, a variant of Standard ML incorporating facilities for handling finite sets and maps elegantly and


983

efficiently [11]. We have then tested this implementation on standard cryptographic protocols [6], on a 166MHz Intel Pentium machine running Linux 2.0.30. Each of these protocols are three-party protocols, involving two principals A and B that wish to get a secret key Kab by interacting with a key server S . All of these protocols were tested under an empty assumption H. Results and running times are as follows:

S in mono-session

Protocol Needham-Schroeder shared key Otway-Rees Wide-Mouthed Frog Yahalom SimplerYahalom Otway-Rees2

Result p.f. OK p.f. p.f. OK OK

Time (s.) #Branches 1:94 4 1:56 3 0:34 2 1:17 4 1:16 3 3:54 4

S in parallel multi-session Result Time (s.) #Branches p.f. 1:56 3 OK 1:56 3 p.f. OK OK

: :

12 1 52

:

14 57

3 3 15

In the result column, “OK” means the protocol passed, “p.f.” means that it contains a possible flaw. The “#Branches” column indicates how much non-determinism is involved in checking all relevant interleavings of the protocol. Times are in seconds, and total the whole exploration of all relevant interleavings; in other words, our tool does not just stop after the first possible flaw. Note that the Needham-Schroeder protocol was found to be flawed, and indeed our tool finds the standard attack where the intruder plays the second part of the session alone against B , without A or S participating at all. The Yahalom protocol was found to be flawed, too: whether or not our tool has found an attack remains to be examined; indeed, reading attacks off _PTAs is not an easy task! But, as noticed in [6], the Yahalom protocol is a very subtle one, and requires strong assumptions. (By the way, our tool only detects flaws in B ’s behaviour, so we are guaranteed that A at least cannot be fooled.) On the other hand, the SimplerYahalom protocol (an improved version of the Yahalom protocol given in [6]) is found to be correct by our tool, confirming the opinion of op.cit. that this second version is easier to show correct than the original one. The last line of the table shows a simulation of two sessions of the Otway-Rees protocol in sequence: OtwayRees2 simulates a principal A2 playing the role of A twice in a row (with A’s identity, and trying to communicate with the same B twice), a principal B2 that plays the role of B twice in a row (with B ’s identity, but without checking that its peer is the same A in both sessions), and a server S . The time taken by our tool is still very reasonable, although there should be many more interleavings than for OtwayRees. We are saved by the fact that several interleavings are impossible: our tool discovers that some reads must block (abstract pattern-matching fails). The worst-case complexity of our algorithms is daunting: abstract pattern-matching in particular takes exponential time and produces _PTAs of exponential size. Nonetheless, the nice news is that verification of actual protocols is quite fast on average, while still maintaining a high level of accuracy.

5 Conclusion We hope to have convinced the reader that automatic verification of cryptographic protocols was now possible, including some limited form of deduction, and allowing us to prove properties like “M is definitely secret at program point p, whatever the initial

984

J. Goubault-Larrecq

messages known to the intruder, provided that assumption H is verified”. Our technique is natural, provides actual secrecy guarantees—and to a lesser extent freshness guarantees—, and works fast in practice.

Acknowledgments Many thanks to Dominique Bolignano, David Monniaux, and Mourad Debbabi.

References 1. M. Abadi. Secrecy by typing in cryptographic protocols. Journal of the Association for Computing Machinery, 1998. Submitted. 2. M. Abadi and A. D. Gordon. A calculus for cryptographic protocols: The spi calculus. In Fourth ACM Conference on Computer and Communications Security. ACM Press, 1997. 3. M. Bellare and P. Rogaway. Provably secure session key distribution–the three party case. In 27th ACM Symposium on Theory of Computing (STOC’95), pages 57–66, 1995. 4. D. Bolignano. An approach to the formal verification of cryptographic protocols. In 3rd ACM Conference on Computer and Communication Security, 1996. 5. R. E. Bryant. Graph-based algorithms for boolean functions manipulation. IEEE Transactions on Computers, C35(8):677–692, 1986. 6. M. Burrows, M. Abadi, and R. Needham. A logic of authentication. Proceedings of the Royal Society, 426(1871):233–271, 1989. 7. H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. Available on http:// www.grappa.univ-lille3.fr/tata/, 1997. 8. P. Cousot and R. Cousot. Abstract interpretation and application to logic programs. Journal of Logic Programming, 13(2–3):103–179, 1992. Correct version at http:// www.dmi.ens.fr/˜cousot/COUSOTpapers/JLP92.shtml. 9. M. Debbabi, M. Mejri, N. Tawbi, and I. Yahmadi. Formal automatic verification of authentication cryptographic protocols. In 1st IEEE International Conference on Formal Engineering Methods (ICFEM’97). IEEE, 1997. 10. T. Genet and F. Klay. Rewriting for cryptographic protocol verification (extended version). Technical report, CNET-France Telecom, 1999. Available at http://www.loria.fr/ ˜genet/Publications/GenetKlay-RR99.ps. 11. J. Goubault. HimML: Standard ML with fast sets and maps. In 5th ACM SIGPLAN Workshop on ML and its Applications, 1994. 12. G. Lowe. Breaking and fixing the Needham-Schroeder public-key protocol using FDR. In TACAS’96, pages 147–166. Springer Verlag LNCS 1055, 1996. 13. W. Marrero, E. M. Clarke, and S. Jha. Model checking for security protocols. Technical Report CMU-SCS-97-139, Carnegie Mellon University, 1997. 14. C. A. Meadows. The NRL Protocol Analyzer: An Overview. Journal of Logic Programming, 1995. 15. D. Monniaux. Abstracting cryptographic protocols with tree automata. In 6th International Static Analysis Symposium (SAS’99). Springer-Verlag LNCS 1694, 1999. 16. L. C. Paulson. The inductive approach to verifying cryptographic protocols. Journal of Computer Security, 6:85–128, 1998. 17. B. Schneier. Applied Cryptography. John Wiley and Sons, 1996. 18. S. D. Stoller. A bound on attacks on authentication protocols. Technical Report 526, Indiana University, 1999. Available from http://www.cs.indiana.edu/hyplan/ stoller.html.

Veri cation Methods for Weaker Shared Memory Consistency Models Rajnish P. Ghughal1 2 and Ganesh C. Gopalakrishnan2 ;

1 2

?

??

F ormal Veri cation Engineer, Intel, Oregon. [email protected] Department of Computer Science, University of Utah, Salt Lake Cit y, UT 84112-9205. [email protected]

Abstract. The problem of verifying nite-state models of shared mem-

ory multiprocessor coherence protocols for conformance to weaker memory consistency models is examined. We start with W.W. Collier's architectural testing methods and extend it in several non-trivial ways in order to be able to handle weak er memory models. This, our rst contribution, presents the construction of architectural testing programs similar to those constructed by Collier (e.g. the Archtest suite) suited for w eak er memory models. Our wo n primary emphasis has, how ever, been to adapt these methods to the realm of model-checking. In an earlier eort (joint work with Nalumasu and Mokkedem), we had demonstrated ho w to adapt Collier's architectural testing methods to model-checking. Our veri cation approach consisted of abstracting executions that violate memory orderings into a xed collection of automata (called Test Automata) that depend only on the memory model. The main advantage of this approach, called Test Model-checking, is that the test automata remain xed during the iterative design cycle when dierent coherence protocols that (presumably) implement a given memory model are being compared for performance. This facilitates `push-button' re-veri cation when each new protocol is being considered. Our second contribution is to extend the methods of constructing test automata to be able to handle arc hitectural tests for weaker memory models. After reviewing prior w ork, in this paper w e mainly focus on architectural tests for w eaker memory models and the new abstraction methods thereof to construct test automata for weaker memory models.

An extended version of this paper is available through www.cs.utah.edu/formal_verification/ under `Publications'

1 Introduction Virtually all high-end CPUs are designed for multiprocessor operation in systems suc h as symmetric multiprocessor servers and distributed shared memory systems. As processors are getting faster faster than memories are, modern CPUs ? The author is currently at Intel, Oregon and was at University of Utah during the

course of the research work presented here.

?? Supported in part by NSF Grant No. CCR-9800928. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 985-992, 2000.  Springer-Verlag Berlin Heidelberg 2000

986

R.P. Ghughal and G.C. Gopalakrishnan

employ shared memory consistency models that permit more optimizations at the hardware and compiler levels. As weaker memory models (weaker relative to sequential consistency [7]) permit more hardware/compiler optimizations, virtually all modern processors employ a weak memory model such as total store ordering (TSO, [13]), partial store ordering (PSO, [13]), or the Alpha Shared Memory Model [11]. Most past work in verifying processors for conformance to memory models has, however, focussed on sequential consistency veri cation. The upshot of these facts is that there is very limited understanding in the formal veri cation community on verifying conformance to weaker memory models, and to do it in a way that ts in a modern design cycle in which design changes, and hence veri cation regressions, are very important.

Contribution 1: Architectural tests for Weaker Memory Models Our rst contribution is in formally characterizing several weaker memory models and presenting new architectural tests for them. In our approach, a formal memory model is viewed as a conjunction of elementary ordering \rules" (relations) such as read ordering and write ordering, as de ned by Collier [1] in conjunction with architectural testing methods for multiprocessor machines developed by him. For example, sequential consistency can be viewed as a conjunction of computational ordering (CMP), program ordering (PO), and write atomicity (WA). This is written \SC=(CMP,PO,WA)" where the right-hand side of the equation is called a compound rule, with CMP, PO, WA, etc., then called elementary rules. Collier's work was largely geared towards strong memory models, as well as certain atypical weaker memory models. For these memory models, it turns out that it is sucient to verify for conjunctions of `classical' memory ordering rules such as PO, WA, etc. However, weaker memory models relax these classical ordering rules (often PO and WA) in subtle ways. For example, as we show later, TSO relaxes the write-to-read ordering (WR) aspect of PO. TSO also relaxes WA slightly. Therefore, in a memory system that is supposed to implement TSO, a violation of the classical PO rule does not mean that the memory system is erroneous. The memory system is erroneous with respect to PO only if it violates an aspect of PO other than WR orderings. Speci cally, given that PO is made up of four sub-rules, namely RO (read ordering), WO (write ordering), WR (write-read ordering), and RW (read-write ordering), it means we must be prepared to look for violations of RO, WO, or RW. Generalizing this idea, to extend Collier's method to cover practical weaker memory models, pure tests that test for violations of a single elementary architectural rule or limited combinations of elementary rules would be good to have. In this paper, we outline an example pure test. This example presents a test that checks whether (CMP,RO) (the conjunction of CMP and RO) is violated. We have developed several other such pure tests for other rules to faciliate testing for dierent weak memory models - some of considerably more complexity than the example presented. We will not be presenting all the tests but provide a brief summary of our results at the end of this paper.

Verification Methods for Weaker Shared Memory Consistency Models

987

In this paper, we explain the technique by which we arrive at pure tests, and examine various aspects of this process, including many non-obvious special cases as well as a few limitations. As one example, we show that sometimes we need to limit the degree to which we leave out rules from a compound rule. For example, we show that the combination (CMP,WO) (WO is \write ordering") is irrelevant in practice; instead, the minimal pure rule worthy of study is (CMP, UPO, WO) where UPO denotes uniprocessor ordering. As another example, we show that WO is indistinguishable from WOS if CMP, and a relaxed write atomicity condition WA-S are provided. The practical implications of these results is that they allow us to explore various tests for a combination of elemental ordering rules and reason about whether an elemental rule is obeyed in presence of other rules. This also enables us to examine a weaker memory model for all aspects of its behavior, come up with dierent tests to stress these aspects separately, and to correlate the test results. In our work, we have obtained such characterizations for PSO, the Alpha Shared Memory Model, and the IBM 370 memory model. We investigate various pure tests to faciliate veri cation of conformance to these weaker memory models. In a nutshell, our contribution allows the Archtest methodology to apply to several practical weaker memory models.

Contribution 2: New Abstraction Methods for Architectural Tests Our second contribution pertains to new abstraction methods in test modelchecking as explained below. In our earlier work [4, 9], we reported our test model-checking approach to verify nite-state models of shared memory systems for conformance to sequential consistency. Test model-checking is basically a reachability analysis technique in which the model of the memory system being veri ed is closed with test automata playing the roles of the CPUs. The test automata administer a predetermined sequence of write operations involving only a limited range of addresses as well as data values. These writes are interspersed with reads over the same addresses. The test automata were constructed in such a way that when the reads return \unexpected" values, they move to error states,

agging ordering rule violations. Test model-checking can be carried out in the framework of temporal logic (say, LTL) model-checking by converting each test automaton into a temporal logic formula and checking for the safety property 2(:inErrorState). In a practical setting, however, specialized reachability analysis algorithms may perform better. The fact that the test automata remain the same despite changes in the shared memory system implementation is a signi cant advantage, as the test model-checking algorithm can be automatically reapplied after each design iteration. In contrast, previous methods required the characterization of the reference speci cation, namely the desired formal memory model, in terms of very complex temporal logic speci cations involving internal details of the memory system under design. This requires the error-prone step of rewriting the temporal logic speci cation following each design iteration. Many previous eorts also

988


involved manual proofs which are not needed in our approach. For these reasons, test model-checking is eminently suited for use in actual design cycles. Our earlier reported work on test model-checking [4, 9] serves as the background for the work reported here. Our contributions in these works were the following. We demonstrated that test automata can be derived through sound abstractions of architectural tests similar to Archtest. The abstractions were based on data independence and address semi-dependence. These notions are de ned with respect to executions, where executions are shared memory programs with reads annotated with the read data values. Under data independence, executions are closed under function applications to the involved data values; in other words, changing the data values does not aect the behavior of the memory system. Under address semi-dependence [5], no other operations may be performed on addresses other than comparison for equality. In our earlier work, we showed that test automata give the eect of running architectural tests for all possible addresses, data values, architectural test-program lengths, and interleavings. The speci c contribution we make with regard to test model-checking is in developing additional abstraction methods that help apply test model-checking for more general varieties of architectural tests. To give a few motivating details, the new pure tests we have developed for handling weaker memory models involve architectural tests that examine a nite unbounded history of read values. To handle these situations, we employ data abstraction in conjunction with properties of Boolean operators to derive a nite summary of these histories. Details of these abstraction methods and soundness proofs appear in [3]. Another related contribution we make is in handling memory barriers. Given that the test-automata administer a non-deterministic sequence of memory operations, a question that arises in connection with `membar' instructions is how many membar instructions to consider. We show that under reasonable assumptions { speci cally that the memory system does not decode the number of membar instructions it has seen { we need to consider only a limited number of membar instructions. Details appear in [3].

2 Summary of Results We now summarize our key results in the form of tables and provide an overview (details are in [3]). In Table 1, we summarize the results of test model-checking an operational model of TSO implemented in Berkeley VIS Verilog [12]. This operational model is similar to that used in [2], and usually corresponds to the reference speci cation of TSO. The two `fail' entries in the table correspond to program ordering, (CMP,PO), and write-to-read orderings, (CMP,WR). Since these orderings are not obeyed in TSO, we obtain `fail' correctly. The other architectural tests in the tables indicate `pass' which means that TSO obeys them. These pass/fail results provide added assurance (a `sanity check') that our characterization of weaker memory models is consistent with the popular understanding of weaker memory models.

Verification Methods for Weaker Shared Memory Consistency Models

989

Table 2 shows various architecture rules and their transition templates. The idea of transition templates introduced in [1] speci ed a summary of the ordering rule. Many of the entries in this table were speci ed in [1]. We have de ned new architectural rules (MB , RR through WA , S ), de ned tests and test automata for them, as well as provided more complete tests for many of the previously existing rules. Table 3 shows the architecture rules in our discussion and the subrules each of them consists of. In particular, note WA , S , which is a relaxed write-atomicity rule that is one of the central sub-rules of TSO. Brie y, write events become visible to the processor issuing the write rst, and then the events become atomically visible to all other processors. In contrast, in sequential consistency, each write becomes atomically visible (at the same time) to all the processors. Table 4 shows the memory models in our discussion and their speci cation in the Archtest framework. These results provide, to the best of our knowledge, the rst formal characterization, in one consistent framework, of several practical weaker memory ordering rules. For example, by contrasting TSO and the Alpha Shared Memory Model, it becomes clear that the later is much weaker than the former in terms of read/write orderings, but provides more safety-net operations to recover these orderings. The Alpha architecture manual [11] describes a number of executions called Litmus tests to illustrate which shared memory behavior is allowed and not allowed by the Alpha Shared Memory Model. In [3], we show that all these litmus tests are (often trivially) covered by our characterization of the Alpha Shared Memory architectural compound rule. In addition to sanity-checking our results, these results indicate that a developer of a modern memory system can use our architectural rules to debug the memory system focusing on each facet (sub-rule) at a time. intra

3 Conclusions and Future Work We formally characterize the problem of verifying nite-state models of shared memory multiprocessor coherence protocols for conformance to weaker memory consistency models in terms of Collier's architectural testing methods. We extend Collier's framework in several non-trivial ways in order to be able to handle

Table 1. Veri cation results on an operational model of TSO using VIS test automata CMP, RO, WO CMP, PO CMP, WR CMP,RW CMP, RO

states #bdd nodes runtime (mn:sec) status 3819 4872 < 1s pass 6.50875e+06 50051 2:38 fail 6.50875e+06 50051 1:25 fail 6.50875e+06 50051 3:02 pass 10187 2463 0:37 pass #

990


Table 2. Architecture rules and their transition templates Architecture rule Transition template SRW (P; L; R; V; O; S ) <SRW (=; =; W; ,; ,; ,) CRW (P; L; R; V; O; S ) 1=2 as we have assumed, it is possible to show that k(N ) grows faster than N 1=2 so that the expression in the square brackets above is maximized for N = N . Thus, the system linear extent is given by

k(N )(N ) k(N ):

(25) That is, the system size is set by the highest level of interconnections if k(N ) is a slowly varying function and p(N ) > 1=2. This implies that the choice of interconnection technology for the highest level is the most critical.

2 Discontinuities and the Origin of Rent's Rule Whereas it is observed that the function k(N ) exhibits considerable continuity over large variation of N , it is also observed that it occasionally exhibits sharp

Connectivity Models for Optoelectronic Computing Systems

1079

discontinuities. In other words, it no longer becomes possible to predict the value of the function k(N ) for certain N by knowing its values at nearby N . For instance, in the context of Rent's rule, it may not be possible to predict the number of pinouts of a VLSI chip by observing its internal structure, or vice versa [13]. However, this does not imply that Rent's rule (in its generalized form, as given by equation 4) is useless. Consider a multiprocessor computer. Rent's rule may be used to predict the wiring requirements internal to each of the processors. It may also be used for similar purposes for the interconnection network among the processors. In fact, the Rent exponent may even be similar in both cases. However, the function k(N ) may exhibit a steep discontinuity (often downward), as illustrated in gure 1 [8]. As is usually the case, a nite number of discontinuities in an otherwise smooth function need not inhibit us from piecewise application of our analytical expressions. Such discontinuities are often associated with the self-completeness of a functional unit [12, 13]. Similar examples may be found in nature. For instance, mammalian brains seem to satisfy n > 3 (i.e. p > 2=3), since the volume per neuron has been found to be greater in species with larger numbers of neurons [27]. The human brain has 1011 neurons each making about 1000 connections [28]. Thus, we would expect at least 1000(1011)2=3 1010 \pinouts." However, we have only about 106 bers in the optic nerve and 108 bers in the corpus callosum.

_ log k(N)

1000

_ log N 100 x 1000

Fig. 1. k(N ) for a system of N = 100 1000 primitive elements consisting of 100

processors of 1000 elements each. The number of \pinouts" of the processors bears no relationship to their internal structure. Equation 4 may be used directly for the range 1 < N < 1000, and with a shift of origin for the range 1000 < N < 100 1000.

In the context of microelectronic packaging, a quote from C. A. Neugebauer oers some insight as to why such discontinuities are observed: \Since the I/O capacity (of the chip carrier) is exceeded, a signi cant number of chips can be interconnected only if the pin/gate ratio can be drastically reduced, normally well below that predicted by Rent's rule. Rent's rule can be broken at any level of integration. The microprocessor chip is an example of the breaking of Rent's rule in its original form for gate arrays on the chip level. Being able to delay the breaking of Rent's rule until a much higher level is always an advantage

1080

H.M. Ozaktas

because it preserves many parallel data paths even at very high levels of integration, and thus oers higher systems performance and greater architectural

exibility." [29] The breaking of Rent's rule seems to be a technological necessity, and undesirable from a systems viewpoint. We will later discuss studies which indicate that superconducting or optical interconnections may allow the maintainment of a large dimensionality and Rent exponent throughout higher levels of the hierarchy. The origin of Rent's rule has intrigued many researchers. Donath had shown that Rent's rule is a consequence of the hierarchical nature of the logic design process [30, 31]. Some have viewed it merely as an empirical observation obtained from an examination of existing circuits. Others have suggested that it is as natural as the branching of trees or the human lung (a consequence of their growth process), or that it represents the adaptation of computer circuits to serve the needs of percolation of information. Fractal concepts have been quite successful in describing natural phenomena. However, it is often more challenging to explain why fractal forms come up so often. Why do computer circuits lend to such a description? One suspects that fractal forms may exhibit certain optimal properties. For instance, bitonic (divide-and-conquer) algorithms can be viewed as elementary fractal forms. Is it possible to postulate general principles (such as the principle of least action in mechanics) regarding optimal information

ow or computation that would lead to an inverse-power-law distribution of line lengths (a constant fractal dimension)? Mandelbrot has postulated maximum entropy principles to predict the observed inverse-power-law distribution of word frequencies (linguistics) [19] and monetary income (economics) [20]. Christie has pursued the idea that the wires in a computing system should obey Fermi-Dirac statistics, based on the observation that the wires are indistinguishable (any two wires of same length can be exchanged) and that they obey an exclusion principle (only one wire need connect two points) [32, 33]. Keyes [27] has shown how the number of distinct ways one can wire up an array of elements increases with average wire length. In [34] we showed that the number of distinct ways one can \wire up" an optical interconnection system increases similarly with a fundamental quantity known as the space-bandwidth product of the optical system, and thus the average interconnection length. The author nds the following viewpoint especially illuminating. At the microscopic level, all information processing involves the distributed manipulation and back-and-forth transfer of pieces of information. There is a certain requirement on the amount of information that must ow or percolate depending on the particular problem we are trying to solve. This requirement can be embodied in an information ow graph. The dimensionality of this graph can then be taken as a measure of the information ow requirements of the problem. For some problems which require little transfer of information, this dimension may be small. For others, it may be large. When the dimensionality associated with the problem exceeds the dimensions of the physical space in which we construct our circuits (often 2 but at most 3), we are faced with the problem of embedding a higher-dimensional graph into a lower-dimensional space. This is what leads


1081

to Rent's rule: the fact that we try to solve problems with inherently higher dimensionality of information ow than the two- or three-dimensional physical spaces we build our computers in. Several structured problems, such as sorting and discrete Fourier transforming, are known to have global information ow requirements leading to separators which are / N , corresponding to large dimensions and nearly unity Rent exponents. The dimensionality associated with general purpose computing may also be presumed to be large. In any event, it certainly seems that quite a fraction of interesting problems have dimensions higher than two or three, so that the space dilation eect associated with Rent's rule is expected. Despite these considerations, Rent's rule may not apply to a particular circuit we examine. The challenges involved in dealing with greater numbers of interconnections may lead designers to reduce the number of physical ports and channels, and to shift the \communication burden" to other levels of the computational hierarchy [35]. Careful examination often reveals that the price of reducing the number of wires is often paid in terms of computation time, intermediated by techniques such as multiplexing or breaking the transfer of information into multiple steps. Clever schemes can reduce the number of wires that are apparently needed, but these often essentially amount to reorganizing the processing of information in such a way that the same information is indirectly sent in several pieces or dierent times. Ultimately, a certain ow and redistribution of information must take place before the problem is solved. Several levels of graphs can come between the n 1 dimensional graph characterizing the information ow requirements of the problem to be solved, and the e 3 dimensional physical space. These graphs correspond to dierent levels of the computational hierarchy, ranging from the abstract description of the problem to the concrete physical circuits. The dimensionality of these graphs provide a stepwise transition from n dimensions to e dimensions ( gure 2). Level transitions involving large steps (steep slopes) are where the greatest implementation burden is felt. For line a in gure 2, this burden is felt at the relatively concrete level, and for line c at the relatively abstract level. The burden is more uniformly spread for line b. Shifting the burden from one level to the others may be bene cial because of the dierent physical and technological limitations associated with each level. Techniques such as algorithm redesign, multiplexing, parallelism, use of dierent kinds of local or global interconnection networks, use of alternative interconnection technologies such as optics, can be used to this end. Better understanding and deliberate exploitation of these concepts and techniques may be expected to translate into practical improvements. A particular question that may be posed in this context is whether the burden should lean primarily towards the software domain or primarily towards the hardware domain. An embodiment of the rst option may be a nearest-neighbor connected mesh-type computer in which the physical interconnect problem is minimized. Global ows of information are realized indirectly as pieces of information propagate from one neighbor to the next. The second option, in contrast, might rely on direct transfer of information through dedicated global lines which

1082

H.M. Ozaktas

dimensionality n c

b

a

e abstract concrete

Fig. 2. The dimensionality of graphs corresponding to dierent levels for a hypothetical system with four levels.

result in heavy physical interconnect congestion. Although determination of the proper balance between these two extremes is in general a very complex issue, it has been addressed in a speci c context in [36]. The conclusion is that use of direct global lines is more bene cial than simulating the same information ow on a locally connected system. This conclusion assumes the use of optical lines to overcome the severe limitations associated with resistive interconnections. Contexts in which the nature of the problem to be solved does require global information ows, but only at a relatively low rate, may result in poor utilization of dedicated global lines, which nevertheless contribute signi cantly to system area or volume. This situation can be especially common with optical interconnections which can exhibit very high bandwidths which are dicult to saturate. For this reason, techniques have been developed for organizing information ow such that distinct pairs of transmitters and receivers can share common highbandwidth channels to make the most of the area or volume invested in them [37].

3 Free-Space Optical Interconnections The concepts discussed in this paper are immediately applicable to three-dimensional layouts [38{40], including those based on optical waveguides or bers. However, the extension of results originally developed for \solid wires" to freespace optics, which can oer much higher density than waveguides and bers, is not immediate. Since optical beams can readily pass through each other, it has been suggested that optical interconnections may not be subject to area-volume estimation techniques developed for solid wires. However, proper accounting for the eects of diraction leads to the conclusion that from a global perspective, optical interconnections can also be treated as if they were solid lines for the purpose of area and volume estimation, so that most of the concepts discussed in this paper are applicable to free-space optical systems as well.


1083

This conclusion is based on the following result [41]: The minimum total communication volume required for an optical system whose total interconnection length is `total is given by `total 2 . This result is stated globally; it does not imply that each optical channel individually has cross-sectional area 2 , but only that the total volume must satisfy this minimum. Indeed some channels may have larger cross-sectional areas but share the same extent of space with other channels which pass through them. The bottom line is that even with the greatest possible amount of overlap and space sharing, the global result is as if each channel required a cross-sectional area of 2 , as if they were solid wires. If the average connection length in grid units is given by r = N p,2=3 as before, then the minimum grid spacing d must satisfy Nd3 = Nkrd2 , leading to a minimum system linear extent of N 1=3 d = (kN p )1=2 , just as would be predicted for solid wires of width (equation 20 with e = 3 and the subscript 0 suppressed) [42]. In many optical systems, the devices are restricted to lie on a plane, rather than being able to occupy a three-dimensional grid. Although in general these systems are subject to the same results, certain special considerations apply [43{46]. The above does not imply that there is no dierence between optical and electrical interconnections. Optical interconnections allow the realization of threedimensional layouts. Optical beams can pass through each other, making routing easier. Furthermore, the linewidth and energy dissipation for optical interconnections is comparatively smaller for longer lines. (This latter advantage is also shared by superconducting lines.)

4 Fundamental Studies of Interconnections Rent's rule and associated line-length distributions have been of great value in fundamental studies of integrated systems [47{50]. Two considerations are fundamental in determining the minimum layout size and thus the signal delay: interconnection density and heat removal [51{54]. Both considerations are interrelated since, for instance, the energy dissipation on a line also depends on its length, which in turn depends on the grid spacing, which in turn depends on both the total interconnection length and the total power dissipated. The complex interplay between the microscopic and macroscopic parameters of the system must be simultaneously analyzed. Rent's rule and line-length distributions are indispensable to this end. However, it is necessary to complement these tools with physically accurate models of interconnection media. Such analytical models for normally conducting, repeatered, superconducting, and optical interconnections which take into account the skin eect, both unterminated and terminated lines, optimization of repeater con gurations, superconducting penetration depth and critical current densities, optical diraction, and similar eects have been developed in [43, 44] and subsequently applied to determine the limitations of these interconnection media and their relative strengths and weaknesses [43, 44, 40, 55{57,36, 58]. Treating inverse signal delay S and bandwidth B as performance

1084

H.M. Ozaktas

parameters, these studies characterize systems with N elements by surfaces of physical possibility in S -B -N space, which are to be compared with surfaces of algorithmic necessity in the same space. This approach has allowed comparative studies of dierent interconnection media to move beyond comparisons of isolated electrical and optical lines, to evaluation of the eects of their dierent characteristics at the system level. These studies clearly show the bene t of optical and superconducting interconnections for larger systems. One of the most striking results obtained is that there is an absolute bound on the total rate of information that can be swapped from one side of an electrically connected system to the other, and that this bound is independent of scaling. Such a bound does not exist for optics and superconductors [43, 59]. An interesting extension is to allow the longer lines in a system to be of greater width to keep their RC delays within bounds. Use of the calculus of variations has shown that the widths of lines should be chosen proportional to the cube root of their length for two-dimensional layouts and to the fourth root of their length for three-dimensional layouts [60]. Staircase approximations to these analytical expressions can serve as practical design guidelines. These studies have also been extended to determine how electrical and optical interconnections can be used together. It is generally accepted that optics is favorable for the longer lines in a system whereas the shorter lines should be electrical. Results based on comparisons of isolated lines may not be of direct relevance in a system context. The proper question to ask is not \Beyond what length must optical interconnections be used?", but \Beyond how many logic elements must optical interconnections be used?". Studies have determined that optical interconnections should take over around the level of 104-106 elements [61{63]. This body of work has demonstrated that inverse-power-law type line-length distributions are very suitable for such studies. This is because distributions which decay faster, such as an exponential distribution, eectively behave like fully local distributions in which connections do not reach out beyond a bounded number of neighbors. Such layouts are essentially similar to nearest-neighbor connected layouts, and are already covered by Rent's rule when we choose n = e. On the other hand, for any layout in which the number of connections per element is bounded, the behavior is at worst similar to that described by a Rent exponent of unity. Thus, although all systems may not exhibit a precise inverse-power-law distribution of line lengths, Rent's rule is nevertheless sucient to represent the range of general interest.

5 Conclusion We believe that many criticisms of Rent's rule are a result of not allowing the Rent exponent and dimensionality to vary as we ascend the hierarchy and a failure to recognize discontinuities. It seems that in most cases of practical interest, the decomposition function k(N ) is piecewise smooth with a nite number of


1085

discontinuities. The role of discontinuities in an otherwise smooth decomposition function, and whether it is bene cial to construct systems in the form of a hierarchy of functionally complete entities, are less understood issues. Is it functionally desirable to construct systems that way, or do physical and technical limitations force us to? Parts of this work appeared in or were adapted from [8].

References 1. B. Bollobas. Graph Theory: An Introductory Course. Springer, Berlin, 1979. 2. G. Strang. Introduction to Applied Mathematics. Wellesley-Cambridge Press, Wellesley, Massachusetts, 1986. 3. H. N. V. Temperley. Graph Theory and Applications. Ellis Horwood Ltd., Chichester, 1981. 4. J. D. Ullman. Computational Aspects of VLSI. Computer Science Press, Rockville, Maryland, 1984. 5. T. C. Hu and E. S. Kuh. VLSI Circuit Layout: Theory and Design. IEEE Press, New York, 1985. 6. S. N. Bhatt and F. T. Leighton. A framework for solving VLSI layout problems. J Computer System Sciences, 28:300{343, 1984. 7. C. E. Leiserson. Area-Ecient VLSI Computation. The MIT Press, Cambridge, Massachusetts, 1983. 8. H. M. Ozaktas. Paradigms of connectivity for computer circuits and networks. Optical Engineering, 31:1563{1567, 1992. 9. W. E. Donath. Placement and average interconnection lengths of computer logic. IEEE Trans Circuits Systems, 26:272{277, 1979. 10. L. Pietronero. Fractals in physics: Introductory concepts. In S. Lundqvist, N. H. March, and M. P. Tosi, eds., Order and Chaos in Nonlinear Physical Systems. Plenum, New York, 1988. 11. B. S. Landman and R. L. Russo. On a pin versus block relationship for partitions of logic graphs. IEEE Trans Computers, 20:1469{1479, 1971. 12. R. L. Russo. On the tradeo between logic performance and circuit-to-pin ratio for LSI. IEEE Trans Computers, 21:147{153, 1972. 13. D. K. Ferry. Interconnection lengths and VLSI. IEEE Circuits Devices Mag, pages 39{42, July 1985. 14. B. B. Mandelbrot. Fractals: Form, Chance and Dimension. W. H. Freeman, San Francisco, 1977. 15. P. Christie, J. E. Cotter, and A. M. Barrett. Design and simulation of optically interconnected computer systems. In Interconnection of High Speed and High Frequency Devices and Systems, Proc SPIE, 947:19{24, 1989. 16. W. E. Donath. Wire length distribution for placements of computer logic. IBM J Research Development, 25:152{155, 1981. 17. M. Feuer. Connectivity of random logic. IEEE Trans Computers, 31:29{33, 1982. 18. B. B. Mandelbrot. The Fractal Geometry of Nature. W. H. Freeman, New York, 1983. 19. B. B. Mandelbrot. Information theory and psycholinguistics: A theory of word frequencies. In P. F. Lazarsfeld and N. W. Henry, eds., Readings in Mathematical Social Science. The MIT press, Cambridge, Massachusetts, 1968.

1086

H.M. Ozaktas

20. B. B. Mandelbrot. The Pareto-Levy law and the distribution of income. Int Economic Review, 1:79{106, 1960. 21. I. E. Sutherland and D. Oestreicher. How big should a printed circuit board be? IEEE Trans Computers, 22:537{542, 1973. 22. W. R. Heller, W. F. Mikhail, and W. E. Donath. Prediction of wiring space requirements for LSI. J Design Automation Fault Tolerant Computing, 2:117{144, 1978. 23. A. El Gamal. Two-dimensional stochastic model for interconnections in master slice integrated circuits. IEEE Trans Circuits Systems, 28:127{134, 1981. 24. A. C. Hartmann and J. D. Ullman. Model categories for theories of parallel systems. In G. J. Lipovski and M. Malek, eds,, Parallel Computing: Theory and Experience. Wiley, New York, 1986. 25. W. J. Dally. A VLSI Architecture for concurrent data structures. Kluwer, Norwell, Massachusetts, 1987. 26. R. W. Keyes. The wire-limited logic chip. IEEE J Solid State Circuits, 17:1232{ 1233, 1982. 27. R. W. Keyes. Communication in computation. Int J Theoretical Physics, 21:263{ 273, 1982. 28. R. F. Thompson. The Brain. W. H. Freeman and Company, New York, 1985. 29. C. A. Neugebauer. Unpublished manuscript. 30. W. E. Donath. Stochastic model of the computer logic design process. Tech Rep RC 3136, IBM T. J. Watson Research Center, Yorktown Heights, New York, 1970. 31. W. E. Donath. Equivalence of memory to `random logic'. IBM J Research Development, 18:401{407, 1974. 32. P. Christie and S. B. Styer. Fractal description of computer interconnection distributions. In Microelectronic Interconnects and Packaging: System and Process Integration, Proc SPIE, 1390, 1990. 33. P. Christie. Clouds, computers and complexity. In S. K. Tewksbury, ed., Frontiers of Computing Systems Research, Volume 2, pages 197{238. Plenum, New York, 1991. 34. H. M. Ozaktas, K.-H. Brenner, and A. W. Lohmann. Interpretation of the spacebandwidth product as the entropy of distinct connection patterns in multifacet optical interconnection architectures. J Optical Society America A, 10:418{422, 1993. 35. H. M. Ozaktas. Levels of abstraction in computing systems and optical interconnection technology. In P. Berthome and A. Ferreira, eds., Optical Interconnections and Parallel Processing: Trends at the Interface, chapter 1. Kluwer, Dordrecht, The Netherlands, 1998. 36. H. M. Ozaktas and J. W. Goodman. Comparison of local and global computation and its implications for the role of optical interconnections in future nanoelectronic systems. Optics Communications, 100:247{258, 1993. 37. H. M. Ozaktas and J. W. Goodman. Organization of information ow in computation for ecient utilization of high information ux communication media. Optics Communications, 89:178{182, 1992. 38. A. L. Rosenberg. Three-dimensional VLSI: a case study. J Assoc Computing Machinery. 30:397{416, 1983. 39. F. T. Leighton and A. L. Rosenberg. Three-dimensional circuit layouts. J Computer System Sciences, 15:793{813, 1986. 40. H. M. Ozaktas and M. F. Erden. Comparison of fully three-dimensional optical, normally conducting, and superconducting interconnections. In 2nd Workshop on Optics and Computer Science, April 1, 1997, Geneva. Submitted to Applied Optics.


1087

41. H. M. Ozaktas and J. W. Goodman. Lower bound for the communication volume required for an optically interconnected array of points. J Optical Society America A, 7:2100{2106, 1990. 42. H. M. Ozaktas, Y. Amitai, and J. W. Goodman. A three dimensional optical interconnection architecture with minimal growth rate of system size. Optics Communications, 85:1{4, 1991. 43. H. M. Ozaktas and J. W. Goodman. The limitations of interconnections in providing communication between an array of points. In S. K. Tewksbury, ed., Frontiers of Computing Systems Research, Volume 2, pages 61{124. Plenum, New York, 1991. 44. H. M. Ozaktas. A Physical Approach to Communication Limits in Computation. PhD thesis, Stanford University, California, 1991. 45. H. M. Ozaktas, Y. Amitai, and J. W. Goodman. Comparison of system size for some optical interconnection architectures and the folded multi-facet architecture. Optics Communications, 82:225{228, 1991. 46. H. M. Ozaktas and D. Mendlovic. Multi-stage optical interconnection architectures with least possible growth of system size. Optics Letters, 18:296{298, 1993. 47. R. W. Keyes. The Physics of VLSI Systems. Addison-Wesley, Reading, Massachusetts, 1987. 48. R. W. Keyes. Fundamental limits in digital information processing. Proc IEEE, 69:267{278, 1981. 49. R. W. Keyes. The evolution of digital electronics towards VLSI. IEEE Trans Electron Devices, 26:271{279, 1979. 50. H. B. Bakoglu. Circuits, Interconnections and Packaging for VLSI. AddisonWesley, Reading, Massachusetts, 1990. 51. H. M. Ozaktas, H. Oksuzoglu, R. F. W. Pease, and J. W. Goodman. Eect on scaling of heat removal requirements in three-dimensional systems. Int J Electronics, 73:1227{1232, 1992. 52. W. Nakayama. On the accomodation of coolant ow paths in high density packaging. IEEE Trans Components, Hybrids, Manufacturing Technology, 13:1040{1049, 1990. 53. W. Nakayama. Heat-transfer engineering in systems integration|outlook for closer coupling of thermal and electrical designs of computers. IEEE Trans Components, Packaging, Manufacturing Technology, Part A, 18:818{826, 1995. 54. A. Masaki. Electrical resistance as a limiting factor for high performance computer packaging. IEEE Circuits Devices Mag, pages 22{26, May 1989. 55. H. M. Ozaktas. Fundamentals of optical interconnections|a review. In Proc Fourth Int Conf Massively Parallel Processing Using Optical Interconnections, pages 184{ 189, IEEE Computer Society, Los Alamitos, California, 1997. (Invited paper, June 22{24, 1997, Montreal.) 56. H. M. Ozaktas. Toward an optimal foundation architecture for optoelectronic computing. Part I. Regularly interconnected device planes. Applied Optics, 36:5682{ 5696, 1997. 57. H. M. Ozaktas. Toward an optimal foundation architecture for optoelectronic computing. Part II. Physical construction and application platforms. Applied Optics, 36:5697{5705, 1997. 58. H. M. Ozaktas and J. W. Goodman. The optimal electromagnetic carrier frequency balancing structural and metrical information densities with respect to heat removal requirements. Optics Communications, 94:13{18, 1992. 59. D. A. B. Miller and H. M. Ozaktas. Limit to the bit-rate capacity of electrical interconnects from the aspect ratio of the system architecture. J Parallel Distributed Computing, 41:42{52, 1997.

1088

H.M. Ozaktas

60. H. M. Ozaktas and J. W. Goodman. Optimal linewidth distribution minimizing average signal delay for RC limited circuits. Int J Electronics, 74:407{410, 1993. 61. H. M. Ozaktas and J. W. Goodman. Elements of a hybrid interconnection theory. Applied Optics, 33:2968{2987, 1994. 62. H. M. Ozaktas and J. W. Goodman. Implications of interconnection theory for optical digital computing. Applied Optics, 31:5559{5567, 1992. 63. A. V. Krishnamoorthy, P. J. Marchand, F. E. Kiamilev, and S. C. Esener. Grainsize considerations for optoelectronic multistage interconnection networks. Applied Optics, 31:5480{5507, 1992.

Optoelectronic-VLSI Technology: Terabit/s I/O to a VLSI Chip Ashok V. Krishnamoorthy Bell Labs, Lucent Technologies, Holmdel, NJ 07733

The concept of a manufacturable technology that can provide parallel optical interconnects directly to a VLSI circuit, proposed over 15 years ago in [1], now appears to be a reality. One such optoelectronic-VLSI (OE-VLSI) technology is based on the hybrid flip-chip area-bonding of GaAs/AlGaAs Multiple-Quantum Well (MQW) electro-absorption modulator devices directly onto active silicon CMOS circuits. The technology has reached the point where batch-fabricated foundry shuttle incorporating multiple OE-VLSI chip designs are now being run [2]. These foundry shuttles represent the first delivery of custom-designed CMOS VLSI chips with surface-normal optical I/O technology. From a systems point of view, this represents an important step towards the entry of optical interconnects in that: the silicon integrated circuit is state-of-the-art; the circuit is unaffected by the integration process; and the architecture, design, and optimization of the chip can proceed independently of the placement and bonding to the optical I/O. To date, over 5760 MQW modulator devices have been integrated onto a single CMOS IC with a device yield exceeding 99.95%. Each bonded device has a load capacitance of approximately 50fF (65fF including a 15µmx15µm bond pad) and can be driven by a CMOS inverter to accomplish the electrical-to-optical interface. Compact CMOS transimpedance receiver circuits have been developed to execute the photocurrent-to-logic-level voltage conversion. Operation of single-ended receivers [3] (one diode per optical input) fabricated in a 0.35µm linewidth CMOS technology, has been demonstrated over 1Gigabit/s with a measured bit-error-rate less than 10-10. Differential two-beam receiver, have similarly been operated to over 1Gbit/s. The


1090

A.V. Krishnamoorthy

receiver circuits mentioned above have static power dissipation in the range of 3.58mW per receiver. More recently, arrays of up to 256 active light sources known as Vertical-Cavity Surface-Emitting Lasers (VCSELs) have also been bonded directly to CMOS VLSI chips [4], with each VCSEL capable of over 1Gigabit/s modulation by the CMOS circuits. Before such a technology can be deployed on a large scale, several issues related to the scalability of the optoelectronic technology and its compatibility with deep submicron CMOS technologies must be addressed. In terms of the modulator technology, the challenges are in reducing the drive voltages of the modulators to stay compatible with sub-micron CMOS technologies, and to continue to improve the yield in the manufacturing and hybridizing of the MQW diodes. In terms of the VCSELs, the challenge will be in producing arrays of power-efficient VCSELs that can attached to CMOS circuits with high-yield, and be simultaneously operated at high speeds [5]. In terms of the circuits, the challenges will be to continue to improve receiver sensitivity while reducing power dissipation and cross-talk. A final consideration is that of the systems integration, where the challenge will be to package systems that can efficiently transport large arrays of light-beams to and from such chips. Based on relatively conservative assumptions on how these components will evolve, a general conclusion is that it appears this hybrid optical I/O technology has substantial room for continued scaling to large numbers of higher-speed interconnects [6]. Indeed, future OE-VLSI technologies (whether modulator-based or VCSELbased) can be expected to provide an I/O bandwidth to a chip that is commensurate with the processing power of the chip, even in the finest linewidth silicon: a task that cannot be expected from conventional electrical interconnect technologies. Initial work on space-division crossbar OE-VLSI switches have suggested that terabit capacities are achievable. The availability of optical access to high-speed RAM [7] will also permit the development of shared-memory (SRAM)-based switches: a goal that cannot be achieved with conventional space-division photonic switching technologies. It is anticipated that the availability of such an OE-VLSI technology

Optoelectronic-VLSI Technology: Terabit/s I/O to a VLSI Chip

1091

will enable terabit-per-second throughput switches with power dissipations on the order of 20-50mW per Gigabit/s of switch throughput.

References: 1.

2.

3.

4.

5.

6.

7.

J. W. Goodman, F. J. Leonberger, S.-Y, Kung, and R. A. Athale, “Optical interconnections for VLSI systems,” Proceedings of the IEEE, vol. 72, no. 7, pp. 850-866, July 1984. A. V. Krishnamoorthy and K. W. Goossen, “Optoelectronic-VLSI: photonics integrated with VLSI circuits,” IEEE Jour. Sel. Topics in Quantum Elec., Vol. 4, pp. 899-912, December 1998. A. L. Lentine et al., “Optoelectronic VLSI switching chip with over 1Tbit/s potential optical I/O bandwidth,” Electronics Letters, Vol. 33, No. 10, pp. 89495, May 1997. A. V. Krishnamoorthy et al., “Vertical cavity surface emitting lasers flip-chip bonded to gigabit/s CMOS circuits,” Photonics Technology Letters, Vol. 11, pp. 128-130, January 1999. A. V. Krishnamoorthy et al., “16x16 VCSEL array flip-chip bonded to CMOS,” OSA Top. Meet. Optics in Computing, (Snowmass) Postdeadline PD3, April 1999. A. V. Krishnamoorthy and D. A. B. Miller, “Scaling Optoelectronic-VLSI circuits into the 21st century: a technology roadmap,” IEEE J. Special Topics in Quant. Electr., Vol. 2, pp. 55-76, April 1996. A. V. Krishnamoorthy et al., ”CMOS Static RAM chip with high-speed optical read-write,” IEEE Photonics Technology Letters, Vol. 9, pp. 1517-19, November 1997.

Three Dimensional VLSI-Scale Interconnects Dennis W. Prather University of Delaware Department of Electrical and Computer Engineering Newark, DE 19716 email: [email protected]

Abstract. As processor speeds rapidly approach the Giga-Hertz regime, the disparity between process time and memory access time plays an increasing role in the overall limitation of processor performance. In addition, limitations in interconnect density and bandwidth serve to exacerbate current bottlenecks, particularly as computer architectures continue to reduce in size. To address these issues, we propose a 3D architecture based on through-wafer vertical optical interconnects. To facilitate integration into the current manufacturing infrastructure, our system is monolithically fabricated in the Silicon substrate and preserves scale of integration by using meso-scopic diffractive optical elements (DOEs) for beam routing and fan-out. We believe that this architecture can alleviate the disparity between processor speeds and memory access times while increasing interconnect density by at least an order of magnitude. We are currently working to demonstrate a prototype system that consists of vertical cavity surface emitting lasers (VCSELs), diffractive optical elements, photodetectors, and processor-in-memory (PIM) units integrated on a single silicon substrate. To this end, we are currently refining our fabrication and design methods for the realization of meso-scopic DOEs and their integration with active devices. In this paper, we present our progress to date and demonstrate vertical data transmission using DOEs and discuss the application for our architecture, which is a multi-PIM (MPM) system.

Introduction As modern day technologies continue to develop an increasing number of applications are resorting to computational based simulations as a tool for research and development. However, as simulation tools strive to incorporate more realistic properties their computational requirements quickly increase and in many cases surpass that which is currently available. As a result, a seemingly perpetual demand to process more information in shorter time frames has resulted. Moreover, while current computer architectures are steadily improving they are not keeping pace with the requirements of more sophisticated applications and in fact for some applications they are falling behind. To this end, new paradigm computer architectures need to be developed.


Three Dimensional VLSI-Scale Interconnects

1093

The current paradigm for addressing this shortcoming is to simply incorporate smaller devices into larger die. However, while this does enable the design and realization of more sophisticated circuits it also exacerbates an already serious problem, namely the interconnection and packaging of the devices and components within the system. For example, according to the National Technology Roadmap for Semiconductors, processors based on 1µm fabrication have a ratio of transistor -tointerconnect delay of 10:1 (assuming a 1mm long interconnect), whereas that for the same processor based on 0.1µm fabrication is 1:100. This represents a shift in emphasis of more than three orders of magnitude. As a result alternative interconnect and packaging technologies need to be developed. Therefore, in this paper we report on our work in addressing these technological barriers by designing an embedded processor-in-memory (PIM) architecture realized using an optically interconnected three-dimensional (3D) package. While conventional 3D packaging increases circuit density, decreases interconnect delay, and reduces critical interconnect path lengths, their full potential has yet to be realized. This is due mainly to the associated capacPitive and inductive loading affects of vertical vias, which reduce bandwidth and allow for only a 1-to-1 interconnect. To overcome these limitations we propose an alternate approach that is based on recent advances in micro-optical technology. Our approach uses vertical cavity surface emitting lasers (VCSELs) that are flipchip bonded onto CMOS drivers. The VCSELs have a 1.3µm wavelength which is transparent to the silicon wafer. The VCSELs are oriented such that the output beam is directed vertically through the silicon wafer. However, before the beam enters the wafer it is incident on a VLSI-scale diffractive optical element (DOE) that not only focuses the beam to a subsequent wafer, but also performs a 1-to-N fanout (N can range from 1 to 50 depending on the area used for the DOE). This allows for nearly real time data routing and distribution, which is essential to overcome conventional computational bottlenecks. However, before presenting further details in our approach we first motivate our PIM-based architecture.

PIM Motivation A current trend in computer system design is to develop architectures based on the integration of a large number of smaller and more-simple processing cores that work together in unison. The idea here is that such processors can be integrated directly into random access memory (RAM) to simplify the memory hierarchy, i.e., level-1 and level-2 CACHE, and thereby streamline processor to memory communication. Such systems have been named Intelligent RAM (IRAM), Flexible RAM (Flex-RAM) and PIM, as we refer to it. Currently, several high profile research initiatives (sponsored by federal agencies, e.g., the HTMT-PIM project [1,2], the DIVA project [3], the FlexRAM project [4] are investigating many of the architectural and system design issues related with the

1094

D.W. Prather

implementation of PIM-based systems. In fact, IBM recently announced the introduction of the Blue-Gene project [5], which anticipates an industry investment on the order of a $100M dollars to produce a petaFLOPS scale machine based on thousands of PIM components. Therefore, even though PIM-based architectures are not currently being used in commercial machines, they promise to overcome the limitations of conventional computer architectures. However, in general the amount of memory and the processing capability of individual PIMs is limited, therefore the construction of PIM-based high performance systems will require the integration of upwards of tens of thousands of PIMs. Thus the integration of multiple PIMs into a single package will be absolutely essential to reduce latencies, increase communication bandwidth between PIMs, reduce power consumption, and reduce the integration cost of the entire system. Therefore the problem addressed by this research proposal is the implementation of a Multiple PIM Module (MPM) to harness the processing capability and the memory storage capability of multiple PIMs into a single computational module. A MPM can be used as the building block to implement mobile computers as proposed by the MIT RAW project. It can be used as the basic building block for computer systems specialized in data intensive computation, as proposed in the DIVA project. And it can be a building block for the DPIM region of a large scale, high performance computer such as the one proposed in the HTMT project. Some of the open research problems in the implementation an MPM and in its use in a system architecture are: (1) How the multiple PIMs, that form the MPM, communicate and synchronize with each other. (2) Is it possible to design and implement a fast and versatile interconnection between the multiple PIMs in the MPM. (3) How MPMs can be programmed and how the interconnection can be adapted for new communication pathways. And (4) how does the runtime system control MPMs to ensure the communications/synchronizations are performed in the most efficient way according to the needs of the application program. To address these issues, we are developing a technology based on the interconnection of multiple PIMs within a single MPM via an array of vertical cavity surface emitting laser (VCSEL) and SiGe detector arrays that are vertically interconnected through the silicon wafer using a DOE. This technology allows for fast, abundant, and distributed interconnections amongst the PIMs in a given module. Also, because this approach allows for data distribution at the 2-5GHz rate it reduces the latency in communication between PIMs to unprecedented levels and because optical beams can essentially pass right through each other without exchanging information it all but eliminates the place and route problem. Also, each interconnect link in our design would consume approximately 50mW of power, which when applied to a full 16 × 16 interconnection would consume on the order of 10 Watts of power. This is nearly an order of magnitude less than current architectures that are limited to only 4 × 4 interconnections.


1095

To realize this architecture three critical technologies must be used: long wavelength VCSELs (1.3µm), high speed (2-5 Gbits) CMOS drivers for the VCSELs, and VLSI-scale DOEs. To this end we have been working with Gore Photonics for the 1.3µm VCSELs and developing our own high-speed CMOS drivers, VLSI-scale DOEs, and system integration techniques at UD. Thus, in the remainder of the paper we report on our progress in this effort. We begin by motivating the 3D architecture and the describe the component optical technologies needed to realize it.

Optoelectronic Technologies Whereas the use of optical interconnects in long haul and local area networks has proven extremely successful, its use on the VLSI-scale has been limited. This is due in large part to the continual increase in speed and performance of conventional electronic devices. However, the issues associated with next generation PIM architectures cannot be adequately addressed with speed alone. Instead, such systems will require not only the ability to share or distribute information among PIM modules (signal fan-out) but also a significant increase in interconnect density. While the issues of increased bandwidth, interconnect densities, and signal fan-out are individually compelling reasons for considering optical interconnects, when combined together they become persuasive. For example, one possible electronic solution to increasing interconnect density is to use flip-chip, or bump, bonds, which can require approximately 20µm2 of chip area while offering only a 1-to-1 interconnect. In comparison, we have designed VLSI-scale diffractive optical elements (DOEs) that within the same area provide a 1-to-16 interconnect. Currently we have experimentally demonstrated a 1-to-4 and are in the process of fabricating the 1-to-16. For this reason we propose the use of an optoelectronic 3D architecture that uses monolithically integrated VLSI-scale DOEs for application to PIM architecture, as shown in Fig. 1. S i G e Detectors

VCSEL, 1.3 µ m (through wafer transmission)

VLSI (dirvers and reprogrammable logic)

Meso-DOEs Silicon Wafer, Mother Board Inter-Layer Interconnects

Fig. 1. Monolithic interconnect architecutre that uses 3D diffractive optical interconnects on the VLSI-scale for through wafer fan-out interconnects for data or clock distrbution. Various modules of this architecture can also be stacked together to realize more complex systems.

While the notion of 3D architectures is appealing, due to the efficient use of power and increased processing and interconnect densities, few of the systems

1096

D.W. Prather

proposed in the literature have received wide spread use. Reasons for this depend on the technology being used. For instance, all electronic architectures suffer from either reduced communication bandwidth, due to routing the inter-layer interconnects through the periphery of the 3D stack, or reduced interconnection density, due to the inability to distribute data between layers using 1-to-1 bump bonds. Along the same lines, optical architectures suffer from input/output coupling efficiencies, for wave guide based approaches, interconnect density and distribution, for 1-to-1 emitterreceiver-based approaches, and scale of integration for bulk optical systems. Thus, we believe that in order for an optical interconnect system to be viable it must satisfy the following conditions: (1) It must have a scale of integration comparable to VLSI, to preserve scales of integration. (2) The optical system must be monolithic in the Silicon substrate, in order to alleviate alignment issues and improve system reliability. And (3) the fabrication methods and materials used must be compatible with the current manufacturing infrastructure, in order to reduce cost of implementation. In the design of our architecture we will are strictly adhering to these conditions. Our approach is based on our recent progress in the development of suitable design tools, which enable the design of VLSI-scale DOEs for monolithic integration with active devices. As a result, we have been able to significantly increase the interconnection density as compared to all electronic vertical interconnections as illustrated in Fig. 2, which illustrates a DOE that occupies 10µm2 and provides a 1-to4 fan-out. If this DOE is tiled over a 20µm2 area, equivalent to that of a bump bond, it would provide a 1-to-16 fan-out in comparison to a 1-to-1, which represents more than an order of magnitude increase in interconnect density. In addition to increasing density this approach significantly simplifies the place and route problem because optical beams do not exchange information and can therefore accommodate overlap in the routing process. In order to realize optical interconnections within a Silicon wafer and on a scale comparable with VLSI circuits, one must be able to heterogeneously integrate active and passive optical devices together on a scale comparable to microelectronic devices. This must also be done in such a way that the ability to control and redirect light in a general fashion is preserved, e.g., off-axis focusing, mode shaping, and beam fan-out. Whereas active optical devices, such as emitters, detectors, and modulators are readily designed and fabricated with dimensions on the micron scale, until recently passive optical elements capable of such general behavior were not. However, recent advances in both the design and fabrication of diffractive structures [6] now enables the integration of active and passive optical devices on the VLSIscale and the ability to efficiently control and redirect light in a general fashion, see Fig. 3. Thus, the integration of VCSELs with wavelength scale fan-out DOEs on the VLSI-scale offer not only an order of magnitude improvement (in terms of density, bandwidth, and power consumption) but also the ability to design architectures that heretofore have not been possible. As a result new optical interconnect architectures can now be developed.


1097

10 µm -6

distance, microns

-4

10 µm

-2 0 2 4 6 -6

-4

-2

0

2

4

6

distance, microns (b)

(a) 12.5

Light Intensity

10

7.5

5

2.5

0 -8

-6

-4

-2

0

2

4

6

8

distance, microns (c)

Fig. 2. Illustration of a three dimensional subwavelength off-axis lenslet array used for 1-to-4 fanout on the VLSI-scale (a) DOE, (b) intensity image in the focal plane, and (c) line scan thorough the focal plane. Results were generated using a 3D FDTD diffraction model.

200

100 microns 0

focal plane

-100 -300

-200

-100

0

100

200

300

microns

Fig. 3. Illustration of a VLSI-scale 1-to-5 fan-out DOE, computed using the boundary element method. The width of the DOE is 120 microns and the focal length is 100 microns.

1098

D.W. Prather

Recently we have fabricated and experimentally validated these elements and are currently preparing them for system level integration [7]. However, critical to the successful completion of this effort is the ability to fabricate DOEs that have features sizes on the nanometer scale. Although many fabrication techniques for DOEs exist, by far the most general and widely used is that of the microelectronics photolithographic process. In this technique the profile of a DOE is realized by etching micro-relief patterns into the surface of either conducting or dielectric substrates. A curved surface profile is realized by using a multi-step process which produces a stair-step approximation. Using this fabrication process DOEs that have diffraction efficiencies on the order of 95% have been fabricated. Unfortunately, as the scale of a DOE is reduced the alignment process, needed for multi-step profiles, becomes exceedingly difficult. As a result, alternate fabrication methods based on single step gray-scale lithography and direct electron beam (e-beam) exposure have been developed. In the gray-scale process one wishes to realize continuous profiles, or structures. However, for devices on the VLSI-scale current fabrication technology limits us to a discrete number of levels, typically 4-8 levels. Thus, we can currently fabricate our DOEs using a gray-scale technique which results in multilevel structures from a single processing step, as shown in Fig. 4. To Pthis end, we designed our multilevel masks in the lab and used an outside vendor [8] to provide the gray-scale mask. Once we have the mask we deposit an initial height of the photoresist on the substrate, i.e., silicon wafer, which can be precisely controlled by adjusting the spin rate at the time of deposition. Through experimentation, we have characterized the response of photoresist to various degrees of UV exposure. This allows us to precisely designate the correct transmission levels of the mask to create our multi-level DOE profiles in the photoresist. After the grayscale photolithography, the pattern is transferred into the surface of the Silicon substrate using a Plasmatherm 790 series reactive ion etching (RIE) system. Careful calibration of the RIE process is required to achieve structures with smooth surfaces and submicron feature resolution while preserving the height of the initial profile. Step 1: glass spin coated with a thin film of photoresist

Step 3: develop

photoresist

photoresist

glass

Step 2: exposure

glass

Gray-scale mask

UV

Final step 4: etching photoresist glass

glass

Fig. 4. Graphical illustration of the gray-scale photolithographic fabrication process.

An alternative fabrication method based on direcet e-beam write can also be used to fabricate VLSI-scale DOEs. In this appraoch a high energy electron beam is


1099

used to expose a photoresist coated substrate. As the substrate is exposed the energy level of the e-beam is varied in accordance with the desired DOE profile. Once developed the substrate is etched, using techniques such as reactive ion etching, to transfer the continuous photoresist profile into the substrate, see Fig. 5. This process is capable of fabricating binary DOE profiles that have feature sizes on the order of 60nm, which is several times smaller than the wavelength of illumination. As a result efficiencies exceeding those predicted by scalar diffraction theory can be achieved [9]. Through collaboration with Axel Scherer of CalTech we have recently had several DOEs fabricated, as shown in Fig. 6. drive signal

direct e-beam exposure substrate with photoresist

translate

develop and ion gas etch

diffractive element

Fig. 5. Fabrication process for continuous profile DOEs based on direct electron beam write.

Fig. 6. Illustration of a mesoscopic diffractive lens having a diameter of 36µm, a focal length of 65µm and a minimum feature size of 60nm. The element was fabricated by Dr. Axel Scherer, of the California Institute of Technology.

In addition to developing the theoretical and experimental framework necessary to design and realize DOEs we have developed a novel system for characterizing their performance.

1100

D.W. Prather

Our system consists of a microscope objective (20X) and a 1inch diameter lens. The system has an overall magnification of 4.2 (based on the ratio of the two focal lengths, f2 / f1 ), and is able to resolve 1 micron minimum features. The entire imaging system is mounted on an x,z translation stage, as shown in Fig. 7. Because the object and image planes, in this system, are fixed and well defined they can be used to determine the axial location relative to the DOE, i.e., the reference plane for z=0. This is achieved by translating the imaging system toward the DOE until the surface is imaged on to the CCD. Subsequently, the translation stage, with the entire imaging system on it, is translated back to the plane of interest, i.e., z=zo . Because the microscope objectives have large numerical apertures the performance of the imaging system, i.e., its modulation transfer function (MTF), reproduces the intensity profile in the object plane, i.e, the observation plane, with excellent fidelity.

microscope objective, DOE

object plane

1 inch lens,

image plane CCD

translation stage Fig. 7. Micro 4f imaging system for characterizing mesoscopic diffractive optical elements.

To validate our electromagnetic design models we used the system to measure the diffracted light from a precision pin-hole of 71µm in diameter, from a collimated incident wave of 0.633nm. We then calculated the diffracted light using both scalar diffraction theory and using our electromagnetic model, results for z=350µm are shown in Fig. 8. Additional measurements were made along the z-axis and showed the same level of agreement. To illustrate the utility of this system we used it to characterize the diffractive lens shown in Fig. 6, the results are shown in Fig. 9. Once confident that our design and fabrication methods were working we then applied them to the realization of through silicon wafer DOEs [7].

Integration In order to achieve optical interconnects on a single Silicon die, we must be able to integrate emitters, detectors, drivers, and DOEs on the VLSI scale. Our


1101

approach toward integration will be to construct a hybrid system using flip-chip bonding. For this part of the project we will use a SEC Omnibonder 860 flip-chip bonding machine to construct a multichip module for the integration of the active and passive optical devices with their electronic counterparts. Figure illustrates the integration of an 8 × 8 CMOS driver array with an 8 × 8 980nm VCSEL array.

2.5 2 1.5 Experimental result

1

Scalar Theory

0.5

FDTD method

0 -0.5 -100 -80 -60 -40 -20

0

20

40

60

80 100

micron

Fig. 8. Comparison between experimental results and theoretical predictions for the diffraction from a precision pin-hole that had a diameter of 71 microns at an axial location of 350 microns.

relative light intensity

250 experimental theoretical

200 150 100 50 0 -25

-20

-15

-10

-5

0 5 x position in microns

10

15

20

25

Fig. 9. Overlay of the experimental characterization of a mesoscopic diffractive lens and the results predicted from our electromagnetic models. Data was taken from our system using a 40X magnification objective at the loacation of z=65µm, the design focal length. Ultimately, we plan to use 1.3 micron VCSELs as emitters and a Silicon substrate as the medium of propagation. However, such long wavelength VCSELs are not currently available in die form at present, so we haPve begun the construction of a preprototype system using an 850nm and 980nm VCSELs on a glass substrate. In this preliminary system, the VCSEL is bonded to a CMOS driver circuit and directed through the DOE as shown in Fig. 10. Our main concern associated with bonding the VCSEL over a DOE, is the air gap spacing between the VCSEL and the backside of the glass substrate. Since the VCSEL will be flip-chip bonded to the glass surface, the solder bump size, bond pressure and bond temperature profile will affect the resultant air gap. Additionally, the proximity of the CMOS driver and the VCSEL will be a guiding parameter of the

1102

D.W. Prather

Fig. 10. Illustration of a VCSEL flipchip bonded to a CMOS driver circuit. The VCSELs and CMOS drivers were supplied by the U.S. Army Research Laboratory.

bonding temperature profile, since we do not want the first device bonded to detach during the second bond. Most likely, we will choose tPo bond the CMOS driver first in order to maximize control over the air gap spacing. That way heating during the bonding of the driver will not affect the final VCSEL position.

Summary We have discussed the motivation for chip-level optical interconnects, and proposed a 3D architecture that offers higher bandwidth interconnect density in comparison to conventional architectures. Also, we have discussed a potential applications for our architecture based on a multi-processor-in-memory system. To this end, we demonstrated through-wafer optical fan-out using VLSI-scale DOEs and long wavelength VCSELs (courtesy of Gore Photonics). Flip-chip bonding gives us the ability to integrate active and passive devices on a single die, and we are currently building a prototype system to demonstrate this integration. The significance of our approach lies in the ability to design optical elements that efficiently control, or redirect, light on a VLSI-scale and can be directly integrated into the current VLSIbased manufacturing infrastructure. As such this technology lends itself nicely to 3D interconnect schemes and facilitates the trend toward higher levels of parallelism in computer architectures.

References [1] T. Sterling, “Achieving petaflops-scale performance through a synthesis of advanced device technologies and adaptive latency tolerant architectures,” in Supercomputing 99, (Portland, OR), Novermber 1999.


1103

[2] P.M. Kogge, J.B. Brockman, T. Sterling, and G. Gao, “Processing-in-memory: chips to petaflops,” in International Symposium on Computer Architecture, (Denver, CO), June 1997. [3] M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki, A. Srivastava, W. Athas, J. Brockman, V. Freeh, J. Park, and J. Shin, “Mapping irregular applications to DIVA, a PIM-based data-intensive architecture,” in Supercomputing 99, Portland OR, Novermber 1999. [4] Y. Kang, M. Huang, S.M. Yoo, Z.Ge, D. Keen, V. Lam, P. Pattnaik, and J. Torrellas, “Flexram: toward an advanced intelligent memory system,” in International Conference on Computer Design, October 1999. [5]

IBM, “IBM unveils $100 million research initiative to build world’s fastest Semiseek, December 1999.

[6] D.W. Prather, M.S. Mirotznik, and S. Shi, Mathematical Modeling in Optical Science, Ch. Electromagnetic models for finite aperiodic diffractive optical elements, in print, SIAM Frontier Book Series, Society for Industrial and Applied Mathematics, 2000. [7] M. LeCompte, X. Gao, H. Bates, J. Meckle, S. Shi, and D.W. Prather, Threedimensional through-wafer fan-out interconnects,” in Optoelectronics Interconnects VII, SPIE 3952, The International Society Optical Engineering, Bellingham WA, January 2000. [8] Canyon Materials, Inc., San Diego, CA. [9] J.N.Mait, D.W. Prather, and M.S. Mirotznik, “Binary subwavelength diffractivelens design,” Opt. Lett., 23, pp. 1343-1345, September 1998.

Present and Future Needs of Free-Space Optical Interconnects Sadik Esener and Philippe Marchand Electrical and Computer Engineering Department University of California, San Diego, La Jolla, CA 92093, USA

Abstract. Over the last decade significant progress in optoelectronic devices and their integration techniques have made Free-Space Optical Interconnects (FSOI) one of the few physical approaches that can potentially address the increasingly complex communication requirements at the board-to-board and chip-to-chip levels. In this paper, we review the recent advances made and discuss future research directions needed to bring FSOI to the realm of practice. Keywords: Optical Interconnects, Optical Packaging, Micro-optics, OptoElectronics, Free-Space Optical Interconnects

1

Introduction

Exchanging data at high speed over sufficiently long distances is becoming a bottleneck in high performance electronic processing systems [1,2,3]. New physical approaches to dense and high-speed interconnections are needed at various levels of a system interconnection hierarchy starting from the longest interconnections: board to board, MCM to MCM on a board, chip-to-chip on a multi-chip module (MCM), and on-chip. For the next decade, FSOI when combined with electronics offer a potential solution [4,5,6,7,8,9] at the inter and intra-MCM level interconnects promising large interconnection density, high distance-bandwidth product, low power dissipation, and superior crosstalk performance at high-speeds [10,11,12,13].

2

Present Status of FSOI

Opto-Electronic (OE) devices including Vertical Cavity Surface Emitting Lasers (VCSELs), light modulators, and detectors have now been developed to a point that they can enable high speed and high-density FSOI [14,15,16]. Flip-chip bonding offers a convenient approach to their integration with silicon. For example, members of the 3-D OESP consortium (Honeywell Technology Center and University of California, Santa Barbara) have demonstrated FSOI links operating up to 2.5Gb/s between VCSEL arrays and suitable detector arrays. These developments occurred at an opportune time when high performance workstation manufacturers struggle to resolve communication bottlenecks at the board-to-board level. As a result, high efficiency FSOI links between VCSEL and detector arrays has sparkled the interest of


Present and Future Needs of Free-Space Optical Interconnects

1105

high performance workstation manufacturers such as Sun Microsystems. While board-to-board interconnect solutions using FSOI are now being evaluated by the computer industry, chip-to-chip interconnects are being investigated at a more fundamental level at several universities including UCSD. One of the key issues that needs to be addressed at this level is packaging. Indeed a packaging architecture and associated technologies need to be developed to integrate OE devices and optical components in a way that is fully compatible with conventional electronic multi-chip packages. Recently at UCSD, we developed and demonstrated the operation of a fully packaged FSOI system for multi-chip interconnections capable of sustaining channel data rates as high as 800Mb/s. A picture of this system is shown in Figure 1. A conventional PCB/ceramic board is populated with silicon and OE chips and mated to a FSOI layer that is assembled separately. Design considerations, packaging approaches as well as testing results indicate that it is now possible to build FSOI electronic systems that are compatible in packaging techniques, physical dimensions and used materials with conventional electronics.

Figure 1. Fully packaged FSOI system

The overall packaging approach consists of the assembly of two different packaging modules: the opto-electronic module (multi-chip carrier and the OE chips (VCSEL, MSM and silicon chips), and the optics (FSOI) module. In our approach both modules are assembled separately then snaped on together. A mechanical pinpinhole technique combined with alignment marks makes the alignment of the two modules a rather straightforward task. The optics module is built out of plastic except for the glass optical lenses that were commercially available. In the current demonstration system, four one-dimensional (1D) proton implanted VCSEL arrays (1´12 elements each) and four 1D Metal-Semiconductor-Metal (MSM) detector arrays (1´12) are used as light sources and photodetectors, respectively. The lasers and detectors are on a 250µm pitch. The VCSELs operate at 850nm with 15o-divergence angle (full angle at 1/e2 ), and the detector aperture is 80´80µm. Laser drivers, receiver (amplifiers), and router circuits are integrated on three silicon chips and included into the system. VCSEL arrays are optically connected to their corresponding detector arrays. Data can be fed electrically to any one of the silicon chips and routed to the VCSELs through driver circuits. The silicon chips also contain receiver circuits directly connected to the detectors; thus, data can also be readout electrically from each silicon chip independently.

1106

S. Esener and P. Marchand

In this FSOI demo system, 48 optical channels each operating up to 800Mb/s with optical efficiencies exceeding 90% and inter-channel crosstalk less than -20dB were implemented in a package that occupied less than 5x5x7 cm3. All channels were operational. This packaging technique is now being applied to demonstrate an FSOI connected board that is populated by three 3-D stacks of silicon chips. Each stack contains 16 silicon chips each hosting a 16x16 crossbar switch. In addition each stack is flip-chip bonded to a 16x16 array of VCSELs and detectors and communicates with other stacks via these devices. Thus with this package of very small footprint, 48 silicon chips will be interconnected via FSOI with each other.

Figure 2. Application of UCSD’s chip-to-chip FSOI packaging technique to 3-D stack-tostack communication

3

Present limitations in FSOI and future directions

Although the demonstrations described above are important milestones in the quest for using optics within the board, it also underlines some of the present limitations of FSOI. These shortcomings include the: • height of the optical package • signal integrity and synchronization issues • thermal stability of the assembly • effective CAD tools • ultra low voltage light modulation • costs associated with FSOI. To reduce the height of the package micro-optical elements compatible with oxide confined VCSELs need to be developed and become commercially available. Presently commercially available micro-optical components do not provide simultaneously the necessary high efficiency, low F# and spatial uniformity. In addition, communication within the box requires very low bit error rates. It is therefore critical to use extensive encoding techniques to minimize the error rates in FSOI. To this end there is a need for more silicon real estate and power consumption.


1107

As the power in the package is increased passive alignment techniques may not be sufficient. Active alignment techniques based for example on MEMs components or special alignment facilitating OE Array Chip stack OE Array. Chip stack.optical components must be examined. Also, in order to build more complex optoelectronic systems and packages, it is now clear that powerful CAD systems capturing both electronic circuits and sub systems as well as optoelectronic and optical components and sub-systems must be made available. Such a CAD system is not only essential for the optoelectronics sub-system designer but also for the electronics system designer. Furthermore, with the scaling of CMOS circuits, in order to conserve drive voltage compatibility, optoelectronic devices that require very low drive voltages are required. Finally, the cost associated with FSOI is of prime concern. The main cost factors include the optoelectronic devices and their integration as well as the overall packaging. The device costs can only be reduced with manufacturing volume. Therefore it is critical to direct the use of optoelectronic arrays to markets with large volumes including optical data storage and bio-photonics. Further in the future, flip-chip bonding with its associated parasitics and high cost should be replaced with heterogeneous integration technologies at the device and material levels rather than at the chip level. Such technologies have the potential to relieve present layout constraints and ultimately reduce cost.

4

Conclusions

Significant progress both at the device and sub-system levels has been made in FSOI to the point where FSOI can now be considered to push the envelope in computing hardware at the board to board interconnect level. However, at the chip to chip level considerable amount of research and development effort still needs to be conducted. Some of the promising new directions that are being investigated at UCSD include the use of 3-D silicon stacks in conjunction with MEMs devices, Conical tapered lens arrays for increased alignment tolerance [17] ,Chatoyant as a versatile CAD system for optoelectronics [18], Ultra low drive surface normal light modulators based on the VCSEL structure [19] and Electric-field assisted micro-assembly and pick and place for advanced integration [20].

References 1.

2.

Krishnamoorthy, A.V., Miller, D.A.B. “Firehose architectures for free-space optically interconnected VLSI circuits”. Journal of Parallel and Distributed Computing, vol.41, (no.1), Academic Press,. pp.109-14. 25 Feb. 1997 P. J. Marchand, A. V. Krishnamoorthy, G. I. Yayla, S. C. Esener and U. Efron, "Optically augmented 3-D computer: system technology and architecture." J.

1108

S. Esener and P. Marchand

Parallel Distrib.Comput. Special Issue on Optical Interconnects, vol.41, no.1, pp.20-35, February 1997 3. Betzos, G.A.; Mitkas, P.A. “Performance evaluation of massively parallel processing architectures with three-dimensional optical interconnections,” Applied Optics, vol.37, (no.2), pp.315-25, 10 Jan. 1998. 4. J. W. Goodman, F. J. Leonberger, S. C. Kung, and R. A. Athale, "Optical Interconnections for VLSI Systems, " Proc. IEEE, vol. 72, no. 7, pp. 850-66, Jul. 1984 5. L. A Bergman, W. H. Wu, A. R. Johnston, R. Nixon, S. C. Esener, C.C Guest, P. Yu, T.J. Drabik, M. Feldman, S. H. Lee, "Holographic Optical Interconnects in VLSI," Opt. Eng., vol. 25, no. 10, pp. 1109-18, Oct. 1986 6. W. H. Wu, L. A Bergman, A. R. Johnston, C. C. Guest, S.C Esener, P.K.L Yu,. M. R. Feldman, S. H. Lee, "Implementation of optical Interconnections for VLSI," IEEE Trans. Electron Devices, vol. ED-34, no. 3, pp. 706-14, Mar. 1987 7. R. K. Kostuk, J. W. Goodman, and L. Hesselink, "Optical Imaging Applied to Microelectric Chip-to-Chip Interconnections," Appl. Opt., vol. 24, no. 17, pp. 2851-8, Sep. 1985. 8. D. A. B. Miller, “Physical reasons for optical interconnection,” Intl. J. of Optoelectronics, vol. 11, no.3, pp. 155-68, 1997. 9. A. Krishnamoorthy and D. A. B. Miller, “ Scaling opto-electronic-VLSI circuits into 21st century: a technology roadmap,” IEEE JST in Quantum Opto-electronics, Vol.2, No.1 , pp.55-76, Apr. 1996. 10.M. R. Feldman, S. C. Esener, C. C. Guest, and S. H. Lee, "Comparison between optical and electrical interconnects based on power and speed considerations," Appl. Opt., 27, no.9, pp. 1742-51, May 1988. 11. F. Kiamilev, P. Marchand, A. Krishnamoorthy, S. Esener, and S. H. Lee, “Performance comparison between opto-electronic and VLSI multistage interconnection networks,” IEEE J. Lightwave Technol., vol. 9, no. 12, pp.1674-92, Dec. 1991. 12. A. V. Krishnamoorthy, P. Marchand, F. Kiamilev, K. S. Urquhart, S. Esener, "Grain-size consideration for opto-electronic multistage interconnection network," Appl. Opt., 31 (26), pp. 5480-5507, 1992. 13. G. Yayla, P. Marchand, and S. Esener, "Speed and Energy Analysis of Digital Interconnections: Comparison of On-chip, Off-chip and Free-Space Technologies," Appl. Opt., 37, pp. 205-227, January 1998. 14. Morgan, R.A.; Bristow, J.; Hibbs-Brenner, M.; Nohava, J.; Bounnak, S.; Marta, T.; Lehman, J.; Yue Liu “Vertical cavity surface emitting lasers for spaceborne photonic interconnects,” Proceedings of the SPIE – The International Society for Optical Engineering, vol.2811, (Photonics for Space Environments IV, Denver, CO, USA, 6-7 Aug. 1996.) SPIE-Int. Soc. Opt. Eng,. pp.232-42.1996. 15. A. Krishnamoorthy, “Applications of opto-electronic VLSI technologies,” Optical Computing 1998, Bruges, Belgium , June 1998. 16. A. V. Krishnamoorthy, L. M. F. Chirovsky, W. S. Hobson, R. E. Leibenguth, S. P. Hui, G. J. Zydzik, K. W. Goosen, J. D. Wynn, B. J. Tseng, J. A. Walker, J. E. Cunningham, and L. A. D’Asaro, “Vertical-Cavity Surface-Emitting Lasers FlipChip Bonded to Gigabit-per-Second CMOS Circuits”, IEEE Phot. Tech. Lett., Vol.11, No.1, pp.128-130, 1999.


1109

Cornelius Diamond, Ilkan Cokgor, Aaron Birkbeck and Sadik Esener, " Optically Written Conical Lenses for Resonant Structures and Detector Arrays" Optical Society of America, Spatial Light Modulators and Integrated Optoelectronic Arrays, Technical Digest, Salt Lake City, Snowmass, April 1999. 18. S.P. Levitan, T.P. Kurzweg, P. Marchand, M.A. Rempel, D.M. Chiarulli, J.A. Martinex, C. Fan, and F.B. McCormick, “Chatoyant, a Computer-Aided Design Tool for Free-Space Optoelectronic Systems,” Appl. Opt., January 1998. 19. O. Kibar and S. Esener “Sub-threshold operation of a VCSEL structure for ultralow voltage, high speed, high contrast ratio spatial light modulation” Optical Society of America, Spatial Light Modulators and Integrated Optoelectronic Arrays, Technical Digest, Salt Lake City, Snowmass, April 1999. 20. S. C. Esener, D. Hartmann, M. J. Heller and J. M. Cable, " DNA Assisted MicroAssembly: A Heterogeneous Integration Technology For Optoelectronics, " Proc. SPIE Critical Reviews of Optical Science and Technology, Heterogeneous Integration, Ed. A. Hussain, CR70-7, Photonics West 98, San Jose, January-98. 17.

Fast Sorting on a Linear Array with a Reconfigurable Pipelined Bus System? Amitava Datta, Robyn Owens, and Subbiah Soundaralakshmi Department of Computer Science The University of Western Australia Perth, WA 6907 Australia email:fdatta,robyn,[email protected]

Abstract. We present a fast algorithm for sorting on a linear array with a reconfigurable pipelined bus system (LARPBS), one of the recently proposed parallel architectures based on optical buses. Our algorithm sorts numbers in (log log log ) worst-case time using processors. To our knowledge, the previous best sorting algorithm on this architecture has a running time of (log 2 ).

O

N

N

N

N

O

N

1 Introduction Recent advances in optical and opto-electronic technologies indicate that optical interconnects can be used effectively in massively parallel computing systems involving electronic processors [1]. The delays in message propagation can be precisely controlled in an optical waveguide and this can be used to support high bandwidth pipelined communication. Several different opto-electronic parallel computing models have been proposed in the literature in recent years. These models have opened up new challenges in algorithm design. We refer the reader to the paper by Sahni [8] for an excellent overview of the different models and algorithm design techniques on these models. Dynamically reconfigurable electronic buses have been studied extensively in recent years since they were introduced by Miller et al. [3]. There are two related opto-electronic models based on the idea of dynamically reconfigurable optical buses, namely, the Array with Reconfigurable Optical Buses (AROB) and the Linear Array with Reconfigurable Pipelined Bus Systems (LARPBS). The LARPBS model has been investigated in [2, 4–6] for designing fast algorithms from different domains. There are some similarities between these two models. For example, the buses can be dynamically reconfigured to suit computational and communication needs and the time complexities of the algorithms are analyzed in terms of the number of bus cycles needed to perform a computation, where a bus cycle is the time needed for a signal to travel from end to end along a bus. However, there is one crucial difference between these two models. In the AROB model, the processors connected to a bus are able to count optical pulses within a bus cycle, whereas in the LARPBS model counting is not allowed during a bus cycle. In the LARPBS model, processors can set switches at the start of a bus cycle and take no further part during a bus cycle. In other words, the basic assumption of the ? This research is partially supported by an Australian Research Council (ARC) grant.


Fast Sorting on a Linear Array with a Reconfigurable Pipelined Bus System

1111

AROB model is that the CPU cycle time is equal to the optical pulse time since the processors connected to a bus need to count the pulses. This is an unrealistic assumption in some sense since the pulse time is usually much faster than the CPU time of an electronic processor. On the other hand, the LARPBS model is more realistic since the basic assumption in this model is that the bus cycle time is equal to the CPU cycle time. Sorting is undoubtedly one of the most fundamental problems in computer science and a fast sorting algorithm is often used as a preprocessing step in many other algorithms. The first sorting algorithm on the LARPBS model was designed by Pan et al. [7]. Their algorithm is based on the sequential quicksort algorithm and runs in O(log N ) time on an average and in O(N ) time in the worst case on an N processor LARPBS. To our knowledge, the best sorting algorithm for this model is due to Pan [4]. His algorithm sorts N numbers in O(log2 N ) worst-case time. We present an algorithm for sorting N numbers in O(log N log log N ) time on an LARPBS with N processors. Our algorithm is based on a novel deterministic sampling scheme for merging two sorted arrays of length N each in O(log log N ) time.

2 Fast sorting on the LARPBS We refer the reader to [2, 5, 6] for further details of the LARPBS model. The measure of computational complexity on an LARPBS is the number of bus cycles used for the computation and the amount of time spent by the processors for local computations. A bus cycle is the time needed for end to end message transmission over a bus and assumed to take only O(1) time. In most algorithms on the LARPBS model, a processor performs only a constant number of local computation steps between two consecutive bus cycles and hence the time complexity of an algorithm is proportional to the number of bus cycles used for communication. We use some basic operations on the LARPBS in our algorithm. In a one-to-one communication, a source processor sends a message to a destination processor. In a 1 broadcasting operation, a source processor sends a message to all the other N processors in an LARPBS consisting of N processors. In a multicasting operation, a source processor sends a message to a group of destination processors. In a multiple multicasting operation, a group of source processors perform multicasting operations. A destination processor can only receive a single message during a bus cycle in a multiple multicasting operation. In the binary prefix sum computation, each processor in an LARPBS with N processors stores a binary value, with processor Pi ; 1 i N storing the binary value bi . The aim is to compute the N prefix sums Si ; 1 i N , where Si = ij =1 bj . Suppose each processor in an N processor LARPBS is marked either as active or as inactive depending on whether the processor holds a 1 or a 0 in one of its registers Ri . Also, each processor holds a data element in another of its registers Rj . In the ordered compression problem, the data elements of all the active processors are brought to consecutive processors at the right end of the array, keeping their order in the original array intact. The following lemma has been proved by Li et al. [2] and Pan and Li [5].

P

1112

A. Datta, R. Owens, and S. Soundaralakshmi

Lemma 1. One-to-one communication, broadcasting, multicasting, multiple multicasting, binary prefix sum computation and ordered compression all can be done in O(1) bus cycles on the LARPBS model. Given a sequence of N numbers k1 ; k2 ; : : : ; kN , the sorting problem is to arrange these numbers in nondecreasing order. Our sorting algorithm on the LARPBS is based on the well known sequential merge sort algorithm. We use an algorithm for merging two sorted arrays of length N each in O(log log N ) time on an LARPBS with N processors. We now give some definitions and properties which are necessary for designing our merging algorithm. 2.1 Definitions and properties Suppose we have two arrays L = fl1 ; l2 ; : : : ; lN g and R = fr1 ; r2 ; : : : ; rN g each having N elements and each sorted according to ascending order. We assume for simplicity that all the elements in L [ R are distinct. It is easy to modify our algorithm for the case when an element may occur multiple times. For an element li 2 L, we denote its predecessor and successor in L by pred(li ) and succ(li ). Successors and predecessors are denoted similarly for an element in R. The rank of li in L is its index i in the array L and denoted by rankL(li ). Similarly, the rank of ri in R is its index i in the array R and denoted by rankR (ri ). The rank of li in R, denoted by rankR (li ), is rankR (rj ) of an element rj 2 R such that rj < li and there is no other element rk 2 R such that rj < rk < li . Sometime we will write rankR (li ) = rj by abusing the notation. Similarly, the rank of ri in L, denoted by rankL (ri ), is rankL(lj ) of an element lj in L such that lj < ri and there is no other element lk 2 L such that lj < lk < ri . For an element lm 2 L, the rank of lm in L [ R is denoted by rank(lm ). The following lemma is a direct consequence of definitions of these three kinds of ranks. Lemma 2. For an element lm 2 L; 1 m N , rank(lm ) =rankL(lm )+ rankR (lm ). Similarly, for an element rn 2 R; 1 n N , rank(rn ) = rankR (rn ) + rankL(rn ). It is clear from Lemma 2 that if we compute rankR (li ) for each element li 2 L, we can compute rank(li ). Note that, we already know rankL (li ) since L is already sorted and rankL (li ) is simply the index i. Similarly, if we compute rankL (rj ) for each element rj 2 R, we can compute rank(rj ). We refer to these two problems as ranking of L in R and ranking of R in L. We do the ranking of L in R recursively in several stages. When every element in L is ranked in R, we say that L is saturated. Consider a stage when L is still unsaturated. In other words, some elements in L are already ranked in R and some are yet to be ranked. Definition 3 Consider two consecutive ranked elements lm and ln , m < n. All the elements between lm and ln , i.e., succ(lm ); : : : ; pred(ln ) are unranked and these elements are called the gap between lm and ln and denoted by Gap(lm ; ln ).


1113

Definition 4 Consider two consecutive ranked elements lm and ln in L. Suppose, rankR and rankR (ln ) = rq . The elements succ(rp ); : : : ; rq are collectively called the cover of Gap(lm ; ln ) and denoted as Cover(lm ; ln ). See Figure 1 for an illustration. (lm ) = rp

......

lm succ(lm )

pred(ln ) ln

......

L

Gap(lm ; ln )

......

rp succ(rp )

pred(rq ) rq

......

R

Cover(lm ; ln )

Figure 1. Illustration for gap and cover.

Lemma 5. For an element li 2 Gap(lm ; ln ), either rankR (li ) =rankR (lm ) or rankR (li ) such that, rm 2Cover(lm ; ln ).

= rm

Definition 6 For two ranked elements lm ; ln Gap(lm ; ln ) is non-empty.

2

6

L, if ln = succ(lm ), we say that

Definition 7 Consider a non-empty gap Gap(lm ; ln ) and its Cover(lm ; ln ). We say that Gap(lm ; ln ) has an empty cover if rankR (lm ) = rankR (ln ), i.e., if lm and ln are ranked at the same element in R. The following two lemmas are crucial for our algorithm. Lemma 8. If Cover(lm ; ln ) is the non-empty cover for Gap(lm ; ln ), an element rj Cover(lm ; ln ) must be ranked in Gap(lm ; ln ).

2

Lemma 9. If Gap(lm ; ln ) and Gap(lo ; lp ) are two arbitrary and distinct non-empty gaps in L, then Gap(lm ; ln ) \ Gap(lo ; lp ) = ;. Similarly, if Cover(lm ; ln ) and Cover(lo ; lp ) are two arbitrary and distinct non-empty covers in R, then Cover(lm ; ln ) \ Cover(lo ; lp ) = ;. We assume that thep sorted sequences L and R have N and M elements respectively. First, we choose every N -th element, i.e, the elements lpN ; l2pN ; : : : ; lpN pN from L. We denote the set flpN ; l2pN ; : : : ; lpN pN g as SampleL. Similarly, we choose

1114

A. Datta, R. Owens, and S. Soundaralakshmi

the elements rpM ; r2pM ; : : : ; rpM pM from R and denote this set of elements as p p SampleR . Note that there are N elements in Sample L and M elements in SampleR . p The elements lipN (resp. ripN ), 1 i N in SampleL(resp. SampleR ) impose a block structure on the sequence L(resp. R). Consider two consecutive elements lipN and l(i+1)pN in SampleL. The elements fsucc(lipN ); : : : ; l(i+1)pN g are called the i-th block in L imposed by SampleL and denoted by BlockL i . The superscript L indicates that it is a block in the sorted sequence L. The elements lipN and l(i+1)pN are called th block BlockR imposed by two the sentinels of BlockL i . Similarly, we define the j j p p consecutive elements rj M and r(j +1) M of SampleR . Consider the ranking of SampleL in SampleR . When an element lipN 2SampleL is ranked in SampleR , we denote this rank by a superscript S , i.e., rankSR (lipN ). Note that, rankSR (lipN ) is only an approximation of the true rank rankR (lipN ) of lipN in R. Assume that for two consecutive elements lkpN and l(k+1)pN in SampleL, rankSR p p p p (lkpN ) = rmpM and rankS R (l(k+1) N ) = rn M , where rm M and rn M are two elements in SampleR . In the following lemma, we estimate the true ranks of the elements in BlockL k in R.

p Lemma 10. If an element lr 2 L is in BlockL k , i.e., in between the two elements lk N R R R p and l(k+1) N , lr must be ranked in Blockm [ Blockm+1 [ : : : [ Blockn , i.e., in Cover(lkpN ; l(k+1)pN ). 2.2 An O (log log N ) time merging algorithm on the LARPBS A variant of the following lemma has been proved by Pan et al. [7].

p

Lemma 11. Given two sorted sequences A and B of length N each, all the elements of A can be ranked in B in O(1) bus cycles on an LARPBS with N processors. Our algorithm is recursive and at every level of recursion, our generic task is to set up appropriate subproblems for the next level of recursion. In the following description, we explain how all the subproblems associated with Gap(lm ; ln ) and Cover(lm ; ln ) are 0 set up for the next level of recursion. We assume that Gap(lm ; ln ) has N elements and 0 Cover(lm ; ln ) has M elements. Step 1. p We take a sample from Gap(lm ; ln ) by choosing every N 0 -th element from Gap (lm ; ln ). We denote this sample by SampleL (Gap(lm ; ln )) Similarly, we take a sample p from Cover(lm ; ln ) by choosing every M 0 -th element from Cover(lm ; ln ) and denote it by SampleR (Cover(lm ; ln )). We explain how to take the sample from Gap(lm ; ln ). The sample from Cover(lm ; ln ) is taken in a similar way. First, each processor holding an element in Gap(lm ; ln ) writes a 1 in one of its 0 registers. Next, a parallel prefix computation is done in one bus cycle to get N , the total number of p elements in Gap(lpm; ln0 ) in the processor holding ln . This procesN to all the processors in Gap(lm ; ln ). We assor computes N 0 and broadcasts p sume for simplicity that N 0 is an integer. Eachpprocessor in Gap(lm ; ln ) determines whether its prefix sum is an integer multiple of N 0 and marks itself as a member of


1115

SampleL(Gap(lm ; ln )) accordingly. Note that, SampleL (Gap(lm ; ln )) consists of the sentinels of the blocks in L. Step 2.

p

p

In this step, we assume that N 0 < M 0 and we rank SampleL (Gap(lm ; ln )) in SampleR (Cover(lm ; ln )). This ranking is done by the method in Lemma 11 in O(1) bus cycles. Step 3. After the ranking in Step 2 is over, for every sentinel l we

know BlockR m,

the block of

p

M

0

p 2SampleL( Gap(lm; ln )), N elements in R in which l p should be ranked. k

0

k

N0

Next, we determine all the sentinels in SampleL(Gap(lm ; ln )) ranked in BlockR m in the following way. After the ranking in Step 2 is over, each processor holding a sentinel p 0 ) from its neighbor in the sample through a one-to-one l p 0 gets rankS R (l i

N

(i+1)

N

communication. After this, a group of consecutive sentinels in SampleL(Gap(lm ; ln )) which are ranked at the same block of SampleR (Cover (lm ; ln )) can be determined. We consider two cases depending on whether a single sentinel or multiple sentinels from SampleL(Gap(lm ; ln )) are ranked in the same block of SampleR (Cover (lm ; ln )). Case i. In this case, only one sentinel l p 0 in SampleL(Gap(lm ; ln )) is ranked in k

p

k

N

p to all the processors in BlockRm N p and the processors in BlockR m determine rankR (lk N ). This takes O(1) bus cycles. p ) in BlockRn in a similar way. Note that, the elements We determine rankR (l k N p )) are the elements in succ(rankR (l p )); : : : ; rankR (l Cover(l p ; N k N k N k p ). l k N p ) must be It follows from Lemma 5 that all the elements in Gap(l p ; l k N k N p ). ranked either at rankR (l p ) or among the elements in Cover(l p ; l k N k N k pN ) Similarly, it follows from Lemma 8 that all the elements in Cover (l p ; l k N p ). Hence kweNrecursively must be ranked at the elements in Gap(l p ; l call k N k N p p p ;l ) and elements in Cover(l ; our algorithm with elements in Gap(l k N k N k N p l ). In this recursive call, all the elements from L are within a block of size pk N p ) and the elements N . The processors holding the elements in Gap(l p ; l k N k p ) participate in this recursive call. N in Cover(l p ; l k N k N Case ii. In this case, multiple sentinels l p ; : : : ; l p are ranked in BlockR m . In j N k N p p ) and then rankR (l ) are determined by broadtwo bus cycles, first rankR (l j N k N p p and then l to all the processors in BlockR casting first l m . We then recurj N k N sively call our algorithm with the elements in Gap(l p ; l p ) and the elements in j N k N Cover(l p ; l p ). Note that, all the elements from R are within a block of size p j N k N BlockR m . The processor holding l

N

0

broadcasts l

k

0

0

0

( +1)

0

( +1)

0

( +1)

0

0

0

( +1)

0

0

( +1)

0

0

0

( +1)

0

0

0

( +1)

0

0

0

0

0

0

0

( +1)

0

0

0

0

0

0

0

( +1)

0

0

( +1)

( +1)

0

0

0

M in this recursive call.

These two types of recursive calls are illustrated in Figure 2.

0

1116

A. Datta, R. Owens, and S. Soundaralakshmi BlockL k

lkpN

lj pN I

l(k+1)pN

L

II

R BlockR m

BlockR n

Figure 2. The two types of recursive calls are indicatedpby I and II. In the first type, the elements from R are within the same block ofpsize M 0 . In the second type, the elements from L are within the same block of size N 0 . Note that, the inputs to each level of recursion are disjoint subsets of processors holding elements of L and R and hence all the one-to-one communication, broadcasting and multiple multicasting operations at each level of recursion for each of the subproblems can be done simultaneously in parallel. Once the recursive calls return, an element li 2 L knows rankR (li ) and it knows rankL (li ) since L is already sorted. Hence the processor holding li can compute rank(li ) and sends li to the processor with index rank(li ) through a one-to-one communication. This can be done in one bus cycle. Similarly the overall rank of each element in R can be computed and the elements can be sent to the appropriate processors. Hence each processor Pi will hold the ith element in L [ R after the merging algorithm terminates. This concludes the description of our merging algorithm. Lemma 12. The merging algorithm terminates in O(log log N ) bus cycles with all the elements of L ranked in R and all the elements of R ranked in L. Proof. p (sketch) p Suppose in the ith level of recursion, each block in L and R is of size N and M respectively. Suppose, the input to one of the recursive calls at the (i + 1)th level of recursion are the elements in two groups of processors GL from L and GR from R. From of the algorithm, p it is clear that either GL is within p the description R a block of size N or G is within a block of size M . Hence, due to this recursive call, at the (i + 1)th level of recursion, either we get new blocks of size N 1=p4 in L or we get new blocks of size M 1=4 in Rp. This gives a recurrence of : T (N ) = T ( N ) + O(1) or a recurrence of : T (M ) = T ( M ) + O(1), since each level of recursion takes O(1) bus cycles. Hence, the recursion stops after 2 log log N levels and all the elements in L and R are ranked at that stage. 2.3 The sorting algorithm Phase 1. Initially, each processor in an N processor LARPBS holds one element from the input. The complete LARPBS with N processors is recursively divided in this phase.


1117

Consider a subarray with processors Pi ; Pi+1 ; : : : ; Pj to be divided into two equal parts. Each processor writes a 1 in one of its registers and a prefix computation is done to renumber the processors from 1 to j i. Now, the last prefix sum is broadcast to all the processors and the processor with index b(j + i)=2c splits the bus to divide the original subarray into two subarrays of equal size. This process is repeated for all the subarrays recursively until each subarray contains only one processor and one element which is trivially sorted. This phase can be completed in O(log N ) bus cycles. Phase 2. The merging is done in this phase using the algorithm in Section 2.2. In the generic merging step, a pair of adjacent subarrays of equal size merge their elements to form a larger subarray of double the size. Each subarray participating in this pairwise merging first renumber its processors starting from 1 and then the merging algorithm is applied. At the end, processor Pi ; 1 i N in the original array holds the element with rank i from the input set. Since there are O(log N ) levels in the recursion and the merging at each level can be performed in O(log log N ) bus cycles, the overall algorithm takes O(log N log log N ) bus cycles and hence O(log N log log N ) time since each bus cycle takes O(1) time. Theorem 1. N elements can be sorted in O(log N log log N ) deterministic time on an LARPBS with N processors.

References 1. Z. Guo, R. Melhem, R. Hall, D. Chiarulli, S. Levitan, “Pipelined communication in optically interconnected arrays”, Journal of Parallel and Distributed Computing, 12, (3), (1991), pp. 269-282. 2. K. Li, Y. Pan and S. Q. Zheng, “Fast and processor efficient parallel matrix multiplication algorithms on a linear array with a reconfigurable pipelined bus system”, IEEE Trans. Parallel and Distributed Systems, 9, (8), (1998), pp. 705-720. 3. R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. F. Stout, Parallel computations on reconfigurable meshes. IEEE Trans. Computers, 42, (1993), 678-692. 4. Y. Pan, “Basic data movement operations on the LARPBS model”, in Parallel Computing Using Optical Interconnections, K. Li, Y. Pan and S. Q. Zheng, eds, Kluwer Academic Publishers, Boston, USA, 1998. 5. Y. Pan and K. Li, “Linear array with a reconfigurable pipelined bus system - concepts and applications”, Journal of Information Sciences, 106, (1998), pp. 237-258. 6. Y. Pan, M. Hamdi and K. Li, “Efficient and scalable quicksort on a linear array with a reconfigurable pipelined bus system”, Future Generation Computer Systems, 13, (1997/98), pp. 501-513. 7. Y. Pan, K. Li and S. Q. Zheng, “Fast nearest neighbor algorithms on a linear array with a reconfigurable pipelined bus system”, Journal of Parallel Algorithms and Applications, 13, (1998), pp. 1-25. 8. S. Sahni, “Models and algorithms for optical and optoelectronic parallel computers”, Proc. 1999 International Symposium on Parallel Architectures, Algorithms and Networks, IEEE Computer Society, pp. 2-7.

Architecture description and prototype demonstration of optoelectronic parallel-matching architecture

Keiichiro Kagaw a, Kouichi Nitta, Yusuke Ogura, Jun Tanida, and Yoshiki Ichiok a ??

Department of Material and Life Science, Graduate School of Engineering, Osaka University We propose an optoelectronic parallel-matching architecture (PMA) that provides pow erful processing capabilit y for distributed algorithms comparing with traditional parallel computing architectures. The PMA is composed of a parallel-matching (PM) module and m ultiple processing elements (PE's). The PM module is implemented by a large-fan-out free-space optical interconnection and parallel-matching smart-pixel array (PM-SPA). In the proposed architecture, eac h PE can monitor the other PE's by utilizing several kinds of global processing by the PM module. The PE's can execute concurrent data matching among the others as well as in ter-processor communication. Based on the stateof-the-art optoelectronic devices and a diractive optical element, a prototype of the PM module is constructed. The prototype is assumed to be used in a multiple processor system composed of 4 4 processing elements, whic h are completely connected via 1-bit optical communication channels. On the prototype demonstrator, the fundamental operations of the PM module such as parallel-matching operations and inter-processor communication were viri ed at 15MHz. Abstract.

1

Introduction

P arallel distributed processing is an eective method to accelerate the performance of computing system. In the parallel distributed processing, a task is divided in to a number of processes executable concurrently. The processes are distributed and executed over multiple processing elements (PE's), so that the total processing time can be reduced. A heuristic optimization described by a distributed algorithm is a good application of a parallel computing system. In the algorithm, the solution space is divided into multiple pieces of segments, in which the candidates of the solution are sought concurrently b ymultiple PE's. In the framework of the traditional parallel computing architecture, global processing to calculate multiple data from all the PE's can be a processing bottleneck. Because communication between the PE's and processing are implemented separately, the heavy traÆc occurs on the ??

[email protected]


Architecture Description and Prototype Demonstration

1119

network path to or from the PE that executes the global processing. The bottleneck causes throughput reduction of the whole parallel computing system. This bottleneck can not be eliminated by simply increasing the communication capacity of network. Therefore, the traditional parallel computing architectures are not always suitable for the distributed algorithms. In this paper, we propose an optoelectronic parallel-matching architecture (PMA) which is an eective parallel computing architecture suitable for the distributed algorithms. The PMA is based on an optoelectronic heterogeneous architecture formerly presented by Tanida et al.,[1] which is composed of electronic parallel processors for local processing and an optical network processor for interconnection and global processing between the electronic processors. The optical network processor is assumed to be embodied by the optical interconnection and the smart-pixes[2] for wide communication bandwidth and dense connectivity between the PE's. In the architecture, both electronic and optical processors work in complementary manner. An electronic processor shows high performance in the local processing, whereas an optical processor is good at the global processing. The system based on the PMA also has ability to execute the global processing without degrading the throughput of network. Detection of the PE's satisfying a given condition and summation of absolute dierences over the multiple PE's are typical examples of the global processing. The optical network processor of the PMA is called a parallel-matching (PM) module, which consists of a large-fan-out free-space optical interconnection and a parallel-matching smart-pixel array (PM-SPA). The proposed architecture can reduce the execution time for the fundamental global data processing: global data matching, detection of the maximum (minimum) data, and ranking of the data, compared with the other traditional architectures with photonic networks.

2

Parallel Matching Architecture

We assume a multiple-instruction multiple-data stream (MIMD) parallel computing system consisting of N PE's embodied by the smart-pixel technology. The PE's are connected each other via a photonic network. A heuristic optimization algorithm based on the distributed algorithm is a good application of the parallel computing systems, which can be applied to the problems that do not always have a rigorous solving method. A general procedure of the distributed optimization algorithm is composed of distribution of the data, parallel processing, and integration of the calculated data. First, the candidates of solutions are distributed to the PE's. Second, each PE locally calculates the tness function of candidate. Finally, good candidates are selected among the candidates based on the values of the tness function. Note that this operation is achieved by global processing over the multiple PE's. Figure 1 shows the system compositions for the distributed algorithms by the traditional MIMD parallel computing system and the parallel matching architecture. The traditional architecture has a hierarchy composed of a master PE and multiple slave PE's as shown in Fig. 1(a). The rolls of the master PE are

1120

K. Kagawa et al.

Master PE

Bottleneck

Parallelmatching module

Network hub

Slave PE #1

Slave PE #2

Slave PE #3

Slave PE #4

PE #1

PE #2

PE #3

PE #4

(a) (b)

Con gurations of parallel computing architectures: (a) a traditional masterslave architecture and (b) the parallel-matching architecture (PMA) Fig. 1.

data distribution, integration, and global processing. The master PE distributes the data to the slave PE's and integrates the resulting data from them through the network. After data integration, the master PE executes global processing locally. Because the amount of the network traÆc in the data distribution and integration is very large, these procedures can be processing bottlenecks. This bottleneck can not be eliminated by simply increasing the communication capacity of network, for the total amount of the fanned-in data from N slave PE's to the master PE is N times as large as the bandwidth of the communication path between the network communication module and the PE's. On the other hand, the PMA has a dierent composition as shown in Fig. 1(b). The PMA is composed of the PM module and the multiple PE's. In the PMA, the tness of each candidate is compared with the candidates on the other PE's by using the global processing mechanism of the parallel-matching (PM) module. The PM module oers both networking and global processing, so that the master PE for data distribution and integration is not required. The PE's in the system have the same priority because the global processing is executed inside the PM module; that is the system has no hierarchy. As a result, there is no bottleneck in the proposed architecture in global processing. The PM module consists of large-fan-out free-space optical interconnection and a parallel-matching smart-pixel-array (PM-SPA). The PM module is regarded as a kind of the network hub in which a speci c mechanism for the global processing is built-in. The global processing in the PMA is data comparison among the data sent from the PE's. The PM module monitors the output data from all PE's, and concurrently compares the datum from each PE with the data from the other PE's. When a PE requires the compared result, it is sent back to the PE through the network communication channel. As mentioned above, data distribution and integration increase the network traÆcs and the processing overheads at a PE. However, because the global processing is ex-


PE#1 d1

PE#2 d2

PE#3 d3

d1 d2 d3

d2 d1 d3

d3 d1 d2

Matching for PE#1

Matching for PE#2

Matching for PE#3

Reference datum

Objective datum

Reference datum and objective data in the parallel matching. denote the output data of PE#1-#3, respectively Fig. 2.

1121

d1 ; d2

, and

d3

ecuted inside the PM module without occupying the network bandwidth, the throughput of the total system does not become decreased. We de ne the datum from each PE as the reference datum and the one from the other PE's as the objective data as shown in Fig. 2. The reference datum and the objective datum to be compared are called a matching pair. The PM module tests the reference datum and each of the objective data for the following conditions: 1) the reference datum is equivalent to the objective datum, 2) the reference datum is less than the objective datum, and 3) the reference datum is more than the objective datum. The result of the global comparison is expressed by a set of logical values. When the condition is satis ed, the returned value is 1 (true), otherwise 0 (false). These operations are called parallel-matching operations, which are denoted by pEQU, pMORETHAN, and pLESSTHAN, respectively. (The pre x p means `parallel.') We also de ne the fourth parallel-matching operation: summation of the absolute dierences denoted by pDIFF. This operation provides the summation of the absolute dierence between the reference datum and the objective datum. Utilizing the pDIFF operation, each PE can obtain the quantitative value of the dierence. Figure 3 shows a schematic diagram of the parallel matching with 5 PE's. In the gure, PE's A, B, and C obtain 4-bit binary values representing the results of parallel matching: pEQU, pMORETHAN, and pLESSTHAN, respectively. PE-D obtains the result of the pDIFF operation. The numbers in the boxes of PE's are the output data from the PE's. After the output data are fanned out and exchanged, they are concurrently compared by the parallel-matching operations in the PM module. Then, one of the parallel-matching results or the objective datum is selected by the multiplexer on the request from the PE's. In general, for m-bit data format, up to (m + 1) PE's can be compared at the same time. Finally, the selected result is sent back to each PE. In Fig. 3, example values of the parallel-matching results are shown. The operation mode of PE-E

1122

K. Kagawa et al. PE-A

PE-B

PE-C

PE-D

PE-E

96

102

23

102

96 PM module

Large-fan-out Free-space Optical Interconnection BCDE

ACDE

ABDE

ABCE

ABCD

0001

0010

0000

0100

1000

pMORETHAN 0 1 0 0 pLESSTHAN 1 0 1 0 pDIFF 85

11 0 1

0000

1011

0010

0000

1111

0000

0101

91

237

91

pEQU

MUX

MUX

PM-SPA

85 MUX

MUX

0001

1101

1111

91

(pEQU)

(pMORETHAN)

(pLESSTHAN)

(pDIFF)

PE-A

PE-B

PE-C

MUX

C=23 PE-D

PE-E

Fundamental operations of the parallel-matching architecture. MUX means a multiplexer Fig. 3.

is dierent from the others. That is the communication mode in which the data from PE-C is sent to PE-E transparently.

3

Experimental prototype system

We construct a prototype system of the PM module to demonstrate its fundamental operations. In designing the prototype, we assume the parallel computing system shown in Fig. 4. The parallel computing system consists of 4 4 PE', which are completely connected via the PM module. The PE's are located on a two-dimensional grid, and each of them is connected to the PM module with bit-serial optical ber channels. Each PE is embodied by smart-pixels coupled with an optical ber. The data from the PE's are sent to the PM module by the optical bers. As mentioned below, a complete-connection network is implemented by optical data fanning. With the optically fanned-out signals, the parallel-matching operations and the processing for inter-PE communication are executed by the PM-SPA. The resulting data are emitted from the PM-SPA, and returned to the PE's through the optical bers. Figure 5 shows the schematic diagram of the optoelectronic complete-connection. As shown in Fig. 5(a), the optical signals from 4 4 PE's in the bit-serial format are assumed to be aligned on a two-dimensional grid as an input image toward the PM module. Because the whole image of the light signals is required for one PE, 4 4 replica images shown in Fig. 5(b) are prepared for 4 4 PE's. In the prototype, an 8 8-VCSEL array (GigalaseTM ; Micro Optical Devices; emitting wavelength, 850nm; pixel pitch, 250m) is used as a light emitter array. In the prototype, the function of the PM-SPA is emulated by a CPLD


1123

PE array for local processing PM module

Fan-out

Return

PM-SPA Optical fibers

Fig. 4.

Large-fan-out free-space optical interconnection

Target prototype system of the PMA

(Model FLASH374i, Cypress) coupled with a 4 4-complementary-metal-oxidesemiconductor photodetector (CMOS-PD) array (Model N73CGD) supplied by United States-Japan Optoelectronic Project (JOP). As shown in Fig. 5(c), one of the replicas is detected by a CMOS-PD array, then transferred to the CPLD, and the fundamental operations of the PMA are executed. For the large-fan-out optical interconnection, a conventional 4f optical correlator was adopted. We constructed a Fourier transform lens system whose focal length is 160.0mm for wavelength 850nm. In designing the lens system, CodeVTM of Optical Research Associates was used. As an optical fan-out element that generates complete-connection pattern shown in Fig. 5(b), we designed a phase-only computer-generated hologram (CGH) lter with two-level phase modulation based on the Gerchberg-Saxton algorithm.[3] Figure 6(a) shows the ideal mapping on the output plane of the interconnection optics. The output pattern contains 16 replicas of the VCSEL image arranged on a grid, in which each quadrant contains 2 2 replicas of the VCSEL image. Each replica corresponds to the optical signals for a single PE. Because the equipments used in fabrication of the CGH lter do not have enough fabrication accuracy to eliminate the 0th light spot, the copied images are located not to be overlapped with the 0th image in the design. The pitch and the margin of adjacent replicas of the VCSEL image are 2.5mm and 1.5mm, respectively. Figure 6(c) shows the lter pattern with two-level phase modulation. The CGH lter was fabricated by the electron beam (EB) lithography. Figure 6(b) shows the reconstructed interconnection pattern of the fabricated CGH lter for 4 4 VCSEL's when the lter was incorporated in the 4f optical correlator. Finally, we operated the prototype system without the CGH lter to verify the fundamental parallel-matching operations and inter-PE communication. The

1124

K. Kagawa et al.

Objective data Optical input signals

Reference datum

1-bit optical signal

Detection by a CMOD-PD array

Fan-out

(a)

Replica of a VCSEL array image (b)

CPLD (c)

Schematic diagram of optoelectronic complete-connection: (a) Output data displayed on the VCSEL array, (b) replica images of the VCSEL array for the completeconnection network, and (c) a replica image of the VCSEL image for one PE Fig. 5.

data transfer was in the bit serial format, and the word length of the data was set to 4. From the experimental results, we have veri ed that the fundamental operations of the prototype were executed exactly at 15MHz. The operational speed was limited by the one of the CMOS-PD array. The bit rate of communication per PE and the total bit rate of the prototype were 15Mbps (bit per second) and 240Mbps, respectively. The frequencies of the parallel-matching operation for each PE and the whole system were 0.68M operations/sec and 11M operations/sec, respectively.

4

Conclusions

We have proposed an optoelectronic parallel-matching architecture (PMA) as an eective parallel computing architecture. The fundamental operations of the PMA, pEQU, pMORETHAN, pLESSTHAN, and pDIFF, have been de ned. This architecture is specialized for the global data processing and has capability to accelerate execution of distributed algorithms, because the PMA has a speci c mechanism for parallel-matching operations over multiple processing elements. The prototype system of the PMA was constructed to demonstrate the fundamental global operations of the PMA based on the state-of-the-art optoelectronic devices and a phase-only CGH lter. In the prototype, the PM-SPA, which was the core module of the PM module, was emulated by the CPLD and the CMOS-PD array. The prototype was assumed to be used with 4 4 PE's that are completely connected via the PM module with 1-bit optical channels. For optical interconnection of the prototype, a Fourier transform lens system was designed. As a fan-out element, the phase-only CGH lter with two-level phase modulation was designed based on the Gerchberg-Saxton algorithm, and was fabricated by the EB lithography. We con rmed that the prototype performed the fundamental parallel-matching operations and the inter-PE communication at 15MHz. For the whole system, the bit rate of inter-PE communication and the


1125

Filter pattern

Ideal point spread function

0th 250µm

0th image

1.5mm 2.50mm

2.5mm

Replica of 4x4 VCSEL image (a)

(b)

phase 0 phase π Pixel size, 8.5µm Filter size, 17.408mm (c)

(a) Designed optical interconnection pattern for complete-connect network composed of 4 4 PE's, (b) a part of the obtained CGH lter with two-level phase modulation, and (c) experimental result of the optical interconnection by the CGH lter Fig. 6.

frequency of the parallel-matching operation were 240 Mbps and 11M operations per second, respectively. The operational speed of the prototype was limited by the CMOS-PD array. The performance can be improved by using high-speed photodetectors with high sensitivity such as MSM photodetectors coupled with transimpedance photo-ampli ers.

Acknowledgment This research was supported by the JOP user funding under the Real World Computing Partnerchip (RWCP). The authors would like to appreciate the activities of the JOP. This work was also supported by Development of Basic Tera Optical Information Technologies, Osaka Prefecture Joint-Research Project for Regional Intensive, Japan Science and Technolgy Corporation.

References 1. P. Berthome and A. Ferreira, Optical interconnections and parallel processing: trends at the interface (Kluwer Academic Publishers, London, 1998). 2. T. Kurokawa, S. Matso, T. Nakahara, K. Tateno, Y. Ohiso, A. Wakatsuki, and H. Tsuda, \Design approaches for VCSEL's and VCSEL-based smart pixels toward parallel optoelectronic processing systems," Appl. Opt. 37, 194{204 (1996). 3. R. W. Gerchberg and W. O. Saxton, \A Practical Algorithm for the Determinaion of Phase from Image and Diraction Plane Pictures," OPTIK 35, 237 { 246 (1972).

A Distributed Computing Demonstration System Using FSOI Inter−Processor Communication J. Ekman1, C. Berger2, F. Kiamilev1, X. Wang1, H. Spaanenburg3, P. Marchand4, S. Esener2 1 University of Delaware, ECE Dept., Newark, DE 19716, USA University of California San Diego, ECE Dept., La Jolla, CA 92093, USA 3 Mercury Computer Systems Inc., Chelmsford, MA 01824, USA 4 Optical Micro Machines, San Diego, CA 92121, USA

2

Abstract. Presented here is a computational system which uses free−space optical interconnect (FSOI) communication between processing elements to perform distributed calculations. Technologies utilized in the development of this system are integrated two−dimensional Vertical Cavity Surface Emitting Lasers (VCSELs) and MSM−photodetector arrays, custom CMOS ASICs, custom optics, wire−bonded chip−on−board assembly, and FPGA−based control. Emphasis will be placed on the system architecture, processing element features which facilitate the system integration, and the overall goals of this system.

1 Introduction The area of optical interconnects is continually growing with many advances in optoelectronic devices, integration of CMOS ICs with these devices, and integration of hybrid electrical/optical devices into functional systems. It is clear that the flexibility in terms of scalability, and optical bandwidth which can be achieved by using optical interconnects will lead to changes in system architectures as designers move to to take advantage of this flexibility. As a part of the 3−D OptoElectronic Stacked Processor program[1], a demonstration system is being developed which illustrates the ability to construct distributed computational systems which use optical communication for passing data between processing elements. In this system, the distribution takes the form of linear chains of processors with nearest neighbor communication. Communication between processors in a multiprocessor system quickly becomes the bottleneck and is therefore an ideal target for the integration of optical communication. One of the goals in developing this system was that of illustrating the use of optical communication in a low cost distributed system as a step toward validation of such architectures.

2 System Topology This demonstration system consists of two linear chains of five processors each. Three processors in each chain are configured to perform computation and the two remaining (one on each end of the chain) are configured to bring data into and out of each chain. This is accomplished by converting between electrical−domain (digital) J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1126-1131, 2000.  Springer-Verlag Berlin Heidelberg 2000

A Distributed Computing Demonstration System

1127

and optical−domain (analog) signals at the ends of each chain (see figure 1). The two chains operate independently, but based on available optoelectronic device arrays, share OptoElectronic (OE) chips for communication. In addition to the ability to lengthen each chain, there is flexibility to scale the number of chains to yield a larger system. The optical chip−to−chip communication is achieved through the use of two dimensional VCSEL and MSM−photodetector arrays provided by Honeywell Technology Center[2] and custom optics designed at UCSD.

Carrier

Carrier

Carrier

Optical Link

Optical Link Optical Link

Optical Link

Electrical Control lines

FPGA

Chain I/O

Chain I/O Carrier

SRAM

Chain 1 Data Storage

SRAM

Carrier Chain 2 Data Storage

External Interface

System Board

Figure 1. System diagram showig five carrier boards placed on system board. OptoElectronic arrays are shown on left and right sides of carrier boards and processing elements in the center of the carrier boards. The upper (light) PEs indicate one chain and the lower (dark) PEs indicate the second chain

2.1 Carrier Boards Each unit in the chain is assembled onto a small "carrier board" where each of these carrier boards contains two processing elements (PEs) and two OE arrays. The OE arrays consist of sixteen VCSELs and sixteen photodetectors in an inter−digitated 4 x 4 array. These parts originally fabricated as a part of the GMU Co−Op program[3]. Each of the chips on the carrier boards are bare die, wire−bonded to contacts on the carrier board. One PE belongs to each of the two chains and the OE arrays are shared among the two chains with dedicated array elements for each chain. These carrier boards are then mounted onto a "system board" which also supports the optics, additional chips to provide control and system interface, and power connectors, etc. For the system described here, there are five carrier boards mounted onto one system board. This is illustrated in figure 1. Another goal of this demo system is to experiment with different opto−mechanics in an effort to

1128

J. Ekman et al.

demonstrate the ability to scale−down what has traditionally been a (physically) large part of such systems through the use of "plug−on−top" optical assemblies[4]. The construction of the carrier board modules facilitates this by allowing independent units to be rotated or moved according to a particular optical arrangement. 2.2 System Board The purpose of the system board is to serve as a substrate for the entire system, supporting the carrier boards and opto−mechanics as well as providing the necessary control to the processing elements, and interfacing with the "outside world" to provide power, data, and system diagnostics. The board itself is a multi−layer printed circuit board (PCB) fabricated commercially. There is electrical and precision mechanical connection of the carrier boards to the system board. The primary components that perform the control and interfacing tasks are a high−end Xilinx Virtex FPGA and commodity SRAM. The Virtex FPGA was chosen for its high pin−count and capacity allowing control of the entire system from one chip and giving great flexibility to re−configure the system. It provides both the data necessary to configure the processing elements initially and control their operation throughout calculations. Additionally, it will provide data to the processor chains, gather results and monitor the results checking for errors. This approach helps reduce risk by allowing for reprogramming of the FPGA and also helps during assembly of such a prototype system. An extension of this system would have built− in controllers with the PEs and allow higher−level programming.

3 Processor Interconnection The processors in this system are connected in two linear chains with each processor communicating with the one to its left and the one to its right. On the ends of the chain, there is only optical communication in one direction. Data is brought in and taken out of the ends of the chain electrically. The interconnection scheme chosen is meant to facilitate construction of this prototype system and serve as a starting point which can lead to more complex connection schemes which may provide additional benefit to specific applications. The logical connection of the processors in this system and the connection to the FPGA control unit is shown in figure 2. With this connection scheme, all data is brought into the processor chain from the two ends. All data communication within the processor chain is through the FSOI links. This both helps illustrate the viability of optical communication in a multiprocessor system as well as ensure that the links will be heavily utilized. The impact on the system architecture is of course that data must be passed to processors in the center of the chains before they can begin calculations. This is not seen as a serious drawback in this system as it adds only some latency to the beginning of calculations. It should be mentioned here that the application chosen for demonstration on this system is that of a radix−2 butterfly engine as a part of an FFT calculation. With this application, data points are brought into the chain and bounced back and forth between the processors in the chain during calculation and finally output from the ends of the chain. The two chains of this system are utilized to compute real and imaginary points simultaneously.

Memory


1129

Chip set up as processor

FPGA

Chip set up as I/O buffer

Figure 2. Diagram of logical connection between the multiple processors and the FPGA controller and memory. The upper and lower chains here illustrate the two independant chians in this system

4 Processing Element The processing element itself is a custom ASIC designed and fabricated for this demo system. It is a 0.5 micron CMOS chip of roughly 10,000 transistors comprising both digital and analog circuitry (shown in figure 3). Some of the design goals for this chip were the ability to interface with the optoelectronic devices to be used in the system, that it provide digital signal processing capability, facilitate system construction and debugging, allow for possible changes to the optical system, and provide the capability to use the chip as an electrical/optical interface at the ends of each chain. The design of the processing element is divided into the following functional units: Input/Output switching, arithmetic and logic units, optical I/O buffering, and a control interface. Input and output switching provides for the re−organization of data as it is received from, or transmitted over the optical chip−to−chip data links. The arithmetic and logic units provide the calculation capability based on a small instruction set. Translation between the optoelectronic analog domain and the digital domain is accomplished through on−chip receiver and VCSEL driver circuits. The control interface provides for configuration of the input/output switches and the selection of the function performed by the chip. Input and output switches provided give much of the flexibility achieved in this design. The primary function of the input switches is the correction to input data words that may be necessary due to changes in the optical communication between chips or system I/O. The chip−to−chip communication links are all eight bits wide and the internal datapaths of the PE are six bits wide. The two remaining links out of every eight are devoted to fault tolerance. In the event that a data link is non− operational for any reason, the data being sent over that link can be diverted to one of these two redundant links. In such a case, the input switch would re−assemble the data word before calculation begins. In this manner the calculation is not corrupted or impeded by a loss of a link between chips. This fault tolerance is important in a demo system to ensure that a faulty link does not deteriorate the demonstration, but will also be important in future systems to provide reliability. The output switches compliment the fault tolerance achieved with the input switches by providing the capability to re−route outgoing data onto a redundant link in the event that a link is known bad. Additionally, the output switches are used to select between outputs of the arithmetic and logic units, the receiver outputs in order

1130

J. Ekman et al.

Figure 3. Microphotograph of the CMOS ASIC used as the processing element in this multiprocessor system. Eighty−six wire bonded pads are shown at the chip perimeter. Other unbonded pads are for probe−testing

to completely by−pass the processing functionality, and an auxiliary set of inputs which allow the chip to be used simply as a parallel VCSEL driver. Complete by− pass functionality is included in the PE chip to add flexibility and aid in system construction and debugging as it will allow chips to logically be removed from the chain without changes to the optics and also isolate the optical path from the digital functionality. The dataflow through the PE is shown in the diagram of figure 4. In addition to the possible loss of an optical data link, changes to the optical system may result in a flipping of the data word during transmission. In order to allow different optical systems to be explored with this system, the ability to account for such flipping is included in the input switches. A final feature of the input switches is the ability to interchange the two inputs before sending them to the Arithmetic and Logic Units. The arithmetic and logic unit (ALU) is a custom developed component which provides the capability to perform addition, subtraction, and multiplication of signed or unsigned numbers as well as a variety of common logic functions and comparisons for maximum/minimum determination. The unit is a three stage pipeline to increase achievable clock rates which gives the PE its characteristic three−cycle latency on all instructions except complete by−pass. Scan chain registers are used in the ALU and include the capability to generate pseudo−random data to provide testability. The on−chip analog receiver and VCSEL driver cells included on the CMOS ASIC are previously verified designs from UCSD and UNCC/UDel respectively and were designed to operate with the specific OE elements used in this system. As an additional testability feature, stand−alone copies of these cells have also been placed on the ASIC connected to probe pads.


RX

bi−dir

TX

Analog Signals

Switch

ALU

1131

RX

bi−dir

Switch

TX

Digital Signals

Figure 4. Architecture diagram of the processor element showing dataflow through the chip. (Thinner) lines indicate dataflow pattern when by−passing the computational portion of the chip

5 Conclusion The current state of integration of optical communication with digital CMOS logic affords the ability to build functional systems from which new processing architectures can evolve. We have taken advantage of this to build a prototype multiprocessor demonstration system which utilizes FSOI data communication. This system is currently in the final stages of development and additional results will be presented at the conference.

References 1. 3D−OESP Consortium website: http://soliton.ucsd.edu/3doesp/ 2. George Mason University Consortium for Optical and Optoelectronic Technologies in Computing website: http://co−op.gmu.edu/ 3. Honeywell Technology Center: http://www.htc.honeywell.com/photonics/ 4. C. Berger, J. T. Ekman, P. J. Marchand, F. E. Kiamilev, H. Spaanenburg: Parallel distributed free−space optoelectronic compute engine using flat "plug−on−top" optics package, accepted for presentation at the International Topical Meeting on Optics in Computing in Quebec, Canada, June 2000 Effort sponsored by the Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory under agreement number F30602−97−2−0122. The US government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon.

Optoelectronic Multi-Chip Modules Based on Imaging Fiber Bundle Structures Donald M. Chiarulli1, Steven P. Levitan2 2

1 University of Pittsburgh, Department of Computer Science University of Pittsburgh, Department of Electrical Engineering

Abstract. Recent advances in optoelectronic (OE) devices and in processing technology have focused attention on the packaging of multi-chip optoelectronic systems. Alignment tolerances and geometrical restrictions often make the implementation of free space optics within these systems quite difficult. Critical alignment issues also characterize fiber-per-channel guided wave systems based optical ribbon cable or large core fiber arrays. In this presentation I will describe an alternative packaging technology based on imaging fiber bundles. In an imaging fiber bundle, each optical data channel is carried by multiple fibers. An array of spots imaged at one end of the fiber bundle is correspondingly imaged on the opposite end. In this manner, imaging fiber bundles are capable of supporting the spatial parallelism of free space interconnects with relaxed alignment and geometry constraints. We have demonstrated a 16-channel point-to-point link between two VCSEL arrays that were directly butt coupled to an imaging fiber bundle. No other optical elements were used in the setup. We have also investigated a number of multi-chip interconnection module designs using both rigid and flexible imaging fiber bundles. Our basic approach to multipoint interconnect is to fabricate structures in which individual regions of the image at the input surface of a fiber bundle (or a fiber bundle array) are passively routed to different output surfaces. Opto-electronic devices, such as flipchip bonded GaAs on silicon can be mounted on metal traces plated on to each surface of the module. The resulting network provides for spatially resolved bidirectional channels between each of the OE chips.


VCSEL based smart pixel array technology enables chip-to-chip optical interconnect Yue Liu Honeywell International 3660 Technology Drive, Minneapolis, MN 55418

Abstract. This paper describes most recent development and demonstration of a VCSEL-based smart pixel array (SPA) technology for chip-to-chip interconnect. This technology is based on Honeywell's commercial successful 850nm VCSEL components, incorporates both monolithic and hybrid integration techniques, and aims to address anticipated interconnect bottleneck in networking interconnect fabric and between processors and memories. Following features of this technology makes it not only technically feasible but also practically viable for system insertion in very near future. First, new generating of oxide VCSEL technology provides key characters that high density 2D optical interconnect systems desire, such as high speed, high efficiency, low power dissipation and good array uniformity. Secondly, monolithically integration VCSEL and photodetector provides system with flexible bi-directional optical I/O solutions, and advantages in adopting new system architectures. Third, the 2D-optoelectronic array can be seamlessly merged with state-of-the-art Si-based VLSI electronics, and micro-optics using hybrid integration techniques such as solder bump bonding and wafer scale integration. Last, and perhaps most importantly, all of our technology implementations follow the guideline of being compatible with mainstream and low cost manufacturing practices. Device performance characteristics, integration approach, and results of up to 34x34 SPA prototype demonstration will be presented.


Run-Time Systems for Parallel Programming

4th RTSPP Workshop Proceedings Cancun, Mexico, May 1, 2000

Organizing Committee General Chair { Laxmikant V. Kale Program Chair { Ron Olsson

Program Committee P eteBeckman

Los Alamos National Laboratory, US A

Greg Benson

University of San F rancisco,USA

Luc Boug e

Ecole Normale Sup erieure of Lyon (ENS Lyon),France

Matthew Haines

Inktomi, USA

Laxmikant V. Kale

Universit y of Illinois at Urbana Champaign, USA

Thilo Kielmann

Vrije Universiteit, The Netherlands

Koen Langendoen

Delft University of Tec hnology , The Netherlands

David Lowenthal

University of Georgia, USA

F rank M uller

Humboldt-Universitaet zu Berlin, Germany

Ron Olsson

University of California, Davis, USA

Raju P andey

University of California, Davis, USA

Alan Sussman

University of Maryland, USA


Run-Time Systems for Parallel Programming (RTSPP)

1135

Preface Runtime systems are critical to the implementation of parallel programming languages and libraries. They support the core functionality of programming models and the glue between such models and the underlying hardware and operating system. As such, runtime systems have a large impact on the performance and portability of parallel programming systems. Despite the importance of runtime systems, there are few forums in which practitioners can exchange their ideas, and these are typically forums showcasing peripheral areas, such as languages, operating systems, and parallel computing. RTSPP provides a forum for bringing together runtime system designers from various backgrounds to discuss the state-of-the-art in designing and implementing runtime systems for parallel programming. The RTSPP workshop will take place on May 1, 2000 in Cancun, Mexico, in conjunction with IPDPS 2000. This one-day workshop includes technical sessions of refereed papers and panel discussions. The 8 paper presentations were selected out of 11 submissions after a careful review process; each paper was reviewed by at least four members of the program committee. Based on the reviewers' comments, the authors revised their papers for inclusion in these workshop proceedings. We thank the RTSPP Program Committee (see previous page) and the following additional people for taking part in the review process: Gabriel Antoniu (LIP, ENS Lyon, France) Yves Denneulin (IMAG, Grenoble, France) Emmanuel Jeannot (LaBRI, University of Bordeaux, France) and Lo c Prylli (LIP, ENS Lyon, France). We also thank the previous Organizing Committees for initiating this workshop and the participants in the previous workshops for making this forum successful and lively. We hope that this year's Workshop will be equally interesting and exciting.

Ron Olsson Laxmikant V. Kale

A Portable and Adaptative Multi-Protocol Communication Library for Multithreaded Runtime Systems

Olivier Aumage, Luc Bouge, and Raymond Namyst

LIP, ENS Lyon, F rance? This paper introduces , an adaptive multiprotocol extension of the portable communication interface. pro vides facilities to use multiple net w ork protocols (VIA, SCI, TCP, MPI) and multiple net w ork adapters (Ethernet, Myrinet, SCI) within the same application. Moreover, it can dynamically select the most appropriate transfer method for a given netw ork protocol according to various parameters such as data size or responsiveness user requirements. We report performance results obtained using Fast-Ethernet and SCI. Abstract.

Madeleine II

Madeleine

Madeleine II

1 EÆcient Communication in Multithreaded Environments

Due to their ever-gro wing success in the development of distributed applications on clusters of SMP machines, today's multithreaded en vironments have to be highly portable and eÆcient on a large variet y of architectures. For portability reasons, most of these en vironments are built on top of widespread messagepassing communication interfaces such as PVM or MPI. How ev er,the implementation of multithreaded environments mainly involv esRPC-like in teractions. This is ob viously true for en vironments pro viding a RPC-based programming model suc h as Nexus [2] or PM2 [4], but also for others which often provide functionalities that can be eÆciently implemented by RPC operations. We have shown in [1] that message passing interfaces such as MPI, do not meet the needs of RPC-based multithreaded en vironments with respect to ef ciency. Therefore, w ehave proposed a portable and eÆcient communication in terface, called Madeleine , which was speci cally designed to pro vide RPCbased multithreaded en vironments with both transparent and highly eÆcient communication. How ever, the internals of this rst implementation were strongly message-passing oriented. Consequently, the support of non message-passing netw ork protocols such as SCI or even VIA was cumbersome and introduced some unnecessary overhead. In addition, no provision was made to use multiple netw ork protocols within the same application. For these reasons, we decided to design Madeleine II , a full multi-protocol version of Madeleine , eÆciently portable on a wider range of netw ork protocols, including non message-passing ones. ?

LIP ,ENS Ly on, 46, Allee d'Italie, F-69364 Ly on Cedex 07, France. Contact: .

[email protected]


A Portable and Adaptative Multi-protocol Communication Library

mad mad mad mad mad mad

1137

Initiates a new message Initiates a message reception Finalize an emission Finalize a reception Packs a data block Unpacks a data block Functional interface of .

begin packing begin unpacking end packing end unpacking pack unpack

Table 1.

Madeleine II

2 The Madeleine II Multi-Protocol Communication Interface The Madeleine II programming interface provides a small set of primitives to build RPC-like communication schemes. Theses primitives actually look like classical message-passing-oriented primitives. Basically, this interface provides primitives to send and receive messages, and several packing and unpacking primitives that allow the user to specify how data should be inserted into/extracted from messages (Table 1). A message consists of several pieces of data, located anywhere in user-space. They are constructed (resp. de-constructed) incrementally using packing (resp. unpacking ) primitives, possibly at multiple software levels, without losing eÆciency. The following example illustrates this need. Let us consider a remote procedure call which takes an array of unpredictable size as a parameter. When the request reaches the destination node, the header is examined both by the multithreaded runtime (to allocate the appropriate thread stack and then to spawn the server thread) and by the user application (to allocate the memory where the array should be stored). The critical point of a send operation is obviously the series of packing calls. Such packing operations simply virtually append the piece of data to a message under construction. In addition to the address of data and its size, the packing primitive features a pair of ag parameters which speci es the semantics of the operation. The available emission ags are the following: send SAFER This ag indicates that Madeleine II should pack the data in a way

that further modi cations to the corresponding memory area should not corrupt the message. This is particularly mandatory if the data location is reused before the message is actually sent. send LATER This ag indicates that Madeleine II should not consider accessing the value of the corresponding data until the mad end packing primitive is called. This means that any modi cation of these data between their packing and their sending shall actually update the message contents. send CHEAPER This is the default ag. It allows Madeleine II to do its best to handle the data as eÆciently as possible. The counterpart is that no assumption should be made about the way Madeleine II will access the data. Thus, the corresponding data should be left unchanged until the send operation has completed. Note that most data transmissions involved in parallel applications can accommodate the send CHEAPER semantics.

1138

O. Aumage, L. Bouge, and R. Namyst

The following ags control the reception of user data packets: receive EXPRESS This ag forces Madeleine II to guarantee that the corre-

sponding data are immediately available after the unpacking operation. Typically, this ag is mandatory if the data is needed to issue the following unpacking calls. On some network protocols, this functionality may be available for free. On some others, it may put a high penalty on latency and bandwidth. The user should therefore extract data this way only when necessary. receive CHEAPER This ag allows Madeleine II to possibly defer the extraction of the corresponding data until the execution of mad end unpacking. Thus, no assumption can be made about the exact moment at which the data will be extracted. Depending on the underlying network protocol, Madeleine II will do its best to minimize the overall message transmission time. If combined with send CHEAPER, this ag guarantees that the corresponding data is transmitted as eÆciently as possible. Figure 1 illustrates the power of the Madeleine interface. Consider sending a message made of an array of bytes whose size is unpredictable on the receiving side. Thus, on the receiving side, one has rst to extract the size of the array (an integer) before extracting the array itself, because the destination memory has to be dynamically allocated. In this example, the constraint is that the integer must be extracted EXPRESS before the corresponding array data is extracted. In contrast, the array data may safely be extracted CHEAPER, striving to avoid any copies.

Sending side conn = mad_begin_packing(...); mad_pack(conn,&size,sizeof(int), send_CHEAPER,receive_EXPRESS); mad_pack(conn, array, size, send_CHEAPER,receive_CHEAPER); mad_end_packing(conn); Fig. 1.

Receiving side conn = mad_begin_unpacking(...); mad_unpack(conn,&size,sizeof(int), send_CHEAPER,receive_EXPRESS); array = malloc(size); mad_unpack(conn, array, size, send_CHEAPER,receive_CHEAPER); mad_end_unpacking(conn);

Sending and receiving messages with

Madeleine II

.

Madeleine II aims at enabling an eÆcient and exhaustive use of underlying communication software and hardware functionalities. It is able to deal with several network protocols within the same session and to manage multiple network adapters (NIC) for each of these protocols. The user application can dynamically and explicitly switch from one protocol to another, according to its communication needs. The multi-protocol support of Madeleine II relies on the concept of channel. Channels in Madeleine II are pretty much like radio channels. They are allocated at run-time. The communication on a given channel does not interfere with the communication on another one. As a counterpart, in-order delivery is not guaranteed among distinct channels. In-order delivery is only enforced for


1139

text_chan = mad_open_channel(TCP_ETH0); video_chan = mad_open_channel(SISCI_SCI0); text_conn = mad_begin_packing(text_chan, video_client); video_conn = mad_begin_packing(video_chan, video_client); mad_pack(text_conn, text_dataptr, text_len, ...); mad_pack(video_conn, video_dataptr, video_len, ...); ...

Example of a video server simultaneously sending video information using a SISCI channel and translation text data using TCP channel. Fig. 2.

point-to-point connections within the same channel. In this respect, they look like MPI communicators, but dierent Madeleine II channels can be bound to dierent protocols as well as adapters (Fig. 2). Of course, several channels may share the same protocol, and even the same adapter.

3 Inside Madeleine II : from the Application to the Network Application

Application

pack

unpack 1

Switch Module 4

Generic Buffer Management Layer

pack, commit

BMM 1 BMM 2

...

BMM n

send

2

select TM1 rdma

TM2 mesg transmit

unpack, checkout

BMM 1 BMM 2

...

BMM n

receive

3

5

VIA

Switch Module

VIA

Specific Protocol Layer

select TM1 rdma

TM2 mesg

6

extract

Fast-Ethernet

Fast-Ethernet

Driver

Driver

Fast-Ethernet

Fast-Ethernet

Adapter

Adapter 7

Network Fig. 3.

Conceptual view of the data path through

Madeleine II

's internal modules.

The transmission of data blocks using Madeleine II involves several internal modules. We illustrate its internals in the case of an implementation on top of VIA (Fig. 3). Protocols such as VIA provide several methods to transfer data, namely regular message passing and remote DMA write (and optionally RDMA-read).

1140


Moreover, there are several ways to use these transfer methods, as VIA requires registering the memory blocks before transmission. It is for instance possible to dynamically register user data blocks, or to copy them into a pool of preregistered internal buers. Their relative eÆciency crucially depends on the size of the blocks. The current implementation of Madeleine II on top of VIA supports the three following combinations: Small blocks: message-passing + static buer pool. Medium-sized blocks: message-passing + dynamically registered buers. Large blocks: RDMA-write + dynamically registered buers. Each transfer method is encapsulated in a protocol-speci c Transmission Module (TM, see Fig. 3). Each TM is associated with a Buer Management Module (BMM). A BMM implements a generic, protocol-independent management policy: either the user-allocated data block is directly referenced as a buer, or it is copied into a buer provided by the TM. Moreover, each BMM implements a speci c scheme to aggregate successive buers into a single piece of message. Each TM is associated with its optimal BMM. However, observe that several TM (even from dierent protocols) may share the same BMM, which results in a signi cant improvement in development time and reliability. In the case of VIA, one can for instance take advantage of the gather/scatter capabilities of VIA to issue one-step burst data transfers when possible. This strategy is rewarding for medium-size blocks scattered in user-space. For small blocks accumulated into static buers, it is most eÆcient to immediately transfer buers as soon as they get full: this enhances pipelining and overlaps the additional copy involved. Sending Side One initiates the construction of an outgoing message with a call to begin packing(channel, remote). The channel object selects the protocol module (VIA in our case), and the adapter to use for sending the message. The remote parameter speci es the destination node. The begin packing function returns a connection object. Using this connection object, the application can start packing user data into packets by calling pack(connection, ptr, len, s mode, r mode). Entering the Generic Buer Management Layer, the packet is examined by the Switch Module (Step 1 on Fig. 3). It queries the Speci c Protocol Layer (Step 2) for the best suited Transmission Module, given the length and the send/receive mode combination. The selected TM (Step 3) determines the optimal Buer Management Module to use (Step 4). Finally, the Switch Module forwards the packet to the selected BMM. Depending on the BMM, the packet may be handled as is (and considered as a buer), or copied into a new buer, possibly provided by the TM. Depending on its aggregation scheme, the BMM either immediately sends the buer to the TM or delays this operation for a later time. The buer is eventually sent to the TM (Step 5). The TM immediately processes it and transmits it to the Driver (Steps 6). The buer is then eventually shipped to the Adapter (Step 7). Special attention must be paid to guarantee the delivery order in presence of multiple TM. Each time the Switch Step selects a TM diering from the


1141

previous one, the corresponding previous BMM is ushed (commit on Fig. 3) to ensure that any delayed packet has been sent to the network. A general commit operation is also performed by the end packing(connection) call to ensure that no delayed packet remains waiting in the BMM. Side Processing an incoming message on the destination side is just symmetric. A message reception is initiated by a call to begin unpacking(channel) which starts the extraction of the rst incoming message for the speci ed channel. This function returns the connection object corresponding to the established point-to-point connection, which contains the remote node identi cation among other things. Using this connection object, the application issues a sequence of unpack(connection, ptr, len, s mode, r mode) calls, symmetrically to the series of pack calls that generated the message. The Switch Step is performed on each unpack and must select the same sequence of TM as on the sending side. For instance, a packet sent by the DMA Transmission Module of VIA must be received by the same module on the receiving side. The checkout function (dual to the commit one on the sending side) is used to actually extract data from the network to the user application space: indeed, just like packet sending could be delayed on the sending side for aggregation, the actual packet extraction from the network may also be delayed to allow for burst data reception. Of course, the nal call to end unpacking(connection) ensures that all expected packets are made available to the user application. Receiving

This modular architecture combined to packet-based message construction allows Madeleine II to be eÆcient on top of message-passing protocols as well as put/get protocols. Whatever the underlying protocol used, Madeleine II 's generic exible buer management layer is able to tightly adapt itself to its particularities, and hence deliver most of the available networking potential to the user application. Moreover, the task of implementing a new protocol into Madeleine II is considerably alleviated by re-using existing BMM. Discussion

4 Implementation and Performances We now evaluate Madeleine II on top of several network protocols. All features mentioned above have been implemented. Drivers are currently available for TCP, MPI, VIA, SISCI [3] and SBP [6] network interfaces. The following performance results are obtained using a cluster of dual Intel Pentium II 450 MHz PC nodes with 128 MB of RAM running Linux (Kernel 2.1.130 for VIA, and Kernel 2.2.10 for TCP and SISCI). The cluster interconnection networks are 100 Mbit/s Fast Ethernet for TCP and VIA, and Dolphin SCI for SISCI. The tests run on the TCP/IP protocol use the standard UNIX sockets. The tests run on the VIA protocol use the MVIA 0.9.2 implementation from the NERSC (National Energy Research Scienti c Computing Center, Lawrence Berkeley Natl Labs). Testing Environment

1142


Latency Bandwidth Protocol TCP SISCI TCP SISCI Raw performance 59.8 s 2.3 s 11.1 MB/s 76.5 MB/s 77.4 s 5.9 s 10.5 MB/s 70.0 MB/s 67.2 s 7.9 s 11.0 MB/s 57.0 MB/s Latency (left) and bandwidth (right) on top of TCP and SISCI. Madeleine

Madeleine II

Table 2.

Surprisingly enough, Madeleine II outperforms Madeleine (Table 2). Madeleine used to require attaching a short header to each transfered message, whereas Madeleine II gives the user ner control on the message structure. The dierence in performance between raw TCP and Madeleine II on top of TCP is the result of the current software overhead of Madeleine II . The bandwidth of Madeleine II on top of TCP is very close to the raw bandwidth of TCP. SISCI The new SISCI Speci c Protocol Layer of Madeleine II is not yet as optimized as the one used by Madeleine . This is why the bandwidth measured with Madeleine II on top of SISCI is not as good as the one obtained with Madeleine (Table 2). The dierence in latency between Madeleine II and Madeleine is due to some additional processing in the internals of Madeleine II . Future optimizaTCP

tions will hopefully solve this problem. Dynamic Transfer Method Selec-

Dynamic transfer method selection (Madeleine/VIA)

We mentioned above the capa1600 1400 bility of Madeleine II to dynamically 1200 choose the most appropriate transfer 1000 paradigm within a given protocol. Fig800 ure 4 shows the dramatic in uence of 600 dynamic transfer paradigm selection 400 on performance using VIA. VIA reMultiparadigm 200 Dynamic registration Static registration + Copy quires the memory areas involved in 0 0 2000 4000 6000 8000 10000 12000 transfer to be registered. Such dynamic Packet size (bytes) registration operations are expensive. This cost is especially prohibitive for Fig. 4. Multi-Paradigm support. short messages, and using a pool of pre-registered buers help circumventing the problem. Instead of registering the memory area where the messages are stored, one can copy the messages into these buers. This amounts to exchanging registration time for copying time. This is obviously ineÆcient for long messages. The two curves are plotted on Figure 4. The Multi-Paradigm curve is obtained by activating the dynamic paradigm selection of Madeleine II . It is optimal both with short messages and long messages! Transfer time (usec)

tion

5 Related work Many communication libraries have recently been designed to provide portable interfaces and/or eÆcient implementations to build distributed applications.


1143

However, very few of them provide an eÆcient support for RPC-like communication schemes, support for multi-protocol communications and support for multithreading. Illinois Fast Messages (FM) [5] provides a very simple mechanism to send data to a receiving node that is noti ed upon arrival by the activation of a handler. Releases 2.x of this interface provide interesting gather/scatter features which allow an eÆcient implementation of zero-copy data transmissions. However, it is not possible to issue a transmission with the semantics of the receive CHEAPER Madeleine II ag: only receive EXPRESS-like receptions are supported, and it is not possible to enforce aggregated transmissions. The Nexus multithreaded runtime [2] features a multi-protocol communication subsystem very close to the one of Madeleine II . The messages are constructed using similar packing operations except that no \high level" semantics can be associated to data: there is no notion of CHEAPER speci cations, which allows Madeleine II to choose the best suited strategy. Also, as for FM, unpacking operations behave like receive EXPRESS Madeleine II transmissions.

6 Conclusion In this paper, we have described the new Madeleine II communication interface. This new version features full multi-protocol, multi-adapter support as well as an integrated new dynamic most-eÆcient transfer-method selection mechanism. We showed that this mechanism gives excellent results with protocols such as VIA. We are now actively working on having Madeleine II running across clusters connected by heterogeneous networks. References

1. Luc Bouge, Jean-Francois Mehaut, and Raymond Namyst. EÆcient communications in multithreaded runtime systems. In , volume 1586 of , pages 468{182, San Juan, Puerto Rico, April 1999. Springer-Verlag. 2. I. Foster, C. Kesselman, and S. Tuecke. The Nexus approach to integrating multithreading and communication. , 37(1):70{82, 1996. 3. IEEE. , August 1993. Standard no. 1596. 4. Raymond Namyst and Jean-Francois Mehaut. PM2: Parallel Multithreaded Machine. a computing environment for distributed architectures. In , pages 279{285. Elsevier, September 1995. 5. S. Pakin, V. Karamcheti, and A. Chien. Fast Messages: EÆcient, portable communication for workstation clusters and MPPs. , 5(2):60{73, April 1997. 6. R.D. Russell and P.J. Hatcher. EÆcient kernel support for reliable communication. In , pages 541{550, Atlanta, GA, February 1998. Proc. 3rd Workshop on Runtime Systems for

Parallel Programming (RTSPP '99)

Lect. Notes Comp. Science

Journal on Parallel and Distributed Computing

Standard for Scalable Coherent Interface (SCI)

Parallel Computing

(ParCo'95)

IEEE Concurrency

13th ACM Symposium on Applied Computing

CORBA Based Runtime Support for Load Distribution and Fault Tolerance Thomas Barth, Gerd Flender, Bernd Freisleben, Manfred Grauer, and F rank Thilo University of Siegen, Holderlinstr.3, D{57068 Siegen, Germany

fbarth, grauer, [email protected], ffreisleb, [email protected]

P arallel scien ti c computing in a distributed computing en vironment based on CORBA requires additional services not (yet) included in the CORBA speci cation: load distribution and fault tolerance. Both of them are essential for long running applications with high computational demands as in the case of computational engineering applications. The proposed approach for providing these services is based on integrating load distribution into the CORBA naming service which in turn relies on information provided by the underlying resource management system developed for typical netw ork ed Unix workstation en vironments. The support of fault tolerance is based on error detection and backward reco very by introducing proxy objects which manage chec kpoin ting and restart of services in case of failures. A protot ypical implementation of the complete system is presen ted, and performance results obtained for the parallel optimization of a mathematical benchmark function are discussed. Abstract.

Winner

1

Introduction

Object{oriented soft w are arc hitectures for distributed computing vironments en based on the Common Object R equest Broker Archite ctur e(CORBA) have started to oer real-life production solutions to interoperability problems in various business applications, most notably in the banking and nancial areas. In contrast, most of todays applications for distributed scienti c computing traditionally use message passing as the means for communication betw een processes residing on the nodes of a dedicated parallel multiprocessor architecture. Message passing is strongly related to the w aycommunication is realized in parallel hardware and is particularly adequate for applications where data is frequently exc hanged betw een nodes. Examples are data{parallel algorithms for complex n umerical computations, such as in computational uid dynamics where essen tially algebraic operations on large matrices are performed. The advent of net works of workstations (NOW) as cost eective meansfor parallel computing and the adv ancesof object-oriented software engineering methods ha vefostered eorts to develop distributed object-oriented softw are infrastructures for performing scienti c computing applications on NOWs and J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1144-1151, 2000.  Springer-Verlag Berlin Heidelberg 2000

CORBA Based Runtime Support for Load Distribution and Fault Tolerance

1145

also over the WWW [7]. Other computationally intensive engineering applications with dierent communication requirements, such as simulations and/or multidisciplinary optimization (MDO) problems [3] [5] typically arising in the automotive or aerospace industry, have even strengthened the need for a suitable infrastructure for distributed/parallel computing. Two essential features of such an infrastructure are load distribution and a certain level of fault tolerance. Load distribution improves the eectiveness of the given resources, resulting in reduced computation times. Fault tolerance is important especially for long{running engineering applications like MDO software systems. It is obviously crucial to provide mechanisms to prevent the whole computation from failing due to a single error on the server side. In this paper, CORBA based runtime support for parallel applications is presented. This support encompasses load distribution as well as fault tolerance for parallel applications using CORBA as communication middleware.

2

Integrating Load Distribution into CORBA

In general, CORBA applications consist of a set of clients (applications objects) requesting a set of services. These services can either be other application objects within a distributed application, or commonly available services (object services) providing e.g. name resolution (naming service) or object persistence (persistence service). There are dierent approaches to integrate load distribution functionality into a CORBA environment:

{ Implementation of an explicit service (e.g. a "trader\, [12]) which returns an

{

object reference for the requested service on an available host (centralized load distribution strategy) or references for all available service objects. In the latter case, the client has to evaluate the load information for all of the returned references and has to make a selection by itself (decentralized load distribution strategy). Integrating the load distribution mechanism into the ORB itself, e.g. by replacing the default locator by a locator with an integrated load distribution strategy [6] or using an IDL{level approach [13]

The drawbacks of these approaches are either that the source code of clients has to be changed (as in the rst approach) or that load distribution depends on a speci c ORB implementation or IDL compiler and can thus not be utilized when other ORBs are used (as in the second approach). To integrate load distribution transparently into a CORBA environment, our proposal is based on integrating it into the naming service. This ensures transparency for the client side and allows the reuse of the load distribution naming service in any other CORBA{compliant ORB implementation. In almost every CORBA{based implementation the naming service is utilized. In the case of applications which do not make use of the naming service, it would be useful to implement load distribution as an explicit service.

1146

T. Barth et al.

Fig. 1.

Schema for the integration of load distribution in a naming service.

Winner

Our concept is illustrated in Fig. 1; it relies on the resource management system [1] [2]. Basically, provides load distribution services for a network of Unix workstations. Its components of interest here are the central system manager and the node managers. There is one node manager on each participating workstation, periodically measuring the node's performance and system load, i.e. data like CPU utilization which is collected by the host operating system. This data is sent to the system manager, which has functionality to determine the machine with the currently best performance. Requests from application objects to the naming service are resolved using this load information for the selection of an appropriate server. The naming service is not an integral part of a CORBA ORB but is always implemented as a CORBA service. The OMG speci es the interface of a naming service without making assumptions about implementation details of the service. Therefore, every ORB can interoperate with a new naming service as long as it complies to the OMG speci cation.

Winner

3

Runtime Support for Fault Tolerance in CORBA Based Systems

The CORBA speci cation as well as the Common Object Services Speci cation oer no adequate level of fault tolerance yet. Due to the need for fault tolerance


1147

in more complex distributed systems, various approaches were developed. The Piranha system [8] for example, is based on an ORB supporting object groups, failure detection etc. Using these facilities provided by the ORB, Piranha is implemented as an CORBA Object Service for monitoring distributed applications and managing fault tolerance via active or passive replication. The major drawback of Piranha is its dependency on non{standard ORB features like object groups. Another approach avoids this drawback by complying completely to the CORBA standard: IGOR (Interactive{Group Object{Replication) [9] realizes fault tolerance also by managing groups of objects providing redundant services. In contrast to the Piranha system, IGOR is portable and interoperable with today's ORB implementations. Lately, there is also a proposal for the integration of redundancy, fault detection and recovery into the CORBA standard [10]. Unlike the previously mentioned approaches, our concept is not based on replicated services in object groups but on the integration of checkpointing and restarting functionality only. Especially for applications with a maximum degree of parallelism (e.g. scalable optimization algorithms) it is not desirable to use a large amount of the computational resources (i.e. hosts in the network) exclusively for availability purposes as in the case of active replication. Thus, in the case of parallel, long running applications it is a good compromise to restrict fault tolerance to checkpointing and restarting. Similar to the concept of passive replication, frequently (i.e. after each method call on the server side) generated checkpoints are used to restart a failed service. Currently, the only way to detect an error on the client side of a CORBA application is the exception CORBA::COMM FAILURE thrown when a CORBA client tries to call a service which is not available anymore (e.g. due to a network failure, a crashed server process or machine). Using the concepts for the naming service already described, it is possible to request a new reference to a service if a call to a server object fails. This approach is suÆcient for services without an internal state. In the more general case of services depending on an internal state of the server object, it is inevitable to (a) save the state (checkpoint) of the server object e.g. after each successful call to a server's method and (b) have the opportunity to restore this state in a newly created server object. We evaluated the following alternatives to integrate checkpointing and restarting functionality on the client side assuming that the service object provides a method to create a checkpoint for restarting the service if an error occurs: (a) modi cation of the client{side code to handle the CORBA::COMM FAILURE exception and to restart a service, (b) extending the client{side stub code generated by the IDL{compiler with exception handling etc., and (c) introduction of proxy{ classes derived from the stub classes on the client{side. The major drawback of alternative (a) is the amount of code to be inserted on the client side: every single call from a client to a method of the server must at rst get a checkpoint from the server, then handle the exception, and start a new server (using the checkpoint) in case of a failure. It would be useful if the automatically generated stub code comprises this code as in alternative (b). But this means changing the IDL{compiler itself, and thus this solution would be

1148

T. Barth et al.

speci c for a certain CORBA implementation providing its own IDL{compiler. Alternative (c) is a compromise between the amount of modi cations to be made on the client side and the targeted platform independence of the concept: the modi cations on the client side are limited to the use of a proxy class instead of the stub class. This proxy class is derived from the stub class and therefore provides all of the methods of the stub class. The additional methods handle the creation of a checkpoint and the restoring of an object's state according to a checkpoint. If a class oers this functionality for checkpointing and restoring a certain internal state it is in principle possible to migrate a service from host to another one not only when an error occured but also due to a changing load situation on a host. With the current implementation, the proxy class for each service class has to be implemented manually. This could be easily automated by parsing the class de nition. For each method, code to call the parent class (the stub) method along with exception handling code and a call to the server object's checkpoint and restore functions would have to be generated.

Checkpoint

Client

Object Proxy

Object Stub

Request Proxy

Request

Client Fig. 2.

Object

Server

Scheme of client, server, proxy objects and their call relationship.

As a proof of concept, a simple service for storing checkpointing data has been implemented. It simply provides functions to store/retrieve arbitrary values to the server object. No real persistency like storing checkpoints on disk media has been implemented, yet. Furthermore, the current implementation is rather ineÆcient. In addition to transparent synchronous method calls, CORBA provides asynchronous method invocations via DII (Dynamic Invocation Interface). When a client wants to utilize DII, it does not call the server object's methods directly, but uses so-called request objects instead. These request objects oer methods to asynchronously initiate methods of the server object and fetch the corresponding results at a later time. To enable fault tolerance in this case, request proxies are used just like the object proxies. The relationship between the described objects is shown in Fig. 2.


4

1149

Experimental Results

To investigate the bene ts of an integrated load distribution mechanism in CORBA, a test case from mathematical optimization was taken. The well known Rosenbrock test function [14] is widely used for benchmarking optimization algorithms because of its special mathematical properties. In our experiments, the function is only used to demonstrate the bene ts of an adequate placement of computationally expensive processes on nodes of a NOW. It is not intended to present a new approach to the solution of the benchmark problem. To compute the function in parallel, a decomposed formulation of the Rosenbrock function has been taken. In the decomposed formulation, several (sub-)problems with a smaller dimension than the original n{dimensional problem are solved by workers, and the subproblems are then combined for the solution of the original problem in a manager. In Fig. 3, the results of the dierent test scenarios are compared. All test cases were computed using multiple instances of a sequential implementation of the Complex Box algorithm [4] on a network of 10 workstations. The ORB used was omniORB 2.7.1 [11]. For the comparison of the dierent implementations of the naming service, a background load was generated on 0, 2, 4, 6 or 8 hosts. The

160

Runtime (seconds)

140 120 100 80 60 40 CORBA 100/7 CORBA/Winner 100/7 CORBA 30/3 CORBA/Winner 30/3

20

0

1

2

3 4 5 6 Number of hosts with background load

7

8

Dierent test cases of a decomposed 30{ and 100{dimensional Rosenbrock function with 3 and 7 worker problems under dierent load situations. Fig. 3.

two lower curves show the computation times for a 30{dimensional Rosenbrock function with 3 worker problems (problem dimension 10, 9 and 9) and a 2{ dimensional manager problem. In this scenario, 6 workstations were available for the 4 processes. The eect of load distribution is obvious when 2 hosts had background load. The selection of hosts with the new naming service avoided these hosts and hence the computation time was the same as in the case without

1150

T. Barth et al.

background load. The two upper curves compare the computation times for a 100{dimensional Rosenbrock function with 7 worker problems. With increasing background, load the advantage diminishes because both implementations of the naming services are forced to select services on hosts with background load. To summarize, the bene t of load distribution for the test cases mentioned above can be estimated by ca. 40% runtime reduction in the best case. Even in the worst case it yields at least the same results as the unmodi ed naming service. The mathematical properties of the test cases as mentioned above result in an average reduction of computation time of about 15%. Providing fault tolerance by proxy classes introduces an additional level of indirection. Additionally, storing the state of the server objects upon each method invocation causes some overhead. To quantify to what extent this overhead aects application runtimes, the above experiment has been repeated, this time using fault tolerant proxy classes. In Table 1 computation times for a 100{ dimensional Rosenbrock problem are shown for the proxy and non{proxy case, respectively. The measurements have been conducted for dierent numbers of iterations of the worker's algorithm. The increasing number of iterations results in longer runtimes of the worker problems because it is a stopping criterion of the algorithm. Table 1 demonstrates that fault tolerance comes at quite a cost in this scenario. In the worst case, the application runtime using proxy objects is more than three times that of the plain version. Because the overhead is constant for each method call, the relative slowdown is lower the more time is spent in the called method. It is important to remark that when using real life engineering applications, most method calls will take orders of magnitude longer to nish. Additionally, the checkpoint storage class has not been optimized for speed in any way as the current implementation is merely a proof of concept. Runtimes for a 100{dimensional Rosenbrock function with 7 worker problems and a varying number of worker iterations. Table 1.

Iterations Runtime without Runtime with Overhead [%] proxy [s] proxy [s] 10,000 92 309 235.9 20,000 165 376 127.8 30,000 232 445 91.8 40,000 299 505 68.9 50,000 383 594 55.1

5

Conclusions

The design and implementation of a CORBA naming service providing load distribution and basic fault tolerance services based on proxy objects was presented. These services are essential for long{running computational engineering


1151

applications in distributed computing environments. Experiments demonstrated the feasibility of both concepts. Areas of future work are: (a) improving, optimizing, and stabilizing the prototype implementation of the proposed CORBA load distribution and fault tolerance services, (b) evaluating its bene ts in real-life engineering MDO applications, and (c) extending the load measurement and process placement features for wide-area networks to enable CORBA based distributed/parallel meta-computing over the WWW. Additionally, the proposed extensions to the CORBA speci cation concerning redundancy, fault detection and recovery must be evaluated.

Winner

References 1. Arndt, O., Freisleben, B., Kielmann, T., Thilo, F., Scheduling Parallel Applications in Networks of Mixed Uniprocessor/Multiprocessor Workstations, Proc. Parallel and Distributed Computing Systems (PDCS98), p.190{197, ISCA, Chicago, 1998 2. Barth, T., Flender, G., Freisleben, B., Thilo, F. Load Distribution in a CORBA Environment, in: Proc. of Int'l Symposium on Distributed Object and Application 99, p. 158{166, IEEE Press, Edinburgh 1999 3. Barth, T., Grauer, M., Freisleben, B., Thilo, F. Distributed Solution of SimulationBased Optimization Problems on Workstation Networks. Proc. 2nd Int. Conf. on Parallel Computing Systems, pp. 152{159, Ensenada, Mexico, 1999 4. Boden, H., Gehne, R., Grauer, M., Parallel Nonlinear Optimization on a Multiprocessor System with Distributed Memory, in: Grauer, M., Pressmar, D. (eds.), Parallel Computing and Mathematical Optimization, Springer, 1991, p.65{78. 5. Grauer, M., Barth, T., Cluster Computing for treating MDO{Problems by OpTiX, to appear in: Mistree, F., Belegundu, A. (eds.), Proc. Conference on Optimization in Industry II, Ban, Canada, June 1999 6. Gebauer, C., Load Balancer LB { a CORBA Component for Load Balancing, Diploma Thesis, University of Frankfurt, 1997 7. Livny, M., Raman, R., High-Throughput Resource Management, in: The GRID: Blueprint for a New Computing Infrastructure, Foster, I., Kesselman, C. (eds.), pp. 311{337, Morgan Kaufmann, 1998 8. Maeis, S., Piranha: A CORBA Tool for High Availability, IEEE Computer, Vol. 30, No.4, p. 59{66, April 1997 9. Modzelewski, B., Cyganski, D., Underwood, M., Interactive{Group Object{ Replication Fault Tolerance for CORBA, 3rd Conf. on Object{Oriented Techniques and Systems, Portland, Oregon, June 1997, pp. 241{244 10. Fault tolerant CORBA, Object Management Group TC Document Orbos/99-1208, December 1999 11. omniORB { a Free Lightweight High{Performance CORBA 2 Compliant ORB, (http://www.uk.research.att.com/omniORB/omniORB.html), AT&T Laboratories Cambridge, 1998 12. Rackl, G., Load Distribution for CORBA Environments, Diploma Thesis, (http://wwwbode.informatik.tu-muenchen.de/~rackl/DA/da.html), University of Munich, 1997 13. Schiemann, B., Borrmann, L., A new Approach for Load Balancing in High{ Performance Decision Support System, Future Generation Computer Systems, Vol: 12, Issue: 5, April 1997, pp. 345-355 14. Schittkowski, K., Nonlinear Programming Codes, Springer, 1980

Run-time Support for Adaptive Load Balancing Milind A. Bhandarkar, Robert K. Brunner, and Laxmikan t V. Kale P arallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, USA fmilind,rbrunner,[email protected], WWW home page: http://charm.cs.uiuc.edu/

Abstract. Many parallel scienti c applications have dynamic and irregular computational structure. However, most such applications exhibit persistence of computational load and communication structure. This allows us to embed measurement-based automatic load balancing framew ork in run-time systems of parallel languages that are used to build such applications. In this paper, we describe such a framework built for the Converse [4] in teroperable runtime system. This framework is composed of mechanisms for recording application performance data, a mechanism for object migration, and interfaces for plug-in load balancing strategy objects. In terfaces for strategy objects allow easy implementation of novel load balancing strategies that could use application characteristics on the entire machine, or only a local neighborhood. We present the performance of a few strategies on a synthetic benchmark and also the impact of automatic load balancing on an actual application.

1 Motivation and Related Work An increasing number of emerging parallel applications exhibit dynamic and irregular computational structure. Irregularities may arise from modeling of complex geometries, and use of unstructured meshes, for example, while the dynamic behavior may result from adaptive re nements, and evolution of a physical simulation. Suc hbehavior presents serious performance challenges. Load may be imbalanced to begin with due to irregularities, and imbalances may grow substan tially with dynamic changes. We are participating in ph ysical simulation projects at the Computational Science and Engineering centers of University of Illinois (Rocket simulation, and Simulation of Metal Solidi cation), where such behaviors are commonly encountered. Load balancing is a fundamental problem in parallel computing, and a great deal of researc h has beendone in this subject. How ev er, alot of this researc h is focussed on improving load balance of particular algorithms or applications. General purpose load balancing research deals mainly with process migration in operating systems and more recently in application frameworks. C++ libraries suc h as DOME [1] implement the data-parallel programming paradigm as distributed objects and allow migration of work in response to varying load conditions. Systems such as CARMI [10] simply notify the user program of the load J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1152-1159, 2000.  Springer-Verlag Berlin Heidelberg 2000

Run-Time Support for Adaptive Load Balancing

1153

imbalance, and leave it to the application process to explicitly move its state to a new processor. Multithreaded systems such as PM 2 [9] require every thread to store its state in the specially allocated memory, so that the system can migrate the thread automatically. An object migration system called ELMO [3], built on top of Charm [6, 7], implements object migration mainly for fault-tolerance. Applications in areas such as VLSI, and Computational Fluid Dynamics (CFD) use graph partitioning programs such as METIS [8] to provide initial load balance. However, every such application has to speci cally provide code for monitoring load imbalance and to invoke the load balancer periodically to deal with dynamic behavior. We have developed an automatic measurement-based load balancing framework to facilitate high-performance implementations of such applications. The framework requires that a computation be partitioned into more pieces (typically implemented as objects) than there are processors, and letting the framework handle the placement of pieces. The framework relies on a \principle of persistence" that holds for most physical simulations: computational load and communication structure of (even dynamic) applications tends to persist over time. For example, even though the load of some object instance changes at adaptive re nement drastically, such events are infrequent, and the load remains relatively stable between such events. The framework can be used to handle application-induced imbalances as well as external imbalances (such as those generated on a timeshared cluster). It cleanly separates runtime data-collection and object migration mechanisms into a distributed database, which allows optional strategies to plug in modularly to decide which objects to migrate where. This paper presents results obtained using our load balancing framework. We brie y describe the framework, then the strategies currently implemented and how they compare on a synthetic benchmark, and nally results on a crack-propagation application implemented using it.

2 Load Balancing Framework Our framework [2] views a parallel application as a collection of computing objects which communicate with each other. Furthermore, these objects are assumed to exhibit temporal correlation in their computation and communication patterns, allowing eective measurement-based load balancing without application-speci c knowledge. The central component of the framework 1 is the load balancer distributed database, which coordinates load balancing activities. Whenever a method of a particular object runs, the time consumed by that object is recorded. Furthermore, whenever objects communicate, the database records information about the communication. This allows the database to form an object-communication graph, in which each node represents an object, with the computation time of that object as a weight, and each arc is a communication pathway representing

1154

M.A. Bhandarkar, R.K. Brunner, and L.V. Kale

communication from one object to another object, recording number of messages and total volume of communication for each arc. The design of Charm++ [5] oers several advantages for this kind of load balancing. First, parallel programs are composed of many coarse-grained objects, which represent convenient units of work for migration. Also, messages are directed to particular objects, not processors, so an object may be moved to a new location without informing other objects about the change; the run-time system handles the message delivery with forwarding. Furthermore, the message-driven design of Charm++ means that work is triggered by messages, which are dispatched by the run-time system. Therefore, the run-time knows which object is running at any particular time, so the CPU time and message trac for each object can be deposited with the framework. Finally, the encapsulation of data within objects simpli es object migration. However, the load balancing framework is not limited to Charm++ only. Any language implemented on top of Converse can utilize this framework. For this purpose, the framework does not interact with object instances directly. Instead, interaction between objects and the load balancing framework occurs through object managers. Object managers are parallel objects (with one instance on each processor) that are supplied by the language runtime system. Object managers are responsible for creation, destruction, and migration of language-speci c objects. They also supply the load database coordinator with computational loads and communication information of the objects they manage. Object managers register the managed objects with the framework, and are responsible for mapping the framework-assigned system-wide unique object identi er to the language-speci c identi er (such as thread-id in multithreaded systems, chare-id in Charm++, processor number in MPI etc.) We have ported a CFD application written using Fortran 90 and MPI with minimal changes to use our framework using MPI library called ArrayMPI on top of the Converse runtime system. The ArrayMPI library allows an MPI program to create a number of virtual processors, implemented as Converse threads, which are mapped by the runtime system to available physical processors. The application program built using this MPI library then executes as if there are as many physical processors in the system as these virtual processors. The LB framework keeps track of computational load and communication graph of these virtual processors. Periodically, the MPI application transfers control to the load balancer using a special call MPI Migrate, which allows the framework to invoke a load balancing strategy and to re-map these virtual processors to physical processors thus maintaining load balance.

3 Load Balancing Strategies Load balancing strategies are a separate component of the framework. By separating the data collection code common to all strategies, we have simpli ed the development of novel strategies. For eciency, each processor collects only a portion of the object-communication graph, that is, only the parts concerning


1155

Strategy

Database

Coordinator

Object Manager 1 Object A[2]

Object Manager 2 Object B[1]

Object A[1]

Object B[3]

Object B[2] Application

Converse

Fig. 1. Components of the load balancing framework on a processor. local objects. This gives the strategy the freedom to ignore or locally analyze part of the graph (to minimize load-balancing overhead), or to collect the graph all in one place for a more thorough, centralized analysis. The strategy chooses a number of objects to migrate to improve program eciency, and those decisions are handed back to the framework, which packs and migrates the objects to their new locations. Once the run-time instrumentation has captured running times and communication graph, it is necessary to have a re-mapping strategy in place, which will attempt to produce an improved mapping. This is a multi-dimensional optimization problem, as it involves minimizing both the communication times and load-imbalances. Producing an optimal solution is not feasible, as it is an NPhard problem. We have developed and experimented with several preliminary heuristic strategies, which we describe next. Greedy Strategy: The simplest strategy is a greedy strategy. It organizes all objects in decreasing order of their computation times. All the processors are organized in a min-heap based on their assigned loads. The algorithm repeatedly selects the heaviest un-assigned object, and assigns it to the least loaded processor, updating the loads, and re-adjusting the heap. Although this strategy is capable of taking the communication costs into account while computing processor loads, it does not explicitly aim at minimizing communication. For N objects, this strategy has the re-mapping complexity of O(N log N ). Also, since

1156


this strategy does not take into account the current assignments of objects, it may result in a large number of migration requests. Re nement Strategy: The re nement strategy aims at minimizing the number of objects that need to be migrated, while improving load balance. It only considers the objects on overloaded processors. For each overloaded processor, the algorithm repeatedly moves one of its objects to an underloaded processor, until its load is below acceptable overload limit. Acceptable overload limit is a parameter speci ed to this strategy and may vary based on the overhead of migration. Typically this overload limit is between 1.02 and 1.05 which governs by what factor any processor may exceed the average load. Metis-based Strategy: Metis [8] is a graph partitioning program and a library developed at University of Minnesota. It is mainly used for partitioning large structured or unstructured meshes. It provides several algorithms for graphpartitioning. The object communication graph that is obtained from the load balancing framework is presented to Metis in order to be partitioned onto the available number of processors. The objective of Metis is to nd a reasonable load balance, while minimizing the edgecut, where edgecut is de ned as the total weight of edges that cross the partitions, which in our case denotes number of messages across processors. Figure 2 shows time taken per iteration of a synthetic benchmark when run with load balancing strategies described above. This benchmark consists of 32 objects with dierent loads and relatively low communication, initially mapped in a round-robin fashion to 8 processors. Load balancing is performed after every 500 iterations. All strategies improve performance, with Metis-based strategy leading to the best performance. A load balancing strategy may improve performance of a parallel application, but if the load balancing step consumes more time than is gained by load redistribution, it may not be worthwhile. Today's parallel scienti c applications run for hours. Thus it may be possible for the load balancers to spend more time in nding a better load distribution. All the three load balancing strategies described above take less than 0.5 seconds for load balancing 1024 objects on 8 processors. Thus a moderate decrease in time per iteration justi es use of any of these strategies. Also, owing to the principle of persistence, load balance deteriorates very slowly with drastic changes occurring very infrequently. Thus it may be possible to employ multiple strategies in such situations: One thorough load re-distribution in case of drastic changes, and a re nement strategy for slower load variations. We are currently experimenting with such combined strategies. Also, note that all the strategies presented above take into consideration the application performance characteristics across all the processors. For ease of implementation, we used a global synchronizing barrier. Thus, all objects are made to temporarily stop computation while the load balancer re-maps them. However, this is usually not necessary. One can use a local barrier (barrier synchronization among objects on a single processor) for load database update, and another local barrier for performing load re-distribution, thus reducing the overheads associated with global synchronization. We are also implementing load


1157

balancing strategies that take only a partial object communication graph (based on a few neighboring processors) into account. 1 No Strategy Refinement Strategy Greedy Strategy Metis Strategy

Time Per Iteration

0.8

0.6

0.4

0.2

0 0

200

400

600 Iteration Index

800

1000

1200

Fig. 2. Comparison of various load balancing strategies

4 Application Performance In order to evaluate the framework, we implemented a Finite Element application that simulates pressure-driven crack propagation in structures. The physical domain is discretized into a nite set of triangular elements. Corners of these elements are called nodes. In each iteration, displacements are calculated at the nodes from forces contributed by surrounding elements. Typically, the number of elements is very large, and they are split into a number of chunks distributed across processors. In each iteration of simulation, forces on boundary nodes are communicated across chunks, where they are combined in, and new displacements are calculated. To detect a crack in the domain, more elements are inserted between some elements depending upon the forces exerted on the nodes. These added elements, which have zero volume, are called cohesive elements. At each iteration of the simulation, pressure exerted upon the solid structure may propagate cracks, and therefore more cohesive elements may have to be inserted. Thus, the amount of computation for some chunks may increase during the simulation. This results in severe load imbalance. This application, originally written in sequential Fortran90, was converted to a C++-based FEM framework being developed by authors. This framework

1158


presents a template library, which takes care of all the aspects of parallelization including communication and load balancing. The application developer simply provides the data members of the individual nodes and elements, and a function to calculate the values of local nodes, and a way to combine them. Figure 3 presents results of automatic load balancing of the crack propagation simulation on 8 processors of SGI Origin2000. Immediately after the crack develops (between 10 and 15 seconds) in one of the chunks, the computational load of that chunk increases. Since the other chunks are dependent on node values from that chunk, they cannot proceed with computation until an iteration of the heavy chunk is nished. Thus, the number of iterations per second drops considerably. After this, the Metis-based load balancer is invoked twice (at 28 and 38 seconds). It uses the runtime load and communication information collected by the load database manager to migrate chunks from the overloaded processor to other processors, leading to improved performance. (In gure 3, this is apparent from increased number of iterations per second.) 14 Crack Prop with Auto LDB 12

Iterations Per Second

10

8

6

4

2

0 0

10

20

30 Time

40

50

60

Fig. 3. Crack Propagation with Automatic Load Balancing. Finite Element Mesh consists of 183K nodes.

5 Conclusion In this paper, we described a measurement-based automatic load balancing framework implemented in the Converse interoperable runtime system. This framework allows for easy implementation of novel load balancing strategies,


1159

while automating the tasks of recording application performance characteristics as well as load redistribution. A few strategies have been implemented and their performance on a synthetic benchmark have been compared. A real nite element method application was ported to use our load balancing framework, and its performance improvement has been demonstrated. Based on the encouraging results with such real applications, we are currently engaged in developing a more comprehensive suite of load balancing strategies, and in determining suitability of dierent strategies for dierent kinds of applications.

References 1. Jose Nagib Cotrim Arabe, Adam Beguelin, Bruce Lowekamp, Erik Seligman, Mike Starkey, and Peter Stephan. Dome: Parallel programming in a heterogeneous multiuser environment. Technical Report CS-95-137, Carnegie Mellon University, School of Computer Science, April 1995. 2. Robert K. Brunner and Laxmikant V. Kale. Adapting to load on workstation clusters. In The Seventh Symposium on the Frontiers of Massively Parallel Computation, pages 106{112. IEEE Computer Society Press, February 1999. 3. N. Doulas and B. Ramkumar. Ecient Task Migration for Message-Driven Parallel Execution on Nonshared Memory Architectures. In Proceedings of the International Conference on Parallel Processing, August 1994. 4. L. V. Kale, Milind Bhandarkar, Narain Jagathesan, Sanjeev Krishnan, and Joshua Yelon. Converse: An Interoperable Framework for Parallel Programming. In Proceedings of the 10th International Parallel Processing Symposium, pages 212{217, April 1996. 5. L. V. Kale and Sanjeev Krishnan. Charm++: Parallel Programming with MessageDriven Objects. In Gregory V. Wilson and Paul Lu, editors, Parallel Programming using C++, pages 175{213. MIT Press, 1996. 6. L. V. Kale, B. Ramkumar, A. B. Sinha, and A. Gursoy. The CHARM Parallel Programming Language and System: Part I { Description of Language Features. IEEE Transactions on Parallel and Distributed Systems, 1994. 7. L. V. Kale, B. Ramkumar, A. B. Sinha, and V. A. Saletore. The CHARM Parallel Programming Language and System: Part II { The Runtime system. IEEE Transactions on Parallel and Distributed Systems, 1994. 8. George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. TR 95-035, Computer Science Department, University of Minnesota, Minneapolis, MN 55414, May 1995. 9. R. Namyst and J.-F. Mehaut. PM 2 : Parallel multithreaded machine. A computing environment for distributed architectures. In E. H. D'Hollander, G. R. Joubert, F. J. Peters, and D. Trystram, editors, Parallel Computing: State-of-the-Art and Perspectives, Proceedings of the Conference ParCo'95, 19-22 September 1995, Ghent, Belgium, volume 11 of Advances in Parallel Computing, pages 279{285, Amsterdam, February 1996. Elsevier, North-Holland. 10. J. Pruyne and M. Livny. Parallel processing on dynamic resources with CARMI. Lecture Notes in Computer Science, 949:259{??, 1995.

Integrating Kernel Activations in a Multithreaded Runtime System on top of Linux 1

1

Vincent Danjean , Raymond Namyst , and Robert D. Russell

2

1

Laboratoire de l'Informatique du Parallélisme École normale supérieure de Lyon 46, Allée d'Italie F-69364 Lyon Cedex 07, France {Vincent.Danjean, Raymond.Namyst}@ens-lyon.fr 2

Computer Science Department Kingsbury Hall University of New Hampshire Durham, NH 03824, USA [email protected]

Clusters of SMP machines are frequen tly used toperform hea vy parallel computations, and the concepts of multithreading have pro ved suitable for exploiting SMP architectures. Generally, the programmer uses a thread library to write this kind of program. Suc h a library sc hedules the threads or asks the OS to do it, but both of these approaches have problems. Anderson et al. have introduced another approach which relies on cooperation between the OS scheduler and the user application using activations and up calls. We have modied the Linux kernel and adapted the Marcel thread library (from the programming 2 en vironment ) to use activ ations. Improved performance w as observed and problems caused by bloc king system calls were removed. Abstract.

PM

1

Kernel Support for User Level Thread Schedulers

The increasing popularity of clusters of SMP machines creates a need for multithreaded programming environments able to fully exploit such arc hitectures. Indeed, the thread model naturally helps to make ecient use of all av ailable processors and to ov erlap I/O operations with computations. Furthermore, threads are often considered as virtual processors and are targeted as suc h b y compilers or runtime support systems for portability purposes. However, these runtime systems are built on top of thread libraries that do not all hav e the same properties, and thus do not provide the same functionalities. Moreov er, these properties directly depend on how muc h control the thread scheduler has ov er the arc hitecture's resources. There are w t o principle kinds of threads: user-lev el and kernel-level, each with its own advantages and inconv eniences. Eciency is the main advan tage of user-level thread libraries, whose scheduler is completely implemented in user space. Most operations on threads (creations,


Integrating Kernel Activations in a Multithreaded Runtime System on Top of Linux

1161

context switches, etc.) can be done without any call to the operating system. As a result, some computations utilizing these threads may perform one or two orders of magnitude better than kernel-level threads. Furthermore, user threads are much more ecient in terms of kernel resource consumption, which means there can often be many more of them per application. Finally, since user-level threads are implemented in user space, they can be tailored to each user's application. The disadvantage is that user-level threads are ignored by the OS and thus cannot be scheduled correctly in many cases. For instance, since user threads within the same process cannot be scheduled concurrently on multiple processors, no real parallelism can be achieved. Similarly, when a thread makes a blocking system call (for example, a

read() on

an empty socket), all the threads in that

process are blocked. Obviously, kernel-level threads do not suer from these drawbacks, since their scheduling is realized within the OS kernel, which handles them the same way it handles processes, except that multiple threads may share the same address space. It is therefore possible on an SMP machine for the kernel to simultaneously assign processors to multiple threads in the same application, thus achieving true parallelism. Furthermore, when one thread makes a blocking system call, the kernel can give control to another thread in the same application. However, even if operations such as thread context switching are more ecient than those related to processes, they still require system calls to be performed.

1.1 The

Marcel Mixed Thread Scheduler

To try to obtain the best properties of the two kinds of threads, some libraries mix them together: there are a xed number of kernel threads each running a number of user threads. This approach retains the ecient scheduling of user threads, but is able to take advantage of parallelism between threads on SMP machines. One such library is Marcel[5], which was developed for use by

P M 2 [4] (Paral-

lel Multithreaded Machine), a distributed multithreaded programming environment. Marcel delivers good performance by eliminating some features from the POSIX pthreads specication that are not useful for scientic applications (e.g., per-thread signal handling). In addition, it supports multiple optimizations as well as dynamic thread migration across a homogeneous cluster. Marcel has been ported to a number of dierent platforms. It utilizes a xed number of kernel threads, each managing a pool of user-level threads.

1.2 Better Support: Kernel Activations Although the two-level version of Marcel achieves better performance than the earlier user-level version, it still suers from some of the problems discussed earlier. The rst problem is that when a user thread makes a blocking system call, the underlying kernel thread is stopped too. It is possible with a few blocking user threads to block all the kernel threads, thereby blocking the whole application, even if some other user threads are ready to run. Another problem is that even if

Marcel can control the scheduling of user-level threads in each pool, it cannot

1162

V. Danjean, R. Namyst, and R.D. Russell

do anything between the dierent pools. So, if thread A in pool

1

holds a lock

and is preempted by the system, then when thread B in another pool wants the lock, it has to wait for the OS to give control back to pool

1

so that thread A

can release the lock. These problems could be avoided if the OS scheduler reported its scheduling decisions to the application. One mechanism to achieve this cooperation is based on the concept of activations, which was rst proposed in an article by Anderson et al.[1] Its authors implemented this mechanism with the FastThread library on the Topaz system. However, this system is no longer running, and the sources were never released. All the terms (activation, upcall, etc.) used in this paper come from this article. This mechanism enables the kernel to notify a user-level process whenever it makes a scheduling decision aecting one of the process's threads. This mechanism is implemented as a set of upcalls and downcalls. A traditional system call is a downcall, from the user-level down into a kernel-level function. The new idea is a corresponding upcall, from the kernel up into a user-level function. An upcall can pass parameters, just as system calls do. An activation is an execution context (i.e., a task control block in the kernel, similar to a kernel-level thread belonging to the process) that the kernel utilizes to make the upcall. The key point is that each time the kernel takes a scheduling action aecting any of an application's threads, the application receives a report of this fact and can take action to (re)schedule the user-level threads under its control. We have modied the Linux kernel by adding activations and changing the existing kernel scheduler to use upcalls to report some scheduling events to the

Marcel scheduler running in user space. Upcalls are mainly used to report that a new activation has been created, that an activation has blocked in a system call, that a previously blocked activation has just been unblocked, or that an activation has been preempted. We have also modied Marcel to utilize this mechanism eciently, as discussed in the next section.

2

Marcel on Top of Linux Activations

The user-level Marcel thread scheduler utilizes the new mechanism as follows:

Marcel begins by making an

act_new() system

call to notify the kernel that

it wants to utilize activations. The scheduler provides parameters that include a vector of entry points for a xed set of user-level management functions to which the kernel will make upcalls. Whenever the kernel makes a scheduling decision aecting any of this process's activations, such as creating, blocking or unblocking it, the kernel informs the process by choosing one of its activations

upcall_new, upcall_block, upcall_unblock. In order to guarantee exclusive access to management infor-

and using it to make the appropriate upcall, such as or

mation while executing one of these functions, the kernel maintains an internal mutual exclusion lock that allows only one upcall at a time to be outstanding per process. Therefore, the management function must make an

act_resume()

system call to release that lock after making its management decision but before

Integrating Kernel Activations in a Multithreaded Runtime System on Top of Linux Table 1.

Upcalls made by the

Upcall

upcall_new upcall_block

1163

Linux kernel to the user-level thread scheduler

Description

a new activation is starting an activation blocked upcall_unblock an activation unblocked. The scheduler has its state, so it can restart the activation's thread when it wants. upcall_preempt an activation was preempted. The scheduler has its state, so it can restart the activation's thread when it wants. Used by the kernel to make an upcall (e.g., in response to an upcall_restart act_send() system call) when it has no scheduling event to report.

executing application specic code. If the kernel scheduler decides that an activation holding this lock should be preempted, the kernel will preempt another activation instead (via

upcall_preempt) and will simply reschedule the original

activation without an upcall. Our implementation of the activations within the Linux kernel is close to the one proposed by Anderson et al. It is described more fully in [2]. The next section presents some general characteristics that are referred to in the following sections. The programming interface provides a few new system calls, and the targeted thread library must be prepared to handle several kinds of upcalls. Table 1 describes the upcall interface used by the kernel to notify the user thread scheduler about certain scheduling events.

2.1 How it works Figure 1 illustrates how Marcel uses activations to keep both processors on a dual-processor SMP platform actively executing application threads, even when some threads are blocked in the kernel.

Time T1

Time T2 User Program

User Program

User Space Kernel Space

Pool (A)

(B)

new()

new()

Time T3

B

(A)

User Program

Pool (B)

(C)

Pool (A)

block(A) new()

Processors

Fig. 1.

A blocking system call with activations

(B)

(C) unblock(A)

1164

V. Danjean, R. Namyst, and R.D. Russell

At time T1, the kernel creates two activations A and B and makes an

upcall_new

to each. In each activation, the Marcel scheduler will choose a

ready application thread and give it control. At time T2, the application thread running in activation A makes a blocking I/O system call. The kernel determines the process to which this activation belongs and creates a third activation, C, into which it makes an

upcall_new.

In this activation, the Marcel scheduler will choose a third application thread and then call

act_resume()

to release the mutual exclusion lock. The kernel

next chooses one of the activations, say B, and makes an

upcall_block to

it,

providing A as the parameter to indicate which activation was just blocked. The Marcel scheduler uses this information to keep track of the status of the corresponding application thread. At time T3, the I/O request completes. The kernel then chooses one of the activations, say C, and makes an

upcall_unblock to

it, providing A as the

parameter to indicate which activation was just unblocked. The Marcel scheduler now chooses whether to return the application thread previously assigned to A to the pool and continue running the application thread already assigned to C, or vice versa. In either case, activation A remains idle until needed by the kernel to make another upcall.

2.2 Extensions to the original proposal Although this work is mainly based on the Scheduler Activation model proposed by Anderson et al, we have developed a number of improvements which extend the set of supported system calls and increase eciency in some situations. One important point with activations is that the number of running activations at the application level is constant. In Anderson's implementation, this also meant that the number of activation structures for that user in the kernel was constant. This has the advantage of using a constant amount of kernel resources. However, it does not allow the kernel to handle blocking system calls properly, since a kernel activation structure is tied up during the time its thread is blocked, thereby preventing the kernel from running another user-level thread in that activation. Our implementation does not keep constant the number of activation structures for one user within the kernel. This allows us to handle any number of simultaneously blocking system calls, because whenever one activation issues a blocking system call, the kernel will create a new activation structure, if necessary, in order to keep constant the number of concurrently running activations at the application level. The cost of this is the additional kernel resources that are needed for the additional activation structures. Several optimizations have been made to increase the performance of our implementation. When an activation blocks, we originally needed to make two

upcalls: the rst (upcall_block) to notify the application that an activation blocked, the second (upcall_new) to launch a new activation. This is now handled by only one upcall to

upcall_new,

which uses a parameter to tell the

application whether another activation has blocked. An additional optimization has been made as far as preemption is concerned. In the original model, when an

Integrating Kernel Activations in a Multithreaded Runtime System on Top of Linux activation is preempted, an

1165

upcall_preempt upcall occurs, and an upcall_new

upcall is made when the kernel is ready to restart an activation. Now, the application can tell the kernel at the end of the parameter to the system call

upcall_new, it

act_resume())

upcall_preempt

upcall (with a

that instead of calling the upcall

can continue this activation directly.

2.3 Modications to

Marcel

Surprisingly, integration of Linux Activations within the Marcel library required almost no rewriting of existing code. We needed only a few localized extensions. The major issue that we had to address was related to the ready-threads queue. The problem was to opt either for a global pool (as in a user-level version of Marcel) or for a collection of activation-specic local pools (as in the mixed version). We have opted for the global pool implementation because maintaining separate pools introduces a number of synchronization problems. In particular, when an activation gets blocked within the kernel, the other activations must retrieve the running threads that were kept in its ready-threads pool. Such a step requires a costly synchronization scheme and the associated overhead may become important in the presence of frequent I/O operations. The drawback of our strategy is that the global pool may become a bottleneck on a large number of processors.

Marcel uses a special lock to prevent concurrent access to its internal data structures. Our implementation of activations ensures that if the kernel preempts the Marcel thread which is holding this lock, then it is relaunched immediately (instead of the one running on the activation that receives the

upcall_preempt

upcall). This allows us to avoid contention situations in the presence of busy waiting threads. Note that a related problem can occur with the upcall

upcall_new.

Indeed, when a new activation is created, it may not succeed in acquiring the aforementioned lock. Since it is mandatory to run a regular Marcel thread when calling

act_resume(),

the activation must schedule a dummy thread.

To this end, we have added a pool of preallocated dummy threads (together with their stacks) into Marcel.

3

Performance and Evaluation

The new version of Marcel on top of Linux Activations is completely operational, although we did not yet implement all the optimizations we discussed in the previous sections. To investigate the gain or the overhead generated by activations and upcalls, we have compared the new version of Marcel to the two existing versions (one purely user-level, one mixed two-level) as well as to native Linux kernel-level threads [3]. The tests were run on an Intel Pentium II 450 MHz platform running Linux v2

:2:13.

On this platform, we ran a mi-

crobenchmark program to measure the time taken by an upcall from the kernel up to user-space. This test reported an average time of

5s

per upcall.

1166

V. Danjean, R. Namyst, and R.D. Russell Table 2.

Performance of various thread libraries

Library

Single processor Dual processor Basic With I/O With computation Marcel user-level 0.308ms 119.959ms 6932ms Marcel mixed two-level 0.435ms 23.241ms 3807ms Marcel with activations 0.417ms 10.118ms 3551ms LinuxThread (kernel-level) 13.319ms 14.916ms 3566ms

The test programs used to compare these libraries are all based on a common synthetic program. The basic program implements a divide and conquer algorithm to compute the sum of the rst N integers. At each iteration step, two threads are spawned to compute the two resulting sub-intervals concurrently, unless the interval to compute contains one element. The parent of the two threads waits for their completion, gets their results, computes the sum and, in turn, returns it to its own parent. This program generates a tree of threads and involves almost no real computation but a lot of basic thread operations such as creation, destruction and synchronization. In order to evaluate the dierent thread libraries in the presence of blocking calls, we have extended the previous program so as to make extensive use of Unix I/O operations. In this case, we have simply replaced all the thread creation calls by a write into a Unix pipe. At the other end of the pipe, a dedicated server thread simply transforms the corresponding requests into thread creations. Finally, we also extended the basic version of the program by adding some articial computation into each thread so that some speedup can be obtained on a multiprocessor platform.

3.1 Performance Table 2 reports the performance obtained with the three aforementioned program versions for each thread library. The rst two programs were run on a uniprocessor machine whereas the last one was run on a dual-processor. The basic version of the divide and conquer program makes heavy use of thread creations and synchronizations. As one may expect on a uniprocessor, the user-level Marcel library is obviously the most ecient, while the Linux-

Thread library exhibits poor performance, because kernel thread operations are much more inecient than those related to user threads. It is interesting to note that the version using activations achieves good performance. The difference with the user-level version is due to the Marcel lock acquire/release primitives that are a little more complex in the presence of activations. With the version involving many I/O operations, things change signicantly. The most noticeable result is the huge amount of time taken by the program with the user-level version. It is, however, not surprising: each time a user thread makes a blocking call, it blocks the entire Unix process until a timer signal forces a preemption and schedules another thread (in this case, every

20ms).

Integrating Kernel Activations in a Multithreaded Runtime System on Top of Linux

1167

The activation version has the best execution time. The mixed Marcel library does not behave as well because two underlying kernel threads are needed to handle the blocking calls properly. Thus, it introduces overhead due to additional synchronization and preemption costs. When the program containing substantial computation is executed on a dualprocessor machine, we observe that the activation version has approximately the same execution time as the Marcel mixed and LinuxThread versions. It reveals that the activation version is perfectly able to exploit the underlying architecture by using two activations simultaneously within the application. The user-level version obviously performs poorly, because only one processor is used in this case.

4

Conclusion

This work augmented the design of activations, a new technique to handle thread support in an OS, then implemented and tested their use under Linux. We wrote a new version of the Marcel thread library that utilizes activations while preserving the existing user interface, so that existing Marcel programs still work with this new model. We have demonstrated that for applications using threads that make blocking system calls, performance of the new version of

Marcel on both single and dual processor platforms is superior to the best previous version of Marcel and to kernel-level threads. Furthermore, since our new library is implemented in user space, we do not need to change the kernel to add new thread features, such as thread migration. A two-level thread library based on activations seems to be a very attractive way to manage application threads. This work shows that this model is a valid one, in particular for application threads that utilize blocking system calls, which often happens within a communication library, for example.

References 1. T. Anderson, B. Bershad, E. Lazowska, and H. Levy. Scheduler Activations: Eective Kernel Support for the User-Level Management of Parallelism. ACM Transactions on Computer Systems, 10(1):5379, February 1992. 2. Vincent Danjean, Raymond Namyst, and Robert Russell. Linux kernel activations to support multithreading. In Proc. 18th IASTED International Conference on Applied Informatics (AI 2000), Innsbruck, Austria, February 2000. IASTED. To appear. 3. Xavier Leroy. The LinuxThreads library. http://pauillac.inria.fr/xleroy/linuxthreads. 4. R. Namyst and J.F. Mehaut. PM2: Parallel Multithreaded Machine. a computing environment for distributed architectures. In ParCo'95 (PARallel COmputing), pages 279285. Elsevier Science Publishers, Sep 1995. 5. R. Namyst and J.-F. Méhaut. marcel : Une bibliothèque de processus légers. Laboratoire d'Informatique Fondamentale de Lille, Lille, 1995.

DyRecT: Software Support for Adaptive Parallelism on NOWs Etienne Godard

Sanjeev Setia

Elizabeth White

Department of Computer Science, George Mason University

Abstract. In this paper, we describe DyRecT (Dynamic Reconfiguration Toolkit) a software library that allows programmers to develop adaptively parallel message-passing MPI programs for clusters of workstations. DyRecT provides a high-level API that can be used for writing adaptive parallel HPF-like programs while hiding most of the details of the dynamic reconfiguration from the programmer. In addition, DyRecT provides support for making a wider variety of applications adaptive by exposing to the programmer a low-level library that implements many of the typical tasks performed during reconfiguration. We present experimental results for the overhead of dynamic reconfiguration of several benchmark applications using DyRecT.

1

Introduction

Parallel applications executing on clusters of workstations have to be able to “withdraw” from a workstation if its owner returns. This is because workstation owners are typically unwilling to share their workstation with parallel applications while they are using it for doing interactive tasks. Thus, it is necessary to ensure that parallel applications execute only on idle workstations. To address this issue, several run-time libraries and environments provide mechanisms for process migration [1]. When owner activity is detected on a workstation being used by a parallel application, the process executing on that workstation is migrated to an idle workstation. If no idle workstation is available, the parallel application is either suspended until more resources are available or multiple processes that compose the parallel application are scheduled on the same processor. Several studies [5, 7] have shown that a more desirable approach from the performance viewpoint would be to dynamically reconfigure the parallel application so that its parallelism matched the number of processors available for execution. Such dynamically reconfigurable applications have been referred to as adaptive parallel or malleable parallel applications. Unlike conventional parallel applications, adaptive parallel applications can adapt to changes in the availability of underlying resources by dynamically shrinking or expanding their degree of parallelism. While the performance benefits of supporting adaptively parallel applications seem clear, most parallel programming environments do not provide mechanisms for dynamically changing the degree of parallelism of executing applications. In J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1168-1175, 2000.  Springer-Verlag Berlin Heidelberg 2000

DyRecT: Software Support for Adaptive Parallelism on NOWs

1169

this paper, we describe DyRecT (Dynamic Reconfiguration Toolkit), a software library that allows programmers to develop adaptively parallel message-passing MPI programs for clusters of workstations. Ideally, writing adaptive parallel applications should be no more difficult than developing conventional parallel applications. To this end, several run-time systems [1, 6] have been designed that support adaptive parallel applications in a user-transparent fashion. Some of these systems, however, require all applications to be written using a master-slave programming paradigm. This can lead to poor performance for several classes of applications [3]. Other systems support adaptive parallelism for specific classes of applications, e.g., Adaptive Multiblock Parti [2] supports adaptive parallel structured and block-structured parallel applications. Recently two systems have been developed that have a wider applicability than the systems discussed above. DRMS [3] supports adaptive parallelism for grid-based message-passing programs on the IBM SP2, while in [4], Scherer et al describe a system for adaptively parallel shared memory programs that use the OpenMP programming model. The wider applicability of these systems arises from the fact that they support the OpenMP and HPF programming models that are used for several classes of applications. DyRecT resembles DRMS in that one of its goals is to support grid-based message-passing programs. To this end we provide a high level API that can be used by the programmer for writing adaptive parallel HPF-like programs. It differs from DRMS in two important ways. First, we provide support for making a wider variety of applications adaptive by exposing to the programmer a low-level library that implements many of the typical tasks performed during reconfiguration. Second, we support adaptive parallelism on NOWs consisting of potentially heterogeneous workstations by providing support for saving and restoring the stack of an executing process in an architecture-independent fashion. Our approach is motivated by the observation that while the details of the actions that need to be taken during reconfiguration depend upon the application, there are common tasks that typically need to performed, e.g., spawning processes, synchronizing the application, capturing and restoring the stack, exchanging data, etc. For example, to move from the first configuration in Figure 1 to the second, the four starting processes must synchronize at some point in the computation where a consistent grid exists across the processes. At that point, data must be moved so that it is distributed across three of the processes. The process leaving the computation must be terminated. Finally, any required changes to the communication bindings must be made. At this point, the grid computation can continue. In the case of regular grid-based iterative applications, most of these reconfiguration related tasks are performed by our high-level library and are hidden from the programmer. However, the high-level API provided with DyRecT is only suitable for certain classes of grid-based applications. Using the low-level library, discussed in Section 3, a programmer can develop reconfiguration code

1170

E. Godard, S. Setia, and E. White M1

P1

M1 reconfigure

P1

M2

P2

M2

grid data

P2

M3

P3

M3

M4

P3 P4

communication data mapping

Fig. 1. Changing the level of parallelism by moving between configurations in a grid-based parallel application.

for other classes of applications with considerably less effort than if they had to develop the code from scratch.

2

High-Level Primitives

There are several different types of high-level primitives provided by the toolkit: initialization and finalization, synchronization, data distribution registration, runtime data support and reconfiguration data distribution. These primitives hide many of the details that user would typically have to deal with when making iterative grid-based applications adaptive; describing how the partitioning is related to the number of processes, moving data between processes at reconfiguration time, performing some data initialization, spawning and terminating processes and synchronizing to ensure that a consistent grid is repartitioned. As an example, consider a typical iterative grid-based application as shown in Figure 1. For each process, every iteration consists of doing a local computation, exchanging information with neighboring processes, and synchronizing to decide convergence. When the global grid is uniformly distributed across the participating processes, this application can be made adaptive by instrumenting the source code with our high-level primitives. These calls provide to the runtime system basic information about how the grids are partitioned across any number of processes. The code for this, described below, is shown in Figure 2. The data partitioning high-level primitives allow users to define uniform partition schemes over multi-dimensional data. In the example, the user specifies a block partition along the first (row) dimension (DYR Block()) combined with a collapsed (non-partitioned) partition (DYR Collapsed()) for the second (column) dimension. Two grids, one for the current iteration and one for the previous, that are partitioned using this scheme, are registered with the library using the DYR Register data() calls. After providing information about the data to be repartitioned, the user decides where in the component source code it is legal for repartitioning to


1171

int main (int argc, char *argv[]) { int local dims[2],dims[2]; double **mydata, **mydata next; DYR Disttype dist types[2]; DYR Disthandle strips; MPI Comm Compute context; MPI Init(&argc, &argv); DYR Init(&Compute context); /* initialize DYRECT */ DYR Save((void *) &iter, 1, MPI INT); /* save variable(s) needed across all nodes */ DYR Block(&dist types[0]); /* globally distributed data */ DYR Collapsed(&dist types[1]); DYR Borders uniform (1, 1, &dist borders[0]); dist borders[1] = dist borders[0]; /* define borders */ DYR Create distribution (2, dist types, dist borders, &strips); DYR Register data (&mydata,2, dims,MPI DOUBLE,0,0,strips); DYR Register data (&mydata next,2,dims,MPI DOUBLE,0,0,strips); if (DYR Init node()) f DYR Local shape (&mydata, local dimens); /* new local size */ /* Put standard initialization calls, etc. from original program */ iter = 0; init data (local dimens, mydata next); /* initialize data area */

g

do { /* iterate using Jacobi relaxation until block has converged */ /* check for reconfiguration */ if (DYR Check reconf(0)) f DYR Reconfigure(1, &Compute context); /* reconfigure */ DYR Local shape (&mydata, local dimens); g /* if */ DYR Update borders ( &mydata next, 0, 0); copy data (local dimens, mydata next, mydata); calc area (local dimens, mydata, mydata next); iter++; } while (cont iter (local dimens, mydata, mydata next); MPI Finalize(); DYR Final(); } cont iter(. . .) { /* compute local norm */ DYR Sync MS(. . .,result,comp norm,set flag); return result; } Fig. 2. Abbreviated source code for the Jacobi Application. Code added for dynamic reconfiguration is shown in boldface.

1172

E. Godard, S. Setia, and E. White

occur. The start of each iteration is used for the Jacobi application. At that point, the user adds an invocation to DYR Reconfigure() guarded by a call to DYR Check reconfig(). The DYR Reconfigure() function uses the data registration information to take care of all of the repartitioning calculations, data exchange, and process creation and termination required for the new set of processes. The toolkit provides two different synchronization mechanisms, both of which assume that the application is iteration based and that reconfiguration must occur when all processes are at the same iteration. Both synchronization functions are responsible for setting a flag that is used by the DYR Check reconfig() function. The synchronization mechanism used in the example extends the existing global synchronization at the end of an iteration. In the master process, this function takes over the details of receiving the data and computing convergence using a user-provided function. It determines if a reconfiguration is needed and and informs the other processes about both convergence and reconfiguration in the return message. If a parallel application has a variable that needs to hold the same value across all participating processes, (such as iter in Figure 2), it is registered with the toolkit using DYR Save(). If a process joins the application at reconfiguration time, the toolkit ensures that it is initialized appropriately. It is sometimes necessary to transform the control flow of the components depending upon whether or not the process was one of the initial processes. Function DYR Init node() only returns true for processes that were part of the application at start time. In Figure 2, this function is used so that the initial processes can initialize their local data and variables. When new processes enter later in the application execution, they skip this code and immediately enter the loop, perform their reconfiguration and get information using DYR Local shape() about their data set. Then they execute normally. This primitive can also be used to guard code that only new processes should execute.

3

Low-Level Primitives

In addition to primitives tailored toward one class of parallel applications, our toolkit also provides to the user a set of low-level primitives. The primary reason for providing these primitives is to allow programmers to more easily handle situations where the standard high-level functionality is not sufficient. There are several different types of primitives we provide for specialized partitioning, physical resource control, tailoring of work done at reconfiguration points, and dealing with data on the runtime stack. We have found these types of low-level primitives useful for several different types of applications. As an example, consider the case where there is variation in the relative processor speeds in the workstation cluster. In this situation, it makes sense to give processes on faster processors larger local grids than processes on slower processors. While high-level functions may provide solutions for some aspects of the problem (synchronization for example), support for non-uniform partitioning


1173

schemes, e.g., recursive bisection, are not supported by the high-level primitives. However, using the low-level primitives provided by DyRecT, the user can tailor the actions taken during reconfiguration such that non-uniform partitioning schemes can be handled. The default assumption is that the given reconfiguration points are placed in the main program. While this is not atypical of this class of applications, for some members of this class, more efficient reconfiguration can be achieved by placing reconfiguration points in other locations in the source code where they are encountered more frequently. For example, a multigrid V-cycle can be implemented recursively and one logical place for reconfiguration is inside the recursive function. However, this placement of reconfiguration points raises the question of how to create the correct runtime stack for new processes and how to update data (typically variables tied to grid size and pointers to intermediate grids) that may be on the stack in existing processes. Our low-level primitives include functions to deal with these problems and some rudimentary sourceto-source transformation tools that deal with some of the difficult issues of the placement of these functions.

4

Performance Results

In this section, we describe the results of experiments in which we measured the cost of dynamically reconfiguring several parallel applications. The main goals of these experiments were to demonstrate the feasabilty of using DyRecT for supporting adaptive parallelism on NOWs and to identify the various components that contribute to the overhead of dynamic reconfiguration. Our experimental environment consists of 16 PCs connected by a switched 100 Mbps Ethernet. Each machine has one or two 200 MHz Intel Pentium Pro processors and between 128 and 256 MB RAM. The computers run Linux 2.2.10. Our reconfiguration software was built on top of the LAM (version 6.2b) implementation of MPI. We measured the cost of reconfiguration for five benchmark applications. The first two applications (referred to as Jacobi and RB) use the Jacobi relaxation method to solve Poisson’s equation on a square grid. In Jacobi, a strip partitioning scheme is used to distribute the grid among the processors, while RB uses recursive bisection to partition the grid. The third benchmark (BC) employs a block cyclic data decomposition technique to allocate grid data to processors. The next two applications, Multigrid and Integer Sort are taken from the NAS parallel benchmarks. We reconfigured each application several times and measured the adaptation time under different scenarios. These scenarios are representative of fluctuations in resource availability that can occur in non-dedicated clusters of workstations such as new nodes joining the computation, nodes leaving the computation, and migration of a process from one node to another. In our experiments, an executing parallel application reconfigures itself when it receives a signal sent via the LAM “doom” command. The delay before the

1174

E. Godard, S. Setia, and E. White

application resumes execution after reconfiguration consists of two components. The main component is the actual cost of reconfiguration itself (as discussed below). In addition, before the reconfiguration can be initiated each process in the computation needs to reach the next “safe” point in its execution. This synchronization delay is application-specific since it depends on the location and frequency of occurrence of reconfiguration points. For example, in the case of the Jacobi, RB, and BC benchmarks, the reconfiguration point occurs at the end of each iteration whereas in the case of multigrid, reconfiguration points occur at each level of the multigrid V-cycle. For our benchmark applications, the synchronization delay varied from 0.07 to 3.77 seconds depending on the number of processors and the data set size of the application. The reconfiguration cost can be broken down into several components corresponding to the different steps involved in the dynamic reconfiguration of parallel applications. These steps are: (i) spawning any new processes (ii) re-establishing the logical configuration of the application (iii) figuring out the new logical data partitioning, e.g. by invoking the recursive bisection algorithm (iv) allocating memory for any newly assigned data (v) figuring out the overlap of the current data assignment with the future data assignments, and (vi) exchanging data between nodes to account for the new configuration. Figure 3 shows the costs for each benchmark for two reconfiguration scenarios: changing the parallelism from 8 to 16 nodes, and vice versa. The time for steps (i) through (vi) is labeled spawn, init, part, alloc, overlap, and redist respectively. Our experiments showed that the main component of the total reconfiguration time was the data redistribution time, which is proportional to the amount of data that needs to be redistributed between the processors. The reconfiguration time for our benchmarks ranged from hundreds of milliseconds to around 15 seconds depending mainly on the data set size of the application. For a more thorough discussion of our performance results, the reader is referred to [8].

5

Conclusion

Efficient and non–intrusive use of NOWs for parallel applications requires easy to use mechanisms for providing adaptive behavior. This paper describes research into providing both high– and low–level functionality for achieving this. The high–level primitives, tailored to iterative grid–based applications, provide simple to use mechanisms for many of the common tasks in this domain. When this functionality does not capture some required feature of the application, the user can use the provided low–level functions to provide additional flexibility. This work is ongoing in that we are still refining both the API and the functionality provided by the API. One natural next step is to look at how high–level APIs for other classes of applications can be constructed on top of our low–level primitives. Research into efficient algorithms for data exchange within this framework is also of interest.


Spawn

Init

Part

Alloc

Overlap

1175

Redist

4.5

Reconfiguration Time (seconds)

4 3.5 3 2.5 2 1.5 1 0.5

Ja co bi Ja 8 co -> 1 bi 16 6 R B -> 8 8 --> R 1 B 16 6 BC --> 8 8 -BC > 1 16 6 M G -> 8 8 M -> 1 G 16 6 IS --> 8 8 --> IS 16 16 --> 8

0

Applications

Fig. 3. The components of the reconfiguration overhead for five benchmark applications. The data set sizes for the benchmarks are as follows: Jacobi, RB, and BC – 144 MB, MG – 55 MB, IS – 24 MB.

References 1. J. Pruyne and M. Livny. Interfacing Condor and PVM to harness the cycles of Workstation Cluster s. In Journal of Future Generation Computer Systems, Vol. 12, 1996. 2. G. Edjlali et al. Data Parallel Programming in an Adaptive Environment. Technical Report CS-TR-3350, University of Maryland, 1994. 3. J. Moreira, V. Naik and M. Konuru. Designing Reconfigurable Data-Parallel Applications for Scalable Parallel Computing Enviromnents. Technical Report RC 20455, IBM Research Division, May 1996. 4. A. Scherer, H. Lui, T. Gross, W. Zwaenepoel. Transparent Adaptive Parallelism on NOWs using OpenMP. In Proc. of PPoPP’99, May 1999. 5. A. Acharya, G. Edjlali, J. Saltz. The Utility of Exploiting Idle Workstations for Parallel Computation. In Proc. of ACM Sigmetrics ’97, 1997. 6. N. Carriero, E. Freeman, D. Gelernter. Adaptive Parallelism and Piranha. IEEE Computer, pp. 40-49, Jan 1995. 7. A. Chowdhury, L. Nicklas, S. Setia, E. White Supporting Dynamic Space-sharing on Non-dedicated clusters of Workstati ons. In Proc. of ICDCS ’97, 1997. 8. E. Godard, S. Setia, E. White. DyRecT: Software Support for Adaptive Parallelism on NOWs. Technical Report GMU-TR00-01, Department of Computer Science, George Mason University, January 2000. This article was processed using the LATEX macro package with LLNCS style

Fast Measurement of LogP Parameters for Message Passing Platforms Thilo Kielmann, Henri E. Bal, and Kees Verstoep Department of Computer Science, Vrije Universiteit, Amsterdam, The Netherlands [email protected] [email protected] [email protected]

Abstract. Performance modeling is important for implementing efficient parallel applications and runtime systems. The LogP model captures the relevant aspects of message passing in distributed-memory architectures. In this paper we describe an efficient method that measures LogP parameters for a given message passing platform. Measurements are performed for messages of different sizes, as covered by the parameterized LogP model, a slight extension of LogP and LogGP. To minimize both intrusiveness and completion time of the measurement, we propose a procedure that sends as few messages as possible. An implementation of this procedure, called the MPI LogP benchmark, is available from our WWW site.

1 Introduction Performance modeling is important for implementing efficient parallel applications and runtime systems. For example, application-level schedulers (AppLeS) [2] aim to minimize application runtime based on application-specific performance models (e.g., for completion times of given subtasks) which are parameterized by dynamic resource performance characteristics of CPUs and networks. An AppLeS may, for example, determine suitable data distributions and task assignments based on the knowledge of message transfer times and computation completion times. Another example for the use of performance models is our MagPIe library [8, 9] which optimizes MPI’s collective communication. Based on a model for the completion times of message sending and receiving, it optimizes communication graphs (e.g., for broadcast and scatter) and finds suitable segment sizes for splitting large messages in order to minimize collective completion time. The LogP model [4] captures the relevant aspects of message passing in distributedmemory systems. It defines the number of processors P , the network latency L, and the time (overhead) o a processor spends sending or receiving a message. In addition, it defines the gap g as the minimum time interval between consecutive message transmissions or receptions at a processor, which is the reciprocal value of achievable endto-end bandwidth. Because LogP is intended for short messages, o and g are constant. The LogGP model extends LogP to also cover long messages [1]. It adds a parameter G for modeling the gap per byte for long messages, which are typically handled more efficiently. Other variants of LogP have also been proposed where the overhead at the sender and the receiver side is treated separately as os and or , and where some parameters depend on the message size [5, 7, 8]. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1176-1183, 2000. Springer-Verlag Berlin Heidelberg 2000

Fast Measurement of LogP Parameters for Message Passing Platforms

1177

For practical use of LogP, the actual parameters of a parallel computing platform have to be measured. Inside a supercomputer or workstation cluster, the network performance characteristics remain constant, except for possible changes in system software. In this case, the respective LogP parameters may be measured offline, and measurement efficiency hardly matters. Our MagPIe library, however, targets multiple clusters connected via wide-area networks. In this context, off-line measurements are not feasible due to two reasons, so measurement efficiency is very important. First, intrusiveness on other ongoing communication has to be kept as small as possible. Second, the performance of wide-area networks may change during application runtime [11], causing measurements also to be performed regularly. The main problem with measurement efficiency is how to accurately measure the gap parameter. The measurement methods described in [5, 7] measure the gap by sending large sequences of messages in order to saturate the communication links in which case the link capacity (as expressed via the gap) can be observed. This measurement procedure has two drawbacks. It is highly intrusive and may disturb other ongoing communication. Also, it is time consuming when measuring long messages, especially when the network has high latency and/or low bandwidth, as is the case with wide-area connections as targeted by MagPIe. In this paper, we present a procedure that measures LogP parameters without saturating the network with long messages. Only for empty messages (with zero bytes of data), the gap has to be determined by saturating the network. This can be achieved in reasonable time even across wide area links. For all other message sizes, simple message roundtrips (and the gap for empty messages) are sufficient to determine the corresponding LogP parameters. In the remainder of the paper, we briefly clarify the LogP variant we use (parameterized LogP [8]), then we describe our measurement procedure and compare our measurements with results obtained by saturation-based measurements.

2 Parameterized LogP The parameterized LogP model defines five parameters, in analogy to LogP. P is the number of processors. L is the end-to-end latency from process to process, combining all contributing factors such as copying data to and from network interfaces and the transfer over the physical network. os (m), or (m), and g (m) are send overhead, receive overhead, and gap. They are defined as functions of the message size m. os (m) and or (m) are the times the CPUs on both sides are busy sending and receiving a message of size m. For sufficiently long messages, receiving may already start while the sender is still busy, so os and or may overlap. The gap g (m) is the minimum time interval between consecutive message transmissions or receptions. It is the reciprocal value of the end-to-end bandwidth from process to process for messages of a given size m. Like L, g (m) covers all contributing factors. From g (m) covering os (m) and or (m), follows g (m) os (m) and g (m) or (m). A network N is characterized as N = (L; os ; or ; g; P ). To illustrate how the parameters are used, we introduce s(m) and r(m), the times for sending and receiving a message of size m when both sender and receiver simul-

1178

T. Kielmann, H.E. Bal, and K. Verstoep

taneously start their operations. s(m) = g (m) is the time at which the sender is ready to send the next message. Whenever the network itself is the transmission bottleneck, os (m) < g(m), and the sender may continue computing after os (m) time. But because g(m) models the time a message “occupies” the network, the next message cannot be sent before g (m). r(m) = L + g (m) is the time at which the receiver has received the message. The latency L can be seen as the time it takes for the first bit of a message to travel from sender to receiver. The message gap adds the time after the first bit has been received until the last bit of the message has been received. Figure 1 (left) illustrates this modeling. When a sender transmits several messages in a row, the latency will contribute only once to the receiver completion time but the gap values of all messages sum up. This can be expressed as r(m1 ; m2 ; : : : ; mn ) = L + g (m1 )+ g (m2 )+ : : : + g (mn ). o (m) s

g(m) = s(m) sender

(m) o (m) s

receiver + g(m)

o (m) r measure

(0)

time

o (m) r

L

RTT(m)

time (0)

(0)

(m)

mirror

= r(m)

Fig. 1. Message transmission as modeled by parameterized LogP (left); fast measurement procedure (right)

For completeness, we show that parameterized LogP subsumes the original models LogP and LogGP. In Table 1, LogGP’s parameters are expressed in terms of parameterized LogP. We use 1 byte as the size for short messages; any other reasonable “short” size may as well be used instead. Note that neither LogP nor LogGP distinguishes between os and or . For short messages, they use r = o + L + o to relate the L parameter to receiver completion time which gives L a slightly different meaning compared to parameterized LogP. We use this equation to derive LogP’s L from our own parameters.

3 Fast parameter measurement Previous LogP micro benchmarks [5, 7] measure the gap values by saturating the link for each message size. Our method has to use saturation only for obtaining g (0). As we use g (0) for deriving other values, we measure it first. We measure the time RTTn for a roundtrip consisting of n messages sent in a row by measure, and a single, empty reply message sent back by mirror. The procedure starts with n = 10. The number of messages n is doubled until the gap per message changes only by = 1%. At this point, saturation is assumed to be reached. We take the time measured for sending the so-far largest number of messages (without reply) as n g (0). We start with a small number of messages in a row in order to speed up the measurement. So we have to ensure that the messages are sufficiently many such that the roundtrip time is dominated by


1179

bandwidth rather than latency. Therefore, we also keep doubling n until the inequality RTT1 < RTTn holds. By waiting for a reply we enforce that the messages are really sent to mirror instead of just being buffered locally. Table 1. LogGP’s parameters expressed in terms of parameterized LogP LogP/LogGP

L o g G P

parameterized LogP = L + g (1) os (1) or (1) = (os (1) + or (1))=2 = g (1) = g (m)=m, for a sufficiently large m = P

,

,

All other parameters can be determined by the procedure shown in Fig. 1 (right). It starts with a synchronization message by which the so-called mirror process indicates being ready. For each size m, two message roundtrips are necessary from measure to mirror and back. (We use RTT (m) = RTT1 (m).) In the first roundtrip, measure sends an m-bytes message and in turn receives a zero-bytes message. We measure the time for just sending and for the complete roundtrip. The send time directly yields os (m). g(m) and L can be determined by solving the equations for RTT (0) and RTT (m), according to the timing breakdown in Fig. 1 (left):

RTT (0) = 2(L + g(0)) RTT (m) = L + g(m) + L + g(0) g(m) = RTT (m) , RTT (0) + g(0) L = (RTT (0) , 2g(0))=2 In the second roundtrip, measure sends a zero-bytes message, waits for > RTT (m) time, and then receives a m-bytes message. Measuring the receive operation now yields or (m), because after waiting > RTT (m) time, the message from mirror is available

at measure immediately, without further waiting. For each message size, the roundtrip tests are initially run a small number of times. As long as the variance of measurements is too high, we successively increase the amount of roundtrips. We keep adding roundtrips until the average error is less than , or until an upper bound on the total number of iterations is reached (60 for small messages, 15 for large messages). Initially, measurements are performed for all sizes m = 2k with k 2 [0; km ]. The value of km has to be chosen big enough to cover any non-linearity caused by the tested software layer. In our experiments, we used km = 18 to cover all changes in send modes of the assessed MPI implementation (MPICH). After measuring the initial set of message sizes, we check whether the gap per byte (g (m)=m) has stabilized for large m. If this is not the case, sending larger messages may achieve lower gaps (and hence higher throughput). So km is incremented and the next message size is tested. This process is performed until g (2km ) is close (within ) to the value linearly extrapolated from g (2km ,2 ) and g (2km ,1 ). So far, the “interesting” range of message sizes has been determined. Finally, possible non-linear behavior remains to be detected. For any size mk , we check whether the measured values for os (mk ), or (mk ), and g (mk ) are consistent with the corresponding,

1180

T. Kielmann, H.E. Bal, and K. Verstoep

predicted values for size mk , extrapolated from the measurements of the previous two (smaller) message sizes, mk,1 and mk,2 . If the difference is larger than , we do new measurements for m = (mk,1 + mk )=2, and repeat halving the intervals until either the extrapolation matches the measurements, or until mk , mk,1 max(32 bytes; mk ). 3.1 Limitations of the method Except for measuring g (0), all parameters are derived from pairs of single messages sent between the measure and mirror processes. The correctness of timing these messages relies on the independence of the message pairs from each other: the time it takes to send a message from measure to mirror and back must always be the same, whether or not other messages have been exchanged before. Whenever measure issues several messages in a row, sending is slowed down to the rate at which the message pipeline is drained. This exactly is the effect used to measure g (0). For all other measurements, we avoid this effect by always sending messages in pairs from measure to mirror and back. Before measure may send the next message, it first has to receive from mirror. This procedure enforces that pipelines will always be drained between individual message pairs, assuming that message headers carry “piggybacked” flow control information that resets senders to their initial state after each message roundtrip. This assumption may fail for communication protocols which update their flow control information in a more lazy fashion. So far, we found our assumption to be reasonable, as it works both with TCP and with our user-level Myrinet control software LFC [3]. In some cases, our measurements reveal values for the receive overhead such that or (m) > g(m) which seems to contradict parameterized LogP. This phenomenon is caused by different behavior of the receive operation depending on whether the incoming message is expected to arrive. Messages are expected to arrive whenever the application called a matching receive operation before the message actually arrives at the receiving host. The treatment of expected messages may be more efficient because unexpected messages, for example, may have to be copied to a separate receive buffer, before they can later be delivered to the application. In our measurement procedure, or (m) is measured with unexpected messages whereas g(m) is measured while receiving expected messages. Whenever or (m) > g (m), g (m) gives an upper bound for processing expected messages. With synchronous receive operations, this measurement setup is unavoidable, because otherwise the measured receive overhead cannot be separated from the time waiting for the message to arrive. (With our MPI-based implementation, we can also measure the receive overhead of expected messages for the asynchronous receive operation, MPI Irecv, in combination with MPI Wait.) The measurement procedure described above assumes that network links are symmetrical, such that sending from measure to mirror has the same parameters as for the reverse direction. However, this assumption may not always be true. On wide area networks, for example, the achievable bandwidth (the gap) and/or the network latency may be different in both directions, due to possibly asymmetric routing behavior or link speed. Furthermore, if the machines running the measure and mirror processes are different (like a fast and a slow workstation), then also the overhead for sending and receiving may depend on the direction in which the message is sent. In such cases, the parameters os , or , and g may be measured by performing our procedure twice, while


1181

switching the roles of measure and mirror in between. Asymmetric latency can only be measured by sending a message with a timestamp ts , and letting the receiver derive the latency from tr , ts , where tr is the receive time. This requires clock synchronization between sender and receiver. Without external clock synchronization (like using GPS receivers or specialized software like the network time protocol, NTP), clocks can only be synchronized up to a granularity of the roundtrip time between two hosts [10], which is useless for measuring network latency. Unfortunately, as we can not generally assume the clocks of (possibly widely) distributed hosts to be tightly synchronized, we can not measure asymmetric network latencies within our measurement framework.

4 Result evaluation We implemented the measurement procedure on our experimentation platform called the DAS system, which consists of four cluster computers. Each cluster contains Pentium Pros that are connected by Myrinet. The clusters are located at four Dutch universities and are connected by dedicated 6 Mbit/s ATM networks. (The system is more fully described on http://www.cs.vu.nl/das/.) For the measurements presented in Fig. 2, we have used our MPI message passing system (described in [8, 9]) which can send messages inside clusters over Myrinet and between clusters over the ATM links, using TCP. We implemented the procedure as an MPI application, called the MPI LogP benchmark. We measured the LogP parameters for MPI Send and MPI Recv as described above, except for g (m) which was measured with our fast method, and by the link saturation method [5, 7]. The graphs in Fig. 2 show os (for comparison) and g , as measured by both methods. In general, on both networks, the curves for g are rather close to each other, confirming the efficacy of our method. There is a general trend that the new, fast method measures slightly larger gaps. This can partially be explained by the systematic error of the saturation method which has to be stopped heuristically based on the increase rate of the measured gap values, causing part of the gap being missed. However, there is a region (64 byte— 1 Kbyte over TCP, and 128 byte—4 Kbyte over Myrinet) where the saturation method measures significantly less (up to 50%) than the fast method. We could attribute the majority of this effect to a cache sensitivity of the mirror process which has better data locality with the saturation-based method as it does not send messages while draining the link. So, cache misses occur with somewhat larger messages, compared to the fast, roundtrip-based measurement. Table 2 provides a breakdown of the measurement completion times shown in Fig. 2 for measuring g (0), os =or (with implicit g (m > 0)), and g (m > 0) (with saturation) over both networks. With our fast measurement procedure, only the first two measurements are necessary, yielding a performance gain of a factor of 10 over Myrinet, and a factor of 17 over the TCP link.

5 Conclusions We presented a new, fast micro benchmark for measuring LogP parameters for messages of various sizes. We used the parameterized LogP [8] performance model. The

1182

T. Kielmann, H.E. Bal, and K. Verstoep MPI over Myrinet g, fast g, saturate os microseconds

1000

100

10 1

10

100 1000 10000 message size (bytes)

100000

MPI over TCP, VU Amsterdam - TU Delft g, fast g, saturate os

microseconds

100000 10000 1000 100 10 1

10

100 1000 10000 message size (bytes)

100000

Fig. 2. Measured send overhead and gap; over Myrinet (top) and over TCP (bottom)

major improvement of our measurement procedure is that the minimal gap between two messages can be observed without saturating the network for each message size. Furthermore, our procedure adapts itself to the network characteristics in order to measure parameters for all relevant message sizes. We implemented the new measurement procedure, called the MPI LogP benchmark, for our MPI platform and verified on two different networks that it gets the same results as a saturation-based measurement. The improvements in measurement time are significant. However, the time needed for a full measurement with various message sizes still takes too long to be performed during application runtime. As our ultimate goal is to enable applications to react to changing WAN conditions, we will need to restrict the Table 2. Breakdown of measurement completion times (seconds)

g(0) os =or (with implicit g(m > 0)) g(m > 0) (with saturation)

Myrinet 0.05 0.16 1.96

TCP 12.3 102.7 2018.7


1183

measurements to only a few message sizes and extrapolate the others by a technique like the one in [6]. The MPI LogP benchmark is available from http://www.cs.vu.nl/albatross/

Acknowledgements This work is supported in part by a USF grant from the Vrije Universiteit. The wide-area DAS system is an initiative of the Advanced School for Computing and Imaging (ASCI). We thank Rutger Hofman for his contributions to this research. We thank John Romein for keeping the DAS in good shape, and Cees de Laat (University of Utrecht) for getting the wide area links of the DAS up and running.

References 1. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages into the LogP Model — One Step Closer Towards a Realistic Model for Parallel Computation. In Proc. Symposium on Parallel Algorithms and Architectures (SPAA), pages 95–105, Santa Barbara, CA, July 1995. 2. F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-Level Scheduling on Distributed Heterogeneous Networks. In Proc. Supercomputing’96, Nov. 1996. Online at http://www.supercomp.org/sc96/proceedings/. 3. R. Bhoedjang, T. Rühl, and H. Bal. User-Level Network Interface Protocols. IEEE Computer, 31(11):53–60, 1998. 4. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proc. Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 1–12, San Diego, CA, May 1993. 5. D. E. Culler, L. T. Liu, R. P. Martin, and C. O. Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35–43, Feb. 1996. 6. M. Faerman, A. Su, R. Wolski, and F. Berman. Adaptive Performance Prodiction for Distributed Data-Intensive Applications. In Supercomputing’99, Nov. 1999. Online at http://www.supercomp.org/sc99/proceedings/. 7. G. Iannello, M. Lauria, and S. Mercolino. Cross–Platform Analysis of Fast Messages for Myrinet. In Proc. Workshop CANPC’98, number 1362 in Lecture Notes in Computer Science, pages 217–231, Las Vegas, Nevada, January 1998. Springer. 8. T. Kielmann, H. E. Bal, and S. Gorlatch. Bandwidth-efficient Collective Communication for Clustered Wide Area Systems. In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun, Mexico, May 2000. 9. T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat, and R. A. F. Bhoedjang. M AG PI E : MPI’s Collective Communication Operations for Clustered Wide Area Systems. In Proc. Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 131–140, Atlanta, GA, May 1999. 10. V. Paxson. On Calibrating Measurements of Packet Transit Times. In Proc. SIGMETRICS’98/PERFORMANCE’98, pages 11–21, Madison, Wisconsin, June 1998. 11. R. Wolski. Forecasting Network Performance to Support Dynamic Scheduling Using the Network Weather Service. In Proc. High-Performance Distributed Computing (HPDC-6), pages 316–325, Portland, OR, Aug. 1997. The network weather service is at http://nws.npaci.edu/.

Supporting exible safety and sharing in multi-threaded environments? Stev en H. Samorodin1 and Raju Pandey2 1 2

Marimba, Inc. Mountain View, Ca. [email protected]

Computer Science Department, University of California at Davis [email protected]

Abstract. There is increasing interest in extensible systems (such as ex-

tensible operating systems, mobile code runtime systems, Internet bro wsers and servers) that allo w external programs to be downloaded and executed directly within the system. While appealing from system design and extensibility points of view, extensible systems are vulnerable to aberrant beha viors of external programs. External programs can interfere with executions of other programs by reading and writing into their memory locations. In this paper, we presen t an approach for pro viding safe execution of external programs through a safe threads mechanism. The approach also provides a novel technique for safe sharing among external programs. The paper also describes the design and implementation of the safe threads.

1

Introduction

There is increasing interest in extensible systems that allow external programs to be downloaded and executed directly within a local system. Examples of such systems include extensible operating systems [3, 7], the Java runtime system [1], mobile code runtime systems [6], Internet browsers and web servers. While appealing from both system design and extensibility points of view, extensible systems are vulnerable to aberrant behaviors of external programs. External programs can in terfere with executions of other programs b y accessing their memory. They can corrupt system-dependent data, force a program into an inconsistent state, and crash the system. They can write into another program's memory, thereby corrupting system-dependent data, force a program in to an ?

This work is supported by the Defense Advanced Research Project Agency (DARPA) and Rome Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-97-1-0221. The U.S. Go vernment is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any cop yrigh t annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the oÆcial policies or endorsements, either expressed or implied, of the Defense Advanced Research Project Agency (DARPA), Rome Laboratory, or the U.S. Government.


Supporting Flexible Safety and Sharing in Multi-threaded Environments

1185

inconsistent state, and overwrite other programs. Clearly, system software must provide safety against malicious or buggy external programs. The notion of safety has been studied quite extensively in the operating system research and, recently, in type-safety based approaches [8, 9]. Most operating systems implement the notion of safety through address containment as in unix [12]. Address containment schemes provide safety by ensuring that a program cannot address the memory used by another program. The problem with the address containment-based approaches is that, in general, they enforce a rigid notion of safety and do not adequately support exible sharing of data between processes. Sharing mechanisms, such as inter-process communication (IPC) or shared memory, are either ineÆcient (due to data copying) or require coordination of addresses among processes. Work in single address operating systems (SASOS) [5, 10] have proposed the notion of address spaces that support safety among threads of execution, while providing sharing through address pointers. SASOS provide a nice solution but requires a specialized operating system. While the above approaches do provide mechanisms for safety and sharing, the mechanisms are either too in exible or diÆcult to use for the kind of application we are building. We are interested in developing a mobile code runtime system that creates a thread of execution for every mobile code. Our focus is on developing an execution environment that protects the runtime system and the mobile programs from each other. Further, since data sharing among mobile programs may be dynamic and exible, the system software must support sharing mechanisms that can be customized dynamically to re ect these sharing patterns. We, thus, need a protection mechanism that provides protection as well as

exible and dynamic sharing among threads of execution at the user level. This paper presents such a notion of protection and sharing for threads. We present a threads package, called Safe Threads, that supports the notion of threads whose stacks and data elements are completely protected. The thread package contains a novel mechanism for specifying exible and dynamic sharing and protection among threads. In this approach, the notion of protection is represented by an abstract entity, called protected domain . Sharing is de ned by permission relationships among protected domains. Applications can bind threads and data elements to dierent protected domains in order to implement dierent sharing relationships dynamically. We have implemented the thread package through mprotect system calls which make threads context switches quite expensive. Performance analysis of the thread package shows that protected thread creation is approximately 1.5 times more expensive. Context switch times are more expensive as well, but vary depending upon the number of protected domains involved. The rest of this paper is organized as follows: In Section 2, we describe the notion of safe threads and sharing among them. We also present an implementation of the threads package in this section. In Section 3, we present the performance characteristics of our system. Section 4 discusses related work and we conclude in Section 5.

1186

2

S.H. Samorodin and R. Pandey

Safe Threads package

In this section, we present the notion of safety and sharing within a thread package that we have developed. The thread package supports creation of multiple threads, provide fundamental safety guarantees, and supports mechanism for safe sharing among threads. We rst brie y describe the notion of threads and then discuss how safety and sharing is de ned in the thread package 2.1

Support for Threads

User-level threads packages provide creation, deletion, and management of multiple threads of execution. Threads are execution contexts and share an address space and other per-process resources. Unlike processes which may require a large amount of state information, threads generally need only a program counter, a set of registers, and a stack of activation records. Context switching costs for threads are, therefore, much lower. Typical user-level threads packages, such as Pthreads [4],are implemented by constructing a separate stack for each thread, while sharing the code and heap data segments. In these thread packages, any thread can access any memory location, including code, scheduler thread stack, other thread stacks, and heap segments. We have developed a thread package that provides for safe execution of external programs. The thread package provides two levels of safety guarantees. The rst is an absolute safety guarantee for data that must always be protected. A thread's per-thread data (including stack and code) are completely protected from other threads. The second guarantee concerns data whose safety and sharing properties can be de ned dynamically by the threads themselves. The thread package supports this through the notion of protected domains and permission relationships.

2.2

Protected domains and Permission relationships

A protected domain aggregates regions of memory that have similar sharing properties. A thread cannot access a protected domain and therefore any of the data contained in that the protected domain unless the thread has been bound to the protected domain. A thread can de ne a binding relationship with a protected domain explicitly or implicitly. An explicit binding between a thread, T1 , and a protected domain, P1 , denoted T1 ! P1 , can occur in two ways: (i) When T1 creates P1 , T1 is said to be the owner of P1 and can access all entities bound to P1 . (ii) When T1 , the owner thread of P1 , explicitly binds a thread T2 with P1 , denoted T1 (T2 ! P1 ). This explicit binding allows T2 to access any data entities associated with P1 . Note that such bindings allow T1 to share any data contained within P1 with other threads. Only the owner can change bindings to allow other threads access or permit other protected domains access. Implicit binding, occurs as a result of thread bindings and permission relationships among protected domains. A permission relationship 7! between two


1187

protected domains captures an asymmetric sharing relationship between threads bound to the protected domains. For instance, the relation P1 7! P2 (read P1 is permitted by P2 ) speci es that threads bound to P1 can access data entities bound to P2 , but not vice versa. We represent threads, protected domains and permission relationships in terms of a directed graph called a sharing relationship graph in which a node denotes a thread or a protected domain and an edge denotes a permission relationship. Each permission relationship indicates a chaining of access to the contents of the protected domain for threads bound to the permitting protected domain. The access associated with each permission relationship is labeled read, write, or read/write, indicating the kind of permission that is allowed. Each protected domain has an access list of (thread ID, access type) pairs associated with it. The notions of protected domain and permission relationship allow one to de ne complex and dynamic sharing relationships between threads and data. An example of such a relationship is the hierarchical notion of trust and safety implemented in many systems. In these systems, a multi-level information sharing speci cation is created where entities (for instance, workers) at level L can access any information that exists at levels L. However, they cannot access any information that exists above level L. Such a sharing relationship can be easily represented through protected domain and permission relationships. Thus, protected domain and permission relationships allow one to capture patterns of accesses and restriction among cooperating threads. 2.3

Implementation

We have implemented the Safe Threads package on top of the QuickThreads [11] library on the FreeBSD 2.2.6 operating system. The QuickThreads library supports non-preemptive user level threads. Safe Threads implements basic threading functionality on top of protection mechanisms. Our current implementation runs inside a single unix process virtual address space. Protection is enforced through use of the mprotect(2) system call. mprotect changes the access restrictions for the calling process on speci ed regions of memory within that processes' virtual address space. Utilizing this mechanism allows for the exibility to protect any page-sized region of memory. One important design decision involved whether a thread's stack should be protected from other threads. In order that a thread truly be safe from other threads, stacks must be inaccessible to other threads. There are two important implications, however, on impact the performance of the threads package. Firstly, the context switching code cannot be executed on either thread's stack since at some point in the algorithm each stack is not accessible. This makes context switches more expensive than if the switching code could be executed directly on the stack of the thread that was previously executing. Secondly, and perhaps more importantly, because the stacks are not visible to all threads, parameters passed between threads must be copied.

1188


Since systems calls require crossing the user/kernel protection boundary, system calls are more expensive than normal procedure calls. The method for implementing the thread context switch described above may potentially require many system calls per thread context switch. Therefore we have developed methods to speed up a protected context switch. There are two kinds of optimization possible: the rst reduces the number of memory regions that must be protected and the second reduces the number of times the user/kernel boundary is crossed. The rst can be achieved by combining protection domains of a thread if they do not export data to dierent threads, and by placing protected regions in contiguous regions so that such regions can be protected through one system call. The current version of the package does not include these optimizations yet as we are still formulating a general algorithm for using the sharing relationship graph to generate optimal1 protected memory region layouts. Further, since the context switching code is usually very small, it is not clear if there are large bene ts to be derived from implementing complex memory layout algorithms. The second optimization involves reducing the number of system calls. During a context switch, the threads package determines which protected domains need to be protected and unprotected. In our initial implementation, the package makes one mprotect call for each protected region that needed to be protected or unprotected. This results in O(n) system calls per context switch, where n is the number of protected regions. To reduce this number, we extended the FreeBSD kernel to include a new system call, multiMprotect(), which takes a vector of (address, length, protection type) triples. We, therefore, make one system call per context switch by packing all of the data into an array of triples. multiMprotect is a simple wrapper that takes each argument from the parameter vector and calls mprotect. Optimizations:

3

Performance Analysis

In this section, we focus on analyzing the costs associated with providing the safety and sharing model. Two benchmarks were performed: a thread creation benchmark and a context switching benchmark. 3.1

Thread Creation

The thread creation benchmark compares the cost of creating protected and unprotected threads. Beyond what is required to create an unprotected thread, protected thread creation requires creating a protected domain, adding its stack as a data item, and protecting the stack. Table 1 shows results which indicate that for large numbers of threads creating protected threads is about 1.5 times as expensive. 1

It is our belief that the general algorithm is at least NP-hard, but we have not proven it yet.


1189

Thread creation times on Pentium 120 w/32mb

# Threads Time (No Protection) Time (With Protection) % Dierence 100 6.85 11.82 173 500 8.89 13.40 151 750 9.45 13.75 146 1000 9.66 14.01 145

Thread creation times on Pentium II 300 w/128mb

# Threads Time (No Protection) Time (With Protection) % Dierence 100 2.27 4.18 184 500 2.72 4.72 174 750 4.16 6.65 160 1000 3.94 6.37 162 Table 1. Data for thread creation times is given for two dierent machines. All times are in micro seconds and are the average of 20 runs of creating the number of threads speci ed. All machines run FreeBSD 2.2.6-STABLE. g++ v.2.7.2.1 with -O2 and -m486 optimizations was used to compile all test programs.

3.2

Context Switch

Context switch times for Safe Threads are highly dependent upon the number of protected domains and the number of data elements contained within those protected domains. Figures 1(a) and 1(b) show the cost for context switches with dierent numbers of protected domains. The cost of an unprotected context switch is, as expected, a constant value. This number was determined by using the Safe Threads package with protection turned o. As mentioned in Section 2.3, our optimization goal with multiMprotect was to reduce the number of system calls from O(n) to 1 per context switch. In this we were successful, but we found that additional overhead introduced minimizes the performance advantage gained by reducing the number of system calls. For all but the smallest numbers of protected domains, our new system call multiMprotect outperforms mprotect. However, the performance bene t from using multimprotect is not as great as we expected. We feel that this is largely due to ineÆcient implementation. With dierent data structures and other optimizations these numbers could be signi cantly reduced. While the times for individual context switches can be very high for large numbers of protected domains, the tests were constructed to show worst case behavior where no protected domains are shared between threads. We believe that many applications will share protected domains and thereby incur lower context switch costs, even for large number of context switches.

4

Existing Safety Solutions

Existing solutions to the safety problem function at three levels of abstraction: hardware/OS, software and language. Hardware-based solutions address the problem at the lowest level. These solutions rely upon hardware to enforce

1190


1400 mprotect multiMprotect no protection

mprotect multiMprotect no protection

3600

1200

3300

1100

3000 Execution Time in micro seconds

Execution Time in micro seconds

1300

1000 900 800 700 600 500 400

2700 2400 2100 1800 1500 1200 900

300

600

200

300

100 0

0 5 10

125

250

500 Total Protected Domains

1250

(a) Context switch times for various Safe Threads protection options for 5 threads.

20 200 500

1000

2000 Total Protected Domains

5000

(b) Context switch times for various Safe Threads protection options for 20 threads.

Fig. 1. Overhead Cost of context switching for safe threads

safety [12]. Hardware protection has the advantage that it physically guarantees protection. The problem of safely executing untrusted code can also be addressed at the software level. Software safety solutions work at the user-level modifying the compiler, runtime system, and sometimes the untrusted code itself to ensure that software modules do not misbehave. Software Fault Isolation (SFI) [13] and Protected Shared Libraries (PSL) [2] are examples of software safety solutions. Finally, type-safe languages, such as Java, use language semantics to provide safety. Name space encapsulation ensures that private variables and methods cannot be accessed by other classes. Language-based protection schemes have the advantage that often a cross protection domain call can be as inexpensive as a procedure call. Several systems have been built using these languages including the SPIN extensible operating system [3] and the J-Kernel system [8]. The J-Kernel [8] protection system provides a general framework for supporting multiple protection domains within a single process address space. This work is similar to Safe Threads in that both develop a mechanism for allowing multiple protection domains to exist within a single address space. However, since J-Kernel relies upon Java to enforce its protection, it is limited to creating safety solutions for Java programs. Opal [5], Mungi [10], and other single address space operating systems (SASOS) address many of the same problems as Safe Threads on an operating system level. Speci cally, providing protection and sharing within a single address space.


5

1191

Conclusion

We have presented the design and implementation of a threads package that supports provides safety among threads. The package supports creation of threads, provides isolation among them, and includes mechanisms for protected sharing among threads. We have implemented the thread package and initial performance analysis suggests that thread creation is approximately 1.5 times more expensive for creating protected threads. Context switching times depend upon the number of protected domains involved. We are currently looking at dierent techniques for optimizing the cost of thread creation and context switching.

References 1. K. Arnold and J. Gosling. The Java Programming Language. Addison Wesley, 1996. 2. A. Banerji, J. M. Tracey, and D. L. Cohn. Protected shared libraries-a new approach to modularity and sharing. In Proceedings of the USENIX 1997 Annual Technical Conference, pages 59{75, Anaheim, CA, January 1997. 3. B. Bershad et al. Extensibility, safety and performance in the SPIN operating system. 15th Symposium on Operating Systems Principles, pages 267{283, December 1995. 4. D. R. Butenhof. Programming with POSIX Threads. Addison Wesley Longman, Inc., 1997. 5. J. Chase, H. Levy, M. Feeley, and E. Lazowska. Sharing and protection in a single address space operating system. ACM Transactions On Computer Systems, 12(4):271{307, May 1994. 6. D. Chess, C. Harrison, and A. Kershenbaum. Mobile Agents: Are they a good idea? In Mobile Object Systems: Towards the Programmable Internet, pages 46{48. Springer-Verlag, April 1997. 7. D. R. Engler, M. F. Kaashoek, and J. O'Toole Jr. Exokernel: An operating system architecture for application-level resource management. In 15th Symposium on Operating Systems Principles, pages 251{266, December 1995. 8. C. Hawblitzel, C. Chang, G. Gzajkowski, D. Hu, and T. von Eicken. Implementing multiple protection domains in Java. In Proceedings of the USENIX 1998 Annual Technical Conference, pages 259{272, New Orleans, La., June 1998. 9. C. Hawblitzel and T. von Eicken. A case for language-based protection. Technical Report 98-1670, Cornell University, Ithaca, NY, 1998. 10. G. Heiser, K. Elphinstone, J. Vochteloo, and S. Russell. Implementation and performance of the Mungi single-address-space operating system. Technical Report UNSW-CSE-TR-9704, The University of New South Wales, Sydney, Australia, June 1997. 11. D. Keppel. Tools and techniques for building fast portable threads packages. Technical Report UWCSE 93-05-06, University of Washington, 1993. 12. U. Vahalia. UNIX Internals: The New Frontiers. Prentice Hall, Upper Saddle River, New Jersey 07458, 1996. 13. R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. EÆcient software-based fault isolation. 14th Symposium on Operating Systems Principles, pages 203{216, 1993.

A Runtime System for Dynamic DAG Programming Min-You Wu1 , Wei Shu1 , and Yong Chen2 1 2

Department of ECE, University of New Mexico

fwu,[email protected]

Department of ECE, University of Central Florida

A runtime system is described here for dynamic DAG execution. A large D A G whic h represents an application program can be executed on a parallel system without consuming large amount of memory space. A DAG scheduling algorithm has been parallelized to scale to large systems. Inaccurate estimation of task execution time and comm unication time can be tolerated. Implementation of this parallel incremental system demonstrates the feasibility of this approach. Preliminary results sho w that it is superior to other approaches. Abstract.

1

Introduction

T ask parallelism is essential for applications with irregular structures. With computation partitioned into tasks, load balance can be achiev ed by scheduling the tasks, either dynamically or statically. Most dynamic algorithms schedule independent tasks, that is, a set of tasks that do not depend on each other. On the other hand, static task scheduling algorithms consider the dependences among tasks. The Directed Acyclic Graph (DA G) is a task graph that models task parallelism as well as dependences among tasks. As the DAG scheduling problem is NP-complete in its general form [4], many heuristic algorithms have been proposed to produce satisfactory performance [6, 3, 9]. Current DAG scheduling algorithms have drawbacks which may limit their usage. Some important issues to be addressed are: { They are slow since they run on a single processor machine. { They require a large memory space to store the graph and are not scalable thereafter. { The quality of the obtained schedules relies heavily on the estimation of execution time. Accurate estimation of execution time is required. Without this information, sophisticated scheduling algorithms cannot deliver satisfactory performance. { The application program must be recompiled for dierent problem sizes since the number of tasks and the estimated execution time of eac h task varies with the problem size. { They are static as the number of tasks and dependences among tasks in a D AG must be known at compile-time. Therefore, they cannot be applied to dynamic problems. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1192-1199, 2000.  Springer-Verlag Berlin Heidelberg 2000

A Runtime System for Dynamic DAG Programming

1193

These problems limit applicability of current DAG scheduling techniques and have not yet received substantial attention. Thus, many researchers consider the static DAG scheduling unrealistic. The memory space limitation and the recompiling problem can be eliminated by generating and executing tasks at runtime, as described in PTGDE [2], where a scheduling algorithm runs on a supervisor processor, which schedules the DAG to a number of executor processors. When a task is generated, it is sent to an executor processor to execute. This method solves the memory limitation problem because only a small portion of the DAG is in the memory at a time. However, the scheduling algorithm is still sequential and not scalable. Because there is no feedback from the executor processors, the load imbalance caused by inaccurate estimation of execution time cannot be adjusted. It cannot be applied to dynamic problems either. Moreover, a processor resource is solely dedicated to scheduling. If scheduling runs faster than execution, the supervisor processor will be idle; otherwise, the executor processors will be idle. We have proposed a parallel incremental scheduling scheme to solve these problems [5]. A scheduling algorithm can run faster and is more scalable when it is parallelized. By incrementally scheduling and executing DAGs, the memory limitation can be alleviated and inaccurate weight estimation can be tolerated. It can also be used to solve dynamic problems. This parallel incremental DAG scheduling scheme is based on general static scheduling and is extended from our previous project, Hypertool [6]. The new system is named Hypertool/2. Dierent from runtime incremental parallel scheduling for independent tasks, Hypertool/2 takes care of dependences among tasks and uses DAG as its computation model.

2

DAG and Compact DAG

A DAG, or a macro data ow graph, consists of a set of nodes fn1 ; n2 ; :::; nn g connected by a set of edges, each of which is denoted by ei;j . Each node represents a task, and the weight of node ni , w(ni ), is the execution time of the task. Each edge represents a message transferred from node ni to node nj and the weight of edge ei;j , w(ei;j ), is equal to the transmission time of the message. Figure 1 shows a DAG generated from a parallel Gaussian elimination algorithm with partial pivoting, which partitions a given matrix by columns. Node n0 is the INPUT procedure and n19 the OUTPUT procedure. The size of the DAG is proportional to N 2 , where N is the matrix size. In a static system, a DAG is generated from the user program and scheduled at compile time. Then this scheduled DAG is loaded to PEs for execution. In a runtime scheduling system, the DAG is generated incrementally and each time only a part of the DAG is generated. For this purpose, a compact form of the DAG (Compact DAG, or CDAG) is generated at compile time. It is then expanded to the DAG incrementally at runtime. The CDAG is similar to the parameterized task graph in [2]. The size of a CDAG is proportional to the program size while the size of a DAG is proportional to the problem size or the matrix size.

1194

M.-Y. Wu, W. Shu, and Y. Chen n

n

n

0

1

n

2

n

n

3

n

4

n

5

n

6

7

8

n

9

n 10

n 11

n 14

n 15

n 12

n 13

n 16

n 17

n 18

Critical Path

n 19

Fig. 1.

A DAG (Gaussian elimination).

A CDAG is de ned by its communication rules. A communication rule is in the format of source node ! destination node: message name j guard. The communication rules in Figure 2 is generated from an annotated C program of Gaussian elimination. For details, refer to [8]. The corresponding CDAG is shown in Figure 3. The runtime system takes the CDAG as its input. ( ) : vector[0]; matrix[0; 0]ji = 0 (0; j ) : matrix[0; j ]j0 j N F indM ax(i) ! F indM ax(i + 1) : vector [i + 1]j0 i N 2 F indM ax(i) ! OU T P U T : vector [N ]ji = N 1 F indM ax(i) ! U pdateM tx(i; j ) : vector [i + 1]j0 i N 1; i j N U pdateM tx(i; j ) ! U pdateM tx(i + 1; j ) : matrix[i + 1; j ]j0 i N 2; i + 1 j U pdateM tx(i; j ) ! F indM ax(i + 1) : matrix[i + 1; j ]j0 i N 2; j = i + 1 U pdateM tx(i; j ) ! OU T P U T : matrix[i + 1; j ]j0 i N 1; j = i U pdateM tx(i; j ) ! OU T P U T : matrix[N; N ]ji = N 1; j = N IN P U T IN P U T

! !

F indM ax i

U pdateM tx

Fig. 2.

Communication rules for the Gaussian elimination code.

N

A Runtime System for Dynamic DAG Programming

1195

FindMax(i+1): vector[i+1]| 0> F .

5)

Run test measurement with computed

3)

1251

Isi , Th , N , and S .

Figure 1: Calibration Test Procedure maximum accuracy and a high-fidelity testing procedure. At first glance, it may appear that a test tool as simple as Ping [KESSLER] would suffice to measure the network fault-recovery performance. Tools such as this are not adequate, however, as they lack the fidelity to make accurate measurements for the real-time distributed computing environment of interest. Also, many of these tools use transaction based measurements rather than one-way measurements which again reduces the fidelity of the measurements as shown in [IREY98]. Figure 1 defines a calibration test procedure for tuning the measurement process to maximize the accuracy of network fault-recovery performance measurements. Several new terms are used in the calibration test procedure: S specifies the size in bytes of each message sent on the test data stream during a test; Th specifies a measurement threshold used to filter out non-event related measurements; N specifies the number of messages transmitted on the data stream during a test; and F specifies the time elapsed between the injection of the first failure into the network and when the network has recovered completely from all injected failures. It is expected that multiple test runs will be conducted to select values for the parameters in each of the steps of the calibration test proceedure. When a value is selected for a given step, additional test runs should be conducted to ensure that the new value selected didn’t unexpectedly impact values previously selected for other steps. Tests of network performance by the authors (e.g., [IREY98]) in the real-time distributed environment of interest to the Navy has shown that the testing methodology used must support high-fidelity measurements. To satisfy this requirement, the network fault recovery performance tests must: 1) gather a large number of test samples to determine the range of performance; 2) examine the maximum and minimum (e.g., worst case) values rather than only mean values with standard deviations; and 3) provide visualization tools which allow the large data sets to be analyzed and iteratively reduced.

5

Network Fault Recovery Performance Measurement Toolset

A toolset was developed to measure network fault recovery performance using the metrics and testing methodology defined here. The tools in the toolset fall into three classes: test orchestration tools, data collection tools, and analysis/visualization tools. The relationships among these tools are shown in Figure 2. 5.1

Test Orchestration Tools The main test orchestration tool in the toolset is called nettest. The main functions of nettest are: 1) to activate the data collection tools (e.g., transmitters and

1252

P.M. Irey IV et al.

receivers) on specified nodes; 2) to initiate the test data stream at a specified time; 3) to inject faults in the network at specified times; 4) to gather results from the experiment; and 5) to repeat the process from step 1 until a specified number of iterations have been completed. The nettest program is generally run on a control host which is usually a system other than the sending and receiving hosts involved in an experiment. The complete set of results gathered by nettest is passed to the analysis/visualization tools. Plots gen_test_results edew

event_window

Statistical Data

Raw Data

Test Controller

nettest

Mn

CAST/MCAST

Mn-1

DATA STREAM

M4

M3

M2

M1

Control/Data/Results CAST

Server/Transmitter

MCAST

Client/Receiver(s)

Figure 2: Network Fault Recovery Performance Toolset One of the unique features of nettest specifically related to the measurement of network fault recovery performance is its fault injection capability. A scripting language was developed for nettest which allows the user to invoke actions on network components and to specify faults to be injected to specific network components (e.g., cables, switches, etc.) at specified times. When the power is removed from a media converter, it appears to the network components or hosts connected by the media converter that a cable has been broken. This provides one mechanism that enables automated testing needed for performing numerous tests which enables the high-fidelity testing previously described. A large library of reusable test scripts has been developed for testing failure scenarios for a variety for fault-tolerant network architectures. For each test iteration, nettest first runs a user provided reset script to force the components under test into a known state. Next, the test data stream is started in parallel with a fault injection script which injects faults into specified network components at specified times. We have found that very complex reset scripts may be needed for particular architectures to be tested. It is best to assume that the network infrastructure is in an unknown state and to reset the state to a known state before a test run is conducted. Events external to a particular test procedure (e.g., a previous test) may have left the network infrastructure in an undesirable state.

Metrics, Methodologies, and Tools

1253

5.2

Data Collection Tools The main data collection tool used in the toolset is called CAST (Communications Analysis and Simulation Tool). CAST was developed by the authors to support endto-end network performance analysis for unicast communications. It gathers a large number of performance metrics of which the network fault recovery performance metrics presented in Section 3 are a subset. CAST performs preliminary filtering on the data using Th defined in Section 3. 5.3

Analysis/Visualization Tools To decrease the time required for a user to analyze various CAST test runs, a set of analysis and visualization tools were developed. The nettest tool generates large data files containing the results of iterative CAST test runs. In most cases, the data is best interpreted through visual plots. A tool was developed called gen_test_results that extracts the data of interest from the raw data files and then plots the extracted data. To aid in reducing the data gathered during a network fault recovery test (i.e., analysis of events), two other tools were developed, edew and eventwindow. Edew computes statistical functions on the set of recorded events, such as the mean and standard deviation. Eventwindow filters the data from a network fault recovery test to obtain the event data. During the analysis of a network fault recovery test, these two tools are used to process and format the data. The output of these tools is provided as input to the gen_test_results tool.

6

Applying the Metrics, Tools, and Testing Methodology

The previous sections described the metrics, the tools that are used to measure those metrics, and some methods of how those tools can be used to perform an experiment. This section demonstrates the utility of the metrics in evaluating the performance of a survivable network by examining experimental data collected. 6.1

General Test Setup This testing methodology is applied to both an FDDI based network and a survivable Fast Ethernet based network. The FDDI network consists of two host machines and a FDDI concentrator. One of the hosts is dual-homed to the FDDI concentrator. This is achieved by connecting both ports of a dual-attached FDDI NIC, which is installed on the host, to two ports on the FDDI concentrator. The second host uses only one of the ports on its installed FDDI NIC. The dual-homed machine serves as the transmitting host during the test, while the single-homed machine performs the function of the receiving host. In the Fast Ethernet test, two host machines are used in conjunction with two Fast Ethernet switches. Each network host is dual-homed to each of the two Ethernet switches. The two dual-homed interfaces on the hosts may be on the same Network Interface Card (NIC) or may be on multiple NICs depending on the type of network fault recovery scheme being used. In either case, both network ports appear as a single network interface to the applications running on the network host. The two Ethernet switches are directly interconnected by one or more Fast Ethernet links.

1254

P.M. Irey IV et al.

6.2

Example FDDI Test Results The results of performing a network fault recovery performance test on FDDI are located in the plots in Figure 3. The plot on the left shows the distribution of interarrival times measured throughout the length of the test. Notice that a majority of the inter-arrival times occur around the expected value of 200 milliseconds, which agrees with the results found by other measurement techniques [HILES95]. All other values of inter-arrival times appear to be well below this range. It is difficult to determine the significance of these values until the plot located on the right in Figure 3 is examined. In this plot, the inter-arrival times are examined with respect to the sequence numbers of the packets being received. This shows at what point in the test that the events occurred. Notice that this plot shows there are two distinct sets of events occurring during the test. This result agrees with the expectation that two events occur during each iteration of the test. These expected events are 1) the primary port on the transmitting host machine is brought down, and 2) the primary port is subsequently brought back up again. The second event is expected since FDDI has one port that is the default when both ports are active. Other fault tolerant solutions including some of the fault tolerant Fast Ethernet solutions do not exhibit this behavior because they do not have

Figure 3: Measured FDDI Interarrival Times a default port. In this case a failover only occurs if the active port goes down.. 6.3

Example Fast Ethernet Test Results The results of performing network fault recovery performance tests on Fast Ethernet are located in figures 4-6. The first set of data represented by the left hand plot of Figure 4 shows measured inter-send times in a test where the inter-send interval has been set to 30 milliseconds. Notice that most of the values are very near the 30-millisecond setting. The outlying values are useful in determining the threshold for the inter-arrival metrics in the absence of network failures. For instance, if an inter-arrival time above 35 milliseconds is considered a network fault, then it is likely that one network failure would be erroneously detected due to the one intersend value above 37 milliseconds. Normally the variance in the inter-send values will be taken into account if the calibration test procedure described in Section 4 has been performed properly. However, it is wise to check the inter-send data from a test to


1255

insure that the test results are due to the performance of the network and not a result of an anomaly on the transmitter.

Figure 4: Measured inter-send and inter-arrival times The next few sets of data shown in the right hand plot of Figure 4 and both plots of Figure 5 are from a test in which one of the two Ethernet switches is powered off and the clients have to fail over to the second Ethernet switch. The right hand plot of Figure 4 shows the inter-arrival times measured during this particular test. The data set indicates that the network fault recovery performance of this survivable Ethernet solution is similar to the fault recovery time of FDDI. All the network fault recovery times are less than one second, and all but one of the measurements are below 500 milliseconds. If this were the only type of measurement made during the test, as was the case with [HILES95], then some very undesirable network behavior would have been undetected. The left hand plot of Figure 5 shows the percent of the transmitted data which was received by the receiving host. Since there is a period of time in which the network is inaccessible due to the powering down of the Ethernet switch, one would expect that only a fraction of the total messages would make it to the destination. Instead more messages were received than were transmitted (approximately 1.5 times as many). This indicates that messages are being duplicated by the network fault recovery process. The data set shown in right hand plot of Figure 5 confirms this. This data set shows the actual number of duplicate messages that have been received. In this case, conclusions drawn from the data in the left hand plot of Figure 5 lead to a closer examination of the data in the right hand plot of Figure 5. The order of magnitude of the number of duplicate messages is important to the results of this test. Even if each individual receiving host knows to ignore duplicate packets, the health

1256

P.M. Irey IV et al.

of the network as a whole may be in danger. If the switches used in this particular survivable Ethernet configuration had been fully populated with clients, then the duplicate packets could have saturated the bandwidth available on the switches, thereby eliminating any chances of recovering from the network failure in a timely manner. Further analysis would be needed to pinpoint the source of the duplicate packets (e.g., transmitting host NIC, switch, etc.) The plots in Figure 6 show the importance of a high-fidelity testing methodology for measuring network fault recovery performance. The plot on the left shows measurement of inter-arrival times obtained by manually inserting network faults into the test. After performing the measurement five times, it appears that the network fault recovery time is consistently around 400 milliseconds. The right hand plot of Figure 6 shows the results an automated, high-fidelity measurement made using nettest. The network fault recovery performance measured using this testing methodology shows that the network fault recovery times can vary within a broader range of 300 to 800 milliseconds. This shows why an automated, high-fidelity testing methodology is preferable since many more iterations are possible. This leads to a better view of the performance being obtained. Typical nettest runs last for hours or days with hundreds or thousands of test iterations.

Figure 6: Interarrival times measured using manual and automated techniques

7

Conclusions and Ongoing Work

This paper presents a number of metrics for evaluating network fault recovery performance and a testing methodology for applying these metrics. The utility of the application of these metrics and the testing methodology is shown through a number of example experiments. Unlike other network fault recovery metrics which have been defined, the metrics defined here allow network fault recovery performance to be measured in a manner which is independent of the characteristics of the networking components used or the architecture used to interconnected these components. The benefits of using the automated, high-fidelity testing methodology via the capabilities provided by the fault recovery performance measurement toolset are shown. The practical limitations of manual fault injection (e.g. on the order of 10 iterations) that lead to low-fidelity measurements is contrasted with the high-fidelity


1257

measurements possible using the automated fault injection capabilities of the nettest test orchestration tool to run hundreds or thousands of test iterations. Experiments are ongoing to evaluate components to provide a fault tolerant networking alternative that uses readily obtainable COTS components rather than FDDI components that will no longer be produced. The experiments are looking at the network fault recovery performance of Network Interface Cards (NICs) and switching components as well as architectures based on these components. Seven network architectures have been identified and testing on them is in-progress. A number of failure scenarios have been developed which are applicable to one or more of these architectures. The network fault recovery performance toolset presented in Section 5 will be used to perform these experiments based on the scenarios developed. In addition to helping the Navy transition to a new approach for configuring shipboard networks, the lessons learned through this use of the toolset will be applied to improving these tools and testing methodologies.

8

References

[802.1D] IEEE Draft P802.1w/D2, Supplement to ISO/IEC 15802-3 (IEEE Std 802.1D) Information technology – Telecommunications and information exchange between systems – Local and metropolitan area networks – Common specifications – Part 3: Media Access Control (MAC) Bridges: Rapid Reconfiguration. [FDDI] ISO-9314-1, Information Processing Systems – Fibre Distributed Data Interface (FDDI) – Part 1: Token Ring Physical Layer Protocol (PHY). [SMT] ISO-9314-6, Information Processing Systems – Fibre Distributed Data Interface (FDDI) – Part 6: Station Management. [HILES95] Hiles, William S., Marlow, David T., Approximation of FDDI Minimum th Reconfiguration Time, Proceedings of the IEEE Computer Society 20 Conference on Local † Computer Networks, September 1995 . [HUANG] Huang, J., Song, S., Li, L., Kappler, P., Freimark, R., Gustin, J., Kozlik, T., An th Open Solution to Fault-Tolerant Ethernet: Design, Prototyping, and Evaluation, 18 IEEE International Performance, Computing, and Communications Conference, February 1999. [IREY97] Irey, Philip M., Marlow, David T., Harrison, Robert D., Distributing Time Sensitive th Data in a COTS Shared Media Envoronment, 5 International Workshop on Parallel and † Distrbuted Real-Time Systems, pp. 53-62, April 1997 [IREY98] Irey IV, Philip M., Harrison, Robert D., Marlow, David T., Techniques for LAN Performance Analysis in a Real-Time Environment, Real-Time Systems - International Journal of Time Critical Computing Systems, Volume 14, Number 1, pp. 21-44, January † 1998. [KESSLER] Kessler, G., Shepard, S., RFC 1739 - A Primer On Internet and TCP/IP Tools, December 1994. [MILLS] Mills, David L., RFC 1305 - Network Time Protocol (Version 3) Specification, Implementation and Analysis, March 1992. [RALPH] Ralph, Stanley F., Ukrainsky, Orest J., Schellak, Robert H., Weinberg, Leonard, th Alternate Path FDDI Topology, Proceedings of the IEEE Computer Society 17 Conference on Local Computer Networks, September 1992. †

Documents available at http://www.nswc.navy.mil/ITT.

Consensus Based on Strong Failure Detectors: A Time and Message-Ecient Protocol F ab olaGrevey Michel Hurfiny Raimundo Macêdoz Michel Raynaly y

IRISA - Campus de Beaulieu, 35042 Rennes Cedex, France,

z LaSiD-CPD-UFBA, Campus de Ondina, CEP 40170-110 Bahia, Brazil ffgreve|hurfin|[email protected] [email protected]

Abstract. The class of strong failure detectors (denoted S ) includes

all failure detectors that suspect all crashed processes and that do not suspect some (a priori unknown) process that never crashes. So, a failure detector that belongs to S is intrinsically unreliable as it can arbitrarily suspect correct processes. Several S -based consensus protocols have been designed. Some of them systematically require computation rounds ( being the number of processes), each round involving 2 or messages. Others allow early decision (i.e., the number of rounds depends on the maximal number of crashes when there are no erroneous suspicions) but require eac h round to involv e 2 messages. This paper presents an early deciding S -based consensus protocol each round of which involv es 3( , 1) messages. So, the proposed protocol is particularly time and message-ecient. n

n

n

n

n

n

Keywords: Asynchronous Distributed System, Consensus, Crash Failure, P erpetual Accuracy Property, Unreliable Failure Detector.

1 Introduction Sev eralcrucial practical problems (such as atomic broadcast and atomic commit) encountered in the design of reliable applications built on top of unreliable asynchronous distributed systems, actually belong to a same family: the family of agr eement problems. This family can be characterized by a single problem, namely the Consensus problem, that is their \greatest common subpr oblem". That is why the consensus problem is considered as a fundamental problem. This is practically and theoretically very important. F roma practical point of view, this means that any solution to consensus can be used as a building block on top of which solutions to particular agreement problems can be designed. F rom a theoretical point of view, this means that an agreement problem cannot be solved in systems where consensus cannot be solved. Informally, the consensus problem can be de ned in the following way. Each process proposes a value and all correct processes have to decide the same value, which has to be one of the proposed values. Solving the consensus problem in asynchronous distributed systems where processes may crash is far from being a trivial task. It has been sho wnby Fischer, Lynch and P aterson [3]that J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1258-1265, 2000.  Springer-Verlag Berlin Heidelberg 2000

Consensus Based on Strong Failure Detectors

1259

there is no deterministic solution to the consensus problem in those systems as soon as processes (even only one) may crash. This impossibility result comes from the fact that, due to the uncertainty created by asynchrony and failures, it is impossible to precisely know the system state. So, to be able to solve agreement problems in asynchronous distributed systems, those systems have to be \augmented" with additional assumptions that make consensus solvable in such improved systems. A major and determining advance in this direction has been done by Chandra and Toueg who have proposed [1] (and investigated with Hadzilacos [2]) the Unreliable Failure Detector concept. A failure detector can informally be seen as a set of oracles, one per process. The failure detector module (oracle) associated with a process provides it with a list of processes it guesses to have crashed. A failure detector can make mistakes by not suspecting a crashed process, or by erroneously suspecting a correct process. In their seminal paper [1], Chandra and Toueg have de ned two types of property to characterize classes of failure detectors. A class is de ned by a Completeness property and an Accuracy property. A completeness property is on the actual detection of crashes. The completeness property we are interested in basically states that \every crashed process is eventually suspected by every correct process". An accuracy property limits the mistakes a failure detector can make. In this paper, we are interested in solving the consensus problem in asynchronous distributed systems equipped with a failure detector of the class S . A failure detector of this class suspects all crashed processes (completeness) and guarantees that there is a correct process that is never suspected, but this process is not a priori known (perpetual weak accuracy). Several S -based consensus protocols have been proposed. They all assume f n , 1 (where f is the maximal number of processes that may crash), and consequently are optimal with respect to the number of crash failures they can tolerate. They all proceed in asynchronous \rounds". The S -based consensus protocol proposed in [1] requires exactly n rounds, each round involving n2 messages (each message being composed of n values). The S -based protocols presented in [7, 8] also require n rounds, but each round involves only n messages carrying a single value. It is important to emphasize that these three protocols require n rounds whatever the value of f , the number of actual crashes and the occurrences of erroneous suspicions are. To our knowledge, very few early deciding S -based consensus protocols have been proposed, more precisely, we are only aware of the generic protocol presented in [6]1 . When instantiated with a failure detector 2 S , this generic protocol 1 The generic dimension of the protocol introduced in [6] lies in the class of the failure

detector it relies on. This generic protocol can be instantiated with any failure detector of S (provided , 1) or 3S (provided 2). A failure detector that belongs to 3S : (1) eventually suspects permanently all crashes processes, and (2) guarantees that there is a time after which there is a correct process that is never suspected. f

n

f < n=

1260

F. Greve et al.

provides a S -based consensus protocol that terminates in at most (f +1) rounds when there are no erroneous suspicions. So, when the failure detector is tuned to very seldom make mistakes, this protocol provides early decision. Each round of this protocol involves n2 messages (each message being made of a proposed value plus a round number) and one or two communication steps. This paper presents an early deciding S -based consensus protocol. When there are no erroneous suspicions, the proposed protocol requires (f +1) rounds, in the worst case. When there are neither crashes nor erroneous suspicions, it requires a single round, a round being made up of two communication steps. Each round involves 3(n , 1) messages, and each message carries at most three values: a round number, a proposed value and a timestamp (i.e., another round number). So, the protocol is both time and message-ecient. Moreover, a generalization of the protocol exhibits an interesting tradeo between the number of rounds and the number of messages per round. The paper is made up of ve sections. Section 2 introduces the asynchronous system model, the class S of failure detectors, and the consensus problem. Then, Section 3 presents the S -based consensus protocol. Section 4 discusses its cost. Finally, Section 5 concludes the paper.

2 Asynchronous Distributed Systems, Failure Detectors and the Consensus Problem The system model is patterned after the one described in [1, 3]. A formal introduction to failure detectors is provided in [1].

2.1 Asynchronous Distributed System with Process Crash Failures We consider a system consisting of a nite set of n > 1 processes, namely, = fp1 ; p2; : : : ; p g. A process can fail by crashing, (i.e., by prematurely halting). It behaves correctly (i.e., according to its speci cation) until it (possibly) crashes. By de nition, a correct process is a process that does not crash. Let f denote the maximum number of processes that can crash (f n , 1). n

Processes communicate and synchronize by sending and receiving messages through channels. Every pair of processes is connected by a channel. Channels are not required to be fifo, they may also duplicate messages. They are only assumed to be reliable in the following sense: they do not create, alter or lose messages. This means that a message sent by a process p to a process p is assumed to be eventually received by p , if p is correct2. The multiplicity of processes and the message-passing communication make the system distributed. There is no assumption about the relative speed of proi

j

j

j

2 The \no message loss" assumption is required to ensure the Termination property

of the protocol. The \no creation and no alteration" assumptions are required to ensure its Validity and Agreement properties.


1261

cesses or the message transfer delays. This absence of timing assumptions makes the distributed system asynchronous.

2.2 The Class S of Unreliable Failure Detectors Informally, a failure detector consists of a set of modules, each attached to a process: the module attached to p maintains a set (named suspected ) of processes it currently suspects to have crashed. Any failure detector module is inherently unreliable: it can make mistakes by not suspecting a crashed process or by erroneously suspecting a correct one. Moreover, suspicions are not necessarily stable: a process p can be added to and removed from a set suspected according to whether p 's failure detector module currently suspects p or not. As in [1], we say \process p suspects process p " at some time t, if at time t we have j 2 suspected . As indicated in the introduction, a failure detector class is de ned by two abstract properties, namely a Completeness property and an Accuracy property. In this paper we are interested in the following properties [1]: { Strong Completeness: Eventually, every crashed process is permanently suspected by every correct process. { Perpetual Weak Accuracy: Some correct process is never suspected. The failure detectors that satisfy these properties de ne the class S (Strong failure detectors). It is important to note that a failure detector 2 S can make an arbitrary number of mistakes: at any time all (but one) correct processes can be erroneously suspected. Moreover, a process can alternatively suspect and not suspect some correct processes. i

i

j

i

i

j

i

j

i

2.3 The Consensus Problem

In the consensus problem, every correct process p proposes a value v and all correct processes have to decide on some value v, in relation with the set of proposed values. More precisely, the Consensus problem is de ned by the three following properties [1, 3]: i

i

{ Termination: Every correct process eventually decides on some value. { Validity: If a process decides v, then v was proposed by some process. { Agreement: No two correct processes decide dierently. The agreement property applies only to correct processes. So, it is possible that a process decides on a distinct value just before crashing. Uniform Consensus prevents such a possibility. It has the same Termination and Validity properties plus the following agreement property: { Uniform Agreement: No two processes (correct or not) decide dierently. In the following we are interested in the Uniform Consensus problem.

1262

F. Greve et al.

3 The S -Based Consensus Protocol 3.1 The Protocol 3.2 Underlying Principles As other failure detector-based consensus protocols, the proposed protocol uses the rotating coordinator paradigm and processes proceed in asynchronous rounds [1]. There are at most n rounds. Each round r (1 r n) is managed by a predetermined coordinator, namely, p . Moreover, during r, the coordinator of the next round (namely, p +1) acts also a particular role. Each process p manages three local variables: the current round number (r ), its current estimate of the decision value (est ), and a timestamp (ts ) that indicates the round number during which it adopted its current estimate est . As in [6{8], during a round, the current coordinator tries to impose its current estimate as the decision value. To attain this goal, each round r is made of two steps (see Figure 1). { During the rst step (lines 4-6) the current coordinator p broadcasts a message carrying its current estimate, namely, the message phase1(r; est ). When a process p receives such a phase1(r; v) message, it adopts v as its new estimate and consequently updates ts to the current round number (line 6). If p suspects p , its \state variables" est and ts keep their previous values. { During the second phase (lines 7-13), each process sends its \current state" to the current round coordinator (p ) and the next round coordinator (p +1 ). This \state" is carried by a phase2 message (line 8). The triple (r; est ; ts ) indicates that during r, (1) the estimate of the decision value considered by p is est and (2) this value has been adopted during the round ts . Then, p and p +1 follow the same behavior: each waits until it has received phase2 messages from all the processes it does not suspect (let us note that due to the completeness property of the underlying failure detector, all crashed processes are eventually suspected). If all the phase2 messages the process p (resp. p +1) has received have a timestamp equal to the current round number (r), p (resp. p +1 ) decides on its current estimate (lines 11-12). This means that the current estimate of p has been imposed as decision value. Whether the process p +1 decides during r or proceeds to r +1, it must have a correct estimate (in order not to violate the consensus agreement property). This is ensured by requiring it to update its local estimate (est +1 ) to the estimate it has received with the highest timestamp in a phase2(r; est; ts) message (line 10). r

r

i

i

i

i

i

r

r

i

i

i

r

i

i

r

r

i

i

i

r

i

i

r

r

r

r

r

r

r

r

3.3 Structure

The protocol is fully described in Figure 1. A process p starts a consensus execution by invoking Consensus(v ), where v is the value it proposes. The protocol i

i

i


1263

terminates for p when it executes the statement return which provides it with the decided value (at line 12 or 15). It is possible that distinct processes do not decide during the same round. To prevent a process from blocking forever (i.e., waiting for a value from a process that has already decided), a process that decides, uses a reliable broadcast [5] to disseminate its decision value. To this end, the Consensus function is made of two tasks, namely, T 1 and T 2. T 1 implements the previous discussion. Line 12 and T 2 implement the reliable broadcast. i

Function Consensus( ) Task 1: (1) 0; ; (2) while do vi

T

ri

esti

vi

0;

tsi

ri < n

(3)

+ 1; is the coordinator of the current round % ri +1 is the coordinator of the next round %

ri

% %

ri

pri p

||||| Phase 1 of round : ri proposes to all ||||||||{ if ( = i ) then 8 : send phase1( i i) to j endif; wait until (phase1( i ) has been received from ri _ i 2 i ); if (phase1( i ) received from ri ) then i ; i i endif; r

(4) (5) (6)

i

r

p

j

r ; est

p

r ;v

p

r ;v

p

est

v

r

ts

suspected

r

||||{ Phase 2 of round : each process replies to ri and ri +1 |{ let = f i i + 1g if ( i ), = f i g otherwise; 8 2 : send phase2( i i i) to j ; if ( 2 ) then wait until (phase2( i ) messages have been received from all non suspected processes); if = i + 1 then i rec. with highest endif; if (all phase2 messages are such that = i ) then 8 6= : send decision( i) to j ; return( i) endif r

(7) (8) (9)

X

j

r ;r

r

X

i

< n

p

X

r ; est ; ts

X

p

r

p

r ; est; ts

(10) (11) (12) (13) endif (14) endwhile

i

r

est

est

ts

ts

j

i

est

p

r

est

Task 2: (15) upon the reception of decision( ) from : 8 6= : send decision( )to ; return( ) T

est

j

i; k

pk

est

pj

est

Fig. 1. The S -Based Consensus Protocol

3.4 Proof

The protocol satis es the Termination, Validity, and Uniform Agreement properties de ning the Consensus problem (these properties have been stated in Section 2.3). The reader interested in the proof will consult [4].

1264

F. Greve et al.

4 Cost of the Protocol

Time complexity The number of rounds of the protocol is n. Dierently from the S -based consensus protocols described in [1, 7, 8] that always require n rounds, the actual number of rounds of the proposed protocol depends on failures occurrences and erroneous suspicions occurrences. So, to analyze the time complexity of the protocol, we consider the length of the sequence of messages (number of communication steps) exchanged during a round. Moreover, as we do not master the quality of service oered by the underlying failure detector, but as in practice failure detectors can be tuned to very seldom make mistakes, we do this analysis considering the underlying failure detector behaves reliably. In such a context, the time complexity of a round is characterized by a pair of integers [6]. Considering the most favorable scenario that allows to decide during the current round, the rst integer measures its number of communication steps (without counting the cost of the reliable broadcast implemented by the task T 2). The second integer considers the case where a decision cannot be obtained during the current round and measures the minimal number of communication steps required to progress to the next round. Let us consider these scenarios. { The rst scenario is when the current round coordinator is correct and is not suspected. In that case, 2 communication steps are required to decide. During the rst step, the current coordinator broadcasts a phase1 message (line 4). During the second step, each process sends a phase2 message (line 8). So, in the most favorable scenario that allows to decide during the current round, the round is made up of two communication steps. { The second scenario is when the current round coordinator has crashed and is suspected by all processes. In that case, as processes correctly suspect the coordinator (line 5), the rst communication step is actually skipped. Processes only send phase2 message (line 8) and proceed to the next round. So, in the most favorable scenario to proceed to the next round, the round is made up of a single communication step. So, when the underlying failure detector behaves reliably, according to the previous discussion, the time complexity of a round is characterized by the pair (2; 1) of communication steps. Message complexity of a round During each round, the round coordinator broadcasts a phase1 message and each process sends two phase2 messages. Hence, the message complexity of a round is bounded by 3(n , 1). Message type and size There are three types of message: phase1, phase2 and decision. A decision message carries only a proposed value. A phase1 message carries a proposed value plus a round number. A phase2 message carries a proposed value plus two round numbers. As the number of rounds is bounded by n, the size of the round number is bounded by log2(n). Let jvj be the bit size of a proposed value. According to the previous discussion, 3n(n , 1)(jvj + log2n) is an upper bound of the bit complexity of the protocol.


1265

5 Conclusion The paper has studied the consensus problem in the setting of asynchronous distributed system equipped with a failure detector of the class S . This class includes all the failure detectors that suspect all crashed processes and that do not suspect some (a priori unknown) correct process. The proposed protocol proceeds in asynchronous rounds and allows early decision. If there are neither failures nor false suspicions the decision is obtained in a single round. A round is made up of two communication steps and involves 3(n , 1) messages. The proposed protocol compares very favorably to the previous S -based consensus protocols, as those require n rounds or n2 messages per round.

References 1. Chandra T. and Toueg S., Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225-267, March 1996. 2. Chandra T., Hadzilacos V. and Toueg S., The Weakest Failure Detector for Solving Consensus. Journal of the ACM, 43(4):685{722, July 1996. 3. Fischer M.J., Lynch N. and Paterson M.S., Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM, 32(2):374{382, April 1985. 4. Greve F., Hur n M., Macêdo R. and Raynal M., Consensus Based on Strong Failure Detectors: Time and Message-Ecient Protocols. Tech Report #1290, IRISA, Universite de Rennes, France, January 2000. http://www.irisa.fr/EXTERNE/bibli/pi/1290/1290.html. 5. Hadzilacos V. and Toueg S., Reliable Broadcast and Related Problems. In Distributed Systems, acm Press (S. Mullender Ed.), New-York, pp. 97-145, 1993. 6. Mostefaoui A. and Raynal M., Solving Consensus Using Chandra-Toueg's Unreliable Failure Detectors: a General Quorum-Based Approach. Proc. 13th Int. Symposium on Distributed Computing (DISC'99) (formerly, WDAG), Springer-Verlag LNCS 1693, pp. 49-63, (P. Jayanti Ed.), Bratislava (Slovaquia), September 1999. 7. Mostefaoui A. and Raynal M., Consensus Based on Failure Detectors with a Perpetual Weak Accuracy Property. Proc. Int. Parallel and Distributed Processing Symposium (IPDPS'2k), (14th IPPS/11th SPDP), Cancun (Mexico), May 2000. 8. Yang J., Neiger G. and Gafni E., Structured Derivations of Consensus Algorithms for Failure Detectors. Proc. 17th ACM Symposium on Principles of Distributed Computing, Puerto Vallarta (Mexico), pp.297-308, 1998.

Implementation of Finite Lattices in VLSI for Fault-State Encoding in High-speed Networks Andreas C. Doring, Gunther Lustig Medizinische Universitat zu Lubeck Institut f u r T echnisc he Informatik Ratzeburger Allee 160 23538 Liibeck, Germany {doering,lustig}@iti.mu-1uebeck.de

Abstract. In this paper the propagation of information about fault states

and its implementation in high-speed networks is discussed. The algebraic concept of a lattice (partial ordered set with supremum and infimum) is used to describe the necessary operation. It turns out that popular algorithms can be handled this way. Using the properties of lattices efficie~implementation options can be found.

1

Introduction

The constantly decreasing prices of computer hardware have made the building of ever larger computer systems more attractive. The interconnection betw een the components like storage media, memorx processing elements and graphics system has consequently migrated from busses to netw orks. Een inside a single computer the 110-subsystem will consist of a high-speed netw orkin the near future. The high bandwidth and low latency of modern netw ork swithes require an operating mode without or only with small softw are protocols. TB achiev e this, the n e t ork has to act very reliably. Hence, some of the actions t o maintain operation in the presence of faults of some netw ork componerts (routers or 'links', i.e. wires) have t o be performed in the netw ork itself. TBw ard this aim ihensiv e researh has been done. The task of allowing a net uork t o circumnavigate faults consists mainly of tw o independert problems:

1. Detect the faults and propagate the knowledge about them through the netw ork, and 2. use this information t o select fault free routes for the messages. In this paper the first problem is considered from a general point of view. The high variety of applications of net w orksrnakes a universal router desirable that can be configured for a given problem. In particular, both tasks mentioned h a v e t o be implemented in a configurable way The investigation of hardware structures fulfilling this, leads to a description and implementation method for a wide range of routing algorithms [DOLM98]. Though failure of a netw ork componert is expected to be rare, the reaction of the net w ork has to be my fast. This is because a new fault leaves the netw ork for some time in an inconsistent state. This problem has been excluded in most algorithms by assuming a separation of the diagnosis phase (item 1 zbo ve) and the message transport. Many applications may not allow this procedure requiring additional methods like transporting affected messages t o the nearest node and retransmitting them. The faster the fault state propagation in the netw ork is, the fewr messages will have to be dealt with this way. Hence in this paper the necessary operation is considered with the aim of a hardware implementation. J. Rolim et al. (Eds.): IPDPS 2000 Workshops, LNCS 1800, pp. 1266-1275,2000. 0 Springer-Verlag Berlin Heidelberg 2000

Implementation of Finite Lattices in VLSI for Fault-State Encoding

1267

With respect to the knowledge base needed to provide connectivity in a faulty network, fault-tolerant routing algorithms can be divided into local- and globalinformation-based. Global-information-based algorithms rest their routing decision upon fault-information that contains the location of any faulty component in the network. Hence optimal routing decisions can be made a t intermediate nodes in the presence of faults. The problem of those algorithms is to ensure a consistent information base among all nodes of the network. This leads mostly t o an off-line re-configuration of the whole system after failure detection. The routing decision of local-information-based algorithms uses only knowledge about the router's own and the neighbors' fault states. This could result in a backtracking for messages trapped in a dead-end [TW98, CS901. In order to avoid backtracking despite of the topological irregularities resulting from faulty components, some algorithms relay on a convex fault model. To achieve this with only local knowledge a router is allowed to pretend a fault although it is fault-free. Each connected set of faults is completed in this way t o a convex shape (e.g. rectangular in 2D meshesltori or cubic in 3D meshesltori). This transforms the remaining topology to a more regular one by adding fault states over a certain distance and enables messages t o bypass the faulty regions on detour routes. Alternative approaches e.g. [Wu98, CW96, CA951 use limited global information in their routing decision. Beside the local node state, the routing decision uses information from neighbor nodes that allows conclusion about the reachability of even non-adjacent regions in the network. To generate this knowledge, each router calculate its own state dependent on local faults and its neighbors' states. Clearly, it takes some time until all nodes have adjusted their own state t o a new fault and several state exchanges and updates are needed. The larger the set of possible states a router can take is, the better the knowledge about the distribution of faults in the whole network can be. Better knowledge allows tolerating more faults, keeping the performance in presence of faults higher or reducing the number of nodes that cannot be reached by the routing algorithm (like those in the convex fault model that have to be marked faulty). A larger state set requires more hardware (memory) to store it, bandwidth to exchange it, and processing power to handle it. The processing of state information is the topic of this paper. It appears that no general method or framework dedicated for the fault state propagation has been developed, because routers implemented so far are either done entirely in software, or use no or only one fault tolerance method[AGSY94]. The flexible implementation of the fault state propagation requires special methods due to the strong speed requirements. A large number of routing algorithms working with a limited global knowledge reveals the observation that the set of states can be described by the mathematical concept called Lattice. Especially the update of a node's state is essentially the computation of the supremum operation in the lattice. This concept is described in section 2. To illustrate the approach, the fault state propagation of two algorithms is used in section 3. Some proposals for a hardware implementation are given in 4 followed by the conclusion (section 5).

2

Lattices and Fault-Tolerance

The notion of a lattice is well known in algebra for a long time. For different purposes advanced theories have been created. There are two meanings for the term lattice, namely a discrete subset of a vector space (a grid) and a set with a partial ordering. In this paper the second meaning is understood. Fault-tolerant systems are usually modeled as construction of components which are all or partially vulnerable to defects (faults). Hence, the state with respect to the

1268

A.C. Doering and G . Lustig

faults is described by a combination of fault states from the individual components. The individual fault states have different implications to the functionality of the whole system. With respect to this influence and the methods applied in which the system reacts to the state a partial order can be defined. If some situation is strictly "worse" than another, it has a higher order of "defectness". More precisely, if all consequences for mal-function or poorer performance from the first situation apply t o the second and if at least the same actions (repair etc.) have to be taken, than both situations can be compared. Of course there are situations where two different properties of the whole system are affected and where the necessary actions also vary. This notion of "worse" justifies the application of the concept of lattices t o the handling of fault states. Since a certain fault state of a certain router is less critical for its neighbors, the transmitted value has to be converted. To incorporate geometric information all neighbors get different information, see for instance figure 3 in section 3. The state of total fault of a node is transformed into North-,South-, East- or WestFailure respectively. These mappings are lattice-homomorphisms. Since they are less complex than the generation of the new fault state they are not further considered in this paper.

Definition 1. A lattice is a tuple (M, V, A) of a set M and tow functions V, A : M x M +-M which fulfill v m , n E M : ~ ( m , n= ) A(n,m), V(m,n) = ~ ( n , m ) 'fl, m, n E M : ~ ( 1~, ( mn)) , = A(A(1, m), n), ~ ( 1~, ( mn)) , = v ( v ( ~m), , n) vm, n E M : ~ ( m~, ( mn)) , = m, ~ ( m~, ( mn ,) ) = m The two functions are also called supremum sup and infimum inf . The intuitive meaning is that the elements of a set M are partially ordered, where all pairs of elements (m, n) have a unique smallest upper ~ ( mn), and lower A(m, n) bound. In the following only lattices with a finite set M are considered. This is motivated by a system model with a finite number of points of failure. In every finite lattice a unique smallest I and a unique largest element T exists. Furthermore the supremum Vs and infimum As is well defined for arbitrary subsets S of M . A chain in C is a sequence of distinct elements m l , . . . ,mk where V(mi, mi+l) = mi+^. It is usual t o illustrate partially ordered sets as a Hasse-diagram which is a directed Graph (M, E). The edges E in this graph are given by directly comparable relations: (m,n) € V & A ( m , n ) = n a n d ' d l € M : ( A ( m , l ) = l a n d A ( l , n ) = n ) + l = n Edges of the Hasse-diagram can be viewed as generators of the lattice. The nodes can be labeled ( a : M + N) with the distance to to the top (T) which results in a layered representation. From ~ ( mn ,) = m follows that m is in a higher level than n. More generally, a(A(m, n) max(a(m), a ( n ) ) . For the application to fault-tolerance a special class of lattices is of interest, lowest-level generated lattices:

>

Definition 2. A lattice (M, V, A) with Hasse-diagram (M, E) is called lowest-level generated (11g) ifl M = {I) U G U {Vs IS c G) where G = {g E MI (I, g) E E ) From an algebraic point of view this means that the set of points which are immediately larger than the bottom element generate the whole lattice already by the A, of course except the bottom element. The importance of this class is that it exactly reflects the situation of fault-states sketched before. The single faults of components in the system represent the generating set G. The only better situation than two different single faults is a system in order. Furthermore if the set of faults are the only parameter that induces the fault state of the system all elements of the lattice have to be generated by the set G. There is a simply criterion whether a lattice is llg or not.


1269

Lemma 1. A lattice C = (M, V, A) with Hasse-diagram (M, E) is llg iff the indegree (number of edges to a node) of all elements from M except G and I is greater than one. 'dx E ( M \ G ) \ { l ) : I{(Y,x) E E ) > 1 Some important examples are given now, where all but the first one are generally llg: 1. %(n) := ({True, FalseIn, I, &) Here & denotes the conjunction and I the disjunction, i.e. digit-wise AND respectively OR. The top element is the vector with all True and the bottom element with all False. Clearly, this lattice is llg, since the lowest level consists of all vectors with just one place True. It has 2n elements. 2. g ( n ) := (10, . . . ,n H l), max, min) Every finite linearly ordered set with n elements is isomorphic to this lattice. If n > 2 it is not llg, since the generating set consists of only a single element. 3. mk(n) = ({T { I , . . . ,n},ITI 5 k ) U { { l , - - . ,n},n,uk) where S1Uk Sz := S1 U Sz if IS1 U SzI 5 , {I,. . . ,n) otherwise This the the frequently-found max-k-faults failure model. More than k faults make the whole system faulty. Though there has been intensive research on lattices the only information about the occurrence frequency seems to be found in [Kyu79]. In this paper the number of different lattices with up to 9 elements is elaborated by algorithmic enumeration and isomorphy checking. We re-implemented this algorithm in order to get further information. Surprisingly some more graphs have been found. The results are given in the following table. It appears that the number of llg lattices is only a small fraction of the lattices of a given size.

Up to 1 = 4 elements in the lattice there are only the Boolean lattices which are llg. Hence, there is no llg lattice with three elements. Though the table does not suggest it the number of llg lattices may also grow strongly. Any lattice can be made llg with adding some elements. The resulting lattices have too many elements to be found in the table. How many of them are non-isomorphic remains open.

3

Application to Selected Fault-Tolerant Routing Algorithms

By exchanging status information with its neighbors, a router node determine whether it is part of a faulty region. Figure 3 illustrates the lattice for routing algorithms which allow the overlapping of detour routes on the border of fault regions. The router is a neighbor node of a fault region if it detects only one of his neighbors to be faulty or in the state T. A combination of two faults in the same dimension leads to a state that indicates the router to be adjacent to distinct fault regions. The case that two or more failures are detected in different dimensions results in the supremum. This means that the node itself is part of a fault region and reaches the state T. The approach of convex fault regions offers a simple solution for low dimensional topologies. In higher dimensional topologies the link redundancy enables

1270

A.C. Doering and G. Lustig Router itself is part of a fault region.

Router is neither part nor neighbor of any fault region.

Fig. 1. Fault-lattice for convex regions in a 2D-mesh.

fault-tolerance concepts that allow the avoidance of detours in advance. As an example the formation of a fault lattice in hypercubes to the concept of [Wu98] is used. Within this approach the fault information is captured in a safety vector of n bits. A node where the k-th bit of the safety vector is set, guarantees message transportation on a minimal path for all destinations with Hamming distance k. To calculate the vector within each node, an information exchange of n -1 rounds is necessary. Based on a topological property of the binary hypercube, the k-th bit of the safety vector (signed with S in the following) is determined from the (k #l)th bit of the neighbors' safety vectors. The first bit of each safety vector is initialized with respect to the own node state. Figurge 3 shows the resulting lattice for determining the second bit of a safety vector in a fault free node. In order t o concentrate on the concept, a hypercube with only three dimensions has been chosen.

Fig. 2. Fault-lattice for determination of safety vectors 2nd Bit.

For calculating the second bit of the safety vector only the first bit of the neighbors' vectors (0, ...) or (1,...) are essential. The safety vector of the i-th neighbor is indicated with the index i. As neighbors' safety vectors with the first bit is set to one do not change the status of the own safety vector, these transitions are dispensable for the fault-lattice. After receiving one safety vector containing a zero at its first bit, the resulting vector does not differ from its initial state. Only the receipt of two


1271

or three safety vectors where the first bit was set to zero results in an unsetting of the second bit in the own safety vector.

Fig. 3. Fault-lattice for determination of a safety vector's 3rd Bit.

Calculation of the third safety bit in Figure 3 is similar t o the second one, except that the result has only a zero if all adjacent nodes have set their second bit to zero. Many other routing algorithms can be treated this way but as the resulting lattices are larger an abstract approach to their description is needed, allowing an automated processing. The rule-based approach presented in [DOLM98] serves this purpose.

4

Implementation

In this section the configurable implementation of finite lattices C = (L, V, A) in hardware is considered. The description of the implementation is scalable with respect t o the size of the lattice 1 = ILI. Furthermore only the implementation of one function, say VL x L + L is discussed, because this is sufficient for fault-tolerant routing algorithms. All proposed implementations can be easily extended to also include A. Another assumption is that the encoding of the elements of L into a bit vector can be freely chosen. Only few results seem to be known about the complexity of implementing the supremum in lattices. For the two trivial cases - the Boolean lattice and sets of integers - the implementation is straight-forward: n-bit OR and MAX respectively. Both need O(1og 1) gates and O(1) respectively O(1og log 1 ) depth. Layout questions are widely discussed since both are frequently used functions. However both circuits are quite different and there is no obvious configurable super-circuit which could implement both problems. 4.1

Simple Table-Based Method

A reference design is given by a look up table with l2 entries. It has an area of 0(1210gl) and a depth of O(log1). Reflexivity (V(x,x) = x) and commutativity

1272

A.C. Doering and G . Lustig

( ~ ( xy), = ~ ( yx)) , can be used to reduce the size of the table t. For the latter, any order < can be used t o "reflect on the matrix diagonal":

x t(x,y) t(1 e x , 1 +y) t(y,x) t(1 e y , 1e x )

ifx=y if x > y and 112 > y if x > y and 112 > y ify>xand1/2>x otherwise

The indexing of the table t assumes that the encoding of the elements uses a continuous sequence of binary encoded integers. In this way, the table t has dimensions 0,. . . , 1 ~ for 1 the first, and 0,. . . , [((l ~ 1 ) / 2 ) for 1 the second index. How these two indices can be combined into a single address for a memory in a chip is a more general topic beyond the scope of this paper. The resulting circuit is dominated by the 1 ( 1 ~ 1 ) / 2 [ l o g l ]memory cells. In a limited fan-in model the depth of this circuit is O(1og log 1) for the comparisons and the memory addressing. Since the order relation of C can be presented as an adjancy matrix, 1 ( 1 ~ 1 ) / memory 2 bits would be sufficient. However the computation of the V function is rather difficult in such a presentation. Since it seems to be not yet sure that the asympt~tic~behavior of the number of lattices for a given 1 is considerably smaller than O(cz ) this might be already an asymptotically optimal implementation up to constant factors. Hence, in the following only implementations which do not necessarily cover all lattices for a given 1 are considered. 4.2

Implementation with Boolean Lattice

The Boolean lattice is universal, which means that every lattice can be presented as a sub-lattice of the boolean lattice. However this can be quite inefficient: the lattice % (n) with n elements would have to be implemented by the Boolean lattice with 2"-1 elements. Hence, a reduction is desirable for an implementation on base of !I3 (1-1). For most lattices there already exists an inclusion into a smaller Boolean lattice B (m). The circuit consists of three parts:

1. Two encoders which generate an injective mapping L : L + B ( m ) from L into the Boolean lattice of m bits(configurab1e). 2. The circuit for v in the Boolean lattice (m OR-gates). 3. A decoder p : B ( m ) + L which generates again elements in L (also configurable) . Since the correspondence of the top- or bottom element is given, it is clear that 1 possible lattices with 1 elements shall be implementable. For doing m = 1 ~ if all this, the digits of the vectors in ! I 3 ( 1 ~ 1are ) indexed with the elements of C except T. The image ~ ( x of ) an element x of C is the vector where the digit d, is True iff x V y = y, i.e. for all elements greater or equal than x in C. Hence conjunction results in an element of ~ ( m ) . Encoding of the elements of M can be chosen in a way that minimizes the size of the encoder circuit. Since the codes for the top and bottom elements are fixed there are 1 ~ codes 2 to be chosen. One further element can be fixed: there has t o be at least one element s in M which is immediately larger than bottom, i.e. no other element but bottom is smaller than s. Hence, this element can be mapped t o the vector (1,0,. . . ,0) in B(n). Looking at the both lattices %(n) and m 2 ( n ~ 2 it is obvious that there is no further similar simplification, since in the first case all other elements are comparable to each other while in the second example no two of the remaining elements are comparable.

)


1273

The implementation of p uses a standard circuit namely a priority encoder. This is possible because every element of M\{T) refers to a digit in B ( 1 ~ 1 )The . order of M induces an order on the digits. Finding the right element in L(M)is identical with finding the lowest digit in the resulting vector which is True. In consequence the decoder (consisting of an priority encoder) does not need t o be configurable. Since every partial order can be embedded in a total order this is fairly possible. Because the bottom element has no corresponding digit, the highest input of the priority encoder (of I bits) is fed with constant True, reflecting that T is larger than any element of M . A similar simplification can be done for I if 1 > 2. In this case the test for the result ~ ( 1I) , saves another bit per vector. It remains to discuss options for the implementation of L. If two tables are used (with diagonal reflection), the resulting circuit needs (1 *2)(1 *3)/2 memory bits. The memory has to be dual ported, that is the multiplexers for output selection are needed twice. However, for many applications including fault-tolerant routing algorithms a much smaller m can be sufficient because typical applications have shorter chains. Especially llg lattices can be included in 23 (g), where g is the number of generating elements. The interesting question is the implementation of the encoder and decoder circuit. For instance the top and bottom elements of the lattice can be assigned with fixed codes removing any need for configuration. Though the encoder is found twice in the design, the problematic part is the decoder because it has t o map from the much larger B(n) onto L. The mapping has to be consistent with the order of L, i.e. it has to be an partial-order endomorphism. In most cases C = v ( L ( ~ ~) ,( b )will be no element of L(M)for arbitrary a, b E M. More precisely, if for all pairs a, b C is in M then L is an lattice endomorphism which is an unnecessary strong restrictions for computing V alone. Hence, in the more general case the implementation of p(C) has to find the smallest element of L(M)which is larger than C with respect to the order in 23(m). For an efficient implementation L has to be chosen in a way which makes this step easy. Fortunately there are only the two restrictions of monotony and injectivity

4.3

Hybrid Implementations

Since lattices are a category of universal algebras, constructs like direct products are available. These constructions carry over to the implementation: direct products of lattices can be computed by the independent calculation of its factors. A direct product represents the combined state of two independent systems. A more interesting option is the identification of sub-lattices, because this occurs much more frequently. In this context two different ways of embedding a sub-lattice are distinguished. A sub-lattice % = (A, V, A ) is said to be included point-like in a lattice C if the only edges in the Hasse-diagram t o nodes outside the sub-lattice are adjacent to a ' s top and bottom elements. Formally Vx E M\A,Va E A : V(a, a ) = v(Ta, x) and A (a, a ) = A ( l a , x) In Figure 4 two sub-lattices are shaded where only one (denoted "Lattice B") is embedded point-like. At least small point-like sub-lattices can be found in every lattice. Especially with respect to fault-tolerance it can be expected that a lot of typical point-like included sub-lattices can be found. This is interesting from an implementation point of view because optimized subcircuits for the special sub-lattice can be used. To allow more lattices to exploit these fixed implementations by adding only a small amount of hardware some selected points with edges out off the sub-lattice can be permitted.

1274

A.C. Doering and G. Lustig

For a configurable implementation for large lattices with a suffient fraction of typical sub-lattices an implementation is suggesting which consists of two parts:

A pool of circuits for the implementation of typical sub-lattices, e.g. fMk(n), %(n), etc. These components can be configured only in a very restricted way, e.g. the constants n,k and the selection of some special elements. An implementation of the "super-lattice" by a universal methods like those described before. A configurated implementation of the example lattice can be seen at the right in figure 4. It is interesting to observe that the resulting architecture resembles strongly the implementation approach for the whole routing algorithm given in [DOL98].

Table-Based

Fig. 4. A composed lattice and its implementation

5

Conclusion

In this paper it has been shown that the fault state propagation of many routing algorithms can be described by a lattice. A special class of lattices has been identified which is especially important for this application. In a discussion of implementation methods some VLSI architectures have been proposed. Of course, this method can also be used in a software implementation. In the latter context it is interesting to look for special machine instructions for a network processor t o support these kind of problems. Since most routing algorithms do not provide complexity measures for the state propagation, the overhead implied by the interpretation as a lattice can not be answered directly. Conversely, properties of the lattice (largest point-like sub-lattice, number of chains, level sizes etc.) can be used to judge various approaches. The main disadvantage of the approach in comparison t o an ad-hoc solution is that the sequence of arriving fault information can only hardly be exploited. On the other hand this gives robustness when some of the exchange messages get lost. Furthermore different routing algorithms may profit from the placement of the 'reducing homomorphism' which transforms the state of one node into an appropriate value for the neighbor's state calculation. Further questions remaining open are: - asymptotic frequency of lattices and lowest-level generated lattices - For networks with some hundreds of nodes routing algorithms with global know-

ledge can be still practical. Does the better behavior justify the higher demand on bandwidth, memory and processing effort in contrast to limited-global routing algorithms. Furthermore routing algorithms with global knowledge tend to do a lot of redundant calculations.

Implementation of Finite Lattices in VLSI for Fault-State Encoding -

1275

For many topologies only a few schemes of reduced fault state representation are known. More detailed methods like the one in NAFTA ([CA95]) could not be found in the literature for many important topologies, like the star graph.

The common framework of lattices can be integrated into tools for the evaluation simulation and optimization of routing algorithms and routers. This is simplified by using results, algorithms and tools from algebra. In the project RuBIN ("Rule Based Intelligent Networks") the authors have implemented a translation tool for a high-level description method onto dedicated hardware structures. These hardware structures are implemented in a FPGA-based prototype using 1 million gates of programmable logic [Xi1981 and myrinet link technology [BCF+95].

References James D. Allen, Patrick T . Gaughan, David E. Schimmel, and Sudhakar Yalamanchili. Ariadne - an adaptive router for fault-tolerant multicomputers. In Proceedings of the H s t Annual International Symposium o n Computer Architecture, pages 278-288, Chicago, Illinois, April 18-21, 1994. IEEE Computer Society TCCA and ACM SIGARCH. Nanette J . Boden, Danny Cohen, Robert E. Felderman, Alan E. Kulawik, Charles L. Seitz, Jakov N. Seizovic, and Wen-King Su. Myrinet: A gigabitper-second local-area network. I E E E Micro, 15(1), Februrary 1995. Chris M. Cunningham and Dimiter R. Avresky. Fault-Tolerant Adaptive Routing for Two-Dimensional Meshes. In Proceedings of the First International Symposium o n High-Performance Computer Architecture, pages 122131, Raleigh, North Carolina, January 1995. IEEE Computer Society. Ming-Syan Chen and Kang G. Shin. Depth-First Search Approach for FaultTolerant Routing in Hypercube Multicomputers. I E E E Transactions on Parallel and Distributed Systems, 1(2):152-159, April 1990. Ge-Ming Chiu and Shui-Pan Wu. A Fault-Tolerant Routing Strategy in Hypercube Multicomputers. I E E E Transactions on Computers, 45(2):143-155, February 1996. A. C. Doring, W. Obeloer, and G. Lustig. Programming and Implementation of Reconfigurable Routers. In Proc. Field Programmable Logic and Applications. 8th International Workshop, FPL '98, volume 1482 of Lecture Notes i n Computer Science, pages 500-504, 1998. Andreas C. Doring, Wolfgang ObelBer, Gunther Lustig, and Erik Maehle. A Flexible Approach for a Fault-Tolerant Router. In Proc. Parallel and Distributed Processing - Proceedings of 10 IPPS/SPDP '98 Workshops, volume 1388 of Lecture Notes i n Computer Science, pages 693-713, 1998. Shoji Kyuno. An Inductive Algorithm t o Construct Finite Lattices. Mathematics of Computation, 33(145):409-421, January 1979. Ming-Jer Tsai and Sheng-De Wang. A fully adaptive routing algorithm. I E E E Transactions o n Parallel and Distributed Systems, 9(2):163-174, February 1998. Jie Wu. Adaptive Fault-Tolerant Routing in Cube-Based Multicomputers Using Safety Vectors. I E E E Transactions on Parallel and Distributed Systems, 9(4):321-334, April 1998. Xilinx, Inc. XC4000XV Family Field Programmable Gate Arrays. Data Sheet, May 1998.

Building a Reliable Message Delivery System Using the CORBA Event Service Srinivasan Ramani1, Balakrishnan Dasarathy2, and Kishor S. Trivedi1 1

Center for Advanced Computing and Communication Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0291, USA {sramani, kst}@ee.duke.edu 2 Telcordia Technologies 445 South Street Morristown, NJ 07960, USA [email protected]

Abstract. In this paper we study the suitability of the CORBA Event Service as a reliable message delivery mechanism. We first show that products built to the CORBA Event Service specification will not guarantee against loss of messages or guarantee order. This is not surprising, as the CORBA Event Service specification does not deal with Quality of Service (QoS) and monitoring issues. The CORBA Notification Service, although it provides much of the QoS features, may not be an option. Therefore, we examine application-level reliability schemes to build a reliable communication means over the existing CORBA Event Service. Our end-to-end reliability schemes are applicable to management applications where state resynchronization is possible and sufficient. The reliability schemes proposed provide resilience in the face of failures of the supplier, consumer, and the Event Service processes.

1 Introduction CORBA [1], spearheaded by the Object Management Group (OMG), provides a basis for portable distributed object-oriented computing applications [2] and is a marriage of the object-oriented paradigm with a client/server architecture [3]. Before the advent of CORBA, clients had to resort to sockets or remote procedure calls such as those provided by DCE or Sun RPC to communicate with objects on remote servers [6, 7]. CORBA abstracts away the complexities of distributed object communication between clients and servers. The Event Service, one of the Common Object Services (COS) [1] provided by CORBA, is intended to support de-coupled, asynchronous communication among objects. For many applications, a synchronous communications model will not scale well. An elegant solution for these applications is to have the server push messages into a “channel” and let any interested client connect to the channel and pick up the messages. In Event Service parlance, the server is called a supplier, the clients are called consumers, and the channel is called the event channel.


1

Building a Reliable Message Delivery System Using the CORBA Event Service

1277

The Event Service however is not reliable. This is because the CORBA Event Service specification does not deal with QoS (Quality of Service) or run-time management issues. There exist applications such as Telcordia’s mission critical network management system for which reliable, in order message delivery is important. In addition, these applications should also be resilient to failures of the supplier, consumer and the Event Service itself. In an attempt to remedy this situation, OMG has come out with the Notification Service specification [5] that explicitly deals with QoS issues. But the Notification Service implementation is not widely available for all ORBs (Object Request Brokers) and also there is no commitment from vendors to make an implementation available for several platforms. Moreover, the Notification Service does not guarantee against loss of messages, as there is no proactive monitoring. Fault tolerance via replicated objects, such as in [4] provides some resilience in the face of event channel failures. In this approach, suppliers retain a copy of messages pushed in to the event channel in a backup queue. But this scheme could still drop messages, as it does not prevent queue overflows. Also the use of ORB-specific API’s prevents the adaptability of the scheme to other ORBs.

2 Log files and retry policies – Are they adequate? Vendors have sought to provide resilience by providing additional features in their Event Service implementations. These include, a retry mechanism that uses a log file to deal with failed message deliveries or messages displaced from overflowing queues. The ordering of messages is no longer guaranteed. This scheme is also not guaranteed to work as the log files might reach their maximum size limit. There is no programming or administrative interface to monitor the queues. Some vendors have mechanisms to deal with object failures (for example, the Visibroker [8] ORB has an Object Activation Daemon to provide some resiliency). But this is not a CORBA standard and hence implementation-dependent.

3 Application-level reliability mechanism to provide resilience In an experiment that we setup, we found that the maximum supplier throughput is 49.9 messages per second, if no message loss is to occur. Having identified the lack of adequate guarantees in the CORBA Event Service specification, we investigated application-level reliability mechanisms at the supplier and the consumer. 3.1 Model for reliability: Resynchronization Many applications for which the Event Service is applicable need to propagate messages that resemble status updates. If and when messages are detected to be missing, what is required is not the history of changes but the current status. The same is true after failures of the supplier, consumer or the Event Service lead to message

2

1278

S. Ramani, B. Dasarathy, and K.S. Trivedi

loss. Our reliability scheme illustrated in Figure 3 exploits this property and is based on resynchronization of states. Supplier exports resynchronization service

Daemon

Trader

Startup, Dropped message, Failed ‘ping’

Reference to resynchronization service

Ping

Supplier

Resync. service

Consumer looks up trader for resynchronization service

Event Channel

Consumer Event channel passes messages on to the consumer

Supplier pushes messages in to the event channel

Timer

Ping

Resynchronization request

Fig. 1. Scheme to provide resilience to the CORBA Event Service

With this scheme, no log/retry mechanism of the Event Service is used, nor is the rebinding of the restarted objects dependent on any implementation-dependent facility. The highlights of our application-level reliability scheme are: • When the supplier connects to the event channel for the first time or is restarted after a failure, it does a resynchronization to bring all the consumers up to date. The supplier puts some sequencing information in each message. • The supplier provides a “resynchronization request service” that it announces through the Trader Service1. This service is to be used by consumers to send a resynchronization request to the supplier whenever required. • When a consumer connects to the event channel – either for the first time or when it comes back up after a failure, it requests a resynchronization. • When the consumer notices that a message has been lost (using the sequence information), it requests a resynchronization. • A daemon to “ping” periodically and restart the Event Service if necessary. We now explain how our scheme deals with failure scenarios. Supplier failure: When the supplier reboots, the recovery is simple – it simply resynchronizes. Consumer failure: When a consumer goes down, because of the de-coupled nature of the link between the supplier and the consumer, the supplier remains unaware of the failure. When the consumer comes up again, it reconnects to the event channel and requests a resynchronization. A consumer may, of course, choose not to act on the status updates, if it already has the latest status. 1

The Trader Service (one of the COS) acts as the “yellow pages” for object services. The provider of a service exports the location and description of the service to a “trader”. A client requiring a service can look up the trader by supplying the description.

3

Building a Reliable Message Delivery System Using the CORBA Event Service

1279

Event Service failure: To detect the failure of the Event Service, a daemon is used to “ping” the Event Service periodically. If the daemon notices that the Event Service has died, it restarts the Event Service. In a “push” model, a publisher notices that the event channel it is connected to has died when it tries to push a message, and tries to rebind with the new Event Service using the trader. The consumer can be unaware of the failure forever since in the push model, the Event Service initiates a message transfer. In our scheme, we make the consumer “ping” the event channel whenever a timer times out. The timer restarts whenever a message reaches the consumer. If the ping operation fails, the consumer tries to rebind to the event channel. The resynchronization strategy also obviates the need for mechanisms to deal with messages lost in the event channel. Queue overflow: Finally, if there is a queue overflow, the consumer will detect the dropped message(s) and request a resynchronization.

4 Effectiveness of the reliability mechanism - Experiments In our earlier experiments, we determined that a supplier throughput of 49.9 messages per second or lower was needed to avoid message loss. We use this rough estimate while doing a resynchronization. Of course, if a resynchronization message were dropped, then the consumer would request a new resynchronization. The results in Table 1 report the mean and standard deviation for each measure, obtained by repeating each experiment 25 times. Table 1. Effect of Supplier Speed on Resynchronization Overhead

Supplier throughput (#messages per sec) Effective throughput Normal messages Mean S.Dev Mean

% of normal messages received by consumer Mean

S.Dev

#resynchronization messages pushed per 10000 normal messages Mean S.Dev

265.6 29.8 247.8 26.3 10.5 580.9 224.7 98.1 1.3 96.9 96.1 3.9 122.4 127.0 49.9 0.0 49.9 99.9 0.1 2.0 5.0 Table 1 shows that when the supplier attempts a throughput higher than 49.90 messages per second, there are several resynchronizations before the supplier finishes pushing 10000 messages. Because of the additional resynchronization messages that are pushed and the overheads associated with the reliability mechanism, it now took longer to push 10000 normal messages. In Table 1, the column titled “normal messages” lists the supplier throughput for normal messages alone (that is, without considering the resynchronization messages). This is calculated as “1 / (Total time to push 10000 normal messages (in sec)/10000)”. If the supplier throughput is substantially higher than 49.90 messages per second, then it can be seen from the table that a low percentage of these normal messages eventually make it to the consumer. This means that the (slower) resynchronization process is doing most of

4

1280

S. Ramani, B. Dasarathy, and K.S. Trivedi

the status updates. As the supplier throughput approaches the recommended value, the effective throughput of the supplier matches the throughput of normal messages from the supplier. This is because almost all the normal messages are successfully delivered to the consumers and the resynchronization messages are rarely required. The general guideline for effectively putting into practice our reliability scheme is as follows: • Determine the supplier throughput that almost always delivers messages without loss in the event channel. Use a much lower value as the throughput for the messages. For the particular application and hardware platform on hand, optimization studies can be done to establish safe and most efficient operating ranges.

5 Summary and Concluding Remarks In this paper we have studied the suitability of the CORBA Event Service as a reliable message delivery mechanism. To deal with its lack of reliability, we examined application-level schemes to build a reliable communication means over the existing CORBA Event Service. The reliability schemes, which are general enough to be applicable to any Event Service implementation, deal with message losses and also provide resilience in the face of failures of the supplier, consumer, and the Event Service processes. Our end-to-end reliability scheme is suitable for applications for which status resynchronization is possible and sufficient.

References 1. Object Management Group. “The Common Object Request Broker: Architecture and Specification”. Revision 2.3, 1998. ftp://ftp.omg.org/pub/docs/formal/98-12-01.pdf. 2. M. Henning and S. Vinoski. Advanced CORBA Programming with C++. Addison Wesley, Reading, Massachusetts, 1999. 3. D. Pedrick, J. Weedon, J. Goldberg and E. Bleifield. Programming with Visibroker – A Developer’s Guide to Visibroker for Java. John Wiley & Sons, 1998. 4. T. Luo et al. “A Reliable CORBA-Based Network Management System”. Int. Conf. on Communications, ICC99, Vancouver, BC, Canada, June 1999. 5. Object Management Group. “Notification Service: Joint Revised Submission”. January 25, 1999. http://www.omg.org/cgi-bin/doc?telecom/98-11-01.pdf. 6. J. Shirley, W.Hu and D. Magid. Guide to Writing DCE Applications. O'Reilly & associates, Inc., May 1994. 7. W.R. Stevens. UNIX Network Programming, Vol. 1, Second Edition, Prentice-Hall, Upper Saddle River, New Jersey, 1998. 8. Visibroker. http://www.inprise.com/visibroker

Acknowledgements Our thanks to Mark Segal, Brian Coan, Michael Skurkay, Neal Bickford and Sarah Tisdale of Telcordia Technologies for their support and review of this paper.

5

Network Survivability Simulation of a Commercially Deployed Dynamic Routing System Protocol Abdur Chowdhury1,2, Ophir Frieder1, Paul Luse2, Peng-Jun Wan1 {abdur, wan, ophir}@cs.iit.edu, [email protected] Department of Computer Science Illinois Institute of Technology1 and IIT Research Institute2

Abstract. With the ever-increasing demands on server applications, many new server services are distributed in nature. We evaluated one hundred deployed systems and found that over a one-year period, thirteen percent of the hardware failures were network related. To provide end-user services, the server clusters must guarantee server-to-server communication in the presence of network failures. In prior work, we described a protocol to provide proactive dynamic routing for server clusters architectures. We now present a network survivability simulation of the Dynamic Routing System (DRS) protocol. We show that with the DRS the probability of success for server-to-server communication converges to 1 as N grows for a fixed number of failures. The DRS’s proactive routing policy performs better than traditional routing systems by fixing network problems before they effect application communication.

INTRODUCTION Traditional supercomputers are becoming scarce and distributed server clusters are becoming the solution of choice. These smaller computers are coupled by networks to achieve the same objective at a substantially lower cost. The Berkley NOW (Network Of Workstations) project was one of the first projects pushing this solution [2]. PVM (Parallel Virtual Machine) [3] and MPI (Message Passing Interface) [4] libraries provide messaging and synchronization constructs that are needed for distributed parallel computing with NOW solutions. Projects like Beowulf [5] for Linux are continuing the distributed computing approach. All of these approaches have one common resource, the network. While the network is very important, no strong push has been made to provide fault tolerance for network failures in a server cluster solution. We developed a network routing algorithm to provide fault-tolerance for server-toserver communication by proactively monitoring network communication links between servers. This is different from reactive routing techniques [6] that wait for a failure to occur and then react by finding an alternative route. Our proactive algorithm constantly looks for errors via continuous ICMP echo requests. When a failure is identified, a new route is selected around the failed portion of the network. This new


1282

A. Chowdhury et al.

route is often found in the time of a TCP retransmit, so server applications are unaware that a network failure has occurred. Our algorithm, the Dynamic Routing System (DRS) [1], improves reliability by providing a second network interface card for each server thus providing an alternate method of physical communications in the case of hardware failure. The DRS works by frequent link checks between all pairs of nodes to determine if the link between pairs of computers is valid. This algorithm uses the redundant network link between two servers to provide multiple communication channels. When one link fails, the second direct link is checked and used. However, if no link exists, a broadcast is made to identify whether or not some other server is able to act as a router to create a new path between the sender and the proposed recipient. Our algorithm discovers the failure before server-to-server communication is affected. The essential goal of our algorithm is to hide network failures from distributed applications. The DRS was deployed in 27 local voice mail server clusters by MCI WorldCom, each cluster contains between 8 and 12 servers. Thus understanding the reliability supported is not only of theoretical interest but of practical interest as well. In prior work [1] we showed that, over a one year period, 13% of hardware failures for 100 compute servers were network related, i.e., network interface cards, hubs, etc. This likelihood of failure provides motivation to improve the resilience of server clusters where services need to be guaranteed. We show that, with the DRS, the probability of success of server-to-server communication converges to 1 as N grows for a fixed number of failures. The proactive routing policy of the DRS performs better than traditional routing systems by fixing network problems before they effect the serverto-server communication.

DRS ALGORITHM RIP [7], OSPF [8], EGP and BGP [9] are routing solutions to many different routing problems, however, they do not address the needs of a high availability server cluster environment [11]. Their primary goal is to provide routing updates to other routers on the network to find alternative routes to the same network. The general design goal is based on reactively rerouting when a specified timeout period has been reached. So if a destination network does not respond to a route query, after some time quantum, it is considered down and a new route is sought after. The DRS works with IP networks unlike some telecommunication approaches using specialized hardware [10] and improves fault tolerance via proactive failure recognition and the use of a redundant network. Thus, each computer has two network interface cards connected to two separate networks. It is the task of the DRS routing demons to monitor the connections between two servers. If a failure occurs, the demons set up new point-to-point routes around the problem before network applications are aware that a problem occurred. The DRS runs on every node in the server array. Each DRS demon is configured to monitor hosts on the networks and executes a two stage run process. In the first phase, the communications links between the local host and all other hosts that is it has been configured to monitor are checked. These checks are accomplished using

Network Survivability Simulation

1283

the ICMP (Internet Control Message Protocol) [13] echo request. Host "A" sends an ICMP echo request to host "B" via the first network. If the echo is returned, the DRS can assume that the hub, wiring, network interface card, device driver, network protocol stack and host kernel, are operational. The DRS continues to test all known hosts on all known networks in the same manner. Each demon keeps track of which hosts to monitor and the state that they are in (i.e., "up", "down"). If a failure occurs, the DRS demon must determine a new route of communication between host "A" and "B". The DRS demon loops through a cycle of monitoring communication links, answering requests, and fixing problems as they occur, for the life of the server cluster. The DRS algorithm avoids routing loops and other issues involved in distributed routing. For a detailed presentation and proof of correctness see [1].

DRS PROACTIVE COST The DRS’s proactive monitoring of network links comes at a cost of network bandwidth. To find errors before they effect network communication, the links must be checked frequently. Figure 1: 100Mb Network Performance If the links were not checked frequently, the DRS would become equivalent to a reactive routing protocol. As the number of nodes increase, the bandwidth required to support the frequent checks likewise increases. In Figure 1, we present the maximum number of servers in the cluster that the DRS supports given a requirement for error resolution in X time units and the percentage of network bandwidth useable by the DRS. As show in Figure 1, ninety hosts are supported in less than 1 second with only 10% of the bandwidth usage. R esponse T ime VS Num ber of Nodes for a 100mbs Network

10

1

0.1

Time

10% 15%

0.01

20% 25%

0.001

0.0001

82

92 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 2

52

62

72

32

42

2 12

22

0.00001

Number of Nodes

NETWORK SURVIVABILITY ANALYSIS We now present a conditional probability model like [12] to quantitatively evaluate networking systems with a given number of network failures occurring at any given instance. This model yields the probability of success, independent of time, of a system with N nodes and f failures. We assume that in a system with N nodes, there are exactly 2N interface connections and two non-meshed back planes, each with equal probability of failure,

1284

A. Chowdhury et al.

say q, for 0≤q≤1. Therefore, the probability of 2 failures in any system will be q2, the probability of 3 failures will be q3, and the probability of f failures will be qf. It follows that f lim q f = 0 . Therefore, the probability of multiple failures in a system →∞ decreases exponentially. Now we develop the equation for the probability of success by counting the number of possible failure combinations for a system with N nodes and f failures. We represent this number by the combinatorial function F(N,f), with  2 N   2 N − 2    2 N − 2   2 N − 4   N − 2  2N   + 2 ⋅    −    +   + 2 ⋅    +  F ( N , f ) =   f − 2  f − 1   f − 1    f − 2   f − 3   f − N

  ⋅ 2 2 N − 

Because the total number of combinations in a networking system is

f −2

 2N − 4   +    f − 2 

 2N + 2   f   

, the

probability of success can be written, as shown in Equation 1. By graphing Equation 1 for fixed  2 N + 2  − F(N , f )   f  values of f, it is evident that as the number P [ Success ] =  2N + 2   of nodes in a system increases, the f   probability of that system maintaining a Equation 1: Probability of Success successful connection between any two nodes at any given time will approach 1 using the DRS. More specifically, for f=2 the P[S] surpasses 0.99 at 18 nodes. For f=3 the P[S] surpasses 0.99 at 32 nodes, and for f=4 the P[S] surpasses 0.99 at 45 lim P[S] = 1 , a system implementing the DRS nodes. Given that f lim q f = 0 and that →∞ N→∞ has a high probability of resilience to network failure, as show in Figure 2. 1 .0 0

0 .9 0

DRS Simulation 0 .8 0

2 3 4 5 6 7 8 9 1

Fa Fa Fa Fa Fa Fa Fa Fa 0 F

ilu r e s ilu r e s ilu r e s ilu r e s ilu r e s ilu r e s ilu r e s ilu r e s a i lu r e s

P[Success]

To validate our probability model, we have developed a computer simulation of a networking system with N nodes and f failures implementing the DRS algorithm. Given a specified number of Figure 2: Convergence of P[Success] to 1 iterations and a fixed f, the simulation output consists of randomly generated success probability values for f

Parallel and Distributed Processing: 15 IPDPS 2000 Workshops, Cancun, Mexico, May 1-5, 2000 Proceedings

Job Scheduling Strategies for Parallel Processing: IPDPS 2000 Workshop, JSSPP 2000, Cancun, Mexico, May 1, 2000 Proceedings

Scientific American (May 2000)

Circuit Cellar (May 2000)

Biologically Motivated Computer Vision: First IEEE International Workshop BMCV 2000, Seoul, Korea, May 15-17, 2000 Proceedings

Genetic Programming: European Conference, EuroGP 2000 Edinburgh, Scotland, UK, April 15-16, 2000 Proceedings: European Conference, EuroGP 2000, ... 3rd

Cosmo 2000 Proceedings

Parallel Computational Fluid Dynamics 2000

Process Algebra for Parallel and Distributed Processing

Parallel Computational Fluid Dynamics 2000

Process algebra for parallel and distributed processing

Euro-Par 2010: Parallel Processing Workshops

15

15)

2000)

2000)

Distributed Computing: 14th International Conference, DISC 2000 Toledo, Spain, October 4-6, 2000 Proceedings

Hydrocarbon Processing Refining Processes 2000

2000)

2000

2000

2000)

Non-Perturbative Methods and Lattice Qcd: Proceedings of the International Workshop Guangzhou, China 15-20 May 2000

Parallel and Distributed Computing

Mathematics and the 21st Century: Proceedings of the International Conference, Cairo, Egypt, 15-20 January 2000

Applied Semantics: International Summer School, APPSEM 2000, Caminha, Portugal, September 9-15, 2000. Advanced Lectures

Mathematics and the 21st century: proceedings of the international conference, Cairo, Egypt, 15-20 January 2000

Discrete Geometry for Computer Imagery: 9th International Conference, DGCI 2000 Uppsala, Sweden, December 13-15, 2000 Proceedings

Euro-Par 2000 Parallel Processing: 6th International Euro-Par Conference Munich, Germany, August 29 - September 1, 2000 Proceedings

Parallel Distributed Processing, Vol. 1: Foundations

Social Stratification in Central Mexico, 1500-2000

Parallel and Distributed Processing: 15 IPDPS 2000 Workshops, Cancun, Mexico, May 1-5, 2000 Proceedings

Job Scheduling Strategies for Parallel Processing: IPDPS 2000 Workshop, JSSPP 2000, Cancun, Mexico, May 1, 2000 Proceedings

Scientific American (May 2000)

Circuit Cellar (May 2000)

Biologically Motivated Computer Vision: First IEEE International Workshop BMCV 2000, Seoul, Korea, May 15-17, 2000 Proceedings

Genetic Programming: European Conference, EuroGP 2000 Edinburgh, Scotland, UK, April 15-16, 2000 Proceedings: European Conference, EuroGP 2000, ... 3rd

Cosmo 2000 Proceedings

Parallel Computational Fluid Dynamics 2000

Process Algebra for Parallel and Distributed Processing

Parallel Computational Fluid Dynamics 2000

Process algebra for parallel and distributed processing

Euro-Par 2010: Parallel Processing Workshops

15

15)

2000)

2000)

Distributed Computing: 14th International Conference, DISC 2000 Toledo, Spain, October 4-6, 2000 Proceedings

Hydrocarbon Processing Refining Processes 2000

2000)

2000

2000

2000)

Non-Perturbative Methods and Lattice Qcd: Proceedings of the International Workshop Guangzhou, China 15-20 May 2000

Parallel and Distributed Computing

Mathematics and the 21st Century: Proceedings of the International Conference, Cairo, Egypt, 15-20 January 2000

Applied Semantics: International Summer School, APPSEM 2000, Caminha, Portugal, September 9-15, 2000. Advanced Lectures

Mathematics and the 21st century: proceedings of the international conference, Cairo, Egypt, 15-20 January 2000

Discrete Geometry for Computer Imagery: 9th International Conference, DGCI 2000 Uppsala, Sweden, December 13-15, 2000 Proceedings

Euro-Par 2000 Parallel Processing: 6th International Euro-Par Conference Munich, Germany, August 29 - September 1, 2000 Proceedings

Parallel Distributed Processing, Vol. 1: Foundations

Social Stratification in Central Mexico, 1500-2000

Recommend Documents