Parallel Computing: Software Technology, Algorithms, Architectures & Applications, Volume 13: Proceedings of the International Conference ParCo2003, Dresden, Germany (Advances in Parallel Computing)

PARALLEL COMPUTING: Software Technology, Algorithms, Architectures and Applications ADVANCESIN PARALLELCOMPUTING VOL...

Author: Gerhard Joubert | Wolfgang Nagel | Frans Peters | Wolfgang Walter

25 downloads 603 Views 54MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

PARALLEL

COMPUTING: Software Technology, Algorithms, Architectures and Applications

ADVANCESIN PARALLELCOMPUTING VOLUME 13

Series Editor:

Gerhard R. Joubert ManagingEditor

(Technical University of Clausthal) Aquariuslaan 60 5632 BD Eindhoven,The Nmherlands

2004

ELSEVIER Amsterdam - Boston - Heidelberg - London - New Y o r k - Oxford Paris- San D i e g o - San Francisco- Singapore- S y d n e y - Tokyo

PARALLEL COMPUTING: Software Technology, Algorithms, Architectures and Applications

Edited by

G.R. Joubert

W.E. Nagel

Clausthal

Dresden

Germany

Germany

EJ. Peters

W.V. Walter

Eindhoven

Dresden

The Netherlands

Germany

20O4

ELSEVIER Amsterdam - Boston - Heidelberg - London - New York - Oxford Paris- San D i e g o - San Francisco- S i n g a p o r e - S y d n e y - Tokyo

ELSEVIER B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands

ELSEVIER Inc. 525 B Street, Suite 1900 San Diego, CA 92101-4495 USA

ELSEVIER Ltd The Boulevard, Langford Lane Kidlington,Oxford OX5 1GB UK

ELSEVIER Ltd 84 Theobalds Road LondonWC1X 8RR UK

9 2004 Elsevier B.V. All rights reserved. This work is protected under copyright by Elsevier B.V., and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier's Rights Department in Oxford, UK: phone (+44) 1865 843830, fax (+44) 1865 853333, e-mail: [email protected]. Requests may also be completed on-line via the Elsevier homepage (http://www.elsevier.com/locate/permissions). In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of the Publisher is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier's Rights Department, at the fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.

First edition 2004 Library of Congress Cataloging in Publication Data A catalog record is available from the Library of Congress. British Library Cataloguing in Publication Data A catalogue record is available from the British Library.

ISBN: 0-444-51689-1 O The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.

PREFACE Dresden, a city of science and technology, of fine arts and baroque architecture, of education and invention, location of important research institutes and high tech firms in IT and biotechnology, and gateway between Western and Eastern Europe, attracted 175 scientists for the international conference on parallel computing ParCo2003 from 2 to 5 September 2003. It was the tenth in the biannual ParCo series, the longest running European conference series covering all aspects of parallel and high performance computing. ParCo2003 was once again a milestone in gauging the status quo of research and the state of the art in the development and application of parallel and high performance computing techniques, highlighting both current and future trends. The conference was hosted by the Center of High Performance Computing (ZHR) of the Technical University of Dresden. Since its foundation in 1828, the TU Dresden has undergone tremendous transformations from engineering school to technical school of higher education to full university. Today, the TU Dresden offers a broad range of subjects and specialisations in a wide variety of fields to about 33000 students. In the tradition of many inventions in the development of mechanical calculators and early computers, the Center for High Performance Computing was founded in 1997 and has played an important role in the development of modem methods and tools to support high performance computing at the university and beyond. Nowadays, many aspects of parallel computing have become part of mainstream computing. It is now commonplace to buy commodity off-the-shelf computers for home and office use that incorporate parallel techniques such as superscaIarity, hyper threading, VLIW (Very Long Instruction Word) and even cluster technologies that were considered advanced a mere decade ago. Quite apart from the speed with which new parallel technologies find their way into new products, these developments underline the importance of parallel computing research and development for the advancement of computer science and IT in general. In view of the rapid technology transfer taking place, one could be led to conclude that parallel computing research and development has passed its zenith since it has become standard computing practice. ParCo2003 showed that such a conclusion is invalid and that many complex research issues remain to be investigated. Thus it is c l e a r - and this has been the case for a number of years - that future research in parallel computing will have to concentrate increasingly on all aspects of software engineering. In addition the development of new architectures, especially those based on new technologies such as nanotechnologies, biocomputing, improved methods for performance evaluation, advanced algorithms, etc. must continue to receive appropriate attention.

Historical aspects and current trends were highlighted by three invited talks: 9 Friedel HoBfeld (Germany): Parallel Machines and the "Digital Brain"- An Intricate Extrapolation on Occasion of JvN's 100-th Birthday 9 Manfred Zorn (USA): Computational Challenges in the Genomics Era 9 Charles D. Hansen (USA): High-Performance Visualisation: So Much Data, So Little Time

vi Furthermore, there were more than 80 contributed papers broadly grouped under three main topics: Applications, Algorithms, and Software & Technology. Most of the papers received three and sometimes even four reviews, and we want to thank all members of the Programme Committee for their diligent work. In contrast to previous ParCo conferences, this conference put a strong emphasis on minisymposia, which constituted two entire parallel tracks of the conference. The topics of the seven minisymposia were: Title Organised by Franz-Josef Pfreundt Grid Computing Manfred Zom and Craig Stewart Bioinformatics, Allen Malony Performance Analysis Barbara Chapman OpenMP Tor Sorevik Parallel Applications Anne C. Elster Cluster Computing Alessandro Genco Mobile Agents We wish to thank the organisers of these minisymposia for their tremendous work and support in attracting excellent speakers and in reviewing the papers. The invited speakers, authors of contributed papers and participants in the industrial session as well as in the various minisymposia highlighted many challenging application areas of parallel computing, such as bioinformatics and genomics, visualisation, image and video processing, modelling and simulation, mobile agents and data mining. Due to the complexity of the problems encountered, parallel computing paradigms often provide the only feasible approach. The overall picture conveyed by the conference was thus one of consolidation of parallel computing technologies and transfer of these into off-the-shelf products on the one hand, and of emerging new areas of research and development on the other. The editors are greatly indebted to the members of the International Programme Committee as well as of the Steering, Organising, Finance and Exhibition Committees for the time they spent in making this conference such a successful event. Many thanks are due to the staff of the Center for High Performance Computing for their enthusiastic support. This was a major factor in making this conference a great success. Our special thanks go to Claudia Schmidt (general organisation), Heike Jagode (overall design), Thomas Blfimel (web administration), Stefan Pflfiger (exhibition), and Guido Juckeland (proceedings).

Gerhard R. Joubert Germany

February 2004

Wolfgang E. Nagel Germany

Frans J. Peters Netherlands

Wolfgang V. Walter Germany

vii

SPONSORS AMD GmbH Cray Computer Deutschland GmbH Hewlett-Packard GmbH, Gesch~iftsstelle Berlin IBM Deutschland GmbH, Fachb. Lehre u. Forschung Megware Computer GmbH NEC High Performance Computing Europe GmbH Pallas GmbH Silicon Graphics GmbH SUN Microsystems GmbH

E X H I B I T O R S / PARTICIPANTS IN THE I N D U S T R I A L T R A C K Cray Computer Deutschland GmbH Hewlett-Packard GmbH, Gesch~iftsstelle Berlin IBM Deutschland GmbH, Fachb. Lehre u. Forschung Megware Computer GmbH NEC High Performance Computing Europe GmbH Silicon Graphics GmbH SUN Microsystems GmbH

viii CONFERENCE

COMMITTEE

Gerhard R. Joubert (Conference Chair, Germany/Netherlands) Wolfgang E. Nagel (Germany) Frans J. Peters (Netherlands) Wolfgang V. Walter (Germany)

STEERING COMMITTEE Frans J. Peters (Chair, Netherlands) Friedel Hossfeld (Germany) Paul Messina (USA) Masaaki Shimasaki (Japan) Denis Trystram (France) Marco Vanneschi (Italy)

ORGANISING COMMITTEE Wolfgang V. Walter (Chair, Germany) Thomas Blfimel (Germany) Uwe Fladrich (Germany) Heike Jagode (Germany) Guido Juckeland (Germany) Claudia Schmidt (Germany) Bernd Trenkler (Germany) Andrea Walther (Germany)

EXHIBITION SUB-COMMITTEE Stefan Pflfiger (Chair, Germany) Norbert Attig (Germany) Hubert Busch (Germany) Wolf-Dietrich Harz (Germany) Matthias Mfiller (Germany)

FINANCE COMMITTEE Frans J. Peters (Chair, Netherlands)

ix

PROGRAMME COMMITTEE Wolfgang E. Nagel (Chair, Germany) Nikolaus Adams (Germany) Hamid Arabnia (USA) Norbert Attig (Germany) Eduard Ayguad6 (Spain) Achim Basermann (Germany) Christian Bischof (Germany) Petter E. Bjorstad (Norway) Arndt Bode (Germany) Thomas Brandes (Germany) Mats Brorsson (Sweden) Helmar Burkhart (Switzerland) Barbara Chapman (USA) Michel Cosnard (France) Pasqua D'Ambra (Italy) Luisa D'Amore (Italy) Erik D'Hollander (Belgium) Koen De Bosschere (Belgium) Luiz DeRose (USA) Andreas Deutsch (Germany) Beniamino Di Martino (Italy) Michael Eiermann (Germany) Rtidiger Esser (Germany) Thomas Fahringer (Austria) Afonso Ferreira (France) Salvatore Filippone (Italy) Michael Gerndt (Germany) Lucio Grandinetti (Italy) Andreas Griewank (Germany) John Gurd (UK) Volker Gtilzow (Germany) Bianca Habermann (Germany) Rolf Hempel (Germany) Hans-Christian Hoppe (Germany) Lennart Johnsson (USA) Dieter Kranzlmiiller (Austria) Norbert Kroll (Germany) Herbert Kuchen (Germany) Keqin Li (USA) Thomas Lippert (Germany) Thomas Ludwig (Germany) Allen Malony (USA)

Tomas Margalef (Spain) Djordje Maric (Switzerland) Federico Massaioli (Italy) Arndt Meyer (Germany) Bemd Mohr (Germany) Almerico Murli (Italy) Per Oster (Sweden) Jean-Louis Pazat (France) Franz-Josef Pfreundt (Germany) Wilfried Philips (Belgium) Michael J. Quinn (USA) Roll Rabenseifner (Germany) Thomas Rauber (Germany) Alexander Reinefeld (Germany) Michael Resch (Germany) Richard Reuter (Germany) Jean Roman (France) Mathilde Romberg (Germany) Dirk Roose (Belgium) Hanns Ruder (Germany) Gudula Rtinger (Germany) Tor Sorevik (Norway) Jens Simon (Germany) Horst Simon (Germany) Henk J. Sips (Netherlands) Erich Strohmaier (USA) Vaidy Sunderam (USA) Mateo Valero (Spain) Marco Vanneschi (Italy) Jeffrey Vetter (USA) Heinrich Voss (Germany) Martin Walker (Switzerland) Wolfgang Walter (Germany) Helmut Weberpals (Germany) Roland Wismtiller (Germany) Gabriel Wittum (Germany) Rtidiger Wolff (Germany) Emilio Zapata (Spain) Hans Zima (Austria) Manfred Zorn (USA)

This Page Intentionally Left Blank

xi CONTENTS Preface Sponsors, Exhibitors / Participants in the industrial track Committees

V

vii viii

Invited Papers Parallel Machines and the "Digital Brain"- An Intricate Extrapolation on Occasion of JvN's 100-th Birthday E Hossfeld So Much Data, So Little Time... C. Hansen, S. Parker, C. Gribble

Software Technology

13

21

On Compiler Support for Mixed Task and Data Parallelism T. Rauber, R. Reilein, G. Riinger

23

Distributed Process Networks - Using Half FIFO Queues in CORBA A. Amar, P. Boulet, J.-L. Dekeyser, E Theeuwen

31

An efficient data race detector backend for DIOTA M. Ronsse, B. Stougie, J. Maebe, E Cornelis, K. De Bosschere

39

Pipelined parallelism for multi-join queries on shared nothing machines M. Bamha, M. Exbrayat

47

Towards the Hierarchical Group Consistency for DSM systems : an efficient way to share data objects L. LefOvre, A. Bonhomme An operational semantics for skeletons M. Aldinucci, M. Danelutto A Programming Model for Tree Structured Parallel and Distributed Algorithms and its Implementation in a Java Environment H. Moritsch

55

63

71

A Rewriting Semantics for an Event-Oriented Functional Parallel Language E Loulergue

79

RMI-like communication for migratable software components in HARNESS M. Migliardi, R. Podesta

87

Semantics of a Functional BSP Language with Imperative Features E Gava, E Loulergue

95

xii The Use of Parallel Genetic Algorithms for Optimization in the Early Design Phases E. Slaby, W. Funk An Integrated Annotation and Compilation Framework for Task and Data Parallel Programming in Java H.J. Sips, K. van Reeuwijk

103

111

On The Use of Java Arrays for Sparse Matrix Computations G. Gundersen, T. Steihaug

119

A Calculus of Functional BSP Programs with Explicit Substitution

127

E Loulergue JToe: a Java API for Object Exchange

135

S. Chaumette, P. Grange, B. MOtrot, P. Vign&as A Modular Debugging Infrastructure for Parallel Programs D. Kranzlmiiller, Ch. Schaubschldger, M. Scarpa, J. Volkert

143

Toward a Distributed Computational Steering Environment based on CORBA O. Coulaud, M. Dussere, A. Esnard

151

Parallel Decimation of 3D Meshes for Efficient Web-Based Isosurface Extraction A. Clematis, D. D'Agostino, M. Mancini, V. Gianuzzi

159

Parallel P r o g r a m m i n g

167

MPI on a Virtual Shared Memory F. Baiardi, D. GuerrL P. Mori, L. RiccL L. Vaglini

169

OpenMP vs. MPI on a Shared Memory Multiprocessor J. Behrens, O. Haan, L. Kornblueh

177

MPI and OpenMP implementations of Branch-and-Bound Skeletons I. Dorta, C. Le6n, C. Rodriguez, A. Rojas

185

Parallel Overlapped Block-Matching Motion Compensation Using MPI and OpenMP E. Pschernig, A. Uhl

193

A comparison of OpenMP and MPI for neural network simulations on a SunFire 6800 A. Strey

201

Comparison of Parallel Implementations of Runge-Kutta Solvers: Message Passing vs. Threads M. Korch, T. Rauber

209

xiii

Scheduling

217

Extending the Divisible Task Model for Workload Balancing in Clusters U. Rerrer, O. Kao, F. Drews

219

The generalized diffusion method for the load balancing problem G. Karagiorgos, N. Missirlis, E Tzaferis

225

Delivering High Performance to Parallel Applications Using Advanced Scheduling N. Drosinos, G. Goumas, M. Athanasaki, N. Koziris

233

Algorithms

241

Multilevel Extended Algorithms in Structural Dynamics on Parallel Computers K. Elssel, H. Voss

243

Parallel Model Reduction of Large-Scale Unstable Systems P. Benner, M. Castillo, E.S. Quintana-Orti, G. Quintana-Orti

251

Parallel Decomposition Approaches for Training Support Vector Machines T. Serafini, G. Zanghirati, L. Zanni

259

Fast parallel solvers for fourth-order boundary value problems M. Jung

267

Parallel Solution of Sparse Eigenproblems by Simultaneous Rayleigh Quotient Optimization with FSAI preconditioning L. Bergamaschi, ,4. Martinez, G. Pini

275

An Accurate and Efficient Selfverifying Solver for Systems with Banded Coefficient Matrix C. H61big, W. Krdmer, T.A. Diverio

283

3D parallel calculations of dendritic growth with the lattice Boltzmann method W. Miller, E Pimentel, I. Rasin, U. Rehse

291

Distributed Negative Cycle Detection Algorithms L. Brim, 1. Cernd, L. Hejtmdnek

297

A Framework for Seamlesly Making Object Oriented Applications Distributed S. Chaumette, P. VignOras

305

Performance Evaluation of Parallel Genetic Algorithms for Optimization Problems of Different Complexity P. K6chel, M. Riedel

313

xiv Extensible and Customizable Just-In-Time Security (JITS) Management of ClientServer Communication in Java S. Chaumette, P. VignOras

Applications & Simulation An Object-Oriented Parallel Multidisciplinary Simulation System U. Tremel, F. Deister, K.A. Sorensen, H. Rieger, N.P. Weatherill

321

329 The SimServer

331

Computer Simulation of Action Potential Propagation on Cardiac Tissues: An Efficient and Scalable Parallel Approach J.M. Alonso, J.M. Ferrero (Jr.), V. Herndndez, G. Molt6, M. Monserrat, J. Saiz

339

M o D y S i m - A parallel dynamical UMTS simulator M.J. Fleuren, H. Stiiben, G.F. Zegwaard

347

apeNEXT: a Multi-TFlops Computer for Elementary Particle Physics F. Bodin, Ph. Boucaud, N. Cabibbo, F. Di Carlo, R. De Pietri, E Di Renzo, H. Kaldass, A. Lonardo, M. Lukyanov, S. de Luca, J. Micheli, V. Morenas, N. Paschedag, O. Pene, D. Pleiter, E Rapuano, L. Sartori, F. Schifano, H. Simma, R. Tripiccione, P. Vicini

355

The Parallel Model System LM-MUSCAT for Chemistry-Transport Simulations: Coupling Scheme, Parallelization and Applications R. Wolke, O. Knoth, O. Hellmuth, W. Schrdder, E. Renner

363

Real-time Visualization of Smoke through Parallelizations T. Vik, A.C. Elster, T. Hallgren

371

Parallel Simulation of Cavitated Flows in High Pressure Systems P.A. Adamidis, F. Wrona, U. Iben, R. Rabenseifner, C.-D. Munz

379

Improvements in black hole detection using parallelism F. Almeida, E. Mediavilla, A. Oscoz, E de Sande

387

High Throughput Computing for Neural Network Simulation J. Culloty, P. Walsh

395

Parallel algorithms and data assimilation for hydraulic models C. Mazauric, V.D. Tran, W. Castaings, D. Froehlich, EX. Le Dimet

403

Multimedia Applications Parallelization of VQ Codebook Generation using Lazy PNN Algorithm A. Wakatani

413 415

XV

A Scalable Parallel Video Server Based on Autonomous Network-attached Storage G. Tan, S. Wu, H. Jin, E Xian

423

Efficient Parallel Search in Video Databases with Dynamic Feature Extraction S. Geisler

431

Architectures

439

Introspection in a Massively Parallel PIM-Based Architecture H.P. Zima Time-Transparent Inter-Processor Connection Reconfiguration in Parallel Systems Based on Multiple Crossbar Switches E. Laskowski, M. Tudruj SIMD design to solve partial differential equations R. W. Schulze

441

449

457

Caches

465

Trade-offs for Skewed-Associative Caches H. Vandierendonck, K. De Bosschere

467

Cache Memory Behavior of Advanced PDE Solvers D. Wallin, H. Johansson, S. Holmgren

475

Performance

483

A Comparative Study of MPI Implementations on a Cluster of SMP Workstations G. Riinger, S. Trautmann

485

MARMOT: An MPI Analysis and Checking Tool B. Krammer, K. Bidmon, M.S. Miiller, M.M. Resch

493

BenchIT- Performance Measurement and Comparison for Scientific Applications G. Juckeland, S. Bdrner, M. Kluge, S. K6lling, WE. Nagel, S. Pfliiger, H. R6ding, S. Seidl, T. William, R. Wloch

501

Performance Issues in the Implementation of the M-VIA Communication Software Ch. Fearing, D. Hickey, P.A. Wilsey, K. Tomko

509

Performance and performance counters on the Itanium 2 study U. Andersson, P. Ekman, P. Oster

A benchmarking case

On the parallel prediction of the RNA secondary structure F. Almeida, R. Andonov, L.M. Moreno, V. Poirriez, M. Pdrez, C. Rodriguez

517

525

xvi Clusters

533

M D I C E - a MATLAB Toolbox for Efficient Cluster Computing R. Pfarrhofer, P Bachhiesl, M. Kelz, H. St6gner, A. Uhl

535

Parallelization of Krylov Subspace Methods in Multiprocessor PC Clusters D. Picinin Jr., A.L. Martinotto, R. V. Dorneles, R.L. Rizzi, C. H61big, T.A. Diverio, P. O.A. Navaux

543

First Impressions of Different Parallel Cluster File Systems T.P Boenisch, P W. Haas, M. Hess, B. Krischok

551

Fast Parallel I/O on ParaStation Clusters N. Eicker, F. Isaila, T. Lippert, T. Moschny, W.F. Tichy

559

PRFX : a runtime library for high performance programming on clusters of SMP nodes B. Cirou, M.C. Counilh, J. Roman Grids

569

577

Experiences about Job Migration on a Dynamic Grid Environment R.S. Montero, E. Huedo, I.M. Llorente

579

Security in a Peer-to-Peer Distributed Virtual Environment J. K6hnlein

587

A Grid Environment for Diesel Engine Chamber Optimization G. Aloisio, E. Blasi, M. Cafaro, I. Epicoco, S. Fiore, S. Mocavero

599

A Broker Architecture for Object-Oriented Master/Slave Computing in a Hierarchical Grid System M. Di Santo, N. Ranaldo, E. Zimeo A framework for experimenting with structured parallel programming environment design M. Aldinucci, S. Campa, P. Ciullo, M. Coppola, M. Danelutto, P. Pesciullesi, R. Ravazzolo, M. Torquati, M. Vanneschi, C. Zoccolo M i n i s y m p o s i u m - Grid C o m p u t i n g

Considerations for Resource Brokerage and Scheduling in Grids R. Yahyapour Job Description Language and User Interface in a Grid context: The EU DataGrid experience G. Avellino, S. Beco, E Pacini, A. Maraschini, A. Terracina

609

617

625

627

635

xvii On Pattern Oriented Software Architecture for the Grid H. Prem, N.R. Srinivasa Raghavan

Minisymposium- Bioinformatics Green Destiny + mpiBLAST = Bioinfomagic W. Feng

643

651 653

Parallel Processing on Large Redundant Biological Data Sets: Protein Structures Classification with CEPAR D. Pekurovsky, I. Shindyalov, P. Bourneb

661

MDGRAPE-3: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations M. Taiji, T. Narumi, Y. Ohno, A. Konagaya

669

Structural Protein Interactions: From Months to Minutes P. Dafas, J. Gomoluch, A. Kozlenkov, M. Schroeder

677

Spatially Realistic Computational Physiology: Past, Present and Future J.R. Stiles, W.C. Ford, J.M. Pattillo, T.E. Deerinck, M.H. Ellisman, T.M. Bartol, T.J. Sejnowski

685

Cellular automaton modeling of pattern formation in interacting cell systems A. Deutsch, U. BOrner, M. B~ir

695

Numerical Simulation for eHealth: Grid-enabled Medical Simulation Services S. Benkner, W. Backfrieder, G. Berti, J. Fingberg, G. Kohring, J.G. Schmidt, S.E. Middleton, D. Jones, J. Fenner

705

Parallel computing in biomedical research and the search for peta-scale biomedical applications C.A. Stewart, D. Hart, R. W. Sheppard, H. Li, R. Cruise, V. Moskvin, L. Papiez

Minisymposium- Performance Analysis

719

727

Big Systems and Big Reliability Challenges D. A. Reed, C. Lu, C.L. Mendes

729

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences H. Brunst, W.E. Nagel

737

CrossWalk: A Tool for Performance Profiling Across the User-Kernel Boundary A. V. Mirgorodskiy, B.P. Miller

745

Hardware-Counter Based Automatic Performance Analysis of Parallel Programs F. Wolf, B. Mohr

753

xviii Online Performance Observation of Large-Scale Parallel Applications A.D. Malony, S. Shende, R. Bell

761

Deriving analytical models from a limited number of runs R.M. Badia, G. Rodriguez, J. Labarta

769

Performance Modeling of HPC Applications A. Snavely, X. Gao, C. Lee, L. Carrington, N. Wolter, J. Labarta, J. Gimenez, P. Jones

777

Minisymposium - OpenMP

785

Thread based OpenMP for nested parallelization R. Blikberg, T. Sorevik

787

OpenMP on Distributed Memory via Global Arrays L. Huang, B. Chapman, R.A. Kendall

795

Performance Simulation of a Hybrid OpenMP/MPI Application with HESSE R. Aversa, B. Di Martino, M. Rak, S. Venticinque, U. Villano

803

An environment for OpenMP code parallelization C.S. Ierotheou, 11. Jin, G. Matthews, S.P. Johnson, R. Hood

811

Hindrances in OpenMP programming E Massaioli

819

Wavelet-Based Still Image Coding Standards on SMPs using OpenMP R. Norcen, A. Uhl

827

M i n i s y m p o s i u m - Parallel A p p l i c a t i o n s

835

Parallel Solution of the Bidomain Equations with High Resolutions X. CaL G.T. Lines, A. Tveito

837

Balancing Domain Decomposition Applied to Structural Analysis Problems P. E. Bjorstad, J. Koster

845

Multiperiod Portfolio Management Using Parallel Interior Point Method L. Halada, M. Lucka, I. Melichercik

853

Performance of a parallel split operator method for the time dependent Schr6dinger equation T. Matthey, T. Sorevik

861

xix Minisymposium - Cluster Computing

869

Design and implementation of a 512 CPU cluster for general purpose supercomputing B. Vinter

871

Experiences Parallelizing, Configuring, Monitoring, and Visualizing Applications for Clusters and Multi-Clusters O.J. Anshus, J.M. Bjorndalen, L.A. Bongo

879

Cluster Computing as a Teaching Tool O.J. Anshus, A.C. Elster, B. Vinter Minisymposium- Mobile Agents

887

895

Mobile Agents Principles of Operation A. Genco

897

Mobile Agent Application Fields E Agostaro, A. Genco, S. Sorce

905

Mobile Agents and Grid Computing E Agostaro, A. Chiello, A. Genco, S. Sorce

913

Mobile Agents, Globus and Resource Discovery E Agostaro, A. Genco, S. Sorce

919

A Mobile Agent Tool for Resource Discovery E Agostaro, A. Genco, S. Sorce

927

Mobile Agents and Knowledge Discovery in Ubiquitous Computing A. Genco

935

Author & Subject Index

Author Index Subject Index

943 945 951


Invited Papers


Parallel Computing:SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.

Parallel Machines and the "Digital Brain"- An Intricate Extrapolation on Occasion of JvN's 100-th Birthday F. Hossfeld a aChair for Technical Informatics and Computer Sciences, University of Technology (RWTH) Aachen, Central Institute for Applied Mathematics, Research Centre Juelich, Germany On 28 December 2003, the scientific community will celebrate the 100-th anniversary of John von Neumann's birthday. On this occasion, we are reminded of his achievements as outstanding mathematician and creator of Game Theory, but even more that he laid the very conceptional foundations of the digital computer. His concept had the fortune - contrary to Konrad Zuse's in Germany- that the transistor was invented in 1947, just when JvN wrote his famous reports on the digital computer, at Bell Labs thus giving rise to the extraordinary technological development of microelectronics pushed further by other inventions like photolithography and integrated circuits in the years to come. Since decades, the exponential growth of the power of microchips- every 18 months, the integration density of transistors on the chips is doublingis steering the also exponential growth of computer power. The top computers have surpassed the teraflops level by far today targeting towards 100 teraflops or even petaflops. The computer has become ubiquitous, and protagonists of robotics and artificial intelligence are tempted to attribute to it omnipotent capabilities which will lead to autonomous "humanoids" (Moravec, Kurzweil) on the one hand and threatening horror scenarios on the other (Joy). The predictions of the semiconductor industry tell us that "Moore's Law" describing the exponential evolution in microelectronics might remain valid for other 10 to 15 years. Beyond Moore's Law, quantum effects will definitely end the orderly functioning of"classical" circuits. In 1982, Richard Feynman pointed out that certain quantum mechanical systems could not be simulated efficiently on a "classical" digital computer how powerful it may evolve. This led to speculations that computation in general could be done, in principle and even more efficiently, if a novel computer could make thorough use of quantum effects, thus providing a challenging option for parallel computing by exploiting the exponential speedup through quantum parallelism. Peter Shor's factorization algorithm of 1994 showed that the quantum computer is capable to solve NP-hard problems efficiently. Experimentalists work on different physical concepts to realize quantum computation; for instance, quantum dots, trapped ions, superconducting devices, and NMR technology have been shown to provide the principles and the technology to build quantum computers. 1. LEGACIES In the early 1950-s, John von Neumann once said that three scientific achievements had changed the view of the world in the 20-th century: (1) Einstein's Theory of Relativity, (2)

Heisenberg's Quantum Mechanics, and (3) G6del's Incompleteness Theorem. In the second half of the 20-th century, the history of science and technology added a then unforeseen 4-th one: von Neumann's Digital Computer. John von Neumann was born in December 1903 in Budapest as son of a Hungarian Jewish banker. JvN was a mathematical infant prodigy with kind of photographic memory. Of course, he later was disposed to study mathematics; however, his father seeking advice from the famous von Kfirmfin then at the University of Technology in Aachen, JvN was to study chemical engineering which he humbly d i d - obviously in Budapest, Berlin, and Zurich. He finished these studies with the diploma in 1926. In the same year, however, in parallel he got his PhD in mathematics with a thesis on set theory certainly influenced at his visits to G6ttingen by Hilbert and his program. After lecturing in Berlin and Hamburg, he received an invitation as visiting professor from Princeton University which he accepted in 1930. In 1933, after the foundation of the Institute of Advanced Study in Princeton in 1930, he became together with Alexander and Veblen the first professors of mathematics at the institute which attracted afterwards Einstein, G6del, Church, Dyson, and Oppenheimer and many other famous scientists as well as, in 1937, as a guest scientist also Turing. Hence, JvN was well informed not only about the progress of physics but also on the Hilbert-program deconstructing work of G6del and on Turing's and Church's ideas of computation and computability. The concept of the Turing machine and algorithmics gave definitely major impact on his later plans and projects of designing and building the digital computer. Highly awarded, JvN died of bone cancer on February 8, 1957. Already in 1921, he received the Award of the Best Mathematician in Hungary. In 1927, he published five papers: three on quantum mechanics which had just been created by Heisenberg and Schr6dinger, one on a mathematical problem, and one which is considered as the foundation of game theory. In 1932, he published the important monography - in German - on the "Mathematical Foundations of Quantum Mechanics" where he set this theory on solid mathematical ground especially with his ideas on algebras which anew gain much attention in modem theoretical physics. From 1940 to 1945, he was deeply involved in the US Government's Manhattan Project continuing his involvement as an esteemed advisor until his death in Washington. However, simultaneously, he focussed his mathematical interests on grand challenges arising from the stalemate of the analysis of partial differential equations which again guided him to the definition, design, and construction of the digital computer to overcome the stagnation of the analytic mathematical treatment of complex problems like climate and weather forecast, to indicate only a few out of his broad spectrum of interdisciplinary challenges which led him to establish computer simulation as the third category of scientific methodology in addition to theory and experiment, a discipline which has later been named Computational Science (& Engineering) by the Nobel ~rize Winner Kenneth Wilson (1982). Honoured with the Medal of Freedom, the Albert Einstein Memorative Award and the Enrico Fermi Award and the membership in the American Academy of Sciences, the American Academy of Arts and Science, the American Philosophical Society and many other academies abroad, he dies too early to complete, besides his diverse projects in mathematics and computing, his thorough analysis of the potential of the digital computer and his "organs"- as he called its various components- and the similarities of the "digital brain" with, or rather its fundamental distinction from the human brain as he elaborated in his last publication, the fragmentary analysis of "The Computer and the Brain" which he worked out for the invited Silliman Lectures in 1956 which he never could give at Yale University [1, 2, 3, 4, 5].

2. COMPUTER TECHNOLOGY AND SUPERCOMPUTER ARCHITECTURE

While Konrad Zuse in Germany [6] as the definitely, and at last also in the US accepted, first digital computer pioneer struggled since the 1930s with the limited technical and technological resources available, JvN was lucky enough to design his digital computer concept at a time when the transistor was invented by John Bardeen, Walter Brattain and William Shockley in 1947 at Bell Labs - earning them the Nobel Prize in 1956 - which gave a tremendous push to the development of digital devices encouraging the STRETCH Program at IBM in 1955 in parallel with the development of the ferrite core memory by Jay Forrester and stimulating John Backus to create the first compiler for the high-level language Fortran in 1957 to harness the IBM 704 computer with 32,768 32-bit word ferrite core memory and a magnetic drum as secondary storage. In 1959, Jack Kilby at Texas Instruments developed the first integrated circuit (then on germanium substrate) and Robert Noyce at Fairchild Camera invented photolithography. These giant technological steps created the incredible explosion of microelectronics which resulted in 1970 in the road-paving achievement of the first microprocessor Inte14004. From then on, we can follow an exponential growth of digital computer power in parallel with the exponential growth of the densitiy of transistors on the chips which is acknowledged today as "Moore's Law" allowing for the doubling of microprocessor power and memory size within about 18 months. Personal computers, workstations, servers, and supercomputers follow this "Law" since decades enhanced even more by innovations in computer architecture and operating system concepts [7] leading to vector-processing functions and parallelism in modern supercomputers which nowadays reach far beyond the teraflops performance even with large microprocessor clusters while since early 2002 the Japanese "Earth Simulator"- a powerful combination of vector-processing and parallel structures built on NEC technology- is heading the TOP-500 list with more than 40 teraflops peak and about 35 teraflops sustained Linpack performance [8], thus causing similar shock waves -"Computnik" - within the political and technocratic circles in USA as by the Sputnik launched then by the late Soviet Union. This tremendous explosion of the digital computer is in almost all aspects guiding back to roots of computer architecture and programming as layed by JvN in the 1940-s. Scanning the history since the very birthday of Computational Science and Engineering, which as mentioned may be dated back to 1946 when JvN formulated the strategic program in his famous report on the necessity and future of digital computing together with H. H. Goldstine, at that time complex systems were primarily involved with fluid dynamics. JvN expected that really efficient high-speed digital computers would "break the stalemate created by the failure of the purely analytical approach to nonlinear problems" and suggested fluid dynamics as a source of problems through which a mathematical penetration into the area of nonlinear partial differential equations could be initiated. JvN envisioned computer output as providing scientists with those heuristic hints needed in all parts of mathematics for genuine progress and to break the deadlock- the "present stalemate"- in fluid dynamics by giving clues to decisive mathematical ideas. In a sense, his arguments sound very young and familiar. As far as fluid dynamics is concerned, in his John von Neumann Lecture at the SIAM National Meeting in 1981 yet Garett Birkhoff came to the conclusion on the development of fluid dynamics that it be unlikely that computational fluid dynamics (CFD) would become a truly mathematical science in the near future, although computers might soon rival windtunnels in their capabilities; both, however, would be ever essential for research [9, 10].

The various strategic position papers in the 1980s [11, 12, 13, 14, 15] and the government technology programs in the USA, in Europe, and in Japan in the early 1990s claimed that the timely provision of supercomputers to science and engineering and the ambitious development of innovative supercomputing hardware and software architectures as well as new algorithms and effective programming tools are an urgent research-strategic response to the grand challenges arising from these huge scientific and technological barriers. The solutions of complex problems are critical to the future of science, technology, and society. Supercomputing will be a crucial factor for the industry as well in order to meet the requirements of international economic competition especially in the area of high-tech products. Despite the remarkable investments in research centers and universities in building up supercomputing power and skills and also some sporadic efforts in the industry concerning supercomputing in Europe, it took until the 1990s that the U.S. Government and the European Union as well as several national European governments started non-military strategic support programs [ 16, 17, 18]. Their goals were also to enhance supercomputing by stimulating the technology transfer from universities and research institutions into industry and by increasing the fraction of the technical community which gets the opportunity to develop the skills required to efficiently access the highperformance computing resources. In recent years, computer simulation has reached even the highest political level, since, in 1996, the United Nations voted to adopt the Comprehensive Test-Ban Treaty banning all nuclear testing for military and peaceful purposes. Banning physical nuclear testing created a need for full-physical modeling and high-confidence computer simulation and, hence, unprecedented steps in supercomputer power. DoE's Accelerated Strategic Computing Initiative (ASCI) [ 19, 20] aiming to replace physical nuclear-weapons testing by computer simulation, and NSF's Partnerships for Advanced Computational Infrastructures [21 ] in the US targeting at the advancement of new computing and communication infrastructures for grid computing [22, 23] intended to revolutionize science and engineering- as well as business- is definitely establishing computer simulation as a fundamental methodology in science and engineering. The dedication of the Nobel Prize for Chemistry in 1998 to Computational Chemistry, in addition, confirmed its significance in the scientific community as well as in industry and politics. In the early 1990s, Cray vector-supercomputers with shared memory architecture and proprietary bipolar CPU technology were foreseeable to run into difficulties to provide the giant steps in compute power needed by greedy users, then particularly in physics. Supercomputer architectures with massive parallelism and distributed memory emerged as the future, also more cost-effective, line for the very top end. During these years, nearly thirty companies were offering massively parallel systems and others were planning to enter the market with new products, although many experts predicted that the market would not be able to sustain thus many vendors [24]. In [25], a chronological compilation of high performance computer history illuminates the "Darwinistic" forces affecting the supercomputer evolution lines. It demonstrates that the expected shake-out in the computer industry took place questioning the health and the future potential of this industry in total. Some went out of the parallel computer business for quite different reasons -, others became just mergers. The dramatic survival battle in the supercomputer industry was also giving severe damage to the users and the customers in the supercomputing arena. Quite often the critical situation of parallel computing has rigorously been analyzed with respect to the possible negative impacts on the future perspectives and the progress of this scientific discipline. Already Goldstine said that the history of computing is

littered with "australopithecanes", short computer lines which do not lead anywhere [3]. Following in the wake of DoE's ASCI program, in recent years powerful new supercomputers were brought into the market by the manufacturers participating in the high-performance computing race again. It seemed that only those supercomputer manufacturing companies would have a realistic chance to survive in the years to come who had the potential, capabilities, and favour to get involved in the ASCI or other significant US Government supported programs. However, in early 2002, the Japanese took the lead with the "Earth Simulator" which was argued by the climate problems arising around the world. The parallel architectures became quite successful on the TOP 500 list recently extending the parallel computer concept towards clustered symmetric multiprocessor (SMP) nodes mainly based on commodity components tying together possibly tens of thousands of processing elements [26]. Hence, the Central Institute for Applied Mathematics at the Research Centre Juelich running the National Supercomputer Centre "John von Neumann Institute for Computing" just inaugurated its 8.9 teraflops SMP-based new supercomputer "JUMP", now No. 1 in Europe, with 41 IBM p690 nodes of 32 processors each thus adding up to 1312 processors and surpassing its MPP system CRAY T3E/1200 by more than a factor of 10 in performance [27, 28]. The simulation of extremely complex systems may also determine future large-scale computing by interconnecting supercomputers of diverse architectures as giant supercomputer complexes via the new paradigm of grid computing. These developments will challenge not only system reliability, availability and serviceability to novel levels, but also interactivity of concurrent algorithms and, in particular, adaptivity, accuracy, and stability of parallel numerical methods [29]. In any case, however, a new technology will be necessary to target at and beyond the petaflops barrier in supercomputing, since the further expansion of the floor space of several thousands of square-meters needed by today's SMP-clustered systems with only tenths of teraflops peak performance will be neither reasonable nor technically sensibly feasible. IBM's Blue Gene project may give some guidance and, more aggressively, nanotechnology may provide innovative transistor concepts which will shift the limit of Moore's Law a bit more into the next decades [30]. 3. FROM BITS TO QUBITS

Nevertheless, the end of Moore's Law has been predicted for the time around 2020, at last, since extrapolation leads into nanoscales where the quantum regime will totally govern and, thus, by noise disturb the transistor functions. While the applications of"ubiquitous" computing and its specific requirements may be satisfied by the currently available and further improved technology, high-performance computing targeting towards multi-petaflops performance may severely suffer from the ending exponential growth. Therefore, alternative computing concepts breaking those performance barriers should be intensively explored in order to provide the means to deal with complex problems in future science, industry, and society. In 1982, Richard Feynman explained that the simulation of certain quantum physical systems cannot be simulated by "classical" computers; he, however, pointed out that quantum mechanics could possibly provide the principles to develop a fundamentally new computing paradigm: quantum computing, based on technologically and logically different computer architectures [31]. In 1984 already, David Deutsch designed a quantum computer model [32] which has been elaborated since then theoretically in remarkable depth [33] yielding, as an expansion

of the Church-Turing computational concept to quantum Turing machines, deep complexitytheoretical results and quantum algorithms which meet the expectation of exponential computational speedup due to the inherent exponential superposition of quantum states which, in analogy to classical Boolean bit-based logics, are represented by "qubits", thus exploiting the so-called quantum parallelism. Like in Boolean switching algebra with the NAND and NOR gates, in quantum computing also universal gates (Qubit and Conditional NOT: CNOT) are available besides versatile gates to build up logical circuits, and remarkably- like in classical computation- the Fourier Transform turned out to be a fundamental algorithmic element to exploit the exponential speedup in quantum algorithms. On this basis, Peter Shor succeeded in developing a factorization quantum algorithm [34] which not only provides exponential speedup compared with classical sequential algorithms for this well-known computationally hard problem, but also demonstrated that quantum computing is fundamentally capable to dissolve at least for a significant subset of NP-hard problems the NP - P question by providing efficient quantum algorithms [35]. Although there are diverse physical methods which provide the principal means to establish quantum computer technology and architecture (like NMR technology, quantum dots, trapped ions, superconducting devices and others) which in some cases already have been successfully demonstrated with small numbers of qubits, it will definitely take many years to establish, if at all, a sound quantum computer technology. There are many sources for limited stability and fundamental perturbations of entangled quantum states (decoherence). However, as science history has shown elsewhere, once the idea is born, it will be definitely pursued by scientific curiosity in innovative computational concepts and by the need of greedy applications as well. Thus, quantum computing represents a grand challenge for the years to come; it opens a new space for future intellectual adventures in research and development of physics, mathematics, and computer science. 4. TOWARDS THE "DIGITAL BRAIN"? Along with considerations on the future of computation and the understanding of the human brain, in his popular books Roger Penrose also addressed quantum mechanics and possibly emerging new theories in physics like String Theory and its future extensions as potential scientific vehicles to explain human brain functions and understanding mind and consciousness [36, 37, 38]. As indicated above, JvN discussed in his preliminarily sketched Silliman Lectures, published as "The Computer and the Brain" [5], the similarities and differences of the computer, as he defined it, and the brain, as he understood it at his time. Interestingly enough, in his famous reports defining the digital computer in the 1940-s he talked about the major computer components as "organs" thus showing that he in some sense entertained the long European tradition of ideas to design and construct humanoids which boomed in the 17th and 18-th centuries giving space to many, as seen today, ridiculous appearances of fluteplaying monkeys, chess-playing fakes and other curiosities. However, as a great mathematician he certainly also stood in the tradition of occidental philosophy which for centuries up to today is challenged by the mind-body problem reaching back to the ancient philosophers Thales of Milet ("the four element S " ), Anaxagoras of Clazomenae (introducing "nous", the mind), Leukippos, Demokrit and, last not least, Aristoteles and Platon via the Renaissance with Locke , ,, ), ("Essay concerning understanding"), Leibniz ("Nouveaux essais sur l'entendement h umam de la Mettrie ("L'homme plus que machine"), Descartes ("Discours de la m&hode" and "Med-

itationes de prima philosophia") towards Eccles ("The self and its brain", together with Popper) and the brain researchers of today like Edelman, Chalmers, Singer, Roth and many others [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 ] pointing out that the human brain be definitely the most complex system in the - k n o w n - universe. The more one learns about the brain, the greater the distance seems to grow to the digital computer, although this statement does not match with the expectations of Artificial Intelligence (AI). The "strong" Artificial Intelligence still maintains its postulation that the exponentially growing complexity of the digital computer will automatically create mind and (self-)consciousness in the digital machine, and Ray Kurzweil [52] and Hans Moravec [53] draw dramatic pictures of a future of autonomous intelligent robots and humanoids which might take over control already during this century keeping us humans as pets if we behave. While Moravec is focussed mainly on robotics, he predicts that the digital computer has reached or is not far from the capacity and capabilitiy to simulate living beings and, pointing at recent computer chess matches with Kasparov, even compete with human intelligence. Based on simplistic calculations comparing vertebrate retina performance and robot vision requirements, he states that 100 tera-instructions per second of computer power will be sufficient to match human behaviour by simulation. This computer performance indeed will be soon available, if it is not already today. On the other hand, IT experts as Bill Joy, philosophers and social scientists as well as brain researchers consider Kurzweil's predictions as either horror scenarios or fundamental impossibilities [54]. However, as long as we do not know what it is about human intelligence and its essence and the possibly - if not, as a scientist, to say: certainly - materialistic ground of human mind and consciousness, what they are and how they emerge, there will be no final answer to the mind-body problem a n d - vice versa - no understanding of the human brain, and, thus, it is not possible to make true statements about the "intelligent" evolution of the digital computer as well. Penrose discussed- in "Shadows of the Mind" [38] - four viewpoints on the question whether computing equals thinking and thinking equals computation: 9 Position A: All thinking is computation ("computation" in the sense of Church and Turing, i.e. based on algorithms and representable by Turing machines); in particular, feelings of conscious awareness are evoked merely by executing appropriate computations. ("Strong" AI) 9 Position B: Awareness is a feature of the brain's physical action; and whereas any physical action can be simulated computationally, computational simulation cannot by itself evoke awareness. ("Weak" AI) 9 Position C: Appropriate physical action of the brain evokes awareness, but this physical action cannot be even properly simulated computationally. 9 Position D: Awareness cannot be explained by physical, computational, or any other scientific terms. It seems to be quite natural and consequent that Penrose, as a mathematician and scientist rather than a strong-AI promoter or a theologically locked esoteric, discarded Positions A, B, and D as inadequate answers to the question, and in strengthening Position C he referred to the evolution of physical theories extending quantum mechanics which will favour and finally

10 confirm Position C. In this respect, it is interesting that also Nobel Prize winner John Eccles seems to have pursued during his last scientific activities (still vague) ideas of involving quantum mechanical processes in the description of the functions of the human brain concerning the interaction of mind and matter. Looking back to JvN's first monograph defining and clarifying rigorously the mathematical foundations of quantum mechanics and to his last work comparing the computer and the brain, we must say that his early death caused a great loss also for this challenging and futuredetermining field of research. If we listen to the predictions of the Kurzweils and Moravecs, we may be justified to state that JvN as one last genius of the 20-th century would have activated his exceptional intellectual capabilities and sharp theoretical instruments to shape this field in his strong scientific manner as he did in his relatively short life in other areas, rather than by blue speculations. The science community has indeed good reasons to celebrate his 100-th birthday! REFERENCES

[ 1] [2] [3] [4] [5]

[6]

[7] [8] [9] [10] [11] [12]

[13]

[ 14] [15]

von Neumann, J., Collected Works, Vol. I-VI, Pergamon Press, 1961-1963. Aspray, W., John yon Neumann and the Origins of Modern Computing, MIT Press, 1990. Macrae, N., John yon Neumann, Pantheon Books, 1992. yon Neumann, J., Mathematical Foundations of Quantum Mechanics, Princeton University Press, 1955. von Neumann, J., The Computer and the Brain, 2nd Edition (with a Foreword by Paul. M. Churchland and Patricia S. Churchland), Yale University Press, Yale Nota Bene Book, 2000. Ceruzzi, P. E., The Early Computers of Konrad Zuse, 1935 to 1945, Ann. Hist. Comp. Vol. 3 (1981), No. 3,241-262; Ritchie, D., The Computer Pioneers, Chapter 3, Simon & Schuster, 1986. Hwang, K., Advanced Computer Architecture- Parallelism, Scalability, Programmability, McGraw-Hill, 1993. www.TOP500.org/lists/2003/11. Birkhoff, G., Numerical Fluid Dynamics, SIAM Review Vol. 25 (1983), 1-34. Roache, P. J., Fundamentals of Computational Fluid Dynamics, Hermosa Publishers, 1998. Special Double Issue: Grand Challenges to Computational Science, Future Generation Computer Systems 5 (1989), No. 2/3. Committee on Physical, Mathematical, and Engineering Sciences, Federal Coordinating Council for Science, Engineering, and Technology, Grand Challenges 1993: High Performance Computing and Communications, The FY 1993 U.S. Research and Development Program, Office of Science and Technology Policy, Washington, 1992. Board on Mathematical Sciences of the National Research Council (USA), The David II Report: Renewing U.S. Mathematics - A Plan for the 1990s, in: Notices of the American Mathematical Society, May/June 1990, 542-546; September 1990, 813-837; October 1990, 984-1004 Commission of the European Communities, Report of the EEC Working Group on HighPerformance Computing (Chairman: C. Rubbia), February 1991. Trottenberg, U., et al., Situation und Erfordernisse des wissenschaftlichen H6chstleistungsrechnens in Deutschland- Memorandum zur Initiative High Performance Scientific

11

[16] [17]

[ 18] [ 19] [20] [21 ]

[22] [23]

[24]

[25]

[26]

[27] [28] [29]

[30] [31] [32] [33] [34]

Computing (HPSC), Februar 1992; published in: Informatik-Spektrum Band 15 (1992), H. 4, 218. The Congress of the United States, Congressional Budget Office, Promoting HighPerformance Computing and Communications, Washington, June 1993. High-Performance Computing Applications Requirements Group, High-Performance Computing and Networking, Report, European Union, April 1994; - High-Performance Networking Requirements Group, Report, European Union, April 1994. Bundesministerium f'tir Forschung und Technologie, Initiative zur F6rderung des parallelen H6chstleistungsrechnens in Wissenschafl und Wirtschafl, BMFT, Bonn, Juni 1993. www.llnl.gov/asci/. www.llnl.gov/asci-pathforward/. Smarr, L., Toward the 21-st Century, Comm. ACM Vol. 40 (1997), No. 11, 28-32. - Smith, Ph. L., The NSF Partnerships and the Tradition of U.S. Science and Engineering, Comm. ACM Vol. 40 (1997), No. 11, 35-37. Foster, I., and C. Kesselman (eds.), The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, 1999. Atkins, D. E., et al., Revolutionizing Science and Engineering Through Cyberinfrastructure, National Science Foundation Report, Blue Ribbon Advisory Panel on Cyberinfrastructure, USA, January 2003. The Superperformance Computing Service, Palo Alto Management Group, Inc., Massively Parallel Processing: Computing for the 90s, SPCS Report 25, Second Edition, Mountain View, California, June 1994. Strohmaier, E., et al., The marketplace of high-performance computing, Special Anniversary Issue (G. Joubert et al., eds.), Parallel Computing Vol. 25 (1999), No. 13/14, 15171544. Hossfeld, F., et al., Gekoppelte SMP-Systeme im wissenschafllich-technischen Hochleistungsrechnen - Status und Entwicklungsbedarf- (GoSMP), Analyse im Auflrag des BMBF, F6rderkennzeichen 01 IR 903, Dezember 1999. www. fz-j uelich.de/zam. http ://jumpdoc, fz-j uelich.de/. Hossfeld, F., Teraflops Computing: A Challenge to Parallel Numerics?, in: Zinterhof, P., et al. (eds.), Parallel Computation, Proceedings of the 4-th International ACPC Conference 1999, Salzburg, Austria, pp. 1-12; Feeding Greedies on Meager Roadmaps, in: Bubak, M., et al. (eds.), Proceedings of SGI'2000, Krakow, Poland, pp. 11-22. Semiconductor Industry Association (SIA), International Technology Roadmap for Semiconductors, 2003 Edition (ITRS 2003). Feynman, R. P., Simulating physics with computers, Int. J. Theor. Phys. Vol. 21 (1982), 467. Deutsch, D., Quantum Theory, the Church-Tufing Principle and the universal quantum computer, Proc. Roy. Soc. Lond. A Vol. 400 (1985), 97. Nielsen, M. A., and Chuang, I. L., Quantum Computation and Quantum Information, Cambridge University Press, 2000. Shot, P. W., Algorithms for quantum computation: discrete logarithms and factoring, in: Proceedings 35-th Annual Symposium on Foundation of Computer Science, IEEE Press, 1994; Polynomial-time algorithms for prime factorization and discrete logarithms on a

12 quantum computer, SIAM J. Comp. Vol. 26 (1997), No. 5, 1484-1509. [35] Moret, B. M. E., and Shapiro, H. D., Algorithms from P to NP - Volume 1: Design & Efficiency, The Benjamin/Cummings Publishing Company, 1991. [36] Penrose, R., The Emperor's New Mind, Oxford University Press, 1989. [37] Penrose, R., Shadows of the M i n d - A Search for the Missing Science of Consciousness, Oxford University Press, 1994. [38] Penrose, R., The Large, the Small, and the Human Mind, Cambridge University Press, 1997. [39] Locke, J., An Essay Concerning Human Understanding, Great Books in Philosophy, Prometheus Books, 1995. [40] de La Mettrie, J. O., Machine Man (L'homme plus que machine), in: Cambridge Texts in the History of Philosophy (ed. by Ann Thomson),: La Mettrie, Machine Man and Other Writings, Cambridge University Press, 1996. [41 ] Descartes, R., Discours de la m6thode pour bien conduire sa raison, et chercher la verit6 dans les sciences; Meditationes de prima philosophia, in: Philosophische Schriften in einem Band, Felix Meiner Verlag, 1996. [42] Leibniz, G. W., Nouveaux Essais sur L'Entendement Humain, Livre I-IV, Philosophische Schriften Band III, Wissenschaftliche Buchgesellschaft, 1985; Schriften zur Logik und zur philosophischen Grundlegung von Mathematik und Naturwissenschaft (orig. latin), Philosophische Schriften Band IV, Wissenschaftliche Buchgesellschaft, 1992. [43] Popper, K., and Eccles, J., The Self and Its Brain - An Argument for Interactionism, Springer Verlag, 1977. [44] Chalmers, D. J., The Conscious Mind- In Search of a Fundamental Theory, Oxford University Press, 1996. [45] Edelman, G. M., and Tononi, G., A Universe of Consciousness - How Matter Becomes Imagination, Basic Books, 2000. [46] Pinker, St., How the Mind Works, Penguin Books, 1998. [47] Singer, R., Der Beobachter im Gehirn, Essays zur Hirnforschung, Suhrkamp Taschenbuch Wissenschaft Band 1571, Suhrkamp Verlag, 2002. [48] Roth, G., Das Gehirn und seine Wirklichkeit- Kognitive Neurobiologie und ihre philosophischen Konsequenzen, Suhrkamp Taschenbuch Wissenschaft Band 1275, Suhrkamp Verlag, 1997. [49] Pauen, M., und Roth, G. (Hrsg.), Neurowissenschaften und Philosophie - Eine Einfiihrung, Wilhelm Fink Verlag, 2001. [50] Pauen, M., Grundprobleme der Philosophic des Geistes - Eine Einf'fihrung, Fischer Taschenbuch Verlag, 2001. [51 ] Zoglauer, Th., Geist und Gehirn- Das Leib-Seele-Problem in der aktuellen Diskussion, Vandenhoek & Ruprecht, 1998. [52] Kurzweil, R., The Age of Spiritual Machines; in German: Homo s@piens, Kiepenheuer & Witsch, 1999. [53] Moravec, H. P., Mind Children: The Future of Robots and Human Intelligence, Harvard University Press, 1990; Robot: Mere Machine to Transcendent Mind, Oxford University Press, 2000. [54] Schirrmacher, F. (Hrsg.), Die Darwin A G - Wie Nanotechnologie, Biotechnologie und Computer den neuen Menschen tr~iumen, Kiepenheuer & Witsch, 2001.

Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004ElsevierB.V. All rights reserved.

13

So M u c h Data, So Little Time... C. Hansen a, S. Parker a, and C. Gribble ~ aScientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT 84112 USA Massively parallel computers have been around for the past decade. With the advent of such powerful resources, scientific computation rapidly expanded the size of computational domains. With the increased amount of data, visualization software strove to keep pace through the implementation of parallel visualization tools and parallel rendering leveraging the computational resources. Tightly coupled ccNUMA parallel processors with attached graphics adapters have shifted the research of visualization to leverage the more unified memory architecture. Our research at the Scientific Computing and Imaging (SCI) Institute at the university of Utah has focused on innovative, scalable techniques for large-scale 3D visualization. Real-time ray-tracing for isosurfacing has proven to be the most interactive method for large scale scientific data. We have also investigated cluster-based volume rendering leveraging multiple nodes of commodity components. 1. INTRODUCTION In recent years, scalable architectures and algorithms have led to unprecedented growth in computational data. The effectiveness of using such advanced hardware that produces large amounts of high-resolution data will hinge upon the ability of human experts to interact with their data and extract useful information. Needless to say, the interactive analysis of such large, and sometimes unwieldy, data sets has become a computation bottleneck in its own right. To address this challenge, the field of visualization and computer graphics has made significant advances in the area of parallel methods for both scientific visualization and image generation. 2. SCALABLE ISOSURFACING Many applications generate scalar fields p(x, y, z) which can be viewed by displaying isosur.faces where p(x, y, z) = Piso" Ideally, the value for Piso is interactively controlled by the user. When the scalar field is stored as a structured set of point samples, the most common technique for generating a given isosurface is to create an explicit polygonal representation for the surface using a technique such as Marching Cubes [ 1, 10]. This surface is subsequently rendered with attached graphics hardware accelerators such as the SGI Infinite Reality. Marching Cubes can generate an extraordinary number of polygons, which take time to construct and to render. For very large (i.e., greater than several million polygons) surfaces the isosurface extraction and rendering times limit the interactivity. In this paper, we generate images of isosurfaces directly

14 with no intermediate surface representation through the use of ray tracing. Ray tracing for isosurfaces has been used in the past (e.g. [9, 12, 21 ]), but we apply it to very large datasets in an interactive setting. It is well understood that ray tracing is accelerated through two main techniques [ 18]: accelerating or eliminating ray/voxel intersection tests and parallelization. Acceleration is usually accomplished by a combination of spatial subdivision and early ray termination [8, 5, 20]. Ray tracing has been used for volume visualization in many contexts (e.g., [8, 19, 23]). Ray tracing for volume visualization naturally lends itself towards parallel implementations [11, 14]. The computation for each pixel is independent of all other pixels, and the data structures used for casting rays are usually read-only. These properties have resulted in many parallel implementations. A variety of techniques have been used to make such systems parallel, and many successful systems have been built. These techniques are surveyed by Whitman [24]. Ray tracing naturally lends itself towards parallel implementations. The computation for each pixel is independent of all other pixels, and the data structures used for casting rays are usually read-only. Simply implementations work, but achieving scalability on large parallel resources requires careful attention to synchronization costs and resource assignment. To reduce synchronization overhead we can assign groups of rays to each processor. The larger these groups are, the less synchronization is required. However, as they become larger, more time is potentially lost due to poor load balancing because all processors must wait for the last job of the frame to finish before starting the next frame. We address this through a load balancing scheme that uses a static set of variable size jobs that are dispatched in a queue where jobs linearly decrease in size. The implementation of the work queue assignment uses the hardware fetch and op counters on the Origin architecture. This allows efficient access to the central work queue resource. This approach to dividing the work between processors seems to scale very well. Rendering a scene with a large memory footprint (rendering of isosurfaces from the visible female dataset [ 17]) uses only 2.1 to 8.4 Mb/s of main memory bandwidth. These statistics were gathered using the SGI perfex utility, benchmarked with 60 processors.

2.1. Rectilinear Isosurfacing Our algorithm has three phases: traversing a ray through cells which do not contain an isosurface, analytically computing the isosurface when intersecting a voxel containing the isosurface, and shading the resulting intersection point [ 16]. This process is repeated for each pixel on the screen. Since each ray is independent, parallelization is straightforward. An additional benefit is that adding incremental features to the rendering has only incremental cost. For example, if one is visualizing multiple isosurfaces with some of them rendered transparently, the correct compositing order is guaranteed since we traverse the volume in a front-to-back order along the rays. Additional shading techniques, such as shadows and specular reflection, can easily be incorporated for enhanced visual cues. Another benefit is the ability to exploit texture maps which are much larger than texture memory (typically 32MB to 256MB). 3. PARALLEL ISOSURFACING RESULTS We applied the ray tracing isosurface extraction to interactively visualize the Visible Woman dataset. The Visible Woman dataset is available through the National Library of Medicine as part of its Visible Human Project [ 15]. We used the computed tomography (CT) data which

15 was acquired in lmm slices with varying in-slice resolution. This data is composed of 1734 slices of 512x512 images at 16 bits. The complete dataset is 910MBytes. Rather than downsample the data with a loss of resolution, we utilize the full resolution data in our experiments. As previously described, our algorithm has three phases: traversing a ray through cells that do not contain an isosurface, analytically computing the isosurface when intersecting a voxel containing the isosurface, and shading the resulting intersection point.

# processors 2 3 4 6 8 12 16 24 32 48 64 96 124

0.427/1.00 0.84/1.97 1.26/2.94 1.67/3.91 2.45/5.73 3.20/7.50 4.81 /11.26 6.38/14.93 9.54/22.33 12.65/29.61 18.85/44.13 24.73/57.90 35.38/82.82 43.06/100.79

Frame rate/Speedup 0.084 /1.00 0.155 /1.00 0.17 /1.99 0.31 /2.00 0.25 /2.95 0.46 /2.96 0.33 /3.96 0.62 /3.97 0.50 /5.97 0.93 /5.96 0.67 /7.94 1.23 /7.93 1.00 /11.89 1.84 /11.88 1.33 /15.84 2.45 /15.80 1.98 /23.54 3.65 /23.49 2.63 /31.38 4.88 /31.47 3.92 /46.72 7.30 /47.02 5.18 /61.78 9.64 /62.14 7.67 /91.38 14.28 /92.02 9.73 /115.88 18.17 /117.08

0.304 /1.00 0.60 /1.98 0.89 /2.93 1.19 /3.92 1.76 /5.77 2.32 /7.61 3.44 /11.30 4.59 /15.08 6.84 /22.48 9.12 /29.96 13.52 /44.39 17.72 /58.19 25.04 /82.23 30.28 /99.45

0.568 /1.00 1.13 /1.98 1.68 /2.96 2.24 /3.94 3.29 /5.80 4.36 /7.67 6.51 /11.47 8.64 /15.21 12.92 /22.76 17.09 /30.10 25.27 /44.50 32.25 /56.80 45.50 /80.14 57.70 /101.63

~.20 o

7

~15

m

o

~10 o

ffl

~

JO ~i~

j',i!

'

'

'

'

'i'

'

I

'

iFrame N~mber~(tihe) I

Figure l. Variation in framerate as the viewpoint and isovalue changes.

~a

I

i

I

i

I

16 The interactivity of our system allows exploration of both the data by interactively changing the isovalue or viewpoint. For example, one could view the entire skeleton and interactively zoom in and modify the isovalue to examine the detail in the toes all at about ten FPS. The variation in framerate is shown in Figure 1. 4. INTERACTIVE VOLUME RENDERING WITH SIMIAN Simian is a scientific visualization tool that utilizes the texture processing capabilities of consumer graphics accelerators to produce direct volume rendered images of scientific datasets. The true power of Simian is its rich user interface. Simian employs direct manipulation widgets for defining multi-dimensional transfer functions; for probing, clipping, and classifying the data; and for shading and coloring the resulting visualization [6]. A complete discussion of direct volume rendering using commodity graphics hardware is given in [2]. For more on using multi-dimensional transfer functions in interactive volume visualization, see [7]. All of the direct manipulation widgets provided by Simian are described thoroughly in [6]. The size of a volumetric dataset that Simian can visualize interactively is largely dependent on the size of the texture memory provided by the local graphics hardware. For typical commodity graphics accelerators, the size of this memory ranges from 32MB to 256MB. However, even small scientific datasets can consume hundreds of megabytes, and these datasets are rapidly growing larger as time progresses. Although Simian provides mechanisms for swapping smaller portions of a large dataset between the available texture memory and the system's main memory (a process that is similar to virtual memory paging), performance drops significantly and interactivity disappears. Moreover, because the size of the texture memory on commodity graphics hardware is not growing as quickly as the size of scientific datasets, using the graphics accelerators of many nodes in a cluster-based system is necessary to interactively visualize large-scale datasets. Naturally, cluster-based visualization introduces many challenges that are of little or no concern when rendering a dataset on a single node, and there are many techniques for dealing with the problems that arise. Our goal was to create an interactive volume rendering tool that provides a full-featured interface for navigating and visualizing large-scale scientific datasets. Using Simian, we examine two approaches to cluster-based interactive volume rendering: (1) a "cluster-aware" version of the application that makes explicit use of remote nodes through a message-passing interface (MPI), and (2) the unmodified application running atop the Chromium clustered rendering framework. 5. CHROMIUM CLUSTERED RENDERING F R A M E W O R K Chromium is a system for manipulating streams of OpenGL graphics commands on commodity-based visualization clusters [3]. For Linux-based systems, Chromium is implemented as a set of shared-object libraries that export a large subset of the OpenGL application programming interface (API). Extensions for parallel synchronization are also included [4]. Chromium's crappfaker libraries operate as the client-side stub in a simple client-server model. The stub intercepts graphics calls made by an OpenGL application, filters them through a userdefined chain of stream processing units (SPUs) that may modify the command stream, and finally redirects the stream to designated rendering servers. The Chromium rendering servers,

17 or crservers, process the OpenGL commands using the locally available graphics hardware, the results of which may be delivered to a tiled display wall, returned to the client for further processing, or composited and displayed using specialized image composition hardware such as Lightning-2 [22]. Depending on the particular system configuration, it is possible to implement a wide variety of parallel rendering systems, including the common sort-first and sort-last architectures [13]. For the details of the Chromium clustered rendering framework, see [3]. In principle, Chromium provides a very simple mechanism for hiding the distributed nature of clustered rendering from OpenGL applications. By simply loading the Chromium libraries rather than the system's native OpenGL implementation, graphics commands can be processed by remote hardware without modifying the calling application. However, for the Simian volume rendering application, Chromium does not currently provide features that sufficiently mask the underlying distributed operation and still enable the application to realize the extended functionality that we seek. OpenGL applications may still require significant modifications to effectively utilize a cluster's resources, even when employing the Chromium framework. It was necessary to implement some OpenGL functionality that was not supported by Chromium. First, we implemented the subset of OpenGL calls related to 3D textures, including glTexImage3D, which is the workhorse of the Simian volume rendering application. In addition, we also added limited support for the NVIDIA GL NV t e x t u r e s h a d e r and G L _ N ' V _ t e x t u r e s h a d e r 2 extensions, implementing only those particular features of the extensions that are explicitly used by Simian. 6. CLUSTER-BASED VOLUME RENDERING RESULTS We experimented with two C-SAFE fire-spread datasets simulating a heptane pool fire, h300_0075 and h300_0130, that were produced by LES codes. Both of the original datasets were 302x302x302 byte volumes storing scalar values as u n s i g n e d c h a r s . Simian is capable of rendering volumes using gradient and Hessian components in addition to scalar values. Therefore, each of the fire-spread datasets was pre-processed to included these additional components and then padded to a power of 2, resulting in two 512x512x512x3 byte volumes that store the value, gradient, and Hessian components ("vgh") as u n s i g n e d c h a r s . 1 We first attempted to visualize both the original and vgh versions of each fire-spread dataset using the stand-alone version of Simian and a fairly powerful workstation. This workstation was equipped with an Intel Xeon 2.66GHz processor, 1GB of physical memory, and an NVIDIA GeForce4 Quadro4 900 XGL graphics accelerator with 128MB of texture memory. On this machine, Simian was able to load and render the original datasets using the swapping mechanisms inherent in OpenGL. As expected, the need to swap severely penalized the application's performance, resulting in rates of 0.55 frames per second. However, the OpenGL drivers could not properly handle the large-scale vgh datasets, crashing the application as result. Having firmly established the need for a cluster-based approach, we tested both the clusteraware version and the Simian/Chromium combination with the vgh datasets using 8, 16, and 32 of the C-SAFE cluster nodes. The results are summarized in Table 1. Note that, with 8 and 16 rendering nodes, the vgh subvolumes distributed by the clusteraware version are 256x256x256 bytes and 256x256x128 bytes, respectively. These subvol1Forthe remainder of this discussion, it is assumedthat the numberof bytes consumedby a raw volumetric dataset is the given size multiplied by three, accounting for each of the three, 1-bytecomponentsin a vgh dataset.

18 Rendering Approach

Number of Nodes

Cluster-aware Simian

8 16

32 Simian/Chromium combination

8 16

h300_0075 h300_0130

vgh dataset

vgh dataset

0.52

0.58 2.15 3.44 0.05 0.04 0.02

1.47 2.87 0.07 0.05 0.03

32 Table 1 Average frame rates (in frames per second) using various cluster configurations

umes exceed the maximum "no-swap" volume size permitted by the cluster's graphics hardware (128x128x256 bytes), so in either of these configurations, even the cluster-aware Simian must invoke its texture swapping facilities. However, with 16 rendering nodes, the swapping is much less frequent, resulting in reasonably interactive frame rates. With 32 rendering nodes, the subvolume size is reduced to 128x128x256 bytes, so no texture swapping occurs and interactive frame rates are restored. Chromium is not able to distribute subvolumes among the rendering nodes. As a result, Simian must rely on its texture swapping mechanism. When the application calls for a new block to be swapped into texture memory, Chromium must transmit the block to the appropriate rendering nodes. The resulting delays impose severe performance penalties that grow with the number of rendering nodes. This behavior is reflected in the low frame rates given in Table 1. 7. CONCLUSIONS The architecture of the parallel machine plays an important role in the success of visualization techniques. In parallel isosurfacing since any processor can randomly access the entire dataset, the dataset must be available to each processor. Nonetheless, there is fairly high locality in the dataset for any particular processor. As a result, a shared memory or distributed shared memory machine, such as the SGI Origin, is ideally suited for this application. The load balancing mechanism also requires a fine-grained low-latency communication mechanism for synchronizing work assignments and returning completed image tiles. With an attached graphics engine, we can display images at high frame rates without network bottlenecks. We have implemented a similar technique on a distributed memory machine that proved challenging. Frame-rates were, of course, lower and scalability was reduced. We have shown that ray tracing can be a practical alternative to explicit isosurface extraction for very large datasets. As data sets get larger, and as general purpose processing hardware becomes more powerful, we expect this to become a very attractive method for visualizing large scale scalar data both in terms of speed and rendering accuracy. For cluster-based volume rendering, we have demonstrated that a cluster-aware MPI version performs far better than simply replacing the underlying shared graphics library with a parallelized library. There are several advantages of developing cluster-aware visualization tools. First, these tools can exploit application-specific knowledge to reduce overhead and enhance performance and efficiency when running in a clustered environment. Second, cluster-aware applications that are built upon open standards such as MPI are readily ported to a wide variety

19 of hardware and software platforms. While frameworks that mask the clustered environment may provide certain advantages, using standard interfaces allows an application's components to be reused or combined in new, more flexible ways. Third, cluster-aware applications are not dependent upon the functionality provided by a clustered rendering package. Finally, because they do not access remote resources via a lower-level framework, cluster-aware applications can exploit the capabilities of the underlying system directly.

REFERENCES

[1]

B. Wyvill G. Wyvill, C. McPheeters. Data structures for soft objects. The Visual Com-

puter, 2:227-234, 1986. [2] [3]

[4] [5] [6]

[7]

[8] [9] [ 10]

[11 ] [ 12] [13] [14]

[15]

[16]

M. Hadwiger et al. High-quality volume graphics on consumer pc hardware. IEEE Visualization 2002 Course Notes. G. Humphreys, Mike Houston, Yi-Ren Ng, Randall Frank, Sean Ahem Peter Kirchner, and James T. Klosowski. Chromium: A stream-processing framework for interactive graphics on clusters. In Proceedings of SIGGRAPH, 2002. H. Igehy, G. Stoll, and P. Hanrahan. The design of a parallel graphics interface. In SIGGRAPH Computer Graphics, 1998. Arie Kaufman. Volume Visualization. IEEE CS Press, 1991. J. Kniss, G. Kindlmann, and C. Hansen. Interactive volume rendering using multidimensional transfer functions and direct manipulation widgets. In Proceedings oflEEE Visualization, 2001. J. Kniss, G. Kindlmann, and C. Hansen. Multi-dimensional transfer functions for interactive volume rendering. IEEE Transactions on Visualization and Computer Graphics, pages 270-285, July 2002. Mark Levoy. Display of surfaces from volume data. IEEE Computer Graphics & Applications, 8(3):29-37, 1988. Chyi-Cheng Lin and Yu-Tai Ching. An efficient volume-rendering algorithm with an analytic approach. The Visual Computer, 12(10):515-526, 1996. William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. Computer Graphics, 21 (4): 163-169, July 1987. ACM Siggraph '87 Conference Proceedings. K.L. Ma, J.S. Painter, C.D. Hansen, and M.F. Krogh. Parallel Volume Rendering using Binary-Swap Compositing. IEEE Comput. Graphics and Appl., 14(4):59-68, July 1993. Stephen Marschner and Richard Lobb. An evaluation of reconstruction filters for volume rendering. In Proceedings of Visualization '94, pages 100-107, October 1994. S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs. A sorting classification of parallel rendering. 1EEE Computer Graphics and Applications, July 1994. Michael J. Muuss. Rt and remrt - shared memory parllel and network distributed raytracing programs. In USENIX: Proceedings of the Fourth Computer Graphics Workshop, October 1987. National Library of Medicine (U.S.) Board of Regents. Electronic imaging: Report of the board of regents, u.s. department of health and human services, public health service, national institutes of health. NIH Publication 90-2197, 1990. Steven Parker, Michael Parker, Yarden Livnat, Peter-Pike Sloan, Charles Hansen, and

20

[ 17]

[ 18] [ 19] [20] [21 ] [22]

[23] [24]

Peter Shirley. Interactive Ray Tracing for Volume Visualization. IEEE Transactions on Visualization and Computer Graphics, 5(3):238-250, July 1999. Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter-Pike Sloan. Interactive ray tracing for isosurface rendering. In Proceedings of Visualization '98, October 1998. E. Reinhard, A.G. Chalmers, and F.W. Jansen. Overview of parallel photo-realistic graphics. In Eurographics "98, 1998. Paolo Sabella. A rendering algorithm for visualizing 3d scalar fields. Computer Graphics, 22(4):51-58, July 1988. ACM Siggraph '88 Conference Proceedings. Lisa Sobierajski and Arie Kaufman. Volumetric Ray Tracing. 1994 Workshop on Volume Visualization, pages 11-18, October 1994. Milos Sramek. Fast surface rendering from raster data by voxel traversal using chessboard distance. In Proceedings of Visualization '94, pages 188-195, October 1994. G. Stoll, M. Eldridge, D. Patterson, A. Webb, S. Berman, R. Levy, C. Caywood, M. Taveira, S. Hunt, and P. Hanrahan. Lighming-2: A high-performance display subsystem for PC clusters. In SIGGRAPH Computer Graphics, 2001. Craig Upson and Micheal Keeler. V-buffer: Visible volume rendering. Computer Graphics, 22(4):59-64, July 1988. ACM Siggraph '88 Conference Proceedings. Scott Whitman. A Survey of Parallel Algorithms for Graphics and Visualization. In High Performance Computing for Computer Graphics and Visualization, pages 3-22, 1995. Swansea, July 3-4.

Software Technology


Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.

23

On Compiler Support for Mixed Task and Data Parallelism T. Rauber ~, R. Reilein b, and G. Rfingerb a Department of Mathematics, Physics, and Computer Science University of Bayreuth E-mail: rauber@uni-bayreuth, de b Department of Computer Science Chemnitz University of Technology E-mail: {reilein, ruenger}@cs, tu-chemnitz, de The combination of task and data parallelism can lead to an improvement of speedup and scalability for parallel applications on distributed memory machines. To support a systematic design of mixed task and data parallel programs the TwoL model has been introduced. A key feature of this model is the development support for applications using multiprocessor tasks on top of data parallel modules. In this paper we discuss implementation issues of the TwoL model as an open framework. We focus on the design of the framework and its internal algorithms and data structures. As examples fast parallel matrix multiplication algorithms are presented to illustrate the applicability of our approach. 1. INTRODUCTION Parallel applications in the area of scientific computing are often designed in a data parallel SPMD (Single Program Multiple Data) style based on the MPI standard. The advantage of this method is a clear programming model but on large parallel platforms or cluster systems the speedup and scalability can be limited especially when collective communication operations are used frequently. The combination of task and data parallelism can improve the scalability of many applications but requires a more intricate program development. The adaption of complex program code to the characteristics of a specific parallel machine may be quite time consuming and often results in a code structure which causes a high reprogramming effort when porting the software to another parallel system. To support the systematic development of mixed task and data parallel programs the TwoL model has been introduced. The model provides a stepwise development process which is subdivided into several phases. Applications are hierarchically composed of predefined or user-supplied basic data parallel modules. The development starts with a specification of the parallelism inherent in the algorithm to be implemented. The specification is transformed stepwise into a coordination program by applying scheduling, static load balancing, and data distribution algorithms. During this transformation process the code is adapted to the characteristics of a specific parallel system. In this paper we present an implementation of the TwoL model as an open compiler framework (TwoL-OF). The open framework implements the core concepts of the TwoL model and additionally provides several intermediate program representations resulting within the trans-

24 formation process. The intermediate representations are produced in specific TwoL-OF formats to provide access interfaces for compiler users. The advantage is that the framework can be used by both, application programmers and algorithm developers. Application programmers can use the framework as transformation tool by providing a specification of the problem which is transformed into an efficient parallel program. Developers can test new scheduling and load balancing techniques by exploiting the access interfaces of the intermediate formats. The remaining paper is structured as follows. Section 2 gives a short overview of the TwoL model. In Section 3 the TwoL-OF compiler is introduced and the compilation process is illustrated. Section 4 presents first results with fast matrix multiplication algorithms and Section 5 concludes. 2. THE TwoL MODEL The TwoL (Two Level) model has been developed to exploit the combination of task and data parallelism [6, 7]. It defines a transformation process from a specification of the parallelism inherent in an algorithm into a coordination program for a specific parallel system. The parallelism is exploited on two levels, an upper task parallel level consisting of hierarchically structured multiprocessor tasks and a lower data parallel level. On the lower level the user provides module specifications and function implementations which declare and realize multiprocessor tasks. On the upper level consecutive transformation steps generate a coordination program from the specification of potential task parallelism. Figure 1 illustrates the derivation of the coordination program. non-executable specification coordination program ~...@.,,,.,~ stepwise derivation --~ double *a,*b; ~

@/Q~@

@ ; ~ ~

.Transformation

9Annotation

9 Selection

Level ~'~ ~

MPI Comm comml,comm2; MPI--Comm_split(&comml,color pid,&comm2) ; if (gpid==0) { ...

~ typea distrib b var x:a ,..

Declaration of types,

m o d u l e s and variables,

Parallel library and/~

user-defined functions.

I dmm dmm-fox std [

II

dmm can

Figure 1. Derivation of the parallel coordination program. The specification program declares multiprocessor tasks as modules and defines their data and control dependencies. It uses a special type concept to declare data types and data distribution types. The transformation applies scheduling, load balancing, and data distribution techniques and the decisions made are included as annotations. The final coordination program contains all information about execution order, processor group sizes, and data distributions. The entire transformation process is accompanied by a parallel cost model to guide the transformations and to obtain an efficient parallel program for a specific parallel machine. 3. TwoL-OF C O M P I L E R STRUCTURE AND IMPLEMENTATION The TwoL open framework compiler implements the core concepts of the TwoL model, supports the guided development of mixed task and data parallel programs, and the porting to

25 different parallel machines. The emphasis lays on fast and easy coding support utilizing the activation of basic modules written in different imperative languages especially C or Fortran. It offers specific interfaces to control the transformation process and to manually revise decisions made by the transformation algorithms. Therefore the intermediate code representation is explicitly stored and open for transformations and annotations.

3.1. Short overview The compilation process starts with a specification language file describing the inherent parallelism of an algorithm. The basic components of a specification are modules which can either be declared (basic modules) or defined (coordination modules). For basic modules a module name and three different types for each parameter to be passed have to be declared. Each module parameter has a data type which corresponds to a data type in the high level target language and a data distribution type which declares how the data are distributed over a processor group and which is used by redistribution functions to establish the data distribution required before starting the execution of a basic module. The third type is the I/O access mode which defines the kind of access to the parameters. Basic modules can be implemented in C or Fortran and are linked to the generated program code in the final compilation step. Coordination modules have to be declared before they can be defined by a coordinating expression which consists of module calls, data dependence operators, and control statements. These expressions are translated into syntax trees which are extended and transformed in the following compilation phases. Based on the coordinating expressions the coordination program is generated by the compiler framework as C code augmented with MPI functions and calls to a data redistribution library interface. The resulting code can be compiled and linked with the basic module implementations using a standard C compiler. The TwoL-OF compiler consists of two consecutive compiler stages. The first stage reads a specification program, generates the intermediate code representation, and outputs two control files for the next stage, one for type definitions and a second file to control tree transformations and annotations. The intermediate code representation comprises different kinds of symbol tables, specification syntax trees for the coordinating expressions, and parallel data dependence graphs to attach data flow information to the syntax trees. Between both stages the control files are augmented with additional specifications supplied by automatic tools or by the user. In the second stage after the intermediate code is read, the type definition and transformation control files are processed. The contents are used to amend symbol table entries and to transform the specification syntax trees into coordination syntax trees. C code with calls to MPI and a data redistribution library is generated after compiler passes to update the data flow information and to introduce variable copying and data redistribution. The compilation process is accompanied by the output of internal data structures as program files which can be postprocessed using the graphviz package by AT&T to generate a graphical visualization. In the next subsections we introduce the specification language and give an example for a parallel specification to illustrate the compilation process.

3.2. Specification language for multiproeessor task programs The specification language allows the expression of data dependencies between module calls by two operators, the ~--operator for data dependence and the II-operator for data independence. Both operators can be used in infix form or as k-ary operator. Furthermore there are several statements to control the program flow especially loops and conditions. The following grammar

26 defines the essential parts for expressing data dependencies and control flow. cexpr stat cexpr_list

--+ I

cexpr >- cexpr I cexpr II cexpr ] II (cexpr_list) I >- (cexpr_list) stat ( cond ) { cexpr } I call loop ] while I if cexpr_list, cexpr ] cexpr

A coordination module defines one coordinating expression (cexpr) which is recursively built up of sub-expressions. To illustrate the compilation process the standard four block matrix multiplication is used here as an example. The algorithm is defined by the following equation:

( all a12 ) ( b l l x hi2 ) _ _ a21 a22 b21 b22

( a 1 1 • b11nt-q12 x b21 all >- m p a r t ( b , b l l , b 1 2 , b 2 1 , b 2 2 , n ) ]l (dram(all, bll, cll, n/2) , dram (a12, b21, dll, n/2) , dram (all, b12, c12, n/2) , dmm (a12, b22, d12, n/2) , dmm (a21, bll, c21, n/2) , dmm (a22, b21, d21, n/2) , dmm(a21,b12,c22,n/2) , dmm(a22,b22,d22,n/2)) ~- II ( m m a ( c l l , d l l , c l l , n / 2 ) , m m a ( c l 2 , d l 2 , c l 2 , n / 2 ) , mma(c21,d21,c21,n/2) , mma(c22,d22,c22,n/2)) >- rejoin (cll, c12, c21, c22, c, n)

The module m p a r t subdivides a matrix into four quadratic submatrices and mj o in reverses this process. Addition and multiplication of submatrices are performed by mma and dram, respectively. This specification is processed by the first compiler stage and the coordinating expression is stored as a specification syntax tree with a linked parallel data dependence graph for expressing data flow. Figure 2 illustrates the parallel data dependence graph for the specification example as output from the framework. Together with the intermediate representation which is stored as binary file by the first stage two text files for type definitions and to control transformations are generated. The type definition file is used to specify a C or Fortran data type for the abstract types used in the specification program. If memory allocation is required for a specific data type the corresponding statements can be supplied. Within the second file the user can control transformations and annotations of the specification syntax trees. In its unmodified form the transformation control file expresses the maximum degree of parallelism specified. The following code fragment shows a part of the transformation control file for the block matrix multiplication example: par[ID34] [4] { [0] :ID30 # m m a ( c l l , d l l , c l l , n / 2 ) [i] :ID31 # m m a ( c l 2 , d l 2 , c l 2 , n / 2 ) [2] :ID32 # m m a ( c 2 1 , d 2 1 , c 2 1 , n / 2 ) [3] :ID33 # m m a ( c 2 2 , d 2 2 , c 2 2 , n / 2 )

}

27

m

\

\

I

\

/

Figure 2. Part of the Parallel data dependence graph for block matrix multiplication. The labels enclosed in brackets and prefixed with 'ID' are unique identifiers for the nodes of the corresponding specification syntax tree. They are also displayed in the visualization of the trees and data dependence graphs. The second parameter of the p a r statement specifies the number of processor groups or multiprocessor tasks, respectively. The t a s k statement groups tasks together to be executed consecutively on a specific processor group given as first parameter. Tasks would be separated by commas in that case. The number sign (#) marks the start of a comment which are generated here for the sake of clarity. The following subsection presents the internal structure of the compiler framework in more detail. 3.3. Algorithms and data structures The first compiler stage of the TwoL open framework comprises a syntax-oriented translation scheme to built up specification syntax trees, the construction of symbol tables for data types, data distribution types, variables, and modules, and the creation of parallel data dependence graphs as representations of the data flow. The second stage updates the symbol tables and transforms the syntax trees using information in specific control files provided by automatic tools or directly by the user. After the insertion of the data redistribution operations required the final coordination code is generated. A specification program which contains definitions of coordination modules by coordinating expressions is translated into syntax trees which are realized as C data structures. A tree node comprises a type, a unique identifier, pointers to child nodes, and a union to maintain type specific data. Leaf nodes, which represent module calls, have an additional pointer to a data structure for the parallel data dependence graph. The nodes of this graph represent parameters of module calls and a vertex of the graph defines a true data dependence between two accesses to a specific variable. A parallel data dependence graph is constructed by a depth first sweep over the corresponding specification syntax tree. If the algorithm detects an input parameter reading a particular variable it creates a new node for the parallel data dependence graph and traverses backwards a list of previously visited nodes to find an appropriate output for that variable. The search procedure takes the data dependence information stored in the specification syntax tree into account. When an output is found which does not already have a graph node a new one is constructed and linked with the current one by pointers. These links define a true data dependency between two accesses to a variable. Specification syntax trees are also used to generate a text representation of transformation

28 control structures that is output to the transformation control file. For each ll-operator node the file contains a specific program structure, see for example the control file in Section 3.2. According to these structures the following transformation of the syntax trees is applied by the translation scheme of the second compiler stage. Each II-operator node is replaced by a fork node that is used to insert activations of multiprocessor tasks. The data independence expressed by the tl-operator is then replaced by a fork into multiprocessor tasks. A fork node stores the number of processor groups and has new task nodes as children to define multiprocessor tasks. The former children of the II-operator node have been linked to the particular task nodes representing the processor group which will execute them in the final code. After this transformation the syntax tree defines the parallel execution order of a coordination function. The second compiler stage maintains variables and introduces copies if a variable is accessed in parallel. Based on the information contained in the parallel data dependence graphs data redistribution operations are inserted in the trees as additional nodes. The resulting trees comprise all information to generate the final coordination code. Figure 3 illustrates the internal structure of the TwoL-OF compiler and the main compilation steps. l a Parsing the specification program to built up specification syntax trees and symbol tables.

Specification language program

ISymbol tables Specification syntax tree modules ~,~ data t y p e s Specification syntax tree with linked distribution types paralleldata dependence graph I* variables Type definition Intermediatecode Transformationcontrol program representation program

I

.,.

Updated symbol tables

.....

|

......

Coordination syntax tree with linked paralleldata dependence graph ( ~ C code coordination function

l b Construction of the parallel data dependence graphs. 1c Generation of control files and write out of intermediate code. 2a Processing of intermediate code and parsing of transformation control and type definition files to transform syntax trees and to update symbol tables entries. 2b Maintenance of variable copies and data exchange operations. 2c Generation of coordination program as C code.

Figure 3. TwoL-OF structure (left) and main compilation steps (fight).

4. EXPERIMENTS

To evaluate the TwoL-OF compiler two versions of four block matrix multiplication C = A x B have been implemented, the standard algorithm with 8 matrix multiplies and 4 additions and the Strassen scheme with 7 multiplies and 18 additions [8]. The inner multiplications of two block matrices are done using the Fox algorithm. To determine the overhead both schemes have been implemented by hand and as specification program supplemented with appropriate basic modules. All four versions have been measured on a Linux cluster (CLiC) built of 800 MHz

29 Pentium III PCs interconnected with switched Fast Ethernet. Figure 4 shows the overhead of the compiler-generated version which is given as the ratio between the runtime and the runtime of the hand-tuned version.

Overhead factor for block matrix multiplication on CLiC Matrix size 2400 , 3600 ........... 4800 ............

1.15 1.1 1.05

Overhead factor for Strassen matrix multiplication on CLiC Matrix size 2400 , 3600 ........... 4800 ............

1.15 1.1

7_. . . . . . . . . . . . . . . . . . . . . . . . . . . .

16 36

64

100

5-~_-~..........................

144

196

256

processor number

"-,..

1.05

16 36

64

100

144

196

256

processor number

Figure 4. Overhead factors for the standard scheme (left) and the Strassen scheme (fight) on CLiC. The results show a maximum overhead factor of 1.12 for this examples. This corresponds to first experiences made with other codes and applications. It has been observed that handtuned programs usually require less variable copies and can organize data redistribution more compact. But on the other hand they require a higher development effort and cannot benefit from the automatic detection and insertion of data redistribution operations.

5. RELATED W O R K AND CONCLUSION Many research groups have developed compiler tools and parallel environments which can integrate task and data parallelism. Most of them extent Fortran or High Performance Fortran (HPF) with task parallel constructs [3, 9, 2], see [1] for a good overview. Newer approaches include the coordination of data parallel programs written in skeleton-like or data parallel languages [5], parallel Haskell programming [10], or task and data parallel programming in Java

[4].

In this paper the TwoL open framework compiler as an implementation of the core concepts of the TwoL model was introduced as an open platform for algorithm and application developers. We gave a closer look on the structure of the framework and presented early compilation results. The average overhead factor of compiler generated programs is reasonably small and caused by a higher amount of variable copies required in combination with a slight communication overhead. This factor is expected to be reduced by the development of optimization algorithms for message-packing and for an efficient placement of data redistribution operations. Future work will include the definition of an interface to the binary intermediate representation and support for runtime prediction.

30 REFERENCES

[1] [2]

[3] [4] [5] [6]

[7] [8] [9] [ 10]

H. Bal and M. Haines. Approaches for Integrating Task and Data Parallelism. IEEE Concurrency, 6(3): 74-84, 1998. P. Banerjee, J. Chandy, M. Gupta, E. Hodge, J. Holm, A. Lain, D. Palermo, S. Ramaswamy, and E. Su. The Paradigm Compiler for Distributed-Memory Multicomputers. IEEE Computers, 28(10):37-47, 1995. I. Foster and K.M. Chandy. Fortran M: A Language for Modular Parallel Programming. Journal of Parallel and Distributed Computing, 25(1):24-35, 1995. F. Kuijlman, H.J. Sips, C. van Reeuwijk, and W.J.A. Denissen. A Unified Compiler Framework for Work and Data Placement. In Proc. ofASC12002, pages 109-115, 2002. S. Pelagatti and D.S. Skillicorn. Coordinating Programs in the Network of Tasks Model. Journal of Systems Integration, 10(2): 107-126, 2001. T. Rauber and G. R/inger. The Compiler TwoL for the Design of Parallel Implementations. In In Proc. of 4th International Conference on Parallel Architecture and Compilations Techniques (PACT'96), pages 292-301. IEEE Computer Society Press, 1996. T. Rauber and G. Riinger. A Transformation Approach to Derive Efficient Parallel Implementations. IEEE Transactions on Software Engineering, 26(4):315-339, 2000. V. Strassen. Gaussian Optimization is not Optimal. Numerische Mathematik, (13):354356, 1969. J. Subhlok and B. Yang. A New Model for Integrated Nested Task and Data Parallel Programming. In Proc. ofACM SIGPLANPPoPP 97, pages 1-12, 1997. P.W. Trinder, H-W. Loidl, and R.F. Pointon. Parallel and Distributed Haskells. Journal of Functional Programming, 2(4/5):469-510, 2002.

Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.

31

Distributed Process Networks - Using Half FIFO Queues in CORBA A. Amar a*, R Boulet a, J.-L. Dekeyser a, and F. Theeuwen b aLaboratoire d'Informatique Fondamentale de Lille, Lille 1 Cit6 Scientifique, Bat. M3, 59655 Villeneuve d'Ascq cedex, France b Philips ED & T / Synthesis WAY 3.13, Prof. Holstlaan 4 5656 AA Eindhoven, The Netherlands Process networks are networks of sequential processes connected by channels behaving like FIFO queues. These are used in signal and image processing applications that need to run in bounded memory for infinitely long periods of time dealing with possibly infinite streams of data. This paper is about a distributed implementation of this computation model. We present the implementation of a distributed process network by using distributed FIFOs to build the distributed application. The platform used to support this is the CORBA middleware. 1. INTRODUCTION Kahn Process Networks [6, 7] are well adapted to model many parallel applications, specially dataflow applications (signal processing, image processing). In this model, processes communicate only via unbounded first-in first-out (FIFO) queues. This model has a dataflow flavor and can express a high degree of concurrency which makes it particularly well suited to model intensive signal processing applications or complex scientific applications. This model makes no assumption on the computation load of the different processes and thus is heterogeneous by nature. Distributed architectures provide an attractive alternative to supercomputers in terms of computation power and cost to execute such complex and computation intensive applications. The two main weak points of these architectures are their communication capabilities (relatively high latency) and often the heterogeneity of their hardware. We present in this paper a distributed implementation of the process network model on heterogeneous distributed hardware. The different processing power of the connected computers is a good support for the different computation needs of the networked processes. We have chosen to use the Common Object Request Broker Architecture (CORBA) [9] middleware to handle the communications for its interoperability properties. Indeed, each process of the process network can be written in a different language and run on a different hardware, provided that these are supported by the chosen Object Request Broker (ORB). In addition of the heterogeneity, our implementation presents the following characteristics: *This work has been supported by the ITEA 99038 project, Sophocles.

32 9 automation of data transfer between distributed processes 9 dynamic and interactive linking of the processes to form the data flow 9 hybrid data-driven, demand-driven data transfer protocol, with thresholds for load balancing 9 the implementation was carried out such as to enable a distributed or local execution without any change to the program source. This paper is organized as follows. In section 2, we motivate our approach and we present our implementation. Section 3 describes a process network deployment and distributed execution. The transfer strategies (demand and data driven) are detailed in section 4. And we finally outline our conclusions and plans for future work in section 5. 2. DESIGN AND IMPLEMENTATION 2.1. Related work

The Kahn process network model has been proposed by Kahn and MacQueen [6, 7] to easily express concurrent applications. Processes communicate only through unidirectional FIFO queues. Read operations are blocking. The number of tokens produced and their values are completely determined by the definition of the network and do not depend on the scheduling of the processes. The choice of a scheduling of a process network only determines if the computation terminates and the sizes of the FIFO queues. Some networks do not allow a bounded execution. Parks [ 10] studies these scheduling problems in depth. He compares three classes of dynamic scheduling: data-driven, demand-driven or a combination of both with respect to two requirements: 1. Complete execution (the application should execute completely, in particular if the program is non-terminating, it should execute forever). 2. Bounded execution (only a bounded number of tokens should accumulate on any of the queues). These two properties are shown undecidable by Buck [4] on boolean dataflow graph which are a special case of process networks. Thus they are also undecidable for the general case of process networks. Data-driven schedules respect the first requirement, but not always the second one. Demand-driven schedules may cause artificial deadlocks. A combination of the two is proposed by Parks [ 10] to allow a complete, unbounded execution of process networks when possible. In the context of a distributed execution of a process network, the process execution is inherently asynchronous. We have thus chosen a completely asynchronous scheduling: each process runs in its own thread that is scheduled by the underlying operating system. As explained by Parks and Roberts in [ 11 ], who use a similar scheduling technique, using bounded communication queues allow for a fair execution of the process network. This blocking write when the output queue is full can lead to deadlocks. Determining a priori if the queue length is large enough to avoid such deadlocks is undecidable. We provide a way for the user to modify this length at runtime.

33 Several implementations of process networks are used for different purposes: for heterogeneous modeling with PtolemyII [8], for signal processing application modeling with YAPI [5] and for metacomputing in the domain of Geographical Information Systems with Jade/PAGIS [13]. To our knowledge, only the Jade/PAGIS implementation and the one by Parks and Roberts [ 11 ] are distributed. Parks and Roberts use the Java Object Serialization to automate the distribution of the network processes while we use a central console to deploy the processes. In Jade, all communications proceed through a central communication manager while in Parks and Roberts' and our implementation the processes communicate directly. This allows a greater scalability. Only our implementation allow the coupling of processes written in different languages. 2.2. Design directions The design of our distributed process network implementation was done so as to: 9 enable users to simulate their network model quickly and effectively, 9 keep the source compatibility with the Yapi library: this library developed by Philips [5] implements the process network model for a local execution, Yapi is a C++ library which focuses on signal processing applications, 9 enable the distributed and the local execution without any change to the program source. The idea of the Yapi syntax is to group processes into process networks. The processes communicate via ports. These ports are linked by point-to-point unidirectional FIFO queues. A process network can be seen as a process and used in the same way. This hierarchical construction allows an easy modeling of complex applications. For this study, we have completely reengineered Yapi to be able to distribute any application over a CORBA bus without any change to the application code. In our implementation the communications are hidden to the programmer who can though configure the data transfer parameters. The reader can refer to [3] for more implementation details. 3. PROCESS NETWORK DISTRIBUTION 3.1. Deployment To control the distributed process networks, a console has been developed. It consists of a program which controls the processes connection and execution by the use of a simple language. It also provides a frontend used for monitoring. The presence of a manager program is contrary to the peer-to-peer character of component systems. However, the console is minimal and serves only as collaboration control. All the communications between the components are done without involving the console through the distributed FIFO queues presented in section 3.2. The FIFO links that form the process network are made interactively (or via a script) by this console. The use of a console allows more flexibility in the connection choice and a dynamic control of the components and the communication parameters. The use of an interactive console and the fact that the FIFOs are bounded also allow for an incremental development where computations can start even if the application is not complete. When all the output queues are full, the computation is suspended and can resume as soon as a consuming component is attached to the not-yet-connected output queues.

34

3.2. Distributed FIFOs The FIFO queues are completely distributed, and distributed process networks communicate directly, without a central point contrarily to what is done in Jade [ 13]. These queues implement the blocking read needed by the process network model but, as they are bounded, the write may block also if the FIFO queue is full. A deadlock can appear, but as the execution is fully distributed, deadlock detection is difficult and has not been implemented. To guarantee the code reuse with our implementation, the implementation of the distributed FIFOs must be done without programmer intervention or code change. This was done by encapsulating the distributed FIFOs (the CORBA object) in the FIFOs. The figure 1 shows the structure of the FIFO objects. For the programmer, no difference exists between the local and the distributed FIFO queue. To determine if the FIFO is a distributed FIFO or not, the runtime uses its ports. When a FIFO is local to the program, it has an input and output port. On the other hand, the distributed FIFO is represented by two half FIFOs, the output FIFO queue (producer side) and the input FIFO queue (consumer side). Each half FIFO has one port (an input port for the input FIFO, and an output port for the output FIFO) and should be linked to the other half FIFO. The runtime uses this property to activate the CORBA object only on the distributed FIFOs, and thus, the FIFO distribution is transparent for the programmer.

F,FO

_•

OutputPort

I

write

Data

write

CorbaObjRef

~,FO

I Da-~a 11 writ~Corba readi~__

le ad

o

Distributed FIFO ask

offer

full satisfyRequest sync_ask empty last_request

|

o~.~ ~ - -

(.1

getLength unlink

1

Eo.Oo ~u.

noUfy4Send

link

Distributed FIFO ask

~- . . . . .

compute C feasibi

S o l u t i o n master; Vector Q[], R[]; Coordination coord; Accessible from; boolean synchronous; boolean forward;

- [objVal> master.objVal]3~

d

~SIRV~

(forward ? coord.getPredecessor() : coord.getSuccessors())

- coord.getAll()); // update master, R, Q ... ip = n e w L P ( m a s t e r , Q , R ) ; // create and solve LP ... coord.getSuccessors().put(solution); // send s o l u t i o n ... coord.getPredecessor () .put (cut) ; // send cut

Ise]

from.get(synchronous);

]

a)

// sweep direction

// linear program object

// loop body of the decomposition algorithm public void iteration(int iterationNumber) { S o l u t i o n solution; // s o l u t i o n of LP Cut cut; // cut for master from = (synchronous ?

{cardinality=#slaves]

[else] ~ (~ [termination]

// s o l u t i o n from master // cuts from slaves // coordination object // source of get o p e r a t i o n // is synchronous version

}

b)

Figure 2. a) Node algorithm for nested Benders decomposition b) DAT implementation

Minimize ~

f~ (z~)

hEN"

V(n E iV') {

TnxpredC (n)xnSn + dnxn = bn

where N" denotes the set of nodes in the tree. A node n E N" is associated with a local objective function f~ wrt. decision variables zn. An, bn, and Tn describe the constraints representing the dependency on the decision variables at the predecessor node pred(n), Sn describes constraints local to node n, for example budget constraints. In the case of the root node, Tn = 0. In the nested Benders" decomposition method, every node performs an iterative procedure, acting as a master, slave, or both of them [ 1]. The master solves a linear program, sends the solution to the slaves (the successor nodes), and receives from them additional constraints (cuts) which will improve the master's solution in the next iteration. In the synchronous version of the method, each node receives the solution from its master, builds and solves the local problem, then sends the solution to its slaves, and waits for cuts from every slave. The solution process is a sequence of forward- and backward sweeps over the whole tree. In the asynchronous version, every node waits until it has received data from at least one of slaves or from the master [9]. Figure 2 shows the asynchronous node algorithm and fragments of the algorithm layer code for both versions. Note that the Accessible variable f r o m is used to express the different cases of getting data from neighbor nodes. Figure 3 shows the results of initial experiments with a multistage financial portfolio optimization problem, and tree sizes from 127 nodes to 255 nodes. Every tree node runs as a separate thread; the nested Benders decomposition code calls within a synchroni z e d block, via JNI and C, the E04MFF NAG Fortran LP routine. For the distribution of the tree, the root node and its descendents, up to a specific depth in the tree, are mapped to the "root" processor. The subtrees emanating from the nodes at that depth as a whole are mapped to the remaining

76 10,0 9,0 8,0 7,0 6,0

5,0 ~ 4,0 3,0 2,0 1,0 0,0

8,0 A__ ........... A "-.. "'- .. "'-

(3-(3127 ~ 127 sync a - w 255 ~ 255 sync A-A511 .. -..

~5,0

-,~

o=4,0

[] . . . .

7,0

A.. ..........

6,0

""-...... tSl..

". . . . .

(3- 9 127 ~ 127 sync G-D 255 ~ 255 sync A-A511 ~ 511 sync

~3,0 O- . . . . . . . . . . . . . . . . . . .

2,0

~ ....................

"-... (3- . . . .

~ ....................

1,0 1

3 compute nodes

a)

5

0,0

1

3 compute nodes

5

b)

Figure 3. Execution times on a) network of Sun workstations b) Beowulf cluster compute nodes; they require communication with the root processor only. The initial size of the local constraint matrices were 7 x 6, the values for the depth parameter d of the distribution of nt tree nodes onto n~ compute nodes were chosen (nt]nc, d) = (127]5, 4), ({12713 , 255]5}, 5), and ({255]3,511]5}, 6). The results expose some properties of the algorithm. In the synchronous version, the tree nodes perform on average less iterations than in the asynchronous one (1); this is reflected in shorter execution times on a single compute node. In the asynchronous version, the tree nodes spend less time in waiting for new data (2). When running in parallel, a compute node as a whole is idle, when all of its tree nodes are waiting. Shorter processor idle times result in larger speedups for the asynchronous version. Still, these are rather small due to small node problem sizes, resulting in weak computation/communication ratios; we expect better numbers for larger problems (which currently suffer from numerical instabilities). The number of iterations per tree node increases also with the tree size (3). The additional increase in the number of communication operations is seen as one reason for longer execution times of the asynchronous version with larger trees in parallel. In addition, low level effects such as the thread scheduling overhead of the particular runtime system (JVM and operating system) have to be taken into account. 5. CONCLUSIONS In this paper we have presented the Distributed Active Tree programming model, which allows the application programmer to express tree structured iterative algorithms at a high level in a natural way. The model can be implemented with a variety of protocols, communication mechanisms, including web services and grid technology. We described an implementation on top of Java/RMI, with the sole use of Java's multi threading, communication and synchronization mechanisms. As a case study, a parallel decomposition technique for solving large scale stochastic optimization problems has been implemented, in a synchronous and in an asynchronous version. Within the Aurora project, the nested Benders decomposition has been parallelized using OpusJava [6], and the DAT with the coordination implemented on top of JavaSymphony [5]. An alternative optimization algorithm is described in [2], parallel decomposition techniques

77 and their implementation in [4, 10, 13, 14]. For writing high performance applications in Java, language extensions have been defined. Spar [12] provides extensive support for arrays such as multidimensional arrays, specialized array representations, and tuples. It supports data parallel programming and allows via annotations for an efficient parallelization. Titanium [ 15] is a Java dialect with support for multidimensional arrays. It provides an explicitly parallel SPMD model with a global address space and global synchronization primitives. HPJava [3] adds SPMD programming and collective communication to Java. DAT does not change the syntax or semantics of Java and is specifically targeted at a high level formulation of tree structured iterative algorithms and a highly modular architecture. A classification of language extensions, libraries, and JVM modifications for high performance computing in Java is given in [8]. The implementation of the nested Benders decomposition algorithm is subject to optimization along various dimensions, such as the tree distribution (including dynamic pruning and rebalancing of the tree), the loop scheduling strategy, the underlying communication mechanism, LP solution techniques (warm start of the LP solver), and the mapping of scenario tree nodes to local problems. The Distributed Active Tree is a tool to implement and combine variants of all contributing parts, to study the interplay of effects and to achieve a high performance application. REFERENCES

[1] [2]

[3] [4] [5]

[6]

[7] [8]

[9]

[10]

J.F.Benders. Partitioning procedures for solving mixed-variable programming problems. Numer. Math., 4:238-252, 1962. S.Benkner, L.Halada, M.Lucka. Parallelization Strategies of Three-Stage Stochastic Program Based on the BQ Method. Parallel Numerics'02, Theory and Applications. Ed. R.Trobec, P.Zinterhof, M.Vajtersic, A.Uhl, pp.77-86, October,23-25, 2002. B.Carpenter, G.Zhang, G.Fox, X.Li, Y.Wen. HPJava: data parallel extensions to Java. Concurrency: Practice and Experience 10,11-13: 873-877, 1998. M.A.H.Dempster, R.T.Thompson. Parallelization and aggregation of nested benders decomposition. Annals of Operations Research, 81:163-187, 1998. T.Fahringer, A.Jugravu, B.Di Martino, S.Venticinque, H.Moritsch. On the Evaluation of JavaSymphony for Cluster Applications. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster2002). September 2002, Chicago, Illinois. E.Laure, H.Moritsch. Portable Parallel Portfolio Optimization in the Aurora Financial Management System. In Proceedings of SPIE ITCom 2001 Conference: Commercial Applications for High-Performance Computing. August 2001, Denver, Colorado. D.Lea. Concurrent Programming in Java. Addison-Wesley, Reading, Mass., 1997. M.Lobosco, C.Amorim, O.Loques. Java for high-performance network based computing: a survey. Concurrency and Computation: Practice and Experience (Ed. Geoffrey Fox), John Wiley & Sons, 14:1-31, 2002. H.Moritsch, G.Ch.Pflug, M.Siomak. Asynchronous nested optimization algorithms and their parallel implementation. In Proceedings of the International Software Engineering Symposium. Wuhan, China, March 2001. Soren S.Nielsen and S.A.Zenios. Scalable parallel Benders decomposition for stochastic linear programming. Parallel Computing, 23:1069-1088, 1997.

78 [ 11] G.Ch.Pflug, A.~;wi~tanowski, E.Dockner, H.Moritsch. The AURORA Financial Management System: Model and Parallel Implementation Design. Annals of Operations Research, 99:189-206, 2000. [12] C.van Reeuwijk, F.Kuijlman, H.J.Sips. Spar: a set of extensions to Java for scientific computation. Concurrency and Computation: Practice and Experience 15,3-5: 277-297, 2003. [ 13] A.Ruszczynski. Parallel decomposition of multistage stochastic programming problems. Math. Programming, 58: 201-228, 1993. [ 14] H.Vladimirou, S.A.Zenios. Scalable parallel computations for large-scale stochastic programming. Annals of Operations Research, 90:87-129, 2000. [15] K.Yelick, L.Semenzato, G.Pike, C.Miyamoto, B.Liblit, A.Krishnamurthy, P.Hilfinger, S.Graham, D.Gay, P.Colella, A.Aiken. Titanium: a high-performance Java dialect. Concurrency and Computation: Practice and Experience 10,11-13: 825-836, 1998.

Parallel Computing: Software Technology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.

79

A Rewriting Semantics for an Event-Oriented Functional Parallel Language F. Loulergue a aLaboratory of Algorithms, Complexity and Logic, 61, avenue du G6n6ral de Gaulle, 94010 Cr6teil cedex, France This paper presents the design of the core of a parallel programming language called CDS*. It is based on explicitly-distributed concrete data structures and features compositional semantics, higher-order functions and explicitly distributed objects. The denotational semantics is outlined, the (equivalent) operational semantics is presented and a new realization of the latter is given as a rewriting system. 1. I N T R O D U C T I O N Resource-aware programming tools and programs with measurable utilization of network capacity, where control parallelism can be mixed with data parallelism are advocated. In [8, 5] we proposed semantic models for languages whose semantics is functional and whose programs are explicitly parallel. Such languages address the above-stated requirements by expressing data placement, and hence communications explicitly, allowing higher-order functions to take placement strategies and generic computations as arguments, allowing higher-order functions to monitor communications within other functions, and yet avoid the complexity of concurrent programming. We also have introduced the elements of an explicitly-parallel functional language CDS*: denotational semantics, operational semantics and full abstraction result. It is inspired by Berry and Curien's sequential language CDS [1] but uses Brookes and Geva's generalized concrete data structures (gcds) [2], for compatibility with parallel execution. Here we present a detailed rewriting semantics. Unlike Multilisp [6] or CD-Scheme [10], CDS* causes no dynamic process creation in order to facilitate performance prediction. Moreover all the possible events of a CDS* program are declared statically together with their physical processor (or static process) location. We call this feature explicit processes and share it with Caml-Flight [4] and the proposed BSP library standard [7]. CDS* improves on Caml-Flight and BSP by its compositional semantics. User-defined functions define dependencies between (explicitly-located) events and thus prescribe the communications generated by their application. As observed by Berry and Curien, a program of functional type on concrete structures can observe event dependencies inside a functional argument and thus compare different algorithms before applying them. In our context, this means that a second-order function can compare first-order function's load-balancing and communication properties before applying them. This goes one step beyond the proposal of Mirani and Hudak [9] which was to use explicit and programmable schedules but to obtain cost information from system calls. Compared with data-flow programming CDS* is a gener-

80

alization by its inclusion of higher-order functions but it is beyond the scope of this paper to make an exact comparison. Let us simply observe that connecting deterministic programs with streams is a special case of programming with section. The following sections present the language's denotational and operational semantics and then the rewriting semantics. 2. CDS* AND G E N E R A L I Z E D C O N C R E T E DATA STRUCTURES W I T H INDICES. A CDS* program is made of type definitions, followed by term definitions and a request to evaluate a given term. The definition of a type inference system, ofparameterized and polymorphic types is important open problem. As a result our (strong) typing is explicit and monomorphic. A CDS* type is a gods with explicit process indices. A gods is a set of cells with allowed values for each one, seen as a game to be played. Certain cells can be filled at any time and others can only be filled after being enabled by specific finite sets of filling events. A valid gods configuration is called a state and the object of the game is to compute states monotonically: never erasing or changing the value given to a cell. As such, states are generalized traces/streams. A program of functional type r --+ r ' describes a continuous function from states of gods r to states of gods r'. Program syntax is almost standard for a functional language except for one crucial difference: there is no A-binding. Elementary terms are simply an enumeration of finite states, as sets of events. Now unlike standard functional programs which denote abstract functions, a CDS* program denotes a function between concrete domains which can be (concretely) encoded by the state of a special-purpose exponential gods. As a result, elementary terms of type r -+ r' enumerate so-called exponential events which are in fact functional dependencies between (sufficient) input events and (necessary) output events. A generalized concrete data structure is a tuple ((C, _ Tl!v' T?c E> Tl !v (T, U)?c.1 > (T1, U)!v [CL1] (T, U} ?x(c'.l) [> (T1, U)!v' [P1] T?x(x' c") > Tl !V" T?(x, x')c" > T~!v" [CUR] uncurry(T) ? (x, x')c" [> uncurry(T1) !v" [UNC] curry(T)?x(x'c") > curry(T1)!v" T?xd > T~!v' U ' = u~U~ y = u~{r I r e A(z) & U?c > Ur lAP2] [T.U, x]?c' E> [TI.U, x]!v' [AP1] IT.U, x]?c' E> [T.U', x U y}]?c' T?xc > Tl !V [FU [fix(T),x]?c > [fix(T~),x]!v T1-- UClTca, y-U Cl(Yc~U{ c1Vl })[c1 cA(x)&[fix(T),x]?cl[>[fix(T~ ),Yr ]!Vl [fix(T), x]?c [> [fix(T1), y]?c IF2] T'?y'r ~ T~!v" T1 = U~,T~, y~ = U~,{dv'} I d e AC(y') & T?yc't> Tr [c2] [T'oT, F]?xc"[>[T~oT, F]!v ''[C1] [T'oT, F]?xc"[>[T'oT1,FU(y,y'Uy~l)]?xc '' Yl Uc{CV} I c e AC(y) & x?c c>x!v cv 9 up~(~) x - T a ~>/3 /3 > T fTRN] IT' o T, F]txc" > [T' o T, F (2 { (y U Yl, Y') }]?xc" [C3] x?c > x!v [E] a > ,7 -

"

-

4. T O W A R D S AN A B S T R A C T M A C H I N E : R E W R I T I N G S E M A N T I C S Like natural semantics of the usual functional languages, the operational semantics cannot be used in practice. Thus we investigate a new semantics which is to an implementation. The

83 rewriting semantics is defined by the figures 2, 3 and 4 below representing possible transitions from left columns to right columns. The global data structure is a multi-set of tasks together with a function (set of pairs) mapping syntactic nodes of the term being evaluated to tables. In the figures the columns contain only the interesting tasks for the rules and it must be understood that the remaining tasks of the multi-set are left unchanged. There are two main forms of tasks. The first component of tasks is the name of a cell together with the term whose type contains it. The last component of tasks is always a m o d e : either a value computed for the given cell, a failure marker 2_ to indicate that the cell could not be filled, or a request mode. A simple request mode "?" simply means a general request to fill the cell. An indexed request mode ?~ means a request with the added information that the value sought is also that of cell c. As an example of this last kind of mode, see rule CL2 in the first figure. A last kind of mode ?tr indicates that a s l i c e of state ( A ( x ) in the operational semantics) is currently being filled. A task with this mode has a third, middle, component. This middle component stores a set of cells remaining to fill and the current result of the attempt at filling the slice. This result is either failure on all cells • success at least with one cell !. A special kind of task is also used (in figure 4) for filling the so-called F-tables when evaluating a composition term. The following definition is needed. V(x c) - / v iff cv E x '

.

_J_iffcv ~ x

In general, transition rules come in pairs: a call for evaluation and a return with the result. For example CL 1 with CL2. Rules in figure 2 merely follow the syntactic structure of terms. For example rule ST must be read as: if the multi-set of tasks contains the task (c~, ?) then it will be replaced by the task (c, v) if cv e x and by the task (c, 2-) otherwise.

Figure 2. Rules without table ]Tasks ST (cx,?) CL1 (C.I(T1,T2), 7) CL2 (C.I(T1,T2), ?.CT1 ) U (CT1,//) CR1 (C.2(T1,T2), 9.) CR2 (C.2(T1,T2), ?.CT2 ) ~ (CT2, v ) PL1 (c.I a - = p i d ;

First, this expression creates a location a at each processor which is initialized at 0 everywhere. For the BSML12b library each processor has this value in its memory. Second, a boolean parallel vector d a n g e r is created which is trivially t r u e if the processor number is even or f a l s e otherwise. Thus, from the BSMLZib point of view, the location a has now a different value at each processor. After the ifat construct, some processor would execute E1 and some other E2. But, the ifat is a global synchronous operation and all the processors need to execute the same branch of the conditional. If this expression had been evaluated with the BSMLZ 2b library, we would have obtained an incoherent result and a crash of the BSP machine. The goal of our new semantics is to dynamically reject this kind of problems (and to have an exception raised in the implementation).

3. D Y N A M I C S E M A N T I C S OF BSML W I T H I M P E R A T I V E F E A T U R E S This section introduces the syntax and dynamic semantics of a core langage, together with some conventions, definitions and notations that are used in the paper. 3.1. Syntax The expressions of mini-BSML, written e, have the following abstract syntax: e ::= x I ( e e ) [ f u n x - ~ e c oplletx=eine I (e,e) [ e ifetheneelsee [ ifeatetheneelsee In this grammar, x ranges over a countable set of identifiers. The form (e e') stands for the application of a function or an operator e, to an argument e'. The form fun x - . e is the lambda-abstraction that defines the function whose parameter is x and whose result is the value of e. Constants c are the integers, the booleans and we assume having a unique value: 0 that have the type unit. This is the result type of assignment (like in Objective Caml). The set of primitive operations ol) contains arithmetic operations, fix-point operator fix, test function isnc of nc (which plays the role of Objective Caml's Hone), our parallel operations (mkpar, apply, put) and our store operation ref, ! and :=. We note el :=e2 for :=(el, e2). Locations are written g, pairs (e, e). We also have two conditional constructs: usual conditional if then else and global conditional if at then else. We note 3r(e), the set of free variables of an expression e. let and fun are the binding operators and the flee variables of a location is the empty set. It is defined by trivial structural induction on e. Before presenting the dynamic semantics of the language, i.e., how the expressions of mini-BSML are computed to values, we present the values themselves. There is one semantics per value of p, the number of processes of the parallel machine. In the following, Vi means Vi E { 0 , . . . , p - 1} and the expressions are extended with enumerated parallel vectors: ( e , . . . , e) (nesting of parallel vectors is prohibited; our type system enforces this restriction [4]). The values of mini-BSML are defined by the following grammar: v ::= fun z --, e functional value c constant [ oi3 primitive ] ( v , . . . , v) p-wide parallel vector value (v, v) pair value I g location

99

3.2. Rules The dynamic semantics is defined by an evaluation mechanism that relates expressions to values 9 To express this relation, we used a small-step semantics. It consists of a predicate between an expression and another expression defined by a set of axioms and rules called steps. The small-step semantics describes all the steps of the calculus from an expression to a value. We suppose that we evaluate only expressions that have been type-checked [4]. Unlike in a sequential computer and a sequential language, an unique store is not sufficient. We need to express the store of all our processors. We assume a finite set A/" = { 0 , . . . , p - 1} which represents the set of processors names and we write i for these names and N for all the network. Now, we can formalize the location and the store for each processor and for the network. We write s~ for the store of processor i with i E N'. We assume that each processor has a store and a infinite set of addresses which are different at each processor (we could distinguish them by the name ofthe processor). We write S = [So,..., sp-1] for the sequence ofall the stores of our parallel machine. The imperative version of the small-steps semantics has the following form: e / S ~ e'/S'. We will also write e / s ~ e'/s' when only one store of the parallel machine can be modified. We note - - , for the transitive closure of ~ and note e0/S0 - - v / S for e0/So ~ el/S1 ____x e2/$2 ~ ... ~ v / S . We begin the reduction with a set of empty stores { 0 0 , . . . , Op-1} noted ON. To define the relation ~ , we begin with some axioms for two kinds of reductions:

1. e/si i e,/s~i which could be read as "in the initial store si, at processor i, the expression e is reduced to e' in the store s~IV! . 2. e / S ~ e ' / S ' which could be read as "in the initial network store S, the expression e is reduced to e' in the network store S'". We write s + {g ~ v} for the extension of s to the mapping of g to v. If, before this operation, we have g E Dora(s), we can replace the range by the new value for the location g. To define these relations, we begin with some axioms for the relation of head reduction. We write e~ [x +-- e~] the expression obtained by substituting all the free occurrences of x in el by e2.

For a single process 9 (fun x -+ e) v / si For the whole parallel machine:

~-~ e[x +-- v] / si i

(fun x --~ e) v / S

( /~i I~n)

~-~ e[x +-- v] / S M

9 (flZ~n)N .

Rules (/3~t) and (fl~t) are the same but having let x = v in e instead of fun. For primitive operators and constructs we have some axioms, the 6-rules. For each classical 6-rule, we have two new reduction rules: e / si ~ e' / s i' a n d e / S i e ' / S'. Indeed, these reductions do not 6i

(SN

change the stores and do not depend on the stores (we omit these rules for lack of space and we refer to [3]). Naturally, for the parallel operators, we also have some ~-rules but we do not have those ~-rules on a single processor but only for the network (figure 1). A problem appears with the p u t operator. The p u t operator is used for the exchange of values, in particular, locations. But a location could be seen as a pointer to the memory (a location is a memory's addresses). If we send a local allocation to a processor that does not have the location in its store, there is no reduction rule to apply and the program stops with an error (the famous segmentation fault of the C language) if it deferences this location (if it reads "out" of the memory). A dynamic solution is to communicate the value contained by the location and to create a new location for this value (as in the M a r a h a q module of Objective Caml). This solution implies the renaming of locations that are communicated to other processes. For this,

100 we define Loc the set of location of a value. It is defined by trivial structural induction on the value. We define how to add a sequence of pair of location and value to a store with: s + O = s and s + [go ~ V o , . . . , gn ~ Vn] = (S + {gO ~ VO}) + [el b--+ V l , . . . , en ~+ Vn]. We note q; = {go ~ g'o,.., gn ~ g.'} for the substitution, i.e, a finite application from location gi to another location g{ with {go,... gn} is the domain of q;.

(mkparv)/S

apply((vo,..., Vp_l) (v; ..., V~_l)) / S

~" 6M

((vO),...,(v(p-1)))/S

~

((v 0 v;) ... (Up-1 Up_l)) / S

~" 6N

el

M M

n

if ( . . . , t r u e , . . . ) n

at v t h e n ei else e2 / S

if (. . . , false, . . .) at v t h e n el else e2 / S

put((fun d s t

ep_l) ) / ~

~ eo,... ,fun dst ~

~"

/ S

if v = n

((~f atT)

e2/Sifv=n

6M ~r (r0,..., Tp_l) / S t 6N

((Sp%t)

where S = [So,.. ., 8p--1]and S' = [s~,.. ., sp__x] ' ' 1 where Vj . sj' = sj + h 'o + . . . + hp__ ! where hj = [e'o ~ Vo, . . . , g" H Vn] and hj = { (go, Vo), . . . , (gn, Vn) } where gk E Loc(ey) and {lk H vk} E sj and qpj = {go H g ~ , . . . , gn ~ g'~} and e} - ~ y ( e j ) and Vi ri -- (let v~ = eo[dst' +-- i] i n . . . V~_li _ ep_lt [dst +-- i] infi) where fi -- f u n x - + if x -- 0 t h e n v~ e l s e . . , if x = (p - 1) t h e n v~_ii else n c O Figure 1. Parallel 5-rules Now we complete our semantics by giving the &rules of the operators on the stores and the references. We need two kinds of reductions. First for a single processor, 5-rules are (5~f), (5~) and (5~=) (given in figure 2). Those operations work on the store of the processor where this operation is executed. The r e f operation creates a new allocation in the store of the processor, the ! operation gives the value contained in the location of the store and the := operation changes this value by another one.

ref(v)

/ s~

!(g) / si e:-v / ~

e

5~ ~ 5~ ~, &

g / si + {g H v}

ifg r

Dom(si)

(Sref

si(g) / si

ifg E Dom(si)

(5~)

() / ~ + {e H v}

if e ~ Dom(,~)

(at__)

S t=

[80 -[- {e~ ~+ ~')0(V)},..., 8p--1 qt- {~:~-1 ~+ ~DP--1(V)}]

ref(v) / S ~',~ e~ / S' where gi.g~ q~ Dom(si) !(g~a) / S W:=v / S

e

v/ S

where 3v Vi s~(g~) = v and g~ ~ Dom(s~) or 3v v = P-l(s~(g~))

~ a~

() / S'

where

5N

&

S t--. [8 0 --[- {e~ ~-+ ~Oo(V)},...

Vi.g~ ~ Dom(si)

8p--1 + {e~__1 w-+ ~Op__1 (v)}]

(a~) (a~) (a~) ((~proj)

Figure 2. "Store" 5-rules For the whole network, we have to distinguish between the name of a location created outside

101 a m k p a r which is used in expressions and its "projections" in the stores of each process. We note 1~ in the first case and l~ for its projection in the store of process i. When an expression outside a m k p a r creates a new location, each process creates a new location (an address) on its store (rule (~ef), figure 2) where 7~i is a trivial function of projection for the parallel vectors of a value. With this function, we assure that there is no data from another processes in a store. The assignment of location 1N (rule (6.~), figure 2) modifies the values of locations l~ using also the function of projection (which have no effect if the value do no contain any parallel vectors). This rule is only valid outside a mkpar. But a reference created outside a m k p a r can be affected and deferenced inside an mkpar. For assignment, the value can be different on each process. To allow this, we need to introduce a rule (6~roj) (figure 2) which transforms (only inside an m k p a r and at process i) the common name l N into its projection l~. Notice that the affection or the deferencing of a location l N cannot be done inside a m k p a r with rules (~i=) and (6() since the condition l c Dom(si) does not hold. The use of the rule ((~pNroj)is first needed. The deferencing of 1N outside a m k p a r can only occur if the value held by its projections at each process is the same or if this value is the projection of a value which contain a parallel vector (a value which cannot be modify by any process since nested of parallelism is forbidden [4]). This verification is done by rule ( ~ ) where 7~-1 is a trivial function of de-projection for the values stores one each processes which contains the projection values. This de-projection do no need any communication. The complete definitions of our reductions are:

-

U i

and - - ' : - ~ U ~-~. It is easy to see 6i

N

~N

that we cannot always make a head reduction. We have to reduce in depth in the sub-expression. To define this deep reduction, we need to define two kinds of contexts. We refer to [3] for such definitions.

3.3. Cost model preserving semantics In order to avoid the comparison of the values held by projections of a 1N location in rule ( ~ ) we can forbid the assignment of such a location inside a mkpar. This can be done by suppressing rule (6~roj)" But in this case, deferencing inside a m k p a r is no longer allowed. Thus, we need to add a new rule:

!(~N) / si

~-~ 6~

si(g~) / si

(6~ ) to suppress the comparison: !(gN) / S

~~N

if ~ C

Dom(s~) (6~N)

7~-1(s~(/~)) / S

and modify the rule

ifg~ c

Dom(s~) (3,N,) " "

This rule is not deterministic but since assignment of a 1N location is not allowed inside a

mkpar the projections of a 1N location always contain the same value. The cost model is now compositional since the new (~'~) does not need communications and synchronization. 4. CONCLUSIONS AND FUTURE W O R K The Bulk Synchronous Parallel ML allows direct mode Bulk Synchronous Parallel programming. The semantics of BSML were pure functional semantics. Nevertheless, the current implementation of BSML is the BSMT.1• library for Objective Caml which offers imperative features. We presented in this paper semantics of the interaction of our bulk synchronous operations with imperative features. The safe communication of references has been investigated, and for this particular point, the presented semantics conforms to the implementation. To ensure safety, communications may be needed in case of assignment (but in this case the cost model is no longer compositional) or references may contain additional information used dynamically

102 to ensure that dereferencing of references pointing to local values will give the same value on all processes. We are currently working on the typing of effects [1 O] to avoid this problem statically. REFERENCES

M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, 1989. [2] F. Gava. Formal Proofs of Functional BSP Programs. Parallel Processing Letters, 2003. to appear. [3] F. Gava and F. Loulergue. Semantics of a Functional BSP Language with Imperative Features. Technical Report 2002-14, University of Paris Val-de-Marne, LACL, october 2002. [4] F. Gava and F. Loulergue. A Polymorphic Type System for Bulk Synchronous Parallel ML. In Parallel Computing Technologies (PACT 2003), LNCS. Springer Verlag, 2003. [5] F. Loulergue. Implementation of a Functional Bulk Synchronous Parallel Programming Library. In 14th lASTED PDCS Conference, pages 452-457. ACTA Press, 2002. [6] F. Loulergue, G. Hains, and C. Foisy. A Calculus of Functional BSP Programs. Science of Computer Programming, 37(1-3):253-277, 2000. [7] F.A. Rabhi and S. Gorlatch, editors. Patterns and Skeletons for Parallel and Distributed Computing. Springer, 2002. [8] D. R6my. Using, Understanding, and Unravellling the OCaml Language. In G. Barthe and al., editors, Applied Semantics, number 2395 in LNCS, pages 413-536. Springer, 2002. [9] D.B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and Answers about BSP. Scientific Programming, 6(3):249-274, 1997. [ 10] Jean-Pierre Talpin and Pierre Jouvelot. The Type and Effect Discipline. Information and Computation, 111(2):245-296, June 1994. [11 ] Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103, August 1990. [1 ]

Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.

103

The Use of Parallel Genetic Algorithms for Optimization in the Early Design Phases E. Slaby a and W. Funk b aWaldorfer Str. 8, D 50969 K61n, Germany slaby @unibw-hamburg. de bInstitute of Machine Design and Production Technology University of the Federal Armed Forces, D 22039 Hamburg, Germany This paper deals with the use of genetic algorithms, which are integrated into a commercial CAE-system in such a way that all the CAE-systems functions can be used for evaluation and data processing. To reduce runtime, a parallelization of the fitness evaluation routine based on the CORBA technology is presented. 1. I N T R O D U C T I O N In these days, the use of CAE-systems (Computer Aided Engineering) for the design process is quite usual in many development departments. Presently a great number of different computerized support systems for the engineering design process are available. Particularly in the last years even more computation and dimensioning tools were integrated into the systems. But most of these tools are applicable only in the later design phases. The early phases like "planning" and "conceptual design" are still relatively unexplored issues concerning computer aided systems. Optimization algorithms and technologies in the field of Computational Intelligence can hardly be found as an integrated component of the systems [3]. Therefore the use of these algorithms is quite rare in practice. Among others the reason for this are long training periods for the user, difficult handling and the high programming efforts which are necessary for each new task. On the other hand some projects ([5], [7], [2]) show the large potential of these technologies for the design process, even to support the engineer in those areas which are considered as a domain of human intuition and creativity. Bentley points this out in [ 1], where particularly the evolutionary algorithms are used not only for the optimization of already existing construction units, but also for the generation of new concept variants. With a closer view on such projects it is noticeable that besides the engineer as the user a computer scientist is often needed for the software-technical realization. Such a time-consuming effort with additional personnel expenditures cannot be operated in practice during the already temporally limited design process. For this reason a software has been developed that combines an evolutionary algorithm and a commercial 3D-CAE-system (I-DEAS Master Series) in such a way that no programming has to be done by the engineer.

104 At first this paper gives a short introduction into evolutionary algorithms (EA), considers the disadvantages of existing applications which use EA's in section three. Section 4 points out the intention of the developed software-system for the integration of EA's into an existing CAE-system (section 5), section 6 deals with the parallelization followed by the use of a neural network for the EA fitness evaluation routine in section 7. 2. EVOLUTIONARY ALGORITHMS Evolutionary strategies, especially genetic algorithms, represent promising tools for the engineering design process. Evolutionary strategies imitate the principles of natural evolution: reproduction, mutation and selection. Starting with an initial random population, the fitness of each individual is evaluated under consideration of the given constraints. The strings representing the individuals are crossed-over and mutated. The individuals of such a generation are tested and given a probability to survive according to their fitness. After a sufficient number of generations, optimal or nearly optimal solutions can be found. By randomly mutating some strings, areas of the solution space are examined which may not have been considered with the initial population [4]. Some fundamental differences to conventional deterministic and stochastic search methods are [8]: 9 Evolutionary algorithms select in a parallel way in a population of points, not only from one individual point. 9 No derivatives of the objective function or other auxiliary information are needed by the evolutionary algorithms. Only the objective function value is used as a basis for the search. 9 Evolutionary algorithms can offer a number of possible solutions for a problem. 9 Evolutionary algorithms can be applied to problems with different representations of the variables (continuous and discrete variables). 9 Evolutionary algorithms are simple and flexible in use and there are no restrictions for the definition of the objective function. Evolutionary algorithms represent thus versatile, durable and efficient optimization procedures, which can be used with strongly nonlinear, non-continuous objective functions and for problems with variables of different representation (binary, integral, real). It is recommended not to use evolutionary algorithms if the computation of the objective function is very complex or timeconsuming, because a large number of objective function computations is necessary [8]. Basis of the implemented optimization module is the software package ECJ (Evolutionary Computing system written in Java [6]). A modified genetic algorithm is used that fits to a broad band of design problems and leads to good results without beeing adapted in an individual way to each problem by the user. Because of the parallel way of stepping through the solution space, EA's suit for a parallel fitness evaluation very well. Thus will be discussed in section 5.

105 3. EXISTING APPLICATIONS USING EA'S IN THE DESIGN PROCESS Existing examples which use genetic algorithms in the early design phases show the great potential that lies among those algorithms for the optimization of design-concepts and the possibility to create new designs from scratch. On the other hand, a lot of programming work is necessary to develop the algorithm and the necessary input/output interfaces and the fitness evaluation routine. In many cases, a software developer has to support the engineer. The resulting software often fits only to a small kind of problems and has to be adapted with a lot of work to a different task. The input of the optimization objectives as well as the interpretation of the results of an optimization run is often very difficult because of a lacking linkage to a CAE-system. The EA's output data must usually be converted with an additional step or entered manually. Some applications contain a special geometry model for the computation and representation. Such a development is connected with a great programming effort and does not reach the functionality of a modem CAE-systems (e.g. free forming surface modelling). For many of the examined applications computation routines are especially programmed for the evaluation of the solution variants, instead of using already existing computation tools of a CAE-system. 4. INTENTION OF "EasyEvoOpt-3D" The intention of the developed system "EasyEvoOpt-3D" is to offer the engineer a tool which gives him the possibility to use the above mentioned technologies during the product design process via the 3D-CAD-systems graphical user interface without the need of deeper knowledge of the used algorithms. In detail the following points have been considered for the development: 9 Optimization algorithms have to be integrated into a CAE-system to offer the possibility to input all relevant information with the help of the CAE-systems input functions. 9 The computation routines of the CAE-system should be usable by optimization algorithms, in order to be able to evaluate individual solutions. 9 The results of the optimization run have to be available in the CAE-system without further action or data conversion. 9 The use of the optimization procedures should be possible for a wide range of different tasks without further adjustment of the optimization algorithms. 9 If necessary, there should be an option to particularly adapt the optimization algorithm to the problem. 9 A modular expendability has to be planned, so that the integration of further external applications for the evaluation of solution variants is possible. A main point for the development of the system is that an engineer should have the possibility to let the system generate solution variants at an early phase of the design process without programming efforts and without a deeper knowledge of the optimization algorithms.

106 5. INTEGRATION INTO THE CAE-SYSTEM I-DEAS The geometry model of the CAE-system is regarded as the basic element of the optimization. Thus the computation tools of the CAE-system are available apart from the possibilities for visualization. The definition of the restrictions should be made not by mathematical descriptions, but by graphic elements (e.g. restricted areas).

........~: ...................................................................................................................... .: ..............................,+r.......................

IV Object-data, ~ Restrictions. optimization-

. . . . . . . . . . . .

"............................................

Record-Mode

Figure 1. User inputs at the CAE-system recorded by the optimization module

All above mentioned points - even the input of the restrictions and the objective function have to be available without programming by the user. With this demand, the communication of the optimization module with the CAE-system gets a central meaning for the further procedure. Here two different cases of communication between the CAE-system and the optimization module can be pointed out. The optimization module has to record the inputs such as variables, optimization objective and restrictions done by the user via the CAE-system and convert them in such a way, that the optimization module can replay them without any user interaction. This means that the variables have to be set according to the outputs of the evolutionary algorithm, and that the fitness evaluation routine is able to use the CAD-systems tools. The possibilities to enter the problem, the restrictions and the fitness evaluation with userfriendly graphical tools of the CAD-system are shown in figure 1. The inputs are recorded and processed in such a way that the tools of the CAD-system can be used for fitness evaluation without any user interaction as shown in figure 2. The communication module for communication with the CAE-system I-DEAS is represented by a Java class named o:r_com (Open I-DEAS Communication). Task of this class is to execute the entire internal CORBA communication and to perform the necessary conversions for communication with the CAE-system I-DEAS. Low level communication with I-DEAS has been developed with the Open I-DEAS programming interface. These CAE-systems library is based on the CORBA technology. It offers a number of methods in order to access internal data of I-DEAS. In the first phase of the optimization application the user has to define which parameters of

107

EasyEv0Opt-3D

, .................. .

OptimizationRun Figure 2. Fitness evaluation without any user interaction a construction unit are variable. For users comfort, this can be done by interactively picking the dimensions or other defining parameters in the graphics region of the CAE-system. The Open-I-DEAS API includes a method to get the 3D-coordinates of any picked point which is not useable for the later use because by changing some parameters of the object, their position in space will change and the system could not pick the element afterwards. For this reason the OI_Com-class contains different methods, which offer the possibility to convert the selection of an element in the graphics region from I-DEAS into the system internal labels. The OI_Com-class also includes methods to access all computation results in the CAEsystem, even if these routines are not supported by the Open-I-DEAS API. This is possible with a specially developed parser which can access and filter all internal result lists and pass them to the optimization module. 6. PARALLEL FITNESS EVALUATION The disadvantage of the use of these CAE-system tools lies in an unsatisfying runtime. Additionally, the convergence of genetic algorithms is generally slow because of the huge number of fitness evaluations. Each individual in a generation encodes a possible solution that is independent from the information of other individuals. Therefore the fitness of the individuals can be evaluated in parallel. As fitness evaluation with the tools of the CAE-system is the most time consuming part of the process, a significant reduction of execution time can be realized through the parallelization. The implemented parallelization scheme had to be applicable on distributed memory systems with different operation systems (Windows, Unix) connected via a LAN and on multiprocessor supercomputers. Therefore the main focus of this section lies on developing a parallel fitness evaluation routine which is able to use many processes of the CAD-system on a heterogeneous cluster of workstations as it can be found in the development department of a engineering company. The parallel fitness evaluation routine is able to integrate workstations into the cluster with different operation systems (Windows, Unix) and also multiprocessor-systems, on which a lot of CAD-system processes can be executed. The parallelization-routine uses Java-threads and the Common Object Request Broker Archi-

108 tecture (CORBA). This architecture offers a possibility to communicate with different processes of the CAE-system, even on different operating systems. It is not necessary to change the CAEsystems source code. An advantage of this approach is the fact that the developed system offers the possibility to use all available CAE-workstation for the parallelization without having to modify the system or even boot a different operating system. The parallelization takes place in a master thread, which is responsible for the distribution of the individual computations as well as the error handling, and slave threads, which can be instantiated several times and perform the direct communication with the CAE-system. The master thread communicates via messages with the slave threads. Each slave thread has a separate message register. Via this register defined messages conceming the status (READY, C A L C U L A T I O N FINISHED, RESULT RECEIVED, ERROR) and for calling subroutines ( I N I T , CALCULATE FITNESS, SLEEP, DONE) can be passed. To avoid slower computers to slow down the whole process, a load balancing is included. If all individual of one EA's generation have been sent to a slave for evaluation and some slaves are already waiting for new individuals, some of the not yet evaluated individuals are sent again to another slave. If a result has been retumed, the other slaves evaluating the same individual are able to stop their run.

7. EVALUATION OF FITNESS VALUES W I T H A NEURAL N E T W O R K A second approach for a reduction of run-time of an optimization run consists in the use of a neural network for the evaluation of the individuals' fitness values. Basis of this concept is to train a neural network with the data, which describe the solution variants, and which associated fitness values computed by the CAE-system. If the neural network achieved a given quality by many learning procedures, a determination of the fitness values can be accomplished by the neural network, which is very much faster than the complex computation of the CAE-system. A neural network must be trained at first with example data, in order to obtain a desired behavior. Therefore the neural net is trained with the individuals, for which the CAE-system did already compute a fitness value. If for a given number of new data sets a demanded accuracy is reached by the neural network, it will be changed over to the second phase, the evaluation of the fitness values by the neural network. In this second phase there is a mixture of fitness values evaluated by the neural network and the computation of selected individuals made by the CAE-system. A closer view of the genetic algorithm is necessary at this point. The best individuals of a generation are preferred for reproduction. For the remaining individuals only the order concerning their quality in the population is important. The exact fitness value must be available only for the best individuals. Therefore the fitness values of all individuals of one generation are evaluated by the neural network and additionally the fitness values of the best individuals as well as some randomly selected are computed by the CAE-system. The neural network module was developed with MatLAB and converted with the MatLAB/C++ compiler to C++ source code. The CORBA-methods were implemented so that the neural network module can be integrated in the optimization system via a network. Figure 3 gives an overview of all components of the optimization system an the integration of the CAE-systems for input/output and fitness evaluation.

109

i

M

Multiprocessor CORBACORBA-

coRPBCA.

fs ....

CAD/CAE

CORBA-Communication-Modul CAD/CAE-System i:i....... i:i~:!~I

Ii!i:

Parallelization-Modul (multithreaded) I

Graphical User Interface

(GU~)

lodul [" I

Evolutionary Algorithm Datarecording and Conversion

Datenbasis

Figure 3. Overview of all components

8. R E S U L T S A N D C O N C L U S I O N

The optimization system "EasyEvoOpt-3D" has been successfully applied for the search for a non-uniform transmission for a wicket and a ergonomics-study for a bench [9]. The flexibility to user changes and the possibility to follow the optimization steps generation by generation on screen has shown the advantages of the use of the CAE-system. The disadvantages of the quite time-consuming CAE-tools for the fitness evaluation could be compensated by the parallelization and the use of a neural network for the fitness evaluation. The system shows that it is possible to use optimization methods at an early stage of the design process without having to consult a optimization and software specialist with just using the already available hardware. REFERENCES

[1] Bentley, Peter John: Generic Evolutionary Design of Solid Objects using a Genetic Algorithm; Dissertation University of Huddersfield 1996 [2]

Dasgupta, D.; Michalewicz, Z (Hrsg.): Evolutionary Algorithms in Engineering Applica-

110 tions; Berlin, Heidelberg, New York, u.a.: Springer 1997 Figel, Klaus: Optimieren beim Konstruieren: Verfahren, Operatoren und Hinweise ffir die Praxis; Mfinchen, Wien: Hanser 1988 [4] Goldberg, David E.: Genetic Algorithms in Search, Optimization & Machine Learning; Massachusetts, Harlow, Menlo Park, u.a.: Addison Wesley Longmann, 1989 [5] Hafner, S.;Kiendl, H.; Kruse, R.; Schwefel, H.-P.: Computational Intelligence im industriellen Einsatz: Tagung Baden-Baden, Mai 2000 VDINDE-Gesellschaft Mess- und Automatisierungstechnik; Diisseldorf: VDI-Verlag 2000 [6] Luke, Sean: Evolutionary Computing system written in Java. http://www.cs.umd.edu/users/seanl/, University of Maryland [7] Miettinen, K.; Neittaanmdki, P.; Mdkeld, M. M.; Periaux, J. (Hrsg.): Evolutionary Algorithms in Engineering and Computer Science; Chichester, Weinheim, New York u.a.: Wiley 1999 [s] Pohlheim, Hartmut: Evolution~ire Algorithmen: Einsatz von Optimierungsverfahren, CAD und Expertensystemen; Berlin, Heidelberg, New York, u.a.: Springer 2000 [9] Slaby, Emanuel: Einsatz Evolution~irer Algorithmen zur Optimierung im frfihen Konstruktionsprozess. Fortschr.-Ber. VDI Reihe 20 Nr. 361. Dfisseldorf: VDI Verlag 2003

[3]

Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.

111

A n Integrated A n n o t a t i o n and C o m p i l a t i o n F r a m e w o r k for Task and D a t a P a r a l l e l P r o g r a m m i n g in Java* H.J. Sips a and K. van Reeuwijk aDelft University of Technology (sips,reeuwij k)@its.tudelft.nl

1. I N T R O D U C T I O N For most applications Fortran has been replaced by languages such as ANSI C, C++. and Java. Yet Fortran has a number of features that still make it particularly useful for scientific programs, namely multi-dimensional arrays, complex numbers, and, in later versions, array expressions. Moreover, Fortran compilers tend to produce highly efficient code compared to compilers for other languages. We have developed a set of language constructs, called Spar, that augment Java with a set of language constructs for scientific programs. The set consists of multi-dimensional arrays; complex numbers; a 'toolkit' to build specialized array representations such as block, symmetric, or sparse arrays; annotations; and parallelization. In this paper we concentrate on the constructs for parallel programming; the other extensions have been described in [9]. In this paper, we present a unified model to describe data and task parallelism in an object oriented language. The approach allows compile-time and run-time techniques to be combined. 2. THE F O R E A C H C O N S T R U C T In a sequential program the execution order of all statements is specified exactly. This is often an over-specification: the programmer may not care about the execution order of statements, even if the observable results differ. For example, the code for(

int i=O;

iFrom the semibandwidths ui and li the column indexes are i - li a I

(false ~ al, a2)

> a2

---+ a l ,

(ItTrue) (IfFalse)

(ao bo,..., ap-1 bp-1) ' { ano , . . . , anp_, )

(< a , . . . , ap, Z ' " "

(ParApp) (Get)

a>

>

al

(IfAtTrue)

a ) & a l , a2)

>

a2

(IfAtFalse)

i

({ a , . . . ,

fa~,..., i

(Get) is not a rule but a set of rules such that for all i in { 0 , . . . , p - 1}, ni is integer constant between 0 and p - 1.

Figure 2. Rules of the term rewriting system BS

The rules for usual operators are defined by rules like : 3 + 2 > 5 or (true or false) true. The equality is only defined on integer and boolean constants. The value of al ? a2 at processor name i is the value of el at processor name given by the value of e2 at i. Notice that, in practical terms, this represents an operation whereby every processor receives one and only one value from one and only one other processor. This restriction can be lifted by defining a "put" operation (which is also a pure functional operation) but which is not given here for the sake of conciseness. Next, the global conditional is defined by two rules. The above two cases generate the following bulk-synchronous computation: first a pure computation phase where all processors evaluate the local term yielding to n; then processor n evaluates the parallel vector ofbooleans; if at processor n the value is t r u e (resp. false) then processor n broadcasts the order for global evaluation of a 1 (resp. a2); otherwise the computation fails. Those two rules are necessary to express algorithms of the form: Repeat Parallel

Iteration

UntilMax

of

local

errors

< e

because without them, the global control can not take into account data computed locally, ie global control can not depend on data. Additional rules bare also needed to propagate substitutions through the symbols of BS. For every symbol f of BS with arity n we have the rule : f(al,...,an)[s]

, f(al[s],...,an[s])

If n = 0 this rule means

f[s]

,

f.

3.2. Confluence of the calculus We will use the follow theorem due to Pagano [20] :

(f#)

132 Theorem 1. Let be (F, TO) a term rewriting system such that : 9F

and A~ (the set of terms of the Ao#) share at most the application symbol

9 T~ is

confluent

9 T~ is left

linear

9T~ does not contain variable-applicante rules

Then the AF~9-calculus is confluent R e m a r k 1. If F and A# do not share the application symbol then the last condition is not required. In our case (F, 7~) = BS. There are no critical pairs. Rules are left linear. The TRS BS is confluent. By theorem 1, the BSA~9-calculus is confluent. 3.3. Examples The first example shows that parallel vector can be also expressed by an intentional construction as in the BSA-calculus [17]. The 7r or parallel vector constructor of BSA can be defined as : =

A vector usually used is t h i s defined as: 7r(A1). It can be reduced to (using the rules indicated on the fight): this _= (A( 1 0 , , 1 ( p - I) ))AI > < 10,, l ( p - 1 ) ) ) [ A 1 id] >P ( ( 1 0 ) [ A a - i d ] , . . . , (1 ( p - 1))[A1. id] >

>P ( l [ A l . i d ] O [ A l . i d ] , . . . , a [ A l . i d ] ( p - 1 ) [ A l . i d ] ) >P < A1 (O[A1. i d ] ) , . . . , AI ( ( p - 1)[Al. id]) > >P ( .)kl 0 , . . . , A1 (p - 1) ) >P < 1[0. i d ] , . . . , i [ ( p - 1). id] ) >P < 0 , . . . , p - l >

(Beta)

(f~) (App) (FVarCons)

(f#) (Beta) (FVarCons)

The second one is the direct broadcast algorithm which broadcasts the value held at processor given as first argument: b c a s t : = AA1 ? ~-(A2). If applied to 1 (the root of the broadcast) and to an expression e which evaluates to a parallel vector ( a0 , . . . , ap_l ) it can be reduced as follows (we omit the steps similar to the previous example): bcast

1e

>* < a0 , . . . , ap_ 1 > .7 < 1 , . . . ,

1,...,

1 ) by rule (Get) < a 1 , - . - ,

a 1,-..,

al >

4. CONCLUSIONS AND FUTURE WORK

The BSAo~-calculus is a confluent calculus for functional bulk synchronous parallel programs. Being an extension of the A~-calculus it has its advantages such as to be closer to functional languages implementations than our BSA-calculus [ 17]. Another interesting feature of this calculus is the possibility to express weak reduction strategy (ie no reduction under a

133 A-abstraction) by removing rules of the calculus. This was not possible in the A-calculus: removing context rules which allow the reduction under a A-abstraction leads to a non confluent calculus. Thus the ease of expressing reduction strategies used in real functional languages allow to prove the correctness of abstract machines used in the implementations of those languages [ 11 ]. We have designed a Bulk Synchronous Parallel ZINC Abstract Machine [8] (BSP ZAM) which is an extension of the Zinc Abstract Machine used in the Objective Caml implementation The next phases of the project will be: the proof of correctness of this machine with respect to the BSA~-calculus, and the parallel implementation of this abstract machine. This BSP ZAM implementation will be the basis of a parallel programming environment developed from the Caml-light language and environment. It will include our type inference [9] and will thus provide a very safe parallel programming environment.

REFERENCES

[ 1] [2] [3]

[4] [5] [6] [7]

[8]

[9]

[ 10] [11 ] [12]

[13]

H. R Barendregt. Functional programming and lambda calculus. In J. Van Leeuwen, editor, Handbook of Theoretical Computer Science (vol. B), pages 321-364. Elsevier, 1990. S. Boutin. Proving correctness of the translation from mini-ml to the cam with the coq proof development system. Technical Report 2536, INRIA, 1995. L. Cardelli. Compiling a functional language. In Conference Record of the 1984 ACM Symposium on Lisp and Functional Programming, pages 208-217, Austin, Texas, August 1984. ACM. G. Cousineau, P.-L. Curien, and M. Mauny. The categorical abstract machine. Science of Computer Programming, 8:173-202, 1987. G. Cousineau and M. Mauny. The Functional Approach to Programming. Cambridge University Press, 1998. P.-L. Curien, T. Hardin, and J.-J. Ldvy. Confluence properties of weak and strong calculi of explicit substitutions. Journal of the ACM, 1996. N.G. De Bruijn. Lambda-calculus notation with nameless dummies, a tool for automatic formula manipulation, whith application to the Church-Rosser theorem. Indag. Math., 34:381-392, 1972. F. Gava and F. Loulergue. A Parallel Virtual Machine for Bulk Synchronous Parallel ML. In Peter M. A. Sloot and al., editors, International Conference on Computational Science (ICCS 2003), Part I, number 2657 in LNCS. Springer Verlag, june 2003. F. Gava and F. Loulergue. A Polymorphic Type System for Bulk Synchronous Parallel ML. In Seventh International Conference on Parallel Computing Technologies (PACT 2003), LNCS. Springer Verlag, 2003. A.V. Gerbessiotis and L. G. Valiant. Direct Bulk-Synchronous Parallel Algorithms. Journal of Parallel and Distributed Computing, 22:251-267, 1994. T. Hardin, L. Maranget, and L. Pagano. Functional runtime systems within the lambdasigma calculus. Journal of Functional Programming, 8(2): 131-176, 1998. Jan Willem Klop. Term rewriting systems. In S. Abramsky, D. M. Gabbay, and T. S. E. Maibaum, editors, Handbook of Logic in Computer Science, volume 2, chapter 1, pages 1-117. Oxford University Press, Oxford, 1992. P. J. Landin. The mechanical evaluation of expressions. The Computer Journal, 4(6):308-

134

[14] [ 15] [16]

[17] [18] [19]

[20] [21]

[22] [23] [24]

320, 1964. X. Leroy. The ZINC experiment: An economical implementation of the ML language. Rapport Technique 117, 1991. Xavier Leroy. The Objective Caml System 3.07, 2003. web pages at www.ocaml.org. F. Loulergue. Implementation of a Functional Bulk Synchronous Parallel Programming Library. In 14th IASTED International Conference on Parallel and Distributed Computing Systems, pages 452-457. ACTA Press, 2002. F. Loulergue, G. Hains, and C. Foisy. A Calculus of Functional BSP Programs. Science of Computer Programming, 37(1-3):253-277, 2000. A. Merlin and G. Hains. La Machine Abstraite Cat6gorique BSP. In Journdes Francophones des Langages Applicatifs. INRIA, 2002. A. Merlin, G. Hains, and F. Loulergue. A SPMD Environment Machine for Functional BSP Programs. In Proceedings of the Third Scottish Functional Programming Workshop, august 2001. Bruno Pagano. Des calculs de substitution explicite et de leur application g~la compilation des langagesfonctionnel. PhD thesis, universit6 Pierre et Marie Curie, 1997. D. R6my. Using, Understanding, and Unravellling the OCaml Language. In G. Barthe, E Dyjber, L. Pinto, and J. Saraiva, editors, Applied Semantics, number 2395 in LNCS, pages 413-536. Springer, 2002. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and Answers about BSP. Scientific Programming, 6(3):249-274, 1997. M. Snir and W. Gropp. MPI the Complete Reference. MIT Press, 1998. Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103, August 1990.


135

JToe: a Java* API for Object Exchange S. Chaumette a, R Grange a, B. Mftrot a, and E Vignfras ~ a Laboratoire Bordelais de Recherche en Informatique, Universit6 Bordeaux 1, 351, cours de la Libfration, 33405 Talence Cedex, France. email: {Serge.Chaumette, Pascal.Grange, Benoit.Metrot, Pierre.Vigneras}@labri.fr This paper presents JToe, an API dedicated to the exchange of Java objects in the context of high performance computing. Even though Java RMI provides a good framework for distributed objects in general, it is known to be quite inefficient, mainly due to the Java serialization process. Many projects have already improved RMI either by redesigning and reimplementing it or by reimplementing the serialization process. We claim that both approaches are missing a clear and high level API for the exchange of objects. JToe proposes a new simple API that focuses on the exchange of objects. This API is flexible enough to allow, for instance, direct copy of byte streams representation of JVM objects over a specialized transport layer such as Myrinet or LAPI. Remote method invocation frameworks such as Java RMI can then be implemented over JToe with good performance enhancement perspectives. 1. INTRODUCTION Whereas remote procedure call and remote method invocation have given developers a good abstraction of the underlying transport layers for client/server communication, high performance computing has not adopted these programming models - message passing interface is still preferred- mainly because of the performance penalty they suffer from. It is acknowledged that the main drawback of RMI is its inefficiency for object serialization and de-serialization. A lot of developments exist that improve this serialization mechanism in the context of high performance computing. Some of them consist in a whole RMI reimplementation [ 1, 2, 4]. In these cases, the problem of the exchange of objects over the network has to be addressed simultaneously with the problems of distributed garbage collection, remote method invocation, threads management, registration management, and so on. Our contribution consists in defining the notion of exchange of objects as simply as possible through the JToe API. This allows one to specifically address this problem separately from the other RMI challenges. Other works (or parts of previously cited works) address only the serialization problem in the context of high performance computing [ 1, 3, 9]. From a technical point of view, they generally rely [I, 3] on the already defined Obj ectOutputStream and Obj ect InputStream *Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. The authors are independent of Sun Microsystems, Inc.

136 classes. We argue in section 2 that these APIs are low level ones. What we propose is a new high level API dedicated to the exchange of Java objects in the context of high performance computing. The rest of this paper is organized as follows. In section 2 we discuss why we introduce a new API. We describe this API in section 3. Existing implementations and performances are presented in section 4. 2. A N E W API

We need an API that allows one to get all the benefits of an efficient object exchange layer that can be used to implement a distributed application or more general frameworks such as RMI. For this purpose, a widely used API already exists in Java: Object (Output/Input)Stream [5]. However we believe that, in the context of high performance computing, a new distinct API for the exchange of objects is needed. First, it is acknowledged that standard Java serialization is not efficient. Faster implementations have been proposed [ 1,2, 3, 4]. Of course, none of them relies on, neither respects, the Java serialization specification: technically speaking, one can inherit Obj e c t O u t p u t S t r e a m and still deeply break the compatibility with the standard serialization process as far as the corresponding Obj e c t I n p u t S t r e a m class is provided. However, when inheriting these classes, depending on how the new classes will be used, we sometimes need to respect the standard serialization process. For instance, how to deal with the u s e P r o t o c o l V e r s i o n method when we do not respect the standard protocol to achieve efficiency? Moreover, these classes, Obj ectOutputStream and Obj ectInputStream, may evolve with the protocol, and they have in the past. Such evolutions may lead to incompatibilities with legacy sub-classes 2. Second, we consider that Obj e c t ( O u t p u t / I n p u t ) S t r e a m is a low level API since it does not hide the stream management of object serialization. Such a stream oriented API is not well suited for transport layers that do not rely on stream based hardware or libraries [6, 7]. Of course one can inherit - and this is the common approach - Obj e c t O u t p u t S t r e a m and give a brand new implementation for non stream based hardware. Even if we believe this is unnatural and error-prone, it is not a technical issue. The real problem is that, in order to receive an object, one must use an instance of the Obj e c t I n p u t S t r e a m class and especially the r e a d O b j e c t method. This implies that a thread must be waiting on this blocking method for an object to be received. This prevents one to get the benefits of using special architectures and/or libraries that allow, for instance, one-sided initiator data-transfers [6]. For all these reasons, we claim that a new non stream oriented API has to be defined. One may argue that RMI could already be implemented to take advantage of one-sided initiator data-transfers. Nevertheless, this implies to re-implement RMI what, in turn and as stated in section 1, implies to deal with a lot of other high level challenges instead of just focusing on the exchange of objects: distributed garbage collection, method invocation, registration management, etc. This is why an object exchange oriented API has to be defined, independently of RMI. The problem of defining adapted APIs for the easy replacement of the transport layer or of the serialization process has already been addressed. However, it generally leads to an unfortunate dividing line between the serialization process and the transport layer. For instance in 2The best example being the writeObjectOverride method that allows one to define a new serialization process. It only exists sincejdk 1.2.

137 KaRMI [ 1] a notion of technology is defined. The serialization is performed by KaRMI and the result of this serialization is sent using a given transport technology. A new technology can be defined to enhance the network transfer or to provide a new type of network support. However, in some improvement approaches [3, 9], the data to be transferred may depend on the remote/local JVM and/or on the available communication mechanisms: for instance the assumption that the two communicating computers are running similar JVMs allows to directly send memory regions. It may also be the case that the transport layer makes it possible not to buffer the data to be transferred. In such a case, separating the serialization process from the transport process prevents one from providing improvements based on these characteristics. This is why we believe that this separation is cumbersome.

3. J T O E : THE API

Program 1 The JToe API

public interface Node { void copy(Serializable object)

}

throws JToeException;

public interface CopyListener { void copied(Serializable object) ;

}

From the previous observations, we defined a simple powerful API: JToe. It is mainly composed of two interfaces: Node and CopyListener (see program 1). When one wants to send a copy of an object to a remote node, he has to get the Node object 3 representing the node he wants to communicate with and then invoke the c o p y method passing it the object to send as an argument. On the server side, the JToe layer will inform the application, by means of a call-back mechanism, that a new object has arrived, using the c o p i e d method of the C o p y L i s t e n e r interface. The action of copying an object to a remote node is a one-sided action. No user thread has to wait for an object to be received. This is the responsibility of the implementation to accept and receive any new object and then signal this arrival to the application using an event driven programming model through C o p y L i s t e n e r . c o p i e d . This approach is not only easy to understand and to use, but also allows to really take advantage of one-sided communication libraries [6]. Moreover, as shown in figure 1, there is no limitation on the way the Node interface can be implemented. The c o p y method is responsible for the whole process of copying an object to another address space, that is the construction of the data to send (serialization) and the transfer of these data through a transport layer. This way, no artificial dividing line between the serialization process and the transport layer is introduced. We claim that this allows any serialization improvement approach to be implemented. 3The problem of retrieving a Node is not managed by JToe since it is not a transfer of object problem but more a registry one. General naming services may be used such as JNDI [8].

138

Application

Application cop~

copy(o)

Node JV

i-

:# i i

: i i

JVM

i

System

System

Figure 1. General JToe behavior.

4. J T O E IMPLEMENTATIONS We have already developed three implementations of JToe [10]. Two of them are totally written in Java, one relying on Java RMI, the other using TCP sockets and the standard Java serialization. These allow any JToe application to be 100% Java compatible and portable. They also serve as reference implementations for regression tests. The third is a JVM level implementation, i.e. at a level where we can directly access the memory representation of objects. The goal of this implementation of JToe is to provide high performance in clusters of homogeneous computers and homogeneous JVMs by directly sending memory data instead of going through the serialization process. This implementation uses the JikesRVM virtual machine [14]. JikesRVM allows to directly access memory and interact with the garbage collector from Java code. It is an interesting experimentation platform and allowed us to implement a prototype in a reasonable time. In this implementation our concern is to perform efficiently and interact smartly with the garbage collector. JikesRVM supports various garbage collection policies. Parts of our implementation are garbage collector dependent. This implementation mainly behaves as follows: when an object is being copied, the corresponding graph of objects is computed. Then the data of the objects of the graph are sent with zero copy 4. On the receiver side, memory is allocated in a garbage collector dependent s p a c e mainly the n u r s e r y - and data are directly written into this area. The pointers are then updated to reflect the original structure. Figure 2 shows the performances of the JikesRVM specific implementation of JToe compared to the 100% Java implementation. Both are using TCP as their communication layer. The values represented by the curves are the average time of one round-trip for the specified object in a ping-pong like application. The RVN_RVIVl curve shows the performances of the JikesRVM dedicated JToe implementation when run with JikesRVM - note that it cannot be run with another JVM. The TCP_RVM curve shows the performances of the 100% Java implementation of JToe running with JikesRVM without any optimization. The TCP_SUN and TCP_IBM curves are for the same implementation running on, respectively, the Sun Java virtual machine version 4When we talk about zero copy we mean that our code does not perform any copy even if the actual communication layer will since we rely on TCE Future releases based on zero copy communication layers will actually achieve real zero copy.

139 120

RVM RVM

RVM_RVM TCP RVM

TCP_-RVM TCP_SUN

--

TCP-SUN -TCISIBM

--

100

._~

8o

80

8o

60

--

o~

"5 ._= =

~ 4o

40

20

200

400

0

600

0

200

U--7 I I

600

RvM_RvM --- ]

/

TCP_RVM-

/

800

1000

(b) Ping-pong with Vectors.

(a) Ping-pong with TreeSets. 120

400 size of the Vector

size of the TreeSet

TCP_SUN_ ---

-

RVM_RVM ~ ] TCP RVM -/ TCP-_SUN - - - - A

/

/

100

i

8O

~

60 0

40 0

20

0

o

200o0

40000

60000

8000o

size of the array

(c) Ping-pong with arrays of ints.

1ooooo

0

0

20000

40000

6OO00

80000

100000

size of the array

(d) Ping-pong with arrays of doubles.

Figure 2. Ping-pong performances. 1.4. l_01-b01 and the IBM Java virtual machine version 1.4.0. All the experimentations were achieved using two Linux 2.4.18 computers with Intel 1.7GHz processors and 512MB of RAM connected with Ethernet 100Mbps. We can see that, generally speaking, JikesRVM does not perform as well as the two other virtual machines for our ping-pong application. However, for the ping-pong of vectors, we see that our JikesRVM specific JToe implementation outperforms the 100% Java one by 30% to 40% when both are running on JikesRVM. This is a really promising result since it allows one to think that the same sort of enhancement may lead to same sort of performance enhancement with the other virtual machines. For the TreeSets, our implementation does not give interesting results; it performs even worse than the default serialization. Our code performs a suboptimal graph browsing to collect the objects to send where the ad hoc serialization of the j ava. u t i l . T r e e S e t class directly and linearly writes the objects it contains. We believe that this is the reason of this poor performance. Future releases of JToe for JikesRVM will improve the graph browsing algorithm by marking visited objects with JikesRVM specific techniques. Finally for arrays of int and double, our implementation totally outperforms the standard serialization on JikesRVM. The round-trip times are close to those obtained with the Sun or IBM virtual machines despite the poor performances of JikesRVM. This allows one to think that similarly optimized implementations of JToe for the Sun or IBM virtual machines

140 will lead to comparable improvements. The issue of fast transfer of objects has already been addressed in other works. Expresso [ 11 ] is a framework aimed at transferring Java objects efficiently using zero copy mechanisms. Expresso relies on the Kaffe [12] virtual machine. Expresso uses the notion of cluster to avoid graph exploration and pointer update. A cluster is a contiguous memory area where objects can be allocated. To transfer an object to a remote node, one actually transfers the cluster containing that object. All the other objects in the cluster are also transfered. With Expresso, objects must explicitly be allocated in the correct cluster to be transfered and the references between objects in different clusters are not conserved. Ibis [ 13] is a grid computing environment for Java. It defines the IPL, an API on top of which higher level distributed environments such as RMI can be implemented. Ibis differs from our work since the IPL not only defines the way objects are sent between computers but also how to access topology information, monitoring data, etc. Moreover, object serialization optimization in Ibis relies on byte code rewriting, the aim of which is to add the serialization code to classes to avoid dynamic type inspection. 5. CONCLUSION In this paper we have presented a new API which we call JToe that is dedicated to the exchange of objects between JVMs. This API makes it possible to take advantage of the knowledge we have of both the source and target JVMs and of the underlying network and the associated communication libraries. This leads to very efficient exchange of objects between JVMs. We have three implementations running. Two are 100% Java (one over RMI and one over TCP), the third is a low level one dedicated to Jikes RVM. This last implementation is still a work in progress. We are currently working to enhance the graph browsing algorithm using JikesRVM specific mechanisms. We also plan to release a LAPI and a Myrinet version of JToe. REFERENCES

[1] [2]

[3]

[4] [5] [6]

Christian Nester, Michael Philippsen and Bernhard Haumacher. A More Efficient RMI for Java. In Java Grande, pages 152-159, 1999. Jason Maassen, Rob Van Nieuwpoort, Ronald Veldema, Henri E. Bal, Thilo Kielmann, Ceriel J. H. Jacobs and Rutger F. H. Hofman. Efficient Java RMI for parallel programming. Programming Languages and Systems, 23(6):747-775, 2001. Fabian Breg and Constantine D. Polychronopoulos. Java virtual machine support for object serialization. In Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande, pages 173-180. ACM Press, 2001. Fabian Breg and Dennis Gannon. A Customizable Implementation of RMI for High Performance Computing. In Proc. of Workshop on Java for Parallel and Distributed Computing of IPPS/SPDP99, pages 733-747, 1999. Sun Microsystem. Serialization specification, http://java.sun.com/ Shah G., Nieplocha J., Mirza C.and Harrison R., Govindaraju R.K., Gildea K., DiNicola P. and Bender C. Performance and experience with LAPI: a new high-performance communication library for the IBM RS/6000 SP. In International Parallel Processing Symposium, pages 260-266, 1998.

141 [7] [8] [9]

[ 10] [11 ] [12] [13]

[ 14]

Myricom. Myrinet software, http://www.myri.com/scs/ Sun Microsystem. Java Naming and Directory Interface. http://j ava. sun. com/products/jndi/ K. Kono and T. Masuda. Efficient RMI: Dynamic Specialization of Object Serialization. In Proc. ofIEEE Int'l Conf. on Distributed Computing Systems (ICDCS), pages 308-315, 2000. Pascal Grange and Pierre Vign6ras. The JToe project, http://jtoe.sf.net L. Courtrai, Y. Mah6o and F. Raimbault. Expresso: a Library for Fast Transfers of Java Objects. In Myrinet User Group Conference, Lyon, 2000. The Kaffe Java virtual machine, http://www.kaffe.org/ Rob Van Nieuwpoort, Jason Maassen, Rutger Hofman, Thilo Kielmann, Henri E. Bal. Ibis: an Efficient Java-based Grid Programming Environment. Joint ACM Java Grande ISCOPE 2002 Conference, pp 18-27, 2002. B. Alpern, C. R. Attanasio, A. Cocchi, D. Lieber, S. Smith, T. Ngo, J. J. Barton, S. F. Hummel, J. C. Sheperd, and M. Mergen. Implementing Jalapefio in Java. In Proceedings of the 1999 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages & Applications, OOPSLA 99, Denver, Colorado, November 1-5, 1999, volume 34(10) of ACM SIGPLAN Notices, pages 314 324. ACM Press, Oct. 1999.


Parallel Computing: SoftwareTechnology,Algorithms, Architecturesand Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.

143

A M o d u l a r D e b u g g i n g I n f r a s t r u c t u r e for Parallel P r o g r a m s D. Kranzlmtiller ~, Ch. Schaubschl~igel~, M. Scarpa a, and J. Volkert ~ aGUP, University of Linz, Altenbergerstr. 69, 4040 Linz, Austria Debugging parallel and distributed programs is a difficult activity due to multiple concurrently executing and communicating tasks. One major obstacle is the amount of debugging data, which needs to be analyzed for detecting errors and their causes. The debugging tool DeWiz addresses this problem by partitioning the analysis activities into different, independent modules, and distributing these modules on the available, possibly distributed, computing environment. The target of the analysis activities is the event graph, which represents a program's behavior in terms of program state changes and corresponding relations. In DeWiz we distinguish three different kinds of modules, namely Event Graph Generation Modules, which generate an event graph stream from the collected trace data, (Automatic) Analysis Modules, whose task is to perform various analysis techniques on the event graph, and Data Access Modules, which present the results of the event graph analysis to the user in a meaningful way. The interconnection between the different modules is established by a dedicated protocol, which is currently based on TCP/IP or uses shared memory. This allows any user of DeWiz to easily integrate arbitrary software analysis tools with DeWiz. 1. O V E R V I E W O F P R O G R A M A N A L Y S I S W I T H D E W I Z 1.1. Motivation and related work

Software development for High-Performance Computers (HPC) faces several serious challenges. One example is the possible scale of the programs, determined by long execution times and large numbers of participating processes. As a result huge amounts of program state data have to be processed during program analysis activities, such as testing and tuning. This problem is addressed by the debugging tool DeWiz (Debugging Wizard), which evolved out of the Monitoring and Debugging environment MAD [6] as a possible solution. The original approach of MAD was repeatedly affected by the amount of debugging data. Long waiting times during interactive debugging sessions and the impracticability of certain analysis techniques represented substantial drawbacks, especially for real-world applications. Similar approaches to MAD are provided by commercial products, such as VAMPIR [1], or other related tools in this area, e.g. AIMS [18], Paje [2], Paradyn [11 ], Paragraph [3], and PROVE [4]. Each of these tools employs some kind of space-time diagram as a representation of program behavior. By analyzing these related approaches in the field, some basic characteristics and differences are obtained: 9 Analysis goals: performance analysis vs. error debugging Most software tools support only one analysis area, focusing either on performance as-

144 pects of a program or on the correctness of its behavior. Some tools provide additional support for other tasks of the software life-cycle, e.g. visual parallel programming. 9 Means of communication/execution: shared memory vs. message passing Depending on the underlying hardware architecture, runtime system, or development environment, software tools usually support only one kind of programming paradigm. An additional distinction is the parallel execution itself, e.g. whether threads or processes are used. 9 Levels of abstraction: high-level code vs. machine instructions Only few tools provide program analysis activities on different levels of abstraction, while most are pre-determined by the chosen level of instrumentation. 9 Approaches to instrumentation: static source vs. dynamic binary instrumentation Instrumentation of programs allows to distinguish between various different approaches, with static instrumentation of source code on one side of the spectrum, and dynamic instrumentation of binary machine code on the other side. 9 Connection between monitor and analysis tool: on-line vs. post-mortem Another related characteristic of software tools is the connection between the monitoring part, that extracts the behavioral data, and the analysis part, that investigates the program's behavior. In brief, on-line program analysis is required whenever changes to the execution behavior should be applicable on the fly, while post-mortem analysis is chosen if the analysis technique requires information about the complete execution history of a program. The reasons for all these distinctions are often given by the particular interests of the involved software tool developers. Besides that, there are also some tools supporting characteristics that seem converse at first. For example, the VAMPIR tool supports both, shared-memory OpenMP programs [ 14] and message-passing program using MPI [ 13]. This originates from the fact, that some of todays architectures are best utilized by using a hybrid MPI/OpenMP programming style [ 15], and more tools have already been proposed to follow this mixed-mode programming style. During our work on the MAD environment, some more evidence occurred: The event manipulation technique originally developed for message-passing programs required only minor changes to be useful for shared-memory codes [5]. With little extensions, most of them in the monitoring code, MAD was also applicable for performance analysis activities. These experiences motivated to extend the originally conservative approach of using the event graph for message-passing codes only into a more universal solution, whose result is the program analysis too 1 DeWiz. The ideas of DeWiz, namely 9 the representation of a program's behavior as event graph, 9 the modular, hence flexible and extensible approach, and 9 the graphical representation of a program's behavior will be described in more detail in the following sections.

145 2. EVENT GRAPH The theoretical fundament of the DeWiz debugging environment is the event graph, which has been defined as follows:

Definition 1 (Event Graph [8]). An event graph is a directed graph G = (E, -~) , where E is the non-empty set o f events e E E, while --+ is a relation connecting events, such that x --~ y means that there is an edge from event x to event y in G with the "tail" at event x and the "head" at event y.

The events e C E are the recorded events observed during a program's execution, like for example send or receive events in a message-passing program or semaphore locks and unlocks in a shared memory environment, respectively. The relation connecting the events of an event graph is the so called happened-before relation which is defined as follows:

Definition 2 (Happened-before relation [10]). The happened-before relation -~ is defined as

s

where ~ is the sequential order o f events relative to a particular responsible object, while is the concurrent order relation connecting events on arbitrary responsible objects.

c

The sequential order relation ~s simply defines that if two events epi and ej occur on the same 9

process and event e;i occurred before event e3p,then ep 9

C

The concurrent order relatlon~ defines inter-process relations of events. In case of messagepassing programs it means, that if event epi occurs on process p and is a send event, and event e~ occurs on process q and is the corresponding receive event, then ep Based on these definitions we have implemented the DeWiz protocol, which allows to propagate an event graph stream through a DeWiz system.

3. DEWIZ PROTOCOL AND FRAMEWORK The DeWiz protocol defines the communication of event graph streams between different DeWiz modules. Based on this protocol we use the abstract concepts of interfaces and communication channels. One channel connects exactly two modules, channels are unidirectional, and each module has exactly one interface (incoming or outgoing) per channel. There are two approaches how these channels and interfaces are implemented, The first one ist to use TCP/IP and BSD sockets [! 6]. This approach enables the tool to be distributed across different computing resources or even to use grid infrastructures, which are currently deployed all over the world. For example, in a DeWiz debugging session it might be feasible to execute an event graph generation module on the same computer where the application under observation is executed. The event graph generated by this module could be forwarded to a (probably resource intensive) analysis module which is executed on a different computer, in order to provide as much computing power as possible to the observed application (and not having to share resources with analysis tasks). The analysis module could then be connected to

146 a visualization module, which resides on a third computer, e.g. the workstation of the DeWiz user. However, in some cases this approach might not be useful, e.g. when the amount of trace data, or the size of the event graph respectively, exceeds certain limits. In this case it is not efficient, or even impossible, to send the event graph stream over the network, simply because of it's size. In such a case the event graph generation module and the analysis module(s) must reside on the same computer to avoid copying of the event graph. For that purpose DeWiz provides shared memory interfaces in addition to TCP/IP. When two modules are connected via such a shared memory interface, the "producer" module writes the event graph data into a shared memory segment, while possible "consumer" modules can read the data from that segment, thus following the "zero copy" paradigm. To enable the propagation of an event graph with the DeWiz protocol, several data structures have been defined. Simplified, an event graph stream consists of two kinds of such data structures: event: epi = (p, i, type, data) concurrent order relation: ep The event data structure corresponds to a particular state change observed in a program's execution. The variables p and i identify the responsible object on which the event occurred and its relative sequential order, respectively. The variable type determines the kind of observed event, e.g. in message-passing programs a send or receive. The data field is used for optional attributes, which describe the event in more detail depending on the intended analysis activities. The concurrent order relation represents a subset of the happened-before relation stated above. It is used to mark corresponding events on distinct processes, whose operations are somehow connected, e.g. a corresponding pair of send and receive operations. (Please note, that events on the same host object p are already ordered by their sequential identifier i.) The DeWiz Framework offers the required functionality to implement DeWiz modules for a particular programming language. At present, the complete functionality of DeWiz is supported in Java, while smaller fragments of the analysis pipeline are already available in C and C++. Each module must implement the following functionality: 9 Open event graph stream interface 9 Filter relevant events for processing 9 Process event graph stream as desired 9 Close event graph stream interface The functions to open and close interfaces are used to establish and destroy interconnections between modules. The interfaces transparently implement the DeWiz protocol, while filtering and processing events is performed within the main loop of each module.

147

Figure 1. DeWiz System during Event Graph Stream Processing Four monitors generate the event graph stream - presumably through on-line observation of 4 processes. A merger module and 2 buffer modules combine and cache the event graph stream. A pattern matching module and a group detection module perform automatic analysis, while the results of these analysis activities are presented in a visualization tool.

4. DEWIZ MODULES The processing of DeWiz is performed on the above mentioned data structures within DeWiz modules. The modules are assembled as a DeWiz system, which represents the intended program analysis pipeline for a particular debugging or performance analysis strategy. In Figure 1 an example DeWiz system is shown. Depending on their processing capabilities, different types of modules can be distinguished:

4.1. Event graph generation modules At least one module is required for generating an event graph stream corresponding to a program's execution. The events can be generated either on-line with a monitoring tool, or post-mortem by reading corresponding tracefiles. In case DeWiz is utilized as a plug-in for existing program analysis tools, the event graph stream is generated by converting the specific data structures of the host tool into the event graph stream protocol. Example modules already provided by the DeWiz framework include an on-line interface to OCM (OMIS Compliant Monitor) [ 17] and OPARI (OpenMP Pragma and Region Instrumentor) [12][5], as well as a trace reader for the monitoring tool NOPE [7]. 4.2. Automatic analysis modules The actual operations on the event graph are performed by automatic analysis modules. These modules extract the desired information and try to detect interesting characteristics of the program's behavior. Example modules are already provided for error detection, e.g. to determine communication errors in message-passing programs or race conditions at semaphore operations in shared memory programs. More elaborate examples include a pattern matcher for repeated behavioral patterns and a module for detection of process grouping and subsequently process isolation [9].

148 4.3. Data access modules

After or during processing the event graph stream, detected program characteristics can be presented to the user. Different kinds of data access modules support a variety of user interfaces. In most cases, DeWiz forwards the results to a visualization module that represents the event graph as a kind of space-time diagram. Example modules include ATEMPT, the visualization tool of the MAD environment, a failure notification mechanism for cellular phones, and a Java applet for arbitrary web browsers. 5. EXAMPLES The analysis functionality already implemented in DeWiz is described with the following two examples, namely extraction of communication failures, and pattern matching and loop detection. Communication failures can be detected by pairwise analyzing of communication partners. A set of analysis activities for message- passing programs is described in [8]. An example is the detection of different message length at send and receive operations, which may formally be defined as follows: Two events epi and e~9are called events with different message length, if epi c ej and

messageLength(e;) =fimessageLength(eJ). This formal definition can easily be mapped onto a DeWiz module with the available Java framework. During analysis, the module detects every communication pair, where the send operation transmits more or less bytes than the receive operation expects. A more complex analysis activity compared to the extraction of communication failures is pattern matching and loop detection. The goal of the corresponding DeWiz modules is to identify repeated process interaction patterns in the event graph. Some example event graph patterns are given in Figure 2. The leftmost pattern is called a simple exchange pattern, which describes the situation, where two arbitrary processes mutually exchange some kind of data item. The existence of this simple event graph pattern can easily be verified within a DeWiz module. More complex patterns can be specified and provided in a pattern database according to the needs of users and the characteristics of their programs. Vice versa, the user may even specify expected patterns of a program, and the program tries to locate them in the event graph of the observed execution. This allows to decrease the complexity of the analysis data, if patterns are detected in a parallel program, or to detect incorrect behavior, if expected patterns are missing. Some more details about pattern matching can be found in [8]. 6. S U M M A R Y

The modular approach of DeWiz allows parallel and distributed program analysis based on a set of independent analysis modules. The target representation of program behavior is the abstract event graph model, which allows a wide variety of analysis activities for different kinds of programs on different levels of abstraction. The analysis modules may be arbitrarily distributed across a set of available resources, which allows to process even large amounts of program state data in reasonable time. The latter is especially interesting with respect to grid computing, which may be just the fight environment for a future DeWiz grid debugging service.

149

S ->

simple exchange

Figure 2. Process interaction patterns The leftmost pattern is called a simple exchange pattern. In the middle a round robin pattern is shown, while the right screenshot shows a tree-like communication pattern.

REFERENCES

[1]

[2]

[3] [4]

[5]

[6] [7] [8]

[9]

[ 10] [11]

H. Brunst, H.-Ch. Hoppe, W.E. Nagel, and M. Winkler. Performance Optimization for Large Scale Computing: The Scalable VAMPIR approach. Proc. ICCS 2001, Intl. Conference on Computational Science, Springer-Verlag, LNCS, Vol. 2074, San Francisco, CA, USA (May 2001). J. Chassin de Kergommeaux, B. Stein. Paje: An Extensible Environment for Visualizing Multi-Threaded Program Executions. Proc. Euro-Par 2000, Springer-Verlag, LNCS, Vol. 1900, Munich, Germany, pp. 133-144 (2000). M.T. Heath, J.A. Etheridge. Visualizing the Performance of Parallel Programs. IEEE Software, Vol. 13, No. 6, pp. 77-83 (November 1996). R Kacsuk. Performance Visualization in the GRADE Parallel Programming Environment. Proc. HPC Asia 2000, 4th Intl. Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, Peking, China, pp. 446-450 (2000). R. Kobler, D. Kranzlmiiller, and J. Volkert. Debugging OpenMP Programs using Event Manipulation. Proc. WOMPAT 2001, Intl. Workshop on OpenMP Applications and Tools, Springer-Verlag, LNCS, Vol. 2104, West Lafayette, Indiana, USA, pp. 81-89 (July 2001). D. Kranzlmfiller, S. Grabner, and J. Volkert. Debugging with the MAD Environment, Parallel Computing, Vol. 23, No. 1-2, pp. 199-217 (Apr. 1997). D. Kranzlmfiller, J. Volkert. Debugging Point-To-Point Communication in MPI and PVM. Proc. EuroPVM/MPI 98, Intl. Conference, Liverpool, GB, pp. 265-272 (Sept. 1998). Event Graph Analysis for Debugging Massively ParD. Kranzlmfiller: allel Programs, PhD thesis, GUP, Joh. Kepler Univ. Linz (Sept. 2000), http ://www. gup. uni-linz, ac. at/-dk/thesis. D. Kranzlmfiller: Scalable Parallel Program Debugging with Process Isolation and Grouping, Proc. IPDPS 2002, 16th International Parallel & Distributed Processing Symposium, Workshop on High-Level Parallel Programming Models & Supportive Environments (HIPS 2002), IEEE Computer Society, Ft. Lauderdale, Florida, (April 2002). L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, Vol. 21, No. 7, pp. 558-565, (July 1978). B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Kara-

150 vanic, K. Kunchithapadam, and T. Newhall. The Paradyn Parallel Performance Measurement Tool IEEE Computer, Vol. 28, No. 11, pp. 37-46 (November 1995). [12] B. Mohr, A.D. Malony, S. Shende, and F. Wolf. Design and Prototype of a Performance Tool Interface for OpenMP. Proc. LACSI Symposium 2001, Los Alamos Computer Science Institute, Santa Fe, New Mexico, USA (October 2001). [ 13] Message Passing Interface Forum. MPI: A Message Passing Interface Standard- Version I.I. http://www.mcs,

anl. gov/mpi/(June 1995).

[14] OpenMP Architecture Review Board. OpenMP C and C++ Application Program Interface - Version 2.0. h t t p : //www. openmp, o r g / ( M a r c h 2002). [15] R. Rabenseiffner. Communication and Optimization Aspects on Hybrid Architectures. Proc. EuroPVMMPI 2002, Springer-Verlag, LNCS, Vol. 2474, Linz, Austria, pp. 410-420 (2002). [ 16] W. Richard Steves. UNIX Network Programming. Prentice Hall, (1990). [ 17] R. Wism/iller, J. Trinitis, and T. Ludwig. O C M - a Monitoring System for Interoperable Tools. Proc. SPDT 98, 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, ACM Press, Welches, Oregon, USA, pp. 1-9 (August 1998). [ 18] J.C. Yan, H.H. Jin, and M.A. Schmidt. Performance Data Gathering and Representation from Fixed-Size Statistical Data. Technical Report NAS-98-003,

http://www.nas .nasa.gov/Research/Reports/Techreports/

19 9 8 / n a s - 9 8 - 0 0 3. p d f , NAS System Division, NASA Ames Research Center, Moffet Field, California, USA (February 1999).

Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.

151

Toward a Distributed Computational Steering Environment based on CORBA O. Coulaud ~, M. Dussere ~, and A. Esnard ~ ~Projet ScA1Applix, INRIA Futurs et LaBRI UMR CNRS 5800, 3 51, cours de la Lib6ration, F-33405 Talence, France This paper presents the first step toward a computational steering environment based on CORBA. This environment, called EPSN 1, allows the control, the data exploration and the data modification for numerical simulations involving an iterative process. In order to be as generic as possible, we introduce an abstract model of steerable simulations. This abstraction allows us to build steering clients independently of a given simulation. This model is described with an XML syntax and is used in the simulation by some source code annotations. EPSN takes advantage of the CORBA technology to design a communication infrastructure with portability, interoperability and network transparency. In addition, the in-progress parallel CORBA objects will give us a very attractive framework for extending the steering to parallel and distributed simulations. 1. I N T R O D U C T I O N Thanks to the constant evolution of computational capacity, numerical simulations are becoming more and more complex; it is not uncommon to couple different models in different distributed codes running on heterogeneous networks of parallel computers (e.g. multi-physics simulations). For years, the scientific computing community has expressed the need of new computational steering tools to better grasp the complexity of the underlying models. The computational steering is an effort to make the typical simulation work-flow (modeling, computing, analyzing) more efficient, by providing on-line visualization and interactive steering over the on-going computational processes. The on-line visualization appears very useful to monitor and detect possible errors in long-running applications, and the interactive steering allows the researcher to alter simulation parameters on-the-fly and immediately receive feedback on their effects. Thus, the scientist gains a better insight in the simulation regarding to the cause-andeffect relationship. A computational steering environment is defined in [ 1] as a communication infrastructure, coupling a simulation with a remote user interface, called steering system. This interface usually provides on-line visualization and user interaction. Over the last decade, many steering environments have been developed; they distinguish themselves by some critical features such as the simulation integration process, the communication infrastructure and the steering system design. A first solution for the integration is the problem solving environment (PSE) approach, like in SCIRun [2]. This approach allows the scientist to construct a steering application according to a visual programming model. As an opposite, CAVEStudy [3] only interacts with

1EPSN project (http ://www. labri, fr/epsn) is supported by the French ACI-GRID initiative.

152 the application through its standard input/output. Nevertheless, the majority of the steering environments, such as the well-known CUMULVS [4], are based on the instrumentation of the application source-code. We have chosen this approach as it allows fine grain steering functionalities and achieves good runtime performances. Regarding the communication infrastructure, there are many underlying issues especially when considering parallel and distributed simulations: heterogeneous data transfer, network communication protocols and data redistributions. In VIPER [5], RPCs and the XDR protocol are used to implement the communication infrastructure. Magellan & Falcon [6] communicates over heterogeneous environments through an event system built upon DataExchange. CUMULVS uses a high-level communication infrastructure based on PVM and allows to collect data from parallel simulations with HPF-like data redistributions. In EPSN project, we intend to explore the capabilities of the CORBA technology and the currently under development parallel CORBA objects for the computational steering of parallel and distributed simulations. In this paper, we first describe the basis and the architecture of EPSN. Then, we illustrate the integration process of a simulation. Finally, we present preliminary results on the EPSN prototype, called Epsilon.

2. THE E P S N ENVIRONMENT

2.1. Principles EPSN is a steering platform for the instrumentation of numerical simulations. In order to be as generic as possible, we introduce an abstract model of steerable simulations. This model intends to clearly identify where, when and how a remote client can safely access data and control the simulation. We consider numerical simulations as iterative processes involving a set of data and a hierarchy of computation loops modifying these data. Each loop is associated with a single counter and the association of all these counters enables EPSN to precisely follow the global time-step evolution. The "play/stop" control operations imply the definition of breakpoints at some stable states of the simulation. In practice, we can easily steer most of the simulations by simply instrumenting their main loop. The basic access operations consist in extracting and modifying the data. As these data alternate between coherent and incoherent states during the computation, it implies the definition of restricted areas where data are not accessible. Several data can be logically associated to define a group that enables the end-user to efficiently manipulate correlated data together. Moreover, we have extended the coherency definition for groups to guarantee that all group members are accessed at the same iteration. This model, which fits well parallel applications (SPMD), needs to define a global time extension to maintain the data coherency over coupled or distributed simulations (MPMD). Such a mechanism implies an explicit association of the loops and breakpoints of the different simulation components. The representation in the abstract model is obtained by pointing out the relevant informations on the simulation. First the user describes the simulation elements in an XML file, then he connects the simulation with its representation. To do so, he annotates the source code with EPSN API in order to locate the elements that he has identified and to mark their evolution through the simulation process. The XML description also intends to lighten and clarify this annotation phase.

153 ..............................................

i EPSN 'nfrastructu~ffs A ~ Steerable

_1~ :i _ .Simulati~ ..S~~

Side

~I i [.C_o : ~o ;, 'I 0 ~1~1

Client Side

....

i. . . . . . . . .

d . . . . . Is

dedi....dthread ~ ..............................................

Steering

r ~ ~.o

:o

SimulationLoop ~ ' : ' ~ q ~ C O R B A requests

i

. .................................................

(~) -~

UserC . . . . .

~

~

~

P..... In,.... , C. . . . .

'--'~VisualizationLo~ dthre

~

-...............................

,cation Layer (MP,)

iO

CORBAobject( .....

" []

client C O R B A reference

)

~

Cliont ~

~")[I

i(~ !

,cation Layer

Object Request Broker

.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

:

/

Client ~j~ k J]l . .......................................

Figure 1. (a) Detail of the EPSN architecture. (b) EPSN parallel architecture with PaCO++.

2.2. Architecture and communication infrastructure EPSN is a distributed and dynamic infrastructure. It is based on a client/server relationship between the steering system (the client) and the simulation (the server) which both use EPSN libraries. The clients are not tight coupled with the simulation. Actually, they can interact onthe-fly with the simulation through asynchronous and concurrent requests. According to this model, different steering systems can concurrently access the same simulation and, reciprocally, a steering system can simultaneously access different simulations. These characteristics make EPSN environment very flexible and dynamic. The communication infrastructure of EPSN is based on CORBA, but it is completely hidden to the end-user. CORBA enables applications to communicate in a distributed heterogeneous environment with network transparency according to a RPC programming model. It also provides to EPSN the interoperability between applications running on different machine architectures. Although, CORBA is often criticized for its performance, some implementations are very efficient [7]. The principle of EPSN infrastructure is to run a permanent thread attached to each simulation process. This thread contains a multi-threaded CORBA server dedicated to the communications between the steering clients and the simulation. As shown on figure 1(a), the simulation thread consists of different CORBA objects corresponding to EPSN functionalities. For being fully asynchronous, EPSN uses oneway CORBA calls and the client thread also implements a callback object to receive data from the simulation. In other steering environments, like CUMULVS, the simulation is in charge of the communications that occur during a single blocking subroutine call. In EPSN, the thread accesses directly to a data through the shared memory of the process without any copy. Between a process and the EPSN thread, the communications use standard inter-thread synchronization mechanisms based on semaphores and signals. This strategy combined with the asynchronous CORBA calls allows to overlap the communications. In order to maintain a single representation of the simulation, we use a specific CORBA object, the proxy, running on the first simulation process. This object provides the description of the whole simulation and all the CORBA references needed by both the client and other simulation processes. As the proxy is registered to the CORBA naming service, remote clients can easily connect it. In order to achieve coherent steering operations on SPMD simulations, we have developed some protocols to synchronize the simulation processes. This synchronization implies to broadcast the request to all the involved processes and to synchronize on the first

154 breakpoint before achieving the parallel request. To reduce the synchronization cost, the parallel processes can also synchronize once at the beginning and keep going synchronously after that. The parallel extension of CORBA objects, like PaCO++ [8], reduces the synchronization cost thanks to a better use of the parallel infrastructure. PaCO++ typically exploits an internal communication layer based on MPI (Fig. l(b)). It also greatly eases the EPSN extension to parallel clients and for the inherent problem of data redistribution. On-going works are focusing on the integration of PaCO++ in EPSN and especially on the full support of regular data decomposition. 2.3. Functionalities EPSN consists in two C/Fortran libraries, the first one provides functions to build a steerable simulation and the second one enables to build a remote steering system. Control. The insertion of b a r r i e r function calls, acting as debugger breakpoints, in the simulation source code allows the user to control the execution flow of the simulation. The breakpoints can be remotely set "up" or "down" ( s e t b a r r i e r command) and they allow classical control commands (play, step, stop). Moreover, the calls to iterate function point out the evolution of simulation through the loop hierarchy. Data Extraction. On the client side, the user can remotely extract data from the simulation by calling g e t functions. The client manages such a request with w a i t and t e s t MPIlike functions. On the simulation side, data access is protected by l o c k / u n l o c k functions, which delimit the "coherence areas" in the source code. Therefore, data sending can be done immediately when receiving a get-request or delayed if the data is not accessible yet. Once the data is received by the client, it can be automatically copied in the client memory within the same lock/unlock areas as for the simulation, or it can start a treatment defined by the user thanks to a callback function call. The user can also request a data permanently, with the g e t p / c a n c e l p functions, in order to continually receive new data releases and produce "on-line" visualization. In this case, an acknowledgment system automatically regulates the data transfer from the simulation to the client, by voluntary ignoring some data releases to avoid to congest the client. Nevertheless, a f l u s h function can be used to force the data sending at each timestep. Thus, it prevents the client to miss any release, but it can slow down the simulation according to the client load. Data Modification. In the same way, the client can modify a data by calling the p u t function which transfers data from the client memory to the simulation.

3. BUILDING A STEERABLE APPLICATION In this section, we present the integration of EPSN steering functionalities in a parallel fluid flow simulation software, FluidBox [9]. This MPI Fortran code is based on a finite volume approximation of the Euler equations on unstructured meshes and simulates a two-fluid spray injection. We detail the XML description of this simulation in the abstract model, the instrumentation of the source code and the different solutions proposed in EPSN to construct steering systems. 3.1. XML description of the simulation The first phase in the construction of a steerable simulation consists in its description through an XML file. This description is the representation of the simulation shared by the simulation

155

<simulation name="spray" context="parallel"> < scalar i d= "nbTri" type=" long" access=" readonly" iocat ion=" repl icat ed" / > <sequence id="NodesCoord" t.ype="double" iocation="replicated"> <sequence id="Cells" type="long" iocation="replicated"> <sequence id="Energy" type="double" iocation="distributed">

Figure 2. FluidBox short XML description.

! --- Simulation Initialization CALL ReadMesh(Mesh,MeshFile) C A L L Init(Data,Mesh,Var) !

........

Epsiion

initialization

---

......

CALL epsilon_init('simu.xml',numproc,nbprocs,ierr) C A L L epsilon_publish('nbNodes',Mesh%Npoint,ierr ) C A L L epsilon_publish('nbTri',Mesh%Nelemt,ierr) C A L L epsilon_publish('NodesCoord',Mesh%coor(l,l),ierr) C A L L epsilon_publish('Cells',Mesh%nu(l,l),ierr) C A L L epsilon_publish('Energy',Var%Ua(l,l),ierr) C A L L epsilon_publishgroup('mesh',ierr) IF (numproc.EQ.0) THEN CALL epsilon_barrier('begin',ierr) MPI_Barrier(MPI_COMM_WORLD, ierr) Barrier o n m a s t e r process CALL epsilonunlockall(ierr); DO kt = kt0+l, ktmax ! Simulation loop C A L L Inject(Data,Mesh,Var) ! Fluid injection CALL

ona.il processes epsilon_barrier('mid',ierr)

Barrierdistributed

CALL !

........

Modify

field

values

inside

locked

area

.........

CALL e p s i l o n _ l o c k ( ' E n e r g y ' , i e r r )

Var%Ua(:,:) = Var%Un(:,:) C A L L epsilon_unlock('Energy',ierr) C A L L Post(Mesh,Var) C A L L epsilon_iterate('loop',ierr) END DO !

---

Simulation

ending

....

CALL WriteResult(Data,Mesh,Var) CALL epsilonexit(ierr)

Figure 3. FluidBox instrumented pseudo-code.

thread and by the client thread. On the simulation side, the XML is parsed at the initialization of EPSN to build all the necessary structures and to parameter the instrumentation. The clients dynamically get this description from the simulation so they do not need a direct access to the XML file. As it is shown on figure 2, the XML file mainly contains the simulation name, a description of the simulation scheme with loops and breakpoints (control XML element) and a description of all the published data (data XML element). Scalar data and sequences (array) are precisely described with their type, their access permission and their location (on master process, replicated on all processes or distributed over all processes). The dimensions for the sequences must be detailed with its size, its offset and its its decomposition (block, cyclic, etc.) in the distributed case. Moreover, the XML g r o u p elements allow the user to logically associate different data (e.g. the FluidBox mesh group). 3.2. Instrumentation As we already said, the integration of an existing simulation is done through source code annotations by a few function calls. Figure 3 presents the instrumented pseudo-code of FluidBox with the three classical phases, the initialization, the simulation loop and the ending phase. The initialization of all the EPSN infrastructure is simply done by calling the i n i t function with the XML file name as argument. When considering parallel simulations, each process must also indicate its rank and the number of processes involved in the simulation. Then, each data described in the XML has to be pointed in the process memory and published (pub l i sh). Within the simulation body itself, one has to mark the loop evolution ( i t e r a t e ) and to locate the breakpoints ( b a r r i e r ) defined in the XML file. One has also to place the data access areas (lock/unlock) and explicitly signal new data releases (release). In the example,

156

i i .........

Figure 4. EPSN generic client.

.......

Figure 5. Semi-genetic client (FluidBox).

only the energy field is modified during the iteration, so the other mesh components remain accessible during the whole computation. At the end of the process, the EPSNex2 t call properly terminates all the infrastructure. After that, if a client tries to connect or to send a request to the simulation, EPSN call gets a CORBA exception and returns an error status. Eventually, when one runs an instrumented simulation, it starts the EPSN CORBA server, accessible through the CORBA naming service and waiting for client requests.

3.3. Visualization and steering system

EPSN proposes three different strategies to construct a remote steering system. One could implement directly a steering system with the CORBA interface (IDL) but it is more convenient to use EPSN client API. The functionalities (see section 2.3) of this API allow the user to build a specific client precisely adapted to its simulation. One can also use the EPSNgeneric client, implemented in Java/Swing (Fig. 4). This tool can control and access the data of any EPSN simulation. Data are presented through simple numerical datasheets or can use basic visualization plug-ins. An intermediate way consists in implementing semi-generic clients using generic EPSN modules dedicated to the control, the data access and the visualization of complex objects (unstructured meshes, molecules, regular grids). This approach suits well with visualization environments (e.g. AVS/Express (Fig. 5)). 4. PRELIMINARY RESULTS We have implemented a prototype of the EPSN platform, called Epsilon. This prototype is written in C++ and is based on omniORB4 (http://omniorb.sourceforge.net), an high performance implementation of CORBA, and the associated thread library (omniThread). The results of this section come from experiments realized on two PC (Pentium IV) linked by a fast-Ethernet network (100Mbps). The figure 6 presents the mean time needed by Epsilon to send different data sizes (from 1Kb to 4Mb) to both a local and remote client at each iteration (a g e t p request), without any computation performed. Remote Epsilon sendings are compared with TCP/IP communications upon which omniORB communicates and shows that Epsilon performances on data transfers are quite as good as TCP/IP's. The figure 7 presents the same experiment as previous, except that 80ms of computations are performed per iteration, half of time being in a unlocked access area.

157 500

500

9E~p~.iio'n dotal'client) ...........

Epsilon (. . . . tei ~i{e'nti ""~:-"

450 400

:: ,(, ,( ,,:: .:! .

350 300 ~ 250

. . . . . . . . . . . . . .

"

" Sir~uiat~on (80rns/iter)"

400 ./

350

.,,. ..., .,'

~" 300

:' .' ,..'

250

, .,./ /

:i 200

.,. ,: ..,:

150 100

....: 50

...~.:::-64 size (Kb)

.....~

200

, / , . /.

150

..

100

..~: ~j..I-

512

Figure 6. Epsilon communication benchmark.

" "

Simulation + Epsiiion~i'r;:mote i cilenti----~---?

450

...............................

~

... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

~-::::"-~7

50 0

8

64 size (Kb)

512

4096

Figure 7. Data extraction from a simulation.

These results are compared with the computation time added to the TCP/IP sending time of the data. This figure demonstrates the overlapping capabilities of a fully asynchronous approach, as it is performed in Epsilon. Moreover, remote Epsilon performance is still under the sum of the computation time and the TCP/IP sending time. So, the Epsilon steering overhead is fully overlapped by the computation. Finally, we have evaluated the instrumentation cost in the second experiment. It is quite negligible (less than 1%) and does not depend on clients for they are clearly disconnected from the simulation. 5. CONCLUSION AND PROSPECTS As shown in this paper, EPSN architecture intends to provide a flexible and dynamic approach of computational steering. It proposes an instrumentation of existing simulations at a low cost and greatly capitalizes on CORBA features. Moreover, the parallel CORBA objects provides a suitable solution for most SPMD cases (with regular data distributions). The prototype Epsilon reveals itself as a really light and easy to use steering platform providing great capabilities of interaction. Epsilon validates the model of integration based on both a source code annotation and a XML description. It also proves that CORBA features present a great interest in the steering of applications with good performances. The developments of EPSN are now oriented on the integration of parallel and distributed simulations with irregular data distributions. REFERENCES

[1] [2]

[3]

Jurriaan D. Mulder, Jarke J. van Wijk, and Robert van Liere. A survey of computational steering environments. Future Generation Computer Systems, 15(1): 119-129, 1999. S.G. Parker, M. Miller, C. Hansen, and C.R. Johnson. An integrated problem solving environment: the SCIRun computational steering system. In Hawaii International Conference of System Sciences, pages 147-156, 1998. Luc Renambot, Henri E. Bal, Desmond Germans, and Hans J. W. Spoelder. CAVEStudy: An infrastructure for computational steering and measuring in virtual reality environments. Cluster Computing, 4(1):79-87, 2001.

158

[4]

J. A. Kohl and R M. Papadopoulos. CUMULVS : Providing fault-tolerance, visualization, and steering of parallel applications. Int. J. of Supercomputer Applications and High Performance Computing, pages 224-235, 1997. [5] S. Rathmayer and M. Lenke. A tool for on-line visualization and interactive steering of parallel hpc applications. In Proceedings of the llth IPPS'97, pages 181-186, 1997. [6] Weiming Gu, Greg Eisenhauer, Karsten Schwan, and Jeffrey Vetter. Falcon: On-line monitoring for steering parallel programs. Concurrency: Practice and Experience, 10(9):699736, 1998. [7] Alexandre Denis, Christian P6rez, and Thierry Priol. Towards high performance CORBA and MPI middlewares for grid computing. In Proceedings of 2nd IWGC, pages 14-25, 2001. [8] Alexandre Denis, Christian P6rez, and Thierry Priol. Portable parallel CORBA objects: an approach to combine parallel and distributed programming for grid computing. In Proc. of the 7th Intl. Euro-Par'O1 Conference (EuroPar'O1), pages 835-844, 2001. [91 B. Nkonga and P. Charrier. Generalized parcel method for dispersed spray and message passing strategy on unstructured meshes. Parallel Computing, 28:369-398, 2002.


159

Parallel Decimation of 3D Meshes for Efficient Web-Based Isosurface Extraction A. Clematis% D. D'Agostino ~, M. MancinP, and V. Gianuzzi b aIMATI-CNR Genova, {clematis,dago,mancini }@ge.imati.cnr.it bDISI-University of Genova, [email protected] Isosurface extraction is a basic operation that permits to implement many type of queries on volumetric data. The result of a query for a particular isovalue is a Triangulated Irregular Network (TIN) that may contain huge number of points and of triangles, depending on the size of the original data set. In distributed environment, due to the limits of bandwidth availability and/or to the characteristics of the client, it may be necessary to visualize the result of the query with lower resolution. The simplification process is costly, especially for huge data sets. In this paper we address the problem of efficiently parallelize this process using a cluster of COTS PCs for a Web-based parallel isosurface extraction system. 1. I N T R O D U C T I O N Nowadays large collections of 3D and volumetric data are available, and many people with expertise in different disciplines access these data through the Web. Isosurface extraction [2] is a basic operation that permits to implement many type of queries on volumetric data. The product of an isosurface extraction is a Triangulated Irregular Network (TIN) containing a more or less large number of topological elements (triangles, edges and vertices), depending on the original data set. In order to efficiently transmit results from a Web server to a remote client, it is useful to simplify the TIN and to compress it [ 1]. In Figure 1 a typical scenario is depicted, where a client accesses a Web server in order to study a 3D data set, as discussed in [9]. Looking at the architecture of the system it is possible to point out three main components: the client, the interconnection network and the server. We assume that the client does not provide large computing resources (such as a Personal Digital Assistant), but that it is able to perform basic rendering and visualization operations on 3D data. The interconnection network in turn may have variable characteristics, but often it represents the bottleneck of the system. In order to obtain acceptable performance it is important to reduce the amount of transmitted data or to use suitable strategies, like progressive transmission, which permit to reduce the effect of transmission delay on the visualization process. The server is expected to be powerful enough to support multiple requests arriving from potential clients. Here we suppose the server to have a multi-tiers organization with a Web front end, that provides the user's interface, and one or more local clusters to execute the required process. In this configuration most of the computation is executed on the server side, where

160 parallel processing is a viable possibility to provide high performance. The computational pipeline executed on the server consists of an isosurface extraction step and possibly of a simplification and compression step. The resulting isosurface can be very huge, so an user can decide to obtain a simplification at different levels, depending on the characteristics of its client device. The server may benefit of parallel computing at different levels. In this paper we will deal with the parallelization of the simplification process using a cluster of COTS workstations. The remaining part of the paper is organized in the following way: in Section 2 we discuss the problem of mesh simplification; in Section 3 we provide a review of related works; in Section 4 we describe our proposal; in Section 5 conclusions will be discussed. 2. S I M P L I F I C A T I O N OF TIN The surface simplification process is an important topic in visualization: often meshes are composed by so many triangles that rendering is very difficult. Typically isosurfaces do not make exception, because they are composed by hundred of thousand (if not millions) of triangles, and also today's common workstations have problems in rendering models of this size. Moreover this process is of particular interest for interrogation of remote data since it may contribute to reduce the size of data that it is necessary to transmit. We can define mesh simplification the process of reducing the number of primitives in a mesh M obtaining a mesh M', which is a good approximation of the original surface. To better clarify this concept let us consider the structure of a mesh. A mesh is a polygonal (in our case triangular) model composed by a set of vertices and a set of triangles. It provides a single fixed resolution representation of one or more objects. A simplification of the mesh M is a mesh M' resulting from the elimination, following an appropriate criterion, of a subset of initial vertices and triangles. Several algorithm have been proposed: for a survey see [ 1]. We are interested in algorithms that preserve the original topology and a good approxima-

C L I E N T SIDE

SERVER SIDE

ItTI'P

SER~

I ._I

~x'macnoN srsr~

VOLUMLrl~C DATA SLrr R.~OSlTORY :

....))

Figure 1. A Web system for 3D data analysis

I !

i

9

)

i!

i )

161 tion of the original geometry. For these reasons we have initially focused our attention on the Schroeder's algorithm, called "Decimation" [3], that was originally applied to isosurfaces. The algorithm reduces triangles number by deleting some vertices using local operations on geometry and topology. It is an iterative algorithm, composed by three steps: 9 classification of vertices on the basis of their topology in simple, non-manifold, or boundary; 9 simple and boundary vertices are evaluated and ordered; 9 the less important vertex and triangles that use it are removed, and the resulting hole is patched by forming a new local triangulation. The process is repeated until some termination conditions are met (i.e. the number of the remaining triangles is lower than a threshold). In [5] the vertex removal operation is replaced with the edge collapse operation. A weight is assigned to each edges, that are then ordered and the less important is removed by contracting it in a point, with the consequent elimination of the triangles that share it. In this manner a lot of expensive consistency checks for the new triangulation resulting from vertex removing can be avoided. Furthermore a new evaluation criterion, the quadric error metrics, is proposed. The cost assigned to each edge represents the amount of error introduced in the mesh after its elimination. This criterion best preserves the quality of the resulting mesh because, for each iteration, the cost assigned to each edge modified by the contraction takes into account the error with respect to the original mesh rather than the actual mesh. This error accumulation produces higher quality meshes. For these reason we based our parallel implementation on Garland's algorithm. 3. RELATED W O R K S Several works were done with the purpose of efficiently perform the mesh simplification in parallel. In [6] a combination of message passing and shared memory paradigms are used. The algorithm uses the master-worker scheme and a data parallel approach. Masters (called priority queue handle processes) are associated with priority queues related to the connected components. They maintain the topology and the priority queue order, providing to the associated set of workers edges to collapse. In [7] the approach is based on a greedy partition of the mesh made by the master and an independent simplification of each resulting set made by the workers. In this work the problem of the borders resulting from the mesh subdivision among workers is solved by a post-processing step. In [8] a similar scheme is adopted, with a minimization in workers communications. In [4] the concept of super independent set is on the basis of the proposed technique. The parallel removing of a set of vertices is possible only if they are "super independent": two vertices are super independent if they do not share any element of the boundary of the hole that would result from their elimination, so vertex removal operations do not influence each other. This parallel implementation of the Schroeder's algorithm is based on a master-worker

162 approach. Worker evaluate the vertices, ordering them and sending the result to the master. The master creates an ordered list of super independent vertices, partitioning it among the workers that remove them. The process is repeated until the target simplification is achieved. Even if we exploit data parallelism and a master-worker scheme, our approach is slightly different. In our system not only the simplification algorithm, but also the isosurface extraction is performed in parallel. In this way data are already partitioned among workers, relying on data subdivision performed in order to extract isosurface in parallel proposed in [ 10]. Our aim is to exploit the data distribution resulting from the previous isosurface extraction avoiding costly redistribution, minimizing master-worker communications and producing an high quality mesh. For this reason we base our parallel computation on the Garland algorithm. 4. PARALLEL DECIMATION OF ISOSURFACES Our parallel algorithm is designed to be executed in pipeline after a parallel isosurface extraction step. Our parallel system uses the master-worker scheme and the pipeline is composed by three steps: the parallel isosurface extraction, the parallel simplification and the creation of an output file. The first and third steps are explained in [ 10].

4.1. Parallel algorithm description At the beginning of the simplification step the representation of the isosurface is stored in the main memory of the cluster workstations. The sequential algorithm at first traverses vertices and evaluates them. A cost is then assigned to each edge, that are sorted using a priority queue. We chose to preserve meshes boundaries, because they could be relevant, e.g. in medical images analysis, so we do not collapse border edges. These two phases are proportional on the number of vertices and edges. With huge meshes the heap construction could require a critical amount of memory, while in the parallel version we significatively reduce memory requirements and we may achieve a linear speed-up on it because it is executed independently by the workers. In fact the only overlap between workers assignments is due to few edges that belong to the boundaries resulting from data partitioning. These edges presently are treated as border edges to make simpler the construction of an unique VRML file of the result, but we plan to treat them as internal edges. The maximum number of triangles to be eliminated is known in advance by the workers, but it may not be reached because of constraints to preserve mesh quality. Considering S the global number of triangles to eliminate, we have to globally delete 7s edges, because the collapse of non-border edges implies that two triangles are deleted. But having the purpose to achieve a good quality of the simplified mesh we don't want to let each worker delete the same percentage of its triangles, because some workers may have less regular parts of the meshes that should be preserved, while others may have large planar regions where an hard simplification is compatible with a very little degradation. For this reason we split the edge removal operation in three parts: the local selection of candidates, done by workers, a global sorting of candidates, done by master, and effective removing, done again by workers. s edges, considering them as candidates, updating the Each workers extracts its lower cost -~ cost of the neighboring edges in the heap. These modification are made over auxiliary structures that does not involve the model, that will be really modified in the effective removing step.

163 During this selection a worker recalculates weight for edges involved in the "simulated" contractions, marking as "candidate" edges and related vertices and faces, rearranging consequently the heap. Workers also create another list, containing pairs "number-cost", that summarizes how many edges have that costs, because for the master is not important the identification of candidate edges. This allows to send to the master a lighter list, ordered on the "cost" value. At this point the master examines the lists until it reaches the target number of edges, sending back to each worker the number of edges to remove. On the basis on master reply each worker collapses the given number of edges and adjusts the topology. In particular, considering that each edge collapse operation means the deletion of a vertex, it renames its vertices considering also the number of collapses made by the other workers, in order to produce a representation of its part of the isosurface that can be directly concatenate with the other by the master in the following output creation step. The pseudo-code for the algorithm is showed in Figure 2. ##Wo rke r s /* Isosurface

extraction step

... */

evaluate_edges () 9 build_heap () ; while (triangles > target_reduction) select_candidate () ; u p d a t e h e a p () ; }

{

build candidate list(); send(master, candidate_list). restore model() ; receive (master, new_target_reduction) while (triangles

> new_target_reduction)

remove_edge () ;

/* Output creation step ... */ ##Master /* Isosurface

extraction

for(i < nu~_workers) sort lists; for(i < n~_workers)

step ... * /

receive(worker[i], send(yorker[i],

/* Output creation step ...

candidate_list[i]) ;

new target_reduction[i])"

*/

Figure 2. The algorithm pseudo-code

4.2. Timing analysis and load balancing The sequential time to simplify a mesh is

Tseq - Theap 4- rcollapse 4- Tupd

(1)

where Theap is the time to evaluate edges and to build the heap, Tcouapseis the sum of the time spent modifying the model and Tupdis the sum of the time spent updating the heap after an edge collapse operation. The parallel time is

Tpar -- Theap 4- rsel 4- rtupd 4- Tcomm 4- Tsorting 4- rdel

(2)

164 where Tsez corresponds to the selection of the candidates, Taez the effective edges removing, T~omm the communication between master and workers and Tsort~ng the lists examination made by the master. T~omm and T~ort~ngare measured on the master, the other on the workers. Considering the sequential time we have that: - for very large data sets the amount of required memory may become critical. In particular Theap and Tupd may grow sharply due to the use of virtual memory; - for data sets that fit in main memory the dominant time is Tcollapse,followed by Th~ap and Tupd. In many cases we may consider the contribution of T~,pdnegligible. The parallel algorithm permits first of all to handle very large data sets because of the availability of a larger memory. Considering Equations 1 and 2 and our experiments we have that: - Tcom~ and T~o~t~ngrepresents an overload with respect to the sequential algorithm but their contribution is very limited; - the time reduction for heap construction is nearly linear (Th~ap vs. T~eap). This is possible because the original mesh is evenly distributed among workers at the end of the isosurface extraction step. In fact, as explained in [ 10], workers receive a part of dataset that will produce quite the same number of vertices (and consequently edges), with respect to the others; the time reduction for edge collapse operations is nearly linear. This time is identified by Tcollapse in the sequential algorithm and Tad in the parallel; there is no time reduction in the update phase, apart for the reduction due to better use of memory hierarchies because of smaller data sets. The update contribution for the sequential algorithm is Tupd and for the parallel algorithm it includes T~d and T~pd. For these considerations the algorithm should scale well over a greater number of processors, but the quality of the result may suffer because of the problem of the border edges introduced by data partitioning. The reduction can produce a (normally) little work unbalance between workers, but the effective edge collapse is not as relevant as in the sequential algorithm, in particular when edges number is huge. The edge removing consists into the deletion of an edge and into the replacement of its vertices with a new vertex, whose position was previously calculated. This unbalancing could be less relevant considering that workers must provide their part of the isosurface to the master moving files into a shared NFS partition. In this manner if workers terminate the operation in different time there is less traffic on the network and so a performance improvement. We have experimented our algorithm using a data set representing a CT scan of a bonsai using a cluster of four PC, equipped with 1.7 GHz Pentium IV processor and 256 MB RAM, connected via Fast Ethernet and running Linux. For isovalue 180 we obtained an isosurface made by 286,954 triangles, that we can reduce to, at minimum, 106,650 triangles with simplification. The sequential version takes 3:32 min. (Theap = 2:10 min. TcoIlapse= 28 sec. Tupd = 44 sec.) while with three workers we obtain a speed-up of about 2.6 times. Results are showed in Figure 3. -

-

5. CONCLUSIONS AND FUTURE W O R K S The main contribution of this paper is a new parallel algorithm for TIN simplification. The algorithm is based over [5] and exploits the knowledge of mesh structure, obtained during the isosurface extraction process. The advantages of this approach are due to the minimization of the communications number and size between processes and the reduction of the sequential part

165

Figure 3. The original (on the left) and simplified isosurfaces (on the right) representing a bonsai pot. of the algorithm, performed by the master. An improvement can be achieved by a progressive send of the candidate ordered list, in order to overlap the time spent by the master examining the lists and the creation of them. Futhermore we would reduce the number of candidates, considering for each of the N workers s edges and we would allow the elimination of edges belonging to borders resulting from f(N) data subdivision. ACKNOLEDGEMENTS

This work has been supported by CNR Agenzia 2000 Programme "An Environment for the Development of Multiplatform and Multilanguage High Performance Applications based on the Object Model and Structured Parallel Programming", by FIRB strategic Project on Enabling Technologies for Information Society, Grid.it, and by MIUR programme L. 449/97-99 SP3 "Grid Computing: enabling Technologies and Applications for eScience". REFERENCES

[i]

C. Gotsman, S. Gumhold, and L. Kobbelt, Simplification and compression of3D-meshes. Tutorials on multiresolution in geometric modeling, A. Iske, E. Quak, M. Floater (eds.), Springer, 2002 [21 W.E. Lorensen, and H.E. Cline, Marching Cubes: a high resolution 3D surface reconstruction algorithm. Computer Graphics, vol. 21, no. 4, 1987, pp. 163-169. [31 W. J. Schroeder, J. A. Zarge, and W. E. Lorensen, Decimation of triangle meshes. Computer Graphics, vol.26 n.2, 1992, pp.65-70. [41 M. Franc, and V. Skala, Parallel Triangular Mesh Reduction. In ALGORITHM 2000 proceedings, pp. 357-367, 2000. [5] M. Garland, and Paul S. Heckbert, Surface Simplification Using Quadric Error Metrics, Computer Graphics, vol. 31, 1997, pp. 209-216.

166 O. Schmidt, and M. Rasch, Parallel Mesh Simplification. In PDPTA 2000 proceedings, pp. 1801-1807. [7] C. Langis, G. Roth, and F. Dehne, Mesh Simplification in Parallel. In ICA3PP 2000 proceedings, pp. 281-290. [8] D. Brodsky, and B.A. Watson, Parallelization, small memories, and model simplification. In 11th Westem Canadian Computer Graphics Symposium proceedings, 2000, pp. 75-83. [9] A. Clematis, D. D'Agostino, W. De Marco and V. Gianuzzi, A Web-Based Isosurface Extraction System for Heterogeneous Clients. In 29th Euromicro Conference proceedings, 2003, pp. 148-156. [10] A. Clematis, D. D'Agostino, V. Gianuzzi, An Online Parallel Algorithm for Remote Visualization oflsosurfaces. In 10th EuroPVM/MPI proceedings, 2003. [6]

Parallel Programming



169

MPI on a Virtual Shared Memory F. BaiardP, D. Guerri ~, P. MorP, L. RiccP, and L. VaglinP ~Dipartimento di Informatica Universit/t di Pisa Via F.Buonarroti, 56125 - PISA To show the advantages of an implementation of M P I on top of a distributed shared memory, this paper describes MPIs14, an implementation of M P I on top of D V S A , a package to emulate a shared memory on a distributed memory architecture. D V S A structures the shared memory as a set of variable size areas and defines a set of operations each involving a whole area. The various kind of data to implement M P I , i.e. to handle a communicator, a point to point or a collective communication, are mapped onto these areas so that a large degree of concurrency, and hence a good performance, can be achieved. Performance figures show that the proposed approach may achieve better performances than more traditional implementations of collective communication. 1. INTRODUCTION Almost any M P I [6] implementation supports M P I primitives through a low level communication library. In most cases, a first layer implementing point to point operations is implemented on top of proprietary libraries and collective operations are implemented on the top of this layer. A few proposals [8, 7, 9] define a M P I run time support on the top of a shared memory layer. [7] exploits light-weight threads to execute M P I applications. Each M P I node is executed by a distinct thread. Point to point communications are implemented through a message queue shared between the partners of the communication. To guarantee that any M P I node can be safely executed as a thread, [7] defines a set of complex compile-time transformations. Since the correctness of these transformations can be proved only for programs not invoking external functions, the approach cannot be considered completely general. [8] proposes a M P I implementation on a shared memory multiprocessor where, for each pair of applicative processes, a distinct queue implements point-to-point communication between the processes. A key issue in the definition of a shared memory support of M P I is the reduction of the overhead introduced to preserve the consistency of shared data. Any optimised implementation of point to point communication primitives tries to minimise this overhead. For instance, [8] exploits lock-flee buffers to implement point to point communications, [7] assumes one writermultiple readers queues to simplify the lock-flee algorithm. A further challenging issues of a shared memory support of M P I is how a shared memory can simplify the complex protocols to implement M P I collective communications. Furthermore, no existing proposal considers a distributed virtual shared memory architecture where the cost of accessing shared data may be high, because data may be stored in the local memory of a remote node. For this reason, the data are clustered into pages, generally of the

170 same size, and a page is the basic data transfer unit to/from the shared memory. In the case of a virtual shared memory, the definition of M P I poses a new set of problems. The first one is the definition of a proper mapping of the M P I support data into the shared pages. This mapping should minimize the overhead due to synchronizations not required by the semantics of the M P I operations. For instance, data supporting M P I point to point communications between different partners or collective communications executed in different communicators should be mapped into distinct pages to minimize conflicting requests to the same data. Furthermore, any caching strategy to support M P I should be coherent with those to support the virtual memory. This paper present MPIsH, a run time support for M P I developed on the top of DVSA, Distributed Shared Areas, [ 1,3], a distributed shared memory abstract machine currently implemented on a Meiko CS2 architecture and on a cluster of Linux workstations. MPIsH supports M P I communicators as well as point to point and collective communications. D V S A structures the shared space as a set of areas where the size of each area is freely chosen within an architectural dependent range, when the area is declared at program startup. D V S A defines a set of functions to manage the areas. The Notification functions allow a process to declare all and only the areas it is going to share and to notify the termination of its operations on each area. The Synchronization functions set includes operations to acquire exclusive access to an area, i.e. they implement locks. The Access Class includes operations to read/write an area. To enhance the portability of the M P I support across distinct physical architectures supporting DVSA, the implementation of MPIsH exploits the D V S A constructs only. As an example, M P I non blocking communications are defined through D V S A non blocking primitives even if a thread mechanism supported by the architecture might be more efficient. The performance figures of MPIsH show that M P I collective communications can benefit of an implementation on top of a shared memory abstract machine. From another point of view, these figures confirm that one of the major problem of current implementation of M P I , on top of general purpose or special purpose message passing libraries, is an efficient strategy to support both M P I point to point and collective communications. The efficiency of collective communications cannot be neglected [4] because, while data parallel algorithms with static data allocation can be easily implemented through point to point communications only, most complex algorithms require some form of data re-mapping that heavily exploits collective communications. Our results suggest the adoption of a hybrid approach where M P I point to point communications could be directly implemented on top of the communication library of the considered architecture, while M P I collective communications could be implemented on top of a distributed memory system. The additional overhead due to this layer may be recovered if the implementation of each M P I primitive is simplified by properly exploiting the operations of the distributed memory. The overall implementation of MPIxH is presented in Section 2. Section 3 shows the strategy to support MPI communicators. The implementations of point to point and of collective communications are shown, respectively, in Section 4 and in Section 5. Section 2 shows some experimental results and draws the conclusions.

171 2. O V E R A L L S T R U C T U R E OF T H E I M P L E M E N T A T I O N An important assumption that has driven our implementation is that an effective M P I implementation should minimize the contention on an area due to synchronization operations issued by distinct processes. Furthermore, it should properly map the areas into the local memories of the processing nodes. Hence, the first step in the implementation of MPIsI4 has defined the overall structure of the areas to implement message exchange and process synchronization. These areas records both the data exchanged among processes and the state of the processes involved in an ongoing communication. According to our initial assumption, the data structures required to implement MPI communications are mapped onto the areas so that: 9 an area is locked through synchronization functions only if this is the only way to preserve the M P I semantics. Hence, data structures that do not require the invocations of synchronization functions should be mapped onto areas distinct from those recording data requiring this operation. 9 the address of an area A should be known to the processes exploiting the data stored in A only. Hence, an area implementing a M P I communicator C should be shared among the processes belonging to C only. In the same way, only the communication partners should access an area storing a point to point message. These principles can be satisfied by a dynamic allocation of the areas to the processes. A static allocation is not possible, because of the M P I semantics that does not support the definition of a static analysis that returns, for each process P, the communications it is involved in and the corresponding communicators. On the other hand, for efficiency reasons, D V S A does not support a dynamic management of the areas and each process defines the areas it is going to share at the beginning of the execution of its program. For this reason, the dynamic management of the areas is explicitly implemented by MPIsH. Before starting the processes execution, MPIsH defines a pool of areas shared by the processes. The size of the pool depends upon the number of the applicative processes and the maximum number of communicators to be supported. M P I s u primitives fetch areas from this pool and assign them to the requesting processes. In this way, an area A is fetched when a process starts a point to point communication and the address of A is notified to the communication partner when it executes the corresponding primitive. The addresses of the areas shared by a process are dynamically stored in local tables. Each process can access only the areas whose addresses are stored in its local tables. MPIsI4 structures the areas into a hierarchy, where each level of the hierarchy is characterized by the number of processes sharing an area of this level. At the top of the hierarchy we find areas always shared among all the processes of the application. These areas record global information, for instance a global counter to assign unique contexts to communicators. They also store a set of pointers to a pool of free areas to be allocated for communicators descriptors. Next we find areas shared by all the processes of the communicator. These areas record either the information on the communicator or to implement collective communications occurring within a communicator. The next levels of the hierarchy include areas to implement point to point communications that are shared by the two partners of the communications only.

172

DVSA Mother Page ............................................................................................... iMPI_COMM WORD IC~176 [ Descriptor ]

I C O ~ NICATOR........ 1

@

........

n

1

II ......II

SYNC_IN

........

SYNCjN

~

AREA_IN

........

AREA_IN

........

SYNC_OUTI

!

@

n

I 1 ..... I ISYNC-~

II

1

n

.....

11 n

I ..... I I

................................................................................................

Figure 1. MPI_COMM_WORLD Environment and Collective Areas

This structuring results in a better memory utilization, because the size of an area can be chosen according to the level it belongs to. Furthermore, the number of processes sharing an area decreases as the level of the area increases and better allocation strategies can be adopted for areas of the highest level. As an example, each area of the highest level is always allocated in the local memory of one partner of the communication. 3. IMPLEMENTATION OF MPI COMMUNICATORS The execution environment of MPIsH is set up by the function MPI_InitsH that allocates a pool of free areas, initially shared by all the processes and that supports the creation of a communicator. To avoid the bottleneck due to a single pool and to concurrently allocate areas to distinct communicators, MPIsH partitions the pool among the applicative processes. Each process stores in a local table the addresses of the free areas it has been assigned. Furthermore, the areas are partitioned according to both the communicator they are associated with and the semantics of the data they record. Each process taking part in the creation of a communicator assigns some of its areas to the new communicator. MPI_Initsu initialises Levelo areas, the top level of the hierarchy, and creates the MPI_COMM_WORD communicator. A further communicator C can be explicitly created through MPI_Comm_CreatesH. The creation of C is implemented through two successive phases. In the first phase, each process informs its partners of its participation in the creation of C by decreasing a counter in the descriptor of C allocated by the first process invoking MPI_Comm_Createsn. This process initialises the descriptor by storing a new context, the number of processes of the communicator and their identifiers. The context is produced through a global context counter allocated in a Levelo area. The descriptor of C also stores pointers to a set of free areas to support the communications within C. As an example, the left part of Fig 1 shows the areas after MPI_InitsH has initialized both the descriptor of the MPI_COMM_WORLD communicator and the execution environment. A communicator descriptor includes a pointer to a pool of free Levela and Level4 areas, which will be dynamically allocated during the communicator lifetime and a set of Level2

173 areas, one for each process in the communicator, shared among all the processes of the communicator during its lifetime. Free areas are used to support point to point communications; other areas are partitioned according to their use. Collective areas support collective communications, Point to Point areas implement point to point synchronizations and Process areas store the address of dynamically allocated channel areas. These areas will be described in more detail in the following together with the implementation of different kinds of communications. Each process invoking MPI_Comm_CreatesH allocates a subset of the areas of the communicator by selecting their address from its local tables. The second phase synchronizes all the processes involved in the creation of the communicator. This synchronization is required by the M P I semantics that states that a communicator can be used only after all the involved processes have completed its creation. This synchronization is implemented through an explicit barrier after MPI_Comm_Creates14. However, processes can be loosely synchronized so that they are delayed only when they execute the first communication that refers to the communicator. When executing this communication, each process accesses the communicator descriptor and compares the number of processes that have completed the creation against that of the communicator processes. This guarantees that any communication starts only after all the areas of the new communicator have been allocated. At the end of the second phase, each process can copy the addresses of the areas shared within the communicator into a set of local tables. 4. POINT TO POINT C O M M U N I C A T I O N S This section describes the implementation of a subset of point to point communications: [2] describes the whole set of primitives including the non deterministic ones. Point to point communications are implemented through Channel, Buffer and Point to Point areas. Channel areas store information about pending communications between two partners and Buffer areas store the corresponding messages. Point to Point areas implement process synchronization. The partition of the areas supports the implementation of several strategies to optimise the allocation and the accesses to the areas. As far as concerns the allocation, MPIsI-I allocates the pool of areas managed by the process P in the local memory of the node executing P. In this way, the channel and the buffer areas are always allocated either in the memory of the sender or in that of the receiver. A proper caching strategy further optimizes the accesses to the areas. When a process P access a channel to receive a message, it copies into its local memory any information regarding any pending communication. The receiver process has been chosen, because the number of pending sends is generally larger than that of the receives. To receive a message, at first a process checks the pending communications in the cache and, only if the cache does not include any matching pending communication, it accesses the possibly remote channel area. When a process accesses a channel area, it copies into the area the updated information from the cache. As shown in section 2 this may largely improve the overall efficiency. 5. C O L L E C T I V E C O M M U N I C A T I O N S Collective communications exploit Collective areas in the communicator descriptor for both message transmission and process synchronization. Since distinct areas are used for each communicator, communications involving distinct communicators are concurrently executed. Since M P I collective communications are blocking, but not synchronous, a process may start a new collective communication before the previous one has been completed by all the

174 other processes. Hence, MPIsH further partitions the set of Collective areas so that a different set of areas is associated with each process of the communicator. In this way, collective communications with distinct root processes can be simultaneously active within the same communicator because they exploit different areas. For the moment being, let us assume that the data exchanged in a collective communication always fits in one area. MPIsH pairs two data areas, AREA_IN and AREA_OUT, with each process P, see the right part of Fig. 1 The former is exploited when P is the root receiver of a communication, the latter when P is the root sender. Furthermore, each data area is paired with two areas, SYNC_IN and SYNC_OUT, to synchronize the communicating processes. Each synchronization area includes n binary semaphores, one for each process of the communicator. A semaphore enables a receiver to check if the data it is waiting for is present in the corresponding data area. A sender, instead, can check if the data area is free or if it records data of a previous communication. Let us now consider the implementation of a broadcast. The root process P checks if its AREA_OUT area is free by inspecting the corresponding SYNC_OUT area. When all the semaphores are equal to 0, all the processes involved in a previous operation with the same sender root have fetched the message from the AREA_OUT area, then all the semaphores are set to 0. P writes the message in AREA_OUT and it set all the semaphores to 1. The i-th process involved in the communication checks the presence of the message by inspecting the i-th semaphore of SYNC_OUT. When the value of this semaphore is 1, the data can be read and the value of the semaphore is reset. This implementation of collective communication does not require the invocation of a synchronizion function before accessing the areas. Hence, it exploits at best the MPI semantics and the partitioning of the areas into synchronization and data areas. To handle message that do not fit into one area, the support defines k AREA_OUT and k AREA_IN areas for each process, each one associated with a corresponding synchronization area. This allows the sender to partition the message into n packets, and store each one in a distinct area. This strategy further increases concurrency, because the sender can write the i-th packet of a message while the other processes record the previous one and it can be adopted because synchronization is implemented through distinct areas.

6. E X P E R I M E N T A L RESULTS AND CONCLUSIONS We consider preliminary performance figures of MPIsI4 on a MeikoCS2 and compare them against an implementation developed on top of a native communication library. In Fig. 2 we show the execution time of, respectively, a) scatter, b) broadcast and c) all to all primitives in the case of 8 processes as well as the execution time of a barrier primitive d) for a varying number of processes. The performances of collective communications largely benefits from an implementation on a distributed virtual shared memory. The curve in Fig. 2 e) shows the effectiveness of the caching strategy in MPIsH. In the considered program, at first, two processes sends to each other k messages, the pending messages labelling the x coordinate in the figure. Then, they execute a barrier synchronization and receive the messages. Fig. 2 f) compares the performances of MPIsH point to point communication against those of the MPI implementation on a message passing library. The performance of MPIsH point to point communications is worse than the one that can be achieved by a message passing library. A large amount of the overhead of MPIsu in the implementation of

175

2500

1200 1000

.~

E

a) ~ .9 "5 uJ

MPI - SH , MPI ..........

800

~"

./ "

_J

o

...................................... -.................... "....

400

200

16

64

256

Message

1024

4096

b)i~

2000

15oo

lOOO

.......x. . . . . . . 4

16384

~ ...... . .... 64

16

1200

MPI_SH MPI

(~ 5 0 0

#,/"

8

.~

400

d)i

~oo

m

100

1000

1024

4096

16384

Size (bytes)

,

~................ . ............... . ..............

300

c)~ ~oo .9

600

200 400

4

i

,

i

J

i

J

16

64

256

1024

4096

16384

Message

0

Size (bytes)

2

3

4

5 Number

800

.............. 17...... .............. ;7;7...... ...........]. .....

600 o Ev 500

._ 1 4 0 - 1 0 0 0 - 1 0 m e t h o d B / i. . i. . i. ' I..-t .... ~ ..... .L . . . . . L"--~---i 40-1000_10, methodA / i i i .-'ifi l 40-lO0-10, methodC / i { .-!~'J! ! / 4 0 - 1 0 0 - 1 0 ' m e t h o d B } - - } - - : ~ - - ' ~ - ,+. . . . . ~ . . . . {. . . . -I 4 0 - 110 0 - 1 01, m e t1h o d Al.]-~r ~.-'~ !' B/ !~ ~ ! -' - @ ~ | ' ! ....

+_>.

.~. . . . .

". . . .

.... i..... i--z ~-~--i-7-

~ ....

!---I

~ ......

q . . . . . t . . . . ~ . . . . . t . . . . . t- . . . . i . . . . v

2

a 1

~ 2

a 3

i 4

5

number

6

i

i

i

7

8

9

of parallel threads

', --i.... 10

11

12

I

~,

~

1

2

3

~ 4

{ 5

number

~ 6

~,

~,

7

8

~, 9

~

~

I

10

11

12

of parallel processes

Figure 4. Speedup measured on a SunFire 6800 for the simulation of an m-h-n RBF network with OpenMP (left) and MPI (fight) The lower performance of the OpenMP implementations was unexpected, because the SunFire 6800 is a symmetric multiprocessor and OpenMP is claimed to be a programming interface for such architectures. Also the advantage of the recomputation method that requires a lot of redundant arithmetic operations was surprising. Therefore, a detailed performance analysis was undertaken to find out the performance bottlenecks. To study the OpenMP overhead for process synchronization and scheduling, the OpenMP microbenchmark was applied [2]. Some results are shown in Fig. 5 (left). It can be seen that the synchronization overhead for must OpenMP directives is low and rather independent of the number of threads. The costly p a r a l l e l directive is applied only once at the beginning. The performance of the collective MPI operations was analyzed by the Pallas MPI benchmark suite [7]. From Fig. 5 (right) it becomes evident that on the SunFire 6800 the MPI function A11 r e d u c e is by far more efficient than its OpenMP counterpart. Furthermore, the function A1 l r e d u c e also allows the reduction of arrays by a

206 single call, whereas the OpenMP r e d u c t • on clause i s - according to the OpenMP specification of C/C++ version 1.0 - not available for arrays [6]. Therefore the recomputation method representing the only data partitioning method that requires absolutely no reduction achieves the highest performance.

20

I !- OpenMP parallel -- OpenMP single -- - - OpenMP for OpenMP, , barrier

~'15

j j l ......

.....

.....

...... i ..... i-~]177:Ji"-"-i ..... i .... i.....

40

I

I-

9 l0

..... i....-~.....i .....L~._.,>_I_~..... ,'.....L....i.... :.....!.....L....i....

. . . .5. . .

2

!

!

!

!

!

!

i

1

1

i

1

1

1

I

,

I

i

. . . .

/,

4

6

Lz-i--l-t-i "--'i---'J''--

-~--'T'--[

8 10 12 n u m b e r of parallel threads

I

i

i

i

i

i i i i i .... i..... ]i ! i i .... f!~[ i ] ....:::! . . . . . . . i .... [~.... i--~-@:--~ ..... '..... , l..... /i--:~F' i i..--,+'--4

..i .... i.-! ..... .il--y

i.--q-?--

!

20

1

1

10

I

14

O;ens176 MPIAllreduce, 4 b y t e " - - MPI Allreduce, 64 byte --- MPI Allreduce, 256 byte ...... MPIAllreduce, 512byte

16

2

!

'

i

i

i

i

i

i i

' i

i ;

' t

' i

' i

i i

4

, , i

i i

i i

! !

6 8 10 12 14 n u m b e r of parallel processes or threads

i l

16

Figure 5. Synchronization time for several OpenMP directives (left) and execution time for a reduction with OpenMP and MPI (right)

To avoid the costly OpenMP r e d u c t i o n clause, several encoding alternatives with other OpenMP directives have been implemented and analyzed. The best performance was achieved by calculating first all local sums in each thread and then adding the local sums to the total sums in one OpenMP c r i t i c a l section. Fig. 6 (left) shows the performance of the accordingly modified methods A' and C'. They are faster than the original versions A and C based on the reduction clause (compare Fig. 4). Now the simplified method C' is for up to 4 threads even slightly faster than the recomputation method B. However for more than 4 threads the recomputation method still gains a higher performance. Replacing the dynamically allocated data arrays by statically declared arrays of known size has lead to some surprising results: Whereas for the sequential reference code and for all MPIbased implementations the performance remained unchanged, the OpenMP-based implementations ran significantly faster (even if only one thread was employed). Even a superlinear speedup can be achieved here (see the dashed curve in Fig. 6 right). This effect is caused by the Sun C compiler that can better exploit the OpenMP directives in loops with static arrays to generate a faster code by using enhanced loop optimizations. As final experiment, OpenMP directives were inserted also in the MPI-based implementation (with static data arrays) and the number of threads per process was varied. The upper curve in Fig. 6 (right) shows that the mixed MPI/OpenMP implementation delivered a higher performance than the pure MPI or OpenMP solutions. However generating two or more threads per process proved to be not advantageous here.

207

R--"

-' 9 " ,,~ne<worKs,ze

i

TI-+

i .

12 10 ----

8

=~ 6

4

.....

.....

. . . . 40z100(~-lO.

methocl B

40-1000-10, 40-1000-10,

method method

i

i

i

C' A'

i

_ . . . . .} ,

i

i

i

i

i

i

~

i

i

;

i

i .....

i. . . . .

;

i

i

>.r

ii. . . . .

~~~ "

-

.

.... j .~q 9" ;

~

!

1

.

. . . . . . . .

' ~

2

I

2

'

_i . . . . .

.

.

'

.i-"

-!../U..-

3

4

5

6

number

,

....

l !

,

5

..... i i

i ;

i i

7

8

9

,

of parallel

,

threads

,

"

.-"

12 ...... .....

i.J-',

i l

i i

I

.

~

--- "U- -!! l

i ;

!

-"'

!

_

.

.

i

i i

i

.

i

. i i

!

_ O; i = 1, 2 , . . . , s Given a distribution of the initial load 1/1, V2,..., V~ with }-~i~1 Vi = V, we can formulate our problem as inhomogenous linear equation system E . s - b as we did in a further work [8]. There we also showed that this problem can be solved in polynomial time by means of linear

223 programming. The linear equation system can be solved for each feasible initial distribution of 8 the load V1, V2,..., V~ (i.e., a distribution with Y~4=l Vi = V). 3. A P P L I C A T I O N OF T H E M O D I F I E D D T M

The modified DTM is applied to the cluster-based image database Cairo. This database allows the user to select regions of interest and to search all archived images for a similar objects or persons. The image analysis and comparison require large computational resources, thus we designed a suitable cluster-based architecture. The divisible task model is well suited for the modeling of the query processing, as the processing can be separated into independent components (processing per image) and into any desirable granularity (a task can be a number of images, a single image or a number of image subsections). Initially, all available images are distributed equally over the computing nodes and processed in parallel. However, many queries consider only a small image portion, which is created by a much simpler evaluation of a-priori extracted features. The result is a unbalanced system, where overloaded nodes prolong the response time of the entire system. Thus a workload balancing strategy for this NP-complete problem is necessary. Due to the described modifications of DTM the application of this model for workload balancing is now straight forward. We implemented parts of it and the model showed a considerable increase of performance since we reduced the length of busy time intervals on a single node due to a reduction of the communication transmission times. On the other hand the approach of providing several originators increased the flexibility to use other interconnection topologies. We made some measurements as well as comparisons to existing strategies such as LTF [3] and RBS [4]. The results seem promissing, but more work is to be done. 4. CONCLUSIONS AND FUTURE W O R K We presented two modifications to the classical DTM model to adapt it to cluster based systems and thus make it applicable to a broader class of practical applications. The possibility to use several originators increases the flexibility of our approach by covering a broader class of network topologies and allows a recursive modeling. Our second modification provides an altemate distribution of the initial data volume on the PEs results in a considerable increase in performance since we reduce the length of busy time intervals on the single nodes due to a reduction of the communication transmission times. REFERENCES

I1] [2]

[3] [4]

Y. C. Cheng and T. G. Robertazzi, "Distributed Computation with Communication Delay." In IEEE Transactions on Aerospace and Electronic Systems 24, pp. 700-712, 1988. T. Bretschneider, S. Geisler and O. Kao, "Simulation-based assessment of parallel architectures for image databases." In Proceedings of the International Conference on Parallel Computing (ParCo 2001), pp. 401-408, 2002. O. Kao, G. Steinert and F. Drews, "Scheduling aspects for image retrieval in clusterbased image databases." In IEEE/ACM Symposium on Cluster Computing and Grid, pp. 329-336, IEEE Society Press, 2001. F. Drews and O. Kao, "Randomised block size scheduling strategy for cluster-based image

224

[s] [6] [7]

[8]

[9]

databases" In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 2116-2123,2001. J. Blazewicz, M. Drozdowski. "Scheduling Divisible Jobs with commtmication Startup Costs." Discrete Applied Mathematics 76, pp. 21-24, 1997. J. Blazewicz, M. Drozdowski and M. Markiewicz. "Divisible task scheduling - concept and verification." In Parallel Computing, Volume 25, Number 1, pp. 87-98, 1999. P. Wolniewicz and M. Drozdowski, "Experiments with scheduling divisible tasks in cluster of workstations." Euro-Par 2000, LNCS 1900, Springer, pp. 311-319, 2000. F. Drews, O. Kao, U. Rerrer and K. Ecker, "Extending the Divisible Task Model for Load Balancing in Parallel and Distribited Systems" Proceedings of The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2003), CSREA Press, Vol. 1, pp. 493-497, 2003. J. Blazewicz and M. Drozdowski and K. Ecker, "Management of Resources in Parallel Systems" in Handbook on Parallel and Distributed Processing, Springer, pp. 264-339, 2000.

Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.

225

The generalized diffusion method for the load balancing problem* G. Karagiorgos a, N. Missirlis% and F. Tzaferis ~ aDepartment of Informatics and Telecommunications, University of Athens, Panepistimioupolis 157 84, Athens, Greece This paper defines the Generalized Diffusion (GDF) method by the introduction of a parameter 7. in the Diffusion (DF) method and studies its convergence analysis for a weighted network graph. In particular, it is proved that GDF converges if and only if 7- c (0, 1/[IAII~ ), where A is the adjacency matrix of the network graph G = (V, E). This is a more relaxed condition than the one required by the DF method. Next, we consider a multiparametric version of GDF, which involves a set of parameters 7-i, i = 1, 2 , . . . , IVt, instead of a single parameter 7-. By applying local Fourier analysis we are able to find a closed form formula which produces optimum values for the set of the parameters 7-~ in the sense that the rate of convergence of GDF is maximized for ring and 2D-torus network graphs. 1. INTRODUCTIONS One of the most fundamental problems to solve on a distributed network is to balance the workload among the processors in order to use them efficiently. Load balancing schemes can be classified as either static or dynamic. In static load balancing schemes it is possible to make a priori estimates of load distribution in contrast to the dynamic load balancing, where the workload is distributed among the processors in a time-varying process. The load balancing problem. We consider the following abstract distributed load balancing problem. We are given an arbitrary, undirected, connected graph G = (V, E), where the node set V represents the set of processors and the edge set E describes the connections between processors. Each node vi E V contains a real variable ui which represents the current workload of processor i and a real variable c~j which represents the weight of edge (i,j). The current workload is the sum of the computational work of its tasks. Tasks are independent and can be executed on any other processor. Most of the existing iterative load balancing algorithms [5, 7, 12, 15] involve two steps. The first step calculates a balancing flow. This flow is used in the second step, in which load balancing elements are migrated accordingly. This paper focuses on the first step. The performance of a balancing algorithm can be measured in terms of number of iterations it requires to reach a balanced state and in terms of the amount of load moved over the edges of the underlying processor graph. Our objective. The aim of this paper is to study a generalized form of the DF method by introducing a new set of parameters 7.~,i = 1, 2 , . . . , IV I for edge weighted network graphs, whose role will be to maximize the rate of convergence of the DF method. *Research is supported by the National and Kapodistrian University of Athens (No. 70/4/4917).

226 Related work. In the Diffusion method (DF) Cybenko [4], Boillat [3], a processor simultaneously sends workload to its neighbors with lighter workload and receives from its neighbors with heavier workload. It is assumed that the system is synchronous, homogeneous and the network connections are of unbounded capacity. Under the synchronous assumption, the diffusion method has been proved to converge in polynomial time for any initial workload [3]. If new workload can be generated or existing workload completed during the execution of the algorithm, it has been proved that the variance of the workload is bounded [4]. The convergence of the asynchronous version of the diffusion method has also been proved by Bertsekas and Tsitsiklis [ 1]. Our contribution. We introduce the GDF method, a generalized version of DF. We study the convergence analysis of the GDF method in the case that the edges of the network graph have a weight. In particular, we show that GDF converges under more relaxed conditions than DF. Next, we consider a mutliparametric version of GDF, which involves a set of parameters Ti. By applying local Fourier analysis we are able to find a closed form formula which produces optimum values for the set of the parameters T~ in the sense that the rate of convergence of GDF is maximized for ring and 2D-torus weighted network graphs. In addition, the values of T~ depend only upon local information hence their computation requires only local communication. The rest of the paper is organized as follows. In section 2, we introduce the GDF method. In section 3, we examine the properties of the GDF method such that its work load to be invariant. In section 4, we study the convergence analysis of the GDF method and we intrduce its local version involving a set of parameters ~-~. In section 5, we determine the optimum values of the parameters T~ such that the convergence rate of the local GDF method is maximized. Finally, our conclusions and future work are stated in section 6.

2. THE GENERALIZED DIFFUSION (GDF) METHOD The Generalized diffusion (GDF) method for the load balancing has the form: u(n+l)

i

. (n)

=~i

-T

~

/ (n)

,...., cij~u i

. (n)

-~j

),

(1)

jEA(i)

where T is a parameter that plays an important role in the convergence of the whole system to the equilibrium state, A ( i ) is the set of the nearest neighbors of node i, n is the step index, n = 0, 1, 2 , . . . and u~ ~ (n) (1 _< i _< IV]) is the total workload of processor i at step n. Then, the overall workload distribution at step n, denoted by u (n), is the transpose of the vector (u~n) u~n) ~ (n) , , . . . , Uiyl). u (~ is the initial workload distribution. In matrix form (1) becomes U (n+l) ---

Mu (n),

(2)

where M is called the diffusion matrix. The elements of M, mij, are equal to 7cij, if j E A(i), 1 - - 7" EjEA(i)Cij ' if i = j and 0 otherwise, where cij is the weight of the edge (i,j). With this formulation, the features of diffusive load balancing are fully captured by the iterative process (2) governed by the diffusion matrix M [14, 2]. Also, (2) can be written as u (n+l) = (I - T L ) u (n), where L = B W B T is the weighted Laplacian matrix of the graph, W is a diagonal matrix of size IEI x IEI consisting of the coefficents c# and B is the vertex-edge incident matrix. At this point, we note that if ~- - 1, then we obtain the DF method proposed

227 by Cybenko [4] and Boillat [3], independently. If W = I, then we obtain the special case of the DF method with a single parameter 7- (non weighted Laplacian). In the non weighted case and for network topologies such as chain, 2D-mesh, nD-mesh, ring, 2D-torus, nD-torus and nDhypercube, optimal values for the parameter 7- that maximize the convergence rate have been derived by Xu and Lau [16]. However, there are no analogous results in case of the weighted Laplacian. This paper will attempt to answer the following questions in case of the weighted Laplacian: 1) Under which conditions (2) is convergent and 2) What is the optimum value of ~- such that the convergence rate of (2) is maximized? 3. THE C H A R A C T E R I S T I C S OF THE G D F M A T R I X

The diffusion matrix of GDF can be written as M - I-

TL,

L = D-

A,

(3)

where D - diag(L) and A is the weighted adjacency matrix. Because of (3), (2) becomes u (~+1) - (I - TD) u (n) + ~-Au(") or in component form

u(•+i) i

= (1--T

Z

yeA(i)

u J i + 7- ~ cijuj('~) ,{ -- 1, 2 , . . . , c~u(~) j~A(i)

IVl.

(4)

The diffusion matrix M must have the following properties" nonnegative, stochastic and symmetric, for the work load to be invariant [4, 3, 2]. In the sequel, we examine under which conditions the diffusion matrix M posesses the aforementioned properties. 9 Nonnegative. For the matrix M to be nonnegative we must have M _> 0, or because of (3) I-TD+~-A_>0, or, i t s u f f i c e s I - T D _ > 0 and T A > _ 0 . From T-A _> 0 we have ~- > 0, because A >_ 0 (cij > 0 ) a n d from I - ~-D _> 0 we have 1 - T ~jcA(i)cij >_ 0 Vi E V. Thus, it suffices ~- < 1/rnax~ ~-~-jeA(~)C~j or 7- _< 1/]IAI]~. Therefore, we have proved the following. L e m m a 1. IfO < 7 _< l / l l A I l ~ then M > O. R e m a r k 1. Note that IIA 1 ~ - { Oll~. Corollary 1. I f cij - c (non weighted case) and 0 < r < 1/A(G), where ~- - Tc and A(G) is

the maximum degree of the graph G, then M > O. Proof If qj - c, we have the non weighted Laplacian matrix of the graph G and the inequalities o f L e m m a 1 yield 0 < T _< 1/cdi or 0 < ~? _< 1/di for each i, where ~? - Tc and di is the degree of node i. Clearly, Corollary 1 follows from the last inequalities. 9

v , IYf u~, 9Stochastic. For the matrix M to be stochastic we must have M g - g, where ~2i - ~1 z__~i=l or because of (3), (I - 7-L)~ - g, which is valid since Lg - 0. 9 Symmetric. Due to (3) the matrix M is symmetric since the Laplacian matrix L is symmetric. R e m a r k 2. I f the inequalities of Lemma 1 hold, then the matrix M is doubly stochastic.

228 4. THE C O N V E R G E N C E A N A L Y S I S OF THE G D F M E T H O D

In this section we present the basic convergence theorem for the GDF method. T h e o r e m 1. The GDF method converges to the uniform distribution if and only if the network graph is connected and either (or both) of the following conditions hold: (i) 0 < ~- < x/IIAII~, (ii) the network graph is not bipartite. Proof The diffusion matrix M can take the following form M =

0 /(T

K 0

, where O's

are used to denote square zero block matrices on the diagonal of M and K is a rectangular 1 ViEV. nonnegative matrix, if and only if it is bipartite and I - ~-D = 0 or ~- = EjeA(~)c~ If the above holds, then - 1 is an eigenvalue of the matrix M, hence its convergence factor 7 ( M ) = 1 and the method does not converge [2]. If the graph G is bipartite, then for 1

00, t>0,

(1)

with A E IRnxn, B E lRnxm, C E IRp• and D E IRp• For simplicity we assume that the spectrum of A is dichotomic with respect to the imaginary axis, i.e., Re(A) -r 0 for all eigenvalues A of A. The case with eigenvalues on the imaginary axis could be treated as well with the method described in this paper, but this would add some distracting technicalities. Throughout this paper we will denote the spectrum of A by A(A). The number of state variables n is called the order of the system. We are interested in finding a reduced-order LTI system,

~(t) ~)(t) -

A~(t)+ L)~(t), d ~ ( t ) + z)~(t),

t>0,

t >_ 0,

(2)

of order r, r _ ~7j+1 > 0 for all j, and (7~ > ~7~+1. The so-called square-root (SR) BT algorithms determine the reduced-order model as fl = T~ATr, OCTr,

& = T~B, D D,

(12)

using the projection matrices Tt = E I ' / 2 V f R

and

T, = sTu1E11/2.

(13)

Due to space limitations, we refer the reader to [10, 11, 12, 13] for a survey of parallel model reduction methods of stable systems based on state-space truncation. Serial implementations of the model reduction algorithms discussed here can be found in the Subroutine Library in Control T h e o r y - SLICOT (available from http : / / w w w . w i n . tue. nl/niconet/NIC2/slicot.html). 4. IMPLEMENTATION DETAILS

The additive decomposition and the model reduction methods basically require matrix operations such as the solution of linear systems and linear matrix equations (Lyapunov and Sylvester), and the computation of matrix products and matrix decompositions (QR, SVD, etc.). The iterative algorithms for efficiently solving linear matrix equations derived from the matrix sign function (see the previous sections and [8, 9]) only require operations like matrix products and matrix inversion. All these operations are basic dense matrix algebra kernels parallelized in ScaLAPACK and PBLAS. Thus, the parallel model reduction routines, integrated into the PLiCMR library (visit h t t p : / / s p i n e . a c t . uj i . e s / ~ p l i c m r ) , heavily rely on the use of the available parallel infrastructure in ScaLAPACK, the serial computational libraries LAPACK and BLAS, and the communication library BLACS. In order to improve the performance of our parallel model reduction routines we have designed, implemented, and employed two specialized parallel kernels that outperform parallel kernels in ScaLAPACK with an analogous purpose: the QR factorization with partial pivoting is computed in our codes by using a parallel BLAS-3 version instead of the traditional BLAS2 approach [14]. Also, our matrix inversion routine is based on a Gauss-Jordan elimination procedure [ 15].

256 Details of the contents and parallelization aspects of the model reduction routines for stable systems are given, e.g., in [ 10]. A standardized version of the library is integrated into the subroutine library PSLICOT, with parallel implementations of a subset of SLICOT. It can be downloaded from the URI f t p : / / f t p . e s a t . k u l e u v e n . a c . b e / p u b / W G S / S L I C O T . However, it is recommended to obtain the version of the library from http-//spine.act.uji.es/~plicmr as it might, at some stages, contain more recent updates than the version integrated into PSLICOT. The library can be installed on any parallel architecture where the above-mentioned computational and communication libraries are available. The efficiency of the parallel routines will depend on the performance of the underlying libraries for matrix computation (BLAS) and communication (usually, MPI or PVM).

5. N U M E R I C A L E X P E R I M E N T S

All the experiments presented in this section were performed on a cluster of 16 nodes using IEEE double-precision floating-point arithmetic (c ~ 2.2204 • 10-16). Each node consists of an Intel Pentium-IV processor at 1.8 GHz with 512 MBytes of RAM running the Linux (SuSE 7.3) operating system. We employ a BLAS library, specially tuned for the Pentium-IV processor, that achieves around 2.8 Gflops (millions of flops per second) for the matrix product (routine DGEMM). The nodes are connected via a Myrinetmultistage network; the communication library BLACS is based on an implementation of the communication library MPI specially developed and tuned for this network. The performance of the interconnection network was measured by a simple loop-back message transfer resulting in a latency of 61 #sec. and a bandwidth of 280 Mbit/sec. We made use of the LAPACK, PBLAS, and ScaLAPACK libraries whenever possible. In this section we report the performance of the parallel routine p a b 0 9ex for the computation of the additive decomposition of a TFM. In order to mimic a real case, we employ a random single-input single-output LTI system with a single unstable pole. Our first experiment reports the execution time of the parallel routine on a system of order n = 2500. This is about the largest size we could evaluate on a single node of our cluster, considering the number of data matrices involved, the amount of workspace necessary for computations, and the size of the RAM per node. The left-hand plot in Figure 1 reports the execution time of the parallel routine using rip=l, 2, 4, 6, 8, and 10 nodes. The execution of the parallel algorithm on a single node is likely to require a higher time than that of a serial implementation of the algorithm (using, e.g., LAPACK and BLAS); however, at least for such large scale problems, we expect this overhead to be negligible compared to the overall execution time. The figure shows reasonable speed-ups when a reduced number of processors is employed. Thus, when rip=4, a speed-up of2.11 is obtained for routine p a b 0 9 e x . As expected, the efficiency decreases as np gets larger (as the system dimension is fixed, the problem size per node is reduced) so that using more than a few processors does not achieve a significant reduction in the execution time for such a small problem. We next evaluate the scalability of the parallel routine when the problem size per node is constant. For that purpose, we fix the problem dimensions to n / ~ = 2500, and report the Gigaflops per node. The fight-hand plot in Figure 1 shows the Gigaflop rate per node of the

257 5o0

A d d i t i v e d e c o m p o s i t i o n of a r a n d o m u n s t a b l e s y s t e m of o r d e r n=2500/sqrt(np) 2

A d d i t i v e d e c o m p o s i t i o n of r a n d o m u n s t a b l e s y s t e m of o r d e r n = 2 5 0 0

Q

450 400 350

i+ol

~= 250

150

0.6

IO0

0.4

5O

0.2

0

2

4

6

N u m b e r of p r o c e s s o r s

8

10

12

00

2

4

6

8

10

N u m b e r of processors

12

14

16

18

Figure 1. Performance of the parallel routine for the computation of the additive decomposition of a TFM.

parallel routine. These results demonstrate the scalability of our parallel kernels, as there is only a minor decrease in the performance of the algorithms when np is increased while the problem dimension per node remains fixed. A thorough analysis of the performance of the model reduction routines for stable systems is given in [ 10]. 6. CONCLUSIONS We have presented an efficient approach for model reduction of large-scale unstable systems on parallel computers. All computational steps in the approach are solved using iterative algorithms derived from the matrix sign function, which has been shown to offer a high degree of parallelism. Numerical experiments on a cluster of Intel Pentium-IV processors confirm the efficiency and scalability of our methods. Analogous routines for model reduction of unstable large-scale discrete-time systems are under current development. REFERENCES

[1] [2]

[3]

A. Antoulas, Lectures on the Approximation of Large-Scale Dynamical Systems, SIAM Publications, Philadelphia, PA, to appear. G. Obinata, B. Anderson, Model Reduction for Control System Design, Communications and Control Engineering Series, Springer-Verlag, London, UK, 2001. A. Varga, Task II.B.1 - selection of software for controller reduction, SLICOT Working Note 1999-18, The Working Group on Software (WGS), available from

http://www, win. tue.nl/niconet/NIC2/reports .html (Dec. 1999).

[4] M. Safonov, E. Jonckheere, M. Verma, D. Limebeer, Synthesis of positive real multivariable feedback systems, Internat. J. Control 45 (3) (1987) 817-842.

[5] C. Kenney, A. Laub, The matrix sign function, IEEE Trans. Automat. Control 40 (8) [6]

(1995) 1330-1348. J. Roberts, Linear model reduction and solution of the algebraic Riccati equation by use of the sign function, Internat. J. Control 32 (1980) 677-687, (Reprint of Technical Report No. TR- 13, CUED/B-Control, Cambridge University, Engineering Department, 1971).

258 [7] [8] [9] [ 10] [ 11]

[12]

[ 13] [14] [15]

R Lancaster, M. Tismenetsky, The Theory of Matrices, 2nd Edition, Academic Press, Orlando, 1985. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Solving linear matrix equations via rational iterative schemes, in preparation. P. Benner, E. Quintana-Orti, Solving stable generalized Lyapunov equations with the matrix sign function, Numer. Algorithms 20 (1) (1999) 75-100. P. Benner, E. Quintana-Orti, G. Quintana-Orti, State-space truncation methods for parallel model reduction of large-scale systems, Parallel Comput. To appear. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Balanced truncation model reduction of large-scale dense systems on parallel computers, Math. Comput. Model. Dyn. Syst. 6 (4) (2000) 383-405. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Singular perturbation approximation of large, dense linear systems, in: Proc. 2000 IEEE Intl. Symp. CACSD, Anchorage, Alaska, USA, September 25-27, 2000, IEEE Press, Piscataway, NJ,, 2000, pp. 255-260. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Efficient numerical algorithms for balanced stochastic truncation, Int. J. Appl. Math. Comp. Sci. 11 (5) (2001) 1123-1150. G. Quintana-Orti, X. Sun, C. Bischof, A BLAS-3 version of the QR factorization with column pivoting, SIAM J. Sci. Comput. 19 (1998) 1486-1494. E. Quintana-Orti, G. Quintana-Orti, X. Sun, R. van de Geijn, A note on parallel matrix inversion, SIAM J. Sci. Comput. 22 (2001) 1762-1771.


259

Parallel Decomposition Approaches for Training Support Vector Machines* T. Serafini a, G. Zanghirati b, and L. Zanni a aDepartment of Mathematics, University of Modena and Reggio-Emilia, via Campi 213/b, 41100 Modena, Italy. E-mail: serafini, thomas@unimo, it, zanni, luca@unimo, it. bDepartment of Mathematics, University of Ferrara, via Machiavelli 35, 44100 Ferrara, Italy. E-mail: g. zanghirati@unife, it. We consider parallel decomposition techniques for solving the large quadratic programming (QP) problems arising in training support vector machines. A recent technique is improved by introducing an efficient solver for the inner QP subproblems and a preprocessing step useful to hot start the decomposition strategy. The effectiveness of the proposed improvements is evaluated by solving large-scale benchmark problems on different parallel architectures. 1. I N T R O D U C T I O N Support Vector Machines (SVMs) are an effective learning technique [11 ] which received increasing attention in the last years. Given a training set of labelled examples D = {(zi, yi), i - 1 , . . . , n ,

zi E R m, yi E { - 1 , 1}},

the SVM learning methodology performs classification of new examples z c R m by using a decision function F : R m ~ { - 1 , 1}, of the form

F(z)-sign(~x~y~K(z,z~)+b*),i=~

(1)

where K : R'~ • ]R"~ ~ R denotes a special kernel function (linear, polynomial, Gaussian,... ) and x* - (x~,..., x~) T is the solution of the convex quadratic programming (QP) problem rain

]

f ( x ) - - ~ x T G x - xT1 /_,

sub. to

y T x = O,

O _ n~p >_ nc > 0 and set i = 0. Arbitrarily split the indices { 1 , . . . , n} into the set B of basic variables, with #B = n~p, and the set N = { 1 , . . . , n} \ / 3 of nonbasic variables. Arrange the arrays x (i), y and G with respect to/3 and N:

x(i) :

X(N )

'

Y--

YN

E

'

GNB

GNN

]

"

2. Compute (in parallel) the solution x(~+1) of 1 min fB(XB) -- ~ x T G ~ x B -- xT(1 -- GBNX(iN))

(3)

~BC~B

where ftB -

{XB E R n~p l y T x B

= --YN T X N(i), 0 T (x,~in). Start value is 7]0 - 2, because the system should consist o f at least 2 processors.

Example: H - 256 9256; t6 - 7.5#s, c - 50#s, q - 1.1#s resp. 0.55#s; # - 10; A - 2

462 1 i~,i!i

1

"~::~

.......... . . . . . . . . . . . . . _.,,w..T..?,,o

I

I

;

/

X

-•H2

function values

X

segment X

(=2)

____ill

I integration area G*

M

...........

Figure 4. Discrete, of segments consisting integration area G , for the approximate solution of the Laplace's differential equation 0 -- Au(x, y) through a difference equation. Each segment in G , consists of ~ grid points (fig. 3) and is allocated to a processing unit of the field processor system from altogether n processing units. The approximate values Up,qemerge in the catchment area of a local operator, which is led over a segment into an autonomous phase in the presented movement direction. Segments adjacent to each other with given permeate depth A exchange information in a communication phase. It applies 0 _= M r o o d ~ and 0 = N r o o d ~. Figure 5 shows the development of the generation time T(x) for all function values in G . on the basis of this parameter set over x = 3 , . . . , 17 processors. In a system of 4 x 4 processors, that is x = 4 in horizontal as well as in vertical direction, is T ~ 400ms (point A) with an assumed transfer time of q = 1.1#s. The halving to q = 0.55#s reduces T(x) to approximately 360ms without influence on the autonomous phases Ta~to(T(x)) (points A', B'). The duration of the communication phases (point B") are also relatively slight. To this point applies that the reduction of the data transmission time has remained without considerable influence on the generation time T(x) at a steadily held number of processing units. If the number of the processors of the data transmission time is adjusted, T(x) will reduce itself. From equation 6 results xmin = 8 for the adjusted number of processors with q = 0.55#s. In a system of 8 • 8 processors applies T (x = xmin) = 200ms (line D, D ~) with the given parameter set from

463

autonomous time rauto(T(x)) = 75ms of a processor (line D', D") and communication time rkomm(T(xauto)) 125ms (line D", D'") between all processors in consideration of # = 10. As shown, the generation time T(x) of the function values in G , can be minimized with an adjustment of the system parameters. -

-

t _ 700 • 60O +

i,i'~176

q=l,llas~[ A, ] q---'0'551as~'~,.~.i,.,,l~

:

1 I

i

3~

T(x) [ms]

[l

q=1,1l.tS~

.l__~,..~

q=0,551as

,o, i1 ~176

. . . . . . . . . .

250 2 ~

150

;

100

ili~176 ......... I "i

.-,-~-;--:-,-;-,

50

x=3

t t l l l l l l l l l l t l

~5

7

i

.

...... , ..... , , - - ,

9

11

13

.

.

.

.

.

.

.... , ...... , .... ; . . . . . . . .

15

17

~ l l . l l l i l l l l l ~ l

i200 ]-

D 1

"

~

300 I

400 F

5~ls

1

5OO

q=l,lps

i 600 700 i

~r

Tko~m(T ....) [ms]

Figure 5. Development of the generation time T(x), split into the autonomous time Ta~to(T(x)) of a segment and into the communication time Tkomm(Ta~to) between all segments for the parameter set H = 256 z 256, tb = 7.5#s, # --- 10, c = 50#s, andA = 2. The transfer time varies in q = 1.1#s and q = 0.55#s.

REFERENCES [1]

K.N. Dharani Kumar, K.M.M. Prabhu, P.C.W. Sommen: A 6, 7 kbps vector sum excitedlinear prediction on TMS32oC54X digital signal processor. Microprocessors and Microsys-

464

[2]

[3] [4]

tems 26(2001) pp. 27-38 MiJller-Schloer, E. Schmitter: RISC-Workstation-Architekturen. Springer-Verlag Berlin Heidelberg New York, 1991 R.W. Schulze: Neuronale Topologiesynthese fiir Massiv Parallele Systeme. Verlag der Wissenschaften Peter Lang, 2003 Alan M. Turing: On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2(1937) p. 42.

Caches



467

Trade-offs for Skewed-Associative Caches H. Vandierendoncka*and K. De Bosschere at aDept, of Electronics and Information Systems, Ghent University Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium. The skewed-associative cache achieves low miss rates with limited associativity by using inter-bank dispersion, the ability to disperse blocks over many sets in one bank if they map to the same set in another bank. This paper formally defines the degree of inter-bank dispersion and argues that high inter-bank dispersion conflicts with common micro-architectural designs in which the skewed-associative cache is embedded. Various trade-offs for the skewed-associative cache are analyzed and a skewed-associative cache organization with reduced complexity and a near-optimal cache miss rate is proposed. 1. I N T R O D U C T I O N Cache memories hide the rapidly increasing memory latency by storing the most recently accessed data on-chip. Most implementations of caches are organized as a direct mapped or set-associative structure. In such an organization, every block of data may be stored in only a few locations of the cache. These locations form one set of the cache, hence the name "setassociative". The benefit of this approach is a fast cache access time, but the down-side is that many cache misses occur when many blocks are accessed that all map to the same set. These misses are called conflict misses and result from the set-associative cache organization. These misses occur frequently in scientific workloads and can significantly deteriorate performance [ 1, 2]. It was shown that the skewed-associative cache can remove these misses and improve the performance predictability [3]. The skewed-associative cache is an organization that combines a fast cache access time with few conflict misses [3, 4, 5]. A 2-way skewed-associative cache has a miss rate comparable to a 4-way set-associative cache [4]. An n-way skewed-associative cache has n direct mapped banks. Each bank is indexed using a different hash function. When many blocks map to the same set in one bank, then it is very unlikely that all of them map to the same set in all banks. The ability to spread blocks that map to the same set in one bank over multiple sets in the other banks is called inter-bank dispersion [4]. Ideally, inter-bank dispersion should be maximum, i.e., it should be possible to spread blocks over all sets in another bank. This paper shows that programs typically do not require such maximum inter-bank dispersion, but that a moderate amount of inter-bank dispersion suffices to *Hans Vandierendonck is sponsored by the Flemish Institute for the Promotion of Scientific-Technological Research in the Industry (IWT). He can be reached at h v d i e r e n @ e l i s . UGent .be t Koen De Bosschere can be reached at kdb@e l i s. UGent. be

468 remove nearly all conflict misses. Furthermore, we give some reasons why a moderate amount of inter-bank dispersion is desirable, from a micro-architectural point of view. The remainder of this paper is organized as follows. In section 2, we present mathematical models of hash functions and formally define the degree of inter-bank dispersion. Then we describe in section 3 how various parameters of the skewed-associative cache affect the complexity of a skewed-associative cache organization. Section 4 presents a technique to construct conflict-avoiding hash functions with a pre-specified degree of inter-bank dispersion, number of hashed address bits and inputs per XOR. Using this technique, we evaluate trade-offs for the skewed-associative cache and propose a near-optimal configuration in section 5. Section 6 concludes this paper. 2. M A T H E M A T I C A L T R E A T M E N T This section describes a mathematical model of a hash function and uses it to define the degree of inter-bank dispersion. These definitions are treated in detail in [6].

2.1. Hash functions We represent an n-bit block address a by a bit vector [an-lan-2... a0], with an-1 the most significant bit and a0 the least significant bit. A hash function mapping n to m bits is represented as a binary matrix H with n rows and m columns. The bit on row r and column c is 1 when address bit ar is an input to the XOR computing the c-th set index bit. Consequently, the computation of the set index s can be expressed as the vector-matrix multiplication over GF(2), denoted by s = a H. GF(2) is the domain {0, 1} where addition is computed as XOR and multiplication is computed as logical and. Every function in the design space of XOR-based set index functions can be represented by the null space of its matrix [7]. The null space N ( H ) of a matrix H is the set of all vectors that are mapped to the zero vector:

N ( H ) = ( x c {0,

Ix

= 0).

The null space is a vector space with dimensionality dim N(H) = n - m. Conflict misses occur when x H = y H or, (x | y) E N(H), by noting that the XOR is its own inverse.

2.2. Set refinement and the lattice of hash functions Set refinement is a relation between two set-associative caches. The relation holds when all addresses that map to the same set in one cache also map to the same set in the other cache [8]. This is expressed for XOR-based hash functions using null spaces as follows: Definition 1. For matrices H and G, H refines G (H E G) iff N(H) c_ N(G). E.g., if H maps two addresses x and y to the same set, then (x| E N(H). If set refinement holds, then (x | y) E N(G) as weil and, by considering all pairs of addresses, it follows that

N(H) c_ N(a). The set of hash functions is a lattice where the functions are ordered by the set refinement relation. The lattice has a smallest element, namely the function that maps every address in main memory to itself, and it has a largest element, namely the function that maps all addresses to the same set (i.e., the one used in a fully-associative cache).

469 Value of H2 00

11

0C xO0OO xOlOl x1111 xlo10 xOllO xOOll 11 ,~ x1001 x1100 --V---_= 01 / 10 st ~t

a) Illustration of a lattice

01

10 ,=.,=,,~

impossible combination

x0100 x0001 . ~ set 1 of x1011 x1110 supremum "

x0OlO xO111 x11Ol xlOOO

set 0 of supremum

b) Illustration of Hs~p and inter-bank dispersion

Figure 1. Illustration of the lattice of hash functions and an example. Every function is a refinement of the largest element and is itself refined by the smallest element. The set refinement relation does not order every pair of functions. This happens in particular for the functions used in a skewed-associative cache. The situation can be understood from a fictitious lattice (Figure la)). Each node in the lattice corresponds to a function and the arrows show where the set refinement relation holds. For every function, there is a path from the smallest element to the largest element, passing through that function. The functions labeled G and H are not directly comparable to each other, as there is no path that passes through both G and H. We can however express their similarity by quantifying how much the paths from smallest to largest element for G and H overlap. In the graph, the paths diverge at the hash function I and converge again at S. These functions are the infinum (greatest lower bound), respectively supremum (least upper bound). Two hash functions are equal when their supremum equals their infinum (i.e., the paths from smallest to largest element do not diverge at all). The functions are as different as possible when the infinum equals the smallest element and/or the supremum equals the largest element.

2.3. Inter-bank dispersion The supremum hash function and its relation to inter-bank dispersion is illustrated for two set index functions, taken from a family of functions defined in [3]. The index functions H1 and /-/2 are defined by [b3, b2,

b0]H1

=

[bl @ b3, bo 9 b2]

[b3, b2, bl, bo]H2

=

[bo @ b3, bl @ b21

bl,

Every address in main memory is mapped to a set in bank 1 by H1 and to a set in bank 2 by//2. These mappings are illustrated in a 2-dimensional plot (Figure l b)). Each axis is labeled with the possible set indices for that bank. Every address is displayed in the grid in a position that corresponds to its set indices in each bank. The part x in the address bears no relevance to the value of the index functions. Both the addresses z0000 and z0101 map to set 00 in bank 1. The addresses are dispersed in bank 2::cO000 maps to set 00 and :c0101 maps to set 11. Inter-bank dispersion is limited to 2 of the 4 sets: there are no blocks that map to set 00 in bank 1 and either set 01 or 10 in bank 2. This is a consequence of the similarity of the functions H1 and/-/2 and it is described

470 mathematically by the supremum sup(H1, H2). The supremum is an imaginary hash function that places the addresses in imaginary sets. In the example, the supremum maps addresses to one of two sets and is defined by: [b3, b2, 51, bo]sup(H1,/-/2)

--

[bo | bl G b2 | b3]

i.e., all addresses with an even number of ones are mapped to one set and those with an odd number of ones are mapped to another set. The supremum is, by definition, refined by both H1 and/-/2. This is shown graphically in Figure l b). Set 0 of the supremum corresponds to the upper left-hand square. When refined by H1, it falls apart into sets 00 and 11 in bank 1. When it is refined by//2, it splits into sets 01 and 10 in bank 2. Inter-bank dispersion is always limited to one set of the supremum function. We define the degree of inter-bank dispersion as the 2-logarithm of the number of sets in a bank that have their addresses mapped to the same set of the supremum. It is limited to the range 0 to m. Definition 2. The degree of inter-bank dispersion (IBD) equals dim N ( s u p ( H 1 , H 2 ) ) dim N(H1).

It is assumed here that H1 and//2 have the same dimensions. For a-way skewed-associative caches (a > 2), inter-bank dispersion is defined only for every pair of banks [6]. Using the above definition, one can prove that inter-bank dispersion is maximal ( I B D = m) if and only if dim U(sup(H1, H2)) = n, i.e., the supremum has only one set. The definition implies an upper bound on inter-bank dispersion: I B D < n - m, as dim N(sup(H1, H2)) _ 2m. 3. MOTIVATION F O R L I M I T E D I N T E R - B A N K D I S P E R S I O N

Ideally, a skewed-associative cache should always have as much inter-bank dispersion as possible, because this minimizes conflict misses. There are, however, situations where putting a limitation on inter-bank dispersion is called for. Most processors access the level-1 cache and the TLB in parallel. Therefore, the hash functions operate on the virtual address. In order to avoid aliases in the cache, it is necessary that only untranslated address bits (i.e., bits in the page offset) are hashed. Therefore, the page size places a limitation on the number of available address bits n, which in turn limits the inter-bank dispersion ( I D B ~,~z~.e-,..9~ ~" ...................... ::::::::::::::::::::::::::: 10 2

10 4

10 2

10 s

message size in bytes

10 4

message size in bytes

b) Communication MPIm Isend-MPI Recv ent nodes of a cluster.

LAM MPICH ..........

!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!{!!!!!!!!!!::!!!!!!:::::/::!!:::::.!!:: :/:::!!!::! ::!:!ii ::!

I::::; ........ :::::::::::::::::::::::::::::::::::::::: ...............

.--

10 s ............ ScaMPI I................................................. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::

._O

1

t=

14oo _ _

I..AM

.......~...... M P I C H

12oo ~

MP-MPICH

...........ScaMPI ..... ~, I ~ . . ~ . ~ -= 9

.

.

.

.

10 ~

times for between differ-

a) Memory copy operation compared to MP ISend-MP I R e cv on one SMP node using a network interface. Figure 1. Measurements of point-to-point communication. 10 6 ......................................................................................

!

.....................~~:~ ~........... ~.~....................................

.

.............................................. ;:

Parallel Computing: Software Technology, Algorithms, Architectures & Applications, Volume 13: Proceedings of the International Conference ParCo2003, Dresden, Germany (Advances in Parallel Computing)

Parallel Computing: Architectures, Algorithms and Applications - Volume 15 Advances in Parallel Computing

Advances in Parallel, Distributed Computing