Parallel Computational Fluid Dynamics '99: Towards Teraflops, Optimization and Novel Formulations

PARALLEL COMPUTATIONAL FLUID DYNAMICS TOWARDS TERAFLOPS, OPTIMIZATION AND NOVEL FORMULATIONS This Page Intentionally ...

Author: D. Keyes | A. Ecer | J. Periaux | N. Satofuka | P. Fox

13 downloads 392 Views 34MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

PARALLEL COMPUTATIONAL FLUID DYNAMICS TOWARDS TERAFLOPS, OPTIMIZATION AND NOVEL FORMULATIONS

This Page Intentionally Left Blank

PARALLEL COMPUTATIONAL FLUID DYNAMICS TOWARDS TERAFLOPS,OPTIMIZATION AND NOVEL FORMULATIONS P r o c e e d i n g s of the Parallel C F D ' 9 9 C o n f e r e n c e

Edited by O.

A.

KEYES

Old Dominion University Norfolk, U.S.A. d,

ECER

IUP UI, Indianapolis Indiana, U.S.A.

PERIAUX

N.

Dassault-Aviation Saint-Cloud, France

SATOFU

KA

Kyoto Institute of TechnologF Kyoto, Japan Assistant Editor P,

Fox

I UP UI, Indianapolis Indiana, U.S.A.

N 2000

ELSEVIER A m s t e r d a m - L a u s a n n e - New Y o r k - O x f o r d - S h a n n o n - S i n g a p o r e - Tokyo

ELSEVIER

SCIENCE

B.V.

Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam,

9 2000 Elsevier Science B.V.

The Netherlands

All rights reserved.

This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science Global Rights Department, PO Box 800, Oxford OX5 IDX, UK; phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: [email protected]. You may also contact Global Rights directly through Elsevier's home page (http://www.elsevier.nl), by selecting 'Obtaining Permissions'. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (978) 7508400, fax: (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 171 631 5555; fax: (+44) 171 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Global Rights Department, at the mail, fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.

First edition 2000 L i b r a r y o f C o n g r e s s C a t a l o g i n g in P u b l i c a t i o n D a t a A c a t a l o g r e c o r d f r o m t h e L i b r a r y o f C o n g r e s s h a s b e e n a p p l i e d for.

ISBN: 0-444-82851-6 T h e p a p e r u s e d in this p u b l i c a t i o n m e e t s the r e q u i r e m e n t s o f A N S I f N I S O P r i n t e d in T h e N e t h e r l a n d s .

Z39.48-1992

(Permanence of Paper).

PREFACE Parallel CFD'99, the eleventh in an international series of meetings featuring computational fluid dynamics research on parallel computers, was convened in Williamsburg, Virginia, from the 23 rd to the 26 th of May, 1999. Over 125 participants from 19 countries converged for the conference, which returned to the United States for the first time since 1995. Parallel computing and computational fluid dynamics have evolved not merely simultaneously, but synergistically. This is evident, for instance, in the awarding of the Gordon Bell Prizes for floating point performance on a practical application, from the earliest such awards in 1987 to the latest in 1999, when three of the awards went to CFD entries. Since the time of Von Neumann, CFD has been one the main drivers for computing in general, and the appetite for cycles and storage generated by CFD together with its natural concurrency have constantly pushed computing into greater degrees of parallelism. In turn, the opportunities made available by highend computing, even as driven by other forces such as fast networking, have quickly been seized by computational fluid dynamicists. The emergence of departmental-scale Beowulf-type systems is one of the latest such examples. The exploration and promotion of this synergistic realm is precisely the goal of the Parallel CFD international conferences. Many diverse realms of phenomena in which fluid dynamical simulations play a critical role were featured at the 1999 meeting, as were a large variety of parallel models and architectures. Special emphases included parallel methods in optimization, non-PDE-based formulations of CFD (such as lattice-Boltzmann), and the influence of deep memory hierarchies and high interprocessor latencies on the design of algorithms and data structures for CFD applications. There were ten plenary speakers representing major parallel CFD groups in government, academia, and industry from around the world. Shinichi Kawai of the National Space Development Agency of Japan, described that agency's 30 Teraflop/s "Earth Simulator" computer. European parallel CFD programs, heavily driven by industry, were presented by Jean-Antoine D~sideri of INRIA, Marc Garbey of the University of Lyon, Trond Kvamsdahl of SINTEF, and Mark Cross of the University of Greenwich. Dimitri Mavriplis of ICASE and James Taft of NASA Ames updated conferees on the state of the art in parallel computational aerodynamics in NASA. Presenting new results on the platforms of the Accelerated Strategic Computing

Initiative (ASCI) of the U.S. Department of Energy were John Shadid of Sandia-Albuquerque and Paul Fischer of Argonne. John Salmon of Caltech, a two-time Gordon Bell Prize winner and co-author of "How to Build a Beowulf," presented large-scale astrophysical computations on inexpensive PC clusters. Conferees also heard special reports from Robert Voigt of The College of William & Mary and the U.S. Department of Energy on research taking place under the ASCI Alliance Program and from Douglas McCarthy of Boeing on the new CFD General Notation System (CGNS). A pre-conference tutorial on the Portable Extensible Toolkit for Scientific Computing (PETSc), already used by many of the participants, was given by William Gropp, Lois McInnes, and Satish Balay of Argonne. Contributed presentations were given by over 50 researchers representing the state of parallel CFD art and architecture from Asia, Europe, and North America. Major developments at the 1999 meeting were: (1) the effective use of as many as 2048 processors in implicit computations in CFD, (2) the acceptance that parallelism is now the "easy part" of large-scale CFD compared to the difficulty of getting good per-node performance on the latest fast-clocked commodity processors with cache-based memory systems, (3) favorable prospects for Lattice-Boltzmann computations in CFD (especially for problems that Eulerian and even Lagrangian techniques do not handle well, such as two-phase flows and flows with exceedingly multiply-connected domains or domains with a lot of holes in them, but even for conventional flows already handled well with the continuum-based approaches of PDEs), and (4) the nascent integration of optimization and very large-scale CFD. Further details of Parallel CFD'99, as well as other conferences in this series, are available at h t t p ://www. p a r c f d , org. The Editors

vii

ACKNOWLEDGMENTS Sponsoring Parallel CFD'99 9 the Army 9 the IBM

were:

Research Office Corporation

9 the National Aeronautics

and Space Administration

9 the National Science Foundation 9 the Southeastern

Universities Research Association (SURA).

PCFD'99 is especially grateful to Dr. Stephen Davis of the ARO, Dr. Suga Sugavanum of IBM, Dr. David Rudy of NASA, Dr. Charles Koelbel and Dr. John Foss of NSF, and Hugh Loweth of SURA for their support of the meeting. Logistics and facilities were provided by: 9 the College of William & Mary 9 the Institute for Computer (ICASE) 9 the NASA

in Science and Engineering

Langley Research Center

9 Old Dominion 9 the Purdue

Applications

University

School of Engineering and Technology

at IUPUI.

The Old Dominion University Research Foundation waived standard indirect cost rates in administering Parallel CFD'99, since it was recognized as a deliverable of two "center" projects: an NSF Multidisciplinary Computing Challenges grant and a DOE ASCI Level II subcontract on massively parallel algorithms in transport. Pat Fox of IUPUI, the main source of energy and institutional memory for the Parallel CFD conference series was of constant assistance in the planning and execution of Parallel CFD'99.

viii

Emily Todd, the ICASE Conference Coordinator, was of immeasurable help in keeping conference planning on schedule and making arrangements with local vendors. Scientific Committee members Akin Ecer, David Emerson, Isaac Lopez, and James McDonough offered timely advice and encouragement. Manny Salas, Veer Vatsa, and Bob Voigt of the Local Organizing Committee opened the doors of excellent local facilities to Parallel CFD'99 and its accompanying tutorial. Don Morrison of the A/V office at NASA Langley kept the presentations, which employed every variety of presentation hardware, running smoothly. Ajay Gupta, Matt Glaves, Eric Lewandowski, Chris Robinson, and Clinton Rudd of the Computer Science Department at ODU and Shouben Zhou, the ICASE Systems Manager, provided conferees with fin-de-si~cle network connectivity during their stay in Colonial-era Williamsburg. Jeremy York of IUPUI, Kara Olson of ODU, and Jeanie Samply of ICASE provided logistical support to the conference. Most importantly to this volume, the copy-editing of David Hysom and Kara Olson improved dozens of the chapters herein. The Institute for Scientific Computing Research (ISCR) and the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory hosted the lead editor during the production of this proceedings. David Keyes, Conference Chair Parallel CFD'99

SCIENTIFIC

COMMITTEE

of t h e P a r a l l e l C F D C o n f e r e n c e s 1998-1999 Ramesh K. Agarwal, Wichita State University, USA Boris Chetverushkin, Russian Academy of Science, Russia Akin Ecer, IUPUI, USA David R. Emerson, Daresbury Laboratory, UK Pat Fox, IUPUI, USA Marc Garbey, Universite Claude Bernard Lyon I, France Alfred Geiger, HLRS, Germany Carl Jenssen, Statoil, Norway David E. Keyes, Old Dominion University, USA Chao-An Lin, National Tsing Hua University, Taiwan Isaac Lopez, Army Research Laboratory NASA Lewis Campus, USA Doug McCarthy, Boeing Company, USA James McDonough, University of Kentucky, USA Jacques P~riaux, Dassault Aviation, France Richard Pelz, Rutgers University, USA Nobuyuki Satofuka, Kyoto Institute of Technology, Japan Pasquale Schiano, Centro Italiano Ricerche Aerospziali CIRA, Italy Suga Sugavanam, IBM Marli E.S. Vogels, National Aerospace Laboratory NLR, The Netherlands David Weaver, Phillips Laboratory, USA

LIST OF P A R T I C I P A N T S in Parallel C F D ' 9 9 Ramesh Agarwal, Wichita State University Rakhim Aitbayev, University of Colorado, Boulder Hasan Akay, IUPUI Giorgio Amati, CASPUR W. Kyle Anderson, NASA Langley Rsearch Center Rustem Aslan, Istanbul Technical University Mike Ashworth, CLRC Abdelkader Baggag, ICASE Satish Balay, Argonne National Laboratory Oktay Baysal, Old Dominion University Robert Biedron, NASA Langley Research Center George Biros, Carnegie Mellon University Daryl Bonhaus, NASA Langley Research Center Gunther Brenner, University of Erlangen Edoardo Bucchignani, CIRA SCPA Xing Cai, University of Oslo Doru Caraeni, Lund Institute of Technology Mark Carpenter, NASA Langley Research Center Po-Shu Chen, ICASE Jiadong Chen, IUPUI Guilhem Chevalier, CERFACS Stanley Chien, IUPUI Peter Chow, Fujitsu Mark Cross, University of Greenwich Wenlong Dai, University of Minnesota Eduardo D'Azevedo, Oak Ridge National Laboratory Anil Deane, University of Maryland Ayodeji Demuren, Old Dominion University Jean-Antoine D~sideri, INRIA Boris Diskin, ICASE Florin Dobrian, Old Dominion University Akin Ecer, IUPUI David Emerson, CLRC Karl Engel, Daimler Chrysler Huiyu Feng, George Washington University Paul Fischer, Argonne National Laboratory Randy Franklin, North Carolina State University Martin Galle, DLR Marc Garbey, University of Lyon Alfred Geiger, HLRS

Aytekin Gel, University of West Virginia Omar Ghattas, Carnegie Mellon University William Gropp, Argonne National Laboratory X. J. Gu, CLRC Harri Hakula, University of Chicago Xin He, Old Dominion University Paul Hovland, Argonne National Laboratory Weicheng Huang, UIUC Xiangyu Huang, College of. William & Mary David Hysom, Old Dominion University Cos Ierotheou, University of Greenwich Eleanor Jenkins, North Carolina State University Claus Jenssen, SINTEF Andreas Kahari, Uppsala University Boris Kaludercic, Computational Dynamics Limited Dinesh Kaushik, Argonne National Laboratory Shinichi Kawai, NSDA David Keyes, Old Dominion University Matthew Knepley, Purdue University Suleyman Kocak, IUPUI John Kroll, Old Dominion University Stefan Kunze, University of Tiibingen Chia-Chen Kuo, NCHPC Trond Kvamsdal, SINTEF Lawrence Leemis, College of William & Mary Wu Li, Old Dominion University David Lockhard, NASA Langley Research Center Josip Loncaric, ICASE Lian Peet Loo, IUPUI Isaac Lopez, NASA Glenn Research Center Li Shi Luo, ICASE James Martin, ICASE Dimitri Mavriplis, ICASE Peter McCorquodale, Lawrence Berkeley National Laboratory James McDonough, University of Kentucky Lois McInnes, Argonne National Laboratory Piyush Mehrotra, ICASE N. Duane Melson, NASA Langley Research Center Razi Nalim, IUPUI Eric Nielsen, NASA Langley Research Center Stefan Nilsson, Chalmers University of Technology Jacques P~riaux, INRIA Alex Pothen, Old Dominion University Alex Povitsky, ICASE

Luie Rey, IBM Austin Jacques Richard, Illinois State University, Chicago Jacqueline Rodrigues, University of Greenwich Kevin Roe, ICASE Cord Rossow, DLR David Rudy, NASA Langley Research Center Jubaraj Sahu, Army Research Laboratory John Salmon, California Institute of Technology Widodo Samyono, Old Dominion University Nobuyuki Satofuka, Kyoto Institute of Technology Punyam Satya-Narayana, Raytheon Erik Schnetter, University of Tiibingen Kara Schumacher Olson, Old Dominion University John Shadid, Sandia National Laboratory Kenjiro Shimano, Musashi Institute of Technology Manuel Sofia, University Politecnico Catalunya Linda Stals, ICASE Andreas Stathopoulous, College of William & Mary Azzeddine Soulaimani, Ecole Sup~rieure A. (Suga) Sugavanam, IBM Dallas Samuel Sundberg, Uppsala University Madhava Syamlal, Fluent Incorporated James Taft, NASA Ames Research Center Danesh Tafti, UIUC Aoyama Takashi, National Aerospace Laboratory Ilker Tarkan, IUPUI Virginia Torczon, College of William & Mary Damien Tromeur-Dervout, University of Lyon Aris Twerda, Delft University of Technology Ali Uzun, IUPUI George Vahala, College of William & Mary Robert Voigt, College of William & Mary Johnson Wang, Aerospace Corporation Tadashi Watanabe, J AERI Chip Watson, Jefferson Laboratory Peter Wilders, Technological University of Delft Kwai Wong, University of Tennessee Mark Woodgate, University of Glasgow Paul Woodward, University of Minnesota Yunhai Wu, Old Dominion University Shishen Xie, University of Houston Jie Zhang, Old Dominion University

xii

xiii


XV

T A B L E OF C O N T E N T S

Preface ................................................................. Acknowledgments

...................................................

List of Scientific Committee

Members .............................

List of Participants ................................................... Conference Photograph

PLENARY

............................................

v vii ix x xiii

PAPERS

J.-A. Ddsideri, L. Fournier, S. Lanteri, N. Marco, B. Mantel, J. Pdriaux, and J. F. Wang Parallel Multigrid Solution and Optimization in Compressible Flow Simulation and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

P. F. Fischer and H. M. Tufo High-Performance Spectral Element Algorithms and Implementations .. 17

M. Garbey and D. Tromeur-Dervout O p e r a t o r Splitting and Domain Decomposition for Multiclusters . . . . . . . 27

S. Kawai, M. Yokokawa, H. Ito, S. Shingu, K. Tani, and K. Yoshida Development of the E a r t h Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

K. McManus, M. Cross, C. Walshaw, S. Johnson, C. Bailey, K. Pericleous, A. Sloan, and P. Chow Virtual Manufacturing and Design in the Real World Implementation and Scalability on H P C C Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

D. Mavriplis Large-Scale Parallel Viscous Flow C o m p u t a t i o n s Using an U n s t r u c t u r e d Multigrid Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xvi

CONTRIBUTED

PAPERS

R. Agarwal Efficient Parallel Implementation of a Compact Higher-Order Maxwell Solver Using Spatial Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

R. Aitbayev, X.-C. Cai, and M. Paraschivoiu Parallel Two-Level Methods for Three-dimensional Transonic Compressible Flow Simulations on Unstructured Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

T. Aoyama, A. Ochi, S. Saito, and E. Shima Parallel Calculation of Helicopter BVI Noise by a Moving Overlapped Grid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A. R. Aslan, U. Gulcat, A. Misirlioglu, and F. O. Edis Domain Decomposition Implementations for Parallel Solutions of 3D NavierStokes Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A. Baggag, H. L. A tkins, and D. E. Keyes Parallel Implementation of the Discontinuous Galerkin Method . . . . . . . 115

J. Bernsdorf, G. Brenner, F. Durst, and M. Baum Numerical Simulations of Complex Flows with Lattice-Boltzmann A u t o m a t a on Parallel Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

G. Biros and O. Ghattas Parallel Preconditioners for K K T Systems Arising in Optimal Control of Viscous Incompressible Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

E. Bucchignani, A. Matrone, and F. Stella Parallel Polynomial Preconditioners for the Analysis of Chaotic Flows in Rayleigh-Benard Convection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

X. Cai, H. P. Langtangen, and O. Munthe An Object-Oriented Software Framework for Building Parallel Navier-Stokes Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

D. Caraeni, C. Bergstrom, and L. Fuchs Parallel NAS3D: An Efficient Algorithm for Engineering Spray Simulations using LES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

xvii

Y. P. Chien, J. D. Chert, A. Ecer, and H. U. Akay Dynamic Load Balancing for Parallel CFD on NT Networks . . . . . . . . . .

165

P. Chow, C. Bailey, K. McManus, D. Wheeler, H. Lu, M. Cross, and C. Addison Parallel Computer Simulation of a Chip Bonding to a Printed Circuit Board: In the Analysis Phase of a Design and Optimization Process for Electronic Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

W. Dai and P. R. Woodward Implicit-explicit Hybrid Schemes for Radiation Hydrodynamics Suitable for Distributed Computer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

A. Deane Computations of Three-dimensional Compressible Rayleigh-Taylor Instability on S G I / C r a y T3E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

A. Ecer, E. Lemoine, and L Tarkan An Algorithm for Reducing Communication Cost in Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

F. O. Edis, U. Gulcat, and A. R. Aslan Domain Decomposition Solution of Incompressible Flows using Unstructured Grids with pP2P1 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

D. R. Emerson, Y. F. Hu, and M. Ashworth Parallel Agglomeration Strategies for Industrial Unstructured Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

215

M. Galle, T. Gerhold, and J. Evans Parallel Computation of Turbulent Flows Around Complex Geometries on Hybrid Grids with the DLR-TAU Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

A. Gel and L Celik Parallel Implementation of a Commonly Used Internal Combustion Engine Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith Towards Realistic Performance Bounds for Implicit CFD Codes . . . . . . . 241

xviii

W. Huang and D. Tafli A Parallel Computing Framework for Dynamic Power Balancing in Adaptive Mesh Refinement Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

E. W. Jenkins, R. C. Berger, J. P. Hallberg, S. E. Howington, C. T. Kelley, J. H. Schmidt, A. K. Stagg, and M. D. Tocci A Two-Level Aggregation-Based Newton-Krylov-Schwarz Method for Hydrology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

D. E. Keyes The Next Four Orders of Magnitude in Performance for Parallel CFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

265

M. G. Knepley, A. H. Sameh, and V. Sarin Design of Large-Scale Parallel Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

273

S. Kocak and H. U. Akay An Efficient Storage Technique for Parallel Schur Complement Method and Applications on Different Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

S. Kunze, E. Schnetter, and R. Speith Applications of the Smoothed Particle Hydrodynamics method: The Need for Supercomputing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

I. M. Llorente, B. Diskin, and N. D. Melson Parallel Multigrid with Blockwise Smoothers for Multiblock Grids . . . . 297

K. Morinishi and N. Satofuka An Artificial Compressibility Solver for Parallel Simulation of Incompressible Two-Phase Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

E. J. Nielsen, W. K. Anderson, and D. K. Kaushik Implementation of a Parallel Framework for Aerodynamic Design Optimization on Unstructured Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

S. Nilsson Validation of a Parallel Version of the Explicit Projection Method for Turbulent Viscous Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

T.-W. Pan, V. Sarin, R. Glowinski, J. Pdriaux, and A. Sameh Parallel Solution of Multibody Store Separation Problems by a Fictitious Domain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

xix

A. Povitsky Efficient Parallel-by-Line Methods in CFD . . . . . . . . . . . . . . . . . . . . . . . . . . . .

337

J. N. Rodrigues, S. P. Johnson, C. Walshaw, and M. Cross An Automatable Generic Strategy for Dynamic Load Balancing in a Parallel Structured Mesh CFD Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

J. Sahu, K. R. Heavey, and D. M. Pressel Parallel Performance of a Zonal Navier-Stokes Code on a Missile Flowfield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

355

N. Satofuka and T. Sakai Parallel Computation of Three-dimensional Two-phase Flows By LatticeBoltzmann Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

P. Satya-narayana, R. A vancha, P. Mucci, and R. Pletcher Parallelization and Optimization of a Large Eddy Simulation Code using and O p e n M P for SGI Origin2000 Performance . . . . . . . . . . . . . . . . . . . . . . . . 371

K. Shimano, Y. Hamajima, and C. Arakawa Calculation of Unsteady Incompressible Flows on a Massively Parallel Computer Using the B.F.C. Coupled Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

M. Sofia, J. Cadafalch, R. Consul, K. Clararnunt, and A. Oliva A Parallel Algorithm for the Detailed Numerical Simulation of Reactive Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

A. Soulaimani, A. Rebaine, and Y. Saad Parallelization of the Edge-based Stabilized Finite Element Method ... 397

A. Twerda, R. L. Verweij, T. W. J. Peeters, and A. F. Bakker The Need for Multigrid for Large Computations . . . . . . . . . . . . . . . . . . . . . .

407

A. Uzun, H. U. Akay, and C. E. Bronnenberg Parallel Computations of Unsteady Euler Equations on Dynamically Deforming Unstructured Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

G. Vahala, J. Carter, D. Wah, L. Vahala, and P. Pavlo Parallelization and MPI Performance of T h e r m a l Lattice Boltzmann Codes for Fluid Turbulence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

XX

T. Watanabe and K. Ebihara Parallel Computation of Two-Phase Flows Using the Immiscible Lattice Gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 P. Wilders Parallel Performance Modeling of an Implicit Advection-Diffusion Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

439

M. A. Woodgate, K. J. Badcock, and B. E. Richards A Parallel 3D Fully Implicit Unsteady Multiblock CFD Code Implemented on a Beowulf Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

PLENARY PAPERS


Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

Parallel m u l t i g r i d solution and o p t i m i z a t i o n in c o m p r e s s i b l e flow s i m u l a t i o n and design J.-A. D(?sid~ri, L. Fournier, S. Lanteri and N. Marco, INRIA Projet Sinus, 2004 Route des Lucioles, 06902 Sophia-Antipolis Cedex, B. Mantel, J. P~riaux, Dassault Aviation, France, J.F. Wang, Nanjing Institute of Aeronautics and Astronautics, China. This paper describes recent achievements regarding the development of parallel multigrid (PMG) methods and parallel genetic algorithms (PGAs) in the framework of compressible flow simulation and optimization. More precisely, this work is part of a broad research objective aimed at building efficient and robust optimization strategies for complex multi-disciplinary shape-design problems in aerodynamics. Ultimately, such a parallel optimization technique should combine the following ingredients: I1. parallel multigrid methods for the acceleration of the underlying flow calculations; I2. domain decomposition algorithms for an efficient and mathematically well posed distribution of the global optimization process on a network of processors; I3. robust parallel optimization techniques combining non-deterministic algorithms (genetic algorithms in the present context) and efficient local optimization algorithms to accelerate the convergence to the optimal solution; I4. distributed shape parametrization techniques. In this contribution, we address mainly topics I1 and I3 in the context of optimal airfoil design. 1. P A R A L L E L S T R A T E G I E S IN G E N E T I C A L G O R I T H M S 1.1.

Introduction

Genetic algorithms (GAs) are search algorithms based on mechanisms simulating natural selection. They rely on the analogy with Darwin's principle of survival of the fittest. John Holland, in the 1970's, introduced the idea according to which difficult optimization problems could be solved by such an evolutionary approach. The technique operates on a population of potential solutions represented by strings of binary digits (called chromosomes or individuals) which are submitted to several semi-stochastic operators (selection, crossover and mutation). The population evolves during the generations according to the fitness value of the individuals; then, when a stationary state is reached, the population has converged to an/the optimized solution (see [3] for an introduction to the

subject). GAs differ from classical optimization procedures, such as the steepest descent or conjugate gradient method, in many ways: - the entire parameter set is coded; the iteration applies to an entire population of potential solutions, in contrast to classical algorithms, in which a single candidate solution is driven to optimality by successive steps; the iteration is an "evolution" step, or new generation, conducted by semi-stochastic operators; - the search space is investigated (more) globally, enhancing robustness; two keywords are linked to GAs : ezploration and ezploitation. Exploration of the search space is important at the beginning of the GA process, while exploitation is desirable when the GA process is close to the global optimum. GAs have been introduced in aerodynamics shape design problems for about fifteen years (see Kuiper et al. [4], P6riaux et al. in [9], Quagliarella in [10] and Obayashi in [8], who present 3D results for a transonic flow around a wing geometry). The main concern related to the use of GAs for aerodynamic design is the computational effort needed for the accurate evaluation of a design configuration that, in the case of a crude application of the technique, might lead to unacceptable computer time if compared with more classical algorithms. In addition, hard problems need larger populations and this translates directly into higher computational costs. It is a widely accepted position that GAs can be effectively parallelized and can in principle take full advantage of (massively) parallel computer architectures. This point of view is above all motivated by the fact that within a generation (iteration) of the algorithm, the fitness values associated with each individual of the population can be evaluated in parallel. In this study, we have developed a shape optimum design methodology that combines the following ingredients: the underlying flow solver discretizes the Euler or full Navier-Stokes equations using a mixed finite element/finite volume formulation on triangular meshes. Time integration to steady state is achieved using a linearized Euler implicit scheme which results in the solution of a linear system for advancing the solution to the next time step; -

a binary-coded genetic algorithm is used as the main optimization kernel. In our context, the population of individuals is represented by airfoil shapes. The shape parametrization strategy is based on B~zier curves.

1.2. P a r a l l e l i z a t i o n s t r a t e g y Several possible strategies can be considered for the parallelization of the GA-based shape design optimization described above:

- a first strategy stems from the following remark: within a given generation of a GA, the evaluation of the fitness values associated with the population of individuals defines independent processes. This makes GAs particularly well suited for massively parallel systems; we also note that a parent/chuld approach is a standard candidate for the implementation of this first level of parallelism, especially when the size of the populations is greater than the available number of processors; - a second strategy consists of concentrating the parallellization efforts on the process underlying a fitness value evaluation, here the flow solver. This approach finds its main motivation in the fact that, when complex field analysers are used in conjunction with a GA, then the aggregate cost of fitness values evaluations can represent between 80 to 90% of the total optimization time. A SPMD paradigm is particularly well suited for the implementation of this strategy; the third option combines the above two approaches and clearly yields a two-level parallelization strategy which has been considered here and which will be detailed in the sequel. Our choice has been motivated by the following remarks: (1) a parallel version of the two-dimensional flow solver was available and adapted to the present study; (2) we have targetted a distributed memory SPMD implementation and we did not want the resulting optimization tool to be limited by memory capacity constraints, especially since the present study will find its sequel in its adaptation to 3D shape optimization problems, based on more complex aerodynamical models (Navier-Stokes equations coupled with a turbulence model); and (3) we believe that the adopted parallelization strategy will define a good starting-point for the construction and evaluation of sub-populations based parallel genetic algorithms (PGas). In our context, the parallelization strategy adopted for the flow solver combines domain partitioning techniques and a message-passing programming model [6]. The underlying mesh is assumed to be partitioned into several submeshes, each one defining a subdornain. Basically, the same ~%ase" serial code is executed within every subdomain. Applying this parallelization strategy to the flow solver results in modifications occuring in the main time-stepping loop in order to take into account one or several assembly phases of the subdomain results. The coordination of subdomain calculations through information exchange at artificial boundaries is implemented using calls to functions of the MPI library. The paralellization described above aims at reducing the cost of the fitness function evaluation for a given individual. However, another level of parallelism can clearly be exploited here and is directly related to the binary tournament approach and the crossover operator. In practice, during each generation, individuals of the current population are treated pairwise; this applies to the selection, crossover, mutation and fitness function evaluation steps. Here, the main remark is that for this last step, the evaluation of the fitness functions associated with the two selected individuals, defines independent operations. We have chosen to exploit this fact using the notion of process groups which is one of the main features of the MPI environment. Two groups are defined, each of them containing the same number of processes; this number is given by the number of subdomains in the partitioned mesh. Now, each group is responsible for the evaluation

of the fitness function for a given individual. We note in passing that such an approach based on process groups will also be interesting in the context of sub-populations based P G a s (see [1] for a review on the subject); this will be considered in a future work.

1.3. An o p t i m u m shape design case The method has been applied to a direct optimization problem consisting in designing the shape of an airfoil, symbolically denoted by 3', to reduce the shock-induced drag, CD, while preserving the lift, CL, to the reference value, C~AE, corresponding to the RAE2822 airfoil, immersed in an Eulerian flow at 2~ of incidence and a freestream Mach number of 0.73. Thus, the cost functional was given the following form: J(7) =

CD

+

10 (Cc -

cRAE)2

(1)

The non-linear convergence tolerance has been fixed to 10 .6 . The computational mesh consists of 14747 vertices (160 vertices on the airfoil) and 29054 triangles. Here, each "chromosome" represents a candidate airfoil defined by a B~zier spline whose support is made of 7+7 control points at prescribed abscissas for the upper and lower surfaces. A population of 30 individuals has been considered. After 50 generations, the shape has evolved and the shock has been notably reduced; the initial and final flows (iso-Mach values) are shown on Figure 1. Additionally, initial and final values of CD and Cc are given in Table 1. The calculations have been performed on the following systems: an SGI Origin 2000 (equipped with 64 MIPS RI0000/195 Mhz processors) and an experimental Pentium Pro (P6/200 Mhz, running the LINUX system) cluster where the interconnection is realized through F a s t E t h e r n e t (100 Mbits/s) switches. The native MPI implementation has been used on the SGI O r i g i n 2000 system while MPICH 1.1 has been used on the Pentium Pro cluster. Performance results are given for 64 bit arithmetic computations.

Figure 1. Drag reduction: initial and optimized flows (steady iso-Mach lines)

We compare timing measurements for the overall optimization using one and two process groups. Timings are given for a fixed number of generations (generally 5 optimization iterations). In Tables 2 and 3 below, Ng and Np respectively denote the number of process groups and the total number of processes (Ng = 2 and Np = 4 means 2 processes for each of the two groups), "CPU" is the total CPU time, "Flow" is the accumulated flow solver time, and "Elapsed" is the total elapsed time (the distinction between the CPU and the elapsed times is particularly relevant for the Pentium Pro cluster). Finally, S(Np) is the parallel speed-up ratio Elapsed(N 9 = 1, Np = 5)/Elapsed(Ng, Np), the case Nv = 1, Np = 5 serving as a reference. For the multiple processes cases, the given timing measures ("CPU" and "Flow") always correspond to the maximum value over the per-process measures.

Table 1 Drag reduction: initial and optimized values of the Co and CL coefficients "-'~L

0.8068

0.8062

v-"D

0.0089

0.0048

Table 2 Parallel perfomance results on the SGI O r i g i n 2000 N q Np Elapsed CPU Flow S(Np) Min Max 1 2 1

5 10 10

2187 sec 1270sec 1126sec

2173 sec 1261sec 1115sec

1934 sec 1031sec 900sec

1995 sec 1118sec 953sec

1.0 1.7 1.9

Table 3 Parallel perfomance results on the Pentium Pro cluster Ng N v Elapsed CPU Flow Min Max 1 2 1

5 10 10

18099 sec 9539 sec 10764 sec

14974 sec 8945 sec 8866 sec

13022 sec 7291 sec 7000 sec

14387 sec 8744 sec 7947 sec

S(N ) 1.0 1.85 1.7

For both architectures, the optimal speed-up of 2 is close to being achieved. For the Pentium Pro cluster, the communication penalty is larger, and this favors the usage of fewer processors and more groups. For the SGI Origin 2000, the situation is different: communication only involves memory access (shared memory), and parallelization remains

/ Y

...f.---""BezierjSpline ......................... .....................................

~::'z'~'r'~::.::':

...................................................................

--__~ :> •

A

Figure 2. Geometry of multi-element airfoil including slat, main body and flap (shape and position definition)

efficient as the number of processors increases; moreover, additional gains are achieved due to the larger impact of cache memory when subdomains are small. 1.4. High-lift m u l t i - e l e m e n t airfoil o p t i m i z a t i o n by G A s a n d m u l t i - a g e n t s t r a t e gies In this section, we report on numerical experiments conducted to optimize the configuration (shape and position) of a high-lift multi-element airfoil by both conventional GAs and more novel ones, based on multi-agent strategies better fit to parallel computations. These experiments have been also described in [7] in part. The increased performance requirements for high-lift systems as well as the availability of (GA-based) optimization methods tend to renew the emphasis on multi-element aerodynamics. High-lift systems, as depicted on Figure 2, consist of a leading-edge device (slat) whose effect is to delay stall angle, and a trailing-edge device (flap) to increase the lift while maintaining a high L/D ratio. The lift coemcient CL of such an airfoil is very sensitive to the flow features around each element and its relative position to the main body; in particular, the location of the separation point can change rapidly due to the wake/boundary-layer interaction. As a result, the functional is non-convex and presents several local optima, making use of a robust algorithm necessary to a successful optimization. Here, the 2D flow computation is conducted by the Dassault-Aviation code "Damien" which combines an inviscid flow calculation by a panel method with a wake/boundarylayer interaction evaluation. This code, which incorporates data concerning transition criteria, separated zones, and wake/boundary-layer interaction, has been thoroughly calibrated by assessments and global validation through comparisons with ONERA windtunnel measurements. As a result, viscous effects can be computed, and this provides at a very low cost a fairly accurate determination of the aerodynamics coefficients. As a

100

,

-

9

-SO ,

|

..

-100

200

5O

0

9

-100

100

,

......

" -

-

"T""

................ .

0

150

----:/

,

9

I

50

1

t

..... "

I

,,

0

'

'

!

100-

~

,

~

..........

O

e

0

0

9

0

I

!

a

200

300

400

~

'

0 I .......

:.....~~

O

,

-

........

. . . . . .

!

_

500

~

__

6110

-~

'~

-

1 ...................................

................. ]

-5O -TOO -t50 -200

Figure 3. Initial (top) and final (bottom) configurations in position optimization problem

counterpart of this simplified numerical model, the flow solver is non differentiable and can only be treated as a black box in the optimization process. Evolutionary algorithms are thus a natural choice to conduct the optimization in a situation of this type. Figure 3 relates to a first experiment in which only the 6 parameters determining the positions relative to the main body (deflection angle, overlap and gap) of the two high-lift devices (slat and flap) have been optimized by a conventional GA, similar to the one of the previous section. Provided reasonable ranges are given for each parameter, an optimum solution is successfully found by the GA, corresponding to an improved lift coefficient of 4.87. The second experiment is the first step in the optimization of the entire configuration consisting of the shapes of the three elements and both positions of slat and flap. More precisely, only the inverse problem consisting in reconstructing the pressure field is considered presently. If 7s, 7B and 7F denote the design variables associated with the slat, main body and flap respectively, and 7 = (Ts, 7B, 7F) represents a candidate configuration, one minimizes the following functional:

J(7) - 4 ( 7 ) + J . ( 7 ) + J.(7)

(2)

in which, for example:

a~

(3)

l0 is a positive integral extending over one element only (slat) but reflecting in the integrand the interaction of the whole set of design variables. Here, Pt is the given target pressure; similar definitions are made for JB(3') and JR(')'). In order to reduce the number of design variables and enforce smoothness of the geometry, shapes are represented by B~zier curves. This inverse problem has been solved successfully by the GA first by a "global" algorithm in which the chromosomes contain the coded information associated with all design variables indiscriminately. The convergence history of this experiment is indicated on Figure 4 for the first 200 generations. An alternative to this approach is provided by optimization algorithms based on (pseudo) Nash strategies in which the design variables are a priori partitioned into appropriate subsets. The population of a given subset evolves independently at each new generation according to its own GA, with the remaining design variables being held fixed and equal to the best elements in their respective populations found at the previous generation. Evidently, in such an algorithm, the computational task of the different GAs, or "players" to use a term of game theory, can be performed in parallel [11]. Many different groupings of the design variables can be considered, two of which are illustrated on Figure 4: a 3-player strategy (slat shape and position; body shape; flap shape and position) and a 5-player strategy (slat, body and flap shapes, slat and flap positions). The two algorithms achieve about the same global optimum, but the parameters of the GAs (population size) have been adjusted so that the endpoints of the two convergence paths correspond to the same number of functional evaluations as 100 generations of the global algorithm. Thus, this experiment simulates a comparison of algorithms at equal "serial" cost. This demonstrates the effectiveness of the multi-agent approach, which achieves the global optimum by computations that could evidently be performed in parallel. We terminate this section by two remarks concerning Nash strategies. First, in a preliminary experiment related to the slat/flap position inverse problem, a 2-player game in which the main body (TB) is fixed, an attempt was made to let the population of 7s evolve at each generation according to an independent GA minimizing the partial cost functional js(Ts) = Js (~/s, 7B, 5'F) only, (the flap design variables being held fixed to the best element 7F found at the previous generation,) and symmetrically for 7F being driven by jF(TF) -- JF (~s, ")'B, ")IF). Figure 5 indicates that in such a case, the algorithm fails to achieve the desired global optimum. Second, observe that in the case of a general cost function, a Nash equilibrium in which a minimum is found with respect to each subgroup of variables, the other variables being held fixed, does not necessarily realize a global minimum. For example, in the trivial case of a function f(x, y) of two real variables, a standard situation in which the partial functions:

= f ( x , y*),

r

(4)

= f(z*, y)

achieve local minima at x* and y* respectively, is realized typically when:

r

= =

y*) = o , y*) = o ,

r r

= =

y*) > o y*) > o

(5)

11 0.014

0.012

0.01

-7~i~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. ......................................................

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .N. .a. s. .h. . . 3. . . p. .i.@ . . .e.r. s. . . . . . . . . . . . . . . . . . . . . . . Global inverse problem

. ..........

: ,

0.008

0.006

',

P~pulation

.

.

.

.

.

.

,~ize : 3 0

iiili~..... §...........i........... '..........~>:iv~:.~:i~:~:~-.,-p~ ~:-.-:!~:~............................................... :-i

0.004

-]......... :.... Globd:-. . . . . . . . . i .................................................

~,:;~i~:~:

:

"

:

:iN'ash ~ p l . a y ~ s

t

N--i:W:.......... :........... !........ 74~;a~;;;.-a~-h;~Kiiia:~i,~4h-~ ..... i.......... !........... ~......... i;~,~i::~ i i : ! 'i' ? i i

0.002 :

',

,,

20

40

60

,

80 Number

1 O0

120

140

160

180

200

of generations

Figure 4. Optimization of shape and position design variables by various strategies all involving the same number of cost functional evaluations

and this does not imply t h a t the Hessian matrix be positive definite. However, in the case of an inverse problem in which each individual positive component of the cost function is driven to 0, the global o p t i m u m is indeed achieved. 2. P A R A L L E L 2.1.

MULTIGRID

ACCELERATION

Introduction

Clearly, reducing the time spent in flow calculations (to the m i n i m u m ) is crucial to make GAs a viable alternative to other optimization techniques. One possible strategy to achieve this goal consists in using a multigrid m e t h o d to accelerate the solution of the linear systems resulting from the linearized implicit time integration scheme. As a first step, we have developed parallel linear multigrid algorithms for the acceleration of compressible steady flow calculations, independently of the optimization framework. This is justified by the fact that the flow solver is mainly used as a black box by the GA. The starting point consists of an existing flow solver based on the averaged compressible Navier-Stokes equations, coupled with a k - c turbulence model [2]. The spatial discretization combines finite element and finite volume concepts and is designed for u n s t r u c t u r e d triangular meshes. Steady state solutions of the resulting semi-discrete equations are obtained by using an Euler implicit time advancing strategy which has the following features: linearization (approximate linearization of the convective fluxes and exact differentiation of the viscous terms); preconditioning (the Jacobian m a t r i x is based on a first-order Godunov

12 Convergence comparison for the fitness

0.014

def'mition

i ...... ~ .......

0.012

---~(

........

,- . . . . . . . . . . . . .

-, . . . . . . . . . . . . . .

!

0.0

~'"

i

~

9

Convergence Convergence Convergence Convergence

i

i

slat position, D2 flap position, D2 slat position, D 1 flap position, D1

iiiiiiiiiiiiiii iiiiii!iiii

0,008I-

............

i

iiiiiiiiiiiiii iiiiiiiiiiiii"

0.006

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

o.oo~

............

,: .............

~..............

i ............

i ................

i::~:;.,i

o

..............

. . . . .

0

0

5

10

15

20

T ..............

25

i

.

. . . . . . . . . . . . . .

30

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . .

35

4[

Number of generations

Figure 5. Effect of the fitness definition on the convergence of slat/flap position parameters

scheme); and local time stepping and CFL law (a local time step is computed on each control volume). Each pseudo time step requires the solution of two sparse linear systerns (respectively, for the mean flow variables and for the variables associated with the turbulence model). The multigrid strategy is adopted to gain efficiency in the solution of the two subsysterns. In the present method, the coarse-grid approximation is based on the construction of macro-elements, more specifically macro control-volumes by "volume-agglomeration". Starting from the finest mesh, a "greedy" coarsening algorithm is applied to generate automatically the coarse discretizations (see Lallemand et al. [5]). Parallelism is introduced in the overall flow solver by using a strategy that combines mesh partitioning techniques and a message passing programming model. The MPI environment is used for the implementation of the required communication steps. Both the discrete fluxes calculation and the linear systems solution are performed on a submesh basis; in particular, for the basic linear multigrid algorithm which is multiplicative (i.e. the different levels are treated in sequence with inter-dependencies between the partial results produced on the different levels), this can be viewed as an intra-level parallelization which concentrates on the smoothing steps performed on each member of the grid hierarchy, A necessary and important step in this adaptation was the construction of appropriate data structures for the distribution of coarse grid calculations. Here, this has been achieved by developing a parallel variant of the original "greedy" type coarsening al-

13 gorithm, which now includes additional communication steps for a coherent construction of the communication data structures on the partitioned coarse grids. 2.2. L a m i n a r f l o w a r o u n d a NACA0012 a i r f o i l

The test case under consideration is given by the external flow around a NACA0012 airfoil at a freestream Mach number of 0.8, a Reynolds number equal to 73 and an angle of incidence of 10 ~. The underlying mesh contains 194480 vertices and 387584 triangles. We are interested here in comparing the single grid and the multigrid approaches when solving the steady laminar viscous flow under consideration. Concerning the single grid algorithm, the objective is to choose appropriate values for the number of relaxation steps and the tolerance on the linear residual so that a good compromise is obtained between the number of non-linear iterations (pseudo time steps) to convergence and the corresponding elapsed time. For both algorithms, the time step is calculated according to CFL=rnin(500 • it, 106) where it denotes the non-linear iteration. Table 4 compares results of various simulations performed on a 12 nodes Pentium Pro cluster. In this table, oo means that the number of fine mesh Jacobi relaxations (~f) or the number of multigrid V-cycles (Nc) has been set to an arbitrary large value such that the linear solution is driven until the prescribed residual reduction (c) is attained; ~1 and ~2 denote the number of pre- and post-smothing steps (Jacobi relaxations) when using the multigrid algorithm. We observe that the non-linear convergence of the single grid is optimal when driving the linear solution to a two decade reduction of the normalized linear residual. However, the corresponding elapsed time is minimized when fixing the number of fine mesh relaxations to 400. However, one V-cycle with 4 pre- and post-smoothing steps is sufficient for an optimal convergence of the multigrid algorithm. Comparing the two entries of Table 4 corresponding to the case c = 10 -1, it is seen that the multigrid algorithm yields a non-linear convergence in 117 time steps instead of 125 time steps for the single grid algorithm. This indicates that when the requirement on linear convergence is the same, the multigrid non-linear solution demonstrates somewhat better convergence due to a more uniform treatment of the frequency spectrum. We conclude by noting that the multigrid algorithm is about 16 times faster than the single grid algorithm for the present test case, which involves about 0.76 million unknowns.

Table 4 Simulations on a F a s t E t h e r n e t Pentium Pro cluster: Np = 1 2 Ng

Nc

llf

1 1 1 1 1 Ng 6 6 6

Nc c~ c~ 1

oo c~ 350 400 450

[pl, p2] 4/4 4/4 4/4

s

Niter

10 -1 10 -2 10 -1~ 10-1~ 10 -1~

125 117 178 157 142 Niter 117 116 117

s 10 -1 10 -2 10 -1~

Elapsed 9 h 28 mn 9h40mn 9h10mn 9 h 06 mn 9h28mn Elapsed 57 mn 1h56mn 33 mn

CPU 8 h 24 mn 8h48mn 8h17mn 8 h 14 mn 8h20mn CPU 50 mn 1h42mn 29 mn

% CPU 88 91 90 90 88 %CPU 88 88 87

14 3. C O N C L U S I O N S A N D P E R S P E C T I V E S Cost-efficient solutions to the Navier-Stokes equations have been computed by means of (multiplicative) multigrid algorithms made parallel via domain-decomposition methods (DDM) based on mesh partitioning. Current research is focused on additive formulations in which the (fine-grid) residual equations are split into a high-frequency and a lowfrequency subproblems that are solved simultaneously, the communication cost also being reduced (since longer vectors are transferred at fewer communication steps). Genetic algorithms have been shown to be very robust in complex optimization problems such as shape design problems in aerodynamics. In their base formulation, these algorithms may be very costly since they rely on functional evaluations only. As a counterpart, their formulation is very well suited for several forms of parallel computing by ('i) DDM in the flow solver; 5i) grouping the fitness function evaluations; (iii) considering subpopulations evolving independently and migrating information regularly [11] (not shown here); (iv) elaborating adequate multi-agent strategies based on game theory. Consequently, great prospects are foreseen for evolutionary optimization in the context of high-performance computing.

REFERENCES

1. E. Cantfi-Paz. A summary of research on parallel genetic algorithms. Technical Report 95007, IlliGAL Report, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory, Urbana, (1995). 2. G. CarrY. An implicit multigrid method by agglomeration applied to turbulent flows. Computers ~ Fluids, (26):299-320, (1997). 3. D.E. Goldberg. Genetic algorithms in search, optimization and machine learning. Addison-Wesley Company Inc., (1989). 4. H. Kuiper, A.J. Van der Wees, C.F. Hendriks, and T.E. Labrujere. Application of genetic algorithms to the design of airfoil pressure distribution. NLR Technical publication TP95342L for the ECARP European Project. 5. M.-H. Lallemand, H. Steve, and A. Dervieux. Unstructured multigridding by volume agglomeration : current status. Computers ~ Fluids, (21):397-433, (1992). 6. S. Lanteri. Parallel solutions of compressible flows using overlapping and nonoverlapping mesh partitioning strategies. Parallel Comput., 22:943-968, (1996). 7. B. Mantel, J. P~riaux, M. Sefrioui, B. Stoufflet, J.A. D~sid~ri, S. Lanteri, and N. Marco. Evolutionary computational methods for complex design in aerodynamics.

AIAA 98-0222. 8. S. Obayashi and A. Oyama. Three-dimensional aerodynamic optimization with genetic algorithm. In J.-A. D~sid~ri et al., editor, Computational Fluid Dynamics '96, pages 420-424. J. Wiley & Sons, (1996). 9. J. P~riaux, M. Sefrioui, B. Stouifiet, B. Mantel, and E. Laporte. Robust genetic algorithms for optimization problems in aerodynamic design. In G. Winter et. al., editor, Genetic algorithms in engineering and computer science, pages 371-396. John Wiley & Sons, (1995). 10. D. Quagliarella. Genetic algorithms applications in computational fluid dynamics.

15

In G. Winter et. al., editor, Genetic algorithms in engineering and computer science, pages 417-442. John Wiley & Sons, (1995). 11. M. Sefrioui. Algorithmes Evolutionnaires pour le calcul scientifique. Application l'~lectromagn~tisme e t ~ la m~canique des fluides num~riques. Th~se de doctorat de l'Universitd de Paris 6 (Spdcialitd : Informatique), 1998.


Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier ScienceB.V. All rightsreserved.

High-Performance

17

Spectral Element Algorithms and Implementations*

Paul F. Fischer a and Henry M. Tufo b ~Mathematics and Computer Science Division, Argonne National Laboratory, Argonne IL, 60439 USA f i s c h e r ~ m c s , a n l . gov (http://www.mcs.anl.gov/~fischer) bDepartment of Computer Science, The University of Chicago, 1100 East 58th Street, Ryerson 152, Chicago, IL 60637 USA hmt 9 u c h i c a g o , edu (http://www.mcs.anl.gov/~tufo) We describe the development and implementation of a spectral element code for multimillion gridpoint simulations of incompressible flows in general two- and three-dimensional domains. Parallel performance is presented on up to 2048 nodes of the Intel ASCI-Red machine at Sandia National Laboratories. 1. I N T R O D U C T I O N We consider numerical solution of the unsteady incompressible Navier-Stokes equations, 0u

0t

+ u. Vu - -Vp

1

2

+ - a - V u, /le

-V-u

= 0,

coupled with appropriate boundary conditions on the velocity, u. We are developing a spectral element code to solve these equations on modern large-scale parallel platforms featuring cache-based nodes. As illustrated in Fig. 1, the code is being used with a number of outside collaborators to address challenging problems in fluid mechanics and heat transfer, including the generation of hairpin vortices resulting from the interaction of a flat-plate boundary layer with a hemispherical roughness element; modeling the geophysical fluid flow cell space laboratory experiment of buoyant convection in a rotating hemispherical shell; Rayleigh-Taylor instabilities; flow in a carotid artery; and forced convective heat transfer in grooved-flat channels. This paper discusses some of the critical algorithmic and implementation features of our numerical approach that have led to efficient simulation of these problems on modern parallel architectures. Section 2 gives a brief overview of the spectral element discretization. Section 3 discusses components of the time advancement procedure, including a projection method and parallel coarse-grid solver, which are applicable to other problem *This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38; by the Department of Energy under Grant No. B341495 to the Center on Astrophysical Thermonuclear Flashes at University of Chicago; and by the University of Chicago.

18

Figure 1. Recent spectral element simulations. To the right, from the top: hairpin vortex generation in wake of hemispherical roughness element (Re~ = 850); spherical convection simulation of the geophysical fluid flow cell at Ra = 1.1 x 105, Ta = 1.4 x 108; twodimensional Rayleigh-Taylor instability; flow in a carotid artery; and temporal-spatial evolution of convective instability in heat-transfer augmentation simulations.

classes and discretizations. Section 4 presents performance results, and Section 5 gives a brief conclusion. 2. S P E C T R A L

ELEMENT

DISCRETIZATION

The spectral element method is a high-order weighted residual technique developed by Patera and coworkers in the '80s that couples the tensor product efficiency of global spectral methods with the geometric flexibility of finite elements [9,11]. Locally, the mesh is structured, with the solution, data, and geometry expressed as sums of Nth-order tensor product Lagrange polynomials based on the Gauss or Gauss-Lobatto (GL) quadrature points. Globally, the mesh is an unstructured array of 14 deformed hexahedral elements and can include geometrically nonconforming elements. The discretization is illustrated in Fig. 2, which shows a mesh in IR2 for the case ( K , N ) = (3,4). Also shown is the reference (r, s) coordinate system used for all function evaluations. The use of the GL basis for the interpolants leads to efficient quadrature for the weighted residual schemes and greatly simplifies operator evaluation for deformed elements.

19

r

9T c~.

Figure 2. Spectral element discretization in 1R2 showing GL nodal lines for (K, N) = (3, 4).

For problems having smooth solutions, such as the incompressible Navier-Stokes equations, exponential convergence is obtained with increasing N, despite the fact that only C o continuity is enforced across elemental interfaces. This is demonstrated in Table 1, which shows the computed growth rates when a small-amplitude Tollmien-Schlichting wave is superimposed on plane Poiseuille channel flow at Re = 7500, following [6]. The amplitude of the perturbation is 10 -s, implying that the nonlinear Navier-Stokes results can be compared with linear theory to about five significant digits. Three error measures are computed: errorl and error2 are the relative amplitude errors at the end of the first and second periods, respectively, and erro% is the error in the growth rate at a convective time of 50. From Table 1, it is clear that doubling the number of points in each spatial direction yields several orders of magnitude reduction in error, implying that only a small increase in resolution is required for very good accuracy. This is particularly significant because, in three dimensions, the effect on the number of gridpoints scales as the cube of the relative savings in resolution.

Table 1 Spatial convergence, Orr-Summerfeld problem: K = 15, At = .003125 N 7 9 11 13 15

E(tl) 1.11498657 1.11519192 1.11910382 1.11896714 1.11895646

errOrl 0.003963 0.003758 0.000153 0.000016 0.000006

E(t2) 1.21465285 1.24838788 1.25303597 1.25205855 1.25206398

error2

errorg

0.037396 0.003661 0.000986 0.000009 0.000014

0.313602 0.001820 0.004407 0.000097 0.000041

20 The computational efficiency of spectral element methods derives from their use of tensor-product forms. Functions in the mapped coordinates are expressed as N

u(xk( r, s))lak -

N

E E u~jhN(r)h~(s),

(1)

i=0 j = 0

where u~j is the nodal basis coefficient; h N is the Lagrange polynomial of degree N based on the GL quadrature points, {~y}N=0; and xk(r, s) is the coordinate mapping from the reference domain, Ft "- [-1, 1]2, to f~k. With this basis, the stiffness matrix for an undeformed element k in IR2 can be written as a tensor-product sum of one-dimensional operators,

Ak -

By @~--~+

(2)

/iy |

where A. and/3, are the one-dimensional stiffness and mass matrices associated with the respective spatial dimensions. If _uk _ uijk is the matrix of nodal values on element k, then a typical matrix-vector product required of an iterative solver takes the form N

(Akuk)im -

-_

N

~ E

^

^

k +

mj

(3)

i=O j = O __

~ _ "'T --~ukBy

+

"

_

A~

.

Similar forms result for other operators and for complex geometries. The latter form illustrates how the tensor-product basis leads to matrix-vector products (Au__)being recast as matrix-matrix products, a feature central to the efficiency of spectral element methods. These typically account for roughly 90% of the work and are usually implemented with calls to D G E M M , unless hand-unrolled F77 loops prove faster on a given platform. Global matrix products, Au__,also require a gather-scatter step to assemble the elemental contributions. Since all data is stored on an element-by-element basis, this amounts to summing nodal values shared by adjacent elements and redistributing the sums to the nodes. Our parallel implementation follows the standard message-passing-based SPMD model in which contiguous groups of elements are distributed to processors and data on shared interfaces is exchanged and summed. A stand-alone MPI-based utility has been developed for these operations. It has an easy-to-use interface requiring only two calls: handle=gs_init(global_node_numbers, n)

and

i e r r = g s - o p ( u , op, handle),

where global-node-numbers 0 associates the n local values contained in the vector u 0 with their global counterparts, and op denotes the reduction operation performed on shared elements of u() [14]. The utility supports a general set of commutative/associative operations as well as a vector mode for problems having multiple degrees of freedom per vertex. Communication overhead is further reduced through the use of a recursive spectral bisection based element partitioning scheme to minimize the number of vertices shared among processors [12].

21

3.

TIME

ADVANCEMENT

AND

SOLVERS

The Navier-Stokes time-stepping is based on the second-order operator splitting methods developed in [1,10]. The convective term is expressed as a material derivative, and the resultant form is discretized using a stable second-order backward difference formula fin-2 _ 4fi_n-1 _~_3u n = S(u~),

2At where S(u n) is the linear symmetric Stokes problem to be solved implicitly, and fi___~-qis the velocity field at time step n - q computed as the explicit solution to a pure convection problem. The subintegration of the convection term permits values of At corresponding to convective CFL numbers of 2-5, thus significantly reducing the number of (expensive) Stokes solves. The Stokes problem is of the form H

-D

T

un

_

-D

(

_

0-)

and is also treated by second-order splitting, resulting in subproblems of the form

H~j - s

EF

- g/,

for the velocity components, u~, (i - 1 , . . . , 3 ) , and pressure, pn. Here, H is a diagonally dominant Helmholtz operator representing the parabolic component of the momentum equations and is readily treated by Jacobi-preconditioned conjugate gradients; E := D B - 1 D T is the Stokes Schur complement governing the pressure; and B is the (diagonal) velocity mass matrix. E is a consistent Poisson operator and is effectively preconditioned by using the overlapping additive Schwarz procedure of Dryja and Widlund [2,6,7]. In addition, a high-quality initial guess is generated at each step by projecting the solution onto the space of previous solutions. The projection procedure is summarized in the following steps: l

(~)

--

p_ -

X;

~,~_,, ~, -

~r

n

f_, g_ 9

i=1

(ii)

Solve 9 E A p -

g~ - E~ l

(~)

to tolerance ~.

(4)

l

g+, - (zXp_- E 9&)/ll/V_- E 9&ll~, ~ - ~_TEzXp_. i=1

i=1

The first step computes an initial guess, ~, as a projection of pn in the E-norm (IIP__I[E"= (p__TEp_)89 onto an existing basis, (~1"'" ,~)" The second computes the remaining (orthogonal) perturbation, Ap_, to a specified absolute tolerance, e. The third augments the approximation space with the most recent (orthonormalized) solution. The approximation space is restarted once 1 > L by setting ~1 "-- P~/IIP~IIE" The projection scheme (steps (i) and (iii)) requires two matrix-vector products per timestep, one in step (ii) and one in step (iii). (Note that it's not possible to use gff - Ep__in place of E A p in (iii) because (ii) is satisfied only to within e.)

22

Spherical

Bouyant Convection.

Spher'|ca| B o u y a n t

n=1658880

60 l , i , , l i , , i l i , i , l i , , , l i , J , l i i l , l , , , , l i i i , l i i i i l i i , , l , , , i l , , , , l i , , , l , , ,

Cenvectlon,

n=165888~

_,,,llVlV,l,,l,l,,,,l,llllll,,l,,,,ll,,,ll,l,l,l|,l,,,,

I , ,,11,,,I,,I

L=0 35 30

10-

25

L - 26

L - 26 0

5

10

15

20

25

30

35

S t e p Number

40 -

45

50

55

60

~

65

70

4~I-'

0

llllllllllllllllllllllll,ll,ll,llll,llll,lll,lll|ll,

5

10

15

20

25

30

35

S t e p Number

40 -

45

50

lllll,lll'llll,

55

60

65

70

m

Figure 3. Iteration count (left) and residual history (right) with and without projection for the 1,658,880 degree-of-freedom pressure system associated with the spherical convection problem of Fig. 1.

As shown in [4], the projection procedure can be extended to any parameter-dependent problem and has many desirable properties. It can be coupled with any iterative solver, which is treated as a black box (4ii). It gives the best fit in the space of prior solutions and is therefore superior to extrapolation. It converges rapidly, with the magnitude of the perturbation scaling as O(At ~)+ O(e). The classical Gram-Schmidt procedure is observed to be stable and has low communication requirements because the inner products for the basis coefficients can be computed in concert. Under normal production tolerances, the projection technique yields a two- to fourfold reduction in work. This is illustrated in Fig. 3, which shows the reduction in residual and iteration count for the buoyancydriven spherical convection problem of Fig. 1, computed with K = 7680 elements of order N = 7 (1,658,880 pressure degrees of freedom). The iteration count is reduced by a factor of 2.5 to 5 over the unprojected (L = 0) case, and the initial residual is reduced by two-and-one-half orders of magnitude. The perturbed problem (4ii) is solved using conjugate gradients, preconditioned by an additive overlapping Schwarz method [2] developed in [6,7]. The preconditioner, K M-1 "- RTAo 1Ro + E RT-4k-IRk, k=l

requires a local solve (~;1) for each (overlapping)subdomain, plus a global solve (Ao 1) for a coarse-grid problem based on the mesh of spectral element vertices. The operators R k and R T are simply Boolean restriction and prolongation matrices that map data between the global and local representations, while R 0 and RoT map between the fine and coarse grids. The method is naturally parallel because the subdomain problems can be solved

23

independently. Parallelization of the coarse-grid component is less trivial and is discussed below. The local subdomain solves exploit the tensor product basis of the spectral element method. Elements are extended by a single gridpoint in each of the directions normal to their boundaries. Bilinear finite element Laplacians, Ak, and lumped mass matrices,/)k, are constructed on each extended element, hk, in a form similar to (2). The tensor-product construction allows the inverse of ~-1 to be expressed as

~1

__ (Sy @ Sx)[I @ A:c "1I" A u @ I]-I(sT @ sT),

where S, is the matrix of eigenvectors, and A, the diagonal matrix of eigenvalues, solving the generalized eigenvalue problem A,z__ = A/),z_ associated with each respective spatial direction. The complexity of the local solves is consequently of the same order as the matrix-vector product evaluation (O(KN 3) storage and O ( K N 4) work in IR3) and can be implemented as in (3) using fast matrix-matrix product routines. While the tensor product form (2) is not strictly applicable to deformed elements, it suffices for preconditioning purposes to build Ak on a rectilinear domain of roughly the same dimensions as ~k [7]. The coarse-grid problem, z_ = Aol_b, is central to the efficiency of the overlapping Schwarz procedure, resulting in an eightfold decrease in iteration count in model problems considered in [6,7]. It is also a well-known source of difficulty on large distributedmemory architectures because the solution and data are distributed vectors, while Ao 1 is completely full, implying a need for all-to-all communication [3,8]. Moreover, because there is very little work on the coarse grid (typ. O(1) d.o.f, per processor), the problem is communication intensive. We have recently developed a fast coarse-grid solution algorithm that readily extends to thousands of processors [5,13]. If A 0 E IRnxn is symmetric positive definite, and X := (2_1,..., 2__~) is a matrix of A0-orthonormal vectors satisfying 2_/TA0~j -- 5ij, then the coarse-grid solution is computed as 9-

XX

b_,

(5)

i=1

Since ~_ is the best fit in Tg(X) - IRn, we have 2 _ - z_ and X X T - Ao 1. The projection procedure (5) is similar to (4/), save that the basis vectors {~i} are chosen to be sparse. Such sparse sets can be readily found by recognizing that, for any gridpoint i exterior to the stencil of j, there exist a pair of A0-conjugate unit vectors, ~i and _~j. For example, for a regular n-point mesh in IR2 discretized with a standard five-point stencil, one can immediately identify half of the unit vectors in 1Rn (e.g., those associated with the "red" squares) as unnormalized elements of X. The remainder of X can be created by applying Gram-Schmidt orthogonalization to the remainder of IR~. In [5,13], it is shown that nested dissection provides a systematic approach to identifying a sparse basis and yields a factorization of Ao I with O(na~ ~-) nonzeros for n-point grid problems in IRd, d _> 2. Moreover, the required communication volume on a P-processor machine is bounded by 3 n @ log 2 P, a clear gain over the O(n) or O(n log 2 P) costs incurred by other commonly employed approaches. The performance of the X X y scheme on ASCI-Red is illustrated in Fig. 4 for a (63 x 63) and (127 • 127) point Poisson problem (n - 3069 and n - 16129, respectively) discretized by a standard five-point stencil. Also shown are the times for the commonly used approaches of redundant banded-LU solves and row-distributed Ao 1. The latency,21og P

24 I

i

i

i

2-

2

le-O1

i Red.

LU -

le-O1 -

5

5

2 2 le-02

xx T le-02

5

5

2 le-03 i 2 le-04

'I

le-03

latency

5latency * 2log(P)

_

I

l

/

5

2

J

le-04 2

_

le-05

_

5

_

,

.

latency

5

latency

,

* 2log(P)

.

.

P

.

.

.

.

.

,

P

Figure 4. ASCI-Red solve times for a 3969 (left) and 16129 (right) d.o.f, coarse grid problem.

curve represents a lower bound on the solution time, assuming that the required all-to-M1 communication uses a contention free fan-in/fan-out binary tree routing. We see that the X X T solution time decreases until the number of processors is roughly 16 for the n = 3969 case, and 256 for the n = 16129 case. Above this, it starts to track the latency curve, offset by a finite amount corresponding to the bandwidth cost. We note that X X T approach is superior to the distributed Ao 1 approach from a work and communication standpoint, as witnessed by the substantially lower solution times in each of the workand communication-dominated regimes.

4. P E R F O R M A N C E RESULTS We have run our spectral element code on a number of distributed-memory platforms, including the Paragon at Caltech, T3E-600 at NASA Goddard, Origin2000 and SP at Argonne, ASCI-Blue at Los Alamos, and ASCI-Red at Sandia. We present recent timing results obtained using up to 2048 nodes of ASCI-Red. Each node on ASCI-Red consist of two Zeon 333 MHz Pentium II processors which can be run in single- and dual-processor mode. The dual mode is exploited for the matrix-vector products associated with H, E, and z~lk1 by partitioning the element lists on each node into two parts and looping through these independently on each of the processors. The timing results presented are for the time-stepping portion of the runs only. During production runs, usually 14 to 24 hours in length, our setup and I/O costs are typically in the range of 2-5%. The test problem is the transitional boundary layer/hemisphere calculation of Fig. 1 at Re~ = 1600, using a Blasius profile of thickness ~ = 1.2R as an initial condition. The mesh is an oct-refinement of the production mesh with (K, N) = (8168, 15) corresponding to 27,799,110 grid points for velocity and 22,412,992 for pressure.

25 I

I

I

I

I

I

I

400

3.0f/ -

-~ .0

150 100

f . . . . . ., . . . . . . . . . .,. . . . . . . . . .,. ........................... ..... , .

~;

1;

1;

Step

210

25

5

10

15

.

20

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

,_ 25

Step

Figure 5. P - 2048 ASCI-Red-333 dual-processor mode results for the first 26 time steps for (K, N) - (8168, 15)" solution time per step (left) and number of pressure and (z-component) Helmholtz iterations per step (right).

Figure 5 shows the time per step (left) and the iteration counts for the pressure and (xcomponent) Helmholtz solves (right) over the first 26 timesteps. The significant reduction in pressure iteration count is due to the difficulty of computing the initial transients and the benefits gained from the pressure projection procedure. Table 2 presents the total time and sustained performance for the 26 timesteps using a combination of unrolled f77 loops and assembly-coded D G E M M routines. Two versions of D G E M M were considered: the standard version (csmath), and a specially tuned version (perf) written by Greg Henry at Intel. We note that the average time per step for the last five steps of the 361 GF run is 15.7 seconds. Finally, the coarse grid for this problem has 10,142 distributed degrees of freedom and accounts for 4.0% of the total solution time in the worst-case scenario of 2048 nodes in dual-processor mode.

Table 2 ASCI-Red-333" total time and GFLOPS, K - 8168, N - 15. Single (csmath) Dual (csmath) Single (perf) Dual (perf) P Time(s) GFLOPS Time(s) GFLOPS Time(s) GFLOPS Time (s) G F L O P S 512 6361 47 4410 67 4537 65 3131 94 1024 3163 93 2183 135 2242 132 1545 191 2048 1617 183 1106 267 1148 257 819 361

26 5. C O N C L U S I O N We have developed a highly accurate spectral element code based on scalable solver technology that exhibits excellent parallel efficiency and sustains high MFLOPS. It attains exponential convergence, allows a convective CFL of 2-5, and has efficient multilevel elliptic solvers including a coarse-grid solver with low communication requirements. REFERENCES

1. J. BLAIR PEROT, "An analysis of the fractional step method", J. Comput. Phys., 108, pp. 51-58 (1993). 2. M. DRYJA AND O. B. WIDLUND, "An additive variant of the Schwarz alternating method for the case of many subregions", Tech. Rep. 339, Dept. Comp. Sci., Courant Inst., NYU (1987). 3. C. FARHAT AND P. S. CHEN, "Tailoring domain decomposition methods for efficient parallel coarse grid solution and for systems with many right hand sides", Contemporary Math., 180, pp. 401-406 (1994). 4. P . F . FISCHER, "Projection techniques for iterative solution of Ax - __bwith successive right-hand sides", Comp. Meth. in Appl. Mech., 163 pp. 193-204 (1998). 5. P. F. FISCHER, "Parallel multi-level solvers for spectral element methods", in Proc. Intl. Conf. on Spectral and High-Order Methods '95, Houston, TX, A. V. Ilin and L. R. Scott, eds., Houston J. Math., pp. 595-604 (1996). 6. P . F . FISCHER, "An overlapping Schwarz method for spectral element solution of the incompressible Navier-Stokes equations", J. of Comp. Phys., 133, pp. 84-101 (1997). 7. P . F . FISCHER, N. I. MILLER, AND H. ~/i. TUFO, "An overlapping Schwarz method for spectral element simulation of three-dimensional incompressible flows," in Parallel Solution of Partial Differential Equations, P. Bjrstad and M. Luskin, eds., SpringerVerlag, pp. 159-181 (2000). 8. W. D. GRoPP,"Parallel Computing and Domain Decomposition", in Fifth Conf. on Domain Decomposition Methods for Partial Differential Equations, T. F. Chan, D. E. Keyes, G. A. Meurant, J. S. Scroggs, and R. G. Voigt, eds., SIAM, Philadelphia, pp. 349-361 (1992). 9. Y. ~/IADAY AND A. T. PATERA, "Spectral element methods for the Navier-Stokes equations", in State of the Art Surveys in Computational Mechanics, A. K. Noor, ed., ASME, New York, pp. 71-143 (1989). 10. Y. 1VIADAY,A. T. PATERA, AND E. 1VI. RONQUIST, "An operator-integration-factor splitting method for time-dependent problems" application to incompressible fluid flow", J. Sci. Comput., 5(4), pp. 263-292 (1990). 11. A. T. PATERA, "A spectral element method for fluid dynamics: Laminar flow in a channel expansion", J. Comput. Phys., 54, pp. 468-488 (1984). 12. A. POTHEN, H. D. SIMON, AND K. P. LIOU, "Partitioning sparse matrices with eigenvectors of graphs", SIAM J. Matrix Anal. Appl., 11 (3) pp. 430-452 (1990). 13. H. M. TUFO AND P. F. FISCHER, "Fast parallel direct solvers for coarse-grid problems", J. Dist. Par. Comp. (to appear). 14. H. M. TUFO, "Algorithms for large-scale parallel simulation of unsteady incompressible flows in three-dimensional complex geometries", Ph.D. Thesis, Brown University (1998).

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights r e s e r v e d .

27

O p e r a t o r S p l i t t i n g and D o m a i n D e c o m p o s i t i o n for Multiclusters * M. Garbey, D. Tromeur-Dervout CDCSP - University Lyon 1, Bat ISTIL, Bd Latarjet, 69622 Villeurbanne France

{garbey, dtromeur}@cdcsp,univ-lyonl, fr http://cdcsp.univ-lyonl.fr

We discuss the design of parallel algorithms for multiclusters. Multiclusters can be considerate as two-level architecture machines, since communication between clusters is usually much slower than communication or access to memory within each of the clusters. We introduce special algorithms that use two levels of parallelism and match the multicluster architecture. Efficient parallel algorithms that rely on fast communication have been extensively developed in the past: we intend to use them for parallel computation within the clusters. On top of these local parallel algorithms, new robust and parallel algorithms are needed that can work with a few clusters linked by a slow communication network. We present two families of two-level parallel algorithms designed for multicluster architecture: (1) new time-dependent schemes for operator splitting that are exemplified in the context of combustion, and (2) a new family of domain decomposition algorithms that can be applied, for example, to a pressure solve in Navier-Stoke's projection algorithm. Our implementation of these two-level parallel algorithms relies on a portable inter program communication library developped by Guy Edjlali et al [Parallel CFD 97]. 1. I N T R O D U C T I O N We discuss the design of two-level parallel algorithms for multicluster CFD computation. Section 2 gives the numerical scheme for adaptive operator splitting on multiclusters and discusses an application of this algorithm on two geographically separated parallel computers linked by a slow network. Section 3 presents a new family of domain decomposition algorithms designed for the robust and parallel computation of elliptic problems on multiclusters. Section 4 presents our conclusions. 2. A D A P T I V E

COUPLING

ALGORITHM

FOR MULTICLUSTERS

We consider a system of two coupled differential equations -

F(X, Y),

-

a(X, Y),

*This work was supported by the R4gion RhOne Alpes.

28 where the dot represents the time derivative. We consider second-order schemes of the form 3Xn+l _ 4X - + X n-1

=

F(Xn+l, y,,n+l)

(1)

G ( x *,n+l Y"+l).

(2)

2At 3yn+l _ 4 y n + y n - 1

=

2At Our goal is to compute (1) and (2) in parallel and, consequently, to use weak coupling in time marching; we therefore introduce a prediction of X "+1 (resp. y , + l ) in (2) (resp. (1)). Let p be an integer; we suppose that (1) is computed on machine I and (2) is computed on machine II. Let TI be the elapsed time needed to compute X "+1 when X n, X n-l, y,,n+~ are available in the memory of machine I. We make a similar hypothesis for machine II and assume further, for simplicity, that z - 7-1 - TII. We suppose that the speed of the network that links these two machines is such that the elapsed time needed to send X *'"+1 (resp. y , , , + l ) from machine I (resp. II) to machine II (resp. I) is bounded by pT. In an ideal world p should be at most 1, but realistically we anticipate p to be large. We use a second or third-order extrapolation formula to predict X *'"+~ or y,,n+l We denote such a scheme C(p, 1, j) with j - 2 or 3 as the order of extrapolation. A difficulty with this scheme from the hardware point of view is that machine I and machine II have to exchange two messages every time step. The network will consequently be very busy, and the buffering of the messages may affect the speed of communication. To relax this constraint, we restrict ourselves to communication of messages every q time steps. The same information X "-'+1 and X n-p then is used to predict X *'"+k for q consecutive time steps. The second-order extrapolation formula used on machine II is given by

X *'n+k = (p + k ) X n - p + l - (p + k -

1)X n-p, k = 1..q.

(3)

Accuracy constraints may lead us to use third-order extrapolation as follows:

X,,n+k

_--

(p + k)(P + k2 - 1 + 1 ) X " - p + I +

(p+ k-

1) 2 + ( p + k 2

-

((p + k - 1) 2 + 2 ( p + k -

1 ) ) X n-p

1)xn-p-i.

We denote such a scheme C(p, q, j) with j = 2 or 3 as the order of extrapolation. It is straightforward to show that the truncation error of the scheme is of order two. The explicit dependence on previous time steps supposed by the predictors X *'n+l and y , , , + l imposes a stability constraint on the time step. As shown in [8], this stability constraint is acceptable in the case of weak coupling of the two ODEs. Further, it is important to notice that the scheme should be adaptive in time, in particular, when the solution of the system of ODE goes through oscillations relaxations. Many techniques have been developed to control the error in time for ODE solvers; see [2] and its references. The adaptivity criterion limits the number of time steps q that the same information can be reused, although p + q should be such that the accuracy of the approximation as well as the stability of the time marching is satisfied. A more flexible and efficient way of using the network between the two machines is to use asynchronous communication [1], that is,

29 to let the delay p evolve in time marching in such a way that, as soon as the information arrives, it is used. This is a first step in adaptive control of communication processes. Let us consider now systems of PDEs. For example, we take

OU = AU + b VII, Ot OV =AV+cVU. Ot This system in Fourier space is given by Uk,m = =

( - k 2 - m2)Uk,m + b i (k + m) Vk,~n,

(4)

(-k

(5)

-

+ c i (k +

5 ,m,

where k (resp. m) is the wave number in x direction (resp. y direction). It is clear that these systems of ODEs are weakly coupled for large wave numbers k or m. One can show that the time delay in the C(p,p,j) scheme does not bring any stability constraint on the time step as long as the wave number m or k is large enough. This analysis leads us to introduce a second type of adaptivity in our time-marching scheme, based on the fact that the lower frequencies of the spectrum of the coupling terms need to be communicated less often than the high frequencies. A practical way of implementing this adaptivity in Fourier space is the following. Let ~ m = - M . . M Um be the Fourier expansion of U. We compute the evolution of U (resp. V) on machine I (resp. II), and we want to minimize the constraint on communication of U (resp. V) to machine II (resp. I). Let ~.,n+l be the prediction used in the C(p,p, j) scheme for the Fourier mode ,,n+l

/Jm, and let U be the prediction used in the C(2p, p, j) scheme. Let a be the filter of order 8 given in [9], Section 3, p. 654. We use the prediction u * ' n + l --

E m=-M..M

?gt

5n+l

+

~

m

: n+l

(1 - c r ( a [ ~ l ) ) U m

m=-M..M

with ~ > 2. The scheme using this prediction is denoted cr~C(p/2p, p,j). This way of splitting the signal guarantees second-order consistency in time and smoothness in space. For lower-order methods in space, one can simply cut off the high modes [3]. For high accuracy, however, we must keep the delayed high-frequency correction. This algorithm has been implemented successfully in our combustion model, which couples Navier-Stokes equations (NS) written in Boussinesq approximation to a reaction diffusion system of equations describing the chemical process (CP) [6]. In our implementation NS's code (I) and CP's code (II) use a domain decomposition method with Fourier discretisation that achieves a high scalable parallel efficiency mainly because it involves only local communication between neighboring subdomains for large Fourier wave number and/or small value of the time step [7]. The two codes exchange the temperature from (II) to (I), and the stream function from (I) to (II). The nonblocking communications that manage the interaction terms of the two physical models on each code are performed by the Portable Inter Codes Communication Library developed in [4]. Our validation of the numerical scheme has included comparison of the accuracy of the method with a classical second-order time-dependent

30 scheme and sensitivity of the computation with the C(p, q, j) scheme with respect to bifurcation parameters [8]. We report here only on the parallel efficiency of our scheme. For our experiment, we used two parallel computers about 500 km apart: (II) runs on a tru cluster 4100 from DEC with 400 MHz alpha chips located in Lyon, (I) runs on a similar parallel computer DEC Alpha 8100 with 440 MHz alpha chips located in Paris. Each parallel computer is a cluster of alpha servers linked by a FDDI local network at 100 Mb/s. Thanks to the SAFIR project, France Telecom has provided a full-duplex 10 Mb/s link between these two parallel computers through an ATM Fore interface at 155 Mb/s. The internal speed of the network in each parallel computer is about 80 times faster with the memory channel and 10 times faster when one uses the FDDI ring. ATM was used to guarantee the quality of service of the long-distance 10 Mb/s connection. To achieve good load balancing between the two codes, we used different data distributions for the chemical process code and the Navier-Stokes code. We fixed the number of processors for code (II) running in Lyon to be 2 or 3 and we used between 2 to 9 processors for Navier-Stokes code (I) in Paris. The data grid was tested at different sizes 2 x Nz x 2 x Nx, where N z (resp. N x ) represents the number of modes in the direction of propagation of the flame (resp. the other direction). The number of iterations was set to 200, and the computations were run several times, since the performance of the SAFIR network may vary depending on the load from other users. Table 1 summarizes for each grid size the best elapsed time obtained for the scheme o,,C(6/12, 6, 2) using the SAFIR network between the different data distribution configurations tested. We note the following: 9 A load balancing of at least 50% between the two codes (73.82% for Nx=180) has been achieved. The efficiency of the coupling with FDDI is between 78% and 94%, while the coupling with SAFIR goes from 64% to 80%. Nevertheless, we note that the efficiency of the coupling codes may deteriorate when the size of the problem increases. 3. D O M A I N D E C O M P O S I T I O N

FOR MULTICLUSTER

Let us consider a linear problem (6)

L[U] = f in ~t, Uioa = O.

We split the domain f~ into two subdomains f~ = ~-~1 U ~~2 and suppose that we can solve each subproblem on f~i, i = 1 (resp. i=2) with an efficient parallel solver on each cluster I (resp. II) using domain decomposition and/or a multilevel method and possibly different codes for each cluster. As before, we suppose that the network between cluster I and II is much slower than access to the memory inside each cluster. Our goal is to design a robust and efficient parallel algorithm that couples the computation on both subdomains. For the sake of simplicity, we restrict ourselves to two subdomains and start with the additive Schwarz algorithm: L[u? +11 - f in

al,

n+l n "ttliF1 - - lt21F1 , n+l

n

L[u~ +1] - f in f~2, a21r~ - ullr~-

(7) (8)

31 We recall that this additive Schwarz algorithm is very slow for the Laplace operator, and therefore is a poor parallel algorithm as a pressure solver, for example, in projection schemes for Navier-Stokes. Usually, coarse-grid operators are used to speed up the computation. Our new idea is to speed up the robust and easy-to-implement parallel additive Schwarz algorithm with a posteriori Aitken acceleration [11]. It will be seen later that our methodology applies to other iterative procedures and to more than two subdomains. We observe that the operator T, n i -- Uri --+ ailr ~ + 2/ - Uv~ uilr

(9)

is linear. Let us consider first the one-dimensional case f~ - (0, 1)- the sequence u~l~,~ is now a n+2 sequence of real numbers. Note that as long as the operator T is linear, the sequence uilr~ n+2 has pure linear convergence (or divergence); that is, it satisfies the identity o.ailr~ -Uir ~= (~(uinlr~ - Uir~), where 5 is the amplification factor of the sequence. The Aitken acceleration procedure therefore gives the exact limit of the sequence on the interface Fi based on three successive Schwarz iterates u~tr~, j - 1, 2, 3, and the initial condition u/~ namely,

ur~ = u~,r ~ _ u~lr ' _ uljr ' + u/Or '

(10)

An additional solve of each subproblem (7,8) with boundary conditions ur~ gives the solution of (6). The Aitken acceleration thus transforms the Schwarz additive procedure into an exact solver regardless of the speed of convergence of the original Schwarz method. It is interesting that the same idea applies to other well-known iterative procedures such as the Funaro-Quarteroni algorithm [5], regardless of its relaxation parameter, and that the Aitken acceleration procedure can solve the artificial interface problem whether the original iterative procedure converges or diverges, as long as the sequence of solutions at the interface behaves linearly! Next, let us consider the multidimensional case with the discretized version of the problem (6): (11)

Lh[U] = f in ~, Uioa = O.

Let us use E h to denote some finite vector space of the space of solutions restricted to the artificial interface Fi. Let g , j = 1..N be a set of basis functions for this vector space and P the corresponding matrix of the linear operator T. We denote by u e.~,~,j - 1,.., N the components of u~]r~, and we have then ~+2 -- U j l F i ) j = I , . . , N ('{ti,j

n -_- P(?.l,i,j

VjlFi)j=l

(12)

.... N .

We introduce a generalized Aitken acceleration with the following formula: P-

/

2(j+l)

(uk,i

-

.2j~-I

~k,i]i=l,..,N,j=O,..,g-l(Uk,i

/

2(j+l)

2j -- ?-tk,i)i=l .... g , j = l , . . , g ,

(13)

and finally we get u ~ k,i, i - 1 ~ "'~ N the solution of the linear system

(Id -

c~

P)(Uk,i)i=

1 .... U =

~' 2 N + 2

(~k,i

)i=I,..,N

-- P

2N

(Uk,i)i=1

.... N -

(14)

32 I d denotes the matrix of the identity operator. We observe that the generalized Aitken procedure works a priori independently of the spectral radius of P, that is, the convergence of the interface iterative procedure is not needed. In conclusion, 2N + 2 Schwarz iterates produce a priori enough data to compute via this generalized Aitken accelera/ 2(j+1) tion the interface value Uir ~, k - 1, .., 2. However, we observe that the matrix (uk,~ 2j )~=1....N,j=0....N-~ is ill-conditioned and that the computed value of P can be very sensiltk,i tive to the data. This is not to say that the generalized Aitken acceleration is necessarily a bad numerical procedure. But the numerical stability of the method and the numerical approximation of the operator P should be carefully investigated depending on the discretization of the operator. We are currently working on this analysis for several types of discretization (finite differences, spectral and finite elements). Here, we show that this algorithm gives very interesting results with second-order finite differences. Let us consider first the Poisson problem u ~ + uyy = f in the square (0, 1) 2 with Dirichlet boundary conditions. We partition the domain into two overlapping strips (0, a) • (0, 1)U(b, 1) • (0, 1) with b > a. We introduce the regular discretization in the y direction yi = ( i - 1)h, h - N ~ , and central second-order finite differences in the y direction. Let us denote by 5i (resp. j~) the coefficient of the sine expansion of u (resp. f). The semi-discretized equation for each sinus wave is then

4/h

h

(15)

and therefore the matrix P for the set of basis functions bi - sin(i~) is diagonal. The algorithm for this specific case is the following. First, we compute three iterates with the additive Schwarz algorithm. Second, we compute the sinus wave expansion of the trace of the iterates on the interface Fi with fast transforms. Third, we compute the limit of the wave coefficients sequence via Aitken acceleration as in the one-dimensional case. We then derive the new numerical value of the solution at the interface Fi in physical space. A last solve in each subdomain with the new computed boundary solution gives the final solution. We have implemented with Matlab this algorithm for the Poisson problem discretized with finite differences, a five-point scheme, and a random rhs f. Figure 1 compares the convergence of the new method with the basic Schwarz additive procedure. Each subproblem is solved with an iterative solver until the residual is of order 10 -1~ The overlap between subdomains in the x direction is just one mesh point, a = b + h . The result in our experiment is robust with respect to the size of the discretized problem. Note however that this elementary methods fails if the grid has a nonconstant space step in the y direction or if the operator has coefficients depending on the y variable, that is L = (el(x, y)ux)x + (a2(x, y)uy)y, because P is no longer diagonal. For such cases P becomes a dense matrix, and we need formally 2N + 2 Schwarz iterates to build up the limit. Figure 2 gives a numerical illustration of the method when one of the coefficients a2 of the second-order finite difference operator is stiff in the y direction. We checked numerically that even if P is very sensitive to the data, the limit of the interface is correctly computed. In fact we need only to accelerate accurately the lower modes of the solution, since the highest modes are quickly damped by the iterative Schwarz procedure itself. Further, we can use an approximation of P that neglects the coupling between

33 the sinus waves, and apply iteratively the previous algorithm. This method gives good results when the coefficients of the operator are smooth, that is the sine expansions in the y variable of the coefficients of the operator converge quickly. In our last example, we consider a nonlinear problem that is a simplified model of a semiconductor device [10]. In one space dimension the model writes

Au f

=

-

e -u

+ f, in(O, d),

x 1 tanh(20(-~-~)),

-

u(0)

e~

-

(16) (17)

x e (0, d),

a s inh t--~-) "f(O) + Uo, u ( d ) -

asinh(

)

(18)

The problem is discretized by means of second-order central finite differences. We apply consecutively several times the straightforward Aitken acceleration procedure corresponding to the Laplace operator case. Figure 3 shows the numerical results for 80 grid points and one mesh point overlap. Notice that the closer the iterate gets to the final solution, the better the result of the Aitken acceleration. This can be easily explained in the following way: the closer the iterate gets to the solution, the better the linear approximation of the operator. Similar results have been obtained in the multidimensional case for this specific problem. So far we have restricted ourselves to domain decomposition with two subdomains. The generalized Aitken acceleration technique however can be applied to an arbitrary number of subdomains with strip domain decomposition. 2

,

,

,

Convergence ,

,

0 --1 -

2 --3--4--5--6--7~

-81

1'.5

3

3.5

4

s'.s

6

Figure 1. Solid line (resp. -o- line) gives the loglo (error in maximum norm) on the discrete solution with Schwarz additive procedure (resp. new method)

34

convergence

U

0 ....... i .......

:

.... i ......

-2 -4 -6 -80

-

io

2o

3o

4o

30

5o

a2

al 1 ....... i

i

.

3 ....... i

' ........ .

i

. 9

2 ....... ! ........ :.....

0 ........ i. . . . . . .

20

0"~~

.....

~ 10

1 ....... i ...... i

".....

i ........ : ...... '

i ......

30

20

Figure 2. Application of the m e t h o d to the nonconstant coefficient case convergence

i

i 0

-20

~4-

0

-6' -8

-1( ) N=80

0.5

-0...= . . . . . . . . . . . .

..,=,=,.1 9 ...........

I. . . . . . . . . . . .

I . . . . . . . . . .

1

2

~., .....

I

3

Figure 3. application of the m e t h o d to a non linear problem Let us consider, for the sake of simplicity, the one-dimensional case with q > 2 subdomains. P is then a p e n t a d i a g o n a l m a t r i x of size 2 ( q - 1). Three Schwarz iterates provide enough

35 information to construct P and compute the limit of the interfaces. However, we have some global coupling between the interface's limits solution of the linear system associated to matrix I d - P. Since P is needed only as an acceleration process, local approximation of P can be used at the expense of more Schwarz iterations. The hardware configuration of the multicluster and its network should dictate the best approximation of P to be used. 4. C O N C L U S I O N We have developed two sets of new ideas to design two level parallel algorithms appropriate for multicluster architecture. Our ongoing work generalizes this work to an arbitrary number of clusters. REFERENCES

1. D . P . Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation, numerical methods, Prentice Hall, Englewood Cliffs, New Jersey, 1989. 2. M. Crouzeix and A.L. Mignot, Analyse Numdrique des dquations diffdrentielles, 2nd ed. Masson, 1992. 3. A. Ecer, N. Gopalaswamy, H. U. Akay and Y. P. Chien, Digital filtering techniques for parallel computation of explicit schemes, AIAA 98-0616, Reno, Jan. 12-15, 1998. 4. G. Edjlali, M. Garbey, and D. Tromeur-Dervout, Interoperability parallel programs approach to simulate 3D frontal polymerization process, J. of Parallel Computing, 25 pp. 1161-1191, 1999. 5. D. Funaro, A. Quarteroni and P. Zanolli, An iterative procedure with interface relaxation for domain Decomposition Methods, SIAM J. Numer. Anal. 25(6) pp. 1213-1236, 1988. 6. M. Garbey and D. Tromeur-Dervout, Massively parallel computation of stiff propagating combustion front, IOP J. Comb. Theory Modelling 3 (1): pp. 271-294, 1997. 7. M. Garbey and D. Tromeur-Dervout, Domain decomposition with local fourier bases applied to frontal polymerization problems, Proc. Int. Conf. DD11, Ch.-L. Lai & al Editors, pp. 242-250, 1998 8. M. Garbey and D. Tromeur-Dervout, A Parallel Adaptive coupling Algorithm for Systems of Differential Equations, preprint CDCSP99-01, 1999. 9. D. Gottlieb and C. W. Shub, On the Gibbs phenomenon and its resolution, SIAM Review, 39 (4), pp. 644-668, 1998. 10. S. Selberherr, Analysis and simulation of semiconductor devices, Springer Verlag, Wien, New York, 1984. 11. J. Stoer and R. Burlish, Introduction to numerical analysis, TAM 12 Springer, 1980.

36

Nz=64, Nx=120 Max(s)[Min(s)]Average(s) 221.04 2 1 2 . 8 9 216.19 PNS=4 210.82 2 0 0 . 1 3 204.56 Pcp=2 208.24 1 9 5 . 6 8 202.05 Pys=4 192.87 180.23 187.68 Pcp-2 208.45 2 0 5 . 1 3 206.98 PNS=4 109.43 108.41 108.90 Pcp-2

200 Iterations SAFIR Coupled FFDI Coupled Non Coupled

Processors


Processors


Processors


Processors

PNs=6 Pcp=3 PNS=6 Pcp-3 PNs=6 Pcp=3

Nz=64, Nx=180 Max(s) Min(s) Average(s) 2 7 2 . 2 0 2 5 3 . 0 9 260.50 259.25 2 4 0 . 0 2 247.86 2 0 9 . 3 3 1 9 9 . 4 6 204.76 194.66 1 8 5 . 2 8 189.69 2 1 4 . 3 8 2 0 5 . 9 7 210.82 158.92 1 4 9 . 2 4 155.64

Nz=128, Nx=120 Max(s) Min(s) Average(s) 247.79 Pys=4 2 5 3 . 9 7 241.22 235.38 Pcp-2 2 4 1 . 5 0 228.89 205.08 PNS=4 2 0 8 . 0 8 199.38 190.78 Pcp-2 1 9 7 . 1 7 184.73 208.26 PNS=4 208.95 206.89 104.84 Pcp=2 105.32 104.49 Nz=128, Nx=180 Max(s) Min(s) Average(s) PNS=6 3 0 6 . 7 6 2 8 5 . 0 9 298.23 Pcp=3 2 9 4 . 2 6 271.71 284.95 PNS=6 2 3 8 . 6 0 2 1 5 . 0 0 224.62 Pcp=3 2 2 2 . 7 3 199.75 209.35 PNS=6 2 4 1 . 9 4 2 2 8 . 5 6 236.57 Pcp-3 107.97 1 0 6 . 5 8 107.26

Efficiency

80.60% 91.24% lOO.O %

Efficiency 73.87% 93.98% 100.0%

Efficiency 71.33% 86.76%

lOO.O%

Efficiency 65.85% 88.13%

100% Table 1 Elapsed time for 20 runs of the coupling codes with different hardware configuration

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

37

D e v e l o p m e n t of t h e " E a r t h S i m u l a t o r " Shinichi Kawai a, Mitsuo Yokokawa b, Hiroyuki Ito a, Satoru Shingu b, Keiji Tani b and Kazuo Yoshida ~ ~Earth Simulator Research and Development Center, National Space Development Agency of Japan, Sumitomo Hamamatsu-cho bldg. 10, 1-18-16, Hamamatsu-cho, Minato-ku, Tokyo, 105-0013, Japan bEarth Simulator Research and Development Center, Japan Atomic Energy Research Institute, Sumitomo Hamamatsu-cho bldg. 10, 1-18-16, Hamamatsu-cho, Minato-ku, Tokyo, 105-0013, Japan "Earth Simulator" is a high speed vector processor based parallel computer system for computational earth science. The goal of the "Earth Simulator" is to achieve at least 5 Tflop/s sustained performance, which should be about 1,000 times faster than the commonly used supercomputers, in atmospheric general circulation model (AGCM) program with the resolution of 5-10 km grid on the equator. This computer system consists of 640 processor nodes connected by a fast single-stage crossbar network. Each processor node has 8 arithmetic processors sharing 16 Gbytes main memory. Total main memory capacity and peak performance are 10 TBytes and 40 Tflop/s, respectively. Application software on the "Earth Simulator" and software simulator to evaluate the performance of the program on the "Earth simulator" are also described. 1. I N T R O D U C T I O N "Earth Simulator" is a high speed parallel computer system for computational earth science, which is a part of earth science field for understanding the Earth and its surroundings such as atmosphere, ocean and solid earth through computer simulation. Computational earth science is useful for weather forecast, prediction of global change such as global warming and E1 Nifio event, and some earthquake related phenomena such as mechanisms of earthquake or disaster prediction. The Science and Technology Agency of Japan promotes global change prediction research through process study, earth observation, and computer simulation. Development of the "Earth Simulator" is a core project to achieve this objective. "Earth Simulator" is being developed by Earth Simulator Research and Development Center, or ESRDC, which is a joint team of NASDA, National Space Development Agency of Japan, JAERI, Japan Atomic Energy Research Institute and JAMSTEC, Japan Marine

38

Science and Technology Center. The goal of the "Earth Simulator" is to achieve at least 5 Tflop/s sustained performance in atmospheric general circulation model (AGCM) program with the resolution of 5-10 km grid on the equator, in other words, about 4000 grid points for the longitudinal or eastwest direction, 2000 grid points for the latitudinal or north-south direction, and i00 grid points for the vertical or altitudinal direction. The total number of grid points is about 800 million for a i0 km grid on the equator. For the most commonly used supercomputers in computational earth science, sustained performance is 4-6 Gflop/s with a resolution of 50-100 km grid and 20-30 layers. Sustained performance of the "Earth Simulator" should be about 1,000 times faster than this. There are two types of parallel computer systems, vector processor based parallel and micro processor based massively parallel system. The execution performance of one of the major AGCM codes with various computer systems shows that efficiency of the vector processor based system is about 30~ Efficiency of the micro processor based systems is much less than 10% (Hack, et al. [2]). If we assume that the efficiency of the "Earth Simulator" is 10-15%, 30-50 Tflop/s peak performance is needed to achieve 5 Tflop/s sustained performance for an AGCM program. We think this is possible by using a vector processor based parallel system with distributed memory and a fast communication network. If we tried to achieve 5 Tflop/s sustained performance by a massively parallel system, more than 100 Tflop/s peak performance would be needed. We think this is unlikely by early 2002. So we adopt the vector processor based parallel system. The total amount of main memory is 10 TBytes from the requirement of computational earth science. 2. BASIC DESIGN OF THE "EARTH SIMULATOR"

i

I

IIIIIIIIIIIIIIIIIIIII.IIIIIIII~IlIII//LI/ilI//IiiiiZ/I .........IIIIIII:IiiiLIL.I. :i.~[;//iiii LII//.II:II................................................................IIILIIIIII/://///.I/.I/.I;I/.I/I]I:IS:///I:///I/.I//// ....

ii.iii iiiiiiiii iiii!".! iiiiiiii iii,!iii

ii:.iii ii. ii iii iiii

ii~

~iL~ ...............................................i!~

Processor Node #1

Processor Node #639

~ii Processor Node #0

iii~

Figure i. Configuration of the "Earth Simulator"

The basic design of the "Earth Simulator" is as follows* (Yokokawa, et al. [6], [7]). *This basic design might be changed as development proceeds.

39 The "Earth Simulator" is a distributed memory parallel system that consists of 640 processor nodes connected by a fast single-stage crossbar network. Each processor node has 8 arithmetic processors sharing 16 Gbytes main memory (Figure 1). The total number of arithmetic processors is 5,120 and total main memory capacity is 10 TBytes. As the peak performance of one vector processor is 8 Gflop/s, the peak performance of one processor node is 64 Gflop/s and total peak performance is 40 Tflop/s. Power consumption will be very high, so air cooling technology and power saving technology such as CMOS technology will be used.

AP RCU lOP MMU

Arithmetic 9 Processor Remote 9 A c c e s s Control Unit 1/O 9 Processor Main 9 M e m o r y Unit

From/To Crossbar Network

i

LAN=

/l i

User/Work Disks

Shared Main M e m o r y (MM) 16GB

Figure 2. Processor Node (PN)

Each processor node consists of 8 arithmetic processors, one remote access control unit (RCU), one I/O processor (lOP) and 32 sets of main memory unit (MMU). Each processor or unit is connected to all 32 MMUs by cables. RCU is also connected to crossbar network to transfer data from/to the other processor node. I/O processor (IOP) is also connected to LAN and/or disks. (Figure 2) Each arithmetic processor consists of one scalar unit, 8 units of vector pipelines, and a main memory access control unit. These units are packaged on one chip. The size of the chip is about 2 cm x 2 cm. There are more than 40 million transistors in one chip. The clock frequency of the arithmetic processor is 500 MHz. The vector unit consists of 8 units of vector pipelines, vector registers and some mask registers. Each unit of vector pipelines has 6 types of functional units, add/shift, multiply, divide, logical, mask, and load/store. 16 floating point operations are possible in one clock cycle time (2 ns), so

40

/ ili!|

:~i~......~

H

......iiii

,

~ii~i~i~!~ii~i~i~:~:~:~:i:~:i:~:i~:!~:.iiiiiii ~i:~~m~ :i~J!i~ii~i~ii~ii~

iii i?iii iii i ili i iiiii i?'i i i i 'i ili......... 'i i i1iiiil

' ,~:~:~::' :::iiiii~~iiiii{ii~iiii iiiiiiiii~]]

7 Figure 3. Arithmetic Processor (AP)

the peak performance of one arithmetic processor is 8 Gflop/s. The scalar unit contains instruction cache, data cache, scalar registers and scalar pipelines. (Figure 3). We adopt the newly developed DRAM based 128Mbit fast RAM which has 2,048 banks for the main memory. The access time of the SSRAM is very fast, but it is very expensive. The access time of the fast RAM is between SSRAM and SDRAM and the cost of the fast RAM is much less than the cost of SSRAM. Memory band width is 32 GBytes per second for each arithmetic processor and 256 GBytes per second for one processor node. The ratio between operation capability and memory through-put is 2:1. The interconnection network unit consists of a control unit which is called XCT and 640• crossbar switch which is called XSW. Each XCT and XSW is connected to all 640 processor nodes through cables. More than 80,000 cables are used to connect the interconnection network unit to the 640 processor nodes. XCT is connected to each RCU in each processor node. Data transfer rate is 16 GBytes per second for both input and output paths. (Figure 4). For any two processor nodes, the distance for data transfer from one processor node to another is always the same. Any two different data paths don't interfere each other. So we don't have to care which node we put the data in. For each processor node, input port and output port are independent, so it is possible to input and output data at the same time. (Figure 5). We are planning to treat 16 processor nodes as one cluster to operate the "Earth Simulator". So there exists 40 clusters, and each cluster has a Cluster Control Station (CCS). We assign one cluster as a front-end for processing interactive jobs and smallscale batch jobs. The other 39 clusters are treated as back-end clusters for processing large-scale batch jobs. (Figure 6). System software is under development. Basically, we will use commonly used operating system and languages with some extensions for the "Earth Simulator". For parallelization

41

X CT

@

@

@

Iiiiiiiiiiiiliiiiii!i!i!iiii@lX SW

0

9

9 @

@

9

9

9 @

PN #639

PN #0

Figure 4. Interconnection Network (IN)

IN

i,ii,i,i,iii,ii,liii...... i,i,li@i!i!iii~ii!i!iiiiiii!iiiii!i!!!iiiili!ii~/~ ~!i;,~,~, i}iiiiiiiii:~i@ii}i!i@!i!i@ii!iiiii!ii!ii;~i~

.IN

IIN

IN

t!~::~

00,t ~ ! . ,..::~

-%

OUT

OUT

O1

Figure 5. Conceptual image of 640x640 crossbar switch

environment, MPI, HPF and OpenMP will be supported. 3. A P P L I C A T I O N

SOFTWARE

ON THE "EARTH SIMULATOR"

In the "Earth Simulator" project, we will develop some parallel application software for simulating the global change of the atmosphere, ocean and solid earth. For atmosphere and ocean, we are developing a parallel program system, called N JR [5]. For solid earth, we are developing a program called Geo FEM [3], which is a parallel finite element analysis system for multi-physics/multi-scale problems. We will only explain the N JR program system. N JR is the large-scale parallel climate model developed by NASDA, National Space Development Agency of Japan, JAMSTEC, Japan Marine Science and Technology Center,

42

!i Figure 6. Cluster system for operation

and RIST, Research Organization for Information Science and Technology. N JR is named for the first letters of these three organizations. The NJR system includes Atmospheric General Circulation Model (AGCM), Ocean General Circulation Model (OGCM), AGCM-OGCM coupler and Pre-Post processor. Two kinds of numerical computational methods are applied for AGCM, one is a spectral method called NJR-SAGCM and the other is a grid point method called NJR-GAGCM. NJR-SAGCM is developed referencing CCSR/NIES AGCM (Numaguti, et al. [4]) developed by CCSR, Center for Climate System Research in University of Tokyo and NIES, National Institute for Environmental Studies. There are also two kinds of numerical computation methods in OGCM, a grid point method and a finite element method. Since AGCM and OGCM are developed separately, the connection program between AGCM and OGCM, which is called AGCM-OGCM coupler, is needed when simulating the influence of ocean current on the climate. A pre-processing system is used to input data such as topographic data and climate data like temperature, wind velocity and geo surface pressure at each grid point. A post-processing system is also used to analyze and display the output data. We will deal with the NJR-SAGCM in detail. Atmospheric general circulation program is utilized from short term weather forecast to long term climate change prediction such as global warming. In general, an atmospheric general circulation model is separated into a dynamics part and a physics part. In the dynamics part, we solve the primitive equations numerically which denotes the atmospheric circulation, for example, equation of motion, thermodynamics, vorticity, divergence and so on. Hydrostatic approximation is assumed for the present resolution global model. In the physics part, we parameterize the physical phenomena less than the grid size, for example, cumulus convection, radiation and so on. In NJR-SAGCM, the computational method is as follows. In the dynamics part, spectrum method with spherical harmonic function is applied for the horizontal direction and finite difference method is applied for the vertical direction. In the physics part, since the scale of the physics parameter is much smaller than the grid size, we compute the physics parameters at each vertical one dimensional grid independently. Semi-implicit method

43 and leap-frog method are applied for the time integration. Domain decomposition method is applied for parallelization of data used in NJRSAGCM.

Domoin

1"3--r

Grid Space

ilion

Fourier Space

O O

o FFT o~

4000x2000x100 Grid Points (Longitude X Latitude X Altitude)

S~ I

4000 9 1333

~ ~ ~ l ~!1 r'~

O

..

-,

A

N 1333 ]~i1 Data

"'''~

Transform/~~P'~i~0 ~:..~i~i~:~i~>

Spectrum Space

,-

)0

Transpose

I~ 0

1333

Figure 7. Parallelization technique in spectrum transform

For the dynamics part, spectrum method is used for the horizontal direction. Figure 7 shows the parallel technique in spectrum transform. First, decompose the grid space along the latitude line into the number of processor nodes equally, and distribute the data to each processor node. Then transform the grid space to Fourier space by one dimensional (1D) FFT with the size of the number of grid point for the longitude direction. Since this 1D FFT is independent of latitude and altitude direction, all 1D FFT's are calculated independently and no data transpose occurs for each processor node. In Fourier space, only about 1/3 of the wave numbers are needed and high frequency components are ignored. This is to avoid aliasing when calculating the second order components such as convolution of two functions. Before going to the Legendre transform, domain decomposition of Fourier Space along the longitude line is needed. Then, the data transpose occurs for all processor nodes. Since triangular truncation is used for the latitude direction in one dimensional Legendre transform, various wave number components are mixed in each processor node on transposing the data in order to avoid the load imbalance among the processor nodes. Inverse transforms are the same as the forward transforms. In NJRSAGCM, four times of data transpose occurs while forward and inverse transforms are executed in one time step. For the physics part, the parallel technique is simpler. First, as in dynamics part, decompose the domain into the number of processor nodes and distribute the data to

44

each processor node. Since each set of one dimensional data for the vertical direction is mutually independent, parallel processing is possible for these sets. No data transpose occurs among the processor nodes. In each processor node, data is divided into eight parts for parallel processing with microtask by eight arithmetic processors. In each arithmetic processor, vectorize technique is used. NJR-SAGCM is being optimized for the "Earth Simulator", which has three levels of parallelization: vector processing in one arithmetic processor, parallel processing with shared memory in one processor node and parallel processing with the distributed memory among processor nodes using the interconnection network, toward the achievement of 5 Tflop/s sustained performance on the "Earth Simulator".

4. PERFORMANCE ESTIMATION ON THE "EARTH SIMULATOR"

We have developed a software simulator, which we call GSSS, to estimate the sustained performance of programs on the "Earth Simulator" (Yokokawa, et al. [6]). This software simulator traces the behavior of the principal parts of the "Earth Simulator" and other similar architecture computers such as NEC SX-4. We can estimate accuracy of GSSS by executing the program both on the existing computer system and on GSSS and comparing them. Once we have confirmed the accuracy of GSSS, we can estimate the sustained performance of programs on the "Earth Simulator" by changing the hardware parameters of GSSS suitable for the "Earth Simulator". The GSSS system consists of three parts, GSSS_AP, which is the timing simulator of arithmetic processor; GSSS_MS, which is the timing simulator of memory access from arithmetic processors; and GSSS_IN, which is the timing simulator of asynchronous data transfer via crossbar network. (Figure 8) . We execute a target program on a reference machine, in our case, an NEC SX-4, and get the instruction trace data, then put this instruction trace data file into the GSSS system. We also put the data file for the hardware parameters such as number of vector pipelines, latency and so on. This hardware parameters must be taken from the same architecture machines as the reference one. If we want to trace the behavior of SX-4, we put the hardware parameters for SX-4; if we want to trace the behavior of the "Earth Simulator", we put the hardware parameters for the "Earth Simulator." The output from the GSSS system is the estimated performance of the target program on SX-4 or "Earth Simulator" depending on the hardware parameter file. Sustained performance of the program is usually measured by the flop count of the program divided by the total processing time. In the case of performance estimation by GSSS, total processing time is estimated using GSSS. We have prepared three kinds of benchmark programs. The first is the kernel loops from the present major AGCM and OGCM codes to check the performance of the arithmetic processor. These kernel loops are divided to three groups. Group A includes simple loops. Group B includes loops with IF branch and intrinsic function. Group C includes loops with indirect access of arrays. The second kind of benchmark program is the FT and BT of the NAS parallel benchmark (Bailey, et al. [1]), which involve transpose operations of arrays when they are executed in parallel, to check the performance of the crossbar network. The third one is NJR-SAGCM to check the performance of the application

45

I Target program

'l

Data file for hardware parameters q GSSS system

Reference machine

[iiiiii iii iiiiiii j

ex.) the number of vector pipelines, latency, etc.

i

I n s t r u c t i o n trace d a t a file

I Estimated performance of l the target program

Figure 8. GSSS system

software. It is obtained that the average absolute relative error of processing time executed on GSSS to that executed on NEC SX-4 for Group A of kernel loops is about 1 % and the estimation accuracy of processing time by GSSS is quite good. The estimated sustained performance for Group A on one arithmetic processor of the "Earth Simulator" is almost half of the peak performance on average, nevertheless the Group A includes various types of kernel loops.

5. F U T U R E W O R K S The design of the "Earth Simulator" is evaluated by the software simulator, GSSS system which was described in the fourth section, with some benchmark programs. We feed back the evaluation result into the design to achieve the 5 Tflop/s sustained performance by executing NJR-SAGCM program which was described in the third section. Figure 9 shows the development schedule of the "Earth Simulator." Conceptual design, basic design which was described in the second section, and research and development for manufacturing are finished. We are going to detailed design this year and then will begin manufacture and installation next year. The buildiing for the "Earth Simulator" will be located in Yokohama, and this computer system will begin operation in early 2002.

REFERENCES 1. 2.

D. Bailey, et al., "The NAS Parallel Benchmarks", BNR Technical Report RNR 94007, March (1994) J. J. Hack, J. M. Rosinski, D. L. Williamson, B. A. Boville and J. E. Truesdale,

45

1997 Hardware System

Conceptual Design Basic Design R&D for Manufacturing

__

1998

Manufacture and Installation Software Design Software Development & Test

Installation of Peripheral Devices

2000

i 1999 I

2001

2002

! i

l

[

i

I

--I

i i

!

Detailed Design Operation Supporting Software

I

conl pletion

We are here.

,~.......

r L , [

i !

i i

m

i

It

[

l

i

Operation

Figure 9. Development schedule

3.

4.

5.

6. 7.

"Computational design of the NCAR community climate model", Parallel Computing, Vol. 21, No. i0, pp. 1545-1569 (1995). M. lizuka, H. Nakamura, K. Garatani, K. Nkajima, H. Okuda and G. Yagawa, "GeoFEM: High-Performance Parallel FEM for Geophysical Applications" in C. Polychronopoulos, K. Joe, A. Fukuda and S. Tomita (eds.), High Performance Computing: second international symposium; proceedings/ISHPC'99, Kyoto, Japan, May-26-28, 1999, 292-303, Springer (1999). A. Numaguti, M. Takahashi, T. Nakajima and A. Sumi, "Description of CCSR/NIES Atmospheric General Circulation Model", CGER's Supercomputer Monograph Report, Center for Global Environmental Research, National institute for Environmental Studies, No.3, 1-48 (1997). Y. Tanaka, N. Goto, M. Kakei, T. Inoue, Y. Yamagishi, M. Kanazawa and H. Nakamura, "Parallel Computational Design of N JR Global Climate Models" in C. Polychronopoulos, K. Joe, A. Fukuda and S. Tomita (eds.), High Performance Computing: second international symposium; proceedings/ISHPC'99, Kyoto, Japan, May-26-28, 1999, 281-291, Springer (1999). M. Yokokawa, S. Shingu, S. Kawai, K. Tani and H. Miyoshi, "Performance Estimation of the Earth Simulator", Proceedings of the ECMWF Workshop, November (1998). M. Yokokawa, S. Habata, S. Kawai, H. Ito, K. Tani and H. Miyoshi, "Basic design of the Earth Simulator" in C. Polychronopoulos, K. Joe, A. Fukuda and S. Tomita (eds.), High Performance Computing: second international symposium; proceedings/ ISHPC'99, Kyoto, Japan, May-26-28, 1999, 269-280, Springer (1999).


47

Virtual M a n u f a c t u r i n g and D e s i g n in the Real W o r l d - I m p l e m e n t a t i o n and Scalability on H P P C S y s t e m s

K McManus, M Cross, C Walshaw, S Johnson, C Bailey, K Pericleous, A Slone and P Chowt Centre for Numerical Modelling and Process Analysis University of Greenwich, London, UK. tFECIT, Uxbridge, UK

Virtual manufacturing and design assessment increasingly involve the simulation of interacting phenomena, sic. multi-physics, an activity which is very computationally intensive. This paper describes one attempt to address the parallel issues associated with a multi-physics simulation approach based upon a range of compatible procedures operating on one mesh using a single database - the distinct physics solvers can operate separately or coupled on sub-domains of the whole geometric space. Moreover, the finite volumeunstructured mesh solvers use different discretisation schemes (and, particularly, different 'nodal' locations and control volumes). A two-level approach to the parallelisation of this simulation software is described: the code is restructured into parallel form on the basis of the mesh partitioning alone, i.e. without regard to the physics. However, at run time, the mesh is partitioned to achieve a load balance, by considering the load per node/element across the whole domain. The latter of course is determined by the problem specific physics at a particular location.

1. INTRODUCTION As industry moves inexorably towards a simulation-based approach to manufacturing and design assessment, the tools required must be able to represent all the phenomena active together with their interactions, (increasingly referred to as multi-physics). Conventionally, most commercial tools focus upon one main phenomena, (typically 'fluids' or 'structures') with others supported in a secondary fashion - if at all. However, the demand for multiphysics has brought an emerging response from the CAE s e c t o r - the 'structures' tools ANSYS (1) and ADINA (2~have both recently introduced flow modules into their environments. However, these 'flow' modules are not readily compatible with their 'structures' modules, with regard to the numerical technology employed. Thus, although such enhancements facilitate simulations involving loosely coupled interactions amongst fluids and structures, closely coupled situations remain a challenge. A few tools are now emerging into the

48

community that have been specifically configured for closely coupled multi-physics simulation, see for example SPECTRUM ~3), PHYSICA ~4) and TELLURIDE ~5~. These tools have been designed to address multi-physics problems from the outset, rather than as a subsequent bolt-on. Obviously, multi-physics simulation involving 'complex physics', such as CFD, structural response, thermal effects, electromagnetics and acoustics (not necessarily all simultaneously), is extremely computationally intensive and is a natural candidate to exploit high performance parallel computing systems. This paper highlights the issues that need to be addressed when parallelising multi-physics codes and provides an overview description of one approach to the problem.

2. T H E C H A L L E N G E S Three examples serve to illustrate the challenges in multi-physics parallelisation. 2.1. Dynamic fluid - structure interaction (DFSI)

DFSI finds its key application in aeroelasticity and involves the strong coupling between a dynamically deforming structure (e.g. the wing) and the fluid flow past it. Separately, these problems are no mean computational challenge - however, coupled they involve the dynamic adaption of the flow mesh. Typically, only a part of the flow mesh is adapted; this may well be done by using the structures solver acting on a sub-domain with negligible mass and stiffness. Such a procedure is then three-phase 0 is a preselected number, Ti is a control volume centered at node i, hi is its characteristic size, Ci is the sound speed and ~/ is the velocity vector at node i. Then, the proposed scheme has a general form

(U~ + l - U i ~ ) / A t ~ + ~ i ( U ~ +1) = 0,

i=1,2,...,N,

n=0,

1, . . . .

(2)

We note, the finite volume scheme (2) has the first order approximation in the pseudotemporal variable and the second order approximation in the spatial variable. On Fw, no-slip boundary condition is entbrced. On F~, a non-reflective version of the flux splitting of Steger and Warming [12] is used. We apply a DeC-Krylov-Schwarz type method to solve (2); that is, we use the DeC scheme as a nonlinear solver, the restarted FGMRES algorithm as a linear solver, and the restricted additive Schwarz algorithm as the preconditioner. At each pseudo-temporal level n, equation (2) represents a system of nonlinear equations for the unknown variable Uh~+1 This nonlinear system is linearized by the DeC scheme [1] formulated as follows. Let. ~h(Uh) be the first-order approximation of convective fluxes V . _f obtained in a way similar to that of ~h(Uh), and let O(~h(Uh) denote its T?n+l,0 n Jacobian. Suppose that for fixed n, an initial guess "h is given (say Uh+~'~ _ U~). For s - 0, 1, . . . , solve for U~+l's+l the following linear system

(D -1-0q)h (V;+l'~

n+l's+l

Uhn+l's)

,s),

where D~ = diag (1/At~, . . . , 1~Ate)is a diagonal matrix. The DeC scheme (3)preserves the second-order approximation in the spatial variable of (2). In our implementation, we carry out only one DeC iteration at each pseudo-temporal iteration, that is, we use the scheme

4. L I N E A R S O L V E R A N D P R E C O N D I T I O N I N G

Let the nonlinear iteration n be fixed and denote'

(4) Matrix A is nonsymmetric and indefinite in general. To solve (4), we use two nested levels of restarted FGMRES methods [10], one at the fine mesh level and one at the coarse mesh level inside the additive Schwarz preconditioner (AS) to be discussed below.

92 4.1. O n e - l e v e l AS preconditioner To accelerate the convergence of the FGMRES algorithm, we use an additive Schwarz preconditioner. The method splits the original linear system into a collection of independent smaller linear systems which can be solved in parallel. Let f~h be subdivided into k non-overlapping subregions f~h,1, f~h,2, f~h,k Let f~' h,2, ... , f~h,k be overlapping extensions of Q h , 1 , Q h , 2 , . - - , Qh,k, respectively, and be also subsets of f~h. The size of the overlap is assumed to be small, usually one mesh layer. The node ordering in f~h determines the node orderings in the extended subregions. For i = 1, 2 , . . . , k, let Ri be a global-to-local restriction matrix that corresponds to the extended subregion f~'h,i, and let Ai be a "part" of matrix A that corresponds to f~'h,i. The AS preconditioner is defined by 9

9

9

~

9

h,1

k

- E R A; i=1

For certain matrices arising from the discretizations of elliptic partial differential operators, an AS preconditioner is spectrally equivalent to the matrix of a linear system with the equivalence constants independent of the mesh step size h, although, the lower spectral equivalence constant has a factor 1/H, where H is the subdomain size. For some problems, adding a coarse space to the AS preconditioner removes the dependency on 1/H, hence, the number of subdomains [11].

4.2. One-level RAS preconditioner It is easy to see that, in a distributed memory implementation, multiplications by matrices R T and Ri involve communication overheads between neighboring subregions. It was recently observed [4] that a slight modification of R/T allows to save half of such communications. Moreover, the resulting preconditioner, called the restricted AS (RAS) preconditioner, provides faster convergence than the original AS preconditioner for some problems. The RAS preconditioner has the form k _

.iI~/T

9--1

i=1

where R~T corresponds to the extrapolation from ~h,i. Since it is too costly to solve linear systems with matrices Ai, we use the following modification of the RAS preconditioner: k

M11 - ~ R~~B;-~R~,

(5)

i=1

where Bi corresponds to the ILU(O) decomposition of Ai. We call MI the one-level RAS

preconditioner (ILU(O) modified). 4.3. Two-level RAS preconditioners Let f~H be a coarse mesh in f~, and let R0 be a fine-to-coarse restriction matrix. Let A0 be a coarse mesh version of matrix A defined by (4). Adding a scaled coarse mesh component to (5), we obtain k

M21 -- (1 -- ct) ~ R~T]~-I ]~i + ct _RoTAo 1R0, i=1

(6)

93 where c, c [0, 1] is a scaling parameter. We call M2 the global two-level RAS preconditioner (ILU(O) modified). Preconditioning by M2 requires solving a linear system with matrix A0, which is still computationally costly if the linear system is solved directly and redundantly. In fact, the approximation to the coarse mesh solution could be sufficient for better preconditioning. Therefore, we solve the coarse mesh problem in parallel using again a restarted FGMRES algorithm, which we call the coarse mesh FGMRES, with a modified RAS preconditioner. Let ~1H be divided into k subregions f~H,1, ~H,2, - . . , ~2H,k with the extented counterparts fYH,1, fYH,2, . . . , ~H,~. To solve the coarse mesh problem, we use F G M R E S with the onelevel ILU(O) modified RAS preconditioner k

M-1 0,1 -- E ( ] ~ 0 , i, ) rB-1R0,i, 0,i

(7)

i=1

where, tbr i - 1, 2, . . . , N, R0,i is a global-to-local coarse mesh restriction matrix, (R'o,i)~ is a matrix that corresponds to the extrapolation from f~H,i, and Bo,i is the ILU(O) decomposition of matrix Ao,i, a part of A0 that corresponds to the subregion fYH , i " After r coarse mesh FGMRES iterations, Ao 1 in (6) is approximated by d o I - polyt(MglA0) with some 1 _< r, where polyl(x ) is a polynomial of degree l, and its explicit form is often not known. We note, l maybe different at different fine mesh FGMRES iterations, depending on a stopping condition. Therefore, FGMRES is more appropriate than the regular GMRES. Thus, the actual preconditioner for A has the form k

JlT/~-1 -

(1 - c~) X~ RitT ~/-1/r~i ~- Ct/!~oT./~O1/~ O.

(8)

i=1

For the fine mesh linear system, we also use a preconditioner obtained by replacing Ao 1 in (6) with _/VL -1 defined by (7) 0,1 k

]~/31

-- E ((1 -- OZ) J[~iITj~I . /r~i qt._ O~/r~,0 T i=1

! "~Tj~-l/~o,i/~O) ([~O,i] O,i

.

(9)

We call Ma a local two-level RAS preconditioner (ILU(O) modified) since the coarse mesh problems are solved locally, and there is no global information exchange among the subregions. We expect that Ma works better than M1 and that M2 does better than Ma. Since no theoretical results are available at present, we test the described preconditioners M1,-/;7/2, and Ma numerically. 5. N U M E R I C A L

EXPERIMENTS

We computed a compressible flow over a NACA0012 airfoil on the computational domain with the nonnested coarse and fine meshes. First, we constructed an unstructured coarse mesh f~n; then, the fine mesh f~h was obtained by refining the coarse mesh twice. At each refinement step, each coarse mesh tetrahedron was subdivided into 8 tetrahedrons. After each refinement, the boundary nodes of the fine mesh were adjusted to the geometry of the domain. Sizes of the coarse and fine meshes are given in Table 1.

94 Table 1 Coarse and fine mesh sizes Coarse Fine Nodes 2,976 117,525 Tetrahedrons 9,300 595,200

...........

~

Fine/coarse ratio 39.5 64

1-level RAS local 2-level RAS global 2-level RAS

...........

I[i [i

1-level RAS local 2-level RAS global 2-level RAS

i! , Li ,~i H~iii

~.\ "

~25

~20 i:i

. ,~"~",~....~

-

13,f , I 3

r, '.J,! li

1000

Total number

2000

of l i n e a r i t e r a t i o n s

3000

.

.

.

.

.

.

.

. ;.,, ,,, ,: ,. n!~ir, ~,l T'='.-,-i-;-,-r7i

1 O0

200 300 400 Nonlinear iteration

500

Figure 1. Comparison of the one-level, local two-level, and global two-level RAS preconditioners in terms of the total numbers of linear iterations(left picture) and nonlinear iterations (right picture). The mesh has 32 subregions.

For parallel processing, the coarse mesh was partitioned, using METIS [8], into 16 or 32 submeshes each having nearly the same number of tetrahedrons. The fine mesh partition was obtained directly from the corresponding coarse mesh partition. The size of overlap both in the coarse and the fine mesh partition was set to one, that is, two neighboring extended subregions share a single layer of tetrahedrons. In (8) and (9), R0r was set to a matrix of a piecewise linear interpolation. Multiplications by R~ and R0, solving linear systems with M1, M2, and Ma, and both the fine and the coarse FGMRES algorithm were implemented in parallel. The experiments were carried out on an IBM SP2. We tested convergence properties of the preconditioners defined in (5), (8), and (9) with c~ - N ~ / N I , where N~ and N/ are the numbers of nodes in the coarse and fine meshes, respectively. We studied a transonic case with Moo set to 0.8. Some of the computational results are presented in Figures 1 and 2. The left picture in Figure 1 shows residual reduction in terms of total numbers of linear iterations. VV'esee that the algorithms with two-level RAS preconditioners give significant improvements compared to the algorithm with the one-level RAS preconditioner. The improvement in using the global two-level RAS preconditioner compared to the local twolevel RAS preconditioner is not very much. Recall, that in the former case the inner FGMRES is used which could increase the CPU time. In Table 2, we present a summary from the figure. We see that the reduction percentages in the numbers of linear iterations drop with the decrease of the nonlinear residual (or with the increase of the nonlinear iteration number). This is seen even more clearly in the right picture in Figure 1. After

95 Table 2 Total numbers of linear iterations and the reduction percentages compared to the algorithm with the one-level RAS preconditioner (32 subregions). One-level RAS Local two-level RAS Global two-level RAS Residual Iterations Iterations Reduction Iterations Reduction 10 -2 859 513 40% 371 57% 10 - 4 1,205 700 42~ 503 58~ 10 -6 1,953 1,397 28% 1,245 36% 10 -s 2,452 1,887 23% 1,758 28%

- - ~ - -

lO~ ~1o-4~

- -

.%

.:

9- -

l - l e v e l R A S / 16 subregions l - l e v e l RAS / 32 subregions local 2-level R A S / 16 subregions local 2-level R A S / 32 subregions

k

~,,_

10-1 f ~ r a . F . ~

10- f

10- f ~-~10-4~"

.o,o-I

Z o71 10 -s

- - ~ - -

r

6

lO-f

500 1000 1500 2000 " 2500 Total number of linear iterations

10 -8

l - l e v e l R A S / 16 subregions

-- :__ lgl'lotaVil2RiAevS/132~/~6gs~

10~ i

\

\ ~

.

global 2-level R A S / 32 subregions

\ ~',~._

\

'"

,.

-%

;-

500 1000 1500 2000 2500 Total number of linear iterations

Figure 2. Comparison of the one-level RAS preconditioner with the local two-level RAS (left picture) and the global two-level RAS preconditioner (right picture) on the meshes with 16 and 32 subregions.

approximately 80 nonlinear iterations, the three algorithms give basically the same number of linear iterations at each nonlinear iteration. This suggests that the coarse mesh may not be needed after some number of initial nonlinear iterations. In Figure 2, we compare the algorithms on the meshes with different numbers of subregions, 16 and 32. The left picture shows that the algorithms with the one-level and local two-level RAS preconditioners initially increase the total numbers of linear iterations as the number of subregions increases from 16 to 32. On the other hand, we see in the right picture in Figure 2 that the increase in the number of subregions had little effect on the convergence of the algorithm with the global two-level RAS preconditioner. These results suggest that the algorithm with the global two-level RAS preconditioner is well scalable to the number of subregions (processors) while the other two are not. In both pictures we observe the decrease in the total number of linear..iterations to the end of computations. This is due to the fact that only 4 or 5 linear iterations were carried out at each nonlinear iteration in both cases, with 16 and 32 subregions (see the right picture in Figure 1), with linear systems in the case of 32 subregions solved just one iteration faster than the linear systems in the case of 16 subregions.

96 6. C O N C L U S I O N S When both the fine and the coarse mesh are constructed from the domain geometry, it is fairly easy to incorporate a coarse mesh component into a one-level RAS preconditioner. The applications of the two-level RAS preconditioners give a significant reduction in total numbers of linear iterations. For our test cases, the coarse mesh component seems not needed after some initial number of nonliner iterations. The algorithm with the global two-level RAS preconditioner is scalable to the number of subregions (processors). Sizes of fine and coarse meshes should be well balanced, that is, if a coarse mesh is not coarse enough, the application of a coarse mesh component could result in the CPU time increase. REFERENCES 1. 2. 3.

K. BOHMER, P. HEMKER, AND put. Suppl, 5 (1985), pp. 1-32. X.-C. CAI, The use of pointwise non-nested meshes, SIAM J. Sci. X.-C. CAI, W. D. GROPP, D.

H. STETTER, The defect correction approach, Corn-

interpolation in domain decomposition methods with Comput., 16 (1995), pp. 250-256. E. KEYES, R. G. MELVIN, AND D. P. YOUNG,

Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation, 4. 5.

SIAM J. Sci. Comput., 19 (1998), pp. 246-265. X.-C. CAI AND M. SARKIS, A restricted additive Schwarz preconditioner for general sparse linear systems, SIAM J. Sci. Comput., 21 (1999), pp. 792-797. C. FARHAT AND S. LANTERI, Simulation of compressible viscous flows on a variety of

MPPs: computational algorithms for unstructured dynamic meshes and performance results, Comput. Methods Appl. Mech. Engrg., 119 (1994), pp. 35-60. 6. W. D. GROPP, D. E. KEYES, L. C. MCINNES, AND M. D. WIDRIRI, Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD, Int. J. High Performance Computing Applications, (1999). Submitted. C. HIR$CH, Numerical Computation of Internal and External Flows, John Wiley and Sons, New York, 1990. 8. G. KARYPIS AND V. KUMAR, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput., 20 (1998), pp. 359-392. 9. D . K . KAUSHIK, D. E. KEYES, AND B. F. SMITH, Newton-Krylov-Schwarz methods 7.

for aerodynamics problems: Compressible and incompressible flows on unstructured grids, in Proc. of the Eleventh Intl. Conference on Domain Decomposition Methods in Scientific and Engineering Computing, 1999. 10. Y. SAAD, A flexible inner-outer preconditioned GMRES algorithm, SIAM J. Sci. Stat. Comput., 14 (1993), pp. 461-469. 11. B. F. SMITH, P. E. BJORSTAD, AND W. D. GROPP, Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University Press, 1996. 12. J. STEGER AND R. F. WARMING, Flux vector splitting for the inviscid gas dynamic with applications to finite-difference methods, J. Comp. Phys., 40 (1981), pp. 263-293. 13. B. VAN LEER, Towards the ultimate conservative difference scheme V: a second order sequel to Godunov's method, J. Comp. Phys., 32 (1979), pp. 361-370.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer,J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier ScienceB.V.All rightsreserved.

97

Parallel Calculation of Helicopter B VI Noise by M o v i n g O v e r l a p p e d Grid Method Takashi Aoyama*, Akio Ochi#, Shigeru Saito*, and Eiji Shimat *National Aerospace Laboratory (NAL) 7-44-1, Jindaijihigashi-machi, Chofu, Tokyo 182-8522, Japan tAdvanced Technology Institute of Commuter-helicopter, Ltd. (ATIC) 2 Kawasaki-cho, Kakamigahara, Gifu 504-0971, Japan The progress of a prediction method of helicopter blade-vortex interaction (BVI) noise developed under the cooperative research between National Aerospace Laboratory (NAL) and Advanced Technology Institute of Commuter-helicopter, Ltd. (ATIC) is summarized. This method consists of an unsteady Euler code using a moving overlapped grid method and an aeroacoustic code based on the Ffowcs Williams and Hawking (FW-H) formulation. The present large-scale calculations are performed on a vector parallel super computer, Numerical Wind Tunnel (NWT), in NAL. [] Therefore, a new algorithm of search and interpolation suitable for vector parallel computations is developed for the efficient exchange of flow solution between grids. The calculated aerodynamic and aeroacoustic results are in good agreement with the experimental data obtained by ATIC model rotor test at German Dutch Windtunnel (DNW). The distinct spikes in the waveform of BVI noise are successfully predicted by the present method. 1. INTRODUCTION Helicopters have the great capability of hovering and vertical takeoff and landing (VTOL). The importance of this capability has been realized again especially in Japan after the great earthquake in Kobe where it was shown that helicopters were effective as a means of disaster relief. It is worthy of mention that an international meeting of American Helicopter Society (AHS) on advanced rotorcraft technology and disaster relief was held in Japan in 1998. However, it cannot be said that helicopters are widely used as a mean of civil transportation. Although their capability is effective in the civil transportation, noise is a major problem. Helicopters produce many kinds of noise, such as blade-vortex interaction (BVI) noise, high-speed impulsive (HSI) noise, engine noise, transmission noise, tail rotor noise, bladewake interaction (BWI) noise, main-rotor/tail-rotor interaction noise, and so on. BVI noise is most severe for the civil helicopters which are used in densely populated areas because it is mainly generated in descending flight conditions to heliports and radiates mostly below the helicopter's tip-path plane in the direction of forward flight. What makes it even worse is that its acoustic signal is generally in the frequency range of most sensitive to human subjective response (500 to 5000Hz).

98 Many researchers have been devoting themselves to developing prediction methods for BVI noise. Tadghighi et al. developed a procedure for BVI noise prediction [ 1]. It is based on a coupling method of a comprehensive trim code of helicopter, a three-dimensional unsteady full potential code, and an acoustic code using the Farassat's 1A formulation of the Ffowcs Williams and Hawking (FW-H) equation. National Aerospace Laboratory (NAL) and Advanced Technology Institute of Commuter-helicopter, Ltd. (ATIC) also developed a combined prediction method [2] of a comprehensive trim code (CAMRAD II), a threedimensional unsteady Euler code, and an acoustic code based on the FW-H formulation. The method was effectively used in the design of a new blade [3] in ATIC. However, one of the disadvantages of the method is that users must specify indefinite modeling parameters such as the core size of tip vortex. The recent progress of computer technology prompts us to directly analyze the complicated phenomenon of B VI by CFD techniques. The great advantage of the direct calculations by Euler or Navier-Stokes codes is that they capture the tip vortex generated from blades without using indefinite parameters. Ahmad et al. [4] predicted the impulsive noise of OLS model rotor using an overset grid Navier-Stokes Kirchhoff-surface method. Although the calculated wave forms of high-speed impulsive (HSI) noise were in reasonable agreement with experimental data, the distinct spikes in the acoustic waveform of blade-vortex interaction noise could not be successfully captured. This is because the intermediate and background grids used in their method are too coarse to maintain the strength of tip vortex. In order to solve the problem, NAL and ATIC developed a new prediction method [5] of BVI noise. The method combines an unsteady Euler code using a moving overlapped grid method and an aeroacoustic code based on the FW-H formulation. After making some efforts on the refinement of grid topology and numerical accuracy [6-8], we have successfully predicted the distinct spikes in the waveform of BVI noise. We validated our method by comparing numerical results with experimental data [9-11 ] obtained by ATIC. Our calculations are conducted using a vector parallel super computer in NAL. A new algorithm of search and interpolation suitable for vector parallel computations was developed for the efficient exchange of flow solution between grids. The purpose of this paper is to summarize the progress of the prediction method developed under the cooperative research between NAL and ATIC. 2. CALCULATION METHOD 2.1 Grid System Two types of CFD grids were used to solve the Euler equations in the first stage of our moving overlapped grid method [6]. The blade grid wrapped each rotor blade using boundary fitted coordinates (BFC). The Cartesian background grid covered the whole computation region including the entire rotor. In the grid system presently used in our method, the background grid consists of inner and outer background grids, as shown in figure 1, which increases the grid density only near the rotor. 2.2 Numerical Method in Blade Grid Calculation The numerical method used in solving the Euler equations in the blade grid is an implicit finite-difference scheme [12]. The Euler equations are discretized in the delta form using Euler backward time differencing. A diagonalized approximate factorization method, which

99 utilizes an upwind flux-split technique, is used for the implicit left-hand-side for spatial differencing. In addition, an upwind scheme based on TVD by Chakravarthy and Osher is applied for the explicit right-hand-side terms. Each operator is decomposed into the product of lower and upper bi-diagonal matrices by using diagonally dominant factorization. In unsteady forward flight conditions, the Newton iterative method is added in order to reduce the residual in each time-step. The number of Newton iteration is six. The typical dividing number along the azimuthal direction is about 9000 per revolution. This corresponds to the azimuth angle about 0.04 ~ per step. The unsteady calculation is impulsively started from a non-disturbed condition at the azimuth angle of 0 ~

2.3 Numerical Method in Background Grid Calculation A higher accuracy explicit scheme is utilized in the background Cartesian grid. The compact TVD scheme [ 13] is employed for spatial discretization. MUSCL cell interface value is modified to achieve 4th-order accuracy. Simple High-resolution Upwind Scheme (SHUS) [ 14] is employed to obtain numerical flux. SHUS is one of the Advection Upstream Splitting Method (AUSM) type approximate Riemann solvers and has small numerical diffusion. The time integration is carried out by an explicit method. The four stage Runge-Kutta method is used in the present calculation. The free stream condition is applied to the outer boundary of the outer background grid.

2.4 Search and Interpolation The flow solution obtained by a CFD code is exchanged between grids in the moving overlapped grid approach. The search and interpolation to exchange the flow solution between the blade grid and the inner background grid and between the inner background grid and the outer background grid are executed in each time step. The computational time spent for search and interpolation is one of the disadvantages of the moving overlapped grid approach. In our computation, this problem is severe because a vector and parallel computer is used. Therefore, a new algorithm suitable for vector parallel computations was developed. Only the detailed procedure of the solution transfer from the blade grid to the inner background grid is described in this paper because the other transfers are easily understood. The procedure flow of the new search and interpolation algorithm is shown in figure 2. In the first step, the grid indexes (i, j, k) of the inner background grid points that might be inside of blade grid cells are listed. In the second step, the listed indexes are checked to determine whether they are located inside of the grid cell. The position of the point is expressed by three scalar parameters s, t, and u because the tri-linear interpolation is utilized in the present algorithm. In this step, the values of s, t, and u of each index are calculated. When all s, t, and u are between zero and one, the point is judged to be located inside of the grid cell. Then the grid points outside of the cell are removed from the list and the flow solution is interpolated into temporal arrays. Each processing element (PE) of NWT performs these procedures in parallel. Finally, the interpolated values are exchanged between the processing elements.

100 I-"-------------------------------] [

List grid points index

i.,,. ...............

4, .................. I:[ check whether inside or outside ,:. ..................... . r ...................

!

]

|

.. i [" | .!,

i

l I 9

I

remove outside grid points from index

.... ................

|'|il ~ "

__

4" ....................

interpolate values

]

I

,

|

[!~

_

"-" ""--" "-'" "-" ""--" "4" "-" ""--" "-'""" "-" "-'""" "'-exchange interpolated values between PEs ] 9o ~ 9 9 vector computation

Inner background

Outer background grid

. . -. =. parallel computation

Figure 1. Blade grids, inner and outer Figure 2. Procedure flow of search and background grids, interpolation. 2.5 Aeroacoustic Code The time history of the acoustic pressure generated by the blade-vortex interaction is calculated by an aeroacoustic code [ 15] based on the Ffowcs Williams and Hawking (FW-H) formulation. The pressure and its gradient on the blade surface obtained by the CFD calculation are used as input data. The FW-H formulation used here doesn't include the quadrupole term because strong shock waves are not generated in the flight condition considered here. 2.6 Numerical Wind Tunnel in NAL The large computation presented here was performed on the Numerical Wind Tunnel (NWT) in NAL. NWT is a vector parallel super computer which consists of 166 processing elements (PEs). The performance of an individual PE is equivalent to that of a super computer, 1.7 GFLOPS. Each PE has a main memory of 256 MB. High-speed cross-bar network connects 166 PEs. The total peak performance of the NWT is 280 GFLOPS and the total capacity of the main memory is as much as 45 GB. The CPU time per revolution is about 20 hours using 30 processing elements. Periodic solutions are obtained after about three revolutions. NAL takes a strategy of continuously updating its high performance computers in order to promote the research on numerical simulation as the common basis of the aeronautical and astronautical research field. The replacement of the present NWT by a TFLOPS machine is now under consideration. The research on helicopter aerodynamics and aeroacoustics will be more and more stimulated by the new machine. On the other hand, the challenging problem of the aerodynamics and the aeroacoustics of helicopters, which includes rotational flow, unsteady aerodynamics, vortex generation and convection, noise generation and propagation, aeroelasticity, and so on, will promote the development of high performance parallel super computer. In addition, the aspect of multi-disciplinary makes the problem more challenging. 3. RESULTS AND DISCUSSION 3.1. Aerodynamic Results The calculated aerodynamic results are compared with experimental data. Figure 3 shows the comparisons between measured and calculated pressure distributions on the blade surface

101

in a forward flight condition. The experimental data was obtained by ATIC model rotor tests [9-11] at the German Dutch Windtunnel (DNW). The comparisons are performed at 12 azimuth-wise positions at r/R=0.95. The quantities r and R are span-wise station and rotor radius, respectively. The agreement is good in every azimuth position. The tip vortices are visualized by the iso-surface of vorticity magnitude in the inner background grid in figure 4. The tip vortices are distinctively captured and the interactions between blade and vortex are clearly observed. Figure 5 shows the visualized wake by particle trace. The formation of tip vortex and the roll-up of rotor wake are observed. The appearance of the rotor wake is similar to that of the fixed-wing wake. This result gives us a good reason to macroscopically regard a rotor as an actuator disk. In figure 6, the tip-vortex locations calculated by the present method are compared with the experimental data measured by the laser light sheet (LLS) technique and the calculated result by CAMRAD II. Figure 6 a) and b) show the horizontal view and the vertical view (at y/R=0.57), respectively. The origin of coordinate in figure 6 b) is the leading edge of the blade. All the results are in good agreement. 3.2. Aeroacoustic Results

The time history of the calculated sound pressure level (SPL) is compared with the experimental data of ATIC in figure 7. The observer position is at #1 in figure 8, which in on the horizontal plane 2.3[m] below the center of rotor rotation. A lot of distinct spikes generated by B VI phenomena arc shown in the measured SPL. This type of impulsiveness is typically observed in the waveform of BVI noise. Although the amplitude of the waveform is over-predicted by the present method because it doesn't include the effect of aeroclasticity, the distinct spikes in the wavcform are reasonably predicted. This result may be the first case in the world in which the phenomenon of BVI is clearly captured by a CFD technique directly. Figure 9 shows the comparison between predicted and measured carpet noise contours on the horizontal plane in figure 8. In this figure, the open circle represents a rotor disk. Two BVI lobes are observed in the measured result. The stronger one is caused by the advancing side BVI and the other one is the result of the retreating side BVI. Although the calculation underpredicts the advancing side lobe and over-predicts the retreating side lobe, it successfully predicts the existence of two types of BVI lobes.

~ ,0 2

0.0

-3

I

I

I

I

!

i

i

l.O i

I

c~~, , , ,X o.o

X/C

t,o

O.O -3

I

I

I

I

I

I

I

I

I

~:F,.~F~,, o.o

X/C

2

l.O

O.O

-3

I

~ , .... I

I

I

I

!

!

~'F,, t.o

o.o

~ ! . . . . iI

0

l.O i

I

,-,, X/C

2

0.0

-3

I

,

I

I

I

I

I

1,0 I

I

.'F, ,,-,. 1.o

o.o

X/C

2

O,O

I

I

I

.... l.O

-3

I

I

I

l.O

-3

~:f', , , t.o

O.O

o.o

X/C

1.o

.~

, , _

o,o

X/C

Calculation--- Experiment o 9 Figure 3. Comparison between measured and calculated pressure distributions on blade surface (r/R=0.95).

1.o

.....:*~:S:[[i: .~:d~,:~iO

'g"dz Y" w ' d z

Ir n

gn+l _ . g n _ _ & ~ n mn

P7+ ~ = p i - f l . p i

F n+l _. F n ~ ~

--n

nF

wn+l _. g n+l .~. S n W n

~-. _ B-1Fn convergence check: rn+~ ~ 0 then p"+~is the solution. B -~ is the preconditioner.

4. P A R A L L E L I M P L E M E N T A T I O N During parallel implementation, in order to advance the solution a single time step, the momentum equation is solved explicitly twice, Eqns.(3) and (4). At each solution interface values are exchanged between the processors working with domains having common boundaries. Solving Eqn.(4) gives the intermediate velocity field, which is used at the right hand sides of Poisson's Equation (5) or (5a), in obtaining the auxiliary potential or pressure, respectively. The solution of the Poisson equation is obtained with domain decomposition where an iterative solution is also necessary at the interface. Therefore, the computations involving an inner iterative cycle and outer time step advancements have to be performed in a parallel manner on each processor communicating with the neighboring one. The master-slave processes technique is adopted. Slaves solve the N-S equations over the designated domain while master handles mainly the domain decomposition iterations. All interfaces are handled together.

5. RESULTS AND DISCUSSION The two DDM methods are used to solve the NS equations with FEM. First, as a test case, an 1 lxl lxl 1 cubic cavity problem is selected. The Reynolds Number based on the lid length and the lid velocity is 1000. The solution is advanced 1000 time steps up to the dimensionless time level of 30, where the steady state is reached. Four cases investigated are: 1. Potential based solution with initial ~=0, 2. Pressure based solution with initial p=0, 3. Pressure based solution with P=Pold for initialization and finalization, 4. Pressure based solution with one shot.

110 For the above test case the first method gave the desired result in the least CPU time. Therefore, a more complex geometry, namely the Re=1000 laminar flow over a wing-winglet configuration is then analysed with this method. Four and six domain solutions are obtained. The number of elements and grid points in each domain are given in Table 1, while the grid and domain partition for the 6 domain case is shown in Fig. 1. Solutions using pressure and auxiliary potential based Poisson's equation formulation are obtained.The solution is advanced 200 time steps up to the dimensionless time level of 1. The tolerance for convergence of DDM iterations is selected as 5x10 -5 while for EBE/PCG convergence the tolerance is 1x 10-6.

Table 1.

Domain Domain Domain Domain Domain Domain

1 2 3 4 5 6

4 Domain case Number of grid Number of Elements points 7872 9594 7872 9594 7872 9466 7872 9450 -

6 Domain case Number of Number of grid Elements points 7462 5904 6396 4920 6396 4920 6316 4920 6300 4920 7350 5904

Pressure based formulation gave much smoother results, particularly in spanwise direction compared to poteantial based formulation. The pressure isolines around the wing-winglet configuration is shown in Fig.2. The effect of the winglet is to weaken the effect of tip vortices on the wing. This effect on cross flow about the winglet is observed and is seen in Fig.3, at mid spanwise plane of the winglet. Table 2 shows the CPU time and the related speed-up values for 4 and 6 domain solutions of the wing-winglet problem.

Table 2. Process Master Slave-I Slave-II Slave-III Slave-IV Slave-V Slave-VI

based solution (CPU seconds) Speed-up 6 domain 4 domain 0.33 1430 466 50943 57689 0.89 56773 50594 56115 52826 56402 53224 51792 63527 -

p based solution (CPU seconds) 4 domain 6 domain Speed-up 899 1302 0.69 95619 56242 1.76 100443 58817 103802 50756 97030 51396 47074 58817

The speed-up value is 0.89 for 6 domain potential based solution and 1.76 for 6 domain pressure based solution where normalization is done with respect to the CPU time of 4domain solution. The potential based solution presents a speed-down while pressure based solution gives a super-linear speed-up. In the EBE/PCG iterative technique used here, the number of operations is proportional to the square of the number of unknowns of the problem, whereas, in domain decomposition the size of the problem is reduced linearly by number of

111

Figure 1. External view of the grid about the wing-winglet configuration and the 6 domain partition.

Figure 2. Pressure isolines about the wing-winglet configuration. Re=1000 and Time=l. domains. Hence, the overall iterations are reduced linearly as the number of domains increases. Therefore, more than 100% efficiencies are attained as the whole domain is divided

112

I

i

!:

!

7

/ p,

~

1

.......

7 .7:

~

\

i ..

...."

N.

....

Figure 3. Wing-winglet spanwise mid plane cross flow distribution. into a larger number of subdomains. Pressure and potential based solutions are compared in Table 3 in terms of iteration counts.

6. CONCLUSION A second order accurate FEM together with two matching nonoverlapping domain decomposition techniques is implemented on a cluster of WS having very low cost hardware configuration. Poisson's equation is formulated both in terms of velocity potential and pressure itself. Flow in a cubic cavity and over a wing-winglet configuration are analysed. Using method 1, parallel super-linear speed-up is achieved with domain decomposition technique applied to pressure equation. Method 2 (one shot DDM) requires a good preconditioner to achieve super-linear speed-ups. For future work, three-dimensional computations will continue with direct pressure formulations and better load balancing for optimized parallel efficiencies. A variety of preconditioners will be empoyed to speed up the one shot DDM.

REFERENCES [ 1] R. Glowinski and J. Periaux, "Domain Decomposition Methods for Nonlinear Problems in Fluid Dynamics", Research Report 147, INRIA, France, 1982.

113

Table 3. 4 Domain solution Formulation Process 1 Process 2 Process 3 Process 4 6 Domain solution Formulation Process 1 Process 2 Process 3 Process 4 Process 5 Process 6

~, 4 process P, 4process ~, 6 process P, 6process

Total number of Pressure Iterations

Minimun number of Pressure Iterations

r based p based 1164143 2099256 1217973 2202983 1225096 2217662 1129044 2051901 Total number of Pressure Iterations

~ based p based 3577 7838 3729 8288 3752 8325 3467 7671 Minimun number of Pressure Iterations

r based p based ~ based p based 2079923 1767713 5741 6138 2145158 1813694 5871 6293 2269113 1910254 6205 6649 2397005 2007897 6558 6979 2070968 1744059 5642 6064 1998353 1696447 5497 5898 Total number of Minimun number Domain Decom. of Domain Decom. Iterations Iterations 4175 12 7871 29 8017 21 6739 23

Maximum number of Pressure Iterations r based p based 13006 16278 13626 17003 13749 17216 12707 15946 Maximum number of Pressure Iterations r based p based 15290 13432 15812 13752 16704 14471 17691 15177 15231 13261 14732 12886 Maximum number of Domain Decom. Iterations 49 62 60 52

Average number of Pressure Iterations r based 5821 6090 6125 5645

p based 10496 11015 11088 10259

Average number of Pressure Iterations r based 10400 10726 11346 11985 10355 9992

p based 8839 9068 9551 10039 8720 8482

Average number of Domain Decom. Iterations 21 39 40 34

[2] R. Glowinski, T.W. Pan and J. Periaux, A one shot domain decomposition /fictitious domain method for the solution of elliptic equations, Parallel Computational Fluid Dynamics, New Trends and Advances, A. Ecer at.al.(Editors), 1995 [3] A. Suzuki, Implementation of Domain Decomposition Methods on Parallel Computer ADENART, Parallel Computational Fluid Dynamics, New Algorithms and Applications, N. Satofuka, J. Periaux and A. Ecer (Editors), 1995. [4] A.R. Asian, F.O. Edis, U. Grill:at, 'Accurate incompressible N-S solution on cluster of work stations', Parallel CFD '98 May 11-14, 1998, Hsinchu, Taiwan [5] U. Gulcat, A.R. Asian, International Journal for Numerical Methods in Fluids, 25, 9851001,1997. [6] Q.V. Dinh, A. Ecer, U. Gulcat, R. Glowinski, and J. Periaux, "Concurrent Solutions of Elliptic Problems via Domain Decomposition, Applications to Fluid Dynamics", Parallel CFD 92, May 18-20, Rutgers University, 1992.



Parallel Implementation

115

of t h e D i s c o n t i n u o u s G a l e r k i n M e t h o d *

Abdelkader Baggag ~, Harold Atkins b and David Keyes c ~Department of Computer Sciences, Purdue University, 1398 Computer Science Building, West-Lafayette, IN 47907-1398 bComputational Modeling and Simulation Branch, NASA Langley Research Center, Hampton, VA 23681-2199 ~Department of Mathematics & Statistics, Old Dominion University, Norfolk, VA 23529-0162, ISCR, Lawrence Livermore National Laboratory, Livermore, CA 94551-9989, and ICASE, NASA Langley Research Center, Hampton, VA 23681-2199 This paper describes a parallel implementation of the discontinuous Galerkin method. The discontinuous Galerkin is a spatially compact method that retains its accuracy and robustness on non-smooth unstructured grids and is well suited for time dependent simulations. Several parallelization approaches are studied and evaluated. The most natural and symmetric of the approaches has been implemented in an object-oriented code used to simulate aeroacoustic scattering. The parallel implementation is MPI-based and has been tested on various parallel platforms such as the SGI Origin, IBM SP2, and clusters of SGI and Sun workstations. The scalability results presented for the SGI Origin show slightly superlinear speedup on a fixed-size problem due to cache effects. 1. M o t i v a t i o n The discontinuous Galerkin (DG) method is a robust and compact finite element projection method that provides a practical framework for the development of high-order accurate methods using unstructured grids. The method is well suited for large-scale time-dependent computations in which high accuracy is required. An important distinction between the DG method and the usual finite-element method is that in the DG method the resulting equations are local to the generating element. The solution within each element is not reconstructed by looking to neighboring elements. Thus, each element may be thought of as a separate entity that merely needs to obtain some boundary data from its neighbors. The compact form of the DG method makes it well suited for parallel computer platforms. This compactness also allows a heterogeneous treatment of problems. That is, the element topology, the degree of approximation and even the choice *This research was supported by the National Aeronautics and Space Administration under NASA contract No. NAS1-97046 while Baggag and Keyes were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 236812199.

116 of governing equations can vary from element to element and in time over the course of a calculation without loss of rigor in the method. Many of the method's accuracy and stability properties have been rigorously proven [15] for arbitrary element shapes, any number of spatial dimensions, and even for nonlinear problems, which lead to a very robust method. The DG method has been shown in mesh refinement studies [6] to be insensitive to the smoothness of the mesh. Its compact formulation can be applied near boundaries without special treatment, which greatly increases the robustness and accuracy of any boundary condition implementation. These features are crucial for the robust treatment of complex geometries. In semi-discrete form, the DG method can be combined with explicit time-marching methods, such as Runge-Kutta. One of the disadvantages of the method is its high storage and high computational requirements; however, a recently developed quadrature-free implementation [6] has greatly ameliorated these concerns. Parallel implementations of the DG method have been performed by other investigators. Biswas, Devine, and Flaherty [7] applied a third-order quadrature-based DG method to a scalar wave equation on a NCUBE/2 hypercube platform and reported a 97.57% parallel efficiency on 256 processors. Bey et al. [8] implemented a parallel hp-adaptive DG method for hyperbolic conservation laws on structured grids. They obtained nearly optimal speedups when the ratio of interior elements to subdomain interface elements is sufficiently large. In both works, the grids were of a Cartesian type with cell sub-division in the latter case. The quadrature-free form of the DG method has been previously implemented and validated [6,9,10] in an object-oriented code for the prediction of aeroacoustic scattering from complex configurations. The code solves the unsteady linear Euler equations on a general unstructured mesh of mixed elements (squares and triangles) in two dimensions. The DG code developed by Atkins has been ported [11] to several parallel platforms using MPI. A detailed description of the numerical algorithm can be found in reference [6]; and the description of the code structure, parallelization routines and model objects can be found in reference [11]. In this work, three different parallelization approaches are described and efficiency results for the selected approach are reported. The next section provides a brief description of the numerical method and is followed by a discussion of parallelization strategies, a citation of our standard test case, and performance results of the code on the Origin2000 and several other computing platforms. 2. D i s c o n t i n u o u s G a l e r k i n M e t h o d The DG method is readily applied to any equation of the form

OU - - - t - V . F(U) = O. Ot

(1)

on a domain that has been divided into arbitrarily shaped nonoverlapping elements f~i that cover the domain. The DG method is defined by choosing a set of local basis functions B = {b~, 1 _< l _< N(p, d)} for each element, where N is a function of the local polynomial order p and the number of space dimensions d, and approximating the solution in the

117 element in terms of the basis set N(p,d)

Ua, ~ Vi -

~

vi,l bl.

(2)

l=l

The governing equation is projected onto each member of the basis set and cast in a weak form to give

OVi

bkFn(~, Vjj) gijds

O,

(3)

where Vi is the approximate solution in element ~i, Vj denotes the approximate solution in a neighboring element ~j, 0ghj is the segment of the element boundary that is common to the neighboring element gtj, gij is the unit outward-normal vector on O~ij, and V~ and Vj denote the trace of the solutions on O~ij. The coefficients of the approximate solution vi,z are the new unknowns, and the local integral projection generates a set of equations governing these unknowns. The trace quantities are expressed in terms of a lower dimensional basis set bz associated with O~ij. fir denotes a numerical flux which is usually an approximate Riemann flux of the Lax-Friedrichs type. Because each element has a distinct local approximate solution, the solution on each interior edge is double valued and discontinuous. The approximate Riemann flux ffR(Vi, Vj) resolves the discontinuity and provides the only mechanism by which adjacent elements communicate. The fact that this communication occurs in an edge integral means the solution in a given element V~ depends only on the edge trace of the neighboring solution Vj, not on the whole of the neighboring solution Vj. Also, because the approximate solution within each element is stored as a function, the edge trace of the solution is obtained without additional approximations. The DG method is efficiently implemented on general unstructured grids to any order of accuracy using the quadrature-free formulation. In the quadrature-free formulation, developed by Atkins and Shu in [6], the flux vector ff is approximated in terms of the basis set bz, and the approximate Riemann flux/~R is approximated in terms of the lower basis set bz: N(p,d)

fi(V~) ~

E

f,,

N(p,d-1)

b,,

fiR(V~, Vj). ~ -

/=1

E

fR,' 6,.

(4)

/=1

With these approximations, the volume and boundary integrals can be evaluated analytically, instead of by quadrature, leading to a simple sequence of matrix-vector operations

(5)

O[vi,l] = (M_IA) [fi,l] - E (M-1BiJ)[fiR,l], Ot (j} where

M-

bk bl d~ ,

A =

Vbk bl dfl ,

Bij =

bk bz ds .

(6)

118

The residual of equation (5) is evaluated by the following sequence of operations:

[~ij,,] = Ti.~[vi,t] } = .F(Vi).n-'ij

Vai,

[fij,,]

v

R

V Of~i~,

O[vi,l] Ot

= =

(M-1A)[fi,t]-

E

/

-R (M-1BiJ)[fij,t]

V f2~

where T~j is the trace operator, and [()~j,z] denotes a vector containing the coefficients of an edge quantity on edge j. 3. Parallel Computation In this section, three different possible parallelization strategies for the DG method are described. The first approach is symmetric and easy to implement but results in redundant flux calculations. The second and third approaches eliminate the redundant flux calculations; however, the communication occurs in two stages making it more difficult to overlap with computation, and increasing the complexity of the implementation. The following notation will be used to describe the parallel implementation. Let f~ denote any element, instead of f2~, and let 0 f2p denote any edge on the partition boundary, and 0 f2/ denote any other edge. The first approach is symmetric and is easily implemented in the serial code in reference [11]. It can be summarized as follows: 1. Compute [~j,l] and If j,1]

V f2

2. Send [vj,t] and [/j,l] on 0f~p to neighboring partitions --R

3. Compute [fl] and (M-1A)[fl] Vf~, and [f~,t] V0f~/ 4. Receive [v~,l] and --R

[fj,z] on

0f2p from neighboring partitions --R

5. Compute [fj,z] V0f2p and (M-lB,)[f3,t]

Vf2

In this approach, nearly all of the computation is scheduled to occur between the --R nonblocking send and receive; however, the edge flux [fj,l] is doubly computed on all 0f2p. It is observed in actual computations that redundant calculation is not a significant --R factor. The calculation of [fiLl] on all Ofhj represents only 2% to 3% of the total CPU time. The redundant computation is performed on only a fraction of the edges given by

0a

/(0a, u 0a ).

The above sequence reflects the actual implementation [11] used to generate the results to be shown later; however, this approach offers the potential for further improvement. By collecting the elements into groups according to whether or not they are adjacent

119

to a partition boundary, some of the work associated with the edge integral can also be performed between the send and receive. Let Ftp denote any element adjacent to a partition boundary and FtI denote any other element. The following sequence provides maximal overlap of communication and computation. 1. Compute [Vj:] and

If j:]

yap

w

2. Send [Vj,z] and [fj,z] on 0Ftp to neighboring partitions --

3. Compute [vj:] and [fj,z] Vai, and 4. Compute [fd and (M-1A)[fl]

[fj,~] VOFti --R

--R

and (M-1Bj)[fjj]

Vai

VFt

5. Receive [vj:] and [fj,d on 0ap from neighboring partitions --R

--R

6. Compute [fj,l] VOFtp and (M-lB,)[fj,l]

VFtp

3.1. O t h e r P a r a l l e l i z a t i o n S t r a t e g i e s

Two variations of an alternative parallelization strategy that eliminates the redundant --R flux calculations are described. In these approaches, the computation of the edge flux [fj,z] on a partition boundary is performed by one processor and the result is communicated to the neighboring processor. The processor that performs the flux calculation is said to "own" the edge. In the first variation, all edges shared by two processors are owned by only one of the two processors. In the second variation, ownership of the edges shared by two processors is divided equally between the two. Let OFt(a) denote any edge owned by processor A, Oft(pb) denote any edge owned by adjacent processor B, and 0Ft(ab) denote any edge shared by processors A and B. For the purpose of illustration, let ownership of all Oa (ab) in the first variation be given to processor A. Thus { 0Ft(a) } V1{ Off(b) } = { ~ } in both variations, and { O Ft(p~b) } gl{ O f~(pb) } = { 0 } in the first variation. Both computation can be summarized in the following steps: Process A 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Compute [vj,l] and [fj,L] V Ftp Send [Vj,t] and [f-j:] V 0Ftp \ 0Ft(p~) Compute [vj,,] and [fj,z] V a I Compute [fz] V Ft Receive [Vj,z] and [fj,l] V OFt~a) --R Ft(pa) Compute [fy,l] VO --R ~-~(pC) Send [fj:] V0 --R Compute [fj,l] V 0 f~I Compute (M-1A)[fd V Ft --R Compute (M-1Bj) [fj,t] VQI Receive [f- nj,,] V 0 t2p \ 0 f~(a) C o m p u t e ( M - 1 B j ) [ j,z] V~p

Process B Compute [Vj,l] and [fj,l] V Ftp Send [~j,l] and [f~:] V 0 Ft, \ 0 Ft(pb) Compute [vj,z] and [fj,l] V~']I Compute [ft] V Ft Receive [vy,z] and [f-~,,] V O fl(pb) --R V0 Ft(pb) Compute [fj,l] --R

~(b)

Send [f.~:] V O --R Compute [fj,l] V O t2i Compute (M-1A)[fl] V ft Compute (M- 1By) [L,,] --R va, ~ Receive [fy,,] V 0Ftp \ 0 gt(ps) Compute (M-1Bj) [fj:] --R

V ~p

120

It is clear that under these strategies, there are no redundant flux calculations. In both variations, the total amount of data sent is actually less than that of the symmetric approach presented earlier. However, because the sends are performed in two stages, it is more difficult to overlap the communication with useful computation. Also, the unsymmetric form of edge ownership, in the first variation, introduces a difficulty in balancing the work associated with the edge flux calculation. In the second variation, the number of sends is twice that of the symmetric approach; however, because { 0 gt(ab) } r { 0 Ft(b) } = { ~ } in the first variation, the total number of sends in this approach is the same as in the symmetric approach presented earlier. 4. P h y s i c a l P r o b l e m The parallel code is used to solve problems from the Second Benchmark Problems in Computational Aeroacoustics Workshop [12] held at Flordia State University in 1997. The physical problem is the scattering of acoustic waves and is well represented by the linearized Euler equations written in the form of equation (1). Details of the problem can be found in reference [12]. Figure (1.a) shows a typical partitioned mesh.

0

(a)

1

2

3 4 5 Number of Processors)

6

7

(b)

Figure 1. Partitioned mesh (a), Performance on SP2 and workstations dusters (b)

5. R e s u l t s and Discussion Performance tests have been conducted on the SGI Origin2000 and IBM SP2 platforms, and on clusters of workstations. The first test case applied a third-order method on a coarse mesh of only 800 elements. Near linear speedup is obtained on all machines until the partition size becomes too small (on more than 8 processors) and boundary effects dominate. The SP2 and clusters are profiled in Figure (1.b). This small problem was also run on two clusters of, resp., SGI and Sun workstations in an FDDI network. The two clusters consisted of similar but not identical hardware. The network was not dedicated to the cluster but carried other traffic. As the domain is divided over more processors, the

121

140 120

! i ..

! ! ! # Elements = ~1.0,392 -~.§ # Elements = 98,910 --+~-;~

.................................................. #.E!ements.-~,.~54;000...;~:, ~1 and p - /91 - 1 f o r z .:b'.'% ":%:~"-

Asc~a,ue

9 /e . . . . g . : ....

, o ....

......A:~5 : ~ j +

00

5;0

Asci Blue

1000 15'00 2000

. ..

2100 3000 3100 4000

102 102

\

~ Asci Red

103

Figure 1. Flop rates (left) and execution time reduction (right) for an Euler flow problem on three machines. Dashed lines show the ideal for each performance curve, based on a left endpoint of 128 processors.

operation whose time-complexity is sublinear in the number of processors. However, the cost-effectiveness of this brute-force approach towards petaflop/s is highly sensitive to frequency and latency of global reduction operations, and to modest departures from perfect load balance. As popularized with the 1986 Karp Prize entry of Benner, Gustafson & Montry, Amdahl's law can be defeated if serial (or bounded concurrency) sections make up a decreasing fraction of total work as problem size and processor count scale - - true for most explicit or iterative implicit PDE solvers. A simple, back-of-envelope parallel complexity analysis [4] shows that processors can be increased as fast, or almost as fast, as problem size, assuming load is perfectly balanced An important caveat relative to Beowulf-type systems is that the processor network must also be scalable (in protocols as well as in hardware). Therefore, all remaining four orders of magnitude could be met by hardware expansion. However, this this does not mean that fixed-size applications of today would run 104 times faster; these Gustafson-type analyses are for problems that are correspondingly larger. As encouraging evidence that even fixed-size CFD problems scale well into thousands of processors, we reproduce from [1], in Fig. 1, flop/s scaling and execution time curves for Euler flow over an ONERA M6 Wing, on a tetrahedral grid of 2.8 million vertices, run on up to 1024 processors of a 600 MHz T3E, 768 processors of IBM's ASCI Blue Pacific, and 3072 dual-processor nodes of Intel's ASCI Red. (The execution rate scales better than the execution time for this fixed-size problem, since as the subdomains get smaller with finer parallel granularity more of the work is redundant, on ghost regions.) 3. S O U R C E

~ 2 - M O R E E F F I C I E N T U S E OF F A S T E R P R O C E S S O R S

Looking internal to a processor, we argue that there are only two intermediate levels of the memory hierarchy that are essential to a typical domain-decomposed PDE simu-

268 Table 1 Flop/s rate and percent utilization, as a function of dense point-block size, which varies in "Incomp." and "Comp." formulations.

Processor Application Actual Mflop/s Pet. of Peak

II II

[[ T3E-900 Origin 2000 II SP R10000 II P2SC (4-card) II Alpha 21164 Incomp. Comp. Incomp. Comp. Incomp. Comp. 75 82 126 137 117 124 8.3 9.1 25.2 27.4 24.4 25.8

lation, and therefore that most of the system cost and performance cost for maintaining a deep multilevel memory hierarchy could be better invested in improving access to the relevant workingsets, associated with individual local stencils (matrix rows) and entire sub domains . Improvement of local memory bandwidth and multithreading together with intelligent prefetching, perhaps through processors in memoryto exploit it could contribute approximately an order of magnitude of performance within a processor relative to present architectures. Sparse problems will never have the locality advantages of dense problems, but it is only necessary to stream data at the rate at which the processor can consume it, and what sparse problems lack in locality, they can make up for by scheduling. With statically discretized PDEs, the schedule is periodic and predictable. The usual ramping up of processor clock rates and the width or multiplicity of instructions issued are other obvious avenues for per-processor computational rate improvement, but only if memory bandwidth is raised proportionally. Improvement of the low effciencies of most current sparse codes through regularity of reference is an active area of research that yields strong dividends for PDEs. PDEs have a simple, periodic workingset structure that permits effective use of prefetch/dispatch directives, and they have a luxurious amount of "slackness" (potential process concurrency in excess of hardware concurrency). Combined with intelligent processors-in-memory (PIM) features to do gather/scatter cache transfers and multithreading for latency that cannot be amortized by sufficiently large block transfers, PDEs can approach full utilization of processor cycles. An important architectural caveat is that high bandwidth is critical to support these other advanced features, since PDE algorithms do only (9(N) work for O(N) gridpoints worth of loads and stores. One to two orders of magnitude can be gained by catching up to the clock, through such advanced features, and by following the clock into the few-GHz range. Even without PIM, multithreading, and bandwidth (in words per second) equal to the processor clock rate times the superscalarity, one can see the advantage in blocking in the comparisons in Table i. For the same Euler flow system considered above, the problem was run incompressibly (with 4 • 4 blocks at each point) and compressibly (with 5 x 5 blocks at each point). On three different architectures, this modest improvement in reuse of cached data leads to a corresponding improvement in efficiency. We briefly consider the workingsets that are relevant to PDE solvers. The smallest consists of the unknowns, geometry data, and coefficients at a single multicomponent stencil, of size Ns. (N2c + Nc + Na). The largest consists of the unknowns, geometry

269 Data Traffic vs. Cache Size

stencil i s in cac e

mostvertices maximallyreused ......................

ilPiilTi.and i ONFLICTMISSES ....

-~ ,,,2n o~o,o / .....

in

COMPULSORYMISSES

Figure 2. Idealized model of cache traffic for fixed computation as cache size increases, showing two extreme knees and one gradual "knee."

data, and coefficients in an entire subdomain, of size (Nx/P). (N2c + Nc + Na). Most practical caches will be sized in between these two. The critical workingset to consider in relation to cache size is the intermediate one of the data in neighborhood collection of gridpoints/cells that is reused when the group of corresponding neighboring stencils is updated. As successive workingsets "drop" into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory misses, as sketched in the illustration of memory traffic generated from a fixed computation versus varying cache size in Fig. 2. There is no performance value in memory levels larger than subdomain, and little performance value in memory levels smaller than subdomain but larger than required to permit full reuse of most data within each subdomain subtraversal (middle knee, Fig. 2). The natural strategy based on this simple workingset structure is therefore, after providing an L1 cache large enough for smallest workingset (and multiple independent copies up to desired level of multithreading, if applicable), all additional resources should be invested in large L2. Furthermore, L2 should be of write-back type and its population should be under user-assist with prefetch/dispatch directives. Tables describing grid connectivity should be built (within each quasi-static grid phase) and stored in PIM used to pack/unpack dense-use cache lines during subdomain traversal. The costs of this greater per-processor efficiency are the programming complexity of managing the subdomain traversal, the space to store the gather/scatter tables in PIM, the time to (re)build the gather/scatter tables, and the memory bandwidth commensurate with peak rates of all processors. Unfortunately, current shared-memory machines have disappointing memory bandwidth for PDEs; the extra processors beyond the first sharing a memory port are often not useful.

270 Table 2 Experiments in grid edge reordering and data structure interlacing on various uniprocessors for the Euler flow problem. In Mflop/s. Final column shows relative speedups. clock Interlacing, Interlacing (MHz) Edge Reord. (only) Original Speedup Processor 120 43 13 P2SC (2-card) 97 7.5 126 26 250 74 4.8 R10000 15 4.4 332 66 34 604e 42 18 4.2 Ultra II 300 75 44 Alpha 21164 91 33 2.8 600 32 Pentium II (Linux) 400 84 48 2.6

4. S O U R C E ~ 3 : M O R E A R C H I T E C T U R E - F R I E N D L Y A L G O R I T H M S

Besides the two just considered classes of architectural improvements more and we consider two classes of algorithmic improvements: some that improve the raw flop rate and some that increase the scientific value of what can be squeezed out of the average flop. In this section, we mention higher-order discretization schemes, especially of discontinuous or mortar type, orderings that improve data locality, and iterative methods that are less synchronous than today's. Algorithmic practice needs to catch up to architectural demands, and several "one-time" gains remain to be contributed that could improve data locality or reduce synchronization frequency, while maintaining required concurrency and slackness. "One-time" refers to improvements by small constant factors, nothing that scales in N or P. Complexities are already near information-theoretic lower bounds for some CFD solvers, and we reject increases in flop rates that derive from less efficient algorithms, as defined by parallel execution time. A caveat here is that the remaining algorithmic performance improvements may cost extra space or may bank on stability shortcuts that occasionally backfire, making performance modeling less predictable. Perhaps an order of magnitude of performance remains here. Raw performance improvement from algorithms include: (1) spatial reorderings that improve locality, such as interlacing of all related grid-based data structures and ordering gridpoints and grid edges for L1/L2 reuse; (2) discretizations that improve locality, such as higher-order methods (which lead to larger denser blocks at each point than lower-order methods) and vertex-centering (which, for the same tetrahedral grid, leads to denser blockrows than cell-centering); (3) temporal reorderings that improve locality, such as block vector algorithms (these reuse cached matrix blocks; vectors in block are independent), and multi-step vector algorithms (these reuse cached vector blocks; vectors have sequential dependence); (4) temporal reorderings that reduce synchronization penalty, such as less stable algorithmic choices that reduce synchronization frequency (deferred orthogonalization, speculative step selection) and less global methods that reduce synchronization range by replacing a tightly coupled global process (e.g., Newton) with loosely coupled sets of tightly coupled local processes (e.g., Schwarz); and (5) precision better-suited processor/memory elements

271 reductions that make memory bandwidth seem larger, such as lower precision representation of preconditioner matrix coefficients or poorly known coefficients (arithmetic is still performed on full precision extensions). Table 2 (from data in [1]) shows some experimental improvements from spatial reordering on the same unstructured-grid Euler flow problem described earlier. 5. SOURCE #4: ALGORITHMS PACKING MORE "SCIENCE PER FLOP"

It can be argued that this last category of algorithmic improvements does not belong in a discussion focused on computational rates, at all. However, since the ultimate purpose of computing is insight, not petaflop/s, it must be mentioned as part of a balanced program, especially since it is not conveniently orthogonal to the other approaches. We therefore include a brief pitch for revolutionary improvements in the practical use of problem-driven algorithmic adaptivity in PDE solvers not just better system software support for well understood discretization-error driven adaptivity, but true polyalgorithmic and multiplemodel adaptivity. To plan for a "bee-line" port of existing PDE solvers to petaflop/s architectures and to ignore the demands of the next generation of solvers will lead to petaflop/s platforms whose effectiveness in scientific and engineering computing might be equivalent to less powerful but more versatile platforms. The danger of such a pyrrhic victory is real. Some algorithmic improvements do not improve flop rate, but lead to the same scientific end in the same time at lower hardware cost (less memory, lower operation complexity). A caveat here is that such adaptive programs are more complicated and less thread-uniform than those they improve upon in quality/cost ratio. They are not daunting, conceptually, but they put an enormous premium on dynamic load balancing. An order of magnitude or more can be gained here for many problems. Some examples of adaptive opportunities are: (1) spatial discretization-based adaptivity, in which discretization type and order are varied to attain required approximation to the continuum everywhere without over-resolving in smooth, easily approximated regions; (2) fidelity-based adaptivity, in which the continuous formulation is varied to accommodate physical complexity without enriching physically simple regions; and (3) "stiffness"-based adaptivity, in which the solution algorithm is changed to provide more powerful, robust techniques in regions of space-time where discrete problem is linearly or nonlinearly stiff, without extra work in nonstiff, locally well-conditioned regions. What are the status and prospects for such advanced adaptivity? Appropriate metrics to govern the adaptivity and procedures to exploit them are already well developed for some discretization techniques, including method-of-lines ODE solutions to stiff IBVPs and DAEs, and FEA for elliptic BVPs. This field is fairly wide open for other types of numerical analyses. Fidelity-based multi-model methods have been used in ad hoc ways in numerous commercially important engineering codes, e.g., Boeing TRANAIR [5]. Polyalgorithmic solvers have been demonstrated in principle e.g., [3], but rarely in the "hostile" environment of high-performance multiprocessing. These advanced adaptive approaches demand sophisticated software approaches, such as object-oriented programming. Management of hierarchical levels of synchronization (within a region and between regions) is also required. User-specification of hierarchical priorities of different threads would also

272 be desirable - - so that critical-path computations can be given priority, while subordinate computations fill up unpredictable idle cycles with other subsequently useful work. An experimental example of new opportunities for localized algorithmic adaptivity is described surrounding Figs. 5 and 6 in [2]. For transonic full potential flow over a NACA airfoil, solved with Newton's method, excellent progress in residual reduction is made for the first few steps and the last few steps. In between, a shock develops and creeps downwing until it "locks" into its final location, while the rest of flow field is "held hostage" to this slowly converging local feature, whose stabilization completely dominates execution time. Resources should be allocated differently before and after shock location stabilizes. 6. S U M M A R Y

To recap in reverse order, the performance improvement possibilities that suggest that petaflop/s is within reach for PDEs are: (1) algorithms that deliver more "science per flop" possibly large problem-dependent factor, through adaptivity (though we won't count this towards rate improvement); (2) algorithmic variants that are more architecturefriendly, which we expect to contribute half an order of magnitude, through improved locality and relaxed synchronization; (3) more efficient use of processor cycles, and faster processor/memory from which we expect one-and-a-half orders of magnitude, through memory-assist language features, PIM, and multithreading; and (4) an expanded number of processors to which we look for the remaining two orders of magnitude. The latter will depend upon more research in dynamic balancing and extreme care in implementation. 7. A C K N O W L E D G M E N T S The author would like to thank his direct collaborators on computational examples reproduced in this chapter from earlier published work: Kyle Anderson, Satish Balay, Xiao-Chuan Cai, Bill Gropp, Dinesh Kaushik, Lois McInnes, and Barry Smith. Computer resources were provided by DOE (Argonne, LLNL, NERSC, Sandia), and SGI-Cray. REFERENCES

1. W . K . Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Achieving high sustained performance in an unstructured mesh CFD application. In Proceedings of SC'99 (CDROM), November 1999. 2. X.-C. Cai, W. D. Gropp, D. E. Keyes, R. G. Melvin, and D. P. Young. Parallel NewtonKrylov-Schwarz algorithms for the transonic full potential equation. SIAM J. Sci. Comput., 19:246-265, 1998. 3. A. Ern, V. Giovangigli, D. E. Keyes, and M. D. Smooke. Towards polyalgorithmic linear system solvers for nonlinear elliptic systems. SIAM Y. Sci. Comput., 15:681-703, 1994. 4. D.E. Keyes. How scalable is domain decomposition in practice? In C.-H. Lai et al., editor, Proceedings of the 11th International Conference on Domain Decomposition Methods, pages 286-297. Domain Decomposition Press, Bergen, 1999. 5. D.P. Young, R. G. Melvin, M. B. Bieterman, F. T. Johnson, S. S. Samant, and J. E. Bussoletti. A locally refined rectangular grid finite element method: Application to computational fluid dynamics and computational physics. J. Cornp. Phys., 92:1-66, 1991.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

273

D e s i g n of L a r g e - S c a l e P a r a l l e l S i m u l a t i o n s Matthew G. Knepley Ahmed H. Sameh Vivek Sarin Computer Science Department, Purdue University, West Lafayette, IN 47906-1398 {knepley, sameh, sarin}~cs.purdue.edu We present an overview of the design of software packages called Particle Movers that have been developed to simulate the motion of particles in two and three dimensional domains. These simulations require the solution of nonlinear Navier-Stokes equations for fluids coupled with Newton's equations for particle dynamics. Furthermore, realistic simulations are extremely computationally intensive, and are feasible only with algorithms that can exploit parallelism effectively. We describe the computational structure of the simulation as well as the data objects required in these packages. We present a brief description of a particle mover code in this framework, concentrating on the following features: design modularity, portability, extensibility, and parallelism. Simulations on the SGI Origin2000 demonstrate very good speedup on a large number of processors. 1. O V E R V I E W

The goal of our KDI effort is to develop high-performance, state-of-the-art software packages called Particle Movers that are capable of simulating the motion of thousands of particles in two dimensions and hundreds in three dimensions. Such large scale simulations will then be used to elucidate the fundamental dynamics of particulate flows and solve problems of engineering interest. The development methodology must encompass all aspects of the problem, from computational modeling for simulations in Newtonian fluids that are governed by the Navier-Stokes equations (as well as in several popular models of viscoelastic fluids), to incorporation of novel preconditioners and solvers for the nonlinear algebraic equations which ultimately result. The code must, on the one hand, be highlevel, modular, and portable, while at the same time highly efficient and optimized for a target architecture. We present a design model for large scale parallel CFD simulation, as well as the PM code developed for our Grand Challenge project that adheres to this model. It is a true distributed memory implementation of the prototype set forth, and demonstrates very good scalability and speedup on the Origin2000 to a maximum test problem size of over half a million unknowns. Its modular design is based upon the GVec package [14] for PETSc, which is also discussed, as it forms the basis for all abstractions in the code. PM has been ported to the Sun SPARC and Intel Pentium running Solaris, IBM SP2 running AIX, Origin2000 running IRIX, and Cray T3E running Unicos with no explicit

274 code modification. The code has proved to be easily extensible, for example in the rapid development of a fluidized bed experiment from the original sedimentation code. The code for this project was written in Mathematica and C. The C development used the PETSc framework developed at Argonne National Laboratory, and the GNU toolset. The goal of our interface development is mainly to support interoperability of scientific software. "Interoperability" is usually taken to mean portability across different architectures or between different languages. However, we propose to extend this term to cover what we call algorithmic portability, or the ability to express a mathematical algorithm in several different software paradigms. The interface must also respect the computational demands of the application to enable high performance, and provide the flexibility necessary to extend its capabilities beyond the original design. We first discuss existing abstractions present in most popular scientific software frameworks. New abstractions motivated by computational experience with PDEs arising from fluid dynamics applications are then introduced. Finally, the benefits of this new framework is illustrated using a novel preconditioner developed for the Navier-Stokes equations. Performance results on the Origin2000 are available elsewhere [11].

2. E X I S T I N G

ABSTRACTION

The existing abstractions in scientific software frameworks are centered around basic linear algebra operations necessary to support the solution of systems of linear equations. The concrete examples from this section are taken from the PETSc framework [18], developed at Argonne National Laboratory, but are fairly generic across related packages [I0]. These abstractions differ from the viewpoint adopted for the BLAS libraries [4] in the accessibility of the underlying data structure. BLAS begins by assuming a data format and then formulates a set of useful operations which can be applied to the data itself. Moreover, the data structure itself is typically exposed in the calling sequence of the relevant function. In the data structure neutral model adopted by PETSc, the object is presented to the programmer as an interface to a certain set of operations. Data may be read in or out, but the internal data structure is hidden. This data structure independence allows many different implementations of the same mathematical operation. The ability to express a set of mathematical operations in terms of generic interfaces, rather than specific operations on data, is what we refer to as algorithmic portability. This capability is crucial for code reuse and interoperability [5]. The operations of basic linear algebra have been successfully abstracted in to the Vector and Matrix interfaces present in PETSc. A user need not know anything about the internal representation of a vector to obtain its norm. Krylov solvers for linear systems are almost entirely based upon these same operations, and therefore have also been successfully abstracted from the underlying data structures. Even in the nonlinear regime, the Newton-Krylov iteration adds only further straightforward vector operations to the linear solve. These abstractions fail when we discretize a continuous problem, or utilize information inherent in that process to aid in solving the system. A richer set of abstractions is discussed in the next section.

275 3. H I G H E R

LEVEL ABSTRACTIONS

To obtain a generic expression of more modern algorithms, we must supplement our set of basic operations and raise the level of abstraction at which we specify algorithmic components [3,2]. For instance, in most multilevel algorithms there is some concept of a hierarchy of spaces, and in finite element based code this hierarchy must be generated by a corresponding hierarchy of meshes [7,1,17,19]. Thus the formation of this hierarchy should in some sense be a basic operation. The detailed nature of the meshes affects the performance of an algorithm but not its conceptual basis; however, it might be necessary to specify nested meshes or similar restrictions. Raising the level of abstraction, by making coarsening a basic operation, allows the specification of these algorithms independently of the detailed operations on the computational mesh. We propose a set of higher level abstractions to encapsulate solution algorithms for mesh-based PDEs that respects the current PETSc-like interface for objects such as vectors and matrices. A Mesh abstraction should allow the programmer to interrogate the structure of the discrete coordinate system without exposing the details of the storage or manipulation. Thus it should allow low level queries about the coordinates of nodes, elements containing a given node or generic point, and also support higher level operations such as iteration over a given boundary, automatic partitioning and automatic coarsening or refinement. In the GVec libraries, a Partition interface is also employed for encapsulating the layout of a given mesh over multiple domains. A mesh combined with a discretization prescription on each cell provides a description of the discrete space which we encapsulate in the Grid interface. The Grid provides access to both the Mesh and a Discretization. The Discretization for each field which understands how to form the weak form of a function or operator on a given cell of the mesh. In GVec, Operator abstractions are used to encapsulate the weak form of continuous operators on the mesh, and provide an interface for registration of user-defined operators. This capability is absolutely essential as the default set of operators could never incorporate enough for all users. Finally, the Grid must maintain a total ordering of the variables for any given discretized operator or set of equations. In GVec, the VarOrdering interface encapsulates the global ordering and ties it to the nodes in the mesh, while the Local VarOrdering interface describes the ordering of variables on any given node. The Grid must also encapsulate a given mathematical problem defined over the mesh. Thus it should possess a database of the fields defined on the mesh, including the number of components and discretization of each. Furthermore, each problem should include some set of operators and functions defining it which should be available to the programmer through structured queries. This allows runtime management of the problem which can be very useful. For example, in the Backward-Euler integrator written for GVec, the identity operator is added automatically with the correct scaling to a steady-state problem to create the correct time-dependence. The only information necessary from the programmer is a specification of the the fields which have explicit time derivatives in the equation. This permits more complicated time-stepping schemes to be used without any change to the original code. The Grid should also maintain information about the boundary conditions and constraints. In GVec, these are specified as functions defined over a subset of mesh nodes identified by a boundary marker, which is given to each node during

276 mesh generation. This allows automatic implementation of these constraints with the programmer only specifying the original continuous representation of the boundary values. 3.1. N o d e Classes In order to facilitate more complex distributions of variables over the mesh, GVec employs a simple scheme requiring one extra integer per mesh node. We define a class as a subset of fields defined on some part of the mesh. Each node is assigned a class, meaning that the fields contained in that class are defined on that node. For example, in a P2/P1 discretization of the Stokes equation on a triangular mesh, the vertices might have class 0, which includes velocity and pressure, whereas the midnodes on edges would have class 1, including only velocity. Thus this scheme easily handles mixed discretizations. However, it is much more flexible. Boundary conditions can be implemented merely by creating a new class for the affected nodes that excludes the constrained fields, and the approach for constraints is analogous. This information also permits these constraints to be implemented at the element level so that explicit elimination of constrained variables is possible, as well as construction of operators over only constrained variables. This method has proven to be extremely flexible while incurring minimal overhead or processing cost. 4. M U L T I L E V E L P R E C O N D I T I O N I N G

FOR NAVIER-STOKES

As an example of the utility of higher level abstractions in CFD code, we present a particulate flow problem [9,11] that incorporates a novel multilevel preconditioner [17] for the Navier-Stokes equations augmented by constraints at the surface of particles [13]. The fluid obeys the Navier-Stokes equations and the particles each obey Newton's equations. However, the interior forces need not be explicitly calculated due to the special nature of our finite element spaces [8,9,15]. These equations are coupled through a no-slip boundary condition at the surface of each particle. The multilevel preconditioner is employed to accelerate the solution of the discrete nonlinear system generated by the discretization procedure. 4.1. P r e c o n d i t i o n i n g The nonlinear system may be formulated as a saddle-point problem, where the upper left block A is a nonlinear operator, but the constraint matrix B is still linear: ( A

B

u)=

(f).

(1)

Thus we may approach the problem exactly as with Stokes, by constructing a divergenceless basis using a factorization of B. The ML algorithm constructs the factorization

pT B V

=

(D

vTv

=

I,

'

(2)

(3)

where V is unitary, D is diagonal, but P is merely full rank. If P were also unitary we would have the SVD, however this is prohibitively expensive to construct. Using this factorization, we may project the problem onto the space of divergenceless functions spanned by P2. Thus we need only solve the reduced nonlinear problem (4) in the projected space

P~APj~2 = pT (f _ APe D - T v T9)

(4)

277 The efficiency issues for this scheme are the costs of computing, storing, and applying the factors, as well as the conditioning of the resulting basis. The ML algorithm can store the factors in O(N) space and apply them in O(N) time. The conditioning of the basis can be guaranteed for structured Stokes problems, and good performance is seen for two and three dimensional unstructured meshes. The basis for the range and null space of B are both formed as a product of a logarithmic number of sparse matrices. The algorithm proceeds recursively, coarsening the mesh, computing the gradient at the coarse level B, and forming one factor at each level. The algorithm terminates when the mesh is coarsened to a single node, or at some level when an explicit QR decomposition of B can be accomplished. In a parallel setting, the processor domains are coarsened to a single node and then a QR decomposition is carried out along the interface. N

4.1.1. Software Issues The ML algorithm must decompose the initial mesh into subdomains, and in each domain form the local gradient operator. Thus if ML is to be algorithmically portable, we must have basic operations expressing these actions. The ability of the Mesh interface to automatically generate this hierarchy allows the programmer to specify the algorithm independently of a particular implementation, such as a structured or unstructured mesh. In fact, the convergence of ML is insensitive to the particular partition of the domain so that a user may select among mesh coarsening algorithms to maximum other factors in the performance of the code. Using the Grid interface to form local gradients on each subdomain frees the programmer from the details of handling complex boundary conditions or constraints, such as those arising in the particulate flow problem. For example, the gradient in the particulate flow problem actually appears as

B-(

Bx

)'

(5)

where B I denotes the gradient on the interior and outer boundary of the domain, BF is the gradient operator at the surface of the particles, and P is the projector from particle unknowns to fluid velocities at the particle surface implementing the no-slip condition. This new gradient operator has more connectivity and more complicated analytic properties. However, the code required no change in order to run this problem since the abstractions used were powerful enough to accommodate it. 4.1.2. Single Level R e d u c t i o n We begin by grouping adjacent nodes into partitions and dividing the edges into two groups: interior edges connecting nodes in one partition, and boundary edges connecting

partitions.

The gradient matrix may be reordered to represent this division, ( BI ) BF " The upper matrix Be is block diagonal, with one block for each partition. Each block represents the restriction of the gradient operator to that cluster. Furthermore, we may

factoreachblockindependentlyusingtheSVD, sothatifUiBiViT-( Di 0 )

we may factor

each domain independently. We now use these diagonal matrices to reduce the columns in by block row reduction. A more complete exposition may be found in [13].

BFVr

278 4.1.3. R e c u r s i v e F r a m e w o r k

We may now recursively apply this decomposition to each domain instead of performing an exact SVD. The factorization process may be terminated at any level with an explicit QR decomposition, or be continued until the coarsest mesh consists of only a single node. The basis P is better conditioned with earlier termination, but this must be weighed against the relatively high cost of QR factorization. Thus, we have the high level algorithm 1. Until numNodes < threshold do: m

(a) Partition mesh (b) Factor local operator (c) Block reduce interface (d) Coarsen mesh 2. QR factor remaining global operator 5. C O N C L U S I O N S The slow adoption of modern solution methods in large CFD codes highlights the close ties between interoperability and abstraction. If sufficiently powerful abstractions are not present in a software environment, algorithms making use of these abstractions are effectively not portable to that system. Implementations are, of course, possible using lower level operations, but these are prone to error, inflexible, and very time-consuming to construct. The rapid integration of the ML preconditioner into an existing particulate flow code[16] demonstrates the advantages of these more powerful abstractions in a practical piece of software. Furthermore, in the development and especially in the implementation of these higher level abstractions, the architecture must be taken into account. It has become increasingly clear that some popular computational kernels, such as the sparse matrix-vector product, may be unsuitable for modern RISC cache-based architectures[17]. Algorithms such as ML which possess kernels that perform much more work on data before it is ejected from the cache should be explored as an alternative or supplement to current solvers and preconditioners. REFERENCES

1. Achi Brandt, Multi-Level Adaptive Solutions to Boundary-Value Problems, Mathematics of Computation 31 (1977) 333--390. 2. David L. Brown, William D. Henshaw, and Daniel J. Quinlan, Overture: An Object Oriented Framework for Solving Partial Differential Equations, in: Scientific Computing in Object-Oriented Parallel Environments, Lecture Notes in Computer Science 1343 (Springer, 1997). Overture is located at http://www, l l n l . g o v / c a s c / 0 v e r t u r e . 3. H.P. Langtangen, Computational Partial Differential Equations--Numerical Methods and Diffpack Programming (Springer-Verlag, 1999). Diffpack is located at http://www, nobj e c t s . corn/Product s/Dif fpack. 4. Jack Dongarra, J. DuCroz, S. Hammarling, and R. Hanson, A proposal for an extended set of Fortran basic linear algebra subprograms, Technical Memo 41, Mathematics and Computer Science Division, Argonne National Laboratory, December, 1984.

279 The ESI Forum is located at h t t p : / / z , ca. sandia.gov/esi. 6. William D. Gropp, Dinesh K. Kaushik, David E. Keyes, and Barry F. Smith, Cache Optimization in Multicomponent Unstructured-Grid Implicit CFD Codes, in: Proceedings of the Parallel Computational Fluid Dynamics Conference (Elsevier, 1999). Ami Harten, Multiresolution representation of data: General framework, SIAM Journal on Numerical Analysis 33 (1996) 1205-1256. Todd Hesla, A Combined Fluid-Particle Formulation, Presented at a Grand Challenge Group Meeting (1995). Howard Hu, Direct simulation of flows of solid-liquid mixtures, International Journal of Multiphase Flow 22 (1996). 10. Scott A. Hutchinson, John N. Shadid, and Ray S. Tuminaro, Aztec User's Guide Version 1.0, Sandia National Laboratories, TR Sand95-1559, (1995). 11. Matthew G. Knepley, Vivek Sarin, and Ahmed H. Sameh, Parallel Simulation of Particulate Flows, in: Solving Irregularly Structured Problems in Parallel, Lecture Notes in Computer Science 1457 (Springer, 1998). 12. Denis Vanderstraeten and Matthew G. Knepley, Paralell building blocks for finite element simulations: Application to solid-liquid mixture flows, in: Proceedings of the Parallel Computational Fluid Dynamics Conference (Manchester, England, 1997). 13. Matthew G. Knepley and Vivek Sarin, Algorithm Development for Large Scale Computing: A Case Study, in: Object-Oriented Methods for Interoperable Scientific and Engineering Computing (Springer, 1999). 14. Matthew G. Knepley, GVec Beta Release Documentation, available at .

http ://www. cs. purdue, edu/home s/knep i ey/comp_ f luid/gvec, nb. ps.

15. Matthew G. Knepley, Masters Thesis, University of Minnesota, available at http: //www. cs. purdue, edu/home s/knep 1e y / i t er_meth. The Mathematica software and notebook version of this paper may be obtained at h t t p ://www. cs. purdue, edu/homes/knepley/iter_meth. 16. Vivek Sarin, An efficient iterative method for Saddle Point problems, PhD thesis, University of Illinois, 1997. 17. Vivek Sarin and Ahmed H. Sameh, An efficient iterative method for the generalized Stokes problem, SIAM Journal on Scientific Computing, 19 (1998) 206-226. 18. Barry F. Smith, William D. Gropp, Lois Curman McInnes, and Satish Balay, PETSc 2.0 Users Manual, Argonne National Laboratory, TR ANL-95/11, 1995, available via ftp://www .mcs. anl/pub/pet sc/manual, ps. 19. Shang-Hua Teng, Coarsening, Sampling, and Smoothing: Elements of the Multilevel Method, Unpublished, 1999.



281

A n Efficient Storage T e c h n i q u e for Parallel Schur C o m p l e m e n t M e t h o d and Applications on Different Platforms S. Kocak* and H.U. Akay Computational Fluid Dynamics Laboratory, Department of Mechanical Engineering, Purdue School of Engineering and Technology, IUPUI, Indianapolis, IN 46202, USA.

A parallel Schur complement method is implemented for solution of elliptic equations. After decomposition of the total domain into subdomains, interface equations are expressed in a coupled form with the subdomain equations. The solution of subdomains is performed after solving the interface equations. In this paper, we present an efficient formation of interface and subdomain equations in parallel environments using direct solvers which take into account sparse and banded structures of the subdomain coefficient matrices. With such an approach, we can solve larger system of equations than in our earlier work. Test cases are presented on different platforms to illustrate the performance of the algorithm. 1. INTRODUCTION Parallel solution of large-scale problems attracted attention of many researchers more than a decade for which efficient solution techniques are being sought continuously. Due to advances in computer hardware, either new techniques are being proposed or many of the existing techniques are modified to exploit the advantages of new computer architectures. One of the existing methods used to solve large-scale problems is a substructuring technique known as the Schur complement method in the literature. Although the substructuring method dates back to 1963 on limited memory and sequential computers [ 1], the adaptation of this technique to parallel processors appears to be first introduced by Farhat and Wilson [2]. In spite of previous developments, speedup and efficient storage implementation of substructuring technique poses some major challenges. In this technique, after decomposition of the total domain into subdomains, the interface equations are formed in a Schur complement form. The solution of subdomains is performed after solving the interface equations. The size of interface equations increases with the number of subdomains used to represent the domain. While the subdomain matrices are sparse and banded, the interface matrix is dense. As a result, storage and solution of the interface equations require excessive memory and CPU time, respectively. Hence, special care must be given to efficient use of storage and CPU time. In this study, we extend our previously developed parallel algorithm [3] for solution of larger size systems by forming Schur complement equations more efficiently and test it on different platforms. More specifically, we present results on UNIX based workstations and Pentium PCs using WINDOWS/NT and LINUX operating systems.

*Visiting from Civil Engineering Department, Cukurova University, Adana, Turkey.

282

2. P A R A L L E L SCHUR C O M P L E M E N T A L G O R I T H M In substructuring method, the whole structure (domain) is divided into substructures (subdomains) and the global solution is achieved via coupled solution of these substructures. Here, for sake of simplicity, we will assume that each substructure is assigned to a single processor, referred to as subprocessor. After finite element discretizations, we can express system of equations of a domain with N subdomains in the following familiar form: -

-Kll

0

0

Klr -Pl

K22

0

K2F

-fl

P2

f2

(1)

0 _Krl

0 Kr2

KNN KNr PN] fN 9 9 KrN

Krr

prJ

fr

where K ii, (i = 1..... N), K r r , and KiF = KT i denote, respectively, coefficient matrices of the subdomains ~i, the interface F, and the coupling between subdomains and the interface. The same is true for subdomain vectors P i and fi, and the interface vectors PF and fF. The interface equation of the system can be written in the form: Krr=9

KFi K~ 1KiF P F = f F - - ~ K F i K~ 1 fi /=1

(2)

The equation system given in Eq. (2) is known as the Schur complement equation. Here, it must be noted that the coefficient matrix of the interface is a large and dense matrix. The dimension of this matrix depends on the number of unknowns on the interface, where the number of interface unknowns increases with the number of subdomains. In Figure 1, various subdomains of a square domain are shown for different number of subdivisions. As can be seen from the figure, the interface of the subdomains is not regular. With a greedy-based divider algorithm we have used here [2], the interface sizes of subdomains are not balanced even though nearly equal number of grid points is obtained in subdomains. Depending upon the finite element mesh used, the number of unknowns of interface equations may become very large and unbalanced. For the solution of Schur complement matrix equations, direct or iterative solvers may be applied. Here, we concentrate on assembling the Schur complement equation, which requires solution of a series of equations involving coupling between subdomains and interfaces. We can further express Eq. (2) in compact form as (K r r - GdT)Pr = f r - g r

(3)

where N Grr=~Kr/K~IKir i=1

N

,

gr=~KriKi-ilfi

(4)

i=1

Contributions to G r r to g r are computed by each subdomain processor without any message

283 passing, and then the interface coefficient matrix K r r = (K rr -C, rr) and the source vector f-r" = ( f r - g r ) are assembled via message passing. The solution of interface matrix may be implemented either using direct or iterative parallel solvers. For both of these techniques, the interface matrix is separated into blocks and distributed to processors, and the solution is achieved concurrently.

Figure 1. Domain divided into two, four, five, and ten subdomains. The two terms of Eqs. (4) can be written as

Air'=K~lKir,

A i = K ~ 1 fi

(5)

and to eliminate inversion of subdomain matrices, Eqs. (5) are expressed as

K ii A iF =K iF ,

K ii A i-- fi

(6)

Here, this system of equations is solved for every column of Air and column of A i . The number of columns of Air is governed by the number of unknowns on interface of that subdomain. In this case, we encounter a repeated right hand side system, which can be solved by using a direct solver very efficiently without any message passing. For the parallel solution of interface equations it is possible to use direct or iterative solvers. Both direct and iterative solvers are proposed in the literature, e.g., [2, 4-6]. Here, following the work of Farhat and Wilson [2], we use a direct solver approach, in which each row of the interface matrix is assigned to a processor as schematically illustrated in Figure 2, where we assume that there are three processors. In implementation of the interface solver, the Schur complement matrix has to be formed and distributed to related processors. As can be seen from Eqs. (3) and (4), dimensions of the Schur complement matrix may become very large in large-scale systems. Since contributions to G r r and g r are computed in each subprocessor to assemble the interface matrices of subprocessors, message passing between subprocessors is required. Once the subdomain and coupling terms given in Eqs. (4) are assembled by each subprocessor using Eq. (6), contributions to G r r and g r are computed using a direct solver. Here, we will present the algorithm for assembling the interface matrices without assembling the whole Schur complement matrix. The algorithm given in Table 1 is identical for all subprocessors.

284

-X

X

Processor 1 Processor 2

X

~d

Processor 3

X

X

X

X

X

X

X X

Processor 1

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

x

X

Processor 2 Processor 3

X

~d

X

Figure 2. Distribution of interface matrix to subprocessors for a direct solver.

Table 1. Algorithm for assembling Schur complement matrix for each subprocessor.

do k=l, np (np= number of subprocessors) do I=1, neqs (neqs = number of equations on interface) if contribute to k th subprocessor then form R i which is k th column of Kit, solve KiiUi = R i for U i compute Pi = KFi U i which is contribution to G r r pack Pi to buffer to be send to k th subprocessor endif enddo form R i which is equal to fi solve KiiUi = R i for U i compute Pi = K r'i U i which is contribution to g r pack Pi to buffer to be send to k th subprocessor enddo m

In our earlier study [3], we assembled the entire Schur complement matrix K r r o n a single processor by receiving contributions to G r-r, and g r from subprocessors and then distributed it to subprocessors. Since K r r is a dense matrix, a lot of memory is needed for its storage. This limited the size of the problems which could be solved compared to the ones this new storage algorithm allows. Direct linear solvers are used here for the solution of both subdomain and interface equations of Poisson pressure equations arising in incompressible viscous flows. For the sake

285 of savings in storage, every subdomain has a locally numbered interface. Moreover, contribution to a subprocessor's interface terms is determined by the corresponding global equation number. Therefore, through the use of the algorithm given in Table 1 for the solution of interface equations, one does not need to assemble the Schur complement matrix K r r which might consume a lot of memory space. Banded nature of the subdomain equations are accounted for by using a skyline direct solution technique [7]. More details for the complete substructuring algorithm used in this study are given in [3]. For parallelization, PVM [8] (Parallel Virtual Machine) is used as the message passing interface. 3. E X A M P L E P R O B L E M S AND RESULTS

The computer algorithm developed here is implemented on MIMD distributed memory architectures with UNIX, WlNDOWS/NT, and LINUX operating systems. Timings at different steps of the program are taken to assess the algorithm and the platform. The platforms used have different CPU speeds and message passing interfaces. Since distributed memory architectures are used, message passing interfaces (switches) play an important role in the efficiency studies. The computer program is developed using a master-slave approach where one processor is reserved for master which divides the domain into subdomains and assign a processor (subprocessor) for each subdomain. Before discussing the results, we list below some of the important features of the platforms we used in our tests: IBM RS6K: UNIX/OS, 16 processors with 160 MHz CPU speed and two different switches with (a) 100 Mb/sec and (b) 10 Mb/sec bandwidths. In the remainder, we will identify these switches as fast and slow, corresponding to 100 and 10 MHz bandwidths, respectively. This system is located at IBM Research Center in Kingston, New York. Pentium II: LINUX/OS, 32 dual processor machines with 400 MHz CPU speed. In networking, 4 machines are connected to routers via 100 Mb/sec switch and total 8 routers are connected via 1000 Mb/sec switches. This system is located at NASA Glenn Research Center, Cleveland, Ohio. Pentium II: WINDOWS/NT, 13 processors with 400 MHz speed and 100 Mb/sec switch. This system is located in our CFD laboratory at IUPUI, Indianapolis. For the examples presented here, each subdomain is assigned to a different subprocessor, i.e., each subprocessor deals with only one subdomain so that, for message passing, switches are used instead of CPU. Therefore, the number of equations of subdomains (blocks) and interface changes with the number of blocks or subprocessors used in the analyses. In Figure 3, the elapsed speedup diagram of the program tested on an IBM RS6K UNIX workstation with fast ethernet is depicted for three example problems having 201,000 (Case 1), 120,600 (Case 2), and 30,300 (Case 3) equations, in a square computational domain as shown in Figure 1. It is seen from the diagram that the speedup of the algorithm is good for all the cases. It also can be seen from the diagram that for larger problems the speedup is better, which can be explained with the problem becoming more computation bound, since for the platform used in this case, CPU speed is faster than that of message passing. The speedups, greater than 100% efficiency are attributed to the efficient memory allocation and efficient direct solver operations for reduced size matrices (i.e., increased locality). The

286 fluctuations in speedups are explained by the varying computational loads of processors as well as the variations in bandwidth sizes. The greedy algorithm-based divider we have used is able to balance the number of elements in a domain but not the bandwidth and interface sizes of the subdomains. Shown in Figure 4 are the variations of the total communication and computation times with respect to the number of processors for Case 3 (30,300 equations) using the fast-switch RS6K. It is noted here that the delays which are caused by unbalanced work load of processors are added into the communication time since during the waiting times no CPU is used. The ratio of communication to computation time reaches to 72% for 15 processors. This ratio is however, smaller for larger systems (23% for Case 2 and 15% for Case 1). Figure 5 shows speedup diagrams of LINUX system for the same three cases. In this diagram, it is seen that the speedup is also better for larger systems, though not as good as in the fast-switch RS6K system. The speedups for Cases 2 and 3 decrease after ten processors, which may be attributed to network properties and the operating system used. Figure 6 shows the speedup diagrams for elapsed times on three different platforms with UNIX, LINUX and WlNDOWS/NT operating systems, for Case 3 (30,300 equations). The purpose of this diagram is to illustrate the effect of platform on speedup. Case 3 is chosen, since it is the most communication-bound problem among the three cases considered in this study. It is seen that, for the UNIX platform with the fast switch, the speedup is better than the others. The speedup curves of the remaining 3 platforms decrease for the 15 processor case which means the message passing interfaces of those platforms slow down the program due to their low speed message passings. The difference between that of the UNIX system with the fast and the slow switches illustrates the effect of message passing to speedup. As can be seen in the figure, the speedup value of the UNIX system for the 15 processors drops drastically when a slow switch is used. The difference between the elapsed and CPU speedup is attributed to message passing and delays between the processors. Since message passing is directly related to the size of the interface used, we expect that balancing the work load among subprocessors will minimize the time delays substantially. Here, it should be noted that besides the message passing interface, operating system and hardware properties of different platforms also affect the speed of the programs being tested. In Table 2, subdomain and interface data indicating matrix storage requirements as well as equation and bandwidth sizes are summarized with varying subdomains for Case 2. For the interface coefficient matrices, global and local terms are used to refer assembling the whole interface matrix versus the distributed interface matrices, respectively. Since the interface matrix is distributed to different processors without a global assembly, considerable amount of memory space is saved, which makes the solution of large size problems possible. The values given in the same table correspond to the memory requirements of the skyline solution technique [7] which is used here. The variations in maximum and minimum bandwidth sizes explain the fluctuations in speedups and efficiencies in different cases. Fluctuations would have been less if banded nature of the equations were not accounted for. However, that would have resulted with much longer elapsed times. 4. CONCLUSIONS The Schur complement algorithm presented here provides both memory and speedup efficiencies due to reduction in matrix sizes and operation counts. Using this algorithm,

287

large-scale problems, which could not be solved in our earlier study, can now be solved. The proposed algorithm gives better efficiencies for larger systems. The choice of an algorithm for dividing a domain into subdomains is crucial. It directly affects the number of unknowns on the interface and computational loading of subdomains, and hence the overall efficiency. The platform properties strongly affect the efficiency of the algorithms, therefore to make a general statement about the efficiency of an algorithm, by testing it only on a single platform might lead to unrealistic results. To avoid this, the properties of the platform and algorithm must be studied carefully.

2000

- - = Case 3 --Case 2 Case 1 ............. T h e o r e t i c a l

- -

--

~ f 8

l

computation

12 ,x:l o

- - c o m m u n i c a t i o n + delays

1600

~

"~ -

1200

. . . . .

~. ~oo

/

400 ..,.._,, . - , , - - - .,.= .-,. 0

,

0

,

2

,

4

6

,

|

8

10

,

12

,

14

|

16

,

,

,

|

,

2

4

6

8

10

12

,

i

14

16

18 N u m b e r o f Processors

Number

of Processors

Figure 3. Elapsed time speedups with the FastSwitch RS6K system.

Figure 4. Communication and computation times for Case 3. 20-

--

3

--Case

16-

1

Case

"-- -- LINUX R S 6 K FS - - " " R S 6 K SS = = = W I N D O W S N T FS ............ T h e o r e t i c a l

= = = Case 2

~.1~

~

12 ~9 8

,,..~

8"

0

,

,

,

,

,

,

,

,

2

4

6

8

10

12

14

16

18

Number of Processors

0

|

i

|

i

|

|

|

i

2

4

6

8

10

12

14

16

Number

1

of Processors

Figure 5. Elapsed time speedups with the LINUX

Figure 6. Elapsed speedups on various

system.

platforms (Case 3).

ACKNOWLEDGEMENTS The authors would like to gratefully acknowledge the computer accesses provided by the NASA Glenn Research Center in Cleveland, Ohio and the IBM Research Center in Kingston, New York. The financial support provided to the first author by the Scientific and Technical Research Council of Turkey (TUBITAK) under a NATO Scientific Fellowship program is also gratefully acknowledged.

288

Table 2. Subdomain and interface data for Case 2 (120,600 equations).

Subdomains

Interface


Coefficient Matrix Size

Number of Equations

Max/Min Bandwidth

Global Matrix Size (Previous)

Local Matrix Size (Present)

Number of Equations

3.25E6

60,000

427/427

9.10F.A

4.56E4

1,698

1.31E6

24,000

333/228

5.46E5

1.09E5

4,284

10

4.86E5

12,000

476/173

1.16E6

1.17E5

6,606

15

2.51E5

8,000

283/143

1.48E6

9.90E4

8,841

REFERENCES

1. J.S. Przemieniecki, "Matrix Structural Analysis of Substructures," AIAA Journal, 1963, Vol. 1,138-147. 2. C. Farhat and E. Wilson, "A New Finite Element Concurrent Computer Program Architecture," International Journal for Numerical Methods in Engineering, Vol. 24, 1771-1792, 1987. 3. S. Kocak, H.U. Akay, and A. Ecer, "Parallel Implicit Treatment of Interface Conditions in Domain Decomposition Algorithms," Proceedings of Parallel CFD '98, Edited by C.A. Lin, et al., Elsevier Science, Amsterdam, 1999, (in print). 4. Nour-Omid, A. Raefsky, and G. Lyzenga, "Solving Finite Element Equations on Concurrent Computers," Parallel Computations and Their Impact on Mechanics, ASME, New York, AMD-VOL 86, 209-227, 1986. 5. J. Favensi, A. Daniel, J. Tomnbello, and J. Watson, "Distributed Finite Element Analysis Using a Transputer Network," Computing Systems in Engineering, Vol. 1, 171-182, 1990. 6. A.I. Khan and B.H.V. Topping, "Parallel Finite Element Analysis Using JacobiConditioned Conjugate Gradient Algorithm," Advances in Engineering Software, Vol. 25, 309-319, 1996. 7. E.L. Wilson and H.H. Dovey, "Solution of Reduction of Equilibrium Equations for Large Structural Systems, "Adv. Engng. Sofiw., 1978, Vol. 1, 19-25. 8. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sundream, PVM: A Users' Guide and Tutorial for Networked Parallel Machines, The MIT Press, 1994.

Parallel Computational Fluid Dynamics Towards Terafiops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) 92000 Elsevier Science B.V. All rights reserved.

289

Applications of the Smoothed Particle Hydrodynamics method" The Need for Supercomputing Stefan Kunze 1, Erik Schnetter 2 and Roland Speith 3 Institut fiir Theoretische Astrophysik, Universit~it Tiibingen Auf der Morgenstelle 10C, 72076 Tiibingen, Germany http://www, t at. physik, uni-t uebingen, de/- sp eith/particle.ht ml

We shortly describe the numerical method Smoothed Particle Hydrodynamics (SPH) and report on our parallel implementation of the code. One major application of our code is the simulation of astrophysical problems. We present some recent results of simulations of accretion disks in close symbiotic binary stars.

1. S m o o t h e d Particle H y d r o d y n a m i c s Smoothed Particle Hydrodynamics (SPH; [11], [5], [2], [13]; [7], [15]) is a meshless Lagrangian particle method for solving a system of hydrodynamic equations for compressible fluids. SPH is especially suited for problems with free boundaries, a commonplace situation in astrophysics. Rather than solving the equations on a grid, the equations are solved at the positions of the so-called particles, each representing a mass packet with a certain density, velocity, temperature etc. and moving with the flow. The principle of the SPH method is to transform a system of coupled partial differential equations into a system of coupled ordinary differential equations which can be solved by a standard integration scheme. This is achieved by a convolution of all variables with an appropriate smoothing kernel W and an approximation of the integral by a sum over particle quantities: f(x)

'~ f d3x' f ( x ' ) W ( x - x')

~ y~ V i f i W ( x - xi) i

Then all spatial derivatives can be computed as derivatives of the analytically known kernel function. Thus only the derivatives in time are left in the equations. The main advantage of SPH is that it is a Lagrangian formulation where no advection terms are present. Furthermore conservation of mass comes for free, and the particles can l k u n z e ~ t a t . p h y s i k , u n i - t u e b i n g e n , de 2 s c h n e t t e r @ t a t , p h y s i k , u n i - t u e b i n g e n , de 3 spe i t h @ t a t . phys i k . u n i - t u e b i n g e n , de

290

be almost arbitrarily distributed which removes the need for a computational grid. By varying the kernel function W in space or time one can adapt the resolution, if necessary. The Euler equation, for example, in its SPH form reads dvi _ Pj + Pi dt - - ~ m~ V W (xi - x~)

which has been derived as outlined above and then antisymmetrized. In order to efficiently evaluate this equation it is important to use a kernel function W that has compact support, thus reducing the number of non-zero contributions to the sum which runs over all particles. Finding interacting particles is an important part of every SPH implementation; this is done using well-known grid- or tree structures [6]. In contrast to many other flavors of SPH used in astrophysics, in our approach the viscous stress tensor is not a rather arbitrary artificial viscosity. Instead it is implemented according to the Navier-Stokes equation to describe the physical viscosity correctly [4].

2. The parallel implementation The usual approach of using high level languages (such as High Performance Fortran, HPF) for a parallelization of the code proved not feasible. The irregular particle distributions create irregular data structures, and nowaday's compilers unfortunately cannot create efficient code in this situation. We instead decided to use the low level MPI library as it is available for all common platforms (compare to [3]). 2.1. Straightforward D o m a i n Decomposition

The main principle we settled for was using a domain decomposition where a modified serial version of the code runs on every node. The communication across domain boundaries is taken care of by a special kind of boundary condition, akin to periodic boundaries. This way the communication routines are separated from the routines implementing the physics. We hope that this will make future additions to the physics easier, because people adding new physical features will need only a basic knowledge of the way communication is handled. This inter-domain boundary condition takes care of (almost) all necessary communication and sets up ghost particles for the SPH routines. The same approach had already successfully been implemented for periodic boundaries, only that now the ghost particles come from other nodes. Of course particle interactions that cross domain boundaries are calculated on only one node. The disadvantage of this method is that a low number of particles cannot efficiently be distributed onto many nodes. The ghost particle domain of each node has the size of the interaction range, and for increasing numbers of nodes the ghost particles eventually outnumber the real particles. Although the numerical workload stays the same, managing

291

the particles becomes more expensive. The common remedy is to increase the number of particles proportionally to the number of nodes.

2.2. Not Wasting Memory An SPH code needs at least three passes over all particles, computing the density, the viscous stress tensor, and the acceleration, respectively. If those passes are run one after the other, then the interaction information for all particles has to be kept in memory. Given that there are about 100 interactions per particle this information requires by far (a factor of 10) the largest amount of memory of the overall simulation. This severely limits the total number of particles that fit into a given computer system. In order to save on memory we run these passes in parallel, where each particle begins the next pass as soon as it and all its neighbours have finished the previous one. This can be realized with only negligible overhead by calculating the interactions by sweeping through the simulation domain combining all three passes. The interactions of a particle are determined on the fly when the particle is first encountered and are dropped from the interaction list as soon as the particles have finished the third pass. This sweeping happens (almost) independently on all nodes. In between the passes information about the particles may have to be exchanged between nodes, which is taken care of by the boundary condition module.

2.3. Simple Load Balancing

0.25 fltlillilili

0.2

-----t--

livUlilluiiuil

0.15

'

0.1

......

0.05

0.05

0

0

-0.05

..0.05

,.

,

.... i!i i!

-0.1 .0.15

-0.15 -0.2 -0.25

~,.'

0.15

0.1

-0.1

'~'~..' : l ~

0.2

-0.2 | -0.2

J -0.15

J -0.1

| -0.05

i 0

i 0.05

i 0.1

i 0.15

i 0.2

0.25

-0.25

I

|

I

I

I

I

|

I

I

I

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Fig. 1. Two examples of domain decompositions. On the left for 15 nodes, on the right for 32 nodes.

292

Load balancing, although of vital importance, has only been implemented in an ad hoc fashion. The domains are cuboid and of different sizes. They do not form a grid but rather a tree structure. The initial domains are chosen by distributing the particles evenly; this distribution is refined during the simulation by monitoring computing time and resizing the domains after every time step. In order to keep the domain shapes as cubic as possible, nodes may be transferred to a different subtree, thus reorganizing the overall domain structure. Two examples of domain decompositions are shown in Figure 1. Thanks to MPI our code runs on many different platforms. It has been tested on our workstation cluster, a Beowulf cluster, the IBM SP/2, and the Cray T3E. It performs reasonably well on all those platforms; the load balancing takes only a negligible amount of time. The typical overall time spent waiting is less than about 12 %. Typical runs have 300 000 particles on about 50 nodes, where one evaluation of the right hand side takes about one second. 3. A c c r e t i o n Disks in S y m b i o t i c B i n a r y S t a r s

Fig. 2. Pole-on and edge-on view of a simulation of the accretion disk of the Dwarf Nova OY Car. Mass of the White Dwarf: 0.696 Mo, mass of the donor star: 0.069 Mo. The scales indicate 0.1 solar radii. The donor star is on the left. Greycoded is the dissipated energy.

Symbiotic binary stars are so close to each other that their evolution and appearance changes dramatically compared to single stars. Dwarf Novae are a class of variable symbiotic binary stars where mass transfer from one star to the other occurs. The donor is a light main sequence star, the accretor a more massive, but much smaller White Dwarf (WD). Due to its intrinsic angular momentum the overflowing gas cannot be accreted by the WD right away, instead a thin gaseous disk around the WD forms and the subsequent

293

accretion is governed by viscous processes in the disk, [18]. The physics of these accretion disks is far from being well understood. Existing models of long term outburst behaviour are essentially 1D and neglect the tidal influence of the donor star [12]. Observationally, the disks show variability on timescales from minutes to decades, occasionally increasing in brightness up to 5 magnitudes. Numerical simulations, especially in 3D, require enormous amounts of grid points - - or particles in our case - - to achieve the necessary resolution. Since the problem size is so large, and the integration time so long, parallel programs on supercomputers are the only possible way to go [14]. 3.1. 3D-SPH Simulation of the Stream-Disk Interaction in a Dwarf Nova One aspect of Dwarf Nova disks is the impact of the overflowing gas stream onto the rim of the accretion disk. Both flows are highly supersonic and two shock regions form [10]. The shocked gas becomes very hot, a bright spot develops, which sometimes can be brighter than the rest of the disk. The relative heights of the stream and the rim of the disk are unclear. If the stream is thicker than the disk, a substantial portion of the infaUing gas could stream over and under the disk and impact at much smaller radii [1], [8]. Figure 2 shows a snapshot of the simulation of the accretion disk of the Dwarf Nova OY Carinae. Grey-coded is the energy release due to viscous dissipation. One can clearly see the bright spot where the stream hits the disk rim. Furthermore, on the far side of the donor star, a secondary bright spot is visible where overflowing stream material finally impacts onto the disk. In this simulation, about 10 to 20 % of the stream material can flow over and under the disk. 3.2. Superhumps in A M CVn AM Canem Venaticorum stars are thought to be the helium counterparts to dwarf novae. AM CVn stars are believed to consist of two helium white dwarfs, a rather massive primary and a very light, Roche-lobe filling secondary. Roche-lobe overflow feeds an accretion disk around the primary. Tsugawa & Osaki [16] showed that such helium disks undergo thermal instabilities similar to the hydrogen disks in Dwarf Novae. In three AM CVn stars, Dwarf Nova-like outbursts indeed have been observed. In order to investigate whether AM CVn exhibit superhumps we performed 3D-SPH simulations of the accretion disk. Initially, there was no disk around the primary. Particles were inserted at the inner Lagrangian point according to the mass transfer rate. Already after about 30 orbital periods the disk grew to a point where it was subject to the 3:1 inner Lindblad resonance [19]. Subsequently, the disk became more and more tidally distorted and started to precess rapidly in the frame of reference corotating with the stars (see Figure 3), which translates to a slow prograde precession in the observers' frame. Every time the bulk of the disk passes the secondary, the tidal stresses and hence the viscous heating are strongest, giving rise to modulations in the photometric lightcurve,

294

.,,,~:'

..... , w r ~ j : .

o, ~ _ . . , : , . w ' . . . . , ~ . . r , " ' ? , ~.

o.,,~

,."~" . . . . . :

~

.........

::LL,.?-

.J

Fig. 3. A series of snapshots of the disk of Am CVn, 0.2 orbital phases apart. Upper panel: density distribution; lower panel: dissipated energy. The parameters used are: M1 = 1 Mo, M2 -- 0.15 Mo, mass transfer rate 10-1~ Mo/yr. A ploytropic equation of state with .y = 1.01 was used. One can see how the precession of the tidally distorted disk leads periodically to higher dissipation, resulting in superhumps in the lightcurve; see Figure 4.

the superhumps. A Fourier transform of the obtained lightcurve reveals a superhump period excess of 4.4 %. This is in good agreement with the periods given by Warner [17], which differ by 3.8 %. A former study of the superhump phenomenon by Kunze et al. [9] showed that the period excess is a function of the mass transfer rate, the mass ratio of the stars, and the kinematic viscosity of the disk. These parameters are not well known for AM CVn. 4. C o n c l u s i o n s The SPH method is very well suited for solving astrophysical problems with compressible flow and free boundaries. An efficient parallel implementation requires some effort but allows three-dimensional long-term simulations. This is especially helpful for exploring and validating theoretical models where the underlying parameters are not well known. Global properties of the system can be reproduced quite accurately. References [1] P. J. Armitage and M. Livio, Astrophys. J., 470 (1996) 1024 [2] W. Benz, in: Numerical Modelling of Stellar Puslations: Problems and prospects, J. R. Buchler (ed.), Kluwer Academic Press, Dordrecht, 1990 [3] T. Bubeck, M. Hipp, S. Hiittemann, S. Kunze, M. Ritt, W. Rosenstiel, H.Ruder and R.

295

Superhumps of AM CVn i

i

i

i

i

I

!

l

i

i

E~ (D f: (D 13

Q.

.m (D (D 13

J

~

5

10

15 20 25 orbital periods

I

[

30

35

40

Fig. 4. Shown is the total dissipated energy of the disk over a time span of 40 orbital periods. A Fourier transform of the dissipated energy reveals a superhump period excess of 4.4 %.

Speith, in: High Performance Computing in Science and Engineering '98, E. Krause and W. J~iger (eds.), Springer, Berlin, 1999 [4] O. Flebbe, S. Miinzel, H. Herold, H. Pdffert and H. Ruder, Astrophys. J., 431 (1994) 754 [5] R. A. Gingold and J. J. Monaghan, Mon. Not. R. Astr. Soc., 181 (1977) 375 [6] L. Greengard, V. Rokhlin, J. Comp. Phys., 73 (1987) 325 [7] L. Hernquist, Astrophys. J., 404 (1993) 717 [8] F. V. Hessman, Astrophys. J., 510 (1999) 867 [9] S. Kunze, R. Speith and H. mffert, Mon. Not. R. Astr. Soc., 289 (1997) 889 [10] S. H. Lubow and F.H. Shu, Astrophys. J., 198 (1975) 383 [11] L. S. Lucy, Astron. J., 82 (1977) 1013 [12] F. Meyer and E. Meyer-nofmeister, Astron. Astrophys., 132 (1983) 143 [13] J. J. Monaghan, Ann. Rev. Astron. Astrophys., 30 (1992) 543 [14] H. Riffert, H. Herold, O. Flebbe, H. Ruder, in: CPC Topical Issue: Numerical Methods in Astrophysics, W. J. Duschl and W. M. Wscharnuter (eds.), 89 (1995) 1 [15] M. Steinmetz, E. Mfiller, Astron. Astrophsy., 268 (1993) 391 [16] M. Wsugawa, Y. Osaki, Publ. Astron. Soc. Japan, 49 (1995) 75 [17] B. Warner, Astron. & Space Sci., 255 (1995) 249 [18] B. Warner, Cataclysmic Variable Stars, Cambridge University Press, Cambridge, 1995 [19] R. Whitehurst, Mon. Not. R. Astr. Soc., 232 (1988) 35


Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 ElsevierScienceB.V. All rights reserved.

297

P a r a l l e l m u l t i g r i d s o l v e r s w i t h b l o c k - w i s e s m o o t h e r s for m u l t i b l o c k g r i d s * Ignacio M. Llorente ~ , Boris Diskin b and N. Duane Melson c ~Departamento de Arquitectura de Computadores y AutomAtica Universidad Complutense, 28040 Madrid, Spain UInstitute for Computer Applications in-Science and Engineering Mail Stop 403, NASA Langley Research Center, Hampton, VA 23681-2199 ~Computational Modeling and Simulation Branch Mail Stop 128, NASA Langley Research Center, Hampton, VA 23681-2199 One of the most efficient approaches to yield robust methods for problems with anisotropie discrete operators is the combination of standard coarsening with alternating direction plane relaxation. However, this approach may be difficult to implement in codes with multiblock structured grids because there may be no natural definition of global lines or planes. This paper studies the behavior of blockwise plane smoothers in order to provide guidance to engineers who use block-structured grids. 1. I N T R O D U C T I O N It is known that standard multigrid smoothers are not well suited for solving problems involving anisotropic discrete operators. Several methods have been proposed in the multigrid literature to deal with anisotropic operators. Alternating-direction plane smoothers in combination with full coarsening have been found to be highly efficient and robust on single-block grids because of their optimal work per cycle and low convergence factor[I]. Single-block structured grids with stretching are widely used in many areas of computational physics. However, multiblock grids are needed to deal with complex geometries and/or to facilitate parallel processing. The range of a plane-implicit smoother on blocked grids is naturally limited to only a portion (one block) of the computational domain since no global lines or planes are assumed. Thus, the plane smoother becomes a block-wise plane smoother. The purpose of the current work was to study whether the optimal properties of planeimplicit smoothers deteriorate for general multiblock grids. Notice that the multigrid method with a block-wise smoother is a priori more efficient than a domain decomposition method because the multigrid algorithm is applied over the whole domain. Its efficiency *This research was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the first authors were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 236812199

298

and an excellent parallelization potential were already demonstrated for isotropic elliptic problems [2] where point-wise red-black smoothers were used. We have analysed two sources of anisotropy. The first case is anisotropic coefficients in the equation discretized on uniform grids (Section 3) and the second case is the isotropic coefficient equation discretized on stretched grids (Section 4). Some conclusions about the convergence behavior of block-wise smoothers and their relation with parallel methods for multigrid are outlined in Section 5. 2. D E S C R I P T I O N

OF T H E P R O B L E M

A parallel 3-D code to study the behavior of block-wise plane-smoothers has been developed. The code solves the nonlineaa diffusion equation by using a full multigrid approach and the full approximation scheme (FAS) with V-cycles. The equation is solved on a multiblock grid using a cell-centered formulation. The grid can be stretched. Each block can overlap with the neighboring blocks. For simplicity, the analysis and test problems are done for rectangular grids although the results can be extrapolated to more general multiblock grids. The code is implemented in Fortran77 and has been parallelized using the standard OpenMP directives for shared-memory parallel computing. The behavior of block-wise smoothers when the number of subdomains spanned by a strong anisotropy is low (up to 4) was studied. It is unlikely that regions of very strong anisotropies will traverse too many blocks, especially for stretched grids, in computational fluid dynamics (CFD) simulations. 3. A N I S O T R O P I C

E Q U A T I O N S D I S C R E T I Z E D ON U N I F O R M G R I D S

The smoothing properties of block-wise plane smoothers are analyzed for a model equation discretized on a rectangular uniform grid. e-'t'(Ui.. - 1,iu,iz -- 2Ui= iu,i= nt- Uia:+l,iy,iz)"~ h~ + h2 ( ?-tiz ,iy ,iz -1

,iy ,iz + ?-tix ,iy ,iz + l ) -

(1) f/'

where i~ = 1, ..., n~, iy = 1, ..., ny, iz = 1, ..., nz; el and e2 are the anisotropy coefficients; f ' is a known discrete function; and h:, hy, and hz are the meshsizes in the x, y and z directions, respectively. This is a finite volume cell-centered discretization. We assume a global grid with N 3 partitioned into m 3 blocks (cubic partitioning). Each subgrid overlaps with neighboring blocks. (See Figure 1.) Our aim is to obtain the dependence of the convergence rate p(m, n, 5, e) on the following parameters: the number of blocks per side of the domain (m), the number of cells per block side (n - N) the m ~ overlap parameter (5), and the anisotropy coefficients (el and e2). For simplicity, we assume q >_ e2 >_ 1. The (x,y)-plane smoother is applied inside each block and the blocks are updated in lexicographic order. We use a volume-weighted summation for the restriction operators. Trilinear interpolation in the computational space is applied as the prolongation operator. The 2-D problems defined in each plane are approximately solved by one 2-D V(1,1)-cycle with x-line smoothing.

299

1

[ !,f~J!"I!.l'.!';ti,:l

,~"~1 ~"22 ~'23

~24

o

epp g

i

Artlfleiel boundary Inner

aetl=

True

boundary

cell8

Figure 1. Data structure of a subgrid with overlap: n is the number of cells per side in every block, m is the number of blocks per side in the split and ~7is the semi width of the overlap.

Two sets of calculations were performed to determine the experimental asymptotic convergence rate as a function of the anisotropy coefficients. 9 Both q and e2 are varied for the single-block case (m = 1) and two multiblock cases (m = 2 and m = 4). Different overlaps between blocks (5 = 0, 2 and 4) are examined. The upper graphs in Figure 2 show the results for a 128 a (N = 128) grid. 9 Only el is varied for the single-block case (m = 1), and two multiblock cases (m = 2 and m = 4). Different overlaps between blocks (~ = 0, 2 and 4) are examined. The lower graphs in Figure 2 show these results for a 128 a (N = 128) grid. All the graphs exhibit a similar behavior with respect to ~ and e. We can distinguish three different cases: 9 In the single-block case {m - 1), the convergence rate decreases quickly for an anisotropy larger than 100, tending to (nearly) zero for very strong anisotropies. In fact, the convergence rate per V(1,1)-cycle decreases quadratically with increasing anisotropy strength, as was predicted in [1]. 9 If the domain is blocked with the minimal overlapping (5 = 0), the convergence rate for small anisotropies (1 < el _< 100) is similar to that obtained in a singleblock solver with a point-wise smoother on the whole domain (i.e., about 0.552 per cycle). It increases (gets worse) for larger anisotropies and is bounded above by the convergence rate of the corresponding domain decomposition algorithm. The convergence rate for strong anisotropies approaches one as the grid is refined. 9 If the domain is blocked with larger overlapping (~ > 2), the convergence rate for small anisotropies is similar to that obtained for a single-block grid spanning the whole domain (i.e., about 0.362 per cycle) and increases to the domain decomposition rate for very strong anisotropies. The asymptotic value for strong anisotropies gets closer to one for smaller overlaps and finer grids.

300

0,8

0.8

p 0,6

.o 0 , 6

0,4

0,4

0,2 0

0,2 . lO

.

. ioo

.

I ooo

~'l ----.6 2

1oooo

: -IOOOOO lOOOOOO

m=2

I

=

0,8

"

~

o

lO

1oo

lOOO 10000 61 --~ 6 2

100000

lOOOOO,

m----.,l

0.8

o,6 p,, o,4 o,2

0 1

0,4 _

~

9

io

0,2

~

,

I oo

IOOO

ioooo

~

~

IOOOOO IOOOOOO

0

I

IO

IOO

,~, (~'z = i ) single

I

I ooo

l OOOO iooooo

i oooooo

~', ( s z = I ) block

4-overlap

~

O-overlap

+

8-overlap

~

2-overlap

'

Figure 2. Experimental asymptotic convergence factors, p~, of one 3-D V(1,1)-cycle with block-wise plane-implicit smoother versus anisotropy strength on a 1283 grid for different values of overlap (5) and number of blocks per side (m).

Numerical results show that for strong anisotropies, the convergence rates are poor on grids with minimal overlap (5 = 0), but improve rapidly with larger than the minimal overlap (~ > 2). For multiblock applications where a strong anisotropy crosses a block interface, the deterioration of convergence rates can be prevented by an overlap which is proportional to the strength of the anisotropy. Even when an extremely strong anisotropy traverses four consecutive blocks, good convergence rates are obtained using a moderate overlap. 3.1. Analysis The full-space Fourier analysis is known as a simple and very efficient tool to predict convergence rates of multigrid cycles in elliptic problems. However, this analysis is inherently incapable of accounting for boundary conditions. In isotropic elliptic problems where boundary conditions affect just a small neighborhood near the boundary, this shortcoming does not seem to be very serious. The convergence rate of a multigrid cycle is mostly defined by the relaxation smoothing factor in the interior of the domain, and therefore, predictions of the Fourier analysis prove to be in amazing agreement with the results of numerical calculations. However, in strongly anisotropic problems on decomposed domains, boundaries have a large effect on the solution in the interior (e.g., on the relaxation stage), making the full-space analysis inadequate. In the extreme case of a very large anisotropy, the behavior of a plane smoother is similar to that exhibited by a one-level overlapping Schwartz method [3]. Below we propose an extended Fourier analysis that is applicable to strongly anisotropic problems on decomposed domains. In this analysis, the discrete equation (1) is considered on a layer (ix, iy, iz) : 0 1.

(4)

IA,(Oy, O=)] >_ a t -

I;,(0,=/~)1

The right-infinite problem is similarly solved providing ]Ar(0y,0z) t e2 >__ 1,

where Ar(0y, 0z) is a root of equation (4) satisfying lal _< 1, If the anisotropy is strong (el = O(h-2)), both boundaries (far and nearby) affect the error amplitude reduction factor. If the number of blocks m is not too large, then

302 Table 1 Experimental, Pe, and analytical, pa, convergence factors of a single 3-D V(1,1)-cycle with blockwise (x,y)-plane Gauss-Seidel smoother (2 • 2 x 2 partition) versus anisotropy strength, width of the overlap and the block size II

i ~ II ~

n = 64

1~

n = 128

]~

10612 ll ~176176 I 4 II 0"32110"31610"867

i s II

I

!~

I ~ II ~}87 I ~

I~

1 4 II ~ 1 8 I!

I ~ I

I ~ I 0.27

I ~ II ~ 10212 II ~

I~ !~

I~ [~

10412 I! 0.51 i 0.51 I 0.66

[4 II 0 " 1 2 1 0 " 1 4 1 0 " 1 4 18 II I I 0.14

0.939 0.729 0.566 0.340 0.92 0.67 0.49 0.26 0.56 0.14 0.i4 0.14

the corresponding problem includes m coupled homogeneous problems (like 3). This multiblock problem can directly be solved. For the two-block partition it results in

_ A~-~-t 2 RF=(A~-6-~ A ~ - A ~ + ~ )"

(5)

3.2. Comparison with numerical tests For simplicity, we consider the V(1,1)-cycle with only z-planes used in the smoothing step. The assumption that ~1 _> e2 >_ 1 validates this simplification. In numerical calculations for isotropic problems on single-block domains, the asymptotic convergence rate was 0.14 per V(1,1)-cycle, which is very close to the value predicted by the two-level Fourier analysis (0.134). In the case of the domain decomposed into two blocks in each direction, the reduction factor R F can be predicted by means of expression (5). Finally, the formula for the asymptotic convergence rate pa is

0.14).

(6)

Table 1 exhibits a representative sample of experiments for the case cl >> e2 = 1. In this table, pe corresponds to asymptotic convergence rates observed in the numerical experiments, while Pa is calculated by means of formula (6). The results demonstrate nearly perfect agreement.

303

4. I S O T R O P I C E Q U A T I O N D I S C R E T I Z E D ON S T R E T C H E D G R I D S We have also analyzed the case of an isotropic equation discretized on stretched grids. This case is even more favorable, as the convergence rate obtained for the single-block case is maintained on multiblock grids with a very small overlap (5 = 2). Numerical simulations were performed to obtain the experimental convergence rate with respect to the stretching ratio, c~. The single-block and multiblock grids (m = 1, 2, and 4) with different overlaps (~ = 0, 2, and 4) were tested. Figure 3 shows the results for a 1283 grid. The results can be summarized in the following two observations: 9 With a 2 a partitioning, even the 0-overlap (~ = 0) is suficient for good convergence rates. The results for a multiblock grid with overlap of 5 - 2 match with the results obtained for the single-block anisotropic case. That is, the convergence rate tends towards zero as the anisotropy increases. 9 With a 4 a partitioning, results are slightly worse. With the minimal overlap (5 = 0), the convergence rate degrades for finer grids. However, with a larger overlap (5 = 2), the convergence rate again tends towards the convergence rate demonstrated in single-block grid and anisotropic cases.

~,o I j

o,8i

~'~

0,6 i I 0,4 i

0.6 } joe 0,4

o,8 p.

0,2

0,0

1

1,1

1,2

1,3

1,4

--~ non-blocking

2-0verlap

0,0 ~ I

"

13

O-ovedap

-,z-- 4-overlap

1,2

9 1,3

i

Figure 3. Experimental asymptotic convergence factors, Pc, of one 3-D V(1,1)-cycle with block-wise plane-implicit smoother in a 128 a grid with respect to the stretching ratio (c~) for different values of the overlap (5) and the number of blocks per side (m).

5. B L O C K S M O O T H E R S

TO FACILITATE PARALLEL PROCESSING

Block-wise plane-implicit relaxation schemes are found to be robust smoothers. They present much better convergence rates than domain decomposition methods. In fact, their convergence rates are bounded above by the convergence rate of a corresponding domain decomposition solver. In common multiblock computational fluid dynamics simulations, where the number of subdomains spanned by a strong anisotropy is low (up to four), textbook multigrid convergence rates can be obtained with a small overlap of cells between neighboring blocks.

304 Block-wise plane smoothers may also be used to facilitate the parallel implementation of a multigrid method on a single-block (or logically rectangular) grid. In this case there are global lines and planes and block-wise smoothers are used only for purposes of parallel computing. To get a parallel implementation of a multigrid method, one can adopt one of the following strategies (see, e.g., [6]). 9 D o m a i n decomposition: The domain is decomposed into blocks which are indepen-

dently solved using a multigrid method. A multigrid method is used to solve the problem in the whole grid but the operators to perform this solve are performed on grid partitioning.

9 Grid partitioning:

Domain decomposition is easier to implement and implies fewer communications (better parallel properties), but it has a negative impact on the convergence rate. On the other hand, grid partitioning implies more communication but it retains the convergence rate of the sequential algorithm (better numerical properties). Therefore, the use of block-wise smoothers is justified to facilitate parallel processing when the problem does not possess a strong anisotropy spanning the whole domain. In such a case, the expected convergence rate (using moderate overlaps at the block interfaces crossed by the strong anisotropy) is similar to the rate achieved with grid partitioning, but the number of communications is considerably lower. Block-wise plane smoothers are somewhere between domain decomposition and grid partitioning and appear to be a good tradeoff between architectural and numerical properties. For the isotropic case, the convergence rate is equal to that obtained with grid partitioning, and it approaches the convergence rate of a domain decomposition method as the anisotropy becomes stronger. Although higher than in domain decomposition, the number of communications is lower than in grid-partitioning algorithms. However, it should be noted that due to the lack of definition of global planes and lines, grid partitioning is not viable in general multiblock grids. REFERENCES

1. I.M. Llorente and N. D. Melson, ICASE Report 98-37, Robust Multigrid Smoothers for Three Dimensional Elliptic Equations with Strong Anisotropies, 1998. 2. A. Brandt and B. Diskin, Multigrid Solvers on Decomposed Domains, Domain Decomposition Methods in Science and Engineering, A. Quarteroni, J. Periaux, Yu. A. Kuznetsov and O.Widlund (ed.), Contemp. Math., Amer. Math. Soc.,135-155, 1994 3. B. Smith, P. Bjorstad and W. Gropp, Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University Press, 1996 4. A. Brandt, GMD-Studien 85, Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, 1984 5. P. Wesseling, An Introduction to Multigrid Methods,John Wiley & Sons, New York, 1992 6. I.M. Llorente and F. Tirado, Relationships between efficiency and execution time of full multigrid methods on parallel computers, IEEE Trans. on Parallel and Distributed Systems, 8, 562-573, 1997

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer,J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier ScienceB.V.All rightsreserved.

305

An Artificial Compressibility Solver for Parallel Simulation of Incompressible Two-Phase Flows K. Morinishi and N. Satofuka Department of Mechanical and System Engineering, Kyoto Institute of Technology Matsugasaki, Sakyo-ku, Kyoto 606-8585, Japan This paper describes an artificial compressibility method for incompressible laminar two-phase flows and its implementation on a parallel computer. The full two-fluid model equations without phase change are simultaneously solved using a pseudotime variable that is added to the continuity equation of incompressible two-phase flows. Numerical results, obtained for a test problem of solution consistency, agree well with analytic solutions. Applicability of the method is demonstrated for a three dimensional two-phase flow simulation about a U-bend. Efficiency is examined on the Hitachi SR2201 parallel computer using domain decomposition and up to 16 processors.

1. I N T R O D U C T I O N The artificial compressibility method was first introduced by Chorin [1] for obtaining steady-state solutions of the incompressible Navier-Stokes equations. In the method, a pseudotime derivative of pressure is added to the continuity equation of incompressible flows, so that the pressure and velocity fields are directly coupled in a hyperbolic system of equations. The equations are advanced in pseudotime until a divergent-free velocity field is obtained. The method has been successfully used by many authors [2,3] for unsteady incompressible flow simulations as well as steady-state flow simulations. The purposes of this study are to develop an artificial compressibility type method for obtaining steady-state solutions of incompressible laminar two-phase flows and to examine its efficiency and reliability on a parallel computer. The pseudotime derivative of pressure is added to the continuity equation derived from the two mass conservation equations of incompressible two-phase flows. The full two-fluid model equations are simultaneously solved with the pseudotime variable. Several numerical experiments are carried out in order to examine the efficiency and reliability of the method. The consistency of the numerical solution is examined for a simple test problem for which numerical results can be compared with analytic solutions. Applicability of the method is demonstrated for a three dimensional flow simulation in a rectangular cross-section U-bend. Efficiency is examined on the Hitachi SR2201 parallel computer using domain decomposition and up to 16 processors.

306

2. BASIC EQUATIONS In this study, we restrict our attention to numerical solutions for the incompressible two-fluid model equations without phase change. The equations can be written as: conservation of mass 0

~-~(0r + V" (Or

--0

(1)

and

~

+ v. (~u~)

0;

(2)

conservation of momentum O

al _ 1 1 4- - - V p - - M 1 + o~lg + - - V . (O~lSl) Pl Pl /91

(3)

A O~2 1 0t (a2U2) -[- V" (O12U2U2)-~- - - V p -- 1 M 2 + a2g + - - V . (a2s2); P2 P2 P2

(4)

o~(Or

+

V.

(Or

and

where ak denotes the volume fraction of phase k, uk the velocity, p the pressure, pk the density, sk the viscous shear tensor, Mk the interracial momentum transfer, and g the acceleration due to gravity. The volume fractions of two phases must be: a l + a2 = 1.

(5)

For the interracial momentum transfer, simple drag forces of bubbly flows are assumed. The drag forces are computed using drag coefficients CD derived from the assumption of spherical particles with uniform radiuses [4]. These are obtained for the dispersed phase

2 as: 3 Ploqoz2CDlUl _ U21(Ul _ U2) M2 = 4db

(6)

and 24

c~ = g/-;~ (~ + 0.~5R~~

(7)

where db is the bubble diameter and Reb the bubble Reynolds number. The interfacial momentum transfer for the continuous phase 1 is obtained with: M1 -- - M 2 .

(8)

For laminar two phase flows of Newtonian fluids, the average stress tensors may be expressed as those for laminar single-phase flows:

skq - ~k

(Ouk~ Ouk~ 2 Ox~ + Ox~ ] - 5~k~v'uk'

where #k is the molecular viscosity of phase k and

(9) 5ij is the Dirac delta.

307

3. A R T I F I C I A L C O M P R E S S I B I L I T Y

By combining the mass conservation equations (1) and (2) with the volume fraction constraint of Eq. (5), the continuity equation for the incompressible two-phase flows is obtained as:

(lo)

V " (CelUl + 0~2U2) -- 0.

The artificial compressibility relation is introduced by adding a pseudotime derivative of pressure to the continuity equation so that the pressure and velocity fields are directly coupled in a hyperbolic system of equations:

ap

(11)

0-'~ -[-/~V" (Cel u I -[- Ce2U2) -- 0,

where 0" denotes the pseudotime variable and/3 an artificial compressibility parameter. As the solution converges to a steady-state, the pseudotime derivative of pressure approaches zero, so that the original continuity equation of the incompressible two-phase flows is recovered. Since the transitional solution of the artificial compressibility method does not satisfy the continuity equation, the volume fraction constraint may be violated in the progress of pseudotime marching. Therefore further numerical constraint is introduced in the mass conservation equations as" 0 + v.

( 1u1) -

+ v.

-

mV. ( 1Ul +

(12)

and 0

(

,Ul +

If the constraint of volume fraction is initially satisfied, the pseudotime derivatives of al and a2 satisfy the following relation: 0

0

O-~(a,) + ~--~(a2) = O.

(14)

Thus the volume fraction constraint is always satisfied in the progress of pseudotime marching. Once the artificial compressibility solution converges to a steady-state solution, Eq. (i0)is effective so that Eqs. (12) and (13) result in their original equations (I) and (2), respectively. Equations (i i)-(I 3) and (3)- (4) are simultaneously solved using cell vertex non-staggered meshes. The convection terms are approximated using second order upwind differences with minmod limiters. The diffusion terms are approximated using second order centered differences. The resultant system of equations are advanced in the pseudotime variable until the normalized residuals of all the equations drop by four orders of magnitude. An explicit 2-stage rational Runge-Kutta scheme [5] is used for the pseudotime advancement. Local time stepping and residual averaging are adopted for accelerating the pseudotime convergence process to the steady-state solution.

308

4. P A R A L L E L I M P L E M E N T A T I O N

Numerical experiments of the artificial compressibility method for the incompressible two-phase flows were carried out on the Hitachi SR2201 parallel computer of Kyoto Institute of Technology. The system has 16 processors which are connected by a crossbar network. Each processor consists of a 150MHz PA-RISC chip, a 256MB memory, 512KB data cache, and 512KB instruction cache. The processor achieves peak floating point operations of 300 MFLOPS with its pseudo-vector processing system. For implementation of the artificial compressibility method on the parallel computer, the domain decomposition approach is adopted. In the approach, the whole computational domain is divided into a number of subdomains. Each subdomain should be nearly equal size for load balancing. Since the second order upwind differences are used for the convection terms of the basic equations, two-line overlaps are made at the interface boundaries of subdomains. The message passing is handled with express Parallelware.

Figure 1. Consistency test case.

5. N U M E R I C A L

RESULTS

5.1. C o n s i s t e n c y Test of Solution A simple test case proposed in [6] is considered to check the consistency of the artificial compressibility method for incompressible two-phase flows. A square box initially filled with 50% gas and 50% liquid evenly distributed within the total volume, is suddenly put in a gravitate environment as shown in Fig. 1. The flow field is modeled using a 21 • 21 uniform mesh. Free-slip boundary conditions are used at the walls. For simplicity, the following nondimensional values of density and viscosity are used for the two fluids:

Pl = 1.0 /tl-l.0

P2 = 0.001 /z2=0.01

The solution is advanced until the residuals of all the equations drop by four orders of magnitude. Steady state distributions obtained for the volume fraction al and normalized

309

1.0

1.0

Numerical Analytical

t-C:D

o..,

"1"13

N9

0.5

0

'o Numericai'-

om,.

-'toN

E

FI'

e-

0.5

E

Z

O

O

Z

0.0

0.0 0.5 1.0 Volume Fraction of Fluid 1 I

I

I

Figure 2. Volume fraction compared with analytical solution.

0.0

0.0 J

0.5 I

1.0 !

Normalized Pressure

Figure 3. Normalized pressure data compared with analytical solution.

pressure are plotted in Figs. 2 and 3, respectively. Analytic solutions are also plotted in the figures for comparison. The numerical results agree well with the analytic solutions. 5.2. 2-D P l a n e U - d u c t Flow The solution method is applied to a two-phase flow through a plane U-duct. Figure 4 shows the model geometry of the plane U-duct. The ratio of the radius curvature of the duct centerline and the duct width is 2.5. The Reynolds number based on the duct width is 200. The flow field is modeled without gravitate effect using a 129 x 33 H-mesh. At the inlet, fully developed velocity profiles with a constant pressure gradient are assigned for a mixed two-fluid flow with the following nondimensional values: Ol I =

0.8

p] = 1.0 #] = 1 . 0

O~2 - - 0 . 2

p2 = 0.001 #2=0.01

These conditions are also used for the initial conditions of the pseudotime marching. The solution is advanced until the residuals of all the equations drop by four orders of magnitude. Flow rates obtained for both phases are plotted in Fig. 5. The conservation of the flow rate is quite good throughout the flow field. Figure 6 shows the volume fraction contours of the heavy fluid al. Within the bend, phase separation is observed due to centrifugal forces that tend to concentrate the heavy fluid toward the outside of the bend. Parallel performance on the SR2201 is shown in Fig. 7. The domain decomposition approach in streamwise direction is adopted for the parallel computing. About 7 and 9 times speedups are attained with 8 and 16 processors, respectively. The performance with 16 processors is rather poor because the number of streamwise mesh points is not enough to attain the high performance. ( About 13 times speedup is attained with 16 processors using a 257 x 33 H-mesh. )

310 5.3. 3-D U - d u c t F l o w

The numerical experiment is finally extended to a three dimensional flow through the rectangular cross-section U-duct. The flow conditions are similar to those of the two dimensional flow case. The flow field is modeled without gravitate effect using a 129 • 33 x 33 H-mesh. Figure 8 shows the volume fraction contours of the heavy fluid c~1. The volume fraction contours at the 45 ~ and 90 ~ cross sections of the bend are shown in Figs. 9 and 10, respectively. Within the bend, secondary flow effects in addition to the phase separation due to centrifugal forces are observed. The parallel performance of this three dimensional case is shown in Fig. 11. The speedup ratios with 8 and 16 processors are 6.7 and 9.2, respectively. Again, the performance with 16 processors is rather poor because the number of streamwise mesh points is not enough to attain the high performance. 6. CONCLUSIONS

The artificial compressibility method was developed for the numerical simulations of incompressible two-phase flows. The method can predict phase separation of two-phase flows. The numerical results obtained for the consistency test agree well with the analytic solutions. The implementation of the method on the SR2201 parallel computer was carried out using the domain decomposition approach. About 9 times speedup was attained with 16 processors. It was found that the artificial compressibility solver is effective and efficient for the parallel computation of incompressible two-phase flows. 7. A C K N O W L E D G E M E N S

This study was supported in part by the Research for the Future Program (97P01101) from Japan Society for the Promotion of Science and a Grant-in-Aid for Scientific Research (09305016) from the Ministry of Education, Science, Sports and Culture of the Japanese Government. REFERENCES

1. Chorin, A.J., A Numerical Method for Solving Incompressible Viscous Flow Problems, Journal of Computational Physics, Vol. 2, pp. 12-26 (1967). 2. Kwak, D., Chang, J.L.C., Shanks, S.P., and Chakravarthy, S.R., Three-Dimensional Incompressible Navier-Stokes Flow Solver Using Primitive Variables, AIAA Journal, Vol. 24, pp. 390-396 (1986). 3. Rogers, S.E. and Kwak, D., Upwind Differencing Scheme for the Time-Accurate Incompressible Navier-Stokes Equations, AIAA Journal, Vol. 28, pp. 253-262 (1990). 4. Issa, R.I. and Oliveira, P.J., Numerical Prediction of Phase Separation in Two-Phase Flow through T-Junctions, Computers and Fluids, Vol. 23, pp. 347-372 (1994). 5. Morinishi, K. and Satofuka, N., Convergence Acceleration of the Rational RungeKutta Scheme for the Euler and Navier-Stokes Equations, Computers and Fluids, Vol. 19, pp. 305-313 (1991). 6. Moe, R. and Bendiksen, K.H., Transient Simulation of 2D and 3D Stratified and Intermittent Two-Phase Flows. Part I: Theory, International Journal for Numerical Methods in Fluids, Vol. 16, pp. 461-487 (1993).

311

u

_t"~

! !

1.0

L 4L

J b,,

0

Q1

o

\

n- 0.5o

1.1..

Q2

i

n

0"00.0

0.5 S/Sa

Sa Figure 4. Geometry of the U-duct.

1.0

Figure 5. Flow rate distributions.

16.0

0,o0 ~ oo* o~

o 8.0

og

0, find {u(t),p(t),VG(t), G(t),w(t),)~(t)} such that u(t) 6 Wgo(t ), p(t) 6 L~)(f2), VG(t) 6 IRd, G(t) 6 IRa, to(t) 6 IR3, A(t) e A(t)

and pf

-0-~-vdx+p;

(u. V ) u - v d x -

+eu/fD ( u ) " D(v)dx

-

Jr2

+(1 - P-Z)[M dVG. Y +

p~

dt

< )~, v - Y

pV-vdx -

0 x Gx~ >A(t)

dto + w x Iw). 0] (I -d-t-

-

(1 - PA)Mg .ps Y + p~~g. vdx, Vv e/-/~ (a)", VY e Ia ~, VO e Ia ~,

f

q V - u ( t ) d x - 0, Vq 6 L2(f2),

dG

dt

= v~,

(1)

(2) (3)

< tt, u(t) - VG(t) -- w(t) • G(t)x~ >A(,)-- 0, Vtt 6 A(t),

(4)

v~(o)

(5)

- v ~ ~(o)=

~o

G(o) - G~

u(x, 0) -- no(x), Vx 6 f2\B(0) and u(x, 0 ) - V ~ + w ~ x G~ ~, Vx 6 B(0).

(6)

331

In (1)-(6) , u ( = {ui} d i=1) and p denote velocity and pressure respectively, ,k is a Lagrange multiplier, D(v) = ( V v + Vvt)/2, g is the gravity, V ~ is the translation velocity of the mass center of the rigid body B, w is the angular velocity of B, M is the mass of the rigid body, I is the inertia tensor of the rigid body at G, G is the center of mass of B; w ( t ) - {cdi(t) }ia=l and 0 - {0i }i=1 a if d - 3, while co(t)- {0, 0,w(t)} and 0 - {0, 0,0} P

if d - 2. From the rigid body motion of B, go has to s a t i s f y / g o " ndF - 0, where

n

Jl

denotes the unit vector of the outward normal at F (we suppose the no-slip condition on OB). We also use, if necessary, the notation r for the function x --+ g)(x, t).

Remark 1. The hydrodynamics forces and torque imposed on the rigid body by the fluid are built in (1)-(6) implicitly (see [2] for detail), hence we do not need to compute them explicitly in the simulation. Since in (1)-(6) the flow field is defined on the entire domain f~, it can be computed with a simple structured grid. Then by (4), we can enforce the rigid body motion in the region occupied by the rigid bodies via Lagrange multipliers. Remark 2. In the case of Dirichlet boundary conditions on F, and taking the incompressibility condition V - U = 0 into account, we can easily show that

D(v)dx-

Vvdx, Vv

w0,

(7)

which, from a computational point of view, leads to a substantial simplification in (1)-(6).

2. A P P R O X I M A T I O N Concerning the space approximation of the problem (1)-(6) by finite element methods, we use PlisoP2 and P1 finite elements for the velocity field and pressure, respectively (see [6] for details). Then for discretization in time we apply an operator-splitting technique la Marchuk-Yanenko [7] to decouple the various computational difficulties associated with the simulation. In the resulting discretized problem, there are three major subproblems: (i) a divergence-free projection subproblem, (ii) a linear advection-diffusion subproblem, and (iii) a rigid body motion projection subproblem. Each of these subproblems can be solved by conjugate gradient methods (for further details, see ref. [2]).

3. PARALLELIZATION For the divergence-free projection subproblems, we apply a conjugate gradient algorithm preconditioned by the discrete equivalent o f - A for the homogeneous Neumann boundary condition; such an algorithm is described in [8]. In this article, the numerical solution of the Neumann problems occurring in the treatment of the divergence-free condition is achieved by a parallel multilevel Poisson solver developed by Sarin and Sameh [9]. The advection-diffusion subproblems are solved by a least-squares/conjugate-gradient algorithm [10] with two or three iterations at most in the simulation. The arising linear

332 systems associated with the discrete elliptic problems have been solved by the Jacobi iterative method, which is easy to parallelize. Finally, the subproblems associated with rigid body motion projection can also be solved by an Uzawa/conjugate gradient algorithm (in which there is no need to solve any elliptic problems); such an algorithm is described in [1] and [2]. Due to the fact that the distributed Lagrange multiplier method uses uniform meshes on a rectangular domain and relies on matrix-free operations on the velocity and pressure unknowns, this approach simplifies the distribution of data on parallel architectures and ensures very good load balancing. The basic computational kernels comprising of vector operations such as additions and dot products, and matrix-free matrix-vector products yield nice scalability on distributed shared memory computers such as the SGI Origin 2000.

4. NUMERICAL RESULTS In this article, the parallelized code of algorithm (1)-(6) has been applied to simulate multibody store separation in a 2D channel with non-spherical rigid bodies. There are three NACA0012 airfoils in the channel. The characteristic length of the fixed NACA0012 airfoil is 1.25 and those of the two moving ones are 1. The xl and x2 dimensions of the channel are 16.047 and 4 respectively. The density of the fluid is pf = 1.0 and the density of the particles is Ps = 1.1. The viscosity of the fluid is vf = 0.001. The initial condition for the fluid flow is u = 0. The boundary condition on 0~ of velocity field is

u/xlx2 l~(

0 (1.0 - e-5~

) - x~/4)

if x 2 - - 2 ,

or, 2,

if x l - - 4 ,

or, 16.047

for t >_ 0. Hence the Reynolds number is 1000 with respect to the characteristic length of the two smaller airfoils and the maximal in-flow speed. The initial mass centers of the three NACA0012 airfoils are located at (0.5, 1.5), (1, 1.25), and (-0.25, 1.25). Initial velocities and angular velocities of the airfoils are zeroes. The time step is A t - 0.0005. The mesh size for the velocity field is hv - 2/255. The mesh size for pressure is hp - 2hr. An example of a part of the mesh for the velocity field and an example of mesh points for enforcing the rigid body motion in NACA0012 airfoils are shown in Figure 2. All three NACA0012 airfoils are fixed up to t - 1. After t - 1, we allow the two smaller airfoils to move freely. These two smaller NACA0012 airfoils keep their stable orientations when they are moving downward in the simulation. Flow field visualizations and density plots of the vorticity obtained from numerical simulations (done on 4 processors) are shown in Figures 3 and 4. In Table 1, we have observed overall algorithmic speed-up of 15.08 on 32 processors, compared with the elapsed time on one processor. In addition, we also obtain an impressive about thirteen-fold increase in speed over the serial implementation on a workstation, a DEC alpha-500au, with 0.5 GB RAM and 500MHz clock speed.

333 ~\

\\ \~\ "~

~xh,\ \ . % \

\

~,,N \ ' , N ~

\

%rx\ x~\\

j" \.:.

\\\'\x~\ \ \ \

\ ,.r,,

\r,, \ ,.r,, ,~, ,N

\ \\\ \ \\ .\ \\ \

\ 9

\\ \

\\

.,x. \ \ \-\

\\ 9\ \ \ \~\

~

\\

\- \ \'x\

\ \ \

.K ,r,. cr,, 4-, \ . a xr\ -~

\\\ \ \ \

\ \ \

x" +

\

N" \ "~2 ,i-x \ . ~

\

\

,.r,, \r-, \ , N @',@

\N.\\

\5,(\

\

\ \\\\

9

~" ~,'

"x]\\ \rx\

\.,.

\

\ \

~\x x

\

\ ~\\

\ ,,>,>,, \\,

\

%\\x~,

\~

~\\

\~x\

\ \ \\\

"4_', \ -,~.:

\ \-,\\

\\ \~>,\\ \ \\\\

\\ \ ,~\, \ \~ . \ \ \

,~ \

, ,',~,~,' \ , ,' \ \ ,

~>, ~,'~','

~ ~ ~,'~,~

x

~ , , x,,\

\

N

\b,\

\

\

\

\

\\ 9

\

~,~ N" \ N" "J2 "~."

~\xx

N" \ N \ N\

:.\

Figure 2. Part of the velocity mesh and example of mesh points for enforcing the rigid body motion in the NACA0012 airfoils with hv - 3/64.

5. CONCLUSION We have presented in this article a distributed Lagrange multiplier based fictitious domain method for the simulation of flow with moving boundaries. Some preliminary experiments of parallelized code have shown the potential of this method for the direct simulation of complicated flow. In the future, our goal is to develop portable 3D code with the ability to simulate large scale problems on a wide variety of architectures. Table 1" Elapsed time/time step and algorithmic speed-up on a SGI Origin 2000 Elapsed Time

Algorithmic speed-up

1 processor*

146.32 sec.

1

2 processors

97.58 sec.

1.50

4 processors

50.74 sec.

2.88

8 processors

27.25 sec.

5.37

16 processors

15.82 sec.

9.25

32 processors

9.70 sec.

15.08

* The sequential code took about 125.26 sec./time step on a DEC alpha-500au.

334

..............................

!iiii!ii!iii!iiiiii!{!{![ !ii !i

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

iiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiii

!

.

.

.

.

:ZZZZ=ZZZZ=ZZZZZZZ=2Z=ZZ=IIZI/=:: .............. ............................................... ...........................................

|

l

,

i

!

.

-I

- - 2

F

-3

.

.

.

.

i

-2

,

i

,

i

i

-i

i

i

i

I

i

0

,

,

i

,

i

1

i

,

,

i

i

2

i,

i

,,

i

i

I

3

Figure 3. Flow field visualization (top) and density plot of the vorticity (bottom) around the NACA0012 airfoils at t - 1 . 5 .

335

iiii;iiiiiiiiii]

-i

-2 -3

-2

-i

0

1

2

3

Figure 4. Flow field visualization (top) and density plot of the vorticity (bottom) around the NACA0012 airfoils at t =2.

336

6. A C K N O W L E D G M E N T S We acknowledge the helpful comments and suggestions of E. J. Dean, V. Girault, J. He, Y. Kuznetsov, B. Maury, and G. Rodin, and also the support of the department of Computer Science at the Purdue University concerning the use of an SGI Origin 2000. We acknowledge also the support of the NSF (grants CTS-9873236 and ECS-9527123) and Dassault Aviation.

REFERENCES [1] R. Glowinski, T.-W. Pan, T. Hesla, D.D. Joseph, J. P~riaux, A fictitious domain method with distributed Lagrange multipliers for the numerical simulation of particulate flows, in J. Mandel, C. Farhat, and X.-C. Cai (eds.), Domain Decomposition Methods 10, AMS, Providence, RI, 1998, 121-137. [2] R. Glowinski, T.W. Pan, T.I. Hesla, and D.D. Joseph, A distributed Lagrange multiplier/fictitious domain method for particulate flows, Internat. J. of Multiphase Flow, 25 (1999), 755-794. [3] H.H. Hu, Direct simulation of flows of solid-liquid mixtures, Internat. J. Multiphase Flow, 22 (1996), 335-352. [4] A. Johnson, T. Tezduyar, 3D Simulation of fluid-particle interactions with the number of particles reaching 100, Comp. Meth. Appl. Mech. Eng., 145 (1997), 301-321. [5] B. Maury, R. Glowinski, Fluid particle flow: a symmetric formulation, C.R. Acad. Sci., S~rie I, Paris, t. 324 (1997), 1079-1084. [6] M.O. Bristeau, R. Glowinski, J. P~riaux, Numerical methods for the Navier-Stokes equations. Applications to the simulation of compressible and incompressible viscous flow, Computer Physics Reports, 6 (1987), 73-187. [7] G.I. Marchuk, Splitting and alternating direction methods, in P.G. Ciarlet and J.L. Lions (eds.), Handbook of Numerical Analysis, Vol. I, North-Holland, Amsterdam, 1990, 197-462. [8] R. Glowinski, Finite element methods for the numerical simulation of incompressible viscous flow. Introduction to the control of the Navier-Stokes equations, in C.R. Anderson et al. (eds), Vortex Dynamics and Vortex Methods, Lectures in Applied Mathematics, AMS, Providence, R.I., 28 (1991), 219-301. [9] V. Sarin, A. Sameh, An efficient iterative method for the generalized Stokes problem, SIAM J. Sci. Comput., 19 (1998), 206-226. 2, 335-352. [10] R. Glowinski, Numerical methods for nonlinear variational problems, SpringerVerlag, New York, 1984.

Parallel ComputationalFluidDynamics Towards Teraflops,Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 ElsevierScienceB.V. All rightsreserved.

337

E f f i c i e n t p a r a l l e l - b y - l i n e m e t h o d s in C F D A. Povitsky ~ ~Staff Scientist, ICASE, NASA Langley Research Center, Hampton, VA 23681-2199, e-mail: [email protected]. We propose a novel methodology for efficient parallelization of implicit structured and block-structured codes. This method creates a parallel code driven by communication and computation schedule instead of usual "creative programming" approach. Using this schedule, processor idle and communication latency times are considerably reduced. 1. I n t r o d u c t i o n Today's trend in Computational Science is characterized by quick shift from the yon Neumann computer architecture representing a sequential machine executed scalar data to MIMD computers where multiple instruction streams render multiple data. For implicit numerical methods, computations at each grid point are coupled to other grid points belonging to the same grid line in a fixed direction but can be done independently of the computations at grid points on all other lines in that direction. The coupling arises from solution of linear banded systems by Gaussian Elimination and leads to insertion of communications inside the computational sub-routines, idle stage of processors and large communication latency time. Traditional ADI methods, high-order compact schemes and methods of lines in multigrid solvers fall into the category of implicit numerical methods. A natural way to avoid far-field data-dependency is to introduce artificial boundary conditions (ABC) at inter-domain interfaces. Nordstrom and Carpenter [1] have shown that multiple interface ABC lead to decrease of a stability range and accuracy for highorder compact schemes. Povitsky and Wolfshtein [2] came to similar conclusions about ADI schemes. Additionally, the theoretical stability analysis is restricted to linear PDEs. Therefore, we do not use ABC for parallelization of a serial code unless ABC arise due to a multi-zone approach (see below). Other methods to solve banded linear systems on MIMD computers include transposed algorithms, concurrent solvers and the pipelined Thomas algorithms (PTA). Matrix transpose algorithms solve the across-processor systems by transposing the data to be node local when solving the banded linear systems. Povitsky (1999) [3] compared estimated parallelization penalty time for transposed algorithms with measurements for PTAs on MIMD computers and concluded that PTAs are superior unless the number of grid nodes per processor is small. Hofhaus and van de Velde [4] investigated parallel performance of several concurrent solvers (CS) for banded linear systems and concluded that the floating-point count is

338 2 - 2.5 greater than that for the PTA. Additionally, implementation of CS in CFD codes requires coding of computational algorithms different from the Thomas algorithm. Thus, we confine ourselves with the Thomas algorithm that has a lowest computational count and widely used in CFD community. However, a parallel emciency for the PTA degrades due to communication latency time. The size of packet of lines solved per message is small due to trade-off between idle and latency times [5], [6]. To reduce this difficulty, we drive processors by schedule and not by waiting of information from neighbors. This schedule allows to use processors for other computations while they are idle from the Thomas algorithm computations in a current spatial direction. 2. M e t h o d o l o g y The proposed methodology of parallelization include the following stages 1. Partition indexes in all spatial directions 2. Compute optimal number of lines solved per message 3. Create processor schedule 4. Run computations by this schedule For Step 1 automatized parallelization tools [11] may be recommended. Otherwise, the annoying partitioning by hand should be done. The optimal number of lines (Step 2) is computed from the theoretical model of parallelization efficiency and represents the trade-off between latency time and parallelization penalty time. This models have been developed for the standard pipelined Thomas algorithms [6]. For the standard PTA, the latency and the processor idle time tradeoff for sets of linear banded systems leads to the following expression [6]: K1--

V/

p ( / ~~--1)'

K2-

V/ P - I '

(1)

where K1 and K2 are the numbers of lines solved per message on the forward and backward steps of the Thomas algorithm, respectively, 7 = bo/g2 is the ratio of the communication latency and the backward step computational time per grid node, and p = gl/g2 is the ratio of the forward and the backward step computational times. For the novel version of PTA, denoted as the Immediate Backward PTA (IB-PTA), the optimal size of packet is given by: N2 KI = 2 ( N d _ I ) , K2 - pKI.

(2)

More details about derivation of the theoretical mode] of parallelization, refinements with regards to KI and /('2 and estimation of para]]e]ization penalty time in terms of O(N) are presented in our another study [8]. The forward and backward steps of the Thomas algorithm include recurrences that span over processors. The main disadvantage of its para]lelization is that during the pipelined process processors will be idle waiting for completion of either the forward or the backward step computations by other processors in row.

339 The standard PTA does not complete lines before processors are idle. Moreover, even data-independent computations cannot be executed using the standard PTA as processors are governed by communications and are not scheduled for other activities while they are idle. The main difference between the IB-PTA and the PTA is that the backward step computations for each packet of lines start immediately after the completion of the forward step on the last processor for these lines. After reformulation of the order of treatment of lines by the forward and backward steps of the Thomas algorithm, some lines are completed before processors stay idle. We use these completed data and idle processors for other computational tasks; therefore, processors compute their tasks in a time-staggered way without global synchronization. A unit, that our schedule addresses, is defined as the treatment of a packet of lines by either forward or backward step computations in any spatial direction. The number of time units was defined on Step 1. At each unit a processor either performs forward or backward step computations or local computations. To set up this schedule, let us define the "partial schedules" corresponding to sweeps in a spatial direction as follows: +1 0 -1

J ( p , i , dir) -

forward step computations processor is idle backward step computations,

(3)

where dir = 1,2, 3 denotes a spatial direction, i is a unit number, p is a number of processor in a processor row in the dir direction. To make the IB-PTA feasible, the recursive algorithm for the assignment of the processor computation and communication schedule is derived, implemented and tested [3]. Partial directional schedules must be combined to form a final schedule. For example, for compact high-order schemes processors should be scheduled to execute the forward step computations in the y direction while they are idle between the forward and the backward step computations in the x direction. The final schedule for a compact parallel algorithm is set by binding schedules in all three spatial directions. Example of such schedule for the first outermost processor (1, 1, 1) is shown in Table 1. In this binded

Table 1 Schedule of processor communication and computations, where i is the number of time unit, T denotes type of computations, (2, 1, 1), (1, 2, 1) and (1, 1, 2) denote communication with corresponding neighbors i

T (2,1,1) (1,2,1) (1,1,2)

...

6

7

8

1 1 0 0

1 0 0 0

-1 3 0 0

9 10 2-1 0 3 0 0 0 0

11 2 0 1 0

12 -1 2 0 0

...

43

44

45

46

47

4 0 0 0

-3 0 0 2

4 0 0 0

-3 0 0 2

4 0 0 0

schedule, computations in the next spatial direction are performed while processors are idle from the Thomas algorithm computations in a current spatial direction, and the

340 for i=l,...,I

( for dir=l,3

( if (Com(p,i, right[dir]) = 1) send FS coefficients to right processor; if (Corn(p, i, right[dir]) = 3) send FS coefficients to right processor and receive BS solution from right processor; if (Corn(p, i, left[dir])= 1) send BS solution to left processor; if (Corn(p, i, left[dir]) = 3) send BS solution to left processor and receive FS coefficients from left processor; if (Corn(p, i, right[dir]) = 2) receive BS solution from right processor; if (Corn(p, i, left[dir]) = 2) receive FS coefficients from left processor;

} for dir=l,3

( if (T(p, i) = dir) do FS computations if (T(p, i) = - d i r ) do BS computations if (T(p, i) = local(dir)) do local directional computations

} if (T(p, i ) = local) do local computations

} Figure 1. Schedule-governed banded linear solver, where right = p + 1 and l e f t = p - 1 denote left and right neighbors, dir - 1, 2, 3 corresponds to x, y, and z spatial directions, T governs computations, Corn controls communication with neighboring processors, p is the processor number, and i is the number of group of lines (number of time unit).

Runge-Kutta computations are executed while processors are idle from the Thomas algorithm computations in the last spatial direction. After assignment of the processor schedule on all processors, the computational part of the method (Step 4) runs on all processors by an algorithm presented in Figure 1. 3. P a r a l l e l c o m p u t a t i o n s The CRAY T3E MIMD computer is used for parallel computations. First, we test a building block that includes PTAs solution of the set of independent banded systems (grid lines) in a spatial direction and then either local computations or forward step computation of the set of banded systems in the next spatial direction. We consider standard PTA, IB-TPA, "burn from two ends" Gaussian elimination (TW-PTA) and the combination of last two algorithms (IBTW-PTA). The parallelization penalty and the size of packet versus the number of grid nodes per processor per direction are presented in Figure 2. For the pipeline of eight processors (Figure 2a) the most efficient IBTW-PTA has less than 100% parallelization penalty if the total number of grid nodes per processor is more than 143 . For the pipeline of sixteen processors (Figure2b) the critical number of grid nodes per processor is 163. For these numbers of nodes, the parallelization penalty for the standard PTA is about 170% and

341

250% for eight and sixteen processors, respectively.

17o I~" 160 ~.150 ~.140 ~-

120 110 100 90 _

i,oor

[ 7o

~9 80 .N

~

70~-

"6 6o

""..3

;~ ~oI o

4""

.~ 8o

N so 4o

"-._

.s

30

S

2 3

1

20 10

15

20

25

Number of grid nodes

0

30

i::I

'~'~

....

~o . . . .

~', . . . .


~'o'

120 ~ 110 ~'100 ~'9OI=80

1ool

~

4"J

50

40 30

s SS

2

s

3

. . . .

20

'~I

10

is

Num~)~r of grid noc~s

30

0

15

20

25


30

Figure 2. Parallelization penalty and size of packet of lines solved per message for pipelined Thomas algorithms; (a) 8 processors; (b) 16 processors; 1-PTA, 2-IB-PTA, 3-TW-PTA, 4-IBTW-PTA

Next example is the 3-D aeroacoustic problem of initial pulse that includes solution of linearized Euler equations in time-space domain. Parallel computer runs on 4 • 4 • 4 = 64 processors with 103- 203 grid nodes per processor show that the parallel speed-up increases 1 . 5 - 2 times over the standard PTA (Figure 3a). The novel algorithm and the standard one are used with corresponding optimal numbers of lines solved per message. The size of packet is substantially larger for the proposed algorithm than that for the standard PTA (Figure 3b). 4. C o n c l u s i o n s We described our proposed method of parallelization of implicit structured numerical schemes and demonstrated its advantages over standard method by computer runs on an MIMD computer. The method is based on processor schedule that control inter-processor

342

~40r 50

50 45 e~40 "6 ~35

64processor~

~30 s_.

o-20

~30

C

25

2,10n

12 14 16 18 Number of Nodes

20

10

12 14 16 18 Number of Nodes

20

Figure 3. Parallelization efficiency for compact 3-D aeroacoustic computations, (a) Speedup; (b) Size of packet. Here 1 denotes the standard PTA, and 2 denotes the binded schedule

communication and order of computations. The schedule is generated only once before CFD computations. The proposed style of code design fully separates computational routines from communication procedures. Processor idle time, resulting from the pipelined nature of Gaussian Elimination, is used for other computations. In turn, the optimal number of messages is considerably smaller than that for the standard method. Therefore, the parallelization efficiency gains also in communication latency time. In CFD flow-field simulations are often performed by so-called multi-zone or multi-block method where governing partial differential equations are discretized on sets of numerical grids connecting at interfacial boundaries by ABC. A fine-grain parallelization approach for multi-zone methods was implemented for implicit solvers by Hatay et al. [5]. This approach adopts a three-dimensional partitioning scheme where the computational domain is sliced along the planes normal to grid lines in a current direction. The number of processors at each zone is arbitrary and can be determined to be proportional to the size of zone. For example, a cubic zone is perfectly (i.e, in load-balanced way) covered by cubic sub-domains only in a case that the corresponding number of processors is cube of an integer number. Otherwise, a domain partitioning degrades to two- or one- directional partitioning with poor surface-to-volume ratio. However, for multi-zone computations with dozens of grids of very different sizes a number of processors associated with a zone may not secure this perfect partitioning. Hatay et al. [5] recommended to choose subdomains with the minimum surface-to-volume ratio, i.e., this shape should be as close to a cube as possible. Algorithms that hold this feature are not available yet and any new configuration requires ad hos partitioning and organization of communication between processors.

343 The binding procedure, that combines schedules corresponding to pipelines in different spatial directions, is used in this study. This approach will be useful for parallelization of multi-zone tasks, so as a processor can handle subsets of different grids to ensure load balance. REFERENCES

1. J. Nordstrom and M. Carpenter, Boundary and Interface Conditions for High Order Finite Difference Methods Applied to the Euler and Navier-Stokes Equations, ICASE Report No 98-19, 1998. 2. A. Povitsky and M. Wolfshtein, Multi-domain Implicit Numerical Scheme, International Journal for Numerical Methods in Fluids, vol. 25, pp. 547-566, 1997. 3. A. Povitsky, Parallelization of Pipelined Thomas Algorithms for Sets of Linear Banded Systems, ICASE Report No 98-48, 1998 (Expanded version will appear in, Journal of Parallel and Distributed Computing, Sept. 1999). 4. J. Hofhaus and E. F. Van De Velde, Alternating-direction Line-relaxation Methods on Multicomputers, SIAM Journal of Scientific Computing, Vol. 17, No 2, 1996,pp. 454-478. 5. F. Hatay, D.C. Jespersen, G. P. Guruswamy, Y. M. Rizk, C. Byun, K. Gee, A multilevel parallelization concept for high-fidelity multi-block solvers, Paper presented in SC97: High Performance Networking and Computing, San Jose, California, November 1997. 6. N.H. Naik, V. K. Naik, and M. Nicoules, Parallelization of a Class of Implicit Finite Difference Schemes in Computational Fluid Dynamics, International Journal of High Speed Computing, 5, 1993, pp. 1-50. 7. C.-T. Ho and L. Johnsson, Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors, SIAM Journal of Scientific and Statistical Computing, Vol. 11, No. 3, 1990, pp. 563-592. 8. A. Povitsky, Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm,ICASE Report No. 98-45, http://www, icase, edu/library/reports/rdp/1998, html#98-45 9. A. Povitsky and M.Visbal, Parallelization of ADI Solver FDL3DI Based on New Formulation of Thomas Algorithm, in Proceedings of HPCCP/CAS Workshop 98 at NASA Ames Research Center, pp. 35-40, 1998. 10. A. Povitsky and P. Morris, Parallel compact multi-dimensional numerical algorithm with application to aeroacoustics, AIAA Paper No 99-3272, 14th AIAA CFD Conference, Norfolk, VA, 1999. II. http://www.gre.ac.uk/-captool 12. T. H. Pulliam and D. S. Chaussee, A Diagonal Form of an Implicit Approximate Factorization Algorithm, Journal of Computational Physics, vol. 39, 1981, pp. 347363. 13. S. K. Lele, Compact Finite Difference Schemes with Spectral-Like Resolution, Journal of Computational Physics, vol. 103, 1992, pp. 16-42.



345

A n A u t o m a t a b l e G e n e r i c S t r a t e g y for D y n a m i c L o a d B a l a n c i n g in P a r a l l e l S t r u c t u r e d M e s h C F D Code J. N. Rodrigues, S. P. Johnson, C. Walshaw, and M. Cross Parallel Processing Research Group, University of Greenwich, London, UK.

In order to improve the parallel efficiency of an imbalanced structured mesh CFD code, a new dynamic load balancing (DLB) strategy has been developed in which the processor partition range limits of just one of the partitioned dimensions uses non-coincidental limits, as opposed to coincidental limits. The 'local' partition limit change allows greater flexibility in obtaining a balanced load distribution, as the workload increase, or decrease, on a processor is no longer restricted by the 'global' (coincidental) limit change. The automatic implementation of this generic DLB strategy within an existing parallel code is presented in this paper, along with some preliminary results.

1. I N T R O D U C T I O N

Parallel computing is now widely used in numerical simulation, particularly for application codes based on finite difference and finite element methods. A popular and successful technique employed to parallelise such codes onto large distributed memory systems is to partition the mesh into sub-domains that are then allocated to processors. The code then executes in parallel, using the SPMD methodology, with message passing for inter-processor interactions. These interactions predominantly entail the updating of overlapping areas of the subdomains. The complex nature of the chemical and physical processes involved in many of these codes, along with differing processor performance in some parallel systems (such as workstation clusters), often leads to load imbalance and significant processor idle time. DLB can be used to overcome this by migrating the computational load away from heavily loaded processors, greatly improving the performance of the application code on the parallel system. For unstructured mesh application codes (including finite element codes), the flexibility of the partition in terms of the allocation of individual cells to processors allows a fairly straightforward technique to be used [1]. Graph partitioning tools such as JOSTLE [2] can be used to take an existing partition and re-partition it in parallel by moving cells when necessary to reduce the load imbalance. For structured mesh application codes (such as finite difference codes), the nature of

346 the coding techniques typically used forces rectangular partitions to be constructed if efficient parallel execution is to be achieved (since this allows loop limit alteration in the co-ordinate directions to be sufficient). This work investigates the DLB problem for structured mesh application codes and devises a generic strategy, along with library utilities, to simplify the implementation of DLB in such codes, leading to the automatic implementation of the technique within the Computer Aided Parallelisation Tools (CAPTools) environment [3,4]. The aims of this research are threefold: 9 Develop a generic, minimally intrusive, effective load balancing strategy for structured mesh codes 9 Develop utilities (library calls) to simplify the implementation of such a strategy 9 Automate this strategy in the CAPTools environment to transform previously parallelised message passing codes

2. P O S S I B L E LOAD BAI,ANCING S T R A T E G I E S

Three different load balancing strategies can be used in the context of structured mesh codes, each trying to achieve a good load balance without incurring high communication latencies or major alterations to the source code. In Figure 1, Case 1 forces all partition limits to coincide with those on neighbouring processors, greatly restricting the load balance possible. Case 2, although allowing good load balance, suffers from complicated communications and difficulties in constructing the partition. Case 3, where partition limits are forced to coincide on all but one dimension, allows for good load balance, fairly simple and neat communication patterns, and is relatively straightforward to construct and is therefore selected for the generic strategy. Original .

.

.

.

.

.

Case 1

Case 2

Case3

1 Change globally No change Moderate

2 Change locally Complex Good

3 Mix Relatively simple Good

.

Case: Limits: Communication: Balance:

[[

Figure 1: Alternative DLB partitioning strategies and their implications on parallel performance.

347

3. C O M M U N I C A T I O N S W I T H N O N - C O I N C I D E N T P A R T I T I O N L I M I T S To avoid the parallel code from becoming unwieldy w h e n using this DLB strategy, the details of communications involving d a t a from several processors (due to the staggering of the partition limits) is dealt with w i t h i n library routines. Figure 2 shows a 3D mesh decomposed onto 27 processors with staggered partition limits used in one dimension to allow DLB. To u p d a t e the overlap a r e a shown for processor 6 requires d a t a from processors 2, 5 and 8, w h e r e the details of this communication can be hidden within the new communication call. In the original communication call, a variable of a p a r t i c u l a r length a n d type is received from a specified direction [5]. To h a n d l e the staggered limits in this DLB strategy, the original communication call n a m e is changed (as shown in Figure 2), and two extra p a r a m e t e r s are added on, which d e t e r m i n e the a m o u n t to communicate and whom to communicate with. Internally, c a p _ d l b _ r e c e i v e performs three communications as determined by examining the partition limits on the neighbouring processors. Matching communications (i.e. c a p _ d l b _ s e n d ' s ) on the neighbour processors use the same algorithm to determine which d a t a items are to be sent to their neighbours. C h a n g i n g all of the communications in the application code t h a t are orthogonal to the staggered partition limits into dlb communications, will correctly handle inter-processor communication whilst only slightly altering the application code.

/ / / /

Original communication call: call cap_receive(var,length,type,dir) .....

,D ~m~

/ ""~ / i /

/

/ New dlb communication call:

call cap_dlb_receive(var,length,type,dir, first,stag_stride)

Figure 2: Update of overlap on processor 6 with contributions from processors 2, 5 and 8. Also, the original and new dlb communication calls, the latter of which receives a variable of a particular length and type from a specified direction.

348 4. LOAD MIGRATION TO SET-UP THE NEW PARTITION Another key section of a DLB algorithm is efficient load migration, particularly since this will typically involve migrating data relating to a very large n u m b e r of arrays t h a t represent geometric, physical and chemical properties. The algorithm devised here performs partition limit alteration and data migration one dimension at a time, as shown in Figure 3. Figure 3 shows the data migration process for a 2D problem in which the Up/Down limits appear staggered (i.e. use local processor partition range limits). The load is first migrated in the non-staggered dimension using the old processor limits. This involves calculating the average idle time on each column of processors in Figure 3 and then re-allocating (communicating) the columns of the structured mesh to neighbouring processors to reduce this idle time. Note t h a t the timing is for a whole column of processors r a t h e r t h a n any individual processor in the column. In subsequent re-balances, this will involve communications orthogonal to the staggered partition limits and therefore the previously described dlb communication calls are used. Once all other dimensions have undergone migration, the staggered dimension is adjusted using the new processor limits for all other dimensions, ensuring a minimal movement of data. The new partition limits are calculated by adjusting the partition limits locally within each column of processors to reduce the idle time of those processors. Note t h a t now the individual processor timings are used when calculating the staggered limits. Obviously, the timings on each processor are adjusted during this process to account for migration of data in previous dimensions. The actual data movement is performed again using utility routines to minimise the impact on the appearance of the application code, and has dlb and non-dlb versions in the same way as the communications mentioned in the previous section.

tu

t~,,

t~,,

tl~

t~

t~

t~

t~

t~

TI,~

Tla

T~

T~,

~l ~

~.-~'.:~,

J

TT

Data migration Left/Right (Dim=l)

Data migration Up/Down (Dim=2)

Figure 3: Data migration for a 2D processor topology, with global Left/Right processor partition range limits, and staggered Up/Down processor partition range limits.

349

5. D Y N A M I C L O A D B A I ~ A N C I N G A L G O R I T H M

The previous sections have discussed the utility routines developed to support DLB; this section discusses the DLB approach that is to be used, based on those utilities. The new code that is to be inserted into the original parallel code should be understandable and unobtrusive to the user, allowing the user to m a i n t a i n the code without needing to know the underlying operations of the inserted code in detail. Since m a n y large arrays may be moved, the calculation and migration of the new load should be very fast, meaning that there should only be a m i n i m u m amount of data movement whilst preserving the previous partition. In addition, communications should be cheap, as the cost of moving one array may be heavy. The algorithm used to dynamically load balance a parallel code is: 9 Change existing communication calls, in the non-staggered dimensions, into the new dlb communication calls 9 Insert dynamic load balancing code: 9 Start/Stop timer in selected DLB loop (e.g. time step loop) 9 For each dimension of the processor topology 9 Find the new processor limits 9 Add migration calls for every relevant array 9 Assign new partition range limits 9 Duplicate overlapping communications to ensure valid overlaps exist before continuing

6. R E S U L T S

When solving a linear system of equations using a 200x300 Jacobi Iterative Solver on a heterogeneous system of processors, each processor has the same workload but their speeds (or number of users) differ. In this instance, the load imbalance is associated with the variation between processors, which shall be referred to as 'processor imbalance'. This means that when a fast processor gains cells from a slow processor, then these shall be processed at the processor's own weight, since the weight of the cells are not transferred. Figure 4a shows the timings for a cluster of nine workstations used to solve the above problem, where the load was forcibly balanced after every 1000 th iteration (this was used to test the load balancing algorithm rather than when to rebalance the load). It can be seen t h a t the maximum iteration time has been dramatically reduced, and the timings appear more 'balanced' after each rebalance, improving the performance of the code. Figure 4b shows that the load on the middle processor (the slowest) is being reduced each time, indicating that the algorithm is behaving as expected. When running on a homogeneous system of processors, the timings should be adjusted differently before balancing subsequent dimensions, as the load imbalance is no longer associated with the variation between the processors since their speeds are the same.

350

P r o c e s s o r T i m i n g s b e f o r e Load B a l a n c i n g

123

130

113

1111

131

100

110

119

60

45'

74

91

$1'

99

93

241

1171 127I

113

98,~

200

E 150

100

I-

," 100 o

m 50 0

E

o 0

85

1

2

3

4 5 6 7 Processor Number

8

9

Figure 4a. The processor times used to solve 3000 iterations of the Jacobi Iterative Solver on a 3x3 workstation cluster, in which the load is redistributed after every 1000th iteration.

76

39

138 7
.

26

Block 4 /

.................

I : / .....................

Block 3 /

/_ ........................ 2z6 Boundary Conditions: Streamwise, x : periodic Spanwise, z : periodic Wall normal,y : no slip

V

y

2/

V

Block 1 / Blocking direction Values at block interfaces are obtained from a global array

Figure 4. Channel geometry showing partitioned blocks tasked to processors domain decomposition can of decomposition from the posed, the same sequential multiple sub-domains. The

result in good scalability, it does transfer the responsibility computer to the user. Once the problem domain is decomalgorithm is followed, but the program is modified to handle data that is local to a sub-domain is specified as PRIVATE or THREADPRIVATE. THREADPRIVATE is used for sub-domain data that need file scope or are used in common blocks. The THREADPRIVATE blocks are shared among the subroutines but private to the thread itself. This type of programming is similar in spirit to message passing in that it relies on domain decomposition. Message passing is replaced by shared data that can be read by all the threads thus avoiding communication overhead. Synchronization of writes to shared data is required. For a Cartesian grid, the domain decomposition and geometry are shown in Figure 4. Data initialization is parallelized using one parallel region for better data locality among active processors. This method overcomes some of the drawbacks of first-touch policy adopted by the compiler. If the data is not distributed properly, the first-touch policy may distribute the data to a remote node, incurring a remote memory access penalty. The" main computational kernel is embedded in the time advancing loops. The time loops are treated sequentially due to obvious data dependency, and the kernel itself is embedded in a second parallel region. Within this parallel region, the computational domain is divided into blocks in the z-direction, as shown in Figure 4, which allows each block to be tasked to a different processor. Several grid sizes were considered; the scalability chart is shown in Figure 5 and a typical load balance chart (using MELOPS as an indicator) is shown in Figure 6. We see performance degradation near 8 processors for the 32 • 32 • 32 grid and 16 processors for the 81 • 81 x 81 grid. Less than perfect load balancing is seen due to the remnant serial component in two subroutines. We observe linear speedup up to 4 processors across all the grid sizes and the large memory case levels off at 8 processors. The SPMD style of parallelization shows an encouraging trend in scalability. A detailed analysis with fully

378 3000

,

D

-e- 32/OMP

I-+

2500

2000 r

E

'~' 28 I I I I

13..

27

26

II

D1500 13. 0 1000

64/OMP I

25

I

\

\

24 if) ~) 23

\

'~

J

22

500 '~ ...... .............................. i

2

i

4

i

6

Y

8

i

10

l

12

i

14

Number of processors

Figure 5. Scalability chart for 323, 643 and 81 a grids

16

20 0

2

4

6 8 10 12 Processornumber

14

16

Figure 6. SGI MFLOPS across all processors for OpenMP LES code for 813 grid

cache optimized and parallelized code will be presented elsewhere. 4. S U M M A R Y

In summary, the CRAY C90/T90 vector code is optimized and parallelized for Origin2000 performance. A significant portion of our time is spent in optimizing c s i p 5 v . f , an in-house LU decomposition solver, which happens to be the most expensive subroutine. The FORTRAN subroutine is modified by changing the order of nested do loops so that the innermost index is the fastest changing index. Several arrays in c s i p 5 v . f are redefined for data locality, and computations are rearranged to optimize cache reuse. Automatic parallelization, PFA, scales comparably to SPMD style OpenMP parallelism, but performs poorly for larger scale sizes and when more than 8 processors are used. SPMD style OpenMP parallelization scales well for the 813 grid, but shows degradation due to the serial component in still unoptimized subroutines. These subroutines contain data dependencies and will be addressed in a future publication. Finally, we report an important observation, for the 32 x 32 x 32 grid presented here, that cache optimization is crucial for achieving parallel efficiency on the SGI Origin2000 machine. 5. A C K N O W L E D G M E N T

The current research was partially supported by the Air Force Office of Scientific Research under Grant F49620-94-1-0168 and by the National Science Foundation under grant CTS-9414052. The use of computer resources provided by the U.S. Army Research Laboratory Major Shared Resource Center and the National Partnership for Advanced

379 Computational Infrastructure at the San Diego Supercomputing Center is gratefully acknowledged. REFERENCES

1. James Taft, Initial SGI Origin2000 tests show promise for CFD codes, NAS News, Volume 2, Number 25,July-August 1997. 2. Pletcher, R. H. and Chen, K.-H., On solving the compressible Navier-Stokes equations for unsteady flows at very low Mach numbers, AIAA Paper 93-3368, 1993. 3. Wang, W.-P., Coupled compressible and incompressible finite volume formulations of the large eddy simulation of turbulent flows with and without heat transfer, Ph.D. thesis, Iowa State University, 1995. 4. Jin, H., M. Haribar, and Jerry Yah, Parallelization of ARC3d with Computer-Aided Tools, NAS Technical Reports, Number NAS-98-005, 1998. 5. Frumkin, M., M. Haribar, H. Jin, A. Waheed, and J. Yah. A comparison Automatic Parallelization Tools/Compilers on the SGI Origin2OOO,NAS Technical Reports. 6. KAP/Pro Toolset for OpenMP, http://www.k~i.com 7. OpenMP Specification. http://www.openmp.org, 1999. 8. Ramesh Menon, OpenMP Tutorial. SuperComputing, 1999. 9. Optimizing Code on Cray PVP Systems, Publication SG-2912, Cray Research Online Software Publications Library. 10. Guide to Parallel Vector Appllcations, Publication SG-2182, Cray Research Online Software Publications Library. 11. Satya-narayana, Punyam, Philip Mucci, Ravikanth Avancha, Optimization and Parallelization of a CRAY C90 code for ORIGIN performance: What we accomplished in 7 days. Cray Users Group Meeting, Denver, USA 1998. 12. Origin 2000(TM) and Onyx2(TM) Performance Tuning and Optimization Guide. Document number 007-3430-002. SGI Technical Publications.



381

C a l c u l a t i o n of U n s t e a d y I n c o m p r e s s i b l e Flows on a M a s s i v e l y P a r a l l e l C o m p u t e r U s i n g t h e B.F.C. C o u p l e d M e t h o d K.Shimano a, Y. Hamajima b, and C. Arakawa b

a

b

Department of Mechanical Systems Engineering, Musashi Institute of Technology, 1-28-1 Tamazutsumi, Setagaya-ku, Tokyo 158-8557, JAPAN Department of Mechanical Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, JAPAN

The situation called "fine granularity" is considered, which means that the number of grid points allocated to one processor is small. Parallel efficiency usually deteriorates in fine granularity parallel computations therefore it is crucial to develop a parallel Navier-Stokes solver of high parallel efficiency. In this regard, the authors have focused on the coupled method. In this study the coupled method is extended to the boundary fitted coordinate and is applied to the unsteady flow around a circular cylinder. High parallel efficiency of the B.F.C coupled method is demonstrated. 1.

INTRODUCTION

Fluid flow simulation of much larger problems will be necessary without doubt. However, a considerable jump-up of single-processor performance cannot be expected while the computational time will become longer as the number of grid points increases. To overcome this problem, the simplest solution is to use as many processors as possible while still keeping parallel efficiency high. In other words, the number of grid points per processor should be decreased. This is called massively parallel computing of very fine granularity. In this study, the authors deal with this situation. Above everything else, it is crucial to develop a parallel Navier-Stokes solver of high efficiency even for fine granularity. The authors[l-2] have focused on the high parallel efficiency of the coupled method originally proposed by Vanka[3] and showed that the coupled method achieved relatively high parallel efficiencies in Cartesian coordinates. Its simple numerical procedures are another advantage.

382

The existing coupled method is characterized by use of the staggered grid in Cartesian coordinates. However, in terms of practical application, it should be applicable to problems in the Boundary Fitted Coordinate (B.F.C.) system. In this paper, extension of the coupled method to B.F.C. is presented and computational results of the unsteady flow around a circular cylinder are shown. High parallel efficiency of the B.F.C. coupled method is demonstrated: for example, even when only 32 control volumes are allocated to one processor in the 3-D calculation, the parallel efficiency reaches 31%. At the end of the paper, accuracy of the B.F.C. coupled method is also discussed.

2.

COUPLED METHOD IN THE CARTESIAN

Vanka[3] proposed the original coupled method as the SCGS method. It is characterized by use of the staggered grid in the Cartesian coordinates. In the 2dimensional case, four velocity components and one pressure component corresponding to one staggered cell are implicitly solved. In practice, the 5 • 5 matrix is inverted for each cell. In the 3-dimensional case, the size of the matrix is 7 • 7. In the iteration process, the momentum equations and the continuity equation are simultaneously taken into account so that mass conservation is always locally satisfied. See references [1-3] for details. In Figure 1, the coupled method is compared with the SIMPLE method in terms of parallel efficiency. The results were obtained by calculation of the lid-driven square cavity flow at Re=100 on a Hitachi SR2201. The number of control volumes (CVs) is 128 • 128. Obviously the efficiency of the coupled method is higher than that of the SIMPLE method. There are several reasons that the coupled method is suitable for fine granularity 120 100 >,

o

E O ._ O

UJ

i!i!iiii

80 60

i SIMPLE 1 D Coupled i

-ii,ir

40 k.

a.

20

2x2

4x4

8x8


Figure 1. Parallel Efficiency of SIMPLE and Coupled methods

383

parallel computing: 1. Velocity components and pressure values are simultaneously sent and received by message-passing, therefore the overhead time for calling the message-passing library is reduced. 2. Numerical operations to invert the local matrix hold a large part in the code and can be almost perfectly parallelized. 3. Numerical procedures are quite simple.

3.

EXTENSION

OF COUPLED

METHOD

T O B.F.C.

The boundary fitted coordinate (B.F.C.) is often used for engineering applications. A B.F.C. version of the coupled method developed by the authors is presented in this section. In the B.F.C. version, unknowns and control volumes (CVs) are located as shown in Figure 2. This a r r a n g e m e n t is only for 2-D calculation. For 3-D flows, pressure is defined at the center of the rectangular-parallelepiped. In the iteration process of the 2-D calculation, eight velocity components and one pressure component constituting one CV for the continuity equation are simultaneously updated; that is, the following 9 • 9 matrix is inverted for each cell. 9

9

9

9

9

Ui+l,j+ ,-I9

-Xi + l , j + l

I

U/,j+I

.u. 1

Ui,j Ui+l,j Vi+l,j+

l

--

fi:,

:

Vi+l,J _

(1)

i,j+l

Vi,j+I

Vi,J

j+,

i,j fi:l,j

_

o

Black circles in Eq.(1) represent non-zero elements. Since velocity components u and v are defined at the same points, contribution of the advection-diffusion terms is the same for both components. Therefore, the computational cost to invert the 9 • 9 matrix is reduced to order 5 • 5. For 3-D flows, the cost is only order 9 • 9 for the same reason, though the size of the matrix to be inverted is 25 • 25. The matrix inversion mentioned above is applied to all the cells according to odd~even ordering. For example, in the 2-D calculation, cells are grouped into fours and Eq.(1) is solved in one group after another. The number of groups is eight in the 3-D calculation. If this grouping is not adopted, relaxation will be less effective.

384 W h e n the a r r a n g e m e n t in Figure 2 is used, the checkerboard-like pressure oscillation sometimes a p p e a r s . However, pressure differences such as Pc-Pa in Figure 2 are always correctly evaluated and the pressure oscillation does not spoil stability or accuracy. In order to eliminate the oscillation, pressure P" is newly defined as shown in Figure 2 and is calculated according to the equation:

I(pA +PB +Pc

(2)

+PD)

r

,,

0

r

,,

PA

0

~V for C , -th~ Continuity

' o ,,

0

,,

~--~ CV for Pc ~, @ " ~ the Momentum

Figure 2.

A r r a n g e m e n t of Variables and Control Volumes for 2-D B.F.C. 9

9 9 9 9

9 9

//~It era~tion~'~X

/ f i t er!ti on-"~N

9 9 9 9

f / I t era t i on~'~

c~11 Groupmj

~ 1 0~ou~

c~11 0rou~

f~esssage~ %2a~s~i~~

f~terat ion~ ~I O~ou~

~terat ion~ C~11 C~ou~ ~t

f~Iterat i o ~

~11 G~ou~ ~....~ ....

C~eIltle%!rtolOpn~

~e~a~o~

~e~a~o~ ~

k~11 Group~j)

er~ation~-~X

c~11 G~oup~j

Ck~ll Group~j)

~e~atio~ k~11 Group~ l)

9

9

9

9 9

9 9

9 9

9

9

Most Frequent Figure 3.

2-D B. F. C

3-D B. F. C

Communication P a t t e r n s

385

4.

PARALLELIZATION

STRATEGY

The B.F.C. coupled method is implemented on a parallel computer by domain decomposition techniques. The communication pattern is an important factor that determines the efficiency of parallel computation. Three communication patterns are compared in Figure 3. Communication between processors via message-passing is expressed by the gray oval. If the most frequent communication is chosen, data are transferred each time after relaxation on one cell group. This pattern was adopted in the authors' Cartesian Coupled method [1-2]. It has the merit that the convergence property is not considerably changed by domain decomposition. However, if this pattern is adopted for the B.F.C. coupled method, the message-passing library is called many times in one inneriteration; at least 4 times in the 2-D calculation and 8 times in the 3-D calculation. This leads to increased over-head time. To avoid that, the patterns shown in the center and the right of Figure 3 are adopted in this study. In the 2-D calculation, processors communicate after relaxation on every two groups. In the 3-D, data are exchanged only once after relaxation on all eight groups is completed. A less frequent communication pattern is chosen because the 3-D communication load is much heavier. However, in using this communication pattern, the convergence property of the 3-D calculation is expected to become worse. In compensation for that, cells located along internal boundaries are overlapped as shown in Figure 4.

Ordinary Decomposition

Overlapping

Figure 4. Overlapping of Cells

386 Gray cells in Figure 4 belong to both subdomains and are calculated on two processors at the same time. Using this strategy, solutions converge quickly even though the parallel efficiency of one outer-iteration is reduced.

5.

RESULTS

The B.F.C. coupled method was implemented on a Hitachi SR2201 to calculate the flow around a circular cylinder. The Hitachi SR2201 has 1024 processing elements (PEs) and each has performance of 300Mflops. PEs are connected by a 3-D crossbar switch and the data transfer speed between processors is 300Mbytes/sec. In this study, PARALLELWARE was adopted as the software. Widely used software such as PVM is also available. Computational conditions and obtained results are shown in Table 1. Calculations were performed 2-dimensionally and 3-dimensionally. The Reynolds number was 100 in the 2-D calculation and 1000 in the 3-D calculation. The same number of CVs, 32768, was adopted to compare 2-D and 3-D cases on the same granularity level. The item "CVs per PE" in Table 1 represents the number of grid points that one processor handles; namely, it indicates granularity of the calculation. The number of CVs per PE is very small; the smallest is only 32 CVs per processor. One may think that too many processors were used to solve a too small problem, but there were no other options. As mentioned in Chapter 1, quite fine granularity was investigated in this study. If such fine granularity had been realized for much larger problems, a bigger parallel computer with many more processors would have been required. However, such a big parallel computer is n o t available today. Fine granularity is a situation t h a t one will see in the future. The authors had to deal with this future problem using today's computers and therefore made this compromise. Time dependent terms were discretized by the first order backward difference and Table 1. Computational Conditions and Results Total CV PE Subdomain CVs per PE E" ope

2-D 256 x 128=32768 256 512 1024 16x16 32x16 32x32 128 64 32 95.0% 76.9% 51.3%

3-D 64 x 32 x 16=32768 256 512 1024 8x8x4 16x8x4 16x8x8 128 64 32 50.1% 35.6% 23.8%

z"/re,-

97.0%

95.1%

93.6%

126.3%

129.5%

130.9%

~"

92.1% 236

73.2% 375

48.0% 491

63.3% 162

46.1% 236

31.1% 319

Speed-Up

387

implicit time marching was adopted. Performance of the parallel computation is shown in the last four rows in Table 1. There are three kinds of efficiency. ~ expresses how efficiently operation in one outer iteration is parallelized. The second efficiency ~;~ represents the convergence property. If its value is 100%, the number of outer iterations is the same as that in the single processor calculation. The third efficiency e is the total efficiency expressed as product of,~~ and , ~ (see [4]). Judging from ,~ ~ the outer iteration is not well-parallelized in the 3-D calculation. These values are roughly half of those of the 2-D calculation because of the heavy load of 3-D communication and the overlapped relaxation explained in w 4. In the 2-D calculation, the convergence property does not greatly deteriorate because the efficiency c ;~ is kept at more than 93%. This means that the communication pattern for the 2-D calculation is appropriate enough to keep ~,~t~ high. In the 3-D calculation, the efficiency ,~'~ exceeds 100%. This means that the number of outer iterations is smaller than the single processor calculation. This is caused by the overlapped relaxation. If the most frequent communication pattern had been used instead, the iteration efficiency would have stayed around 100% at best. Overlapped relaxation is one good option to obtain fast convergence. Using 1024 processors, the total efficiency ~ reaches 48% in the 2-D calculation and 31% in the 3-D calculation. Considering the number of CVs per PE, these values of efficiency are very high. It can be concluded that the B.F.C. coupled method is suitable for massively parallel computing of quite fine granularity. Further work is necessary to get more speed-up. One idea is a combination of the coupled method and an acceleration technique. A combined acceleration technique must be also suitable for massively parallel computing. Shimano and Arakawa [1] adopted the extrapolation method as an acceleration technique for fine granularity parallel computing. Finally, accuracy of the B.F.C. coupled method is discussed. In the 3-D calculation, the length of the computational domain in the z-direction was 4 times longer than the diameter. The grid of 64 • 32 • 16 CVs was too coarse to capture the exact 3-D structure of the flow. The authors tried 3-D calculations using eight times as many CVs, namely 128 • 64 x 32 (=262144) CVs. In the computational results on the fine grid, the appearance of four cyclic pairs of vortices in the z-direction was successfully simulated (see Figure 5), which had been reported in the experimental study by Williamson [5]. Estimated drag coefficient and Strouhal number are 1.08 and 0.206 respectively, which are in accordance with experimental data 1.0 and 0.21. From these facts, high accuracy of the B.F.C. coupled method is confirmed.

388

Figure 5.

6.

Flow around a 3-D Circular Cylinder, Isosurfaces of Vorticity (X-Axis) Re=1000, 128 • 64 • 32(=262,144) CVs, 64PE

CONCLUSIONS

The B.F.C. coupled method was proposed and applied to 2-D and 3-D calculations of flow around a circular cylinder. The overlapped relaxation adopted in the 3-D calculation was effective for fast convergence. Obtained parallel efficiency was high even when granularity was quite fine. Suitability of the coupled method to massively parallel computing was demonstrated. For example, even when only 32 control volumes were allocated to one processor, the total parallel efficiency reached 48% in the 2-D calculation and 31% in the 3-D calculation. The computational results by the B.F.C. coupled method were compared with experimental data and its high accuracy was ascertained.

REFERENCES 1. K.Shimano, and C.Arakawa, Comp. Mech., 23, (1999) 172. 2. K.Shimano, et al., Parallel Computational Fluid Dynamics, (1998) 481, Elsevier. 3. S.P.Vanka, J. Comp. Phys., 65, (1986) 138. 4. K.Shimano, and C.Arakawa, Parallel Computational Fluid Dynamics, 189 (1995), Elsevier. 5. C.H.K. Williamson, J. Fluid Mech., 243, (1992), 393.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) o 2000 Elsevier Science B.V. All rights reserved.

389

A parallel algorithm for the detailed numerical simulation of reactive flows M. Soria J. Cadafalch, R. Consul, K. Claramunt and A. Oliva a * aLaboratori de Termotecnia i Energetica Dept. de Maquines i Motors Termics Universitat Politecnica de Catalunya Colom 9, E-08222 Terrassa, Barcelona (Spain) manel @labtie, mmt. upc. es

The aim of this work is to use parallel computers to advance in the simulation of laminar flames using finite rate kinetics. To this end, a parallel version of an existing sequential, subdomain-based CFD code for reactive flows has been developed. In this paper, the physical model is described. The advantages and disadvantages of different strategies to parallelize the code are discussed. A domain decomposition approach with communications only after outer iterations is implemented. It is shown, for our specific problem, that this approach provides good numerical efficiencies on different computing platforms, including clusters of PC. Illustrative results are presented.

1. I N T R O D U C T I O N The aim of this work is to use parallel computers to advance in the simulation of laminar flames using finite rate kinetics. A parallel version of an existing sequential, subdomainbased CFD code for reactive flows has been developed[I,6,7]. Although industrial combustors work under turbulent conditions, studies allowing laminar flow conditions are a common issue in their design. Furthermore, a good understanding of laminar flames and their properties constitute a basic ingredient for the modelling of more complex flows[4]. The numerical integration of PDE systems describing combustion involves exceedingly long CPU times, especially if complex finite rate kinetics are used to describe the chemical processes. A typical example is the full mechanism proposed by Warnatz[21], with 35 species and 158 reactions. In addition of the momentum, continuity and energy equations, a transport convection-diffusion has to be solved for each of the species, as well as the kinetics. The objective of these detailed simulations is to improve the understanding of combustion phenomena, in order to be able to model it using less expensive models. *This work has been financially supported by the Comision Interministerial de Ciencia y Tecnologia, Spain, project TIC724-96. The authors acknowledge the help provided by David Emerson and Kevin Maguire from Daresbury Laboratory (UK).

390

2. P H Y S I C A L

MODEL

The governing equations for a reactive gas (continuity, momentum, energy, species and state equation) can be written as follows:

Op 0---{+ V. (pv) = 0 OV

pg/+

(1)

(pv.V)~ = v.~,j - Vp + pg

0 (ph) = V (kVT) - V. 0t

o (pY~) Ot

"

p E hiYi ( v i - v) i=1

+ V. (pY~vi)

(2)

)

= wi

(3)

(4)

pM P = RT

(5)

where t is time; p mass density; u average velocity of the mixture; vii stress tensor; p pressure; g gravity; N total number of chemical species; h specific enthalpy of the mixture; hi specific enthalpy of specie i; T temperature; k thermal conductivity of the mixture; M molecular weight of the mixture; R gas universal constant. The diffusion velocities are evaluated considering both mass diffusion and thermal diffusion effects: T

vi = v - DimVYi - Dim v (In T) P

(6)

where Dim and Di~ are respectively diffusivity and thermal diffusivity of the species into the mixture. The evaluation of the net rate of production of each specie, due to the J reactions, is obtained by summing up the individual contribution of each reaction:

(7) j=l

i=l

i=1

Here, [mi] are the molar concentration and Mi the molecular weights of the species, uij, uij the stoichiometric coefficients appearing as a reactant and as a product respectively for the i specie in the reaction j, and kf,j, kb,j the forward and backward rate constants. The transport and thermophysic properties have been evaluated using CHEMKINS's database. More information of the model can be found at [6,7]. tt

3. P A R A L L E L

ASPECTS

3.1. P a r a l l e l i z a t i o n s t r a t e g y Different parallel computing approaches, or combinations of them, can be used to obtain the numerical solution of a set of PDEs solved using implicit techniques: Functional decomposition[13] is based on assigning different tasks to different processors. In our case,

391

the tasks would be the different PDEs. In the case of reactive flows, this method is more attractive due to the higher number of scalar equations to be solved. Domain decomposition in time[18] is based in the simultaneous solution of the discrete nonlinear equations for different time steps in different processors. Both approaches require the transference of all the unknowns after each outer iteration. As our code is targeted for clusters of workstations, they have been discarded in favour of the domain decomposition[17,14] approach. In domain decomposition method, the spatial domain to be solved is divided into a number of blocks or subdomains which can be assigned to different CPUs. As the PDEs express explicitly only spatially local couplings, domain decomposition is perhaps the most natural strategy for this situation. However, it has to be kept in mind that the PDEs to be solved in CFD are, in general, elliptic: local couplings are propagated to the entire domain. Thus, a global coupling involving all the discrete unknowns is to be expected, and a domain decomposition approach has to be able to deal effectively with this problem if it is to be used for general purpose CFD solvers. Different variants of the domain decomposition method can be found in the literature. In a first approach, each subdomain is treated as an independent continuous problem with its own boundary conditions, that use information generated by the other subdomains where necessary. The boundary conditions are updated between iterations until convergence is reached. Another approach is to consider only a single continuous domain and to use each processor to generate the discrete equations that are related with its part of the domain. The solution of the linear systems is done using a parallel algorithm, typically a Krylov subspace method. The first approach has been used here, because: (i) As linear equations are not solved in parallel, it requires less communication between the processors, and only after each outer iteration and (ii) it allows to reuse almost all the sequential code without changes. The first advantage is specially relevant in our case, as our code is to be used mainly in clusters of PCs. It is important to point out that the iterative update of the boundary conditions without any other strategy to enforce global coupling of the unknowns behaves essentially as a Jacobi algorithm with as many unknowns as subdomains. Thus, the method does not scale well with the number of processors, unless the special circumstances of the flow help the convergence process. This is our case, as the flows of our main interest (flames) have a quasi-parabolic behaviour. Domain decomposition method is known to behave well with parabolic flows: for each subdomain, as the guessed values at the downstream region have no effect over the domain, the information generated at the upstream region is quickly propagated from the first to the last subdomain[15]. 3.2. P r o g r a m m i n g m o d e l a n d s o f t w a r e e n g i n e e r i n g a s p e c t s The parallel implementation of the code had two goals: allow maximum portability between different computing platforms and keep the code as similar as possible to the sequential version. To achieve the first, message passing paradigm has been used and the code has been implemented using MPI libraries. To achieve the second, all the calls to low-level message passing functions have been grouped on a program module and a set of input-output functions has been implemented. The code for the solution of a singledomain problem remains virtually identical to the previous sequential code. In fact, it

392 can still be compiled without MPI library and invoked as a sequential code. The parallel implementation of the code had two goals: allow maximum portability between different computing platforms and keep the code as similar as possible to the sequential version. To achieve them, message passing paradigm has been used (MPI) and all the calls to message passing functions have been grouped on a program module. A set of input-output functions has been implemented. The code for the solution of a singledomain problem remains virtually identical to the previous sequential code. In fact, it can still be compiled without MPI and invoked as a sequential code.

:~i~iii:ii~i~i~ii]~iiiii~i~i!i~i!i~iiiii~iiiii~i!i~i~iiiii~iii~iiiiiii~ii~i~ii~iii!~iiii!iiii~ii!iiiii~i!iii~i~i!i!i~iiii~ii~iii~iiiii~i!ii

2.1 OE3

~i~i~i~i:i:~i~i~if~ii:iii!~i~i~i!i~!i!i~iii~iii~iii!i!i~iii!iii!i!ii~!ii!~i~iii~i!i~i!iii~i~i~i~i~iiiii~iii~!ii~iiiiiii~i~iiiii~iiii i;;&i~ii~i~i!~i~ii~i~iii~i~:~ii~i~i!i!i!iii!i~i~iii~i!i!i~i~i~i~ii~!i!i!i~!!!~!i!i!ii~!~!~!~ii!i!i!i~i~i~i!i~!i!i~i!i~i~i~i~i!~ii .................................................... : ............................................. ..................... 9 w:::::.:. .............

~5~!~!~!~!~!~!~!~~ i !:~~ i~ i~ i~ i~ i !i~~ i~ i~ i~ i~ i~ i~ i! ~ !i ~ i~ i~ i~ i !i~!i~~ i~ i~ i~ i~ i !i~~ i~ i~~ i~ i~ i~ i~ ii ~~ i~ i ;~ i :~ i~ i~ i~ ;i ~ i~ i !~~ i~ i~ i~ i~ i~ i~ i~ i~ i~ i !~!~!~ i~ i~ i~ i~ i !~!~!~!~ i~ i !i~!~!~~ i !~ i~ i~ i~ i~ i !~!~!~~ i~ i~ ii ~i~i~i~;i~i~i~i~i~i~i~i~i!F~ii~i~!~i~i~i~i~i~i~i~i~i~i~i~i!~i~i~i~!i!ii~i~i!~i!i~i~i! 9ww,:..v:wv:~v~w:w:w:.:::. v..~v:.:::v:-:v:w::~::::w:~:.:.~u

......................................................... . ........................... . ......................................................................... : ....................

Y6

T

~:i!ii:ii!iiii~i~iiii!iii:i!~i!i!i~i!i!iii!i!i!i~i!i!i!i~i!i!i!i!!ii!ii!!i~i!~!iii!i!iiiiiii!iiiiiiiii!ii!!i!!!!iii!i!i!ii~iiiii!i!i!i!i!i!i~i!i~

2.C OE3

4.7E-2

iiii~iiiii~iii!i~iiii~!iiiiiii~iiiiii~ii~iii~!i~ii~i~i~i~iii!i~iii~!i~iii~i~i!~ii!i~ii!~i~ii~ii~i~;iii~ii!ii~ii!ii~iii~i 4.5E-2 4.2E-2

1.fi OE3 1.EOE3 1.7 OE3 1.EOE3 1.EOE3 1.4 OE3 1.E OE3

3.9E-2

...i. . .i.... .i.....i..................!............... ...................

3.6E-2 3.4E-2

Yll 1.97E-3 1.85E-3 1.74E-3 1.62E-3 1.51 E-3

i!iiii~ii~iiiiiii~iiii~iiii~iiiiiiiii~ii~iiiiiiiiiiii~iii~iiiiii!iiii!iiiiiii~iiii~iiiiiiiii!ii~iii~i~iiii!iii!~i:iiiiiiiiii~iiiiiii~ i!i•!iii!•ii!i•i•ii!•i•i!i!•!i••ii•iii!ii••ii•!••iiiii•i•iiiii••••iiiii••iiiiii!ii•ii!3.1 •iii!•E-2 ii•i

1.39E-3

iiiiiiiii iiiii i i i li i iil

1.16E-3

2.8E-2 ~iii~i~ii~i~i~iiii~!i~iiiiii~iii~i~iii~i!~!~i~i~i~i~!ii;~iiii~i~iii~i~i~ii~i~iii~!~ii~ii~iii~i~i~iiiiiii~i~i

1..~OE3 1.1 OE3 1s OE3 9s OE2 8s OE2 7s OE2 6s OE2 5.COE2

2.5E-2 2.2E-2 2.0E-2 1.7E-2 1.4E-2 1.1 E-2 8.4E-3 5.6E-3 2.8E-3

1.27E-3 1.04E-3 9.26E-4 8.10E-4 6.95E-4 5.79E-4 4.63E-4 3.47E-4 2.32E-4 1.16E-4

Figure 1. Illustrative results. From left to right: Temperature distribution (K), OH mass fraction and CH3 mass fraction.

4. N U M E R I C A L

ASPECTS

In our implementation of the domain decomposition method, the meshes of the individual subdomains are overlapped and non-coincident. The second feature allows more geometrical flexibility that is useful to refine the mesh in the sharp gradient areas at the edges of the flames but the information to be transferred between the subdomains has to be interpolated. This has to be done accomplishing the weIl-posedness conditions; i.e. the adopted interpolation scheme and the disposition of the subdomains should not affect the result of the differential equations. For instance, methods that would be correct for one second order PDE[5] are not valid for the full Navier-Stokes set. If the well-posedness condition is not satisfied, more outer iterations are needed and slightly wrong solutions

393 can be obtained. Here, conservative interpolation schemes that preserve local fluxes of the physical quantities between the subdomains are used[2,3]. The governing equations are spatially discretized using the finite control volume method. An implicit scheme is used for time marching. A two-dimensional structured and staggered Cartesian or cylindrical (axial-symmetric) mesh has been used for each domain. High order SMART scheme[10] and central difference are used to approximate the convective and diffusive terms at the control volume faces. It is implemented in terms of a deferred correction approach[8], so the computational molecule for each point involves only five neighbours. Solution of the kinetics and the transport terms is segregated. Using this approach, kinetic terms are an ODE for each control volume, that is solved using a modified Newton's method with different techniques to improve the robustness[19]. To solve the continuity-momentum coupling, two methods can be used: (i) Coupled Additive Correction Multigrid[16], in which the coupled discrete momentum and continuity equations are solved using SCGS algorithm; (ii) SIMPLEC algorithm[9] with an Additive Correction Multigrid solver for pressure correction equation [l l] . In both cases, correction equations are obtained from the discrete equations. 5. I L L U S T R A T I V E

RESULTS

The premixed methane/air laminar fiat flame studied by Sommers[20] is considered as an illustrative example and as a benchmark problem. A stoichiometric methane-air homogeneous mixture flows through a drilled burner plate to an open domain. The mixture is ignited above the burner surface. The boundary conditions at the inlet are parabolic velocity profile with a maximum value of 0.78 m/s, T = 298.2 K and concentrations of N2,O2 and CH4 0.72, 0.22, 0.0551 respectively. At the sides, o _ 0 for all the unknowns except vx = 0. The dimensions of the domain, are 0.75 x 4 mm. In Fig. 1, the results obtained with skeletal mechanism by Keyes and Smoke[19] with 42 reactions and 15 species are presented. For the benchmarks, the four-step mechanism global reaction mechanism by Jones and Lindstedt [12] has been used. 6. P A R A L L E L

PERFORMANCE

Before starting the parallel implementation, the sequential version of the code was used to evaluate its numerical efficiency for this problem, from one to ten processors. A typical result obtained with our benchmark problem can be seen in Fig. 2. The number of outer iterations remains roughly constant from one to ten subdomains. However, the total CPU time increases due to the extra cost of the interpolations and the time to solve the overlapping areas. Even prescribing the same number of control volumes for each of the processors, there is a load imbalance, as can be seen in Fig. 3 for a situation with 10 subdomains. It is due to three reasons: (i) the time to solve the kinetics in each of the control volumes differs; (ii) there are solid areas where there is almost no computational effort; (iii) the inner subdomains have two overlapping areas while the outer subdomains have only one. The following systems have used for benchmarking the code: (i) 02000: SGI Origin 2000, shared memory system with R10000 processors; (ii) IBM SP2, distributed memory

394 3500 II. O

=

3000

eo

=.. ,-

3 .o u

,~,_ 0 L

E Z

2500,

=

9

- ~

.iterations time

m

2000

1500 1

,

,

,

,

3

5

7

9

11

Number of subdomains

Figure 2. Time to solve each of the subdomains for a ten subdomains problem.

C 0 0

2

3

4

5

13

7

8

10

Subdomain

Figure 3. Number of iterations and CPU time as a function of the number of subdomains.

system with thin160 nodes; (iii) Beowulf: Cluster of Pentium II (266 MHz) 2. For the benchmark, each subdomain has one processor (the code also allows each processor to solve a set of subdomains). The speed-ups obtained in the different systems (evaluated in relation to the respective times for one processor) are in Fig. 4. They are very similar in the different platforms. This is because the algorithm requires little communication. For instance, for the most unfavourable situation (02000 with 10 processors) 2The system at Daresbury Laboratory (UK) was used. http: / / www. dl. ac.uk / TCSC / disco/Beowulf/config, ht ml

All the information can be found at

395

only a single packet of about 11.25 Kbytes is exchanged between neighbouring processors approximately every 0.15 seconds. So, the decrease on the efficiency is mainly due to the load imbalance and to the extra work done (overlapping areas and interpolations).

Speed-up

7.5 9

/

2.5

i

!

r'~

9 SP2 /~

Beowulf Ideal

1

2

3

4

5

6

7

8

9

10

Number of processors

Figure 4. Speed-ups obtained in the different systems.

7. C O N C L U S I O N S A domain decomposition approach has been used to develop a parallel code for the simulation of reactive flows. The method implemented requires little communication between the processors, and it allowed the authors to reuse almost all the sequential code (even using message passing programming model). Speed-ups obtained in different shared and distributed memory systems up to 10 processors are reasonably good. It is remarkable that the speed-up obtained with this code on a cluster of PC computers with Linux is as good as the speed-up obtained on two commercial parallel computers. To be able to use a higher number of processors while maintaining a reasonable efficiency, load balance has to be improved and the interpolation process has to be optimised. REFERENCES

1. J.Cadafalch et al., Domain Decomposition as a Method for the Parallel Computing of Laminar Incompressible Flows, Third ECCOMAS Computational Fluid Dynamics Conference, ed. J.A.Desideri et al, pp.845-851, John Wiley and Sons, Chichester, 1996. 2. J. Cadafalch et al., Fully Conservative Multiblock Method for the Resolution of Turbulent Incompressible Flows, Proceedings of the Fourth European Computational Fluid Dynamics Conference, Vol. I, Part. 2, pp 1234-1239. John Wiley and Sons, Athens, Grece, Octubre 1998.

396 J. Cadafalch, et al., Comparative study of conservative and nonconservative interpolation schemes for the domain decomposition method on laminar incommpressible flows, Numerical Heat Transfer, Part. B, vol 35, pp. 65-84, 1999. S. Candel et al., Problems and Perspectives in Numerical Combustion. Comp. Meth. in App. Sci. '96, J. Wiley and Sons Ltd, 1996. G.Chesshire and W.D.Henshaw, A Scheme for Conservative Interpolation on Overlapping Grids, SIAM J. Sci. Comput., vol 15, no 4, pp 819-845, 1994. R. Consul et al., Numerical Studies on Laminar Premixed and Diffusion Flames, 10th Conf. on Num. Meth. in Ther. Prob., pp. 198-209, Swansea., 1997. R. Consul et al., A. Oliva, Numerial Analysis of Laminar Flames Using the Domain Descomposition Method, The Fourth ECCOMAS Computational Fluid Dynamics Conference, Vol. 1, Part 2, pp. 996-1001, John Wiley and Sons, Athens, Grece, Octubre, 1998. M.S.Darwish and F.Moukalled, Normalized Variable and Space Formulation Methodology for High-Resolution Schemes, NHTB, v26,pp 79-96, 1994 J.P.Van Doormal and G.D.Raithby, Enhancements of the SIMPLE method for predicting incompressible fluid flows, NHT, v 7, pp 147-163, 1984 10. P.H.Gaskell and K.C.Lau, Curvature-Compensated Convective Transport: SMART, New Boundedness-Preserving Transport Algorithm, Int. J. Numer. Meth. Fluids, vol 8, pp. 617-641, 1988. 11. B.R.Hutchinson and G.D.Raithby, A Multigrid Method Based on the Additive Correction Strategy, NHT, v 9, pp 511-537, 1986 12. W.P. Jones, and R.P. Lindstedt, Global Reaction Schemes for Hydrocarbon Combustion, Comb. and Flame, v 73, pp 233-249, 1988. 13. A.J.Lewis and A.D.Brent, A Comparison of Coarse and Fine Grain Parallelization Strategies for the Simple Pressure Correction Algorithm, IJNMF, v 16, pp 891-914, 1993. 14. M.Peric E.Schreck, Analysis of Efficiency of Implicit CFD Methods on MIMD Computers, Parallel Computational Fluid Dynamics: Algorithms and Results Using Advanced Computers, pp 145-152, 1995 15. N.R.Reyes et al., Subdomain Method in Both Natural and Forced Convection, Application to Irregular Geometries, in Numerical Methods in Laminar and Turbulent Flow: Proceedings of the 8th international conference, Ed. C.Taylor, 1993. 16. P.S.Sathyamurthy, Development and Evaluation of Efficient Solution Procedures for Fluid Flow and Heat Transfer Problems in Complex Geometries, Ph.D.Thesis, 1991. 17. M.Schafer, Parallel Algorithms for the Numerical Simulation of Three-Dimensional Natural Convection, Applied Numerical Mathematics, v 7, pp 347-365, 1991 18. V.Seidl et al, Space -and Time- Parallel Navier-Stokes Ssolver for 3D Block-Adaptive Cartesian Grids, Parallel Computational Fluid Dynamics: Algorithms and Results Using Advanced Computers, pp 577-584, 1995 19. M.D. Smooke et al., Numerical Solution of Two-Dimensional Axisymmetric Laminar Diffusion Flames, Comb. Sci. and Tech., 67: 85-122, 1989. 20. L.M.T.Sommers phD Thesis, Technical University of Eindhoven, 1994 21. J. Warnatz et al., Combustion, Springer-Verlag, Heidelberg, 1996.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 2000 Elsevier ScienceB.V. All rightsreserved.

397

P a r a l l e l i z a t i o n of t h e E d g e B a s e d S t a b i l i z e d F i n i t e E l e m e n t M e t h o d A. Soulaimani a* , A. Rebaineband Y. Saad c Department of Mechanical Engineering, Ecole de technologie sup(~rieure, 1100 Notre-Dame Ouest, Montreal, PQ, H3C 1K3, CANADA b Department of Mechanical Engineering, Ecole de technologie sup~rieure, 1100 Notre-Dame Ouest, Montreal, PQ, H3C 1K3, CANADA CComputer Science Department, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street S.E., Minneapolis, MN 55455 This paper presents a finite element formulation for solving multidimensional compressible flows. This method has been inspired by our experience with the SUPG, the Finite Volume and the discontinous-Galerkin methods. Our objective is to obtain a stable and accurate finite element formulation for multidimensional hyperbolic-parabolic problems with particular emphasis on compressible flows. In the proposed formulation, the upwinding effect is introduced by considering the flow characteristics along the normal vectors to the element interfaces. Numerical tests performed so far are encouraging. It is expected that further numerical experiments and a theoretical analysis will lead to more insight into this promising formulation. The computational performance issue is addressed through a parallel implementation of the finite element data structure and the iterative solver. PSPARLIB and MPI libraries are used for this purpose. The implicit parallel solver developed is based on the nonlinear version of GMRES and Additive Schwarz algorithms. Fairly good parallel performance is obtained.

1. I N T R O D U C T I O N This work discusses the numerical solution of the compressible multidimensional NavierStokes and Euler equations using the finite element metholology. The standard Galerkin variational formulation is known to generate numerical instabilities for convection dominated flows. Many stabilization approaches have been proposed in the literature during the last two decades, each introducing in a different way an additional dissipation to the original centered scheme. The Streamline Upwinding Petrov-Galerkin method of Hughes (SUPG) is commonly used in finite element based formulations [1-4] while Roe-Muscl schemes are used for finite volume methods [5]. A new stabilized finite element formulation, refered to as Edge Based Stabilized finite element methd (EBS), has been recently introduced by Soulaimani et al. [6,7] which lies between SUPG and finite volume formula*This work has been financed by grants from NSERC and Bombardier.

398 tions. This formulation seems to embodies good properties of both of the above methods: high order accuracy and stability in solving high speed flows. Preliminary numerical results in 2D were encouraging here we present further developments and more numerical experiments in 3D. In the following, SUPG and EBS methods are briefly reviewed, then the Edge Based Stabilizing method is described. The parallel data structure and the solution algorithms are discussed. Finally, some numerical simulations are presented along with parallel efficiency results.

2. G O V E R N I N G

EQUATIONS

Let ~ be a bounded domain of R nd (with nd = 2 or n d = 3) and F = O~ be its boundary. The outward unit vector normal to F is denoted by n. The nondimensional Navier-Stokes equations written in terms of the conservation variables V = (p, U, E) t are written in a vector form as

V,t + F ia,div( V ) - FdiiI I ( V ) + .7z

(1)

where U is the vector of conservative variables, F adv and F diff a r e respectively the convective and diffusive fluxes in the ith-space direction, and .9" is the source vector. Lower commas denote partial differentiation and repeated indices indicate summation. The diffusive fluxes can be written in the form F dify ~ K i j Y , j while the convective fluxes can be represented by diagonalizable Jacobian matrices Ai - FadVi,v. Any linear combination of these matrices has real eigenvalues and a complete set of eigenvectors.

3. S E M I D I S C R E T E

SUPG

FORMULATION

Throughout this paper, we consider a partition of the domain ~ into elements ~e where piecewise continous approximations for the conservative variables are adopted. It is well known that the standard Galerkin finite element formulation often leads to numerical instabilities for convective dominated flows. In the SUPG method, the Galerkin variational formulation is modified to include an intergral form depending on the local residual 7~(V) of equation (1), i.e. 7~(V) - V,t + Fia,d v ( y ) - FdiiH(V) - ~ , which is identically zero for the exact solution. The SUPG formulation reads: find V such that for all weighting function W ,

E f , [ w . (v, + r,a - J:)+ W,V:'ff]

+Z e

-

w.r:

ff,

dr

(2)

(A~W,~)- v - 7~(V) dQ - 0

e

In this formulation, the matrix r is refered to as the matrix of time scales. The SUPG formulation is built as a combination of the standard Galerkin integral form and a perturbation-like integral form depending on the local residual vector. The objective is to reinforce the stability inside the elements. The SUPG formulation involves two

399

important ingredients: (a) it is a residual method in the sense that the exact continuous regular solution of the original physical problem is still a solution of the variational problem (2). This condition highligths its importance to get high order accuracy; and (b) it contains the following elleptic term: E ( f w ( A t i . w , i ) ' r ( A j U , j ) d ~ ) which enhances the stability provided that the matrix r is appropriately designed. For multidimensional systems, it is difficult to choose r in such a way to introduce the additional stability in the characteristic directions, simply because the convection matrices are not simultaneously diagonalizable. For instance, Hughes-Mallet [8] proposed r = ( B i B i ) -1/2 where Bi - ~a xA ~ a and o_~ are the components of the element Jacobian matrix. j Oxj

4.

E D G E B A S E D S T A B I L I Z A T I O N M E T H O D (EBS)

Let us first take another look at the SUPG formulation. Using integration by parts, the integral f a ~ ( A ~ W , i ) . ' r . "R.(V) d~ can be transformed into fr~ W . ( A n r . "R.(V)) dF fae W . (Ai r . 9~.(V)),i d~ where, F e is the boundary of the element f~e, n ~ is the outward unit normal vector to F ~, and An - neiAi. If one can neglect the second integral, then

e

The above equation suggests that r could be defined, explicitly, only at the element boundary. Hence, a natural choice for r is given by r - ~htAnr -1 Since the characteristics lines are well defined on F r for the given direction n ~, then the above definition of r is not completely arbitrary. Furthermore, the stabilizating contour integral term becomes

Ee

h

?w .

. n(v))

er.

For a one dimensional hyperbolic system, one can recognize the upwinding effect introduced by the EBS formulation. Here, we would like to show how more upwinding can naturally be introduced in the framework of EBS formulation. Consider the eigendecomposition of An, An - SnAnSn -1. Let Pei - )~ih/2z~ the local Peclet number for the eigenvalue hi, h a measure of the element size on the element boundary, ~ the physical viscosity and/3~ - m i n ( P e i / 3 , 1.0). We define the matrix Bn by B n - SnLSn -1 where L is a diagonal matrix given by Li - (1 +/3~) if A~ > 0; L~ - - ( 1 -/3~) if A~ < 0 and Li - 0 if Ai - 0. The proposed EBS formulation can be summarized as follows: find V such that for all weighting functions W ,

W

9 7" ne '

9

n ( V ) dr

-

o

h with -r~e the matrix of intrinsic length scales given by r ned = 7" Bn.

We would like to point out the following important remarks:

400 - As for SUPG, the EBS formulation is a residual method, in the sense that if the exact solution is sufficiently regular then it is also a solution of (3). Thus, one may expect a highorder accuracy. Note that the only assumption made on the finite element approximations for the trial and weighting functions is that they are piecewise continuous. Equal-order or mixed approximations are possible in principle. Further theoretical analysis is required to give a clearer answer. - Stabilization effect is introduced by computing on the element interfaces the difference between the residuals, while considering the direction of the characteristcs. The parameter /~i is introduced to give more weight to the element situated in the upwind characteristic direction. It is also possible to make t3 dependent on the local residual norm, in order to add more stability in high gradient regions. - For a purely hyperbolic scalar problem, one can see some analogy between the proposed formulation and the discontinuous-Galerkin method of Lesaint-Raviart and the Finite Volume formulation. - The formula given above for the parameter/~ is introduced in order to make it rapidly vanishing in regions dominated by the physical diffusion, such as the boundary layers. - The length scale h introduced above is computed in practice as being the distance between the centroides of the element and the edge (or the face in 3D) respectively. - The EBS formulation plays the role of adding an amount of artificial viscosity in the characteristic directions. In the presence of high gradient zones as for shocks, more dissipation is needed to avoid non-desirable oscillations. A shock-capturing artificial viscosity depending of the discrete residual 7~(V) is used as suggested in [9].

5. P A R A L L E L I M P L E M E N T A T I O N

ISSUES

Domain decomposition has emerged as a quite general and convenient paradigm for solving partial differential equations on parallel computers. Typically, a domain is partitioned into several sub-domains and some technique is used to recover the global solution by a succession of solutions of independent subproblems associated with the entire domain. Each processor handles one or several subdomains in the partition and then the partial solutions are combined, typically over several iterations, to deliver an approximation to the global system. All domain decomposition methods (d.d.m.) rely on the fact that each processor can do a big part of the work independently. In this work, a decomposition-based approach is employed using an Additive Schwarz algorithm with one layer of overlaping elements. The general solution algorithm used is based on a time marching procedure combined with the quasi-Newton and the matrix-free version of GMRES algorithms. The MPI library is used for communication among processors and PSPARSLIB is used for preprocessing the parallel data structures. 5.1. D a t a s t r u c t u r e for A d d i t i v e Schwarz d . d . m , w i t h o v e r l a p p i n g In order to implement a domain decomposition approach, we need a number of numerical and non-numerical tools for performing the preprocessing tasks required to decompose a domain and map it into processors. We need also to set up the various data structures,

401 and solve the resulting distributed linear system. PSPARSLIB [10], a portable library of parallel sparse iterative solvers, is used for this purpose. The first task is to partition the domain using a partitioner such as METIS. PSPARSLIB assumes a vertex-based partitioning (a given row and the corresponding unknowns are assigned to a certain domain). However, it is more natural and convenient for FEM codes to partition according to elements. The conversion is easy to do by setting up a dual graph which shows the coupling between elements. Assume that each subdomain is assigned to a different processor. We then need to set up a local data structure in each processor that makes it possible to perform the basic operations such as computing local matrices and vectors, the assembly of interface coemcients, and preconditioning operations. The first step in setting up the local data-structure mentioned above is to have each processor determine the set of all other processors with which it must exchange information when performing matrix-vector products, computing global residual vector, or assembly of matrix components related to interface nodes. When performing a matrix-by-vector product or computing a residual global vector (as actually done in the present FEM code), neighboring processors must exchange values of their adjacent interface nodes. In order to perform this exchange operation efficiently, it is important to determine the list of nodes that are coupled with nodes in other processors. These local interface nodes are grouped processor by processor and are listed at the end of the local nodes list. Once the boundary exchange information is determined, the local representations of the distributed linear system must be built in each processor. If it is needed to compute the global residual vector or the global preconditioning matrix, we need to compute first their local representation to a given processor and move the interface components from remote processors for the operation to complete. The assembly of interface components for the preconditioning matrix is a non-trivial task. A special data structure for the interface local matrix is built to facilitate the assembly opration, in particular when using the Additive Schwarz algorithm with geometrically non-overlapping subdomains. The boundary exchange information contains the following items: 1. n p r o c - The number of all adjacent processors. 2. proc(l:nproc) - List of the nproc adjacent processors. 3. i x - List of local interface nodes, i.e. nodes whose values must be exchanged with neighboring processors. The list is organized processor by processor using a pointer-list data structure. 4. Vasend - The trace of the preconditioning matrix at the local interface, computed using local elements. This matrix is organized in a CSR format, each element of which can be retrieved using arrays iasend and jasend. Rows of matrix Vasend are sent to the adjacent subdomains using arrays proc and ix. 5. jasend and iasend - The Compressed-Sparse-Row arrays for the local interface matrix Vasend , i.e. j a s e n d is an integer array to store the column positions in global numbering of the elements in the interface matrix Vasend and iasend a pointer array, the i-th entry of which points to the beginning of the i-th row in j a s e n d and Vasend. 6. var~cv - The assembled interface matrix, i.e. each subdomains assembles in Varecv interface matrix elements received from adjacent subdomains, varify is stored also in a CSR format using two arrays jarecv and iarecv.

402 5.2. A l g o r i t h m i c a s p e c t s The general solution algorithm employes a time marching procedure with local timesteping for steady state solutions. At each time step, a nonlinear system is solved using a quasi-Newton method and the matrix-free GMRES algorithm. The preconditioner used is the block-Jacobian matrix computed, and factorized using ILUT algorithm, at each 10 time steps. Interface coefficients of the preconditoner are computed by assembling contributions from all adjacent elements and subdomains, i.e. the Varecv matrix is assembled with the local Jacobian matrix. Another aspect worth mentioning is the fact that the FEM formulation requires a continuous state vector V in order to compute a consistent residual vector. However, when applying the preconditoner (i.e. multiplication of the factorized preconditioner by a vector) or at the end of Krylov-iterations, a discontinuous solution at the subdomains interfaces is obtained. To circumvent this inconsistency, a simple averaging operation is applied to the solution interface coefficients.

6.

NUMERICAL

RESULTS

The EBS formulation has been implemented in 2D and 3D, and tested for computing viscous and inviscid compressible flows. Only 3D results will be shown here. Also, EBS results are compared with those obtained using SUPG formulation (the definition of the stabilization matrix employed is given by ~- = (~]i [Bil) -1) [3] and with some results obtained using a Finite Volume code developed in INRIA (France). Linear finite element approximations over tetrahedra are used for 3D calculations. A second order time-marching procedure is used, with nodal time steps for steady solutions. Three-dimensional tests have been carried out for computing viscous flows over a flat plate and inviscid as well as turbulent flows (using the one-equation Sparlat-Allmaras turbulence model) around the ONERA-M6 wing. All computations are performed using a SUN Enterprise 6000 parallel machine with 165 MHz processors. The objective of the numerical tests is to assess the stability and accuracy of EBS formulation as compared to SUPG and FV methods and to evaluate the computational efficiency of the parallel code. For the flat plate, the computation conditions are Re=400 and Mach=l.9. Figures 1-3 show the Mach number contours (at a vertical plane). For the ONERA-M6 wing, a Euler solution is computed for Mach= 0.8447 and an angle of attack of 5.06 degrees. The mesh used has 15460 nodes and 80424 elements. Figures 4-7 show the Mach number contours on the wing. It is clearly shown that EBS method is stable and less diffusive than SUPG method and the first Order Finite Volume method (Roe scheme). Figure 8 shows the convergence history for different numbers of processors. Interesting speedups have been obtained using the parallelization procedure described previously, since the iteration counts have not been significantly increased as the number of processors increased (Table 1). Under the same conditions (Mach number, angle of attack and mesh), a turbulent flow is computed for a Reynolds number of R~ = 11.7106 and for a distance 5 = 10 .4 using the same coarse mesh. Again, the numerical results show clearly that SUPG and first-order FV codes give a smeared shock (the figures are not included because of space, see [7]). It is fairly well captured by the EBS method. However, the use of the second-order FV method results in a much stronger shock.

403 All numerical tests performed so far show clearly that EBS formulation is stable, accurate and less diffusive than SUPG and the finite volume methods. Table 1 shows speed-up results obtained in the case of Euler computations around the Onera-M6 wing for EBS formulation using the parallel implementation of the code. The additive Schwarz algorithm, with only one layer of overlapping elements, along with ILUT factorization and the GMRES algorithm seem well suited as numerical tools for parallel solutions of compressible flows. Further numerical experiments in 3D using finer meshes are in progress along with more analysis of the parallel performance.

REFERENCES

1. A.N. Brooks and T.J.R. Hughes. Computer Methods in Applied Mechanics and Engineering, 32, 199-259 (1982). 2. T.J.R. Hughes, L.P. Franca and Hulbert. Computer Methods in Applied Mechanics and Engineering, 73, 173-189 (1989). 3. A. Soulaimani and M. Fortin. Computer Methods in Applied Mechanics and Engineering, 118, 319-350 (1994). 4. N.E. E1Kadri, A. Soulaimani and C. Deschenes. to appear in Computer Methods in Applied Mechanics and Engineering. 5. A. Dervieux. Von Karman lecture note series 1884-04. March 1985. 6. A. Soulaimani and C. Farhat. Proceedings of the ICES-98 Conference: Modeling and Simulation Based Engineering. Atluri and O'Donoghue editors, 923-928, October (1998). 7. A. Soulaimani and A. Rebaine. Technical paper AIAA-99-3269, June 1999. 8. T.J.R. Hughes and Mallet. Computer Methods in Applied Mechanics and Engineering, 58, 305-328 (1986). 9. G.J. Le Beau, S.E. Ray, S.K. Aliabadi and T.E. Tezduyar. Computer Methods in Applied Mechanics and Engineering, 104, 397-422, (1993). 10. Y. Saad. Iterative Methods For Sparse Linear Systems. PWS PUBLISHING COMPANY (1996).

Table 1 Performance of Parallel computations with the number of processors, Euler flow around Onera-M6 wing SUPG EBS Speedup Etficieny Speedup Efficiency 2 1.91 0.95 1.86 0.93 4 3.64 0.91 3.73 0.93 6 5.61 0.94 5.55 0.93 8 7.19 0.90 7.30 0.91 10 9.02 0.90 8.79 0.88 12 10.34 0.86 10.55 0.88

404

. . . . . . .

-

Figure 1. 3D viscous flow at Re = 400 and M = 1.9- Mach contours with EBS.

Figure 2. 3D viscous flow at Re = 400 and M = 1 . 9 - Mach contours with S U P G .

Figure 3. 3D viscous flow at Re = 400 and M = 1.9- M a c h contours with FV.

Figure 4. Euler flow around Onera-M6 wing- Mach contours with SUPG.

405

Figure 5. Euler flow around Onera-M6 wing- Mach contours with EBS

Figure 6. Euler flow around Onera-M6 wing- Mach contours with 1st order F.V.

Normalized residual norm vs time steps

o

12 4 6 1

-1

-2

-3

1

-4 -5

-7 ' 0

'

'

50

100

J

'

150

200

I 250

Time steps

Figure 7. Euler flow around Onera-M6 wing- Mach contours with 2nd order F.V.

Figure 8. Euler flow around Onera-M6 wing. Convergence history with EBS.

proc. proc. proc. proc.

-e---~---n---x....


Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofuka and P. Fox (Editors) o 2000 Elsevier Science B.V. All rights reserved.

407

The Need for Multi-grid for Large Computations A.Twerda, R.L.Verweij, T.W.J. Peeters, A.F. Bakker Department of Applied Physics, Delft University of Technology, Lorentzweg 1, 2628 CJ Delft, The Netherlands. In this paper an implementation of a multi-grid algorithm for simulation of turbulent reacting flows is presented. Results are shown for a turbulent backward facing step and a glass-melting furnace simulation. The multi-grid method proves to be a good acceleration technique for these kinds of computations on parallel platforms.

1

INTRODUCTION

In the last decade considerable efforts have been spent in porting Computational Fluid Dynamics codes to parallel platforms. This enables the use of larger grids with finer resolution and reduces down the vast amount of computer time needed for numerical computations of complex processes. In this paper, heat transfer, combustion, and fluid flow in industrial glass melting furnaces are considered. Simulations of these processes require advanced models of turbulence, combustion, and radiation, in conjunction with sufficiently fine numerical grids to obtain accurate predictions of furnace efficiency and concentrations of various toxic gases. The need for very fine grids in turbulent combustion simulations stems from the huge range in time and length scales occurring in the furnace. For example, the predictions for the concentrations O2 and NOx still vary 10 to 12 % when the grid is refined from a 32 • 48 • 40 to a 32 x 72 • 60 grid [1]. Furthermore, most solvers do not scale with the number of grid-points. To keep turnaround times for the large systems involved moderate, efficient techniques have to be used. The aim of this paper is to show that multi-grid provides a powerful tool in conjunction with domain-decomposition and parallelisation. Parallel computing yields the resources for accurate prediction of turbulent reacting flows in large scale furnaces.

2

MATHEMATICAL

AND PHYSICAL MODEL

Figure (1) shows a typical furnace geometry (0.9 x 3.8 x 1 m 3) , where the pre-heated air (T = 1400 K, v = 9 m/s) and the gas (T = 300 K, v = 125 m/s) enter the furnace

408

Flame

Gas inlet

Air inlet

""

.

Flue gases

f ,

Figure 1: Artist's impression of a furnace geometry with flame separately. The turbulence, mainly occurring because of the high gas-inlet velocity, improves the mixing that is essential for combustion of the initially non-premixed fuel and oxidiser into products. The products exit the furnace at the opposite side. The maximum time-averaged temperatures encountered in the furnace are typically 2200 K. At this temperature, most of the heat transfer to the walls occurs through radiation. The conservation equations of mass, momentum, energy and species are applied to describe the turbulent reacting flow. As often used in turbulent reacting flow simulation, Favre averaging is applied to these equations. The quantities are averaged with respect to the density. The Favre averaged incompressible (variable density) Navier-Stokes equations are solved for conservation of momentum and mass. The standard high-Reynolds number k - 6 with wall functions are applied to account for the turbulence [10]. The transport equation for the enthalpy is solved for the conservation of energy. The conservation of species is modelled with a conserved scalar approach. The concentrations of all species are directly coupled to the mean mixture fraction. To close the set equations, the ideal gas law is used. For radiative heat transfer the Discreet Transfer Model is applied. The chemistry is modelled with an constraint equilibrium model. A f~- probability density function is used for computing the mean values of the thermo-chemical quantities. Additional details can be found in [9] and [1].

3

NUMERICAL

MODEL

The set of equations described in the previous section are discretised using the Finite Volume Method (FVM). The computational domain is divided in a finite number of control volumes (CV). A cell-centered colocated Cartesian grid arrangement is applied [3]. Diffusive fluxes are approximated with the central difference scheme. For laminar flows, the convective fluxes are approximated with the central difference scheme. For turbulent (reacting) flows, upwind scheme with the 'van Leer flux limiter' is used. For pressure velocity coupling the SIMPLE scheme is applied [7, 3]. The porosity method is used to match the Cartesian grid to the geometry of the furnace [9]. GMRES or SIP are used for solving the set of linear equations [15, 11].

409

3.1

PARALLEL

IMPLEMENTATION

To increase the number of grid points and keep simulation times reasonable, the code is parallelized. Domain Decomposition (DD) with minimal overlap is used as the parallelisation technique. This technique has proven to be very efficient for creating a parallel algorithm [13]. A grid embedding technique, with static load balancing, is applied to divide the global domain into sub-domains [2]. Every processor is assigned to one subdomain. Message passing libraries MPI, o r SHEM_GET//PUT on the Cray T3E, are used to deal with communication between processors [4].

3.2

MULTI-GRID

METHOD

One of the drawbacks of DD is that, when implicit solvers are used to solve the discretised set of equations, the convergence decreases as the number of domains increases [8]. To overcome this problem, a multi-grid (MG) method is applied over the set of equations. That is, when a grid-level is visited during a MG-cycle several iterations of all transport equations are solved. Cell-centred coarsening is used to construct the sequence of grids. This means that, in 3D, one coarse grid CV compromises eight fine grid CVs. The V-cycle with a full approximation scheme is implemented using tri-linear interpolation for restriction and prolongation operators. Since the multi-grid tends to be slow in the beginning of the iteration, it is efficient to start with a good approximation. For this approximation a converged solution of a coarser grid is used, and interpolated onto the fine grid. This leads to the Full Multi-Grid method (FMG) [14]. Special care is taken to obtain values on coarse grids of variables which are not solved with a transport equation. These variables (e.g. density and turbulent viscosity) are only calculated on the finest grid and interpolated onto the coarser grids.

4 4.1

RESULTS TURBULENT

BACK

~_

h

I

STEP

FLOW

inlet

.......

I

lOh Figure 2: Schematic picture of the backward facing step The code has been validated on this standard CFD test problem. The geometry of this test case is simple, but the flow is complex. The recirculation zone after the step is especially difficult to predict. See Figures (2) and (3). Le and Moin provided some direct numerical simulation data and Jakirli5 calculated this test-case with the k - c turbulence

410

model [6, 5]. Our results agree well with these results as shown by Verweij [12]. A different configuration was used to perform the MG calculations. The Reynolds number based on the inlet-velocity and the step height is R e - 105. A constant inlet velocity, Vinlet -- l m / s was used. The step height was h - 0.02m. The coarsest grid-level consisted of 6 • 24 uniform CVs, which were needed to catch the step-height in one CV along a vertical line. In Figure (4) the convergence histories for the turbulent kinetic energy (k)

Figure 3: Recirculation zone of the backward facing step equation for two grids are plotted. Respectively three and four levels where used during the computation resulting in 24 • 96 and 48 • 192 number of grid points. The convergence of the FMG is much better than for the single grid case, and is almost independent of the number of CVs. These results are in agreement with results presented by Ferizger and Peri~ [3]. i

I

i

i

i

10-6 10 -6 10

~.,, '~-_- - .

FMG 24x96 FMG 48xl 92 SG 24x96 SG 48xl 92

----

-10

----=

\

:3 10-12

~ rr

10-14 10

-16

30

-18

10

-20

I I

\ \

i

200

,

i

400

,

i

600

,

i

800

,

t

1000

,

L

1200

# iterations Figure 4: Convergence history of k equation of the backward-facing step

4.2

PARALLEL

EFFICIENCY

The parallel efficiency was also analysed, using 1 to 16 processors on a CRAY-T3E. The two grids were used: a coarse grid consisting of 6 x 24 points and a fine grid with 48 • 196

411

points. On the coarse grid no MG was applied. Four grid levels were used in the MG algorithm on the fine grid. The results are shown in Table 1. The left number shows the number of MG-cycles which were needed for convergence (residual less then 10-16). The right number is the time used per MG-cycle. For the coarse grid the number of Table 1: Number of iterations and time per iteration in seconds coarse grid fine grid # processors # cycles time (s) # cycles time (s) 1

80

1.30

-

-

2

80 > 500 > 500 -

1.32 1.34 1.42 -

150 180 170 350

12.95 10.05 8.51 6.39

4 8 16

cycles increases dramatically when the number of processors increases. This is due to the implicit solver which was employed. This is not the case for the MG algorithm on the fine grid, and the is main reason for using the algorithm. The CPU-time consumed per iteration is nearly constant for the coarse grid and decreases for the fine grid. The small problem size is the main reason for the bad speed-up for the coarse grid. Also the M G algorithm on the finer grid scales poorly with the number of processors. This is explained by the fact that some iterations are still needed on the coarse grid. These iterations do not scale with the number of processors, as already mentioned. 4.3

FULL

e+07

FURNACE

SIMULATION

e+06 e+06

~ ~ ~ ! ~ ~ ~ ! ~ ! ~ ! ~ ! ~ ! ~ i l l ~ ~ i ~{IIiiiiii!!iliiiliiliii i!;!iilii!iii!i!ilii~i!!!i!i!ili~ii~i i!!iii:lili~i!!!ii!iiii:,~ii!i!!ii!!(!ii!i!:i!l~ii!i!iiiiii! !! i~ii:!i:.ii!:=!i!i!i:!i~i~!iili !=iii!iliiiiiiiiili!?iliiiiilii ! ~~i~iiiii!iiiiiii{iiiiiiiiiii!iiiiiii!:.!i!iili::i:i!iiii!iliiii!iiiiii!illii!iiiili'.ill:.El!ii!?!!ii:!!i:!:!iiiiii!!ill i!i!i::!ili!ii!!i!;:iiliii~:ililiililiilili~i =~'
i!~!1!!i!~;; i'!: :: ~!; ii ;:

i:: : :::: i: ii:ii ii:1:)[i:i!i ;!11iii!!!::i i i!iiiiiii!iii:!iii!]!!ii!iiii;i!!

Figure 5: Contour plot of the enthalpy in the center plane of the flame Finally, we present some results of a full furnace simulation. The furnace simulated is the I F R F glass-melting furnace at IJmuiden. More detailed description of the furnace can been found in [1]. The computations are done on 2 grids: a medium (34 x 50 x 42)

412

and fine (66 x 98 x 82) grid on 16 processors on a CRAY-T3E. On both grids two levels of MG are used. A typical contour plot of the enthalpy is shown in Figure (5). The flame with the peak in the enthalpy is clearly visible near the gas inlet. In Figure (6) the fuel balance, or fuel inlet-outlet ratio, of the computations of the medium and fine grid. This value should of course go to zero, and is therefore a useful parameter for checking the convergence. Usually, the computation is considered converged if the value is below 10 - 1 % . Other convergence criteria must also be met. The number of iterations on the x-as are on the finest grid. As these simulations are in 3D, the number of grids points on the coarser grid are 8 times less. So this is also good measure for the computing times consumed by the program. Also for this type of simulation the MG algorithm converges 10

2

,_..,

10 ~

(9 0 ttl:l

10 0

o~

~- - - -~ M e d i u m grid S G Fine grid S G -: M e d i u m grid M G Fine grid M G

rn

~) L L 10-1

,

0

i

5000

,

i

10000

# iterations

,

15000

Figure 6: Fuel balance for furnace simulations using a medium and a fine grid faster than the single grid counterpart. The computations on the fine grid have not fully converged yet, but it is clear that the MG algorithm performs better. The convergences rate is not independent of the number of CVs applied, because on the fine grid physics will be resolved that are not captured by the coarser grids.

5

CONCLUSIONS

The multi-grid method, as described in this paper, has proven to be a good acceleration technique for turbulent reacting flow simulations. The method also improves the convergence behaviour when domain decomposition is applied, and the number of iterations does not change significantly when more processors are used. Using the MG allows the use of finer grids, which will increase the accuracy of turbulent combustion simulations, and yet keep simulation times acceptable.

413

ACKN OWL ED G EMENT

S

The authors would like to thank the HPc~C-centre for the computing facilities.

REFERENCES [1]

G.P. Boerstoel. Modelling of gas-fired furnaces. Technology, 1997.

PhD thesis, Delft University of

[2] P.

Coelho, J.C.F. Pereira, and M.G. Carvalho. Calculation of laminar recirculating flows using a local non-staggered grid refinement system. Int. J. Num. Meth. Fl., 12:535-557, 1991.

[8] J.H.

Ferziger and M. Perid. Computational Methods for Fluid Dynamics. SpringerVerlag, New York, 1995.

[4]

William D. Gropp and Ewing Lusk. User's Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6.

[5]

S. Jakirli(~. Reynolds-Spannungs-modeUierung komplexer turbulenter StrSmungen. PhD thesis, Universit~it Erlangen-Niirnberg, 1997.

[6] H. Le, P. Moin, and J. Kim. Direct numerical simulation of turbulent flow over a backward-facing step. In Ninth symposium on turbulent shear flows, pages 1321-1325, 1993. [7] S.V. Patankar. Numerical heat transfer and fluid flow. McGraw-Hill, London, 1980. [8] M. Perid, M. Sch~ifer, and E. Schreck. Computation of fluid flow with a parallel multigrid solver. Parallel Computational Fluid Dynamics, pages 297-312, 1992. [9] L. Post. Modelling of flow and combustion in a glass melting furnace. PhD thesis, Delft University of Technology, 1988. [10] W. Rodi. Turbulence models and their application in hydraulics- a state of the art review. IAHR, Karlsruhe, 1984. [11] H. L. Stone. Iterative solution of implicit approximations of multi-dimensional partial differential equations. SIAM J. Numer. Anal., 5:530-558, 1968. [12] R.L. Verweij. Parallel computing for furnace simulations using domain decomposition. PhD thesis, Delft University of Technology, 1999. [13] RL. Verweij, A. Twerda, and T.W.I. Peeters. Parallel computing for reacting flows using adaptive grid refinement. In Tenth International Conference on Domain Decomposition Methods, Boulder, Colorado, USA, 1998.

414 [14] P. Wesseling. An introduction to multigrid methods. John Wiley & Sons, Chichester, 1992. [15] Y.Saad and M.H. Schultz. Gmres" A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM, 7:856-869, 1986.


415

Parallel C o m p u t a t i o n s of Unsteady Euler Equations on D y n a m i c a l l y D e f o r m i n g Unstructured Grids A. Uzun, H.U. Akay, and C.E. Bronnenberg Computational Fluid Dynamics Laboratory, Department of Mechanical Engineering, Purdue School of Engineering and Technology, IUPUI, Indianapolis, IN 46202, USA. A parallel algorithm for the solution of unsteady Euler equations on unstructured deforming grids is developed by modifying USM3D, a flow solver originally developed at the NASA Langley Research Center. The spatial discretization is based on the flux vector splitting method of Van Leer. The temporal discretization involves an implicit timeintegration scheme that utilizes a Gauss-Seidel relaxation procedure. Movement of the computational mesh is accomplished by means of a dynamic grid algorithm. Detailed descriptions of the parallel solution algorithm are given and computations for airflow around a full aircraft configuration are presented to demonstrate the applications and efficiency of the current parallel solution algorithm. 1. INTRODUCTION Problems associated with unsteady flows and moving boundaries have been of long interest to aerodynamicists. Examples of such problems include external flows around oscillating airfoils and other three-dimensional configurations in motion. The current research is based on developing a parallel algorithm for the solution of unsteady aerodynamics and moving boundary problems using unstructured dynamic grids and the Arbitrary Lagrangian-Eulerian (ALE) method [1]. The sequential version of flow solver, USM3D, originally developed at the NASA Langley Research Center, has been modified here to solve unsteady aerodynamics and moving boundary problems. The version of the solver USM3D modified here was primarily developed to solve steady-state Euler equations on unstructured stationary grids [2]. The algorithm is an implicit, cell-centered scheme in which the fluxes on cell faces are obtained using the Van Leer flux vector splitting method [3]. The dynamic grid algorithm that is coupled with the flow solver moves the computational mesh to conform to the instantaneous position of the moving boundary. The solution at each time step is updated with an implicit algorithm that uses the linearized backward-Euler time-differencing scheme. 2. FLUID FLOW EQUATIONS The Arbitrary Lagrangian-Eulerian formulation of the three-dimensional time-dependent inviscid fluid-flow equations are expressed in the following form [4]:

416 (1)

f2

~)f2

where Q = Lo, pu, pv, pw, e] r is the vector of conserved flow variables, F. fi = F = ((fi-'~). fi)Lo

,ov pw

pu

is the convective flux vector, fi = [nx

ny

e + p] T+p[0

nx

ny

nz

a t ]T

(2)

nz] T is the normal vector on the boundary Of~,

fi = [u, v, w] T and ~ = [x t, Yt, Zt IT are the fluid and grid velocity vectors, respectively, and

= ~r fi is the contravariant face speed. The pressure p is given by the equation of state for a perfect gas p = ( 7 - 1 ) [ e1- - ~ p ( u 2 +2 vw 2+] ) . a t = xtn x + Ytny + ztn z

The above equations have been nondimensionalized by the freestream density p~ and the freestream speed of sound a~. The domain of interest is divided into a finite number of tetrahedral cells and Eq. (1) is applied to each cell. 3. DYNAMIC MESH ALGORITHM

The current work models the unsteady aerodynamic response that is caused by oscillations of a configuration. Hence, the mesh movement is known beforehand. The dynamic mesh algorithm moves the computational mesh to conform to the instantaneous position of the moving boundary at each time step. Following the work of Batina [5], the algorithm treats the computational mesh as a system of interconnected springs. This system is constructed by representing each edge of each triangle by a tension spring. The spring stiffness for a given edge i - j is taken as inversely proportional to the length of the edge as: k m =(~/(xj-xi)

2 +(yj-yi)

2 + ( Z j --Zi) 2

(3)

Grid points on the outer boundary of the mesh are held fixed while instantaneous location of the points on the inner boundary (i.e., moving body) is given by the body motion. At each time step, the static equilibrium equations in the x, y, and z directions that result from a summation of forces are solved iteratively at each interior node i of the mesh for the displacements Ax i , Ay i and Azi, respectively. After the new locations of the nodes are found, the new metrics (i.e., new cell volumes, cell face areas, face normal vectors, etc.) are computed. The nodal displacements are divided by the time increment to determine the velocity of the nodes. It is assumed that the velocity of a node is constant in magnitude and direction during a time step. Once the nodal velocities are computed, the velocity of a triangular cell face is found by taking the arithmetic average of the velocities of the three nodes that constitute the face. The face velocities are used in the flux computations in the solution algorithm.

417 4. T I M E I N T E G R A T I O N In the cell-centered finite volume approach used here, the flow variables are volumeaveraged values, hence the governing equations are rewritten in the following form: V n+l

AQ n . . . . At where

Wn

IIF(Qn+I) 9r i d S _ Q n . AV___~ n At

is the cell volume at time step

n,

V n+l

(4)

is the cell volume at time step n+l,

AQn _Qn+l _ Q n , A V n = V n+l - V n is the change in cell volume from time step n to n+l, and At is the time increment Since an implicit time-integration scheme is employed, fluxes are evaluated at time step n+l. The flux vector is linearized according to R n+l

:

Rn

~R n -I-

AQ n

(5)

0Q Hence the following system of linear equations is solved at each time step: A-AQ n

= R n

_Qn. Avn At

where

A =

V n-I-1

(6)

~R n and R n = - I I F(Q n ). II dS .

I - ~ At ~Q

~ta

The flow variables are stored at the centroid of each tetrahedron. Flux quantities across cell faces are computed using Van Leer's flux vector splitting method [3]. The implicit time integration used in this study has been originally developed by Anderson [6] in an effort to solve steady-state Euler equations on stationary grids. The system of simultaneous equations that results from the application of Eq. (6) for all the cells in the mesh can be obtained by a direct inversion of a large matrix with large bandwidth. However, the direct inversion technique demands a huge amount of memory and extensive computer time to perform the matrix inversions for three-dimensional problems. Therefore, the direct inversion approach is computationally very expensive for practical threedimensional calculations. Instead, a Gauss-Seidel relaxation procedure has been used in this study for the solution of the system of equations. In the current relaxation scheme, the solution is obtained through a sequence of iterations in which an approximation of AQ n is continually refined until acceptable convergence. 5. P A R A L L E L I Z A T I O N 5.1. Mesh Partitioning A domain decomposition approach has been implemented to achieve parallelization of both the dynamic mesh algorithm and the flow solution scheme. The computational domain

418 that is discretized with an unstructured mesh is partitioned into subdomains or blocks using a program named General Divider (GD). General Divider is a general unstructured meshpartitioning code developed at the CFD Laboratory at IUPUI [7]. It is currently capable of partitioning both structured (hexahedral) and unstructured (tetrahedral) meshes in threedimensional geometries. The interfaces between partitioned blocks are of matching and overlapping type [8]. They exchange data between the blocks.

5.2. Parallelization of the Dynamic Mesh Algorithm During a parallel run of the dynamic mesh algorithm, the code executes on each of the partitioned blocks to solve the relevant equations for that particular subdomain. The dynamic mesh algorithm is an iterative type solver. After every iteration step, the blocks communicate via interfaces to exchange the node displacements on the block boundaries. The nodes on the interfaces that are flagged as sending nodes send their data to the corresponding nodes in the neighboring block. Similarly, the nodes on the interfaces that are flagged as receiving nodes receive data from the corresponding nodes in the neighboring block. In this research, the communication between blocks is achieved by means of the message-passing library PVM (Parallel Virtual Machine) [9].

5.3. Parallelization of the Flow Solver The implicit time integration used in this study is also an iterative solver in which an approximation of AQ for each cell is continually refined until acceptable convergence. In a given block, there are three types of cells as follows: 9 Type I (Interior). Interior cells: These are cells that do not have a face on any kind of boundary. All four faces of such cells are interior faces. 9 Type PB (Physical Boundary). Boundary cells that are not adjacent to an outer interface boundary: These are cells that have a face on a physical boundary such as an inviscid wall, far-field boundary, etc., and not on an outer interface boundary. 9 Type IB (Interface Boundary). Boundary cells that are adjacent to an outer interface boundary: These are cells that have a face on an outer interface boundary. During the course of parallel computations, the time integration scheme runs on each of the blocks. However, the time integration scheme computes the relevant AQ values only for Type I and Type PB cells and not for Type IB cells in a given block. The matching and overlapping interface type used in this study ensures that the Type IB cells in a given block will be of either Type I or Type PB in the corresponding neighboring block. Hence, AQ values in Type IB cells in a given block are taken from the corresponding neighboring block as part of the data exchange during the course of parallel computations. The AQ values in Type IB cells in a given block are needed while computing the AQ values in Type I and Type PB cells that share a common face with the Type IB cells. This approach is very effective for the implicit time integration scheme used in this study. The main advantage of this approach is the fact that it eliminates the need for applying boundary conditions explicitly on outer interface boundaries. Furthermore, the continuity of the solution across the interfaces is guaranteed.

419

5.4. Parallel Computing Environment In this research, a Linux cluster located at the NASA Glenn Research has been used as the parallel computing environment. The cluster has 32 Intel Pentium II 400MHz dual processor PCs connected via a Fast Ethernet network that has a transmission rate of 100Mbps. Each PC has 512MB memory and runs on the Linux operating system. Message passing among the machines was handled by means of the PVM software. The main differences between the steady and the unsteady flow solution schemes are as follows: 9 The dynamic grid algorithm is only used in unsteady flow problems. It is not needed in steady state computations since the computational grid remains stationary. 9 Local time stepping strategy accelerates the convergence to steady state, hence this strategy is used while solving steady state problems. 9 For unsteady problems, a global time increment has to be defined since the unsteady flow solution has to be time accurate. All cells in the computational grid use the same time increment during the course of unsteady flow computations. 9 For both steady and unsteady problems, the flow solver performs 20 nonlinear iterations in each time step to solve the system of equations. Since time accuracy is not desired in steady state problems, the flow solver first performs the 20 nonlinear iterations and then communicates once in each time cycle while solving steady state problems. This approach reduces the communication time requirements of the flow solver while solving steady state problems. 9 In unsteady problems, the flow solver needs to communicate once after every nonlinear iteration. In other words, the flow solver has to communicate 20 times in each time step in order to maintain time accuracy. If this is not done, errors are introduced. 6. TEST CASES The program was first tested for accuracy by using well-known oscillating and plunging cases for the NACA0012 airfoil case. The accuracy of the results was verified by comparing with experiments [ 10] and other numerical solutions [ 11]. To demonstrate the applicability of the current method to complex problems, an aircraft configuration has been considered as the next test case. Due to space limitations, we will present the results of a generic aircraft case for testing parallel efficiency of the code. Both steady and unsteady computations were done on the aircraft configuration. The steady state computations were performed for a freestream Mach number of Moo = 0.8 and zero degree angle of attack. For unsteady computations, the aircraft was oscillated sinusoidally with an amplitude of 5 degrees. The reduced frequency was chosen as k = 0.50 for the unsteady calculations. The computations were performed on a coarse and a fine grid. The coarse grid has 319,540 cells and 57,582 nodes whereas the fine grid has 999,135 cells and 178,201 nodes. Figure 1 shows the surface triangulation for the coarse grid and Figure 2 shows the partitioned surface grid for the 16-block case. The steady-state computations were started at a CFL number of 5 on both grids. The CFL number was then linearly increased to 150 over the first 20 time steps. Local time stepping strategy was used while solving for the steady state solution. 20 subiterations were needed in

420

each time step and the blocks were communicated once at the end of 20 subiterations in each time step. The steady-state solution on the coarse grid is reached in approximately 250 time steps whereas the steady-solution on the fine grid needs about 300 time steps to converge.

Figurel. Surface triangulation for the coarse aircraft grid.

Figure 2. Partitioned surface triangulation for the 16 block coarse grid partition.

Figure 3a shows the speedup curves of the steady-state solutions for the different cases on the coarse and the fine grids. The speedups on the coarse and the fine grid seem to be identical. These speedups are quite close to the ideal speedup curve. This shows that the current parallel steady state flow solution scheme remains quite efficient even for large number of blocks. In the unsteady test case, the aircraft configuration is pitching sinusoidally about the mid chord with an amplitude of 5 degrees. For this test case, the reduced frequency, freestream Mach number, mean angle of pitching and the chord length are k = 0.50, M ~ =0.80, O~m

=

0 ~ ,

and c = 8.76, respectively, where the chord length corresponds to the full length of

the aircraft. The initial condition for this case is the previously obtained steady state solution. The results were obtained using 1500 steps per cycle of motion. 20 subiterations were done in each time step and the blocks were communicated once every subiteration in order to maintain the time accuracy. Figure 3b showss the speedup curves of the unsteady steady solutions, for the coarse and the fine grids and compares them with the ideal speedup curve. The speedup curve for the fine grid appears to be slightly better than the speedup curve for the coarse grid. The parallel efficiency of the unsteady algorithm decreases rapidly as the number of blocks is increased since the communication time requirements become more significant with increasing number of blocks. This behavior can be attributed to two factors. The first factor is the dynamic grid algorithm that is embedded in the unsteady flow solution scheme. The dynamic grid algorithm is an iterative solver, hence an error is computed after every iteration of the dynamic grid algorithm. This error has to be less than a predefined tolerance, usually 10 -5 , in order for the dynamic grid algorithm to stop the iterations. The number of iterations necessary for convergence depends on the total number of nodes in the grid and how much the moving boundaries are displaced. For example, for the unsteady flow solution on the coarse aircraft grid using 1500 time steps per cycle of motion, the number of iterations to deform the mesh at each time step ranges from about 50 to about 100. The second factor

421 responsible for the behavior of the total communication time is the fact that the unsteady flow solution scheme communicates 20 times in each time step of the nonlinear iterations in order to maintain the time accuracy.

36 .................................................................................................................................. 32

32

28

28

~. 24

-o- Coarse Grid

20

Fine Grid - t - Ideal S p e e d u p

r~ 12

I~ 24 20

r~ 12

8

8

4

4

0

0 0

4

8

12

16

20

24

28

32

N u m b e r of Blocks

(a) Steady

0

4

8

12

16

20

24

28

32

N u m b e r of B l o c k s

(b) Unsteady

Figure 3. Speedup curves for different cases on the coarse and the fine grid. The time accuracy of the unsteady computations was established by comparing multiblock solutions with the single block solution as well as by varying the time step At. Figures 4 and 5 show the deformed fine grid at the maximum and minimum angle of attack positions, respectively, of the unsteady computations.

Figure 4. Deformed coarse grid at maximum angle of attack position.

Figure 5. Deformed coarse grid at minimum angle of attack position.

7. C O N C L U S I O N S

In this research, a sequential program, USM3D, has been modified and parallelized for the solution of steady and unsteady Euler equations on unstructured grids. The solution algorithm was based on a finite volume method with an implicit time-integration scheme. Parallelization was based on domain decomposition and the message passing between the parallel processes was achieved using the Parallel Virtual Machine (PVM) library. Steady

422 and unsteady problems were analyzed to demonstrate the possible applications of the current solution method. Based on these test cases, the following conclusions can be made: 9 The parallel steady-state solution scheme showed good efficiency for all multi-block cases. The speedup of the steady-state solution scheme was quite close to the ideal speedup curve. 9 The parallel unsteady flow solution scheme showed less efficiency as the number of blocks was increased. The reason for the inefficiency for cases involving large number of blocks is the increased communication time requirements of the iterative dynamic grid algorithm. 9 Reasonable efficiencies are achieved for up to 32 processors while solving steady flows (90%) and 16 processors while solving unsteady flows (70%). ACKNOWLEDGEMENTS The permission provided by the NASA Langley Research Center for use of the flow code USM3D and its grid generator VGRID is gratefully acknowledged. The access to a Linux computer cluster provided by the NASA Glenn Research Center for parallel computations is also gratefully acknowledged. REFERENCES

1. Trepanier, J. Y., Reggio, M., Zhang, H., Camarero, R., "A Finite Volume Method for the Euler Equations on Arbitrary Lagrangian-Eulerian Grids," Computers and Fluids, Vol. 20, No. 4, pp. 399-409, 1991. 2. Frink, N. T., Parikh, P., Pirzadeh, S., "A Fast Upwind Solver for the Euler Equations on Three-Dimensional Unstructured Meshes," AIAA Paper 91-0102, 1991. 3. Van Leer, B., "Flux Vector Splitting for the Euler Equations," Lecture Notes in Physics, Vol. 170 (Springer-Verlag, New YorlJBerlin 1982), pp. 507-512. 4. Singh, K. P., Newman, J. C., Baysal, O., "Dynamic Unstructured Method for Flows Past Multiple Objects in Relative Motion," AIAA Journal, Vol. 33, No. 4, 1995. 5. Batina, J. T., "Unsteady Euler Algorithm with Unstructured Dynamic Mesh for ComplexAircraft Aerodynamic Analysis," AIAA Journal, Vol. 29, No. 3, 1991. 6. Anderson, W. K., "Grid Generation and Flow Solution Method for Euler Equations on Unstructured Grids," Journal of Computational Physics, Vol. 110, pp. 23-38, 1994. 7. Bronnenberg, C. E., "GD: A General Divider User's Manual - An Unstructured Grid Partitioning Program," CFD Laboratory, IUPUI, 1999. 8. Akay, H. U., Blech, R., Ecer, A., Ercoskun, D., Kemle, B., Quealy, A., Williams, A., "A Database Management System for Parallel Processing of CFD Algorithms," Parallel CFD '92, Edited by R. B. Pelz, et. al., Elsevier, Amsterdam, pp. 9-23, 1993. 9. Geist, G. A., Beguelin, A. L., Dongarra, J. J., Jiang, W., Manchek, R., Sunderam, V., "PVM 3 User's Guide and Reference Manual," Oak Ridge National Laboratory ORNL/TM-12187, 1993. 10. Landon, R. H., "NACA 0012. Oscillating and Transient Pitching," Compendium of Unsteady Aerodynamic Measurements, Data Set 3, AGARD-R-702, Aug. 1982. 11. Kandil, O. A. and Chuang, H. A., "Computation of Steady and Unsteady VortexDominated Flows with Shock Waves," AIAA Journal, Vol. 26, pp. 524-531, 1988.

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes, A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rightsreserved.

423

P a r a l l e l i z a t i o n and M P I P e r f o r m a n c e o f T h e r m a l Lattice B o l t z m a n n C o d e s for Fluid Turbulence George Vahala a, Jonathan Carter b, Darren Wah a, Linda Vahala c and Pavol Pavlo d aDepartment of Physics, William & Mary, Williamsburg, VA 23187 bNERSC, Lawrence Berkeley Laboratory, Berkeley, CA CDepartment of Electrical & Computer Engineering, Old Dominion University, Norfolk, VA 23529 dInstitute of Plasma Physics, Czech Academy of Science, Praha 8, Czech Republic

The Thermal Lattice Boltzmann Model (TLBM) is presented for the solution of complex twofluid systems of interest in plasma divertor physics. TLBM is a mesoscopic formulation to solve nonlinear conservation macroscopic equations in kinetic phase space, but with the minimal amount of discrete phase space velocity information. Some simulations are presented for the two-fluid interaction of perpendicular double vortex shear layers. It is seen that TLBM has almost perfect scaling as the number of PE's increases. Certain MPI optimizations are described and their effects tabulated on the speed-up in the various computational kernels.

1. INTRODUCTION One of the goals of divertor physics [1] is to model the interaction between neutrals and the plasma by a coupled UEDGE/Navier-Stokes system of equations [2]. An inverse statistical mechanics approach is to replace these highly nonlinear two-fluid macroscopic equations by two coupled linear lattice BGK kinetic equations, which, in the Chapman-Enskog limit, will permit recovery of the original two-fluid system. While the complexity of phase space has been increased from (x, t) to the kinetic phase space (x, ~, t), the TLBM approach now seeks to minimize (and discretize) the required degrees of freedom that must be preserved in ~space. As a result, instead of solving for the fluid species density ns(x, t), mean velocity Vs(X, t), and temperature 0s(x, t), we solve for the species distribution function Ns,pi(X, t) - where p is the number of speeds and i is the directional index for the chosen velocity lattice. For example, on a hexagonal (2D)lattice, one can recover the nonlinear conservation equations of mass, momentum and energy by choosing p - 2, i - 6 ; i.e., a 13-bit model (for each species)

424 can recover the 4-variable n, Vx, Vy, 0 -macroscopic system [3-10].

Now the storage

requirements for the linear kinetic representation are increased over that needed for conventional CFD by not more than a factor of 2 because of the auxiliary storage arrays needed in the CFD approach. However, there are substantial computational gains achieved by this imbedding: (i) Lagrangian kinetic codes with local operations, ideal for scalability with PE's, and (ii) the avoidance of the nonlinear Riemann problem, in CFD. There is also a sound physics reason for pursuing this kinetic imbedding. In the tokamak divertor, one encounters (time varying) regimes in which the neutral collisionalities range from the highly collisional (well treated by fluid equations) to the weakly collisional (well treated by Monte Carlo methods). The coupling of fluid-kinetic codes is a numerically stiff problem. However, by replacing the fluid representation by a TLBM, we will be coupling two kinetic codes with non-disparate time and length scales. In Sec. 2, we will present a two-fluid system of equations that we will represent by two linear kinetic BGK equations. Simulations will be presented for the case of the turbulent interaction between two perpendicular double-shear velocity layers. In Sec. 3, we will discuss the parallelization and MPI optimization of our TLBM code.

7.. T W O - F L U I D T R A N S P O R T E Q U A T I O N S

We will determine a kinetic representation for the following coupled set of nonlinear conservation equations (species mass ms, density ns, mean velocity vs, mean temperature 0s)

at (msns)+~)o~ (msnsVs,o~)=0 t)t (msnsVs,et) + ~(1-Io~13 +msnsVs, o~Vs,[5) =- msns (Vs,ct - Vss' ~x ) 7, SS'

Ot (3nsOs-

msns Vs2) + t)ct (2qs, a + Vs,oc [3nsOs + msns Vs2 ] + 2Vs, f~ I-ls,off~ ) 1 m

(1) [3 n s (t3s - 0 s, ) + msn s (Vs2 -- Vs, 2)]

7, SS'

where I-Is is the stress tensor (with viscosity Its =7,ssns Os)

ns, ,,

- ,,sOs a , ,

-

s

Us,

+

V

-

2

8~ 1 + ms ns

7,ss X 7, SS'

X

oss, - 0 ms ssct~+(Vss,,Ot--Vs,o~Vss,~--Vs,~)

l

- - '80t~ ~(Vss'2--Vs

with %s the s-species relaxation time and % s' the cross-species relaxation rate. The heat flux vector (with conductivity ~q = 5 ~f12) is

57, q o~ =-~: Oo~Os + - ss ns(O-Os ~s)(Vs'~176 27,s ~

ms2ns (Vs -Vss'

)2 (Vs, o~-

Vss' ~x )

425

2.1 Thermal Lattice Boltzmann Model of Eq. (1). In the Chapman-Enskog limit, one can recover the above nonlinear macroscopic moment equations from the following linearized BGK kinetic equations [11,12]

~tfs +~)a(~a f s ) -

(2)

- fs - gs.........~_fs -gss' T, ss

T, ss'

for species s interacting with species s'. gss is the s-species relaxation Maxwellian distribution function, while gs~,is the cross-species Maxwellian distribution

(/3 /2 E m s

gss' = ns 2~ Oss,

exp -

ms(~-Vss 20~s'

,21

(3)

The first term on the r.h.s, of (2) corresponds to the self-species collisional relaxation at rate

"Css ns ~. Os for some constant r (dependent on the interactions); while the second term corresponds to the cross-species collisional relaxation at rate 1/~

Xss, = ~E

ms+ ms'

,

where

~E =

-2

(ms + ms

nsns'

+

ms'

and 7 is another constant dependent on the type of collisional interaction. The cross-species parameters vss' and 0ss. are so chosen that the equilibration rate of each species mean velocity and temperature to each other will proceed at the same rate for these linear BGK collision operators as for the full nonlinear Boltzmann collision integrals [11,12]. The only restriction on the parameter ~ is: ~ > -1. In TLBM [3-10], one performs a phase velocity discretization, keeping a minimal representation that will still recover Eq. (1) in the Chapman-Enskog limit. For a hexagonal (2D) lattice, we require at each spatial node 13 bits of information in ~-space [3-10]

fs(X,~,t)

r

Ns,pi(X,t ) , i=1 ....6, p=0,1,2

Performing a Lagrangian discretization of the linearized BGK equations,

Us, pi (x, t )- U'se~i(x et~ Ns, pi(X +epit + 1)-Ns, p i ( X , t ) = -

T, ss

Ns, pi (x,

t) -

Nseq pi ( x, t)

T,ss'

(4)

426 where the N eq a r e appropriate Taylor-expanded form of the corresponding Maxwellians [c.f. Eq. (2)]. r are the lattice velocities.

2.2 Simulation for 2-Species Turbulence : double vortex layers We now present some simulations for a 2-species system in 2D, with double vortex layers perpendicular to each other: fluid #1 has a double vortex layer in the x-direction, while fluid #2 has a double vortex layer in the y-direction. The two fluids are coupled through a weak inter-species collisionality, with Zss' >> %, ":s'. In Fig.1 we plot the evolution of the mass density weighted resultant vorticity of the 2-fluid system, in which mz - ml, n2 - 2 nl. The first frame is after 1000 TLBE time steps. The vortex layers for fluid #1 (in the x-direction) and for fluid #2 (in the y-direction) are still distinct as it is too early in the evolution for much interaction between the two species. The vortex layers themselves are beginning to break up, as expected for 2D turbulence with its inverse cascade of energy and the formation of large spacing-filling vortices. After 5K time steps (which corresponds to about 6 eddy turnover times), larger structures are beginning to form as the vortex layers break up into structures exhibiting many length scales. After 9K time steps, the dominant large vortex and counter-rotating vortex are becoming evident. The smaller structures are less and less evident, as seen after 13K time steps. Finally, by 17K time steps one sees nearly complete inter-species equilibration. More details can be found in Ref. 13.

3. P A R A L L E L I Z A T I O N OF T L B M CODES

3.1 Algorithm For simplicity, we shall restrict our discussion of the parallelization of TLBM code to a single species system [i.e., in Eq. (4), let "Css'---) r The numerical algorithm to advance from time t --)t + 1 is: (a) at each lattice site x, free-stream Npi along its phase velocity Cpi :

(5)

Npi(X) ~ Npi(X + Cpi) (b) recalculate the macroscopic variables n, v, 0 so as to update N eq = Neq(n, v, 0) (c) perform collisional relaxation at each spatial lattice node:

Npi(x)N p i ( X ) --

(x)

--~ Npi (X), at time t + 1

T,

It should be noted that this algorithm is not only the most efficient one can achieve but also has a (kinetic) CFL - 1, so that neither numerical dissipation or diffusion is introduced into the simulation.

427 3.2 P e r f o r m a n c e on the C R A Y C90 Vector S u p e r c o m p u t e r On a dedicated C90 with 16 PE's at 960 MFlops/sec and a vector length of 128, the TLBM code is almost ideally vectorized and parallelized giving the following statistics on a 42minute wallclock run 9

Table 1 Timing for TLBM code on a dedicated C90 Floating Ops/sec avg. conflict/ref CPU/wallclock time ratio Floating Ops/wal! sec Average Vector Length for all operations Data Transferred

603.95 M 0.15 15.52 9374.54 M 127.87 54.6958 MWords

3.3 Optimization of the M P I Code on T 3 E With MPI, one first subdivides the spatial lattice using simple domain decomposition. The TLBM code consists of 2 computational kernels, which only act on local data: 9"integrate" - in which one computes the mean macroscopic variables, (b)" and 9 "collision" - in which one recomputes N eq using the updated mean moments from "integrate", and then performs the collisional relaxation, (c). 9 The "stream" operation, (a), is a simple shift operation that requires message passing only to distribute boundary data between PE's. With straightforward MPI, the execution time for a particular run was 5830 sec - see Table 2 - of which 80% was spent in the "collision" kernel. For single PE optimization, we tried to access arrays with unit stride as much as possible (although much data was still not accessed in this fashion), to reuse data once in cache, and to try to precompute expensive operations (e.g., divide) which were used more than once. This tuning significantly affected "collision", increasing the Flop-rate for this kernel from 24 MFlops to 76 MFlops. The "integrate" kernel stayed at 50 MFlops. As the "stream" operation is simply a sequence of data movements, both on and off PE, it seemed more appropriate to measure it in terms of throughput. Initially, "stream" propagated information at an aggregate rate of 2.4 MB/sec, using mainly stride-one access. We next investigated the effect of various compiler flags on performance. Using fg0 flags: -Ounroll2, pipeline3,scalar3,vector3,aggress-lmfastv further increased the performance of the "collision" kernel to 86 MFlops and "stream" to 3.4 MB/sec. The overall effect can be seen in column 2 of Table 2. We then looked at optimizing the use of MPI. By using user-defined datatypes, defined through MPI_TYPE_VECTOR, data that are to be passed between PE's which are separated by a constant stride can be sent in one MPI call. This eliminates the use of MPI_SEND and MPI_RECV calls within do-loops and reduces the total wait time/PE from 42 secs. (33% of total time) to just 5 secs. (5% of total time). The overall effects of this optimization are seen in column 3 of Table 2. It should be noted that the computational kernels, "collision" and "integrate", and the data propagation routine "stream" access Npi in an incompatible manner. We investigated whether

428

further speed-up could be achieved by interchanging array indices so as to obtain optimal unit stride in "collision" and "integrate" - such a change resulted in non stride-one access in "stream". To try and mitigate this access pattern, we made use of the "cache_bypass" feature of the Cray T3E, where data moves from memory via E-registers to the CPU, rather than through cache. Such an approach is more efficient for non stride-one access. This flipped index approach resulted in a speed-up in "collision" from 86 to 96 MFlops, and in "integrate" from 52 to 70 MFlops, while "stream" degraded in performance from 3.4 to 0.9 MB/sec. The total effect was slower than our previous best, so this strategy was abandoned.

Table 2 Performance of Various Routines under Optimization. The total execution time was reduced by a factor of 4. 42, while the "collision" routine was optimized by nearly a factor of 6. M P I optimization increased the performance of "stream" by a factor of 5.8, while the time spent in the SEND, R E C V calls was reduced by a factor of over 10. Single PE optim. + Single PE Optim. Initial Code MPI optim. 1320 sec 2050 sec. 5830 sec COLLISION 4670 sec 80.1% 893 sec 43.6% 788 sec 59.7% INTEGRATE 359 sec 6.2% 360 sec 17.6% 369 sec 28.0% STREAM 791 sec 13.6% 780 sec 38.1% 134 sec 10.2% MPI SEND 396 sec 6.8% 400 sec 19.2% MPI RECV 294 sec 5.0% 278 sec 13.6% MPI SENDREC 63 sec 5.0% In Table 3, we present the nearly optimal scaling of the MPI code with the number of PE's, from 16 to 512 processors. In this comparison, we increased the grid dimensionality in proportion to the increase in the number of PE's so that each PE (in the domain decomposition) worked on the same dimensional arrays. Table 3 Scaling with PE's on the T3E # PE's GRID 4 x4 (16) 1024 x 1024

CPU/PE (sec) 999.3

8 x4

(32)

2048 x 1024

1003.6

8x8

(64)

2048 x 2048

1002.8

16 x 16

(256)

4096 x 4096

1002.2

32 x 16

(512)

8192 x 4096

1007.2

Some benchmarked runs were also performed on various platforms, and the results are presented in Table 4. In these runs, the dimensionality was fixed at 512 x 512 while the

429

1 y 32~

~ 48 ~ i i

/~ 32x

17 48~~ii

17

32

x

x

y 8

48~.,~ / i

I7

17

~

~

1

17

Figure 1. Evolution of the Composite Mass-Density Vorticity for a 2-fluid system with perpendicular double vortex. layers. The frames are separated by 4000 TLBE time steps (with an eddy turn-over time corresponding to 900 TLBE time steps). The approach to the time asymptotic 2D space-filling state of one large vortex and one large counterrotating vortex is clearly seen by 20 eddy turnover times (last frame)

430 number of PE's was increased. The scalings are presented in brackets, showing very similar scalings between the T3E and the SP604e PowerPC.

Table 4 Benchmarked Timings (512 x 512 grid) MACHINE CRAY T3E SP POWER2 SUPER SP 604e POWERPC

16 PE's 1360 1358 1487

32 PE's 715 834 800

[ 0.526 ] [ 0.614 ] [ 0.538 ]

64 PE's 376 463 424

[ 0.276 ] [ 0.341 ] [ 0.285 ]

REFERENCES 1. J. A. Wesson, Tokamaks, Oxford Science Publ., 2nd Ed., 1997. 2. D. A. Knoll, P. R. McHugh, S. I. Krasheninnikov and D. J. Sigmar, Phys. Plasmas 3 (998) 422. 3. F.J. Alexander, S. Chen and J. D. Sterling, Phys. Rev. E47 (1993) 2249. 4. G. R. McNamara, A. L. Garcia and B. J. Alder, J. Stat. Phys. 81 (1995) 395. 5. P. Pavlo, G. Vahala, L. Vahala and M. Soe, J. Computat. Phys. 139 (1998) 79. 6. P. Pavlo, G. Vahala and L. Vahala, Phys. Rev. Lett. 80 (1998) 3960. 7. M. Soe, G. Vahala, P. Pavlo, L. Vahala and H. Chen, Phy. Rev. E57 (1998) 4227. 8. G. Vahala, P. Pavlo, L. Vahala and N. S. Martys, Intl. J. Modern Phys. C9 (1998) 1274. 9. D. Wah, G. Vahala, P. Pavlo, L. Vahala and J. Carter, Czech J. Phys. 48($2) (1998) 369. 10. Y. Chert, H. Obashi and H. Akiyama, Phys. Rev.ES0 (1994) 2776. 11. T. F. Morse, Phys. Fluids 6 (1964) 2012. 12. J. M. Greene, Phys. Fluids 16 (1973) 2022. 13. D. Wah, Ph. D Thesis, William & Mary (June, 1999)

Parallel ComputationalFluidDynamics Towards Teraflops, Optimizationand NovelFormulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) 92000 Elsevier Science B.V. All rightsreserved.

431

Parallel computation of two-phase flows using the immiscible lattice gas Tadashi Watanabe and Ken-ichi Ebihara Research and Development Group for Numerical Experiments, Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Tokai-mura, Naka-gun, Ibaraki-ken, 319-11, Japan

The two-dimensional two-phase flow simulation code based on the immiscible lattice gas, which is one of the discrete methods using particles to simulate two-phase flows, is developed and parallelized using the MPI library. Parallel computations are performed on a workstation cluster to study the rising bubble in a static fluid, the phase separation in a Couette flow, and the mixing of two phases in a cavity flow. The interfacial area concentration is evaluated numerically, and its dependencies on the average density, wall speed, and region size are discussed. 1. I N T R O D U C T I O N Two-phase flow phenomena are complicated and difficult to simulate numerically since two-phase flows have interfaces of phases. Numerical techniques for simulating two-phase flows with interfaces have recently progressed significantly. Two types of numerical methods have been developed and applied: a continuous fluid approach, in which partial differential equations describing fluid motion are solved, and a discrete particle approach, in which motions of fluid particles or molecules are calculated. It is necessary in a continuous fluid approach to model the coalescence and disruption of the interface. On the other hand, in the discrete particle approach, the interface is automatically calculated as the boundary of the particle region. Among particle simulations methods, the lattice gas automata (LGA) is one of the simple techniques for simulation of phase separation and free-surface phenomena, as well as macroscopic flow fields. The LGA introduced by Fi'isch, Hasslacher and Pomeau (FHP) asymptotically simulates the incompressible Navier-Stokes equations[i]. In the FHP model, space and time are discrete, and identical particles of equal mass populate a triangular lattice. The particles travel to neighboring sites at each time step, and obey simple collision rules that conserve mass and momentum. Macroscopic flow fields are obtained by coarse-grain averaging in space and time. Since the algorithm and programming are simple and complex boundary geometries are easy to represent, the LGA has been applied to numerical simulations of hydrodynamic flows [2]. In the LGA, only seven bits of information are needed in order to specify the state of a site, and only the information of neighboring sites is used for

432 updating the state of a site. The LGA is thus efficient in memory usage and appropriate for parallel computations [3]. The immiscible lattice gas (ILG) is a two-species variant of the FHP model [4]. Red and blue particles are introduced, and the collision rules are changed to encourage phase segregation while conserving momentum and the number of red and blue particles. In this study, the two-dimensional two-phase flow simulation code based on the ILG is developed and parallelized using the MPI library. Parallel computations are performed on a workstation cluster to simulate typical two-phase flow phenomena. The interracial area concentration, which is one of the important paremeters in two-phase flow analyses, is evaluated numerically, and its dependency on the density, wall speed, and region size are discussed. 2. T H E I M M I S C I B L E

LATTICE GAS MODEL

In the ILG, a color field is introduced to the LGA [4]. Red and blue particles are introduced, and a red field is defined as the set of seven Boolean variables"

r(x) -- {ri(x) e {0, 1}, i -- 0, 1, ..., 6},

(1)

where ri(x) indicates the presence or absence of a red particle with velocity ci at lattice site x: Co - 0 and Cl through c6 are unit vectors connecting neighboring sites on the triangular lattice. The blue field is defined in a similar manner. Red and blue particles may simultaneously occupy the same site but not with the same velocity. The phase segregation is generated by allowing particles from nearest neighbors of site x to influence the output of a collision at x. Specifically, a local color flux q[r(x), b(x)] is defined as the difference between the net red momentum and the net blue momentum at site x" q[r(x), b(x)] - ~ ci[ri(x) - bi(x)].

(2)

i

The local color field f ( z ) is defined to be the direction-weighted sum of the differences between the number of red particles and the number of blue particles at neighboring sites: f(x)

+

-

i

- bj(

+ c )l.

(3)

j

The work performed by the color flux against the color field is W(r, b) - - f - q(r, b).

(4)

The result of a collision at site z, r ~ r t and b ~ bt, is then with equal probability any of the outcomes which minimize the work,

W(r', b') - m i n W ( r , b),

(5)

subject to the constraints of colored mass conservation,

i

- Z

Z

i

i

-

(6) i

433

and colorblind momentum conservation,

c,(,-.; + b;) - Z c;(fi + b',). i

(7)

i

3. P A R A L L E L C O M P U T A T I O N

ON A WORKSTATION

CLUSTER

The ILG code is parallelized, and parallel computations are performed on a workstation cluster. The domain decomposition method is applied using the MPI library, and the two-dimensional simulation region is divided into four small domains in the vertical direction. The workstation cluster consists of four COMPAQ au600 (Alpha 21164A, 600MHz) workstations connected through 100BaseTx (100 Mb/s) network and switching hub. As an example of parallel computations, the rising bubble in a static fluid is shown in Figs. 1 and 2. Figure 1 is obtained by a single-processor calculation, while Fig. 2 is by the parallel calculation using four workstations. The simulation region of the sample problem is 256 x 513 lattices, and 32 x 64 sampling cells are used to obtain flow variables. The number of particles is about 0.67 million. A circular red bubble with a radius of 51 lattice units is placed in a blue field in the initial condition. Periodic conditions are applied at four boundaries of the simulation region. The external force field is applied to the particles by changing the direction of the particle motion. The average amount of momentum added to the particles is 1/800 in the upward direction for red and 1/10000 for blue in the downward direction. A flow field is obtained by averaging 100 time steps in each sampling cell. The 20th flow field averaged from 1901 to 2000 time steps is shown in Figs. 1 and 2.

256 X

cyclic

ratio

:~!,?~:i~!~!i)i~i?ii!iii:~i!Ui84i ! 84184184 il, ~ ii~ii!iiillil ~:ii i~,~!ii i~~!i i i l,!ili i i!i !i ;i i!;~ii~!i !i i i!!ii~j Zii~!iiiiiii i~i ~i i~i!! i;i:~? 84

5t3

ioo

dec

sam pting Cells

c~i!ie: Figure 1: Rising bubble obtained by single-processor calculation.

.....

i

.....

......

Figure 2: Rising bubble obtained by fourprocessor calculation.

434 The shape of the rising bubble is shown in Figs. 1 and 2 along with the stream lines in the flow field. It is shown in these figures that the rising bubble is deformed slightly due to the vortex around the bubble. The pressure in the blue field near the top of the bubble becomes higher due to the flow stagnation. The downward flow is established near the side of the bubble and the pressure in the blue fluid becomes lower than that in the bubble. The bubble is thus deformed. The downstream side of the bubble is then concaved due to the upward flow in the wake. The deformation of the bubble and the vortex in the surrounding flow field are simulated well in both calculations. In the lattice-gas simulation, a random number is used in collision process of particles in case of two or more possible outcomes. The random number is generated in each domain in the four-processor calculations, and the collision process is slightly different between single-processor and four-processor calculations. The flow fields shown in Figs. 1 and 2 are thus slightly affected. The speedup of parallel computations is shown in Fig. 3. The speedup is simply defined as the ratio of the calculation time: T1/Tn, where T1 and T~ are the calculation time using single processor and n processors, respectively. In Fig. 3, the speedup on the parallel server which consists of four U14.0 traSparc (300MHz) processors cono--o COMPAQ;100Mb/s / ~ nected by a 200MB/s high-speed network is also shown. The calculation speed of the COMPAQ 3.0 600au workstation is about two "O times faster than that of the U1traSparc processor. The network c~ (/) 2.0 speed is, however, 16 times faster for the UltraSparc server. Although the network speed is much slower, the speedup of parallel computation is 1.0 better for the workstation cluster. 2 3 4 The difference in parallel processNumber of processors ing between the workstation cluster and the parallel server may affect the Figure 3: Speedup of parallel computations. speedup. I

4. I N T E R F A C I A L

I

AREA CONCENTRATION

The interracial area concentration has been extensively measured and several empirical correlations and models for numerical analyses were proposed [5]. Various flow parameters and fluid properties are involved in the correlations according to the experimental conditions. The exponents of parameters are different even though the same parameters are used, since the interracial phenomena are complicated and an accurate measurement is difficult. The interfacial area concentration was recently shown to be measured numerically by using the ILG and correlated with the characteristic variables in the flow field [6]. The interfacial area was, however, overestimated because of the definition of the

435

interfacial region on the triangular lattice. The calculated results in the small simulation region (128 x 128) and the average density from 0.46 to 0.60 were discussed in Ref [6]. In this study, the effects of the characteristic velocity and the average density on the interracial area concentration are discussed by performing the parallel computation. The definition of the interfacial region is slightly modified. The effect of the region size is also discussed. In order to evaluate the interfacial area concentration, red and blue regions are defined as the lattice sites occupied by red and blue particles, respectively. The lattice sites with some red and blue particles are assumed to be the interfacial lattice sites. In contrast to Ref. [6], the edge of the colored region is not included in the interfacial region. The interfacial area concentration is thus obtained as the number of interfacial lattice sites divided by the number of all the lattice sites. The interfacial area concentration is measured for two cases: the phase separation in a Couette flow and the mixing of two phases in a cavity flow. The simulation region is 128 x 128 or 256 x 256, and the average density is varied from 0.46 to 0.73. The no-slip and the sliding wall conditions are applied at the bottom and top boundaries, respectively. At the side boundaries, the periodic boundary condition is applied in the Couette flow problem, while the no-slip condition is used in the cavity flow problem. The initial configuration is a random mixture of red and blue particles, with equal probabilities for each, for the Couette flow and a stratified flow of red and blue phases with equal height for the cavity flow. The interfacial area concentration, Ai, in the steady state of the Couette flow is shown in Fig. 4 as a function of the wall speed, Uw. 1.25 i c~ 9d=O.46(128x128) d=O.46(256x256) ' The error bar indicates the standard = d=O.53(128x128) deviation of the fluctuation in the --r1.20 ': - ~ d=O.53(256x256) steady state. The phase separation --A d=O.60(128x128) -7J, --V in the steady state obtained by the 1.15 1 --A d=0.60(256x256) - . ~ _ ~ ' ~ ~'r/ ~_ _ parallel computation in this study .% are almost the same as shown in Ref < 1.10 [6]. The interface is fluctuating due to the diffusivity of one color into the 1.05 other, and the interfacial region has a thickness even tbr Uw = 0. In order to see the effect of the wall speed 0.95 clearly, the interfacial area concen0.0 0.1 0.2 0.3 tration for the case with Uw = 0 is Wall used for normalization in Fig. 4. It is shown that Ai*(Ai/Ai(u~=o)) Figure 4: Effect of the wall speed on Ai ~ in the increases with an increase in the wall steady state Couette flow. speed. This is because the detbrmation of phasic region is large due to the increase in phasic momentum. = =

speed

436

It was, however, shown for the Couette flow that Ai* did not increase largely with an increase in the wall speed [6]. The interfacial region was estimated to be large in Ref. [6] due to the definition of the interface and the phasic region. The effect of the wall speed was thus not shown clearly in Ref. [6]. The interracial areal concentration was shown to be correlated with the wall speed and increased with U w 1/2 [6]. The same tendency is indicated in Fig. 4. The increase in Ai* is seen to be small for d = 0.46 in comparison with other cases. The surface tension is relatively small for low densities and the interfacial region for U w = 0 is large in this case. The effect of the wall speed is thus small for d = 0.46. The effect of the wall speed is large for higher densities in case of the small region size (128 x 128). The increase in A i ~ is larger for d=0.60 than for d=0.53. For the large region size (256 x 256), however, the increase in A i ~ is slightly larger for d = 0.53 than for d = 0.60. This is due to the difference in the momentum added by the wall motion. The interracial area is determined by the surface tension and the momentum added by the wall motion. The average momentum given to the particle is small for the case with the large region size when the wall speed is the same, and the increase in Ai* is small for the region size of 256 x 256 as shown in Fig. 4. The effect of the wall speed is thus slightly different for the different region size. The interfacial area concentration in the steady state of the cavity flow is shown in Fig. 5. The steady state flow fields obtained by the parallel computation are almost the same as shown in Ref [6]. Although the effect of the wall speed for higher densities in case of the small region size is different from that in Fig. 4, the dependency of d=O.46(128x128) t 1.25 =o---o 9d=O.46(256x256) Ai* on the wall speed is almost the ' =- - -- d=O.53(128x128) same as for the Couette flow. 1.20 ~- - ~ d=O.53(256x256) It is shown in Figs. 4 and 5 that the increase in Ai* is not so different 1.15 quantitatively between the Couette flow and the cavity flow. The dif< 1.10 ference in the flow- condition is the 1.05 boundary condition at the side wall: periodic in the Couette flow and no1 00~ slip in the cavity flow. The effect of the side wall is, thus, found to be 0.95 . . . . . small in these figures. 0.0 0.1 0.2 0.3 The effect of the wall speed was Wall discussed for the densities from 0.46 to 0.60 in Figs. 4 and 5, since this Figure 5: Effect of the wall speed on Ai* in the density range was studied in Ref. [6]. steady state cavity flow. The dependency of Ai* on the wall speed in the cavity flow is shown in Fig. 6 for higher density conditions.

I

--

.~

====

speed

437

1.25

0.42

e, e, d=O.65(128x128) o ~ o d=O.65(256x256) u - - i d=O.73(128x128) ~--- ~ d=O.73(256x256)

1.20 1.15

0.38

1.10

9Uw=O.O5(128x128) o ~ - o Uw=O.O5(256x256) t-~ Uw=O.15(128x128) c- -4::] 1Uw=O. 5 ( 2 5 6 x 2 . _ 56)

~ ,

0.34

\

1.05 0.30

1.00-~

=,==

0.95 0.00

0.10

0.20

Wall speed

0.30

Figure 6: Effect of the wall speed on A i ~ in the steady state cavity flow for higher densities.

0.26

0.4

.

. . 0.5

.

. 0.6

0.7

0.8

Density Figure 7: Dependency of interfacial area concentration on the density.

It is shown that the increase in A i ~ becomes negative as the wall speed increases. In the steady state cavity flow, large regions and small fragments of two phases are seen in the flow field [6]. The decrease in A i ~ indicates that the large phasic region increases and the small fragments of one phase decreases. The mean free path is becomes shorter and the collisions of particles increases for higher density conditions. The coalescence of phasic fragments may occur frequently as the wall speed increases. The dependency of A i on the average density, d, in the cavity flow is shown in Fig. 7. The interfacial area concentration is not normalized in this figure. It is seen, when the wall speed is small ( U w = 0.05), that A i becomes minimum at around d = 0.6. This density almost corresponds to the maximum surface tension [7]. The interracial area is thus found to be smaller as the surface tension becomes larger, since the deformation of phasic region is smaller. The value of the density which gives the minimum A i is slightly shifted to higher values as the wall speed increases as shown in Fig. 7. The effect of the coalescence of small fragments may be dominant for higher density conditions as shown in Fig. 6. The interracial area for Uw = 0.15, thus, becomes small as the density increases. 5. S U M M A R Y In this study, the ILG has been applied to simulate the rising bubble in a static fluid, the phase separation in a Couette flow and the mixing of two phases in a cavity flow. The ILG code was parallelized using the MPI library and the parallel computation were performed on the workstation cluster. The interface was defined as the intert'acial lattice sites between two phases and the interfacial area concentration was evaluated numerically.

438 It was shown in the steady state that the interfacial area concentration increased with the increase in the wall speed for relatively lower density conditions (d = 0 . 4 6 - 0.60). In the higher density conditions, however, the increase in the interracial area concentration was negative as the wall speed increased. It was shown that the coalescence of phasic fragments might be important for higher density conditions. Phase separation and mixing with a change of interface are complicated and difficult to simulate by conventional numerical methods. It was shown in Ref. [6] and this study that the interracial area concentration was evaluated systematically using the ILG, which is one of the discrete methods using particles to simulate multi-phase flows. The colored particles used in the ILG have, however, the same properties, and the two-phase system with large density ratio cannot be simulated. Although several models based on the LGA have been proposed to simulate various two-phase flows, the original ILG was used in this study to test the applicability of the LGA. Our results demonstrate that interracial phenomena in multi-phase flows can be studied numerically using the particle simulation methods. REFERENCES

[1] U. Frisch, B. Hasslacher, and Y. Pomeau, Phys. Rev. Lett. 56, 1505(1986). [2] D. H. Rothman and S. Zaleski, Ref. Mod. Phys. 66, 1417(1994). [3] T. Shimomura, G. D. Doolen, B. Hasslacher, and C. Fu, In: Doolen, G. D. (Ed.). Lattice gas methods for partial differential equations. Addison-Wesley, California, 3(1990). [4] D. H. Rothman and J. M. Keller, J. Stat. Phys. 52, 1119(1988). [5] G. Kocamustafaogullari, W. D. Huang, and J. Razi, Nucl. Eng. Des. 148, 437(1994). [6] T. Watanabe and K. Ebihara, Nucl. Eng. Des. 188, 111(1999). [7] C. Adler, D. d'Humieres and D. Rothman, J. Phys. I France 4, 29(1994).

Parallel ComputationalFluid Dynamics Towards Teraflops, Optimizationand Novel Formulations D. Keyes,A. Ecer, J. Periaux, N. Satofukaand P. Fox (Editors) 2000 Elsevier Science B.V. All rights reserved.

439

Parallel performance modeling of an implicit advection-diffusion solver P. Wilders ~* ~Departement Information Technology and Systems, Delft University, P.O. Box 5031, 2600 GA Delft, The Netherlands,p.wilders~its.tudelft.nl.

A parallel performance model for a 2D implicit multidomain unstructured finite volume solver is presented. The solver is based upon the advection-diffusion equation and has been developed for modeling tracer transport. The parallel performance model is evaluated quantitatively. Moreover, scalability in the isoefficiency metric is discussed briefly. 1. I N T R O D U C T I O N We focus on an implicit advection-diffusion solver with domain decomposition as the basic tool for parallelism. Experiments on the SP2 show acceptable efficiencies up to 25 processors [1]. The analysis of communication patterns and sequential overhead did not explain the observed efficiencies completely [2]. Therefore, we decided to develop a full parallel performance model and to evaluate this model in some detail. Generally speaking, our interests are towards large scale modeling. For this reason the emphasis is on linearly scaled problems and, more generally, on scalability in the isoefficiency metric. Related studies on parallel performance analysis of Krylov-Schwarz domain decomposition methods have been undertaken in [3], [4], [5] and [6] for fixed size problems. Load balancing effects are absent or assumed to be absent and the authors introduce analytical expressions relating the iterative properties to the dimensions of the problem. The consequence is that the emphasis in these studies is on qualitative issues. Instead, our study includes quantitative aspects and some extra care is needed at these points. The present advection-diffusion solver has been developed for modeling the transport of passive scalars (tracers) in surface and subsurface environmental engineering [7], [8]. In this paper we present parallel results for a basic problem in this field, i.e. an injection/production problem in a porous medium (quarter of five spots). 2. S O M E B A S I C I S S U E S

We consider a scalar conservation law of the form

Oc p~--~+V.[vf(c)-DVc]-0

,

xEftE~2

,

t>0

.

(1)

*This work has been supported by the Dutch Ministry of Economic Affairs as part of the HPCN-TASC project.

440 In this paper, only tracer flows are considered. The coefficients 9~, v and D are assumed to be time-independent and f(c) = c. Our interests are on the advection-dominated case, i.e. the diffusion tensor D depends on small parameters. For the spatial discretization we employ cell-centered triangular finite volumes, based upon a 10-point molecule [9]. This results in the semi-discrete system dc

L?-/-

,

(2)

for the centroid values of the concentration. L is a diagonal matrix, containing the cell values of the coefficient q~ multiplied with the area of the cell. Moreover, F is a nonlinear differentiable function of c. We set

J-

OF Oc

"

(3)

The Jacobian J represents a sparse matrix with a maximum of 10 nonzero elements in each row. The linearly implicit trapezoidal rule is used for the time integration:

( ~-~L 2Jn)cn+l __ ( ~--~L ~1 j n ) c " + F,~ .

(4)

Here, Tn denotes the time step. The scheme is second-order accurate in time. The linear system (4) is solved iteratively by means of a one-level Krylov-Schwarz domain decomposition m e t h o d - in our case a straightforward nonoverlapping additive Schwarz preconditioner with GMRES as the Krylov subspace method. ILU-preconditioned BiCGSTAB is used for the approximate inversion of the subdomain problems. Experimental studies have been conducted for tracer flows in reservoirs. The velocity v is obtained from the pressure equation, using strongly heterogeneous permeability data provided by Agip S.p.A (Italian oil company). In this paper we consider the quarter of five spots, a well-known test problem for porous medium applications. The total number of unknows is N. As soon as N varies, either in the analysis or in experiments, it is important to take application parameters into account [10]. From the theory of hyperbolic difference schemes it is known that the Courant number is the vital similarity parameter. Therefore, we fix the Courant number. This means that both the spatial grid size h (N = O(1/h2)) and the time step ~- vary with v/h constant. It is possible to reduce the final system to be solved in the domain decomposition iteration to a small system formulated in terms of interface variables [11]. This makes it possible to include the GMRES routine in the sequential task, which is executed by the master. As a consequence, all communication is directed to and coming from the master (see Figure 1). 3. T I M I N G

AND

PERFORMANCE

MEASURES

Our concern is the computional heart of the code, the time stepping loop. For ease of presentation we consider a single time step. Let p denote the number of processes (each process is executed on a separate processor). There are three phases in the program: computation as a part of the sequential task 0, computation as a part of the distributed tasks

441 gather: interface v a r i a b l e ~ , h ~ scatter: halo variables gat / ~ ~ @ - ~ / sc:tter

~

S broadcast

process 1

Figure 1. Communication patterns.

1, .., p and communication. Inversion of subdomain problems is part of the distributed tasks. For the sake of analysis we consider the fully synchronized case only. The different processes are synchronized explicitly each time the program switches between two of the three phases. The elapsed time in a synchronized p processor run is denoted with Tp. It follows that (5)

with T (s) the computational time spent in the sequential task 0, Tp(d) the maximal time spent in the distributed tasks and T(~c) the communication time. Of course, T[ c) - O. Let us define T , ( p ) - r(s) + pTJ d)

(6)

.

F o r p = 1, it holds that TI(p) = T1. For p > 1, rl(p) is the elapsed time of a single processor shadow run, carrying out all tasks in serial, while forcing the distributed tasks to consume the same amount of time. It is clear that TI(p) - T1 presents a measure of idleness in the parallel run, due to load balancing effects. It easily follows that

T,(p)

-

T1

pT~") ~)

-

-

(7)

.

The (relative) efficiency Ep, the (relative) costs Cp and the (relative) overhead @ are defined by

T1

EP = pTp

,

1 pTp Cp - Ep = T1

,

Op - Cp - 1

(8)

.

Note that in (8) we compare the parallel execution of the algorithm (solving a problem with Q subdomains) with the serial execution of the same algorithm (relative). We introduce the following factorization of the costs Cp:

Cp-C(')C(S)

,

Cp(')- 1 + O q)

,

C(p)- 1 + O (p)

,

(9)

442 where o(l) = T1 (p) - T1

TI

O(p) _ pTp - T1 (p)

'

(10)

TI(p)

It follows that Op - O q) + O (p) + O q)O (p)

(11)

.

The term 0(/) is associated with load balancing effects. Of) is referred to as the parallel overhead. O(pp) combines sequential overhead and communication costs. Let us introduce 0p - 0 (/) + 0 (p) + 0 ( 0 0 (v)

,

(12)

with

O(pl) _ pT(a;-)T[a)

O(v) _ p- 1 T (~) ,

-

P

T(a)

T(c) }

T(a)

(13)

.

Using (5), (6) and (10)it can be seen that the relative differences between Oq), O(p) and, respectively, 0q), 0(v) can be expressed in terms of the fraction of work done in the sequential task; this fraction is small (< 1% in our experiments). Therefore, we expect Op, 0(') and ()(P)to present good approximations of, respectively, Op, 0 q) and 0 (v). 4. T I M I N G

AND

PERFORMANCE

MODEL

FOR A SQUARE

DOMAIN

A fully balanced blockwise decomposition of the square domain into Q subdomains is employed; the total number of unknowns is N, N/Q per subdomain. The total number of edges on internal boundaries is B, on the average B/Q per subdomain. B is given by B - v/Sv/~(V/~ - 1 )

(14)

.

Both the total number of interface variables as well as the total number of halo variables depend on linearly on B. We assume that either p - 1, i.e. a serial execution of the Q subdomains, or p - Q, i.e. a parallel execution with one subdomain per processor. The proposed timing model reads" N B N PT(~)Q - C~1 ~ ~ -+M(a2 Q +a3 Ip)

,

(15)

with M

1 m~ 'p__

Q ~1 E I(m,q)

=1 q=l M 1 m~ m a x / ( m =1 q=l .... ,Q

, p-1 q)

,

p-

(16) Q

443

T (~) - / 3 M 2 B

,

(17)

B T(p~) - M ( T l f ( p ) + 7 2 ( p - 1 ) ~ )

,

f(p) - m i n ( [ x / ~ , F21ogp])

.

(18)

Here, m = 1,..., M counts the number of matrix-vector multiplications in the domain decomposition iteration (outer iteration) and I ( m , q) denotes the number of matvec calls associated with the inner iteration in subdomain q for the m-th outer matvec call. The first term on the rhs of (15) corresponds with building processes (subdomain matrix, etc.). The third term on the rhs of (15) reflects the inner iteration and the second term comes from correcting the subdomain rhs. Relation (17) is mainly due to the GramSchmidt orthogonalization as a part of the GMRES outer iteration. Finally, (18) models the communication pattern depicted in Figure 1 for a SP2. The first term on the rhs models the broadcast operations and the second term the gather/scatter. Here, [ stands for taking the smallest integer above. The term [2log Pl is associated with a 'treelike' implementation of communication patterns and the term [x/~ with a 'mesh-like' implementation [12]. It is now straightforward to compute analytical approximations 0p, 0p(1), 0p(;) of the overhead functions using (12), (13) and (15), (17), (18). The explanatory variables in the resulting performance model are N, Q, P, M and Ip. These variables are not completely independent; in theory it is possible to express the latter two variables as a function of the first variables. Such expressions are either difficult to obtain or, if available, too crude for quantitative purposes. Therefore, we will rely upon measured values of M and Ip. Note that these values are already available after a serial run. 5.

EXPERIMENTS

Experiments have been carried out on a SP2 (160 Mhz) with the communication switch in user mode. The parallel runs are characterized by N = NoQ, Q = 4, 9, 16, p = Q with No = 3200 (linearly scaled) and N = 12800, 28800, 51200, Q = 4, p = Q (number of processors fixed); in total, five parallel runs. Of course, every parallel run is accompanied by a serial run (see (8)). Physical details on the experiments can be found in [1]. First, it is necessary to obtain approximations of the unknown coefficients in (15), (17), (18). The number of available measurements are ten for (15), five for (17) and five for (18) and a least-squares estimation leads to 0~ 1

3 -

1 6 , 1 0 .4

-.

.68

9 10 -7

71-.62"10-3

,

ct2 - . 81 9 10 -5

,

,

c~3-.86.10

-6

,

(19)

(20)

,

72-.44"10

-6

9

(21)

Let O denote one of the original overhead functions (see (8), (10)) and d its analytical approximation obtained in section 4. In order to evaluate the performance model we introduce Dev-

1

_ x/" r/ j = l

Oj - Oj

Oj

9 lOO% ,

(22)

444

Table 1 Dev, deviations between overhead functions and their analytical approximations.

Dev:

8.7

1

11.5

2.0

,

,

--'Op

,

0.5 --:0

. *

....

90 (1)p

_..

0 (p)

P

0.4

0.8

(I)

. - - - - -

.... : 0 P

t~

~0.3

oc-0 . 6 O > O

4(

/

-

(P) P

~0.2

..

/

-.'0

0 > 0

.•

/

~0.4

0.2

,, * "

• ..""

...........

o

0.1 ----._._.

, -~o ..........................

5

Figure 2. scaled.

10 P

15

Overhead functions, linearly

N

-o

x 10 4

Figure 3. Overhead functions, fixed p.

with n denoting the number of measurements (n=5). Table 1 presents the values of Dev for the different overhead functions and it can be seen that the analytical predictions are quite accurate. ~'inally, Figure 2 and Figure 3 present the (total) overhead Op and its components Oq), O(pp). The analytical approximations have not been plotted because, within plotting accuracy, they coincide more or less with the true values. Figure 2 presents the overhead functions as a function of p for the linearly scaled experiments. Figure 3 plots the overhead functions as a function of N for the experiments with a fixed p (p = 4). Rather unexpectedly, it can be seen that the load imbalance leads to a significant amount of overhead. A close inspection of the formulas presented in section 4 shows that 0(/) depends linearly upon ( I p - I1), which means that the load imbalance is due to variations in the number of BICGSTAB iterations over the subdomains. 6. I S O E F F I C I E N C Y In the isoefficiency metric, a parallel algorithm is scalable if it is possible to find curves in the (p, N) plane on which the efficiency Ep and the overhead Op are constant [13]. A sufficient condition for the existence of isoefficiency curves is lira @ < o o N--+ co

p fixed

.

(23)

445

Table 2 Iterative properties for 4 subdomains (Q = 4). N M Ii,p=l Ip,p=4 3200 12800 51200

11.5 11.5 11.7

5.0 5.0 5.0

5.8 5.8 5.7

This condition is equivalent with (see (11)) lim O q ) < o o

N--+ c~ p fixed

,

lira O (p) < ~

N-~c~ p fixed

.

(24)

In order to be able to discuss these limits, some knowledge on M and Ip is needed. Table 2 presents some measured values for the experiments leading to Figure 3. It seems to be reasonable to assume constant (or slowly varying) iterative properties in the limiting process. This enables some preliminary conclusions on the basis of the analytical performance model for a square domain. From (13), (14), (15), (17) and (18)it follows that lim 0~ p ) - 0

N--~ oo p fixed

.

(25)

The parallel overhead approaches zero with a speed determined by the ratio B / N O(~/r From (13), (14), (15) and (16)it follows that O(t) ~

M ( Ip - 1 1 ) (OZl/O~3)-71- i f

X -+ oo I

,

p fixed

.

=

(26)

'

This means that the load balancing effect approaches a constant. Table 2 and (19) show that this constant is close to 0.1 for p = O = 4. From the analysis it follows that (23) is satisfied. In fact, the total overhead approaches a constant value in the limit. Isoefficiency curves with an efficiency lower than 0.9 exist and efficiencies close to 1.0 cannot be reached.

7. C O N C L U D I N G

REMARKS

We have presented a parallel performance model of an implicit advection-diffusion solver based upon domain decomposition. The model was evaluated quantitatively and a good agreement between the model and the results of actual runs was established. As such the model presents a good starting point for a more qualitative study. As a first step we have used the model to investigate the scalability in the isoefficiency metric. Although the solver is scalable, it is clear from the experimentally determined efficiencies that the solver is only attractive for machines with sufficient core-memory up to approximately 25-50 processors. To keep the overhead reasonable, it is necessary to have

446 subdomains of a moderate/large size. The size No = 3200 taken in the linearly scaled experiments turns out to be somewhat small. Beyond 50 processors the load balancing effects will become too severe. Because the latter are due to variable iterative properties over the subdomains, it is not straightforward to find improvements based upon automatic procedures. REFERENCES

1. C. Vittoli, P. Wilders, M. Manzini, and G. Potia. Distributed parallel computation of 2D miscible transport with multi-domain implicit time integration. Or. Simulation Practice and Theory, 6:71-88, 1998. 2. P. Wilders. Parallel performance of domain decomposition based transport. In D.R. Emerson, A. Ecer, J. Periaux, T. Satofuka, and P. Fox, editors, Parallel Computational Fluid Dynamics '97, pages 447-454, Amsterdam, 1998. Elsevier. 3. W.D. Gropp and D.E. Keyes. Domain decompositions on parallel computers. Impact of Computing in Science and Eng., 1:421-439, 1989. 4. E. Brakkee, A. Segal, and C.G.M. Kassels. A parallel domain decomposition algorithm for the incompressible Navier-Stokes equations. Simulation Practice and Theory, 3:185-205, 1995. 5. K.H. Hoffmann and J. Zou. Parallel efficiency of domain decomposition methods. Parallel Computing, 19:137/5-1391, 1993. 6. T.F. Chan and J.P. Shao. Parallel complexity of domain decomposition methods and optimal coarse grid size. Parallel Computing, 21:1033-1049, 1995. 7. G. Fotia and A. Quarteroni. Modelling and simulation of fluid flow in complex porous media. In K. Kirchgassner, O. Mahrenholtz, and R. Mennicken, editors, Proc. IUIAM'95. Akademic Verlag, 1996. 8. P. Wilders. Transport in coastal regions with implicit time stepping. In K.P. Holz, W. Bechteler, S.S.Y. Wang, and M. Kawahara, editors, Advances in hydro-sciences and engineering, Volume III, MS, USA, 1998. Univ. Mississippi. Cyber proceedings on CD-ROM. 9. P. Wilders and G. Fotia. Implicit time stepping with unstructured finite volumes for 2D transport. J. Gomp. AppI. Math., 82:433-446, 1997. 10. J.P. Singh, J.L. Hennessy, and A. Gupta. Scaling parallel programs for multiprocessors: methodology and examples. IF,RE Computer, July:42-50, 1993. 11. P. Wilders and E. Brakkee. Schwarz and Schur: an algebraical note on equivalence properties. SIAM J. Sci. Uomput., 20, 1999. to be published. 12. Y. Saad and M.N. Schultz. Data communication in parallel architectures. Parallel Computing, 11:131-150, 1989. 13. A. Gupta and V. Kumar. Performance properties of large scale parallel systems. Or. Par. Distr. Comp., 19:234-244, 1993.

Parallel Computational Fluid Dynamics Towards Teraflops, Optimization and Novel Formulations D. Keyes,A. Ecer, J. Periaux,N. Satofukaand P. Fox (Editors) e 2000 Elsevier Science B.V. All rights reserved.

447

A Parallel 3D Fully Implicit Unsteady Multiblock CFD Code Implemented on a Beowulf Cluster M. A. Woodgate, K. J. Badcock, B.E. Richards ~ ~Department of Aerospace Engineering, University of Glasgow, Glasgow, G12 8QQ, United Kingdom

We report on the development of an efficient parallel multi-block three dimensional Navier-Stokes solver. For robustness the convective terms are discretised using an upwind TVD scheme. The linear system arising from each implicit time step is solved using a preconditioned Krylov subspace method. Results are shown for the ONERA M6 wing, NLR-F5 wing with launcher and missile as well as a detailed look into parallel efficiency.

1. I N T R O D U C T I O N Navier-Stokes solvers for complete aircraft configurations have been used for a number of years. Geometric complexities are tackled through the use of either unstructured or block structured meshes. For multi-disciplinary work such as shape optimisations, where repetitive calculations on the same or similar mesh are required, the efficiency of the flow solver is more important than the cost of generating the mesh. This makes the use of block structured grids viable. The current paper describes the development of an implicit parallel method for solving the three dimensional unsteady Navier-Stokes equations. The work builds on developments in two dimensions. The features of the method are the iterative solution [1] of an unfactored linear system for the flow variables, approximate Jacobian matrices [2], an effective preconditioning strategy for good parallel performance [3], and the use of a Beowulf cluster to illustrate possible performance on commodity machines [4]. The method has been used on a wide range of applications which include unsteady aerofoil flows and multielement aerofoils [5], an unsteady Delta wing common exercise [6] and the unsteady NLR F-5 CFD validation exercise [7]. The following work will focus on how a fast, low memory and communication bandwidth method has been developed for solving unsteady problems. This method will be used to demonstrate that a few commodity machines can be used to obtain unsteady solutions over geometrically complex configurations in a matter of hours.

448

2. G O V E R N I N G

EQUATIONS

The three-dimensional Cartesian Navier-Stokes equations can be written in conservative form as 0W Ot

0(F i + F ~) t

Ox

O(G i + G ~) +

Oy

+

O(H i + H ~) Oz

=0,

(1)

where W = (p, pu, pv, pw, p E ) T denotes the vector of conservative variables. The inviscid flux vectors F i, G i, H i and viscous flux vectors F v, G v, H v are the direct 3D extension to the fluxes given in [5]. The laminar viscosity is evaluated using Sutherland's law whilst the turbulent eddy viscosity is given by the Baldwin-Lomax turbulent model [8]. Finally, the various flow quantities are related to each other by the perfect gas relations. 3. N U M E R I C A L

METHOD

The unsteady Navier-Stokes equations are discretised on a curvilinear multi-block body conforming mesh using a cell-centred finite volume method that converts the partial differential equation into a set of ordinary differential equations, which can be written as d (lfn+lT~y~Tn+lh D /~lrn+l d---t~*i,j,k "i,j,k J + ,~i,j,kt ,, i,j,k) - 0.

(2)

The convective terms are discretised using either Oshers's [9] or Roe's [10] upwind methods. MUSCL variable extrapolation is used to provide second-order accuracy with the Van Albada limiter to prevent spurious oscillations around shock waves. The discretisation of the viscous terms requires the values of flow variables and derivatives at the edge of cells. Cell-face values are approximated by the straight average of the two adjacent cell values, while the derivatives are obtained by applying Green's formula to an auxiliary cell. The choice of auxiliary cell is guided by the need to avoid odd-even decoupling and to minimise the amount of numerical dissipation introduced into the scheme. Following Jameson [11], the time derivative is approximated by a second-order backward difference and equation (2) becomes . a

( ~ i T n + 1 ~ __ i,j,k \ 9 9 i,j,k ]

ovi,j,kql/'n+llTtTn+l'' i,j,k _

_~_ l / - n - 1 I : ~ T n - 1 vi,j,k ' ' i,j,k -Jr- a i

vin n 4--'j'kWi'j'k 2/~ t

j k ~{ w 9n +i l, j , k ) - - O. ' '

(3)

xrn+l To get an expression in terms of known quantities the term D,~* i,j,kt[ ~vv i,j,k) is linearized w.r.t, the pseudo-time variable t*. Hence one implicit time step takes the form

[( v 3v) XF + G-/

0a

( W r e + l - W m)

-

-R*(W

TM)

(4)

where the superscript m denotes a time level m a t * in pseudo-time. Equation (4) is solved in the following primitive form V + 3V

0 W _~ OR

_ pro) _ _ R , ( W m)

(5)

449

4. I M P L E M E N T A T I O N To obtain a parallel method that converges quickly to the required answer two components are required. First is a serial algorithm which is efficient in terms of CPU time and second is the efficient parallelisation of this method. With a limited number of computers no amount of parallel efficiency is going to overcome a poor serial algorithm. A description of the Beowulf cluster used in performing the computations can be found in [4]. 4.1. Serial P e r f o r m a n c e In the present work, the left hand side of equation (5) is approximated with a first order Jacobian. The Osher Jacobian approximation and viscous approximation is as in [2], however, instead of using the Roe approximation of [12] all terms of O(Wi+l,j,k-- Wi,j,k) are neglected from the term 0oR ow. this forms a much more simplified matrix reducing W 0P the computational cost of forming the Roe Jacobian by some 60%. This type of first order approximation reduces the number of terms in the matrix from 825 to 175 which is essential as 3D problems can easily have over a million cells. Memory usage is reduced further by storing most variables as floats instead of doubles; this reduces the memory bandwidth requirements, and results in the present method requiring 1 MByte of memory per 550 cells. The right hand side of equation (5) is not changed in order to maintain second order spatial accuracy. A preconditioned Krylov subspace algorithm is used to solve the linear system of equations using a Block Incomplete Lower-Upper factorisation of zeroth order. Table 1 shows the performance of some of the key operations in the linear solver. The matrix vector and scalar inner product are approximately 25% of peak while the saxpy only obtains about half this. This is because saxpy needs two loads per addition and multiply while the others only need one. The drop off in performance is even more pronounced in the Pentium Pro 200 case. This may arise from the 66MHz bus speed, which is 33% slower than a PII 450.

Table 1 Megaflop performance of key linear solver subroutines Processor Matrix x vector Pre x vect PII 450 115 96 PPro 200 51 44

IIx.xll 108 44

saxpy 58 18

4.2. Parallel P e r f o r m a n c e The use of the approximate Jacobian also reduces the parallel communication since only one row of halo cells is needed by the neighbouring process in the linear solver instead of two. This coupled with only storing the matrix as floats reduces the communication size by 75%.To minimize the parallel communication between processes further, the BILU(0) factorisation is decoupled between blocks; this improves parallel performance at the expense of not forming the best BILU(0) possible. This approach does not seem to have a major impact on the effectiveness in 2D [3], however more testing is required in 3D.

450 Since the optimal answer depends on communication costs, different approaches may be required for different machine architectures.

5. R E S U L T S 5.1. O N E R A M 6 W i n g The first test case considered is flow around the ONERA M6 wing with freestream Mach number of 0.84 at angle of attack of 3.06 ~ and a Reynolds number of 11.7 million [13]. An Euler C-O grid was generated containing 257 x 65 x 97 points and a viscous C-O grid of 129 x 65 x 65 points. Table 2 shows the parallel and computational efficiency of the Euler method. It can be seen that the coarse grid is not sufficient to resolve the flow features but the difference between the medium and fine grids are small and are limited to the shockwaves. Both the medium and fine grids use multi level startups to accelerate the convergence giving nearly O(N) in time dependence on the number of grid points when comparing the same number of processors. The second to last column contains the wall clock time and shows that the 1.6 million cell problem converged 6 orders of magnitude in just under 30 minutes on 16 Pentium Pro machines. Taking the medium size grid time of under 4 minutes, which includes all the file IO, the use of commodity computers is shown to be very viable for unsteady CFD calculations. The last column contains the parallel efficiencies for the code, these remain high until the runtimes are only a few minutes. This is due to the totally sequential nature of the file IO which includes the grid being read in once per processor. The times also show that the new Roe approximate Jacobians produce slightly faster runtimes at the expense of some parallel efficiency. This is because although it is faster to form the Roe Jacobian the linear system takes about 10% longer to solve so increasing the communication costs.

Table 2 Parallel performance and a grid density study for Grid CL Co Procs Total 0.0114 16 257 x 65 x 97 0.2885 0.0125 2 129 x 33 x 49 0.2858 4 8 16 0.0186 1 65 x 17 x 25 0.2726 2 (Osher) 4 8 0.0189 1 65 x 17 x 25 0.2751 2 (Roe) 4 8

the euler ONERA M6 wing CPU time Wall time Efficiency 432 29.1 N/A 46.4 24.0 100 47.0 12.5 96 48.4 6.6 90 50.9 3.7 81 5.15 5.25 100 5.27 2.80 94 5.43 1.52 87 5.79 0.92 72 4.87 4.97 100 5.00 2.63 94 5.19 1.45 86 5.54 0.88 70

451

Figure 1. Fine grid pressure contours for the ONERA M6 Wing

The wing upper surface seen in Figure 1 clearly shows the lambda shock wave captured. The predicted pressure coefficient distributions around the wing agree well with experimential data for both the Euler and Navier-Stokes calculations (see Figure 2 for the last 3 stations). In particular, the lower surface is accurately predicted as well as the suction levels even right up to the tip. However, both calculations fail to capture the small separation region at the 99% chord. 5.2. N L R F-5 W i n g 4- T i p Missile The final test cases are on the NLR F-5 wing with launcher and tip missile [7]. The computational grid is split into 290 blocks and contains only 169448 cells. The steady case is run 320, (angle of attack 0 ~ and a Mach number of 0.897) while the unsteady case is run 352 (mean angle of attack -0.002 ~, the amplitude of the pitching oscillation 0.115 ~, the reduced frequency 0.069 and Mach number 0.897) . This is a challenging problem for a multiblock method due to the geometrical complexity of the problem. This gives rise to a large number of blocks required as well as a large variation in block size (the largest contains 5520 cells while the smallest has only 32 cells). The run time of the code was 40 minutes on 8 processors to reduce the residual by 5 orders. Figure 3 shows the upper surface where the shock wave is evident on the wing from root to tip due to the absence of viscous effects. The unsteady calculation was again performed on 8 processors taking 40 minutes per cycle for 10 real timesteps per cycle and 60 minutes per cycle for 20 real timesteps per

452

Figure 1. Fine grid pressure contours for the ONERA M6 Wing

The wing upper surface seen in Figure 1 clearly shows the lambda shock wave captured. The predicted pressure coefficient distributions around the wing agree well with experimential data for both the Euler and Navier-Stokes calculations (see Figure 2 for the last 3 stations). In particular, the lower surface is accurately predicted as well as the suction levels even right up to the tip. However, both calculations fail to capture the small separation region at the 99% chord.

5.2. N L R F-5 Wing + Tip Missile The final test cases are on the NLR F-5 wing with launcher and tip missile [7]. The computational grid is split into 290 blocks and contains only 169448 cells. The steady case is run 320, (angle of attack 0~ and a Mach number of 0.897) while the unsteady case is run 352 (mean angle of attack -0.002 ~ the amplitude of the pitching oscillation 0.115 ~ the reduced frequency 0.069 and Mach number 0.897) . This is a challenging problem for a multiblock method due to the geometrical complexity of the problem. This gives rise to a large number of blocks required as well as a large variation in block size (the largest contains 5520 cells while the smallest has only 32 cells). The run time of the code was 40 minutes on 8 processors to reduce the residual by 5 orders. Figure 3 shows the upper surface where the shock wave is evident on the wing from root to tip due to the absence of viscous effects. The unsteady calculation was again performed on 8 processors taking 40 minutes per cycle for 10 real timesteps per cycle and 60 minutes per cycle for 20 real timesteps per

453

O N E R A M6 Wing - Run 2308 - Eta = 0.90 1.5 I 1

! . ~ ....................................... i

0.5

i ..... i.........

!

i

i ...........

i. . . . . . . . . . .

-0.5

O N E R A M6 Wing - Run 2308 - Eta = 0.95 1.5 I

1

:) ,,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

O"I

........

~1- - T u r b u l e n t calculation -0.5 {~1- - Inviscid calculation

......

o o Experimental data

-10

1.5

0.2

0.4

WC

0.6

0.8

1

0

0.2

0.4

X/C

0.6

0.8

1

O N E R A M6 Wing - Run 2308 - Eta = 0.99

,O

1 0.5 .

.

.

.

.

.

.

~ - Turbulent calculation -0.5 ][1- - Inviscid calculation II o o E x p e r i m e n t a l d a t a 0

0.2

0.4

X/c

0.6

0.8

1

Figure 2. Pressure distribution for Navier-Stokes &c Euler equations ONERA M6 Wing, Moo - 0 . 8 4 , c ~ - 3.06 ~ Re=ll.7106

cycle. As can be seen from Figure 4, the effect of doubling the real number of timesteps per cycle is minimal.

6. C O N C L U S I O N S An unfactored implicit time-marching method for solving the three dimensional unsteady Navier-Stokes equations in parallel has been presented. The ONERA M6 wing has been used to evaluate the characteristics of the method as well as the parallel performance. The simplistic parallel implementation leads to only good parallel efficiency on small problem sizes, but this is offset by a highly efficient serial algorithm. Future work includes an improved parallel implementation of the method for small data sets and a detailed study of the performance of the preconditioner with the advent of fast ethernet communication speeds.

454

Figure 3. Pressure contours for F5 Wing with launcher and missile Upper Surface, M~ = 0.897, a - 0~ Run 320

REFERENCES

1. Badcock, K.J., Xu, X., Dubuc, L. and Richards, B.E., "Preconditioners for high speed flows in aerospace engineering", Numerical Methods for Fluid Dynamics, V. Institute for Computational Fluid Dynamics, Oxford, pp 287-294, 1996 2. Cantariti, F., Dubuc, L., Gribben, B., Woodgate, M., Badcock, K. and Richards, B., "Approximate Jacobians for the Solution of the Euler and Navier-Stokes Equations", Department of Aerospace Engineering, Technical Report 97-05, 1997. 3. Badcock, K.J., McMillan, W.S., Woodgate, M.A., Gribben, B., Porter, S., and Richards, B.E., "Integration of an impilicit multiblock code into a workstation cluster environment", in Parallel CFD 96, pp 408. Capri, Italy, 1996 4. McMillan, W.,Woodgate, M., Richards, B., Gribben, B., Badcock, K., Masson, C. and Cantariti, F., "Demonstration of Cluster Computing for Three-dimensional CFD Simulations", Univeristy of Glasgow, Aero Report 9911, 1999 5. Dubuc, L., Cantariti, F., Woodgate, M., Gribben, B., Badcock, K. and Richards, B.E., "Solution of the Euler Equations Using an Implicit Dual-Time Method", AIAA Journal, Vol. 36, No. 8, pp 1417-1424, 1998. 6. Ceresola, N., "WEAG-TA15 Common Exercise IV- Time accurate Euler calculations of vortical flow on a delta wing in pitch motion", Alenia Report 65/RT/TR302/98182, 1998 7. Henshaw, M., Bennet, R., Guillemot, S., Geurts, E., Pagano, A., Ruiz-Calavera, L. and Woodgate, M., "CFD Calculations for the NLR F-5 Data- Validation of CFD Technologies for Aeroelastic Applications Using One AVT WG-003 Data Set", presented at CEAS/AIAA/ICASE/NASA Langley International forum on aeroelasticty and structural dynamics, Williamsburg, Virginia USA, June 22-25, 1999

455 0.001084

!

1() Timesteps/cycle 20 Timesteps/cycle

0.001082 0.00108 E 0.001078

(D .m

0 .m

o 0.001076

0 C~

a 0.001074 0.001072 0.00107 0.001068 -0.15

-

0'.1

I

I

-0.05 0 0.05 Incidence in degrees

0.1

0.15

Figure 4. Effect of Steps/Cycle on unscaled Drag Mo~ - 0.897, am - -0.002 ~ a0 =0.115 k - 0.069. Run 352

10. 11. 12.

13.

Baldwin, B. and Lomax H., "Thin-Layer Approximation and Algebraic Model for Separated Turbulent Flows", AIAA Paper 78-257,1978. Osher, S. and Chakravarthy, S., "Upwind Schemes and Boundary Conditions with Applications to Euler Equations in General Geometries", Journal of Computational Physics, Vol. 50, pp 447-481, 1983. Roe, P.L., "Approximate Riemann Solvers, Parameter Vectors and Difference Schemes", Journal of Computational Physics, vol. 43, 1981. Jameson, A. "Time dependent calculations using multigrid, with applications to unsteady flows past airfoils and wings", AIAA Paper 91-1596, 1991 Feszty, D., Badcock, B.J. and Richards, B.E., "Numerically Simulated Unsteady Flows over Spiked Blunt Bodies at Supersonic and Hypersonic Speeds" Proceedings of the 22nd International Symposium on Shock Waves, 18-23 July 1999, Imperial College, London, UK, Schmitt, V. and Charpin, F., "Pressure Distributions on the ONERA-M6-Wing at Transonic Mach Numbers", in "Experimental Data Base for Computer Program Assessment", AGARD-AR-138, 1979.


Parallel Computational Fluid Dynamics '99: Towards Teraflops, Optimization and Novel Formulations

Parallel Computational Fluid Dynamics

Optimization and Computational Fluid Dynamics

Optimization and Computational Fluid Dynamics

Parallel Computational Fluid Dynamics '95

Parallel Computational Fluid Dynamics 2004

Parallel Computational Fluid Dynamics 2000

Parallel Computational Fluid Dynamics 2004

Parallel Computational Fluid Dynamics 2000

Computational Fluid Dynamics

Computational fluid dynamics

computational fluid dynamics

Computational Fluid Dynamics 2010

Computational Fluid Dynamics 001

Using computational fluid dynamics

Computational Fluid Dynamics

Computational Fluid Dynamics

COMPUTATIONAL FLUID DYNAMICS chung

Computational Fluid Dynamics

Computational fluid dynamics

Computational Fluid Dynamics 2008

Computational Fluid Dynamics

Parallel Computational Fluid Dynamics 2001, Practice and Theory

Parallel Computational Fluid Dynamics 2005: Theory and Applications

Computational Fluid Dynamics for Engineers

Principles of computational fluid dynamics

Principles of computational fluid dynamics

Principles of Computational Fluid Dynamics

Principles Of Computational Fluid Dynamics

Computational Methods for Fluid Dynamics

Principles of Computational Fluid Dynamics

Parallel Computational Fluid Dynamics '99: Towards Teraflops, Optimization and Novel Formulations