Parallel Processing for Scientific Computing
SOFTWARE • ENVIRONMENTS • TOOLS The SIAM series on Software, Environments, and Tools focuses on the practical implementation of computational methods and the high performance aspects of scientific computation by emphasizing in-demand software, computing environments, and tools for computing. Software technology development issues such as current status, applications and algorithms, mathematical software, software tools, languages and compilers, computing environments, and visualization are presented.
Editor-in-Chief
Jack J. Dongarra University of Tennessee and Oak Ridge National Laboratory
Editorial Board James W. Demmel, University of California, Berkeley Dennis Gannon, Indiana University Eric Grosse, AT&T Bell Laboratories Ken Kennedy, Rice University Jorge J. More, Argonne National Laboratory
Software, Environments, and Tools Michael A. Heroux, Padma Raghavan, and Horst D. Simon, editors, Parallel Processing for Scientific Computing Gerard Meurant, The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations Bo Einarsson, editor, Accuracy and Reliability in Scientific Computing Michael W. Berry and Murray Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval, Second Edition Craig C. Douglas, Gundolf Haase, and Ulrich Langer, A Tutorial on Elliptic PDE Solvers and Their Paralle/ization Louis Komzsik, The Lanczos Method: Evolution and Application Bard Ermentrout, Simulating, Analyzing, and Animating Dynamical Systems: A Guide to XPPAUT for Researchers and Students V. A. Barker, L S. Blackford,). Dongarra, J. Du Croz, S. Hammarling, M. Marinova, J. Wasniewski, and P. Yalamov, LAPACK95 Users' Guide Stefan Goedecker and Adolfy Hoisie, Performance Optimization of Numerically Intensive Codes Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide Lloyd N. Trefethen, Spectral Methods in MATLAB E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users' Guide, Third Edition Michael W. Berry and Murray Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval Jack J. Dongarra, lain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst, Numerical Linear Algebra for High-Performance Computers R. B. Lehoucq, D. C. Sorensen, and C. Yang, AKPACK Users' Guide: Solution of Large Scale Eigenvalue Problems with Implicitly Restarted Arnold: Methods Randolph E. Bank, PLTMG: A Software Package for Solving Elliptic Partial Differential Equations, Users' Guide 8.0 L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, C. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users' Guide Greg Astfalk, editor, Applications on Advanced Architecture Computers Francoise Chaitin-Chalelin and Valerie Fraysse, Lectures on Finite Precision Computations Roger W. Hockney, The Science of Computer Benchmarking Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, )une Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods E. Anderson, Z. Bai, C. Bischof, |. Demmel, |. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users' Guide, Second Edition Jack J. Dongarra, lain S. Duff, Danny C. Sorensen, and Henk van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers ). J. Dongarra, ]. R. Bunch, C. B. Moler, and G. W. Stewart, Linpack Users' Guide
Parallel Processing for Scientific
Computing Edited by Michael A. Heroux Sandia National Laboratories Albuquerque, New Mexico Padma Raghavan Pennsylvania State University University Park, Pennsylvania
Horst D. Simon Lawrence Berkeley National Laboratory Berkeley, California
Society for Industrial and Applied Mathematics Philadelphia
Copyright © 2006 by the Society for Industrial and Applied Mathematics. 10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 191042688. Trademarked names may be used in this hook without the inclusion of a trademark symbol. These names are used in an editorial context only; no infringement of trademark is intended. BlueCene/L was designed by IBM Research for the Department of Energy/NNSA's Advanced Simulation and Computing Program. Linux is a registered trademark of Linus Torvalds. MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA, 508-647-7000, Fax: 508-647-7101,
[email protected], www.mathworks.com POWER3 and POWER4 are trademarks of IBM Corporation. XI E is a trademark of Cray, Inc. Figure 8.2 is reprinted from Advances in Engineering Software, 29, W. J. Barry, M. T. Jones, and P. E. Plassmann, "Parallel adaptive mesh refinement techniques," 21 7-229, 1998, with permission from Elsevier. Figure 8.4 is reprinted from A. Wissink et al., "Enhancing Scalability of Parallel Structured AMR Calculations." 17th ACM Int'l. Conf. on Supercomputing (ICS'03), pp. 336-347. © 2003 ACM, Inc. Reprinted by permission. Figure 8.6 is reprinted from "Mesh generation," M. W. Bern and P. E. Plassmann, in Handbook of Computational Geometry, J. Sack and J. Urrutia, eds., 291-332, 2000, with permission from Elsevier. Figure 1 7.1 is reprinted from G. Lancia et al., "101 optimal PDB structure alignments: A branch-and-cut algorithm for the maximum contact map overlap problem." Proceedings of 5th Int'l. Conference on Computational Biology (RECOMB'01), pp. 193-202. © 2001 ACM, Inc. Reprinted by permission.
Library of Congress Cataloging-in-Publication Data Parallel processing for scientific computing / edited by Michael A. Heroux, Padma Raghavan, Horst D. Simon. p. cm. — (Software, environments, tools) Includes bibliographical references and index. ISBN-13: 978-0-898716-19-1 (pbk.) ISBN-10: 0-89871-619-5 (pbk.) 1. Parallel processing (Electronic computers) I. Heroux, Michael A. II. Raghavan, Padma. III. Simon, Horst D. QA76.58.P37645 2006 O04'.35-dc22
2006045099
Royalties from the sale of this book are placed in a fund to help students attend SIAM meetings and other SIAM-related activities. This fund is administered by SIAM, and qualified individuals are encouraged to write directly to SIAM for guidelines.
is a registered trademark.
List of Contributors VolkanAkcelik Ultrascale Simulation Laboratory Department of Civil and Environmental Engineering Carnegie Mellon University Pittsburgh, PA 15213 USA
[email protected] George Almasi IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA
[email protected] Srinivas Alum Department of Electrical and Computer Engineering Iowa State University Ames. IA 50011 USA
[email protected] Nancy Amato Parasol Laboratory Department of Computer Science Texas A&M University College Station, TX 77843 USA
[email protected] Charles Archer IBM Systems and Technology Group Rochester, MN 55901 USA
[email protected] Rob Armstrong Sandia National Laboratories Livermore, CA 94551 USA
[email protected] Scott B. Baden CSE Department University of California San Diego La Jolla, CA 92093 USA baden @ cs.ucsd.edu David A. Bader College of Computing Center for the Study of Systems Biology Center for Experimental Research in Computer Systems Georgia Institute of Technology Atlanta, GA 30332 USA
[email protected] David Bailey Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA 94720 USA
[email protected] Suchindra Bhandarkar Computer Science University of Georgia Athens, GA 30602 USA
[email protected] Cyan Bhanot IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA
[email protected] George Biros Department of Mechanical Engineering and Applied Mechanics and Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 USA biros @ scas.upenn .edu Rupak Biswas NASA Advanced Supercomputing (NAS) Division NASA Ames Research Center Moffett Field, CA 94035 USA
[email protected] Erik G. Boman Department of Discrete Algorithms and Mathematics Sandia National Laboratories Albuquerque, NM 87185 USA cgboman @ sandi a.gov Randall Bramley Computer Science Indiana University Bloomington, IN 47405 USA bramley @cs.indiana.edu Sharon Brunett Center for Advanced Computing Research California Institute of Technology Pasadena, CA 91125 USA
[email protected] Bor Chan Lawrence Livermore National Laboratory Livermore, CA 94550 USA
[email protected] Leonardo Bachega LARC-University of Sao Paulo Sao Paulo, Bra/il
[email protected] V
List of Contributors
VI
Sid Chatterjee IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA
[email protected] Jack Dongarra Computer Science Department University of Tennessee Knoxville, TN 37996
[email protected] Alan Gara IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA
[email protected] Edmond Chow D. E. Shaw & Co. 120 W. 45th St. New York, NY 10036 USA ctchow @ gmail .com
Jonathan Eckstein RUTCOR, Room 155 640 Bartholomew Road Busch Campus Rutgers University Piscataway, NJ 08854 USA
[email protected] Luis G. Gervasio Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 USA
[email protected] Girl Chukkapalli San Diego Supercomputer Center La Jolla, CA 92093 USA
[email protected] Alessandro Curioni IBM Zurich Research Laboratory CH-8803 Ruschlikon Switzerland
[email protected] Bruce Curtis Lawrence Livermore National Laboratory Livermore, CA 94550 USA
[email protected] Kostadin Damevski Scientific Computing and Imaging Institute The University of Utah Salt Lake City, UT 84112 USA
[email protected] Travis Desell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 USA
[email protected] Karen D. Devine Department of Discrete Algorithms and Mathematics Sandia National Laboratories Albuquerque, NM 87185 USA
[email protected] Lori Diachin Center for Applied Scientific Computing Lawrence Livermore National Laboratory P.O. Box808,L-561 Livermore, CA 94551 USA diachin2 @ llnl.gov
Victor Eijkhout Computer Science Department University of Tennessee Knoxville, TN 37996 USA
[email protected] Kaoutar El Maghraoui Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 US A
[email protected] Jamal Faik Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 USA
[email protected] Robert D. Falgout Center for Computational Sciences and Engineering Lawrence Livermore National Laboratory Livermore, CA 94551 USA
[email protected] Joseph E. Flaherty Department of Computer Science Rensselaer Polytechnic Institute Troy. NY 12180 USA
[email protected] Ian Foster Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue Building 221 Argonne, IL 60439 USA
[email protected] Omar Ghattas Departments of Geological Sciences, Mechanical Engineering, Computer Sciences, and Biomedical Engineering Institute for Computational Engineering and Sciences and Institute for Geophysics University of Texas at Austin Austin, TX 78712 US A omar@ ices.texas.edu Judit Gimenez European Center for Parallelism of Barcelona Technical University of Catalonia 08034 Barcelona Spain
[email protected] William D. Gropp Argonne National Laboratory Argonne, IL 60439 USA
[email protected] John Gunnels IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA gunnels @ us.i bm.com Manish Gupta IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA
[email protected] Robert Harkness San Diego Supercomputer Center La Jolla, CA 92093 USA
[email protected] List of Contributors
VII
William Hart Sandia National Laboratories Mail Stop 1110 P.O. Box 5800 Albuquerque, NM 87185-1110 USA
[email protected] Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories Livermore, CA 94551 USA
[email protected] Michael T. Heath Computation Science and Engineering Department University of Illinois Urbana, IL61801 USA
[email protected] Victoria E. Howie Computational Sciences and Mathematics Research Department Sandia National Laboratories Livermore, CA 94551 USA
[email protected] Bruce Hendrickson Department of Discrete Algorithms and Mathematics Sandia National Laboratories Albuquerque, NM 87185 USA and Department of Computer Science University of New Mexico Albuquerque, NM 87131 USA
[email protected] Amy Henning IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA amhennin @ us.ibm.com Michael A. Heroux Sandia National Laboratories Albuquerque, NM 87185 USA
[email protected] Judith Hill Optimization & Uncertainty Estimation Department Sandia National Laboratories Albuquerque, NM 87185 USA
[email protected] Richard Hornung Center for Applied Scientific Computing Lawrence Livermore National 1 .aboratory P.O. Box 808, L-561 Livermore, CA 94551 USA hornung 1 @ llnl.gov
Jonathan J. Hu Sandia National Laboratories Livermore, CA 94551 USA
[email protected] Xiangmin Jiao Computational Science and Engineering University of Illinois Urbana, IL 61801 USA
[email protected] Chris R. Johnson Scientific Computing and Imaging Institute The University of Utah Salt Lake City, UT 84112 USA
[email protected] David Keyes Department of Applied Physics and Applied Mathematics Columbia University New York, NY USA
[email protected] Jesus Labarta CEPBA-UPC Jordi Girona 1-2, Modulo D-6 08034 Barcelona Spain Jesus @cepba.upc.edu Sebastien Lacour IR1SA/1NRIA Rennes, France
[email protected] Julien Langou Department of Computer Science University of Tennessee Knoxville, TN 37996 USA
[email protected] Andrew Lumsdaine Computer Science Indiana University Bloomington, IN 47405 USA
[email protected] Dan C. Marinescu School of Computer Science University of Central Florida Orlando, FL 32816 USA
[email protected] Laxmikant Kale Department of Computer Science University of Illinois UrbanaChampaign Urbana, IL 61801 USA
[email protected] Lois Mclnnes Argonne National Laboratory Argonne, IL 60439 USA
[email protected] Nicholas Karonis Department of Computer Science Northern Illinois University DeKalb, IL 60115 USA
[email protected] Jose E. Moreira IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA
[email protected] George Karypis Department of Computer Science and Engineering Minneapolis, MN 55455 USA karypis @ cs.umn .edu
Esmond G. Ng Lawrence Berkeley National Laboratory Berkeley, CA 94720 USA
[email protected] List of Contributors
VIII
Leonid Oliker Lawrence Berkeley National Laboratory Berkeley, CA 94720 USA
[email protected] James Sexton IBM Thomas .1. Watson Research Center Yorktown Heights, NY 10598 USA
[email protected] Manish Parashar The Applied Software Systems Laboratory Department of Electrical and Computer Engineering Rutgers, The State University of New Jersey Piscataway, NJ 08855 USA
[email protected] Horst D. Simon Lawrence Berkeley National Laboratory Berkeley, CA 94720 USA
[email protected] Steven G. Parker Scientific Computing and Imaging Institute The University of Utah Salt Lake City, UT 84112 USA
[email protected] Wayne Pfeiffer San Diego Supercomputer Center La Jolla,CA 92093 USA
[email protected] Cynthia A. Phillips Sandia National Laboratories Albuquerque, NM 87185 USA caphill @ sandia.gov All Pinar High Performance Computing Research Department Lawrence Berkeley National Laboratory Berkeley, CA 94720 USA
[email protected] Paul Plassmann Electrical and Computer Engineering 302Whittemore(0111) Virginia Tech Blacksburg, VA 24061 USA
[email protected] Padma Raghavan Computer Science and Engineering The Pennsylvania State University University Park, PA 16802 USA raghavan @ cse.psu.edu
Allan Snavely CSE University of California, San Diego La Jolla, CA 92093 USA
[email protected] Matt Sottile Los Alamos National Laboratory Los Alamos, NM 87545 USA
[email protected] Rick Stevens Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue Building 221 Argonne, IL 60439 USA
[email protected] Valerie E. Taylor Department of Computer Science Texas A&M University College Station, TX 77842 USA
[email protected] James D. Teresco Department of Computer Science Williams College Williamstown, MA 01267 USA terescoj @cs.williams.edu Raymond S. Tuminaro Computation, Computers and Mathematics Center Sandia National Laboratories Livermore, CA 94551 USA
[email protected] Bart van Bloemen Waanders Optimization and Uncertainty Estimation Department Sandia National Laboratories Albuquerque, NM 87185 USA
[email protected] Rob Van der Wijngaart Computer Sciences Corporation NASA Ames Research Center Moffett Field, CA 94035 USA wijngaar @ nas.nasa.gov Carlos A. Varela Department of Computer Science Renssalaer Polytechnic Institute Troy, NY 12180 USA
[email protected] Bob Walkup IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 USA walkup® us.ibm.com Andy Wissink NASA Ames Research Center MS 215-1 Moffett Field, CA 94035 USA
[email protected] Xingfu Wu Department of Computer Science Texas A&M University College Station, TX 77843 USA
[email protected] Ulrike Meier Yang Center for Applied Scientific Computing Lawrence Livermore National Laboratory Livermore, CA 94551 USA yangll @ llnl.gov Kerning Zhang Scientific Computing & Imaging Institute The University of Utah Salt Lake City, UT 84112 USA
[email protected] Contents List of Figures
xv
List of Tables
xxi
Preface 1
xxiii
Frontiers of Scientific Computing: An Overview Michael A. Heroux, Padma Raghavan, and HorstD. Simon 1.1 Performance modeling, analysis, and optimization 1.2 Parallel algorithms and enabling technologies 1.3 Tools and frameworks for parallel applications 1.4 Applications of parallel computing 1.5 Conclusions and future directions Bibliography
2 2 3 4 5 5
1
Performance Modeling, Analysis, and Optimization
7
2
Performance Analysis: From Art to Science Jesus Labarta and Judit Gimenez 2.1 Performance analysis tools 2.2 Paraver 2.3 Analysis methodology 2.4 Scalability 2.5 Models: Reference of what to expect 2.6 Conclusions Acknowledgments Bibliography
9
3
Approaches to Architecture-Aware Parallel Scientific Computation James D. Teresco, Joseph E. Flaherty, Scott B. Baden, Jamal Faik, Sebastien Lacour, Manish Parashar, Valerie E. Taylor, and Carlos A. Varela 3.1 Prophesy: A performance analysis and modeling system for parallel and distributed applications Valerie E. Taylor, with Xingfu Wu and Rick Stevens ix
1
11 13 18 23 29 30 30 31 33
35
x
Contents 3.2
Canonical variant programming and computation and communication scheduling 38 Scott B. Baden 3.3 A multilevel topology-aware approach to implementing MPI collective operations for computational grids 40 Sebastien Lacour, with Nicholas Karonis and Ian Foster 3.4 Dynamic load balancing for heterogeneous environments 42 JamalFaik, James D. Teresco, and Joseph E. Flaherty, with Luis G. Gervasio 3.5 Hierarchical partitioning and dynamic load balancing 44 James D. Teresco 3.6 Autonomic management of parallel adaptive applications 46 Manish Parashar 3.7 Worldwide computing: Programming models and middleware 49 Carlos A. Varela, with Travis Desell and Kaoutar El Maghraoui Acknowledgments 50 Bibliography 51 4
5
II 6
Achieving High Performance on the BlueGene/L Supercomputer George Almasi, Cyan Bhanot, Sid Chatterjee, Alan Gara, John Gunnels, Manish Gupta, Amy Henning, Jose E. Moreira et al. 4.1 BG/L architectural features 4.2 Methods for obtaining high performance 4.3 Performance measurements 4.4 Conclusions Bibliography Performance Evaluation and Modeling of Ultra-Scale Systems Leonid Oliker, Rupak Biswas, Rob Van der Wijngaart, David Bailey, and Allan Snavely 5.1 Modern high-performance ultra-scale systems 5.2 Architecture evaluation using full applications 5.3 Algorithmic and architectural benchmarks 5.4 Performance modeling 5.5 Summary Acknowledgments Bibliography Parallel Algorithms and Enabling Technologies
59 60 61 63 73 74 77 77 80 86 90 93 93 93 97
Partitioning and Load Balancing for Emerging Parallel Applications and Architectures 99 Karen D. Devine, Erik G Boman, and George Karypis 6.1 Traditional approaches 100 6.2 Beyond traditional applications 101 6.3 Beyond traditional approaches 108
Contents
7
xi
6.4 Beyond traditional models 6.5 Beyond traditional architectures 6.6 Conclusion Bibliography
112 115 117 118
Combinatorial Parallel and Scientific Computing Ali Pinar and Bruce Hendrickson 7.1 Sparse matrix computations 7.2 Utilizing computational infrastructure 7.3 Parallelizing irregular computations 7.4 Computational biology 7.5 Information analysis 7.6 Solving combinatorial problems 7.7 Conclusions Acknowledgments Bibliography
127 127 130 132 134 135 136 139 139 139
8
Parallel Adaptive Mesh Refinement 143 Lori Freitag Diachin, Richard Hornung, Paul Plassmann, and Andy Wissink 8.1 SAMR 145 8.2 UAMR 149 8.3 A comparison of SAMR and UAMR 152 8.4 Recent advances and future research directions 153 8.5 Conclusions 156 Acknowledgments 156 Bibliography 157
9
Parallel Sparse Solvers, Preconditioners, and Their Applications Esmond G. Ng 9.1 Sparse direct methods 9.2 Iterative methods and preconditioning techniques 9.3 Hybrids of direct and iterative techniques 9.4 Expert approaches to solving sparse linear systems 9.5 Applications Acknowledgments Bibliography
163
A Survey of Parallelization Techniques for Multigrid Solvers Edmond Chow, Robert D. Falgout, Jonathan J. Hu, Raymond S. Tuminaro, and Ulrike Meier Yang 10.1 Sources of parallelism 10.2 Parallel computation issues 10.3 Concluding remarks Acknowledgments Bibliography
179
10
163 167 169 171 172 173 173
179 184 194 195 195
xii 11
Contents Fault Tolerance in Large-Scale Scientific Computing Patricia D. Hough and Victoria E. Howie 11.1 Fault tolerance in algorithms and applications 11.2 Fault tolerance in MPI 11.3 Conclusions Acknowledgments Bibliography
203 204 210 216 216 216
III Tools and Frameworks for Parallel Applications
221
12
223
13
14
15
Parallel Tools and Environments: A Survey William D. Gropp and Andrew Lumsdaine 12.1 Software and tools for building and running clusters 12.2 Tools for computational science 12.3 Conclusion Bibliography
224 228 231 232
Parallel Linear Algebra Software Victor Eijkhout, Julien Langou, and Jack Dongarra 13.1 Dense linear algebra software 13.2 Sparse linear algebra software 13.3 Support libraries 13.4 Freely available software for linear algebra on the Internet Reading List Bibliography
233
High-Performance Component Software Systems Randall Bramley, Rob Armstrong, Lois Mclnnes, and Matt Sottile 14.1 Current scientific component systems 14.2 Mathematical challenges 14.3 Conclusion Acknowledgments Bibliography
249
Integrating Component-Based Scientific Computing Software Steven G. Parker, Kerning Zhang, Kostadin Damevski, and Chris R. Johnson 15.1 SCIRun andBioPSE 15.2 Components for scientific computing 15.3 Metacomponent model 15.4 Distributed computing 15.5 Parallel components 15.6 Conclusions and future work Acknowledgments Bibliography
234 238 242 242 246 246
250 254 264 265 265 271 272 276 278 280 282 285 286 286
Contents
xiii
IV Applications of Parallel Computing
289
16
291
Parallel Algorithms for PDE-Constrained Optimization Volkan Akqelik, George Biros, Omar Ghattas, Judith Hill, David Keyes, and Bart van Bloemen Waanders 16.1 Algorithms 16.2 Numerical examples 16.3 Conclusions Acknowledgments Bibliography
295 302 314 316 316
17
Massively Parallel Mixed-Integer Programming: Algorithms and Applications 323 Cynthia A. Phillips, Jonathan Eckstein, and William Hart 17.1 Basic branch and bound for MIP 325 17.2 Applications 326 17.3 A scalable parallel MIP solver 332 Bibliography 337
18
Parallel Methods and Software for Multicomponent Simulations Michael T. Heath and Xiangmin Jiao 18.1 System overview 18.2 Integration framework and middleware services 18.3 Parallel computational methods 18.4 Results of coupled simulations 18.5 Conclusion Acknowledgments Bibliography
341 342 345 347 351 353 354 354
19
Parallel Computational Biology 357 Srinivas Aluru, Nancy Amato, David A. Bader, Suchindra Bhandarkar, Laxmikant Kale, and Dan C. Marinescu 19.1 Assembling the maize genome 357 19.2 An information theoretic approach to genome reconstruction 361 19.3 High-performance computing for reconstructing evolutionary trees . .363 19.4 Scaling classical molecular dynamics to thousands of processors . . .365 19.5 Cluster and grid computing for 3D structure determination of viruses with unknown symmetry at high resolution 368 19.6 Using motion planning to study protein folding 371 Bibliography 373
20
Opportunities and Challenges for Parallel Computing in Science and Engineering Michael A. Heroux, Padma Raghavan, and Horst D. Simon 20.1 Parallel computer systems 20.2 Robust and scalable algorithms
379 379 383
xiv
Contents
20.3 Application development and integration 20.4 Large-scale modeling and simulation 20.5 Concluding remarks Acknowledgments Bibliography Index
385 387 388 388 388 391
List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7
Performance tools approaches The CEPBA-tools environment 2D analysis Load balancing at different scales Communication bandwidth Parallelism profile of a 128-processor run Drill scalability
3.1
Prophesy framework 36 Broadcast using a binomial tree: processes are numbered from 0 (root) through 9; communication steps are circled 41 Topology-unaware broadcast using a binomial tree: three intercluster messages (bold arrows) and six intracluster messages 41 Topology-aware broadcast: only one intercluster message (bold arrow). . 41 Tree constructed by DRUM to represent a heterogeneous network 43 Ideal and achieved (using DRUM) relative changes in execution times compared to homogeneous partitioning for an adaptive calculation using PHAML on different processor combinations 44 Hierarchical balancing algorithm selection for two four-way SMP nodes connected by a network 45 Conceptual overview of GridARM 47 A modular middleware architecture as a research testbed for scalable high-performance decentralized distributed computations 50
3.2 3.3
3.4 3.5 3.6
3.7
3.8 3.9 4.1
4.2
12 14 16 21 23 27 28
Performance of daxpy on a BG/L node is shown as a function of vector length. LI and L3 cache edges are apparent. For data in the LI cache (lengths < 2,000), the performance doubles by turning on SIMD instructions (440d) and doubles again when using both processors on the node 64 The performance speed-up using virtual node mode is shown for the class C NAS parallel benchmarks. The speed-up is denned as the ratio of Mops per node in virtual node mode to Mops per node using coprocessor mode. 65 xv
xvi
List of Figures 4.3
4.4 4.5
4.6
4.7
4.8 6.1
6.2
6.3
6.4
6.5
6.6
6.7
Unpack performance in BG/L is shown as a function of the number of compute nodes. Performance is indicated as a fraction of the theoretical peak. Results for three different strategies are included: using a single processor on each node, offloading computation to the coprocessor with the model, and using virtual node mode to run with two tasks per node. . Performance of UNPACK on large-scale BG/L systems Comparison of the default mapping and optimized mapping for N AS BT on up to 1,024 processors in virtual node mode. Mapping provides a significant performance boost at large task counts The computational performance for sPPM is shown for systems including IBM p655 at 1.7 GHz (top curve) and BG/L at 700 MHz using virtual node mode (middle curve) or coprocessor mode with a single computational task per node (lower curve). The x-axis indicates the number of BG/L nodes or the number of p655 processors Weak scaling results for UMT2K on BG/L and an IBM p655 cluster. The x-axis indicates the number of BG/L nodes or the number of p655 processors, and the y-axis indicates overall performance relative to 32 nodes of BG/L in coprocessor mode Results for CPMD with 138,000 atoms on BG/L
66 66
67
68
70 71
Cutting planes (left) and associated cut tree (right) for geometric recursive bisection. Dots are objects to be balanced; cuts are shown with dark lines and tree nodes 101 SFC partitioning (left) and box assignment search procedure (right). Objects (dots) are ordered along the SFC (dotted line). Partitions are indicated by shading. The box for box assignment intersects partitions 0 and 2 102 Use of multiconstraint graph partitioning for contact problems: (a) the 45 contact points are divided into three partitions; (b) the subdomains are represented geometrically as sets of axis-aligned rectangles; and (c) a decision tree describing the geometric representation is used for contact search 104 Comparing the nonzero structure of matrices from (a) a hexahedral finite element simulation, (b) a circuit simulation, (c) a density functional theory simulation, and (d) linear programming shows differences in structure between traditional and emerging applications 107 Example of communication metrics in the graph (left) and hypergraph (right) models. Edges are shown with ellipses; the partition boundary is the dashed line 108 Row (left) and column (right) distribution of a sparse matrix for multiplication « = Av. There are only two processors, indicated by dark and light shading, and communication between them is shown with arrows. In this example, the communication volume is three words in both cases. (Adapted from [9, Ch. 4].) 109 Irregular matrix distribution with two processors. Communication between the two processors (shaded dark and light) is indicated with arrows. 110
List of Figures 7.1
7.2 7.3 7.4 8.1
8.2
8.3 8.4
8.5 8.6
8.7
Permuting large entries to the diagonal. Dark edges in the graph correspond to edges in the matching in the bipartite graph of the matrix on the left. The matrix on the right is the permuted matrix with respected to the matching where columns are reordered as mate of the first row, mate of the second row, etc Directed graph for the sweep operation Branch-and-bound algorithm Cutting planes close the gap between IP (Integer Program) and LP feasible regions
xvii
129 134 137 138
Examples of AMR using structured and unstructured grids. The left figure shows fine detail in an impulsively sheared contact surface computed using patch-based structured AMR [3]. The right figure shows the accurate suface and volume representation of the fuselage and engine cowl of an RAH-66 Comanche helicopter with an unstructured AMR grid [37]. . 144 On the left is a comparison of maximum element error as a function of the number of grid vertices in an unstructured, tetrahedral mesh calculation [8]. The AMR computation requires significantly fewer points to achieve a desired accuracy. On the right is an image of the two-dimensional version of this problem showing refinement around the transition region and areas of high elastic stress 145 An outline of a parallel adaptive solution method for PDEs 145 Scaling properties of a three-level scaled SAMR simulation of a moving advecting sinusoidal front [68]. Remeshing occurs every two timesteps. Although the problem scales reasonably, adaptive gridding costs are clearly less scalable than numerical operations. Work to improve scaling of adaptive gridding operations is ongoing 148 The bisection algorithm 150 The process of the bisection algorithm is shown from left to right. In the initial mesh the shaded elements are refined; subsequently the shaded elements are refined because they are nonconforming [14] 150 Examples of new AMR applications. The left image shows a continuumatomistic hybrid coupling using AMR [65]. The right image shows an embedded boundary SAMR mesh around buildings in Manhattan used for flows in urban environments [41] 154
10.1
Full domain partitioning example
183
13.1
2D block-cyclic distribution of a matrix of order n with parameters (nprows= n/8,npcols= «/8,bs_i= 2,bs__j= 3). On the left, the original data layout, the matrix is partitioned with blocks of size n/8; on the right, the data is mapped on a 2 x 3 processor grid
236
Ccaffeine's graphical builder: the components' uses ports appear on their right, while provides ports appear on their left
252
14.1
xviii
List of Figures
14.2
14.3 14.4 15.1 15.2 15.3
15.4
15.5
15.6
15.7 15.8
Different codes in fusion energy and potential couplings among categories. (Courtesy of Stephen Jardin, Princeton Plasma Physics Laboratory.) The synchronization problem in PRMI Comparison of two component architectures with a single builder that works interchangeably via model-specific glue code
256 259 263
The SCIRun PSE, illustrating a 3D finite element simulation of an implantable cardiac defibrillator. 272 BioPSE neural source localization network. The optimal dipole source is recovered using a multistart optimization algorithm 273 Visualization of the iterative source localization. The voltages of the true solution (disks) and the computed solution (spheres) are qualitatively compared at the electrode positions as the optimization (shown as arrows) converges on a neural source location. The solution misfit can be qualitatively interpreted by pseudocolored voltages at each electrode. . . 274 BioPSE dataflow interface to a forward bioelectric field application. The underlying dataflow network implements the application with modular interconnected components called modules. Data are passed between the modules as input and output parameters to the algorithms. While this is a useful interface for prototyping, it can be nonintuitive for end users; it is confusing to have a separate user interface window to control the settings for each module. Moreover, the entries in the user interface windows fail to provide semantic context for their settings. For example, the text-entry field on the SampleField user interface that is labeled "Maximum number of samples" is controlling the number of electric field streamlines that are produced for the visualization 275 The BioFEM custom interface. Although the application is the functionality equivalent to the data flow version shown in Figure 15.4, this PowerApp version provides an easier-to-use custom interface. Everything is contained within a single window. The user is lead through the steps of loading and visualizing the data with the tabs on the right; generic control settings have been replaced with contextually appropriate labels, and application-specific tooltips (not shown) appear when the user places the cursor over any user interface element 276 The BioTensor PowerApp. Just as with BioFEM, we have wrapped up a complicated data flow network into a custom application. In the left panel, the user is guided through the stages of loading the data, coregistering MRI diffusion weighted images, and constructing diffusion tensors. On the right panel, the user has controls for setting the visualization options. In the rendering window in the middle, the user can render and interact with the dataset 277 Components of different models cooperate in SCIRun2 279 A more intricate example of how components of different models cooperate in SCIRun2. The application and components shown are from a realistic (albeit incomplete) scenario 281
List of Figures 15.9
15.10 16.1
16.2
16.3
17.1
18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9
xix
MxN method invocation, with the caller on the left and the callee on the right. In the left scenario, the number of callers is fewer than the number of callees, so some callers make multiple method calls. In the right, the number of callees is fewer, so some callees send multiple return values. . 283 Components of different models cooperate in SCIRun2 285 Reconstruction of hemipelvic bony geometry via solution of an inverse wave propagation problem using a parallel multiscale reduced (Gauss) Newton conjugate gradient optimization algorithm with TV regularization An optimal boundary control problem to minimize the rate of energy dissipation (equivalent here to the drag) by applying suction or injection of a fluid on the downstream portion of a cylinder at Re = 40. The left image depicts an uncontrolled flow; the right image depicts the optimally controlled flow. Injecting fluid entirely eliminates recirculation and secondary flows in the wake of the cylinder, thus minimizing dissipation. The optimization problem has over 600,000 states and nearly 9,000 controls and was solved in 4.1 hours on 256 processors of a Cray T3E at PSC Solution of a airborne contaminant inverse problem in the Greater Los Angeles Basin with onshore winds; Peclet number =10. The target initial concentration is shown at left and reconstructed initial condition on the right. The measurements for the inverse problem were synthesized by solving the convection-diffusion equation using the target initial condition and recording measurements ona21 x21 x21 uniform array of sensors. The mesh has 917,301 grid points; the problem has the same number of initial condition unknowns and 74 million total space-time unknowns. Inversion takes 2.5 hours on 64 AlphaServer processors at PSC. CG iterations are terminated when the norm of the residual of the reduced space equations is reduced by five orders of magnitude Example contact map alignment with isomorphic subgraphs with seven nodes and five edges corresponding to an alignment of seven amino acid residues with five shared contacts (bold edges) [27]
304
309
313
332
Overview of Rocstar software components 342 AMPI minimizes idle times (gaps in plots) by overlapping communication with computation on different virtual procs. Courtesy of Charm group. . 344 Windows and panes 346 Abstraction of data input 346 Scalability of data transfer on Linux cluster. 348 Speed-ups of mesh optimization and surface propagation 349 Example of remeshing and data transfer for deformed star grain 350 Initial burn of star slice exhibits rapid expansion at slots and contraction at fins. Images correspond to 0%, 6%, and 12% burns, respectively. . . .351 Titan IV propellant deformation after 1 second 352
XX
List of Figures
18.10 18.11 18.12 18.13 18.14
RSRM propellant temperature at 175 ms Hardware counters for Rocfrac obtained by Rocprof. Absolute performance of Rocstar on Linux and Mac Scalability with Rocflo and Rocsolid on IBM SP. Scalability with Rocflu and Rocfrac on Linux cluster.
352 352 353 353 353
19.1 19.2 19.3
Parallel clustering framework Speed-up on PSC LeMieux (a) 3D reconstruction of Sindbis at 10 A resolution, (b) The speed-up of one 3D reconstruction algorithm for several virus structures
360 367 369
List of Tables 4.1
The performance for CPMD using a 216-atom SiC supercell is listed for IBM p690 (Power4 1.3 GHz, Colony switch) and BG/L (700 MHz) systems. The performance metric is the elapsed time per time step in the simulation. Values marked n.a. were not available
71
4.2
Performance of Enzo for 256**3 unigrid on BG/Land IBMp655 (1.5GHz Power4, Federation switch) relative to 32 BG/L nodes in coprocessor mode. 72
8.1
A comparison of SAMR and UAMR methods
152
13.1
Support routines for numerical linear algebra
243
13.2
Available.software.for.dense matrix
243
13.3
Sparse direct solvers
244
13.4
Sparse eigenvalue solvers
244
13.5
Sparse iterative solvers
245
16.1
Fixed-size scalability on a Cray T3E-900 for a 262,144-grid point problem corresponding to a two-layered medium
306
Algorithmic scaling by LRQN, RNCG, and PRNCG methods as a function of material model resolution. For LRQN, the number of iterations is reported, and for both LRQN solver and preconditioner, 200 L-BFGS vectors are stored. For RNCG and PRNCG, the total number of CG iterations is reported, along with the number of Newton iterations in parentheses. On all material grids up to 653, the forward and adjoint wave propagation problems are posed on 653 grid x 400 time steps, and inversion is done on 64 PSC AlphaServer processors; for the 1293 material grid, the wave equations are on 1293 grids x 800 time steps, on 256 processors. In all cases, work per iteration reported is dominated by a reduced gradient (LRQN) or reduced-gradient-like (RNCG, PRNCG) calculation, so the reported iterations can be compared across the different methods. Convergence criterion is lO"5 relative norm of the reduced gradient. * indicates lack of convergence;T indicates number of iterations extrapolated from converging value after 6 hours of runtime
307
16.2
xxi
xx ii
List of Tables 16.3 16.4
16.5
19.1 19.2 19.3
19.4
Algorithmic scalability for Navier-Stokes optimal flow control problem on 64 and 128 processors of a Cray T3E for a doubling (roughly) of problem size 310 Fixed size scalability of unpreconditioned and multigrid preconditioned inversion. Here the problem size is 257 x 257 x 257 x 257 for all cases. We use a three-level version of the multigrid preconditioner. The variables are distributed across the processors in space, whereas they are stored sequentially in time (as in a multicomponent PDE). Here hours is the wall-clock time, and rj is the parallel efficiency inferred from the runtime. The unpreconditioned code scales extremely well since there is little overhead associated with its single-grid simulations. The multigrid preconditioner also scales reasonably well, but its performance deteriorates since the problem granularity at the coarser levels is significantly reduced. Nevertheless, wall-clock time is significantly reduced over the unpreconditioned case 313 Isogranular scalability of unpreconditioned and multigrid preconditioned inversion. The spatial problem size per processor is fixed (stride of 8). Ideal speed-up should result in doubling of wall-clock time. The multigrid preconditioner scales very well due to improving algorithmic efficiency (decreasing CG iterations) with increasing problem size. Unpreconditioned CG is not able to solve the largest problem in reasonable time. . .314 Assembly statistics and runtime on a 64-processor Pentium III 1.26 GZH Myrinet Cluster The increase of the amount of data and the corresponding increase in memory requirements for very-high-resolution reconstruction of the reo virus with a diameter of about 850 A The time for different steps of the orientation refinement for reo virus using4,422 views with 511 x 511 pixels/view. DFTsizeis512x512x512. Refinement steps of 1°, 0.1°, and 0.01°. The refinement time increases three to five times when the refinement step size decreases from 0.1° to 0.01 ° because of a larger number of operations and also due to the memory access time to much larger data structures. The orientation refinement time is the dominant component of the total execution time, hundreds of minutes, as compared with the computation of the 3D DFT, the reading time, and the DFT analysis, which take a few hundred seconds A comparison of protein folding models
360 369
370 371
Preface Scientific computing has often been called the third approach to scientific discovery, emerging as a peer to experimentation and theory. Historically, the synergy between theory and experimentation has been well understood. Experiments give insight into possible theories, theories inspire experiments, experiments reinforce or invalidate theories, and so on. As scientific computing (also known as or strongly related to computational science and engineering; computer modeling and simulation; or technical computing) has evolved to increasingly produce computational results that meet or exceed the quality of theoretical and experimental results, it has become an indispensable third approach. The synergy of theory, experimentation, and computation is very rich. Scientific computing requires theoretical models and often needs input data from experiments. In turn, scientists hoping to gain insight into a problem of interest have yet another basic tool set with which to advance ideas and produce results. That scientific computing is recognized as important is evidenced by the large research investment in this area. As one example, we point to the Scientific Discovery through Advanced Computing Program (SciDAC) sponsored by the U.S. Department of Energy. Although the Internet and related computing technologies have enabled tremendous growth in business and consumer computing, computing and science have been intimately connected from the very beginning of computers, and scientific computing has an insatiable demand for high-performance calculations. Although science is not the dominant force it once was in the computing field, it remains a critical area of computing for strategic purposes, and increasingly scientific computing is essential to technology innovation and development. Parallel processing has been an enabling technology for scientific computing for more than 20 years. Initial estimates of the cost and length of time it would take to make parallel processing broadly available were admittedly optimistic. The impact of parallel processing on scientific computing varies greatly across disciplines, but we can strongly argue that it plays a vital role in most problem domains and has become essential in many. This volume is suitable as a reference on the state of the art in scientific computing for researchers, professionals, and application developers. It is also suitable as an overview and introduction, especially for graduate and senior-level undergraduate students who are interested in computational modeling and simulation and related computer science and applied mathematics aspects. This volume reflects the themes, problems, and advances presented at the Eleventh S1AM Conference on Parallel Processing for Scientific Computing held in San Francisco in 2004. This series of SIAM conferences is a venue for mathematicians, computer scientists, and computational scientists to focus on the core enabling technologies that make parallel processing effective for scientific problems. Going back nearly 20 years, xxiii
xxiv
Preface
this conference series is unique in how it complements other conferences on algorithms or applications, sponsored by SI AM and other organizations. Most of the chapters in this book are authored by participants of this conference, and each chapter provides an expository treatment of a particular topic, including recent results and extensive references to related work. Although progress is made each year in advancing the state of parallel processing, the demands of our target problems require ever more capabilities. This is illustrated by the titles we chose for the four parts of this book, which could well have described the categories of interest back at the beginning of the parallel processing conference series. Our hope is that the reader will not only look at the specifics of our current capabilities but also perceive these perennial issues. By doing so, the reader will gain knowledge that has lasting value. Michael A. Heroux Padma Raghavan Horst D. Simon
Chapter 1
Frontiers of Scientific Computing: An Overview
Michael A. Heroux, Padma Raghavan, and HorstD. Simon
Scientific computing is a broad discipline focused on using computers as tools for scientific discovery. This book describes the present state and future directions of scientific computing over the next 19 chapters, organized into four main parts. The first part concerns performance modeling, analysis, and optimization. The second focuses on parallel algorithms and software for an array of problems that are common to many modeling and simulation applications. The third part emphasizes tools and environments that can ease and enhance the process of application development. The fourth provides a sampling of applications that require parallel computing for scaling to solve large and realistic models that can advance science and engineering. The final chapter of this volume discusses some current and upcoming challenges and opportunities in parallel scientific computing. In this chapter, we provide an overview of this edited volume on scientific computing, which in broad terms concerns algorithms, their tuned software implementations on advanced computing systems, and their use and evaluation in modeling and simulation applications. Some distinguishing elements of scientific computing are a focus on high performance, an emphasis on scalable parallel algorithms, the development of advanced software tools and environments, and applications to computational modeling and simulation in diverse disciplines. These aspects are covered in the four main parts of the volume (Chapters 2-19). The final chapter, Chapter 20, contains a brief discussion of new and emerging trends and challenges in scientific computing. The remainder of this chapter provides a brief overview of the contents of the book. We discuss each of the four main parts and the chapters within these parts in the order in which they occur. 1
2
1.1
Chapter 1. Frontiers of Scientific Computing: An Overview
Performance modeling, analysis, and optimization
The first part of this volume focuses on one of the most prominent themes in scientific computing, namely, that of achieving high performance. This emphasis on high performance arises primarily from the need for scientists and engineers to solve realistic models of everincreasing size at the limits of available computing resources. In this context, optimizations for higher performance often can make the difference between solving and not solving a specific problem. To enable such performance optimizations, it is critical to have tools and instruments with which to identify and measure the sources of inefficiency across all levels, from the processor architecture and system software to the algorithm and its implementation. This topic is discussed in Chapter 2 with a focus on the challenges in developing such performance analysis tools and the issues that are relevant to improving current practices. Chapter 3 discusses how parallel multiprocessors and distributed hierarchical and heterogeneous systems further complicate the process of achieving acceptably high levels of performance. It is emphasized that parallel scientific computation must necessarily take into account architectural characteristics for performance optimizations. In addition, there is a sampling of architecture-aware optimizations which can be applied across levels, from compilers that can reorder loops to high-level algorithms that are tuned to achieve a favorable trade-off between computation and communication costs. Chapter 4 provides an overview of the BlueGene/L, which is currently the fastest supercomputer in existence, with sustained execution rates of more than 280 teraops/sec for dense matrix computations [2]. This chapter focuses on the design of the BlueGene/L processor architecture, interconnection network, and the tuning of applications and benchmarks to achieve a large fraction of the peak execution rates while scaling to tens of thousands of processors. Chapter 5, the fourth and final chapter in this part, concerns the growing gap between sustained and peak performance for realistic scientific applications on conventional supercomputers. The authors discuss the need for high-fidelity performance modeling to understand and predict the interactions among hardware, software, and the characteristics of a diverse set of scientific applications.
1.2
Parallel algorithms and enabling technologies
The second part of this volume focuses on parallel algorithms for core problems in scientific computing, such as partitioning, load balancing, adaptive meshing, sparse linear system solution, and fault tolerance. Taken together, these techniques are critical for the scaling of a variety of scientific applications to large numbers of processors while maintaining acceptable levels of performance. The scalability goal in scientific computing is typically that of maintaining efficiency while increasing problem size and the number of processors. This is achieved by developing algorithms that limit the overheads of parallelization, such as communication costs. If T\ and TP are, respectively, the observed serial and parallel execution times using P processors, the efficiency is E = -j^. Now the total overhead To — PTP -T\, and thus E = Y^TTypically, forproblem size N, T\ = f ( N ) , where / is a function representing computational costs, while TO depends both on N and P, i.e., of the form g(N, P). NowE = can be maintained at a fixed acceptable level by choosing appropriate values of N f