Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos New York University, NY, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
3349
Barbara M. Chapman (Ed.)
Shared Memory Parallel Programming with OpenMP 5th International Workshop on OpenMP Applications and Tools, WOMPAT 2004 Houston, TX, USA, May 17-18, 2004 Revised Selected Papers
13
Volume Editor Barbara M. Chapman University of Houston Department of Computer Science Houston, TX 77204-3010, USA E-mail:
[email protected] Library of Congress Control Number: 2004118420 CR Subject Classification (1998): C.1-4, D.1-4, F.1-3, G.1-2 ISSN 0302-9743 ISBN 3-540-24560-X Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11386162 06/3142 543210
Preface
This book contains the Proceedings of the 5th Workshop on OpenMP Applications and Tools (WOMPAT 2004), which took place at the University of Houston, Houston, Texas on May 17 and 18, 2004. Previous workshops in this series took place in Toronto, Canada, Fairbanks, Alaska, Purdue, Indiana, and San Diego, California. The purpose of the workshop was to bring together users and developers of the OpenMP API for shared memory parallel programming to disseminate their ideas and experiences and discuss the latest developments in OpenMP and its application. To support this aim, the program comprised a mixture of invited talks from research and industry, experience reports, and submitted papers, the last of which are presented in this volume. A tutorial introduction to OpenMP was held at the same location on May 18 by Ruud van der Pas from Sun Microsystems. Further, a two-day lab session called OMPlab was held immediately following the workshop and the tutorial on May 19 and 20, and was attended by both novice and advanced users. Many of the hardware vendors and several researchers gave in-depth tutorials on their software and made their systems available to both novice and advanced attendees during OMPlab. Contributors to the WOMPAT 2004 OMPlab included IBM, Intel, Sun, the University of Tennessee, NASA, the University of Greenwich, Cornell University, the University of Oregon and the University of Houston. The OpenMP API is a widely accepted standard for high-level shared memory parallel programming that was put forth by a consortium of vendors in 1997. It is actively maintained by the OpenMP Architecture Review Board, whose members consist of most hardware vendors, high-performance compiler vendors, and several research organizations including cOMPunity, which works to engage researchers in the standardization efforts and to ensure the continuation of the WOMPAT series of events. OpenMP is still evolving to ensure that it meets the needs of new applications and emerging architectures, and WOMPAT workshops, along with their Asian and European counterparts (WOMPEI and EWOMP, respectively) are among the major venues at which users and researchers propose new features, report on related research efforts, and interact with the members of the OpenMP Architecture Review Board. The papers contained in this volume were selected by the WOMPAT 2004 Program Committee from the submissions received by the organizers. They include experience reports on using OpenMP (sometimes in conjunction with proposed extensions) in large-scale applications and on several computing platforms, considerations of OpenMP parallelization strategies, and discussions and evaluations of several proposed language features and compiler and tools technologies. I would like to thank the members of the Program Committee for their efforts in reviewing the submissions and in identifying a variety of excellent speakers for the remainder of the program. Thanks go also to those who gave tutorials and provided software at the accompanying events. I would also like to thank the
VI
Preface
members of the Local Organizing Committee, who not only helped prepare for this event but also supported the OMPlab by installing software and providing technical support on a variety of platforms. Last, but not least, I would like to thank our sponsors, who helped make this an enjoyable and memorable occasion. October 2004
Barbara Chapman
Organization
WOMPAT 2004 was organized by cOMPunity Inc., the Association of OpenMP Researchers, Developers and Users, in conjunction with the Department of Computer Science and the Texas Learning and Computation Center, TLC2, at the University of Houston.
Program Committee Barbara Chapman, University of Houston, USA (Workshop Chair) Dieter an Mey, Technical University of Aachen, Germany Eduard Ayguade, Universitat Politecnica de Catalunya, Spain Beniamino di Martino, University of Naples, Italy Rudolf Eigenmann, Purdue University, USA Constantinos Ierotheou, University of Greenwich, UK Olin Johnson, University of Houston, USA Gabriele Jost, NASA Ames Research Center, USA Ricky Kendall, Iowa State University, USA Larry Meadows, Intel Corporation, USA Mitsuhisa Sato, University of Tsukuba, Japan Sanjiv Shah, KSL, Intel Corporation, USA Martin Schulz, Cornell University, USA Danesh Tafti, Virginia Tech, USA Andreas Uhl, University of Salzburg, Austria
Sponsors WOMPAT 2004 was sponsored by the OpenMP Architecture Review Board (OpenMP ARB), the OpenMP users’ group cOMPunity, Intel Corporation, the Portland Group, and the Texas Learning and Computation Center at the University of Houston.
Table of Contents
Parallelization of General Matrix Multiply Routines Using OpenMP Jonathan L. Bentz, Ricky A. Kendall . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Performance Analysis of Hybrid OpenMP/MPI N-Body Application Rocco Aversa, Beniamino Di Martino, Nicola Mazzocca, Salvatore Venticinque . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Performance and Scalability of OpenMP Programs on the Sun FireTM E25K Throughput Computing Server Myungho Lee, Brian Whitney, Nawal Copty . . . . . . . . . . . . . . . . . . . . . . .
19
What Multilevel Parallel Programs Do When You Are Not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP and Nested OpenMP Gabriele Jost, Jes´ us Labarta, Judit Gimenez . . . . . . . . . . . . . . . . . . . . . .
29
SIMT/OMP: A Toolset to Study and Exploit Memory Locality of OpenMP Applications on NUMA Architectures Jie Tao, Martin Schulz, Wolfgang Karl . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
Dragon: A Static and Dynamic Tool for OpenMP Oscar Hernandez, Chunhua Liao, Barbara Chapman . . . . . . . . . . . . . . .
53
The ParaWise Expert Assistant - Widening Accessibility to Efficient and Scalable Tool Generated OpenMP Code Stephen Johnson, Emyr Evans, Haoqiang Jin, Constantinos Ierotheou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program Yuan Lin, Christian Terboven, Dieter an Mey, Nawal Copty . . . . . . . .
83
An Evaluation of Auto-Scoping in OpenMP Michael Voss, Eric Chiu, Patrick Man Yan Chow, Catherine Wong, Kevin Yuen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
Structure and Algorithm for Implementing OpenMP Workshares Guansong Zhang, Raul Silvera, Roch Archambault . . . . . . . . . . . . . . . . . . 110
X
Table of Contents
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution Zhenying Liu, Lei Huang, Barbara Chapman, Tien-Hsiung Weng . . . . 121 Runtime Adjustment of Parallel Nested Loops Alejandro Duran, Ra´ ul Silvera, Julita Corbal´ an, Jes´ us Labarta . . . . . . 137
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Parallelization of General Matrix Multiply Routines Using OpenMP Jonathan L. Bentz and Ricky A. Kendall Scalable Computing Laboratory, Ames Laboratory, U.S. DOE Department of Computer Science, Iowa State University, Ames, IA 50011
Abstract. An application programmer interface (API) is developed to facilitate, via OpenMP, the parallelization of the double precision general matrix multiply routine called from within GAMESS [1] during the execution of the coupled-cluster module for calculating physical properties of molecules. Results are reported using the ATLAS library and the Intel MKL on an Intel machine, and using the ESSL and the ATLAS library on an IBM SP.
1
Introduction
Matrix multiply has been studied extensively with respect to high performance computing, including analyzation of the complexity of parallel matrix multiply (see Ref. [2] and Refs. therein). The Basic Linear Algebra Subprograms [3, 4, 5, 6] are subroutines that perform linear algebraic calculations on vectors and matrices with the aim of being computationally efficient across all platforms and architectures. The Level 3 Basic Linear Algebra Subprograms [7] (BLAS) are subroutines that perform matrix-matrix operations. The double precision general matrix multiply (DGEMM) routine is a member of the level 3 BLAS and has the general form of C ← αAB + βC ,
(1)
where A, B and C are matrices and α and β are scalar constants. In the double precision case of real numbers, matrices A and B can either be transposed or not transposed upon entry into DGEMM. This work uses OpenMP to facilitate the parallelization of the general matrix multiply routines consistently encountered in high-performance computing. Specifically, we are working in conjunction with the ab initio quantum chemistry software suite GAMESS (General Atomic and Molecular Electronic Structure System) [1], developed by Dr. Mark Gordon and his group at Iowa State University. Many of the modules in GAMESS are implemented to run in parallel (via an asynchronous one-sided distributed memory model), but the module used in calculating physical properties via the coupled-cluster (CC) method currently has B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 1–11, 2005. c Springer-Verlag Berlin Heidelberg 2005
2
J.L. Bentz and R.A. Kendall
only a serial implementation [8]. The CC code [9] has numerous calls to DGEMM and to improve the performance of this code when run on shared memory systems, we are developing an intelligent application programmer interface (API) for the DGEMM routine which is called from within GAMESS during its execution. Our wrapper routine (hereafter referred to as ODGEMM) uses OpenMP to parallelize the matrix multiplication. Currently, GAMESS comes with a vanilla source code BLAS (VBLAS) library built in and one can optionally link with any available BLAS library instead. It is not sufficient to simply link all of GAMESS to a multi-threaded BLAS library because then the modules (other than CC) which have previously been parallelized will create numerous threads when the parallelization has already been taken care of at a different level. Because of this difference between the CC module and the rest of GAMESS, shared memory parallelization of DGEMM within the CC module is facilitated by ODGEMM which is written to work specifically within the CC module. Our API is designed to be directly called by GAMESS, partition the matrices properly and then call a supplied BLAS routine to perform the individual multiplications of the partitioned patches. We have tested a number of different BLAS libraries where the library DGEMM is used as a subroutine in our ODGEMM. The ODGEMM routine also calls different BLAS libraries based on the system software infrastructure. We have tested our routine with the Automatically Tuned Linear Algebra Software library (version 3.6.0) [10] (ATLAS), which is freely available and compiles on many platforms. ATLAS provides a strictly serial library and a parallel library with the parallelization implemented using POSIX threads. The number of threads chosen in the parallel libraries of ATLAS is determined when the library is compiled and is commonly the number of physical processors. The number of threads cannot be increased dynamically but ATLAS may use less if the problem size does not warrant using the full number. The multiplication routines in ATLAS are all written in C, although ATLAS provides a C and FORTRAN interface from which to call their routines. GAMESS is written in FORTRAN and as such our wrapper routine is a FORTRAN callable routine. Because our routine is written in C, we can use pointer arithmetic to manipulate and partition the matrices. This allows us to avoid copying matrix patches before calling the multiplication routine. Testing has also been performed with the Intel Math Kernel Library (version 6.1) [11] (MKL) and the IBM Engineering and Scientific Subroutine Library (version 3.3.0.4) [12] (ESSL). The MKL is threaded using OpenMP so the threading can be controlled by an environment variable similarly to ODGEMM. The ESSL number of threads can also be changed through the use of an environment variable. The testing with GAMESS has been run with ATLAS ODGEMM, ATLAS PTDGEMM, MKL ODGEMM, ESSL ODGEMM and ESSLSMP DGEMM (see Table 1). MKL does come with a threaded library but it cannot be called with multiple threads from GAMESS currently because of thread stacksize problems. These tests have been performed on a variety of computational resources and associated compiler infrastructure.
Parallelization of General Matrix Multiply Routines Using OpenMP
3
Table 1. Acronyms for the libraries used Library Platform Description VBLAS DGEMM Intel Serial Vanilla Blas ATLAS PTDGEMM Intel Pthread built-in implementation using ATLAS ATLAS ODGEMM Intel OpenMP implementation using ATLAS MKL ODGEMM Intel OpenMP implementation using MKL ESSLSMP DGEMM IBM Vendor threaded implementation of ESSL ESSL ODGEMM IBM OpenMP implementation using ESSL IBMATLAS ODGEMM IBM OpenMP implementation using ATLAS
2
Outline of ODGEMM Algorithm
The ODGEMM algorithm uses course-grained parallelism. Consider, for brevity, that the A and B matrices are not transposed in the call to the ODGEMM routine. In this case, A is an M × K matrix, B is a K × N matrix, and the resultant C is an M × N matrix. (In subsequent tables of data, M , N and K are also defined in this manner.) Upon entry to ODGEMM, A is partitioned into n blocks, where n is the number of threads. The size of each block is M/n by K, such that each block is a patch of complete rows of the original A matrix. If n does not divide M evenly, then some blocks may receive one more row of A than others. Matrix B is partitioned into blocks of size K by N/n. In a similar fashion each block of B is a patch of N/n full columns of B and again if N/n has a remainder, some blocks will receive one more column of B than others. After this partitioning occurs, the calls to a library DGEMM (e.g., ATLAS, MKL, etc.) are made. Each thread works with one block of A and the entire B. If Ai is the ith block of A and Bj is the jth block of B, then the multiplication of Ai by Bj produces the Cij block. Furthermore, since the ith thread works with Ai and the entire B, the ith thread is computing the Ci block of C, a block of M/n complete rows of C. Each thread computes an independent patch of C and as a result there is no dependence among executing threads for the storage of C.
3 3.1
Results Matrix Multiply Testing
We have tested our routine with the libraries mentioned above and provide some results of our testing in Table 2. These tests were performed by first generating analytical A, B, and C matrices with double precision elements, performing the multiplication using DGEMM or ODGEMM, and comparing the resultant C matrix with the analytical C matrix. All of our preliminary testing was performed on two machines: an SMP machine (named Redwing) with 4 GB of memory and 4 Intel Xeon 2.00 GHz processors, and one node of the IBM SP (named Seaborg) provided by NERSC (see Acknowledgments) which has 16 GB of memory and 16 POWER3 375 MHz processors.
4
J.L. Bentz and R.A. Kendall
Table 2. Matrix multiplication execution times. Results are reported in seconds of wallclock time. The numbers in the column headings indicate the number of threads used. The PT column heading indicates the execution time upon calling ATLAS PTDGEMM threaded routine directly Library ATLAS ODGEMM VBLAS DGEMM MKL ODGEMM ATLAS ODGEMM MKL ODGEMM ESSLSMP DGEMM ESSL ODGEMM IBMATLAS ODGEMM ATLAS ODGEMM MKL ODGEMM ESSLSMP DGEMM ESSL ODGEMM IBMATLAS ODGEMM ATLAS ODGEMM MKL ODGEMM ESSL ODGEMM IBMATLAS ODGEMM ATLAS ODGEMM MKL ODGEMM ESSL ODGEMM IBMATLAS ODGEMM
M K N 1 2000 2000 2000 5.05 99.29 4.62 7000 7000 7000 210.9 196.0 538.6 538.6 531.3 8000 8000 8000 315.9 293.3 966.5 966.5 795.7 10000 10000 1000 63.21 57.42 163.1 214.3 1000 10000 10000 61.53 58.32 153.3 156.9
2 2.69 2.50 108.6 101.8 162.8 150.4 30.88 31.21 31.66 32.22 -
4 1.64 1.58 57.98 56.87 85.5 83.06 48.56 16.97 17.18 20.99 -
16 39.9 34.8 49.2 60.69 53.11 78.69 11.85 15.78 11.18 15.12
PT 1.41 58.5 51.4 85.74 80.42 17.27 14.26 16.67 14.10
The data in Table 2 exhibits a number of interesting features. The VBLAS DGEMM time is inserted for comparison to the more sophisticated BLAS libraries used in this work. On a relatively small matrix size, VBLAS DGEMM is almost 2 orders of magnitude larger than either the single-threaded ATLAS ODGEMM or MKL ODGEMM results. Viewing all results from Redwing, the ATLAS ODGEMM results are comparable to the MKL ODGEMM results, and in a few cases, the ATLAS PTDGEMM routine actually runs faster than the MKL ODGEMM routine. Viewing the results from Seaborg, one notices that the ESSLSMP DGEMM and ESSL ODGEMM threaded routines are consistently faster than the IBMATLAS ODGEMM. On the other hand, when only one thread is used, IBMATLAS ODGEMM runs faster for the largest matrix size tested. A rather striking result is that of the last two sections of the table, where the dimensions of the matrices are not equal. Considering the Redwing results, when M is 10000 and 4 threads are used, ATLAS ODGEMM is quite high, but when M is 1000, then ATLAS ODGEMM is quite reasonable. The only difference between these two test cases is that the dimensions of M and N are swapped. Recall that the algorithm partitions the rows of A, so as the number of rows of A changes, that should affect the outcome somewhat. However, MKL
Parallelization of General Matrix Multiply Routines Using OpenMP
5
ODGEMM does not show a similar difference between the non-square matrix sizes. These are simply general results so that one can see how these matrix multiply routines compare with one another. 3.2
GAMESS
The results reported in this section are from execution of the GAMESS CC module performing energy calculations.1 Figures 1, 2 and 3 show timing data vs. number of threads from GAMESS execution runs. In all three figures, the ATLAS PTDGEMM result is simply a single point since the number of threads when using the ATLAS PTDGEMM threaded routine is unable to be changed by the user. The compile time thread number (set to the number of physical processors) is really a maximum thread number since ATLAS may use fewer threads if the matrices are sufficiently small. 400 ATLAS ODGEMM MKL ODGEMM ESSLSMP DGEMM ESSL ODGEMM ATLAS PTDGEMM
350
Time (s)
300
250
200
150
100
1
2
4
8
16
Number of threads
Fig. 1. Execution time vs. number of threads for the molecule HNO using the ccpVTZ basis set. ATLAS and MKL calculations were performed on Redwing. ESSL calculations were performed on one node of Seaborg. The basis set cc-pVTZ uses 85 basis functions for this calculation. The x-axis scale is logarithmic
Figure 1 shows execution time vs. number of threads for the HNO molecule using the cc-pVTZ basis set. The ESSL ODGEMM and ESSLSMP DGEMM results are almost identical for 1 and 2 threads, but they diverge quickly when 4 or more threads are used. This is probably due to the fact that in these 1
The basis sets were obtained from the Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory, http://www.emsl.pnl.gov/forms/ basisform.html. For an explanation of the basis sets see ref. [13].
6
J.L. Bentz and R.A. Kendall 8000 ATLAS ODGEMM MKL ODGEMM ESSLSMP DGEMM ESSL ODGEMM ATLAS PTDGEMM
7500
Time (s)
7000
6500
6000
5500
5000
1
2
4
8
16
Number of threads
Fig. 2. Execution time vs. number of threads for the molecule HNO using the ccpVQZ basis set. ATLAS and MKL calculations were performed on Redwing. ESSL calculations were performed on one node of Seaborg. The basis set cc-pVQZ uses 175 basis functions for this calculation. The x-axis scale is logarithmic
20000 ATLAS ODGEMM MKL ODGEMM ATLAS PTDGEMM
Time (s)
18000
16000
14000
12000
10000
1
2
3
4
Number of threads
Fig. 3. Execution time vs. number of threads for the molecule glycine using the ccpVDZ basis set for the hydrogens and the cc-pVTZ basis set for all other atoms for a total of 200 basis functions. Calculations were performed on Redwing
testing runs, ESSL ODGEMM always uses the specified number of threads, and with small matrix sizes this can cause an unnecessary amount of overhead. With ESSL ODGEMM there is no performance improvement after 2 threads but
Parallelization of General Matrix Multiply Routines Using OpenMP
7
with ESSLDMP DGEMM the performance continues to increase. The ATLAS ODGEMM and MKL ODGEMM results are more sporadic. With 1 thread, MKL ODGEMM is faster, with 2 threads, ATLAS ODGEMM is faster, and then with 3 and 4 threads, MKL ODGEMM is faster. Both curves, with the exception of ATLAS ODGEMM using 2 threads, show an increase in execution time with increase in the number of threads. Again this is most likely due to the thread overhead of partitioning matrices which may not be large enough to require parallelization. The fastest overall execution time is obtained by calling ATLAS PTDGEMM. As a reference point, on Redwing, the wall time for this calculation using the VBLAS DGEMM that comes packaged with GAMESS is 705 seconds. To investigate this behavior further, some test cases were calculated using matrix sizes that occur frequently in this GAMESS calculation. The results of these test cases are shown in Table 3. The three test cases shown account for about 75% of the matrix multiplications found in this GAMESS calculation. The first thing to note when viewing these results is that the dimensions of the matrices are quite unequal. The results show good agreement with what was shown in Fig. 1. As the ESSLSMP DGEMM thread number increases, the execution time decreases, while the ESSL ODGEMM execution time is at its lowest using either 2 or 4 threads, then stays flat and even increases in some cases. A similar analysis of the ATLAS ODGEMM and MKL ODGEMM results shows a good agreement with the timing data of the actual GAMESS executions and shows that the execution of GAMESS is dependent on the repeated multiplication of only a few different sizes of matrices. Figure 2 shows execution time vs. number of threads for the HNO molecule using the cc-pVQZ basis set with 175 basis functions. The results using ESSL show that the ESSL ODGEMM is slightly faster than ESSLSMP DGEMM for 1 and 2 threads, but when more than 2 threads are used the ESSLSMP DGEMM continues to decrease in execution time while ESSL ODGEMM decreases more slowly and even increases for 12 and 16 threads. The increase when using ESSL ODGEMM is again probably attributed to the overhead of partitioning and using more threads than necessary on some matrix sizes. The MKL ODGEMM results in particular are striking. The execution time decreases slightly from 1 to 2 threads, but using 3 and 4 threads increases the execution time. The ATLAS ODGEMM results are as one would expect, namely that as the thread number increases, the time decreases, until the fastest time is obtained when 4 threads are used. Also note that the direct call of the threaded ATLAS PTDGEMM is essentially the same as that of ATLAS ODGEMM when 4 threads are used. As a reference, the wall time for this calculation on Redwing when using the default VBLAS DGEMM in GAMESS is 35907 seconds. As in the earlier case, Table 4 was prepared with three test cases using matrix sizes that occur frequently in this GAMESS calculation. The results are similar to the earlier case in that the execution time of GAMESS with respect to the matrix multiplication is dominated by the multiplication of relatively few different sizes of matrices. The three test cases shown in Table 4 account for
8
J.L. Bentz and R.A. Kendall
Table 3. Matrix multiplication execution times using test cases where the matrix sizes are equal to the matrix sizes used on the HNO molecule with the cc-pVTZ basis set. Results are reported in 10−3 seconds of wall-clock time. The PT column results were obtained by calling ATLAS PTDGEMM directly. The numbers in the column headings indicate the number of threads. Frequency is the number of times DGEMM is called with matrices of that size in the GAMESS execution. Total number of calls to DGEMM in the GAMESS execution is ≈ 3060 Library ATLAS ODGEMM MKL ODGEMM ESSLSMP DGEMM ESSL ODGEMM ATLAS ODGEMM MKL ODGEMM ESSLSMP DGEMM ESSL ODGEMM ATLAS ODGEMM MKL ODGEMM ESSLSMP DGEMM ESSL ODGEMM
M 77
K N 1 2 77 5929 26.1 20.1 44.4 35.7 70.8 39.9 87.6 52.4 5929 6 77 7.67 9.63 8.13 9.03 48.4 11.9 34.9 18.8 36 5929 77 28.0 20.5 26.2 19.6 43.8 19.3 39.1 21.1
4 18.4 56.3 20.4 39.6 8.44 7.89 8.55 17.7 64.0 31.5 13.1 28.7
8 12.8 33.8 6.23 17.3 13.1 32.3
12 11.4 35.4 5.51 20.5 12.1 37.2
16 11.9 37.2 5.42 27.9 11.2 35.9
PT Frequency 14.2 420 4.84 420 31.9 1461 -
Table 4. Matrix multiplication times using test cases where the matrix sizes are equal to the matrix sizes used on the HNO molecule with the cc-pVQZ basis set. Results are in 10−2 seconds of wall-clock time. The PT column results were obtained calling ATLAS PTDGEMM directly. The numbers in the column headings indicate the number of threads. Frequency is the number of times matrices of that size are called in the GAMESS execution. Total number of calls to DGEMM in the GAMESS execution is ≈ 4860 Library M K N 1 2 4 8 12 16 ATLAS ODGEMM 167 167 27889 54.7 32.6 23.0 MKL ODGEMM 44.6 44.1 57.1 ESSLSMP DGEMM 132. 66.0 33.7 33.2 14.2 12.6 ESSL ODGEMM 139. 77.9 47.9 33.1 28.9 27.6 ATLAS ODGEMM 27889 6 167 8.01 8.46 7.84 7.89 9.05 8.21 MKL ODGEMM ESSLSMP DGEMM 21.8 11.3 5.88 3.09 2.71 2.63 ESSL ODGEMM 29.0 18.7 13.5 11.6 11.2 11.1 ATLAS ODGEMM 36 27889 167 28.4 20.5 17.6 MKL ODGEMM 22.7 19.1 32.1 ESSLSMP DGEMM 28.9 15.5 8.95 6.22 5.29 5.91 ESSL ODGEMM 28.9 18.9 13.9 15.2 27.4 32.3
PT Frequency 21.7 420 5.45 420 17.8 3172 -
about 80% of the matrices multiplied. The matrix dimensions are much different and it is clear that the execution time with respect to the matrix multiplica-
Parallelization of General Matrix Multiply Routines Using OpenMP
9
tion is dominated by the repeated multiplication of only a few different sizes of matrices. Figure 3 shows execution time as a function of the number of threads for GAMESS using glycine as the input molecule. An interesting feature of this graph is that the ATLAS ODGEMM timings decrease monotonically and there is a difference of over 500 seconds between one thread and four threads of execution. The MKL ODGEMM does much better than ATLAS ODGEMM when using one thread, and is slightly faster using two threads. But then the total time increases for three and four threads using MKL ODGEMM. ATLAS DGEMM shows consistent decrease in time of execution upon the addition of processors, but MKL ODGEMM actually increases its execution time for three and four threads. Note that the ATLAS PTDGEMM result is almost exactly the same as the ATLAS ODGEMM result using 4 threads.
4
Conclusions and Future Work
One conclusion of this work is that if one wants to improve the performance of the CC module of GAMESS, an external BLAS library should be used. With unsophisticated testing on Redwing it was shown that almost two orders of magnitude improvement can be gained by linking with ATLAS or MKL. With respect to results calculated on Redwing, when using 1 thread, the MKL ODGEMM is faster than ATLAS ODGEMM. When 2 threads are used, the results are mixed with MKL ODGEMM running faster in some cases and ATLAS ODGEMM running faster in other cases. When 3 and 4 threads are used, especially in the GAMESS executions, ATLAS ODGEMM is consistently faster than MKL ODGEMM. When one looks at GAMESS execution time irrespective of the number of threads used, ATLAS PTDGEMM is almost always the fastest, and in Fig. 2 and Fig. 3 the ATLAS ODGEMM times are comparable. An unexpected result is that of the MKL ODGEMM when multiple threads are used. When more than 2 threads are used in the GAMESS execution using MKL ODGEMM, the times actually increase. The results obtained using Seaborg are worthy of note as well. When considering the results of the generic matrix multiplication of Section 3.1, the ESSL ODGEMM actually runs faster than the built in ESSLSMP DGEMM library. When compared to ATLAS, the threaded versions of ESSL are faster than the threaded ATLAS versions. The results from the GAMESS tests show that for thread numbers less than 4, ESSL ODGEMM and the ESSLSMP DGEMM give similar results. When adding more threads to the calculation, the ESSLSMP DGEMM consistently yields faster results than the ESSL ODGEMM, especially at high numbers of threads. This seems to suggest that the ESSLSMP DGEMM library has some built in mechanisms to determine better partitioning of the matrices. After testing on two different architectures, it is clear that calling ODGEMM with the number of threads equal to the number of processors does not always yield the fastest execution times. Based on whether the system is an Intel system
10
J.L. Bentz and R.A. Kendall
or IBM, our results show that the fastest execution time varies significantly based on the number of threads chosen and the BLAS library used. This information will be incorporated into the ODGEMM routine. The ODGEMM library will then be able to choose the appropriate call mechanism to the library DGEMM with properly partitioned matrices and an appropriate number of threads. For future work we wish to incorporate some matrix metrics into ODGEMM to make decisions about parallelization. For example, considering GAMESS executions, it is quite probable that some calls to DGEMM will run faster in parallel while some calls may be fastest running on one thread only, or with the number of threads less than the number of processors. These decisions will depend on factors such as matrix size, dimensional layout, and processor speed. Especially when using the vanilla BLAS which has no parallel implementation, the opportunity for performance enhancement is available via ODGEMM properly parallelizing the matrix multiplication. We are also going to perform testing and hence provide portability to more architectures, as GAMESS is currently available for a wide range of architectures and systems. The use of parallel DGEMM in GAMESS is a beginning for exploiting the advantages of SMP machines in GAMESS calculations. We will investigate the utilization of OpenMP at other levels within the chemistry software as well.
Acknowledgements This work was performed under auspices of the U. S. Department of Energy under contract W-7405-Eng-82 at Ames Laboratory operated by the Iowa State University of Science and Technology. Funding was provided by the Mathematical, Information and Computational Science division of the Office of Advanced Scientific Computing Research. This material is based on work supported by the National Science Foundation under Grant No. CHE-0309517. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098. Basis sets were obtained from the Extensible Computational Chemistry Environment Basis Set Database, Version 12/03/03, as developed and distributed by the Molecular Science Computing Facility, Environmental and Molecular Sciences Laboratory which is part of the Pacific Northwest Laboratory, P.O. Box 999, Richland, Washington 99352, USA, and funded by the U.S. Department of Energy. The Pacific Northwest Laboratory is a multiprogram laboratory operated by Battelle Memorial Institute for the U.S. Department of Energy under contract DW-AC06-76RLO 1830. Contact David Feller or Karen Schuchardt for further information. The authors wish to thank Ryan Olson for his helpful comments with respect to implementation details of GAMESS.
References 1. M. W. Schmidt et al.: General Atomic and Molecular Elecronic Structure System. J. Comput. Chem. 14 (1993) 1347-1363.
Parallelization of General Matrix Multiply Routines Using OpenMP
11
2. E. E. Santos: Parallel Complexity of Matrix Multiplication. J. Supercomp. 25 (2003) 155-175. 3. C .L. Lawson, R. J. Hanson, D. R. Kincaid and F. T. Krogh: Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Soft. 5 (1979) 308-323. 4. C. L. Lawson, R. J. Hanson, D. R. Kincaid and F. T. Krogh: ALGORITHM 539, Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Soft. 5 (1979) 324-245. 5. J. J. Dongarra, J. Du Croz, S. Hammarling and R. J. Hanson: An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 14 (1988) 1-17. 6. J. J. Dongarra, J. Du Croz, S. Hammarling and R. J. Hanson: ALGORITHM 656, An Extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Trans. Math. Soft. 14 (1988) 18-32. 7. J. J. Dongarra, J. Du Croz, S. Hammarling and I. Duff: A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16 (1990) 1-17. 8. GAMESS User’s Guide. (http://www.msg.ameslab.gov/GAMESS/GAMESS.html). 9. P. Piecuch, S. A. Kucharski, K. Kowalski and M. Musial: Efficient computer implementation of the renormailized coupled-cluster methods: The R-CCSD[T], RCCSD(T), CR-CCSD[T], and CR-CCSD(T) approaches. Comput. Phys. Commun. 149 (2002) 71-96. 10. R. C. Whaley, A. Petitet and J. J. Dongarra: Automated Empirical Optimization of Software and the ATLAS project. Parallel Computing. 27 (2001) 3-35. Also available as University of Tennessee LAPACK Working Note #147, UT-CS-00448, 2000 (http://www.netlib.org/lapack/lawns/lawn147.ps). 11. Intel Corporation. Intel Math Kernel Library, Reference Manual. (http://www.intel.com/software/products/mkl/docs/mklman61.htm). 12. IBM Corporation. Engineering and Scientific Subroutine Library for AIX Version 3 Release 3: Guide and Reference. (http://publib.boulder.ibm.com/doc link/en US/a doc lib/sp34/essl/essl.html). 13. T. H. Dunning Jr.: Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen. J. Chem. Phys. 90 (1989) 10071023.
Performance Analysis of Hybrid OpenMP/MPI N-Body Application Rocco Aversa, Beniamino Di Martino, Nicola Mazzocca, and Salvatore Venticinque DII, Seconda Universit` a di Napoli, via Roma 29, 81031 Aversa (CE), Italy
Abstract. In this paper we show, through a case-study, how the adoption of the MPI model for the distributed parallelism and OpenMP parallelizing compiler technology for the inner shared memory parallelism, allows to yield a hierarchically distributed-shared memory implementation of an algorithm presenting multiple levels of parallelism. The chosen application solves the well known N-body problem.
1
Introduction
Clusters of Bus-based shared memory multiprocessor systems (SMPs), where moderately sized multiprocessor workstations and PCs are connected with a high-bandwidth interconnection network, are gaining more and more importance for High Performance Computing. They are increasingly established and used to provide high performance computing at a low cost. Current parallel programming models are not yet designed to take into account hierarchies of both distributed and shared memory parallelism into one single framework. Programming hierarchical distributed-shared memory systems is currently achieved by means of integration of environments/languages/libraries individually designed for either shared or distributed address space model. They range from explicit messagepassing libraries such as MPI (for the distributed memory level) and explicit multithreaded programming (for the shared memory level), at a low abstraction level, to high-level parallel programming environments/languages, such as High Performance Fortran (HPF) [1] (for the distributed memory level), and OpenMP [2] (for the shared memory level). In this paper we report our experience in parallelizing a scientific application and in integrating its MPI [3] implementation and an OpenMP compiler for programming hierarchical distributed-shared memory multiprocessor architectures, in particular homogeneous clusters of SMPs nodes.
2
A Case-Study: The N-Body Application
The chosen case-study is a sequential algorithm that solves the N-body problem by computing, during a fixed time interval, the positions of N bodies moving under their mutual attraction. The algorithm is based on a simple approximation: the force on each particle is computed by agglomerating distant particles into B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 12–18, 2005. c Springer-Verlag Berlin Heidelberg 2005
Performance Analysis of Hybrid OpenMP/MPI N-Body Application
13
groups and using their total mass and center of mass as a single particle. The program repeats, during a fixed time interval, three main steps: to build an octal tree (octree) whose nodes represents groups of nearby bodies; to compute the forces acting on each particle through a traverse in the octree; to update the velocities and the positions of the N particles. During the first step, the algorithm, starting from the N particles, builds the octree storing its nodes in an array using a depth-first strategy. At each level in the octree all nodes have the same number of particles until each leaf remains with a single particle. The subsequent level in the octree can be obtained by repeatedly splitting in two halves the particles in each node on the basis of the three spatial coordinates (x,y,z). At the end of this stage a tree node represents a box in the space characterized by its position (the two distant corners of the box) and by the total mass and the center of mass of all the particles enclosed in it. The second stage of the algorithm computes the forces on each of the N particles by traversing the above built octree. Each particle traverses the tree. If the nearest corner of the box representing the current node is sufficiently distant, the force due to all particles is computed from the total mass and center of mass of the box. Otherwise the search comes along the lower levels of the tree, if necessary, till the leaves. Once obtained the force components affecting the particles, it’s possible to update their velocities and positions. The last step computes respectively the new velocities and the new positions of the N particles using a simple Euler step. The basis for the development of the hybrid code is a sequential version of a conventional Barnes-Hut tree algorithm [6]. This can be described by the following pseudo-code: /* main loop */ while (time<end) { /* build the octree */ build_tree(p,t,N); /* compute the forces */ forces(p,t,N); /* compute the minimal time step */ delt=tstep(p,N); /* update the positions and velocities of the particles */ newv(p,N,delt); newx(p,N,delt); /* update simulation time */ time=time + delt; }
3
Parallelizing the Code
There is a wide body of knowledge on the parallelization of this problem, and a lot of highly-parallel and optimized algorithms and implementations have been devised through the years (see for example [4],[8], [9]. A first analysis of the sequential algorithm suggests us some preliminary considerations that can drive the
14
R. Aversa et al.
parallelization strategy. The first step of the algorithm (the building of the octree) appears to be a typical master-slave computation: a master reads input data and starts building the first levels of the octal tree until the elaboration can proceed in parallel. The forces computation step can be classified as a data parallel computation since for each particle, we can calculate the force acting on it in an independent way. However it’s worth while noting that this task, that is even the most compute intensive one of the algorithm, cannot be statically balanced since the elaboration requirements are strongly data-dependent (for the different particles the traverse can stop to different levels of the octree). Finally, the computation steps that updates the particles velocities and positions can be still recognized as a data parallel portion of the code but that can be parallelizable and statically balanced in a trivial way. The hybrid version of the application has been implemented integrating the OpenMP directives in the MPI parallel code. The MPI model allows to exploit a coarse-grain parallelization creating the processes on different nodes where different subproblems are solved. As the program data are stored in two different arrays it is almost trivial to assign different chunks to each worker. The size of each chunk is statically computed one time at the beginning of the program. According to the data distribution strategy each worker has to compute the new position and velocity of the particles assigned to it. We are able to distribute the data among the workers exploiting the MPI collective communication primitives. Each communication phase becomes a barrier because, in order to complete each step of the algorithm, it is necessary to get the complete results from the previous one. Hence we have three communication phase: the update of the tree, the reduction of the timestep and the update of the particles. The first and the third one are realized by a All gatherv primitive that allows the transmission of the results from each worker to all the others. The second communication is realized by an All reduce primitive that returns the minimum value among the ones sent by all the workers. The SPMD algorithm can be described by the following pseudo-code: /* main loop */ while (time<end) { /* build the octree */ build_tree(p,t,N,my_first_node,my_last_node); All_gatherv(my_subtree, tree); /* compute the forces */ forces(p,t,N,my_first_particle, my_last_particle); /* compute the minimal time step */ delt=tstep(p,N, my_first_particle, my_last_particle); All_Reduce(delt,new_delt,); /* update the positions and velocities of the particles */ newv(p,N,delt, my_first_particle, my_last_particle); newx(p,N,delt, my_first_particle, my_last_particle); All_gatherv(my_particles, allparticles); /* update simulation time */ time=time + delt; }
Performance Analysis of Hybrid OpenMP/MPI N-Body Application
15
As we can see the execution of the sequential routines is performed just on a subset of elements of the original data, and the results are distributed to all the workers. The OpenMP directives allow the parallelization at a finer level: the inner loops computed by a single worker can be distributed to different thread on a SMP node. Starting from the analysis of the sequential code is straightforward to exploit the OMP parallel For directive when long and heavy iterations have to be computed. We remember that the program data are stored in arrays and often the same computation has to be performed for each element. The most compute intensive phase of the algorithm is the computation of the forces acting on all the particles. It is implemented by many iterations that are equal to the total number of the particles. Each iteration lasts for a different time as the tree can be traversed more depth or less, according to the distance of the considered particle from the other ones. This consideration lead us to choose a dynamic schedule of the chunks. Elsewhere a static allocation of the iterations is suited. For example the updating of position and velocities requires the same time for each particle. Finally some service routines, which are employed to compute the maximum or the minimum among a set of values, can be parallelized in the same way. The use of a cluster of SMPs and hybrid shared-memory and message-passing programming model based on MPI and OpenMP provides the programmer with a too wide range of computational alternatives to be explored in order to obtain high performance from the system [5, 7]. A good effort of the integration between the shared memory and the distributed memory paradigms should be noticed by a performance improvement in communications: more threads on the same node use the shared memory communication model. Hence we chose to allocate a MPI process on each node. The OMP section, at run time, are executed by two threads (the number of processes).
4
Experimental Results
The system used as testbed for the chosen application is the Cygnus cluster of the Parsec Laboratory, at the Second University of Naples. This is a Linux cluster with 4 SMP nodes equipped with two Pentium Xeon 1 Ghz, 512 MB RAM and a 40 GB HD, and a dual-Pentium III 300 frontend. The application was executed in three different conditions: MPI-only (i.e., not hybrid) decomposition into four tasks, MPI-only into eight tasks, and Hybrid (i.e., MPI-OpenMP) into four tasks with two threads each. It should be noted that the first decomposition is expected to lead to the poorest figures, since only four of the eight available CPUs are exploited. On the other hand, as the first set of result showed to us, it is not trivial to foresee the performance behavior of the parallel program. The amount of computation and communications,and the shape of the latter ones, in each phase of the program, affect the final performance figure of the application. We got a first set of results executing the application always with the same set of input data. Table 1 shows, in the first two columns, the speed-up we get for the total execution and for the most compute intensive phase (the forces computation). The
16
R. Aversa et al. Table 1. Application speedup CodeSection T otalspeed − up F orcesspeed − up M P overhead(%) MP I − 4 3.588 3.645 1.22% MP I − 8 7.019 7.293 3.66% Hybrid 6.649 6.87 2.57% Table 2. Application speedup with MPICH-G2 Conf iguration F orcesSpeedup M P overhead(%) M P I2/2 1,97 17.68 % M P I2/1 1,957 10.3% M P I4/4 3,805 25.4% M P I4/2 3,898 24.4% M P I8/4 7,817 32.90% Hyb2/2 3,727 17.68% Hyb4/4 7,30 27.5%
third column shows the relative overhead of the Message Passing communication compared with the total execution time. The measures we show in the second column is the mean time required by each worker just to compute the forces on its group of particles. It does not take in account the communication overhead. As we can see the total speedup is not much long from it. The third column shows that the communication overhead is not relevant for the final performance figure. However we were able to appreciate an improvement of the communication time in the hybrid version of the program, that is very close to the MPI-4 version. The better behavior of the pure MPI implementation is due to the very little amount of communication and to the presence of some race conditions in the parallel for variables. On the other hand we need to reserve much more memory in order to run many MPI processes rather than a smaller number of hybrid processes. With a second set of experiments we configured the cluster as a distributed system composed of four nodes, each of which hosts a Globus gatekeeper[10] that is able to run MPICHG2[11] jobs. The building and the installation of the GT2 implementation of the Globus toolkit, using the Intel compilers, did not prove to be a trivial task. However the Intel compiler allowed as to exploit its better performances and the OpenMP support for SMP programming. We submitted from to the frontend node the MPICH-G2 request spreading MPI processes on the different nodes. We obtained comparable speedup even if the real user time increases a lot, till 34 sec. for the execution of eight MPI processes, because of the Globus overhead. In fact each gatekeeper needs to authenticate the user before to launch the job. In Table 2 is shown the speedup for the most significant code section and the amount of communications compared to the total execution time. Each line of the table is referred to a different configuration of processes/nodes. We also tested the application changing the input data. We kept equals the number of particles and we changed the mass, velocities and distribution of the
Performance Analysis of Hybrid OpenMP/MPI N-Body Application
17
Table 3. Real Execution Timing CodeSection t tree t gattree t f orces t reduce t update t gatpar
401,5 0.5889 501,3 0.0330 0.0136 0.305
Execution 397,2 4.91 330 170.8 0.0133 0.327
time 388,2 13.82 0.380 500.8 0.0135 0.274
391,1 10.97 0.380 500.7 0.0134 0.295
388,9 13.17 0.377 500.8 0.0134 0.287
388,9 13.17 0.377 500.8 0.0134 0.287
386,4 15.7 0.377 500.8 0.0134 0.274
389,54 12.6 0.375 500.8 0.0134 0.296
particles in the space. In this case the execution time for the sequential version increases a lot, but the parallel implementation does not scale with the number of MPI processes. In Table 3 we can realize that the application is very unbalanced. The first and the second processes are overloaded because their particles are very close in the space and they need to traverse all the tree in order to compute the forces. The communications are very poor (0,1%) but the fast processes are blocked in the reduction phase, waiting from the other workers for the results. In this case we need a dynamic load balancing mechanism to overcome the loss of performance. The update one is pure data parallel and its execution time increases just with the number of particles. The tree step is never unbalanced as the computation of the forces. About a comparison between the hybrid implementation and the pure MPI version we must notice that, as the workers need to get the entire tree in the force computing phase, we have replicated data. If the number of particles is equals to NP, then the number of tree-nodes becomes: NODES=(NP*D-1)/(D-1) where D equal to 8 for our octal tree. As each node is an array of 11 double values the memory requirement is equal to NODES*11*8 bytes. For the described experiment the number of particles is equal to 32768. The memory used to store just the tree is more than 3 Mbytes. We will have some problem about memory requirements increasing the number of workers which execute on the same node or increasing the problem size. Using a single multithreaded process we can relax the memory constraints. About the communication, on a local network the overhead we have is due above all to the number of messages instead of their size. However we should expect that on a geographically distributed area network also this feature could heavily affect the performance. Future works will study these kind of behaviors both on a local and wide area network, in a cluster or grid environment.
18
5
R. Aversa et al.
Conclusions
The use of a hybrid MPI-OpenMP decomposition is an interesting solution to develop high performance codes for cluster of SMPs. We described an application of this kind of parallel programming model to a well-known application. We presented the parallel program and results obtained using a cluster of SMP nodes, each of them equipped with two processors. We tested the application using the both the MPICH and the MPICH-G2 library obtaining comparable speedup with the sam input data. The experimental results also showed how much irregular is the presented application. We showed how irregular is the application as different sets of input data can produce loss in performance because some unbalance in the central code section.
References 1. High Performance Fortran Forum: High Performance Fortran Language Specification, Version 2.0, Rice University, 1997. 2. OpenMP Architecture Review Board,”OpenMP Fortran Application Program Interface” ver. 1.0, October 1997. 3. Message Passing Interface Forum. MPI: A Message-Passing Interface standard. International Journal of Supercomputer Applications 8(3-4):165–414, 1994. 4. A. Grama, V. Kumar and A. Sameh, Scalable parallel formulations of the BarnesHut method for n-body simulations. Parallel Computing 24, No. 5-6 (1998) 797822. 5. T. Boku et al., Implementation and performance evaluation of SPAM particle code with OpenMP-MPI hybrid programming. Proc. EWOMP 2001, Barcelona (2001). 6. J. Barnes and P. Hut, A Hierarchical O(N logN ) Force Calculation Algorithm. Nature 324 (1986) 446-449. 7. Rolf Rabenseifner, Hybrid Parallel Programming on HPC Platforms. Proc. EWOMP ’03 (2003) 185-194. 8. T.Fahringer, A. Jugravu, B. Di Martino, S.Venticinque, H. Moritsch, On the Evaluation of JavaSymphony for Cluster Applications, IEEE Intl. Conf. on Cluster Computing, CLUSTER 2002, Chicago, Illinois, Sept. 2002 9. R. Aversa, B. Di Martino, N. Mazzocca, M. Rak, S. Venticinque, Integration of Mobile Agents and OpenMP for programming clusters of Shared Memory Processors: a case study, in proc. of PaCT 2001 Conference, 8-12 Sept. 2001, Barcelona, Spain. 10. Globus: A Metacomputing Infrastructure Toolkit. I. Foster, C. Kesselman. Intl J. Supercomputer Applications, 11(2):115-128, 1997. Provides an overview of the Globus project and toolkit. 11. MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface. N. Karonis, B. Toonen, and I. Foster. Journal of Parallel and Distributed Computing, 2003.
Performance and Scalability of OpenMP Programs on the Sun Fire™ E25K Throughput Computing Server Myungho Lee1, Brian Whitney2, and Nawal Copty3 1
Compiler Performance Engineering Strategic Application Engineering 3 SPARC Compilation Technology Scalable Systems Group Sun Microsystems, Inc. {myungho.lee, brian.whitney, nawal.copty}@sun.com 2
Abstract. The Sun Fire™ K is the first generation of high-end servers built to the Throughput Computing strategy of Sun Microsystems. The server can scale up to 72 UltraSPARC® IV processors with Chip MultiThreading (CMT) technology, and execute up to 144 threads simultaneously. The Sun Studio™ 9 software includes compilers and tools that provide support for the C, C++, and Fortran OpenMP Version 2.0 specifications, and that fully exploit the capabilities of the UltraSPARC IV processor and the E25K server. This paper gives an overview of the Sun Fire E25K server and OpenMP support in the Sun Studio 9 software. The paper presents the latest world-class SPEC OMPL benchmark results on the Sun Fire E25K, and compares these results with those on the UltraSPARC III Cu based Sun Fire 15K server. Results show that base and peak ratios for the SPEC OMPL benchmarks increase by approximately 50% on the Sun Fire E25K server, compared to a Sun Fire 15K server with the same number of processors and with higher clock frequency.
1 Introduction OpenMP [1] has become the standard paradigm for parallelizing applications for shared memory multiprocessor (SMP) systems. One advantage of OpenMP is that it requires a relatively small amount of coding effort when parallelizing existing sequential programs. As the demand for more computing power by high performance technical computing applications grows, achieving high performance and scalability using OpenMP becomes increasingly important. The Sun Fire E25K [2] is the flagship, massively scalable, highly available server from Sun Microsystems based on the UltraSPARC IV processor with Chip MultiThreading (CMT) technology. Compared with existing UltraSPARC III Cu based high-end servers, such as the Sun Fire 15K, the E25K server can offer up to twice the compute power or throughput. In order to fully utilize the computing power of the E25K server for computationally demanding OpenMP applications, sophisticated OpenMP support from the compiler, run-time system, tools, and the Solaris™ 9 Operating System [3] is exploited. B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 19–28, 2005. © Springer-Verlag Berlin Heidelberg 2005
20
M. Lee, B. Whitney, and N. Copty
The Sun Studio 9 software [4] provides support for the C, C++, and Fortran OpenMP Version 2.0 specifications, and delivers world class performance for high performance technical computing applications written using OpenMP directives. The software includes (a) compilers that apply advanced optimizations to best use the capabilities of the UltraSPARC IV processor and memory subsystem; (b) the OpenMP runtime library that supports efficient management, synchronization, and scheduling of OpenMP threads; and (3) tools for checking, debugging, and analyzing the performance of OpenMP applications. All these features in the Sun Studio 9 software enable the user to exploit large scale parallelism while maximizing the utilization of the Sun Fire E25K system. The rest of the paper is organized as follows. Section 2 describes the Sun Fire E25K server architecture and new features in the Solaris 9 Operating System. Section 3 gives an overview of the compiler, runtime, and tools support for OpenMP in the Sun Studio 9 software. Section 4 describes the SPEC OMPL benchmark suite [5] and how OpenMP applications can be tuned to obtain high performance and scalability on the Sun Fire E25K server. In addition, Section 4 presents the latest SPEC OMPL benchmark results on the Sun Fire E25K server and compares these results with those on the Sun Fire 15K server. Finally, Section 5 presents some conclusions and future directions.
2 The Sun Fire E25K Server The Sun Fire E25K server is the first generation of the high-end servers built to the Throughput Computing strategy of Sun Microsystems, which aims to dramatically increase application throughput via Chip MultiThreading (CMT) technology. The server is based on the UltraSPARC IV processor and can scale up to 72 processors executing 144 threads simultaneously. The new system offers up to twice the compute power of existing UltraSPARC III Cu based high-end systems. The basic computational component of the Sun Fire E25K server is the UniBoard. Each UniBoard consists of up to four UltraSPARC IV processors, their L2 caches, and associated memory. The UltraSPARC IV is the first general-purpose, dual-threaded CMT processor from Sun Microsystems (see Figure 1). It contains two enhanced UltraSPARC III Cu based Thread Execution Engines (TEEs), a memory controller, and the necessary cache tag memory for 8 MB of external L2 cache per TEE. The off-chip L2 cache is 16 MB in size (8 MB per TEE). The two TEEs share the Fireplane System Interconnect, as well as the L2 cache bus. The line size of L2 cache in the UltraSPARC IV is 128 bytes (in two 64-byte subblocks). Performance improvements were made in both the FP Adder and the Write Cache of the integrated pipelines. The new FP Adder circuitry handles more NaN and underflow cases in hardware than the FP Adder of the UltraSPARC III Cu processor. The Write Cache is enhanced with a hashed index to handle multiple write streams more efficiently. The Solaris 9 Operating System provides scalability and high performance for the Sun Fire E25K server. Features in the Solaris 9 OS, such as Memory Placement Optimizations (MPO) and Multiple Page Size Support (MPSS), are very effective in achieving high performance for OpenMP programs:
Performance and Scalability of OpenMP Programs
•
•
21
MPO allows the Solaris 9 OS to allocate pages close to the processors that access those pages. The default MPO policy, called first-touch, allocates memory on the UniBoard containing the processor that first touches the memory. The first-touch policy can greatly improve the performance of applications where data accesses are made mostly to the memory local to each processor. MPSS allows a program to use different page sizes for different regions of virtual memory. The UltraSPARC IV processor supports four different page sizes: 8 KB, 64 KB, 512 KB, and 4 MB. With MPSS, programs can request one of these four page sizes. With large page sizes, the number of Translation-Lookaside Buffer (TLB) misses can be significantly reduced; this can lead to improved performance for applications that use a large amount of memory.
Fig. 1. High-Level View of the UltraSPARC IV Processor
3 OpenMP Support in the Sun Studio 9 Software The Sun Studio 9 software includes compilers and tools that provide support for the C, C++, and Fortran OpenMP Version 2.0 specifications. These compilers and tools fully exploit the capabilities of the UltraSPARC IV processor and the E25K server. In this section, we give an overview of this support. 3.1 Compiler and Runtime Support The Sun Studio 9 compilers process OpenMP directives and transform these directives to code that can be executed by multiple threads. The underlying execution
22
M. Lee, B. Whitney, and N. Copty
model is the fork-join model of execution where a master thread executes sequentially until a parallel region of code is encountered. At that point, the master thread forks a team of worker threads. All threads participate in executing the parallel region concurrently. At the end of the parallel region (the join point), the master thread alone continues sequential execution. The OpenMP runtime library, libmtsk, provides support for thread management, synchronization, and scheduling of parallel work. The runtime library supports multiple user threads; that is, the library is thread-safe. If a user program is threaded via explicit calls to the POSIX or Solaris thread libraries, then libmtsk will treat each of the user program threads as a master thread, and provide it with its own team of worker threads. OpenMP parallelization incurs an overhead cost that does not exist in sequential programs. This includes the cost of creating threads, synchronizing threads, accessing shared data, allocating copies of private data, bookkeeping of information related to threads, and so on. Sun Studio 9 compilers and libmtsk employ a variety of techniques to reduce this overhead. The overhead of synchronization is minimized by using fast locking mechanisms, padding of lock variables, and using tree structures for barriers and for accumulating partial reduction results. The cost of accessing shared data is minimized by making a copy of read-only shared data on each thread stack. The cost of allocating private data is reduced by allocating small arrays on the thread stacks, instead of the heap. The cost of accessing threadprivate data is reduced by allocating threadprivate data such that false sharing is minimized. The cost of thread creation is alleviated by not destroying the worker threads at the end of a parallel region, but reusing the worker threads to execute subsequent parallel regions. To improve the performance of OpenMP programs, the user can set a number of Sun-specific environment variables that control the execution environment: •
•
•
The SUNW_MP_PROCBIND environment variable allows the user to bind lightweight processes (LWPs) of an OpenMP program to processors. This prevents the threads from migrating to different processors and can result in improved cache locality. The SUNW_MP_THR_IDLE environment variable allows the user to specify whether the worker threads spin or sleep while waiting for work. Putting the threads to sleep improves performance on over-subscribed systems, where the number of threads running a program is greater than the number of TEEs available. The SUNW_MP_GUIDED_SCHED_WEIGHT environment variable allows the user to specify the weight factor used in computing the chunk sizes for a loop with the GUIDED schedule. By carefully choosing an appropriate weight factor, the user can achieve improved load balance.
3.2 Tools Support The Sun Studio 9 software provides various tools for parallelizing a program, as well as tools for debugging, error checking, and analyzing the performance of OpenMP programs. We describe some of these tools below.
Performance and Scalability of OpenMP Programs
23
Automatic Scoping of Variables. This is a Sun-specific extension to OpenMP in the Sun Studio 9 Fortran 95 compiler [6]. This feature allows the user to specify which variables in a given parallel region should be scoped automatically by the compiler. The compiler determines the appropriate scopes for these variables by analyzing the program and applying a set of autoscoping rules. The scoping results are displayed in an annotated source code listing as compiler commentary. This automatic scoping feature offers a very attractive compromise between automatic and manual parallelization. OpenMP Debugging. The dbx debugger tool can be used to debug C and Fortran OpenMP programs. All of the dbx commands that operate on threads can be used for OpenMP debugging. dbx allows the user to single-step into a parallel region, set breakpoints in the body of an OpenMP construct, as well as print the values of SHARED, PRIVATE, THREADPRIVATE, etc., variables for a given thread. Static and Runtime Error Checking. Sun Studio 9 software provides compiler options and runtime environment variables that enable the static and runtime checking of OpenMP programs. Problems reported include inconsistencies in the use of OpenMP directives, data race conditions, errors in alignment, disagreement in the number or type of procedure arguments, etc. Collector and Performance Analyzer. The Collector and Performance Analyzer [7] are a pair of tools that can be used to collect and analyze performance data for an application. The Collector collects performance data using a statistical method called profiling and by tracing function calls. The data can include call-stacks, micro-state accounting information, thread-synchronization delay data, hardware-counter overflow data, memory allocation data, summary information for the process, etc. The Performance Analyzer processes the data recorded by the Collector, and displays various metrics of performance at program, function, caller-callee, source-line, and assembly instruction levels. The Performance Analyzer can also display the raw data in a graphical format as a function of time. Information related to OpenMP programs, such as the time spent at barriers, is presented in a way that corresponds to the original source code. This information can aid the programmer in finding performance bottlenecks in the source code where performance tuning is needed. For detailed information on the Collector and Performance Analyzer, and their use in tuning applications performances, see [8]. 3.3 Other Compiler Optimizations Sun Studio 9 compilers perform extensive analysis and code optimizations to improve the performance of a program. Optimizations that enhance the performance of OpenMP programs include automatic parallelization, scalar optimizations, loop transformations, inter-procedural optimizations, advanced data prefetching, memory optimizations, inlining, cloning, loop unrolling, skewed tiling, among many others. In what follows, we give a brief overview of some of the optimizations that are performed.
24
M. Lee, B. Whitney, and N. Copty
Scalar Optimizations. The compilers implement many scalar optimizations, such as arithmetic re-association, condition elimination, forward substitution, branch combination, loop invariant hoisting, induction variable-based optimizations, dead code elimination, common sub-expression elimination, etc. Loop and Cache Optimizations. The compilerstake advantage of the memory hierarchy and improve cache locality by implementing many loop and memory-oriented transformations. These include loop distribution, loop fusion, loop interchange, loop tiling, loop unimodular transformations, array contraction, etc. Prefetching. To alleviate memory latency, the compilers analyze the code to determine which references are likely to incur cache misses at runtime. The compilers then insert prefetch instructions in the code for these references. Prefetch candidates are determined based on reuse information. Sun's compilers can prefetch indirect array references and handle complex array subscripts. Optimizations for Instruction-Level Parallelism. Modern processors have many functional units and are able to issue and execute several instructions in any given cycle. To utilize these capabilities, the compilers perform optimizations that improve instruction level parallelism, such as unroll-and-jam, scalar replacement, instruction scheduling, etc. Interprocedural Analysis and Optimizations. The compilers perform interprocedural side-effect analysis, alias analysis, and whole program analysis. These analyses are used for various optimizations, including cache-conscious data layout, procedure inlining, procedure cloning, and constant propagation. Automatic Parallelization. The compilers perform data dependence analysis and various loop transformations to enable automatic parallelization of loops. In particular, the compilers perform loop distribution to generate parallelizable loops, loop fusion to increase granularity of parallelizable loops, and loop interchange to generate parallelizable outer loops. The compilers also recognize reduction operations for both scalar and array variables. Directive-based parallelization and automatic parallelization can be freely mixed in a program.
4 Tuning and Performance of the SPEC OMPL Benchmarks on the E25K Server In this Section we describe the SPEC OMPL benchmark suite and give an overview of our performance tuning of OpenMP applications. We also present the latest Sun SPEC OMPL results on the Sun Fire E25K server. 4.1
SPEC OMPL Benchmarks
The SPEC OMPL benchmark suite consists of nine application programs written in C and Fortran, and parallelized using OpenMP. These benchmarks are representative of high performance technical computing applications from the areas of chemistry, mechanical engineering, climate modeling, and physics (see Table 1). Each benchmark requires a memory size of up to 6.4 GB when running on a single processor. The benchmarks, therefore, target large-scale systems with a 64-bit address space.
Performance and Scalability of OpenMP Programs
25
Table 1. List of SPEC OMPL Benchmarks
Benchmark
Application Area
311.wupwise_l 313.swim_l 315.mgrid_l 317.applu_l 321.equake_l 325.apsi_l 327.gafort_l 329.fma3d_l 331.art_l
Quantum chromodynamics Shallow water modeling Multi-grid solver Partial differentialequations Earthquake modeling Air pollutants Genetic algorithm Crash simulation Neural network simulation
Language Fortran Fortran Fortran Fortran C Fortran Fortran Fortran C
4.2 Performance Tuning Performance tuning of an OpenMP application requires first identifying bottlenecks in the program. These bottlenecks can be related to OpenMP directives used in the program or to user code. As mentioned earlier, OpenMP support in the compilers and the runtime library, libmtsk, incurs an overhead cost which is not present if the program is not parallelized. In addition, to maintain program correctness, the compilers may disable certain optimizations on code enclosed by OpenMP directives. Sun Studio 9 compilers minimize the overheads associated with OpenMP. Bottlenecks in user code not related to OpenMP directives can be removed by advanced compiler optimizations such as scalar optimizations, loop transformations, memory hierarchy optimizations, advanced prefetching, and inter-procedural optimizations. These optimizations are activated by compiler options specified by the user. Automatic parallelization by the compiler, beyond user-specified parallelization, can further improve performance. OpenMP loops with the GUIDED schedule can achieve better load balancing and improved performance by using the SUNW_MP_GUIDED_SCHED_WEIGHT environment variable. If there are parts of the code where the compiler can do limited optimizations, the user can rewrite the code to enable more compiler optimizations. For example, in the latest submission of SPEC OMPL results by Sun Microsystems, source changes were proposed and used for the 325.apsi_l benchmark to improve initial data distributions. The proposed changes resulted in improved scalability and performance of the 325.apsi_l benchmark. The SPEC HPG committee reviewed and approved the proposed source changes. OpenMP threads can be bound to processors using the environment variable SUNW_MP_PROCBIND. Processor binding, when used along with static scheduling, benefits applications that exhibit a certain data reuse pattern where data accessed by a thread in a parallel region will either be in the local cache from a previous invocation of a parallel region, or in local memory due to the OS first-touch memory allocation policy. Processor binding greatly improves the performance of
26
M. Lee, B. Whitney, and N. Copty
such applications because of better cache locality and local memory accesses. Processor binding also leads to consistent timing results over multiple runs of the same program, which is important in benchmarking. Efficiently utilizing Memory Placement Optimizations (MPO) in the Solaris 9 Operating System can significantly improve the performance of programs with intensive data accesses to localized regions of memory. With MPO, memory accesses can be kept on the local UniBoard most of the time, whereas, without MPO, those accesses would be distributed over the UniBoards (both local and remote) which can become very expensive. Finally, using large pages for programs that require a large amount of memory can significantly reduce the number of TLB entries needed for the program and the number of TLB misses, thus significantly improving the performance. 4.3 SPEC OMPL Results In February 2004, Sun announced new world-record SPEC OMPL results using 143 threads on the Sun Fire E25K server with 72 UltraSPARC IV processors running at 1050 Mhz [9]. Sun Studio 9 compilers and the Solaris 9 Operating System were used for the announced results. Table 2. Performance Comparison of SPEC OMPL on Sun Fire 15K and Sun Fire E25K
Sun Fire 15K
Sun Fire E25K
(72 x 1200 Mhz US III Cu Procs)
(72 x 1050 Mhz US IV Procs)
71 Threads
143 Threads
Improvement Factor On Sun Fire E25K Vs. Sun Fire 15K
Base Ratio
Peak Ratio
Base Ratio
Peak Ratio
Base
Peak
311.wupwise_l
243196
303373
375539
487262
1.54
1.61
313.swim_l
205754
270239
223826
297689
1.09
1.10
315.mgrid_l
177695
237237
214577
306847
1.21
1.29
317.applu_l
176695
177501
235603
244551
1.33
1.38
321.equake_l
61752
105899
70107
114639
1.14
1.08
325.apsi_l
93757
93757
130228
179767
1.39
1.92
327.gafort_l
184157
184157
228430
244119
1.24
1.33
329.fma3d_l
155668
198462
158909
285845
1.02
1.44
331.art_l
321045
734672
1920101
2017639
5.98
2.75
Geometric Mean
163554
213466
240622
316182
1.47
1.48
Performance and Scalability of OpenMP Programs
27
Comparing these results with the earlier 71-thread results on the Sun Fire 15K server with 72 UltraSPARC III Cu processors running at 1200 Mhz using Sun Studio 8 compilers, we find that the overall base ratio increased by 1.47x (240622 vs. 163554), and the overall peak ratio increased by 1.48x (316182 vs. 213466). Table 2 compares the results obtained on the Sun Fire 15K and the Sun Fire E25K. This increase in the overall base and peak ratios is significant, considering that the UltraSPARC IV processor uses the same pipeline as the UltraSPARC III Cu processor, and is clocked lower at 1050 MHz instead of 1200 MHz. The projected increases in the overall base and peak ratios on a Sun Fire E25K with the same clock frequency (1200 MHz) are 1.68x and 1.69x, respectively. Table 2 shows that most benchmarks show good improvements in base and peak ratios, ranging from 1.09x to 1.92x. 331.art_l scales super-linearly with the advanced memory optimizations performed by the compiler. 325.apsi_l also shows impressive scalability for peak. Memory-intensive benchmarks such as 313.swim_l and 321.equake_l do not scale as well. Since the two TEEs on each UltraSPARC IV processor share the same memory and L2 cache bus, the effective memory bandwidth for each TEE is reduced. This results in low scalabilities for these benchmarks. Figure 2 gives a graphical representation of the improvements in base and peak ratios of the SPEC OMPL benchmarks on the Sun Fire E25K server, compared to the Sun Fire 15K server.
Fig. 2. Improvements in Base and Peak Ratios of SPEC OMPL Benchmarks on Sun Fire E25K Compared to Sun Fire 15K
5 Conclusions and Future Directions In this paper, we gave an overview of the Sun Fire E25K throughput computing server and OpenMP support in the Sun Studio 9 software. We presented results that demonstrated the superior performance and scalability of the SPEC OMPL benchmarks on the E25K server. The results show that the base and peak ratios of the
28
M. Lee, B. Whitney, and N. Copty
SPEC OMPL benchmarks increased by approximately 50% on a Sun Fire E25K server, compared to a Sun Fire 15K server with the same number of processors and with higher clock frequency. At the same clock frequencies, the projected increase is approximately 70%. Further enhancements are currently underway in the Sun Studio compilers, runtime libraries, and tools. These include improved static and runtime error checking, enhanced automatic scoping of OpenMP programs, support for nested parallelism, and support for multi-threaded architectures.
Acknowledgments The authors extend their thanks to Eric Duncan, Richard Friedman, Yonghong Song, and Partha Tirumalai for their helpful and insightful comments on this paper.
References 1. 2. 3. 4. 5. 6.
OpenMP Architecture Review Board, http://www.openmp.org Sun Fire E25K server, http://www.sun.com/servers/highend/sunfire_e25k/index.xml Solaris 9 Operating System, http://www.sun.com/software/solaris Sun Studio 9 Software, http://www.sun.com/software/products/studio/index.html The SPEC OMP benchmark suite, http://www.spec.org/omp Yuan Lin, Christian Terboven, Dieter an Mey, and Nawal Copty, Automatic Scoping of Variables in Parallel Regions of an OpenMP Program, WOMPAT 2004. 7. Sun Studio 8 Performance Analyzer, http://docs.sun.com/db/doc/817-5068 8. Brian Wylie and Darryl Gove, OMP AMMP analysis with Sun ONE Studio 8, EWOMP 2003. 9. SPEC OMPL results on the Sun Fire E25K server, http://www.spec.org/omp/results/res2004q1/omp2001-20040209-00126.asc
What Multilevel Parallel Programs Do When You Are Not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP Gabriele Jost1,∗, Jesús Labarta2, and Judit Gimenez2 1
NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 USA
[email protected] 2 European Center for Parallelism of Barcelona-Technical University of Catalonia (CEPBA-UPC), cr. Jordi Girona 1-3, Modul D6,08034 – Barcelona, Spain {jesus, judit}@cepba.upc.es
Abstract. In this paper we present a performance analysis case study of two multilevel parallel benchmark codes implemented in three different programming paradigms applicable to shared memory computer architectures. We describe how detailed analysis techniques help to differentiate between the influences of the programming model itself and other factors, such as implementation specific behavior of the operating system or architectural issues.
1 Introduction For applications exhibiting multiple levels of parallelism the current most commonly used multilevel parallel programming methods are based on hybrid parallelization techniques. These are paradigms that employ multiple processes, each of which runs with multiple threads. Examples of this approach are the combination of MPI [8] and OpenMP [10] or the MLP paradigm proposed by Taft [14]. An approach which has not been widely used but is applicable for shared memory computer architectures is the nesting of OpenMP directives as discussed in [3]. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting a large scientific application in order to employ a new programming paradigm is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a potential gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The goal of this study is not to evaluate and rate different programming paradigms, but rather to demonstrate a number of analysis techniques which help to de-mangle the factors that influence the program’s performance and enable the programmer to make sensible optimization decisions. ∗
The author is an employee of Computer Sciences Corporation.
B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 29–40, 2005. © Springer-Verlag Berlin Heidelberg 2005
30
G. Jost, J. Labarta, and J. Gimenez
In our study we use three different implementations of the multi-zone versions of the multi-zone NAS Parallel Benchmarks NPB-MZ [15]. These codes have been designed to capture the multiple levels of parallelism inherent in many full scale CFD applications. We chose the Paraver [11] performance analysis system for our study. Paraver is being developed and maintained at CEPBA-UPC. It consists of a graphical user interface to obtain a qualitative view of the program execution and an analysis module for a detailed quantitative performance analysis. We chose Paraver since it is the only available performance analysis system which supports all of the programming paradigms under consideration. Furthermore, Paraver provides a high degree of flexibility for trace file examination and performance metric calculation, which is crucial for our investigations. The rest of this paper is structured as follows: Section 2 gives an overview of the programming paradigms and the benchmark codes under consideration. In Section 3 we present timing results and describe performance analysis techniques to explain performance differences. In Section 4 we discuss related work and draw our conclusions in Section 5.
2
Programming Paradigms and Benchmark Code Implementations
The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [8] for coarse grained parallelization and OpenMP [10] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft [14]. The approach is similar to MPI/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by copying to and from the shared memory. Libraries supporting the MLP paradigm usually provide routines for process creation, shared memory allocation, and process synchronization [4]. The third paradigm employed in our study is the usage of nested OpenMP directives. Even though the nesting of parallelization directives is permitted by the OpenMP standard, it is not supported by many compilers. The NanosCompiler [3] was developed to show the feasibility of exploiting nested parallelism in OpenMP and is used in our study. The NanosCompiler accepts Fortran-77 code containing OpenMP directives and generates plain Fortran-77 code with calls to the NanosCompiler thread library NTHLib [7]. NTHLib supports multilevel parallel execution such that inner parallel constructs are not being serialized. The programming model supports several extensions to the OpenMP standard allowing the user to create groups of threads and to control the allocation of work to the participating threads.
What Multilevel Parallel Programs Do When You Are Not Watching
31
The benchmark codes used in this study are multi-zone versions of the NAS Parallel Benchmarks (NPB-MZ [15]). The purpose of NPB-MZ is to capture the multiple levels of parallelism inherent in many full scale CFD applications. Multi-zone versions of the well known NAS Parallel Benchmarks [2] LU, BT, and SP were developed by dividing the discretization mesh into a two-dimensional tiling of three-dimensional zones. Within all zones the LU, BT, and SP problems are solved to advance the time-dependent solution. The same kernel solvers are used in the multi-zone codes as in the single-zone codes. Exchange of boundary values takes place after each time step. The general structure of the NPBMZ is shown in Figure 1. Reference implementations employing the MPI/OpenMP and the MLP programming paradigm are part of the benchmarks distribution. A discussion of the performance characteristics of these codes is presented in [6]. The nested OpenMP implementation we used in our study is based on NanosCompiler extenFig. 1. General structure of the NPB-MZ sions and is discussed in [1]. An overview of the parallelization strategies is shown in Figure 2.
time step
MPI/OpenMP sequential
MLP sequential
Nested OpenMP sequential
inter-zone
MPI processes
MLP processes
OpenMP
exchange boundaries intra-zone
call MPI
data copy+sync
OpenMP
OpenMP
OpenMP
OpenMP
Fig. 2. Three multi-level parallelization strategies for the NPB-MZ
3 Timings and Performance Analysis Our tests are executed on an SGI Origin 3000 located at the NASA Ames Research Center. The SGI Origin 3000 is a ccNUMA architecture with 4 CPUs per node. The CPUs are of type R12K with a clock rate of 400 MHz, 2 GB of local memory per node, and 8 MB of L2 cache. The MLP implementations use the SMPlib library as
32
G. Jost, J. Labarta, and J. Gimenez
described in [4]. The MIPSpro 7.4 Fortran Compiler is used to compile the hybrid codes and the NanosCompiler for the nested OpenMP code. The compiler options -mp -O3 –64 are set in all cases. We ran the different implementations of LU-MZ and BT-MZ for various problem classes. The problem classes vary by the number of zones and the aggregate number of grid points. The aggregate sizes for all benchmarks are: • Class W: 64x64x8 grid points • Class A: 128x128x16 grid points • Class B: 304x208x17 grid points Gigaflops/sec
15 10
MPI+OMP MLP nested OpenMP
5 0 8x1(W)
8x2(W) 16x1(A) 16x4(A)
NPxNT(Class)
BT-MZ Performance
Gigaflops/sec
The timings in our experiments are based on executing 20 time steps. In Figure 3 we show some performance numbers for Class A and Class W on different numbers of CPUs. The reported performance is the best that was achieved over different combinations of processes and threads or thread groups in case of nested OpenMP. The MLP implementation shows the best performance for both benchmark class, particularly for a larger number of CPUs. Figure 4 shows the scalability for benchmark Class B measured by the total number of Gigaflops per second with an increasing number of CPUs. For LU-MZ the scalability of the MLP based implementation is clearly superior to MPI/OpenMP and nested OpenMP. For BTMZ, MLP and nested OpenMP show similar scalability while MPI/OpenMP lags behind for a large number of CPUs. We will now investigate the reasons for the performance differences.
LU-MZ Performance
12 10 8 6 4 2 0
MPI+OMP MLP nested OpenMP
4x2 (W) 4x4 (W) 4x4 (A) 16x4 (A)
NPxNT(Class)
Fig. 3. Performance of LU-MZ and BT-MZ in terms of the total number of Gigaflops per second. The horizontal axis show the number of CPUs as the number of processes or thread groups (NP) times the number of threads (NT) per process
Why does MLP scale better than MPI/OpenMP? We will first look into LU-MZ. The number of zones for LU-MZ Class B is limited to 16, which requires the exploitation of fine grained parallelism in order to scale beyond 16 CPUs. We note that the performance difference increases with an increasing number of threads per process.
What Multilevel Parallel Programs Do When You Are Not Watching
BT-MZ Class B Peform ance 40
MPI/OpenMP MLP nested OpenMP
8x1
16x1
16x2 16x4
Gigaflops/sec
Gigaflops/sec
LU-MZ Class B Performance 25 20 15 10 5 0
33
30
MPI/OpenMP MLP nested OpenMP
20 10
16x8
0 4x1
16x1
16x4
16x8
Num ber of CPUs (NPxNT)
Number of CPUs (NPxNT)
Fig. 4. Scalability of the LU-MZ and BT-MZ implementations for Class B. The number of CPUs is indicated as the number of processes or thread groups (NP) times the number of threads (NT)
The performance difference between MPI/OpenMP and MLP on 128 CPUs can also be observed on 64 CPUs, employing 8 processes with 8 threads each. We obtained traces for the smaller configuration and computed various statistics for the major time consuming subroutines, which are jacld, jacu, blts, and buts. Summaries of some of the calculated statistics are shown in Tables 1 and 2. Table 1. Performance Statistics for LU-MZ MLP
jacld blts jacu buts
Duration (ms) 52.3 44.7 37.1 49.0
Instr. 72 B 64 B 65 B 65 B
L2 misses 800 M 835 M 700 M 675 M
Table 2. Performance Statistics for LU-MZ MPI/OpenMP
jacld blts jacu buts
Duration (ms) 68.1 62.7 53.7 63.4
Instr. 72 B 64 B 65 B 65 B
L2 misses 725 M 758 M 701 M 684 M
The average duration of a routine call and the time spent in communication (Com) is given in milliseconds (ms), the total number of instructions during the time of useful calculations is given in billions (B), the number of L2 cache misses is reported in millions (M). It is important to note that we base the instruction and L2 miss count on the time spent in useful calculations, which means we exclude time spent in waiting for work, synchronizing, or busy waits implemented by the user. If this time would be
34
G. Jost, J. Labarta, and J. Gimenez
included in the measurements, which is what a typical profiler would do, then the measurements would be tainted by factors such as OS scheduling or interrupt servicing. Our very precise measurements are a fair indicator of the computational complexity and L2 cache behavior during the actual computation time of both versions of the code. All of the time consuming routines are called from within parallel regions and do not contain any communication. This fact in addition to the information contained in Tables 1 and 2 indicates the following: • • • •
The routines taking more time in MPI/OpenMP do not contain any communication. This rules out communication as the reason for the performance differences. Some of the subroutines take considerably (50%) more time in MPI/OpenMP than in MLP. There is no significant difference (< 0.001%) in the number of issued instructions between the MPI/OpenMP and MLP in the time consuming routines. Computational complexity is therefore not an issue for the time difference. The difference in the number of L2 cache misses is slightly in favor of MPI/OpenMP which has about 1% less misses.
After ruling out communication, computational complexity, and L2 cache misses as reasons for the time difference, we now examine the possibility of memory placement. The duration of OpenMP work sharing constructs can be used for this purpose. An example is shown in Figure 5, which displays a zoom into a timeline view of the duration of work shares for 2 processes with 8 threads each. The duration of the work shares is indicated by a gradient color scale. The scale ranges from 30 ms (light shading) to 60 ms (dark shading) to make the difference between the different threads apparent. The MPI/OpenMP code clearly shows a bimodal pattern, depending on the thread number. Four of the eight threads of each process show shorter and four show longer durations. Those with short durations are very similar to the MLP case. Those with the long durations are responsible for the increased amount of time in certain subroutines in the MPI/OpenMP implementations. We compared the number of instructions and L2 misses across the threads. The differences in these hardware counters across the threads in the MPI/OpenMP code are less than 3% and can not account for the time difference. When jointly looking at the time taken by the OpenMP work shares within a given subroutine, the number of instructions and the number of L2 misses across threads we can estimate that the same number of L2 misses take a fairly different time on different threads. This indicates that the thread to processor mapping and page to node mapping do result in fairly different access times for each of the processor groups. It is important to mention how the analysis of a single trace with hardware counters can give indications of memory/processor placement problems. A problem with the approach described above is that the analyst has to mentally correlate the views of time, instructions and misses. The question arises whether a single view can help to identify the problem. We use a simple performance model which, even though not being very accurate, can help to point the analyst to the potential problem. We model the cost of the L2 cache misses according to equation (1) which is applied to each interval between hardware counter samples:
What Multilevel Parallel Programs Do When You Are Not Watching
35
60 ms 30 ms
30 ms
Fig. 5. Timeline view of the work sharing duration within MPI/OpenMP (top) and MLP (bottom) code for LU-MZ. The view zooms in on 2 processes with 8 threads each. The thread number is indicated on the vertical axis, the time on the horizontal axis. Light shading indicates OpenMP work shares with a short durations, darker shading indicates longer work shares
L 2missCost =
Instr IdealMIPS L 2misses
ElapsedTime −
(1)
The values for the elapsed time, the number of instructions and the number of L2 cache misses are contained in the trace file. The IdealMIPS parameter is the instruction rate that would be achieved by the code assuming an ideal memory subsystem with zero L2 miss costs. Our current approach is to use an estimate of 500 MIPS. We also assume that all routines have the same IdealMIPS rate. Figure 6 shows a snippet of a Paraver analysis view of the profile of the average L2 miss cost for different threads and subroutines.
36
G. Jost, J. Labarta, and J. Gimenez
Fig. 6. Paraver analysis view of estimated L2 miss cost for the MPI/OpenMP zoomed in on processes 7 and 8 for an 8x8 run of LUMZ. Darker shading indicates a higher number. The view shows a bimodal pattern of higher and lower values, depending on the thread number
Fig. 7. Time line view of LU-MZ using nested OpenMP. The view shows the time spent in different compiler generated parallel functions. The vertical axis indicates the thread number. Different shadings indicate different parallel functions. Black indicates time spent outside of parallel regions. The dashed lines mark the first two iterations, the solid lines mark 5 of the following iterations
The view is zoomed in on processes 7 and 8 of the LU-MZ MPI/OpenMP code. It clearly identifies the different costs of L2 misses for two groups of four threads within each process. The difference between threads is consistently detected across routines and the actual ratio is dependent on the routine. This may be due to our assumption that the ideal MIPS rate is the same for all routines. Another possible reason is the
What Multilevel Parallel Programs Do When You Are Not Watching
37
Total Gigaflops/sec
Gigaflops/sec
difference in the memory access LU-MZ Class B Peformance pattern across routines with differMPI/OpenMP ent local and remote access ratios. MLP Nevertheless, for each of the rou25 Nested OpenMP MPI/OpenMP (pin-to-node) tines the statistics indicate that 20 Nested OpenMP (mod) four of the threads are placed 15 correctly on consecutive CPUs, but the other four are placed fur10 ther away. This effect does not 5 occur in the MLP code. The rea0 son for the different behavior is as 8x1 16x1 16x2 16x4 16x8 follows: When combining MPI Number of CPUs (NPxNT) and OpenMP on the SGI Origin, care has to be taken on how to place the threads onto the CPUs. BT-MZ Class B Performance By default the OS will place the MPI processes onto consecutive MPI/OpenMP 35 MLP CPUs. When new threads are 30 nested OpenMP being forked they may end up 25 MPI/OpenMP (pin-to-node) running on CPUs far away from 20 the master MPI process, which can 15 10 potentially decrease the perform5 ance due to increased remote 0 memory access time. The per4x1 16x1 16x4 16x8 formance analysis of the LU-MZ Number of CPUs (NPxNT) run indicates that four of the threads are placed correctly onto consecutive CPUs, but the other four are placed further away. The Fig. 8. Performance of LU-MZ and BT-MZ Class B MLP programming paradigm was designed for ccNUMA architectures. During the start-up phase the MLP library issues a system call which pins a thread to a particular CPU for the duration of the run to assure efficient memory access. Using the same system call within the MPI/OpenMP implementations of LUMZ and BT-MZ, while leaving the rest of the code unchanged, resulted in a performance improvement of the MPI/OpenMP making it very similar to the MLP code performance (see Figure 8). Why does LU-MZ using nested OpenMP perform worse than MLP? Next we investigate the lack of scalability of LU-MZ using nested OpenMP. We gathered a trace employing 16 groups of 8 threads. Figure 7 shows a Paraver timeline view that identifies the parallel functions executed by each thread. We can identify that 16 threads execute at the outer level of parallelism, each of them generating work for 8 threads. The first two iterations are very long and imbalanced. The following iterations are much faster and are relatively balanced. There is no algorithmic reason for a workload imbalance within the first two iterations, nor for the fact that they should consume more time than the following iterations. We suspect that due to the interaction between the NTHLib and the OS it takes several iterations until the data is placed
38
G. Jost, J. Labarta, and J. Gimenez
onto the appropriate memory modules. The BT-MZ benchmark implementation differs from LU-MZ in that it executes one time step before the actual iteration loop. The purpose is to place the data appropriately before the timing begins. We have added the execution of two time steps before the timed iteration loop to the LU-MZ code, which resulted in increased performance of the nested OpenMP implementation (see Figure 8). The results are discussed the next subsection. Summary of the performance analysis results. We repeated the timings after the modifications described above and obtained a similar performance for all three implementations. As an example we show in Figure 8 the scalability of the class B benchmarks. The chart contains the performance numbers from Figure 1 and in addition results that were obtained after the adjustments to the code. The performance of MPI/OpenMP after insertion of the system call that pins a thread to a particular CPUs is noted as “MPI/OpenMP (pin-to-node)”. The performance of the nested OpenMP code after adding a pre-time-step-loop iteration is noted as “nested OpenMP (mod). The charts show that after removing OS and runtime library effects, the performance of the three implementations is very similar. The performance increase was obtained with only minor changes to the user code.
4 Related Work There are many published reports on the comparison of different programming paradigms. We can only name a few of them. In [14] Taft discusses the performance of a large CFD application. He compares the scalability of message passing versus MLP. A comparison of message passing versus shared memory access is given in [12] and [13]. The studies use the SGI SHMEM library for SMA programming. Our current work differs from these reports in that we are exploiting performance analysis tools to obtain detailed information about program behavior. Rather than evaluating the pros and cons about different programming paradigms our study aims to differentiate between the influence of the programming paradigm per se and its runtime support on a particular hardware platform. We have previously conducted a detailed performance study of hybrid codes [5], however the codes under consideration were single zone codes and not very well suitable for hybrid parallelization.
5 Conclusions In this paper we have looked at the performance of different programming models. We have conducted a performance analysis case study of two multilevel parallel benchmark codes implemented in three different programming paradigms applicable to shared memory computer architectures. The use of detailed analysis techniques helped to determine that initially observed performance differences were not due to the programming models themselves but rather to other factors. The point of our paper is to emphasize the need to perform detailed quantitative analysis in order to properly validate the assumptions about the causes of the performance differences.
What Multilevel Parallel Programs Do When You Are Not Watching
39
Our first conclusion is that a high degree of flexibility for creating trace file views and calculating performance metrics is essential for discovering patterns that point to potential performance problems. Extracting such information can be tricky, for example not counting the instructions in busy wait loops. It may also be necessary to quantify the potential impact of architectural parameters such as memory access cost. The Paraver system proved to be capable to support the calculation of very precise metrics and conducting an in-depth analysis. Secondly we conclude that spending the time for a detailed performance analysis, while it might seem tedious, is worth the effort if it avoids rewriting the source code in order to switch to a different programming paradigm.
Acknowledgements This work was supported by NASA contract DTTS59-99-D-00437/A61812D with Computer Sciences Corporation/AMTI, by the Spanish Ministry of Science and Technology, by the European Union FEDER program under contract TIC2001-0995C02-01, and by the European Center for Parallelism of Barcelona (CEPBA).
References 1. E. Ayguade, M. Gonzalez, X. Martorell, and G. Jost, Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications, Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS04), Santa Fe, NM, USA, April 2004 2. D. Bailey, T. Harris, W. Saphir, R. Van der Wijngaart, A. Woo, and M. Yarrow, The NAS Parallel Benchmarks 2.0, RNR-95-020, NASA Ames Research Center, 1995 3. M. Gonzalez, E. Ayguade, X. Martorell and J. Labarta, N. Navarro and J. Oliver. Nanos Compiler: Supporting Flexible Multilevel Parallelism in OpenMP, Concurrency: Practice and Experience. Special issue on OpenMP. vol. 12, no. 12. pp. 1205-1218. October 2000 4. H. Jin, G. Jost, Performance Evaluation of Remote Memory Access Programming on Shared Memory Parallel Computer Architectures, NAS Technical Report NAS-03-001, NASA Ames Research Center, Moffett Field, CA, 2003 5. G. Jost, H. Jin, J.Labarta, J. Gimenez, J. Caubet, Performance Analysis of Multilevel Parallel Programs, Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS03), Nice, France, April 2003 6. H. Jin, R. F. Van der Wijngaart, Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks, to appear in the Proceedings of IPDPS04, Santa Fe, New Mexico, USA, April 2004 7. X. Martorell, E. Ayguadé, N. Navarro, J. Corbalan, M. Gonzalez and J. Labarta. Thread Fork/join Techniques for Multi-level Parallelism Exploitation in NUMA Multiprocessors. 13th International Conference on Supercomputing (ICS’99), Rhodes (Greece). pp. 294301. June 1999 8. MPI 1.1 Standard, http://www-unix.mcs.anl.gov/mpi/mpich 9. OMPItrace User’s Guide, https://www.cepba.upc.es/paraver/manual_i.htm 10. OpenMP Fortran Application Program Interface, http://www.openmp.org/. 11. Paraver, http://www.cepba.upc.es/paraver
40
G. Jost, J. Labarta, and J. Gimenez
12. H. Shan, J. Pal Singh, A comparison of MPI, SHMEM, and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessor, International Journal of Parallel Programming, Vol. 29, No. 3, 2001. 13. H. Shan, J. Pal Singh, Comparison of Three Programming Models for Adaptive Applications on the Origin 2000, Journal of Parallel and Distributed Computing 62, 241-266, 2002 14. J. Taft, Achieving 60 GFLOP/s on the Production CFD Code OVERFLOW-MLP, Parallel Computing, 27 (2001) 521 15. R. F. Van Der Wijngaart, H. Jin, “NAS Parallel Benchmarks, Multi-Zone Versions,” NAS Technical Report NAS-03-010, NASA Ames Research Center, Moffett Field, CA, 2003
SIMT/OMP: A Toolset to Study and Exploit Memory Locality of OpenMP Applications on NUMA Architectures Jie Tao1 , Martin Schulz2 , and Wolfgang Karl1 1
Institut f¨ur Rechnerentwurf und Fehlertoleranz, Universit¨at Karlsruhe, 76128 Karlsruhe, Germany {tao, karl}@ira.uka.de 2 School of Electrical and Computer Engineering, Ithaca NY 14853, USA
[email protected] Abstract. OpenMP has become the dominant standard for shared memory programming. It is traditionally used for Symmetric Multiprocessor Systems, but has more recently also found its way to parallel architectures with distributed shared memory like NUMA machines. This combines the advantages of OpenMP’s easyto-use programming model with the scalability and cost-effectiveness of NUMA architectures. In NUMA (Non Uniform Memory Access) environments, however, OpenMP codes suffer from the longer latencies of remote memory accesses. This can be observed for both hardware and software DSM systems. In this paper we present SIMT/OMP, a simulation environment capable of modeling NUMA scenarios and providing comprehensive performance data about the inter-connection traffic. We use this tool to study the impact of NUMA on the performance of OpenMP applications and show how the memory layout of these codes can be improved using a visualization tool. Based on these techniques, we have achieved performance increases of up to a factor of five on some of our benchmarks, especially in larger system configurations.
1
Introduction
Over the last years, OpenMP has become the dominant standard for shared memory programming. It is traditionally used on SMP systems with UMA (Uniform Memory Access) properties. In these systems, all data is stored in a central memory and is equally accessible by all processors. This avoids the complex issue of data locality or memory distribution and thereby significantly eases the use of these machines. The scalability of these platforms, however, is restricted. To provide more scalability, OpenMP is therefore increasingly used also on NUMA (Non Uniform Memory Access) architectures [4, 10], despite the problems arising from the non-uniformity of memory accesses. In addition, compiler developers are extending their compilers to provide OpenMP on more scalable platforms such as cluster environments using software DSM systems [14] or to explicitly deal with NUMA issues. Well-known examples are the Nanos Compiler [3, 11], the B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 41–52, 2005. c Springer-Verlag Berlin Heidelberg 2005
42
J. Tao, M. Schulz, and W. Karl
Polaris parallelizing compiler [1], the ParADE OpenMP programming environment [8], or the Omni Fortran/C compiler [15]. Due to the inherent distinction between local and remote memories, severe performance problems exist in NUMA environments. We have measured a set of applications from standard benchmark suites on a PC cluster using Omni/SCASH OpenMP [15]; most applications show a low parallel performance, especially running on larger systems. Similar results have also been obtained by other researchers. For example, the ParADE [8] team has reported a low performance for the CG code from NAS parallel benchmark with a speedup of 1.3 on a 4-node and 1.06 on a 8-node SMP cluster using 1-thread/1-cpu mode. The main reason for these results are excessive remote memory accesses and the connected latency penalty, which can be up to two or three orders of magnitude longer than a local access. The work presented in this paper aims at studying the impact of NUMA architectures on the performance of OpenMP applications using a comprehensive simulation environment. In addition, we propose an approach using memory access visualization to present the simulation results to the user and to identify improved memory distributions for critical data structures. This is done by analyzing the memory access patterns of applications in the form of memory access histograms. In particular, we make the following contributions: – We present SIMT/OMP, a simulation environment for OpenMP applications running on top of NUMA architectures (Section 2). It models a two-level cache hierarchy, as well as a distributed memory with varying remote memory access latencies. This allows us to gather comprehensive performance data and to study the performance problems of OpenMP codes on NUMA systems. This in turn provides us with the basis for locality optimizations. – Using SIMT/OMP we study the performance impact of OpenMP in NUMA environments using a set of shared memory benchmarks and varying architectural parameters (Section 3). – We discuss an approach to optimize the memory layout of OpenMP codes using a visualization tool based on SIMT/OMP. This visualizer, which is based on our prior work for general shared memory optimization [12], allows users to identify data structures with critical behavior and provides guidance towards an improved memory locality (Section 4). By using SIMT/OMP and visualizing the simulation results using our tool, we have been able to optimize all test codes by providing a better memory layout. Almost all codes show a significant improvement in runtime, in some cases of up to a factor of five over the baseline.
2
SIMT/OMP: Simulating OpenMP Codes on NUMA Architectures
The performance of OpenMP codes on NUMA architectures depends on the application’s memory access pattern and its interaction with the architectural parameters. To study this interaction and to provide a tool to optimize the execution behavior of applications,
SIMT/OMP: A Toolset to Study and Exploit Memory Locality
43
we have developed a simulation environment, SIMT/OMP. It allows the user to run any OpenMP code while varying all critical design parameters including system size, cache configurations, and remote memory access latency. For SIMT/OMP we modify the backend of the OMNI OpenMP compiler [9] and provide a new OpenMP runtime library to map the OpenMP programming model to the simulation platform. For the latter we use SIMT [16], an event-driven multiprocessor simulator, as the simulation platform. SIMT itself takes it’s front-end from Augmint [13] and provides a detailed memory backend with a two-level memory hierarchy, various cache coherence protocols, and a configurable NUMA network. Our new OpenMP compiler backend maps the application’s OpenMP threads to the thread model of the SIMT/Augmint simulator. This thread model simulates the existence of multiple CPUs by switching the thread or CPU context for each memory access stall based on a global, ordered event list. Hence, OpenMP constructs using busy waits drastically reduce the simulation performance and can even lead to deadlock scenarios. To avoid these problems, we replace all synchronization constructs with SIMT internal routines and re-implement ordering and exclusion constructs (such as ORDERED or ATOMIC) with non-blocking versions that yield execution to waiting simulation threads if necessary. In addition, SIMT/OMP directly handles scheduling and work distribution decisions in order to guarantee a correct work distribution among the simulated threads. Currently, SIMT only supports static scheduling, which statically assigns parallel work to threads in a round-robin fashion, but we plan on extending this in the future. Like most simulators, SIMT/OMP provides extensive statistics about the execution of an application at the simulation process. This includes elapsed simulation time, simulated processor cycles, number of total memory references, and number of local and remote memory accesses. This information can give the user a global overview about the memory performance, but is not sufficient for locality optimizations which require a deep insight into the runtime data layout. For this, SIMT/OMP uses a monitoring simulator to trace all memory references and deliver detailed information in the form of memory access histograms. This component models a hardware monitor designed to observe the internode communications over SCI (Scalable Coherent Interface), an IEEE-standardized interconnection technology with global physical access space [5]. The memory access histogram records the accesses of all processor nodes to the whole virtual address space throughout the execution of an application. This can be based on a user-given granularity, e.g., pages, sections, or words. For each entry in the histogram, the number of accesses performed by each processor is recorded. This enables the user to detect access hot spots, i.e., address ranges in which data is dominantly accessed remotely. It also provides the basis for determining a better data distribution leading to less inter-node communication.
3
Locality Problem on NUMA
We use SIMT/OMP to study the locality problem of OpenMP executions on NUMA architectures. For this study, we use a variety of codes from several standard benchmark suites. Short descriptions of all selected programs and their working set sizes are shown in Table 1. The first four applications are chosen from the NAS parallel benchmark suite [2,
44
J. Tao, M. Schulz, and W. Karl Table 1. Description of selected applications and working set sizes Applications FT MG CG LU RADIX OCEAN WATER MULM SOR
Description Working set size Fast Fourier Transformations 64×64×32 Multigrid solver 32×32×32 Grid computation and communication 1400 LU-decomposition for dense matrices 32×32×32 integer radix sort 262144 keys, 1024 Radix simulation of large scale ocean movements 258×258 evaluation of water molecule systems 343 molecules Matrix multiplication 2048×2048 Successive Over Relaxation 4096×4096
Table 2. Number and latency of memory accesses on different systems 4 processors 32 processors L2 M. RemMem. Lat. oLat. L2 M. RemMem. Lat. FT 1.6% 66% 22.52 2.28 1.2% 90% 23.92 MG 1.3% 74.9% 20.97 1.82 1.1% 98.2% 21.59 CG 0.12% 64.4% 2.98 1.24 0.6% 77.2% 12.86 LU 18% 74.7% 273.7 11.47 6.2% 94.9% 386.6 RADIX 1.7% 61.6% 21.85 2.07 3.5% 93.8% 67.1 OCEAN 0.64% 75.5% 10.9 1.43 0.2% 96.6% 12.98 WATER 0.015% 74.7% 1.3 1.0 0.08% 96.4% 2.52 MULM 1.6% 64.8% 23.76 3.38 6.9% 68% 98.53 SOR 5.8% 75% 89.3 3.9 5.8% 96.8% 113.46
oLat. 2.08 1.73 1.34 7.55 2.07 1.39 1.06 5.69 2.85
7], the following three are SPLASH-2 [17] codes ported to OpenMP, the remaining two are self-coded small kernels. By default all codes use a round-robin page allocation scheme to implement a transparent page distribution. We configure SIMT/OMP to model a two-level cache hierarchy with a 16K 2-way L1 cache and 512K 4-way L2 cache. We assume 1 cycle latencies for all L1 cache accesses, 5 cycles for L2 cache accesses, and 50 cycles for main memory. This configuration corresponds to a 500 MHz Pentium-II system and matches the underlying processor model of Augmint [13], the base of SIMT/OMP. The cost of remote memory accesses can vary between 50 cycles and 10000 cycles to model a large variety of system configurations from UMA systems with equal latencies for all memory locations to widely distributed, loosely coupled systems with large latencies. In our first study we choose a remote latency of 2000 (or 4µs), which corresponds to a typical configuration of a NUMA cluster1 . We vary the numbers of processors between four and 322 . The experimental results are summarized in Table 2. 1
The number of 2000 cycles is based on measurements using SCI clusters [6], which provide NUMA capabilities for PC clusters. 2 This is currently the limit of Augmint, but we are working on an extension.
SIMT/OMP: A Toolset to Study and Exploit Memory Locality
45
For each system configuration, four access metrics are presented: proportion of main memory accesses (not resolved by caches, i.e., all L2 misses) to the total references (L2 M.); the proportion of remote memory accesses to total main memory accesses (RemMem.); the average latency of all memory accesses including accesses to cached data (Lat); and the optimal latency in case all main memory accesses can be satisfied by the local memory (oLat.). The difference between the last two columns shows the maximal optimization potential that could be achieved with an optimized data distribution. Examining the second column, it can be seen that all applications, except the LU and SOR code, show a low memory access ratio with less than 2% of the read operations performed in local/remote memories. For systems with 32 processors this value can be greater or smaller depending on individual codes, but a similar low proportion can be observed. However, as shown in the third column, these accesses are mostly performed remotely and the situation is even worse with 32 node systems. This directly results in a high average read latency shown in the (Lat.) column. Nevertheless, this scenario changes drastically if all the remote accesses could be performed locally. The values in column five show a considerable improvement (up to a factor of 20), even WATER, with only 0.015% memory reads, show an average read latency reduction of 23% on four processors and 58% on 32 processors. Overall, the data shows that all codes incur a high rate of remote memory accesses. In addition, this number grows with the number of processors, since more processors also means more memory modules within the overall system and hence more data that is potentially stored remotely. The second observation is that, despite the very low number of accesses that miss in the L2 cache and potentially cause remote accesses these accesses have a large impact on the overall application performance. This can be seen in the changes of average latency with remote accesses vs. average latency of the optimal case with only local accesses. This indicates that the access latency could be improved by reducing the number of remote accesses.
4
Tool Support for Locality Optimization
The results above show the importance of locality awareness for OpenMP programming in NUMA architectures. At the same time, they indicate a large potential that can be gained by fine tuning codes with locality in mind. This, however, can be a complex and tedious process and hence requires extensive tool support. Using the data provided by SIMT/OMP, we have developed a visualization tool that presents the information provided by the simulator and relates it to the application source code. It aids the programmer in understanding the memory access behavior of the target application and provides the basis for locality optimizations. This tool, which is based on our earlier research for SCI-based shared memory clusters [12], presents all memory accesses in the form of memory access histograms. These histograms show the number of accesses from each individual node for each address or memory block. Figure 1 shows two examples of such histograms, one aggregated across the entire execution time (top) and one split into application phases (bottom). In the former the Y-axis shows the address space aggregated to pages, the X-axis shows the number of accesses to this page, and the color indicates which node has accessed the
46
J. Tao, M. Schulz, and W. Karl
Fig. 1. Sample views of the visualization tool
corresponding page. In the 3D figure, the X-axis shows the address space, the Y-axis the number of accesses, and the Z-axis the phase number. The trivial example in Figure 1 (top) shows the memory access characteristics of SOR (512×512) on a 4-node system. This concrete figure only presents the pages located on node 2. It shows that SOR has a very regular access pattern in a block-like structure where a processor node mainly accesses a single block of data pages. If the pages in a block would be placed on the respective nodes, virtually all remote memory accesses can be eliminated. This is an extreme example, but for general cases the user can use this 2D view to find the dominating node for each virtual page. For a deeper analysis and optimization the visualization tool provides 3D views to exhibit the per-phase access behavior with temporal information. Phases are identified by barriers, locks, or other synchronization mechanisms. Figure 1 (bottom) shows a sample view for FT. As can be seen, the 3D views show the memory accesses performed within each program phase. Each page is represented by a colored bar showing all accesses from
SIMT/OMP: A Toolset to Study and Exploit Memory Locality
47
all nodes. The higher the bar, the more accesses performed by the corresponding node. This enables easy detection of the dominating node for each virtual page. In addition, the viewpoint is adjustable enabling a change of focus to an individual phase or observation of the changes in access behavior between phases. For the concrete example, it can be observed that FT shows different access patterns in different phases. While most pages are dominantly accessed by a single node in the first, second, and fourth phase, they are also accessed by other nodes in the remaining phases. This indicates that for some applications separate optimizations should be performed for each phase. An accumulated view across all phases alone would have failed to show this behavior and would have led to incorrect conclusions. SIMT/OMP provides an ALLOCATE macro to enable explicite data allocation. This macro allows the user to specify placement of a single word, page, or block to a particular node. For example, for the SOR code we manually allocate the data according to the presented block feature in the 2D views for all nodes (see Figure 1 for node 2): ALLOCATE(0-30,1) ALLOCATE(31-60,2) ALLOCATE(61-90,3) ALLOCATE(91-120,0) The first parameter specifies the page range (or an individual page); the second one identifies the processor on which these pages are allocated. This kind of optimization results in a data placement, where the blocks assigned and processed by a processor are also located on the memory of this processor. This leads to significant reduction of remote accesses, which can be seen in Figure 2. As illustrated in the top diagram of this figure, the access behavior of the original application results in a lage amount of remote memory accesses and only a small proportion of all accesses are performed locally. This behavior changes drastically with the optimized version shown in the bottom of Figure 2, where only a few remote accesses can be observed. These accesses are needed by neighboring processors to exchange boundary rows and therefore resemble true and necessary communication. To apply these optimizations in an application, the user has to know how data structures are mapped to virtual addresses. To aid this process, our tool provides pop-up windows listing all data structures mapped to a selected virtual address. In addition, the visualization tool provides views displaying the access characteristics of arrays. For arrays of a single dimension or higher than two dimensions, it uses small squares to show the dominating accesses performed on each element of an array. The elements are placed one after another within a diagram according to their order within the arrays. More diagrams may be required for large arrays containing more elements than a single diagram can hold. For two dimensional arrays, the tools maps the data to a visual representation of a matrix of the same size. Figure 3 depicts such an example with the LU code on a eight node system. Each block in this diagram represents an element of a two-dimensional matrix. The color of the blocks indicates the node that performs the most accesses to the corresponding element. This diagram hence helps the user to distribute the working set across processor nodes in the manner that each node dominates the accesses to the data located on the
48
J. Tao, M. Schulz, and W. Karl
Fig. 2. Data accesses before (top) and after (bottom) optimization (SOR)
Fig. 3. Accesses to data structures (LU on eight nodes)
SIMT/OMP: A Toolset to Study and Exploit Memory Locality
49
same node. For the concrete example in Figure 3, we see that most matrix elements fall into the block access pattern, with subsequent lines predominantly accessed by the same node. For example, the first few lines are predominantly required by node 1, while the next lines are more accessed by node 3 and the following several lines by node 0. This means a reallocation with all lines on their dominating nodes can lead to an optimized data distribution and hence a better runtime performance. Such data structure views are also provided for individual phases using a 3D representation. In summary, our tool enables the user to gather detailed information about the memory characteristics of an application and to understand the interactions between the application and the execution environment. It is important to note that, because of that, the results are not restricted to the concrete working set simulated, but can be extrapolated to any working set of the same application.
5
Optimization Result
Using the visualization tool, we have studied the access pattern of all test codes and optimized their static data distribution using the newly introduced macro in SIMT/OMP. All changes are global for the whole execution time of the application. Table 3 documents the changes in the execution behavior. This table contains two blocks of data. The first block shows the absolute number of remote accesses before (Trans) and after (Opt) the optimization, and the resulting reduction. It shows that these results are very application dependent, but most of them have achieved a reduction of more than 50%. This directly resulted in a decrease of the average read latency, which is presented in the second block of the table. For comparison, the optimal latency (OLat) in case of only local memory accesses is also included. The resulting performance is shown in Figure 4 in terms of speedup of the optimized against the transparent version for various system configurations. In general, all applications show significant performance improvements after their optimizations. The only exceptions are WATER, which has an extremely low L2 cache miss rate and therefore exhibits only few remote accesses during its execution, and MULM, whose access pattern Table 3. Changing access behavior caused by the optimization (M = million) — 32 processors Remote Accesses Trans Opt Reduction FT 1.04M 0.45M 56% MG 5.5M 1.43M 74% CG 17128 11134 38% LU 144M 27M 81% RADIX 153M 29M 81% OCEAN 2.9M 0.91M 69% WATER 27219 9798 36% MULM 434 213 51% SOR 0.18M 0.03M 83%
Average Read Latency Trans Opt oLat. 23.92 7.3 2.08 21.59 6.49 1.73 12.86 9.52 1.34 386.6 41.3 7.55 67.1 6.38 2.07 12.98 4.63 1.39 2.52 1.94 1.06 98.53 43.68 5.69 113.46 11.9 2.85
50
J. Tao, M. Schulz, and W. Karl
Fig. 4. Speedup of optimized against transparent running (left: different number of CPUs, but constant remote access latency of 2000 cycles; right: with various remote access latency, but constant system size of 32 CPUs)
is difficult to optimize using a single, static memory layout. Nevertheless, the latencies are still larger than in the optimal case indicating the potential for further optimization, e.g., using dynamic, phase specific data distribution information. In addition, larger systems (with more CPUs) or more loosely coupled systems (with longer remote access latencies) benefit more from locality optimization. This is expected, since in larger systems each CPU only has smaller percentage of the total memory locally and since the remote memory access latency directly impacts the observed mean latency. The use of memory layout optimizations therefore directly impacts the scalability of OpenMP codes on NUMA architectures and is crucial for larger systems.
6
Conclusions
Non Uniform Memory Access machines are very suitable for high performance computing because they combine cost-effectiveness and scalability. However, OpenMP codes in NUMA environments often suffer from large remote memory access latencies. since 1) accesses to remote memories on a NUMA system can take up to two or three orders of magnitudes longer than local accesses, and 2) OpenMP codes explore finegrained parallelism causing normally more remote memory accesses than other programming models. Locality optimizations are therefore crucial in achieving an efficient execution. We present a toolset based on simulation and memory access visualization to analyze memory access behavior of applications and to guide locality optimization. Using this method, we have achieved significant performance gains of up to a factor of five for some applications. Especially on larger configurations and systems with longer remote access latencies, i.e., systems that scale up to larger numbers of CPUs, applications profit from such optimizations.
SIMT/OMP: A Toolset to Study and Exploit Memory Locality
51
Acknowledgments We thank Peter Szwed, Cornell University, for providing the OpenMP ports of the SPLASH-2 benchmarks.
References 1. A. Basumallik, S.-J. Min, and R. Eigenmann. Towards OpenMP Execution on Software Distributed Shared Memory Systems. In Proceedings of the 4th International Symposium on High Performance Computing (ISHPC 2002), pages 457–468, 2002. 2. D. Bailey et. al. The NAS Parallel Benchmarks. Technical Report RNR-94-007, Department of Mathematics and Computer Science, Emory University, March 1994. 3. Marc Gonz`alez, Eduard Ayguad´e, Xavier Martorell, Jes´us Labarta, Nacho Navarro, and Jos´e Oliver. NanosCompiler: Supporting Flexible Multilevel Parallelism in OpenMP. Concurrency: Practice and Experience, 12(12):1205–1218, 2000. 4. T. S. Grbic, S. Brown, S. Caranci, G. Grindley, M. Gusat, G. Lemieux, K. Loveless, N. Manjikian, S. Srbljic, M. Stumm, Z. Vranesic, and Z. Zilic. Design and Implementation of the NUMAchine Multiprocessor. In Proceedings of the 1998 Conference on Design Automation, pages 66–69, Los Alamitos, CA, June 1998. 5. H. Hellwagner and A. Reinefeld, editors. SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Computer Clusters, volume 1734 of Lecture Notes in Computer Science, chapter 3. Springer-Verlag, 1999. 6. H. Hellwagner and A. Reinefeld, editors. SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Computer Clusters, volume 1734 of Lecture Notes in Computer Science, chapter 3. Springer-Verlag, 1999. 7. H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011, NASA Ames Research Center, October 1999. 8. Y.-S. Kee, J.-S. Kim, and S. Ha. ParADE: An OpenMP Programming Environment for SMP Cluster Systems. In Proceedings of Supercomputing (SC2003), Phoenix, USA., November 2003. 9. K. Kusano, S. Satoh, and M. Sato. Performance Evaluation of the Omni OpenMP Compiler. In Proceedings of International Workshop on OpenMP: Experiences and Implementations (WOMPEI), volume 1940 of LNCS, pages 403–414, 2000. 10. J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th International Symposium on Computer Architecture, pages 241–251, May 1997. 11. Xavier Martorell, EduardAyguad´e, Nacho Navarro, Julita Corbal´an, Marc Gonz´alez, and Jes´us Labarta. Thread Fork/Join Techniques for Multi-Level Parallelism Exploitation in NUMA Multiprocessors. In Proceedings of the 1999 International Conference on Supercomputing, pages 294–301, Rhodes, Greece, June 1999. 12. T. Mu, J. Tao, M. Schulz, and S. A. McKee. Interactive Locality Optimization on NUMA Architectures. In Proceedings of the ACM Symposium on Software Visualization, San Diego, USA, June 2003. 13. A-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures. In Proceedings of 1996 International Conference on Computer Design, pages 486–491. IEEE Computer Society, October 1996. 14. B. Nitzberg and V. LO. Distributed Shared Memory: A Survey of Issues and Algorithms. IEEE Computer, pages 52–59, August 1991.
52
J. Tao, M. Schulz, and W. Karl
15. M. Sato, H. Harada, and Y. Ishikawa. OpenMP compiler for Software Distributed Shared Memory System SCASH. In Proceedings of Workshop on OpenMP Applications and Tool (WOMPAT’2000), 2000. 16. J. Tao, M. Schulz, and W. Karl. A Simulation Tool for Evaluating Shared Memory Systems. In Proceedings of the 36th Annual Simulation Symposium, pages 335–342, Orlando, Florida, April 2003. 17. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995.
*
Dragon: A Static and Dynamic Tool for OpenMP Oscar Hernandez, Chunhua Liao, and Barbara Chapman Computer Science Department, University of Houston {oscar, liaoch, chapman}@cs.uh.edu
Abstract. A program analysis tool can play an important role in helping users understand and improve OpenMP codes. Dragon is a robust interactive program analysis tool based on the Open64 compiler, an open source OpenMP, C/C++/Fortran77/90 compiler for Intel Itanium systems. We developed the Dragon tool on top of Open64 to exploit its powerful analyses in order to provide static as well as dynamic (feedback-based) information which can be used to develop or optimize OpenMP codes. Dragon enables users to visualize and print essential program structures and obtain runtime information on their applications. Current features include static/dynamic call graphs and control flow graphs, data dependence analysis and interprocedural array region summaries, that help understand procedure side effects within parallel loops. On-going work extends Dragon to display data access patterns at runtime, and provide support for runtime instrumentation and optimizations.
1 Introduction Scientific and industrial OpenMP [10] applications are becoming increasingly large and complex. They are generally difficult to understand, parallelize, and optimize. A lot of programmer effort is thus expended in the task of investigating complex code structures. Extensive information on often low-level details of data structures and control flow is necessary for any kind of code reengineering, as well as for creating parallel applications under OpenMP or an alternative paradigm. The optimization of parallel programs is particularly challenging as it requires information pertaining to multiple processes or threads. For shared memory parallel codes, optimization strategies may include the privatization of scalar and array variables to improve data locality, and code modifications that reduce synchronizations, all of which require considerable insight into a given code. The automated provision of structural information (call graphs, procedure control flow graphs), interprocedural information (procedure side effects), information on variable usage (data dependences, use-def chains, aliases) and runtime profiling information, can speed up the development process and reduce the likelihood that errors are introduced in the process. Dynamic compiler feedback has proved to be useful for optimization as it can help programmers discover how an application is behaving, including detecting program hotspots, runtime anomalies, and other behavior which is unknown statically. *
This work was partially supported by the DOE under contract DE-FC03-01ER25502 and by the Los Alamos National Laboratory Computer Science Institute (LACSI) through LANL contract number 03891-99-23.
B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 53–66, 2005. © Springer-Verlag Berlin Heidelberg 2005
54
O. Hernandez, C. Liao, and B. Chapman
Dragon is a tool that gathers information on the structure and variable usage of a potentially multi-language program, saves this data and displays the results to a user via a graphical menu-driven interface. If dynamic information is desired, it automatically instruments the code to gather additional data during one or more runs and stores these results also. It includes features that are commonly needed to study and change both sequential and parallel codes, with specific support for OpenMP code. Current features include static and dynamic call graphs and control flow graphs, data dependence graphs to help the user find parallel loops and detect race conditions, and summaries of array accesses across procedure boundaries, to support the parallelization of loops with procedure calls and enable an understanding of procedure side effects within parallel loops. We decided to use the publicly available Open64 compiler infrastructure to create this tool, since Open64 is robust, has powerful analyses, supports the programming languages that are of interest to us and is open source. However, Open64 is undergoing further development by a number of different researchers and so this also constrains us to perform our work in such a way that we can easily adapt to newer versions of the compiler infrastructure. Dragon is a robust tool that can analyze large real-world applications and visualize large graphs, and is freely available (http://www.cs.uh.edu/~dragon). This paper presents the design and implementation of Dragon, and its organization is as follows. Section 2 describes the Open64 compiler [9] and its support for OpenMP. The following section introduces Dragon, its structure, functionalities and interfaces to Open64. We next show how Dragon supports OpenMP and gives dynamic feedback. Finally, we reach conclusions and discuss future work.
2 An Overview of the Open64 Compiler The Open64 compiler infrastructure was originally developed and put into the public domain by Silicon Graphics Inc.; it is currently maintained by Intel. Written in C++, Open64 accepts Fortran 77/90 and C/C++, as well as the shared memory API OpenMP. The system targets Intel’s IA-64 processors. Open64 is a well-written compiler performing state-of-the-art analyses that can be exploited by tools such as Dragon. They include interprocedural analysis, data flow analysis, data dependence analysis and array region analysis. A number of different research groups already base their compiler research on this open source system. 2.1 IR and Optimizations The intermediate representation (IR) for the Open64 compiler, called WHIRL, has five different levels, starting with Very High Level (VHL) WHIRL, and serves as the common interface among all the front end and back end components. Each optimization phase is designed to work at a specific level of WHIRL. Our tool gathers information primarily from the VHL and High Level (HL) WHIRL phases, which preserve high level control flow constructs, such as do and for loops, as well as OpenMP directives, which are preserved as compiler pragmas. HL WHIRL can be translated back to C and Fortran source code with only a minor loss of semantics, and OpenMP can be lowered to threaded code at this level.
Dragon: A Static and Dynamic Tool for OpenMP
FE(C/ C++/F7 7/F90)
IPL
IPA
LNO WOPT CG
55
AS
Fig. 1. The Open64 compiler modules
The Open64 compiler basically consists of five modules as shown in Fig. 1, with multiple front ends (FE) that parse C/C++/Fortran programs and translate them into VHL WHIRL. Additional “special-purpose” modules include automatic parallelization (APO). If interprocedural analysis is invoked, IPL (the local part of interprocedural analysis) first gathers data flow analysis and procedure summary information from each compilation unit, and the information is summarized and saved into files. Then, the main IPA module gathers all the IPL files to generate the call graph and performs interprocedural analysis and transformations based on the call graph. Array region analysis describes or summarizes the sub-region of array accesses and supports the transformations performed in latter modules. The back end module is invoked next; it consists of the Loop Nest Optimizer (LNO), the global optimizer (WOPT), and the Code Generator (CG). WOPT performs aggressive data flow analysis and optimizations based on the SSA form. LNO calculates a dependence graph for all array statements inside each loop of the program, and performs loop transformations. CG creates assembly codes, which are finally transformed to binaries by AS (assembler). Via a feedback strategy, it is possible to collect execution frequency counts for program constructs such as branches, loops and procedure calls. The inteprocedural analyzer uses the feedback to optimize the call graph by deciding when to clone procedures with constant parameters, and for inlining purposes. On the other hand, feedback is also used to optimize branch prediction. 2.2 OpenMP Support Open64 was delivered with support for OpenMP in the Fortran front end; in the meantime, the C front end has been enhanced and both are OpenMP 1.0 compliant. The Open64 C/C++ front ends are based on GNU technology and the OpenMP pragmas are parsed to GNU High Level IR; the Fortran directives are initially converted to Cray IR. After parsing, the directives and pragmas of both (Crayf90, GNU IR) are translated to VHL WHIRL. During this process the parallel directives are lowered to regions of code, which reflect the scope of the directives. The clauses of the directives are annotated within the regions for further processing. After that, the OpenMP constructs are lowered. Subroutines are spawned to implement the parallel regions, and the OpenMP directives are translated to runtime calls to a threading library implemented via Pthreads. For example, loops are broken down into chunks, based on the specified schedule, and assigned to individual threads and the respective runtime system calls are inserted. The compiler handles the memory for SHARED, PRIVATE, FIRSTPRIVATE and LASTPRIVATE variables but currently lacks support for THREADPRIVATE variables and directives such as COPYIN, that copies data from sequential regions to the parallel region.
56
O. Hernandez, C. Liao, and B. Chapman
3 The Dragon Tool In order to implement Dragon, we directly modified the Open64 compiler to extract the relevant information on an input program, and created a separate menu-driven graphical interface with which the user may retrieve a variety of facts about their code.
FE Control flow graph New component CFG_IPL
Program Analysis Database
Call graph
IPL
Feedback
IPA
Dependence graph & Array Regions
Dragon Tool Browser display
Screen WOPT /LNO /CG
print VCG
.vcg ..ps .bmp
Fig. 2. The Dragon Tool Architecture
Fig. 2 shows the architecture of Dragon. Dashed lines surround the Open64 modules: FE, IPL, IPA, WOPT, LNO and CG. We added an extra module, CFG_IPL, which can be turned on or off via compiler flags, to export control flow analysis results into a program database. Call graph, data dependence and array regions information are extracted from later compiler modules and exported also in this manner. After compilation, Dragon exploits the information that has been stored in the program database, determining the mapping of results to the corresponding source code and displaying it in graphical and text forms as required. If dynamic information is required, the program must also be executed prior to invoking Dragon. The program information provided by Dragon can be displayed on the screen or saved in printable formats (.vcg, .ps, or .bmp) using VCG. We describe the most important features in Dragon in more detail below. 3.1 The Static and Dynamic Call Graph A call graph represents the calling relationships in a program. It contains a node for each procedure in the program, and a directed edge linking a pair of nodes if and only if the procedure corresponding to the source node may invoke the sink node's procedure at run time. The graph can be cyclic where recursion is allowed. In Open64 and Dragon, the caller and callee may be in different languages. For instance, a Fortran program can call a function written in C, and vice versa. Our call graph display
Dragon: A Static and Dynamic Tool for OpenMP
57
distinguishes the root procedure (main program), leaf procedures (that do not call any other routines), dead procedures (that are never called), and procedures containing OpenMP code via node coloring. Clicking on a node or selecting a procedure from a text list of procedure names will cause the corresponding source code to be displayed in a separate browser window. Fig. 3 shows the call graph for an OpenMP version of NAS BT [7]. The static call graph can also display an estimated “procedure weight” for each node, defined as the weighted sum of statement count, basic block count and call count within a procedure. The weight, originally designed for use by the inlining heuristic, can give an insight to the programmer on the size of procedures. The number of nodes containing OpenMP constructs can also be displayed.
Fig. 3. The call graph of BT from NAS parallel benchmarks
Using the compiler feedback functionality, Dragon can also display the dynamic call graph, which shows the call chains and the number of times that functions or procedures were invoked at runtime. If the application was run several times, feedback from the different runs are collected and merged into the call graph, to show the frequencies with which procedures were invoked in different runs. This information, plus the cycle count for each procedure, can help to detect hot spots in the application, especially when there are procedures containing OpenMP. This may help the programmer decide when to inline a procedure, or where to focus manual optimization efforts on.
58
O. Hernandez, C. Liao, and B. Chapman
3.2 The Static and Dynamic Control Flow Graph The control flow graph represents the detailed structure of an individual subroutine or function. It is the basis for dataflow analysis and many standard optimizations. Open64 constructs the control flow graph from HL WHIRL, doing so multiple times in different phases of the compiler while translating a program. Dragon retrieves the control flow graph for each procedure in the IPL module. A node in the graph (Fig. 4) represents a basic block, a maximal segment of straight-line code with a single entry and single exit; each node is linked to all the possible successor nodes in the procedure by directed edges, so that each possible execution path within the procedure is represented. In the display, nodes are colored to distinguish the entry points and exit points of a procedure, OpenMP regions, branches and loops. In the same manner as with the call graph, a dynamic control flow graph can be displayed in Dragon using feedback information from previous runs. Users may locate the most frequently executed path in the control flow graph and the compiler may also use this information to enhance its optimizations such as branch prediction.
Fig. 4. The control flow graph of NAS BT benchmark
Fig 4. shows the control flow graph of a main function: user can easily understand the structure of the function and actual execution path in it, or click one node to examine related source locations. Control flow graph is more useful for users to understand the function structure and navigate among the source code when there are complex nested loop structures. We developed a flow graph class, edge class and basic block class and plugged them into the WOPT module in IPL to save information on the flow graph, basic
Dragon: A Static and Dynamic Tool for OpenMP
59
blocks, and flow graph edges of each procedure, along with the corresponding source code location (directory name, file name and line number), in Dragon’s database. Dragon requires an exact mapping between the control flow graph (which is generated directly from a level of IR) and the source code. For example, a contiguous sequence of statements in the source code display will be highlighted if a basic block node in the control flow graph is selected. However, this one-to-one correspondence with the source code is only retained in VHL WHIRL. Most compiler analyses, including the construction of the control flow graph, are performed at HL WHIRL or lower levels. Thus some constructs in the source code that are directly represented in VHL WHIRL have been translated to a lower level representation before the control flow information is derived. In particular, Fortran 90 has been translated to Fortran 77 as part of this lowering. Without additional work, loops would appear in the control flow graph in place of array statements, leading to a source code mapping problem. One possible solution to this is to record the translation and mapping from VHL to HL WHIRL, and recover the VHL WHIRL after the analyses are performed. This would generally help us deal with mappings between the source code and subsequent WHIRL and related analysis, but it requires non-trivial programming effort, since Open64 was not designed to keep such an exact mapping. A simpler strategy was to deal with the control flow graph mapping problem separately by adding code to construct the control flow graph before VHL WHIRL is lowered. Our current system includes the CFG_IPL module that does so by invoking the pre-optimizer and storing the results in the Dragon database. It does not affect Open64 analyses because the flow graph is rebuilt in other modules as required. Since the original Open64 code did not handle the features of VHL WHIRL that are lowered, our method required us to extend the existing flow graph construction code, primarily to deal with Fortran 90 features such as array statements, array sections, and the WHERE construct. There are a few limitations at present; for example, the SELECT-CASE structure is replaced by IF constructs. 3.3 The Data Dependence Graph Dragon can display data dependence information for each loop nest. This is essential in any effort to reorganize a sequential program containing loops or to obtain a parallel counterpart. However, even though this concept is well understood by many application developers, it is notoriously difficult to accurately detect the dependences that exist in even small regions of code. Dragon extracts this information from the Loop Nest Optimizer (LNO) in the Open64 compiler. The dependence graph (in Fig. 5) displays are based on results from the Omega test [22], the most accurate integer symbolic test available. Our tool can map the data dependences back to the source code, showing the array accesses involved. We also show the dependence distance and direction vector, to provide a better understanding of the opportunities for program modification. Programmers may use this information to locate possible race conditions within an OpenMP program. We have a detailed example in section 4 to show how the display of the dependence graph information can be used in parallelizing a sequential loop.
60
O. Hernandez, C. Liao, and B. Chapman
Fig. 5. The Data Dependence Graphs of NAS BT benchmark
3.4 Interprocedural Array Region Information Whenever the user needs to decide whether a loop containing procedure calls can be parallelized, interprocedural analysis is needed to determine whether the procedure’s side effects permit this. Dragon can display interprocedural analysis results (see Fig. 6.) in the form of procedure side effects on global variables, arrays and actual parameters.
Fig. 6. Interprocedural Array Region Analysis for NAS BT benchmark
Dragon: A Static and Dynamic Tool for OpenMP
61
The display shows which variables are accessed within a procedure. In the case of arrays, it shows the regions of the arrays accessed, and if they were used or modified. This requires that the effect of multiple references to an array must be summarized, which can lead to inaccuracies. The array regions are summarized in triplet notation forms and can be mapped back to the source code.
4 Case Study: Creating an OpenMP Program with Dragon Dragon facilitates the task of writing OpenMP programs by helping a programmer determine which functions or loop nests should be parallelized and whether there are any data dependences that may prevent this. Although the user must decide whether it is possible to overcome such dependences, information on the variable accesses involved exposes their details. A basic strategy for parallelizing a code using Dragon is the following: Step 1 : Modify the Makefile of the application to use the Open64 compiler with interprocedural analysis, loop nest optimization and profiling options selected as well as the “-dragon” flag. Step 2 : Compile the application and run it. Step 1 and 2 may be repeated several times, since some options for different analysis information cannot be used at the same time. Step 3: Invoke Dragon and use its menus to locate important functions or procedures in the code, inspect the call graph to find the highest level opportunity for exploiting parallelism, and use the control flow graph to find loop nests; data dependence information can then be requested for loops to determine whether they can be parallelized. This will give details of any variable accesses that may prevent parallelization, so that opportunities for manual modification may be explored. The NAS Parallel Benchmarks (NPB) is a suite of eight codes designed to test the performance of parallel supercomputers. We use the serial version of BT (Block tridiagonal solver) from NPB 3.1 [21] to illustrate these steps. In the first step, we modify BT’s make.def file to use the Open64 Fortran compiler with the following options: F77= openf90 FFLAGS = -fb_create myfeedback -fb_type=1 -fb_phase=0 FLINKFLAGS = -fb_opt myfeedback -dragon -ipa -O2 After compilation and execution of BT with test data, a new set of options are used in both FFLAGS and FLINKFLAGS to recompile the application and dump the feedback information: -fb_opt myfeedback -dragon -ipa -O2. As a result, the call graph, control flow graph and feedback information will be stored in Dragon’s database for subsequent querying. It is also easy to obtain data dependence information. We use another compiler option “–dragon –O3” in the Makefile to recompile the application, ensuring that the loop nest optimization procedures in the compiler are carried out and the data dependence information generated there is exported to the dragon database.
62
O. Hernandez, C. Liao, and B. Chapman
CPU Cycles(in Million) for Top 10 Functions in BT
bi nv rh s
ad d
lh si ni t
bi nv cr m hs at m ul _s ub x_ so lv e z_ so lv e y_ s co ol ve m pu te _r m hs at ve c_ su b
35000 30000 25000 20000 15000 10000 5000 0
Fig. 7. CPU cycles for top 10 functions in BT
Fig. 8. The control flow graph of x_solve
Figure 7 shows part of the results of profiling all functions from BT. As we can see, function binvcrhs consumes most of the execution cycles but it does not contain loops. Therefore, it cannot be parallelized at this level. In general, superior performance is obtained with large parallel regions, so we need to use the call graph to deter-
Dragon: A Static and Dynamic Tool for OpenMP
63
mine where this procedure is invoked. It tells us that the callers of binvcrhs are x_solve, y_solve and z_solve where binvcrhs is invoked within several loops. Matmul_sub is the second most time-consuming routine and it contains a loop. However, this procedure is a leaf in the call graph. It is also called by x_solve, y_solve and z_solve. These are the 3rd, 4th and 5th most time-consuming functions and their flow graphs (e.g. Figure 8) show good opportunities for parallelization. Moreover, the caller of x_solve, y_solve and z_solve contains no loop structures; therefore nothing can be done for it. The situation is obvious now: we focus our attention on parallelization of the three subroutines: x_solve, y_solve and z_solve.
Fig. 9. Data dependence information for x_solve
64
O. Hernandez, C. Liao, and B. Chapman
The code of x_solve contains many complex nested loops and is thus difficult to read. The control flow graph can be very helpful in this case. From the graph (Fig 8), the programmer can easily see that x_solve (y_solve and z_solve are the same) has two branches with a triply-nested loop, while the innermost loop contains three statement blocks plus three more loops, the last of which is also triply-nested. Despite the complexity of the control flow structure, the data dependences in these loops are quite simple. Fig 9 shows the dependence graph and its corresponding source position. Associating the source position with the control flow graph in Fig 8, we can find the true dependence is actually located in the last triply-nested loop and this does not prevent us from inserting OpenMP directives to parallelize the outermost loop. We obtain the following for the outermost loop: !$omp parallel do default(shared) shared(isize) !$omp& private(i,j,k,m,n) do k = 1, grid_points(3)-2 do j = 1, grid_points(2)-2 do i = 0, isize …..
The same procedure can be repeated for the remaining functions to quickly get an initial OpenMP version of the original serial code. Although Dragon considerably speeds up this process, the potential for further improvements can be seen. For example, a user might desire to obtain profiling information at the loop level also (even though this may distort procedure-level information if gathered at the same time). In addition, the data dependence analysis is limited to individual loop nests. It may sometimes be necessary to analyze the data dependences between two consecutive loops; programmers can use this information to insert a NOWAIT clause between them. Another problem is that the burden of finding out which variables should be shared and which are private remains with the user. Dragon is currently being interfaced with the automatic parallelizer of Open64, which may provide insight into this.
5 Related Work There are a number of other tools that analyze source code, or provide support for parallelization. They include CAPO [6] for automatic OpenMP generation which supports Fortran 77 and some F90 extensions, Intel Thread Checker [17] which focuses on OpenMP semantic checking, GPE [16] which uses a static hierarchical task graph to help users create OpenMP. Foresys [4] is an early product that supports the maintenance of Fortran code and helps with parallelization. Profile tools such as TAU [18], Expert [19], and Guide [20] focus on performance issues, and do not perform program analysis. None of these program analysis tools integrate both dynamic and static analysis information for OpenMP or provide similar features for multilanguage codes. Our tool also differs from SUIF Explorer [23] because there is no OpenMP support from SUIF explorer.
Dragon: A Static and Dynamic Tool for OpenMP
65
6 Conclusions and Future Work Dragon is an interactive tool that provides detailed information about a C/Fortran77/Fortran90 program that may contain OpenMP/MPI constructs. It takes advantage of Open64’s analysis capabilities. Much of the information displayed in this graphical tool is general-purpose and could be employed in many situations, from analyzing legacy sequential code to helping users reconstruct parallel code. Dragon has been tested on many large applications including the POP code [11], ASCI Sweep3d and UMT98 [1]. GenIDLEST [14] and the NAS OpenMP parallel benchmarks [7] have also been analyzed successfully by the Dragon tool. Both POP beta 2.0 and GenIDLEST are large, real-world applications containing MPI and OpenMP. We are currently enhancing Dragon and Open64 to provide dynamic (runtime) array region information, in order to better understand the actual array access patterns for different OpenMP threads. This functionality may help privatize data in an OpenMP code, or restructure code to achieve SPMD style OpenMP. Each time an array region is accessed dynamically, we will record the information necessary to instantiate that region (which may have been described in symbolic terms), and determine which thread has accessed it. We will combine this information with the dynamic call graph and control flow graph of the application to provide different views of array region summaries at the procedure, basic block, statement and thread level. One of our research goals is to provide further support for developing SPMD style OpenMP, which promises to provide good performance on shared and distributed shared memory systems. To do this we need to develop additional functionality, including the ability to analyze and present information on explicitly parallel OpenMP programs, such as the parallel data flow graph and parallel task graph.
References 1. The ASCI Blue Benchmark Codes, http://www.llnl.gov/asci_benchmarks/ 2. B. M. Chapman, T. H. Weng, O. Hernandez, Z. Liu, L. Huang, Y. Wen, L. Adhianto. “Cougar: Interactive Tool for Cluster Computing,” Proceedings of the 6th World MultiConference on Systemics, Cybernetics and Informatics, Orlando, Florida. July 14-18, 2002. 3. Dragon tool, http://www.cs.uh.edu/~dragon 4. Foresys, http://www.simulog.fr/is/2fore1.htm 5. O. Hernandez, “Dragon Analysis Tool”, Master Thesis. Department of Computer Science, University of Houston. December 2002. 6. C. S. Ierotheou, S. P. Johnson, M. Cross, and P. Legget, “Computer Aided Parallelisation Tools (CAPTools) – Conceptual Overview and Performance on the Parallelisation of Structured Mesh Codes,” Parallel Computing, 22 (1996) 163-195. 7. H. Jin, M. Frumkin, J. Yan, “The OpenMP Implementation of NAS Parallel Benchmarks and its Performance,” NASA Technical Report, NAS-99-011, 1999. 8. Z. Liu, B. Chapman, T.-H. Weng, O. Hernandez, “Improving the Performance of OpenMP by Array Privatization,” WOMPAT’2002, Workshop on OpenMP Applications and Tools. The University of Alaska Fairbanks. Fairbanks, Alaska. August 5-7, 2002.
66 9. 10. 11. 12. 13.
14.
15.
16. 17. 18. 19. 20. 21. 22.
23.
O. Hernandez, C. Liao, and B. Chapman The Open64 compiler, http://open64.sourceforge.net/ The OpenMP Application Program Interface, http://www.openmp.org Parallel Ocean Program(POP), http://www.acl.lanl.gov/climate/models/pop/ Source Navigator, http://sourcenav.sourceforge.net/ M. Satoh, Y. Aoki, K. Wada, T. Iitsuka, and S. Kikuchi, “Interprocedural Parallelizing Compiler WPP and Analysis Information Visualization too Aivi,” Second European Workshop on OpenMP ( EWOMP 2000 ), 2000 D. K. Tafti. “GenIDLEST - A Scalable Parallel Computational Tool for Simulating Complex Turbulent Flows,” Proceedings of the ASME Fluids Engineering Division, FED 256, ASME-IMECE, Nov. 2001, New York. T.-H. Weng, B. M. Chapman. “Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution,” Proceedings of the 7th workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS-7), IEEE, Ft. Lauderdale, April 2002. M. Giordano and M. Mango Furnari. GPE: an OpenMP Program Development Environment. 2004 To appear. Intel Thread Checker, http://www.intel.com/software/products/threading/tcwn/ Tuning and Analysis Utilities (TAU), http://www.cs.uoregon.edu/research/paracomp/tau Felix Wolf, Bernd Mohr: Automatic Performance Analysis of Hybrid MPI/OpenMP Applications. Euro PDP 2003: 13-22 Guide Reference Manual, http://www.intel.com The NAS Parallel Benchmarks, http://www.nas.nasa.gov/Software/NPB/ William Pugh, “The Omega test: a fast and practical integer programming algorithm for dependence analysis, ” Proceedings of the 1991 ACM/IEEE conference on Supercomputing, p.4-13, November 18-22, 1991, Albuquerque, New Mexico, United States S.-W Liao, A. Diwan, R. P. Bosch, Jr. and A. Ghuloum, M. S. Lam, "SUIF Explorer: An Interactive and Interprocedural Parallelizer," Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'99), May, 1999.
The ParaWise Expert Assistant - Widening Accessibility to Efficient and Scalable Tool Generated OpenMP Code Stephen Johnson1 , Emyr Evans1 , Haoqiang Jin2 , and Constantinos Ierotheou1 1
Parallel Processing Research Group, University of Greenwich, Old Royal Naval College, Park Row, Greenwich, London SE10 9LS, UK {s.johnson, e.w.evans, c.ierotheou}@gre.ac.uk 2 NAS Division, NASA Ames Research Center, Moffet Field, CA94035-1000, USA
[email protected] Abstract. Despite the apparent simplicity of the OpenMP directive shared memory programming model and the sophisticated dependence analysis and code generation capabilities of the ParaWise/CAPO tools, experience shows that a level of expertise is required to produce efficient parallel code. In a real world application the investigation of a single loop in a generated parallel code can soon become an in-depth inspection of numerous dependencies in many routines. The additional understanding of dependencies is also needed to effectively interpret the information provided and supply the required feedback. The ParaWise Expert Assistant has been developed to automate this investigation and present questions to the user about, and in the context of, their application code. In this paper, we demonstrate that knowledge of dependence information and OpenMP are no longer essential to produce efficient parallel code with the Expert Assistant. It is hoped that this will enable a far wider audience to use the tools and subsequently, exploit the benefits of large parallel systems.
1
Introduction
The OpenMP standard [1] was devised to simplify the exploitation of parallel hardware, making it accessible to a far wider audience than most alternative techniques. Standardization allowed a single application code to be portable to a wide range of parallel machines and compilers. The shared memory model followed by OpenMP allowed simpler implementation than the explicit message passing model that has proved successful for distributed memory systems. Despite this, creating efficient, scalable OpenMP code still requires a significant degree of expertise as effective determination of, in particular, parallelism and variable privatization is essential. The overheads inherent in an OpenMP parallel region limit performance when inner loops in a nest operate in parallel, requiring B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 67–82, 2005. c Springer-Verlag Berlin Heidelberg 2005
68
S. Johnson et al.
outer loop parallelization in many cases to facilitate efficiency. This often adds the complexity of investigating interprocedural variable references to the process as loops containing call sites need to operate in parallel. Other key factors to efficient parallelization involve reduction of OpenMP overheads by having parallel regions containing many loops and avoiding unnecessary synchronization. The necessary expertise and the volume of information that needs to be considered have limited the use and success of OpenMP parallelization. To ease these problems, automatic parallelizing compilers from vendors and research groups [2, 3, 18], and interactive parallelization tools [13, 17, 19] have been developed to detect parallelism based on some form of dependence analysis of the application. Obviously, the ideal scenario for a user who wishes to parallelize their code is the automatic parallelizing compiler, however, in practice, the resultant performance is typically limited, particularly in terms of scalability. There are a number of reasons for this, with one fundamental problem being related to unknown input information such as the value ranges of variables that are read into the application code. This type of information can often be critical in accurate dependence determination and so the compiler is forced to conservatively assume a dependence exists, potentially preventing parallelism detection or valid privatization. Additionally, accurate analysis can be restricted by the time constraints a compiler must meet for commercial acceptability where indepth analysis may be sacrificed to allow rapid operation. The alternative approach is to use interactive parallelization tools to allow user involvement in the parallelization process whilst still exploiting the power of dependence analysis. The vital role of the user leads to the need for an accurate dependence analysis to only involve the user when necessary. A detailed interprocedural, value based dependence analysis can be used to produce a more accurate dependence graph, at the cost of a greater analysis time, but the consequent reduction in user effort is often deemed to warrant this expensive analysis. The Parawise/CAPO parallelization tools [4, 5, 6, 13] have been developed based on such a philosophy. Unfortunately, as the user is involved, some degree of expertise is again required to interpret information and provide the necessary feedback. This expertise may well include the understanding of the OpenMP parallelization strategy, including variable privatization, and also some level of understanding of dependence information. As a result, the tools may not provide increased accessibility to the production of efficient and scalable OpenMP code. To overcome the limitations described above in the use of parallelization tools, we are developing an environment that will meet the needs of a wide range of potential users. For users with expertise in OpenMP parallelization, the clear presentation of complex interprocedural information and the automation of the vast majority of the parallelization process are provided to greatly improve their productivity, allowing them to focus on fine tuning application codes for high parallel efficiency. For users with some experience with OpenMP, the environment is intended to relieve them of almost all parallelization decisions and avoid the trial-and-error approach to OpenMP parallelization often taken (e.g. where an assumption is made and tested, as a thorough interprocedural investigation of
The ParaWise Expert Assistant
69
all the required issues proved overwhelming). Most importantly, the environment also aims to cater for users with no experience, and often no interest, in OpenMP parallelization. Such users typically have expertise in other areas of science and only require the computation power of parallel processing without much interest in how this is achieved. To make the environment accessible to these potential users, all aspects of OpenMP parallelization and dependence analysis must be kept away from the user, concentrating instead on information related to the application code being parallelized. If the necessary information required to enable efficient, scalable parallelization can be obtained from a user by only asking questions about their application code, then the number of users of tools, and therefore OpenMP, could dramatically increase. In this paper we focus on the development of an Expert Assistant, a key component of the environment, that attempts to close the gap between what the tool needs to know to improve the parallelization and what the user knows about the application code.
2
ParaWise/CAPO OpenMP Code Generation
ParaWise is a parallelization tool from PSP Inc [4, 6, 7, 12]. that performs an indepth interprocedural, value based dependence analysis of Fortran applications. It has been developed over many years and has numerous techniques to improve the quality of its analysis [4]. The CAPO module [13, 14] uses dependence information provided by ParaWise to generate effective OpenMP parallel code. The CAPO module includes many algorithms to enhance the quality of the generated code including automatic merging of parallel regions in an interprocedural framework, routine duplication to allow differing levels of parallelism for different call sites, automatic insertion of NOWAIT directives to reduce synchronizations as well as parallelism detection and variable privatization determination. An example of the power of the CAPO OpenMP code generation is given by an example taken from an application code shown in figure 1. In the original serial code (figure 1(i)) there are two calls to routine r2r, one inside the j loop at S1, which is parallel, and the other outside that loop at S2. Inside routine r2r is another loop (k) that is also parallel. The ParaWise dependence analysis and CAPO algorithms automatically identify that these loops are parallel and determine that for the call site of r2r inside the j loop parallelism in that loop should be exploited, whilst for the call site at S2 parallelism in the contained k loop should be exploited (in this case, not considering multiple levels of parallelism in the j and k loops [14]). To allow this, routine r2r is automatically cloned (routine cap r2r) and is referenced at S1 inside the j loop (figure 1(ii)). The parallel region for the parallel k loop in the original routine r2r is then migrated to surround the call site at S2 and immediately follows the parallel region for the j loop. The parallel regions are then merged to form a single parallel region, reducing the region start and stop overheads incurred, and additional algorithms determine that there is no inter-thread interaction between the two loops so thread synchronization at the end of the first loop is not required and a NOWAIT can be generated. In this case, the interprocedural operation of
S. Johnson et al.
S1 S2
do j=2,jmax call r2r(j,kmax) enddo call r2r(1,kmax)
!$OMP PARALLEL DEFAULT(SHARED),PRIVATE(J) !$OMP DO do j=2,jmax call cap_r2r(j,kmax) enddo !$OMP ENDDO NOWAIT call r2r(1,kmax) !$OMP END PARALLEL
400 350 Tim e (secs)
70
300 250 200 150 100
subroutine r2r(j,kmax) do k=1,kmax . . . enddo
subroutine r2r(j,kmax) !$OMP DO do k=1,kmax . . . enddo !$OMP ENDDO
50 0 1
2
4 8 Processors
16
32
16
32
(iii)
(i)
(ii)
Tim e per iter (secs)
70
subroutine cap_r2r(j,kmax) do k=1,kmax . . . enddo
60 50 40 30 20 10 0 1
2
4
8
Processors
(iv)
Fig. 1. Example of code generation techniques used in CAPO OpenMP code generation. Timings on Origin 2000 for (iii) RJET code from Ohio Aerospace Institute (iv) OVERFLOW from NASA Ames Research Center (using log scale for time)
the dependence analysis and code generation are essential for efficient parallel execution as hundreds of lines of code in the sub-call graph called within the k loop are executed in parallel. Together, these tools can generate efficient, scalable OpenMP code and have been used successfully on many real world application codes [15, 16]. Figure 1(iii) shows timings for the RJET code from the Ohio Aerospace Institute that studies vortex dynamics and breakdown in turbulent jets, and also for the OVERFLOW code from the NASA Ames Research Center (figure 1(iv)), used for aircraft simulations. Both codes show significant scalability up to and beyond 32 processors. Both ParaWise and the CAPO module have been designed with user interaction as a vital component, providing browsers to allow the user to inspect the parallelization and to add information about the application code. For OpenMP parallelizations, the Directive Browser shows loops that will operate in parallel and, more importantly, those loops that will execute in serial including an explanation of the inhibitors to parallelization. This can involve variables that are assigned in an iteration of the loop and subsequently used in a later iteration and variables that are assigned or used in several iterations where, commonly, the assigned values appear to be used after the loop has completed. From this, the relevant dependencies can be investigated in another browser and, if possible, information added about variables in the application code that prove the dependencies do not exist (or alternatively the user can explicitly delete dependencies). In every real world application code that we have investigated some amount of user interaction is essential in the production of efficient and scalable parallel code. These tools can enable expert users of ParaWise/CAPO to produce efficient OpenMP code, however, an understanding of the application code, dependence analysis and OpenMP parallelization are all needed to make the most effective use of the tools. To reduce the effort and skill required by an
The ParaWise Expert Assistant
71
expert, and allow non-experts to exploit the power of automatic parallelization techniques, the ParaWise Expert Assistant has been developed to identify the information that is required to enable effective parallelization and ask the user questions in the context of their application code.
3
The ParaWise OpenMP Expert Assistant
Ideally, parallelization experts who specialize in producing OpenMP parallel codes and code authors (or whoever wishes to parallelize an application code) should be able to use the ParaWise/CAPO tool. Our experience of parallelizing application codes clearly indicates that the information required to improve the efficiency of a parallel version will be known to the code author. The information required by the tool is the same information that would be needed by the parallelization experts undertaking a manual parallelization who currently need to ask the code author about certain variables in a code, although the techniques used to determine what information is essential are different. The dependence information provides a rigid framework in which parallelism and privatization can be determined. Parallelism is determined by checking for loop carried dependencies that enforce an execution order where one iteration of a loop must precede a later iteration of that loop. If the loop carried dependence is a true (dataflow) dependence (i.e. an assignment followed by a later usage) then investigation of this dependence is required. If any loop carried true dependence cannot be proven non- existent (or proved to be part of a reduction operation) then the loop must remain serial. If the loop carried dependence is a pseudo dependence (memory location re-use e.g. a usage followed by a later assignment, known as an anti dependence, or an assignment followed by a later re-assignment, known as an output dependence) then, to allow parallel execution either this dependence must be proven non-existent or all true dependencies of this variable into or out of the loop (e.g. an assignment in the loop to a usage following the loop) must be proven non-existent to enable the required privatization of this variable (except where FIRSTPRIVATE or LASTPRIVATE can be used). For a pseudo dependence, the investigation only focuses on the memory locations of the references involved. For a true dependence, both the memory locations and any assignments inbetween the assignment and usage of that dependence are investigated. The memory location test is often interprocedural as the assignment and usage can both be deep inside a call tree from call sites inside the loop. For the intermediate assignment test, we need to determine if all values passed from assignment to usage of the original dependence are overwritten (i.e. no value is passed). Again, this often involves interprocedural operation where the intermediate assignments can be in any number of different routines in the call tree. Additionally, for some loop-out dependencies, an interprocedural investigation in all possible combinations of caller sites may need to be made where other call sites that lead to usages of the subject variable must be
72
S. Johnson et al.
considered. This leads to several types of question being asked by the Expert Assistant. Firstly, conditions are presented to the user for inspection where the user can select and confirm any conditions as always TRUE. These conditions are extracted from the indices of arrays in the assignment and usage of the dependence under investigation, the control of these statements throughout the associated call paths and, in the case of true dependencies, indices and control for any intermediate assignments [4]. In most cases, this leads to fairly complex conditions for true dependencies which take the form: Ar ∩ Ac ∩ U r ∩ U c ∩ (¬((Ir1 ∩ Ic1 ) ∪ (Ir2 ∩ Ic2 ) ∪ ... ∪ (Irn ∩ Icn ) )) where Ar is the range of the array that is assigned in the source of the dependence under investigation, Ac represents the conditions under which that assignment occurs, U r and U c are the index range and control for the usage at the sink of the dependence, and Irj and Icj are the index range and control for each intermediate assignment identified. These conditions are extracted from the heart of the ParaWise dependence analysis to encourage substitution of variables to those read into the application code (as these are often more familiar to a user than variables calculated as part of an algorithm in the application) and also to automatically prove many cases so that the user only sees cases that need their attention. The algorithms in the dependence analysis substitute sets of constraints through conditional assignments and multiple call sites to produce a set of fully substituted constraints. If a set is presented to the user and they can confirm some constraint then that set is proven impossible and the next unresolved set is processed. If the user cannot provide information for a set then that set remains unresolved and the overall test terminates, indicating that the dependence concerned cannot be proven non-existent. Secondly, true dependencies where intermediate assignments exist are presented to the user, both graphically and in a list form, showing all intermediate assignments in the context of the application code. Such information can enable the user to understand what is being asked and, with their understanding of the algorithm implemented in the code section concerned, determine if the dependence in question carries any values from its source to its sink. The Expert Assistant also identifies other situations, such as when an I/O statement is inside a loop (searching in call sites within the loop), and asks whether it is essential to the operation of the code or just for debugging purposes. Questions are asked either for a selected individual loop or for all loops listed in the Directive Browser. Once a question and answer session is complete, the added information and deletions of dependencies or I/O statements etc. are applied and the parallelization process is automatically replayed. In some cases, the dependence analysis will need to be repeated where a replay file is created containing all actions in the previous parallel process with the new actions also added. In other cases, the previously constructed dependence graph can be altered to include the effect of the new information.
The ParaWise Expert Assistant
4 4.1
73
Examples of the Power of the Expert Assistant Simple Constraints Needed to Allow Privatization
A common cause of serialization is when a workspace variable that needs to be privatized to allow parallelization (due to a loop carried pseudo dependence) cannot be privatized due to usages after the loop has completed. An example of this is shown in figure 2(i) where the loop i1 writes to array b (amongst many others) in several iterations causing an output dependence for instances of statement S1 between iterations of loop i1. An investigation of this dependence using the Expert Assistant reveals that no possible solutions exist that would allow its deletion so the process then determines if privatization is possible. Here, it is not possible as usages of the variable are made after the loop has completed, including the usage at S3 shown in figure 2(i) (assuming that other code not shown prevents a LASTPRIVATE clause being used for b). The investigation therefore focuses on the causes of these post loop usages of values of array b. This can involve interprocedurally tracing many chains of dependencies relating to routine starts, routine stops, call sites and also the actual statements containing usages (i.e. in figure 2(i), loop i2 will often reside in another routine). The investigation of the dependence from S1 to S3 uses the ParaWise dependence analysis engine test for value based dependencies, identifying intermediate assignments of the values between S1 and S3. In this case, statement S2 is included and the constraints that result lead to the question is nj always ≥ 2 ? The variable nj is read at runtime and should be familiar to a user who is knowledgeable about the application code. Typically, the operation of the code makes no sense if such constraints are not met (i.e. the code was written making the implicit assumption that nj will be ≥ 2) allowing the user to answer with confidence. Additionally, since parallel processing is only needed for computationally expensive executions of the application code, the user may be able to indicate that the value of nj will always be at least 2 for a use of the parallel version. In figure 2(i), the assignment to b at S2 only occurs if nj ≥ 2 (otherwise the surrounding j2 loop is not entered). If nj were less than 2 then the values used at S3 would come from other assignments, potentially taking values from throughout the application code. In this case, as in all the other cases highlighted in this section, it is not a deficiency in the analysis that has caused this dependence, the conservative nature of dependence analysis and lack of information about variable nj forces it to be set. Also, the reason that array b in loop i1 could not be privatized was not related to loop i1. The automatic investigation of the Expert Assistant traced the problem to loop i2 elsewhere in the application code. By asking a question about the constraint on a variable, the user never needs to know where the problem was traced to. This avoids one of the major tasks in the use of previous versions of CAPO. The addition of knowledge in this way means that the information could prove crucial in proving that other dependencies can be removed.
74
S1
S2 S3
S. Johnson et al. do i1=1,ni do j1=1,nj b(j1) = . . . enddo . . . enddo . . .
(i)
S1
S2
S1
nnodes
. . . do i2=1,ni do j2=2,nj b(j2) =. . enddo b . . = b(nj) enddo
subroutine eleinfo(type,nnodes) if (type.eq.TRIANG) then nnodes=3 endif if (type.eq.QUAD) then nnodes=4 endif if (type.eq.BRICK) then nnodes=8 endif end
b outp
do elem=1,num_elements call eleinfo(eletyp(elem),nnodes) do i=1,nnodes . . . enddo . . . enddo (ii)
read*,il,ih do k=1,nk do i=il,ih,2 b(i) = . . . enddo . . . do i=ih,il,-2 . .=b(i) enddo . . . enddo
(iii)
Fig. 2. Simple case of (i) a privatization preventing post-loop dependence (ii) code fragment where element type is related to a number of nodes for each element (iii) potentially covering assignment for the usage under certain conditions
4.2
More Complex Cases for the Finite Element Code Example
A finite element application code provides further examples of the power of the Expert Assistant. This code follows the fairly standard finite element algorithm where a number of different element types are used to form the mesh. Stress and strain components are calculated for the nodes of each element and are based on shape functions and boundary conditions. The code provides three types of finite elements: 2D triangular elements (3 node), 2D quadrilateral elements (4 node) and 3D brick elements (8 node). It has four modes of operation: plain stress, plain strain, axis symmetric and full three-dimensional analysis. As is fairly typical with such codes, the mesh is read in from a file that has been produced by a mesh generation package, therefore it will conform to certain constraints. In this case, these include that the elements are one of the three types above, that the analysis mode is one of four above and that brick elements are only used with the three-dimensional analysis mode. As these constraints are always met by meshes read from file, they were not checked in the application code, and therefore this information can only be obtained from the user. A simple case that was caused by the implicit constraint on finite element types is shown in figure 2(ii). The variable nnodes is assigned for all valid element types of the code, however, as the array eletyp is read from a file and no explicit constraints are in the code, the analysis must assume that eletyp can take any value, and not necessarily one of the three required. As a result of the uncertainty of the assignment of variable nnodes, dependencies for the usage of nnodes at S1 are set to assignments of nnodes elsewhere in the application code, including dependencies to assignments in earlier iterations of the elem loop over finite elements, serializing that loop. The Expert Assistant automatically investigates the serializing dependence shown in figure 2(ii) where the assignments to nnodes in the same iteration of the elem loop as the usage are intermediate assignments, so their control is collected for the questions to be asked of the
The ParaWise Expert Assistant
75
user. The Expert Assistant interface lists all three conditions, along with a number of other conditions, where if any single condition is known to be TRUE the dependence does not exist. In this case, no single condition is TRUE, however, the interface allows for the combination of constraints. To provide the necessary information, the user selects the constraints and combines them to produce: (eletyp(elem).eq.TRIANG.or.eletyp(elem).eq.QUAD.or.eletyp(elem).eq.BRICK)
where the user knows that this is always true since eletyp is read from the input file. This information allows the replayed dependence analysis to determine that such dependencies do not exist as the value of nnodes used at S1 in figure 2(ii) is definitely from one of the three assignments shown earlier in the same iteration of loop elem. This may then allow parallelism in loop elem to be detected. The form of presentation of the required constraint, and the variables involved, can assist user recognition and enable a positive response. An example of this is when two constraints are identified for a loop serializing dependence where an if-then presentation is more appropriate to the users understanding of the application code i.e.the following facts are identical: "(eletyp(elem).ne.TRIANG.or.mode.ne.threed)" "if (eletyp(elem).eq.TRIANG) then (mode.ne.threed)"
where the user can far more easily understand the constraint when it states that if a triangular element is used then the analysis mode cannot be threedimensional. An example of the Expert Assistant window for this scenario is shown in figure 3(i). For the user to interpret such questions requires the variables involved to be familiar to the user. Additionally, for the information to be applicable to many situations throughout the application code, the information should, as far as is possible, not relate to locally computed variables. In figure 2(ii), for example, it may be difficult for a user to answer questions involving variable nnodes as it is computed locally. If the substitutions of the possible values of nnodes are exploited through conditional substitution then nnodes is eliminated from any questions and variable eletyp is introduced. As eletyp is read into the application code, its values are fundamental to the finite element method used and should be understandable to the user, and since this variable is used throughout the application code, any added information can potentially be exploited to resolve many other situations. There are also cases in the Finite Element code where dependencies are inhibiting parallelism but no answerable question is presented to the user. The main cause of this is error checking, particularly in the finite element stiffness matrix calculation phase of the code. As part of this process, a small matrix is inverted so a check is used to avoid divisions by zero when the matrix is singular. This check can have a significant effect on the parallelization. Firstly, it may contain I/O and execution termination if a singular matrix is encountered. Secondly, if execution is not stopped then the possibility of data being used from the previously processed finite element (as it is not assigned for the element with the singular matrix) forces the serialization of the finite element loop. The Expert Assistant deals with such cases by detecting the I/O and program termination statements when they are inside a potentially parallel loop (in this case
76
S. Johnson et al.
(i)
(ii)
Fig. 3. Examples of the Expert Assistant window; (i) asking for confirmation of constraints (ii)asking if the set of intermediate assignments provides all values to a usage
in a routine called from inside such a loop) and presenting the statement in the context of the application code, asking the user if it is essential to the operation of the application code, or was just for debugging purposes during development. In this case, the generated finite element mesh read into the application code guarantees that no finite element will have a singular matrix so the user can recognize that these statements are just for debugging and can be removed for a production version of the parallel code. The lack of an assignment of the inverted matrix still leads to serializing dependencies as the analysis does not know that no singular matrices will be encountered. As the detection of a singular matrix is based on a non-zero value of the matrix determinant, the questions posed by the Expert Assistant involving constraints cannot be answered by the user. Instead, the associated dependence is presented to the user in both graphical and textual form as shown in figure 3(ii). The dependence that is causing the problem is highlighted in bold and the intermediate assignments that could overwrite all values carried by that dependence are also shown. Several other browsers are provided to offer more information about what is being asked (and all other ParaWise browsers are also available) where all these browsers show the statements involved in the context of the application source code. With their knowledge of the algorithms implemented in the application code, the user should be able to know that all values of the inverted matrix used for a finite element are calculated earlier in the same iteration of the finite element loop. This should allow them to answer with confidence that the dependence in question does not carry any values and can therefore be ignored in parallelism determination.
The ParaWise Expert Assistant
4.3
77
Other Essential User Interaction
Implicit constraints and error checking are often the cause of serialization in application codes, however, other scenarios also exist that can only be addressed by the user. Consider the example shown in figure 2(iii) where a usage at S2 of values assigned before the code fragment shown could prevent privatization and therefore parallelism in other loops in the code. The values used at S2 could also have been provided by the assignment at S1, however, the loop step indicates that the values of the loop bounds il and ih as read into the application code are vital in determining which dependence exists. If the user knows that il and ih are both even, then the even indices of array b assigned at S1 will be used at S2 in the every iteration of loop k. Similarly, if both il and ih are odd then the odd indices of array b will be assigned and used in each k iteration. If however, il is even and ih is odd (or vice-versa) then no value used at S2 is assigned at S1 so the dependence from before this code fragment does exists (and the dependence from S1 to S2 does not exist). If no information is known by the user about il and ih, the worst case scenario must be taken to ensure conservative analysis, with both dependencies being set. In this case, the graphical and textual displays of the application code shown in figure 3(ii) should enable the user to interpret and, if possible, resolve such situations. 4.4
Experience of Using the Expert Assistant
We have used the Expert Assistant on a large number of application codes and found that it does address the issues a parallelization expert would need to investigate to improve the parallelization quality, but does so in terms that should be understandable to a code author. Results for parallelizations using the Expert Assistant are shown in figure 4 for (i) the NAS benchmark BT and (ii) the Finite Element code discussed in section 4.2. For BT, the initial performance, when no user information was added, was very poor with inner loops operating in parallel, so an investigation using the Expert Assistant was required. The questions asked by the Expert Assistant focused on constraints for the mesh size variables NX, NY and NZ where lower bounds to their values were of interest. Confirming that such constraints were always true for an execution of BT and performing a replay produced a parallel version with far superior performance, exhibiting high speedup and scalability to reasonably large numbers of processors as shown in figure 4(i). In this version the outermost parallel loops were exploited in most cases. For the Finite Element code, the initial parallelization with no user information was forced to focus mainly on inner loops due to dependencies between iterations of outer loops. After a session with the Expert Assistant providing the sort of information discussed in section 4.2, parallelism at outer loops was detected enabling some speedup and scalability to be achieved as shown in figure 4(ii). A significant number of loop carried dependencies and dependencies into and out of loops were proven non- existent in the replay allowing effective parallelism and privatization directives to be inserted. This version was able to exploit parallelism between iterations of the loop over finite elements, attaining the most effective level of
78
S. Johnson et al. 25 Tim e (secs)
Tim e (secs)
1000
100
10
20 15 10 5
1
0 1
8
16
32
Processors
(i)
64
1
8
16
32
64
Processors
(ii)
Fig. 4. Timing of (i) BT using log scale and (ii) the Finite Element code on 1-64 400MHz processors of an SGI Origin 3000
parallelization possible in those cases. Although there is a marginal increase in performance from 8 to 32 processors, scalability is still limited. The main cause of this is that one serial code section (containing a single loop) still exists in the parallel code where, although this section seems relatively unimportant on a single processor, when larger numbers of processors are used, the speedup of parallel loops leads to this serial loop dominating runtime.
5
The Evolving Parallelization Environment
The Expert Assistant together with ParaWise and CAPO are powerful tools for parallelization, however, they are just part of the overall environment that is under development. The environment consists of a number of interrelated components, as shown in figure 5, aiming at both OpenMP and message passing parallelizations. The exploitation of user knowledge is vital for effective parallelization, however, this inevitably opens the parallelization process to the danger of incorrect user assertions and resultant incorrect parallel execution. To address this inevitable problem in the environment, another component, based on relative debugging [8], is being developed to automatically identify where incorrect parallel calculations originate and to indicate the related potentially incorrect user decisions identified in the parallelization history. The relative debugger contains many features including tolerance settings for absolute and relative value comparisons and visualization of differences across threads or processors. The final component is the parallel profiler and trace visualizer. A number of these tools already exist including Paraver from CEPBA [9], the SUN Studio 9 Performance Analyzer [10] and Intel VTune [11]. The aim in this work is to extract simple metrics (loop speedup etc.) to enable inexperienced users to focus their effort as well as allowing skilled users to exploit the full power of these tools with the additional information provided by the other environment components.
The ParaWise Expert Assistant USER
SERIA L CO D E Inform ation and Dependence D elete
Runtim e V alues
Record
Bug in output
Partition alignm ent Loop selection..
G eneration -PA RTITIO N IN G -M A SK IN G -CO M M U N ICA TIO N S -CA PO D IRECTIV E INSERTION
EXPERT A SSISTA N T U ndo ErroneousU ser D ecisions
D EPEN D EN CE A N A LY SIS EN G IN E
PA RA LLEL CO D E
79
PA RA LLELIZA TIO N H ISTO RY
RELA TIV E D EBU G G ER CO N TRO LLER
PA RA LLEL D EBU G G ER PARALLEL PROFILER & TRACE VISUALIZER 1-PRO CESSO R 1-PRO CESSO R N-PRO CESSO R
N-PRO CESSO R
Loop speedup and O penM P runtim e overheads Routinespeedup and com m unication overheads
Fig. 5. Components of the evolving parallelization environment
6
Related Work
It is well known that there is a need for information relating to the application code when performing a parallelization that is not statically available. Hence the attempts to address this issue range from runtime testing to fully manual parallelization decisions (such as those manually written with OpenMP). A number of tools have been developed to assist users in shared memory directive based parallelization and only a subset are cited here [17, 18, 19, 20]. Most focus on providing information to the user, relying on the user to have the skill to correctly interpret and exploit the provided information that is often detailed. For parallelization experts, such tools provide a means to overcome some of the most difficult problems in code parallelizations relating to understanding the application code and investigating interprocedural accesses of data which can be significant manual tasks. A number of techniques have been developed that involve checking for valid parallel execution at runtime. The inspector/executor technique has been used to determine the legality of parallel execution [21]. The inspector loop, which is a light-weight duplicate of the computations that mimic the data accesses in the loop, uses these accesses to determine if parallel execution is legal. Both parallel and serial versions of the loop are generated and the appropriate version is executed after the test is made. Speculative computation [22] is where a loop is executed in parallel, but the references that potentially cause serializing
80
S. Johnson et al.
dependencies are monitored. As soon as a serializing dependence is detected, the speculative parallel execution can be terminated and the serial version is then used. Some form of checkpointing is required to preserve the state of associated program data before the speculative parallel loop is executed to enable that state to be reset if serial execution is forced. Both these methods require an accurate, value based, interprocedural dependence analysis as a pre-requisite to only require such methods to be used in certain loops and to keep the number of variables whose accesses are monitored in those loops and the associated runtime overheads as low as possible. In addition, the prevention of privatization by, in particular, loop-out dependencies where a LASTPRIVATE directive cannot be used, will not be overcome using these methods as the runtime detection of output and anti dependencies for these shared variables will force serial execution. These runtime techniques are focused on an automatic compiler approach to parallelization with little or no user interaction, so any serial code sections encountered will greatly restrict scalability. Presentation of such cases to the user is still essential if scalability is required.
7
Conclusion
The Expert Assistant achieves its goal of greatly simplifying the users role in the parallelization of real world application codes with the ParaWise/CAPO tools. Our experience demonstrates that the potential of Expert Assistant would be a major benefit to experienced users of the tool and we are confident that parallelization experts and those familiar with OpenMP will also find it invaluable. An important issue that remains to be answered is whether the Expert Assistant can encourage and enable the vast number of novice users to use the tools so they can take advantage of parallel machines with many processors. Improvements that are currently underway include ordering questions based on the likelihood that they can be answered and will be profitable, and removing questions where a related variable is indicated as unanswerable by the user. Future versions will be more closely coupled with CAPO algorithms to allow investigation of other cases such as where a NOWAIT could not be used or where ParaWise code transformations are advantageous. Additionally, the information from the profiling and relative debugging components of the environment will be used to identify crucial cases for user investigation.
Acknowledgements The authors would like to thank their colleagues that have assisted in the many different aspects of this work, including Gabriele Jost, Greg Matthews and Bob Hood (NASA Ames), Peter Leggett and Jacqueline Rodrigues (Greenwich). This work was partly supported by the AMTI subcontract No. Sk-03N-02 and NASA contract DDTS59-99-D-99437/A61812D.
The ParaWise Expert Assistant
81
References 1. OpenMP home page, http://www.openmp.org 2. Wilson R.P., French R.S., Wilson C.S., Amarasinghe S.P., Anderson J.M., Tjiang S.W.K., Liao S., Tseng C., Hall M.W., Lam M. and Hennessy J. , SUIF:An infrastructure for research on Parallelizing and Optimizing Compilers, Stanford University, CA.,1996 3. Blume W., Eigenmann R., Fagin K., Grout J., Lee J., Lawrence T., Hoeflinger J., Padua D., Tu P., Weatherford S., Restructuring Programs for high speed computers with Polaris, ICPP Workshop on Challenges for Parallel Processing, pp149-162, 1996 4. Johnson S.P.,Cross M., Everett M.G.. Exploitation of Symbolic Information in Interprocedural Dependence Analysis. Parallel Computing, 22, pp 197-226, 1996. 5. Leggett P.F.,Marsh A.T.J., Johnson S.P. and Cross M., Integrating user Knowledge with Information from parallelisation Tools to Facilitate the Automatic Generation of Efficient Parallel FORTRAN code. Parallel Comp., Vol 22, 2, pp197-226, 1996. 6. Evans E.W., Johnson S.P., Leggett P.F. and Cross M., Automatic and effective multi-dimensional parallelisation of structured mesh based codes., Parallel Computing 26(6): 677- 703, 2000. 7. Johnson S.P., Ierotheou C.S. and Cross M. Computer Aided Parallelisation of unstructured mesh codes. Proceedings of PDPTA Conference, Las Vegas, volume 1, CSREA, pp 344-353, 1997. 8. Matthews G., Hood R., Jin H., Johnson S. and Ierotheou C., Automatic Relative Debugging of OpenMP Programs, Proceedings of EWOMP, Aachen, Germany 2003. 9. Paraver, http://www.cepba.upc.es/paraver 10. SUN Studio performance analyzer, http://developers.sun.com/prodtech/cc/ analyzer index.html 11. Vtune Performance Analyzer. http://developer.intel.com/software/products/ vtune/index.htm. 12. Parallel Software Products Inc, http://www.parallelsp.com 13. Jin H., Frumkin M., Yan J., Automatic generation of OpenMP directives and its application to computational fluid dynamics codes, International symposium on High Performance Computing, Tokyo, Japan p440, 2000; LNCS 1940, pp 440-456. 14. Jin H., Jost G., Yan J., Ayguade E., Gonzalez M. and Martorell X., Automatic Multilevel Parallelization Using OpenMP, Proceeding of EWOMP 2001, Barcelona, Spain, September 2001; Scientific Programming, Vol. 11, No. 2, pp 177-190, 2003. 15. Ierotheou C.S.,Johnson S.P., Leggett P., Cross M., Evans E., The Automatic Parallelization of Scientific Application codes using a Computer Aided Parallelization Toolkit, Proceedings of WOMPAT 2000, San Diego, USA, July 2000; Scientific Programming Journal, Vol. 9, No. 2+3, pp 163-173, 2003. 16. Jin H., Jost G.,Johnson D., Tao W., Experience on the parallelization of a cloud modelling code using computer aided tools, Technical report NAS-03-006, NASA Ames Research Center, March 2003. NAS-03-006. 17. FORGE, Applied Parallel Research, Placerville California 95667, USA, 1999. 18. KAI/Intel: http://www.kai.com/ 19. Zima H.P., Bast H-J. and Gerndt H.M., SUPERB-A tool for Semi-Automatic MIMD/SIMD Parallelisation, Parallel Computing, 6, 1988.
82
S. Johnson et al.
20. The Dragon Analysis Tool, http//www.cs.uh.edu/ dragon 21. Rauchwerger L., Amato N. and Padua D., A Scalable Method for Run-Time Loop Parallelization, International Journal of Parallel Processing, vol 26, no. 6, pp537576, July 1995. 22. Rauchwerger L. and Padua D., The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization, IEEE Transactions on Parallel and Distributed Systems, Vol 19, No 2, February 1999.
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program Yuan Lin1, Christian Terboven2, Dieter an Mey2, and Nawal Copty1 1 Sun
Microsystems, U.S.A. {yuan.lin, nawal.copty)@sun.com 2 RWTH Aachen University, Center for Computing and Communication, Seffenter Weg 23, 52074 Aachen, Germany {terboven, anmey}@rz.rwth-aachen.de
Abstract. The process of manually specifying scopes of variables when writing an OpenMP program is both tedious and error-prone. To improve productivity, an autoscoping feature was proposed in [1]. This feature leverages the analysis capability of a compiler to determine the appropriate scopes of variables. In this paper, we present the proposed autoscoping rules and describe the autoscoping feature provided in the Sun Studio™ 9 Fortran 95 compiler. To investigate how much work can be saved by using autoscoping and the performance impact of this feature, we study the process of parallelizing PANTA, a 50,000-line 3D Navier-Stokes solver, using OpenMP. With pure manual scoping, a total of 1389 variables have to be explicitly privatized by the programmer. With the help of autoscoping, only 13 variables have to be manually scoped. Both versions of PANTA achieve the same performance.
1 Introduction Efficient parallelization of large codes using OpenMP can be difficult, especially when the program does not contain substantial hot spots. Parallelization may require the manual introduction of hundreds of directives and data scoping clauses, which can be quite tedious and error-prone. Many users have found this to be the hardest part of using OpenMP. To alleviate the difficulty of specifying the scope of variables in an OpenMP parallel region, an additional data scope attribute of the DEFAULT clause has been proposed to the OpenMP Architecture Review Board by one of the authors [1]. The new data scope attribute is named AUTO, and can be used as follows with the PARALLEL directive: !$OMP PARALLEL DEFAULT (AUTO) ... An OpenMP compiler that supports the proposed autoscoping feature offers a very attractive compromise between automatic and manual parallelization. The need for such a feature was previously expressed in [2], when loop-level parallelization of the 3D Navier-Stokes solver, PANTA, was first investigated. B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 83–97, 2005. © Springer-Verlag Berlin Heidelberg 2005
84
Y. Lin et al.
A compiler that supports autoscoping determines the appropriate scopes of variables referenced in a parallel region, based on its analysis of the program and on a set of autoscoping rules. Special knowledge that the programmer has about certain variables can be specified explicitly by using additional data-sharing attribute clauses. This relieves the compiler from complicated analyses which are difficult or impossible because of lack of information at compile-time, such as information about memory aliasing. If the compiler is not able to identify the scope of all variables that are to be autoscoped in a given parallel region, it adds an IF(.FALSE.) clause to the PARALLEL directive, causing the parallel region to execute serially. This paper is organized as follows. Section 2 presents the proposed autoscoping rules and the autoscoping feature in the Sun Studio™ 9 Fortran 95 compiler. Section 3 gives some background information about PANTA. Section 4 describes the experiences we gained by applying the new autoscoping feature to PANTA. Section 5 compares related work. Finally, Section 6 presents our conclusions.
2 Autoscoping in OpenMP The proposed autoscoping feature allows the user to specify which variables in a parallel region should be automatically scoped by the compiler. The compiler analyzes the execution and synchronization pattern of the parallel region and determines the scoping of variables based on a set of autoscoping rules. 2.1 Enabling Autoscoping A user can specify a variable to be autoscoped by using one of the following two methods. 2.2.1 Using the AUTO Data-Sharing Attribute Clause This clause has the following format: AUTO(list-of-variables) The AUTO clause specifies that the variables listed in list-of-variables be automatically scoped by the compiler. If a variable is listed in the AUTO clause, then it cannot be listed in any other data-sharing attribute clause. 2.2.2 Using DEFAULT(AUTO) A user can also specify which variables are to be automatically scoped by the compiler using the DEFAULT(AUTO) clause. In fact, we expect this to be the most common way in which users will use the autoscoping feature. This clause has the following format: DEFAULT(AUTO) Both the AUTO(list-of-variables) clause and the DEFAULT(AUTO) clause can appear on a PARALLEL, PARALLEL DO, PARALLEL SECTIONS, or PARALLEL WORKSHARE directive.
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program
85
2.2 Autoscoping Rules In automatic scoping, the compiler applies a set of rules to determine the scopes of variables in a parallel region. The rules should be simple enough so that most users can understand them, and should be complete enough so that they can cover most useful cases. The rules do not apply to variables scoped implicitly according to the OpenMP Specification, such as index variables of worksharing DO loops. 2.2.1 Scoping Rules for Scalar Variables The following rules are used to determine the scopes of scalar variables in a parallel region: • S1: If the use of the variable in the parallel region is free of data race1 conditions for the threads in the team executing the region, then the variable is scoped SHARED. • S2: If in each thread executing the parallel region, the variable is always written before being read by the same thread, then the variable is scoped PRIVATE. The variable is scoped LASTPRIVATE, if the variable can be scoped PRIVATE and is read before it is written after the parallel region, and the construct is either PARALLEL DO or PARALLEL SECTIONS. • S3: If the variable is used in a reduction operation that can be recognized by the compiler, then the variable is scoped REDUCTION with that particular type of operation. 2.2.2 Scoping Rules for Array Variables The following rules are used to determine the scopes of array variables in a parallel region: • A1: If the use of the variable in the parallel region is free of data race conditions for the threads in the team executing the region, then the variable is scoped SHARED. • A2: If in each thread executing the parallel region, any element of the array read by a thread is always written by the same thread first, then the variable is scoped PRIVATE. The variable is scoped LASTPRIVATE, if the variable can be scoped PRIVATE and any element written in the parallel region is read before it is written after the parallel region, and the construct is either PARALLEL DO or PARALLEL SECTIONS. • A3: If the variable is used in a reduction operation that can be recognized by the compiler, then the variable is scoped REDUCTION with that particular type of operation.
1
A data race exists when two threads can access the same shared variable at the same time and at least one of the threads modifies the variable. To remove a data race condition, accesses to the variable should be protected by a critical section or the threads should be synchronized.
86
Y. Lin et al.
2.3 General Comments About Autoscoping If a user specifies the following variables to be autoscoped by AUTO(list-of-variables) or DEFAULT(AUTO), the compiler will scope the variables according to the implicit scoping rules in the OpenMP Specification: • A THREADPRIVATE variable. • A Cray pointee. • A loop iteration variable used only in sequential loops in the lexical extent of the
region or PARALLEL DO loops that bind to the region. • Implied DO or FORALL indices. • Variables used only in work-sharing constructs that bind to the region, and
specified in a data scope attribute clause for each such construct. If a variable is specified to be autoscoped in a parallel region and it is scoped by the user as FIRSTPRIVATE, LASTPRIVATE or REDUCTION in any worksharing construct that binds to the parallel region, then the variable is scoped as SHARED. If a variable is specified to be autoscoped in a parallel region, and the above implicit scoping rules do not apply to it, then the compiler will attempt to automatically scope the variable. The compiler checks the use of the variable against the rules in 2.2.1 or 2.2.2, in the order shown. If a rule matches the use of a variable, the compiler will scope the variable according to that rule. If a rule does not match, the compiler will check the next rule. If it cannot find any matching rule, then the compiler is unable to autoscope the variable. In such cases, the variables that the compiler cannot autoscope will be scoped SHARED and the binding parallel region will be serialized as if an IF (.FALSE.) clause were specified. There are two reasons why a compiler is unable to autoscope some variables. One is that the use of a variable does not match any of the rules. The other is that the source code is too complex for the compiler to do a sufficient analysis. Function calls, complicated array subscripts, memory aliasing, and user-implemented synchronizations are some typical causes. 2.4 Examples Example 1 INTEGER X(100), Y(100), I, T C$OMP PARALLEL DO DEFAULT(AUTO) DO I=1, 100 T = Y(I) X(I) = T*T END DO C$OMP END PARALLEL DO END
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program
87
In the above code, scalar variable T will be scoped PRIVATE and array variables X and Y will be scoped SHARED. Example 2 1.
REAL FUNCTION FOO (N, X, Y)
2.
INTEGER N, I
3.
REAL X(*), Y(*)
4.
REAL W, MM, M
6.
W = 0.0
7.
C$OMP PARALLEL DEFAULT(AUTO)
8.
C$OMP SINGLE
9.
M = 0.0
10. C$OMP END SINGLE 11.
MM = 0.0
12. C$OMP DO 13.
DO I = 1, N
14.
T = X(I)
15.
Y(I) = T
16.
IF (MM .GT. T) THEN
17.
W = W + T
18.
MM = T
19. 20.
END IF END DO
21. C$OMP END DO 22. C$OMP CRITICAL 23.
IF ( MM .GT. M ) THEN
24. 25.
M = MM END IF
26. C$OMP END CRITICAL 27. C$OMP END PARALLEL 28.
FOO = W - M
29.
RETURN
30.
END
The function FOO() contains a parallel region which contains a SINGLE construct, a work-sharing DO construct, and a CRITICAL construct. If we ignore all the OpenMP constructs, we find that the code in the parallel region does the following:
88
Y. Lin et al.
1. Copy the value in array X to array Y. 2. Find the maximum positive value in X, and store it in M. 3. Accumulate the value of some elements of X into variable W. The following variables are used in the parallel region: I, N, MM, T, W, M, X, and Y. Upon analyzing the uses of these variables in the parallel region, the compiler determines the following: • Scalar I is the index of the work-sharing DO loop. The OpenMP specification
mandates that I be scoped PRIVATE. • Scalar N is only read in the parallel region and therefore will not cause a data race,
so it is scoped as SHARED following rule S1. • Any thread executing the parallel region will execute statement 11, which sets the
• •
•
• •
value of scalar MM to 0.0. This write will cause a data race. So rule S1 does not apply. The write happens before any read of MM in the same thread, so MM is scoped as PRIVATE according to S2. Similarly, scalar T is scoped as PRIVATE. Scalar W is read and then written at statement 17 by all threads, so rules S1 and S2 do not apply. The addition operation is both associative and communicative; therefore, W is scoped as REDUCTION(+:W) according to rule S3. Scalar M is written in statement 9 which is inside a SINGLE construct. The implicit barrier at the end of the SINGLE construct ensures that the write in statement 11 will not happen concurrently with either the read in statement 23 or the write in statement 24, while the latter two will not happen at the same time because both are inside the same CRITICAL construct. No two threads can access M at the same time. Therefore, the writes and reads of M in the parallel region do not cause a data race. So, following rule S1, M is scoped SHARED. Array X is only read. So it is scoped as SHARED by rule A1. The writes to array Y are distributed among the threads, and no two threads will write to the same elements of Y. As there is no data race, Y is scoped SHARED according to rule A1.
2.5 Autoscoping in the Sun Studio 9 Fortran 95 Compiler In the Sun Studio 9 Fortran 95 compiler, we introduced a new autoscoping feature based on the above proposal as a Sun-specific extension to OpenMP. To specify a variable to be autoscoped by the compiler, a user can use either __AUTO(list-ofvariables) or DEFAULT(__AUTO). These correspond to AUTO(list-of-variables) and DEFAULT(AUTO) in the above proposal, respectively. A user can use compiler commentary ([7]) to check the detailed autoscoping results, including whether any parallel region is serialized because some variables cannot be autoscoped. The compiler produces an inline commentary with the source code when compiled with the -g option. The generated commentary can be viewed with the er_src command or using the Sun Studio 9 Performance Analyzer ([7]) GUI interface. In addition, when a program is compiled with the -vpara option, a warning message will be printed out if any variable cannot be autoscoped.
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program
89
The following is an example showing the use of er_src to display the autoscoping results for the code in Example 1. Example 3 >f95 -xopenmp -g -c t.f >er_src t.o Source file: ./t.f Object file: ./t.o Load Object: ./t.o 1.
INTEGER X(100), Y(100), I, T
Variables autoscoped as PRIVATE in OpenMP construct below: i, t Variables autoscoped as SHARED in OpenMP construct below: y, x 2. C$OMP PARALLEL DO DEFAULT(__AUTO) Loop below parallelized by explicit user directive 3.
DO I=1, 100
4.
T = Y(I)
5.
X(I) = T*T
6.
END DO
7. C$OMP END PARALLEL DO 8.
END
There are some limitations to the autoscoping feature in the Sun Studio 9 Fortran 95 compiler. Only OpenMP directives are recognized and used in the analysis. OpenMP API function calls, such as OMP_SET_LOCK() and OMP_UNSET_LOCK(), are not recognized. Only synchronizations specified via OpenMP synchronization directives, such as BARRIER and MASTER, are recognized and used in the analysis. User-implemented synchronizations, such as busy-waiting, are not recognized. Interprocedural analysis and array subscript analysis are limited. Arrays that satisfy rules A2 or A3 will be scoped SHARED with the enclosing parallel region serialized. Despite the above limitations, our experiments show that the autoscoping feature in the Sun Studio 9 Fortran 95 compiler can be very useful in improving the productivity of writing OpenMP programs. Our experimental results are reported below.
3 The PANTA Navier-Stokes Solver PANTA is a well vectorized 3D solver that is extensively used in the modeling of turbomachinery ([3], [4]). A system of 7 coupled partial differential equations originating from the Reynolds- and Favre-averaged Navier-Stokes equations is solved with a cell-centered finite volume method with implicit time integration. The many fat
90
Y. Lin et al.
loop nests in the compute-intensive program kernel are well suited for shared memory parallelization. The PANTA package used in our experiments consists of about 50,000 lines of Fortran 90 code. The compute-intensive program kernel contains about 7,000 lines of code in 11 subroutines. The original OpenMP version includes some 200 OpenMP directives in about 370 lines of code. The default scope used is SHARED. The sheer number of OpenMP directive continuation lines reflects the large number of variables that have to be explicitly privatized with scoping clauses.
4 Comparing Automatic Parallelization, Manual Parallelization, and Semi-automatic Parallelization Using the Autoscoping Feature An early attempt at automatic and manual parallelization of the PANTA code with OpenMP was described in [2]. Now that the first version of an OpenMP compiler2 offering the autoscoping feature is available, we would like to investigate how much programming work would have been saved if this feature had been available earlier and whether this feature has any impact on performance. Table 4.1 gives some statistical information about the OpenMP versions of the code using pure manual scoping versus autoscoping with the DEFAULT(__AUTO) clause. We did not count the variables that have SHARED scope, since they normally do not reflect the amount of work involved in writing OpenMP codes. Nevertheless, the autoscoping compiler has to analyze and scope these variables. Table 4.1. Manual parallelization with OpenMP. Pure manual scoping versus autoscoping: Code statistics
Routine
vscchr inisfc simpeq vscflx roeflx geodif lubsol eulflx jacroe boumat lupibs Total 2
Number Number of Lines of of parallel parallel code regions loops 2984 244 386 466 879 132 149 660 422 147 693 7162
2 4 9 2 2 1 3 7 2 1 1 34
2 7 10 2 4 1 3 10 4 2 43 88
Number of explicitly privatized variables 880 16 48 10 220 15 5 96 42 3 54 1389
Number of serialized parallel regions, due to unsuccessful autoscoping 0 4 0 2 0 1 0 4 0 0 0 11
Number of variables which could not be autoscoped 0 4 0 2 0 3 0 4 0 0 0 13
An early access version of Sun Studio 9 Fortran 95 compiler was used in the experiments reported in this paper. The results using the final version of the compiler may vary.
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program
91
In the manual scoping version, a total of 1389 variables had to be explicitly privatized. In the autoscoping version, the compiler was able to scope all but 13 variables in the parallel regions. 11 out of 34 parallel regions were serialized due to some variables could not be autoscoped. The following code shows one typical pattern of the parallel regions ([1]) in PANTA. !$omp parallel default(__auto) do n = 1,7 do m = 1,7 !$omp do do l = LSS(itsub),LEE(itsub) !large loop trip count i = IG(l) j = JG(l) k = KG(l) lijk = L2IJK(l) RHS(l,m) = RHS(l,m) - & FJAC(lijk,lm00,m,n)*DQCO(i-1,j,k,n,NB)*FM00(l) - & FJAC(lijk,lp00,m,n)*DQCO(i+1,j,k,n,NB)*FP00(l) - & FJAC(lijk,l0m0,m,n)*DQCO(i,j-1,k,n,NB)*F0M0(l) - & FJAC(lijk,l0p0,m,n)*DQCO(i,j+1,k,n,NB)*F0P0(l) end do !$omp end do nowait end do end do !$omp end parallel
The loop trip count for the inner-most loop is larger than those in the two outer loops. Therefore, to improve scalability and load balance, the programmer chooses to make the inner-most loop a worksharing loop with no terminating barrier and extends the parallel region to the outer-most loop. By using DEFAULT(__AUTO), the programmer is able to steer the parallelization and leave the annoying work of scoping to the compiler. Sun Studio 9 Fortran 95 compiler correctly scopes the variables in the above case automatically. A typical example of a parallel region that is serialized because the compiler cannot autoscope some variables is shown below. This one is taken from routine INISFC.
92
Y. Lin et al. 177. !$omp parallel do DEFAULT(__AUTO) 178. do k = MAX(ks,KSTA(ib)),MIN(ke+1,KEND(ib)) 179.
do j = MAX(js,JEND(ib)),MIN(je,JEND(ib))
180.
do i = MAX(is,IPERBI),MIN(ie,IPERTI-1)
181.
ADDFLG(l3(i,j+1,k),2) = 0
182. 183.
end do end do
184. end do 185. !$omp end parallel do
The variable which could not be autoscoped in this region is ADDFLG. In the program, the following statement function is used to calculate the first index: 66. integer :: l3 67. l3(n1,n2,n3) = n1 + n2*D23 + n3*D33 + D43
In order for the parallel access to ADDFLG to be free of data race, no two threads may write to the same element of ADDFLG. This implies, among other conditions, that l3(i,j,k) has to be different from l3(i',j',k') for different k and k' within the index range, since the k-loop is a parallel loop. In the statement function l3, the values of coefficient variables D23, D33 and D43 are read in at run-time and unknown at compile-time. Because compiler is unable to perform a precise array subscript analysis, it has to assume data race and serialize the parallel region conservatively. In this given example the programmer has additional knowledge of the application and the variables D23, D33 and D43 are carefully choosen depending on the input dataset and therefore there is no data race, so the k-loop can be parallelized. After explicitly declaring ADDFLG as SHARED by changing line 177 to 177. !$omp parallel DEFAULT(__AUTO) SHARED(ADDFLG)
the autoscoping mechanism is able to correctly scope all remaining variables and the parallel region is no longer serialized. The other cases in which some variables could not be autoscoped are similar to the above one: one dimension in an array subscript uses a statement function and the current compiler is unable to determine that the parallel access is free of data race because of imprecise array subscript analysis. To study the performance impact of using autoscoping, we measured the execution time of the following four versions of parallelized PANTA: 1. 2. 3. 4.
automatic parallelization, manual parallelization with OpenMP (autoscoping), manual parallelization with OpenMP (manual scoping), and manual parallelization with OpenMP (mixed autoscoping and manual scoping).
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program
93
In the first version, the program was parallelized by using the automatic parallelization feature provided in Sun Studio 9 Fortran 95 compiler. For the last version, manual scoping was performed on the variables that the compiler could not autoscope. The execution times of each routine using 1 to 4 threads were measured. The measurements were made on a Sun Fire™ 6800 system equipped with 24 UltraSPARC® III Cu chips running at 900 MHz. Table 4.2 presents the timing results. The numbers in square brackets ([]) denote the number of threads used for each run. The last two columns in Table 4.2 are copied from Table 4.1. The automatically parallelized version (column labeled "auto-parallel") has the lowest speedup overall. The autoscoping version (column labeled "autoscoping only") achieves better speedup. This is due to the fact that in the autoscoping version a better selection of parallel regions is made by the programmer. Because 11 parallel regions (out of 34) are serialized by the compiler due to autoscoping failure, the autoscoping version does not exhibit the best speedup. Even with this constraint, the autoscoping version is overall about 1.7 times faster than the automatically parallelized version when running with four threads. The execution time for all the routines, except for INISFC, is smaller in the autoscoping version than in the automatically parallelized version. In INISFC, there are seven OpenMP worksharing loops and these are contained in four parallel regions. There is one loop in each parallel region that the current compiler has difficulty analyzing. This causes the whole parallel region to be serialized. In the automatically parallelized version, the compiler treats each worksharing loop as a different parallel region, and automatically parallelizes some of these loops. Therefore, the automatically parallelized version results in shorter execution time when run in parallel. The best speedups are obtained with the manually-scoped OpenMP version (column labeled "explicit scoping only") and the mixed-scoped version (column labeled "autoscoping plus explicit scoping"). The performance of both of these versions is the same. However, in the latter version, the programmer needs to manually scope only 13 variables, while in the former version the programmer needs to manually scope 1389 variables, more than 100x difference. Table 4.2. Execution time with 1 to 4 threads. (a) automatic parallelization, (b) autoscoping with DEFAULT(__AUTO) only, (c) explicit scoping with the private clause, (d) autoscoping plus explicit scoping the remaining 13 variables Routine
Vscchr
inisfc
auto-parallel
autoscoping only
42.820[1] 50.145[2] 51.896[3] 52.887[4] 1.861[1] 1.171[2] 0.921[3] 0.831[4]
44.210[1] 29.971[2] 21.315[3] 18.023[4] 1.841[1] 1.991[2] 2.192[3] 2.252[4]
explicit scoping autoscoping plus # of # of variables only (original explicitscoping explicitly which could OpenMP) privatized not be variables autoscoped 46.463[1] 45.162[1] 880 0 24.087[2] 24.147[2] 16.882[3] 16.522[3] 13.660[4] 12.869[4] 1.961[1] 1.831[1] 16 4 1.171[2] 1.091[2] 0.921[3] 0.951[3] 0.881[4] 0.871[4]
94
Y. Lin et al.
Table 4.2. (Continued) simpeq
vscflx
Roeflx
geodif
lubsol
eulflx
jacroe
boumat
lupibs
Total
58.501[1] 24.177[2] 14.960[3] 11.388[4] 74.142[1] 68.368[2] 66.056[3] 64.765[4] 34.944[1] 46.473[2] 44.201[3] 43.290[4] 0.640[1] 0.610[2] 0.650[3] 0.580[4] 10.047[1] 4.853[2] 3.452[3] 2.452[4] 56.980[1] 62.714[2] 58.491[3] 56.750[4] 10.077[1] 18.003[2] 17.082[3] 17.152[4] 7.575[1] 5.644[2] 4.813[3] 4.613[4] 10.517[1] 5.764[2] 4.033[3] 3.042[4] 249.354[1] 196.768[2] 174.892[3] 165.746[4]
57.770[1] 23.647[2] 14.690[3] 10.597[4] 70.369[1] 58.301[2] 49.134[3] 45.222[4] 35.615[1] 24.457[2] 17.983[3] 16.762[4] 0.490[1] 0.680[2] 0.620[3] 0.710[4] 10.167[1] 5.514[2] 3.693[3] 3.002[4] 57.330[1] 43.040[2] 34.814[3] 33.413[4] 9.897[1] 6.475[2] 4.633[3] 4.793[4] 6.685[1] 5.444[2] 5.054[3] 4.553[4] 8.396[1] 4.483[2] 3.062[3] 2.202[4] 240.498[1] 170.699[2] 140.628[3] 129.100[4]
58.251[1] 24.497[2] 14.520[3] 10.437[4] 74.504[1] 41.489[2] 29.150[3] 22.756[4] 35.195[1] 22.045[2] 16.912[3] 16.021[4] 0.560[1] 0.400[2] 0.420[3] 0.280[4] 10.667[1] 5.354[2] 3.773[3] 2.932[4] 57.991[1] 35.705[2] 27.469[3] 25.548[4] 9.727[1] 6.264[2] 5.224[3] 4.423[4] 6.905[1] 5.144[2] 4.893[3] 4.513[4] 8.506[1] 4.373[2] 3.022[3] 2.342[4] 258.241[1] 145.522[2] 111.968[3] 97.588[4]
57.640[1] 24.377[2] 14.740[3] 10.487[4] 72.441[1] 40.368[2] 28.390[3] 22.065[4] 34.444[1] 21.575[2] 17.652[3] 17.242[4] 0.590[1] 0.470[2] 0.400[3] 0.410[4] 10.217[1] 5.744[2] 3.623[3] 2.732[4] 56.880[1] 36.706[2] 28.220[3] 26.479[4] 9.327[1] 6.254[2] 5.324[3] 5.384[4] 6.715[1] 5.094[2] 4.713[3] 4.423[4] 8.616[1] 4.333[2] 3.012[3] 2.282[4] 242.299[1] 145.742[2] 112.038[3] 97.969[4]
48
0
10
2
220
0
15
3
5
0
96
4
42
0
3
0
54
0
1389
13
5 Related Work A parallelizing compiler takes a sequential program as its input, detects the embedded parallelism and transforms the program into its parallel form. The related techniques have been studied extensively in academia for many years (for example, see [5], [6], [7], [8], [9], and [10]), and some of these techniques have been adopted by the industry. Nowadays, the automatic parallelization feature can be found in production compilers/tools from many companies and organizations, such as Intel/KSL, Portland
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program
95
Group (STMicroelectronics), SGI, IBM, Sun Microsystems, HP, Intel, and NASA, just to name a few. Some take a source-to-source approach and generate parallel programs using either OpenMP directives or some vendor-specific directives, while the others integrate the transformation in the compilation process. Similar techniques have been used in this work to detect the proper scope of variables. However, there is one major difference between automatic parallelization and autoscoping. All previous work on automatic parallelization deals with sequential programs and loops only. For autoscoping in OpenMP, the input is an explicitly parallel OpenMP program rather than a sequential program. The parallel program may contain any type of OpenMP constructs/directives, such as SECTIONS, SINGLE, MASTER, BARRIER, etc. In our study, we find that OpenMP programmers like to merge parallel regions and use all kinds of features provided by OpenMP. The compiler, therefore, needs to analyze not only loops but also general regions. The compiler also needs to analyze the data flow in the parallel program and understand synchronization patterns. A set of new techniques have been developed in the Sun Studio 9 Fortran compiler to accomplish these tasks. Michael Voss et al [14] extends Polaris, a research parallelizing compiler to do OpenMP autoscoping. Because their method is based purely upon techniques from automatic parallelization, they can handle only PARALLEL DO regions that contain no other OpenMP constructs/directives. The interactive CAPO (Computer-Aided Parallelizer and Optimizer) tool from NASA Ames Research Center [15] takes a similar approach, therefore has the same limitation. Some vendor-specific parallelization directives also provide certain forms of a default scoping feature. For example, in SGI's Power Fortran parallelization directives ([11]), loop iteration variables are LASTLOCAL by default, and all other variables are SHARED by default. In Cray Fortran ([12]), there is an AUTOSCOPE directive. When used, a scalar or an array is scoped PRIVATE if it is 'written to and read from', or (for an array) the loop iteration variable does not appear in the subscript expression; otherwise, it is scoped SHARED. In Sun parallelization directives ([13]), scalars are PRIVATE and arrays are SHARED by default. OpenMP provides more kinds of constructs and is more flexible than any of the existing vendor-specific parallelization directives. It is, therefore, more challenging to design the autoscoping rules for OpenMP. The autoscoping rules for OpenMP proposed in this work are more sophisticated than those found in current vendor-specific directives.
6 Summary The autoscoping feature offers a very attractive compromise between automatic and manual parallelization. In the PANTA code, the vast majority (1376 out of 1389) of variables that were manually privatized before can be automatically scoped by the new autoscoping feature in the Sun Studio 9 Fortran 95 compiler. After explicitly scoping the remaining 13 variables, the version using autoscoping performs as well as
96
Y. Lin et al.
the original OpenMP version. The availability of an autoscoping feature in an OpenMP compiler is definitely a tremendous help for the programmer, and can drastically reduce the probability of scoping errors.
Acknowledgments The authors extend their thanks to Eric Duncan and Prashanth Narayanaswamy for reviewing the paper and providing insightful comments, and to Iain Bason and Val Donaldson for their work on the Fortran 95 compiler front-end.
References 1. http://www.compunity.org, see section “New Ideas for OpenMP” with a link to http://support.rz.rwth-aachen.de/public/AutoScoping.pdf 2. Dieter an Mey, S. Schmidt: “From a Vector Computer to an SMP-Cluster - Hybrid Parallelization of the CFD Code PANTA”, 2nd European Workshop on OpenMP, Edinburgh, 2000 3. Institute for Jet Propulsion and Turbomachinery, RWTH Aachen University http://www.ist.rwth-aachen.de[/en/forschung/instationaere_li.html] 4. Volmar, T., Brouillet, B., Gallus, H.E., Benetschik, H.: Time Accurate 3D Navier-Stokes Analysis of a 1½ Stage Axial Flow Turbine, AIAA 98-3247, 1998. 5. Hans Zima, Barbara Chapman. Supercompilers for Parallel and Vector Computers, ACM Press, 1991 6. Wolfe, Michael J., Optimizing Supercompilers for Supercomputers, Ph. D. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1982. 7. Uptal Banerjee, Rudolf Eigenmann, Alexandru Nicolau and David A. Padua. Automatic program parallelization, in proceedings of the IEEE, Vol. 81, No. 2, Feb 1993 8. W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, W. Pottenger, L. Rauchwerger, and P. Tu. Parallel Programming with Polaris. IEEE Computer. Vol. 29, No. 12, pp 78-82, Dec. 1996. 9. M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, E. Bugnion and M. S. Lam. Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Computer, December 1996. 10. Constantine Polychronopoulos, Milind B. Girkar, Mohammad R. Haghighat, Chia L. Lee, Bruce P. Leung, Dale A. Schouten. “Parafrase-2: An Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors", Proceedings of the International Conference on Parallel Processing, St. Charles IL, August 1989, pp. II39-48 11. Fortran 77 Programmer's Guide, document number: 007-0711-060, Silicon Graphics, Inc., 1994 12. CF90™ Commands and Directives Reference Manual – S-3901-35, Cray Inc. 13. Sun Studio 8: Program Performance Analysis Tools, Sun Microsystems, Inc., http://docs.com/db/doc/817-5068
Automatic Scoping of Variables in Parallel Regions of an OpenMP Program
97
14. Michael Voss, Eric Chiu, Patrick Chow, Catherine Wong and Kevin Yuen. “An Evaluation of Auto-Scoping in OpenMP”, WOMPAT'2004: Workshop on OpenMP Applications and Tools, Houston, Texas, USA, 2004 15. H. Jin, M. Frumkin and J. Yan. “Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamic Codes”, WOMPEI 2000: International Workshop on OpenMP: Experiences and Implementations, Tokyo, Japan, 2000
An Evaluation of Auto-Scoping in OpenMP Michael Voss, Eric Chiu, Patrick Man Yan Chow, Catherine Wong, and Kevin Yuen Edwards S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada phone: 416-946-8031, fax: 416-971-2326
[email protected] {chiue, chowp, wongc, yuenke}@ecf.toronto.edu
Abstract. In [1], Dieter an Mey proposes the addition of an AUTO attribute to the OpenMP DEFAULT clause. A DEFAULT(AUTO) clause would cause the compiler to automatically determine the scoping of variables that are not explicitly classified as shared, private or reduction. While this new feature would be useful and powerful, its implementation would rely on automatic parallelization technology, which has been shown to have significant limitations. In this paper, we implement support for the DEFAULT(AUTO) clause in the Polaris parallelizing compiler. Our modified version of the compiler will translate regions with DEFAULT(AUTO) clauses into regions that have explicit scopings for all variables. An evaluation of our implementation on a subset of the SPEC OpenMP Benchmark Suite shows that with current automatic parallelization technologies, a number of important regions cannot be statically scoped, resulting in a significant loss of speedup. We also compare our compiler’s performance to that of an Early Access version of the Sun Studio 9 Fortran 95 compiler [2].
1
Introduction
The OpenMP API [3, 4] is designed to allow users to flexibly write explicitly parallel applications for shared-memory systems. To parallelize an application using OpenMP, a user identifies parallel worksharing constructs and categorizes all variables accessed within these constructs using data scope attributes, e.g. SHARED, PRIVATE and REDUCTION. For example, Figure 1(a) shows a parallel loop with two shared arrays and two private scalars. In [1], Dieter an Mey argues that classifying variables as shared, private and reduction is tedious and error-prone. Therefore, he proposes a new data scope attribute for the OpenMP DEFAULT clause. As described in the OpenMP specifications [3, 4], the DEFAULT clause is used to specify the default data scoping for variables that do not appear explicitly in data scope clauses. Users may set
This work is supported in part by the Canadian National Science and Engineering Research Council, the Canada Foundation for Innovation, the Ontario Innovation Trust, the Connaught Foundation and the University of Toronto.
B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 98–109, 2005. c Springer-Verlag Berlin Heidelberg 2005
An Evaluation of Auto-Scoping in OpenMP C$OMP PARALLEL DO SHARED(A,B) C$OMP&PRIVATE(I,J) DO I = 1, 100 DO J = 1, 100 A(I,J) = A(I,J) + B(I,J) ENDDO ENDDO C$OMP END PARALLEL DO
C$OMP PARALLEL DO DEFAULT(AUTO) DO I = 1, 100 DO J = 1, 100 A(I,J) = A(I,J) + B(I,J) ENDDO ENDDO C$OMP END PARALLEL DO
(a)
(b)
99
Fig. 1. An example of a parallel loop with (a) all variables explicitly scoped and (b) with a DEFAULT(AUTO) clause
the default scoping to SHARED, PRIVATE or NONE. NONE requires that all variables be explicitly listed in a data scope attribute. Providing no DEFAULT clause is equivalent to DEFAULT(SHARED). In [1], a new attribute AUTO is proposed. The AUTO scope attribute indicates that variables not explicitly included in a data scope clause should be automatically assigned a scope by the compiler. In Figure 1(b), the DEFAULT(AUTO) clause would cause the compiler to automatically determine that arrays A and B are shared and scalars I and J are private. The proposed DEFAULT(AUTO) clause would be a powerful and useful addition to OpenMP, relieving users from the time-consuming process of explicitly scoping variables. However, as recognized in [1], this clause is a compromise between automatic and explicit parallelization. To determine the proper scoping attribute for the variables in a parallel loop, the compiler must do a similar analysis to that required to automatically parallelize the loop. Unfortunately, automatic parallelization is known to be effective in only approximately 50% of scientific benchmark applications [5]. In [1], it is suggested that if the compiler cannot determine proper data scopes for a parallel loop, that the loop be serialized. Given the known limitations of parallelizing compilers, this implies that a large number of parallel loops may need to be serialized. In this paper, we evaluate the efficacy of the DEFAULT(AUTO) clause by implementing it in the Polaris Fortran 77 compiler. Polaris is an advanced parallelizing research compiler that applies a range of techniques including the Range Test [6], array and scalar privatization, array and scalar reduction recognition, induction variable substitution and inter-procedural constant propagation [5]. In [7], the Polaris compiler was extended to parse as well as generate OpenMP directives. In this work, we further extend Polaris to be a full-fledge OpenMP to OpenMP translator and include support for the DEFAULT(AUTO) clause. This paper makes the following contributions: – We extend the Polaris compiler to automatically scope variables when presented with a DEFAULT(AUTO) clause. This is one of the first implementations of this clause in an OpenMP compiler [2]. – We evaluate the efficacy of the DEFAULT(AUTO) clause by auto-scoping the Fortran 77 benchmarks in the SPEC OpenMP Benchmark Suite [8]. We
100
M. Voss et al.
also compare the performance of our implementation with an Early Access version of the Sun Microsystems compiler [2]. – We discuss the use of speculative parallelization to cover cases where autoscoping fails. Organization of the Paper In Section 2, we present related work. Our modifications to the Polaris compiler are outlined in Section 3. An evaluation of auto-scoping on a subset of the SPEC OpenMP Benchmark Suite is presented in Section 4. The use of runtime tests for covering cases where static scoping fails is discussed in Section 5. In Section 6, we present our conclusions.
2
Related Work
OpenMP can be the output language of parallelizing compilers that perform source-to-source conversion of sequential applications into parallel code [9, 7]. Both the CAPO [9] and PCOMP [7] projects use OpenMP as a target language. This paper addresses auto-scoping in OpenMP, where users explicitly identify parallel loops and some scoping attributes. A DEFAULT(AUTO) clause is used to force the compiler to automatically scope the non-explicitly scoped variables. If users provide some scoping attributes in addition to the DEFAULT(AUTO), an autoscoping compiler may then be able to parallelize more loops than a traditional parallelizer. The DEFAULT(AUTO) clause was proposed by Dieter an Mey in [1]. Sun Microsystems plans to release a version of auto-scoping in their next release of the Sun Studio 9 Fortran 95 compiler [2]. In Section 4, we compare the performance of our implementation with an Early Access version of this compiler. Our implementation and that in [2] represent the first implementations of DEFAULT(AUTO) in OpenMP compilers. We use the Polaris parallelizing compiler for our work [5]. Automatic parallelization has been a widely studied area and there are a number of commercial and research automatic parallelizers available [10, 11, 5]. As discussed previously, auto-scoping of variables requires a similar analysis to automatic parallelization, and therefore we are able to leverage the advance dependence testing capabilities of Polaris to support DEFAULT(AUTO). Runtime dependence testing can be used when traditional static dependence tests fail [12, 13, 14]. In Section 5, we discuss using a runtime test to perform scoping for variables that cannot be statically scoped. These tests would be similar in nature to the runtime dependence tests discussed in [12, 13, 14].
3
Supporting DEFAULT(AUTO) in Polaris
To evaluate the DEFAULT(AUTO) directive, we modified the Polaris compiler so that a parallel region containing DEFAULT(AUTO) would be translated into a
An Evaluation of Auto-Scoping in OpenMP
101
parallel region where all variables are explicitly scoped. If Polaris cannot determine the scoping for any variable in a parallel region, the region is serialized and all OpenMP directives are removed from the region. If in addition to a DEFAULT(AUTO) clause, explicit scoping for some variables are provided, these explicit scopings will be used. This added information may allow Polaris to parallelize loops for which it cannot determine the scope automatically for all variables. Implementation Details Polaris is a source-to-source restructurer and, as described in [7], was extended so that it can both parse and generate OpenMP directives. However, the OpenMP support in Polaris was designed to either (1) express the parallelism detected through automatically parallelization of a sequential program or (2) to take an explicitly parallel application that uses OpenMP directives and generate an explicitly threaded program with calls to a runtime library (MOERAE). OpenMP to OpenMP translation was not supported. Figure 2 shows the original two paths that were supported through the Polaris compiler, as well as the needed OpenMP to OpenMP path. As Shown in Figure 2, when Polaris is executed on a parallel program with OpenMP directives, its analysis passes are deactivated (the program is already explicitly parallel). In order to auto-scope parallel regions, these passes must still be executed. We therefore needed to ensure that all of the parallelization passes in Polaris would properly respond to and modify OpenMP directives that were parsed by the front-end. This task required small modifications to several passes and utilities in the compiler. Next, we needed to add support for DEFAULT(AUTO). This required modifications to the OpenMP parser and intermediate representation, as well as changes
Fortran 77
Polaris Parser Ddtest pass Reduction pass Privatization pass … OpenMP backend
Fortran 77 + OpenMP
Polaris Fortran 77 + OpenMP
Parser MOERAE backend
Fortran 77 + MOERAE calls
Fig. 2. Using Polaris as an OpenMP to OpenMP translator. The solid lines show paths originally supported by Polaris. The dotted path shows the OpenMP to OpenMP path that we added
102
M. Voss et al. !$OMP PARALLEL DEFAULT(AUTO) DO N = 1,7 DO M = 1,7 !$OMP DO DO L = LSS(itsub),LEE(itsub) I = IG(L) J = JG(L) K = KG(L) LIJK = L2IJK(L) RHS(L,M) = RHS(L,M) + - FJAC(LIJK,LM00,M,N)*DQCO(i-1,j,k,n,NB)*FM00(L) + - FJAC(LIJK,LP00,M,N)*DQCO(i+1,j,k,n,NB)*FP00(L) + - FJAC(LIJK,L0M0,M,N)*DQCO(i,j-1,k,n,NB)*F0M0(L) + - FJAC(LIJK,L0P0,M,N)*DQCO(i,j+1,k,n,NB)*F0P0(L) ENDDO !$OMP END DO NOWAIT ENDDO ENDDO !$OMP END PARALLEL (a)
!$OMP PARALLEL !$OMP+DEFAULT(SHARED) !$OMP+PRIVATE(M,L,N) DO n = 1, 7, 1 DO m = 1, 7, 1 !$OMP DO DO l = lss(itsub), lee(itsub), 1 rhs(l, m) = rhs(l, m)+(-dqco(ig(l), (-1)+jg(l), kg(l), n, nb))* *f0m0(l)*fjac(l2ijk(l), l0m0, m, n)+(-dqco(ig(l), 1+jg(l), kg(l), n *, nb))*f0p0(l)*fjac(l2ijk(l), l0p0, m, n)+(-dqco((-1)+ig(l), jg(l) *, kg(l), n, nb))*fjac(l2ijk(l), lm00, m, n)*fm00(l)+(-dqco(1+ig(l) *, jg(l), kg(l), n, nb))*fjac(l2ijk(l), lp00, m, n)*fp00(l) ENDDO !$OMP END DO NOWAIT ENDDO ENDDO !$OMP END PARALLEL (b) Fig. 3. An example loop (a) before and (b) after auto-scoping by Polaris. In this example no explicit scopings are provided by the user in (a)
to the dependence tester (the ddtest pass). Ddtest was modified to only operate on loops that are explicitly parallel (i.e. had an OpenMP DO directive) and are contained in a parallel region with a DEFAULT(AUTO) clause. All other regions and loops are passed to the OpenMP backend unchanged, with the original user directives preserved.
An Evaluation of Auto-Scoping in OpenMP
103
SUBROUTINE RECURSION(n,k,a,b,c,d,e,f,g,h,s) REAL*8 A(*),B(*),C(*),D(*),E(*),F(*),G(*),H(*) REAL*8 T,S INTEGER N,K,I S = 0.0D0 C$OMP PARALLEL SHARED(D) C$OMP+DEFAULT(AUTO) C$OMP DO DO I = 1,N T = F(I) + G(I) A(I) = B(I) + C(I) D(I+K) = D(I) + E(I) H(I) = H(I) * T S = S + H(I) END DO C$OMP END DO C$OMP END PARALLEL END (a) SUBROUTINE recursion(n, k, a, b, c, d, e, f, g, h, s) DOUBLE PRECISION a, b, c, d, e, f, g, h, s, t INTEGER*4 i, k, n DIMENSION a(*), b(*), c(*), d(*), e(*), f(*), g(*), h(*) s = 0.0D0 !$OMP PARALLEL !$OMP+DEFAULT(SHARED) !$OMP+PRIVATE(T,I) !$OMP DO !$OMP+REDUCTION(+:s) DO i = 1, n, 1 t = f(i)+g(i) a(i) = b(i)+c(i) d(i+k) = d(i)+e(i) h(i) = h(i)*t s = h(i)+s ENDDO !$OMP END DO !$OMP END PARALLEL RETURN END (b) Fig. 4. An example loop (a) before and (b) after auto-scoping by Polaris. In this example explicit scopings are provided for the array D. Without the explicit scoping of D, Polaris would serialize the loop
The loops that are to be auto-scoped are passed through all of the parallelization passes in the compiler, including dependence testing, reduction recognition
104
M. Voss et al.
and privatization. If Polaris can determine the scope for all variables in the loop, the appropriate directives are added to the region, otherwise the region is serialized. If the user includes explicit scoping for some variables in a region, these scopings will override those determined by Polaris, including the cases where Polaris cannot determine a scoping. The use of explicit scoping may allow Polaris to successfully auto-scope loops that it would normally need to serialize. Figures 3 and 4 show the first two example loops provided in [1] before and after auto-scoping by Polaris. In Figure 3, Polaris would have been able to automatically parallelize this loop without the OpenMP directives, and therefore is able to auto-scope it successfully. The OpenMP directives in Figure 3 therefore serve as only an indicator that this loop should be automatically parallelized, but offer no assistance in scoping. In Figure 4, Polaris would have been unable to automatically parallelize the loop without the OpenMP directives. However, since D is explicitly classified as SHARED by the user, the loop can be automatically scoped. It should be noted that Polaris is a loop-level parallelizer. It therefore will only be able to auto-scope regions that have the semantics of a !$OMP PARALLEL DO, not general single-program multiple-data (SPMD) PARALLEL regions. The Sun Compiler uses an array data race detection method to address general parallel regions, instead of using traditional loop-based dependence analysis [2].
4
Evaluation of DEFAULT(AUTO)
To evaluate our implementation of DEFAULT(AUTO) in the Polaris compiler, we auto-scoped all of the Fortran 77 benchmarks from the SPEC OpenMP Benchmark Suite [8]. We hand modified applu, apsi, mgrid, swim and wupwise so that all regions included a DEFAULT(AUTO) directive and did not explicitly scope any variables. We auto-scoped these benchmarks using our modified version of Polaris. We also used the Early Access version of the Sun Studio 9 Fortran 95 compiler to auto-scope the same programs. To evaluate the speedup lost by auto-scoping with Polaris, we ran the original parallel code and the auto-scoped code on a 4 processor Xeon server. The system has four 1.6 GHz Xeon processors and 16 GBytes of main memory. We use the Omni OpenMP compiler [15] with -O2 as our back-end compiler. To evaluate the performance of the Sun compiler, we auto-scoped the same applications using the -vpara flag to obtain feedback on which regions it successfully auto-scoped. When compiling for SPARC, we used -O3 with the Sun Studio 9 Early Access version F95 compiler1 Table 1 shows the results from our experiments. For both Swim and Mgrid, Polaris is able to auto-scope all of the parallel loops, resulting in identical speedups for the original and auto-scope versions. 1
We were unable to secure dedicated time on a SPARC-based server prior to submission to determine the speedups for the Sun compiler. However, the Sun compiler in all cases is able to auto-scope less regions than Polaris. We hope to collect the speedups for the Sun compiler as future work.
An Evaluation of Auto-Scoping in OpenMP
105
Table 1. Performance of auto-scoping in Polaris and the early access version of the Sun Studio 9 Fortran 95 compiler. The speedups are calculated from runs on the Xeon server using the original code and the auto-scoped code produced by Polaris Benchmark Regions Serialized % Parallel Serialized % Parallel Explicit Auto-Scope by Polaris by Polaris by Sun by Sun Speedup Speedup Applu 33 2 94% 9 72% 5.8 1.1 Apsi 32 17 47% 24 25% 2.8 1 Mgrid 12 0 100% 2 83% 3.6 3.6 Swim 8 0 100% 3 63% 1.9 1.9 Wupwise 10 6 40% 8 20% 3.0 1.25
The Sun compiler is unable to fully auto-scope these benchmarks, serializing some important parallel regions. The Sun Compiler’s array data race detection method is unable auto-scope these regions, whereas the full loop-based dependence analysis of Polaris succeeds. In the remaining benchmarks however, both Polaris and the Sun compiler are able to only successfully auto-scope a subset of the regions. Both compilers must serialize the most time-consuming regions, sacrificing almost all speedup in the process. In Applu, two regions must be serialized due to the inability of the compiler to classify variables. One of these regions is the outermost parallel region in which the SSOR iterations are performed. Polaris is designed to detect parallel loops, however this main loop does not contain a parallel loop, but instead uses an SPMD model of parallelism. The Sun compiler, even with its more general analysis, is also unable to auto-scope this most-important region. Lacking the ability to analyze this loop, speedup suffers considerably. In both Apsi and Wupwise, a number of regions must be serialized. Polaris performs inter-procedural optimization through inlining. In both Apsi and Wupwise, there are a number of subroutine calls which cannot be inlined by the compiler and therefore the loops that contain them cannot be fully analyzed. In both applications, speedup is significantly limited by these serialized parallel regions. In Apsi, the major parallel regions cannot be effectively scoped, resulting in no speedup. For Wupwise, only a small speedup is obtained. The Sun compiler is also unable to auto-scope the important regions in these applications. However, it is able to auto-scope some regions that Polaris cannot. For example, the Sun compiler succeeds on the region from Wupwise shown in Figure 5. Polaris, must serialize this region because it can only analyze simple PARALLEL DO regions and therefore cannot prove that LSCALE and LSSQ are private to the region. Table 1 suggests that even advanced parallelizing compilers are often unable to auto-scope important regions. In addition, the performance of auto-scoping will be highly dependent on the capabilities of the OpenMP compiler. For example, the number of regions auto-scoped successfully by the Sun compiler and Polaris was different for all benchmarks tested. The number of auto-scoped regions
106
M. Voss et al.
C$OMP
* C$OMP
C$OMP C$OMP
C$OMP C$OMP
PARALLEL DEFAULT(AUTO) LSCALE = ZERO LSSQ = ONE DO DO IX = 1, 1 + (N - 1) *INCX, INCX IF (DBLE (X(IX)) .NE. ZERO) THEN TEMP = ABS (DBLE (X(IX))) IF (LSCALE.LT.TEMP) THEN LSSQ = ONE + LSSQ* (LSCALE / TEMP) ** 2 LSCALE = TEMP ELSE LSSQ = LSSQ + (TEMP / LSCALE) ** 2 END IF END IF DITMP = DIMAG(X(IX)) IF (DITMP .NE. ZERO) THEN TEMP = ABS (DIMAG (X(IX))) IF (LSCALE.LT.TEMP) THEN LSSQ = ONE + LSSQ* (LSCALE / TEMP) ** 2 LSCALE = TEMP ELSE LSSQ = LSSQ + (TEMP / LSCALE) ** 2 END IF END IF END DO END DO CRITICAL IF (SCALE .LT. LSCALE) THEN SSQ = ((SCALE / LSCALE) ** 2) * SSQ + LSSQ SCALE = LSCALE ELSE SSQ = SSQ + ((LSCALE / SCALE) ** 2) * LSSQ END IF END CRITICAL END PARALLEL
Fig. 5. An example of code that can be auto-scoped by the Sun compiler but not by Polaris. The current implementation in Polaris cannot analyze regions that are not semantically equivalent to an !$OMP PARALLEL DO
will therefore not be portable across compilers that differ in analysis capabilities, leading to potentially significant performance differences.
5
Runtime Support for Auto-Scoping
When static dependence tests cannot determine whether a loop is parallel or not, runtime dependence testing may be used [12, 13, 14]. Similarly, runtime tests
An Evaluation of Auto-Scoping in OpenMP
107
!$OMP PARALLEL !$OMP+DEFAULT(SHARED) !$OMP+PRIVATE(U51K,U41K,U31K,Q,U21K,M,K,I,U41,U31KM1,U51KM1,U21KM1) !$OMP+PRIVATE(U41KM1,TMP,J) !$OMP+SPECULATE(UTMP,RTMP) !$OMP DO !$OMP+LASTPRIVATE(FLUX2) DO j = jst, jend, 1 ... ENDDO !$OMP END DO !$OMP END PARALLEL Fig. 6. Using a SPECULATE directive in the RHS subroutine of Applu. Polaris detected that there might be cross-iterations dependencies for the UTMP and RTMP arrays
might be used to determine scoping for variables that cannot be automatically scoped at compile-time. However, to perform runtime dependence testing, the compiler must be able to capture all accesses to arrays in the region. For example in the LRPD test [14], all reads and writes are marked at runtime in shadow arrays that are later used to determine if the loop has cross iteration dependencies. Likewise, for an inspector-executor model, such as described in [12], the compiler must construct an inspector loop that captures an array’s access patterns. Since the compiler must add the code to do this marking or testing, all accesses must be visible to the compiler. In most of the benchmarks tested in Section 4, if Polaris was unable to autoscope a region, it was due to the inability to analyze all accesses to the variables, i.e. inlining failed. Therefore, it seems that the compiler would have similar difficulties in generating the code required for runtime dependence testing. To quantify the number of regions that might be executed using an LRPDstyle runtime test, we further modified Polaris to generate a SPECULATE directive for any region that could be fully analyzed, but still had to be serialized. In these regions, Polaris was able to see all accesses to all variables, but limitations in dependence testing caused the loop to be serialized. For the benchmarks in Section 4, only 2 regions met this criterion. Both of these regions are small and would not significantly improve the resulting speedup. Figure 6 shows an example of a region from Applu after the SPECULATE directive has been added by the compiler.
6
Conclusion
In this paper, we have evaluated an implementation of auto-scoping in an advanced parallelizing compiler. In Section 3 we presented an overview of DEFAULT(AUTO) in the Polaris source-to-source translator. In Section 4, we autoscoped several benchmarks from the SPEC OpenMP Benchmark Suite and compared the performance of our implementation with the Early Access version
108
M. Voss et al.
of the Sun Studio 9 Fortran 95 compiler. Our evaluation demonstrates that auto-scoping is limited by the automatic parallelization capabilities of the target OpenMP compiler. For two of the five tested benchmarks, our compiler was able to auto-scope all of the parallel regions. However for the remaining applications, some regions had to be serialized and the resulting speedups were significantly degraded. The Sun compiler was able to auto-scope fewer regions than Polaris for each benchmark. While the DEFAULT(AUTO) clause would be a powerful and useful addition to the OpenMP specification, it might lead to a strong dependence on automatic parallelization technology. The performance of an application that is auto-scoped will therefore not be portable across compilers. Even if a user confirms that their compiler is able to auto-scope the regions in their code that contain a DEFAULT(AUTO) clause, there is no guarantee that a different compiler will likewise be able to auto-scope the same region. However, auto-scoping could be a useful tool for developers as they manually parallelize an application. A user might explicitly include directives detected by an auto-scoping compiler, relieving them of some of the tedious and error-prone work of manually classifying variables.
Acknowledgments We would like to thank Yuan Lin for pointing us to the Early Access version of the Sun Studio 9 F95 compiler, and for his feedback on this paper.
References 1. Dieter an Mey. A compromise between automatic and manual parallelization: Auto-scoping (proposal for an additional data scope attribute of the omp parallel directive’s default clause). http://support.rz.rwthaachen.de/public/AutoScoping.pdf, September 2001. 2. Yuan Lin, Christian Terboven, Dieter an Mey, and Nawal Copty. Sun Microsystems Automatic Scoping of Variables in Parallel Regions of an OpenMP Program. In WOMPAT’2004: Workshop on OpenMP Applications and Tools, Houston, Texas, USA, May 2004. 3. The OpenMP Architecture Review Board. OpenMP C and C++ Application Program Interface, 2.0 edition, March 2002. 4. The OpenMP Architecture Review Board. OpenMP Fortran Application Program Interface, 2.0 edition, November 2000. 5. William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay Hoeflinger, Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence Rauchwerger, and Peng Tu. Parallel Programming with Polaris. IEEE Computer, pages 78–82, December 1996. 6. William Blume and Rudolf Eigenmann. The Range Test: A Dependence Test for Symbolic, Non-linear Expressions. Proceedings of Supercomputing ’94, Washington D.C., pages 528–537, November 1994.
An Evaluation of Auto-Scoping in OpenMP
109
7. Seon Wook Kim, Seung Jai Min, Michael Voss, Sang Ik Lee, and Rudolf Eigenmann. Portable compilers for openmp. In Proc. of WOMPAT 2001: Workshop on OpenMP Applications and Tools, West Lafayette, Indiana, July 2001. 8. Vishal Aslot, Max Domeika, Rudolf Eigenmann, Greg Gaertner, Wesley B. Jones, and Bodo Parady. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In Proceedings of the Workshop on OpenMP Applications and Tools, pages 1–10, Lafayette, Indiana, July 2001. 9. Michael Frumkin Haoqiang Jin and Jerry Yan. Parallelization with CAPO – A User Manual. Technical Report NAS-01-008, NASA Ames Research Center, August 2001. 10. Francis E. Allen. An overview of the ptran analysis system for multiprocessing. Journal of Parallel and Distributed Computing, pages 617–640, Oct 1988. 11. Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. Maximizing multiprocessor performance with the SUIF compiler. Computer, pages 84–89, December 1996. 12. J. Saltz, R. Mirchandaney, and K. Crowley. Run time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603–612, May 1991. 13. Lawrence Rauchwerger and David Padua. The PRIVATIZING DOALL Test: A Run-Time Technique for DOALL Loop Identification and Array Privatization . Proceedings of the 8th ACM International Conference on Supercomputing, Manchester, England, pages 33–43, July 1994. 14. L. Rauchwerger and D. Padua. The LRPD Test: speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the SIGPLAN 1995 Conference on Programming Languages Design and Implementation, pages 218–232, June 1995. 15. RWCP. The Omni OpenMP Compiler. http://phase.hpcc.jp/Omni/, 2004.
Structure and Algorithm for Implementing OpenMP Workshares Guansong Zhang, Raul Silvera, and Roch Archambault IBM Toronto Lab, Toronto, ON, L6G 1C7, Canada
Abstract. Although OpenMP has become the leading standard in parallel programming languages, the implementation of its runtime environment is not well discussed in the literature. In this paper, we introduce some of the key data structures required to implement OpenMP workshares in our runtime library and also discuss considerations on how to improve its performance. This includes items such as how to set up a workshare control block queue, how to initialize the data within a control block, how to improve barrier performance and how to handle implicit barrier and nowait situations. Finally, we discuss the performance of this implementation focusing on the EPCC benchmark. Keywords: OpenMP, parallel region, workshare, barrier, nowait.
1
Introduction
Clearly OpenMP[1][2] has become the leading industry standard for parallel programming on shared memory and distributed shared memory multiprocessors. Although there are different hardware architectures to support the programming model, the implementation of an OpenMP programming language usually can be separated into two parts: compilation process and runtime support, shown in Figure 1. In this figure, the languagedependent frontends will translate user source code to an intermediate representation (IR), to be processed by an optimizing compiler into a resulting node program, which interacts with a runtime environment that starts and controls the multi-threaded execution. There are already papers introducing the compilation part of this structure [3], yet little discussion has been dedicated to the runtime environment itself. In fact, as the OpenMP standard continues evolving, there are still many open issues in this area, such as the mechanisms required to support nested parallelism, threadprivate etc. In this paper, we will focus our attention on the runtime environment implementation. The runtime system we have developed is based on the pthreads library available on most c major operating systems, including Linux, Mac OS and AIX. We try to avoid the controversial parts of the OpenMP runtime environment by concentrating mainly on structures and algorithms used to implement OpenMP workshares, the well-understood concept yet the most fundamental one in parallel programming. We will also discuss the performance considerations when we use standard benchmarks to test our implementation.
The opinions expressed in this paper are those of the authors and not necessarily of IBM.
B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 110–120, 2005. c Springer-Verlag Berlin Heidelberg 2005
Structure and Algorithm for Implementing OpenMP Workshares
Fortran source code
C/C++ source code
Fortran frontend
C/C++ frontend
111
IR with directives High level optimization Node program IR SMP runtime library
Machine level optimization Machine code
Fig. 1. compilation structure
2
Classifying OpenMP Workshares
The parallelism in an OpenMP application program is expressed through parallel regions, worksharing constructs, and combined parallel constructs. A combined parallel construct can be considered as a parallel region with only one worksharing construct inside it, although its implementation may be more efficient from a performance point of view. In any case, a parallel region is the basic concept in OpenMP programming. When a parallel region starts, multiple threads will execute the code inside the region concurrently[4]. In Figure 2, a master thread starts up a parallel region executed by 8 threads. Through parallel regions, multiple threads accomplish worksharing in an OpenMP program.
Master thread
Work share
Parallel region Master thread
Fig. 2. Parallel region
112
G. Zhang, R. Silvera, and R. Archambault
1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 0 0 1 0 1 0 1 0 1 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 0 0 1 0 1 0 1 0 1 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 0 0 1 0 1 0 1 0 1 4 0 1 0 1 0 1 0 1 0 1 0 1 0 1
11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
(a)
1 0 0 1 0 1 0 1 0 1
(b)
Fig. 3. Common workshare
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 (a)
(b)
Fig. 4. Synchronization workshare
The simplest format of worksharing is replicated execution of the same code segment on different threads. It is more useful to divide work among multiple threads — either by having different threads operate on different portions of a shared data structure, or by having different threads perform entirely different tasks. Each cooperation of this kind is considered as a workshare1 in this paper. In the figure, we use a round-cornered rectangular block to represent one workshare. The most common forms of workshares in OpenMP specification are worksharing DO and SECTIONS, as shown in Figure 3 (a) and (b) respectively. We use different numbers to mark the sequential order of the blocks in an OpenMP DO construct, but it does not mean the implementation will ensure that the corresponding threads will work on those blocks. In fact, the mapping of work to threads is determined by the schedule type of the DO construct as specified by the OpenMP standard. Similarly, the mapping between the sections in a SECTIONS construct and the threads is also decided by the schedule type. The fundamental difference between a DO construct and a SECTIONS construct is that in SECTIONS the code segments executed by threads could be entirely different. Besides the common workshares introduced above, we can consider other OpenMP structures as workshares too. The typical examples are SINGLE constructs and explicit barriers, as in Figure 4 (a) and (b). A SINGLE construct is semantically equivalent to a SECTIONS construct with only one SECTION inside, while the explicit barrier is like a SECTIONS construct with neither NOWAIT clauses nor any SECTION inside. For a SINGLE construct, the first 1
The concept of workshare here is different from the parallel construct, WORKSHARE in Fortran OpenMP specification 2.0, which can be treated semantically as the combination of OMP DO and OMP SINGLE constructs.
Structure and Algorithm for Implementing OpenMP Workshares
113
thread that encounters the code will execute the block. This is different from a MASTER construct, where the decision can be made simply by checking the thread ID. Worksharing DO, SECTIONS and SINGLE constructs have an important common feature: they have an implicit barrier at the end of the construct, which may be disabled by using the optional NOWAIT clause. An explicit BARRIER is semantically equivalent to an empty SINGLE section without the NOWAIT clause. This observation leads us to classify explicit barriers in the same category of workshare constructs. Different implementations may have a different view of this kind of classification. The real advantage of considering them all as workshares is for practical coding. From an implementation point of view, their common behaviors will lead to a common code base, thus improving the overall code quality. From now on, we will use workshare to refer to any one of the workshares we have mentioned above.
3
Key Data Structures
In this section, we will examine the techniques for implementing workshares in a parallel region. 3.1
General Requirement
The specific ways of implementing workshares in a parallel region may be different from one to another, but with the analysis we already have in previous sections, one can see that at least the following data segments should be in a control block for workshares, – Structure to hold workshare-specific information This is needed to store information regarding an OpenMP DO construct or SECTIONS construct. Things like initial, final value of the loop induction variable, the schedule type, etc. – Structure to complete possible barrier synchronization This is used to implement any barriers that may be needed during the lifetime of the workshare construct. – Structure to control access to the workshare control block This is typically a lock to ensure only one thread modifies the information of the shared control block, for example, to mark the workshare started or show that a particular section code is already done. Since different workshare constructs can be put into the same parallel region, the first item must be a “per workshare” value, i.e. for each workshare in the parallel region, there should be a corresponding one. Because of the existence of NOWAIT clauses, multiple workshares can be active at the same time — executed by different threads simultaneously. So, multiple instance of this data structure must exist simultaneously; one for each active workshare. The same observation can be made for the third item in the list. Each thread will access its own active workshare control block, which may or may not be different from each other. In general, there has to be a queue of workshare control blocks for each parallel region.
114
3.2
G. Zhang, R. Silvera, and R. Archambault
The Control Block Queue Length
Once we understand the basic structures needed for a parallel region, we have to consider its construction, destruction and initialization. At the same time, we need to minimize the overhead of creating and manipulating such structures, as noted in [5]. It cannot be statically predicted how many parallel regions a program may explore and how many of them will be active at the same time because of nested parallelism or explicit usage of threads in user codes. Therefore, the workshare control block queue will be allocated dynamically. In addition to discussions in section 3.1, a workshare control block queue has to be constructed whenever a parallel region is encountered, and it will be destroyed when the parallel region ends. Before we go any further, we need to decide the length of this queue. For example, if we map each workshare to its own control block in a straightforward manner, the following parallel region !$OMP PARALLEL !$OMP DO DO i = istart, ... END DO !$OMP END DO !$OMP DO DO i = istart, ... END DO !$OMP END DO NOWAIT !$OMP DO DO i = istart, ... END DO !$OMP END DO !$OMP DO DO i = istart, ... END DO !$OMP END DO !$OMP END PARALLEL
iend
iend
iend
iend
will need the workshare control block queue as shown in Figure 5, where there are four block elements in total. Workshare control block
1111111111 0000000000 1111111111 0000000000 1111111111 0000000000
Max number of active workshares Length of workshare control block queue
Fig. 5. Workshare control block queue
Structure and Algorithm for Implementing OpenMP Workshares
115
A simple way of handling the queue is to create a new control block whenever a workshare is encountered, more accurately, whenever a workshare instance is encountered. If a workshare is defined inside a sequential loop body, we have as many workshare instances as the number of iterations in that loop. Since the workshare control block is a shared construct for multiple threads, to constantly increase its size will be detrimental to the runtime overhead for the creation of parallel regions. Moreover, doing so will create higher memory impact on the overall program execution. A better solution is to allocate a chunk of workshare control blocks at once, use them as a block pool, and try to reuse the pool whenever possible, only increasing its size when absolutely necessary. In the queue, we will set up extra fields, such as next available workshare control block pointer, and set it back to the beginning whenever an explicit or implicit barrier is met. Thus we only need two block structures instead of the four in Figure 5 for the code snippet above. When the first workshare instance finishes, a barrier synchronization occurs, and the next available workshare control pointer will be set back to the first block structure. Then, two block structures are needed at most because of the NOWAIT clause. After that, another barrier will set the pointer back to the beginning again. Furthermore, resizing the pool may need extra book-keeping structures and work. If we artificially introduce a barrier on a nowait workshare when the queue becomes full, we do not need to increase the size at all2 . In real applications, a special case of using of NOWAIT clauses is to form a coarsegrain software pipeline, such as the one in APPLU of SPECOMP2001[6], thus one cannot degrade any nowait workshare. But in such cases, the length of the pipeline is often set as the number of the hardware processors to get best performance. Choosing a proper fixed-size queue to avoid extra overhead is still a valid consideration. Regardless of resizing the queue or not, the number of hardware processors or its multiples, to be conservative, is a good candidate for the initial chunk size of the pool. Then we can resize the queue if needed. 3.3
Fine-Grained Optimization
From the analysis in section 3.1, a lock has to be defined within the control block structure. Whether it is hand coded as in [7] or used directly from Pthread library[8], the lock itself needs to be initialized, along with other structures for the workshare-specific information. The master thread can initialize all the structures at once when they are allocated during the setup time. But for parallel regions where the maximum number of active workshares is small — only one, when the NOWAIT clause does not exist, this process will bring extra overhead. It is more desirable to defer the initialization of the block items until their use becomes necessary.
2
We do need to mark the head and tail position of the active workshares in the queue, so the next available pointer can be set, from the position next to the head, back to its tail when a barrier is encountered.
116
G. Zhang, R. Silvera, and R. Archambault
Algorithm 1, used to execute a workshare, achieved this by always keeping the queue have one extra block ready during initialization phase. In the scheme, the master thread initializes the first control block, thus making the queue ready to use for the first workshare. Then, when the first thread exclusively accesses the available control block through the lock, it will make the next control block available by initializing it. Algorithm 1: Execute a workshare in parallel regions Data Data
: Workshare control block queue : Next available block pointer
begin if this workshare is not locked then lock it; if this workshare is already started then release the lock; else mark this workshare has been started; init the next available control block if it is new; end end execute the corresponding workshare and release the lock if I have locked it; if barrier is needed then do barrier synchronization and reset the next available control block pointer; end end
The next control block may never be used. But by keeping it ready, we do not need another lock process to initialize a shared item. The lock is shared with the exclusive access to the previous control block. The exclusive access here is needed anyway, since each thread needs to know if the workshare has been worked on. 3.4
Barrier Implementation
There are many papers dedicated to the implementation of a barrier synchronization already. Here we will use Algorithm 2 described in [9]. Since both distributed counter and local sensor are padded cachelines, we would like to avoid allocating two cache line arrays for every barrier by letting all the barriers in a parallel region share the same pair of counter and sensor. Before a barrier starts, all the elements of the counter array will be set to one, as will the local sensor counter array. We let one thread in the group, for instance the master thread, act as if it is the last thread. It will decrease its own element of the distributed counter array and then spin to check whether all of the counter elements are zero. The rest of the threads will decrease their own counter elements and then spin on checking their own local sensors. When the designated thread finds the counter elements are all zero, it will set all the counter elements back to one and then zero all of the elements in the local sensor array.
Structure and Algorithm for Implementing OpenMP Workshares
117
Algorithm 2: Barrier with distributed counter and local sensor Data Data
: Distributed counter with each element as one : Local sensor with each element as one
begin Decrease my own distributed counter element; if I am the master thread then repeat foreach element in distributed counter do check if it is zero until all distributed counter elements are zero; foreach element in distributed counter do set it back to one end foreach element in local sensor do set it to zero end else repeat check my local sensor element until it is zero; end Set my own local sensor element back to one; end
Finally, when all of the threads leave the barrier after their local sensor is zeroed, they reset their local sensor back to one.
4
Other Techniques and Performance Gain
We used the EPCC microbenchmarks[5] to show the performance gains coming from optimization considerations we discussed in previous sections. The hardware systems we tested on are a 16-way 375MHz POWER3T M and a 32-way 1.1GHz POWER4T M .
Time(us)
Parallel region overhead (for 16 threads) 80 70 60 50 40 30 20 10 0
Init at startup Init on demand
POWER3
POWER4
Fig. 6. Performance difference
118
G. Zhang, R. Silvera, and R. Archambault
Time(us)
Barrier overhead 40 35 30 25 20 15 10 5 0
FetchAndAdd DistCounter
4
8
12
16
20
24
28
32
Processor number
Fig. 7. POWER4 barrier overhead
Improvements on EPCC Before After
50
Time(us)
40
30
20
10
0 PAR
PDO
BAR
DO
SINGLE
REDUC
Fig. 8. EPCC improvement
The EPCC benchmarks have two parts, the synchronization suite and the schedule suite. Since we are concentrating on workshare implementation, we have focused on the synchronization benchmark. It includes separate test cases for parallel region, workshares (which are OMP DO, SINGLE and explicit barrier), and reductions. For a parallel region, as we show in Figure 6, the “on demand” initialization method introduced in section 3.3 reduces its overhead on both POWER3 and POWER4 systems. For a barrier operation, we compare a primitive fetch-and-add barrier with the one using algorithm 2 and the results are shown in Figure 7. The test was done on the POWER4 system. As the number of threads increases, the advantage of using distributed counters becomes more apparent.
Structure and Algorithm for Implementing OpenMP Workshares
119
In fact, the whole sub-suite improved significantly with the methods introduced in the previous sections. Again the data in Figure 8 was collected on the POWER4 system. Note that in this figure, the reduction case is special; aside from the improvement from a better parallel region and synchronization method, we have also implemented a partial-sum mechanism to implement the reduction, which allows us to minimize the required synchronization.
5
Summary and Future Work
In this paper, we listed important data structures needed in our runtime library to support workshares in OpenMP standard, and the corresponding algorithms to reduce the runtime overhead. We explained that by carefully choosing the length of the workshare control block queue, and reusing the queue whenever a barrier synchronization occurred, we do not need to map control block structures to memory for each workshare instance. Also by preparing the next workshare control block in advance, we do not need a separate locking phase to initialize a control block, neither do we need to initialize the whole queue at startup, which saves the parallel region overhead for most of the real application cases. Although these are just “small techniques”, they do have significant impact on performance benchmarks for runtime libraries. We also introduced our view on OpenMP workshare, and algorithms to implement barrier synchronization, which was considered as a special case of the workshare. All r C/C++ compilers. these are commercial available in our XL Fortran and VisualAge We are actually further improving the barrier performance by taking advantage of more accurate cacheline alignment of our internal data structures.
6
Trademarks and Copyright
AIX, IBM, POWER3, POWER4, and VisualAge are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Other company, product and service names may be trademarks or service marks of others. c Copyright IBM Corp. 2004. All rights reserved.
References 1. OpenMP Architecture Review Board. Openmp specification FORTRAN version 2.0, 2000. http://www.openmp.org. 2. OpenMP Architecture Review Board. Openmp specification C/C++ version 2.0, 2002. http://www.openmp.org. 3. Michael Voss, editor. OpenMP Shared Memory Parallel Programming, International Workshop on OpenMP Applications and Tools, WOMPAT 2003, Toronto, Canada, June 26-27, 2003, Proceedings, volume 2716 of Lecture Notes in Computer Science. Springer, 2003. 4. Rohit Chandra et al. Parallel programming in OpenMP. Morgan Kaufmann Publishers, 2001.
120
G. Zhang, R. Silvera, and R. Archambault
5. J. M. Bull. Measuring synchronization and scheduling overheads in OpenMP. In First European Workshop on OpenMP, October 1999. 6. Standard Performance Evaluation Corporation. SPECOMP2001 benchmarks. http://www.spec.org/omp2001. 7. John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. on Computer Systems, 9(1):21–65, February 1991. 8. David R. Butenhof. Programming with POSIX threads. Addison-wesley, 1997. 9. Guansong Zhang et al. Busy-wait barrier synchronization with distributed counter and local sensor. In Michael Voss, editor, WOMPAT, volume 2716 of Lecture Notes in Computer Science, pages 84–99. Springer, 2003.
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution Zhenying Liu1, Lei Huang1, Barbara Chapman1, and Tien-Hsiung Weng2 1
Department of Computer Science, University of Houston {zliu, leihuang, chapman}@cs.uh.edu 2 Department of Computers Science and Information Management, Providence University
[email protected] Abstract. This paper discusses an approach to implement OpenMP on clusters by translating it to Global Arrays (GA). The basic translation strategy from OpenMP to GA is described. GA requires a data distribution; we do not expect the user to supply this; rather, we show how we perform data distribution and work distribution according to OpenMP static loop scheduling. An inspectorexecutor strategy is employed for irregular applications in order to gather information on accesses to potentially non-local data, group non-local data transfers and overlap communications with local computations. Furthermore, a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns. This directive can help us generate efficient codes for irregular applications using the inspector-executor approach. Our experiments show promising results for the corresponding regular and irregular GA codes.
1 Introduction A recent survey of hardware vendors [7] showed that they expect the trend toward use of clustered SMPs for HPC to continue; systems of this kind already dominate the Top500 list [20]. The importance of a parallel programming API for clusters that facilitates programmer productivity is increasingly recognized. Although MPI is a de facto standard for clusters, it is error-prone and too complex for non-experts. OpenMP is a programming model designed for shared memory systems that does emphasize usability, and we believe it can be extended to clusters as well. The traditional approach to implementing OpenMP on clusters is based upon translating it to software Distributed Shared Memory systems (DSMs), notably TreadMarks [9] and Omni/SCASH [18]. Unfortunately, the management of shared data based on pages incurs high overheads. Software DSMs perform expensive data transfers at explicit and implicit barriers of a program, and suffer from false sharing of data at page granularity. They typically impose constraints on the amount of shared memory that can be allocated. This effectively prevents their applications to large This work was supported by the DOE under contract DE-FC03-01ER25502. B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 121–136, 2005. © Springer-Verlag Berlin Heidelberg 2005
122
Z. Liu et al.
problems. [6] translates OpenMP to a hybrid MPI+software DSM in order to overcome some of the associated performance problems. This is a difficult task, and the software DSM could still be a performance bottleneck. In contrast, we propose a translation from OpenMP to Global Arrays (GA) [10]. GA is a library that provides an asynchronous one-sided, virtual shared memory programming environment for clusters. A GA program consists of a collection of independently executing processes, each of which is able to access data declared to be shared without interfering with other processes. The translation requires a selection of a data distribution. This is a major challenge, since OpenMP lacks data distribution support. We also need to take data locality and load balancing problems into account when we decide how to distribute the computation to the GA processes. In this paper, we discuss a translation of OpenMP to GA that transparently uses the simple block-based data distributions provided by GA, and discuss additional user information needed for irregular access applications. The remainder of this paper is organized as follows. We first give an overview of the translation from OpenMP to GA. In section 3, we discuss language extensions and compiler strategies that are needed to implement OpenMP on clusters. Experiments, related work and conclusions are described in the subsequent sections.
2 Translation from OpenMP to GA In this section, we show our translation from OpenMP to GA [10], and describe the support for data distribution in GA. The appropriate work distributions are also discussed. 2.1 A Basic Translation to GA GA [15] was designed to simplify the programming methodology on distributed memory systems by providing a shared memory abstraction. It does so by providing routines that enable the user to specify and manage access to shared data structures, called global arrays, in a FORTRAN, C, C++ or Python program. GA permits the user to specify block-based data distributions for global arrays, corresponding to HPF BLOCK and GEN_BLOCK distributions which map the identical length chunk and arbitrary length chunks of data to processes respectively. Global arrays are accordingly mapped to the processors executing the code. Each GA process is able to independently and asynchronously access these distributed data structures via get or put routines. It is largely straightforward to translate OpenMP programs into GA programs because both have the concept of shared data and the GA library features match most OpenMP constructs. However, before the translation occurs, we may perform global privatization and transform OpenMP into the so-called SPMD style [13] in order to improve data locality. If each OpenMP thread consistently accesses a region of a shared array, the array may be privatized by creating a private data structure per thread, corresponding to the region it accesses. New shared data structures may need to be inserted to act as buffers, so that elements of the original shared array may be
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution
123
exchanged between threads as necessary. Our translation [10] strategy follows OpenMP semantics. OpenMP threads correspond to GA processes and OpenMP shared data are translated to global arrays that are distributed among the GA processes; all variables in the GA code that are not global arrays are called private variables. OpenMP private variables will be translated to GA private variables as will some OpenMP shared data. OpenMP scalars and private variables are replicated to each GA process. Small or constant shared arrays in OpenMP will also be replicated; all other shared OpenMP arrays must be given a data distribution and they will be translated to global arrays. Shared pointers in OpenMP must also be distributed. If a pointer points to a shared array, we need to analyze whether the contents of the current pointer is within the local portion of the shared memory. Otherwise, get or put operations are required to fetch and store non-local data and the pointer will need to point to the local copy of the data. If a shared pointer is used directly, we need to substitute it by an array and distribute the array according to the loop schedule because GA does not support the distribution of C pointers. Our basic translation strategy assigns loop iterations to each process according to OpenMP static loop schedules. For this, we must calculate the iteration sets of the original OpenMP parallel loops for each thread. Furthermore, we compute the regions of shared arrays accessed by each OpenMP thread when performing its assigned iterations. After this analysis, we determine a block-based data distribution and insert the corresponding declarations of global arrays. The loop bounds of each parallel loop are modified so that each process will work on its local iterations. Elements of global arrays may only be accessed via get and put routines. Prior to computation, the required global array elements are gathered into local copies. If the local copies of global arrays are modified during the subsequent computation, they must be written back to their unique “global” location after the loop has completed. GA synchronization routines replace OpenMP synchronization to ensure that all computation as well as the communication to update global data have completed before work proceeds.
1 !$OMP PARALLEL SHARED(a) 2 do k = 1 , MAX 3 !$OMP DO 4 do j = 1 , SIZE_J ( j ) 5 do i = 1 , SIZE 6 a(i,j)= a(i,j) … 7 enddo 8 enddo 9 !$OMP END DO 10 enddo 11 !$OMP END PARALLEL
1 OK=ga_create (MT_DBL, SIZE_X , SIZE_Y , ’A’ , 2 SIZE_X, SIZE_Y/ nproc , g_a ) 3 do k = 1 , MAX 4 ! compute new lower bound and upper bound for each process 5 (new_low, new_upper) = … 6 ! compute the remote array region read for each thread 7 (jlo, jhi)= … 8 call ga_get ( g_a, 1, SIZE , jlo , jhi , a, ld ) 9 call ga_sync ( ) 12 do j = new_low , new_upper 13 do i = 1 , SIZE 14 a (i, j) = a(i, j) … 15 enddo 16 enddo 18 ! compute remote array region written 19 call ga_put ( g_a , 1 , SIZE , jlo , jhi , a , ld) 20 call ga_sync ( ) 21 enddo
Fig. 1. (a) An OpenMP program and (b) the corresponding GA Program
124
Z. Liu et al.
We show an example of an OpenMP program in Fig. 1(a) and its corresponding GA program in Fig. 1(b). The resulting code computes iteration sets based on the process ID. Here, array A has been given a block distribution in the second dimension, so that each processor is assigned a contiguous set of columns. Non-local elements of global array A in Fig. 1 (b) are fetched using a get operation followed by synchronization. The loop bounds are replaced with local ones. Afterwards, the non-local array elements of A are put back via a put operation with synchronization. Unfortunately, the translation of synchronization constructs (CRITICAL, ATOMIC, and ORDERED) and sequential program sections (serial regions outside parallel regions, OpenMP SINGLE and MASTER) may become nontrivial. When translating into GA programs, we will insert synchronization among GA processes. GA has several features for synchronization; however, the translated codes may not be efficient. 2.2 Implementing Sequential Regions We use several different strategies to translate the statements enclosed within a sequential region of OpenMP code including I/O operations, control flow constructs (IF, GOTO, and DO loops), procedure calls, and assignment statements. A straightforward translation of sequential sections would be to use exclusive master process execution, which is suitable for some constructs including I/O operations. Although parallel I/O is permitted in GA, it is a challenge to transform OpenMP sequential I/O into GA parallel I/O. The control flow in a sequential region must be executed by all the processes if the control flow constructs enclose or are enclosed by any parallel regions. Similarly, all the processes must execute a procedure call if the procedure contains parallel regions, either directly or indirectly. We categorize the different GA execution strategies for an assignment statement in sequential parts of an OpenMP program based on the properties of data involved: 1.
2.
If a statement writes to a variable that will be translated to a GA private variable, this statement is executed redundantly by each process in a GA program; each process may fetch the remote data that it will read before executing the statement. A redundant computation can remove the requirement of broadcasting results after updating a GA private variable. If a statement writes to an element of an array that will be translated to a global array in GA (e.g. S[i]=…), this statement is executed by a single process. If possible, the process that owns the shared data performs the computation. The result needs to be put back to the “global” memory location.
Data dependences need to be maintained when a global array is read and written by different processes. Our strategy is to insert synchronization after each write operation to global arrays during the translation stage; at the code optimization stage, we may remove redundant get or put operations, and aggregate the communications for neighboring data if possible.
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution
125
2.3 Data and Work Distribution in GA GA only provides simple block-based data distributions and supplies features to make them as efficient as possible. There are no means for explicit data redistribution. GA’s asynchronous one-sided communication transfers the required array elements, rather than pages of data, and it is optimized to support the transfer of sets of contiguous or strided data, which are typical for HPC applications. These provide performance benefits over software DSMs. With block distributions, it is easy to determine the location of an arbitrary array element. However, since these distributions may not provide maximum data locality, they may increase the amount of data that needs to be gathered and scattered before and after execution of a code region respectively. In practice, this tends to work well if there is sufficient computation in such code regions. In our translation, GA only requires us to calculate the regions of global arrays that are read or written by a process to complete the communication; GA handles the other details. It is fast and easy for GA to compute the location of any global data element. We may optimize communication by minimizing the number of get/put operations and by grouping small messages into bigger ones. A user has to determine and specify the distribution of global data in a GA program; thus our translation process must decide on the appropriate block-based data distributions when converting OpenMP programs to the corresponding GA ones. We determine data distributions for a GA program based upon the following simple rules: 1.
2. 3.
If most loop index variables in those loop levels immediately enclosed by PARALLEL DO directives sweep over the same dimension of a shared array in an OpenMP program, we perform a one-dimensional distribution for the corresponding array in this dimension; If different dimensions of a shared array are swept over almost evenly by parallel loops, we may perform multi-dimensional distribution for this array. If parallel loops always work on a subset of a shared array, we may distribute this shared array using a GEN_BLOCK distribution; otherwise, a BLOCK distribution may be deployed. In the former case, the working subset of the shared array is distributed evenly to each thread; the first and last thread will be assigned any remaining elements of arrays at the start and end, respectively.
We believe that an interactive tool could collaborate with the user to improve this translation in many cases. One improvement is to perform data distribution based on the most time-consuming parallel loops. Statically, we may estimate the importance of loops. However, user information or profile results, even if based on a partial execution, is likely to prove much more reliable. For example, if an application contains many parallel loops, user information about which ones are the most time-consuming can help us determine the data distribution based upon these specified parallel loops only. We are exploring ways to automate the instrumentation and partial execution of a code with feedback directly to the compiler: such support might eliminate the need for additional sources of information.
126
Z. Liu et al.
Note that it is possible to implement all forms of OpenMP loop schedule including OpenMP static, dynamic and guided loop scheduling. OpenMP static loop scheduling distributes iterations evenly. When the iterations of a parallel loop have different amount of work, dynamic and guided loop scheduling can be deployed to balance the workload. We can perform the work assignment in GA corresponding to dynamic and guided loop scheduling; however, the equivalent GA program may have unacceptable overheads, as it may contain many get and put operations transferring small amounts of data. Other work distribution strategies need to be explored that take data locality and load balancing into account. In the case of irregular applications, it may be necessary to gather information on the global array elements needed by a process; whenever indirect accesses are made to a global array, the elements required in order to perform its set of loop iterations cannot be computed. Rather, a so-called inspector-executor strategy is needed to analyze the indirect references at run time and then fetch the data required. The resulting data sets need to be merged to minimize the number of required get operations. We enforce static scheduling and override the user-given scheduling for OpenMP parallel loops that include indirect accesses. The efficiency of the inspector-executor implementation is critical. In a GA program, each process can determine the location of data read/written independently and can fetch it asynchronously. This feature may substantially reduce the inspector overhead compared with a message passing program or with a paradigm that provides a broader set of data distributions. Our inspector-executor implementation distributes the loop iterations evenly to processes, assigning each process a contiguous chunk of loop iterations. Then each process independently executes an inspector loop to determine the global array elements (including local and remote data) needed for its iteration set. The asynchronous communication can be overlapped with local computations, if any. 2.4 Irregular Computation Case Study We studied one of the programs in the FIRE™ Benchmarks [1]. FIRE™ is a fully interactive fluid dynamics package for computing compressible and incompressible turbulent fluid. gccg is a parallelizable solver in the FIRE package that uses orthomin and diagonal scaling. gccg which contains irregular data accesses is explored in order to understand how to translate such codes to GA. We show the process of translation of OpenMP gccg program into the corresponding GA program. Fig. 2 displays the most time-consuming part of the gccg program. In our approach, we perform array region analysis to determine how to handle the shared arrays in OpenMP. Shared arrays BP, BS, BW, BL, BN, BE, BH, DIREC2 and LCC are privatized during the initial compiler optimization to improve locality of OpenMP codes, since each thread performs work only on an individual region of these shared arrays. In the subsequent translation, they will be replaced by GA private variables. In order to reduce the overhead of the conversion between global and local indices, we may preserve the global indices for the list of arrays above when declaring them and allocate the memory for array regions per process dynamically if the number of processes is not a constant. Shared array DIREC1 is distributed via global arrays according to the work distribution in the two parallel do loops in Fig. 2. A subset of array DIREC1 is swept by all the threads in the first parallel loop; the second paral-
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution
127
lel loop accesses DIREC1 indirectly via LCC. We distribute DIREC1 using a GEN_BLOCK distribution according to the static loop schedule in the first parallel loop in order to maximize data locality, as there is no optimal solution of data distribution for DIREC1 in the second loop. Array region DIREC1[NINTC1:NINTCF] is mapped to each process evenly in order to balance the work. Since DIREC1 is declared as [1:N], the array regions [1:NINTC1] and [NINTCF:N] must be distributed as well. We distribute these two regions to process 0 and the last process for contiguity. Therefore, it is not an even distribution and a GEN_BLOCK distribution is employed as shown in Fig. 3, assuming four processors are involved. !$OMP PARALLEL DO I = 1, iter !$OMP DO DO 10 NC=NINTCI,NINTCF DIREC1(NC)=DIREC1(NC)+RESVEC(NC)*CGUP(NC) 10 CONTINUE !$OMP END DO !$OMP DO DO 4 NC=NINTCI,NINTCF DIREC2(NC)=BP(NC)*DIREC1(NC) X - BS(NC) * DIREC1(LCC(1,NC)) X - BW(NC) * DIREC1(LCC(4,NC)) X - BL(NC) * DIREC1(LCC(5,NC)) X - BN(NC) * DIREC1(LCC(3,NC)) X - BE(NC) * DIREC1(LCC(2,NC)) X - BH(NC) * DIREC1(LCC(6,NC)) 4 CONTINUE !$OMP END DO END DO !$OMP END PARALLEL
Fig. 2. An OpenMP code segment in gccg with irregular data accesses
1
NINTCI
P0
NINTCF
P1
P2
N
P3
Fig. 3. GEN_BLOCK distribution for array DIREC1
As before, we perform work distribution according to OpenMP loop scheduling. In the case that all data accesses will be local the first loop of Fig. 2, loop iterations are divided evenly among all the threads and data are referenced contiguously. If a thread reads or writes some non-local data in a parallel loop, array region analysis enables us to calculate the contiguous non-local data for regular accesses and we can fetch all the non-local data within one communication before these data are read or written. Fortunately, we do not need to communicate for the first loop in Fig. 2 due to completely local accesses. But in the second loop of Fig. 2, some data accesses are indirect and
128
Z. Liu et al.
thus actual data references are unknown at compile time. Therefore we cannot generate efficient communications based upon static compiler analysis, and at least one communication per iterations has to be inserted. This would incur very high overheads. Hence, inspector-executor strategy [11] is employed to overcome them. The inspector-executor approach [17] generates an extra inspector loop preceding the actual computational loop. Our inspector is a parallel loop as shown in Fig. 4. We detect the values for each indirection array in the allocated iterations of each GA process. We use a hash table to save the indices of non-local accesses, and generate a list of communications for remote array regions. Each element in the hash table represents a region of a global array, which is the minimum unit of communication. Using a hash table can remove duplicated data communications that will otherwise arise if the indirect array accesses from different iterations refer to the same array element. We need to choose the optimal region size of a global array to be represented by a hash table element. This will depend on the size of the global array, data access patterns and the number of processes and needs to be further explored. The smaller the array regions, the more small communications are generated. But if we choose a larger array region, the generated communication may include more unnecessary nonlocal data. Another task of the inspector is to determine which iterations access only local data, so that we may overlap non-local data communication with local data computation. DO iteration=local_low, local_high If (this iteration contains non-local data) then Store the indices of non-local array elements into a hash table Save current iteration number in a nonlocal list Else Save current iteration number in a local list Endif Enddo Merging contiguous communications shown in the hash table Fig. 4. An inspector pseudo-code
One optimization of the inspector used in our experiment is to merge neighboring regions into one large region in order to reduce the number of communications. The inspector loop only needs to be performed once during execution of the gccg program, since the indirection array remains unmodified throughout the program. Our inspector is lightweight because: 1) the location of global arrays is easy to compute in GA due to the simplicity of GA’s data distributions; 2) the hash table approach removes enables us to identify and eliminate redundant communications; 3) all the computations of the inspector are carried out independently by each process. These factors imply that the overheads of this approach are much lower than is the case in other contexts and that it may be viable even when data access patterns change frequently, as in adaptive applications. For example, an inspector implemented using MPI is less efficient than our approach as each process has to generate communications for both sending and receiving, which rely on other processes’ intervention.
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution
129
The executor shown in Fig. 5 performs the computation in a parallel loop following the iteration order generated by the inspector. It prefetches non-local data via nonblocking communication, here using the non-blocking get operation ga_nbget() in GA. Simultaneously, the iterations that do not need any non-local data are executed so that they are performed concurrently with the communication. ga_nbwait() is used to ensure that the non-local data is available before we perform the corresponding computation. We need to optimize the GA codes translated under this strategy. In particular, GA get and put operations created before and after each parallel construct may be redundant. If a previous put includes the data of a subsequent get operation, we may remove this get operation; if the data in a put operation contains the content of a following put operation, the latter put operation may be eliminated. It is advantageous to move get and put operations as early as possible in order to make data available. We intend to develop array region analysis and parallel control flow analysis [3] for parallel programs in order to provide the context-sensitive array region communication information. ! non-local data gathering Call ga_nbget(….) DO iteration1=1, number_of_local_data Obtain the iteration number from the local list Perform the local computation Enddo ! wait until the non-local data is gathered Call ga_nbwait() Do iteration2=1, number_of_nonlocal_data Obtain the iteration number from the non-local list Perform computation using non-local data enddo Fig. 5. An executor pseudo-code
3 OpenMP Extensions for Data and Work Distributions We do not attempt to introduce data distribution extensions to OpenMP. Data distribution extensions contradict the OpenMP philosophy in two ways: first, in OpenMP the work assignment or loop schedule decides which portion of data is needed per thread and second, the most useful data distributions are those that assign data to processes elementwise, and these will require rules for argument passing at procedure boundaries and more. This would create an added burden for the user, who would have to determine data distributions in procedures that may sometimes be executed in parallel and sometimes sequentially, along with a variety of other problems. In contrast to the intended simplicity of OpenMP, existing data distribution ideas for OpenMP are borrowed from HPF and are quite complicated. For example, TEMPLATE and ALIGN are intricate concepts for non-experts. Providing equivalent C/C++ and Fortran OpenMP syntax is a goal of OpenMP; it is hard to make data distributions work for C pointers although distributing Fortran or C arrays is achiev-
130
Z. Liu et al.
able. Finally, data distribution does not make sense for SMP systems, which are currently the main target of this API. We rely on OpenMP static loop scheduling to decide data and work distribution for a given program. We determine a data distribution for those shared objects that will be defined as global arrays by examining the array regions that will be accessed by the executing threads. Data locality and load balancing are two major concerns for work distribution. Ensuring data locality may increase a load imbalance, and vice versa. Our optimization strategy gives data locality higher priority than load balancing. We may need additional user information to minimize the number of times the inspector loop is executed. If a data access pattern in a certain code region has not changed, it is not necessary to perform the expensive inspector calculations. Both runtime and language approaches have been proposed to optimize the inspector-executor paradigm in a given context. One of the approaches that we are experimenting with is the use of INVARIANT and END INVARIANT directives to inform compilers of the dynamic scope of an invariant data access pattern. It is noncompliant to branch into or out of the code region enclosed by INVARIANT directives. The loop bounds of parallel loops and the indirection arrays are candidate variables to be declared in INVARIANT directives, if they indeed do not change. For example, the loop bounds NINTCI and NINTCF and indirection array LCC are declared as INVARIANT in Fig. 6. Therefore, the communications generated by an inspector loop can be reused.
!$OMP INVARIANT ( NINTCI:NINTCF, LCC ) !$OMP PARALLEL DO I = 1, iter …… !$OMP DO DO 4 NC=NINTCI,NINTCF DIREC2(NC)=BP(NC)*DIREC1(NC) X - BS(NC) * DIREC1(LCC(1,NC)) X - BW(NC) * DIREC1(LCC(4,NC)) X - BL(NC) * DIREC1(LCC(5,NC)) X - BN(NC) * DIREC1(LCC(3,NC)) X - BE(NC) * DIREC1(LCC(2,NC)) X - BH(NC) * DIREC1(LCC(6,NC)) 4 CONTINUE !$OMP END DO END DO !$OMP END PARALLEL !$OMP END INVARIANT Fig. 6. A code segment of gccg with INVARIANT directives
In a more complex case, an application may have several data access patterns and use each of them periodically. We are exploring language extensions proposed for HPF to give each pattern a name so that the access patterns can be identified and reused, including using them to determine the required communication for different loops if these loops have the same data access pattern.
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution
131
4 Experiments Our initial experiments of translating regular, small OpenMP codes to the corresponding GA ones achieved encouraging results on a UH Itanium2 cluster and a NERSC IBM SP RS/6000 cluster and were reported in [10]. The UH Itanium2 cluster has twenty-four 2-way SMP nodes and a single 4-way SMP node at the University of Houston: each of the 24 nodes has two 900MHz CPUs and 4 GB memory. The Scali interconnect has a system bus bandwidth of 6.4GB/s and a memory bandwidth of 12.8GB/s. The NERSC IBM SP RS/6000 cluster is composed of 380 nodes, each of which consists of sixteen 375 MHz POWER 3+ CPUs and 16GB to 64 GB memory. These nodes are connected to an IBM "Colony" high-speed switch via two "GX Bus Colony" network adapters. OpenMP programs can be run on a maximum of 4 processors of UH clusters and 16 processors of NERSC IBM clusters due to their SMP configuration.
Speedup on UH Itanium2 Cluster 30 OpenMP GA
20
Speedup
Speedup
25 15 10 5 0 1
2
4 8 16 32 No. of Processors
40
Speedup on IBM NERSC Cluster 70 60 OpenMP 50 GA 40 30 20 10 0 1 4 8 16 24 32 48 56 No. of Processors
Fig. 7. The performance of a Jacobi OpenMP program and its corresponding GA program
Execution Time of LBE Programs 1200
30 25
OpenMP GA
800 600 400
Speedup
1000 Seconds
Speedup of LBE programs OpenMP GA
20 15 10 5
200
0
0 1
2
4 8 16 24 32 40 Processors
1
2
4
8
16 24 32 40
processors
Fig. 8. The performance of OpenMP LBE program and the corresponding GA program
132
Z. Liu et al.
We employ implicit block data distribution for two regular OpenMP codes (Jacobi and LBE [8] ) according to OpenMP loop scheduling and our experiments show that it is feasible. Fig. 7 displays the performance of the well known Jacobi code with a 1152 by 1152 matrix on these two clusters. Both Jaocbi OpenMP program and the corresponding GA program achieved a linear speedup because of the data locality inherent in the Jacobi solver. Fig. 8 displays the performance of LBE OpenMP program and its corresponding GA program with a 1024 by 1024 matrix on NERSC IBM clusters. LBE is a computational fluid dynamics code that solves the Lattice Boltzmann equation. The numerical solver employed by this code uses a 9-point stencil. Unlike the Jacobi solver, the neighboring elements are updated at each iteration. Therefore, the performance of LBE programs is lower than that of Jacobi programs due to the writing to non-local global arrays. Note that our LBE program in GA was optimized to remove a large amount of synchronization; otherwise, the performance does not scale. We experiment with FIRE benchmarks as examples of irregular codes. We use gccg (see also at section 2.4) to explore the efficiency of our simple data distribution, work distribution and inspector-executor strategies. Fig. 9 depicts the performance of OpenMP and corresponding GA program for gccg on the NERSC IBM SP cluster. The performance of OpenMP version is slightly better than the corresponding GA program within one node, but the corresponding GA program with a large input data set achieves a speedup of 26 with 64 processors in 4 nodes. The reason of performance gain is that the inspector is only calculated once and reused throughout the program, and the communication and computation are well overlapped.
Speedup of gccg Programs
31 Speedup
Seconds
Execution Time of gccg Programs 140 120 OpenMP 100 GA 80 60 40 20 0 1 2 4 8 16 24 32 40 48 56 64 Number of Processes
26
OpenMP
21
GA
16 11 6 1 2 4 8 16 24 32 40 48 56 64 Number of processes
Fig. 9. The performance of a gccg OpenMP program and its corresponding GA program
5 Related Work A variety of data and work distribution extensions have been proposed for OpenMP in the literature and the syntax and semantics of these differ. Both SGI [19] and Compaq [2] support DISTRIBUTE and REDISTRIBUTE directives with BLOCK, CYCLIC and * (no distribution) options for each dimension of an array, in page or element
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution
133
granularity. However, their syntax varies. In the Compaq extensions, ALIGN, MEMORIES and TEMPLATE directives are borrowed from HPF and further complicate OpenMP. Both SGI and Compaq also supply directives to associate computations with locations of data storage. Compaq’s ON HOME directives and SGI’s DATA AFFINITY directives indicate that the iterations of a parallel loop are to be assigned to threads according to the distribution of the data. SGI also provides THREAD affinity. The Portland Group Inc. has proposed data distribution extensions for OpenMP including the above and GEN_BLOCK, along with ON HOME directives for clusters of SMP systems [14]. INDIRECT directives were proposed in [11-12] for irregular applications. The idea is to create inspectors to detect data dependencies at runtime and to generate executors to remove the unnecessary synchronization; the overhead of the inspector can be amortized if a SCHEDULE reuse directive [11-12] is present. We transparently determine a data distribution which fits into the OpenMP philosophy and search for ways to enable good performance even if this distribution is suboptimal. GA helps in this respect by providing efficient asynchronous communication. The simplicity of the data distributions provided by GA implies that the calculation of the location of shared data is easy. In particular, we do not need to maintain a replicated or distributed translation table consisting of the home processors and local index of each array element, as was needed in the complex parallel partitioner approach taken in the CHAOS runtime library for distributed memory systems [4]. Our strategy for exploiting the inspector-executor paradigm also differs from the previous work, since we are able to fully parallelize the inspector loops and can frequently overlap the resulting communication with computation. Also, the proposed INVARIANT extension for irregular codes is intuitive for users and informative for the compiler implementation. Our approach of implementing OpenMP for distributed memory systems has a number of features in common with approaches that translate OpenMP to Software DSMs [9, 18] or Software DSM plus MPI [6] for cluster execution. All of these methods need to recognize the data that has to be distributed across the system, and must adopt a strategy for doing so. Also, the work distribution for parallel and sequential regions has to be implemented, whereby it is typically the latter that leads to problems. Note that it is particularly helpful to perform an SPMD privatization of OpenMP shared arrays before translating codes via any strategy for cluster execution due to the inherent benefits of reducing the size and number of shared data structures and obtaining a large fraction of references to (local) private variables. On the other hand, our translation to GA is distinct from other approaches in that ours promises higher levels of efficiency via the construction of precise communication sets. The difficulty of the translation itself lies somewhere between the translation to MPI and the translation to Software DSMs. First, the shared memory abstraction is supported by GA and Software DSMs, but is not present in MPI. It enables a consistent view of variables and a non-local datum is accessible if given a global index. In contrast, only the local portion of the original data can be seen by each process in MPI. Therefore manipulating non-local variables in MPI is inefficient since the owner process and the local indices of arrays have to be calculated. Furthermore, our GA approach is portable, scalable and does not have the limitation of the shared memory
134
Z. Liu et al.
spaces. An everything-shared SDSM is presented in [4] to overcome the relaxation of the coherence semantic and the limitation of the shared areas in other SDSMs. It does solve the commonly existing portability problem in SDSMs by using an OpenMP runtime approach, but it is hard for such a SDSM to scale with sequential consistency. Second, the non-blocking and blocking one-sided communication mechanisms offered in GA allow for flexible and efficient programming. In MPI-1, both sender process and receiver process must be involved in the communication. Care must be taken with the ordering of communications in order to avoid deadlocks. Instead, get and/or put operations can be handled within a single process in GA. A remote datum can be fetched even if we do not know the owner and local index of that datum. Third, extra messages may occur in Software DSMs due to the fact that data is brought in at a page granularity. GA is able to efficiently transfer sets of contiguous or strided data, which avoids the overheads of the transfers at page granularity of Software DSMs. Besides, for the latter, the different processes have to synchronize when merging their modifications to the same page, even if those processes write to distinct locations of that page. In contrast, extra messages and synchronization are not necessary in our GA translation scheme. Our GA approach relies on compiler analyses to obtain precise information on the array regions accessed; otherwise, conservative synchronization is inserted to protect the accesses to global arrays.
6 Conclusions and Future Work Clusters are increasingly used as compute platforms for a growing variety of applications. It is important to extend OpenMP to clusters, since for many users a relatively simple programming paradigm is essential. However, it is still an open issue whether language features need to be added to OpenMP for cluster execution, and if so, which features are most useful. Any such decision must consider programming practice, the need to retain relative simplicity in the OpenMP API, and must take C pointers into consideration. This paper describes a novel approach to implementing OpenMP on clusters by translating OpenMP codes to equivalent GA ones. This approach has the benefit of being relatively straightforward. We show the feasibility and efficiency of our translation. Since part of the task of this conversion is to determine suitable work and data distributions for the code and its data objects, we analyze the array access patterns in a program’s parallel loops and use them to obtain a block-based data distribution for each shared array in the OpenMP program. The data distribution is thus based upon the OpenMP loop schedule. This strategy has the advantage of relative simplicity together with reasonable performance, without adding complexity to OpenMP for both SMP and non-SMP systems. Some optimizations are performed to improve the data locality of the resulting code and to otherwise reduce the cost of communications. Although load imbalance may sometimes be increased, in general the need for data locality is greater. The inspector-executor approach is applied to enable us to handle indirect accesses to shared data in irregular OpenMP codes, since we are unable to determine the re-
Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution
135
quired accesses to global arrays at compile time. Furthermore we propose a new directive called INVARIANT to specify the dynamic scope of a program region in which a data access pattern remains invariant, so as to take advantage of the accesses calculated by an inspector loop. In future, we plan to provide a solid implementation of our translation from OpenMP to GA in the Open64 compiler [16], an open source compiler that supports OpenMP and which we have enhanced in a number of ways already. The rich set of analyses and optimizations in Open64 may help us create efficient GA codes.
Acknowledgements We are grateful to our colleagues in the DOE Programming Models project, especially Ricky Kendall who helped us understand GA and discussed the translation from OpenMP to GA with us.
References 1. Bachler, G., Greimel, R.: “Parallel CFD in the Industrial Environment,” Unicom Seminars, London, 1994. 2. Bircsak, J., Craig, P., Crowell, R., Cvetanovic, Z., Harris, J., Nelson, C. A. and Offner, C. D.: “Extending OpenMP for NUMA machines”. Scientific Programming. 8(3), 2000. 3. Chakrabarti, S., Gupta, M. and Choi, J.-D.: “Global Communication Analysis and Optimization,” SIGPLAN Conference on Programming Language Design and Implementation, pp. 68-78, 1996. 4. Costa, J. J., Cortes, T., Martorell, X., Ayguade, E., and Labarta, J.: “Running OpenMP Applications Efficiently on an Everything-Shared SDSM”, Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS ’04), IEEE, 2004. 5. Das, R., Uysal, M., Saltz, J., and Hwang., Y.-S.: “Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures,” Journal of Parallel and Distributed Computing, 22(3): 462-479, September 1994. 6. Eigenmann R. et. al.: “Is OpenMP for Grids?” Workshop on Next-Generation Systems, Int'l Parallel and Distributed Processing Symposium (IPDPS'02), May, 2002. 7. Fagerström J., Faxen, Münger P., Ynnerman A. and Desplat J-C.: “High Performance Computing Development for the Next Decade, and its Implications for Molecular Modeling Applications”. http://www.enacts.org/hpcroadmap.pdf. Daily News and Information for the Global Grid Community, vol. 1 no. 20, October 28, 2002. 8. He, X., and Luo, L.-S.: “Theory of the Lattice Boltzmann Method: From the Boltzmann Equation to the Lattice Boltzmann Equation”. Phys. Rev. Lett. E, 6(56), pp. 6811, 1997. 9. Hu, Y.C., Lu, H., Cox, A. L., and Zwaenepoel, W.: “OpenMP for Networks of SMPs," Journal of Parallel Distributed Computing, vol. 60, pp. 1512-1530, 2000. 10. Huang, L., Chapman, B. and Kendall, R.: “OpenMP for Clusters”. Proceedings of the Fifth European Workshop on OpenMP (EWOMP'03), Aachen, Germany September 2226, 2003. 11. Hwang, Y.-S., Moon, B., Sharma, S. D., Ponnusamy, R., Das, R. and Saltz, J. H.: “Runtime and Language Support for Compiling Adaptive Irregular Problems on Distributed Memory Machines,” Software Practice and Experience, 25(6):597-621, June 1995.
136
Z. Liu et al.
12. Labarta J., Ayguadé E., Oliver J. and Henty D.: “New OpenMP Directives for Irregular Data Access Loops,” 2nd European Workshop on OpenMP (EWOMP'00), Edimburgh (UK), September 2000. 13. Liu, Z., Chapman, B. M., Weng, T.-H., Hernandez, O.: “Improving the Performance of OpenMP by Array Privatization,” WOMPAT 2002, pp. 244-259, 2002 14. Merlin, J.: “Distributed OpenMP: Extensions to OpenMP for SMP Clusters,” 2nd European Workshop on OpenMP (EWOMP'00), Edimburgh (UK), September 2000. 15. Nieplocha, J., Harrison, R. J., and Littlefield, R. J.: “Global Arrays: A non-uniform memory access programming model for high-performance computers,” The Journal of Supercomputing, 10, pp. 197-220, 1996 16. Open64 Compiler Tools. http://open64.sourceforge.net/ 17. Saltz J., Berryman, H., Wu, J.: “Multiprocessors and Run-Time Compilation,” Concurrency: Practice and Experience, 3(6): 573-592, December 1991. 18. Sato, M., Harada, H., Hasegawa, A., and Ishikawa, Y.: "Cluster-Enabled OpenMP: An OpenMP Compiler for SCASH Software Distributed Share Memory System,” Scientific Programming, Special Issue: OpenMP, 9(2-3), pp. 123-130, 2001. 19. Silicon Graphics Inc. “MIPSpro 7 FORTRAN 90 Commands and Directives Reference Manual,” Chapter 5: Parallel Processing on Origin Series Systems. Documentation number 007-3696-003. http:// techpubs.sgi.com. 20. Top 500 Supercomputer Sites. http://www.top500.org/.
Runtime Adjustment of Parallel Nested Loops Alejandro Duran1 , Raúl Silvera2 , Julita Corbalán1 , and Jesús Labarta1 1
CEPBA-IBM Research Institute, Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Jordi Girona, 1-3, Barcelona, Spain {aduran, juli, jesus}@ac.upc.es 2 IBM Toronto Lab, 8200 Warden Ave Markham, ON, L6G 1C7, Canada
[email protected] Abstract. OpenMP allows programmers to specify nested parallelism in parallel applications. In the case of scientific applications, parallel loops are the most important source of parallelism. In this paper we present an automatic mechanism to dynamically detect the best way to exploit the parallelism when having nested parallel loops. This mechanism is based on the number of threads, the problem size, and the number of iterations on the loop. To do that, we claim that programmers must specify the potential application parallelism and give the runtime the responsibility to decide the best way to exploit it. We have implemented this mechanism inside the IBM XL runtime library. Evaluation shows that our mechanism dynamically adapts the parallelism generated to the application and runtime parameters, reaching the same speedup as the best static parallelization (with a priori information).
1
Introduction
OpenMP[1] is the standard programming model in shared-memory multiprocessor systems. This programming model provides programmers with a set of directives to explicitly define parallel regions in applications. These directives are translated by the compiler that generates code for the runtime parallel library. The runtime parallel library is in charge of managing the application parallelism. It can try to optimize decisions that can only be resolved at runtime such as the number of processors used to execute a parallel region or the scheduling of a parallel loop. In the case of numerical applications, loops are the most important source of parallelism. The typical structure of these parallel loops is several nested loops that cover several matrix dimensions (i,j,k,...). Often, these inner loops are also parallel. A typical practice of parallel programmers is to insert the parallel directive in only one loop, usually the outer loop, serializing the rest of the nested loops. This approach is sufficient if the number of iterations of the outer loop is high and the iteration’s granularity is coarse enough. However, in some cases the number of iterations of the outer parallel loop is not enough to distribute between the available number of processors, even though the total computational work is very high. B.M. Chapman (Ed): WOMPAT 2004, LNCS 3349, pp. 137–147, 2005. c Springer-Verlag Berlin Heidelberg 2005
138
A. Duran et al.
With this behavior, programmers decide a priori the loop parallelization, ignoring runtime characteristics such as the number of threads available at the moment of opening parallelism. In this paper we claim that programmers must use OpenMP directives just to specify parallelism, annotating all application parallelism, and give the runtime library the responsibility of select the best way to exploit the available parallelism. This would be done as a function of the application characteristics and resource availability. Instead of exploiting either the outer or the inner loop, we propose to apply a mixed approach, opening parallelism in several iterations of the outer loop and exploiting the parallelism of the inner loop in the rest of the iterations. The runtime will gather information of the parallel loops and will make a decision of whether to open parallelism at the outer loop, do it at the inner one or do some iterations of the outer loop in parallel and some iterations of the inner loop in parallel. This last scheme causes the overheads to be minimized by executing as much as possible at the outer level while getting the necessary granularity for obtaining a good load balance. We have modified the IBM XL runtime library to include the proposed mechanism to dynamically decide the most convenient way to exploit parallelism in nested parallel loops. It uses simple heuristics based on the number of iterations of the loops and the number of threads available. We have executed several applications and measured the speedup achieved when executed with fixed parallelizations (outer, inner, and nested), compared with the speedup achieved with our proposal. Results show that applications executed with the new runtime reach the same or even better speedup than with the best static (a priori) parallelization. The rest of the paper is organized as follows: in section 2 work related to our proposal is discussed. Then, section 3 describes the available ways to exploit parallel nested loops and the proposed semantics. Section 4 describes our runtime mechanism to choose between the different options. In section 5 an evaluation of the proposal on several applications is presented and section 6 concludes the paper and discusses future lines of research.
2
Related Work
One important way to reduce the overhead associated with parallelization is choosing an appropriate granularity as showed by Chen et al. [2]. Harrison et al. [3] proposed to dynamically select between a serial and parallel version of a loop based on the measured granularity. They propose having a threshold to decide whether to use the sequential or the parallel version. Moreira et al. [4] proposed a dynamic granularity control mechanism that tries to find the best granularity in a hierarchical task graph. Jin et al. [5] show the different overheads between parallelizing the outer or the inner loop of a Cloud Modeling Code. Nested parallelism is an active area of research in the OpenMP community. Gonzalez et al. [6] discussed some extensions to the OpenMP standard to create teams of threads for nested execution. Another approach to nested parallelism was presented by Shah et al. [7] using work queues for nested parallel dynamic work. Different studies have been
Runtime Adjustment of Parallel Nested Loops
139
done showing the benefit of using nested parallelism mainly on large SMP machines [8, 9, 10, 11, 12]. In [13] authors propose to convert parallel calls into parallel-ready sequential calls and to execute the excess parallelism as sequential code. In our work, we also exploit the idea of considering the programmer specification as a hint, but not only the potential parallelism, but also the way the application should exploit that parallelism: How many levels of parallelism, which level to exploit, and the thread distribution inside them. In [14], authors propose a forall statement to unify the existing constructions to specify parallelism. At this moment, the OpenMP standard unifies the parallelism specifications in a set of directives. In the paper authors focused their work in the semantics of one level of parallelism, whereas this work is focused on multiple levels of parallelism and how to efficiently exploit them as a function of the application characteristics and resource availability.
3
Nested Loops Parallelization
When a programmer finds a nested loop such as the one shown in Figure 1, he has to decide the way to parallelize it. The current OpenMP standard supports three different approaches by simply introducing a c$omp parallel do directive in the appropriate place(s). The most used option is only parallelizing the outer loop. This option is a good choice if the number of iterations of the outer loop is enough compared to the number of available processors to reach a good data distribution. However, if iterations are computationally intensive and there are few iteration, this option can result in heavy load imbalance. The second option is only parallelizing the inner loop. Parallelizing the inner loop will potentially provide smaller pieces of work so they can be distributed evenly between the available threads but it has more overhead due to work distribution and synchronization between threads . These overheads can be extremely prohibitive if the loop granularity is too fine. A third option is to parallelize both loops having nested parallelism. This choice suffers from the same problem as parallelizing the inner loop plus the overhead of distributing data from the outer loop but it has shown to be a good choice for systems with large number of processors [8]. The last possible configuration is a mixed approach and cannot be explicitly specified by the OpenMP programming model. The idea is to parallelize the maximum number of iterations from the outer loop that do not lead to load imbalance and then execute the
do i = 1,N do j = 1,M Compute_1() end do end do Fig. 1. Simple nested loops
!$omp parallel do do i = 1,N !$omp parallel do do j = 1,M Compute_1() end do end do Fig. 2. Proposed parallelization for nested loops
140
A. Duran et al. Thread 0
i=0, j=0..2
i=3, j=0
Thread 1
i=1, j=0..2
i=3, j=1
Thread 2
i=2, j=0..2
i=3, j=2
Time
Fig. 3. Mixed parallelization approach for two nested loops (with four and three iterations ) and three threads
remaining iterations sequentially, while parallelizing the inner loop to provide enough work to balance the overall execution (see Figure 3). As it is not supported by the standard it has to be coded manually in the source code and this is against the idea of simplicity and portability behind the OpenMP standard. In this paper, we claim that parallel programmers must use OpenMP directives to declare the application parallelism (as shown in Figure 2) and not decide or impose which level of parallelism should be exploited. As we have seen, this decision can be difficult because it depends on a lot of parameters that may be only known at runtime. What we propose is to dynamically classify nested parallel loops and decide which configuration is more suitable depending on : loop granularity, number of iterations and number of available threads. Depending on these three conditions, the runtime will parallelize the outer loop, the inner loop, or both in a nested or a mixed approach. The classification and configuration will be done automatically by the runtime, transparent to the programmer. In giving this responsibility to the runtime (which is already possible through OMP_DYNAMIC environment variable), applications will be more portable as the runtime adjusts the applications to the specific environment where they are executed. Applications will also benefit from future optimizations in the runtime parallelizing mechanisms without having to modify them.
4
Runtime Support
To be able to decide the kind of parallelism (outer, inner, mixed or nested) to be used for a given loop the runtime has to: (1) discover the structure of the parallel loops of the application and their characteristics; (2) using that information, the runtime has to decide how to proceed and which iterations are assigned to each level in a mixed approach. We have implemented a mechanism for both tasks inside IBM’s XL runtime library for OpenMP. The mechanism is fairly general and holds for multiple levels of nesting. Our prototype implementation works for iterations of the same weight. To extend it to any kind of loop, some other runtime balancing technique, like the Adjust schedule[15], should be integrated. 4.1
Discovering the Application’s Structure
Each time the application wants to start a parallel loop it calls the appropriate runtime service passing some parameters including the address of the outlined function that contains the parallelized code. This address is used to identify the parallel loops (as each has a different function). For each loop the runtime allocates a descriptor where relevant information is stored.
Runtime Adjustment of Parallel Nested Loops
141
A stack of the calls (for opening parallism) to the runtime is also maintained. When a new loop is encountered, if the stack is not empty, the loop is related to the stack’s top loop. This way a tree is obtained that represents the call path of parallel loops in the application. Though compiler analysis could also be used to detect the application’s structure, it has not been used because we wanted to push the runtime detection to the limit and because static analysis can not always detect nested loops. 4.2
Deciding the Kind of Parallelization to Apply
When the application requests to open parallelism for a given loop, the runtime uses the information stored in the loop’s descriptor to decide whether to serialize the loop (i.e., only inner loops will be parallelized), open parallelism for some of the iterations (i.e., mixed parallelization) or for all the iterations (i.e., outer or nested parallelization ). This decision is straight forward: – If it has no inner loops open full parallelism at this level since it is the only one available. – Otherwise, if the ratio of total iterations divided by the number of threads is • an integer then open full parallelism at this level. Inner levels will be serialized. • a real number greater than one then execute in parallel trunc(total iterations/ number of threads) at this level. The remaining are executed serially afterwards parallelizing the inner level. • a real number lower than one then we choose to open full parallelism at this level and allow nested parallelism in inner loops. Note that the decision mechanism lacks control granularity based on runtime measures or compiler information. This kind of control should be integrated in order to avoid opening excessive fine levels. This decision is saved in the loop descriptor and reused afterwards. If a change in the information is produced (a nested loop is found, the number of iterations or threads change, . . . ) then it is recalculated. When the chosen method is not nested parallelism then some (or all) iterations of the inner or outer loop must be serialized. This is achieved by taking a special path inside the runtime with almost no overhead that avoids all work distribution and synchronization code. This way there is no need to have two different version of the code (one parallel and serial) that could increase the executable object leading to greater memory consumption. When nested parallelism is used the number of teams of threads in the outer loop is also chosen as this allows better utilization than an all versus all approach to nested parallelism. This decision is done by computing the greatest common divisor (gcd) between the number of iterations on the outer level and the number of threads available. The gcd is the number of teams that will be spawned in the outer level. In the inner level all available processors will be distributed evenly across the teams. Also, the loop records which threads were part of the team and tries to reuse them in the following executions resulting in improvements due to data affinity. The first time a loop is executed, the runtime does not know if the loop will have a nested loop inside or not. For this reason, it is executed slightly different. At this point
142
A. Duran et al.
there is two possible decisions: execute the loop assuming that there is a nested loop inside or execute the loop assuming there is no nested loop inside. In the first case, the runtime will execute the loop as whether no nested loop was inside. If later on a nested loop is discovered, a new decision could be applied. If the number of steps of the algorithm is high this is a simple solution but when this is not the case such initial decision can heavily reduce the benefits of the algorithm. We propose to assume that there will be an inner nested loop and calculate the normal decision strategy for this loop. If this decision leads to a mixed parallelization approach this will give the runtime the opportunity to detect inner parallelism during the execution of the initial chunk of outer iterations. If no inner parallelism is detected , the tail of of the iteration space is also parallelized afterwards.
5 5.1
Evaluation Environment
The evaluation was done in a dedicated 16-way 375Mhz Power3 with 4 Gb of memory running AIX 5.2. The compiler used was IBM’s xlf compiler using –qsmp=omp and –O3 flags. In nested versions -qsmp=nested was also used. The OpenMP runtime was IBM’s XL support library to which our proposal was added. For all the applications we show the speedup (taking serial time as base time) obtained with the different parallelization approaches (parallelizing the inner level, parallelizing the outer level, parallelizing both in a mixed way as explained in section 3 or parallelizing both with nested parallelism) and the one achieved by our runtime (described in section 4). The mixed parallelization was done manually. 5.2
Synthetic Application
First, a synthetic application was evaluated for proof of concept. It contains two parallel loops, each having a computational delay. Two different cases are shown, (1) one where the number of iterations of the outer loop is large (500 iterations) and (2) another where is small and is not multiple of the number of threads (17 iterations). As we can see, the best way to parallelize the loops is different depending on the scenario. For the case with 500 iterations (see Figure 4(a)) the best choice is to parallelize the outer loop while for the case with 17 iterations (see Figure 4(b)) the best speedup is achieved parallelizing both loops in a mixed approach. In both cases, the runtime is able to decide an appropriate way of parallelizing the loops. 5.3
LU Kernel
We have also evaluated a LU kernel. It uses blocking (size of each block is 150 elements) and the schedule used in all loops is STATIC. The outermost parallel loop decreases the number of iterations executed at each step by one and thereby changes the amount of parallelism in each step. The speedups for the LU computation for different matrix size is shown in Figures 5(a), 5(b) and 5(c). As can be seen, the performance of the different approaches
Runtime Adjustment of Parallel Nested Loops
(a) Outer loop with 500 iterations
143
(b) Outer loop with 17 iterations
Fig. 4. Speedups for the synthetic cases
varies significantly depending on the matrix size and the number of threads available. The runtime decisions are close to the best most of the time. It can be seen that if the number of processors is large the nested parallelization approach outperforms the others, including the runtime. In this case the runtime actually chooses to use nested parallelism but the number of teams it decides to use is not optimal leading to some performance degradation. 5.4
MBLOCK Benchmark
The mblock benchmark is a multi-block algorithm that performs a simulation of the propagation of a constant source of heat in an object. The output of the benchmark is the temperature at each point of the object. The heat propagation is computed using the Laplace equation. The object is modeled as a multi-block structure composed of a number of rectangular blocks. Blocks are connected through a set of links at specific positions. The internal representation of the blocks is as follows. After an initialization phase, an iterative solver computes the temperature of each point in the structure. Each block computation can be done in parallel, also parallelism exist inside each block. Propagation of the temperature between blocks can also be done in parallel. Three different inputs have been used. First, a set composed of eight large blocks (128x128x128 elements). Second, a set composed by sixteen medium sized blocks (40x40x64 elements) and then another set of eight medium sized blocks. The nested version used to evaluate this benchmark is a modified one that allows to fix the number of teams in the outer level and distribute the threads between all the teams equally. In Figure 6(a) it can be seen how the best methods changes with the number of threads. From two to eight threads all the method have similar speedups. With 12 threads,
144
A. Duran et al.
(a) Matrix size=1500
(b) Matrix size=2550
(c) Matrix size=3000
Fig. 5. Speedups for the LU kernel
(a) 8 large blocks
(b) 16 medium blocks
(c) 8 medium blocks
Fig. 6. Speedups for the MBLOCK benchmark
parallelizing the inner loop results in a 38% of gain with respect the other methods but with sixteen threads using nested parallelism gives a 9% of gain with respect the inner method. When the input is changed to sixteen medium blocks (see Figure 6(b)), the inner method performs poorly while the outer and the inner methods perform very well. When the input is only eight medium blocks (see Figure 6(c)), the outer method scales only up to eight threads because there is not enough work to feed more threads. So, in this case the best choice is to use nested parallelism.
Runtime Adjustment of Parallel Nested Loops
(a) Class A
145
(b) Class B
Fig. 7. Speedups for the SP-MZ benchmark
In all cases, the runtime chooses a method that works close to the best one or even better. The difference in performance between the runtime and the nested method with 12 threads, as seen in figures 6(b) and 6(c), is because the runtime uses a better distribution of the threads throught the teams than the one used in the nested version. 5.5
SP-MZ Benchmark
The SP-MZ benchmark is part of the NPB-MZ benchmark suite [16]. It as modified version of the SP NAS Parallel benchmarks[17]. The original discretization mesh was divided into a two-dimensional tiling of three-dimensional zones. Inside each zone a normal SP problem is resolved. Zone computation can be done in parallel as well as computation inside each block. Predefined classes A (16 zones of 32x32x16 elements) and B (64 zones of 38x26x17 elements) were evaluated. We can see in Figures 7(a) and 7(b) that given any number of threads this benchmark works best choosing outer level parallelism. The runtime gets the same performance in all the cases. It is interesting to note that in the case with twelve processors in Figure 7(a) the runtime chooses to use a mixed parallelization scheme but it still gets the same performace as parallelizing the outer level. We believe this is because the outer level has higher benefits from locality (that is why we see superlineal behavior). On the other hand the mixed approach, not having as good locality, achieves the same time by avoiding the imbalance present in the outer parallelization.
6
Conclusions and Future Work
OpenMP allows the specification of nested parallel loops. In that case, deciding which of that loops must be parallelized is a hard task that, in addition, depends on runtime parameters such as the number of available threads or the problem size.
146
A. Duran et al.
In this work we claim that the programmer must use OpenMP directives to specify the existing parallelism and let the runtime library decide how to exploit this parallelism. Also, we present an approach to exploit nested parallel loops to be applied in those cases where exploiting only one of the levels of parallelism is not enough to reach a good scalability. The idea is to exploit the outer level of parallelism as much as possible and execute the remaining iterations in serial, exploiting, in these iterations, the inner level of parallelism. We have implemented our proposal inside IBM’s XL library and executed several benchmarks. Our results show that the runtime, with a simple set of heuristic, is able to determine the appropiate way of parallization between outer, inner, nested or mixed parallelization approaches in most cases. The heuristics used by the decision algorithm may be improved for example by using rules based on feedback measures that will allow better decision of the type of parallelism to use. Different ways to find the optimal number of teams need to be explored to fully exploit this kind of technique. The present work will also be extended to other worksharing constructs as sections or shares. Also the current mechanism will we extended to be able to cope with inhomogenous distributions. Measures taken by the runtime will be useful in these kind of scenarios. Further, we will explore the potential of using compile time information inside the runtime. This information could be not only the application structure but the size of data structures or estimated execution time of parallel regions. Runtime should not rely on that information but it could take advantage of it when available for example to reduce the initial overhead.
Acknowledgements Authors want to thank Marc Gonzàlez for his help on nested parallelism concepts. This work has been supported by the IBM CAS program, the POP European Future Emerging Technologies project under contract IST-2001-33071 and by the Spanish Ministry of Science and Education under contract TIC2001-0995-C02-01.
References 1. OpenMP Organization. Openmp fortran application interface, v. 2.0. www.openmp.org, June 2000. 2. D. Chen, H. Su, and P.Yew. The impact of synchronization and granularity in parallel systems. In Proceedings of the17th Annual International Symposium on Computer Architecture, pages 239–248, 1990. 3. W. Harrison III and J. H. Chow. Dynamic control of parallelism and granularity in executing nested parallel loops. In Proceedings of IEEE Third Symposium on Parallel and Distributed Processing, pages 678–685, 1991. 4. J. E. Moreira, D. Schouten, and C. Polychronopoulos. The performance impact of granularity control and functional parallelism. In C. H. Huang, P. Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, pages 481–597. Springer, Berlin, Heidelberg, 1995.
Runtime Adjustment of Parallel Nested Loops
147
5. H. Jin, G. Jost, D. Johnson, and W. Tao. Experience on the parallelization of a cloud modelling code using computer-aided tools. Technical report, NASA Ames Research Center, March 2003. NAS-03-006. 6. M. González, J. Oliver, X. Martorell, E. Ayguadé, J. Labarta, and N. Navarro. OpenMP extensions for thread groups and their run-time support. Lecture Notes in Computer Science, 2017:324–??, 2001. 7. S. Shah, G. Haab, P. Petersen, and J. Throop. Flexible control structures for parallellism in openmp. In 1st European Workshop on OpenMP, September 1999. 8. E. Ayguadé, M. González, X. Martorell, and G. Jost. Employing nested openmp for the parallelization of multi-zone computational fluid dynamics applications. In Proceedings of the International Parallel and Distributed Processing Symposium, April 2004. 9. E. Ayguade, X. Martorell, J. Labarta, M. Gonzalez, and N. Navarro. Exploiting multiple levels of parallelism in openmp: A case study. In Proceedings of the 1999 International Conference on Parallel Processing, September 1999. 10. R. Blikberg and T. Sørevik. Nested parallelism: Allocation of processors to tasks and openmp implementation. In Proceedings of the 1999 International Conference on Parallel Processing, September 1999. 11. Y. Tanaka, K. Taura, M. Sato, and A. Yonezawa. Performance evaluation of openmp applications with nested parallelism. In Sandhya Dwarkadas, editor, Languages, Compilers, and Run-Time Systems for Scalable Computers, 5th International Workshop, LCR 2000, volume 1915 of Lecture Notes in Computer Science. Springer, 2000. 12. R. Blikberg and T. Sørevik. Nested parallelism in openmp. In ParCo 2003, 2003. 13. Seth C. Goldstein, Klaus E. Schauser, and David E. Culler. Enabling Primitives for Compiling Parallel Languages. In Third Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, May 1995. 14. P.Dechering, L. Breebaart, F.kuijlman, and K.van Reeuwijk. Semantics and implementation of a generalized forall statement for parallel languages. In 11th International Parallel Processing Symposium (IPPS ’97), April 1997. 15. E. Ayguadé, B. Blainey, A. Duran, J. Labarta, F. Martínez, R. Silvera, and X. Martorell. Is the schedule clause really necessary in openmp? In M.J. Voss, editor, Proceedings of the International Workshop on OpenMP Applications and Tools 2003, volume 2716 of Lecture Notes in Computer Science, pages 69–83, June 2003. 16. R.F. Van der Wijngaart and H.Jin. Nas parallel benchmarks, multi-zone versions. Technical report, NASA Ames Research Center, July 2003. NAS-03-010. 17. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63–73, Fall 1991.
Author Index
an Mey, Dieter 83 Archambault, Roch Aversa, Rocco 12
110
Bentz, Jonathan L.
1
Karl, Wolfgang 41 Kendall, Ricky A. 1
Chapman, Barbara 53, 121 Chiu, Eric 98 Copty, Nawal 19, 83 Corbalan, Julita 137 Di Martino, Beniamino Duran, Alejandro 137 Evans, Emyr
12
Hernandez, Oscar Huang, Lei 121
Man Yan Chow, Patrick Mazzocca, Nicola 12
98
Schulz, Martin 41 Silvera, Ra´ ul 110, 137 Tao, Jie 41 Terboven, Christian
67
Gimenez, Judit
Labarta, Jes´ us 29, 137 Lee, Myungho 19 Liao, Chunhua 53 Lin, Yuan 83 Liu, Zhenying 121
83
29 Venticinque, Salvatore Voss, Michael 98
53
Ierotheou, Constantinos Jin, Haoqiang 67 Johnson, Stephen 67 Jost, Gabriele 29
67
12
Weng, Tien-Hsiung 121 Whitney, Brian 19 Wong, Catherine 98 Yuen, Kevin
98
Zhang, Guansong
110