This page intentionally left blank
Proceedings of the Eleventh ECMWF Workshop on the
Use of High performance computing in Meteorology
Reading, UK 25- 29 October 2004
Editors
Walter Zwieflhofer
George Mozdzynski European centre for Medium-Range Weather Forecasts, UK
World Scientific NEW JERSEY LONDON SINGAPORE BEIJING SHANGHAI HONGKONG TAIPEI CHENNAI
Published by
World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK oftice: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
USE O F HIGH PERFORMANCE COMPUTING IN METEOROLOGY Proceedings of the 11th ECMWF Workshop Copyright 0 2005 by World Scientific Publishing Co. Re. Ltd All rights reserved. This book, or parts thereoJ may not be reproduced in any form or by any means, electronic or mechanical, includingphotocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-256-354-7
Printed in Singapore by Mainland Press
PREFACE
The eleventh workshop in the series on the “Use of High Performance Computing in Meteorology” was held in October 2004 at the European Centre for Medium Range Weather Forecasts. This workshop received talks mainly from meteorological scientists, computer scientists and computer manufacturers with the purpose of sharing their experience and stimulating discussion. It is clear from this workshop that numerical weather prediction (NWP) and climate science continue to demand the fastest computers available today. For NWP in particular this ‘need for speed’ is mainly a consequence of the resolution of weather models used, increasing complexity of the physics, and growth in the amount of satellite data available for data assimilation. Today, the fastest commercially available computers have 1,000’s of scalar processors or 100’s of vector processors, achieving 10’s of teraflops sustained performance for the largest systems. These systems are mainly programmed using Fortran 90/95 and C++ with MPI for interprocessor communication and in some cases OpenMP for intra-node shared memory programming. Can this programming model continue to be used to run our community models to achieve orders of magnitude increases in performance? Or do we now need to look to new languages and/or hardware features to address known problem areas of space/power and memory latency? Chip roadmaps show increasing numbers of cores per chip with modest increases in clock frequency. The implication of this is that our applications need to scale on increasing numbers CPUs with greater dependence on fast switch networks. Will our applications rise to this challenge? During the week of this workshop a number of talks considered these issues while others presented on areas of grid computing, interoperability, use of Linux clusters, parallel algorithms, and updates from meteorological organisations. The papers in these proceedings present the state of the art in the use of parallel processors in the fields of meteorology, climatology and oceanography.
George Mozdzynski
Walter Zwieflhofer
V
This page intentionally left blank
CONTENTS Preface
V
Early Experiences with the New IBM P690+ at ECMWF Deborah Salmond, Sami Saarinen
1
Creating Science Driven System Architectures for Large Scale Science William TC. Kramer
13
Programming Models and Languages for High-productivity Computing Systems Hans P. Zima
25
Operation Status of the Earth Simulator Atsuyu Uno
36
Non-hydrostatic Atmospheric GCM Development and Its Computational Performance Keiko Takahashi, Xindong Peng, Kenji Komine, Mitsuru Ohduira, Koji Goto, Masayuki Yamada, Fuchigami Hiromitsu, Takeshi Sugimura
50
PDAF - The Parallel Data Assimilation Framework: Experiences with Kalman Filtering L. Nerger, W. Hiller, J. Schroter
63
Optimal Approximation of Kalman Filtering with Temporally Local 4D-Var in Operational Weather Forecasting H. Auvinen, H. Huario, 'I: Kauranne
84
Intel Architecture Based High-performance Computing Technologies Herbert Cornelius
100
Distributed Data Management at DKRZ Wolfgang Sell
108
Supercomputing Upgrade at the Australian Bureau of Meteorology I. Bermous, M. Naughton, W Bourke
131
4D-Var: Optimisation and Performance on the NEC SX-6 Stephen Oxley
143
vii
viii
The Weather Research and Forecast Model: Software Architecture and Performance J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W Skamarock, W Wang
156
Establishment of an Efficient Managing System for NWP Operation in CMA Jiangkai, Hu, Wenhai, Shen
169
The Next-generation Supercomputer and NWP System of the JMA Masami Narita
178
The Grid: An IT Infrastructure for NOAA in the 21'' Century Mark W Govett, Mike Doney, Paul Hyder
187
Integrating Distributed Climate Data Resources: The NERC Datagrid A . Woo& B. Lawrence, R. Lowry, K. Kleese van Dam, R. Cramer, M. Gutierrez, S. Kondapalli, S. Lathan, K. O'Neill, A . Stephens
215
Task Geometry for Commodity Linux Clusters and Grids: A Solution for Topology-aware Load Balancing of Synchronously Coupled, Asymmetric Atmospheric Models I. Lumb, B. McMillan, M. Page, G. Carr
234
Porting and Performance of the Community Climate System Model (CCSM3) on the Cray X1 G.R. Carr Jr., I.L. Carpenter, M.J. Cordery, J.B. Drake, M.W. Ham, F.M. Hoffman, P.H. Worley
259
A Uniform Memory Model for Distributed Data Objects on Parallel Architectures V Balaji, Robert W. Numrich
272
Panel Experience on Using High Performance Computing in Meteorology - Summary of the Discussion George Mozdzynski
295
List of Participants
299
EARLY EXPERIENCES WITH THE NEW IBM P690+ AT ECMWF DEBORAH SALMOND &
SAMISAARINEN ECMWF
This paper describes the early experiences with the large IBM p690+ systems at ECMWF. Results are presented for the IFS (Integrated Forecasting System) which has been padelized with MPI message passing and OpenMP to obtain good scalability on large numbers of processors. A new pmfiler called Dr.Hook, which has been developed to give detailed performance information for IFS, is described. Finally the code optimizations for IFS on the lBM p69W are briefly outlined.
1. Introduction
This paper describes the performance of the IFS on the IF3M p690+ systems at ECMWF. The IFS has been designed and optimized to perform efficiently on vector or scalar processors and parallelized for shared and distributed memory using a hybrid of MPI and OpenMP to scale well up to O( 1000) processors. The performance of the IFS is measured and compared on high performance computer systems using the RAPS (Real Applications for Parallel Systems) benchmark.
1.1. High performance Computer Systemsfor Numerical Weather prediction at ECMWF
In 2004 ECMWF installed two IBM p690+ clusters known as hpcc and hpcd. Each cluster is configured with 68 shared memory nodes, each node having 32 IBM Power4+ processors with a peak performance of 7.6 Gflops per processor. These replaced two IBM p690 clusters known as hpca and hpcb. These were configured with 120 nodes, each node having 8 IBM Power4 processors with a peak performance of 5.2 Gflops per processor. Figure 1 shows the key features of these IBM systems for IFS. The main improvement &om hpcahpcb to hpcc/hpcd was the introduction of the new Federation switch. This significantly reduced the time spent in message passing communications. 1
4. Dr.Hook
Dr.Hook is an instrumentation library which was initially Written to catch runtime errors in IFS and give a calling tree at the time of failure. Dr. Hook also gathers profile information for each instrumented subroutine or called sub-tree. This profile information can be in terms of Wall-clock time, CPU-time, Mflops, MIPS and Memory usage. Dr.Hook can also be used to create watch points and so detect when an m a y gets altered. Dr.Hook can be called ffom Fortran90 or C. The basic feature is to keep track of the current calling tree. This is done for each MPI-task and OpenMP-thread and on error it tries to print the calling trees. The system's own traceback can also be printed. The Dr.Hook library is portable allowing profiles on different computer systems to be compared and has very low overhead (-1% on IBM). Figure 10 shows how a program is instrumented with Dr.Hook. Calls to DR-HOOK must be inserted at the beginning and just before each return from the subroutine. SUBROUTINE SUB USE YOMHOOK, ONLY IMPLICIT NONE
:
LHOOR, DR-HOOK
REAL(8) ZHOOK-HANDLE ! Must b e a l o c a l
(stack) variable
f i r s t statement i n the subroutine IF (LHOOK) CALL DR_HOOK('SUB',O,ZHOOK_~LE)
!- The v e r y
! - - - Body o f the r o u t i n e goes here !
-
---
J u s t b e f o r e RETURNing from the subroutine IF (LHOOR) CALL DR__HOOK( 'SUB' ,l.ZHOOK--HANDLE)
END SUBROUTINE SUB
Figure ZU. How to instrument a Fortran9Uprogram with Dr.Hook.
Figure 11 shows an example of a trace-back obtained after a failure - in this case due to CPU-limit being exceeded. The IBM's system trace-back is also shown. The IBM trace-back will usually give the line number of the failure but in some cases, for example CPU-limit exceeded no useful information is given. Figure 12
I
10
shows and example of a profile obtained from IFS for a T511 forecast run on hpca - the communications routines are highlighted in red.
Figure 11. Example of DrHook trace-back.
#
%Time (self) 1
2 3 4 5 6 7 8 9
10 11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
7.43 3.67 3.65 3.64 3.63 3.51 2.76 2.51 2.41 2.40 2.39 2.39 2.36 2.31 2.30 2.29 1.94 1.94 1.92 1.92 1.85 1.85 1.83 1.83 1.82 1.80
CUmul (sec)
(sec)
Self
Total (sec)
35.027 52.349 52.349 52.349 52.349 68.918 81.935 93.763 105.145 105.145 105.145 105.145 116.296 116.296 116.296 116.296 125.448 125.448 125.448 125.448 134.173 134.173 134.173 134.173 142.737 151.219
35.02’1 17.322 17.204 17.181 17.138 16.569 13.017 11.829 11.382 11.336 11.274 11.267 11.150 10.897 10.832 10.816 9.152 9.130 9.073 9.045 8.725 8.724 8.654 8.621 8.565 8.482
40.573 17.367 17.287 17.289 17.202 16.584 18.260 11.831 30.536 30.436 30.394 30.072 11.185 10.940 10.920 10.310 9.327 9.263 9.256 9.220 8.785 8.777 8.741 8.658 8.580 69.102
#calls MIPS MFlops bDiv
49 5824 5791 5769 5770 54 51 54 11540 11538 11582 11648 3492 3502 3474 3484 27785 27980 27715 27750 5563 5596 5541 5546 51 13
961 1113 1116 1118 1117 783 926 742 1106 1112 1110 1113 2135 2218 2216 2224 1433 1434 1432 1440 985 987 986 989 782 581
273 546 548 549 549 0
1 0 88 88 88 86 2172 2259 2258 2266 682 679 682 686 592 593 593 595 0 22
Figure 12. Example of DrHookproJilefor T511forecast.
2.9 3.6 3.6 3.6 3.6 27.6 2.8 24.8 3.4 3.4 3.4 3.4 0.0 0.0 0.0 0.0
0.0 0.0
0.0 0.0 2.2 2.2 2.2 2.2 21.6 10.6
Routine
WVCOITPLEBl I567.11 *CLOWSCBl 15.41 CLOWScB4 15.41 CLOm)Sc82 15.41 CLOUDSC@3 15.41 T m O L COMMSa1 [ 5 2 5 , 1 1 TROTOM31 1520.11 TRLTOO_COMMSBl (523.11 *CUAS-3 110.41 WAS-2 130.41 WAS-4 130.41 CUASCNB1 130.41 *MXUAOP@l 1166.41 MXUAOP@2 1166.41 MXUAOPOI 1166.41 MXUAOPO3 1166.41 *LATTQ-3 1138.41 LArTQX@l 1138.41 LAITQ-4 1138.41 LAITQ-2 1138.41 * S L T B m 4 1297.41 SLT-1 I297.41 SLTB-2 1297.41 SLTBNDB3 1237.41 TRLTOM-COMMSBl I524.11 RhDINTrWl L207.11
11
Figure 13 shows and example of a Dr.Hook memory profile. This keeps trace of how much is data is allocated and de-allocated by each instrumented subroutine and its child subroutines. Also a total for the heap and stack is recorded.
1 i 3 4
5 6 7 8
9 10 11 ia 13 14 15 16 17
40784 440483a
20.00 9.95 8.01 6.51 6.15 4.91 4.n 4.59
587569164 a9iisio88 135681718 i91547a80 183480071 144638640 138637064 134851148
4.39
US847896
13056
iao 14 110 113 14 1 1 1 1
4.19 1.79 1.71 1.68 1.39 1.18 1.63 1.56
ll3051080 81091910
~90aeo
113
134631140 10530944 3900704 a0593680 18431 19391
79457488
78809176 70318856
9691368 135193904
111 iao
4408480
a
10530944 13014 17115456 17184
64001688
47978864 45774496
713 168 1910
113 1 1 1
34
710 as8 1917 144 518 10 31
1
0
50 241
a41
a44
518 10
141 360 6
111 8
1 113
49 141 360 6 la1 3 1 1ll
0 s 1
-1.10.1 CAr&-#Ul '~B-~TTM
1rnB-LmLIO
Smz1m.l amv_w.LI LmpupI.1
.mer ~ -Ul 1,nB-rTTM
mm.1 m - 1 1-
Figure 13. Example of DrHook memory profile.
Figure 14 shows an extracts from Dr.Hook profiles for and IFS T5 11 forecast on two different systems - the Cray X1 and the IBM p690+. The performance of the different subroutines on the scalar and vector systems can easily be compared.
IBM p690+ (7.6 Gf lops peak)
CRAY X1 (3.2 Gf lops peak)
MFlops per CPU
Routine
Mf lops
609
CUADJTQ
851
548
CLOUDSC
486
a12
I
LAITQM
I
1185
LASCAW VDFEXCU
704
Figure 14. Example of DrHook for comparing different systems for T511 forecast
m
~
~
12 5. Optimisation of IFS
Today’s Fortran compilers do most of the basic code tuning. This depends on the optimization level that has been chosen. At ECMWF we use ‘-03 -qstrict’ flags on the IBM Fortran compiler. The ‘-qstrict’ flag ensures that operations are performed in a strictly defined order. With higher optimization levels we found that we did not get bit-reproducible results when NFROMA was changed. The ‘-qstrict’ flag does however stop certain optimizations like replacing several divides by the same quantity, by one reciprocal and corresponding multiplies. The main optimization activities done for IFS on the IBM p690 system have been as follows:
- Add timings (at subroutine level and also for specific sections using &.Hook)
- MPI (remove buffered mpi-send and remove overlap of communications & CPU) - Add more OpenMP parallel regions - ‘By hand’ optimization of divides - necessary because of ‘-qstrict’
- Use IBM ‘vector’ functions - Remove copies and zeroing of arrays where possible - Optimize data access for cache - group different uses of the same array together. - Remove allocation of arrays in low-level routines. - Basic Fortran optimizations, e.g. ensuring inner loops are on first array index. The IBM has scalar processors, but as we have worked on optimization we have tried to keep a code that performs well on scalar and vector processors. This has meant that in a few places we have had to put in alternate code for vector and scalar versions.
References 1. D.Dent and G.Mozdzynski: ‘ECMWF operational forecasting on a distributed memory platform’, Proceedings of the 7” ECMWF workshop on the use of Parallel Processors in Meteorology, World Scientific, pp 36-51, (1996). 2. D.Dent, M.Hamrud, G.Mozdzynski, D.Salmond and C.Temperton: ‘IFS Developments’, Proceedings of the 9’hECMWF workshop on the use of Parallel Processors in Meteorology, World Scientific, pp 36-52, (2000). 3. M.Hamrud, S.Saarinen and D.Salmond: ‘Implementation of IFS on a Highly Parallel Scalar System’. Proceedings of the 10” ECMWF workshop on the use of Parallel Processors in Meteorology, World Scientific,pp 74-87, (2002).
CREATING SCIENCE DRIVEN SYSTEM ARCHITECTURES FOR LARGE SCALE SCIENCE WILLIAM T.C. KRAMER Lawrence Berkeley National Laboratory MS Sob-4230, One Cyclotron Road Berkeley, California 94720 (510) 486-7577
[email protected] Application scientists have been frustrated by a trend of stagnating application performance, despite dramatic increases in claimed peak performance of highperformance computing systems. This trend is sometimes referred to as the “divergence problem” and often has been assumed that the ever-increasing gap between theoretical peak and sustained performance was unavoidable. However, recent results from the Earth Simulator (ES) in Japan clearly demonstrate that a close collaboration with a vendor to develop a science-driven architectural solution can produce a system that achieves a significant fraction of peak performance for critical scientific applications. This papa discusses the issues contributing to divergence problem, suggests a new approach to address the problem and documents some early successes for this approach.
1. The “Divergence Problem”
Application scientists have been frustrated by a trend of stagnating application performance, despite dramatic increases in claimed peak performance of highperformance computing systems. This trend is sometimes referred to as the “divergence problem”’ and often has been assumed that the ever-increasing gap between theoretical peak and sustained performance was unavoidable. Other people have described this situation as the memory wall. Figure 1 shows this divergence comparing peak theoretical performance and the measured (sustained) performance for an application workload. Due to business pressures, most computational system components are designed for applications other than high performance computing (e.g. web servers, desktop applications, databases, etc.) In the early 1990’s, in an attempt to increase price performance for HPC applications, many organizations in the US government, most notably the High Performance Computing and Communications (HPCC) Initiative, encouraged the use of COTS (Commodity Off The Shelf) components for large scale computing. Very effective systems like the Cray T3E and the IBM SP were created during this. This effort was 13
15
1.1. Applications Have Diverse Algorithmic Needs Tables 1 and 2 show disciplines require the use of many different algorithmic methods. It is very challenging for architectures to address all these algorithmic areas at the same time. Data parallel algorithms perform well on systems with high memory bandwidth such as vector or super scalar systems. Table 1. Significant scientific breakthroughs are. waiting increased in computational performance. Science Areas
Goals
Computational Methods Breakthrough Target (5OTflop/s sustained) Simulate nanostructures with hundreds to thousands of atoms as well as transport and optical properties and other parameters Explicit finite difference Simulate laboratory Implicit finite difference scale flames with high Zero-dimensional physics fidelity representations of governing physical Adaptive mesh processes refinement Lagrangian particle methods
Nanoscience
Simulate the synthesis and predict the properties of multi-component nanosystems
Quantum molecular dynamics Quantum Monte Carlo Iterative eigensolvers Dense linear algebra Parallel 3D FFTs
Combustion
Predict combustion processes to provided efficient, clean and sustainable energy
Fusion
Understand high-energy density plasmas and develop an integrated simulate of a fusion reactor
Multi-physics, multi-scale Simulate the lTER Particle methods reactor Regular and irregular access Nonlinear solvers Adaptive mesh refinement
Climate
Accurately detect and attribute climate change, predict future climate and engineer mitigations strategies
Finite difference methods Perform a full FFk oceadatmospheres climate model with Regular & irregular access 0.125 degree spacing, Simulation ensembles with an ensemble of 810 NllS
Astrophysics
Use simulation and analysis of observational data to determine the origin, evolution and fate of the universe, the nature of matter and energy, galaxy and stellar evolutions
Multi-physics, multi-scale Simulate the explosion Dense linear algebra of a supernova with a full 3D model Parallel 3D FFT's Spherical transforms Particle methods Adaptive Mesh Refinement
16
Irregular control flow methods require excellent scalar performance and spectral and other applications require high bisection bandwidth and low latency for interconnects. Yet large scale computing requires architectures capable of achieving high performance across the spectrum of state of the art applications. Hence, for HPC computing, balanced system archtectures are extremely important. Table 2. State-of-the-art computational science requires increasingly diverse and complex algorithms. Science Areas
Nanoscience Combustion Fusion Climate Astrophysics
Multi Dense Physics and Linear Multi-scale Algebra
d
FFTs Particle A M R Data Irregular Methods Parallelism Control Flow
.i
. i d
d
d
d d
d .i
d
d
d d
d d
4
d
d d
d v
4 4
4
d
d
d d
Some systems and sites focus on particular application areas. For example, a country may have one or several sites focused on weather modeling and/or climate research. Other focused systems may be for aeronautical applications, for military design, for automotive design or for drug design. Because a more focused set of applications run at these sites, they are more concerned about supporting a smaller set of methods. However, many sites, such as the National Energy Research Scientific Computing (NERSC) Facility, the flagship computing facility for the US Department of Energy Office of Science, support a wide range of research disciplines. Figure 1 shows the FY 03 breakdown of usage by scientific discipline at NERSC. Clearly, combining Table 1 and 2 with Figure 1 indicates only a balanced system will operate well across all disciplines. Sites such as NERSC are important bell weathers for large scale system design because of their workload diversity. In other words, if a system performs well at a site with a diverse workload, it will perform well at many sites that have a more focused workload.
2. The Science Driven System Architecture Approach Recent results from the Earth Simulator (ES) in Japan clearly demonstrate that a close collaboration with a vendor to develop a science-driven architectural solution can produce a system that achieves a significant fraction of peak performance for critical scientific applications. The key to the ES success was
18
Computing Revitalization Task Force (HECRTF) Workshop3, the Federal Plan for High-End Computing4 and the DOE SCaLeS Workshop’ and has been demonstrated successhlly by the Earth Simulator. the initial Berkeley Lab/IBM Blue Planet effort, and the SandidCray Red Storm effort6. The HECRTF report states: “We must develop a government-wide. coordinated method for influencing vendors. The HEC injluence on COTS components is small, but it can be maximized by engaging vendors on approaches and ideas five years or more before commercial products are created. Given these time scales. the engagement must also be focused and sustained. We recommend that academic and government HEC groups collect and prioritize a list of requested HEC-speciJicchangesfor COTS components,focusing on an achievable set. ”
The SDSA is in part an implementation of this recommendation. The SDSA process requires close relationships with vendors who have resources and an R&D track record and a national collaboration of laboratories, computing facilities, universities, and computational scientists.
3. Goals of the Science Driven System Architecture Process
Even for applications that have high computational intensity, scalar performance is increasingly important. Many of the largest scale problems can not use dense methods because of N3 memory scaling that drives applications to use sparse and adaptive computational methods with irregular control flow. Specialized architectures typically are optimized for a small subset of problems and do not provide effective performance across the broad range of methods discussed above7. Additionally, one of the main drivers for large, more accurate simulations is to incorporate complex microphyics into inner loops of computations. This is demonstrated by the following quote from a NERSC computational scientist. “It would be a major step backward to acquire a new plarform that could reach the 100 Tflop levelfor only a few applications that had ‘clean’microphysics. Increasingly realistic models usually mean increasingly complex microphysics.’‘
Thus, the goal for SDSA is to create systems that provide excellent sustained price performance for the broadest, large scale applications. At the same time, SDSA should allow applications that do well on specialized architectures such as vector systems to have performance relatitively near optimal on a SCSA system. In order to be cost effective, the SDSA systems will not likely leverage COTS technology while adding a limited amount of focused technology accelerators.
19
In order to achieve thls goal, the entire large-scale community must engage in technology discussions that influence technology development to implement the science-driven architecture development. This requires serious dialogue with vendor partners who have the capability to develop and integrate technology. To facilitate this dialogue, large-scale facilities must establish close connections and strategic collaborations with computer science programs and facilities funded by federal agencies and with universities. There needs to be strong, collaborative investigations between scientists and computer vendors on sciencedriven architectures in order to establish the path to continued improvement in application performance. In order to do this, staff and users at large scale facilities and the computational science research community must represent their scientific application communities to vendors to the Science-Driven Computer Architecture process. The long-term objective: integrate the lessons of the large scale systems, such as the Blue Gene/L and the DARPA HPCS experiments with other commodity technologies into hybrid systems for petascale computing. There are significant contributions that large-scale facilities can provide into this effort. They include Workload Analysis which is difficult to do for diverse science but critically important to provide data to computer designers. Explicit scaling analysis including performance collection, system modeling and performance evaluation. Provide samples of advanced algorithms, particularly the ones that may be in wide use five or ten years into the future provide a basis for design of systems that require that long lead times to develop. Provide computational science successes stories, these are important because they provide motivation to system developers and stakeholder funders. Numerical and System Libraries such as SuperLU, Scalapack, MPI-2, parallel NETcdf, as well as applications, tools and even languages that quickly make use of advanced features.
3.1. An Example - Sparse Matrix Calculations Sparse matrix calculations are becoming more commonly driven by the growing memory needs of dense methods, particularly as simulations move from one or two dimensions to three dimensional analysis. Many applications that have in the past or currently use dense linear algebra have to use sparse matrix methods in the future. Current cache based computing systems are designed to handle applications that have excellerit temporal and spatial memory locality but do not operate efficiently for sparse cases.
20
Consider the following simple problem, y = A*x, where A is a two dimension array that stores matrix elements in uncompressed form. Nrow and Ncol are the number of rows and columns in the array, respectively, and s is the interim sum of the multiplication. The equation above can be carried out with the following psuedo-code. do j=l,Nrow s=o do i=l, Ncol s = s+A(j,i)*x(i) a d do y(j) = s end do
This methods works well when the elements in A have good spatial locality, which is when the matrix is dense. However, if the matrix is sparse, then many of the values are zero and do not need to be computed. Figure 4, shows the impact of sparsity in the calculation. Because memory is loaded into cache according to cache lines (typically eight to sixteen 64-bit words), as the matrix gets sparser, contiguous memory locations no longer hold meaninghl data. It soon evolves into an application using only a small fiaction of the data moved to cache - in the extreme, using only one word per cache line. In many 3D simulations of scale, at least one dimension uses a sparse approach. In this case. to optimize computational effort and to conserve memory as problems grow, sparse methods are used. The psuedo-code for a sparse matrix implementation would look as follows. A is row compressed sparse matrix. int row-start() is the starting index of each row and int col-ind0 is the column index for each row. A ( ) is a single dimension array that stores sparse matrix elements in a concentrated form. Nrow is the number of rows in the array, and s is the interim sum of the row multiplication. The equation above can be carried out with the following psuedo-code. do j=l,Nrow jj=row-start ( j ) row-length = row-start(j+l) jj s=o do i d , row-length s s+A(jj+i)*x(col_ind(jj+i)) end do y(j) = s end do
-
22
specialized “accelerator” hardware to commodity CPUs that would increase efficiency for applicationsthat can typically do well on vector architectures 2. There is a possibility to add a “pre-load” feature that takes user input into account to provide more flexibility than pre-fetch engines. Other ideas are also being explored. 4. Early Results for Science Driven Architecture
Realizing that effective large-scale system performance cannot be achieved without a sustained focus on application-specific architectural development, NERSC and IBM have led a collaboration since 2002 that involves extensive interactions between domain scientists, mathematicians, computer experts, as well as leading members of IBM’s R&D and product development teams. The goal of this effort is to adjust IBM’s archtectural roadmap to improve system balance and to add key architectural features that address the requirements of demanding leadership-class applications - ultimately leading to a sustained Petaflop/s system for scientific discovery. The first products of this multi-year effort have been a redesigned Power5-based HPC system known as Blue Planet and a set of proposed architectural extensions referred to as ViVA (Virtual Vector Architecture). This collaboration has already had a dramatic impact on the archtectural design of the ASCI Purple system’ and the Power 5 based SP configuration. As indicated in the following quote from the “A National Facility for Advanced Computational Science: A Sustainable Path to Scientific Discovery” technical report, Dona Crawford, the Associate Laboratory Director for Computation at Lawrence Livermore National Laboratory wrote, “The Blue Planet node conceived by NERSC and IBM I...] features high internal bandwidth essential ,for successful scientific computing. LLNL elected early in 2004 to modifv its contract with IBM to use this node as the building block of its 100 TF Purple system. This represents a singular benefit to LLNL and the ASC program. and LLNL is indebted to LBNL for this effort. ’’
The Blue Planet node is an 8 way SMP that is balanced with regard to memory bandwidth. That is, if each processor operates at its full potential memory bandwidth, the system can keep all processors supplied with data at the sum of the individual rate. If more processors are added, the node memory bandwidth would be oversubscribed. The implication of this change meant that the interconnection architecture had to scale to more than was originally planned, yet the overall sustained system performance improved. It is important to point out that this approach has the potential for additional improvements, not just with IBM but with other providers. The Blue Planet design is incorporated into the new generation of IBM Power microprocessors that are the building blocks of future configurations. These
23
processors break the memory bandwidth bottleneck, reversing the recent trend towards architectures poorly balanced for scientific computations. The Blue Planet design improved the original power roadmap in several key respects: dramatically improved memory bandwidth; 70% reduction in memory latency; eight-fold improvement in interconnect bandwidth per processor; and ViVA Virtual Processor extensions, which allow all eight processors within a node to be effectively utilized as a single virtual processor. 5. Summary
This paper explains a new approach to making the COTS based large scale system more effective for scientific applications. It documents some of the problems with the existing approach, including the need for systems to be balanced and perform well on multiple computational methods. The paper proposes a more interactive design and development process where computational science experts interact directly with system designers to produce better, more productive systems. The goals for SDSA systems include excellent price-performance, consistent performance across a range of applications, and pointed hardware improvement that add high value. The SDSA process has already resulted in improvements to IBM systems that are available now.
Acknowledgments This work is supported by the Office of Computational and Technology Research, Division of Mathematical, Information and Computational Sciences of the U S . Department of Energy, under contract DE-AC03-76SF00098. In addition, this work draws on many related efforts at NERSC and throughout the HPC community, who we acknowledge for their efforts to improve the impact of scientific computing.
24
References 1 C. William McCurdy, Rick Stevens, Horst Simon, et al., “Creating Science-Driven Computer
Architecture: A New Path to Scientific Leadership,” Lawrence Berkeley National Laboratory report LBNUPUB-5483 (2002), http://www.nersc.gov/news/ArchDevProposal.5.01.pdf. 2 Gordon E. Moore, “Cramming more components onto integrated circuits”. Electronics, Volume 38, Number 8, April 19, 1965. Available at
ftp://download.intel.com/research/silicon/moorespaper.pdf 3 Daniel A. Reed, ed.,“Workshop on the Roadmap for the Revitalization of High-End Computing,” June 16-18.2003 (Washington, D.C.: Computing Research Association). . 4 “Federal Plan for High-End Computing: Report of the High-End Computing Revitalization Task
Force (HECRTF),” (Arlington, VA: National Coordination Ofice for Information Technology Research and Development, May 10.2004), http://www.house.gov/science/hearings/fullO4/may13/ hecrtf.pdf 5 Phillip Colella, Thom H. Dunning, Jr., William D. Gropp, and David E. Keyes, eds., “A ScienceBased Case for Large-Scale Simulation” (Washington, D.C.: DOE Ofice of Science, July 30, 2003). 6 “Red Storm System Raises Bar on SupercomputerScalability” (Seattle: Cray Inc., 2003). http://www .cray.comlcompany/RedStorm-flyer.pdf.
7 h n i d Oliker, Andrew Canning, Jonathan Carter, John Shalf, and Stephane Ethier. Scientific Computations on Modem Parallel Vector Systems (draft). In SC2004 High Performance Computing, Networking and Storage Conference; Pittsburgh. P A Nov 6 - 12,2004. - Available at
http://www.nersc.gov/news/reports/vectorbench.pdf 8 Simon, Horst, et. al. “A National Facility for Advanced Computational Science: A Sustainable Path to Scientific Discovq”. LBNL Report #5500, April 2.2004 - http://www-
library.lbl.gov/docs/PUB/55OOiPDF/PUB-55OO.pdf 9
“Facts on ASCI Purple,” Lawrence Livermore National Laboratoly report UCRLTB-I 50327
(2002); http://www.sandia.gov/supercomp/sc2002/fly~/SCO2ASC~u~~ev4.pdf.
PROGRAMMING MODELS AND LANGUAGES FOR HIGH-PRODUCTIVITY COMPUTING SYSTEMS*
HANS P. ZIMA JPL, California Institute of Technology, Pasadena, CA 91109, USA and Institute of Scientific Computing, University of Vienna, Austria E-mail:
[email protected] High performance computing (HPC) provides the superior computational capability required for dramatic advances in key areas of science and engineering such as DNA analysis, drug design, or structural engineering. Over the past decade, progress in this area has been threatened by technology problems that pose serious challenges for continued advances in this field. One of the most important problems has been the lack of adequate language and tool support for programming HPC architectures. In today’s dominating programming paradigm users are forced to adopt a low level programming style similar to assembly language if they want to fully exploit the capabilities of parallel machines. This leads to high cost for software production and error-prone programs that are difficult to write, reuse, and maintain. This paper discusses the option of providing a high-level programming interface for HPC architectures. We summarize the state of the art, describe new challenges posed by emerging peta-scale systems, and outline features of the Chapel language developed in the DARPA-funded Cascade project.
1. Introduction
Programming languages determine the level of abstraction at which a user interacts with a machine, playing a dual role both as a notation that guides thought and by providing directions for program execution on an abstract machine. A major goal of high-level programming language design has traditionally been enhancing human productivity by raising the level of abstraction, thus reducing the gap between the problem domain and the level at which algorithms are formulated. But this comes at a cost: the success *This paper is based upon work supported by the Defense Advanced Research Projects Agency under its Contract No. NBCH3039003. The research described in this paper was partially carried out at the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration.
25
26
of a language depends on acceptable performance of the generated target programs, a fact that has influenced high-level language design since the very beginning a. The necessity to find a viable compromise between the dual goals of high level language abstraction and target code performance is a key issue for the design of any programming language. For HPC systems this is particularly critical due to the necessity of achieving the best possible target code performance for the strategically important applications for which these architectures are being designed: the overarching goal is to make scientists and engineers more productive by increasing programming language usability and time-to-solution, without sacrificing performance. Within the sequential programming paradigm, a steady and gradual evolution from assembly language to higher level languages took place, triggered by the initial success of FORTRAN, COBOL and Algol in the 1960s. Together with the development of techniques for their automatic translation into machine language, this brought about a fundamental change: although the user lost direct control over the generation of machine code, the progress in building optimizing compilers led to wide-spread acceptance of the new, high-level programming model whose advantages of increased reliability and programmer productivity outweighed any performance penalties. Unfortunately, no such development happened for parallel architectures. Exploiting current HPC architectures, most of which are custom MPPs or clusters built from off-the-shelf commodity components, has proven to be difficult, leading to a situation where users are forced to adopt a low-level programming paradigm based upon a standard sequential programming language (typically Fortran or C/C++), augmented with message passing constructs. In this model, the user deals with all aspects of the distribution of data and work to the processors, and controls the program’s execution by explicitly inserting message passing operations. Such a programming style is error-prone and results in high costs for software production. Moreover, even if a message passing standard such as MPI is used, the portability of the resulting programs is limited since the characteristics of the target architectures may require extensive restructuring of the code.
&John Backus(1957): “...It was our belief that if FORTRAN ... were to translate any reasonable “scientific” source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger ...”
27
In this paper, we address the issue of providing a high-level programming interface for HPC systems. After discussing previous work in Section 2 we outline the requirements posed by emerging peta-scale systems in Section 3. This is followed by a description of some key features of the language Chapel in Section 4,which focuses on the support provided for multithreading and locality management. Section 5 concludes the paper.
2, Languages for Scientific Parallel Programming
With the emergence of distributed-memory machines in the 1980s the issue of a suitable programming paradigm for these architectures, in particular for controlling the tradeoff between locality and parallelism, became important. The earliest (and still dominant) approach is represented by the fragmented programming model: data structures and program state are partitioned into segments explicitly associated with regions of physical memory that are local to a processor (or an SMP); control structures have to be partitioned correspondingly. Accessing non-local state is expensive. The overall responsibility for the management of data, work, and communication is with the programmer. The most popular versions of this explicitly parallel approach today use a combination of C, C++, or Fortran with MPI. It soon became clear that a higher-level approach to parallel programming was desirable, based on data-parallel languages and the SingleProgram-Multiple-Data (SPMD) paradigm, with a conceptually single thread of control coupled with user-specified annotations for data distribution, alignment, and datalthread affinity. For such languages, many low-level details can be left to the compiler and runtime system. High Performance Fortran (HPF) became the trademark for a class of languages and related compilation and runtime system efforts that span more than a decade. Some of the key developments leading to HPF included the Kali language and compiler the SUPERB restructuring system 2 , and the Fortran D and Vienna Fortran languages. HPF-1 5, completed in 1993, was soon recognized as being too constrained by relying exclusively on regular data distributions, resulting in performance drawbacks for important classes of applications. HPF-2 and HPF+ extended the power of the distribution mechanism to accommodate dynamic and irregular applications '; HPF+ took the additional step of providing high-level support for communication management. A related approach based on the Japanese HPF version JA-HPF, reached in 2002 a performance of 12.5 Teraflops for
28
a plasma code on the Earth Simulator g . As the interest of the user community in HPF waned due to shortcomings of the language and weak compiler support, a number of languages rose in its place, commonly referred to as partitioned global address space languages. The best-known examples are Co-Array Fortran lo, Unified Parallel C 11, and Titanium 1 2 . These languages are similar due to their support for regular data distributions which are operated on in SPMD style; they have the advantage of being easier to compile than HPF, but achieve this by shifting some of that burden back t o programmers by requiring them to return to the fragmented programming model. In summary, these languages offer a compromise between MPI and HPF in the tradeoff between programmability and performance. Other language developments include OpenMP l 3 and ZPL 14. OpenMP is one of the few current parallel programming providing a non-fragmented, global view of programming; however its execution model assumes a uniform shared address space, which severely limits its scalability. ZPL supports parallel computation via user-defined index sets called regions 14, which may be multidimensional, sparse, or strided and are used both t o declare distributed arrays and t o operate on them in parallel. 3. New Challenges and Requirements A new generation of HPC architectures is currently being designed in the USA and Japan, with the goal of reaching Petaflops performance by 2010. The rising scale and complexity of these emerging systems, together with large-scale applications, pose major new challenges for programming languages and environments. These include the massive parallelism of future systems consisting of tens of thousands processing and memory components, the extreme non-uniformities in latencies and bandwidths across the machine, and the increase in size and complexity of HPCS applications, as researchers develop full-system simulations encompassing multiple scientific disciplines. Such applications will be built from components written in different languages and using different programming paradigms. These and related challenges imply a set of of key requirements that new languages must address: (1) Multithreading: Languages must provide explicit as well as implicit mechanisms for creating and managing systems of parallel threads that exploit the massive parallelism provided by future ar-
29
chitectures. (2) Locality-Aware Computation: Future architectures are expected to provide hardware-supported global addressing; however, the discrepancies between latencies and bandwidths across different memory levels will make locality-aware computation a necessary requirement for achieving optimum target code performance. (3) Programming-in-the-Large: Emerging advanced application systems will be characterized by a dramatic increase in size and complexity. It is imperative that HPC languages support componentbased development allowing the seamless interaction of different languages, programming paradigms, and scientific disciplines in a single application. (4) Portability and Legacy Code Integration: Taking into account the huge investment in applications, their long lifetime, and the frequent change in the supporting hardware platforms makes the automatic or semi-automatic porting and integration of legacy codes a high-priority requirement. Under the necessary constraint of retaining high performance this is a demanding task requiring a sophisticated approach and intelligent tools. There are a number of additional key requirements, which are beyond the scope of this paper. They include fault tolerance and robustness and a range of intelligent tools supporting debugging, testing, static analysis, model checking, and dynamic monitoring. 4. Chapel
4.1. Overview
In this section we discuss some design aspects of Chapel, a new language developed in the Cascade project led by Cray Inc. Cascade is a project in the DARPA-funded High Productivity Computing Systems (HPCS) program. Chapel pushes the state-of-the-art in programming for HPC systems by focusing on productivity. In particular Chapel combines the goal of highest possible object code performance with that of programmability by supporting a high level interface resulting in shorter time-to-solution and reduced application development cost. The design of Chapel is guided by four key areas of programming language technology: multithreading, locality-awareness in the sense of HPF, object-orientation, and generic programming supported by type inference and cloning. The language provides a global name space.
30
In the following we will outline those features of Chapel that are related to the control of parallelism and distributions. A more complete discussion of Chapel can be found in 15. 4.2. Domains, Indez Sets, and A m y s
The key feature of Chapel for the control of parallelism and locality is the domain. Domains are first-class entities whose primary component is a named index set that can be combined with a distribution and associated with a set of arrays. Domains provide a basis for the construction of sequential and parallel iterators. Index sets in Chapel are far more general than the Cartesian products of integer intervals used in conventional languages; they include enumerations] sets of object references associated with a class, and sparse subsets. This leads to a powerful generalization of the standard array concept. Distributions map indices to locales, which are the virtual units of locality provided t o executions of a Chapel program. The language provides a powerful facility for specifying user-defined distributions in an objectoriented framework. In addition] alignment between different data structures, and afinity between data and threads can be expressed at a high level of abstraction. Arrays associated with a domain inherit the domain’s index set and distribution. The index set determines how many component variables are allocated for the array, and the distribution controls the allocation of these variables in virtual memory. The example below illustrates some of these basic relationships: D is a one-dimensional arithmetic domain with index set (1, . . . , n } and a cyclic distribution; A1 and A2 are two arrays associated with D . c o n s t D: var var
domain (1) = [I. .n] d i s t (cyclic); Al: a r r a y CD] of float; A2: a r r a y [D] of integer;
... forall
i i n D on D.locale(i) { Al(i) = . . . }
Both arrays have n component variables and are distributed in a roundrobin manner to the locales assigned to the program execution. The forall loop is a parallel loop with an iteration range specified by the index set of
31
D. Iteration i is executed on the locale to which element i of that index set is mapped. 4.3. Specification of Distributions in Chapel
Standard distributions such as block or cyclic have been used in many highlevel language extensions. Most of these languages rely on a limited set of built-in distributions, which are based on arrays with Cartesian product index sets. Chapel extends existing approaches for the specification and management of data distributions in major ways: 0
0
0
0
Distributions in Chapel support the general approach to domains and their index sets in the language. Thus, distributions can not only be applied to arithmetic index sets but also, for example, to indices that are sets of object references associated with a class. Domains can be dynamic and allow their index set to change in a program-controlled way, reflecting structural changes in associated data structures. Distributions can be adjusted as well in an incremental manner. Chapel provides a powerful object-oriented mechanism for the specification of user-defined distribution classes. An instance of such a class, created in the context of a domain, results in a distribution for the index set of the domain. As a consequence, Chapel does not rely on a pre-defined set of built-in distributions. Each distribution required in a Chapel program is created via the user-defined specification capability. In addition t o defining a mapping from domain indices t o locales, the distribution writer can also specify the layout of data within locales. This additional degree of control can be used to further enhance performance by exploiting knowledge about data and their access patterns.
4.3.1. User-Defined Distributions: An Example User-defined distributions are specified in the framework of a class hierarchy, which is rooted in a common abstract class, denoted by distribution. All specific distributions are defined as subclasses of this class, which provides a general framework for constructing distributions. Here we illustrate the
32
basic idea underlying the user-defined distribution mechanism by showing how to implement a cyclic distribution. class c y c l i c
{
implements d i s t r i b u t i o n ;
blocksize: integer;
/*
The constructor arguments sd and td respectively specify the index set to be distributed and the
index set of the target locales */ constructor c r e a t e ( s d : d o m a i n , t d : d o m a i n , c : i n t e g e r ) {
t h i s . SetDomain(sd) ; t h i s . SetTarget (td) ; if (c, using Fortran notation for clarity, to the local portion of the distributed array. The index s=lbound(h)+w marks the start of the local computational domain and the index e=ubound(h)-w marks the end of the local computational domain. We briefly describe how this memory allocation might be implemented on different architectures. On pure distributed memory, each PET gets an array h(s-w ,e+w), where the extra points on either side store a local copy of data from neighbouring PETs (the “halo”). The user performs operations (“halo updates”) at appropriate times to ensure that the halo contains a correct copy of the remote data. On flat shared memory, one PET requests memory for all nx points while the others wait. Each PET is given a pointer into its local portion of this single array. On distributed shared memory (DSM or ccNUMA: Lenoski and Weber, 1995) PETs can share an address space beyond the flat shared portion. This means access is non-uniform: the processor-to-memory distance varies across the address space and this variation must be taken into account in coding for performance. Most modern scalable systems fall into the DSM category. We call the set of PETs sharing flat (uniform) access to a block of physical memory an mNode, and the set of PETs sharing an address space an aNode. Whether it is optimal to treat an entire aNode as shared memory, or distributed memory, or some tunable combination thereof depends on the platform (hardware), efficiency of the underlying parallelism semantics for shared and distributed memory (software), and even problem size. Memory request procedures must retain the flexibility to optimize distributed array allocation under a variety of circumstances.
289
5.2. Memory access
The array pointer h and the indices s and e offer a uniform semantic view into a distributed array, with the following access rules: 0
0
0
0
0
Each array element h (s :e) is exclusively “owned” by one and only one PET. A PET always has READ access to the array elements h(s :e ) , which it owns. A PET can request READ access to its halos, h(s-w:s-I) and h(e+l :e+w) . A PET can acquire WRITE access to the array elements h(s:e). No other PET can ever acquire WRITE access to this portion of the distributed array. A PET never has WRITE access to its halos, h(s-w:s-l) and h(e+l :e+w) .
In practice, it is usually incumbent on the user to exercise restraint in writing outside the computational domain. There may be no practical means to enforce the last rule. Also, there may be occasions when it is advantageous to perform some computations in the halo locally, rather than seeking to share remote data. But for this discussion we assume the last rule is never. Under these rules, we extend the access states and access operations, which we defined earlier for a single variable, to apply to a single object called a DistributedArray. These access semantics are the key step in defining a uniform programming interface f o r distributed and shared m e m ory, and we urge close reading. We again define two access states: 0
0
READ access for a DistributedArray always refers to the halo points. READ access to interior points is always granted. WRITE access always refers to the interior points. It is never granted for the halo points.
We also define three access operations: 0
0
A REQUEST for WRITE access to a DistributedArray must be posted, and fulfilled, before modifying the array contents, i.e., before it appears on the LHS of an equation. A REQUEST for READ access must be posted, and fulfilled, before an array’s halo appears on the RHS of an equation. A REQUIRE for READ or WRITE access follows a matching REQUEST.
290 0
A RELEASE of WRITE access must be posted immediately following completion of updates of the computational domain of an array. A RELEASE of READ access must be posted following completion of a computation using an array’s halo points.
Only the REQUIRE call is blocking; REQUEST and RELEASE are non-blocking. These rules are specialized for arrays representing grid fields, where data dependencies are regular and predictable. These constitute a vast simplification over rules developed for completely general and unpredictable data access, where read and write permission require more complicated semantics, as one sees for example in pthreads or any other general multithreading library. We summarize the main points of our model. Each array element is owned by a PET, which is the only one that can ever have WRITE access to it. WRITE access on a PET is to the interior elements of the array and READ access is to the halo. It is taken for granted that WRITE access to the halo is always forbidden, and READ access to the interior is always permitted. Simultaneous READ and WRITE access to an array is never permitted. If a PET has WRITE access to its computational domain, its neighbours will have WRITE access to its halo. Hence, a PET cannot have READ access to its halos when if has WRITE access to its interior points. To put it another way, one cannot write a loop of the following form,
(h(i) = a*( h(i+l)
-
h(i-1) ) FORALL il
(4)
is not well-defined on distributed arrays. This is not an onerous restriction. As is easily seen, this is the same rule that must be obeyed to avoid vector dependencies. The rules f o r writing distributed array loops under the rules outlined here are those that must be followed f o r vectorization as well. READ access and WRITE access form a pair of orthogonal mutual exclusion locks. When a PET owns WRITE access to a DistributedArray,its neighbours cannot have READ access to their halo points, because they may be updated by the PET that owns them. Conversely, when PET owns READ access to its halo, its neighbours must wait for the array to be released before they can get WRITE access. At this point in the development of the argument, we return to the unison assumption introduced in Section 2.2. Each distributed array goes through an access cycle, typically once per iteration. The unison assumption requires that no PET can ever be more than one access cycle ahead its the result
291
in the instruction sequence of any other PET with which it has a data dependency. The following code block illustrates how our uniform memory model might be applied to the shallow water equations.
BEGIN TIME LOOP RequireREAD( u ) RequireWRITE( h ) h(i) = h(i) - (0.5*H*dt/dx)*(
u(i+l) - u(i-I)
)
FORALL i
Release( u ) Release( h ) RequestREAD( h ) RequestWRITE( u ) RequireREAD( h ) RequireWRITE( u 1 u(i> = u(i) - (0.5*g*dt/dx)*(
h(i+l) - h(i-I)
)
FORALL i
Release( h ) Release( u ) RequestREAD( u ) RequestWRITE( h ) END TIME LOOP 6. Summary This article reviews several parallel programming models and demonstrates their essential unity. All memory models require semantics to know when a load from memory or a store to memory is allowed. We have extended this to include remote memory. Our access semantics are couched entirely in terms of data dependencies. We have shown that the implied highlevel syntax can be implemented in existing distributed memory and shared memory programming models.
292
While the uniform memory model of Section 3 is in terms of a single memory location, it can be applied to any distributed data object, as we demonstrated by extending the approach to arrays, in Section 5. Distributed arrays have always been a topic of interest in scalable programming. Language extensions like Co-Array Fortran (Numrich and Reid, 1998) propose this as an essential extension to programming languages, and have built general-purpose class libraries for arrays and matrices on this basis (Numrich, 2005). The Earth System Modeling Framework (ESMF: Hill et al., 2004) also proposes a general distributed array class for grid codes. We have demonstrated a general method to describe data dependencies in distributed arrays that is agnostic about the underlying transport, but is described entirely in terms of the algorithmic stencils. For typical (e.g nearest-neighbour) data dependencies, we show that the coding restrictions imposed b y the memory model are exactly equivalent to those needed to ensure vectorization of loops. However, the model can be generalized to arbitrarily complex data dependencies, such as those resulting from arrays discretized on unstructured grids. High-level expressions of distributed data objects, we believe, are the way forward: they allow scientists and algorithm developers to describe data dependencies in a natural way, while at the same time allowing scalable library developers considerable breadth of expression of parallelism in conventional and novel parallel architectnres.
Acknowledgements Balaji is funded by the Cooperative Institute for Climate Science (CICS) under award number NA17RJ2612 from the National Oceanic and Atmospheric Administration, U S . Department of Commerce. The statements, findings, conclusions, and recommendations are those of the author and do not necessarily reflect the views of the National Oceanic and Atmospheric Administration or the Department of Commerce. Numrich is supported in part by grant DE-FC02-01ER25505 from the U S . Department of Energy as part of the Center for Programming Models for Scalable Parallel Computing and by grant DE-FG02-04ER25629 as part of the Petascale Application Development Analysis Project both sponsored by the Office of Science. He is also supported in part by the NASA Goddard Earth Sciences and Technology Center where he holds an appointment as a Goddard Visiting Fellow for the years 2003-2005.
293 References
Balaji, V., J . Anderson, I. Held, Z. Liang, S. Malyshev, R. Stouffer, M. Winton, and B. Wyman, 2005a: FMS: the GFDL Flexible Modeling System: Coupling algorithms for parallel architectures. Mon. Wea. Rev., 0, in preparation. Balaji, V., J. Anderson, and the FMS Development Team, 2005b: FMS: the GFDL Flexible Modeling System. Part 11. The MPP parallel infrastructure. Mon. Wea. Rev.. 0, in preparation. Balaji, V., T. L. Clune, R. W. Numrich, and B. T. Womack, 2005c: An architectural design pattern for problem decomposition. Workshop on Patterns zn Hzgh Performance Computzng. Balay, S., W. D. Groop, L. C. McInnes, and B. F. Smith, 2003: Software for the scalable solution of partial differential equations. Sourcebook of parallel computzng, Morgan Kaufmann Publishers Inc., pp. 621-647. Barriuso, R., and A. Knies, 1994: SHMEM User’s Guzde: SN-2516. Cray Research Inc. Blackford, L. S., et al., 1996: ScaLAPACK: A portable linear algebra library for distributed memory computers design issues and performance. Proceedzngs of SC’96, p. 5 . Carlson, W. W., J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren, 1999: Introduction to UPC and language specification. Center for Computing Sciences, Tech. Rep. CCS-TR-99-157, http://www.super.org/upc/. Chamberlain, B. L.. S.-E. Choi, E. C. Lewis, C. Lin, L. Snyder, and D. Weathersby, 2000: ZPL: A machine independent programming language for parallel computers. IEEE Transactaons on Software Engzneerzng, 26(3),197-211. Chandra, R., R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald, 2001: Parallel Programmzng zn OpenMP. Morgan-Kaufmann, Inc. Ciotti, R. B.. J. R. Taft. and J. Petersohn, 2000: Early experiences with the 512 processor single system image Origin2000. Proceedzngs of the 42nd Internatzonal Cray User Group Conference, SUMMIT 2000, Cray User Group, Noordwijk, The Netherlands. Gropp. W.. S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg. W. Saphir, and M. Snir, 1998: MPI: The Complete Reference. The MPI-2 Extenszons.. vol. 2, MIT Press. Gropp, W., E. Lusk, and A. Skjellum, 1999: Uszng MPI: Portable Parallel Programmzng wzth the Message Passzng Interface. 2nd ed., MIT Press. ~
294
Haltiner, G. J., and R. T. Williams, 1980: Numerical Prediction and Dynamzc Meteorology. John Wiley and Sons, New York. Hill, C., C. DeLuca, V. Balaji, M. Suarez, A, da Silva, , and the ESMF Joint Specification Team, 2004: The Architecture of the Earth System Modeling Framework. Computing in Science and Engineering, 6(l ) , 1-6. HPF Forum, 1997: High Performance Fortran Language Specification V2.0. Lenoski, D. E., and W.-D. Weber, 1995: Scalable Shared-Memory Multiprocessing. Morgan Kaufmann. Nieplocha, J., and B. Carpenter, 1999: ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586, 533-546. Numrich, R. W., 1997: F--: A parallel extension to Cray Fortran. Scientific Programming, 6(3), 275-284. Numrich, R. W., 2005: Parallel numerical algorithms based on tensor notation and Co-Array Fortran syntax. Parallel Computing, p. in press. Numrich, R. W.. and J. K. Reid, 1998: Co-array Fortran for parallel programming. A C M Fortran Forum, 17(2), 1-31. Quadrics, 2001: S h m e m Programming Manual, Quadrics Supercomputing World Ltd. Smith, L., and M. Bull, 2001: Development of mixed mode MPI/OpenMP applications. Scientific Programming, 9(2-3), 83-98. Presented at Workshop on OpenMP Applications and Tools (WOMPAT 2000), San Diego, Calif., July 6-7, 2000. Supercomputing Technologies Group, 1998: Cilk 5.2 Reference Manual, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. Taft, J . R., 2001: Achieving 60 GFLOP/s on the production code OVERFLOW-MLP. Parallel Computing, 27,521-536. Yelick, K., et al., 1998: Titanium: A high-performance Java dialect. Proceedings of the Workshop o n Java f o r High-Performance Network Computing, Stanford, California.
PANEL EXPERIENCE ON USING HIGH PERFORMANCE COMPUTING IN METEOROLOGY SUMMARY OF THE DISCUSSION GEORGE MOZDZYNSKI European Centre for Medium-Range Weather Forecasts Shinfield Park, Reading, Berkshire, RG2 9AX, U.K. E-mail: George.Mozdzynski (29ecmwJ1:int
As has become customary at these workshops, the final session was dedicated to a panel where brief statements raising fundamental or controversial issues were made to open the discussions. New Languages and New Hardware Features The following statements introduced this topic, 0
Many of the sites with large application codes have tried for years to keep codes highly portable (standard FORTRAN, MPI, OpenMP) There have been plenty of examples where the indiscriminate use of vendor specific features stored up problems for the future The academic argument for better languages (programming paradigms) is compelling; but what would it take before any of the major meteorological codes will be developed in it? There are promising ideas on the HW from (PIM, vector features in the router chip, etc.) - will this allow new programming paradigms to make a breakthrough ?
A user representative started the discussion by commenting that the stability of their application codes does not stress manufacturers and only new applications would do this. He added that the cost of developing new applications is a major concern as was portability. Another user expressed that his organisation could see opportunities for systems with 3-4 orders of magnitude more computing capability and suggested that commodity clusters could play a part in realising this. A vector manufacturer representative commented that the more sites employ these clusters the less that manufacturers can provide for the future, as development budgets would be reduced. 295
296
The discussion moved on to the Top500 report and the use of linpack to rank performance. Some user representatives commented that the linpack benchmark was not representative of their applications. A vector manufacturer representative stated that their vector processor was about lox faster than some scalar processors but that it was also about lox more expensive. He added that manufacturers are more interested in strong new science on systems than linpack. Somebody suggested that the NWP community should provide its own report based on meteorological applications. A user representative added that his organisation did not buy systems based on their Top500 ranking and that price/performance is the most important consideration. He added that we could judge this aspect by the number and size of contracts awarded to manufacturers. Would the meteorological community be willing to develop new codes to realise performance increases of between 2 to 3 orders of magnitude? A hardware expert stated that reconfiguration logic could provide 2 orders of magnitude increase in performance and that this was not an issue of cost. A vector manufacturer representative commented that climate/NWP codes are very diverse and do not have any significant peaks that could be handled by special hardware. A user representative endorsed this argument, and commented that his organisation’s current NWP application started development in 1987 and added that this application has been in use longer than a number of computer architectures. A vendor representative stated that most NWP applications he had experience of were simply not large enough to scale well on today’s supercomputers.
Another user saw a need for running NWP applications on 100 nodes today to be followed by many ensemble runs to study small scales.
Linux Clusters The following statements introduced this topic,
Small scale clusters seem to work well Large clusters for time-critical operations?
- Maintenance: have your own kernel programmers? - Applications: what can we not run on this type of system?
297
It was broadly agreed that the Linux O/S was as reliable as any other operating system and given the numbers of systems installed Linux was probably one of the best supported. A user commented that in his experience Linux clusters require more support than vendor-supplied systems as the site is normally responsible for maintaining facilities such as compilers, MPI libraries and batch sub-systems. The user added that there was no obvious need for kernel programmers for small clusters. A user commented that the rate of development for Linux clusters could be viewed as a problem if you want to upgrade your system. He added that it was probably more cost effective to totally replace a cluster (e.g. faster clock, more memory) after a couple of years than upgrading the existing system. The chairman asked the question - Does the NWP community have applications that cannot be satisfied by Linux clusters? There was no clear answer. A user representative said that UO is a problem for his NWP applications. He added that Linux clusters would require installation of parallel file system s/w such as Lustre to get acceptable performance. A user asked whether the success of Linux clusters could be viewed as a consequence of the fact that the HPC market is too small to cover the development costs of special hardware. A user representative replied that future model and data assimilation workloads might still need to run on special hardware to meet operational deadlines. A hardware expert expressed the opinion that government should help to fund the development of new technologies. As an example he stated that early work on Beowulf clusters (leading to Linux clusters) was government funded.
Frameworks A user introduced this topic by expressing concern about the effort required to support EMSF in their NWP model. He said that he expected some loss of model efficiency to support EMSF and asked what should be acceptable. An EMSF expert suggested that the number of lines of code would most likely be just a few hundred and that this should have a negligible impact on performance. Another user commented that frameworks are a necessary component for coupling models as software such as GRIB is not that portable. There was some disagreement on how intrusive frameworks would be on this model. However, it was agreed that this could provide a useful case study and topic for the next workshop at ECMWF in 2006.
This page intentionally left blank
LIST OF PARTICIPANTS
Mr. Eugenio Almeida
INPEKPTEC, Rodl. Presidente Dutra, km 39, Cachoeira Paulista, SP, CEP 12630-000, Brazil eugenio @cptec.inpe.br
Dr. Mike Ashworth
CCLRC Daresbury Laboratory, Warrington, WA4 3BZ, United Kingdom m.ashworth @ dl.ac.uk
Dr. Venkatramani Balaji
Princeton University/GFDL, PO Box 308, Princeton, N.J. 08542, USA v.balaji @noaa.gov
Mr. Stuart Bell
Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom
[email protected] Dr. Ilia Bermous
Australian Bureau of Meteorology, GPO Box 1289K, Melbourne, VIC 3001, Australia i.bermous @bom.gov.au
Mr. Tom Bettge
NCAR, 18.50 Table Mesa Drive, Boulder, CO 80305, USA
[email protected] Mr. David Blaskovich
Deep Computing Group, IBM, 519 Crocker Ave., Pacific Grove, CA 93950, USA
[email protected] Mr. Jan Boerhout
NEC HPC Europe, Antareslaan 65, 2132 JE Hoofddorp, Netherlands
[email protected] Dr. Ashwini Bohra
National Centre for Medium-Range Weather Forecasts, A-50, Sector-62, Phase 11, Noida-201307, U.P., India
[email protected] 299
300
Mr. Reinhard Budich
Max Planck Institute for Meteorology, 20146 Hamburg, Germany budich @dkrz.de
Mr. Arry Carlos Buss Filho
INPEKF'TEC, Rod. Pres. Dutra, km 40, Cachoeira Paulista - SP, Brazil arry @cptec.inpe.br
Dr. Ilene Carpenter
SGI, 2750 Blue Water Rd., Eagan, MN 55121, USA
[email protected] Mr. George Carr Jr.
NCAWCGD, PO Box 3000, Boulder, CO 80307-3000, USA gcarr @ ucar.edu
Dr. Bob Carruthers
Cray (UK) Ltd., 2 Brewery Court, High Street, Theale, Reading RG7 5AH, United Kingdom crjrc @cray.com
Mrs. Zoe Chaplin
University of Manchester, Manchester Computing, Oxford Road, Manchester, M13 9PL, United Kingdom
[email protected] Mr. Peter Chen
World Meteorological Organization, 7 bis Ave. de la Paix, 1211 Geneva, Switzerland pchen @ wrno.int
Dr. Gerardo Cisneros
Silicon Graphics, SA de CV, Av. Vasco de Quiroga 3000, P.0.-lA, Col. Sante Fe, 01210 Mexico, DF, Mexico Gerardo @sgi.com
Dr. Thomas Clune
NASNGoddard Space Flight Center, Code 93 1, Greenbelt, MD 2077 1, USA
[email protected] 301
Dr. Herbert Cornelius
Intel GmbH, Dornachcr Str. 1, 85622 FeldkirchedMunchen, Germany
[email protected] Mr. Franqois Courteille
NEC HPC Europe, "Le Saturne", 3 Parc Ariane, 78284 Guyancourt Cedex, France
[email protected] Mr. Jason Cowling
Fujitsu Systems Europe, Hayes Park Central, Hayes End Road, Hayes, Middx., UB4 8FE,United Kingdon
[email protected] Ms. Marijana Crepulja
Rep. Hydrometeorological Service of Serbia, Kneza Viseslava 66, 11030 Belgrade, Serbia and Montenegro maja @ meteo.yu
Mr. Jacques David
DSI, CEA-Saclay, 91191 Gif sur Yvette Cedex, France
[email protected] Mr. David Dent
NEC High Performance Computing Europe, Pinewood, Chineham Business Park, Basingstoke, RG24 8AL, United Kingdom ddent @ hpce.nec.com
Mr. Michel Desgagnk
Environment Canada, 2121 North Service Road, Trans-Canada Highway, Suite, 522, Dorval, Quebec, Canada H9P 1J3 michel.desgagne 0 cmc.ec.gc.ca
Mr. Martin Dillmann
EUMETSAT,Am Kavalleriesand 3 1, 64295 Darmstadt, Germany dillmann @eumetsat.de
Mr. Vladimir DimitrijeviC
Rep. Hydrometeorological Service of Serbia, Kneza Viseslava 66, 11030 Belgrade, Serbia and Montenegro vdimitrijevic @meteo.yu
302
Mr. Douglas East
Lawrence Livermore National Laboratory, PO Box 808, MS-L-73, Livermore, CA 94551, USA dre @llnl.gov
Mr. Ben Edgington
Hitachi Europe Ltd., Whitebrook Park, Lower Cookham Road, Maidenhead, SL6 SYA, United Kingdom
[email protected] Mr. Jean-Franqois Estrade
METEO-FRANCE, DSI, 42 Av. G. Coriolis, 31057 Toulouse Cedex, France jean-francokestrade @ meteo.fr
Dr. Juha Fagerholm
CSC-Scientific Computing, PO Box 4305, FIN-02101 Espoo, Finland juha.fagerholm @ csc.fi
Mr. Torgny Faxtn
National Supercomputer Center, NSC, SO58183 Linkoping, Sweden faxen @nsc.liu.se
Dr. Lars Fiedler
EUMETSAT, Am Kavalleriesand 3 1, 64295 Darmstadt, Germany Fiedler @eumetsat.de
Dr. Enrico Fucile
Italian Air Force Met Service, Via Pratica di Mare, Aer. M. de Bernardi, Pomezia (RM) Italy
[email protected] Mr. Toshiyuki Furui
NEC HPC Marketing Division, 7-1, Shiba 5-chome Minatoku, Tokyo 1088001, Japan t-furui @ bq.jp.nec.com
Mr. Fabio Gallo
Linux Networx (LNXI), Europaallee 10,67657 Kaiserslautern, Germany
[email protected] Mr. Jose A. Garcia-Moya
Instituto Nacional de Meteorologia (INM), Calle Leonard0 Prieto Castro 8, 28040 Madrid, Spain j
[email protected] 303
Dr. Koushik Ghosh
NOANGFDL, Princeton University Forrestal Campus, Princeton, N.J., USA
[email protected] Dr. Eng Lim Goh
Silicon Graphics, 1500 Crittenden Lane, MS 005, Mountain View, CA 94043, USA
[email protected] Mr. Mark Govett
NOAA Forecast Systems Lab., 325 Broadway, W S L , Boulder, CO 80305, USA mark.w.govett@ noaa.gov
Dr. Don Grice
IBM, 2 Autumn Knoll, New Paltz, NY 12561, USA
[email protected] Mr. Paul Halton
Met Eireann, Glasnevin Hill, Dublin 9, Ireland
[email protected] Dr. James Hamilton
Met Eireann, Glasnevin Hill, Dublin 9, Ireland
[email protected] Mr. Detlef Hauffe
Potsdam Institut fur Klimafolgenforschung, Telegrafenberg A31, D-14473 Potsdam, Germany hauffe @ pik-potsdam.de
Mr. Pierre Hildebrand
IBM Canada, 1250 blvd. ReneLevesque West, Montreal, Quebec, Canada H3B 4W2 pierreh @ca.ibm.com
Mr. Chris Hill
Massachusetts Insitute of Technology, 54-1515, M.I.T. Cambridge, MA 02139, USA cnh @mit.edu
304
Dr. Richard Hodur
Naval Research Laboratory, Monterey, CA 93943-5502, USA hodur @ nrlmry.navy.mil
Mr. Jure Jerman
Environmental Agency of Slovenia, SI-1000 Ljubljana, Slovenia j ure.jerman @ rzs-hm.si
Mr. Hu Jiangkai
National Meteorological Center, Numerical Weather Prediction Div., 46 Baishiqiao Rd., Beijing 10008, P.R. of China
[email protected] Dr. Zhiyan Jin
Chinese Academy of Meteorological Sciences, 46 Baishiqiao Road, Beijing 100081, P.R. of China j inzy @ cma.gov.cn
Mr. Jess Joergensen
University College Cork, College Road, Cork, CO. Cork, Ireland jesjoergensen @ wanadoo.dk
Mr. Bruce Jones
NEC HPC Europe, Pinewood, Chineham Business Park, Basingstoke, RG24 8AL, United Kingdom bjones @ hpce.nec.com
Mr. Dave Jursik
IBM Corporation, 3600 SE Crystal Springs Blvd., Portland, Oregon 97202, USA
[email protected] Mr. Tuomo Kauranne
Lappeenranta University of Technology, POB 20, FIN-53851 Lappeenranta, Finland
[email protected] Dr. Crispin Keable
SGI, 1530 Arlington Business Park, Theale, Reading, RG7 4SB, United Kingdom
[email protected] 305
Mr. A1 Kellie
NCAR, 1850 Table Mesa Drive, Boulder, CO 80305, USA kellie @ucar.edu
Mrs. Kerstin Kleese-van Dam
CCLRC-Daresbury Laboratory, Warrington, WA4 4AD, United Kingdom
[email protected] Ms. Maryanne Kmit
Danish Meteorological Institute, Lyngbyvej 100, DK-2100 Copenhagen @, Denmark kmit @ dmi.dk
Dr. Luis Kornblueh
Max-Planck-Institute for Meteorology, D-20146 Hamburg, Germany kornblueh @dkrz.de
Mr. William Kramer
NERSC, 1 Cyclotron Rd., M/S 50B4230, Berkeley, CA 94550, USA wtkramer@ lbl. gov
Dr. Elisabeth Krenzien
Deutscher Wetterdienst, POB 10 04 65,63004 Offenbach, Germany
[email protected] Mr. Martin Kucken
Potsdam Institut fur Klimafolgenforschung, Telegrafenberg A3 1, D- 14473 Potsdam, Germany kuecken @pik-potsdam.de
Mr. Kolja Kuse
SCSuperComputingServices Ltd., Oberfohringer Str. 175a, 81925 Munich, Germany kolja.kuse @ terrascale.de
Mr. Christopher Lazou
HiPerCom Consultants Ltd., 10 Western Road, London, N2 9HX, United Kingdom chris @lazou.demon.co.uk
306
Ms. Vivian Lee
Environment Canada, 2121 N. Trans Canada Hwy, Dorval, Quebec, Canada, H9P 1J3
[email protected] Mr. John Levesque
Cray Inc., 10703 Pickfair Drive, Austin, TX 78750, USA levesque @cray.com
Mr. RenB van Lier
KNMI, PO Box 201,3730 AE De Bilt, The Netherlands rene.van.lier @ knmi.nl
Dr. Rich Loft
NCAR, 1850 Table Mesa Drive, Boulder, CO 80305, USA loft @ucar.edu
Mr. Thomas Lorenzen
Danish Meteorological Institute, Lyngbyvej 100,DK-2100 Copenhagen 0,Denmark tl @ dmi.dk
Mr. Ian Lumb
Platform Computing, 3760 14" Avenue, Markham, Ontario L3R 3T7, Canada ilumb @platform.com
Mr. Wai Man Ma
Hong Kong Observatory, 134A, Nathan Road, Tsim Sha Tsui, Kowloon, Hong Kong
[email protected] Dr. Alexander MacDonald
NOAA Forecast Systems Lab., 325 Broadway, W S L , Boulder, CO 80305, USA Alexander.e.macdonald@ noaa.gov
Mr. Moray McLaren
Quadrics Ltd., One Bridewell St., Bristol, BS1 2AA, United Kingdom moray @
[email protected] Mr. John Michalakes
NCAR, 3450 Mitchell Lane, Boulder, CO 80301, USA
[email protected] 307
Mr. Aleksandar Miljkovid
“Coming-Computer Engineering”, Tosejovanovica 7, 1030 Belgrade, Serbia and Montenegro aleksandar.miljkovic @coming.co.yu
Prof. Nikolaos Missirlis
University of Athens, Dept. of Informatics, Panepistimiopolis, Athens, Greece
[email protected] Mr. Chuck Morreale
Cray Inc., 6 Paddock Dr., Lawrence, N.J. 08648, USA
[email protected] Mr. Guy de Morsier
MeteoSwiss, Krahbuhlstr. 58, CH8044 Zurich, Switzerland gdm @ meteoswiss.ch
Mr. Masami Narita
Japan Meteorological Agency, 1-3-4 Ote-machi, Chiyoda-ku, Tokyo, Japan
[email protected] Dr. Lars Nerger
Alfred-Wegener-Institut fur Polarund Meeresforschung, Am Handelshafen 12,27570 Bremerhaven, Germany lnerger @awi-bremerhaven.de
Mr. Wouter Nieuwenhuizen
KNMI, PO Box 201,3730 AE De Bilt, The Netherlands wouter.nieuwenhuizen@ knmi.nl
Mr. Dave Norton
HPFA, 235 1 Wagon Train Trail, South Lake Tahoe, CA 96150-6828, USA norton @ hpfa.com
Mr. Per Nyberg
Cray Inc., 3608 Boul. St-Charles, Kirkland, Quebec, H9H 3C3, Canada nyberg @cray.inc
Mr. Michael O’Neill
Fujitsu Systems Europe, Hayes Park Central, Hayes End Road, Hayes, Middx., UB4 8FE, United Kingdon Mike.O’
[email protected] 308
Mr. Yuji Oinaga
Fujitsu, 1-1, Kamikodanaka 4-chome, Nakahara-ku, Kawasaki-shi, Kanagawa, 21 1-8588, Japan
[email protected] Dr. Stephen Oxley
Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom stephen.oxley @metoffice.gov.uk
Mr. Jairo Panetta
INPEKPTEC, Rod. Presidente Dutra, Km 40, Cx. Postal 01, 12630-000 Cahoeira Paulista SP, Brazil
[email protected] Dr. Hyei-Sun Park
Korea Institute of Science and Technology Information (KISTI), 52 Yuseong, Daejeon, S. Korea hsunparkC2kisti.re.h
Mr. Simon Pellerin
Meteorological Service of Canada, 2121 Trans-Canada Highway North, Suite 504, Dorval, Quebec, Canada, H9P 153 simon.pellerin@ ec.gc.ca
Mr. Kim Petersen
NEC High Performance Computing Europe, Prinzenallee 11,40549 Dusseldorf, Germany kpetersen @ hpce.nec.com
Dr. Jean-Christophe Rioual
NEC HPCE, 1 Emperor Way, Exeter Business Park, Exeter, EX1 3GS, United Kingdom jrioual @hpce.nec.com
Dr. Ulrich Schattler
Deutscher Wetterdienst, POB 10 04 65,63004 Offenbach, Germany ulrich.schaettler @ dwd.de
Dr. Joseph Sela
US National Weather Service, 5200 Auth Rd., Room 207, Suitland, MD 20746,USA
[email protected] 309
Mr. Wolfgang Sell
Deutsches Klimarechenzentrum GmbH (DKRZ), Bundesstr. 55, D20146 Hamburg Germany
[email protected] Dr. Paul Selwood
Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom
[email protected] Mr. Robin Sempers
Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom robimempers @metoffice.gov.uk
Mr. Eric Sevault
METEO-FRANCE, 42 Av. Coriolis, 3 1057 Toulouse Cedex, France eric.sevault 0 meteo .fr
Dr. Johan Sile’n
Finnish Meteorological Institute, Helsinki, Finland johansilen @ fmi.fi
Dr. Roar SkHlin
Norwegian Meteorological Institute, PO Box 43, Blindern, 0313 Oslo, Norway
[email protected] Mr. Niko Sokka
Finnish Meteorological Institute, PO Box 503,00101 Helsinki, Finland
[email protected] Mr. Jorg Stadler
NEC High Performance Computing Europe, Prinzenallee 11,40549 Diisseldorf, Germany
[email protected] Mr. Alain St-Denis
Meteorological Service of Canada, 2121 North Service Road, Transcanada Highway, Dorval, Quebec, Canada H9P 1J3
[email protected] Dr. Lois Steenman-Clark
University of Reading, Dept. of Meteorology, PO Box 243, Reading, RG6 6BB, United Kingdom
[email protected] 310
Mr. Thomas Sterling
Caltech, CACR,d MC 158-79, 1200
E. California Blvd., Pasadena, CA, USA, 911 tron @ cacr.caltech.edu Dr. Conor Sweeney
Met Eireann, Glasnevin Hill, Dublin 9, Ireland conor.Sweeney @ met.ie
Dr. Mark Swenson
FNMOC, 7 Grace Hopper Ave., Monterey, CA 95943, USA mark.swenson@ fnmoc.navy.mil
Dr. Keiko Takahashi
The Earth Simulator Center/JAMSTEC, 3173-25 Showamachi, Kanazawa-ku, Yokohama, Kanagawa 236-0001, Japan takahasi @jamstec.go.jp
Mr. Naoya Tamura
Fujitsu Systems Europe, Hayes Park Central, Hayes End Road, Hayes, Middx., UB4 8FE, United Kingdon
[email protected] Dr. Ulla Thiel
Cray Europe, Waldhofer Str. 102, D69123 Heidelberg, Germany
[email protected] Dr. Mikhail Tolstykh
Russian Hydrometeorological Research Centre, 9/11 B. Predtecenskii per., 123242 MOSCOW, Russia tolstykh@ mecom.ru
Ms. Simone Tomita
CPTECIINPE, Rod. Pres. Dutra, km 40, CP 01, 12630-000 Cachoeira Paulista - SP, Brazil
[email protected] Mr. Joseph-Pierre Toviessi
Meteorol. Service of Canada, 2121 Trans-Canada Highway, Montreal, Que. Canada H9P 1J3 joseph-pierre.toviessi @ec.gc.ca
311
Mr. Eckhard Tschirschnitz
Cray Computer Deutschland GmbH, Wulfsdorfer Weg 66, D-22359 Hamburg, Germany
[email protected] Mr. Robert obelmesser
SGI GmbH, Am Hochacker 3,85630 Grasbrunn, Germany
[email protected] Dr. Atsuya Uno
Earth Simulator Center, JAMSTEC, 3 173-25 Showa-machi, Kanazawa-ku, Yokohama, Kanagawa 236-0001, Japan
[email protected] Dr. Ole Vignes
Norwegian Meteorological Institute, PO Box 43, Blindern, 0313 Oslo, Norway ole.vignes @met.no
Mr. John Walsh
NEC HPC Europe, Pinewood, Chineham Business Park, Basingstoke, RG24 8AL, United Kingdom j walsh @ hpce.nec.com
Mr. Bruce Webster
NOAA/National Weather Service, National Centers for Environmental Prediction, 5200 Auth Road, Camp Springs, MD 20746, USA
[email protected] Dr. Jerry Wegiel
HQ Air Force Weather Agency, 106 Peacekeeper Drive, STE 2N3, Offutt AFB, NE 681 13-4039, USA jerry.wegiel @afwa.af.mil
Mr. Jacob Weismann
Danish Meteorological Institute, Lyngbyvej 100, DK-2100 Copenhagen 0,Denmark j wp @dmi.dk
Mr. Shen Wenhai
National Meteorological Information Center, 46 Baishiqiao Rd., Beijing, 100081, PR of China
[email protected] 312
Dr. Gunter Wihl
ZAMG, Hohe Warte 38, A-1191 Vienna, Austria gunter.wihl @ zamg.ac.at
Mr. Tomas Wilhelmsson
Swedish Meteorological & Hydrological Institute, SE-601 76 Norrkoping, Sweden
[email protected] Dr. Andrew Woolf
CCLRC, Rutherford Appleton Laboratory, Chilton, Didcot, OX1 1 OQX, United Kingdom
[email protected] Prof. Hans Zima
University of Vienna, Austria, & Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 9 11098099, USA zima @jpl.nasa. gov
ECMWF: Erik Anderson Sylvia Baylis Anton Beljaars Horst Bottger Philippe Bougeault Paul Burton Jens Daabeck Matteo Dell’Acqua Mithat Ersoy Richard Fisker Mats Hamrud Alfred Hofstadler Mariano Hortal Lars Isaksen Peter Janssen Norbert Kreitz Franqois Lalaurette Dominique Marbouty Martin Miller
Head, Data Assimilation Section Head, Computer Operations Section Head, Physical Aspects Section Head, Meteorological Division Head, Research Department Numerical Aspects Section Head, Graphics Section Head, Networking & Security Section Servers & Desktop Group Head, Servers & Desktops Section Data Assimilation Section Head, Meteorological Applications Section Head, Numerical Aspects Section Data Assimilation Section Head, Ocean Waves Section User Support Section Head, Meteorological Operations Section Director Head, Model Division
313
Stuart Mitchell Umberto Modigliani George Mozdzynski Tim Palmer Pam Prior Luca Romita Sami Saarinen Deborah Salmond Adrian Simmons Neil Storer Jean-Noel Thepaut Saki Uppala Nils Wedi Walter Zwieflhofer
Servers & Desktop Group Head, User Support Section Systems Software Section Head, Probabilistic Forecasting & Diagnostics Division User Support Section Servers & Desktop Group Satellite Data Section Numerical Aspects Section Head, Data Division Head, Systems Software Section Head, Satellite Data Section ERA Project Leader Numerical Aspects Section Head, Operations Department