Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology
DEVELOPMENTS IN TERACOMPUTING
DEVELOPMENTS IN TERACOMPUTING
Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology
DEVELOPMENTS IN TERACOMPUTING Reading, UK
November 13-17,2000
Editors
Walter Zwieflhof er Norbert Kreitz European Centre for Medium-Range Weather Forecasts, Reading, UK
V f e World Scientific WB
New Jersey • London • Singapore • Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
DEVELOPMENTS IN TERACOMPUTING Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology Copyright © 2001 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4761-3
Printed in Singapore by Mainland Press
V
PREFACE
In an earlier Workshop of this series, TeraComputing was defined as achieving a sustained performance of at least one teraflop per second in a production environment. In November 2000, around 140 researchers in the fields of meteorology and climatology and experts in high-performance computing gathered at the European Centre for Medium-Range Weather Forecasts (ECMWF) to discuss the latest developments in the quest for TeraComputing. Such high levels of computational performance are required in order to realise the potential for prediction of weather and climate on all timescales. Effective highperformance computing is needed for accurate numerical modelling of the earth's atmosphere and of conditions at its ocean, ice and land boundaries and for the assimilation of the wealth of in-situ and satellite-based observations that provide the high-quality initial conditions essential for accurate forecasts. It is also needed for running the ensembles of predictions that characterize the probability of occurrence of events and thereby enable informed decision-making and exploitation of the value inherent in the forecasts. The challenges and opportunities for medium- and extended-range weather prediction were discussed comprehensively by ECMWF's Deputy Director, Anthony Hollingsworth, in his keynote lecture published in the Proceedings of the Eighth Workshop on the Use of Parallel Processors in Meteorology, and reiterated by Adrian Simmons and Martin Miller in their introductory lectures to this Ninth Workshop. As the week progressed, the lectures provided an insight into what various high-performance computing sites around the world consider the most appropriate algorithmic techniques and parallelisation strategies to make effective use of the available computer architectures. The roundtable discussion on the last day of the Workshop showed that the relative merits of vector parallel systems versus SMP systems continue to be hotly debated and that the performance gains expected of techniques such as combining OpenMP and MPI have yet to be demonstrated. The next Workshop in this series is planned for November 2002 and one can confidently expect significant progress towards effective TeraComputing between now and then. Walter Zwieflhofer
CONTENTS
Preface
v
Research and development of the earth simulator K. Yoshida and S. Shingu
1
Parallel computing at Canadian Meteorological Centre J. -P. Toviessi, A. Patoine, G. Lemay, M. Desgagne, M. Valin, A. Qaddouri, V. Lee and J. Cote
14
Parallel elliptic solvers for the implicit global variable-resolution grid-point GEM model: Iterative and fast direct methods A. Qaddouri and J. Cote
22
IFS developments D. Dent, M. Hamrud, G. Mozdzynski, D. Salmond and C. Temperton
36
Performance of parallelized forecast and analysis models at JMA Y. Oikawa
53
Building a scalable parallel architecture for spectral GCMS T. N. Venkatesh, U. N. Sinha and R. S. Nanjundiah
62
Semi-implicit spectral element methods for atmospheric general circulation models R. D. Loft and S. J. Thomas
73
Experiments with NCEP's spectral model J. -F. Estrade, Y. Tremolet and J. Sela
92
The implementation of I/O servers in NCEP's ETA model on the IBM SP J. Tuccillo
100
VIII
Implementation of a complete weather forecasting suite on PARAM 10 000 S. C. Purohit, A. Kaginalkar, J. V. Ratnam, J. Raman and M. Bali
104
Parallel load balance system of regional multiple scale advanced prediction system J. Zhiyan
110
Grid computing for meteorology G.-R. Hoffmann
119
The requirements for an active archive at the Met Office M. Carter
127
Intelligent support for high I/O requirements of leading edge scientific codes on high-end computing systems — The ESTEDI project K. Kleese and P. Baumann
134
Coupled marine ecosystem modelling on high-performance computers M. Ashworth, R. Proctor, J. T. Holt, J. I. Allen and J. C. Blackford
150
OpenMP in the physics portion of the Met Office model R. W. Ford and P. M. Burton
164
Converting the Halo-update subroutine in the Met Office unified model to co-array Fortran P. M. Burton, B. Carruthers, G. S. Fischer, B. H. Johnson and R. W. Numrich
177
Parallel ice dynamics in an operational Baltic Sea model T. Wilhelmsson
189
ix
Parallel coupling of regional atmosphere and ocean models S. Frickenhaus, R. Redler and P. Post
201
Dynamic load balancing for atmospheric models G. Karagiorgos, N. M. Missirlis and F. Tzaferis
214
HPC in Switzerland: New developments in numerical weather prediction M. Ballabio, A. Mangili, G. Corti, D. Marie, J. -M. Bettems, E. Zala, G. de Morsier and J. Quiby
228
The role of advanced computing in future weather prediction A. E. MacDonald
240
The scalable modeling system: A high-level alternative to MPI M. Govett, J. Middlecoff, L. Hart, T. Henderson and D. Schaffer
251
Development of a next-generation regional weather research and forecast model J. Michalakes, S. Chen, J. Dudhia, L. Hart, J. Klemp, J. Middlecoff and W. Skamarock
269
Parallel numerical kernels for climate models V. Balaji
277
Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications Y. He and C. H. Q. Ding
296
Parallelization of a GCM using a hybrid approach on the IBM SP2 S. Cocke and Z. Christidis
318
Developments in high performance computing at Fleet Numerical Meteorology and Oceanography Center K. D. Pollak and R. M. Clancy
327
X
The computational performance of the NCEP seasonal forecast model on Fujitsu VPP5000 at ECMWF H. -M. H. Juang and M. Kanamitsu
338
Panel experience on using high performance computing in meteorology — Summary of the discussion P. Prior
348
List of participants
353
1
RESEARCH & DEVELOPMENT OF THE EARTH SIMULATOR KAZUO YOSHIDA Earth Simulator Research and Development Center, Global Change Research National Space Development Agency of Japan (NASDA)
Division,
SATORU SHINGU Earth Simulator Development
Team, Japan Atomic Energy Research Institute
(JAERI)
A high-speed parallel computer system based on vector processors for research into global environmental changes called the Earth Simulator (ES: previously also known as GS40) has been studied and developed at the Earth Simulator Research and Development Center (ESRDC). This project is promoted by the Science and Technology Agency of Japan (STA). In addition to facilitating research into global environmental changes, ES will contribute to the development of computational science and engineering. The target of sustained performance for ES is 5 T flop/s on our Atmospheric General Circulation Model (CCSR/NASDA-SAGCM). Two years ago the design of the ES and its estimated performance were introduced at the Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology [1] and there has been little change in this area since. Therefore, in this paper we describe the present status of development of the hardware system, operating support software, application software and facilities. The current schedule is to accomplish the ES by February 2002 and we plan to bring it into service after March 2002.
1
Overview
Earth observation, research on physical processes and the computer simulations are very important elements in global environmental change studies, but global change mechanisms are tightly coupled and typically form a complex system. Large-scale simulation is the best tool to understand these kind of complicated phenomena. As part of the Earth Simulator project, we are developing a high-speed parallel computer system based on vector processors named the Earth Simulator (ES), which we called GS40 before. The Science and Technology Agency of Japan (STA) proposed the ES project in 1996. STA promotes studies into global change by an integrated three-in-one research and development approach. Those are the earth observation, the basic research into physical modeling and the numerical simulation by means of a high performance computer system. The National Space Development Agency of Japan (NASDA) wants to improve the usefulness of earth observation satellite data. The Japan Atomic Energy Research Institute (JAERI) wants to study the effectiveness of using atomic energy and its effect on environmental changes applying computer science and simulation. The Japan Marine Science and Technology Center (JAMSTEC) wants to do research on marine observation data and do global change
2
studies. So, STA requested and ordered NASDA, JAERI and JAMSTEC to establish the ES project and the cooperate organization of the Earth Simulator Research and Development Center (ESRDC) in 1997. The main purposes of the ES are understanding and prediction of global climate changes, understanding of plate tectonics and the development of advanced parallel software. The requirements for the resolution of the Atmospheric General Circulation Model (AGCM) are shown in Table 1. Table 1. Requirement for ES
Global model Regional model ^ ^ . ^ _ ,
50 - 100 km mesh 20 - 30 km mesh ^ s _ _ _
5 - 10 km mesh , 1 km mesh , 100-200 ~
At ESRDC, NASDA, JAERI and JAMSTEC are cooperating with each other in the research and development of the ES system. Each of us will contribute about 1/3 of the ES system. NASDA and JAERI will develop the operating support software. NASDA and JAMSTEC will develop the AGCM and the Ocean General Circulation Model (OGCM) respectively. The ES facilities are constructed by JAMSTEC and JAMSTEC alone will be responsible for the operation and management of the ES system after its accomplishment in order to avoid questions of competence. NASDA has launched the Tropical Rain Measuring Mission (TRMM) satellite in cooperation with the National Aeronautics and Space Administration (NASA). Furthermore, NASDA will launch the Advanced Earth Observation Satellite II (ADEOS-II) and other Global Change Observation Mission (GCOM) satellites in the future and is going to provide the data to the ES for use by all global climate change researchers. As for example with Sea Surface Temperature (SST) data retrieved from TRMM, we think, satellite data can show more clearly the detail of the climatic phenomenon, but if satellite data will be used effectively within the ES, simulations may predict more precisely the global environmental change in future. 2
Hardware system
ES is a distributed memory parallel system, which consists of 640 Processor Nodes (PNs) connected by a single-stage crossbar network. Each PN is a shared memory system composed of 8 Arithmetic Processors (APs), a shared memory system of 16GBytes, a Remote Control Unit (RCU), an Input-Output Processor (IOP) and system disk space of about 576GBytes. The peak performance of each AP is 8Gflop/s. Therefore, the total number of APs is 5,120 and the total peak performance, main memory and system disk are 40TFlop/s, lOTBytes and about 360TBytes respectively. The interconnection Network (IN) is a single stage
3
crossbar network, which connects PN to PN independently. The IN makes the system appear completely flat and the bandwidth of the IN is 16 G Bytes/sec x 2 on a node to a node transfer. To achieve these specifications, we reduced the size of the cabinets containing PNs. For comparison, the present SX-5 supercomputer with a peak performance of 128GFlop/s houses 16APs in 1PN whereas the ES has 16APs in 2PNs. Both, the SX-5 and the ES adopt an air-cooling system. Main memory is 128GBytes compared to 32GBytes, which depends on our budgetary limitation, but power consumption is about 90KVA compared to 20KVA. The sizes are 3.2m (L) x 6.8m (W) x 1.8m (H) as opposed to 1.4m ( L ) x l m (W) x 2m (H). So, the volume of ES will reduce to about 1/14 compared with the SX-5. We also adopt many new technologies for the ES. The important ones are discussed in the following.
Figure 1. Arithmetic Processor Package
2. /
LSI technology In semiconductor's fabrication, exposure technology has currently achieved line width of 0.35 microns. This technology is expected to advance through the
4
development of 0.25-micron technology to eventually reach 0.15-micron technology at the beginning of the 21s1 century. In a trial run we manufactured CMOS LSI and decided to use 0.15-micron technology for the ES. Large sized LSI can achieve high efficiency, but it is very difficult to manufacture and gives cause to heat and pin-connection problems. Currently sizes of approximately 10mm x 10mm are usually used, but we need a high efficiency so we decided to use about 21mm x 21mm CMOS LSI for the AP and 18mm x 18mm CMOS LSI for the Main Memory Control (MMC). The clock frequency of the ES is 500 MHz (2nsec) for the AP, but 1 GHz (lnsec) for the vector units.
Figure 2. Main Memory Unit Package
2.2
Packaging technology Multi layered Printed Circuit Boards (PCB) are very useful for compact packages. Now a maximum of 12 layers is used in another case. But the ES uses more of 121ayered build-up PCBs and flip chip mounting with 5,185
5
solder bumps for the AP instead of pins. So, AP, MMC and some other chips are not mold, but bare chips. 2.3
Cooling technology Reducing the power voltage and increasing the clock speed in semiconductors leads to reduced power efficiency, increased power response time and currents. Usually, high-density LSI causes heat problems and it limits the performance of the AP. The ES uses an air-cooled heat sink system with a heat pipe for the AP to solve this problem and limitation. Also, we use a wind funnel for cooling the AP more effectively.
2.4
Interconnection technology The transfer capabilities, within a parallel computer network, must have maximum data transfer capacity and minimum access latency in order to accommodate various types of parallel software. We estimated the performance for inter-node communication. Latency of one-sided communication is about 7 micro-seconds, and point-to-point communication is about 10 micro-seconds. These values are a little bit worse than we estimated before, but still within our specification. The ES uses 1.25GHz balanced serial transmission and achieves a maximum of 40m-transmission distance.
We have finished the design and trial manufacturing of key elements of the ES in 1999, and defined the detail design early in 2000. Now, we are conducting manufacturing tests of the AP, MMU and integration tests of the PN and IN. So, we will start to manufacture PNs and INs and hopefully go into full production in early 2001. One PN consists of 8 APs, Main Memory Unit (MMU), Remote access Control Unit (RCU), I/O Processor (IOP), Diagnostic Processor (DGP), AC-DC Converter, Serial/Parallel Converter, half a Power Supply Unit (PSU), cooling fan and more. In order to reduce the overall size, our design accommodates two PNs in one cabinet. The cabinet size for PNs is about 1.4m (L) x lm (W) x 2m (H). An IN consists of a Crossbar Switch (XSW), AC-DC converter, DC-DC converter, serial/parallel Converter, half a power supply unit, cooling fan and more. We also reduced the size of the IN's cabinet and housed two INs in one cabinet. The cabinet size for INs is about 1.3m (L) x 1.2m (W) x 2m (H). The ES will have system disks, user disks, a Cartridge Tape Library (CTL) of about 12 PBytes for data archiving, file server, user work stations (WS), user graphic work stations (GWS), 3-dimensional graphic display (3D) and more. The ES will have 3 networks to connect peripherals and the system itself. One is the
6
Interconnection Network, which connects a TSS cluster and Batch clusters. The second one is the Data Transfer Network, which connects system disks, CTL and file server. The third one is the Backbone Network, which connects TSS cluster, file server, WSs, GWSs, the 3D display system and the Internet through a Firewall. The ES will occupy an area of 40m x 40m and the power consumption will be between 5,000 and 6,000 KVA in total for the main system. We will place 65 IN cabinets at the center with 320 PN cabinets surrounding these INs. The magnetic disk system and network equipment will be situated around the PNs. The CTL will be placed near the magnetic disk system. 3
Operating system environment
Figure 3. Peripheral Devices and Network
The operating system (OS) of the ES is based on the vender's Unix system (Super-UX) providing the usual Unix environment. However, some functions will be newly developed for the ES, such as: large scalability of distributed-memory parallel processing, large scale parallel I/O, interface to the operation support software and so on.
7
In order to decrease the difficulty of the ES operation, we divide the 640 nodes into 40 groups which we call "clusters". Each cluster consists of 16 nodes, a Cluster Control Station (CCS), an I/O Control Station (IOCS) and system disks. With the ES system we employ parallel I/O techniques for both disks and massstorage systems to enhance data throughput. That is, each node has an I/O processor with access to a disk system via high-speed networks and each cluster is connected to a driver of the mass-storage system via IOCS. Each cluster is controlled by a CCS and the total system is controlled by the Super Cluster Control Station (SCCS). All the clusters can be seen transparently by the users on the other hand.
Figure 4. Batch Job Queuing System
ES is basically a batch-job system and most of the clusters are called batch clusters. Batch clusters are mainly for multi-node batch jobs and have system disks. These system disks are used for system files used by the OS and for staging user files. However, we are preparing a very special cluster called the TSS cluster. In the TSS cluster, one or two nodes will be used for interactive processing and the rest of the nodes will be used for small-scale (single-node) batch jobs. Note that, user disks are only connected to the TSS cluster in order to save the cost of peripheral devices. Therefore, most of the user files, which will be used on the batch clusters, are to be stored in the mass storage system. All batch clusters and the TSS cluster are
8
connected directly to the mass storage system. A file management system automatically retrieves and migrates data between system disks and the mass storage system for batch jobs. To support the efficient operation of the ES, we are developing operation support software. Operation support software will comprise a job scheduler, a file management system, a broadcasting system, an automatic power control system and an accounting system. The job scheduler will manage node allocation assigning free nodes to a job. The file management system will provide the data movement such that data will be available on the nodes where a job will run. The job scheduler will run a batch job queuing system. If a large-scale job, such as a multi-node batch job, is submitted, the job scheduler will pass it to a batch cluster under the control of the job scheduler and the Network Queuing System (NQS) and allocate nodes and time. But when a small-scale job, such as a singlenode batch job, is submitted, the job scheduler will pass it to a batch cluster controlled by NQS only and allocate APs and time. The file management system will dynamically move user files from/to the CTL. It will retrieve specified files before the job will run and will save specified files after the job has finished. Users will have to declare such files in the job script. The file management system will maintain a virtual file tree in the database, so users see a unified file system. 4
Parallel vector processing
Table 2.3-levels of parallel vector processing
1
distributed
2
inter-node
PN
640
intra-node
AP
8
AP
vector register length
shared 3
256
MPI HPF OpenMP microtasking MPI HPF vectorization
Possible programming languages for the ES are Fortran90, C and C++. From these languages, we think that Fortran90 is more effective than other languages to achieve good vector performance and we recommend using it. The following parallel programming models will be supported. MPI-2 for message passing programming, HPF2.0/JA for data parallelism, OpenMP and
9 microtasking are for shared memory applications and vectorization. MPI-2 is the standard Message Passing Interface including one-sided communication and parallel I/O. HPF2.0/JA is High Performance FortranV2 plus extended JAHPF. MPI and HPF are available for both distributed and shared memory. OpenMP is the standard shared memory programming Application Program Interface (API). Microtasking is a shared memory programming API similar to OpenMP for NEC SX series and the Fortran90 compiler provides an automatic parallelization feature. There are 3-levels of parallel vector processing (see Table 2) on the ES. The first level is parallel processing among distributed PNs via the crossbar network (inter-node). The second level is parallel processing with shared memory in a PN (intra-node). The third level is vector processing in an AP. Users will have to select a suitable combination within these environments in order to achieve good performance. For example, one recommended combination is MPI for level-1, microtasking for level-2 and vectorization for level-3. This combination is used with CCSR/NASDA-SAGCM. If a user wants to go for relatively simple programming, he/she might prefer to select MPI or HPF for both level-1 and 2. If a user wants to achieve portability and performance, he/she might choose MPI for level-1 and OpenMP for level-2. 5
Application software
We are going to prepare as standard application software an Atmospheric General Circulation Model (AGCM) and an Ocean General Circulation Model (OGCM) to enable easy and efficient use of the ES. To develop software for general use achieving high performance combined with high resolution we found difficult. Therefore, we decided to create 3 software categories: One is for common use, the second one is for performance evaluation and the third one is for high-resolution experiments and future models. For common use, we are developing CCSR/NASDA-SAGCM, previously known as NJR-SAGCM. This is a global climate spectral model based on the CCSR model developed by the University of Tokyo. CCSR/NASDA-SAGCM is Fortran95 compliant enabling readability, portability, modularity and expandability for users. Also, JAMSTEC is developing and optimizing two types of OGCM for the ES. One is based on MOM3 and the other is based on POM. For performance evaluation, we are developing CCSR/NASDA-SAGCM (TF5) for the pursuit of high performance on the ES. The target of its sustained performance is higher than 5TFlop/s. The original program is the same as CCSR/NASDA-SAGCM, but CCSR/NASDA-SAGCM (TF5) is optimized to get the best performance on higher resolution experiments compared to the common use version. For high-resolution and future models, we are developing for instance a global lOKm resolution version of the AGCM in collaboration with many scientists. Also, we are working on a high resolution, non-hydrostatic global coupled atmosphere-
10
ocean model. Furthermore, we are developing a regional model, which will implement a new algorithm of Cubic Interpolation Pseudo-particle method (OP), etc. We measured the performance of CCSR/NASDA-SAGCM (T159L20) on a single node of SX-4 with SDRAM. Parallel programming in this version is using MPI only. Its speed-up ratios are 1.977 for 2 APs, 3.908 for 4 APs and 7.604 for 8 APs. Effective parallel ratios are 99.85%, 99.21% and 99.26%, respectively.
Figure 5. Performance of CCSR/NASDA-SAGCM
In order to estimate the performance for the ES, let's look at Amdahl's law in the following form: S = l / ( 1 - P + P/M), where S is the speed-up ratio, P is the parallel ratio, and M is the number of processors. For the ES, the situation is more complex and we have to think about parallel processing on two levels. The first level is the inter-node (max 640 nodes) and the second level is the intra-node (max 8 APs) parallelism. The updated Amdahl's law for a hybrid parallel programming model may then be formulated as:
11
S = l/( 1 - P + P/M( l - t + t/m) + a + P ) , where S is the speed-up ratio, P is the parallel ratio for inter-node, t is the parallel ratio for intra-node, M is the number of PNs, m is the number of APs, a is the overhead for MPI, for example message transfer throughput and latency, loadimbalance, etc., and fj is the overhead for microtasking. When a user runs a parallel program using a lot of nodes, P may become the most significant factor determining the speedup. In addition, not to reduce the length of vector loops will become a significant factor, too.
6
Facility and development schedule
It has been decided to locate the ES facilities in Yokohama City, Kanagawa Prefecture near Tokyo, Japan. The ES facility, which consists of the ES building itself, the research building, the chiller etc., is under construction and will be accomplished by December 2000.
Figure 6. Scale Model of ES Building
12
The ES building is a two-story, 50m wide and 65m long building, which looks like a gymnasium, but equipped with a seismic isolation system, electro-magnetic shielding, a double floor for cables and a powerful air conditioning system etc. We will begin the cabling for networks and power in January 2001.
Figure 7. Construction of ES Building (September, 2000)
As mentioned before, the detailed hardware design for the ES components has been finished and manufacturing started early in 2000. PNs and INs will be manufactured continuously in a factory, so we will install and set them up in the ES building unit by unit from spring 2001 onwards. Peripheral systems will be installed at the same time and we will begin to test and evaluate individual PNs including operation support software and then combine them into clusters. When all clusters are installed and working, we will begin to test and tune the entire ES system. This should be accomplished by the end of February 2002. We plan to start operational use of the ES at the beginning of March 2002 by JAMSTEC. NASDA, JAERI and JAMSTEC, including the Frontier Research System for Global Change (FRSGC), will begin to use the ES thereafter.
13
We expect not to complete the development of the application software such as AGCM and OGCM before the end of 2002. So, we have to continue the development of these after March 2002. Joint research with institutes from around the world interested in climate change will be welcome. Furthermore, JAMSTEC plans to issue a Research Announcement of Opportunity in the summer of 2001. Following the evaluation of the applications selected participants should be able to carry out their programs after October 2002.
References 1. Mitsuo Yokokawa and Satoru Shingu, et al, Performance Estimation of the Earth Simulator, Towards Teracomputing (1998) pp. 34-53. 2. Kazuo Yoshida and Shinichi Kawai, Development of the Earth Simulator, New Frontiers of Science and Technology (1999) pp. 439-444. 3. Keiji Tani, Earth Simulator Project in Japan, High Performance Computing (2000) pp. 33-42. 4. Ian Foster, DESIGNING and BUILDING PARALLEL PROGRAMS (1995), Addison-Wesley Publishing Company. 5. http://www.hlrs.de/organization/par/services/models/index.html 6. http://www.openmp.org 7. http://www.gaia.jaeri.go.jp
14
PARALLEL COMPUTING AT CANADIAN METEOROLOGICAL CENTRE JOSEPH-PIERRE TOVIESSI, ALAIN PATOINE, GABRIEL LEMAY Canadian Meteorological Centre, Meteorological Service of Canada, 2121 Trans-Canada Highway, 4th floor, Dorval, QC, Canada, H9P 1J3 MICHEL DESGAGNE, MICHEL VALIN, ABDESSAMAD QADDOURI, VIVIAN LEE, JEAN C6TE Recherche en prevision numerique, Meteorological Service of Canada, 2121 Trans-Canada Highway, 5th floor, Dorval, QC, Canada, H9P 1J3 The Canadian Meteorological Centre has four NEC SX supercomputers (2 SX-4 and 2 SX-5) used for operational, development and research activities in the area of data assimilation, forecasting, emergency response, etc. One of the major components is the Global Environmental Multiscale (GEM) model. The GEM model can be configured to run with a uniform-resolution grid to support long-range forecasting, as well as with a variableresolution grid to support limited-area forecasting. First designed as a shared memory model, it is being migrated to a distributed memory environment. This paper concentrates on some techniques used in the distributed GEM model. We will also discuss the performance of the model on NEC SX-5 supercomputer nodes.
1
Introduction
Few years ago the Canadian Meteorological Centre (CMC) undertook a major transition in its operational suite. The main reason for this was the advent of a unified model, the Global Environmental Multiscale (GEM) model, which can be used for both global and regional forecasting. Originally designed to run in a shared memory environment, in the context of OpenMp paradigm or NEC SX multitasking programming, it is now running operationally on a single 32-processor NEC SX-4 node. The development of the distributed-memory version of the GEM model (GEM-DM) is in its final stage. This paper outlines the special features of the GEM-DM model, and presents the performance obtained so far. 2
Hardware infrastructure
A multi-tiered networking topology is used to closely couple the varied components of the CMC supercomputer infrastructure as represented in Fig. 1. There are currently four shared-memory NEC vector supercomputers nodes at CMC. The two SX-4s have been in-service for some time while the two SX-5s have been recently acquired.
15
IXS
SX-4-32 asa
IXS
SiX-4-32 hiru
SX-S-16 yonaka
SX-5-16 kaze
HIPPI Switch Figure 1. CMC supercomputers.
In Table 1 below, a comparison is presented between one SX-4 node and one SX-5 node. Note that the SX-5 has a slower memory but the ratio of total peak Gigaflops to the central memory capacity is one, allowing each processor to access memory with little conflict. Table 1. Supercomputer architecture.
Number of processing elements (PE) Clock (MHz) Peak Gigaflops/PE (total) Vector Pipes/PE (total) Type of Main Memory Unit (MMU) Main Memory Unit in GBytes Memory Bandwidth (GBytes/s) Extended Memory Unit (XMU) in GBytes XMU Bandwidth (GBytes/s) IXS Cross-Bar Bandwidth (GBytes/s)
SX-4 32 125 2(64) 8 (256) SSRAM 8 128 16 8 8
SX-5 16 250 8 (128) 16 (256) SDRAM 128 1024 none N/A 16
Presendy, CMC uses only one SX-4 to run its shared-memory operational model. The distributed-memory model version of the operational GEM will run on two SX-5 nodes which will allow an increase of resolution as well as the inclusion of more sophisticated physics schemes.
16
3
The unified Global Environmental Multiscale (GEM) model
Many papers have described the design, the characteristics and the features of the GEM model. Cote" et al. [1, 2] presented some design considerations and results. This section outlines the main features of the operational GEM model, and the impact of those characteristics on the programming strategies in the distributed implementation. The GEM system uses an arbitrary rotated latitude-longitude mesh to focus resolution on any part of the globe. Figure 2 shows the grid of the current operational regional model, which has a resolution of 24 km over Canada and uses 28 levels. The future regional configuration will have instead, a resolution of about 16 km and 35 - 45 levels. Performance data presented further in this paper are based on this new grid and levels.
Figure 2. Grid of the operational regional model (resolution of 24 km in the uniform area).
The computational domain is discretised in Nx x Ny x Nz points. For distributed-memory computation, a regular block partitioning of the global computational domain across a P = Px x Py logical processor mesh is employed. Since no partitioning of the vertical domain is performed, each processor computes on a sub-domain covering Nx/Px x Ny/Py x Nz points (Fig. 3). The choice of Px and Py for a particular problem size should in principle be made according to the best achievable single-processor performance on a particular machine architecture [4]. For example, on the NEC SXs, given the fact that the code is largely vectorised along the X-axis, we tend to choose Px in such a way that Nx/Px is either 256 or slightly below in order to take full advantage of the vector registers. A halo (or ghost) region used for communication with neighbouring processors surrounds most computing arrays within the dynamical core of the model. These
17
communications are performed using a local application programming interface (API) that uses MPI primitives to exchange data between processors [5, 6]. The need for such communications arises mainly from the finite-difference stencil used by the model along with the number of intermediate results that must be stored and consequently exchanged before being used. A careful analysis and redesign of the data flow have therefore been performed in order to limit the need for communication as much as possible.
Figure 3. Memory layout (regular block partitioning).
Apart from grid considerations, the methods used to solve the dynamical problem include two items, that necessitates heavy communications: ® an implicit time treatment of the non-advective terms responsible for the fastest-moving acoustic and gravitational oscillations which leads to the solution of an elliptic problem; ® a semi-Lagrangian treatment of advection to overcome the stability limitations encountered in Eulerian schemes due to the convergence of the meridians and to strong jet streams, the semi-Lagrangian handling of the polar areas is particularly delicate as physically close points may lie in distant processors.
18
4
Performance results.
For any operational centre, there is a constraint on the time available to run the different models so that the products are disseminated to the users before a specific deadline. At CMC, the most restrictive deadline is imposed on the regional integration, which must be completed (including its outputs), in a time window of about 50 minutes (wall time). For that reason the performance results we present are those of the anticipated regional implementation of the GEM-DM model at 16 km. All the experiments are performed using two SX-5 nodes. 4.1
Timing one time step
A one time step execution of the new regional model on two SX-5 nodes gives 5.1 seconds for the dynamics and 4.1 seconds for the physics. This execution uses 16 PEs in a 4 x 4 processor mesh. The dynamics sustains 2.7 Gigaflops per PE while the physics reaches 1.3 Gigaflops per PE. The details of the timing of the dynamical core are listed in Table 2. Table 2. Timing the dynamical core on a two-node SX-5, (4 x 4) PE mesh.
Number Step calls 1 2 2 4 4 4 1
of
WT(s)
GF/PE
%ofWT
RHS ADV PREP NLIN SOL BAC DIFF
0.60 2.50 0.06 0.24 1.52 0.20 0.55
1.04 1.13 2.06 1.70 6.02 1.50 0.34
1.17 48.73 1.17 4.68 29.63 3.90 10.72
Total
5.13
N/A
100.00
A dynamical time step is composed (Table 2) of one right-hand-side computation (RHS), two advection computations (ADV), two linear preparations (PREP), four non-linear updates (NLIN), four solver calls (SOL), four back substitutions (BAC) and one horizontal diffusion (DIFF). The advection, solver and horizontal diffusion calls make up 89 percent of the dynamics timings. Qaddouri [3] presents a faster version of the solver and horizontal diffusion. The CMC/RPN GEM-DM team is rewriting the advection subroutine and its communication strategy, to improve its
19
performance for the operational model. In the following section, the results presented use the current subroutines only. 4.2
Timing a 48-h forecast
Figure 4 displays timings of the GEM-DM model on two SX-5 nodes for three families of domain decompositions, each run is a 48-hour integration, the length of the operational forecast. The (P x Q) domain decomposition divides the X-axis among P processors, and the Y-axis among Q processors, for a total of P x Q PEs for the model integration. The number of PEs along the X-axis is fixed for each member of the family, to obtain the same vectorisation optimisation for all the members. In general, the longer the length of the X-axis vectors, the more the vector pipes are being utilised. This is seen on Fig. 4, where the three families give results that are nearly parallel to each other. The perfect speedup curve is represented by the dotted line. The solid line, the (1 x n) decomposition, is the closest to the ideal speed-up. That the other curves are above this line reflects the decrease in the optimal usage of the vector pipes for the other types of domain decomposition. On the SX-5, the X-axis vector length should be close to a multiple of 256.
20 GEM-DM o ntwo SX-5 nodes 170 160
8X
1 50 140 130
I
1
X\ V1
„
-
120 110
-
100
I
I
—
„ (1 x n) PEs
at-
m (2 * n) PEs
m
«
(3x
n)
-
PEs
X* N
\ V \
SO 8O 70
-
-
60
—
—
SO X
V 40
N
\
~
* grid (509X625 or 16km)
^**xs
* 35 levels 30
\
*'\ \{
" no output
i
K
i
i
\
i
10 Number of PEs
F i g u r e 4 . GEM-DM on two SX-5 nodes.
The number of PEs needed to meet the operational deadline for this version of the GEM-DM model in its 16-km regional configuration is 18. This implies that the model must run on two nodes. When the number of PEs is above 20 the parallel performance diminishes as a result of the "regular block reduction effect". 5
Conclusion
The 48-hr integration of the distributed-memory GEM model on a two-node SX-5 cluster already displays acceptable performance. With further optimisation, we expect a regional model at 16 km to be feasible within the operational deadline.
21 References: 1.
Cote J., S. Gravel, A. Methot, A. Patoine, M. Roch, and A. Staniforth, 1998: The Operational CMC-MRB Global Environmental Multiscale (GEM) Model. Part I: Design Considerations and Formulation. Monthly Weather Review, 126, 1373-1395. 2. Cote, J., J.-G. Desmarais, S. Gravel, A. Methot, A. Patoine, M. Roch, and A. Staniforth, 1998: The Operational CMC-MRB Global Environmental Multiscale (GEM) Model. Part II: Results. Monthly Weather Review, 126, 1397-1418. 3. Qaddouri, A., 2000: Parallel elliptic solvers for the implicit global variableresolution grid-point GEM model: iterative and fast direct methods. Proceedings of the Ninth ECMWF Workshop, Use of High Performance Computing in Meteorology, Reading, UK, November 13-18, (this volume). 4. Thomas, S. J., M. Desgagn6, and R. Benoit, 1999: A real-time North American Forecast at 10-km Resolution with the Canadian MC2 Meso-LAM. Journal of Atmospheric and Oceanic Technology, 16, 1092-1101. 5. Gropp, W., E. Lusk, and A. Skjellum: Using MPI, 1994: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, 307 pp. 6. Samir, M., S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, 1996: MPI: The Complete Reference. The MIT Press, 336 pp.
22 PARALLEL ELLIPTIC SOLVERS FOR THE IMPLICIT GLOBAL VARIABLE-RESOLUTION GRID-POINT GEM MODEL: ITERATIVE AND FAST DIRECT METHODS ABDESSAMAD QADDOURI AND JEAN COTE Recherche en prevision numerique, Meteorological Service of Canada, 2121 Trans-Canada Highway, 5* floor, Dorval, QC, Canada, H9P 1J3 The direct solution of separable elliptic boundary value problems in spherical geometry arising in numerical weather prediction involves, in the variable-mesh case, a full-matrix multiplication. To decrease the execution time of this direct solution we investigate two alternative ways of performing the matrix product: the Strassen method, and the exploitation of the mirror symmetry of the mesh. An iterative Preconditioned Conjugate Gradient (PCG) solution has also been implemented and compared to the direct solution. The direct and iterative solvers have been tested on the NEC SX-4 and SX-5.
1
Introduction
Most models in operational numerical weather prediction (NWP) use either an implicit or semi-implicit time discretisation on tensor products spatial grids. This gives rise to the need to solve a separable 3D elliptic boundary value (EBV) problem at each model time step. This is the case for the Global Environmental Multiscale (GEM) model in operation at the Canadian Meteorological Centre [1]. In the GEM model the separable EBV problem is currently solved with a direct method which is very efficiently implemented on Environment Canada NEC SX-4 system of vector multiprocessors. A parallel distributed-memory implementation with explicit message passing was recently described in Qaddouri et al. [2]. This direct solver, see section 3 below, can be implemented either with a fast or a slow Fourier transform. In the case of the slow transform, a full-matrix multiplication, the cost per grid point rises linearly with the number of grid points along the transform direction. The object of the present paper is to report on attempts to improve the performance of the EBV problem solution either by accelerating the full-matrix multiplication in the direct solver or by using a PCG iterative solver. The paper is organised as follows: section 2 presents the problem, section 3 describes the direct method, its parallelisation, its adaptation to solve a high-order horizontal diffusion problem, and its acceleration by using either the Strassen method or by exploiting the mirror symmetry of the mesh. Section 4 presents the iterative method, section 5 the performance results and finally section 6 the conclusion.
23
2
ELLIPTIC PROBLEM
The problem to solve is a 3D separable positive-definite EBV problem, which after vertical separation reduces to a set of horizontal Helmholtz problems, each of the form {T]-A)(/> = R , T])0 , where the real constant 7], the solution 0,
where power is an integer ranging from 1 to 4 and is half the order of the horizontal diffusion. The method used for the direct solution of this generalised Helmholtz problem proceeds similarly to Lindzen [5], and leads to almost the same algorithm as the previous one, except that in the second step we now have to solve block tridiagonal problems of size NJ x power. 3.3
Parallelisation of the direct solution algorithm
Observing that the communication bandwidth on our systems is large enough so that the volume of data communicated is less important than the communication latency (from software and hardware) in determining the communication cost of an algorithm, then minimising the number of communication steps would likely minimise the communication cost. This means that the remapping (or transposition) method might be a good choice. In this method the algorithm reads for a PxQ processor grid: 1. transpose the right-hand-sides to bring all the /l-direction grid points in each processors, 2. analyse the right-hand-sides, 3. transpose the data to bring all the ^-direction grid points in each processors, 4. solve the (block) tridiagonal problems along the ^-direction, 5. inverse step 3 with the solutions of step 4, 6. synthesise the solution, 7. inverse step 1; the solutions are again distributed in X-9 subdomains. This algorithm necessitates 4 global communication steps (1,3,5,7). The data partitioning used in the GEM model is described in Toviessi et al. [6].
28
3.4
Acceleration of the direct solution algorithm for the variable mesh case
To improve the speed of the direct solution algorithm on variable meshes, we need to find better algorithms to implement the full-matrix multiplication. A known possibility is the Strassen method of matrix multiplication [7]; another one is to take advantage of symmetries to reduce the number of operations. We have tried both separately. We have replaced the ordinary matrix product MXMA by the Strassen method. The operation count in the matrix product is then reduced to NKxNJxNIx((7/syxNI) where x is the number of recursive calls to the Strassen subroutine. We always construct variable 2-meshes that are symmetrical about n, i.e. if X is a grid point then 2n-X is also a grid point. A consequence of this mirror symmetry is that the eigenvectors ( ^ ) used for the separation are either odd or even under this symmetry operation, and can be represented with about half the number of degrees of freedom (NI/2). Furthermore the orthogonality of these modes implies that only the even (odd) part of the right-hand-side will project on the even (odd) modes. Since the separation of the right-hand-side in even and odd parts is almost free, exploiting this symmetry leads to a reduction by a factor of two of the cost of the projection step, the NIxNI projection matrix being replaced by 2 matrices of size NI/2xNI/2 each. We have the same reduction for the synthesis (inverse projection) step. The operation count in the matrix product is then reduced to NKxNJxNIx(NI/2). 4
Iterative solution
Since our problem is symmetric positive definite, the Preconditioned Conjugate Gradient (PCG) method is the natural choice for an iterative method. The problem is then to choose an adequate preconditioner. It should make the scheme converge rapidly while requiring only a few operations per grid point, two conflicting requirements, and finally it should vectorise because of the large discrepancy between vector and scalar performance on the SX-4/5. Preconditioning with the diagonal would have a low cost per iteration but it would require too many iterations because of two difficulties: the metric factors reduce the diagonal dominance in the polar regions, and the variable resolution induces large variations in the mesh sizes. Both difficulties would be alleviated by non-diagonal preconditioners, which are more effective at transporting global information at each iteration. We have examined preconditioners that have the same complexity as ILU(O), the simplest incomplete LU factorisation, and we have developed code for the preconditioning step (triangular solve) that exploits the data regularity and yields vector performance.
29 Writing A as
A = L,+D A +L r A , where LA is the strict lower triangular part of A and DA its diagonal, and writing the preconditioner M as
M = (I + LAD) D_1(I + DL^) , we obtain different preconditioners depending on the choice of the diagonal matrix D. Choosing D = DA, gives the symmetric Gauss-Seidel (SGS) preconditioner. The ILU(O) factorisation which gives a preconditioner that has the same sparsity structure as the matrix A is obtained by requiring that the diagonal of M be the same as the diagonal of A, viz.
diag(M) = DA , from which D is determined recursively. Note that M is also symmetric and positive definite [8]. The modified incomplete LU(0) factorisation [MILU(O)] satisfies the constraint that the row sum of M equals the row sum of A, thus
Me = Ae,e = (lX-J)T . We have experimented with these three preconditioners: SGS, ILU(O), and MILU(O). The set-up of each factorisation is different, but the triangular solve can proceed with the same subroutine. It is worth emphasising that our particular application requires the solutions of several linear systems with the same coefficient matrix but different right-handsides. In this case the cost of the set-up is amortised over many solutions and can be neglected. The operation count per iteration is NIxNJxNKx(2+3+5+5) where the various constants are for the two scalar products, the three vector updates, the matrix-vector product and the preconditioning step respectively in our implementation. 5
Numerical results
Uniform-mesh experiments were presented in Qaddouri et al. [2], and we present here results of variable-mesh experiments carried out on the SX-4 and SX-5. Table 1 summarises the three test problems considered, they are denoted PI, P2 and P3 respectively.
30 Table 1. Test problems.
5.1
Problem
Resolution
NI
NJ
NK
PI
0.22°
353
415
28
P2
0.14°
509
625
28
P3
0.09°
757
947
28
Performance of the direct solution algorithm
Figure 1 gives the execution time per grid point for the three test problems on a single SX-4 processor of the basic code (MXMA, power=l), while Fig. 2 gives the equivalent SX-5 results. As predicted by the operation count, and since all the operations involved in the direct solution algorithm are well vectorised, we obtain a straight line. The problem PI shows some overhead on the SX-5 related to the fact that the SX-5 is more sensitive to the vector length than the SX-4. We have added a fourth point (AT/=255,ArJ=360)) in Fig. 2, and we note that this point lies on the predicted line, which confirms the greater sensitivity of the SX-5 performance on the vector length. CO
o »
1.5
-
1 ' ': Q. i
•g
5, °-5 r
ID 200
300
400
500
600
700
800
Ni Figure 1. Execution time/(NIxNJxNK) in microsecond, on one SX-4 processor.
o
05
cr
31
0.4
CD W O
E •*-> c
o
Q.
u.a 0.2
•D 1—
O) 0)
E i—
0.1 0 200
300
400
500
600
700
800
Ni Figure 2. Execution time/(NIxNJ'xNK) in microsecond, on one SX-5 processor.
Figure 3 displays the execution time on a single SX-5 node for the problem P2 as a function of the number of processors, and for the three variants of the direct solver (MXMA, Strassen, Parity). We note that the use of the Strassen method decreases the execution time, and using the parity nearly cuts the execution time by two as expected. The execution time scales very well in the MXMA and Parity cases. We do not have a highly optimised code for the Strassen matrix product, but in principle the algorithm could be implemented using MXMA, and we would expect a gain of about 20% with respect to the basic MXMA version. Furthermore we could easily obtain the combined benefits of the Strassen and parity methods if needed.
I
o
a>
—•—MXMA
^ ^ ^
1
to
000
^ ^ 5 ^
-
- • — Parity - o — Strassen
R u = 2 A i [ ( 0 " - 1 , V - w ) + (fu", w ) ] R(P=2At
V-u"-1) + (/^,g)]
[-4>0(q,
where X = Ti1 (fi) and M — £.2(fl) are the velocity and pressure approximation spaces with inner products f,g,v,w € £ 2 (ft), ( /, 3 ) = / /(x)s(x) dx, ./n
( v, w ) = / v(x) • w(x) dx. ./n
The weak variational formulation of the Stokes problem can lead to spurious 'pressure' modes when the Ladyzhenskaya-Babuska-Brezzi (LBB) inf-sup condition is violated. For spectral elements, solutions to this problem are summarized in Bernardi and Maday (1992). To avoid spurious modes, the discrete velocity Xh C X and geopotential Mh C M approximation spaces are chosen to be subspaces of polynomial degree iV and N — 2 over an element fix
xh = xnPN,K(n), PNMV)
Mh = MnPN-2,K(ty
= {f e £ 2 («): f\oK e PN(nK)}
For a staggered mesh, two integration rules are defined by taking the tensorproduct of Gauss and Gauss-Lobatto quadrature rules on each element. K
N
N
k=l
i=0
j=0
K
N-l
N-l
(/.ff)o = E E E *;=i »=i
/*(c• r € A x A. The discrete form of (5) - (6) can now be given as follows. ( 6 uh, w ) G L - At ( S 4>h, V • w ) G = R u ( & h, Q )G + A t o ( q, V • 0 D B " 1 R u
(10)
79 Once the change in the geopotential 5 <j> is computed, the velocity difference S u is computed from (9). The Helmholtz operator H = B + Ai 2 f represents the pressure variation within each element. The coarse grid solution is mapped to the fine grid by injection (prolongation) of a constant value onto each pressure point within an element using the operator 7". The coarse grid solution is obtained from the fine grid by application of the averaging (restriction) operator IT. Equation (10) then becomes H(4>f
+
Ic)=g
where g is the discrete right-hand side vector. The coarse and fine grid Helmholtz operators are defined by Hc = IT HI and Hf = H — HIH~x IT H. The coarse and fine grid problems are therefore given by Hf
2 40
5
-
50
100
150
200 Threads
250
300
350
400
Figure 5: Total sustained Gflops. Top: C56L30. Bottom: C168L30.
89
Boundary Exchange Time
-e— Semi-Implicit • - * - Explicit
50
100
150
200
250
300
350
400
Threads
PCG Core Time
Figure 6: Top: Boundary exchange time. Bottom: CG solver core time.
90 Acknowledgments The authors would like to thank Einar Ronquist of the Norwegian University of Science and Technology for helpful discussions regarding the theory and implementation of spectral element methods. References 1. Bernardi, C. and Y. Maday, 1992: Approximations spectrales de problemes aux limites elliptiques. Mathematiques et Applications, vol. 10, Springer-Verlag, Paris, France, 242p. 2. D'Azevedo, E., V. Eijkhout, and C. Romine, 1992: Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors. LAPACK working note 56, University of Tennessee. 3. Held, I. H., and M. J. Suarez, 1994: A proposal for the intercomparison of the dynamical cores of atmospheric general circulation models. Bull. Amer. Met. Soc, 75, 1825-1830. 4. Iskandarani, M., D. B. Haidvogel, and J. P. Boyd, 1995: A staggered spectral element model with application to the oceanic shallow water equations. Int. J. Numer. Meth. Fluids, 20, 394-414. 5. Karniadakis, G. M., and S. J. Sherwin, 1999: Spectral/hp Element Methods for CFD. Oxford University Press, Oxford, England, 390p. 6. Patera, A. T., 1984: A spectral element method for fluid dynamics: Laminar flow in a channel expansion. J. Comp. Phys., 54, 468. 7. Rancic, M., R. J. Purser, and F. Mesinger, 1996: A global shallowwater model using an expanded spherical cube: Gnomic versus conformal coordinates. Q. J. R. Meteorol. Soc., 122, 959-982. 8. Rivier, L., L. M. Polvani, and R. Loft, 2000: An efficient spectral general circulation model for distributed memory computers. To appear. 9. Robert, A., 1969: The integration of a spectral model of the atmosphere by the implicit method. In Proceedings of WMO/IUGG Symposium on NWP, VII, pages 19-24, Tokyo, Japan, 1969. 10. Ronquist, E. M., 1988: Optimal Spectral Element Methods for the Unsteady Three Dimensional Navier Stokes Equations, Ph.D Thesis, Massachusetts Institute of Technology, 176p. 11. Ronquist, E. M., 1991: A domain decomposition method for elliptic boundary value problems: Application to unsteady incompressible fluid flow. Proceedings of the Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations, SIAM, Philadelphia, 545-557. 12. Taylor, M., J. Tribbia, and M. Iskandarani, 1997a: The spectral element
91
method for the shallow water equations on the sphere. J. Comp. Phys., 130,92-108. 13. Taylor, M., R. Loft, and J. Tribbia, 1997b: Performance of a spectral element atmospheric model (SEAM) on the HP Exemplar SPP2000. NCAR Technical Note 439+EDD. 14. Williamson, D. L., J. B. Drake, J. J. Hack, R. Jakob, and P. N. Swarztrauber, 1992: A standard test set for numerical approximations to the shallow water equations in spherical geometry. J. Comp. Phys., 102, 211-224.
92
Experiments with NCEP's Spectral Model Jean-Frangois Estrade, Yannick Tremolet, Joseph Sela December 14, 2000 Abstract The spectral model at The National Center for Environmental Prediction was generalized to scale from one processor to multiples of the number of levels. The model has also been 'threaded' for Open-MP. We present experiments with NCEP's operational IBM SP using similar resources in different mixes of MPI and Open-MP. We also present experiments with different platforms, assessing performance of vector machines vs. cache based machines ranging from Linux PCs, IBM SPs in different configurations, to Fujitsu's VPP5000. The scalability properties of the model on the various platforms are also discussed.
Introduction The National Center For Environmental Prediction has been running a global spectral model since 1980 on a variety of platforms. The code has continually evolved to reflect optimization features on the different computer architectures that were used in operations during these last twenty years. With the advent of parallel computing, the model was rewritten to conform to the distributed memory of the new machines. This code was developed and implemented on a T3D, T3E and then ported to the IBM-SP which is currently the operational machine at the U.S. National Weather Service. There were two fundamental issues during the development of the parallel code; parallelization of the algorithm and optimization for cache based computing. The parameterization of the physics and radiation processes were cloned from the Cray C90 code that was in operations just before the SP implementation. The data layout for parallel computations was completely redesigned. The grid point computations of dynamics, physics and radiation were originally designed for vector machines. In view of the cache's restrictive size on some parallel machines, the parameterization codes were devectorized for better performance. This issue will be discussed in more detail later in this paper. The parallel code was originally using SHMEM for communication but later switched to MPI for the IBM SP. Recently the OPEN-MP feature was incorporated in the code and results from tests using the same resources with and without OPEN-MP are studied in this paper.
93 In view of the shorter life time of present day computer systems, the issue of portability is gaining importance. To test portability, the model was first generalized to run on any number of PEs. The new parallelization permits execution of the model on a variety of machines, ranging from a LINUX based PC through Origin, SP or any MPI platform up to Fujitsu's VPP architectures. It must be emphasized that the effort invested in devectorization for optimal performance on our operational machine, will be shown to be counter productive when executing on a vector processor such as a VPP computer.
Parallelization and Data layout The model's data parallel distribution is as follows: Spherical Harmonic space We distinguish between three possibilities: If the number of processors is less than the number of vertical levels we distribute the levels among the processors. If the number of processors is equal to the number of vertical levels, we assign one level to each processor. In the 'massively parallel' case we assign the number of processors to be a multiple of the number of layers and we assign one level to each group of processors. The number of groups=number of PEs / number of levels. In addition we spread the spectrum among the groups when doing purely spectral computation such as diffusion, filtering and semi-implicit tasks. Fourier Space In this space we split the data among the levels and latitudes to facilitate efficient Fourier Transform computations with all longitudes present in processor for a given field. Gridpoint Space In this space, when we are "between Fourier Transforms", we further parallelize over longitudes since we need data for ALL levels for EACH point of the Gaussian grid on EACH processor. Communication In order to shuttle from one space to another, a set of encapsulated communication routines are used and the transition from MPI to other communication packages is thereby isolated from the computational part of the code. We have found that collective MPI routines perform more efficiently than point to point communication on the IBM SP.
94 O P E N - M P Threading The code contains MPI communication and OPEN-MP directives . OPEN-MP is used for threading FFTs and the gridpoint computations of dynamics physics and radiation. MPI is used whenever off node data is required. Given the memory per node on the SP, we are unable to execute a purely OPEN-MP code on one PE. (just for measurements of course, the computational power of one PE would not suffice for operational prediction). Invocation of Open_MP is achieved at execution time. The choice of the number of nodes, MPI tasks and the number of threads are only explicit in the execution script and no changes to source code are required when reconfiguration of the resources is desired. The compilation however requires the appropriate compiler and associated thread libraries.
Platforms and Tests Performance evaluation was executed in two resolutions. T6228, Triangular truncation at zonal wave 62 and 28 levels. T17042, the Operational truncation. Triangular truncation at zonal wave 170 and 42 levels. I B M - S P at N C E P The NCEP SP phase one system consists of 32 silver nodes and 384 winterhawk nodes. Each silver node has 2 GB of memory and 4 CPUs. Each winterhawk has 512 MB of memory and 2 CPUs. Winterhawk are faster than silvernodes, despite their slower clock (200MHz. vs. 333MHz.). The winterhawks deliver superior floating-point performance because of better pipelining in the CPUs and cache prefetch for large loops. Silver node CPUs are faster for character manipulations and scalar codes that do not optimize well. AIX operating system requires about 150 MB so only about 360 MB of user space on a winterhawk and 1.8 GB on a silver node are available. Since AIX is a virtual memory operating system, if the workload exceeds these values portions of its address space will page to disk. Only winterhawk nodes were used in tests presented in this report. Load distribution throughput and scheduling are important issues in an operational environment and the following configuration information seems pertinent. There are 4 classes of user access. A class is a group of nodes from which users can request job resources. Two of these classes are restricted for production use only and two are for general use. Additionally, two non-user classes are configured to support scheduling of threaded applications on the winterhawk nodes: Class 1: 17,4-way silver nodes supports batch and interactive POE jobs, up to 68 available scheduling slots. Class 2: 3,4-way silver nodes for Production - interactive or batch
95 Class dev: 384 2-way winterhawk nodes for Development in batch only, 2 tasks per node. Class. PROD: 384 2-way winterhawk nodes for Production in batch only, 2 tasks per node. In the PROD class, even when we are alone in the class, the timing results can vary from 9.6% for a T170L42 case with 42 processors, to 16% with 84 processors, 25% with 126 and 27% with 168 processors. This large variation required running the same cases several times in order to chose the minimum time for each set of experiments. The variation in timing results is larger when the dev queue is used. V P P 5 0 0 0 at Meteo-France The VPP5000 system consists of 32 vector nodes. Each node has 8 GB of memory and one CPU of 9.6 Gflops with 7.9 GB of memory available to the user. The nodes are linked by a crossbar with a bi-directional speed of 1.6 GB/s. This is an operational machine where the operational suite of Meteo-Prance runs four times a day. There is only one partition where development, research and operational system coexist. To manage all the different configurations, MeteoFrance has installed a system with several queues with several priorities which depend on memory use, the number of processors and the time requested. We were able to run alone only once and were able to test the model on 21 processors. If we compare the time of this test with the times measured when other jobs are running, the difference is lower than 1% ! A Linux P C This is a 450 Mhz PC with 384 MB of memory with Linux and 256 MB of swap space. We used Fortran 90 from Lahey/Fujitsu and ran alone in all experiments. We have also installed M P I C H on this machine for future experiments with PC clusters. At this time we are engaged with Fujitsu's PC cluster project and performance results will be published as they become available.
Performance and Scalability In order to make a three way comparison between the different machines it was necessary ,because of memory constrains, to run the T62L28 MPI version of the model. Since the only machine supporting O P E N _ M P that was available to us was the SP, we used the model's operational truncation of T17042. The timings of the different sections of the code in the three way comparison for the T17042 executions are shown in Table 1. In these runs the IBM-SP used 21 processors, the VPP5000 used eight and the PC one processor. The entries in Table 1 display the percentage time spent in the different section of the code.
96 Algorithm Transpose Semi-imp
% IBM/SP (T17042) 33.9 4.4
% VPP5000 (T17042) 11.4 2.6
% PC (T6228) 7.7 1.0
FFT For w. Leg In v. Leg Semi-imp Grid-dyn Physics Radiation
3.2 7.2 4.5 1.1 0.8 11.0 27.6
9.3 2.1 1.1 0.5 2.3 52.4 15.7
5.3 4.9 2.4 0.8 0.9 10.9 51.4
I/O Prints
3.4 0.7
0.9 0.6
1.1 0.5
We note that almost 40% of the time on the IBM-SP and 14 % of the time on the V P P are spent in communication. It should also be noted that these communication times are influenced by the relatively long time required to distribute the initial conditions which are not accounted for in this table. In addition, the time charged to communication, which is done in 32 bit precision, includes the 'in processor' reorganization of the data. We measured the relative times spent in this reorganization and using the T62L28 version found that on the VPP with only one processor the reorganization time is approximately 0.3 % of the total, while on the PC it is 8.7 %. This big difference is partly attributable to the vectorization advantage on the Fujitsu of this segment of the code. It is observed that on the vector machines more time is spent in the physics than in the radiation because the physics computations are poorly vectorized. For the scalar machines, the opposite is true. We note again that the original codes for the grid dynamics, physics and radiation were devectorized for the SP. Regarding the code's scalability we observe that on the Fujitsu VPP5000, the scalability is reasonable even though the code is poorly vectorized. The vectorization percentage was measured to be less than 30 %, and is not necessarily independent of the number of processors; depending on the number of processors, vector lengths can become too small for good vector speeds. If we run 'almost alone' using the production queue, the same observations can be made about the IBM-SP and the scalability remains reasonable up to 168 processors, the number used in operations.
97 VPP5000 runs with a T170L42 12 hours forecast
IBM-SP runs with a T170L42 12 hours forecast bMU-]
\ 11400-
^
\
8900
\ \
3900
\
I 1 s
\ 6400
\
690
£
\ 490
~~-
1400-
1 2
3
4 5 6 Processors
\ \
290
42
7
63
84 105 126 147 16 Processors
Open-MP versus MPI The Open-MP aspect of the code was tested on the IBM-SP. We ran tests using MPI alone and a combinations of MPI and OPEN-MP. On the IBMSP winterhawks phase one, there can be only two threads. These tests were executed in the prod class. We observe that MPI processes (dash line) are faster than the mixed OPEN-MP +MPI processes (full line). These performance results are probably due to the OPEN-MP overhead in spite the rather large grain computational tasks assigned to the Open-MP threads. We also note good scalability of the code with the use of Open-MP. Test with MPI alone (dash line) or MPI+OPEN-MP (full line) T170L42 runs 12 hours forecast 1280 -.
1080
\
\
\ \
880
680
\
\
\
\
480
- -. 280-
r-
,
,
""~-
T
1
126
147
1
__ 1
1
1
1
16
Processors
Scalar versus Vector If we compare the execution time of the VPP5000 and IBM-SP on a T170L42 case (12 hours forecast) on 21 processors, we get a ratio of 1.97. This ratio is rather disappointing for the VPP5000 but can be explained by the less than 30% vector nature of the code. In order to measure the vectorization advantage of the VPP5000 with respect to the SP we selected the simplest vectorizable part of the code, namely the grid
98 dynamics, for the adiabatic version of the model. In the adiabatic case the code is more vectorized (72%) but the subroutine which consumes most of the time is still the same. Using the profiling tools from Fujitsu, we observe that the second most time consuming subroutines is the dynamic grid computation. In this subroutine, we compute the contribution of one longitude point at each call. If we increase the computation to include several points for each call and if we increase the amount of vectorization inside this code, we increase the vectorization amount of the entire code to 75% and we execute in 805s instead of 907s on four processors (9% better). With the IBM-SP, the same test with no modification runs in 640s with 22 processors. After vectorizing the dynamics grid computations, the ratio between IBM and VPP is improved to around 4.3. If we run the new modified vector code on the IBM-SP and on the PC, we do not notice any degradation in performance except when we use the 2-D arrays with the second dimension set to one (only one longitude per call). It is possible that this problem arises from the prefetching mechanism of data on the SP. When the vectorized code is executed on the SP with several longitudes per call we get a slightly improved performance, contrary to the experience with the Cray T3E, this is most likely a result of the larger SP cache. Similar remarks apply to the execution and performance on the PC.
I.B.M Phase II Phase II consists of 2 systems: one for production, the other for development. Each system contains 276 Winterhawk-2 nodes: *256 are reserved for computation. * 20 nodes are reserved for filesystems, interactive access, networking, various system functions. At this writing only 128 nodes are available on one system. The Winterhawk-2 have a clock speed of 375 MHz ( 200 MHz for Winterhawk1) and an L2 cache size of 8 MBytes ( 4MBytes on Winterhawk-1). Each nodes contains 4 processors and 2 GBs of main memory (usable memory 1.8 GBytes). Theoretically, this system is 5 times faster than phase I. Comparison b e t w e e n phase I and phase II We compare T17042 12 hours forecasts on the two machines: On phase I with 21 processors requires 1606 sec. On phase II, this run requires 863 sec. So the ratio between the two machines is roughly: 1.86. We get exactly the same ratio with 42 processors. But with 84 processors, this ratios becomes 2.22. Given the variability in measurements of running times, we consider the ratio to be around 2.
99 T h r e a d i n g vs M P I The results on the next graph indicate improved scaling on phase II. We still find that the pure MPI version is more efficient than the thread + MPI version. Test with MPI and threads on IBM phase II system T17042 runs -12 hours forecast **" mpi [mpxtf) mpi + 2 threads (mpxH_r) mpi + 3 threads (mpxILr) mpi + 4 threads {mpxtt_r)
650- \ 550- \
350-
\
\
250150-1
1
1
s
\
\
**
\ 1
(
1
.
1
r^T
1
1
1
1
1 "i ' " i " " i
1
1
1
1
•
1
21 42 63 84 105126147 16B 189 210 231 252273294315336357378399420 441 462 483504
O n e M P I p r o c e s s p e r n o d e vs four M P I processes p e r n o d e The most efficient MPI version consists of one MPI processor per node. If we compare a run with one MPI process per node with a run with 4 MPI process per node, the former is approximately 1.6 time faster than the latter. This ratio is 1.1 if we compare one MPI process per node with 2 MPI process per node.
Conclusions The quantitative comparison of cache and vector architectures, especially in term of price vs. performance is not straight forward. It is clear from measurements that vector machines with well vectorized code will outperform scalar computers and will do so with fewer processors. Given the rather low percentage of peak performance on scalar machines, and the quite high sustained percentage of peak for vector computers, the cost effectiveness of the two architectures should be measured with fully operational codes using the guide lines of operational timeliness. Regarding the Open-MP implementation in an MPI application it would appear that the pure MPI version is still preferable. The finding that large cache machines can efficiently execute slab codes could be considered in planning and maintaining operational codes with long life expectancy. In conclusion we would also like to express our regret that there are so few vector MPPs available for use in Numerical Weather Prediction.
100 THE IMPLEMENTATION OF I/O SERVERS IN NCEP'S ETA MODEL ON THE IBM SP JAMES J. TUCCILLO International Business Machines Corporation, 415 Loyd Road, Peachtree City, GA 30269, USA E-mail:
[email protected] Asynchronous I/O servers have been recently introduced into the NCEP's Eta-coordinate model. These servers are additional MPI tasks responsible for handling preliminary post-processing calculations as well as performing the I/O of the files for post-processing and/or model restart. These activities can occur asynchronously with the MPI tasks performing the model integration and effectively reduce the model integration time.
1
Introduction
The NCEP Eta-coordinate model [1] is an operational, limited-area, short-range weather prediction model used by the US National Weather Service for numerical guidance over the North American continent. The important meteorological characteristics of the model are as follows: • • • • • • •
Eta vertical step-coordinate Arakawa E-grid for horizontal grid structure Hydrostatic or Non-hydrostatic options Comprehensive physical parameterization package Horizontal boundary conditions from NCEP Global Spectral Model One-way nesting option Initial conditions from an assimilation cycle The code characteristics are as follows:
• • • • • • •
Fortran90 with MPI-1 for message passing 2-D domain decomposition Supports 1-N MPI tasks where N is the product of any two integers Pure MPI or combined MPI/OpenMP MPI infrastructure is "hidden" Dummy MPI library provided for support on systems without an MPI library I/O and preliminary post-processing handled asynchronously through I/O servers It is the final code characteristic, the asynchronous I/O servers, that will be discussed in this paper.
2
Motivation for I/O Servers
With the original MPI implementation of the Eta model by Tom Black at NCEP, the global computational domain was two dimensionally decomposed over a number of MPI tasks that is the product of any two integers. Typically, the number of MPI tasks would be on the order of 10s to 100s and a square aspect ratio would be used (i.e. the number of tasks in the x and y direction would equal or nearly equal). For the purposes of post-processing and/or restarting, the model needs to output the state variables plus several diagnostic quantities every hour of simulated time. This was handled by having each MPI task write a separate file to a shared file system. Each file contained the state variables and diagnostic quantities for the MPI task that wrote the file. These files were subsequently read by another program (Quilt) that patched them together, performed some preliminary post-processing computations, and finally wrote out a single "restart" file that could be used for restarting the model or for post-processing. The net affect of this approach was that the model integration was delayed while each MPI task wrote its data to disk. In
101 addition, the data for restarting and post-processing was written to disk twice and read from disk once before a final "restart" file was generated. While this methodology was acceptable when the Eta model was first implemented operationally on the IBM SP system with a relatively low horizontal resolution of 32 kms, a different approach was needed for the anticipated increases in resolution to 22 kms (Fall of 2000) and 12 kms (Fall of 2001). The new approach was the introduction of asynchronous I/O servers to offload the I/O responsibility from the MPI tasks performing the model integration. In addition, I/O to disk could be significantly reduced by eliminating a write and read to disk during the Quilt processing.
3
Design of the I/O Servers
The I/O servers are essentially the incorporation of the Quilt functionality into the Eta model. Quilt, when run as a separate program, read the files written by the MPI tasks performing the model integration, "quilted" the data together, computed temperatures for model layers below ground in the Eta vertical coordinate via an elliptic equation for use in building down mean sea level pressure, and finally wrote out a single "restart" file. With the larger computational domain anticipated for the 22 km and 12 km implementations, the Quilt functionality was developed as a separate MPI code and incorporated into the Eta model. The tasks performing the model integration perform a non-blocking MPIJSEND to the I/O servers and then immediately return to performing the model integration. The net result is that the tasks performing the model integration are no longer delayed by writes to disk and the wall time is reduced. Essentially, the I/O is completely hidden as it is offloaded to additional MPI tasks (the I/O servers) whose sole responsibility is preliminary pre-processing and I/O. When the Eta model is started, additional MPI tasks are created above and beyond what is needed for the model integration. These additional MPI tasks are the I/O servers. There can be one or more groups of I/O servers and each group can have one of more I/O servers. The I/O servers that are part of a group work collectively on receiving the data from the tasks performing the model integration, perform the preliminary post-processing, and write the "restart" file to disk. Multiple output times can be processed at once if there are multiple groups. In other words, if the time to process an output time is greater than the time for a group of I/O servers to finish their work then multiple groups of servers can be initiated. Again, the idea is to completely hide the I/O time. Figure 1 show an example of the relationship between the tasks performing the model integration and two groups of servers with 2 servers each.
30
31
32
33
34
35
24
25
26
27
28
29
18
19
20
21
22
23
12
13
14
15
16
17
6
7
8
9
10
11
0
1
2
3
4
5
MPI_COMM_COMP MPI COMM TNTER A ( * l
Server„Group_0
Server„Group_1
1
0
MPI COMH_C OHP COMM_I NTER
0
MPI_COMM_COMP MPI_COMM_INTER
MP I_C0MM_W0RLD
Figure 1. An example of the relationship between the MPI tasks performing the model integration and two groups of I/O servers with two I/O servers each. In this example, the model integration uses 36 MPI tasks and there are two groups of I/O servers each with two I/O servers. With this arrangement, the model would be initiated with a total of 40 MPI tasks.
102 Each group of I/O servers will process every other output time. In other words, the tasks performing the model integration will round robin through the groups of I/O servers. The servers within a group work in parallel and perform their own MPI communication that is independent of the communication taking place between the tasks performing the model integration, between the tasks of other groups of I/O servers, and between the tasks performing the model integration and other groups of servers. This separation of the communication patterns is accomplished through the use of the intracommunicator and intercommunicators depicted in Figure 1. The intracommunicator ( MPI_COMM_COMP) is created by using MPI_COMM_SPLIT to split MPI_COMM_WORLD. The intercommunicators are created with MPI_INTERCOMM_CREATE and provides the mapping between the tasks performing the model integration and the I/O servers of the different I/O server groups. Since the I/O servers use a one-dimensional decomposition strategy, the communication between the tasks performing the model integration and the I/O servers maps a two-dimensional decomposition to a onedimensional decomposition. The tasks performing the model integration actually use an array of intercommunicators as they must be able to toggle between which group of servers they send data to. Each group of I/O servers needs a scalar intercommunicator since they only need to be able to map to the tasks performing the model integration.
4
Configuring the I/O Servers
How do you decide on how many groups of servers and how many servers per group are needed? Basically, the number of I/O servers per group is controlled by the size of the final "restart" file since the entire file is buffered by the I/O servers. On the SP at NCEP, each node contains 4 CPUs (4 MPI tasks are typically initiated per node) and 2 Gbytes of memory. Although the virtual operating system allows disk to be used as an extension of real memory via paging, it is best, from a performance point of view, to not use more than the amount of real memory per node. For the current operational 22 km configuration, each "restart" file is approximately 600 Mbytes so the memory of a single node is sufficient and 4 servers are used to consume the 4 CPUs on a node. For the 12 km implementation, the "restart" file will be in excess of 2 Gbytes so the memory of 2 nodes will be required - 8 I/O servers. In selecting the number of groups of I/O servers you need to know the time it takes a group of I/O servers to complete the processing of an output time plus the amount of wall time for the model to integrate to the next output time. At NCEP, the goal is to produce 1 simulated hour per minute in operations regardless of the resolution of the model. With an output frequency of 1 hour of simulated time, a group of I/O servers must be able to finish their processing in less than one minute otherwise a second (or third, or fourth) group of servers is needed. For the current 22 km operational configuration, one group of I/O servers is sufficient. For the 12 km configuration, 3 groups of I/O servers will be required. The number of I/O server groups and the number of I/O servers per group is specified at run time and the code monitors whether sufficient I/O server groups have been specified.
5
Performance Impact of the I/O Servers
The impact of the transfer of data from the tasks performing the model integration to the I/O servers has been examined. The tasks performing the model integration are able to perform MPIJSENDs to the I/O servers and immediately resume model integration without affecting the model integration time as determined by measuring model integration time with the I/O servers disabled. With a properly configured number of I/O server groups, the I/O is completely hidden from the model integration time with the exception of the output for the final simulated hour. In operations at NCEP, the Eta model post-processor is initiated as each "restart" file is written so the post-processing is completely overlapped with the model integration except for the last output time. 6
Acknowledgements
This work was based on many discussions with Tom Black, Geoff Dimego, Brent Gordon, and Dave MichaudofNCEP.
103 7
References 1. Black, Thomas L., 1994: The New NMC Mesoscale Eta Model: Description and Forecast Examples. Weather and Forecasting: Vol. 9, No.2, pp. 265-284.
104 IMPLEMENTATION OF A COMPLETE WEATHER FORECASTING SUITE ON PARAM 10 000 Sharad C.Purohit, Akshara Kaginalkar,. J. V. Ratnam, Janaki Raman, Manik Bali Center for Development for Advanced Computing Pune University Campus Ganesh Khind Pune-411007 E-mail:
[email protected] The entire suite of weather forecasting, based on T80 global spectra] model, which consists of decoders, quality control, analysis and the forecasting programs have been ported onto PARAM 10000 super computer from CRAY -XMP. PARAM10000 is a distributed shard memory system scalable up to lOOGFlops. The most time consuming portions of the suite, the analysis code ( ssi80) and the forecasting code ( T80) were parallelized. Due to inherent dependencies in the ssi80 code, it was parallelized using compiler directives, which takes advantage of the shared memory architecture of the PARAM 10000 machine. The T80 forecasting code was parallelized using data decomposition method using Message Passing libraries. The analysis and the forecasting code show good speedups. The fault tolerance required for the forecast cycle was devised at the script level.
1
Introduction
The prediction of weather using numerical models involves many complicated steps. The models require initial conditions to be specified for predicting the weather. These initial conditions are prepared from the observations obtained from various observatories spread over the whole globe. The observations are to be passed through quality control checks for checking their quality and then interpolated to the model grid using different methods. The National Center for Medium Range weather Forecasting Center, New Delhi, issues a medium range weather forecast based on T80 global spectral model. The observations are received through GTS and decoded using ECMWF decoders. The data is checked for its quality using a number of quality control methods. It is then interpolated to the model grid using spectral statistical interpolation method (ssi80). The prepared initial conditions are used in the forecasting of weather for five days using global spectral model T80. The entire weather forecasting suite was ported onto PARAM 10000 supercomputer from CRAY-XMP. The porting involved rigorous checking of the results, optimization of the codes for best performance on PARAM 10000 and parallelization of the programs. It was found that the analysis code (ssi80) and the forecasting code (T80) were the most time consuming portions of the weather forecasting suite. An attempt was made to parallelize the codes to take advantage of the distributed-shared memory architecture of the PARAM 10000. The T80 code was parallelized by the data decomposition method using Message Passing Libraries ((MPI forum, 1995) and proprietary C-DAC MPI ( Mohanram et.al 1999). The ssi80 code, because of its inherent dependencies was parallelized using compiler directives and Pragmas. To make the whole weather forecasting suite fault tolerant to system failures, a script was developed to create a data image after the execution of each components of the suite. In this paper section 2, describes the architecture of PARAM 10000 supercomputer, followed by ssi80 and T80 model description (section 3). The parallelization strategy is discussed in section 4. Finally the results are discussed.
105 2 PARAM 10 000 PARAM10000 supercomputer was indigenously developed by Center for Development of Advanced Computing, C-DAC. It is a cluster of workstations based on UltraSparc family of microprocessors, which operate on Solaris 2.6. The workstations are configured as compute nodes, file servers, graphic nodes and Internet server node. PARAM10000 consists of thirty-six compute nodes. Each compute node is a high performance, shared memory, symmetric-multiprocessing UltraEnterprise 450 server from Sun Microsystems with four cpu's. The nodes are interconnected with Myrinet and FastEthernet. C-DAC has developed High Performance Computing and Communication (HPCC) software (Mohanram et. al, 1999) for PARAM 10000. This software architecture allows the ensemble of workstations to be used as independent workstations, cluster of workstations, or Massively Parallel Processors System connected through a scalable high bandwidth network. The HPCC suite provides many tools for developing parallel codes using message passing and for debugging them.
3 a Description of ssi80 model The Spectral Statistical Interpolation program ( Rizvi et. al 1995 ) is used for interpolation of the observaed data to the forecast model grid. The objective of this program is to minimize an objective function defined in terms of the deviations of the desired analysis from the guess field, which is taken as the six hour forecast and the observations weighted by the inverse of the forecast and the observation errors respectively. The objective function used by ssi is given by
J=l/2[ (x-xb)TB(x-xb) + [yobs -R(x)]To[yobs - R(x)]] Where x N component vector of analysis variables xb N compnonet variable of background variables ( forecast or guess variable ) yobs M component vector of observations B NxN forecast error covariance matrix O MxM observational error covariance matrix R Transformation operator ( nonlinear ) that converts the analysis variables to the observation type and location N Number of degrees of freedom in the analysis M Number of observations The data flow of the programme is given in fig 1. The ssi program calls three main routines The first set of routines read the input data ( decoded satellite data ). The second set of routines, are used for setting the right hand side of the optimization function. The third set of subroutines carry out the interpolation.
106
SSI ALGORITHM
c
READ INPUT I W A
3
SET RHS FOR SOI
READ THE SPECTRAL VARIABLES IAND CONVERT THEM TO GRID PERFORM VERTICAL ADVECTION AND TIME INTEGRATION IN GRID SPACE.
ADVECTION ALONG KIN GRID (PARALLELIZABLE LOOP) ADVECTION ALONG J
ADVECTION ALONG I
CONVERTS GRID VAR BACK TO SPEC.
CHECK STOPPING CRITERION
c
FINALISE
J
Figl. The ssi80 program was developed for a Cray vector processor. The porting of this code onto PARAM10000 involved rearranging some of the data structures and removal of asynchronous I/O, which was used for writing intermediate files. The performance bottlenecks of this programme were found to be in a) File I/O b) Subroutines which convert variables from spectral to grid space and back to spectral. For performance tunning, the intermediate files were handled using low level system I/O ( Solaris pread and pwrite. ). This modification gave a performance improvement of 25% in the total execution time. The rearrangement of the subroutines, which were tuned for the vector processor, involved the removal of unwanted test conditions and loop rearrnagements. This improved the timing of the sequential code by 10%.
107
Fig 2. 3b. Description of T80 global spectral model The global spectral model was developed at NCEP and was modified by National center for medium range weather forecasting. A detailed description of the model can be found in Purohit et.al, 1996. The model has eighteen levels in the vertical and has an horizontal resolution of 1.5° x 1.5°. The model can broadly be divided into three modules a) The physics module b) Dynamics module and c) the radiation module. These modules carry out the computations along the latitude loops, i.e. the variables along all the longitude and all levels are computed at a particular latitude at a particular time. The time step used in the model was fifteen minutes.
4a. Parallelization of ssiSO On analyzing the sequential ssi80 code, we found that the conversion of the spectral to grid domain and vice versa was carried out independent of levels. The model (fig 1.) requires about hundred iterations for the calculation of the final values. The iteration loop is found to be highly dependent and is not suitable for explicit parallelization using Message Passing across the nodes. However, to take advantage of the independency of the model along the levels, the explicit parallelization using the compiler directives and Pragmas was carried out within the node. This type of parallelization takes advantage of the SMP architecture of the node. The parallelization using the Pragmas and compiler directives required us to identify the loops, which can be parallelized. This type of parallelization requires the user to carry out the data and loop dependency analysis. On carrying out the loop analysis it was found that some subroutines required rearrangement to remove the dependency and become amenable to paralleization. The code after compilation was run on different number of processors of a node, by specifying the number of processors as an environmental value. 4 b. Parallelization of T80 The global spectral forecast model T80 was parallelized using data decomposition method (Purohit et.al 1996). The code was parallelized along the gaussian latitudes. Master- Worker paradigm was used for the parallelization. Initially the master program reads the input files and distributes the data to the worker programs. The workers along with the master program carry out the computations. The global spectral variables are updated, in both the master and worker program, every time step using the mpLallreduce call.
108 At the end of the program, the master program receives all the data required for writing the final output file and writes it. One day forecast timmings using public domain MPI and C-DAC MPI using active messaging 100 8888 90 » 80 I 70 H 60|0 5 0 -
D Public Domain MPI • C-DAC MPI
fl
i 40 f 30 u 8 20111 10 0
r>
1
-M
r^o
38
Ml _•
!_•
22
LJB
1 2 4 8 Number of processors of PARAM 10000
Fig. 3 5 Fault Tolerance
The hardware of PARAM 10000 has fault tolerance at node level. In case of some failure in the node, the node does a proper shut down without corrupting the data in the node. At the system level, fault tolerance is implemented to save rerunning of the codes from system failures. The script writes intermediate files, of input data, with appropriate time stamp before the execution of each module of the forecasting suite. In case of a system failure the user can run a restart script which will start execution from the point where the interrupt had taken place, reading the intermediate files. 6 Results The entire forecasting suite, which consists of about twenty-five programs, was ported on to PARAM 10000 from CRAY-XMP. The results of the individual programs were verified for correctness, by running the programs on both the systems simultaneously and verifying the output files. The performance of the entire forecasting suite was dependent on the performance of the analysis code and the forecasting code. Hence the analysis code and the forecast model were parallelized. Here, we present the results of the speedup of the analysis and the forecast model on varying number of processors The performance of the parallelized analysis code ssi80 and the forecasting code was measured for varying number of processors. The analysis code was compiled using the compiler directive -explicitpar. The stack for the subroutines, which had pragmas, was increased for the improving the performance of the code. The code was run on varying number of processors of the node, specifying the number as an environmental variable. Each node of PARAM 10000 has four processors. The speedup of the code is shown in fig 2.
109 From the figure it can be seen that the code has good speedup till three processors and the performance degrades after that. The good speedup can be attributed to even distribution of the work on all the processors and very a little overhead in creation of threads by the compiler.
The parallelized forecast model was run on varying number of processors for different net works. The code was run on Fast Ethernet and Myrinet networks. The results are shown in fig. 3. The proprietary C-MPI was used while the code was run on Myrinet. C-MPI was improved upon the public domain MPI (MPI forum, 1995). The point-to-point communication calls were optimized in C-MPI. From the fig it can be seen that the model has a good speedup till eight processors. The performance of the model degrades after eight processors. An analysis of the model, showed that the communication overheads dominate as the number of processors increase beyond eight processors.
Conclusions The entire forecasting suite was successfully ported and optimized on the PARAM 10000 supercomputer. The most time consuming programs of the suite, the analysis code and the forecast code, were parallelized. It is found that the parallelized analysis code shows good speedup for varying number of processors within a node. The good speedups are due to load balancing of the work and also due to very little overheads in the creation of the threads by the compiler. The forecast model is found to speedup till eight processors, after which the performance degradation was observed because of increased communication overheads. A script was developed to create fault tolerance in the execution of the weather forecasting suite.
References 1. Message Passing Interface Forum, MPI : A Message Passing Interface Standard, Version 1.1. , 1995, available at URL http://www.mcs.anl.gov/mpi 2. Mohanram, N., V.P. Bhatkar, R.K. Arora and S. SasiKumar, 1999, HPCC Software: A Scalable Parallel Programming Environment for Unix Cluster. In Proceedings of 6lh International Conference on Advanced Computing, edited by P.K. Sinha and C.R. Das. NewDelhi : Tata McGraw-Hill Publications, pp 265-273. 3. Rizvi, S.R.H. and D.F. Parrish, 1995, Documentation of the Spectral Statistical Interpolation Scheme, Technical Report 1/1995, NewDelhi: National Center for Medium Range Weather Forecasting, 38pp. 4. Purohit, S.C, P.S. Narayanan, T.V. Singh and A. Kaginalkar, 1996, Global Spectral Medium Range Weather Forecasting Model on PARAM, Supercomputer, 65: 27-36.
110
PARALLEL LOAD BALANCE SYSTEM OF REGIONAL MULTIPLE SCALE ADVANCED PREDICTION SYSTEM JIN ZHIYAN National Meteorological Center of China 46#Baishiqiao Rd., Beijing, 100081, P.R.China E-mail: jinzy @ rays. cma. gov. en We design a Parallel Load Balance System for our new meso-scale model R_MPS, it is the parallel processing driver layer of the model. The main functions include the automatic domain decomposition and dynamic load balancing. Non-rectangular shaped domain is used to archive better load balance. Dynamic load balance is implemented by global view of load balance and data re-mapping. Primary simulation result is presented
1
Introduction
This work belongs to the sub-project to generate a regional multiple scale nonhydrostatic model-Regional Multiple scale Advanced Prediction System. It is supported by the national 973 project. The model is under the development by the Numerical division of the National Meteorological Center of China. The model will be the production and research model of the center. It will be a completely redesign code, targeted for the 10-lkm grid-scale. The principal aim if this work is to generate a software named Parallel Load Balance System (PLBS) to take care all of the parallel processing of the model, separate all of the other scientific code from the parallel processing code, and make the model portable on different architecture machines. The performance of any parallel application depends on good load balance. For lowresolution model, the biggest load imbalance is in the radiation [1], which can be estimated at runtime. In a high-resolution model, microphysics can easy destroy the static load balance [2] and the load distribution cannot be estimated before the forecast. Fortunately, the load of each grid point does not vary too much and the load of the previous time step is a good prediction of the next time step [3]. Previous work [3][4][5] had studied some technique of load balance. In this paper, we use another approach to deal with the problem. Section 2 is the overview of the system. In section 3 and 4, we will compare some of the domain decomposition methods and the data layout of the sub domain. Section 5 will discuss the load balance strategy. Section 6 is the simulation results section 7 is the conclusion.
111
2
Overview of PLBS
PLBS works between the model and Message Passing Interface (MPI). It is be very portable to any machine that supports MPI with the architecture of shared or distributed memory parallel machines and distributed memory machine with SMP node, The software handles all of distributed parallel processing and I/O for the model. At present, parallel I/O has not been implemented in PLBS. For any I/O, one node read the input file from disk and distributes the proper data to other nodes, and collect the output date from all of the nodes and write to disk. Initial
Index mataince
Duta Distribution
Data exchange
-*
I Figure 1. Over view of PLBS
Figure 1 shows the overview of the parallel processing part of the PLBS. It is composed of six parts, initialization, partitioning, indexes maintenances, data distribution, data exchanges and load balancing. The Initialization is the first part that should called before any other parts of the PLBS. It Initialize the MPI, get the environment parameter, and get the options of the command line. It tells the PLBS the size of the domain of the model and how many processors are available for the model. Then the partition part calculate the domain decomposition with the methods specialize by the user. After that, the index maintain part establishes the local table of the neighbors of the grid points by reference in global index of the grid points, establishes the list for data exchange, input and output, etc. At the end of each time step, the load balancing module will timing the running time of each node and test the load is balanced or not. If certain condition is matched, It will lunch the partitioning module to decomposition the domain according to the new load distribution, and then test the load balance of the new decomposition is better or not
112
and redistribute the data and establish the new indexes table according to the new decomposition when the load balance is improved. It will discard the new decomposition if there is only a little or no improvement and continue the data exchange like any other time step. The user does not need to do any thing and the whole process is transparence to user of PLBS.
. . . . . 1~TZ 7 T T . . . . . .
. . . . .
. . . . . . . . . . . .
Figure 2. Four partitioning methods. Shade area is high load area
3
Domain Decomposition Methods
The model is intended to be run at the resolution of 10-1 km, the heavy load of the microphysics can easily destroy the static load balance and the performance will suffer a lot by load imbalance of the physics of the model. The method of domain decomposition is crucial to balance the load, It should be capable to distribute the load of the model as even as possible among processors on any given load distribution. Four methods have been tested, which were shown in Figure 2. In the first method, the processors are divided into Na by Nb groups in x and y directions, and the grid points in each direction are divided as even as possible by Na and Nb. This method is base the assumption that each grid point has roughly the same workload, which is true in many lower resolution limited area models. It has no capability to deal with the load imbalance of the physics in the model. The static load balance is poor when the number of processors does not matches the grid point very well. We use it as a base to evaluate other methods. In the second method, the whole load is divided as even as possible into Na columns and Nb rows. The area of the sub-domain can be
113
changed corroding to the load of each grid point, It keeps that each sub-domain has four neighbors and does not change them during the period of the calculation. The first step of the third method is the same as the second one, but in the second step instead of divide the whole load of domain, it divide the load of each column into Nb rows seperatly. Each sub-domain keeps in rectangle shape, the neighbors change with the changing of the load of each grid point. The fourth method is the same as the third one but introduce little steps when divide the columns and rows, which is quit similar to IFS [6]. The steps make the sub-domain irregular shape, with which the load can be distributed more even among processors. The numbers of rows in each column do not need to be the same in the third and fourth methods. This gives the advantage that the model can use any number of processors properly. Figure 2. Shows the four methods. We used the second order linear diffusion equation to test the performance of each method. In order to evaluate the capability of load balancing, we added a simulation of physics in the code. The workload of the grid points at the middle area is ten times higher than in other area. Figure 3 show the speedup of each method. Obviously the fourth method is the best, and we use it in PLBS.
100
1
2
4
6
16
32
64
Figure 3 Speedup of the linear diffusion equation of four methods
It is possible to use other domain decomposition method in PLBS. User can specify their own method in the system or just tells the system the sub-domain map. 4
Indexes Maintenance
A sub-domain is divided into three parts, the area that overlap with neighbors, called outer boundary, the area that need the data of outer boundary, called inner boundary,
114
the area that does not need any data of outer boundary, called inner area. We use one dimension array to hold all of the data which is shown in fig. 8. The index of local array has nothing to do with the grid position. A lookup table of the local index with the global index must be introduced, and also a neighbor table that tells where are the neighbors. The data exchange list of the overlap area and I/O list must also be maintained.
Figure 4 Data layout of a sub-domain 5
Dynamic Load Balance Strategy
The function of load imbalance is defined as:
T 7
,
-T
m
average
imbalance =
— T average
7 ^ is the maxim of the clock time among processors. Tav
e
is the average clock
time of the processors. we measured the load of every grid point and time each node every time step. At the end of each time step, before the normal date exchange of the overlapped area, the load imbalance module is called. We defined a threshold Kthreshold . If the function is less than the threshold, normal data exchange is performed. If the imbalance function is greater than the threshold, we start to re-partition the whole domain according to the new load distribution of the last time step. After the repartition, we test the load imbalance function again to make sure that the new partitioning is much better than the old one (but this seems can be removed for the new partition always much better than the old one). If the new partition is adopted, data re-mapping is performed to move the data from old buffer to new one. Each node scan their old sub-domain to make a list that where each grid point would be send to, and scan their new sub-domain to know where and how many grid points of their date should
115
be get from. Each node send the list and the data to the nodes which their old data need to go and receive form the nodes which their new data need to get from and copy the data from the receiving buffer to model buffer according to the list. If the data is on the same node, we just copy them form old buffer to new buffer. The index tables of the new decomposition should also be maintained. We found that the performance is very sensitive to the threshold and the adjustment is too often and the overhead is very high when the threshold is small, and sometimes the imbalance function was very high at random place due to the turbulence of the system for unknown reason. We changed the dynamic load balancing strategy a little bit, the re-partition would be called only when the load imbalance function is greater than the old one for certain times to avoid the unnecessary adjustment and control frequency of re-partitioning. The load of every grid point is only measured at the last time step before the re-partition.
6
Experiment result
Again, We use the second order linear diffusion equation to test the system. The domain is 103 by 103 square with 100 levels, we added the simulation of the physics in it that there is a heavy load area in the domain. The load of the area is 10 times higher than other area and let the area move form the north west to south east of the domain, which is shown in Figure 5.
Figure 5 A heavy load area goes through (he domain
The system is IBM SP, which has 10 nodes, and there are 8 processors share the 2 Gbytes memory in each node. In the experiment, each processor was used as a node. Figure 6 shows the result of the experiment.
116 18 1 16
-•-Ideal "•"Static
14
" * ^ l nproved
12
-*-Threshold(0. 1) -*-Threshold(0. 2)
5.0
Internally, RasDaMan employs a storage structure which is based on the subdivision of an MDD object into arbitrary tiles, i.e., possibly non-aligned sub-arrays, combined with a spatial index [3]. Optionally, tiles are compressed. In the course of ESTEDI, RasDaMan will be enhanced with intelligent mass storage handling and optimised towards HPC; among the research topics are complex imaging and statistical queries and their optimisation. 3.3
CLRC's contribution to ESTEDI
CLRC - Daresbury Laboratory is focussing on the support of climate research applications during this project, but aims to apply the technology to other scientific disciplines were appropriate. We are collaborating with the UK Global Atmospheric Modelling Project (UGAMP) group at the University of Reading, namely Lois Steenman-Clark and Paul Berrisford. The system itself will be installed and tested at the national super computing service at Manchester (CSAR). In a first step we have installed a climate modelling meta data system on the Oracle database at Manchester. After some deliberation we decided to use the CERA meta data system which has been developed by the German Climate Computing Centre (DKRZ) and the Potsdam Institute for Climate Research (PIK). The CERA meta data system will allow us to catalogue, search and retrieve private and public data files stored at Manchester based on their content. The system holds information not only information about the stored data, but also about the person who submitted the data, related projects and storage location (disk, tape etc.). The usage of the CERA model will increase the organisation and accessibility of the data (currently 4 TB) significantly. In parallel we are installing the RasDaMan database. This will use the CERA meta data system to locate requested files, but holds itself, additional information on the organisation of the contents of the data files (e.g. data format, properties, field length). This will allow RasDaMan to extract only the requested information from the file, therefore cutting down dramatically on the retrieval and transmission times.
148
For the UGAMP consortium we are planning to implement a range of services. We will develop an interface which will connect their analysis program to RasDaMan, allowing them to retrieve their required data from a running application. We are also planning to implement a parallel I/O interface which will allow to write certain fields in parallel into the database which will in turn reassemble the separate outputs into one data file. If time permits we will also try to implement on-the-fly data format conversion; converters from GriB to NetCDF and HDF are currently planned. We are in the implementation phase and expect a first prototype by mid 2001. 4
Summary
Our scientific research data is one of our most valuable assets, unfortunately much of it is inaccessible due to outdated hardware and lack of organisation. Databases could play a major role in changing the current situation helping to organise the data, making it accessible and prepare it for the application of state of the art exploration tools. However, we need to apply database technology that is well suited for the multidimensional nature of our scientific data. Standardisation and the usage of generic technology will help to make these tools easier to install, maintain and use, allowing the fast uptake by wide areas of the scientific community. It is important that all these developments are carried out in close collaboration with the scientific community to ensure that their requirements are met. The ESTEDI project will provide a major building block in this development by delivering a field-tested open platform with flexible, contents-driven retrieval of multiterabyte data in heterogeneous networks. References 1. P. Baumann: A Database Array Algebra for Spatio-Temporal Data and Beyond. Proc. Next Generation Information Technology and Systems NGITS '99, Zikhron Yaakov, Israel, 1999, pp. 76-93. 2. P. Baumann, P. Furtado, R. Ritsch, and N. Widmann: Geo/Environmental and Medical Data Management in the RasDaMan System. Proc. VLDB'97, Athens, Greece, 1997, pp. 548-552. 3. P. Furtado and P. Baumann: Storage of Multidimensional Arrays Based on Arbitrary Tiling. Proc. ICDE '99, Sydney, Australia 1999, pp. 480-489.
149 K. Kleese: Requirements for a Data Management Infrastructure to support UK High-End Computing, Technical Report, CLRC - Daresbury Laboratory (DL-TR99-004), UK, November 1999 K. Kleese: A National Data Management Centre, Data Management 2000, Proc. 1. Int'l Workshop on Advanced Data Storage/Management Techniques for High Performance Computing, Eds. Kerstin Kleese and Robert Allan, CLRC - Daresbury Laboratory, UK, May 2000 6. G. Konstandinidis and J. Hennessy: MARS - Meteorological Archival and Retrieval System User Guide, ECMWF Computer Bulletin B6.7/2, Reading, 1995 L. Libkin, R. Machlin, and L. Wong: A query language for multidimensional arrays: Design, implementation, and optimization techniques. Proc. ACM SIGMOD'96, Montreal, Canada, 1996, pp. 228 - 239. A.P. Marathe and K. Salem: Query Processing Techniques for Arrays. Proc. ACM SIGMOD '99, Philadelphia, USA, 1999, pp. 323-334. 9. A. O'Neill and Lois Steenman-Clark: Modelling Climate Variability on HPC Platforms, High Performance Computing, R.J.Allan, M.F.Guest, A.D.Simpson, D.S.Henty, D.A.Nicole (Eds.), Plenum Publishing Company Ltd., London, 1998 10. S. Sarawagi, M. Stonebraker: Efficient Organization of Large Multidimensional Arrays. Proc. ICDE'94, Houston, USA, 1994, pp. 328-336.
List of Relevant Links Active Knowledge CINECA CLRC
cscs
Data Management 2000 workshop DAMP DIRECT DKRZ DLR DMC ERCOFTAC ESTEDI IHPC&DB NUMECA RasDaMan UKHEC WOS
www.active-knowledge.de www.cineca.it www.clrc.ac.uk www.cscs.ch www.dl.ac.uk/TCSC/datamanagement/conf2.html www.cse.clrc.ac.uk/Activity/DAMP www.epcc.ed.ac.uk/DIRECT/ www.dkrz.de www.dfd.dlr.de www.cse.ac.uk/Activity/DMC imhefwww.epfl.ch/Imf7ERCOFTAC www.estedi.org www.csa.ru www.numeca.be www.rasdaman.com www.ukhec.ac.uk www.woscommunity.org
150
COUPLED MARINE ECOSYSTEM MODELLING ON HIGHPERFORMANCE COMPUTERS
M. ASHWORTH CLRC Daresbury Laboratory, E-mail:
Warrington WA4 4AD, UK
[email protected] R. PROCTOR, J. T. HOLT Proudman Oceanographic
Laboratory, E-mail:
Bidston Observatory, Birkenhead
CH43 7RA, UK
[email protected] J. I. ALLEN, J. C. BLACKFORD Plymouth Marine Laboratory,
Prospect Place, West Hoe, Plymouth PL1 3DH, UK E-mail: jia @pml. ac. uk
Simulation of the marine environment has become an important tool across a wide range of human activities, with applications in coastal engineering, offshore industries, fisheries management, marine pollution monitoring, weather forecasting and climate research to name but a few. Hydrodynamic models have been under development for many years and have reached a high level of sophistication. However, sustainable management of the ecological resources of coastal environments demands an ability to understand and predict the behaviour of the marine ecosystem. Thus, it is highly desirable to extend the capabilities of existing models to include chemical, bio-geo-chemical and biological processes within the marine ecosystem. This paper describes the development of a high-resolution three-dimensional coupled transport/ecosystem model of the Northwest European continental shelf. We demonstrate the ability of the model to predict primary production and show how it may be used for the study of basic climate change scenarios. The challenges of extending the model domain into the Atlantic deep water are discussed. The use of a hybrid vertical coordinate scheme is shown to give improved results. Optimization of the model for cache-based microprocessor systems is described and we present results for the performance achieved on up to 320 processors of the Cray T3E-1200. High performance levels, sufficient to sustain major scientific projects, have been maintained despite a five-fold increase in grid size, incorporation of the ecosystem sub-model and the adoption of highly sophisticated numerical methods for the hydrodynamics.
151
1
The POLCOMS shelf-wide model
The hydrodynamic model used in this study is the latest in a series of developments at the Proudman Oceanographic Laboratory (POL). The model solves the threedimensional Shallow Water form of the Navier-Stokes equations. The equations are written in spherical polar form with a sigma vertical co-ordinate transformation and solved by an explicit forward time-stepping finite-difference method on an Arakawa B-grid. The equations are split into barotropic (depth-mean) and baroclinic (depthfluctuating) components, enabling different time-steps to be used with typically a ratio of ten between the shorter barotropic and the baroclinic time-steps. The model is a development of that of James [9] and is capable of resolving ecologically important physical features, such as stratification, frontal systems and river plumes, through the use of a sophisticated front-preserving advection scheme, the Piece wise Parabolic Method (PPM). This method has been shown to have excellent structure preserving properties in the context of shelf sea modelling [10]. Vertical mixing is determined by a simple turbulence closure scheme based on Richardson-number stability. A sub-model is included for the simulation of suspended particulate matter (SPM) including transport, erosion, deposition and settling [6]. The computational domain covers the shelf seas surrounding the United Kingdom from 12°W to 13°E and from 48°N to 63°N with a resolution of about 12 km. The computational grid has 150 x 134 points in the horizontal and 20 vertical levels; a total of 402000 gridpoints. The typical depth over most of the domain is around 80 m, but the western extremity includes the shelf edge where depths increase rapidly to around 2000 m. Previous models include a fine resolution model of the southern North Sea [16], which was parallelized for execution on the Cray T3D [12]. This model was larger (226 x 346 x 10 = 781960 gridpoints) covering a smaller region at finer resolution. The achieved performance was 1.0 Gflop/sec on 128 Cray T3D processors, or around 5% of peak performance. 2
The ERSEM ecosystem model
The European Regional Seas Ecosystem Model (ERSEM) [4] was developed by a number of scientists at several institutes across Europe through projects under the MAST programme of the European Community. Many features and applications of the ERSEM model are described in a special issue of the Journal of Sea Research. ERSEM is a generic model that describes both the pelagic and benthic ecosystems and the coupling between them in terms of the significant bio-geo-chemical processes affecting the flow of carbon, nitrogen, phosphorus and silicon. It uses a 'functional group' approach to describe the ecosystem whereby biota are grouped together according to their trophic level and sub-divided according to size and
152
feeding method. The pelagic food web describes phytoplankton succession (in terms of diatoms, flagellates, picoplankton and inedible phytoplankton), the microbial loop (bacteria, heterotrophic flagellates and microzooplankton) and mesozooplankton (omnivores and carnivores) predation. The benthic sub-model contains a food web capable of describing nutrient and carbon cycling, bioturbation/bioirrigation and the vertical transport in sediment of particulate matter to the activity of benthic biota. The model has been successfully run in a wide variety of regimes and on a variety of spatial scales. All studies illustrate the importance of the ecological model being coupled with fine-scale horizontal and physical processes in order for the system to be adequately described. ERSEM was coupled with a simple 2D depth-averaged transport model in a modelling study of the Humber plume [1]. The study successfully simulated much of the behaviour of the plume ecosystem, with primary production controlled by limited solar radiation between March and October and by nutrient availability during the rest of the year. This allowed exploration of the causal linkages between land-derived nutrient inputs, the marine ecosystem and man's influence. 3
The coupled model
The models were coupled by installing the ERSEM code as a component within the main time stepping loop of the POL code. Coupling is unidirectional. ERSEM requires input describing the physical conditions (namely temperature, salinity and transport processes) from the hydrodynamic model, but there is no feedback in the opposite direction. The PPM advection subroutines from the POL code, which are used for temperature and salinity, were also used to perform advection of 36 ecological state variables. Both components take information from meteorological data (solar heat flux, wind speed and direction, cloud cover, etc.). 4
Coupled model validation
The coupled model has been run to simulate conditions in 1995 (driven by ECMWF meteorology, tides and daily mean river inflows of freshwater, nutrients and SPM) and validated against collected observations [2]. One output of the model is an estimate of total primary production (Fig. 1). Area-averaged modelled production compares well with estimates of observed production taken from the literature (not necessarily for 1995) (Table 1).
153
1
10*W
5*W
0.0 0.4 0.8 1.2 1.6 2.0 2.4 Total Annual Production(xl 00 g C m"2) Figure 1. Total Primary Production for 1995 (gC m"2)
154 Table 1. Primary Production, modelled and observed
Region Southern North Sea Northern North Sea English Coast Dogger Bank German Bight
Primary Production (gCm-2) 150 - 250 54 - 127 79 112 261
Reference Anon 1993 [14] [11]
Using this simulation as a baseline, the model has been used to predict estimates of the effect of climate change on plankton blooms on the NW European shelf. In these first calculations the predictions of the Hadley Centre HADCM2 climate model [8] for the mid-2 lsl century have been simplified to very basic climate change effects, i.e. an increase in air temperature of 2% and an increase in wind speed of 2%. The 1995 simulation was repeated with these simple meteorological changes and the magnitude and timing of the spring plankton bloom compared. Fig. 2 shows the effect of the changed climate relative to the baseline 1995 simulation. It can be seen that this simplified climate change can result in spatial difference with increases and decreases in the amplitude of the spring bloom of as much as 50%. The timing of the bloom is also affected, generally advancing the bloom in the south and delaying the bloom in the north, by up to 30 days, but the result is patchy. Although not yet quantified these changes seem to lie within the bounds of spring bloom variability reported over the past 30 years. Cllmalo Bosnmk) Fractional Ampl it u d a o l spring btoom
Climala Boonatto ShM In poak ol Eprino, bloom (day*)
Figure 2. Impact of simple climate change on amplitude and timing of the spring bloom
155
5
Extension of the Shelf Model into Atlantic deep water
The western boundary of the present Shelf model (at 12°W) intersects the shelf edge west of Ireland. This causes difficulties in modelling the flow along the shelf edge (e.g. [18]), effectively breaking up the 'continuous' shelf edge flow. Additionally, the present model area does not include the Rockall Trough west of Scotland, an area of active exploration drilling by the offshore industry (as is the Faeroe-Shetland Channel). To fully resolve the shelf edge flows and to forecast temperature and current profiles in these regions, the model area has been extended to include the whole of the Rockall Trough. The larger area Atlantic Margin model (Fig. 3) extends from 20°W to 13°E and from 40°N to 65°N and has a grid of 198 x 224 x 34 (1507968 gridpoints) at the same resolution. At present, boundary conditions are provided by the l/3 rd degree UK Meteorological Office Forecast Ocean Atmosphere Model (FOAM) in the form of 6 hourly 3-dimensional temperature and salinity (used in a relaxation condition) and depth-averaged currents and elevation (both used in a radiation condition). FOAM is a rigid lid model, without tides, so tidal forcing has to be added separately to the radiation condition. Also, in order to fully resolve the small scale current variability in both the Rockall Trough and FaeroeShetland Channel, models with grid resolution of the order of 2km (to 'resolve' the baroclinic Rossby radius of deformation, i.e. eddy-resolving) are required. Such models are being nested into the larger area Atlantic Margin model (also shown in Fig. 3). Extending the domain into significantly deeper water required a modification to the vertical coordinate system. The disadvantage of the simple sigma coordinate when the model domain includes deeper water is that the thickness of the model surface layer can be greater than the ocean surface mixed layer, giving poor results in deeper water. To overcome this a hybrid co-ordinate can be specified, the 'S' coordinate [17] which combines the terrain following co-ordinate at lower model levels with near-horizontal surface layers, giving improved resolution of the model surface layer in deeper water. At each gridpoint the relationship between S and a is defined as:
hc
(Tk=Sk
for fly
and
f i e l d converts the Fortran 77 code to Co-Array Fortran. At the same time, recording the dimension information into the structure, z*/,size(l) = Ions; z%size(2) = l a t s ; z'/.size(3) = l e v s makes it available to other domains that may need it. This solves the problem of how one domain knows how to find data on remote memory stacks and to find the size and shape of that data. This problem is very difficult to solve in other SPMD programming models, but it is not a problem for CAF. Knowing how to find the information follows naturally from the design of the language. For example, to find the dimensions of a field on domain [r,s], one simply writes dims(:) = z [ r , s ] ' / , s i z e ( : ) The compiler generates code, appropriate for the target hardware, to read the
184
array component s i z e ( : ) of the structure z on the remote domain. In the same way, to obtain actual field data from a remote domain, one writes x = z[r,s],/.field(i,j,k) In this case, the compiler first reads the pointer component from the remote domain, which contains the address associated with the pointer on the remote domain, then uses that remote address to read the data from the heap on the remote domain. These deceptively simple assignment statements open all the power of CAF to the programmer. First, consider updating halo cell zero by reading all the latitudes from all the levels from the neighbor to the west. If we assume periodic boundary conditions, then the co-dimension index to the west is west = p - i ; if(west < 1) west = np For example, the halo-update from the west becomes a single line of code, field(0,1:lats,1:levs)=z[west,q]%field(lons,1:lats,i:levs) Similarly, the update from the east, where east = p+1 ; i f ( e a s t > np) east = 1 is a similar line of code, field(lons+l,l:lats,i:levs)=z[east,q]Afield(1,1:lats,l:levs:) In the north-south direction, the mapping is a bit more complicated since the domains must be folded across the poles. We do not give the specific mapping here, but point out that any mapping to another domain,be it simple and regular or complicated and irregular, may be represented as a function that returns the appropriate co-dimension. For example, north = neighbor('N')
might represent a function that accepts character input and returns the result of a complicated relationship between the current domain and some remote domain.
185
5
Experimental Results
We performed our experiments using the Met Office Unified Model (UM) 10 . The Met Office has developed this large Fortran77/Fortran90 code for all of its operational weather forecasting and climate research work. The UM is a grid point model, capable of running in both global and limited area configurations. The next major release of the UM will have a new semi-implicit non-hydrostatic dynamical formulation for the atmospheric model, incorporating a semi-lagrangian advection scheme. The code used in this paper is a stand-alone version of the model, which incorporates the new dynamical core. This was used for developing the scheme prior to insertion into the UM. The timing results are based around climate resolution (3.75 degree) runs. We performed two experiments, one to compare Co-Array Fortran with SHMEM and the other to compare it with MPI. The UM uses a lightweight interface library called GCOM u to allow interprocessor communication to be achieved by either MPI or SHMEM, depending on which version of the library is linked. Additionally, the halo exchange routine swapBounds ( ) exists in two versions, one using GCOM for portability and another using direct calls to SHMEM for maximum performance on the CRAY-T3E. We converted the swapBounds () subroutine from the SHMEM version to Co-Array Fortran, as outlined in Section 4, and then substituted it into the UM, this being the only change made to the code. This was then run with the two different versions of the GCOM library (MPI and SHMEM) to measure the performance. Table 1 shows that, with four domains in a 2 x 2 grid, the SHMEM version, times shown in column 2, runs about five percent faster than the pure MPI version, times shown in column 5. Although an improvement of five percent may not seem like much, because of Amdahl's Law this difference makes a big difference in scalability for large numbers of processors. This can be seen from the fact that at 32 domains the difference in times is already about fifteen per cent. The times for the SHMEM code with the CAF subroutine inserted are shown in column 3 of Table 1, and the times for the MPI version with CAF inserted are shown in column 4. Comparison of the results shows what one would expect: CAF is faster than MPI in all cases but slower than SHMEM. The explanation of the timing comparisons is clear. CAF is a very lightweight model. The compiler generates inline code to move data directly from one memory to another with no intermediate buffering and with very little overhead unlike the high overhead one encounters in something like the heavyweight MPI library, which strives more for portability than for performance. The SHMEM library, on the other hand, is a lightweight library, which
186
also moves data directly from memory to memory with very little overhead, but, in addition, its procedures have been hand optimized, sometimes even written in assembler, to obtain performance as close to peak hardware performance as possible. Hence, the compiler generated code produced for CAF is not as fast as SHMEM, but considering that the compiler alone generated the code with no hand optimization, the performance of the CAF version of the code is quite impressive. With more work to put more optimization into the compiler to recognize CAF syntax and semantics, there is no reason why the CAF version will not eventually equal the SHMEM version. Table 1. Total Time (s) Domains 2x2 2X4 2x8 4x4 4x8
6
SHMEM 191 95.0 49.8 50.0 27.3
SHMEM with CAF 198 99.0 52.2 53.7 29.8
M P I with CAF 201 100 52.7 54.4 31.6
MPI 205 105 55.5 55.9 32.4
Summary
This paper has outlined a way to convert the halo-update subroutine of the Met Office Unified Model to Co-Array Fortran. It demonstrated how to add CAF features incrementally so that all changes are local to one subroutine. Hence the bulk of the code remained untouched. It incidentally demonstrated the compatibility of three programming models, not a surprising result since they all assume the same underlying SPMD programming model. The communication patterns in this application are very regular, but the Co-Array Fortran model is not restricted to such patterns. In some cases, such as encountered in a semi-lagrangian advection scheme, it is necessary to gather a list of field elements from another domain using a line of code such as tmp(:)=z[myPal],/.field(i,list(:),k) Or it might be necessary to gather field elements from a list of different domains using a line such as tmp(:)=z[list(:)]y.field(i,j,k)
187
And at times it may be necessary to modify data from another domain before using it or storing it to local memory,
f i e l d ( i , j , k ) = f i e l d ( i , j , k ) + scale*z[list(:)]'/.field(i, j,k) No matter what is required, the programmer writes code that clearly describes the requirements of the application and the compiler maps that description onto the hardware using it in the most efficient way it can. Such a programming model allows the programmer to concentrate on solving a physical or numerical problem rather than on solving a computer science problem. In the end, the Co-Array Fortran code is simpler than either the SHMEM code or the MPI code. It is easier to read as well as easier to write, which means it is easier to modify and easier to maintain. In addition, its performance is as good as or better than the performance of library-based approaches, which have had much more optimization effort put into them. With a similar amount of work devoted to compiler development, the performance of the CAF model will approach or exceed that of any other model. References 1. Message passing toolkit. CRAY Online Software Publications, Manual 007-3687-002. 2. William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI, Portable Parallel Programming with the Message-Passing Interface. The MIT Press, 1994. 3. J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global Arrays: A nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10:197-220, 1996. 4. Robert W. Numrich and John K. Reid. Co-Array Fortran for parallel programming. ACM Fortran Forum, 17(2):1—31, 1998. http://www.coarray.org. 5. William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, and Karen Warren. Introduction to UPC and language specification. Technical Report CCS-TR-99-157, Center for Computing Sciences, 17100 Science Drive, Bowie, MD 20715, May 1999. http://www.super.org/upc/. 6. Katherine Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham, David Gay, Phillip Colella, and Alexandar Aiken. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience, 10:825-836, 1998.
188
7. OpenMP Fortran application program interface. http://www.openmp.org. 8. Eric T. Kapke, Andrew J. Schultz, and Robert W. Numrich. A parallel 'Class Library' for Co-Array Fortran. In Proceedings Fifth European SGI/Cray MPP Workshop, Bologna, Italy, September 9-10, 1999. 9. Robert W. Numrich, John Reid, and Kieun Kim. Writing a multigrid solver using Co-Array Fortran. In Bo Kagstrom, Jack Dongarra, Erik Elmroth, and Jerzy Wasniewski, editors, Applied Parallel Computing: Large Scale Scientific and Industrial Problems, pages 390-399. 4th International Workshop, PARA98, Umea, Sweden, June 1998, Springer, 1998. Lecture Notes in Computer Science 1541. 10. T. Davies, M.J.P. Cullen, M.H. Mawson, and A.J. Malcolm. A new dynamical formulation for the UK Meteorological Office Unified Model. In ECMWF Seminar Proceedings: Recent developments in numerical methods for atmospheric modelling, pages 202-225, 1999. 11. J. Amundsen and R. Skaalin. GC user's guide release 1.1. Technical report, SINTEF Applied Mathematical Report STF42 A96504, 1996.
189
PARALLEL ICE D Y N A M I C S IN A N O P E R A T I O N A L BALTIC SEA MODEL TOMAS WILHELMSSON Department of Numerical Analysis and Computer Science, Royal Institute of Technology, SE-100 44 Stockholm, Sweden E-mail:
[email protected] HIROMB, a 3-dimensional baroclinic model of the North Sea and the Baltic Sea, has delivered operational forecasts to the SMHI since 1995. The model is now parallelized and runs operationally on a T3E, producing a 1 nautical mile forecast in the same time as a 3 nm forecast took on a C90. During the winter season, a large fraction of CPU time is spent on ice dynamics calculations. Ice dynamics include ice drift, freezing, melting, and changes in ice thickness and compactness. The equations are highly nonlinear and are solved with Newton iterations using a sequence of linearizations. A new equation system is factorized and solved in each iteration using a direct sparse solver. This paper focuses on the efforts involved in parallelizing the ice model.
1
Introduction
The Swedish Meterological and Hydrological Institute (SMHI) makes daily forecasts of currents, temperature, salinity, water level, and ice conditions in the Baltic Sea. These forecasts are based on data from a High Resolution Operational Model of the Baltic Sea (HIROMB). Within the HIROMB project, 1 the German Federal Maritime and Hydrographic Agency (BSH) and SMHI have developed an operational ocean model, which covers the North Sea and the Baltic Sea region with a horizontal resolution from 3 to 12 nautical miles (nm). This application has been parallelized 2 ' 3 ' 4 and ported from a CRAY C90 vector computer to the distributed memory parallel CRAY T3E-600. The memory and speed of the T3E allows the grid resolution to be refined to 1 nm, while keeping the execution time within limits. Figure 1 shows output from a 1 nm resolution forecast. 2
Model description
HIROMB gets its atmospheric forcing from the forecast model HIRLAM. 5 Input includes atmospheric pressure, wind velocity, wind direction, humidity and temperature, all at sea level, together with cloud coverage. Output includes sea level, currents, salinity, temperature, ice coverage, ice thickness and ice drift velocity. HIROMB is run once daily and uses the latest 48-hour
190 HIROMB (1 nm) Surface salinity and current TUE 26 SEP 2000 00Z +48
50
100
150
200
250
300
Figure 1. This is output from a 1 nautical mile resolution forecast centering on the South Baltic Sea. Arrows indicate surface current and colors show salinity. Blue is fresh water and red is salt water.
forecast from HIRLAM as input. There are plans to couple the models more tightly together in the future. The 1 nm grid (see Figure 2) covers the Baltic Sea, Belt Sea and Kattegat. Boundary values for the open western border at 10° E are provided by a coarser 3 nm grid which extends out to 6° E and includes Skagerrak. Boundary values for the 3 nm grid are provided by 12 nm resolution grid which covers the whole North Sea and Baltic Sea region. All interaction between the grids takes place at the western edge of the finer grid where values for flux, temperature, salinity, and ice properties are interpolated and exchanged. Water level at the 12 nm grid's open boundary is provided by a storm surge model covering the North Atlantic. Fresh water inflow is given at 70 major river outlets. In the
191
100
200
300 400
500 600 700
Figure 2. The 1 nm resolution grid is shown to the left. Colors indicate depth with maximum depth south of the Norwegian coast around 600 meters. The grid has 1,126,607 active points of which 144,449 are on the surface. The 3 n m grid, on the upper left, provides boundary values for the 1 nm grid and has 154,894 active grid points of which 19,073 are on the surface. T h e 12 nm grid, on the lower right, extends out to whole North Sea and has 9,240 active points and 2,171 on the surface. It provides boundary values for the 3 n m grid. The sea level at the open North Sea boundary of the 12 nm grid is given by a storm surge model.
vertical, there is a variable resolution starting at 4 m for the mixed layer and gradually increasing to 60 m for the deeper layers. The maximum number of layers is 24. In the future, we plan to replace the 3nm and 12 nm grids with one 6 nm grid, and also to double the vertical resolution. HIR.OMB is very similar to an ocean model described by Backhaus. 6,T Three different components may be identified in the HIROMB model: the baroclinic part, the barotropic part, and ice dynamics. In the baroclinic part, temperature and salinity are calculated for the whole sea at all depth levels. Explicit two-level time-stepping is used for horizontal diffusion and advection. Vertical exchange of momentum, salinity, and temperature is computed implicitly. In the barotropic part, a semi-implicit scheme is used for the vertically
192
Grid 12 nm 3nm 1 nm
Equations 1,012 16,109 144,073
Non-zeros 4,969 84,073 762,606
Table 1. Matrix sizes used in ice dynamics iterations from a May 3, 2000 forecast.
integrated flow, resulting in a system of linear equations (the Helmholtz equations) over the whole surface for water level changes. This system is sparse and slightly asymmetric, reflecting the 9-point stencil used to discretize the differential equations over the water surface. It is factorized with a direct solver once at the start of the simulation and then solved for a new right-hand side in each time step. During mid-winter ice dynamics dominates the total computation time, and its parallelization will be the focus of the rest of this paper. 3
Ice dynamics
Ice dynamics occur on a very slow time scale. It includes ice drift, freezing, melting, and changes in thickness and compactness. The ice cover is regarded as a granular rubble of ice floes and modeled as a 2D continuously deformable and compressible medium. The viscous-viscoplastic constitutive law is based on Hibler's model. 8 The dynamic state—constant viscosity or constant yield stress—may change locally at any time, creating a highly nonlinear equation system. The system is solved with Newton iterations using a sequence of linearizations of the constitutive law until quasi-static equilibrium is reached. In each iteration a new equation system is factorized and solved using a direct sparse solver. Convergence of the nonlinear system is achieved after at most a dozen iterations. The ice model is discussed in detail by Kleine and Sklyar. 9 Profiling shows that almost all computation time for ice dynamics is spent in the linear solver. The linearized ice dynamics equation system contains eight unknowns per ice covered grid point: two components of ice drift velocity and three components of strain rates and three components of stress. Hence, even with a small fraction of the surface frozen, the ice dynamics system may become much larger than the water level system which has only one equation per grid point. The system is unsymmetric, indefinite, with some diagonal elements being very small: max, |oy| > 10 6 |ajj|. It is also strongly ill-conditioned. The matrix from the 12 nm grid in Table 1 has condition number K 2 « 6.6 • 10 15
193
which is close to the machine precision limit. On parallel computers large linear systems are most commonly solved with iterative methods. But for ill-conditioned matrices an iterative solver needs a good preconditioner in order to have rapid and robust convergence. Kleine and Sklyar 9 found it difficult to apply common iterative solvers to the ice equation system and they instead turned to a direct solver. It was therefore decided to use a direct solver also for parallel HIROMB". Lack of diagonal-dominance, indefiniteness, and large condition number suggest that partial pivoting would be necessary to get an accurate solution with a direct solver. Serial HIROMB, however, successfully makes use of the YSMP 10 solver which does no pivoting. We have compared results with and without partial pivoting using the direct solver SuperLU. n Both the residuals and the solution differences indicate that pivoting is not needed for these matrices. Furthermore, as the linear system originates from Newton iterations of a nonlinear system, an exact linear solution is not necessary for the nonlinear iterations to converge. 4
Parallel Direct Sparse Matrix Solvers
At the start of the parallelization project, there were few direct sparse solvers available for distributed memory machines. At the time we only had access to a solver written by Bruce Herndon, then at Stanford University. 12 Like the YSMP solver it handles unsymmetric matrices, and it does not pivot for stability. Later the MUMPS 13 solver from the European PARASOL project 14 was added as an alternative. The solution of a linear system with a direct method can be divided into four steps: (i) Ordering where a fill-reducing reordering of the equations is computed. Multiple minimum degree (MMD) and nested dissection (ND) are two common algorithms for this, (H) Symbolic factorization which is solely based on the matrix' non-zero structure. (Hi) Numeric factorization where the actual LU factors are computed, (iv) Solving which involves forward and back substitution using the computed LU factors. In HIROMB's ice dynamics a sequence of matrices are factorized and solved until convergence in each time step. During the iterations the non-zero structure of the matrices stays the same, so the ordering and symbolic factorization steps need only to be done once per time step. Herndon's solver performs all four solution steps in parallel, whereas version 4.0.3 of the MUMPS "Some tests with Matlab's GMRES and incomplete factorization as preconditioner have been done with the systems in Table 1. At least 45% of the non-zero entries in L and U had to be retained for the algorithm to converge.
194
solver has only parallelized the two latter numerical factorization and solve stages. Both Herndon's solver and MUMPS use the multi-frontal method 15 to factorize the matrix. The multi-frontal method seeks to take a poorly structured sparse factorization and transforms it into a series of smaller dense factorizations. These dense eliminations exhibit a well understood structure and can be made to run well using techniques already developed for dense systems. In Herndon's solver, processors factorize their own local portions of the matrix independently and then cooperate with each other to factorize the shared portions of the matrix. The shared equations are arranged hierarchically into an elimination tree, where the number of participating processors is halved at each higher level in the hierarchy. Thus, if the fraction of shared equations is large, performance will suffer due to lack of parallelism. Due to this arrangement, the Herndon solver also requires the number of participating processors to be a power of two. The MUMPS solver is able to parallelize the upper levels of the elimination tree further by using standard dense parallel solvers to factorize the equations remaining near the root of the elimination tree.
5
Matrix decomposition
Since MUMPS does the equation reordering and symbolic factorization serially, the whole matrix is first moved to a master processor. During symbolic factorization an optimal load balanced matrix decomposition is computed. The matrix is then distributed onto the processors according to the computed decomposition for subsequent numerical factorizations. Herndon's solver accepts and uses the original matrix decomposition directly. No data redistribution is necessary before calling the solver. But performance will suffer if the initial matrix decomposition is not load balanced and suitable for factorization. As can be seen in Figure 3, HIROMB's ice distributions are in general very unbalanced. The performance of Herndon's solver for HIROMB's ice matrices may be substantially improved by first redistributing the matrix using ParMETIS. ParMETIS 16 is a parallel version of the popular METIS package for graph partitioning and fill-reducing matrix ordering. It can compute a matrix decomposition which is optimized for parallel factorization. The matrix is first decomposed using multilevel recursive bisection and then each partition is ordered locally with a multiple minimum degree (MMD) algorithm.
195
100
200
300
400
500
600
700
Figure 3. The picture on the left shows the ice distribution in the 1 nm grid on May 3, 2000. About 13% of the grid points (18,251 out of 144,449) were ice covered. The picture on the right shows an example of how this grid would be decomposed into 74 blocks and distributed onto 16 differently colored processors.
6
Solver performance
We have measured solver performance on a set of equation systems from a late spring forecast (May 3, 2000). On this date 13% of the l n m grid was ice covered as shown in Figure 3. Table 1 gives the matrix sizes. We limit the discussion below to the matrices of the two finer grids 3 nm and 1 nm. Although none of the solvers show good speedup for the 12 nm grid, the elapsed times are negligible in relation to the time for a whole time step. Times for solving ice matrices from the 3nm grid are shown in upper graph of Figure 4. The time to compute a fill-reducing equation ordering is included under the symbolic factorization heading. The graph shows that the MUMPS solver is the best alternative for all processor counts up to 16. For 32 processors and above Herndon's solver combined with ParMETIS redistribution gives the best performance. The lower graph of Figure 4 gives times to factorize and solve the matrix from the 1 nm grid. This matrix is 9 times larger than the 3 nm matrix and at least 8 processors were necessary to generate it due to memory constraints. Here, MUMPS is slowest and shows little speedup. Herndon's solver with ParMETIS gives by far the best performance in all measured cases.
196 Factorize and solve ice matrix once
4
8
16
Number of processors Factorize and solve ice matrix once
16 32 Number of processors Figure 4. The top graph gives measured time to factorize and solve an ice matrix from the 3 nm grid. For each number of processors a group of three bars is shown. T h e left bar gives times for the MUMPS solver, the middle bar times for Herndon's solver, and the right bar gives times for Herndon's solver with ParMETIS redistribution. Timings for Herndon's solver on 64 processors are missing. The lower graph gives measured time spent to factorize and solve t h e ice matrix from t h e 1 n m grid. At least 8 processors were necessary for this matrix due to memory constraints.
197
In this forecast, on average three iterations were necessary to reach convergence for the nonlinear system in each time step. So in order to get all time spent in the solver, the times for the numerical factorization and solve phases should be multiplied by three. However, doing this does not change the mutual relation in performance between the solvers. The differences in time and speedup between the solvers are clearly revealed in Figure 5. Here a time line for each processor's activity is shown using the VAMPIR tracing tool. MUMPS is slowest because it does not parallelize symbolic factorization which accounts for most of the elapsed time. MUMPS would be more suitable for applications where the matrix structure does not change, e.g. the Helmholtz equations for water level in HIROMB. Herndon's solver is faster because it parallelizes all solution phases. But it suffers from bad load balance as the matrix is only distributed over those processors that happen to have the ice covered grid points. But by redistributing the matrix with ParMETIS all processors can take part in the computation which substantially improves performance.
6.1
ParMETIS
optimizations
Initially ParMETIS had low speedup and long execution times. For the 1 nm matrix in Table 1 the call to ParMETIS took 5.9 seconds with 32 processors. The local ordering within ParMETIS is redundant as Herndon's solver has its own MMD ordering step. By removing local ordering from ParMETIS, the time was reduced to 5.0 seconds and another 0.7 seconds was also cut from the solver's own ordering time. HIROMB originally numbered the ice drift equations first, then the strain rate equations and finally the stress equations. This meant that equations referring to the same grid point would end up far away in the matrix. Renumbering the equations so that all equations belonging to a grid point are held together reduced the ParMETIS time by 3.8 seconds down to 1.2 seconds. Now the whole matrix redistribution, factorization and solve time, 4.7 seconds, is lower than the initial ParMETIS time. HIROMB's ice matrices may become arbitrarily small when the first ice appears in fall and the last ice melts in spring. Due to a bug in the current version 2.0 of ParMETIS, it fails to generate a decomposition when the matrix is very small. By solving matrices smaller than 500 equations serially without calling ParMETIS the problem is avoided.
198
Figure 5. VAMPIR traces of MUMPS (top), Herndon (middle) and Herndon with ParMETIS (bottom) when solving the 1 nm grid ice matrix from the May 3, 2000 dataset on 16 processors. Yellow respresents active time spent in the solver, green is ParMETIS time and brown is wasted waiting time due to load inbalance. Intialization shown in light blue is not part of the solver time. Compare with Figure 4, lower part.
199
7
Conclusion
The ice dynamics model used in HIROMB necessitates an efficient direct sparse solver in order to make operational l n m forecasts feasible. The performance achieved with Herndon's solver and ParMETIS was absolutely essential. SMHI's operational HIROMB forecasts today run on 33 processors of a T3E-600. The water model (batotropic and baroclinic) accounts for 12 seconds per time step. A full summer (ice free) 48-hour forecast is produced in 1.0 hours with a time step of 10 minutes. The May 3, 2000 forecast with 13% ice coverage (see Figure 3) would with the MUMPS solver take 6.8 hours. Using Herndon's solver with ParMETIS the forecast is computed in 1.8 hours. The difference between the solvers grows even larger with more ice coverage. Acknowledgments The parallelization of HIROMB was done together with Dr Josef Schiile at the Institute for Scientific Computing in Braunschweig, Germany. The author would like to thank Lennart Funkquist at SMHI and Dr Eckhard Kleine at BSH for their help with explaining the HIROMB model. Computer time was provided by the National Supercomputer Centre in Linkoping, Sweden. Financial support from the Parallel and Scientific Computing Institute (PSCI) is gratefully acknowledged. References 1. Lennart Funkquist and Eckhard Kleine. HIROMB, an introduction to an operational baroclinic model for the North Sea and Baltic Sea. Technical report, SMHI, Norrkoping, Sweden, 200X. In manuscript. 2. Tomas Wilhelmsson and Josef Schiile. Running an operational Baltic sea model on the T3E. In Proceedings of the Fifth European SGI/CRAY MPP Workshop, CINECA, Bologna, Italy, September 9-10 1999. URL: http://www.cineca.it/mpp-workshop/proceedings.htm. 3. Josef Schiile and Tomas Wilhelmsson. Parallelizing a high resolution operational ocean model. In P. Sloot, M. Bubak, A. Hoekstra, and B. Hertzberger, editors, High-Performance Computing and Networking, number 1593 in LNCS, pages 120-129, Heidelberg, 1999. Springer. 4. Tomas Wilhelmsson and Josef Schiile. Fortran memory management for parallelizing an operational ocean model. In Her-
200
5. 6.
7.
8. 9.
10.
11.
12.
13.
14.
15. 16.
mann Lederer and Friedrich Hertweck, editors, Proceedings of the Fourth European SGI/CRAY MPP Workshop, pages 115123, IPP, Garching, Germany, September 10-11 1998. URL: http://www.rzg.mpg.de/mpp-workshop/papers/ipp-report.html. Nils Gustafsson, editor. The HIRLAM 2 Final Report, HIRLAM Tech Rept. 9, Available from SMHI. S-60176 Norrkoping, Sweden, 1993. Jan 0 . Backhaus. A three-dimensional model for the simulation of shelf sea dynamics. Deutsche Hydrographische Zeitschrift, 38(4):165-187, 1985. Jan 0 . Backhaus. A semi-implicit scheme for the shallow water equations for application to sea shelf modelling. Continental Shelf Research, 2(4):243-254, 1983. W. D. Hibler III. Ice dynamics. In N. Untersteiner, editor, The Geophysics of Sea Ice, pages 577-640. Plenum Press, New York, 1986. Eckhard Kleine and Sergey Sklyar. Mathematical features of Hibler's model of large-scale sea-ice dynamics. Deutsche Hydrographische Zeitschrift, 47(3):179-230, 1995. S. C. Eisenstat, H. C. Elman, M. H. Schultz, and A. H. Sherman. The (new) Yale sparse matrix package. In G. Birkhoff and A. Schoenstadt, editors, Elliptic Problem Solvers II, pages 45-52. Academic Press, 1994. Xiaoye S. Li and James W. Demmel. A scalable sparse direct solver using static pivoting. In 9th SIAM Conference on Parallel Processing for Scientific Computing, 1999. Bruce P. Herndon. A Methodology for the Parallelization of PDE Solvers: Application to Semiconductor Device Physics. PhD thesis, Stanford University, January 1996. Patrick R. Amestoy, Ian S. Duff, and Jean-Yves L'Excellent. MUMPS multifrontal massively parallel solver version 2.0. Technical Report TR/PA/98/02, CERFACS, 1998. Patrick R. Amestoy, Iain S. Duff, Jean-Yves L'Excellent, and Petr Plechac. PARASOL An integrated programming environment for parallel sparse matrix solvers. Technical Report RAL-TR-98-039, Department of Computation and Information, Rutherford Appelton Laboratory, Oxon, UK, May 6 1998. I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, London, 1986. George Karypis and Vipin Kumar. A coarse-grain parallel formulation of multilevel k-way graph partitioning algorithm. In 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997.
201 PARALLEL COUPLING OF R E G I O N A L A T M O S P H E R E OCEAN MODELS
AND
STEPHAN FRICKENHAUS Alfred-Wegener-Institute Columbusstrasse, E-mail:
for Polar and Marine Research, 27568 Bremerhaven, Germany
[email protected] RENE REDLER AND PETER POST Institute for Algorithms and Scientific Computing, German National Research Center for Information Technology Schloss Birlinghoven, D-53754 Sankt Augustin, Germany In coupled models the performance of massively parallel model components strongly suffers from sequential coupling overhead. A coupling interface for parallel interpolation and parallel communication is urgently required to work out this performance dilemma. Performance measurements for a parallel coupling of parallel regional atmosphere and ocean models are presented for the CRAY-T3E-1200 using the coupling library MpCCI. The different rotated grids of the models MOM2 (ocean-seaice) and PARHAM (atmosphere) are configured for the arctic region. In particular, as underlying MPI-implementations CRAY-MPI and MetaMPI are compared in their performance for some relevant massive parallel configurations. It is demonstrated that an overhead of 10% for coupling, including interpolation and communication, can be achieved. Perspectives for a common coupling specification are given enabling the modeling community to easily exchange model components as well as coupling software, making model components reusable in other coupling projects and on next generation computing architectures. Future applications of parallel coupling software in parallel nesting and data assimilation are discussed.
1
Introduction
The climate modeling community produces a growing number of model components, e.g., for simulations of atmosphere, ocean and seaice. Currently, more and more model codes are parallelized for running on massively parallel computing hardware, driving numerical performance to an extreme, mostly with the help of domain decomposition and message passing techniques. From the undoubted need for investigation in coupled high performance models a new performance bottleneck appears from the necessary interpolation and communication of domain decomposed data between the model components 1 . In particular, scalability of coupled massively parallel models is strongly bound when using a sequential coupling scheme, i.e., gathering distributed data from processors computing the sending model component, interpolating, communicating and scattering data to processors computing the receiving
202
model component. The alternative to such an external coupling approach is the internal coupling approach: mixing the codes of model components to operate on the same spatial domains, for convenience with the same spatial resolution. Thus, internal coupling puts strong limits on the flexibility of the model components. In external coupling, the performance of the coupled model can be optimized by running model components in parallel, each on an optimal, load balancing number of processors. Furthermore, external coupling allows for an easy replacement of model components, at least, when a certain standard for coding the coupling is followed. To overcome the bottleneck of sequential coupling, a set of parallel coupling routines is required, capable of parallel interpolation of data between partly overlapping domains and of managing all required communication in parallel, e.g., by a message passing technique. As an implementation of such a functionality, the mesh based parallel code coupling interface MpCCI 3 ' 4 is used in the following. MpCCI can be considered as a medium level application programming interface, hiding the details of message passing and interpolation in a library, while offering a small set of subroutines and an extensive flexibility by the use of input configuration files. It is advantageous for the integration into a certain class of model codes, to encapsulate calls to the library in a high level interface, the model interface, allowing, for example, for an easy declaration of regular domain decomposed grids and for a simple call to a coupling routine. The details of the interface developed for the presented models are not subject to this paper. Instead, performance measurements for the specified arctic regional model in different massively parallel configurations and an outline for further applications as well as for standardization of model interfaces are presented. Concerning the programming effort of coupling it may be unpractical to mix model components, due to the fact that memory per processor is limited, used file unit numbers or naming of variables and/or common blocks may coincide. Making model components compatible to work in a single executable (SPMD) by using a common I/O-library and a common memory allocation scheme may be achievable for model codes of low complexity. However, such a procedure must be repeated for every new model component and also for model code updates; furthermore the reusability of coupled model components is better without references to special I/O-managing libraries and naming conventions. The approach to leave model component codes in separate binaries (MPMD, i.e., multiple Program Multiple Data) seems much more practical. However, on certain computing architectures this requires a metacomputing
203
library for message passing. For example, on a CRAY-T3E, using CRAYMPI, two different executables cannot be launched in one MPI-context; it is also not possible with CRAY-MPI or CRAY-shmem to establish message passing communication between separately launched groups of MPI-processes, i.e., between application teams. This is worked around with metacomputing MPIimplementations, such as metaMPI or PACX 2 . Furthermore, a metacomputing MPI allows for coupling model components across different computing architectures, even in different locations, provided that a high bandwidth low latency network connection is installed. In the following presentation of performance measurements the potentials of metaMPI and CRAY-MPI are investigated in detail. 2
M p C C I and the Model Interface for Domain Decomposed Data
MpCCI is designed as a library. It enables to loosely couple different (massively) parallel or sequential simulation codes. This software layer realizes the exchange of data which includes neighborhood search and interpolation between the different arbitrary grids of any two codes that take part in the coupled problem. In parallel applications the coupling interfaces of each code can be distributed among several processors. In this case communication between pairs of processes is only realized where data exchange is necessary due to the neighborhood relations of the individual grid points. In the codes themselves the communication to MpCCI is invoked by simple calls to the MpCCI library that syntactically follow the MPI nomenclature as closely as possible. On a lower level, and hidden from the user, message passing between each pair of codes in a coupled problem is performed by the subroutine calls that follow precisely the MPI standard. Therefore the underlying communication library can be a native MPI implementation (usually an optimized communication libray tuned to the hardware by the vendor), MPICH or any other library that obeys the MPI standard like, e.g., metaMPI. It must be noted, that for coupling of domain decomposed data by interpolation on the nodes, elements must be defined, spanning the processor boundaries of data domains. Otherwise, gridpoints of the receiving model lying between gridpoints of the domain boundaries of the sending model do not receive data. This also requires the introduction of ghostpoint data that must be updated before sending data. Such a functionality is easily implemented in the model interface to MpCCI. Furthermore, due to the rather simple, but very precise conservative in-
204
Figure 1. The different rotated grids of the arctic atmosphere model HIRHAM and the ocean-seaice model MOM.
terpolation of fluxes in MpCCI, the received fluxes show up artifical patterns with strong deviations from a smooth structure. These deviations must be smoothed out locally, either by calculation of local mean values, or by a more sophisticated local smoother, that may be based on an anisotropic diffusion operator. Such a smoother with local diffusion coefficients calculated from the interpolation error of a constant flux is currently under development. Alternatively, one might use the non-conservative interpolation also for the fluxes and rescale the received data such that global conservativity is restored. 3
Measuring parallel coupling performance
The j arctic ocean-seaice model MOM2 has 243x171 horizontal gridpoints on 30 levels. The \° atmosphere model HIRHAM works on 110x100 horizontal gridpoints on 19 levels. In figure 1 the rotated grids of the models are sketched over the arctic. The atmosphere model communicates 6 scalar fluxes and 2 scalar fields to the ocean-seaice model, making a total of 0.08 MW (1 Megaword = 8 Megabyte) coupling data on the sender site, and 0.33 MW for the receiver after interpolation. In the reverse direction 4 scalar fields are sent, summing up to 0.2 MW coupling data in total for the sender and 0.06 MW for the
205 Table 1. Performance measurements of MpCCI for the configuration of the coupled model of the arctic; bandwidth d a t a is given in Megawords per second [MW/s] (one word = one 8 byte double precision number), see text. OS: ocean send; OR: ocean receive; AR: atm. receive; AS: atm. send.
PEs ocn atm 20 80 30 110 1 100 PEs ocn atm 20 80 30 110 1 100
st dMPI MGPD/s] OS OR AR AS 1.15 1.28 1.97 0.20 1.06 0.92 1.62 0.16 0.29 0.37 0.066 5.53 std MPI/ metaM PI OS OR AR AS 14 20 8 151 24 37 27 135 0.7 1.0 0.5 61
metaMPI-local [MGPD/s] OS OR AR AS 0.092 0.057 0.026 0.013 0.044 0.025 0.006 0.012 0.387 0.342 0.135 0.091
receiver. Here gridpoints from non-overlapping domains were included in the counting. The lower block in table 1 shows the ratio of CRAY-MPI bandwidths over metaMPI bandwidths. Since the timed routines contain - besides the communication routines - MpCCI-implicit interpolation routines, the increase of bandwidth is not linearly dependent on the achievable increase in point-topoint bandwidth between processors of the two models when switching from metaMPI to CRAY-MPI. It is noteworthy at this point, that metaMPI has almost the same communication performance between the processors within the model components compared to CRAY-MPI, i.e., the performance of uncoupled models is unchanged. It is seen that in the case of coupling a single MOM-process (holding the full size arrays of boundary data) with 100 HIRHAM processes, the use of metaMPI has noteworthy influence only on the HIRHAM-to-MOM send bandwidth (61 times the CRAY-MPI bandwidth). In the setups with parallel MOM-coupling (upper two rows) the reduction of the bandwidth for HIRHAM-to-MOM send is also dominant. In figure 2 the timing results for a set of communication samples are displayed for 20 MOM processors coupled to 80 HIRHAM processors. The upper graph displays results from CRAY-MPI, the lower graph for metaMPI. The displayed samples are a sequence of 20 repeated patterns. The points in the patterns represent the timings of the individual processors. It is observed in the upper graph that the receiving of data in MOM
206
(MOM-RCV) takes the longest times (up to 0.225 seconds), while the corresponding send operation from parallel HIRHAM (PH-SND) is much faster (needs up to 0.05 seconds). In contrast, the communication times for the reverse direction are more balanced. In the lower graph, displaying the results for metaMPI-usage, communication times appear more balanced. The times of up to 4 seconds are a factor 18 above the corresponding CRAY-MPI measurements. In this massive parallel setup coupling communication times would almost dominate the elapsed time of model runs, since the pure computing time for a model time interval between two coupling communication calls (typically one hour model time) is on the same order of magnitude (data not shown). In figure 3 the timing results for communication are displayed for 30 MOM processors and 110 HIRHAM processors. Qualitatively the same behavior is seen as in figure 2. For the usage of CRAY-MPI (upper graph) comparable timings are measured. However, for metaMPI, the maximum times are 8 seconds for certain HIRHAM receive operations, which is a factor 2 longer than in figure 2, bottom graph. Clearly the ratio of communication times over computation times is even worse compared to the setup used for figure 2. Figure 4 depicts the timing results for coupling communication between one MOM processor and 100 HIRHAM processors. It is seen in the upper graph, that the MOM receive operations dominate the coupling communication times (about 0.85 seconds at maximum). This characteristic is also found for metaMPI-usage (lower graph). Interestingly, in this setup, also the coupling times are nearly unchanged. Furthermore, the four displayed operations are performed partially in parallel. The net time of 1.33 seconds used for one coupling communication call is also found for MetaMPI (data not shown). This corresponds well to the bandwidth ratios given in the lower block in table 1.
207
0,25 MOM-SND MOM-RCV PH-RCV PH-SND
0,2
u0,15
A •J :
••/%
-y\
"•
*
k v s W v ^
0,1
0,05
i.^ i*i, ***** * ^ ^ k 1 • * H» ^.-A^s ' ^ ^*V* .'"** ' *S samples
**r**-
•C'
o
-c
o
©
o
o
o o o- o . 0
o o o
o o o
0
o •o
0* 0
"O"
o o. o
o a o
o o o
'O' 0
o
o o o-
0 0
o-
"O-
o o
•o o
, Q
"0 0 X> ,
0
•a
o o o, o,
0 0 .0-
MOM-SND MOM-RCV PH-RCV PH-SND
Figure 2. Timing data measured under CRAY-MPI (top graph) and metaMPI (bottom graph), coupling 20 MOM processes with 80 HIRHAM processes
208
MOM-SND MOM-RCV PH-RCV PH-SND
0,25
0,2
%
• V- 4
:
•
¥
• ***/: **
\
-
•V; u0,i5 - .
• ;
.
• • • ^^^^^33K:=s&^.::tf^^t2£Ei&:£i&£^^Kfc^^
0,1
0,05
samples —. * *v»*^A^i?
^*-^
-*te
ifeil^fSftSslSSiS
jjf^f^w^imw"
*« * ^ * ^ t • * j i { * i ^ *r»
. / ,
_»*»«*+*,
*
.
.
*
.
' !*•
*r:;**';;^^*t?* t ::^::^**:^rt:*ft"*:***v?**
*£!*
0-5"(iiWt|QioOiti^;-.aM«ii^io>iii^i,i»Vii^i^ita''