PERFORMANCE-ORIENTED APPLICATION DEVELOPMENT FOR DISTRIBUTED ARCHITECTURES
This page intentionally left blank
Performance-Oriented Application Development for Distributed Architectures Perspectives for Commercial and Scientific Environments Edited by
M. Gerndt Department of Informatics, Technical University Munich, Munich, Germany
/OS
Press Ohmsha
© 2002, IOS Press All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior written permission from the publisher. ISBN 1 58603 267 4 (IOS Press) ISBN 4 274 90514 4 C3055 (Ohmsha) Library of Congress Control Number: 2002106946 This is the book edition of the journal Scientific Programming, Volume 10, No. 1 (2002). ISSN 1058-9244.
Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam The Netherlands fax:+3120 620 3419 e-mail:
[email protected] Distributor in the UK and Ireland IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX3 7AD England fax:+44 1865750079
Distributor in the USA and Canada IOS Press, Inc. 5795-G Burke Centre Parkway Burke. VA 22015 USA fax: +1 703 323 3668 e-mail:
[email protected] Distributor in Germany, Austria and Switzerland IOS Press/LSL.de Gerichtsweg 28 D-04103 Leipzig Germany fax: +49 341 995 4255
Distributor in Japan Ohmsha, Ltd. 3-1 Kanda Nishiki-cho Chiyoda-ku. Tokyo 101-8460 Japan fax: +81 332332426
LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
Contents Guest-editorial: PADDA 2001, M Gerndt
1
Performance engineering, PSEs and the GRID, T. Hey and J. Papay
3
NINJA: Java for high performance numerical computing, J.E. Moreira, S.P. Midkiff, M. Gupta, P. Wu, G. Almasi and P. Artigas
19
Dynamic performance tuning supported by program specification, E. Cesar, A. Morajko, T. Margalef, J. Sorribes, A. Espinosa and E. Luque
35
Memory access behavior analysis of NUMA-based shared memory programs, J. Tao, W. Karl and M. Schulz
45
SKaMPI: a comprehensive benchmark for public benchmarking of MPI, R. Reussner, P. Sanders and J.L. Traff
55
Efficiently building on-line tools for distributed heterogeneous environments, G, Rackl, T. Ludwig, M. Lindermeier and A. Stamatakis
67
Architecting Web sites for high performance, A. Iyengar and D. Rosu
75
Mobile objects in Java, L. Moreau and D. Ribbens
91
Author Index Almasi, G. Artigas, P. Cesar, E. Espinosa, A. Gerndt, M. Gupta, M. Hey, T. lyengar, A. Karl, W. Lindermeier, M. Ludwig, T. Luque, E. Margalef, T. Midkiff, S.P. Morajko, A.
19 19 35 35 1 19 3 75 45 67 67 35 35 19 35
Moreau, L. Moreira, Papay, J. Rackl, G. Reussner, R. Ribbens, D. Rosu, D. Sanders, P. Schulz, M. Sorribes, J. Stamatakis, A. Tao, J. Traff, J.L. Wu, P.
J.E.
91 19 3 67 55 91 75 55 45 35 67 45 55 19
Scientific Programming 10 (2002) 1-2 IOS Press
Guest-Editorial
PADDA 2001
This special issue is devoted to programming models, languages, and tools for performance-oriented program development in commercial and scientific environments. The included papers have been written based on presentations given at the workshop PADDA 2001 (Performance-Oriented Application Development for Distributed Architectures - Perspectives for Commercial and Scientific Environments) held at Technische Universitat Munchen on April, 19th-20th, 2001. The workshop was funded by KONWIHR (Competence Network for Scientific High Performance Computing in Bavaria, konwihr.in.tum.de). The goal of the workshop was to identify common interests and techniques for performance-oriented program development in scientific and commercial environments. Distributed architectures currently dominate the field of highly parallel computing. Highly parallel machines can range from clustered shared-memory systems like the Hitachi SR8000 and SGI Origin 2000, to PC-based clusters connected via Myrinet/Beowulf, SCI or Fast Ethernet. In addition, meta-computing applications are executed in parallel on high-performance systems and special purpose systems, which are connected via a wide-area network. Distributed architectures, based on Internet and mobile computing technologies, are important target architectures in the domain of commercial computing too. Program development is facilitated by distributed object standards such as CORBA, DCOM, and Java RMI. Those standards provide support for handling remote objects in a client-server fashion, but must also ensure certain guarantees on the quality of service. A large number of applications, such as data mining, video ondemand, and e-commerce, require powerful server systems, which are quite similar to the parallel systems used in high performance computing applications. In both areas performance tuning plays a very important role. For high performance computing both, individual applications as well as co-operating applications in meta-computing environments, have to be tuned to ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved
improve speedup and enable new complex simulations to be undertaken. In commercial data processing performance is crucial to enable processing of huge data set and customer requests. For example, e-commerce applications must support rapid response times to customers, which may become a deciding factor in whether customers remain with a given service provider. Similarly, such electronic businesses may need to sustain a large number of transactions and manage large data sets. A similar requirement is found in emerging applications in scientific computing which make use of distributed visualization and data management. On one side, there is an uptake of programming standards from the commercial arena into high performance computing since these standards facilitate metacomputing, as well as programmers from universities being trained in these programming styles. On the other hand, a huge experience in performance analysis exists in the area of high performance computing that might be applicable to the commercial arena. The papers included in the special issue come from both areas, scientific computing and commercial computing. The first five papers are devoted to scientific computing while the other three papers are presenting aspects and techniques for commercial computing. The first paper written by Tony Hey et al. on Performance Engineering, PSEs and the Grid gives an introduction into requirements and standard techniques for performance analysis in scientific environments. It also presents new requirements and techniques in the areas of problem solving environments and grid computing. The second paper written by Jose Moreira et al. on NINJA: Java for High Performance Numerical Computing discusses the challenges in using Java for scientific programs. It presents the motivation for using Java for numerical applications as well as the current major performance problems in doing so. It presents also recent techniques for improving the performance of numerical codes written in Java.
Editorial
The third paper written by Eduardo Cesar et al. on Dynamic Performance Tuning Supported by Program Specification presents an environment for automatic performance analysis and program tuning for message passing codes. Kappa-Pi analyzes traces from message passing programs, detects performance bottlenecks automatically, and gives recommendations for improving the code. The environment is currently being extended for online-tuning of applications based on abstract program structures, such as SMPD or divide and conquer. The fourth paper written by jie Tao et al. on Memory Access Behavior Analysis of NUMA-based Shared Memory Programs presents a new monitoring infrastructure for SCI-based NUMA machines. It is based on a low-level hardware monitoring facility in SCIinterfaces and includes software layers and tools that allow the user to investigate the remote access performance in the context of the program's control flow and data structures. The fifth paper written by Ralf Reussner et al. on SKaMPI: A Comprehensive Benchmark for Public Benchmarking ofMPI describes a benchmark for MPI programs which incorporates well-accepted mechanisms for ensuring accuracy and reliability. The benchmarking results are maintained in a public performance database with different hardware platforms and MPI implementations. It thus assists users in developing performance portable MPI applications. The second group consists of three papers presenting aspect of performance-oriented programming in commercial environments. The first paper written by Gunther Rackl et al. on Efficiently Building Online Tools for Distributed Heterogeneous Environments presents a methodology for developing on-line tools based on MIMO, a novel middleware monitor. The MIMO environment implements an event-based online approach in the context of CORBA applications frequently used in commercial environments. The second paper written by Arun lyengar et. al. on
Architecting Web Sites for High Performance gives an overview of recent solutions concerning the architectures and the software infrastructures used in building Web site applications. The challenge for such Web sites is the specific combination of high variability in workload characteristics and of high performance demands regarding the service level, scalability, availability, and costs. The third paper in this group written by Luc Moreau et al. on Mobile Objects in Java discusses a novel algorithm for routing method calls. Mobile objects have emerged as a powerful paradigm for distributed applications. Important for the success of that paradigm is the performance of transparent method invocation and the availability of distributed garbage collection. This article compares different implementations based on synthetic benchmarks. The broad spectrum of topics covered by the papers in this special issue will, hopefully, enable readers to benefit from techniques developed in both areas. New research and development projects should take advantage of similarities of requirements for performance in both areas and thus broaden the opportunities for exploitation of the results. Certainly, the workshop was very successful in steering discussions and triggering new ideas for joint research. I would like thank the members of the PADDA 2001 program committee, Marios Dikaiakos (University of Cyprus), Wolfgang Gentzsch (SUN Microsystems), John Gurd (University of Manchester), Gabriele Kotsis (Universitat Wien), Frank Muller (North Carolina State University), Jim Laden (SUN Microsystems), and Omer F. Rana (University of Wales-Cardiff), for setting up an exiting program and for selecting these excellent papers. Michael Gerndt Technische Universitat Munchen Germany
Scientific Programming 10 (2002) 3-17 IOS Press
Performance engineering, PSEs and the GRID Tony Heya and Juri Papayb a
Director UK e-Science Programme EPSRC, Polaris House, North Star Avenue Swindon SN2 1ET, UK E-mail:
[email protected] b Department of Electronics and Computer Science, University of Southampton, Southampton SO171BJ, UK E-mail:
[email protected] Abstract: Performance Engineering is concerned with the reliable prediction and estimation of the performance of scientific and engineering applications on a variety of parallel and distributed hardware. This paper reviews the present state of the art in 'Performance Engineering' for both parallel computing and meta-computing environments and attempts to look forward to the application of these techniques in the wider context of Problem Solving Environments and the Grid. The paper compares various techniques such as benchmarking, performance measurements, analytical modelling and simulation, and highlights the lessons learned in the related projects. The paper concludes with a discussion of the challenges of extending such methodologies to computational Grid environments.
1. Introduction Performance has been a central issue in computing since the earliest days [3]: 'As soon as the Analytical Engine exists, it will necessarily guide the future course of science. Whenever any result is sought by its aid, the question will then arise - by what course of calculation can these results be arrived at by machine in the shortest time ?' Performance engineering may be defined as a systematic approach in which components of both the application and computer system are modelled and validated. Although performance is probably one of the most frequently used words in the vocabulary of computing, paradoxically it is evident that there is a substantial "knowledge gap" between the software development process and actual performance estimation and optimisation. It is still often the case that programmers and software system designers have insufficient knowledge of the performance implications of their design choices. Indeed, it is clear that systematic performance engineering is not yet an integral part of the software development process and that performance issues often arise very late in the process. As a result it is not surprising that performance problems are a frequent cause of failure of large software development projects. ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. Allrightsreserved
One possible reason why performance issues do not feature explicitly in current software methodologies is Moore's Law. Up to now, the exponential growth in microprocessor performance has usually enabled users to avoid hitting any serious 'performance-wall'. However, in the relatively near future, it is likely that growth in processor performance will slow down and begin to deviate from Moore's Law. Software developers will then be forced to pay more attention to the efficient use of the available silicon real-estate. Furthermore, if the Grid becomes a reality and computer resource 'marketplaces' begin to emerge, software performance on different hardware platforms will be directly related to real costs. In such a 'computational economy', it is clear that performance engineering and reliable performance estimation will play a pivotal role in the establishment of realistic 'performance contracts'. A performance contract is the product of the negotiating process between the suppliers and customers of computing resources. It contains information about the resource demand of applications and available computing capacity. At present this feature is not available, this is mainly due to the lack of reliable performance estimation techniques. The Grid is assumed to be 'an infrastructure that enables flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources' [13]. The resources accessible
T. Hey and J. Papay / Performance engineering, PSEs and the GRID
via the Grid include computational systems, data storage and specialized facilities and are thus a richer set of 'informational utilities' than the Web. In this context it is helpful to consider the Grid as providing the global middleware infrastructure that will enable the establishment of transient "virtual organizations" on a transparent and routine basis. There will be many different types of applications for the Grid and in many cases it is likely that Problem Solving Environments (PSEs) generalized to the Grid will play an important role. A PSE is an applicationspecific environment that provides the user with support for all stages of the problem solving process from program design and development to compilation and performance optimisation. Such an environment also provides access to libraries and integrated toolsets, as well as support for visualization and collaboration. It may also implement some form of automated workflow management. Performance engineering - including estimation, monitoring and measurement - will be an integral component of any Grid PSE since reliable models of performance prediction will be required for any realistic Grid scheduling and accounting packages. The paper is organized as follows. Section 2 reviews several different approaches that have been attempted for performance engineering and gives a short account of some performance benchmarking, monitoring and simulation techniques. Section 3 takes a brief look at Problem Solving Environments in the context of performance and presents a short account of two recent UK-based PSE projects. The next section outlines some of the challenges represented by the Grid for the performance evaluation community and reviews some EU experiments on Europe-wide meta-computing. Finally we offer some conclusions and challenges for the performance engineering community.
2. Performance engineering approaches 2.1. Benchmarks The goal of benchmarking is to understand and predict the key parameters that determine the performance of computing platforms and full scale applications. There have been numerous benchmarking efforts undertaken in the past but no general agreement on how to conduct the measurements and how to interpret the results. Examples of benchmarking efforts include the Livermore Loops [38], the NAS Kernels [4] and the Parkbench initiative [16]. It is worthwhile for us to
summarize the objectives and achievements of Parkbench. The main objectives of the ParkBench initiative were: 1) To establish a comprehensive set of parallel benchmarks that is generally accepted by both users and vendors of parallel systems. 2) To provide a focus for parallel benchmark activities and avoid unnecessary duplication of effort and proliferation of benchmarks. 3) To set standards for benchmarking methodology and result-reporting with establish a database/ repository for both benchmarks and the results. 4) To make parallel and sequential versions of the benchmarks and results freely available in the public domain. As a result of this effort a benchmark suite was developed which contains sequential and message passing versions of the following codes: -
5 Low Level Communication codes 5 Low Level Sequential codes 5 Parallel Linear Algebra Kernels 2 NAS Parallel Benchmark Kernels 3 NAS Compact Application codes ORNL Shallow Water Model Application code
Open-MP versions of some of these codes are now available [18]. By providing three tiers of benchmark complexity - low-level, kernel and compact application - it was hoped that the performance of real applications could be understood. The low-level codes provide basic machine parameters, the kernels provide information about compute intensive algorithms and the compact applications add the complexities of start-up, I/O and so on. In the event, Parkbench was only partially successful: lack of dedicated funding for such a benchmark evaluation programme prevented its full exploration and realization. Nonetheless, there were significant achievements - a serious programming methodology was defined, parallel versions of the NAS Parallel Benchmarks were made available in the public domain and a repository for results with a graphical interface established. In addition, an electronic journal for the rapid publication of performance results was established [19] and this is now established as a special section of the journal Concurrency and Computation: Practice and Experience. Two examples will illustrate the type of data that was made available by these benchmarks. Figure 1 shows the performance of the parallel LU kernel benchmark
T. Hey and J. Papay / Performance engineering, PSEs and the GRID C-90Aug.92 T3DOct.94 SP-lFeb.94 - X - - SP-2 Aug.94 - — — Paragon-OSF1.2 Jul.94
10000
§ 1000
100
1
10
100
1000
10000
Number of processors Fig. 1. Performance of vector vs. distributed memory machines on LU kernel.
on distributed memory (DM) machines. The data is from 1994 and shows for perhaps the first time the performance of a parallel DM system outperforming the largest vector supercomputer of the time. Figure 2 shows results of the low-level communication benchmark COMMS1 for the Intel Delta and the Intel iPSC/860 systems. The main concern of performance engineering is to develop techniques which enable to predict the performance and resource requirements of applications, in this sense benchmarking has little to offer. The results of benchmarks are usually expressed by a single number which is sufficient for comparing and rating various computers, however this number provides little information about the key parameters governing the resource requirements of real applications. 2.2. Performance measurements Performance measurements are based on event profiling and tracing. These techniques assume an event model of program execution. During program execution, profiling tools accumulate summary data for significant events such as function calls, cache misses, communications etc. Typically, this approach has a low overhead since such profiling is usually implemented by simple event counters. Using this method users can obtain statistical information on the percentage of time spent by their application program in performing vari-
ous functions and can use this information to identify potential problem areas. Trace tools on the other hand, provide the user with much more detailed information on the sequence of events as they happen during program execution. Trace files record time ordered events and can constitute a large volume of data. The information recorded in a trace can represent various levels of abstraction. In the case of parallel platforms, the recorded traces for the different nodes need to be collected, sorted according to time stamps and merged into a global event trace. Although trace tools provide more detailed information about the parallel program execution than the profiling tools, there can be a significant overhead for trace generation. Furthermore, the large volumes of trace data generated can be overwhelming. There are numerous commercial and academic tracing and profiling tools available. Examples include Apprentice [8], gprof, Vampir [41] and Paradyne [5]. Apprentice is a product of Cray Research which uses source code instrumentation through compiler switches to provide statistics on the level of functions and basic blocks. The advantage of source code instrumentation is that the results of monitoring can be easily interpreted in terms of programming language statements and give very direct feedback to the programmer. A problem with this approach is that the libraries cannot generally be monitored at the same level of detail since they are usually only available in binary format. Unlike the
T. Hey and J. Papay / Performance engineering. PSEs and the GRID 100
10
2
1
10
100 1000 Message length, Byte
10000
100000
Fig. 2. Benchmarking of communication latency.
Cray product, Vampir is a commercial graphical event trace browser from Pallas that is available on many different platforms. Vampir also provides visualization and statistical analysis of trace files. Measurement and profiling tools have achieved a high level of maturity, however the usage of these tools for reliable performance estimation is still an open issue. These tools can generate a large volume of detailed data, but in order to gain some understanding of the application's runtime behaviour this data needs to processed and interpreted. The interpretation is not a simple and straightforward process, it requires user intervention and considerable knowledge of the problem domain. An important aspect of these tools is the level of intrusion which affects the accuracy of measurements and can even alter the behaviour of the system, this is often not mentioned or not quantified. The amount of data that is collected during the measurement is related to the level of intrusion, therefore it is vital to find the right balance between the volume of data and the acceptable level of intrusion. 2.3. Analytical models Historically, Hockney's n1/2 and rx 'pipeline' model provided a useful abstraction of Vector Supercomputer architecture [15]. This pipeline model has
been extended to characterize communication performance in parallel DM message-passing systems. In this case the pipeline parameters captured the communication latency and asymptotic communication bandwidth. Typically, the nodes of such systems are scalar processors but it is also possible to use Hockney-style pipeline parameters to provide a simple characterization of the memory hierarchy of the node. These are basic hardware parameters. It is also possible to characterize the 'computational intensity' of an application in terms of the ratio of the number of arithmetic operations performed per off-chip memory access. The parallel program is represented as an alternation of nonoverlapping computation and communication stages. The application program is described in terms of the number of scalar floating point operations, the amount of data transferred and the number of messages. The output of the model is assumed to be the sum of processing and communication times. The model assumes perfect load balance and the timing formula derived for a single processor is used for the performance characterisation of the whole parallel program. A weakness of this model is that it is only valid for the performance analysis of parallel algorithms with regular structures and good load balancing. In recent years, several other cost models have been developed. These include the BSP (Bulk Synchronous
T. Hey and J. Papay / Performance engineering, PSEs and the GRID
Parallel) model [46,36,37] and the LogP model [9]. Both these approaches attempt to get beyond the usual (unrealistic) assumptions made in 'classical' PRAM complexity analysis. In particular an attempt is made to take into account the limitations of real systems such as the network bandwidth, and communication and synchronisation overheads. These models aim to provide a machine independent framework for parallel algorithm design and performance prediction. The LogP model characterises the parallel system by the following parameters: - the upper bound on communication latency from source to target (L), - the overhead of send and receive (o), - the minimum time interval between consecutive transmissions or receptions (g), - the number of processor/memory modules (P). The model assumes asynchronous execution mechanism, finite network capacity and specifies the work (W) between communications. The BSP model, on the hand, has an equally simplistic and unrealistic cost model. In fact, the Oxford 'BSP model' bears almost no resemblance to Valiant's actual BSP complexity analysis. In order to prove any useful results, in his original BSP model Valiant requires parallel slackness at the nodes to hide communication delays, two-phase random routing of messages to avoid possible network congestion, and data hashed randomly across the processors with a sufficiently random class of hash functions to avoid memory 'hot spots'. All that is left of Valiant's BSP analysis in the Oxford BSP model is the programming methodology of the bulk synchronous programming style! Both the BSP and LogP models are similar in a sense that they attempt to provide an abstraction of the performance of the communication network and processing nodes using a minimal number of 'average' performance parameters. Details of the network topology and memory hierarchy are ignored. Such models represent, at best, a "back-of-the envelope" approach to performance prediction. Nevertheless, it must be said that in some cases, as the experiments on CM-5 showed, the LogP model was able to provide a close match to actual performance measurements [11]. As we will see below, a similar 'average' speed analysis of the Livermore loops ignoring any effects of the memory hierarchy gives performance results that can be over 100% under- or over-estimated. Other interesting approaches to performance modelling include Carter's Parallel Memory Hierarchy Model [2] and the Manchester group's Overhead Analysis approach [40,44].
Analytical modeling is in fact complexity analysis, which involves an abstract model of program execution and cost models of computation and communication operations. This technique attempts to combine the parameters of the application and the computer in order to produce a mathematical expression for performance estimation. This approach involves numerous assumptions which approximate the system and application's behaviour. Although these approximations make the performance evaluation analytically tractable, they significantly reduce the accuracy of predictions and limit the applicability of analytical models to certain class of applications or computer architectures. 2.4. Simulation Simulation can provide very detailed information about both the computer system and application program. This information can be at various levels, in terms of hardware architecture ranging from simulation of only the main components of the architecture right down to simulations at the gate level, and on the application side, from programming language statements down to machine code. The simulation model characterizes the system by a number of state variables that are updated as the simulation progresses. Simulation techniques can be classified according to four basic types: instruction driven, trace driven, execution driven and event driven. Instruction driven simulation is based on interpreting instructions of the target machine. This technique gives high accuracy, achieved by step-bystep simulation of each instruction, but requires long simulation times. Full instruction level simulation is too time consuming for practical use on real applications and complex machine architectures. Trace driven simulation uses records of measurements obtained from the real system or synthetic traces generated by the trace generator. The trace is a sequence of user defined events generated by an instrumented program. This technique can require large amounts of memory and processing time to produce reliable results. The DIMEMAS tool is an example of a trace-driven performance prediction tool for messagepassing parallel programs [34]. This tool uses the Vampir Trace File and scales the CPU time spent in each block of code and the parameters of communication events according to the target machine parameters. Event driven simulation maintains a global queue of events. The operation cycle consists of event fetching, the simulation step and update of the data structure representing the simulated system [42]. The inputs of
T. Hey and J. Papay / Performance engineering, PSEs and the GRID
Fig. 3. Prediction error of static statement analysis.
event driven simulation models are probability distributions of response times, request arrivals and delays. The main disadvantages of probabilistic workload models are that they do not directly represent the parameters of specific application programs and the results of the simulation require in-depth statistical processing in order to determine the accuracy of the model. Finally, execution-driven simulation models interleave the execution of an application with the simulation of the target system. The main advantages of execution driven simulation are the speed and the use of actual programs for the simulation of parallel architectures rather than using distribution or trace driven workloads. That such an execution driven approach is a feasible solution for the simulation of parallel systems has been demonstrated by systems such as the Rice Parallel Processing Testbed (RPPT) [7] and the Wisconsin Wind Tunnel [43]. The disadvantage of this technique is that it is more difficult to implement than the trace driven approach due to the complex interactions between the application program and the simulator. A key lesson learned from the previous work is that any simulation method must take account of memory hierarchy for realistic performance estimation. Simple static statement analysis of the source code is well known to be unpredictably unreliable as it is presented by Fig. 3, which shows the difference between predicted and actual execution times of Livermore Fortran kernels on SPARC 1 and SPARC 5 workstations.
The PERFORM system developed by Dunlop at Southampton is an execution-driven simulation tool that uses a novel 'Fast Simulation Method' that attempted to improve the accuracy of prediction and overcome some of the problems with a full simulation of the memory hierarchy [10,14]. The model uses the "program slicing" technique to isolate the control variables and array indices of the source code, retaining sufficient information to simulate data movement within the memory hierarchy. The sliced program is then augmented with calls to the PERFORM simulator which models the effects of memory hierarchy cache memory, computation and message passing. Fast simulation is achieved by providing feedback between the simulator and source and curtailing loop execution when the cache behaviour of iterations are reliably estimated. The main stages of the Fast Simulation Method used in PERFORM tool are illustrated in Fig. 4. An example of the accuracy of predictions that can be achieved by the PERFORM tool is shown in Fig. 5 that compares actual and predicted performance on a SPARC system. As can be seen, the predicted lower bound provided by PERFORM captures the detailed cache effects very accurately. Simulation is a useful technique for the performance evaluation of systems at the design stage of development. Concerns with such simulation-based approaches are the level of detail of the simulation model, accuracy and the simulation time. Simulation models are usually large size programs, their development is expensive and in the case of simulations of the architecture at the instruction level for example, require a long run-time in order to provide meaningful results.
3. Problem Solving Environments A Problem Solving Environment (PSE) is an integrated computing environment which incorporates all stages of the problem solving process, such as problem specification, computation, analysis and optimization. The key issues associated with the architecture design of PSEs are interoperability, modularity and reusability of components. The problem of interoperability of the different software packages stems from the different (and often proprietary) file formats produced by the various components. This is often the case, for example in mechanical engineering where we frequently need to implement data exchange between various CAD packages and CAE tools. Several recent papers [20,21] highlight the need to adopt a universal file format based
T. Hey and J. Papay / Performance engineering, PSEs and the GRID
Fig. 4. Key stages of fast simulation method.
on XML that will simplify data exchange and structure specification. Many of these pleas are from users with real industrial applications but vendors of component software packages see little commercial incentive to make their software easily interoperable with packages from other vendors. At present there is no generally accepted methodology for the specification, analysis and design of modular reusable systems. In recent years there has been a significant progress in the development of middleware technologies that provide support for system integration based on objects. Examples of these technologies include CORBA, DCOM and, in the context of Web Services, the recently proposed SOAP protocol. Although these technologies share many common features there is no universally accepted definition of objects and consequently their claims to provide full interoperability within the same system are somewhat questionable. At present, CORBA is the dominant middleware technology in the PSE world. Key advantages of CORBA are the existence of a single specification document and the participation of more than 800 companies in the consortium. Nevertheless, despite the existence of the IIOP inter-ORB protocol, there is still a problem for applications that attempt to mix two or more of the large number of vendor specific implementations of CORBA. An unwelcome result of this diversity is the problem of interoperability between competing middleware prod-
20 18 16 14 -
V
12 10 6 6• Optimised benchmark results
4-
0
• Predicted lower bound execution time
1
2
3
4
Data size, MByte
Fig. 5. Predictive power of PERFORM tool.
ucts. It should also be emphasised that programming using any of the CORBA implementations is not trivial and the C++ syntax is rather complex. By contrast, DCOM is Microsoft's answer to CORBA and this has the definite advantage that there is one specification, implementation and one vendor. A limitation is that DCOM runs only on Windows.
10
T. Hey and J. Papay / Performance engineering, PSEs and the GRID
Fig. 6. A PSE architecture built from Cardiff/Southampton PSE components.
Arguably, two of the most successful PSEs are the commercial products Matlab and Mathematica. Both these mature software environments have successfully combined good usability with functionality. Perhaps, strictly speaking, these products do not qualify as genuine PSEs since they both provide rather generic environments for a broad range of application areas rather than an environment targeted at a single application. Nevertheless, these products have a wide user base and demonstrate what is possible in principle though neither have seen the development of a version for parallel systems as a high commercial priority. Apart from these examples, there are many 'research' PSEs either in existence or under development in many universities and research institutes around the world. Examples include GasTurbnLab from Purdue [22], BIOPSE from Utah [23] and Autobench from Stuttgart [24]. Unfortunately, it is not clear whether any of these PSE systems are much used by real users. As two examples of PSE projects, we shall briefly describe two ongoing UK projects: one a project with considerable direct industrial involvement and the other a purely 'academic' project. These are the Swansea/BAE Systems Project [49] and the Cardiff/Southampton PSE project [47]. The Swansea/BAE Systems project represents a complete industrial environment for multi-disciplinary computational fluid dynamics, electro-magnetics and structural mechanics simulations. This PSE includes geometry builder, mesh repair, unstructured grid generation, grid quality analysis, post-processing and data analysis.
execution on remote/parallel platforms, help facilities and application integration. The system is based on CORBA and uses a parallel architecture (VIPar) for image processing. This PSE has been further developed in the CAESAR and JULIUS EU projects. A key problem for implementation of their 'Computational Science Pipeline' is the data transfer between the different components. The aim of Cardiff/Southampton PSE Project is to leverage modern software technologies such as CORBA, Java and XML and to develop the key modules which can be used for the rapid prototyping of application specific PSE environments [47]. The main components developed in this project are: a Visual Component Composition Environment, an Intelligent Resource Manager, based on the Southampton Intrepid Scheduler [1] and a Software Component Repository. The two applications targeted by this project are Molecular Dynamics and Photonic Crystal Structures simulations. A PSE architecture incorporating the developed components is presented in Fig. 6. The system is based on the object-web concept where the services are represented as network objects. The problem is formulated as an XML request by the user and the response, also in XML, is produced by the Web server. The user interface is embedded in the browser environment and enables visual programming by allowing "drag-and-drop" of objects in a task-graph design area. The interaction with the user is implemented as a sequence of Web pages. As a commercial middleware ORB. the ORBACUS implementation of CORBA
T. Hey and J. Papay / Performance engineering, PSEs and the GRID
was selected which is a mature product and provides numerous services for naming, trading and interface repository. The main task of the Monitor is to collect and store information about machines and tasks running in the system. The information about machines includes data about the available resources such as memory, processors, disk and load. The task information is a representation of the resources used by the given task such as size of occupied memory, communication and I/O traffic, and disk volume used. The information collected by the Monitor is stored in a database and used by the Scheduler for task allocation and load balancing. The interactions of Monitor with the other components of the system are represented in Fig. 7. On each computer there is an Object Server deployed which instantiates the Reporter object. The Reporter registers with the Name Server, which maintains a list of remote object addresses. The Monitor at regular intervals queries the Reporter objects and updates the Machine and Task tables in the database. The Scheduler provides task allocation to resources, run-time forecast and dynamic load-balancing. The scheduling is based on machine independent application load models for CPU, memory size, I/O traffic and disk volume. These algebraic expressions are included in the description of each task and represent the task resource requirements. The scheduling algorithm performs the following steps for each task in the task-graph: -
check the availability of licenses, memory, disk generate list of candidate machines compute time components select minimum execution time include task-machine binding in the schedule
The three year project is nearing completion and a full evaluation with performance measurements will be available soon. Preliminary indications are that such a component based approach can bring real advantages in terms of software development and deployment. In the final analysis, however, it will be the reaction of users to such an environment that will provide the real measures of success or failure! On the whole, the lessons learnt from many of these PSE experiments are not very encouraging. There are major problems in automating the data flow between CAD and CAE tools. Furthermore, there is little incentive for vendors of legacy codes to make their product interoperable with tools from other vendors. In addition, present PSEs focus almost entirely on the design
11
and simulation part of the engineering process. There is a real need to incorporate the experimental validation and testing part of the process. Thus a complete PSE would offer support for the recording and analysis of experimental as well as simulation data. Incorporation of databases of experimental measurements and simulation results will allow the development of data mining and knowledge discovery components of the PSE. PSEs have a long way to go to prove their worth in real engineering environments! PSEs often represent large scale meta-applications which require massive computing resources. In this case the role of performance engineering is to predict the resource requirements of the application, to ensure that there is sufficient computing capacity available and the individual tasks are assigned to the most appropriate computer.
4. Grids 4.1. The Grid as a new paradigm The significant investments currently being made in Grid research shows that the governments around the world are taking the development of such infrastructure middleware very seriously. With the recent announcement of IBM's support for the Grid, it is not unreasonable to expect that the Grid will eventually become the key middleware not only for science and engineering but also for industry and commerce. In the US several agencies are funding major Grid initiatives. Examples include: -
NASA Information Power Grid [25] NSF Science Grid [26] NSF GriPhyN Project [27] DOE PPGrid [28] NSF NVO [29] NSF NEESGrid [30]
Most of these Grid Infrastructure and Application Projects make use of the Globus Toolkit [12] as the basic platform on which to provide Grid services. In addition, the NSF funded GrADS project [6] identifies many important research issues for Grid computing. Europe has been also active in Grid R&D. In addition to two initial EU Grid projects, DataGrid [17] and EuroGrid [31], several new EU Grid-centred projects are currently under negotiation. National governments in the EU have also recognised the potential strategic importance of the Grid. For example, under its new 'e-
12
T. Hey and J. Papay / Performance engineering, PSEs and the GRID
Fig. 8. Grid vision.
Science Programme', the Office of Science and Technology (OST) in the UK have allocated £120M for the deployment of e-Science Grids spanning a wide range of application areas and the development, with industry, of the associated Grid middleware. The Computation Grid is perhaps best envisaged as an infrastructure that integrates computing resources, data, knowledge, instruments and people. The construction of such an environment will enable sharing of computational resources, data repositories and facilities in a routine way as the Web now allows us to
share information. In cartoon form, this Grid vision is depicted in Fig. 8. There are many genuine Computer Science research challenges to be overcome before we can realize this vision. In the context of this paper, an obvious issue is the need for realistic performance estimation. Together with mechanisms for monitoring and accounting, reliable performance estimation will allow the creation of global marketplaces for Grid resources. As a starting point for a discussion of Grid performance estimation, it is worthwhile to review results from some recent
13
T. Hey and J. Papay / Performance engineering, PSEs and the GRID Table 1 WAN experiment statistics
Southampton (PAC) Southampton University Barcelona (UPC) Stuttgart (RUS) Madrid (CASA) Bilbao (CEIT) Torino (ItalDesign) Torino (Blue Enginnering) Grand Totals
CPUs Nproc 15 10 16 12 15 12 11 11 102
Total cpus installed Total cpus defined in PVC Total cpus available Total cpus used in WAN
102 102 95 79
Partner
In PVC 15 10 16 12 15 12 11 11 102
Availability Access 15 9 16 11 15 12 6 11 95
Used 14 6 16 9 14 8 5 7 79
Failed 1 0 1 0 15 0 2 6 25
Shot statistics Successful 150 40 275 104 184 98 63 61 975
Total 151 40 276 104 199 98 65 67 1000
Elapsed Execution Time: 4:39: 16 Approx Single CPU Time: 250 hrs
meta-computing projects. 4.2. Meta-computing experiments There have been numerous meta-computing projects involving performance engineering. Here we shall restrict our discussion to several EU-funded metacomputing projects - Promenvir [35], Toolshed [32] and HPC-VAO [33]. The main application focus of these projects is engineering design optimisation by simulations. In these simulations, the parameter space of key design parameters is explored to find a set of optimal values: simulations must be performed for every new set of parameter values. The applications were drawn from a variety of engineering domains including satellite alignment analysis, surface accuracy analysis, reflector deployment, crash analysis, vibro-acoustic optimization and CFD computation. Optimisation by simulation is computationally expensive and the user needs these simulations to execute in the shortest possible time or within a set period or resource cost. There is a clear economic incentive to achieve efficient utilisation of the available resources with as little intervention as possible. Several of these meta-computing experiments utilized Europewide computing resources. An illustration of the results of the PROMENVIR project performed by connecting up the resources of project partners across Europe is given in Table 1. Table 1 contains statistics of the resource usage obtained by running a large scale Monte Carlo simulation of satellite deployment. The program and associated data were small enough that each simulation could run on a single workstation or node: nontrivial parallelism of the application code was not required. In this simulation a thousand 'shot' (parameter set) computational experiment has been performed.
Initially all machines listed above were specified as comprising the Parallel Virtual Computer (PVC). During the actual run, however, some of them were either not available or not used due to the pre-existing high load on them. As can be seen, the experiment was very successful and utilized nearly 100 processors and resulted in a very significant improvement in exploration of the design space. The key module of a distributed computing environment is the scheduler. This must perform task allocation to resources, run-time prediction and dynamic load-balancing. Resource management decisions must be made using platform independent application load or resource demand models for CPU, memory size, I/O traffic and disk volume. In the above mentioned metacomputing projects, such models were developed by benchmarking large industrial size codes such as NASTRAN and Sysnoise. The process of obtaining load models and using them by the scheduler is presented in Fig. 9. The accuracy of performance predictions obtained by this technique is illustrated on the case study of a static analysis with NASTRAN [39]. It is important to stress that, as is commonly the case for commercial codes, the source code was not available for instrumentation. A series of 2D and 3D test problems were used for benchmarking on two different architectures a Distributed Memory IBM SP2 and a Shared Memory SGI Power Challenge. During benchmarking, the runtime, memory, disk traffic, disk space parameters were measured. These measurements were then used for the development of analytical performance models. At the first stage a machine independent model of the application is derived. Figure 10 illustrates that the derived CPU-load model (number of floating point operations) for SP2 and Power Challenge show a close match so
14
T. Hey and J. Papay / Performance engineering, PSEa and the GRID
Fig. 9. Development and utilisation of platform independent load models. 10000.0
100000.0 2DChall 2DSP2
10000.0 -
1000.0 -
• 3D Chall D 3D SP2
.
1 1000.0 1
a
100.0 D 100.0 - .0 .
. 10.0 -
10.0 -
1.0
1.0 -
100 100
1000
10000
100000
N
100 100
1000
10000
100000
N
Fig. 10. Derived CPU-loads of static analysis in NASTRAN for 2D-3D problems for PowerChallenge and SP2.
that, as might be expected, the number of FPU operations in both cases is approximately the same. The derived load models are used for the development of an analytic expression that incorporates the key application parameters such as degrees of freedom, front size, number of extracted eigenvalues, etc. The accuracy of the analytical model of CPU-load for the NASTRAN static analysis code is illustrated in Fig. 11.
Similar models to the one presented in Fig. 11 have been developed for I/O load, disk volume and memory size. The advantage of this approach is that it provides simple mathematical expressions that include the key parameters governing the performance of the application. The main drawback is that the development of these models requires substantial benchmarking effort and also some knowledge of the algorithm and applica-
T. Hey and J. Papay / Performance engineering, PSEs and the GRID 100000.0 • • 10000.0
2DChall 2D Model 3DChall 3D Model
1000.0
100.0 --
10.0
100
1000
N
10000
100000
Fig. 11. CPU-load model for static analysis in NASTRAN.
tion. Nevertheless, these experiments demonstrate the level of accuracy that can be obtained in the industrially relevant environment in which the source code of the application package is unavailable. 4.3. Performance and the Grid Performance estimation and forecasting will be vital ingredients of the future Grid environment as has been emphasised in several US projects such as GrADS [6], AppLeS [45] and the Network Weather Service [48]. The GrADS Project envisages a "performance contract" framework as the basis for a dynamic negotiation process between resource providers and consumers. The Network Weather Service monitors the available performance capacity of distributed resources, forecasts future performance levels using statistical forecasting models, reports monitoring data to client schedulers, applications and visual interfaces. Such a service is important for the Grid environment but needs to be scalable, portable, and secure. There are many open issues that need further investigation such as the balance between intrusiveness of sensors and accuracy of measurements, fault diagnosis and adaptive sensors. 5. Concluding remarks In this paper, various techniques used for performance engineering on parallel and distributed systems have been reviewed. We conclude with two remarks:
15
(1) The national and international levels of investment in Grid computing make it clear that performance estimation, modelling and measurement on the Grid will assume an increasingly important role in any future computational Grid economy. Over the last decade, we have seen a shift in the software industrie towards an objectoriented, component-based software methodology. At present, although the programming interfaces and functionality of these components are exposed, there is no methodology for expressing performance trade-offs in the software development process. We therefore suggest that, in addition to specifying interfaces and functions, software methodologies need to incorporate some form of "performance metadata". Such metadata would contain information about the performance and resource requirements of software constructs and components. Only with the availability of such performance metadata will the construction of truly intelligent schedulers become possible. An internationally coordinated effort to define a common format for performance metadata seems long overdue. (2) As we have seen, performance models range from simple algebraic models that attempt to identify a few key parameters, to complex simulation models with many parameters and involving powerful mathematical techniques such as queuing theory. However, the key to realistic performance prediction lies in understanding the interaction between the application and the computer architecture. It is also important to note that in a typical industrial application users will not have access to the source code of a software package or library routine. These requirements highlight the need for performance model abstractions that are relatively simple and easy to use yet are sufficiently accurate in their predictions to be useful as input to a scheduler or intelligent agent. Reliable performance estimation becomes even more relevant when we consider payment for services in a computational Grid economy. Users will require answers to questions such as best value for money as well as guarantees for specified turn-around times. Finally, we have seen that there are many existing tools for performance monitoring, some of which have a non-negligible user community. When it comes to performance estimation, there are few tools and few users. Although the computer science community has
16
T. Hey and J. Papay /Performance engineering, PSEs and The GRID
been researching performance for a long time, we believe that such research needs to become more systematic and scientific. A common approach to performance metadata together with a methodology that allows independent verification and validation of performance results would be a good start.
References [1 ]
[2]
[3] [4]
[5]
[6]
[7]
[8] [9]
[10] (11)
[121
[131 [141 [151
[16]
N.K. Allsopp, T.P. Cooper and P. Ftakas, Porting Legacy Engineering Applications onto Distributed NT Systems. Proceedings of the 3rd USENIX Windows NT Symposium. Seattle. Washington, USA, July 12-15, 1999. B. Alpern, L. Carter, E. Feig and T. Selker, The Uniform Memory Hierarchy Model of Computation, Algorhhmica 12(2/3) (1994), 72-109. Babbage, Charles, Passages of the Life of a Philosopher. Longman, et alia, London. 1864. D. Bailey, J. Barton, T. Lasinski and H. Simon, (eds). The NAS parallelbenchmarks. Technical Report RNR-91-02, NASA Ames Research Center, Moffett Field. CA 94035. January 1991. B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R. Bruce Irvin, K.L. Karavanic, K. Kunchithapadam and T. Newhall, The Paradyn Parallel Performance Measurement Tools, IEEE Computer 28(11) (November 1995), 37-46. F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster. D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, D. Reed, L. Torczon and R. Wolski, The GrADS Project: Software Support for High-Level Grid Application Development. http://www.hipersoft.rice.edu/grads/publications/tr/grads. project.pdf, February 15, 2000. R.G. Covington, S. Dwarkadas, J.R. Jump, J.B. Sinclair and S. Madala, Efficient Simulation of Parallel Computer Systems, International Journal in Computer Simulation 1 (1991 ), 3158. CRAY Research, Introducing the MPP Apprentice Tool, CRAY Manual IN-2511, 1994. D. Culler, R. Karp and D. Patterson. LogP: Towards a realistic Model for Parallel Computation. ACM SIGPLAN Notices 28(7)(1993), 1-12. A. Dunlop, Southampton Ph.D. thesis, 1997. A.C. Dusseau, D.E. Culler, K.E. Schauser and R. Martin, Fast Parallel Sorting under LogP: Experience with the CM5. IEEE Transactions on Parallel and Distributed Systems (August 1996). I Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit, International Journal of Supercomputer Applications 11(2) (1997), 115-128. L Foster and C. Kesselman, (eds). The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999. T. Hey, A. Dunlop and E. Hernandez. Realistic Parallel Performance Estimation, Parallel Computing 23 (1997). 5-21. R.W. Hockney, Performance parameters and benchmarking of supercomputers, in: Computer Benchmarks, J.J. Dongarra and W. Gentzsch, eds, Elsevier Science Publishers. Holland, 1993, pp. 41-63. R.W. Hockney and M. Berry, PARKBENCH Report: Public international benchmarks for parallel computing. Scientific Programming 3(2) (1994). 101-146
[ 17]
W. Hoschek. J. Jaen-Martinez, A. Samar. H. Stockinger and K Stockinger. Data Management in an International Data Grid Project, in: Proc. 1st IEEE/ACM International Workshop on Grid Computing. Springer Verlag Press. 2000. [18] http://www.netlib.org/parkbench/. [19] http://hpc-journals.ecs.soton.ac.uk/PEMCS/. [20] http://www-106.ibm.com/developerworks/xml/. (21 ] http://www.xml.com/. [22] http://www.cs.purdue.edu/research/cse/gasturbn/. [23] http://ampano.cs.utah.edu/software/. (24) http://wwwvis.informatik.unistuttgart.de/eng/research/proj/autobench/. [25] http://www.ipg.nasa.gov/. [26] http://www.ncsa.uiuc.edu/About/PACI/. [27] http://www.griphyn.org/. [28] http://www.ppdg.net/. [29] http://www.hoise.com/primeur/01/articles/monthly/AE-PR04-01-15.html. [30] http://www.neesgrid.org/. [31] http://www.eurogrid.org/. [32] http://www.cse.clrc.ac.uk/ActivityResources/16. [33] http://www.beasy.com/projects/hipsid/pac.html. [34] J Labarta. S. Girona and T. Cortes, Analyzing scheduling policies using Dimemas. Parallel Computing 23( 1-2) (1997). 23-34. [35] J. Marczyk. Principles of Simulation-Based Computer-Aided Engineering FIM Publications, Barcelona, 1999, pp. 174. [36] W.F. McColl. EPS Programming, in Proc. DIMACS Workshop on Specification of Parallel Algorithms. Princeton. 9-11 May 1994 [37] W.F. McColl. Truly, madly, deeply parallel. New Scientist. February 1996. pp. 36-40. [38] F.H. McMahon. The Livenmore Fortran kernels test of the numerical performance range. Performance Evaluation of Supercomputers (1988). 143-186. [39] MSC/NASTRAN Quick Reference Guide. Version 70. The MacNeal-Schwendler Corporation, 1997. [40] N. Mukherjee, G. Riley and J. Gurd, FINESSE: A Prototype Feedba-guided Performance Enhancement System, Proceedings of 8th Euromicro Workshop on Parallel and Distributed Processing, Rhodes. Greece. IEEE Computer Society Press. January 19-21. 2000, pp. 101-109. [41] W.E. Nagel. A. Arnold, M. Weber, H.-C. Hoppe and K Solchenbach. VAMPIR: Visualization and Analysis of MPI Resources. Supercomputer 63 12(1) (19%), 69-80. [42] D. Pease et al.. PAWS: A Performance Evaluation Tool for Parallel Computing Systems, IEEE Computer (January 1991). 18-29. [43] S K. Reinhardt. M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis and DA. Wood. The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers, in Proceedings of the 1993 ACM SIGMETRICS Conference on Measurements & Modeling of Computer Systems, May 1993. pp. 48-60. [44] G. Riley. M. Bull and J. Gurd, Performance Improvement through Overhead Analysis: A Case Study in Molecular Dynamics. Proceedings of the 1997 International Conference on Supercomputing. ACM Press. 1997. [45] N. Spring and R Wolski. Application Level Scheduling of Gene Sequence Comparison on Metacomputers, Proceedings of the 12th ACM International Conference on Supercomputing, Melbourne, Australia, July 1998. [46] L.G. Valiant, A bridging model for parallel computation. Communications of the ACM 33(8) (1990). 103-111.
T. Hey and J. Papay / Performance engineering, PSEs and the GRID [47]
D.W. Walker, M. Li, O.F. Rana, M.S. Shields and Y. Huang, The Software Architecture of a Distributed Problem-Solving Environment Concurrency: Practice and Experience 12(15) (2001), 1455-1480. [48] R. Wolski, Forecasting Network Performance to Support Dynamic Scheduling Using the Network Weather Service. In Proc. 6th IEEE Symp. on High Performance Distributed Computing, Portland, Oregon, 1997.
[49]
17
Y. Zheng, N.P. Weatherill, E.A. Turner-Smith, M.I. Sotirakos, M.J. Marchant and O. Hassan, Visual Steering of Grid Generation in a Parallel Simulation User Environment, Chapter 27 in Enabling Technologies for Computational Science: Frameworks, Middleware and Environments, The Kluwer International Series in Engineering and Computer Science, (vol. 548), Kluwer Academic Publishers, Boston, 2000, pp. 339-349.
This page intentionally left blank
19
Scientific Programming 10 (2002) 19-33 IOS Press
NINJA: Java for high performance numerical computing Jose E. Moreiraa, Samuel P. Midkiffa, Manish Guptaa, Peng Wua, George Almasia and Pedro Artigasb a
IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598-0218, USA Tel: +1 914 945 3018; Fax: +1 914 945 4270; E-mail: {jmoreira,smidkiff,mgupta,peng\vu,gheorghe}@ us.ibm.com b School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891, USA E-mail:
[email protected] Abstract: When Java was first introduced, there was a perception that its many benefits came at a significant performance cost. In the particularly performance-sensitive field of numerical computing, initial measurements indicated a hundred-fold performance disadvantage between Java and more established languages such as Fortran and C. Although much progress has been made, and Java now can be competitive with C/C++ in many important situations, significant performance challenges remain. Existing Java virtual machines are not yet capable of performing the advanced loop transformations and automatic parallelization that are now common in state-of-the-art Fortran compilers. Java also has difficulties in implementing complex arithmetic efficiently. These performance deficiencies can be attacked with a combination of class libraries (packages, in Java) that implement truly multidimensional arrays and complex numbers, and new compiler techniques that exploit the properties of these class libraries to enable other, more conventional, optimizations. Two compiler techniques, versioning and semantic expansion, can be leveraged to allow fully automatic optimization and parallelization of Java code. Our measurements with the NINJA prototype Java environment show that Java can be competitive in performance with highly optimized and tuned Fortran code.
1. Introduction When Java(TM) was first introduced, there was a perception (properly founded at the time) that its many benefits, including portability, safety and ease of development, came at a significant performance cost. In few areas were the performance deficiencies of Java so blatant as in numerical computing. Our own measurements, with second-generation Java virtual machines, showed differences in performance of up to one hundred-fold relative to C or Fortran. The initial experiences with such poor performance caused many developers of high performance numerical applications to reject Java out-of-hand as a platform for their applications. The JavaGrande forum [11] was organized to facilitate cooperation and the dissemination of information among those researchers and applications writers wanting to improve the usefulness of Java on these environments. ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved
Much has changed since those early days. More attention to optimization techniques in the just-in-time (JIT) compilers of modern virtual machines has resulted in performance that can be competitive with popular C/C++ compilers [4]. Figure l(a), with data from a study described in [4], shows the performance of a particular hardware platform (a 333 MHz Sun Sparc10) for different versions of the Java Virtual Machine (JVM). The results reported are the aggregate performance for the SciMark [16] benchmark. We note that performance has improved from 2 Mflops (with JVM version 1.1.6) to better than 30 Mflops (with JVM version 1.3). However, as Fig. l(b) with data from the same study shows, the performance of Java is highly dependent on the platform. Often, the better hardware platform does not have a virtual machine implementing the more advanced optimizations. Despite the rapid progress that has been made in the past few years, the performance of commercially available Java platforms is not yet on par with state-of-the-
20
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
Fig. 1. Although Java performance on numerical computing has improved significantly in the past few years (a), that performance is inconsistent across platforms (b) and still not up to par with state-of-the-art C and Fortran compilers. (Data courtesy of Ron Boisvert and Roldan Pozo, of the National Institute of Standards and Technology.)
art Fortran and C compilers. Programs using complex arithmetic exhibit particularly bad performance [21]. Furthermore, current Java platforms are incapable of automatically applying important optimizations for numerical code, such as loop transformations and automatic parallelization [20]. Nevertheless, our thesis is that there are no technical barriers to high performance computing in Java. To prove this thesis, we have developed a prototype Java environment, called Numerically INtensive JAva (NINJA), which has demonstrated that Fortran-like performance can be obtained by Java on a variety of problems. We have successfully addressed issues such as dense and irregular matrix computations, calculations with complex numbers, automatic loop transformations, and automatic parallelization. Moreover, our techniques are straightforward to implement, and allow reuse of existing optimization components already deployed by software vendors for other languages [17], lowering the economic barriers to Java's acceptance. The primary goal of this paper is to convince virtual machine and application developers alike that Java can deliver both on the software engineering and performance fronts. The technology is available to make Java perform as well for numerical computing as highly tuned Fortran or C code. Once it is accepted that Java performance is only an artifact of particular implementations of Java, and that there are no technical barriers to Java achieving excellent numerical performance, our techniques will allow vendors and researchers to
quickly deliver high performance Java platforms to program developers. The rest of this paper is organized as follows. Section 2 describes the main sources of difficulties in optimizing Java performance for numerical computing. Section 3 covers the solutions that we have developed to overcome those difficulties. Section 4 discusses how those solutions were implemented in our prototype Java environment and provides various results that validate our approach to deliver high performance in numerical computing with Java. Finally. Section 5 presents our conclusions. Two appendices provide further detail on technologies of importance to numerical computing in Java: Appendix A gives the flavor of a multidimensional array package and Appendix B discusses a library for numerical linear algebra. A note about the examples in this paper. The Java compilation model involves a Java source code to Java bytecode translation step, with the resulting bytecode typically compiled into native, or machine code using a dynamic (i.e. just-in-time) compiler. The NINJA compiler performs its optimizations during this bytecode to machine code compilation step, but we present our examples using source code for readability.
2. Java performance difficulties Among the many difficulties associated with optimizing numerical code in Java, we identify three characteristics of the language that are. in a way. unique: (i)
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
exception checks for null-pointer and out-of-bounds array accesses, combined with a precise exception model, (ii) the lack of regular-shaped arrays, and (iii) weak support of complex numbers and other arithmetic systems. We discuss each of these in more detail. 2.1. The Java exception model Java requires all array accesses to be checked for dereferencing via null-pointer and out-of-bounds indices. An exception must be thrown if either violation happens. Furthermore, the precise exception model of Java states that when the execution of a piece of code throws an exception, all the effects of those instructions prior to the exception must be visible, and no effect of instructions after the exception should be visible [8]. This has a negative impact on performance in two ways: (i) checking the validity of array references contributes to runtime overhead, and (ii) code reordering in general, and loop iteration reordering in particular, is prohibited, thus preventing almost all optimizations for numerical codes. The first of these problems can be alleviated by aggressive hardware support that masks the direct cost of the tests. The second problem is more serious and requires compiler support. 2.2. Arrays in Java Unlike Fortran and C, Java has no direct support for truly rectangular multidimensional arrays. Java allows some simulation of multidimensional arrays through arrays of arrays, but that is not an ideal solution. Arrays of arrays have two major problems. First, arrays of arrays are not necessarily rectangular. Determining the shape of an array of arrays is, in general, an expensive runtime operation. Even worse, the shape of an array of arrays can change during computation. Figure 2(a) shows an array of arrays being used to simulate a rectangular two-dimensional array. In this case, all rows have the same length. However, arrays of arrays can be used to construct far more complicated structures, as shown in Fig. 2(b). We note that such structures, even if unusual for numerical codes, may be natural for other kinds of applications. When a compiler is processing a Java program, it must assume the most general case for an array of arrays unless it can prove that a simpler structure exists. Determining rectangularity of an array of arrays is a difficult compiler analysis problem, bound to fail in many cases. One could advocate the use of pragmas to help identify rectangular arrays. However, to maintain the overall safety
21
of Java, a virtual machine must not rely on pragmas that it cannot independently verify, and we are back to the compiler analysis problem. It would be much simpler to have data structures that make this property explicit, such as the rectangular two-dimensional arrays of Fig. 2(c). Knowing the shape of a multidimensional array is necessary to enable some key optimizations that we discuss below. As can be seen in Fig. 2(b), the only way to determine the minimum length of a row is to examine all rows. In contrast, determining the size of a true rectangular array, as shown in Fig. 2(c), only requires looking at a small number of parameters. Second, arrays of arrays may have complicated aliasing patterns, with both intra- and inter-array aliasing. Again, alias disambiguation - that is, determining when storage locations are not aliased - is a key enabler of various optimization techniques, such as loop transformations and loop parallelization, which are so important for numerical codes. The aliasing problem is illustrated in Fig. 2. For the arrays of arrays shown in Fig. 2(b), two different arrays can share rows, leading to inter-array aliasing. In particular, row 4 of array X and row 3 of array Y refer to the same storage, but with two different names. Furthermore, intra-array aliasing is possible, as demonstrated by rows 0 and 1 of array X. For the true multidimensional arrays shown in Fig. 2(c) (Z and T), alias analysis is easier. There can be no intra-array aliasing for true multidimensional arrays, and inter-array aliasing can be determined with simpler tests [20]. 2.3. Complex numbers in Java From a numerical perspective, Java only has direct support for real numbers. Fortran has direct support for complex numbers also. For even more versatility, both Fortran and C++ provide the means for efficiently supporting other arithmetic systems. Efficient support for complex numbers and other arithmetic systems in Fortran and C++ comes from the ability to represent low-cost data structures that can be efficiently allocated on the stack or in registers. Java, in contrast, represents any non-primitive data type as a full fledged object. Complex numbers are typically implemented as objects of a class Complex, and every time an arithmetic operation generates a new complex value, a new Complex object has to be allocated. That is true even if the value is just a temporary, intermediate result. We note that an array of n complex numbers requires the creation of n objects of type Complex, further complicating alias analysis and putting more pres-
22
J.E. Moreira el al. / NINJA: Java for high performance numerical computing
Fig. 2. Examples of (a) array of arrays simulating a two-dimensional array, (b) array of arrays in a more irregular structure, and (c) rectangular two-dimensional array.
sure on the memory allocation and garbage collection system. We have observed the largest differences in performance between Java and Fortran when executing code that manipulates arrays of complex numbers. Because Complex objects are created to hold the result of each arithmetic operation, almost all of the execution time of an application with complex numbers is spent creating and garbage collecting Complex objects used to hold intermediate values. In that case, even modern virtual machines may perform a hundred times slower than equivalent Fortran code. The three difficulties described above are at the core of the performance deficiencies of Java. They prevent the application of mature compiler optimization technology to Java and, thus, prevent it from being truly competitive with more established languages such as Fortran and C. We next describe our approach to eliminating these difficulties, and we will show that, with the proper technology, the performance of Java numerical code can be as good as with any other language.
3. Java performance solutions Our research showed that the performance difficulties of Java could be solved by a careful combination of language and compiler techniques. We developed new class libraries that "enrich" the language with some important constructs for numerical computing. Our compiler techniques take advantage of these new constructs to perform automatic optimizations. Above all, we were able to overcome the Java performance problems mentioned earlier while maintaining full portability of Java across all virtual machines. The performance results on a particular virtual machine, however, depends on the extent to which that virtual machine (more precisely, its Java bytecode to machine code compiler) implements the automatic optimizations we describe below.
3.1. The A rray package and semantic expansion To attack the absence of truly multidimensional arrays in Java, we have defined an Array package with multidimensional arrays (denoted in this text as Arrays, with a capital A) of various types and ranks (e.g., doubleArray2D, ComplexArray3D, ObjectArraylD). This Array package introduces true multidimensional arrays in Java through a class library. See Appendix A, The Array package for Java, for further discussion. Element accessor methods (get and set methods for individual array elements), sectioning operations, gather and scatter operations, and basic linear algebra subroutines (BLAS) are some of the operations defined for the Array data types. By construction, the Arrays have an immutable rectangular and dense shape, which simplifies testing for aliases and facilitates the optimization of runtime checks. The Array classes are written in fully compliant Java code, and can be run on any JVM. This ensures that programs written using the Array package are portable. When Array elements are accessed via the get and set element operations, each element access will be encumbered by the overhead of a method invocation, which is unacceptable for high performance computing. This problem is avoided by a compiler technique known as semantic expansion. In semantic expansion, the compiler looks for specific method calls, and substitutes efficient code for the call. This allows programs using the Array package to have high performance when executed on JVM that recognize the Array package methods. As an example, consider the operation of computing Ci,j = Aij + Bij, for all elements of n x n Arrays A, B, and C. The code for that operation would look something like: doubleArray2D A. B.C. for (i = 0; i < n:i++) { for (j =0:j < n:j++) {
.A.getd.(i,j)+ B.gpt(j. i))::
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
which requires three method calls (two gets and one set) in every loop iteration. If the compiler knows that A, B, and C are multidimensional arrays, it can generate code that directly accesses the elements of the Arrays, much like a Fortran compiler generates code for the source fragment do i = 1, n do j = 1, n C ( i , j ) = A(i,j) + B(j,i) end do end do
Note that this is different from the important, but more conventional, optimization of inlining. The compiler does not replace the invocation of get and set by their library code. Instead, the compiler knows about them: it knows the semantics of the classes and of the methods. Semantic expansion is an escape mechanism for efficiently extending a programming language through standard class libraries. 3.2. The complex class and semantic expansion A complex number class is also defined as part of the Array package, along with methods implementing arithmetic operations on complex numbers. (See Fig. 3.) Again, semantic expansion is used to convert calls to these methods into code that uses a value-object version of Complex objects (containing only the primitive values, not the full Java object representation). Figure 3 illustrates the differences between valueobjects and regular objects. A value-object version of Complex contains only fields for the real and imaginary parts of the complex number represented, as shown in Fig. 3(b). It is akin to a C struct, and can be easily allocated on the stack and even on registers. For Complex to behave as a true Java object, a different representation is necessary, shown in Fig. 3(c). In particular, every Java object requires an object header, which can represent a significant fraction of the object size. (For example, a Complex object of doubleprecision real and imaginary parts occupies 32 bytes in modern virtual machines, even though only 16 bytes are dedicated to the numerical fields.) Even worse is the overhead of creating and destroying objects, which typically are allocated on the heap. Any computation involving the arithmetic methods can be semantically expanded to use complex values. Conversion to Complex objects is done in a lazy manner upon encountering a method or primitive operation that truly requires object-oriented functionality. Thus, the programmer continues to treat complex numbers as
23
objects (maintaining the clean semantics of the original language), while our compiler transparently transforms them into value-objects for efficiency. We illustrate those concepts with an example. Consider the computation of yi = axi for all n elements of arrays x and y of complex numbers. This operation would typically be coded as ComplexArraylD x, y; Complex a;
for (i = 0; i < n; i++) { y.set(i, a.times(x.get(i))); }
A straightforward execution of this code would require the creation of 2n temporary objects. For every iteration, an object has to be created to represent Xi. A second object is created to hold the result of axi. The cost of creating and destroying these objects completely dominates execution. If the compiler knows the semantics of Complex and ComplexArrays, it can replace the method calls by code that simply manipulates values. Only the values of the real and imaginary parts of Xi are generated by x.get(i). Only the values of the real and imaginary parts of axi are computed by a.times(x.get(i)). Finally, those values are used to update yi. As a result, the object code generated would not be significantly different from that produced by a Fortran compiler for the source fragment complex* 16 x(n), y(n) complex*16 a doi = l,n y(i) = a, * x(i) end do
3.3. Versioning for safe and alias-free regions For Java programs written with the Array package, the compiler can perform simple transformations that eliminate the performance problems caused by Java's precise exception model. The idea is to create regions of code that are guaranteed to be free of exceptions. Once these exception-free (also called safe) regions have been created, the compiler can apply traditional core-reordering optimizations, constrained only by data and control dependences [20]. The safe regions are created by versioning of loop nests. For each optimized loop nest, the compiler creates two versions - safe and unsafe - guarded by a runtime test. This runtime test establishes whether all Arrays in the loop nest are valid (not null), and whether all the indexing operations
24
J.E. Moreira et al. /N1NJA: Java for high performance numerical computing public final class Complex { private double re, im;
public Complex minus(Complex z) ( return new Complex ire- z . re , im- z . im ;
public Complex times(Complex z) { return new Complex(re*z.re-im*z.im,im*z.re+re*z.im)
(a) partial code for Complex class
re 0.0
im 00
(b) Complex value-object representation descriptor
re 0.0
im 0.0
(c) Complex object representation Fig. 3. A Java class for complex numbers.
inside the loop will generate in-bound accesses. If the tests pass, the safe version of the loop is executed. If not, the unsafe version is executed. Since the safe version cannot throw an exception, explicit runtime checks can be omitted from the code. We take the versioning approach a step further. Application of automatic loop transformation (and parallelization) techniques by a compiler requires, in general, alias disambiguation among the various arrays referenced in a loop nest. We rely on a key property of Java that two object references (the only kind of pointers allowed in Java) must either point to identical or completely non-overlapping objects. Use of the Array package facilitates checking for aliasing by representing a multidimensional array as a single object. Therefore, we can further specialize the safe version of a loop nest into two variants: (i) one in which all multidimensional arrays are guaranteed to be distinct (no aliasing), and (ii) one in which there may be aliasing between arrays. The safe and alias-free version is the perfect target for compiler optimizations. The mature loop optimization techniques, including loop parallelization, that have been developed for Fortran and C programs can be easily applied to the safe and alias-free region.
We note that the "no aliasing" property between two Arrays is invariant to garbage collection activity. Garbage collection may remove aliasing, but it will never introduce it. Therefore, it is enough to verify once that two Arrays are not aliased to each other. We have to make sure, however, that there are no assignments to Array references (e.g., A — B) in a safe and aliasfree region, as that can introduce new aliasing. Assignments to the elements of an Array (e.g.. A(i) — B [ j ) never introduce aliasing. An example of the versioning transformation to create safe and alias-free regions is shown in Fig. 4. Figure 4(a) illustrates the original code for computing A, = F(B i =1) for n-element arrays ,4 and B. Figure 4(b) explicitly shows all null pointer and array bounds runtime checks that are performed when the code is executed by a Java virtual machine. The check chknull(-4) verifies that Array reference ,4 is not a null-pointer, whereas check chkbounds( i) verifies that the index i is valid for that corresponding Array. Figure 4(c) illustrates the versioned code. A simple test for the values of the A and B pointers and a comparison between loop bounds and array extents can determine if the loop will be free of exceptions or not. If
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
25
for (i = 0; i < n; i++) { [i + 1])
(a) original code
for (i = 0; i < n; i++) { /* code for A[i] = F(B[i + 1]) with explicit checks */ chknull(A)[chkbounds(i)] = .F(chknull(B)[chkbounds(i + 1)])
(a) original code with explicit runtime checks if ( (A null) A (B / null) A (n - 1 < A.length) A (n < /* This region is free of exceptions */
length)) {
/* This region is free of aliases */ for (i = 0; i < n; i++) { A'[i] = F(B'[i + 1]) } } else { /* This region may have aliases */ for (i = 0; i < n; i++) { A[i] = F(B[i + 1]) } } } else { /* This region may have exceptions and aliases */ for (i = 0; i < n; i++) { chknull(A)[chkbounds(i)] = F(chknull(B)[chkbounds(?; + 1)])
(c) code after safe and alias-free region creation Fig. 4. Creation of safe and alias-free regions.
the test passes, then the safe region is executed. Note that the array references in the safe region do not need any explicit checks. The array references in the unsafe region, executed if the test fails, still need all the runtime checks. One more comparison is used to disambiguate between the storage areas for arrays A and B. A successful disambiguation will cause execution of the alias-free version. Otherwise, the version with potential aliases must be executed. At first, there seems to be no difference between the alias-free version and the version with potential aliases. However, the compiler internally annotates the symbols in the alias-free region as not being aliased with each other. We denote these new, alias-free symbols, by A' and B'. This information is later used to enable the various loop transformations. We note that the representation shown in Fig. 4(c) only exists as a compiler internal intermediate representation, after the versioning is automatically performed and before object code is generated. Neither the Java language, nor the Java bytecode, can directly represent that information. The concepts illustrated by the example of Fig. 4 can be extended to loop nests of arbitrary depth operating on multidimensional arrays. The tests for safety and
aliasing are much simpler (and cheaper) if the arrays are known to be truly multidimensional (rectangular), as in Fig. 2(c). The Arrays from the Array package have this property. 3.4. Libraries for numerical computing Optimized libraries are an important vehicle for achieving high-performance in numerical applications. In particular, libraries provide the means for delivering parallelism transparently to the application programmer. There are two main trends in the development of high-performance numerical libraries for Java. In one approach, existing native libraries are made available to Java programmers through the Java Native Interface (JNI) [5]. In the other approach, new libraries are developed entirely in Java [3]. Both approaches have their merits, with the right choice depending on the specific goals and constraints of an application. Using existing native libraries through JNI is very appealing. First, it provides access to a large body of existing code. Second, that code has already been debugged and its performance tuned by previous pro-
26
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
grammars. Third, in many cases (e.g., BLAS, MPI, LAPACK, . . . ) the same native library is available for a variety of platforms, properly tuned by the vendor of each platform. However, using libraries that are themselves written in Java also has its advantages. First, those libraries are truly portable, and one does not have to worry about ideosyncrasies that typically occur in versions of a native library for different platforms, such as maintaining Java floating point semantics. Second, Java libraries typically fit better with Java applications. One does not have to worry about parameter translation and data representations that can cause performance problems and/or unexpected behavior. Third, and perhaps most importantly, by writing the libraries in Java the more advanced optimization and programming techniques that are being developed, and will be developed, for Java will be exploited in the future without the additional work of performing another port. The discussion of Appendix B describes one technique which is easier to implement with Java, that can lead to improved performance. The Array package itself is a library for numerical computing. In addition to focusing on properties that enable compiler optimizations, we also designed the Array package so that most operations could be performed in parallel. We have implemented a version of the Array package which uses multiple Java threads to exploit multiprocessor parallelism inside some key methods. This is a convenient approach for the application developer. The application code itself can be kept sequential, and parallelism is exploited transparently inside the methods of the Array package. We report results with this approach in the next section. For further information on additional library support for numerical computing in Java, see Appendix B. Numerical linear algebra in Java. 3.5. A comment on our optimization approaches We want to close this section by emphasizing that the class libraries and compiler optimizations that we presented are strictly Java compliant. They do not require any changes to the base language or the virtual machines, and they do not change existing semantics. The Array and complex classes are just tools for developing numerical applications in a style that is familiar to scientific and technical programmers. The compiler optimizations (versioning and semantic expansion) are exactly that: optimizations that can improve performance of code significantly (by orders of magnitude as we will see in the next section) without changing the observed behavior.
4. Implementation and results We have implemented our ideas in the NINJA prototype Java environment, based on the IBM XL family of compilers. Figure 5 shows the high-level organization of these compilers. The front-ends for different languages transform programs to a common intermediate representation called W-Code. The Toronto Portable Optimizer (TPO) is a W-Code to W-Code transformer which performs classical optimizations, like constant propagation and dead code elimination, and also high level loop transformations based on aggressive dataflow analysis. TPO can also perform both directive-assisted and automatic parallelization of loops and other constructs. Finally, the transformed W-Code is converted into optimized machine code by an architecture-specific back-end. The particular compilation path for Java programs is illustrated in the top half of Fig. 5. Java source code is compiled by a conventional Java compiler (e.g.Javac) into bytecode for the Java Virtual Machine. We then use the IBM High Performance Compiler for Java [ 19] (HPCJ) to statically translate bytecode into W-Code. In other words, HPCJ plays the role of front-end for bytecode. Once W-Code for Java is generated, it follows the same path through TPO and back-ends as W-Code generated from other source languages. Semantic expansion of the Array package methods [2] is implemented within HPCJ, as it is Java specific. Safe region creation and alias versioning have been implemented in TPO and those techniques can be applied to W-Code from any other language. We note that the use of a static compiler - HPCJ represents a particular implementation choice. In principle, nothing prevents the techniques described in this article from being used in a dynamic compiler. Moreover, by using the quasi-static dynamic compilation model [18], the more expensive optimization and analysis techniques employed by TPO can be done off-line, sharply reducing the impact of compilation overhead. We should also mention that our particular implementation is based on IBM products for the RS/6000 family of machines and the AIX operating system. However, the organization of our implementation is representative of typical high-performance compilers [15] and it is adopted by other vendors. Obviously, a reimplementation effort is necessary for each different platform, but the approach we followed serves as a template for delivering high-performance solutions for Java. We used a suite of eight real and five complex arithmetic benchmarks to evaluate the performance impact
27
J.E. Moreira etal. /NINJA: Java for high performance numerical computing Front End
Java Source
Bytecode
Portable Optimizations
Back End
TOBEY
HPCJ
POWER/PowerPC Code
TPO
Other Backend
W-Code
Machine Code
W-Code
Fig. 5. Architecture of the IBM XL compilers.
of our techniques. We also applied our techniques to a production data mining application. These benchmarks and the data mining application are described further in [2,13,14]. The effectiveness of our techniques was assessed by comparing the performance produced by the NINJA compiler with that of the IBM Development Kit for Java version 1.1.6 and the IBM XLF Fortran compiler on a variety of platforms. 4.1. Sequential execution results The eight real arithmetic benchmarks are matmul (matrix multiply), microdc (electrostatic potential computation), lu (LU factorization), cholesky (Cholesky factorization), shallow (shallow water simulation), bsom (neural network training), tomcatv (mesh generation and solver), and fft (FFT with explicit real arithmetic). Results for these benchmarks, when running in strictly sequential (single-threaded) mode, are summarized in Fig. 6(a). Measurements were made on an RS/6000 model 260 machine, with a 200 MHz POWERS processor. The height of each bar is proportional to the best Fortran performance achieved in the corresponding benchmark. The numbers at the top of the bars indicate actual Mflops. For the Java 1.1.6 version, arrays are implemented as double [] []. The NINJA version uses doubleAr ray2 D Arrays from the Array package and semantic expansion. For six of the benchmarks (matmul, microdc, lu, cholesky, bsom, and shallow) the performance of the Java version (with the Array package and our compiler) is 80% or more of the performance of the Fortran version. This high performance is due to well-known loop trans-
formations, enabled by our techniques, which enhance data locality. The Java version of tomcatv performs poorly because one of the outer loops in the program is not covered by a safe region. Therefore, no further loop transformations can be applied to this particular loop. The performance of fft is significantly lower than its Fortran counterpart because our Java implementation does not use interprocedural analysis, which has a big impact in the optimization of the Fortran code. 4.2. Results for complex arithmetic benchmarks The five complex benchmarks are matmul (matrix multiply), microac (electrodynamic potential computation), lu (LU factorization), fft (FFT with complex arithmetic), and cfd (two-dimensional convolution). Results for these benchmarks are summarized in Fig. 6(b). Measurements were made on an RS/6000 model 590 machine, with a 67 MHz POWER2 processor. Again, the height of each bar is proportional to the best Fortran performance achieved in the corresponding benchmark, and the numbers at the top of the bars indicate actual Mflops. For the Java 1.1.6 version, complex arrays are represented using a Complex [] [] array of Complex objects. No semantic expansion was applied. The NINJA version uses ComplexArray2D Arrays from the Array package and semantic expansion. In all cases we observe significant performance improvements between the Java 1.1.6 and NINJA versions. Improvements range from a factor of 35 (1.7 to 60.5 Mflops for cfd) to a factor of 75 (1.2 to 89.5 Mflops for matmul). We achieve Java performance that ranges from 55% (microac) to 85% (fft and cfd) of fully optimized Fortran code.
28
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
Fig. 6. Performance results of applying our Java optimization techniques to various cases.
4.3. Parallel execution results Loop parallelization is another important transformation enabled by safe region creation and alias versioning. We report speedup results from applying loop parallelization to our eight real arithmetic Java benchmarks. All experiments were conducted using the Array package version of the benchmarks, compiled with our prototype compiler with automatic parallelization enabled. Speedup results, relative to the single processor performance of the parallel code optimized with NINJA, are shown in Fig. 6(c). Measurements were made in a machine with four 200 MHz POWERS processors. The compiler was able to parallelize some loops in each of the eight benchmarks. Significant
speedups were obtained (better than 50% efficiency on 4 processors) in six of those benchmarks (matmul, microdc, lu, shallow, bsom, and fft). 4.4. Results for parallel libraries We further demonstrate the effectiveness of our solutions by applying NINJA to a production data mining code [14]. In this case, we use a parallel version of the Array package which uses multithreading to exploit parallelism within the Array operations. We note that the user application is a strictly sequential code, and that all parallelism is exploited transparently to the application programmer. Results are shown in Fig. 6(d). Measurements were made with a RS/6000 model F50
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
machine, with four 332 MHz PowerPC 604e processors. The conventional (Java arrays) version of the application achieves only 26 Mflops, compared to 120 Mflops for the Fortran version. The single-processor Java version with the Array package (bar Array x 1) achieves 109 Mflops. Furthermore, when run on a multiprocessor, the performance of the Array package version scales with the number of processors (bars Array x 2, Array x 3, and Array x 4 for execution on 2,3, and 4 processors, respectively), achieving almost 300 Mflops on 4 processors.
5. Conclusions Our results show that there are no serious technical impediments to the adoption of Java as a major language for numerically intensive computing. The techniques we have presented are simple to implement and allow existing compiler optimizers to be exploited. The Java-specific optimizations are relatively simple and most of the benefits accrue from leveraging well understood language-independent optimizations that are already implemented in current compilers. Moreover, Java has many features like simpler pointers and flexibility in choosing object layouts, which facilitate application of the optimization techniques we have developed. The impediments to high-performance computing in Java are instead economic and social - an unwillingness on the part of vendors of Java compilers to commit the resources to develop product-quality compilers for technical computing; the reluctance of application developers to make the transition to new languages for developing new codes; and finally, the widespread belief that Java is simply not suited for technical computing. The consequences of this situation are severe: a large pool of programmers is being underutilized, and millions of lines of code are being developed using programming languages that are inherently more difficult and less safe to use than Java. The maintenance of these programs will be a burden on scientists and application developers for decades. We have already engaged with companies that are interested in doing numerical computing in Java, which represents a first step towards wider adoption of Java in that field. Java already has a strong user base in commercial computing. For example, IBM's Websphere suite is centered around Java and is widely used in the industry. However, the characteristics of the commercial computing market are significantly different,
29
in both size and requirements, from the technical computing market. It is our hope that the concepts and results presented in this paper will help overcome the difficulties of establishing Java as a viable platform for numerical computing and accelerate the acceptance of Java, positively impacting the technical computing community in the same way that Java has impacted the commercial computing community. Appendix A. The Array package for Java The Array package for Java (provisionally named com. ibm. math. array) provides the functionality and performance associated with true multidimensional arrays. The difference between arrays of arrays, directly supported by the Java Programming Language and Java Virtual Machine, and true multidimensional arrays is illustrated in Fig. 2. Multidimensional arrays (Arrays) are rectangular collections of elements characterized by three immutable properties: type, rank, and shape. The type of an Array is the type of its elements (e.g., int, double, or Complex). The rank (or dimensionality) of an Array is its number of axes. For example, the Arrays in Fig. 2 are two-dimensional. The shape of an Array is determined by the extent of its axes. The dense and rectangular shape of Arrays facilitate the application of automatic compiler optimizations. Figure 7 illustrates the class hierarchy for the Array package. The root of the hierarchy is an Array abstract class (not to be confused with the Array package). From the Array class we derive type-specific abstract classes. The leaves of the hierarchy correspond to final concrete classes, each implementing an Array of specific type and rank. For example, doub1eAr r ay 2 D is a two-dimensional Array of double precision floatingpoint numbers. The shape of an Array is defined at object creation time. For example, intArray3D A = new intArray3D(m,n,p); creates an m x n x p three-dimensional Array of integer numbers. Defining a specific concrete final class for each Array type and rank effectively binds the semantics to the syntax of a program, enabling the use of mature compiler technology that has been developed for languages like Fortran and C. Arrays can be manipulated element-wise or as aggregates. For instance, if one wants to compute a twodimensional Array C of shape m x n in which each element is the sum of the corresponding elements of Arrays A and B, also of shape m x n, then one can write either
30
J.E. Moreira et al. /NINJA: Java for high performance numerical computing
other Array types
Fig. 7. Simplified partial class hierarchy chart for the Array package. ESSL and Java BLAS performance for SGEMM on RS/6000 260 800 700 600 500 400 300 200-
100
0
100
200
300
400
500 600 Problem size
700
800
900
1000
Fig. 8. Performance results for ESSL and Java BLAS for SGEMM operation.
the two forms can be aggressively optimized as with for (int i=0; i<m; state-of-the-art Fortran compilers. for (int j=0; j