Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1595
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Kevin Hammond Tony Davie Chris Clack (Eds.)
Implementation of Functional Languages 10th International Workshop, IFL’98 London, UK, September 9-11, 1998 Selected Papers
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Kevin Hammond Tony Davie University of St Andrews, Division of Computer Science North Haugh, St Andrews KY16 9SS, UK E-mail: {kh,ad}@dcs.st-and.ac.uk Chris Clack University College London, Department of Computer Science Gower Street, London WC1E 6BT, UK E-mail:
[email protected] Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Implementation of functional languages : 10th international workshop ; selected papers / IFL ’98, London, UK, September 9 - 11, 1998. Kevin Hammond . . . (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; Vol. 1595) ISBN 3-540-66229-4
CR Subject Classification (1998): D.3, D.1.1, F.3 ISSN 0302-9743 ISBN 3-540-66229-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. © Springer-Verlag Berlin Heidelberg 1999 Printed in Germany Typesetting: Camera-ready by author SPIN: 10704703 06/3142 – 5 4 3 2 1 0
Printed on acid-free paper
Preface and Overview of Papers This volume contains a refereed selection of papers presented at the 1998 International Workshop on the Implementation of Functional Languages (IFL’98), held at University College, London, September 9–11, 1998. This was the tenth in a series of international workshops that have been held at locations in The Netherlands, Germany, Sweden, and the UK, and the third to be published in the Springer-Verlag series of Lecture Notes in Computer Science (selected papers from IFL’96 and IFL’97 are published in LNCS volumes 1268 and 1467 respectively). The workshop has grown over the years, and the 1998 meeting attracted 64 researchers from the international functional language community, many of whom presented papers at the workshop. We are pleased that, after due revision and rigorous peer review, we are able to publish 15 of those papers in this volume. The papers selected from the workshop cover a wide range of topics including work on parallel process co-ordination (Aßmann; Klusik, Ortega, and Pe˜ na), parallel profiling (Charles and Runciman; King, Hall, and Trinder), compilation (Grelck) and semantics of parallel systems (Hall, Baker-Finch, Trinder, and King), programming methodology (Koopman and Plasmeijer; Scholz), interrupt handling (Reid), type systems (McAdam; Pil), strictness analysis (Pape), concurrency and message passing (Holyer and Spiliopoulou; Serrarens and Plasmeijer) and inter-language working (Reinke). Some of the work developed out of research reported at previous IFL workshops. In Implementing Eden, or: Dreams Become Reality, pp 103–119, Klusik, Ortega, and Pe˜ na show how expressions in the parallel functional language, EDEN, are compiled, work arising from their previous operational specification of the distributed abstract machine, DREAM (IFL’97). Grelck (Shared Memory Multiprocessor Support for SAC, pp 38–53) continues work on Single Assignment C (IFL’96), describing the compilation of SAC programs for multithreaded execution on multiprocessor systems. Scholz also describes work on SAC (A Case Study: Effects of WITH-Loop-Folding on the NAS MG Benchmark in SAC, pp 216–228) showing how the WITH-Loop-Folding optimization supports high-level array operations efficiently. Much work has been done in recent years on optimization of functional programming compilers. This paper shows that code generated from the SAC program not only reaches the execution time of code generated from FORTRAN, but even outperforms it by a significant amount. Aßmann (Preliminary Performance Results for an Implementation of the Process Coordination Language K2, pp 1–19) builds on earlier work on K2 (IFL’97), analysing the efficiency of a number of benchmarks. Pil (Dynamic Types and Type Dependent Functions, pp 169–185) describes work on type dependent functions which will be added to the Clean language using overloading. These will facilitate communication between independent (possibly persistent) functional programs using a minimum of dynamic typing. Two papers continue work on profiling parallel functional programs. King, Hall, and Trinder (A Strategic Profiler for Glasgow Parallel Haskell, pp 88–102) show how the GRANSIM-SP profiler attributes costs to the abstract evaluation strate-
VI
Preface and Overview of Papers
gies that created various threads. Charles and Runciman (An Interactive Approach to Profiling Parallel Functional Programs, pp 20–37) concentrate on the user interface to profiling tools, describing a system that interactively combines graphical information with a query interface. Finally, Koopman, and Plasmeijer (Efficient Combinator Parsers, pp 120–136) describe how elegant combinators for constructing parsers, which, however, have exponential time complexity, can be made more efficient using continuations. Other papers represent work on new lines of research. Reinke (Towards a Haskell/Java Connection, pp 200–215) describes preliminary work on a system to support inter-language working between functional and object-oriented languages, a timely coverage of a vitally important issue. Reid (Putting the Spine Back in the Spineless Tagless G-Machine: An Implementation of Resumable Black-Holes, pp 186–199) shows how to tackle interrupts in lazy languages, an issue that has been problematic for some years. Serrarens and Plasmeijer (Explicit Message Passing for Concurrent Clean, pp 229–245) describe a language extension providing efficient flexible communication in a concurrent system. Holyer and Spiliopoulou also present concurrency research (Concurrent Monadic Interfacing, pp 72–87) in which the monadic style of interfacing is adapted to accommodate deterministic concurrency. Three papers describe more theoretical work, albeit for use in practical implementations. Pape (Higher Order Demand Propagation, pp 153–168) gives a new denotational semantics mapping function definitions into demand propagators, allowing the definition of a backward strictness analysis of the functions even when they are higher order and/or transmit across module boundaries. McAdam (On the Unification of Substitutions in Type Inference, pp 137–152) describes a polymorphic type inference algorithm which reduces the confusion which often arises when type errors are reported to users. Finally, Hall, Baker-Finch, Trinder, and King (Towards an Operational Semantics for a Parallel Non-strict Functional Language, pp 54–71) present a semantics for a simple parallel graph reduction system based on Launchbury’s natural semantics for lazy evaluation. The 15 papers published in this volume were selected using a rigorous aposteriori refereeing process from the 39 papers that were presented at the workshop. The reviewing process was shared among the program committee, which comprised: Chris Clack University College London UK Tony Davie University of St. Andrews UK John Glauert University of East Anglia UK Kevin Hammond University of St. Andrews UK Werner Kluge University of Kiel Germany Pieter Koopman University of Nijmegen The Netherlands Rita Loogen University of Marburg Germany Greg Michaelson Heriot-Watt University UK Markus Mohnen RWTH Aachen Germany Marko van Eekelen University of Nijmegen The Netherlands
Preface and Overview of Papers
VII
We would like to thank the many additional anonymous referees who provided us with timely, high-quality reports on which our decisions were based. The overall balance of the papers is representative, both in scope and technical substance, of the contributions made to the London workshop as well as to those that preceded it. Publication in the LNCS series is not only intended to make these contributions more widely known in the computer science community but also to encourage researchers in the field to participate in future workshops, of which the next one will be held in Nijmegen, The Netherlands, September 7-10, 1999 (see http://www.cs.kun.nl/~pieter/ifl99/index.htm). Significantly, this year the workshop attracted industrial sponsorship from both Andersen Consulting and Ericsson. This indicates the growing importance of functional programming languages and their implementation within two key commercial spheres (IT Consultancy and Telecommunications, respectively). We thank our sponsors for their generous contributions. March 1999
Kevin Hammond, Tony Davie, and Chris Clack
IX
Table Of Contents Performance Results for an Implementation of the Process Coordination Language K2
Claus Amann : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
An Interactive Approach to Pro ling Parallel Functional Programs
Nathan Charles and Colin Runciman : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
Shared Memory Multiprocessor Support for SAC
Clemens Grelck : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38
Towards an Operational Semantics for a Parallel Non-Strict Functional Language
Jon G. Hall, Clem Baker-Finch, Phil Trinder and David J. King : : : : : : : : : : : 54
Concurrent Monadic Interfacing
Ian Holyer and Eleni Spiliopoulou : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72
A Strategic Pro ler for Glasgow Parallel Haskell
David J. King, Jon G. Hall and Phil Trinder : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
Implementing Eden { or: Dreams Become Reality
Ulrike Klusik, Yolanda Ortega and Ricardo Pe~na : : : : : : : : : : : : : : : : : : : : : : : : : : 103
Ecient Combinator Parsers
Pieter Koopman and Rinus Plasmeijer : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120
On the Uni cation of Substitutions in Type Inference
Bruce J. McAdam : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 137
Higher Order Demand Propagation
Dirk Pape : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 153
Dynamic Types and Type Dependent Functions
Marco Pil : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 169
Putting the Spine Back in the Spineless Tagless G-Machine: An Implementation of Resumable Black-Holes
Alastair Reid : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 186
Towards a Haskell/Java Connection
Claus Reinke : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 200
X
Table of Contents
A Case Study: Eects of With-Loop-Folding on the NAS Benchmark MG in SAC
Sven-Bodo Scholz : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 216
Explicit Message Passing for Concurrent Clean
Pascal R. Serrarens and Rinus Plasmeijer : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 229
Author Index
::::::::::::::::::::::::::::::::::::::::::::::::::::
247
Performance Results for an Implementation of the Process Coordination Language K2 Claus Aßmann Department of Computer Science, University of Kiel, 24105 Kiel, Germany
[email protected] Abstract. This paper is on process coordination language K2 which is based on a variant of high-level Petri nets. It strictly separates the specification of processes and their communication structures from (functional) specifications of the computations to be performed by the processes. The purpose of this separation is to facilitate both correct program construction and formal reasoning about essential safety and liveness properties. More specifically, the paper presents and analyzes performance measurements obtained from a K2 implementation based on the message passing system PVM. Two simple K2 benchmark programs are to determine the overhead inflicted by the K2 runtime system relative to a direct PVM implementation of the same programs. A third benchmark program, the implementation of a relaxation algorithm for the solution of PDEs, is to investigate the efficiency of distributing some given workload on a multiprocessor system by recursively expanding process (sub-)structures. The measurements show that coarse-grain K2 programs are only marginally slower than their PVM counterparts, and that good scalability of recursive programs can be achieved on multiple computing sites. Keywords: coordination language, concurrent computing, Petri nets, performance results
1
Introduction
Functional programs, due to the absence of side effects, are considered perfect candidates for non-sequential execution. They specify in the form of function equations and function applications just desired problem solutions but (ideally) leave it completely to compilation or interpretation, and to some extent to the runtime system, to figure out how these solutions are to be computed. If this is to be done non-sequentially, it involves partitioning a program (recursively) into concurrently executable pieces, creating tasks for them, scheduling these tasks on available processing sites, handling communication and synchronization among tasks, and (recursively) assembling problem solutions from partial results produced by terminating tasks. Crucial with respect to attainable performance gains are a fairly high ratio of useful computations within the tasks versus the overhead inflicted by the task management on the one hand and a fairly balanced workload distribution over the processing sites on the other hand. It has been recognized fairly early in the game that this cannot be had without at least some program annotations which identify terms that may be H. Hammond, T. Davie, and C. Clack (Eds.): IFL’98, LNCS 1595, pp. 1–19, 1999. c Springer-Verlag Berlin Heidelberg 1999
2
C. Aßmann
executed concurrently, control job granularities and the degree to which terms must be evaluated (which primarily relates to lazy languages), put upper bounds on (or throttle) the number of tasks that participate in a computation, and even allocate terms to specific (virtual) processing sites for execution [16,15]. Concurrently executable program parts may also be made explicit by means of so-called skeletons [8]. They define a few standardized control and communication structures, e.g., for divide-and-conquer or pipelined computations, which are complemented by compilation schemes that produce efficient code. Another approach that makes concurrency explicit uses language constructs for the specification of processes (tasks) and for the communication structures that must be established among them. These constructs may either be integrated into a functional language [5] or provided by a separate so-called process coordination language [6]. Cleanly separating the specification of the process system from the computations to be performed by the individual processes, as is the underlying idea in the latter case, considerably facilitates both correct program construction and formal analysis of the process system wrt the existence of essential safety and liveness properties. It also has the pragmatic advantage that, proper interfacing with the process level provided, different languages, functional or procedural, may be used to specify what the individual processes are supposed to compute. This does not only considerably enhances the re-usability of existing (sequential) code but to some extent also portability to different, possibly heterogeneous machinery. To be a useful tool for the design of non-trivial systems of communicating processes, a coordination language should – be compositional to allow for the systematic construction of complex process systems from simpler subsystems; – provide only a small set of fairly simple, well-understood process interaction schemes, including controlled forms of nondeterminism (to realize, say, client/server interactions), which ensure an orderly system behavior largely by construction; – support some concept of recursively expandable process structures which, at runtime, may be dynamically adapted to actual problem sizes (workload) or available resources (processing sites); – be based on some process calculus which provides computationally tractable formal methods for analyzing and reasoning about the existence of basic safety and liveness properties; – lend itself to some easily comprehensible form of high-level debugging, e.g., by step-wise execution of selected process instances and inspection of intermediate system states, and also facilitate performance analysis. Two approaches to concurrent computing in the area of functional languages which in one form or another use process specifications are Eden [5] and Caliban [7], both being extensions of Haskell.
Performance Results for K2
3
Eden provides explicit process abstractions which may be called from within Haskell code to create processes and communication channels at runtime. Neither communication structures nor the number of processes are fixed at compile time, but may rather freely be changed during execution. Higher-order functions can be used to construct complex systems from simpler subsystems. Nondeterminism may be introduced into programs only by primitive merge processes. However, as complete freedom of building and modifying process structures at runtime tends to invite chaos, it appears to be extremely difficult to ensure basic safety and liveness properties for non-trivial programs. Other difficulties may arise from intertwining communications with computations as the effects of flaws in the former may be hard to isolate from flaws in the latter, particularly if nondeterministic phenomena are involved. Mixing computations and communication also runs somewhat counter to the concept of modular programming. As programmers are not forced to cleanly separate one from the other, it may be difficult to migrate existing code written either in Haskell or any other language, or to re-use Eden code in other contexts. Caliban is a flexible annotation language for Haskell that describes which expressions should be evaluated on the same computing sites. They are transformed into specifications of static process networks, i.e., their structures are fixed at compile-time. Neither dynamically changing process structures nor nondeterminism are supported. Conceptually, Caliban programs from which the annotations are being removed are semantically fully equivalent to their Haskell counterparts. Caliban can therefore not really be considered a process coordination language that is in the same league as Eden or K2. Beyond these conceptual shortcomings, there are also some problems with the Caliban implementation. For instance, as streams are evaluated in order, stream elements whose values are not needed but are getting trapped in a nonterminating loop, cause the entire program to loop without end, if a later element in the stream is needed. Similarly, the Caliban compiler itself may not terminate since it may have to evaluate expressions that are not needed. The functional coordination language K2 [2] has been developed with the requirements listed above as design goals in mind. It derives from a variant of colored Petri nets [12,13] in which programs are composed of processes and streams corresponding to transitions inscribed with functions and to places that carry tokens inscribed with argument objects, respectively. The choice of colored Petri nets as a conceptual basis for K2 was motivated by the fact that they are a perfect match for the operational semantics of functional computations: both the rewrite rules for function applications and the firing rules for transitions in Petri nets follow the concept of context–free substitutions, i.e., both are characterized by the orderly consumption and (re)production of computational objects, and thus are completely free of side–effects. A deterministic (functional) behavior of entire K2 process systems is ensured by restricting the communication structures among processes to so-called synchronization graphs, in which each stream connects to at most one producing and one consuming process. However, good-natured forms of nondeterminism which essentially relate to to the split-
4
C. Aßmann
ting and merging of streams are supported by special primitive process types, When properly applied, these primitives produce only locally contained nondeterminism, e.g., within client-server subsystems, but have no global effects. K2 also provides abstraction mechanisms for subsystems of cooperating processes which are similar to code abstractions as they are available in conventional programming languages, including recursively expandable process specifications. Processes in K2 may, in a first step, simply be specified as empty containers which just provide interfaces to incoming and outgoing streams. These containers may, in a second step, be inscribed with function specifications, written in some functional or even in a procedural language, which must be made compatible with the respective interfaces. Existing program modules may thus be readily migrated from sequential processing. The strict separation of the process from the program specification level, in conjunction with the restrictions imposed on the communication structures, facilitates the application of analysis methods that are well-known from Petri net theory to verify the existence of structural invariance properties which are essential pre-requisites for safety and liveness of the process system (see [10,11,1]). The correctness of the process inscriptions may be independently verified or validated by other means. Other than that the computations carried out by the processes must terminate eventually, they have no impact on the dynamic behavior of the process system. K2 is implemented on widely available (de-facto) standard message passing software (PVM [9]) for easy portability to different platforms. K2 specifications are compiled to C code for efficient execution. The remainder of the paper is organized as follows: it first gives a brief introduction into K2, followed by an overview of its PVM implementation. The main part presents and analyzes performance results obtained from three benchmark programs which are to determine the overhead inflicted by the K2 runtime system relative to direct PVM implementations of the same programs, and to investigate the efficiency of recursive process structures wrt to scaling workload distribution to available processing sites. The last section summarizes the current system status and discusses further work.
2
The Coordination Language K2
This section briefly introduces K2 to the extent necessary to follow the remainder of the paper, a detailed description may be found in [2]. K2 process systems are composed of processes and of streams which communicate objects of the underlying computation language(s). Streams are realized as FIFO-queues of finite capacities which establish finite synchronic distances 1 within which producing and consuming processes may operate concurrently. Conceptually, processes conform exactly to the firing semantics of transitions in ordinary Petri-nets. A process instance consumes at least one object (or token) 1
A synchronic distance generally defines the number of occurrences by which one of two transactions can get ahead of the other before the latter must catch up.
Performance Results for K2
5
from every input stream, performs some operation on them, and produces at least one result token in each of its output streams. It is therefore enabled to take place iff it finds sufficiently many tokens in its input streams and sufficient free capacities in its output streams. Each stream connects just one producing and one consuming process, thus excluding conflict situations (which may cause unintended nondeterminism) as enabled processes can never be de-activated by the firing of other processes. The basic building blocks of K2 programs are atomic processes whose computations need to be specified by inscriptions with functional or procedural pieces of code. The current implementation only requires that the code of the inscriptions can be called from C because K2 has been implemented in that language; no other requirements exist. C has been chosen for the implementation because it is very portable and PVM routines can be called directly from it. Controlled forms of (local) nondeterminism which are, for instance, necessary to realize client- server (sub-)systems are supported by the primitive process types split and spread which pass tokens from one input stream to one of several output streams, and by select and merge which route tokens from one of several input streams to one output stream. In the case of split and select primitives the selection depends on the value of a control token supplied via a designated control stream, while spread and merge primitives perform a nondeterministic, fair selection. K2 also provides a general abstraction mechanism for the construction of process hierarchies and of recursive processes. Hierarchical processes may be recursively assembled from subsystems composed of atomic, primitive and hierarchical processes which, for reasons of compositionality, interface with other processes in exactly the same way as atomic processes, i.e., they consume sets of tokens from input streams, perform computations on them, and produce sets of result tokens to be placed in output streams. Hierarchical processes may also be operated as pipelines of some pre-specified depths (or finite synchronic distances) between input and output. They specify the maximum number of token sets which, in their order of arrival, can be processed simultaneously. The programming environment of K2 currently consists of a graphical editor, a compiler, and a runtime library. The user may specify process systems with the help of the editor, whose graphical interface is shown in Fig. 1. It depicts a simple master/slave program. The equivalent textual representation, which is used as input for the K2 compiler, looks like this2 : extern C int mastergen(int i,int *o); extern C int slavecomp(int i,int *o); atomic master(IN int r, OUT int t) { mastergen(r,t); } atomic slave(IN int i, OUT int o) { slavecomp(i,o); } void main() { int r,t,t1,t2,r1,r2; master(r,t); spread(t,t1,t2); slave(t1,r1); slave(t2,r2); merge(r,r1,r2); } 2
This is one of the benchmark programs used in Sect. 4
6
C. Aßmann
Fig. 1. The graphical interface of the K2 system, showing a simple master/slave process system
The first two lines define the signatures of the C interfaces of the functions that are used as process inscriptions. Next are two atomic process definition which have parameter lists that define the types of input and output streams. The main process declaration is similar to main() in a C program. It defines a hierarchical process that (indirectly) contains all process applications. Stream declarations are denoted as stream name , where n specifies the stream capacity (with used as abbreviation for a capacity of one). To illustrate what a simple program inscription for an atomic process looks like, here is a C wrapper for a function called hitc slave () which does the actual computing of the slave process: int slavecomp(int t,int *o) { if (t Int sumEuler n = let eulerList = map euler [1..n] in seq (parList eulerList) (sum eulerList) euler :: Int -> Int euler n = let relPrimes = filter (relprime n) [1..(n-1)] in par (spine relPrimes) (length relPrimes) parList :: [a] -> () parList = foldr par () spine :: [a] -> () spine [] = () spine (_:xs) = spine xs hcf :: Int -> Int -> Int hcf x 0 = x hcf x y = hcf y (rem x y) relprime :: Int -> Int -> Bool relprime x y = (hcf x y==1)
Fig. 1. The sumEuler program
Fig. 2. Activity graph for Version 1 of the program. The bands from bottom to top represent the number of running, runnable, fetching, and blocked threads.
An Interactive Approach to Profiling Parallel Functional Programs
23
Fig. 3. Per-thread graph for Version 1 of the program. The lifetime of each thread is represented by a horizontal line, shaded grey when the thread is running, lightgrey when it is runnable and black when it is blocked.
2.3
Using Queries to Extract Information
The activity graph for sumEuler shows that not all the processors are being used during the whole computation. To find out why we begin our investigation by asking how much parallelism each par-site generated. This simple question is hard to answer using current tools: the programmer has to search through the log-file, which in this case is 2876 lines long! Our tool provides a structured way to ask such questions using SQL1 [18] queries. The relational tables and underlying model on which the queries are based are briefly described in Sect. 3. To ask one interpretation of the question How much parallelism did each par-site generate? the programmer enters the following query:
In plain English: What is the average length of time of all the running periods, for threads created by each par-site? The answer is shown in Fig. 4. The par-site inside the euler function generates smaller threads than the other par-site and main thread. The cost of creating a thread, and controlling communication can be high. Where a thread does not perform much evaluation, the costs can outweigh the benefit of parallelism. To test whether this is the case in this example we remove the par-site within the euler function. 1
We do not impose a restriction on the SQL queries that can be expressed apart from those inherited from the database system we use.
24
N. Charles, C. Runciman
Fig. 4. Average run-time for each par-site. The left column contains the parsite names: 0 is the main thread; euler refers to the par-site inside the euler function, and similarly for parList. The right column shows average length of running periods for each par-site.
euler :: Int -> Int euler n = length (filter (relprime n) [1..(n-1)]) Call this Version 2 of the program. Using GranSim with the same set up as the previous run, the run-time drops by more than 30% to 182791 cycles; so the par inside the euler function was a wasteful. 2.4
Using Graphs to Extract Information
Fig. 5 shows the Per-Thread graph for Version 2 of the program. The dominant solid grey shading re-assures us that the threads spend most of their time running. Life-times of the threads increase over time. This means at the beginning of the execution all the processors were hungry for tasks, most of which resulted in little computation. Might it be better if some of the larger tasks were created at the beginning of the execution, so that the processors’ idleness is distributed across the computation? One way to do this is to reverse the order of the list of euler function applications. In Version 3 of the program sumEuler is defined accordingly: sumEuler :: Int -> Int sumEuler n = let eulerList = map euler [n,n-1..1] in seq (parList eulerList) (sum eulerList) The reader may wonder why creating larger tasks at the beginning and small tasks at the end could be an improvement. The simplest way to understand this is to look at a mini per-thread graph. If we assume that a program creates four tasks, lasting 1, 1.5, 2, and 2.5 ticks respectively, and the main thread takes 0.5 ticks to evaluate the expressions that create the parallelism, then we would get per-thread graphs (omitting the main thread) resembling those in Fig. 6. Running Version 3 of the program results in a run-time of 158497: 15% faster than Version 2, and 45% faster than Version 1. This part of the analysis has not used any novel feature of our tool. But it serves to illustrate the usefulness of supporting the traditional graphs in connection with the query interface.
An Interactive Approach to Profiling Parallel Functional Programs
25
Fig. 5. Per-thread graph for Version 2 of the program.
Original order
Reverse order
Fig. 6. Mini per-thread graphs showing short euler applications evaluated first (left) and longer euler applications evaluated first (right).
2.5
Using Graphs to Compose Queries
Fig. 7 shows the Activity graph for Version 3 of the program. During the first quarter of the computation the level of parallelism rises steadily. However, in the period after this there is a drop in parallelism, followed by a pick up in parallelism, and another peak and trough. So, why do we get the two drops in the level of parallelism? The graph in Fig. 8 is the per-thread graph for the same run. To work out why we get the tidal effect we use an important feature of our tool: the ability to click on regions of the graph help fill in a query. The user can click on any part of any bar in a per-thread graph to generate an activity reference that is sent to the query editor (see Sect. 3 for details of the data model). Fig. 8 shows us that there are no runnable threads during the drop in parallelism, but we do not know whether there are any unevaluated sparks 2 waiting to be evaluated, and if so, on what processors these sparks lie. To answer this question we click on the end of one of the threads on the graph at a point where 2
When a par application is evaluated a spark is created to evaluate an expression in parallel. When this evaluation begins, the spark is turned into a thread.
26
N. Charles, C. Runciman
Fig. 7. Activity graph for Version 3 of the program.
the level of parallelism is low. This sends a reference of this point to the Query editor (shown in Fig. 8 as the highlighted 72). We fill in the rest of the query to ask how many sparks lie on the different processors, at the point we clicked. Submitting this query returns the table:
There are 25 sparks but they all lie on processor 0. Clicking on other areas with low parallelism results in much the same table being produced. This information holds the key to understanding what is happening: one reason for the drop in parallelism is because all the threads terminate at approximately the same time. This would cause no problems if each of the idle processors had some sparks waiting to be evaluated, but because all the parallelism is sparked by the main thread, all the sparks lie on the processor evaluating the main thread. It takes time for each of the processors to obtain more work. As soon as each of the processors import a spark the level of parallelism picks up again. Had there been no limit on the number of processors, the peak and trough effect would not have emerged. The activity graph for an unlimited processor machine would be approximately the same as the activity graph for a 32 processor machine with the peaks stacked on top of each other. This would make the per-thread graph look similar to the right-hand mini per-thread graph in Fig. 6. We have shown that the lack of parallelism in some regions of the graph is due to the lack of distribution of sparks across the processors. We leave as an exercise to the reader to work out how to distribute the sparks more effectively.
An Interactive Approach to Profiling Parallel Functional Programs
27
Fig. 8. A query about Version 3: how many sparks lie on each processor at the point where the user clicks?
2.6
Using Queries to Generate Graphs
The final, and probably most important, feature of the current prototype of our tool is the ability to select what information is displayed in a graph using a query. Selective display is useful, because a display of the full graph is too crowded. As a simple example consider the very first question we asked at the beginning of this paper: How much parallelism does each par-site generate?. In Sect. 2.3 we answered this question using a query, and the result was displayed in a table. An alternative way to answer this question is to filter the information so that only the data for selected par-sites is displayed. Fig. 9 shows the per-thread graph for only threads created by par-site parList. A simple query (shown in the query editing box) was used to filter the information displayed. A similar query can be used to display only the threads created by par-site euler, for which the threads are considerably shorter. Of course it is not reasonable to enter any query and expect a graphical result, so we limit the type of queries that can be displayed. Currently we do this by locking the SELECT, GROUP BY, and ORDER BY fields so that the user only has control over the filtering WHERE clause. Even with such a restricted class of queries, it is possible to create a highly complex filter. For example, we could choose to display only those threads that were created by processor 0 by parList, lasted longer than the average thread, and never blocked.
28
N. Charles, C. Runciman
Fig. 9. Per-Thread graph for Version 1 of the program, filtered to display only threads created by par-site parList.
If we combine the ability to click on graphs to help compose queries with the ability to generate graphs from queries, we are able to dynamically change a graph, circumventing one of the major weaknesses of current tools that use static graphs.
3
Implementation
Fig. 10 shows the five main components of our tool.
3. DBMS
2. Builder
(database management system)
4a. Query
Evaluator
log-file
1. GranSim
sql-query
source code
user input 4b. GUI
results
Fig. 10. Structure of the tool.
table/graph
An Interactive Approach to Profiling Parallel Functional Programs
29
GranSim As we said in the previous section, GranSim [10,15,16] is a parallel simulator. We use GranSim to compile and run our parallel programs to produce a log-file. Builder The Builder takes a GranSim-generated log-file and processes it to produce relational tables. These relational tables are derived from the entityrelationship [5] diagram in Fig. 11. A brief description of the entity-relationship diagram follows, with an example query. We describe the development of this data model and the importance of data modelling for profiling tools elsewhere [4]. In Fig. 11 the boxes represent entities, the connecting arcs relationships, and the bullets attributes of entities. The diagram can be summarised: a thread has many activities which take place on different processors; during a single activity period a thread may run and create sparks, block on a piece of graph being evaluated by another thread, queue in a runnable state, migrate to another processor, or fetch a section of graph from another processor; and before a spark is turned into a thread it may lie-on different processors. The diagram can be used as a template to compose queries. For example, to ask Which thread t1 created thread t2? we use the following path through the entity-relationship diagram: was_spark t2 = THREAD(thread_id)
sparked_by SPARK(spark_id)
activity RUN(activity_no)
undertake_for ACTIVITY(activity_no)
THREAD(thread_id) = t1
We apply standard transformations [26] to the entity relationship diagram to generate the format for the relational tables. In general, each of the entities becomes a table, and its attributes, and connecting arcs become fields. For example, the Thread table has the following fields: Thread Id, Created At, Ends At, and Was Spark. There are five main tables (Spark, Thread, Activity, Lie-On and Processor) and five tables for the entities that are subtypes of ACTIVITY (Run, Migrate, Queue, Fetch, and Block)3 . The TIME entity is replaced by attributes of each of the other entities to which it is related. It is easy to translate a query written using the entity relationship diagram into a query over the relational tables. For example, using SQL as the relational query language the above query Which thread t1 created thread t2? is written as: SELECT t1.Thread_Id FROM Thread t1, Thread t2, Spark, Run, Activity WHERE t2.Was_Spark = Spark.Spark_Id AND Spark.Sparked_By = Run.Activity_No AND Run.Activity_No = Activity.Activity_No AND Activity.Undertake_for = t1.Thread_Id 3
It is also necessary to have an extra table Pruned, as there is no simpler way to translate an either-or relationship into relation tables.
30
N. Charles, C. Runciman
In Sect. 2.6 we said that when the user clicks on a graph an activity reference is sent to the query editor. Such a reference is an Activity No (an attribute of the Activity table). It can be used to form queries about individual activities (say, their start and end times), and also about other entities (such as threads and processors) that are associated with the activity.
0110 spark_id 01 par_site sparked_by
was_spark
SPARK
pruned_at
lie_to
1110pe_number 00 0010 11
PROCESSOR
THREAD
ends_at TIME
0001 time 11
end_time start_time hosted_by
RUN
MIGRATE
undertake_for
LIE-ON
created_at lie_from
created_at
00 thread_id 0011 11
01 activity_no 10
ACTIVITY
block_on
QUEUE
FETCH
BLOCK
to_processor from_processor
Fig. 11. The data model used by our tool.
Building the relational tables: To build the relational tables there are four phases. All but the first are implemented in Haskell. 1. Remove redundant data and sort { GranSim log-files are quite verbose and contain a lot of redundant data. We use the UNIX utilities awk and sort to provide quick removal of redundant data and to sort the file. 2. Parse the log-file. 3. Create relational tables from the parse tree. 4. Submit tables to the database engine using the GreenCard pre-processor [20] to generate the routines that communicate with the database server. DBMS We could have written our own query retrieval system. However, to save on production time we decided to use a popular database server and client.
An Interactive Approach to Profiling Parallel Functional Programs
31
This has the advantage that file storage and implementing clever algorithms to retrieve the data efficiently, is left to the database server. The builder and query evaluator (both Haskell applications) communicate with the database server using a C API and GreenCard 2 [20] – a foreign language interface pre-processor. We use PostgreSQL [28], a freely distributable relational database system supporting a fairly large subset of SQL, as the database server. Originally we used the Mini-SQL database engine but bugs effectively prevented any use of indexes. One thing that surprised us was the ease of converting from one database server to another. Despite our tool being designed around the C API that accompanies Mini-SQL it took no more than a couple of hours to change it over to use PostgreSQL. One reason for this was because of the similarities in the C API’s between both database servers. However, the main reason we managed to convert with relative ease was because of how well the GreenCard functions integrate into Haskell. Changes to the C API meant little or no changes to the Haskell code, and only moderate changes to the GreenCard sources. Query Evaluator The query evaluator exchanges messages between the GUI and the database server4 . It communicates with the GUI using socket streams, implemented using GreenCard [20], and with the database server using PostgreSQL’s C API and GreenCard. In general we have found GreenCard to be very suitable for interfacing with the database server. It was useful to get Haskell to garbage collect C data structures automatically, and convenient that side-effect free C functions could be treated as pure functions in Haskell. For example, consider the function getValue: %fun getValue :: PG_Result_Ptr -> Int -> Int -> String %call (pg_Result_Ptr x) (int col) (int row) %code res = PQgetvalue(x,col,row); %result (string res) This function takes a pointer to a database result table, a column and row number, and returns the value at that position by calling the C function PQgetvalue. To return the whole table we can use the function getTable: getTable :: PG_Result_Ptr -> [[String]] getTable ptr = map fetchRow [1..(numRows ptr)] where fetchRow n = map (getValue ptr n) [1..(numFields ptr)] As getValue is a pure function (because PQgetvalue is side-effect free) the result of getValue is only translated in Haskell on demand. So given an expression such as (getTable ptr !! 3), that returns the fourth row of the table, only 4
If HBC (the compiler used for the GUI) had a GreenCard pre-processor, then this part of the system would not have been required as the GUI could have communicated directly with the C API.
32
N. Charles, C. Runciman
the entries in the fourth row are translated into Haskell. As some tables returned from SQL queries are very large, this can save a lot of time and space. One problem we found with GreenCard (and as far as we know it is present in other foreign language interfaces to Haskell [6]) is that there is no standard way to access Haskell file handles within C across different Haskell implementations. So there is no portable way to read/write a file opened in C using the standard Haskell readFile and writeFile routines. GUI We shall first give an overview of how the GUI is implemented. Then, for the reader with experience using the Fudget library, we outline a problem we found writing our program using Fudgets and a possible solution to this problem. Implementation overview: We wanted to implement as much of our tool in Haskell as possible, including the GUI. We chose to use the Fudget Library [7] because it was both a purely functional solution and had previously been used to write substantial applications, such as a web browser [3]. Fudgets are stream processors with a high-level and low-level stream. The low-level stream is connected to the I/O system and is usually transparent to the programmer. Fudgets communicate by sending and receiving messages along high-level streams connected by the programmer. The GUI is implemented as a parallel composition of fudgets wrapped in a loop (so that the fudgets can communicate with each other): mainF = loopRightThroughF handler $ graphOutputF >+< -- Display graphs graphInputF >+< -- Handle input from query evaluator sqlEditorF >+< -- SQL editing box statusF >+< -- Status bar buttonsF >+< -- Request buttons ... When the user clicks on a button to evaluate a query buttonsF handles the request, passing an appropriate message to graphInputF that sends a request to the query evaluator. When the query evaluator returns the result it is passed to graphOutputF for display as a graph/table. Communication between fudgets: In a parallel composition of two fudgets (e.g. fud1F >+< fud2F) an incoming message is sent to the appropriate Fudget by tagging the message with Left or Right. As anyone who has written a program using Fudgets will know, when there are several fudgets in a parallel composition the tags become long and can be confusing to program. For example, to send a message to sqlEditorF above, the tag Right . Right . Left is used. As our program increased in size, arranging communication between fudgets became tricky. After we discussed this problem with Thomas Hallgren, he wrote a new fudget to arrange communication between n fudgets using n one-level tags.
An Interactive Approach to Profiling Parallel Functional Programs
33
Fudgets are composed in parallel with a new combinator >&+&< tagGraphInputF >&< tagSqlEditorF >&< ... In this definition the TagF structure has three components: the composed fudget combF; insideH, which is required but whose role is not of concern to us; and a dictionary of message tags (toGraphOutputF :&: toGraphInputF :&: ...) whose structure matches the parallel composition on the right-hand side. Now to send a message to sqlEditorF the one-level tag toSqlEditorF is used. Using this new fudget increased the speed of development and made our code much more understandable.
4
Current Problems and Bottlenecks
Speed: One of the first questions potential users may ask is How fast is it?, especially as the tool is written in Haskell. One of the aims of the tool is to speed up the development of parallel programs. If the tool was considerably slower than current tools then we might not achieve this objective. Table 1 gives the time in seconds taken to compile, run, process, and display both the activity graph and the per-thread graph for the example in this paper (1) using the tools that accompany GranSim5 and (2) using our tool. Each time is a minimum over three runs6 . Each experiment was performed on a 270 MHz Sun Ultra 5 Server, 192MB of RAM, GUI running with 25MB heap, Query evaluator with 500KB heap. The table shows that the main increase in time using our tool is the time taken to process the data to build the database. It turns out that most of this time is spent creating the indexes. In the current version of the database indexes are created on every primary key for every table. Some of these tables are used rarely. We plan to look carefully at what indexes are useful. However, even with the indexes as they are, the time taken to process the data is not necessarily longer using our new tool. Table 1 only shows the times taken to process the data for two graphs. Our tool processes the data only once. Other tools need to process the data every time a new graph or view of a graph is required. So, a more realistic estimate of the time taken to process the data for this example is (q ×12.3)/2 seconds, where q is the number of queries. The time taken to process the data using our tool is constant, 31.0 seconds. If more than five queries are required then for this example our tool will be faster. 5
6
David King has written faster versions of these tools in C rather than Perl and Bash, but our measurements are against the standard tools. It was necessary to measure the real times rather than the user and system times because otherwise any work performed by the database server would not have been accounted.
34
N. Charles, C. Runciman Method
Compile Run Process data Current 21.8 1.0 12.3 (Create PostScript graphs) New Tool 21.8 1.0 31.0 (Build database)
Run browser Display Total graphs 1.0 6.0 41.8 (Ghostview) 7.0 (GUI)
7.1
67.9
Table 1. Time (in seconds) taken to compile, run, and display both the Activity graph and Per-Thread graph using GranSim’s tools and our tool.
Space usage: One problem we have found with the GUI part of the tool is that it is sometimes necessary to allocate a very large heap. The main reason is that some of our code is too strict. We aim to make future versions more lazy. Composing queries: Writing SQL queries can be laborious. One solution to this problem is to extend the query editor so that common queries are quicker to write. Another is to use a more concise functional query language (e.g. [17]), translated automatically into SQL. Recording information: The current version of GranSim does not record information such as the relation between threads blocking on a closure, and the thread that evaluates that closure. This restricts the class of data models that can be used as a basis for the relational tables. In [4] we highlight some inconsistencies with the implicit data models used by current tools.
5
Summary and Conclusions
We have described a prototype tool for profiling parallel programs. It has the advantage over current tools that it is possible both to view high-level graphical views of a computation and to pose queries over lower-level information in a structured way. With current tools the only link between the log-file and the graphs is a hidden translation. Our tool allows the programmer to click on the graphs to fill in parts of query, and to use queries to control and filter the information that is displayed. An initial version of our tool was purely text-based and only allowed the user to enter SQL queries and view the answers in a table. This was almost unusable: one did not know where to start. Even with the PostScript graphs at hand we still found it difficult to use. It was not until we linked together both the query evaluator and the graphical charts that we were able to use the tool effectively. Although we have had little experience using our tool, the small examples we have worked with so far have been encouraging. We have had success in understanding them and making them run faster. Clearly, having a query language to retrieve information is going to be a lot more productive than searching through
An Interactive Approach to Profiling Parallel Functional Programs
35
the log-file manually. However, with the examples we have analysed, the main strength identified is the ability to interact with the graphs both to compose queries and to filter the information displayed in graphs. The main disadvantage of our tool over current tools is that it takes longer to process the log-file and view the charts. However, when the programmer needs more information than standard charts provide the query interface is significantly faster than searching through the log-file. In the end the time taken to write a good parallel program should be less than using conventional tools. We have also found that it is possible to implement profiling tools largely within a lazy functional language. Most other profiling tools written for functional languages are implemented in other languages. Development in Haskell had its disadvantages, such as the extra demand on resources. However, it would have taken a lot longer to develop the tool in a conventional procedural language. 5.1
Related Work
The most similar work to ours is that of Lei et al. [14]. They have implemented a performance tuning tool for message-passing parallel programs. As with our tool they use a database query system. The data model they use is at a lower-level than ours, and is designed mainly to extract communication and synchronisation costs. The results of a query can be displayed in either tables or graphical visualisations. However, one of the most important features of our tool, the ability to interact with the graphs to help fill in the queries, is not implemented in their tool. One interesting feature of their tool is the incorporation of a 3-D spreadsheet to analyse data between program runs. For example, it is possible use the 3-D spreadsheet package to compare the run-times for different program runs and display them in a bar chart. Halstead’s Vista system [8] (based on an earlier system called Pablo [21]) provides a generic browser that can be used to analyse different types of performance data. In particular, Halstead has used Vista to analyse parallel symbolic programs [9]. Vista presents the programmer with different views of the parallel execution which resemble the activity and per-thread graphs generated by our tool. The programmer can interact with the graphs, to choose what information to display, and how to display it. Vista provides a wide range of options to filter the information displayed but does not allow the complex filtering of data that our query interface provides. Our tool currently uses the par-site as a reference to the source-code. Two independent projects are working on extending GranSim to record alternative source-code references. Hammond et al. [11] have extended GranSim with the cost-centre mechanism. King et al. [13] have developed a profiler for programs that introduce parallelism using evaluation strategies [24]. It should not take much work to adapt our tool to handle the alternative source-code references and display the adapted graphical views used by these two tools.
36
5.2
N. Charles, C. Runciman
Future Work
The current prototype of our tool only allows the programmer to click on the per-thread graph to help compose queries. We plan to extend this so that other graphs can also be used. We have tried to make the tool as generic as possible. It should not be too difficult to extend the tool to handle data from heap profiling and time profiling. The advent of standard log-files [12] would promote this development. Writing SQL queries can be a chore. We plan to implement a more functional query language on top of SQL. One way to do this would be to use something similar to list comprehensions [2].
Acknowledgements Our thanks go to Thomas Hallgren for his help with the more tricky aspects of Fudget programming, and quick response to the occasional bug report. We also thank Malcolm Wallace for guidance on GreenCard and nhc.
References 1. L. Augustsson and T. Johnsson. Parallel Graph Reduction with the hν, Gimachine. In Proc. 1989 ACM Conference on Functional Programming Languages and Computer Architecture (FPLCA ’89), pages 202–212, 1989. 2. P. Buneman, L. Libkin, V. Tannen, and L. Wong. Comprehension Syntax. SIGMOD Record, 23(1), 1994. 3. M. Carlsson and T. Hallgren. Fudgets – Purely Functional Processes with Applications to Graphical User Interfaces. PhD thesis, Department of Computer Science, Chalmers University of Technology, Sweden, 1998. 4. N. Charles and C. Runciman. Performance Monitoring. In Research Directions in Parallel Functional Programming. Springer-Verlag, 1999. In preparation. 5. P. P. Chen. The Entity-Relationship Model: Towards a Unified View of Data. IEEE Transactions on Data Base Systems, 1(1), 1976. 6. S. Finne, D. Leigen, E. Meijer, and S Peyton Jones. H/Direct: A Binary Foreign Language Interface for Haskell. In Proc. 3rd. International Conference on Functional Programming (ICFP ’98), pages 153–162. ACM Press, 1998. 7. T. Hallgren and M. Carlsson. Programming with Fudgets. In Advanced Functional Programming, 1st. International School on Functional Programming Techniques, B˚ astad, Sweden, volume 925 of LNCS, pages 137–182. Springer-Verlag, 1995. 8. R. Halstead. Self-Describing Files + Smart Modules = Parallel Program Visualization. In Theory and Practice of Parallel Programming, volume 907 of LNCS. Springer-Verlag, 1994. 9. R. Halstead. Understanding the Performance of Parallel Symbolic Programs. In Parallel Symbolic Languages and Systems (Workshop Proceedings), volume 1068 of LNCS. Springer-Verlag, 1996. 10. K. Hammond, H-W. Loidl, and A. Partridge. Visualising Granularity in Parallel Programs: A Graphical Winnowing System for Haskell. In A.P.W. Bohm and J.T. Feo, editors, Proc. HPFC’95 – High Performance Functional Computing, pages 208–221, 1995.
An Interactive Approach to Profiling Parallel Functional Programs
37
11. K. Hammond, H-W. Loidl, and P. Trinder. Parallel Cost Centre Profiling. In Proc. 1997 Glasgow Workshop on Functional Programming, 1997. 12. S. Jarvis, S. Marlow, S Peyton Jones, and E. Wilcox. Standardising Compiler/Profiler Log Files. In K. Hammond, A.J.T. Davie, and C. Clack, editors, Draft Proc. 10th International Workshop on the Implementation of Functional Languages (IFL ’98), London, England, pages 429–446, September 1998. 13. D. King, J. Hall, and P. Trinder. A Strategic Profiler for Glasgow Parallel Haskell. In This Proceedings, 1998. 14. S. Lei, K. Zang, and K.-C. Li. Experience with the Design of a Performance Tuning Tool for Parallel Programs. Journal of Systems Software, 39(1):27–37, 1997. 15. H-W. Loidl. GranSim User’s Guide, 1996. http://www.dcs.glasgow.ac.uk/fp/software/gransim/. 16. H-W. Loidl. Granularity in Large Scale Parallel Functional Programs. PhD thesis, Department of Computer Science, University of Glasgow, 1998. 17. R. Nikhil. An Incremental, Strongly-Typed, Database Query Language. PhD thesis, University of Pennsylvania, 1984. 18. Oracle Corporation, Belmont, California, USA. SQL Language, Reference Manual, Version 6.0, 1990. 19. J. C. Peterson, K. Hammond, L. Augustsson, B. Boutel, F. W. Burton, J. Fasel, A. D. Gordon, R. J. M. Hughes, P. Hudak, T. Johnsson, M. P. Jones, E. Meijer, S. L. Peyton Jones, A. Reid, and P. L. Wadler. Report on the Non-Strict Functional Language, Haskell, Version 1.4, Yale University, 1997. Available at http://haskell.org. 20. S. L. Peyton Jones, T. Nordin, and A. Reid. Green Card: a Foreign-Language Interface for Haskell. In Proc. of 2nd. Haskell Workshop, Amsterdam, 1997. Oregon Graduate Institute, 1997. http://www.dcs.gla.ac.uk/fp/software/green-card/. 21. D. Reed, R. Aydt, T. Madhyastha, R. Noe, K. Shields, and B. Schwartz. Scalable Performance Environments for Parallel Systems. In 6th. Distributed Memory Computing Conference (IEEE), 1991. 22. P. Roe. Parallel Programming using Functional Languages. PhD thesis, Glasgow University, UK, 1991. 23. C. Runciman and D. Wakeling. Profiling Parallel Functional Computations (Without Parallel Machines). In Proc. 1993 Glasgow Workshop on Functional Programming, pages 236–251. Springer-Verlag, 1993. 24. P. Trinder, K. Hammond, H.-W. Loidl, and S. L. Peyton Jones. Algorithm + Strategy = Parallelism. Journal of Functional Programming, 8(1):23–60, 1998. 25. P. W. Trinder, K. Hammond, J. S. Mattson, A. S. Partridge, and S. L. Peyton Jones. GUM: a Portable Parallel Implementation of Haskell. In Proc. PLDI ’96, Philadelphia, Penn., pages 79–88, 1996. 26. R. P. Whittington. Logical Design. In [27], chapter 10. 1988. 27. R. P. Whittington. Database Systems Engineeering. Clarendon Press, Oxford, 1998. 28. A. Yu and J. Chen. The POSTGRES95 User Manual, Version 1.0, 1995.
Shared Memory Multiprocessor Support for SAC Clemens Grelck University of Kiel Dept. of Computer Science and Applied Mathematics D-24105 Kiel, Germany
[email protected] Abstract. Sac (Single Assignment C) is a strict, purely functional programming language primarily designed with numerical applications in mind. Particular emphasis is on efficient support for arrays both in terms of language expressiveness and in terms of runtime performance. Array operations in Sac are based on elementwise specifications using so-called With-loops. These language constructs are also well-suited for concurrent execution on multiprocessor systems. This paper outlines an implicit approach to compile Sac programs for multi-threaded execution on shared memory architectures. Besides the basic compilation scheme, a brief overview of the runtime system is given. Finally, preliminary performance figures demonstrate that this approach is well-suited to achieve almost linear speedups.
1
Introduction
Sac (Single Assignment C) is a strict, first-order, purely functional programming language primarily designed with numerical applications in mind. Particular emphasis is on efficient support for array processing. Efficiency concerns are essentially twofold. On the one hand, SAC offers the opportunity of defining array operations on a high level of abstraction, including dimension-invariant program specifications which generally improves productivity in program development. On the other hand, sophisticated compilation schemes ensure efficiency in program execution. Extensive performance evaluations on a single though important kernel application (3-dimensional multigrid relaxation from the Nas benchmark [5]) show that Sac clearly outperforms its functional rival Sisal[17] both in terms of memory consumption and in terms of wallclock execution times[23]. Even the Fortran reference implementation of this benchmark is outperformed by about 10% with respect to execution times. Although numerical computations represent just one application domain, certainly, this is a very important one with many applications in computational sciences. In these fields, the runtime performance of programs is the most crucial issue. However, numerical applications are often well-suited for non-sequential program execution. On the one hand, underlying algorithms expose a considerable amount of concurrency; on the other hand, the computational complexity H. Hammond, T. Davie, and C. Clack (Eds.): IFL’98, LNCS 1595, pp. 38–53, 1999. c Springer-Verlag Berlin Heidelberg 1999
Shared Memory Multiprocessor Support for SAC
39
can be scaled easily with the computational power available. So, multiprocessor systems allow substantial reductions of application runtimes, and, consequently, computational sciences represent a major field of application for parallel processing. Therefore, sufficient support for concurrent program execution is particularly important for a language like Sac. Due to the Church-Rosser-Property, purely functional languages are often considered well-suited for implicit non-sequential program execution, i.e., the language implementation is solely responsible for exploiting concurrency in multiprocessor environments. However, it turns out that determining where concurrent execution actually outweighs the administrative overhead inflicted by communication and synchronization is nearly as difficult as detecting where concurrent program execution is possible in imperative languages [25]. Many highlevel features found in popular functional languages like Haskell or Clean, e.g. higher-order functions, polymorphism, or lazy evaluation, make the necessary program analysis even harder. As a consequence, recent developments are often in favour of explicit solutions for exploiting concurrency. Special language constructs allow application programmers to specify explicitly how programs are to be executed on multiple processors. Many different approaches have been proposed that reach from simple parallel map operations to full process management capabilities and even pure coordination languages [26, 19, 4, 9, 12, 16]. Although the actual degree of control varies significantly, explicit solutions have in common that, in the end, application programmers themselves are responsible for the efficient utilization of multiprocessor facilities. Programs have to be designed specifically for the execution in multiprocessor environments and, depending on the level of abstraction, possibly even for particular architectures or concrete machine configurations. However, typical Sac applications spend most of their execution time in array operations. In contrast to load distribution on the level of function applications, elementwise defined array operations are a source of concurrency that is rather well-suited for implicit exploitation, as an array’s size and structure can usually be determined in advance, often even at compile time. This allows for effective load distribution and balancing schemes. Implicit solutions for parallel program execution offer well-known advantages: being not polluted with explicit specifications, a program’s source code is usually shorter, more concise, and easier to read and understand. Also, programming productivity is generally higher since no characteristics of potential target machines have to be taken into account which also improves program portability. Functional languages like Sisal[17], Nesl[6], or Id[3], have already demonstrated that, following the so-called data parallel approach, good speedups may well be achieved without explicit specifications [11, 7, 13]. Successfully reducing application runtimes through non-sequential program execution makes it necessary to consider at least basic design characteristics of intended target hardware architectures. Having a look at recent developments in this area, two trends can be identified. Up to a modest number of processing facilities (usually ≤ 32) symmetric shared memory multiprocessors dominate. If
40
C. Grelck
a decidedly larger number of processing sites is to be used, the trend is towards networks of entire workstations or even personal computers. Both approaches are characterized by reusing standard components for high performance computing which is not only more cost-effective than traditional supercomputers but also benefits from an apparently higher annual performance increase. In our current approach for Sac, we focus on shared memory multiprocessors. Machines like the Sun Ultra Enterprise Series, the HP 3000/9000 series, or the SGI Origin have become wide-spread as workgroup or enterprise servers and already dominate the lower part of the Top500 list of the most powerful computing facilities worldwide [10]. Although their scalability is conceptually limited by the memory bottleneck, processor private hierarchies of fast and sufficiently large caches help to minimize contention on the main memory. Theoretical considerations like Amdahl’s law [2], however, show that an application itself may be limited with respect to scalability anyway. Our current approach may also serve as a first step to be integrated into a more comprehensive solution covering networks of shared memory multiprocessors in the future. As a low-level programming model, multi-threading just seems to be tailormade for shared memory architectures. It allows for different (sequential) threads of control within the single address space of a process. Each thread has its private execution stack, but all threads share access to the same global data. This programming model exactly coincides with the hardware architecture of shared memory multiprocessors which is characterized by multiple execution facilities but uniform storage. To ensure portability between different concrete machines within the basic architectural model, the current implementation is based on Posix-Threads[18] as the major standard. The rest of the paper is organized as follows: after a short introduction to Sac in Sect. 2, the basic concepts of our shared memory multiprocessor implementation are outlined in Sect. 3. Preliminary performance figures are presented in Sect. 4. Finally, Sect. 5 draws conclusions and discusses future work.
2
SAC — Single Assignment C
This section is to give a very brief overview of Sac. A more detailed introduction to the language may be found in [21, 24]; its strict, purely functional semantics is formally defined in [20]. The core language of Sac may be considered a functional subset of C, ruling out global variables and pointers to keep the language free of side effects. It is extended by the introduction of arrays as first class objects. An array is represented by two vectors: a data vector which contains the elements of the array, and a shape vector which provides structural information. The length of the shape vector specifies the dimensionality of the array whereas its elements define the array’s extension in each dimension. Built-in functions allow determination of an array’s dimension or shape as well as extraction of array elements. Complex array operations may be specified by means of so-called Withloops, a versatile language construct similar to the array comprehensions of
Shared Memory Multiprocessor Support for SAC
41
Haskell or Clean and to the For-loops of Sisal. It allows the dimensioninvariant, elementwise definition of operations on entire arrays as well as on subarrays selected through index ranges or strides. W ithExpr
⇒ with
Generator
⇒ Expr Relop Identif ier Relop Expr
Relop
⇒ < | ST s b) -> ST s b All external state-based facilities, i.e. I/O and foreign procedure calls, are captured into a single state type called RealWorld, and all access to these external facilities occurs in a single global state thread, so that a linear ordering is put on all procedure calls. A simple approach to I/O could use the type:
74
I. Holyer, E. Spiliopoulou
type IO a = ST RealWorld a All internal facilities such as incremental arrays are bundled into a polymorphic state type s which notionally represents the program’s heap. All access to internal state types happens via references. These are provided as pointers to ordinary values or to special state types such as arrays. A reference of type MutVar s a denotes an updatable location in a state of type s, which contains values of type a, introduced by: newVar :: a -> ST s (MutVar s a) The state type s, as well as representing the heap, is also used for encapsulation to ensure safety of state thread accesses. For internal facilities, the primitive runST has been provided, which has a rank-2 polymorphic type: runST :: (forall s. ST s a) -> a This takes a complete action sequence as its argument, creates an initially ‘empty’ state, obeys the actions and returns the result discarding the final state. Each internal state thread has a separate polymorphic state variable and the references in it are ‘tagged’ with this variable. For each state thread, the enclosing runST limits its scope and prevents its references from escaping, or references belonging to other state threads from being used within it. By forcing the state argument to be polymorphic, runST can only be used for internal facilities, since they refer to a polymorphic state type, and not for I/O which is applied to the specific state type RealWorld. This enforces a single I/O state thread, taken to exist when the program begins, by preventing further unrelated I/O state threads from being created, which could destroy referential transparency. Both lazy and strict versions of thenST have been proposed: thenST :: ST s a -> (a -> ST s b) -> ST s b thenST m k s = k x s’ where (x, s’) = m s strictThenST :: ST s a -> (a -> ST s b) -> ST s b strictThenST m k s = case m s of (x, s’) = k x s’ The strict one has been suggested for use with I/O, for efficiency. Indeed, without it, an infinite printing loop might never produce any output, for example. Lazy state threads [10], have also been considered important because they preserve laziness in the presence of imperative actions. Additionally, lazy state threads seem to be the right choice in the case of internal state, where the state is discarded at the end of the computation. Their use with I/O, although it has been hinted at in the case of interleaved I/O operations [13,11], has so far been neglected. It turns out that they play an important role in deterministic concurrency, where evaluation of one thread should not force the main thread to terminate in order to see any results.
Concurrent Monadic Interfacing
stdin
stream
stdout
stream
handle
stdin
file
IOState
stream
75
stdout
stream
IOState (a)
(b)
file putS s
Fig. 1. (a) State Splitting. (b) State Forking
3
The Brisk Monadic Framework
The main issue that has permeated the design of the Brisk monadic framework is to allow states to have independent state components or substates, with separate threads acting on those substates, possibly concurrently. Specific functions are provided to create and act on substates. Each substate is associated with a reference and every access to a substate happens via this reference, as in Fig. 1. This implies that internal state types such as arrays are separated from the reference mechanism as described in [11] and they no longer need to be accessed via references, unless they happen to be substates of some other state. Thus, in the Brisk monadic framework there is no longer a distinction between internal (i.e. arrays) and external (i.e. I/O) state types. Instead, each facility has its own concrete type, so that both internal and external facilities are treated uniformly. Moreover, each state type can be defined separately in its own module. Similarly, a group of related foreign procedure calls can be encapsulated in its own module by defining a suitable state type and presenting the procedures as suitable actions. In the next sections, the Brisk monadic framework is outlined and all the facilities that derive from it are illustrated. First, we show how all the facilities mentioned in the GHC monadic framework [11] are reproduced in the Brisk one and then we present additional features that derive from our approach. 3.1
Generic Primitives
Any stateful computation is expressed via the type Act t s a, together with operations returnAct and thenAct for accessing it: type Act t s a = s -> (a,s) returnAct :: a -> Act t s a thenAct :: Act t s a -> (a -> Act t s b) -> Act t s b run :: s -> (forall t. Act t s a) -> a The type Act t s a is very similar to ST s a, but we have changed the name to avoid confusion between the two. The returnAct and thenAct functions have
76
I. Holyer, E. Spiliopoulou
result
result state in
state out
type Act t s a = s -> (a, s)
state in
state out
returnAct :: a -> Act t s a result
111111 000000 000000 111111 000000 111111 000000 111111
state in
state out
thenAct :: Act t s a -> (a -> Act t s b) -> Act t s b Fig. 2. Monadic Operators the same definitions as returnST and the lazy version of thenST. As with ST, Act is made abstract for protection. An action of type Act t s a transforms a state of type s and delivers a result of type a together with a new state of type s, as can be seen in Fig. 2. However, s will now normally be instantiated to a specific state type. The second purpose of s in ST s a, namely to act as a tag identifying a specific state thread, is separated out into the extra type variable t, which is never instantiated. As an example, we will show later that I/O facilities can be defined essentially as in Haskell with: type IO t a = Act t RealWorld a The IO type now has a tag t on it. It is possible to make this tagging less visible to the programmer, but we do not pursue that here. We will see later that specialised state-splitting facilities can be provided for safe handling of concurrent I/O threads. The run function takes an initial value for the state and an action to perform. It carries out the given action on the initial state, discards the final state and returns the result. As with runST, the tag variable t identifies a particular state thread at compile-time. Forcing the action to be polymorphic in t encapsulates the state thread, preventing references from escaping from their scope or being used in the wrong state. To prevent run from being used in conjunction with IO, the IO module does not export any useful state value which can be used as an initial state. The generic facilities for handling substates and references are: type Ref t u s = ... new :: s1 -> (forall u. Ref t u s1 -> Act t s a) -> Act t s a at :: Ref t u s1 -> Act u s1 a -> Act t s a One of the main changes in the Brisk monadic framework concerns the philosophy of references. They are treated as being independent of any specific
Concurrent Monadic Interfacing
01 state in
77
result
01
11 00
state out
new :: s1 -> (forall u. Ref t u s1 -> Act t s a) -> Act t s a result
state in
01
state out
at :: Ref t u s1 -> Act u s1 a -> Act t s a Fig. 3. Operators for Handling Substates substate type. References can be thought of as pointers to substates, see also Fig. 1. The reference type has two tags. The first refers to the main state and ensures that the reference is used in the state where it was created. The second tag refers to the substate, and ensures that the reference can only be used while the substate actually exists. The new function is introduced to implement state splitting. The new function creates a new substate of type s1 with a given initial value. It does this for any desired type s of main state. It also creates a reference to the new substate, as can be seen in Fig. 3. The second argument to new is a function which takes the new reference and performs the actions on the new extended main state. This contrasts with definitions such as the one for openFile where a reference is returned directly and the main action sequence continues. However, in a context where there may be substates and subsubstates and so on, this encapsulation of the actions on the extended state is needed to provide a scope for the new tag variable u and to protect references from misuse. The function at, see also Fig. 3, carries out an action on a substate. The call at r f performs action f on the substate corresponding to the reference r. This is regarded as an action on the main state in which the rest of the state (the main state value and the other substates) is unaltered. In the main state, the substate accessible via r is replaced by the final state after the primitive actions in f have been performed. The actions in f are performed lazily, i.e. driven by demand on its result and may be interleaved with later actions in the main thread. This assumes that the substate is independent of the main state so that the order of interleaving does not matter. Consider a sequence of actions: a1 >> at r a2 >> a3 >> at r a4 >> a5 ... As in Haskell, we use overloaded functions >>= and >> and return for monadic operations, in this case referring to thenAct etc. The laziness of thenAct means that the actions in the sequence are driven only by the demand for results. The
78
I. Holyer, E. Spiliopoulou
usual convention that primitive actions should be strict in the state ensures that the actions happen in sequence. However, main state actions such as a3 are not strict in the substates of the main state. The action at r a2 replaces the substate referred to by r by an unevaluated expression; it is the next action a4 on the substate which causes a2 to be performed. Thus, the sequence of actions a1, a3, . . . on the main state, and the sequence a2, a4, . . . on the substate, may be executed independently. We are assuming here that the substate is independent of the main state, so that the order of interleaving does not matter. The laziness of the thenAct operator allows concurrency effects to be obtained in this way. It is possible that a main state action such as a5 may cause the evaluation of the result of some previous action on a substate such as a4. In this case, there is effectively a synchronisation point, forcing all previous actions on the substate to be completed before the main state action is performed. Now, each individual state type and its primitive actions can be defined separately in its own module. It is easy to add new state types, e.g. out of values or procedural interfaces. Furthermore, this modularised point of view can be used for providing separate interfaces to separate C libraries, instead of using a primitive ccall. 3.2
Semantics
The semantics of our proposed system does not require anything beyond normal lazy functional semantics. In particular, although state splitting is designed to work alongside concurrency, it is independent of concurrency and can be described in sequential terms. Thus, instead of presenting the semantics directly with a mathematical formalism, we choose to present it using a prototype implementation in Haskell. For the prototype implementation, we use Hugs which supports 2nd-order polymorphism, a common extension to Haskell at the time of writing. The principle difficulty in producing the prototype is not in capturing the correct semantics, but rather in getting around Haskell’s type system. When substates are added to a state, it would seem that the type of the main state needs to change, but the type changes are too dynamic to be captured by the usual polymorphism or overloading tricks. In the prototype implementation presented in this paper, the state type State s is polymorphic and describes a state with a value of type s and a list of substates. To get around Haskell’s type system, substates are not described using type State but instead are given the substate type Sub. This is not polymorphic, but is a union of specific instantiations of the State type. Each type which can form a substate is made an instance of the Substate class, with functions sub and unsub which convert the generic state representation to and from the specific substate representation. data State s = S s [Sub] data Sub = SI Int [Sub] | SA (Array Int Int) [Sub] | SF File [Sub] | SW World [Sub] | ...
Concurrent Monadic Interfacing
79
class Eval s => Substate s where sub :: State s -> Sub unsub :: Sub -> State s instance Substate Int where sub (S n ss) = SI n ss unsub (SI n ss) = S n ss instance Substate (Array Int Int) where sub (S ar ss) = SA ar ss unsub (SA ar ss) = S ar ss ... This use of overloading breaks the desired modularisation of state types, but it is only necessary in this prototype. In a real implementation where state facilities are built in, none of the above is necessary. State types do not need the State wrapper and substates can be treated directly. In the code below, calls to sub and unsub can be ignored. The Act type and the generic operations returnAct, thenAct and run are defined in the usual way. The definition of run uses seq to evaluate the initial state before beginning in order to prevent error values from being used as initial state values. (This is not essential, but in a real implementation, it allows compiler optimisations in which the state type is implicit rather than explicit.) In a real implementation, Act t s a would be a builtin abstract type, the three operations would also be builtin, with the types shown here. newtype Act t s a = Act (State s -> (a, State s)) returnAct :: a -> Act t s a returnAct x = Act (\st -> (x, st)) thenAct :: Act t s a -> (a -> Act t s b) -> Act t s b thenAct (Act f) g = Act (\st -> let (x,st1) = f st; Act h = g x in h st1) run :: Eval s => s -> (forall t . Act t s a) -> a run v act = seq v (fst (f (S v []))) where Act f = act A reference is implemented as an integer for simplicity, used as an index into the list of substates. The Ref type is tagged with t and u tags respectively, so that actions on both the main state and the substate can be encapsulated safely. The implementation of new uses seq to prevent error values being used as initial substates, adds the new substate to the end of the list of substates, carries out the actions which use the new reference and then removes the final version of the substate. The at function replaces the relevant substate by the result of applying the given action on it. In a real implementation, Ref t u s and the two
80
I. Holyer, E. Spiliopoulou
operations would be builtin; the types would be the same, with the omission of the Substate restriction. data Ref t u s = R Int new :: Substate s1 => s1 -> (forall u . Ref t u s1 -> Act t s a) -> Act t s a new v1 f = Act g where g (S v ss) = seq v1 (x, S v’ (init ss’)) where (x, S v’ ss’) = h (S v (ss ++ [sub (S v1 [])])) Act h = f (R (length ss)) at :: Substate s1 => Ref t u s1 -> Act u s1 a -> Act t s a at (R r) (Act f) = Act g where g (S v ss) = (x, st’) where subst = unsub (ss !! r) (x, subst’) = f subst st’ = S v (take r ss ++ [sub subst’] ++ drop (r+1) ss) This completes the generic facilities in the prototype. A number of examples can be demonstrated by extending the prototype.
4
Examples of State Types
In this section we demonstrate the way in which different state types can be developed in a modular way and discuss some of the issues that arise. 4.1
Values as States
State types are usually special abstract types representing updatable data structures or interfaces to procedural facilities. However, it is often convenient to carry ordinary values around in a stateful way, e.g. as substates. This can be illustrated by extending the prototype, showing how to treat Int as a state type. The get and put actions get and put an Int value held in the StInt state type (“stateful Int”), and inc increments it. newtype StInt = St Int get :: Act t StInt Int get = action f where f (St n) = (n, St n) put :: Int -> Act t StInt () put m = action f where f (St n) = ((), St m) inc :: Act t StInt () inc = get >>= \n -> put (n+1) action :: (s -> (a,s)) -> Act t s a action f = Act g where g (S v ss) = (x,S v’ ss) where (x,v’) = f v
Concurrent Monadic Interfacing
81
The function action converts an ordinary function of a suitable type into an action on the corresponding state type. It is a convenience function included just for the purposes of the prototype and we will also use it in later examples (it should not be made available in a real implementation because it is unsafe). The following small example creates an StInt main state and an StInt substate of that main state. eg = run (St 20) (new (St 21) (\r -> at r inc >> at r get >>= \n -> get >>= \m -> return (m+n))) In the prototype, the example is restricted to specific types such as StInt in order to insert them individually into the substate union. In a real implementation, the facilities shown here can be polymorphic, using newtype St a = St a. The St a type is used rather than just a to prevent illegal access to, e.g., I/O states. The get action cannot be used during a normal I/O sequence because the type of get is not compatible with I/O actions. At best it has type Act t (St RealWorld) RealWorld. It can then only be used as part of an action sequence with state type St RealWorld and that sequence must have been started up with run or new. The lack of any useful values of type RealWorld with which to initialise run or new prevents such a sequence. As already explained, rank-2 polymorphic types safely encapsulate actions on the state thread by forcing actions to be polymorphic in a separate type variable t. Actions on substates are encapsulated as well by adding an additional polymorphic type variable u. The following illegal example illustrates that rank-2 polymorphic types enable the secure encapsulation of state thread accesses. bad = run 0 (put 41 >> new 0 (\r -> return r) >>= \r1 -> return r1) The second argument to new, i.e. the expression (\r → return r) is of type Ref t u s1 → Act t s (Ref t u s1). This is disallowed because the tag u is not polymorphic, since it appears in both the Ref and Act types. The second argument to run also breaks the rank-2 polymorphic type restriction. Similarly expressions such as the following are illegal and thus references are prevented from being imported into run or new. ... \r -> run (... at r act ...) ... \r -> new (\r1 -> ... at r act ...) Another useful facility along similar lines would be to convert any ordinary stream function f :: [X] -> [Y] into a monadic form ([X] -> [Y]) -> Act t s a, for some suitable s, where items can be both given to f and taken from f one at a time. This generalises the emulation of Dialogues in terms of IO as described by Peyton Jones and Wadler [13]. There is a well known builtin implementation which avoids space leaks.
82
4.2
I. Holyer, E. Spiliopoulou
Arrays
As another example, we show here how to provide incremental arrays. They are described as state types in their own right, not via references. The implementation shown here uses Haskell’s monolithic array type Array i e as the state type for illustration, where i is the index type and e is the element type. In practice, a built-in implementation would be provided with a separate state type, supporting update-in-place. A minimal set of facilities might be: getItem :: Int -> Act t (Array Int Int) Int getItem n = action f where f arr = (arr ! n, arr) putItem :: Int -> Int -> Act t (Array Int Int) () putItem n x = action f where f arr = ((), (arr//[(n,x)])) The getItem function finds the array element at a given position and putItem updates the element at a given position with a given value. If an array is used as a substate via reference r, then its elements are accessed using calls such as at r (getItem n) or at r (putItem n v). Once again, specific index and element types are given for illustration in the prototype. The getItem and putItem actions can be more general in a real implementation. 4.3
File Handling
For handling files piecemeal rather than as a whole, a state type File is introduced to represent an open file. newtype File = F String getC :: Act t File Char getC = action f where f (F (c:cs)) = (c, F cs) putC :: Char -> Act t File () putC c = action f where f (F s) = ((), F (s++[c])) eof :: Act t File Bool eof = action f where f (F s) = (null s, F s) contents :: Act t File String contents = action f where f (F s) = (s, F s) This is intended to be used as a substate of an input/output state. 4.4
Input and Output
In this section, we show the implementation of some common I/O facilities using the state-splitting ideas described so far. Issues such as error handling are ignored
Concurrent Monadic Interfacing
83
and facilities involving true concurrency are left until later. We treat I/O actions as actions on a global World state consisting of a number of files, each with an indication whether it is currently open. The type Handle is regarded as a reference to a File substate. This is like the corresponding standard Haskell type, except that in our setting it has two tag variables attached, one representing the main state and the other the File substate. newtype World type IO t a = type Handle t data IOMode =
= W [(FilePath, File, Bool)] Act t World a u = Ref t u File ReadMode | WriteMode | AppendMode
The openFile function is one which creates a new File substate of the I/O state, returning a handle as a reference to it, properly encapsulated. Its implementation ensures that one cannot open a file that does not exist for reading. If a file is opened for writing, its contents are set to the null string. openFile :: FilePath -> IOMode -> (forall u . Handle t u -> IO t a) -> IO t a openFile name mode io = Act g where g (S w ss) = if isLocked w name then (error "locked", S w ss) else (x, S w2 (init ss’)) where (x, S w1 ss’) = h (S (lock w name) (ss ++ [sub (S file [])])) S file’ [] = unsub (last ss’) Act h = io (R (length ss)) file = if (mode == WriteMode) then F "" else getFile w name w2 = unlock (putFile w name file’) name getFile :: World -> FilePath -> File getFile w@(W ps) f = let i = find f w in thrd3 (ps !! i) ... The getFile and putFile functions get and put the contents of a file with a given name. The definition of putFile is similar with the one of getFile above. The lock, unlock and isLocked functions set, unset and test the boolean flag associated with a file. In this prototype, a file cannot be reopened while already open, even for reading; in a real implementation, multiple readers could be allowed. The openFile function can be thought of as a specialised form of the new function. Instead of creating a new substate from nothing, it uses a portion of the main state (a file) to form a new substate. While the substate is in use, the corresponding part of the main state must be locked in whatever way is necessary to ensure that the main state and the substate are independent. For example, the following is not allowed: openFile "data" (\handle -> ... >> openFile "data" ...)
84
I. Holyer, E. Spiliopoulou
When openFile returns, the actions on the substate may not necessarily have been completed, since they are driven by a separate demand. However, an action which re-opens the same file after this, acts as a synchronisation mechanism, forcing all the previous actions on the substate to be completed. For example, the following is allowed: openFile "data" (\handle -> ...) >> openFile "data" ... The synchronisation may mean reading the entire contents of a file into memory, for example. In detecting when the same file is re-opened, care has to be taken over filename aliasing. This is where the operating system allows the same physical file to be known by two or more different names, or to be accessed via two or more different routes. The readFile function now opens a file and extracts all its contents as a lazy character stream: readFile :: FilePath -> IO t String readFile name = openFile name ReadMode (\handle -> at handle getAll) getAll :: Act u File String getAll = eof >>= \b -> if b then return "" else getC >>= \c -> getAll >>= \s -> return (c:s) This contrasts with the approach of Launchbury and Peyton Jones [11] in which unsafe generic interleaving is used. Here, we rely on the fact that openFile is implemented as a safe state-splitting primitive. Consider the following program in standard Haskell: writeFile "file" "x\n" >> readFile "file" >>= \s -> writeFile "file" "y\n" >> putStr s One might expect the final putStr to print x as the result. However, the contents s of the file are read in lazily and are not needed until putStr is executed, which is after the second writeFile. On all the Haskell systems tested at the time of writing, putStr prints y, or a file locking error occurs (because the file happens to be still open for reading at the time of the second writeFile). This example is not concurrent; there are two state threads, but only one execution thread. The behaviour of the program is unacceptable in a concurrent setting because of the dependency of the behaviour on timings. Under the state splitting approach, this program does indeed give deterministic results. The second writeFile forces the old version of the file to be completely read into memory (or, in a cleverer implementation, moved to a temporary location so that it is not affected by the writeFile).
Concurrent Monadic Interfacing
5
85
Concurrency
The state-splitting proposals we have discussed are independent of the concurrency issue. However, concurrency was the main motivation for studying the problem, so we show briefly here how concurrency is implemented. A further generic primitive fork can be introduced. This is the concurrent counterpart of at and has the same type signature: fork ::
Ref t u s1 -> Act u s1 a -> Act t s a
The call fork r f has the same effect as at r f, carrying out an action on a substate of the main state while leaving the remainder of the main state undisturbed. However, fork creates a new execution thread to drive the action f by evaluating the final substate it produces. This new demand is additional to any other demands on f via its result value and the new thread executes independently of, and concurrently with, the subsequent actions on the main state, as illustrated in Fig. 1. Under the deterministic concurrency approach, a processor can run one or more I/O performing operations without including any user-visible non-determinism. As an example of the use of fork, here is a lazy version of the standard Haskell function writeFile. The standard version, unlike readFile, is hyperstrict in that it fully evaluates the contents and writes them to the file before continuing with the main state actions. The lazy version shown here evaluates the contents and writes them to the file concurrently with later actions. A separate execution thread is required to drive this, otherwise there might be no demand to cause it to happen. writeFile :: FilePath -> String -> IO t String writeFile name s = openFile name WriteMode >>= \handle -> fork handle (putAll s) putAll :: String -> Act u File () putAll s = if null s then return () else putC (head s) >> putAll (tail s) In general, a concurrent program can be viewed as having multiple independent demands from the outside world. Each external device or service such as a file or window implicitly produces a continuous demand for output. Care needs to be taken in the implementation of concurrency if referential transparency and determinism are to be preserved. Output devices or services are regarded as substates of the outside world and their independence must be ensured. In addition, the multiple execution threads may cause independent demands on inputs to the program. To ensure fairness, a centralised approach to I/O is needed where all the inputs are merged in the sense that the first item to arrive on any input wakes up the appropriate execution thread. However, this merging is not made available to the programmer at the language level and so cannot be used to obtain non-deterministic effects.
86
6
I. Holyer, E. Spiliopoulou
Related Work
Previous work in purely functional operating systems (see Holyer and Carter [8] for a detailed overview) faced the problem of adopting non-deterministic merge operators, which destroys referential transparency. Alternatively, Stoye [16] proposed a model, adopted also by Turner [17], in which functions communicate by sending messages to a central service, called the sorting office to which non-determinism is confined. Although individual programs remain referentially transparent, the sorting office itself is non-deterministic and so is the system as a whole. Kagawa [9] proposed compositional references in order to bridge the gap between monadic arrays [14] and incremental arrays. Compositional references can be also viewed as a means for allowing multiple state types so that a wide range of mutable data structures can be manipulated. On the other hand, they concern only internal facilities and not I/O. The Clean I/O system [1] allows a similar mechanism to state splitting, called the multiple environment passing style. This is only possible by using the Uniqueness Type System of Clean, whereas state splitting does not require any extension to the usual Hindley-Milner type system apart from the second order polymorphism mechanism to ensure safety.
7
Conclusions and Future Work
Deterministic concurrency emerged out of the need to provide a purely declarative model of concurrency without the need for non-deterministic operators. This model has been proven expressive enough to build a single-user multitasking system [3] and a deterministic GUI [15] without any non-deterministic feature available to the programmer. In these systems, concurrent threads do not communicate via shared state, which immediately introduces non-determinism, but only via result streams. In this paper we presented the Brisk monadic framework, a proposal which adapts the usual monadic style of interfacing to accommodate a deterministic form of concurrency. Its main innovation has been to introduce state splitting, a technique which assigns, to each new state thread, a substate to act upon. Deterministic concurrency arises as a natural extension of this monadic framework. In the Brisk compiler, to complement the state splitting approach, we are investigating a freer form of communication via a safe form of message passing with timed deterministic merging of messages from different sources. This is particularly important for an object-oriented event style of programming. As another application of deterministic concurrency we are investigating distribution [7]. Within the semantic framework that has been developed for deterministic concurrency, distribution is almost an orthogonal issue. Systems can be distributed in a trustable way, without any change to the meaning of programs – the only observable difference is the speed at which the overall system generates results. Computation, including both code and data, can be moved in response to demand.
Concurrent Monadic Interfacing
87
A question that arises further in the Brisk monadic framework is the impact of using the lazy version of thenST with I/O. It affects the efficiency of I/O and disallows infinite I/O and its space behaviour needs to be investigated. On the other hand, it is essential in our setting, e.g. for writing a functional definition of reading the contents of a file lazily, which we consider more important.
References 1. P. Achten and M.J. Plasmeijer. The Ins and Outs of Clean I/O. Journal of Functional Programming, 5(1):81–110, 1995. 2. L. Augustsson and B. Boutel et al. Report on the Functional Programming Language Haskell, Version 1.4, 1997. 3. D. Carter. Deterministic Concurrency. Master’s thesis, Department of Computer Science, University of Bristol, September 1994. 4. S.O. Finne and S.L. Peyton Jones. Programming Reactive Systems in Haskell. In Proc. 1994 Glasgow Functional Programming Workshop, Ayr, Scotland, 1994. Springer-Verlag WiCS. 5. S.O. Finne and S.L. Peyton Jones. Composing Haggis. In Proc 5th Eurographics Workshop on Programming Paradigms in Graphics, Maastricht, September 1995. 6. S.L. Peyton Jones, A.D. Gordon, and S.O. Finne. Concurrent Haskell. In Proc. 23rd. ACM Symposium on Principles of Programming Languages (POPL ’96), pages 295–308, St Petersburg Beach, Florida, January 1996. ACM Press. 7. I. Holyer, N. Davies, and E. Spiliopoulou. Distribution in a Demand Driven Style. In 1st International Workshop on Component-based Software Development in Computational Logic, September 1998. 8. Ian Holyer and David Carter. Concurrency in a Purely Declarative Style. In J.T. O’Donnell and K. Hammond, editors, Proc. 1993 Glasgow Functional Programming Workshop, pages 145–155, Ayr, Scotland, 1993. Springer-Verlag WiCS. 9. K. Kagawa. Compositional References for Stateful Functional Programs. In Proc. 1997 ACM International Conference on Functional Programming (ICFP ’97), Amsterdam, The Netherlands, June 1997. ACM Press. 10. J. Launchbury. Lazy Imperative Programming. In ACM SIGPLAN Workshop on State in Programming Languages, June 1993. 11. J. Launchbury and S.L. Peyton Jones. Lazy Functional State Threads. In Proc. 1996 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’96), Orlando, 1994. ACM Press. 12. S.L. Peyton Jones and J. Launchbury. State in Haskell. Lisp and Symbolic Computation, 8(4):293–341, 1995. 13. S.L. Peyton Jones and P.L. Wadler. Imperative Functional Programming. In Proc. 20th. ACM Symposium on Principles of Programming Languages (POPL ’93), Charleston, January 1993. ACM Press. 14. P.L. Philip. Comprehending Monads. In Proc. 1990 ACM Conference on Lisp and Functional Programming (LFP ’90), pages 61–78, June 1990. 15. P. Serrarens. BriX - A Deterministic Concurrent Functional X Windows System. Master’s thesis, Computer Science Department,Bristol University, June 1995. 16. W.R. Stoye. A New Scheme for Writing Functional Operating Systems. Technical Report Technical Report 56, Cambridge University, Computer Lab, 1984. 17. D. Turner. Functional Programming and Communicating Processes. In J.W. Bakker, A.J. Nijman, and P.C. Treleaven, editors, PARLE II, volume 259 of LNCS, pages 54–74, Eindhoven, The Netherlands, June 1987. Springer Verlag. 18. P.L. Wadler. The Essence of Functional Programming. In Proc. 19th. ACM Symposium on Principles of Programming Languages (POPL ’92), Albuquerque (New Mexico, USA), January 1992. ACM Press.
A Strategic Profiler for Glasgow Parallel Haskell David J. King1 , Jon Hall1, and Phil Trinder2 1
Faculty of Maths and Computing, The Open University Walton Hall, Milton Keynes MK7 6AA {d.j.king, j.g.hall}@open.ac.uk http://mcs.open.ac.uk/{djk26,jgh23} 2 Department of Computing and Electrical Engineering Heriot-Watt University, Riccarton, Edinburgh EH14 4AS
[email protected], http://www.cee.hw.ac.uk/~trinder/
Abstract. Execution profiling plays a crucial part in the performanceimproving process for parallel functional programs. This paper presents the design, implementation, and use of a new execution time profiler (GranSim-SP) for Glasgow Parallel Haskell (GpH). Parallelism is introduced in GpH by using evaluation strategies, which provide a clean way of co-ordinating parallel threads without destroying a program’s original structure. The new profiler attributes the cost of evaluating parallel threads to the strategies that created them. A unique feature of the strategic profiler is that the call-graph of evaluation strategies is maintained, allowing the programmer to discover the sequence of (perhaps nested) strategies that were used to create any given thread. The strategic profiler is demonstrated on several examples, and compared with other profilers.
1
Introduction
Profiling tools that measure and display the execution behaviour of a program are well established as one of the best ways of tracking down code inefficiencies. Profiling tools are particularly useful for very high-level languages where the relationship between source program and execution behaviour is far from apparent to a programmer, compared to conventional languages like C, for instance. Moreover, parallelism and laziness in a language magnifies the disparity between source program and execution behaviour by the sheer number of activities carried out in an chaotic order. With all this in mind, a new profiler has been developed (which we’re calling GranSim-SP) for the non-strict parallel functional language Glasgow Parallel Haskell (GpH)—an established parallel derivative of Haskell. The profiler allows for a direct correspondence between source program and a visualisation of its execution behaviour. This is done by allowing programmers to augment the parallel co-ordination parts of their code (known as evaluation strategies) with labels. Each label corresponds to a unique colour in a graphical visualisation of a program’s execution behaviour, thus allowing the threads created by different strategies to be distinguished. H. Hammond, T. Davie, and C. Clack (Eds.): IFL’98, LNCS 1595, pp. 88–102, 1999. c Springer-Verlag Berlin Heidelberg 1999
A Strategic Profiler for Glasgow Parallel Haskell
89
This paper has the following structure. In Sect. 2 GpH, and evaluation strategies are briefly explained. Then in Sect. 3 a description is given of the GpH runtime system for GranSim that GranSim-SP is built on top of. In Sect. 4, the new combinator for annotating strategies is given, and a direct comparison is made with the GranSim profile in the previous section. In Sect. 5 the implementation of GranSim-SP is described, and in Sect. 6 motivating examples are given justifying why GranSim-SP is useful. In Sect. 7 GranSim-SP is compared and contrasted with GranSim-CC another profiler for GpH, and some imperative profilers are described. Finally, Sect. 8 discusses the paper and gives future work.
2
Glasgow Parallel Haskell
Glasgow Parallel Haskell provides parallelism by adding to Haskell a combinator for parallel composition, par, that is used together with sequential composition, seq, to control parallelism. The denotational behaviour of these combinators is described by the following equations: ⊥, if x is ⊥ x ‘seq‘ y = x ‘par‘ y = y y, otherwise Operationally, the combinator seq arranges for its arguments to be evaluated in sequence, x then y, and par arranges for its arguments to be evaluated in parallel, x on a separate processor from y. These combinators are explained in more detail elsewhere, for example, see Loidl [8]. 2.1
Evaluation Strategies
Adding par and seq combinators to programs can easily obfuscate the code— mixing computation with parallel co-ordination. In view of this, Trinder et al. [15] introduced evaluation strategies which are lazy higher-order polymorphic functions that allow expressions to remain intact, but with suffixes that suggest their parallel behaviour. For example, an expression that maps a function over a list may be parallelised in the following way: map f xs ‘using‘ strategy The function strategy describes how to parallelise map f xs, and using is a composition function with the following definition: using :: a -> Strategy a -> a e ‘using‘ s = s e ‘seq‘ e Evaluation strategies can be user-defined, but a lot of commonly occurring strategies are provided in a library module bundled with GpH.
90
2.2
D.J. King, J. Hall, P. Trinder
The Need for GranSim-SP
When parallelising a program there are a number of different choices that can be made. Evaluation strategies make it easier to focus on the parallel co-ordination routines, but a programmer still has to choose which expressions to parallelise, and which strategies to use. The purpose of GranSim-SP is to give direct feedback to the programmer on the effectiveness of different choices in strategies.
3
The Runtime System for GpH
The compiler for GpH that GranSim-SP has been built on top of is an extension of the Glasgow Haskell compiler, GHC (Peyton Jones et al. [11]). The compiler supports parallelism using the GUM runtime system (Trinder et al. [16]) and a parallel simulator called GranSim (Loidl [7,8]). Our strategic profiler GranSimSP is an extension of GranSim, and hence its name. The runtime system that provides parallelism and parallel simulation is based on the sequential threaded runtime system for GHC that was developed to provide concurrency (Peyton Jones et al. [10]). Threads are best thought of as virtual processors. They are represented in the runtime system by TSOs (Thread State Objects) which hold a thread’s registers together with various statistics about a thread. Threads are created by the par combinator, and die when they have completed their work. At any instant a thread will be in one of five states: running, runnable (awaiting resources to become available), blocked (awaiting for the result of evaluating another thread), fetching (awaiting the arrival of a result from another processor), or migrating (moving from a busy processor to an idle processor). 3.1
Recording Events
The parallel simulator GranSim (Loidl [7,8]) records information in a log file about each thread when they change from one state to another (see Fig. 1). PE PE PE PE PE PE PE PE PE PE
18 17 17 17 18 18 1 1 17 20
[612386]: [612288]: [612398]: [612410]: [614185]: [614335]: [614172]: [614322]: [614254]: [614481]:
SCHEDULE BLOCK REPLY START FETCH SCHEDULE FETCH SCHEDULE RESUME(Q) END
39 15 1f 88 39 16 6a 68 15 3c
Fig. 1. An extract from a GranSim log file.
A Strategic Profiler for Glasgow Parallel Haskell
91
Each line of the log file contains the processor number, the time (given in number of instructions), the state that a thread is changing to, followed by the thread number. Some other statistics about threads are also recorded, but they have been omitted here since they are of less interest to us. 3.2
Graphical Visualisations of Log Files
Log files of programs can contain so much data that it is almost impossible to discover useful information from them without a tool; such as one that can give a graphical representation of the data. GranSim provides tools for doing this, the most useful of which shows how many threads there are, and what state they are in (running, runnable, fetching, blocked, or migrating) at every point in time during execution. An example visualisation of a GranSim log file is given in Fig. 2 for the following program. Example 1. main = print ((a, b) ‘using‘ parPair rnf rnf) where a = expA ‘using‘ stratA b = expB ‘using‘ stratB Note in Example 1 that the definitions for expA, expB, stratA, and stratB have been left out for brevity, and that parPair rnf rnf is a strategy to reduce the components of a pair to normal form in parallel. The graph in Fig. 2 plots tasks (y-axis) against time (x-axis). Two colours are shown in the graph: black representing blocked threads, and grey representing running threads. Other colours representing runnable, fetching, and migrating do not appear in the profile because the simulation was done with GranSim-Light (which models an idealised machine with an unlimited number of processors and zero communication cost). Note that these graphical profiles are usually viewed in colour, with detail improving with size.
4
Attributing Threads to Strategies
With the basic par and seq combinators GranSim records the number of threads and the state they are in, but does not attribute the created threads to any parts of the source program. There are some advanced combinators, however, such as parGlobal (see Loidl [7,8]), that provide a limited capability allowing sparks to be associated with a numerical value which is carried through to the log files. With GranSim-SP a more sophisticated combinator is provided, namely markStrat, for annotating all the threads created by a strategies, it has the following type: markStrat :: String -> Strategy a -> Strategy a
92
D.J. King, J. Hall, P. Trinder
tasks
GrAnSim
parpair +RTS -bP -bp0
Average Parallelism = 12.6
42
36 32 28 24 20 16 12 8 4 0
0
4251
8501
running
12752
runnable
17003
21254
fetching
25504
blocked
29755
34006
migrating
38256
Fig. 2. A GranSim profile for Example Program 1.
parpair +RTS -bP -bp0
GranSP
Parallelism = 12.6
Threads 42 40 35 30 25 20 15 10 5 0
0
4.3e+03 BLOCKED ROOT (14.1%)
8.5e+03
1.3e+04
1.7e+04
2.1e+04 2.6e+04 Time (cycles)
FETCHING ROOT . A (43.2%)
3e+04
3.4e+04
42507
Runtime = 42507
3.8e+04
4.3e+04
RUNNABLE ROOT . B (42.8%)
Fig. 3. A GranSim-SP profile for Example Program 2.
A Strategic Profiler for Glasgow Parallel Haskell
93
the first argument of markStrat is the string label that is used to annotate all the threads created in the second argument which must be a strategy. For example, taking the same program (Example 1), strategies stratA and stratB could be annotated as follows: Example 2. main = print ((a, b) ‘using‘ parPair rnf rnf) where a = expA ‘using‘ markStrat "A" stratA b = expB ‘using‘ markStrat "B" stratB Three strategies are used in this program, and two have been annotated. The graph in Fig. 3 was generated from the log file for the above program. The strategies labelled A and B are distinguished by different colours (albeit shades of grey) in this paper. 4.1
Recording the Relationships between Threads
In addition to recording a strategy name for each thread, GranSim-SP also records the child-to-parent relationships between threads. This information has several uses: one is to use the call-path for thread names; another, less obvious use, is to allow different call-paths of a strategy to be distinguished. But perhaps the most important use is when a thread blocks waiting for a child thread; in this case the blocked thread can trace its parentage making it possible to discover where it was created in the source program. This is useful information, since using strategies that avoid thread blocking can often give speedups. Morgan and Jarvis [9] have also had success in adding similar relationships to sequential cost centre profiling. They record call-stacks of names that corresponds to the function call-path of cost-centres. The following is an example of a program where two strategies labelled F and G both use a further strategy labelled H. Example 3. main = print ((f 32, g 16) ‘using‘ markStrat "parPair" (parPair rnf rnf)) f x = map h [x, 2] ‘using‘ markStrat "F" (parList rnf) g x = map h [x, 3] ‘using‘ markStrat "G" (parList rnf) h x = x + sum (map loop (replicate 8 64) ‘using‘ markStrat "H" (parList rnf)) where loop n = if n==0 then 0 else loop (n-1) In Example Program 3 the thread relationships that are recorded are described in Fig. 4. The relationships are also used to build the names in the
94
D.J. King, J. Hall, P. Trinder
ROOT parPair
parPair
F
G
F
G
HHHHHHHH
HHHHHHHH
HHHHHHHH
HHHHHHHH
Fig. 4. The thread structure recorded by GranSim-SP for Example Program 3. graphical profiles. See Fig. 5 for a per-thread activity profile, and Fig. 6 for an overall activity profile of Example Program 3. In Fig. 4 each node corresponds to a parallel thread. The arrows are upwards, since GranSim-SP only records a parent name for each thread. Maintaining the reverse relationship from parent-to-child would require more storage space, and the relationship can be reproduced anyway. Per-Thread Activity Profiles A per-thread activity profile plots threads against time, where the threads are coloured depending on their state and which strategy they correspond to. The threads are also sorted by their creation time, and by strategy. Fig. 5 is a per-thread activity profile for Example Program 3. Threads are only in two states in this example, running (shown in shades of grey), and blocked (shown in black). The two blocks of threads at the top of the graph represent the threads created by strategy H, and are named ROOT.parPair.F.H and ROOT.parPair.G.H, which describes the two call-paths of strategy H. Perthread activity profiles are more useful with GranSim-SP than with GranSim because the threads are sorted by strategy and are labelled, making it easier to interpret the profile. Overall Activity Profiles An overall activity profile plots thread activity against time, and is perhaps the most often used with GranSim. Fig. 6 depicts an overall activity profile of Example Program 3. In an overall activity graph, the threads are separated into the five thread states with the running state plotted below other states. With GranSim-SP the running state is broken down into different coloured areas representing each strategy, and again these are sorted by strategy name. The advantage of an overall activity profile over a per-thread activity profile is that it is easier to see the amount of a particular type of activity (e.g. parallelism, or blocking) at any given point in time. The disadvantage is that it is not possible to see which threads are blocking at any point in time.
A Strategic Profiler for Glasgow Parallel Haskell
example3 +RTS -H20M -bP -bp0
GranSP
95
Parallelism = 14.1
Threads 38 35
30
25
20
15
10
5
0
0
5e+03
1e+04
1.5e+04
BLOCKED ROOT . parPair . G (3.7%) ROOT . parPair . G . H (43.9%)
2e+04
2.5e+04 3e+04 Time (cycles)
ROOT (3.2%) ROOT . parPair . F (3.2%)
3.5e+04
4e+04
4.5e+04
5e+04
ROOT . parPair (1.4%) ROOT . parPair . F . H (44.6%)
Fig. 5. A GranSim-SP per-thread activity profile for Example Program 3.
example3 +RTS -H20M -bP -bp0
GranSP
Parallelism = 14.1
Threads 39 35 30 25 20 15 10 5 0
0
5e+03
1e+04
BLOCKED ROOT (3.2%) ROOT . parPair . F (3.2%)
1.5e+04
2e+04
2.5e+04 3e+04 Time (cycles)
FETCHING ROOT . parPair (1.4%) ROOT . parPair . F . H (44.6%)
3.5e+04
4e+04
4.5e+04
5e+04
RUNNABLE ROOT . parPair . G (3.7%) ROOT . parPair . G . H (43.9%)
Fig. 6. A GranSim-SP profile for Example Program 3.
96
5
D.J. King, J. Hall, P. Trinder
Implementation
Our implementation of markStrat is a small addition to the on-going GpH project. From a user’s point of view the function markStrat has been added, whose behaviour is to label threads with a given strategy name. When running a program the log files are the same as before (see Fig. 1), but with a name and parent identifier on START events. The parent identifier is the thread name (a unique hexadecimal number) of the parent thread. For example, the start of the log file for the program in Sect. 4.1 is given in Fig. 7: PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[0]: [10519]: [10540]: [11790]: [13632]: [13653]: [14040]: [14563]: [14693]: [14040]: [15333]: [14765]: [17094]: [17669]: [18081]: [18656]:
START START START BLOCK START START START RESUME BLOCK START BLOCK BLOCK START START START START
0 1 2 0 3 4 5 0 2 6 0 1 7 8 9 a
(ROOT , -1) (parPair , 0) (parPair , 0) (G , 2) (F , 1) (F , 1)
(G , 2)
(H (H (H (H
, , , ,
4) 4) 5) 5)
Fig. 7. A log file extract for Example Program 3 that is used by GranSim-SP.
GranSim-SP was implemented by changing the runtime system for GpH, and by writing new visualisation tools. What follows is a sketch of the implementation in the runtime system. First of all extra fields were added to the data structures representing sparks and threads for (i) strategy name, and (ii) parent identifier. Whenever markStrat is evaluated the accompanying label is recorded in the currently active thread’s data structure (i.e. the TSO described in Sect. 3). When a spark is created the strategy name and thread identifier of the current thread is saved in the spark data structure. At the point when sparks are turned into threads the name and thread identifier, held in the spark data structure, are passed to the new thread; this thread identifier being that of the parent thread. Finally, the strategy name and parent identifer are output on the START event, which is the most convenient place for the visualisation tools. The overheads incurred by our implementation are negligible. Indeed, when simulating, no overheads are recorded in the log file. The implementation has not yet been tested on a real parallel machine, but again the overheads should be minor and consistent (i.e. the same overhead is incurred for each thread).
A Strategic Profiler for Glasgow Parallel Haskell
6
97
Is GranSim-SP Useful?
The bottom-line is: does GranSim-SP help a programmer to track down inefficiencies? A question like this can only be answered properly after many programmers have used GranSim-SP on many different applications over a reasonable period of time. So far, however, the signs are good. GranSim-SP has shown to be useful in understanding the behaviour of a number of non-trivial examples including: a canny edge detector, a ray tracer for spheres, a naughtsand-crosses program, and a word-search program. The word-search program is a good example to look at closely, because it has been well-studied by Runciman and Wakeling ([12], Ch. 10), and it uses non-trivial forms of parallelism. The program searches for a number of words in a grid of letters. The words may appear horizontally, vertically, or diagonally, and either forwards or backwards. The program is made parallel by producing different orientations of the grid in parallel, and then by searching these orientations in parallel. A GranSim profile of the (version 6) program produced by Runciman and Wakeling is given in Fig. 8. Originally, Runciman and Wakeling used the HBC-PP compiler, which is a version of the Chalmers HBC compiler that provides a parallel simulator. The profile in Fig. 8 is not identical to the original, since GranSim was used to produce it, but it is pleasingly similar, which demonstrates a consistency between the simulators GranSim and HBC-PP. Taking the original soda program it was re-written to use strategies, and then profiled with GranSim-SP. Using GranSim-SP revealed that the peaks in Fig. 8 corresponded to searching the grid in different orientations in turn, i.e. sequentially. This was perhaps not the intention of Runciman and Wakeling, but was caused by a subtlety in their code. By searching all the orientations in parallel it was possible to increase the parallelism and reduce the runtime by approximately 25%, as depicted in Fig. 9. One feature of our profiler that is particularly useful is the recording of parent/child relationships. A common inefficiency is that there is more blocking going on than there needs to be. Blocking is caused by a thread waiting for a result from a child thread, and this can sometimes be avoided by changing a program’s evaluation order. Since the parent/child relationships between threads are recorded by GranSim-SP it is possible to locate precisely where the blocking occurs.
7
Related Work
In this section a comparison is given with GranSim-CC a parallel cost centre profiler, and some imperative profilers are discussed. 7.1
Comparing GranSim-SP with GranSim-CC
Independently, Hammond et al. [5] have been working on GranSim-CC, a parallel cost-centre profiler. Their work combines sequential cost centre profiling (Sansom and Peyton Jones [13]) with the Glasgow Parallel Haskell runtime system.
98
D.J. King, J. Hall, P. Trinder
tasks
GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim
soda-6 +RTS -bP -bp0
Average Parallelism = 98.9
381
342
304
266
228
190
152
114
76
38
0
0
16201
running
32402
48602
runnable
64803
81004
fetching
97205
blocked
113406
129606
migrating
145807
162008
Runtime = 162008
Fig. 8. A GranSim profile for the original soda program.
soda-7 +RTS -bp0 -bP
GranSP
Parallelism = 163.6
Threads 891 800 700 600 500 400 300 200 100 0
0
1.1e+04
2.1e+04
3.2e+04
4.2e+04
5.3e+04 6.4e+04 Time (cycles)
BLOCKED FETCHING ROOT (0.2%) ROOT . d_||_dl_|_ul (1.1%) ROOT . find . dirs (1.1%) ROOT . find . dirs . (++) (0.9%) ROOT . find . dirs . (++) . back (47.9%)
7.4e+04
8.5e+04
9.6e+04
1.1e+05
RUNNABLE ROOT . find (0.3%) ROOT . find . dirs . forw (48.6%)
Fig. 9. A GranSim-SP profile for an improved version of the soda program.
A Strategic Profiler for Glasgow Parallel Haskell
99
GranSim-CC uses the set-cost-centre primitive _scc_ to annotate expressions. When an expression with _scc_ is evaluated the given label is recorded in the log file. It is then possible to produce a profile that divides up activity by the cost of these expressions, and attributes them to the expression labels. An example of a GranSim-CC profile for Example Program 3 is shown in Fig. 10. This profile should be compared with 2, which is a GranSim-SP profile of the same example. The program had to be modified for GranSim-CC, which involved using _scc_ around expressions instead of markStrat on strategies. In this instance there is a similarity between the GranSim-CC and GranSim-SP profiles which shows a consistency between the two systems. There is a difference, however, which is that GranSim-SP is giving more detail because it is able to distinguish between two separate call-paths to the same function, whereas GranSim-CC is unable to. This highlights one advantage of GranSim-SP over GranSim-CC which is that it records parent-to-child relationships between threads, and makes good use of them. GranSim-CC is also a much more complex system, and as noted by Hammond et al. [5], it hasn’t been fully developed. The main advantage of GranSim-CC is that any expression can be annotated, whereas GranSim-SP only allows strategies to be annotated. Therefore, both systems are useful in their own right, and both can provide different and useful information.
Activity Profile for example3.gr
GranSimCC
Average Parallelism = 17.3
Tasks
35
30
25
20
15
10
5
0
0
4.1e+03
8.2e+03
1.2e+04
Main.h (90.8%) Main.f (1.8%) PRELUDE.CAFs_in_... (0.1%)
1.6e+04
2.1e+04 2.5e+04 Time (instructions)
MAIN.MAIN (3.3%) Main.g (1.8%) MAIN.DONT_CARE (0.0%)
2.9e+04
3.3e+04
3.7e+04
Main.CAFs_in_... (2.1%) Main.DICTs_in_... (0.2%) ???.??? (0.0%)
Fig. 10. A GranSim-CC profile for Example Program 3.
4.1e+04
100
D.J. King, J. Hall, P. Trinder
Lexical Scoping Versus Evaluation Scoping Another difference between using markStrat in GranSim-SP and _scc_ in GranSim-CC is in their scoping rules. For example, in the following expression: map f xs ‘using‘ _scc_ "A" strat The cost of evaluating map f xs would be attributed to the enclosing cost centre, and not to A, because a form of lexical scoping is used with GranSim-CC (see Clack et al. [2] for a full explanation of lexical profiling). Essentially, lexical scoping attributes costs to the enclosed expression. If markStrat was used in the above example instead of _scc_, then all the threads that are sparked when the strategy strat is applied to the map expression are labelled. This form of scoping is evaluation scoping, and is precisely the semantics one would want for markStrat. See the cost centre work of Sansom and Peyton Jones [13] for a discussion about different scoping rules. Sansom and Peyton Jones found that a hybrid of lexical scoping was the most appropriate for the sequential cost centre profiling of Haskell. 7.2
Imperative Profiling Tools
There are several profiling tools available for imperative parallel languages. One example is Jumpshot which is provided with an implementation of MPI (Message Passing Library) called mpich (Gropp and Lusk [3]). Jumpshot is an interactive tool written in JavaTM , that animates a program’s execution displaying parallel time lines with process states (e.g. running, or communicating) for each processor. Another tool is XPVM (Kohl and Geist [6]) which provides a graphical monitor and console to PVM. XPVM is written in Tcl/Tk and is similar to Jumpshot in that it provides an interactive environment showing an animation of the activity on each processor. It’s also possible to use XPVM to trace which instruction caused a particular activity. XPVM and Jumpshot differ from GranSim-SP in that they don’t relate a section of program source to the displayed histogram. This is perhaps because with imperative parallel languages the evaluation order is static, and the parallelism is explicit, so it is easier to relate source to profile.
8
Discussion and Future Work
Discussion GranSim-SP has been described in this paper which is an addition to the available profiling tools for Glasgow Parallel Haskell. Its purpose is to help a programmer understand, and thereby improve their parallel Haskell program. The implementation of the profiler was straightforward and fits comfortably into the runtime system for the parallel simulator GranSim. GranSim-SP enables the programmer to annotate threads giving a direct correspondence between program and profile. The usefulness of GranSim-SP was demonstrated on several simple examples, as well as some moderately sized ones.
A Strategic Profiler for Glasgow Parallel Haskell
101
Future Work One obvious improvement is to get GranSim-SP working with the parallel runtime system, so far our efforts have been concentrated on the parallel simulator GranSim. There should not be any major obstacles with this. GranSim-SP has been tested on several examples, and this experimentation plays a crucial role in uncovering shortcomings and ideas for improvement. One improvement is to provide a compiler option to automatically annotate strategies. Automatic annotation has shown to be useful with sequential cost centre profiling, especially for large examples, where the programmer wants to get results quickly without having to make lots of annotations. The only real usefulness of using markStrat instead of an automatic name generator is that the programmer is given freedom to use any desired name, and he or she may want to use a common name for more than one strategy. One unique feature of the sequential cost centre profiling work of Sansom and Peyton Jones [14], is that they give a formal specification for attributing costs to costs centres, and show it equivalent to an operational semantics. The same could be done to justify the behaviour of markStrat. First of all though, an operational semantics is needed to model GpH, which is far from trivial. Hall et al. [4] are currently working on such a semantics. Producing better visualisations of profiles is also of interest to us. The two forms of profile that are shown here, namely the per-thread activity profile, and the overall activity profile, are useful, but why restrict ourselves to these? For example, it may be useful to mask away some strategies, or to zoom in on one part of a profile. This may be of particular value for large scale parallel programs, where there are thousands of active threads. It would also be useful to provide a programmer with the actual call-graph. The call-graph information is currently provided in the log files, and it is used to generate the profiles, but the complete graph is not displayed. All the above features would best be provided by an interactive graphical interface like those provided by Jumpshot and XPVM. Charles and Runciman [1] have begun work on such a project for GpH. Their particular interest is to translate log files to a database format, and then to query the information. This high-level interactive approach appears to be a good way forward, and may be of use for profiling other languages, sequential or imperative.
Acknowledgements This work has been supported by the Research Development Fund at The Open University under the project title APSET (A Parallel Software Engineering Tool). We would also like to thanks everyone that has helped to improve our work, especially the GpH gurus Kevin Hammond and Hans-Wolfgang Loidl.
102
D.J. King, J. Hall, P. Trinder
References 1. N. Charles and C. Runciman. An Interactive Approach to Profiling Parallel Functional Programs. In This Proceedings. 2. C. Clack, S. Clayman, and D. Parrott. Lexical Profiling: Theory and Practice. Journal of Functional Programming, 5(2):225–277, Apr. 1995. 3. W. Gropp and E. Lusk. User’s Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, University of Chicago, July 1998. 4. J. G. Hall, C. Baker-Finch, P. Trinder, and D. J. King. Towards an Operational Semantics for a Parallel Non-Strict Functional Language. In This Proceedings. 5. K. Hammond, H.-W. Loidl, and P. Trinder. Parallel Cost Centre Profiling. In Proc. 1997 Glasgow Workshop on Functional Programming, Ullapool, Scotland, Sept. 1997. 6. J. A. Kohl and G. A. Geist. XPVM 1.0 User’s Guide. Oak Ridge National Laboratory, Nov. 1996. 7. H.-W. Loidl. GranSim’s User Guide. Department of Computing Science, University of Glasgow, 0.03 edition, July 1996. 8. H.-W. Loidl. Granularity in Large-Scale Parallel Functional Programming. PhD thesis, Department of Computing Science, University of Glasgow, Mar. 1998. 9. R. G. Morgan and S. A. Jarvis. Profiling Large-Scale Lazy Functional Programs. Journal of Functional Programming, 8(3), May 1998. 10. S. L. Peyton Jones, A. Gordon, and S. Finne. Concurrent Haskell. In Proc. 23rd ACM Symposium on Principles of Programming Languages (POPL ’96), pages 295–308, St Petersburg Beach, Florida, Jan. 1996. 11. S. L. Peyton Jones, C. Hall, K. Hammond, W. Partain, and P. Wadler. The Glasgow Haskell compiler: A technical overview. In Proc. UK Joint Framework for Information Technology, Technical Conference (JFIT ’93), pages 249–257, Keele, Mar. 1993. 12. C. Runciman and D. Wakeling. Applications of Functional Programming. UCL Press Ltd, 1995. 13. P. M. Sansom and S. L. Peyton Jones. Time and Space Profiling for Non-Strict, Higher-Order Functional Languages. In Proc. 22nd. ACM Symposium on Principles of Programming Languages (POPL ’95), pages 355–366, San Francisco, California, Jan. 1995. ACM SIGPLAN-SIGACT. 14. P. M. Sansom and S. L. Peyton Jones. Formally-Based Profiling for Higher-Order Functional Languages. ACM Transactions on Programming Languages and Systems, 19(1), Jan. 1997. 15. P.W. Trinder, K. Hammond, H.-W. Loidl, and S.L. Peyton Jones. Algorithm + Strategy = Parallelism. Journal of Functional Programming, 8(1):23–60, January 1998. 16. P.W. Trinder, K. Hammond, J.S. Mattson, A.S. Partridge, and S.L. Peyton Jones. GUM: a Portable Parallel Implementation of Haskell. In Proc. 1996 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’96), pages 79–88, 1996.
Implementing Eden – or: Dreams Become Reality? Ulrike Klusik1 , Yolanda Ortega2 , and Ricardo Pe˜ na2 1
2
Philipps-Universit¨ at Marburg, D-35032 Marburg, Germany
[email protected] Universidad Computense de Madrid, E-28040 Madrid, Spain {yolanda,ricardo}@sip.ucm.es
Abstract. The parallel functional programming language Eden was specially designed to be implemented in a distributed setting. In a previous paper [3] we presented an operational specification of DREAM, the distributed abstract machine for Eden. In this paper we go a step further and present the imperative code generated for Eden expressions and how this code interacts with the distributed RunTime System (RTS) for Eden. This translation is done in two steps: first Eden is translated into PEARL (Parallel Eden Abstract Reduction Language), the parallel functional language of DREAM, and then PEARL expressions are translated into imperative code.
1
Introduction
The parallel functional language Eden [6,5], with its explicit notion of process, represents an alternative to annotated parallel functional languages [14,16] and to the automatic parallelization of functional programs [9]. The advantages of having explicit processes have been extensively discussed elsewhere [4]. Moreover, Eden processes are autonomous closed entities producing results to be consumed by other processes without any notion of central manager or coordinator, or common shared memory. Thus, the language facilitates the programming of distributed systems. Correspondingly, Eden is specially suited for a distributed implementation. Actually, Eden extends the lazy functional language Haskell [10] with a coordination language to explicitly express parallelism. There exist already very good and efficient compilers for Haskell. Thus, we decided to use the Glasgow Haskell compiler (GHC) [12] as a basis for the implementation of the computational language, and to concentrate our efforts in the distributed implementation of the coordination features. Our first step towards an efficient distributed implementation for Eden has been the definition of DREAM (DistRibuted Eden Abstract Machine), consisting of a set of multithreaded abstract machines, each of which executes an Eden ?
Work partially supported by German-Spanish Acci´ on Integrada HA1997–0107 and the Spanish projects CAM-06T/033/96 and CICYT-TIC97-0672.
H. Hammond, T. Davie, and C. Clack (Eds.): IFL’98, LNCS 1595, pp. 103–119, 1999. c Springer-Verlag Berlin Heidelberg 1999
104
U. Klusik, Y. Ortega, R. Pe˜ na
process. In [3] a formal definition of the operational semantics of DREAM in terms of a transition system, in the same spirit as the STG machine description in [11], was given. Our next step has been the construction of a distributed runtime system (RTS), i.e. a compiled version of DREAM. The RTS is explained in [2]. The main contribution of this paper is the translation of each Eden feature into imperative code calling the RTS routines. This translation is done in two phases: first, Eden expressions are desugared and translated into a much simpler parallel functional language we call PEARL (Parallel Eden Abstract Reduction Language); then, PEARL expressions are translated into imperative code. We discuss in detail the interaction between the compiled code and the RTS by using scenarios describing the main activities taking place at runtime (i.e. how processes are instantiated, etc). Outline of the paper: we begin with a short presentation of the coordination features in Eden. Then, in Sect. 3, we explain the characteristics of PEARL and give translation schemes from Eden expressions into PEARL expressions. Sect. 4, contains a brief overview of the runtime system. Translation schemes from PEARL expressions into a sequence of RTS routine calls are given in the next section. Then, the most important scenarios are presented. We conclude with some references to related work.
2
An Overview of Eden
In this section, the language Eden is presented very briefly. For details, the interested reader is referred to [6] and [5]. Eden extends the lazy functional language Haskell with a set of coordination features, aimed to express parallel algorithms. Coordination Features of Eden They are based on two principal concepts: explicit management of processes and implicit communication. Functional languages distinguish between function definitions and function applications. Similarly, Eden offers process abstractions, i.e. abstract schemes for process behaviour, and process instantiations for the actual creation of processes and of the corresponding communication channels. A process mapping input variables x1 , . . . , xn to output expressions exp1 , . . . , expk can be specified by the following process abstraction: process (x 1 , . . . , xn ) -> (exp 1 , . . . , exp k ) where equation 1 . . . equation r The output expressions can reference the input variables, as well as the auxiliary functions and common subexpressions defined in the optional where part. A process can communicate to other processes through tuples of channels and arbitrary data structures of channels. The latter are explained in [1]. A process instantiation expression has the form: (y1 , . . . , yk ) = p # (exp1 , . . . , expn )
Implementing Eden – or: Dreams Become Reality
105
From now on, we will refer to the new instantiated process as the child process, while the process where the instantiation takes place, is called the parent. The process abstraction bound to p is applied to a tuple of input expressions, yielding a tuple of output variables. The child process uses k independent threads of control in order to produce these outputs. Correspondingly, the parent process creates n additional threads for evaluating exp1 , . . . , expn . Communication is unidirectional, from one producer to exactly one consumer. In order to provide control over where expressions are evaluated, only fully evaluated data objects are communicated. This also facilitates a distributed implementation of Eden. But retaining the lazy nature of the underlying computational functional language, lists are transmitted in a stream-like fashion, i.e. element by element. Concurrent threads trying to access not yet available input will be suspended. This is the only way of synchronizing Eden processes. New communication channels can be established at runtime by means of dynamic reply channels, which are specially useful for the description of reactive concurrent systems. A process may generate a new input channel by executing the following expression: new (varcn , varc ) exp where the value varc of the created channel can be used in the expression exp, and the channel name varcn can be sent to another process. A process receiving varcn can use it to output a value through the channel1 referenced by this name by executing: varcn !* exp1 par2 exp2 Before exp2 is evaluated, a new concurrent thread for the evaluation of exp1 is generated, whose result is transmitted via the received reply channel. Nondeterminism constitutes a controversial point in language design. On the one hand, it is an essential feature for some program applications but, on the other hand, it spoils the purity of functional languages. In Eden, nondeterminism is gently introduced by means of a predefined process abstraction merge which is used to instantiate nondeterministic processes, each one fairly merging a list of input channels into a single list. The following table summarizes Eden’s coordination constructs: exp →
1
2
process pattern -> exp where equations | exp 1 #exp 2 | new (var cn , var c ) exp | var cn !∗ exp 1 par exp 2 | merge
process abstraction process instantiation dynamic channel creation dynamic channel connection merge process abstraction
To ensure that channels connect unique writers to unique readers, it is checked at runtime that no other process is already connected (see Sect. 5.3). This construction in Eden is not related to the homonymous function in [14].
106
U. Klusik, Y. Ortega, R. Pe˜ na
The Evaluation Model for Eden Process Systems The evaluation of an Eden process is driven by the evaluation of its output expressions, each of which constitutes an independent thread of execution. Although this overrules lazy evaluation to some extent, in order to speed up the generation of new processes and the distribution of the computation all top level process instantiations are immediately evaluated. Therefore, if the demand driven evaluation of inner process instantiations hinders the unfolding of the parallel process system, the programmer has the possibility to lift them to the top level where they are immediately evaluated. Nevertheless, the system evolution is still controlled by the demand for data, so that a process without output terminates immediately and its input channels are eliminated. This implies the abolition of the corresponding output channels in sender processes, and the propagation of process termination.
Example: The following Eden program specifies a parallel mergesort algorithm, where a parameter h controls the granularity by specifying the desired height of the process tree to be generated (h=0 for one process, h=1 for three processes, etc): parMsort h = process xs -> ys where ys | h == 0 = seqMsort xs | h > 0 = ordMerge (parMsort (h-1) # x1, parMsort (h-1) # x2) where (x1,x2) = unshuffle xs seqMsort [] = [] seqMsort [x] = [x] seqMsort xs = ordMerge (seqMsort x1, seqMsort x2) where (x1,x2) = unshuffle xs
Notice the difference between the parameter h, which must be known at instantiation time, and the input xs, i.e. the list of values to be sorted, which will be gradually provided during the execution of the process. Ignoring the height parameter h, the similarity between the sequential and parallel versions is remarkable. While the former makes two function applications —function seqMsort—, the latter produces two process instantiations —process abstraction parMsort. Sequential functions unshuffle (splitting a list into two by putting odd position elements into one list and even position ones into another) and ordMerge (merging two ordered lists into a single ordered list) are written in plain Haskell and are not shown here.
3
First Translation Phase: Eden into PEARL
PEARL is the language of DREAM, the abstract machine for Eden. Before going into the details of the translation from EDEN into PEARL, we give a brief description of the syntax for PEARL.
Implementing Eden – or: Dreams Become Reality
→ → → → → → | | | | | | vars → atoms → atom → prog binds bind lf π expr
binds bind 1 ; . . . ; bind n var = lf vars f \π vars a -> expr u|n literal constr atoms primOp atoms var atoms let binds in expr letrec binds in expr case expr of alts {var1 , . . . , varn } {atom1 , . . . , atomn } var | literal
bind → varso = varp # vari lf → varsf \proc varsa -> expr expr → {|var1 || . . . ||vark |} | new {varcn , varc } vare | write varcn vare1 vare2
107
n≥1
constructor application primitive application application local definition local recursive definition case expression n ≥ 0, variable list n ≥ 0, atom list
process instantiation process abstraction k ≥ 1, parallel expression dynamic channel creation dynamic channel use
primOp → sendVal | sendHead | closeStrm communication primitives
Fig. 1. Syntax of STGL (above) and PEARL extensions (below)
3.1
PEARL
PEARL extends the language of GHC, called STGL (Spineless Tagless G-machine Language), by incorporating constructs for the coordination features in Eden. A summary of the syntax of STGL, as it has been defined in [11], is presented in the upper part of Fig. 1, while PEARL extensions are shown in the lower part. An STGL program is a non-empty sequence of bindings of variables to lambda forms: vars f \π vars a -> expr , i.e. lambda abstractions with additional information about the free variables vars f appearing in the body, and a flag π indicating whether the corresponding closure should be updated or not after its evaluation. PEARL introduces a new binding expression for defining process instantiations: vars o = var p # var i , where vars o denotes the list of variables receiving the output of the process, var p is bound to a process abstraction, and var i is bound to the input expression. Process instantiations can only occur in the binding part of let and letrec expressions.
108
U. Klusik, Y. Ortega, R. Pe˜ na
Lambda forms in PEARL can be either original STGL lambda forms defining functions, or process abstractions defining Eden processes. In this latter case, the arguments varsa represent the formal input variables of the process, and the process body is an expression which evaluates the tuple of output expressions. As outputs are evaluated in parallel, the whole body eventually evaluates to a parallel expression {|var1 || . . . ||vark |}, where each component expression will give rise to a different concurrent thread. Parallel expressions are used too when instantiating new processes, because the actual inputs for the child process are also evaluated by independent threads in the parent process. Both situations are reflected in the translation schemes of Sect. 3.2. Other PEARL expressions are introduced for dynamic channel handling: new for the creation and write for using it. In the expression for creating a channel, varcn and varc represent, respectively, the channel name and the channel value, and vare is bound to a lambda abstraction ChanName a -> a -> b expecting the channel name and the channel value to produce a result of type b. In the write expression, varcn corresponds to the channel name, and vare1 and vare2 are supposed to be respectively bound to the message expression and to the continuation expression. At the PEARL level, communication is more explicitly handled. The primitive function sendVal is used to send a single value of any type, sendHead sends the next value of a stream and closeStrm closes a stream. However, the actual process intercommunication remains implicit, i.e. the use of these primitives is not required to provide a particular channel or receiver. These primitive functions form the basis of the overloaded function sendChan, shown below, which transmits values on a channel. In the translation schemes of the next subsection, the compiler inserts an application of sendChan to any expression to be evaluated by a parallel thread. The class Transmissible contains all the types whose values can be transmitted from one process to another. The compiler provides derived instances of this class for any user defined type. class NFData a => Transmissible a where sendChan :: a -> () sendChan x = rnf x ‘seq‘ sendVal x
-- default definition
instance Transmissible [a] where sendChan [] = closeStrm sendChan (x:xs) = rnf x ‘seq‘ sendHead x ‘seq‘ sendChan xs
sendChan uses the overloaded function rnf :: a -> () to reduce elements to normal form. This is defined in the implementation of Glasgow Parallel Haskell [14]. The class definition follows: class Eval a => NFData a where rnf :: a -> () rnf a = a ‘seq‘ ()
-- default definition
The function seq :: a -> b -> b evaluates its first argument to head normal form and then returns its second argument. x ‘seq‘ y = case x of _ -> y
Implementing Eden – or: Dreams Become Reality
3.2
109
Translating Eden Expressions into PEARL
We present the translation schemes for the new expressions introduced by Eden with respect to Haskell, in the form of a recursive function tr1 :: Eden → PEARL. For the rest of the computational language, i.e. normal Haskell expressions, this function provides the original Haskell to STGL translation, and it is not shown here. Process Abstraction def tr1 ( process (x1 , . . . , xn) -> e where eqns) = let f = {frees} \n {x1 , . . . , xn} -> tr1 (let eqns in e) in let p = {f} \proc {x1 , . . . , xn } -> case f {x1 , . . . , xn } of (y1 , . . . , yk ) -> let v1 = {y1 } \ u {} -> sendChan y1 in ··· let vk = {yk } \ u {} -> sendChan yk in {|v1 || . . . ||vk |} in p We assume the type of the process abstraction to be Process (t1 , . . . , tn ) (t01 , . . . , t0k ) and frees to be the free variables in the body e and in the auxiliary equations eqns. The whole expression reduces to a variable p bound to a process abstraction which evaluates its outputs in parallel. The body is just a lambda abstraction applied to the inputs. Process Instantiation def
tr1 ( e1 # e2 ) = letrec p = {frees1 } \u {} -> tr1 (e1 ) v = {frees2 } \u {} -> case tr1 (e2 ) of (x1 , . . . , xn ) -> let v1 = {x1 } \ u {} -> sendChan x1 in ··· let vn = {xn } \ u {} -> sendChan xn in {|v1 || . . . ||vn |} (y1 . . . , yk ) = p # v in (y1 , . . . , yk ) We assume the type of e1 to be Process (t1 , . . . , tn ) (t01 , . . . , t0k ), the type of e2 to be (t1 , . . . , tn ), and freesi to be the free variables in ei (i = 1, 2). The translation specifies that a process abstraction expression p will be evaluated and that its inputs will be fed in parallel with the result of evaluating the tuple e2 . The result of the whole evaluation is the tuple of outputs produced by the instantiated process. Actually, process instantiations are not translated one by one. Instead, they are collected in a single letrec expression. This feature is crucial for the eager distribution of the computation. Dynamic Channel Creation and Use def
tr1 ( new (cn, c) e) = let v = {frees}\n {cn, c} -> tr1 (e) in new {cn0 , c0 } v
110
U. Klusik, Y. Ortega, R. Pe˜ na def
tr1 ( cn !* e1 par e2 ) = let v1 = {frees1 } \u {} -> tr1 (e1 ) v2 = {frees2 } \u {} -> tr1 (e2 ) in write cn v1 v2 The translation in these two last cases is quite straightforward and requires little explanation. As usual, frees, frees1 and frees2 are the free variables in their respective expressions. Let us note, in the channel creation case, that the free variables cn and c in Eden’s expression are changed to lambda-bound variables in the translation. This facilitates the further translation of this expression into imperative code, as we only have to apply v to the fresh variables cn0 and c0 .
4
A Distributed Runtime System for Eden
The environment to execute an Eden program consists of a set of distributed abstract machines, each machine executing an instance of an Eden process abstraction. DREAM refers to the whole set of machines, while each single abstract machine is referred to as a Dream. Recall that an Eden process consists of one or more concurrent threads of control. These evaluate different output expressions which are independent of each other and use a common heap that contains shared information. The destination of the value produced by the thread is called an outport. In essence, an outport is a global address consisting of a destination processing element together with a unique local identifier in this processor called an inport. The state of a Dream includes information common to all threads, and the individual state of each thread. As input is shared among all threads, the shared part in a Dream includes an inport table as well as the heap. The inport table maps inport identifiers to the respective heap addresses, where received messages are to be stored. These heap addresses point to queueMe closures which represent not yet received messages. Should a thread enter a queueMe closure, it would get blocked into a queue where all threads needing a not received value wait until the value arrives. The state of a thread comprises the state of a sequential abstract machine and a specification of the associated outport referencing the connected inport. Therefore, the interface of a process is visible in the following way: The inports are enumerated in the inport table and the outports are associated to threads (either runnable or blocked). Communication channels are only represented by inports and outports which are connected to each other. There are no explicit channel buffers. Instead, the data is transferred from the producer’s heap to the consumer’s heap using the inport table to find the location where the data should be stored. Let us now obtain a compilable version of our DREAM abstract machine in terms of a distributed runtime system consisting of a set of processor elements (PE), each containing a pool of runnable threads, each represented by a heap-allocated Thread State Object (TSO) containing the thread’s registers and pointers to the heap-allocated argument and return stacks.
Implementing Eden – or: Dreams Become Reality
111
Mapping of Several Dreams to one PE: As we allow more Eden processes than we have PEs available, several Eden processes need to share a PE. We can easily map several Dreams onto one abstract machine, by joining their interface representations, threads, heaps, and code. This enables a cheap dynamic creation of Eden processes, whereas the process system of the underlying message passing system is static: one abstract machine process per available PE. Runtime Tables: In order to be able to reference inports and outports of other PEs, we need immutable global references of the form (PEId,Id), where Id is a locally unique identifier. We cannot simply use the heap addresses of the queueMe closures and TSOs — for inports and outports, respectively — because they may change under garbage collection. Therefore, each PE keeps one inport and one outport runtime table, where the current heap addresses are stored, and indexes to these tables are used as local Ids. For termination purposes, the global reference to the sender thread is saved, together with the queueMe closure address, in the inport table; while outport identifiers are mapped to TSOs. As several Eden processes share the same inport and outport tables, each PE also needs a separate process table, with the lists of the current inports and outports of each process. Communication Unit and Protocol: The communication unit is responsible for communicating with other PEs, providing sending and receiving routines for the following messages: Data messages concerning the communication of values through channels: – VAL for sending a value over a one value channel. – HEAD for sending an element of a stream. – CLOSE-STRM for closing a stream. System messages implying modifications on the runtime tables of a remote PE. Messages of this kind always include the identification of the sending PE, to be used appropriately at the receiver PE. – CREATE-PROCESS: Is sent by the parent to a remote PE3 to instantiate a new process there. The process abstraction and the new local inports and outports of the parent are included in the message. – ACK: Is the response to a CREATE-PROCESS message, acknowledging the creation of the child process and including information to properly connect the inports and outports of the child in the parent process. – DCH: When a process uses a dynamic channel, it must communicate to the creator of the channel the new connection. – TERM-THREAD: For terminating a remote thread, represented by its associated outport. 3
The specific remote PE will be decided by the RTS of the parent PE, depending on the global scheduling and load balancing policies.
112
U. Klusik, Y. Ortega, R. Pe˜ na
Termination When a thread has completed its computation and has sent its output value, it terminates closing the corresponding outport. If the last outport in a process is closed — this can be looked up in the process table — the whole process is terminated by closing all its inports. When an inport is closed, the sending thread is informed that its results are not longer needed. This is done via a TERM-THREAD message. Inports may also be closed at garbage collection, when it is detected that the input is not referenced anymore. Only a brief overview of the runtime system for Eden has been given here. It suffices to understand the second translation phase described in the next section and its interaction with the RTS. For a more detailed presentation and a discussion on topics not adressed here, e.g scheduling, process distribution, garbage collection, etc., the interested reader is referred to [2].
5
Second Translation Phase and Runtime Scenarios
In this section we present the translation of PEARL expressions into imperative code, i.e. a sequence of calls to the procedures of the RTS. As we did it in Sect. 3.2 for the first translation phase from Eden to Pearl, we present the translation schemes in the form of a recursive function tr2 :: P EARL → RT S. Function tr2 is here only described for the new expressions introduced by PEARL with respect to STGL. In order to better understand the runtime behaviour of Eden programs, we present the translation schemes grouped into two “scenarios”: process creation and dynamic channel handling. In each scenario, we explain the interaction between the code generated by PEARL expressions and the code executed by the RTS upon the reception of system messages. In this way, the reader can get a whole picture of what happens at runtime. In the following, we will assume that process abstractions always have type Process (t1 , . . . , tn ) (t01 , . . . , t0k ), i.e. they have n inports and k outports. The following notation and conventions will also be used: inport (outport) identifiers Local inport (outport) identifiers will be denoted by i1 , . . . , in (o1 , . . . , ok ) in the child process and by i01 , . . . , i0k (o01 , . . . , o0n ) in the parent process. The destination of a thread (origin of an inport) is its corresponding remote inport (outport), and it is represented by a global inport (outport) identifier, i.e. a tuple (idp, i) ((idp, o)) where idp is a remote PE and i (o) is a local inport (outport) in that PE. Destinations (origins) may also be denoted by D (O). closure addresses a, b, p, q, e, . . . thread state objects are distinguished from other closure addresses and denoted by tso. procedure calls are of the form o1 , . . . , or ← name(i1 , . . . , is ), where o1 , . . . , or are the output parameters, name is the name of the called procedure and i1 , . . . , is are the input parameters.
Implementing Eden – or: Dreams Become Reality
5.1
113
Interface between the Imperative Code and the RTS
The self-explanatory procedure names and the accompanying text will, in general, suffice to understand the translation. The less obvious procedures offered by the RTS are explained in advance. tso ← createT hread(e, D) Creates a thread tso, where e is the heap address of the expression to be executed, and D is the destination of the thread. If this is not known yet, or if the thread has no destination, the value undefD is used. idp ← sendCreateP rocess(p, otso , i01 , . . . , i0k , o01 , . . . , o0n ) The RTS will compose and send a CREATE-PROCESS message to some remote PE, and return the identifier idp of the chosen remote machine. The parameter p represents a closure that will eventually evaluate to a process abstraction, and otso is the outport associated with the thread (in parent’s PE) responsible for spawning the threads feeding the inports of the child process. It is passed around in order to have a handle to this thread and to be able to activate it after the corresponding acknowledgement message arrives. The rest of the parameters are the inports —(i01 , . . . , i0k )— and the outports —(o01 , . . . , o0n )— at the parent’s side, which correspond to outports, respect. inports, in child’s PE. sendAck(idp, o, i01 , . . . , i0k , o01 , . . . , o0n , i1 , . . . , in , o1 , . . . , ok ) Sends an ACK message to the matching thread for outport o in PE idp, acknowledging that the process instantiated from that thread has been created with inports i1 , . . . , in and outports o1 , . . . , ok . The rest of the parameters are just a repetition of the remote inports and outports received in the corresponding CREATEPROCESS message. 5.2
Scenario 1: Process Creation
A graphical description of this scenario is shown in Fig. 2. The parent and the child side, and the messages exchanged between both are represented there. As we will see in the generated code, newly created inports in both sides are initially bound to queueMe closures so that threads trying to consume values not yet produced by the remote process will be suspended. Outports should be matched to their corresponding threads, but in many cases these are not already known at the moment of creating the outport. Thus, outports are usually initially matched to undefT threads. Translation of the Process Instantiation Expression The environment at the parent’s side has to be created in order to be able to feed the inports of the remote process and to receive the values delivered by it through its outports. Therefore, k inports and n outports are created. However, at instantiation time of a new process not all the information is at hand at the parent’s side to create the new outports. In fact, the output expressions which will feed each outport and the destinations of the new threads are not yet
114
U. Klusik, Y. Ortega, R. Pe˜ na parent PE
messages
child PE
parent’s thread
sendCreateProcess
RTS schedule
CREATE_PROCESS
createThread
RTS schedule
child’s init thread sendAck
ACK
spawning thread spawn threads
spawn threads
Fig. 2. Process creation scenario known. Therefore, instead of directly matching the outports to their corresponding threads, a spawning thread is created that will eventually reduce to a parallel expression. When this happens, threads are spawned and matched to the previously created outports (see translation of the parallel expression below). Notice that, at this moment, this spawning thread is created but not yet activated. Then, the remote process creation is initiated —by sending the corresponding message—, and the tuple of new input channels is delivered to the continuation stack as the result of the whole instantiation. The addresses of the newly created queueMe closures form this tuple. def
tr2 (p#e) = a1 ← createQM(); . . . ; ak ← createQM(); i01 ← createInport(a1 ); . . . ; i0k ← createInport(ak ); o01 ← createOutport(undefT ); . . . ; o0n ← createOutport(undefT ); tso ← createT hread(e, undefD); otso ← createOutport(tso); idpchild ← sendCreateP rocess(p, otso , i01 , . . . , i0k , o01 , . . . , o0n ); returnT uple(a1 , . . . , ak ) Arrival of the Process Creation Message When the RTS of the PE where the child process must be instantiated, receives the message CREATE-PROCESS(idpparent , p, otso, i01 , . . . , i0k , o01 , . . . , o0n ), it first unpacks the closure p and copies it to the local heap — let p0 denote the address of the copied closure— then, it creates the initial thread: tso ← createT hread(p0 , undefD); initArg(tso, idpparent , otso, i01 , . . . , i0k , o01 , . . . , o0n ); scheduleT hread(tso)
Implementing Eden – or: Dreams Become Reality
115
Before scheduling the initial thread, all parameters in the message, except p, are saved in the argument stack, so that the code generated for the process abstraction (see below) can use them. Translation of the Process Abstraction Expression The code generated for a process abstraction assumes that the executing thread’s stack contains the arguments mentioned above. This is safe as we call it only in this setting. Now, the environment of the new instantiated process must be established: n inports and k outports are created. As the origins of the inports are already known, the connection is completed. Thereafter, the destinations and the outports of the future parallel threads are saved — in a table associated to the spawning thread state, an ACK message is sent, and the body of the process abstraction is executed. The arguments x1 , . . . , xn are free in the body of the process abstraction, and the values bound to them are supposed to be in the argument stack. That is the reason for pushing there the addresses of the queueMe closures. def
tr2 ({frees} \proc {x1 , . . . , xn } → body) = idpparent ← popArg(); otso ← popArg(); i01 ← popArg(); . . . ; i0k ← popArg(); o01 ← popArg(); . . . ; o0n ← popArg(); b1 ← createQM(); . . . ; bn ← createQM(); i1 ← createInport(b1 ); . . . ; in ← createInport(bn ); o1 ← createOutport(undefT); . . . ; ok ← createOutport(undefT); connectInport(i1 , (idpparent , o01 )); . . . ; connectInport(in , (idpparent , o0n )); tso ← ownTSO() : saveDestinations(tso, (idpparent , i01 ), . . . , (idpparent, i0k )); saveOutports(tso, o1 , . . . , ok ); sendAck(idpparent , otso, i01 , . . . , i0k , o01 , . . . , o0n , i1 , . . . , in , o1 , . . . , ok ); pushArg(bn ); . . . ; pushArg(b1 ); tr2 (body) Arrival of the Acknowledgement Message When a message ACK(idpchild , otso , i01 , . . . , i0k , o01 , . . . , o0n , i1 , . . . , in , o1 , . . . , ok ) arrives to the parent’s PE, the outports and destinations of the threads, which will be later created by the parallel expression, are stored in the state of the spawning thread. Inports are connected, and the spawning thread is scheduled: tso ← getT hread(otso ); saveDestinations(tso, (idpchild , i1 ), . . . , (idpchild , in )); saveOutports(tso, o01 , . . . , o0n ); connectInport(i01 , (idpchild , o1 )); . . . ; connectInport(i0k , (idpchild , ok )); scheduleT hread(tso)
116
U. Klusik, Y. Ortega, R. Pe˜ na
Translation of the Parallel Expression The executing thread will spawn a set of m parallel threads and thereafter die. The translation below is valid both for parallel expressions in process instantiations (parent’s side) and for parallel expressions in process abstractions (child’s side). It is assumed that the destinations and the outports associated to the spawned threads have been previously saved in the state of the executing thread. def
tr2 ({|e1 || . . . ||em |}) = D1 , . . . , Dm ← getDestinations(); o1 , . . . , om ← getOutports(); tso1 ← createT hread(e1 , D1 ); . . . ; tsom ← createT hread(em , Dm ); matchOutportT hread(o1 , tso1 ); . . . ; matchOutportT hread(om , tsom ); scheduleT hread(tso1 ); . . . ; scheduleT hread(tson ); exitT hread()
5.3
Scenario 2: Dynamic Channel Handling
Translation of the Channel Creation Expression A new inport is created in the current machine, bound to a queueMe closure and the global address of this inport is saved in a ChanName closure which, in essence, is a tuple. Then, the main expression is entered, having previously pushed in the stack the arguments of this lambda abstraction. def
tr2 (new {cn, c} e) = idpcreator ← ownPE(); q ← createQM(); i ← createInport(q); a ← createCHN(idpcreator , i); pushArg(q); pushArg(a); enter(e) Let us note that the binding variables c and cn are ignored and replaced by the fresh heap addresses q and a. The first one, i.e. q, is bound to a queueMe closure, representing the new channel, whereas the second one, a, is bound to a ChanName closure containing the global address of the channel. Translation of the Channel Use Expression The message expression v1 and the main expression v2 are evaluated by independent threads. Therefore, a new thread is created and scheduled for the message evaluation, whereas the current thread continues with the evaluation of the main expression. A new local outport o matching the new thread is created. Then a DCH message is sent to communicate the new connection to the creator of the dynamic channel.
Implementing Eden – or: Dreams Become Reality
117
def
tr2 (write cn v1 v2 ) = idpcreator , i ← accessCHN(cn); tso ← createT hread(v1 , (idpcreator , i)); o ← createOutport(tso); sendDch(idpcreator , i, o); scheduleT hread(tso); enter(v2 ) Arrival of the Dynamic Channel Connection Message When the message DCH (idpuser , i, o) arrives to the dynamic channel creator’s PE, the RTS will just connect the local inport i to its new established origin (idpuser , o), by executing: connectInport(i, (idpuser , o)); The fact that no other process has connected to this channel before can be checked by looking up in the inport table.
6
Related and Future Work
The specialized code generation defined here instantiates Eden processes, each one consisting of a number of threads and a complex communication structure, whereas languages like Concurrent ML [13] or Facile [8] use explicit message passing primitives on the language level, and they treat these primitives as ordinary functions during code generation. Other systems like GpH [14] and Concurrent Clean [16] handle communication implicitly by relying on a virtually shared memory. Their parallel annotations are also implemented in a straightforward manner by primitive operations. Closer to our approach is Caliban [7], which extracts topology information from the program text and needs a special code generation, but it is restricted to static process networks. The starting point for the implementation of Eden has been the Glasgow Haskell Compiler (GHC) [12]. Only small modifications to the parser were needed in order to accept the new expressions Eden has with respect to Haskell. As a consequence, our abstract machine DREAM is an extension of the STG machine [11] on which GHC is based. The extensions concern multithreading and communication features. The current runtime system for Eden has much profited from the runtime system GUM for Glasgow Parallel Haskell [15]. The implementation status is as follows: a compiler for the basic features of Eden is running and producing C code; PEARL constructions have been implemented by a set of predefined C functions called from this code; the RTS has been coded and the whole system is being integrated. In brief, we will present efficiency results with this compiler. Preliminary figures were obtained with a first distributed prototype directly calling MPI from Haskell, and they were published elsewere [2]. They promise good speedups for the actual implementation. Eden features not covered up to now are channel structures and implementation optimizations such as channel bypassing.
118
U. Klusik, Y. Ortega, R. Pe˜ na
Acknowledgements We thank Luis Antonio Gal´ an, Rita Loogen, Crist´ obal Pareja, Steffen Priebe, Fernando Rubio and Clara Segura for useful discussions, and comments provided while preparing this paper.
References 1. S. Breitinger, U. Klusik, and R. Loogen. Channel Structures in the Parallel Functional Language Eden. In Proc. 1997 Glasgow Workshop on Functional Programming, 1998. revised version. 2. S. Breitinger, U. Klusik, and R. Loogen. From (Sequential) Haskell to (Parallel) Eden: An Implementation Point of View. In Proc. Programming Languages: Implementations, Logics and Programs (PLILP’98), volume 1490 of LNCS, pages 318–334, Springer-Verlag, 1998. 3. S. Breitinger, U. Klusik, R. Loogen, Y. Ortega-Mall´en, and R. Pe˜ na. DREAM - the DistRibuted Eden Abstract Machine. In C. Clack, A.J.T. Davie and K. Hammond, editors, Proc. 9th. International Workshop on the Implementation of Functional Languages (IFL ’97), St Andrews, Scotland, September 1997, volume 1467 of LNCS, pages 250–269. Springer-Verlag, 1998. 4. S. Breitinger, U. Klusik, R. Loogen, and Y. Ortega-Mall´en. Concurrency in Functional and Logic Languages. In Proc. Fuji International Workshop on Functional and Logic Programming, Japan. World Scientific Publishing Company, 1995. ISBN 981-02-2437-0. 5. S. Breitinger, U. Klusik, R. Loogen, Y. Ortega-Mall´en, and R. Pe˜ na. Eden — Language Definition and Operational Semantics. Technical Report 96-10, PhilippsUniversit¨ at Marburg, 1996. 6. S. Breitinger, R. Loogen, Y. Ortega-Mall´en, and R. Pe˜ na. The Eden Coordination Model for Distributed Memory Systems. In Proc. HIPS ’97 — High-Level Parallel Programming Models and Supportive Environments. IEEE Press, 1997. 7. S. Cox, S.-Y. Huang, P.H.J. Kelly, J. Liu, and F. Taylor. An Implementation of Static Functional Process Networks. In Proc. PARLE ’92 — Parallel Architectures and Languages Europe, pages 497–512. Springer-Verlag, 1992. 8. A. Giacalone, P. Mishra, and S. Prasad. Facile: A Symmetric Integration of Concurrent and Functional Programming. Journal of Parallel Programming, 18(2), 1989. 9. G. Hogen, A. Kindler, and R. Loogen. Automatic Parallelization of Lazy Functional Programs. Proc. 1992 European Symposium on Programming (ESOP ’92), Springer-Verlag LNCS, 1992. 10. J.C. Peterson, K. Hammond, L. Augustsson, B. Boutel, F. W. Burton, J. Fasel, A. D. Gordon, R. J. M. Hughes, P. Hudak, T. Johnsson, M. P. Jones, E. Meijer, S. L. Peyton Jones, A. Reid, and P. L. Wadler. Report on the Non-Strict Functional Language, Haskell, Version 1.4, 1997. 11. S.L. Peyton-Jones. Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine, version 2.5. Journal of Functional Programming, 2(2):127–202, April 1992. 12. S. L. Peyton Jones, C. Hall, K. Hammond, W. Partain, and P. Wadler. The Glasgow Haskell compiler: A technical overview. In Proc. UK Joint Framework for Information Technology, Technical Conference (JFIT ’93), pages 249–257, Keele, Mar. 1993.
Implementing Eden – or: Dreams Become Reality
119
13. J.H. Reppy. CML: A higher-order concurrent language. In Proc. 1991 ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI 91), pages 293–305, 1991. 14. P.W. Trinder, K. Hammond, H.-W. Loidl, and S.L. Peyton Jones. Algorithm + Strategy = Parallelism. Journal of Functional Programming, 8(1):23–60, January 1998. 15. P.W. Trinder, K. Hammond, J.S. Mattson, A.S. Partridge, and S.L. Peyton Jones. GUM: a Portable Parallel Implementation of Haskell. In Proc. 1996 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’96), pages 79–88, 1996. 16. M.C.J.D. van Eekelen and M.J. Plasmeijer. Functional Programming and Parallel Graph Rewriting, Addison-Wesley, 1993.
Efficient Combinator Parsers Pieter Koopman? and Rinus Plasmeijer Computer Science Nijmegen University, The Netherlands
[email protected],
[email protected] Abstract. Parser combinators enable the construction of recursive descent parsers in a very clear and simple way. Unfortunately, the resulting parsers have a polynomial complexity and are far too slow for realistic inputs. We show how the speed of these parsers can be improved by one order of magnitude using continuations. These continuations prevents the creation of intermediate data structures. Furthermore, by using an exclusive or-combinator instead of the ordinary or-combinator the complexity for deterministic parsers can be reduced from polynomial to linear. The combination of both improvements turn parser combinators from a beautiful toy to a practically applicable tool which can be used for real world applications. The improved parser combinators remain very easy to use and are still able to handle ambiguous grammars.
1
Introduction
Parser combinators [3,6,5,8] are a beautiful illustration of the use of higher order functions and currying. By using a small set of parser combinators it becomes possible to construct parsers for ambiguous grammars in a very elegant and clear way. The basis of parser combinators is the list of successes method introduced by Wadler [13]. Each parser yields a list of results: all successful parsings of the input. When the parser fails this list is empty. In this way it is very easy to handle ambiguous parsers that define multiple ways to parse a given input. Despite the elegant formulation and the ability to handle ambiguous grammars, parser combinators are rarely used in practice. For small inputs these parsers work nice and smoothly. For realistically sized inputs the parsers consume extraordinary amounts of time and space due to their polynomial complexity. In this paper we show that the amounts of time and memory required by the combinators parsers can drastically be reduced by improving the implementation of the parser combinators and providing a little more information about the grammar in the parser that is written. This additional information can reduce the complexity of the parser from polynomial to linear. Although the implementation of the parser combinators becomes more complex, their use in combinator parsers remains as simple as in the original setting. This paper starts with a short review of classical parser combinators. The proposed improvements are presented hereafter. We use a running parsing example to measure the effect of the improvements. ?
Sponsored by STW project NWI.4411
H. Hammond, T. Davie, and C. Clack (Eds.): IFL’98, LNCS 1595, pp. 120–136, 1999. c Springer-Verlag Berlin Heidelberg 1999
Efficient Combinator Parsers
2
121
Conventional Parser Combinators
There are basically two approaches to construct a parser for a given grammar[1]. The first approach is based upon the construction of a finite state automaton determining the symbols that can be accepted. This approach is used in many parser generators like yacc [7,1,10], Happy [4] and Ratatosk [9]. In general the constructed parsers are efficient, but cumbersome to achieve and there might be a serious distance between the automaton accepting input tokens and the original grammar. If we use a generator there is a barrier between parsing and using the parsed items in the rest of the program. The second approach to achieve a parser is to create a recursive descent parser. Such a parser follows the rules of the grammar directly. In order to ensure termination the grammars should not be left-recursive. Parser combinators are a set of higher order functions that are convenient in the construction of recursive descent parsers. The parser combinators provide primitives to recognize symbols and the sequential or alternative composition of parsers. The advantages of parser combinators are that they are easy to use, elegant and clear. Due to the fact that the obtained parsers directly correspond to the grammar there is no separate parser generator needed. Since parser combinators are ordinary functions they are easy to understand and use. It is easy to extend the set of parser combinators with new handy combinators whenever this is desired. The full power of the functional programming language is available to construct parsers, this implies for instance that it is possible to use second order grammars. Finally, there are no problems to transfer parsed items from the parser to the manipulation functions. Conventional parser combinators are described at many places in the literature e.g. [3,6,5,8]. Here we follow the approach outlined in [8], using the functional programming language Clean [11]. We restrict ourselves to a small, but complete set of parser combinators to illustrate our improvements. A Parser is a function that takes a list of input symbols as argument and produces a ParsResult. A ParsResult is the list of successes. Each success is a tuple containing the rest of the list of input symbols and the item found. The types Parser and ParsResult are parameterized by the type of symbols to be recognized, s, and the type of the result, r. :: Parser s r :== [s] -> ParsResult s r :: ParsResult s r :== [([s],r)]
In the examples in this paper we will use characters, Char, as symbols in the lists of elements to be parsed, but in general they can be of any datatype. 2.1
Basic Parser Combinators
The basic combinator to recognize a given symbol in the input is symbol. This parser combinator takes the symbol to be recognized as its argument. When the first token in the input is equal to this symbol there is a single success. In all
122
P. Koopman, R. Plasmeijer
other situations the list of successes is empty indicating that the parser failed. It is of course requested that the equality is defined for the type of symbols to be recognized, this is indicated by the phrase | == s in the type definition. symbol :: s -> Parser s s | == s symbol sym = p where p [s:ss] | s==sym = [(ss,sym)] p _ = []
A related combinator is called satisfy. This combinator takes a predicate on the first input symbol as its argument. When the first symbol satisfies this predicate the parser combinators succeeds, otherwise it fails. satisfy :: (s->Bool) -> Parser s s satisfy pred = p where p [s:ss] | pred s = [(ss,s)] p _ = []
To complete the set of basic parser combinators there is a combinator, called fail, that always fails (i.e. yields the empty list of successes) and the combinator yield that always succeeds with the given item. fail :: Parser s r fail = \_ -> [] yield :: r -> Parser s r yield r = \ss -> [(ss,r)]
2.2
Combining Parsers
Given two parser it is possible to create new parsers by combining them using parser combinators. For instance, the or-combinator indicates a choice between two parsers. Using the list of successes, the implementation just concatenates the results of both parsers. To enhance readability this parser combinator is defined as an infix operator named . () infixr 4 :: (Parser s r) (Parser s r) -> Parser s r () p1 p2 = \ss -> p1 ss ++ p2 ss
There are several ways to compose parsers sequentially. Here we restrict attention to the so-called monadic style where the result of the first parser is given as an argument to the second parser. This and-combinator, named , is defined as: () infixr 6 :: (Parser s r) (r -> Parser s t) -> Parser s t () p1 p2 = \ss -> [ tuple \\ (ssRest,result1) Parser s [r] p = p \r -> p [r:rs]
2.3
Examples and Measurements
In order to illustrate the use of parser combinators we will show some examples. Our first parser, aANDbORc, recognize a character ’a’ followed by a ’b’ or a ’c’. The result of the parser is a tuple containing the parsed items. aANDbORc :: Parser Char (Char,Char) aANDbORc = symbol ’a’ \x -> (symbol ’b’ symbol ’c’) (x,y)
We illustrate the use of this parser by applying it to some inputs. The program
yields
Start = aANDbORc [’abc’] [([’c’],(’a’,’b’))] Start = aANDbORc [’acb’] [([’b’],(’a’,’c’))] Start = aANDbORc [’cba’] []
Our next example is the ambiguous parser alphaORhex. It recognizes characters from the alphabet and hexadecimal characters. alphaORhex :: Parser Char Char alphaORhex = alpha hex
124
P. Koopman, R. Plasmeijer
alpha :: Parser Char Char alpha = satisfy (\c -> isMember c ([’a’..’z’]++[’A’..’Z’])) hex :: Parser Char Char hex = satisfy (\c -> isMember c ([’0’..’9’]++[’A’..’F’]))
Again we illustrate the use of the parser by applying it to some inputs. The program
yields
Start = alphaORhex [’abc’] [([’bc’],’a’)] Start = alphaORhex [’ABC’] [([’BC’],’A’),([’BC’],’A’)] Start = alphaORhex [’123’] [([’23’],’1’]
The characters from the range [’A’..’F’] are accepted by the parsers hex and alpha. Hence alphaORhex is ambiguous and yields both results. Here the parser alphaORhex recognizes an alpha, an alpha and a hex, and a hex respectively. As a more serious example we turn to a parser that recognize a sentence. A sentence is defined as a sequence of words separated by white space and/or comma’s and terminated by a full stop. The result of the parser sentence is the list of words in the sentence. Each words is represented by a string. word :: Parser Char String word = satisfy isAlpha (sep \_ -> word) \r -> symbol ’.’ [w:r]
The function isAlpha from the standard library checks whether a character is an element from the alphabet. Similarly, the test for white space (space, tab, newline ..) is done by isSpace. Note that the parser word is ambiguous. It does not only find the entire word, but also all non-empty initial parts. The program
yields Start = word [’Hello world’] [([’ world’],"Hello") ,([’o world’],"Hell") ,([’lo world’],"Hel") ,([’llo world’],"He") ,([’ello world’],"H") ] The parser sentence looks clear and simple. It is not ambiguous since trailing parts of a word are not recognized as separators and the sentence has to end in a full stop. To investigate its efficiency we apply it to a sentence of 50,000 characters (the size of a reasonable file). As comparison we use a hand written
Efficient Combinator Parsers
125
ad-hoc parser which is described in Appendix A. This ad-hoc parser uses a state automaton and accumulators to construct the words and is pretty efficient. As a consequence it is harder to determine which syntax is accepted by the parser by looking at the definition. Measurements in this paper are done on a 266 MHz Pentium II, using 10 Mb of heap space and 500 Kb of stack space. The execution time includes the generation of the input list and the consumption of the result. Parser used execution garbage total minimal time (s) collect (s) time (s) heap (Kb) sentence 120 345 465 7070 ad-hoc 0.06 0.03 0.09 625 These measurements show that the performance of the parser sentence is rather disappointing. It consumes one order of magnitude more memory than the ad-hoc parser and is about 5000 times slower. This disappointing performance is the reason that many people turn to parser generators, like yacc, Happy and Ratatosk mentioned above, forbid ambiguity [12], or use an ad-hoc parser. In this paper we will investigate how we can improve the performance of parsers within the given framework of potential ambiguous parser combinators. In general the time and memory needed by a parser is dependent on the parser and its input. We will use the parser sentence as running example.
3
Avoiding Intermediate Data Structures
As a first step to increase the speed of parser combinators we focus on the traditional implementation of these combinators. Although the list of successes method is conceptual nice and clear, it introduces a very large overhead. Each result must be packed in a list of tuples. Usually, this list of tuples is immediately taken apart in the next action of the parser. Using continuations [2] in the implementation of the parser combinators it is possible to avoid these superfluous intermediate lists and tuples. The parser combinators will handle all continuations. The only thing that changes for the user of these parser combinators is that a parser should be initiated by begin parser input, instead of parser input. The parser combinators using continuations can replace the combinators introduced in the previous section completely. Nevertheless, in this paper we use new names for the version of the combinators using continuations such we can identify the different implementations of the combinators. The type of a continuation parser reflects the fact that there are two continuations. The first continuation handles the item recognized by the parser. This first continuation is very similar to the second argument of the operator. The second continuation contains all remaining results. :: ParserC s r t :== (Success s r t) (NextRes s t) -> Parser s t :: Success s r t :== r (NextRes s t) -> Parser s t :: NextRes s t :== ParsResult s t
126
P. Koopman, R. Plasmeijer
The type ParserC has three arguments. The first argument, s, denotes the type of symbols parsed. The next argument, r, is the type of the result of parsing. The success continuation transforms this to the type t of the total result. One can regard this success continuation as a function applied by the apply-combinator next
The parser combinator yieldC, that yields the given element r applies the success continuation to this element, the next continuation and the input. yieldC :: r -> ParserC s r t yieldC r = \succ next ss -> succ r next ss
The parser that recognizes the given symbol s, checks whether the first element of the input stream is equal to this symbol. If it is, the success continuation is applied to the symbol, the next continuation and the rest of the input stream. If the first element is not equal to the given symbol, the parser fails and hence yields the next continuation. symbolC :: s -> ParserC s s t | == s symbolC sym = p where p sc nc [s:ss] | s==sym = sc sym nc ss p sc nc _ = nc
The new definitions of the and-combinator and or-combinator are trivial. In the and-combinator the second parser is used as success continuation. This parser is applied to the item found by the first parser and the original success continuation. (>&=ParserC s v t) -> ParserC s v t (>&= p1 (\t -> p2 t sc)
In the or-combinator the next continuation is replaced by applying the second parser to the success continuation, the original next continuation and the input. (>| ParserC s r t (>| p1 sc (p2 sc nc ss) ss
In the combinators +< respectively. Finally we introduce the function begin that provides the initial continuations for a parser. As success continuation we provide a function that constructs an element in the list of successes. The next continuation is initially the empty list.
Efficient Combinator Parsers
127
begin :: (ParserC s t t) -> Parser s t begin p = p (\r nc ss -> [(ss,r):nc]) []
To determine the effect of using continuations we apply a modified parser sentence again to the same input. Inside the definition of the parser we have
replaced all parser combinators by the corresponding version using continuations. Parser used
execution garbage total minimal time (s) collect (s) time (s) heap (Kb) original sentence 120 345 465 7070 sentence with continuations 9.6 36.6 46.2 5940 ad-hoc 0.06 0.03 0.09 625 From these measurements we see that using continuations improves the execution speed by a factor of ten. Also the minimal amount of memory needed to run this program is reduced a little bit. This effect is caused by omitting the construction of intermediate data structures. On the other hand it is clear that this improvement is not enough, the difference between this parser and the hand written parser is still a factor of 500.
4
Limit Backtracking
Since it looks impossible to speed up the parser combinators much more, we must turn our focus to the behaviour of the parsers. As we have seen in the word [’Hello world’] example in Sect. 2.3 above, the parser word is ambiguous. Apart from the whole word, it yields also all initial fragments of this word. This implies that there are n results for a word of n characters. The same holds for the separations of words recognized by the parser sep. In the current application we are only interested in the most eager parsing of words and separations. All other results will not yield a successful parsing of a sentence. This behaviour is determined by the or-combinator, , used in the definition of the repeat-combinator, . The or-combinator yields the results of both alternatives inside the definition of . Even when p \r -> p [r:rs] succeeds, the or-combinator also yields the empty list. In the current situation we require an exclusive or-combinator that only applies the second parser when the first one fails. In the plain list of successes approach the definition of the xor-combinator is very direct. The parser p2 is only applied to the input ss if applying p1 to the input yields the empty list of successes. If applying p1 to the input produces a non-empty result, this is the result of the entire construct. () infixr 4 :: (Parser s r) (Parser s r) -> Parser s r () p1 p2 = \ss -> case p1 ss of [] -> p2 ss r -> r
128
P. Koopman, R. Plasmeijer
Using continuations we can achieve the same effect by replacing the next continuation of the new success-continuation by the original next continuation. This excludes the results of p2 from the result when p1 succeeds. (>! ParserC s r t (>! p1 (\r _ -> sc r nc) (p2 sc nc ss) ss
Now we can define eager versions of the repeat-combinators by replacing the ordinary or-combinator in the body of by the xor-combinator. For the version using continuations we obtain: >!*< :: (ParserC s r t) -> ParserC s [r] t >!*< p = ( p >&=< \r -> >!*< p @< \rs -> [r:rs]) >!< yieldC [] >!+< :: (ParserC s r t) -> ParserC s [r] t >!+< p = p >&=< \r -> >!*< p @< \rs -> [r:rs]
It is important to note that the previous optimization, using continuations, does not have any semantically influence on the parser combinators. In the current optimization we turn an ambiguous parser for words into an eager and deterministic parser for words. In the context of parsing a sentence this is allowed, but there might be (rare?) occurrences where the ambiguous nature of the parser word is used. So, it depends on the context whether this optimization can be applied. In general we need the deterministic parser combinators >!< and >!*< as well as the ambiguous combinators >|< and >* everywhere 0.30 0.19 0.49 4712 sentence with continuations 9.6 36.6 46.2 5940 >!*< in word and sep 1.19 1.12 2.31 1872 >!*< everywhere 0.28 0.14 0.43 1740 ad-hoc 0.06 0.03 0.09 625 These measurements show that using xor-combinators instead of ordinary or-combinators make a tremendous differences. The difference in speed between the ad-hoc parser and the best combinator parser is less than a factor 5. Even when the xor-combinator is used only at a limited number of places it has a huge effect. In fact we obtain the effect of an old-fashioned tokenizer when we use the eager repeat combinators in word and sep.
Efficient Combinator Parsers
129
It is mainly the memory usage that makes it preferable to use the continuation version of the parser combinators when we use xor-combinators and eager repeats everywhere. If there is ambiguity on the parser, the ordinary or-combinator (), the version using continuations is an order of magnitude more efficient.
5
Reducing the Memory Requirements
As a last improvement we try to reduce the memory requirements of the parser. As a side effect of the other optimizations introduced above, the memory requirements are reduced by a factor of four. We can improve this by observing that in many applications of the xor-combinator we can decide that the second alternative will not apply before the first alternative does succeed. Currently, the second alternative is not thrown away before the first alternative succeeds. Consider again the repeat combinator. The version using continuations is: >!*< :: (ParserC s r t) -> ParserC s [r] t >!*< p = ( p >&=< \r -> >!*< p @< \rs -> [r:rs]) >!< yieldC []
As soon as the parser p is successfully applied, it is clear that the alternative yieldC [] will not be used. However, this alternative is kept until the first alternative of >!< succeeds. This alternative succeeds after the completion of >!*< p, which can be a huge amount of work. By adding primitives to throw the xor-alternatives away the programmer can improve the memory requirements of the parser. In order to do this we introduce a new continuation, XorCont s t, containing the xor-alternatives. :: CParser s r t :== (SucCont s r t) (XorCont s t) (AltCont s t) -> Parser s t :: SucCont s r t :== r (XorCont s t) (AltCont s t) -> Parser s t :: XorCont s t :== (AltCont s t) -> ParsResult s t :: AltCont s t :== ParsResult s t
The xor-continuation is a function that takes the ordinary alternative continuation as argument and yields the remaining parse results. Of course we have to adopt the definitions of all parser combinators to this new continuation. Again, the parsers which use these combinators remain unchanged. In order to distinguish the various versions of the combinators we introduce again new names. The basic parser combinators are: Cfail :: CParser s r t Cfail = \sc xc ac ss -> xc ac Cyield :: r -> CParser s r t Cyield x = \sc -> sc x
130
P. Koopman, R. Plasmeijer
Csymbol :: s -> CParser s s t | == s Csymbol sym = p where p sc xc ac [s:ss] | s==sym = sc sym xc ac ss p sc xc ac _ = xc ac
In the and-combinator the second parser, p2, is inserted in the first continuation, the success continuation sc, of the first parser, p1. () infixr 6 :: (CParser s u t) (u->CParser s v t) -> CParser s v t () p1 p2 = \sc -> p1 (\t -> p2 t sc)
For the xor-combinator we insert the second parser in the xor-continuation, xc, of the first parser. () infixr 4 :: (CParser s r t) (CParser s r t) -> CParser s r t () p1 p2 = \sc xc ac ss -> p1 (\x xc2 -> sc x xc) (\ac3 -> p2 sc xc ac3 ss) ac ss
For the ordinary or-combinator we insert the second parser in the alternative continuation, ac, of the first parser. () infixr 4 :: (CParser s r t) (CParser s r t) -> CParser s r t () p1 p2 = \sc xc ac ss -> p1 sc (\ac2 -> ac2) (p2 sc xc ac ss) ss
Again there is a function Begin that provides the initial continuations to construct a parse result for the successful parsings. Begin :: (CParser s t t) -> Parser s t Begin p = p (\x xc ac ss -> [(ss,x):xc ac]) (\ac -> ac) []
After the definition of the basic combinators we can define the combinator cut that removes xor-alternatives by replacing the corresponding continuation, xc, by an identity function. cut :: (CParser s r t) -> CParser s r t cut p = \sc xc ac -> p sc (\ac -> ac) ac
The scope of the cut is all xor-combinators up to the first ordinary orcombinator. Multiple cuts, or cuts in a context without xor-alternatives are harmless. It appears to be convenient to combine the cut-combinator with an and-combinator. () infixr 6 :: (CParser s u t) (u->CParser s v t) -> CParser s v t () p1 p2 = \sc -> p1 (\t ac2 -> p2 t sc (\ac -> ac))
The semantics of this combinator reads: iff p1 succeeds, the xor-alternatives are thrown away and we continue parsing with p2. If p1 fails, we turn to the xorand or-alternatives as usual. An obvious application of this combinator is the eager repeat-combinator discussed above. When p succeeds we can remove the xor-alternative Cyield []. We use the operator to achieve this. The definition of the eager repeat-combinator doing this is:
Efficient Combinator Parsers
131
:: (CParser s r t) -> CParser s [r] t p = ( p \r -> p
To determine the effects of this optimization we repeat our measurements. Parser used
execution garbage total minimal time (s) collect (s) time (s) heap (Kb) >!*< everywhere 0.28 0.14 0.43 1740 everywhere 0.27 0.11 0.38 937 ad-hoc 0.06 0.03 0.09 625 Although the improvements (12% for speed 46% for minimal memory) are of another range than the previous optimizations we consider them useful. The gain of this optimization depends on the amount of storage that is claimed by the alternative that is thrown away and the moment it is thrown away. For huge alternatives that are thrown away after a very complex (time and memory consuming) first alternative, the gain will be larger.
6
Using Accumulators
Finally, we can consider implementing the repeat-combinator outside the framework of parser combinators. For the implementation of the combinator , it might be useful to employ an accumulator like we have used in the ad-hoc parser. This can look like: :: (CParser s r t) -> CParser s [r] t p = ClistP p [] :: (CParser s r t) -> CParser s [r] t p = p \r -> ClistP p [r] ClistP :: (CParser s r t) [r] -> CParser s [r] t ClistP p l = clp l where clp l sc xc ac ss = p (\r xc2 -> clp [r:l] sc (\ac3 -> ac3)) (\ac4 -> sc (reverse l) xc ac4 ss) ac ss
This new implementation is transparent for the user of the repeat combinators. We measure the value of this modification in our running example. Parser used
execution garbage total minimal time (s) collect (s) time (s) heap (Kb) everywhere 0.27 0.11 0.38 937 everywhere 0.26 0.08 0.34 887 ad-hoc 0.06 0.03 0.09 625
132
P. Koopman, R. Plasmeijer
Using this implementation shows again a small improvement. Due to the use of accumulators and the changed form of recursion, this version also requires less stack space. It appears that the xor-continuation is necessary to implement this efficiently. For the repeat-combinator such an implementation might be used. However, we cannot expect a user of the parser combinators to write such an implementation. Fortunately, the price to be paid by using parser combinators instead of tailor-made functions using accumulators is limited (about 10%).
7
Complexity
Parsers constructed by using conventional parser combinators only work well for small inputs. For larger inputs the execution time increases very rapidly. This is a strong indication that there is a problem with the complexity of the parsers. It is impossible to make useful statements about all, potentially ambiguous, parsers created by parser combinators. However, we can investigate the complexity of a specific parser constructed using parser combinators. We will again take the parser for sentences as example. A large part of the discussion is based on the behaviour of the repeat-combinators and the recognition of tokens. This part of the discussion will apply to almost any parser written. As shown in Sect. 2.3 parsing words is ambiguous. The ambiguity is caused by using the standard repeat-combinator . Apart from the longest possible word, all initial fragments of this word are also recognized as a word. The recognition of the characters in the word is shared among the various results. But, for each result a spine of the list of characters has to be constructed. These lists of characters are later transformed to strings. For each individual result the parser continues with the remaining part of the input. In our example, parsing the rest of the input will fail on the first input character for all but the most eager result of word. Nevertheless, there are n results for parsing a word of length n. For each of these results a list of characters of average length n/2 is constructed and the next character after that result is parsed. This indicates that parsing words, and hence parsing a sentence, in this way is O(n2 ). In Sect. 4 we showed how we can get rid of partial words by introducing an xor-combinator. When we construct the repeat-combinator using this xorcombinator, the parser word has only a single result. Parsing a word of length n implies parsing n characters and constructing a single list of length n to hold these characters. The parser is now linear, O(n), in the length of the word. When we also apply the eager repeat-combinator in the parser sentence, the entire parser will be linear in the length of the input. This holds for all implementations of the parser combinators. Various implementations of the same combinator result in parsers of the same complexity, only the constants are different. When we apply the ordinary or-combinators the parser sentence will be polynomial. Using xor-combinators the parser will be linear. In order to verify this we measure the time consumed by our best implementation of the combinators (Sect. 5) as function of the length of the input.
Efficient Combinator Parsers
133
length
or-combinators xor-combinators speed-up execution garbage total execution garbage total total time(s) collect(s) time(s) time(s) collect(s) time(s) time 1000 0.02 0.01 0.01 0.01 0.00 0.01 1 2000 0.05 0.02 0.07 0.01 0.00 0.01 7 4000 0.12 0.04 0.16 0.02 0.01 0.03 5 10,000 0.45 0.27 0.72 0.04 0.03 0.07 10 20,000 1.46 1.73 3.19 0.09 0.05 0.15 21 40,000 5.54 16.87 22.41 0.21 0.09 0.30 75 100,000 heap full 0.46 0.35 0.81 200,000 heap full 0.96 1.15 2.11 400,000 heap full 1.92 5.56 7.48 These measurements show clearly that the time consumed by the version using the ordinary or-combinator grows polynomially. Also for the xor-combinator version there seems to be some slight super-linear effect, especially in the garbage collection times. To understand this it is important to note that the parser cannot yield any result before the full stop indicating the end of the sentence is encountered. If the full stop is not found, it is not a proper sentence and the parser should yield an empty result. Since parsers of larger inputs have to handle larger data structures, the garbage collector will reclaim less memory. Hence we need a larger number of garbage collections. This explains the super-linear growth of the garbage collection time in the parser using xor-combinators. The changed complexity caused by using the eager repeat-combinators, like , instead of the usual repeat combinators, like , explains the enormous increase of efficiency for the running example of this paper. In fact the efficiency gain of more than a factor of 1000 is just an example. If we increase the size of the input, the speed-up figure grows rapidly.
8
Conclusion
Parser combinators are a nice and clear way to define ambiguous recursive descent parser in functional programming languages. Unfortunately, the parsers that are usually constructed are not suited for large inputs since they consume enormous amounts of time and memory. This is caused by a polynomial complexity and an implementation that generates many intermediate data structures. In this paper we have introduced a new implementation technique for parser combinators based on continuations. This improves the speed of the parser by one order of magnitude and reduces the amount of memory required significantly by eliminating intermediate data structures. By introducing an exclusive or-combinator we are able to reduce the complexity of deterministic parsers from polynomial to linear. This makes it possible to use the parsers constructed by parser combinators for large inputs. However, we stay within the framework of ambiguous parser combinators. This implies that it is always possible to use ambiguity, introduced by the ordinary or-combinator, whenever this is desired.
134
P. Koopman, R. Plasmeijer
As a result the example parser in this paper is only four times slower for large inputs than an efficient ad-hoc parser for the same problem. Using the standard parser combinators it was 5000 times slower. This indicates that the proposed changes turn the parser combinators from a nice toy for small problems, into a beautiful and useful tool for real world parsing problems.
9
Related and Future Work
More people have experienced that parser combinators are attractive, but not suited for large inputs. Doaitse Swierstra and his group follow a different approach to tackle the problem. They exclude ambiguity from the beginning and construct parser combinators for deterministic grammars. Our work is more general since it allows ambiguity whenever the user wants it. The user can indicate that the parser is not ambiguous by using the xor-combinator instead of the orcombinator. If the parser is known to be non-ambiguous the complexity decreases from polynomial to linear. In our xor-combinator it is allowed that alternatives overlap partially. In their approach the alternatives should be distinguished at the first symbol. Moreover, our combinator satisfy enables the use of a conditional function on the input symbols, while the Swierstra combinators require that all allowed symbols are listed explicitly by a symbol combinator. Hence, our combinators are easier to use than Swierstra’s unambiguous grammars. Swierstra combinators enables the generation of error messages. It is simple and cheap to extend our combinators in a similar way. In order to compare the speed of both approaches, we implemented their combinators in Clean and applied them to the running example, 2000 simple function alternatives and a huge expression. parser
old-combinators ex (s) gc (s) tot (s) sentence 120 345 465 functions 5.5 50.5 56.0 expression 73 104 177
new-combinators ex (s) gc (s) tot (s) 0.26 0.08 0.34 0.43 0.22 0.65 0.36 0.19 0.55
Swierstra-combinators ex (s) gc (s) tot (s) 0.55 0.28 0.83 0.98 0.37 1.35 0.51 0.33 0.84
As a conclusion we state that our combinators are more powerful (since they allow ambiguity), easier to use (since they allow alternatives to overlap partially and conditions on input symbols) and seems to be more efficient. Currently the user has to indicate whether the or-combinator or the xorcombinator has to be used. It would be convenient if the parser uses the xorcombinator whenever this is possible. In principle it is possible to detect this automatically in a number of situations. If the branches of the or-combinators excludes each other (i.e. the parser in unambiguous) the or-combinator can be replaced by the xor-combinator. It looks feasible and seems attractive to construct such an automatic parser optimizer.
Efficient Combinator Parsers
135
References 1. A. Aho, R. Sethi and J.D. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley, 1986. 2. A. Appel. Compiling with Continuations. Cambridge University Press. 1992. 3. J. Fokker. Functional Parsers. In Advanced Functional Programming, 1st. International School on Functional Programming Techniques, B˚ astad, Sweden, volume 925 of LNCS, pages 1–23. Springer-Verlag, 1995. 4. A. Gill and S. Marlow. The Parser Generator for Haskell. University of Glasgow. 1995. 5. S. Hill. Combinators for Parsing Expressions. Journal of Functional Programming, 6(3):445–463, 1996. 6. G. Hutton. Higher Order Functions for Parsing. Journal of Functional Programming, 2:323–343, 1992. 7. S.C. Johnson. Yacc: Yet Another Compiler Compiler. UNIX On-Line Documentation. 1978. 8. P. Koopman. Parser Combinators. Chapter II.5 of Functional Programming in Clean In preparation (draft available at http://www.cs.kun.nl/~{}clean). 9. T. Mogensen. Ratatosk: A Parser Generator and Scanner Generator for Gofer. University of Copenhagen (DIKU), 1993. 10. S.L. Peyton Jones. Yacc in SASL, an Exercise in Functional Programming. Software: Practice and Experience, 15(8):807–820, 1995. 11. M.J. Plasmeijer and M.C.J.D. van Eekelen. The Concurrent Clean Language Report, Version 1.3. Nijmegen University, The Netherlands. 1998. http://www.cs.kun.nl/~{}clean. 12. D. Swierstra and L. Duponcheel. Deterministic, Error-Correcting Combinators Parsers. In Advanced Functional Programming. volume 1129 of LNCS, pages 185– 207, Springer-Verlag, 1996 13. P.L. Wadler. How to Replace Failure by a List of Successes: a Method for Exception Handling, Backtracking, and Pattern Matching in Lazy Functional Languages. In J.P. Jouannaud, editor Proc. 1985 Conference on Functional Programming Languages and Computer Architecture (FPLCA ’85). volume 201 of LNCS, pages 113– 128, Springer-Verlag, 1985.
136
A
P. Koopman, R. Plasmeijer
The Ad-Hoc Parser
For the sake of completeness we include the ad-hoc parser used in the measurements listed above. The parser has three states called word, sep and dot. In the state word the parser accepts characters for the current word. When the character cannot be part of this word we go to that state sep. In this state we read separations. When the separation is empty and the input character is not part of the separation the parser enters state dot. Otherwise the parser returns to word. The parser uses accumulators to construct the current word w and the sentence s. The accumulators are constructed in reversed order to avoid polynomial complexity. When accumulation is finished the accumulator is reversed to obtain the right order. manParse :: [Char] -> [([Char],[String])] manParse input = word [] [] input where word w s [c:r] | isAlpha c = word [c:w] s r word w s input = sep [] [toString (reverse w):s] input sep l s [c:r] | isSpace c || c == ’,’ = sep [c:l] s r sep [] s input = dot s input sep _ s input = word [] s input dot [] input = [] dot s [’.’:r] = [(r,reverse s)] dot _ _ = []
On the Unification of Substitutions in Type Inference Bruce J. McAdam Laboratory for Foundations of Computer Science The University of Edinburgh
[email protected] Abstract. The response of compilers to programs with type errors can be unpredictable and confusing. In part this is because the point at which type inference fails may not be the point at which the programmer has made a mistake. This paper explores a way of making type inference algorithms fail at different locations in programs so that clearer error messages may then be produced. Critical to the operation of type inference algorithms is their use of substitutions. We will see that the way in which substitutions are applied in type inference algorithm W means that errors are detected towards the right-hand side of expressions. This paper introduces a new operation — unification of substitutions — which allows greater control over the use of substitutions so that this bias can be removed.
1
Introduction
One benefit of having a type system for a programming language is that programs which do not have types can be rejected by the compiler. This is of help to programmers as it assures them of one aspect of program correctness, however a difficulty many programmers find in programming is that when their program is rejected by a compiler they are not given enough information to easily locate and repair the mistakes they have made. This paper deals with Hindley-Milner type systems such as that of Standard ML. Programmers using these type systems frequently complain about the poor quality of type error messages. We will see in Sect. 2 that one complaint is that the part of the program highlighted by the compiler is generally not the part of the program in which the programmer made the mistake. In Sect. 3 we examine the type inference algorithm and see that it has a left-to-right bias towards detecting problems late in the code and this bias is caused by the way unification and substitutions are used. The solution to this problem with the conventional inference algorithm is a new type inference algorithm designed from a pragmatic perspective. The key idea here is that the algorithm should be symmetric, treating subexpressions identically so that there is no bias causing errors to tend to be reported in one part of a program rather than another. The new algorithm rests upon the novel concept of the unification of substitutions to allow the symmetric treatment of H. Hammond, T. Davie, and C. Clack (Eds.): IFL’98, LNCS 1595, pp. 137–152, 1999. c Springer-Verlag Berlin Heidelberg 1999
138
B.J. McAdam
subexpressions. Sect. 4 introduces the operation of unifying substitutions and discusses how it will remove the left-to-right bias from type inference. Sect. 5 presents a variation of the classic type inference algorithm for Hindley-Milner type systems which uses this substitution unifying procedure. Further uses of unifying substitutions are given in Sect. 6. Issues in implementing these ideas are discussed in Sect. 7. A summary of the conclusions of this paper is in Sect. 8.
2
Motivation
Type systems play two important roles in software development. The programmer has simple static properties of programs machine-checked and the compiler can optimise on the basis of types. We are concerned with the first of these roles, and in particular that the compiler rejects untypeable programs as they are not guaranteed to have a meaningful interpretation in the language semantics. The problem with this is that the burden of correcting the program lies in the programmer’s hands and often there is little help the compiler can give. Compilers typically tell the programmer at which point in the program the error was found, but generally this is not the part of the program in which the programmer made the mistake and it is unclear why type checking failed at this point. 2.1
An Inherent Problem with Hindley-Milner
Two key features of Hindley-Milner type systems are of particular interest to this paper. Implicit typing means that the programmer need not say what type an expression or identifier should have, the type system (and inference algorithm) can infer most types with only a small number of typing assertions necessary. Polymorphic typing means that types can be flexible. For example, a function might take a value of any type and return a value of the same type. In this case the function can have type τ → τ for any type τ . We denote this with the polymorphic type scheme ∀α.α → α. It is not necessary for the programmer to specify that a function is polymorphic, type inference will discover this fact. From the point of view of finding the location of mistakes in a program, these features are weaknesses. The only way to detect a mistake is to find an inconsistency between two parts of the program. This is in contrast to explicit type systems in which the inconsistencies are typically between the use of identifiers and the declarations of their types. So with Hindley-Milner based languages, rather than being able to establish that either the use of an identifier or its declaration is incorrect, there are three possibilities: the first expression may be incorrect, the second may be incorrect, or the problem may lie in the context. Because of polymorphism some program fragments which contain mistakes
On the Unification of Substitutions in Type Inference
139
will have still have a type — but not the type the programmer expected. This can lead to a cascading effect: spurious errors are announced later when newly derived types are found to be in conflict with the type derived for the mistake. 2.2
Examples of Untypeable Expressions
Let us first consider a simple λ-calculus example of a Hindley-Milner untypeable expression. This function should take a real number and return the sum of the original number and its square root: λa.add a (sqrt a) A typical error message from a type checker would read: Cannot apply sqrt : real → real to a : int. A programmer, seeing this message, may be confused as a is intended to be real. So the mistake cannot be that sqrt is being applied to a, it must be that something else causes a to appear to be an int. The source of the problem will become apparent if we look at the type environment the expression is checked inside: add : int → int → int sqrt : real → real The mistake has been to use integer addition where real number addition is required. Clearly in this case the error message is inappropriate as there is equal evidence that a should be a real as there is that it should be an int. The type checker has incorrectly given priority to the information derived from the leftmost subexpression — it has a left-to-right bias. It would have been more informative in the example if the type checker had pointed out that there was an inconsistency between the two subexpressions, instead of falsely claiming that either was internally inconsistent. A classic example used to illustrate the monomorphism of function arguments is λI.(I 3, I true) The programmer has written a function which expects the identity function as an argument, this is not possible in Hindley-Milner type systems as arguments cannot be used polymorphically. When a compiler is faced with this expression it type checks from left to right, first establishing that I must have a type of the form int → α and then finding that I cannot, therefore, be applied to true of type bool. The programmer will be given an error message of the form Cannot apply I : int → α to true : bool. This message implies that there is an inconsistency inside I true. In fact there is an inconsistency in the use of I between the two subexpressions. The algorithm in this paper will find this inconsistency in the uses of I between the subexpressions of (I 3, I true) instead of finding an apparent mistake in I true.
140
2.3
B.J. McAdam
Alternative Approaches to the Problem
A number of other authors have looked into the problem of generating error messages about the appropriate parts of programs. Johnson and Walz [7] looked at the difficulty in working out how unification fails. They applied their method to the classic inference algorithm, so it still suffers from a left-to-right bias. It is likely that their ideas could be applied to unification of substitutions to both eliminate the left-to-right bias in the point at which unification fails, and generate improved error messages about why unification failed. Duggan and Bent [6] provided another system for recording why substitutions were made during unification. Their interest is in producing meaningful messages after some unification fails, whereas this paper is interested in making unification fail at the appropriate point. Beaven and Stansifer [1] provided another explanation facility. Bernstein and Stark [2] provide a novel type inference algorithm based on the unification of type environments. Like the algorithm in this paper, their’s is symmetric and has no left-to-right bias. Lee and Yi [9] discuss a folklore algorithm. They prove that it can reject programs sooner than the classic algorithm. One might suppose that this will eliminate left-to-right bias, but this is not the case. Given an untypeable application, e0 e1 , the folklore algorithm may reject during the recursive call for e1 when e0 and e1 are incompatible. In order to remove the left-to-right bias rejection must be delayed until the entire expression e0 e1 is being examined.
3
Type Systems and the Classic Inference Algorithm
This section begins by summarising the language and type system used in the remainder of the paper. An introductory discussion of type systems and their inference algorithms can be found in [3]. Proofs of algorithm W ’s correctness can be found in [4]. 3.1
Types, Type Schemes and Type Environments
In this paper we will consider types which are built from type variables (α, β . . .) (hence types are polymorphic); type constants (int, real and others); and the function type constructor →. The form of polymorphism in Hindley-Milner type systems arises from the use of type schemes. These are types with some (or all) of their type variables universally quantified, for example ∀α.α → β. A type, τ 0 , is a generic instance of a type scheme, ∀α1 · · · αn .τ , if τ 0 can be obtained from τ by substituting types for the type variables α1 · · · αn . In this paper, we are not particularly concerned with type schemes as substitutions refer to type variables which are not universally quantified. Type inference starts from a type environment, Γ , which is a map from identifiers to type schemes. Γ can be augmented with new terms, for example
On the Unification of Substitutions in Type Inference
141
after a declaration, and can have terms removed from it (Γx is Γ with any term for x removed). Type schemes are obtained from types by closing a type under a type environment. Γ (τ ) (the closure of τ under Γ ) is the type scheme ∀α1 · · · αn .τ where α1 · · · αn are the type variables occurring in τ but which do not occur free in Γ . In particular, closing a type under a type environment with no free type variables results in every type variable in the type being universally quantified. Component Values Type Variables α, β, . . . Types τ ::= α | τ0 → τ1 | int | real | . . . Type Schemes σ = ∀α0 · · · αn .τ Type Environments Γ = {x0 : σ0 , · · · , xn : σn }
Table 1. Components of type systems.
3.2
Language and Type System
Hindley-Milner type systems are formulated as non-deterministic transition systems. In this paper, we will look at a simple λ-with-let-calculus (as in Fig. 1) and will be particularly interested in the rule for deriving types of applications. The type system is in Fig. 2.
e ::= x | e0 e1 | λx.e | let x = e0 in e1
Fig. 1. Syntax of the λ-with-let-calculus.
The type inference rule App states that if (given the type environment Γ ) subexpression e0 has type τ 0 → τ (it is a function), and similarly that e1 has type τ 0 under the same type environment (it is a suitable argument for the function), then the application of e0 to e1 has type τ . The non-determinism in this case arises from the function argument type τ 0 . If we are attempting to show Γ ` e0 e1 : τ , there is no way of knowing what τ 0 to use in the sub-derivation for each of e0 and e1 . Note that the inference rule tells us that in order for e0 e1 to be typeable, three conditions must be satisfied. e0 must be typeable, e1 must be typeable
142
B.J. McAdam
Γ (x) > τ Var Γ `x:τ Γ ` e0 : τ 0 → τ Γ ` e1 : τ 0 App Γ ` e0 e 1 : τ Γx ∪ {x : τ0 } ` e : τ1 Abs Γ ` λx.e : τ0 → τ1 Γ ` e0 : τ 0 Γx ∪ {x : Γ (τ0 )} ` e1 : τ1 Let Γ ` let x = e0 in e1 : τ1
Fig. 2. Type derivation rules. and the types of the two must be compatible for application. It is, therefore, desirable for error messages to describe a mistake as being in e0 or e1 , or as an incompatibility between them. We saw in Sect. 2.2 that this does not happen with current type checkers. Sometimes they announce an incompatibility as if it was a problem inside e1 . 3.3
Substitutions
In addition to types, type schemes and type environments, type inference algorithms make extensive use of substitutions. A substitution is a finite mapping from type variables to types. Substitutions are denoted by a set of mappings, {α1 7→ τ1 , · · · αn 7→ τn }. A substitution represents a means of refining types. If we know that a certain type (containing type variables) is associated with an expression, and that a substitution is also associated with it then we can apply the substitution to the type to refine it. Both the classic algorithm and the new algorithm in this paper work by refining types using substitutions. All substitutions must meet a well-formedness criteria. Definition 1 (Well-formedness). A substitution S is said to be well-formed whenever dom(S) ∩ FV (S) = {}. This restriction prevents us from getting ‘infinite’ types (types which contain themselves). Substitutions can be composed. S1 S0 is the substitution which has the same effect as applying first S0 and then S1 . S1 S0 exists iff it is well formed, so as well as both conjuncts being well formed it is necessary that FV (S1 )∩ dom(S0 ) = {}. We define an ordering on types: τ > τ 0 iff ∃S : Sτ = τ 0 . Also, a type environment, Γ , has an instance, Γ 0 , iff ∃S : SΓ = Γ 0 . 3.4
The Classic Inference Algorithm
The classic inference algorithm, W , is a deterministic simulation of the derivation rules. For a particular type environment and expression, it attempts to find a type for the expression and a substitution of types for type variables such that the
On the Unification of Substitutions in Type Inference
143
expression has the type under the substituted type environment. The algorithm traverses the structure of the expression building up substitutions and types as shown in Fig. 3. This paper is concerned with the case of the algorithm which handles function applications.
W (Γ, x) = let ∀α1 · · · αn .τ = Γ (x) in ({}, τ [β1/α1 , · · · , βn /αn ]) for new β1 · · · βn W (Γ, e0 e1 ) = let (S0 , τ0 ) = W (Γ, e0 ) (S1 , τ1 ) = W (S0 Γ, e1 ) τ00 = S1 τ0 V = U (τ00 , τ1 → β) for new β in (V S1 S0 , V β) W (Γ, λx.e) = let (S, τ ) = W (Γx ∪ {x : β}, e) for new β in (S, Sβ → τ ) W (Γ, let x = e0 in e1 ) = let (S0 , τ0 ) = W (Γ, e0 ) (S1 , τ1 ) = W (S0 Γx ∪ {x : S0 Γ (τ0 )}, e1 ) in (S1 S0 , τ1 )
Fig. 3. The classic type inference algorithm, W .
The most significant part of this case is the use of unification. The algorithm U returns the most general substitution which when applied to each of its argument types will produce the same type, for example U (int → α, β → real) = {α 7→ real, β 7→ int}. A survey of applications and techniques for unification can be found in [8]. Type inference fails if no unifier exists. When inference fails because of a failure to unify, implementations produce error messages indicating a problem with the subexpression of current interest. The result of type inference shown to the programmer (when it succeeds) is a polymorphic type scheme. If W returns (S, τ ) then the type scheme is SΓ (τ ). Since Γ typically does not have any free type variables, all type variables in the result type will normally be universally bound.
144
B.J. McAdam
The action of the algorithm satisfies a soundness theorem and a completeness theorem. Theorem 1 (Soundness of W ). If W (Γ, e) succeeds with (S, τ ) then there is a derivation of SΓ ` e : τ . Theorem 2 (Completeness of W ). Given (Γ, e) let Γ 0 be an instance of Γ and η be a type scheme such that Γ 0 ` e : η. Then 1. W (Γ, e) succeeds 2. If W (Γ, e) = (P, π) then for some R: Γ 0 = RP Γ , and η is a generic instance of RP Γ (π). Proofs of these theorems can be found in [4]. We will revisit these theorems after the algorithm is modified later in the paper.
3.5
Source of Left-to-Right Bias
The left-to-right bias arises in application expressions because the first substitution, S0 , is applied to the type environment, Γ , before type checking e1 . This means that if an identifier is used incorrectly in e0 and used correctly in e1 inference on e1 could fail and wrongly imply that e1 contains an error. Table 2 shows the different ways in which the application case of W can fail, and how these can be interpreted. The concern of this paper is that it is not possible to differentiate inconsistencies between e0 and e1 from inconsistencies inside e1 .
Point of failure Recursive call W (Γ, e0 )
Possible meanings There is an internal inconsistency in e0 . e0 is incompatible with Γ . Recursive call W (S0 Γ, e1 ) There is an internal inconsistency in e1 . e1 is incompatible with Γ . There is an inconsistency between e0 and e1 . Unification U (S1 τ0 , τ1 → β) e0 cannot be applied to e1 .
Table 2. Ways in which W can fail to type check an application expression.
It is the third cause of failure of W (S0 , Γ, e1 ) which we wish to eliminate. The solution proposed in this paper is to delay applying the substitutions to Γ by having some means for combining substitutions. Such a means is described in the next section.
On the Unification of Substitutions in Type Inference
4
145
An Algorithm for Unifying Substitutions
We have already seen some examples demonstrating the left-to-right bias of W . We have also seen how the algorithm works and know why the bias arises in the case of application. The objective of the new algorithm is to allow us to infer types and substitutions for each subexpression independently. The new algorithm US deals with combining substitutions, the next section shows how to modify W to make use of it and Sect. 6 shows how to further extend the algorithm and to apply it to other type inference algorithms and other type systems. 4.1
The Idea — Unifying Substitutions
To treat the subexpressions e0 and e1 independently in a modified version of W , the recursive calls must be W (Γ, e0 ) and W (Γ, e1 ). This will yield two result pairs: (S0 , τ0 ) and (S1 , τ1 ). It is necessary then to – check that the two substitutions are consistent – apply terms from S0 to τ1 and from S1 to τ0 so that the resulting types have no free type variable in the domain of either substitution, and – return a well-formed substitution containing entries from both S0 and S1 . The second of these requirements is not simply S0 τ1 and S1 τ0 , because these could have unwanted free type variables. Likewise the third of these is not simply S1 S0 or S0 S1 . The essence of these three operations can be summarised in these two requirements: – check the substitutions are consistent, and if they are – create a substitution which contains the effect of both. We must unify the two substitutions. 4.2
Examples
Before we look at the algorithm for unifying substitutions, it will be worthwhile seeing some examples. The simplest case is where the two substitutions are completely independent. S0 = {α 7→ int} S1 = {β 7→ γ} US (S0 , S1 ) = {α 7→ int, β 7→ γ} If the domains of S0 and S1 contain a common element, we must unify the relevant types: S0 = {α 7→ int → β} S1 = {α 7→ γ → real} US (S0 , S1 ) = {β 7→ real, γ 7→ int}
146
B.J. McAdam
Note that equivalent results cannot be obtained simply by composing the substitutions (S0 S1 or S1 S0 ). The previous example would have occurred inside the lambda term λf.λx.(f 1) + ((f x) + 0.1). S0 is the substitution produced from f 1 and S1 comes from (f x) + 0.1 (α is the type variable for f). Substitution unification can fail, for example with S0 = {β 7→ α → real} S1 = {β 7→ real → real, α 7→ int} There is an inconsistency between the instantiations of α in this case. Unification can also fail with an occurs error. S0 = {α 7→ int → β} S1 = {β 7→ int → α} Clearly the two substitutions here imply that α and β should map to infinite types, hence the two substitutions cannot be unified. 4.3
Formal Definition
A substitution, S 0 , unifies substitutions, S0 and S1 , if S 0 S0 = S 0 S1 . In particular the most general unifier of a pair of substitutions is S 0 such that (S 0 S0 = S 0 S1 ) ∧ (∀S 00 : (S 00 S0 = S 00 S1 ) ⇒ (∃R : S 00 = RS 0 )) i.e. S 0 unifies S0 and S1 , and S 0 can be augmented to be equivalent to any other unifying substitution. The unified substitution, S 0 S0 , has the effect of both S0 and S1 since S 0 S0 α < S0 α and S 0 S0 α < S1 α, for all α. 4.4
Algorithm US
Algorithm US computes the most general unifier of a pair of substitutions. To see how the algorithm works, note that the domain of the result consists of three parts as shown in Fig. 4. The algorithm to be introduced here deals with each of the three parts of the domain separately. The free variables in the range of the unifier are free in either S0 or S1 and are in the domains of neither. The algorithm, commented in italics, is in Fig. 5. Note that it can terminate with an occurs error as indicated, or if U fails. 4.5
Verification of US
It must be shown that US does indeed compute the most general unifier of a pair of substitutions. Two theorems define this property. Theorem 3. For any pair of substitutions, S0 and S1 , if US (S0 , S1 ) succeeds then it returns a unifying substitution.
On the Unification of Substitutions in Type Inference
147
S1
S0 dom FV
Fig. 4. The domain of US (S0 , S1 ) consists of three parts (shaded). The disjoint parts of the domains of S0 and S1 , and the free variables in their ranges where their domains overlap.
US (S0 , S1 ) = let First split the domains into three disjoint parts: D0 = dom(S0 ) − dom(S1 )
T0 = {α 7→ S0 α : α ∈ D0 }
D1 = dom(S1 ) − dom(S0 )
T1 = {α 7→ S1 α : α ∈ D1 }
D∩ = dom(S0 ) ∩ dom(S1 ) Note: FV (T0 ) ∩ dom(S0 ) = {}, similarly for T1 . Start with T0 and add terms for D1 one at a time, always producing well formed substitutions: S00 = T0
{α1 7→ τ1 · · · αn 7→ τn } = T1
0 Si+1 = let
Remove elements of dom(Si0 ) from τi+1 , and remove αi+1 from Si0 : 0 = Si0 τi+1 τi+1 0 If αi+1 ∈ FV (τi+1 ) terminate (occurs error) 0 }Si0 in{αi+1 7→ τi+1
Sn0
is the unifier for T0 and T1 .
Now deal with items in D∩ = {β1 · · · βm }: V0 = Sn0 Vi+1 = let τ0 = Vi S0 βi+1
τ1 = Vi S1 βi+1
If βi+1 ∈ FV (τ0 ) ∪ FV (τ1 ) terminate (occurs error) inU (τ0 , τ1 )Vi inVm
Fig. 5. Algorithm US commented in italics.
148
B.J. McAdam
Theorem 4. If S 00 unifies S0 and S1 then 1. US (S0 , S1 ) succeeds returning S 0 , and 2. there is some R such that S 00 = RS 0 . Proofs of both these propositions can be found in the extended version of this paper [10].
5
The New Version of W
Now that we know what it means to unify two substitutions and have seen that this is possible, let us now look at the new type inference algorithm, W 0 , in Fig. 6. This differs from W only in the case for applications
W 0 (Γ, e0 e1 ) = let (S0 , τ0 ) = W 0 (Γ, e0 ) (S1 , τ1 ) = W 0 (Γ, e1 ) S 0 = US (S0 , S1 ) τ00 = S 0 τ0 V =
U (τ00 , τ10
τ10 = S 0 τ1 → β) for new β
in (V S 0 S0 , V β)
Fig. 6. Algorithm W 0 , case for application. As stated earlier, the algorithm treats e0 and e1 symmetrically and features US in an analogous manner to (and in addition to) U . 5.1
Correctness of W 0
Algorithm W 0 should produce the same results as W . To verify this it is necessary to prove the soundness and completeness theorems for W 0 . These theorems are analogous to those which Damas proved for W . The algorithm is sound if every answer it gives is a type for the parameter expression under the type environment obtained from applying the substitution to the original type environment. Theorem 5 (Soundness of W 0 ). If W 0 (Γ, e) succeeds with (S, τ ) then there is a derivation of SΓ ` e : τ . This theorem is proved in [10]. The completeness result states that if a type exists for an expression, then W 0 returns a type which is at least as general as the type known to exist.
On the Unification of Substitutions in Type Inference
149
Theorem 6 (Completeness of W 0 ). Given Γ and e, let Γ 0 be an instance of Γ and η be a type scheme such that Γ 0 ` e : η. Then 1. W 0 (Γ, e) succeeds 2. If W 0 (Γ, e) = (P, π) then for some R: Γ 0 = RP Γ , and η is a generic instance of RP Γ (π). The proof of this theorem can also be found in [10]. Because W 0 satisfies the same soundness and completeness theorems as W , and we know that the solutions of these theorems are unique (from the principal type scheme theorem of [5]) we know that W 0 always produces the same results as W . Corollary 1 (W 0 and W are equivalent). For any pair, (Γ, e), W (Γ, e) succeeds and returns (S, τ ) if and only if W 0 (Γ, e) succeeds and returns (S, τ ). 5.2
Interpreting the Failure of W 0
The argument for using W 0 is that it is possible to create better error messages when it fails than is possible with W . First, recall the ways in which the application case of W can fail and the possible causes of this as given in Table 2. In particular when W (S0 Γ, e1 ) fails this may be caused by e1 being incompatible with a mistake in e0 . This is the sort of error message which programmers find so frustrating as it is not easy to find the original source of the problem from it. Now consider the possible causes of failure of the application case of W 0 , given in Table 3. Point of failure Recursive call W (Γ, e0 )
Possible meanings There is an internal inconsistency in e0 . e0 is incompatible with Γ . Recursive call W (Γ, e1 ) There is an internal inconsistency in e1 e1 is incompatible with Γ . Substitution unification US (S0 , S1 ) There is an inconsistency between e0 and e1 . Unification U (S1 τ0 , τ1 → β) e1 is not a suitable argument for e0 .
Table 3. Ways in which W 0 can fail to type check an application expression.
Clearly from this, type checkers using W 0 can produce more informative error messages than those using W .
6
Other uses of US
This section explores further uses of US . It can be used to type check larger expressions than simple applications symmetrically (for example curried expressions and tuples); and it can be used in other type inference algorithms.
150
6.1
B.J. McAdam
Larger Syntactic Structures
The type inference algorithm given earlier treated application expressions symmetrically. Similarly, the treatment of tuples and records is typically asymmetric and US can be used to eliminate this asymmetry. Not only can US be used to treat simple application expressions symmetrically — it can also be used for curried applications of the form e0 e1 · · · en . Each of the subexpressions must be type checked, then all the resulting substitutions are unified and the type of the curried expression is found. The advantage of this technique is that it allows type checking to follow the structure of the program in the way the programmer views it. 6.2
Another Type Inference Algorithm, M
An alternative type inference algorithm for Hindley-Milner type systems is M , [9]. This is a top-down algorithm which attempts to check that an expression has a type suitable for its context. The algorithm takes type environment, Γ , expressions, e, and target type, τ , and returns substitution, S, such that SΓ ` e : Sτ . The case of the algorithm for application expressions is shown in Fig. 7.
M (Γ, e0 e1 , τ ) = let S0 = M(Γ, e0 , β → τ ) S1 = M(S0 Γ, e1 , S0 β) in S1 S0
Fig. 7. The application case of algorithm M . It is clear that this algorithm suffers from the same left-to-right bias as W but it is a simple matter to change M to remove the bias as can be seen from Fig. 8.
M 0 (Γ, e0 e1 , τ ) = let S0 = M 0 (Γ, e0 , β → τ ) S1 = M 0 (Γ, e1 , β) in (US (S1 , S0 ))S0
Fig. 8. The application case for the modified M . If the inference M 0 (Γ, e0 , β → τ ) fails, this implies that e0 is not a function, or does not have the correct return type. The inference M 0 (Γ, e1 , β) will fail if
On the Unification of Substitutions in Type Inference
151
and only if Γ and e1 are inconsistent (there is no typing for Γ ` e1 ). If the unification fails then either e1 is not a suitable argument for e0 , or there is some other inconsistency between them.
7
Implementation
The modified version of W has been implemented for a simple λ-with-let calculus. The implementation has also been extended to deal with curried expressions as described in Sect. 6.1. One difficulty in implementing type inference using US for a full programming language is that it prevents substitutions from being implemented using references. In most compilers, instead of storing explicit substitutions, a type variable, α, is represented as a reference which can refer to a type, τ , to implement the substitution term α 7→ τ . It has been suggested [11] that this style of implementation is more efficient for large projects (though [12] suggests that explicit substitutions are equally efficient for smaller implementations). Because updating global references represents a greedy strategy it is in conflict with the cautious taken in this paper approach of waiting until as late as possible to apply substitutions. The ideal for further investigating the ideas in this paper is an implementation in a full-scale compiler. This should be experimentally tested by comparing how programmers deal with problems using modified and unmodified versions of the compiler. An issue not dealt with in this paper which will become relevant in doing this is the actual generation of error messages. Most compilers use ad hoc methods which lead to idiosyncratic messages. It is not clear how this would affect experimental testing of this paper’s ideas, but it seems that the choice of compiler would be important.
8
Conclusions
This paper introduced the concept of unifying substitutions in type inference. An algorithm, US , for this was defined. Proofs of properties of US can be found in [10]. The classic inference algorithm, W , was modified so that the recursive calls when type checking application expressions are independent of each other. This was achieved using US . Proofs that the modified version of W is correct can be found in [10]. The new type inference algorithm has the advantage that it has no left-to-right bias in the locations in which type errors are found (when type checking application expressions). This means that an type inconsistency between the function and argument will never be detected as if it is an internal inconsistency in the argument. Thus, one form of error message confusing to programmers has been removed. The new inference algorithm can be extended to deal with more complex syntactic structures such as curried expressions and tuples. US can also be used to remove the left-to-right bias in other type inference algorithms.
152
B.J. McAdam
The algorithm has been implemented for a toy language, and issues for a large scale implementation have been discussed. Acknowledgements Thanks to Ian Stark and Stephen Gilmore for commenting on early versions of this paper. This work is supported by an EPSRC research studentship.
References 1. M. Beaven and R. Stansifer. Explaining Type Errors in Polymorphic Languages. ACM Letters on Programming Languages and Systems, 2(1):17–30, March 1993. 2. K.L. Bernstein and E.W. Stark. Debugging Type Errors (Full version). Technical report, State University of New York at Stony Brook, Computer Science Department, November 1995. http://www.cs.sunysb.edu/~stark/REPORTS/INDEX.html. 3. L. Cardelli. Basic Polymorphic Type-Checking. Science of Computer Programming, 8(2):147–172, 1987. 4. L.M.M. Damas. Type Assignment in Programming Languages. PhD thesis, Department of Computer Science, The University of Edinburgh, April 1985. CST–33–85. 5. L.M.M. Damas and A.J.R.G. Milner. Principal Type-Schemes for Functional Programs. In Proc. 9th. ACM Symposium on Principles of Programming Languages (POPL ’82), pages 207–212, 1982. 6. D. Duggan and F. Bent. Explaining Type Inference. Science of Computer Programming, (27):37–83, 1996. 7. G.F. Johnson and J.A. Walz. A Maximum-Flow Approach to Anomaly Isolation in Unification-Based Incremental Type-Inference. In Proc. 13th. ACM Symposium on Principles of Programming Languages (POPL ’86), pages 44–57, 1986. 8. K. Knight. Unification: A Multidisciplinary Survey. ACM Computing Surveys, 21(1):93–124, 1989. 9. O. Lee and K. Yi. Proofs About a Folklore Let-Polymorphic Type Inference Algorithm. ACM Transactions on Programming Languages and Systems, 1999. To Appear. 10. B.J. McAdam. On the Unification of Substitutions in Type-Inference. Technical Report ECS–LFCS–98–384, Laboratory for Foundations of Computer Science, The University of Edinburgh, James Clerk Maxwell Building, The Kings Buildings, Mayfield Road, Edinburgh, UK, March 1998. 11. S.L. Peyton Jones and P.L. Wadler. Imperative Functional Programming. In Proc. 20th. ACM Symposium on Principles of Programming Languages (POPL ’93), pages 71–84, January 1993. 12. M. Tofte. Four Lectures on Standard ML. Technical Report ECS–LFCS–89–73, Laboratory for Foundations of Computer Science, The University of Edinburgh, March 1989.
153
Higher Order Demand Propagation Dirk Pape Department of Computer Science Freie Universität Berlin
[email protected] Abstract. A new denotational semantics is introduced for realistic non-strict functional languages, which have a polymorphic type system and support higher order functions and user definable algebraic data types. It maps each function definition to a demand propagator, which is a higher order function, that propagates context demands to function arguments. The relation of this “higher order demand propagation semantics” to the standard semantics is explained and it is used to define a backward strictness analysis. The strictness information deduced by this analysis is very accurate, because demands can actually be constructed during the analysis. These demands conform better to the analysed functions than abstract values, which are constructed alone with respect to types like in other existing strictness analyses. The richness of the semantic domains of higher order demand propagation makes it possible to express generalised strictness information for higher order functions even across module boundaries.
1 Introduction Strictness analysis is one of the major techniques used in optimising compilers for lazy functional programs [15]. If a function is identified to be strict in some arguments, these arguments can safely (without changing the lazy semantics) be calculated prior to the function call, saving the effort of building a suspension and later entering it. Strictness information can also be used to identify concurrent tasks in a parallel implementation, since referential transparency guarantees, that function arguments can be reduced independently. Generalised strictness analysis can in addition derive an amount of evaluation, which is safe for a larger data structure (e.g. evaluating the spine of a list argument). Such information can be used to find larger tasks in a parallel implementation [2,7], hence reducing granularity. Though it is not always recommended to use generalised strictness information for a sequential implementation – because of the danger of introducing space leaks – it can in general improve simple strictness analysis results even in the sequential setting. In this paper a new approach to generalised strictness analysis is proposed. The analysis is called demand propagation because evaluation demands are propagated from composed expressions to its components (backward w.r.t. function application). In contrast to other backward analyses like projection analysis [19,3] or abstract demand propagation [17] this analysis is applicable for a higher order and polymorphic language and works with infinite domains in the non-standard semantics. In Sect. 2 a simple but realistic functional language based on the polymorphically typed lambda calculus is introduced together with its standard semantics. K. Hammond, T. Davie, and C. Clack (Eds.): IFL‘98, LNCS 1595, pp. 153-168, 1999. © Springer-Verlag Berlin Heidelberg 1999
154
D. Pape
Demands, demand propagators and the demand propagation semantics are defined in Sect. 3. The utilisation of demand propagators for generalised strictness analysis is explained in Sect. 4. The soundness proof for the semantics can be read in [13]. Higher order demand propagation departs from other strictness analyses: its abstract values are higher order functions, which calculate the generalised strictness information via backward propagation of demands. A serious implication of this is that the abstract domains are in general infinite. Because of this, more accurate strictness information of a function can be expressed and made available to functions which use it in their definition, even across module boundaries. This can be seen by looking at the examples in the appendix and in [12]. How the analysis can deal with those infinite domains is briefly sketched in Sect. 5 and more thoroughly in [12]. As a bonus it is possible to trade accuracy against speed, making the analysis applicable for different stages of program development. Further research topics and related work are discussed in Sects. 6 and 7.
2 Core Language Syntax and Semantics The simple but realistic functional language defined in this section is basedon the polymorphically typed lambda calculus. The language is higher order and provides user definable algebraic data types. It is a module language and the standard semantics of a module is a transformation of environments. The module’s semantics transforms a given import environment into an export environment by adding to it the semantics of the functions, which are defined in the module. 2.1
Syntax of the Core Language
The syntax of the core language and its associated type language is given by: type :== [ " { tvar } . ] monotype monotype :== tvar | INT | monotype fi monotype | tcons { monotype } usertype :== [ " { tvar } . ] algebraic algebraic :== [ algebraic + ] cons { monotype } expr :== var | cons | expr expr | l var . expression | let module in expr | case expr of { cons { var } expr } var expr module :== { tcons = usertype } { var = expr } It consists of user type declarations and function declarations. Expressions can be built of variables, constructors, function applications, lambda abstractions, case- and letexpressions. All expressions are assumed to be typed, but for simplicity all type annotations are omitted in this paper. In a real implementation the principal types can be inferred by Hindley-Milner type inference. Nevertheless we define a type language which is used to index the semantic domains. Polymorphic types are defined by universal quantification of all free type variables at the outermost level of the type, hence an expression’s type must not contain a free type variable. The same holds for the user defined types. This is a common approach, that is e.g. followed in the definition of the functional language Haskell [14]. Constructors of an algebraic data type are assumed to be unique over all types.
Higher Order Demand Propagation
155
The usual constants 0, 1, -1, …, +, *, … appear in the language as variables with their semantics assigned to them in a prelude environment. Integer numbers also represent constructors, which identify the components of the infinite sum Z^ @ ({0}¯ {1}¯ {-1}¯ …)^ and which can be used in a pattern of a case-alternative. A case-expression must always contain a default alternative, to eliminate the necessity to handle pattern match errors in the semantics. An ordinary case-expression without a default alternative can easily be transformed by adding one with the expression bot, where bot is defined for all types and has the semantics ^ . 2.2
Standard Semantics of Core Language Modules
The definition of the standard semantics is given in Fig. 1 and consists of four parts: 1. The Type Semantics D: The semantic domains associated to the types of the type language are complete partial ordered sets (cpos). They are constructed out of the flat domain for INT by sum-of-products construction for algebraic data types and continuous function space construction for functional types. These constructions yield again cpos as shown e.g. in [5]. Function domains are lifted here to contain a least element ^ for functional expressions, which do not have a weak head normal form. A domain for a polymorphic type is defined as a type indexed family of domains. User type declarations can be polymorphic and mutual recursive. User defined types are represented by type constructors, which can be applied to the appropriate number of types yielding the domains for recursive data types, which themselves are defined as fixpoints of domain equations in the usual way [5]. Because we focus on the expression semantics here, we handle the type constructors as if they are predefined in the type environment for the type semantics. 2. The Expression Semantics E: Syntactic expressions of a type t are mapped to functions from an environment – describing the bindings of free variables to semantic values – into the domain belonging to t. The semantics of polymorphic expressions are families of values for all instances of the polymorphic type. 3. The User Type Declaration Semantics U: The sum-of-products domains for user defined data types come together with unique continuous injection functions inC into the sum and with continuous projection functions projC,i to the i-th factor of the summand corresponding to C. The user type definition semantics implicitly defines a constructor function for each constructor with the corresponding injection function as its semantics. For notational convenience we also define a function tag, which maps any non-bottom value of a sum to the constructor of the summand it belongs to. For v„^ it holds v˛ Image(intag(v)). 4. The Module Semantics M: The semantics of a core language module is a transformation of environments. Declarations in a core language module can refer to the semantics defined in the import environment. The semantics of the set of mutually recursive declarations in a module for a given import environment is the (always existing) minimal fixpoint of an equation describing the environment enrichment.
156
D. Pape Fig. 1. Standard Semantics of the Core Language (cont.)
fi Env fi D[[t]] where Env = "t. Vart¯ Constt fi D[[t]] E[[x]] r = r xþþþandþþþE[[C]] r = r C E[[e1 e2]] r = E[[e1]] r (E[[e2]] r); resp. ^ , if E[[e1]] r = ^ E[[lx. e]] r = lv. E[[e]] r[v/x] E[[case e of C1 v11…v1a1 e1; …; Cn vn1…vnan en; v edef]] r = ^ , if e = ^ = E[[ei]] r[projCi,1 (e)/vi1, …, projCi,ai (e)/viai], if tag(e)=Ci = E[[edef]] r[e/v], otherwise where e = E[[e]] r E[[let m in e]] r = E[[e]] (M[[m]] r) E : "t:Type. Expr
t
U : Usertype fi Env fi Env U[["a 1…a m. C1 t11…t1a1 + … + Cn tn1…tnan]] = [c1/C1, …, cn/Cn] where ci = lv1…vai. inCi (v1,…,vai)
M : Module fi Env fi Env M[[T1 = u1; …; Tm = um; fdefs]] r = F[[fdefs]] (r++U[[u1]]++…++U[[un]]) where F[[v1 = e1; …; vn = en]] r = mR. r[E[[e1]] R/v1, …, E[[en]] R/vn]
f = M[[m]] r f is the semantics of a variable f in a module m with import env. r Prelude semantics are provided for integer numbers, + and bot
3 Higher Order Demand Propagation Semantics The aim of higher order demand propagation semantics is to express how evaluation demands on a given function application propagate to evaluation demands on the function arguments (lazy evaluation is presumed). The definition follows the structure of the definition of the standard semantics but maps the syntactical function definitions to so called demand propagators, which are defined below. 3.1
The Abstract Semantic Domains for Higher Order Demand Propagation
We now define the notions of demand and demand propagator. Note that demands and demand propagators are polymorphically typed. This is the reason why higher order demand propagation works for a polymorphic language. Definition 1 (Demand). A demand D of type t is a continuous function from the standard semantic domain D[[t]] to the two-point-cpos 2={1}^ with ^