of Haskell and a few K lines of C. Lolita is data-intensive in accessing large, read-only, data structures that typically occupy around 16Mb. It also generates large ...
Go-faster Haskell Or: Data-intensive Programming in Parallel Haskell DRAFT P.W. Trinder, University of Glasgow K. Hammond, University of St Andrews Glasgow, Scotland St. Andrews, Scotland H-W. Loidl, S.L. Peyton Jones J. Wu University of Glasgow Centre for Transport Studies, University College London Glasgow, Scotland London, England
Abstract We have recently constructed an integrated programming environment to support programming in Glasgow Parallel Haskell GpH. This paper descibes the construction of several data-intensive parallel programs using the environment. It focuses on a road-trac accident application because it is a real problem with real data, and is the rst non-trivial GpH program to achieve wall-clock speedups | a factor of 10 on 16 processors. Using our portable runtime system the programs have been measured on two machines with fundamentally dierent architectures: a shared-memory and a distributed-memory machine.
1 Introduction One of the war-cries of the functional programming research community has been that functional programming oers a new approach to parallel programming [18]. Functional languages appear to oer substantial implicit parallelism, without adding new language constructs or complicating the language semantics. In particular, the non-determinism often associcated with parallel programs is absent; if the programs works when run sequentially, it will deliver the same results when run in parallel. Motivated by this potential, a tremendous amount of effort has been invested in the design and implementation of parallel functional languages. Despite this investment, the results have been mostly disappointing. It has proved very hard to get absolute speedups from parallelism, compared to the same program compiled by a state-of-the-art compiler in a sequential setting. (The outstanding exception is NESL | see Section 6.) The big diculty is that the road to convincingly good performance is a long slog:
Functional lanuages are very high-level, so their compilers are correspondingly complex. A parallel functional language has to cope with a new layer of complexity: resource scheduling. The run-time system has to manage a distributed global heap, decide where and when to run a thread, and move data from one processor to another when necessary. As a result, the sheer eort of building a correct parallel functional-language implementation (i.e. one that works) is often all that
a reseach group can manage. The additional eort of making a fast implementation often falls of the end of the project. In particular, integrating the parallel system with a competitive compiler is a signi cant challenge. Pro ling tools are extremely important for very high level programming languages, because they give feedback to the programmer about where time and space resource are actually being spent [24, 26]. In a parallel setting pro ling tools are more important still. A programmer concerned with parallelism must be given detailed information about the parallel behaviour of the program. Lastly, considerable eort must be devoted to building real parallel application programs. Without gaining experience from writing realistic parallel applications there is little to support the claim that writing parallel programs in a functional style is easier or less errorprone. However, real applications naturally stress the implementation a great deal more than toy programs, which in turn leads to more implementation work. On top of all this, multiprocessor architectures are still changing quite rapidly. In an earlier paper [28] we reported on our eorts to solve the rst two of these problems, a full-scale parallel implementation of the lazy, purely-functional language Haskell. The implementation, called GpH, has three key components:
The Glasgow Haskell Compiler (GHC), a state-of-the-
art compiler for Haskell [19]. Our parallel implementation uses GHC almost unchanged, except for the addition of a few extra synchronisation barriers and run-time system calls. Our goal was that parts of the program that have not parallelism should execute at or near the speed that they would have run when compiled by GHC for a sequential platform. GHC comes equipped with a time and space pro ler based on so-called \cost centres" [26]. The GUM (Graph Reduction on a Uniform Model) run-time system [28]. GUM supports the distributed global heap, thread placement and scheduling, load management, and other resource-management tasks. GUM does not require shared memory between processors; rather, it requires only point-to-point message
{ However, parallel functional programming really
passing capability. Our current implementation sits on top of the highly-portable PVM substrate, but MPI or something even lower level would do equally well. The GranSim simulator [8], which allows us to simulate the behaviour of running GHC-compiled programs on GUM. Simulation is extremely useful. First, it means that one does not need access to a real parallel machine. Second, it allows the parameters of the simulated hardware to be altered at will. One very common set of parameters is an idealised machine with in nite bandwidth and zero communication delay. If then program does not run fast on that machine, then there isn't much point in trying it on real hardware! Based on this suite of tools, the focus of this paper is on the third task identi ed above, that of gaining concrete experience of parallel functional programming by building real applications. We focus on one application in particular, a data-intensive trac-accident analysis program that was in use at the Centre for Transport Studies at University College London. The paper makes the following contributions: We are one of the few ever to report a signi cant wallclock speed-up (a factor of 10 on 16 processors) from a realistic parallel functional-language application (Section 4.5.2). Furthermore, this speedup is relative to a challenging baseline: the same program compiled by a state-of-the-art compiler for a sequential platform. This is a challenging baseline. Firstly, it includes the costs incurred by compiling for a parallel model; and secondly, communication costs and other parallelism overheads show up much more starkly when the costs of executing straight-line sequential code is small than when it is large. We report good scale-up numbers as well (Section 4.5.4). For example, the runtime is multipled by only 1.4 when both the data size and the number of processors are multiplied by 16. (The ideal would be 1.0.) We report results for running the identical program on both shared-memory and distributed-memory machines. This is unusual: it is more common for an implementation to work on one or the other but not both. Based on our experience, we articulate a stepwise approach to developing parallel functional programs, starting with a sequential implementation and moving to the parallel machine via a series of increasingly realistic simulations (Section 2.2). An key part of this approach is the use of evaluation strategies to separate the algorithmic and pragmatic aspects of the program (Section 2.1). These strategies make essential use of laziness. We are able to validate some claims about parallel functional programming, and refute others: { Parallel functional programming is not a panacea. Writing parallel programs remains dicult, and pro ling tools are essential. Wholesale restructuring of the algorithm to gain parallelism is as necessary as ever.
is high level. We were able to experiment with many more ways of parallelising the program than would have been possible using a lowerlevel technique. In the trac-accident program we explored four major variants of the parallelism structure, with many short-lived iterations within each. { GpH supports dynamic process scheduling, so that tasks can be matched with available processors at run-time. This carries greater run-time costs, but allows program to be run eectively that have unpredictable task sizes and structure. We nd that such claims do in fact hold water (Section 4.5.3).
2 Parallel Functional Programming The essence of the problem facing the parallel programmer is that, in addition to specifying what value the program should compute, explicitly parallel programs must also specify how the machine should organise the computation. There are many aspects to the parallel execution of a program: threads are created, execute on a processor, transfer data to and from remote processors, and synchronise with other threads. Managing all of these aspects on top of constructing a correct and ecient algorithm is what makes parallel programming so hard. One extreme is to rely on the compiler and runtime system to manage the parallel execution without any programmer input. Unfortunately, this purely implicit approach is not yet fruitful for the large-scale functional programs we are interested in. The approach used in GpH is less radical: the runtime system manages most of the parallel execution, only requiring the programmer to indicate those values that might usefully be evaluated by parallel threads and, since our basic execution model is a lazy one, perhaps also the extent to which those values should be evaluated. We term these programmer-speci ed aspects the program's dynamic behaviour. Parallelism is introduced in GpH by the par combinator, which takes two arguments that are to be evaluated in parallel. The expression p `par` e (here we use Haskell's in x operator notation) has the same value as e, and is not strict in its rst argument, i.e. ? `par` e has the value of e. Its dynamic behaviour is to indicate that p could be evaluated by a new parallel thread, with the parent thread continuing evaluation of e. We say that p has been sparked, and a thread may subsequently be created to evaluate it if a processor becomes idle. Since the thread is not necessarily created, p is similar to a lazy future [15]. Since control of sequencing can be important in a parallel language [23], we introduce a sequential composition operator, seq. If e1 is not ?, the expression e1 `seq` e2 has the value of e2; otherwise it is ?. The corresponding dynamic behaviour is to evaluate e1 to weak head normal form (WHNF) before returning e2.
2
2.1 Evaluation Strategies
Realistic Simulation. GranSim can be parame-
terised to closely resemble the GUM runtime system for a particular machine, forming a bridge between the idealised and real machines. 3. Real Machine. The GUM runtime system supports some of the GranSim performance visualisation tools. This seamless integration helps understand real parallel performance.
Even with the simple parallel programming model provided by par and seq we nd that more and more code is inserted in order to obtain better parallel performance. In realistic programs the algorithm can become entirely obscured by the dynamic-behaviour code. Evaluation strategies use lazy higher-order functions to separate the two concerns of specifying the algorithm and specifying the program's dynamic behaviour. A function de nition is split into two parts, the algorithm and the strategy, with values de ned in the former being manipulated in the latter. The algorithmic code is consequently uncluttered by details relating only to the parallel behaviour. In fact the driving philosophy behind evaluation strategies is that it should be possible to understand the semantics of a function without considering its dynamic behaviour. Because evaluation strategies are written using the same language as the algorithm, they have several other desirable properties. Strategies are powerful: simpler strategies can be composed, or passed as arguments to form more elaborate strategies. Strategies are extensible: the user can de ne new application-speci c strategies. Strategies can be de ned over all types in the language. Strategies are type safe: the normal type system applies to strategic code. Strategies have a clear semantics, which is precisely that used by the algorithmic language. Evaluation strategies are used to specify the dynamic behaviour in of the programs described in this paper. A complete description and discussion of strategies can be found in [29].
The following sections describe the application of the guidelines to several data-intensive programs. The guidelines are discussed in detail in [14, 29].
3 Demonstrator To begin with, we discuss an extremely simple data-intensive program. Unless we can achieve wall-clock speedups for simple demonstrator programs we have failed. The demonstrator also allowed us to gain experience of parallelising Haskell programs, uncover bugs in the GUM runtime system and helped drive the development of evaluation strategies. Date poses the following bill of material, or parts explosion, problem [5]. Given the relation below, write a program to list all the component parts of a given part to all levels. For example, the components required to make part P1 are fP2, P3, P6g. The program is inherently data parallel because the explosion of one part is not dependent on the explosion of any other part. parts
Main SubQuantity Component Component P1 P2 2 P1 P3 4 P5 P3 1 P5 P6 8 P2 P6 3 In the examples presented here the bill of material contains 2000 elements, each part has 3 sub-components, some of which may in turn have subcomponents. 80 parts are exploded, and the runtime of the optimised sequential code on a Sun SPARC 10 is 25.1 seconds. Because of its simplicity, the bill of materials program does not need to use all of the steps in the methodology outlined in Section 2.2. There is no top-level pipeline,and the computational costs of the program are also simple enough to understand without time pro ling. Figure 1 shows the GranSim activity pro le for an idealised machine. The initial sequential segment is the bill construction, representing less than 1% of the work. When the parts are exploded in parallel, maximum parallelism is rapidly attained as there are no dependencies between the explosions. Some explosions contain more parts, and so take longer to complete, and this leads to the raggedly-declining parallelism that we observe. The two threads running between 32 106 and 44 106 cycles are rstly a printing thread and secondly a part explosion that contains 116 components. The nal sequential tail is due to printing the output. These conclusions can be veri ed using a per-thread visualisation of the execution.
2.2 Parallelisation Guidelines From our experiences engineering these, and other, programs we are developing guidelines for parallelising large non-strict functional programs. Our approach relies on an integrated suite of tools, including time pro ling [26] and the GranSim simulator [8]. A crucial property of GranSim is that it can be parameterised to simulate both real architectures and an idealised machine with, for example, zero-cost communication and an in nite number of processors. Here is the aproach we have developed: 1. Sequential implementation. Start with a correct implementation of an inherently-parallel algorithm. 2. Parallelise and tune. Seek top-level parallelism. Often a program will operate over independent data items, or the program may have a pipeline structure. Time Pro le the sequential application to discover the \big eaters", i.e. the computationally intensive pipeline stages. Parallelise Big Eaters using evaluation strategies. Idealised Simulation. Simulate the parallel execution of the program on an idealised execution model, i.e. with an in nite number of processors, no communication latency, no threadcreation costs etc. This is a \proving" step: if the program isn't parallel on an idealised machine it won't be on a real machine. 3
tasks
GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim
dateNaive11.ogr 900 980 1000 +RTS -H32M -b: -bP
Average Parallelism = 15.5
82 80
72
64
56
48
40
32
24
16
8
0 0
5271099
10542197
15813296
21084394
running
26355492
31626590
36897688
blocked
42168788
47439888
Runtime = 52710984
Figure 1: Bill of Material GranSim Activity Pro le tasks
GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim GrAnSim
dateNaive11.ogr 900 980 1000 +RTS -H32M -bp4 -bP -bG -bQ0 -o100
Average Parallelism = 3.8
5
4
3
2
1
0 0
122650880
running
245301760
367952640
runnable
490603520
613254400
fetching
735905280
blocked
858556160
migrating
981207040
11038579201226508800
Runtime = 1226508802
Figure 2: Bill of Material GranSim Activity Pro le Because of the load on this machine, it is not possible to use all six processors for any signi cant length of time. Hence the rst plot shows wall-clock speedups for just four processors on a lightly loaded machine. The second plot shows the speedups achieved when the program is run on up to six processors. In the second case, the machine is more heavily loaded (approx 30 users). A wallclock speedup of 2.2, and relative speedup of 2.6, on a busy four-processor machine is encouraging for a program with a signi cant sequential tail. It is clear, however, that the machine used for the performance tests, is too heavily loaded to give a realistic indication of the potential parallel performance that could be achieved. We would expect results on an unloaded machine to be signi cantly better than those reported here.
Figure 2 shows the activity pro le when GranSim is parameterised to resemble the target machine. The average parallelism of 3.8 indicates that the program is adequately parallelised for a 4-processor machine. The parallel machine used to execute this program is a Sun SPARCserver with six SPARC 10 processors, communicating via shared memory segments under Solaris 2. The machine is heavily loaded, seldom having less than 30 users, or a load average below 3. Figure 3 shows the wall-clock speedups for the bill of material program. GUM does not support thread migration, and hence execution is highly susceptible to sheduling accidents [28]. To ameliorate the effects of such accidents, each point on the graph is the median of three separate execution times. The parallel code has some overheads that are not present in the sequential Glasgow Haskell implementation, such as the need to test every closure on entry to determine if it is already being evaluated by another task. As a result, the parallel code on a single processor slows down by 20% to make the parallel system only 80.5% as ecient as the sequential implementation. A careful analysis of the eciency of GUM can be found in [28]. 4
Bill Of Material: Wall-clock Speedups "bom.speedups.quiet" "bom.speedups.busy"
2.4 2.2 2 1.8
Speedup
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2
3 No. Processors
4
5
6
Figure 3: Bill of Material: Wallclock Speedups
4 Accident Blackspots
databases. Unusually for a functional language, PFL provides a uniform persistent framework for both data and program. The PFL program uses a slightly dierent algorithm to the Haskell program and comprises approximately 500 lines.
4.1 Description Given a set of police accident records (modi ed to preserve privacy) the task is to discover accident blackspots: locations where two or more accidents have occurred. A number of criteria can be used to determine whether two accident reports are for the same location. Two accidents may be at the same location if they occurred at the same junction number, at the same pair of roads, at the same grid reference, or within a small radius of each other. The radius is determined by the class of the roads, type of the junction etc. The problem amounts to combining several partitions of a set into a single partition. For example if the partition on road pairs is {{2,4,5},{3},{6,7}} and on grid references is {{2,5},{3},{4,6},{7}} , the combined partition is {{2,4,5,6,7},{3}} . The problem of unioning disjoint sets, union nd, has been much studied by algorithm designers as it has an interesting sequential complexity. For n union and m nd operations, an algorithm with an amortised complexity of O(n + F(m,n)) can be given, where F is a very small function (the inverse of the Ackermann function) [27]. These RAM algorithms are not directly applicable in our application because not all of a large data set may be randomly accessed in memory. We have adopted an index-, or tree-, based solution with complexity O(n log n) if n is the number of elements in the sets. The motivation for this choice is that for very large data sets not all of the tree need be memory resident at any time.
4.2.2 The Haskell Implementation The Haskell implementation constructs a binary relation containing an element for each pair of accidents that match under one of the four conditions. The combined partition is formed by repeatedly nding all of the accidents reachable in sameSite from a given accident. The program has four major phases: reading and parsing the le of accidents; building indices over the accident data; constructing sameSite, and indices over sameSite; forming the partition. The program is a 300-line module, together with 3 library modules totalling 1300 lines. The results and runtimes of the PFL and Haskell programs are given in the table below (runtimes in minutes:seconds). The runtimes are only for broad comparison purposes as the programs were run on similar, but not identical machines: PFL on a Sun ELC, and Haskell on a Sun Classic. There are 7310 accidents, runtimes are given in minutes and seconds and we abbreviate multiple accident sites as MA sites. Partition by
Sequential Results
MA Acc./ PFL Haskell Sites MA Sites RT RT 208 1617 2:41 0:47 1310 4653 14:41 0:52 1324 4434 14:16 0:52
Junction No. only Road-pair only Grid ref. only Junction No, Road-pair & Grid ref 1228 5430 17:44 1:42 Junction No, Road-pair, Grid ref. & nearby 1229 5450 18:25 2:03 We expect Haskell compiled with the Glasgow compiler (ghc) to be one or two orders of magnitude faster for CPUintensive problems than the interpreted PFL. However, in
4.2 Sequential Implementations 4.2.1 The PFL Implementation The application was originally written at the Centre for Transport Studies [31] in PFL and has subsequently been rewritten in Haskell. PFL is an interpreted functional language [21], designed speci cally to handle large deductive 5
contrast to PFL's carefully designed persistence, ghc's lazy le I/O is poor: amongst other things, requiring the data to be parsed from text on input. The Haskell run times for the single-condition partitions are very similar. Most of the time is probably absorbed in reading and parsing the le of accidents. To avoid creating an unfair impression from this set of timings, it is important for the reader to note that this is not a typical PFL application, being essentially memory-resident once the initial data is read. PFL's primary application domain concerns problems where the data is normally too large to t into primary memory.
4.3 Parallel Variants Following our guidelines, we initially investigated the application's parallelism using an idealised simulation. Once adequate parallelism was obtained, we used a realistic simulation of our rst 4-processor shared-memory target machine. The tables below report the results obtained from the simulators when just 1000 accidents are partitioned, runtimes and work are in units of 106 GranSim machine cycles. Parallel Variant
Idealised Simulation
Pipeline only Par. Pipeline Stages Par. Pipeline Stages & preconstructed Ixs Geographically Partitioned
Work Average Run Time (MCycles) Parallelism (MCycles) 327 1.2 273 327
2.8
124
304
3.5
87
389
3.7
105
Realistic SPARCserver Simulation
Parallel Variant
Work Average Run Time (MCycles) Parallelism (MCycles)
Par. Pipeline Stages & preconstructed Ixs 393 2.3 171 Geographically Partitioned 394 3.7 105 Pipeline only. The rst version simply converted the 4 phases of the program outlined in section 4.2 into a pipeline. The speedup of 1.2 is disappointingly low, because the pipeline is blocked by the trees passed between stages. Parallel Pipeline Stages. The next version introduces parallelism within each pipeline stage using a variety of paradigms. The le reading and parsing stage is made data parallel by partitioning the data and reading from n les. Control parallelism is used to construct the accident indices. the stages constructing the same-site relation and the partition both use benign speculative parallelism.
Parallel Pipeline Stages and Preconstructed Indices.
Geographically Partitioned. A very dierent, coarsegrained, parallel structure can be obtained by splitting the accident data into geographical areas Each area, or tile, can be partitioned in parallel before aggregating the results. Accidents occuring near the edge of a tile must be treated specially. In fact this approach is only feasible because every accident has a grid reference and we assume that accidents occuring more than 200m apart cannot be at the same site. Accidents occuring within 100m of the nominal edge between two tiles are duplicated in both tiles. Splitting the original data into 4 tiles results in a 4% increase in data volume. As a result of the duplicated border accidents, some multiple-accident sites may be discovered in more than one tile. These sites must be combined in the aggregate result. a precise result can be obtained by partitioning the small border area during aggregation. However, for our purposes we obtain a good, but imprecise result by simply excluding any multiple-accident sites in the North and East borders of a tile unless the site extends into the body of the tile. The North and East border coordinates form the header of each le of accidents, making each tile self-describing. Breaking the data into tiles reduces the work required to form a partition as long as the the border is much smaller than the body of the tile. Less work is required because each accident is compared with fewer accidents: the tree's constructed during the partition are smaller. The table below shows the runtimes for a sequential partition of the original (monolithic) set of accidents, a sequential partition of the data in 4 tiles, and a parallel partition of the 4 tiles. More formally, for the n accidents in the monolithic data, the algorithm is O(n log n) The number of accidents in each tile is (n=4 + b), where b is the number of replicated border accidents. If we assume that b n=4 then the tiled algorithm is O( 4(n=4 log n=4) ) or, O(n log n=4) The simulator and strategies have allowed us to carry out low-cost experiements with several possible parallel variants of the program. Most of the program, the partitioning algorithm in particular, remained substantially unchanged while dierent parallel strategies were investigated. The tiled variant is selected for execution on the real machine because it delivers good coarse-grained parallelism under both idealised and realistic simulation. Monolithic and Tiled Runtimes
Program Variant Sequential Monolithic Sequential Tiled Parallel Tiles
Parallelism is further improved by merging the rst two pipeline stages. That is, the indices on the accident data were constructed before the program is run, and the program reads the indices rather than constructing them. The resulting parallelism is satisfactory on an idealised simulation of a 4-processor machine, but poor under a realistic simulation. The poor realistic results are due to the ne grain of parallelism and the volume of data being communicated.
Work Average Run Time (MCycles) Parallelism (MCycles) 498 394 394
1.0 1.0 3.7
498 394 105
4.4 Apparatus
Machines. The program is measured on two very dierent
machines, making use of the portability of the GUM runtime system. One is a shared-memory architecture and the
6
Eciency. Section 4.3 showed that, as long as the
other distributed-memory. The shared-memory machine is a Sun SPARCserver with 4 Sparc 10 processors and 256Mb of RAM. The machine is shared with other users, but measurements are performed when it is very lightly loaded. The distributed-memory machine is a network of up to 16 Sun 4/15 workstations each with 24Mb of RAM, and connected on a single ethernet segment. Data. The original data set of 7400 accident records occupies 0.3Mb and is split into 2 dierent sized tiles. There are 8 small tiles, each containing approximately 1000 accidents and occupying 37Kb, and 4 large tiles each containing approximately 2000 accidents and occupying 73Kb. Both architectures use a shared le system, i.e. any PE can access any le. On the network of workstations the les are stored on a single le server and accessed via NFS. On both machines the program is warm started, i.e. it is run at least once before measurements are taken. Warm starts reduce runtime because the data is preloaded into RAM disk caches in the le system. Program. One change is required to the GranSim version of the program to enable it to run under GUM. GUM processors don't inherit le handles from the main thread, and hence to permit them to simultaneously read les the program uses the `unsafe' C-interface supported by Glasgow Haskell [11].
tiles are large relative to the border area, many smaller tiles are more ecient than a few large tiles. Reduced Peak Resources. If there is one tile per PE then all of the le reading occurs at the start of the program, inducing intense network trac. With mulitple tiles per PE the le reading is spread through the program execution. Automatic Good Load Management. A small number of large tiles could be statically allocated to PEs. However it is a tedious task to maintain the allocation as the number of tiles and PEs change. Further, even with careful allocation machine utilisation may be poor because the time to partition a tile is unpredictable. The number of accidents in a tile is a rough indication of the partition time, but the actual time isdependent on the exact distribution of the accidents. GUM supports dynamic load management, i.e. when a PE becomes idle it locates the next work item, in this case a tile. With many tiles each processor can be kept busy. Moreover the program is independent of the number of PEs and the number of tiles.
In the following measurements there are 30 small tiles occupying 1.2Mb. Figure 6 shows the SPARCserver and workstation speedups respectively. With more data the workstation speedups are good: a relative and absolute speedups of 11.3 and 9.7 on 16 workstations. The 4-processor SPARCserver speedups have improved to a relative speedup of 2.6 and absolute speedup of 1.9. As the next section shows, speedups can be further improved on the SPARCserver with more data. The workstation speedup graphs show a number of phase changes, e.g. between 7 and 8 PEs and between 14 and 16 PEs. These are due to the coarse grained parallelism. If the number of tiles, and hence threads, is an exact multiple of the number of processors, all of the processors process the same number of tiles and nish at approximately the same time. For example with 6 processors each PE processes 5 tiles, adding a 7th PE doesn't reduce the runtime because although some PEs process just 4 tiles, others must still process 5 tiles. Runtime falls again with 8 PEs because each PE processes at most 4 tiles.
4.5 Measurements 4.5.1 Original Data Set For these measurements the data is in 8 small tiles on the workstations, and in 4 large tiles on the SPARCserver. Figure 4 shows the runtimes of the program under GUM with up to 8 processors and the runtime for the optimised program under the standard sequential runtime system. The runtime under GUM on a single processor is 28% longer than the runtime under the sequential runtime system because it has some additional overheads [28]. We say that GUM is 78% ecient. Eciency varies depending on both program and machine. The eciency of the blackspots program on the SPARCserver is 77%. Parallelism results are more often reported as speedup graphs like Figure 5. The top line is ideal, or linear speedup. The second line is the relative speedup, i.e. compared to a single processor running the parallel program. The second line is the absolute or wall-clock speedup, i.e. compared to a single processor running the optimised sequential code. These results are the rst wallclock speedups obtained for a GpH program solving a real problem using real data. On the SPARCserver the speedup is good at 2 processors, but poor thereafter. On the workstations the speedups are good up to 4 processors (3.6 relative speedup), but fall o thereafter. This indicates that the initial data set is too small to get good results on the larger parallel machine con gurations.
4.5.3 Heterogeneous Tiles An additional advantage of dynamic load management is that it automatically manages threads of varying lengths. As a result the GpH blackspots program can obtain good speedups with tiles of varying sizes. Figure 7 shows the speedups obtained partitioning 1.8Mb of data in 40 tiles: 32 small and 8 large. As before, the larger data set gives an improved speedup on both architectures: 12 relative and 10 absolute on 16 workstations, and 2.8 relative and 2.2 absolute on the SPARCserver.
4.5.2 Multiple Tiles per Processor
4.5.4 Scaling
The data set can be increased by using larger tiles or by using more tiles. The latter approach is taken for three reasons:
In addition to speedups, an important measure for dataintensive applications is scaleup, i.e. can a system twice the 7
Runtimes, Sun Network, Original Data Parallel Runtimes Sequential Runtime
160 140
Runtime (secs)
120 100 80 60 40 20 0 0
1
2
3
4 5 No. Processors
6
7
8
9
Figure 4: Workstations, Runtimes on Original Data Blackspots speedups, Network of Sun 4/15s, Original Data 9
SPARCserver, Original Data 5
Absolute Speedups Relative Speedups Ideal Speedups
8
Absolute Speedups Relative Speedups Ideal Speedups
4
7
Speedup
Speedup
6 5 4
3
2
3 2
1
1 0
0 0
1
2
3
4 5 No. Processors
6
7
8
9
0
1
2 3 No. Processors
4
Figure 5: Speedups on Original Data size process twice the volume of data in the same time? Figure 8 shows the scaleup for the two machines. There are as many large tiles as there are processors The scaleup of the workstations is satisfactory: a 44% increase in runtime between 1 and 16 processors. Scaleup on the SPARCserver is not nearly so impressive: a 32% increase in runtime with just 4 processors.
initial sequential segment of C-parsing. Wallclock speedups obtained to date on the Sun SPARCserver are disappointing because of Lolitas heavy memory requirements. The current machine has just 256Mb of physical memory, and the sequential program requires 100Mb. In consequence even with just two processors physical memory is exhausted and an average parallelism of 1.4 obtains a slowdown of approximately 10%. We aim to achieve better results by using a dedicated machine with a 512Mb physical memory, by making the C re-entrant and hence parallelising it, and by arranging for the processors to share a single copy of the read-only data structures.
5 Lolita We have Cooperated with the DEAR project (EPSRC GR/J99292) to parallelise the Lolita natural language processor. Lolita has been developed at Durham University over the past 10 years and is large, comprising 60K lines of Haskell and a few K lines of C. Lolita is data-intensive in accessing large, read-only, data structures that typically occupy around 16Mb. It also generates large internal data structures, e.g. for each sentence in the text to be analysed Lolita generates a forest of possible parses and a forest of possible meanings. Surprisingly, parallelising Lolita entails minimal changes to the 300 modules of code. A realistic simulation gives an average parallelism of 3.1, which is satisfactory in view of an
6 Related Work The FLARE project [22] studied a number of realistic functional applications, yielding results for both simulations and for the GRIP multiprocessor. Programs included a computational uid dynamics solver, a theorem prover, a telephone network routing and a graphical user interface. The results obtained for these programs showed that it was possible for non-expert programmers to develop serious parallel functional programs, capable of yielding some performance 8
Blackspots speedups, Network of Sun 4/15s, 30 Homog. tiles Absolute Speedups Relative Speedups Ideal Speedups
16 14
Speedup
12 10 8 6 4 2 0 0
1
2
3
4
5
6
7 8 9 10 11 12 13 14 15 16 No. Processors
Figure 6: Speedups on 30 Homogeneous tiles (SPARCserver graph to be supplied) Sun Network, 40 Hetero. Tiles
SPARCserver, 40 Hetero. Tiles 5
Absolute Speedups Relative Speedups Ideal Speedups
16 14
Absolute Speedups Relative Speedups Ideal Speedups
4
10
Speedup
Speedup
12
8
3
2
6 4
1
2 0
0 0
2
4
6
8 10 No. Processors
12
14
16
0
1
2 3 No. Processors
4
Figure 7: Speedups on Heterogeneous tiles gain. Experience gained with the FLARE applications and others (including a linear equation solving package [12] and a very large natural language system [16]) has paved the way for the methodology which we have expounded in this paper, particularly the use of the initial simulation phase. The FLARE project used a Haskell-based idealised simulator hbcpp [30]. In its incarnation as GranSim-Light, GranSim yields results which are comparable to those provided by hbcpp, though it is much more accurate in costing both raw reduction and system overheads (this is important, since, as we have shown, I/O often causes serialisation). GranSim is, of course, also capable of providing several levels of more detailed simulation, as necessary for the development of the parallel application. There is, naturally, some cost associated with obtaining this extra information: but this cost is proportional to the information gained. Like GUM, DIGRESS [3] also provides a parallel implementation of Haskell running on distributed workstations, with dynamic scheduling and global load-balancing. DIGRESS is, however, designed speci cally for this environment, and so provides features such as automatic system recon guration as new workstation resources become available. We are not aware of comparative performance gures for the two systems, however, DIGRESS appears to be designed primarily as a platform for experimentation with parallelism
rather than for ultimate performance. pH is another well-known parallel Haskell, developed at MIT [17]. It diers from Glasgow Parallel Haskell in being entirely implicit, and in modifying the Haskell language to support data ow ideas such as I-structures and k-bounding for loops. The pHluid implementation [7] uses data ow techniques developed for Id (essentially, this is a pH front-end with an Id back-end), so performance is not yet as good as for Glasgow Haskell. Several authors have suggested higher-order programming techniques to control parallelism. The best-known of these techniques are probably based on algorithmic skeletons [4]. Skeletons dier from strategies in that they tightly integrate control and value. In eect, good patterns of parallel behaviour are associated with the use of certain higherorder functions. The corresponding behaviours are invoked automatically at runtime when these functions are called (normally based on static compilation techniques). The approach is more implicit than that suggested here, in that behaviours are chosen based only on the functions that are used. Programmers must, however, cast their programs to suit the skeletons they intend to use, and it is not usually possible for the system to react to dynamic changes by choosing a dierent behaviour. Evaluation transformers, as suggested by Burn [2] are also related to strategies in that they specify evaluation degree 9
Blackspots, Scaleup 75Kb Tiles User Time User+System Elapsed Sequential
14
Runtime
12
10
8
6 0
1
2 3 No. Processors
4
5
Figure 8: Scaleup (Workstations Graph to be supplied) consistency checks or the concise and ecient access to bulk types as provided by PFL selectors. Compared to PFL, Haskell execution is fast, except, of course, for the crucial aspect of I/O operations. Parallel programming. A methodology for parallelising large programs is emerging. It makes use of a parameterised simulator with heavy instrumentation and many ways of visualising the execution. We have also developed strategies, a mechanism for separately specifying a program's algorithm and its dynamic behaviour. Benign speculation again proves to be a useful means of parallelising functional programs.
on a data structure, and can be used to control parallel behaviour. Unlike strategies, evaluation transformers specify a xed degree of evaluation, implicitly invoked on the basis of strictness information to reduce values to the extent required. Evaluation transformers are intended to be automatically generated by the compiler rather than programmable, as here. Like strategies, Caliban's moreover clause [10] also allows control to be separated from value under programmer control. Unlike strategies, moreover-clauses are statically evaluated with a view to creating a static map of tasks to processors. Probably the most successful purely-functional parallel functional language to date is NESL [1]. NESL's war-cry is nested data parallelism, supported by built-in operations on one-dimensional arrays. NESL is not as general as GpH, but it is implemented on a variety of big parallel machines (Unix workstations, SP-2, CM5, Cray C90 and J90, MasPar MP2, and Intel Paragon), and delivers remarkable performance.
7.2 Future Work Work is continuing on the trac accident program. Once good simulated speedups have been obtained, we intend to run the program on the Sun SPARCserver. To get good parallelism results however, we plan to port GUM to a less heavily-loaded machine. Strategies are being used at Glasgow, St Andrews, and Durham to parallelise several large programs. A paper describing strategies, and our experiences using them is nearing completion. In the longer term we would like to develop a parallel cost-centre pro ling tool that relates a program's dynamic behaviour to its strategies.
7 Discussion It has been well worth the eort to address a real problem. The work is still in progress, but we have already learnt much from it, and from the other large parallel programs being developed at Glasgow. The experience can be classi ed as functional data-intensive and parallel programming.
References [1] Blelloch G.E. and Greiner J. \A Provable Time and Space Ecient Implementation of NESL", Proc. ICFP '96, Philadelphia, Pennsylvania, (1996), pp. 213{225. [2] Burn G.L.. \Evaluation Transformers | A Model for the Parallel Evaluation of Functional Languages (Extended Abstract)" Proc. FPCA '87, Springer-Verlag LNCS 274, (1987), pp. 446{470. [3] Clack C. \Painless Parallel Programming.", Parallel Processing in Engineering Community Club, (1995). [4] Cole M.I. Algorithmic Skeletons: Structured Management of Parallel Computation, Research Monographs in Parallel and Distributed Computing, Pitman, (1989). [5] Date C.J. An Introduction to Database Systems, 4th Edition, Addison Wesley, (1976).
7.1 Experiences
Functional programming. Lazy time pro ling helps
identify the computation-intensive parts of the program worth parallelising. Lazy le reading and parsing data from text is excruciatingly slow. System libraries greatly reduce the implementation eort. Unfortunately, in order to de ne strategies over library types it is necessary to take private copies of the library modules. Data-intensive programming. Coping with null values in a language without explicit support requires care. Multiple bulk types, in particular sets, lists and trees are useful, and both comprehesion notation and uniform function names would be a great help. There are also no built-in 10
[6] Davis K. \MPP Parallel Haskell" Proc. IFL '96, Bonn, Germany, (September 1996) Springer Verlag (in press). [7] Flanagan C. and Nikhil R.S. \pHluid: the Design of a Parallel Functional Language Implementation", Proc. ICFP '96, Philadelphia, Pennsylvania, (May 1996), pp. 169{179. [8] Hammond K., Loidl H-W., and Partridge A.S. \Visualising Granularity in Parallel Programs: A Graphical Winnowing System for Haskell", Proc. HPFC'95 | High Performance Functional Computing, Denver, Colorado, (April 1995), pp. 208{221. [9] Halstead R.H. \Multilisp - a language for concurrent symbolic computation" TOPLAS 7(4) (October 1985), pp 501-538. [10] Kelly P.H.J. Functional Programming for Looselycoupled Multiprocessors, Research Monographs in Parallel and Distributed Computing. Pitman, (1989). [11] J Launchbury and SL Peyton Jones, \State in Haskell", Lisp and Symbolic Computation 8(4), Dec 1995, pp293342. [12] Loidl H.-W. and Hammond K. \Solving Systems of Linear Equations Functionally: a Case Study in Parallelisation", Technical Report, University of Glasgow, (1994). [13] Loidl H.-W. and Hammond K., \A Sized Time Analysis for a Parallel Functional Language", Proc. 1996 Glasgow Workshop on Functional Programming, (1996). [14] Loidl H.-W. and Trinder P.W. \Engineering Parallel Functional Programs" Proc. IFL '97, St Andrews, Scotland, (September 1997) [15] Mohr E., Kranz D.A., Halstead R.H., \Lazy Task Creation { a Technique for Increasing the Granularity of Parallel Programs", IEEE Transactions on Parallel and Distributed Systems 2(3) (July 1991), pp. 264{280. [16] Morgan R.G. and Jarvis S.A. \Pro ling Large-Scale Lazy Functional Programs." Proc. HPFC'95 | High Performance Functional Computing, Denver, Colorado, (April 1995), pp. 222-234. [17] Nikhil R.S., Arvind, Hicks J., Aditya S., Augustsson L., Maessen J.-W. and Zhou Y. \pH Language Reference Manual, Version 1.0 { Preliminary." Computation Structures Group Memo 369, Laboratory for Computer Science, MIT, (January 1995). [18] SL Peyton Jones, \Parallel implementations of functional programming languages", Computer Journal 32(2), April 1989, pp175-186. [19] SL Peyton Jones, \Compilation by transformation: a report from the trenches", in European Symposium on Programming (ESOP'96), Linkoping, Sweden, Springer Verlag LNCS 1058, pp 18-44, Jan 1996. [20] Poulovassilis A.P., and Small C. \A Functional Programming Approach to Deductive Databases", Proc. 17th Intl. Conf. on Very Large Databases (VLDB '91), G. Lohman, et al. (eds), (1991), pp. 491{500.
[21] Poulovassilis A.P., and Small C. \A Domain-Theoretic Approach to Logic and Functional Databases", Proc. 19th Intl. Conf. on Very Large Databases (VLDB '93), (1993), pp. 415{426. [22] Runciman C. and Wakeling D. (eds.), Functional Languages Applied to Realistic Examplars: the FLARE Project, UCL Press, (1994). [23] P Roe, \Parallel programming with functional languages", PhD thesis, CSC 91/R3, Department of Computing Science, University of Glasgow, April 1991. [24] C Runciman and D Wakeling, \Heap pro ling of lazy functional programs", Journal of Functional Programming 3(2), pp217-246, Apr 1993. [25] Rushall D. Task Exposure in the Parallel Implementation of Functional Programming Languages, PhD Thesis, Dept. of Comp. Sci., University of Manchester, (1995). [26] Sansom, P.M., and Peyton Jones, S.L., \Time and Space Pro ling for Non-Strict, Higher-Order Functional Languages", Proc. POPL'95, (1995), pp. 355{366. [27] Tarjan R.E. \Eciency of a good, but not linear set union algorithm" J. ACM 22 (1975), pp. 215-225. [28] Trinder P., Hammond K., Mattson J., Partridge A., and Peyton Jones S.L. \GUM: a Portable Parallel implementation of Haskell". Proceedings of Programming Languages Design and Implementation, Philadelphia, USA, (May 1996). [29] Trinder P.W. Hammond K. Loidl H-W. Peyton Jones S.L. \Algorithm + Strategy = Parallelism". To appear in The Journal of Functional Programming. [30] Wakeling D. and Runciman C., \Pro ling Parallel Functional Computations (Without Parallel Machines)". Proc. 1993 Glasgow Workshop on Functional Programming, Springer-Verlag WICS, (1993), pp. 235{ 248. [31] Wu J., and Harbird L. \A Functional Database System for Road Accident Analysis". Advances in Engineering Software, 26(1), (1996), pp. 29{43.
11