2008 International Symposium on Parallel and Distributed Processing with Applications
Parallelism without Pain: Orchestrating Computational Algebra Components into a High-Performance Parallel System A. D. Al Zain #1 , P. W. Trinder #2 , K. Hammond ∗3 , A. Konovalov ∗4 , S. Linton ∗5 , J. Berthold +6 #
School of Mathematics and Computer Sciences, Heriot-Watt University, Edinburgh, UK
School of Computer Science, University of St Andrews, St Andrews, UK Philipps-Universit¨at Marburg, Fachbereich Mathematik und Informatik Hans Meerwein Straße, Marburg, Germany ∗
+
1 3
2
[email protected],
[email protected], 6
4
[email protected],
average non-specialist, while the situation has certainly improved compared with the experiences of pioneer parallel programmers, effective parallel programming still involves considerable pain. In this paper, we describe the design and implementation of a new system, SymGrid-Par, which aims to orchestrate sequential computational algebra components from a variety of computational algebra systems into coherent parallel programs. A wide variety of computational algebra systems exist today: commercial examples include Maple [8], Mathematica [29] and MuPAD [31]; while free examples include Kant [12] and GAP [21]. The programmer base for these systems is large (for example, it is estimated that there several million users of Maple world wide), diverse (comprising mathematicians, engineers, scientists and economists), and may lack the technical skills that are necessary to exploit widely-used packages such as PVM/MPI. Although many computational algebra applications are computationally intensive, and could, in principle, make good use of the cheap and available parallel architectures of cluster machines, relatively few parallel implementations of computational algebra systems have been produced. Those that are available can be unreliable and difficult to use. Indeed, in at least one case [10], the underlying computational algebra engine has been explicitly optimised to be singlethreaded, so rendering parallelisation a major and daunting task. By providing an external mechanism that is capable of orchestrating individual sequential components into a coherent parallel program, we aim to facilitate the parallelisation of a variety of computational algebra systems in a way that can be exploited by the normal users of computational algebra systems. The key advantages of our approach are: i) by using external middleware, it is not necessary to change the sequen-
This paper describes a very high-level approach that aims to orchestrate sequential components written using high-level domain-specific programming into highperformance parallel applications. By achieving this goal, we hope to make parallel programming more accessible to experts in mathematics, engineering and other domains. A key feature of our approach is that parallelism is achieved without any modification to the underlying sequential computational algebra systems, or to the user-level components: rather, all orchestration is performed at an outer level, with sequential components linked through a standard communication protocol, the Symbolic Computing Software Composability Protocol, SCSCP. Despite the generality of our approach, our results show that we are able to achieve very good, and even, in some cases, super-linear, speedups on clusters of commodity workstations: up to a factor of 33.4 on a 28-processor cluster. We are, moreover, able to parallelise a wider variety of problem, and achieve higher performance than typical specialist parallel computational algebra implementations.
Introduction
Despite the availability of standard message-passing libraries, such as PVM/MPI, and other modern programming tools, writing effective parallel programs is still hard, even for those with a good background in computer science. For domain experts in other specialisms, the difficulty is compounded. At the same time, support for parallelism in domain-specific programming environments is often poor or even completely lacking. It is fair to say that for the 978-0-7695-3471-8/08 $25.00 © 2008 IEEE DOI 10.1109/ISPA.2008.19
5
[email protected]
Abstract
1
[email protected],
[email protected],
99
tial computational algebra system in any way; ii) we can exploit the same middleware for multiple computational algebra systems, so gaining benefits of scale; iii) we can support heterogeneous applications, orchestrating components from more than one system; and iv) we can benefit from future advances in parallelisation, without needing to change the end-user application, and with minimal consideration in the application of how that parallelism is achieved. The main disadvantages that we have identified are: i) there is a potential loss of performance compared with hand-coded parallel applications; and ii) the parallel application depends on our external middleware. We believe that the advantages of painless parallelism for the average user of a computational algebra system more than outweigh these disadvantages. Moreover, as we shall see, because we are able to concentrate purely on parallelism aspects, the resulting performance can be highly competitive with even specialised parallel implementations.
1.1
components inter-communicate using the Symbolic Computation Software Composability Protocol, SCSCP [14]. SCSCP is defined as an OpenMath [33] Content Dictionary, where OpenMath is an XML-based data description format, designed specifically to represent computational mathematical objects. As its name suggests, SymGrid-Par forms a key part of the SymGrid framework [22] that provides high-performance Grid services for symbolic computing systems. This paper goes beyond [22] by describing SymGrid-Par for the first time, moreover in this paper we focus on parallelism in a local cluster, rather than in a geographically-distributed Grid. In doing so, we exploit the capabilities of SymGrid-Par to provide a simple parallel API for heterogeneous symbolic components. GAP
Kant
Computational Algebra Systems (CAS)
Novelty
CAG Interface
Our work exploits the SymGrid-Par middleware for parallelising computational algebra systems, (Section 2). The primary research contributions made by this paper are:
(GpH/GUM)
1. we present the SymGrid-Par middleware for the first time
GCA Interface
2. we give results showing that the SymGrid-Par middleware delivers:
Computational Algebra Systems (CAS)
(a) flexibility – we parallelise three test problems that capture typical features of real computational algebra problems (Section 3.2); and (b) performance – we show that our generic system can outperform the specialised ParGAP [11] implementation, for example (Sections 3.1 and 3.2);
GAP
Maple
Kant
Figure 1. SymGrid-Par Design Overview SymGrid-Par (Figure 1) builds on and extends our implementation of the G UM system [41, 3], a parallel implementation of the widely-used purely-functional language Haskell [36]. SymGrid-Par comprises two generic interfaces which are described below: the “Computational Algebra system to Grid middleware” (CAG) interface links Computational Algebra Systems (CASs) to G UM; and the “Grid middleware to Computational Algebra system” (GCA, Figure 2) interface conversely links G UM to these systems. The CAG interface is used by computational algebra systems to interact with G UM. G UM then uses the GCA interface to invoke remote computational algebra system functions, to communicate with the computational algebra systems etc. In this way, we achieve a clear separation of concerns: G UM deals with issues of thread creation/coordination and orchestrates the computational algebra system engines to work on the application as a whole; while each instance of the computational algebra system engine deals solely with execution of individual algebraic
3. we consider one problem that is the subject of ongoing research in mathematical computation, the summatory Liouville function, demonstrating that our approach can deliver accurate results in less time than other parallel approaches (Section 3.3); and 4. we demonstrate that we are capable of orchestrating two different computational algebra systems (Maple and GAP) into a coherent single parallel application (Section 4), so demonstrating our ability to orchestrate parallel symbolic computing systems from heterogeneously-defined components.
2
Maple
SymGrid-Par
The work described here uses our new SymGrid-Par middleware that orchestrates (generally sequential) computational algebra components into a parallel application, where 100
The first argument in each case is a function of either one or two arguments that is to be applied in parallel.
computations. Processor A
Processor B
GpH
parMap:: (a->b) -> [a] -> [b] parZipWith:: (a->b->c) ->[a] -> [b] -> [c] parReduce:: (a->b->b) -> b -> [a] -> b parMapReduce:: (c->[(d,a)]) -> (d->[a]->b) -> [c]->[(d,b)] masterWorker:: (a->([a],b))-> [a] -> [b]
GpH
Comm Layer
GUM
GUM
So, for example, parMap is a pattern taking two arguments and returning one result. Its first argument (of type a->b) is a function from some type a to some other type b, and its second argument (of type [a]) is a list of values of type a. It returns a list of values each of type b. Operationally, parMap applies a function argument to each element of a list, in parallel, returning the list of results, e.g.
e.g MPICH GCA
CAS GAP, Maple, KANT ....
GCA
CAS GAP, Maple, KANT ....
Figure 2. The SymGrid-Par GCA Interface
parMap double [1,4,9,16] where double x = x + x
Figure 2 shows how pairs of GUM and CAS processes are statically spawned onto each available processing element (PE). We have fully implemented the GCA interface. We consider its parallel performance in Section 3 and orchestration capabilities in Section 4. Work on the implementation of the CAG interface is still in progress.
2.1
==
[2,8,18,32]
It thus implements a parallel version of the commonly found map function, which applies a function to each element of list. Such patterns are commonly found in parallel applications, and we will therefore make extensive use of the parMap pattern to orchestrate the examples presented in this paper. The parZipWith pattern similarly applies a function, but in this case to two arguments, one taken from each of its list arguments. Each application is performed in parallel. For example:
The CAG Interface
The CAG interface comprises an API for each symbolic system that provides access to a set of common (and potentially parallel) patterns of symbolic computation. These patterns form a set of dynamic algorithmic skeletons [9], which may be called directly from within the computational algebra system, and which may be used to orchestrate a set of sequential components into a parallel computation. In general (and unlike most skeleton approaches), these patterns will be nested and can be dynamically composed to form the required parallel computation. Also, in general, they may mix components taken from several different computational algebra systems. We can identify two classes of pattern:standard patterns that apply both to computational algebra problems and to other problem types; and domain-specific patterns that arise particularly when dealing with computational algebra problems. Standard Parallel Patterns: The standard patterns we have identified are listed below. The patterns are based on commonly-used sequential higher-order functions that can be found in functional languages such as Haskell. Similar patterns are often defined as algorithmic skeletons. Here, each argument to the pattern is separated by an arrow (->), and may operate over lists of values ([..]), or pairs of values ((..,..)). All of the patterns are polymorphic: i.e. a, b etc. stand for (possibly different) concrete types.
parZipWith add [1,4,9,16] [3,5,7,9] == [4,9,16,25] where add x y = x + y
Again, this implements a parallel version of the zipWith function that is found in Haskell and some other functional languages. Finally, parReduce reduces its third argument (a list) by applying a function between pairs of elements, ending with the value supplied as its second argument, parMapReduce combines features of both parMap and parReduce, and masterSlaves is used to introduce a set of tasks, each of which is under the control of a coordinating master task. The parReduce and parMapReduce patterns are often used to construct parallel pipelines, where the elements of the list will themselves be lists, perhaps constructed using other parallel patterns. In this way, we can achieve nested parallelism. Domain-Specific Parallel Patterns: We have identified a number of new domain-specific patterns that may arise in a variety of computational algebraic problems, and incorporated these into the design of SymGrid-Par. These patterns include: 1. orbit calculation: generate unprocessed neighbouring states from a task queue of states to process. 101
PE 1
2. duplicate elimination: given two lists of values, merge them while eliminating duplicates.
run
Closure
CPU
Blocking Queue
run
CPU
Blocking Queue
block
block
FISH
Figure 3. Load Distribution in GUM
in a message-passing architecture designed to provide an architecture-neutral and portable runtime environment. The G UM implementation delivers good performance for a range of parallel benchmark applications written using the G P H parallel dialect of Haskell. These benchmarks have been tested on a variety of parallel architectures, including shared and distributed-memory architectures [26], and a wide-area computational Grid [3]. The basic unit of computation in G UM is a lightweight thread. Each logical processing element (PE) is represented by an operating-system process that schedules multiple lightweight threads, as outlined below. Threads are automatically synchronised using the graph structure, and each PE maintains a pool of runnable threads. Parallelism is introduced by sparking potential threads.
[casObject] -> casObject [casObject] -> [casObject] -> gapObject -> String
Here, casEval and casEvalN allow Haskell programs to invoke functions in the computational algebra system by giving a function name as a string, plus a list of parameters as casObjects; casEvalN is used to invoke computational algebra system functions that return more than one object (a list of casObjects is returned); and string2CASExpr/casExpr2String convert computational algebra system objects to/from internal Haskell data formats, respectively.
2.3
Thread Pool awake
Closure
The GCA interface (Figure 2) interfaces our middleware with computational algebra systems (CASs), connecting to a small interpreter that allows the invocation of arbitrary computational algebra system functions, marshalling/unmarshalling data as required. The interface comprises both C and Haskell components. The C component is mainly used to invoke operating system services that are needed to initiate the computational algebra process, to establish communication channels, and to send and receive commands/results from the computational algebra system process. It also provides support for static memory that can be used to maintain state between calls. The Haskell component provides interface functions to the user program and implements the communication protocol with the computational algebra process. The main G P H functions are: String -> String -> String casObject
spark
activate
Thread Pool awake
The GCA Interface
:: :: :: ::
Spark Pool
spark
activate
Each of these patterns produces highly parallel computations, with complex irregular task sizes, and mixes of taskand data-parallelism.
casEval casEvalN string2CASExpr casExpr2String
PE 2
Spark Pool
3. completion algorithm: given a set of objects to process, generate new objects from any pair of objects, and then reduce these new objects against the existing objects. Continue until no new objects can be found.
2.2
SCHEDULE
G UM supports dynamic and decentralised load management through a tried-and-tested work-stealing model, which works as follow. If (and only if) a PE has no runnable threads, it chooses a spark from its local spark pool, and turns this into a thread, by allocating a local stack and other thread state. If there are no local sparks, then the PE sends a FISH message to another PE. If the recipient of a FISH has a spark it returns it to the originating PE as a SCHEDULE message. The originating PE unpacks the graph describing the spark, and adds the newly-acquired spark to its local spark pool. An ACK message is then sent to the PE that provided the spark to indicate its new location. In this way, the distributed shared graph structure is preserved. If the recipient of a FISH message has an empty spark pool, it simply forwards the FISH to another PE. If no spark is found after a specified number of hops, the FISH is returned to the originator, which backs off for a short period, and then retries the FISH, perhaps using a different set of target PEs. A sequence of messages initiated by a FISH is shown in Figure 3.
G UM - A Parallel Haskell Runtime Environment
SymGrid-Par internal coordination constructs are implemented using G UM [23, 41, 2], a portable, parallel runtime environment for the highly-optimising Glasgow Haskell Compiler (GHC) [19] that has been succesively refined since its inception in 1995. G UM implements a specific distributed shared-memory model of parallel execution, namely graph reduction on a distributed, but virtually shared, graph. Graph segments are communicated 102
3
Parallel Performance Results
The corresponding GCA implementation calls the GAP euler function directly.
We consider three simple programs that all display irregular parallelism: sum-Euler computes the sum of values of the Euler totient function for a list of integers (Section 3.1); smallGroup searches for finite groups of given orders such that the average order of their elements is an integer (Section !n3.2); and the summatory Liouville function, L(n) = k=1 λ(n) is the sum of values of the Liouville function λ(n) = (−1)r(n) , where r(n) is the number of prime factors of n, counted with their multiplicity, and r(1) = 0 (Section 3.3). We have currently only implemented SymGrid-Par for x86-class machines running PVM under Linux. In this paper, we measure the performance of our implementation on a 28-node Beowulf cluster, located at Heriot-Watt university (bwlf ). Each node is constructed from a 3GHz single-core Intel Pentium 4, with a 533MHz front-side bus, and 512MB of standard DIMMs. All nodes run the same version of Fedora Linux (kernel version 2.6.10-1). Nodes on the Beowulf cluster are connected using 100Mb/s Ethernet. In order to reduce the impact of operating system and other effects, all runtimes are given as the mean of three measured times.
3.1
sumTotient_GAP :: Int -> Int -> Int -> Int sumTotient_GAP lower upper c = sum(parMap (euler_GAP) (splitAtN c [lower..upper])) euler_GAP :: Int -> Int euler_GAP n = gapObject2Int( gapEval ‘‘euler’’ [int2GAPObject n])
For comparison, we show an implementation of sum-Euler in ParGAP. Here, ParList is equivalent to parMap. sumEuler_ParGAP:=function(lower, upper, c) local result, x; result:=Sum(ParList( [lower..upper], x ->euler(x),c)); return result; end;
We have defined two versions of euler in GAP: a na¨ıve recursive implementation, and a straightforward imperative implementation. In the recursive implementation, a direct translation of the G P H code, relprime rec calls the hcf function to calculate the highest common factor of x and y recursively. In the straightforward implementation, relprime direct instead uses the builtin GAP function GcdInt to identify arguments which are relatively prime. To maintain comparability between the GAP and Haskell implementations, we do not use the well-known formula for the Euler function (e.g. [1]), but rather define it directly from the underlying mathematical equations.
sumEuler
3.1.1 Problem Outline and Implementation The sum-Euler program represents a simple example that enables us to demonstrate that GCA delivers good parallel performance for a typical irregularly-parallel computational algebra system problem, to compare the parallel performance of alternative computational algebra system algorithms, and to compare the performance of GCA with a specialist parallel computational algebra system implementation, ParGAP. The main sum-Euler function, sumTotient, computes the sum of applying the euler function in parallel (via parMap) to a list of integers ranging from lower to upper. In order to improve parallelism, the list is split into sub-lists, each of size c. This gives a data parallel implementation, with a fairly cheap combination phase involving only a small amount of communication. Despite using data parallelism, sum-Euler introduces highly irregular parallelism, since task granularity varies significantly depending on the input parameter values.
hcf := function(x,y) local m; if y=0 then return x; else m:= x mod y; return hcf(y,m); fi; end;; relprime_rec := function(x,y) local x; m:= hcf(x,y); return m=1; end;;
-- recursive
relprime_direct := function(x,y) local m; m := GcdInt(x,y); return m=1; end;
-- direct
euler := function (n) local x; x:= Number(Filtered([1..n], x->relprime(x,n))); return x; end;;
sumTotient :: Int -> Int -> Int -> Int sumTotient lower upper c = sum (parMap (euler) (splitAtN c [lower..upper]))
Runtime
euler :: Int -> Int euler n = length (filter (relprime n) [1..n-1])
GpH/GUM 164s
GAP direct 500s
GAP recursive 2298s
Table 1. Sequential sum-Euler 32000
relprime :: Int -> Int -> Bool relprime x y = hcf x y == 1
103
Results for sum-Euler Sum-Euler from 1 to 32000
PE 1 2 4 6 8 12 16 20 28
GCA Recur Spd Dir Spd 3006s 1 500s 1 1139s 2.6 320s 1.5 542s 5.5 154s 3.2 365s 8.2 107s 4.6 267s 11.2 84s 5.9 174s 17.2 59s 8.4 141s 21.3 51s 9.8 115s 26.1 45s 11.1 95s 31.6 40s 12.5
ParGAP Recur Spd Dir Spd 3288 1 686s 1 2117 1.5 481 1.4 1175 2.8 233 2.9 690 4.7 137 4.8 490 6.7 104 6.6 310 10.6 62 11.0 223 14.7 44 15.5 166 19.8 34 20.1 115 28.5 25 27.4
3500
Runtime (s)
3000
GUM Spd
35
GUM Speedup GUM Runtime GCA-GAP Recursive Speedup GCA-GAP Recursive Runtime
30
2500
25
2000
20
1500
15
1000
10
500
5
0
164s 121s 69s 45s 38s 39s 35s 30s 23s
1.0 1.3 2.3 3.6 4.3 4.2 4.6 5.4 7.1
Speedup
Table 1 shows sequential results for sum-Euler. Here, the G P H/G UM (pure Haskell) implementation is significantly more efficient than either of the GAP implementations, and the direct GAP solution is significantly faster than the recursive implementation. Overall, the G P H/G UM version is a factor of 2-3 times faster than the direct GAP version, and a factor of 8-17 times faster than the recursive GAP version.
0 5
10
15 PEs
20
25
Figure 4. G UM vs. GCA-GAP, recursive impl. of sum-Euler
Table 2. Parallel sum-Euler [1..32000]
Sum-Euler from 1 to 32000 500 450
Table 2 shows the performance of sum-Euler applied to the range [1..32000] on the 28-node bwlf cluster described earlier. the number of processors (PEs). The second and third columns show runtimes and speedups for GCA using recursive euler, and the fourth and fifth columns show the same results for the direct version. The sixth to ninth columns show corresponding runtimes and speedups for ParGAP. The last two columns show runtimes and speedups for G UM. As is typical for parallel implementations, the parallel versions of the GCA and parGAP are slower on a single processor than the corresponding sequential programs, e.g. 3006s compared with 2998s. This is because the parallel versions add additional synchronisation and marshalling, detailed analyses of the sequential efficiency of GUM can be found in [41]. Figures 4 and 5 plot comparative speedup and runtime curves for G UM and GCA-GAP running sum-Euler 32, 000 using, respectively, the recursive and direct algorithms. Table 3 shows the corresponding performance of sumEuler applied to [1..80000]. No results could be recorded for the recursive version.
400
14
GUM Speedup GUM Runtime GCA-GAP Iterative Runtime GCA-GAP Iterative Speedup
12 10
Runtime (s)
350 300
8
250 6
200 150
Speedup
3.1.2
4
100 2
50 0
0 5
10
15
20
25
PEs
Figure 5. G UM vs. GCA-GAP, direct impl. of sum-Euler
memory management and garbage collectors are better optimised for recursion [25]. The modest super-linearity in the GCA program can again be attributed to reduced memory management costs. The Direct GCA Version Table 2 also shows that the direct version displays worse speedup than its recursive counterpart. However, it is faster by a factor of 6 on 2 PEs and by a factor of 2.3 on 28 PEs. It is clear that, for this example, the direct algorithm is implemented more efficiently by the GAP kernel. The corresponding GCA implementation yields some, but not exceptional, parallel performance, with a speedup of 1.5 on 2 PEs and 12.5 on 28 PEs. While ParGAP shows a better speedup (1.4 on 2 PEs and 27.4 on 28 PEs), GCA shows better runtimes up to 12 PEs. Beyond
The Recursive GCA Version Table 2 shows that the recursive GCA version delivers near-linear speedup, yielding a maximum speedup of 31.6 on 28 PEs. It also shows better performance than recursive ParGAP by a factor of 2.1 on 4 PEs and 1.2 on 28 PEs. Despite this, GCA is still slower than G P H/G UM alone by a factor of 18.3 on 1 PE and a factor of 4.1 on 28 PEs. This difference can be attributed to the lack of memory management optimisations for recursive programming in GAP [11]. In contrast, the G UM 104
PE 1 2 4 8 12 16 20 24 28
GCA 3265s 1677s 818s 406s 277s 207s 183s 159s 139s
Spdup 1 1.9 3.9 8.0 11.7 15.7 17.8 20.5 23.4
GUM 1088s 633s 330s 165s 126s 95s 78s 76s 76s
Spdup 1 1.7 3.3 6.6 8.6 11.4 13.9 14.3 14.3
some given property (in this case, that the average order of their elements is an integer). There are two levels of irregularity: i) as shown by Figure 6, the number of groups of a given order varies enormously (by 5 orders of magnitude); and ii) there are variations in the cost of computing the conjugacy classes of each group. The kernel of the smallGroup program is divided into a coordination part which uses the GCA interface from Haskell: smGrpSearch :: Int -> Int [(Int,Int)] smGrpSearch lo hi = concat(map(ifmatch) (predSmGrp [lo..hi]))
Table 3. Parallel sum-Euler [1..80000]
predSmGrp :: ((Int,Int) ->(Int,Int,Bool)) -> Int ->[(Int,Int)] predSmGrp (i,n) = (i,n,(gapObject2String (gapEval "IntAvgOrder" [int2GapObject n, int2GapObject i])) == ‘‘true’’)
this, the small problem size means that overhead costs of communication, marshalling/unmarshalling etc. dominate the total time costs, and ParGAP performs better. Although the direct version delivers better speedup than the G UM version, G UM is still faster than either GCA or GAP, by a factor of 3 and 4, respectively, on one PE, and by a factor of 1.7 and 1.1, respectively, on 28 PEs. This is mainly because our system is based on the highly-optimised sequential GHC implementation [19, 35]. The poor speedup we observe on 28 PEs with an input of 32, 000 for both G UM and the direct GCA version is another effect of the small problem size: increasing the input size to 80, 000 (Table 3) improves speedup by almost a factor of two in each case, to 14.3 and 23.4 on 28 PEs, respectively.
3.2
ifmatch :: ((Int,Int) -> (Int,Int,Bool)) -> Int -> [(Int, Int)] ifmatch predSmGrp n= [(i,n) | (i,n,b) [(Integer,Integer)] sumL mylist= mySum ((masterSlaves liouville) mylist)
smallGroup Search [600 .. 1000] 300000
250000
35
Runtime Relative Speedup Absolute Speedup
liouville :: (Integer,Integer) -> [((Integer, Integer),(Integer,Integer))] liouville (lower,upper) = let l = map gapObject2Integer (gapEvalN "gapLiouville" [integer2GapObject lower, integer2GapObject upper]) in ((head l, last l), (lower,upper))
30 25 20
150000 15
Speedup
Runtime
200000
100000 10 50000
5
0
and the GAP computational part is:
0 5
10
15 PEs
20
25
LiouvilleFunction:=function( n ) if n=1 then return 1; elif Length(FactorsInt(n)) mod 2=0 then return 1; else return -1; fi; end;
Figure 8. Runtime and speedup for smallGroup [600 ... 1000]
gapLiouville := function( n, x ) local total, max, s, list; s := LiouvilleFunction(n); total:= s; max:= s; n:= n+1; while n 0, but he neither presented such x nor gave the upper bound for the minimal x with L(x) > 0. A computer-aided search reported in [24] found the counterexample to be L(906180359) = +1, without claiming its minimality. This search used some inventive techniques as described in the paper, e.g. preliminary computation of some data and ”caching” them on magnetic tape, and was performed on an IBM 704 at the University of California, Berkeley. Only in 1980 was it reported that the smallest counterexample is n = 906150257 [39]. In fact, the P´olya conjecture fails to hold for most values of n in the range of 906150257 ≤ n ≤ 906488079. In this region, the function reaches a maximum value of 829 at n = 906316571. To the best of our knowledge, as of April 2008, the highest value that the conjecture has been tested with is 7.5 × 1013 [6]. The coordination part of the kernel of the parallel liouville function is:
3.3.2
Performance Results
This section presents two performance results: Parallel Performance Table 7 shows the performance of liouville applied to the range [1..906150257 on the 28-node bwlf cluster described earlier. The first column in the table shows the number of processors (PEs). the second and third column show runtimes and speedups. The graph plots the relative speedup and runtime for GCA-GAP running liouville on the same range. This results strength the earlier claim that GCA continues to provide significant parallel performance on different real application with a speed of 26.1 on 28 PEs.
L :: Integer -> Integer -> Int ->[(Integer,Integer)] L lower upper c = sumL (myMakeList c lower upper)
107
PE 1 2 4 8 12 16 20 24 28
Rtime 101750 53553 27013 13409 8993 6738 5409 4580 3899
4
Spd 1 1.9 3.8 7.5 11.3 15.1 18.8 22.2 26.1
We have shown that we can obtain good parallel performance by coordinating components from a single computational algebra system, GAP. By using a common protocol, an early version of SCSCP [14], we are, in fact, able to link and orchestrate components written using several different computational algebra systems. This increases expressive power, meaning that new problems can be solved. Our overall aim is to orchestrate components written using GAP, Maple, MuPAD, KANT, and possibly other systems. This section shows how GCA can orchestrate multiple computational algebra systems using two examples that combine GAP and Maple components.
Liouville [1..906150257] 110000 100000
30
Runtime Relative Speedup
25
90000 80000
60000
15
50000 40000
Speedup
20
70000 Runtime
Orchestration
4.1
10
The Totient Inverse program computes, sequentially, the Totient of a positive integer. It then finds all numbers which have the same Totient value, i.e. all numbers x which have the same value of P hi(x). While GAP provides a built-in totient (P hi) function, there is no convenient way of computing the inverse Totient. In contrast, Maple has a built-in invphi function. In the implementation of Totient Inverse, totient is calculated using the gapPhi function, which invokes GAP’s Phi function on value. The totient inverse is the list of integers given by maplePosiInt which invokes Maple’s numtheory[invphi] on totient.
30000 20000
5
10000 0
0 5
10
15 PEs
20
Totient Inverse
25
Table 7. Runtimes and speedups for liouville [1 ..906150257]
main = do ......... totient = gapPhi value finalResult = maplePosiInt totient ......... return ()
Comparative performance Borwein has recently [6] described a number of new algorithms to calculate the summatory Liouville function. Using a cluster of around 301 PowerMac dual-processor 2.5GHz G5 workstations at Simon Fraser University, each with 2GB of memory, he evaluated the summatory Liouville up to 7.5 × 1013 . The job was run over thirty nights, requiring about 2.5 years CPU time in total. In comparison, using our GCA interface, we obtained the same results in 13 days and 20 hours on our 28-processor bwlf cluster, despite using a relatively naive algorithm, and a much slower processor2 , with less memory. This demonstrates that the GCA interface is capable of undertaking challenging computational examples that are more than competitive with the state-of-the-art in specialised implementations. It also demonstrates the robustness and correctness of our implementation.
gapPhi :: Integer -> Integer gapPhi n = gapObject2Integer( gapEval "Phi" [integer2GapObject n]) maplePosiInt :: Integer -> [Int] posiInt n = map mapleObject2Int( mapleEvalN "numtheory[invphi]" [integer2MapleObject n])
4.2
Finite State Automata
This example arose from a research problem [4]. Finite state automata comprise a finite number of states, transitions between those states, and a set of actions. A finite state automaton defines a regular language. One interesting feature of such a language is its growth rate, which can be found as the second largest real eigenvalue of the transition matrix of the automaton. A GAP package provides efficient functions for constructing and manipulating automata and can compute the transition matrix, but, owing to the lack of floating point support in GAP, cannot find the actual growth
1 Borwein does not give his exact configuration, but we are able to calculate the likely size of his cluster from the timing figures he provides [6]. 2 Apple Computer report a SPEC intrate base2000 of 17.2 for a dual 2GHz G5, versus 10.3 for a single-core Intel Pentium IV — http:// www.apple.com/lae/powermac/performance/
108
rate. Maple, by contrast, has excellent real number facilities, but no support for automata. In this example we combine the two systems to determine the growth rates of some languages of interest. We show the computation for one specific automaton. FSAtoMatrix converts the automaton, aut, into its transition matrix, and CharPolyOfFSA gives the characteristic polynomial for matrix m.
There have been several attempts to provide support for parallelism in specific computational algebra systems. For example, ParGap [11] is an extension of GAP which defines a similar set of parallel patterns to our basic set; the new Threads package for Maple [27] provides userlevel routines for multi-threaded programming on multicore computers; the Distributed Maple environment [38] allows the construction of Maple applications across distributed systems; Maple2g [34] uses Globus to link Maple to computational Grids; and the rather misleadingly-named gridMathematica and Mathematica Personal Grid Edition similarly support parallel cluster computing in Mathematica. In contrast to these attempts, very few production parallel algorithms have been produced. This is partly due to the complexities involved in programming such algorithms using explicit parallelism, which are accidental for many programmers in this domain. It is also partly due to the lack of generalised support for communication, distribution etc, in these systems. By abstracting over such issues, by providing system-independent orchestration of parallel programs and, especially, through the identification of the domainspecific patterns noted above, we anticipate that SymGridPar will considerably simplify the construction of effective parallel computational algebra computations.
CharPolyOfFSA := function() local aut, m, c; aut:=Automaton("Det", 4,2,[[3,4,3,4],[3,3,4,4]],[1],[4]);; m := FSAtoMatrix(aut);; c := CharacteristicPolynomial(m); return c; end; FSAtoMatrix := function(a) local mat, s, l, s2; mat := NullMat(a!.states,a!.states, 0); for s in [1..a!.states] do for l in [1..a!.alphabet] do s2:= a!.transitions[l][s]; mat[s2][s]:= mat[s2][s]+1; od; od; return mat; end;
The GCA program calls two functions, gapAutomaton and mapleFsolve. gapAutomaton calls the GAP CharPolyOfFSA function to build an automaton and return its characteristic polynomial. The mapleFsolve function calls Maple’s fsolve function to compute the roots of the given polynomial.
5.2
main = do ......... vFromGAP = gapAutomaton finalResult = mapleFsolve vFromGAP ......... return ()
There have also been some previous attempts to link parallel functional programming languages with computer algebra systems, for example, the GHC-Maple interface [20]; or the Eden-Maple system [28]. None of these systems is in widespread use at present, however; none supports the broad range of computational algebra applications we are targeting, or has the support of the developers of those systems; none has such an ambitious goal in terms of orchestrating legacy sequential components; and none has achieved comparable results to those reported here.
gapAutomaton :: String gapAutomaton = gapObject2String ( gapEval "CharPolyOfFSA") mapleFsolve :: String -> [String] mapleFsolve str = map mapleObject2String (mapleEvalN "fsolve" [string2MapleExpr str])
5 5.1
Other Systems Linking Parallel Functional Languages and Computer Algebra Systems
Related Work
5.3
Parallel Symbolic Computation
Parallel Coordination Languages
In a coordination language, behavioural control is achieved by the use of a meta-language (the coordination language) into which is embedded the normal algorithmic language (the computation language). The coordination language approach is typified by Linda [15] or PCN [13]. Linda uses a shared tuple space to express communication between sequential processes written in the computation language, whereas PCN uses three composition operators to link pairs of communication ports sequentially, in parallel, or through choice.
Work on parallel symbolic computation dates back to at least the early 1990s – Roch and Villard [37] provide a good general survey of early research. Within this general area, significant research has been undertaken on parallelising specific computational algebra algorithms, notably term re-writing and Gr¨obner basis completion (e.g. [5, 7]). A number of one-off parallel programs have also been developed for specific algebraic computations, mainly in representation theory [30]. 109
Despite being both extremely generic, and requiring absolutely no changes to the computational algebra kernel systems, our approach is also highly competitive with ParGAP, a specialised parallel implementation of GAP. In fact, in most cases SymGrid-Par outperforms ParGAP in terms of both relative speedup and absolute execution time. Moreover, our approach is both scalable and portable: for the applications we have considered here, we obtain speedups over a sequential GAP system of up to 33.4 on a 28processor cluster. We must now extend our results to cover a wider variety of computational algebra systems, firstly on parallel clusters and then, by extending preparatory work that we have done to map G UM to the Grid [3], on wide-area computational Grids. The implementors of the Kant [12] and MuPAD [31] systems are partners in the EU-funded SCIEnce project that has sponsored our work, and we will shortly integrate these systems into SymGrid-Par. We anticipate delivering similar parallel performance to the users of these other computational algebra systems, again without needing to alter the stable, reliable and widely-used sequential kernels of these systems. The benefits of lightweight orchestration using our approach are clear: it is possible to achieve good parallel performance without needing major system rewrites, and without the need for end-users to learn painful low-level parallel programming techniques. This is a major gain for developers and users of complex legacy systems wishing to take rapid and straightforward advantage of the upcoming availability of cheap commodity parallel hardware. It is our conclusion that, by using our coordination-based approach, not only is “parallelism without pain” possible for both programmers and system developers, but that it can also be highly practical. Moreover, despite the genericity of our solution, the resulting performance can even, in some cases, exceed that of much more limited, specialised parallel implementations.
Clearly, the approach we have described here can be seen as an example of a coordination language. Our approach differs from these earlier approaches in three main ways: firstly, by providing a set of standard parallel patterns that can be directly called by the applications programmer from within the host computational algebra system, we avoid the need for significant parallel programming or paradigm-shift in many cases; secondly, we provide very high level abstraction over parallelism issues such as task migration, placement etc; and finally, our approach generalises to computational Grids as well as parallel systems.
5.4
Orchestrating Parallel Computational Algebra Systems
Several other projects aim to support computational algebra systems as either Web or Grid services. These include GENSS [17], GEMLCA [16], NetSolve/GridSolve [32], Geodise [18], MapleNet [27], Web Mathematica [29] and MathGridLink [40]. However, none of these systems is capable of composing and linking heterogeneous systems, as we have done. Unlike these approaches, SymGrid-Par is designed from the outset to support parallelism for a variety of symbolic systems. SymGrid-Par already unites two major systems, and the work required to add a new system mainly consists of adding simple SCSCP and GCA support. SymGrid-Par allows symbolic components to be orchestrated instead of developing specific tools for each system. While this may seem to carry performance penalties, we have demonstrated above that, in practice, we can achieve very good speedup on a real distributed systems.
6
Conclusions and Future Work
We have presented a generic interface for constructing parallel applications from multiple computational algebra systems, SymGrid-Par, using the SCSCP protocol. We have demonstrated the flexibility of our approach by parallelising three test problems which capture typical features of real computational algebra problems, including a problem with multiple levels of irregular parallelism that is not readily solved by other parallel computational algebra systems. In all cases, we achieve very good absolute speedups over the corresponding sequential implementations. We have further demonstrated the usability of our system by comparing our performance with recent ongoing research that is aimed calculating the values of the summatory Liouville function. We obtain both accurate results and better absolute performance than our rivals. Finally, we have used two examples to show how we increase the expressive power of computational algebra systems by orchestrating different computational algebra systems, in this case GAP and Maple, into a coherent parallel application.
Acknowledgments This research is supported by European Union Framework 6 grant RII3-CT-2005-026133 SCIEnce: Symbolic Computing Infrastructure in Europe. We would like to thank Hans-Wolfgang Loidl for his constructive comments on a previous draft of this paper, Gene Cooperman for his advice on domain-specific patterns.
References [1] M. Abramowitz and I.A. Stegun. Handbook of Mathematical Functions. Dover Publications, New York, 1964. 110
[2] A. Al Zain, P. Trinder, H-W. Loidl, and G. Michaelson. Supporting high-level grid parallel programming: the design and implementation of grid-gum2. In Simon J. Cox, editor, UK e-Science Programme All Hands Meeting (AHM), pages 182–189, Nottingham, UK, September 2007. National e-Science Centre.
[14] S. Freundt, P. Horn, A. Konovalov, S. Linton, and D. Roozemond. Symbolic Computation Software Composability Protocol (SCSCP) specification. Version 1.1, 2008 (http://www. symbolic-computation.org/scscp). [15] D. Gelernter and N. Carriero. Coordination Languages and Their Significance. Comm. ACM, 32(2):97–107, February 1992.
[3] A. Al Zain, P. Trinder, H-W. Loidl, and G. Michaelson. Evaluating a high-level parallel language (gph) for computational grids. IEEE Transactions on Parallel and Distributed Systems, 19(2):219–233, 2008.
[16] GEMLCA, gemlca/.
[4] M. H. Albert and S. A. Linton. Growing at a perfect speed. Combinatorics, Probability and Computing, 2007. Submitted.
http://www.cpc.wmin.ac.uk/
[17] GENSS, http://genss.cs.bath.ac.uk/. [18] Geodise, http://www.geodise.org/.
[5] B. Amrheim, O. Gloor, and W. K¨uchlin. A Case Study of Multithreaded Gr¨obner Basis Completion. In Proc. ISSAC ’96: International Symposium on Symbolic and Algebraic Computation, pages 95–102. ACM Press, 1996.
[19] GHC. http://www.haskell.org/ghc/. [20] The GHC-Maple Interface, http://www.risc .uni-linz.ac.at/software/ghc-maple/. [21] The GAP Group. GAP – Groups, Algorithms, and Programming, 2007. http: //www.gap-system.org.
[6] P. Borwein, R. Ferguson, and M.J. Mossinghoff. Sign changes in sums of the liouville function. Mathematics of Computation, 77(263):1681–1694, January 2008.
[22] K. Hammond, A. Al Zain, G. Cooperman, D. Petcu, and P. Trinder. SymGrid: a Framework for Symbolic Computation on the Grid. In Proc. EuroPar’07, LNCS, Rennes, France, August 2007.
[7] R. B¨undgen, M. G¨obel, and W. K¨uchlin. Multithreaded AC term re-writing. In Proc. PASCO’94: Intl. Symp. on Parallel Symbolic Computation, volume 5, pages 84–93. World Scientific, 1994.
[23] K. Hammond, J.S. Mattson Jr., A.S Partridge, S.L. Peyton Jones, and P.W. Trinder. GUM: a Portable Parallel Implementation of Haskell. In IFL’95 — Intl Workshop on the Parallel Implementation of Functional Languages, September 1995.
[8] B. W. Char et al. Maple V Language Reference Manual. Maple Publishing, Waterloo Canada, 1991. [9] M.I. Cole. Algorithmic Skeletons. In K. Hammond and G. Michaelson, editors, Research Directions in Parallel Functional Programming, chapter 13, pages 289–304. Springer-Verlag, 1999.
[24] R. S. Lehman. On Liouville’s function. Math. Comp., 14:311–320, October 1960. [25] H-W. Loidl and P. W. Trinder. Engineering Large Parallel Functional Programs. In Implementation of Functional Languages, 1997, LNCS, St. Andrews, Scotland, September 1997. Springer.
[10] G. Cooperman. GAP/MPI: Facilitating parallelism. In Proc. DIMACS Workshop on Groups and Computation II, volume 28 of DIMACS Series in Discrete Maths. and Theoretical Comp. Sci., pages 69–84. AMS, 1997.
[26] H-W. Loidl, P. W. Trinder, K. Hammond, S. B. Junaidu, R. G. Morgan, and S. L. Peyton Jones. Engineering Parallel Symbolic Programs in GPH. Concurrency — Practice and Experience, 11:701–752, 1999.
[11] G. Cooperman. Parallel GAP: Mature interactive parallel. Groups and computation, III (Columbus, OH,1999), 2001. de Gruyter, Berlin.
[27] Maple 11. http://www.maplesoft.com/.
[12] M. Daberkow, C. Fieker, J. Kl¨uners, M. Pohst, K. Roegner, M. Sch¨ornig, and K. Wildanger. Kant v4. J. Symb. Comput., 24(3/4):267–283, 1997.
[28] R. Mart´ınez and R. Pena. Building an Interface Between Eden and Maple. In Proc. IFL 2003, SpringerVerlag LNCS 3145, pages 135–151, 2004.
[13] I. Foster and S. Taylor. A Compiler Approach to Scalable Concurrent-Program Design. ACM TOPLAS, 16(3):577–604, 1994.
[29] Mathematica. http://www.wolfram.com/ products/mathematica/. 111
[30] G. O. Michler. High performance computations in group representation theory. Preprint, Institut fur Experimentelle Mathematik, Univerisitat GH Essen,, 1998. [31] K. Morisse and A. Kemper. The Computer Algebra System MuPAD. Euromath Bulletin, 1(2):95–102, 1994. [32] NetSolve/GridSolve, edu/netsolve/.
http://icl.cs.utk.
[33] The OpenMath Standard, Version 2.0, http:// www.openmath.org/. [34] D. Petcu, M. Paprycki, and D. Dubu. Design and Implementation of a Grid Extension of Maple. Scientific Programming, 13(2):137–149, 2005. [35] S.L. Peyton Jones, C.V. Hall, K. Hammond, W.D. Partain, and P.L. Wadler. The Glasgow Haskell Compiler: a Technical Overview. In Proc. JFIT (Joint Framework for Information Technology) Technical Conference, pages 249–257, Keele, UK, March 1993. [36] S.L. Peyton Jones (ed.), J. Hughes, L. Augustsson, D. Barton, B. Boutel, W. Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M. Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, and P. Wadler. Haskell 98 Language and Libraries. The Revised Report. Cambridge University Press, April 2003. [37] L. Roch and G. Villard. Parallel computer algebra. In Proc. ISSAC ’97: International Symposium on Symbolic and Algebraic Computation. Preprint IMAG Grenoble France, 1997. [38] W. Schreiner, C. Mittermaier, and K. Bosa. Distributed maple: parallel computer algebra in networked environments. J. Symbolic Computation, 35(3):305–347, 2003. [39] M. Tanaka. A numerical investigation on cumulative sum of the Liouville function. Tokyo J. Math., 3(1):187–189, 1980. [40] D. Tepeneu and T. Ida. MathGridLink – Connecting Mathematica to the Grid. In Proc. IMS ’04, Banff, Alberta, 2004. [41] P.W. Trinder, K. Hammond, J.S. Mattson Jr., A.S Partridge, and S.L. Peyton Jones. GUM: a Portable Parallel Implementation of Haskell. In Proc. PLDI’96, pages 79–88, Philadelphia, PA, USA, May 1996.
112