Implementing VCODE with static processes ... - Semantic Scholar

1 downloads 0 Views 139KB Size Report
Values are in units of 10 ?4 seconds for computing the least- squares linear t, n is the problem size. n. Alpha alpha. C90. C90. C. NESL. F77 NESL. 214. 7. 29. 1.
Implementing VCODE with static processes Mostafa Bamha and Gaetan Hains

fbamha,[email protected]

LIFO, Universite d'Orleans BP 6759, 45067 Orleans Cedex 2, France

Abstract The NESL parallel functional language developed at CMU supports a combination of dataand control parallelism through so-called nested parallelism. The designers of NESL have de ned the portable intermediate language VCODE into which NESL is compiled. VCODE realises nested parallelism by data structures called segmented vectors akin to lists of lists. Arbitrary trees of calls to parallel procedures are thus implemented by VCODE primitives on segmented vectors. The simplicity of those primitives enhances the portability of NESL, but their eciency depends strongly on the action of algorithms on the layout, shape and size of the segmented vectors. CMU's current implementation uses algorithms oblivious to the number of processors and hence to exact data layout. Our experiment improves performance by programming VCODE primitives with explicit static processes.

1 Introduction The experiment described here identi es a weakness in the implementation of the nested-parallel programming language NESL. Our conclusion supports the use of explicit static processes in the intermediate parallel language | as realised by Caml Flight [3] | to make the source language's performance less data-dependent. NESL is an ML-like parallel functional language and combines the control parallelism of concurrent procedure calls with the dataparallelism of vector operations. Its principles were laid by Blelloch's study \ attening nested parallelism" [1]. The NESL system [2] is based on the intermediate language VCODE and a li-

brary of vector routines called C Vector Library or CVL. Despite NESL's ecient compiler, the VCODE interpreter does not take into account the size of vectors or the number of processors. Our goal is to improve the performance of VCODE, and hence of NESL, by implementing it with explicit static processes i.e. associating with each routine call a xed number of processes. This aspect of algorithms is important to restrict the amount of physical parallelism within the limits of available resources for computation and communication. It is well known that an excessive number of virtual processes overcomes the network bandwidth, cancelling the parallel acceleration. As a result, the language will better exploit hardware resources if its runtime routines adapt to the number of available processors. Following this observation we have written a new implementation of VCODE primitives with C and MPI and tested their performance on an 128-processor AP1000 con guration at IFPC in London.

2 Nested parallelism in NESL NESL is a high-level language for the concise and simple expression of nested data-parallel algorithms. Data-parallelism results from the evaluation of operations on sequences and, in the presence of nested sequences, from recursive calls to every element of a sequence. For example the usual divide-and-conquer form of Quicksort is programmed as in gure 1. The instruction which de nes the value of result creates nested parallelism by calling quicksort concurrently on the two elements less and grt of a nested sequence. The implementation [2] of NESL involves the

P1-H-1

function quicksort(S): if (#S < 2) then S else let pivot = S[0]; less = {e in S | e < pivot }; eql = {e in S | e = pivot }; grt = {e in S | e > pivot }; result = {quicksort(v): v in [less, grt]}; in result[0] ++ eql ++ result[l];

Figure 1: Quicksort in NESL NESL

?

NESL Compiler

?

VCODE

?

VCODE Interpreter

?

CVL

?

Typed and functional Nested data-parallel language Flatten nested parallelism Mark last use of variables Type inference and specialisation

;

;

;

v

;

Memory management Runtime length checking Serial I/O C library of parallel functions

Multiple Hardware Platforms

v

?

i

v

v

Stack-based intermediate language Operations on segmented vectors

?

-th entry is the reduction 's rst elements. If has multiple segments, the operations has the above e ect independently on each segment. The reduction uses one of several binary associative operations whence the existence of several scan primitives, e.g. scan+ with addition. Given an algorithm for non-segmented scan+ there are two possible techniques for adapting it to the general case. The rst is to apply the non-segmented algorithm to every segment, one after the other. This is ecient if the segments are long and few in number. The second technique is to atten the vector into a 1-segment one, then apply the given algorithm and nally reconstruct the segments by subtraction. Given for example = [[1 2 4] [1 3 8 2] [7 0 2 9]] we would then atten to [1 2 4 1 3 8 2 7 0 2 9], apply the at version of scan+ to yield [1 3 7 8 11 19 21 28 28 30 39], then recover segment boundaries as [[1 3 7] [8 11 19 21] [28 28 30 39]] and nally in parallel for every element not in the rst block, subtract the last value of the previous block: [[1 3 7] [1 4 12 14] [7 7 9 18]]. This technique is ecient if segment lengths are relatively uniform. Returning now to the non-segmented case, we will consider two algorithms. First remark that the computation of values using binary operations on a CREW PRAM of processors requires at least time ( + log( )). The rst algorithm partitions the elements into arrays of elements and uses times the following standard algorithm:

i

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

;

n

p

n=p

p

n

n=p

Figure 2: CMU's implementation of NESL intermediate language VCODE, a compiler, a VCODE interpreter and a portable library of parallel vector routines CVL ( gure 2). There also exists a VCODE compiler performing length and access inference, optimisations and producing multi-threaded C code.

3 Algorithms for VCODE primitives Some of the most important primitives are the well-known [1] scan operations. Their e ect on a non-segmented vector or a 1-segment vector , is to return a vector of the same length whose v

p

n=p

Let n' = p; For i:=1 to log_2(n') do Parallel for k in current array do If k >= 2^(i-1) then x[k]:=x[k]+x[k-2^(i-1)]; End.

after having added the -th value of the previous result array. Each step requires 1o one time unit to add the -th value of the previous result to every element of the current array and then 2o log( ) time units to apply the standard algorithm to the current array. So in total the time is ( np  log( )). When  this algorithm is far from optimal because it then requires close to linear time. This algorithm is better adapted to situations where is close to in which case

P1-H-2

p

p

p

O

p

p

p

n

n

it is almost logarithmic. The second algorithm we consider follows a common technique of increasing granularity and applying locally the best possible sequential method. Each of the processors is given a sub-array of size and they begin by independently computing the additive pre x of their local array. A global pre x computation on the local sums (last local results) requires time (log( )) and provides each processor with the sum of elements stored on preceding processors. In time , each processor then independently adds this to its local results. Total time is optimal: ( + log( )) and so this algorithm is preferred. Let us illustrate its execution on = 16 elements and = 4 processors. Initially the local arrays are P0 P1 P2 P3 [2,4,5,6] [7,2,0,5] [4,1,2,3] [8,6,7,2] then local sequential pre x computations yield P0 P1 P2 P3 [2,6,11,17] [7,9,9,14] [4,5,7,10] [8,14,21,23] and the global parallel scan of the local total produces new values (in bold): P0 P1 P2 P3 [2,6,11,17] [7,9,9,14] [4,5,7,10] [8,14,21,23] p

n=p

p

O

p

n=p

O n=p

p

n

p

squares linear t, is the problem size. n

n

Alpha alpha C NESL 7 29 137 468 2869 9506

C90 C90 F77 NESL 14 2 1 12 216 4 18 218 58 122 222 927 1551 n CM-2 CM-2 CM-5 CM-5 CMF NESL CMF NESL 14 2 18 61 8 39 216 19 61 11 40 18 2 37 133 57 101 222 322 1283 1473 3251 We have implemented the same computation on the AP1000 using MPI procedures for our algorithms. The timings of those tests are summarised in the following table where is the number of processors, the problem size and time is in units of 10?4 seconds. p

n

t

on Fujitsu AP1000 MPI 210 17 14 128 2 26 216 77 100 000 111 0 17 31 41 210 17 then nally the new values are broadcast locally. 14 64 2 46 The concatenated result is 16 2 147 [2 6 11 17 24 26 26 31 35 36 38 41 49 55 63 64]. 100 000 218 10 2 18 4 Performance on the Fujitsu 32 214 80 AP1000 216 284 100 000 422 The above technique has been adapted to most Beyond a certain limit, increasing provides of the VCODE operations which we have imno speedup. For example on vectors of size 210 , plemented for non-segmented vectors. For gen= 32 processors is sucient, which justi es an eral segmented vectors the attening method implementation sensitive to . was used, leading to better load balancing than Although the AP1000's performance on those methods which distribute segments. The execucon gurations is small compared to the CM-2 tion time of algorithms consists of to time for seand CM-5 architectures, our timings are comquential processing on sets of size , and the parable to those of NESL on the CM-2 and are cost of log( ) communications. The latter dewithin a small factor of those on the CM-5. pends on the size and type of data transferred. To study the performance of our MPI-VCODE we used the problem of the least-squares linear 5 Conclusion t. The following table (taken from [2] compares the performance of NESL and of major parallel Our timing analysis of the implementation of languages on various architectures. Values are VCODE primitives con rm that communicain units of 10?4 seconds for computing the leasttion costs place natural limits on the possible ;

;

;

;

;

;

;

;

;

;

;

;

;

;

p

n

t

;

p

p

p

n=p

p

P1-H-3

speedup for a given vector size. The number of processors is an important dymanic parameter for optimising the language's implementation. This kind of -adaptive implementation can optimise processor usage and avoid the possibility of creating an excessive number of virtual processes. The procedures we wrote were simpli ed by the many portable primitives available in MPI. p

Acknowledgements We thank Martin Kohler for assistance with the AP1000 implementation of MPI and Imperial College's IFPC for free access to its systems.

References [1] G. E. Blelloch. Vector Models for DataParallel Computing. MIT Press, 1990. [2] G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. Technical Report CMU-CS-93-112, School of Computer Science, Carnegie Mellon University, February 1993. [3] C. Foisy and E. Chailloux. Caml Flight: a portable SPMD extension of ML for distributed memory multiprocessors. In A. W. Bohm and J. T. Feo, editors, Workshop on High Performance Functionnal Computing, Denver, Colorado, April 1995. Lawrence Livermore National Laboratory, USA. [4] Alan Gibbons and Wojciech Rytter. Ecient Parallel Algorithms. Cambridge University Press, 1988.

P1-H-4

Suggest Documents