High level BSP programming: BSML and BS - Semantic Scholar

1 downloads 0 Views 151KB Size Report
A functional data-parallel language called BSML is designed for programming bulk-synchronous parallel (BSP) algorithms in so-called direct mode. Its aim is.
High level BSP programming: BSML and BS Olivier Ballereau, Fr´ed´eric Loulergue and Ga´etan Hains LIFO, Universit´e d’Orl´eans BP6759, 45067 Orl´eans Cedex 2, France folivier,loulergu,[email protected]

Abstract A functional data-parallel language called BSML is designed for programming bulk-synchronous parallel (BSP) algorithms in so-called direct mode. Its aim is to combine the generality of languages like NESL with the predictable performance of direct-mode BSP algorithms. The BSML operations are motivated and described. Experiments with a library implementation of BSML show the possibility and limitations of parallel performance prediction in this framework.

1

INTRODUCTION

This paper is concerned with the possibility of writing so-called direct-mode parallel BSP algorithms as purely functional programs. A parallel algorithm is said to be in direct mode[4] when its physical process structure is made explicit. This makes it less convenient to express but more efficient in many cases [4]. On the other hand, existing functional parallel languages like NESL [2] support nested parallelism where physical process structure is implicit at the expense of efficiency [1] and/or predictability of performance. We propose as a solution BSML, a purely1 functional programming language for direct-mode BSP [13] algorithms. 2 EXPLICIT PROCESSES + FLAT PARALLELISM = DIRECT MODE Among researchers interested in declarative parallel programming, there is a growing interest in execution cost models taking into account global hardware parameters like the number of processors and bandwidth. With similar motivations we are designing an extension of ML called BSML for which the BSP cost model facilitates performance prediction. Its main advantage in this respect is the use of explicit processes: the map from processors to data is programmed explicitly and does not have to be recovered by inverting the semantics of layout directives. The BSP execution model [15] represents a parallel computation on p processors as an alternating sequence of computation supersteps (p asynchronous computations) and communications supersteps (data exchanges between processors) with global synchronization. The BSP cost model estimates execution times by a simple formula. A computation superstep takes as long as its longest sequential process, a global synchronization takes a fixed, system-dependent time L and a communication superstep is completed in time proportional to the arity h of the data exchange: the maximal number of words sent or received by a processor during that superstep. The system-dependent constant g , measured in time/word, is multiplied by h to obtain the estimated communication time. It is useful to measure times in multiples of a Flop so as to normalize g and L w.r.t. the sequential speed of processor nodes. On current architectures, g is always greater than 2 and often above 10 Flops while L is often greater than 1000 Flops. In BMSL, a parallel value is built from an ML function from processor numbers to local data. A computation superstep results from the pointwise application of a parallel functional value to a parallel value. A communication and synchronization superstep is the application of a communication template (a parallel value of processor numbers) to a parallel value. A crucial restriction on the language’s constructors is that parallel values are not nested. Such nesting would imply either dynamic process creation or some non-constant dynamic costs for mapping parallel values to the network of processors, both of which would contradict our goal of direct-mode BSP programming. 1 We currently ignore the imperative aspects of ML so as to clarify the interaction of BSP with functional programming.

3

BSML

The popular style of SPMD programming in a sequential language augmented with a communication library has some advantages due to its explicit processes and explicit messages. In it, the programmer can write BSP algorithms and control the parameters that define execution time in the cost model. However, programs written in this style are far from being pure-functional: they are imperative and even non-deterministic. There is also an irregular use of the pid (Processor ID i.e. processor number) variable which is bound outside the source program. Consider for example p static processes (we refer to processes as processors without distinction) given an SPMD program E to execute. The meaning of E is then

[ E ℄ SPMD = [ E 0 jj : : :

jj E (p

1)℄℄CSP

where E i = E [pid i℄ and [ E ℄ CSP refers to concurrent semantics defined by the communication library, for example the meaning of a CSP process [7]. This scheme has two major disadvantages. First, it uses concurrent semantics to express parallel algorithms, whose purpose is to execute predictably fast and are deterministic. Secondly, the pid variable is used without explicit binding. As a result there is no syntactic support for escaping from a particular processor’s context to make global decisions about the algorithm. The global parts of the SPMD program are those which do not depend on any conditional using the pid variable. This dynamic property is thus given the role of defining the most elementary aspect of a parallel algorithm, namely its local vs global parts. We propose to eliminate both of these problems by using a minimal set of algorithmic operations having a BSP interpretation. Our parallel control structure is analogous to the PAR of Occam [9] but without possibility of nesting. The pid variable is replaced by a normal argument to a function within a parallel constructor. The property of being a local expression is then visible in the syntax and types. BSML is based on the following elements. There is an externally-bound variable nprocs whose value is p, the static number of processors. There is also a polymorphic type constructor Par such that (’a Par) represents the type of p-wide vectors of values of type ’a, one per processor. The nesting of Par types is prohibited, and it is understood that a compiler can enforce this restriction. This improves on an earlier language DPML / Caml Flight [3, 6] in which the global parallel control structure sync had to be prevented dynamically from nesting [14]. Parallel values are created by (mkpar: (int -> ’a) -> ’a Par) so that (mkpar f) evaluates (f i) on processor i for i = 0; 1; : : : ; (p 1). Asynchronous phases are programmed with apply-par:

(’a -> ’b)Par -> ’a Par -> ’b Par

so that (apply-par (mkpar f) (mkpar e)) evaluates ((f i) (e i)) on processor i The communication and synchronization phases are expressed by get:

’a Par -> (int list Par) -> (Hash(int,’a))Par

so that (get(mkpar x)(mkpar y)) evaluates on processor i a hash table of content f(j, x j) | j in (y i)g: the type Hash(int,’a) is that of hash tables with integer keys and values of type ’a. There is a put operation that is dual to get: processors indicate target destinations for their local data. put is not necessary from a purely declarative point of view but saves a synchronization barrier since get requires two in our implementation. For convenience, a variant of get (resp. put) called get one (resp. put one) is provided for the frequent special case where every processor requests (resp. sends) a value from (resp. to) a single other processor. Both are defined with get and put. BSML also contains a synchronous conditional if-par such that (if-par b i (v1,v2)) will evaluate to v1 or v2 depending on the value of b at processor i. It type is if-par:

bool Par -> int -> (’a Par)*(’a Par)-> ’a Par

Readers familiar with BSPlib [8] will observe that we express communications collectively and ignore the distinction between a communication request and its realization at the barrier. 4

THE BSMLLIB EXPERIMENT

To test the practical feasibility of the BSML concept, we have programmed an interface between ocaml (Objective Caml [10]) and BSPlib [8] called BSMLlib. Ease of programming in BSMLlib is relatively high: undergraduate students familiar with Caml could write simple parallel algorithms within a few hours. For example, the following are simple programs for BSMLlib. They realise a onephase (“direct”) broadcast and a log p-phase broadcast by binary replication. (* val bcast_direct: int -> ’a Bsmlcore.par -> ’a Bsmlcore.par *) let bcast_direct from vec = get_one vec (replicate from) (* replicate from = make_par (fun i-> from) *) ;; (* val bcast_logp : ’a Bsmlcore.par -> ’a Bsmlcore.par *) let bcast_logp vc = let rec aux n vc = if n if (n/2

Suggest Documents