A Modular Implementation of Bulk Synchronous ... - Semantic Scholar

1 downloads 341 Views 173KB Size Report
ming. We obtained the Bulk Synchronous Parallel ML language (BSML) based .... To send a value v from process j to process i, the function fsj at process j.
A Modular Implementation of Bulk Synchronous Parallel ML Fr´ed´eric Loulergue and Fr´ed´eric Gava and David Billiet Laboratory of Algorithms, Complexity and Logic University Paris XII, Val-de-Marne 61, avenue du G´en´eral de Gaulle 94010 Cr´eteil cedex – France http://bsmllib.free.fr

Abstract. The BSMLlib is a library for parallel programming with the functional language Objective Caml. It is based on an extension of the λcalculus by parallel operations on a parallel data structure named parallel vector. The execution time can be estimated, dead-locks and indeterminism are avoided. Programs are written as usual functional programs (in Objective Caml) but using a small set of additional functions. Provided functions are used to access the parameters of the parallel machine and to create and operate on parallel vectors. It follows the execution and cost model of the Bulk Synchronous Parallel model. The paper presents the lastest implementation of this library and experiments of performance prediction.

1

Introduction

The design of parallel programs and parallel programming languages is a tradeoff. On one hand the programs should be efficient. But the efficiency should not come at the price of non portability and unpredictability of performances. The portability of code is needed to allow code reuse on a wide variety of architectures and to allow the existence of legacy code. The predictability of performances is needed to guarantee that the efficiency will always be achieved, whatever is the used architecture. Another very important characteristic of parallel programs is the complexity of their semantics. Deadlocks and indeterminism often hinder the practical use of parallelism by a large number of users. To avoid these undesirable properties, a trade-off has to be made between the expressiveness of the language and its structure which could decrease the expressiveness. Bulk Synchronous Parallelism [34,32] (BSP) is a model of computation which offers a high degree of abstraction like PRAM models but yet a realistic cost model based on a structured parallelism: deadlocks are avoided and indeterminism is limited to very specific cases in the BSPlib library [18]. BSP programs are portable across many parallel architectures. Our research aims at combining the BSP model with functional programming. We obtained the Bulk Synchronous Parallel ML language (BSML) based

on a confluent extension of the λ-calculus [27]. Thus BSML is deadlock free and deterministic. Being a high-level language, programs are easier to write, to reuse and to compose. It is even possible to certify the correctness of BSML programs [10] with the help of the Coq proof assistant [1,3]. The performance prediction of BSML programs is possible [24]. BSML has been extended in many ways throughout the years and the papers related to this research are available at http://bsml.free.fr. In section 2 we present the core of BSML, which is in fact currently a library, the BSMLlib library, for the Objective Caml language [22]. Section 3 gives an overview of the new modular implementation of the current BSMLlib library. Performance prediction is considered in section 4: we first describ how the BSP parameters of parallel machine could be benchmarked and then we present a small experiment. Related work is presented in section 5. We conclude and give future research direction in section 6.

2

The BSMLlib Library

There is currently no implementation of a full BSML language but rather a partial implementation as a library for Objective Caml language [22,8,30]. 2.1

The BSP Model

A BSP computer contains a set of uniform processor-memory pairs, a communication network allowing inter-processor delivery of messages and a global synchronization unit which executes collective requests for a synchronization barrier (for the sake of conciseness, we refer to [32,4] for more details). In this model, a parallel computation is divided in super-steps, at the end of which a the routing of the messages and barrier synchronization are performed. Hereafter all requests for data which have been posted during a preceding super-step are fulfilled. The performance of the machine is characterized by 3 parameters expressed as multiples of the local processing speed r: p is the number of processor-memory pairs, l is the time required for a global synchronization and g is the time for collectively delivering a 1-relation (communication phase where every processor receives/sends at most one word). The network can deliver an h-relation in time g × h for any arity h. The execution time of a super-step is thus the sum of the maximal local processing time, of the data delivery time and of the global synchronization time. The execution time of a program is the sum of the execution time of its super-steps. 2.2

The Core Bsmllib Library

BSML does not rely on SPMD programming. Programs are usual “sequential” Objective Caml programs but work on a parallel data structure. Some of the advantages is a simpler semantics and a better readability: the execution order

follows (or at least the results is such as the execution order seems to follow) the reading order. The core of the BSMLlib library is based on the following elements: bsp p : mkpar : apply : type α put : at :

unit → int ( int → α) → α ( α → β ) par → o p t i o n = None | ( i n t →α o p t i o n ) α par → i n t →

par α par → β par Some of α par → ( i n t →α o p t i o n ) par α

It gives access to the BSP parameters of the underling architecture. In particular, bsp p() is p, the static number of processes. There is an abstract polymorphic type α par which represents the type of p-wide parallel vectors of objects of type α one per process. The nesting of par types is prohibited. Our type system enforces this restriction [12]. The BSML parallel constructs operates on parallel vectors. Those parallel vectors are created by mkpar so that (mkpar f) stores ( f i ) on process i for i between 0 and (p − 1). We usually write f as fun pid→e to show that the expression e may be different on each processor. This expression e is said to be local. The expression (mkpar f) is a parallel object and it is said to be global. A BSP algorithm [32] is expressed as a combination of asynchronous local computations and phases of global communication with global synchronization. Asynchronous phases are programmed with mkpar and apply. The expression (apply (mkpar f) (mkpar e)) stores (( f i )(e i )) on process i. Let consider the following expression: l e t v f = mkpar( fun i → ( + ) i ) and vv = mkpar( fun i → 2∗ i +1) in apply v f vv The two parallel vectors are respectively equivalent to: fun x→x+0 fun x→x+1 · · · fun x→x+(p−1)

and 03

···

2 × (p − 1) + 1

The expression apply vf vv is then evaluated to: 04

···

2 × (p − 1) + 2

Readers familiar with BSPlib [18] will observe that we ignore the distinction between a communication request and its realization at the barrier. The communication and synchronization phases are expressed by put. Consider the expression: put(mkpar(fun i→fsi )) (1) To send a value v from process j to process i, the function fsj at process j must be such that (fsj i) evaluates to Some v. To send no value from process

j to process i, (fsj i) must evaluate to None. Expression (1) evaluates to a parallel vector containing a function fdi of delivered messages on every process. At process i, (fdi j) evaluates to None if process j sent no message to process i or evaluates to Some v if process j sent the value v to the process i. The full language would also contain a synchronous projection operation at where (at vec n) return the nth value of the parallel vector vec. at expresses communication and synchronization phases. Without it, the global control cannot take into account data computed locally. Global conditional is necessary of express algorithms like: Repeat Parallel Iteration Until Max of local errors < ² The projection should not be evaluated inside the scope of a mkpar. This is enforced by our type system [12]. 2.3

Examples

Very Often Used Functions Some useful functions can be defined using only the primitives. For example the function replicate creates a parallel vector which contains the same value everywhere. The primitive apply can be used only for a parallel vector of functions which take only one argument. To deal with functions which take two arguments we need to define the apply2 function. l e t r e p l i c a t e x = mkpar( fun p i d → x ) l e t apply2 v f v1 v2 = apply ( apply v f v1 ) v2 It is also very common to apply the same sequential function at each process. It can be done using the parfun functions: they differ only in the number of arguments of the function to apply: l e t p a r f u n f v = apply ( r e p l i c a t e f ) v l e t p a r f u n 2 f v1 v2 = apply ( p a r f u n f v1 ) v2 l e t p a r f u n 3 f v1 v2 v3 = apply ( p a r f u n 2 f v1 v2 ) v3 applyat n f1 f2 v applies function f1 at process n and function f2 at other processes: let applyat n f1 f2 v = apply ( mkpar( fun i → i f i=n then f 1 e l s e f 2 ) ) v Communication Functions The semantics of the total exchange function is given by: totex h v0 , . . . , vp−1 i = h f , . . . , f , . . . , f i where ∀i.(0 ≤ i < p) ⇒ (f i) = vi . The code is as follows where noSome just removes the Some constructor and compose is usual function composition: (∗ v a l t o t e x : α par → ( i n t → α ) par ∗) l e t t o t e x vv = p a r f u n ( compose noSome ) ( put ( p a r f u n ( fun v d s t → Some v ) vv ) )

Its parallel cost is (p − 1) × s × g + L, where s denotes the size of the biggest value v held at some process n in words. From it we can obtain a version which returns a parallel vector of lists: (∗ v a l t o t e x l i s t : α par → α l i s t par ∗) let t o t e x l i s t v = p a r f u n 2 L i s t . map ( t o t e x v ) ( r e p l i c a t e ( p r o c s ( ) ) ) where

  List .map : (α→β)→α list →β list List .map f [v0 ; . . . ; vn ] = [(f v0 ); . . . ; (f vn )]  procs() = [0;. . . ; bsp p()-1].

The semantics of the broadcast is:

bcast h v0 , . . . , vp−1 i r = h vr , . . . , vr i The direct broadcast function which realizes the broadcast is given in figure 1, where noSome is such as noSome(Some x)=x.

exception Bcast l e t b c a s t d i r e c t r o o t vv = i f r o o t < 0 | | r o o t>=bsp p ( ) then r a i s e Bcast e l s e l e t mkmsg = mkpar( fun i v d s t → i f i=r o o t then Some v e l s e None ) in p a r f u n noSome ( apply ( put ( apply mkmsg vv ) ) ( r e p l i c a t e r o o t ) )

Fig. 1. Direct Broadcast In section 4 we will use the following parallel reduction function: l e t f o l d d i r e c t op vec = l e t f o l d h : : t = L i s t . f o l d l e f t op h t in p a r f u n f o l d ( t o t e x l i s t vec ) Its BSP cost formula is (assuming op has a small and constant cost): 2 × p − 1 + (p − 1) × s × g + L

3

An Overview of the Implementation

In the implementation of the BSMLlib library version 0.3, the Bsmllib module which contains the primitives presented in section 2.2, is implemented in SPMD style using a lower level communication library. This module called Comm is based on the main elements given in figure 2. Their are several implementations

val p i d : u n i t → i n t val n p r o c s : u n i t → i n t val send : α o p t i o n a r r a y → α o p t i o n a r r a y

Fig. 2. The Comm Module of Comm based on MPI [33], PVM [13], BSPlib [18], PUB [6], TCP/IP, etc. The implementation of all the other modules of the BSMLlib library, including the core module, is independent on the actual implementation of Comm, but only depends on its interface. The meaning of pid and nprocs is obvious, they give respectively the process identifier and the number of processes of the parallel machine. send takes on each process an array of size nprocs(). These arrays contains optional values. These input values are obtained by applying at each process i, the function f i (argument of the put primitive) to integers from 0 to (bsp()−1). If at process j the value contained at index i is (Some v) then the value v will be send from process j to process i. If the value is the None value, nothing will be sent. In the result, which are also arrays, None at index j at process i means than the process j sent no value ot i and a non (Some v) value means that process j sent the value v to i. A global synchronization occurs inside this communication function. put and at are implemented using send. The implementation of the abstract type, mkpar and apply is as follows: type α par = α l e t mkpar f = f ( p i d ( ) ) l e t apply f v = f v For the MPI version, the pid function does not call the MPI Comm rank function each time. The MPI function is called only one time at initialization and the value is store in a reference. The pid function only obtains the value of this reference. Let see how the put primitive works. Consider a parallel machine with 4 processors and functions fi whose types are int→α par such as (fi (i + 1)) = Some vi for i = 0, 1, 2 and (fi j) = None otherwise. The expression mkpar(fun i →fi ) would be evaluated as follows: 1. First at each process the function is applied to all process identifiers to produce p values, the messages to be sent by the processes. In the following figure, a column represents the values produced at one process and the lines are ordered by destination (first line represents the messages to be sent to process 0, etc.): None None None Some v0 None None None Some v1 None None None Some v2

None None None None

2. Then the exchange of messages is actually performed. If we think of the previous table as a matrix, the resulting matrix is obtained by transposition: None Some v0 None None None None Some v1 None None None None Some v2 None None None None This operation is performed by the send function of the Comm module. 3. Finally the parallel vector of functions is produced. Each process i holds an array ai of size p (a column of the previous matrix) and the function is fun x →ai .( x). In our example at process 3, (f3 0) = None which means that process 3 received no message from process 0 and (f3 2) = Some v2 which means that process 2 sent the value v2 to process 3. It is to notice that the implementation relies on the powerful module mechanism of Objective Caml, in particuliar the Bsmllib module is a parametric module [23].

4

Performance Prediction

One of the main advantages of the BSP model is its cost model: it is rather simple but yet accurate. Several papers (for example [20]) have demonstrated this fact using the BSPlib [18] library or the Paderborn University BSP Library (PUB) [5]. In his book [4], Bisseling presents in the first chapter the BSP model, programming using the BSPlib library and examples. Among this examples, the “probe” program is a benchmark used to determine the parameters of the BSP machines. 4.1

Benchmarking the BSP parameters

There are four BSP parameters. Of course, the number of processors does not need to be benchmarked. The g and L parameters are expressed as multiples of the processors speed r. This parameter is the first to be determined. It is also the most difficult to determine accurately. In [4], the target applications belong to scientific computing area. Thus r is obtained by timing several computations using floating point operations. The first version of our bsmlprobe program does the almost same, because we would like to perform some comparisons with the C version (we use an average of the measured speeds, called r and the best of the speeds, called r 0 ). But of course, arrays are not the most commonly used data structure in Caml. For applications with rich data structures, this value of r may be not very accurate. This way to obtain r is also more favorable to C as more complex data structures, or even function calls [14] are more efficient in Objective Caml than in C. Thus in practice, for a general usage, the way to determine r should not only rely on floating arithmetics.

Then g and L are determined as follows: several super-steps are timed, with an increasing h relation. In these super-steps, each process sends to each other process n/p or (n/p) + 1 words. Then the least squares method is used to obtain g and L. Figure 3 presents the average results obtained by running the probe (C+MPI) and bsmlprobe (BSMLlib 0.3 with MPI implementation of the Comm module) programs 10 times on a 6 Pentium IV nodes cluster interconnected with a Gigabit Ethernet network.

C+MPI BSMLlib (MPI) Avg BSMLlib (MPI) Max r g L r g L r g L 1 477 44.9 449167 1 334 21.6 161713 1 468 29.3 227258 2 477 44.9 449167 2 331 31.9 154985 2 462 40.2 218733 3 478 19.4 673227 3 335 30.4 156961 3 469 44.5 217905 4 480 23.5 680942 4 332 24.0 156889 4 469 31.3 223746 5 479 28.5 652335 5 327 7.8 164817 5 454 23.0 220677 6 478 20.0 685116 6 328 9.4 164678 6 479 16.0 239412 7 476 15.7 672573 7 330 14.9 162151 7 471 28.5 227467 8 477 20.1 660288 8 331 16.3 162916 8 473 23.7 233947 9 476 18.2 643489 9 333 19.6 161171 9 458 16.1 229192 10 478 16.8 665109 10 328 25.1 157177 10 484 27.7 236784 Avg 478 25.2 623141 Avg 331 20.1 160346 Avg 469 28.0 227512 r in Mflops/s, g and L in flops Fig. 3. Benchmark of BSP Parameters (p=6)

4.2

Experiments

We then performed some experiments on the program which compute the inner product of two vectors. There are two versions, one using arrays to store the vectors and one using lists to store them. The code is given in figure 4. It uses the fold direct function from the BSMLlib standard library (presented in section 2.3). Thus the overall BSP cost formula of inprod is: n + 2 × p + (p − 1) × g + L . These programs were run 15 times each for increasing size of vectors (from 5.000 to 100.000) and the average was taken. Figure 5 summarizes the timings. The predicted performances using the parameters obtained in the previous section are also given.

l e t i n p r o d a r r a y v1 v2 = l e t s = r e f 0 . in f o r i = 0 to ( Array . l e n g t h v1 ) −1 do s : = ! s +.( v1 . ( i ) ∗ . v2 . ( i ) ) ; done ; !s l e t i n p r o d l i s t v1 v2 = L i s t . f o l d l e f t 2 ( fun s x y → s +.x ∗ . y ) 0 . v1 v2 l e t i n p r o d s e q i n p r o d v1 v2 = l e t l o c a l i n p r o d = p a r f u n 2 s e q i n p r o d v1 v2 in Bsmlcomm . f o l d d i r e c t ( + . ) l o c a l i n p r o d

Fig. 4. Inner Product in BSML

Inner Product in BSML 1.1 lists arrays prediction (avg) prediction (max)

1 0.9 0.8 t (ms) 0.7 0.6 0.5 0.4 0

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 n: Size of the vectors (in Kfloats)

Fig. 5. Timings

5

Related Work

For C or Fortran, there are two libraries for BSP programming. The BSPlib library is the implementation of the standard proposal presented in [18]. It is no more supported and the compilation fails on recent Linux operating systems. There is a library for Java very similar to the BSPlib library [15]. [31] is based on BSP-like operations but does not follow the BSP model. It is used for global computing. In this respect it is closer to the BSPGRID [35] and the Dynamic BSP [28] models. The PUB library [5] is another implementation of the BSPlib standard proposal. It offers additional features with respect to the standard which follows the BSP* model [2] and the D-BSP model [9]. Minimum spanning trees nested BSP algorithms [7] have been implemented using these features. For high-level languages, there are other libraries based on the BSλ framework. They are based either on the functional language Haskell [29] or on the object-oriented language Python [19]. They propose flat operations similar to ours. Only the latter is available for downloading. [36] is an algorithmic skeletons language based on the BSP model and offers divide-and-conquer skeletons. Nevertheless, the cost model is not really the BSP model but the D-BSP model [9] which allows subset synchronization. We follow [16] to reject such a possibility. Another algorithmic skeletons language based on the BSP model [17] does not offer divide-and-conquer skeletons.

6

Conclusions and Future Work

The BSMLlib library allows declarative parallel programming in a safe environnement. Being implemented using Objective Caml and using a modular approach which allows to perform the communications with various communication libraries, it is portable, and efficient, on a wide range of architectures. The basic parallel operations of BSMLlib are Bulk Synchronous Parallel operations, thus allow accurate and portable performance prediction. Future works include the development of new Comm modules with low level communication libraries such as for example the work described in [21]. The future benchmark program would also determine the EM-BSP parameters for parallel IO [11]. We will also include the new parallel composition operations we designed [25,26] (but without allowing subset synchronization and stricty following the BSP model). The development of several applications implemented using the BSMLlib library is in progress.

Acknowledgements This work is supported by the ACI Grid program from the French Ministry of Research, under the project Caraml (http://www.caraml.org).

References 1. The Coq Proof Assistant (version 8.0). Web pages at coq.inria.fr, 2004. 2. W. B¨ aumker, A. adn Dittrich and F. Meyer auf der Heide. Truly efficient parallel algorithms: c-optimal multisearch for an extension of the BSP model. In 3rd European Symposium on Algorithms (ESA), pages 17–30, 1995. 3. Y. Berthot and P. Cast´eran. Interactive Theorem Proving and Program Development. Springer, 2004. 4. R. Bisseling. Parallel Scientific Computation. A structured approach using BSP and MPI. Oxford University Press, 2004. 5. O. Bonorden, B. Juurlink, I. von Otte, and I. Rieping. The Paderborn University BSP (PUB) Library - Design, Implementation and Performance. In Proc. of 13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing (IPPS/SPDP), San-Juan, Puerto-Rico, April 1999. 6. O. Bonorden, B. Juurlink, I. von Otte, and O. Rieping. The Paderborn University BSP (PUB) library. Parallel Computing, 29(2):187–207, 2003. 7. O. Bonorden, F. Meyer auf der Heide, and R. Wanka. Composition of Efficient Nested BSP Algorithms: Minimum Spanning Tree Computation as an Instructive Example. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), 2002. 8. E. Chailloux, P. Manoury, and B. Pagano. D´eveloppement d’applications avec Objective Caml. O’Reilly France, 2000. freely available in english at http://caml.inria.fr/oreilly-book. 9. P. de la Torre and C. P. Kruskal. Submachine locality in the bulk synchronous setting. In L. Boug´e, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, EuroPar’96. Parallel Processing, number 1123–1124 in Lecture Notes in Computer Science, Lyon, August 1996. LIP-ENSL, Springer. 10. F. Gava. Formal Proofs of Functional BSP Programs. Parallel Processing Letters, 13(3):365–376, 2003. 11. F. Gava. Parallel I/O in Bulk Synchronous Parallel ML. In M. Bubak, D. van Albada, P. Sloot, and J. Dongarra, editors, The International Conference on Computational Science (ICCS 2004), Part III, LNCS, pages 339–346. Springer Verlag, 2004. 12. F. Gava and F. Loulergue. A Static Analysis for Bulk Synchronous Parallel ML to Avoid Parallel Nesting. Future Generation Computer Systems, 2004. to appear. 13. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM Parallel Virtual Machine. A User’s Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Computation Series. MIT Press, 1994. 14. S. Gilmore. Advances in programming languages: Efficiency. www.inf.ed.ac.uk/teaching/modules/apl, 2003. 15. Yan Gu, Bu-Sung Lee, and Wentong Cai. JBSP: A BSP programming library in Java. Journal of Parallel and Distributed Computing, 61(8):1126–1142, August 2001. 16. G. Hains. Subset synchronization in BSP computing. In H.R.Arabnia, editor, PDPTA’98 International Conference on Parallel and Distributed Processing Techniques and Applications, volume I, pages 242–246, Las Vegas, July 1998. CSREA Press. 17. Y. Hayashi. Shaped-based Cost Analysis of Skeletal Parallel Programs. PhD thesis, University of Edinburgh, 2001.

18. J.M.D. Hill, W.F. McColl, and al. BSPlib: The BSP Programming Library. Parallel Computing, 24:1947–1980, 1998. 19. K. Hinsen. High-Level Parallel Software Development with Python and BSP. Parallel Processing Letters, 13(3):461–472, 2003. 20. S.A. Jarvis, J.M.D Hill, C.J. Siniolakis, and V.P. Vasilev. Portable and architecture independent parallel performance tuning using BSP. Parallel Computing, 28:1587– 1609, 2002. 21. Y. Kee and S. Ha. An Efficient Implementation of the BSP Programming Library for VIA. Parallel Processing Letters, 12(1):65–77, 2002. 22. X. Leroy, D. Doligez, J. Garrigue, D. R´emy, and J. Vouillon. The Objective Caml system release 3.07. Web pages at www.ocaml.org, 2003. 23. Xavier Leroy. A modular module system. Journal of Functional Programming, 10(3):269–303, 2000. 24. F. Loulergue. Implementation of a Functional Bulk Synchronous Parallel Programming Library. In 14th IASTED International Conference on Parallel and Distributed Computing Systems, pages 452–457. ACTA Press, 2002. 25. F. Loulergue. Parallel Juxtaposition for Bulk Synchronous Parallel ML. In H. Kosch, L. Boszorm´enyi, and H. Hellwagner, editors, Euro-Par 2003, number 2790 in LNCS, pages 781–788. Springer Verlag, 2003. 26. F. Loulergue. Parallel Superposition for Bulk Synchronous Parallel ML. In Peter M. A. Sloot and al., editors, International Conference on Computational Science (ICCS 2003), Part III, number 2659 in LNCS, pages 223–232. Springer Verlag, june 2003. 27. F. Loulergue, G. Hains, and C. Foisy. A Calculus of Functional BSP Programs. Science of Computer Programming, 37(1-3):253–277, 2000. 28. J. M. R. Martin and A. V. Tiskin. Dynamic BSP: towards a flexible approach to parallel computing over the grid. In I. East et al., editor, Communicating Process Architectures: Proceedings of CPA, pages 219–226. IOS Press, 2004. 29. Q. Miller. BSP in a Lazy Functional Context. In Trends in Functional Programming, volume 3. Intellect Books, may 2002. 30. D. R´emy. Using, Understanding, and Unravellling the OCaml Language. In G. Barthe, P. Dyjber, L. Pinto, and J. Saraiva, editors, Applied Semantics, number 2395 in LNCS, pages 413–536. Springer, 2002. 31. L. Sarmenta. An Adaptive, Fault-tolerant Implementation of BSP for Java-based Volunteer Computing Systems. In IPPS’99 Workshop on Java for Parallel and Distributed Computing, number 1586 in LNCS, pages 763–780. Springer Verlag, april 1999. 32. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and Answers about BSP. Scientific Programming, 6(3):249–274, 1997. 33. M. Snir and W. Gropp. MPI the Complete Reference. MIT Press, 1998. 34. Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990. 35. Vasil P. Vasilev. BSPGRID: Variable Resources Parallel Computation and Multiprogrammed Parallelism. Parallel Processing Letters, 13(3):329–340, 2003. 36. A. Zavanella. Skeletons and BSP : Performance Portability for Parallel Programming. PhD thesis, Universita degli studi di Pisa, 1999.