automatic loop parallelization in the bsp model - Semantic Scholar

AUTOMATIC LOOP PARALLELIZATION IN THE BSP MODEL RADU CALINESCU Computing Laboratory, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD, UK

Abstract. This paper introduces a new scheme for

the scheduling of generic, untightly nested loops on distributed-memory systems. Being targeted at the bulk-synchronous parallel (BSP) model of computation, the novel parallelization scheme yields parallel code which is scalable, portable, and whose performance can be analytically evaluated. Key words: automatic parallelization, BSP model

1. INTRODUCTION The prohibitive costs of parallel software design have led to an ever increasing interest in the automatic parallelization of existing sequential code. As a result, remarkable advances have been made in areas such as data dependence analysis [4], code transformation [3], and potential parallelism identi cation [2, 14]. Based on these theoretical advances, many parallelizing compilers and tools have been devised within the last decade or so [12]. While the parallel code generated by these automatic parallelizers has shown to be eective in many cases, it nevertheless lacks portability and performance predictability. Unlike the above mentioned approaches, the new parallelization strategy described in the current paper succeeds in automatically generating parallel code which is scalable, portable, and whose performance is analytically predictable. This is achieved by employing the bulk-synchronous parallel (BSP) model of computation [11, 9] as a target platform for the parallelization and scheduling of generic loop nests. The BSP model of computation provides a unifying framework for the development of both scalable parallel architectures and portable parallel software. A BSP computer comprises a set of processor-memory pairs, a communication network providing point-topoint message delivery, and a mechanism for the ef cient barrier-style synchronisation of the processors. Three parameters fully characterise a BSP machine: p, the number of processor-memory components; L, the synchronisation periodicity, or the cost of a barrier synchronisation of all processors; and g, the ratio between the computation throughput of the processors, and the communication throughput of the router. A BSP program consists of a sequence of supersteps, each ended by a barrier synchronisation of all processors. In any

superstep, the processors independently execute operations on locally held data, and/or initiate read/write requests for non-local data. These requests are guaranteed to be satis ed after the barrier synchronisation that ends the superstep. The BSP cost model is compositional: the cost of a BSP program is simply the sum of the costs of its constituent supersteps. As for the cost of an individual superstep S , it is obtained by adding up the costs of synchronisation, computation, and communication: cost(S ) = L + w + gh; (1) where w denotes the maximum number of local operations executed by any processor during superstep S , and h is the maximum number of words received or sent by any processor during superstep S . Following the successful design and implementation of many BSP algorithms and applications [9, 11], recent research work studied the scheduling of perfect loop nests on BSP computers [5, 7, 10]. The current paper extends this work by introducing a scheme for the BSP parallelization of generic, untightly nested loops whose loop bounds and referenced array subscripts are ane functions of the surrounding loop indices. The new parallelization scheme comprises three stages: data dependence analysis and potential parallelism identi cation, data and computation partitioning, and synchronisation and communication generation. In the remainder of the paper, we describe each of these stages, and devise new algorithms for their implementation.

2. DATA DEPENDENCE ANALYSIS AND POTENTIAL PARALLELISM IDENTIFICATION The rst stage of the parallelization scheme aims at rewriting the loop nest in a form that reveals the potential parallelism of the sequential code. Initially, the dependence graph [4] of the loop nest is built based on the direction vector abstraction [13] of data dependences. As shown in Fig. 1, this is a directed graph comprising a vertex for each statement in the loop nest, and an edge between each pair of vertices that stand for interdependent statements. The edges are labeled with the type (i.e., , and o for ow, anti- and output dependences, respectively) and direction vector(s) of the associated data dependence(s).

l1 for i = 1; n ? 2 do l2 for j = 0; n ? 1 do S1 : a[i; j ] = f1 (a[i ? 1; j ]; b[i ? 1; i; j ]) l3 for j = 1; n ? 1 do l4 for k = 0; n ? 2 do S2 : b[i; j; k] = f2 (b[i + 1; j ? 1; k]; b[i ? 1; j; k + 1]) (a) (

automatic loop parallelization in the bsp model - Semantic Scholar

automatic loop parallelization in the bsp model - Semantic Scholar

Suggest Documents

Automatic Loop Transformations and Parallelization for Java

Automatic Parallelization of Recursive Procedures - Semantic Scholar

Extending Automatic Parallelization to Optimize ... - Semantic Scholar

Automatic Parallelization of XQuery Programs - Semantic Scholar

Automatic Parallelization of Scripting Languages ... - Semantic Scholar

Toward Automatic Parallelization of Spatial ... - Semantic Scholar

Automatic Parallelization of Non-uniform ... - Semantic Scholar

On the Automatic Parallelization of Sparse and ... - Semantic Scholar

Loop Level Parallelization of a Seismic Inversion ... - Semantic Scholar

Automatic Parallelization using AutoFutures

A Graphical Tool for Automatic Parallelization and ... - Semantic Scholar

A programming model for BSP with partitioned ... - Semantic Scholar

Pushing Loop-Level Parallelization to the Limit

Fastpath Speculative Parallelization - Semantic Scholar

Automatic Parallelization - Sable Research Group

Automatic Parallelization in a Binary Rewriter - CiteSeerX

Automatic Parallelization in a Binary Rewriter - CiteSeerX

acidic proteins (BSP-A1, BSP-A2 and BSP-A3 ... - Semantic Scholar

Automatic parallelization in Graphite - Google Sites

Adaptive Parallelization of Model-Based Head ... - Semantic Scholar

Parallelization of a 3-D Computational Model for ... - Semantic Scholar

A Parallel BSP implementation - Semantic Scholar

Towards Automatic, Model-Driven Determination ... - Semantic Scholar

MODEL-LEVEL AUTOMATIC TEST GENERATION ... - Semantic Scholar