automatic loop parallelization in the bsp model - Semantic Scholar

1 downloads 0 Views 213KB Size Report
the bulk-synchronous parallel (BSP) model of compu- tation 11, 9] as a target platform for the parallelization and scheduling of generic loop nests. The BSP ...
AUTOMATIC LOOP PARALLELIZATION IN THE BSP MODEL RADU CALINESCU Computing Laboratory, University of Oxford Wolfson Building, Parks Road, Oxford OX1 3QD, UK

Abstract. This paper introduces a new scheme for

the scheduling of generic, untightly nested loops on distributed-memory systems. Being targeted at the bulk-synchronous parallel (BSP) model of computation, the novel parallelization scheme yields parallel code which is scalable, portable, and whose performance can be analytically evaluated. Key words: automatic parallelization, BSP model

1. INTRODUCTION The prohibitive costs of parallel software design have led to an ever increasing interest in the automatic parallelization of existing sequential code. As a result, remarkable advances have been made in areas such as data dependence analysis [4], code transformation [3], and potential parallelism identi cation [2, 14]. Based on these theoretical advances, many parallelizing compilers and tools have been devised within the last decade or so [12]. While the parallel code generated by these automatic parallelizers has shown to be e ective in many cases, it nevertheless lacks portability and performance predictability. Unlike the above mentioned approaches, the new parallelization strategy described in the current paper succeeds in automatically generating parallel code which is scalable, portable, and whose performance is analytically predictable. This is achieved by employing the bulk-synchronous parallel (BSP) model of computation [11, 9] as a target platform for the parallelization and scheduling of generic loop nests. The BSP model of computation provides a unifying framework for the development of both scalable parallel architectures and portable parallel software. A BSP computer comprises a set of processor-memory pairs, a communication network providing point-topoint message delivery, and a mechanism for the ef cient barrier-style synchronisation of the processors. Three parameters fully characterise a BSP machine: p, the number of processor-memory components; L, the synchronisation periodicity, or the cost of a barrier synchronisation of all processors; and g, the ratio between the computation throughput of the processors, and the communication throughput of the router. A BSP program consists of a sequence of supersteps, each ended by a barrier synchronisation of all processors. In any

superstep, the processors independently execute operations on locally held data, and/or initiate read/write requests for non-local data. These requests are guaranteed to be satis ed after the barrier synchronisation that ends the superstep. The BSP cost model is compositional: the cost of a BSP program is simply the sum of the costs of its constituent supersteps. As for the cost of an individual superstep S , it is obtained by adding up the costs of synchronisation, computation, and communication: cost(S ) = L + w + gh; (1) where w denotes the maximum number of local operations executed by any processor during superstep S , and h is the maximum number of words received or sent by any processor during superstep S . Following the successful design and implementation of many BSP algorithms and applications [9, 11], recent research work studied the scheduling of perfect loop nests on BSP computers [5, 7, 10]. The current paper extends this work by introducing a scheme for the BSP parallelization of generic, untightly nested loops whose loop bounds and referenced array subscripts are ane functions of the surrounding loop indices. The new parallelization scheme comprises three stages: data dependence analysis and potential parallelism identi cation, data and computation partitioning, and synchronisation and communication generation. In the remainder of the paper, we describe each of these stages, and devise new algorithms for their implementation.

2. DATA DEPENDENCE ANALYSIS AND POTENTIAL PARALLELISM IDENTIFICATION The rst stage of the parallelization scheme aims at rewriting the loop nest in a form that reveals the potential parallelism of the sequential code. Initially, the dependence graph [4] of the loop nest is built based on the direction vector abstraction [13] of data dependences. As shown in Fig. 1, this is a directed graph comprising a vertex for each statement in the loop nest, and an edge between each pair of vertices that stand for interdependent statements. The edges are labeled with the type (i.e., ,  and o for ow, anti- and output dependences, respectively) and direction vector(s) of the associated data dependence(s).

l1 for i = 1; n ? 2 do l2 for j = 0; n ? 1 do S1 : a[i; j ] = f1 (a[i ? 1; j ]; b[i ? 1; i; j ]) l3 for j = 1; n ? 1 do l4 for k = 0; n ? 2 do S2 : b[i; j; k] = f2 (b[i + 1; j ? 1; k]; b[i ? 1; j; k + 1]) (a) (

Suggest Documents