A BSP Approach to the Scheduling of Tightly-Nested Loops Radu Calinescu Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford OX1 3QD, UK
[email protected]
Abstract This paper addresses the scheduling of uniformdependence loop nests within the framework of the bulksynchronous parallel (BSP) model. Two broad classes of tightly-nested loops are identified in the paper and scheduled according to the BSP discipline, and the resulting schedules are analysed in terms of the BSP cost model.
for i1 = 1; n1 do for i2 = 1; n2 do for i3 = 1; n3 do a[i1 ; i2 ; i3 ]
=
f (a[i1 ; i2
?1
; i3 ]; b[i1 ; i2 ;
2i2 ]; c[i1 + 1; i3 ])
Figure 1. A 3-level loop nest of type I. The only distance vector of the loop is (0; 1; 0).
? ?
for i1 = 0; n 1 do for i2 = 0; n 1 do
1. Introduction One of the major aims currently focusing the efforts of the parallel computing community is the development of a realistic base for the design and programming of general purpose parallel computers. The advent of the bulk-synchronous parallel (BSP) model [10, 8] has offered a solution to this problem, providing an underlying framework for the devising of both scalable parallel architectures and portable parallel software. However, when proposing BSP programming as the standard approach to portable parallel code design, one must provide effective methods for the exploitation of the existing basis of sequential and PRAM code. In this paper, we address the scheduling of regular patterns of computation for concurrent execution onto a BSP computer. As loop nests represent the most common structure for the description of regular computations in imperative programs, this paper approaches the BSP scheduling of tightly-nested loops with uniform data dependences. For the purpose of our approach, we distinguish two classes of uniform-dependence loop nests. The distinction between the two classes is given by the structure of the set of distance vectors encoding the data dependences of the considered loop. Thus, a nested loop belongs to the first class—and is called a loop nest of type I—if its set of distance vectors does not span the entire iteration space of the loop (see an example in Figure 1). Similarly, a tightly-nested loop is called a loop nest of type II if its set of distance vectors contains a basis for the iteration space of the loop (Figure 2). The BSP scheduling of the two classes of loops is essentially
a[i1 ; i2 ] b[i1 ; i2 ]
?
= f (a[i1 1; i2 ]) = g (a[i1 ; i2 ]; b[i1
? 1 2 ? 2]) ;i
Figure 2. A 2-level loop nest of type II. The set of distance vectors D = f(1; 0); (1; 2)g spans the entire iteration space of the loop. different, and is addressed separately in the paper. The remainder of the paper is organised as follows. After briefly introducing the BSP model in section 2, the paper overviews other scheduling approaches related to our results in section 3. Then, in sections 4 and 5, techniques for the BSP scheduling of uniform-dependence loops of type I and II, respectively, are devised and analysed in terms of the BSP cost model. A final section including a short summary and further work directions concludes the paper.
2. The BSP programming and cost model A bulk-synchronous parallel computer [10, 8] consists of a set of processor-memory pairs, a communication network providing uniformly efficient non-local memory access, and a mechanism for efficient barrier synchronisation of all processors or of a subset of processors. No specialised broadcasting or combining facilities are available. The performance of a BSP computer is fully characterised by a set of three parameters: p, the number of processormemory units; L, the minimal number of time steps between successive synchronisation operations, or the synchronisation periodicity; and g, the ratio between the total number of
local operations performed by all processors in one second and the total number of words delivered by the communication network in one second. The parameter L is a measure of the network latency, whereas the parameter g is related to the time required to complete a so-called h-relation, i.e., a routing problem where any processor has at most h packets to send to various processors in the network, and where at most h packets are to be received by any processor. Practically, g is the value such that gh is an upper bound for the number of time steps required to perform an h-relation. A BSP computation consists of a sequence of supersteps, each of which is followed by a barrier synchronisation of all processors. In any superstep, the processors are allowed to independently execute operations on locally held data and/or to initiate read/write requests for non-local data. However, the non-local memory accesses initiated during a superstep are guaranteed to take effect only after all the processors reach the barrier synchronisation that ends that superstep. The BSP cost model is compositional: the cost of a BSP program is simply the sum of the costs of its constituent supersteps. As for the cost of an individual superstep S , it is given by the sum of three independent costs: L, the synchronisation cost; w, the computation cost or the maximum number of local computation steps executed by any processor during S ; and g maxfhs; hr g, the communication cost, with hs, hr representing the maximum number of messages sent, respectively received by any processor during S :
cost(S ) = L + w + g maxfh ; h g: s
r
(1)
forall i1 = 1; n1 ; i2 = 1; n2 ; : : : ; ik = 1; nk do in parallel for ik+1 = 1; nk+1 ; ik+2 = 1; nk+2 ; : : : ; iK = 1; nK do (b1 [: : :]; b2 [: : :]; : : : ; br [: : :]) = f (a1 [: : :]; a2 [: : :]; : : : ; aq [: : :])
Figure 3. The normalised form of a loop nest of type I; f : Rq ! Rr (the loop body consists of r assignments), aj , 1 j q is an mIj dimensional input array, and bj , 1 j r is an mOj -dimensional output array, not necessarily distinct from the input arrays. level) loop nests of type II, and scheduled using the technique presented in section 5, which is a generalisation of the scheduling method proposed in [7]. The second approach, due to Ding and Stefanescu [5], addresses the BSP scheduling of a particular loop nest of type I, namely of a perfect loop nest consisting of k outermost parallel loops and K ? k innermost sequential loops (where K is the depth of the loop nest), and having a reduction operation as the loop body. Two possible decompositions of the iteration space (i.e., two BSP schedules) are proposed in [5], blockk -serialK ?k decomposition, and blockK decomposition (the terminology is from [1]), and the one yielding the lowest cost is eventually preferred. Since both decompositions are chosen to be computationally optimal, the selection of the best schedule is done by taking into account the associated communication overheads. This scheduling method was tested against a couple of manually coded BSP programs on matrix multiplication, producing identical schedules.
Accordingly, the performance of a BSP algorithm depends not only on the problem size and on the number of processors, but also on the BSP parameters L and g.
4. BSP scheduling of loop nests of type I
3. Related work
distance vectors of the loop. Then, according to the definition of a loop nest of type I given in the introductory section, k > 0. Furthermore, it is immediate to see that an affine transformation exists that maps the iteration space of the loop to an iteration space in which the distance vectors have null elements along k directions of the iteration space. Consequently, the k loops of the transformed loop nest that correspond to these directions can be made outermost loops and executed in parallel [1]. This result permits us to consider the loop nest in Figure 3 as the normalised form of a loop nest of type I. The loop nest in Figure 1 for instance can be normalised by bringing the loops indexed by i1 and i3 into the outermost position and executing them in parallel. It is worth noticing, however, that in the general case the transformed loop nest may not be rectangular, in which circumstances slight modifications need to be operated in the results presented in this section. Unlike in the particular case studied in [5] and briefly described in section 3, in the general case only a tiling of
Over the last two decades remarkable advances have been made in the development of reliable techniques for the extraction of potential parallelism from sequential code [1]. The availability of these techniques has triggered an ever increasing interest in the scheduling of the resulting virtual parallelism. As a result, scheduling has been successfully applied to derive either architecture-specific parallel code or special-purpose parallel designs (e.g., systolic arrays). However, so far little effort has been directed towards the derivation of portable schedules based on a realistic generalpurpose parallel computing model. In fact, we are aware of only two such approaches, both targeted at the BSP model. The former approach [7] tackles the scheduling of regular-structure two- and three-dimensional dags of computations in the context of triangular linear system solving and LU decomposition, respectively. Both patterns of computation can be represented as (two-, respectively three-
Consider a generic K -level loop nest of type I, and let K ? k be the maximum number of linearly independent
1 =0 ? 1 do in parallel
assign to processor Pi , 0 x1
:::
xk
nk +
i < p
:::
a rectangular tile Ti of size Qk Qk , j=1 xj = ( j=1 nj )=p
nK
forall i ;p begin superstep Pi fetches FTi (a1 ), FTi (a2 ), : : : , FTi (aq ) end superstep begin superstep Pi computes f (a1 [: : :]; : : : ; aq [: : :]) for all (i1 ; : : : ; iK ) in Ti end superstep
Figure 4. The schedule of a loop nest of type I the iteration space along the k directions corresponding to the parallel loops is possible. This tiling yields the twosuperstep BSP schedule in Figure 4, with FTi (aj ), 0 i < p, 1 j q representing the footprint of array aj across tile Ti , i.e., the set of elements from aj required for the computation of the loop body for the iteration points in Ti — this definition is from [5]. It is easy to see that the footprint of a pure-input array (i.e., of an array which is used only as input within the loop body) is exactly the set of array elements accessed during the computation of the considered tile. The situation is more complex for an input/output array, since some of the array values used within the loop body are not external, but are obtained while computing the tile. Assuming an uniform distribution of data prior to executing the loop, the amount of data exchanged by any processor during the first superstep can be approximated to the amount of data fetched by the corresponding processor. Consequently, the cost of the BSP schedule is 2L +
?QK j =1
n cost(f )=p + g j
Pq j =1
#FT (aj );
(2)
where cost(f ) is the cost of computing f at a single point of the iteration space, T is a generic rectangular tile of size x1 xk nk+1 nK , and #FT (aj ), 1 j q denotes the size of the footprint of array aj across tile T . Finding the optimal BSP schedule corresponds in this case to finding a tiling which minimises the P communication overhead gComm(x1 ; x2 ; : : :; xk) = g qj=1 #FT (aj ) under Qk
Qk
the load balancing constraint j =1 xj = ( j =1 nj )=p. To assess the footprint size of a pure-input array, consider a generic K -level loop nest L, and let a be a singlereferenced pure-input array of L, with the array subscripts depending on exactly l K of the loop nest indices. Let also T be a rectangular tile of size x1 x2 xK of the iteration space of L. Then, if the l loop indices that appear in the subscripts of a are ij1 ; ij2 ; : : :; ijl , #FT (a) xj1 xj2 : : :xjl :
(3)
Indeed, in the worst case, for any value taken by (ij1 ; ij2 ; : : :; ijl ) across T , a different element from a is used. When the subscripts of a are arbitrary expressions of the l loop indices, the upper bound in (3) is the best we can do. The
computation of a better upper bound for many particular cases, as well as the computation of the footprint size of a multiple-referenced pure-input array are studied in [4]. The footprint size of an input/output array can be computed as follows. First, if the input and output references to the array are related through a loop-independent flow dependence, the footprint of the array is empty. Second, if the input and output references are totally independent or they are related through an antidependence, the footprint size must be computed as for a pure-input array. Finally, if the input and output references are linked through a loopcarried true dependence, an upper bound for the footprint size of the array is provided by the following theorem. Theorem 1 Let a be an input/output array inducing a loopcarried true dependence, and d = (d1 ; d2 ; : : :; dK ) be the associated distance vector. Let also T be a rectangular tile of size x1 x2 : : : xK . Then, if xj j dj j for any 1 j K, #FT (a)
K Y
x ?
K Y
j
j =1
x ? j d j):
(
j
(4)
j
j =1
Proof Let (c1 ; c2 ; : : :; cK ) be the coordinates of T in the tile space. Then, if (i1 ; i2 ; : : :; iK ) is a generic iteration point in T , cj xj ij < (cj + 1)xj for any 1 j K , and the computation of the loop body for this iteration point uses the element of a corresponding to the subscripts given by i1 , i2 , : : : , iK . According to the hypothesis of the theorem, if (i1 ? d1 ; i2 ? d2 ; : : :; iK ? dK ) is an iteration point from T , this element of a is modified in a previous iteration also belonging to T , and the reference to the input/output array implies no use of loop-external data. The problem now is to compute the maximum number of loop-external elements from a which are needed for the computation of the iteration points in T . In the worst case, i.e., when a different element from a is used for any value taken by (i1 ; i2 ; : : :; iK ) across T , #FT (a) is upper bounded by #(f(i1 ; : : :; iK ) j 8j
2 1::K c x i < (c + 1)x gn f(i1 ; : : :; i ) j 8j 2 1::K c x i ; i ? d < (c + 1)x g) j
K
j
j
j
j
j
j
j
j
j
j
j
Since the second set is a subset of the first one, we obtain #FT (a) #f(i1 ;: : :;iK) j8j 2 1::K cj xjij 1. Assume that the tile size is large enough for the dependences not internalised by the tiling to occur only between neighbour tiles, and that any tile Tc1 ;c2 ;:::;cK depends (at least) on each of the tiles Tc1?1;c2;:::;cK , Tc1 ;c2?1;:::;cK , : : : , Tc1;c2;:::;cK ?1 . Then, the
x
=
p
1=(K
?1)
?
for t = 0; Kx K do forall c1; : : :; cK = 0; x 1 with c1 + : : : + cK = t do in parallel Processor c1 xK ?2 + c2 xK ?3 + : : : + cK ?1: begin superstep (1) compute tile Tc1 ;c2 ;::: ;cK : for i1 = c1 x; (c1 + 1)x 1 do .................. for iK = cK x; (cK + 1)x 1 do loop body (2) send data required by other tile computations end superstep
?
?
?
Figure 5. The BSP schedule of a K -level loop nest of type II. set of (K ?1)-dimensional hyperplanes given by fc1 + c2 + : : : + c = t j 0 t Kp1 ( ?1) ? K g defines a minimum-length legal schedule for L. 2 = K
K
The BSP schedule induced by the family of hyperplanes in Theorem 2 requires Kp1=(K ?1) ? K + 1 supersteps, with the following operations taking place in each superstep t, 0 t Kp1=(K ?1) ? K : (1) each tile Tc1 ;c2 ;:::;cK with c1 + c2 + : : : + cK = t is computed by processor c1p(K ?2)=(K ?1) + c2p(K ?3)=(K ?1) + : : : + cK ?1 ; (2) data computed in the current superstep and required by subsequent tile computations is sent to the appropriate processor(s). The pseudocode for this schedule is presented in Figure 5; an initial superstep in which input data is provided for the boundary tiles must precede the whole computation. Since the tile coordinates satisfy 0 c1 ; c2 ; : : :; cK < p1=(K ?1), any tile is scheduled for execution on one of the p processors (i.e., 0 c1 p(K ?2)=(K ?1) + c2 p(K ?3)=(K ?1) + : : : + cK ?1 < p). This approach is equivalent to arranging the p processors in a (K ? 1)-dimensional array of size p1=(K ?1) p1=(K ?1) : : : p1=(K ?1), with processor (c1 ; c2 ; : : :; cK ?1 ) being assigned the computation of tiles Tc1;c2 ;:::;cK , 0 cK < p1=(K ?1). Moreover, the equation c1 + c2 + : : : + cK = t has at most one solution when c1 , c2, : : : , cK ?1 are fixed, so any processor is assigned at most one tile per superstep. To evaluate the cost of the BSP schedule, we need to know the amount of data sent/received by a processor during a superstep. This result is provided by the following theorem. Theorem 3 Let L be a K -level fully permutable loop nest whose loops iterate from 0 to n ? 1, and D the set of distance vectors encoding the flow data dependences of L. Then, the maximum amount of data sent or received by a processor in any superstep of the p-processor BSP schedule of L is P Comm = n ?1( 2 K
2
d
D
PK ?1 j =1
d
j
=p:
+ o(1))
(5)
Consequently, the cost of a (computational) superstep of the schedule is L + (n=p1=(K ?1))K Comp + gComm, with
Comm given by (5), and Comp representing the compu-
tational cost for a single point of the iteration space. After simplifications, the cost of the entire schedule is
?1 ?1 X X Kp K? L+ (K+ o(1))pn Comp + gKnK? d (6) p K? 2 =1 with D the set of distance vectors encoding true data dependences. Thus, K -optimality (remember that K is rarely K
1
K
K
1
j
2 1
Dj
d
larger than 2 or 3) is obtained if the problem size is large enough for the computation cost to dominate the synchronisation and communication overheads in (6), i.e., if
n max p K? 1
1
?1 L K ; gp K? X X Comp Comp 2 =1 d : 1
1
K
1
j
d
(7)
D j
Accordingly, the cost of the BSP schedule for the loop 2 nest in PFigure 2 is 2pL+(2+o(1))n (cost(f )+cost(g))=p+ 2gn d2f(1;0);(1;2)g d1 = 2pL + (2 + o(1))n2 (cost(f ) + cost(g))=p + 4gn; with 2-optimality attained for n maxfp(L=(cost(f )+cost(g)))1=2 ; 2pg=(cost(f )+cost(g))g. For a BSP computer with p = 8, L = 600 and g = 10 (i.e., for a typical BSP computer), and for cost(f )+cost(g) = 12, this condition translates into n maxf56; 13g = 56. Since not all loop nests of type II are fully permutable, the BSP schedule proposed in this section would be really useful if a procedure existed to normalise a generic loop nest of type II. As proved by Wolf and Lam [11], such a procedure does exist in the general case; indeed, the authors show in [11] that any perfect loop nest with computable distance vectors can be converted into a fully permutable loop nest by using an affine transformation of the iteration space. Formally, if D is the K q matrix whose columns are the distance vectors of the original loop nest, a O(K 2 q)-time algorithm that finds a unimodular lower triangular transformation matrix T = (tij )1i;j K such that D+ = T D 0 (where the inequality is applied component-wise) is developed in [11]. However, Wolf and Lam pay no attention to choosing a transformation that would result in a minimum communication cost. Still, if the BSP schedule in this section is to be extended to generic loop nests of type II, one needs to consider this communication cost for the transformed loop. Theorem 4 The maximum amount of data exchanged by a processor computing a size xK hypercubic tile of the transformed iteration space is P P P Comm0 = x ?1( =?1 1 =1 =1 t d 2 K
K
q
K
i
j
k
ik
kj
:
+ o(1))
(8)
Several approaches to find a transformation T that minimises (8) have been proposed so far. Ramanujam and Sadayappan [9] have formulated a linear programming problem whose solution is an optimal unimodular lower triangular transformation T: find (tij )1i;j K which minimises
PK ?1 Pq
PK
t d subject to tii = 1, 1 i K ; k =1 ik kj PK 1 i