E cient Runtime Support for Parallelizing Block ... - Semantic Scholar

Ecient Runtime Support for Parallelizing Block Structured Applications 1 Gagan Agrawal

Alan Sussman Joel Saltz Dept. of Computer Science University of Maryland College Park, MD 20742 fgagan, als, [email protected]

Abstract

Scienti c and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we describe a runtime library for parallelizing these applications on distributed memory parallel machines in an ef cient and machine-independent fashion. This runtime library is currently implemented on several dierent systems. This library can be used by application programmers to port applications by hand and can also be used by a compiler to handle communication for these applications. Our experimental results show that our primitives have low runtime communication overheads. We have used this library to port a multiblock template and a multigrid code. Eort is also underway to port a complete multiblock computational uid dynamics code using our library.

1 Introduction

There are two major reasons why distributed memory parallel machines are not yet popular among developers of scienti c and engineering applications. First, it is very dicult to parallelize application programs on these machines. Second, it is not easy to get good speed-ups and eciency on communication intensive applications. We believe that specialized eort is required for providing ecient runtime support for applications in which data access patterns are not necessarily regular. One such class of scienti c and engineering applications involves structured meshes. These meshes may be nested (as in multigrid codes) or may be irregularly coupled (called Multiblock or Irregularly Coupled Regular Mesh Problems). Multigrid is a common technique for accelerating the solution of partial-dierential equations [14]. Multigrid codes employ a number of meshes at dierent levels of resolution. The restriction and prolongation operations for shifting between dierent multigrid levels require moving regular array sections with non-unit strides. In multiblock problems, the data is divided into several interacting regions (called blocks or subdomains). There are computational phases in which regular computation is performed on each block independently. Boundary updates require communication between blocks, which is restricted to moving regular array sections (possibly including non-unit strides). Multiblock grids are frequently used for modeling geometrically complex objects which cannot be easily mod1 This work was supported by ARPA under contract No. NAG-1-1485, by NSF under grant No. ASC 9213821 and by ONR under contract No. SC 292-1-22913. The authors assume all responsibility for the contents of the paper.

eled using a single regular mesh. Multiblock applications are used in important grand-challenge applications like air quality modeling [13], computational uid dynamics [18], structure and galaxy formation [16], simulation of high performance aircrafts [4], large scale climate modeling [6], reservoir modeling for porous media [6], simulation of propulsion systems [6], computational combustion dynamics [6] and land cover dynamics [10]. In Figure 1, we show how the area around an aircraft wing has been modeled with a multiblock grid. In this paper, we present runtime support that we have designed and implemented for parallelizing these applications on distributed memory machines in an ecient, convenient and machine independent manner. We have experimented with one multiblock template [18] and one multigrid code [15]. Our experimental results show that the primitives have low runtime communication overheads. In separate work, we have also developed methods for integrating this runtime support with compilers for HPF style parallel programming languages [1]. Several other researchers have also developed runtime libraries or programming environments for multiblock applications. Baden [11] has developed a Lattice Programming Model (LPAR). This system, however, achieves only coarse grained parallelism since a single block can only be assigned to one processor. Quinlan [12] has developed P++, a set of C++ libraries for grid applications. While this library provides a convenient interface, the libraries do not optimize communication overheads. Our library, in contrast, reduces communication costs by using message aggregation. The rest of this paper is organized as follows. In Section 2, we discuss the nature of multiblock and multigrid applications and the runtime support requirements for parallelizing them on distributed memory parallel machines. In Section 3, we introduce the runtime library that we have developed and discuss how it can be used. In Section 4, we describe the regular section analysis required for eciently generating schedules for regular section moves, one of the communication primitives in our library. In Section 5, we present experimental results to study the eectiveness of our primitives.

2 Block Structured Applications

In this section we discuss the nature of computation and communication in multiblock and multigrid codes. We also discuss the runtime requirements for parallelizing these applications on distributed memory machines. For a typical multiblock application, the main body of the program consists of an outer sequential (time step) loop, and an inner parallel loop. The inner loop iterates over the blocks of the problem, after applying boundary conditions

Wing Region (subdomain 1)

Control Surface (subdomain 2)

subdomain 1

adjacent cells

adjacent cells subdomain 2

Figure 1: Block structured grid around a wing, showing an interface between blocks to all the blocks (including updating interfaces between Multigrid codes have phases of restriction (in which a coarse blocks). Applying the boundary conditions involves intergrid is initialized based upon a ner grid), prolongation (in action (communication) between the blocks. In the inner which a coarse grid is copied into a ner grid with non-unit loop over the blocks, the computation in each block is typistride and then the other elements on the ne grid are comcally sweep over a mesh in which each mesh-point interacts puted by interpolation) and sweeps over individual grids. only with its nearby neighbors. Since in these applications, Coarse meshes may be obtained from ne grid by coarsenthere are computational phases that involve interactions ing by the same factor or dierent factors along dierent only within each block, communication overheads are redimensions. Accordingly, each grid may be distributed over duced if each block is not divided across a large number the entire set of processors, or some grids may have to be of processors. So, blocks may have to be distributed onto distributed over parts of the processor space. subsets of the processor space. Since the number of blocks The runtime support requirements for the multigrid is typically quite small (i.e. at most a few dozen), at least codes is as follows. As with multiblock codes, we need some of the blocks will have to be distributed across multo be able to map grids over subsets of the processor space. tiple processors. The restriction and prolongation steps require regular secNow, we brie y discuss the runtime support required for tion moves between grids at dierent levels of resolution. parallelizing multiblock applications. First, there must be Again, communication arises because each grid may be disa means for expressing data layout and organization on the tributed across multiple processors and computation within processors of the distributed memory parallel machine. We each block requires near neighbor interactions. Similarly, need compiler and runtime support for mapping blocks (arthe interpolation required during the prolongation step also rays) to subsets of the processor space. Second, there must involves interaction among neighboring elements. Also, be methods for specifying the movement of data required. support for distributing loop iterations and transforming Two types of communication are required in multiblock global distributed array references to local references is reapplications. The interaction between dierent blocks required. quires the movement of regular array sections. The inner 3 Runtime Support loop involves interactions among neighboring elements of the grids. Since blocks may be partitioned across procesIn this section we present the details of the runtime support sors, this also requires communication. Third, there must library that we have designed for parallelizing multiblock be some way of distributing loops iterations among the proand multigrid codes on distributed memory machines. We cessors and converting global distributed arrays references also discuss how this library can be used by application to local references. programmers and compilers. Multigrid is a common technique for accelerating the The set of runtime routines that we have developed is solution of partial dierential equations. Multigrid codes called the Multiblock Parti library [17]. In summary, these employ a number of meshes at dierent levels of resolution. primitives allow an application programmer or a compiler

to

Lay out distributed data in a exible way, to enable

good load balancing and minimize interprocessor communication, Give high level speci cations for performing data movement, and Distribute the computation across the processors. We have designed the primitives so that communication overheads are signi cantly reduced (by using message aggregation). These primitives provide a machineindependent interface to the compiler writer and applications programmer. We view these primitives as forming a portion of a portable, compiler independent, runtime support library. This library is currently implemented on the Intel iPSC/860, the Thinking Machines' CM-5 and the PVM message passing environment for network of workstations. The design of the library is architecture independent and therefore it can be easily ported to any distributed memory parallel machine or any environment that supports message passing (e.g. Express). The library primitives can currently be invoked from Fortran or C programs. Programmers can port their Fortran or C programs on distributed memory machines by manually inserting calls to the library routines. The resulting program has Single Program Multiple Data (SPMD) model of parallelism.

3.1 Multiblock Parti Library

We now discuss the details of our runtime library primitives [17]. Since, in typical multiblock and multigrid applications, the number of blocks and their respective sizes is not known until runtime, the distribution of blocks onto processors is done at runtime. The distributed array descriptors (DAD) [3] for the arrays representing these blocks are, therefore, generated at runtime. Distributed array descriptors contain information about the portions of the arrays residing on each processor, and are used at runtime for performing communication and distributing loops iterations. We will not discuss the details of the primitives which allow the user to specify data distribution. For more details, see [17]. Two types of communication are required in both multiblock and multigrid applications. We need intra-block communication because a single block may be partitioned across the processors of the distributed memory parallel machine. Inter-block communication is required because of boundary conditions between blocks (in multiblock codes) and restrictions and prolongations between grids at dierent levels of resolution (in multigrid codes). Since the data that needs to be communicated is always a regular section of an array, this can be handled by primitives for regular section move. A regular section move copies a regular section of one distributed array into regular section of another distributed array, potentially involving changes of oset, stride and index permutation. Intra-block communication is required because of partitioning of blocks or grids across processors. The data access pattern in the computation within a block or grid is regular. This implies that the interaction between grid points is restricted to nearby neighbors. The interpolation required during the prolongation step in multigrid codes also involves interaction among the neighboring array elements. Such communication is handled by allocation of extra space at the beginning and end

of each array dimension on each processor. These extra elements are called overlap, or ghost, cells [7]. In our runtime system, communication is performed in two phases. First, a subroutine is called to build a communication schedule that describes the required data motion, and then another subroutine is called to perform the data motion (sends and receives on a distributed memory parallel machine) using a previously built schedule. Such an arrangement allows a schedule to be used multiple times in an iterative algorithm. The communication primitives include a procedure Overlap Cell Fill Sched, which computes a schedule that is used to direct the lling of overlap cells. The primitive Regular Section Copy Sched carries out the preprocessing required for performing the regular section moves. In Section 4, we discuss the details of the regular section analysis required for eciently generating the schedule for regular section move. Now, we brie y discuss the functionality and implementation of the Overlap Cell Fill Sched primitive for lling in ghost or overlap cells in the presence of nested loops and multidimensional arrays. Consider a loop nest which reads a multidimensional array A. If the loop nest does not contain any loop carried dependencies, then all overlap cells can be updated by a single data move. Schedule for this data move can be generated by the primitive Overlap Cell Fill Sched. This primitive takes as input an array lls. ith element of the array lls is the ll along the ith dimension, this speci es the number and direction of ghost cells that need to be lled along that dimension. A positive value of ll means that the ghost cells need to be lled along the higher indices and a negative value means that the ghost cells need to be updated along the lower indices. Suppose that the array A is m-dimensional and the array lls has n non-zero entries. Our implementation breaks the communication arising due to this into disjoint sets of m ? 1, m ? 2, .. m ? n dimensional objects. (We consider corners to be zero dimensional objects). For example, if the array A is 3-dimensional and the array lls is f1; 1; 1g, then communication required includes lling in 3 planes (1 along each dimension), 3 edges (1 for each combination of 2 out of 3 dimensions) and one corner. If, for the same 3-dimensional array A, array lls is f1; 1; 0g, then the communication required includes lling in 2 planes (1 along each of the rst two dimensions), and 1 edge. The schedules produced by Overlap Cell Fill Sched and Regular Section Copy Sched are employed by a primitive called Data Move that carries out both interprocessor communication (sends and receives) and intra-processor data copying.

3.2 Other Applications

While the design of the library was initially motivated by multiblock and multigrid applications, the library primitives can be used for supporting communication in a number of cases for regular applications as well. An example of this is compiling High Performance Fortran (HPF) forall loop when either the loop bounds are not compile time constants or when exact data distribution information is not available at compile time. The forall loop in HPF has the following form: forall (i1 = lo1 : hi1 : st1 ; : : : ; im = lom : him : stm ) A(f1; f2 ; : : : ; fj ) = B(g1 ; g2 ; : : : ; gn )

ik ; (k = 1::m) are the loop variables associated with

Data move primitive copies a regular section from array B into array Temp using the communication schedule Sched. Each processor determines the loop bounds for executing the loop using the runtime primitives Local Lower Bound and Local Upper Bound. DAD (Distributed Array DescripC TRANSFORMED CODE tor) is used at runtime for determining information about C DAD is dist. array desc. for arrays A, B, C and Temp the distribution of arrays A, B, C and Temp. The runtime NumSrcDim= 2 NumDestDim = 2 calls for constructing DAD are not shown in the example. SrcDim(1) = 2 DestDim(1) = 1 Even if the loop bounds are known at compile time, runSrcDim(2) = 1 DestDim(2) = 2 time resolution may be required if complete information SrcLos(1) = n*c DestLos(1) = a about distribution of arrays A, B and C is not available to SrcLos(2) = a DestLos(2) = c the compiler. This can happen primarily because of two SrcHis(1) = n*d DestHis(1) = b reasons, use of dynamic data distributions [5] or the comSrcHis(2) = b DestHis(2) = d piler not having sucient support for inter-procedural inSrcStr(1) = n DestStr(1) = 2 formation propagation [8]. Again, in this case the compiler SrcStr(2) = 2 DestStr(2) = 1 can handle this by inserting calls to our runtime primitives, Sched = Regular Section Move Sched(DAD,DAD, as shown in Figure 2. NumSrcDim, NumDestDim,SrcDim, SrcLos, SrcHis, SrcStr, 4 Regular Section Analysis DestDim, DestLos, DestHis, DestStr) In this section we discuss the regular section analysis reCall Data Move(B,Sched,Temp) quired for eciently generating schedules for regular secL1 = Local Lower Bound(DAD,1,a) tion moves (i.e. for implementing the primitive ReguL2 = Local Lower Bound(DAD,2,c) lar Section Copy Sched). By regular section analysis we H1 = Local Upper Bound(DAD,1,b) mean how each processor can determine, for each other H2 = Local Upper Bound(DAD,2,d) processor, the exact parts of the distributed array it needs do 10 i = L1, H1,2 to send and receive, given the source and the destination do 10 j = L2, H2 regular sections in global coordinates. For each processor A(i,j) = Temp(i,j) + C(i,j) owning part of the source regular section, we want to de10 continue termine the set of local elements that it will be sending to each processor owning part of the destination regular section. We call these sets of elements send sets. Similarly, for each processor owning part of the destination, we want Figure 2: Compiling HPF loop when loop bounds are not to determine the set of local elements that it will be receivcompile time constants ing from each processor owning part of the source. We call these sets of elements receive sets. Here we just discuss the the forall statement. lok , hik and stk are respectively the analysis that a particular source processor does to compute lower bound, upper bound and the stride for each loop the send sets. The analysis for determining the receive sets variable. For the left hand side array A, f1; f2 ; :: ; fj are is completely analogous and is therefore not described. the subscripts. Similarly, for the right hand side array B, For simplicity of description, we assume that both the g1 ; g2 ; :: ; gn are the subscripts. HPF speci cations allow source and the destination arrays have r dimensions. The the lower bound, upper bound and stride along each dianalysis can be performed even when the number of dimenmension (i.e. for each loop variable) to be evaluated in sions in the source and the destination arrays is not equal. any order. Consequently, the lower bound, upper bound The source regular section, denoted by S , is part of the and stride for a loop variable cannot be a function of any source distributed array s . other loop variable. If the subscripts f1 ; ::;fj of the array A and subscripts g1 ; ::; gn of the array B are all either a linS = f(s lo1 : s hi1 : s str1 ); (s lo2 : s hi2 : s str2 ); : : : ; ear function of a loop variable or an invariant scalar, then the elements of the array A (and B) accessed by the forall (s lor : s hir : s strr )g statement form a regular section. If the loop bounds, subscript functions and data distributions are known at coms loi , s hii and s stri are respectively the lowest index, pile time, then the exact communication requirements ashighest index and the stride along the ith dimension (in sociated with the forall loop can be determined by the comglobal indices). We assume that 8 i; 1 i r; s loi piler. Current prototype Fortran D/HPF compilers compile a set of array elements. s hii . The regular section S de nes under the assumptions that loop bounds, subscript funcAn array index ei along the ith dimension is said to be a tions and data distributions are always known at compile part of the regular section i 9li (an integer) s.t. time [9]. However, this assumption often does not hold while compiling many real applications. Runtime resolution is required for determining exact communication and ei = s loi + li s stri ; (4.0.1) loop partitioning in such cases. The compiler can handle where such cases easily by inserting calls to our primitives. ? s loi 0 li s hisi str In Figure 2, we show how an HPF forall loop can be i compiled when the loop bounds are not compile-time constants. In this example, the values of a; b; c; d and n are An array element whose indices are (e1 ; e2 ; : : : ; er ) belongs not known at compile time. In executing this loop, the to the regular section S i,

C C

ORIGINAL F90D CODE Arrays A, B and C are distributed identically forall (i = a:b:2,j = c:d) A(i,j) = B(n*j,i) + C(i,j)

8 i; 1 i r; the array index ei , along the ith dimension, belongs to the regular section S .

We will use this format to describe all regular sections. Array indices always start from 0 (as in the C programming language). The processors owning source (or, destination) array can be viewed as forming an r-dimensional virtual processor grid. A processor p in this processor grid is supposed to have coordinates fp1 ; p2 ; : : : ; pr g. We assume that the numbering starts from zero in each dimension in the processor grid. The destination regular section, denoted by D, is a part of the destination array d .

D = f(d lo : d hi : d str ); (d lo : d hi : d str ); : : : ; (d lor : d hir : d strr )g 1

1

1

2

2

2

We denote by im(i) the destination dimension which is aligned to the ith source dimension. The steps we follow for computing the send sets are as follows. For a processor p, we determine the part of regular section S that it owns, that is, we restrict the section S to the processor p (which is denoted 0by S 0 (p)). Next, we take a transformation of the section S (p)) to map it from the source regular section S to the destination regular section D. The resulting section is denoted by D0 (p). We next determine the set of destination processors that own part of the section D0 (p), i.e. the destination processors to which the processor p will be communicating. For each such pro0(p) to determine the part cessor q, we restrict the section D that q 00owns, calling it D00 (p; q). In the last step, the section D (p; q) is mapped00 back to00 the source, the resulting section is denoted by S (p; q). S (p; q) is the send set that the source processor p will be sending to the destination processor q. Now, we present the details of these steps. In this paper, we just discuss the details of the analysis when the data is block distributed in both source and destination arrays. Details of the analysis when the data-distribution is cyclic are available in [2]. We consider a particular processor p which owns part of the sources array s , ofs which the regular section S is a part. Let lloi (p) and lhii (p) be the lowest and the highest points along the ith dimension (in global indices) that the processor p owns. Since the data distribution is block, the processor p owns a contiguous set of indices from llosi (p) to lhisi (p).

4.1 Restricting S to the Processor p

We now compute the part of the regular 0section S that the processor p owns. This is denoted by S (p). Through-out our discussion, all regular sections will always be described in global indices. Given the global coordinates, any individual processor can always determine the corresponding local (on processor) indices.

S 0 (p) = f(s lo0 (p); s hi0 (p); s str0 (p)); : : : ; (s lo0r (p); s hi0r (p); s strr0 (p))g 1

where,

1

1

si (p) ? s loi ) s stri s lo0i (p) = s loi + max(0; llo s str i

(4.1.1) (4.1.2) (4.1.3) Now, we brie y argue the correctness of the above equations. Since the data is block distributed, s stri0 (p) is s stri . s lo0i (p) is the rst index along the ith dimension which is part of the regular section S and is owned by the processor p. For the purpose of our discussion , assume that the processor p owns at least some part of the regular section. This implies that llosi (p) s0 hii and lhisi (p) s loi . In the calculation of s loi (p) there are two cases, depending upon whether s loi llosi (p) or s loi < llosi (p). If s loi llosi (p) then the index s loi is on the processor p (since we assumed that the processor p owns at least a part 0i (p) = s loi . (The lo of the regular section). Therefore, s expression for s lo0i (p) in the Equation 4.1.1 also reduces to s loi when s loi llosi (p)). Alternatively, if s loi < llosi (p), then the expression for 0 s loi (p)reduces to s loi + (llosi (p) ? s loi )=s stri s stri . We need to establish three things th to ensure that this is the correct starting index along the i dimension. 1. The index s lo0i (p), along the ith dimension belongs to the regular section S . This follows from our requirement for an index to belong to the regular section (Equation 4.0.1) , with li = (llosi (p) ? s loi )=(s stri ) . If s loi < llosi (p) and if p owns at least some part of the regular section S , then this value of li satis es

s hi0i (p) = min(lhisi (p); s hii ) s stri0 (p) = s stri

? s loi 0 li s hisi str i This establishes that the index s lo0i (p), along the ith dimension, belongs to the regular section S . 2. The index s lo0i (p) must be greater then or equal to the smallest index along the ith dimension on the processor p. This follows from the observation that if s loi < llosi (p), then s loi + (llosi (p)(p) ? s loi )=(s stri ) s stri llosi (p). 3. No index less than s lo0i (p) is a member of regular section S and is on processor p. Any member of the regular section S less than s lo0i (p) has to be smaller by at least0 the stride s stri . To show that no index less than s loi(p) is a member of regular section S and is on processor p, we need to show that

s lo0i (p) ? llosi (p) < s stri s 0 If ssloi < lloi (p), then s loi (p) is s loi + (lloi (p) ? s lo )=s stri s stri . It can be shown i that s loi + (llosi (p)(p) ? s loi )=(s stri ) s stri ? llosi (p) < s stri This means any member of regular section S less than s lo0i (p) will also be less than llosi (p) and will therefore not reside on the processor p.

s hi0i (p) is not necessarily a member of the regular section S . This is because if a for loop accesses the indices from s lo0i (p) to s hi0i (p) with stride s stri , then it will terminate at the correct index, if two requirements are met. First, s hi0i (p) needs to be less than or equal0 to the highest index on the processor p. Secondly, s hii (p) needs to be greater than or equal to the highest index in the regular section owned by the processor p0. In establishing the correctness of the expression for s hii (p) (Equation 4.1.2), there are two cases depending upon whether lhisi (p) s hii or lhisi (p) > s hii . If lhisi (p) s hii then the expression for s hi0i (p) reduces to lhisi (p). This satis es the two requirements. Alternatively if lhisi (p) > s hii then the expression for s hi0i (p) reduces to s hii . Again, this satis es the two requirements. If the processor p does not own any part of the regular section along the ith dimension, then the above expressions 0i (p) which is greater than the value will give a value of s lo of s hi0i (p). In general, if 9i; 1 i r; s lo0i (p) > s hi0i (p) then the processor p does not own any part of the regular section S . 4.2 Mapping S 0 (p) to the Destination

Next, we determine the corresponding section in the destination array (i.e. part of the destination regular section that will be received from the processor p). This is denoted by D0 (p). D0(p) = f(d lo01 (p); d hi01 (p); d str10 (p)); : : : ; (d lo0r (p); d hi0r (p); d strr0 (p))g where,

j = im(i) (4.2.1) 0i (p) ? s loi s lo 0 d strj d loj (p) = d loj + s str i

(4.2.2)

0 d hi0j (p) = d loj + s hii (p) ? s loi d strj

s stri

(4.2.3) (4.2.4)

d strj0 (p) = d strj

Since the array is distributed by blocks, d strj0 (p) is d strj . The expressions for d lo0j (p) and d hi0j (p) follow from nding the indices in the destination regular section which correspond to the indices s lo0i (p) and s hi0i (p), respectively, in the source regular section. The term (s lo0i(p) ? s loi )=s stri is the number of0 indices in the regular section0 S which are less than s loi (p). ( By Equation 4.1.1, s loi(p) ? s loi is a multiple of s stri .) Multiplying this term by d strj and adding to d loj gives the index in the destination which corresponds to js lo0i (p). Simi-k )?s loi ) larly, in determining d hi0j (p), the term (s hi(is(pstr i) is the number 0of indices in the regular section S which are less than s hii (p). Multiplying this term by d strj and the index in the destination which adding to d loj gives corresponds to s hi0i (p). 0

4.3 Restricting D0 (p) to Processor q

We rst determine the set of processors to which the source processor p will send data. The processors owning parts of the destination array form an r-dimensional virtual processor grid. A processor q, owning part of the destination array, has coordinates fq1 ; q2 ; : : : ; qr g in this processor grid. We assume that each processor owns sizei indices along the ith dimension. In our implementation, the rst and last processors along each dimension can own a dierent number of indices from other processors (because of ghost cells or the total number of elements not dividing the number of processors, etc.). However, for simplicity of presentation, we do not discuss this case here. We denote by p mini (p) and p maxi (p) the lowest and highest coordinates along the ith dimension of the processors which own part of the regular section D0 (p).

d lo0i(p) + 1 ? 1 sizei 0i (p) + 1 ?1 p maxi (p) = d hisize i p mini (p) =

(4.3.1) (4.3.2)

A processor q having coordinates fq1 ; q2 ; : : : ; qr g will receive data from the source processor p i

8i; 1 i r; p mini (p) qi p maxi (p) Consider a particular processor q which will receive data from the source processor p. Suppose that the start and end points along the ith dimension on this processor are llodi (q) and lhidi (q) respectively. We denote the part of the destination regular section00that the processor q will receive from the processor p by D (p; q).

D00 (p; q) = f(d lo00 (p; q); d hi00 (p; q); d str00 (p; q)); : : : ; (d lo00r (p; q); d hi00r (p; q); d strr00 (p; q))g 1

where,

1

1

'

&

d q) ? d lo0i (p)) d lo00i (p; q) = d lo0i + max(0; llodi (str d stri i

(4.3.3) (4.3.4) (4.3.5)

d hi00i (p; q) = min(lhidi (q); d hi0i (p)) d stri00 (p; q) = d stri

The reasoning behind the correctness of the above expressions is the same as that used in determining S 0 from S , as discussed in Section 4.1. 4.4 Mapping D00 (p; q) to the Source Next, we determine the equivalent part of the regular section D00 (p; q) on the source side (i.e. the part of the source regular section which the processor p sends to the processor q). We denote this by S 00 (p; q).

S 00 (p; q) = f(s lo00 (p; q); s hi00 (p; q); s str00 (p; q)); : : : ; (s lo00r (p; q); s hi00r (p; q); s strr00 (p; q))g 1

1

1

where,

j = im(i) (4.4.1) 00 d q) ? d loj s str s lo00i (p; q) = s loi + loj (dp; str i j

(4.4.2)

00 q) ? d loj s hi00i (p; q) = s loi + d hij (dp;str s stri j

s stri00 (p; q) = s stri

(4.4.3) (4.4.4)

The reasoning behind the correctness of the these0 expressions is the same as that used in determining D (p) from S 0 (p) in Section 4.2.

and the second dimension of the source regular section is aligned to the rst dimension of the destination regular section i.e. im(1) = 2 and im(2) = 1. In Figure 3, we show the steps for computing these send sets. We consider the source processor p with coordinates f1; 0g. The part of the global array that this processor owns is given by llod1 (q) = 0, lhid1 (q) = 49, llod2 (q) = 75 and lhid2 (q) = 99. The part of the source regular section that processor p owns (S 0 (p)) is given by

S 0 (p) = f(50 : 60 : 2); (10 : 49 : 3)g

The corresponding section on the destination side (D0 (p)) is given by

4.5 Discussion

All the calculations described above are performed by the processor p locally and do not involve communication with any other processor. Therefore, communication schedules can be generated eciently. Based on the calculation of S 00 , the processor p knows the contents of the message that it must send to processor q. However, when processor q receives this message, it does not have any information about which local memory locations each element of the message must be copied into. To facilitate this, each destination processor computes the set of (local) elements that it will receive from each source processor. The calculations for computing these receive sets are completely analogous to the computations for the send sets. Therefore, we do not describe the computation of the receive sets here. The source processor p always sends the set of elements it needs to send to the processor q in a single message, packed in column major fashion. Processor q can then use the receive set information to copy the elements in the received message into the appropriate local elements. An alternative to this scheme is that the message sent by the source processor p also contains information about what local memory location at the destination processor q each of the elements packed in the message needs to be copied to. The destination processor q can then copy elements of the message into its local elements based upon this information. This approach does save some computation at the destination processors. However, the size of the messages increases signi cantly because of the extra information that needs to be sent. In our implementation, we have chosen to compute both the send and receive sets, since on current distributed memory machines, this is less expensive than communicating the receive set information.

4.6 Example

Consider a regular section move that involves a source array of size 100 100 and a destination array of size 50 100. The source array is block distributed over a 22 virtual processor grid and the destination array is also block distributed over a 41 virtual processor grid. The source and the destination regular sections are as follows:

S = f(10 : 60 : 2); (10 : 70 : 3)g D = f(10 : 30 : 1); (5 : 80 : 3)g

The rst dimension of the source regular section is aligned to the second dimension of the destination regular section

D0 (p) = f(10 : 23 : 1); (65 : 80 : 3)g Next, the processor p determines the set of destination processors with whom it will be communicating. We have for the destination array, size1 = 50 and size2 = 25. This gives, p min1 (p) = 0; p max1 (p) = 0; p min2 (p) = 2; and p max2 (p) = 3. The destination processors the source processor p will communicate with are the ones with grid coordinates f0; 2g and f0; 3g. Consider the destination processor q with coordinates f0; 3g. The part of the destination array that processor q owns is given by llod1 (q) = 0, lhid1 (q) = 49, llod2 (q) = 75 and lhid2 (q) = 99. The part of regular section which the processor p will be sending to the processor q (D00 (p; q)) is given by D00 (p; q) = f(10 : 23 : 1); (77 : 80 : 3)g as

The corresponding source section (S 00 (p; q)) is now given

S 00 (p; q) = f(58 : 60 : 2); (10 : 49 : 3)g

5 Experimental Results

In this section, we present experimental results to demonstrate the eectiveness of our primitives. We have performed experiments to study the runtime overhead of primitives, scalability of the primitives with problem size and the eect of data distribution.

5.1 Overhead of Primitives

We measure the runtime overhead incurred in using our library primitives as compared to the bare cost of communication associated with the best possible hand parallelized codes. The use of the library primitives involves runtime overheads because of generating schedules, copying data to be communicated into buer at source processors, and similarly copying the received data into appropriate memory locations at the destination processors. The possible advantage (in terms of eciency) of the library primitives is that, for each invocation of a data-move, each processor sends at most one message to each other processor. It may, in general, be very dicult for an application programmer, parallelizing the code by hand, to do such message aggregation. However, the best performance that an application programmer can ever achieve will only have the cost of actual communication and computation, assuming that messages have been aggregated to reduce the eect of communication latencies. We will study the overheads incurred

{0,0}

{0,1}

(10,10)

(10,70) (50,10)

{1,0}

{1,1}

{1,0}

Section S

{0,0}

(60,49)

(60,10)

(60,70)

(60,10)

(50,49)

Section S’(p)

{0,2}

{0,1}

{0,3}

(10,5)

(10,80) (30,80)

(30,5)

Section

D (10,65) (23,65)

Section (10,77) (23,77)

(10,80) (23,80)

{0,3}

Section D’’(p,q)

(10,80) (23,80)

D’(p) (58,10) (60,10)

(58,49) (60,49)

{1,0}

Section S’’(p,q)

Figure 3: Example for computing send set with block distributed arrays in a code parallelized using our primitives as compared to fraction (less than 5%) of the cost of communication for the best possible performance of hand parallelized code, most of the cases. Also, if the schedule built is used over a which incurs the minimum communication costs (assuming large number of iterations, the cost of building the schedule maximum message aggregation). is also a small fraction of the cost of communication. We consider a simple code executed on 2 processors in 5.2 Scalability of Primitives which a regular section move involves moving data from To maintain low runtime overheads in using the library, processor 0 to processor 1. We vary the number of bytes it is important that the primitives for building commuinvolved in the regular section moves and measure three sets nication schedules are ecient. An interesting feature of of timings: the time required just for communication, the the primitive Regular Section Copy Sched is that the time time required for communication when the library primit takes for generating schedules does not increase as the itives are used (excluding the cost of schedule building) problem size (sizes of the regular sections) increases. In and the total time required when the library primitives are Figure 5, we show the execution time of the primitive Regused (including time for schedule building). The rst set of ular Section Copy Sched when the regular section move intimings represents the best performance that hand paralvolved sections of varying sizes. In all the cases, the source lelization can achieve if all the data elements to be commusection with a stride of 1 along all dimensions was copied nicated are laid out contiguously. If the data elements to into a destination section having stride of 4 along the rst be communicated are not contiguous, then the application dimension and stride of 1 along other two dimensions. The programmer will need to do copying to aggregate messages. source section was distributed in block fashion over 4 proThe second set of timings represents this case. The third cessors and the destination section was distributed in block set of timings represents the performance with the use of fashion over all the processors. The experiments were perlibrary primitives. The timings presented are for 100 iterformed on an Intel iPSC/860. The gure shows that an ations of the regular section move. increase in problem size has no eect on the execution time The performance results on an iPSC/860 are presented of the primitive. in Figure 4. The results show that the cost of copying The performance of the primitives is also not aected (dierence between Set I and Set II) is typically a small

Time (ms.) Set III Set II Set I

250.00 240.00 230.00 220.00 210.00 200.00

Source array is dist. on 4 processors Dest. array is dist. on all the processors No. of Time taken Time taken Processors 888 32 32 32 Source Section Source Section 4 9.18 9.20

190.00 180.00 170.00 160.00 150.00 140.00 130.00

8

9.10

9.14

16

9.21

9.19

32

9.32

9.33

120.00

Figure 6: Eect of No. of proc. on primitives (iPSC/860) (All timings in ms.)

110.00 100.00 90.00 80.00 70.00 60.00 50.00 40.00 1.00

2.00

3.00

4.00

Bytes per iteration x 103 5.00

Set I : Communication (Max. message aggregation) Set II : Communication and copying Set III : Communication, copying and schedule building Figure 4: Performance of Primitives on iPSC/860 (100 iterations) Source array is dist. on 4 processors Dest. array is dist. on all the processors Size of Source Time taken Time taken Regular Section 16 proc. (ms) 32 proc.(ms) 222 9.17 9.12 444

9.16

9.32

888

9.21

9.32

16 16 16

9.15

9.30

32 32 32

9.19

9.33

Figure 5: Scalability of Primitive with Problem size (iPSC/860) much by the number of processors on which the program is executed. In Figure 6, we show the time taken by the primitive when sections of the same size are distributed over a varying number of processors. The execution time of the primitive depends upon the number of processors each processor communicates with. In the regular section move performed for this experiment, the number of processors with which each processor communicates decreases in going from 4 processors to 8 processors, but increases as we go to 16 and then 32 processors. Accordingly, we see a small decrease in the execution time in going from 4 processors to 8 processors and then an increase as we go to 16 and 32 processor cases. This dierence, however, is very small.

TWO BLOCK: 49 X 17 X 9 Mesh (50 Iterations) Number of Blocks Mapped Blocks Mapped Processors Entire Proc. Space Disjoint Proc. Spaces 4 8.99 7.59 8

5.14

4.74

16

3.24

2.83

32

2.41

1.87

Figure 7: Eect of Data Distribution on iPSC/860 (All timings in ms.)

5.3 Eect of Data Distribution

As we discussed earlier, one of the features of our runtime library is the ability to map arrays (or templates) to subsets of the processor space. In the current de nition of HPF (and hence in HPF compilers), this is not possible. In block structured codes, this feature allows us to keep the communication overheads low while maintaining the load balance. To study the bene t of this feature, we experimented with the multigrid template [18] for a two block input case. We ran the parallelized code, once distributing both the blocks over the entire processor space and then distributing each block over disjoint processor spaces. The results on Intel iPSC/860, shown in the Figure 7, show that the latter scheme improves the performance by nearly 10 to 25%. Since there is no dierence in the net computation performed at each processor in either of the the two cases, this dierence comes because of the increased amount of communication required when each block is distributed across the entire processor space. Mapping a block over a large number of processors increases communication arising from near neighbor interactions during the regular computation within blocks. Note that a 10 to 25% degradation in performance occurs when there are only two blocks. We expect that with a larger number of blocks, the dierence in the performance would be much more severe.

6 Conclusions

In this paper we have addressed the problem of ecient runtime support for parallelizing multiblock and multigrid applications on distributed memory machines. We have

designed and implemented a set of runtime primitives for parallelizing these applications in an ecient, convenient and machine independent manner. The runtime primitives give the ability to specify data distributions, perform communication and distribute loops based on data distributions speci ed at runtime. One of the communication primitives in our library is the regular section move, which can copy a rectilinear part of a distributed array onto a rectilinear part of another distributed array, potentially involving index permutations, change of strides and change in osets. Design of this runtime primitive involves a runtime regular section analysis. The experimental results show that these primitives have low communication overheads. We have so far used our runtime support for parallelizing one multiblock template and a multigrid code. Eort is also underway to port a complete multiblock code.

Acknowledgements

We are grateful to V. Vatsa and M. Sanetrik at NASA Langley for giving us access to the multiblock TLNS3D application code. We will also like to thank John van Rosendale at ICASE and Andrea Overman at NASA Langley for making their sequential and hand parallelized multigrid code available to us. We thank Jim Humphries for creating a portable version of the runtime library. We acknowledge the continuing relevance of initial work done by Kay Crowley, Craig Chase and S. Gupta on this project.

References

[1] Gagan Agrawal, Alan Sussman, and Joel Saltz. Compiler and runtime support for structured and block structured applications. In Proceedings Supercomputing '93, pages 578{587. IEEE Computer Society Press, November 1993. An extended version available as University of Maryland Technical Report CS-TR-3052 and UMIACS-TR-93-29. [2] Gagan Agrawal, Alan Sussman, and Joel Saltz. On ecient runtime support for multiblock and multigrid applications: Regular section analysis. Technical Report CS-TR-3140 and UMIACS-TR-93-92, University of Maryland, Department of Computer Science and UMIACS, September 1993. [3] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. Fortran 90D/HPF compiler for distributed memory MIMD computers: Design, implementation and performance results. In Proceedings Supercomputing '93, pages 351{360. IEEE Computer Society Press, November 1993. [4] Kalpana Chawla and William R. Van Dalsem. Numerical simulation of a powered-lift landing. In Proceedings of the 72nd Fluid Dynamics Panel Meeting and Symposium on Computational and Experimental Assessment of Jets in Cross Flow, Winchester, UK. AGARD, April 1993. [5] D. Loveman (Ed.). Draft High Performance Fortran language speci cation, version 1.0. Technical Report CRPC-TR92225, Center for Research on Parallel Computation, Rice University, January 1993. [6] Survey of principal investigators of grand challenge applications: Workshop on grand challenge applications and software technology, May 1993.

[7] Michael Gerndt. Updating distributed variables in local computations. Concurrency: Practice and Experience, 2(3):171{193, September 1990. [8] M.W. Hall, S. Hiranandani, K. Kennedy, and C.-W. Tseng. Interprocedural compilation of Fortran D for MIMD distributed-memory machines. In Proceedings Supercomputing '92, pages 522{534. IEEE Computer Society Press, November 1992. [9] Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD distributedmemory machines. Communications of the ACM, 35(8):66{80, August 1992. [10] J.R.G.Townshend, C.O.Justice, W. Li, C.Gurney, and J.McManus. Global land cover classi cation by remote sensing:present capabilities and future possibilities. Remote Sensing of Environment, 35:243{256, 1991. [11] Scott R. Kohn and Scott B. Baden. An implementation of the LPAR parallel programming model for scienti c computations. In Proceedings of the Sixth SIAM Conference on Parallel Processing for Scienti c Computing, pages 759{766. SIAM, March 1993. [12] Max Lemke and Daniel Quinlan. P++, a C++ virtual shared grids based programming environment for architecture-independent development of structured grid applications. Technical Report 611, GMD, February 1992. [13] Rohit Mathur, Leonard K. Peters, and Rick D. Saylor. Sub-grid representation of emission source clusters in regional air quality monitoring. Atmospheric Environment, 26A(17):3219{3237, 1992. [14] S. McCormick. Multilevel Projection Methods for Partial Dierential Equations. SIAM, 1992. [15] Andrea Overman and John Van Rosendale. Mapping robust parallel multigrid algorithms to scalable memory architectures. To appear in Proceedings of 1993 Copper Mountain Conference on Multigrid Methods, April 1993. [16] J.M. Stone and M.L. Norman. Zeus-2d: A radiation magnetohydrodynamics code for astrophysical ows in two space dimensions: I. the hydrodynamic algorithms and tests. The Astrophysical Journal Supplements, 80(753), 1992. [17] Alan Sussman, Gagan Agrawal, and Joel Saltz. A manual for the multiblock PARTI runtime primitives, revision 4.1. Technical Report CS-TR-3070.1 and UMIACS-TR-93-36.1, University of Maryland, Department of Computer Science and UMIACS, December 1993. [18] V.N. Vatsa, M.D. Sanetrik, and E.B. Parlette. Development of a exible and ecient multigrid-based multiblock ow solver; AIAA-93-0677. In Proceedings of the 31st Aerospace Sciences Meeting and Exhibit, January 1993.

E cient Runtime Support for Parallelizing Block ... - Semantic Scholar

E cient Runtime Support for Parallelizing Block ... - Semantic Scholar

Suggest Documents

E cient Runtime Support for Parallelizing Block ... - Semantic Scholar

E cient Runtime Support for Parallelizing Block Structured Applications 1

Speculation for Parallelizing Runtime Checks - Semantic Scholar

E cient Run-time Support for Irregular Block-Structured ... - CiteSeerX

Compile-time Support for E cient Data Race ... - Semantic Scholar

Common Runtime Support for Assertions - Semantic Scholar

Compiler and Runtime Analysis for E cient ...

E cient Combinator Parsers - Semantic Scholar

E cient Interprocedural Data Placement ... - Semantic Scholar

E cient Interprocedural Data Placement ... - Semantic Scholar

MOLAR: Adaptive Runtime Support for High-End ... - Semantic Scholar

Programming Model and Runtime Support for ... - Semantic Scholar

Adaptive Runtime Support for Direct Simulation ... - Semantic Scholar

Scalable Runtime Support for Data-Intensive ... - Semantic Scholar

An Indexing Structure for E cient Navigation in ... - Semantic Scholar

Quasi-Bayesian Strategies for E cient Plan ... - Semantic Scholar

Olan: A Language and Runtime Support for ... - Semantic Scholar

A Thematic Hierarchy for E cient Generation from ... - Semantic Scholar

An E cient Algorithm for Computing MHP ... - Semantic Scholar

E cient Parallel Algorithms for Modular ... - Semantic Scholar

An Informal and E cient Approach for Obtaining ... - Semantic Scholar

E cient schemes for nearest neighbor load ... - Semantic Scholar

An E cient Algorithm for Performance Analysis of ... - Semantic Scholar

Fast, E cient Mutual and Self Simulations for ... - Semantic Scholar