Coordinating Programs in the Network of Tasks Model - CiteSeerX

1 downloads 72 Views 300KB Size Report
The Network of Tasks (NOT) model is a directed graph coordination ...... SVs. SV1. Figure 4. Task graph for parallel iterated Runge-Kutta methods exploiting task.
Coordinating Programs in the Network of Tasks Model Susanna Pelagatti ([email protected])

Dipartimento di Informatica Universita di Pisa, Corso Italia 40 56125 Pisa, Italy

David B. Skillicorn ([email protected])

Department of Computing and Information Science Queen's University, Kingston, Canada

Abstract. The Network of Tasks (NOT) model allows adaptive node programs written in a variety of parallel languages to be connected together in an almost acyclic task graph. The main di erence between NOT and other task graphs is that it is designed to make the performance of the graph predictable from knowledge of the performance of the component node programs and the visible structure of the graph. It can therefore be regarded as a coordination language that is transparent about performance. For largescale computations that are distributed to geographically-distributed compute servers, the NOT model helps programmers to plan, assemble, schedule, and distribute their problems. Keywords: task graph, parallel computation, adaptive computation, heterogeneous computation, program transformation, cost modelling, coordination languages

1. Introduction The Network of Tasks (NOT) model is a directed graph coordination language for node programs that are heterogeneous, adaptive, and properly costed. This means that node programs may be written in di erent parallel languages, may be transformed to use di erent resources, and that their performance can be predicted from knowledge of their structure and a few architectural parameters. In practice, this means that they are written in a skeleton-like language such as P3L (Danelutto et al, 1997), Calypso (Baratloo et al, 1995), BSP (Skillicorn et al, 1997) or any of a number of data-parallel languages. When node programs exist in functionally-equivalent but textually-di erent forms, or when node programs are adaptive, that is can be transformed to use di erent resources, NOT is able to select a global implementation that preserves the work cost (product of time and processors) visible to the programmer. This makes transparent design possible in a richer framework than was previously possible. Considerable progress has been made in designing parallel models that have predictable costs on parallel architectures, but models have achieved this predictability by limiting program composition operations, usually to sequential composition of operations that use all of

c 2000 Kluwer Academic Publishers. Printed in the Netherlands.

journal.tex; 12/05/2000; 14:37; p.1

2 the processors of the target machine. This is too restrictive { programs forced into this style may lack sucient parallelism to fully use a target machine; or may require arti cial and opaque structuring. At the other extreme, coordination languages such as Linda (Ahuja et al, 1988) allow

exible arrangement of program components, but it is hard to predict the performance of the resulting program. The NOT model allows node programs to be connected by directed edges modelling data dependencies. Its primary goal is preserve predictability of performance: knowing the cost of the component node programs, it is possible to predict the cost of the NOT program. To do this requires reliable architectural parameters as inputs to the cost model. The NOT model may be used to describe computation on parallel computers, especially those with complex internal structures such as NUMA clumps of shared-memory multiprocessors. However, it is also useful as a way of planning, describing, implementing, and scheduling computations that are not appropriate for execution on a single target system. This may happen because they are too resource-intensive for an available system; because the use of a single system is not coste ective; or because di erent parts of the computation are able to exploit di erent hardware or software features that are only available at particular sites. The NOT model plays the role of a coordination language for building and scheduling such complex computations on widely-distributed compute servers. In Section 2 we discuss the problems created by large-scale computations which may be executed by heterogeneous, geographicallydistributed compute servers. In Section 3 we discuss the di erent styles and levels of adaptivity possible for node programs; in Section 4 we de ne the NOT model; in Section 5 we describe the implementation strategy for NOT. Section 6 presents some small examples of programs that can be naturally expressed as networks of tasks and illustrates their implementation; Section 7 extends previous work to include pipelining and farming; Section 8 discusses some related work.

2. Coordinating Large-Scale Computations High-performance applications are increasingly over owing the limits of a single compute server, as the amount of computation or data they require exceeds what can be provided at a single site. For example, a typical application may use input data of terabyte size. It is attractive to process such data as close to its stored location as possible if the result size is smaller than the input size. Certain phases of a computation

journal.tex; 12/05/2000; 14:37; p.2

3 may bene t from specialized hardware, such as high-performance visualization systems. Whenever a variety of compute servers are available, some will be more heavily loaded than others, and there are bene ts to being able to choose which one to use, or even to change from one to another in mid-computation. These applications create new research problems, including (a) how to describe computations in ways that show where they may reasonably be separated into independent pieces, (b) how to design programs so that they will perform well when executed in this distributed fashion, (c) how to schedule the tasks that make up an entire computation onto a set of compute servers to achieve the best balance between performance and resource costs, and (d) how to distribute the component tasks to the compute servers that will execute them. All of these are complicated by the fact that, since the entire system is distributed, knowledge about the dynamic properties of compute servers is always stale. These problem are dicult, and are receiving intensive study, e.g. (Foster & Kesselman, 1998). The NOT model addresses the planning and design aspects of this problem. It can be regarded as a way of coordinating existing modules or building blocks in a way that makes it clear what the e ects of di erent arrangements will be on performance. For example, the NOT model provides a way of gluing existing programs together using the concept of a residual (Skillicorn & Pelagatti, 2000). Given an existing program and a speci cation to be satis ed, it is possible to derive the speci cation of the `missing' piece. This process can be repeated with each of the existing programs until the remaining `missing' piece is empty; now the entire program satis es the speci cation and uses all the existing programs. Alternatively, the NOT model can be used to derive a complete, potentially-distributed program from a given speci cation by re nement, in the style of Morgan's Re nement Calculus (Morgan, 1990). The NOT model does more than just help to design programs, however. It also allows their cost to be discussed. This is because the model includes an evaluation strategy that is designed to make programs' costs transparent. Of course, in physically-distributed systems, there are always uncertainties about what is happening at remote locations; but within these limits, it is possible to predict the e ective cost of a NOT program. Programs designed to execute on remote compute servers will complete their execution earlier and more cheaply if they are able to adapt to the speci c local properties of the server at the time they are being executed. For example, the server may be able to start executing a program earlier if it can be altered to require fewer processors. The

journal.tex; 12/05/2000; 14:37; p.3

4 issue of adaptive behavior is thus naturally associated with that of scheduling large distributed computations. We consider this aspect in the next section.

3. Adaptivity An increasing number of parallel programming models and languages are addressing the issue of adaptivity: the decision about how many resources to allocate to a program can be postponed until after the program has been written, and may perhaps even be altered during the program's execution. Making such a property possible increases a program's portability, and allows it to execute e ectively in environments that are more variable than conventionally considered. For example, such programs can continue to execute in the presence of transient faults in the target system without any special programming e ort. It is not necessary for an adaptive programming model to have an associated cost model, but in practice they often do. One reason for adaptivity is to use a complex resource well, and it is convenient to be able to predict what the e ect on, for example, running time of reducing the number of available processors will be. Thus cost modelling and adaptable programming models often exist together. The cost of a parallel program depends on the operations it executes and the communication actions it performs (and increasingly on its memory accesses as well). All parallel programs are adaptive based on their operation count, for Brent's theorem (Brent, 1974) allows fewer processors to be used, with a corresponding increase in execution time, in a simple reordering that is independent of semantics. One of the contributions of the BSP cost model (Skillicorn et al, 1997; Donaldson et al, 1998) has been to show that a similar tradeo exists in the communication behavior of parallel programs. BSP models communication globally, rather than on a message-by-message basis. It makes two assumptions: that message trac is dense, which makes the e ects of congestion predictable; and that the bottleneck for communication is at the boundary of the network. The rst assumption is surprisingly realistic for many programs, but it is ensured for the NOT model by the choice of execution strategy. The second assumption has been shown to hold by experiment { transmission times are small compared to protocol overheads, even for recent network interface card protocols, and most delays are in fact encountered when messages arrive simultaneously at the same destination.

journal.tex; 12/05/2000; 14:37; p.4

5 BSP models each architecture's communication performance using a single parameter, its permeability, g (in units of time per word transmitted). If each processor transmits hi words during a communication phase, the total time for the phase is close to maxi hi g. The model has been investigated experimentally and is accurate to within a few percent across a range of architectures and applications (Hill et al, 1996; Reed et al, 1996; Goudreau et al, 1996; Hill et al, 1997). Suppose that a program executing on p processors executes a total exchange in which it sends data of size x to each other processor. The BSP communication cost of this operation is pxg. If this program is now emulated on a smaller, say p=2, number of processors, then each new processor must send 2px data since each processor holds twice as much data as before. Thus the product of communication cost and processors remains constant. A node program contains internal parallelism and provides a cost formula, parameterized by the number of processors on which it executes. This cost formula captures its computation cost and its internal communication cost. Each node program also consumes inputs and produces results which must be communicated to other node programs { external communication. The node program abstract machine may de ne transformations and their e ects on costs (for example, the BirdMeertens formalism (Bird, 1989), or P3L (Danelutto et al, 1997)). If it does not, Brent's theorem can be used to trade processors for time. The BSP communication cost model can similarly be used to trade processors for time for external communication. The NOT model execution strategy relies on transforming nodes so that those in the same `layer' of the graph complete at approximately the same time by allocating an appropriate number of processors to each. There is another way in which node programs may be adaptive. Some languages, particular skeleton-like ones, provide multiple di erent implementations of a given functionality. Often these di erent implementations are best considered as di erent algorithms, rather than di erent arrangements of the same algorithm. Deciding which of these distinct implementations is preferable may require understanding the context in which it will execute; for example, the way an upstream operation leaves its results distributed across processors may determine which implementation of a downstream operation is cheapest. This use of adaptivity must take place within the context of an entire NOT program; the execution strategy makes decisions about node programs using di erent implementations before it makes nal processor allocations.

journal.tex; 12/05/2000; 14:37; p.5

6

4. The Network of Tasks Model A NOT program is a static acyclic directed graph of blocks, which may be individual node programs or encapsulated pieces of graph, in both cases with de ned input and output edges. Each edge has a name associated with it. The graph structure may be described visually, or by using a single assignment syntax based on these edge names. There is a single loop structure of this form: first

block1 predicate on edge values block2 The block block2 has the same names for its input and output edges, which therefore form cycles. No state passes from one block to another, except by explicit communication. Nodes are connected by single edges which encapsulate bundles of data passed from the processors executing the upstream block to the processors executing the downstream block. In other words, a single NOT edge denotes a potentially-large set of (one way) communications, with both parallel and stream structure. A NOT edge is of one of the following three types, which describe when communication takes place: File of { The whole stream of values to traverse an edge must have been computed before any transfer takes place. This is the basic edge type. Block of { The data stream is pipelined in blocks, determined either by the compiler or by the programmer. Bag of { The ordering of the data in the stream is not signi cant, and it may be generated and consumed in any order (but is available in a pipelined fashion). The NOT model was originally described in (Skillicorn, 1999). The NOT model does not really address programs whose behavior is dynamic, since the task graph structure itself is static, except for the loop construct. There is always a dicult trade-o between dynamic behavior and cost modelling, since complex decisions at run-time are hard to model accurately at design time. while

journal.tex; 12/05/2000; 14:37; p.6

7

5. Implementation The implementation of the NOT model is motivated by the need for a useful cost model as well as by concerns with portability and e ectiveness. It is done in two phases. In the rst, appropriate choices of implementation for each node program are made. In the second stage, processors are allocated to each node program using a technique called work-based allocation. We assume throughout that each node program has been written to express the maximal available (virtual) parallelism. 5.1. Global Optimization of Node Implementations When di erent implementations of each node are possible, nding the choices which optimize both total computation and total communication is dicult. When the variable associated with an edge of the task graph is non-scalar, it is computed by code that is distributed over the set of processors executing its producer (upstream) node, and destined for the set of processors executing its consumer (downstream) node. Each graph edge is a bundle of edges from a set of processors on which the producer node is executing to a set of processors on which the consumer node is executing. When the programming language in which the node program is written is based on a model that allows transformations between semantically-equivalent textual forms, these di erent forms may have di erent distributions for variables. The best form for a node program can only be computed by including the cost of communication, including redistributing input and output data structures. This problem can be solved using a shortest-path algorithm for searching alternate implementation structures. The paper (Skillicorn et al, 1998), which shows how to solve the problem for sequential compositions of nodes, can be straightforwardly extended to graphs. The graph version of the algorithm works, layer by layer, through the graph. The algorithm assumes that the total number of processors to be used is known; a sensible strategy is to assume the largest number required by any part of the graph, since it is always possible to use fewer processors. Suppose for simplicity that every one of the n layers of the graph consists of w tasks, and that there are a alternative implementations of each task. Then there are aw possible con gurations of the rst layer. Each of these con gurations can be extended to the next layer by choosing one of the aw possible con gurations for the next layer. Only aw of the composed con gurations need to be considered for further extension; the rest are now known to be more expensive. All a2w possibilities must be checked at each stage, however, so the total

journal.tex; 12/05/2000; 14:37; p.7

8 work required is a2w n. Although this is exponential, it is quite practical since w and n are small relative to the total computation performed, and a is likely to be small in practice. Task graphs whose nodes are adaptive in this strong sense of possessing multiple implementations for nodes have now been transformed to a form that is optimal with regard to implementation choice. It now remains to transform the task graph as a whole to best use the available processors. The way this is done depends on the relationship between the available parallelism of the task graph and the number of available processors on the target. 5.2. Allocating Processors Consider the sets of nodes in the graph that are disjoint in the sense that there is no path connecting any pair of them. Such sets represent nodes that might be simultaneously active in some execution of the graph. The graph width is the cardinality of the largest such set. The total width is the total number of processors required by the node programs from such a set. If all of the node programs are regarded as identical, then the graph can be arranged in layers by placing each node as early (or late) as possible. The length of a task graph is the number of layers (that is, the number of nodes) along the longest path from source to sink. Roughly speaking, there are three situations: ? If p < graph width then there is little point in trying to exploit intra-node parallelism, and not even the parallelism of the task graph itself can be fully exploited. ? If graph width < p < total width, then the parallelism of the task graph can be fully exploited, and there may be some opportunity to exploit intra-node parallelism as well. In this case, work-based allocation is used as the implementation technique. ? If total width  length < p then it becomes sensible to use pipelining to execute all nodes of the graph simultaneously. It may also be possible to use farming to increase the e ective parallelism of single nodes in the graph, or indeed of the entire graph. It is not the total width that is important so much as the e ective total width, which is usually smaller and depends on how much slack there is to move computation forward or backward in the graph, e ectively reducing the parallelism needed at particular times. Of course, these are general guidelines. There are programs in which a single node forms a bottleneck for which a farm may be useful; and

journal.tex; 12/05/2000; 14:37; p.8

9 programs with a path on which much of the computation lies for which pipelining may be useful. When the available p is small, the overhead of communication dominates. The best strategy is probably to execute each node sequentially, and partition the task graph into regions with minimal cutsets. This can be done e ectively in time linear in the number of nodes, which is reasonable in practice. For larger target architectures, both the parallelism within the task graph nodes, and the parallelism between graph nodes can be exploited. For this, work-based allocation, which exploits inter-node parallelism rst, then intra-node parallelism, while maintaining predictable performance is used. It works in two phases: rst each node is treated as identical, and its earliest and latest layer in the graph determined. Nodes whose earliest and latest layers are di erent have their work divided so that they execute an equal fraction of it in each layer. In the second phase, the slowest node program in each layer is found. Its elapsed time cannot be improved by allocating more processors to it than it claims to use, since we assumed that every node program is expressed with maximal virtual parallelism. Accordingly, all of the other nodes in its layer are slowed so that they will complete at (approximately) the same time by giving them fewer processors than their virtual parallelism could exploit. The net e ect is that the cost of the executing NOT program (the product of processors used and elapsed time) preserves the sum of the work of the individual nodes, and the communication cost between each pair of nodes in adjacent layers is the same. The NOT graph therefore executes in a layered, lock-step way. A detailed description of work-based allocation can be found in (Skillicorn, 1999).

6. Examples In this section we consider some typical applications coordinating several parallel tasks and discuss how they can be speci ed and implemented using NOT. 6.1. Multi-baseline stereo In this section, we consider a multi-baseline stereo application proposed in (Dinda et al, 1994). The corresponding task graph is shown in Fig. 1. Input consists of a sequence of three n  q images acquired from three cameras. One image is called the reference image and the other two are called match images. Task C is responsible of getting the three images

journal.tex; 12/05/2000; 14:37; p.9

10 Table I. BSP costs for multi-baseline stereo tasks. Task Cost function Condition Version C 3  tr  m + tbcs1 (3  m; p) p  16 bcast1 3  tr  m + tbcs2 (3  m; p) p  16 bcast2 3  m  tr + tbcs1 (3  m; 16) + tsct (3  m; p=16) p > 16 bcast1;scatt 3  m  tr + tbcs2 (3  m; 16) + tsct (3  m; p=16) p > 16 bcast2;scatt 3  m(tr + 16  (1 ? 1=p)  g) + l p > 16 mcast Di 2  tshift + tdistance  (m=p) m>p 2p tshift + tdistance mp m>p Ei ( m=p + 13)  13  4  g + l + tsum  (m=p) 13  13  4  g + tsum mp R tdisp  (m=p) + 16  (m=p)  g + l m>p tdisp + 16  g + l mp D tdisplay  m + m  g + l

from the cameras and broadcasting them to tasks D0 : : : D15 . Each Di shifts the rst match image by i pixel and the second by 2i pixels. Then, it computes a di erence image which is given by the sum of the squared di erences between the corresponding pixels of the reference image and the shifted match images. Then Di passes the di erence image to Ei which forms an error image by replacing each pixel in the di erence image by the sum of the pixels in a surrounding 13  13 window. C

D0

E0

...

...

D15

E15

R

D

Figure 1. Task graph for multi-baseline stereo application. All edges are block of.

Task R combines all the error images to form a disparity image. In the disparity image each pixel is set equal to the disparity which minimizes the error. Finally, task D displays each pixel in the disparity image as a simple function of its value. Now supposing that all tasks are implemented using BSP, we can give them quite intuitive costs as shown in Table I. All parameters, except the number of processors, p, are assumed to be known before reasoning about the implementation. m = n  q is the number of pixels in each image.

journal.tex; 12/05/2000; 14:37; p.10

11 Table II. BSP costs for collective communications. n is the size of data, p the number of processors, f the maximum amount of data reaching/leaving a node. Collective communication Cost function bcast1 tbcs1 (n; p) = n  (p ? 1)  g + l bcast2 tbcs2 (n; p) = 2  n  (1 ? 1=p)  g + 2  l scatt tsct (n; p) = n  (1 ? 1=p)  g + l total-exchange tte (f ) = f  g + l

C accepts three images from the three cameras, costing 3  tr  m, and distributes them to the Di0 s. If each Di is not internally parallel, this corresponds to broadcasting each image. There are several di erent ways to implement broadcasting; their relative eciency depends on the relationship between the cost of data transmission and synchronization, that is on the values of g and l. Their cost, along with the cost of other collective communications in BSP is shown in Table. II. The rst technique (bcast1) broadcasts n data on p processors just sending p ? 1 messages of size n. The second technique (bcast2) splits the data into p equally-sized pieces, and sends one to each of the p processors. Then it executes a total exchange of each piece, which gathers all of the data onto all of the processors. A total exchange costs f  g + l where f is the maximum size of data reaching/leaving a node. When the Di 's are sequential, we consider two versions of C using these two di erent techniques for broadcasting. If the Di 's are executed in parallel, we assume that each one is assigned the same number of processors (p=16). The distribution needs to broadcast the three images to the 16 groups and scatter them among each group, a communication pattern usually known as multicast. There are three choices to implement C in this second case. We can broadcast the images to one processor in each group and then scatter the images in parallel on all groups of p=16 processors. Scattering n data on p processors costs n  (1 ? 1=p)  g + l (Table II). The second broadcast within each group can be implemented using bcast1 or bcast2, yielding the two versions in the table. Alternatively, we can choose to send the nal data directly to each node (mcast version), which has cost 3  16  (m ? m=p)  g + l. For now, we keep all these versions alive. We will decide which one is best during the optimization phase. Tasks Di and Ei have uniform costs. Di spends 2  tshift + tdistance to compute the distance of each pixel. If Di is executed on p processors each processor computes m=p pixels in parallel. When each Ei is executed on p processors it gathers the window pixels resident on

journal.tex; 12/05/2000; 14:37; p.11

12 Table III. BSP parameters for the three target machines, from (Skillicorn et al, 1997). g and l have been normalized to s = 1 . SGI Origin Pentium cluster Cray T3E g (cycles/byte) .45 33.5 0.04 l (cycles) 158.5 5654 26.8 s (M ops/s) 101 88 47

q

di erent processors (( mp + 13)  13  4  g + l) and computes the sums in time tsum  (m=p). R reduces the elements of all error images by computing their disparity tdisp  (m=p), after having received the relevant blocks of the 16 images 16  (m=p)  g + l. Finally, D is a sequential task that gathers the disparity matrix and displays it, with a cost of tdisplay  m + m  g + l. 6.1.1. Optimization We consider three parallel targets: an SGI Origin, a Cray T3E, and a network of workstations using Pentium II processors and interconnected by 100Mbps switched Ethernet segment. Values for s, g and l have been derived in (Skillicorn et al, 1997) and are summarized in Table III. Here, g and l have been normalized to s, the average number of operations per second. C is the only task for which an implementation needs to be selected in this phase. The choice between di erent implementations can be done using the linear search algorithm discussed in (Skillicorn et al, 1998). However, this is a particularly simple case, as it reduces to choosing which of the ve implementations is faster for a given number of processors, p. The behavior of the di erent versions on a T3E is shown in Fig. 2 for images of m = 100  100 pixels. Clearly bcast2 and bcast2;scatt outperform the others in all cases. Similar results can be obtained for the cluster and SGI platforms. This is always true for images greater than 1000 pixels, which is the range of interest of our application. 6.1.2. Allocation We are now ready to allocate our task graph. We assume x m = 10; 000. C and D have constant costs and can only be executed sequentially on a single processor. The width of the other tasks is as follows: Di's can be executed at most on 32 processors, Ei on 26 and R on 20. The graph width is 16, so we rst consider the allocation on p  16 processors. In this case, we do not have enough processors even to exploit the parallelism among di erent tasks. If we decide to partition the

journal.tex; 12/05/2000; 14:37; p.12

80000 78000 76000 74000 72000 70000 68000 66000 64000 62000 60000

bcast1 bcast2 completion time

completion time

13

2

4

6

8

10

12

14

80000 78000 76000 74000 72000 70000 68000 66000 64000 62000 60000

16

bcast1;scatt bcast2;scatt mcast

18

20

22

processors

24 26 processors

28

30

32

Figure 2. Behavior of the ve implementations of C on a Cray T3E, m = 10000.

task graph among di erent processors and execute all tasks sequentially we can use one of the algorithms for DAG scheduling with communications (see for instance (Papadimitriou & Yannakakis, 1990)). A possible scheduling where R has been executed on p processors is shown in Fig. 3. processors 0

1

2

3

4

14

time

15 C

... ...

Di’s Ei’s R D

Figure 3. Allocation of multi-baseline stereo application on p = 16 processors.

When 16 < p  total width = 512, we can use work-based allocation. We rst compute the levels: C has level 1, Di 's level 2, Ei level 3, R level 4 and D level 5. Then we stretch all the tasks in each level so that they have the same height. This is not dicult here as tasks belonging to the same level are identical and have the same cost, so we can evenly distribute the available processors to them. This means rst stretching the Di s to a width of p=16 and each Ei s to a width of min(p=16; 26). Finally, R gets a width of min(p=16; 20). The allocation proceeds one level at a time, to obtain a layout as shown in Fig. 3. This time, each task has been assigned more than one processor, according to its width.

journal.tex; 12/05/2000; 14:37; p.13

14 6.2. Iterated Runge-Kutta methods In this section, we discuss the parallelization of iterated Runge-Kutta (RK) methods, a family of solution methods for ordinary di erential equations. The solutions evaluated in this section have been proposed and formalized for a di erent coordination model in (Rauber & Runger, 1999). We will see how the di erent schemas can be expressed in NOT and how cost prediction allow correct discrimination among di erent alternatives. Iterated Runge-Kutta methods work on ordinary di erential equations of the form dydx(x) = f (x; y(x)), with y(x0 ) = y0 , x0  x  xe . y : R ! Rn is the unknown function and f : R  Rn ! Rn is an application speci c function, which can be non linear. y0 2 Rn speci es the initial condition at x0 . An RK method computes a sequence of approximation vectors yk for y(xk ) at discrete intervals xk = xk?1 + h with step h. There are a large variety of RK methods. We focus on s-stage iterated RK methods working as follows. In each time step k, the method performs a xed number, m, of iterations to compute s stage vectors v1 ; : : : ; vs . Stage vectors computed at iteration i are used to compute stage vectors for iteration i + 1. Stage vectors computed at iteration m are used to compute the approximation vector yk+1 . At stage j , the computation of stage vectors can be executed in parallel by s independent processes, provided that they know all the stage vectors computed in the previous iteration (vj ?1 's). We concentrate on the two NOT graphs of Fig. 4 and Fig. 5. In both pictures, the solid line DAG is iterated until the approximation point x falls outside the interval (x  xe ). Node I computes the initial stage vectors v01 ; : : : v0s . At stage i, the computation of stage vectors can be executed in parallel exploiting task parallelism by s independent processes SV1 ; : : : ; SVs (Fig. 4) or can be computed executing each SVi in sequence (Fig. 5). All SVi are identical solvers working on di erent data. After the computation of vm1 ; : : : ; vms , the graphs in the two solutions are identical. Node CA uses the stage vectors vm1 ; : : : ; vms to compute the new approximation vector yk+1. Finally, SC checks the approximation vector and xes a new value for x. The costs for the two implementations are summarized in Table IV. Node I is a sequential node and computes the initial stage vectors (v0 's). Initialization costs a certain initialization time tin for each element of the m n-size vectors. Then, in TKg, the v0 's are broadcast to SV1 ; : : : ; SVs . We use the two-stage broadcast (bcast2) introduced in Section 6.1 to distribute n  s data on s processors. Then, assuming that the s solvers SVi are all allocated on the same number of processors p=s,

journal.tex; 12/05/2000; 14:37; p.14

15 I

SV1

...

...

while

SVs

SV1

... ...

SC

CA

SVs

m times

Figure 4. Task graph for parallel iterated Runge-Kutta methods exploiting task parallelism between stage vector computations (TKg). Edges are le of. I while

SV1

...

SVs

SC

CA

m * s times

Figure 5. Task graph for iterated Runge-Kutta methods exploiting data parallelism only (DPg). Edges are le of.

the stage vectors are scattered across each solver in parallel. In DPg, we only need to scatter n  s data on p processors, without broadcasting. Consider now the SVi 's. If each SVi in TKg is executed on p=s processors, its cost consists of the computation of the new stage vector plus the collection of the new vectors from the other SV processors to get ready for the next stage. The cost of evaluating f is modeled by the function tf and depends on the characteristics of the system to be solved (e.g. sparse or dense). We will consider linear and quadratic tf functions to model the behavior of the RK schema under di erent load conditions. After the vj 's have been computed, a total exchange is executed in which each SVi receives n  s=(p=s) elements. In DPg, if we assume that all the SVi 's are executed in sequence on the same p processors, no data redistribution is required as stage vectors are produced and consumed using the same distribution. In TPt, CA is executed on p processors. It gathers s vectors of length n (a total exchange with n  s=p incoming data) and computes the new approximation vector of size n. Again, communication is not needed in DPg as CA consumes only local data. Finally, the cost of SC is the same for both solutions. We assume both CA and SC are allocated on the same processors, so there is no need of redistributing data and the cost of SC is the cost of computing the new step value. With work-based allocation, we can allocate the two task graphs and compare the cost of their execution. Assuming all tasks but I have a width larger than 32, the behavior of the two implementations is

journal.tex; 12/05/2000; 14:37; p.15

16 Table IV. BSP costs for iterated RK methods. Notice that s represents the number of SV stages. Task Cost function TKg Cost function DPg I n  s  tin + tbcs2 (n  s; s) + tsct (n  s; p=s) n  s  tin + tsct (n  s; p) SVi tf (n=p) + tte (n  s2 =p) tf (n=p) CA tca  s  n=p + tte (n  s=p) tca  s  n=p SC tsc  (n=p) tsc  (n=p)

summarized in Fig. 6. The graphs describe the predicted runtime cost scaled to 10; 000  s, the unitary BSP operation cost, on a 32-processor Cray T3E. The plot on the left is for a linear evaluation function tf , while the one on the right is for a quadratic tf . Both curves clearly show the performance gain of the data parallel solution. This makes perfect sense as: (1) the SV tasks are able to scale up to 32, removing the advantage of exploiting task parallelism among them; and, (2) the task parallel execution needs a large amount of communication due to data redistribution among levels. The qualitative shape of the curves depicted here is very close to the ones presented in (Rauber & Runger, 1999) for the same two versions of the iterated RK method. In that case, the curves represented the runtime of actual MPI implementations on a 32-node SP2. As happens quite often with numerical applications, the MPI code was probably largely synchronous, making the two versions show similar behavior. The important thing to note is that NOT costs have pointed clearly to the better design choice before writing a single line of code. 100

120 task parallel data parallel

90 80 70

80

60

runtime

runtime

task parallel data parallel

100

50 40

60 40

30 20

20

10 0

0 0

2000 4000 6000 8000 10000 12000 14000 16000 18000 system size

0

2000 4000 6000 8000 10000 12000 14000 16000 18000 system size

Figure 6. Runtimes in 10; 000  s M ops/s on a Cray T3E for the task parallel (TKg) and data parallel (DPg) Runge-Kutta implementations. We consider sparse systems (left), with linear tf costs and dense systems (right), with quadratic costs.

journal.tex; 12/05/2000; 14:37; p.16

17

7. Pipelining and Farming Pipelining can improve the performance of a task graph if there are more processors available than can be used by the e ective width of the graph. In e ect, pipelining allows the depth of the task graph to be used as well. There are two requirements, however, for pipelining to be practical: 1. The modules at the nodes of the task graph must have semantics that allows them to work on blocks of data at a time. Roughly speaking, this means that modules must have an outer loop that can be unrolled into blocks of di erent sizes. For several of the models discussed, implementations of this kind are possible. 2. There must be enough processors available to allow the entire task graph to be pipelined, since pipelining the rst few stages and then reverting to nonpipelined for the remainder does not provide any overall performance improvement. It is reasonable to expect that the cost of a node module becomes pipelined cost = cost=(block size= le size) given the semantics above. Work-based allocation can still be used to schedule processors for pipelined execution. Work-based allocation has the property that each `layer' of the task graph nished at approximately the same time, by construction. If the same allocation of processors is made, this means that the pipelined versions of each layer will have the same behavior as the nonpipelined versions. In other words, the modules in each layer will produce their next block of outputs at about the same time. It is less clear that communication cost modelling remains the same. We have ignored any costs due to start-up e ects (BSP's l term). This is not unreasonable when le of semantics are used for edges, since the total amount of data moved is large and start-up e ects accordingly small. In the pipelined setting, the blocks actually transmitted may be quite small, and start-up e ects signi cant. The cost of these extra e ects need not a ect the allocation used, since every message presumably pays the same penalty on start-up. They will, however, add to the overall cost of the program which will see a cost of number of blocks  startup penalty between each pair of layers. In a sequential composition of modules, farming can be used to balance a pipeline by introducing parallelism into the slowest module.

journal.tex; 12/05/2000; 14:37; p.17

18 In our task graph setting, we can instead vary the performance of a module by increasing or decreasing the number of physical processors allocated to execute it. However, there is still one situation in which farming can be useful, and that is to increase the virtual parallelism of a module. Using a farm allows us to get around the limitation of Brent's theorem in a situation where we would like to allocate more physical processors to a module than its virtual parallelism allows, in order to shorten the execution time of the layer in which it appears. Once again, there are two necessary conditions: the semantics of the module must allows its input and output edges to have bag of semantics; and there must be extra physical processors that could not otherwise be e ectively used. There is one other situation in which farming is commonly used: when the cost of processing a block of input varies signi cantly with the values of that input. At the level of cost modelling we have been assuming, such cases are not easy to extract since we have assumed cost expressions that are upper bounds, determinable at compile-time. This use of farming can be exploited in systems that use exact cost expressions and minimize over every element of a data stream, but cannot naturally be handled by the sorts of cost model we have been advocating. There is one exception: it is possible that when a data stream is unpacked from le into blocks, the terms of the nested cost expression may have di erent sizes. A farmed implementation would enable a maximum (for an upper bound) to be replaced by an average if the semantics of the module allowed it. 7.1. Earlier examples revisited We can use pipelining and farming to speed up the examples discussed in Section 6. Consider the multi-baseline stereo application. If the number of processors available, p, is greater than the total graph width, we can start to pipeline the stream of images through the tasks, exploiting parallelism among di erent task levels. In particular, if p is large enough to accommodate both groups of processes in Fig. 7, the average cost of the computation of a triple of images becomes maxfTC ; TD ; TR ; max T ; max T g i Di i Ei

where Tx is the cost of executing the module node x. It is extremely important that the resources allocated to each level in the graph are balanced, that is they achieve similar heights for di erent graph levels. This is because the cost of computing an image triplet will be dominated by the time taken by the slowest stage, no matter how fast all the others are.

journal.tex; 12/05/2000; 14:37; p.18

19 This suggests a simple optimization algorithm for pipelined NOT tasks. We rst compute the minimum cost MCi achievable by each graph level i, assuming an unbounded number of processors. This is easy, as all levels have been adjusted to have similar height. Then we compute MC = maxi MCi . MC is a lower bound on the cost achievable when pipelining as it gives a lower bound on the speed of the bottleneck stage. Finally, we adjust the resources devoted to all the stages to achieve a cost as close as possible to MC . At this point the pipeline is completely balanced and we can allocate it to the available processors scaling down resources as usual. Notice that pipelined allocation can be advantageous even if the total number of processors does not allow for the simultaneous execution of all processors at the maximum parallelism. For instance, executing groups 1 and 2 in Fig. 7 in an alternate way on p processors lead to a global cost of TC + TDi which can be smaller than the cost obtained by packing all the Di s on the available processors. processors 0 1 2

...

p-1

21

...

group1 C

D

...

group2 D0

E0

R

...

E15

... D15

Figure 7. Pipelined allocation of multi-baseline stereo application in two alternate groups of processors.

Now assume p is the number of processors needed by the balanced pipeline with cost MC . When p > p we can use farming to improve costs in two ways: (1) replicating bottleneck stages which already exploit their maximal parallel width; and, (2) replicating a whole task graph when costs cannot be decreased anymore using (1). Replicating a pipeline stage with cost T k times allows k input data to be handled in parallel. This does not a ect the latency of the stage, that is the time taken to produce the results of a given input, which is still T . However, the average service time is reduced to T=k as k data can be processed in the same time T . Stage replication can be systematically applied to improve service time according to the following strategy. We rst use the minimum costs MCi to compute MC = MCk = mini MCi . MC is a lower bound on the cost improvements achievable when replicating, and k is the fastest stage. We can reach a cost of MC replicating each stage

journal.tex; 12/05/2000; 14:37; p.19

20

i 6= k a number of times ri such that MCi  MC ri

Consider again the multibaseline example assuming a set of semantics of the task graph edges. Solid boxes Fig. 8 show a pipelined implementation with maximum parallelism in each stage. Just as a speci c example, we assume MCDi = MCEi = MC , MCC = 4  MC , MCD = 3  MC and MCR = 2  MC . Replicating the three stages C, D and R as shown in the picture, we obtain an average cost of MC using 25 extra processors. At this point replication of a single stages is no longer appealing as it will lead to unbalanced processor utilization. If still more processors are available we can use them to replicate the whole implementation as in Figure 8. This could lead, in principle, to an improvement of MC=r where r is the number of replicas. However, in practice, farming also requires some overhead to coordinate the replicas and to arrange for the proper delivery/collection of results, and this will reduce the expected ideal bene ts. C

D

R

E0

D0

... ...

E15

D15

farmed replicas

Figure 8. Stage farming to achieve cost lower bound MC in multibaseline stereo.

8. Related Work Several other task graphs formulations exist. They can be classi ed into two kinds: those based on extensions of data-parallel languages, and those that start from task graphs with simple nodes and increase the possibilities for nodes. Three models of the rst kind are COLT (Orlando & Perego, 1999), SkIE (Bacci et al, 1999), and Rauber and Runger's Task Graphs (Rauber & Runger, 1999). Each of these models allows stateless nodes written in an imperative language, connected by edges representing variables that are implicitly distributed if the variables concerned are non-scalar, and

journal.tex; 12/05/2000; 14:37; p.20

21 pipelined. The set of graph substructures di ers slightly: COLT allows pipelines and farms, while SkIE extends this with a loop construct. Task Graphs allow seq and par constructs and nested loops. COLT and SkIE use names to associate the heads and tails of communications, while Task Graphs use position as well. All three models use a global heuristic to optimize the distribution of data and the choice of implementation for each node. Bal and Haines (Bal & Haines, 1998) give a good survey of models of the second kind. One well-known example is Opus (Chapman et al, 1994). Programs are collections of parallel objects that interact via shared data abstractions which manage the invocation of remote methods, synchronously or asynchronously. Objects are statically-allocated, but data are dynamically redistributed at calls. Fx (Subhlok et al, 1994; Subhlok & Yang, 1997) is an extension of HPF in which subsets of data-parallel code can behave as if they were independent processes running on processor groups. This allows a monolithic data-parallel program to be decomposed into a hierarchy of smaller data-parallel programs with a graph structure connecting them. The connections have copy to and copy from semantics. Another direction for coordination languages has weakened the requirements to explicitly express communication. Tasks graphs require an explicit partitioning of the program into pieces, and the data ow between pieces is marked by communication actions such as messages. In coordination languages such as Linda (Ahuja et al, 1988; Carriero & Gelernter, 1988), the partitioning is still explicit, but communication is not. Program pieces communicate by placing data into and extracting data from a shared region usually called tuple space. This reduces the complexity of program construction since sends and receives do not have to be matched, and timing of data movement is less critical. Proponents of Linda-like coordination argue that this increased abstraction makes it easier to put large programs together. The weakness of this approach, however, is that it becomes hard to predict, even approximately, what the performance of the resulting system will be. This is partly because it is not clear from the program what the eventual ow of data will be, and partly because managing an associative store is inherently expensive. Even the limiting case of poor performance, deadlock, cannot be detected in such systems.

9. Discussion The NOT model is a coordination language for nodes written in arbitrary parallel languages. When the node programs (a) provide a cost

journal.tex; 12/05/2000; 14:37; p.21

22 expression in terms of processors allocated, (b) can be transformed to use more or fewer resources, and (c) have multiple semanticallyequivalent textual forms, the NOT execution technique preserves the apparent costs of the node programs at the NOT level. NOT can therefore be regarded as an extremely powerful program composition technique or coordination language which is both semantically clean and transparent about performance. Software developers can reliably predict the performance of their programs from the performance of the nodes and the graph structure. This is achieved by using a work preserving implementation technique, and building on the known accuracy of the BSP model of communication cost.

References S. Ahuja, N. Carriero, D. Gelernter, and V. Krishnaswamy. Matching languages and hardware for parallel computation in the Linda machine. IEEE Transactions on Computers, 37, No.8:921{929, August 1988. B. Bacci, M. Danelutto, S. Pelagatti, and M. Vanneschi. SkIE: an heterogeneous HPC environment. Parallel Computing, 25(13{14):1827{1852, December 1999. H.E. Bal and M. Haines. Approaches for integrating task and data parallelism. IEEE Concurrency, 6, No. 3:74{84, July-August 1998. A. Baratloo, P. Dasgupta, and Z. Kedem. Calypso: A novel software system for faulttolerant parallel processing on distributed platforms. In Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing, 1995. R.S. Bird. Algebraic identities for program calculation. The Computer Journal, 32(2):122{126, February 1989. R.P. Brent. The parallel evaluation of general arithmetic expressions. Journal of the ACM, 21, No.2:201{206, April 1974. N. Carriero and D. Gelernter. Application experience with Linda. In ACM/SIGPLAN Symposium on Parallel Programming, volume 23, pages 173{ 187, July 1988. B. Chapman, P. Mehrotra, J. Van Rosendale, and H. Zima. A software architecture for multidisciplinary applications: Integrating task and data parallelism. Technical Report 94-18, NASA Institute for Computer Applications ICASE, March 1994. M. Danelutto, F. Pasqualetti, and S. Pelagatti. Skeletons for data parallelism in p3l. In C. Lengauer, M. Griebl, and S. Gorlatch, editors, Proceedings of Europar'97, volume 1300 of LNCS, pages 619{628. Springer-Verlag, August 1997. P. Dinda, T. Gross, D. O'Hallaron, E. Segall, J. Stichnoth, J. Subholk, J. Webb, and B. Yang. The CMU task parallel program suite. Technical Report CMU-CS-94131, School of Computer Science, Carnegie Mellon University, March 1994. S.R. Donaldson, J.M.D. Hill, and D.B. Skillicorn. BSP clusters: High-performance, reliable, and very low cost. Technical Report Technical Report PRG-TR-5-98, Oxford University Computing Laboratory, 1998. I. Foster and C. Kesselman. The Globus project: A status report. In IPPS/SPDP'99 Workshop on Heterogeneous Computing, pages 4{18, April 1998.

journal.tex; 12/05/2000; 14:37; p.22

23 M. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas. Towards eciency and portability: Programming the BSP model. In Proceedings of the 8th Annual Symposium on Parallel Algorithms and Architectures, pages 1{12, June 1996. J.M.D. Hill, P.I. Crumpton, and D.A. Burgess. The theory, practice, and a tool for BSP performance prediction applied to a CFD application. Technical Report TR4-96, Programming Research Group, Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford, England. OX1 3QD, February 1996. J.M.D. Hill, S.R. Donaldson, and D.B. Skillicorn. Stability of communication performance in practice: from the Cray T3E to networks of workstations. Technical Report PRG-TR-33-97, Oxford University Computing Laboratory, October 1997. C. Morgan. Programming from Speci cations. Prentice-Hall International, 1990. S. Orlando and R. Perego. COLTHPF , a run-time support for the high-level coordination of HPF tasks. Concurrency: Practice and Experience, 11(8):407{434, 1999. C.H. Papadimitriou and M. Yannakakis. Towards an architecture-independent analysis of parallel algorithms. SIAM Journal on Computing, 19(2):322{328, April 1990. T. Rauber and G. Runger. A coordination language for mixed task and data parallel programs. In Proceedings of ACM Symposium on Applied Computing SAC'99, Feb/Mar 1999. J. Reed, K. Parrott, and T. Lanfear. Portability, predictability and performance for parallel computing: BSP in practice. Concurrency: Practice and Experience, 8(10):799{812, December 1996. D.B. Skillicorn. The Network Of Tasks model. In Proceedings of Parallel and Distributed Computing and Systems 1999, Boston, MA, November to appear. IASTED. D.B. Skillicorn, M. Danelutto, S. Pelagatti, and A. Zavanella. Optimising dataparallel programs using the BSP cost model. In Europar'98, pages 698{703, September 1998. D.B. Skillicorn, J.M.D. Hill, and W.F. McColl. Questions and answers about BSP. Scienti c Programming, 6(3):249{274, 1997. D.B. Skillicorn and S. Pelagatti. Building programs in the network of tasks model. In Proceedings of the ACM Symposium on Applied Computing, April 2000. J. Subhlok, D. O'Hallaron, and T. Gross. Task parallel programming in Fx. Technical Report CMU-CS-94-112, School of Computer Science, Carnegie Mellon University, January 1994. J. Subhlok and B. Yang. A new model for integrated nested task and data parallel programming. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 1{12, Las Vegas, NV, June 1997.

journal.tex; 12/05/2000; 14:37; p.23

journal.tex; 12/05/2000; 14:37; p.24

Suggest Documents