Computation, University of Houston, Texas, USA. 2 Department of Mathematics and Texas Center for Advanced Molecular. Computation ... propose a technique we call shortcutting to reduce the execution time of such problems. ..... ing on what other jobs were running on it (di erences by factors of as much as. 2 or 3).
Overlapping and Shortcutting Techniques in Loosely Synchronous Irregular Problems Ernesto Gomez1 and L. Ridgway Scott2 1
Department of Computer Science and Texas Center for Advanced Molecular Computation, University of Houston, Texas, USA 2 Department of Mathematics and Texas Center for Advanced Molecular Computation, University of Houston, Texas, USA
Abstract. We show shortcutting and overlapping techniques that, separately and in combination show promise of speedup for parallel processing of problems with irregular or asymmetric computation. Methodology is developed and demonstrated on an example problem. . . .
1 Introduction We are interested in parallel programs for irregular and asymmetric problems. We have the same program on all processors, but data dependent logic may evaluate dierently at dierent nodes, causing dierent code to execute, or resulting in dierent amounts of work by the same code. Such problems present challenges for parallel processing because synchronization delays can be of the order of the computation time. There are dierent kinds of irregularity in parallel codes. Data structure irregularity is one common case that has been addressed previously [1]. In other applications, data among parallel tasks can be regular, but access unpredictable [2] or asymmetric [3]. Another large class of problems are very regular in sequential execution, but when executed in parallel, exhibit irregularities in time. We propose a technique we call shortcutting to reduce the execution time of such problems. These codes can have parallel eciency exceeding one in some cases. Shortcutting gains are typically of the same order as the synchronization delays in the problems to which they may be applied. We propose a protocol to tolerate or hide synchronization delays by overlapping intervals between variable de nition and use at each parallel process. Both techniques should be employed together, to prevent losses from synchronization oseting gains from shortcutting.
2 Irregular Problems and Synchronization For any non-trivial computation carried out in parallel there is some communication cost added to the actual computational cost. This is at least the cost of moving data between processses. In asymmetric or irregular computation, however, we have an added cosr due to time or control asymmetry. Note that this
time is in addition to the communications required to synchronize processes. In the following we will refer to \synchronization time" as this waiting time, and include any message passing required to synchronize in the communications time. Suppose two processes, p1 and p2 are initially synchronized and must communicate after performing some computation. Suppose that p1 executes n1 instructions and p2 executes n2 instructions, and that each instruction takes a time t to execute. Then there will be a synchronization cost Ts = t(n2 ? n1 ) added to the actual message times. One measure of the irregularity or asymmetry of the problem is be given by:
n2 ? n1j = jmax (n )
(1) i Let Tc be the time it takes for signals to travel between processes (including any signals required for process coordination), and Tn be the computation time equal to t max(ni ). Normally we would want to compute in parallel in situations where Tn >> Tc . In highly skewed, irregular problems, the dierence jn2 ? n1 j can be of the same order as max(ni ); the synchronization time Ts is therefore much greater than Tc and the irregularity Ts=Tn can be a large fraction of unity. We want to emphasize that we are not talking about badly load-balanced codes. We have in mind codes that are well balanced at a coarser scale but irregular at a ne scale.
3 Shortcutting In some problems, a parallel execution may give dramatic improvement over its serial counterpart. For example, suppose the computation involves testing various cases and terminates as soon as any one case passes a test. In a parallel execution, as soon as one process nds a \passing" case, it can \shortcut" the other processes by telling them to quit. The parallel algorithm may do less work if it nds the answer on one node before nishing the computation on other nodes. We have so far applied this techniqe to a minimization problem, but we believe its applicability can be extended to more general cases. Consider a backtracking algorithm. Conceptually we are performing dierent computations on the same data, by following rst one possible computational branch, then another if the rst fails, and so on. Suppose we parallelize this algorithm by going down multiple branches simultaneously. (This is a parallel emulation of a non-deterministic choice algorithm, except that, lacking unlimited parallelism we may either have to sequence multiple branches on each node or limit the approach to problems with a xed branching factor). Suppose further that we have some criterion that tells us when a solution has been reached. Since each branch follows a dierent computational path, the amount of work to be done along each branch is dierent. Let us consider a simple case in which we have P processors and P branches must be explored. Let Ti be the work required on branch i. If all branches are
X Ti + O s
explored one after another, the total amount of work done serially is:
Ts =
(2)
where Os is the serial overhead. Let Tavg = Avg(Ti ). Then we may replace this by: Ts = P Tavg + Os (3) Assume that not all branches lead to solutions, but that out of P branches there are s < P solutions, and that we have some criterion that tells us when a solution has been reached along a particular branch without comparing to other branches (for example, this could be the case for a search with a xed goal state where all we care about is the goal, not the path). Then we can halt as soon as a solution has been reached. When this will happen depends on the order in which we take the branches. Assume for the sake of simplicity that there is a single solution; s=1. Then on the average we would expect to nd a solution after trying about half the branches, and we should have an average case sequential time: Tavg.case = :5(P Tavg + Os ) (4) (Where we also assume that if we can stop after taking half the branches, we only incur half the overhead. We are also assuming that the algorithm halts on all branches; we can always force this by some arti cial criterion like a counter). Now consider the parallel case. In this situation, we can halt as soon as any one of our processes has found a solution. We would then have that parallel work done on P branches is: Tp = P (min(Ti) + Op ) (5) Take the case of small overhead. Then we have: Tavg.case = :5P Tavg (6) and Tp = P min(Ti) = P Tmin (7) The average case parallel eciency would be: T :5T (8) Ep = (P Tavg ) = 2Tavg min min where we know that Tmin < Tavg. If it should happen that Tmin < :5Tavg, then we would have a parallel eciency greater than 1 in the average case. In the worst case for parallel eciency, we would nd the solution on the rst sequential trial. Then the total sequential work would be Tmin , much less than the parallel work of P Tmin . But the parallel task would still complete at the same clocktime. In the best case, the sequential algorithm nds the solution on the last branch; this would give us a best case parallel eciency greater than 2.
4 The Overlap Protocol The possible gains from shortcutting are of the order of the computation time. As we have seen, this is also the case for synchronization costs in highly irregular problems. It is therefore possible that synchronization delays will cancel out all or most of the potential gains from shortcutting. Synchronization cost is due to the properties of the code, or to random factors in the runtime environment, and therefore may not be eliminated. A standard way to hide communications cost is to overlap computations with communications. However, this will not be bene cial to processes that nish their execution in less time than others, and in any case the bene ts of such overlap are only possible if special hardware allows communications to be executed in parallel. We propose therefore to separate sends and receives and overlap the periods between generation and use of data in dierent processes. If this period at a producer process overlaps the period between references to data at a consumer process, and the data can be sent during this overlap, then neither process will have to wait for the other. Depending on which process is faster we generally only need to have either sender or receiver wait, but rarely both at once. It is dicult for the programmer to implement such overlap, so we propose here a protocol for overlapping communications that may be implemented automatically at the level of a compiler or preprocessor. Consider a simple case of point to point communication between sender P1 and receiver P2 for data item X. We have the following possibilities: X.1, X.2,
Table 1. Sender(P1)-receiver(P2) timing time case 1: sender is slightly faster 1 2 3 case 2: sender is much faster 1 2 3 case 3: receiver is faster 1 2 3
P1
P2
write X.1 | write X.2
| read X.1 |
write X.1 wait X.2 write X.2
| read X.1 |
| write X.1 write X.2
wait X.1 wait X.1 read X.1
. . . X.n are the rst, second and nth values in memory location X. In general we have statements such as: S1: X = (expression)
... S2: Z = (expression using X) ... (S3: X at P2 = X at P1 ; original communication statement ) ... S4: Y = (expression using X) ... S5: X = (expression) ; maybe next iteration of a loop ... S6: Y = (expression using X) All references to X are in S1 . . . S6. Referring to Table 1; S1 is a write at P1 and S4 is a read at P2. Since the processes are not synchronized, later statements at one process can occur before earlier statements at another; e.g. case 3, t=1 is the situation where P2 reaches S4 before P1 reaches S1; this forces P2 to wait before it can read. Recall that both P1 and P2 are executing the above sequence of statements, but in the given case we need the value of Y only in P2, and we are only using the value of X produced at P1. P1 has data which it may send at S1, it must send it before S4 when it needs to update the local value of X. P2 may read X any time after S2; it must read it at S4. Similarly, the value of X that P2 uses in statement S6 must be read after S5 (where X is once more updated - P2 executes this statement even though this value of X is not used). X must be read at S6 at the latest. P1 is in a state MAY-SEND (with respect to a given data item and communication statement ) as soon as it calculates the value of X at S1; it is in a state MUST-SEND at S5, because here it cannot proceed to recalculate X until it sends the old value. P2 is in a state MAY-READ after S2 and before S4. It can't read X before S1, because here it executes code to recalculate X, even if we don't want this value. Once P2 reaches S4, it is in a state MUST-READ, and here it waits until P1 is ready to send. In case 3, if we start at t=2 we have sender writes X.1 when receiver wants it, that is, synchronized. If sender is much faster than receiver, it starts waiting for X.1 earlier, the end result is same as case 3. Processes in MAY states (MAY-SEND, MAY-READ) can continue processing, until they reach a statement that puts them in a MUST state (MUST-SEND, MUST-READ). In a MUST state we either need to use a value from another process to calculate something, or we need to replace a value that some other process needs. In either case, we have to wait for the data transfer to take place. We consider a process to be in a QUIET state if it is neither ready to send nor to receive. Data is transferred in states 11, 12, 21 and 22. In cases 12 and 21, one process must wait for one data copy, but the other (the one in the MAY state) can continue. In case 11, neither process has to wait, computation can completely overlap communications. In case 22, both processes wait, but they wait at the same time so total wait is only for one copy. All states are with respect to some speci c data item. For example, a process may have calculated X and so be in MAY-SEND with respect to it, and need
Table 2. States of a sender-receiver pair P1 state QUIET QUIET QUIET MAY-SEND MAY-SEND MAY-SEND MUST-SEND MUST-SEND MUST-SEND
00 01 02 10 11 12 20 21 22
P2 state QUIET MAY-READ MUST-READ QUIET MAY-READ MUST-READ QUIET MAY-READ MUST-READ
results both processes continue both processes continue P2 waits, P2 continues both processes continue P1 sends to P2, both continue P1 sends to P2, P2 waits for data P1 waits, P1 continues P1 sends to P2, P1 waits for data P1 sends to P2, both wait
a value Z from someplace else in the current statement being executed, which puts it in MUST-READ with respect to Z. Figure 1 de nes a Finite State Machine (FSM) that implements our protocol, with added states to permit shortcutting. DC
J,N shRoot
MSG
Z
8
MSG
Ident DC
RH,LH
mayR 2
Z
mustR 5
J C Z
MSG,CS
R
MSG,CS
ShortCut 7
C Z
CS
mayA 3
mustA 6
}
Z Quiet 9
Start 0 RH,CS
RH,CS
S Z mustS 4
mayS 1 LH
Z
Set Counter
Set/Get Counter error 10
N
RH,LH
A
Count
Unspecified message jumps to error
Messages: S: send - define and load, set count R: receive - define and load, set count A: all - define and load, set count RH: variable used LH: variable defined or updated CS: clear to send and decrement count MSG: received msg, decrement count Z: count is 0 - comm. complete DC: declare shortcut, at shortcut proc. C: shortcut - interrupt J: join shortcut group N: do not join group
Z is sent by counter (all send/receive states check count)
Fig. 1. Overlappint and Shortcutting FSM We assign a separate FSM to each variable being communicated at each node. Group communications, to be ecient, requires cooperative action, in which some nodes will pass information on, both sending and receiving. We use a messages count to determine completion, implemented by adding a counter
to our FSA, which emits a "zero" message when there are no pending operations. Messages de ne the transitions in the FSM: each message corresponds to a particular event, and is a letter in the language accepted. States in Fig. 1 are divided in 3 general paths corresponding to sends, receives and group communications. Any particular communication is de ned as a set of point to point sends between nodes, such that the receiver sends a CS message when ready to receive, and the sender then sends the data message. One or more sends from a node may start with Send, allow multiple RH and CS (each CS corresponds to a message send and a decrement of the count, each RH is a use of the variable without rede nition), terminating in Z or blocking at a rede nition (LH), followed by at least one CS (send and decrement), terminating in Z. (That is, S (RH jCS ) Z or S (RH jCS ) LHCSCS Z is a word accepted by the FSM) One or more receives at a node allow multiple message (MSG) arrival (and decrement count), terminating in Z or blocking on rst use or de nition (RH or LH), followed by at least one MSG (and decrement), terminating in Z. (That is, RMSG Z or RMSG (LH jRH )MSGMSG Z ) A group communication allows multiple message (MSG, decrement ) arrival, multiple CS (message sends, decrement), terminating in Z or blocking on rst use or de nition (RH or LH), followed by at least one MSG (and decrement) or CS (send and decrement), terminating in Z. (That is, A(MSGjCS ) Z or A(MSGjCS ) (LH jRH )(MSGjCS )(MSGjCS ) Z ) We have de ned protocols for overlap between de nition and next de nition at the sender, and use and next use at the receiver; processes being in a may send/receive state during the overlap period, and entering a must send/receive state just before the next use/de nition. Note, however, even if we are in an overlap section, communication can not occur until the senders and receivers are identi ed. Therefore we consider the 'may communicate' interval to begin when senders and receivers are de ned, and to end (at sender) before the next de nition and (at receiver) before the next use. We have a group of processes P = fPiji in G = f1::ngg We are presently able to handle the following cases (others are treated as compositions of these): 1. Point to point: send x at j, receive y at i; a single send matched by a single receive. (count of 1) 2. Send to a group G: (count of n = jGj at sender, count of 1 at each receiver). It is possible to de ne it as a collaborative broadcast, using eg. a spanning tree rooted in j. Doing this costs implicit synchronization and ordering 3. Gather: Y at i is the result of applying some operator to pairs of values x at all nodes in G. (count of n = jGj at receiver, count of 1 at each sender). Order is arbitrary, therefore correctness requires op to be commutative and associative. This can also be de ned as a collaborative operation involving sets of sends and receives at particular nodes. All communications are resolved into sets of point to point sends and receives (note that this does not preclude collaborative communications, since a particular node may receive a message from another and forward it to yet another node, optionally adding information of its own).
We know the communication pattern before communicating (that is, we know the group of processes involved, and what the initial and nal situation should be). W can therefore resolve any communication into a pattern of point to point sends ahead of time. Therefore for every send of a variable at a given process, there is a matching receive (of possibly a dierent variable) at the destination process, and vice-versa. Assume the message includes an identi er that indicates what variable is being sent (this is necessary because, if several messages are received by a given node, it must know which buers to use for the received data). In analyzing the correctness of the protocol, we have our three basic cases to consider, and we must show that each send is paired with the correct receive, and that each receive has a correct matching send. Note that communications proceeds by alternating CS (receiver->sender) and MSG (sender->receiver). A sender state may not accept without receiving a correct count of CS messages; a receiver state may not accept without receiving a correct count of MSG, and an allgather state may not accept without receiving a correct count of both CS and MSG. We assume a correct program in which communication statements express a transfer of information and include senders and receivers; this is the style of MPI [4] group communications and of Pfortran [5], to name two examples. Processes each see the communications in which they participate in the same order. It is not possible that there will be more than one pending receive of a speci c variable. In order for this to happen, the variable in question must be used twice, in two dierent lines of code. But then RH would appear twice, at the time of each de nition. But in order to receive, the FSA corresponding to the received variable must be parsing a word that starts with R; but if an RH is encountered in such a word, it blocks until the corresponding message count is Z. Therefore the rst receive has completed before the second begins. It is not possible that the send corresponding to a particular receive corresponds to the wrong de nition of the variable in question. In order for this to happen, there are two possibilities. Either the send must transmit a later de nition of the variable, or a previous one. Suppose it transmits a later de nition. Then the corresponding FSA must pass the rst de nition (LH) and send after the next LH. But LH causes a sender to block until it receives an appropriate number of CS messages from the intended receiver and the count is Z. Therefore the send must complete on the rst LH and it is not possible that it will transmit a later de nition. So the only way we could transmit the wrong value is if we transmit an earlier value, corresponding to an earlier de nition. But by the same argument this is not possible, since the send of an earlier version of the variable has to have completed at the previous LH. Therefore sends transmit the correct version of the variable de nition. An allgather can be considered as a set of matched sends and receives, where each process does both. A word that starts with A will block before the next de nition or use of a variable, so has the properties of both R and S. Therefore
an allgather must complete before the next communication of the variable(s) in question, by the same reasoning above. Therefore communications are ordered with respect to each variable, each communication completes before the next communication of the given variable is required, and the correct value is transmitted. If we have collaborative communication, we must prove the correctness of the group algorithm; once this is done, the above arguments apply and therefore the communication is correct.
5 Implementation The test program was written in Fortran77 for serial processing. It was parallelized using Pfortran and MPI, and a one node version without communications was used for speed benchmarks running on a single node. We chose to implement a real algorithm rather than a synthetic benchmark; although a test on one algorithm is clearly not an exhaustive proof, it should be more indicative of real world performance than an arti cial benchmark program, which we could write to respond arbitrarily well to our overlap synchronization strategy. We picked a multidimensional function minimization technique called the Downhill Simplex Method in Multidimensions from [6] for use in our test. We parallelized by projecting through all faces of the simplex at the same time rather than just through a single face: speedup was obtained through convergence in fewer steps. The parallelism and branching factor in this case is limited to the number of dimensions in the problem. At each stage we do the following: { Input: simplex { P: parallel tasks on each vertex:
P1: project through opposite base P2: if medium good project farther P3: if nothing worked contract simplex
{ compare all nodes and pick best simplex { C: if simplex has converged stop; else go back to Input.
The problem is irregular because the evaluation at each node is dierent. We then modi ed the algorithm by shortcutting as follows: If at any stage P1, P2 or P3 we have a better simplex than the original (possibly by some adjustable criterion of enough better) then go directly to C, the convergence test, interrupting all the other parallel tasks. We then have: { Input: simplex { P: parallel tasks on each vertex:
P1: project through opposite base; if good goto C P2: if medium good project farther; if good goto C P3: if nothing worked contract simplex
{ compare all nodes and pick best simplex { C: halt parallel tasks
{ if simplex has converged stop; else go back to Input.
Potentially at each stage some process could shortcut the computation after doing only 1/3 the work, at P1. Therefore it is possible that the criterion Tmin < Tavg is met, and there is a potential in the modi ed algorithm for suprascalar parallel eciency. Although in the serial version it is sucient to take any good enough new value as the base for the next iteration, in the parallel version we must consider the possibility that more than one node has a good enough value at the same time (traces while running the program showed this circumstance in fact occurred). If two nodes think they have a good value, and both try to broadcast in the same place, the program will deadlock. The protocol de nition allows for de nition of separate streams of execution corresponding to each of multiple shortcutting nodes, but this is not fully implemented. The present implementation simply forces a jump to the nal comparison section when either a node has a good enough value or it has received a message that tells it some other node has one. The evaluation then proceeds as in the standard algorithm. Note that the shortcutting version of the program is nondeterministic but not random; the shortcutting process is the one that rst (by clock time) nds a good enough solution, and this may vary from one run to the next. It may be argued that the parallel shortcut works better in this case than the serial shortcut because it evaluates the best partial solution, whereas the serial algorithm merely takes the rst one it nds that is good enough. But in fact, the rst good enough solution found will be the best value the serial algorithm has; in order to nd a better value it must continue trials which would cause it to evaluate more functions.
6 Tests For testing we used a simple seven dimensional parabloid requiring fteen oating point operations to compute, embedded within a loop that repeated it a number of times to simulate a more complicated function. We took an initial simplex with a corner at (-209,-209,-209,-209,-209,-209,-209) and then changed the starting point along a diagonal in increments of +80 to 451; the minimum was at (3,0,0,0,0,0,0). We ran six variations of the program: A0 - the original serial algorithm. From a starting point of -209 this required 11,513 function evaluations. A1 - the parallelized algorithm, run serially. From the same starting point, 9,899 function calls. A2 - the shortcutting algorithm, run serially. 9543 function calls. A3 - the parallel algorithm. 9899 function calls. A4 - the parallel shortcutting algorithm. 8000-11000 function calls. A5 - the parallel overlapping algorithm. 9899 function calls.
A6 - the parallel overlapping and shortcutting algorithm. 8000-11000 function calls. Figure 2 shows work required for convergence of the deterministic algorithm (det - correasponds to A1 serially or A5 in parallel); for the serial shortcutting algorithm A2 (short); and for several runs of the non-deterministic shortcutting algorithm in parallel, A4 (10 runs, n0 .. n9, equivalent to A6). Although the shortcutting algorithm executes dierently each time, it converged correctly to the same value.
Fig. 2. Function Evaluations Required for Convergence The level of irregularity in the program may be seen in the variation in number of function evaluations when dierent paths are taken to a solution. There is variation of up to about one tenth of the function evaluations, indicating an irregularity of the order of .1. Timings were done on the SP2 for A3, the standard parallel algorithm, A4 the shortcutting algorithm, and and A5, the shortcutting algorithm with overlap, generating a starting simplex from (-209,-209,-209,-209,-209,-209,-209). We found that the SP2, using load leveler, exhibited large variations in time depending on what other jobs were running on it (dierences by factors of as much as 2 or 3). The following results were taken when the machine was lightly loaded. A2 deterministic: Average = 48.6 sec, range between 47.9-49.8 sec. A3 shortcutting: Average = 47.8 sec, range between 46.2-48.9 sec. A4 shortcutting with overlap: Average = 46.9 sec, range between 46.3-48.3 sec. In the tests, each function evaluation was set to 15 million oating point operations, by setting the loop index to 1000000. Reducing the loop index to 1, program run times were about 10 seconds; we took this to correspond to the base communication time, so the actual computation time is about 40 seconds. The time irregularity should be about 4 seconds, and the combination of shortcutting and overlapping appears to be capturing almost half of the possible savings. Some of the time variation is probably due to dierences in the SP2 environment. The two non-deterministic shortcutting programs exhibit more variation
in time, as expected. We see that the shortcutting program has a slight advantage over the standard program, and the overlapping program performs better than shortcutting alone.
7 Conclusions We have shown two techniques that, separately and in combination show promise of speedup for parallel processing of irregular programs. The overlapping of regions between data de nition and use in dierent processes lightens the synchronization requirements by spreading the work of synchronization over an interval of time. This is particularly important for irregular problems in which we want to not only hide the communications cost, but also minimize the time processes must spend waiting for others to reach synchronization points. Where applicable, the shortcutting method allows the implementation of parallel algorithms that are inherently better than their serial counterparts in that they do less work to reach the same solution. Note that this means increasing advantage with increasing problem size, since there will be a greater dierence between the work done in parallel and that done serially. In addition, shortcutting dynamically load balances parallel execution of irregular problems since all processes will take the same clock time when shortcutted. Problems to which shortcutting is applicable are extreme cases of irregular problems; this in our view justi es the development and application of overlapping technology together with shortcutting, for maximum bene ts.
References 1. R. Ponnusamy, J. Saltz, and A. Choudhary; Runtime-Compilation Techniques for Data Partitioning and Communication Schedule Reuse. Proceedings Supercomputing '93, pages 361-370, Portland, Oregon, November 15-19, 1993 2. L. R. Scott and Xie Dexuan Parallel linear stationary iterative methods @TechReportlrsBIBdt, Title = Parallel linear stationary iterative methods, Author = L. R. Scott and Xie Dexuan, Type = RRmath, Institution = UH, Number = 239, Year = 1997 3. G. A. Geist and C. H. Romine LU factorization algorithms on distributed-memory multiprocessor architectures. @ArticleGRLU, Journal = SIAM J. Sci. Stat. Comput., Volume = 9, Number = , Pages = 639{649, Year = 1988 4. Message Passing Interface Forum; MPI: A message Passing Interface Standard. June 12, 1995, Version 1.1, http://www.mcs.anl.gov/mpi/ 5. B. Bagheri, T. W. Clark and L. R. Scott; Pfortran: A Parallel Dialect of Fortran. Fortran Forum, ACM Press, 11, pp. 3-20, 1992. 6. W. H. Press, B. P.Flannery, A. A. Teulkosky, W. T. Vetterling: Numerical Methods: The Art of Scienti c Computing. Cambridge 1986