I. Introduction. Many commercial message-passing systems have been introduced, such as Intel iPSC. 1], Ametek System/14 2], Ncube/ten 3], FPS T Series 4], ...
HYPERTOOL: A PROGRAMMING AID FOR MESSAGE-PASSING SYSTEMS Min-You Wu and Daniel D. Gajski Abstract|As both the number of processors and the complexity of problems to be solved increase, programming multiprocessing systems becomes more dicult and errorprone. This paper discusses programming assistance and automation concepts and their application to a program development tool for message-passing systems called Hypertool. It performs scheduling and handles the communication primitive insertion automatically. Two algorithms, based on the critical-path method, are presented for scheduling processes statically. Hypertool also generates the performance estimates and other program quality measures to help programmers in improving their algorithms and programs.
I. Introduction Many commercial message-passing systems have been introduced, such as Intel iPSC [1], Ametek System/14 [2], Ncube/ten [3], FPS T Series [4], and the Connection Machine [5]. The recent approach to program these machines is in the single program multiple data (SPMD) style. In this style, each processing element (PE) runs the same program but executes dierent code segments depending on its PE id and data in its local memory [6]. Programmers nd this approach easy to develop parallel programs. However, to achieve the balanced load among PEs in the SPMD style is not a trivial task since the computation load for each PE may be dierent, especially for those problems with irregular structures. In these cases, computation loads on overloaded PEs must be moved to underloaded PEs. 1
There are two extreme approaches to develop programs in the SPMD style. One school of thought believes that these problems are complex and should be left to programmers [7]. However, programmers are error-prone to handle those tedious chores, such as communication primitive insertion. For example, system deadlock is the most common problem, and is dicult to detect once the program has been developed. Also, the developed programs are not portable since each machine has dierent communication primitives [8]. The other school of thought believes in restructuring compilers that will extract parallelism and restructure sequential programs into parallel programs automatically [9]. However, the complete analysis for exploiting parallelism is very complex. Furthermore, the parallelism revealed in this way is restricted by the algorithms embodied in the sequential programs. We advocate an approach that falls between the above two extremes. We believe that parallelization is a very complex problem which can only be fully solved by a human expert. Program development tools can assist users to solve creative chores, such as algorithm design. On the othe hand, some of parallelization chores can be automated, especially scheduling and synchronization [10]. Program development tools can also generate performance estimates and quality measures to guide programmers in improving their programs and algorithms. In this way, optimal performance can be obtained with increased productivity. In this paper, we describe a programming aid, named Hypertool, for automatic scheduling and synchronization on message-passing systems. Hypertool takes a partitioned program as input and generates parallel code for execution on message-passing machines. Hypertool also generates performance estimates and quality measures for the parallel code. We will discuss static scheduling problems and present new scheduling algorithms. The algorithms for mapping and synchronization will be discussed also. In particular, several 2
issues for programming message-passing systems and computer-aided programming tools are discussed in section II. Our Hypertool and its programming model is described in section III. Following that, scheduling, mapping, and synchronization insertion are discussed in sections IV, V, and VI, respectively. Several annotated examples are given in section VII to illustrate the method and demonstrate the usefulness of the programming aid. II. Computer-Aided Programming A programmer writing programs for a message-passing system must partition each problem into processes, group these processes into tasks, assign each task to a PE, and insert synchronization primitives for proper execution [11], [12]. A program segment that is not partitioned further is called a process, and is counted as one unit of computation. When two processes executing on dierent PEs must exchange data, the data communication time becomes the overhead that slows down the computation. If this overhead is sizable, the maximum parallelism found in the program may not generate the optimum speedup. For this reason, several processes are merged together to form a task, which can be thought of as one unit of allocation. The number of operations in each task de nes the operation granularity, and the number of data items in each task de nes the data granularity. The best speedup is obtained by proper granularity. A scheduler can be used in this purpose which modi es granularity by merging processes into tasks [13]. After merging processes into tasks, each task is assigned to a PE in a given topology. To reduce network trac, two tasks that exchange data should be mapped to the neighboring PEs. Since this is not always possible, the assignment heuristic should attempt to minimize the total communication trac in the network. For proper communication of tasks on 3
dierent PEs, synchronization primitives must be inserted into a program. The tasks of partitioning, scheduling, mapping, and synchronization can be automated through an interactive programming aid. Several research eorts have demonstrated the usefulness of program development tools for multiprocessing. There are two types of tools. One provides software development environment and debugging facilities. POKER [14] is a parallel programming environment for message-passing systems, which provides a graphic representation of communication structure. DAPP [15] accepts program code with inserted synchronization primitives and produces a report of parallel access anomalies, that is, pairs of statements that can access the same location simultaneously. The other type of tool performs some program transformation. Most tools of this type are based on the theory of program restructuring [9]. PTOOL [16] performs sophisticated dependency analysis, including advanced interprocedural ow analysis. It identi es parallel loops, extracts global variables, and provides a simple explanation facility. This information can be used to obtain more parallelism, eliminate some dependencies, and reduce eciency losses. It also transforms control dependencies into data dependencies. However, PTOOL only tests loops for independence and does not provide partitioning and synchronization mechanisms for non-parallel loops. CAMP [17] partitions both parallel and non-parallel loops, and reduces dependencies by using process alignment and minimum-distance algorithms. Since it extracts more parallelism and eliminates many dependencies, eciency loss from processor suspension is reduced. CAMP also inserts synchronization primitives, and estimates performance for each partitioning strategy.
4
III. Hypertool The programming aid presented here aims to increase programming productivity and take advantage of tedious tasks that computers perform better than humans. It takes user partitioned program as its input, automatically allocates these partitions to PEs, and inserts proper synchronization primitives where needed. The program development methodology to be used with Hypertool is shown in Figure 1. First, a designer develops a proper algorithm, performs partitioning, and writes a program as a set of procedures. The program looks like a sequential program and it can run on a sequential machine for the purpose of debugging. This program is automatically converted into the parallel program for a message-passing target machine by parallel code synthesis and optimization. Hypertool then generates performance estimates, including execution time, communication time, suspension time for each PE, and network delay for each communication channel. The explanation facility displays data dependencies between PEs, as well as parallelism and load distribution [18]. If the designer is not satis ed with the result, he/she will attempt to rede ne partitioning strategy and the size of partitions using the information provided by the performance estimator and the explanation facility. Figure 2 shows the organization of the program synthesis and optimization module. The lexer and the parser recognize data dependencies and user de ned partitions. The graph generation submodule generates a macro data ow graph, in which each node represents a process. The scheduling submodule assigns processes to tasks by minimizing the execution time for the graph. The mapping submodule maps each task to a physical PE in a given topology by minimizing network trac. Mapping has little eect on the execution time of a data ow graph. However, improper mapping does increase network trac, which increases network contention. After scheduling and mapping are completed, the synchronization 5
module inserts the communication primitives. Finally, the code generator generates target machine code for each PE. To facilitate automation of program development, we use a programming style in which a program is composed of a set of procedures called from the main program. A procedure is an indivisible unit of computation to be scheduled on one processor. The grain sizes of procedures are determined by the programmer, and can be modi ed with Hypertool. Figure 3 shows an example, a parallel Gaussian elimination algorithm, which partitions a given matrix by columns. The procedures FindMax and UpdateMtx are called several times. The control dependencies can be ignored, so that a procedure call can be executed whenever all input data of the procedure are available. Data dependencies are de ned by the single assignment of parameters in procedure calls. Communications are invoked only at the beginning and the end of procedures. In other words, a procedure receives messages before it begins execution, and it sends messages after it has nished the computation. For a static program, the number of procedures are known before program execution. Data dependencies among the procedural parameters de ne a macro data ow graph. A macro data ow graph, which is generated directly from the main program, is a directed graph with a start and an end point. For example, Figure 4 shows the macro data ow graph of the program in Figure 3. Note that only the parallel parts of Figure 3 and the messages transferred among these procedures are shown in Figure 4. A macro data ow graph consists of a set of nodes fn ; n ; :::; nng and a set of edges fe ; e ; :::; eeg. Each node corresponds to a procedure, and the node weight is represented by the procedure execution time. For example, nodes n ; n ; n ; n in Figure 4 correspond to the procedure FindMax, while the other nodes represent the procedure UpdateMtx. Each edge corresponds to a message transferred from one procedure to another procedure, and the weight of the edge 1
1
7
12
2
1
16
6
2
is equal to the transmission time of the message. In Figure 4, for example, the edges connecting n ; n ; n ; n ; n ; n , and n correspond to a message called \vector" in the rst iteration, and the edge connecting n and n to a message called \matrix". When two nodes are scheduled to a single PE, the weight of the edge connecting them becomes zero. In static scheduling, the number of nodes is known before program execution. The execution time of a node is obtained by running the corresponding procedure. The transmission time of a message is estimated by using the message startup time, message length, and communication channel bandwidth. We assume that the message transmission time is for neighbor communication. Non-neighbor communication takes a little more time. We also assume network trac is not too heavy. Therefore, network contention is ignored in our model. 1
2
3
4
5
6
7
4
9
This programming style has good modularity, which is necessary for developing general application programs. Also, it is system independent since communication primitives are not speci ed within the program. Such a program can be executed sequentially or in parallel. Furthermore, dynamic scheduling becomes simpler since it partitions a program into many procedures. IV. Scheduling In this section, we discuss methods for scheduling a macro data ow graph. We will describe static nonpreemptive scheduling algorithms for homogeneous multiprocessors. Critical-path scheduling has been addressed by Hu [19]. The critical-path algorithm has been proved to be near optimal [20] [21]. Basically, the critical-path algorithm assigns a label to each node according to the longest path from this node to the end point. It 7
performs well for bounded number of PEs. Ramamoorthy, et al. developed algorithms to determine the minimum number of PEs required to process a program in the shortest possible time [22]. They used exhaustive search, which is not acceptable for large programs. Bussell, et al. proposed an alternative method to reduce the number of PEs [23], but the eciency of the algorithm is still dependent on an estimate of the number of processors. More importantly, these algorithms did not model transmission time, that is, they assumed that the data transmission between PEs did not take any time. This is not true, however, for most message-passing systems. The data transmission time is a signi cant factor which aects the overall performance of a system, which must be considered in modeling. Kruatrachue and Lewis have recently presented a model which assigned data communication time as weights of edges [24]. However, their scheduling algorithm is more suitable for ne-granularity. There are two objective functions to be considered for scheduling. The rst function is the time taking to process a graph on bounded number of PEs. The second is the number of PEs required to process a graph in the shortest possible time. We will rst modify the critical-path algorithm for a bounded number of PEs. But since this modi ed criticalpath (MCP) algorithm is not able to eciently determine the minimum number of PEs required to execute a program, a new algorithm, called mobility-directed (MD) scheduling, is presented. This algorithm generates solutions for the number of PEs required. To de ne the scheduling algorithms succinctly we will de ne several concepts. First, we describe the as-soon-as-possible (ASAP) and the as-late-as-possible (ALAP) bindings [25] of a macro data ow graph in order to determine mobilities. In an ASAP binding, the earliest possible execution is assigned to each node. The ASAP binding is created by moving forward through the macro data ow graph. For example, the ASAP binding of 8
the macro data ow graph in Figure 4 is shown in Figure 5. The node n is bound to the time interval 0 to 80 s. It then sends a message to nodes n ; n ; n ; n ; n , and n . When the messages arrive, at time 140 s, these nodes are bound to the time interval starting at 140 s. After performing the ASAP binding, each node is assigned an ASAP start time, denoted as TS (ni). Conversely, the latest possible start time is assigned to each node for the ALAP binding, which is obtained by moving backward through the macro data ow graph. The ALAP binding of Figure 4 is shown in Figure 6. Similarly, after performing the ALAP binding, each node is then assigned an ALAP time, denoted as TL(ni). 1
2
3
4
5
6
7
The moving range of a node ni is de ned as the time interval from TS (ni) to TL(ni), in which node ni may start its execution without delaying the execution of any other node on the critical path. The length of this range is de ned as the mobility of node ni , M (ni ) = TL(ni) ? TS (ni ). The relative mobility of a node is de ned by Mr (ni) = M (ni)=w(ni ), where w(ni) is the weight of node ni, that is, the execution time of the node. Moreover, we de ne TF (ni) = TL(ni) + w(ni ) as the latest nishing time of node ni. The moving interval of a node ni is de ned as the time interval from TS (ni) to TF (ni). As an example, the ASAP time of node n in Figure 5 is 340, and its ALAP time in Figure 6 is 630. Therefore, the moving range of n is 340 to 630, and its mobility is 290. The relative mobility of the node is Mr (n ) = 290=30 = 9:7. The moving ranges, mobilities, and relative mobilities of nodes in Figure 4 are listed in Table 1, sorted by relative mobilities. 8
8
8
All nodes with mobility equal to zero identify the critical path through the graph. The length of the critical path is the minimum time required for program execution. Obviously, there is at least one critical path in a macro data ow graph, although several critical paths are possible. Moreover, if two nodes on the critical path are scheduled to a single PE, the weight of the edge connecting them becomes zero, and the critical path is shortened. Note 9
that shortening a critical path may generate several new critical paths. Now we describe a scheduling algorithm for a macro data ow graph on a given number of PEs. This algorithm is based on the critical path algorithm presented by Sethi [26].
Algorithm 1 Modi ed Critical-Path (MCP) scheduling. Step 1: Perform the ALAP binding and assign the resulting ALAP time TL(ni) (i = 1; 2; :::; n) to each node in the graph.
Step 2: For each node ni create a list l(ni) which consists of TL's of ni and all its descendants in increasing order. Sort these lists in increasing order lexicographically. According to this order, create a node list L. Step 3: Schedule the rst node in L to a PE that allows its earliest execution. Delete the node from L and repeat Step 3 until L is empty. 2
The complexity of Step 1 is O(n ), since the ALAP binding is performed by the depth rst search. Step 2 requires O(n log n) operations to generate each list, and there are n lists, so it requires time of O(n log n). Then sorting all lists requires O(n log n) operations also. Therefore, the complexity of Step 2 is O(n log n). The complexity of Step 3 and 4 is O(n ). Thus, the complexity of Algorithm 1 is O(n log n). 2
2
2
2
2
2
Before discussing the scheduling method to minimize the number of PEs required to execute a graph, we give a fact for a node to be scheduled to a PE to which several other nodes have been already scheduled.
Fact 1 Necessary and sucient condition for scheduling a node to a PE. Assume a node np is to be scheduled to PE m, to which l nodes, nm1 ; nm2 ; :::; nm , have been scheduled. If the moving intervals of these nodes do not intersect with the moving l
10
interval of np, then np can be scheduled on PE m. Otherwise, assume the moving intervals of nodes nm ; nm +1 ; :::; nm (1 i j l) intersect the moving interval of np. np can be scheduled to PE m, if there exists k (i k j + 1), i
i
j
w(np) min(TF (np); TL(nm )) ? max(TS (np); TS (nm ?1 ) + w(nm ?1 )) k
k
k
where TF (np) and TS (np) are calculated as the node np is scheduled on PE m; w(ni) is the weight of ni; and if nm ?1 does not exist, TS (nm ?1 ) = 0 and w(nm ?1 ) = 0; and if nm +1 does not exist, TL(nm +1 ) = 1. Otherwise, the node cannot be scheduled to PE m. 2 i
i
j
j
i
We now describe a mobility-directed algorithm to schedule a macro data ow graph on unbounded number of PEs.
Algorithm 2 Mobility-Directed (MD) scheduling. Step 1: Calculate relative mobilities for all nodes. Let L include all nodes initially. Step 2: Let L0 be the group of nodes in L with minimum relative mobility. Let ni be a node in L0 that does not have any predecessors in L0. Chack whether ni can be scheduled to the rst PE with Fact 1. If ni can be scheduled to the PE, ni is inserted before the rst node in the node sequence of the PE that satis es the inequality listed in Fact 1. If ni cannot be scheduled to the rst PE, schedule it to the second PE, and so on. When ni is scheduled to PE m, all edges connecting ni and other nodes already scheduled to PE m are changed to zero. Step 3: If ni is scheduled before node nj on PE m, add an edge with weight zero from nj to ni in the graph. If ni is scheduled after node nj on the PE, add an edge with weight
11
zero from nj to ni in the graph. This is to ensure there is no deadlock. Then check if the adding edges form a loop. If so, schedule ni to the next available space. Step 4: Recalculate relative mobilities for the modi ed graph. Delete ni from L and repeat Step 2, 3 and 4 until L is empty. 2
The time for Step 1 includes O(n ) each for ASAP and ALAP bindings, and O(n) for calculations. Thus the complexity of Step 1 is O(n ). It takes O(n) operations for scheduling a node on a PE and O(n ) operations for calculating relative mobilities. Therefore, the complexity of Algorithm 2 is O(n ). 2
2
2
3
V. Mapping The mapping algorithm presented in this section maps these virtual PEs used in the previous section to physical PEs connected in a prede ned topology. A good mapping algorithm generates the minimum amount of communication trac, reducing network contention. For best results, a trac scheduling algorithm that balances the network trac should be used [27]. However, trac scheduling requires exible-path routing which generates larger overhead. If network trac is not too heavy, simpler algorithms which minimize total network trac may be used. Since most problems being solved on message-passing systems are in this category, we will use an algorithm to minimize the total network trac. The mapping problem may be described as follows: given a virtual PE graph consisting of Nt nodes and Et edges. A virtual PE graph is also called a task graph, which is generated by scheduling. Each node in this graph corresponds to a task, and each edge corresponds to messages transferred between two tasks. The weight of the edge, w(ei)(i = 1; 2; :::; Et), 12
is the sum of transmission time of all messages between the two tasks. This task graph is to be mapped to a system graph. A system graph consists of Ns nodes and Es edges, where Ns Nt. Each node corresponds to a physical PE, and each edge to a connection between two PEs, with weight 1. If the task graph can be mapped to the system graph and all communications are nearest-neighbor communications, no routing is necessary and the mapping is optimal. Otherwise, certain pairs of tasks connected by an edge will be mapped to two non-neighboring PEs. The corresponding message will be routed through the shortest path between the two PEs. The distance d between two PEs is de ned as the number of hops on the shortest path from one PE to another. Our objective function is
F=
XE w(ei)di t
i=1
where, di is the distance between the two PEs to which the two tasks connected by ei are mapped. Therefore, F stands for the total communication trac. As an example, Figure 7 (a) and (b) show a task graph and a system graph, respectively. When the task graph is mapped to the system graph as shown in Figure 7 (c), F = 20. A better mapping is shown in Figure 7 (d) with F = 16. We may use the algorithms for the quadratic assignment problem to obtain an optimal mapping. The algorithm used in Hypertool is a heuristic algorithm, which was presented by Hanan and Kurtzberg to minimize the total communication trac [28]. It generates an initial assignment by a constructive method, which is then improved iteratively to obtain a better solution. Lee and Aggarwal described some experimental results for hypercube topologies and showed that the algorithm works well enough, even though it did not always guarantee an optimum solution [29]. 13
VI. Synchronization Synchronization of message-passing systems is carried out by communication primitives. The basic communication primitives are send and receive. Some message-passing systems supply synchronous communication primitives only, while some supply both synchronous and asynchronous primitives. Asynchronous primitives usually use communication co-processors to handle network activities, which reduces the load on main processors. The communication primitives are used to exchange messages between processors. They must be used properly to ensure the correct sequence of computation. In rst-generation message-passing systems communication primitives are inserted by programmers. It is possible to insert these primitives automatically, reducing programmer's load and eliminating insertion errors. Since our programming model partitions programs into procedures and message exchanges only take place before and after the procedures, the primitive insertion is relatively easy. The communication primitive insertion may be described as follows. After scheduling and mapping, each node in a macro data ow graph has been allocated to a PE. If an edge exits from this node to another node which belongs to a dierent PE, the send primitive is inserted after the node. Similarly, if the edge comes from another node in a dierent PE, the receive primitive is inserted before the node. However, if a message has already been sent to a particular PE, the same message does not need to be sent to the same PE again. If a message is to be sent to many PEs, broadcasting is applied instead of sending the message to these PEs separately. The insertion method described above does not ensure that the communication sequence is correct. For example, two possible cases are shown in Figure 8(a) and Figure 9(a). 14
In Figure 8(a), the order of the sends is incorrect, and must be reordered as shown in Figure 8(b). In Figure 9(a), on the other hand, either the order of sends or the order of receives needs to be exchanged as shown in Figures 9(b) or (c), respectively. We use a send- rst strategy for this case. That is, we will reorder receives according to the order of sends, and obtain the solution shown in Figure 9(c). The entire communication primitive insertion algorithm is described below.
Algorithm 3 Communication insertion. Assume after scheduling and mapping that each node ni of the macro data ow graph is assigned to PE M (ni ), where M is a function mapping a node number to a PE number. Step 1: For each edge ek from node ni to nj for which M (ni ) 6= M (nj ), insert a send primitive after node ni in PE M (ni ), denoted by S (ek ; ni; M (nj )); and insert a receive primitive before node nj in PE M (nj ), denoted by R(ek ; nj ; M (ni)). Once a message has been sent to a PE, eliminate other sends and receives which transfer the same message to the same PE.
Now, for each PE, we have a sequence, X (em1 ; nm1 ; pm1 ); X (em2 ; nm2 ; pm2 ),..., where X could be either S or R. Step 2: For each pair of PEs, say p1 and p2 , extract all S (em ; nm ; p2 ) from PE p1 to form a subsequence Sp1 , and extract all R(em ; nm ; p1) from PE p2 to form a subsequence Rp2 . i
j
i
j
Step 2.1: Within each segment of the subsequence Sp1 with the same node number, exchange the order of sends according to the order of receives as de ned by the subsequence Rp2 .
15
Step 2.2: If the two resultant subsequences are still not matched with each other, Rp2 is reordered according to the order of Sp1 . 2
Reordering of sends and receives may not be necessary for a system supporting typed messages. However, even for this kind of systems, message transmission reordering may reduce the message waiting time and the demand of system communication buers. VII. Examples and Comparison We use the example of Figure 4 to illustrate scheduling and synchronization. MD algorithm is used to schedule the macro data ow graph. Node n is scheduled rst, then node n . After n is scheduled, the weight of the edge connecting n and n becomes 0, inducing modi cation of mobilities. Node n is then scheduled, and so on. The macro data ow graph is scheduled to two PEs, as shown in Figure 10. Note that sending \vector" in the rst iteration to the node n in PE 1 means sending it to nodes n and n unnecessary. When communication primitives are inserted, the order of the second and the third receives in PE 1 is exchanged so that PE 1 receives the message from n before that from n . Figure 11 shows the generated code for two PEs according to Figure 10. Note that only the main program for each PE is shown. The data structure is the same as in Figure 3. The initial matrix is loaded to PEs such that each PE obtains a part of the matrix which is demanded for its computation. Consequently, the memory space can be compacted so that only what is demanded is allocated in each PE. In this example, the rst, second, and the third columns of matrix are loaded to PE 0, and the fourth and the fth columns to PE 1. Similarly, the third, fourth, and the fth columns of the resulting matrix is unloaded from PE 0, and the rst and the second columns from PE 1. Furthermore, to reduce the number 1
3
3
1
3
7
5
6
3
16
2
7
of message transfers and consequently, the time to initiate messages, several messages can be packed and sent together. For example, the rst three columns of the initial matrix may be packed into one message and sent to PE 0. Figure 12 shows another example. The Gauss-Seidel algorithm is used to solve Laplace equations since it converges faster than the Jacobi algorithm. In this algorithm, the update procedure for step k may be presented as:
Ai;jk = 41 [A ki? ;j + Ai;k j? + A ki ? ;j + Ai;k?j ] ( )
( ) ( 1)
( ) (
1)
( 1) ( +1)
(
1) ( +1)
Each value of Ai;j at iteration k is obtained from two newly generated values for iteration k and two old values of iteration k-1. This algorithm assumes a particular updating sequence. Since Ai;j depends on the values of the same iteration, not all values of an iteration can be obtained simultaneously. A given matrix is partitioned by rows and columns. Th macro data ow graph in Figure 12 may be scheduled to three PEs using MD algorithm. However, if only two PEs are available, MCP algorithm is applied. The resultant scheduling is shown in Figure 13. Our Hypertool is currently running on a Sun workstation under UNIX. It takes 37 seconds to schedule a program with 162 processes to 8 PEs. Several examples have been tested on Hypertool. By our experience, developing programs with Hypertool takes much less time than manual program development. Debugging is much easier, and we never have any deadlock in the programs developed on Hypertool. Next we compare the manually generated codes and Hypertool-generated codes. Both codes are generated from the same algorithms. The only dierence is that automatic scheduling has applied to the Hypertool-generated codes but not to the manually generated codes. The scheduling method for the manually generated codes can be brie y described 17
as follows: the data domain is partitioned equally in such a style that reduces dependencies among these partitions. Each PE is assigned a partition, and an SPMD program is coded for PEs. Although a programmer might use sophisticated scheduling, the working load is too heavy for human to produce good, error-free results. This sophisticated scheduling should be performed by an automatic tool. By using Hypertool, some problems, such as the Gaussian elimination, Laplace equations, and Dynamic programming, resulted in up to 300% improvement in speed (see Figures 14, 15, and 16). These problems have less regular structures so that manual scheduling usually leads to an unbalanced load distribution among PEs because computation amount in each data partition is dierent. On the other hand, automatic scheduling moves some nodes from overloaded PEs to underloaded PEs and achieves a better load balance. For more regular problems, such as the Matrix multiplication and Bitonic sort, automatic scheduling gives performance similar to that of manual scheduling, as shown in Figures 17 and 18. However, even for these kinds of problems Hypertool still shows better performance when the size of matrix cannot be evenly divided by the number of PEs. Table 2 shows the comparison of Gaussian elimination for MCP and MD scheduling algorithms. These two algorithms deliver the almost same performance, however, MD algorithm is useful for reducing the number of PEs required to execute a program. The execution times of manually generated codes are also listed for comparison. Figure 19 compares the execution times of random scheduling and MCP scheduling for Laplace equations. About 30% improvement in speed is obtained by MCP scheduling algorithm. Surprisingly, in many cases, even random scheduling algorithm generates better performance than manual scheduling. 18
The grain size of each procedure also aects the quality of generated parallel codes. Too large granularity reduces parallelism and too small granularity increases overhead. Only moderate grain size gives the best performance. Figure 20 shows the performance for dierent grain sizes of procedures. Since the execution times of nodes and the message transmission times are obtained by estimation, the performance aected by the dierence between the estimated value and the real value has been studied. Tables 3 and 4 show the results for Laplace equations and Bitonic sort, respectively. For the Laplace equations, the node weight is estimated with matrix size of 4, and for the Bitonic sort, the node weight is estimated with array length of 64. The results show that the dierence between the estimated value and the real value has little eect on performance. VIII. Conclusion As both the number of PEs and the complexity of problems to be solved increase, programming multiprocessing systems becomes more dicult and error-prone. The optimal parallelization may be too complicated for all but simple problems. Actually, early experiments on programming hypercube systems have revealed that conceptualization of program execution is very dicult, and any further optimization of complex problems was discouraged. A program development tool that helps programmers to develop parallel programs by automating part of parallelization tasks and back-annotating some quality measures to programmers becomes a necessity. The experimental results show that the Hypertool approach is better than the manual methodology in many respects. First, it increases the programming productivity. Pro19
grammers only de ne partitions without indicating the task allocation or communication primitives. Second, since communication primitive insertion is performed by Hypertool, synchronization errors are eliminated. Moreover, most programming errors may be debugged by sequentially running the program. Finally, the program development tool generates better parallel codes since it uses good scheduling algorithms. This resulted in substantial performance increases as demonstrated in the previous section. Since Hypertool generates target machine codes automatically, the programs developed on the tool are portable. The programs may run on dierent message-passing systems, and even on shared memory systems. The tool can also be developed for a variety of languages to t dierent applications. The current version of Hypertool uses static scheduling and is applied to static problems. That is, all processes must be created before starting execution. Also, an estimate for the computation amount of each process must be known at compile time. One drawback of static scheduling is that a program must be rescheduled before it can run on another machine. A dynamic version of Hypertool is under investigation. Our Hypertool is able to schedule a macro data ow graph with several thousands processes caused mainly by memory space limitation. A large graph may be partitioned into several strongly connected subgraphs for scheduling. Acknowledgement
The authors would like to thank the anonymous reviewers for their thorough comments which caused us to improve the presentation and level of detail.
20
References [1] J. Rattner, \Concurrent processing: a new direction in scienti c computing," in Proc. National computer Conf., pp. 157{166, 1985. [2] Ametek system 14 User's Guide: C Edition. Ametek Computer Research Division, Arcadia, California, 1986. [3] J. P. Hayes, T. Mudge, Q. F. Stout, S. Colley, and et al., \A microprocessor-based hypercube supercomputer," IEEE Micro, vol. 6, pp. 6{17, Oct. 1986. [4] H. L. Gustafson, S. Hawkinson, and K. Scott, \The architecture of a homogeneous vector supercomputer," in Proc. Int'l Conf. on Parallel Processing, pp. 649{652, Aug. 1986. [5] L. W. Tucker and G. G. Robertson, \Architecture and applications of the Connection Machine," IEEE Computer, pp. 26{38, Aug. 1988. [6] A. H. Karp, \Programming for parallelism," IEEE computer, vol. 20, pp. 43{57, May 1987. [7] C. L. Seitz, \The COSMIC cube," Communications of ACM, vol. 28, pp. 22{33, Jan. 1985. [8] A. H. Karp and R. G. Babb II, \A comparison of 12 parallel fortran dialects," IEEE Software, pp. 52{67, Sep. 1988. [9] D. A. Padua, D. J. Kuck, and D. L. Lawrie, \High speed multiprocessor and compilation techniques," IEEE Trans. Computers, vol. C{29, pp. 763{776, Sep. 1980. [10] M. Y. Wu and D. D. Gajski, \Computer-aided programming for message-passing systems: problems and a solution," IEEE Proceedings, Feb. 1990. [11] D. D. Gajski and J. Peir, \The essential issues in multiprocessor systems," IEEE Computer, vol. 18, pp. 9{27, June 1985. [12] J. K. Peir, D. D. Gajski, and M. Y. Wu, \Programming environments for multiprocessors," in Proc. of Int'l Seminar on Scienti c Supercomputers, pp. 47{68, Feb. 1987. [13] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors. The MIT Press, 1989. [14] L. Snyder, \Parallel programming and the POKER programming environment," IEEE Computer, pp. 27{36, July 1984. 21
[15] W. F. Appelbe and C. McDowell, \Anomaly detection in parallel Fortran programs," in Proc. Workshop on Parallel Processing Using the HEP, May 1985. [16] R. Allan, D. Baumgartner, K. Kennedy, and A. Porter eld, \PTOOL: a semiautomatic parallel programming assistant," in Proc. Int'l Conf. on Parallel Processing, pp. 164{170, Aug. 1986. [17] J. K. Peir and D. D. Gajski, \CAMP: a programming aide for multiprocessors," in Proc. Int'l Conf. on Parallel Processing, pp. 475{482, Aug. 1986. [18] M. Y. Wu and D. D. Gajski, \A programming aid for message-passing systems," in Proc. 3rd SIAM Conf. on Parallel Processing for Scienti c Computing, Dec. 1987. [19] T. C. Hu, \Parallel sequencing and assembly line problems," Operations Research, vol. 9, no. 6, pp. 841{848, 1961. [20] T. L. Adam, K. Chandy, and J. Dickson, \A comparison of list scheduling for parallel processing systems," Communications of ACM, vol. 17, pp. 685{690, Dec. 1974. [21] W. H. Kohler, \A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems," IEEE Trans. Computers, vol. C{24, pp. 1235{1238, Dec. 1975. [22] C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzales, \Optimal scheduling strategies in a multiprocessor system," IEEE Trans. Computers, vol. C-21, pp. 137{146, Feb. 1972. [23] B. Bussell, E. Fernandez, and O. Levy, \Optimal scheduling for homogeneous multiprocessors," in Proc. IFIP Congress 74, pp. 286{290, North-Holland Publ. Co., 1974. [24] B. Kruatrachue and T. Lewis, \Grain size determination for parallel processing," IEEE Software, pp. 23{32, Jan. 1988. [25] D. Landskov, S. Davidson, B. Shriver, and P. Mallett, \Local microcode compaction techniques," Computing Surveys, vol. 12, pp. 261{294, Sep. 1980. [26] R. Sethi, Computer and Job-Shop Scheduling Theory, ch. Ch.2 Algorithms for MinimalLength Schedules. Wiley, N.Y., 1976. [27] R. P. Bianchini and J. P. Shen, \Interprocessor trac scheduling algorithm for multiple-processor networks," IEEE Trans. Computers, vol. C-36, pp. 396{409, Apr. 1987. [28] M. Hanan and J. Kurtzberg, \A review of the placement and quadratic assignment problems," SIAM Rev., vol. 14, pp. 324{342, Apr. 1972. 22
[29] S. Y. Lee and J. K. Aggarwal, \A mapping strategy for parallel processing," IEEE Trans. Computers, vol. C{36, pp. 433{442, Apr. 1987.
23
Table 1: Mobilities of the Nodes in Figure 4 (s) Node n n n n n n n n n n n n n n n n n n 1 3 7 9
12 14 16 17 18 4
10 15 5
11 13 6 8 2
Moving range Mobility Relative mobility 0{0 0 0 140 { 140 0 0 220 { 220 0 0 340 { 340 0 0 410 { 410 0 0 510 { 510 0 0 570 { 570 0 0 650 { 650 0 0 650 { 650 0 0 140 { 260 120 3 340 { 440 100 3.3 510 { 590 80 4 140 { 360 220 5.5 340 { 520 180 6 510 { 640 130 6.5 140 { 440 300 7.5 340 { 630 290 9.7 140 { 620 480 12
Table 2: Performance Comparison for MCP and MD Algorithms (Gaussian Elemination) Matrix Number Execution time (msecs) size of pes MCP MD Manual
4 8 16 32
4 5 7 12
1.8 6.3 24.2 103.6
24
1.8 6.3 24.0 102.6
2.4 9.5 40.0 165.6
Table 3: The Effect of Estimation on Performance (Laplace Equations) Problem size Execution time (msecs) Dierence Real Estimated in speed 4 1.86 1.86 0% 8 4.23 4.26 -0.7% 16 12.78 12.82 -0.3% 32 44.84 44.88 -0.1% 64 168.8 168.8 0% 128 655.9 655.9 0%
Table 4: The Effect of Estimation on Performance (Bitonic Sort) Problem size Execution time (msecs) Dierence Real Estimated in speed 64 9.20 9.20 0% 128 17.08 17.10 -0.1% 256 33.17 33.30 -0.4% 512 68.03 68.31 -0.4%
25
Figure 1: The development tool. Figure 2: Program synthesis and optimization. Figure 3: A parallel Gaussian elimination algorithm. Figure 4: The macro data ow graph of Figure 3 (N=4). Figure 5: The ASAP binding of Figure 4. Figure 6: The ALAP binding of Figure 4. Figure 7: Mapping: (a) task graph; (b) system graph; (c) a mapping; (d) the optimal mapping. Figure 8: Synchronization insertion (case 1). Figure 9: Synchronization insertion (case 2). Figure 10: The scheduling of Figure 4 by MD algorithm. Figure 11: The target machine code for each PE. Figure 12: The macro data ow graph of a parallel Laplace equation algorithm. Figure 13: The scheduling of Figure 12 by MCP algorithm. Figure 14: Performance comparison for Gaussian elemination Figure 15: Performance comparison for Laplace equations Figure 16: Performance comparison for Dynamic programming Figure 17: Performance comparison for Matrix multiplication Figure 18: Performance comparison for Bitonic sort Figure 19: Comparison for random and MCP scheduling (Laplace equations) Figure 20: Comparison for dierent 26 grain sizes (Laplace equations)