Low-cost scheduling algorithms for distributed

0 downloads 0 Views 63KB Size Report
large, the time complexity of an algorithm may be of more concern than how ... step for a subsequent algorithm that also considers the ... The cluster merging algorithm is based on a list ... of tasks per cluster is large, the running time of our.
Low-cost scheduling algorithms for distributed-memory architectures Andrei R˘adulescu

Arjan J.C. van Gemund

Hai-Xiang Lin

Henk J. Sips

Department of Information Technology and Systems Delft University of Technology P.O.Box 5031, 2600 GA Delft, The Netherlands Keywords: scheduling, clustering, distributed computing

Abstract One of the main problems in the field of scheduling algorithms for distributed memory systems is finding heuristics that produce good schedules at a low cost. We propose two new approaches for the scheduling problem. Both algorithms are intended to be used at compile time, to schedule task graphs on a distributed systems. Compared to known scheduling algorithms, the proposed algorithms preserve output performance while reducing complexity.

1 Introduction High-performance computing is still largely an academic practice, despite the general availability of large-scale distributed-memory systems such as networks of workstation clusters. One of the main obstacles to a cost-effective realization of highperformance applications is the lack of adequate scheduling algorithms. The existence of such algorithms would enable compilers to automatically perform the mapping of the application’s parallel tasks onto the distributed-memory machine. Currently however, a programmer is still forced to solve the mapping problem by hand, resulting in not portable code. The AUTOMAP project [15] is aimed at a task parallel programming environment for distributedmemory machines. A primary focus of the project is the development of a fully automatic scheduling engine that relieves the programmer from the mapping problem, essentially providing programming ease as well as portability. The project emphasizes the development of general scheduling algorithms, regardless the structure or size of the problems. In practical situations where the number of tasks may be extremely large, the time complexity of an algorithm may be of more concern than how close an algorithms is to optimality.

In distributed-memory architectures, communication becomes an important factor in scheduling. The communication delays further complicates the design of good scheduling heuristics. Unlike shared-memory architectures, where even a low-cost scheduling algorithm is guaranteed to produce acceptable performance [8], for distributed-memory systems such a cost/performance guarantee does not exist. As a result, scheduling tasks in a distributed-memory system has received considerable attention. This paper describes two new compile-time scheduling algorithms for distributed-memory architectures aimed at reducing time complexity while maintaining acceptable performance. It is shown that there is room for new algorithms with a better cost/performance ratio, despite a large body of existing work. The paper is organized as follows. Next section gives a overview of the scheduling problem and the currently known ways of solving it. In Section 3 two new algorithms for mapping clusters to processors are presented, while their cost/performance evaluation is discussed in Section 4. Section 5 concludes the paper.

2 Related Work Scheduling tasks to processors is known to be NP-complete in its general form and in several restricted cases [7]. Only a few very restricted scheduling problems are known to have polynomial algorithms leading to optimal solutions [5, 6, 9, 13]. Real problems rarely fit to a restricted case. As a result, a large amount of work has been done to design heuristics for finding sub-optimal solutions. The heuristic algorithms used for task scheduling can be divided in two classes, (a) scheduling algorithms for unbounded number of processors and (b) scheduling algorithms for bounded number of processors.

2.1 Unbounded number of processors Scheduling for unbounded number of processors can be performed easier, because the constraint on the number of processors is not considered. The communication delays between tasks within the same virtual processor are considered to be negligible compared to the delay between different virtual processors. Within the class of algorithms for unbounded number of processors a distinction can be made between clustering algorithms, which attempts to schedule tasks without duplication, and duplication-based algorithms. Clustering is performed by grouping connected tasks together in order to reduce communication. The clustering algorithms are based on e.g. (a) the critical path (CP) analysis (Dominant Sequence ClusterE V coming Algorithm (DSC) with O V plexity [18]) or (b) reducing communication between E complextasks (Internalization with O E V ity [14]). Duplication-based scheduling algorithms use the same approach like clustering, but they try to reduce the communication delays even more by duplicating tasks. They can use e.g. CP analysis (Scalable Task Duplication Based Scheduling (STDS) with O V 2 complexity [3], CPM with O V 2 complexity [2]). The scheduling algorithms for unbounded number of processors are not practical, because that they cannot be used if the required number of processors is not available. However, they can be used as a preliminary step for a subsequent algorithm that also considers the limited numbers of processors.

(( + ) log ) ( ( + ))

( )

( )

2.2 Bounded number of processors An important class of scheduling algorithms for bounded number of processors is the class of list scheduling algorithms. In list scheduling, each task is assigned a priority. A ready task is defined to be the task with all the dependencies satisfied. The ready task with the highest priority is scheduled on the task’s “best” available processor, e.g. the processor where the task has the earliest start time. For list scheduling the priorities of the tasks can be calculated: (a) before scheduling the tasks (Modified Critical Path (MCP) with O V 2 V complexity [16], Mapping Heuristic (MH) with O V 3 P 2 complexity [4]) or (b) after each scheduling of a task (Earliest Task First (ETF) with O V 2 P complexity [10], Mobility Directed (MD) with O V 3 complexity [16]). Besides list scheduling algorithms, there are also duplication based algorithms (Duplication Scheduling Heuristic (DSH) with O V 4 complexity [11], Critical Path Fast Duplication (CPFD) with O V 4 complexity [1]). Their complexities, however, are much higher to be practical for large problems.

( log ) ( ( ) ( )

( )

)

( )

A third approach is to use a multistep method. Step 1 is a clustering algorithm. Step 2 is cluster merging, which maps the clusters previously obtained to the available number of processors. Finally, in step 3, the tasks are ordered according to their dependencies. The first clustering phase can be done fast and significantly reduces the dimension of the mapping problem. Because of the smaller size of the problem, cluster merging can be done by a rather more expensive algorithm (Load Balancing (LB) with O V V P 3 [17], List Processor Assignment (LPA) with O P V C [14], where C denotes the number of clusters). Task ordering can be done using list scheduling (Ready Critical Path (RCP), Free Critical Path (FCP), both with O V V E complexity [17]). Results show that the multistep method yields good results at a low complexity [12, 18].

( log + ) ( )

( log + )

3 The Proposed Algorithms In this section we describe two new cluster merging algorithms to be used within a multistep approach. The first one uses a classical list scheduling algorithm which also considers information obtained in the previous clustering step. The second one is a new way of merging clusters by load balancing. Both algorithms have low complexities, while their results are still comparable with other higher-complexity algorithms. We use a multistep approach to perform scheduling. For step 1 (clustering) we use DSC [18] as it is a low-complexity algorithm which produces good clustering. Step 3 (task ordering) can be also performed well with fast algorithms such as RCP [17]. For step 2 there the known algorithms either have a too high complexity (LBA [14]) or perform not very good (LB [17]). We aim to improve step 2 by the two new proposed cluster merging algorithms.

3.1 Cluster Merging by List Scheduling The cluster merging algorithm is based on a list scheduling algorithm. Instead of scheduling each task separately, only the first task in a cluster is scheduled using a list algorithm. All the other tasks in the cluster are simply mapped onto the same processor, thus reducing the cost. The algorithm (Fig. 1) works as follows. Initially, no task and no cluster are mapped to a processor. The tasks are scheduled sequentially in the order imposed by the list scheduling algorithm. If the current task belongs to a cluster that is already mapped, the task is assigned to the same processor. If the cluster is not mapped, the task is scheduled according to the list scheduling algorithm and the cluster is mapped to the same processor. This algorithm combines cluster merging with task ordering. Using a priority list of ready tasks, task

List_ClusterMerging()

GLB_ClusterMerging()

f

f

WHILE there are unscheduled tasks f Select taskk to be scheduled according to the list scheduling algorithm. Let clust be the cluster taskk belongs to. IF clust is already mapped to Pi f Schedule taskk on Pi . g ELSE f Use the list scheduling algorithm to schedule taskk on a processor Pj . Map clust to Pj .

g

g

Descendingly order the clusters by the start time of the first task of the cluster. Ties are broken choosing the cluster with the largest amount of computation. WHILE there are unmapped clusters f Select the unmapped cluster clust with the highest priority. Map the entire cluster to the least loaded processor.

g

g

g

Figure 2: Guided Load Balancing (GLB) Figure 1: Cluster merging using list scheduling mapping is done in the order imposed by their dependencies. Thus, the order in which the tasks are mapped is the same as the order of their execution in the final schedule. There is no need for a task ordering step. When calculating the priorities of the tasks, some list scheduling algorithms consider also the delays between tasks. A potential improvement would be to zero the delays between tasks in the same cluster, because we will map all the tasks in each cluster to the same processor. The complexity of the cluster merging algorithm is determined by the complexity of the list scheduling algorithm. In the worst case, when each task is a separate cluster, the number of operations is equal to those in the list scheduling algorithm. The improvement is obtained when the clusters are large, because most of the computation time spent for finding a processor for the current task is saved. Only one task in a cluster is scheduled following the normal procedure. For the remaining tasks, only the starting time for each task is calculated which is at most O E for all tasks. Compared to LPA [14], which also schedules the clusters using an ordered list of tasks this algorithm does not verify at each step that the total completion time is minimized at each cluster mapping. Consequently, the complexity of the algorithm is decreased, but we expect also that the results will be worse. Compared to LB [17], which is a purely load balancing method, the complexity is increased in the worst case, when each task is a cluster. In such case, cluster merging performed only by load balancing will most probably lead to long schedules. When the number of tasks per cluster is large, the running time of our algorithm is almost the same as that of LB. When MCP [16] is chosen as the basic list scheduling algorithm, the complexity of the different steps of the cluster merging algorithm are the following: computation of the task priorities takes O E time and task ordering takes (O V V ) time. Scheduling the first tasks in each cluster is O CV P and schedul-

( )

( log ) (

( ) )

( ) ( + log + )

ing the other tasks is O E . The overall complexity of the proposed merging algorithm using MCP is therefore O E V V CV P .

3.2 Cluster Merging by Guided Load Balancing Yang [17] proposed an algorithm based on load balancing (LB) for cluster merging. The merge criterion is the amount of computation in the cluster, defined as the sum of the execution times of the tasks in the cluster. Despite the fact that the main focus of this algorithm is load balancing, it may lead to processors with sparse task occupancy. The cause is that the clustering step is grouping the tasks locally. If cluster merging is performed considering only the amount of computation in clusters, i.e. without considering when the computation is supposed to be executed in time, the result may be that some of the processors are overloaded in time. In our algorithm for cluster merging using Guided Load Balancing (GLB) (Fig. 2) cluster priorities are based on on the time they are supposed to start in a system with an infinite number of processor. The start time of a cluster is defined as the earliest start time of the tasks belonging to it. The cluster starting earliest is preferred. If two clusters start at the same time, the cluster with the largest amount of computation has priority. The selected cluster is mapped onto the less used processor (i.e. the processor with the lowest amount of computation). The reason of choosing the start time of the first task as the priority of the cluster is to use information from the dependence analysis in the previous step. Topologically ordering tasks results in a natural order of the clusters as well. Scheduling clusters in the order they become available for execution may yield a better schedule compared to a schedule considering only the amount of computation in a cluster. The processor with the lowest workload is chosen as opposed to choosing the processor where the cluster can start the earliest. The reason is that a cluster

is not a single schedulable unit. In the task ordering phase, the clusters on the same processor will be most probably interleaved, because of the inter-task dependences. In this case, a processor selection criteria such as the least used processor is more suited. Topologically sorting the clusters takes O E time. Mapping a cluster Ci takes O P time to maintain the processor list ordered by their workloads and O jCi j time to map the tasks in the cluster to the selected processor, where jCi j denotes the number of tasks in cluster Ci . The resulting complexity for mapping all clusters is O C P jCi j , jCi j V , where C is the number of clusters. As the complexity of mapping clusters to processors is O C P V , The total complexity of the cluster merging is therefore O C P E , which in the P E . Note that in terms of worst case is O V complexity, our algorithm outperforms all the previously presented algorithms.

(log )

( )

( )

( log P+ P

=

)

( log + ) ( log + ) ( log + )

4 Performance and Comparison To evaluate the performance of the proposed algorithms, we integrate the proposed merging algorithms as part of a full multistep scheduling method. As clustering algorithm we have chosen DSC [18], because its has a low complexity, while still producing a good clustering. For task ordering, we use RCP [17]. For the cluster merging using list scheduling, we have chosen MCP [16] for its low cost compared to other list scheduling algorithms and its relative good performance compared to other list scheduling algorithms. We use two versions of the cluster merging algorithm based on list scheduling. In the first one (DSC-MCP1) the priorities are calculated using the original task graph. In the second one (DSC-MCP2), when calculating priorities, we considere the communication delays between the tasks in the same cluster as being zero. DSC-GLB denote DSC followed by GLB for cluster merging and RCP for task ordering. We compare DSC-MCP1, DSC-MCP2 and DSC-GLB with the method presented by Yang DSC-LB (DSC followed by LB and RCP) and with MCP. The comparison is done both in terms of completion time and in terms of the running time of the the scheduling algorithms. The following problems are used to obtained task graphs for measurements: row LU decomposition, block LU decomposition, a divide and conquer problem and a stencil problem. The applications can lead to coarse-grain task graphs (i.e. task execution times larger than communication delays) or fine-grain task graphs (i.e. task execution times smaller than communication delays). Row LU decomposition is inherently a fine-grain application, while block LU decomposition is coarse-grain. For the other two problems,

different sets of task execution and communication times are used in order to vary their granularities. As a distributed system, we assume a homogeneous clique topology with no contention on communication operations and a non-preemptive execution of tasks. Our experiments are performed by simulating the execution of the task graph on such a topology. The execution and communication times on the simulated machine are selected to be the same as those used in the scheduling algorithms. We exponentially vary the number of processors for each application starting with 2 until no speedup is obtained. The experiments show that DSC-LB performs worse compared to MCP. This is expected, because the cluster merging step is performed without taking into consideration communication costs and the relative placement in time of clusters. However, DSC-LB performs better for row LU decomposition, which has a very fine granularity. MCP leads to long schedules, because a future communication between two tasks mapped on different processors can be much larger than the small gains obtained by local decisions during the scheduling. DSC-MCP1 and DSC-MCP2 do not yield as good results as expected, despite the fact that the idea behind them seems very natural and promising. The reason is that the mapping decision is taken using a limited information about the cluster. A cluster mapping is based only on the information of the first task from the cluster without considering the effects of scheduling the rest of the tasks to the same processor. Note that as long as there is enough parallelism available, both versions of DSC-MCP yield results comparable to those of MCP. They perform worse when they have to extract all the parallelism available from a task graph. For both coarse-grain applications (Fig. 4 and Fig. 5) and fine-grain applications (Fig. 7 and Fig. 8) the speedups obtained for large number of processors are lower compared to the other algorithms. In the case of row LU decomposition (Fig. 6), which is very fine-grain, it is interesting to notice that the improvements due to the clustering step are important. DSC-MCP2 yields in this case better significantly results compared to DSC-MCP1. The reason is that the the communication delays in this task graph are much longer compared to the task execution times. Therefore, zeroing the communication delays belonging to the same cluster improves the schedule. For coarse-grain problems, DSC-GLB performs almost as well as MCP (Fig. 3, Fig. 4 and Fig. 5). The large amount of parallelism of coarse-grain problems and the criteria used for cluster merging that combine both dependencies analysis and load balancing leads to good schedules compared to MCP and DSC-MCP and DSC-LB.

[s]

6

16 8

 DSC-MCP1

? DSC-MCP2  DSC-GLB  DSC-LB M? M MCP M? 

4

M?

2 1

2

1

[s]

M? 4

 

8

M?

M?

6

16 8 4

16

32

64

-[P]

M?

2 1

6

16 8 4

M?

2 1

1

[s]

2

M? 4



8

M

?

M

16

?

8

32

64

6

-[P]

16 8

 DSC-MCP1

? DSC-MCP2  DSC-GLB  DSC-LB M MCP 

4

M?

2 1

1

2

M ?

4

M

1

?

8

1

6

16



16

M?



32

M?

8



64

M

M

M

M

4

8

16

32

64

-[P]

 DSC-MCP1

2

 ?

4

M

M

M

8

16

32

64

? 

 ?

 ?

 ?

-[P]

 DSC-MCP1

? DSC-MCP2  DSC-GLB  DSC-LB M MCP

4

-[P]

M

M

Figure 7: Fine-grain stencil application [s]

M?

?

M

? DSC-MCP2  DSC-GLB  DSC-LB M MCP

M?

2

Figure 4: Coarse-grain stencil application [s]

6

4

16

? 

Figure 6: Row LU decomposition

 DSC-MCP1

? DSC-MCP2  DSC-GLB  DSC-LB M M MCP M? ?

2

1

Figure 3: Block LU decomposition [s]

 DSC-MCP1

? DSC-MCP2  DSC-GLB  DSC-LB M MCP ?  ? ? 

M?

2 1

1

2

M?

M

M

M

M

4

8

16

32

64



?

?

?

?

-[P]

Figure 5: Coarse-grain divide and conquer problem

Figure 8: Fine-grain divide and conquer problem

In the case of row LU decomposition (Fig. 6) all multilevel approaches DSC-GLB, DSC-MCP and DSC-LB yield better schedules compared to MCP. The effect of reducing communication delay by clustering is more important compared to the increase in size of the schedulable unit. DSC-GLB yields a better speedup compared to DSC-LB and DSC-MCP2, but performs worse compared to DSC-MCP1. For fine-grain applications (Fig. 7 and Fig. 8) DSC-GLB also performs better compared to the other multistep algorithms, but it yields lower speedup compared to MCP. The same behavior is observed for the other multistep algorithms. The reason is that if the granularity is not very low, the effects of reducing communication delays by clustering does not totally

compensate the size increase of the units that can be scheduled. Regarding the execution time (Fig. 9), note the significant improvement of the multilevel approach, compared to a simple list scheduling algorithm. When cluster merging is performed using a list scheduling algorithm, the decision of mapping nodes to processors is seldom taken and consequently the complexity of the merging step is reduced. As the clustering step also has a low complexity, the combination of the two results in a low complexity algorithm. In the worst case however, which is the one in with each task is a cluster, the combination of clustering and list scheduling yields the same high cost as a traditional list algorithm.

[s]

6

16 8

1

M M M

4

?

2 1

[5] L. Finta, Z. Liu, I. Milis and E. Bampis, “Scheduling UET–UCT series–parallel graphs on two processors,” Theor. Comp. Science, vol. 162, Aug. 1996, pp. 323–340.

 DSC-MCP1

? DSC-MCP2  DSC-GLB  DSC-LB M MCP

M 2

?

M?

?

4

8

16

M

?

 32

[6] M. Fujii, T. Kasami and K. Ninomiya, “Optimal sequencing of two equivalent processors,” SIAM J. App. Math., July 1969.

?

 64

-[P]

Figure 9: Execution times DSC-LB and DSC-GLB both have a low complexity, even in the worst case. If DSC-LB is used we ? longer schedule. Overall may expect a DSC-GLB obtains schedules comparable with MCP and better than the other multilevel approaches presented, at a low cost.

10 30%

5 Conclusion Two new algorithms for cluster merging have been presented. The two algorithms are evaluated and compared with other known cluster merging algorithms and with list scheduling algorithms, both in terms of the quality of the result and in terms of complexity. The combination of task clustering, cluster merging and task ordering has a low complexity and has been shown to have comparable results with other higher complexity algorithms, while in terms of complexity they outperform other known scheduling algorithms. An open question related to cluster merging is whether it is possible to use cluster merging after clustering using task duplication. Duplication of tasks may prove to be a good choice, as it can significantly reduce the communication time.

References [1] I. Ahmad and Y-K. Kwok, “A new approach to scheduling parallel programs using task duplication,” in ICPP, Aug. 1994, pp. 47–51. [2] J.Y. Colin and P. Chr´etienne, “C.P.M. Scheduling with small communication delays and task duplication,” Oper. Res., 1991, pp. 680–684. [3] S. Darbha and D.P. Agrawal, “A scalable and optimal scheduling algorithm for distributed memory machines,” in ICPP, Aug. 1997. [4] H. El-Rewini and T.G. Lewis, “Scheduling parallel program tasks onto arbitrary target machines,” JPDC, vol. 9, 1990, pp. 138–153.

[7] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NPCompleteness. W.H. Freeman and Co., 1979. [8] R.L. Graham, “Bounds on multiprocessing timing anomalies,” SIAM J. App. Math., vol. 17, Mar. 1969, pp. 416–429. [9] T.C. Hu, “Parallel sequencing and assembly line problems,” Oper. Res., vol. 9, Nov. 1961, pp. 841–848. [10] J-J. Hwang, Y-C. Chow, F.D. Anger and CY. Lee, “Scheduling precedence graphs in systems with interprocessor communication times,” SIAM J. Comp., vol. 18, Apr. 1989, pp. 244–257. [11] B. Kruatrachue and T.G. Lewis, “Grain size determination for parallel processing,” IEEE Software, Jan. 1988, pp. 23–32. [12] J-C. Liou and M.A. Palis, “A comparison of general approaches to multiprocessor scheduling,” in IPPS, Apr. 1997, pp. 152–156. [13] C.H. Papadimitriou and M. Yannakakis, “Scheduling interval-ordered tasks,” SIAM J. Comp., vol. 8, 1979, pp. 405–409. [14] V. Sarkar, Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. PhD thesis, MIT, 1989. [15] K. van Reeuwijk, H.J. Sips, H.-X. Lin and A.J.C. van Gemund, “Automap: A parallel coordination-based programming system,” Tech. Rep. 1-68340-44(1997)04, TU Delft, Apr. 1997. [16] M-Y. Wu and D.D. Gajski, “Hypertool: A programming aid for message-passing systems,” IEEE TPDS, vol. 1, July 1990, pp. 330–343. [17] T. Yang, Scheduling and Code Generation for Parallel Architectures. PhD thesis, Dept. of CS, Rutgers Univ., May 1993. [18] T. Yang and A. Gerasoulis, “DSC: Scheduling parallel tasks on an unbounded number of processors,” IEEE TPDS, vol. 5, Dec. 1994, pp. 951–967.

Suggest Documents