)(vi,vj)+ vi,vj % V *, representing data dependency between two vertices, vi and vj. However, an .... be considered the focal point for the next refinement step.
John von Neumann Institute for Computing
A Paradigm for Allocating Parallel Application Tasks to Heterogeneous Computing Resources on the Grid B. Arafeh, K. Day, A. Touzene
published in
Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005, G.R. Joubert, W.E. Nagel, F.J. Peters, O. Plata, P. Tirado, E. Zapata (Editors), John von Neumann Institute for Computing, Julich, ¨ NIC Series, Vol. 33, ISBN 3-00-017352-8, pp. 41-48, 2006.
c 2006 by John von Neumann Institute for Computing
Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above.
http://www.fz-juelich.de/nic-series/volume33
141
A Paradigm for Allocating Parallel Application Tasks to Heterogeneous Computing Resources on the Grid Bassel Arafeh, Khaled Day, and Abderezak Touzene Department of Computer Science Sultan Qaboos University Muscat, Oman Abstract The work addresses the problem of allocating parallel application tasks for execution on heterogeneous computing resources on the Grid. The proposed allocation paradigm considers issues pertinent to the Grid environment. Basically, our model considers the relationship between the clients and the environment in one side, and the relationship between the system providers and the environment on the other. This consideration is re ected in utilizing the client and the system speci cation to determine the objective function and the constraints of the mapping problem. The paradigm adopts a multilevel graph partitioning and mapping approach. The objective of the mapping is to minimize the parallel application execution time, subject to the speci ed constraints. The paradigm introduces an ef cient heuristic for the coarsening step, called the VHEM method. The simulation study shows that the heuristic can achieve very high reduction factor, when the ratio of the number of tasks to the number of processors exceeds a threshold value. Also, the paradigm introduces an ef cient heuristic for the re nement phase, in which, the space of processor preference for remapping includes the subset of processors on the shortest paths from the currently allocated processor to all other processors to which adjacent vertices are allocated. 1. Introduction The concept of clustering computing resources to solve computational problems has been the focus of high-performance computing community for more than two decades. The advances in highspeed microprocessors and computer networks have made cost-effective parallel computing based on clusters or networks of workstations (NOWs) an alternative to expensive supercomputers. However, the demand for computing power continues to grow, while most of the available machines are eventually underutilized. Recently, there has been a proposal for using large-scale high-performance distributed computing resources through a new architecture, known as the computational Grid [4]. In order to construct a Grid computing environment, it is very important to have a Grid Resource Management System (RMS). The basic functions of an RMS would be to accept requests for resources by users' applications and allocate computing resources to those requests from the overall pool of the grid resources. The RMS is con gured as a middleware infrastructure software system, through which resource information is disseminated, suitable resources are discovered, and applications are mapped and scheduled for execution. This work builds on the concept of Grid application schedulers, which has been adopted in research projects such as the AppLes and GrADS projects [1] [3]. Our work focuses on the allocation of tasks of a resource-intensive parallel application to a selected pool of Grid resources for execution. The objective is to nd a matching between the tasks and the set of Grid computing resources that optimizes the application completion time. We assume a static and decentralized approach, where a Grid application scheduler works on a predictive estimation of the application resource requirements provided by the client, such as the reservation period and the execution behavior. On the other
422 side, it works on a predictive estimation of the characteristics of the selected Grid resources for the application, such as the duration of a resource-sharing period, the resource utilization factor, and the CPU speed. In general, it has been shown by Bokhari[2] that the general mapping problem is NPhard. Several heuristic methods have been proposed to provide approximate solutions for parallel architectures. However, the mapping problem in computational Grids has received little attention so far. In our current work, we address the mapping problem in the context of a computational Grid based on the multilevel graph partitioning scheme [5] [6] [7]. The rest of the paper is organized as follows. Section 2 introduces our application model, system model, assumptions and the problem statement. Section 3 describes our proposed heuristic for mapping tasks of a parallel application to a pool of Grid resources. Section 4 describes the simulation and performance evaluation results. Finally, section 5 is the conclusion. 2. De nitions and Background 2.1. System Model The target architecture for the execution of a parallel application is a heterogeneous multi-cluster environment, formed from the distributed heterogeneous computing resources on the Grid. Accordingly, a heterogeneous computational system is modelled as a weighted undirected graph, S = (P; L; ; ), referred to as the system graph. Where, P is a nite set of vertices representing sites/processors of the system on the Grid; and L is a nite set of edges representing the communication links between sites/processors. Each site/processor vertex, p, is characterized by a speci ed or announced processing weight, (p), re ecting its processing cost per unit of computation. Each edge, lij = (pi ; pj ), has a link weight, (pi ; pj ), that denotes its communication latency (cost) per unit of communication between pi and pj . We assume each processor, p 2 P , is characterized by a set of system parameters, based on its available resources for the Grid environment, (e.g., memory capacity, cpu speed, workload, operating system, etc.). For the purpose of simplifying the system model, we assume each processor, p, has a declared utilization factor, (p), denoting the local workload on the processor; and, a speci ed maximum duration, (p), for allowing its computational resources to be shared on the Grid environment. We will refer to (p) by the processor's sharing period. The Grid does not enforce constraints on the network topolgy and the communication latency between processors. Therefore, an arbitrary network topology of the system is assumed. However, we assume the system graph is connected. Although the topology of the network is not completely connected, we can derive a communication latency matrix, CL = [lat(pi ; pj )], that represents the communication latency between any two processors in the network. The communication latency, lat(pi ; pj ), between any two adjacent processors is equal to (pi ; pj ). While the communication latency, lat(pi ; pj ), between any two non-adjacent processors is the sum of the link weights on the shortest path between them. The matrix is symmetric, since all links are assumed to represent full duplex communication. 2.2. Application Model In this work, a weighted undirected graph G = (V; E; !; ), known as a task interaction graph (TIG), is used to model a parallel application, which we refer to as the application graph. Where, V is a nite set of vertices representing the application tasks; and E is a nite set of edges, E = f(vi ; vj )j vi ; vj 2 V g, representing data dependency between two vertices, vi and vj . However, an edge (vi ; vj ) does not impose any precedence relation between the incident vertices vi and vj . Each vertex v has a computation weight !(v) that represents the amount of computation required by this
343 task to accomplish a unit progress. Each edge eij = (vi ; vj ) has a communication cost (eij ) that represents the amount of data to be communicated between vi and vj to advance a unit progress. The execution behavior of the parallel application is assumed to pass through a number of iterations. Each iteration forms a unit progress in the execution behavior of the application, which consists of a communication phase followed by a computation phase. Therefore, we assume that the modelled application has a requirement for executing the tasks iteratively at a certain rate, times per second, referred to as the application execution rate. Also, the application speci es a maximum duration , referred to as the application reservation period, that re ects the total time to be reserved by the application on the computational Grid, in order to perform all required iterations. 2.3. Problem De nition Given an application graph G = (V; E; !; ) and a system graph S = (P; L; ; ), we need to nd a mapping, : V ! P , such that each vertex v 2 V is assigned to a partition (p) that is allocated to a processor, p, in the system graph for execution. The objective of the mapping is to minimize the application execution time, subject to the application requirements and the system constraints. In this work, we assume there is no overlapping between computation and communication. Therefore, the execution time of a task is determined by the summation of its computation time and all the communication costs with its dependent tasks. Accordingly, the execution time of a task vi on a processor p is de ned as P P ET (vi ; p) = !(vi ) (p) + (vi ; vk ) lat(p; q) (1) q2P & q6=p vk 2 (q)
Where, (q) is a partition of the set of vertices, V , that is mapped to processor q. Then the execution time of partiton (p) on processor p is de ned as P P P ET ( (p)) = f!(vi ) (p) + (vi ; vk ) lat(p; q)g (2) q2P & q6=p vk 2 (q)
vi 2 (p)
Hence, the parallel application execution time, ET , is given by
(3)
ET = maxfET ( (p))g p2P
Then, the objective function for partitioning and mapping a parallel application to a heterogeneous system on the Grid is to minimize ET subject to the following (p)(1 ET ( (p))
(4)
(p)); 8p 2 P (p)(1
(p))
; 8p 2 P
(5)
Where, represents the maximum number of iterations that can be executed within a reservation period for the parallel application. The inequality 4, speci es the relationship between the application reservation period, , and the processors' sharing period, (p), and the utilization factor, (p). Inequality 4 indicates that the total time available on any processor, p, should not be less than the application's reservation period, . This constraint may work as a rule for selecting a processor for the execution of the parallel application on the Grid environment. The system constraint in inequality 5 de nes the amount of workload that is acceptable by each processor p, given it has a utilization factor (p), and a processor's sharing period (p). The amount of workload is determined based on the expected number of iterations to be performed by the application during a maximum reservation period . Accordingly, the execution time of a partition, (p), should not exceed the amount of time that a processor, p, can allocate per iteration.
444 3. Multilevel Clustering, Mapping and Re nement Paradigm In this section, we introduce a multilevel paradigm for clustering irregular TIG into a contracted graph with a reduced number of vertices. The process of contraction, referred to by coarsening, is carried out at several levels, until a threshold number of vertices is reached. The vertices of the coarsest TIG (CTIG) can be mapped to the system processors, with the objective of minimizing the application execution time. Assigning the vertices of the CTIG to the processors generates the initial graph partitions. However, the optimal or suboptimal initial partitioning and mapping may not be so for the original graph. An iterative optimization procedure can be applied at each level, through which a coarse graph is expanded or returned to its parent graph for further re nement. At each level of expansion, the optimization scheme is used to reduce the application execution time. The re nement approach is based on considering the possible migration of vertices to other processors, such that the maximum execution time among all processors is minimized. In the following, we introduce each phase of the paradigm as applied to mapping parallel applications to heterogeneous processors in the Grid environment. 3.1. Multilevel Clustering Phase In this phase, the original graph (TIG) is contracted/coarsened into a sequence of smaller graphs, Gi = (Vi ; Ei ; ! i ; i ), starting from the original graph G0 = (V0 ; E0 ; ! 0 ; 0 ), such that jVi j < jVi 1 j. A coarser graph at level i is obtained from collapsing edges at level i 1. The collapse of an edge ei 1 (v1 ; v2 ) at level i 1 generates a single vertex u 2 Vi at level i, where ! i (u) = ! i 1 (v1 ) + ! i 1 (v2 ). The approach relies on nding a maximal independent subset of graph edges, or a matching of vertices, then collapsing them. Two edges are called independent if they are not incident on the same vertex. It follows that a subset of graph edges is independent and maximal if no more edges can be added to the subset without making two edges incident on the same vertex. The maximal independent subset of graph edges can be generated by visiting vertices in a random order, matching each unmatched vertex with one of its unmatched neighbors randomly. Since the objective of the partitioning and mapping is to minimize the maximum execution time, it would be bene cial if the clustering phase can minimize the total communication cost in the CTIG. In sequal, the coarsening steps should collapse the most heavly weighted edges as proposed by Karypis and Kumar [5], referred to as heavy edge matching (HEM). In this work, we adopt a modi ed approach of the HEM. Only edges with communication costs exceeding the average edge communication cost are selected for matching a vertex with one of its unmatched neighbors. We call the scheme a Very Heavy Edge Matching (VHEM). As such, the VHEM guides the coarsening step towards achieving an effective reduction in the total communication cost at each level; while maintaining the ability for performing re nement at different resolution levels. But, the subset of collapsed edges is not maximal as de ned previously. To overcome the drawback of this policy, we relax the issue of independence among the subset of collapsed edges. If an unmatched vertex has no unmatched neighboring vertices that are joined with a very heavy-edge weight, it is allowed to be matched with a matched neighbor, given that the cost of the edge joining them is also a very heavy-edge weight. Furthermore, the coarsening phase must resolve three main issues. These are related to controlling the rate at which the size of a TIG is reduced, the threshold value for terminating the clustering phase, and the weight of a coarser vertex. For the rst issue, the coarsening step at each level can be stopped when the size of the generated coarser graph becomes a certain factor (e.g., 1.5-2.0) of the ner graph. The aim is to control the rate at which a graph is reduced at each level, in order to allow re nement to take place at different resolution levels. For the second issue, the coarsening process
545 can be stopped when the number of vertices in the coarser graph becomes less than or equal to the number of processors in the system graph; and this is the approach taken in this work. For the third issue, the weight of a coarser vertex is constrained to be less than or equal to an upper-bound. The aim is to limit the rate at which the weight of certain vertices grow. The upper bound, ub(v), of the weight of a coarser vertex v is determined from the the rate of iterative execution, , the average communication latency in the system, , the total communication cost, civ , incident on a vertex v 2 Vi at level i, and the average processor speed, sa . That is, ub(v) = T (v) sa , where, T (v) is the estimated maximum period of computation per iteration, and it is given by T (v) = 1= civ . 3.2. Initial Mapping Phase The second phase performs an initial mapping of the coarsest graph, Gc = (Vc ; Ec ; ! c ; c ), to the system processors, with the objective of minimizing the application's computation time. There is no need to apply a general optimization technique at this phase, since the re nement phase will apply an incremental re nement procedure at each uncoarsening step. Therefore, the initial mapping allocates tasks with heavy computation weights, !(v), to processors having higher speeds. The application graph vertices, Vc , are sorted in descending order based on their computation weights. Similarly, the system graph processors are sorted in descending order based on their speeds. However, a processor's speed is adjusted to take into consideration the speci ed processor's workload. That is, the adjusted processor's speed is s0(p) = s(p)(1 u(p)), where s(p) = 1= (p). Accordingly, the application's tasks are mapped to the system processors that have the same corresponding order. 3.3. Re nement Phase: A Greedy Remapping Approach for a k-way Partitioning In this phase, the mapping, c , of the coarsest graph Gc is projected back to the original graph Go through several levels of re nements. Starting from c , a mapping at level i, i , is obtained from the mapping at level i + 1, i+1 , by assigning the mapping i+1 (v) of each vertex v 2 Vi+1 in the coarser graph Gi+1 to each vertex u 2 Vi in the ner graph Gi , that has been merged to produce the vertex v. The operation is called the uncoarsening step, and the whole process is referred to by the uncoarsening phase. Associated with each uncoarsening step at level i there is a re nement step, which is applied to reduce the application execution time by checking for vertex migration across the boundaries of the partitions. 3.3.1. The Gain Function Let pm be the processor having the maximum execution time over all other processors due to 0 the current mapping of G. Also, let be the mapping of G if a vertex v 2 V migrates from a processor pm to any other processor pq 2 P . Given a cost function ( ) for a mapping , the fruitfulness of migrating a vertex v 2 (pm ) to a processor pq is found by a gain function gain(v; pm ; pq ), such that
gain(v; pm ; pq ) = ( )
(
0
)
(6)
where, ( ) = ET ( (pm )) = maxfET ( (p))g and ( 0 ) = ET ( (pq )) + ET (v; pq )
(7)
Hence, the gain function says that, in order to have a fruitful migration step for a vertex v from processor pm to processor pq , it must have a positive value greater than zero.
466 3.3.2. The Re nement Step The gain function, gain(v; pm ; pq ), for a vertex v 2 (pm ) can be calculated with respect to every other partition (pq ), where pm 6= pq . Basically, the priority will be given to move a vertex v to a partition (pq ), if it produces the maximum gain in the cost function over all possible migrations of vertices on the boundary of (pm ) with other partitions. Based on the cost function, the processor, pm , having the maximum execution time, ETmax , must be considered the focal point for the next re nement step. The candidate vertices for migration are those on the borders of (pm ) with other partitions. A candidate border vertex v can be moved to a partition (pq ), only if the move of the vertex would not voilate the system constraint, that is ET ( (pq )) + ET (v; pq )
(p)(1
(p)) :
(8)
Our paradigm deviates from the general multilevel approach for partitioning in the way a processor preference is selected. In the general multilevel approach, a full processor preference may be considered, where a vertex may migrate to any other processor [6] [7]. In this case, the cost of a full processor preference is O(jP j2 ), since the computation of the gain function is O(jP j :dav ), and there are (jP j 1) possible migrations. Where, dav is the average degree in the system graph. Other alternatives can be employed to reduce the cost by restricting the migration of vertices to adjacent partitions or to adjacent processors [7]. As such, the cost of computing the gain function for selecting the best target processor is unlikely to be O(jP j2 ). However, the later two methods may not guarantee to nd the maximum gain. Therefore, we extend the possible space for vertex migrations to adjacent partitions/processors to include the subset of processors on the shortest paths from pm to all neighboring processors for a border vertex. The approach is more costly than considering migrations to just adjacent partitions or adjacent processors. However, it is not expected to reach O(jP j2 ), if the number of processors is not too small. 4. Simulation and Performance Study A simulation study of our paradigm for allocating application tasks to heterogeneous computing resources on the Grid has been conducted. All phases of the paradigm have been implemented and tested using randomly generated graphs for the application and system models. We have generated application graphs randomly with sizes ranging from 100 to 6000 vertices, based on a vertex degree in the range 2-20. For the system model, we have generated graphs representing arbitrary networks consisting of 16, 32, 64, 128, 256, and 512 processors. The system graphs are connected, with a maximum node degree equal to 0.25 of the graph size for all models; except for the 512 processors model, it is limited to a maximum of 32. We have implemented the multilevel clustering phase based on both Karypis and Kumar HEM method and our VHEM method. The reduction factor goal for each coarsening step is set to be about 1.82. However, a coarsening step may stop as soon as the coarser graph size reaches a value less than or equal to the number of processors, or an upper bound of iterations has been reached. The performance evaluation of both methods for application graphs with 3000 and 6000 vertices for mapping to systems with different number of processors are shown in Figure 1. The plots in Figure 1(a) show that the VHEM method always generates coarsest graphs with less total communication volume than the HEM method. However, the cost of the VHEM method in terms of the execution time is slightly higher than the cost of the HEM method, as shown in Figure 1(b). But the major advantage of the VHEM method is in its ability to achieve a very high reduction in the total communication volume, when the ratio of the application graph size to the number of the processors is
747 HEM3000
VHEM3000
HEM6000
HEM3000
VHEM6000
Clustering Execution Time in sec.
Total Communication Weight Reduction
VHEM3000
HEM6000
VHEM6000
8
1.2 1 0.8 0.6 0.4 0.2 0
7 6 5 4 3 2 1 0
0
100
200
300
400
500
600
0
100
Number of Process ors
200
300
400
500
600
Num ber of Proces sors
(a)
(b)
Figure 1. The clustering phase execution time and the total communication weight reduction versus the number of processors.
hem3000
vhem3000
hem6000
vhem6000
hem3000
hem6000
vhem6000
2 Speedup Factor Due to Refinement
250 Refinement Phase ExecutionTime in sec.
vhem3000
200 150 100 50 0 0
100
200
300
400
Number of Proces sors
(a)
500
600
1.5 1 0.5 0 0
100
200
300
400
500
600
Num ber of Proces sors
(b)
Figure 2. Comparisons between the HEM and VHEM coarsening techniques on the speedup factor and the execution time of the re nement phase for application graphs with 3000 and 6000 vertices.
very high. For example, a break through of over than 0.99 reduction in the total communication volume is achieved for application graphs with 3000 vertices, when they are mapped to 16, 32, and 64 processors. Similarly, we have achieved a break through in the reduction of communication volume for application graphs with 6000 vertices, when they are mapped to 16, 32, 64, and 128 processors. In general, a conservative assessment of our VHEM method is that it can produce ef cient clustering of application graph models when the ratio of the application graph size to the number of processors is greater than 45 times. Accordingly, it is expected that effective clustering for system models with 256 and 512 processors can be achieved with application graph sizes exceeding 12000 and 24000 vertices, respectively. It is to be noted that the execution cost of the VHEM method maintains the same level of overhead over the HEM method irrespective of the breaking through points. We have adopted the policy of remapping a border vertex to the best processor on the shortest path to one of its neighboring processors, that would reduce the application's execution time. Figure 2(a) shows the execution time of the re nement phase versus the number of processors. The plots
488 indicate, implicitly, the effect of the number of tested border vertices for migration on the execution time. The results of the re nement phase on reducing the application time are shown by the plots of the speed-up factor versus the number of processors in Figure 2(b). It is noted that little speedup can be achieved when the computation weights of partitions are very heavy. This situation can mainly occur when the re nement phase is applied after using the VHEM method, in which a break through in the reduction of the communication volume of the coarsened graph has been achieved. The same effect of the VHEM clustering technique on the re nement phase can be seen from the low execution time at the number of processor, which caused the break through in the reduction of the communication volume. 5. Conclusion This work has introduced a multilevel graph partitioning paradigm for mapping parallel application tasks to heterogeneous computing resources on the Grid. The contribution of the work is focused on three aspects. In the rst, the paradigm considers the dynamic relationship between the application clients and the system's providers on the Grid environment, through the requirements provided by each side. This consideration is re ected in utilizing the client and the system speci cation to determine the objective function and the constraints of the mapping problem. In the second, the paradigm introduces an ef cient heuristic for the coarsening step, called the VHEM method. The simulation results show that the heuristic can achieve very high reduction factor in the communication volume, when the ratio of the number of tasks to the number of processors exceeds a threshold value, without extra overhead in the execution time. In the third, the paradigm introduces an ef cient heuristic for the re nement phase for remapping border vertices to other processors. The heuristic uses a space of processor preference for migration that includes the subset of processors on the shortest paths from the currently allocated processor for a vertex to all other processors to which adjacent vertices are allocated. In general, the future work should concentrate on enhancing the initial mapping phase and the re nement phase, and should develop procedures to reduce the effect of a greedy technique for the coarsening step, like VHEM, on the effectiveness of the re nement process. References [1] [2] [3] [4] [5] [6]
[7]
Berman, F., Wolski, H., Casanova, H., et al., "Adaptive Computing on the Grid Using AppLeS," IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 4, April 2003, pp. 369-382. Bokhari, S. H., "On the Mapping Problem," IEEE Trans. Computers, vol. c-30, vo. 3, March 1981. Coope, K., Dasgupta, A., Kennedy, K., et al. "New Grid Scheduling and Resechduling Methods in the GrADS Project," Workshop for Next Generation Software, Santa Fe, New Mexico, April 2004. Foster, I., Kesselman, C., The Grid: Blueprints for a New Computing Infrastructure, Second Edition, Elsevier Inc., 2004. Karypis, G., and Kumar, V., "Multilevel k-way Partitioning Scheme for Irregular Graphs," Journal of Parallel and Distributed Computing, vol. 48, no. 1, 1998, pp. 96-129. Kumar, S., Das, S. K., and Biswas, R., "Graph Partitioning for Parallel Applications in Heterogeneous Grid Environments," Proceedings of the IEEE 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), Fort Lauderdale, Florida, April 15-17, 2002. Walshaw, C., and Cross, M., "Multilevel Mesh Partitioning for Heterogeneous Communication Networks," Future Generation Computer Systems, vol. 17, 2001, pp. 601-623.