Performance and Cost Optimization for Multiple Large ... - CiteSeerX

Performance and Cost Optimization for Multiple Large-scale Grid Workflow Applications

∗

Rubing Duan, Radu Prodan, Thomas Fahringer Institute of Computer Science, University of Innsbruck Email: [email protected]

ABSTRACT Scheduling large-scale applications on the Grid is a fundamental challenge and is critical to application performance and cost. Largescale applications typically contain a large number of homogeneous and concurrent activities which are main bottlenecks, but open great potentials for optimization. This paper presents a new formulation of the well-known NP-complete problems and two novel algorithms that addresses the problems. The optimization problems are formulated as sequential cooperative games among workflow managers. Experimental results indicate that we have successfully devised and implemented one group of effective, efficient, and feasible approaches. They can produce solutions of significantly better performance and cost than traditional algorithms. Our algorithms have considerably low time complexity and can assign 1,000,000 activities to 10,000 processors within 0.4 second on one Opteron processor. Moreover, the solutions can be practically performed by workflow managers, and the violation of QoS can be easily detected, which are critical to fault tolerance.

1.

INTRODUCTION

The Grid is a heterogeneous and geographically distributed computing environment which has different access cost models and dynamically varying load availability conditions [30]. The execution of workflows in a Grid environment must take such resource variability and economic factors into account, which makes dynamic performance and cost optimization one of the most essential issues in attaining high performance. Most current Grid workflow execution environments [1, 2, 5, 6, 9, 11, 16, 20] focus on improving performance by redistributing workload, but they provide relatively simple system models and lack effective functionality to support large-scale Grid applications. From an end user’s perspective, both minimizing costs as well as execution time are preferred functionalities, whereas from system’s perspective, fairness can be considered as a good motivation. To our best knowledge, so far no scheme deals with all purposes in an integrated and effective manner. Performance, fairness, users’ cost, and optimization are not considered by most of the current workflow execution environments. Since potentially there are many workflows on the Grid which are competitors for the use of available resources, several issues arise and ought to be dealt with: (1) efficient resource allocation

for different workflows taking into account their different needs and performance requirements; (2) the notion of fairness; (3) the ability to implement the allocation scheme in a distributed manner with minimal makespan; (4) the issue of cost if workflows are assigned to resources according to (1) and (2) above. In this paper, we address these four issues by proposing two optimization schemes for a class of scientific Grid workflows characterized by a large number of homogeneous activities. The first scheme aims to minimize the expected execution time of workflow applications which we formulate as an NP-complete makespan minimization problem [23] and propose a more efficient, effective, and feasible solution than existing algorithms [8, 12, 14, 19]. The second scheme can minimize the cost of execution while guaranteeing the user-specified deadline which we solve in three steps: deadline assignment, workflow partitioning, and cost optimization. We compare the performance of our algorithms with six heuristics (in Section 4): Opportunistic Load Balancing (OLB) [14, 19], Minimum Execution Time (MET) [14, 19], Minimum Completion Time (MCT) [19], Sufferage [8, 12], Min-min [8, 12], and Max-min [8, 12]. Experimental results indicate that our algorithms are superior in efficiency, performance, fairness and cost to these heuristics. However, our algorithms may not work well when a scheduling problem cannot be properly formulated as a typical and solvable game, comprised of phases which can be specifically defined so that game players can bargain without dependencies between them. Our contributions are both theoretical and experimental: More efficient algorithms: Our novel algorithms can assign one million activities to 10 thousand processors within 0.4 seconds on one Opteron 2.4GHz machine (see Section 4); More effective solutions: The solutions are closer to the optimal schedule. Sometimes, optimal solutions can be obtained in simple cases (see example in Section 3); More realistic Grid models: Existing approaches [8, 12, 14, 19] assume direct access to individual processors upon scheduling decisions, which acts against the independence of local administration policies of each site. Our approach uses the local queuing system as the entry negotiation point to each site;

∗

More feasible resource management scheme: The scheduling and the rescheduling solutions can be easily executed by the workflow managers.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The rest of this paper is structured as follows. In Section 2, we present the abstract workflow and Grid models, and define the multiworkflow performance and cost optimization problems. Section 3 describes the solutions for the performance and cost optimisation schemes. Section 4 presents a comparative study of our algorithms with the related work. In section 5 we present a review of related work. Section 6 concludes the paper with a summary and a short discussion of future research.

This work is partially funded by the European Union through the IST-034601 edutain@grid and FP6-004265 CoreGRID projects.

SC | 07 November 10-16, 2007, Reno, Nevada, USA (c) 2007 ACM 978-1-59593-764-3/07/0011 ...$5.00.

2.

MODEL

This section describes our abstract workflow and Grid models, and defines the optimization problems addressed.

2.1

Workflow model

We focus our work on large-scale workflow applications that are characterized by a high (thousands to millions) number of independent, concurrent, and homogeneous activities that dominate performance of applications. For example, Figure 1 depicts three real applications that we use as case study in our work: ASTRO (astronomy) [15], WIEN2k (chemistry) [13], and MeteoAG (meteorology) [29]. The details of these applications are described in Section 4. The sources of their performance bottlenecks are this kind of activities, for example, poten and pgroups in ASTRO, lapw1 and lapw2 in WIEN2k, and CaseInit, RamsMakevfile, RamsInit, or Raver in MeteoAG 1 . Currently, most related work only considers workflows with tens or few hundreds of activities, which are not realistic large-scale applications. In ASTRO for instance, the number of grid cells (i.e. number of pgroups and poten activities) of a real simulation is 1283 . In WIEN2k, the number of lapw1 and lapw2 parallel activities may be of several thousands for a good density of states. In MeteoAG, the number of parallel activities (e.g. CaseInit, etc.) could be infinite, because there is no lower limit to the domain or the mesh cell size of the models finite difference grid. In contrast, sequential activities are relatively trivial in these applications, hence they can be served and scheduled on demand on the fastest available processor, since a few sequential activities do not really affect the performance of large-scale workflows. Based on above motivation, we propose the following abstract workflow definition to fulfill the requirements of our pilot largescale workflows. Definition 2.1 (Workflow) Let W F = (ACS, n CF D, DF D) de- o note a workflow application, where ACS = AC (k) |k = 1, . . . , K is the set of so called activities classes. We define an activity class AC (k) , (k = 1, . . . , K) as a set of activities which have the same activity type and can be concurrently executed. The term activity type refers to a functional description of activities such as matrix multiplication, a Fast Fourier Transform, or poten, pgroups, lapw1, lapw2, etc. as shown in Figure 1. In ³ other words, AC (k) is a set of homogeneous and parallel activ´ ities AC (k) = {A(k,j) |j = 1, . . . , nk } , where nk is the number ³ ´ P of activities in activity class AC (k) N = K k=1 nk . Each activity A(k,j) ∈ AC (k) (j = 1, 2, . . . , nk ) is distinguished by the activity class AC (k) and an identifier j within the class. An atomic or sequential activity is an activity class of cardinality one. Finally, CF D = {(ACsource ≺c ACsink )|ACsource , ACsink ∈ ACS} is the set of©control flow dependencies; ª DF D = (ACsource ≺d ACsink )|ACsource , ACsink ∈ ACS is the set of data flow dependencies.

2.2

Optimization problem

The problems of multi-workflow performance and cost optimization can be transformed into a classical NP-complete problem – the problem of scheduling independent jobs on heterogeneous computational resources. In the following definitions, we assume that the expected execution time of activities is available from a performance prediction service [18]. The expected execution time consists of two components: data preparation time and CPU execution time. A Performance Prediction service [18] based on a training phase and statistical methods supplies the data preparation times and CPU execution times required by our algorithms. Usually, for our pilot 1

http://www.rams.com/

Stage in

Stage in

GeodataInit

Kgen

CaseInit

lapw0

CaseInit

RamsMakevfile

RamsMakevfile

RamsInit

lapw1

RamsInit

Raver

Raver

Yes? Y RamsHist

lapw1

lapw2

lapw2

N

N

RevuDump

lapw1

Lapw2 FERMI

Yes? Y RamsHist

sumpara

RevuDump

mixer

lapw2

lcore

converged MeteoAG Stage Out

WIEN2k Stage Out

Stage in nbody poten

pgroups

galaxyformation

Legend Activity class (performance bottleneck)

Activity class of cardinality 1

hydro ParallelForEach ASTRO Stage Out

Figure 1: Real world workflow examples. applications, the data preparation time is small compared to the duration of the CPU execution time. Definition 2.2 (Performance (Makespan) Optimization Problem) Suppose we have W workflows consisting of a set of N activities which can be categorized into K different activity classes, where the expected execution time of activities in each class is p(k) = (k) (k) (k) {p1 , . . . , pS }, and pi is the expected execution time of activity class k on site i , i ∈ {1, . . . , S}, k ∈ {1, . . . , K}. Suppose we have a set of S Grid sites, where each site has mi processors and the processors on one Grid site are homogeneous. The objective is to find a solution x to assign the jobs to the Grid sites so that the overall makespan t(x) of all workflows is minimized. Definition 2.3 (Cost Optimization Problem) Suppose we have the same input as in Definition 2.2 including the following extra constraint: each site has a price ϕi . The objective is to find a solution x to assign the jobs so that the cost is minimized and deadline is guaranteed, which is expressed in the following formulations: • guarantee the deadline of each workflow w ∈ {1, . . . , W }: tw (x) < Deadlinew , • minimize the overall cost: Cost(x) = (k)

where ti

PS i=1

PK

(k) k=1 ti

· ϕi

remaining execution time of activity class k on site i.

2.3 Grid model In this section, we define a more reasonable Grid model, which makes our resource and workflow management scheme more controllable and monitorable. As we know, Grid sites are not fully controllable for outside users, as jobs are submitted to local resource manager like Portable Batch System (PBS), Load Sharing Facility (LSF), Condor, or WS-GRAM. However, most scheduling algorithms just schedule jobs to processors, which is not realistic and leads to unpractical solutions. In contrast, we schedule and manage jobs based on processing rates of activities on each site, which is controlled through appropriate job submission at runtime. Characterizing Grid resource access behavior for workflow managers in this way has four advantages. First and most importantly, it allows

Ei

3. OPTIMIZATION ALGORITHMS

Oa,i Oi,b

Gi

Figure 2: System model. the user or workflow manager to effectively control workflow execution on the Grid, because the workflow manager can control the processing rate by submitting the same number of activities as the number of allocated resources to the Grid site. Second, the first advantage in turn enables the adaptation of workflows based on allocated resources, especially when the computing environment changes and rescheduled solutions need to be performed. The rescheduled solution can easily be performed by adjusting the number of job submissions. Third, it allows the execution of workflows that have other constraints on performance, cost and resource requirements. For example, our approaches allow users to filter out unwanted Grid sites, or set deadlines for some workflows. Last, it allows more accurate prediction of workflow execution due to precise mathematical characterization of solutions in the model. We consider a Grid environment as shown in Figure 2. The Grid has S sites connected with a communication means. Activities arriving at each site i (i = 1, 2, . . . , S) may belong to W workflows. Workflow managers, which control the execution of workflows, are distributed on the Grid and compete with each other for resources. Each Grid site has mi (i = 1, 2, P. . . , S) processors, and the sum of processors of all sites is M = S i=1 mi . In this model, each workflow manager has S queues, where each queue corresponds to one Grid site. Queues are used by each workflow manager to control the processing of activities scheduled on each site. We use the terminology and notations similar to [4] and also introduce some additional notations as follows. (k)

• δi : Current length (number of activities) of the queue of activity class k on site i; PS (k) • δ (k) = i=1 δi : Current length (number of activities) of the queue of activity class k; (k)

• pi

(k)

• βi

: Expected execution time of activity class k on site i; =

(k) θi (k) pi

: Job processing rate of activity class k on site i,

(k)

where θi is the number of available processors for activity class k on site i; P (k) : Job processing rate of activity class k; • β (k) = S i=1 βi (k)

• ti

=

site i; • ti =

(k)

δi

(k)

βi

Pni

: Remaining execution time of activity class k on (k)

δi k=1 β (k) i

• λa,b : Mean bandwidth from site a to site b.

: Remaining execution time on site i;

o n (k) (k) (k) : Remaining execution time • t(k) = max t1 , t2 , . . . , tS of activity class k;

In this section we describe our performance and cost optimization algorithms.

3.1 Performance optimization First of all, in Section 3.1.1, we formulate the performance optimization problem as a cooperative game among the workflow managers, which can theoretically generate the optimal solution. However, the optimal solution is hard to achieve due to the high problem complexity. Therefore, we observe that the problem can be further formulated and solved as a sequential cooperative game, and we present the complete algorithm and one simple example in Section 3.1.2.

3.1.1 Formulation and solution Before starting to formulate our problem, we present the most important definition in game theory – what is a game? game = players + strategies + specification (of payoff). When players, strategies and specification of payoff are properly defined, the final solution can be obtained smoothly. In following paragraphs, the problem is first formulated as a cooperative game. We consider a K player game where the K workflow managers, which are players, attempt to minimize the execution time of their own activity class t(k) , which depends on the number of activities in the activity class k (δ (k) ) and the processing rate of activity class k (β (k) ). For clarity, we assume that each workflow manager handles the execution of one activity class. The objective function for each manager k can be expressed as: δ (k) δ (k) = , k = 1, 2, . . . , K, (k) (k) P θi β S

fk (∆) = t(k) =

i=1 p(k) i

(k)

where ∆ is a matrix of activity distribution (δi )K×S in which (k) the activity distributions are strategies, and θi is the number of processors which are allocated to activity class k on site i, which is the embodiment of payoff in our cooperative game: (k)

θi

(k)

(k)

δ = mi PK i

· pi

(x) x=1 δi

·

(k)

· wi

(x) pi

(x)

· wi

,

(1) (k)

where mi is the number of processors on site i, and wi weight of Grid site i for activity class k: min

(k)

wi

=

n o (k) px

x∈{1...S} (k) pi

PS y=1 (k)

min

is the

1 (k)

pi

(k)

{px }

x∈{1...S} (k) py

= PS

1 y=1 p(k) y

.

(2)

We use this weight wi to enhance the fairness of allocation because one Grid site has a different suitability for different activities for many reasons, for example, the locality of data, the size of (k) memory, etc. Intuitively, if the execution time ti on Grid site i for activity class k is much shorter than on other Grid sites, we are supposed to set a higher priority for activity class k on site i, and allocate more resources for this activity class on site i. The specific utilization of this weight is explained when we introduce the notion of sequential game. When the ideal load balance of activity class k is achieved, the objective function can also be defined as:  (k) (k) (k)  pi ·δi θi ≥ 1 (k) θi fk (∆) = t(k) =  (k) 0 θi < 1.

Based on the allocation of resources and the ratio of processing rate on site i to the total processing rate of the activity class, we can define the activity distribution as follows: (k)

(k) δi

=

(k) β δ (k) i(k) β

θi

=δ

(k)

pi

(k)

PS

(k)

θi i=1 p(k) i

.

(3)

4

St (1)

Weight ½ ¾ '0 ¿ eq.(11)

'St (1)

Accordingly, we have following definition. Definition 3.1 (The cooperative optimization game) The cooperative optimization game consists of: • Managers of K activity class as players.

≥0 (k) ≤ δi (k) = 0, if yi < 1 = mi = δ (k)

l-th Stage Game

§ G1(1) (4 St (1) ) ... G S(1) (4 St (1) ) · ¸ ¨ ... ... ... ¸ ¨ ¨ G ( K ) (4 St (1) ) ... G ( K ) (4 St (1) ) ¸ S ¹ © 1

'St(l )

§ G1(1) (4 St (l ) ) ... G S(1) (4 St (l ) ) · ¸ ¨ ... ... ... ¸ ¨ ¨ G ( K ) (4 St (l ) ) ... G ( K ) (4 St (l ) ) ¸ S ¹ © 1

Figure 3: Data flow of input and output using sequential gamebased allocation strategy.

K X

K X

(4)

• For each player k, k = 1, 2, ..., K, the objective function fk (∆). The goal is to minimize simultaneously all fk (∆) • For each player k, k = 1, 2, ..., K, the initial value of objective function fk (∆0 ), where ∆0 is a matrix K rows by S (k) columns filled with initial distribution of activities (δi )0K·S . For the cooperative optimization game defined above, the solution is determined by solving following problem: K X

'St (l 1)

§ T1(1) ('St (l 1) ) ... T S(1) ('St ( l 1) ) · ¨ ¸ ... ... ... ¨ ¸ ¨T ( K ) ('St (l 1) ) ... T ( K ) ('St (l 1) ) ¸ S © 1 ¹

1st Stage Game

St(1)

fk

(∆St(0) ) ≥

St(l−1)

fk

K X

St(2)

fk

(∆St(1) ) ≥ . . . ≥

k=1

k=1

(k)

δi (k) yi (k) δi PK (k) yi Pk=1 (k) S i=1 δi

4

St (l )

on following decreasing sequence:

• The set of strategies ∆ defined by following constraints:

minimize

§ T1(1) ('St ( 0 ) ) ... T S(1) ('St ( 0 ) ) · ¨ ¸ ... ... ... ¨ ¸ ¨T ( K ) ('St ( 0) ) ... T ( K ) ('St ( 0 ) ) ¸ S © 1 ¹

(∆St(l−2) ) ≥

k=1

K X

St(l)

fk

(∆St(l−1) ) ≥

k=1

K X

fk (∆∗ ),

k=1

where St denotes the stage of sequential game, St(l) is the lth stage game, and ∆∗ denotes the optimal solution. At each stage, the players (managers of activity classes) provide a set of strategies (distribution of their own activities) based on the allocation of resources of the last stage, and then the new allocation of resources is generated by using eq. 1. The first step in the sequential game is to initialize the distribution of activities ∆St(0) . At the initial stage St(0), every activity class assumes that all processors are available for it and is allocated an amount of processors on the basis of processing rate on each site by using the following equation: (k) δi

=

(k) β δ (k) i(k) β

mi

=δ

(k)

(k)

pi

PS

my y=1 p(k) y

.

th St(l) The resource allocation of the ), where Θ is a ³ l ´stage (Θ (k) matrix of activity distribution θi , is calculated based on

fk (∆),

k=1

K×S

subject to the constraints defined in eq. 4. In order to present the high complexity of this game, we introduce the Lagrangian for the optimization problem, which is a typical method for finding the extremum of a function of some variables subject to one or more constraints. Let L(δ, η, ς, ι) denote the Lagrangian where η, ς, and ιki ≤ 0 denote the Lagrange multipliers. The Lagrangian is: L(δ, η, ς, ι) =

K X

fk (∆) + η(

k=1

K X

(k)

θi

k=1

i=1

ΘSt(l) = Θ(∆St(l−1) ); ∆

− mi )

S K X X (k) (k) + ς( δi − δ (k) ) + ιki · δi .

the activity distribution of last stage (∆St(l−1) ). Accordingly, the activity distribution of the lth stage (∆St(l) ) is calculated based on the resource allocation of lth stage (ΘSt(l) ). These steps fully embody the idea of a sequential cooperative game. From eqs. 1 and 3, as shown in Figure 3, we have:

(5)

k=1

Unfortunately, the exact and direct solution (which is also optimal) to this optimization problem is in general difficult to obtain. Because the problem has high complexity and K · S variables; the solution depends on the distribution of activities in the same class on different sites, and the distribution of activities in different classes on the same site. In other words, the change of one variable impacts the values of all other variables. To circumvent this difficulty, we derive an approximate solution by further formulating this problem as a sequential game [27] in which players select a strategy following a certain predefined order, and in which some players can observe the moves of the players who preceded them. Although the optimal solution is not achievable directly from eq. 5, we can derive an intermediate solution which is comprised of a set of game stages based

St(l)

= ∆(Θ

St(l)

).

(6) (7)

In the following, we explain why the weight defined in eq. 2 is important and how we utilize it. The main idea of our method is to accumulate the optimization effects with many game stages until achieving a certain load balance among activity classes. Due to the demand of the cumulative effects, we need a weight to generate positive impacts on the results of every game stage. This weight of one activity ought to be comparable with the weight of activities in the same activity class on different sites and the weight of activities in different activity classes on the same site. Based on above notion, we have the definition of importance weight in eq. 2, which is the normalized value of expected execution times. Furthermore, how to utilize the importance weight is also innovative, because the cumulative effects need to be transferred to the next stage and affects the final optimization results; hence we need a intermediate variable to accept, preserve, and transfer these effects. In our case, the intermediate variable is the resource allocation matrix Θ, which accepts the effects from the importance weights and transfers the effects to the activity distribution matrix ∆.

Algorithm 1 Performance optimization algorithm (Game-quick scheduling algorithm) (k)

Input: W F, pi , δ (k) , mi , constraints Output: ∆St(l) - Distribution of activities, ΘSt(l) - Allocation of resources Step 1. Initialize ∆0 and weight of activity classes, and apply constraints 1. For each wf ∈ W F do 2. For each AC (k) ∈ wf to be scheduled next according to the control flow dependencies do 3. add AC (k) to the set of game players 4. For each Grid site i do (k) 5. calculate wi by applying eq. 2 (k)

6. calculate δi by applying eq. 3 to build ∆0 7. End for End for End for Step 2. Search the final distribution of activities and the allocation of resources 8. do 9. For each Grid site i do (k) 10. calculate θi by applying eq. 6 to build ΘSt(l) 11. End for 12. For each Activity class k do (k) 13. calculate δi by applying eq. 7 to build ∆St(l) 14. End for PK 15. While k=1 (t(k) (∆St(l−1) ) − t(k) (∆St(l) )) > ² Step 3. Verify the constraints, increase resource allocation for unsatisfied activity classes, and repeat Step (1) and (2) Step 4. Remove completed activity classes from queues, and repeat Steps (1) and (2) until all wf ∈ W F are completed

To explain why we achieve better performance, the relation between the so called aggregated execution time (AET) and makespan is presented in this paragraph. When we achieve load balance, the makespan can approximated by dividing the AET of all activities P S

t ·m

i i with the total number of processors: makespan ≈ Pi=1 , S i=1 mi where ti denotes the remaining execution time on site i, and mi denotes the number of processors on site i. At each stage, our algorithm produces even load distributions of activities and, therefore, when AET is decreased by our algorithm, the makespan is decreased in proportionally. In other words, our algorithm indirectly reduces the makespan by reducing the AET, as also shown in our experimental results.

3.1.2

Game-quick Algorithm

In the following, we explain our optimisation algorithm called Game-quick by using one simple example. Figure 4 presents a scenario in which Game-quick outperforms other heuristics. The first matrix presents the expected execution times for four activities: {A0 , A1 , A2 , A3 } on four machines: {M0 , M1 , M2 , M3 }. In this particular case, Game-quick gives a makespan of 18, which is also optimal in this case, while Min-min gives a makespan of 20, Maxmin and Sufferage a makespan of 19, and MCT performs the worst and gives a makespan of 25. Intuitively, MET assigns all tasks to the fastest machine M3 and gives the worst makespan of 49, hence, we do not show the mapping of MET. Figure 5 presents the intermediate data generated by Game-quick for this scenario. Algorithm 1 shows the pseudo-code for planning the execution of a workflow.

Step 1. After acquiring the information about activities and resources (e.g. the matrix of expected execution times in Figure 4), we generate an initial distribution of activities ∆0 and a weight matrix (see lines 1–7), as shown in Figure 5. In this simple case, these two matrices are identical because we have one processor on each cluster and one activity in each activity class. In Step 1, users are also allowed to set performance constraints, or filter out some unwanted Grid sites. This functionality is supported efficiently and effectively by our algorithm. To filter out some unwanted Grid sites for certain workflows, we can simply set the weights of the workflows on those sites to zero and not distribute any activities to those sites, with no further scheduling steps required. To assure that all constraints are satisfied, constraints can be verified again in Step 3.

Step 2. Every iteration of the While loop (see line 8–15) is one game stage, where every stage is comprised of S sub-games. In other words, there is one sub-game on one Grid site. In each subgame, all activity classes compete for resource allocation, and the activity classes with relatively heavier weight win the sub-game on one site and obtain more resources at the next stage. These activity classes, however, cannot win everywhere due to the definition of weight (the sum of weights of one activity class is 1). Therefore, winers of the sub-game on one Grid site must be losers on other Grid sites. Similarly, losers lose resources on one Grid site, but become winners on other sites and achieve compensation from somewhere else. This process is repeated until no more performance can be gained. The further processing of the algorithm depends on the evalP (k) uation result at line 15: K (∆St(n−1) ) − t(k) (∆St(n) )) > k=1 (t ², where ² can be used to control the number of stages and the degree of optimization. For the experiments in this study, we set ² to zero. The input and output data flow of each game stage is shown in Figure 3. Every stage of the game gets results from the last stage, and sends new results to the next stage for further optimization. Specifically, as shown in Figure 5, we apply eq. 6 at line 10 to generate the first resource allocation matrix ΘSt(1) . Based on ΘSt(1) , we use eq. 7 at line 13 to generate the first activity distribution ∆St(1) . Thereafter, we repeat the iteration until we reach the upper limit of optimization. In addition, we can use ² to control the number of stages. For example, in this case when ² is set to 0.1, the algorithm completes at stage 29. When ² is set to zero, the algorithm completes at stage 731.

Step 3. In this step we verify deadline constraints and increase the allocated resources for unsatisfied activity classes.

Step 4. In this step we eliminate the earliest completed activity classes. To utilize the released resources by the completed activity classes, we repeat Steps 1 and 2 to recompute the distribution of the rest of uncompleted classes until all workflows are completed. In order to recompute a new distribution and allocation, we need an iterative algorithm which must be run periodically when the environment parameters change. In terms of feasibility and predictability, the solution provided by our approach can be practically performed and the performance of workflows can easily be predicted, because our approach explicitly indicates the number of allocated processors of each activity class. Therefore, the workflow manager can easily control the processing rate of each activity class, and precisely predict the execution time based on that. Moreover, the load imbalance can easily be detected and solved by the workflow manager, which is an important feature for Grid computing since load imbalance is a main source of performance bottlenecks. In contrast, it is not practical and predictable that other algorithms just produce the execution sequence of activities. On the one hand, Grid sites are not fully controllable to outside users (including work-

M0 A 0 § 18 ¨ A1 ¨ 20 A 2 ¨ 25 ¨ A 3 ¨© 21

M1 M2 M3 14 19 14 · ¸ 15 19 12 ¸ 19 18 15 ¸ ¸ 9 19 8 ¸¹

Expected execution time

M0 M1 M2 M3 A 0 §18 · ¸ ¨ A1 ¨ 12 ¸ ¸ A2 ¨ 18 ¸ ¨ ¸ A 3 ¨© 9 ¹

M0 M1 M2 M3 A0 § 14 · ¨ ¸ A1 ¨ 20 ¸ A2 ¨ 18 ¸ ¨ ¸ A 3 ¨© 8 ¸¹

Game-quick (Optimal) Min-min (makespan=18, sum=57) (makespan=20, sum=60)

M0 M1 M2 M3 M0 M1 M2 M3 M0 M1 M2 M3 14 · A 0 §18 A 0 §18 A0 § · · ¨ ¸ ¸ ¨ ¨ ¸ 12 ¸ 19 A1 ¨ A1 ¨ A1 ¨ 15 ¸ ¸ ¨ ¸ ¸ ¨ 15 A2 ¨ A2 A 2 25 15 ¸ ¨ ¸ ¸ ¨ ¨ ¸ ¸ ¸ ¸ 19 9 A 3 ©¨ A 3 ¨© A 3 ¨© 19 ¹ ¹ ¹ MCT Max-min Sufferage (makespan=19, sum=67) (makespan=19, sum=61) (makespan=25, sum=70)

Figure 4: A simple example that illustrates the situation where the Game-quick outperforms other algorithms.

§ 0.22 ¨ ¨ 0.20 ¨ 0.19 ¨ ¨ 0.14 ©

0.28 0.21 0.28 · ¸ 0.26 0.21 0.33 ¸ 4 St (1) 0.24 0.26 0.31¸ ¸ ¸ 0.33 0.16 0.37 ¹

§ 0.30 ¨ ¨ 0.26 ¨ 0.25 ¨ ¨ 0.22 ©

0.27 0.29 0.14 · ¸ 0.24 0.27 0.22 ¸ 0.25 0.36 0.13 ¸ ¸ 0.26 0.29 0.22 ¸¹

eq.(11)

4 St ( 2 )

§ 0.34 ¨ ¨ 0.27 ¨ 0.24 ¨ ¨ © 0.20

'0

0.28 0.21 0.28 · ¸ 0.26 0.21 0.33 ¸ 'St (1) 0.24 0.26 0.31¸ ¸ 0.33 0.16 0.37 ¸¹

4

St ( 731)

End

§1 ¨ ¨0 ¨0 ¨ ¨0 ©

§ 0.25 ¨ ¨ 0.20 ¨ 0.18 ¨ ¨ 0.10 ©

0 0 0· ¸ 0 0 1¸ 0 1 0¸ ¸ 1 0 0 ¸¹

'

St(730)

'St ( 731)

§1 ¨ ¨0 ¨0 ¨ ¨0 ©

'St ( 2 )

4

St ( 29 )

eq.(11) End

§ 0.29 ¨ ¨ 0.21 ¨ 0.17 ¨ ¨ 0.06 ©

2e+07

0.30 0.19 0.22 · ¸ 0.24 0.19 0.35 ¸ 0.18 0.36 0.28 ¸ ¸ 0.42 0.07 0.44 ¸¹

0 · § 0.88 0.10 0.02 ¨ ¸ ¨ 0.33 0.10 0.01 0.55 ¸ ¨ 0 0 1 0 ¸ ¨ ¸ ¨ 0 069 0.13 0.17 ¸¹ ©

Example 1 Example 2 Example 3 Example 4 Example 5

2.1e+07

2nd Stage Game

0.29 0.20 0.25 · ¸ 0.25 0.20 0.34 ¸ 0.21 0.30 0.30 ¸ ¸ 0.37 0.11 0.41¸¹

731st Stage Game

H 0

2.2e+07

eq.(11)

1st Stage Game

§ 0.22 ¨ ¨ 0.20 ¨ 0.19 ¨ ¨ 0.14 ©

0.27 0.28 0.10 · ¸ 0.23 0.23 0.26 ¸ 0.23 0.42 0.09 ¸ ¸ 0.27 0.27 0.24 ¹¸

Execution time (sec)

Weight

1.9e+07 1.8e+07 1.7e+07 1.6e+07 1.5e+07 1.4e+07 1.3e+07 1.2e+07 10

'St(28)

20

30

40 50 60 Game stage

70

80

90 100

29th Stage Game

H 0.1

0 0 0· ¸ 0 0 1¸ 0 1 0¸ ¸ 1 0 0 ¸¹

'St ( 29)

§ 0.67 ¨ ¨ 0.07 ¨ 0.01 ¨ ¨ 0 ©

0.32 0 0 · ¸ 0.10 0 0.83 ¸ 0.01 0.84 0.14 ¸ ¸ 0.74 0 0.26 ¸¹

Figure 6: Convergence process of performance optimization. no more improvement can be achieved.

Figure 5: Intermediate data of Game-quick. flow managers) and we cannot decide which processor will be used for the next activity, hence the execution process cannot be the same as the scheduled plan. On the other hand, it is hard to predict the execution time of workflows, since the execution process is a mixture of all workflows (or activity classes). Hence, our approach performs better on achieving feasibility and predictability.

3.1.3

Algorithm analysis

The time complexity of the Game-quick algorithm is O(l · K · S) and the space complexity is O(K·S), where l is the number of stages of the sequential game, K is the number of activity classes, and S is the number of Grid sites. Table 1 contains the measured length of game stage l, and the algorithm execution times of Game-quick and Min-min for different problem sizes measured on a machine with Dual Core Opteron 880 2.4GHz processors and 1GB of RAM. From the table, we can notice that Game-quick scales well, since the number of game stages does not increase in proportion to the number of processors and activities. Even when there are 106 activities and 104 processors, the algorithm just needs 593 stages and 0.36 seconds to complete the optimization. The convergence processes of performance optimization are very fast, as shown in Figure 6. In this experiment, we randomly generated five examples to assign 102 × 104 activities to 102 × 102 processors. With about 30-40 stages, more than 90% optimization has been completed, and the entire optimization processes need about 600 stages for this problem size. The reason why the convergence process is fast is that, to some extent, every activity class is a winner on certain sites, and all of them can achieve performance improvement. Specifically, at the beginning of a game, every activity class moves their work load to the sites which are more efficient for them, and bargain for resources. If they cannot successfully bargain and achieve more resources, they move workloads to the less important sites to them. Finally, all activity classes reach a balance point, and Grid site × Processor

Activity class × Activity

Stage

Game-quick (milisec.)

Min-Min (milisec.)

10 × 10 10 × 10 10 × 102 10 × 102 102 × 102 102 × 102 102 × 102 102 × 102

10 × 10 10 × 100 102 × 103 102 × 104 102 × 103 102 × 104 103 × 103 103 × 104

310 334 476 484 597 593 632 627

2 2 23 25 362 362 11, 065 11, 856

15 22 3,109 29,512 485,597 > 1 hour > 1 hour > 1 hour

Table 1: Game stages and algorithm execution times.

3.2 Cost optimization We introduce a cost optimization algorithm, named Game-cost, based on a similar idea as the Game-quick algorithm. The first step is to assign deadlines to activity classes and partition workflows into sub-workflows according to assigned deadlines. Then we apply our cost optimization algorithm on each sub-workflow.

3.2.1 Workflow partitioning Partitioning a workflow into smaller sub-workflows and assigning them to different games for further optimization are the two key steps in the design of time-constrained algorithms. Our partition is based on the deadline of activity classes. In this paper we use a static deadline assignment method called Effective Deadline (ED) introduced in [25], in which the deadline of any activity is the overall workflow deadline minus the total expected execution time of its subsequent activities. Figure 7 presents one example of deadline assignment and partitioning of a workflow consisting of four activity classes. According to the user-specified deadline and work amount of each activity class, the four deadlines {d1 , d2 , d3 , d4 } are assigned to the partitions {P1 , P2 , P3 , P4 } by using the ED method. Thereafter, we sort the deadlines and identify game phases between two adjacent deadlines. In this example, the optimization process is divided into three game phases, where each game phase is associated to a color in Figure 7 (i.e. phase one between 0 and d1 , phase two between d1 and d2 , and phase three between d2 and d3 (d3 = d4 )). Our cost optimization algorithm is applied to solving the cost optimization problem on each game phase, where different game phases are independent of each other.

3.2.2 Formulation and solution To optimize the cost of workflows, we consider a K player sequential cooperative game, where K workflow managers try to minimize their costs while guaranteeing a deadline. Each manager cond3 d 4

P3

P4

GamePhase3

d2 P3

P2

P2

P1

GamePhase2 d1 GamePhase1

Figure 7: Partition of cost optimization game.

trols the execution of one activity class. The game is comprised of a set of stages, where St denotes the stage of the sequential game and St(l) is the lth stage game. The objective function for each manager k (k = 1, 2, . . . , K) can be expressed as: S X

fk (∆) = ck (∆) =

(k)

(k)

pi

· δi

· ϕi ,

(8)

i=1

where ck (∆) is the cost of activity class k and ϕi is the price of site i. When we achieve the best price/performance ratio, the following deadline constraint, which is for the distribution of activity k on site (k) (k) i (δi ) and the resource allocation of activity k on site i (θi ), is supposed to be satisfied: (k)

dphase ≥

(k)

δi

· pi

,

(k) θi

(9)

(k)

(k)

where cwi

(k)

· pi (x)

x=1

δi

(k)

· cwi (x)

(x)

· pi

· cwi

,

1 (k)

(k)

= PS

ϕi ·pi

1 y=1 ϕ ·p(k) y y

.

(11)

minimize

(ck (∆))

k=1

subject to the constraint:

   

max {tk (∆)} = max

   

δ (k) ≤ dphase . (k) P   θi    S  (k) i=1 pi

Let L(δ, η) denote the Lagrangian where η denotes the Lagrange multipliers. The Lagrangian is: L(δ, η) =

K X S X

(k)

pi

(k)

· δi

· ϕi + η(

k=1 i=1

δ (k) − dphase ). PS θi(k) i=1 p(k) i

Unfortunately, the direct and exact (optimal) solution to this problem is difficult to obtain too. Based on a similar idea as in the Gamequick algorithm, we have the following decreasing sequence: K X

St(1)

ck

(∆St(0) ) ≥

St(l−1)

ck

St(2)

ck

(∆St(1) ) ≥ . . . ≥

k=1

k=1 K X

K X

(∆St(l−2) ) ≥

K X

St(l)

ck

(∆St(l−1 ) ≥

K X

ck (∆∗ ).

k=1

k=1

k=1

(12) The termination condition of the sequential cooperative game is judged as follows: K X k=1

St(l+1)

ck

(∆St(l) ) ≥

K X

(k)

assign δi

by applying

mi ·dphase (k) p i

to build ∆St(0) , according to the sort

order of resources 7. End for End for End for Step 3. Search the final distribution of activities and the allocation of resources 9. do 10. For each Activity class k do 11. For each sorted Grid site i do (k) 12. calculate θi by applying eq. 14 to build ΘSt(n)

St(l)

ck

13. calculate δi by applying eq. 15 to build ∆St(n) 14. If all activities are allocated then 15. break; 16. EndPif End for End for (k) 17. While K (∆St(n−1) ) − c(k) (∆St(n) )) > ² k=1 (c Step 4. If the deadline are not satisfied, apply Game-quick and repeat Step (3)

According to the analysis mentioned above, a new allocation ΘSt(l) can be achieved based on the distribution of last stage ∆St(l−1) . With the new allocation ΘSt(l) in the same stage, the new distribution ∆St(l) can be generated for evaluation. From eqs. 9 and 10, we obtain: ΘSt(l) = Θ(∆St(l−1) );

We use this importance weight to improve the fairness of resource allocation. For this cooperative cost optimization game, the solution is determined by solving the following optimization problem: K X

6.

(10)

is the importance weight of site i for activity class k: cwi

(k)

Input: subW F, pi , δ (k) , mi , ϕ, dphase , constraints Output: ∆St(l) - Distribution of activities, ΘSt(l) - Allocation of resources Step 1. Sort Grid site for each activity class by increasing performance/price ratio Step 2. Initialize ∆St(0) and the weights of activity classes, and apply constraints 1. For each wf ∈ subW F do 2. For each AC (k) ∈ wf in this game phase to be scheduled next according to the workflow partitions do 3. add AC (k) to the set of game players 4. For each sorted Grid site i for AC (k) do (k) 5. calculate cwi by applying eq. 11

(k)

where dphase is the deadline of current phase. The resource allocation for each activity class is defined by the following equation: δ (k) θi (∆) = mi PK i

Algorithm 2 Game-cost optimization algorithm

(∆St(l−1 ).

k=1

which means that Game-cost cannot reduce costs any more.

(13)

∆St(l) = ∆(ΘSt(l) ) =

(14) (k)

dphase · θi (k) pi

.

(15)

Algorithm 2 shows the pseudo-code of the algorithm for planning the workflow execution. After acquiring the information about activities and resources, we can sort the resources for every activity class, and generate an initial distribution of activities and initial allocation of resources. At the beginning, activity classes are competitors on the site which has the highest price/performance ratio. After one stage competition, winners get more processors from one resource. In the next stage, losers compete for resources which have the second highest price/performance ratio. The difference between Game-quick and Game-cost is that the competitors of Game-quick contend for resources on all Grid sites, but the competitors of Gamequick compete for resources from the Grid sites with the highest price/performance ratio to the ones with lowest price/performance ratio. This process is repeated until no more costs can be reduced, which is determined by the condition at line 18. In addition, due to tight deadlines and non-backtracking nature of the algorithm, sometimes it might not be possible to meet the deadlines for all activity classes. If this happens, our Game-cost invokes Game-quick to meet the deadline first, and then optimize costs based on the results of Game-quick. This has been proved very effective when we have tight deadlines, because other heuristics in these cases are not able to return a complete schedule. The time complexity of the cost optimization algorithm is O(l · K · S) and the space complexity is O(K · S), where l is the number of game stages, K is the number of activity classes, and S is the number of Grid sites. The convergence process of cost optimization is very fast, as shown in Figure 8. The number of stages for cost optimization are much fewer than for performance optimization, because the deadline constraints limit the convergence processes. In addition, the initialization process in Game-cost is different from the Game-quick, hence the convergence process is much shorter. In this experiment, we randomly generate 5 examples to assign 102 × 104 activities to 102 × 102 processors. After about 20–30 stages, optimization has almost been completed, and the entire optimization processes need about 50 stages for this problem size.

1.2e+09 1e+09

Cost

8e+08 6e+08 4e+08

Example 1 Example 2 Example 3 Example 4 Example 5

2e+08 0 10

20

30 40 Game stage

50

60

Figure 8: Convergence process of cost optimization.

4.

EXPERIMENTAL RESULTS

In this section, we first compare the time and space complexity of different approaches, and then show the scheduling results of two real application on Austrian Grid to explain the advances of our algorithms. To ensure the completeness of our experiments, we also evaluate and compare different algorithms over a complex simulated system and large amount of activities based on different machine and activity heterogeneity. All measurements were performed on a machine with Dual Core Opteron 880 2.4GHz processors and 1GB of RAM.

4.1

Complexity and execution times

The computational time complexity is an important measure for comparison of different algorithms. We have implemented OLB [14, 19], MET [14, 19], MCT [19], Sufferage [8, 12], Min-min [8, 12], Max-min [8, 12] in our system, and modified them to work on classes of activities instead of activities. The execution time of Game-quick and Game-cost algorithms is distinctly less than all other algorithms. The time complexity is only related to the number of activity classes (K) and the number of clusters (S). When we assign 105 activities to 103 processors, the execution time of our algorithm is less than 0.4 seconds, as shown in Table 2, while other algorithms may need several hours to generate comparable solutions. MET, which has asymptotic complexity of O(M + N ), executes for less than 1 second, where M is the number of processors and N is the number of activities. However, the results of MET have serious problems, because MET schedules most activities to the fastest Grid sites. OLB and MCT have asymptotic complexity of O(M · N ), but their results are much worse than our algorithms. Sufferage with asymptotic complexity of O(M · N · ω)(ω ≤ N ) executes for an average of 200 − 300 seconds. There is no performance difference between Sufferage-C and Sufferage. Min-min and Max-min (and Duplex) have asymptotic complexity of O(M · N 2 ) and an average execution time of 200 − 300 seconds. There are some other algorithms to which we do not compare such as Work Queue (WQ) [21], Heterogeneous Earliest Finish Time (HEFT) [7], Genetic Algorithms (GA) [17, 22], or A* [31]. WQ, Algorithm

Time Complexity

Time (seconds)

Space Complexity

Game-quick, Game-cost MET OLB, MCT Sufferage,Sufferage-C Min-min, Max-min Duplex, HEFT GA-based solutions A∗

O(l · K · S) O(M + N ) O(M · N ) O(M · N · ω) O(M · N 2 ) O(M · N 2 ) scales poorly exponential

< 0.4 > 200 − 300 >> 200 − 300

O(K · S) O(M + N ) O(M + N ) O(M + N ) O(M + N ) O(M + N ) O(M + N ) exponential

Table 2: Comparison of time complexity and execution time of algorithms when we assign 105 activities to 103 processors.

however, is just for homogeneous parallel machines. HEFT degrades to Min-min for large-scale applications. GA-based solutions and A* scale poorly as the number of activities and processors increases, and their execution times are significantly higher than other algorithms, though they can decrease the makespans of Min-min by 5%-10% according to related work [17]. Other algorithms are similar to implemented algorithms, or not practical for large-scale workflows due to reasons mentioned. Therefore, we did not implement and compare them with our algorithms. Although the space complexity is not as important as the time complexity because most scheduling algorithms are not memory intensive, it is still worth to mention that, for large-scale applications (usually M >> K), the space complexity of Game-quick and Game-cost O(K · S) is much lower than that of other algorithms O(M + N ) or exponential. Compared with existing algorithms, our Game-quick and Gamecost algorithms are the most efficient for large-scale applications characterized by large number of homogeneous activities. For such applications, the scheduling problem can be easily formulated a typical and solvable game, although we cannot exclude the possibility that there will be large-scale applications with tens of thousands of different types of activities. Solving this latter problem needs further research on game partitioning techniques.

4.2 Real applications In the following, we evaluate our proposed methods using two real world scientific workflow applications executed in a national Grid infrastructure. WIEN2k [13] is a program package for performing electronic structure calculations of solids using density functional theory based on the full-potential (linearized) augmented plane-wave ((L)APW) and local orbital (lo) method. We have ported the application onto the Grid by splitting the monolithic code into several course-grain activities coordinated in a workflow, as already illustrated in Figure 1. The lapw1 and lapw2 activity classes can be solved in parallel by a fixed number of homogeneous k-points. A final activity converged applied on several output files tests whether the problem convergence criterion is fulfilled. AstroGrid [15] is an astronomical application illustrated in Figure 1 which solves numerical simulations of the movements and interactions of galaxy clusters using an N-Body system. The computation starts with the state of the universe at some time in the past and is done to the current time. Galaxy potentials are computed for each time step, and then the hydrodynamic behavior and processes are calculated and described. We executed these applications on a subset testbed of the Austrian Grid infrastructure consisting of a set of parallel computers and workstation networks accessible through the Globus toolkit and local job queuing systems as separate Grid sites. For the sake of clarity, our experimental testbed consists of two clusters, one at the University of Innsbruck and the other at the University of Linz, and we just use 4 processors on each Grid site. The characteristics of the machines are given in Table 3. In this experiment, we evaluate the performance of Min-min (the best of the other heuristics) and Game-quick by comparing makespan, AET (see Section 3.1.1) and fairness of these two applications. We quantify the fairness by using the Jain’s fairness index [24]: ³P ´2 W w=1 Tw , f airness = P 2 W· W w=1 Tw where W is the number of workflows, and Tw is the execution time Site

Size

GHz

Architecture

Mgr.

Location

hc-ma.uibk altix1.jku

4 (8) 4 (64)

2.2 1.6

EM64,COW I2,ccNUMA

SGE PBS

Innsbruck Linz

Table 3: The Austrian Grid testbed.

35000

fairness=0.86120

30000 Algorithm execution time(ms)

hc-ma

P4

sum=1622.5

P3 P2 P1 P4

altix1

makespan=240

25000 20000 15000 10000 5000

P3 0 MET

P2

OLB HiHi

MCT

Min-min

HiLo

Max-min Sufferage

LoHi

Gamequick

LoLo

P1 0

50

100

150

200

250

Time (second)

Figure 10: Algorithm execution times for assigning more than 105 activities to about 103 processors.

(a) Min-min.

hc-ma

P4 P3

fairness=0.99987 sum=1585

P2

P4

altix1

No. of Procs.

No. of Clusters

No. of Activities

Activity Classes

Activity Heterog.

Machine Heterog.

HiHi HiLo LoHi LoLo

900 989 900 1048

10 10 10 10

157118 147871 149731 168208

10 10 10 10

[1, 1000] [1, 1000] [1, 10] [1, 10]

[1, 100] [1, 10] [1, 100] [1, 10]

Table 4: Consistent computing environment.

P1

makespan=225

P3 P2 P1 0

50

100

150

200

250

Time (second)

(b) Game-quick. Figure 9: Scheduling results of two real applications (WIEN2k and ASTRO) where the Game-quick gives a shorter makespan and AET, and better fairness than Min-min.

of workflow w. The f airness value ranges from zero to one, where f airness = 0 indicates the worst fairness, and f airness = 1 the best fairness. Figure 9 presents a scenario in which Game-quick outperforms Min-min. In this particular case, as shown in Figure 9(a), Minmin gives a makespan of 240, an AET of 1622.5, and a fairness of 0.86120. Comparing the results of Game-quick (see Figure 9(b)) with the results of Min-min, we notice that Game-quick improved the makespan of the workflows by 6.67%, the AET by 2.36%, and the fairness by 16%. Notice that the fairness of Game-quick is almost perfect (0.99987). Moreover, we can intuitively observe that it is hard to predict the execution time of each workflow for the execution plan in Figure 9(a), because the workflows are interleaved and form a mixture of activities of different workflows. Contrarily, Game-quick yields an execution plan in which each workflow is executed based on the control of processing rates, hence activities of one workflow can be considered to be executed on some dedicated processors. Therefore, we can achieve better makespans and AET from Game-quick, obtain almost perfect fairness, and more precisely predict the workflow execution times.

4.3

Scenario

Performance optimization

For the completeness of our experiments and the universality of the experimental results, we evaluate and compare different algorithms for different scenarios. We first introduce our simulation environment, and then the makespans, fairness and AET of different algorithms are compared based on the classification of activity and machine heterogeneity.

We use ETC (expected time to compute) [17] matrix to evaluate all algorithms. ETC matrix categorizes activities and resources by the degree of their heterogeneity. Machine heterogeneity represents the variation that is possible among execution time for a given activity class across all the machines, while activity heterogeneity is defined as the amount of variance among the execution times of activity classes for a given machine [17]. We assume that activities in one activity class are homogeneous, and machines in one cluster are homogeneous. To simulate the real computing environment, we use consistent and inconsistent matrices. Consistent denotes if whenever a machine a executes any activity faster than machine b, then machine a execute all activities faster than machine b; Inconsistent matrices characterize the situation where machine a may be faster than machine b for some activities and slower for others [17]. We evaluate the algorithms for four different scenarios: High activity and high resource heterogeneity (HiHi), High activity and low Resource heterogeneity (HiLo), Low activity and high Resource heterogeneity (LoHi), and Low activity and low Resource heterogeneity (LoLo). Tables 4 and 5 present the details of the simulated computing environment. Expected execution times of activities are generated based on activity and machine heterogeneity, which are selected from a uniform distribution in the specified ranges. High machine heterogeneity (in the range of [1, 100]) causes a significant difference in a activity’s execution time among Grid sites. High activity heterogeneity (in the range of [1, 1000]) indicates that expected execution times of different activities have great difference. We assume that the number of activities are randomly generated from a uniform distribution in the range of [10000, 20000], and the number of processors on one Grid site in the range of [64, 128]. The algorithm execution times for consistent and inconsistent metrics are similar, and shown in Figure 10. The relative execution time of algorithms from the best to worst was: (1) Game-quick, (2) MET, (3) OLB, (4) MCT, (5) Min-min, (6) Max-min (sometimes Sufferage), and (7) Sufferage (sometimes Max-min). Not only are the time complexities much lower than that of other algorithms, but Game-quick algorithm gives the best makespans and machine utilization. For all cases, the relative performance order of algorithms from the best to worst was: (1) Game-quick, (2) Minmin, (3) Sufferage, (4) MCT, (5) Max-min, (6) OLB, and (7) MET. In terms of fairness, Game-quick always achieved almost perfect fairness (see Figure 11(g) and Figure 12(g)). On average, the fairness value of Game-quick is 0.99.

No. of Clusters

No. of Activities

Activity Classes

Activity Heterog.

Machine Heterog.

HiHi HiLo LoHi LoLo

982 955 955 1007

10 10 10 10

131298 153395 173418 150156

10 10 10 10

[1, 1000] [1, 1000] [1, 10] [1, 10]

[1, 100] [1, 10] [1, 100] [1, 10]

Table 5: Inconsistent computing environment.

4.3.1

Consistent heterogeneity

Table 4 presents the four input scenarios and Figure 11 illustrates our results. We will discuss the performance of all algorithms in the order from the slowest to fastest. For all four consistent cases, MET gives the worst results, because it maps all activities to the fastest machine, thus, MET does not appear in the figures. OLB usually performs the second worst. This is because there is no cooperation between different activity classes, and the resources are selected based on their availability without considering the activity execution time. In many cases, OLB maps activities to the worst machines. Max-min gives a poor result because it only fits the situation when some activities are much larger than the others. At the beginning of execution, Max-min helps the execution of larger activities, but smaller activities are ignored. This very special situation is seldom encountered, thus, Max-min is not a good option. In addition, there is no fairness to smaller activities, hence Max-min performs worse than most algorithms. MCT performs quite well for high machine heterogeneity scenarios, because there is a higher likelihood that it selects the fastest machine for activities, especially for larger activities. This algorithm performs poorly for low machine heterogeneity scenarios because it does not compare the activity execution time and only considers the completion time. On the other hand, under these situations, the differences between faster and slower machines are blurred. Therefore, it is less likely to select the fastest machine for the activity and MCT cannot achieve better performance for low machine heterogeneity scenarios. Sufferage performs quite similar to MCT for high machine heterogeneity, but performs 5%-10% better than MCT for low machine heterogeneity scenarios. This is because Sufferage makes more intelligent decisions by considering the activity execution time. Theoretically, Sufferage performs better than Min-min if the execution time of one activity on a certain machine is much slower than anywhere else (i.e. activity suffers if not run on a specific machine). For the heterogeneous environments considered in this study, this type of special case never occurs. Min-min performs well, giving the second best results in each case. In contrast to Max-min, at the beginning of execution it only handles the smallest activities and ignores larger ones. In the midst of execution, Min-min still handles smaller activities more frequently than larger activities, which implies that smaller activities have higher priorities than larger ones. There is not enough fairness among activity classes, and that is why Min-min loses performance. Game-quick scheduling algorithm provides the best performances for all four scenarios, because it can make the best globally intelligent decisions. It performs about 10% better than Min-min for LoHi scenario, and 5% better for the other three scenarios. We can observe that when fairness is ensured, efficiency is improved. According to related work [17], GA was able to improve upon the Min-min solution by 5%-10%, which means Game-quick performs as well as GA, but with much shorter algorithm execution time.

ment, but this special case rarely occurs. Therefore, MET is still the worst heuristic. OLB, Max-min, MCT, and Sufferage perform worse for inconsistent than for consistent scenarios, because inconsistent computing environments are more complex than consistent ones. Faster machines do not always perform better than slower machines. Based on the algorithm designs, most existing algorithms cannot effectively handle the high heterogeneity of machines, which results in poor makespans. Therefore, it is more likely that OLB, MCT and Sufferage assign more activities to slower machines. In contrast, Min-min performs better for inconsistent than for consistent scenarios, because the fastest machines are allocated evenly; thus, Min-min is able to assign more activities to the fastest machines, though it does not intentionally handle the change of environment either. Game-quick still provides the best mapping for the inconsistent cases. For the same reason as for consistent scenarios, Game-quick is able to achieve better performance than others. subsectionCost optimization Figure 13 illustrates in percentage the cost of each algorithm normalized against the cost of Game-cost. We do not present as many results as previous experiments for Game-quick because most existing algorithms are not comparable to Game-cost, even if we have optimized the heuristics for this problem. From Figure 13(a) it can be observed that all algorithms need more than twice the cost of Game-cost. Consequently, we modified the original heuristics to incorporate cost and deadline control into OLB, MCT, Min-min, Maxmin, and Sufferage. The optimized algorithms are marked with ”*”, for example, OLB*, MCT*, Min-min*, Max-min*, and Sufferage*, as shown in Figure 13(b). MET is not shown in the figures because it cannot meet the deadline. The relative cost order of algorithms from the best to worst was: (1) Game-cost, (2) MCT*, (3) Sufferage*, (4) Min-min*, (5) Maxmin*, (6) Game-quick, and (7) OLB*. In this case, Game-cost finds mappings whose costs are better than MCT* by 27%, Sufferage* by 45%, and better than other algorithms by at least 50%. Moreover, Game-cost algorithms can be combined with Game-quick to avoid the problem that deadlines cannot be met. This problem frequently occurs when we set a relatively short deadline, because if

350

300

250 Percent (%)

No. of Procs.

200

150

100

50

0 OLB

MCT

Min-min

Max-min Sufferage

Gamequick

Gamecost

(a) Percentage of cost. 300

250

Percent (%)

Scenario

200

150

100

4.3.2

Inconsistent heterogeneity

Table 5 presents the four input scenarios and Figure 12 illustrates the results. For all four consistent cases, MET gives the worst results, because it maps most activities to a few fastest clusters. MET could perform better than OLB, when the fastest clusters for different activity classes are distributed evenly in the computing environ-

50 OLB*

MCT*

Min-min*

Max-min* Sufferage*

Gamequick

Game-cost

(b) Percentage of cost (optimized heuristics). Figure 13: Cost optimization results.

1.2E+06 1.1E+06

60000 55000

75000 73000 71000

ic k

ge G

am equ

m in

Su ffe ra

ax M

M

M

C T

in -m in

LB O

am equ

(c) Makespan (LoHi).

140

125

ic k

ge

G

G

M

Su ffe ra

O

m in

65000 LB

ic k

ge

am equ

m in ax -

Su ffe ra

M

M

C T

in -m in

LB O

(b) Makespan (HiLo).

130

(d) Makespan (LoLo).

1 0.95

120

120

0.9

100

110 105

0.85 Fairness

115

Percent (%)

Percent (%)

77000

67000

G

(a) Makespan (HiHi).

79000

69000

50000 M

ge

am equ

m in ax -

Su ffe ra

M

M

O

ic k

1.0E+06 C T

2.80E+06 in -m in

1.1E+06

LB

2.90E+06

Makespan (secs)

1.2E+06

ax -

3.00E+06

1.3E+06

81000

65000

M

3.10E+06

83000

1.3E+06

C T

3.20E+06

85000

70000

M

3.30E+06

75000

in -m in

1.4E+06

Makespan (secs)

1.4E+06

3.40E+06

Makespan (secs)

1.5E+06

3.50E+06

M

Makespan (sec.)

3.60E+06

80

0.8 0.75

60 100

0.7

40

95

0.65

90

20 OLB

MCT

Min-min HiHi

HiLo

Max-min LoHi

Sufferage

LoLo

(e) Percentage of makespans.

Gamequick

0.6

MET

OLB

MCT

HiHi

HiLo

Min-min Max-min Sufferage

LoHi

Gamequick

MET

OLB

LoLo

(f) Percentage of AET.

MCT HiHi

Min-min HiLo

Max-min Sufferage

LoHi

Gamequick

LoLo

(g) Fairness.

Figure 11: Performance optimization results for consistent scenarios. the deadline is limited, the cost optimization problem can be quite challenging.

5.

RELATED WORK

In this section, we compare our work with two main areas of related work. One uses game theoretic approaches in the design of distributed computing algorithms, and the other deals with the minimum multiprocessor scheduling problem (NP-complete problem) in current Grid workflow management systems. The work described in this paper bridges these two categories. In terms of game theory based algorithms, other researchers in performance-oriented distributed computing focus on system-level load balancing [3, 28] or resource allocation [10, 26] that aims to introduce economic and game theoretic aspects into computational questions. Penmatsa et al. [28] formulate the problem as a cooperative game where Grid sites try to minimize the expected response time of tasks; Kwok et al. [26] investigate the impact of selfish behaviors of individual machine by taking into account of non-cooperativeness of machines; The game theory algorithm in ICENI [20] solves the scheduling problem using elimination of strictly dominated strategies, in which the least optimal solutions are continually discarded, but the game theory algorithm is not feasible due to its high time complexity. However, the cooperation of intra- and inter-workflows is not considered by other systems. The idea was not yet evolved to the Grid computing field. The approach presented here is successfully applied the idea of game theory to solving the NP-complete problems - makespan minimization problem and cost optimization problem. In this paper, we introduce two algorithms for multi-workflow optimization based on game theory. Our work differ from others in that we present a practical system model which can model the Grid environment and workflow applications, and formulate the scheduling problems as cooperative games among workflow managers. The algorithms we designed not only consider performance and cost, but also provide fairness to all workflows. Many algorithms has been applied in current Grid workflow management systems. The Pegasus system [5] uses a weighted Min-min heuristic; UNICORE [2] and Gridant [11] provide manual scheduling; Scheduling in Taverna [16] in my Grid and Triana [9] is done using just-in-time; The GrADS project [6] provides scheduling using Min-min, Max-min, and Sufferage algorithms; Apart from the game theory algorithm, ICENI [20] provides scheduling using random, best of n-random, and simulated annealing; The scheduler in Grid-

bus [1] provides just-in-time mappings using Grid economy mechanisms. Most algorithms mentioned above have been compared in detail in Section 4. The comparison of eleven heuristics are also presented by Braun et al [17]. None of these projects provide supports for multi-workflows with effective optimization. Our work, in contrast, concentrates on performance and cost optimization for real large-scale scientific workflow applications.

6. CONCLUSION AND FUTURE WORK With increasing focus on large-scale workflow applications on the Grid, it is important for a Grid middleware to efficiently and effectively schedule and dynamically steer execution of workflows. In this paper, we analyze the main bottleneck of a special class of workflows characterized by a large number of homogeneous activities, and present one set of systematic solutions to two main problems: performance and cost optimization. Based on the adoption of game theory and the further idea of sequential cooperative game, we provide two novel algorithms, called the Game-quick and Game-cost, for the popular performance and cost optimization problems of such applications. Based on a broad set of realistic and simulated experiments, we present better solutions with less algorithm execution times compared to existing approaches such as Min-min, Max-min, Sufferage, etc. Furthermore, we observe the larger scale the experiments are, the better results we achieve. The improved fairness, performance, and cost optimization solutions have been applied and validated on the Austrian Grid platform with real workflow applications and a large number of randomly generated applications with various activity and machine heterogeneity. Game-quick and Gamecost significantly outperform the other algorithms in terms of different metrics. In addition, a novel system model with better controllability and predictability has been designed and implemented for the multi-workflow optimization problems on Grid environment. Our algorithms work properly for applications which can be formulate as a typical and solvable game. For applications which are too complex to be formulated, preprocessing to partition the game into different phases will be necessary to let us have opportunities of utilizing our algorithms. Hence, we plan to research more effective preprocessing techniques in future research. In addition, more cost optimizations for different resource provisioning modes could be also investigated.

16000

9.00E+05

13000 12000 11000 10000

10000

Summed execution time percent (%)

100

90 OLB

MCT HiHi

Min-min HiLo

LoHi

Max-min

Sufferage

LoLo

(e) Percentage of makespans.

Gamequick

ic k

ge G

am equ

m in ax M

Su ffe ra

C T

M

in -m in

ET

LB

M

O

0.95 0.9 Fairness

100

110

(d) Makespan(LoLo).

1

120

120

M

ge

(c) Makespan(LoHi).

140

130

ic k

G

(b) Makespan(HiLo).

140

am equ

m in ax M

Su ffe ra

C T M

G

M

in -m in

ET

LB O

M

ge

ic k

am equ

m in ax M

M

Su ffe ra

C T M

ET

5000

160

MET

15000

7000

G

(a) Makespan(HiHi).

20000

8000

150

Makespan percent (%)

14000

9000

M

am equ

ge

ic k

m in

Su ffe ra

ax -

LB

M

M

M

O

in -m in

5.00E+05 C T

7.00E+05

5.00E+05

Makespan (secs)

1.10E+06

7.00E+05

M

Makespan (secs)

1.30E+06

in -m in

9.00E+05

1.50E+06

LB

1.10E+06

25000

15000

O

Makespan (secs)

1.30E+06

ET

Makespan (sec.)

1.70E+06 1.50E+06

30000

17000

1.90E+06

1.70E+06

80 60

0.85 0.8 0.75

40

0.7

20

0.65

MET

OLB

MCT

HiHi

HiLo

Min-min Max-min Sufferage

LoHi

Gamequick

MET

OLB

LoLo

(f) Percentage of AET.

MCT HiHi

Min-min HiLo

Max-min Sufferage

LoHi

Gamequick

LoLo

(g) Fairness.

Figure 12: Performance optimization results for inconsistent scenarios.

7.

REFERENCES

[1] Rajkumar Buyya and Srikumar Venugopal. The gridbus toolkit for service oriented grid and utility computing: An overview and status report. In 1st International Workshop on Grid Economics and Business Models (GECON 2004), pages 19–36. IEEE Computer Society Press, April 2004. [2] Dietmar W. Erwin and David F. Snelling. UNICORE: A Grid computing environment. Lecture Notes in Computer Science, 2150:825–??, 2001. [3] C.Kim et al. An algorithm for optimal load balancing in distributed computer systems. IEEE Transactions on Computers, 41(3):381–384, 1992. [4] E. Altman et al. Nash equilibria in load balancing in distributed computer systems. International Game Theory Review 4(2), pages 91–100., 2002. [5] Ewa Deelman et. al. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1:25–39, 2003. [6] Francine Berman et al. The GrADS Project: Software support for high-level Grid application development. The International Journal of High Performance Computing Applications, 15(4):327–344, 2001. [7] Haluk Topcuouglu et al. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst., 13(3):260–274, 2002. [8] Henri Casanova et al. Heuristics for scheduling parameter sweep applications in grid environments. In Proc. 9th Heterogeneous Computing Workshop (HCW), pages 349–363, Cancun, Mexico, May 2000. [9] I.Taylor et al. Triana applications within Grid computing and peer to peer environments. Journal of Grid Computing, 1(2):199–217, 2003. [10] Jonathan Bredin et al. A game-theoretic formulation of multi-agent resource allocation. In Proceedings of the Fourth International Conference on Autonomous Agents, pages 349–356, Barcelona, Catalonia, Spain, 2000. ACM Press. [11] Kaizar Amin et al. GridAnt: A Client-Controllable Grid Workflow System. In 37th Hawai’i International Conference on System Science, Island of Hawaii, Big Island, 5-8 January 2004. [12] Muthucumaru Maheswaran et al. Dynamic mapping of a class of independent tasks onto heterogeneous computing systems. Journal of Parallel and Distributed Computing, 59(2):107–131, 1999. [13] P. Blaha et al. WIEN2k: An Augmented Plane Wave plus Local Orbitals Program for Calculating Crystal Properties. Institute of Physical and Theoretical Chemistry, Vienna University of Technology, 2001. [14] R. Armstrong et al. The relative performance of various mapping algorithms is independent of sizable variances in run-time predictions, 1998. [15] S. Schindler et al. Metal enrichment processes in the intracluster medium. Astronomy and Astrophysics, pages 435:L25–L28, May

2005. [16] T. Oinn et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, 2004. [17] Tracy D. Braun et al. A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing, 61(6):810–837, 2001. [18] T. Fahringer, R. Prodan, R. Duan, and et al. ASKALON: A Grid Application Development and Computing Environment. In 6th International Workshop on Grid Computing (Grid 2005), Seattle, USA, November 2005. IEEE Computer Society Press. [19] R. F. Freund and et al. Scheduling resources in multi-user, heterogeneous, computing environments with smartnet. In HCW ’98: Proceedings of the Seventh Heterogeneous Computing Workshop, page 3, Washington, DC, USA, 1998. IEEE Computer Society. [20] N. Furmento, W. Lee, A. Mayer, S. Newhouse, and J. Darlington. Iceni: An open grid service architecture implemented with jini, 2002. [21] R. L. Graham. Bounds for certain multiprocessor anomalies. Bell System Technical Journal, 45:1563–1581, 1966. [22] John H. Holland. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. MIT Press, Cambridge, MA, USA, 1992. [23] Oscar H. Ibarra and Chul E. Kim. Heuristic algorithms for scheduling independent tasks on nonidentical processors. J. ACM, 24(2):280–289, 1977. [24] R. Jain, D. Chiu, and W. Hawe. A quantitative measure of fairness and discrimination for resource allocation in shared computer systems, 1998. [25] B. Kao and H. Garcia-Molina. Deadline assignment in a distributed soft real-time system. IEEE Transactions on Parallel and Distributed Systems, 8(12):1268–1274, 1997. [26] Y. Kwok, S. Song, and K. Hwang. Selfish grid computing: Game-theoretic modeling and nas performance results, 2005. [27] Roger B. Myerson. Game Theory: Analysis of Conflict. Harvard University Press, September 1997. [28] S. Penmatsa and A. T. Chronopoulos. Cooperative load balancing for a network of heterogeneous computers. In IEEE, editor, Proc. of the 21st IEEE Intl. Parallel and Distributed Processing Symposium (IPDPS 2006), Rhodes Island, Greece, April 25-29 2006. [29] Reginal Atmospheric Modelling System. http://www.atmet.org/. [30] R.Buyya. Economic-based Distributed Resource Management and Scheduling for Grid Computing. Ph.d. thesis, Monash University, Melbourne, Australia, 2002. [31] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall et al, 2nd edition edition, 2003.

Performance and Cost Optimization for Multiple Large ... - CiteSeerX

Performance and Cost Optimization for Multiple Large ... - CiteSeerX

Suggest Documents

A Joint Optimization of Operational Cost and Performance ... - CiteSeerX

Cost And Energy Performance Optimization - International Building ...

Cost And Energy Performance Optimization - International Building ...

Optimizing Cost and Performance for Multihoming - CiteSeerX

fast seamless handover scheme and cost performance optimization for

Interactive Multiple Criteria Optimization for Capital ... - CiteSeerX

unbalanced large scale multiple site diversity performance ... - CiteSeerX

Cost Optimization Design Approach for Multiple Drive Belt Conveyors

Optimization of multiple tuned mass dampers for large ...

Multiple Page Size Modeling and Optimization - CiteSeerX

Performance Improvements for Large Scale Traffic ... - CiteSeerX

Performance Improvements for Large Scale Traffic ... - CiteSeerX

Performance Analysis Framework for Large Software ... - CiteSeerX

Traffic modelling and cost optimization for transmitting ... - CiteSeerX

Cost-Based Optimization for Magic: Algebra and ... - CiteSeerX

A Modeling and Optimization Approach for Multiple ... - CiteSeerX

FPGA Implementation Cost and Performance Evaluation ... - CiteSeerX

THE COST AND PERFORMANCE OF PAID ... - CiteSeerX

Modeling Cloud Cost and Performance - CiteSeerX

LOCOMOTOR PERFORMANCE AND COST OF ... - CiteSeerX

Performance Optimization and Modeling of Blocked ... - CiteSeerX

Infrastructure optimization and performance monitoring - CiteSeerX

Threshold setting and performance optimization in ... - CiteSeerX

Performance Characterization and Optimization of Atomic ... - CiteSeerX