Dynamic Scheduling of Parallel Real-time Jobs by ... - Semantic Scholar

10 downloads 10308 Views 242KB Size Report
Department of Computer Science, University of Warwick. Coventry, United ... cost clusters to process soft-real-time jobs (a fraction of jobs are allowed to miss ...
Optimising Static Workload Allocation in Multiclusters

Ligang He, Stephen A. Jarvis, Daniel P. Spooner and Graham R. Nudd Department of Computer Science, University of Warwick Coventry, United Kingdom, CV4 7AL {liganghe, saj, dps, grn}@dcs.warwick.ac.uk

Abstract Workload allocation and job dispatching are two fundamental components in static job scheduling for distributed systems. This paper addresses static workload allocation techniques for two types of job stream in multicluster systems, namely, non-real-time job streams and softreal-time job streams, which request different qualities of service. Two workload allocation strategies (called ORT and OMR) are developed by establishing and numerically solving two optimisation equation sets. The ORT strategy achieves the Optimised mean Response Time for the nonreal-time job stream; while the OMR strategy can gain the Optimised mean Miss Rate for the soft-real-time job stream over multiple clusters (these strategies can also be applied in a single cluster system). The effectiveness of both strategies is demonstrated through theoretical analysis. The proposed workload allocation schemes are combined with two job dispatching strategies (Weighted Random and Weighted Round-Robin) to generate new static job scheduling algorithms for multicluster environments. These algorithms are evaluated through extensive experimental studies and the results show that compared with static approaches without the optimisation techniques, the proposed workload allocation schemes can significantly improve the performance of static job scheduling in multiclusters, in terms of both the mean response time (for the non-real-time jobs) and the mean miss rate (for soft-real-time jobs)*.

1. Introduction Clusters are now a popular computing platform for scientific and commercial applications. Separate clusters are now being interconnected to create multicluster computing architectures and these constituent clusters may be *

This work is sponsored in part by grants from the NASA AMES Research Center (administrated by USARDSG, contract no. N68171-01-C9012), the EPSRC (contract no. GR/R47424/01) and the EPSRC eScience Core Programme (contract no. GR/S03058/01).

located within a single organization or across wide geographical sites [5][9]. Job scheduling in these architectures can be categorized into static and dynamic scheduling [15]. The static approaches usually consider average system behaviours, such as the mean job arrival rate and size, while the dynamic scheduling schemes typically take instantaneous system state into account for the scheduling decisions. The dynamic schemes usually perform better than the static approaches. However they typically incur much higher overheads when obtaining the current system information needed for the scheduling decisions. Hence, it remains necessary to develop static scheduling schemes to gain desirable performance improvements at a low cost. Static job scheduling in distributed systems usually consists of two fundamental components, the off-line workload allocation and the on-line job dispatching [15]. In the multicluster architectures assumed in this paper, the workload allocation scheme determines the proportion of workload for each cluster, while the job dispatching strategy distributes the incoming independent jobs to each cluster as the jobs arrive and in so doing, satisfies the proportion of workload specified by the workload scheme. There is now a large amount of support for using lowcost clusters to process soft-real-time jobs (a fraction of jobs are allowed to miss their real-time requirements) [11][16][17]. Scheduling schemes for non-real-time and soft-real-time jobs are usually evaluated by different performance criteria. The scheduling of non-real-time jobs typically aims to reduce the mean response time of the incoming jobs in the system [15]. However, soft-real-time jobs have additional real-time requirements. Although it is tolerable to miss some jobs’ real-time constraints, the main objective for soft-real-time job scheduling is to minimize the fraction of such jobs [16][17]. In this paper, optimisation techniques are addressed for both non-real-time and soft-real-time job scheduling in multicluster systems. Two workload allocation strategies, Optimised mean Response Time (ORT) and Optimised mean Miss Rate (OMR), are developed. The aim of ORT is to achieve the optimised mean response time for the in-

coming non-real-time job stream and the aim of OMR is to gain the optimised mean miss rate for the soft-real-time job stream. Workload allocation in the multiclusters is mathematically modelled using optimisation equation sets and numerical solutions are developed to solve workload allocation for each cluster. When each cluster has only one processing computer, the multicluster system becomes a single cluster. Therefore, the proposed workload allocation strategies can also be applied in a single cluster environment. Weighted Random (Rand) or Weighted Round-Robin (RR) policies are two job dispatching strategies often used in real heterogeneous systems [15]. In this paper, the proposed ORT and OMR workload allocation strategies are combined with these two job dispatching strategies (Rand and RR) to generate four new static job scheduling algorithms: ORT-RR, ORT-Rand, OMR-RR and OMRRand. Extensive experimental studies are conducted and the results verify that these algorithms significantly outperform static scheduling algorithms without these optimisation techniques. The rest of the paper is organized as follows. Section 2 presents related work. The system model assumed in this paper is discussed in Section 3. Two optimised workload allocation strategies are presented in Section 4 and the performance of these strategies is evaluated in Section 5. Finally, Section 6 concludes the paper.

2. Related work Studies on multicluster systems are receiving a good deal attention [2][4][5][7][8][10]. A multicluster model is presented in [5] that integrates different workstation clusters into a virtual parallel machine. A multi-protocol communication library is presented in [2] to provide computing support for multicluster systems. However, this work does not consider suitable static job scheduling schemes for multicluster systems. It is non-trivial to optimise the static workload allocation in the heterogeneous systems. Relevant research has been documented in a number of papers [3][13][15]. It is shown in [13] that allocating workload proportional to computing capability does not achieve the best performance unless the system workload is very high. However, the paper does not quantitatively develop a scheme to optimise the performance. A similar problem is addressed in [3] and an optimisation function is established. However, the solution to the objective function is not given and the optimisation function is limited to multicomputer systems rather than multicluster systems. A static workload allocation technique is also addressed in [15] aiming to optimise mean response times in a heterogeneous cluster. Both an optimisation function and its solution are given. However, their solution is in fact a special case of this

work, where each cluster in our multicluster architecture has only one computer. In this paper, a non-linear optimisation function is established for multicluster systems. Although the symbolic solution to the function is unobtainable, a numerical solution is developed by revealing a property of the objective function. Furthermore, all workload allocation techniques discussed above are intended for processing non-real-time jobs. This paper addresses optimisation techniques for workload allocation for both non-real-time and soft-real-time jobs. Using non-real-time cluster systems with conventional operating systems to process soft-real-time jobs is gaining popularity [1][11][16][17]. A soft-real-time job is considered as having missed its real-time requirements if its waiting time in the queue is greater than its slack [11][17]. [11] documents the possibility of using the dual non-realtime servers to provide a soft real-time service, and [17] extends this by investigating the feasibility of using homogeneous clusters for soft-real-time service. In this paper, a multicluster architecture is considered. The detailed system model will be discussed in Section 3. The performance of the soft-real-time job scheduling is evaluated in [11] in term of the miss rate. However, this work is confined to a single homogeneous cluster and it does not consider the optimisation of the miss rate through judicial workload allocation. The work presented here has been developed for multicluster architectures consisting of a number of different clusters where each cluster is assumed to be homogeneous. The cited literature assumes that conventional operating systems are used and incoming jobs are therefore processed on a First-Come-FirstServiced basis. The FCFS policy is also used for processing the soft-real-time job stream in this paper.

3. System model The multicluster system assumed in this paper consists of n different clusters, where each cluster comprises a set of homogeneous computers. Cluster i (1≤i≤n) is modeled by an M/M/mi queueing model, where mi is the number of computers in cluster i. The multicluster architecture has two levels of scheduler, a global scheduler and multiple local schedulers, shown in Fig.1. The global scheduler has no waiting queue and therefore incoming jobs are immediately dispatched to individual clusters. The local scheduler located at each cluster adopts a centralized queueing architecture. A single waiting queue is used by each local scheduler to accommodate the jobs received from the global scheduler. There are no waiting queues at each of the processing computers at each cluster. Each local scheduler schedules the jobs based on a First-ComeFirst-Serviced policy and sends the jobs to free processing computers for execution. The scheduling itself is non-

preemptive. For the soft-real-time job stream, each job has a slack which follows a uniform distribution in [sl, su]. λ

G lobal Schedu ler

α 1λ

α nλ

α 2λ



W aiting queu es Local Sch edulers







Processing com puters

Figure 1. The Multicluster architecture

4. Workload allocation When a job stream of an average arrival rate λ is presented to the global scheduler, as shown in Fig.1, it is decomposed by applying a static job scheduling scheme and as a result a fraction αi of all jobs are allocated to cluster i. The objective of workload allocation is to determine {α1, α2…,αn}, a process that can be determined off-line.

Cluster i containing mi computers is modelled using an M/M/mi queue (1≤i≤n). According to queueing theory [12], the mean waiting time of jobs, Wi , is computed by Eq.3, where ρi is the utilization of cluster i and W0i is the mean remaining execution time of the job in service when a new job arrives. W 0i (3) Wi = 1 − ρi The formula for W0i is given by Eq.4 [6], where Pmi is the probability that the system has no less than mi jobs. Pmi (4) W 0i = miui Suppose the fraction of the workload allocated to cluster i is αi, then, αiλ (5) ρi = m iu i

Pmi in Eq.4 is given by Eq.6 [6][12]. (6) ( m iρ i ) m i P mi = k k m i −1 ⎡ ( m iρ i ) ( m iρ i ) ⎤ (1 − ρ i ) m i! ⎢ ∑ + ⎥ k ! ( 1 − ρ i ) m!⎦ ⎣ k =0 With Eq.2-Eq.6, we get the formula for Ri in terms of the only unknown variable αi, shown in Eq.7. (7) miui( αuλ ) m 1 i

4.1. ORT (Optimised mean Response Time) workload allocation For the non-real-time job stream, the workload allocation strategy aims to optimise the mean response time of the job stream in the multicluster system. The response time of a job is defined as the time from when the job arrives at the system until it is completed. Intuitively, this workload allocation strategy might take into account the heterogeneity of the clusters performance, so that the workload fraction αi allocated to cluster i (1≤i≤n) is proportional to its processing capability, miui. Hence, αi is computed as miui (1) αi =



n

i =1

miui

This strategy is called weighted workload allocation. A more detailed analysis is given in this subsection to develop a workload allocation scheme for optimising the mean response time. The response time of a job is its waiting time in the queue plus its execution time. Hence, the average response time of the jobs in cluster i, denoted as Ri, can be computed by Eq.2, where Wi is the mean waiting time of the jobs in cluster i and ui is the mean job service rate of each computer in cluster i. 1 (2) R i = Wi + ui

Ri =

i

i

⎡ mi −1 ( αuiλi ) k ( αuiλi ) mi ⎤ 2 + ⎢mi! ∑ ⎥(miui − αiλ ) (1 − mαiiλui ) ⎥⎦ ⎢⎣ k =0 k!

+

ui

Thus, the mean response time of the incoming job stream over these n clusters, denoted by R, can be computed by Eq.8. n (8) R = ∑i =1 Riαi Hence, in order to achieve the optimal mean response time of the job stream over the multicluster system, the objective is to find a workload allocation {α1, α2…,αn} that minimizes Eq.8 subject to ∑n αi = 1 and 0≤αi≤ miui i =1 λ m i u i is used to ensure that cluster i (the constraint αi≤ λ

does not become saturated). This is a constrainedminimum problem and according to the Lagrange multiplier theorem, solving this problem is equivalent to solving the following equation set. miui ⎧ n ⎪⎪∑i =1αi = 1, 0 ≤ αi ≤ λ ⎨ ⎪ ∂ ( n Riαi) − v ∂ ( n αi − 1) = 0 1 ≤ k ≤ n ∑i =1 ⎪⎩∂αk ∑i =1 ∂αk

(a)

(9)

(b)

Since αi is the only unknown variable in the expression of Ri, Eq.9 can be reduced to Eq.10 by solving the partial differential equations in Eq.9.b. m iu i ⎧ n ⎪⎪∑ i =1αi = 1, 0 ≤ α i ≤ λ ⎨ ⎪ ∂ ( R kα k ) = v 1 ≤ k ≤ n ⎩⎪ ∂α k

(a) (b )

(10)

It is impossible to find the general symbolic solution {α1, α2…,αn} from Eq.10 due to the complicated expression of Ri. However, a property of Eq.10.b is revealed (below) that enable us to develop a numerical solution for Eq.10. The right side of Eq.10.b can be transformed into Eq.11. ∂ ∂Rk (11) (Rkαk ) = αk + Rk ∂αk ∂αk As in queueing theory [6][12], the mean response time of jobs (Rk) monotonically increases as the average job arrival rate αk. Furthermore, the slope of the function ( ∂Rk )

24. v_lower= v_mid; 25. else 26. v_upper= v_mid; 27. end while Algorithm 1 is explained as follows. According to

also monotonically increases as the increase of αk. With Eq.11, ∂ (Rkαk) is therefore a monotonically increasing

evaluated at αi=0; while the upper limit of v, v_upper, can be set as the maximum of ∂ (Riαi) (1≤i≤n) evaluated at

function of αk. Based on this property, we develop a numerical solution to solve Eq.10 and therefore derive the optimised workload allocation {α1, α2…,αn}. The numerical solution is shown in Algorithm 1. Algorithm 1. Computation of workload allocation among clusters for optimised mean response time 1. Let lower and upper limits of the mean response time be v_lower and v_upper; 2. while (v_lower≤v_upper) 3. v_mid=( v_lower+ v_upper)/2; 4. for each cluster i (1≤i≤n) do 5. if (v_mid< ∂ (Riαi) | )

αi=0.999. Since the differential of Riαi at αi=0 decreases

∂αk

∂αk

6. 7. 8. 9. 10. 11. 12.

αi=0;

∂αi

α i =0

else if (v_mid> miui )

λ

v_upper=v_mid; continue; while (α_lower≤α_upper) α_mid=(α_lower+α_upper)/2; v_cur= ∂ (Riαi) | ; α =α _ mid ∂αi if (the difference between v_cur and v_mid is less than v_valve) αi=α_mid; if (v_cur is less than v_mid) α_lower=α_mid; else α_upper=a_mid; end for i

13. 14. 15. 16. 17. 18. 19. 20.

n

α_sum= ∑α i ; i =1

21. 22. 23.

if (the difference between α_sum and 1 is less than

α_valve)

the current set of αi (1≤i≤n) is the correct work load allocation; else if (α_sum is less than1)

∂ (Riαi) (1≤i≤n) must equal a common value v. ∂αi ∂ Since (Riαi) (1≤i≤n) is the monotonically increasing ∂αi function of αi, in Algorithm 1 the lower limit of v,

Eq.10,

v_lower, can be set as the minimum of ∂ (Riαi) (1≤i≤n) ∂αi

∂αi

as miui increases, v_lower is effectively ∂ (Rkαk ) | α ∂αk

k

=0

,

where k satisfies the value of mkuk is the greatest of all clusters (which means cluster k has the most computing capabilities). For an arbitrary v_mid between v_lower and v_upper, the algorithm searches for a suitable α_mid (the search space of αi is [0, 1]) so that the difference between , is less than v_mid and v_cur, computed by ∂ (Riαi) | ∂αi

αi =α _mid

a predefined valve v_valve. In this way, a set of specific values of αi (1≤i≤n) can be obtained. Then the algorithm adds the αi (1≤i≤n) to get α_sum. If the sum is greater than 1, by a predefined valve α_valve, it means that the current v_mid is too high and a lower value should therefore be used for computing a new set of αi. On the other hand, if the sum is less than 1, by α_valve, a higher value of v should be used for the next iteration of the computation of αi. This binary search technique is used to search for the actual v and αi in the their respective search spaces. In Algorithm 1 if v_mid is less than ∂ (Riαi) |α =0 (in i ∂αi Step 5), it means that the algorithm cannot find such a αi in 0≤α ≤ miui as to satisfy i

λ

∂ ( Riαi ) = v _ mid ∂αi

Then, αi is set to be 0 (in Step 6) and the workload allocation is recalculated for the remaining clusters. It can be proved that the lower miui is, the greater ∂ (Riαi) | is ∂αi

α i =0

(the proof is omitted in this paper). It suggests that in a highly heterogeneous multicluster environment (in terms of miui) the clusters with low processing capabilities may not be allocated any workload in order to achieve the optimised mean response time. This scheduling behaviour is also observed in the literature [3][13][15] for a single cluster environment. The impact of heterogeneity of the

multicluster architecture on scheduling performance is extensively evaluated in our experimental studies. The feasibility and effectiveness of Algorithm 1 are proven in Theorem 1. Theorem 1. The workload allocation strategy {α1, α2…,αn} computed by Algorithm 1 minimizes the average response time of the incoming job stream in a multicluster system of n clusters. Proof: We need to prove the following two aspects to prove the theorem. 1. Algorithm 1 can generate a set {α1, α2…,αn} that satisfies Eq.10. 2. The generated workload allocation strategy can lead to the minimal mean response time of the job stream over these n clusters. As stated above, ∂ (Riαi) (1≤i≤n) is a monotonically ∂αi

increasing function of αi. Hence, for any v in [v_lower, v_upper], if v is in the value field of ∂ (Riαi) , which is ∂αi

[ ∂ (Riαi) | , ∞] (1≤i≤n), using the binary search we can α =0 ∂αi find such a value of αi in its value field [0, 1] that satisfies ∂ (Riαi) = v . If for some cluster k (1≤k≤n), v is not in the ∂αi value field of ∂ (Rkαk ) (i.e., v< ∂ (Rkαk ) | ), we set αk α =0 ∂αk ∂αk to be 0. Suppose we get a workload allocation { α 1′ , i

k

α 2′ …, α n′ } from v=v′ and get another { α 1′′ , α 2′′ …, α n′′ }

from v= v ′′ . Since ∂ (Riαi) monotonically increases over ∂αi αi, if v ′′ > v′, there must be α i′′ > α i′ (1≤i≤n) and if v ′′