PPDD: scheduling multi-site divisible loads in single-level tree networks

0 downloads 0 Views 452KB Size Report
Oct 1, 2009 - ple processors in a single-level tree network. This scenario happens ..... receiver delimiter is still. K = 3, thus the sender set is Sover = {p4,p5} and receiver ...... sensor networks for data fusion applications. J. Comput. Sci. Tech-.
Cluster Comput (2010) 13: 31–46 DOI 10.1007/s10586-009-0103-1

PPDD: scheduling multi-site divisible loads in single-level tree networks Xiaolin Li · Bharadwaj Veeravalli

Received: 2 July 2007 / Accepted: 1 September 2009 / Published online: 1 October 2009 © Springer Science+Business Media, LLC 2009

Abstract This paper investigates scheduling strategies for divisible jobs/loads originating from multiple sites in hierarchical networks with heterogeneous processors and communication channels. In contrast, most previous work in the divisible load scheduling theory (DLT) literature mainly addressed scheduling problems with loads originating from a single processor. This is one of the first works that address scheduling multiple loads from multiple sites in the DLT paradigm. In addition, scheduling multi-site jobs is common in Grids and other general distributed systems for resource sharing and coordination. An efficient static scheduling algorithm PPDD (Processor-set Partitioning and Data Distribution Algorithm) is proposed to near-optimally distribute multiple loads among all processors so that the overall processing time of all jobs is minimized. The PPDD algorithm is applied to two cases: when processors are equipped with front-ends and when they are not equipped with frontends. The application of the algorithm to homogeneous systems is also studied. Further, several important properties exhibited by the PPDD algorithm are proven through lemmas. To implement the PPDD algorithm, we propose a communication strategy. In addition, we compare the performance of the PPDD algorithm with a Round-robin Scheduling Algorithm (RSA), which is most commonly used. ExX. Li (!) Computer Science Department, Oklahoma State University, 219 MSCS, Stillwater, OK 74078, USA e-mail: [email protected] B. Veeravalli Department of Electrical and Computer Engineering, The National University of Singapore, 4 Engineering Drive 3, 119260, Singapore, Republic of Singapore e-mail: [email protected]

tensive case studies through numerical analysis have been conducted to verify the theoretical findings. Keywords Divisible load theory · Heterogeneous computing · Load scheduling · Grid computing · Single-level tree networks 1 Introduction Parallel and distributed heterogeneous computing has become an efficient solution methodology for various realworld applications in science, engineering, and business [1–4]. One of the key issues is how to partition and schedule jobs/loads that arrive at processing nodes among the available system resources so that the best performance is achieved with respect to the finish time of all input tasks. To efficiently utilize the computing resources, researchers have contributed a large amount of load/task scheduling and balancing strategies in the literature [1, 5–7]. Recent efforts have been focused on resource sharing and coordination across multi-site resources (e.g. multiple supercomputer centers or virtual organizations). For divisible load scheduling problems, research since 1988 has established that the optimal workload allocation and scheduling to processors and links can be solved through the use of a very tractable linear model formulation, referred to as Divisible Load Theory (DLT) [6]. DLT features easy computation, a schematic language, equivalent network element modeling, results for infinite sized networks and numerous applications. This theoretical formulation opens up attractive modeling possibilities for systems incorporating communication and computation issues, as in parallel, distributed, and Grid environments. Here, the optimality, involving solution time and speedup, is derived in

32

the context of a specific scheduling policy and interconnection topology. The formulation usually generates optimal solutions via a set of linear recursive equations. In simpler models, recursive algebra also produces optimal solutions. The model takes into account the heterogeneity of processor and link speeds as well as relative computation and communication intensity. DLT can model a wide variety of approaches with respect to load distribution (sequential or concurrent), communications (store and forward and virtual cut-through switching) hardware availability (presence or absence of front end processors). Front end processors allow a processor to both communicate and compute simultaneously by assuming communication duties. A recent survey of DLT research can be found in [8]. The DLT paradigm has been proven to be remarkably flexible in handling a wide range of applications. 1.1 Related work Since the early days of DLT research, the research has spanned from addressing general optimal scheduling problems on different network topologies to various scenarios with practical constraints, such as time-varying channels [9], minimizing cost factors [10], resource management in Grid environments [11, 12], and distributed image processing [13]. Thorough surveys of DLT can be found in [5, 6, 14, 15]. Load partitioning of intensive computations of large matrix-vector products in a multicast bus network was theoretically investigated in [16]. Research efforts after 1996 particularly started focusing on including practical issues such as, scheduling multiple divisible loads [17], scheduling divisible loads with arbitrary processor release times in linear networks [18], consideration of communication startup time [19, 20], buffer constraints [21]. Some of the proposed algorithms were tested using experiments on real-life application problems such as image processing [13], matrix-vector product computations [22], and database operations [23]. Various experimental works have been done using the divisible load paradigm such as [22] for matrix-vector computation on PC clusters and [23] for other applications on a network of workstations (NOWs). Recent work in DLT also attempted to use adaptive techniques when computation needs to be performed under unknown speeds of the nodes and the links [24]. This study used bus networks as the underlying topology. Beaumont et al. consolidates the results for single-level tree and bus topologies and presents extensive discussions on some open problems in this domain [25]. A few new applications and solutions in DLT have been investigated in recent years, e.g., bioinformatics [26], multimedia streaming [27], sensor networks [28, 29], economic and game-theoretic approaches [30, 31]. Although most of the contributions in DLT literature consider only a single load originated at one processor [14, 15],

Cluster Comput (2010) 13: 31–46

scheduling multiple loads has been considered in [32] and [17]. Work presented in [33] considers processing divisible loads originating from an arbitrary site on a arbitrary graph. However, they considered merely a single-site multi-load scheduling problem and didn’t address multiple loads originated at arbitrary multiple sites/nodes in networks. Multisite multi-load scheduling is a practical situation, e.g., multiple jobs submitted to multiple sites in Grids. The point of load origination does impose a significant influence on the performance. In addition, when one considers multiple loads originating from several nodes/sites, it becomes much more challenging to design efficient scheduling strategies. One relevant paper to the context of the problem addressed in our work is [34]. This study investigated load scheduling and migration problems without synchronization delays in a bus network by assuming that all processors have front-ends and the communication channel can be dynamically partitioned. Front-ends are communication coprocessors, handling communication without involving processors so that communication and computation can be fully overlapped and concurrent [6]. In this case, load distribution without any consideration of synchronization delay is quite straightforward as will be shown later. However, in practice, it would be unreasonable to assume that the channel can be dynamically partitioned. In addition, we shall also consider the case when processors are not equipped with frontends. Especially, in the application of distributed sensor systems, front-end modules may be absent from the processing elements [6]. Recently, [35] investigates the case of two load origination sources in a linear daisy chain architecture. The divisible load scheduling problems with multiple load sources in Grid environments have been studied in [11, 36]. In this paper, we consider a general load scheduling and balancing problem with multiple loads originating from multiple processors in a single-level tree network. This scenario happens commonly in realistic situations, such as applications in distributed real-time systems, collaborative grid systems (where each virtual organization can be abstracted as a resource site or a local hierarchical network), and in general load balancing and sharing applications [1, 2, 37]. In Grid environments, our proposed model can be applied to the following scenario: we have a super-scheduler across multiple sites and local schedulers for each site; multiple jobs are submitted to local schedulers and possibly partitioned and migrated across multiple sites by the superscheduler for resource sharing, load balancing, and high performance/throughput. 1.2 Our contributions The contributions in this paper are as follows. Primary motivation of this work stems from the fact that in a realworld scenario, there could be multiple loads submitted for

Cluster Comput (2010) 13: 31–46

33

Fig. 1 A single-level tree network with multiple loads

processing on networks originating from several geographically distributed sites, such as in Grid computing environments [1]. While multiple loads processing has been studied in the DLT literature [17, 32], these studies focus towards bus networks and that all the loads are available at the root (bus-controller unit) a priori. We regard these as singlesite multiple jobs problems. The study in this paper is different in formulation and attempts to provide a generalized framework. We formulate the load scheduling problem with multiple loads originating from multiple sites in single-level tree networks.1 For the cases with and without front-ends, we design a scheduling strategy, referred to as Processorset Partitioning and Data Distribution Algorithm (PPDD), to achieve the near-optimal processing time of all loads. Several significant properties of PPDD algorithm are proven in lemmas. Detailed analysis of the time performance of PPDD algorithm is conducted. In order to actually implement the load distribution obtained through PPDD algorithm, we propose a load communication strategy. In addition, we compare the time performance of PPDD algorithm with another algorithm, referred to as Round-robin Scheduling Algorithm (RSA). It is demonstrated that the proposed PPDD algorithm produces better scheduling solutions than RSA. We testify all these findings via detailed numerical examples on heterogeneous system of processors. The contributions in this paper are expected to spur further research in this direction, especially useful while considering scheduling loads on arbitrary networks from multiple sites. This paper is organized as follows. We first formulate the problem and present some notations in Sect. 2. In Sect. 3, we first consider load partitioning strategies for cases with and without front-ends. Then, we present the communication strategies for these cases in Sect. 4. We prove some important results to analyze the performance of the algorithms. In Sect. 5, we discuss in detail and compare the time performance of PPDD algorithm and RSA. Section 6 con1 It may be noted that our formulation holds for a bus network topology which is a special case of a single-level tree network.

cludes the paper and presents some possible extensions to this work.

2 Problem formulation and some notations This section first introduces the network architecture and then presents the definitions, notations, and terminology to be used throughout the paper. As shown in the Fig. 1, we consider a single-level tree network with a root processor p0 , also referred to as a scheduler for the system, and m processors denoted as p1 , . . . , pm connected via links l1 , . . . , lm , respectively. We assume that the scheduler is only in charge of collecting the load status on each processor and routing the loads from one processor to another processor, and it does not participate in processing any load. In other words, p0 works like a router. Initially, each processor is assumed to have a load to be processed. The objective is to minimize the overall processing time of all the loads submitted to the system (at various processors). If we do not schedule each of the loads among the set of processors, then the overall processing time of all the loads is determined by the time when the last processor finishes processing its own load. In order to minimize the overall finish time, we should carefully re-schedule and balance the loads among all processors. Also, the scheduling strategy must be such that a faster processor will process more loads while a slower processor will process less loads. Since processors, links, and the size of the load originating at various processors are heterogeneous (non-identical), it becomes a complex problem to obtain an optimal solution. In the load balancing literature [38], the basic rationale is to balance loads in such a way that some load fractions from overloaded processors are transferred to under-loaded processors and all the processors more-or-less have identical processing time of loads assigned to them. Here too, we follow the same strategy as a basic mechanism in balancing the divisible loads among the processors. We introduce some notations and terminology that will be used throughout the paper as follows.

34

Ei : The time it takes to compute a unit load by processor pi , i = 1, . . . , m. Ci : The time it takes to transmit a unit load on link li , i = 1, . . . , m. Li : The amount of load originating at pi for processing as shown in Fig. 1. !i : The load assigned to pi according to a scheduling strategy. η: The load distribution obtained. This is defined as an m-tuple denoting the loads assigned to each pi , and is given by, η = {!1 , !2 , . . . , !m }. Certainly, the sum of !i , i = 1, 2, . . . , m, should be same !m as the sum of the original loads, that is, L = # i=1 !i = !m L . i=1 i $Li : The load portion to be transferred from or to a processor pi , i = 1, 2, . . . , m. Ti (m): The finish time for processing the load at pi . T (m): The overall processing time for all the loads processed by m processors. This is given by T (m) = maxm i=1 Ti (m). T ∗ (m): The optimal processing time of all the loads. Sover : The set of processors which are over-loaded. Processors in this set are the potential senders of excess loads. Sunder : The set of processors which are under-loaded. Processors in this set are the potential receivers of loads transferred from the processors in Sover . In our formulation, we consider a single-level tree network with m processors and a scheduler p0 . Each processor pi has its own divisible load of size Li to process, and the goal is to design an efficient scheduling strategy to minimize the overall processing time of all the loads (in all the processors) by partitioning and distributing the loads among all the m processors. Note that, the proposed scheduling strategy also takes care of the situation in which there may be only a subset of processors having loads to process. Note that Ti (m), the processing time at pi , is a function of Ei , Ci , !i , and $Li . According to the above definitions, for a given load of size L units, its computation time at processor pi is LEi and its communication time over link li is given by LCi . Note that the central scheduler works like a router. On a network link, in general, the time taken by a load to reach the final destination depends on the slowest link on its path, owing to the available bandwidth on various links comprising the path [39]. Thus, if li and lj are the links connecting the source and destination nodes and if Ci ≤ Cj , then we assume that the communication time taken to reach the destination via the links li and lj is simply LCj . However, it may be noted that this assumption does not affect the way in which the strategy is designed. In fact, we will show that this assumption eases analytical tractability. Without loss of generality, we index all pi in the order Li Ei ≤ Li+1 Ei+1 , i = 1, . . . , m − 1. Thus, without any

Cluster Comput (2010) 13: 31–46

load partitioning and scheduling, the overall processing time of all the loads is determined by the finish time of the last processor pm , which is given by Tmax (m) = Lm Em . In the case that Li = 0, for some i ∈ [1, m], we shall group these processors as an equivalent processor p0 . Thus, for this case, the analysis followed will assume there is at the most one processor which has no load. Since a divisible load is assumed to be computationally intensive, a natural assumption is that the computation time for a given load is much larger than the communication time, that is, Ei > Cj , i, j = 1, 2, . . . , m. More general discussion on computation-intensive applications and their computationcommunication ratios can be found in [2, 40]. In addition, in [6], a condition referred to as “Rule A” for singlelevel tree networks was used to eliminate all the redundant processor-link pairs to obtain an optimal reduced network to achieve the optimal processing time. In our current formulation, with the assumption that Ei > Cj , i, j = 1, 2, . . . , m, Rule A is automatically satisfied. The reader is referred to the implications of Rule A as explained in [6]. In the next section, we shall first identify a condition that designates a processor as an over-loaded processor or as an under-loaded processor and form Sover and Sunder sets, respectively. Then, we obtain the exact load fractions $Li , pi ∈ Sunder , to be received by processors in Sunder and $Lj , pj ∈ Sover , to be extracted from processors in Sover to minimize the overall processing time. Thus, we obtain the exact load portion assigned to pi as !i = Li + $Li , pi ∈ Sunder , or !j = Lj − $Lj , pj ∈ Sover , for i, j ∈ {1, . . . , m}. From the resulting load distribution η, we obtain the overall processing time of all the loads. 3 Load partitioning strategies In this section, we consider two cases, namely when all the processors are equipped with front-ends and when all the processors are not equipped with front-ends. In the case with front-ends, we can improve the processing time performance through efficiently overlapping communication with computation [6]. However, in the case without front-ends, communication and computation for each processor cannot be fully overlapped. and communication delays incurred while redistributing the loads among processors need to be minimized. The strategy involves two phases. In the first phase, the entire set of loads are partitioned. In the second phase, the partitioned loads are transferred from one processor to another following a communication strategy. These two phases will be carried out for both with and without front-end cases. This section focuses on the first phase and the next section will investigate the second phase. The implementation of the proposed PPDD algorithm also involves these two phases. In the first phase, the scheduler p0 collects all load distribution information about all slave processors and applies the

Cluster Comput (2010) 13: 31–46

PPDD algorithm to obtain the near-optimal load partitions. In the second phase, the set of overloaded processors initiates sending data and the set of under-loaded processors initiates receiving data; the scheduler coordinates these slaves sending or receiving operations by routing data among them. Note that, although PPDD algorithm appears iterative to obtain the near-optimal data partition, the amount of load migration/partition for each processor will be adjusted only once. We assume that each processor has its own load to process initially. A processor can start processing its own load or communicating with other processors from time, say t = 0 onwards, as per the design of a scheduling strategy. During some time interval, a processor may not have any load available to process and also may not be engaged in receiving any load from other processors. In this situation, a processor may simply remain idle without any activity. However, this processor may be assigned a load portion at some time later by the scheduler; hence, till that time, this processor will keep idle. We refer this idle time interval as a starvation gap in the rest of the paper. Efficient load balancing strategies are thus expected to minimize these starvation gaps and maximize the system utilization. 3.1 With front-ends For the case with front-ends, consider an ideal situation in which there is no starvation gap. Also, we assume that the entire communication can be overlapped by computation. In other words, a processor will not starve for the data while it is receiving it from other processors and it will be engaged in processing its own load. We refer to this situation as ideal case, hereafter. To achieve the optimal processing time for the entire set of loads, we should balance the loads among all the processors such that all the participating processors finish processing at the same time instant. We use this criterion as the optimality condition to determine the optimal solution as in [6]. Intuitively, if there are some processors complete processing earlier and other processors complete processing later, we can reschedule some workload from those late processors to those early processors to reduce the overall processing time (which is determined by the proces-

35

sor that finishes the last). Thus, for the ideal situation mentioned above, the optimal processing time is given by !m i=1 Li ∗ Tideal (m) = !m (1) 1 i=1 Ei

In the above equation, the numerator is the summation of all the loads and the denominator is the summation of the total processing power available with the system. From (1), we can obtain the load portions to be transferred from/to the nodes as, " ∗ (m) "" " Li Ei − Tideal " = |Li − !∗ |, " $Li = " i " Ei i = 1, 2, . . . , m

∗ (m)/E , i = 1, 2, . . . , m, and !∗ is the load where !∗i = Tideal i i processed at processor pi after balancing. Note that when Li > !∗i , processor pi belongs to the set of overloaded nodes Sover (senders) and hence, the load at pi should be transferred to other nodes. On the other hand, when Li ≤ !∗i , pi belongs to the set of under-loaded nodes Sunder (receivers), to which the load from other nodes will be transferred. As mentioned earlier, since we index the processors in the order of minimum Ei Li first, we can obtain an integer K uniquely such that pi ∈ Sunder , i = 1, 2, . . . , K and pi ∈ Sover , i = K + 1, K + 2, . . . , m. We refer to K as a delimiter to separate these receiver and sender sets. Thus, for all pi ∈ Sunder , we have Li < !∗i and for all pj ∈ Sover , we have Lj > !∗j , respectively. The load distribution algorithm is presented in Table 1. We initially use the optimal solution obtained for the ideal case and determine a delimiter K to identify the potential senders and receivers. Then, using (2) we derive the load fractions to be exchanged, $Li , i = 1, . . . , m, thus obtaining a load distribution η. Because of the assumption that the entire communication can be overlapped by computation, we immediately obtain the finish time for each processor as Ti (m) = !i Ei . Since the algorithm partitions the processors into two sets first and then distributes extra loads from the sender set to the receiver set, we refer to this algorithm as Processor-set Partitioning and Data Distribution Algorithm (PPDD).

Table 1 PPDD algorithm for the case with front-ends Initial stage: From (1) and (2), we obtain the initial delimiter K which separates the sender and receiver sets. Load distribution: The load assigned to pi is !i = Li + $Li , i = 1, 2, . . . , K and !i = Li − $Li , i = K + 1, K + 2, . . . , m.

Overall processing time:

The finish time for processor pi is given by, Ti (m) = !i Ei .

(2)

Thus, we obtain the overall processing time T (m) = max{Ti (m)}, i = 1, 2, . . . , m.

36

Cluster Comput (2010) 13: 31–46

Note that in the above strategy for load distribution, we have not explicitly discussed how the loads are communicated to the respective processors, rather we have discussed how much load a processor is assigned from the entire set of loads. We shall discuss the communication strategy in the next section. However, in practical situations, there may be starvation gaps and not all communication can be overlapped with computation. We will see how this issue is addressed in Sect. 4, when we discuss the load communication strategies. 3.2 Without front-ends Table 2 describes the proposed algorithm for finding a load distribution for the case without front-ends in detail.

This algorithm operates in three steps. In the first step, an initial solution is obtained by using (1) and (2), then corresponding sender and receiver sets are formed as described in the previous section. Note that (1) and (2) are for the case with front-end. In the second step, the feasibility of the resulting load distribution obtained is validated. When all the resulting $Lj > 0, j = 1, . . . , m, are positive, PPDD algorithm obtains a feasible delimiter K and stops the iteration. Then it obtains the final load distribution, η using (9). In the last step, following the load distribution obtained above, we calculate the overall processing time of all the loads. Since the basic style of working is identical to the case with front-end, we continue to refer to this algorithm simply as Processor-set Partitioning and Data Distribution Algorithm (PPDD).

Table 2 PPDD algorithm for the case without front-ends Initial phase: From (1) and (2), we obtain the initial delimiter K which identifies the potential sender and receiver sets. Iteration phase: We assume that a processor pi is a sender, if i > K or a receiver, if i ≤ K. Assuming all processors finish processing at the same time, denoted as Tx (m), x = 1, . . . , m, to achieve the optimal processing time, we have, Tx (m) = Li Ei + $Li (Ci + Ei ), Tx (m) = Lj Ej + $Lj (Cj − Ej ),

i = 1, 2, . . . , K, j = K + 1, . . . , m

(3) (4)

Thus, expressing all the $Li in terms of $L1 , for the receiver set, we obtain, $Li = fi + gi $L1 ,

L1 E1 −Li Ei Ci +Ei

i = 1, 2, . . . , K

and gi = where fi = For the sender set, we have $Li = fi% + gi% $L1 ,

C1 +E1 Ci +Ei ,

(5) i = 1, 2, . . . , K.

i = K + 1, K + 2, . . . , m

(6)

1 −Li Ei 1 and gi% = CC1i +E where fi% = L1 E Ci −Ei −Ei , i = K + 1, K + 2, . . . , m. Since the sum of loads transferred from the sender set and the sum of loads received by the receiver set must be identical, we have

K # i=1

$Li =

m #

(7)

$Li

i=K+1

Thus, from (5), (6) and (7), the closed-form solution for $L1 is given by, !m ! fi% − K i=1 fi $L1 = !i=K+1 !m K % g − i=1 i i=K+1 gi

(8)

Equations (5), (6) and (8) give the solution for $L1 , $L2 , . . . , $Lm , which should all be non-negative. If any resulting $Li is negative, we update the receiver set and sender set by moving pK+1 from Sover to Sunder and increase K by 1. We repeat the calculations given by (3) to (8) until all $L1 , $L2 , . . . , $Lm are non-negative. Overall processing time: The loads assigned to individual processors are given by, $ Li + $Li = Li + fi + gi $L1 , i = 1, . . . , K !i = Li − $Li = Li − fi% + gi% $L1 , i = K + 1, . . . , m The finish time of processor pi is given by, $ Li Ei + (fi + gi $L1 )(Ci + Ei ), Ti (m) = Li Ei + (fi% + gi% $L1 )(Ci − Ei ),

i = 1, 2, . . . , K

i = K + 1, . . . , m

Thus, the overall processing time is, T (m) = max{Ti (m)}, i = 1, 2, . . . , m.

(9)

(10)

Cluster Comput (2010) 13: 31–46

It may be noted that since the load distribution for the ideal case serves as a convenient starting point for the case without front-ends, in the above load distribution strategy for without front-ends, we avoid the possibility of iterating from the value of K = 1 onwards. Further, the optimal load distribution for the ideal case will definitely identify a larger sender set than the “optimal” load scheduling for the case without front-ends, because all the communication can be overlapped with the computation in the ideal case. Also, for the ideal case, we expect a smaller overall processing time than that of the case without front-ends. In addition, we initially designate all the processors as senders whose original processing times (Li Ei ) are greater than their expected overall processing time. This is our basic idea to judiciously use the initial results obtained as the starting point of the algorithm. Thus, we increase K in a step by step fashion to shrink the sender set in order to find the ultimate feasible receiver and sender sets. The initial choice of the K determines the number of iterations needed for the PPDD algorithm. From the equations from (3) through (8), we observe that the algorithm always guarantees that the resulting load distribution will make all the processors finish processing at the same time. We present several significant properties exhibited by the proposed strategy below. Lemma 1 In the case without front-ends, whenever the loads are not balanced, i.e., Li Ei &= Lj Ej for some i &= j , there is at least one sender and one receiver such that $Li > 0 for all receivers pi ∈ Sunder . Lemma 2 The load distribution strategy takes a finite number of steps to converge. It needs only n < (m − K) iterations to obtain a near-optimal solution. The proofs of the above lemmas can be found in the Appendix. Note that, the above two lemmas hold also for the case with front-ends. All the properties presented in the above lemmas lead to the following conjecture. Conjecture 1 The load distribution strategy presented in Tables 1 and 2 yield optimal solutions for cases with and without front-ends, respectively. A rigorous proof can be attempted following the treatment presented in [6]. The basic idea is to use contradictions if we do not follow the solution given by the PPDD algorithm. Due to the uncertainty of processing speed distribution in heterogeneous systems, we have not derived a satisfactory rigorous proof for this conjecture. Our ongoing work is to derive a proof of “statistical optimality” for the PPDD algorithm. However, we observe that from the workings of the algorithms, any re-distribution from the load

37

scheduling proposed above will cause an imbalance of the loads among all processors and will result in under-utilizing certain processors due to the fact that some processors may be busy in processing while others have finished their tasks, thus increasing the overall processing time. Based on above proofs and observations, we argue that PPDD algorithm yields near-optimal solutions. To see the working steps of this algorithm, we present a numerical example with the following speed parameters. This is for the case without front-ends. Note that, since Example 1 to 4 are based on numerical analysis, results are stable and deterministic. These examples are used to verify our theoretical analysis and demonstrate certain features of our proposed algorithms more vividly. The range of parameters (normalized processor and link speeds) and computation to communication ratios used in these examples follow the observations and guidelines in [2, 40]. Example 1 In this example, we consider a single-level tree network with m = 5 processors and a root node (central scheduler). The system parameters we had set are, processor speeds E1 = 50 sec/MB, E2 = 65 sec/MB, E3 = 60 sec/MB, E4 = 45 sec/MB, E5 = 80 sec/MB, and link speeds C1 = 0.3 sec/MB, C2 = 0.2 sec/MB, C3 = 0.15 sec/MB, C4 = 0.1 sec/MB, C5 = 0.55 sec/MB. These parameters are typical for image processing applications [13, 41]. The size of the respective loads injected on each processor is as follows. L1 = 100 MB, L2 = 110 MB, L3 = 120 MB, L4 = 180 MB, L5 = 150 MB. We index the processors in the order of the smallest Li Ei first, as mentioned before. Note that the original processing time at each processor (calculated using Li Ei , i = 1, . . . , 5) is, 5000, 7150, 7200, 8100, and 12000 sec, respectively, in an incremental order. Thus, if each processor processes its own load without sharing with other processors, the overall processing time of the entire set of loads is 12000 sec and the average processing time is 7890 sec. Using (2), we first obtain the ideal scheduling solution as follows. The sender and receiver set delimiter is K = 3, and hence, the sender set is given by, Sover = {p4 , p5 } and receiver set is given by, Sunder = {p1 , p2 , p3 }. The amount of load migration is $L1 = 52.12, $L2 = 7.02, $L3 = 6.77, $L4 = 10.97, $L5 = 54.92, respectively. In the ideal case, the resulting schedule makes all processors finish processing at the same time instant and the overall processing time is given by 7606.01 sec. Following PPDD algorithm, for the case without frontends presented above, using K = 3 as the initial starting point, after one iteration, we obtain the scheduling solution as follows. The sender and receiver delimiter is still K = 3, thus the sender set is Sover = {p4 , p5 } and receiver

38

set is Sunder = {p1 , p2 , p3 }. The amount of load exchanged is $L1 = 51.98, $L2 = 7.13, $L3 = 6.89, $L4 = 10.81, $L5 = 55.19, respectively. Observe that all processors finish processing at the same time and the overall finish time of the entire set of loads is 7614.7 sec. From the above example, we observe that the resulting overall processing time, for the case without frontends, is quite close to the ideal case and is much less than the original processing time without load sharing (less than 36.5%). In addition, the near-optimal finish time obtained, 7614.7 sec, is even better than the average of the original individual processing times of the respective loads, obtained as 7890 sec. Above results clearly elicit the fact that any naive strategy which aims to achieve an average processing time or considers assigning equal size portions among the processors (average of all the loads) will not result in a good solution in heterogeneous computing networks. 3.3 Homogeneous systems without front-ends To gain more insights on the properties of the proposed algorithms, we further conduct analysis of homogeneous systems. Due to the irregular (sometime random) parameters in heterogeneous systems, it is difficult to observe some natural trends of the performance and load distribution following the PPDD algorithm. Homogeneous settings offer us an opportunity to examine some special behaviors of systems and findings in homogeneous systems can be used as a reference or approximation for similar heterogeneous systems. For a homogeneous system, we have all Ci = C, i = 1, 2, . . . , m and all Ei = E, i = 1, 2, . . . , m. In this case, we observe some interesting special properties exhibited by the load partitioning strategy. Lemma 3 In a homogeneous system, in the near-optimal load distribution obtained using the proposed strategy, we always have, for the receiver set, $Li ≥ $Li+1 , i = 1, 2, . . . , K − 1, and for the sender set, $Li ≤ $Li+1 , i = K + 1, . . . , m − 1. The proof of this lemma is presented in Appendix. From Lemma 3, we observe that the proposed strategy wisely balances loads among all the processors such that it “pulls” more loads from the heavily loaded processors and “pushes” more loads to the more lightly loaded processors. In heterogeneous computing systems, we can also observe a similar behavior demonstrated in the proposed strategy. Example 2 In this numerical example, we consider a homogeneous single-level tree network with m = 10 processing nodes. We set the processor and link speed parameters as E = 10 sec/MB and C = 1 sec/MB, respectively. The

Cluster Comput (2010) 13: 31–46

Fig. 2 Load distributions of Example 1

size of the loads originating on processors are set as, {L} = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} MB, respectively. Following the PPDD algorithm, we obtain the receiver set as {p1 , p2 , p3 , p4 , p5 } and the final load distribution {!} = {53.1818, 54.0909, 55.0, 55.9091, 56.8182, 57.2222, 56.1111, 55.0, 53.8889, 52.7778} and {$L} = {43.1818, 34.0909, 25.0, 15.9091, 6.8182, 2.7778, 13.8889, 25.0, 36.1111, 47.2222}, respectively. In addition, all processors finish processing at the same time given by T (m) = 575.0 sec. Without load re-distribution, the overall processing time is 1000 sec. The load distribution is illustrated in Fig. 2. From this figure, we observe that in the final load distribution the loads are almost equally distributed for this homogeneous system. In this example, due to nonnegligible communication delays incurred in load communication phase, the individual loads assigned to the processors are not identical in size. Since minimum communication delay occurs at p5 and p6 ($L5 = 6.8182, $L6 = 2.7778), they process additional amount of loads than other processors. Further, we observe that the distribution of the exchanged loads $Li , completely adheres to the statement of Lemma 3. One of the parameters that is often important in the study of load distribution problems on a network based environment is the ratio of communication to the computation delays. To see the effect of various communication to computation ratios, we consider the system used in Example 1 for different communication speeds, C = 1, 2, . . . , 6, respectively, and keep all other parameters same. We denote the communication to computation speed ratio as δ = C/E, and δ = 0.1, 0.2, . . . , 0.6. Since we consider computationintensive applications, the communication computation ratios are typically much less than 0.5 [40]. The effect of δ can be understood by observing the variation of the exchanged loads among the processors. The resulting exchanged load

Cluster Comput (2010) 13: 31–46

39

Fig. 3 Exchanged load distributions for various communication to computation ratios (δ = 0.1 to 0.6)

distributions are illustrated in Fig. 3 for various values of δ. From this figure, we observe that the basic tendency of the exchanged load distributions ($Li ) is a V-shaped curve, that is, the amount of load exchanged first decreases and then increases, for any δ value. However, with increasing value of δ, the receiver set grows from 5 processors to 7 processors for a δ = 0.1 to δ = 0.6, respectively, as shown in Fig. 3. According to Lemma 3, we know that $Li > $Li+1 in the receiver set and $Li < $Li+1 in the sender set. As a result, more loads are transferred from the last processor pm when δ is larger. From Fig. 4, we observe that the final load distribution for δ = 0.1 is well balanced among all processors. As the communication to computation ratio increases, in this case (without front-ends), the last processor L10 is engaged in communication for a large amount of time rather than processing and results in less workload assignment. In Fig. 5, we see that the overall processing time increases monotonically, when δ increases, as expected.

4 Load communication strategies We shall now propose a load communication strategy to efficiently implement the load balancing among all the processors using the near-optimal load distribution obtained using the PPDD strategy in the previous section. It may be noted that when one attempts to distribute the optimal load fractions between the processors in the view of balancing, some processors may go idle and the resulting overall processing time of the entire set of loads may be greater than the nearoptimal solution proposed by PPDD in the previous sec-

tion. Thus, we need a communication strategy which carefully considers the communication delays while implementing PPDD strategy (balancing the load fractions). Thus, the scheduler p0 first obtains the near-optimal fractions using PPDD which serves as the input to the communication strategy. In the following, we shall first describe the communication strategy for without front-ends, as this needs a systematic treatment. Then, for the with front-ends case, we can then design the strategy using the procedure carried out for the without front-ends case. Without loss of generality let us assume that pi ∈ Sunder , i = 1, 2, . . . , K, and pj ∈ Sover , j = K + 1, K + 2, . . . , m, where K is the delimiter of the sender and receiver sets. The load redistribution process is described as follows. Initially, pi holds Li units of load. We shall redistribute the extra loads $Lj from Sover to Sunder . However, processors pi in Sunder will accept only $Li . Thus, the senders and receivers are not sending and receiving the same amount of loads. Hence, we need to devise a strategy which obtains a better solution to carry out the redistribution process to minimize the overall processing time. Table 3 presents the details of the redistribution procedure and explains various stages involved in the communication process. Note that senders will transfer load to the central scheduler p0 and the scheduler will route the load to the respective receivers. The transferred load size is determined by (5) through (8). Further, it may be observed that at any time instant, there is only one active sender and one active receiver. In the load communication strategy shown in Table 3, we denote the communication time slot of senders with a superscript s and we denote the communication time slot of receivers with a superscript r. As mentioned in the algorithm, because of heterogeneous communication speeds, we

40

Cluster Comput (2010) 13: 31–46

Fig. 4 Final load distributions for various communication to computation ratios (δ = 0.1 to 0.6)

Table 3 Load communication strategy for the case without front-ends Initial stage: Initially, processor pi has load Li , i = 1, 2, . . . , m. Sender set and receiver set will initiate communication at the same time. The first sender is pm and the first receiver is p1 . Load communication stage: Sender part: processors pj ∈ Sover , j = K + 1, . . . , m, send the extra load, $Lj which is obtained from (5) through (8), to the scheduler p0 , in the reverse order of the processor index (from pm to pK+1 ). At the beginning, processor pm starts communication at t0s = 0 and s s s , tm−j stops at t1s . The communication time slot for pj is [tm−j +1 ], where tm−j is the time instant that pi starts transferring its extra load s is the time instant that p stops the communication. It is given by, and tm−j i +1 s s tm−j +1 = tm−j + $Lj Cj ,

j = 1, 2, . . . , m − K

(3)

Thus, we see that the finish time at processor pj is given by, s Tj (m) = max{(Lj − $Lj )Ej + $Lj Cj , tm−j +1 }

(4)

Receiver part: processors pi ∈ Sunder , i = 1, 2, . . . , K receive the load, $Li which is obtained from (5) through (8), from the scheduler p0 in the order of processor index from p1 , p2 , . . . , to pK . At time zero t0r = 0, p1 starts to receive load transferred from the sender set through r , t r ], where t r the central scheduler and ends communication at time instant t1r . For pi , i = 1, 2, . . . , K, its communication time slot is [ti−1 i i is given by, $ t r + $Li Ci for Ci ≥ Cj (sender is faster) (5) tir = i−1 r + $L C ti−1 for Ci < Cj (receiver is faster) i j Hence, the finish time of pi is given by, Ti (m) = max{(Li + $Li )Ei + $Li Ci , tir + $Li Ei }

(6)

Note that in the above equation, the item $Li Ci may be substituted by $Li Cj , if the corresponding sender is slower than the receiver during this communication session. Overall processing time: T (m) = max{Ti (m), Tj (m)},

i = 1, 2, . . . , K, j = K + 1, K + 2, . . . , m

shall calculate the communication time taken for receiving the loads from the respective senders, by the receivers. Since the scheduler works as a router, as per our earlier assump-

tion, the communication time for a receiver is determined by the slower link connecting the active sender and the active receiver during the communication session. Note that there

Cluster Comput (2010) 13: 31–46

41

Fig. 5 Overall processing times for various communication to computation ratios (δ = 0.02 to 0.6)

is only one active sender and one active receiver at any instant in time. However, in the proposed strategy, at different time instants, a sender may cater for more than one receiver and a receiver may also receive loads from more than one sender. Further, the communication time for the sending part is solely governed by its own link speed while the communication time for the receiver set is determined not only by its own link speed but also depends on the senders link speed during that communication session. Now, for the case with front-ends, we shall follow the same load communication procedure explained above. However, with front-ends, we shall modify (4) and (6) as, s Tj (m) = max{(Lj − $Lj )Ej , tm−j +1 },

(7)

Ti (m) = max{(Li + $Li )Ei , tir + $Li Ei }

(8)

Let us now demonstrate the strategies via a detailed numerical analysis as follows.

Example 3 We continue to consider the system used in Example 2 in Sect. 3 for this case study. Using the proposed load communication strategy, we obtain the following results. The finish times of processors are obtained as follows: T1 (m) = 7626.8, T2 (m) = 7617.2, T3 (m) = 7617.5, T4 (m) = 7614.7, T5 (m) = 7614.7 sec. Thus, the overall finish time of the entire set of loads is T (m) = 7626.8 sec, which is 0.1588% more than the overall processing time obtained without considering the load communication stage. The load communication process on each processor is illustrated in Fig. 6 (not in the exact proportion). Results from Example 1 imply that the receiver set is Sunder = {p1 , p2 , p3 } and the sender set is Sover = {p4 , p5 }. From the results of Example 3, we observe that all the

processors in the sender set will finish processing at the same time while processors in the receiver set may not finish processing at the same time. This is due to the fact that the communication time of senders are determined only by their own channel speeds. However, the communication time for the receiver part is determined by the slower link between the active sender and receiver thus causing more communication delays. Figure 6 illustrates the load communication process for Example 3. Above the time axes, we show the communication process using shadowed blocks, and below the time axes, we show the computation process using blank blocks. In the group of senders, p5 starts sending its load at time zero and stops its load communication at time instant 30.36 sec. During this time, p4 is processing its own load independently. Immediately following p5 , p4 starts communication at 30.36 sec, which lasts only 1.08 sec. In the group of receivers, p1 starts receiving its assigned load from time zero to the time instant 27.69 sec. During this time period, p2 and p3 are processing their own load. Following p1 , p2 continues its load communication then p3 starts communication following p2 . When a processor is not engaged in load communication it will process its available load independently, which is clearly demonstrated in Fig. 6. From the result obtained, we also observe that the difference between the actual overall processing time and the near-optimal finish time obtained using PPDD (in the previous section), is approximately 0.16%. Thus, the proposed strategy is shown to be efficient and close to the optimal solution. 5 Discussions of the results The contributions in this paper are novel to the literature in DLT. The paper addresses a realistic situation in a distributed network of processors wherein the computational load can originate at any processor on the network. Thus, when there is more than one load to be processed in the system, unless a clever strategy for load distribution is carried out, the processors may not be efficiently utilized. In the existing DLT literature [32], processing multiple loads on distributed networks is addressed, however, it was assumed that all the loads originate at the central scheduler (bus controller unit in the case of bus networks). Our formulation considers loads originating at different processors on the network. We proposed load distribution (PPDD) and communication strategies for both the cases when the processors are equipped with and without front-ends. For the case with front-ends, we simply use (2) to obtain the loads exchanged among processors to obtain the final load distribution !i , i = 1, 2, . . . , m. For the case without front-ends, we shall follow steps proposed in Table 2. The PPDD algorithm takes advantage of the optimality principle to minimize the overall processing time. As proven in

42

Cluster Comput (2010) 13: 31–46

Fig. 6 Timing Diagram for Example 3. We have 5 processors in total; 2 processors (p4 and p5 ) play the sender role and 3 processors (p1 , p2 and p3 ) are in the receiver set. We observe that senders all stop at the same time while receivers do not stop at the same time

Lemma 1 and Lemma 2, the PPDD algorithm guarantees to determine the near-optimal solution in finite number of steps. Since the load partitioning phase does not account the communication delays that will be encountered during the actual load communication, we obtain the near-optimal solution immediately from the procedure described in Sect. 3. When we consider the actual load transfer to the processors, PPDD algorithm is guaranteed to produce a near-optimal solution for homogeneous systems. However, in heterogeneous systems, (3) and (4) use only the communication speed of the corresponding receiving link, thus causing imprecise results (when the actual load transfer takes place), as clearly demonstrated in Examples 2 and 3. This minor discrepancy in the results is due to the fact that communication speeds are different between the active sending and receiving links, whereas the actual load transferring time is determined solely by the slowest link and not by the receiving link. One may relax this assumption and consider the combined effect of both the link delays in PPDD algorithm, however, the ultimate solution may not be drastically different from what is proposed by PPDD under the current model. A significant advantage of PPDD in its current form is its simplicity to design and implement a scheduler at the root processor p0 . However, one may attempt to use other strategies to schedule and transfer loads among processors to minimize the processing time. A natural choice is to modify the load scheduling strategy proposed in the literature [6, 32]. In these works, scheduling strategies for multiple loads arriving at a bus controller unit (BCU) were studied. However, in this paper, we consider the case where multiple loads originate at different sites. In any case, we can apply the previous scheduling strategies in the literature to our problem context. At first, we consider a single-level tree network with only one load originating on a processor. Using the following equations, we obtain a near-optimal load distribution for a single load. Then, we may repeat this procedure for all the

Fig. 7 A single-level tree network with a single load

loads residing at other sites. For comparison purposes, we will now consider the case without front-ends in this section. For the case with front-ends one may follow a similar procedure. Without loss of generality, we assume that the processor p1 has a load L as shown in Fig. 7. The load distribution process is as follows. Processor p1 partitions its load L into m portions, α1 L, α2 L, . . . , αm L. Then, p1 distributes load fraction α2 L to p2 first, then α3 L to p3 , · · · , until αm L to pm , respectively, via p0 . After the load distribution, p1 starts processing its own load α1 L. The timing diagram for this load distribution process can be found in [6]. Thus, to balance the load among all processors in such a way that they finish processing at the same time, we have, α1 LE1 = αm LEm ,

(9)

αi LEi = αi+1 L(max{C1 , Ci+1 } + Ei+1 ), i = 2, 3, . . . , m − 1,

(10)

Cluster Comput (2010) 13: 31–46 m # i=1

43

αi = 1

(11)

where, αi , i = 1, 2, . . . , m, is the fraction of load L assigned to processor pi . From Fig. 7, we observe that link l1 is the common link for all communications between p1 and pi , i = 2, 3, . . . , m. Because the communication time is determined by the slower link, we use max{C1 , Ci+1 } in the above equations to obtain the (slower) link speed between link l1 and li+1 . From above equations, we can obtain the individual αi and hence, the near-optimal load fraction to be assigned to pi is αi L, i = 1, 2, . . . , m. Equations (9) through (11) provide a near-optimal solution for sharing one load among all processors. Now, we apply it to a new strategy for sharing multiple loads, referred to as Round-robin Scheduling Algorithm (RSA). We describe the RSA in Table 4. RSA algorithm is an extension of the load scheduling strategy for a single load. Note that when we schedule a load among all processors, processors which are not engaged in communication can process its load independently. In every iteration, in a step-by-step fashion, each processor distributes its load among all processors. Further, each processor attempts to balance all the loads among all the processors such that they finish processing at the same time. To compare the time performance of RSA and PPDD strategies, we present an example through numerical analysis.

Example 4 We consider a homogeneous single-level tree network with m = 5 processors. The speed parameters are set as follows. Ei = 10 sec/MB and Ci = 1 sec/MB, i = 1, 2, . . . , 5. Original load status is given by, L1 = 10 MB, L2 = 20 MB, L3 = 30 MB, L4 = 40 MB, L5 = 50 MB. The working steps of RSA are shown in Table 5. In Table 5, for example, row 1 specifies the distribution process by processor p5 . Thus, starting from 0 sec, p5 which has 50 MB (see column 2 of row 1) starts distributing the load to the rest of the processors as per the RSA algorithm and the respective load fractions are shown in the last column. Similarly the distribution process continues for other processors. Using RSA algorithm, we obtain that the finish time at each processor is as follows, T1 (m) = 256.266 sec, T2 (m) = 329.393 sec, T3 (m) = 385.391 sec, T4 (m) = 429.262 sec, T5 (m) = 463.926 sec. Thus, the overall processing time is T (m) = 463.926 sec. Because the system in the above example is homogeneous, the load fractions αi are same at every iteration. However, using PPDD algorithm, we obtain that the overall processing time is T % (m) = 312.245 sec and all processors finish processing at the same time. As a comparison, the optimal processing time in the ideal situation is 300 secs. On comparison of the solutions using these two load scheduling strategies, we observe that the overall processing time using RSA is much greater than that obtained using PPDD algorithm, T (m) > T % (m). We also observe that, in

Table 4 Round-robin scheduling algorithm for the case without front-ends Initial stage: Initially, processor pi has load Li , i = 1, 2, . . . , m.

Iteration stage:

From pm to p1 , all processors share loads step by step. At the first iteration, we schedule load Lm among all processors. After the communication of Lm , the remaining load on processor pm−1 will be shared among all processors. At iteration i, pm−i+1 schedules its available load among all processors and let them finish processing of this load at the same time using (9) to (11). Note that at each iteration, we distribute the load in the order of link speed, faster link first. Also, note that there is no violation of Rule A [6]. Final solution: At the end of iteration m, we stop load scheduling and let all processors finish its own load to the end. Thus the overall processing time of all loads is determined by the finish time of the processor that takes maximum time to complete the processing.

Table 5 Results of Example 4 using RSA Iterations (i)

Sender and load

Sender’s α

Receivers’ α

1 at 0 sec

p5 , 50.000 MB

2 at 41.136 sec

p4 , 45.636 MB

α5 = 0.177

{1, 2, 3, 4} = {0.236, 0.215, 0.195, 0.177}

3 at 78.682 sec

p3 , 42.646 MB

4 at 113.768 sec

p2 , 41.251 MB

α3 = 0.177

{1, 2, 4, 5} = {0.236, 0.215, 0.195, 0.177}

5 at 147.707 sec

p1 , 41.827 MB

α4 = 0.177 α2 = 0.177 α1 = 0.177

{1, 2, 3, 5} = {0.236, 0.215, 0.195, 0.177} {1, 3, 4, 5} = {0.236, 0.215, 0.195, 0.177} {2, 3, 4, 5} = {0.236, 0.215, 0.195, 0.177}

All communication is finished at the time 182.119 sec. At this time, the distribution of remaining loads is as follows, {7.415, 14.727, 20.327, 24.714, 28.181 MB}

44

RSA strategy, the last processor to distribute is p1 and it is the first one to finish processing its load. The reason is that, at the last iteration, p1 lets all other processors finish processing its remaining load at the same time while other processors have their own load during that time. At the end of the load communication phase, the remaining load at p1 is 7.415 MB (smallest), as shown in the last row of Table 5. Thus, all the other processors need more time to finish their loads than p1 after the end of load communication. A natural improvement is to repeat round-robin scheduling until the finish times of all processors are sufficiently close. But RSA cannot avoid any additional time delays (overhead) incurred due to shuttling of load from and to the same processor, i.e., a load fraction transferred from pi to pj in previous iterations may be transferred back to pi from pj or pk , thus wasting the communication resources. Since there is no front-end to overlap communication and computation, such kind of unnecessary load “wandering” greatly prolongs the overall processing time. On the other hand, RSA always needs m iterations to obtain the final solution while PPDD algorithm needs only (m − K) iterations, as proven in Lemma 2. In Example 4, RSA needs 5 iterations while PPDD algorithm needs only one iteration to obtain a better solution. If we improve RSA through repeating the round-robin scheduling, RSA needs more iterations to obtain a better solution. However, even in this case, improved version of RSA cannot avoid load wandering (as discussed above) from and back to a processor either.

Cluster Comput (2010) 13: 31–46

near-optimal processing time of all loads. Several significant properties of PPDD algorithm are proven in lemmas and detailed analysis of time performance of PPDD algorithm was conducted. The above analysis is also extended to homogeneous systems wherein we have shown that the time performance of PPDD algorithm with respect to various communication-computation ratios. To implement the load distribution strategy obtained through PPDD algorithm, we proposed a simple load communication strategy. It was demonstrated that the overall processing time obtained using PPDD algorithm is sufficiently close to the result following the actual load communication strategy proposed. To further demonstrate the efficiency of PPDD algorithm, we also compared the time performance of PPDD algorithm with another algorithm, Round-robin Scheduling Algorithm (RSA). It is shown that the proposed PPDD algorithm produces better scheduling solutions than RSA. Detailed discussions and comparisons are carried out. The proposed load scheduling strategies can be readily extended to other network topologies in a similar way. Another interesting extension is to further study the case with multiple load arrivals at each processor, which models dynamic scheduling scenarios in grid or cloud computing environments. Acknowledgements The authors would like to thank the editors and referees for their valuable suggestions, which have significantly helped improve the quality and presentation of this paper. The research presented in this paper is supported in part by US National Science Foundation grant CNS-0709329.

6 Conclusions We have addressed the problem of scheduling strategies for divisible loads originating from multiple sites in single-level tree networks. The formulation presented a general scenario with multi-site divisible loads, demanding several processors to share their loads for processing. We have designed a load distribution strategy and communication strategy to carry out the processing of all the loads submitted at various sites. A two phase approach is taken to attack the problem— a load partitioning phase and the actual communication of load fractions to the respective processors (communication strategy). In the first phase, we derive the near-optimal load distribution; in the second phase, we consider the actual communication delay in transferring the load fractions to the processors, by assuming that the overall delay is contributed by the slowest link between the sending and receiving processors. As a first step, one can relax this assumption and analyze the performance and the proposed scheduling strategies are flexible in adapting to such relaxed assumptions as mentioned in the discussions section. For cases with front-ends and without front-ends, we propose a scheduling strategy, PPDD algorithm, to achieve a

Appendix

Proof of Lemma 1 Since L1 E1 ≤ Li Ei , for i ≥ 2 and Ci < Ej for any i, j ∈ {1, . . . , m}, as mentioned before, we have, fi = gi = fi% = gi% =

L1 E 1 − Li E i ≤ 0, Ci + E i C1 + E 1 > 0, Ci + E i

i = 1, 2, . . . , K,

L1 E 1 − Li E i ≥ 0, Ci − E i C1 + E 1 < 0, Ci − E i

i = 1, 2, . . . , K,

(12) (13)

i = K + 1, K + 2, . . . , m, (14)

i = K + 1, K + 2, . . . , m

(15)

Since the loads are unbalanced, fi &= 0, for some i and fi% &= 0, for some i. From the closed-form expression of $L1 given in (8) and using the above inequalities, we immediately see that $L1 > 0. Similarly, by expressing all $Li , i = 1, 2, . . . , m − 1 in terms of $Lm and through similar algebraic manipulations as we prove $L1 > 0, we can

Cluster Comput (2010) 13: 31–46

45

also show that $Lm > 0. Therefore, there is at least one sender and one receiver available always. Now, we shall prove the second part of the lemma. For the receivers pi , i = 1, . . . , K, at each iteration in the above load distribution strategy, we naturally have (L1 + $L1 )E1 + $L1 C1 > Li Ei , i = 1, . . . , K, where LHS is the new finish time of processor p1 after load distribution and RHS is the earlier finish time of processor pi ∈ Sunder . This is because we attempt to balance the loads among all the processors by extending (stretching) the processing time of processors in the receiver set and reducing (shrinking) the processing time of processors in the sender set in such a way that they finish processing at the same time. Thus, from (5), we have, $Li =

(L1 + $L1 )E1 + $L1 C1 − Li Ei > 0, Ci + E i

i = 2, . . . , K

(16) !

Hence the proof.

Proof of Lemma 2 This can be directly seen from an inherent property of the load distribution strategy proposed. Assume that we fail to determine a near-optimal solution satisfying the condition that $Li > 0, i = 1, 2, . . . , m, in all previous iterations from K to m − 1. The last iteration cannot fail, since from Lemma 1 we have $Lm > 0 and all the other processors become potential receivers which have $Li > 0, i = 1, 2, . . . , m − 1. Hence the proof. ! Proof of Lemma 3 For the receiver set, from (5), we have, $Li = $Li+1 +

(Li+1 E − Li E) , C +E

i = 1, 2, . . . , K − 1

(17)

Since Li E ≤ Li+1 E, i = 1, 2, . . . , K − 1, from the above equation, we can immediately obtain, $Li ≥ $Li+1 , i = 1, 2, . . . , K − 1. For the sender set, from (6), we have, $Li = $Li+1 +

Li+1 E − Li E , C−E

i = K + 1, . . . , m − 1

(18)

Note that (C − E) < 0 in the above equation as mentioned in Sect. 2 as per the assumptions. Since Li E ≤ Li+1 E, i = K + 1, . . . , m − 1, from the above equation, we obtain, $Li ≤ $Li+1 , i = K + 1, . . . , m − 1. Hence the proof. ! References 1. Kesselman, C., Foster, I. (eds.): The Grid 2: Blueprint for a New Computing Infrastructure by. Morgan Kaufmann, San Mateo (2003)

2. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 4th edn. Morgan Kaufmann, San Mateo (2006) 3. Xu, K., Hwang, Z.: Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York (1998) 4. Eshaghian, M. (ed.): Heterogeneous Computing. Artech House, Norwood (1996) 5. Drozdowski, M.: Selected Problems of Scheduling Tasks in Multiprocessor Computer Systems. University of Technology Press, Poznan (1997) 6. Veeravalli, B., Ghose, D., Mani, V., Robertazzi, T. (eds.): Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, (1996) 7. Shirazi, B., Hurson, A., Kavi, K. (eds.): Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos (1995) 8. Veeravalli, B., Ghose, D., Robertazzi, T.: Divisible load theory: a new paradigm for load scheduling in distributed systems. Clust. Comput. Div. Load Sched. 6(1), 7–18 (2003). Special Issue 9. Robertazzi, T., Sohn, J.: Optimal time-varying load sharing divisible jobs. IEEE Trans. Aerospace Electronic Syst. 34, 907–923 (1998) 10. Sohn, J., Robertazzi, T., Luryi, S.: Optimizing computing costs using divisible load analysis. IEEE Trans. Parallel Distrib. Syst. 9(3), 225–234 (1998) 11. Marchal, L., Yang, Y., Casanova, H., Robert, Y.: A realistic network/application model for scheduling divisible loads on largescale platforms. In: International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005 12. Viswanathan, S., Veeravalli, B., Yu, D., Robertazzi, T.: Design and analysis of a dynamic scheduling strategy with resource estimation for large-scale grid systems. In: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing (held in conjunction with Supercomputing 2004), Pittsburgh, Pennsylvania, USA, Nov. 2004, pp. 163–171 13. Li, X., Veeravalli, B., Ko, C.: Distributed image processing on a network of workstations. Int. J. Comput. Appl. 25(2), 1–10 (2003) 14. Veeravalli, B., Ghose, D., Robertazzi, T.: A new paradigm for load scheduling in distributed systems. Divisible Load Sched. Clust. Comput. 6(1), 7–18 (2003). Special issue 15. Robertazzi, T.: Ten reasons to use divisible load theory. IEEE Comput. 36(5), 63–68 (2003) 16. Ghose, D., Kim, H.J.: Load partitioning and trade-off study for large matrix-vector computations in multicast bus networks with communication delays. J. Parallel Distrib. Comput. 55(1), 32–59 (1998) 17. Veeravalli, B., Barlas, G.: Efficient scheduling strategies for processing multiple divisible loads on bus networks. J. Parallel Distrib. Comput. 62(1), 132–151 (2002) 18. Wong, H., Veeravalli, B.: Scheduling divisible loads on heterogeneous linear daisy chain networks with arbitrary processor release times. IEEE Trans. Parallel Distrib. Syst. 15(3), 273–288 (2005) 19. Drozdowski, M., Blazewicz, J.: Distributed processing of divisible jobs with communication startup costs. Discrete Appl. Math. 76(1–3), (1997) 20. Veeravalli, B., Li, X., Ko, C.C.: On the influence of start-up costs in scheduling divisible loads on bus networks. IEEE Trans. Parallel Distrib. Syst. 11(12), 1288–1305 (2000) 21. Li, X., Veeravalli, B., Ko, C.: Divisible load scheduling on singlelevel tree networks with buffer constraints. IEEE Trans. Aerospace Electronic Syst. 36(4), 1298–1308 (2000) 22. Chan, S., Veeravalli, B., Ghose, D.: Large matrix-vector products on distributed bus networks with communication delays using the divisible load paradigm: Performance analysis and simulation. Math. Comput. Simul. 58, 71–79 (2001) 23. Wolniewicz, P., Drozdowski, M.: Experiments with scheduling divisible tasks in clusters of workstations. In: Proceedings of the

46

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38. 39. 40.

41.

Cluster Comput (2010) 13: 31–46 Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 2000, pp. 311–319 Ghose, D., Kim, H.J., Kim, T.H.: Adaptive divisible load scheduling strategies for workstation clusters with unknown network resources. IEEE Trans. Parallel Distrib. Syst. 16(10), 897–907 (2005) Beaumont, O., Casanova, H., Legrand, A., Robert, Y., Yang, Y.: Scheduling divisible loads on star and tree networks: results and open problems. IEEE Trans. Parallel Distrib. Syst. 16(3), 207–218 (2005) Min, W.H., Veeravalli, B.: Aligning biological sequences on distributed bus networks: a divisible load scheduling approach. IEEE Trans. Inf. Technol. Biomed. 9(4), 489–501 (2005) Yao, J., Guo, J., Bhuyan, L., Xu, Z.: Scheduling real-time multimedia tasks in network processors. In: IEEE Global Telecommunications Conference (GLOBECOM’04), vol. 3, 2004 Li, X., Cao, J.: Coordinated workload scheduling in hierarchical sensor networks for data fusion applications. J. Comput. Sci. Technol. 23(3), 355–364 (2008) Moges, M., Robertazzi, T.G.: Wireless sensor networks: scheduling for measurement and data reporting. IEEE Trans. Aerospace Electronic Syst. 42(1), 327–340 (2006) Carroll, T.E., Grosu, D.: A strategyproof mechanism for scheduling divisible loads in tree networks. In: Proc. of the 20th IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS 2006), 2006 Carroll, T.E., Grosu, D.: Strategyproof mechanisms for scheduling divisible loads in Bus-Networked distributed systems. IEEE Trans. Parallel Distrib. Syst. 19(8), 1124–1135 (2008) Robertazzi, T., Sohn, J.: A multi-job load sharing strategy for divisible jobs on bus networks. In: Proceedings of the Conference on Information Sciences and Systems, Princeton, NJ, March 1994 Veeravalli, B., Yao, J.: Design and performance analysis of divisible load scheduling strategies on arbitrary graphs. Clust. Comput. 7(2), 841–865 (2004) Haddad, E.: Real-time optimization of distributed load balancing. In: Proceedings of the Second Workshop on Parallel and Distributed Real-Time Systems, 1994, pp. 52–57 Robertazzi, T., Lammie, T.: A linear daisy chain with two divisible load sources. In: 2005 Conference on Information Sciences and Systems, The Johns Hopkins University, Baltimore, Maryland, March 2005 Wong, H.M., Yu, D., Veeravalli, B., Robertazzi, T.: Data intensive grid scheduling: multiple sources with capacity constraints. In: Fifteenth IASTED International Conference on Parallel and Distributed Computing and Systems, vol. 1, 2003, pp. 7–11 Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S.: The data grid: Towards an architecture for the distributed management and analysis of arge scientific datasets. J. Netw. Comput. Appl. 23, 187–200 (2001) Shivaratri, N., Krueger, P., Singhal, M.: Load distributing for locally distributed systems. Computer 25(12), 33–44 (1992) Gallager, D., Bertsekas, R. (eds.): Data Networks, 2nd edn. Prentice Hall, New York (1992) Luszczek, P., Dongarra, J.: Introduction to the hpcchallenge benchmark suite. University of Tennessee, Tech. Rep. ICL-UT05-01 (2005) Lee, C., Hamdi, M.: Parallel image processing applications on a network of workstations. Parallel Comput. 21, 137–160 (1995)

Xiaolin Li is Assistant Professor in Computer Science Department at Oklahoma State University. His research interests include Parallel and Distributed Systems, CyberPhysical Systems, and Network Security. His research has been sponsored by several external grants, including US National Science Foundation (NSF) (PetaApps, GENI, CRI, and MRI programs), Department of Homeland Security (DHS), Oklahoma Center for the Advancement of Science and Technology (OCAST), Oklahoma Transportation Center (OTC), and industry partners. He is an associate editor of three international journals and a program chair for over 10 international conferences and workshops. He is on the executive committee of IEEE Technical Committee on Scalable Computing (TCSC) and a panelist for NSF. He has been a TPC member for numerous international conferences, including INFOCOM, GlobeCom, ICC, CCGrid, MASS, and ICPADS. He received the Ph.D. degree in Communications and Information Engineering from National University of Singapore, Singapore, and the Ph.D. degree in Computer Engineering from Rutgers University, USA. He is directing the Scalable Software Systems Laboratory (http://s3lab.cs.okstate.edu). He is a member of IEEE and ACM. Bharadwaj Veeravalli received his B.Sc. in Physics, from MaduraiKamaraj University, India in 1987, Master’s in Electrical Communication Engineering from Indian Institute of Science, Bangalore, India in 1991 and Ph.D. from Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India in 1994. He did his post-doctoral research in the Department of Computer Science, Concordia University, Montreal, Canada, in 1996. He is with the Department of Electrical and Computer Engineering, Communications and Information Engineering (CIE) division, at The National University of Singapore, Singapore, as a tenured Associate Professor. His main stream research interests include, Multiprocessor systems, Cluster/Grid/Cloud computing, Scheduling in parallel and distributed systems, Bioinformatics & Computational Biology, and Multimedia computing. He is one of the earliest researchers in the field of divisible load theory (DLT). He has published over 65 papers in high-quality International Journals(Conferences). He had successfully secured several externally funded projects. He has co-authored three research monographs in the areas of PDS, Distributed Databases(competitive algorithms), and Networked Multimedia Systems, in the years 1996, 2003, and 2005, respectively. He had guest edited a special issue on Cluster/Grid Computing for IJCA, USA journal in 2004. He had served as a program committee member and as a Session Chair in several International Conferences. He is currently serving the Editorial Board of IEEE Transactions on Computers, IEEE Transactions on SMC-A, and Multimedia Tools & Applications (MTAP), USA, as an Associate Editor. He is a Senior Member of IEEE & IEEECS. Bharadwaj Veeravalli’s complete academic career profile can be found in http://cnds.ece.nus.edu.sg/elebv.

Suggest Documents