only one change is allowed for each event) can achieve a constant approximation ratio ... where a single server is used, we consider that a set of servers initially hold .... noted by wi. All .... items may be split, but each bin may contain at most k items or pieces of ... would correspond to bins with heterogeneous capacities and.
Online Task Allocation in Client-Server Large Scale Heterogeneous Platforms Olivier Beaumont, Lionel Eyraud-Dubois and Hejer Rejeb INRIA Bordeaux – Sud-Ouest University of Bordeaux, LaBRI Abstract—In this paper, we consider the problem of the online allocation of a very large number of identical tasks on a masterslave platform. Initially, several masters hold or generate tasks that are transfered and processed by slave nodes. The goal is to maximize the overall throughput achieved using this platform, i.e., the (fractional) number of tasks that can be processed within one time unit. We model the communications using the so-called bounded degree multi-port model, in which several communications can be handled by a master node simultaneously, provided that bandwidths limitation are not exceeded and that a given server is not involved in more simultaneous communications than its maximal degree. Under this model, it has been proved in [1] that maximizing the throughput (MTBD problem) is NPComplete in the strong sense but that a small additive resource augmentation (of 1) on the servers degrees is enough to find in polynomial time a solution that achieves at least the optimal throughput. In this paper, we consider the reasonable setting where the set of slave processors is not known in advance but rather join and leave the system at any time, i.e., the online version of MTBD. We prove that no fully online algorithm (where only one change is allowed for each event) can achieve a constant approximation ratio, whatever the resource augmentation on servers degrees. Then, we prove that it is possible to maintain the optimal solution at the cost of at most four changes per server each time a new node joins or leaves the system. At last, we propose several other greedy heuristics to solve the online problem and we compare the performance (in terms of throughput) and the cost (in terms of disconnections and reconnections) of proposed algorithms through a set of extensive simulation results. Keywords-online task allocation; large scale platforms; approximation algorithms; online algorithms.
I. I NTRODUCTION Scheduling computational tasks on a given set of processors is a key issue for high-performance computing, especially in the context of the emergence of large scale computing platforms such as BOINC [2] or Folding@home [3]. These platforms are characterized by their large scale, their heterogeneity and the variations in the performances of their resources. These characteristics strongly influence the set of applications that can be executed using these platforms. First, the running time of the application has to be large enough to benefit from the platform scale, and to minimize the influence of start-up times due to sophisticated middlewares. And second, the applications has to consist on many small independent tasks. This allows to minimize the influence of variations in resource performances and to limit the impact of resource failures. From a scheduling point of view, the set
Christopher Thraves INRIA Rennes Atlantique Rennes, France
of applications that can be efficiently executed is therefore restricted, and we can concentrate on “embarrassingly parallel” applications consisting in many independent tasks. In this context, makespan minimization, i.e., minimizing the minimal time to process a given number of tasks, is usually intractable. An idea to circumvent the difficulty of makespan minimization is to lower the ambition of the scheduling objective. Instead of aiming at the absolute minimization of the execution time, it may be more efficient to consider asymptotic optimality and optimize the throughput (i.e., the fractional number of tasks that can be processed in one time-unit once steady-state has been reached). After all, the number of tasks to be executed in applications such as SETI@home [4] or Folding@home [3] is huge. This approach has been pioneered by Bertsimas and Gamarnik [5] and has been extended to task scheduling in [6] and collective communications in [7]. Steady-state scheduling allows to relax the scheduling problem in many ways, and aims at characterizing the activity of each resource during each time-unit by deciding which rational fraction of time is spent sending and receiving tasks and to which client tasks are delegated. In this paper, we restrict our attention to steady-state scheduling of independent equal-sized tasks. In order to consider a more general model than current settings where a single server is used, we consider that a set of servers initially hold (or generate) the tasks to be processed. Each server Sj is characterized by its outgoing bandwidth bj (i.e., the number of tasks it can send during one time-unit) and its maximal degree dj (i.e., the number of open TCP connections that it can handle simultaneously). On the other hand, each client Ci is characterized by its capacity wi (i.e., the number of tasks it can handle during one time-unit). wi encompasses both its processing and communication capacities. More specifically, if compi denotes the number of tasks Ci can process during one time-unit, and commi denotes the number of tasks it can receive during one time-unit, then we set wi = min(compi , commi ). Our goal is to build a bipartite graph between the servers and the clients, assuming that the underlying network topology is unknown. To model contentions, we rely on the bounded multi-port model, that has already been advocated by Hong et al. [8] for independent task distribution on heterogeneous platforms. In this model, server Sj can serve any number of clients simultaneously, each using a bandwidth wi0 ≤P wi provided that its outgoing bandwidth is not exceeded, i.e., i wi0 ≤ bj . This corresponds to modern network infrastructure, where each
communication is associated to a TCP connection. This model strongly differs from the traditional one-port model used in scheduling literature, where connections are made in exclusive mode: the server can communicate with a single client at any time-step. Previous results obtained in steady-state scheduling of independent tasks [6] have been obtained under this model, which is easier to implement. For instance, Saif and Parashar [9] report experimental evidence that achieving the performances of bounded multi-port model may be difficult, since asynchronous sends become serialized as soon as message sizes exceed a few megabytes. Their results hold for two popular implementations of the MPI messagepassing standard, MPICH on Linux clusters and IBM MPI on the SP2. Nevertheless, in the context of large scale platforms, the networking heterogeneity ratio may be high, and it is unrealistic to assume that a 100MB/s server may be kept busy for 10 seconds while communicating a 1MB data file to a 100kB/s DSL node. Therefore, in our context, all connections must directly be handled at TCP level, without using high level communication libraries. It is worth noting that at TCP level, several QoS mechanisms enable a prescribed sharing of the bandwidth [10], [11]. In particular, it is possible to handle simultaneously several connections and to fix the bandwidth allocated to each connection. In our context, these mechanisms are particularly useful since wi encompasses both processing and communication capabilities of Ci and therefore, the bandwidth allocated to the connection between Sj and Ci may be lower than both bj and wi . Nevertheless, handling a large number of connections at server Sj with prescribed bandwidths consumes a lot of kernel resources, and it may therefore be difficult to reach bj by aggregating a large number of connections. In order to circumvent this problem, we introduce another parameter dj in the bounded multi-port model, that represents the maximal number of connections that can be simultaneously opened at server Sj . Therefore, the model we propose encompasses the benefits of both bounded multi-port model and one-port model. It enables several communications to take place simultaneously, what is compulsory in the context of large scale distributed platforms, and practical implementation is achieved by using TCP QoS mechanisms and by bounding the maximal number of connections. In this context, the off-line problem has been considered in [1]. It has been proved that the additional constraint on the number of simultaneous communications a server can handle makes the problem of maximizing the overall throughput NPComplete. On the other hand, a polynomial time algorithm, based on a slight additive resource augmentation, has been proposed to solve this problem. More specifically, if dj denotes the maximal number of connections that can be opened at server Sj , then the throughput achieved using the algorithm proposed in [1] and dj + 1 connections is at least the same as the optimal one with dj connections. Formally, let Sj be a server, and let bj and dj denote respectively the outgoing bandwidth of server Sj and the
maximal number of connections that server Sj can handle simultaneously. Also, let Ci denote the i−th client and wi its capacity. Finally, let us denote by wij the outgoing bandwidth assigned from server Sj to client Ci . Then, a valid assignation must satisfy the following conditions: P j ∀j, (1) i wi ≤ bj Card{i, wij > 0} ≤ dj P j j wi ≤ wi
∀j, ∀i,
(2) (3)
Outgoing bandwidth constraint at Sj (1), degree constraint at Sj (2), and capacity constraint at Ci (3). Therefore, MTBD is defined as follows: XX j Maximize wi under constraints (1), (2) and (3). j
i
Due to the dynamic nature of large scale heterogeneous platforms and the results mentioned above, it becomes interesting to study MTBD in the more realistic dynamic scenario, when clients join and leave the system at any moment. In the online context, it makes sense to compare the algorithms according to their cost, in terms of the number of changes in platform topology induced by a client arrival or departure, and their performance, in terms of achieved throughput. The rest of the paper is organized as follows. In Section II, we present the communication model we use, we formalize the scheduling problem we consider and we state the main results of this paper. Related works are discussed in Section III. In Section IV, we prove that no online algorithm whose cost is less than 2 can achieve a constant approximation ratio, whatever resource augmentation is used. In Section V, we present an online algorithm that maintains the optimal throughput (using an additive resource augmentation of 1 on servers degrees) at that involves and most 4 changes per server each time a client is added or removed. At last, we propose in Section VI several greedy heuristics to solve the online problem and we compare their performance (in terms of throughput) and their cost (in terms of changes in existing connections) through a set of extensive simulations. Concluding remarks are given in Section VII. II. M ODEL AND M AIN R ESULTS A. Platform Model and Maintenance Costs Let us denote by bj the capacity of server Sj and by dj the maximal number of connections that it can handle simultaneously (its degree). The capacity of client Ci is denoted by wi . All capacities are normalized and expressed in terms of (fractional) number of tasks per time-unit. Moreover, let us denote by wij the number of tasks per time-unit sent from server Sj to client Ci . Then, we have noticed in the introduction that MTBD can be expressed as a maximization problem under constraints (1), (2) and (3). In the online version of MTBD, we introduce the notion of (virtual) rounds. A new round starts when a client joins or leaves the system, so that no duration is associated to a round. We denote by C t the set of clients present at round t (with their
respective capacities). Client C joins (resp. leaves) the system at round t if C ∈ C t \C t−1 (resp. C ∈ C t−1 \C t ). The arrival or departure of a client can therefore only take place at the beginning of a round and ∀t, |C t \C t−1 | + |C t−1 \C t | ≤ 1. Let us denote by S the set of servers (with their respective bandwidth and degree constraints). Solving the online version of MTBD comes into two flavours. First, one may want to maintain the optimal throughput (what is possible using resource augmentation on the degree) at a minimal cost in terms of changes in existing connections between clients and servers. Second, one may want to achieve a minimal number of changes in existing connections at each server and to obtain the best possible throughput. In order to compare online solutions, we need to define precisely the cost of the changes to the existing allocation of clients to servers. Let us denote by wij (t) the fraction of outgoing bandwidth assigned from server Sj to client Ci at round t. We say that client Ci is connected to server Sj at round t if wij (t) > 0. We say that the connection between server Sj and client Ci changes at round t if wij (t − 1) 6= wij (t), and we denote by Njt = |{i, wij (t−1) 6= wij (t)}| the number of changes occuring at server Sj at round t. This notion of changes covers three different situations. If wij (t − 1) = 0 and wij (t) > 0, then this change corresponds to a new connection to the server. Symmetrically, if wij (t−1) > 0 and wij (t) = 0, then client Ci was disconnected from server Sj . Finally, if both wij (t − 1) and wij (t) are positive, then this corresponds to a change in the quality of service required between client Ci and server Sj . In our context, since we use complex QoS mechanisms to achieve the prescribed bandwidth between clients and servers, any change in bandwidth allocation induces some cost. If a new client connects to a server, a new TCP connection needs to be opened, what also induces some cost. On the other hand, all modifications in bandwith connections made by the different server nodes can take place in parallel. Therefore, we introduce the following definitions to measure and compare the algorithms that solves online MTBD providing the same levels of throughput. Definition 2.1: Let A be an algorithm solving the online version of MTBD. It is said that A produces at most l changes in connections per round if: max max Njt t
Sj ∈S
≤l
We will say that the cost of algorithm A is the minimum such value of l. B. Main Results The first interesting result we present in this paper is that no online algorithm with cost less than 2 can achieve a constant approximation ratio, whatever the resource augmentation on the degree. This is theorem 4.1 of section IV. The second result we propose shows that there exists a polynomial time online algorithm whose cost is at most 4,
with a resource augmentation of 1, and that maintains the optimal throughput. Indeed, we know that Algorithm S EQ, introduced in [1], provides at least the optimal throughput allowing the smallest possible additive resource augmentation α = 1. Hence, we transform algorithm S EQ to an online algorithm, and we use it to solve the online version of MTBD. The online version of Algorithm S EQ is called OS EQ, and it recomputes at each round all the connections between servers and clients (see Algorithm 1 in section V). In order to evaluate algorithm OS EQ, we first prove (lemma 5.1, based on [1]) that, at each round, it provides at least as much throughput as an optimal (offline) solution without resource augmentation. Then, we show in theorem 5.4 that the cost of Algorithm OS EQ is at most 4. Therefore, in our context, maintaining the optimal throughput (with resource augmentation) is not more expensive, in terms of online cost and up to a constant ratio smaller than 2, than maintaining a constant approximation ratio of the optimal throughput. III. R ELATED W ORK A closely related problem is Bin Packing with Splittable Items and Cardinality Constraints. The goal in this problem is to pack a given set of items in as few bins as possible. The items may be split, but each bin may contain at most k items or pieces of items. This is very close to the problem we consider, with two main differences: in our case the number of servers (corresponding to bins) is fixed in advance, and the goal is to maximize the total bandwidth throughput (corresponding to the total packed size), whereas the goal in Bin Packing is to minimize the number of bins used to pack all the items. Furthermore, we consider heterogeneous servers (what would correspond to bins with heterogeneous capacities and heterogeneous cardinality constraints). As far as we know, Bin Packing with splittable items and cardinality constraints was introduced in the context of memory allocation in parallel processors by Chung et al. [12], who considered the special case when k = 2. They showed that even in this simple case, this problem is NP-Complete, and they proposed a 3/2-approximation algorithm. Epstein and van Stee [13] showed that Bin Packing with splittable items and cardinality constraints is NP-Hard for any fixed value of k, and that the simple NEXT-FIT algorithm has an approximation ratio of 2 − 1/k. They also designed a PTAS and a dual PTAS [14] for the general case where k is a constant. Other related problems were introduced by Shachnai et al. [15], in which the size of an item increases when it is split, or there exists a global bound on the number of fragmentations. The authors prove that theses two problems do not admit a PTAS, and provide a dual PTAS and an asymptotic PTAS. In a multiprocessor scheduling context, another related problem is scheduling with allotment and parallelism constraints [16], where the goal is to schedule a certain number of tasks, where each task has a bound on the number of machines that can process it simultaneously, and another bound on the overall number of machines that can participate in its execution. This
problem can also be seen as a splittable packing problem, but this time with a bound ki on the number of times an item can be split. In [16], an approximation algorithm of ratio maxi (1 + 1/ki ) is presented. In a related context, resource augmentation techniques have already been succesfully applied to on-line scheduling problems [17], [18], [19], [20] in order to prove optimality or good approximation ratio. More precisely, it has been established that several well-known on-line algorithms, that have poor performance from an absolute worst-case perspective, are optimal for these problems when allowed moderately more resources [18]. In this paper, we consider a slightly different context, since the off-line solution already requires resource augmentation on the servers degrees. We prove that it is possible in the on-line context to maintain at relatively low cost a solution that achieves the optimal throughput with the same resource augmentation as in the off-line context. IV. U NAPPROXIMABILITY R ESULTS In this section, we prove that no online algorithm whose cost is less than 2 can achieve a constant approximation ratio for the online MTBD problem. This result holds true even if we allow any constant resource augmentation ratio of the degree of the servers, what strongly differs from the offline setting, where a constant additive resource augmentation of 1 is enough to achieve optimal throughput. The proof is by counter-example. An algorithm Aα uses α ≥ 1 resource augmentation factor when the maximal degree used by a server Sj is dj + α while its original degree is dj . Moreover, let us denote by OP T (I) the optimal throughput on the instance I, and by Aα (I) the throughput provided by Algorithm Aα on instance I. Theorem 4.1: Given a resource augmentation factor α and a constant k, there exists an instance I of online MTBD, such that for any algorithm Aα with cost less than 2, 1 OP T (I). k Proof: The proof is by exhibiting an instance I on which any online algorithm with cost less than 2 will fail in achieving the required approximation ratio. This platform consists in only one server S with bandwidth b = (2k)α+1 and degree constraint d = 1. On the other hand, we consider a set of clients C0 , C1 , . . . , Cα+1 whose capacities are 1, 2k, (2k)2 , . . . , (2k)α+1 . In instance I, clients arrive one after the other, by increasing capacities. More precisely, at round j, for 0 ≤ j ≤ α + 1, the new client Cj with capacity (2k)j is added. Clearly, since the degree of the server is 1, only 1 client can be attached to the server and, since clients arrive by increasing capacity, the optimal solution consists in attaching Cj to the server at round j. Note that maintaining this optimal solution at any time step has cost 2, since at each round, client Cj is connected to the server and client Cj−1 is disconnected. In fact, any online algorithm that achieves an approximation ratio at most k must attach Cj to the server at round j. Indeed, Aα (I)
v In particular, since by hypothesis x ≤ wp and x+y ≤ wp+1 , we have that in all cases, D(u, v) ≤ C(u, v). Furthermore, since x ≥ wp−1 , we also have that D(u, v) ≥ C(u − 1, v − 1). Let us now consider the application of OS EQ(d, b) to C. According to the remark about dummy clients in section V-A, we will consider that a suitable interval [l, l + d] was found, which means that C(l, l + d − 1) < b and C(l + 1, l + d) ≥ b. (a) Furthermore, the allocation A is (Cl , . . . , Cl+d−1 , Cl+d ), and (b) (a) the updated list C 0 is (C1 , . . . , Cl−1 , Cl+d , . . . , Cn ), where Cl+d (b) and Cl+d are the two parts of the split client Cl+d . If the change from C to D is outside this interval (i.e., p < l or p > l + d), then this change does not affect our algorithm, and allocations A and B are the same. In that case, thus, the result holds. Otherwise, the resulting allocation B depends on the value of D(l + 1, l + d). Indeed, our previous bounds for D(u, v) imply that D(l, l + d − 1) < b, and D(l + 2, l + d + 1) ≥ b. Thus either [l, l + d] or [l + 1, l + d + 1] is a suitable interval for the application of OS EQ to D. • Let us suppose that D(l + 1, l + d) ≥ b. In this case, the (i) (a) resulting allocation B is (Cl , . . . , X , Cp , . . . , Cl+d−1 ).
•
It differs from A by 4 changes: the addition of X , (a) the removal of Cl+d , and the modification of both Cp and Cl+d−1 . The updated list of clients D0 is (b) (C1 , . . . , Cl−1 , Cl+d−1 , Cl+d , . . . , Cn ). It is thus an aug(b) mented version of C 0 , since Cl+d−1 is inserted between (b) Cl−1 and Cl+d , whose capacity is increased to wl+d . In(b) deed, by definition of the splitting process, w(Cl+d−1 ) = (b) D(l, l + d) − b and w(Cl+d ) = C(l, l + d) − b, which (b) (b) implies w(Cl+d−1 ) ≤ w(Cl+d ). If p = l + d, then X is the split client. This case actually results in only two changes in the allocations: the addition (a) of (one part of) X , and the removal of Cl+d . The updated 0 list D is also an augmented version of C 0 , with the other (b) part of X inserted, and the capacity of Cl+d increased to wl+d + y. Let us now suppose that D(l + 1, l + d) < b. In this case, the suitable interval is [l + 1, l + d + 1] and thus the (a0 ) (i) resulting allocation B is (Cl+1 , . . . , X , Cp , . . . , Cl+d ). Once again, it differs from A by 4 changes: the addition of X , the removal of Cl , and the modification (a) of both Cp and Cl+d . The updated list of clients D0 0 (b ) is (C1 , . . . , Cl , Cl+d , . . . , Cn ). It is thus an augmented version of C 0 , since Cl is inserted between Cl−1 and (b) (b0 ) Cl+d , whose capacity is increased to w(Cl+d ). This time, (b) the fact that wl ≤ w(Cl+d ) is derived from the locality property of OS EQ, see section V-A. If p = l, then X is actually not included in B. This case results in only two changes: the modification of the capacity of Cp , and the fact that Cl+d is split differently. The updated list D0 is also an augmented version of C 0 , (b) with X inserted and the capacity of Cl+d increased.
Theorem 5.4: The cost of Algorithm OS EQ is at most 4. Proof: We will prove that if two inputs C and D of OS EQ differ by the addition of a new client, then the resulting allocations to each server differ by at most 4 changes. Denote by C j the current list of clients after applying the first j steps of OS EQ starting from C 0 = C, and similarly for D. It is clear that D is an augmented version of C, and it is clear from lemma 5.3 that if Dj is an augmented version of C j , then Dj+1 is an augmented version of C j+1 . Then by induction, lemma 5.3 applies through the whole execution of OS EQ, and all the allocations produced differ by at most 4 changes. In the case of the removal of a client, we can simply swap the role of C and D in the previous statements.
many cases depending on the values of p, l, l + d and so on. However, it is also more efficient, since for each server, we only have to decide whether the suitable interval is [l, l + dj ] or [l + 1, l + dj + 1], instead of going through the whole list of clients. It is thus possible to do it in constant time, which yields a global complexity of θ(m) for algorithm OS EQ. VI. S IMULATION RESULTS A. Heuristics for comparison As already mentioned in Section III, related work has been mostly done in the context of Bin Packing, where there is an infinite amount of identical bins, and the goal is to pack all items in as few bins as possible1 . Interestingly, in this setting, the N EXT-F IT algorithm has a worst-case approximation ratio of 2 − 1/k [13], but it can easily be observed that it does not exhibit a constant approximation ratio for the total packed size when the number of bins is fixed. Moreover, most of existing algorithms in this context are approximation schemes, with prohibitive running times. To provide a basis of comparison, we thus introduce online versions of the natural greedy heuristics that performed best in the offline setting. Online Best Connection (OBC) Servers are ordered by their remaining capacity per connection, which is defined as the ratio between the remaining capacity b0j and the remaining available degree d0j . When a new client arrives, it is connected to the server whose capacity per connection is closest to the client’s capacity. If no server is available, OBC goes through all servers which have some bandwidth remaining but no degree left, and swaps the newly arrived client with a smaller one, selecting the one which yields the largest gain in total throughput. When a client X leaves, OBC tries to use the newly available bandwidth to reduce the indegree of other clients. Assume that X was connected to server S, and that client Y is connected to both S and S 0 . When X leaves, S can reallocate the corresponding bandwidth to client Y, what can be of interest if this allows to disconnect Y from S 0 , since this lowers the outdegree of S 0 . OBC selects as many such incident connections as possible, starting from the smallest ones. If there are some unconnected clients remaining, OBC then acts as if they had just arrived and tries to connect them with the procedure described earlier. Online Largest Server (OLS) This heuristic is very similar to the previous one. The only difference is the order in which the servers are sorted: in OLS, when a new client arrives, it is connected to the server with the largest available bandwidth. The rest of the heuristic is the same. B. Random Instance Generation
C. Implementation details We have presented OS EQ as an algorithm that recomputes from scratch the whole solution each time an event occurs (arrival or departure of a client). However, the proof of Lemma 5.3 shows clearly that it is possible to compute only the changes between the previous allocation and the new one. The implementation is made more complex by the analysis of
We generate instances randomly, trying to focus at the same time on realistic scenarios and difficult instances. Instances are more difficult to solve when the sum of server capacities is roughly equal to the sum of client capacities. Indeed, the minimum of both is a trivial upper bound on the total achievable 1 In
our context, servers are bins and clients are items
1.0
0.8
0.8
Tot. tasks computed
Tot. tasks computed
1.0
0.6
0.4
0.2
0.6
0.4
0.2
Seq OLS OBC
0.0
Seq OLS OBC
0.0 0
50
100
150
0
50
Num. servers
Figure 1.
100
150
Num. servers
Average normalized tasks computed.
Figure 2.
Maximum cost.
1000
C. Results We ran simulations for different instance sizes by varying the number m of servers. For each value of m, 250 instances were generated, and we plot on the figures the average, median, and the first and last decile over these 250 instances. For each algorithm, the line connects the average values, the upper error bar shows the last decile (which means that on 10% of the instances, the value was higher), the lower error bar shows the first decile (the value was lower on 10% of the instances), and the lonely mark in between is the median (half of the instances had lower values). On Figure 1, we plot the total number of computed tasks throughout the instance, which is simply the integral over time of the instantaneous throughput (assuming for simplicity that changes from one solution to the other take no time). The
800
Sum. number. changes
throughput, and a large difference between them provides a lot of freedom on the largest component to reach this upper bound. Based on the same idea,Pwe generate instances where the sum of the server degrees j dj is roughly equal to the number n of clients. In order to get a realistic distribution of server and client capacities, we have used information available from the volunteer computing project GIMPS [21] that provides the average computing power of all its participants. A simple statistical study shows that the computational power (based on the 7,000 largest participants) follows a power-law distribution with exponent α ˆ ≈ 2.09. We have thus used this distribution and this exponent to generate the capacities of both clients and servers. The P resulting values are then scaled so that their sums P ( i wi and j bj ) are roughly equal. The same distribution, and a subsequent scaling as well, is also used to generate the degrees of the servers. To generate online instances, we start from a complete instance with m servers and n = 50m clients. Two kinds of random events are then generated: departure of a client (picked uniformly at random), or arrival of a newly generated client. We generate 300 such events, each kind having probability 1/2. The time intervals between two successive events are generated as a Poisson process.
600
400
200
Seq OLS OBC
0 0
50
100
150
Num. servers
Figure 3.
Total cost over 300 events.
valuePobtained P is then normalized against the upper bound min( j bj , i wi ) (so that an average over 250 instances make sense). We can see that although this upper bound is very likely to be unreachable (because of the degree constraints), OS EQ achieves about 90% of this value for all instance sizes. On the other hand, the performance of OBC gets worse on average when the instances grow, and is also very variable: it can get as low as 30% of the upper bound, even on smaller instances. The situation is even worse for OLS, with a very low average performance, and roughly the same variability. Figure 2 shows the cost of the algorithms. We can see that the cost of OS EQ is always 4, while the cost of OBC is around 10 and 15 on average; however, it is once again very variable and reaches 25 on roughly 10% of the smaller instances. The behavior of OLS is worse here also, with a cost of 25 on average, and over 40 on 10% of the instances, for all sizes. Remember that the average outdegree of the servers is 50 (instances with m servers contain 50m clients), so this result means that it is quite likely that, using OLS or OBC, at some point in the execution, one server has to change almost half of the clients it is connected to. On the other hand, Figure 3 shows the sum of the costs of all events. We can see that the average cost of an event for OS EQ is around 2.3 (it is above 3 for 10% of the instances), while it is around 1.6 (varying between 1.3 and 2) for OBC
and OLS. This shows that events that incur many changes for OBC or OLS are relatively rare, and are compensated by many events that generate very few changes. However, we feel that this cost for maintaining the guarantees are justified by the much higher performance and use of computing resources, and by the stability of OS EQ. VII. C ONCLUSION We have considered the problem of the on-line allocation of a set of clients to a set of servers, which is a useful problem for instance in large scale distributed platforms such as volunteer computing networks. To model communications, we use the bounded degree muti-port model, where each server is characterized by its outgoing bandwidth and the maximal number of TCP connexions it can handle with QoS mechanisms, and each client is characterized by its capacity, i.e., the number of tasks that it can handle within one time unit. This model is much more realistic than one-port models traditionaly used in scheduling theory. In this paper, we prove that this model is also tractable, even in an on-line setting, provided that we also use a slight resource augmentation on the servers degrees. More specifically, we prove that it is possible to maintain, using an additive resource augmentation of 1 on the servers degrees, a solution whose cost is constant (4) in terms of maximal changes in connections when a client joins or leaves the system, whereas maintaining a solution that achieves a constant approximation ratio has cost at least 2, whatever the resource augmentation of the degrees. At last, we provide a set of extensive simulations based on realistic degree and capacity distributions obeserved in volunteer computing networks. We compare our algorithm against a sophisticated greedy algorithm and the simulations prove that our algorithm achives a much better throughput (i.e., process 25% more tasks than the greedy one), while achieving a worst case performance in the number of induced changes in existing connections 3 times lower. Possible extensions of this work would consist in adapting these mechanisms to platforms that process several types of tasks instead of one, or in using the available outgoing bandwidth of the clients to build more complex topologies, where clients may act as intermediate servers for the tasks they cannot process by themselves. ACKNOWLEDGMENT This work was partially supported by French ANR projects Alpage and Shaman. R EFERENCES [1] O. Beaumont, L. Eyraud-Dubois, H. Rejeb, and C. Thraves, “Allocation of Clients to Multiple Servers on Large Scale Heterogeneous Platforms,” in In Proceedings of the Fifteenth International Conference on Parallel and Distributed Systems (ICPADS), 2009.
[2] D. Anderson, “BOINC: A System for Public-Resource Computing and Storage,” in 5th IEEE/ACM International Workshop on Grid Computing, 2004, pp. 365–372. [3] S. Larson, C. Snow, M. Shirts, and V. Pande, “Folding@ Home and Genome@ Home: Using distributed computing to tackle previously intractable problems in computational biology,” Computational Genomics, 2002. [4] D. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer, “SETI@ home: an experiment in public-resource computing,” Communications of the ACM, vol. 45, no. 11, pp. 56–61, 2002. [5] D. Bertsimas and D. Gamarnik, “Asymptotically optimal algorithm for job shop scheduling and packet routing,” Journal of Algorithms, vol. 33, no. 2, pp. 296–318, 1999. [6] C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert, “Scheduling Strategies for Master-Slave Tasking on Heterogeneous Processor Platforms,” IEEE Transactions on Parallel and Distributed Systems, pp. 319–330, 2004. [7] O. Beaumont, A. Legrand, L. Marchal, and Y. Robert, “Pipelining Broadcasts on Heterogeneous Platforms,” IEEE Transactions on Parallel and Distributed Systems, pp. 300–313, 2005. [8] B. Hong and V. Prasanna, “Distributed adaptive task allocation in heterogeneous computing environments to maximize throughput,” International Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, 2004. [9] T. Saif and M. Parashar, “Understanding the Behavior and Performance of Non-blocking Communications in MPI,” Lecture Notes in Computer Science, pp. 173–182, 2004. [10] M. A. Brown, “Traffic Control HOWTO. Chapter 6. Classless Queuing Disciplines,” http://tldp.org/HOWTO/Traffic-Control-HOWTO/classlessqdiscs.html, 2006. [11] B. Hubert et al., “Linux Advanced Routing & Traffic Control. Chapter 9. Queueing Disciplines for Bandwidth Management,” http://lartc.org/lartc.pdf, 2002. [12] F. Chung, R. Graham, J. Mao, and G. Varghese, “Parallelism versus Memory Allocation in Pipelined Router Forwarding Engines,” Theory of Computing Systems, vol. 39, no. 6, pp. 829–849, 2006. [13] L. Epstein and R. van Stee, “Improved results for a memory allocation problem,” in WADS, ser. Lecture Notes in Computer Science, F. K. H. A. Dehne, J.-R. Sack, and N. Zeh, Eds., vol. 4619. Springer, 2007, pp. 362–373. [14] ——, “Approximation Schemes for Packing Splittable Items with Cardinality Constraints,” Lecture Notes in Computer Science, vol. 4927, p. 232, 2008. [15] H. Shachnai, T. Tamir, and O. Yehezkely, “Approximation schemes for packing with item fragmentation,” Theory Comput. Syst., vol. 43, no. 1, pp. 81–98, 2008. [16] H. Shachnai and T. Tamir, “Multiprocessor Scheduling with Machine Allotment and Parallelism Constraints,” Algorithmica, vol. 32, no. 4, pp. 651–678, 2002. [17] B. Kalyanasundaram and K. Pruhs, “Speed is as powerful as clairvoyance,” J. ACM, vol. 47, no. 4, pp. 617–643, 2000. [18] C. Phillips, “Optimal Time-Critical Scheduling via Resource Augmentation,” Algorithmica, vol. 32, no. 2, pp. 163–200, 2008. [19] C. Chekuri, A. Goel, S. Khanna, and A. Kumar, “Multi-processor scheduling to minimize flow time with resource augmentation,” in STOC ’04: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing. New York, NY, USA: ACM, 2004, pp. 363–372. [20] S. Funk, J. Goossens, and S. Baruah, “On-line scheduling on uniform multiprocessors,” Real-Time Systems Symposium, IEEE International, vol. 0, p. 183, 2001. [21] “The great internet mersenne prime search (gimps),” http://www. mersenne.org/.