How to Avoid Herd: A Novel Stochastic Algorithm in Grid Scheduling Qinghua Zheng1,2, Haijun Yang3 and Yuzhong Sun1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080 2 Graduate School of the Chinese Academy of Sciences, Beijing, 100049 3 School of Economics and Management, BEIHANG University
[email protected],
[email protected],
[email protected]
Abstract Grid technologies promise to bring the grid users high performance. Consequently, scheduling is being becoming a crucial problem. Herd behavior is a common phenomenon, which causes the severe performance decrease in grid environment with respect to bad scheduling behaviors. In this paper, on the basis of the theoretical results of the homogeneous balls and bins model, we proposed a novel stochastic algorithm to avoid herd behavior. Our experiments address that the multi-choice strategy, combined with the advantages of DHT, can decrease herd behavior in large-scale sharing environment, at the same time, providing better schedule performance while burdening much less scheduling overhead than greedy algorithms. In the case of 1000 resources, the simulations show that, for the heavy load(i.e. system utilization rate 0.5), the multi-choice algorithm reduces the number of incurred herds by a factor of 36, the average job waiting time by a factor of 8, and the average job turn-around time by 12% compared to the greedy algorithms.
1. Introduction Scheduling is one of key issues in computing grid, and various scheduling systems can be found in many grid projects, such as GrADS[1], Ninf[2], etc. Among all challenges in grid scheduling, “herd behavior” (with the meaning that “all tasks arriving in an information update interval go to a same subset of servers”)[3], which causes imbalance and drastic contention on resources, should be a critical one. Large scale of grid resources makes network be partitioned into many autonomy sites, each having its autonomous decision. However, there has been theoretical evidence that systems in which resource allocation is performed by many independent entities can exhibit herd
behavior and thus degrade performance[3]. Herd behavior is caused by independent entities, with imperfect and stale system information, allocating multiple requests onto a resource simultaneously. Here, “imperfect” means that one does not know others’ decisions when it performs resource allocation. This herd behavior can be found in the supercomputer scheduling[4], where many SAs (stands for Supercomputer AppLeS) scheduled jobs on one supercomputer. SA estimated the turn-around time of each request based on the current state of the supercomputer, and then forwarded the request with the smallest expected turn-around time to the supercomputer. The decision made by one SA affected the state of the system, therefore impacting other instances of SA, and the global behavior of the system thus came from the aggregate behavior of all SAs. We can abstract a model as Fig.1 for the above system in which herd behavior might happen. In this model, jobs are firstly submitted to one of SIs, and then forwarded to one of sharing resources allocated independently by the SI. When multiple SIs, with imperfect and stale information, perform same resource allocations simultaneously, herd behavior will happen, causing some resources overloaded while others being idle. SI
Jobs
R
…… Jobs
R
SI
R
Fig.1. Job Scheduling Model (SI: Scheduling Instance. R: Resource). This abstract model is still reasonable in grid environment. Autonomy of each site and network latency make the information for scheduling imperfect and stale, and they even worsen these characteristics compared to the supercomputer environment. With
imperfect and stale information, however, the existing grid scheduling methods will possibly cause herd behavior and thus degrade system performance. How to prevent herd behavior in grid scheduling is an important problem to promote system performance. The main contributions of this paper are as follows: 1) On the basis of the balls and bins model, we proposed a novel stochastic algorithm, named multi-choice algorithm, to reduce herds incurred in grid scheduling by at least one order of magnitude. 2) With combination of the advantages of DHT and data replication techniques, we bridged the gap between the theoretical achievement(the homogeneous balls and bins model) and a complex and realistic world (the heterogeneous distributed grid environment). 3) Simulations demonstrated that the multi-choice algorithm provided better schedule performance while burdening much less overhead compared to conventional ones. In the case of 1000 resources, the simulations showed that, for the heavy load(i.e. system utilization rate 0.5), the multi-choice algorithm reduced the average job waiting time by a factor of 8, and the average job turn-around time by 12% compared to the greedy algorithms. The rest of this paper is structured as follows. Section II compares our method to related work. Section III presents our scheme to prevent herd. Section IV addresses our implementation and algorithms. Section V presents simulations supporting our claims. Finally, we summarize our conclusions in Section VI.
2. Related work The goal of the Grid Application Development Software (GrADS) Project is to provide programming tools and an execution environment to ease program development for the Grid. Its scheduling procedure involves the following general steps: (i) identify a large number of sets of resources that might be good platforms for the application, (ii) use the applicationspecific mapper and performance model specified in the COPs (configurable object programs) to generate a data map and predict execution time for those resource sets, and (iii) select the resource set that results in the best rank value (e.g. the smallest predicted execution time). The goal of Ninf is to develop a platform for global scientific computing with computational resources distributed in a world-wide global network. One of its
components, named metaserver, provides the scheduler and resource monitor. The scheduler obtains the computing servers’ load information from the resource information database and decides the most appropriate server depending on the characteristics of the requested function (usually the server which would yield the best response time, while maintaining reasonable overall system throughput) . Although details of above scheduling systems vary dramatically, there are aspects which they share to causing herd behavior: all of them are in nature greedy algorithms, selecting the best resource set according to some criteria. When jobs simultaneously arrive at multiple scheduling instances, same schedules, taken by these scheduling instances, might aggregate on the system and thus herd happens. Condor[5] is a distributed batch system for sharing the workload of compute-intensive jobs in a pool of UNIX workstations connected by a network. In a pool, the Central Manager is responsible for job scheduling, finding matches between various ClassAds[6,7], in particular, resource requests and resource offers, choosing the one with the highest Rank value (noninteger values are treated as zero) among compatible matches, breaking ties according to the provider’s rank value[6]. To allow multiple Condor pools to share resources by sending jobs to each other, Condor employs a distributed, layered flocking mechanism, whose basis is formed by the Gateway machines – at least one in every participating pool[8]. The purpose of these Gateway machines is to act as resource brokers between pools. Gateway chooses at random a machine from the availability lists received from other pools to which it is connected, and represents the machine to the Central Manager for job allocation. When a job is assigned to the Gateway, the Gateway scans the availability lists in random order until it encounters a machine satisfying the job requirements. It then sends the job to the pool to which this machine belongs. Although DHT can be used to automatically organize Condor pools[9], the fashion of scheduling in a Condor flock keeps unchanged. Scheduling in Condor would not cause herd behavior, because: (1) it defines a scope (Condor pool) within which only one scheduling instance is running, and consequently scheduling in each scope does not affect others; (2) stochastic selection is used when scheduling in Condor flocking. Some disadvantages are obvious: a large scope will arouse the concerns of performance bottleneck and single point of failure, while a small scope will prevents certain optimizations due to its layered Condor flocking mechanism[8]. Its scope leads to the lack of efficiency and flexibility to
share grid resources. Nevertheless, its stochastic thought is a key to prevent herd behavior. Randomization has been widely mentioned to be useful for scheduling. In [10], it was pointed out that parameters of performance models might exhibit a range of values because, in non-dedicated distributed systems, execution performance may vary widely as other users share resources. In [11], their experiences proved that introducing some degree of randomness in several of the heuristics for host selection can lead to better scheduling decision in certain case. When multiple task choices achieve or are very close to the required minimum or maximum of some value (e.g. minimum completion time, maximum sufferage, etc.), their idea is to pick one of these tasks at random rather than letting this choice be guided by the implementation of the heuristics. The paper[12] described a methodology that allows schedulers to take advantage of stochastic information (choose a value from the range given by the stochastic prediction of performance) to improve application execution performance. In next section, we will show a novel stochastic algorithm based on the balls-and-bins model.
Suppose that we sequentially place n balls into n boxes by putting each ball into a randomly chosen box. It is well known that when we are done, the fullest box has with high probability (1 + O(1)) ln n / ln ln n balls in it. Suppose instead, that for each ball we choose d ( d ≥ 2) boxes at random and place the ball into the one that is less full at the time of placement. Azer[13] has shown that with high probability, the fullest box contains only ln ln n / ln d + Θ(1) balls - exponentially less than before. Further, Vöcking[14] found that choosing the bins in a non-uniform way results in a better load balancing than choosing the bins uniformly. Mitzenmacher[3] demonstrated the importance of using randomness to break symmetry. For systems replying on imperfect information to make scheduling decisions, “deterministic” heuristics that choose the lightest loaded server do poorly and significantly hurt performance (even for “continuously updated but stale” load info) due to “herd behavior”, but a small amount of randomness (e.g. “d-random” strategy) suffices to break the symmetry and give performance comparable to that possible with perfect information.
3. How to avoid herd behavior 3.2. How to employ this model Recall that herd behavior is caused by same behaviors simultaneously aggregating on a system. It is the imperfect and stale information that leads to multiple individuals taking same behaviors. With perfect and up-to-date information (the current state of resources and all the decisions made by others are known), each scheduling instance can properly use greedy algorithms, making appropriate decisions to maximize the system performance, and thus herd will not take place. In grid environment, however, the information for scheduling is inevitably imperfect and stale. In this case, stochastic algorithms instead of greedy algorithms can be used to make individuals avoid taking same behaviors. In this paper, we introduce a stochastic algorithm, which is based on the balls and bins model, into the grid scheduling. However, the grid scheduling cannot immediately benefit from the model because it has very strong constraints. Therefore, some technologies will be proposed to overcome these constraints.
3.1. Balls and bins model The balls and bins model is a theoretical model to study the load balance. Here, we introduce some results as following.
This multiple-choice idea, that is, the concept of first selecting a small subset of alternatives at random and then placing the ball into one of these alternatives, has several applications in randomized load balancing, data allocation, hashing, routing, PRAM simulation, and so on[15]. For example, it is recently employed in the DHT load balance[16]. When we use such an algorithm for load balancing in grid scheduling systems, balls represent jobs or tasks, bins represent machines or servers and the maximum load corresponds to the maximum number of jobs assigned to a single machine. However, there are three essential problems which we must tackle before employing this model. Organization of grid resources In the original theory, it is supposed that all of n bins are known and we can find any bin when placing a ball. In grid environments, however, we must implement this supposition by organizing all grid resources in resource management. Constraints imposed by jobs on resources In the original theory, there is no constraint on the choice of d bins, that is, one ball can be placed into any one of n bins if we do not care about load balance. In grid scheduling applications, however, some jobs will possibly have certain resource requirements (e.g.
CPU clock rate, MEM size, etc.), which limits the scope of d choices. Hence, we should make appropriate resource choices for each job. Probability of choosing a resource In the original theory, all bins are homogeneous and each one is chosen with same probability (1/n). It doesn’t matter that which bin burdens the maximum load. In grid scheduling applications, however, resources are drastically heterogeneous and their capabilities (e.g. CPU, MEM) differ greatly from each other’s. Therefore, we must carefully consider how to adjust the probability of choosing a resource somewhat because it does matter that which resource is allocated the maximum load now. We will bridge these gaps in the following subsections. 3.2.1. Organization of grid resources. There are three typical kinds of organizing resources and each can be found in a certain project. Flat organization: flock of Condor. As stated in Section 2, sharing resources between Condor pools requires the bi-connection of their Gateways. All of these Gateways are placed in a same plane, each connecting directly to others when flocking. The disadvantage of flat organization is obvious: for a global share of resources, an all-connection network is created by Gateways, which might be unsuitable for a large-scale environment such as grid because of the overburden of these Gateways; otherwise, it is lack of efficiency to share several rare resources among all users. Hierarchal organization: MDS[17] of Globus[18]. In MDS, aggregate directories provide often specialized, VO-specific views of federated resources, services, and so on. Like a Condor pool, each aggregate directory defines a scope within which search operations take place, allowing users and other services to perform efficient discovery without scaling to large numbers of distributed information providers within a VO. Unlike Condor flocking mechanism, MDS builds an aggregate directory in a hierarchal way as a directory of directories. One disadvantage of MDS is that directory should be maintained by an organization. Another disadvantage is that when more and more directories are added––and we envision truly root directory consisting of a very large number of directories––and many searches are directed to the root directory, the concerns of performance bottleneck and single point of failure are aroused. P2P structure organization: SWORD[19] of PlanetLab[20]. SWORD is a scalable resource discovery service for wide-area distributed systems.
The implementations of SWORD use the Bamboo DHT[21], but the approaches generalize to any DHT because only the key-based routing functionality of DHT is used. One disadvantage of this decentralized DHT-based resource discovery infrastructure is the relative poor performance of collecting information compared to a centralized architecture. All above three methods have their advantages as well as disadvantages. Nevertheless, we adopt the DHT-based resource management based on the following reasons: (1) The advantages of using DHT DHT inherits the quick search and load balance merits of hash table, and its key properties include selforganization, decentralization, routing efficiency, robustness, and scalability. Moreover, as a middleware with simple APIs, DHT features simplicity and easiness of large-scale application development. Besides, using it as an infrastructure, applications automatically inherit the properties of underlying DHT. (2) Small overhead of the d-random strategy As mentioned above, the disadvantage of using DHT is the relative poor performance of collecting information compared to a centralized architecture. However, our “d-choice” strategy only submits d queries on the DHT, each taking a time complexity of O (log(n)) . Hence, our query complexity is only
O(d * log(n)) - far less than that of collecting all information which requires traveling the whole DHT and takes a time complexity of O (n) . From above statement, the powerful theoretical results of d-choice strategy, combined with the quick search of DHT, will possibly produce a practical scheduling system, which can provide good performance even with imperfect information while incurring small overhead. In this paper, we use Chord[22] to organize resources. Note that other DHTs can also be employed, and the advanced research on DHT can benefit our scheduling system too. The application using Chord is responsible for providing desired authentication, caching, replication, and user-friendly naming of data. Chord’s flat keyspace eases the implementation of these features. Chord provides support for just one operation: given a key, it maps the key onto a node. Depending on the application using Chord, that node might be responsible for storing a value associated with the key. Data location can be easily implemented on top of Chord by associating a key with each data item, and
storing the pair at the node to which the key maps. As SWORD of PlanetLab has successfully designed and implemented a resource discovery system with DHT that can answer multi-attribute range queries, we just follow its design of key mapping in this part. In this way, our research can efficiently be employed on top of SWORD. For simplicity, we use multiple Chord instances, one per attribute of resource. For example, suppose that resource defines two attributes: , we should create two Chord rings, each corresponding to an attribute. Chord places no constraint on the structure of the keys it looks up: the Chord key-space is flat. This gives applications a large amount of flexibility in how they map their own names to Chord keys. Fig. 2 shows how we map a measurement of one attribute to a DHT key. The length of value bits is allocated according to the range of attribute value. The random bits spread measurements of the same value among multiple DHT nodes, to load balance attributes that take on a small number of values. In this way, we can get an ordered DHT ring.
Resource A B C D
Node Key 0x16F92E11 0x990663CB 0x5517A339 0xD5827FA3
D
B
Routing table of CHORD Stored Data Routing table of CHORD Stored Data Routing table of CHORD
CPU = 86 (MFLOPS) Stored Data
Value Random bits bits Key = 0 x 5 6 3 A 8 1 D 9 Fig. 2. Mapping a measured value to a DHT key. Fig. 3 shows an example of how to organize resources for an attribute. When a resource joins the DHT, it is assigned a random node key and holds a key space between its predecessor’s node key and itself. It is responsible for storing the data whose key falls in its holding key space, answering the query whose key falls in this space and forwarding to other node the query whose key is beyond this space. 3.2.2. Constraints imposed by jobs on resources. Next, we focus on how to choose d resources that satisfy the job requirements. We take Fig. 3 as an example. Suppose that there is a job requiring a resource with CPU≥100(MFLOPS) and MEM≥700(MB) for running. We query resources on the DHT instance corresponding to the CPU attribute of resource. Note that it is a big challenge to perform this query on multiple DHT instances
Routing table of CHORD Stored Data
CPU 86 120 220 135
CPU_Key 0x563A81D9 0x780209E1 0xDC937C5A 0x87103FF2
A
CHORD ordered ring of CPU attribute
C Node A Predecessor Successor Finger[0:m] Resource List
Node D Node C … C
Node C Predecessor Successor Finger[0:m] Resource List
Node A Node B … Nil
Node B Predecessor Successor Finger[0:m] Resource List
Node C Node D … A, B, D
Node D Predecessor Successor Finger[0:m] Resource List
Node B Node A … Nil
Fig. 3. Resource organization for the CPU attribute simultaneously because the intersection of these query results will possibly be a null set. We firstly generate d queries, each having a random CPU_Query_Key in the interval [0x64000000, 0xFFFFFFFF]. Then all these d queries are launched to the corresponding DHT instance. With the help of the Chord routing algorithms, each query can reach a proper DHT node whose holding key space contains the CPU_Query_Key. Now, this DHT node checks its stored data whether there is a valid resource satisfying the job requirements. If so, it should send the valid resource to the query node for further use; if not, it should alter the CPU_Query_Key to fit the holding key space of its successor and then forward the query to its successor. This process can be iterative until a valid resource is returned, or the query reaches the first node in the DHT key space (the one having a minimum
node key), for example, Node A in Fig. 3. Therefore, all resources returned by d queries, if not nil, can satisfy the job requirements. 3.2.3. Probability of choosing a resource. From the point of load balance, the number of jobs assigned to a resource should be proportional to its capabilities when system is middle or high loaded. Whereas, when system is light-loaded, users will favor fast machines to run their jobs. In order to achieve above both goals, intuitively, we should appropriately increase the probabilities of choosing high capability resources. There are two existing techniques for this purpose. The first is the virtual server used in DHT load balance[23], and the second is data replication. In [23], each host acts as one or several virtual servers, each behaving as an independent peer in the DHT. The host’s load is thus determined by the sum of these virtual servers’ loads. By adjusting the number of working virtual servers(for example, the number of virtual servers of a host is proportional to its capacity), we can achieve the DHT load balance among all hosts. From the view of our algorithm, however, one of its disadvantages is that it requires DHT middleware to support this technique by itself. Some DHT middleware can do it, but certainly not all. In order to make our system running over different platforms with flexible parameter settings, we adopt another application-layer method at the same time. Data replication has extensively been used in widearea distributed systems (Content Distribution Networks). It aims at decreasing the access to some hot-spots and increasing the access QoS. Here, we use data replication to appropriately increase the access to some high capability resources instead. To keep the constraints of job on resource, we place replicas of a resource on DHT nodes whose node keys are not more than the maximum key corresponding to the capabilities of that resource. Taking Fig. 3 for an example, we create a replica for Resource C. The CPU_Key of this replica should be in the interval [0x00000000, 0xDCFFFFFF] and hence this replica is stored on a DHT node whose node key is not more than 0xDCFFFFFF. Therefore, any query reaching on this DHT node can possibly get Resource C because the CPU capability of Resource C is greater than that of the query with high probability. The number of replicas of a resource as well as the distribution of their keys can affect the probability of choosing this resource. As the number increases, the possibility of herd behavior will increase too. How to determine and adjust the number of replicas and the
distribution of their keys is an open issue in this scheduling application. In above statement, several techniques have been introduced to prevent herd behavior in grid scheduling. Firstly, the balls and bins model is a powerful tool for this problem. Secondly, we employ DHT and data replication to bridge the gap between the homogeneity of the model and the heterogeneity of grid environment. We believe that d-random strategy, combined with the quick search of DHT, can provide a good system for grid job scheduling.
4. Our implementation In this section, we describe our system architecture and related algorithms. Each resource acts as a scheduling instance and independently performs job scheduling. Fig.4 shows the modules in a DHT node. Submit Job Launch Job Job Proxy ①
⑥ Scheduling ②
Resource Index
⑤ ③
Resource Update
④
DHT Module
TCP/IP
Chord DHT
Fig. 4. Modules in a DHT node When a resource joins the system, it firstly joins the Chord DHT. The algorithm of joining Chord can be seen in [22]. Then, it places certain number of resource replicas on the DHT and updates these replicas periodically to keep alive, using the algorithm shown in Fig.5. The resource update module in Fig.4 is responsible for collecting the information of local resource, generating and updating replicas. The resource index module uses a list to store the replicas which other resources place on the DHT.
The scheduling process includes the following 6 steps: STEP1: The job proxy module receives a job from a user, abstracts the job running requirements and performance model, and requests for scheduling with the requirements and performance model. STEP2: The scheduling module generates d queries according to the job requirements and then places these queries on the DHT. The job requirements as well as the performance model are included in each of these d queries. Here, d is a system parameter that is predefined, or dynamically adjusted according to system utilization rate. These d query keys should be not less than a certain minimum value to keep the job constraints as stated in Section 3.2.3, obeying the uniform or non-uniform and independent or dependent distribution between this minimum value and the maximum key of DHT key space. STEP3: With the help of CHORD routing algorithms, these queries reach their destination DHT nodes respectively. In the destination node, the query is processed in the resource index module. STEP4: The resource index module checks its stored resource list whether there is a resource satisfying the job running requirements. If not so, the query will be forwarded to the successor node (until it reaches the first node of the DHT key space, as stated in Section 3.2.3) and GOTO STEP3. If there were multiple resources satisfying the requirements, a competition algorithm would be performed to return one of valid resources. Now, the valid resource is returned to the source node of this query. STEP5: When a query result returns, we can use one-more-hop to collect the up-to-date information of the resource indicated in the returned query result. FUNCTION Replica Creation Algorithm (Input Parameter: r) // Number of replicas = Capability / r, except for r=0 BEGIN IF (r==0) THEN RETURN; Local_Replica_List.set_nil(); rep_num = local_resource.capability DIV r; remainder = local_resource.capability MOD r; IF (RANDOM (r)