Hierarchical Scheduling of Independent Tasks with Shared Files

2 downloads 0 Views 278KB Size Report
Scheduling independent tasks (with no file sharing) on heterogeneous processors was studied by Maheswaram et. al. [12]. In such paper, the authors propose ...
Hierarchical Scheduling of Independent Tasks with Shared Files Hermes Senger, Fabr´ıcio A. B. Silva, Waneron M. Nascimento Universidade Cat´olica de Santos (UniSantos) R. Dr. Carvalho de Mendonca, 144 Santos, SP - Brazil - 11070-906 Email: {senger,fabricio}@unisantos.br Abstract— Parallel computing platforms such as grids, clusters and multi-clusters constitute promising alternatives for executing applications comprised by a large number of independent tasks. However, some application and architectural characteristics may severely limit performance gains. For instance, tasks with fine granularity, huge data files to be transmitted to or from data repositories, and tasks which share common input files are examples of such characteristics that may cause poor performance. Bottlenecks may also appear due to the existence of a centralized controller in the master-slave architecture, or centralized data repositories within the system. This paper shows how system efficiency decreases under such conditions. To overcome such limitations, an hierarchical strategy for file distribution which aims at improving the system capacity of delivering input files to processing nodes is proposed and assessed. Such a strategy arranges the processors in a tree topology, clusters tasks that share common input files together, and maps such groups of tasks to clusters of processors. By means of such strategy, significant improvements in the application scalability can be achieved.

I. I NTRODUCTION Because of their computational power, clusters, multiclusters, and grid platforms are suitable to execute high performance applications comprised by a great number of tasks, often ranging from hundreds to several thousands. This paper is dedicated to a class of applications studied in [2], [3], [7], [8], [12], which can be decomposed as a set of independent tasks and executed in any order. In this class, tasks do not communicate to each other and depend only upon one or more input data files to be executed. The output produced is also one or more files. Such application class is referred in the literature as parameter-sweep [3], or bagof-tasks [4], and typical examples include applications that can be structured as a set of tasks that realize independent experiments with different parameters. Other typical examples include several applications that involve data mining, image manipulation, Monte Carlo simulations, and massive searches. They are frequent in fields such as astronomy, high-energy physics, bioinformatics, and many others. In this paper, we focus on a specific class of applications comprised by a large number of independent tasks that may share input files. By large we mean a number at least one order of magnitude greater than the number of available processors. Such applications were studied in [2], [3], [7], [8], being the typical case for grid-based environments such as the AppLeS

Parameter Sweep Template [1] (for a detailed description of some real-world applications see [2], [3]). According to our previous experience [16]–[18], some data mining applications clearly fit this model, in particular those which follow the Independent Parallelism model, as mentioned in [20]. Furthermore, many other science and engineering computational applications are potential candidates. Many challenges arise when scheduling a large number of tasks in large clusters or computational grids. Typically, there is no guarantee regarding availability levels and quality of services, as the involved computational resources may be heterogeneous, non-dedicated, geographically distributed, and owned by multiple institutions. Furthermore, data grids that harness distributed resources such as computers and data repositories need support for moving high volume data files in an efficient manner. In [15], Ranganathan and Foster propose and assess some independent policies for assigning jobs to processors, and moving data files among sites of a grid environment, showing the importance of considering file locality when scheduling tasks. Scheduling independent tasks (with no file sharing) on heterogeneous processors was studied by Maheswaram et. al. [12]. In such paper, the authors propose and assess three scheduling heuristics, named min-min, max-min, and sufferage. However, many problems may arise whether the application tasks do share files which are stored in a single repository and have to be transmitted to the processing nodes in which tasks will be executed. In [2], [3], Casanova et. al. improved these heuristics to schedule independent tasks with an additional constraint. The constraint is the possibility of sharing common input files. As the number of tasks to schedule in such applications is typically large, scheduling heuristics must have low computational costs. With this purpose, Giersch, Robert, and Vivien [6], [8] propose and assess some scheduling heuristics that produce schedules with similar qualities to those proposed by Casanova et. al. in [2], [3], while keeping the computational costs one order of magnitude lower. In a further work, Giersch, Robert, and Vivien [7] extend their work on heuristics, and establish theoretical limits for the problem of scheduling independent tasks with shared files, for all possible locations of data repositories (e.g., centralized, decentralized). This paper addresses the problem of scheduling independent

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

F1

F2

F3

F4

Fm

T1

T2

T3

T4

Tn

Input Files

Tasks

File

Fig. 1.

Tn

Task

The shared files model

tasks with shared files which are stored in a centralized repository. The focus of this paper is not concerned with proposing a new heuristic, neither stressing heuristic approaches. Instead, some scalability limits that are intrinsic for the execution of such applications executing on a master-slave architecture are presented, and a strategy is proposed, which can significantly improve the application scalability by means of reducing the bottleneck at the master computer, and multiplying the capacity of the grid to distribute data files. The proposed strategy implements an hierarchical scheme that integrates tasks scheduling with file distribution. Such strategy is carried out in three steps: it starts by clustering tasks that share common files into jobs; it organizes the available processors to form a hierarchy; and then, it maps groups of tasks onto the processors hierarchy. Experimental results obtained by means of simulation suggest that application scalability can be improved in one order of magnitude. The remainder of this paper is organized as follows. Section 2 presents a model that describes architecture and application features. Section 3 describes the problem. Section 4 evaluates the potential benefits of using hierarchical scheduling. An hierarchical strategy is proposed in section 4 and assessed by means of simulation in section 5. The final considerations are presented in section 6. II. A S YSTEM M ODEL The motivating architecture for this work is comprised by one master computer, and a set of S slaves distributed among C clusters. The master is responsible for controlling the slave computers, being usually implemented by the user’s machine from which the application is launched. An application consists of a set of T independent tasks. By independent we mean that every task does not depend upon the execution of any other task, and there is no communication among tasks. There is a set of F data files which are initially stored in the master computer, and must be transmitted to the slave computers to serve as input for the application tasks. Every task requires one or more input files, and produces only one output file. Each file provides input data for at least one task, and not rarely, for a very large number of tasks of the same application. Such relationship can be represented

as a bipartite graph as depicted in Fig.1. For instance, this example shows a task T1 that depends upon files F1 and F2 . The master communicates with slave processors in order to transmit input files sequentially (i.e., with no concurrent transmission) by means of a dedicated full-duplex link. In this model, we consider that file sizes and task execution times are known a priori. A. The Motivating Application This model is motivated by previous works that involve the execution of computationally expensive machine learning algorithms for data mining in computational grids [18] and clusters [16]. For instance, the project mentioned in [16] involves the adaptation of Weka [21], a tool that is widely used among data mining researchers and practitioners, to execute on clusters of PCs. A specific class of data mining algorithms is the classification algorithms [5], which analyze the characteristics of dataset from a specific domain, and try to produce models that can well characterize a set of examples. For instance, a common application consists in evaluating ”which is the best classification algorithm from a list, for a given dataset”, i.e., which algorithm can produce a model that best represents the characteristics for a given dataset. The tenfold Cross Validation procedure [5] is a standard way of predicting the error rate of a classifier given a simple, fixed sample of data. In the tenfold cross validation process, the dataset is divided into ten equal parts (folds) and the classifier is trained in nine parts (folds) and tested in the remaining one. This procedure is repeated in ten different training sets and the estimated error rate is the average in the test sets. If one would like to evaluate a list of classifier algorithms, say 30 algorithms, by means of tenfold cross validation, 300 tasks could be created. Alternatively, a more accurate method is the N-fold cross validation [7]. In this method, a dataset with N instances (items) is divided into N folds (each one containing a single item), then the algorithm is trained with N − 1 folds and tested in the remaining one. The error rate for a given is computed as the mean of N validation tasks. Thus, a dataset with 5,000 items to be tested with the same 30 algorithms will produce 150,000 tasks, all of them using the same input file. III. D ISTRIBUTING S HARED F ILES In order to illustrate scalability problems in the masterslave architecture, this section shows some results of a real application carried out in the Unisantos’ laboratory, which involves the execution of the Cluster Genetic Algorithm (CGA) [10]. The goal of CGA is to identify a finite set of categories (clusters) to describe a given data set, both maximizing homogeneity within each cluster and heterogeneity among different clusters. Thus, objects that belong to the same cluster should be more similar to each other than those objects that belong to different clusters. In this experiments, we used a dataset that is a benchmark for data mining applications (the Congressional Voting Records), available at the UCI Machine Learning Repository [13]. This dataset contains 435 instances (267 democrats, 168 republicans) and 16 Boolean valued

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

1.0E+7

22000

Workqueue (Dedicated Cluster)

20000 16000

1.0E+6

14000

Makespan

Execution times

18000

12000 10000

1.0E+5

8000

Granularity = 2 Granularity = 4 Granularity = 8 Granularity = 16 Granularity = 32 Granularity = 64 Granularity = 128

6000 4000 2000 0

Fig. 2.

2

4

6 8 10 12 14 16 18 20 Number of Processors

The makespan for real experiments on a dedicated cluster.

attributes that represent key votes for each of the U.S. House of Representatives Congressmen. Also, 203 instances present missing values which have been removed for our experiments. For these experiments we adopted the MyGrid [4] platform, which is a lightweight and easy to use grid middleware, intended to execute applications comprised by independent tasks. MyGrid implements the Workqueue [4] scheduling algorithm. Initially, a set of tasks were executed sequentially on a single machine with a Pentium IV (1.8 GHz) processor with 1 GB of main memory. Then, the same set of tasks were run on 2, 4, 8, 12, 16 and 20 dedicated machines with similar hardware characteristics located in the laboratory at Unisantos, interconnected by a 100 Mbps Ethernet LAN. As one can note from Fig. 2, no significant reduction in the execution time can be achieved with more than 9 slave processors. After this limit, no performance gains can be realized by adding slave computers. Such behavior is typical for some critical applications that cause higher data transfer rates and higher resource consumption at the master processor, as we will explain in the following. The CGA application in the Congressional dataset is such an example. Additional information about these experiments as well as such scalability problems can be found in [17], [18]. Initially, the dataset is stored in some repository that is accessible to the user’s machine which plays the role of master processor. The master is responsible for controlling the execution of the application tasks as well as transferring input and output files to and from the grid machines. In such a master slave platform, the more slave machines are added to the system, the greater is the demand for data transfers imposed to the master computer. If the number of slaves exceeds the master’s capacity of delivering input files, the addition of new slave processors in the system will force some of them to stay idle while waiting for file transfers. Such a situation is aggravated in presence of fine grain application tasks, i.e., tasks with a low computation per communication ratio. In order to manage the execution of remote tasks, the master usually spawns a small number of local processes that transmit input files to slave processors and receive output files which are sent back with results. Such processes are eliminated

1.0E+4 1

10 100 Number of Processors

1000

Fig. 3. Simulated makespan for tasks with different granularities running on a master-slave platform.

after their corresponding tasks are completed. Thus, short application tasks will raise the rate of creation and destruction of control processes, thus degrading the performance. Also, the number of control processes executing concurrently at the master node is proportional to the number of slaves under its control. Thus, the consumption of resources in the master processor (e.g. memory, I/O subsystem, network bandwidth, CPU cycles) tend to be proportional to the number of slaves with which it directly interacts for performing control and data transfer operations. A. Granularity and Scalability in Master-Slave Platforms Performance limitations of the master-slave architecture may appear with different numbers of processors, depending on some application characteristics. For instance, the execution times of the application tasks, and the transmission times of their input files can determine the maximum number of processors that can be added to the system without loosing performance. Under some assumptions, such an effective number of slaves can be estimated as follows. Let RT be the mean response time for the execution of tasks, which is given by the sum of their mean execution time (ET ), mean time required to transmit input files (T T IF ), and mean time required to transmit output files (T T OF ): RT = T T IF + ET + T T OF.

(1)

Also, suppose that files are sequentially transmitted to slave processors, i.e., the master does not perform concurrent transmissions. In such scenario, the maximum number of slave processors without loss of performance Sef f , can be estimated as: RT . (2) Sef f = T T IF + T T OF Obviously, this number depends on the granularity of the application tasks, since it reflects the ratio between the times involved in the execution of tasks and in the transmission of files. In order to illustrate the influence of granularity (here expressed as the computation time per transmission time ratio) on the effective number of slaves, some simulation experiments

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

B. Grouping Tasks to Improve Scalability As mentioned before, the main cause of poor scalability is the bottleneck in the master computer. Despite of the severe limitations in scalability, a small number of papers is devoted to this problem. In [17], [18], Silva et. al. propose a scheduling algorithm that organizes a given application in groups of tasks that share the same input files, so minimizing the file transfer. For didactic purposes, this section presents details about the grouping of tasks that share input files, because our proposal also adopts the grouping technique. At scheduling time, tasks are clustered to form jobs, so that only one job is created per machine. The number of file transfers is minimized because tasks that comprise a given job share the same input files, thus, reducing the number of file transmissions to be delivered by the master computer. The principle behind this technique is the improvement of the computation per communication ratio, i.e., the granularity of work units, since a job is composed by a group of tasks.

1.0E+7

1.0E+6 Makespan

were carried out using SimGrid [11]. For the experiments, we consider a dedicated cluster, comprised by the master node and a set of slave nodes. We assume a shared link, and the master’s capability of handling only one transmission at any given time (no concurrent transmissions). The link is assumed full-duplex, so that the master node can send an input file to one processor and receive the output file from another processor concurrently. According to Yang and Casanova [22], this model corresponds to the one port model, which is suitable to simulate LAN network connections. For these experiments, different granularities were produced by varying the time to transmit files, while all other parameters (e.g. number of tasks, their execution times, total amount of work of the application) were fixed. In this example, if one assumes the granularity is equal to 2, he means the execution of task takes twice as long as the transmission of the involved files for such task. The experiment involves 32,000 tasks which take 128 time units to execute, and the bandwidth of communication links is 1,000,000 bytes per second. File sizes started with 2,000,000 bytes, doubling up to 64,000,000 bytes, producing tasks whose computation per communication ratio varies from 2 to 64. The metric adopted for this experiment is the makespan [14], which is computed by the time taken to execute all the application tasks as well as involved transmissions of input and output fies. As shown in figure 3, the reduction of the makespan are strongly dependent on the granularity of the application tasks. These results show that the coarser the granularity of the application tasks, the more processors can be effectively added to the system, leading to reduction in execution times. Although performance gains are critical for fine granularity applications, it is worth noting that such problem may also occur in presence of coarse grain applications, depending upon the application characteristics. Our work aims at extending such scalability limits, i.e., improving the effective number of processors that can be added to a grid system for a given application, so that its overall execution time can be reduced.

1.0E+5

1.0E+4

1.0E+3

WQ/2 WQ/16 WQ/128 WQ+G/2 WQ+G/16 WQ+G/128 1

10 100 1000 Number of Processors

10000

Fig. 4. Makespan for simulated experiments using pure Workqueue and Workqueue with Grouping, for granularities of 2, 16, and 128

To illustrate the scalability gains obtained with the grouping technique, we simulated a dedicated cluster comprised by the master and a set of slave processors. The application involves 32,000 fine grain tasks which take 128 time units to be executed by homogeneous slaves. We assume that (similarly to our motivating application) each task requires one input file and produces one output file that are transmitted through the communication link. Such situation is often verified for many other real applications mentioned in section I. For the sake of simplicity, we also consider that each task produces one output file whose size is at least one order of magnitude smaller than the input file, and its transmission time is negligible1 Fig. 4 shows the simulation results as the total execution times, for an application executing on grids scaling from 1 to 1,000 slave nodes. Our simulation evaluates the Workqueue, and Workqueue with Grouping algorithms. In summary, the Workqueue algorithm puts all tasks in a queue and maps each of them to an idle processor. Whenever an idle processor is detected, it is assigned to the next task in the queue. When one task finishes, it is removed from the queue. The algorithm performs while there remain tasks to be completed. The Workqueue originally does not consider file sharing, and transmits input files to a slave processor every time a task is scheduled to it. The Workqueue with Grouping extends the original algorithm by grouping tasks that share input files to compose one job. The later algorithm creates only one job for each slave processor, so that each input file is sent only once for each slave. As suggested by the experiments shown in Fig. 4, there are scalability limits for both the scheduling algorithms. The pure Workqueue performs very poorly, scaling up to 9 slave processors. After this number of processors, no reduction in the execution time can be achieved. As grouping tasks raises this limit to a few hundreds of slave nodes, this technique will be applied in our hierarchical strategy. Since each input file is transmitted only once for each slave processor that executes tasks which need 1 Such assumption is commonly adopted for simulation models found in the literature [7], [15], and reflects the characteristic of our motivating application mentioned in section II-A, which may use very large input files and produces output files of a few hundreds of bytes.

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

M

Master

1.1

Supervisor Slave

1.05 S1

S2

Efficiency

Link Sm

1 2 4 8 16 32

0.95

Fig. 5.

The master-supervisor-slave architecture 0.9 0

it, the maximum number of slaves that can be effectively added to the system without loss of performance may be estimated for schedules that adopts task grouping. Under the same assumptions considered for equation 2, the expected response time for a job (RTjob ) can be computed as

4e+06

(3)

where N Tjob is the number of grouped tasks to form one job, T T IF is the time required to transmit the input file (which is the same for all tasks in this job), and T T OF is the transmission time for output files produced by each task in the job. Thus, the effective number of slaves that can be effectively controlled by the master node is RTjob . = T T IF + N Tjob ∗ T T OF

IV. I SOEFFICIENCY AND SCALABILITY Scalability may be defined as ”the system’s ability to increase speedup as the number of processors increase” [9]. Another definition that is not based in the concept of speedup is the following: ”An algorithm-machine combination is scalable if the achieved average speed of the algorithm on the given machine can remain constant with increasing number of processors, provided the problem size can be increased with the system size” [19]. This last definition is important since it relates the scalability to the combination of a machine and an algorithm, instead of being a property of either the machine or the algorithm. Based on those definitions, in this section we introduce one scalability metric: the isoefficiency function [9], based on the concept of parallel computing efficiency. Grama, Gupta and Kumar proposed the isoefficiency concept [9]. Isoefficiency fixes the efficiency and measures how much work must be increased to keep the efficiency unchanged as the machine scales up. An isoefficiency function f (P ) related

400

2 4 8 16 32

3e+06

2e+06

1e+06

10

100

400

Number of Processors

(4)

As illustrated in Fig. 4, there are scalability limits for both the scheduling algorithms. The pure Workqueue performs very poorly, while the task clustering performs slightly better. Thus, though task clustering has demonstrated to reduce the minimum makespan in about one order of magnitude, a performance limit can still be observed because of the bottleneck at the master processor. In the next section we investigate whether the use of an hierarchical topology as opposed to the traditional master-slave architecture can reduce the bottleneck and improve the scalability of the system.

350

0

Fig. 7.

Isoefficiency function when all tasks share the same input file

machine size (P ) to the amount of work needed to maintain the efficiency. Parallel computing efficiency is defined as Tseq Tpar

, (5) P where Tseq is the time for a sequential execution, and Tpar is the time for a parallel execution with P processors. In this section we present the isoefficiency functions of the execution of independent tasks with shared file applications on a master-slave platform. The platform simulated was composed of up to 400 homogeneous and dedicated processors, and the application was composed of a variable number of E=

100000 80000 Number of tasks

 Sef f

100 150 200 250 300 Number of Processors

Fig. 6. Efficiency of the executions when all tasks share the same input file

Number of tasks

RTjob = T T IF + N Tjob ∗ (ET + T T OF ),

50

60000 40000 2 4 8 16 32

20000 0 0

Fig. 8.

5

10 15 20 25 30 35 40 45 50 Number of Processors

Isoefficiency function when each task has its own input file

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

500000

Number of tasks

400000

and Figure 9 show (see Figure 10).

2 4 8 16 32

V. H IERARCHICAL S CHEDULING

300000 200000 100000 0 10 100 Number of Processors

Fig. 9.

Isoefficiency function for the hierarchical platform

5e+06 4e+06 Number of tasks

400

3e+06 2e+06

Master-slave 2 Master-slave 4 Master-slave 8 Master-slave 16 Master-slave 32 Hierarchical 2 Hierarchical 4 Hierarchical 8 Hierarchical 16 Hierarchical 32

1e+06 0 10

100

400

Number of Processors

Fig. 10. Comparing the isoefficiency functions of the hierarchical and masterslave platforms

tasks, depending on the amount of work necessary to keep the efficiency constant. Each task takes 8 time units to complete (ET ), and the amount of time needed to send the input files (T T IF ) varies in order to obtain different ratios. In the following experiments we simulated the following ratios: 2, 4, 8, 16 and 32. Figure 6 shows the efficiency of the experiments when tasks share the same input file. It can be verified that the efficiency is kept around 0.99 for all ratios. Figure 7 shows the corresponding Isoefficiency functions. It is worth noting the parabolic shape of all curves. We also executed the same application when each task has its own input file. In this case a round-robin strategy was used to map tasks to processors. The round-robin strategy is optimal for master-slave platforms that are homogeneous and dedicated [8]. For those executions the input files are sent to slave nodes before each task execution. Figure 8 shows the corresponding Isoefficiency functions, for efficiencies around 0.99. It can be seen that the execution of an application comprised by independent tasks which have different input files is not scalable. A third set of experiments is shown in Figure 9.We have simulated a hierarchical platform composed of one supervisor node and several master and slave nodes. The maximum number of slave nodes considered is 200. It is possible to verify that the curves also have a parabolic shape. However, the rate of grown of the paraboliclike function is smaller when compared to the master-slave platform, as the direct comparison of the curves of Figure 7

As shown in section IV, an architecture with hierarchical topology present higher scalability than a master-slave architecture for the execution of applications comprised by independent tasks that share input files. In this section, we present a hierarchical scheme for file distribution and task scheduling that aims at improving the application scalability in such scenario. The problem of scheduling independent tasks with shared input files stored in a centralized repository appear for many applications mentioned in section I, being studied in [2], [3], [6], [8], [17], [18]. In such scenario, the master node is implemented by the user’s machine which accesses a centralized repository and launches the application tasks to be executed by the slave processors. The main function of the master node concerns taking actions for application coordination (e.g., scheduling, control of completed tasks) and file distribution. Our hierarchical scheme aims at alleviating the bottleneck in master node, by reducing the number of file transfers and control actions it should perform. In order to reduce the workload at the master node, we propose the addition of a number of supervisor nodes to the master-slave architecture. The supervisor is responsible for controlling the execution of the application tasks as well as transmitting files to the slave nodes. The master groups tasks together, to form execution units comprised by tasks that share common files, namely the jobs that will be distributed among the supervisor nodes. In turn, each supervisor is responsible to control the execution of this job on a subset of slave nodes. In this model, there is no direct interaction between the master and slave nodes. Instead, the master delegates jobs to supervisors which are responsible for communicating to slave nodes and managing the execution of the application tasks. A. A Strategy for Hierarchical Scheduling First, consider a distributed architecture comprised by one master node M , a collection of P processors placed in a collection of C clusters interconnected by communication links. The application is comprised by T tasks, where T is at least one order of magnitude larger than P (T >> P ). This collection may be partitioned into two sets. The former is comprised by S slave computers, and the later is comprised by N supervisor computers. Under such conditions, we propose the following steps to be carried out by the master node, for launching the application execution: 1) Initially, the master obtains static information about the available resources, e.g., the number and identification of available computers in the system, processor speed, and memory. By the end of this step, the parameter P can be known. 2) For each known cluster, compute its number of supervisors Ni , so that every cluster has at least one supervisor, and clusters with a large number of processors have additional supervisors accordingly to the maximum number of slaves they can efficiently control. More

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

precisely, the number Ni of supervisors for the i-th cluster is computed as   Pi Ni = , (6)  Sef f where Pi is the number of processor in the i-th cluster,  and Sef f is the maximum number of processors a supervisor can efficiently control when task grouping is applied (see equation 4 in section III). Then, obtain N , as the total number of supervisors in the system as N=

C 

Ni .

(7)

i=1

3) Compute the number of slaves S = P − N . 4) As proposed by Silva et. al. in [17], group the application tasks into S execution units, namely the jobs, so that each job can be assigned to one supervisor and its slaves. Each job will group together a number (in the range [T /S , T /S]) of tasks that share common input files. The mapping of jobs to processors can be done at random, or by means of some heuristic similar to those presented in [6]–[8], [17]. For instance, the granularity (i.e., the number of tasks) of each job could be adjusted according to slave processors capacities. The result of this step is a list of 2-tuples containing the identification of a job and the slave processor it was assigned to. 5) Decompose the job list into N sub-lists, so that each sublist corresponds to one supervisor and contains the jobs assigned to the slave processors under its control. This step can be accomplished in time O(N ) by selecting the sub-list each job must be moved to. For each supervisor, send the sub-list containing the information on jobs it has been assigned to, and transmit the required input files (only once). 6) Wait for tasks to be executed and results to be returned. As soon as a supervisor receives a job list and input files, it distributed the tasks to its slave processors. Every time a supervisor is notified that a task has concluded, it performs the following steps: 1) Set the task status as DONE. 2) Forward the notification and results back to the master. 3) While there remain tasks to execute (READY) whose input files have already been transmitted to the idle slave, the supervisor assigns the next task to the idle processor. 4) When no READY task can be found, the supervisor looks for some uncompleted task (RUNNING) whose input files have already been transmitted to the idle slave, and creates a replica of such task in the idle slave processor. Replication can improve the probability of such task to be concluded earlier, as well as provide some level of fault tolerance. 5) When steps 3 and 4 are completed (all tasks have been concluded), the supervisor asks the master node for incomplete tasks from other processors. 6) If the master has no tasks to be executed, then it finishes.

TABLE I P ERFORMANCE FOR W ORKQUEUE (WQ), W ORKQUEUE WITH G ROUPING (WQ+G), AND H IERARCHICAL W ORKQUEUE (WQ+H)

Number of processors (P) 1 5 8 10 50 100 400 500 1000 2000 3000 5000 10000

WQ MakeEffispan ciency 256000 51204 32007 32007 32007 32007 32007 32007 32007 32007 32007 32007 32007

1,00 0,87 0,87 0,70 0,14 0,07 0,02 0,01 0,01 0,00 0,00 0,00 0,00

WQ+G MakeEffispan ciency 224001 44805 28008 22409 4509 2294 764 702 673 674 674 675 684

1,00 1,00 1,00 1,00 0,99 0,98 0,73 0,64 0,33 0,17 0,11 0,07 0,03

WQ+H MakeEffispan ciency 224002 44806 22410 22410 4510 2270 589 482 263 161 134 123 124

1,00 1,00 1,00 1,00 0,99 0,99 0,95 0,93 0,85 0,70 0,56 0,36 0,18

Every supervisor node maintains a list for the jobs and tasks under its control, and the master maintain a list for all jobs and tasks of the application. For each task, there is a status information (READY, RUNNING, DONE), which is updated by means of messages exchanged among the processors, upon corresponding events. VI. S IMULATION R ESULTS The hierarchical scheduling strategy was evaluated by means of simulation, using the same application and architecture scenario described in section III. For the execution of simulations, homogeneous, dedicated computers and communication links were considered. Also, the simulated application is comprised by homogeneous tasks. The results are shown in Table I. For these experiments, two metrics were used, the total makespan [14], and the computational efficiency (see equation 5). The total makespan is computed by the time between the transmission of the first input file, and the return of the last output file. As shown in Table I, with the pure Workqueue algorithm, makespan can be reduced at the minimum of 32,007 time units, using eight processors. After this threshold, the makespan can not be shorten by adding processors. With the Workqueue with Grouping (WQ+G) algorithm, the makespan may be reduced at the minimum of 673 time units, by employing 1,000 nodes. In this experiment, the execution with 500 nodes is concluded in 702 time units and the computational efficiency is around 0.64. However, after this threshold the addition of processors will not lead to a significant reduction in execution times, and the efficiency will drop fast. Finally, with the Hierarchical (WQ+H) algorithm, the makespan may be reduced at the minimum time of 123 time units, which can be achieved with 5,000 processors. In this experiment, the execution with 3,000 nodes is concluded in 134 time units and the computational

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

1.0E+6

R EFERENCES

WQ WQ+G WQ+H

Makespan

1.0E+5

1.0E+4

1.0E+3

1.0E+2 1

10 100 1000 Number of Processors

10000

Fig. 11. Makespan for simulated experiments using Workqueue (WQ), Workqueue with Grouping (WQ+G), and Hierarchical Workqueue (WQ+H) scheduling, for a fixed granularity.

efficiency is 0.56. However, it is worth noting that after this threshold the addition of processors will not lead to a significant reduction in the makespan and the efficiency will gracefully decrease. VII. C ONCLUSIONS AND F UTURE W ORK Independent tasks with shared files constitutes an important class of applications that can benefit from the computational power delivered by computational grids and clusters. The execution of such applications in master-slave architectures creates a bottleneck in the master computer, which limits the system scalability. The bottleneck appears because the master node is responsible for scheduling tasks and transmitting input files to slave processors. Such a limitation is aggravated by the existence of fine grain application tasks that increase both, the rate of files transmitted to slave processors, and the rate of scheduling actions taken by the master. As a contribution, we propose and assess a strategy that orchestrates file transfers and the mapping of tasks to processors, in an integrated manner. Our strategy performs by mapping groups of tasks to a hierarchical division of processors, leading to significant improvements in the application scalability, mainly in presence of fine grain tasks. The basis for such improvement comes from the facts that our strategy: (i) groups tasks that share input files together, so improving the application granularity and reducing the number of file transfers on the communication links; (ii) multiplies the system capacity of transmitting input files to slave processors, by means of adding supervisors to the master-slave model; and (iii), reduces the amount of scheduling actions to be taken by the master, by delegating them to a set of supervisor nodes. These results apply to the execution of applications comprised by independent tasks with shared files, executing on the top of master slave platforms. Although in this paper we evaluated an instance of mastersupervisor-slave hierarchy with only one level of supervisors, a more general scheme must be evaluated in the future. Also, future evaluations must emphasize analytical and theoretical aspects of the scalability limits that can be achieved by means of techniques discussed here.

[1] Berman, F. High-performance schedulers. In: Foster, I., Kesselman, C. (editors) The Grid: Blueprints for a New Computing Infrastructure, pp. 279-309. Morgan-Kaufmann, 1999. [2] Casanova, H., Legrand, A., Zagorodnov, D., Berman, F.: Using Simulation to Evaluate Scheduling Heuristics for a Class of Applications in Grid Environments. Research Report 99-46, LIP-ENS, Lyon, France, 1999. [3] Casanova, H., Legrand, A., Zagorodnov, D., Berman, F.: Heuristics for Scheduling Parameter Sweep Applications in Grid Environments. In: 9th Het. Computing Workshop, 2000, pp. 349-363. IEEE CS Press, 2000. [4] Cirne, W., Paranhos, D., Costa, L., Santos-Neto, E., Brasileiro, F., Sauv´e, J., Oshtoff, C., Silva, F.A.B., Silveira, C.: Running Bag-of-Tasks Applications on Computational Grids: The MyGrid Approach. In: Intl. Conf.Parallel Processing - ICPP, 2003. [5] Fayyad, U. M., Shapiro, G. P., Smyth, P.: From Data Mining to Knowledge Discovery: An Overview. In: Advances in Knowledge Discovery and Data Mining, Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Editors, MIT Press, pp. 1-37, 1996. [6] Giersch, A., Robert, Y., Vivien, F.: Scheduling Tasks Sharing Files on heterogeneous clusters. Research Report RR-2003-28, LIP, ENS Lyon, France, May, 2003. [7] Giersch, A., Robert, Y., Vivien, F.: Scheduling Tasks Sharing Files from Distributed Repositories. Technical Report N. 5214, INRIA, France,2004. [8] Giersh, A., Robert, Y., Vivien, F.: Scheduling tasks sharing files on heterogeneous master-slave platforms. In: 12th Euromicro Workshop in Par., Dist. and Network-based Proc., pp. 364-371. IEEE CS Press, 2004. [9] Grama, A., Gupta A., and Kumar, V. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures. IEEE Parallel and Distributed Technology, Vol 1, No 3, 1993. [10] Hruschka, E. R., Ebecken, N.F.F.: A genetic algorithm for cluster analysis, Intelligent Data Analysis (IDA), v.7, pp. 15-25, IOS Press, 2003. [11] Legrand, A., Lerouge,J.: MetaSimGrid: Towards Realistic Scheduling Simulation of Distributed Applications. Research Report N. 2002-28. Laboratoire de L’Informatique du Paralllisme, ENS, Lyon, 2002. [12] Maheswaram, M., Ali, S., Siegel, H.J., Hengsen, D., Freund, R.: Dynamic Matching and Scheduling of a Class of Independent Tasks onto Heterogeneous Computing Systems. 8th Heterogeneous Computing Workshop (HCW’99), April, 1999. [13] Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases, http://www.ics.uci.edu, Irvine, CA, University of California. [14] Pinedo, M.: Scheduling: Theory, Algorithms and systems. Prentice Hall, Englewood Cliffs, NJ, 1995. [15] Ranganatham, K., Foster, I. Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids. Journal of Grid COmputing 1(1), 2003, pp.53-62. Kluwer Academic Publishers, The Netherlands, 2003. [16] Senger, H., Hruschka, E.R., Silva, F.A.B., Sato, L.M., Bianchini, C.P., Esperidio, M.D.: Inhambu: Data Mining Using Idle Cycles in Clusters of PCs. In: Proc. IFIP Intl. Conf. on Network and Parallel Computing (NPC’04), Wuhan, China, 2004. LNCS, Vol. 3222, pp.213-220. SpringerVerlag, Berlin Heidelberg New York (2004). [17] Silva, F.A.B., Carvalho, S., Senger, H., Hruschka, E.R., Farias, C.R.G.: Running Data Mining Applications on the Grid: a Bag-of-Tasks Approach. In: Int. Conf. on Computational Science and its Appllications (ICCSA), Assisi, Italy. LNCS, Vol. .3044, pp.168-177. Springer-Verlag, Berlin Heidelberg New York (2004). [18] Silva, F.A.B., Carvalho, S., Hruschka, E.R.: A Scheduling Algorithm for Running Bag-of-Tasks Data Mining Application on the Grid. In: Euro-Par 2004, Pisa, Italy, 2004. LNCS, Vol. 3419, pp.254-262. Springer-Verlag, Berlin Heidelberg New York (2004). [19] Sun, X. and Rover, D.T. Scalability of Parallel Algorithm-Machine Combinations. IEEE Transactions on Parallel and Distributed Systems, Vol 5, No 6, June 1994. [20] D. Talia, ”Parallelism in Knowledge Discovery Techniques”, Proc. Sixth Int. Conference on Applied Parallel Computing, Helsinki, LNCS 2367, pp. 127-136, June 2002. [21] Witten, I. H., Frank, E.: Data Mining: Practical machine learning tools with Java implemen-tations. Morgan Kaufmann, San Francisco, 2000. [22] Yang, Y., van der Raadt, K., Casanova, H.: Multi-Round Algorithms for Scheduling Divisible Workloads. in IEEE Transactions on Parallel and Distributed Systems (TPDS), 16(11), 1092–1102, 2005.

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) 0-7695-2585-7/06 $20.00 © 2006 IEEE

Suggest Documents