Scheduling Fine-Grained Distributed Simulations ... - Semantic Scholar

1 downloads 0 Views 156KB Size Report
of work from the masters to the slaves allows a large number of slaves to be ... It is also tolerant to slave failures: the master can simply reallocate any lost work.
Scheduling Fine-Grained Distributed Simulations in Wide-Area Systems Jon B. Weissman1 and Ping Jiang [email protected] Division of Computer Science University of Texas at San Antonio

1.0 Introduction Computer simulations of engineering and science phenomena have become a commonplace method for problem solving and scientific discovery. However, executing simulation models of sufficient fidelity and complexity to be useful requires a great deal of computational resources both in memory and machine cycles. A great many of these simulation models are well-suited to parallel and distributed execution which can meet the high resource demands. In this paper, we have studied a class of simulation programs that consist of a large number of independent tasks such as may arise from parameter studies or probabilistic analysis. One of the features of these applications is that they can benefit from a large number of machine resources often spanning multiple sites across the network. The seamless use of wide-area resources, called metacomputing, has been proposed to solve problems requiring the coordinated use of wide-area resources [3][5][6][9]. A metacomputing solution to this problem requires that machine resources be located, selected, and allocated to the application in a manner that reduces the application completion time. Collectively, we refer to this as application scheduling. This paper describes the design of scheduling methods for these applications. Since we presume that the wide-area network will consist of a very large number of resources that vary in availability, the issues of scalability and adaptability are important. Scalability means that 100-1000’s of machine resources may be selected for the application and adaptability means that machine resources may suffer dynamic load fluctuations or may be added or removed during the course of the application. We also consider applications in which the individual tasks are fine- to medium-grained which presents challenges to achieving efficient execution in a distributed environment. The results show that wide-area resources can be effectively exploited for these applications, but careful selection of several scheduling parameters is required in order to get the best possible performance. This effort is part of a larger project, GlobalSim, at the University of Texas at San Antonio [10]. This project aims to build a usable software tool that can automatically

1. This work was partially funded by grants NSF ASC-9625000 and CDA-9633299, AFOSR-F49620-96-1-0472. 1

schedule large simulation applications across wide-area networks with minimal modifications to the application source.

2.0 GlobalSim Architecture The GlobalSim architecture is designed to accommodate very large-scale applications comprised of a large number of independent tasks that may benefit by machine resources located across a wide-area network. Unlike similar projects described in Section 4.0, the tasks are not assumed to be coarse-grained, but may be fine- to medium-grained taking on the order of milliseconds to seconds of execution time per task. We classify simulation applications as either task independent or task synchronous. A task independent application does not require any synchronization during execution. The task set for this application class is normally created statically and remains unchanged. Parameter study applications generally fall into this category, particularly those that wish to explore an entire set of parameters for comparison purposes. A task synchronous application contains tasks that are independent but there is a need for synchronization. In these applications, the task set is more dynamic and depends on the results of prior task executions. For example, an application may wish to find an acceptable set of parameter values in a parameter study, but it is too expensive to examine all possibilities and not necessary. Instead, a subset of the space will be explored. An initial task set is generated for an initial part of the parameter space. Based on these results, another portion of the parameter space is identified for exploration, and so on until an acceptable solution is found. Another example would be a simulation application with a human-in-the-loop for steering. In these cases, synchronization is required to ensure that all prior task executions are finished so that a decision can be made as to the next set of tasks to execute. We have developed a multi-level architecture for mapping these applications on to distributed machine resources (Figure 1). The main program generates the task set to be executed by a set of slave computers located in potentially remote sites. The tasks are first allocated to one or more masters that control a subset of the slaves computers. The slaves request work from their master, execute the tasks, and return results to their master. The masters in turn return the results accumulated from their slaves back to the main program. Each of these software components are provided in GlobalSim. The user application consists of two pieces: a function that generates tasks and collects final results for the user that is linked in with our main program, and the task execution program that is linked in with the slave program. (The details of the software architecture and APIs may be found in [10].) This architecture meets two of our central requirements, scalability and adaptability. The distribution of work from the masters to the slaves allows a large number of slaves to be supported without creating a bottleneck. The architecture is also adaptable and allows slave computers to be easily added or removed

2

masters

slaves

main tasks

Figure 1: GlobalSim runtime architecture during the application. It is also tolerant to slave failures: the master can simply reallocate any lost work from a slave that has failed since it keeps track of the tasks assigned to the slaves. Master failures can be handled similarly from the main program, though a more sophisticated scheme that elects a new master from among the set of slaves it controls will be explored in the future. This architecture raises a number of important scheduling issues. First, how many slaves and masters are needed and which computers in the selected sites will be used? This problem is particularly important if the application tasks have specific resource requirements such as a minimum amount of memory or access to local disk, or for smaller applications in which scheduling overhead becomes more dominant. For example, a smaller application may be best served by a small number of fast slave resources perhaps contained in a single remote site. In this case, locating the best single site quickly may be the primary objective. Allocating additional resources may in fact introduce unnecessary overhead. Second, where are the masters located relative to the slaves? The placement of masters is an important issue in a wide-area network as our results demonstrate. Third, given a selected set of resources, how are tasks allocated? Task allocation is an issue for the slaves as well as the masters. These are complex issues and this paper deals with the latter two questions. For simplicity, we will assume that a fixed number of masters and slaves will be used and that no specific resource constraints are present. For the slaves, task allocation is performed by dynamic self-scheduling since it supports our goal of adaptivity and fault tolerance. For the masters, we have examined both static and dynamic self-scheduling allocation schemes. In the static schemes, the entire workload is divided to the masters once at runtime. In the dynamic schemes, the masters are self-scheduled and request work from the main program in a manner similar to the slaves. With dynamic schemes such as self-scheduling an important issue is the size of the

3

allocated task chunk. We studied the effect of chunk size for both slave and master task allocation across a range of application parameters. In addition to chunk size, there is the issue of load balancing vs. load sharing. Task synchronous applications require a load balanced task allocation to ensure that all masters (via their slaves) reach the synchronization points at the same time. Task independent applications, on the other hand, do not require load balancing. Instead, they require load sharing to prevent busy slaves from delaying the completion of the application in the event that idle slaves are available. Task independent applications are the more common case and are the subject of this paper.

3.0 Results We have performed a set of experiments that illustrate the key issues in the design of scalable widearea scheduling methods for task independent applications, and have picked a few points on the vast parameter space that highlight the important scheduling parameters. The results are a first step toward the design of automated scheduling strategies. We have built a prototype to explore the scheduling issues outlined in the previous section. The prototype consists of main, master, and slave programs written in C with TCP/IP communication between all components. The main and master are threaded to allow them to efficiently control the masters and slaves respectively. We use rsh for remote process creation of the slaves and masters. A synthetic application shell was constructed for experimentation with the following parameters: • • • • •

number of tasks or application length task granularity (msec) task size (bytes) -- the data record that specifies the task (e.g. a set of parameter values in a parameter study) task result size (bytes) -- result data associated with each executed/simulated task chunk size -- the number of tasks allocated to a slave or master in each request for work

We initially experimented with four problem sizes (10 sec, 1 min, 5 min, and 10 min total application length) with a 1 msec tasks, a small task size (20 bytes), and a small result size (100 bytes), on our local network. For simplicity, we assume that all tasks have the same size, granularity, and result size. The first question we investigated was the issue of scale. In a self-scheduled master-slave architecture, the possibility of 100’s-1000’s of slaves suggests that scalability may be compromised for some applications. Clearly, for extremely large-grain tasks this may be less of an issue, but we expect that the most common case will consist of a large set of fine- to medium-grain tasks as in many parameter study applications. We have selected a fine-grain task granularity of 1 msec. In the initial set of experiments, we used a collection of 45 Sun Sparc5 and UltraSparc workstations located on an ethernet LAN at UTSA. All runs were performed overnight when the system was lightly loaded and the data presented throughout the paper is the average of multiple values, unless otherwise stated.

4

Figure 2: Impact of chunk size on scalability (chunk size = # tasks) When we use a single master, a bottleneck clearly forms as the number of slave processors increases (Figure 2). In this figure and in all remaining figures and tables, the unit of time is seconds. As the problem size increases, additional slaves can be profitably used, but eventually this benefit is still overcome by the bottleneck formed at the master. It is not surprising that the single master bottleneck can be relieved somewhat by increasing the amount of tasks assigned in a task chunk (the best # of slaves increases and execution time decreases)2. However, arbitrarily increasing the chunk size has its own limitations in the event that the slave workload becomes uneven. This can occur if the tasks themselves have different granularities or the slave computers vary in computational power due to sharing with other users as we will show. This problem is more acute for synchronous applications in which the impact of a load imbalance is greater. Interestingly, a similar problem occurs for task independent applications in that it becomes more difficult to perform load sharing. The slowest slave will have work on its queue (at the end of the application) while the other slaves are idle, which slows down the completion of the application. The next set of experiments explore the trade-offs in increasing the chunk size. We have run a set of experiments using a fixed number of processors (16) and have introduced load variance to see the effect of increasing the chunk size (Table 1). Load variance was introduced by having the master delay a slave an

2. For the application length = 10 min and chunk size = 100 a bottleneck forms earlier than expected. We have no explanation for this at present. 5

problem (1 msec/task)

Application length 5 minutes 10 minutes 20 minutes

chunk size (# tasks)

100 241.87 452.57 875.10

1000 247.34 458.46 891.99

10000 295.71 536.68 984.33

Table 1: Effect of chunk size under load conditions interval of time equal to the execution time of a task chunk with a 50% probability, before allocating the next chunk to the slave. The results show that increasing the chunk size under load variance can negatively impact the application. The benefit of increasing the chunk size from 100 to 1000 was offset by the load imbalance. Increasing the chunk size further to 10000 created a more severe load imbalance that swamped any gains due to the larger chunk size. In addition, increasing the chunk size also has a negative impact on the adaptability of scheduling to respond to a changing set of processors, a capability we plan to support. For example, if a processor is to be removed but is working on a large set of tasks then either the tasks have to be reclaimed and given to another slave or the system will have to wait before allowing the slave to be removed. A smaller chunk size allows the system to simply wait for the slave’s completion and does not require any special reclamation code (this helps meet our goal of allowing application code to run in our system with minimal modification). An obvious solution to the master bottleneck that does not suffer from the limitations of large chunk sizes is to use multiple masters. When two masters are used (under lightly loaded conditions), the bottleneck can be relieved (Figure 3). In these experiments, the two masters managed an identical number of slave computers with approximately equal power, and the masters were allocated equal chunks of the workload initially from the main program. If load variance was introduced, then the problem of load sharing would again appear since the computation power of each master’s set of slaves could be different. The solution to this problem is for the masters to request work from the main program via self-scheduling in a manner similar to the slaves, as we will show. These results motivate the importance of multiple masters to provide scalability and adaptability. This becomes even more clear when we consider a multi-site wide-area network. We created a multi-site network by adding a remote Internet site (Southwest Research Institute or SwRI in San Antonio)3. The multisite testbed consisted of 8 Sparcs at SwRI and 8 Sparcs at UTSA. We reduced the set of slaves at UTSA to

3. Even though SwRI is “local”, it takes over 10 Internet hops to get there from UTSA. 6

Figure 3: Impact of multiple masters on scalability. Chunk size = 100 tasks make the sites comparable in power and modified our problem parameters: application length = 40 min with a varying task granularity and chunk size. The next question we investigated is the issue of master placement. The choice is both masters at UTSA or one each at UTSA and SwRI. We ran overnight in lightly-loaded mode (both for the slaves and the Internet) and allocated a one-shot even chunk of tasks to each master as above for a few problem instances. The results clearly demonstrate that additional remote slaves are beneficial even for fine-grain tasks, but that a collocation of masters and slaves is important to utilizing the additional computing power in remote slaves (Table 2). The reason is the communication of tasks from the main to the remote masters and results back from the masters to the main program has a high cost across the Internet. The final question we examined was the allocation of tasks to the masters in the multi-site network when the site resources were shared. We began with an investigation of static policies in which a fixed

problem (40 minutes)

2 msec/task, 30 tasks/chunk 4 msec/task, 100 tasks/chunk 10 msec/task, 100 tasks/chunk

sequential

UTSA only 8 slaves, 1 master

UTSA and SwRI 16 slaves, 2 masters at UTSA

UTSA and SwRI 16 slaves, 1 master at UTSA, 1 at SwRI

2417.26 2419.25 2406.35

1963.59 825.02 767.40

2583.37 883.91 671.73

1917.24 697.83 498.51

Table 2: The effect of master placement

7

amount of workload was delivered to the masters and remain unchanged as above. Three static policies were implemented: static-even, static-comp-only, and static-comp-and-comm. The first policy simply divides the entire workload evenly between the sites (as above). In the latter two policies, the main allocates an initial chunk of tasks (10% of the total workload) evenly to the sites and times the execution of this initial chunk to decide how the remainder will be divided. For example, if site 1 takes 10 sec and site 2 takes 20 sec then site 1 will be given twice as many tasks in the remaining set. In static-comp-only, the masters individually time the execution of the slaves for the initial chunk and report this time back to the main. This does not include any wide-area communication costs. In static-comp-and-comm, the main times the execution and wide-area communication cost for each master. This includes Internet communication time for main-master communications. The results indicate that the effective site power must be taken into account, but that this is not sufficient to determine how the workload will be divided (Table 3). Because Internet communication costs are high, a computation and communication load balance as implemented by static-comp-and-comm performs the best. The principle reason is that the amount of result data transferred to and from the main program is proportional to the number of allocated tasks, hence balancing computation and communication costs is necessary. The remote site (SwRI) is allocated less work to reduce its communication cost to and from the main program. Collectively, the power of the UTSA site is a little greater than SwRI, hence static-comp-only also outperforms static-even. The final issue is the need for dynamic scheduling policies in multi-site networks. Clearly, the static policies described above are unable to respond to changes in machine and network performance during the execution of the application. They also do not support adaptivity in which processors may come and go during the execution. The solution is to allow the masters to be self-scheduled, requesting chunks of work from the main periodically in a manner similar to the slaves. We have simulated a load variance as before by having each master stochastically delay the delivery of work to their slaves an interval of time equal to the length of a slave task chunk. To create a load difference between the two sites, we set the probability of delay for USTA slaves and SwRI slaves differently. The question still remains: how to determine the master chunk sizes? Clearly for synchronous applications, timing data (as above) will be needed to determine the chunk sizes that keep the masters and slaves load balanced. problem (40 minutes)

static-even

static-comp-only

static-comp-and-comm

3 msec/task, 200 tasks/chunk 6 msec/task, 100 tasks/chunk 10 msec/task, 100 tasks/chunk

704.50 529.82 458.37

603.92 479.58 422.19

524.82 413.26 375.62

Table 3: Static work allocation methods

8

However for task independent applications, the main goal is to ensure that the masters are load sharing and to prevent a load imbalance at the end of application. We have implemented a tapered allocation in which the chunk sizes start out large and slowly taper off over time such as proposed in other scheduling contexts [8]. The idea is that large chunk sizes reduce communication overhead early on, and when load balance becomes more critical near the end of the application, smaller chunk sizes are used. We used a fixed tapered allocation consisting of smaller and smaller chunk sizes. For example, the 10 msec per task problem contained 240,000 tasks (for a 40 minute simulation) and we used master chunk sizes: 80,000, 60,000, 40,000, 25,000, 20,000, and 15,000 tasks respectively. The main program allocates chunks to the masters in this order, i.e. the first master request is given 80,000 tasks, the next master request is given 60,000, and so on. The chunk sizes for the other problems are similar. We compared tapering to the best static technique (static-comp-and-comm). The results indicate that the dynamic tapered scheme outperforms the best static scheme (Table 4). Interestingly, the dynamic scheme also outperforms the static scheme under no load. The reason is that the static scheme evenly allocates the first 10% of the workload in order to decide how to divide the rest of the workload. This creates an initial load imbalance for the first chunk of work since the main program must wait for the results from both sites before dividing the remaining 90% of the work. This observed discrepancy is also seen for the .5, .5 scenario since the generated load is the same across the two sites (i.e., no load variance) and no additional load imbalance beyond the initial 10% of the workload should occur. However, as the load variance between the two sites increases, the dynamic scheme is much less sensitive to load fluctuation than is the static scheme. We speculate that a more adaptive tapering scheme would likely perform even better and are beginning to explore such an algorithm. problem (40 minutes)

allocation

(load probability: UTSA, SwRI)

no load

.5, .5

.7, .5

.9, .5

10 msec/task,100 tasks/chunk (per slave) 80,60,40,25,20,15 K tasks (master chunks)

static dynamic

361.74 310.22

414.28 357.89

494.51 371.15

572.54 405.67

4 msec/task,500 tasks/chunk (per slave) 200,150,100,62.5,50,37.5 K tasks (master chunks)

static dynamic

453.43 368.14

512.44 426.54

622.39 478.49

723.13 497.91

static dynamic

314.84 277.82

375.91 338.06

452.47 357.98

517.07 365.79

10 msec/task,500 tasks/chunk (per slave) 80,60,40,25,20,15 K tasks (master chunks)

Table 4: Dynamic vs. static work allocation

9

4.0 Related Work A number of related projects are exploring the scheduling of applications consisting of independent tasks in distributed systems including [2][4][5][6][7]. The Nile project [2] is building a high-performance wide-area computing environment for high energy physics applications. The objective of Nile is to allocate tasks (or jobs) to computers near a dependent data source. Nile is concerned more with job throughput as opposed to application finishing time. EveryWare is a system designed to support the execution of large coarse-grain task independent applications in a variety of wide-area environments [4]. In contrast, GlobalSim is focusing on fine- to medium-grain tasks and the scheduling issues that arise for these applications. Globus [5] and Legion [6] are metacomputing infrastructure projects that have encapsulated scheduling policies in user-provided modules. Specific parameter study applications have been successfully developed using both systems. In GlobalSim, our focus is the construction of general and broadly applicable scheduling strategies suitable for a wide range of task independent and task synchronous applications. In Condor [7], a collection of idle computers within a LAN can be used to perform load sharing of large task independent applications such as parameter studies. Recently, a more sophisticated resource management architecture has been adopted that allows more flexible resource management based on the concept of resource advertisements. Nimrod [8] is a tool designed specifically for the execution of large parameter studies in distributed networks. Their focus has been on building an easy-to-use tool and not on the development of scheduling policies. Nimrod has also been integrated with Globus (Nimrod-G) to provide a wide-area execution capability.

5.0 Summary We have described a scalable architecture for executing large fine- to medium-grain task independent applications such simulation parameter studies across wide-area networks. The scheduling of workload must be done carefully to mitigate the overheads inherent in wide-area communications and to prevent bottlenecks from forming. We presented experimental evidence that showed the importance of a hierarchical workload distribution and the selection of workload granularity at multiple levels. We also showed that dynamic scheduling must also be performed at multiple levels in order to provide effective load sharing in the likely event that the network environment is shared. This work is a first step toward our ultimate objective of providing fully automated scheduling support within GlobalSim. Ideally, the programmer would provide a few parameters to the system such as the number of tasks and perhaps average task granularity, and the system would automatically select the needed resources and the appropriate scheduling parameters. The next step is an extensive experimental evaluation that includes a large number of different problem sizes, problem granularities, and multi-level chunk sizes in our network testbed. This data will be used to 10

derive equations for selecting the appropriate scheduling parameters and the appropriate number of masters and slaves as a function of the application. Future work also includes an investigation of task synchronous applications in which the issue of load balance becomes more crucial.

6.0 References [1]

[2] [3] [4] [5] [6] [7] [8] [9] [10]

D. Abramson et al, “Nimrod: A Tool for Performing Parameterised Simulations using Distributed Workstations,” Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing, August 1995. A. Amoroso, “Wide-area Nile: A Case Study of a Wide-area Data-parallel Application,” The 18th International Conference on Distributed Computing Systems, 1998. H. Bal et al, “Optimizing Parallel Applications for Wide-Area Clusters,” Twelfth International Parallel Processing Symposium,” March 1998. EveryWare. URL: http://nws.npaci.edu/EveryWare. I. Foster and C. Kesselman, “Globus: A Metacomputing Infrastructure Toolkit,” International Journal of Supercomputing Applications, 11(2), 1997. A.S. Grimshaw and W. A. Wulf, “The Legion Vision of a Worldwide Virtual Computer,” Communications of the ACM, Vol. 40(1), 1997. M.J. Litzkow et al, “Condor - a hunter of idle workstations,” In Proceedings of the 8th International Conference on Distributed Computing Systems, June 1988. C. Polychronopoulos and D. Kuck, “Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers,” IEEE Transactions on Computers, C-36(12), December 1987. J.B. Weissman, “Gallop: The Benefits of Wide-Area Computing for Parallel Processing,” Journal of Parallel and Distributed Computing, Vol. 54(2), November 1998. J.B. Weissman and P. Jiang, “GlobalSim: A System for Scheduling Large-Scale Simulation Applications,” UTSA Technical Report, CS-99-1, February 1999.

11