Chapter 3: Resource Provisioning for Workflow Applications. 10 .... single file. 8. 2 hours. SGT Generation Use the velocity mesh to create strain Green tensors.
A RESOURCE PROVISIONING SYSTEM FOR SCIENTIFIC WORKFLOW APPLICATIONS
by Gideon Mark Juve
A Thesis Presented to the FACULTY OF THE USC VITERBI SCHOOL OF ENGINEERING UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (COMPUTER SCIENCE)
December 2008
Copyright 2008
Gideon Mark Juve
Table of Contents List of Tables
iv
List of Figures
v
Abstract
vi
Chapter 1: Introduction
1
Chapter 2: Scientific Workflows
3
2.1 Workflow representation 2.2 Workflow management 2.3 Example workflow applications 2.3.1 CyberShake 2.3.2 Montage Chapter 3: Resource Provisioning for Workflow Applications 3.1 Grid Computing 3.2 Issues for Workflow Applications on the Grid 3.3 Workflow Restructuring 3.4 Resource Provisioning 3.5 Advance Reservations 3.6 Multi-level Scheduling and Personal Clusters 3.7 Condor Glideins 3.8 Existing Multi-level Scheduling Systems Chapter 4: A Resource Provisioning System for Workflow Applications 4.1 Design Goals 4.2 System Design 4.2.1 Architecture 4.2.2 System Components 4.2.3 Service Operation 4.3 Glidein Service Implementation 4.3.1 Overview 4.3.2 Resources 4.3.3 Resource Managers 4.3.4 Batch Job Management 4.3.5 State Management 4.3.7 Fault Tolerance and Recovery 4.3.8 Network Issues 4.4 Features 4.4.1 Command-line interface 4.4.2 Service API 4.4.3 Asynchronous Notifications 4.4.4 Glidein Resubmission 4.4.5 Resource History
3 4 4 4 7 10 10 10 11 12 13 14 17 17 20 20 21 21 22 24 24 24 25 25 25 26 27 27 28 28 29 29 29 30
ii
Chapter 5: Resource Provisioning System Evaluation 5.1 Resource Provisioning Overhead 5.2 Job Execution Delays 5.3 Workflow Runtime 5.4 Storage Requirements
31 31 32 32 34
Chapter 6: Conclusion
36
6.1 Future Work 6.2 Summary
36 37
References
39
Appendix: Resource State Transition Diagrams and Tables
43
iii
List of Tables Table 2.1: CyberShake task characteristics for mesh generation workflow
6
Table 2.2: CyberShake task characteristics for earthquake simulation workflow
6
Table 2.3: Montage task characteristics for a 2-degree mosaic workflow
8
Table 3.1: A comparison of existing multi-level scheduling systems
17
Table 5.1: Average provisioning overheads at several grid sites (in seconds)
31
Table 5.2: Average no-op job runtime at several grid sites (in seconds)
32
Table 5.3: Sizes of files used by Condor glideins (in KB)
34
Table 5.4: Storage required for multiple nodes (in MB)
35
Table A.1: Site State Transitions
43
Table A.2: Glidein State Transitions
44
iv
List of Figures Figure 2.1: DAG representation of a workflow
3
Figure 2.2: Hazard map (left) and hazard curves (right) generated by CyberShake
5
Figure 2.3: Schematic CyberShake mesh generation workflow
5
Figure 2.4: Schematic CyberShake earthquake simulation workflow
6
Figure 2.5: Images of the Rho Oph dark cloud (left) and the Black Widow Nebula (right) generated by Montage
7
Figure 2.6: Schematic Montage workflow
8
Figure 3.1: Multi-level scheduling
15
Figure 4.1: Glidein Service architecture
22
Figure 4.2: Multi-level scheduling system components
23
Figure 4.3: Glidein Service components
25
Figure 4.4: Example meta-workflow containing resource provisioning jobs
29
Figure 5.1: Resource provisioning stages
31
Figure 5.2: Runtime comparison for a 1-degree Montage workflow
33
Figure 5.3: Utilization comparison for a 1-degree Montage workflow
34
Figure A.1: Site State Diagram
43
Figure A.2: Glidein State Diagram
44
v
Abstract The development of grid and workflow technologies has enabled complex, loosely coupled scientific applications to be executed on distributed resources. Many of these applications consist of large numbers of short-duration tasks whose runtimes are heavily influenced by delays in the execution environment. Such applications often perform poorly on the grid because of the large scheduling overheads commonly found in grid environments. In this work we address the performance issues of these applications through the use of resource provisioning techniques. We present a provisioning system based on multi-level scheduling that improves workflow runtime by reducing task scheduling overheads, by reserving resources for the exclusive use of the application, and by giving users control over scheduling policies. This system is shown to reduce the runtime of fine-grained workflow applications by as much as 90% compared to traditional methods.
vi
Chapter 1: Introduction Scientists in fields such as high-energy physics [12], earthquake science [5, 13], and astronomy [24] are developing applications to orchestrate large-scale, data-intensive scientific analyses. These applications are commonly expressed as workflows containing computational tasks arranged in a hierarchy according to their data dependencies. The largest of these workflows contain millions of tasks and require thousands of hours of aggregate computation time. In many cases, however, individual tasks are serial programs that may take only a few seconds to run. This combination of numerous tasks and short runtimes creates unique challenges for workflow applications. Because of the large amount of computation time required, executing workflows on a single computer is not feasible. In order to produce results in a reasonable amount of time, computations must be outsourced to high-performance computing centers and other external resource providers where they can be executed in parallel. This outsourcing is typically accomplished by submitting workflow tasks to remote clusters via the Grid [17]. Unfortunately, workflows with many fine-grained tasks do not perform well using this approach because of the large scheduling overheads and queuing delays commonly found on the Grid. Several different techniques have been developed to remedy this situation. Clustering [42] increases the granularity of tasks in a workflow by grouping similar tasks together. This improves performance by amortizing overheads and delays over all the tasks in a cluster, but has the undesirable side-effect of reducing parallelism. Advance reservations [48, 49] can be used to pre-allocate resources for computations. This improves performance by eliminating batch queuing delays, however many grid sites do not support reservations or, if they do, they charge users a premium for them. In this thesis we investigate the use of multi-level scheduling [39] as an alternative approach for running workflows on the Grid. Multi-level scheduling is a resource provisioning technique that allows resources to be allocated from grid sites to create temporary, user-managed resource pools called personal clusters. Workflow tasks executed on personal clusters encounter fewer queuing delays because access to resources is not shared. In addition, personal clusters reduce scheduling overheads by allowing scheduling
1
policies and configuration to be defined at the application level. By assuming control over scheduling, the user is able to streamline the dispatch and execution of tasks to avoid many of the performance penalties of running workflows on the Grid. The remainder of this thesis is organized as follows. In chapter 2 we describe workflows in more detail and discuss the unique characteristics of workflow applications using two example applications. Chapter 3 describes how the traditional approach to executing workflows on the grid can lead to poor performance and explains how multi-level scheduling and personal clusters can be used to create more efficient execution platforms for workflow applications. In Chapter 4 we describe the design and implementation of a resource provisioning service based on multi-level scheduling that can be used to create personal clusters out of grid resources. Chapter 5 presents some experiments that were conducted to evaluate the performance and benefits of this service. Finally, in Chapter 6 we conclude and discuss areas for future work.
2
Chapter 2: Scientific Workflows 2.1 Workflow representation Workflows can be expressed as directed acyclic graphs (DAGs) as shown in Figure 2.1. Tasks in a DAG are arranged in a hierarchy according to their dependencies. Each task has an outgoing edge connecting it to all other tasks that are dependent on it. Edges typically represent data dependencies such as input and output files, but they may also be used to impose a specific ordering on task execution or to represent non-data dependencies, such as dependencies on shared resources.
Figure 2.1: DAG representation of a workflow A DAG can be divided into levels based on dependencies. Tasks at level 1 are not dependent on any other tasks in the workflow. Each subsequent level is dependent on tasks from the level above it and any previous levels. For example, level 2 is dependent on level 1, level 3 is dependent on levels 1 and 2, and so on. DAGs are executed in order starting with tasks in level 1. The workflow is re-evaluated when each task terminates to determine if any other tasks have had all of their dependencies resolved. Any tasks that have become free of dependencies are submitted for execution. Tasks with no dependencies between them can be executed concurrently. This is an important characteristic for large-scale workflows that require many hours of aggregate computational time. Tasks in these workflows can be submitted to a cluster and executed on separate processors in parallel.
3
2.2 Workflow management There are a variety of software systems available for planning and executing workflow applications, including: Moteur [21], Kepler [28], Taverna [35], Triana [8], Swift [50] and others [43]. These systems implement many features that are useful for automating complex scientific analyses. In the area of data management they support the generation of metadata to track the characteristics and provenance of scientific datasets. For portability and modularization they support late binding of application binaries to workflow tasks. For optimization they support the execution of tasks using workflow-level scheduling heuristics [2, 1]. And for failure recovery and fault tolerance they detect errors, retry failed jobs, and provide checkpoints when problems cannot be resolved automatically. Pegasus [14] is a good example of one of these workflow management systems. It transforms abstract workflows descriptions into concrete DAGs that can be executed on grid resources. This process involves identifying the appropriate executables and input files needed for the workflow, translating logical paths and file names into physical ones, and inserting directory creation and data transfer jobs into the workflow. In addition, Pegasus performs many useful workflow transformations such as task clustering, workflow reductions, and workflow partitioning. The concrete DAGs generated by Pegasus are executed on distributed resources using Condor [27] and DAGMan [11].
2.3 Example workflow applications 2.3.1 CyberShake CyberShake [5, 13] is a probabilistic seismic hazard analysis (PSHA) application that estimates the level of shaking expected to result from future earthquakes. These estimates are calculated using statistical ground motion models created by simulating thousands of scenario earthquakes for each geographic location, or site, of interest. Two types of scientific output are produced from these models: hazard curves and hazard maps (Figure 2.2). Hazard curves illustrate the probability of exceeding a certain level of shaking at a site over time. Hazard curves from multiple sites in a geographic region can be combined and interpolated to create 2D hazard maps showing which areas are likely to experience strong shaking in the future. In addition to their scientific value, these products can be used by urban planners, 4
emergency responders, civil engineers, and insurance companies to help them assess and mitigate the risk that earthquakes pose to people and structures.
Figure 2.2: Hazard map (left) and hazard curves (right) generated by CyberShake The calculations for each CyberShake site are split into two separate workflows: a mesh generation workflow and a rupture simulation workflow. The mesh generation workflow uses parallel seismic wave simulation codes to generate a model of the Earth’s crust near the site. The model is composed of seismic wave velocities and strain Green tensors (SGTs) for points in a 3D mesh surrounding the site. The model is produced by a workflow that contains three tasks as shown in Figure 2.2. The computational requirements of these tasks are shown in Table 2.1.
Figure 2.3: Schematic CyberShake mesh generation workflow 5
Table 2.1: CyberShake task characteristics for mesh generation workflow Task
Description
Mesh Generation
Create a seismic velocity mesh containing points around the site of interest Merge the X,Y,Z components of the mesh into a single file Use the velocity mesh to create strain Green tensors (SGTs)
Merge Mesh SGT Generation
Number of Processors 160
Job Runtime1 3 hours
8
2 hours
400
32 hours
The rupture simulation workflow uses the model created by the mesh generation workflow to calculate synthetic seismograms and peak ground motions for many scenario earthquake ruptures. Figure 2.3 shows a schematic representation of the CyberShake rupture simulation workflow and Table 2.2 shows the characteristics of the simulation tasks.
Figure 2.4: Schematic CyberShake rupture simulation workflow Table 2.2: CyberShake task characteristics for rupture simulation workflow Task
Description
SGT Extraction
Extracts SGTs relevant to a given earthquake rupture scenario Generates a synthetic seismogram using extracted model data and slip parameters for a given rupture scenario Calculates peak ground motion intensity measurements (IMs) for a given synthetic seismogram
Seismogram Synthesis Peak Ground Motion
Tasks per Workflow 7,000
Task Runtime2 139 sec
420,000
48 sec
420,000
1 sec
Unlike the mesh generation workflow, which consists of large, parallel MPI jobs, the rupture simulation workflow contains nearly 850,000 serial tasks with an average weighted runtime of less than 30 1 2
Approximate runtime using X86_64 nodes on the USC HPCC cluster Average runtime using IA-64 nodes on NCSA’s Mercury cluster 6
seconds each. Efficiently executing these fine-grained tasks is a significant challenge for CyberShake developers.
2.3.2 Montage In the field of astronomy there are several survey projects that are using telescopes to collect highresolution images covering the entire sky. These images are stored in centralized repositories and databases to facilitate research by the science community. The primary users of these databases are astronomers interested in studying images of celestial objects such as galaxies or nebulas. Individual images stored in a survey database often cover only a portion of an object being studied. To create a complete picture of the object multiple images need to be combined into a mosaic. Montage [24] is a set of image processing tools that enable the creation of high-quality image mosaics by stitching together smaller images. Figure 2.5 shows two example mosaics created by Montage.
Figure 2.5: Images of the Rho Oph dark cloud (left) and the Black Widow Nebula (right) generated by Montage Montage uses workflow technology to automate the many computations required to produce each mosaic. A Montage workflow takes as input a series of images that cover a section of the sky, re-projects them, performs background rectification and other image-processing steps, and then adds the images together to form a single mosaic as output. Figure 2.4 shows a schematic representation of a typical Montage workflow and Table 2.3 shows its task characteristics. 7
Figure 2.6: Schematic Montage workflow Table 2.3: Montage task characteristics for a 2-degree mosaic workflow Task
Description
mProject
Re-projects a single image to the desired coordinate system Finds the difference image between two adjacent images and fits a plane to that difference image Merge plane fit parameters into a single output Models the sky background using the plane fit parameters and computes planar corrections for input images Rectifies the background in a single image Extracts the image geometry information from a set of files and stores it in a single image metadata table Co-adds a set of re-projected images to produce the desired mosaic
mDiffFit mConcatFit mBgModel mBackground mImgtbl mAdd
3
Tasks per Workflow 180
Task Runtime3 6 sec
1010
1.4 sec
1 1
44 sec 32 sec
180 1
0.8 sec 3.5 sec
1
60 sec
Average runtime on a typical X86 processor 8
Like the CyberShake earthquake simulation workflow, Montage workflows are very fine-grained. A 2-degree Montage workflow contains over 1,300 tasks with an average weighted runtime of only 2 seconds. Larger 4-, and 6-degree Montage workflows contain 3,000 and 6,000 tasks respectively, and have equally short runtimes. Efficiently running such fine-grained workflows requires low-latency, highthroughput task scheduling.
9
Chapter 3: Resource Provisioning for Workflow Applications 3.1 Grid Computing The computing power needed to execute large-scale scientific workflows in a reasonable amount of time often exceeds the capabilities of the resources owned by the scientist or scientific collaboration. As a result, workflow computations are typically outsourced to high-performance computing centers where resources are more readily available. Access to these centers is provided through the Grid [17]. The Grid is a distributed computing infrastructure that provides common abstractions, protocols and middleware for sharing access to computational and storage resources across administrative boundaries and wide-area networks. Grid collaborations, such as the TeraGrid [44] and the Open Science Grid [36], are organized as a series of independent sites connected via high-speed networks. Each site provides computational resources in the form of space-shared supercomputers and clusters. Users access these resources by submitting batch jobs to each site’s local resource manager (LRM) from a remote submission host using grid protocols such as GRAM [10]. Resources are typically allocated to jobs with best-effort quality of service using a queue-based provisioning model. When a user submits a job it is placed in a queue with other user’s jobs. Jobs in the queue are matched to resources according to the site’s scheduling policies.
3.2 Issues for Workflow Applications on the Grid Although the traditional method for accessing grid resources described above works well for many different types of applications, it is often inefficient for large-scale workflow applications. Many grid sites are not optimized for the large number of short-running, serial jobs that are commonly found in large-scale workflows. Submitting these jobs directly to a site’s LRM is not ideal for several reasons. First, because LRMs support many advanced features, such as parallel processing, multiple queues, advance reservations, and job prioritization, they often have large scheduling overheads and long scheduling intervals. For many workflows, such as CyberShake and Montage, these delays are longer than the jobs themselves, which leads to poor resource utilization and low throughput. In addition, in workflows
10
with many short-running jobs these delays are incurred many times over, causing poor performance and greatly increased workflow runtimes. Second, on space-shared systems competition with other users for access to resources can result in long queue times when resources are oversubscribed. This is problematic for workflow applications because jobs at each level of the workflow must wait until their ancestors have completed before they can be submitted to the queue. This causes queue delays to be added at every level in the workflow and can significantly increase the overall runtime of a workflow application. Finally, grid sites often have scheduling policies that are unfriendly to workflow jobs. Many sites prioritize large parallel jobs over smaller serial jobs, leading to increases in queuing delays. Some sites also place an upper limit on number of queued jobs per user, which limits the number of workflow jobs that can be executed concurrently.
3.3 Workflow Restructuring Workflow management systems such as Pegasus are able to avoid some of these issues through the use of workflow restructuring techniques such as task clustering; a technique in which several independent tasks in the workflow are grouped into a single physical job [42]. Clustering in Pegasus can be done horizontally by grouping independent jobs at the same level in the workflow, vertically by grouping parents with their children, or by grouping jobs based on labels provided in the DAG. Clustering results in two benefits for workflow execution: It decreases the number of jobs, thus decreasing scheduling costs and reducing total queuing delay, and it amortizes scheduling overheads by increasing average job runtimes, thus improving utilization. However, clustering also has some negative side-effects which can increase the overall runtime of a workflow. First, the longer job runtimes produced by clustering can delay jobs by preventing the scheduler from taking advantage of backfilling opportunities to schedule those jobs in smaller slots. Second, horizontal clustering decreases the number of jobs at each level of the workflow, thus reducing opportunities for parallel execution. Finally, vertical clustering restructures the parent-child relationships in the workflow, which can cause jobs to be delayed for longer than they would have been in a non-clustered workflow.
11
3.4 Resource Provisioning The execution of jobs in grid environments is typically based on a queuing model for resource allocation that provides best-effort quality of service. In this model a job is queued until all jobs ahead of it in the queue have been scheduled and it can be matched with available resources for execution. This ensures that access to resources is shared equally and fairly among all users of the system and that resource utilization is maximized. As a side effect of this model, individual jobs are frequently delayed for long periods as they wait for other jobs to finish and for resources to become available. In many cases these delays are longer than the actual runtime of the job. For applications that consist of a single job, or of several jobs that can all be submitted in parallel, this situation is not ideal, but is tolerable because delays are only encountered once. For workflow applications with complex job hierarchies and dependencies that force jobs to be submitted serially, this has a detrimental effect on performance because it causes delays to be accumulated for each job. One way to improve workflow performance is to use a model for resource allocation based on provisioning [40]. In a provisioning model resources are reserved for the exclusive use of a single user for an extended period of time. This minimizes queuing delays because the user’s jobs no longer compete with other jobs for access to those resources. Furthermore, in contrast to the queuing model where resource allocation occurs as a side-effect of job scheduling and is not explicit, in the provisioning model resources are leased by the user for a fixed amount of time that is independent of the scheduling of jobs. This allows the resources to be used for multiple jobs while incurring the overhead of resource allocation only once. This is especially useful for workflow applications because it enables sequential jobs to be scheduled more efficiently than best-effort queuing. Resource provisioning is slightly more complex than normal batch queuing mechanisms because it requires users to make resource allocation decisions explicitly, rather than relying on the automatic allocation provided by the queuing model. There are two policies that can be used to guide these decisions. In static provisioning the user allocates all resources required for the application before any jobs are
12
submitted, and releases the resources after all the jobs have finished. This method assumes that the number of resources required is known or can be predicted in advance. In dynamic provisioning resources are allocated on-demand at runtime. This allows the pool of available resources to grow and shrink according to the changing needs of the application. Dynamic provisioning does not require advanced knowledge of resource needs, but it does require additional policies to automatically decide when to acquire and release resources.
3.5 Advance Reservations One resource provisioning mechanism commonly used on the grid is advance reservations. Users create advance reservations by requesting slots from the batch scheduler that specify the number of resources to reserve and the start and end times of the reservation. During the reservation period the scheduler only runs jobs on the reserved resources that belong to the user. This feature is supported by many popular batch schedulers, including Maui [30], Moab [31], PBS Pro [38], and the latest version of Sun Grid Engine [20]. Advance reservations can help increase the performance of workflow applications dramatically, but they suffer from several important disadvantages. One disadvantage is limited support by resource providers. Although batch schedulers used by many resource providers have advance reservation features, few providers support the use of reservations. This may be because reservations have been shown to have a negative impact on resource utilization and queuing delays for non-reservation jobs [34]. Singh, et al. conducted a survey of advance reservation capabilities at several grid sites [40]. They discovered that 50% of the sites surveyed did not support reservations at all, and that most of the sites that did support reservations required administrator assistance to create them. Only a few sites allowed users to create their own reservations. This lack of support makes using advance reservations time-consuming and cumbersome. Another disadvantage to advance reservations is increased cost. Users of advance reservations are typically charged a premium for dedicated access to resources. Furthermore, users are forced to pay for the
13
entire reservation, even if they are not able use it all (e.g. if there is a failure that causes the application to abort, or if the actual runtime of the application is shorter than predicted). An alternative to scheduler-based advance reservations is the use of probabilistic advance reservations [34]. In this method reservations are made based on statistical estimates of queue times. The estimates allow jobs to be submitted with a high probability of starting some time before the desired reservation begins. This allows “virtual reservations” to be created by adjusting the runtime of the job to cover both the time between the submission of the job and the desired reservation start time, and the duration of the reservation itself. The advantage of this method is that it does not require the target cluster to support any special features. The disadvantages of this method are that 1) the reservation is not guaranteed because the actual queue delay may exceed the predicted delay, and 2) the final cost of the reservation is difficult to predict because the actual runtime of the job may exceed the desired reservation time.
3.6 Multi-level Scheduling and Personal Clusters Many of the performance issues encountered by workflow applications on the grid arise because resource providers control both the management of resources and the scheduling of application jobs. In multi-level scheduling [39] these two functions are divided between the user and the provider. Providers retain their authority over resources, but users are given control over scheduling. This division is accomplished by creating personal clusters [26, 46], which are temporary pools of computational resources provisioned from a grid site and managed by the user. After a personal cluster is created it can be used to schedule and execute application jobs using policies specified by the user based on the needs of the application. The process of creating a personal cluster is illustrated in Figure 3.1. A resource provisioner requests nodes by submitting provisioning jobs to a grid site using standard mechanisms. From the perspective of the site’s resource manager, these jobs are indistinguishable from normal user-level jobs. However, instead of running an application program, the provisioning jobs install and run guest node managers on the site’s worker nodes. The guest managers contact a pre-configured application resource
14
manager owned by the user, thus becoming a part of the user’s personal cluster. The newly acquired nodes are then matched with application jobs for execution. With this process the user is able to acquire computational resources from providers and schedule jobs on those resources using application-specific policies.
Figure 3.1: Multi-level scheduling The use of personal clusters and multi-level scheduling leads to many important benefits for workflow applications. Multi-level scheduling allows users to manage scheduling policy at the application level. Custom scheduling configurations can be used to eliminate overheads and improve the runtime characteristics of workflow applications. With multi-level scheduling workflow jobs are submitted directly to an application-specific scheduler for execution on the personal cluster. They do not pass through the site’s LRM, thus avoiding many of the overheads associated with traditional resource access methods. Furthermore, application-specific schedulers can be specifically configured to minimize scheduling overheads, thus improving throughput for workflows with large numbers of small jobs. In addition, scheduling policies can be fine tuned to fit the specific characteristics of the application. For example, workflow jobs with many dependants could be given priority over jobs with fewer dependants in order to create more opportunities for parallelism. Similarly, jobs at a higher level of the workflow could be given priority over jobs later in the workflow. Both of these policies are difficult to implement using traditional grid methods because there may be no way to communicate priorities to the remote scheduler, or to 15
guarantee that the remote scheduler honors the desired ordering of tasks. Using a custom scheduler also allows applications to take advantage of the many sophisticated task-scheduling algorithms available [29][6][2]. Together these optimizations can result in significant decreases in workflow runtime. Singh, et al. has demonstrated some of the benefits of using application-specific scheduling parameters for workflow execution in [41]. Another advantage of multi-level scheduling is that it simplifies the use of resources from different providers by separating resource provisioning from scheduling. This allows personal clusters to be composed of resources gathered from multiple, independent resource providers. Personal clusters provide a uniform interface to heterogeneous resources for the application, allowing the application to use a single abstraction for all jobs. The cost of multi-level scheduling is less than alternative methods such as advance reservations. Because multi-level scheduling acquires resources through traditional channels, it imposes no additional usage cost. In addition, users only pay for resources that they actually use because personal clusters can be dynamically sized to fit application requirements by simply submitting or canceling resource provisioning requests to increase or decrease the size of the available resource pool. Multi-level scheduling can also help workflow applications take advantage of common site scheduling policies that are usually harmful to workflows. For example, multi-level scheduling enables users to make provisioning requests for several resources at once. This allows workflows to benefit from scheduling policies that give higher priorities to jobs requiring multiple nodes. It also allows workflows to compete favorably with parallel applications for access to resources. The benefits of multi-level scheduling extend to resource providers as well. Offloading scheduling decisions allows providers to avoid creating scheduling policies for specific applications and users, thus simplifying cluster administration. It also reduces the load on cluster gateway nodes by eliminating the large number of application jobs that are submitted to the provider’s local resource manager during workflow execution. Finally, multi-level scheduling increases the utilization of resources by minimizing the amount of time each resource spends waiting to be matched with a job.
16
3.7 Condor Glideins Glidein [18] is a multi-level scheduling technique based on the Condor [27] high-throughput workload management system. Using this technique, node managers called “glideins” are created by starting Condor worker daemons on remote cluster nodes. Upon startup, the worker daemons join a Condor pool administered by the user (i.e. a personal cluster) where they can be used to execute jobs submitted to the pool’s queue. The use of glideins has been shown to reduce runtimes for several large-scale workflow applications [41][40].
3.8 Existing Multi-level Scheduling Systems Multi-level scheduling is a complex process requiring a significant amount of setup, configuration and debugging. This complexity is being addressed through the development of middleware systems that automate much of the work of provisioning resources and installing and configuring resource managers on remote grid sites. Many of these systems are being built on top of existing resource managers, such as PBS [37, 38, 45], Sun Grid Engine [20], and Condor [27]. Leveraging off-the-shelf components allows multilevel scheduling systems to benefit from the fault tolerance, policy management, security, and scalability features supported by existing resource managers. In addition, many multi-level scheduling systems support advanced resource provisioning options, such as dynamic provisioning, automatic resubmission of provisioning requests, support for multiple resource providers, and load balancing between resource providers. Table 3.1 compares the features of these systems. Table 3.1: A comparison of existing multi-level scheduling systems System
Interfaces
condor_glidein
Resource Managers Condor
command-line
Provisioning Policies static
Firewall Negotiation none
Resource Provider Interfaces Globus
glideinWMS
Condor
none (automatic)
dynamic
GCB
Globus
MyCluster
Condor, SGE, PBS
command-line
static
Falkon
custom
API
Globus, PBS, SGE, LoadLeveler, LSF, Condor, EC2 Globus
VGES
PBS
API
static, dynamic static
manager on head node, virtual networking manager on head node manager on head node
Globus, EC2
17
Condor_glidein [9] is a command-line tool that uses the glidein technique to create personal Condor pools using grid resources. It uses Condor-G [18] and the Globus Toolkit [16] to install and run glideins on remote grid sites. Condor_glidein is simple to use, but does not support any advanced provisioning features. GlideinWMS [22] is a workload management system that is also based on Condor glideins. It supports dynamic provisioning by polling a Condor queue and creating glideins to service queued jobs. The provisioning policy used to decide when to request more glideins is customizable and can be tuned to meet the needs of many different applications. The system supports sophisticated credential management and security features, as well as firewall negotiation via GCB [19]. Because GlideinWMS is designed to be autonomous, it does not provide any interfaces that can be used for static provisioning. MyCluster [46] creates personal clusters using the Condor, Sun Grid Engine, and OpenPBS resource managers. It can automatically maintain fixed-size pools by resubmitting resource requests as they expire, and it allows users to control the granularity of resource requests. It supports the creation of heterogeneous pools using resources allocated from multiple, independent grid sites, and is capable of migrating resource requests from one site to another based on evolving site conditions such as communications failures or increasing queue times. The system can configure node managers to contact a resource manager previously started by the user, or it can dynamically start a resource manager itself. MyCluster assumes that resource management software is pre-installed on the remote site and does not stage executables. Falkon [39] is a multi-level scheduling system designed for high-throughput applications. It consists of a web service that accepts job requests, a provisioner that allocates resources from grid sites, and a custom node manager that executes application jobs on provisioned resources. The provisioner supports dynamic provisioning using several different resource acquisition policies. Although Falkon achieves very high throughput for computing tasks, it does not support many of the features provided by off-the-shelf resource managers such as configurable scheduling policies, fault tolerance, resource matching and parallel scheduling.
18
The VGES system [25] includes a Java API that creates personal clusters on the grid [26]. The system uses a custom version of the Torque resource manager [45] that has been modified to run in usermode. It creates personal clusters by starting Torque daemons on host clusters using grid protocols. Access to these personal clusters is provided through a user-level Globus gatekeeper that is started on the host cluster’s head node. The system assumes that both Torque and Globus are installed and configured on the remote site and does not stage executables.
19
Chapter 4: A Resource Provisioning System for Workflow Applications In this chapter we describe the design and implementation of a resource provisioning system based on multi-level scheduling. The system creates personal clusters by provisioning resources from grid sites using the Condor glidein technique. In addition, the system provides many useful features including: the ability to stage resource management software on remote grid sites, command-line and programmatic interfaces for static or dynamic provisioning, automatic resubmission of provisioning requests, asynchronous notifications, and request history tracking.
4.1 Design Goals When developing any new software system it is important to have guiding principles that can help in making design decisions. Based on our analysis of the existing multi-level scheduling systems and the requirements of workflow applications we developed the following list of goals for our system: •
Automate environment setup. Using multi-level scheduling on a remote Grid site requires that the resource manager software be installed and configured on the site. Many existing multi-level scheduling systems assume that the resource manager is pre-installed on the site by the user. This imposes additional burdens on the user to identify the characteristics of the remote system (OS, architecture, etc.), to identify the correct software version, and to install and configure the resource manager properly. This process makes using existing systems complicated and error-prone. Instead of relying on the user, our system should automate the setup process as much as possible while allowing the user to control details of the configuration where necessary or desirable.
•
Minimize overheads. When running workflows on the Grid the main performance metric is time to solution. This makes the time required to acquire resources an important design criteria in the development provisioning systems. Several of the existing multi-level scheduling systems transfer large executables for each provisioning request. This introduces overheads that delay the provisioning of resources. Our system should try to minimize these delays as much as possible by reducing the amount of data transferred for each request.
20
•
Use standard technologies and conventions. The development of Grid technologies has been focused on creating uniform abstractions, protocols, and conventions for creating distributed computing systems. These elements enable heterogenous systems to inter-operate, allow rapid development using pre-existing components, and provide a common language for talking about distributed systems. In developing our system we would like leverage standard protocols, tools and technologies wherever possible.
•
Provide multiple control interfaces. Complex software systems benefit from having multiple interfaces tailored to suit the needs of diverse clients. Alternative interfaces provide different access mechanisms and abstractions that cannot be offered through a single interface. Many existing multi-level scheduling systems provide only a single interface, and some provide no external interfaces at all. Our system should provide a powerful programmatic interface that can be used by third-party software tools, and a scriptable, easy-to-use command-line interface for users and administrators.
•
Recover gracefully from failures. Component failures are common occurrences in distributed systems. This is especially true of grid environments that span administrative boundaries, support hundreds of simultaneous users, and operate on heterogeneous systems across wide-area networks. Any service that operates in such an environment should be able to recover from routine failures of system components. Our system should be able to recover the state of its resources and jobs after server failures.
4.2 System Design 4.2.1 Architecture The system was developed using a client-server architecture as shown in Figure 4.1. It consists of one or more clients that make requests to a server hosting a grid service. The grid service communicates with one or more grid sites to fulfill the clients’ requests. Each grid site consists of a head node, several worker nodes, and a shared storage system that can be accessed by all nodes.
21
Figure 4.1: Glidein Service architecture
4.2.2 System Components The components of the system and the functional relationships between them are shown in Figure 4.2. A description of the purpose and responsibilities of each of these components follows. Glidein Service—The Glidein Service is the central component of the system. It accepts requests from clients, sets up the execution environment on the grid site, provisions resources using glideins, and cleans up files and directories created by the system. Condor—Condor is used to process service and application jobs, and to manage glidein workers. Condor submits service jobs to the grid site using Condor-G [18], glidein workers contact the Condor central manager to join the user’s personal cluster, and application jobs are submitted to the Condor queue where they are matched to glidein workers for execution. Delegation Service—The system uses standard grid security mechanisms [3] based on X.509 certificates [47]. Clients send their security credentials to the Delegation Service where they are stored for later use by the Glidein Service. These credentials are used for authentication when submitting jobs and transferring files to the grid site. Staging Servers—The Glidein Service installs Condor on the remote grid site from bundles of executables and configuration files called packages. Each package contains a set of Condor worker node daemons for a different Condor version, system architecture and operating system. Staging servers are file servers used to host these packages. Any file server that supports the HTTP, FTP, or GridFTP protocols may be used as a staging server.
22
Replica Location Service (RLS)—The Replica Location Service (RLS) [7] is an existing Globus grid service that maps logical file names to physical locations. It is used by the Glidein Service to map package names to staging servers. Setup Job—The setup job is submitted to the grid site to prepare a workspace for glideins. It runs an installer which determines the appropriate Condor package to use for the site, looks up the package in RLS to determine which staging servers have it, and downloads the package from the first available staging server. It then creates an installation directory and a working directory, and copies the Condor binaries into the correct location. Glidein Job—The glidein job provisions worker nodes for the user’s personal cluster. Glidein jobs generate a Condor configuration script and launch Condor daemons on each allocated worker node. The Condor daemons register themselves with the central manager and are matched with application jobs for execution. Cleanup Job—The cleanup job is submitted to the grid site to cleanup the workspace used by the glideins. It runs an uninstaller which removes all log files, configuration files, executables and directories created by the service.
Figure 4.2: Multi-level scheduling system components 23
4.2.3 Service Operation The process used by the service to provision resources has been divided into three phases: setup, provisioning, and cleanup. During the setup phase the remote site is configured to run glideins by the setup job. This includes creating directories in the shared file system for use by the service, identifying the architecture and operating system of the remote system, and downloading and installing Condor executables to the shared file system. In the provisioning phase resources are allocated by submitting glidein jobs to run on the remote site. This step includes requesting resources from the site’s LRM, generating Condor configuration files, and launching Condor daemons on the site’s worker nodes. Finally, in the cleanup phase all glidein jobs are cancelled and the cleanup job is submitted to remove all files and directories created on the site’s shared file system. This three-step process allows Condor executables to be staged once during the setup phase and reused for multiple requests during the provisioning phase. This reduces the amount of time and storage required for individual provisioning requests by eliminating redundant and costly data transfers.
4.3 Glidein Service Implementation 4.3.1 Overview The Glidein Service is the central component of the system. It contains all the logic used by the system to provision resources from grid sites. Its functions include: communicating with clients about provisioning requests, setting up Condor on the remote grid site, submitting glidein jobs to provision resources, cleaning up log files and directories, and managing all resources and jobs used by the system. In keeping with our design goal of using standard tools and technologies, the Glidein Service was developed in Java using the Globus GT4 grid services framework [15]. The service is composed of several components as shown in Figure 4.3. These components are described in more detail in the following sections.
24
Figure 4.3: Glidein Service components
4.3.2 Resources In order to facilitate the separation of the setup and provisioning phases, the service includes two different resource types. Site resources contain information about the configuration of a grid site, including target file system paths for executables and log files, the desired version of Condor to install, and contact information for the site’s job submission interface. Glidein resources contain information about a resource provisioning request, including the number of hosts and processors desired, the duration of the reservation, and specific resource requirements such as operating system, disk space, and available memory. Each site resource can have multiple glidein resources associated with it, and each glidein resource is associated with a single site resource.
4.3.3 Resource Managers Each resource type has a corresponding manager component that provides an interface for client requests. These managers create resources, store and retrieve resources from the database, invoke resource operations on behalf of clients, and delete resources.
4.3.4 Batch Job Management The Glidein Service submits batch jobs to grid sites using Condor-G [18] and Globus GRAM [10]. GRAM provides a standard job-submission interface that can be used with most grid sites, and Condor-G
25
provides job management and fault-tolerance. Leveraging these technologies simplifies and improves the robustness and flexibility of our system. Jobs are submitted to Condor-G via a custom Condor API. The API generates submit scripts, monitors log files for job-related events, and invokes command line tools to submit and cancel jobs. The critical feature of this interface is the ability to notify other parts of the system of job management events through the use of callbacks. Although there are many APIs that can be used to submit and monitor Condor jobs, none of them provide callbacks at the level of detail required by the Glidein Service. These callbacks are used by the service to update the state of resources and initiate critical operations. Without this feature the service would need to poll Condor to detect job management events. This would complicate service logic, limit scalability, and reduce fault-tolerance by tightly coupling the service with Condor.
4.3.5 State Management The Glidein Service must carefully manage resource state by enforcing rules governing state transitions. Operations performed on a resource have different results depending on the current state of the resource. For example, a glidein resource can be submitted to a remote site only if it is in the NEW state and its associated site resource is in the READY state. If the glidein is not NEW, then a glidein job has already been submitted, and if the site is not READY, then the glidein job will fail to find the required execution environment when it runs. The Glidein Service enforces these rules by implementing a state machine inside each resource that validates state change events and generates actions based on the current state of the resource. A complete list of states and transition rules for site and glidein resources is documented in the Appendix. The management of resource state is further complicated by concurrent requests. Because the Globus container is multi-threaded, concurrent requests for a single resource may create a race condition that leads to a violation of the state transition rules. Concurrent requests could be synchronized using traditional mechanisms such as locks and semaphores, but these mechanisms are difficult to program and can lead to deadlocks and unintended behavior if used incorrectly. Instead, the Glidein Service converts most requests into events and appends them to an event queue. This queue is serviced by a dedicated thread
26
that pulls events off the queue and applies them to resources in order. This serializes all requests and guarantees that race conditions cannot occur. This approach is similar to that used by the GT4 GRAM service [15]. One disadvantage of this synchronization method is that it can cause delays in the processing of some operations. However, it is expected that the event thread can process events much faster than they can be generated by clients, so the delay, if one occurs, will likely be small. Another disadvantage is that queuing prevents clients from learning the results of their request immediately and forces them to query for the results an indeterminate amount of time later. However, because the service makes non-blocking job submission requests to Condor and Globus, the client would have to query for results anyway. In addition, the service provides an asynchronous interface that can be used to obviate polling by notifying clients when results become available (See Section 4.4.3).
4.3.7 Fault Tolerance and Recovery Resource state must also be carefully managed in the presence of failures. The Glidein Service is able to recover the state of its resources in the event of crashes and restarts. During system startup the service checks the database to see if there are any active resources and, if there are, begins the recovery process. During recovery the service loads all site and glidein resources to determine their last known state. Depending on the last state of the resource the service may submit new jobs, cancel existing jobs, query Condor for the current state of submitted jobs, or remove resources from the database. This process ensures that server failures do not lead to inconsistent resource state and that the service can continue to process client requests for resources created prior to the failure.
4.3.8 Network Issues Multi-level scheduling systems function well when worker nodes have public IP addresses and are free to communicate with clients outside their network. However, many resource providers conserve IP addresses by using private networks and isolate their worker nodes behind firewalls for security. This prevents application-specific schedulers outside the resource provider’s network from communicating directly with worker nodes and hinders the use of multi-level scheduling techniques. 27
One solution to this problem is to use Generic Connection Brokering (GCB) [19]. Using GCB a server called the broker is started in a location that is accessible to both the worker nodes and the application-specific scheduler to facilitate connections between them. The broker allows the applicationspecific scheduler and the worker nodes to communicate without requiring any direct connections into the private network. The glidein service supports the use of GCB by automatically configuring glideins to use a GCB broker. When creating a glidein the user specifies the address of an existing broker and the service automatically configures the Condor worker daemons to connect to the user’s Condor pool through the broker.
4.4 Features 4.4.1 Command-line interface Users can interact with the system using a simple command-line interface. This is done by invoking the “glidein” command with one or more arguments specifying a sub-command and arguments for the subcommand. The available sub-commands include: •
create-site: Create a new site resource and install Condor on the remote file system.
•
list-sites: Display a list of active sites.
•
remove-site: Cancel all active jobs, clean up files and directories on the remote file system, and delete the site and all of its glideins from the database.
•
create-glidein: Create a new glidein resource and submit a glidein job to the grid site.
•
list-glideins: Display a list of active glideins.
•
remove-glidein: Cancel the glidein job and remove the glidein from the database.
•
site-history: Display the state change history of one or more sites.
•
glidein-history: Display the state change history of one or more glideins.
•
help: Display a detailed help message for one of the available sub-commands.
In addition to providing a simple interface for interactive provisioning requests, the command-line interface also supports scripting by providing outputs that are easy to parse and operations that block until
28
resources have been allocated. This allows the command-line interface to be used in shell scripts and workflows to automate provisioning. Figure 4.4 shows how this capability can be used to create a metaworkflow to automate the planning and execution of other workflows.
Figure 4.4: Example meta-workflow containing resource provisioning jobs
4.4.2 Service API In order to simplify programmatic access to the service the implementation includes a client library and API written in Java. The API provides object-oriented abstractions that allow developers to create applications that access the service without the need to delve into the complexities of the SOAP and WSRF protocols.
4.4.3 Asynchronous Notifications The Glidein Service supports both synchronous and asynchronous interfaces. Most clients accessing the service use the synchronous interface to create, submit and remove site and glidein resources. Clients may also be interested in receiving automatic updates from the service when the state of a resource changes. To supply these updates the service provides an asynchronous interface based on the WSNotification standard [33]. The asynchronous interface allows clients to learn about changes to the state of sites and glideins without explicitly requesting the information from the service. This allows clients to avoid polling, which does not provide timely updates, increases the load on the service, and reduces scalability. The service API includes operations that allow clients to easily subscribe to resources and receive state change notifications without being concerned with the details of the WS-Notification standard.
4.4.4 Glidein Resubmission Many resource providers limit the maximum amount of time that can be requested for an individual job. This means that glidein jobs used to provision resources can only run for a limited amount 29
of time before they expire. Often, however, users would like to provision resources for longer than the maximum allowed by the provider. This can be accomplished in a multi-level scheduling system by resubmitting provisioning requests as they expire. The Glidein Service supports this by automatically submitting new glidein jobs to replace old jobs that have terminated. When creating a new glidein the user can specify different policies to use for resubmission. These policies include: •
N Times: The glidein job is resubmitted N times.
•
Deadline: The glidein job is resubmitted until a specific date and time has passed.
•
Indefinite: The glidein job is resubmitted until the user explicitly cancels the request.
In all of these policies, if the last request failed, or if the user’s credential has expired, the request will not be resubmitted.
4.4.5 Resource History For record keeping it is useful to have the state change history of a resource. This information can be used to prepare reports for papers or allocation proposals, to determine how long resources have been used, and to analyze how long it takes to acquire resources at a particular site. The Glidein Service supports this feature by recording all resource state changes in its database. The history of changes can be retrieved using both the command-line interface and the API.
30
Chapter 5: Resource Provisioning System Evaluation 5.1 Resource Provisioning Overhead In order to determine the overhead of the Glidein Service we measured the amount of time required to complete the various stages of the resource provisioning process. These stages are illustrated in Figure 5.1. Each stage corresponds to specific events in the lifecycle of the setup, glidein, and cleanup jobs.
Figure 5.1: Resource provisioning stages We measured the setup time, allocation time and cleanup time for three TeraGrid sites. The average measurements are shown in Table 5.1. All allocation measurements were taken when the sites had plenty of free resources to avoid unpredictable queuing delays. Table 5.1: Average provisioning overheads at several grid sites (in seconds) Site NCSA Mercury NCSA Abe SDSC IA-64
Setup Time 29.5 28.4 28.4
Allocation Time 52.5 35.3 97.0
Cleanup Time 15.0 15.7 15.2
On all sites tested, the overheads for setup and cleanup were approximately 30 seconds and 15 seconds respectively. This uniformity is most likely a result of using the Globus fork jobmanager to run setup and cleanup jobs. If a batch jobmanager (i.e. jobmanager-pbs, jobmanager-lsf, etc.) were used for these jobs, the delays would have been longer due to scheduling overheads and queuing delays. Fortunately, because the Glidein Service installs packages on a shared file system, the fork jobmanager can be used on all sites where it is supported. 31
As expected, the allocation time varies considerably across the three sites. This is due to variations in the scheduling configurations of the sites.
5.2 Job Execution Delays In order to determine the benefits of running application jobs using glideins we measured the amount of time required to run a test job using Globus version 2, Globus version 4 and glideins provisioned using the Glidein Service. A no-op job was used in order to ensure that only the overheads were being measured. The average runtimes for three TeraGrid sites are shown in Table 5.2. Table 5.2: Average no-op job runtime at several grid sites (in seconds) Site
GT2 Runtime
GT4 Runtime
Glidein Runtime
NCSA Mercury NCSA Abe SDSC IA-64
61.1 35.8 263.3
237.9 220.7 N/A 4
2.2 1.6 2.0
On all three sites, the runtime of the jobs using glideins (~2 seconds) was significantly shorter than the runtime using Globus 2 and 4 (~35-260 seconds). This improvement is attributed to a reduction in software layers that introduce scheduling overheads in jobs submitted using Globus, and the ability to configure Condor’s scheduling policy to immediately execute jobs when resources become available rather than waiting for the next scheduling cycle.
5.3 Workflow Runtime To quantify the benefits of using the Glidein Service for real applications we measured the runtime of a 1-degree Montage workflow using both the traditional grid approach and the multi-level scheduling approach. The traditional approach used Condor-G and Globus GT2 GRAM to submit jobs to a grid site directly. The multi-level scheduling approach used Condor to submit jobs to a personal cluster created by the Glidein Service using resources from the same site. The site used for all experiments was the Mercury cluster at NCSA. The runtime of the workflow was measured using 1, 2, 4 and 8 resources. Condor-G was configured to throttle the number of jobs submitted in order to ensure that the correct number of resources 4
The GT4 service was not available at this site 32
were utilized for experiments using the traditional approach. The results of these experiments are shown in Figure 5.2.
Figure 5.2: Runtime comparison for a 1-degree Montage workflow In all cases, the runtime of the workflow using multi-level scheduling was approximately 90% less than the runtime using the traditional approach. These results suggest that glideins and multi-level scheduling can be used to significantly improve the runtime of fine-grained workflow applications. Figure 5.3 compares the resource utilization achieved by the traditional approach and glideins. In all cases the resource utilization with glideins was significantly higher than the traditional approach, however it was far from ideal. The highest utilization achieved using glideins was only about 30%, which is much less than the ideal utilization of 100%. This indicates that, although glideins are a significant improvement over the traditional approach, which achieved a maximum utilization of only 3%, there is still plenty of room for improvement. Higher utilization could be achieved by combining multi-level scheduling with clustering to further reduce overheads.
33
Figure 5.3: Utilization comparison for a 1-degree Montage workflow
5.4 Storage Requirements In order to determine the storage requirements of the Glidein Service we analyzed the disk space required to run Condor glideins on a cluster. This includes the space required for executables, configuration files, and logs. Because the size of these files varies depending on the runtime of the glideins (logs) and the architecture and system libraries of the cluster (executables), we report the minimum and maximum sizes that are possible. Table 5.3 shows the sizes of each type of file. Table 5.3: Sizes of files used by Condor glideins (in KB) File Type
Minimum Size
Maximum Size
Logs (per worker) Configuration (per worker) Executables (per site)
144 5 20480
6144 5 44032
One important thing to note is that executables can be installed on a shared file system and used by multiple nodes, while logs and configuration files are generated for each node. Depending on the number of nodes allocated, the actual storage space used by the service may vary significantly. Table 5.4 shows the minimum and maximum amount of storage that would be required for multiple nodes.
34
Table 5.4: Storage required for multiple nodes (in MB) Nodes 1 2 4 8 16 32 64 128
Minimum Size 20 20 20 21 22 24 29 38
Maximum Size 49 55 67 91 139 235 427 811
In the worst case, the maximum space required for a pool of 128 nodes is 811 MB. This space is primarily consumed by Condor log files, which, by default, are allowed to grow up to 6 MB per node. However, because the Glidein Service automatically cleans up log files, and because the glideins will rarely be around long enough for log files to reach their maximum size, the actual size required will likely be closer to the minimum. Also, because the Glidein Service allows the user to specify values for all configuration parameters, a smaller maximum log file size could easily be used to reduce storage requirements. Finally, because the Glidein Service allows users to provide one path for executables, and another path for logs and configuration files, the per-node files can be stored in the available temporary storage space on each node (i.e. /tmp). This means that shared storage is really only required for executables, which can easily fit in the user’s home directory given the quotas available at most grid sites.
35
Chapter 6: Conclusion 6.1 Future Work Multi-level scheduling systems enable resources to be accessed and controlled by remote users over a wide area network. This allows the user to run applications more efficiently, but it also exposes resources and application jobs to security threats that may be present in the network. Several of the existing multi-level scheduling systems support advanced security features that can be used to guard against these threats. In developing the Glidein Service we concentrated primarily on enabling efficient resource provisioning and implemented only basic, host-based security mechanisms. As a result, it may be possible for a determined attacker to gain access to resources provisioned by the service, or interfere with application jobs and data. To prevent this, the security of the service could be improved by automatically configuring personal clusters to use the authentication and encryption features supported by Condor. Some workflows contain a mix of both serial and parallel jobs. The CyberShake application described in Chapter 2, for example, uses several parallel jobs to produce datasets that are used in subsequent analyses. Although Condor supports the execution of parallel jobs, the Glidein Service does not enable this feature in the personal clusters it creates. This means that workflows containing both types of jobs must use two different resource provisioning mechanisms: the traditional approach for parallel jobs and the multi-level scheduling approach for serial jobs. In order to simplify resource provisioning and job submission, the Glidein Service could be modified to facilitate the execution of parallel jobs on personal clusters. The benefit of advance reservations over multi-level scheduling is the ability to specify the exact start and end times of a reservation. As was mentioned in Chapter 3, however, many grid sites do not support the use of advance reservations. VARQ [34] is a tool that allows users to make “virtual” reservations at sites that do not support scheduler-based reservations. It predicts the amount of time a batch job will spend waiting in the queue and submits the job with enough leeway to ensure that it is running at the time specified by the user. In comparison, the Glidein Service provisions resources on a best-effort basis with reservations starting as soon as resources become available. By combining the Glidein Service
36
with VARQ, users would be able to specify the exact start time of their reservations to ensure that their personal clusters are available when they are needed. In order to provision resources for a workflow the user must specify the number of resources required, their characteristics (in terms of CPU, memory, disk space, etc.), and the amount of time they are needed. The development of such specifications is non-trivial due to the large number of unknown quantities involved. Recently, several algorithms have been proposed that can be used to generate resource specifications for workflow applications [4][23]. The algorithms analyze an abstract workflow description to produce an estimate of its resource requirements. Combining these algorithms with the Glidein Service would provide an automatic resource provisioning capability for workflow applications. Dynamic provisioning is a useful feature provided by some multi-level scheduling systems that allows resource allocation requests to be triggered by application-specific events that are external to the provisioning system. For example, resources could be provided for a workflow on demand by expanding the resource pool when a job is submitted and there are no free resources in the personal cluster. Policies could be implemented that would determine the precise conditions under which more resources would be requested. This would ensure that the user’s personal cluster automatically contains enough resources to run the application without any manual intervention on the user’s part. Currently the Glidein Service does not support dynamic provisioning directly, but does provide an API that could be used to implement dynamic provisioning tools.
6.2 Summary Scientists in many fields are developing large-scale, fine-grained workflow applications for complex, data-intensive scientific analyses. These applications require the use of large numbers of lowlatency computational resources in order to produce results in a reasonable amount of time. Although the grid provides access to ample resources, the traditional approach to accessing these resources introduces many overheads and delays that make the grid an inefficient platform for executing workflows. We have described the design and implementation of a resource provisioning system that can be used to significantly improve the performance of workflow applications on the grid. The system is based on 37
the concept of multi-level scheduling, a provisioning technique in which temporary resource pools called personal clusters are created by running user-level resource managers on grid sites. This approach eliminates queuing delays by pre-allocating resources, reduces overheads by streamlining resource management, and improves throughput by allowing the user to specify application-specific scheduling policies. This system has been shown to reduce the runtime of a typical workflow application by as much as 90% compared to traditional grid mechanisms.
38
References 1.
Anirban Mandal, K. Kennedy, C. Koelbel, G. Marin, J. Mellor-Crummey, B. Liu, and L. Johnsson, “Scheduling strategies for mapping application workflows onto the grid,” 14th IEEE International Symposium on High Performance Distributed Computing (HPDC '05), 2005, pp. 125-134.
2.
J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A. Mandal, and K. Kennedy, “Task scheduling strategies for workflow-based applications in grids,” IEEE International Symposium on Cluster Computing and the Grid (CCGrid '05), 2005, pp. 759-767.
3.
R. Butler, V. Welch, D. Engert, I. Foster, S. Tuecke, J. Volmer, and C. Kesselman, “A national-scale authentication infrastructure,” Computer, vol. 33, 2000, pp. 60-66.
4.
E. Byun, J. Kim, Y. Kee, E. Deelman, K. Vahi, and G. Mehta, “Efficient Resource Capacity Estimate of Workflow Applications for Provisioning Resources,” 4th IEEE International Conference on eScience (eScience'08), to appear, 2008.
5.
S. Callaghan, P. Maechling, E. Deelman, K. Vahi, G. Mehta, G. Juve, K. Milner, R. Graves, E. Field, D. Okaya, and T. Jordan, “Reducing Time-to-Solution Using Distributed High-Throughput MegaWorkflows: Experiences from SCEC CyberShake,” 4th IEEE International Conference on e-Science (eScience'08), to appear, 2008.
6.
H. Casanova, A. Legrand, D. Zagorodnov, and F. Berman, “Heuristics for scheduling parameter sweep applications in grid environments,” 9th Heterogeneous Computing Workshop (HCW'00), 2000, pp. 349-363.
7.
A. Chervenak, Naveen Palavalli, Shishir Bharathi, C. Kesselman, and R. Schwartzkopf, “Performance and Scalability of a Replica Location Service,” 13th IEEE International Symposium on High Performance Distributed Computing (HPDC'04), 2004, pp. 182-191.
8.
D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang, “Programming scientific and distributed workflow with Triana services,” Concurrency and Computation: Practice and Experience, vol. 18, 2006, pp. 1021-1037.
9.
condor_glidein, http://www.cs.wisc.edu/condor/glidein.
10. K. Czajkowski, I.T. Foster, N.T. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke, “A Resource Management Architecture for Metacomputing Systems,” IPPS/SPDP '98 Workshop on Job Scheduling Strategies for Parallel Processing, 1998. 11. DAGMan, http://cs.wisc.edu/condor/dagman. 12. E. Deelman, C. Kesselman, G. Mehta, L. Meshkat, L. Pearlman, K. Blackburn, P. Ehrens, A. Lazzarini, R. Williams, and S. Koranda, “GriPhyN and LIGO, building a virtual data Grid for gravitational wave scientists,” 11th IEEE International Symposium on High Performance Distributed Computing (HPDC'02), 2002, pp. 225-234. 13. E. Deelman, S. Callaghan, E. Field, H. Francoeur, R. Graves, N. Gupta, V. Gupta, T.H. Jordan, C. Kesselman, P. Maechling, J. Mehringer, G. Mehta, D. Okaya, K. Vahi, and L. Zhao, “Managing Large-Scale Workflow Execution from Resource Provisioning to Provenance Tracking: The CyberShake Example,” 2nd IEEE International Conference on e-Science and Grid Computing (ESCIENCE'06), 2006. 39
14. E. Deelman, G. Singh, M. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G.B. Berriman, J. Good, A. Laity, J.C. Jacob, and D.S. Katz, “Pegasus: A framework for mapping complex scientific workflows onto distributed systems,” Scientific Programming, vol. 13, 2005, pp. 219-237. 15. I. Foster, “Globus Toolkit Version 4: Software for Service-Oriented Systems,” 2006. 16. I. Foster and C. Kesselman, “Globus: A Metacomputing Infrastructure Toolkit,” International Journal of Supercomputer Applications, vol. 11, 1997, pp. 115-128. 17. I. Foster, C. Kesselman, and S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations,” International Journal of High Performance Computing Applications, vol. 15, 2001, pp. 200-222. 18. J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke, “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” Cluster Computing, vol. 5, 2002, pp. 237-246. 19. Generic Connection Brokering (GCB), http://cs.wisc.edu/condor/gcb. 20. W. Gentzsch, “Sun Grid Engine: towards creating a compute power grid,” First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001, pp. 35-36. 21. T. Glatard, J. Montagnat, D. Lingrand, and X. Pennec, “Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR,” International Journal of High Performance Computing Applications, vol. 22, Aug. 2008, pp. 347-360. 22. glideinWMS, http://home.fnal.gov/~sfiligoi/glideinWMS/. 23. R. Huang, A. Chien, and H. Casanova, “Automatic Resource Specification Generation for Resource Selection,” IEEE International Conference on Supercomputing (SC'07), 2007. 24. D.S. Katz, J.C. Jacob, E. Deelman, C. Kesselman, S. Gurmeet, S. Mei-Hui, G.B. Berriman, J. Good, A.C. Laity, and T.A. Prince, “A comparison of two methods for building astronomical image mosaics on a grid,” International Conference on Parallel Processing (ICPP'05), 2005, pp. 85-94. 25. Y. Kee, D. Nurmi, G. Singh, A. Mutz, C. Kesselman, and R. Wolski, “VGES: the Next Generation of Virtualized Grid Provisioning,” IEEE/IFIP International Workshop on End-to-end Virtualization and Grid Management (EVGM’07), 2007. 26. Y. Kee, C. Kesselman, D. Nurmi, and R. Wolski, “Enabling personal clusters on demand for batch resources using commodity software,” IEEE International Symposium on Parallel and Distributed Processing (IPDPS'08), 2008. 27. M.J. Litzkow, M. Livny, and M.W. Mutka, “Condor: A Hunter of Idle Workstations,” 8th International Conference on Distributed Computing Systems, 1988, pp. 104-111. 28. B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E.A. Lee, J. Tao, and Y. Zhao, “Scientific workflow management and the Kepler system,” Concurrency and Computation: Practice and Experience, vol. 18, 2006, pp. 1039-1065. 29. M. Maheswaran, S. Ali, H.J. Siegel, D. Hensgen, and R.F. Freund, “Dynamic Matching and Scheduling of a Class of Independent Tasks onto Heterogeneous Computing Systems,” 8th Heterogeneous Computing Workshop (HCW'99), 1999.
40
30. Maui Cluster Scheduler, http://www.supercluster.org/maui. 31. Moab, http://www.clusterresources.com/pages/products/moab-cluster-suite.php. 32. Montage, http://montage.ipac.caltech.edu. 33. P. Niblett and S. Graham, “Events and service-oriented architecture: The OASIS Web Services Notification specifications,” IBM Systems Journal, vol. 44, pp. 869-886. 34. D. Nurmi, R. Wolski, and J. Brevik, “VARQ: virtual advance reservations for queues,” 17th international symposium on High performance distributed computing (HPDC'08), 2008. 35. T. Oinn, M. Greenwood, M. Addis, M.N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M.R. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe, “Taverna: lessons in creating a workflow environment for the life sciences,” Concurrency and Computation: Practice and Experience, vol. 18, 2006, pp. 1067-1100. 36. Open Science Grid, http://www.opensciencegrid.org. 37. OpenPBS, http://www.openpbs.org. 38. PBSPro, http://www.pbspro.com. 39. I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, “Falkon: a Fast and Light-weight tasK executiON framework,” Supercomputing (SC'07), Reno, Nevada: 2007. 40. G. Singh, C. Kesselman, and E. Deelman, “Performance Impact of Resource Provisioning on Workflows,” USC ISI Technical Report, 2005. 41. G. Singh, C. Kesselman, and E. Deelman, “Optimizing Grid-Based Workflow Execution,” Journal of Grid Computing, vol. 3, 2005, pp. 201-219. 42. G. Singh, M. Su, K. Vahi, E. Deelman, B. Berriman, J. Good, D.S. Katz, and G. Mehta, “Workflow task clustering for best effort systems with Pegasus,” Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities, Baton Rouge, Louisiana: ACM, 2008, pp. 1-8. 43. I.J. Taylor, E. Deelman, D.B. Gannon, and M. Shields, Workflows for e-Science: Scientific Workflows for Grids, Springer-Verlag New York, Inc., 2006. 44. TeraGrid, http://www.teragrid.org. 45. Torque, http://supercluster.org/torque. 46. E. Walker, J.P. Gardner, V. Litvin, and E. Turner, “Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment,” IEEE Workshop on Challenges of Large Applications in Distributed Environments (CLADE’2006), Paris, France: 2006. 47. V. Welch, I. Foster, C. Kesselman, O. Mulmo, L. Pearlman, J. Gawor, S. Meder, and F. Siebenlist, “X.509 proxy certificates for dynamic delegation,” Proceedings of the 3rd Annual PKI R&D Workshop, 2004.
41
48. M. Wieczorek, M. Siddiqui, A. Villazon, R. Prodan, and T. Fahringer, “Applying Advance Reservation to Increase Predictability of Workflow Execution on the Grid,” Second IEEE International Conference on e-Science and Grid Computing (e-Science '06), 2006. 49. H. Zhao and R. Sakellariou, “Advance Reservation Policies for Workflows,” Job Scheduling Strategies for Parallel Processing, 2007, pp. 47-67. 50. Y. Zhao, M. Hategan, B. Clifford, I. Foster, G.V. Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, “Swift: Fast, Reliable, Loosely Coupled Parallel Computation,” 2007 IEEE Congress on Services, 2007, pp. 199-206.
42
Appendix: Resource State Transition Diagrams and Tables
Figure A.1: Site State Diagram Table A.1: Site State Transitions Current State NEW STAGING STAGING STAGING READY READY EXITING EXITING REMOVING REMOVING FAILED FAILED DELETED all
Event CREATE SUBMIT INSTALL_FAILED INSTALL_SUCCESS REMOVE REMOVE (no glideins) REMOVE (has glideins) GLIDEIN_DELETED (has more glideins) GLIDEIN_DELETED (no more glideins) INSTALL_FAILED INSTALL_SUCCESS REMOVE (no glideins) REMOVE (has glideins) n/a DELETE
Action Add site to database Submit install job None None Cancel install job, submit uninstall job Submit uninstall job Remove glideins for site None
Next State NEW STAGING FAILED READY REMOVING REMOVING EXITING EXITING
Submit uninstall job
REMOVING
None Remove site from database Submit uninstall job Remove glideins for site n/a Remove site from database
FAILED DELETED REMOVING EXITING n/a DELETED 43
Figure A.2: Glidein State Diagram
Table A.2: Glidein State Transitions Current State NEW NEW WAITING WAITING SUBMITTED SUBMITTED SUBMITTED QUEUED RUNNING RUNNING RUNNING RUNNING REMOVING FAILED DELETED all
Event CREATE SUBMIT (site not ready) SUBMIT (site ready) SITE_READY SITE_FAILED QUEUED JOB_FAILURE REMOVE RUNNING JOB_SUCCESS (no resubmit) JOB_SUCCESS (resubmit) REMOVE JOB_FAILURE JOB_ABORTED
Action add glidein to database none submit glidein job submit glidein job none none none abort glidein job None remove glidein from database submit glidein job abort glidein job none remove glidein from database
New State NEW WAITING SUBMITTED SUBMITTED FAILED QUEUED FAILED REMOVING RUNNING DELETED SUBMITTED REMOVING FAILED DELETED
all DELETE
none remove glidein from database
DELETED DELETED
44