The DataGrid Workload Management System - Springer Link

Journal of Grid Computing (2004) 2: 353–367

© Springer 2005

The DataGrid Workload Management System: Challenges and Results G. Avellino1 , S. Beco1 , B. Cantalupo1 , A. Maraschini1 , F. Pacini1 , M. Sottilaro1 , A. Terracina1 , D. Colling2 , F. Giacomini3 , E. Ronchieri3 , A. Gianelle4 , M. Mazzucato4 , R. Peluso4 , M. Sgaravatto4 , A. Guarise5 , R. Piro5 , A. Werbrouck5 , D. Kouˇril6 , A. Kˇrenek6 , L. Matyska6 , M. Mulaˇc6 , J. Posp´ıšil6 , M. Ruda6 , Z. Salvet6 , J. Sitera6 , J. Škrabal6 , M. Voc˚u6 , M. Mezzadri7 , F. Prelz7 , S. Monforte8 and M. Pappalardo8 1 DATAMAT

S.p.A., Via Laurentina 760, I-00143 Roma, Italy Department, Imperial College London, Prince Consort Road, London SW7 2BW, United Kingdom 3 INFN CNAF, Viale Berti Pichat 6/2, I-40127 Bologna, Italy 4 INFN Sezione di Padova, Via Marzolo 8, I-35131 Padova, Italy 5 INFN Sezione di Torino, Via P. Giuria 1, I-10125 Torino, Italy 6 CESNET z.s.p.o., Zikova 4, 160 00 Praha 6, Czech Republic 7 INFN Sezione di Milano, Via Celoria 16, I-20133 Milano, Italy 8 INFN Sezione di Catania, Via S. Sofia 64, I-95123 Catania, Italy 2 Physics

Key words: Grid scheduling, distributed resource management, Grid workload management

Abstract The workload management task of the DataGrid project was mandated to define and implement a suitable architecture for distributed scheduling and resource management in a Grid environment. The result was the design and implementation of a Grid Workload Management System, a super-scheduler with the distinguishing property of being able to take data access requirements into account when scheduling jobs to the available Grid resources. Many novel issues in various fields were faced such as resource management, resource reservation and co-allocation, Grid accounting. In this paper, the architecture and the functionality provided by the DataGrid Workload Management System are presented.

1. Introduction The first work package of the EU funded DataGrid (EDG) project [1] was mandated to address the issue of optimizing the distribution of jobs onto Grid resources. The aggressive schedule of the project and the emphasis on meeting the needs and schedules of the collaborating High Energy Physics (HEP), Earth Observation and Bioscience projects required an effective collaboration with other projects (such as Globus [15] and Condor [22]) as well as new middleware development. Existing Grid systems and services were integrated whenever possible and new components that provided functionality not provided by any available Grid tools, were designed and implemented where needed. Many novel issues in many different fields were faced including: resource management,

scheduling, resource reservation and co-allocation and distributed accounting. The result of all these activities was the design and implementation of a Grid Workload Management System (WMS). In particular this implements a super-scheduler (performing the actions described in [31]) with the new and distinguishing feature of taking data access into account when scheduling jobs to the available Grid resources. Scheduling a large number of different data-intensive jobs to a Grid encompassing many heterogeneous Grid resources is a typical and challenging issue for many applications, and is in particular for the DataGrid reference applications. Therefore the provision of a system to efficiently schedule and execute such jobs on the available resources, taking into account both job characteristics, requirements and preferences, and the characteristics

354 and status of the Grid resources, was considered to be very important. This WMS has been proved to successfully address many aspects of the problem of scheduling and resource management in a Grid environment. The success of this software can be measured also by its deployment and use in environments outside of the DataGrid testbeds. This paper discusses the architecture and the functionality provided by this Workload Management System. Section 2 discusses the architecture of the WMS and its components and describes the basic functionality. In Section 3 more advanced functionality provided by the WMS is reported. Section 4 discusses results and achievements in terms of deployment and use of the WMS, and of its measured performance. In Sec-

tion 5 comparisons with related work are presented. Section 6 concludes the paper.

2. The DataGrid Workload Management System The architecture of the DataGrid WMS is represented in Figure 1. The User Interface (UI), which is discussed in more detail in Section 2.1, is the component that allows users to access the functionality offered by the WMS. In particular it allows jobs, described in a Job Description Language (JDL), to be submitted. The Network Server (NS) is a generic network daemon, responsible for accepting incoming requests from the UI (e.g., job submission, job removal). These

Figure 1. WMS component and deployment diagram.

355 requests are validated and passed to the Workload Manager. The Workload Manager is the core component of the WMS. Given a valid request, the Workload Manager has to take the appropriate actions to satisfy it. To do so, it may need support from other components, which are specific to the different types of request. The MatchMaker (or Resource Broker, RB) is one of the classes offering support to the Workload Manager. It provides a matchmaking service: given a request (e.g., for a job submission, or for a resource reservation), it finds the resources that best match it. Details on the Resource Broker design and implementation are discussed in Section 2.2. The Job Adapter (JA) is responsible for making the final “touches” to the job before it is passed to the Job Submission Service (which is discussed in Section 2.3) for the actual submission. In particular it is responsible for creating the appropriate execution environment in the Computing Element (CE) worker node where the execution takes place. The Logging and Bookkeeping (L&B) service is another component of the WMS. It stores logging and bookkeeping information from events generated by the various components of the WMS and related to the job flow. Using this information, the L&B service, discussed in Section 2.4, keeps a state machine view of each job. As all components interact with the L&B service, it has not been shown in the figure for the sake of simplicity. The other components of the WMS represented in the figure are discussed in the following sections. Determining the hierarchy and distribution of WMS instances (the Service Nodes in Figure 1) is not trivial. At one extreme a single, central WMS would give a theoretical possibility of providing optimal resource allocation and fairness, but poses scalability problems that are very hard to address with commodity hardware. On the other extreme one could imagine a dedicated WMS (some sort of Grid-browser) for each end-user, giving freedom in the choice of resource, but also causing races for distributed resources that cannot be fairly arbitrated. The intermediate solution usually deployed is to provide a WMS for each Virtual Organization (VO) participating in the Grid. In order to be able to experiment with different choices, however, the WMS was designed to work both at the “personal” and at the VO level. The Workload Management System, whose architecture and functionality will be described in the next

sections, has been implemented and deployed in the DataGrid testbed. A first working WMS, with limited functionality, was released in the first phase of the project [5]. This first WMS was then reviewed and refactored to address some shortcomings that emerged in the first DataGrid testbed (in particular some scalability and reliability problems) and to provide some new functionality [6, 7]. Among the several improvements applied was the removal of all the duplication of persistent information related to jobs (which was difficult to keep coherent and which caused various problems). In the revised WMS, the L&B service was chosen as the only repository for job information. Another major improvement was the introduction of various techniques and capabilities to quickly recover from failures (e.g., process or system crashes). For example, the communication among the various components of the new WMS was made much more reliable, implemented via persistent queues in the file system. In the revised WMS, moreover, monolithic long-lived processes were avoided. Instead, some functionality (e.g., the matchmaking) was delegated to pluggable modules. This also helped reducing the exposure to memory leaks, not only coming from the EDG software, but also from the third party software linked with it. Other enhancements in design and implementation were applied to all the services, addressing the various shortcomings seen with the first release of the WMS. Improvements also came from enhancements in the underlying software, such as the ones coming from the Globus [15] and the Condor [22] projects. 2.1. The User Interface and the Job Description Language A WMS has to provide users with a suitable means to express their needs, which have then to be forwarded and interpreted by the Grid middleware. Therefore mechanisms to represent customers and resources of the system are needed. These are: − the characteristics of jobs (executable name and size, parameters, number of instances to consider, standard input/output/error files, etc.); − the resources (CPUs, network, storage, etc.) required for the processing and their properties (architecture types, network bandwidth, latency, assessment of required computational power, etc.). This information has then to be provided to the actual workload management software layer, through an appropriate API or a specific, flexible, easy to use (Graphical) User Interface.

356 A high-level, user-oriented DataGrid Job Description Language (JDL) to describe both jobs and resources, was therefore defined. In the DataGrid JDL, which is based on the Condor ClassAd language [28], the central construct is a record-like structure, the classad, composed of a finite number of distinct attribute names mapped to expressions. These ads conform to a protocol that states that every description should include expressions named Requirements and Rank, which denote the requirements and preferences of the advertising entity. Two entity descriptions match if each ad has an attribute (Requirements) that evaluates to true in the context of the other ad. For the convenience of the reader, we quote from [28] some of the features of this framework that make it particularly fitting for our application: − The use of a semi-structured data model, so no specific schema is required for the resources description, allowing it to work naturally in a heterogeneous environment. − The query language folded into the data model. Therefore requirements (i.e. queries) may be expressed as attributes of the job description. − The possibility to arbitrarily nest descriptions, leading to a natural language for expressing resources and job aggregates or co-allocation requests. The JDL defined for the DataGrid WMS provides attributes to support: − the definition and specification of batch, interactive, MPI-based, checkpointable and partitionable jobs, − the definition of aggregates of jobs with dependencies (Direct Acyclic Graphs, DAGs), − the specification of constraints to be satisfied by the selected computing and storage resources, including also data access requirements, − the specification of preferences for choosing between multiple suitable resources (ranking expressions). Despite its extensibility, the JDL encompasses a set of predefined attributes that have a special meaning for the underlying components of the Workload Management System. The Requirements and Rank expressions are built using the resources attributes, which represent the characteristics and status of the resources. These resources attributes are not part of the predefined set of attributes for the JDL as their naming and meaning depends on the adopted Information Service schema [4].

This independence of the JDL from the resources information schema allows targeting for the submission resources that are described by different Information Services without any changes in the job description language itself. Hereafter follows an example of the JDL used to describe a simple job: [ Type = "Job"; Executable = "/bin/bash"; StdOutput = "std.out"; StdError = "std.err"; Arguments = "./sim010.sh"; Environment = "GATE_BIN=/usr/local/bin"; OutputSandbox = "std.out","std.err", "Brain_radioth000.root"; InputData = "lfn:BrainTotal", "lfn:EyeTotal"; DataAccessProtocol = "gridftp"; OutputSE = "grid011.pd.infn.it"; InputSandbox = "sim010.sh", "macro000.mac", "status000.rndm"; rank = -other.GlueCEStateEstimatedResponseTime; requirements = other.GlueCEStateFreeCPUs >= 2; ]

After having described the characteristics and requirements of their applications, users expect, without caring about the complexity of the Grid middleware software, to be able to submit them to the Grid and monitor their execution. In the DataGrid WMS this is accomplished via the User Interface (UI) component. The UI provides access to all services of the WMS and includes all the public interfaces exposed by the WMS. The UI allows access to all the job management related functionality: job submission, job cancellation, job status and output retrieval, listing of the resources suitable to run a specific job, etc. All this functionality is made available through a Python command line interface and an API providing C++ and Java bindings. The JDL and the command line User Interface provide a very powerful means to interact with the Grid. However JDL-based definitions of jobs can become quite complex, since users to some extent are required to know some details of the definition language. Moreover the increasing number of available options, and the added complexity, can make the learning curve of the command line UI even steeper. In order to relieve users from this burden, a set of flexible and easily configurable Java graphical components are provided. 2.2. The Resource Broker As introduced in Section 2, the Resource Broker, RB (or MatchMaker) is one of the classes that offer support to the Workload Manager. It provides a

357 matchmaking service, which relies on the ClassAds mechanisms [28] provided by the Condor project: given a job submission or resource reservation request (represented by a JDL expression), the RB finds the resources that best match this request. In particular it is responsible for finding the resource that best matches the requirements and preferences of a submitted user job, considering also the current distribution of load on the Grid. The distinguishing feature of this DataGrid super-scheduler is the ability to also take into account the data access requirements of the user job. The RB can be decomposed into three sub-modules: − a sub-module responsible for performing the matchmaking, therefore returning all the resources suitable for that JDL expression, − a sub-module responsible for performing the ranking of matched resources, therefore returning just the “best” resource suitable for that JDL expression, − a sub-module implementing the chosen scheduling strategy, which can be based on the services of the previous two modules; this is easily pluggable and replaceable with other ones implementing different scheduling strategies. In order to achieve its goal, the Resource Broker interacts with other Grid services. In particular, as shown in Figure 1, it interacts with the Information Services (to know the status of the available computing and storage resources) and with the Data Management Services (to know where the required data are physically available in the Grid). This information is cached and converted into JDL for symmetric matching to the user requests. The default implemented scheduling policy (replaceable with different ones) is to submit to a resource where the submitting user has proper authorization, that matches the characteristics specified in the job JDL, and where the specified input data (and possibly the chosen output storage resource) are determined to be “close enough” (e.g., on the same local network), therefore minimizing the overall access cost to the required data. In this matchmaking framework not only computing resource but also storage resource information can be taken into account. For example it is possible to request that a job must be submitted to a computing resource close to a storage system where enough storage space is available. This is an application of a more general match-making technique sometimes referred to as gang-matching [30].

The implementation of the RB was based on the Condor matchmaking library [29]. 2.3. The Job Submission Services The Job Controller (JC) is the component of the workload management system responsible for the actual job management operations, issued on request of the Workload Manager. In particular the JC is responsible for managing the job submission and job removal requests. The JC is essentially a wrapper around the CondorG system [17], which leverages the security and resource access protocols as supported within the Globus Toolkit [15] (namely the Globus GSI and GRAM [11] services), and the intra-domain resource management methods of Condor [22]. Instead of developing a job submission service from scratch, it was decided to integrate the CondorG system as it is a proven and reliable job submission service, which guarantees fault tolerance and exactlyonce execution semantics. The first capability is due to the persistent (crash proof) queue of jobs, used as a persistent database storing information concerning active jobs, while the exactly-once execution semantics is made available by the two-phase commit protocol used by CondorG for job management operations. CondorG was chosen also to favor the interoperability with other Grid systems. CondorG, in fact, relies on the Globus GRAM protocol, which is a de-facto standard for Grid resource access. This assures that basically any Globus GRAM-based computational resource can be managed and integrated within the DataGrid Workload Management System. 2.4. The Logging and Bookkeeping Service The L&B service collects and manages job-related data from the WMS components, which are not otherwise directly accessible by the user, to provide a global view on job states, as well as aggregate information on compound jobs. The data are collected in terms of L&B events of the following types: − Register. The very first event of each job. It records the basic information on the job type, its definition (JDL), owner, etc. − Transfer, Accept, Refuse, EnQueue, DeQueue, Call, Return. Passing job control between WMS components over network connections, through reliable job queues, and by calling a procedure.

358

Figure 2. Block diagram of the L&B infrastructure.

− Match, Run, Done, Cancel, Resubmit, Abort. Other important points of the job life (finding a matching Computing Element, etc.). − Checkpoint, UserTag. Arbitrary user’s annotation of the job in the form (key = value), including the application checkpoints. All the events carry common information (JobId, timestamp, originating component, . . .), as well as event-type specific attributes, e.g., where the job is transfered, or its exit code. Main components of the event delivery infrastructure are shown in Figure 2. The events are posted with calls to the L&B producer library [21] which passes them on to the locallogger daemon, running physically close (preferably on the same machine) in order to avoid any sort of network problems. Consequently, the logging calls would not be blocked by inaccessibility of any remote services. Upon successful return of the logging call, the event is guaranteed to be delivered reliably, unless the communication path to the bookkeeping server is broken forever. Event delivery is managed by the interlogger daemon. It takes the events from the locallogger (or its log files on crash recovery), and repeatedly tries to deliver them to the destination bookkeeping server (which is known from the JobId) until it eventually succeeds. A counterpart to the standard delivery is the synchronous mode (used under specific circumstances, e.g., for the Register event, when the resulting performance penalty is acceptable). In this case a reverse confirmation channel is established so that the logging call does not return until the event is accepted by the server. Besides being stored, events are processed by the L&B server to give a higher level view – the job state seen by the user. The following states are identified:

− Submitted, Waiting, Ready, Scheduled. 4 stages of job preparation, i.e. accepted by WMS, transfered to RB, being assigned a target Computing Element, and transfered to the Computing Element queue. − Running. The job has started execution. − Done, Cancelled, Aborted. Normal and abnormal (user or system initiated) termination. − Cleared. Job output was retrieved and purged. Again, each state carries an appropriate set of attributes, e.g., job owner, job description (JDL), destination Computing Element, reason of failure, etc. In the case of DAGs (see Section 3.1) basic statistics on the sub-jobs (i.e., how many of them are in each state) are computed as well. Information available in the bookkeeping server is retrieved by L&B consumers using two classes of queries: job queries which return one or more jobs including their states, and event queries returning raw events. The queries specify conditions on various job and event attributes (e.g., JobId, job owner, destination, various timestamps, etc., as well as specific user tags). However, completely arbitrary queries would translate into full searching through the large L&B server database, easily overloading the server. Therefore the L&B database can be indexed according to selected job attributes (including user tags). The server rejects queries which cannot make use of the indices. Selectivity of individual attributes is highly specific for a given users’ community, hence the actual set of indices is configurable and left to the server administrator. Finally, the L&B server is capable of pushing job state changes into the R-GMA [10] monitoring infrastructure. Careful setup of the infrastructure leads to considerable optimizations in the L&B data distribution to the end users, namely notifications on job

359 state changes instead of otherwise necessary frequent database polling. 2.5. Security in the Workload Management System The WMS can access a lot of Grid resources and often processes user sensitive data. Therefore special attention was paid to security issues during designing and implementing the WMS components. Network communications between all the WMS components are mutually authenticated using the Grid Security Infrastructure (GSI [16]), which also provides integrity protection of the protocols. GSI is based on a Public Key Infrastructure and use of shorttime PKI credentials (proxies), which offer a secure way for single sign-on and credential delegation. Each user must present a valid GSI credential at each interaction with the WMS, and possibly delegate it to the WMS. Each job submitted to the WMS must be accompanied with a delegated credential so the job owner is always identifiable and the job itself is able to access authenticated services. The WMS components can use either their own credential or a delegated user credential when acting on the user’s behalf, e.g., while submitting a job to a target CE. Even though each job submitted to the WMS has a valid GSI credential, the overall job lifetime can easily exceed the lifetime of the credential, which is usually a few hours. Using credentials with longer lifetime does not solve the problem since the overall job lifetime is unpredictable and the security is decreased with long lived credentials. To address the problem the WMS offers the Proxy Renewal Service, which allows the WMS to keep the job’s credential valid for all its lifetime. The service is built upon the MyProxy Service [25], a credential storage system that allows users to store and maintain their long-lifetime credentials, and which issues users’ short-time credentials to a small, well-defined set of clients, including the WMS. Running as part of the WMS, the Proxy Renewal service registers the certificate of a submitted job and, by periodically contacting the MyProxy Service, keeps the proxy certificate valid. The proxy renewal mechanism ensures that jobs submitted to the WMS have valid credentials throughout their lifetime, while preserving the advantages of short-time proxy certificates. The proxy renewal process is performed in a controlled way, with only a small set of trusted services allowed to renew proxies. All transactions concerning proxy renewal are logged and can be analyzed in the future for possible misuse. Since management of user’s proxy certificates in

the repository is completely controlled by the user, they may stop renewal at any time by removing their credential from the repository. The WMS also has integrated support for the Virtual Organization Management System (VOMS [3]). This provides the ability to specify, maintain and manage sets of groups or roles within a particular VO and is part of the DataGrid authorization framework. Various WMS components use this information for either making authorization decisions (e.g., access control to data in the L&B service) or finding out details about users’ VOs (e.g., jobs owned by a given user can only be submitted to CEs available to that user).

3. Advanced Functionality As well as allowing the management of simple, sequential batch jobs, as described in the previous section, the DataGrid WMS also provides some advanced capabilities: mechanisms for Grid accounting, a framework for resource reservation and coallocation, the possibility to submit parallel, interactive, checkpointable, partitionable jobs, etc. Some of this functionality is described in the following subsections. 3.1. Jobs with Dependencies The concept of a “job” only rarely corresponds to an executable to be run on a worker node, reading some input and producing some output. Often a job is better expressed as a set of interdependent tasks, collaborating to achieve a common goal. It is then desirable that a Grid implementation provides mechanisms to manage an arbitrary complex workflow. In its simplest form a workflow can be expressed as a Directed Acyclic Graph (DAG), where nodes represent jobs and arcs represent temporal dependencies between any two of them. For example an arc between node A and node B means that the job associated to node B can start only when the job associated with node A has finished (successfully). Within the WMS the support for DAGs is implemented on top of a tool called DAGMan, developed within the Condor project. DAGMan is a metascheduler whose purpose is to navigate through the graph, processing the nodes that happen to be free of dependencies. For each DAG submitted to CondorG, a DAGMan process is locally spawned.

360 The DataGrid WMS inherited the specification of a DAG node from DAGMan, which consists of three parts: a PRE script, the user’s job and a POST script. When DAGMan processes a node, it first executes the PRE script. If this succeeds DAGMan submits the user’s job to CondorG and keeps monitoring it. When the job finishes, either successfully or unsuccessfully, the POST script is run (e.g., to close any job-specific context). A node is considered successful if all the three parts are successful. The external interface that the WMS offers to manage DAGs is independent of the DAGMan based implementation. Three aspects contribute to this interface: the language used to specify a DAG, the API provided to manage a DAG and the state machine representing the status of a DAG during its lifetime. To express DAGs the customary Job Description Language introduced in Section 2.1 is considered. For example, the DAG shown in Figure 3 can be expressed as follows: [ type = "dag"; nodes = [ nodeA = [ file = "A.jdl" ]; nodeB = [ file = "B.jdl" ]; nodeC = [ file = "C.jdl" ]; nodeD = [ file = "D.jdl" ]; dependencies = { {nodeA, {nodeB, nodeC}}, {{nodeB, nodeC}, nodeD} } ] ] where A.jdl, B.jdl, C.jdl and D.jdl are files containing the description of the respective jobs. The contents of those files can be directly embedded in the DAG description. Within the WMS a DAG is considered as a single unit of work. As such it has its own entries in the Logging and Bookkeeping Service, its passing through the

Figure 3. A diamond shaped DAG.

system generates events that feed the standard state machine for jobs, it can be queried for status, it can be cancelled, etc. Moreover its nodes, being themselves jobs, can be managed independently. As a rule of thumb, the later the scheduling decision for a job is taken the better. Following this guideline the scheduling of the DAG nodes is not performed when the DAG is submitted (eager scheduling) but is deferred until the single node is ready, i.e. free of dependencies (lazy scheduling) [18]. In principle the decision can be deferred to the moment a certain worker node is actually free (very lazy scheduling) but this approach raises other problems, so it was not pursued further in the project. 3.2. Job Checkpointing Checkpointing is a service to facilitate automated recovery and continuation of interrupted computations with the aid of periodically recorded checkpoint data. Checkpointing is essential for long-running computations in order to minimize lost time and other costs incurred by system failures. Checkpointing also enables job preemption, job migration and other interruptions to calculations [32]. In the DataGrid WMS an application level checkpointing service was implemented, where state saving is an explicit part of the application. A proper Grid checkpointing API was provided, to allow applications to be instrumented to save the state of the process (represented by a list of pairs) at any moment during the execution of a job and to restore a previously saved state (to restart the computation from checkpointed data). In the framework of the DataGrid WMS, this Grid checkpoint service is basically used to enable failure recovery to applications; if a job, instrumented with the provided Grid checkpointing APIs (and which therefore saves its intermediate results from time to time while running) fails, the Workload Management System automatically reschedules the job and resubmits it to another compatible resource. When the job restarts its execution, the last saved state is automatically retrieved, and the computation is restarted from that point. If the failure cannot be detected by the Grid middleware, the user can retrieve a saved state for her job (usually the last saved one), and resubmit it specifying that it must start from these retrieved checkpoint data. The functionality to persistently save the state of a job and to retrieve a previously saved state, in the

361 DataGrid WMS is provided by the L&B service. As described in Section 2.4, checkpoints (internal application states) are stored among the various attributes associated for each job. Using the checkpointing service naturally introduces an overhead, which depends on many factors: on the job state saving frequency, on the job state size and on the network connectivity between the worker node where the job is executed and the L&B server where job states get saved. 3.3. Interactive Jobs The interactive job support provided in the DataGrid WMS allows standard streams to be forwarded from the worker node where the execution takes place to a remote machine, usually the UI machine, so that the user can interact with the job during its execution. This interactive job support relies on the bypass software [33] (more specifically on the Grid Console module) developed by the Condor group. Grid Console is an implementation of an interposition agent system, that is a software module which sits between the application layer and the operating system layer, and grabs control and manipulates the results when specific system calls are invoked by the user application. It consists of two software components: the

Grid Console Agent and the Grid Console Shadow. The Grid Console Agent intercepts reads and writes on stdin, stdout, and stderr, while all other operations are left untouched. Reads and writes on these streams are forwarded to the Grid Console Shadow for execution. Relying on these mechanisms, the interactive job support implemented in the DataGrid WMS allows the job’s standard streams to be redirected from the worker node where the execution is taking place to a remote machine, usually the UI node, so that the user can interact with the running job. Figure 4 shows the flows for interactive jobs in the DataGrid WMS. The Job Shadow is a DataGrid customized Grid Console Shadow running on the UI machine. The Job Agent, installed on the worker node where the job runs, consists of a set of processes linked against a slightly modified Grid Console Interposition Agent. The Job Agent gets automatically staged and installed on this executing node when an interactive job is submitted for execution. When the interactive jobs runs, its standard streams are sent to the Shadow machine (usually the UI node). Reliable mechanisms have been designed and implemented, e.g., if the UI crashes or the user disconnects from it, the Job Agent writes the output locally and tries to reconnect. This is retried for a while before giving up and aborting the job.

Figure 4. Interactive job flow.

362 3.4. Grid Accounting One of the objectives when designing and implementing the DataGrid WMS was to provide an accounting system that traces the resource usage by the authorized Grid users. Instead of using a traditional approach – simply recording usage metrics for user jobs – an economy-based approach (users “pay” virtual credits for resource usage) was implemented, in order to support an economic brokering of Grid resources. The main goal of economic brokering is to help in balancing the overall workload by pricing resources according to their current state (for example, queue length) and including price information into the resource selection process. In an economic context, the goal of resource exchange fairness implies not only a fair exchange of computational energy [26], but also concerns the temporal value of Grid resources, since the price of a good reflects its value or true worth only if it brings about an equilibrium between demand and supply. It is widely believed [2, 14] that an economic approach offers natural self-regulating mechanisms that can help with allocating resources to Grid users in a fair and effective manner by balancing demand and supply, i.e. reaching market equilibrium as long term optimization. The existence of a precise competitive equilibrium (a market equilibrium that maximizes all agents’ utilities), however, is guaranteed only under restrictive conditions that in actual markets usually are not met. An economic approach may also help in balancing the incoming workload among the participating Grid resources (“just-in-time” scheduling as short term optimization). As in real market economies the prices for idle resources may be lowered to attract “consumers”, while those of overloaded resources may be raised in order to reasonably balance average utilization and computational efficiency of the Grid as a whole. The fundamental services of which the DataGrid Accounting System (DGAS) is composed are the socalled Home Location Register (HLR) service, responsible for resource usage accounting, and the Price Authority (PA) service, responsible for resource pricing, as shown in Figure 1. DGAS services are based on the assumption of distributed and decentralized control, to assure scalability, and the existence of trustrelationships between all Virtual Organizations that are collaborating in the Grid. DGAS thus allows to have an arbitrary topology of HLR and PA servers. The suggested deployment scenario consists of one HLR and one PA per Virtual Organization.

Each HLR (“bank”) server manages a subset of Grid users and resources by maintaining their respective account and transaction (job) information in a relational database. User and resource accounts are updated after each job execution using resource usage records that are provided by the executing computing resource. In this context, the authorization to use a specific resource should depend not only on the authorization policy of the resource owner, but also on the availability of user credits, thus accounting and authorization are tightly bound in our model. The PA servers furnish, upon request, valid price quotations for the resources within their administrative domain. The current implementation of the PA service adopts a per-resource pricing, where resource prices depend only on the current state of the resource. Each PA server keeps a history of resource prices, including their respective GMT timestamps, so that the prices that were valid at job submission time can be retrieved even after they have been updated. Each Grid user should receive the amount of credits she needs from the management of her Virtual Organization, that redistributes the credits earned by its resources among its users. If necessary, an exchange rate between Grid credits and real currencies might be established in order to make it feasible for large computing centers to share their resources even if their users do not require Grid services in return, as well as allowing small laboratories or research groups to utilize these services even if they cannot contribute own resources. Although DGAS has been fully implemented, it has been deployed only on small testbeds, such that it is not possible yet to present experimental results for the efficiency of economic brokering and pricing strategies with DGAS. First simulation results, however, indicate that pricing algorithms based on the levels of workload on the single Grid resources, combined with price-sensitive resource brokering strategies, may effectively balance the workload over the Grid [27].

4. Results and Achievements The DataGrid WMS, described in the previous sections, has been shown to provide much of the functionality needed by several applications areas. This is a very important achievement and is reflected in the use of the WMS on Grid environments other than the DataGrid testbed.

363 Some details of the performance of the WMS and of its use by applications are reported below. 4.1. Deployment and Use of the Workload Management System Software The WMS software was (and still is) used outside of the DataGrid testbed. The most important deployments are listed below: − The LCG Grid. The DataGrid WMS software has been deployed on the LCG Grid infrastructure, the Grid of the LCG project implemented to meet the unprecedented computing needs of the LHC HEP experiments. This LCG infrastructure currently spans about 80 sites (in Europe, America and Asia), encompassing about 7300 CPUs; the number of sites and resources keeps growing rapidly. − The DataTAG testbed. In this testbed in particular the interoperability with US Grid domains was successfully addressed and job submissions using the DataGrid WMS to US resources was demonstrated. − The GRID.IT infrastructure. The DataGrid WMS software has been deployed in the Italian GRID.IT infrastructure. This currently comprises of 13 sites, but it is growing to include other resources belonging to different Italian scientific organizations. Currently the GRID.IT Grid is used to run typical HEP applications but in the near future new users and new applications (e.g., in the domains of volcano seismology, climate model development, genomics, proteomics and astrophysics) will be involved. − The CrossGrid testbeds. The DataGrid WMS software has also been used by the CrossGrid project, where the WMS software has been customized and enhanced according to the specific requirements of the project. 4.2. Performance of the Workload Management System Here we report the results of some stress tests performed by the LCG Certification, Testing and System Support teams on the LCG certification testbed. The LCG tests were pushed to hit the scaling limits of the software, and therefore they mostly represent “worst case” performance measurements. All the following tests were performed on release 2.1.12 of the WMS, released in the middle of

December 2003. Since that date many improvements and fixes have been applied and deployed, but large scale performance tests on newer software releases have not yet been performed or made available. 4.2.1. Test 1 In this test all the available commands were tested, to see if they work properly and if they provide the expected functionality. Everything worked correctly. 4.2.2. Test 2 In this test 500 jobs were submitted to the WMS from a single stream. Each job took about 3680 s to be executed on the average available computing resources: this guaranteed that when the last job was submitted the first was still running. The job resubmission functionality was disabled for this test. There were no failures during submission, and there were no failures during execution. The submission of the 500 jobs took about 3000 s (so on the average submitting 1 job took about 6 s). 4.2.3. Test 3 In this test the submission to a single CE was tested. 1000 short jobs were submitted from a single stream to a single CE whose computing nodes were managed by the PBS Local Resource Management System (LRMS). The test was then repeated with a CE having LSF as LRMS. Job resubmission was disabled (failed jobs were not resubmitted). The submission of the 1000 jobs to the PBS CE took about 100 minutes (6000 s, so on the average the submission of a job took 6 s), while on the second CE (the one managed by LSF) the submission of the 1000 jobs took 98 minutes (5880 s, on the average the submission of a single job took again about 6 s). There were 2 failures submitting to the first CE, while there was only one failure for the second CE: the failure rates were therefore 0.2% for the CE managed by PBS, and 0.1% for the LSF CE. All three jobs failed because of some problems in the CondorG–Globus GRAM interaction. 4.2.4. Test 4 In this test the submission to the WMS was performed from 10 streams with each stream submitting 100 jobs. For this test the job resubmission capability was disabled. The submission of the 1000 jobs took about 59 minutes (3540 s): therefore on the average submitting a single job took about 3.6 s. For this test there were no failures at all and all jobs were successfully executed.

364 4.2.5. Test 5 This test is similar to the previous one; the only difference was that 20 submission streams were used instead of 10. For 67 jobs the submission to the WMS failed, giving a submission failure rate of 3.4%. It was noted that some of the failures happened simultaneously among different streams and this happened probably because of a peak in the CPU load of the WMS server machine. The submission of the 1933 remaining jobs took 129 minutes (7740 s), so on the average the submission of a single job took 4 s. 3 of the 1933 submitted jobs were not successfully completed (one because the matchmaking could not be performed due to problems querying the Information Service and one because of problems in the CondorG submission to the Globus resource). For this test the overall failure rate was therefore 3.5%. 4.2.6. Test 6 Test 6, as Test 4, consisted in submitting 1000 jobs from 10 streams, but this time job resubmission was enabled (allowing up to 3 resubmissions in case of failures). For this test there was just one failure during submission, while there were no errors during execution. The overall failure rate was therefore 0.1%. 4.2.7. Test 7 In this test the proxy renewal functionality was tested. 10 long jobs (taking about 14 hours to complete their execution) were submitted with a short user proxy (which therefore had to be renewed by the proxy renewal service). All jobs got successfully completed. 4.2.8. Test 8 In this test more complex jobs were considered. 400 jobs, submitted from 4 different streams (100 jobs per stream) were submitted. Each job: − Produced some data which was stored on one SE; − Replicated this data on a second SE.

4.3. Workload Management System Software Application Evaluation The LCG Grid infrastructure where the WMS software has been deployed, has been heavily used during the first half of 2004 by the four LHC experiments (ALICE, ATLAS, CMS and LHCb) for large distributed data challenges. This use of the LCG software (including the DataGrid WMS) in real production activities allowed the detection of several problems, which it was not possible to detect on smaller scale testbeds. Most of these problems were addressed and the enhancements versions were deployed during the data challenges. The results of the evaluations performed by the ATLAS, CMS and LHCb experiments are presented below. ALICE feedback is not reported as the ALICE data challenge was performed with a very particular configuration. The ALICE experiment specific computing environment (AliEn) was in fact used on the LCG Grid infrastructure via a “keyhole” approach, i.e. the entire LCG system was seen by AliEn as a single computing and storage element and was used beside other pure AliEn computing and storage systems. The WMS software has been proved to have most of the functionality required by the reference applications. Also the feedback on the performance and on the reliability of the system was quite good. Unfortunately many problems in the job submission chain that arised during the data challenges can be attributed to resource failures and site misconfigurations and are outside the scope of WMS. It must be stressed that the results presented refer to the evaluation of the integrated system, and not only to the WMS software. The many complex interactions between the WMS and other services make it very difficult to interpret these tests as an evaluations of the WMS software in isolation.

All the 400 jobs were successfully executed. 4.2.9. Test 9 Matchmaking on input data was also tested using collection of three SEs. A set of three different files was created and all files were replicated on two of the three SEs. Only two of the three files were replicated on the remaining SE. Jobs, requiring all the three files, were then submitted, to check if they were submitted to a proper CE (only two of the three CEs had to match). The matchmaking for this test worked as expected.

4.3.1. ATLAS The ATLAS data challenge was performed during May 2004. The submitted jobs generated physics signals and performed the detector simulation on the output produced. In total about 20 M physics events were generated and simulated. These jobs had different execution times, depending on the physics channel being considered. The execution time of the physics generation jobs, for example, varied from 250000 s to 1.5 million seconds

365 on a SPECint2000 CPU. Simulation jobs took much more. In total 74779 jobs were submitted and 37044 of them finished successfully. Several reasons explain these bad results: besides the many site misconfigurations (most of the jobs failed because of site related problems), failures were due to problems in the interaction between the Workload Management System and the Information Services and to problems in the data management services. 4.3.2. CMS CMS was the first experiment to use the LCG Grid infrastructure for real production activities. In March and April 2004 the CMS experiment undertook a Data Challenge (DC04) to test several key aspects of the CMS Computing Model. Simulated data to reconstruct and analyze were produced before the DC04 and then the full chain was performed with the exception of the first-pass recostruction, which is carried out at the data-production site (the so-called “Tier 0” in a multi-tiered data processing organization). About 15000 jobs were submitted, during a period of 17 days. The time spent by an analysis job varied depending on the kind of data and specific analysis performed (and of course on the type and performance of the execution nodes). In any case the data analysis were not very CPU intensive, ranging from few to 30 minutes per job. These jobs were distributed to only two LCG CMS-certified computing sites available at the time of this evaluation. The overall job efficiency during the period turned out to be around 90% (it was about 99% towards the end of the CMS data challenge, when some problems and other misconfigurations had been addressed). The failures were due in particular to network problems. 4.3.3. LHCb The LHCb data challenge was performed in June and July 2004. The submitted jobs consisted of simulating and reconstructing either 500 signal events or 500 background events together with 900 minimum-bias events. The former type of jobs (the simulation jobs) needed about 2 days on a Normalized Cern Unit (180SPECint2000) CPU, while 3 days were needed for the latter on the same CPU type. The total of 43 LCG sites were executing (at least 1) LHCb jobs. In the first phase 50000 jobs were submitted, and about 6600 jobs failed. In the second phase another

30000 jobs were submitted, and about 5600 of them failed. After having fixed some problems (in particular some memory leaks in some CondorG components) the efficiency rate reached 95%.

5. Related Work The DataGrid WMS is characterized by its complex coverage of the many facets of the workload management. While an increasing number of Grid related projects and activities use portals as the primary entry point into the Grid (see, e.g., [34] or the NPACI HotPage, https://hotpage.npaci.edu), the DataGrid command line user interface proved to be very useful when large job sets were to be submitted. It also enables off-line preparation of job submission, requiring on-line connectivity only at the actual job submission time. The GridLab project attempts to use multicriteria scheduling strategies [20] which require a superscheduler with full knowledge and control over the resources. DAGman/Condor-G is used in GriPhyN [13] as a low-level scheduler, complex workflows are taken care of by the Pegasus subsystem. The general workflow scheduling problem studied in [23] poses too restrictive conditions on the knowledge of the current Grid state required by the scheduler that it is still to be tested within some large scale production Grid environment (currently within the GridLab project). GrADS [12] uses a simple launch-time scheduler which can block, accompanied by a metascheduler capable of negotiating scheduling contracts. The approach adopted by the DataGrid WMS while being multicriterial and supporting workflows, performs well even in the presence of outdated information and does not rely on full knowledge of the Grid state. GrADS also supports job rescheduling via swapping, tailored to MPI jobs. The DataGrid WMS handles job recovery in presence of failures via its job partitioning and checkpointing support. Working at the application level, it can support generic jobs, overcoming global synchronization problems faced by automated checkpointing support for parallel (MPI) jobs [9]. Other Grid projects are developing infrastructures implementing economic oriented Grid accounting (like the GASA GridBank system [8]) while other groups are focusing mainly on reporting usage metrics [19]. The DataGrid DGAS, while being developed

366 mainly to implement economic accounting, is also meant to provide resource usage metrics.

Moreover the evolution of the set of middleware components towards emerging standards, based on service oriented architectures, is another activity were future work will have to focus.

6. Conclusions and Future Directions In this paper the details of a Workload Management System, designed and implemented in the context of the DataGrid project, were presented. As well as integrating existing Grid technology whenever possible, new and original developments took place in order to implement missing functionality. In particular a super-scheduler, able to schedule jobs to the available Grid resources, taking into account data access constrains and characteristics, was designed and implemented. This was a missing Grid component, whose provision was considered critical by many scientific applications. Feedback on this software system, deployed and used in many environments besides the DataGrid testbed, has been quite good, not only in terms of the provided functionality, but also for the performance and robustness of the software. Although significant results have been achieved, the problem of distributed processing of data-driven jobs using standard, general purpose components is of course not completely solved. Many areas appear to still require attention, such as: − The organization of the resource information: the collection of the various information concerning Grid resources, the organization of this information, its indexing, and its possible optimization for match-making use are still an open issue. − Resource access: mechanisms providing access to computing resources as they become available, needed to accommodate the process of matchmaking/brokering at the latest possible time. − The handling of failures: faults and misconfigurations are inevitable in a Grid environment, it is therefore necessary to be able to tolerate and manage them. − The handling of data requirements: the co-location of the job and at least the bulk of data to be accessed, the management of disk space (allocation and cleanup) and the staging of any output need further investigations. − Distributed super-scheduling: mechanisms for scheduling between multiple Resource Brokers, to increase the scalability and reliability of the Workload Management System, and to avoid possible bottlenecks, should be investigated.

Acknowledgements DataGrid is a project funded by the European Commission under contract IST-2000-25182. We also acknowledge the national funding agencies participating in DataGrid for their support of this work. References 1. 2.

“Home page of the DataGrid project”, http://www.edg.org D. Abramson, J. Giddy and R. Buyya, “An Economy Driven Resource Management Architecture for Global Computational Power Grids”, in Proc. of the 7th International International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2000), Las Vegas, USA, 2000. 3. R. Alfieri et al., “VOMS, an Authorization System for Virtual Organizations”, in Grid Computing, First European Across Grids Conference, 2004. 4. S. Andreozzi, M. Sgaravatto and C. Vistoli, “Sharing a Conceptual Model of Grid Resources and Services”, in Proceedings of the 2003 Computing in High Energy and Nuclear Physics Conference (CHEP03), La Jolla, CA, USA, March 2003. 5. C. Anglano et al., “Integrating Grid Tools to Build a Computing Resource Broker: Activities of DataGrid WP1”, in Proceedings of the 2001 Computing in High Energy and Nuclear Physics Conference (CHEP01), Beijing, China, September 2001. 6. G. Avellino et al., “The EU DataGrid Workload Management System: Towards the Second Major Release”, in Proceedings of the 2003 Computing in High Energy and Nuclear Physics Conference (CHEP03), La Jolla, CA, USA, March 2003. 7. G. Avellino et al., “The First Deployment of Workload Management Services on the EU DataGrid Testbed: Feedback on Design and Implementation”, in Proceedings of the 2003 Computing in High Energy and Nuclear Physics Conference (CHEP03), La Jolla, CA, USA, March 2003. 8. A. Barmouta and R. Buyya, “GridBank: A Grid Accounting Services Architecture (GASA) for Distributed System Sharing and Integration”, in Proceedings of the 17th Annual International Parallel & Distributed Processing Symposium (IPDPS 2003) Workshop on Internet Computing and E-Commerce. 9. G. Bronevetsky et al., “Automated Aplication-Level Checkpointing of MPI Programs”, in ACM Symposium on Principles and Practice of Parallel Programming, 2003. 10. A. Cooke et al., “Relational Grid Monitoring Architecture (RGMA)”, Presented at UK e-Science All-Hands meeting, Nottingham, UK, 2003. https://edms.cern.ch/file/400756/1/rgma. pdf 11. K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S. Tuecke, “A Resource Management Architecture for Metacomputing Systems”, Lecture Notes in Computer Science, Vol. 1459, 1998.

367 12. 13. 14.

15.

16.

17.

18.

19. 20.

21. 22.

23.

24.

H. Dail et al., “Scheduling in the Grid Application Development Software Project”, in [24], pp. 73–98. E. Deelman, J. Blythe, Y. Gil and C. Kesselman, “Workflow Management in GriPhyn”, in [24], pp. 99–118. D.F. Ferguson et al., “Economic Models for Allocating Resources in Computer Systems”, in Market-Based Control: A Paradigm for Distributed Resource Allocation. World Scientific: Hong Kong, 1996. I. Foster and C. Kesselman, “The Globus Toolkit”, in I. Foster and C. Kesselman (eds), The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann: San Francisco, CA, 1999, Chap. 11, pp. 259–278. I. Foster, C. Kesselman, G. Tsudik and S. Tuecke, “A Security Architecture for Computational Grids”, in Proc. 5th ACM Conference on Computer and Communications Security Conference, 1998. J. Frey, T. Tannenbaum, I. Foster, M. Livny and S. Tuecke, “Condor-G: A Computation Management Agent for MultiInstitutional Grids”, in Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC). San Francisco, California, 2001, pp. 7–9. F. Giacomini, F. Prelz, M. Sgaravatto, I. Terekhov, G. Garzoglio and T. Tannenbaum, “Planning on the Grid: A Status Report”, Technical Report INFN-TC-02/26, INFN-GRID Project, 2002. S. Jackson et al., “Charter of the Global Grid Forum Usage Records Working Group”, Technical Report, 2003. K. Kurowski, J. Nabrzyski, A. Oleksiak and J. Weglarz, “Multicriteria Aspects of the Grid Resource Management”, in [24], pp. 271–294. A. Kˇrenek and Z. Salvet, “L&B API Reference”, DataGrid01-TED-0139, 2003, http://lindir.ics.muni.cz/dg_public/ M. Litzkow, M. Livny and M. Mutka, “Condor – A Hunter of Idle Workstations”, in Proceedings of the 8th International Conference of Distributed Computing Systems, 1988. M. Milka, G. Waligora and J. Weglarz, “A Metaheuristic Approach to Scheduling Workflow Jobs on a Grid”, in [24], pp. 295–320. J. Nabryzski, J. Schopf and J. Weglarz (eds), Grid Resource Management. Kluwer Academic: Norwell, MA, 2003.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

J. Novotny, S. Tuecke and V. Welch, “An Online Credential Repository for the Grid: MyProxy”, in Proceedings of the Tenth International Symposium on High Performance Distributed Computing (HPDC-10), 2001. R. Piro, A. Guarise and A. Werbrouck, “An Economy-Based Accounting Infrastructure for the DataGrid”, in Proc. of the 4th International Workshop on Grid Computing (Grid2003), Phoenix, Arizona, USA, 2003. R. Piro, A. Guarise and A. Werbrouck, “Simulation of PriceSensitive Resource Brokering and the Hybrid Pricing Model with DGAS-Sim”, in Proc. of the 13th International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE 2004), Modena, Italy, 2004. R. Raman, M. Livny and M. Solomon, “Matchmaking: Distributed Resource Management for High Throughput Computing”, in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, 1998. R. Raman, M. Livny and M. Solomon, “Matchmaking: Distributed Resource Management for High Throughput Computing”, in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, 1998. R. Raman, M. Livny and M. Solomon, “Resource Management through Multilateral Matchmaking”, in Proceedings of the Ninth IEEE Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, PA, 2000, pp. 290–291. J.M. Schopf, “Ten Actions when SuperScheduling”, Technical Report GFD-I.4, Global Grid Forum, Scheduling Working Group, 2001. D. Simmel et al., “Charter of the Global Grid Forum Grid Checkpoint Recovery Working Group”, Technical Report, 2004. D. Thain and M. Livny, “Bypass: A Tool for Building Split Execution Systems”, in Proceedings of the Ninth IEEE Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, PA, 2000, pp. 79–85. M.P. Thomas et al., “The GridPort Toolkit: a System for Building Grid Portals”, in Proc. of the Tenth IEEE International Symposium on High Performance Distributed Computing, 2001.