Resource Descriptions for Job Scheduling in

Technical Report DHPC-026

Resource Descriptions for Job Scheduling in DISCWorld H.A.James and K.A.Hawick Email: {heath,khawick}@cs.adelaide.edu.au Fax: +61 8 8303 4366, Tel +61 8 8303 4519 Department of Computer Science, University of Adelaide, SA 5005, Australia

January 1998 Abstract The DISCWorld system requires the description and mapping of data, software and computational resources to allow user queries or requests to be scheduled. We review the distributed scheduling problem and discuss mechanisms for expressing and operating upon the necessary distributed job control information. We focus upon techniques that can be expressed in a portable fashion using Java and a World Wide Web software infrastructure, constraining the problem to deal with the well characterised high-level service applications employed in DISCWorld.

1

Introduction

Within the context of the DISCWorld [9], we recognise three different types of resources: hardware components, software components and data. The classification of Hardware Components includes the processing hosts, storage devices and interconnection networks that are used inside the DISCWorld. Software Components, Operations or Services, are the methods by which users interact with data. They may be custom written for use within the DISCWorld, or may be legacy software. Data resources encompass all data that exists and is created inside DISCWorld. The data may be injected into the system automatically (for example, archived satellite data), or may be derived as the result of operations. Auxiliary data, or meta-data is also termed a resource, as it is available to be searched, and can be used to derive new data. User problems, or requests, are expressed as queries on the data, which are translated into a virtual computational network to map the data onto services and return results or partial results, to the user. The scheduling of user problems is a mapping of the virtual computational network onto a logical network of hardware components. In section 2 we describe the general problem of resource description, and the approach that we have taken in building the DISCWorld system. A taxonomy for distributed scheduling is described in section 3, followed by a positioning of the DISCWorld into the taxonomy. We discuss some issues on distributed scheduling in section 4, and the topic of representation for job control information is discussed in section 5. 1

2

Resource Description

Resource description is the problem of providing a canonical name and description for all the resources inhabiting a distributed system. The description must be in terms of useful attributes of the resources in the systems’ context. There have been many attempts to solve the resource description problem, most of which focus on describing the machines that make up the distributed system. There has been less work in describing the data or software components that are also an integral part of the distributed system. Resources in RDL [1] consist of a specification of a hardware structure and a process graph. Resources in Globus [6] are hardware components only, and are specified by a name-value pair using LDAP [5]. Queries on the hardware components are specified by a resource specification language [4]. Some systems [6, 8] are attempting to provide a general metacomputing service, where users can submit processes of their own device, and data which is formatted especially for their processes. We believe that this level of generality is too broad to be achievable using present technology. For this reason, we have decided to simplify our distributed system by restricting the types of processes and data which will be used within the DISCWorld. This approach considerably reduces the naming problem for operations, and, by using a closed-world approach, allows the transfer and updating of information regarding methods available to individual hardware resources to be achieved in a more simple manner.

3

Categorising DISCWorld Scheduling Requirements

The general problem of describing approaches to the resource management problem is described by Casavant and Kuhl [2]. Another description is given by Kopetz [11]. Casavant and Kuhl attempted to make a hierarchical taxonomy for scheduling systems. They concluded that some attributes of scheduling systems were unable to be easily categorised into the hierarchical model. The attributes that did not fit neatly into the hierarchical taxonomy were: adaptive versus non-adaptive scheduling; load balancing; bidding; probabilistic assignment; and one-time assignment versus dynamic reassignment. By its nature, the DISCWorld is a distributed system, made up of a heterogeneous collection of resources. It is for this reason that at the first level of the taxonomy, the DISCWorld’s scheduling system is global. The DISCWorld model has well-defined services, some of which will not be available on every hardware resource. Our current design is that the model will use a combination of static and dynamic scheduling. The traversal of the processor allocation solution space will almost certainly be suboptimal in general. An exception being when the search space is trivially small, in as above, a hardware component is only aware of a single instance of a service. We believe that the DISCWorld system will be fully adaptive to the conditions of the distributed system. For example, if a certain hardware component is receiving many requests for operations, the system may decide, according to some heuristic, to seek another hardware resource that can provide the same services, at perhaps a higher cost, but with increased response time. Load balancing implies the suspension of the running process and the moving to a different hardware component. At the moment we do not see the DISCWorld as 2

having this capability, as the granularity of requests and operations is higher than the process-level. The taxonomy classification of one-time assignment versus dynamic reassignment was originally meant to refer to a single job, which was essentially a single process. In the DISCWorld, a request may consist of many operations and it is intended that the system be able to revoke previous scheduling decisions and reassign operations to different hardware resources according to the type of optimisations that are being used (performance, network utilisation, data locality).

4 4.1

Distributed Scheduling with the WWW and Java Experiences with ERIC

ERIC [10] is a prototype satellite imagery browser developed to demonstrate webbased control systems for environmental scientists. When developing the system, a number of issues were identified that require further research. These include: thread safety, result naming, and master-slave processing. ERIC implements a canonical naming mechanism by using long filenames. Unfortunately this solution does not scale to multiple applications. Since the ERIC program was executed and controlled by the web server local to the machine on which it runs, when actions were performed in parallel, it was a masterslave program. The web server spawned a script that initiated a series of rsh programs, and although the processing was executed in parallel, there still existed a single point of failure, which would prevent the computation from finishing. Thread safety was also needed to prevent the program overwriting partial results.

4.2

Experiences with the Java Thread Model

Due its platform independence, network-orientation, and remote method invocation (RMI), Java [7] is becoming the de-facto standard for writing portable software for the Internet. We have used Java’s threads model [12] to build and simulate a multithreaded scheduling system. This has been achieved by designing and implementing a custom user-level thread scheduler to which users can submit jobs. A sample thread scheduler has been built using the sample given in [12]. It uses a circular linked list to implement the queue. Threads to be executed are placed into the list and the list is traversed in a FIFO manner. Another thread, of higher priority creates new jobs at regular intervals and then sleeps, allowing the lowerpriority threads in the list to be executed. Accounting information has been added to the thread scheduler in the form of a higher-priority thread which periodically wakes and counts the number of threads marked as alive. Initially, the thread scheduler was built to replicate the different techniques for loadbalancing [15, 3, 18]. One of the purposes of building our own thread scheduler is to experiment with techniques for expressing and implementing general scheduling policies. It is intended to create two threads for the execution of jobs, which are both RMIs and to study the behaviour of such a system.

3

5

Representing Job Control Information

Job control information is necessary to track a job’s progress from the point of submission as a user query, to the execution of a network of operations, and the delivery of a result to the user. This information is used for tasks such as job monitoring and tracking. It would be used, for example, if the user wished to cancel the query or enquire as to its progress. In this section we describe some of the efforts that are being undertaken to represent job control information. A resource description language is being developed to describe DISCWorld resources. It is anticipated that resources will be able to be described in enough detail to allow reasoning and selection about the resources, and assembly into virtual computational networks (VCNs). The VCNs produced will provide intermediate representations for requests so that they can be analysed for possible parallel execution and optimisation. It is important that the options for trade-off are able to be expressed in both the resource description language, and the VCN language. We have developed a prototype EBNF [17] specification for describing VCNs. It is intended that the user request will be translated to a VCN, which will be passed to a parser. The parser will be the method by which the VCN is executed. The VCN is broken into sections, each describing a facet of the VCN. For example, the different sections describe: the operations; the parameters; the outline of the computation; and the result delivery mechanism. The operations are listed and are given labels, with which they are referred throughout the script. A parameter section is necessary as each operation may have more than one parameter, and the output of one operation may be used as the direct input of another. There are two types of parameters, shared and literal. Literal parameters are passed directly to the operation verbatim; shared parameters act as linear pipes (cf Helios [14]), with data sources and data sinks. The Unix pipes [16] model is not sufficient, as it does not allow the results of operations not immediately preceding the current to be shared. The outline section details the way in which the computation is to proceed (ie whether the operations are to be executed serially, in parallel, or whether values are to be taken from shared parameters and used as loop control variables in later operations). One of the deficiencies of the specification is that it does not allow the generator of the query to provide any meta-information as to the preferred services (see section 6). Another deficiency is the manner in which we must currently use separate sections to specify all the necessary information.

6

Discussion

The restriction of using only well-known services allows greater control of users to construct sequences of services, operating on data that is already present in the system, and allows the data that is produced to be recorded and managed by the system. It is anticipated that the user will have available to them different versions of the same operation. All operations of the same name will, of course, perform the same basic function, but some versions may vary in one or more respects: the standard operation may be written in Java, and provide standard precision; a higher precision version may be present; a fast C version may be supplied; or, a parallel or distributed memory version may be supplied for large data sets. It is clear that in terms of

4

cost, the more specialised an implementation of the operation is, the more the user will be charged. It is, then, a form of trade-off. Another form of trade-off which will be presented to the user is that of priority. It is anticipated that with many domain experts using the system, many requests will be executing at once. Some of the users may be involved in Emergency Services and disaster planning, and those users may require answers to requests quickly. As such, a priority system will be introduced, with a cost-performance curve that the user can select. It is important to note that not all services will be available on every machine within the DISCWorld. For example, the services that have a very high performance for large datasets may only be available on parallel hardware resources. It can be seen from the discussion in section 2 that not all of the operations in the DISCWorld will be written in Java. Indeed, there are a number of legacy applications that it will be necessary to support [13]. Unfortunately, the hosts on which the legacy applications reside may not support the Java Virtual Machine. One example of this is the Thinking Machines CM-5. Any operations that are to be run on this will have to be controlled remotely, possibly by a native method. We believe that although there are other important components to the DISCWorld scheduling system, user requests should remain central to the design process. We also believe that to achieve platform independence, the distributed scheduler in DISCWorld should be built using Java, which provides easy ways to interface with native methods, and remote Java methods using RMI. We have designed and are currently building a distributed scheduling system using RMI that allows both local and remote daemons to request the placement of jobs into execution queues. Execution queues may be operation-specific, priority-based, or general queues. Operation-specific execution queues only accept requests for that particular operation, while priority queues, execute any operation providing the priority is set above a threshold. These two queues have a maximum number of jobs that may be in each queue at once. General execution queues are designed to accept requests for operations that do not have a specific queue, and are not of high enough priority to get onto a priority queue. In Summary, we have built a user-level thread scheduler in Java that was used to implement a number of different scheduling policies. We have discussed the different types of resources within the DISCWorld: hardware components; software components; and data, and have discussed the organisation of these resources into user requests.

7

Acknowledgements

The Distributed High Performance Computing Infrastructure (DHPC-I) is a project of the Research Data Networks Cooperative Research Center (RDN CRC) and is managed under the On-Line Data Archives (OLDA) Program of the Advanced Computational Systems CRC. RDN and ACSys are established under the Australian Government’s CRC Program.

References [1] B. Bauer and F. Ramme. A general purpose Resource Description Language. 5

In R. Grebe and M. Baumann, editors, TAT’91, pages 68–75. Springer Verlag, 1991. [2] T. L. Casavant and J. G. Kuhl. A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems, May 1996. [3] George Cybenko. Dynamic Load Balancing for Distributed Memory Multiprocessors. Journal of Parallel and Distributed Computing, 7:279–301, 1989. [4] Karl Czajkowski, Ian Foster, Carl Kesselman, Stuart Martin, Warren Smith, and Steven Tuecke. A Resource Management Architecture for Metacomputing Systems, 1997. [5] S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke. A Directory Service for Configuring High-Performance Distributed Computations. In IEEE, editor, Proceedings of the 6th IEEE Symposium on High-Performance Distributed Computing 1997, 1997. [6] Ian Foster and Carl Kesselman. Globus: A Meta-computing Infra-structure Toolkit. International Journal of Supercomputer Applications, 1996. [7] James Gosling, Bill Joy, and Guy Steele. The Java Language Specification. JavaSoft Series. Addison Wesley, 1996. [8] Andrew S. Grimshaw and Wm. A. Wulf. Legion – A View From 50,000 Feet. In IEEE, editor, Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, Los Alamos, California, August 1996. IEEE Computer Society Press. [9] K. A. Hawick, A. L. Brown, P. D. Coddington, J. F. Hercus, H. A. James, K. E. Kerry, K. J. Maciunas, J. A. Mathew, C. J. Patten, A. J. Silis, and F. A. Vaughan. DISCWorld: An Integrated Data Environment for Distributed High-Performance Computing. In Proceedings of the Fifth IDEA Workshop, February 1998. Also DHPC Technical Report DHPC-027. [10] H. A. James and K. A. Hawick. A Web-based Interface for On-Demand Processing of Satellite Imagery Archives. In Proc of Australian Computer Science Conference, 1998. Also DHPC Technical Report DHPC-018. [11] Hermann Kopetz. Scheduling. In Sape Mullender, editor, Distributed Systems. ACM Press, 1993. ISBN 0-201-62427-3. [12] Scott Oaks and Henry Wong. Java Threads. Nutshell Handbook. O’Reilly & Associates, Inc., United States of America, 1st edition, 1997. ISBN 1-56592216-6. [13] C. J. Patten, F. A. Vaughan, K. A. Hawick, and A. L. Brown. DWorFS: File System Support for Legacy Applications in DISCWorld. In Proceedings of the Fifth IDEA Workshop, February 1998. Also DHPC Technical Report DHPC-032. [14] Perihelion Software. The Helios operating system. Prentice Hall International, 1989. [15] Niranjan G. Shivaratri, Phillip Krueger, and Mukesh Singhal. Load Distributing for Locally Distributed Systems. Computer, 25(12):33–44, December 1992. [16] W. Richard Stevens. UNIX Network Programming. Prentice Hall, Inc., 1990.

6

[17] Niklaus Wirth. What Can We Do About the Unnecessary Diversity of Notation for Syntatic Definitions. Communications of the ACM, 20(11):822–823, November 1977. [18] Min-You Wu. On Runtime Parallel Scheduling for Processor Load Balancing. IEEE Transactions on Parallel and Distributed Systems, 8(2):173–186, February 1997.

7