May 16, 1998 - wide-area clusters to optimise either user response-time or cluster utilisation. ... partially addressed by available scheduling management software. ..... are listed, in a cyclic fashion, and later processors in the list are not ...
Technical Report DHPC-038
Service Scheduling on Wide-Area Metacomputer Clusters K.A.Hawick, P.D.Coddington and H.A.James Advanced Computational Systems CRC Department of Computer Science, University of Adelaide, SA 5005, Australia Tel +61 08 8303 4519, Fax +61 08 8303 4366, Email {khawick,paulc,heath}@cs.adelaide.edu.au
16 May 1998 Abstract It is a significant problem to provide a robust and portable software environment that can link together clusters of workstations and other heterogeneous computers. There are particular difficulties when the computer clusters to be managed transcend administrative boundaries across wide-area networks. We review some of the technologies that have emerged recently for managing arbitrary computer programs across clusters of computers, and use our experiences with such systems to illustrate the difficulties in managing systems across wide areas. A simplifying approach is to limit the services provided across wide-area clusters to well-defined processing and data access modules, that are specified a priori and are advertised between servers. Client programs can then invoke queries on databases, and set up processing tasks based on combinations of these well-defined services. Developers can build new modules or services conforming to a well specified application programming interface and new services can be tested within administrative boundaries before being made available across wide-area clusters. This is the approach we take with our DISCWorld metacomputing environment. We focus on a description of the scheduling aspects involved in managing multiple job streams across wide-area clusters to optimise either user response-time or cluster utilisation. We describe how a serverless or non-hierarchical architecture maintains scalability when additional cluster nodes are added. This high-level service-based approach provides a higher granularity of distributed computation than other systems and provides a way to amortise the latency that accrues over wide areas. Services can be provided as portable code modules that may run on a variety of service providers, such as Java modules running on distributed Java Virtual Machines, or can be optimised native code that runs on specific highperformance resources in the clusters. This provides a way of encapsulating parallel supercomputers in a wide-area cluster environment. Keywords: metacomputing; scheduling; wide-area distributed computing; latency tolerance; embedded supercomputing.
1
1
Introduction
The problem of managing clustered computing resources is still a challenging one which has only been partially addressed by available scheduling management software. The general issues of scheduling executable jobs in time and onto particular platforms, especially in heterogeneous clusters, are not easily solved in their entirety, but can be addressed with an appropriate software environment. We describe the metacomputing approach we are adopting with our DISCWorld environment [19] and develop a model framework for handling the scheduling of complex services by decomposing them into well characterised component parts. We illustrate this with a systematic set of performance measurements across heterogeneous cluster of clusters using different scheduling management policies. Rather than consider the open ended problem of optimally scheduling arbitrary computations on a heterogeneous cluster, we focus on applications that can be described using a known basis set of well-characterised components parts. This is well suited to many application regimes, such as decision support for land resource management[8] and defence related imagery analysis and processing[9]. We describe and discuss some of the general issues for a successful scheduler system in section 2 and focus on the particular needs of some typical end-user applications in section 3. We review some scheduling, batch queuing and metacomputing environments in section 4. We have carried out some measurement and analysis of heterogeneous clusters in section 5 and develop a simple model for service based computing in section 6. Finally in section 7 we discuss our work in incorporating these ideas for an adaptive model into the DISCWorld system and summarise our conclusions in section 8.
2
Scheduling Issues
The general scheduling problem can be described as the need to optimise the placement and hence execution, of a number of jobs on a set of computing resources. Jobs may require various resources, but for the purposes of this paper, we will simply characterise a job as a program that requires some processing time on some compute platform, and also some storage resource for input and output data. Schedule optimisation in this simple case can be to optimise either the absolute response time perceived by the users of certain jobs, or can be to make best utilisation of the cluster of resources. In practice, both these criteria may need to be met on a practical system. In addition to these efficiency or response time criteria, some other general issues for scheduling on clustered computing resources are those of establishing an adaptive system that can cope with changing loads on the cluster nodes, and changing requirements from users, as well as providing a robust enough implementation that can cope with component failures. This robustness against unexpected component failures, even temporary ones is crucial for a system that will operate over wide areas. In a very large system, the probability of at least one component failing or becoming temporarily unavailable is quite high and therefore this case has to be treated as a common or routine event. Examples might be a workstation in a cluster being rebooted to allow installation of some new software, or a network component failing temporarily or a resource becoming effectively unusable temporarily due to heavy loads on the network connecting it to the rest of the cluster. It is also extremely important to allow effective scheduling on heterogeneous clusters. Tightly coupled parallel supercomputers which are built from a homogeneous set of workstation like nodes are still commonly available, but probably represent a minority amongst the clustered computer systems in widespread use around the world today and in the foreseeable future. Heterogeneous clusters with mixes of different architectures or even just different models and clock speed versions of the same architecture are far more common. Scheduling systems that rely on being able to predict performance and load from homogeneity of the cluster nodes are therefore considerably less useful. We discuss general scheduling issues below and review some of the software systems, both research systems
2
as well as production level ones, that are presently available in section 4. Some of the general issues for a cluster scheduling system include: job granularity; job and platform heterogeneity; job latency; portable job specification; incorporation of special resources such as supercomputers; and economic issues of cost performance that a practical scheduler must address. We discuss these issues in detail below. It is important to control or have knowledge about the granularity of jobs that are to be scheduled. We are particularly interested in wide-area clusters where the startup cost of remotely initiating jobs at a long distance may be comparable with the time cost of executing the job itself. We have recently carried out experiments with a long distance broadband network connecting Australia and Japan, where the network latency is approximately 200ms. The return trip time for such a network implies a significant cost in carrying out remote computations even if the throughput benefits of having remote access to a very powerful supercomputer resource might justify incorporating it in a cluster. If jobs of a known granularity can be identified, then it is possible for the scheduler to make an intelligent decision regarding the tradeoff of placing them on fast resources at a distance or on nearby resources that may be slower to carry out computations but which give an effectively longer time to complete due to the latency effect. Latencies or startup costs are also common in involving supercomputer nodes in a cluster even if connected to the scheduler by low latency networks. Such systems often have their own queueing system or special job startup software that can be slow for small jobs compared to the total time to run them. Specifying jobs portably is a complex issue. Traditional scheduling or batching environments provide a mechanism to submit a script, such as a shell script on Unix based systems, that describes an entire environment for running arbitrary programs. This is often convenient for building services from legacy software components. We are working on an environment that will host jobs constructed from portable Java code as well as providing Java wrappers to system calls which invoke full shell scripts. A range of scheduling and environment techniques become applicable if the support of legacy code was not necessary. Also, by restricting the runnable services to those composed only of a priori well-characterised (in complexity) and predicted performance, we allow the scheduler much more freedom to optimise for response time or resource utilisation. Achieving code portability is generally difficult in the case of incorporating supercomputers. Such systems are generally capable of performing certain operations very well, but may not perform well executing arbitrary codes. This is particularly so for parallel supercomputers which are usually only well optimised for a particular model of parallelism. Our approach is to use parallel supercomputers as specialised service providers. This works well in the case of numerical or linear algebra services, where the supercomputer can be coded in its preferred native language (such as High Performance Fortran[23]) rather than in Java. In some cases it is interesting for a scheduler to optimise using response time. In many practical cases, however, economic grounds [29] must be taken into account and jobs may be separable into those belonging to either users who want a fast response and are willing to pay in some sense for it or those users who will accept any response time with the cheapest resources available. Economic issues become particularly interesting for a clustered system which transcend administrative or ownership boundaries. Organisations normally buy compute resources to satisfy their own needs often compromising between desired peak load capacity and price. However in many cases organisations have machines operating with no load and could be persuaded to allow their resources to be viewed as part of a wide area shared cluster resource. It then becomes important to collect billing information and provide some sort of costing mechanisms and usage policy. We believe this is a worthwhile area to pursue as cluster computing becomes more widespread. Costing and policy control systems have not yet been well developed. Marshalling parameters and migrating data, particularly bulk data between job components, creates another dimension to the tradeoff space a scheduler needs to operate in. Additional interesting tradeoffs arise when the jobs run on clusters require access to shared and complex resources such as specialist storage or processing or real time devices. We are presently investigating some of these effects but do not report on them in this paper. They can be addressed in the DISCWorld framework by restricting jobs to be composed of welldefined a priori services which are executed based on a choice of implemented components on either general purpose nodes or specialist nodes which participate in the cluster.
3
The networks connecting various hosts in such a distributed computational environment will almost assuredly have different performance characteristics. Our DISCWorld environment specifically targets systems where a variety of different networks (possibly built from completely different technologies such as ATM or ethernet) are used to interconnect cluster hosts. Some hosts may even have dual port interfaces and have access to a slow control network for low volume traffic as well as a broadband high-volume network. A number of additional issues arise when the cluster hosts that are being used in the distributed computing environment cross administrative boundaries. These include: authentication of users across boundaries and establishing trust relationships between servers; managing the user environment on different systems; transferring data safely across fire-walls; managing temporary working space, particularly for large data sets. We belive that current authentication technologies can be used in DISCWorld to address the issue of security, providing a senisible protocol is used for exchanging digitally signed or encrypted data between nodes in a clustered envronment[17]. We are currently working on a wide-area data management system that may address the problem of distributed temporary work space[26].
3
Application User Scenarios
The service based approach is useful for a number of applications areas. We describe four scenarios here: land and natural resource management applications; military and defence related decision support applications; research based on large datasets of scientific data; and the ingestion applications necessary to operate on line data archives of public or government funded data. Consider the decision support example[8] of a station manager or farmer planning his years activities and wishing to make sensible decisions regarding irrigation and rainfall runoff; crop rotation and planting; expected yield and optimal harvest times and other land care operations. How can he exploit the data that may be available to him to aid his decision making process? Some data will be available in the form of highly sophisticated processed data products such as short and long term weather forecasts. These may be available at the resolution and localisation required for precision agriculture decisions, or may only be available in undigested form. How can a processing and automatic product creation framework be put in place to allow those organisations who do have the necessary skills to create the desired products to do so economically for what might be a set of isolated one off sales? A wide area metacomputing environment built using a set of clustered computing resources can be set up to provide a common shared resource for the customers (farmers and land managers) to interact with the value-adders or organisations who can create processed products from raw data and the raw data suppliers or government custodians. What is needed is a suitable set of middleware or software that can provide the necessary interoperability and scheduling of the necessary computing services. Consider the types of services that might be required for the land care and management decision support scenario we have outlined. These are not arbitrary programs running with arbitrary data. Instead they are generally drawn from subset of well characterised application components with known performance requirements and will typically be run on a very restricted set of data sets and data set sizes. Consequently it is possible to set up a scheduling management system that has access to the characteristics of each application component and can therefore predict and hence optimise more accurately the cost and time needed to carry it out on a set of computing resources. The smart scheduler is therefore able to make very good use of the whole set of resources under its control or can organise user requests in priority, running the most important or urgent on fast resources, and the lower priority actions on slower, cheaper queues. Given that a restricted finite or manageable set of applications and data sets are being used under the system as a whole it is also possible to effectively cache or store frequently accessed or requested data products. Suppose a farmer has requested information regarding crop yield and optimal harvest time predictions. Crop acreage may already be known to him, or can be accessed from either a land registry database or perhaps calculated from a satellite image which is geo-rectified and registered to allow a geographical area calculation[20]. Weather 4
predictions for the region and perhaps weather trends from previous years may be combined to calculate a likely rain pattern prediction, from which a crop yield estimate may be possible. A practical system might provide a series of possible harvest time predictions and options with the computed consequences for the farmer. The decision product delivered to the farmer may only be the summarised output of the calculations - a few kilobytes perhaps, whereas the raw data upon which the calculation was based may have represented may gigabytes of spatial imagery, digital terrain data, runoff patterns and so forth. Similarly, the computational power of the client computer the farmer might use to place the query and request the decision support product might be low, whereas many processing cycles may have been used by the value-adding organisation in creating the data product. Suppose the farmer makes a similar request the next day, but perhaps with some additional new information. By controlling and caching the intermediate data products the value-adding organisation used to create the product such as the raw satellite imagery for the farmer’s particular region, it may well be possible to save considerable re-processing time. The profit margin on the second product will be much greater - or of course the value adder may choose to pass on the margin saved to its customers in the form of a cheaper price. This model of being able to cache intermediate data products may provide significant savings and better utilisation of resources. It can only be set up under a smart caching system however. It would be too hard to track the necessary data items manually.
General Query (eg when to harvest crop?)
Crop type, acreage, value Optimising Criteria (eg yield, or income)
Break down query into computable parts
Constraints (eg weather, labour, equipment)
Decision Support answer and possible outcomes
Consequences (eg yield, income, opportunity costs)
Figure 1: On clustered metacomputing environment a complex query can be broken down into a set of computable parts.
The farmer’s query is decomposed into a sequence of well-defined and computable steps in the example above. This is shown in figure 1. Each of these steps might be an application component with a well characterised performance that can be controlled by the smart scheduler. The value-adding organisation that provides some particular services and data products might have access to archived raw data either as a locally archived copy or from some other data provider such as a government agency or another vendor. Our value adder might be able to respond semi-automatically to the incoming query and deliver the product and a bill to the farmer using a suitable scheduling environment which manages its computing resources. Other applications that are well suited to cluster computing share many of the properties we have described above. Defence applications such as reconnaissance and intelligence product processing can be viewed in a similar model framework. Intelligence products originate from satellite or air recconaissance flights and are archived using various technologies including digital data systems. Various organisations within a government’s defence forces may collect and archive their own data and may make it available to each other. Sources may vary in quality and type considerably. Some organisations may work using an entirely in-house customised system, but more frequently economic and inter-operating reasons require various value-adding relationships to be set up with the whole defence force to share data and to derive intelligence products from multiple sources and ‘on-sell’ derived products for decision support. Human analysts may be working as end-users or as value-adders in the system itself, combining data products into decision support material for 5
those processes that are not yet automatable. The defence community has some additional constraints but overall the model is very similar to that for commercial value-adding of land resource data. Some additional properties however include: real-time or near real-time data delivery; tighter security and secrecy between components of the system; encryption and restrictions to certain parts of the ‘market’; closed rather than open product catalogues. In spite of these differences the quality and quantity of the data components and the automated and human enhanced processing activities still conform to the decision support information delivery model outlined above[9]. One of our original motivations for DISCWorld was to enable processing of very large scientific data sets such as those that result from scientific simulations or from some of the multi-channel satellite data. Present levels of software technology make it difficult to carry out post processing on large datasets without building a customised software system that can manage the loading of data from tape to temporary disk and can loop over the data sets extracting the sub-sets necessary for a particular calculation. This model of operation is also common for post processing of computational physics and chemistry simulations[2], particularly for fields like quantum chromodynamics. Scientists in these disciplines will typically run simulation codes on supercomputing facilities, saving sets of statistics and model output configurations which represent highly valuable data sets and which can generally be post-analysed. These data sets are effectively experimental data, although they have been gathered from numerical experiments. Processing resources can be pooled together by organisations to provide better response time or resource utilisation than if individual departments under-utilised their own resources most of the time just to be able to handle their occasional peak capacity requirements. In particular many organisations have workstations or clusters or personal computers that are sporadically used by individuals during working hours but which stand idle overnight and over weekends and holidays, often still switched on. It is often too difficult to make use of these lost cycles due to lack of smart interoperable scheduling software environments. Improved scheduling on clusters can turn an unused resource into a virtual supercomputer for these organisations. In summary, better middleware to manage storage of raw data; under-utilised processing capabilities such as already existing clusters of workstations; and data network delivery mechanisms such as smart caches and pre-fetching, can be used to significantly enhance the performance and cost performance perceived by decision support users. This is particularly effective over large wide-area clusters of resources, where one might expect the statistical load fluctuations from individual users is balanced out. A number of middleware products have already been attempted and several techniques are being presently researched. In the next section we review some of the major systems and their key features.
4
Scheduler Software Systems
To provide the necessary scheduling of services for remote users such as we describe in section 3, there are a number of software models possible. A centralised system is relatively easy to design and build, with all job requests passing through a single control gateway. This approach is the one traditionally used to manage single installations and especially supercomputers. Various software systems have been developed that extend this model to handling clustered resources, but at the time or writing the problem of allowing wide area clusters that are separately owned to inter-operate and share jobs has not been adequately addressed. In this section we briefly review some of the existing software systems for localised batch scheduling as well as the more recent innovative approaches to metacomputing environments that do allow wide-area clustering. This class of software environment is characterised by the user submitting inherently serial code with no intercommunication between different pieces of code. Examples of this class are PBS [4], NQS [22], DQS [15] and LoadLeveler [10]. These systems are compared and contrasted in detail in [3]. Portable Batch System (PBS) was a project initiated by NASA to create an extensible batch processing system for heterogeneous networks. PBS allows users to initiate and execute the scheduling of batch jobs, and allows routing of jobs between different hosts. Users must specify, in advance, any special resources 6
that they need to complete a job. Networked Queueing System (NQS) was also developed at NASA with much the same goals and functionality of PBS, and is since being developed at the University of Sheffield. It extends the PBS system with the addition of rudimentary file staging support. Distributed Queueing System (DQS), developed by SCRI at Florida State University provided the functionality of a queue system with the addition of parallel package support and multiple, redundant queue masters, for fault-tolerance. The main task of batch queueing systems is to level the load on machines, thus increasing the average utilisation of resources. This may not increase the throughput as seen by the user, due to administratorimposed restrictions on where certain programs may be run. It is not usually the case that a batch queueing system will have the ability to make intelligent scheduling decisions based on the contents of the job – instead most rely on the user informing the system of the expected performance characteristics. The user does not always have this information, however. On the border between batch queueing systems and general metacomputing systems are packages such as Nimrod [1] and Codine [13]. Nimrod is a system for parameterised simulation using a batch queue as a back end. Users specify a number of different parameters and Nimrod enumerates over the cross-product of all valid parameters, using the queueing software to execute and control the simulation on a collection of homogeneous computers. Codine is a batch queueing system which has additional support for heterogeneous platforms and jobs that may involve parallel computation. Metacomputing environments allow users to submit arbitrary pieces of code, containing serial or parallel components and have these jobs run on what may be seen as a single virtual computer, consisting of many, possibly heterogeneous, machines. It is the metacomputing software’s job to properly assign and manage the resources (in the form of compute nodes) to achieve either the best resource utilisation for the owner or the best throughput for the user. Some other examples of metacomputing environments are: Globus/Nexus [11]; Legion [16]; Infospheres [7]; and Prospero Resource Manager [25]. Globus/Nexus is a metacomputing infrastructure toolkit, developed at Argonne National Laboratories, to support communications across distributed computational and information resources. It uses Nexus to provide the primitives with which distributed applications communicate. Based at the University of Virginia, Legion is a metacomputing project which is based on the Mentat object oriented platform. Mentat provides infrastructure for multiple communication methods, as supported by the heterogeneous hardware, on which it runs. Infospheres, a metacomputing project at the California Institute of Technology, is designed around using Java and the WWW for communications and is quite similar, in many respects to our DISCWorld system. Prospero Resource Manager (PRM) enables users to run sequential and parallel jobs across a large number of heterogeneous machines connected via local- or wide-area networks. Users must specify, in advance, the number and type of nodes that their jobs require. We believe there are a number of fundamental problems that limit the applicability of some of these systems to the broader problem of distributed computing for the non-computer scientist. One of the problems is that of managing the heterogeneity of machines. Some environments manage the problem by allowing users to submit arbitrary binaries, but force the user to supply data as to the architecture and number of nodes requested [11, 25], while others use programming systems which are portable across a variety of architectures [16, 7]. Packages such as Distributed Computing environment (DCE) [27] are designed to facilitate the development and execution of distributed applications in heterogeneous environments. We found that while the security and authentication services were beneficial, the DCE system, as a whole, appreciably slowed the systems down. This may be a worthwhile tradeoff of performance to attain extera level of security, but presents a significant impediment to efficient use of DCE in a high-performance clustered environment. One of the fundamental problems in managing distributed systems is that of whether to allow users to submit arbitrary binaries to be run directly on the machine. This raises security issues, as the user process will run with the privileges of the user. Sometimes, too, it is infeasible to allow every possible user a valid login to every machine in the metacomputing environment. We believe it is preferable to have a daemon controlling access
7
to the resources which knows about individual users, and runs programs on their behalf, recording resource usage and appropriately charging them. The problem of the user being forced to log into a machine of the same architecture as the target is significant, as the user really needs to ensure the program’s correctness before submitting it to run on a machine that they will be charged for. System administrators are unlikely to advocate a metacomputer environment which requires kernel modifications. For this reason, we believe that the daemons that control users’ programs should run in user space, or at the very most, be spawned by the machine at startup. We believe it is possible to avoid system level kernel modifications entirely, and our DISCWorld system runs as a set of user level daemons.
5
Cluster Performance Analysis
In this section we describe a series of measurements we carried out on combinations of three different sub-clusters of computers. Specifically, we used: DEC Alpha workstations running Digital Unix, and interconnected by ATM fibres; Sun Ultra workstations running Solaris and connected by 10-baseT ethernet, and a Beowulf-style cluster of 486 and Pentium PCs running the Linux operating system. By combining three subclusters, each of seven of the different node architecture we obtained the performance data shown in figure 2. This illustrates the effect of running N compute jobs on each of the seven combined cluster configurations shown in table 1. This data was all measured using our prototype scheduler framework using a set of satellite data processing jobs. Sub-Cluster Alpha-A Alpha-R Sun Ultra Beowulf PC Power Challenge Connection Machine
Components 7 Local DEC Alpha Workstations 7 Remote DEC Alpha Workstations 7 Local Sun Sparc Ultra Workstations 6 Local i486 and 1 Pentium PCs 1 Local 20-Processor Power Challenge 1 Local 64-Processor CM5
Table 1: Sub-Cluster Test Components In the search for the most appropriate method of scheduling services across this possibly heterogeneous cluster of machines, we have implemented a prototype system that schedules a given parameterised service across a set of nodes, according to one of three scheduling algorithms. The scheduling algorithms are: perfect; adaptive; and first-come first-serve. In perfect scheduling, or round-robin scheduling, processors are assigned a task in the order in which they are listed, in a cyclic fashion, and later processors in the list are not assigned a task until the processor before them in the list has been assigned. Thus, this method is favoured when assigning the same amount of work to each machine. This approach to scheduling tasks is most appropriate when using homogeneous computational resources and a homogeneous job mix. In effect, this scheduling algorithm spreads the tasks evenly amongst all available processors; it is unsuitable for heterogeneous job mixes or processors, as subsequent processors may be waiting for task allocation due to a slow processor being given a large task. The adaptive scheduling algorithm is separated into two parts: benchmarking and execution. In the benchmarking phase, each processor in the list is given a job representative of the task size. For the terms of this implementation, the first task in the list of tasks is assigned to each processor in the list. The time taken to execute the task is recorded for each processor and the list ordered. In the execution phase tasks are assigned to the processors in the order specified by the ordered list of processors. If the fastest processor is currently executing a job, then the second fastest processor is used, and so on. Thus, working on the 8
premise that the fastest processor will be available most of the time, it should execute the majority of the tasks. Adaptive scheduling is most appropriate when the task mix is homogeneous; heterogeneity in the processors does not impact on the performance as seen by the user. Unfortunately, as implemented, this algorithm does not adapt to changing conditions on the processors during the execution of the task list – if the machine that was benchmarked the fastest processor suddenly becomes overloaded, the algorithm will still favour placing tasks on that processor, to the detriment of user response time. The first-come first-serve (FCFS) scheduling algorithm assumes nothing about the processors or the task mix before or during execution. The available processors are placed into an availability list, and the processor at the head of the list is chosen as the target for the assignment. While there are entries on the availability list, machines are assigned tasks, and removed from the list. As machines finish executing a task, they are added on the the end of the availability list. Thus, the longer that a processor spends executing a task, the more likely it is that other processors will finish before it, and will hence, be further toward the head of the availability list. This is the most general scheduling algorithm, optimising in favour of user response time. Obviously, this algorithm favours the fastest machine, and if the fastest machine suddenly becomes overloaded, the increase in execution time will cause it to be lower down in the availability list. The prototype system that schedules a parameterised service across a set of nodes in called dploader. Written in C, it uses posix threads to monitor and control the execution of services. The user input to the program is similar to the foreach construct in the Unix tcsh shell, with additional features, such as the actions to perform on a machine not being available or the user cancelling the computations. The parameters listed in the input file are simply enumerated and are assigned to processors, in order, according to the selected scheduling policy. At the end of processing, means and standard deviations are output for each machine that participated in the run. The dploader tool was our early attempt at a framework to experiment with scheduling algorithms and policies. At present it implements the three simple policies: perfect; adaptive; and FCFS as decribed above. We plan to incorporate a chunk size parameter for use with the perfect and adaptive cases as this will allow a subset of jobs to be issued at once, which is likely to be useful when controlling clustered resources over high latency networks, as over great distances[18]. The synthetic job that was used to measure the effectiveness of each of the scheduling algorithms is representative of the types of operations that are performed on real satellite data. For example, when we receive the satellite data from a download site at NASA, it is in a compressed HDF [24] file – in order to be able to effectively use it, the data must be decompressed, extracted and checked for quality. The synthetic service is equivalent to the operation of ensuring the quality of the satellite data that we normally invoke prior to depositing files into the archive. The synthetic job loads we investigate in this paper however, do not consider the time taken to load and save data from/to disk or tape. We only consider the compute component, by working on dummy satellite data assumed to be preloaded. We plan to address the storage times and costs from the perspective of cluster nodes with both their own and a shared disk resource, in a future paper. Some results from running various clusters are shown in figures 2, 3 and 4. These show the total time for N synthetic satellite data processing jobs to run on the cluster configurations in table 1. It is apparent, that the Alpha workstations outperform the Sun Ultras and that both significantly outperform the Beowulf PC cluster. It is interesting to observe that the scheduling policies significantly change the overall performance. In term of absolute throughput and getting a best response time to complete the whole queue, the FCFS sheduling algorithm is indeed best. Figure 4 shows that this policy will try to use the fastest machines (Alphas) at expense of the other machines. It is interesting to note that in fact adding slower machines to the cluster degrades the overall performance with respect to absolute completion time, even although the clsuter has more resources. This suggests that additional information given to the scheduler about which jobs have priority would allow the fast machines to be used for those jobs in preference to the slower ones, and that a better utilisation of the whole cluster would result. 9
Elapsed Time / seconds (better than +/- 10ms)
1000 Cluster Processing Time (Synthetic Satellite Job Load) 800
600
’Sun+PC-Perfect’ ’Sun-Only-Perfect’ ’PC-Only-Perfect’ ’Alpha+PC-Perfect’ ’Alpha+Sun+PC-Perfect’ ’Alpha+Sun-Perfect’ ’Alpha-Only-Perfect’
400
200
0 60 80 100 120 140 160 180 200 Number of Jobs (Homogeneous Job Mix, Perfect Scheduling) Figure 2: Timing for Perfect Scheduling of Homogeneous Job Load
The slow down ratio by adding the PC cluster to the Sun Ultras is even more marked than for the slow down of the Alphas by the Suns. The graphs show that while the FCFS policy still shows this degradation, it significantly outperforms the perfect algorithm and outperforms the adaptive algorithm by a factor of two. Figure 5 shows rather more fluctuation effects where (integer) rounding effects of assigning jobs to cluster nodes are more significant than in the other graphs which show large N effects. These suggest that even at low number of jobs, the FCFS algorithm is still best. The analasys above is all in terms of response time. In terms of resource utilisation , however it may be preferable not to run the FCFS scheduling algorithm. The data presented might represent a signle queu of jobs all with the same priority. A different queue of higher priority might be run at the same time, and it migt be preferable that the higher priority jobs have access to the faster nodes. In which case it may be advantageous to run the low priority queu with perfect scheduling to make use of all the nodes - including the slower ones which are likely to be cheaper. In fact we can make some observations about the economics of running our synthetic job queue by considering the approximate cost of the various clusters. To a rough approximation, the DEC Alpha nodes cost us around 10k Australian dollars, the Sun Ultras 5k and the PCs a mere 100. There are therefore significant cost differences that a practical scheduler with a billing system needs to address. Weighting the times for the Alpha, Sun and PC clusters to complete (separately) 200 jobs, which were approximately 25, 200 and 125 seconds respectively, using the best FCFC data, we obtain a price/performance figures of 12.5, 31.25 and unity for the PC - normalising for the PC cluster. So although the Alpha cluster has the best absolute time performance, it may be economically preferable to use the cheaper PC cluster if a slow result is preferred. Managing the economic tradeoffs and allowing trading or exchanging of resources that would otherwise be unused is a fascinating area of study. We are investigating mechanisms and optimisation strategies for the administrative owners of clustered resources to set their preferences as well as allowing uses to state their
10
Elapsed Time / seconds (better than +/- 10ms)
400 Cluster Processing Time (Synthetic Satellite Job Load) 350 300 250
’PC-Only-Adaptive’ ’Sun-Only-Adaptive’ ’Sun+PC-Adaptive’ ’Alpha+PC-Adaptive’ ’Alpha+Sun+PC-Adaptive’ ’Alpha+Sun-Adaptive’ ’Alpha-Only-Adaptive’
200 150 100 50 0 60 80 100 120 140 160 180 200 Number of Jobs (Homogeneous Job Mix, Adaptive Scheduling) Figure 3: Timing for Adaptive Scheduling of Homogeneous Job Load
preferred options for resource bidding.
6
Performance Model
In this section we outline a model framework for understanding the behaviour of client server relationships in a clustered computing environment. Consider the cost of a single component of a multi-component program, Pi in the context of the synthetic job we describe above. P Time Ti for program component Pi to complete and T = Ti is the time for the complete program. Ti may be of the form Ti = a0 + a1 n, where n is some measure of the amount of computational work or data to be transferred or retrieved. We do not consider the direct effects of heterogeneous jobs in this paper, but a simple approach to understanding their complexity is to model a mix of the same program working on different data sizes that can be parameterised in terms of some data size parameter n. However, resources may be shared and other processes are also in progress. Perhaps a particular instance of the program P takes Tj time to complete and only over some number N of program runs does this time measurement distribute around a mean value for some typical load of the system. Specifically, consider program P A which is composed from the following services: P A = P1A 7→ P2A 7→ P3A 7→ P4A 7→ P5A
(1)
A service component can be defined as something which takes (measurable) cycles to carry out on a resource. For simplicity consider only a linear chain of services which are carried out sequentially in strict order to 11
Elapsed Time / seconds (better than +/- 10ms)
400 Cluster Processing Time (Synthetic Satellite Job Load) 350 300 250
’PC-Only-FCFS’ ’Sun-Only-FCFS’ ’Sun+PC-FCFS’ ’Alpha+Sun-FCFS’ ’Alpha+PC-FCFS’ ’Alpha-Only-FCFS’ ’Alpha+Sun+PC-FCFS’
200 150 100 50 0 60 80 100 120 140 160 180 200 Number of Jobs (Homogeneous Job Mix, FCFS Scheduling)
Figure 4: Timing for First-come First-serve Scheduling of Homogeneous Job Load
allow for data dependencies. There may be some choice available to decide which resources will carry out the service components. Suppose we have resources Rα , Rβ and Rγ available to carry out P A and that for simplicity they are all equally capable of carrying out all components. The ultimate goal of the scheduler will be to compute the costs associated with possible service to resource mappings and to rank them. This can be a substantial task. What information is available to the scheduler? Suppose it has an approximate time-to-complete estimate for each service component on each of the available resources. This is of the form T (P x , Ry ), which maps programs to resources. It is then a matter of optimisation to determine which resource to place each service component on. This can either be done once for the whole program or as each service component requires placement. Ideally some bounded estimate is presented to the user when the program request is made. In the service based model, the full x−y outer-product matrix of possible placements does not arise however. In practice a DISCWorld node will only advertise its capability to carry out some services, and it will only store the knowledge of a limited sybset of its “favourite” providers of the services it needs. This means that the optimisation matrix is very sparse and only a few possible placements need be computed and ranked as options for the user of a particular service. A complication arises if loads are known to vary significantly during time of day. The cost estimation then needs to forward predict the completion time of each stage to compute the launch time of the next stage for lookup in the cost tables for that stage. At this point, the cost estimation process become too complicated to be worthwile calculating. Time to calculate it will probably outcost the job it is supposed to schedule. In addition the scheduler may have the ability to compute the execution time of a service component that has a variable workload argument - such as a different problem size parameter. This is possible if we have
12
Elapsed Time / seconds (better than +/- 10ms)
60 Cluster Processing Time (Synthetic Satellite Job Load) 50 ’Alpha-Lowend-Adaptive’ ’Alpha-Lowend-Perfect’ ’Alpha-Lowend-FCFS’
40
30
20
10
0 0
5
10 15 20 25 30 35 40 Number of Jobs (Homogeneous Job Mix)
45
50
Figure 5: Timing for Various Scheduling of Homogeneous Job Load on Alphas only.
a simple complexity model for each service component. We do not consider this case in the present paper. It is a major simplifying approximation to ignore any cost that arises from having the service requests on different resources as compared with the same ones. For a high granularity of service request decomposition this may be adequate, although the effect is likely to be important in the case of jobs that need to exchange large data sets. Consider the optimisation procedure as a series of option evaluations. A greedy algorithm can look at the cost estimates available for a given service component on the various resources it knows about and picks the lowest. The cost function can be a simple function based on the criterion of fastest possible execution time, or it can be a combination of some sort of billing or monetary cost information as well as time. Other constraints might be applicable for various reasons. It may be desirable to constrain the execution to avoid or prefer certain resources if other conditions are equal. Consider now the difficulties in predicting the time to complete for a computational task in a distributed multi-user environment. Tasks may have well defined and known complexities and there may exist sufficient performance data for a single otherwise unloaded computer platform running the task to predict time to run. In a multi-user and distributed system a number of additional effects arise making the problem of estimating completion time and therefore cost to the user much harder. A task has to compete with other tasks or jobs running in the same environment. It may not be known what fraction of the compute platform’s resources a particular task may get at its initiation, and indeed this fraction may well change in time as other jobs are added and complete or abort. How can simple performance models be combined to provide at least bounds on predicted time to complete and some estimate of a typical completion time to be expected by the user? Consider an event driven model based on well characterised jobs and compute platform components. To 13
develop the model consider the simple case of a job that consists of reading some data, carrying out some processing and then or writing out some data. Consider an idealised computer platform with very simplified memory structure and with no conflicts in memory and cpu usage from other processes. In particular, suppose the jobs granularity is sufficiently large compared with the systems processes running on the platform that the following approximation holds. The time to complete the job is given by TTot = TLoad + TProc + TSave
(2)
where: TLoad is the time to load the data (from disk or store); TProc the time to carry out the processing - entirely in memory; and TSave the time to store the output data to disk or other store. This might be a reasonable approximation on present workstation computing platforms providing each of these times is of the order of seconds or more. Consider the sequence of events the job goes through in running to completion. These states might be enumerated as: State 0 Waiting to start execution; State 1 Performing I/O - accessing the disk or store; State 2 Processing data entirely within memory; and State 3 Completed all required execution. This again is an oversimplification but is sufficient iff the job components’ granularity is very high compared with other system tasks ongoing. In such a case, the transition between states are approximately instantaneous at least compared to the times spent in states. Supposing the job is well characterised and that either the times TLoad , TProc and TSave are either constant for that job on that compute platform or are perhaps some known functions of some parameter of the job, or can be deduced by some means for the particular platform the job is run on. As long as only a single job runs on the platform, the time to completion is essentially known to a very good approximation, again assuming high task granularity. We might consider TTot (Param, Plat), where then time is now a (known) function of the problem size parameter(s) and the particular platform configuration being used. The model might be generalised to include times for the use of each resource - compute or storage or communications link in a distributed system, with some approximate information available for the resource requirements for a particular job. This can become complex for jobs that do not conform to the high granularity approximation. In general to be useful, the model should not be over complicated as to be useful it must itself be computable in a very much shorter time than carrying out the job itself. The tradeoff space might be crudely characterised by a table showing the regimes for model complexity and granularity, and number of jobs, and knowable information for the relevant compute platforms. Consider the effects of a second job running on the same platform - or sharing the storage resources which may be a disk or store shared between compute platforms. 1 Denoting the times for jobs 1 and 2 using superscripts, there are now interference effects between TTot 2 and TTot depending upon how they share resources. How much is still estimatable when each of the jobs are launched? This depends upon whether information is shared about the two jobs. A scheduler of well known services (the jobs) on a set of well known cooperating compute platforms might have access to this information. An event driven simulation can then compute the relative completion times for the jobs (or service instantiations) are. The situation is complicated since the 2 jobs sharing resources may (well) not have the same individual profiles and indeed may start at different times.
A smart scheduler may have number of choices available to it to determine what resources might be shared or used to place the service instantiations. It is important that the cost of scheduling the placements be substantially less than that of processing the jobs themselves. In some cases the scheduler can make some simple choices based on what-if scenarios. How should it choose which scenarios to evaluate? This can be encoded into the scheduler software as adaptive policy information, based on best guesses and best last time performance. The system is likely to work only within a regime where: the (high) job granularity approximations hold; scheduler costs are low compared to job costs (true 14
if number of scheduling options can be restricted); and the scheduler has complete knowledge about what resources are allocated. A problem arises if the scheduler does not have complete control (and knowledge) of what is running on its compute platforms. This is quite a likely case as other work may be being carried out on the platforms. This may arise from interactive users, or from other batch or scheduling environments that do not communicate information. Short of solving the problem of giving our scheduler full control and hence full information, how can it be made to give its best estimate? Can it monitor resources and make a best estimate on what fraction it is likely to have access to? The predictability can be maintained if the scheduler can book resources or fractions thereof in advance. Mechanisms for this are not simple and not widely available for general computing platforms. The closest reliable mechanisms for this are those for processor allocation on multi-processor computers. Multi-threaded systems with built in preemptive scheduling may also be able to provide this mechanism. Consider what statistical estimates can be made on the basis of available loading information. If a means for monitoring and storing patterns of access can be found then reasonable estimates for times of completion may be viable in certain load regimes. Distributions of test jobs as wel discuss in the previous section can be used used as input to an event driven simulation. A distribution of jobs and platforms or resource components can be used to simulate likely performance of a smart scheduler. The simulation can be embedded in the scheduler itself, providing it does not itself take too much computational power to run. The key to this approach is to make sufficiently good approximations that the cost of running the model is low, but that the resulkts are adequate. We believe the key to this lies in the approximations that can be made if all services are known a priori. It is not a simple problem to characterise resources themselves in a way that can be used for service instantiation placement. A simple numerical rating might rank processors for their floating point capability or by some other relevant benchmark. Ideally, platforms or resources are rated according to their relative performance (and cost perhaps) on the specific named services that a wide area sharing system is configured to run. This idea has been raised, but not implemented, in the PVM[12] system. As we have seen in the previous section, absolute time data can give different optimisation targets from economic optimisation criteria. Costing resources is necessarily related to some non-technical issues and economics. A system can usefully provide its system administrators mechanisms to set and monitor pricing policies. These may be set up to optimise machine use, or revenue or profit. It is interesting to consider the marketplace for shared resources of this nature and the framework that can provide a trading environment with all the economic and financial effects that arise. We plan to implement modules in our framework that will allow these issues to be dealt with according to policy information.
7
Implementing a Scheduler
In this section we discuss implementation issues for a practical scheduling system and in particular how to incorporate the performance models we have developed in section 6 to allow the scheduler to adapt its behaviour. We are presently incorporating the ideas and experiences from the dploader tool into the scheduler of our DISCWorld system. This uses Java and is therefore subject to the threading model provided by the Java Virtual Machine. Our DISCWorld software model is to decompose complex service requests into well characterised component parts. The intention is that these component parts can be more easily scheduled since the scheduler software can more easily predict their time to completion on the cluster nodes it manages. We believe this approach makes the general scheduling problem more tractable. The software architecture of our system involves a number of cooperating daemons, each running on a separate machine, arranged in a non-hierarchical fashion. This means that there are no nodes which are always designated as the masters or 15
clients, and none that are always slaves or servers, as in such systems as Ninf [28] and Netsolve [6]. Instead, nodes are all equal but can be temporariliy promoted to act as the handler for a particular user job request. It is by this symmetric arrangement of peer-level daemons that allows us to achieve scalability that is not possible in strict hierarchical architectures. DISCWorld daemons intercommunicate, exchanging information such as: locations of other daemons; the services that are supported by themselves; and what results, if any, they have from previous invocations of services on data, that they are able to share. In our model, resources are not only computational engines, but are also services and the data to be operated on; services and data may, at the administrator’s discretion, be transferred between daemons to facilitate in the scheduling and placement of user queries. This exchange of information we refer to as the gossip protocol [21]. Upon receiving a user request a daemon is able to co-opt the help of other daemons to fulfil the request by seeking their services Heterogeneity support is gained through the use of Java [14] and CORBA [5] objects. Daemons are written in Java to ensure portability, and to reduce the considerable effort that must be employed to port a binary between architectures. This method allows us to write and test a single Java implementation, which is then pushed out to all participating nodes in the system. There are two types of service in the DISCWorld model: general services; and optimised services. General services are intended to be supported by more than one node in the distributed system. As such, the code that represents a general service needs to be written in an architecture-independent, portable fashion. For this reason, we have chosen to write such services in Java, where each service conforms to a well-defined application programmer interface (API). General services may be propagated through the system by the daemons that reside on each machine gossipping. Services that are optimised for use on high-performance computing platforms, be they multi-node supercomputers of farms of high-performance workstations, may be implemented as Java Native Methods. Native methods allow supercomputer nodes to be embedded in the DISCWorld framework, providing their specialist (well-optimised) services.
8
Discussion and Conclusions
We have presented user scenarios and applications that would benefit from wide area clustered computing environment. We have systematically investigated the effects of three different scheduling policies for managing queues of jobs across wide area clusters and have shown that a scalable cluster configuration is generally possible, but effective utilisation of the resources depends upon having an appropriate job mix and scheuling policy. The scheduler is able to make considerable optimisations in either user response time or resource utilisation if it has full knowledge about the likely time to completion of a particular job on a particular platform. It is possible to give the scheduler even more flexibility if jobs are based on well defined service components, of known performance (or complexity) and if the scheduler is thus able to break complex jobs up into their atomic sub-components and decide upon the scheduling of these sub-components. Some simple models for the complexity and performance may allow schedulers to make intelligent decisions or to offer users and system administrators options to meet their preferences. We have described what we believe to be the key issues for effective scheduling of high granularity jobs or service components on homogeneous and heterogeneous computer clusters. These involve modelling and understanding the particular tradeoff behaviour peculiar to such job and resource mixes to allow good prediction of response time and hence performance and cost of running a particular job on a particular clustered resource. Tradeoffs from job granularity and the network costs of remote computation are particularly interesting. It may be possible to achieve a more scalable solution than has been possible on traditional batch queueing software systems by using the peer-based architecture we use in DISCWorld.
16
Acknowledgements This work was carried out under the On-Line Data Archives (OLDA) program of the Advanced Computational Systems (ACSys) Cooperative Research Centre (CRC) with funding from the Research Data Networks CRC. ACSys and RDN are established under the CRC program of the Australian Commonwealth Government. We thank D.A.Grove for assistance in setting up our Beowulf cluster of PCs.
References [1] D. Abramson, R. Sosic, J. Giddy, and B. Hall. Nimrod: A tool for performing parameterised simulations using distributed workstations. In Proceedings of the 4th IEEE Symposium on High Performance Distributed Computing, Virginia, August 1995. [2] Clive F. Baillie, Rajan Gupta, Kenneth A. Hawick, and G. Stuart Pawley. Monte carlo renormalisation group study of the 3d ising model. Physical Review B, 45:10438–10453, 1992. [3] Mark A. Baker, Geoffrey C. Fox, and Hon W. Yau. A review of commercial and research cluster management software. Technical report, Northeast Parallel Architectures Center, June 1996. [4] A. Bayucan, R. L. Henderson, T. Proett, D. Tweten, and B. Kelly. Portable batch system external reference specification. NAS Scientific Computing Branch, NASA Ames Research Center, California, June 1996. [5] Ron Ben-Natan. CORBA - A Guide to Common Object Request Broker Architecture. McGraw-Hill, 1995. [6] Henri Casanova and Jack Dongarra. Netsolve: A network server for solving computational science problems. In Proc. Super-Computing 96, 1996. [7] K. Mani Chandy, Adam Rifkin, Paolo A.G. Sivilotti, Jacob Mandelson, Matthew Richardson, Wesley Tanaka, and Luke Weisman. A world-wide distributed system using java and the internet. In High Performance Distributed Computing (HPDC-5) 1996. Caltech, March 1996. Also available as Caltech CS Technical Report CS-TR-96-08 and CRPC Technical Report Caltech-CRPC-96-1. [8] P. D. Coddington, K. A. Hawick, and H. A. James. Web-based access to distributed, high-performance geographic information systems for decision support. Technical Report DHPC-038, University of Adelaide, June 1998. Paper submitted to HICSS99. [9] P. D. Coddington, K. A. Hawick, K. E. Kerry, J. A. Mathew, A. J. Silis, D. L. Webb, P. J. Whitbread, C. G. Irving, M. W. Grigg, R. Jana, and K. Tang. Implementation of a geospatial imagery digital library using java and corba. Technical Report DHPC-047, University of Adelaide, May 1998. Submitted to TOOLS Asia 98. [10] IBM Corporation. Loadleveler. Network job scheduling and job management. 16 May 1998. http://www.austin.ibm.com/software/sp products/loadlev.html. [11] Ian Foster and Carl Kesselman. Globus: A meta-computing infra-structure toolkit. International Journal of Supercomputer Applications, 1996. [12] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine A Users’ guide and Tutorial for Networked Parallel Computing. MIT Press, 1994. [13] Genias Software GmbH. Codine. Resource-management system for heterogeneous environment. http://www.genias.de/products/codine/. 17
[14] James Gosling, Bill Joy, and Guy Steele. The Java Language Specification. JavaSoft Series. Addison Wesley, 1996. [15] Thomas P. Green. Dqs user interface preliminary design document. Supercomputer Computations Research Institute, Florida State University, July 1993. [16] Andrew S. Grimshaw and Wm. A. Wulf. Legion – a view from 50,000 feet. In IEEE, editor, Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, Los Alamos, California, August 1996. IEEE Computer Society Press. [17] Duncan A. Grove, Andrew J. Silis, J.A.Mathew, and K.A.Hawick. Secure transmission of portable code objects in a metacomputing environment. Technical report, University of Adelaide, 1998. Also DHPC Technical Report DHPC-041. [18] K. A. Hawick, H. A. James, K. J. Maciunas, F. A. Vaughan, A. L. Wendelborn, M. Buchhorn, M. Rezny, S. R. Taylor, and M. D. Wilson. Geostationary-satellite imagery applications on distributed, highperformance computing. In HPCAsia, editor, Proceedings HPCAsia, Seoul, Korea, August 1997. Also DHPC Technical Report DHPC-004. [19] K. A. Hawick, H. A. James, C. J. Patten, and F. A. Vaughan. Discworld: A distributed high performance computing environment. University of Adelaide, December 1997. Submitted to HPCN98. Also Technical Report DHPC-020. [20] H. A. James and K. A. Hawick. A web-based interface for on-demand processing of satellite imagery archives. In Proc of Australian Computer Science Conference, 1998. Also DHPC Technical Report DHPC-018. [21] K.A.Hawick, A.L.Brown, P.D.Coddington, J.F.Hercus, H.A.James, K.E.Kerry, K.J.Maciunas, J.A.Mathew, C.J.Patten, A.J.Silis, and F.A.Vaughan. Discworld: An integrated data environment for distributed high-performance comput. In Proc. of the 5th IDEA Workshop, Fremantle, February 1998. Also DHPC Technical Report DHPC-027. [22] B. Kingsbury. The network queueing srp/batch/sterling/READMEFIRST.txt.
system.
16
May
1998.
http://pom.ucsf.edu/
[23] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele, and M. E. Zosel. The High Performance Fortran Handbook. MIT Press, 1994. [24] National Center for Supercomputing Applications. Getting started with HDF - user manual. University of Illinois at Urbana-Champaign, May 1993. [25] B. Clifford Neuman and Santosh Rao. The prospero resource manager: A scalable framework for processor allocation in distributed systems. Concurrency: Practice and Experience, 6(4):339–355, June 1994. [26] C. J. Patten, F. A. Vaughan, K. A. Hawick, and A. L. Brown. Dworfs: File system support for legacy applications in discworld. In Proceedings of the Fifth IDEA Workshop, February 1998. Also DHPC Technical Report DHPC-032. [27] W. Rosenberry, D. Kenney, and G. Fisher. Understanding DCE. O’Reilly & Associates, Inc., 1992. [28] Satoshi Sekiguchi, Mitsuhisa Sato, Hidemoto Nakada, and Umpei Nagashima. – ninf – : Network base information library for globally high performance computing. In Parallel Object-Oriented Methods and Applications (POOMA), February 1996. [29] G. F. Stanlake. Introductory Economics. Longman Group Limited, London, 1967. SBN 582 350603.
18