best functional characteristics combined with an ease of extension to ... Server. Job Scheduler. Resource. Monitor available resources resource requirements .... limitation that jobs submitted from Windows NT can only be executed on ...
Effective Use of Networked Reconfigurable Resources A. V. Staicu1, J. R. Radzikowski1, K. Gaj1, N. Alexandridis2, and T. El-Ghazawi2 1
George Mason University, 4400 University Drive, Fairfax, VA 22030 George Washington University, 725 23rd Street NW, Washington DC 20052
2
Abstract Distributed reconfigurable resources, such as FPGA-based accelerator boards1 are expensive and often underutilized. Therefore, it is important to permit sharing these resources at least among the members of the same organization. Traditional resources, such as CPU time of loosely coupled workstations can be shared using a variety of existing distributed computing systems. We analyzed twelve of these systems, and chose five of them, belonging to the class of Job Management Systems with a Central Job Schedular, as most suitable for extension to reconfigurable hardware. These five systems, LSF, CODINE, PBS, Condor, and RES, were further compared based on their functional characteristics and ease of extension. LSF was shown to be superior according to the majority of the 25 identified requirements. The general architecture of the extended system was developed, and the exact way of extending LSF, CODINE, and PBS to manage FPGA-based accelerator boards was proposed.
I. INTRODUCTION This paper reports on a research effort to create a distributed computing system interface for the effective utilization of networked reconfigurable computing resources. The objective is to construct a system that can leverage underutilized resources at a given time to serve other users who currently have the needs, in a grid computing like style. The targeted type of resources are workstations and clusters that are equipped with Field Programmable Arrays (FPGA) boards serving as reconfigurable coprocessors, as one can find in academic and government research labs. In order to take advantage of previous related works, our strategy is to extend an efficient Commercial Off the Shelf (COTS) Job Management System (JMS) [1, 2]. Such extensions should provide the ability to recognize reconfigurable resources, monitor and understand their current loading, and effectively schedule them for the incoming remote user requests, with little impact on local users. It also includes providing local users with proper tools to control the degree to which they wish to share their own resources and how others may use such resources. Our effort started with a study that aimed at the comparative evaluation of currently available job management systems (JMS) and a conceptual design of how to architect such a system for managing networked reconfigurable resources [2]. Twelve systems were considered in this study 1 This work was sponsored by the Department of Defense under
the LUCITE contract #MDA904-98-C-A081.
and evaluated (and ranked) with respect to their suitability for being extended to handle FPGA reconfigurable computing resources and tasks. The compared systems are LSF, PBS, Condor, RES, Compaq DCE, Sun Grid Engine /CODINE, Globus, Legion, NetSolve, MOSIX, AppLES, and NWS. In order to facilitate the comparison, twenty-five major requirements were identified, and a scoring system was developed to gauge the compliance of these systems to the functional requirements, and rank them accordingly [1, 2, 6, 7]. In addition, the 12 systems were also clustered into several categories in order to easily appreciate the functional and operational differences among them and help the decisionmaking. The remaining part of the paper is organized as follows. In Section II, we classify the compared systems into six distinctive categories, and choose a category that is most suitable for extension to reconfigurable hardware. In Section III, we give a comparative overview of five systems belonging to the selected category. In Section IV, we identify features of a system that supports extension to reconfigurable hardware, and propose a general architecture of the extended system. Finally, we describe the means of extending three selected Job Management Systems, and identify a single JMS that offers the best functional characteristics combined with an ease of extension to reconfigurable hardware.
II. CLASSIFYING THE COMPARED SYSTEMS The twelve compared systems can be classified into several major categories, based on their primary objective, major modules, and distribution of control. These groups are defined below, together with the discussion of their relevance to the project goals.
A. Job Management Systems with a Central Scheduler The objective of a JMS is to let users execute jobs on a non-dedicated cluster of workstations with a minimum impact on owners of these workstations by using computational resources that can be spared by the owners. The system should be able to perform at least the following tasks: a. monitor all available resources, b. accept jobs submitted by users together with resource requirements for each job, c. perform centralized job scheduling that matches all available resources with all submitted jobs according to the predefined policies, d. allocate resources and initiate job execution, e. monitor all jobs and collect accounting information.
To perform these basic tasks, a JMS with a Central Scheduler must include at least the following major functional units [4]: 1. User server – which lets user submit jobs and their requirements to a JMS (task b), and additionally may allow the user to inquire about the status and change the status of a job (e.g., to suspend or terminate it). 2. Central job scheduler – which performs job scheduling and queuing based on the resource requirements, resource availability, and scheduling policies (task c). 3. Resource manager, including • Resource monitor (tasks a and e), and • Job dispatcher and (task d). scheduling policies
Resource Manager resource requirements
User Server jobs & their requirements
available resources
Job Scheduler
Resource Monitor
• •
Legion NetSolve.
Systems belonging to this group are of secondary interest to us from the viewpoint of the direct project goals. However, after extending the capabilities of a centralized JMS, a JMS without a central job scheduler can be used to create a highlevel Distributed Job Management System in order to achieve additional scalability.
C. Distributed Operating System Distributed Operating System differs from the Job Management System by operating on processes instead of jobs, and being oriented primarily towards tightly coupled clusters of homogenous workstations [5]. The only member of this category is • MOSIX. Since one of the primary requirements of the project is support for loosely coupled clusters of heterogeneous workstations, MOSIX and other systems of this type should not be considered as potential candidates for extension.
D. Parameter Study Schedulers resource allocation and job execution
Job Dispatcher User job
Figure 1 Major functional blocks of a Job Management System with a Central Scheduler.
We have classified the following five systems as centralized Job Management Systems: • LSF – Load Sharing Facility • Sun Grid Engine / CODINE • PBS – Portable Batch System • Condor • RES Systems belonging to this group are of primary interest to us from the viewpoint of the project objectives. All these systems can be potentially extended to support distributed computations using FPGA accelerator boards.
B. Job Management Systems without a Central Scheduler The primary difference compared to the first group is the lack of a central job scheduler. The system is able to collect information about all available resources and accept job submissions, but the decision about matching jobs with available resources is performed locally by a user server or a local job scheduler. A system of this kind often works as a high-level job management system, which can submit jobs to several lowerlevel centralized job management systems, based on the information about their currently available resources. This group of systems includes: • Globus
Parameter Study Scheduler is a system geared specifically towards the distribution of parameter study applications over networked resources. It is used to schedule jobs belonging to a single application on multiple workstations, and to assign input data to these jobs, depending on the current use of resources of these workstations. The application in this case should involve solving a given task for a large number of parameter sets. Each computational node is used to work only on a small subset of a large parameter space. The size of this subset depends on the current computational capabilities of the given node. The instances of the program running on the different workstations do not communicate among each other, but rather return partial results to the central node. The only representative of this group is • AppLES This type of a system is not really a JMS and does not fulfill many major requirements formulated in our project.
E. Resource Monitors and Forecasters Resource Monitors and Forecasters are similar to the resource monitor, which is a standard part of each JMS system, as defined in point “A”. The difference is that the system of this type does not only monitor computational resources, but also predicts their availability and performance in the future. This system uses numerical models to generate forecasts regarding availability of resources in a given time frame. The only representative of this group is • Network Weather Service. The system of this type may be combined with a centralized JMS to extend its capabilities.
F. Distributed Computing Interface Distributed Computing Interface is a set of interoperable libraries that can be linked with application programs or operating systems to support client-server applications. Distributed Computing Interface provides • secure communication between client and server, based on public key certificates; • directory services, including the means of creating, updating, and accessing directories of currently active servers, public key certificates, etc., • distributed file services (e.g., a client access to remote disks), and • time services (e.g., synchronization of local clocks on several distributed workstations).
III. COMPARATIVE OVERVIEW OF JOB MANAGEMENT SYSTEMS WITH A CENTRAL JOB SCHEDULER
PBS, Condor, and RES leads to the following conclusions (Table 1): • Majority of systems are available in public domain. The only system that does not have a public domain version is LSF. PBS have both commercial and public domain versions. • Majority of systems operate under several versions of UNIX, including Linux. The only system that does not directly support Linux is RES. LSF is the only JMS that fully supports Windows NT. In Condor, there is a limitation that jobs submitted from Windows NT can only be executed on machines running Windows NT. CODINE, PBS, and RES currently do not support Windows NT. • Majority of systems have both graphical and commandline interface. The only system without a graphical user interface is RES. Analysis of the second set of requirements, “Scheduling and Resource Management,” leads to the following conclusions (Table 2):
The main features of Job Management Systems with a Central Job Scheduler can be classified into the following categories [1, 2, 3, 6, 7, 8, 9]:
• Batch jobs are supported by all job management systems, while Interactive jobs are not supported by Condor and RES.
i. ii. iii. iv. v.
• A full parallel job support is offered only by LSF and CODINE, both systems support programming, testing, execution, and run-time management of parallel applications. PBS has a limited parallel job support (cannot control or monitor resource usage of all processes belonging to a parallel job). Condor can act as a resource manager for the PVM (Parallel Virtual Machine) daemon. The current version of Condor does not support MPI.
Operating System Support, Flexibility, and User Interface Scheduling and Resource Management Efficiency and Utilization Fault Tolerance and Security Documentation and Technical Support.
Analysis of the first set of requirements, “Operating System Support, Flexibility, and User Interface,” for LSF, Codine,
Table 1 Comparison of selected Job Management Systems from the viewpoint of the Operating System Support, Flexibility, and User Interface. LSF
CODINE
PBS
Condor
RES
Distribution
commercial
public domain
commercial and public domain
public domain
restricted
Source code
no
yes
public domain version only
yes
restricted
Solaris
yes
yes
yes
yes
yes
Linux
yes
yes
yes
yes
yes
Tru64
yes
yes
yes
no
no
Windows NT
yes
no
no
partial
no
GUI & CLI
GUI & CLI
GUI & CLI
GUI & CLI
CLI
Operating System Support
User Interface
Table 2 Comparison of selected Job Management Systems from the viewpoint of Scheduling and Resource Management. LSF
CODINE
PBS
Condor
RES
Batch jobs
yes
yes
yes
yes
yes
Interactive jobs
yes
yes
yes
no
no
Parallel jobs
yes
yes
partial
limited to PVM
no
Accounting
yes
yes
yes
yes
no
• Accounting is available in all centralized job management systems, except for RES.
Analysis of the fourth set of requirements, “Fault Tolerance and Security,” leads to the following conclusions (Table 4):
Analysis of the third set of requirements, “Efficiency and Utilization,” leads to the following conclusions (Table 3):
• LSF supports kernel-level checkpointing on all operating systems that provide it. LSF also supplies a user-level library to provide checkpointing on operating systems that do not provide kernel-level control. Checkpointing in Condor can be done using user-level libraries. Checkpointing in CODINE can be done in the same way as checkpointing in LSF, however it must be based on foreign checkpointing libraries, such as Condor libraries. In PBS, only the kernel-level checkpointing is supported on a small number of operating systems that provide this mechanism. • In LSF and CODINE, if the master host becomes unavailable or its daemons fail, another host takes over automatically. In LSF, only the total number of hosts in the system limits the number of transitions; in CODINE, only one additional host is specified as a shadow master host. In all JMSs, jobs can be automatically rerun if the execution host fails. • Authentication is supported by all systems, and strong authorization by all systems, except for RES.
• Stage-in and stage-out are supported by LSF, PBS, and Condor, while CODINE and RES do not support these capabilities. • In LSF, CODINE, and Condor, jobs can be checkpointed and migrated to another machine with a lighter load. If no checkpointing is supported by a given job (e.g., commercial software), and the job is preempted from a heavily loaded host, this job can be rerun from the beginning on another machine. In LSF and CODINE a job can be automatically migrated if it is suspended for an extended period of time to assure load balancing. In CODINE exceeding the predefined load value may also cause the migration. PBS and RES do not support either job migration or load balancing. • LSF Multicluster allows dividing computational resources of a very large or geographically dispersed organization into a number of autonomous clusters. This configuration enables load sharing and batch processing among clusters, and can support systems with multiple thousands of nodes. The similar capability is not available in any other JMS.
Analysis of the last set of requirements, “Documentation and Technical Support,” leads to the following conclusions (Table 5):
Table 3 Comparison of selected Job Management Systems from the viewpoint of Efficiency and Utilization. LSF
CODINE
PBS
Condor
RES
Stage-in and stage-out
yes
no
yes
yes
no
Timesharing
yes
yes
yes
yes
no
Process migration
yes
yes
no
yes
no
Dynamic load balancing
yes
yes
no
no
no
excellent
good
good
good
good
Scalability
Table 4 Comparison of selected Job Management Systems from the viewpoint of Fault Tolerance and Security. LSF
CODINE
PBS
Condor
RES
yes
using external libraries
only kernel-level
yes
no
master and execution hosts
master and execution hosts
only for execution hosts
only for execution hosts
no
Authentication
yes
yes
yes
yes
yes
Authorization
yes
yes
yes
yes
no
Check pointing Daemon fault recovery
Table 5 Comparison of selected Job Management Systems from the viewpoint of Fault Tolerance and Security. LSF
CODINE
PBS
Condor
RES
yes
using external libraries
only kernel-level
yes
no
master and execution hosts
master and execution hosts
only for execution hosts
only for execution hosts
no
Authentication
yes
yes
yes
yes
yes
Authorization
yes
yes
yes
yes
no
Check pointing Daemon fault recovery
• Documentation is excellent for LSF, and good for the majority of the other systems, except RES. The documentation of LSF is superior compared to the documentation of other systems from the viewpoint of organization, completeness, readability, and clear division of information into parts required by LSF administrators, users, and programmers. • Technical support of Platform Computing Corporation (LSF), Veridian Systems Inc. (PBS) and the Condor Team at the University of Wisconsin-Madison were contacted several times during our study [3]. The typical issues concerned problems with installation and configuration of the systems, and clarifications regarding the capabilities of each JMS. The technical support of Sun Grid Engine was not used during our study. The technical support of LSF was found to be excellent. A typical response time was in the range of 2-3 hours. The answers to our questions were complete and accurate. The technical support of PBS was good. A typical response time was in the range of 24 hours. Nevertheless, the follow-up communication was sometimes necessary to clarify answers to our questions. The technical support of Condor was average. A typical response time was in the range of 2-3 days, and selected answers required a followup communication.
IV. EXTENSION TO RECONFIGURABLE HARDWARE A. Features of the Job Management Systems supporting extension The specific features of Job Management Systems that support extension to reconfigurable hardware include • capability to define new dynamic resources, • strong support for stage-in and stage-out in order to allow an easy transfer of the FPGA configuration bitstreams, data inputs, and results between the submission host and the execution host with reconfigurable hardware; • support for Windows NT and Linux, which are two primary operating systems running on PCs that can be
extended with commercially available FPGA-based accelerator boards with the PCI interface.
B General architecture of the extended system General architecture of the extended system is shown in Figure 2. The primary component of this extension is an external resource monitor that controls the status of an accelerator board, and periodically communicates this status to a resource monitor. The resource monitor transfers this information periodically or by request to a Job scheduler, which uses this information to match each job that requires acceleration with an appropriate host. Job requirements regarding the new reconfigurable resource are specified during a job submission to a user server, and are enforced by a job scheduler the same way as requirements regarding default built-in resources.
C Extending selected JMSs to manage reconfigurable resources LSF, CODINE, and the most recent version of PBS (5.1.1) all support defining and managing new dynamic resources using configuration files only, without any need for changes in the source code of the JMS programs. This feature can be used to extend each of these three JMSs to manage FPGA-based accelerator boards. The new resource that needs to be added to a given JMS represents the availability of the accelerator board for JMS users. An external resource monitor is written according to the specification for • ELIM, External Load Information Manager in LSF • Load sensor in CODINE, and • Shell escape to the MOM configuration file in PBS. This daemon is started by a local resource manager (LIM in LSF, cod_execd in CODINE, and MOM in PBS), and communicates with the resource monitor using standard output. Extending Condor and RES to provide the similar functionality would require changes in the source code of these JMSs.
scheduling policies
Resource Manager resource requirements
User Server jobs & their requirements
available resources
Job Scheduler
resource allocation and job execution
Resource Monitor Job Dispatcher
External Resource Monitor Status of the FPGA board
User job FPGA FPGA Board APIs board Figure 2 Extension of a Centralized Job Management System to recognize, monitor, and schedule reconfigurable resources.
V. SUMARY LSF outperforms all other Centralized Job Management Systems in terms of the operating system support, scalability, documentation, and technical support. It is also one of only two systems that fully support parallel jobs, checkpointing, and offer strong resistance against the master host failure. CODINE performs extremely well in multiple categories such as parallel job support, job migration, load balancing, and resistance against the master host failure. The major drawbacks of CODINE include the lack of support for Windows NT, no support for stage-in and stage-out, and only externally supported checkpointing. The primary weaknesses of PBS include no support for Windows NT, very limited checkpointing, no job migration or load balancing, and limited parallel job support. Condor distinguished itself from other systems in terms of the strong checkpointing. It is also one of the oldest and the best understood job management systems. The main weaknesses of Condor include no support for interactive jobs, the limited support for parallel jobs, and average technical support. Surprisingly, an ease of extension to support FPGA-based accelerator boards appeared to be a minor factor in comparison. Three out of four systems, LSF, CODINE, and PBS v. 5.1, seems to be easily extendable without the need for any changes in their source code. The remaining systems Condor and RES can also be relatively easily extended, taken into account the full access to their source codes. Taking into account the combined results of our study we recommend choosing LSF as a primary system to be extended to support FPGA-based accelerator boards.
REFERENCES [1] M. A. Baker, G. C. Fox, and H. W. Yau, “Cluster Computing Review,” Northeast Parallel Architectures Center, Syracuse University, Nov. 1995. [2] T. El-Ghazawi, et al., Conceptual Comparative Study of Job Management Systems, Technical Report, February 2001, available at http://ece.gmu.edu/lucite/reports.html. [3] T. El-Ghazawi, et al., Experimental Comparative Study of Job Management Systems, Technical Report, August 2001, available at http://ece.gmu.edu/lucite/reports.html. Scalable Parallel [4] Kai Hwang, Zhiwei Xu, Computing:Technology, Architecture, Programming, McGraw-Hill 1998. [5] A. Goscinski, Distributed Operating Systems. The Logical Design, Addison-Wesley, 1991. [6] James Patton Jones, “NAS Requirements Checklist for Job Queuing/Scheduling Software,” NAS Technical Report NAS-96-003 April 1996, available at http://www.nas.nasa.gov/Pubs/TechReports/NASreports/ NAS-96 003/ [7] James Patton Jones, “Evaluation of Job Queuing/Scheduling Software: Phase 1 Report,” NAS Technical Report, NAS-96-009, Sep 1996 available at http://www.nas.nasa.gov/Research/Reports/Techreports/ 1996/nas-96-009-abstract.html [8] C. Byun, C. Duncan, S. Burks, "A Comparison of Job Management Systems in Supporting HPC ClusterTools," Proc. SUPerG, Vancouver, Fall 2000. [9] O. Hassaine, "Issues in Selecting Job Management Systems," Proc. SUPerG, Tokyo, April 2001.