User-level Parallel Operating System for Clustered ... - CiteSeerX

User-level Parallel Operating System for Clustered Commodity Computers Atsushi Hori, Hiroshi Tezuka, and Yutaka Ishikawa

Tsukuba Research Center Real World Computing Partnership Tsukuba Mitsui Building 16F, 1-6-1 Takezono Tsukuba-shi, Ibaraki 305, JAPAN

fhori,tezuka,[email protected]

E-mail:

1

and then we may devote to design and implementation of the global resource management. SCore-D has been running on our workstation cluster, consisting of 36 SparcStation 20's, and PC cluster, consisting of 32 Pentiums. In this paper, after introducing an overview of the design and implementation of SCore-D, the overhead of the SCore-D user-level operating system is evaluated to examine if our approach is a right way to a parallel operating system for clustered commodity computers.

Introduction

Workstation and PC clusters using a high-speed network can achieve a performance comparable to massively parallel machines with much cheaper cost. Especially, PC clusters can be more costeective than workstation clusters because of the mass-production eect of PC. The merit of the clusters is not only its cost-performance, but also fast installation of a cluster. The world of microelectronics is changing so rapidly that taking more than a year for cluster development means that the developed cluster cannot use the state-of-theart hardware technologies. A parallel operating system, which brings out the hardware potential of workstation and PC clusters as well as a good multi-user parallel programming environment is desired. Such a parallel operating system must be portable and easy-to-develop to catch up the state-of-the-art hardware technologies. We have designed and developed a parallel operating system, called SCore-D, on top of the Unix operating system[6]. SCore-D is designed so that it excites full potential of the cluster hardware. Our approach is that local resource management is left to Unix and only global resource management is implemented in daemon processes on top of Unix. Choosing a Unix operating system as a base and local operating system releases from developing the machine dependent part of the operating system,

2 2.1

An Overview of SCore-D Features

To support multi-user environment, SCore-D employs gang scheduling which has an advantage of ecient execution for ne-grain parallel programs. With gang scheduling, a set of user processes, communicating with each others frequently, distributed to workstations is subject to context switch. Such a set is called a user parallel process while each element of a set is called element process. To enable gang scheduling under an assumption that a parallel process directly accesses the network hardwares of the cluster to achieve low latency and high bandwidth communication, the network context, which is the status and buers of the network hardware and user messages ying in the network, 1

must be stored/restored. We call this changing the user of the network preemption of the network or network preemption[6]. SCore-D is implemented as a parallel process on top of Unix. All user parallel processes are created and controlled by the SCore-D parallel process using the signal facility of Unix. The parallel process switching using the network preemption is realized as the following scenario: i) an element process of the SCore-D parallel process stops an element process of the current user parallel process by issuing the SIGSTOP signal, ii) the network context is stored, iii) all SCore-D element processes are synchronized, iv) new network context is restored from the next user parallel process, and v) each SCoreD element process issues the SIGCONT signal in order to resume the next user parallel process. To realize this scenario, the SCore-D parallel process uses the network to synchronize all elements of the SCore-D parallel process while the user parallel process also uses the network. Thus, SCore-D assumes that the network driver must support at least two virtual networks for the SCore-D and the users. We have designed and developed a network driver called PM[10] for the Myrinet to satisfy the SCoreD requirements. PM introduces a communication channel, that represents the network status and a message buer, and its context switching capability. The current PM on the Myrinet[3] supports four channels. It achieves 7.2 micro second latency and a bandwidth of 117.6 MB/s on our PC cluster with Myrinet (M2M-PCI32). The network preemption enables not only gang scheduling but also enables the detection of a global state of a user parallel process[5]. At the network preemption, the SCore-D parallel process may examine if no element process of a parallel process is running and there is no message in the network. In that case, if there exists one or more blocking system call, then the process is idle, otherwise the process is terminated. By detecting the status of user parallel processes each time quantum, SCoreD can schedule parallel processes to realize an interactive parallel programming environment under gang scheduling. To provide global resource facilities such as paral-

lel I/O, SCore-D also supports a system call mechanism in a protected way using Unix IPC mechanisms. 2.2

Implementation

Realizing the SCore-D functions complicates communication protocols among SCore-D element processes. Usually, a communication protocol is implemented in the message passing paradigm where the design and implementation of message formats are very hard and dicult to maintain the software. However, inter-element-process communication protocol in SCore-D is written in a multithreaded C++, MPC++[8], which supports remote function invocation and remote memory access facilities. Since the multi-threaded features of MPC++ hides the communication protocol in its abstraction, the developer frees from the traditional nasty protocols implementation. As a result, all the SCore-D functions are implemented about 5,000 lines of MPC++. 3

Preliminary Evaluation

Here, we focus on the overhead incurred by the SCore-D to see if the SCore-D approach can be practical. Table 1 shows slowdown (overhead) due to the gang scheduling, varying the time quantum, 0.2, 0.5, and 1.0 second. To measure the overhead, we run a special program in which barrier synchronization is iterated 200,000 times running on 32 processors. Evaluating with this special program, all possible gang scheduling overhead, including coscheduling skew[2], can be included. The slowdown is compared with the same program running with a stand-alone runtime library of MPC++. On both our workstation and PC cluster, the slowdown is 8.84 % and 4.16 % with the time quantum of half second, respectively. The overhead of PC cluster is much less than that of workstation cluster. We guess this dierence is dominated by a scheduling policy of the local Unix operating system[6]. Although the time quantum is relatively larger than that of Unix, granularities of execution time of parallel applications considered to be larger than sequential applications. It is not reasonable to 2

run a text editor on parallel machines. Scheduling tens of processor to echo one character is meaningless. Thus, even for an interactive parallel programming, processing granularities triggered by input commands should be larger, and the larger time quantum is considered to be still acceptable. The mechanism of the global state detection of parallel processes can utilize processor resource and can reduce users frustration with interactive parallel programs.

duce the protocol overhead, Fast Messages (FM)[9], U-Net[11] and our PM use memory-mapping technique in network drivers and libraries. FM allows a process to use the network hardware resource exclusively. U-Net supports multiple communication channels and support simultaneous access to a network hardware from multiple processes. However, U-Net does not guarantee the soundness of recycling or sharing a communication channel (endpoint in terms of U-Net), and the user processes must be responsible to guarantee the free of message buers and the ushing messages ying in the network. While PM supports multiple channels and a network preemption mechanism, SCore-D can preempt or kill a user parallel process at any time, and the channel used by the parallel process can be reused or recycled. GLUnix[2, 1] is the other user-level parallel operating system for workstation cluster. The major difference between GLUnix and SCore-D can be found in their goal. While GLUnix assumes the mixture of parallel and sequential workload to utilize idle workstations, SCore-D assumes only parallel workload. Our research target is to clarify how much parallel performance can be achieved on workstation or PC clusters comparing with parallel machines. SCore-D does not take into account fault tolerancy and binary compatibility with sequential programs. Through the building of our PC cluster, we have con rmed that PC cluster is very costeective. We believe that it is also feasible for a PC cluster to dedicate itself to parallel processing.

Table 1: Slowdown due to gang scheduling [%] Time Quantum [Sec:] 1.0 0.5 0.2 Workstation Cluster 6.96 8.84 28.7 PC Cluster 2.87 4.16 6.25 To evaluate the SCore-D supported system call overhead, processing times of getpid() on both Unix and SCore-D are compared in Table 2. The getpid() on SCore-D (actually, it is named as sc getpid()) is to get a parallel process ID managed by SCore-D. The processing speed of getpid() on SCore-D is much slower than that of Unix due to the IPC overhead. This overhead is, however, acceptade in parallel I/O facilities whose processing time is much greater than the overhead of the SCore-D system call mechanism. Table 2: getpid() [sec] Workstation Cluster PC Cluster

4

SCore-D 351.7 483.6

5

Unix 5.0 (SunOS) 1.6 (NetBSD)

Concluding Remarks

We have developed a user-level parallel operating system called SCore-D, providing i) a gang scheduling implemented by network preemption, ii) a global state detection mechanism to utilize processor resource with interactive parallel programs, and iii) a system call mechanism. We have also developed a new class of job scheduling, called Time Space Sharing Scheduling (TSSS), a combination of space sharing and time sharing[7]. TSSS can utilize processor resource more than a simple batch scheduling. A load balancing algorithm on TSSS, called Distributed

Related Works

In workstation and PC clusters, standard communication libraries, such as PVM or MPI using TCP or UDP having expensive protocol overhead, i.e., high latency and low bandwidth are used. To re3

Queue Tree (DQT)[4], has been developed in SCoreD to balance the system load in a distributed manner. It is hard to answer for the question if the gang scheduling overhead is acceptable. However, considering the merit coming from the TSSS and interactive programming, a few percent of slowdown is acceptable. We believe that the approach of SCoreD is eective to develop a parallel operating system for workstation or PC cluster. SCore-D is not only a practical parallel operating system, but also a parallel operating system research platform. We are continuing to develop and expand the SCore-D to support more parallel functions such as parallel I/O. For people interested with our clustering technologies, please visit the following WEB address:

[5] A. Hori., H. Tezuka., and Y. Ishikawa. Global State Detection using Network Preemption. In IPPS'97 Workshop on Job Scheduling Strategies for Parallel Processing

pear.

, April 1997. to ap-

[6] A. Hori, H. Tezuka, Y. Ishikawa, N. Soda, H. Konaka, and M. Maeda. Implementation of Gang-Scheduling on Workstation Cluster. In D. G. Feitelson and L. Rudolph, editors, IPPS'96 Workshop on Job Scheduling Strate-

, volume 1162 of , pages 76{ 83. Springer-Verlag, April 1996.

gies for Parallel Processing

Lecture Notes in Computer Science

[7] A. Hori, T. Yokota, Y. Ishikawa, S. Sakai, H. Konaka, M. Maeda, T. Tomokiyo, J. Nolte, H. Matsuoka, K. Okamoto, and H. Hirono. Time Space Sharing Scheduling and Architectural Support. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 949 of Lecture Notes in Computer Science, pages 92{105. Springer-Verlag, April 1995.

http://www.rwcp.or.jp/lab/mpslab/clusters/

References

[1] T. E. Anderson, D. E. Culler, D. A. Patterson, et al. A Case for NOW (Networks of Workstations). IEEE Micro, 15(1):54{64, February 1995.

[8] Y. Ishikawa, A. Hori, H. Tezuka, M. Matsuda, H. Konaka, M. Maeda, T. Tomokiyo, and J. Nolte. MPC++. In G. V. Wilson and P. Lu, editors, Parallel Programming Using C++, pages 429{464. MIT Press, 1996.

[2] R. H. Arpaci, A. C. Dusseau, A. M. Vahdat, L. T. Liu, T. E. Anderson, and D. A. Patterson. The Interaction of Parallel and Sequential Workloads on a Network of Workstations. UC Berekeley Technical Report CS-94-838, Computer Science Division, University of California, Berekeley, 1994.

[9] S. Pakin, M. Lauria, and A. Chien. High Performance Messaging on Workstations: Illinoi Fast Messages (FM) for Myrinet. In Supercomputing'95, December 1995.

[3] N. J. Boden, D. Cohen, R. E. Felderman, A. E. [10] H. Tezuka, A. Hori, Y. Ishikawa, and M. Sato. Kulawik, C. L. Seitz, J. N. Seizovic, and W.-K. PM: A Operating System Coordinated High Su. Myrinet: A Gigabit-per-Second Local Area Performance Communication Library. In HighNetwork. IEEE Micro, 15(1):29{36, February Performance Computing and Networking '97, 1995. 1997. to appear. [4] A. Hori, Y. Ishikawa, H. Konaka, M. Maeda, [11] T. von Eicken, A. Basu, and W. Vogels. Uand T. Tomokiyo. A Scalable TimeNet: A User Level Network Interface for ParSharing Scheduling for Partitionable, Disallel and Distributed Computing. In Fifteenth tributed Memory Parallel Machines. In ProACM Sumposium on Operating Systems Princeedings of the Twenty-Eighth Annual Hawaii ciples, pages 40{53, 1995. International Conference on System Sciences,

, pages 173{182. IEEE Computer Society Press, January 1995. Vol. II

4

User-level Parallel Operating System for Clustered ... - CiteSeerX

User-level Parallel Operating System for Clustered ... - CiteSeerX

Suggest Documents