ronment (NORMA-IPC/DE); 60 sec in latency and. 19Mbytes/sec in throughput in a single task environ- ment (MPI/DE); and 25 sec1 in latency for user-level.
The design and implementation of a microkernel based parallel OS "Cenju-3/DE"
Yosuke Takano, Christopher Howson, Koichi Konishi, Tomoyoshi Sugawara, Hiroyuki Araki, Shinji Yanagiday, Akihiko Konagaya C&C Systems Research Laboratories, NEC Corporation, yNEC Informatec systems, Ltd. Abstract
Cenju-3/DE is a microkernel-based parallel OS developed on the distributed memory parallel machine Cenju-3. Its objective is to realize an ecient software platform dedicated to speci c parallel applications using standard components such as the Mach microkernel, UNIX and Message Passing Interface (MPI) library. The current Cenju-3/DE system (Version.1.1) provides a multi-user, multi-tasking and multi-threaded programming environment; a subset of UNIX system calls; and various interprocessor communications such as NORMA-IPC/DE, MPI/DE and NODO. So far, most of our eorts have focussed on tuning the performance of the IPCs, since for distributed memory architectures the performance of parallel OS services and applications strongly depends on the IPC performance. As a result, Cenju-3/DE has archived 185sec in latency and 22Mbytes/sec in throughput in a multi-tasking environment (NORMA-IPC/DE); 60sec in latency and 19Mbytes/sec in throughput in a single task environment (MPI/DE); and 25sec1 in latency for user-level IPC (NODO). 1
Introduction
Recently, parallel computer architecture design has been focussing on distributed memory architectures to achieve scalability. For distributed memory architectures, there are two major approaches in designing a parallel operating system (OS); a monolithic kernel approach and a microkernel approach. The monolithic kernel approach is the design in which each processor of a parallel computer has a 1 12sec
has been measured when an experimental network
interface hardware is used[7]
traditional OS and communicates with other nodes as if the parallel computer is a local area computer network (LAN). While the implementation of such a design seems to be easier because it can be based on the existing OS implementation, system management becomes dicult especially when the number of processors becomes large. In addition, its networking facilities must be modi ed in order to achieve fast interprocessor communication (IPC)2 , since the conventional networking facilities use heavy communication protocols which are necessary for slow, unreliable and unscalable network such as LAN, but a big burden for a parallel machine with a fast, reliable and scalable processor interconnection. The alternative is a microkernel approach in which a parallel OS consists of microkernels and OS servers. The microkernel is located on each processor of a parallel computer and handles low-level resource management facilities such as task management and memory management as well as IPC. The OS servers are located on any processor on the parallel computer and serves for all tasks via the IPC. From the viewpoint of parallel OS development, the microkernel approach has the following advantages. First of all, microkernel is so small that it considerably saves computing power and memory consumed by OS facility. This grows in signi cance as the number of processors increases. Secondly, OS servers can be accessed transparently from any processors on the parallel machine using the microkernel's IPC. 2 In
this paper we use IPC to refer to inter
process
inter
is used.
processor
not
communication, except when the term Mach-IPC
NWC0
NWC0
NWC0
NWC0
Multistage Interconnection Network NWC1
NWC1
NWC1
Cables
NWC1 40MByte/sec
IOC EWS 4800
PE0
PE16
PE32
PE48
PE8
PE24
PE40
PE56
PE40
Multi Chip Module (MCM) V R 4400(75MHz) (16KB+16KB)
Secondary Cache (1MByte) System Bus (64bit 50MHz)
Memory Address Controller
DRAM (64MByte)
Memory Data 128bit Controller
32bit 25MHz I/O Controller
SIO Diagnosis Line
Network Interface (NIF-S) FIFO
Network Interface (NIF-R) FIFO
16bit 20MHz
from NWC1 to NWC0
Figure.1 shows a Cenju-3 system and an expanded view of a single PE. The multistage interconnection network is a packet-switched network with 40 Mbyte/sec hardware throughput for each port and 1 sec latency from input to output. It consists of two 16 2 16 network cards, and we can use the same network cards in con gurations ranging from 16 to 256 PEs by changing cable connections. Each PE consists of a VR 4400(75MHz) RISC CPU with 1 Mbyte of secondary cache, 64 Mbytes of main memory, and the network interface hardware(NIF) implemented by FPGAs. By using the NIF, a message can be sent to either the receive queues or a physical address on the local memory of a destination processor. The latter case is called a remote DMA transfer. 3
Cenju-3/DE Design Philosophy
Cenju-3/DE has been designed to support the develFigure 1: The block diagram of Cenju-3 (64 PE con- opment of application-speci c OS's and of parallel OS guration) servers. An application-speci c OS enables application programmers to optimize OS facilities for the target application system by customizing the OS server In addition, Cenju-3/DE aims to show the ad- layer. Such customization facilities have great advanvantage of an application speci c OS which pro- tages, potentially, to reduce the overhead of the convides server components speci c to target applica- ventional general purpose OS which has been tuned tions rather than providing general purpose servers for the most common cases. for all applications. For example, the memory allocation is usually The organization of this paper is as follows. Sec- scheduled by OS in a xed manner. However, some tion 2 describes the overview of Cenju-3 system. Sec- applications may take great bene ts when they can tion 3 describes the design issues of Cenju-3/DE. Sec- schedule the allocation, for example, for application 4 describes the structure and implementation of tion data, swap area, disk buer and communication Cenju-3/DE. Section 5 describes the IPC implemen- buer. To achieve this, it is necessary to design OS tations of Cenju-3/DE. Section 6 reports the results facilities which are as exible as possible. Therefore, of performance measurements in terms of the IPCs we adopt the microkernel design since we can provide on Cenju-3/DE version 1.1 on Cenju-3. most OS facilities as OS servers, although much remains to be done to design customizable OS server components. 2 Cenju-3 We adopted Mach (CMU version) for the curCenju-3 is a parallel machine with up to 256 rent Cenju-3/DE implementation, because various processing element(PE)s and a front-end computer kinds of OS server are available freely, for exam(EWS4800 workstation) connected by a multistage ple, GNU hurd[3], Mach4 server and CMU multiinterconnection network. The block diagram in server[6]. These OS servers are useful as a starting
point to build the application-speci c OS servers. 4.2 System Organization The second design issue is the parallelization of OS Figure.2 shows the organization of the Cenju-3/DE facilities. In order to take full advantage of parallel v1.1 system on Cenju-3, which consists of microkercomputer, it is necessary to utilize the parallel pronel, IPCs and multiple OS servers. cessing power for OS execution as well as application execution. If the OS facility is provided as a single server, it is easy to expect that a single server may Parallel Application Program become a bottleneck of a parallel application if sysDenEn v1.1 tem calls from user tasks on all nodes are handled psh MPI/DE OS Servers there. execs rsyscalls As for UNIX facilities, multiple servers, such as Chorus/Mix, GNU Hurd and CMU multi-servers, cjcs snames have been announced and developed on microkernels. Mach MicroKernel However there have been few reports concerning the denend parallelization of multiple servers, especially for disNORMA IPC/DE tributed memory architectures. In Cenju-3/DE we have tried to revise the server decomposition and disFrontend Cenju-3 element processors Computer tribution for enhancing the performance, availability and application-speci cation. 4
Implementation
Figure 2: Cenju-3/DE v1.1 structure
This section describes the current Cenju-3/DE impleThe Mach microkernel, which was ported to the mentation (Cenju-3/DE v1.1) running on Cenju-3 in Cenju-3 architecture, is located on each PE and proterms of the system organization, facilities and IPC. vides low-level resource management facilities. It also provides a network transparent IPC (NORMA4.1 Cenju-3/DE Facilities IPC/DE) and a high performance user-level IPC The facilities of Cenju-3/DE v1.1 are summarized be- (NODO). low. The rst four are implemented by Cenju-3/DE In order to support multi-user, multi-tasking and OS servers or libraries. UNIX system call facilities, we developed three OS servers named execs, rsyscalls and cjcs on Mach. Multi-user computing environment The execs-server services multi-user and multi A subset of UNIX (SystemV R4.2) system calls tasking facilities. All task creation and termination on Cenju-3 are performed by sending user requests to Three kinds of IPC(MPI/DE, NORMA-IPC/DE the execs-server. There is a single execs-server in the and NODO) Cenju-3/DE system which is located on a PE designated at OS initialization. Psh interpreter The rsyscalls-server realizes UNIX system calls The other three facilities are part of the Mach mi- such as input/output operations by forwarding system calls to the denend-server on the front-end comcrokernel. puter. Multi-tasking and multi-threaded programming The cjcs-server works for the communication beenvironment tween the PEs and the front-end computer. The rsyscalls-server and the cjcs-server are parallelized Mach services like cthreads and are located on each PE, but this facility is im Local SCSI disk access (Raw I/O) plemented just for convenience. The system call for-
warding service is useful if an application system requires most of the processing power and memory of a parallel computer, leaving no resources for full OS services. The snames server, originally developed at CMU, is the name server used by the OS servers to register their service ports. The denend server is located on a front-end computer. This server processes the system calls forwarded from each PE. It also forwards the terminal control operations between a user's terminal and the user task on a PE. In addition to servers, Cenju-3/DE provides tools to simplify the construction of parallel programs. The psh interpreter creates a (currently) static task graph, speci ed as a text le, into a set of Mach tasks and ports. This simpli es the creation and initialization of sets of user programs designed to work together. MPI/DE is a complete implementation of the Message Passing Interface (MPI) for building message passing parallel programs. It is described in more detail in the next section. 5
In order to reduce the number of communication packets, small user messages of up to 360 bytes are transferred as a single packet including control information to the destination node. User messages with more that 360 bytes are transferred using a more complex protocol involving multiple messages in either direction.
Mach-IPC also de nes out-of-line (OOL) data transfer using a pointer to an integral number of pages. This is implemented using Cenju-3's remote DMA hardware to reduce the number of data copies. While inline data, which is included in the message's body, is copied four times from the sender to the reciever side, OOL data is copied once.
Our implementation of ow control between the nodes also supports Mach-IPC's SEND NOTIFY option. By using this option, a sender can always force a message onto a port without blocking, which is useful for implementing OS servers.
Interprocessor Communication
In distributed memory architectures, the performance of parallel applications and OS facilities strongly depend on the performance of IPC. Cenju3/DE v1.1 provides three kinds of IPC, NORMAIPC/DE, MPI/DE and NODO in order to achieve high performance IPC for dierent situations. 5.1
ularly with respect to performance and compatibility. The original CMU NORMA-IPC paid little attention to this so far, since it targeted multiple computers connected by an Ethernet LAN, that is, slow and unreliable networks. On the contrary, the Cenju-3 internal network has 32 times better throughput than an Ethernet and it assures reliable packet delivery at the hardware level. Therefore, we decided to develop NORMA-IPC/DE which is tuned to the Cenju-3 network architecture. As for the compatibility with Mach-IPC, the CMU NORMA-IPC lacks some facilities. One such example is a port migration facility. Although most applications do not require the port migration facility, it is useful in OS servers. Therefore, NORMA-IPC/DE provides an enhanced compatibility with Mach-IPC. The implementation of NORMA-IPC/DE is as follows.
NORMA-IPC/DE
NORMA-IPC[2] was originally designed and implemented in CMU Mach for multi-computers, that is, IPC for NO Remote Memory Access architectures. It attempts to maintain the same semantics as local Mach-IPC which has been designed for single processor or shared memory architectures. This facility is very attractive since all OS servers using MachIPC can be ported to Cenju-3 without considering the dierent of processor architecture if NORMA-IPC is provided. However, CMU NORMA-IPC has considerable drawbacks when used on the Cenju-3 system, partic-
5.2
MPI/DE
Message Passing Interface (MPI)[4] is a standard IPC speci cation designed for parallel and distributed applications. Cenju-3/DE v1.1 provides MPI/DE which supports the full MPI speci cation.
When implementing MPI/DE, we supposed that most of MPI programmers require performance over security for IPC. Therefore, we decided to implement MPI/DE by skipping the microkernel and using a user-level communication mechanism for direct access to the Cenju-3 network interface hardware (NIF). The MPI/DE implementation supports two modes: single task (space-sharing) and multiple task (timesharing). The single task mode only allows one MPI/DE task per processor. This restriction allows the receiver task to map the NIF receive buer directly into its address space, and permits MPI/DE to use an ecient protocol to transfer data. This mode is suitable for traditional number crunching applications which require maximum communication performance. The implementation of the single task mode requires two data copies, one at the sender and one at the receiver. It also includes a kernel call in a sender side, to ensure the integrity of the message header. This call can be removed, but this reduces system integrity signi cantly, so it is only done for testing purposes. Multiple task mode allows dierent MPI/DE tasks to share the same processor. This mode is more appropriate for some applications, for example I/O intensive applications using local disks attached to each PE. In this case, a task should release the Cenju-3 NIF after it has done. In multiple task mode, a thread in the receiver task is scheduled when a message arrives for that task. This may cause some overhead in IPC latency, but considerably increases the total system performance. The implementation of the multiple task mode requires a kernel call on the receiver side as well as on the sender side.
munication time is spent for IPC protocol processing, such as ow control, buering and data-type encording, not by NIF hardware. NODO provides good potential to improve the total IPC performance by providing an application speci c IPC protocol. The implementation of NODO is the following. In order to achieve minimum IPC latency, NODO provides the lowest level facilities in the kernel. In the single task mode, NODO achieves 25sec in latency for one word transfer with memory protection and cache control facilities. The latency becomes 12sec if the memory protection facility is supported by NIF hardware. Since it takes about 10sec for the NIF control and network data transfer in hardware level, the latency is almost optimal in the current Cenju-3 system. The kernelized IPC enables us to develop various IPC protocols at user level. For example, users can choose appropriate buer size, ow control and encoding scheme for target applications. Users also can choose between polling and interrupts for noti cation of message arrival. This is very similar to the relationship between a monolithic OS and a microkernel. Kernelization gives more chances for users to customize OS facilities and as a result to improve total system performance. 6
Evaluation
We have measured the IPC performances on NORMA-IPC/DE and MPI/DE. As a comparison, we use the IPC performance of Cenju-3 OS, a proprietary OS developed by NEC corporation for the Cenju-3 system. Since the Cenju-3 OS does not support multi-tasking nor virtual memory facilities, it is reasonable to expect that the Cenju-3 OS IPC achieves the highest performance on the Cenju-3 sys5.3 NODO tem. The comparison of IPC performance makes it NODO is a user-level IPC facilities designed as an possible to analyze the overhead of OS facilities and example to prove the eectiveness of application- the dierences of IPC implementation. speci c OS facilities. The key concept of NODO design is to provide a kernelized but safe IPC facility 6.1 Throughput so that users can customize at the user level. The user's IPC can be adapted without modifying and Figure.3 shows the throughput of IPCs. The average re-compiling the microkernel. It is often pointed that performance is 22Mbytes/sec for NORMA-IPC/DE, parallel applications have a bottleneck in IPC. Ac- 19Mbytes/sec for MPI/DE and 33Mbytes/sec for cording to our experience on IPC tuning, most com- Cenju-3OS.
IPC Throughput
40 35
+
+ ++ 30 +
+
25 + Throughput (Mbyte/sec)20 + 15 10
3
3 +3 2
2 23 3 3 3 2 3
2
2
3
3 2 2
2
2
2
3 2 2
2
2
NORMA-IPC/DE 3 Cenju-3OS + MPI/DE" 2
+
5+ 0
+
+
0
100000
200000 300000 400000 message length(bytes)
500000
600000
Figure 3: IPC Through-put IPC Latency
400 350 300
time(usec)200
2
2
2
2
50 0
+
3 0
+
+
3
3 50
+
3
100
+
+
+
+
+
3
3
3
+
+
MPI/DE(single-task version) 3 MPI/DE(multi-task version) + NORMA-IPC/DE 2
150 100
+
+
2
2222 2 2 2 2 32 3 2 2 32 32 3 3 3 3 +
250
+
2
2222 2
+
3
150
200 250 300 message length(bytes) Figure 4: IPC latency
350
400
450
500
As for the peak performances, NORMA-IPC/DE archives higher throughput than MPI/DE in any data sizes. The dierence between NORMA-IPC/DE and MPI/DE results from the kernel call overhead on both a sender task and a receiver task in MPI/DE for page mapping. The dierence between NORMA-IPC/DE and Cenju-3 OS results from the dierence of protocols, such as message acceptance insurance, and the derived implementation overhead such as multithreading. It seems to be dicult to optimize NORMA-IPC throughput with keeping the semantics of Mach-IPC. The dierence between MPI/DE and Cenju-3 OS comes from the redundant data copying from a kernel space to a user space. So, there is little room to optimize the MPI/DE implementation in user level. The dierence can be negligible if a user level page mapping facility is provided, although care must be taken so that a user task does not destroy memory spaces in other tasks. 6.2
Latency
Figure.4 shows the latency of NORMA-IPC/DE and MPI/DE for various data sizes. NORMA-IPC/DE , MPI/DE (multi-task environment) and MPI/DE (single-task environment) achieve 185sec, 90 sec and 60sec for null packet data, respectively. The latencies are 37sec in Cenju-3 OS and 25sec in NODO. MPI/DE archives smaller latency than NORMAIPC/DE in small data sizes. The dierence mainly comes from protocol overhead of NORMAIPC/DE,ehandling message completion, typed arguments and so on. The latencyrof MPI/DE changes sharply at about 128 bytes. This is because the sender side can transfer user data without the receiver's acknowledgement if a message's size is less than 128 bytes. Communication buers up to 128 bytes are pre-allocated in the each PE for receiving user data. Otherwise, the sender task rst sends a message with control data, then it sends the user data only after the receiver's acknowledgment of the request. The same thing happens in NORMA-IPC/DE at about 360 bytes.
The dierence between MPI/DE and Cenju-3 OS results from the lack of user level operations for cache invalidation in the current Cenju-3 NIF. Cenju-3 OS can minimize the operations by utilizing static correspondence between physical memory address and logical memory address. The MPI/DE performance can be improved up to 42sec if an appropriate NIF is supported[7].
7
Conclusion
The design and implementation of a parallel OS Cenju-3/DE for distributed memory architecture are reported. Cenju-3/DE aims to achieve the advantage of application-speci c OS's using a microkernel and OS server components. On a distributed memory architecture, the performance of remote OS services and parallel applications strongly depends on the performance of IPC. By means of careful tuning of network protocol, Cenju-3/DE has archived 185sec in latency and 22Mbytes/sec in throughput in a multitask environment (NORMA-IPC/DE), 60sec in latency and 19Mbytes/sec in throughput in a single task environment (MPI/DE) and 25sec in latency for user-level IPC (NODO). Although much remains for further development, we believe that Cenju-3/DE can be a practical platform for parallel computing research eld in terms of the functionality and performance.
8
Acknowledgements
The authors would like to thank Mr. Sohei Misaki in NEC Systems Laboratory, Inc., Mr. Makoto Tsukagoshi in NEC 1st Computer Operations Unit, Dr. Masahiro Yamamoto, Dr. Ryosei Nakazaki, in NEC C&C Laboratories and Dr. Nobuhiko Koike in NEC Europe Ltd. for their warm support and encouragement for the Cenju-3/DE project. Thanks also to those who are related to the Cenju-3 project at the University of Houston, NEC Systems Laboratory, NEC Research Institute, NEC Informatec Systems and NEC C&C Laboratories for their valuable discussion and comments.
References
[1] David B. Golub, Randall Dean, Alessandro Forin and Richard Rashid. UNIX as an Application Program. USENIX Summer Symposium, Jun, 1990. [2] Joseph S. Barrera III. A Fast Mach Network IPC Implementation. USENIX Mach Symposium, pp. 1{18, Nov. 1991. [3] Michael Bushnell. GNU Hurd. In Tokyo GNU Techinical Seminor Proceeding. Dec. 1994. [4] The MPI Forum. MPI: A Message Passing Interface. Proc. of Supercomputing'93. Nov. 1993. [5] Nobuhiko Koike. NEC Cenju3: A Microprocessor-Based Parallel Computer. Proc. the 8th International Parallel Processing Symposium. Apr. 1994, pp.396{401. [6] Paulo Guedes and Daniel Julin. Object-Oriented Interfaces in the Mach 3.0 Multi-Server System. Proc. IEEE 2nd Workshop on Object Orientation in Operating Systems. Oct. 1991. [7] Yasushi Kanoh, Koichi Konishi and et al. User Level Communication on Cenju-3. Hot Interconnects III Symposium. Aug. 1995.