Network interface sharing with an experimental ... - Semantic Scholar

1 downloads 0 Views 125KB Size Report
network preemption during context switch. There- ... by multiple users without network preemption. .... of-line message has larger communication overhead.
Network interface sharing with an experimental gang scheduler on Cenju-3/DE Hiroyuki Araki, Kosuke Tatsukawa, Akihiko Konagaya NEC C&C Research Laboratories

Abstract

user tasks which belong to one application are scheduled simultaneously. Previous work[4][5] on gang scheduling has reported a large scheduling overhead to guarantee exclusive use of the communication hardware by at most one task. In order to share the communication mechanism the gang scheduler has to wait for network preemption during context switch. Therefore we have been doing research on communication mechanism, which could be simultaneously accessed by multiple users without network preemption. In order to reduce the large overhead We introduce a server task which provides the network interface sharing mechanism. This server task aims to take a middle position between user-level communication and system provided communication such as sockets. The server provides network interface sharing mechanism with thin interface layer to achieve lower latency than system provided communication. In this papers, we present the evaluation of the prototype network interface sharing server with gang scheduling. In section 2 we explain gang scheduling and the scheduling overhead sources. Section 3 describes our network interface sharing server and gang scheduler implementation. Section 4 shows the evaluation results. We compare our system with previous work in section 5, and conclude in section 6.

We have implemented an experimental gang scheduler on Cenju-3/DE[1] in order to investigate ef cient scheduling schemes for multiple user environment with high performance communication on parallel machines. The Cenju-3/DE system provides user-level inter-processor communication mechanisms for a single user. It realizes high performance communication, but cannot be shared among multiple users. In order to realize a multiple user environment we have implemented a network interface sharing mechanism, and examined its behavior with the gang scheduling. In this paper, we report our early evaluation of the gang scheduler and highlight some issues we faced.

1 Introduction Recently, high speed communication libraries have become popular on parallel computers such as Cenju-3. Such libraries rely on user-level interprocessor communication mechanism, in which the user program directly accesses the communication hardware for sending messages, and use polling techniques for incoming messages to eliminate the system call overhead from communication operations. However, in order to guarantee exclusive access to the communication hardware, such systems prohibit multiple processes from using the high speed communication libraries simultaneously. As parallel computers have become popular, the demand for sharing a parallel machine by multiple users has increased. There are two ways to support multiple user environment on a parallel machine: space sharing and time sharing. Many parallel computers support space sharing, in which user tasks could always make exclusive use of the CPU and use polling techniques for receiving messages. Time sharing could be implemented using gang scheduling[2]. Gang scheduling guarantees that

2 Gang scheduling Gang scheduling was originally introduced by Ousterhout in the context of the Medusa system on Cm*[2]. He treated processes belonging to one application as a \gang". All processes in the same gang are scheduled at the same time slice on each node. In addition, the gang scheduler enables several gangs to share disjoint sets of processor during each time slice in a space-sharing fashion. This means that we can assign the gangs by space-sharing scheme in each time slice. In this sense, the gang scheduler achieves both time and space sharing at the same 36

optimize the user level communication by eliminating protection overhead since the NIF server does not violate memory protection. A user process sends a message as the following. At rst, a message to the receiver process is expanded to a remote procedure calling to the NIF server on the same processor using Mach Interface Generator (MIG). Then the NIF server forwards the message to the target NIF server running on the destination processor. The receiver process receives messages from the NIF server via remote procedure call also generated by MIG.

time, and it is a suitable scheduling schemata for parallel computers. Existing implementations[4][5] of the gang scheduling on parallel computers have a large scheduling overhead which is not essential to gang scheduling, but results from the design of the communication hardware in existing systems. They assume a single receive bu er. This bu er must be shared between all tasks executing on the node. Messages sent from user level code are all received into this bu er. In previous systems, in order to distinguish between messages sent before and after the scheduling event, the scheduler waits until the receiving tasks receives all outstanding messages sent by the previous tasks. This incurs a large scheduling overhead on global scheduling since the high speed network in parallel computers are often accompanied by large message queues.

3.2 Gang scheduling thread

We have implemented our gang scheduler by adding a gang scheduling thread to the \execution server" which controls tasks creation and termination on Cenju-3/DE. Only the execution server knows the application information which maps applications to user tasks. We adopted the matrix algorithm of gang scheduling[2][3], in which each column corresponds to a physical processor, and each row represents a \virtual Cenju-3". Each virtual Cenju-3 acts like a Cenju-3 system on which user applications can be executed in each time slice. As a result, Cenju-3 system can handle multiple applications concurrently. We call a row as a \slot". Gangs can be packed in a slot, if possible, by space-sharing scheme. Figure 1 illustrates the matrix algorithm.

3 Implementation In order to evaluate the scheduling overhead caused by the network pre-emption, we have built an experimental gang scheduler on Cenju-3. Since the Cenju-3 network hardware provides a single receive bu er for user communication, we use a Mach server task for controlling access to the receive bu er. Our implementation consists of a gang scheduler and NIF servers. The gang scheduler provides a combination of space-sharing and time-sharing mechanism. The NIF server supports network interface sharing for multiple user tasks while exclusively accessing the communication hardware in order to utilize user level DMA communication .

Physical Processors

Slots(virtual Parallel Machine)

3.1 NIF server

The NIF server allows multiple users to access to Cenju-3/DE's NIF0 device simultaneously, a device for user level communication. Each processor provides its own NIF server, and the NIF server exclusively accesses the NIF0 device from user level. In order to realize a user-level communication, we use a remote DMA mechanism, which is provided by Cenju-3's NIF hardware. The communication latency between two NIF servers is better than our ordinary user-level communication. Usually, in order to use user-level communication the tasks on Cenju-3/DE are required a kernel call that xes a destination address eld of remote DMA header to realize a memory protection because the Cenju-3's NIF provides no memory protection mechanism. In this implementation, we can

ASSIGN New session(gang) to be launched free processor

Figure 1: The matrix algorithm The gang scheduling thread in an execution server switches the context of slots at regular intervals called \time quantum". It suspends all processes 37

that belong to the current slot, and invokes the next processes in the scheduled slot. The suspension and invocation are performed by NORMA IPC issued from the execution server to a local scheduler. The scheduling policy is simple round-robin scheme in eight slots, that is , eight virtual Cenju-3s are prepared for our experimental gang scheduler.

4 Preliminary Evaluation In order to evaluate the e ect of overlapping between inter-processor communication and context switches, we measured the execution time of two benchmark programs while varying the time quantum. Each program is executed under two di erent situations. One situation allows overlapping between inter-processor communication and context switch. In this case, the gang scheduler does not wait for the network to be preempted because the NIF server handles the outstanding messages. The other situation does not allow overlapping between inter-processor communication and context switch. In this case, the gang scheduler sends a request to each NIF server to ush all the messages in its request queue before the context switch in order to guarantee the preemption of the network, and wait for the reply to the request. Under each situation, we execute the following two programs. One is an ordinary `ping-pong' program. Two processes iterate sending a message and receiving it via the intervening NIF servers. Another is a burst transfer program. It also consists of two processes. One process sends many messages to ll the network path, while the other process receives them. The context switch overhead for each case is shown in Figure 2 and 3. These graphs show the slow down ratios at the various time quantum. In these graphs, the slow down ratio is calculated as,

Figure 2: Results of Experiments(ping-pong case)

t ?t r(tquantum ) = elapsed process  100(%) tprocess where tquantum is a time quantum, telapsed is an elapsed time of each bench mark at tquantum including context switch overhead and tprocess is an elapsed

time of each bench mark using the NIF server but without gang scheduling. The gures show that the performance of the overlapping cases is better than the non-overlapping cases especially under the burst transfer case. Under the burst transfer program, the overlapping case is 13:6% faster than the other case at 100ms.

Figure 3: Results of Experiments(burst transfer case)

38

caused by network preemption to garantee exclusive use of the communication hardware for user-level communication. In order to reduce this overhead we introduced the server task which provides network interface sharing mechanism. We implemented the prototype NIF server with gang scheduling. The current implementation of gang scheduling relies on the NIF server which realizes network interface sharing mechanism for overlapping. The communication delay is still large (200s). It is not essential to the NIF server, but results from fat RPC interface generated by Mach Interface Generator (MIG). We plan to overcome this problem by changing the RPC interface.

It shows that under the burst transfer program the slow down ratio of the overlapping case is less than 5% even if the time quantum is 100ms. For interactive use, an operating system uses a time slice quantum of about 100ms[6]. Gang scheduling usually requires a longer time quantum (around 1s[4][5]) to achieve good performance. But our evaluation results show that the context switch overhead is often caused by network preemption. By overlapping the context switch and inter-processor communication, the performance dose not decrease, even if the time quantum is 100ms. Although the NIF server can decrease the context switch overhead caused by network preemption, the communication delay using current implementation is larger than user-level communication on Cenju3/DE[1]. It takes about 200s to send a short message. The delay is caused by the large overhead of the RPC interface generated by Mach Interface Generator (MIG), which takes about 80s at both the sender and the receiver side. The reason we used MIG for communication between user tasks and the NIF server is to implement prototype server easily. There are several ways to improve performance of the communication. For example, using in-line message of Mach IPC instead of out-of-line message generated by MIG can decrease the delay because outof-line message has larger communication overhead than in-line message. Alternatively, the performance can be improved by using a shared memory mechanism between the NIF server and user tasks.

References [1] A.Konagaya, K.Tatsukawa, C.Howson, T.Sugawara, H.Araki and K.Konishi, \Cenju3/DE: Open Platform for Parallel Processing Research", Cenju-3 workshop in HPC Asia, 1997. [2] John K.Ousterhout, \Scheduling Techniques for Concurrent Systems", 3rd Intl. Conf. Distributed Computing Systems. Oct. 1982, pp.2230. [3] Dror G. Feitelson and Larry Rudolph, \Gang Scheduling Performance Bene ts for Fine-Grain Synchronization", Journal of Parallel and Distributed Computing 16, 1992, pp.306-318. [4] F.Wang, H.Franke, M.Papefthymiou, P.Pattnaik, L.Rudolph, M.S.Squillante, \A Gang Scheduling Design for Multiprogrammed Parallel Computing Environments", IPPS'96 Workshop, April 16, 1996, pp.67-75. [5] A.Hori, H.Tezuka, Y.Ishikawa, N.Soba, H.Konaka, M.Maeda, \Implementation of Gang-Scheduling on Workstation Cluster", IPPS'96 Workshop, April 16, 1996, pp.76-83. [6] Mashall Kirk McKusick, Keith Bostic, Michael J.Karels, John S.Quarterman, \The Design and Implementation of the 4.4BSD Operating System", Addison Wesley(1996), p.93.

5 Related Work F.Wang shows a mathematical model for gang scheduling and con rms an e ect of scheduling policy[4]. A.Hori introduced an idea of \network preemption" to implement an ecient and practical gang scheduler for their Myrinet cluster system[5] and concluded that it guarantees the maximum time to reach the steady state in a large network. They uses PM communication library[5] that works on the Myrinet cluster system and is targeted for multitasking environment instead of standard libraries like MPI which we used for evaluation.

6 Conclusion There is increasing requirements to share the parallel computers. The gang scheduling scheme provides both time and space sharing on parallel computers. But it contains a large scheduling overhead that is 39

Suggest Documents