Design and Implementation of Real-Time User-Level Thread - CiteSeerX

Design and Implementation of Real-Time User-Level Thread Shuichi Oikaway

Hideyuki Tokudaz

Tatsuo Nakajima

y

Department of Mathematics, Keio University Faculty of Environmental Information, Keio University School of Computer Science, Carnegie Mellon University

z

For performance reasons, user-level thread packages have been proposed in multi-threaded environments. They often lack adequate intervention with their kernels. Thus, threads in these packages, however, are not first class citizens in their operating systems because. First class user-level threads are threads with appropriate kernel support. User-level schedulers and the kernel communicate with each other and avoid wrong scheduling decisions. First class user-level threads suit the needs of real-time operating systems because both of low latency and accuracy for scheduling are required for real-time systems. In this paper, we will describe our design and implementation of real-time user-level threads in ARTS operating system, and demonstrate the effectiveness of the current implementation.

1 Introduction

achieve the goal, our user-level threads must be treated as first class threads. The next section proposes software architecture and mechanisms for real-time user-level threads. The rest of the paper describes our implementation and evaluation of real-time user-level threads on the ARTS operating system [1].

Many user-level thread packages have been proposed to achieve high performance in multi-threaded environments. While threads can be implemented both in the kernel and user level, user-level threads can achieve better performance than kernel-provided threads. This is because kernelprovided threads suffer from the overhead of not only switching between the kernel and the user mode, but also facilities required by some but not all applications. For instance, supporting multiple scheduling policies is not required by all applications. Operating system kernels, however, often lack the first class treatment of user-level threads. Consider that a userlevel thread is implemented on an conventional operating system which supports only one process in one address space. Because the kernel does not recognize states of userlevel threads, the kernel cannot properly choose the address space containing a thread which should run next. When a user-level thread issues a system call and is blocked in the kernel, the kernel may choose another address space even if there is a runnable thread in the same address space. Recent researches explored the mechanisms in which the user-level thread library and the kernel cooperate with each other to avoid the above problem [2, 3, 4]. Since these researches are motivated by the demand of parallel computing, there are problems, such as very poor preemptability and predictability, to use the mechanisms for real-time applications. Real-time operating systems must be inherently high performance because they need to suffice time constraints of real-time applications. Thus, first class user-level threads match the needs of real-time operating systems. Our goal is to provide higher performance real-time userlevel threads which satisfy requirements from real-time systems, such as high preemptability and predictability. To

2 Real-Time User-Level Thread In this section, first we give our architectural model. Second, several structures of user-level schedulers and our approach are proposed. Then, we discuss mechanisms to realize userlevel schedulers.

2.1

Architectural Model

Because real-time user-level threads are running and switched in a user context, a user-level scheduler (ULS) resides in each context. A ULS schedules all user-level threads in its address space. Figure 1 shows the overall architecture of our operating system with ULS. A number of user-level threads can run in a single address space and share resources of the address space, such as code and data. The multiple threaded model requires fewer resources than the single threaded model to implement a highly preemptable server. Thus, the multiple threaded model suits real-time applications more than the single threaded model. Both kernel and user stacks are allocated for each userlevel thread to achieve higher preemptability. A kernel stack is used when a thread enters the kernel. Without a kernel stack per user-level thread, a user-level thread can be preempted only at some specific points. This can cause a serious problem in a real-time operating system. When an event which effects threads scheduling occurs, the kernel signals to a ULS. For example, when a thread is 1

AAA AAA

User Address Space

AA AA AA

Kernel Stack Association

Thread

In our approach, a kernel stack is allocated for each userlevel thread when it is created. While a kernel-provided thread always uses the same kernel stack, an association of ULS our user-level thread and its kernel stack is not fixed. A ULS switches a user-level thread without notifying the kernel. The processor, however, keeps a kernel stack pointer of a previously scheduled thread. Therefore, when the thread Kernel Event User Event enters the kernel, it may use the kernel stack of another Kernel thread. We create an idle user-level thread for each ULS. This Figure 1: Operating System Architecture with ULS thread has both kernel and user stacks as other user-level threads. The kernel stack of the idle user-level thread is used for a user-level thread when it enters the kernel. Then, blocked or unblocked inside the kernel, the event is passed to when the thread is blocked inside the kernel, the kernel the ULS. Then, the ULS should be invoked. These events swaps its kernel stack and the idle thread’s kernel stack happen because of an interrupt or a system call. After a which is currently used. specific task of the interrupt or the system call, the execution is not back to the current running thread. Instead, the ULS is called. Then, it chooses the next thread from the runnable Notifying Events from the Kernel to ULS threads in the address space and dispatches it. If the chosen As a mechanism for the kernel to communicate with a ULS, thread was blocked in the kernel, the ULS needs to call a a space shared by a ULS and the kernel is used. The shared system primitive to resume the thread. space is statically allocated by a ULS. The shared space If there is no runnable thread in the address space, the contains the address of the entry point to the ULS, pointers ULS will return the execution to the kernel, or passes it to to descriptors of the running user-level thread and the idle another address space. This depends on the structure of the user-level thread, the buffer which the kernel writes events, ULS. Section 2.2 shows how ULSs are structured. and the number of events. The ULS tells the address of the shared space to the kernel at the initialization. 2.2 Structures of User-Level Schedulers We chose to use a shared region, though passing an event as an argument is possible. There are two advantages to use There are three schemes to construct ULS as follows: a shared region. One is that the number of upcalls can be US A ULS only concerns threads in a single address reduced. Another is that the kernel can pass multiple events space. The address space which does not have at one upcall. This is necessary for us to avoid a priority a runnable thread will not be preempted without inversion problem. information of user thread scheduler because the kernel does not have enough information. Notifying Events from ULS to the Kernel Thread

User Stack Kernel Stack

US KM A ULS only concerns threads in a single address The ULS needs to ask the kernel to run a thread which was space, but provide the kernel with necessary information of threads in its address space. Thus, the kernel can choose the next address space which should be run next.

UM

blocked in the kernel. Because the thread needs to resume inside the kernel, the ULS cannot switch the context to it directly. Thus, a system call to resume a user-level thread which is blocked inside the kernel is implemented. Even when multiple address spaces are considered, system call interface is enough to notify events to the kernel. Suppose there are several address spaces which execute user-level threads. If there is no user-level thread in an address space, the ULS should tell the kernel that the address space yields the processor. These events, however, cannot happen simultaneously. Therefore, system call interface suffices our requirements.

A ULS in each address space shares information of their threads with other ULSs. Thus, they can choose the next thread in a different address space. A ULS often needs help of the kernel to switch an address space.

To provide maximum flexibility and performance, UM approach is desired. As our first step, however, US approach was chosen to make a prototype of a ULS and to gain our experience with real-time user-level threads.

3 Implementation 2.3

Mechanisms

We have been implementing the real-time user-level threads described in Section 2 by modifying the ARTS operating system.

In this section, we describe the mechanisms which are used to realize real-time user-level threads. 2

3.1

Thread A

ULS/Kernel Interface

Thread B

The interface to notify kernel events is a pair of an event type and its argument as follows.

AAAA AAAA

Kernel Scheduler

typedef struct ULSKEV f u long kev type; u long kev arg; g ULSKEV;

/ type of event / / event argument /

#define KEV BLOCK 0x00000001 #define KEV UNBLOCK 0x00000002

/ thread blocked / / thread unblocked /

AAAA AAAA AAAA User

Kernel

Cchoose

Center

Cleave Cctx_swich

Csched_call

AAAA AAAA AAAA

Figure 2: Kernel Threads Context Switch Thread A Thread B

AAAAA AAAAA

Cu_choose

ULS

Cu_ctx_swich

There are two types of events, such as KEV BLOCK and KEV UNBLOCK, to specify a blocked thread and an unblocked thread respectively. The thread which is blocked or unblocked is passed in kev arg as the argument of the event. KEV UNBLOCK is used also to specify a thread which is blocked to upcall a ULS. Since the thread is still runnable, the ULS can treat it in the same way as an unblocked thread. A system primitive, ULS Resume, is implemented to restart a thread which was blocked inside the kernel. A ULS chooses the next thread and call this primitive with the thread as an argument if it was blocked inside the kernel.

3.2

User Kernel

Figure 3: User Threads Context Switch in User Space a ULS should be called, the kernel restores the context of an interrupted thread which is saved on the kernel stack into the user-level thread descriptor and upcalls the ULS. Therefore, the ULS is able to restart the interrupted thread by itself.

4 Performance

Context Switch

In this section, we show the context switch cost of our implementation of real-time user-level threads. First, we describe the basic cost factors of context switching for three cases, such as user-level threads in a user space, user-level threads through the kernel, and kernel-provided threads. Then, we evaluate the performance of our user-level threads. Figure 2 shows context switch of kernel-provided threads. Csched call is the delay time to switch to the scheduler. Cchoose is the time to choose the next thread. Cctx switch is the actual context switch time for saving and restoring registers. Center and Cleave are the time to enter the kernel and to leave the kernel respectively. The total cost of the context switching of kernel-provided threads is sum of these costs. Figure 3 shows context switch of user-level threads. The cost of the context switching of user-level threads is sum of Cu choose and Cu ctx switch . Cu choose is the time to choose the next user-level thread. Because switching userlevel threads in a user space does not need to enter the kernel and the procedure to choose the next thread is called directly, the costs of entering and leaving the kernel, switching to the scheduler are eliminated. Figure 4 shows context switch of user-level threads where a thread is blocked inside the kernel. Culs call is the time to setup the upcall to the ULS. Cusched call is the time to switch to the scheduler. This includes the time to process arguments passed to the ULS. The total cost of the context switching of user-level thread through the kernel is the sum of Culs call ,

Context switching of user-level threads is implemented both in the kernel and a user space. There are three cases in which a context switch occurs. A context switch can occur in user or kernel address space when blocking functions are called explicitly. Another case is occurred by an interrupt. For example, when a timer interrupts a user-level thread, the thread can be preempted. In the first case, all of context switching of user-level threads is done in a user space. The old thread context is saved into the user-level thread descriptor which is allocated by a ULS in the user space. Then, a next thread context is restored from the descriptor. In the second case that a user-level thread is blocked in the kernel, the kernel saves the context in the user-level thread descriptor. The kernel switches its context to the idle userlevel thread to upcall the ULS. This new context is created just before the upcall. ULS Resume primitive is used to restore the context of a blocked user-level thread. The ULS calls this primitive in the context of the idle user-level thread. The kernel restores the context from the user-level thread descriptor. The context of the idle user-level thread is just thrown away. In the last case that a thread is interrupted, the context of the thread is usually saved on the kernel stack. If the context is left on the kernel stack when a ULS is upcalled, the ULS cannot retrieve the context to run the preempted thread without kernel help. This will make the performance worse. In our implementation, when an event happens and 3

Thread A Thread B

AAAA AAAA AAA AAA AAAA AAAA AAAA AAAA AAAA Cusched_call Cu_choose

ULS

user thread (user space) 5.8sec

kernel thread 39sec

user thread (via kernel) 42sec

User

Center

Kernel

Cleave

Culs_call

Cu_ctx_swich

null function call 0.8sec

null system call 11sec

Table 2: Basic Operations Performance

Figure 4: User Threads Context Switch via Kernel

Cusched call , Cu choose , Cu ctx and Cleave .

Table 1: Context Switch Performance

switch , and twice of Center 6 Summary

Table 1 shows the context switch performance of our threads. These were measured using the Gateway2000 486/33C which is an Intel 80486-based IBM-PC compatible computer with a 33MHz clock. Table 2 shows the costs of basic operations. From the cost of a null system call, the sum of Center and Cleave is 11seconds. Although it needs to enter and leave the kernel twice, the context switch time of user-level threads through the kernel is only slightly worse than the time of kernel-provided threads. This may be because the overhead of the kernel scheduler in the ARTS operating system. It provides the policy/mechanism separation and multiple scheduling policies while the user-level scheduler provides only one scheduling policy for each thread package. From the cost of a user-level context switch in a user space and a null system call, we can estimate that the sum of Culs call and Cusched call is 14seconds. This rather big number may comes from the cost of the operation of Culs call . It includes the time of checking stack boundaries to determine which stack is currently used, swapping kernel stacks when necessary, and setting up a kernel stack to upcall.

Our real-time user-level threads can achieve higher performance than kernel-provided threads by reducing the overhead of context switching. Also, first class treatment to user-level threads can be retained by appropriate interactions between the kernel and a ULS. The performance measurement shows that the difference between our user-level threads and kernel-provided threads is not significant even if user-level threads require kernel intervention. As our future work, it is important to support multiple address space. Because of hardware limitations, it is difficult for a ULS to perform a context switch across address spaces directly. Address spaces will be switched through the kernel, though a ULS can choose the next address space. More detailed timing analysis is necessary to use our user-level threads for real-time applications. In particular, it is required to analyze how much preemptability and predictability are improved. Also, if we can determine the overhead of various cases of context switching, it may be able to develop a better mechanism or implementation for user-level threads.

References

5 Related Work

[1] H. Tokuda and C.W. Mercer. ARTS: A Distributed Real-Time Kernel. ACM Operating Systems Review, 23(3), July 1989.

There are several systems which provide first-class userlevel threads. Scheduler Activations [2] was proposed at the University of Washington. Also, the Psyche operating system provides first-class user-level threads [3]. Both of them are implemented on the parallel computers to exploit the ability of parallelism of the underling hardware. Our approach can provide both high performance and highly preemptability for real-time applications on a uniprocessor. DASH [4] provides user-level threads with the first-class treatment by splitting scheduler to user-level and kernellevel. This is implemented on a uniprocessor and shared memory is used extensively to pass information between an address space and the kernel. DASH proposes a new mechanism for asynchronous communication to avoid threads blocked in the kernel. Our approach does not require special attention to asynchronous communication, but achieves highly preemptability.

[2] T.E. Anderson, B.N. Bershad, E.D. Lazowska and H.M. Levy. Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism. In Proceeding of the 13th Symposium on Operating System Principle, October 1991. [3] B.D. Marsh, M.L Scott, T.J. LeBlanc, E.P. Markatos. First-Class User-Level Threads. In Proceeding of the 13th Symposium on Operating System Principle, October 1991. [4] R. Govindan, D.P. Anderson. Scheduling and IPC Mechanisms for Continuous Media. In Proceeding of the 13th Symposium on Operating System Principle, October 1991. 4

Design and Implementation of Real-Time User-Level Thread - CiteSeerX

Design and Implementation of Real-Time User-Level Thread - CiteSeerX

Suggest Documents

Design and Implementation of a CORBA Realtime Event Service

Design and Implementation of Monitoring and ... - CiteSeerX

EFFICIENT REALTIME FPGA IMPLEMENTATION OF THE TRACE

Implementation of realtime STRAIGHT speech manipulation system ...

Design and Implementation of Composite Protocols - CiteSeerX

Design and Practical Implementation of Multiple ... - CiteSeerX

Design and Implementation of Content-Adaptive ... - CiteSeerX

Design and Implementation of Reflective SQL - CiteSeerX

Design and Implementation of Secure Networked ... - CiteSeerX

DESIGN AND IMPLEMENTATION ISSUES OF A ... - CiteSeerX

Design and Implementation Experiments of Scalable ... - CiteSeerX

Design, Analysis and Implementation of Multiphase ... - CiteSeerX

Design, implementation and performance evaluation of ... - CiteSeerX

Design, Implementation and Deployment of PAIRwise - CiteSeerX

Design and Implementation of Backtracking Wave ... - CiteSeerX

Automatic Design and Implementation of Microprocessors - CiteSeerX

design and implementation of distributed programmable ... - CiteSeerX

eMAP: Design and Implementation of Educational ... - CiteSeerX

The Design and Implementation of Arjuna - CiteSeerX

Design and Implementation of Dynamic Service ... - CiteSeerX

Design, Implementation and Characterization of Practical ... - CiteSeerX

Realtime Burstiness Measurement - CiteSeerX

Monte Carlo Experiments: Design and Implementation - CiteSeerX

Biological database design and implementation - CiteSeerX