parallel program reaches an equilibrium, where the user-level threads of the .... on each physical processor of the system, accounting information used to ...
Achieving Multiprogramming Scalability on Intel SMP Platforms: Nanothreading in the Linux Kernel Dimitrios S. Nikolopoulos, Eleftherios D. Polychronopoulos, Theodore S. Papatheodorou Christos D. Antonopoulos, Ioannis E. Venetis, Panagiotis E. Hadjidoukas High Performance Information Systems Laboratory Department of Computer Engineering and Informatics University of Patras Rio 26500, Patras, Greece e-mail:{dsn,edp,tsp}@hpclab.ceid.upatras.gr, {antonop,venetis,xdoukas}@ceid.upatras.gr
Abstract This paper presents the architecture and implementation of a nanothreading interface in the kernel of the Linux operating system for Intel Pentium-based symmetric multiprocessors. The nanothreading interface aims at achieving scalability of parallel programs in multiprogrammed shared memory multiprocessors, where multiple parallel and sequential programs with diverge characteristics and resource requirements execute simultaneously. The main idea of the nanothreading interface is to let parallel programs and the kernel exchange critical scheduling information through shared memory with minimal overhead, in order to let parallel programs adapt to dynamically changing resources and ensure that all programs running in the system will minimize their idle time and make always progress along their critical path. We evaluate both the overhead of the low-level nanothreading mechanisms and the efficiency of the nanothreading interface in terms of system throughput, using multiprogrammed workloads with parallel benchmarks. Our results substantiate the efficiency of our implementation and demonstrate that the nanothreading kernel provides solid improvements over the native Linux SMP kernel.
1
Introduction
Small-scale symmetric multiprocessors (SMPs), based on commodity microprocessors and a shared bus architecture are widely adopted as high performance and cost-effective servers for the enterprise and the desktop, as well as building blocks for large-scale, scalable multiprocessors with hardware or software-based shared memory. Two and four-way SMPs with Pentium processors are marketed at competitive prices, to service a broad spectrum of applications ranging from scientific computing to desktop applications, database servers, Java and networking [8]. Modern proprietary and freeware operating systems are SMP-compliant. Symmetric multiprocessing at kernel-level implies the ability to run operating system code on any of the system processors. The additional functionalities required are memory management support for multithreading the same address space, scheduling support for load distribution across multiple processors, plus any of the low-level mechanisms needed for multiprocessing and memory consistency e.g. interprocessor interrupts, and TLB coherence. Multiprocessing is exported to the user-level primarily through a threads API [15]. This interface is used from application programs to create kernel-level execution vehicles that share a common address space. Task decomposition, orchestration and synchronization are then realized by mapping the user computation to kernel-threads which are in turn mapped to physical processors. Instead of direct access to kernel threads, a user-level runtime layer is typically used as an intermediate between parallel programs and the kernel. WindowsNT, Solaris
1
x86, Linux and Mach among others are operating systems that provide adequate multithreading support for Intel Pentium-Based SMPs. The integration of multiprocessing with multiprogramming has been a hot spot in high performance computing research for quite some time [11]. As modern SMPs become heavily multiprogrammed, the need for scalability of parallel programs under multiprogramming is intensified. The burden of integrating parallel programs in multiprogrammed environments is shared between the operating system and the runtime system layers. Unfortunately, most runtime systems for parallel programming are oblivious of multiprogramming and most operating systems kernels are oblivious of the fine-grain interactions between threads in parallel programs. This lack of coordination between runtime systems and operating systems has proven to be harmful for both application and system performance on multiprogrammed SMPs. This paper addresses the problem of integrating parallel programs in multiprogrammed environments at both the runtime system and operating system layers. Our generic approach is derived from previous research on multiprogrammed multiprocessors [2, 12, 16]. The fundamental idea is to decouple the user-level and kernel-level schedulers and provide a communication path between the two schedulers to implement automatic adaptation of parallel programs to changes of the system resources allocated to them from the operating system at runtime. The user-level scheduler maps computational tasks to execution vehicles (EVs) which serve as virtual processors and run on top of kernel-threads, while the kernel-level scheduler maps kernel-threads to physical processors. The communication path is bidirectional and lets the operating system communicate kernel-side scheduling interventions to the user-level scheduler, while the program communicates its processor requirements to the kernel. Multiprogramming scalability is attained when each executing parallel program reaches an equilibrium, where the user-level threads of the program are mapped one-to-one to the program’s EVs, which are in turn mapped one-to-one to physical processors. Our approach differentiates from previous work in several aspects. Unlike previous implementations that relied on coarse-grain mechanisms like signals and upcalls, we implement the communication path between the runtime system and the kernel using just loads and stores in shared memory. The data structure used as the communication medium —also known as the shared arena [6]—is mapped shared between the kernel and the runtime system and used from the kernel to communicate the state of a program’s kernel-threads to the runtime system. The user-level scheduler polls the status of kernel-threads at idling points and applies priority scheduling of the owned EVs to resume preempted kernel threads and greedily assist the program to make progress along its critical path. Parallel programs react promptly to kernel interventions in processor allocation, as well as to events with local scope, like page faults and I/O. The intra-program EV control is implemented through a hand-off mechanism and processor yielding supported by the kernel. In this way, the kernel gracefully grants parallel programs the authorization to control and use their EVs in the most effective way, while it maintains the responsibility of distributing system resources fairly. Put simply, the kernel acts more like an advisory rather than as an explicit coordinator of the concurrency of parallel programs. In previous work [13], we presented the semantics of the proposed kernel-user communication path for multiprogramming scalability, as well as kernel-side processor allocation policies that exploit this interface in the context of the Nanothreads programming model [12]. We evaluated our proposals using user-level emulation of a kernel CPU allocator on the SGI Origin 2000 and demonstrated the benefits of our strategy compared to the IRIX operating system scheduler. This paper moves one step further and presents the implementation of the kernel-side support for nanothreading along with associated runtime system extensions in the Linux operating system kernel for Intel-based SMPs. We present the low-level implementation details of a nanothreading kernel, including the shared memory interface, mechanisms for processor allocation and affinity scheduling of nanothreaded jobs at kernel-level. We also present the integration of space sharing strategies with a standard UNIX time-share scheduler [13]. The nanothreading kernel exports a minimal set of services to the user-level, which are exploited from the runtime system to implement dynamic process control within each parallel program. The process control mechanisms are coupled with a dynamic user-level scheduler [14] and non-blocking synchronization [10]. The current implementation is integrated with a user-level threads library which uses compiler knowledge to control threads granularity and match the number of running user-level threads of a parallel program to the number of EVs granted to the program by the kernel [3, 4]. However, the kernel-side support and the exported kernel interface can be directly used in general-purpose multithreading runtime systems with diverge characteristics, even outside the context of parallel programming models. The main effort needed is to customize the control of kernel-threads and tune scheduling and synchronization at the runtime system
2
layer, to meet the desired performance goal. Although this effort could well be substantial, it can be isolated within the runtime system, since the exported kernel services provide a clean interface. We are currently investigating the applicability of using the nanothreading interface in the Java Virtual Machine and HTTP server libraries. The implementation was carried out on a Dual Pentium Pro system, using the Linux 2.0.36 kernel. The native Linux SMP kernel support for scheduling and memory management has proven to be quite immature [24] to implement nanothreading efficiently. However, our experimental results with multiprogrammed workloads of parallel applications demonstrate clearly the benefits of our approach. General optimizations of the Linux SMP kernel are out of the scope of this work, however improvements planned for upcoming releases and more specifically fine-grain locking in the kernel can only be to the benefit of nanothreading. The rest of this paper is organized as follows: Section 2 overviews the architecture of the nanothreads kernel interface. Section 3 presents the implementation of the kernel-side support and Section 4 discusses the runtime system modifications. Section 5 presents experimental results for the low-level kernel services and multiprogrammed workloads with parallel applications from the SPLASH-2 benchmark suite. Section 6 discusses related work and Section 7 concludes the paper.
2
Nanothreads Kernel Interface Architecture
In this section we outline a communication architecture based on shared memory that enables the multiprogramming scalability of parallel programs written in a multithreading programming model. We assume that parallel programs follow a task-queue execution paradigm, where parallel tasks are generated dynamically at runtime and enqueued in user-level run queues for execution [12]. Execution is controlled by a user-level scheduler which is invoked in between the execution of parallel tasks on any of the EVs of the parallel program. With respect to multiprogramming, a task-queue paradigm provides a straightforward way to let programs adapt to dynamically changing system resources, since the user-level scheduler has the ability to control the degree of parallelism exposed from the program at runtime —by adjusting thread granularity accordingly— and the mapping of parallel tasks to EVs. The assumption of a task-queue execution model does not compromise the applicability of a nanothreading interface in other programming models based on shared memory, such as the data parallel programming model, as soon as these models are coupled with adequate runtime support to control the degree of parallelism at runtime and more specifically at the entry points of parallel regions. To simplify the discussion, we assume that the kernel-user interface is implemented in a traditional monolithic kernel, where kernel code has full access rights to the address space of user processes while essential kernel data structures are protected and maintained in pinned pages or unmapped physical memory. It is important to note that the infrastructure described in this section is portable to other operating system architectures such as microkernels [1] and exokernels [7]. The primary differences are related to performance issues and mainly to the mechanisms used to communicate kernel events to the programs, as well as the degree of information communicated to the programs. For example, in a monolithic kernel, preemption of a kernel-thread can be communicated with a flag in shared memory which is polled asynchronously from a user-level application library. In an exokernel architecture a kernel thread preemption can be communicated synchronously (i.e. immediately) to the program along with the context of the preempted process which can be handled directly by the application library.
2.1
General Overview
The first design principle behind the nanothreading kernel interface is to let the programs and the kernel communicate solely through loads and stores in shared memory in an asynchronous manner. Shared memory is the most efficient communication medium and avoids the need for implementing complicated kernel-side mechanisms, such as special-purpose software interrupts or signals to exchange scheduling information between the kernel and parallel programs. The shared-memory interface is general-enough to provide the bulk of the functionality needed for program-adaptability in multiprogrammed environments. The use of expensive mechanisms that cross the kernel boundaries is necessary only for the creation of kernel-threads and the context switches imposed from the scheduling policies.
3
Application1 User Space R/W Segment
n_cpus_requested application−side EV states (worker/idler) w/i
w/i
w/i
Application2 EV11
EV12
EV1n
b/p/r
b/p/r
b/p/r
Applicationn
kernel−side EV states (blocked/preempted/running) n_cpus_allocated n_cpus_current n_cpus_blocked R/O Segment
Kernel Space EVij
kth
EVkl
EVmn
kth
CPU1
kth
CPU2
EVyz
2222 2222 2222 kth
CPUp
System load info EV affinity info Scheduler accounting info
Figure 1: Nanothreading kernel interface architecture. The second design principle of the nanothreading kernel interface is to arm parallel programs with mechanisms that continuously assist a program execute through the critical path, in the presence of undesirable interventions from the operating system scheduler and more specifically preemptions of EVs and blocking system calls. This design issue dictates the use of a mechanism that resumes kernel-preempted EVs along with an efficient intra-program priority scheduling policy of the EVs owned by a program. The architecture of a nanothreading shared memory interface is illustrated in Figure 1. Each parallel program reserves a portion of memory at a fixed disposition within its address space, which is shared between the program and the kernel. This memory region is called a shared arena. A part of the shared arena is accessible with read-write privileges from the user program while the rest of the memory region is read-only and modifiable only by the kernel. The kernel maintains also a private set of book-keeping data structures used for processor allocation among parallel programs. These data structures track the id’s of the EVs that run on each physical processor of the system, accounting information used to distribute processors and processor time among parallel programs and system load information. In the read-write region of a shared arena, a parallel program stores requests for processors through a cpus request() call to the user-level runtime library. Requests for processors are guided from the application-dependent degree of parallelism that the program can effectively exploit, assuming that the application runs on a dedicated system. Apart from its inherent parallelism, the only hard limit to the number of processors requested by a program is the number of processors in the system. The processor requirements of a parallel program may vary during execution. For example, a parallel program may alternate between a sequential phase where the program performs I/O from a file and a parallel phase where the program performs some computation in parallel. Changes in processor requirements are communicated through successive cpus request() calls and stored in the associated field in the shared arena. The read-write region of the shared arena maintains also program-specific information that can be effectively used from the kernel as a hint to make better scheduling decisions. The EVs allocated to the program designate themselves as workers, or idlers, depending on whether they have a user-level thread to execute in their run-queues. In case the kernel decides to preempt some EVs for scheduling purposes, the priority of preempted worker EVs is raised compared to the priority of preempted idler EVs, to accelerate the rescheduling of workers and help the program make progress. The processor requests of a parallel program can be different from the number of processors allocated
4
from the operating system to the program at runtime. Processor requests reflect an application’s degree of parallelism, while processor allocations reflect the operating system resource allocation strategy and instantaneous scheduling decisions, although these may be affected in various ways from processor requests. The instantaneous number of physical processors allocated to each program is kept up-to-date in the shared arena from the kernel, which stores the associated information in the n cpus current field of the read-only region. The kernel exploits the information available in the shared arena to customize processor allocation. In our implementation, the kernel grants processors to parallel programs according to the number of processors requested from each program (n cpus requested) and the overall system load expressed as the sum of CPU requests from all the programs in the system [13]. This dynamic space-sharing scheme is effectively integrated with the Linux time-sharing scheduler. A nanothreading program can retrieve snapshots of the actual number of processors granted to it from the kernel, by issuing calls to the runtime library, which load the n cpus current value from the shared arena. The generic use of this call is to get issued whenever the program initiates a parallel execution phase. In this way, the program will arrange its parallelism by creating as many user-level threads as n cpus current —instead of n cpus requested, which can be higher than cpus current()— to execute the parallel phase. The read-only portion of the shared arena is used to communicate preemptions of EVs from the kernel, as well as blocking of EVs inside the kernel. Communication of kernel-thread preemptions and blocking is essential, since it provides a convenient means for letting parallel programs manage their parallelism more effectively [2]. Although explicit knowledge of the number of processors allocated to a parallel program is readily available through the n cpus current variable, preemptions of threads that execute in the critical path of the program impede the progress of the program and may lead to pathological situations like severe processor underutilization and forceful serialization. Communication of preemptions and blocking enable the user-level scheduler to resume preempted EVs in the place of idler EVs. When the user-level scheduler detects that its associated EV is idling, it attempts to hand-off its processor to an EV of the same program with higher priority i.e. a worker EV. If such an EV is not found the idler EV simply yields its processor with the prospect that a worker EV of another program will utilize the processor better. Processor yielding and hand-offs are implemented with kernel intervention. The kernel suspends the execution of the idler EV and reactivates if necessary the EV to which the processor is handed off by inserting it in the ready queue. The EV may or may not be resumed directly when the hand-off returns, depending on the kernel scheduling policy. A direct resume of the EV to which the processor is handed-off could be possible if the operating system provided a capability for resuming a kernel-preempted context at user-level. Such a capability is generally not available in contemporary operating systems. An implementation of a user-level mechanism for resuming kernel-preempted contexts in the Cellular IRIX operating system can be found in [6]. A similar functionality can be implemented in an exokernel architecture [7]. All the communication for the kernel interface is implemented in shared memory in an asynchronous manner, which however ensures the avoidance of deadlock and livelock in parallel programs, as soon as the kernel grants some processor time to the program.
2.2
A Working Example
We provide a simple example to demonstrate the functionality of the nanothreading kernel interface. Assume that a parallel program executes in distinct phases and the program alternates between an I/O phase and a parallel computation phase for a number of iterations. At the beginning of execution, the program requests a number of processors which corresponds to the maximum between the degree of parallelism in the program and the number of processors in the system. The kernel will create P − 1 additional EVs for the program, which will be activated whenever the kernel grants more than one processors to the program for parallel execution. At the beginning of the parallel phase, the program issues a call to poll the n cpus current variable in the shared arena and adapt its parallelism accordingly. At the end of each parallel phase, the program issues a cpu request(1) call to inform the kernel that the program moves to a sequential phase. The kernel will react to this change by reducing the number of processors allocated to the program to 1, at the next invocation of the kernel nanothreads scheduler i.e. within an OS scheduler time quantum. The nanothreads scheduler will be invoked as a reaction to the change in processor requests. Suppose that during the parallel execution phase the kernel decides to redistribute the processors among
5
the running programs and as a consequence preempts one of the EVs of the program. The first of the running EVs of the program to finish the execution of a user-level thread will detect the preemption by polling the the shared arena. After detecting the preemption, the EV will hand-off its processor to the preempted EV. If more than one EVs of the program are preempted, the preemptions will be handled in the same manner, as soon as the program has at least one processor to execute. The kernel processor allocation policy ensures that the program will receive a fair amount of resources to proceed and finish the parallel phase. When moving from the parallel phase to the I/O phase, the program experiences a transient state. The kernel may have already granted a number of processors to the program to execute the parallel phase and all the executing EVs but one should be preempted since the program will make no further use of them until the next parallel phase. The idling EVs of the program will yield their processor gracefully after some time. EVs that do not yield their processor will eventually be preempted by the kernel within at most 10 msec. A possible scenario in this case is that the kernel —after granting one processor to the program— preempts the EV which is going to actually execute the sequential phase in favor of an idler, before the idler decides to yield its processor. The idler will mark its state in the shared arena, detect the existence of a preempted EV and evidently resume the preempted EV. As a consequence, the thread that executes in the critical path will be resumed with at most one excessive context switch.
3
Kernel-Side Implementation
As mentioned earlier, the core of the nanothreading interface is a shared arena, i.e. a memory region shared between each program that uses the nanothreading interface and the operating system kernel. This memory region is allocated in the virtual address space of nanothreading programs. Care is taken so that the memory region is situated in a single memory page. The pages containing such memory regions are pinned to physical memory in order to avoid paging out critical scheduling information at run-time, which would be harmful for system performance. This poses a limit on the number of concurrently active parallel programs, which is however significantly less restrictive than other limits posed by the operating system itself, such as the length of the task table (512 entries in Linux 2.0.36) [24]. The shared arena is divided in two major regions, according to the privileges the user program has on them: the read only and the read-write region. The read only region is used by the kernel in order to communicate scheduling information (number of currently active CPUs, number of undesirably preempted CPUs, number of CPUs blocked in kernel) to the program, while the read-write region serves to the application as a medium to inform the kernel of its desired level of parallelism as well as on the current state of its EVs. Unfortunately, because of limitations posed by the Intel x86 processor architecture, a memory page can be either read-only or read-write. In order to achieve both goals of using only one memory page per application for the shared arena and protecting the read-only information from accidental overwriting, a turnaround technique has been used. The kernel keeps a private copy of the shared arena and trusts only the data residing on it. The copy of the kernel is being updated with the data residing on the read-write region of the shared arena copy of an application each time that application loses control of a processor as a result of an operating system scheduling decision. The read only region of the application copy is being updated, on its turn, when the operating system scheduler selects a kernel-thread belonging to that application to be given a processor for the next time quantum. The update of the application-side copy usually requires changing the CR3 (page table base address) register, which results to an expensive TLB flush [23]. In order to minimize the overhead, we update the application-side copy only if one of the read-only fields has been altered since the previous update. A slightly different approach would be to place the shared arena in a single memory region, shared between all nanothreading applications, where each application would have its own slot. This solution has been rejected, because it would be a cause of severe and unnecessary increase of traffic and contention on the memory region accommodating the shared arena. Although this is not a problem in UMA machines, like the one used for this work, it would certainly result to performance degradation in large-scale NUMA machines. The disposition of the shared arena region in the application’s virtual address space is communicated to the kernel via a system call. This system call informs the kernel that the application uses the nanothreading interface and passes a pointer to the top of the memory region allocated by the user. It also serves as a request to the kernel to create as many kernel-threads1 as an application expects to use during its execution lifetime. 1
These kernel-threads are going to be used as execution vehicles by the application.
6
The application is responsible of allocating the user-level stacks the new threads are going to use and provide a pointer to the code the threads are going to execute after being created and unblocked. All kernel-threads needed are created at once, using a slightly modified version of the native kernel code for cloning. This reduces the overhead of multiple system calls and results to significantly faster kernel thread creation. The field in the kernel TSS (Task State Segment) structure2 corresponding to the instruction pointer is set to point to the function the thread must upcall to when running for the first time (the nanothreads run-time library idler function). At the same time, the arguments needed by that function are pushed to the user-level stack of the thread. The newly created kernel-threads are left in blocked state and they are explicitly unblocked later, when the kernel decides to grant some processors to the application. Each nanothreading application is represented in the kernel via a data structure. This structure contains mainly the kernel-level copy of the shared arena, a pointer to the user-level copy, a pointer to the thread-parent of the application and information on every kernel-level thread that the application controls. The structures of all active nanothreading applications are organized as a doubly linked list. The nanothreading kernel keeps also accounting information on the assignment of physical processors to nanothreading applications and the total system workload, expressed as the sum of processor requirements of all nanothreading applications. Given the definition of the nanothreads programming model it is certain that, if for any reason, a kernelthread belonging to a nanothreading application terminates prematurely the application will either suffer a livelock or produce incorrect results. This necessitated the implementation of a share-groups mechanism, which was originally introduced in IRIX. In Linux, a share group is defined to be a subtree of the system process tree, whose member-processes share the same virtual address space. The share-groups mechanism provides functionality for informing all the members of a share-group on an event that happened to another member. Currently, only the signaling of all members upon death of another member-process is supported. Each share-group is represented as an in-kernel data structure, which contains the number of the members of the group and the signal to be used. Each member of a share group retains in its task structure a pointer to the corresponding share-group structure. Special handling is required upon signaling the share-group members if the specified signal’s standard behavior is to terminate the receiver. The first danger is to flood the system with signals. Each signaled process, before terminating, will on its turn try to signal the other members of the group. This procedure would produce n2 signals for a share-group with n members. In order to avoid that, we allow only the first terminating process to use the share groups mechanism. The rest follow the standard Linux behavior (they signal their parent with SIGCHLD). The second risk is to leave zombie processes in the system. That risk is dealt with by applying an algorithm that signals each process only after all its children belonging to the same share-group have terminated and have been waited for. When the used signal gets changed, the signal masks of all member-processes of the share group are checked and if necessary changed, in order to make sure that all the members are able to receive the signal. Functionality has been added to allow the binding of kernel-threads to processors. A kernel-thread bound to a processor is allowed to run only on that specific processor. A field in the task structure of the kernelthread defines the id of the corresponding processor. During the selection process initiated by the scheduler when a processor is to be reallocated, the kernel-threads bound to a processor other than the one in favor of which the scheduling is taking place are bypassed. The binding is in that aspect strict, i.e. the thread is not allowed to run on another processor than the one it is bound to, even if that processor is idling. This behavior has been chosen because it has been determined that the cost of migrating the thread to another processor —incurring by the loss of cache locality—, overwhelms the profit we have because of increased utilization. A mechanism has also been implemented to allow explicit blocking/unblocking of threads. A semaphore variable has been added to the task structure of each kernel-thread. This semaphore is decreased at block and increased at unblock requests. If the semaphore gets negative, the corresponding kernel-thread is blocked. This is done by changing its state to TASK INTERRUPTIBLE and expelling it from the queue of runnable processes3 . At the same time all its signals, except the one used by the share-groups mechanism, get blocked in order to avoid accidental wake-up of the thread as a result of the receipt of a signal. The old signal mask 2
The TSS structure is a structure residing in kernel space, used by the processor to save the context of kernel-threads during context switches between them. 3 If the thread happens to be running at the time of blocking on another processor, the expelling from the run-queue is postponed until scheduling takes place on that processor.
7
is saved in order to get restored during unblocking. If the blocked process was the one previously running on the current processor, an operating system scheduling is initiated. If, on the other hand, an unblock request sets the semaphore to zero, the kernel-thread must get unblocked. Its signal mask gets restored using the saved value, its state is changed to TASK RUNNING and it is placed back in the run queue. A case that needs special handling is the block request for an already blocked kernel-thread. The thread can be blocked in kernel for numerous reasons such as waiting for I/O to complete, having executed a blocking system call etc. Trying to block an already blocked thread poses two problems. Firstly, the kernel-thread does no longer reside in the run-queue and any attempt to remove it from the queue will result to an error. This is dealt with by checking the existence of a thread in the run-queue before trying to dequeue it. The second problem is that, after the occurrence of the event upon which the thread was blocked, the thread will be waken up by the kernel and will be from that time on available to run, which is not the expected behavior. This is avoided by checking the semaphore of any process to be waken-up. If the semaphore is negative the process remains blocked. The assignment of runnable EVs to applications is a duty of the nanothreads kernel-level scheduler. The scheduling takes place in three phases. During the first phase, the scheduler decides how many runnable EVs will be assigned to each nanothreading application. The criterion at this phase is the scheduling policy currently active. At the time being, four policies have been implemented: batch, round-robin, dynamic space sharing and round-robin with dynamic space sharing [13]. The second phase results to an indirect assignment of physical processors to the nanothreading applications selected at phase one. More specifically, the nanothreading applications are not given a specific physical processor, but the right to compete with other, non-nanothreading applications, for that specific processor. During this phase the major goal is the preservation of locality. The applications that have been assigned processors both for the previous and the next time quantum will be given the right to execute on the same physical processors as before. The remaining processors are distributed to the nanothreading applications based on the execution history of each application. The applications keep a bit-vector of their recent processor allocations, which serves as an advisor to the scheduler. During the final phase, a specific kernel-thread of each application chosen in phase one is selected to serve as an EV during the next time quantum. A priority scheme is applied between the threads of the application. If a thread was, during the past time quantum, being executed at that physical processor, it is automatically selected. If this is not the case, the threads that have previously been preempted at non-safe points4 , while executing useful work, have the highest priority. The threads previously preempted at non-safe points, while having marked themselves in the shared arena as idlers, constitute the next priority class. The threads with the lowest priority are the ones that were previously voluntarily suspended at safe points. The selected thread gets unblocked and bound to the physical processor, whereas the thread previously running gets blocked and unbound. The scheduler also updates the kernel copy of the shared arena of the affected applications. Nanothreading scheduling is initiated in four cases: a) upon expiration of a nth scheduler time quantum, b) when one or more nanothreading applications change their parallelism requests, c) when a nanothreading application terminates and d) when a nanothreading application enters the system. In order to avoid livelock and achieve higher system throughput and utilization, our implementation provides mechanisms to assist applications to execute through their critical path in the presence of undesirable preemptions of execution vehicles by the operating system scheduler. Each EV reaching a safe point, checks the shared arena for preempted EVs of the same application. If such EVs are found the currently executing EV handoffs its processor in favor of the preempted EV. This is done via a system call. The kernel executes the third phase of the nanothreads scheduling and selects, using the previously described priority scheme, a preempted EV to get resumed. The previously running EV gets suspended. EVs that find themselves idling for long can yield their processor in favor of another nanothreading application, hoping that there is at least one nanothreading application in the system having useful work to do. They execute a system call, which initiates a local nanothreads scheduling, i.e. a scheduling having effect only on the yielded processor. The local scheduling tries to choose the application and the kernel-thread that the global scheduling is most likely to select during its next execution. If such an application and a kernel-thread are found, the processor is assigned to them and the previously running kernel-thread gets suspended. With this technique the nanothreading 4 A safe point is defined as the point where an EV has finished the execution of a user-level thread and has not yet been assigned another user-level thread to execute.
8
applications greedily try to achieve increased system throughput. A common problem of user-level thread libraries is the fact that when a user-level thread blocks, as a result of executing a blocking system call, or while waiting for the completion of pending I/O, the corresponding EV also blocks. However the user-level library has no means to get informed and activate another EV in order to keep using the processor effectively. This can result to low processor utilization, or even worse to deadlock if all EVs of an application block5 . A nanothreading conscious kernel and the communication between kernel and user level are powerful tools, which can assist in resolving such cases. When the kernel detects that an active EV of a nanothreading application gets blocked, a local scheduling takes place, for the processor assigned to the blocked EV, which leads to the selection and resumption of another EV. When the blocked EV is ready to get unblocked again, the kernel checks if a global nanothreads scheduling has occurred in the meantime. In this case, the EV is not unblocked but marked in the shared arena as a preempted worker, as the physical processor is not assigned to it anymore. If the previous assignment still holds, i.e. no global scheduling arouse in-between, the EV currently executing on the processor that belonged to the blocked EV gets suspended and the previously blocked EV is resumed. The existence of blocked EVs is communicated to the application that owns them via the shared arena, in order to allow it to take any actions needed. Another effect of monitoring the blocking of EVs in the kernel is that the applications that own blocked EVs are considered during the scheduling of lower priority than the applications that do not. This is based on the observation that blocking events usually occur on the critical path of an application. When this holds, it is meaningless to grant the application physical processors while waiting for the completion of a blocking event.
4
User-Side Implementation
In order to evaluate the efficiency of our kernel-side implementation we ported a research prototype of the Nanothreads multithreading runtime library to Linux. The library has originally been designed and implemented for IRIX [4]. Support for the shared arena has been added. The library undertakes the responsibility to allocate the memory region to be used as a shared arena and execute the appropriate system call to inform the kernel on the disposition of the region allocated. The same system call creates the required kernel-threads. This is done during the library initialization phase. The shared arena is checked by the library at safe points, i.e. when the EV has just finished the execution of a user-level thread and the user-level scheduler is about to assign another one to it. If there are preempted EVs of the same application, the current processor is handed off to them. If this is not the case and the EV is idling for a significant amount of time, the processor is yielded in favor of other nanothreading applications. The library also marks the EVs in the shared arena as workers or idlers, depending on whether they are in the middle of the execution of a user-level thread or not. The application checks the shared arena at points where user-level threads are created. It takes into account the processor resources made available to it by the kernel, in order to appropriately adapt the degree of the created parallelism. The application also writes its requests for processors in the appropriate field of the shared arena, when these requests vary during its execution life. Another augmentation to the nanothreading library is the addition of non-blocking synchronization primitives for enqueuing and dequeuing user-level threads[17]. An alternative approach would be the use of locks. However, this approach has proven to be quite inefficient in the presence of frequent, undesired EV preemptions by the kernel. If an EV gets preempted while the user-level thread running on top of it holds a queue lock, all enqueuing and dequeuing operations must be postponed until the resumption of the preempted thread and the release of lock. The algorithm used is quite straightforward. Before enqueuing/dequeuing a queue entry, the entry residing at the head of the queue is copied on a temporary variable. This temporary variable is compared with the entry at the head of the queue and if the comparison succeeds, the head of the queue is exchanged with another entry. The compare and exchange is atomic (using the CMPXCHGB instruction of the Pentium Pro) and therefore guarantees the atomicity of enqueuing/dequeuing. The major drawback of lockfree queues is that they allow only one access point to the queue. It is not, for example, possible to enqueue an entry either at the head or at the tail of the queue. This can cause problems when enqueuing/dequeuing 5 In fact this can never occur in the Nanothreads programming model. However, in other programming models it is a common case and substantial effort may be required in order to avoid deadlock [21].
9
Primitive Block(other) Block(myself) Unblock Handoff to stolen Handoff to leader
Average Time (µsec) 1.427 6.675 8.045 53.932 25.168
Table 1: Overhead of elementary nanothreading interface primitives. user-level threads at multiple points is used in order to introduce some kind of prioritization between them. This restriction, however, has proven to be of minor importance in our implementation.
5
Performance Evaluation
The system used for the evaluation of our implementation is a Dual Pentium Pro, clocked at 200 MHz. Each processor is equipped with 512 Kilobytes L2 cache and the total physical memory is 512 Megabytes. The timing methodology we use has a granularity of 5 nanoseconds, which is the clock period of the system. This granularity is achieved by taking advantage of the Time Stamp Counter register of the Pentium Pro. This is a 64-bit counter, which counts the clock ticks since the system startup. This counter can be read from user-level using the RDTSC instruction. The RDTSC is not a serializing instruction, i.e. it is not guaranteed that any modifications to flags, registers and memory for previous instructions are completed before RDTSC is fetched and executed [23]. For this reason, it is required to assure that a serializing instruction is executed right before RDTSC. The most common serializing instruction in Pentium Pro processors is CPUID. In our timing routines care is also taken in order to estimate and exclude the overhead caused by their execution from the time interval measured. It is worth to mention that CPUID has a rather peculiar behavior concerning the cycles it takes to execute. The first two CPUID instructions issued by each application take longer to execute than the following ones. In order to deal with this instability, the initialization of our time counting routines includes the execution of two CPUID instructions, before the estimation of the overhead those routines cause [22]. The first set of measurements is focused on the low-level kernel primitives we implemented. The measurements presented are the average time of the results taken during ten repetitions of each experiment. A short explanation of each primitive and the corresponding numbers and diagrams follow: • Block (other): The process to be blocked is not currently running on any processor. • Block (myself): The process to be blocked is the one currently running on the processor, which is going to execute the block. • Unblock: A blocked process will be waken up. • Handoff to stolen: The currently running EV hands off its physical processor to an EV of the same application marked as undesirably preempted. • Handoff to leader: The currently running EV hands off its physical processor to the first process of the application, i.e. the application leader. This hand off is required by the Nanothreads run-time library and takes place during the termination phase of an application in order to assure that no zombie processes will remain in the system. • Cloning: Creation of the kernel-level threads (EVs) needed by the application. The efficiency of blocking/unblocking a kernel-thread is obvious, compared to the tenths of µsecs required by user-level threads packages for blocking/unblocking user-level threads. The overhead of handoff is also satisfactory, given that it requires crossing the kernel boundaries and context switching to another kernel-thread. The handoff to a stolen CPU costs more than the handoff to the processor leader, because the
10
2000
60000
Average Time (microseconds)
Average Time (microseconds)
Standard Linux clone Nanothreads clone 1500
1000
500
40000
Standard Linux clone Nanothreads clone
20000
0
0 5
10
15
100
Number of kernel threads
200
300
400
Number of kernel threads
Figure 2: Overhead of cloning kernel-threads.
Native Linux SMP kernel Nanothreading Linux SMP kernel
40
20
average turnaround time (secs.)
average turnaround time (secs.)
50 60
0
40
Native Linux SMP kernel Nanothreading Linux SMP kernel
30 20 10 0
1-way 4-way 8-way degree of multiprogramming
1-way 4-way 8-way degree of multiprogramming
SPLASH-2 Block-LU
SPLASH-2 FFT
300
Native Linux SMP kernel Nanothreading Linux SMP kernel
200
100
average turnaround time (secs.)
average turnaround time (secs.)
150
0
100
Native Linux SMP kernel Nanothreading Linux SMP kernel
50
0 1-way 4-way 8-way degree of multiprogramming
1-way 4-way 8-way degree of multiprogramming
SPLASH-2 Raytrace
SPLASH-2 Volrend
Figure 3: Results from executions of multiprogrammed workloads of SPLASH-2 benchmarks with the native Linux SMP kernel and the nanothreading Linux kernel. The charts illustrate the average and normalized average execution times of the benchmarks under various degrees of multiprogramming.
11
former requires an extensive search to find the most appropriate, according to the priority scheme described, EV to wake up. It is also clear that our cloning mechanism performs much better than the standard Linux cloning facility. Figure 2 shows the overhead for cloning 1 to 16 and 10 to 400 kernel-threads, using the standard Linux clone system call and our service. The first diagram makes clear the overhead decrease in common cases whereas the second demonstrates the scalability of our approach. The next set of measurements attempts to evaluate the performance of the system in terms of overall throughput under multiprogramming. Four applications from the SPLASH-2 suite [18] have been chosen as benchmarks. More specifically we used BlockLU, FFT, Raytrace and Volrend. The specific applications either follow a task queue execution paradigm, or constitute of parallel regions separated from each other with global barriers. The changes needed in these applications, in order to use the nanothreading interface were minor. For the first class of applications they were straightforward. Each task was transformed into a userlevel thread and the built-in task queues used in the applications were replaced by the user-level run queues of the nanothreading runtime system. For the second class of applications, we substituted each parallel region with a chunk of nanothreads. The throughput of our implementation for each workload has been compared to the throughput achieved by the standard Linux kernel (version 2.0.36) and the LinuxThreads package6 . The workloads consist of 1, 4 and 8 identical copies of each application. The performance achieved for each workload under the nanothreading kernel with the nanothreads library and the standard Linux kernel with the LinuxThreads library are presented in Figure 3. The nanothreading kernel demonstrates a solid improvement over the native Linux kernel, expressed as an 8% to 11% increase of system throughput, which is expected to magnify on larger processor scales and with higher degrees of multiprogramming. The overall results substantiate our argument that exporting kernel-level scheduling decisions to parallel programs through a very lightweight communication interface assists the scalability of parallel programs under multiprogramming, while keeping the cost of dynamic scheduling both at the user and kernel levels affordable.
6
Related Work
Previous research in multiprogrammed shared-memory multiprocessors [2, 9, 16] has demonstrated the effectiveness of using dynamic space sharing, as the preferred processor allocation policy for parallel programs. The idea behind dynamic space sharing is to partition the system processors among parallel applications according to some predefined policy and subsequently let processors migrate from one parallel program to the other, upon changes of the system load. Two prerequisites for the applicability of dynamic space sharing is the use of a task-queue programming paradigm based on user-level threads and the existence of a mechanism that communicates kernel-level scheduling decisions such as allocations and deallocations of processors. Our work is in-line with these research proposals in the sense that we also use dynamic space sharing as our default processor allocation policy, by exploiting the nanothreading interface as the communication medium between parallel programs and the kernel. However, we differentiate from previous works in four important aspects. First, the works of Tucker and Gupta [16], Anderson et.al.[2] and McCann et.al.[9], relied on heavyweight mechanisms such as signals and upcalls to implement the communication interface with the kernel. We rely on shared memory, a completely asynchronous model of communication and polling to implement the same functionality. Our approach minimizes the overhead of kernel-user communication as such and provides adequate means to parallel programs in order to adjust their parallelism, even in the presence of very frequent changes of the workload and/or fine-grain interactions between threads in parallel programs. Second, our nanothreading interface is more oriented towards providing efficient mechanisms to let parallel programs make progress along their critical path in the presence of undesirable preemptions of EVs from the operating system. Put simply, the nanothreading interface is not coupled with a specific kernel scheduling policy and focuses rather on enhancing any scheduling policy with mechanisms for efficient multiprogrammed execution of parallel programs. A third difference is that our nanothreading interface uses a dynamic space sharing strategy that exploits information on the individual processor requirements of each parallel program [13], rather than flat FCFS, or equipartitioning schemes used in previous proposals. Finally, our nanothreading interface is naturally integrated with a standard UNIX time-sharing scheduler and enables efficient simultaneous execution of both nanothreading and non-nanothreading jobs. Integration with time-sharing has not 6 LinuxThreads is an implementation of the POSIX 1003.1c thread package for Linux, which provides kernel-level threads.
12
attracted considerable attention in previous research works on multiprocessor scheduling. Recent works of Yue and Lilja [19, 20] with the Solaris operating systems and Craig [6] with the Cellular IRIX operating system, presented scheduling methodologies based on kernel-user communication through a shared arena pinned to physical memory. The work of Yue and Lilja uses the shared arena to implement dynamic processor allocation at the entry points of parallel loops by polling a system load variable. Our nanothreading interface differentiates from this work in the sense that it supports arbitrary forms of parallelism instead of just a master/slave execution paradigm and uses a shared arena which is enriched with plenty of information both for scheduling multiple parallel programs simultaneously and for controlling the EVs of each parallel program in the most effective way. Our nanothreading interface in terms of overall architecture and philosophy is similar to the multiprogrammed execution paradigm presented by Craig [6]. The two most notable differences between the two works reside in the implementation of the shared arena and the space sharing strategy used in the kernel scheduler. We use a distributed implementation of a shared arena mostly for reducing contention, while Craig uses a centralized implementation which pins the shared arena in a specified region of main memory shared between all parallel programs. Craig also uses a different dynamic space sharing algorithm, which is coupled with the earnings-based time-sharing scheduler of the IRIX operating system [5]. A third difference is that Craig uses the shared arena not only to exchange critical scheduling information between parallel programs and the kernel, but also as a repository for the contexts of EVs preempted from the kernel scheduler. The implementation enables the user-level scheduler of each parallel program to resume the contexts of preempted EVs for execution directly from the shared arena without kernel intervention and privileged instructions7 . We plan to investigate the feasibility of such a functionality on Intel Pentium processors.
7
Conclusions
This paper presented the implementation details of a shared-arena nanothreading interface in Linux and demonstrated its efficiency in achieving scalable performance for parallel programs in multiprogrammed execution environments. Our implementation of the nanothreading interface stressed two important issues. First, we demonstrated that exporting kernel scheduling information to user programs is feasible with minimal overhead, using shared memory as the communication medium and a combination of asynchronous notifications and polling in the user-level scheduler. This approach combined with highly tuned kernel mechanisms enable the efficient scheduling of fine-grain parallel programs under multiprogramming. Second, we presented a methodology for letting parallel programs control the EVs in a way that ensures the progress of each parallel program in the presence of inopportune preemptions and minimizes idle time within each parallel program. Our current implementation work focuses on further optimizing the nanothreading interface and porting the complete infrastructure to newer versions of Linux kernels that employ fine-grain locking in the kernel and more scalable SMP support. We also investigate the applicability and efficiency of our developed nanothreading infrastructure in the context of general-purpose multithreading runtime systems, including WWW runtime libraries and the Java Virtual Machine.
Acknowledgements We are grateful to Constantine Polychronopoulos, David Craig and our partners in the NANOS project. This work is supported by the European Commission, under the ESPRIT IV Project No. 21907 (NANOS). 7
The actual implementation exploits a specific feature of superscalar execution on the MIPS R10000 processor, which enables a program to resume the values of the non-loadable registers that complete a context switch in the delay slots of branch instructions.
13
References [1] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A Tevanian and M. Young, Mach: A New Kernel Foundation for UNIX Development, Proc. of the Summer 1986 USENIX Conference, pp. 93– 112, 1986. [2] T. Anderson, B. Bershad, E. Lazowska and H. Levy, Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism, ACM Transactions on Computer Systems, Vol. 10(1), pp. 53–79, 1992. [3] E. Ayguad´e, M. Furnari, M. Giordano, H. Hoppe, X. Martorell, J. Labarta, N. Navarro, D. Nikolopoulos, T. Papatheodorou and E. Polychronopoulos, Nanothreads, Programming Model Specification, Deliverable M1.D1, ESPRIT Project No. 21097, 1997. [4] E. Ayguad´e, M. Furnari, M. Giordano, H. Hoppe, X. Martorell, J. Labarta, N. Navarro, D. Nikolopoulos, T. Papatheodorou and E. Polychronopoulos, Nanothreads Library Implementation, Deliverable M2.D2, ESPRIT Project No. 21097, 1998. [5] J. Barton and N. Bitar, A Scalable Multidiscipline, Multiprocessor Scheduling Framework for IRIX, Proc. of the 1st IPPS Workshop on Job Scheduling Strategies for Parallel Programming, LNCS Vol. 1459, Santa Barbara, CA, 1995. [6] D. Craig, An Integrated Kernel-Level and User-Level Paradigm for Efficient Multiprogramming, Master’s Thesis, CSRD Technical Report 1533, University of Illinois at Urbana-Champaign, 1999. [7] D. Engler, M. Frans Kaashoek and J. O’Toole, Exokernel: An Operating System Architecture for Application-Level Resource Management, Proc. of the 15th ACM Symposium on Operating System Principles, pp. 251–266, 1995. [8] K. Keeton, D. Patterson, Y. He, R. Raphael and W. Baker, Performance Characterization of a Quad Pentium Pro SMP using OLTP workloads, Proc. of the 25th Annual International Symposium on Computer Architecture, pp. 15–26, Barcelona, 1998. [9] C. McCann, R. Vaswani and J. Zahorjan, A Dynamic Processor Allocation Policy for Multiprogrammed Shared-Memory Multiprocessors, ACM Transactions on Computer Systems, Vol. 11(2), pp. 146–178, 1993. [10] M. Michael and M. Scott, Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors, Journal of Parallel and Distributed Computing, Vol. 54(2), pp. 162– 182, 1998. [11] C. Polychronopoulos, Multiprocessing vs. Multiprogramming Proc. of the 1989 International Conference on Parallel Processing, pp. II-223–II-230, 1989. [12] C. Polychronopoulos, N. Bitar and S. Kleiman, Nanothreads: A User-Level Threads Architecture, CSRD Technical Report 1297, University of Illinois at Urbana-Champaign, 1993. [13] E. Polychronopoulos, X. Martorell, D. Nikolopoulos, J. Labarta, T. Papatheodorou and N. Navarro, Kernel-Level Scheduling for the Nano-Threads Programming Model, Proc. of the 12th ACM International Conference on Supercomputing, pp. 337–344, Melbourne, Australia, 1998. [14] E. Polychronopoulos and T. Papatheodorou, Scheduling User-Level Threads on Scalable SharedMemory Multiprocessors, Technical Report 010498, High Performance Information Systems Laboratory, University of Patras, 1998. [15] POSIX OSI API Standards, IEEE Standard 1003.1c, Thread Extensions, 1995. [16] A. Tucker and A. Gupta, Process Control and Scheduling Issues for Multiprogrammed Shared-Memory Multiprocessors, Proc. of the 12th ACM Symposium on Operating System Principles, pp. 159–166, 1989. [17] J. Valois, Lock-Free Data Structures, PhD Dissertation, Rensselaer Polytechnic Institute, 1995.
14
[18] S. Woo, M. Ohara, E. Torrie, J. P. Singh and A. Gupta, The SPLASH-2 Programs: Characterization and Methodological Considerations, Proc. of the 22nd Annual International Symposium on Computer Architecture, pp. 24–36, 1995. [19] K. Yue and D. Lilja, An Efficient Processor Allocation Strategy for Multiprogrammed Shared-Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, Vol. 8(12), pp. 1246–1258, 1997. [20] K. Yue and D. Lilja, Dynamic Processor Allocation with the Solaris Operating System, Proc. of the First Merged IPPS/SPDP Conference, pp. 392–397, Orlando, FL, 1998. [21] M.L. Powell, S.R. Kleiman, S. Barton, D. Shah, D. Stein and M. Weeks, Sun Microsystems Inc., SunOS Multi-thread Architecture, USENIX, Winter 1991. [22] Using the RDTSC Instruction for Performance monitoring, (http://www.intel.com/drg/pentiumII/appnotes/RDTSCPM1.HTM).
Intel
Corporation,
1997.
[23] Intel Architecture Software Developer’s Manual, Volumes 1, 2 and 3, Intel Corporation, 1996-1997. [24] M. Beck, H. Boehme, M. Dziadzka, U. Kunitz, R. Magnus, D. Verworner, Linux Kernel Internals, Addison Wesley 1997.
15