vive and automatically recover from failures in critical components, such as device .... and (b) Mac OS X. Their properties are discussed in Sec. 2.1. .... implement the policies that drive the operating system. .... vice I/O and interrupt management, access to kernel data ... When the system starts up, CLOCK programs the hard-.
The Design and Implementation of a Fully-Modular, Self-Healing, UNIX-Like Operating System Technical Report IR-CS-020, February 2006
Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum Dept. of Computer Science, Vrije Universiteit Amsterdam, The Netherlands {jnherder, herbertb, bjgras, philip, ast}@cs.vu.nl
Abstract
structure, combined with several explicit mechanisms for transparent recovery from crashes and other failures, results in a highly-reliable, completely-multiserver operating system that still looks and feels like UNIX. While some of the mechanisms are well-known, and multiserver operating systems have been first proposed years ago, to the best of our knowledge, we are the first to explore such an extreme decomposition of the operating system that is designed for reliability, while providing reasonable performance. Quite a few ideas and technologies have been around for a long time, but were often abandoned for performance reasons. We believe that the time has come to reconsider the choices that were made in common operating system design.
In this paper, we discuss the architecture of a fullymodular, self-healing operating system, which exploits the principle of least authority to provide reliability beyond that of most other operating systems. The system can be characterized as a minimal kernel with the entire operating system running as a set of compartmentalized user-mode servers and drivers. By moving most of the code to unprivileged usermode processes and restricting the powers of each one, we gain proper fault isolation and limit the damage bugs can do. Moreover, the system has been designed to survive and automatically recover from failures in critical components, such as device drivers, transparent to applications and without user intervention. We used this design to develop a highly-reliable, opensource, POSIX-conformant member of the UNIX family, which is freely available and has been downloaded over 100,000 times in the past 3 months.
1
1.1 Contribution The contribution of this work is the design and implementation of an operating system that takes the concept of multiserver to an extreme in order to provide a dependable computing platform. The concrete goal of this research is to build a UNIX-like operating system that can transparently survive crashes of critical components, such as device drivers. As we mentioned earlier, the answer that we came up with is to break the system into manageable units and rigidly control the power of each unit. The ultimate goal is that a fatal bug in, say, a device driver should not crash the operating system; instead, the failed component should be automatically and transparently replaced by a fresh copy, and running user processes should not be affected. No existing system has this property. To achieve this goal, our system provides: simple, yet efficient and reliable IPC; disentangling of interrupt handling from user-mode device drivers; separation of policies and mechanisms; flexible, run-time operating system configuration; decoupling of servers and drivers through a publish-subscribe system; and error detection and transparent recovery for common drivers failures.
INTRODUCTION
Operating systems are expected to function flawlessly, but, unfortunately, most of today’s operating systems frequently fail. As discussed in Sec. 2, many problems stem from the monolithic design that underlies most common systems. All operating system functionality, for example, runs in kernel mode without proper fault isolation, so that any bug can potentially trash the entire system. Like other groups, we believe that reducing the operating system kernel is a first important step in the direction of designing for reliability. In particular, running drivers and other core components in user mode helps to minimize the damage that may be caused by bugs in such code. However, our system explores an extreme in the design space of UNIX-like operating systems where the entire operating system is run as a collection of independent, tightly-restricted, user-mode processes. This 1
We believe that we are the first to realize a fullymodular, open-source, POSIX-conformant operating system with self-healing properties. Although we primarily use a multiserver architecture because of its reliability, we will show that the system has many other benefits as well, for example, for system administration and programming. The system has been released (with all the source code) and over 100,000 people have downloaded it so far, as discussed later.
User
85% of all operating system crashes are caused by device drivers [2, 20], running untrusted, third-party code in the kernel also diminishes the system’s reliability. From a high-level reliability perspective, a monolithic kernel is unstructured. The kernel may be partitioned into domains but there are no protection barriers enforced between the components. Two simplified examples, Linux and MacOS X, are given in Fig. 1.
We first introduce how operating system structures have evolved over time (Sec. 2). Then we proceed with a detailed discussion of the kernel (Sec. 3) and the organization of the user-mode servers on top of it (Sec. 4). We review how our multiserver operating system realizes a dependable computing platform and highlight some additional benefits (Sec. 5), and briefly discuss its performance (Sec. 6). In the end, we survey related work (Sec. 7), and draw conclusions (Sec. 8).
2
User
User
User
User
User
User
VFS
Inet
UNIX server
Drivers
Paging
Microkernel
Linux
Mac OS X
(a)
(b)
Figure 1: Two typical monolithic systems: (a) Vanilla Linux and (b) Mac OS X. Their properties are discussed in Sec. 2.1.
2.2
OPERATING SYSTEM DESIGN
Single-Server Systems
A single-server system has a reduced kernel, and runs a large fraction of the operating system as a single, monolithic user-mode server. In terms of reliability, this setup adds little over monolithic operating systems, because there still is a single point of failure. The only gain in case of an operating system crash is a faster reboot. An advantage of this setup is that it preserves a UNIX environment while one may experiment with a microkernel approach. Legacy applications targeted towards the monolithic operating system server can coexist with novel applications. The combination of legacy applications and real-time or secure components allows for a smooth transition to a new computing environment. Mach-UX [1] was one of the first systems to run BSD UNIX in user-mode on top of the Mach 3 microkernel, as shown in Fig. 2(a). Another example, shown in Fig. 2(b), is Perseus [13], running Linux and some specialized components on top of the L4 microkernel.
This section illustrates the operating system design space with three typical structures and some variants thereof. While most structures are probably familiar to the reader, we introduce them explicitly to show an overview of the design space that has monolithic systems at one extreme and ours at the other. Concrete systems are discussed and compared to our system in Sec. 7. It is sometimes said that virtual machines and exokernels provide sufficient isolation and modularity for making a system safe. However, these technologies provide an interface to an operating system, but do not represent a complete system by themselves. The operating system running on top of a virtual machine or exokernel can have any of the following structures.
2.1
User
Paper Outline Kernel space
1.2
User
Monolithic Systems
Kernel space
User space
Monolithic kernels provide rich and powerful abstractions of the underlying hardware. All operating system services are provided by a single, monolithic program that runs in kernel mode; applications run in user mode and can request services directly from the kernel. Monolithic designs have some inherent problems that affect their reliability. All operating system code, for example, runs at the highest privilege level without proper fault isolation, so that any bug can potentially trash the entire system. With millions of lines of code (LoC) and 1-16 bugs per 1000 LOC [22, 23], monolithic systems are likely to contain many bugs. Since 70% to
User
User
User
User
BSD Unix
User
User
User
L4 Linux Drivers
Drivers
User
GUI Sign
Paging
Mach 3
L4
(a)
(b)
Figure 2: Two typical single-server systems: (a) Mach-UX and (b) Perseus. Their properties are discussed in Sec. 2.2.
2
2.3
Multiserver Systems
3
Kernel space
User space
In a multiserver design, the operating system environment is formed by a set of cooperating servers. Untrusted, third-party code such as device drivers can be run in separate, user-mode modules to prevent faults from spreading. High reliability can be achieved by applying the principle of least authority [15], and tightly controlling the powers of each module. A multiserver design also has other advantages. The modular structure, for example, makes system administration easier and provides a convenient programming environment, as discussed in Sec. 5. Several multiserver operating systems exist. An early system is MINIX [21], which distributed operating system functionality over two user-mode servers, but still ran the device drivers in the kernel, as shown in Fig. 3(a). More recently, IBM Research designed SawMill Linux [4], a multiserver environment on top of the L4 microkernel, as illustrated in Fig. 3(b). While the goal was a full multiserver variant of Linux, the project never passed the stage of a rudimentary prototype, and was then abandoned when the people working on it left IBM. User
User
User
Mem
User
FS
Driver
User
User
User
The kernel is responsible for low-level functionality that cannot be handled in user space, such as IPC, process scheduling, and interrupt handling. Ours consists of fewer than 4000 lines of code (LoC), which makes it easy to understand. The kernel provides only the most elementary mechanisms, whereas the user-mode servers implement the policies that drive the operating system.
3.1 Interprocess Communication Interprocess communication (IPC) is of crucial importance in a multiserver system. IPC allows user processes to request operating system services and enables cooperation between servers and drivers. We compared many alternatives to find a suitable set of IPC primitives that is simple, efficient, and reliable. Finding IPC primitives that do not hang the system when message senders and receivers crash during a request-reply sequence is far from trivial. Consequently, our primitives have evolved as the system matured. In this paper we describe the primitives of the version currently in test; these differ in minor ways from those in previous versions. Our IPC communication is characterized by rendezvous message passing using small, fixed-length messages. Rendezvous is a two-way interaction without intermediate buffering. The interaction is fully synchronous, which means that the first process that is ready to interact blocks and waits for the other. When both processes are ready, the message is copied from the sender to the receiver and both resume execution. Although rendezvous message passing is easier to implement than a buffered scheme, it is less flexible and sometimes even inadequate. Some events are inherently asynchronous or need to be communicated to higherlevel process without risking a deadlock. Therefore, we complemented the synchronous primitives with a simple nonblocking event notification mechanism.
User
Net
FS
Mem
Driver
Driver
Name
Driver
MINIX
L4
(a)
(b)
Figure 3: Two typical multiserver systems: (a) MINIX and (b) SawMill Linux. Their properties are discussed in Sec. 2.3.
2.4
THE KERNEL ARCHITECTURE
Designing for Reliability
Although several multiservers systems exist, either in design or in prototype implementation, none of them were designed with the explicit goal of being highly reliable. In the rest of this paper, we present a new, fully-modular, highly-reliable, open-source, POSIX-conformant multiserver operating system that is freely available for download, and has been widely tested. Some commercial systems like Symbian OS and QNX [8] are also based on multiserver designs. However, they are proprietary and distributed without source code, so it is difficult to verify the claims. Still, the fact that innovating companies use multiserver designs, demonstrates the viability of the approach. We will now discuss our multiserver architecture in detail, and show why it is a reliable system and how it can survive crashes of operating system components.
IPC Primitives The kernel supports four IPC primitives. The standard primitives to send messages are IPC REQUEST and IPC REPLY, typical in synchronous client-server communication. A request will block the caller until the reply has been received. The message type and arguments can be freely set by the caller, but the source is reliably patched into the message by the kernel. In addition, the nonblocking IPC NOTIFY primitive can be used to send notification messages. This is used to pass kernel events, user-configurable events, or callback requests. Normally only the event set is passed, but the kernel also includes the message source for callback requests, so that the receiver can query the sender. 3
Finally, the IPC SELECT primitive can be used to receive a specified message. The caller can pass the message source or set of events it is interested in, and will be blocked until such a message arrives. Messages are prioritized. Event notifications, such as hardware interrupts and timeouts, have the highest priority. Callback notifications have a lower priority, as they cannot be delivered together with other notifications. Finally, request messages have the lowest priority.
send its requests to the file server instead. These restrictions also help to prevent deadlocks. Third, we restrict the use of event notifications. Only trusted processes, such as the process manager and file server, can use them. Callback events, in contrast, are also available to untrusted processes, such as drivers. All these protection mechanisms are implemented by means of bitmaps that are statically declared as part of the process table. This is space efficient, prevents resource exhaustion, and allows for fast permission checks since only simple bit operations are required.
Design Principles We designed the IPC primitives to be simple, efficient, and reliable to reduce the amount of code and increase understandability. Complicated optimizations to improve resource usage often lead to complex, buggy code, so we have tried to keep the code straightforward. For example, only small, fixed-length messages are used. Messages are a union of different message types, and their size is determined at compile time as the largest of all types in the union. To ensure messages are reliably delivered to the right destination, IPC endpoints are under the control of the kernel. An IPC endpoint is formed by the combination of a process’ slot number and the slot’s generation number, which increased with each new process. This ensures that IPC directed to an exited process cannot end up at a process that reuses a slot. Our IPC design eliminates the need for dynamic resource allocation, both in the kernel and in user space. The standard request-reply sequence uses a rendezvous, so that no message buffering is needed. If the destination is not waiting, IPC REQUEST blocks the sender. Similarly, a receiver is blocked on IPC SELECT when no IPC is available. Messages are never buffered in the kernel, but always directly copied from sender to receiver. No additional copies are required, speeding up IPC. The asynchronous IPC NOTIFY mechanism is also not susceptible to resource exhaustion. Event notifications are typed and at most one bit per type is saved. All pending notifications can be stored in a compact bitmap that is statically declared as part of the process table. Multiple pending notifications of the same type are merged. Although the amount of information that can be passed this way is limited, this design was chosen for its simplicity, reliability and low memory requirements.
3.2 Process Scheduling Scheduling is done using a fixed number of prioritized queues. The processes on each queue are kept in a linked list, and are scheduled round robin. The scheduler simply finds the highest populated queue and selects the first process on it to run. Whenever a process becomes ready, it is put on the head of its queue when it still has some quantum left. A process goes to the rear of its queue only when it has no quantum left. While somewhat counterintuitive, it works because this ensures that processes doing a system call are not moved to the rear of the queue. It also makes the system responsive since processes that were blocked for I/O can immediately run once the I/O is done. Each time a process consumes a full quantum it degrades in priority. Periodically the priority of all processes is upgraded to prevent that all processes end up in the lowest-priority queue. Since I/O bound processes consume fewer quanta than CPU-bound processes, they will have a higher average priority, and are likely to be scheduled when the I/O finishes.
3.3 Interrupt Handling Another important responsibility of the kernel is interrupt handling. Because this cannot be done in user space, we disentangled interrupt handlers and device drivers. User-mode device drivers can only instruct the kernel to transform specific interrupts into notification messages and must do all further processing themselves. After registration, drivers can tell the kernel to enable and disable hardware interrupts. The kernel catches all hardware interrupts with a generic interrupt handler that looks up which drivers are associated with the IRQ line, and sends a nonblocking notification message to each of them. As a side-effect, the generic interrupt handler gathers randomness for the random number generation device. The only exception to the above is that the clock driver (CLOCK) defines its own handler as part of the kernel, as discussed below. This handler is simple and usually
Protection Mechanisms Since IPC is a powerful construct, we included several mechanisms to restrict who can do what. First, we restrict the set of IPC primitives available to each process. User processes, for example, are allowed to use only IPC REQUEST. Second, we restrict who can request services from whom. A user process doing I/O, for example, cannot communicate directly with device drivers, but needs to 4
does only accounting. When more work is needed, such as scheduling another process, a notification is sent to CLOCK for further processing. All the real work is done at the process level, usually in a user-mode device driver, but sometimes also in a kernel task. This helps to achieve a low interrupt latency—since processes can be preempted—and makes the system suitable for real-time applications.
3.4
interrupt occurs, the currently running kernel task or user process is preempted, the interrupt is transformed into a message, and another process is scheduled. If the interrupt is for a high-priority device, the associated driver is likely to be scheduled soon. Building a complete real-time operating system would require extending the scheduler with real-time primitives.
3.5 The Kernel Tasks
Kernel Model
The kernel contains two independently-scheduled processes called tasks (to distinguish them from the usermode OS components). They are in kernel address space and run in kernel mode because they perform privileged operations that cannot be done in user space in a portable way, such as device I/O.
Proc
3.5.1
servers and drivers. All kernel calls in the system library are transformed into request messages that are sent to SYS, which processes the requests, and sends reply messages. SYS never takes initiative by itself, but it is always blocked waiting for a new request message. The kernel calls handled by SYS can be grouped into several categories, including process management, memory management, copying data between processes, device I/O and interrupt management, access to kernel data structures, and clock services. An overview of common kernel calls is given in Fig. 5.
Proc
SYS
System Task (SYS)
SYS is the interface to the kernel for all user-mode
Proc
KERNEL
Kernel space
User space
Our kernel design is different from most UNIX-like systems. In contrast to ordinary UNIX kernels, ours cleanly follows the process-oriented multiserver design, as illustrated in Fig. 4. It provides two kernel-mode servers, SYS and CLOCK, also known as kernel tasks, with special privileges to support the user-mode operating system servers. Although tasks share the kernel’s address space and run in kernel mode, they are treated like normal processes.
CLK
Figure 4: The kernel closely follows the process-oriented multiserver design. Services can be requested with ordinary, synchronous request messages. Kernel events are transformed into asynchronous notification messages.
Kernel Call
Apart from the actual service requested, a kernel call from a user-mode server to a kernel task is similar to a system call from a user process to an operating system server. Both calls use the IPC primitives discussed in Sec. 3.1 and result in a synchronous request message being sent from one process to another. The services provided by the kernel tasks are discussed below. Only a tiny fraction of the kernel is responsible for handling hardware interrupts, IPC traps, and exceptions. Whenever such an event happens the CPU saves the state of the currently-running process, and invokes the associated service routine that has been registered by the kernel. When the event has been processed the kernel picks a (possibly different) process to run, restores the process’ state, and tells the CPU to resume normal execution. To keep the kernel simple, kernel reentries are forbidden.
SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
FORK EXEC EXIT NEWMAP VIRCOPY DEVIO IRQCTL PRIVCTL GETINFO TIMES SETALARM
Purpose Fork a process; copy parent slot Execute a process; initialize slot Exit a process; clear process slot Assign memory segment to process Copy data using virtual addressing Read or write a single I/O port Set or reset an interrupt policy Assign system process’ privileges Get a copy of kernel information Get process times or kernel uptime Set or reset a synchronous alarm
Figure 5: A selection of common kernel calls. All calls require privileged operations and are handled by SYS.
3.5.2
Real-Time Properties Because the kernel is process oriented, it forms a suitable base for real-time systems. The kernel is locked only when this is absolutely required to prevent race conditions. Whenever a hardware
Clock Task (CLOCK)
CLOCK is responsible for accounting of CPU usage,
scheduling another process when a process’ quantum expires, managing watchdog timers, and interacting 5
with the hardware clock. It does not have a publiclyaccessible user interface like SYS. When the system starts up, CLOCK programs the hardware clock’s frequency and registers an interrupt handler that is run on every clock tick. The handler does only basic integer operations, that is, it increments a process’ user or system time and decrements the scheduling quantum. If the a new process must be scheduled or an alarm is due, a notification is sent to CLOCK to do the real work at the task level. This minimizes the hardware interrupt latency, because kernel tasks can be preempted. Although CLOCK has no direct interface from user space, its services can be accessed through the kernel calls handled by SYS. The most important call is SYS ALARM that allows system processes to schedule a synchronous alarm that causes a ‘timeout’ notification upon expiration. The alarm is synchronous because CLOCK the notification message is only delivered when the client indicates it is ready to receive it. As an aside, the POSIX alarm that is available to ordinary user applications is handled by the user-mode process manager server, and causes an asynchronous SIGALRM signal.
4
User
User
User
User space
OS Interface FS
PM
KERNEL
Kernel space
driver
RS
DS
MM
SYS
CLK
Figure 6: The core components of the full multiserver operating system, and some typical IPC paths. Top-down IPC uses synchronous requests, whereas bottom-up IPC is done with asynchronous notifications.
For character devices the user process may be suspended until the driver notifies FS that the data is ready. (3) Additional servers and drivers can be started on the fly by requesting the reincarnation server (RS). RS then forks a new process, assigns all needed privileges, and, finally, executes the given path in the child process (not shown in the figure). Information about the new system process is published in the data store (DS), which allows parts of the operating system to subscribe to updates in the operating system configuration. (4) Although not a system call, it is interesting to see what happens if a user or operating system process causes an exception, for example, due to an invalid pointer. In this event, the kernel’s exception handler notifies PM, which transforms the exception into a signal or kills the process when no handler is registered. Recovery in case of operating system failures is discussed below.
THE USER-MODE SERVERS
On top of the kernel we have implemented a multiserver operating system. The core components of this system are shown in Fig. 6. Apart from the device drivers and user processes this constitutes the trusted computing base. Most of the servers are relatively small and simple. The sizes approximately range from 1000 to 3000 LoC per server, which makes them easy to understand and maintain. The components are discussed below. We first give some examples to illustrate how our multiserver operating system actually works. Fig. 6 also shows some typical IPC interactions initiated by user processes. Although the POSIX operating system interface is implemented by multiple servers, system calls are transparently targeted to the right server by the system libraries. Four examples are given below: (1) The user process that wants to create a child process calls the fork() library function, which sends a request message to the process manager (PM). PM verifies that a process slot is available, asks the memory manager (MM) to allocate memory, and instructs the kernel to create a copy of the process. (2) A read() or write() call, in contrast, is sent to FS. If the requested block is available in the buffer cache, FS asks the kernel to copy it to the user. Otherwise it first sends a message to the disk driver asking it to retrieve the block from disk. The driver sets an alarm, commands the disk controller through a device I/O request to the kernel, and awaits the hardware interrupt or timeout notification.
Design Principle The general design principle that led to the above set of servers is that each process should be limited to its core business. Having small, well-defined services helps to keep the implementation simple and understandable. As in the original UNIX philosophy, each server has limited responsibility and power, as is reflected in its name. For example, FS must interact with drivers, but should not be checking for weird driver failures like nonresponsiveness. Although FS could manage driver timeouts itself, this would complicate its design and implementation. Therefore, it relies on a separate component, RS, which is responsible for the system’s well-being. If FS hangs on a driver, RS will detect that the driver is not responding, and will kill the driver and revive FS. Although killing a nonresponsive driver seems harsh, a properly designed driver must adhere to the protocol and return an error code to FS if it cannot fulfill the request. 6
Multiserver Protocol Although our IPC facilities are fairly reliable by design, as discussed in Sec. 3.1, they cannot prevent deadlocks. Therefore, we devised a multiserver protocol that ensures that synchronous messages are sent in one only direction. There is a loose layering based on who can make requests to whom. User processes can call the servers, servers can call each other and drivers, and servers and drivers can call the kernel. IPC in the opposite direction is done using the nonblocking notification mechanism. Another aspect of the multiserver protocol is that we try to minimize copying to prevent loss of performance. The number of copies required for I/O is precisely the same as in a monolithic system. Although there are more context switches because a user process, the file server, and a driver must interact, no intermediate copies are required. For example, I/O for character devices is no buffered. All data is directly copied between the user process and the device driver.
4.1
4.1.2
FS manages the file system. It is an ordinary file server that handles standard POSIX calls such as open(), read(), and write(). More advanced functionality includes support for symbolic links and the select() system call. FS is
also the interface to the network server. For performance reasons, file system blocks are buffered in FS’ buffer cache. To maintain file system consistency, however, crucial file system data structures use write-through semantics, and the cache is periodically written to disk. Since the file server runs as an isolated process that is fully IPC driven, it can be replaced with a different one to serve other file systems, such as FAT. Moreover, it should be relatively straightforward to transform FS into a network file server that runs on a remote host. Device Driver Handling Because device drivers can be dynamically configured, FS maintains a table with the mapping of major numbers onto specific drivers. As discussed below, FS is automatically notified of changes in the operating system configuration through a publishsubscribe system. This decouples the file server and the drivers it depends on. A goal of our research is to automatically recover from common driver failures without human intervention. When a disk driver failure is detected, the system can recover transparently by replacing the driver and rewriting the blocks from FS’ buffer cache. For character devices, transparent recovery sometimes is also possible. Such failures are pushed to user space, but may be dealt with by the application if the I/O request can be reissued. A print job, for example, can be reissued by the print spooler system.
Core Components
This section discusses the core components shown in Fig. 6. We will focus on design decisions that are specific to the multiserver aspects of the system. 4.1.1
File Server (FS)
Process Manager (PM)
PM is responsible for process management such as cre-
ating and removing processes, assigning process identifiers and priorities, and controlling the flow of execution. Furthermore, PM maintains relations between processes, such as process groups and parent-child blood lines. The latter, for example, has consequences for exiting processes and accounting of CPU time. Although the kernel provides mechanisms, for example, to set up the CPU registers, PM implements the process management policies. As far as the kernel is concerned all processes are similar; all it does is schedule the highest-priority ready process.
4.1.3
Memory Manager (MM)
To facilitate ports to different architectures, we use a hardware-independent, segmented memory model. Memory segments are contiguous, physical memory areas. Each process has a text, stack, and data segment. System processes can be granted access to additional memory segments, such as the video memory or the RAM disk memory. Although the kernel is responsible for hiding the hardware-dependent details, MM does the actual memory management. MM maintains a list of free memory regions, and can allocate or release memory segments for other system services. Currently MM is integrated into PM and provides support for Intel’s segmented memory model, but work is in progress to split it out and offer limited virtual memory capabilities, for example, shared libraries.
Signal Handling PM is also responsible for POSIX signal handling. When a signal is to be delivered, by default, PM either ignores it or kills the process. Ordinary user processes can register a signal handler to catch signals. In this case, PM interrupts pending system calls, and puts a signal frame on the stack of the process to run the handler. This approach is not suitable for system processes, however, as it interferes with IPC. Therefore, we implemented an extension to the POSIX sigaction() call so that system processes can request PM to transform signals into notification messages. Since event notification messages have the highest priority of all message types, signals are delivered promptly. 7
We will not support demand paging, however, because we believe physical memory is no longer a limited resource in most domains. We strive to keep the code simple and eliminate complexity whenever possible. Swapping segments to disk would be easy to add, but in the interest of simplicity we have not done so. 4.1.4
ponent, which is useful, for example, for development purposes. Another policy might use a binary exponential backoff protocol when restarting components to prevent clogging the system due to repeated failures. In any event, the problems are logged so that the system administrator can always find out what happened. Optionally, an e-mail can be sent to a remote administrator. Failed components can be restarted from a fresh copy on disk except for the disk driver, which is restarted from a copy kept in RAM.
Reincarnation Server (RS)
RS is the central component responsible for managing all operating system servers and drivers. While PM is responsible for process management in general, RS deals
with only privileged processes. It acts as a guardian and ensures liveness of the operating system. Administration of system processes also goes through RS. A utility program, service, provides the user with a convenient interface to RS. It allows the system administrator to start and stop system services, (re)set their policies, or gather statistics. For optimal flexibility in specifying policies a shell script can be set to run on certain events, including device driver crashes.
4.1.5
Data Store (DS)
DS is a small database server with publish-subscribe functionality. It serves two purposes. First, system processes can use it to store some data privately. This redundancy is useful in the light of fault tolerance. A restarting system service, for example, can request state that it lost when it crashed. Such data is not publicly accessible. Second, the publish-subscribe mechanism is the glue between operating system components. It provides a flexible interaction mechanism and elegantly reduces dependencies by decoupling producers and consumers. A producer can publish data with an associated identifier. A consumer can subscribe to selected events by specifying the identifiers or regular expressions it is interested in. Whenever a piece of data is updated DS automatically broadcasts notifications to all dependent components. Although we currently do not do this, in future, drivers could announce every request message and I/O completion to DS. In this manner, if a driver crashes, its replacement could find out what work was pending, similar to the shadow drivers in Nooks [19].
Fault Set The fault set that RS deals with are protocol errors, transient failures, and aging bugs. Protocol errors mean that a system process does not adhere to the multiserver protocol, for example, by failing to respond to a request. Transient failures are problems caused by specific configuration or timing issues that are unlikely to happen. Aging bugs are implementation problems that cause a component to fail over time, for example, when it runs out of buffers due to memory leaks. Logical errors where a server or driver perfectly adheres to the specified system behavior but fails to perform the actual request are excluded. An example of a logical error is a printer driver that accepts a print job and confirms that the printout was successfully done, but, in fact, prints garbage. Such bugs are virtually impossible to catch in any system.
Naming Service IPC endpoints are formed by the process and generation numbers, which are controlled and managed by the kernel. Because every process has a unique IPC endpoint, system processes cannot easily find each other. Therefore, we introduced stable identifiers that consist of a natural language name plus an optional number. The identifiers are globally known. Whenever a system process is (re)started RS publishes its identifier and the associated IPC endpoint at DS for future lookup by other system services. In contrast to earlier systems, such as Mach, our naming service is a higher-level construction that is realized in user space. Mach provided stable IPC endpoints in the kernel, namely the ‘port’ mechanism, to which a client and server could attach. This required bookkeeping in the kernel and did not solve the problems introduced by exiting and reappearing system services. We have intentionally pushed all this complexity to user space.
Fault Detection and Recovery During system initialization RS adopts all processes in the boot image as its children. System processes that are started later, also become children of RS. This ensures immediate crash detection, because PM raises a SIGCHLD signal that is delivered at RS when a system process exits. In addition, RS can check the liveness of the system. If the policy says so, RS does a periodic status check, and expects a reply in the next period. Failure to respond will cause the process to be killed. The status requests and the consequent replies are sent using a nonblocking event notification. Whenever a problem is detected, RS can replace the malfunctioning component with a fresh copy. The associated policy script, however, might not restart the com8
Error Handling Since fault tolerance is an explicit design goal, the naming service is an integral part of the design. The publish-subscribe mechanism of DS makes it very suitable to inform other processes of changes in the operating system. Moreover, recovery of, say, a driver is made explicit to the services that depend on it. For example, FS subscribes to the identifier for the disk drivers. When the system configuration changes, DS notifies FS about the event. FS then calls back to find out what happened. If FS discovers that a driver has been restarted, it tries to recover transparently to the user. 4.1.6
structures at any time. Examples include the process table of the kernel or PM, the device driver mappings at FS, and the status of privileged processes at RS. 4.2.2
Network Server (INET)
The network server, INET, implements TCP/IP in user space. The interface offered to the application programmer are BSD sockets. As with other I/O handles, sockets are managed by the file server, but FS transparently forwards networking requests to INET, which manages both TCP streams and UDP datagrams. Like FS, INET requests DS to be notified about configuration of Ethernet drivers, and can handle driver crashes. Since the TCP protocol prescribes retransmissions of lost packets (and lost datagram are explicitly allowed by the UDP protocol), INET can fully recover from Ethernet driver failures transparent to the user.
Device Drivers
All operating systems hide the raw hardware under a layer of device drivers. Consequently, we have implemented drivers for ATA, S-ATA, floppy, and RAM disks, keyboards, displays, audio, printers, serial line, various Ethernet cards, etc. Although device drivers can be very challenging, technically, they are not very interesting in the operating system design space. What is important, though, is each of ours runs as an independent user-mode process to prevent faults from spreading outside its address space and make it easy to replace a crashed or looping driver without a reboot. This is the self-healing property referred to in the title. While other people have measured the performance of user-mode drivers [10], no currently-available system is self-healing like this. We are obviously aware that not all bugs can be eliminated by restarting a failed driver, but since the bugs that make it past driver testing tend to be timing bugs or memory leaks rather than algorithmic bugs, a restart often does the job.
4.2.3
X Window System (X)
To demonstrate that our ideas are practical in real UNIXlike systems, we have ported a recent version of the X Window System. X provides a client-server interface between display hardware and the desktop environment. We have successfully run large GUI applications, including the Firefox browser, over a network. A downside of X is that it is a large, monolithic window system, but it clearly shows that our system can run real-world software. Future work might include porting a small, modular window system.
5
BENEFITS OF THIS DESIGN
In addition to the core components discussed above, several operating system services are started on the fly with help of RS. The most important ones are discussed here.
Although the main reason for using a multiserver architecture is achieving extremely high reliability, there are other advantages of this approach. In this section, we will highlight some benefits of our system for programmers, system administrators, and end users.
4.2.1
5.1 Programming Environment
4.2
Optional Components
Information Server (IS)
With the multiserver design in place, development of new components is much easier due to a shortened development cycle. With help of RS, drivers can be started and tested just like other user programs, lengthy builds of the entire operating system are not needed, and no reboot of the system is required to test a new driver. This may increase programmer productivity, and lead to a shorter time-to-market. Since drivers run as isolated user-mode processes, the consequences of a bad driver are limited and debugging becomes easier. After a driver crash, RS simply logs the problem, and the developer can continue immediately.
Debugging dumps are centrally managed by IS. Instead of polluting each server with a mechanism to make debug dumps, servers only have to implement a method to allow IS to retrieve a copy of their data structures, and IS handles all user interaction. If a function key is pressed, the associated data structure is copied to IS’ memory space, a formatted dump is displayed on the primary console, and the data are automatically logged for later inspection. Although IS is an optional component, it is automatically started by the startup scripts for convenience. To encourage development and experimentation, IS provides screendumps of the system’s most crucial data 9
No need for a reboot and a check of the file system. Crashes result in core dumps that can be inspected and debugged using normal tools. The programming model is more convenient, because the user-mode programming model is much closer to the POSIX standard than the restricted kernel API. This lowers the barrier for experimentation by new users and might improve driver quality. Furthermore, user-mode drivers enforce a proper design respecting interfaces, which leads to cleaner code. Finally, there is good accountability. RS logs all component crashes so that it is clear where the error occurred. This might have legal liability implications for the developer, and lead to more carefully crafted drivers.
ample, in the design of our device drivers we use shared library code. This helps to improve reliability since the code is thoroughly tested by the many drivers that use it. We also postpone initialization of drivers until their first use so that they cannot hang the system at boot time. The use of separate memory segments protects against many types of buffer overruns. Code injection is no longer possible because the text segment is read-only and the stack and data segment are not executable. Consequently, even if a buffer overrun injects a worm or virus onto the stack, this code cannot be executed. While other types of attacks exist, for example, the return-to-libc attack, they are harder to exploit.
5.2
Problem Detection Since all operating system services run as separate user-mode process, we can detect many problems, just like we can for ordinary applications. Invalid pointers or illegal access attempts are caught by MMU and will cause a signal from PM. The scheduler’s feedback mechanism tames infinite loops by lowering a process’ priority. Furthermore, a process’ security policy is checked whenever it makes a system call.
System Administration
The multiserver approach also makes system administration much easier due to the presence of many small, wellunderstood, self-contained modules instead of a massive kernel. As mentioned above, the kernel consists of less than 4000 LoC and the core servers are even smaller. This improves the system’s maintainability, because the components are easy to understand and can be maintained independently from each other, as long as the interfaces between them are respected. Because the operating system can be dynamically configured, a system administrator can quickly respond to security breaches. Instead of applying a patch and rebooting the system, a malfunctioning device driver can be replaced with a new one on the fly. This allows updates without loss of service or downtime.
Security Policies Each user, server, and driver process has an associated policy that specifies what it can do. Only the minimal privileges needed to perform the task are given, according to the principle of least authority. In contrast, in a monolithic operating system it usually is not possible to precisely restrict individual components. Driver access to I/O ports can be limited by a range stored in the kernel’s process table. In this way, if, say, the printer driver tries to write to the disk’s I/O ports, the kernel can prevent the access. Stopping rogue DMA is not possible with current hardware, but as soon as an I/O MMU is added, we can prevent that, too. Furthermore, we can tightly restrict the IPC capabilities of each process, as discussed in Sec. 3.1. User applications, for example, can use only IPC REQUEST and to only a subset of the operating system servers.
Configurability The multiserver model makes it easy to configure a system. The core components are always present to provide the basic operating system services, but optional components can be installed and loaded at a later time, without having to reboot the system. Small consumer appliances, such as mobile phones and PDAs, and embedded systems need a small, configurable operating system. Trying to squeeze a large monolithic system like Linux into a small device requires much more effort than mixing and matching parts from a modular system that started small.
5.3
Fault Detection and Recovery Sec. 4.1 discussed how RS deals with crashes at the operating system level. RS provides immediate crash detection and does periodic status checks if the policy says so. Depending on the policy, failing components are automatically restarted. If a block device driver fails FS can provide transparent recovery to the application level by flushing its buffer cache. For character device drivers the error is pushed to the user level, where transparent recovery is sometimes possible. Typically, daemons need to be rewritten to retry instead of giving up on the first failure.
End-User Reliability
In the previous sections we already discussed how multiple, independent servers and drivers help to improve the end-user reliability of an operating system. Our system’s major reliability features are briefly summarized below. Structural Measures The system is designed to prevent problems from occurring in the first place. For ex10
6
PERFORMANCE
7.1 Virtual Machines and Exokernels Virtual machines [16] and exokernels [3] do not provide a hardware abstraction layer like (other) operating systems. Instead, they respectively duplicate or partition available hardware resources so that multiple operating systems can run next to each other with the illusion of having a private machine. A virtual machine monitor or exokernel runs in kernel mode and is responsible for the protection of resources and multiplexing hardware requests, whereas each operating system runs in user mode, fully isolated from each other. These technologies provide an interface to an operating system, but do not represent a complete system by themselves. Neither approach solves the problem we try to tackle: how to build a reliable operating system that can heal itself after a fatal bug has been triggered in a device driver or server.
Multiserver systems based on microkernels have been criticized for decades because of alleged performance problems. We argue that a modular system need not be slow due to additional copying and context-switching overhead introduced when multiple servers cooperate to perform a task. While this was the case for some early microkernel systems, current multiserver systems have competitive performance. To illustrate the case, BSD UNIX on top of the early Mach microkernel was well over 50% slower than the normal version of BSD UNIX, and led to the impression of microkernels being slow. Modern microkernels, however, have proven that high performance actually can be realized. L4 Linux on top of L4, for example, has a performance loss of about 5% [6]. Another project recently demonstrated that a user-mode gigabit Ethernet can achieve the same performance as a kernel-mode driver up to 750 Mbps. Above that throughput for the user-mode driver dropped by 7% [10]. We have done extensive measurements of our system and presented the results in a technical report [7]. We can summarize these results (done on a 2.2 GHz Athlon) as follows. The simplest system call, getpid, takes 1.011 microseconds, which includes two messages and four context switches. Rebuilding the full system, which is heavily disk bound, has an overhead of 7%. Jobs with mixed computing and I/O, such as sorting, sedding, grepping, prepping, and uuencoding a 64-MB file have overheads of 4%, 6%, 1%, 9%, and 8%, respectively. The system can do a build of the kernel and all user-mode servers and drivers in the boot image within 6 sec. In that time it performs 112 compilations and 11 links (about 50 msec per compilation). Fast Ethernet easily runs at full speed, and initial tests show that we can also drive gigabit Ethernet at full speed. Finally, the time from exiting the multiboot monitor to the login prompt is under 5 sec. It has to be noted that the prototype incorporates many new security checks that cause some overhead. Furthermore, we did not yet do any performance optimizations. Careful analysis and removal of bottlenecks may boost the performance. We believe a performance penalty of less than 5% is realistic.
7
7.2 Monolithic Systems A monolithic system runs the entire operating system in kernel mode without proper fault isolation. Although these properties negatively affect the system’s reliability, as discussed in Sec. 2.1, many operating systems have a monolithic design. Windows XP and Vista Microsoft Windows XP is an example of a monolithic system that runs the entire operating system in kernel mode. Although Microsoft once tried to put some components to user space, the performance penalty was deemed too high. Faster hardware and consumer demands for reliability made Microsoft revisit this design decision. In Vista, Microsoft has plans to run many device drivers and the graphics subsystem in user mode, thus demonstrating that Microsoft has come to the same insight that we have: user-mode drivers are the way to go. Linux and Isolated Drivers Linux has basically the same monolithic style as Windows. One major difference with Windows is that the graphics subsystem in Linux—like other UNIX system—has always been in user space. As the system ages, it acquires more and more functionality that ends up in the kernel, with all consequences for maintainability and reliability. An important project to improve the reliability of commodity systems such as Linux is Nooks [19, 20]. Nooks keeps device drivers in the kernel but transparently encloses them in a kind of lightweight protective wrapper so that driver bugs cannot propagate to other parts of the operating system. All traffic between the driver and the rest of the kernel is inspected by the reliability layer. Another project uses virtual machines to isolate device drivers from the rest of the system [11, 12]. When
RELATED WORK
In this section we review some related operating systems. Note that we survey complete operating systems and not individual kernels to make the comparison fair. In other words, we compare our complete POSIX-conformant operating system to other systems that provide a comparable full system call interface. 11
7.3 Single-Server Systems
a driver is called, it is run on a different virtual machine than the main system so that a crash or other fault does not pollute the main system. In addition to isolation, this technique enables unmodified reuse of device drivers when experimenting with new operating systems. A recent project ran Linux device drivers in user mode with small changes to the Linux kernel [10]. This work shows that drivers can be isolated in separate user-mode processes without significant performance degradation. While isolating device drivers helps to improve the reliability of legacy operating systems, we believe a proper, modular design from scratch gives better results. This includes encapsulating all operating system components (e.g., file system, memory manager) in independent, user-mode processes.
While these systems are more modular than the previous examples, the operating system still runs as a huge, monolithic server. Several systems follow this design. L4 Linux-based Systems A typical example of a single-server system is L4 Linux, in which Linux is run on top of the L4 microkernel. User processes obtain operating system services by making remote procedure calls to the Linux server using L4’s IPC mechanism. Measurements show the performance penalty over native Linux to be about 5% [6]. A real-time system built with help of L4 Linux is DROPS [5]. It is targeted toward multimedia applications. However, most of the device drivers still run as part of a big L4 Linux server, with only the multimedia subsystem running separately. Perseus [13] is another L4 Linux-based system. It was designed to provide secure digital signatures while still supporting legacy applications for Linux. L4 Linux is used for most operations, but whenever a document needs to be signed control is given to a trusted subsystem that includes a signature server. The problem with all these systems is that a single bug in, say, a device driver can still crash the entire operating system server. The only gain of this design from a reliability point of view is a faster reboot.
Mac OS X Apple MacOS X is yet another example of a monolithic kernel, but it has a layered kernel structure. The lowest layer is a microkernel based on Mach. The BSD UNIX personality is part of the kernel, as are various other components. This makes it simply a differently-structured monolithic kernel, but does not add many reliability features. VxWorks VxWorks is a POSIX-compliant, real-time operating system, generally used in embedded systems. The core of VxWorks is indicated as the ‘Wind’ microkernel, but, in fact, the kernel contains the operating system, including device drivers, and thus has a monolithic structure. VxWorks has historically provided only kernel mode, requiring users to develop exclusively in this mode. Only recently, VxWorks AE provided a real-time process model that enables memory protection between processes. While this allows to run isolated user-mode applications, it is up to the developer to enable the protection; ours is mandatory.
7.4
Multiserver Systems
Multiserver operating systems distribute functionality over multiple, isolated components. Although several multiserver systems exists, to the best of our knowledge, nobody has yet built and released a working, fully modular, open-source, multiserver operating system. SawMill Linux SawMill Linux [4] is a sophisticated approach is to split the operating system into pieces and run each one in its own protection domain. While the system was designed to be fully modular, the focus was on efficiency instead of reliability. As discussed in Sec. 2.3, the project was abruptly terminated in 2001 when many of the principals left IBM, and the only outcome was a rudimentary, unfinished prototype.
CapROS and Coyotos CapROS is a persistent, capability-based system designed to be highly reliable, and is based on EROS [17]. The kernel includes critical drivers and the persistence manager. The fact that device drivers are tolerated in the kernel means that a driver bug might bring down the entire system. Although persistence allows processes to be resumed as of the last checkpoint, we believe it is better to prevent a full operating system crash in the first place. Coyotos is also based on EROS, but will be exploring the limits of software verification in operating systems. It uses a new programming language, BitC, with a welldefined semantics to formally verify security and correctness properties. Coyotos has not been released and little has been published about it.
Mach-US In Sec. 2.2, Mach-UX was mentioned as example of a single-server system. A later project known as Mach-US [18] tried to split to operating system into a set of servers, each of which supports orthogonal system services, much like our system. However, like SawMill Linux, this system never left the development stage. 12
Singularity A recent multiserver system developed by Microsoft Research is Singularity [9]. In contrast to other systems, Singularity is based on language safety and bypasses hardware protection offered by the MMU. The trusted base consists of parts of the kernel and runtime system that are not verifiably safe. The operating system is run on top of the kernel as a set of verifiablysafe, software-isolated servers, each running under the control of its own run-time system. A contract specifies allowable interactions via state machine-driven IPC declarations. These contracts can be statically verified, but are complex and hard to get correct without knowledge of formal specifications. Building applications for Singularity means a paradigm shift for the programmer, making it less suitable for large-scale adoption.
Our system represents a new data point in the spectrum from monolithic to fully modular structure. The design of consists of a small kernel running the entire operating system as a collection of independent, isolated, user-mode processes. While people have tried to produce a fully modular microkernel-based UNIX clone with decent performance for years (such as GNU Hurd), we have actually done it, tested it heavily, and released it. The kernel implements only the minimal mechanisms required to build an operating system upon. It provides IPC, scheduling, interrupt handling, and contains two kernel tasks (SYS and CLOCK) to support the user-mode operating system parts. The core servers are the process manager (PM), memory manager (MM), file server (FS), reincarnation server (RS), and data store (DS). Since the size of these components ranges from about 1000 to 4000 lines of code, they are easy to understand and maintain. Additional operating system services, such as device drivers, the information server (IS), window system (X), and network server (INET), can be started on the fly and are guarded by RS. The system is robust and self healing, so that it can withstand and automatically recover from common failures in these components, transparent to applications and without user intervention.
Symbian OS This operating system is designed for small, handheld devices, especially mobile phones. Symbian shares many characteristics with monolithic systems. For example, process management, memory management, device drivers, and dynamically loadable modules all are implemented in the kernel. Only the file server, and the networking and telephony stacks are hosted in user-mode servers.
9 ACKNOWLEDGMENTS
QNX This is a closed-source, commercial UNIX-like real-time operating systems [8]. Although QNX has a multiserver design, from recent information sheets we conclude that the QNX kernel, Neutrino, contains process management and other functions which could have been removed. Furthermore, QNX does not have a POSIX-conformant API.
We thank Chandana Gamage, Kemal Bicakci, and Bruno Crispo for critically reviewing the paper. This work was supported by the Dutch Organization for Scientific Research (NWO) under grant 612-060-420.
10
Nemisis Yet another multiserver operating system containing user-mode device drivers [14]. This system had a single address space shared by all processes, but with hardware protection between processes. Like DROPS, it was aimed at multimedia applications, but it was not POSIX conformant, or even UNIX like.
8
AVAILABILITY
The system is called MINIX 3 because we started with MINIX 2 and then modified it very heavily. Starting from a known base, we were spared the task of writing boring pieces of code, such as device drivers, interrupt handlers, and part of a file system. The resulting system has only a passing resemblance to MINIX 2; it is really a completely new system, but for historical reasons, we settled on the name MINIX 3. MINIX 3 is free, open-source software, available via the Internet. You can download MINIX 3 from the official homepage at: http://www.minix3.org/, which also contains the source code, documentation, news, contributed software packages, and more. Over 100,000 people have downloaded the CD-ROM image in the first 3 months, resulting in a large and growing user community that communicates using the USENET newgroup comp.os.minix. MINIX 3 is actively being developed, and your help and feedback are much appreciated.
CONCLUSIONS
We have demonstrated that it is feasible to build a highlyreliable, multiserver operating system with a performance loss of only 5% to 10%. We have discussed the design and implementation of a serious, stable, prototype that currently runs hundreds of standard UNIX applications, including two C compilers, language processors (e.g., awk, bison, flex, perl, python, yacc), many editors (e.g., emacs, nvi, vim, elvis, elle), networking (e.g., telnet, ftp, kermit, talk, wget), and all the standard shell, file, text manipulation, and other UNIX (and GNU) utilities. 13
References
¨ [13] P FITZMANN , B., AND S T UBLE , C. Perseus: A Quick Open-source Path to Secure Signatures. In 2nd Workshop on Microkernel-based Systems (2001).
[1] ACCETTA , M., BARON , R., B OLOSKY, W., G OLUB , D., R ASHID , R., T EVANIAN , A., AND YOUNG , M. Mach: A New Kernel Foundation for UNIX Development. In Proc. of USENIX’86 (1986), pp. 93–113.
[14] ROSCOE , T. The Structure of a MultiService Operating System. Ph.D. Dissertation, Cambridge University. [15] S ALTZER , J., AND S CHROEDER , M. The Protection of Information in Computer Systems. Proceedings of the IEEE 63, 9 (Sept. 1975).
[2] C HOU , A., YANG , J., C HELF, B., H ALLEM , S., AND E NGLER , D. An Empirical Study of Operating System Errors. In Proc. 18th ACM Symp. on Oper. Syst. Prin. (2001), pp. 73–88.
[16] S EAWRIGHT, L., AND M AC K INNON , R. VM/370—A Study of Multiplicity and Usefulness. IBM Systems Journal 18, 1 (1979), 4–17.
[3] E NGLER , D., K AASHOEK , M., AND J. O’T OOLE , J. Exokernel: an operating system architecture for application-level resource management. In Proc. 15th ACM Symp. on Oper. Syst. Prin. (1995), pp. 251–266.
[17] S HAPIRO , J. S., S MITH , J. M., AND FARBER , D. J. EROS: A fast capability system. In Proc. 17th Symp. on Oper. Syst. Design and Impl. (Dec. 1999), pp. 170–185.
[4] G EFFLAUT, A., JAEGER , T., PARK , Y., L IEDTKE , J., E LPHINSTONE , K., U HLIG , V., T IDSWELL , J., D ELLER , L., AND R EUTHER , L. The SawMill Multiserver Approach. In ACM SIGOPS European Workshop (Sept. 2000), pp. 109–114.
[18] S TEVENSON , J. M., AND J ULIN , D. P. Mach-US: UNIX On Generic OS Object Servers. In Proc. USENIX’95 (Jan. 1995), pp. 119–130. [19] S WIFT, M., A NNAMALAI , M., B ERSHAD , B., AND L EVY, H. Recovering Device Drivers. In Proc. 6th Symp. on Oper. Syst. Design and Impl. (2004), pp. 1–15.
¨ [5] H ARTIG , H., BAUMGARTL , R., B ORRISS , M., H AMANN , C.-J., H OHMUTH , M., M EHNERT, F., R EUTHER , L., S CHONBERG , S., AND W OLTER , J. DROPS OS Support for Distributed Multimedia Applications. In Proc. 8th ACM SIGOPS European Workshop (Sept. 1998), pp. 203–209.
[20] S WIFT, M., B ERSHAD , B., AND L EVY, H. Improving the Reliability of Commodity Operating Systems. ACM Trans. on Comp. Syst. 23, 1 (2005), 77–110. [21] TANENBAUM , A. S., AND W OODHULL , A. S. Operating Systems Design and Implementation, 1st ed. PrenticeHall, 1987.
¨ [6] H ARTIG , H., H OHMUTH , M., L IEDTKE , J., ¨ S CH ONBERG , S., AND W OLTER , J. The Performance of -Kernel-Based Systems. In Proc. 6th Symp. on Oper. Syst. Design and Impl. (Oct. 1997), pp. 66–77.
[22] T.J. O STRAND AND E.J. W EYUKER. The Distribution of Faults in a Large Industrial Software System. In Proc. of the 2002 ACM SIGSOFT Int’l Symp. on Software Testing and Analysis (2002), ACM, pp. 55–64.
[7] H ERDER , J. N., B OS , H., AND TANENBAUM , A. S. A Lightweight Method for Building Reliable Operating Systems Despite Unreliable Device Drivers. In Technical Report (Jan. 2006).
[23] T.J. O STRAND AND E.J. W EYUKER AND AND R.M. B ELL. Where the Bugs Are. In Proc. of the 2004 ACM SIGSOFT Int’l Symp. on Software Testing and Analysis (2004), ACM, pp. 86–96.
[8] H ILDEBRAND , D. An Architectural Overview of QNX. In Proc. USENIX Workshop in Microkernels and Other Kernel Architectures (Apr. 1992), pp. 113–126. [9] H UNT, G. C., L ARUS , J. R., A BADI , M., A IKEN , M., BARHAM , P., FAHNDRICH , M., H AWBLITZEL , C., H ODSON , O., L EVI , S., M URPHY, N., S TEENSGAARD , B., TARDITI , D., W OBBER , T., AND Z ILL , B. An Overview of the Singularity Project. Tech. Rep. MSRTR-2005-135, Microsoft Research, Redmond, WA, USA, Oct. 2005. [10] L ESLIE , B., C HUBB , P., F ITZROY-DALE , N., G OTZ , S., G RAY, C., M ACPHERSON , L., DANIEL P OTTS , Y.T. S., E LPHINSTONE , K., AND H EISER , G. User-Level Device Drivers: Achieved Performance. Journal of Computer Science and Technology 20, 5 (Sept. 2005). [11] L E VASSEUR , J., AND U HLIG , V. A Sledgehammer Approach to Reuse of Legacy Device Drivers. In Proc. 11th ACM SIGOPS European Workshop (Sept. 2004), pp. 131– 136. [12] L E VASSEUR , J., U HLIG , V., S TOESS , J., AND G OTZ , S. Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines. In Proc. 6th Symp. on Oper. Syst. Design and Impl. (Dec. 2004), pp. 17–30.
14