A PORTABLE KERNEL-MODE RESOURCE MANAGER. ON WINDOWS 2000 PLATFORMS. PANAGIOTIS E. HADJIDOUKAS, VASILEIOS K. BAREKAS,.
A PORTABLE KERNEL-MODE RESOURCE MANAGER ON WINDOWS 2000 PLATFORMS PANAGIOTIS E. HADJIDOUKAS, VASILEIOS K. BAREKAS, ELEFTHERIOS D. POLYCHRONOPOULOS and THEODORE S. PAPATHEODOROU High Performance Information Systems Laboratory Department of Computer Engineering and Informatics University of Patras, Rio 26500, Greece {peh, bkb, edp, tsp}@hpclab.ceid.upatras.gr
ABSTRACT The radical technological improvements, the low cost and the popularity of Intel processors are leading the major multiprocessor system integrators to design and build the new parallel systems based on Intel microprocessors, which are interweaved historically with the Windows operating system. In this paper, we present the development of a kernel-mode device driver that performs efficient scheduling of parallel applications on the Windows 2000 operating system. This environment can easily be used as a platform for developing and evaluating scheduling policies on multiprocessors systems running Windows 2000. We review and evaluate the basic primitives and mechanisms in kernel-mode, which we used in our implementation, introducing our own solutions wherever is necessary. Our results demonstrate significant performance improvements and scalability enhancements over the native Windows 2000 scheduler, indicating the integration of such a subsystem in the Windows kernel. Keywords: threads, scheduling, multiprogramming, operating systems, Windows.
1 INTRODUCTION It is common in Intel based platforms, to upgrade CPU power and RAM memory whenever we need to improve system performance. Although this is a relatively acceptable way to boost up the performance of uniprocessor personal computers, it should not be thought as the best solution for Intel based multiprocessors. Increasing CPU power or/and adding memory is a costly, though not efficient way to improve system performance. Moreover, for the case of hardware system upgrade, the exploitation of CPU and memory resources is not sufficient, since the major constrain relies either on the improvement of the system software or on the optimization of the application code. Therefore, the most efficient way of improving and exploiting effectively system performance is through appropriate software techniques and optimizations.
Like most traditional operating systems, Windows offers a general-purpose time-sharing multiprocessing and multiprogramming environment. However, its support for parallel processing is in a primitive stage, due to the lack of appropriate tools for the efficient implementation of parallel applications. Moreover, time-sharing is proven to interact poorly with parallel programs, primarily because standard time-sharing schedulers preempt threads of parallel programs without being aware of synchronization or communication. The scheduling decisions on Windows are made strictly on a thread basis, which automatically indicates the inappropriateness of Windows for efficient scheduling of parallel applications and the potential benefits from a CPU Resource Manager that will distribute the available physical processors among the parallel applications. The presence of 8-way Pentium Processor-based Servers and the feasibility of building ccNUMA architectures over the Windows operating system [3] indicate the importance of such a subsystem in the Windows kernel. In this paper, we present a kernel-mode Resource Manager for the Windows 2000 platform that implements the kernel mechanisms of the Nanothreads Programming Model (NPM) [13]. The parallel applications have been parallelized using the exported API of an appropriate user-level thread library that runs on several platforms including Windows 2000. Although a user-level CPU Manager, e.g. a daemon process, is able to demonstrate significant performance enhancements, we move one step further developing also a kernel mode device driver. Such a driver grants access to all system memory and all CPU instructions; by this way, we provide an implementation that is very close to a corresponding modification of the Windows kernel but preserves the important advantages of portability and availability. The Resource Manager provides also a platform for developing and evaluating scheduling policies. Several advanced and very efficient policies for shared-memory multiprocessors machines (including cc-NUMA) have been introduced and tested on IRIX 6.5, using the userlevel NANOS CPU-manager [10]. We also evaluate various capabilities of the operating system, used in the
implementation of our device driver, including sharing memory between kernel and user-mode, blocking threads from kernel-mode and some synchronization issues. The rest of this paper is organized as follows: Section 2 presents an overview of the execution environment, including the Windows 2000 operating system and the Nanothreads Programming Model. In Section 3, we present the design and the functionality of the device driver while in Section 4 we highlight the implementation details. Performance study and experimental results are presented in Section 5. In Section 6, we present related work; finally, in Section 7, we summarize our conclusions and present our future work.
2 BACKGROUND This section provides the necessary background related to our work. Here, we present the Windows 2000 operating system’s features we used and the functionality of the programming model on which we based our device driver.
2.1 Windows 2000 OS Windows implements a priority-driven, preemptive scheduling system [15]; the highest-priority ready thread always runs. Consequently, if another thread with a higher priority becomes ready to run, the currently running thread is preempted before finishing its time slice. The scheduler tries to schedule a thread taking into account its soft affinity (the processor where the thread last executed on), its hard affinity (the list of processors that the thread can execute on) and its ideal processor, which can be considered as a user-defined soft affinity. At any time, a processor executes at a particular Interrupt Request Level (IRQL), which determines which interrupts can be received. An interrupt will never be processed when the processor is busy processing a higherlevel interrupt. The highest of the IRQL levels are Device Interrupt Request Levels (DIRQLs), which correspond to hardware interrupts. The rest IRQLs are implemented in software. The lower IRQLs, in descending order, are the DISPATCH, APC and PASSIVE levels. Threadscheduling decisions are made at DISPATCH_LEVEL IRQL while threads run at PASSIVE_LEVEL IRQL. All the possible scheduler priorities of threads exist at this IRQL. The Win32 API first organizes processes based the priority class assigned to them at creation and then by the relative priority or the individual threads within those processes. The device drivers interface with the operating system via the I/O manager [6], which calls the particular functions of the driver as required. Most I/O requests are represented by an I/O request packet (IRP), a data structure that describes everything a device driver needs to know in order to complete the program's request. The driver's internal code is structured as a series of
subroutines that are called to process the various stages of an I/O request. Windows provides Asynchronous Procedure Calls (APCs). These kernel-defined control objects represent a procedure that is called asynchronously in a specific thread context and at OS predetermined times. An APC can cause preemption of the currently running thread and its routine can be preempted as well. Windows NT supports User APCs, Kernel APCs and Special Kernel APCs [6].
2.2 Nanothreads Programming Model The Nanothreads Programming Model exploits multiple levels of loop and functional parallelism. The integrated compilation and execution environment consists of a parallelizing compiler, a multithreaded runtime library and the appropriate operating system support. Applications are decomposed into multiple levels of parallel tasks. The management of multiple levels of parallel sections and loops allows the extraction of all the useful parallelism contained in the application. According to this model, applications are decomposed into fine-grain tasks and executed in a dynamic multiprogrammed environment. The operating system offers virtual processors to applications, as the kernel abstraction of physical processors on which applications can be executed in parallel. Virtual processors provide user-level contexts for the execution of the task’s user-level threads. The kernel-level scheduling policy is responsible for the distribution and allocation of physical processors to the applications currently running in the system. The runtime environment, which controls tasks creation, cooperates with the operating system, ensuring that the generated parallelism matches the number of processors allocated to the application. The resulting environment is multi-user and multiprogrammed, allowing each user to run parallel and sequential applications. The main scheduling objective of the NPM is that both application scheduling (user-level threads to virtual processors mapping), at the user-level, and virtual processor scheduling (virtual to physical processors mapping), at the kernel-level, must be tightly coordinated in order to achieve high performance. The overhead of the runtime library is low enough to make the management of parallelism affordable. It offers user-level threads and the appropriate support to the application in order to exploit its parallelism. Additionally, it plays the role of the user-level representative of the Resource Manager, adapting the execution of the application according to its decisions. By this way, when the application is executed in a heavily loaded system, the runtime library provides the application with an execution environment that includes only the resources that have been assigned to it by the Resource Manager.
Currently, we have developed two such runtime libraries, which export to the applications almost the same API. The first library, called NTLib, uses custom made user-level threads while the other uses the standard Windows fibers and it is called FibRT. More information about these two runtime libraries and their implementation details can be found in [2].
3 GENERAL DESIGN In this section, we describe the general design decisions that guided the implementation of the Resource Manager and the functionality it exports to the runtime system. The implementation details are described in the next section. The primary operation of the Resource Manager is to keep track of all applications running on the system and apply to them a user defined scheduling policy, which mainly determines the processors that each application will receive to run on. In order to achieve this, the Resource Manager utilizes a shared memory area, which is mapped in the address space of each executed application and acts as the communication path between the Manager and the applications. Apart from this, the Resource Manager creates a separate thread that applies the scheduling policy. Next, we give a more detailed description of Resource Manager’s subparts. During its initialization, the Resource Manager creates a shared memory section, which is mapped in the address space of each application that uses the runtime system. This shared memory area contains information about the maximum number of processors each application requests, the number of processors allocated to it, etc. The Resource Manager uses this information to compute the number of processors allocated to each application. Additional information is kept, into the shared memory, about the virtual processors of each application allowing the Resource Manager to control their execution on the available physical processors. Every application that uses our runtime library checks, during its initialization, whether the Resource Manager is active in the system. If not, it continues its execution on the number of processors it has requested and the system’s native scheduler has the exclusive responsibility to schedule it. Otherwise, the application is initially connected to the Manager by mapping the shared area in its virtual address space. Next, the application’s main thread creates the rest virtual processors in suspended mode and registers them into the shared area, making them accessible from the Resource Manager and enabling thus their manipulation from it. Finally, the main thread sets the flag that indicates that the application is available for scheduling from the manager and suspends itself. From now on, the control of the application has passed exclusively to the Resource Manager, which is responsible for its execution.
The Resource Manager based on the information in the shared area, applies a user defined scheduling policy to distribute the physical processors across the registered applications. This is accomplished by creating a separate thread, called scheduler thread, which is dedicated to the execution of the scheduling policy at fixed time intervals, called scheduler quantum. Both the scheduling policy and the scheduler quantum are user-defined and can be changed dynamically at run-time. Depending on the current scheduling policy, the scheduler thread decides about the distribution of physical processors to the applications, informs them by writing its decisions on the shared memory and finally applies its decisions. More specifically, the scheduling policy decides about the number of processors each application will run on during the next scheduling quantum and which these physical processors will be. In addition, it decides about which virtual processors of the applications will be running on these physical processors and what will be the mapping between them. A number of scheduling policies has been presented and evaluated based on the user-level implementation of the CPU Manager on IRIX [14]. The applications, through the runtime library, cooperate with the Resource Manager during their execution; this cooperation is necessary in order to achieve high performance and avoid the idling of physical processors while some application could execute on them. Each application informs the Resource Manager about its requirements reflecting the actual degree of parallelism that is able to exploit at any moment. The Resource Manager responds to the application’s requirements and allocates physical processors to it according to the specified scheduling policy. The application receives, through the shared memory, the Resource Manager’s decisions and tries to match the parallelism that it will generate with the assigned number of physical processors. By this way, the Resource Manager adjusts each application to the available resources and manages to keep all the physical processors busy executing useful application code. More specifically, the Resource Manager decides about physical processors reallocations by removing some processors from some applications and assigning them to some others. The removal of a physical processor from an application means the blocking of the associated virtual processor, while the assignment of a physical processor to an application means the unblocking of an application’s virtual processor and the binding of it to the assigned physical processor. Since, this reallocation procedure can block a virtual processor at an unsafe point (for example while being inside a critical section) a virtual processor checks for any non-safely preempted virtual processors whenever it reaches a safe-point, where it is known that the application and runtime synchronization constraints are satisfied. If such a processor is found, the currently running virtual processor yields its physical processor to the preempted virtual processor. The fine-grain decomposition of our applications allows the polling of
the Resource Manager’s decisions to be performed often enough without hurting the performance. Currently, we try to avoid using priorities and study the performance of scheduling parallel applications of the same priority, thus focusing on the general case. This allows the successful coexistence of the Resource Manager and the native Windows scheduler making possible the simultaneous execution of both applications controlled by the CPU Manager and other independent.
4 IMPLEMENTATION ISSUES In this section, we describe the basic mechanisms used in the implementation of our device driver. We present all possible ways to establish a communication path between kernel and user-mode, based on shared memory that has been allocated from a device driver. We also introduce two ways for blocking threads from kernelmode and we discuss the adopted synchronization solution. A more detailed description can be found in [7]. A kernel-mode device driver can share memory with the applications using to the following methods: “No mapping”: The applications access the memory, exclusively through the I/O subsystem. The driver can receive a copy of the user data in the kernel buffer of the IRP or access them directly from the user buffer. The disadvantage of this technique is the intensive use of the I/O subsystem and its significant latency compared to the communication through shared memory. Named Section Object: The driver creates a named section object and the application itself opens and maps it in its address space, changing also the protection of the memory to enable write access to it. This method does not require the application to exchange any information with the device driver. Section Object: The driver creates a section object, which represents a shareable block of memory, and maps it in the address space of the requested application. The disadvantage of using a section object is that this memory can be paged out and it is not safe to be accessed at DISPATCH_LEVEL IRQL. Explicitly allocated kernel memory: The driver allocates Non-Paged kernel memory for the shared data. This memory is mapped into an application’s address space as described before, using the native physical memory section object. The data is guaranteed to be always resident in physical memory and thus can be accessed from any address space and at DISPATCH_LEVEL IRQL. Memory descriptor lists (MDLs): This is an opaque structure, defined by the Memory Manager, that uses an array of physical page frame numbers to describe the pages that back a virtual memory range. As before, this memory cannot be paged out.
The reallocation procedure premises the support of blocking primitives. However, in kernel-mode, such a mechanism is not supported directly. We implemented two different methods for the blocking / unblocking of a kernel thread: The first method uses the native API functions NtSuspendThread and NtResumeThread, although these are undocumented and not exported by the kernel. Note that every system service function (e.g. SuspendThread) corresponds to a native function that its address is located in the system service dispatch table, which is accessible from kernel-mode. Based on this fact, we found out experimentally the position of the two functions in this table. The second method is more general and based on Kernel APCs. Εach virtual processor is related with a blocking structure that consists of an Event handle, created in user-mode, a pointer to the underlying kernel object, which is computed by the device driver, and a Kernel APC object. The last is initialized with a callback function that is executed in user mode, and corresponds to a wait call on the Event of the blocking structure it belongs. When the scheduler of the driver needs to block a certain virtual processor, it inserts that thread’s APC forcing it to block itself. Obviously, a virtual processor can use this Event in order to block itself. This is necessary for our model, as the virtual processors must become suspended immediately after their creation. An inappropriate synchronization solution can cause significant performance degradation, especially in kernelmode. From a variety of synchronization primitives that Windows offers in kernel-mode [6], Spin Locks and Fast Mutexes represent two possible and simple solutions. The operation that acquires a Spin Lock will first raise the IRQL to DISPATCH_LEVEL and then obtain the lock. If it fails to obtain the lock, the spinning is also performed at DISPATCH_LEVEL, thereby preventing any other work that can be done at or below that level. On the other hand, once a Fast Mutex has been acquired, the IRQL is raised at APC_LEVEL, thus the APCs to the thread are blocked. If the thread fails to acquire the Fast Mutex, it is put into a wait state, releasing the processor. In order to avoid the possible idling of processors, we adopt Fast Mutexes as a solution. However, as the protected code is too critical concerning the scheduling phase or small enough for the applications, we raise the IRQL at DISPATCH_LEVEL after acquiring the lock, avoiding thus any undesirable preemption.
5 EXPERIMENTAL EVALUATION In this section, we measure the overhead of the mechanisms we have presented and the obtained performance for various multiprogramming levels when the applications are executed under the control of the
5.1 Mechanisms Overhead First, we measure the time for successively mapping and unmapping the shared memory section into the address space of an application. The measurements are presented in Table 1. Initially we measure the overhead of accessing the memory twice through requests packets, which represents the overhead of sending two I/O requests to the driver, exchanging four bytes of information with it. The results show that using the User Buffer is faster, so we use this option in the three methods that need communication with the driver. We observe that the Memory Description List is the faster method for establishing the necessary communication path between kernel and user mode. Obviously, in both three cases the measured time includes the overhead of sending two requests, i.e. 42 µsecs. Mapping Method “No Mapping” - Kernel buffer “No Mapping” - User Buffer Named Section Object Section Object Allocated Memory Memory Description List
Time (µsecs) 56 42 109 151 183 100
Table 1: Memory Mapping Overhead In Table 2, we illustrate the overhead for blocking a running a kernel thread from kernel-mode using the two methods described previously. It is obvious that the second method is much faster that the native Windows suspend primitive, it depends on a kernel mechanism rather than on a kernel structure and it is completely user defined. Blocking Method System Service Table Hook Asynchronous Procedure Call
Time (µsecs) 126 37
Table 2: Blocking Overhead Finally, the overhead for successively acquiring and releasing a Kernel Spin Lock is 0,75 µsecs, while for the second method is 1,35 µsecs. Although the Kernel Spin Lock is slightly faster than the Fast Mutex, we have already explained that it is not the appropriate solution as it can cause the idling of processors.
For our experiments, we have used a time-sharing version of the DSS scheduling policy presented in [14]. According to this policy, each application receives a number of processors that is proportional to its request. Time-sharing is then applied on a scheduler quantum basis. We conduct two types of experiments, measuring also the performance of the native Windows scheduler, for comparison reasons. In the first type, we measure the performance obtained when the multiprogramming workload is running on a dedicated machine while in the second on an independently fully utilized machine. The utilization of the four-processor machine is achieved using four threads of normal priority and maximum activity of the CPU Stress utility. This allows us to show that our multiprogramming workloads are executed faster even if there are other running applications in the system, not controllable by the Resource Manager. For both experiments, we run workloads with different multiprogramming level as expressed with the total instances of the application that composes a workload. These instances run simultaneously and request all the four processors of the machine. In Figures 1 and 2, we present the execution time of our multiprogramming workloads for both experiments using as application the Block LU kernel. The times presented correspond to the total time the multiprogramming workloads need to run to completion. As we can see in Figure 1, when we use the Resource Manager to coordinate the execution of the workloads, their execution time scales linearly to the multiprogramming level, while the native Windows scheduler fails to scale well. The same scaling pattern is also observed for the second experiment, at the total execution time of the workloads in Figure 2. Although there is an independent application (CPU Stress) that fully utilizes all the processors of the machine, the total execution time of the workloads still scales proportional to the multiprogramming level when the Resource Manager is active. Multiprogramming Workload 100 90 80 70
Time (sec)
Resource Manager. All the experiments were performed on a Compaq Proliant 5500 4-way 200MHz Pentium Pro processor based system with 512 MB of main memory, running Windows 2000 Advanced Server.
Resource Manager Windows
60 50 40 30 20
5.2 Performance under multiprogramming In order to measure the performance of the Resource Manager under multiprogramming we run workloads composed of several instances of the same parallel application. We used applications from the SPLASH2 benchmarks suite [17] built using the NTLib runtime library presented in [2].
10 0 1-way
2-way
4-way
Multiprogramming Level
8-way
Figure 1: Total execution time of our workload on a dedicated machine.
utilize effectively the machine’s hardware. On the contrary, the Resource Manager performs an ideal scheduling, adapted to the running applications needs, achieving thus high performance.
Multiprogramming Workload + CPU Stress 110 100
Time (sec)
90
Resource Manager Windows
80
6 RELATED WORK
70
The concept of using shared memory to communicate between the kernel and user processes as well as the notion of scheduling based on proportional processor allocations was first proposed in [13]. Similarly to Process Control [16], the application is allowed to dynamically change its requests during its lifetime and to set the maximum number of processors it wants to run on, while the operating system allocates the most suitable number of processors taking into account the overall system load. Our approach maintains the number and mapping of processors allocated to the application, similarly to Scheduler Activations [1]. Considering the amount and quality of the information shared between an application and the operating system, Process Control maintains a counter of processors allocated to the application and informs the applications of kernel level events trough UNIX signals. Scheduler Activations and First-Class Threads inform using upcalls, while we establish this communication path through shared memory. Yue and Lilja [18] use shared memory to implement dynamic processor allocation at the entry points of parallel loops by polling a system load variable, supporting thus a master/slave execution paradigm while the NPM supports various and arbitrary forms of parallelism. Multiprogramming support for the NPM is presented in [12] and is based on a modification of the Linux kernel.
60 50 40 30 20 10 0 1-way
2-way
4-way
Multiprogramming Level
8-way
Figure 2: Total execution time of our workload on a machine fully utilized by the CPU Stress utility. Another interesting measurement is the average turnaround time of the applications in the workload and its deviation in each case. We present these times for the two types of experiments in Tables 3 and 4 respectively. For both types, the Resource Manager leads to application turnaround time that scales proportional to the multiprogramming level of the workload and exhibits very low deviation. On the other hand, the native Windows scheduler fails to exhibit a similar behavior. MP level Ex. time Deviation
1-way 4.713 0.080
Resource Manager 2-way 4-way 9.183 17.597 0.117 0.430
8-way 34.381 0.669
MP level Ex. time Deviation
1-way 4.719 0.010
Windows 2-way 4-way 17.624 38.146 1.042 3.900
8-way 89.495 10.272
Table 3: The average application turnaround times the workload running on a dedicated machine. MP level Ex. Time Deviation
1-way 7.448 0.402
Resource Manager 2-way 4-way 12.673 22.601 0.386 0.505
8-way 42.089 1.021
MP level Ex. time Deviation
1-way 12.216 0.422
Windows 2-way 4-way 21.634 42.459 0.518 3.677
8-way 101.317 8.104
Table 4: The average application turnaround times in the workload running on a machine fully utilized by the CPU Stress utility. Our experiments show clearly that the Resource Manager enhances very effectively the execution of parallel application workloads in the Windows 2000 operating system. The specific application we used, the Block LU kernel, exhibits an intensive synchronization and data-sharing pattern among the executing processors. The native Windows scheduler is proved inadequate to schedule efficiently such parallel applications, failing to
As far as we know, this is the first work that supports efficient execution of workloads with parallel applications on Windows platforms. With the exception of Vassal [4], previous work on Windows is concentrated strictly on real-time scheduling. However, various similarities concern implementations issues. Rialto/NT [8] is an implementation of scheduling abstractions originally developed for the Rialto real-time operating system within a research version of Windows NT. In [9], a soft real-time scheduling server for time-sensitive multimedia applications in the Windows NT environment is presented. Finally, VenturCom’s RTX [5] provides a realtime subsystem running on Windows NT, implementing extensions to Windows NT which are often found in specialized real-time operating systems.
7 CONCLUSIONS AND FUTURE WORK This paper presents a kernel-mode Resource Manager that provides support for efficient multiprogrammingconscious scheduling of parallel applications on Windows 2000 platforms. It can be used as an extension of the native Windows scheduler and a platform for evaluating scheduling policies. We measured the overhead of the mechanisms we used examining the performance of various solutions that we developed on top of the support
that the operating system provides. Finally, the execution of workloads for different multiprogramming levels illustrated significant performance gains and throughput over the native scheduler, which is proved inadequate.
[9] C. Lin, H. Chu, and K. Nahrstedt, A Soft Real-time Scheduling Server on the Windows NT, In Proceedings of the Second USENIX Windows NT Symposium, Seattle, WA, August 1998.
Our future work is concentrated on the extension of our device driver in order to inform the applications about blocking events that occur in the kernel, e.g. I/O operations and page faults. More kernel-level scheduling policies will be evaluated and developed and the interaction of kernel-level and user-level scheduling policies will be investigated in depth. Finally, the Manager should be able to control all the applications that run in the system taking also into account the priority model that Windows supports.
[10] X. Martorell, J. Corbalan, D. S. Nikolopoulos, N. Navarro, E. D. Polychronopoulos, T. S. Papatheodorou, and J. Labarta, A Tool to Schedule Parallel Applications on Multiprocessors: The NANOS CPU Manager, Proc. of the 6th Workshop on Job Scheduling Stategies for Parallel Processing, in conjunction with IEEE IPDPS'2000, Cancun, Mexico, May 2000.
ACKNOWLEDGMENTS We would also like to thank Dimitrios Nikolopoulos, Xavier Martorell and all our NANOS Project partners who participated in the development of the NANOS CPU Manager.
REFERENCES [1] T. Anderson, B. Bershad, E. Lazowska, and H. Levy, Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism, In Proc. of the 13th ACM Symposium on Operating System Principles (SOSP), December 1991. [2] V. Barekas, P. Hadjidoukas, E. Polychronopoulos, and T. Papatheodorou, NanoThreads vs. Fibers for the Support of Fine Grain Parallelism on Windows NT Platforms, To appear in 3rd ISHPC, Tokyo, Japan, October 2000. [3] B. Brock, G. Carpenter, E. Chiprout, E. Elnozahy, M. Dean, D. Glasco, J. Peterson, R. Rajamony, F. Rawson, R. Rockhold, and A. Zimmerman, Windows NT in a ccNUMA System, In Proc. of the Third USENIX Windows NT Symposium, Seattle, WA, August 1999. [4] G. M. Candea and M. B. Jones, Vassal: Loadable Scheduler Support for Multi-Policy Scheduling, In Proc. of the Second USENIX Windows NT Symposium, Seattle, WA, August 1998. [5] M. Cherepov and C. Jones, Hard Real-Time With RTX on Windows NT, In Proc. of the Third USENIX Windows NT Symposium, Seattle, WA, August 1999.
[11] C. McCann, R. Vaswani, and J. Zahorjan. A Dynamic Processor Allocation Policy for Multiprogrammed Shared-Memory Multiprocessors, ACM Transactions on Computer Systems, 11 (2), pp. 146-178, May 1993. [12] D. Nikolopoulos, E. Polychronopoulos, T. Papatheodorou, C. Antonopoulos, I. Venetis, and P. Hadjidoukas, Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel, In Proc. of the Parallel Computing'99 Conference (ParCo'99), Delft, The Netherlands, August 1999. [13] C. Polychronopoulos, N. Bitar, and S. Kleiman, Nano-Threads: A User-Level Threads Architecture, Technical Report 1297, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, 1993. [14] E. D. Polychronopoulos, D. S. Nikolopoulos, T. S. Papatheodorou, N. Navarro and X. Martorell, An Efficient Kernel-Level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors, Proc. of the 12th International Conference on Parallel and Distributed Computing Systems, Fort Lauderdale, Florida, August 1999. [15] D. A. Solomon, Inside Windows NT, Second Edition, Microsoft Press, 1998. [16] A. Tucker and A. Gupta, Process Control and Scheduling issues for Multiprogrammed Shared-Memory Multiprocessor, Proc. Of the 12th ACM Symposium on Operating System Principles, pp. 159-166, 1989.
[6] E. N. Dekker and J. M. Newcomer, Developing Windows NT Device Drivers: A Programmer’s Handbook, Addisson-Wesley, 1999.
[17] S. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, The SPLASH-2 programs: Characterization and methodological considerations, In Proc. of the 22th International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995.
[7] P. Hadjidoukas, V. Barekas, E. Polychronopoulos, and T. Papatheodorou, Efficient Multiprogramming on Windows 2000 Platforms, Technical Report HPCLABTR-200600, June 2000.
[18] K. Yue and D. Lilja, An Effective Processor Allocation Strategy for Multiprogrammed Shared Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 8 (12), pp. 1246-1258, 1997.
[8] M. B. Jones and J. Regehr, CPU Reservations and Time Constraints: Implementation Experience on Windows NT, In Proceedings of the Third USENIX Windows NT Symposium, Seattle, WA, August 1999.