mance through the direct implementation of services as parallel low-level hardware ... iar to application programmers and transforms them into the low-level commands ... shared memory, etc.) instead of the custom interfaces that are typically.
Hardware/Software Co-Design of Operating System Services for Thread Management and Scheduling Wesley Peck, David Andrews, Ed Komp, Jason Agron, Mike Finley EECS Department University of Kansas Information and Telecommunications Technology Center (ITTC) {peckw, dandrews, komp, jagron, mfinley}@ittc.ku.edu
Abstract
as software running on general purpose processors, as hardware components which run concurrently with each other and with application programs. Implementing these services in hardware promises to enhance performance by reducing overhead and to provide for more precise operations by reducing variability in the time it takes to perform the functionality of the services.
Migrating operating system services from software to hardware has long held the promise of enhanced performance through the direct implementation of services as parallel low-level hardware components that run concurrently with application programs. This paper describes the design of one such service, the management and scheduling of threads in a multithreaded system. This service is part of our multithreaded RTOS kernel that is being designed as an integral part of our hybrid thread programming model being developed for hybrid systems. The service described in this paper is implemented as a thin software API which takes the high level constructs familiar to application programmers and transforms them into the low-level commands understood by the hardware implementation. The design currently support up to 256 active threads along with preemptive priority, round robin, and FIFO scheduling algorithms. Performance tests verify our hardware/software co-design approach provides significantly tighter bounds on scheduling precision and jitter compared to traditionally implemented RTOS kernels.
All operating systems introduce some level of overhead that results in reduced execution capacity for application programs which are running on the processor. Traditional operating systems have a relatively high level of overhead because all of the operating system services and all of the application programs must compete for the limited processor resources. Hardware based operating systems reduce this overhead drastically by moving the operating system services into hardware leaving the processor resources for the application programs. Moving the services into hardware also has the effect of reducing jitter because decisions made in hardware are typically available to applications with very little variability in the time it takes to make those decisions.
This paper describes the implementation of one operating system service, thread management and scheduling, using a Xilinx Virtex II Pro. The Virtex II Pro is a hybrid chip which contains a PowerPC 405 processor which is 1 Introduction immersed inside of the FPGA. This allows the software running on the processor to communicate with the hardAdvances in FPGA density and performance is now mak- ware components inside the FPGA at very high speeds. ing it economically feasible to implement operating sys- All of our components where implemented in VHDL as tem services, which have traditionally been implemented slave attachments to the OPB bus which is part of the Core 1
Operation CREATE
Description Create a new joinable or detached thread but do not add the thread to the ready-torun queue ADD Add an already existing thread onto the ready-to-run queue JOIN Suspend the operation of one thread until another thread terminates EXIT Terminate a thread but do not release the resources that it is using CLEAR Release the resources of a thread which has already been terminated NEXT Perform a scheduling decision to find the next thread to run YIELD Add a thread back to the ready-to-run queue and perform a new scheduling decision PRIORITY Set the scheduling priority of a thread IDLE Set the thread which will be used as the idle thread
Connect architecture used by the Virtex II. The thread management service is one part of a multithreaded RTOS kernel which is being designed as the integral part of a hybrid programming model which can take advantage of the new hybrid FPGA’s being developed by Xilinx[1] and Altera[2]. This hybrid programming model has the goal of providing a familiar multithreaded programming model to application programs which can be implemented as many software threads interacting with many hardware threads. This model allows software and hardware to interact using the standard tools available to multithreaded programs (mutexes, condition variables, shared memory, etc.) instead of the custom interfaces that are typically used for software/hardware interaction.
2 Thread Management Services Our thread management service is designed as a thin software API which forwards application requests to and performs context switching on behalf of the thread management hardware. These two components together implement most of the thread operations supported by general purpose thread packages such as PThreads. The operations that the thread management hardware supports are shown in table 1. The supported operations are all lowlevel tasks that can be performed by the thread management hardware in 12 clock cycles or less. These operations, however, are too low-level to burden an application programmer with so the software API combines the low-level operations to form higher level constructs that the application programmer is more familiar with and can use more easily. Thus, any thread operation goes through several stages. In the first stage the application program requests some operation to be performed by the software API. The software API then takes the request and breaks it into one or more low-level operations that the thread management hardware performs. The results of these lowlevel operations are then combined and returned to the application program.
Table 1: Hardware Operations our hardware based thread package familiar and easy to use. Second, the complexity in the hardware is reduced resulting in a smaller and faster component.
To implement our design we broke the thread management hardware into two components: the thread manager and the thread scheduler. The thread manager implements all of the low-level operations exception for NEXT, YIELD, and PRIORITY with the thread scheduler implementing those operations. The thread manager provides support for the management of up to 256 simultaneously active threads with one thread being reserved as an idle thread which is only run when no other thread in the system can be scheduled. This management includes maintaining a ready-to-run queue and keeping track of what thread resources are used and what thread resources are available. The thread scheduler provides support for priBy placing the thin software API between the application ority based preemptive scheduling, round robin schedulprogram and the thread management hardware this design ing, and FIFO scheduling of all threads in the ready-tois able to achieve two goals. First and most importantly, run queue. Both of these components are connected to the API presented to the application program maintains the system bus which allows both the processor and other a similarity with existing thread packages which makes hardware components to access the services provided. 2
This becomes especially useful when other operating system services are implemented in hardware. For example, a hardware implementation of blocking semaphores can add a software thread back to the ready-to-run list as easily as the processor.
3 Results Figure 1: Scheduling Delay with 250 Threads These results were generated on a Memec[9] 2VP7 which contains a Xilinx Virtex 2 Pro 7 series FPGA. This FPGA contains a PowerPC 405 processor running at 300 MHz with PLB and OPB buses both running at 100 MHz. Figures 1, 2, and 3 show the end-to-end scheduling delay when the system is running 250 threads, 64 threads, and 2 threads respectively. This end-to-end time includes the time taken for the processor to acknowledge the interrupt from the thread management service, the handshaking sequence with the PIC, the scheduling decision, and the context switch. The measurements were obtained by creating a free counting timer inside the FPGA which reset itself to zero on the same clock cycle that a scheduling interrupt is asserted. The timer then increments itself at every 10 ns clock tick. The delay time was read from this counter at the end of the context switching code before control was given to the newly scheduled thread.
Figure 2: Scheduling Delay with 64 Threads
The results given in figures 1, 2, and 3 show an average scheduling delay of 2.0 us, 1.9 us, and 1.7 us with a maximum scheduling delay of 4.4 us, 4.2 us, and 2.0 us when the system was running 250, 64, and 2 simultaneously active threads. Although low, the system did exhibit a small about of jitter (2.4 us, 2.3 us, and 0.3 respectively). To identify the source of the jitter we reran the same test with the data and instruction caches on the processor turned off. This test showed that in all three cases the jitter was reduced to about 100 ns. This provides a strong indication that the main source of jitter in the system can be attributed to cache misses during the interrupt service routing and the context switch and is not a result of inherent variability in the thread management service.
Figure 3: Scheduling Delay with 2 Threads
3
4 Future Work
References [1] www.xilinx.com
Work on the hybrid thread programming model and the thread management service itself is still very active. For the thread management service we are changing our VHDL implementation to use a more generic style that should allow the service to be configured based on an application program’s requirements.
[2] www.altera.com [3] D. Andrews et. al., Programming Models for Hybrid FPGA/CPU Computational Components: A Missing Link, IEEE Micro, July/August 2004, pp. 42-53
As for the hybrid thread programming model, future work [4] Andrews, D., Niehaus, D., Ashenden, P. Programming Models for Hybrid FPGA/CPU Computational includes moving more operating system services into the Components, IEEE Computer, January, 2004, pp. hardware. Currently we are creating hardware compo118-120 nents for timing services, synchronization services, and message passing services. We hope to combine all of [5] Niehaus, D., Andrews, D., Using the Multi-Threaded these services as the core of our RTOS kernel. Computation Model as a Unifying Framework for Hardware-Software Co-Design and Implementation, Proceedings of the 9th Workshop on Object-oriented Real-time Dependable Systems (WORDS 2003), 5 Conclusion September, 2003 Isle of Capri, Italy In this paper we have presented the design of thread man- [6] www.opengroup.org agement services which were implemented in hardware. [7] R. Jidin, D. Andrews, D. Niehaus, ImplementThis design takes advantage of emerging FPGA technoling Multi Threaded System Support for Hybrid ogy which enables low overheads and precise schedulFPGA/CPU Computational Components, Proceeding that cannot be obtained using traditional approaches. ings of the International Conference on Engineering The current system supports most standard thread operaof Reconfigurable Systems and Algorithms, June 21tions through a thin software API which interacts with the 24, Las Vegas, Nevada thread management hardware and provides support for up to 256 simultaneously active threads each of which can [8] D. Andrews, D. Niehaus, R. Jidin, Implementing the have one of 128 priority levels. In addition, this compoThread Programming Model on Hybrid FPGA/CPU nent can easily be integrated along side other components Computatoinal Components, Proceedings of the 1st inside the FPGA to form general purpose SoC solutions. Workshop on Embedded Processor Architectures of This capability is fundamental in enabling the creation of the International Symposium on Computer Architeclow overhead and precise operating system components ture (ISCA), Madrid, Spain, Feb. 2004 which form the core of a small footprint RTOS kernel. [9] www.memec.com A more detailed description of the hybrid threads project can be found at http://www.ittc.ku.edu/hybridthreads/.
6 Acknowledgement The work in this article is partially sponsored by National Science Foundation EHS contract CCR-0311599 4