and preemptive-priority scheduling services that run in parallel with the microprocessor in the system so as to reduce scheduling overhead. The scheduler ...
FPGA Implementation of a Priority Scheduler Module Jason Agron, David Andrews, Mike Finley, Wesley Peck, Ed Komp EECS Department University of Kansas Information and Telecommunications Technology Center (ITTC) {dandrews, peckw, mfinley, komp, jagron}@ittc.ku.edu Abstract Minimizing system overhead and jitter is a fundamental challenge in the design and implementation of a Real-Time Operating System (RTOS). This paper describes the design of a hardware scheduler module developed as part of a multithreaded RTOS kernel build on a hybrid FPGA/CPU system. The scheduler module currently provides FIFO, round robin, and preemptive-priority scheduling services that run in parallel with the microprocessor in the system so as to reduce scheduling overhead. The scheduler amortizes the time taken to perform a new scheduling decision while the current thread is running resulting in minimal delay and jitter at each scheduling decision time.
1. Introduction Minimizing system overhead and jitter is a fundamental challenge in the design and implementation of a Real-Time Operating System (RTOS). System overhead and jitter is introduced by many factors, including the time taken by the RTOS itself to perform basic system services on behalf of the application program. Minimizing the overhead allows the system to provide more computational cycles to the application programs, and minimizing the jitter allows for more precise, deterministic scheduling. For any given system, there is certainly a minimum amount of overhead and jitter that must exist. However this level can be dramatically reduced by careful redesign and migration of portions of the RTOS into concurrent co-designed hardware modules [7][8][9]. Migrating these services off the CPU also eliminates the hidden overhead associated with context switch times associated with entering and exiting the RTOS. Hardware module versions of RTOS components can be configured as specialized co-processors that run concurrently with applications running on the CPU. Perhaps the biggest benefit of this approach is the ability to effectively eliminate
the overhead and jitter associated with running the actual RTOS scheduler. This paper describes the hardware design of a Priority Scheduler Module developed as part of our multithreaded RTOS kernel [3][4][5]. Our RTOS kernel implements a generalized thread programming model to provide a uniform interface for programmers and designers creating hybrid computational components: general purpose threads that can be implemented in both hardware and software for embedded and real time systems. The overall goal of our Priority Scheduler Module was to provide a modular, and modifiable RTOS scheduling component that requires little or no CPU processing time. To support this goal, we have designed our Priority Scheduler Module to perform all scheduling processing concurrently with applications running on the CPU. Additionally, we have created a standard set of interfaces and protocols that provide a stable framework for designers to insert application specific scheduling functions. Within our Priority Scheduler Module, the scheduling decision for the next scheduling event is performed in the hardware before the scheduling event occurs. The scheduler module invokes a scheduling decision to CPU via an interrupt, which directs the CPU to perform a simple and fast context switch to the predetermined thread id. In this fashion, we eliminate the majority of overhead and jitter associated with the traditional approach of causing variable scheduling decision times to be initiated after the scheduling timer interrupt has expired. The paper is organized as follows: Section 2 describes the importance of a hardware Scheduler Module in a RTOS. Section 3 describes the design of the hardware Scheduler Module. Section 4 provides resource usage and timing results of the Scheduler Module, and Section 5 contains conclusions and future work.
2. Importance Scheduler Module
of
a
Hardware
In most traditional operating systems, a timer interrupt mechanism is used to invoke the scheduler processing. Traditionally an expiration of the timer interrupt causes a context switch to the scheduler, which then determines which task from the ready-to-run queue that should run. Only after this processing has occurred can the operating system then perform the context switch to the thread chosen for scheduling. This sequence of an interrupt going off, and then calculating which task to run introduces latency and variability (jitter) into the system. A hardware scheduler module removes this scheduling latency and jitter from the system by calculating the task to run in parallel with CPU computations. With a hardware scheduler module in place, the scheduling “decision” has been made before the interrupt goes off. Now when the interrupt goes off, a low-overhead context switch can be done that simply asks the scheduler which task it has picked, saves the current context, and loads the context of the task that has already calculated by the scheduler module. This approach to designing hardware Scheduler Modules has two very important implications: 1)
2)
In order for the scheduler to be completely independent from the CPU it must have all thread state information directly available to the module All entry points to scheduling services must now be directed toward the hardware Scheduler Module (requests from CPU, unblocking of semaphores, timers, etc.).
These implications require the hardware Scheduler Module to have some sort of local storage facilities and a general interface so that devices/modules other than the CPU can interact with it. A complete hardware Scheduler Module can be designed by satisfying these implications while emulating traditional scheduling services. A hardware Scheduler Module designed in this fashion can drastically reduce, or sometimes eliminate, the overhead of scheduling decisions in a RTOS by performing scheduling services in parallel with the CPU.
3. Scheduler Module Design A block diagram of the Scheduler Module is shown in Figure 1. This scheduler module works in conjunction with the Software Thread Manager (SWTM).
Figure 1: Scheduler Module Block Diagram. The Scheduler Attributes Table in Figure 1 contains entries for all threads in the ready-to-run queue. The format of the attributes table can be seen in Figure 2.
Figure 2: Format of Scheduler Attribute Table Entry. Both the SWTM and the Scheduler were implemented within an FPGA, and both are connected to each other and to a peripheral bus that allows the CPU to communicate with the modules. The dedicated hardware interface between the scheduler and the SWTM consists of 3 data registers, 4 control signals, and the BPort Interface to the SWTM’s Block RAM (BRAM). The Current_Thread_ID data register holds the thread ID of the thread currently running on the CPU; it is readable by the scheduler and writeable by the SWTM. The Next_Thread_ID data register holds the thread ID of the thread that has been identified by the scheduler to run at the next scheduling decision; it is writeable by the scheduler and readable by the SWTM. The Thead_ID_2_Sched data register holds the thread ID of the thread to be added (enqueued) into the ready-to-run queue; it is readable by the scheduler and writeable by the SWTM. The B-Port interface allows the
Scheduler to lookup thread management information from the SWTM’s thread table. The 4 control signals implement a “handshake” protocol used to reliably coordinate communication between the Scheduler and the SWTM. The Scheduler Module has two main categories of operations: bus commands (BUScom), and SWTM commands (SWTMcom). The SWTM commands are only issued from the SWTM and are requested via the dedicated hardware interface. The bus commands can be issued from any device that is a bus-master, and are re quested via the busattachment interface. The command set of the Scheduler Module can be seen in Figure 3.
much shorter time than it takes to complete a context switch, which validates the use of the FIFO queue order. If a priority queue (sorted) was used in this situation, then the enqueue operation would take longer because it would involve a sorted insert, and the dequeue operation would be shorter because the highest priority thread would always be located at the head of the list. A slow enqueue operation is detrimental to system performance, because enqueue operations are usually performed by user-programmed threads which wait for the enqueue operation to return the status of the operation. The priority queue would also take additional logic to constantly keep the list in sorted order, so it was decided to design the Scheduler Module to use a queue in FIFO order.
4. Results of Implementation
Figure 3: Scheduler Module Command Set.
The ready-to-run queue, held in the scheduler attribute table, is kept in First-In-First-Out (FIFO) order. This FIFO ordering allows the scheduler to enqueue threads very quickly because there is no re-sorting of threads to be done. The enqueue operation consists of adding the thread to enqueue to the tail of the list and checking to see if this newly enqueued thread should be the thread to run next. However, the FIFO ordering of the queue does require the dequeue operation to be a bit longer because the position of the entry that is linked to the thread to dequeue is unknown, so the list must be traversed in order to unlink the thread to dequeue from the list. The lengthier dequeue operation fits very well into our system because dequeue operations are only performed directly before context switches, which is the longest period of “idle” time that the Scheduler Module will encounter. During this time, the scheduler module will traverse the entire ready-to-run queue in order to both unlink the thread to dequeue from the list, and to find the best priority thread in the queue and put its thread ID in the Next_Thread_ID data register. Timing results, which can be seen in Section 4, show that the entire dequeue operation completes in a
The Scheduler Module has been implemented on a Memec [6] 2VP7 development board which contains a Xilinx [1] Virtex II Pro 7 Series FPGA with the following resource statistics: 484 slices, 518 slice flip-flops, 967 4-input LUTs, 161 bonded IOBs, and 1 BRAM. The module has a maximum operating frequency of 142.9 MHz, which easily meets the requirements of the 100 MHz clock frequency that the Xilinx FPGA uses. Timing results of how long it takes the Scheduler Module to make a scheduling decision (basically, the dequeue operation) have been found by running the scheduler through a ModelSim simulation. The simulation involved queuing up the requested number of threads, and then performing a dequeue operation, which was timed. The results of this simulation can be seen in Figure 4.
# Threads 250 128 64 32 16 2
Time (ns) 10060 5140 2610 1330 690 130
Est. Per Thread Time(ns) 40.24 40.15625 40.78125 41.5625 43.125 65
Figure 4: ModelSim Timing Results of Scheduling Decision.
From Figure 4, one can see that it takes approximately 40 ns (4 clock cycles) per thread to search the list, with a fixed setup cost due to search setup. These results show that in the worst case (250 threads) a dequeue operation will take about 10 µs to complete, running concurrently with the immediate context switching of the next thread id on the CPU. During a context switch, the CPU does not interact with the scheduler module, which gives the scheduler module the perfect opportunity to search through the ready-to-run queue to calculate its next scheduling decision. This scheduling decision is calculated in parallel with the CPU, thus eliminating much of the processing delays normally incurred by calculating a scheduling decision.
5. Conclusions and Future Work In this paper we have presented the design of our hardware scheduler module. This design takes advantage of current FPGA technology by building hardware modules that provide important operating system services that operate in parallel with a microprocessor. The concurrent processing of the scheduler module supports FIFO, round-robin, and priority-based scheduling algorithms, while drastically reducing the amount of system overhead normally used for scheduling decision purposes. The system currently supports up to 256 active threads, with up to 128 different priority levels. Our immediate future work is now focusing on creating a more flexible organization that will allow the user to reconfigure the scheduler module to support an arbitrary number of threads and priority levels. We are also working on optimizing the execution time of our dequeuing operation. The capability of migrating traditional operating system services into hardware allows system designers to improve concurrency, and reduce system overhead, which has the possibility of making real-time and embedded systems more precise and accurate. More detailed descriptions of the hybrid threads project can be found at www.ittc.ku.edu/hybridthreads.
6. Acknowledgment The work in this article is partially sponsored by National Science Foundation EHS contract CCR-0311599.
7. References [1] www.xilinx.com [2] www.altera.com [3] D. Andrews et. Al., Programming Models for Hybrid FPGA/CPU Computational Components: A Missing Link, IEEE Micro July/August 2004, pp. 42-53 [4] Andrews, D., Niehaus, D., Ashenden, P. Programming Models for Hybrid FPGA/CPU Computational Components, IEEE Computer, January, 2004, pp. 118-120 [5] Niehaus, D., Andrews, D., Using the MultiThreaded Computation Model as a Unifying Framework for Hardware-Software Co-Design and Implementation, Proceedings of the 9th Workshop on Object-oriented Real-time Dependable Systems (WORDS 2003), September, 2003 Isle of Capri, Italy [6] www.memec.com [7] J. Lee, V. Mooney, K. Instrom, A. Daleby, T. Klevin, and L. Lindh, A Comparison of the RTU Hardware RTOS with a Hardware/Software RTOS, Proceedings of the Asia and South Pacific Design Automation Conference, ASPDAC 2003 [8] www.realfast.se [9] V.J. Mooney, D. M. Blough, A HardwareSoftware Real-Time Operating System Framework for SOC’s, IEEE Design and Test of Computers, pp. 44-51, Nov/Dev 2002