Accelerating an Embedded RTOS in a SOPC Platform Timothy F. Oliver, Siraj Mohammed, Nataraj Muthu Krishna and Douglas L. Maskell Centre for High Performance Embedded Systems (CHiPES) School of Computer Engineering Nanyang Technology University, Singapore {tim.oliver, siraj, krishna}@pmail.ntu.edu.sg,
[email protected]
Abstract SoPC platforms are becoming more prevalent as a solution for the implementation of embedded computing systems. This is due to their ease of implementation and highly customisable nature. We demonstrate a simple yet effective technique for accelerating an embedded RTOS running on a soft-core CPU in an SoPC platform. Custom instructions are developed to accelerate the task scheduling. We show rapid development of our technique can be achieved through the use of integrated SoPC development environments like Altera’s Quartus-II. Further, implementing a system running the same accelerated RTOS in the Opencores ORP SoPC platform shows the portability of our methods. A notable increase in the performance of key RTOS routines has been seen as well as a reduction in interrupt-latency at the cost of a minimal amount of FPGA real estate. We propose the novel use of custom instructions to access frequently used global data structures as an acceleration technique suitable for SoPC platforms.
1. Introduction A modern digital system is a combination of one or more processors, configured intellectual property blocks (IPB) and user logic. In the past implementing a system on a chip (SoC) could only be targeted at an ASIC design. Developments in FPGA logic density and speed have made a system on a programmable chip (SoPC) implementation possible. The advantages of SoC are simplified PCB design, improved performance and reliability. Further advantages of SoPC are flexibility and shortened development time. A RISC core can be configured to be 16/32 bits, have data or instruction pipelining, hardware multipliers and custom instructions with hardware acceleration. All the peripherals that the application demands can be customised and included on chip. In an ASIC development this application specific partitioning would usually be considered a risk of the product only being suitable for a small market. With a
SoPC, hardware is effortlessly scaled and the application partitioned to efficiently meet requirements. The availability of proprietary [1] and open source [2] configurable embedded computing platforms provides the system architect with greater options on the structure and customisation available. Altera offer the SoPC Builder integrated into their Quartus-II FPGA development environment. This allows a number of system blocks to be built around one or more Nios processors. The Nios instruction set can be extended with custom execution units [3]. An open-source offering is the OpenRISC reference platform (ORP) [4]. The ORP provides a comprehensive collection of peripherals that can be customised. The open source nature of the ORP simulator allows it to be modified to emulate any changes to the soft core. Custom computing hardware can benefit a general computing application [5]. A computing application can also benefit from hardware acceleration of the RTOS that it is running [6]. Our study requires an RTOS that allows partitioning of functionality between hardware and software. Micrium’s MicroC/OS-II is a small, simple pre-emptive RTOS suitable for embedded applications [7]. It has been widely used and ported to a number of microprocessors. While the RTOS is not open source the simplicity and well-documented structure make it easy to adapt functionality to use software or custom computing hardware. MicroC/OS-II can be scaled to the needs of the application thus minimising the memory footprint and processing overhead. Research has been done to provide RTOS functionality in hardware and significant improvements in speed have been reported at the expense of only a minimal amount of additional hardware. For example semaphores and the associated deadlock detection for sharing resource among several concurrent processes has been implemented in hardware [8]. Logic circuits to perform scheduling of tasks and interrupt handling have been developed. This concept has been commercialised [9] using the Xilinx Microblaze [10] as the SoPC platform. We propose the novel use of custom instructions to access frequently used global data structures as an acceleration technique suitable for SoPC platforms. The
previous works in RTOS acceleration have not considered the use of custom instructions. Accelerators have been connected to the system on the main bus. This approach requires the structural overhead of a bus interface to the acceleration hardware and the overhead of multiple clock cycles to perform an access transaction over the bus. If the bus is shared then RTOS acceleration hardware could impact on the performance of other connected peripherals. We have proposed using custom instructions to accelerate the RTOS functionality. Use of a custom instruction alleviates the issues of bus connectivity. However it localises acceleration to a single processor. Therefore we only consider this lightweight technique in single processor systems. We apply this RTOS acceleration framework to several SoPC platforms. We then investigate the impact on RTOS overhead and interrupt response time. The FPGA resource usage, memory usage and energy consumption are reported and the overheads weighed up against the benefits. Finally we discuss how easy the framework was to use and port to different platforms.
2. Acceleration of RTOS Lightweight RTOS software is used in a wide range of applications. Accelerating the RTOS will provide benefit to all these applications. This work focuses on an application specific paradigm where hardware and software are customised and scaled to a given application. We assume there is some unused FPGA resource available. This unused resource is considered to be a waste as it is dissipating power through static leakage. The issue is how best to utilise this wasted resource to improve the RTOS performance and potentially save energy consumption that would be attributed to extra instruction reads from memory. As FPGA devices are register-rich we consider placing key RTOS global data structures into hardware and access them through custom instructions. A custom instruction traditionally has no concept of state as the state of a processor is encapsulated in its register set. In this paper we explore the use of a custom instruction to access an RTOS accelerator with local storage. When selecting a data structure to localise it must be a set of global variables that are central to the application and therefore never need to be swapped out to memory. The accessing functions should benefit from the localisation.
3. Custom Acceleration Framework First the RTOS is profiled to identify functionality that could benefit from custom acceleration. For real-time applications it is important to consider interrupt response time. The processing overhead of the operating system is considered. For the identified functionality the access patterns for any data structures concerned are
investigated for potential localising of state within the custom instruction. The custom instruction is created as a standalone piece of synchronous logic with a standard processor interface (clock, reset, start, 2 data operands and one result). It is written in RTL and tested with an independent RTL test environment. The instruction is then integrated into the soft-core CPU data path. To use the custom instruction in software requires changes to the SDK. In particular the Binutils definitions files are updated to include the opcode and instruction format. It is necessary to modify the cycle accurate simulator to emulate the custom instruction. To enable the use of the custom instruction in the application software we create a platform independent abstraction layer. We use macros to encapsulate inline assembly code for the custom instruction [3]. This method removes the need to add custom instruction support to the C compiler. The encapsulation is the only part of the framework that is platform dependent. The target software is modified to use the encapsulation macro in place of the original code. For exploration purposes the original code can be switched back in through compile options. Further testing and profiling of the system is then performed in software.
4. Scheduler Acceleration The core function of any RTOS is task scheduling in response to various events. Therefore we focused on how to accelerate the scheduling with custom instructions. MicroC/OS-II supports a maximum of 64 tasks. The status that indicates a task is ready to run or not is held in the ready table. This is an array of 8 bytes. Each byte is considered to be a ready group of 8 tasks. Each bit in the ready group indicates whether a particular task within the group is ready to run. Tasks are organised in the ready table by their priority. The highest priority task (HPT) is assigned bit 0 in ready group 0. The lowest priority task is assigned bit 7 in the last ready group. There are four operations that are performed on the ready table: Initialisation, making a task ready, making a task not ready and finding the HPT that is ready to run. The initialisation is a trivial case of clearing the bytes that store the ready table. Changing the ready status of a task requires decoding of the task priority to select the correct ready group and alter the correct bit location in that group. This “mapping” requires an 8-byte look-up table and several instructions from the standard instruction set. To find the HPT ready to run is slightly more complicated. The basis of this is the “un-mapping” operation. The un-mapping operation finds the position of the first bit that is set within a byte starting from bit 0. This operation is performed using a 256-byte look-up table. The ready groups are encoded into a byte where each bit indicates that a group contains at least one task ready to run. The find highest ready task is performed in
two steps: The un-mapping operation is used to find the highest priority group that is ready; then it is used to find the HPT that is ready within that group. We first looked at accelerating the mapping and unmapping operations with simple custom instructions. Then we looked at accelerating the HPT ready search as a whole. Finally we localised the ready table inside the processor with access through the custom instruction. The circuit structure for the custom instruction is shown in figure 1. The priority number is broken into a group and member select. For efficiency, MicroC/OS-II allows the ready table to be scaled to the number of tasks required in an application. Only enough ready groups to serve the required number of tasks are used. To maximise the efficiency of the acceleration hardware it is necessary to scale the hardware to only handle the number of tasks required by the application. Member Select Map
Command Set or Clear
Ready Table 8- Byte Array Highest Ready Member
WDAT1 RDAT1 RDAT2
Group Select
Highest Ready Task
UnMap
WADR1 RADR1 RADR2
section entry and exit have been instrumented to record maximum and average critical section length (CSL). The CPU idle counter is used to gauge the difference in RTOS overhead for the different builds. A simple exerciser application has been created that runs the maximum number of tasks allowed. The tasks are split into groups of eight. Each task group share a semaphore. Each task in the group will attempt to get the semaphore and then increment its own counter. It will then release the semaphore and wait for a delay. Each task waits for a different amount of time. Any improvement in performance due to less RTOS processing overhead will be indicated by a higher idle count figure.
6. Results The corresponding exerciser application is run on the two platforms then build and run-time statistics are gathered. Net activity figures were obtained for each platform, with and without the custom instruction. These figures were used to find the approximate power dissipation of the FPGA using vendor models. The maximum critical section was always found in the task scheduling routine.
Lo
Table 1. Ready Groups 8-Bit Array WDAT1 DAT[7:0]
Highest Ready Group
Hi
UnMap
WADR1
Figure 1.
Scheduler Acceleration Hardware
5. Experimental Method The mapping, un-mapping and ready table access has been selected for acceleration. The two target platforms are Nios [1] running on the Altera APEX20KE family and the ORP [4] running on the Xilinx Virtex family [11]. The SoPC builder handles all the SDK and CPU integration for the custom instruction on the Nios platform. Altera do not supply a cycle accurate simulator for the Nios platform. The ORP required us to change the SDK and integrate the instruction RTL by hand. This was relatively trivial with code examples available for modifying the opcode and instruction definitions and modifying the cycle accurate simulator emulation. The hardware is built with and without the custom instruction module included. The resource usage is extracted from the build report files. The RTOS kernel was compiled with and without custom instruction support. The RTOS was configured to use the minimum amount of resource for 16, 32 and 64 tasks with only semaphores enabled. It was noted that functionality concerning mutual exclusion and flags would also benefit from the scheduler acceleration. The critical
Altera Nios CSL Ave. (Cycles) CSL Max. (Cycles) Idle Time (%/s) FPGA LE (Units) FPGA FF (Units) ROM (Bytes) RAM (Bytes) Clk Speed (MHz) Power Dis. (mW)
Altera Nios Results
Pure SW 49
HW 16task 49
HW 32task 49
HW 64task 49
1273
1230 (-43) 19.88 (+0.28) 3204 (+211) 1395 (+33) 34434 (-1280) 32900 -71972 60.18 (+2.42) -
1230 (-43) 19.79 (+0.19) 3249 (+256) 1414 (+52) 34462 (-1252) 52100 -52772 55.34 (-2.42) -
1230 (-43) 19.61 (+0.01) 3370 (+377) 1451 (+89) 34562 (-1152) 104872
19.60 2993 1362 35714 104872 57.76 -
52.97 (-4.79) -
The maximum CSL has been reduced by around 3% for the Nios and 5% for the ORP. This suggests an improvement in the interrupt response time. The task scheduling acceleration resulted in a ROM saving of around 1 kilobyte. The ROM saving suggests that less access to read-only data and instruction memory is required. This would result in less power dissipation
attributed to the memory interface. The energy consumed in performing an external memory access would far out-weigh that of executing a custom instruction unit [12]. A minimal reduction in RTOS processing overhead was observed. The custom instruction required a small amount of FPGA resource. This resource would probably go unused in most SoPC applications. The impact on clock speed was minimal. Table 2. Xilinx ORP CSL Ave. (Cycles) CSL Max. (Cycles) Idle Time (%/s) FPGA LE (Units) FPGA FF (Units) ROM (Bytes) RAM (Bytes) Clk Speed (MHz) Power Dis. (mW)
HW 16task 110
HW 32task 110
HW 64task 111
652
139380
616 (-36) 19.62 (+0.79) 4705 (+78) 1810 (+36) 24309 (-1136) 35332
616 (-36) 19.37 (+0.54) 4825 (+198) 1831 (+57) 24329 (-1116) 70020
616 (-36) 18.86 (+0.03) 4831 (+204) 1868 (+94) 24377 (-1068) 139380
14.6
14.6
14.6
14.6
335
337 (+2)
337 (+2)
338 (+3)
4627 1774 25445
[1] Altera Corporation “Nios Embedded Processor 32-Bit Programmer’s Reference Manual”, Altera version 3.1 January 2003. [2] “The Opencores Website” http://www.opencores.org, Last accessed April 2004. [3] Altera Corporation “AN188: Custom Instructions for the Nios Embedded Processor”, Altera version 1.2 September 2002
Xilinx ORP Results
Pure SW 111
18.83
8. References
7. Conclusion The novel use of a custom instruction to access frequently used global data structures as an acceleration technique suitable for SoPC platforms has been demonstrated. The portability of the technique has been illustrated by the use of several target platforms. This technique proves possible to realise with automated CAD tools such as Altera’s SoPC Builder and opensource offerings such as the ORP. There is scope for much future work into accelerating embedded RTOS on SoPC platforms. Other RTOS functionality would also benefit from acceleration. A few examples are task delay and event control. As automated SoPC integration tools advance, the use of such techniques will become more widespread. There is potential to allow the tools to explore trade-offs of whether to implement one or more of a set of application specific custom instructions for a given software application. The re-programmable nature of SoPC devices allow custom instructions to be swapped in and out of the device at run-time as necessary [13].
[4] D. Lampret, “Open RISC 1000 Project”, http://www.opencores.org/projects.cgi/web/or1k/overview, Last accessed April 2004. [5] David Abramson, Paul Logothetis, Adam Postula, Marcus Randall, "FPGA Based Custom Computing Machines for Irregular Problems," HPCA Pages 324-333, 1998 [6] V. J. Mooney and D. M. Blough, "A Hardware-Software Real-Time Operating System Framework for SOCs", IEEE Design & Test of Computers, Vol. 19, pp. 44-51, Nov. 2002. [7] J. J. Labrosse, “MicroC/OS-II The Real Time Kernel”, R&D Books, 1999. [8] J. Lee, V. Mooney, A. Daleby, K. Ingstrom, T. Klevin and L. Lindh, "A Comparison of the RTU Hardware RTOS with a Hardware/Software RTOS," Proceedings of the Asia and South Pacific Design Automation Conference, Pages 683-688, 2003 [9] RF Realfast AB “RealFast”, http://www.realfast.se, Last accessed April 2004. [10] Xilinx Corporation, “MicroBlaze Soft http://www.xilinx.com, Last accessed April 2004.
Processor”,
[11] Xilinx Corporation, “Virtex Series http://www.xilinx.com, Last accessed April 2004.
FPGAs”,
[12] J. P. Heron, R. Woods, S. Sezer and R. H. Turner “Development of a Run-Time Reconfiguration System with Low Reconfiguration Overhead” Journal of VLSI Signal Processing Volume 28, Pages 97–113, 2001 [13] M.J. Wirthlin and B. L. Hutchings. "DISC: The dynamic instruction set computer" Field Programmable Gate Arrays (FPGAs) for Fast Board Development and Reconfigurable Computing, John Schewel, Editor, Proc. SPIE 2607, pp. 92-103 1995