2335 Irving Hill Road, Lawrence, KS. {eanderso,peckw,jstevens,jagron,bricefab,warn,dandrews}@ittc.ku.edu. ABSTRACT. The paper presents the new ...
SUPPORTING HIGH LEVEL LANGUAGE SEMANTICS WITHIN HARDWARE RESIDENT THREADS Erik Anderson, Wesley Peck, Jim Stevens, Jason Agron, Fabrice Baijot, Seth Warn, David Andrews Information and Telecommunication Technology Center - The University of Kansas 2335 Irving Hill Road, Lawrence, KS {eanderso,peckw,jstevens,jagron,bricefab,warn,dandrews}@ittc.ku.edu CPU Software Interface
ABSTRACT The paper presents the new Hardware Thread Interface (HWTI), a meaningful and semantic rich target for a high level language to hardware descriptive language translator. The HWTI provides a hardware thread with the same hthread system calls available to software threads, a fast global distributed memory, support for pointers, a generalized function call model including recursion, local variable declaration, dynamic memory allocation, and a remote procedural call model that enables hardware threads access to any library function. 1. INTRODUCTION The potential benefits of Multi-Core System on a Programmable Chip (MCSoPC) are clear: increased performance through customizations and parallelization. However, these benefits do not come without challenges. These include selecting appropriate programming models [8], defining targets for hardware tasks, and creating communication models between hardware and software. Ideally programmers would specify their MCSoPC in a known high level language and programming model, and have their design automatically and correctly translated to hybrid software and hardware tasks. Numerous research groups have made impressive strides in achieving this goal [5, 6, 13, 9, 7, 11], however none thus far have been able to achieve unaltered high level language to hardware descriptive language (HLL to HDL) translation. Issues like support for pointers, function calls, recursion, and access to existing software libraries have prevented true HLL to HDL translation. The Hybridthreads Computational Model (hthreads) [3, 4, 1] is addressing these problems. With hthreads an engineer designs, develops, and debugs applications for hybrid CPU/FPGA chips using a familiar shared memory Pthreads like programming interface and model. A primary hthread goal is to abstract the hardware/software interface by extending operating system services, in particular hthread system calls, to hardware threads that run independently and concurrently with CPUs. An hthreads design consists of
Software Thread
Software Thread
Software Thread
Hardware Interface
Hardware Interface
Hardware Thread
Hardware Thread
System Bus
Mutexes
Conditional Variables
Thread Manager
Thread Scheduler
CBIS
Shared Memory
Fig. 1. Hthread System Block Diagram
a series of threads, written in C, that either get compiled to run as software binaries on one or more general purpose CPUs or translated [12] and synthesized as application specific hardware threads to run as hardware cores within the FPGA fabric. A block diagram of the hthreads run-time system is shown in Figure 1. The hthread interface is implemented as a runtime library for software threads and as the Hardware Thread Interface (HWTI) for hardware threads. A hardware thread, depicted in Figure 2, consists of three components: a vendor supplied bus connection (IPIF), a HWTI that enables hthread library support, and a “user logic” that provides user thread level functionality. Initially the HWTI was designed to abstract the hardware / software interface by providing the same communication and synchronization services available to software threads to independent and concurrently executing hardware threads. The details of this initial work are described in [2]. However, once we started developing our own C to HDL tool it became clear that the HWTI needed to be improved to provide a generalized function call model with equivalent HLL runtime services. The new enabling technology is globally distributed local memory. This memory, accessible by all cores is instantiated within each HWTI using dual ported on-chip memory. The HWTI leverages this fast on-chip memory to provide a function call stack and heap for hardware threads. Features that in turn provide a generalized model for invoking functions, support for recursion, and dynamic memory allocation. Finally, using existing support for mutexes, condition variables, and memory access, we created a remote
Fig. 2. Hardware Thread State Machine and Register Set Block Diagram
procedural call model that hardware threads may leverage to call any library function. These services were added to the HWTI not only to give hardware threads better performance, but provide a meaningful target for a HLL to HDL translator. 2. GLOBALLY DISTRIBUTED LOCAL MEMORY Memory latency is a well-studied and understood problem. In traditional CPU systems it is addressed through memory hierarchies. Unfortunately, there has been little research in memory hierarchies for MCSoPC. De Micheli looked at ways of representing memory for C to HDL synthesis [10], and Vuletic investigated using virtual memory in reconfigurable hardware cores [15, 14]. Hardware threads, which operate like application specific processors, suffer from the same memory latency problems as CPUs. If hardware threads only operates on off chip data their performance would be greatly degraded. Instantiating a traditional cache for a hardware thread is very likely to be too expensive. Our new globally distributed local memory addresses this problem. Globally distributed local memory, or “local memory,” is not cache, but rather a fast memory accessible to both the user logic and other Hthread cores. The local memory is instantiated using on chip dual ported memory embedded within the FPGA fabric (known as BRAM on Xilinx chips). One port on the BRAM is used to allow access for the user logic, the second port on the BRAM is used to allow access for other Hthread cores. The local memory address space is encompassed within the hardware thread’s address space. Access to the local memory for other cores, is through standard bus read and write protocols. The local memory is accessible to the hardware thread’s user logic through the HWTI’s user interface using the same protocol as would be used for accessing traditional global memory. On each LOAD or STORE operation, the HWTI checks to see if
the address range requested is “local” (within the HWTI) or “global” (outside the current HWTI). If the address is local, the HWTI accesses the memory through the BRAM’s signaling protocol. If the address is global, the HWTI accesses the memory through bus operations. In this way, the HWTI abstracts the location of data for the programmer. An advantage of the HWTI’s local memory is that the user logic may access it without issuing a bus command. Consequently multiple hardware threads could perform simultaneous memory accesses, even when the threads share a common bus. For example Figure 3 depicts four threads, three hardware and one software, each concurrently accessing global memory. The largest portion of global memory is on the same bus as the CPU. This is normally off chip memory, shown here as DRAM. The three hardware threads are sharing a bus, each thread containing a small segment of memory, shown here as BRAM. If the shared variables were all stored in traditional global memory, accessing them would be slower. Not only would each thread have to perform a bus operation, but the bus would effectively serialize the operations. When the variables are distributed as depicted, four memory operations may be performed in parallel. Furthermore, accessing local memory is very quick: It takes 3 clock cycles to load a value from local memory and 1 clock cycle to store a value. This is compared with 51 and 28 clock cycles to load and store respectively to off-chip memory (runtime measurements on the Virtex-II Pro). 3. HWTI’S FUNCTION CALL STACK A HWTI goal is to provide a target for HLL to HDL translation, in particular C to VHDL. Whereas software can target an existing processor’s instruction set with a Von Neumann architecture behind it, hardware does not have any pre-existing target. Consequently, there is no implicit support for high-level language semantics. Pointers and a function call stack are often two capabilities left out. Support for pointers have largely been solved by translating a pointer address into a bus operation [9, 2]. Both the HWTI and Altera’s C2H uses this method to support pointers. However, a hardware equivalent function call stack has not been addressed. Without a stack’s functionality, parameter passing is difficult and true recursion is impossible. To address this problem, the HWTI creates a function call stack using its local memory. The HWTI’s function call stack works analogously to software function call stacks. The HWTI maintains registers for a stack and frame pointer, pointing to its local memory instead of traditional global memory. During a call, the HWTI pushes the frame pointer and the number of passed function parameter values onto the stack. The stack and frame pointers are then appropriately incremented for the new function. All function parameters are passed by pushing the values onto the stack and retrieved
register, and the frame register is restored by popping the value from the stack. The return state and return value are passed back to the user logic. The return value is limited to 32 bits. In order to give the user logic easy access to the local memory the HWTI supports similar semantics to HLL variable declarations. To declare local variables, the user logic uses the DECLARE operation, with the number of words (4 bytes) in memory it wants to set aside. The HWTI reserves space on the stack by incrementing its stack pointer the specified number of words. The user logic access this memory using READ and WRITE operations in conjunction with an index number that corresponds to the declared variables. The first declared variable has index 0, the second declared variable has index 1, and so on. Since the variables are declared and granted space with the HWTI’s local memory, they each have an address in global memory. The ADDRESSOF operator works by converting the index number into its equivalent memory address, taking into account the HWTI’s base address and current frame pointer. As an example, Firgure 4 depicts a hardware thread (base address x6300 0000) with its equivalent C code, and the state of the function call stack after the foo function is called. Fig. 3. Simultaneous memory access between four threads
by popping the values off the stack. The HWTI supports a PUSH and POP operation for this purposes. Finally, instead of saving the contents of the program counter during a function call, as done on CPUs, the HWTI pushes the user logic’s state machine’s return state onto the stack. The user logic is required to pass the return state to the HWTI, along with the function to call, during a CALL operation. To be more specific, the user logic passes a 16-bit variable representing the return state. The user logic is responsible for mapping this variable to its return state when control is returned to the caller function. When the user logic makes a CALL operation it specifies the function it wants to call through a 16-bit function code. Purposefully similar to the return state, this function code represents either the start state of the called function within the user logic, or it represents a system library function supported by the HWTI. The interface and protocol for calling either user defined function or system function is identical. The implementation difference is that for a system or library call, the HWTI performs the needed operations on behalf of the user logic. For a user defined function call, the HWTI sets up the function stack for a new function, and then returns control to the user logic, specifying the start state of the function. Function returns are implemented with the RETURN operation. Here, the stack register is set to the current frame
Fig. 4. HWTI Function Call Stack Example
Finally, incorporating a function call stack allows a hardware thread to call a function recursively. A hardware thread may repeatedly call the same function without incurring the costs of duplicating function logic within the FPGA fabric. The caller function’s state is saved to the HWTI’s local memory, and then restored when the callee function returns. The recursive depth of a function is only limited by the availability of local memory.
4. LIBRARY SUPPORT FOR HARDWARE THREADS The initial set of hardware supported library calls were carefully chosen to support the hthread programming model. Their implementation is relatively inexpensive since a number of these functions, such has hthread mutex lock or hthread cond wait are largely encoded bus operations to the respective Mutex Manager or Condition Variable Manager cores. Although this model could be extended to support other POSIX features, such as semaphores or timer queues, it is unrealistic to believe that this model could be used for the majority of C’s standard libraries (stdlib.h, string.h, math.h, etc). Support for other library functions therefore must either be written as user-defined functions in the user logic, specifically written into the HWTI, or passed to the CPU to be performed on behalf of the hardware thread. We do not advocate writing a library function as a user-defined function as this breaks the programming model abstraction. Support for library functions should therefore be implemented directly in the HWTI–but only when the resource size and speed tradeoff is reasonable–or passed to the CPU. Thus far we have only implemented dynamic memory allocation within the HWTI. Traditionally managed by the operating system, users have access to dynamically allocated memory through services such as malloc and free. In [16] the authors survey many of the allocation and deallocation techniques for software based memory management. Despite the use and study of memory allocation within software, with the exception of [10], a dynamic memory allocation for custom hardware cores has thus far been elusive. This is primarily because hardware cores do not have access to the operating system, which manages memory Taking advantage of its operating system services and local memory, the HWTI implements its own light weight versions of malloc and free. Like the existing hthread library functions, when the user logic calls malloc or free, the HWTI implements these functions on behalf of the user, acting like an operating system. When a hardware thread allocates memory the returning pointer is guaranteed to be part of the HWTI’s local memory. The user logic may access this memory with 3 clock cycles reads and 1 clock cycle writes. Since HWTI’s local memory is accessible by all cores, the hardware thread may pass the pointer to any other core allowing them to use the data as well. This scheme is desirable since it allows the programmer to maintain the existing multi-threaded shared memory programming model for both software and hardware threads. To implement dynamic memory allocation the HWTI adds two limitations. First the same thread that allocates memory must deallocate it. Second, since the dynamic memory is allocated within the thread’s local memory, there is a limit to the size and number of memory segments that
can be allocated. The memory the HWTI allocates, known as the heap, is pre-allocated in 8B, 32B, and 1024B segments at the top of the local address range. These sizes were selected to assist with the dynamic creation of mutexes, condition variables, and threads, common structures within hthreads programming. By preallocating memory the HWTI avoids implementing a defragmentation routine. This is advantageous since the HWTI should remain as small as possible. When the user logic calls malloc, the HWTI selects, using a “best-fit” algorithm, the smallest appropriate preallocated memory space and returns its address to the user. The HWTI marks the memory used in a malloc state table. If the requested memory size is larger than 1024B, the HWTI allocates this space by decrementing a internal heap pointer the specified amount and returning the appropriate address to the user. The user logic may request only a single segment of memory larger than 1024B. If the user logic requests a memory space larger than the HWTI has available, the HWTI returns a null pointer. When the user logic calls free, the HWTI marks the appropriate malloc state table entry as unused. To enable support for all other software resident library functions, the HWTI relies on a new remote procedural call (RPC) methodology to a special software system thread. The RPC model uses existing hthread mechanisms and is completely abstracted from the user. A block diagram of the RPC methodology is in Figure 5. The user logic makes the library function call to the HWTI in the same manner as it would any other function. The HWTI recognizes the function (the function opcode has to be known at synthesis time) as an RPC function. Using a mutex, the HWTI obtains a lock protecting the RPC mechanism. Once granted, the HWTI writes the opcode and all arguments to a global RPC struct. Using a condition variable the HWTI signals the RPC thread to perform the function. The HWTI then waits for a return signal from the RPC thread. The RPC thread, which runs as a software thread on the CPU, will read the RPC struct and call the appropriate function (the function must be known at compile time) with the passed in arguments. The return value, if any, is written back to the RPC struct. The RPC thread signals the HWTI indicating the function is complete. When the HWTI receives the condition variable signal, it wakes up, reads the return value, unlocks the RPC mutex, and passes the return value to the user logic. The RPC model may be used to support any library function or operation too expensive to implement within the FPGA. Due to their complexity the HWTI also uses RPC to implement hthread create and hthread join. However, the RPC model does have disadvantages. The CPU has to be interrupted to perform the RPC which may impact real time constraints. Furthermore, depending on the priorities of the threads in the system, a hardware thread may have to wait a significant amount of time before the RPC is
complete.
Fig. 6. Quicksort performance, comparing software and hardware threads, and local and global memory.
Fig. 5. Remote procedural call high level diagram.
5. RESULTS To demonstrate the HWTI’s memory hierarchy and function call stack, the well known quicksort algorithm was implemented as a hardware thread. At start up, the main thread (running in software) creates an array of random integers and passes a pointer to the array with the array length to the quicksort thread. The thread then uses the quicksort algorithm to sort the array. Using the HWTI’s function call stack to handle the recursion and the local memory to store intermediate values, the hardware thread correctly sorts the array. The quicksort hardware thread is implemented on a Xilinx Virtex-II Pro 30 in 2851 slices, using 16 BRAM blocks for local memory. Figure 6 shows the performance of the quicksort thread, using either a software or hardware implementation, and operating on an array in either global memory or local memory (for a software thread this is either running with data cache off or on). The hardware and software versions were purposefully written to be functionally equivalent. Optimizations to take advantage of possible instruction level parallelism in hardware or compiler optimizations in software were not used. What is most striking about Figure 6 is that the hardware thread operating on an array in its local memory performs almost identically to a software thread using data cache. This strongly supports the notion that the HWTI’s local memory can give a ”cache-like” performance without actually being cache. Figure 6 also shows how a hardware thread’s perfor-
mance is limited when operating from traditional off-chip global memory. In this example there is roughly a 6x slowdown for a hardware thread, and a 19x slowdown for a software thread. The difference is due to where intermediate values are stored. The hardware thread implementation kept intermediate values as either registers or data within its local memory. The software thread version had to reference all data from off-chip memory. Table 1 summarizes the HWTI performance for selected library functions. The performance of most hthread system calls are dependent on the number of bus operations needed to complete the call. hthread self, equal, and exit do not require access to any other hthread core and are executed very quickly. The mutex and condition variable operations typically only require 1 bus read to either the Mutex Manager or Condition Variable Manager and take roughly .40µs to complete. hthread cond wait requires two bus reads to the Mutex Manager and one to the Condition Variable Manager, and consequently takes a roughly 3 times as long. The remaining listed functions were implemented using the RPC model. Although the RPC model may be used to provide access to any library function or operation, it is relatively slow. Finally, Table 2 lists the HWTI operations and their respective execution time. All performance numbers where measured on a Xilinx ML310 development board, with the CPU running at 300MHz and hardware threads running at 100MHz. 6. CONCLUSION In this paper we presented the new HWTI, a semantically rich abstract target for HLL to HDL tanslation. Specifically we showed how instantiating hardware threads with globally distributed local memory creates a fast on-chip global memory for hardware threads that maintains the existing shared
Function Call hthread create * hthread join * hthread self hthread equal hthread exit hthread mutex lock hthread mutex trylock hthread mutex unlock hthread cond signal hthread cond broadcast hthread cond wait malloc (stdlib.h) free (stdlib.h) printf * (stdio.h) cos * (math.h) strcmp * (string.h)
Execution Time 160µs 130µs 0.01µs 0.05µs 0.02µs 0.36µs 0.36µs 0.36µs 0.41µs 0.41µs 1.02µs 0.10µs 0.07µs 1.66ms 450µs 113µs
Table 1. Performance of selected HWTI library calls. (*) indicates method is implemented using RPC model.
Operation POP PUSH LOAD (local memory) LOAD (global memory) STORE (local memory) STORE (global memory) DECLARE READ WRITE ADDRESSOF CALL (user defined) CALL (system function) RETURN
Execution Time .05µs .01µs .03µs .51µs .01µs .28µs .01µs .03µs .01µs .01µs .03µs varies .07µs
Table 2. On the board performance of HWTI operations.
memory programming model. The local memory has in turn enabled the HWTI to support function call stacks, local variable declaration, and dynamic memory allocation. Finally, by using a RPC model, the HWTI can support any software resident library function. Hthreads allows for seamless integration of software and hardware computations by providing two-uniform interfaces (hthreads.h library, and the HWTI) to the same set of system-level services. The interfaces work together to abstract the HW/SW boundary; thus allowing all computations in a system, whether implemented in SW or HW, to transparently communicate and synchronize using standard APIs. Additional information on hthreads may be found on our website www.ittc.ku.edu/hybridthreads.
Acknowledgment The work in this article is partially sponsored by National Science Foundation EHS contract CCR-0311599. 7. REFERENCES [1] J. Agron, W. Peck, and et al. Run-Time Services for Hybrid CPU/FPGA Systems on Chip. In Proc. of the 27th IEEE Int. Real-Time Systems Symp., pages 3–12, Rio De Janeiro, Dec. 2006. [2] E. Anderson, J. Agron, and et al. Enabling a Uniform Programming Model Across the Software/Hardware Boundary. In Proc. of 14th Annual IEEE Symp. on Field-Programmable Custom Computing Machines, pages 89–98, Napa, California, Apr. 2006. [3] D. Andrews, D. Niehaus, and et al. Programming Models for Hybrid FPGA/CPU Computational Components: A Missing Link. IEEE Micro, 24(4):42–53, July/August 2004. [4] D. Andrews, W. Peck, and et al. hThreads: A Hardware/Software Co-Designed Multithreaded RTOS Kernel. In Proc. of the 10th IEEE Int. Conference on Emerging Technologies and Factory Automation, Catania, Sept. 2005. [5] T. Callahan, J. R. Hauser, and J. Wawrzynek. The GARP Architecture and C Compiler. IEEE Computer, 33(4):62–69, 2000. [6] Celoxica. HandelC. http://www.celoxica.com. [7] S. C. Goldstein, H. Schmit, and et at. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In Proc. of the Int. Symp. on Computer Architecture, pages 28–39, 1999. [8] A. A. Jerraya and W. Wolf. Hardware/Software Interface Codesign for Embedded Systems. Computer, 38(2):63–69, 2005. [9] D. Lau, O. Pritchard, and P. Molson. Automated Generation of Hardware Accelerators with Direct Memory Access from ANSI/ISO Standard C Functions. In Proc. of 14th Annual IEEE Symp. on Field-Programmable Custom Computing Machines, pages 45–56, Napa, California, Apr. 2006. [10] L. Semeriaria, K. Sato, and G. D. Micheli. Resolution of dynamic memory allocation and pointers for the behavioral synthesis form C. In Proc. of Design Automation and Test in Europe, pages 312–319, 2000. [11] B. So, M. Hall, and P. Diniz. A Compiler Approach to Fast Hardware Design Space Exploration in FPGA-Based Systems. In Proc. of the Int. Conf. on Prog. Language Design and Implementation, pages 165–176, Berlin, June 2002. [12] J. Stevens and F. Baijot. Hybridthreads compiler. http://www.ittc.ku.edu/hybrdithreads/downloads. Technical Report. [13] SystemC. Systemc. http://www.systemc.org. [14] M. Vuletic. Unifying Software and Hardware of Multithreaded Reconfigurable Applications within Operating System Processes. PhD thesis, Ecole Polytechnique Federale de Lausanne, July 2006. [15] M. Vuletic, L. Pozzi, and P. Ienne. Seamless HardwareSoftware Integration in Reconfigurable Computing Systems. IEEE Design and Test, 22(2):102–113, 2005. [16] P. Wilson, M. Johnson, and D. Boles. Dynamic Storage Allocation, A Survey and Critical Review. In Int. Workshop Memory Management, Sept. 1995.