Utilizing Custom Registers in Application-specific Instruction Set Processors for Register Spills Elimination Hai Lin
Dept. of Electrical & Computer Engineering University of Connecticut Storrs, CT 06269, USA
[email protected]
ABSTRACT Application-specific instruction set processor (ASIP) has become an important design choice for embedded systems. It can achieve both high flexibility offered by the base processor core and high performance and energy efficiency offered by the dedicated hardware extensions. Although a lot of efforts have been devoted to computation acceleration, e.g., automatic custom instruction identification and synthesis, the limited on-chip data storage elements, including the register file and data cache, have become a potential performance bottleneck. In this paper, we propose a hardware/software cooperative approach and a linear scan register allocation algorithm to utilize the existing custom registers in ASIPs for eliminating register spills. The data traffic between the processor and memory can be reduced through efficient on-chip communications between the base processor core and custom hardware extensions. Our experimental results demonstrate that a promising performance gain can be achieved, which is orthogonal to improvements by any other technique in ASIP design.
Categories and Subject Descriptors C.1.m [Processor Architectures]: Miscellaneous
General Terms Design
Keywords ASIP, register spill, custom register
1. INTRODUCTION In recent decades, application-specific instruction set processors (ASIPs) have been more and more popularly used in embedded system design to satisfy demanding requirements on performance, ∗ Acknowledgments: This work was supported by an NSF grant CCF-0541102.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’07, March 11–13, 2007, Stresa-Lago Maggiore, Italy. Copyright 2007 ACM 978-1-59593-605-9/07/0003 ...$5.00.
∗
Yunsi Fei
Dept. of Electrical & Computer Engineering University of Connecticut Storrs, CT 06269, USA
[email protected]
power consumption, cost, and turn-around time. ASIPs allow designers to customize both the instruction set architecture (ISA) and underlying microarchitecture for a specific application domain. A programmable base processor core can be extended to incorporate dedicated hardware accelerators for applications, thus, both high flexibility and application-oriented high performance and energy efficiency can be achieved in ASIPs [18]. There have emerged several commercial tools to take various configurable and extensible processors from specification to hardware implementation, such as Tensilica Xtensa [1], ARCtangent Processor [2], Jazz DSP [3], Altera Nios/NiosII [4], and Xilinx MicroBlaze [5]. A crucial step to achieve high performance in ASIP design is to select an optimal set of custom instructions with the best speedup under certain architectural constraints, e.g., area constraints, critical path time constraint, the limited number of input and output operands, etc. Three techniques, Very Long Instruction Word (VLIW), vector operations, and fused operations, have been exploited for possible custom instructions to achieve best tradeoff between performance improvement and hardware cost [15]. Various techniques have been presented to explore the design space of custom instructions efficiently and thoroughly. Sun et al. used a priority function to compute rankings of the candidate instructions, and employed a branch-and-bound algorithm to prune inferior candidates [22]. To speedup the exploration process, Clark et al. added a configurable array of functional units to the baseline processors that enables acceleration of a wide range of applications [9]. A set of algorithms for pattern generation, pattern selection, and application mapping are also employed for reconfigurable systems [11, 17]. Several exact exhaustive algorithms (e.g., a binary tree to determine whether or not to include an operation node in a candidate instruction) and approximate algorithms (e.g., a genetic algorithm) have been summarized in [20]. The ASIP architecture and compiler co-exploration problem is addressed in [13]. In a simple RISC-style processor, operations are performed only on register data and memory accesses are restricted to load and store instructions. In an ASIP implementation, normally the hardware extensions need to obtain or update data in the generic register file of the base core. Base instructions use at most two input operands and one output, which is determined by the number of read and write ports available on the register file. However, previous studies have shown that generally a significant performance gain of custom instructions comes from clusters with more than two input operands, e.g., 4∼5 makes the best results [24]. To reconcile the data bandwidth mismatch between the base processor core and potential custom extensions, local storage elements designer-defined custom registers can be generated to hold the extra input operands needed by custom instructions [14, 22]. Each additional input has to be loaded from the register file to a cus-
tom register explicitly by a “move” instruction. The data traffic between the base processor core and custom extensions can significantly offset the performance gain from selecting a complex cluster. A shadow register technique has been presented to mitigate the data bandwidth limitation in the configurable processor [10, 12]. They made writing to custom registers coincide with the “write-back” pipeline stage of base instructions, so that the cycle overheads for data transferring are removed. We observe that in this shadow register scheme, custom registers are local to the hardware extensions, where only the custom instructions are allowed to use them for extra input operands. The reverse direction of usage, using the custom registers for base instructions, has not been exploited. As indicated in [23], choice of a sufficient number of registers has a significant impact on the code size, performance, and energy consumption of embedded processors. Since multiple constraints of embedded processors, like area, power, etc., impose a limitation on the size of the register file, generic registers have become a scarce on-chip resource. For example, the ARM7DMI only has 16 general registers and 8 are used for the THUMB ISA [23]. In this paper, we propose a novel approach to turn the custom registers to a register file extension, so that the base instructions can use the data stored in custom registers instead of going to memory system, possibly reducing the memory traffic, and hence execution time and energy consumption. The remainder of the paper is organized as follows. Section 2 analyzes the register spill and memory traffic problem. Section 3 describes the proposed hardware/software cooperative approach and its implementation within an ASIP synthesis framework. Section 4 presents experimental results, followed by conclusions and future work in Section 5.
2. ANALYSIS OF MEMORY TRAFFIC PROBLEM Figure 1 shows the partial datapath of a typical extensible processor. The left part illustrates the base architecture that we target in this paper, which is a simple single-issue 5-stage pipeline RISC processor core with a generic register file with two read ports and one write port. In the figure, we do not show the pipeline stages explicitly, and the tri-buffers refer to control signals. The right part depicts the custom hardware extensions, where custom logic is added as computation accelerators, and several custom registers are added to feed extra inputs to the custom logic. Previous studies have shown that relaxing the input operands constraint to approximately 4 can achieve performance gain close to the theoretical limit [24]. Thus, normally we need two custom registers for extra inputs. The shadow register technique makes the custom registers visible to the base instructions for writing, and uses the reserved field of an instruction for controlling (either skipping or forwarding values to) the shadow registers [10]. In the microarchitecture, as shown in Figure 1, when sg is set high for writing a result to the register file, sc can also be asserted to copy the result to a custom register. However, in this approach, only the data needed by custom instructions are forwarded to custom registers. If we mark the activation periods of a custom register during program execution, they tend to be sparsely scattered, and there exists a lot of idle time for the custom register. On the one hand, custom registers in hardware extensions are under-utilized during program execution. On the other hand, the generic registers in the base processor core are a scarce resource. When program variables have exhausted the register file, some register values have to be stored in the memory system temporarily (so-called register spills) so that the registers can be reclaimed for other variables. Later on, the spilled values can be loaded back to
1 2 3 sg
sc
%& '()* + ' ,
!br
bt
bc
" $# H"I "F ( . 3L< "M N)O)( P load/store
4562798:7"6>62;:3=?; :