TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll03/18llpp561-569 Volume 14, Number 5, October 2009
Architecture Design of a Variable Length Instruction Set VLIW DSP* SHEN Zheng (沈 钲), HE Hu (何 虎), YANG Xu (杨 旭), JIA Di (贾 迪),SUN Yihe (孙义和)** Tsinghua National Laboratory of Information and Technology, Institute of Microelectronics, Tsinghua University, Beijing 100084, China Abstract: The cost of the central register file and the size of the program code limit the scalability of very long instruction word (VLIW) processors with increasing numbers of functional units. This paper presents the architectural design of a six-way VLIW digital signal processor (DSP) with clustered register files. The architecture uses a variable length instruction set and supports dynamic instruction dispatching. The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM. A compiler based on the Open64 was developed for the system. Evaluations show that the processor is suitable for high performance applications with a high code density and small program code size. Key words: digital signal processor (DSP); very long instruction word (VLIW); variable length instruction set; clustered register file
Introduction High performance hardware is of great importance for digital signal processing, especially for multimedia applications. Very long instruction word (VLIW) architectures have proved effective in many performance critical applications[1]. Several concurrent functional units are used to exploit the instruction level parallelism (ILP), producing a significant speedup. Contrary to the traditional superscalar architectures, VLIW machines move the instruction paralleling task from hardware to software or, in other words, from the run-time to the compile-time. This architecture positively impacts the processor performance by reducing the processor area, while making VLIW machines very attractive for computationally intensive tasks. Received: 2008-06-17; revised: 2009-04-29
* Supported by the National Natural Science Foundation of China (No. 60236020) and the Specialized Research Fund for the Doctoral Program of Higher Education of MOE, China (No. 20050003083)
** To whom correspondence should be addressed. E-mail:
[email protected]; Tel: 86-10-62788905
However, the VLIW approach also has some serious disadvantages. First, the cost of the register files explodes with increasing number of functional units in VLIW processors. The single central register file that has traditionally been used to interconnect functional units and to provide short-term storage does not scale well with a large number of functional units[2,3]. Second, the usage of augmented program memory (due to the large code size) leads to significant power dissipation, not only for the large amount of memory needed, but also for the increased instruction bus width[4]. The effects of these disadvantages are reduced by the variable length instruction set VLIW DSP (digital signal processor) architecture with clustered register files described in this paper. The processor architecture targets high-performance digital signal processing and multimedia applications such as imaging, audio, and video. Specific instructions are proposed including bit permutation instructions for variable length codec algorithms and vector instructions for motion estimation.
Tsinghua Science and Technology, October 2009, 14(5): 561-569
562
1
Related Work
supports variable instruction widths (16-bit/32-bit) and variable instruction packet lengths.
The VLIW architecture has proved effective in exploiting instruction level parallelism. Most conventional high performance DSPs use VLIW architectures[5]. For example, the SC1400 announced by FreeScale is a fixed-point VLIW DSP core[6], with four identical DALU and two identical AGU which can simultaneously execute up to six instructions. TI’s TMS320C64x[7] is a VLIW architecture with eight execution units clustered into two identical datapaths. Each datapath is bound to a corresponding register file. A crossbar is used to exchange data between the two register files. Using its eight execution units, the processor can execute up to eight 32-bit instructions in a single clock cycle. The CEVA-X1640 is a member of CEVA’s fifth-generation DSP processor core family[8]. It is a quad-MAC processor with a 16-bit data width and four MAC units. The processor issues up to 8 instructions simultaneously, and the instruction set
2
Architecture
The LILY processor is a six-way VLIW DSP core with clustered register files. The processor is able to support 16-bit/32-bit variable length instruction sets such that the code density is improved. A block diagram of the processor is shown in Fig. 1. The processor has one-level memory architecture with 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM. Both the instruction cache and the on-chip RAM are connected to a program memory manage unit (PMMU). The data memory manage unit (DMMU) is used to control the data cache and the on-chip RAM. The PMMU and the DMMU are connected to a direct memory access (DMA) controller, which serves as a communication bridge between the AMBA[9] bus interface and the memory system.
Fig. 1 Overall architecture of LILY processor
2.1
Architecture features
The LILY is a six-way, 32-bit fixed-point, exposed pipeline VLIW processor. To reduce the register files costs, the architecture uses a hierarchical register file connectivity cluster (RFCC)[3] model. Each functional unit has private ports to access one of the local register
files (LRF) and all functional units share the global register file (GRF). To reduce the program code size, the instruction format resembles a 16-bit/32-bit variable length RISC (reduced instruction set computer) instruction set to achieve high code density. Traditional VLIW architectures require each VLIW packet to contain the
SHEN Zheng (沈 钲) et al.:Architecture Design of a Variable Length Instruction Set VLIW DSP
maximum number of instructions, even if there is insufficient instruction-level parallelism to actually issue that number of instructions per cycle. In contrast, the VLIW instruction words of LILY are stored in a compressed format in memory. This format removes instructions for functional units scheduled to have a no operation (NOP) instruction. Conventional VLIW architectures may require a wide instruction word to specify which functional unit will execute the instruction[7,8]. Instructions are dispatched to the corresponding functional unit according to the functional unit assignment field in the operation code. The instruction mapping to the different functional units requires different operation codes, even though the instructions have the same function. For instance, in the TMS320C64x DSP[7], the addition (ADD) instruction can be mapped to six functional units, such that the ADD has six different operation codes corresponding to the different functional units. This generates redundant instruction information such that the code density is reduced. To address this problem, the dispatch unit of the LILY processor can dynamically dispatch instructions to functional units according to the instruction type, so that an instruction
Fig. 2
563
has a uniform operation code without a functional unit assignment field. This orthogonal instruction set maintains a compact code density for applications. Conventional VLIW architectures indicate which instructions are issued in parallel by an explicit parallel field in the instruction word[6-8], which costs at least a 1-bit operation code. In contrast, the LILY processor uses an implicit parallel index to compress the code size. The instruction type and the register file assignment field in the instruction code are combined to generate an implicit instruction parallel index (IPI). The IPI is used to indicate which instruction is to be issued in parallel. The details of the architecture and instruction set are discussed in the following sections. 2.2
Pipeline
The cycle time and the core pipeline are balanced and optimized for high throughput. By minimizing the latency of the most frequent operations, the dead time in the overall computation is reduced. The DSP core pipeline has ten stages, as seen in Fig. 2, and the pipeline stages have the following functions.
LILY pipeline
564
Program counter generate stage (PCG) It generates the next program counter (PC) address to be sent to the instruction memory system. Program counter send stage (PCS) It sends the PC address to the instruction memory system. Processor wait stage (PWT) It waits for the instruction memory system to get a 256-bit instruction fetch-packet. Fetch-packet receive stage (FPR) It fetches a 256-bit instruction each cycle, which is called the instruction fetch-packet. The DSP core receives a new instruction fetch-packet from the instruction memory system. Dynamic dispatch stage (DDP) It dynamically dispatches instructions to functional units according to the instruction type. The parallel index generating unit combines the instruction type and the register file assignment field in the instruction code to generate the IPI. Based on this IPI, the dispatch unit dispatches up to six instructions in parallel to the appropriate functional units. Instruction decode stage (IDC) It decodes any pending instructions in the function unit and reads the source register file. For branch instructions, the program address is calculated and sent to the PCG stage to generate an appropriate PC address. Execution of branch instruction takes six cycles before the instruction, whose address was generated in the PCG stage caused by the branch, and arrives at the EX1 stage. Execution stage 1 (EX1) It executes the value read from the source register files and generates the result to the destination register file for instructions that take only one single cycle to execute, such as most arithmetic operations. For instructions that take multiple cycles to execute, such as multiply operations, EX1 operates on the values read from the source register files and sends the semi-finished result to the EX2. Execution stages 2, 3, 4 (EX2, EX3, EX4) It writes the result to the destination register file in the last execution stage for instructions that take multiple cycles to execute. For a memory load instruction, the DSP core sends the data access address to the data memory system in the EX2 and waits for the data memory system to get the appropriate data. In the EX4, the data loaded from memory is written to the destination register. For a memory store instruction, in the EX2 stage the data to be stored and the corresponding
Tsinghua Science and Technology, October 2009, 14(5): 561-569
addresses are sent to the data memory system. 2.3
Datapath
Figure 3 shows the block diagram of the LILY architecture datapath. The architecture has a clustered register file. Each local register file has twenty-four 32-bit registers, four write ports, and seven read ports, which enable parallel reads/writes of source/destination operands from three function units. The global register file has eight 32-bit registers, six write ports, and ten read ports, with all six functional units sharing the write ports and read ports. Due to the limited number of register file ports, the software programmer and the compiler must assign the read/write ports to each functional unit. Each register file is divided into two sections (odd/even registers) to enable 40-bit/64-bit reads and writes to registers without requiring extra read/ write ports.
Fig. 3 LILY datapath
The LILY processor has six functional units classified into three functional unit types (A, M, and D). The A, M, and D types support 274 instructions, with 88 16-bit instructions and 186 32-bit instructions. The datapaths are capable of executing six instructions per cycle. The functional units can execute several operations in parallel, such as six 32-bit additions/subtractions, six packs/unpacks, four 16-bit multiplies, four 16-bit multiply-accumulations, eight 8-bit multiplies, two 16-bit dot products, bit permutation operations, shifts, and comparisons. The A units (XA, YA) support
SHEN Zheng (沈 钲) et al.:Architecture Design of a Variable Length Instruction Set VLIW DSP
all the arithmetic operations, logical operations, and shift/rotate operations. The M units (XM, YM) support multiply operations and most arithmetic operations. The D units (XD, YD) are responsible for calculating memory locations for load/store operations, and are in charge of branch operations, compare operations, and supporting parts of the arithmetic operations. 2.4
Memory architecture
The one-level memory system in the LILY processor consists of a 16-KB instruction and data cache and a 16-KB instruction and data on-chip RAM. Both the instruction cache and the on-chip RAM are connected to the PMMU, and the DMMU controls the data cache and the on-chip RAM. The instruction memory system provides up to 256-bit instruction fetch-packets to the DSP core each cycle. The instructions can be fetched from the direct mapped L1 I-cache or the on-chip RAM according to the memory map table in the PMMU. The instruction memory system is organized in two pipeline stages to optimize access to the data bank in the instruction memory. The data memory system is capable of handling two 64-bit aligned load or store operations in parallel. The data can be loaded/stored from/to the L1 D-cache or the on-chip RAM according to the memory map table in the DMMU. Figure 4 shows the organization of the dual-ported, variable-way set-associative L1 D-cache, which can be dynamically configured as two-way/fourway set associative. L1 D-cache can work in a low
Fig. 4 Organization of the L1 D-cache
565
power mode, in which memories in way 2 and way 3 indicated by the dashed lines in Fig. 4 can be configured either as cache memories or on-chip buffers.
3 Instruction Set The LILY processor supports 274 instructions, including arithmetic operations, logic operations, multiply operations, shift operations, compare operations, and bit permutation operations. To achieve a high code density for a small program code size, the instruction format resembles a 16-bit/32-bit variable length instruction set. 3.1
Instruction format
The LILY processor instruction set includes 16-bit and 32-bit instructions. Most frequently used instructions, such as addition, subtraction, and and/or, have both 16-bit and 32-bit instruction formats. As seen in Fig. 5, the 16-bit instruction format has up to two register address fields. The destination register address field within the 16-bit instructions is used to indicate one source register and the destination register. Most 16-bit instructions have only a 4-bit register address field, and these instructions can access only half the register file, which consists of half the local register file and half the global register file. The 4-bit instruction type field is used by the dispatch unit to distinguish the 16-bit and 32-bit instruction formats and to dynamically dispatch instructions to the appropriate functional units. When the first two bits of the “op_t” field are not “11”, as seen in Fig. 6, the dispatch unit dispatches the 16-bit instructions to the appropriate functional unit. The dispatch unit dispatches 32-bit instructions when the first two bits of the “op_t” field are “11”. The LILY instruction set picks up the parallel information from the instruction type and the register file assignment field. Figure 6 shows that the instruction type is used to generate an instruction parallel index to indicate the instruction position in the instruction slot. The instructions are grouped into two slots, slot x and slot y, according to the register file assignment field, with each slot holding up to three instructions. Slot x and slot y can be executed concurrently, and the instructions in the same slot can be dispatched in parallel.
Tsinghua Science and Technology, October 2009, 14(5): 561-569
566
Fig. 5
Fig. 6
3.2
LILY instruction format
Instruction type and instruction parallel index
Instructions set
The LILY instruction set supports a load-store architecture which loads data from memory into local or global register files before processing and stores data back to memory from the register files after processing. The instruction set provides a rich set of operations for digital signal processing and multimedia processing. To exploit the instruction-level parallelism, the LILY instruction set provides conditional execution instructions to implement the compiler and software pipeline techniques[10]. Almost all the 32-bit instructions have a 3-bit conditional register address field, which specifies one of the seven conditional registers. A 1-bit z field
shown in Fig. 5 in the 32-bit instructions indicates testing the conditional register for zero or nonzero when conditional executions occur. The LILY instruction set also provides instructions with four operands. Thus, the instruction set can execute operations that require three source operands and one destination operand. These instructions include fused multiplication and addition/subtraction, fused shift and addition/subtraction, dot product with addition /subtraction, and so on. These instructions are important for modern multimedia applications. DSP applications have many 8-bit and 16-bit operations. Such operations cannot fully utilize the datapath on their own. Therefore, the LILY instruction set
SHEN Zheng (沈 钲) et al.:Architecture Design of a Variable Length Instruction Set VLIW DSP
provides single instruction multiple data (SIMD) instructions to take full advantage of the 32-bit datapath. These include addition, subtraction, average value, shift, compare, multiply, multiply-accumulation, and dot production. Within the SIMD operations, packed data types store either four 8-bit values or two 16-bit values in a single 32-bit register, or four 16-bit values in a 64-bit register pair. In addition to SMID instructions, the instruction set includes byte- and bit- permutation instructions. The NORM instruction, for instance, which counts consecutive zero/one bits, and the EXT instruction, which extracts bits from a given register, are useful in variable-length decoding in multimedia applications. The shuffle/pack/unpack instructions manipulate bytes of data for motion compensation in video applications. With these instruction set enhancements, multimedia applications can achieve significantly better performance on the LILY processor than on conventional microprocessors.
4
Compiler
The compiler developed for the LILY architecture is a retargetable compiler ported from an open research compiler named Open64[11]. As is shown in Fig. 7,
Fig. 7
Basic compiler architecture
567
Open64 consists of a front end, a machine independent optimizer, and a back end. The machine description part provides information about the target machine for the compiler to use throughout the whole flow. The porting focuses on the back end and the machine description parts. The machine description part includes information about instruction operation codes, operand type for each instruction, the number of operands in each instruction, register type, the number of each register type, the assembly language syntax of each instruction, and the target processor resources. Information about the LILY processor is added to the machine description files to support the variable length instruction set. The compiler back end, also called code generator, can be divided into four parts. The instruction selection part translates the very low level intermediate presentation into instructions from the target machine instruction set. The intermediate presentation is then mapped into instructions that are only for the LILY processor. In the instruction scheduling phase, a preprocessing algorithm[3] is used to improve the instruction level parallelism and achieve high performance for the LILY architecture. The global registers are mainly used for data exchange, so the local registers must have higher priority than the global registers. In the register allocation part, the compiler calculates the life range of each register type operand, and assigns these register type operands into registers. The solution extending graph coloring model is used[12] to solve the register allocation problem in a multiple register file architecture. When there are not enough registers, new store and load operations are inserted to spill a symbolic register to memory and then restore it to a register later, and the destination operand of the load instructions is assigned to global registers. The assembly code is output in the instruction emission part, where modifications are made to ensure that all the output assembly code is in the form specified by the LILY architecture. There are four levels of optimization (O0, O1, O2, and O3) in the Open64 together. This compiler now gives the results for every configuration with the O0 optimization, with codes for O1, O2, and O3 levels available soon.
Tsinghua Science and Technology, October 2009, 14(5): 561-569
568
5
Table 3 Signal processing benchmark kernel program code size (Bytes)
Evaluation
The LILY processor operates at 300 MHz and was synthesized with ARM 0.13 μm Metro Lib[13] under worst case conditions, with about 64-KB on-chip memory. The gate count for the enter processor logic without memory is about 260 K NAND2 gates, where the core control logic is about 19%, the datapath logic is about 35%, and the cache control logic is about 46%. The processor performance was evaluated using a set of benchmarking algorithms, including those in the BDTI DSP benchmarking suite[14] and several multimedia applications, with hand written assembly codes. Table 1 compares the kernel cycle counts of LILY with several commercial digital signal processors and the performance for multimedia applications is shown in Table 2. As seen in Tables 1 and 2, the instruction set enhancements combined with the static instructionlevel parallelism and software pipeline enable the sixway VLIW LILY processor to obtain high performance on digital signal processing and multimedia applications. Table 3 compares the program code size. With the 16-bit/32-bit variable length instruction set and dynamically functional unit dispatching, the LILY processor can achieve high code density and small program code size. Table 1 counts
Signal processing benchmark kernel cycle (cycles)
Benchmark kernel
StarCore
TI
CEVA
SC1400
C64x
X1640
675
674
1333
448
9
16
9
13
1631
1246
1248
1472
40-sample, 16-tap complex block FIR Two biquad IIR Radix-2, 256-point complex FFT
LILY
16
25
19
19
Vector add
19
27
18
21
Vector maximum
27
36
22
19
8×8 FDCT
―
506
―
562
8×8 IDCT
―
614
―
643
Application performance of LILY
Application
StarCore
TI
CEVA
SC1400
C64x
X1640
234
568
304
292
76
128
80
90
630
1000
1248
318
40-sample, 16-tap complex block FIR Two biquad IIR Radix-2, 256-point complex FFT Vector dot product
64
120
110
96
Vector add
60
100
100
71
170 ―
252
8×8 FDCT
980
196 ―
528
8×8 IDCT
―
968
―
584
Vector maximum
6
LILY
72
Conclusions
The architecture of the LILY processor, a 300-MHz six-way VLIW DSP, has been presented. A compiler based on Open64 was developed for this architecture. The major architectural features, the instruction set, the compiler, and the capabilities for digital signal processing and multimedia processing are given in detail. The LILY processor performance for audio/video processing and digital signal processing has demonstrated that the processor can provide high performance on a range of applications. With the 16-bit/32-bit variable length instruction set and dynamically functional unit dispatching, the LILY processor achieves high code density and small program code size. References
Vector dot product
Table 2
Benchmark kernel
Performance
AMR-NB decode
0.98M cycles
H.263 baseline encode, 15 fps, QCIF
9.4M cycles
H.263 baseline decode, 15 fps, QCIF
5.5M cycles
[1] Rau B, Fisher J. Instruction-level parallel processing: History, overview, and perspective. Journal of Supercomputing, 1993, 7(21): 9-50. [2] Rixner Scott, Dally William J, Brucek Khailany, et al. Register organization for media processing. In: Sixth International Symposium on High-Performance Computer Architecture. Toulouse, France, 2000: 375-386. [3] Zhou Zhixong, He Hu, Sun Yihe, et al. A 2-dimension force-directed scheduling algorithm for register-file- connectivity VLIW architecture. In: Proceedings of 18th IEEE Conference on Application-Specific System, Architecture and Processor. Montreal, Canada, 2007: 371-376. [4] Yuan Xie, Wolf W, Lekatsas H. Code compression for embedded VLIW processors using variable-to-fixed coding. IEEE Transactions on Very Large Scale Integration
SHEN Zheng (沈 钲) et al.:Architecture Design of a Variable Length Instruction Set VLIW DSP Systems, 2006 14(5): 525-536. [5] BDTI insight, analysis, and advice on signal processing technology. http://www.bdti.com/, 2006. [6] SC140 DSP core reference manual. http://www.freescale.
569
conditional branches. In: Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. Paris, France, 1996: 262-273. [11] The open research compiler. http://www.open64.net/, 2008. [12] Zhou Zhixong, He Hu, Sun Yihe, et al. A retargetable com-
com/, 2005. [7] TMS320C6000 CPU and instruction set reference guide. http://www.ti.com/, 2005. [8] CEVA-X1641.
http://www.ceva-dsp.com/products/cores/
ceva-x1641.php, 2004. [9] ABMA bus. http://www.abma.com, 2004. [10] Stoodley M G, Lee C G. Software pipelining loops with
piler of VLIW ASIP for media signal processing. In: Proceedings of 2006 International Conference on Embedded Systems & Applications. Las Vegas, USA, 2006: 46-49. [13] ARM 0.13μm metro lib. http://www.arm.com/products/ physicalip/productsservices.html, 2007. [14] BDTI benchmark results. http://www.bdti.com/, 2005.
Tsinghua Holds Welcome Ceremony for Graduate Students Tsinghua University President Gu Binglin and University Council Chairman Hu Heping together with other university officials presided over a welcoming ceremony for 2009’s new graduate students on the morning of September 4, 2009. President Gu congratulated the new students in his welcome address. He expressed several hopes for them. He urged them to pursue lofty ideals and encouraged them to broaden their horizons. And finally, he told them to be rigorous in their studies and research. Tsinghua University registered 4712 new graduate students on September 2, 2009. (From http://news.tsinghua.edu.cn, 2009-09-07)