Design and Implementation of a High-Performance and Complexity ...

0 downloads 0 Views 393KB Size Report
Keywords: VLIW, digital signal processor, register organization, instruction encoding, micro-architecture ...... Pica stays in its standby mode until ARM initializes.
Journal of Signal Processing Systems 51, 209–223, 2008

* 2007 Springer Science + Business Media, LLC. Manufactured in The United States.

DOI: 10.1007/s11265-007-0061-x

Design and Implementation of a High-Performance and Complexity-Effective VLIW DSP for Multimedia Applications TAY-JYI LIN, SHIN-KAI CHEN, YU-TING KUO AND CHIH-WEI LIU Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan PI-CHEN HSIAO SoC Technology Center, Industrial Technology Research Institute, Hsinchu, Taiwan

Received: 13 September 2006; Revised: 9 March 2007; Accepted: 17 March 2007

Abstract. This paper presents the design and implementation of a novel VLIW digital signal processor (DSP) for multimedia applications. The DSP core embodies a distributed & ping-pong register file, which saves 76.8% silicon area and improves 46.9% access time of centralized ones found in most VLIW processors by restricting its access patterns. However, it still has comparable performance (estimated in cycles) with state-of-the-art DSP for multimedia applications. A hierarchical instruction encoding scheme is also adopted to reduce the program sizes to 24.1õ26.0%. The DSP has been fabricated in the UMC 0.13 mm 1P8M Copper Logic Process, and it can operate at 333 MHz while consuming 189 mW power. The core size is 3.2  3.15 mm2 including 160 KB onchip SRAM. Keywords: 1.

VLIW, digital signal processor, register organization, instruction encoding, micro-architecture

Introduction

Programmable processors are attractive in embedded multimedia systems because software-based systems need less development effort. Bugs can be fixed with field patches and the products may be upgraded to support new standards with only software. These factors reduce the time-to-market, extend the timein-market, and thus help to make the greatest profits. However, today_s multimedia applications, including speech, audio, image and video processing demand extremely high computing power, and the processors must exploit the inherent parallelism extensively to meet the requirements. For instruction-level parallel-

ism, VLIW architectures [1] have low-cost static instruction scheduling (by compilers or assembly programmers) and thus deterministic execution time compared to the dynamically scheduled superscalar ones. They have already become mainstreams of high-performance DSP designs [2, 3]. However, VLIW processors have two major weaknesses that prevent the integration of more functional units with higher instruction issue rate—the dramatically growing register file complexity and the poor code density. This paper proposes a distributed and pingpong register organization, which saves 76.8% silicon area and 46.9% access time of non-scalable centralized register files in conventional VLIW

Lin et al.

210

Instruction Memory

verification flow of programmable processors are summarized in Section 4, while the implementation results from instruction set simulation to silicon proof are available in Section 5. Section 6 concludes this work and outlines our future researches.

Data Memory

Cluster 0 Instruction Fetch

address registers Load/Store Unit (L/S)

Instruction Unpack (dispatch & pre-decode)

2.

ping registers Program Control Unit Interrupt Controller

pong registers

2.1. Clustered Architectures

Arithmetic Unit (AU)

Program Sequencer

accumulators

status/predicate registers

Figure 1.

Pica DSP architecture.

processors. A hierarchical instruction-encoding scheme is also proposed to reduce the program sizes to 24.1õ26.0%. Figure 1 shows the VLIW DSP architecture with packed instructions and the clustered architecture (Pica), which integrates up to four clusters depending on application requirements. These clusters can operate independently by exploring data-level parallelism in most DSP applications, or inter-cluster communications can be carried out via the memory subsystem. We have implemented a prototype of Pica DSP with two identical clusters in the UMC 0.13 mm 1P8M Copper Logic Process. The chip can operate at 333 MHz and consumes 189 mW average power. Its core size is 3.2  3.15 mm2 including the 160 KB on-chip memory. The rest of this paper is organized as follows. Section 2 describes clustered architectures and our proposed distributed and ping-pong register file respectively that reduce datapath complexity. Section 3 describes hierarchical instruction encoding to improve the code density of VLIW processors. The design and

FU

Register Organization

Cluster 1

A register file provides temporal storage and interconnections for parallel functional units (FU) in a microprocessor. As more and more FU are integrated into a microprocessor, the complexity of the register file grows rapidly. For a centralized register file for N FU, where each FU can read or write any register directly, the area complexity is N3, the access delay is N3/2, and the power dissipation is N3 [4]. Therefore, register files in modern wide-issue microprocessors are usually partitioned to reduce hardware complexity [5–12]. Figure 2 shows a generic clustered architecture, where each FU has direct accesses only to a subset of the register file. If data need passing across clusters, additional wiring and execution cycles are necessary for inter-cluster communication (ICC). ICC mechanisms can be classified into three categories [5, 6]. The first one is to use Bcopy^ instructions, which are issued through existing instruction slots [7] or dedicated slots [8], to copy variables from (or to) the register files of other clusters. The former can reduce additional access ports on each of the partitioned register files, but it wastes some effective instruction cycles. The second ICC category is through extended register accesses, where each cluster has limited Bread^ accesses from

Cluster 0

Cluster 1

Cluster N-1

Partitioned RF

Partitioned RF

Partitioned RF

FU

FU

FU

FU

FU

Switch for Inter-Cluster Communication

Figure 2.

Clustered architecture.

FU

FU

FU

High-Performance and Complexity-Effective VLIW DSP Table 1.

Comparison of ICC mechanisms. HW complexity

Centralized

f (x, y, n, w)

Copy

c  f ðx=c þ p; y=c þ p; n=c; wÞ þ gCP ðc; p; wÞ

Extended read

c  f ðx=c þ p; y=c; n=c; wÞ þ gER ðc; p; wÞ

Extended write

c  f ðx=c; y=c þ p; n=c; wÞ þ gEW ðc; p; wÞ

Load/store pair

c  f ðx=c; y=c; n=c; wÞ þ gLS ðc; p; wÞ

[9] (or Bwrite^ accesses to [10]) other partitioned register files. The third kind of ICC mechanisms is via some common storage resources, such as shared memory or specific shared registers [11]. Table 1 shows the comparison of ICC mechanisms [6]. The function f(x, y, n, w) denotes the hardware complexity of a centralized register file with x read ports, y write ports, and n w-bit registers, of which the cell-based implementation requires n registers, x n-input multiplexers (n:1 MUX) for the read ports and n y:1 MUX with some glue control for the write ports. The glue control is negligible when w is much larger than y. Thus, the area complexity of f can be approximated with the area of ½xðn  1Þ þ nðy  1Þ  w 2:1 MUX and n  w flipflops, while the time complexity can be approximated as the delay of log2n 2:1 MUX or the delay of log2y 2:1 MUX plus the write control overheads. The register file complexity of the clustered architectures with dedicated ICC copy slots can be described as c  f ðx=c þ p; y=c þ p; n=c; wÞ þ gCP ðc; p; wÞ, where c denotes the number of clusters, each of which has p extra ports. The g function describes the interconnection complexity of the ICC switch network. The architectures without dedicated copy slots have the same register file complexity as those with extended read ICC: c  f ðx=c þ p; y=c; n=c; wÞ þ gER ðc; p; wÞ, where no extra write port is needed. Similarly, the register file complexity of those with extended write ICC can be described as c  f ðx=c; y=c þ p; n=c; wÞ þ gEW ðc; p; wÞ . We have adopted the ICC mechanism through the on-chip data memory [6]. When a load/store pair in a VLIW packet has an identical memory address, the data variable from the Bstore^ will be directly forwarded to the Bload^. The proposed ICC mechanism with load/store pairs has the lowest complexity of c  f ðx=c; y=c; n=c; wÞ þ gLS ðc; p; wÞ . Note that the existing crossbar in the memory subsystem can be used for ICC, and therefore the g complexity can be absorbed.

211

2.2. Distributed and Ping-Pong Register File Besides global clustering, the register file complexity of each cluster can be further reduced. Figure 3 shows a cluster of Pica with a load/store unit (LS), an arithmetic unit (AU), and the proposed distributed and ping-pong register file with 32 registers. The 32 registers are divided into four independent banks, and each bank is equipped with the access ports for a single functional unit (FU) only (i.e. two Bread^ and two Bwrite^ ports in our case). The address registers (a0õa7) and the accumulators (ac0õac7) are dedicated to LS and AU respectively, and they are not visible to the other FU. The remnant 16 registers are shared between these two FU and they are divided into two banks with exclusive accesses. In other words, when LS accesses the ping, AU can only access the pong, and vice versa. Each VLIW packet of Pica needs to specify its access mode (to be either ping or pong) with a corresponding bit (i.e. the Bping-pong index^ described later) in its encoding. Pica supports powerful SIMD instructions based on the proposed distributed and ping-pong register file. For example, the double load (store) word instruction:   dlw dm ; ðai Þ þ k; aj þ l: It performs two memory accesses (i.e. dm@ Mem32[ai] and dm+1@Mem32[aj]) and two address updates (i.e. ai@ai + k, and aj@aj + l) simultaneously. The index m must be an even number with m + 1 implicitly specified. These double load/store instructions require six concurrent accesses of the register

address registers a0~a7 Load/Store Unit (LS) ping registers d0~d7

pong registers d'0~d'7 Arithmetic Unit (AU) accumulators ac0~ac7

Figure 3.

Distributed and ping-pong register organization.

212

Lin et al.

file (including two Breads^ and four Bwrites^ for dlw, or four Breads^ and two Bwrites^ for dsw). They do not cause access conflict on the distributed and pingpong register file, because ai and aj are private address registers while dm and dm+1 are ping-pong registers that deliver data between LS and AU. These registers locate at independent banks. The DSP also supports sub-word (16-bit) SIMD multiplyaccumulations with full-precision results on two 40-bit accumulators aci and aci+1: fmac:d aci ; dm ; d n : aci þ dm :H  dn :H (dm.H and It performs aci dn.H denote the upper 16 bits of dm and dn respectively) and aciþ1 aciþ1 þ dm :L  dn :L (dm.L and dn.L denote the lower 16 bits of dm and dn) in parallel. Similarly, the index i must be an even number with i + 1 implicitly specified. This instruction requires six concurrent accesses of the register file (i.e. four Breads^ and two Bwrites^). Without the proposed ping-pong register file, Pica will need a 14-port centralized one for the 14 simultaneous register accesses (i.e. four Breads^ and four Bwrites^ for LS and four Breads^ and two Bwrites^ for AU). Finally, each of the four four-port banks (i.e. with two Breads^ and two Bwrites^) in Fig. 3 can be further partitioned into even and odd banks, and the distributed and ping-pong register file can be implemented using eight smaller banks, each of which has only two Bread^ and one Bwrite^ ports. We define fDPP ðn; wÞ ¼ 8  f ð2; 1; n=8; wÞ þ g_DPP ðwÞ as the hardware complexity of a distributed and pingpong register file with n w-bit registers. The f function is that for the centralized register file described in Section 2.1, and g`DPP denotes the complexity of the mapping between logical and physical ports. The port mapping contains six 6:1 MUX and two 3:1 MUX (for rm+1 and ri+1 in LS and AU respectively) for the read ports, and one 2:1 MUX (even private bank), one 4:1 MUX and one 2:1 MUX (odd private banks), two 3:1 MUX (even pingpong banks) and two 6:1 MUX (odd ping-pong banks) for the write ports respectively. By the way, the ping-pong registers can be extended for a special ICC mechanism via register permutations, such as the proposed Bring-structure register organization^ in our previous work [12].

The assembly syntax of our VLIW packet starts from the ping-pong index, followed by the instructions to each issue slot in sequence: pingpong index; i0ðfor LSÞ;

i1ðfor AUÞ;:

In the following, we are going to use FIR and DCT, two popular DSP kernels [13], to illustrate how the proposed distributed and ping-pong register file works, and show some code optimization techniques. Assume there is no delay slot (such as an ALU instruction that follows a load instruction immediately cannot use the load result in classical five-stage pipelined processors) for simplicity. The data memory is byte-addressable.

&

FIR filtering The following code segment is to perform 64tap finite-impulse response (FIR) filtering on 1,024 samples. The inputs including both the data samples (pointed by X) and the coefficients (pointed by COEF) are 16-bit fractional numbers and the outputs (pointed by Y) are 32-bit variables. 1 2 3 4 5 6 7 8 9 10 11 12

PP 0; 0; 0; 0; 1; 0; 1; 0; 0; 1;

LS li a0, COEF; li a1, X; li a2, Y; rpt 1024, 8; dlw d0, (a0)+4, (a1)+4; rpt 15, 2; dlw d0, (a0)+4, (a1)+4; dlw d0, (a0)+4, (a1)+4; dlw d0, (a0)+4, (a1)+4; li a0,COEF; addi a1, a1, -126; sw (a2)+4, d0;

AU li ac0, 0; nop; nop; li ac1, 0; fmac.d ac0, d0, d1; fmac.d ac0, d0, d1; fmac.d ac0, d0, d1; fmac.d ac0, d0, d1; add d0, ac0, ac1; li ac0, 0;

The zero-overhead loop instructions (rpt in the line 4 and line 6) are carried out in the instruction dispatcher and do not consume execution cycles of the datapath. The inner loop (line 7õ8 in dark gray) loads two 16-bit inputs and two 16-bit coefficients into the 32-bit d0 and the 32-bit d1 with the SIMD load operations (i.e. d0@Mem32 [a 0 ] and d 1 @Mem 32 [a 1 ]), and the address registers a0 and a1 are updated simultaneously (i.e. a0@a0 + 4 and a1@a1 + 4). In the meanwhile, AU performs two 16-bit (SIMD) multiplyaccumulations (i.e. ac0@ac0 + d0. H  d1. H and ac1@ac1+ d0. L  d1. L). After accumulating 32

High-Performance and Complexity-Effective VLIW DSP

&

32-bit products respectively with two 40-bit accumulators, ac0 and ac1 are added together and rounded back to the 32-bit d0 in the pingpong registers. Finally, the 32-bit output is stored in the memory by LS via d0. In this example, each output is produced in 35 cycles, and the loops can be unrolled to easily achieve similar performance when the load slots are taken into account. Discrete Cosine Transform (DCT) Figure 4 shows the eight-point fast DCT algorithm [14], which is extensively used in image and video processing. In this case, the number of operations per sample is larger than the previous FIR example. Thus, LS can assist in performing some simple arithmetic operations to balance the loading. The first dlh (double load half words; line 3) instruction loads and sign-extends the first and the last 16-bit input data (i.e. x0 and x7 in Fig. 4) into 32-bit d 0 and d 1 . AU performs a butterfly operation on these two data by switching the ping-pong index in the succeeding VLIW packet. The butterfly operations on the other three input

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

PP

LS

AU

0; 0; 0; 1; 0; 1; 0; 1; 1; 1; 1; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 1; 0;

li a0,X; li a1,COEF; dlh d0,0(a0),14(a0); dlh d0,2(a0),12(a0); dlh d0,4(a0),10(a0); dlh d0,6(a0),8(a0); lh d4,0(a1); dlh d4,0(a1),2(a1); dlh d6,4(a1),6(a1); nop; nop; add d4,d0,d2; sub d5,d0,d2; dsh d4,0(a0),8(a0); add d4,d1,d3; sub d5,d1,d3; dsh d4,4(a0),12(a0); nop; nop; nop; nop; nop; nop; nop; dsh d0,10(a0),6(a0); dsh d0,2(a0),14(a0);

nop; nop; nop; bf ac0,d0,d1; bf ac4,d0,d1; bf ac6,d0,d1; bf ac2,d0,d1; bf d0,ac0,ac2; bf d2,ac4,ac6; add d3,d1,d3; xmpy16 d3,d3,d4; add d0,ac3,ac7; add d1,ac5,ac7; add d2,ac5,ac1; xmpy16 d4,d4,d1; bf ac0,ac1,d4; sub d3,d0,d2; xmpy16 d3,d3,d5; xmpy16 d6,d0,d6; add ac2,d3,d6; xmpy16 d7,d2,d7; add d7,d7,d2; add ac3,d3,d7; bf d0,ac1,ac3; bf d0,ac0,ac2; nop;

C1/4

x0

213

X0

S1/4 x1

C1/4 C3/8

x2

X4

X2

S3/8 x3

C3/8 C7/16

x4 C1/4

x5

S1/4 x6

C1/4

x7

X5

S3/16 C3/16

C7/16

Figure 4.

X1 C3/16

S7/16

X6

X3

X7

Discrete cosine transform.

pairs are performed accordingly. The final two butterfly operations for the even outputs (i.e. X0, X2, X4, and X6 shown in Fig. 4) are performed in LS instead (line 12õ17) after the coefficient multiplication, which must be performed in AU following the two butterfly operations (line 8–11). In this example, an eight-point DCT can be carried out in 24 cycles. Note that two eight-point DCT can be concurrently performed in one cluster with SIMD instructions, and thus a 2D 8  8 DCT (i.e. equivalent to sixteen eight-point DCT) requires only 192 cycles. 3.

Instruction Encoding

VLIW architectures are notorious for their poor code density, which comes from the redundancy inside (1) fixed-length RISC-like instructions, for many instructions do not use all bits actually; (2) fixedlength VLIW packets of which the encoding dedicates a bit-field for each issue slot, and NOPs will be filled in the unused slots; and (3) the repeated codes due to loop unrolling or software pipelining. Here, we define an instruction to be a RISC-like operation for an issue slot, and a VLIW packet (or simply a packet) as the instructions issued in the same cycle. HAT [15] is an effective variable-length instruction format to improve the first problem. Variable-length VLIW encoding applied on TI C64 [9] and NEC SPXK5 [16] can eliminate NOP in a packet by attaching a slot number to each instruction for runtime dispersal. Besides, an additional mark is needed for each instruction to denote the boundary of each variable-length packet (i.e. with a varying number of

214

Lin et al.

effective instructions). Indirect VLIW [8] uses a programmable microinstruction memory for the VLIW datapath (i.e. the programmable VIM), and the VLIW packets are stored internally and executed with short indices. The instructions in existing packets may be reused to synthesize new packets to reduce the instruction bandwidth. Systemonic proposes an incremental encoding scheme for the prolog and the epilog of software pipelined codes [17], which helps to remove the repeated instructions. Pica uses a novel and integrated encoding scheme [18], which takes into account the three problems to improve the VLIW code density. 3.1. Variable-length Instructions First, each instruction is variable-length coded, depending on the number of operands, the size of its immediate variable, the frequency of its occurrence, and whether it is conditionally executed. Note that the same operations in different issue slots need not be coded identical. The variable-length instruction is divided into a fixed-length Bhead^ and a variable-length Btail^ as the HAT format to reduce the complexity of instruction alignment and dispersal. Figure 5 shows our encoding of the add/sub instructions for the arithmetic unit (AU). Note that the heads need not have the same length across different issue slots. 3.2. VLIW Packets without NOP NOP instructions are not coded in the VLIW packets for Pica. Here we attach a fixed-length CAP to each Head (16-bit)

Tail (0~32-bit)

add 0000

rs

rt

rd

rs

rt

rd

rs

0000

rd

rs

0001

abs 1001

4-bit

5-bit

2-bit

M

Valid

Tail Length

PP

Conditional execution

Figure 6.

CAP in Pica.

VLIW packet. The CAP has a Bvalid^ field for centralized dispersal, of which each bit indicates whether its corresponding issue slot has an effective instruction. In other words, NOPs are eliminated by turning the corresponding Bvalid^ bits off. Besides, the total length of the Bheads^ and Btails^ (HT) of a VLIW packet can be calculated easily for parallel grouping, with the Bvalid^ for the fixed length Bheads^ and the Btail length^ for the variable-length Btails^. To reduce the CAP length and the alignment complexity, while supporting enough flexibility, the tail lengths are restricted to be multiple of 4. Figure 6 shows the 14-bit CAP format of the 4-way Pica DSP. Figure 7a gives an illustrating example with only two effective instructions. The addi instructions are converted into machine code first by looking up the encoding table in Fig. 5. The CAP is then encoded as 0 for a VLIW packet; 0101 to remove NOP in the first and the third instruction slots; 00010 for an eight-bit total tail length; 00 for the ping-pong indices; and 00 to disable SIMD clusters and conditional execution. The identical clusters in the VLIW/data streaming mode of Pica can be configured into SIMD execution by asserting the S bit in the CAP. The instructions of the main cluster will be replicated to reduce the code e.g. 00 nop; addi d0, ac4, -8; nop; addi d0, ac4, -8; 0101

00010

00 0 0

H1

1000

1000

0100

00

00

T1

1000

H3

1000

1000

0100

00

00

T3

1000

00

T1

1000

a

abs.d 1001

CAP 0

3-addr (less frequently occur) 0110

rd

rs

rt

rd

rs

func

DL

immediate

func: 00(addi), 01(addi.d), 10(addi.q), 11(rsbi) DL: 00(4-bit), 01(8-bit), 10(16-bit), 11(32-bit)

Figure 5.

0100

00001

00 1 0

func

func: 0000(sub), 0001(sub.d), 0010(sub.q), 0011(adc) 0101(bf.d), 0110(add.q)

immediate 1000

S C

SIMD clusters 0: VLIW/data streaming mode 1: Scalar/program control mode & differential encoding

CAP 0

rd

add.d 0001

1-bit

Instruction encoding for add/sub instructions.

H1

1000

1000

0100

00

b Figure 7. Example of proposed VLIW encoding. a An illustrating example with only two effective instructions. b SIMD execution by asserting the S bit in the CAP.

High-Performance and Complexity-Effective VLIW DSP

in Fig. 8a. Moreover, all fixed-length Bheads^ are placed first in order, ahead of their variable-length Btails^. The proposed code layout enables parallel packet fetch, alignment, dispersal, and decoding. It is even possible to look ahead succeeding VLIW packets to further reduce control overheads. We attach a B#CAP^ field to each bundle to indicate the number of packets in the bundle. In our simulations, the 512-bit bundle size is optimal, which has practical decoding complexity and acceptable fragment (i.e. the unused bits in a bundle that cannot accommodate a new packet). Our first prototype of Pica DSP contains 32 kByte on-chip instruction memory, which can hold 512 512-bit bundles. We will use these parameters hereafter for simplicity. Instead of huge multiplexers, we use incremental and logarithmic shifters to extract VLIW packets

size. Figure 7b is an example, which eliminates the encoding of the instruction for the slave cluster by turning on the S bit. Besides SIMD execution, we also supports differential encoding to reduce the repeated instructions in loop-unrolled or software-pipelined programs. A VLIW packet can be reused to synthesize a new one for its succeeding cycle with a small register index offset or new ping-pong indices. 3.3. Instruction Bundles In order to simplify the accesses of variable-length VLIW packets in the instruction memory, the packets are first packed into fixed-length instruction bundles. The fixed-length CAP and the variablelength HT of a VLIW packet are placed individually from the two ends of an instruction bundle as shown 5-bit

14-bit

VLIW packet 1 (variable-length)

# CAP CAP 1

T3

T1

H3

H1

512-bit bundle

a

32-KB instruction memory with 512 bundles # CAP

bundle register (512-bit)

465

H0 shifter (0/16-bit) H1 shifter (0/16-bit) H2 shifter (0/16-bit) H3 shifter (0/16-bit)

16 16 16

Head decoder

16

C shifter (0/20-bit) 280 280

CAP shifter (14-bit)

465

280

HT buffer (465-bit)

14

CAP decoder 1 5

valid C tail length

b Figure 8.

a Instruction bundle and b packet extractor.

Tail shifter (0~124-bit) 465

CAP buffer (280-bit)

4

215

124

Tail decoder

216

Lin et al.

from an instruction bundle as shown in Fig. 8b. The CAP decoder first decodes a leading CAP (14-bit) of the CAP buffer and the CAP will then be shifted out to allocate a new CAP for the next packet. The righthand-side incremental shifters follow to shift out 0õ4 fixed-length Bheads^ depending on the Bvalid^ of the current CAP. If the C-bit in the CAP is asserted, more 20 bits will be shifted out, where an instruction slot has an independent five-bit condition code. Finally, the logarithmic tail shifter will shift out all Btails^ of the VLIW packet. A new VLIW packet can thus be allocated, similarly to the CAP. In brief, a CAP and a HT are continuously shifted out from the two ends of an instruction bundle, and a new VLIW packet will be aligned every cycle to the boundaries of the CAP buffer and the HT buffer respectively. The lengths of the CAP and HT buffers can be estimated as follows: bundle capacity ¼ bundle size  dlog2 ðworstcase #CAPÞe ¼ 512  dlog2 ð19:5Þe ¼ 507ðbitsÞ capacity worstcase #CAP ffi averagebundle length of scalar instr: 507 ¼ 26 ¼ 19:5ð 20Þ CAP buffer size ¼ CAP size  worstcase #CAP ¼ 14  20 ¼ 280ðbitsÞ l m bundle capacity HT buffer size ¼ bundle capacity  CAP size maximum packet length  507  ¼ 507  14  222 ¼ 465ðbitsÞ

First, the capacity of a bundle to accommodate VLIW packets can be calculated by deducting the B#CAP^ bits from the bundle size. We assume that the worst-case number of CAPs turns up when a bundle only holds scalar instructions (encoded as a fixed-length CAP and a variable-length Btail^, describe later), which are much shorter than VLIW packets. Thus, the CAP buffer needs to hold this worst-case number of entries, which can be approximated by the maximum number of scalar instructions of average length. The worst-case HT buffer size is the bundle capacity minus the bits impossible to be either Bheads^ or Btails^ (i.e. the minimum number of CAPs in a bundle, which equals to the minimum number VLIW packets). Note that the two buffers have overlapped bits, for the boundary between CAPs and HTs is not deterministic. Finally, it is important to mention that the above estimation is conservative. The buffer sizes can be decided arbitrarily with constraints on the linker and the code generation tools to prevent illegal instruction bundles. Program flow instructions in Pica have a bundle offset, a CAP index, and a HT pointer, and it can

32-KB instruction memory with 512 bundles # CAP

bundle register (512-bit) 280 465

CAP selector (20-to-1)

465 208

H0 shifter (0/16-bit) 14

H1 shifter (0/16-bit)

CAP decoder 4 1 6

valid

HT aligner (0~456-bit)

C

H2 shifter (0/16-bit) H3 shifter (0/16-bit) C shifter (0/20-bit)

HT length

16 16 16 16 20 124

465

Head decoder

Tail decoder

HT buffer (465-bit)

Figure 9.

Instruction decoder.

jump to an arbitrary VLIW packet by fetching a new instruction bundle and shifting out unused CAPs and HTs. They are also variable-length encoded as the effective VLIW instructions, where a variable-length program flow instruction is decomposed into a fixedlength CAP with a leading 1 (instead of a fixedlength Bhead^ as the VLIW instruction) and a variable-length Btail^. Program flow instructions and VLIW packets are mixed together, and their CAPs, Bheads^ and Btails^ are stuffed in order into instruction bundles. We have developed a linking tool to automatically translate the code labels in Pica assembly (or the instruction offsets of PC-relative jump or branch instructions in a compiled code) into their corresponding bundle offsets, CAP indices and HT pointers in the machine code. Figure 9 shows the complete instruction decoder for Pica DSP, where a CAP selector is used instead of the shifter in Fig. 8b. The Bhead^ dispersal of a packet and the packet alignment with branching are parallel, so the sizes of the incremental shifters are significantly reduced. 4.

Design and Verification Flow

Figure 10 shows a generic design flow for programmable processors, including definition of an instruction set, exploration of an optimal micro-architecture that implements the instruction set architecture (ISA), RTL authoring, and silicon/or FPGA implementations. Designers move forward to the next phase once the function and the performance meet the processor specification. Although iteration is inevitable in the design process, the overall design time can still be minimized by reducing the number of iterations, especially in the major loops. In the following sections, we will describe the design tasks respectively.

High-Performance and Complexity-Effective VLIW DSP Modify ISA / Add or factorize instr

ISA Design (ISS)

Function? Verified with hand-code assembly Instr. Count ? Modify latency / Merge or factorize stages

Microarchitecture Exploration (SystemC)

Function? Verified with hand-code assembly or compiled program Cycle Count?

Improve coding style

RTL Authoring

Equivalence checking with cycle-accurate SystemC model

Implementation

Function? Formal equivalence checking with RTL model Timing Area/Power?

Figure 10.

[19], NEC SPXK5 [16], and Intel/ADI MSA [20], etc. Whether an instruction is included in our instruction set depends on the tradeoffs between performance (in terms of cycle counts or code sizes of applications) and implementation complexity (both the cycle time and silicon area). An instruction set design can be modeled as an instruction set simulator (ISS) for early performance evaluation, which is usually the most abstract model of a programmable processor. By the way, ISS is also very useful for development of large-scale application software, because it avoids hardware details that are not exposed to the software programming. 4.2. Micro-Architecture Exploration

Function?

Improve constraints

217

Generic design flow for programmable processors.

A micro-architecture is a specific implementation of an ISA design, where all implementations should execute any program encoded in that instruction set. Attributes associated with an implementation include the pipeline design and the memory organization. Figure 11 shows the pipeline design flow. The instructions are first classified to design their corresponding datapaths. In general, a processor must do three generic tasks to perform a computation—(1) arithmetic operations, (2) data movements, and (3) instruction sequencing. The semantics of each of the three instruction types can be specified based on the sub-computations performed by the instruction type. Designers can begin with the five

4.1. Instruction Set Design Instruction Set

An instruction set characterizes the functional behavior of a processor, and the software must be mapped to or encoded in this instruction set in order to be executed by the processor. In other words, every program is compiled into a sequence of instructions in the instruction set. Attributes associated with an instruction set design include assembly language, instruction format, addressing modes and programming model, which are exposed to the software as perceived by compilers or assembly programmers. The instruction set may influence design effort and implementation complexity. Thus, we focus not only on the micro-architecture techniques but also on the instruction set design. The primitive instructions of Pica DSP are defined by pruning the MIPS32 instruction set. DSP-enhanced instructions are added by surveying the instruction sets of state-of-the-art DSPs, such as TI C64 [9]/C55

Instruction Classification

Operation Factorization

Pipeline Balancing (subdivide or merge)

Optimal Forwarding

Processor Pipeline Figure 11.

Pipeline design.

218

Lin et al.

2.08 ns

Instruction Dispatcher

2.05 ns

2.00 ns

4.30 ns

1.88 ns

OF

EX/AG

MEM

WB

2.05 ns

5.35 ns

0.60 ns

1.88 ns

OF

EX

-

WB

a 2.08 ns

Instruction Dispatcher

2.05 ns

2.00 ns

1.43 ns

2.03 ns

1.44 ns

1.88 ns

OF

EX/AG

MC

MEM1

MEM2

WB

2.05 ns

1.86 ns

2.14 ns

2.16 ns

1.65 ns

1.88 ns

OF

EX_B

EX1

EX2

AC

WB

b Figure 12. stages.

Pipelines for Pica a classical and b balanced pipeline

generic sub-computations. Instructions with distinct sequences of sub-computations can be separated into different pipelines to improve modularity [21]. One natural operation factorization is based on the five generic sub-computations. That is, each of the five sub-computations is mapped to a pipeline stage, resulting in a five-stage pipeline. The pipeline stages can be further balanced to minimize the internal fragment with two methods—(1) merge multiple sub-computations into one, and (2) subdivide a subcomputation into multiple sub-computations. Figure 12a shows the initial design with two four-stage pipelines, where the critical path lies on the multiplyaccumulator (MAC) for arithmetic operations. Figure 12b shows more balanced pipelines by subdividing MAC and memory accesses into three stages. The forwarding paths can be grouped as a separate module as shown in Fig. 13 to improve timing closure in silicon implementation with registered output ports [22, 23]. We define Btolerable latency (TL)^ of a data dependency between consecutive instructions on the fly to be the difference between the time the produced datum is ready and the time the consumer actually needs the datum. A dependency has a negative TL must be avoided with processor stalls or exposed explicitly as delay slots. On the other hand, there is no harm in dependencies with non-negative TLs, where the data item can be passed through either the register file or specific forwarding paths. Thus, dependencies from all data generators to all data consumers in a pipeline can be classified into non-causal (TL < 0), timing-critical (TL = 0), and normal (TL > 0) one. Data for normal dependences are passed through the register file or the forwarding unit without or with small impacts on

the timing of the datapath. However, the forwarding paths for timing-critical dependencies increase the number of input ports (MUX) of the functional units, which usually lengthen the critical path. Thus, the inclusion of a timing-critical forwarding path is a difficult tradeoff between the cycle times the cycle counts for application software. The micro-architecture designs were modeled in SystemC [24] and validated against the ISS to ensure they perform the functional requirements specified by the ISA. Figure 14 is the example dump file from cycle-accurate SystemC simulation, which is faster than RTL simulations by an order of magnitude. After micro-architecture explorations, the final design was described in synthesizable Verilog for the implementation phase. Moreover, the RTL design has been cross-verified with the cycle-accurate SystemC model from bottom up to achieve 100% coverage (e.g. statement, branch, state, arc, etc.) [25]. 4.3. FPGA Prototyping Figure 15 depicts a simplified Pica system for multimedia applications based on the AMBA AHB bus. We have prototyped this system on the ARM Versatile platform [26] with an ARM926EJ-S core, 128 MB SDRAM, and a rich set of peripherals such as an LCD controller, a stereo audio codec. Pica was first encapsulated in an AHB wrapper and ported on the Xilinx XC2V6000 FPGA on the Logic Tile daughter card. The system AHB, the expansion AHB and ARM run at 70 MHz, 35 MHz and 210 MHz respectively, while Pica operates at 70 MHz with the embedded DLL-based clock generator on FPGA that doubles the clock rate of the expansion AHB. ARM functions as a smart DMA controller that moves data

to consumers from producers

Forwarding Unit

normal forwarding paths

timing-critical forwarding paths

ALU pipeline register

Figure 13.

pipeline register

Forwarding unit and simplified BEX^ pipeline stage.

High-Performance and Complexity-Effective VLIW DSP ~~~~~~~~

Cycle 71 ~~~~~~~~

======== BOOLEAN REGISTERS ======== B0 = 1 B1 = 0 B2 = 0 B4 = 0 B5 = 0 B6 = 0 ======== CLUSTER 0 ======== ======== PING REGISTERS ======== D'[0] = ffff6400 D'[1] = 6600 D'[2] = fffffe00 D'[4] = 1413a D'[5] = ffff8ac6 D'[6] = 0 ======== ADDRESS REGISTERS ======== A[0] = 0 A[1] = 8 A[2] = 8 A[4] = 0 A[5] = 0 A[6] = 0 ======== PONG REGISTERS ======== D[0] = fffe4d00 D[1] = 4d00 D[2] = 1065e D[4] = 1535e D[5] = ffff46a2 D[6] = 494b ======== ACCUMULATORS ======== AC[0] = 113e6 AC[1] = fffff278 AC[2] = e865 AC[4] = ffffcc00 AC[5] = ffff0000 AC[6] = ffff7f00 ======== ======== D'[0] = 0 D'[4] = 0 ======== A[0] = 0 A[4] = 0 ======== D[0] = 0 D[4] = 0 ======== AC[0] = 0 AC[4] = 0

CLUSTER 1 ======== PING REGISTERS ======== D'[1] = 0 D'[2] = 0 D'[5] = 0 D'[6] = 0 ADDRESS REGISTERS ======== A[1] = 0 A[2] = 0 A[5] = 0 A[6] = 0 PONG REGISTERS ======== D[1] = 0 D[2] = 0 D[5] = 0 D[6] = 0 ACCUMULATORS ======== AC[1] = 0 AC[2] = 0 AC[5] = 0 AC[6] = 0

Figure 14.

B3 = 0 B7 = 0

D'[3] = db3a D'[7] = 5a84 A[3] = 0 A[7] = 0 D[3] = 1fc4b D[7] = fffc50b5 AC[3] = fffeb300 AC[7] = ffffe700

D'[3] = 0 D'[7] = 0 A[3] = 0 A[7] = 0 D[3] = 0 D[7] = 0 AC[3] = 0 AC[7] = 0

Cycle-accurate SystemC simulation.

between the Pica on-chip memory and the on-board 128 MB SDRAM. Besides, all peripherals are controlled by ARM, too. After the power-on reset, Pica stays in its standby mode until ARM initializes the instruction memory and triggers an interrupt. We have successfully demonstrated a simplified H.264 intra coder on the prototyping system, which can real-time compress 21 CIF-format pictures per second. Besides, we have also implemented a standard JPEG encoder and a real-time MP3 player using this prototyping platform. 5.

219

the cycle counts for block-based motion estimation under MAE criteria [27], of which the block size is 16  16 and the search range is within T15 pixels. TI C64 [9] and NEC SPXK5 [16] are two highperformance VLIW DSP. C64 can issue eight instructions per cycle and its datapath is partitioned into two identical clusters, each of which has a 16-port centralized register file. Besides, the two clusters can communicate with each other via the extended-read mechanism. SPXK5 is a four-issue processor with a centralized register file, which has only eight generalpurpose registers to reduce the hardware complexity. TI C55 [19] and Intel/ADI MSA [20] are conventional DSP architectures with dual multiply-accumulators, and the former even allows memory operands (i.e. not a load/store architecture [28]), and these results are included just for reference. All cycle counts are excerpted from the application notes from the vendors to reveal more objective performance indices. Pica has DSP-oriented instructions such as MAC and rich SIMD instructions (e.g. as those described in Section 2), but they are not new and can be found in some other DSP processors. However, most Pica instructions are simple (i.e. RISC-like). Pica would likely have comparable performance with state-ofthe-art DSP if it is equipped with a generic register file, for the supported instructions are very similar. The simulation results in Table 2 show that the clustered architecture with distributed and ping-pong register organization has small impact on highlyparallel DSP kernels, whenever the dataflow is appropriately arranged. By the way, the instruction slots inside the hand-optimized assembly codes can be almost filled to maximize the hardware utilization. Note that the concurrently-issued instructions may be

Results Logic Tile

5.1. Performance Evaluation We have hand-optimized several DSP kernels in assembly to evaluate the performance of Pica DSP with cycle-accurate instruction set simulation. Table 2 shows the comparison between state-of-the-art DSP processors and Pica DSP. The first column shows the number of cycles required for 40-sample and 16-tap real-value FIR filtering on 16-bit samples. The second column compares the execution cycles to perform the eight-by-eight 2D DCT. The third column is for the 256-point radix-2 fast Fourier transform (FFT) [13]. The last column summarizes

Pica

(70MHz)

Unified RISC/VLIW DSP Core

IM DM DMA

(35MHz)

Main Memory (SRAM/SDRAM)

(210MHz)

AMBA AHB (70MHz)

Misc Peripherals Versatile Baseboard

Figure 15.

Simplified Pica system for multimedia processing.

220

Lin et al.

Table 2.

20%

Performance comparison (in cycles).

FIR DCT FFT TI C64 [9]

194 126

NEC SPXK5 [16]

TI C55 [19]



240

Processor descriptions

ME

1,246 36,538 8-issue VLIW at 1 GHz (90 nm) 2,944 –

4-issue VLIW at 250 MHz (0.13 mm)

256-bit 512-bit 1024-bit 2048-bit 15%

10%

5%

394 238

4,922 82,260 Dual MAC at 300 MHz

Intel/ADI MSA [20] 381 284

3,176 90,550 dual MAC at 400 MHz

Figure 16.

Pica DSP

2,510 41,372 4-issue VLIW at 333 MHz (0.13 mm)

hand-optimized assembly programs described in Table 2. The fifth and the sixth rows are for a standard JPEG encoder [14] and a simplified H.264 encoder [27]. The 1st column lists the code sizes for 192-bit fixed-length VLIW packets, each of which contains four 48-bit instructions. The second column lists the variable-length VLIW encoding used in TI C64 and NEC SPXK5. The third column shows the results with variable-length instruction encoding and the proposed NOP removal scheme with CAP. The fourth column shows the improved code sizes with SIMD execution, while the fifth shows those with additional differential encoding. The variable-length VLIW packets are then stuffed into bundles of different sizes to evaluate the fragment (i.e. the unused bits in a bundle that cannot accommodate a new packet). Note that the lengths of program flow instructions depend on the bundle size (Bbundle offset^ can decrease by 1 bit, but both BCAP index^ and BHT pointer^ need 1 more bit, as the bundle size is doubled). To minimize the offset of intra-instruction fragment (i.e. the Btails^ must be multiple of 4), we use worst-case lengths of program flow instructions for all bundles sizes under comparison. Figure 16 shows the bundling overheads over the effective instructions.

0% FIR

232 123

severely limited in general applications with irregular dataflow or arbitrary dependency (especially across the clusters), and thus the performance will degrade significantly. Finally, we have also hand-crafted a standard JPEG encoder [14, 29] on Pica DSP, which can compress 512  512 24-bit color Lena and Baboon images in 4,340,149 and 6,326,148 cycles respectively (i.e. about 53.6õ76.7 frames/s at 333 MHz). 5.2. Code Size Analysis It is very difficult to evaluate the quality of the instruction encoding of different processor architectures, because it strongly depends on the instruction set architectures (ISA), compilers and code generation strategies. Therefore, we base on a single ISA (the aforementioned Pica DSP) and try our best to make the comparison fair. Table 3 shows the comparison of various instruction encoding schemes. The first four rows summarize the results from the Table 3.

Code size comparison (in bits). Hierarchical

Table 4.

DCT

ME

FFT

JPEG

H.264

Bundling overheads (due to fragment).

Comparison of register organizations.

Fixed

NOP removal

Original

+SIMD

+Diff

Centralized

FIR

5,016

4,947

2,686

1,646

1,646

2  f ð8; 6; 32; 32Þþ 2  f 0 ð32; 32Þþ

DCT

11,248

10,557

6,160

3,696

3,696

f (16, 12, 64, 32)

gLS

gLS

ME

12,616

13,158

6,946

4,486

3,838

Delay

3.18 ns

1.91 ns

1.69 ns

FFT

60,648

56,202

32,850

20,058

19,698

104,976

45,850

JPEG

62,472

47,481

28,936

20,128

19,748

Gate 196,920 count

H.264

229,368

167,535

99,424

72,992

72,128

Area

Proposed

1,440,000 mm2 766,314 mm2

334,544 mm2

High-Performance and Complexity-Effective VLIW DSP

Table 5.

69%

Complexity comparison.

Bundle size Delay (ns) Gate count

256-bit 1.79 14,780

512-bit 2.08 29,792

221

Instruction Dispatcher

1,024-bit 2.32 66,165

2,048-bit 2.71 147,195

ICC & Memory Control Cluster 0 9%

Cluster 1 11%

4%

7%

Memory

5.3. Silicon Implementation

Figure 18.

We have implemented the proposed distributed and ping-pong register file for a cluster with 32 32-bit registers and equivalent centralized register files for both one and two clusters. These three designs are described in Verilog RTL, and then synthesized using Synopsys Design Compiler and UMC L130E High Speed FSG Library (from Faraday Technology). The designs are optimized for timing, and their net-lists are placed and routed for UMC 0.13 mm 1P8M Copper Logic Process using Cadence SoC Encounter. Table 4 summarizes the implementation results from post-layout simulations, where the proposed register organization reduces the access delay by a factor of 1.88 and the silicon area by 4.30 respectively for the dual-cluster configuration. We have implemented the instruction decoders for various bundle sizes in Verilog RTL, which are synthesized using Synopsys Design Compiler and UMC L130E High Speed FSG Library (from Faraday Technology). Table 5 shows the gate counts and the delays of the synthesis results. Combined with the

aforementioned overhead analysis as depicted in Fig. 16, 512-bit bundles are optimal for our implementation. Finally, we have implemented the complete fourway Pica DSP with the proposed ICC mechanism with load/store instruction pairs, the distributed and ping-pong register file, and the hierarchical VLIW decoder. The RTL design is synthesized using Synopsys Physical Compiler and UMC L130E High Speed FSG Library. The gate count is 223,578, where the register file and the instruction dispatcher (i.e. hierarchical VLIW decoder) account for 20.5 and 13.1% respectively. The net-lists are placed and routed using Cadence SoC Encounter for UMC 0.13 mm 1P8M Copper Logic Process. Figure 17 shows the layout of Pica DSP, and the core size is 3.2  3.15 mm2 including the 128-kB data memory and the 32-kB instruction memory. The DSP core can operate at 333 MHz while consuming 189 mW average power (while running 2D DCT), where the power breakdown is shown in Fig. 18. 6.

Figure 17.

Layout of Pica DSP.

Power breakdown of Pica DSP.

Conclusions

The paper describes the design and implementation of a high-performance and complexity-efficient VLIW DSP for multimedia applications. A simple intercluster communication mechanism with load/store pairs and a novel distributed and ping-pong register organization are proposed to reduce the register file complexity. The simulation results show that a fourissue VLIW DSP with the proposed micro-architecture design has comparable performance with state-of-theart high-performance DSPs. However, 76.8% silicon area and 46.9% access time are saved from an equivalent centralized register file found in most DSPs. Hierarchical VLIW encoding with flexible variablelength instruction formats, NOP removal and automatic code replication is proposed to solve VLIW code density problem. In our simulations, the code sizes of a four-issue VLIW DSP can be reduced to only 24.1õ26.0%. An efficient decoding architecture is proposed, too. Finally, a complete four-issue VLIW

222

Lin et al.

DSP with the area-efficient register organization and the hierarchical instruction decoder is implemented in UMC 0.13 mm Copper Logic Process. The gate count is 223,578 (core only), where the register file and the instruction dispatcher account for 20.5 and 13.3% respectively. The core size is 3.2  3.15 mm 2 including the 128 kB data memory and the 32 kB instruction memory. The DSP core can operate at 333 MHz with 189 mW power dissipation (while performing 2D DCT on a natural image). The drawback of the proposed register organization is that it significantly complicates the operation scheduling and the code generation. Actually, we can only optimize the assembly codes by hand. We have tried some compilation methods with ping-pong access constraints, but the performance of the compiled codes is still far behind that of hand-optimized ones. Development of customized compilation techniques may be a good research direction [30]. The variable-length VLIW packets are packed into fixed-length instruction bundles to simplify the instruction memory accesses. However, it introduces fragments, especially for small bundles. There is an opportunity for future researches on the optimal code generation and linking methods to reduce such fragments. Besides, automatic synthesis of optimized variable-length instruction encoding for applicationspecific instruction set processors (ASIP) is also an interesting topic. The overheads of Bvalid^ bits to remove NOP in each VLIW can be reduced by allowing only limited patterns [31] or even applying entropy coding according to the occurrence frequency. Compressing redundant operands such as the reduced ISA in ARM Thumb [32] may also further improve the code density at small costs. References 1. P. Lapsley, J. Bier, and E. A. Lee, BDSP Processor Fundamentals - Architectures and Features,^ IEEE Press, 1996. 2. Y. H. Hu, BProgrammable Digital Signal Processors—Architecture, Programming, and Applications,^ Marcel Dekker, 2002. 3. J. A. Fisher, P. Faraboschi, and C. Young, BEmbedded Computing—A VLIW Approach to Architecture, Compiler, and Tools,^ Morgan Kaufmann, 2005. 4. S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens, BRegister Organization for Media Processing,^ in Proc. HPCA, 2000, pp. 375–386. 5. A. Terechko, E. L. Thenaff, M. Garg, J. Eijndhoven and H. Corporaal, BInter-Cluster Communication Models for Clustered VLIW Processors,^ in Proc. HPCA, 2003, pp. 354–364.

6. T. J. Lin, P. C. Hsiao, C. W. Liu, and C. W. Jen, BArea-Efficient Register Organization for Fully-Synthesizable VLIW DSP Cores,^ Int. J. Electr. Eng., vol. 13, pp. 117–127, May 2006. 7. P. Faraboschi, G. Brown, J. A. Fisher, G. Desoll and F. M. O. Homewood, BLx: A Technology Platform for Customizable VLIW Embedded Processing,^ in Proc. ISCA, 2000, pp. 203–213. 8. G. G. Pechanek and S. Vassiliadis, BThe ManArray Embedded Processor Architecture,^ in Proc. Euromicro Conf., 2000, pp. 348–355. 9. TMS320C64x DSP Generation. http://www.ti.com. 10. K. Arora, H. Sharangpani, and R. Gupta, BCopied Register Files for Data Processors Having Many Execution Units,^ US Patent 6,629,232, Sep. 30, 2003. 11. A. Kowalczyk et al., BThe First MAJC Microprocessor: A Dual CPU System-On-a-Chip,^ IEEE J. Solid-State Circuits, vol. 36, pp. 1609–1616, Nov. 2001. 12. T. J. Lin et al., BPerformance Evaluation of Ring-Structure Register File in Multimedia Applications,^ in Proc. ICME, July 2003. 13. A. V. Oppenheim, R. W. Schafer, and J. R. Buck, BDiscreteTime Signal Processing, 2nd ed.,^ Prentice Hall, 1999. 14. Independent JPEG Group. http://www.ijg.org. 15. H. Pan and K. Asanovic, BHeads and Tails: A Variable-Length Instruction Format Supporting Parallel Fetch and Decode,^ in Proc. CASES, 2001. 16. T. Kumura, M. Ikekawa, M. Yoshida, and I. Kuroda, BVLIW DSP for Mobile Applications,^ IEEE Signal Process. Mag., pp. 10–21, July 2002. 17. G. Fettweis, M. Bolle, J. Kneip, and M. Weiss, BOnDSP: A New Architecture for Wireless LAN Applications,^ Presented at Embedded Processor Forum, San Jose, 2002. 18. T. J. Lin et al., BA Unified Processor Architecture for RISC & VLIW DSP,^ in Proc. GLSVLSI, Apr. 2005. 19. TMS320C55x DSP Generation. http://www.ti.com. 20. R. K. Kolagotla et al., BHigh-Performance Dual-MAC DSP Architecture,^ IEEE Signal Process. Mag., pp. 42–53, July 2002. 21. J. P. Shen and M. H. Lipasti, BModern Processor Design— Fundamental of Superscalar Processors,^ McGraw-Hill, 2005. 22. M. Keating and P. Bricaud, BReuse Methodology Manual— For System-on-a-Chip Designs, 3rd ed.,^ Kluwer, 2002. 23. D. Chinnery and K. Keutzer, BClosing the Gap Between ASIC & Custom—Tools and Techniques for High-Performance ASIC Design,^ Kluwer, 2002. 24. J. Bhasker, BA SystemC Primer,^ Star Galaxy Publishing, 2002. 25. J. Bergeron, BWriting Testbenches—Functional Verification of HDL Models, 2nd ed.,^ Kluwer, 2003. 26. Versatile Platform Baseboard for ARM926EJ-S. http://www. arm.com/. 27. I. E. G. Richardson, BH.264 and MPEG-4 Video Compression,^ Wiley, 2003. 28. J. L Hennessy and D. A. Patterson, Computer Architecture—A Quantitative Approach, 3rd ed.,^ Morgan Kaufmann, 2002. 29. W. B. Pennebaker and J. L. Mitchell, JPEG—Still Image Data Compression Standard, Van Nostrand Reinhold, 1993. 30. Y. C. Lin, Y. P. You, and J. K. Lee, BRegister Allocation for VLIW DSP Processors with Irregular Register Files,^ in Proc. CPC, 2006. 31. Intel 64 and IA-32 Architectures Software Developer_s Manual, Intel, Nov. 2006. 32. The Thumb Architecture Extension. http://www.arm.com.

High-Performance and Complexity-Effective VLIW DSP

223

sity, Taiwan, in 2004 and 2006, respectively. He is working towards a Ph.D. degree. His researches include low-power signal processing, digital hearing aids, and computer architecture.

Tay-Jyi Lin received the B.S. degree in Electrical and Control Engineering and the Ph.D. degree in Electronics Engineering, from National Chiao Tung University, Taiwan, in 1998 and 2005, respectively. He is now with the Department of Electronics Engineering and the Institute of Electronics, National Chiao Tung University, as a Researcher Assistant Professor. His research interests include VLSI signal processing, configurable computing, and computer architecture.

Chih-Wei Liu received the B.S. and Ph.D. degrees in Electrical Engineering from National Tsing Hua University, Taiwan, in 1991 and 1999, respectively. From 1999 to 2000, he was an Integrated Circuit Design Engineer at the Electronics Research and Service Organization (ERSO) of Industrial Technology Research Institute (ITRI), Taiwan. Then, near the end of 2000, he started to work for the SoC Technology Center (STC) of ITRI as a project leader and eventually left ITRI at the end of October, 2003. He is currently with the Department of Electronics Engineering, National Chiao Tung University, Taiwan, as an Assistant Professor. His current research interests include SoC and VLSI system design, processor architecture, digital signal processing, digital communications, and coding theory. Shin-Kai Chen received the B.S. degree in Electronics Engineering from National Chiao Tung University, Taiwan, in 2004, where he is working toward the Ph.D. degree. His researches include system software designs and application mapping of programmable processors.

Yu-Ting Kuo received the B.S. and the M.S. degrees in Electronics Engineering from National Chiao Tung Univer-

Pi-Chen Hsiao received the B.S. degree in Electronics Engineering from National Central University, Taiwan, in 2003, and the M.S. degree in Electronics Engineering from National Chiao Tung University, Taiwan, in 2005. He is now with the SoC Technology Center at Industrial Technology Research Institute, Hsinchu, Taiwan. His research interests include DSP architecture and datapath designs.

Suggest Documents