Dynamic Codewidth Reduction for VLIW Instruction Set ... - CiteSeerX

7 downloads 6277 Views 25KB Size Report
in Digital Signal Processors. Matthias H. Weiss and Gerhard P. Fettweis. Mobile Communications Systems. Dresden University of Technology. 01062 Dresden ...
Dynamic Codewidth Reduction for VLIW Instruction Set Architectures in Digital Signal Processors Matthias H. Weiss and Gerhard P. Fettweis Mobile Communications Systems Dresden University of Technology 01062 Dresden, Germany {weissm,fettweis}@ifn.et.tu-dresden.de Abstract - The design of an instruction set architecture (ISA) plays an important role for both exploiting processor resources and providing a common software interface. Three main classes of ISAs can be distinguished: CISC (Complex Instruction Set Computer), RISC (Reduced Instruction Set Computer), and VLIW (Very Long Instruction Word). They differ mainly in assembler and compiler support, pipeline control, hardware requirements, and code density. Comparing these architectures for usage in a DSP, the VLIW architecture shows to be very advantageous. Though, the main disadvantage in many applications is code size explosion. To reduce code size, a method called tagged VLIW (TVLIW) is presented. Dividing the instruction set into control/move and arithmetic instructions, a different usage of functional units can be examined. The first set only requires the parallel execution of a limited number of functional units. The second set, though requiring several functional units in parallel, is often used inside loops. Within our proposed method, the instruction word is dynamically assembled using a low complex highly regular decoding hardware. Inside loops, the full VLIW functionality is supported by cache methods. 1. INTRODUCTION Three main classes of ISAs are applied in microprocessors: CISC, RISC, and VLIW. They mainly differ in assembler and compiler support, pipeline control, hardware requirements, and code density. Comparing these architectures for usage in a DSP, the VLIW architecture shows to be very advantageous in terms of processing performance. Though, DSPs often contain a CISC ISA mainly for code density and compatibility reasons. Recently, RISC ISAs were also applied, e.g. in Hx24 by Hitachi or Lode by TCSI. This paper presents a first step for an efficient use of the VLIW ISA. By employing a tagged VLIW ISA advantages in processing power are provided while code explosion can be avoided. Besides code compactness, CISC ISAs provide the assembly programmer with a wide variety of instructions. Since the programming is done on instruction level (see explanation in Fig. 1), the hardware architecture does not have to be known in full detail at implementation time, allowing different hardware implementations to be object code compatible. However, for the same reason hardware resources cannot be fully exploited. Furthermore, CISC ISAs only support a decoding pipeline but no deep execution pipeline, since instructions are too heterogeneous. The RISC ISA, on the other hand, consists of homogeneous instructions in terms of pipeline properties. This can be achieved by splitting complex instructions into several small instructions. Therefore, the hardware decoding complexity can be reduced to achieve a cycle per instruction (CPI) close to 1 [HePa90]. Superscalar architectures lead to an increase in decoding hardware again, e.g. for hazard resolving or scoreboarding for out-of-order executions. Hence, code execution speedup is carried out mostly by hardware support and not by compiler optimizations. IF Instruction Level

ID IF

R ID IF

W E R ID IF Cycle

E R ID IF

W E R ID

W E R

W E t

W

IF: Instruction Fetch ID: Instruction Decode R: Read Operands E: Execute Instruction W: Write Result

Fig. 1: Instruction vs. cycle level in pipelined architectures

To support compiler optimizations for superscalar architectures a horizontal or VLIW ISA can be applied. Due to timestationary pipelining (i.e. pipeline control at cycle level [Kogg81]), both the programmer and the compiler are given full control over the pipeline with the cost of code size increase, e.g. 128 bit code width in the VIPER architecture [Gray93]. Hence, prohibiting this type of ISA for the main application field of fixed-point DSPs. The tagged VLIW scheme proposed in this paper reduces the code size requirements by assembling the VLIW dynamically. This method is based on the distinction between in-line and in-loop code. While the first requires only limited parallelism, the latter can be supported by a simple cache. Thus, the properties of VLIW can be exploited without code size explosion. This paper is organized as follows. In Section 2, the properties of the VLIW architecture in DSPs are explained in more detail. In Section 3, the dynamic instruction word coding by using TVLIW scheme is described. In Section 4, to demonstrate its applicability, the scheme is applied to the AT&T DSP16 architecture. 2. APPLYING VLIW TO DSP ARCHITECTURES A VLIW architecture consists of several independent functional units (FU) controlled by one instruction word (IW) and connected by a fairly complex bus system (Fig. 2). In a DSP architecture these FUs are the Program Control Unit (PCU), Address Generation Units (AGU), Datapath Units (DPU), I/O-Units (IOU) etc. [Vanh92]. In some floating-point DSPs, VLIW architectures are already applied [Madi95]. These DSPs are usually used for highperformance applications, which require a high degree of flexibility. On the other hand, in high volume and low power products, e.g. in mobile communications, fixed-point DSPs are employed. Since they are typically programmed in assembly languages, and code density is an important issue they often contain a CISC ISA for both data-stationary (e.g. TI C54x, Motorola DSP5630x) and time-stationary (e.g. AT&T DSP16, NEC 7701x) pipeline control. However, at the advance of more processing power requirements concurrent processing must be supported. Most DSP algorithms have inherent parallelism [Kung88], which can be exploited by replicating arithmetic units. This is not necessarily restricted by limited memory bandwidth. Due to locality of algorithms, duplicating arithmetic units does not necessarily lead to multiport memory architectures and thus can be applied in fixed-point DSPs also, as shown in [Fett96]. Furthermore, by demanding stronger compiler support [Zivo95], an ISA combining flexibility (offered by a VLIW ISA) and code density (offered by a CISC ISA) is required. The main drawback for employing a VLIW ISA is, besides a complicated assembly coding, the code size increase. The main reason for code size increase is the independent control of concurrent FUs. For maximum flexibility, a VLIW ISA must support all permutations of all FUs.

Horizontal Very Long Instruction Word (VLIW)

FIW FIW PCU

AG1

FIW

FIW

AG2 DP1

FIW FUx

Bussystem

Tagged Very Long Instruction Word (TVLIW) TAG TAG IWC F# FIW F# FIW

VLIW-Cache FIW FIW FIW FIW

FIW

Horizontal Very Long Instruction Word (VLIW) Fig. 2: Example of a VLIW DSP-Architecture

Fig. 3: Tagged VLIW Instruction Decoder

Though, not all instructions can exploit the VLIW ISA’s full functionality. Program control and move instructions for instance require only a limited number of FUs at one time. These instructions are mainly applied in in-line code. Thus, for in-line code full VLIW functionality is not required. In-loop code on the other hand mainly consists of arithmetic and logic instructions, which typically require several FUs concurrently and thus VLIW‘s full functionality. On the DSP architecture’s side, loops are already supported, e.g. by including zero-cycle hardware loop counter and cache mechanisms. Furthermore, compiler support loops also by applying techniques such as loop unrolling, software pipelining, and trace scheduling, especially developed for VLIW architectures [HePa90]. Hence, the VLIW ISA‘s functionality must be enabled within loops. This is can be achieved by employing our TVLIW scheme. 3. DYNAMIC INSTRUCTION WORD CODING BY THE TVLIW SCHEME TVLIW supports different requirements of in-line and in-loop instructions by assembling the VLIW dynamically. As shown in Fig. 2, the very long instruction word (VLIW) consists of a number of functional unit instruction words (FIW). Each FIW controls the associated FU independently from the remaining FIWs. Thus, the whole VLIW can control several FUs concurrently. The idea of the TVLIW scheme is to assemble the actual VLIW out of limited number of FIWs (Fig. 3). If the full functionality of VLIW is required, this assembling may require several cycles. However, these instructions, which require full parallelism of VLIW, mainly occur within loops. With the help of a loop cache, this overhead is only necessary during the first iteration. The TVLIW scheme is based on two assumptions: • Equal FIW width: All FUs require a common instruction word width. This can be achieved by designing the FIW for a given TVLIW width, since FIWs can be fully decoded if necessary. • Limited parallelism in in-line code: If parallelism can be fully exploited all the time, this scheme is not applicable. However, as shown below this is usually not the case for in-line instructions. In-loop instructions on the other hand are supported by cache mechanisms. While the first assumption is verified by the case study in section 4, the second assumption is checked in more detail by the following examination of a DSP’s instruction set. A. Classification of the Instruction Set While the in-loop code mainly consists of arithmetical/logical (AL) instructions, including memory accesses, the in-line code mainly consists of move instructions, including register-register and register-memory transfers, and program control instructions, including jumps, branches, calls etc. Program control/move instructions do not require all FUs at the same time. To show this in more detail, in table 1 the FU usage of some program control and move instructions are shown. It can be seen, that at instruction level these instructions typically only use one or two FUs. Note that besides FUs also immediate fields need to be considered. Due to time-stationary pipeline control, the usage of FUs at cycle level must be considered. As an example, a pipelined machine with a one-cycle memory latency is assumed. In table 2, the first four instructions of table 1 are assumed to appear in sequential order. Thus, each column represents the actual usage of FUs at cycle level. As on instruction level, on cycle level only one or two FUs are used at one time. By applying the same method to AL instructions (table 3 and table 4), several FUs are used on both instruction and cycle level.Taking this behavior into account, our current implementation of TVLIW discussed below supports the independent control of two FUs at one time without assembly loss. B. Overview The block diagram of the TVLIW decoder is shown in Fig. 3. The TVLIW consists of a class field (IWC), mainly indicating of how many TVLIWs the actual VLIW has to be assembled from, and two tag fields (F#), indicating which FU should be controlled by the associated FIW. The output is the actual VLIW, which consists of coded FIWs and nops otherwise. For instructions requiring the full VLIWs processing power multiple TVLIW instructions are collected to build one VLIW. By first iterating a loop, the actual VLIW is stored in a wide cache, where instructions can be read from during the next iterations. C. Functional Description During the programming process, a VLIW is assumed. The gained immediate object code consists of a set of VLIWs each containing a number of independent FIWs. The main task of the following assembler pass is to reduce the VLIW to one or more TVLIWs by using different instruction classes supported by the TVLIW: single IW, multiple IW, insert IW, and end IW. The single IW class indicates, that the current VLIW only uses two FUs at the most, except for the following case. If the current VLIW contains the same FIWs as the following, the current VLIW is a subset of the following. Thus, the current VLIW will be executed and also stored to be used by the next VLIW. This is indicated by the insert IW class. If the current VLIW uses more than two FUs, the VLIW must be assembled sequentially. Therefore, the current VLIW is divided into a set of TVLIWs, each containing two FUs at the most. If the preceding TVLIW was an insert IW, this TVLIW is removed from the set. All remaining TVLIWs except for the last are indicated by the multiple IW class, while

Examplea

Instruction type

t

Usage of FU n

program flow

return, icall, nop

PCU

argumented program flow

call, branch, loop

PCU, IM

memory+register

*Xr1++ = Reg1

DP1, AG1

n+2

register+memory

Reg1 = *Yr1++

AG2, DP1

n+3

register+register

Acc = Reg1

DP1

register+constant

Reg1 = 7

DP1, IM

t+1

t+2

t+3

t+4

PCU

n+1

PCU, IM DP1

AG2 AG1

DP1

Table 2: FU usage over time for program control and move instructions of table 1

Table 1: FU usage by program control and move instructions a.

examples are written in a c-like notation Instruction Type

Example

AL instruction with PMAa

Acc += Reg1 • *Xr1++

AL instruction with 2 PMAs

Acc += *Xr1++ • *Yr1++

AL instruction with a constant Acc = Const • *Xr1++ and 1 PMA

n

AG1, DP1 AG1, AG2, DP1 AG1, IM, DP1

parallel AL instructions with 2 Acc1 += *Xr1++ • *Yr1++ AG1, AG2, PMAs || Reg2 = *Xr1++ DP1, DP2 || Acc2 += *Xr1++ • Reg2

AG1

n+1

t+1

t+2

n+2

t+3

t+4

DP1 AG1, AG2

DP1 AG1

n+3

IM, DP1 AG1, AG2

DP1, DP2

Table 4: FU usage for arithmetic/logic instructions of table 3

Table 3: FU usage by arithmetic/logic (AL) instructions a.

t

Usage of FUs

PMA: Parallel Memory Access

the last is indicated by the end IW class. The end IW class is necessary to clear all previous stored FIWs. The insert IW class is introduced to support especially coding of unrolled loops. In unrolled loops, previous instructions often are expanded by one or two further FIWs to gain the current instruction. In this case, previous instructions can be used for the following. D. Hardware Description The hardware requirements for a TVLIW decoder are inexpensive and highly regular. As shown in Fig. 4, the hardware s structure can be divided into three parts. The IW itself consists of a two bit wide class field, a m-bit wide tag field for determining one out of 2m FUs, and two n-bit wide FIWs controlling two independent FUs. In the first step, the control signals QCX, QAX, and QBX are generated from the decoded class and both tag fields, respectively. The tag signals control the crossbar unit, in which the n-bit wide FA- and FB-fields are routed to the appropriate intermediate busses F’X, {x: 0..2m-1}.. A nop is switched to the remaining 2m-2 intermediate busses, if the particular FU is not selected by QAX or QBX, respectively. In parallel, the class signals are used to determine the way of assembling the VLIW. The first multiplexer controls an n-bit wide register, storing the intermediate F’X in a multi or insert IW case or being cleared in the end IW case. The final multiplexer switches either the intermediate F’X, the content of the register, or a nop to the actual FX. The complete set of all FX represents the actual VLIW. Thus, the hardware expenses are only one 2:4- and two m:2mdecoder, 2m n-bit wide registers, 2 times 2m n-bit wide 3:1- and 2m 2:1-multiplexer. However, the crossbar unit may require some - though highly regular - wiring. E. Further Remarks For reducing both, the IW width and hardware cost for the TVLIW decoder, particularly the crossbar switch, combinations of FUs within one TVLIW can be limited. For instance, the same FU cannot be used twice in one TVLIW and both permutations F1:F2 and F2:F1 do not have to be supported. Thus, the combinations Fn, Fm, n ≤ m do not have to be supported. If the separate fields are combined into one field,

∑i = 1 i n

combinations can be removed at the expense of a

slightly more complex tag decoder. At the same time, restrictions of this kind can be used to simplify the crossbar switch. In the event of an interrupt, the contents of the decoder registers must be saved. This is necessary for restoring the current state, if the interrupt occurs during a multi- or insert-instruction. Thus, the interrupt service routine can use the whole VLIW also. 4. CASE STUDY: APPLYING TVLIW TO AT&T‘S DSP16 To demonstrate the TVLIW scheme on a real-world-example, we chose the DSP16 of AT&T. This DSP contains a well structured register set, a small 16 bit wide instruction set, a simple bus architecture with only one read/write bus and, above all, has a time-stationary pipeline organization. By orthogonalizing the instruction set into separate FIWs, the dynamic instruction coding scheme can be applied. A. Overview As can be seen by table 5, a FIW width of n=8 bit is sufficient to support the DSP16‘s functionality. Additionally, m=5 Functional Unit Permutations

PCU-, X-Unit 90 + 3

Suggest Documents