DIVA: Dual-Issue VLIW Architecture with Media ... - Semantic Scholar

3 downloads 0 Views 253KB Size Report
DIVA: Dual-Issue VLIW Architecture with Media Instructions for Image Processing. Sang-Joon Nam, Young-Su Kwon, Yeon-Ho Im, Kyung-Gu Kang, and ...
DIVA: Dual-Issue VLIW Architecture with Media Instructions for Image Processing Sang-Joon Nam, Young-Su Kwon, Yeon-Ho Im, Kyung-Gu Kang, and Chong-Min Kyung

VLSI Systems Lab., Department of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST), Teajon, Korea Phone: +82-42-866-0848 (Sang-Joon Nam), FAX: +82-42-866-0702 e-mail address: [email protected]

ABSTRACT ? According to the demand on enor-

Over the past four years major vendors of gen-

mous multimedia data processing, we have designed

eral-purpose processors have extended instruction set

a VLIW (Very Long Instruction Word) processor

architectures to support multimedia workloads.

called DIVA(Dual-Issue VLIW Architecture) ex-

On the other hand, dedicated processors targeted

ploiting the ILP(instruction-level parallelism) in

to multimedia applications, called media processors,

multimedia programs. DIVA processor which can

have employed similar semantics. In addition, some

execute two instructions in one cycle supports 86

media processors incorporate multimedia-specific

instructions including 30 media instructions, and has

features including vector instructions, VLIW archi-

a sub-word execution structure that supports the satu-

tectures, and special-purpose hardwares such as video

ration mode arithmetic for image processing. Com-

and audio ports and bit-stream CODEC¡s.

pared to scalar architectures without media instruc-

General-purpose processors with multimedia ex-

tions, the performance of the DIVA processor is im-

tension and media processors need to have the speci-

proved by 2.2 to 5 times due to the combination of

fic architecture to exploit the ILP(Instruction Level

VLIW architecture and media instructions. DIVA

Parallelism) in multimedia application programs and

processor, consisting of about 90,000 gates, was im-

execute multiple operations simultaneously. Super-

plemented using 0.6 µm CMOS SOG (Sea-of-Gate)

scalar and VLIW architectures can take advantage of

process on 8mm 5 8mm die, and has shown a per-

this parallelism[1].

formance of 80 MOPS(Million Operations Per Second) at 10MHz clock frequency.

Considering the following characteristics of multimedia applications, we have chosen the VLIW architecture as our target ILP processor;

I. Introduction In addition to the text, numbers, and 2-D graphics

ϒ Multimedia application programs are mostly

data previously handled, the processor market is

described in high-level programming languages.

moving fast toward the digital multimedia processing

ϒ Compilers must exploit the parallelism in ap-

areas such as digital image, speech, 3-D graphics,

plication programs and architectures.

audio, and video. To meet the demand on the mul-

ϒ The sub-word execution is required to process

timedia processing, two approaches are in progress.

the tremendous and small-size data.

In this paper we describe a prototype DSP core

multiplier(PMUL), a shift unit(SHIFT), a load/store

based on the VLIW architecture called DIVA (Dual-

unit(LDST), two-bank register file having 32 entries

Issue VLIW Architecture) with sub-word operations

altogether, and controller which consists of branch

for image processing. DIVA can issue two instruc-

and program counter unit(BRPC), pipeline controllers,

tions per clock cycle and execute them with eight

and operand network.

functional units. It is shown that the performance can

PDATA 64bit

be significantly improved with an appropriate adoption of the combination of VLIW architecture and

INST 1 INST 0

sub-word operations, which is shown as experimental

ID instruction

results in Section IV using real application programs.

EX instruction

LDST

32bit WEB OEB

The rest of this paper is organized as follows. The WB instruction

features of DIVA are described in Section II. Section

Register File bank1 (16 x 32bit)

III shows performance of DIVA with experimental

Register File bank1 (16 x 32bit)

results.

Operand network

IALU0 IALU1 IMUL PALU PMUL SHIFT RESET

BRPC 16bit

PADDR

II. DIVA processor

Fig.1 Block diagram of DIVA processor consisting of two integer ALU, an integer multiplier, a media ALU, a media

This section describes the DIVA (Dual-Issue

multiplier, a shift unit, a load/store unit, two-bank register

VLIW Architecture) processor which issues two

file, and controller.

instructions per clock cycle and executes sub-word operations for image processing. A. Overview of DIVA processor Major features of DIVA processor include:

IALU performs the integer arithmetic, logical operations and address calculation. Because these operations are executed frequently, DIVA processor has

l

2-issue VLIW architecture

two IALU¡s. BRPC controls the program sequence

l

64-bit program bus and 32-bit data bus

and supports the loop operation and branch opera-

l

4 pipeline stages (fetch instruction, decode in-

tions.

struction, execute and write back)

B. Data types of DIVA processor

l

64K 16-bit word program/data memory space

l

8 independent functional units

essing algorithms, we support rich packed data types.

l

Sub-word execution (8- or 16-bit)

As shown in Fig. 2, DIVA processor defines three

l

Saturating arithmetic operation

packed data types: packed byte, packed word, and

l

2-bank, 16-entry, 32-bit register file with 2 read

double word. Each element within a packed data type

/ 1 write ports

is a fixed-point integer. Packed word data type is

l

4-depth nested hardware FOR-loop

used for operations that need higher precision than 8-

l

86 instructions (including 30 media instructions)

bit.

The block diagram of the DIVA processor is schematically shown in Fig. 1. DIVA processor consists of two integer ALU( IALU0, IALU1 ), an integer multiplier(IMUL), a media ALU(PALU), a media

To accommodate a wide variety of image proc-

the branch instructions using carry or overflow flag, 31

(a)

24 23 Byte 3

Byte 2

31

(b)

16 15

8

7

0

Byte 1

Byte 0

16 15 Word 1

0

count. D. Packed multiplication instructions

Word 0

31

0

(c)

which significantly increases the execution cycle

Double word

We defined two variations of these instructions, both of which support 8-bit precision multiplication. PMULHW(Packed multiply high) or PMULLW

Fig 2. Three data types in DIVA processor: (a) packed byte, (b)packed word, and (c)double word.

(Packed multiply low) instruction performs four 8-bit x 8-bit multiplications, and lets user choose the lower,

To execute the sub-word operations, PALU consists of four 8-bit sub-ALU¡s and carry propagation logic.

PALU

can

either

execute

four

8-bit

and higher 8 bits of the 16-bit multiplication result, respectively. The effect of code size reduction due to the inclusion of PMULLW instruction in DIVA is

add/subtract operations in parallel or two 16-bit

shown in Fig. 3 for α-blending program. α-blending

add/subtract operations in parallel, while 32-bit op-

is a way of mixing two images by adding, pixel-by-

erations are only handled by IALU.

pixel, the two corresponding pixel values each

According to the required data type, carry propa-

weighted by α and (1-α), respectively. Fig. 3 shows a

gation of each sub-ALU is controlled. Four sub-

part of the program that processes one pixel. In reg-

ALU¡s are independent of each other in byte opera-

ister R10, 8-bit RGB component data of each pixel is

tion and carry propagation doesn¡t occur across the

stored in series in a packed format while the α-value

byte border. But in word operation, two carry_enable

is stored in R7. For α-blending, we multiply each

signals become active to propagate the carry of the

value of red, green, blue planes with the α-value,

lower 8-bit sub-ALU to the higher one to generate

respectively. To perform this operation without me-

16-bit results. If two issued instructions are all media

dia instructions, sequential logical shift by 8 bits for

instructions, DIVA processor can execute four to

blue plane, by 16 bits for green plane and by 24 bits

eight operations according to the data type.

for red plane need to be performed.

C. Packed addition and subtraction instructions with saturation These instructions deal with two packed data types, i.e., packed byte and packed word data types. Each single add or subtract operation is independent of the others either in unsigned-saturation mode or

;; R12 : 0x000000ff and R4,R10, R12 mul.uu R4, R4, R7 addu R5, R5, R4 lsr R4, R10, 8 and R4, R4, R12 ;; mul.uu R4, R4, R7 lsl R4, R4, 8 addu R5, R5, R4 lsr R4, R10, 16 and R4, R4, R12 mul.uu R4, R4, R7 lsl R4, R4, 16 addu R5, R5, R4

;; R10 : image 1 ;; R7 : {0x00, alpha, alpha, alpha} pmullw R5, R10, R7

signed-saturation mode. The upper and lower saturation limits are 0xFF and 0x00 for unsigned bytes, and 0x7F and 0x80 for signed bytes. For packed words, the limits are the maximum and minimum unsigned

(a)

(b)

Fig 3. Comparison of code size of a part of ?blending program processing one pixel showing that (a) 13 instructions are necessary when no media instruction is available, while (b)

or signed values that the data type can represent. If

one instruction is enough when the media instruction,

the saturation mode is not supported, the complex

PMULLW is used.

saturation check routine needs to be implemented by

By replacing all these operations by PMULLW

result mask of 1¡s and 0¡s to select elements from

instruction in DIVA, the number of registers used is

either one of two different input sources. These in-

reduced from 5 to 3. Overall, media instructions re-

structions are very useful in various image filtering

duce the code size by 92%.

and composition applications such as the chroma keying program.

31

24 23

16 15

8

7

0

src1

A3

A2

A1

A0

src2

B3

B2

B1

B0

Chroma keying is an image overlay technique frequently used in the weather forecast. If a pixel of girl¡s image is not blue color, that pixel is selected to the corresponding pixel of result image, but otherwise

5

5

5

+

5

+

a corresponding pixel of scenery image is selected. This data selection without branch instructions enables great performance enhancement.

dest

A3 5 B3 + A2 5 B2

A1 5 B1 + A0 5 B0

IV. Experimental Results Fig 4. Packed multiply-add half-word to word (PMADD).

PMADD(Packed multiply and addition) instruction which is shown in Fig. 4 is the basic operation

The DIVA processor was fabricated with 0.6µm CMOS SOG(Sea-of-Gate) process. The specification of DIVA processor is summarized in Table 1.

for the fast multiply-accumulate capability in DIVA. Table 1. Specification of DIVA processor.

It multiplies the corresponding 8-bit elements from two sources, generating four 16-bit products. The

Performance

80 MOPS

Process Technology

0.6 µm SOG

Chip Size

8mm 5 8mm

Gate Count

88,931

Supply Voltage

5V Single Supply

first two 16-bit products are added up for one 16-bit result, and so are the two remaining 16-bit products. PMADD instruction, therefore, performs four multiplications and two 16-bit accumulations. The product of 4 5 4 matrix and 4 5 1 vector, which normally requires 16 multiplications and 12 accumulations, can be done using 4 PMADD and 4 PADDW(packed addition word) instructions.

Operating Frequency Package

10MHz (typical) 160-pin QFP

PMUL is designed to two-stage pipeline in order

Four programs, α-blending, chroma keying, ma-

to prevent the cycle time from increasing due to ad-

trix multiplication, and dot product, are used to

ding media instructions.

measure the performance of DIVA in the field of

E. Packed compare instructions

image processing application.

These instructions independently compare the

The relative performance comparison of four

corresponding data elements of two packed data

processor configurations is listed in Fig. 5. Four dif-

types, in parallel, to generate a mask of 1¡s and 0¡s,

ferent process configurations are dependent on

depending on whether the relevant condition is true

whether VLIW or scalar processor is used, and

or false. Subsequent instructions can use the compare

whether media instruction is included or not. Scalar indicates the scalar, single-issue processor. ¡VLIW2,

no_media¡ indicates the dual-issue VLIW processor

media instructions. The performance of VLIW ar-

with no support for media instructions. Performance

chitecture is improved according to the media in-

for each program is normalized to that of the scalar

struction¡s complement. No increase of performance

processor with no media instruction support (scalar,

after adding new media instruction in a program

no_media). Because DIVA issues two instructions

means that the program doesn¡t use that media in-

per clock cycle and uses the media instructions, the

struction. Addition/subtraction instructions with satu-

performance of DIVA is improved by 2.2 to 5 times

ration and PMADD instruction improve the perfor-

compared to that of the scalar processors without

mance of DIVA by 30% and 100%, respectively.

media instructions. It is also shown that, without

Through the synergic effect due to the combination of

media instruction addition, even VLIW processors

the media instructions and VLIW architecture, the

are inferior to the scalar processors with media in-

performance of DIVA is improved by 1.8 to 2.7 times

structions. This result shows that the inherent paral-

compared to the dual-issue VLIW architecture

lelism of the VLIW processor can only be fully ex-

without media instructions depending on the applica-

ploited by employing proper media instructions

tions. Chip photo of the DIVA processor is shown in

dealing with packed byte or packed word data types.

Fig. 7.

Fig. 6 shows the effect of some media instructions on the performance in various application programs. Performance is normalized to that of VLIW2 without DIVA(VLIW2, media)

dot product matrix multiplication chroma keying alpha blending

VLIW2, no_media scalar, media

scalar, no_media

0

2

4

6

Performance index Fig 5. Comparison of performance index, defined as the inverse of the number of clock cycles consumed, for four different combinations depending on whether VLIW or scalar is used, and whether media instruction is included or not, for 4 different benchmark programs.

with saturation, PCMPE and PMADD

chroma keying

with saturation and PCMPE matrix multiplication

with saturation

alpha blending

VLIW2, no_media

dot product

0

0.5

1

1.5

2

2.5

3

Performance index

Fig 6. Comparison of performance index in VLIW2 without media instructions, as media instructions (add/sub saturation instruction, PCMPE, and PMADD) are added.

mental setup for image processing applications, DIVA has shown an improvement of the performance by 2.2 to 5 times over the scalar processors without media instructions. Even in comparison with 2-issue VLIW architecture without media instructions, the performance of DIVA is improved by 1.8 to 2.7 times without any detrimental effect on the cycle time. This result shows that the inherent parallelism of the VLIW architecture can only be fully exploited by employing proper media instructions dealing with packed byte or packed word data types. Fig 7. Photograph of DIVA chip.

Reference V. Conclusions In this paper, we described the DIVA(Dual-Issue

[1] T.M. Conte et al., ¡Challenges to Combining General-Purpose and Multimedia Processors,¡ IEEE

VLIW Architecture) processor which is targeted to

Computer, pp.33-37, Dec. 1997.

image processing applications and is based on media

[2] A.K. Jain, ¡Fundamentals of Digital Image Proc-

instructions and the VLIW architecture. After ana-

essing,¡ Prentice-Hall, Inc., 1989.

lyzing a wide range of image processing programs, we defined an instruction set consisting of thirty media instructions with parallel sub-word execution capability. DIVA can issue two instructions per clock cycle and execute them with eight functional units. DIVA processor has been implemented in 0.6µm CMOS SOG process and its performance is 80 MOPS at 10MHz clock frequency. In the experi-

Suggest Documents