DIVA: Dual-Issue VLIW Architecture with Media Instructions for Image Processing Sang-Joon Nam, Young-Su Kwon, Yeon-Ho Im, Kyung-Gu Kang, and Chong-Min Kyung
VLSI Systems Lab., Department of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST), Teajon, Korea Phone: +82-42-866-0848 (Sang-Joon Nam), FAX: +82-42-866-0702 e-mail address:
[email protected]
ABSTRACT ? According to the demand on enor-
Over the past four years major vendors of gen-
mous multimedia data processing, we have designed
eral-purpose processors have extended instruction set
a VLIW (Very Long Instruction Word) processor
architectures to support multimedia workloads.
called DIVA(Dual-Issue VLIW Architecture) ex-
On the other hand, dedicated processors targeted
ploiting the ILP(instruction-level parallelism) in
to multimedia applications, called media processors,
multimedia programs. DIVA processor which can
have employed similar semantics. In addition, some
execute two instructions in one cycle supports 86
media processors incorporate multimedia-specific
instructions including 30 media instructions, and has
features including vector instructions, VLIW archi-
a sub-word execution structure that supports the satu-
tectures, and special-purpose hardwares such as video
ration mode arithmetic for image processing. Com-
and audio ports and bit-stream CODEC¡s.
pared to scalar architectures without media instruc-
General-purpose processors with multimedia ex-
tions, the performance of the DIVA processor is im-
tension and media processors need to have the speci-
proved by 2.2 to 5 times due to the combination of
fic architecture to exploit the ILP(Instruction Level
VLIW architecture and media instructions. DIVA
Parallelism) in multimedia application programs and
processor, consisting of about 90,000 gates, was im-
execute multiple operations simultaneously. Super-
plemented using 0.6 µm CMOS SOG (Sea-of-Gate)
scalar and VLIW architectures can take advantage of
process on 8mm 5 8mm die, and has shown a per-
this parallelism[1].
formance of 80 MOPS(Million Operations Per Second) at 10MHz clock frequency.
Considering the following characteristics of multimedia applications, we have chosen the VLIW architecture as our target ILP processor;
I. Introduction In addition to the text, numbers, and 2-D graphics
ϒ Multimedia application programs are mostly
data previously handled, the processor market is
described in high-level programming languages.
moving fast toward the digital multimedia processing
ϒ Compilers must exploit the parallelism in ap-
areas such as digital image, speech, 3-D graphics,
plication programs and architectures.
audio, and video. To meet the demand on the mul-
ϒ The sub-word execution is required to process
timedia processing, two approaches are in progress.
the tremendous and small-size data.
In this paper we describe a prototype DSP core
multiplier(PMUL), a shift unit(SHIFT), a load/store
based on the VLIW architecture called DIVA (Dual-
unit(LDST), two-bank register file having 32 entries
Issue VLIW Architecture) with sub-word operations
altogether, and controller which consists of branch
for image processing. DIVA can issue two instruc-
and program counter unit(BRPC), pipeline controllers,
tions per clock cycle and execute them with eight
and operand network.
functional units. It is shown that the performance can
PDATA 64bit
be significantly improved with an appropriate adoption of the combination of VLIW architecture and
INST 1 INST 0
sub-word operations, which is shown as experimental
ID instruction
results in Section IV using real application programs.
EX instruction
LDST
32bit WEB OEB
The rest of this paper is organized as follows. The WB instruction
features of DIVA are described in Section II. Section
Register File bank1 (16 x 32bit)
III shows performance of DIVA with experimental
Register File bank1 (16 x 32bit)
results.
Operand network
IALU0 IALU1 IMUL PALU PMUL SHIFT RESET
BRPC 16bit
PADDR
II. DIVA processor
Fig.1 Block diagram of DIVA processor consisting of two integer ALU, an integer multiplier, a media ALU, a media
This section describes the DIVA (Dual-Issue
multiplier, a shift unit, a load/store unit, two-bank register
VLIW Architecture) processor which issues two
file, and controller.
instructions per clock cycle and executes sub-word operations for image processing. A. Overview of DIVA processor Major features of DIVA processor include:
IALU performs the integer arithmetic, logical operations and address calculation. Because these operations are executed frequently, DIVA processor has
l
2-issue VLIW architecture
two IALU¡s. BRPC controls the program sequence
l
64-bit program bus and 32-bit data bus
and supports the loop operation and branch opera-
l
4 pipeline stages (fetch instruction, decode in-
tions.
struction, execute and write back)
B. Data types of DIVA processor
l
64K 16-bit word program/data memory space
l
8 independent functional units
essing algorithms, we support rich packed data types.
l
Sub-word execution (8- or 16-bit)
As shown in Fig. 2, DIVA processor defines three
l
Saturating arithmetic operation
packed data types: packed byte, packed word, and
l
2-bank, 16-entry, 32-bit register file with 2 read
double word. Each element within a packed data type
/ 1 write ports
is a fixed-point integer. Packed word data type is
l
4-depth nested hardware FOR-loop
used for operations that need higher precision than 8-
l
86 instructions (including 30 media instructions)
bit.
The block diagram of the DIVA processor is schematically shown in Fig. 1. DIVA processor consists of two integer ALU( IALU0, IALU1 ), an integer multiplier(IMUL), a media ALU(PALU), a media
To accommodate a wide variety of image proc-
the branch instructions using carry or overflow flag, 31
(a)
24 23 Byte 3
Byte 2
31
(b)
16 15
8
7
0
Byte 1
Byte 0
16 15 Word 1
0
count. D. Packed multiplication instructions
Word 0
31
0
(c)
which significantly increases the execution cycle
Double word
We defined two variations of these instructions, both of which support 8-bit precision multiplication. PMULHW(Packed multiply high) or PMULLW
Fig 2. Three data types in DIVA processor: (a) packed byte, (b)packed word, and (c)double word.
(Packed multiply low) instruction performs four 8-bit x 8-bit multiplications, and lets user choose the lower,
To execute the sub-word operations, PALU consists of four 8-bit sub-ALU¡s and carry propagation logic.
PALU
can
either
execute
four
8-bit
and higher 8 bits of the 16-bit multiplication result, respectively. The effect of code size reduction due to the inclusion of PMULLW instruction in DIVA is
add/subtract operations in parallel or two 16-bit
shown in Fig. 3 for α-blending program. α-blending
add/subtract operations in parallel, while 32-bit op-
is a way of mixing two images by adding, pixel-by-
erations are only handled by IALU.
pixel, the two corresponding pixel values each
According to the required data type, carry propa-
weighted by α and (1-α), respectively. Fig. 3 shows a
gation of each sub-ALU is controlled. Four sub-
part of the program that processes one pixel. In reg-
ALU¡s are independent of each other in byte opera-
ister R10, 8-bit RGB component data of each pixel is
tion and carry propagation doesn¡t occur across the
stored in series in a packed format while the α-value
byte border. But in word operation, two carry_enable
is stored in R7. For α-blending, we multiply each
signals become active to propagate the carry of the
value of red, green, blue planes with the α-value,
lower 8-bit sub-ALU to the higher one to generate
respectively. To perform this operation without me-
16-bit results. If two issued instructions are all media
dia instructions, sequential logical shift by 8 bits for
instructions, DIVA processor can execute four to
blue plane, by 16 bits for green plane and by 24 bits
eight operations according to the data type.
for red plane need to be performed.
C. Packed addition and subtraction instructions with saturation These instructions deal with two packed data types, i.e., packed byte and packed word data types. Each single add or subtract operation is independent of the others either in unsigned-saturation mode or
;; R12 : 0x000000ff and R4,R10, R12 mul.uu R4, R4, R7 addu R5, R5, R4 lsr R4, R10, 8 and R4, R4, R12 ;; mul.uu R4, R4, R7 lsl R4, R4, 8 addu R5, R5, R4 lsr R4, R10, 16 and R4, R4, R12 mul.uu R4, R4, R7 lsl R4, R4, 16 addu R5, R5, R4
;; R10 : image 1 ;; R7 : {0x00, alpha, alpha, alpha} pmullw R5, R10, R7
signed-saturation mode. The upper and lower saturation limits are 0xFF and 0x00 for unsigned bytes, and 0x7F and 0x80 for signed bytes. For packed words, the limits are the maximum and minimum unsigned
(a)
(b)
Fig 3. Comparison of code size of a part of ?blending program processing one pixel showing that (a) 13 instructions are necessary when no media instruction is available, while (b)
or signed values that the data type can represent. If
one instruction is enough when the media instruction,
the saturation mode is not supported, the complex
PMULLW is used.
saturation check routine needs to be implemented by
By replacing all these operations by PMULLW
result mask of 1¡s and 0¡s to select elements from
instruction in DIVA, the number of registers used is
either one of two different input sources. These in-
reduced from 5 to 3. Overall, media instructions re-
structions are very useful in various image filtering
duce the code size by 92%.
and composition applications such as the chroma keying program.
31
24 23
16 15
8
7
0
src1
A3
A2
A1
A0
src2
B3
B2
B1
B0
Chroma keying is an image overlay technique frequently used in the weather forecast. If a pixel of girl¡s image is not blue color, that pixel is selected to the corresponding pixel of result image, but otherwise
5
5
5
+
5
+
a corresponding pixel of scenery image is selected. This data selection without branch instructions enables great performance enhancement.
dest
A3 5 B3 + A2 5 B2
A1 5 B1 + A0 5 B0
IV. Experimental Results Fig 4. Packed multiply-add half-word to word (PMADD).
PMADD(Packed multiply and addition) instruction which is shown in Fig. 4 is the basic operation
The DIVA processor was fabricated with 0.6µm CMOS SOG(Sea-of-Gate) process. The specification of DIVA processor is summarized in Table 1.
for the fast multiply-accumulate capability in DIVA. Table 1. Specification of DIVA processor.
It multiplies the corresponding 8-bit elements from two sources, generating four 16-bit products. The
Performance
80 MOPS
Process Technology
0.6 µm SOG
Chip Size
8mm 5 8mm
Gate Count
88,931
Supply Voltage
5V Single Supply
first two 16-bit products are added up for one 16-bit result, and so are the two remaining 16-bit products. PMADD instruction, therefore, performs four multiplications and two 16-bit accumulations. The product of 4 5 4 matrix and 4 5 1 vector, which normally requires 16 multiplications and 12 accumulations, can be done using 4 PMADD and 4 PADDW(packed addition word) instructions.
Operating Frequency Package
10MHz (typical) 160-pin QFP
PMUL is designed to two-stage pipeline in order
Four programs, α-blending, chroma keying, ma-
to prevent the cycle time from increasing due to ad-
trix multiplication, and dot product, are used to
ding media instructions.
measure the performance of DIVA in the field of
E. Packed compare instructions
image processing application.
These instructions independently compare the
The relative performance comparison of four
corresponding data elements of two packed data
processor configurations is listed in Fig. 5. Four dif-
types, in parallel, to generate a mask of 1¡s and 0¡s,
ferent process configurations are dependent on
depending on whether the relevant condition is true
whether VLIW or scalar processor is used, and
or false. Subsequent instructions can use the compare
whether media instruction is included or not. Scalar indicates the scalar, single-issue processor. ¡VLIW2,
no_media¡ indicates the dual-issue VLIW processor
media instructions. The performance of VLIW ar-
with no support for media instructions. Performance
chitecture is improved according to the media in-
for each program is normalized to that of the scalar
struction¡s complement. No increase of performance
processor with no media instruction support (scalar,
after adding new media instruction in a program
no_media). Because DIVA issues two instructions
means that the program doesn¡t use that media in-
per clock cycle and uses the media instructions, the
struction. Addition/subtraction instructions with satu-
performance of DIVA is improved by 2.2 to 5 times
ration and PMADD instruction improve the perfor-
compared to that of the scalar processors without
mance of DIVA by 30% and 100%, respectively.
media instructions. It is also shown that, without
Through the synergic effect due to the combination of
media instruction addition, even VLIW processors
the media instructions and VLIW architecture, the
are inferior to the scalar processors with media in-
performance of DIVA is improved by 1.8 to 2.7 times
structions. This result shows that the inherent paral-
compared to the dual-issue VLIW architecture
lelism of the VLIW processor can only be fully ex-
without media instructions depending on the applica-
ploited by employing proper media instructions
tions. Chip photo of the DIVA processor is shown in
dealing with packed byte or packed word data types.
Fig. 7.
Fig. 6 shows the effect of some media instructions on the performance in various application programs. Performance is normalized to that of VLIW2 without DIVA(VLIW2, media)
dot product matrix multiplication chroma keying alpha blending
VLIW2, no_media scalar, media
scalar, no_media
0
2
4
6
Performance index Fig 5. Comparison of performance index, defined as the inverse of the number of clock cycles consumed, for four different combinations depending on whether VLIW or scalar is used, and whether media instruction is included or not, for 4 different benchmark programs.
with saturation, PCMPE and PMADD
chroma keying
with saturation and PCMPE matrix multiplication
with saturation
alpha blending
VLIW2, no_media
dot product
0
0.5
1
1.5
2
2.5
3
Performance index
Fig 6. Comparison of performance index in VLIW2 without media instructions, as media instructions (add/sub saturation instruction, PCMPE, and PMADD) are added.
mental setup for image processing applications, DIVA has shown an improvement of the performance by 2.2 to 5 times over the scalar processors without media instructions. Even in comparison with 2-issue VLIW architecture without media instructions, the performance of DIVA is improved by 1.8 to 2.7 times without any detrimental effect on the cycle time. This result shows that the inherent parallelism of the VLIW architecture can only be fully exploited by employing proper media instructions dealing with packed byte or packed word data types. Fig 7. Photograph of DIVA chip.
Reference V. Conclusions In this paper, we described the DIVA(Dual-Issue
[1] T.M. Conte et al., ¡Challenges to Combining General-Purpose and Multimedia Processors,¡ IEEE
VLIW Architecture) processor which is targeted to
Computer, pp.33-37, Dec. 1997.
image processing applications and is based on media
[2] A.K. Jain, ¡Fundamentals of Digital Image Proc-
instructions and the VLIW architecture. After ana-
essing,¡ Prentice-Hall, Inc., 1989.
lyzing a wide range of image processing programs, we defined an instruction set consisting of thirty media instructions with parallel sub-word execution capability. DIVA can issue two instructions per clock cycle and execute them with eight functional units. DIVA processor has been implemented in 0.6µm CMOS SOG process and its performance is 80 MOPS at 10MHz clock frequency. In the experi-