Mapping of Application Software to the Multimedia Instructions of GeneralPurpose Microprocessors Ruby Lee and Larry McMahan Hewlett-Packard Company
[email protected],
[email protected] Abstract: This paper describes how media processing programs may be accelerated by using the multimedia instruction extensions that have been added to general-purpose microprocessors. As a concrete example, it describes MAX2, a minimalist, second-generation set of multimedia instructions included in the PA-RISC 2.0 processor architecture. MAX2 implements subword parallel instructions, which utilize the microprocessor’s 64-bit wide datapaths to process multiple pieces of lower-precision data in parallel. It also includes innovative, new instructions like Mix, which are very useful for matrix transpose and other common data rearrangements. The paper examines some typical multimedia kernels, like Block Match, Matrix Transpose, Box Filter and the IDCT, coded with and without the MAX2 instructions, to illustrate programming techniques for exploiting subword parallelism and superscalar instruction parallelism. The kernels using MAX2 show significant speedups in execution time, and more efficient utilization of the processor’s resources. Keywords: multimedia extensions, subword parallelism, MAX2, PA-RISC, media processing, code optimizations, SIMD, packed arithmetic.
1. Introduction Media processing, or the processing of digital multimedia data such as images, video, audio, and graphics, requires significant computation power. For example, merely reading (or viewing) a video object requires performing video decompression in real-time (e.g. 30 frames per second), and storing video requires performing video compression. As the number of media streams, the frame size and the desired fidelity of the multimedia objects increase, the compute and access bandwidth requirements also increase. General-purpose digital processors have a set of instructions that can be programmed to perform any algorithm on binary bits of data. However, many media processing algorithms can be significantly accelerated with the addition of a few new instructions. These new instructions exploit the fact that there is a great deal of parallelism in media processing algorithms, and that the data being worked on (e.g., 8-bit pixels) have lower precision than the word size of modern microprocessors, which is currently either 32 or 64 bits. Subword parallelism [1,2] is a technique proposed for performing parallel operations on lower-precision data packed into wordoriented datapaths. For example, the 64 bits in a processor’s register can be assumed to represent four 16-bit quantities. By very minor changes in the 64-bit adder, it can be used to perform either one 64-bit add or four parallel 16-bit adds (Figure 1). These subword parallel instructions, like Parallel Add, have been called “multimedia instructions” because they were first introduced into the instruction sets of general-purpose processors for performing multimedia functions with software rather than hardware solutions [3,4]. In fact, they can be used for any programs that repeat a set of operations on different sets of lower precision data. They are often called SIMD instructions (Single Instruction Multiple Data [18]), since they perform the same operation on multiple subwords. The purpose of this paper is to illustrate how media processing programs may use multimedia extensions in generalpurpose processors. We do this by showing a few key techniques in some typical multimedia kernels, commonly found in Proceedings of Multimedia Hardware Architectures 1997, IS&T/SPIE Symposium on Electronic Imaging: Science and Technology, February 10-14, 1997, San Jose, California, pp.122-133.
image, video, audio and graphics programs. In section 2, we describe the multimedia instructions that have been added to microprocessor architectures, in particular, the MAX2 multimedia extensions. MAX2 is the second generation of multimedia instructions for the 64-bit PA-RISC 2.0 instruction set architecture. In section 3, we describe four important media processing kernels, optimized at the instruction-set level (assembly language) both with and without the MAX2 instructions. In section 4, we summarize the programming techniques used in these examples. In section 5, we compare the performance of equally optimized code, with and without the MAX2 instruction extensions. Section 6 concludes the paper. G e n e ra l R e g s . y5
y6
y7
y8
x5 x1
x6 x2
x7 x3
x8 x4
y1
y2
y3
y4
y2 x2 x1 y1
S ta n d a rd ALU 2 o p s / c y c le
P a r t it io n a b le 6 4 - b it A L U
S ta n d a rd ALU
P a r t it io n a b le 6 4 - b it A L U
8 o p s / c y c le x1+y1
x 2 -y 2
Figure 1a: Superscalar Processor with 2 ALUs
Figure 1b: Subword Parallelism in Superscalar Processor
2. Multimedia Instructions Multimedia extensions for general-purpose processors were first introduced in a product in January 1994 with the MAX1 (Multimedia Acceleration eXtensions) instructions for 32-bit PA-RISC processors [2-4]. Later, Sun introduced VIS (Visual Instruction Set) for UltraSparc processors [6], HP introduced MAX2 for 64-bit PA-RISC processors [5,1], and Intel introduced MMX (Multi Media eXtensions) for x86 processors [7]. Recently, MIPS has announced the MDMX multimedia extensions for some future MIPS processors [8], and DEC has announced a small number of instructions to support MPEG for Alpha processors [8]. To provide real code examples, we have to choose one of these sets of multimedia extensions. We have chosen MAX2, since it represents perhaps the simplest set of general-purpose multimedia acceleration primitives, with key characteristics shared in common by the other multimedia extensions, as well as some uniquely versatile yet simple features.
2.1 MAX2 instructions in PA-RISC 2.0 Parallel Subword Instruction Parallel add, hadd hadd,ss hadd,us Parallel subtract, hsub hsub,ss hsub,us Parallel shift left & add, hshladd Parallel shift right & add, hshradd Parallel average, havg Parallel shift right signed, hshr Parallel shift right unsigned, hshr,u Parallel shift left, hshl Mix, mixh,L mixh,R mixw,L mixw,R Permute, permh
Description Add 4 pairs of 16-bit operands, with modulo arithmetic Add 4 pairs of 16-bit operands, with signed saturation Add 4 pairs of 16-bit operands, with unsigned saturation Subtract 4 pairs of 16-bit operands, with modulo arithmetic Subtract 4 pairs of 16-bit operands, with signed saturation Subtract 4 pairs of 16-bit operands, with unsigned saturation Multiply 4 first operands by 2, 4 or 8 and add corresponding second operands Multiply 4 first operands by 1/2, 1/4 or 1/8 and add corresp. second operands Arithmetic mean of 4 pairs of operands Shift right by 0 to 15 bits, with sign extension on the left Shift right by 0 to 15 bits, with zero extension on the left Shift left by 0 to 15 bits, with zeros shifted in on the right Interleave alternate 16-bit [h] or 32-bit [w] subwords from two source registers, starting from Leftmost [L] subword, or ending with Rightmost [R] subword Rearrange subwords from one source register, with or without repetition
Table 1: MAX2 Instructions in PA-RISC 2.0 MAX is a minimalistic set of parallel subword instructions for PA-RISC processors. Although MAX2 is the second
2—IS&T’s 49th Annual Conference
generation of multimedia instructions for PA-RISC processors, it is still a much smaller set than that now proposed for other processors [5-10]. It uses the existing microprocessor registers and functional units, like the Arithmetic Logical Unit (ALU) and the Shift-Merge Unit (SMU). MAX2 features are added only if they have potential general-purpose usage, in addition to providing significant speedup for media processing. Table 1 shows the instructions in MAX2. The instructions in MAX1 are a proper subset, including only the parallel subword arithmetic instructions (through havg).
2.2 Parallel Subword Compute Instructions These instructions perform the basic arithmetic functions of add and subtract, and a few common varieties of multiply and divide. The Parallel Add, and Parallel Subtract instructions each have three variants, which differ only in the way they treat overflow. The default action is modulo arithmetic, where any overflow is discarded. If signed saturation is specified in the instruction, an overflow causes the result to be clipped to the largest or smallest signed integer representable in the result range, depending on the direction of the overflow. Similarly, if unsigned saturation is specified, an overflow causes the result to be clipped to the largest or smallest unsigned integer in the result range [2]. One difference between the multimedia extensions in different microprocessor architectures is the support provided for multiplication. A fast multiply circuit typically occupies two to three times the space of an adder, and takes several execution cycles. Furthermore, the product is twice as long as the operands, assuming these have the same number of bits. In addition, the audio and 3-D graphics transformations that require the most multiplications usually also need multiplyaccumulate, and more than 16 bits of precision for intermediate results. Hence, in MAX2, we decided on two approaches for multiplication, depending on the media stream being processed. For audio and 3-D graphics transformations, the full power and versatility of the floating-point multiply-accumulate functional units is used. This gives two single-precision (32-bit) or two double precision (64-bit) multiply-accumulate instructions per cycle in PA-RISC processors, or the equivalent of four operations per cycle. For video, images, and graphics rendering, where the data are 8-bit pixels (or 12-bit pixels for medical images), multiplications by constants are done by a series of shift and add instructions, while multiplications by variables use the standard 64-bit integer multiply instruction. Our data indicates that many of these multiplications required are indeed by constants. MAX2 provides two multiply primitives: Parallel Shift Left and Add, and Parallel Shift Right and Add instructions. These instructions can shift the operand left or right by 1, 2 or 3 bits, before adding the second operand. They are very effective in implementing multiplication by integer or fractional constants, respectively. They require just a minor modification to the existing preshifter to the integer ALU, rather than new subword-parallel integer multiplier circuits. Division circuitry is even more expensive, and integer division circuitry is not usually provided by microprocessors. In MAX2, audio and 3-D graphics transforms use the floating-point registers and functional units, and so have access to full floating-point division (and reciprocal) circuitry. For the pixel-oriented media types, parallel integer division is simulated by a series of right shifts, which are divisions by a power of two. The Parallel Shift Right (Signed or Unsigned) instructions may be used for division of signed and unsigned subwords, respectively. They use the existing 64-bit shifter, but block any bits shifted out from one subword from being shifted into the adjoining subword. The Parallel Shift Right and Add instruction may also be used for division by 1/2, 1/4 or 1/8. Division by any constant can be simulated with a combination of these instructions. The Parallel Average instruction adds the two operands, then performs a divide by two. This is an add followed by a right shift of one bit. In the process, the overflow bit is shifted in as the most significant bit of the result, so the instruction has the added advantage that no overflow can occur. In addition, rounding is done on the least significant bit, to conserve precision in cascaded average operations. This instruction is very useful for interpolation, sub-pixel resolution, as well as division by two with rounding.
2.3 Data Alignment and Data Rearrangement Instructions Data alignment is often needed to maintain the desirable significant bits in the intermediate results. This is achieved with the Parallel Shift Right or Left instructions. Data rearrangement of the packed subwords in a register is often needed in order that subsequent parallel subword operations can proceed at full parallelism. The design challenge is to find a small set of data rearrangement primitives that are most powerful for frequent inner-loop cases. MAX2 defines only two data rearrangement primitives, Mix and Permute, based on their versatility of use, and ease of implementation. Mix rearranges subwords from two source registers, while Permute provides a comprehensive set of rearrangements of subwords in a single source register. The Mix instruction takes subwords from two registers, and interleaves alternate subwords from each register in the
Proceedings of Multimedia Hardware Architectures 1997, IS&T/SPIE Symposium on Electronic Imaging: Science and Technology, February 10-14, 1997, San Jose, California, pp.122-133.
result register as shown in Figure 2. The subword sizes are indicated by the suffix “h” for halfword (16 bits), and “w” for word (32 bits). The second suffix, “L” or “R” indicates Mix Left or Mix Right: Mix Left collects the odd subwords in the result register, whereas Mix Right collects the even numbered subwords. (Because even and odd numberings change depending on numbering from 0 or 1, or from left or right, the names Mix Left and Mix Right are used rather than Mix Odd and Mix Even. Mix Left starts from the leftmost subword in each of the two source registers, while Mix Right ends with the rightmost subwords from each source register.) In Figure 2, the definitions of four Mix instruction variants is given, where the contents of a register are given as four 16-bit elements. (Note that in PA-RISC instructions, the first two operands, Ra and Rb, are source registers, and Rc is the result register.) In section 3, the use of Mix is illustrated by a matrix transpose example, and in the IDCT. Mix also implements unpacking operands, using R0 as one of the source registers, and subsequent packing of operands.
mixh,L mixh,R mixw,L mixw,R
;Ra = a1 a2 a3 a4, Rb = b1 b2 b3 b4 Ra,Rb, Rc ;Rc = a1 b1 a3 b3 Ra,Rb, Rc ;Rc = a2 b2 a4 b4 Ra,Rb, Rc ;Rc = a1 a2 b1 b2 Ra,Rb, Rc ;Rc = a3 a4 b3 b4
are the contents of the source registers
Figure 2: Definition of Mix Instruction Variants The Permute instruction takes one source register, and produces a permutation of the subwords in that register. With 16-bit subwords, this instruction allows all possible permutations, with and without repetitions, of the four subwords in the source register. Figure 3 shows some possible permutations. A Permute index in the instruction, comprising four 2bit indices, identifies which subword in the source register is to be placed in each subword of the destination register. Subwords in the source register, Ra, are numbered from left to right starting from zero. Permute allows, for example, the replication of a subword scalar value to all the subwords in a register, in a single cycle.
permh,0000 permh,3210 permh,1003 permh,0312
Ra,Rc Ra,Rc Ra,Rc Ra,Ra
;Ra = a b c d ;Rc = a a a a ;Rc = d c b a ;Rc = b a a d ;Ra = a d b c
are the contents of the source register replicate scalar across vector reverse order of subwords arbitrary permutation with repetition arbitrary permutation without repetition
Figure 3: Permute Instruction Examples
2.4 Other Useful PA-RISC features In addition to the MAX2 instructions described above, other existing features in the PA-RISC architecture are also very useful for media processing [12-14, 5]. Table 2 lists some of the more useful ones. The Shift Right Pair instruction allows two source registers to be concatenated and shifted together, with the resulting rightmost 64 bits placed in the destination register. This instruction facilitates use of arbitrarily aligned 64-bit quantities. The Extract instruction allows one to extract a sequence of contiguous bits from a source register, and place it right-aligned in the destination register. The Deposit instruction does the reverse: place a right-aligned field of bits from the source register anywhere in the destination register. All the existing logical functions are also available, and needed, for media processing as well [1,5,17]. The Floating-point Multiply Accumulate instructions (FMAC) provide high-performance multiply-accumulate, with full IEEE floating-point precision compliance, for audio and 3-D graphics media datatypes. The multiple floating-point condition bits allow simultaneous testing of conditions, while eliminating costly conditional branches. For example, very fast graphics accept and reject tests for determining whether an object falls within a bounding box, or not, are supported by PA-RISC processors, using these condition bits. The low-overhead cache prefetch instructions can take advantage of the highly predictable, streaming nature of the memory accesses of many media processing programs by prefetching data into the cache before it is actually used, thus hiding memory latencies from cache misses. Load and store instructions may use a cache hint to indicate that the data has spatial locality (but no temporal locality), and may be fetched into a look-aside buffer, to prevent replacing useful cache lines for data that is used only once. Another PA-RISC feature, especially useful for code without MAX2 instructions, is the arithmetic nullify feature. Every arithmetic, logical and field manipulation instruction generates a condition which can be used to nullify the execution of the next instruction. This enables in-line conditional execution, while avoiding the pipeline penalties associated with branch instructions.
4—IS&T’s 49th Annual Conference
Feature Shift Right Pair of Registers: shrpd Extract a field: extrd, extrw Deposit a field into a reg.: depd, depdi, depw, depwi Logical operations: and, andcm, or, xor fmac Multiple FP condition bits ldd r0; Prefetch Cache line for Read, ldw r0; Prefetch Cache line for Write Cache hint: Spatial Locality Arithmetic Nullification
Description Concatenate and shift two 64-bit regs. into target register Select a bit-field field from source reg. and place right-aligned in target register Select a right-aligned field from source register or immediate, and place anywhere in the target register Logical operations: and, and complement, or, exclusive or Floating-Point Multiply Accumulate instruction Enable concurrent floating-point comparisons and tests Fetch data into cache before it is used, to reduce cache miss penalty (no action on TLB miss). Hint to prevent cache pollution when data has no reuse Conditional execution of next instruction based on condition generated by current arithmetic, logical, or field instruction
Table 2: Other Supporting PA-RISC Features
3. Code Examples Table 3 shows the four multimedia kernels chosen to illustrate the programming techniques used to map multimedia algorithms to the MAX2 instructions, and to exploit modern superscalar microprocessor operation. These kernels are often performance-critical loops in multimedia applications. For each of the kernels described, an algorithm is coded both with and without the MAX2 extensions. General techniques for loop optimization are applied to both code versions, while special techniques, such as saturating arithmetic and data rearrangement, are used to optimize the code with the MAX2 multimedia instructions. The resulting code is scheduled using the superscalar scheduling rules of the PA-8000 (a 64-bit PA-RISC 2.0 processor [11]). The figures in this section show fragments of code to illustrate the coding techniques, and do not show the entire programs which are rather lengthy. Multimedia Kernel 16 x 16 Block Match 3 x 3 Box Filter Matrix Transpose 8 x 8 2-D IDCT
Description Sum of absolute magnitude of the differences of corresponding pixels in two 16 x 16 blocks Compute the ‘smoothed’ value of each pixel in an image using a 3x3 filter Transpose an 8 x 8 matrix of 16-bit values contained in the processor’s general registers. Perform a two-dimensional Inverse Discrete Cosine Transform [15] on an 8x8 block
Table 3: Multimedia Kernels Chosen 3.1 Block Match In this example, the inner loop accumulates the absolute magnitude of the difference of two corresponding values from two 16x16 blocks of data. This is often used in motion estimation for MPEG-1 and MPEG-2. The example illustrates the use of saturation arithmetic for in-line conditional execution, eliminating the need for conditional branches. The absolute value of Xij-Yij is obtained as follows: First Xij-Yij is calculated using unsigned saturation, then the operands are reversed and Yij-Xij is calculated. Without saturation arithmetic, one of these terms will be positive and the other, negative. With unsigned saturation, the smallest unsigned number representable is zero, so the negative term saturates to zero. The results for each pair of subtractions are accumulated in parallel in two registers (r4 and r5 in Figure 4a). Two separate registers are used to accumulate the results, to allow full superscalar bandwidth without dependency stalls on the accumulation. The 6 instructions in bold print in the code fragment shown in Figure 4a form the core function performed on every four pairs of elements in the two 16x16 blocks. In the code without MAX2 instructions (Figure 4b), the arithmetic nullify feature is used instead. Xij-Yij calculated, and if the result is less than zero, the result is subtracted from zero. This places the positive value of the difference in the result register. In the complete code, either with or without MAX2 instructions, the inner loop is unrolled completely to accumulate 16 pairs of absolute differences, to eliminate loop counter overhead and the loop counter register. Figure 4a shows how 2 unrolled iterations may be interleaved to optimize superscalar instruction scheduling. A limited amount of software pipelining is used where the initial loads of the loop are moved to the previous iteration to avoid load latency stalls.
Proceedings of Multimedia Hardware Architectures 1997, IS&T/SPIE Symposium on Electronic Imaging: Science and Technology, February 10-14, 1997, San Jose, California, pp.122-133.
ldd,ma ldd,ma copy copy
8(r6),r8 8(r7),r9 r0,r4 r0,r5
ldd,ma ldd,ma hsub,us hsub,us hsub,us hsub,us hadd hadd hadd hadd
8(r6),r10 8(r7),r11 r8,r9,r13 r9,r8,r14 r10,r11,r15 r11,r10,r16 r4,r13,r4 r5,r14,r5 r4,r15,r4 r5,r16,r5
;load first Xij value outside loop ;load first Yij value outside loop ;zero accumulator 1 ;zero accumulator 2 ;loop is unrolled to minimize memory latency ;load second Xij value ;load second Yij value ;subtract first Xij - Yij ;subtract first Yij - Xij ;subtract second Xij - Yij ;subtract second Yij - Xij ;accumulate first Xij - Yij ;accumulate first Xij - Yij ;accumulate second Yij - Xij ;accumulate second Yij - Xij
loop
Figure 4a: Block Match Code with MAX2 Instructions ldh,ma ldh,ma copy copy ldh,ma ldh,ma
2(r6),r8 2(r7),r9 r0,r4 r0,r15 2(r6),r10 2(r7),r11
add sub,>= subi ldh,ma ldh,ma add sub,>= subi add
r15,r4,r4 r8,r9,r13 0,r13,r13 2(r6),r8 2(r7),r9 r4,r13,r4 r10,r11,r15 0,r15,r15 r4,r15,r4
loop
;load first Xij value outside loop ;load first Yij value outside loop ;zero accumulator ;zero difference for ‘previous iteration’ ;load second Xij value outside loop ;load second Yij value outside loop ;loop is unrolled to minimize memory latency ;last add absolute difference from previous iteration of loop ;subtract first Xij - Yij values, skip next instruction if >= 0 ;subtract negative difference from zero ;load third Xij value now to avoid stall ;load third Yij value ;accumulate first |Xij - Yij| value ;subtract second Xij - Yij, skip next instruction if >= 0 ;subtract negative difference from zero ;accumulate second |Xij - Yij| value
Figure 4 b: Block Match Code without MAX2 Instructions
3.2 Box Filter In this algorithm the smoothed values of the pixels are computed for the middle 14x14 section of a 16x16 block of pixels. This example illustrates programming techniques for reducing the number of load and copy instructions, for parallel result accumulation, and for performing constant multiplications with shift and add instructions. The constant multipliers used in the 3 x 3 Box Filter are shown in Figure 5.
3x3 Box Filter ¼ ½ ¼ ½ 1 ½ ¼ ½ ¼
xo y0 z0
x1 y1 z1
x2 y2 z2
x3 y3 z3
x4 y4 z4
x5 y5 z5
row i r1 r2 r3 Pixel Matrix
Figure 5: 3 x 3 Box Filter Values Used
r4
Figure 6: Parallel Accumulation with 2 Loads/Iteration
Each pixel in the image requires its eight nearest neighbors and itself, in order to perform the smoothed function of
6—IS&T’s 49th Annual Conference
the 3x3 box filter. This involves 8 multiplications and 9 adds. By moving down the columns of the image, 6 of these 9 pixel values are reused, for the next smoothed pixel. Hence, only three new elements need to be loaded for each smoothed pixel. For the MAX2 version of the code, four pixels from the same row are now packed in one register, which allows the number of load instructions to be further reduced from three to two, for each set of 4 smoothed pixels. Figure 6 shows the register layout of input data to simultaneously accumulate 4 pixel results, p1, p2, p3 and p4, with two loads. In addition, register copy instructions may be reduced by unrolling the inner loop three times and reusing the same registers for subsequent iterations in a round-robin fashion. This technique is known to compiler writers as recurrent scalar replacement (section 4). In the code with MAX2, a single Parallel Shift Right and Add instruction is used to multiply four pixels by a fractional constant of the box filter, as well as accumulate these results. In the code without MAX2, the multiplication of each pixel is done with an Extract instruction (equivalent to a right shift operation), and the accumulation must be done with a separate add instruction. This shows the power of a single Parallel Shift and Add instruction: performing four parallel multiplications and four parallel accumulations in a single cycle. Figures 7 shows a code fragment using MAX2 instructions. 8 Parallel Shift and Add, 1 Parallel Add, 6 Shift Pair and 2 Load instructions produce 4 smoothed pixel results simultaneously. colloop ldd,ma ldd,ma ldd,ma ldd,ma addi ldi
8(r2),r3 Rowoffset-8(r2),r4 8(r2),r5 Rowoffset-8(r2),r6 8,r12,r12 Numrows-2,r13
rowloop ldd,ma 8(r2),r7 ldd,ma Rowoffset-8(r2),r8 hshradd r3,2,r0,r9 shrpd r3,r4,48,r10 hshradd r10,1,r9,r9 shrpd r3,r4,32,r10 hshradd r10,2,r9,r9 hshradd r5,1,r9,r9 shrpd r5,r6,48,r10 hadd r10,r9,r9 shrpd r5,r6,32,r10 hshradd r10,1,r9,r9 hshradd r7,2,r9,r9 shrpd r7,r8,48,r1 hshradd r10,1,r9,r9 shrpd r7,r8,32,r10 hshradd r10,2,r9,r9 addib,