Optimizing the AMR Encoder for the ARM10 Vector ...

5 downloads 0 Views 731KB Size Report
ARM Vector Floating-point (VFP) hardware floating-point coprocessor can perform running the Adaptive Multi-Rate. (AMR) speech encoder. The AMR standard ...
Optimizing the AMR Encoder for the ARM10 Vector Floating-Point Coprocessor David R. Lutz ARM 1250 S. Capital of Texas Highway Austin, TX 78746 phone: +1 (512) 732-8076 david.lutz @arm.com ABSTRACT We present a method and example for optimizing compu­ tational and I/O intensive floating-point algorithms for the ARM Vector Floating-Point (VFP) coprocessor. The VFP incorporates several novel features which are beyond the scope of available compilers to utilize, but which are capa­ ble of providing significant performance improvement. The work focuses on the Adaptive Multi-Rate (AMR) speech encoder and is targeted to an ARM1020E with a hardware VFP. We show significant performance gains by using the vector capability of the VFP.

Categories and Subject Descriptors B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids

General Terms Algorithms, Performance

Keywords Floating-point, AMR codec, speech processing, VFP, code optimization.

1. INTRODUCTION The purpose of this work is to determine how well the ARM Vector Floating-point (VFP) hardware floating-point coprocessor can perform running the Adaptive Multi-Rate (AMR) speech encoder. The AMR standard has been se­ lected by the Third Generation Partnership Project (3GPP)

Chris N. Hinds ARM 1250 S. Capital of Texas Highway Austin, TX 78746 phone: +1 (512)314-1055 chris .hinds @arm. com

for speech transmission, and it will be used in GSM and possibly other next-generation wireless networks. If the ARM VFP can be made to run the AMR codec efficiently enough, then it may be possible to use it in low cost hand­ sets. Our processor target is the ARM1020E. Its VFP10 copro­ cessor is the first ARM implementation of the VFPv2 archi­ tecture, providing IEEE 754-compliant [1] single and dou­ ble precision floating-point operations, a large, flat register file, and overlapped arithmetic and I/O. The novel part of this architecture is the presence of short vector operations, which allows us to keep both the load/store and arithmetic units busy in a single-issue processor. Our compiler does not vectorize code, so maximum speedups require coding in assembler. The speech encoder contains over 12,000 lines of C code, so we needed some way of picking out the routines that were doing most of the work. We did this by running a call graph profiler (gprof) on the encoder for a variety of speech files and encoding rates. For any given encoding rate, there were usually only about ten subroutines that made any significant difference in the overall time. For example, when encoding at the 12.2 kbit/second rate, the top nine routines took up more than 87 percent of the processing time, but accounted for less than 12 percent of the C code. We used a variety of optimizations in order to quickly gen­ erate good assembler code. Many of the AMR routines are load/store bound, and usually the best design was achieved by minimizing the loads and stores, and then trying to use vector operations to do the computations for free. We were somewhat surprised to find that longer vectors almost al­ ways outperform shorter vectors. While vector instructions cannot issue until all of their operands are available, care­ ful scheduling of the short vector instructions and the data transfer operations, and a sufficiently large register file, al­ lowed us to avoid stalls due to register availability.

2. VFP HARDWARE The ARM VFPv2 Floating-point Architecture was designed to support embedded applications which can benefit from the range and precision of floating-point over that of integer or fixed-point computational models. The VFPv2 architec­ ture specifies a register file with 32 single-precision regis­ ters overlapped with 16 double-precision registers. The reg­ ister file is capable of being subdivided for the purpose of creating 4 hardware circular queues which respond to short vector operations. We will talk more about short vectors later in this section. The VFPv2 architecture specifies a basic IEEE 754 compli­ ant instruction set including add, subtract, multiply, divide, and square root; conversion between floating-point formats and integer formats; comparison; and absolute value and negation. The VFPv2 also specifies a family of multiplyaccumulate (FMAC) instructions. These instructions per­ form a multiplication rounded to the destination precision, then an add, resulting in a final sum which is equivalent to the multiply and add begin executed independently. The re­ sult of the FMAC operations is fully IEEE 754 compliant. : Issue

i i

D ecode

i i i

E xecute I

i • i

E x ec u te 2

I i I

i R ound and ! W riteback.

E xecute 3 i

and the data is used once and discarded. If the function is carried out in a strictly linear fashion hazards will appear between the data transfer and the arithmetic which could significantly reduce performance. In these cases unrolling loops is valuable in removing the dependencies between the data transfer operations and the arithmetic operations. In commercial DSP chips the ability to specify a repeat con­ dition for an instruction or sequence of instructions has sig­ nificantly reduced the code size and increased the potential for parallel execution of data transfer and arithmetic opera­ tions. The VFPv2 architecture incorporates this feature through the use of short vector operations. Most data processing instructions may be vectorized. The length of the vector may be up to 8 for single-precision instructions and 4 for double-precision. The vector length is specified in the LEN field in the Floating Point Status and Control Register (FPSCR). Each subsequent iteration of the vector operation in­ crements the source and destination registers by one, then performs the operation. The following example illustrates a vector addition. Assuming a vector length of 4, the instruc­ tion fadds slO, s!8, s26

results in the following scalar operations being issued se­ quentially in 4 cycles: fadds fadds fadds fadds

Figure 1: VFP10 FMAC and Load/Store Datapath Figure 1 presents the primary FMAC datapath and the load/store datapath. The FMAC datapath handles all arith­ metic operations except divide and square root. Throughput for single precision operations and double precision opera­ tions not involving a multiply is 1 cycle, with a latency of 4 cycles. Double precision multiply and multiply-accumulate operations have a throughput of 1 operation per 2 cycles and a latency of 5 cycles. Many floating-point applications, such as speech process­ ing or graphics processing, contain functions which operate on large blocks of data in a sequential fashion. Typical of these applications are the dot product and convolution, in which data access is to locations with high spatial locality

slO, sll, sl2, sl3,

sl8, sl9, s20, s21,

s26 s27 s28 s29

The vector operations are supported for DSP operations through the presence of hardware circular queues in the reg­ ister file [2], The register file is divided into 4 banks, with the first bank (designated bank 0) serving a special func­ tion for operations involving scalar operands or results. For single-precision (double-precision) operations the 32 (16) registers are divided into 4 banks of 8 (4) registers per bank. Within each bank a short vector operation will treat the bank as a circular queue. Again assuming a vector length of 4, the instruction fadds sl5, sl6, s30

results in the following scalar operations being issued se­ quentially in 4 cycles: fadds fadds fadds fadds

sl5, s8, s9, slO,

s!6, s30 sl7, s31 sl8, s24 sl9, s25

Several features of the vector architecture allow for easy intermixing of scalar operations and short vector operations with a scalar. Three rules define the operation:

1. If the destination register is in bank 0, the operation is treated as a scalar operation, regardless of the value in the LEN field. 2. If the second source register is in bank 0 and the desti­ nation is not in bank 0, the vector addressed in the first source is operated on by the scalar in the second source and written to the vector addressed in the destination. 3. If neither the destination nor second source register is in bank 0, the instruction operates on vector data ad­ dressed by the source registers to a vector destination. Register conflicts are avoided through the use of a simple scoreboard. Registers involved in an operation, including all registers in a vector operation, are locked when the in­ struction issues and cleared when they are no longer needed or when they have been written. The scoreboard contains a single bit per register to enable fast setting and checking. A short vector operation will be stalled until all registers involved in the operation are available. All operations (vec­ tor and scalar) are stalled until all registers involved in the operation are available. While the benefit of issuing a single instruction and spec­ ifying multiple operations is beneficial from a code size perspective, it also provides a mechanism for parallel data transfer on a single-issue machine. The ARM1020E core and the VFP10 have separate pipelines, allowing parallel execution of arithmetic operations and data transfer op­ erations. Good performance is often achieved by double buffering, in which one set of registers is being loaded while computations are being performed on a second set of regis­ ters. Most of the routines optimized using the short vector operations had cycle counts reflecting either the cycles re­ quired for the arithmetic or those required for the data trans­ fer, but rarely more than the maximum of these operations. As a result, one of the activities, either the arithmetic or the data transfer, did not impact the final cycle counts.

3. OPTIMIZING FOR THE YFP Our method for optimization is as follows: 1. profile and select optimization targets 2. unroll loops in C 3. rewrite C routines to support vectorization 4. map out the register space for the target subroutine 5. use the compiler to generate assembler, then fixup and test 6. use multiple register loads and stores 7. use vector arithmetic

8. double buffer I/O and arithmetic The first step is applicable to the program as a whole, while the remaining steps are used on individual subroutines. Not every step is applicable to every subroutine. Most of the ex­ pensive routines in the AMR encoder are load/store bound and so usually the critical step is to map out the register space to minimize I/O costs. Each of the subsections in this section will address one of the optimization techniques.

3.1 Profile and Select Optimization Targets The first step in any optimization effort is to figure out where the resources are being spent. The easy way to do this with C programs is to generate a call-graph profile. We gen­ erated this profile using gcc and gprof, and found that most of the time was spent on a handful of functions, which are listed in Table 1. For each subroutine, we list the number of lines of C code, and what percent of the execution time is attributable to that subroutine. A few of the functions are marked inline, and no estimate is made for their execution time. These functions were moved inline by the compiler, and so they did not appear in the call graph profile. We add them to the table because they were called frequently from the other functions in the table. Their contribution to the overall execution time is already mostly accounted for by the calling routines. The encoder as a whole contains 12,229 lines of code in .c files (plus another 16,419 lines in header files). Together, the functions in Table 1 contribute to more than 87% of the runtime for the encoder, but represent less than 12% of the C code. Table 1: Candidate Functions for VFP Optimizations Subroutine Lines % Execution Time Autocorr 44 inlined 134 6.3 clJtp 53 comp_corr 6.3 301 14.3 cor_h 53 cor_h_x inlined Dotproduct40 29 inlined search J 0i40 426 22.2 98 9.5 Norm.Corr 72 3.2 Residu 96 11.1 Syn_filt 51 Vq.subvec 12.7 Vq_subvec_s 78 1.6 1435 87.2 total Most of these functions are much too involved to discuss in a short paper. The simplest routine is Dotproduct40, and even it has some complications that obscure the optimiza­ tions we are discussing. To get around this, we will look at a

simplified version called dotproduct40 (abbreviated dp40 in the code samples) that performs a 40-term single-precision dot product.

• enough bandwidth to keep the vector units busy - if the vector units have to wait for data, performance is go­ ing to suffer. Sometimes we can help this situation by rewriting the C function to reuse values and cut down on I/O.

extern float dp40(float *b, float *c)

{ int i ; float a; a = O.Of; for (i = 0; i < 40; i++)

{ a += b [i ] * c [i ] ;

} return(a);

}

3.2

Unroll Loops in C

Our objective is to unroll enough so that vectorization be­ comes possible. For single-precision arithmetic, we unroll enough so that we can use vectors of length eight.

In the case of dotproduct40, the problem is getting indepen­ dent results. We can do this by having multiple sums in the loop body, which are then combined to get the final sum. This does change the answer, but the floating-point AMR encoder is not required to be bit exact. Since the compiler often generates better code for pointers than for arrays, we also change array references to pointer references: extern float dp40(float *b, float *c)

{ int i ; float a O ,a l ,a 2 ,a 3 ,a 4 ,a 5 ,a 6 ,a 7 ; aO = al = a2 =a3 = O.Of; a4 = a5 = a6 =a7 = O.Of; for (i = 0; i < 40; i+=8)

extern float dp40(float *b, float *c)

{

{

int i ; float a; a = O.Of; for {i - 0; i < 40; i+=8)

aO al a2 a3 a4 a5 a6 a7

{ a a a a a a a a

+= b[i] * c [i ] ; += b [i + 1] * c[i+l] += b[i+2] * c [i+2] + = b[i+3] * c [i+3] += b[i+4] * c[i+4] += b [i+5] fr c[i+5] + = b[i+6] * c[i+6] + = b [i+7] * c[i+7]

+= += += += += += += +=

*b++ *b++ *b++ *b++ *b++ *b++ *b++ *b++

* * * * * * * *

*c++; *c++; *c++; *c++; *c++; *c++; *c++; *c + +;

} return(aO + al + a2 + a3 + a4 + a5 + a6 + a 7 );

}•

) return(a);

3.4

}

3.3

Rewrite C Routines to Support Vector­ ization

The unrolled loop in the previous section is great for vector loads of b[] and c[], but the fm a c s operation that multiplies and sums to s is not set up for good vector performance. To get good vector performance we need: • independent results - if results are dependent, then the vector pipeline is going to be filled with bubbles. We can sometimes avoid this problem by using multiple partial results, which are then combined for the final result. • operands loaded into registers in the proper order this is particularly apparent for convolutions, where some of the vectors are likely to be stored backwards in memory. In this case, we usually have to choose between vector loads and vector operations.

Map out Register Space for the Target Subroutine

This can be very tricky, particularly if there are unusual vector lengths (one of the AMR routines has vectors of length eleven), or mixed precision operations. We need to be mindful about banks: bank 0 registers can imply vector, vector-scalar, or scalar operations depending on where they appear in an instruction. If you cannot find a good fit in the register space, it pays to rewrite the C code to make it fit. For example, the vectors of length 11 can be handled by adding a zero term to give them length 12, and then treating that vector as three vectors of length 4. For the case of dotproduct40, the register mapping is easy. The terms a[], b[], and c[] can be assigned to banks 1, 2, and 3; and the partial sums can be added in bank 0.

3.5

Use Compiler to Generate Assembler

The reason we have spent so much time on the C code is that it is much easier to let the compiler deal with pointer

manipulations. This is not so important in functions like dotproduct40, but for complicated routines it saves a lot of programming time. Another advantage is that the rewritten C code is usually much faster, and in some cases we can get good enough performance without writing assembler. For maximum performance, though, we have to use assem­ bler. Once we have the structure we want, we compile to get working assembler, and then make changes to an as­ sembler routine that already supports the data layout and vector lengths we have in mind.

3.6

Use Multiple Register Loads and Stores

Because our pipeline allows two singles to be loaded or stored per cycle, and because load latency is three cycles, it is very beneficial to us load multiple instructions. To con­ tinue the example from the previous section, we load eight elements of b[] into bank2 as follows: fldmias

r l !,{sl6-s23}

This is a huge performance boost. The load store unit is fully utilized for four cycles, and only one issue slot is needed for the instruction. Furthermore, the bus is fully utilized. A reasonable version of the assembler listing for dotproduct40 using the optimizations we have discussed so far is as follows: dotproduct40 proc fids s8,=0.0 r3,#0 mov fmovs s9 ,s8 fmovs slO,s8 fmovs sll,s8 fmovs sl2,s8 sl3,s8 fmovs fmovs sl4,s8 fmovs sl5,s8 loop rl!,{sl6-s23 } fldmias add r3,r3,#8 fldmias r 2 !,{S24-S31} cmp r3 ,#0x28 s 8 ,sl6,s24 fmacs s 9 ,sl7,s25 fmacs fmacs slO , sl8,s26 fmacs sll,sl9,s27 sl2,s20,s28 fmacs fmacs Sl3,s21,s29 sl4,s22,s30 fmacs fmacs sl5,s23,s31 loop bit fadds sO ,s8,s9 fadds si ,slO,sll s 2 ,sl2,sl3 fadds fadds s3,sl4,sl5 fadds s O ,sO ,si fadds s2,s2 ,s3 fadds sO,sO ,s2 f sts sO, [rO,#0]

bx endp

3.7

lr

Use Vector Arithmetic

At this point, we set the vector length to 8 and replace the 8 fmacs calls with one: fmacs

s8,sl6,s24

In many cases, this will offer good speedup, but there are unnecessary bubbles in dotproduct40. Vector operations cannot begin until all operands are available, so there is a long wait between the second fldmias and the vector fmacs. To solve this problem, we turn to double buffering.

3.8

Double Buffer I/O and Arithmetic

For double buffering, we break our length-8 vectors into two vectors of length 4. While one set of vectors are load­ ing, the arithmetic unit is operating on the second set. To do this properly, we need to preload one set of vectors. This involves some changes to the loop counter (we count to 32 instead of 40), and adds some additional code before and after the loop. The advantage is that the body of the loop can go at maximum speed: both the load/store pipeline and the arithmetic pipeline are fully utilized, with no bubbles. fids fldmias mov fmovs fldmias fmovs

s0,=0.0 rl!,{s8~sll} r 3 ,#0 s24,s0 r 2 !,{sl6-sl9} s28,s0

loop fldmias r l !,{sl2-sl5} add r3,r3,#4 fldmias r2!,{s20-s23} fmacs s24,s8,sl6 fldmias rl!,{s8-sll} cmp r3,#32 fldmias r 2 !,{sl6-sl9} fmacs s28,sl2,s20 bit loop fldmias r l !,{sl2-sl5} fldmias r 2 !,{s20~s23} fmacs s24,s8,sl6 fmacs s28,sl2,s20 fadds s0,s24,s25 fadds sl,s26,s27 fadds s2,s28,s29 fadds s3,s30,s31 fadds sO,s O ,si fadds s2,s 2 ,s3 fadds s O ,s O ,s2 fsts sO,[rO,#0]

1 2 3 4 5 6 7 8

The comment field in the code above is used to indicate the cycle in the main loop during which the accompanying in­ struction starts executing. Double buffering has eliminated all of the pipeline bubbles from the main loop. This code is more than twice as fast as the best compiler output.

4. RESULTS Speech data consists of 16-bit samples taken 8000 times per second. The speech is processed in 20 millisecond frames. We count how many cycles it takes to process each frame, and them multiply the highest number (the worst frame) by 50 to get the number of processor cycles per second re­ quired to encode continuous speech. For the ARM1020E encoding at the 12.2kbps rate, the com­ piler required about 722000 cycles to process the worst frame. This means that the processor would have to run at a speed of at least 36.1 MHz to encode continuous speech. After optimizing, we reduced the required speed to 23.4 Mhz. Table 2 shows the speedups on a function by function ba­ sis (Dotproduct40 is not included because it was difficult to separate from the other functions). We used the ARMulator (an ARM instruction set simulator) to execute the code and to collect statistics before and after each function was called, and added up all of the calls for a single frame. So comp_corr, for example, consumed 85,418 cycles be­ fore optimization, and only 35,405 cycles after optimiza­ tion. Speedup is given by dividing the old cycles by the new cycles.

Table 2: Speedups from VFP Optimizations Subroutine Old Cycles New Cycles Speedup Autocorr 22085 15906 1.39 cLltp 132239 87420 1.51 comp_corr 85418 35405 2.41 corJi 51412 31854 1.61 corJh_x 24512 15000 1.63 dblloop_10i40 130243 97211 1.34 Norm_Corr 71096 45031 1.58 Residu 19102 5811 3.29 Syn_filt 55174 30354 1.82 Vq_subvec 29152 20738 1.41 Vq_subvec_s 18359 15453 1.19 optimized total 638792 400183 1.60 total 467495 722497 1.55 Our optimizations were targeted at the 12.2 kbps rate, but they also improved the other rates. Most of the optimized routines are shared among the rates. The biggest difference is the first routine in the call graph profile, which is specific to each rate. We believe that optimizing this routine for the remaining rates would allow us to make all of the rates at least as good as the 12.2 kbps rate. Our current performance numbers for all eight rates on the ARM1020E are given in Table 3.

Table 3: Required MHz for Encoding at Various Rates Rate(kbps) Required MHz 12.2 23.4 10.2 24.4 7.95 24.5 7.40 23.6 6.70 24.1 5.90 19.5 5.15 15.9 4.75 21.6

5. CONCLUSION Vector arithmetic instructions together with vector loads and stores give the VFP unusually good performance. The ARM1020E is a single-issue processor, but the vector in­ structions give superscalar performance, and sometimes better than superscalar performance. Theoretically, one could get the same performance out of a superscalar ma­ chine (say one that can issue a load/store, a floating-point, and an integer instruction simultaneously). The problem is that the window for picking instructions to issue is small, and in practice programmers are not going to be able to pack the instructions correctly to get peak performance. In contrast, the vector paradigm encouraged by the VFP is rel­ atively easy to master, and we think most programmers will quickly learn how to get good performance out of the ma­ chine. Our performance results are simulated, and we have some reason to believe that actual results would be a little better. In future work, we would like to test timing on real hard­ ware, and perhaps see how well the optimizations perform on other ARM processors with different pipelines. We are also interested in determining whether short vector opera­ tions are useful on superscalar processors.

References [1] ANSI/IEEE Standard No. 754-1985. IEEE Standard fo r Binary Floating-Point Arithmetic. The Institute of Electrical and Electronics Engineers, New York, New York, 10017, 1985. [2] Chris N. Hinds. An enhanced floating point coproces­ sor for embedded signal processing and graphics ap­ plications. In 1999 Asilomar Conference on Signals, Systems, and Computers, October 1999. A R M is a registered tradem ark of A R M Lim ited. A R M 10, A RM 1020E, V FP10, and V FPv2 are tradem arks o f A R M Lim ited.

ARM VFPLite Architecture and Implementation Chris N. Hinds and David R. Lutz ARM Austin Design Center Category F4, F6, F3 Contact: Chris N. Hinds ARM Austin Design Center 1250 Capital of Texas Highway Building 3 Suite 560 Austin, TX 78746 (512)314-1055 chris.hinds @arm.com

Abstract VFPLite is a small, non-pipelined implementation of ARM's VFPv3 architecture. It provides fully compliant IEEE 754 single and double precision arithmetic entirely in hardware, with performance that is surprisingly good given its modest area. VFPLite is aggressively clocked, and uses new techniques that allow many operations to terminate early. In this paper we discuss the motivation for VFPLite, examine its basic architecture and microarchitecture, and give timing and area results for the first implementation.

Introduction Fast 3D graphics hardware has long relied on the speed and dynamic range of floating­ point hardware for performance [1,2], In the graphics pipeline two primary tasks involve high usage of floating-point arithmetic - content creation and scene creation. In the first of these tasks the flow of the scene, the intelligence of the action, and the physics of that action are developed. In the second, the scene itself is transformed from the created scene, with the characters, visual elements, and backgrounds, to a 3D image which can be rendered to a display. Scene creation requires high throughput single-precision computations which are targeted by ARM ’s Neon Floating-Point unit. Content creation commonly involves both single-precision and double-precision operands, a wider range of operations, and IEEE 754 compliance[3], and this is the target application for VFPLite. The first implementation of the ARM VFPv3 floating-point architecture[4], the VFPLite, is designed to provide hardware support for content creation. The majority of the 3D graphics computations are in the scene creation task, and the performance of the VFPLite has been designed to provide sufficient performance for the scene creation task in a very small area and with low power consumption. Support for both double-precision and single-precision operands is provided in hardware. We describe the first implementation of a new set of fixed-point conversion instructions. Subnormal operands and operations which result in a subnormal are supported, as is a flush-to-zero mode. The VFPLite

provides the four IEEE specified rounding modes, and default results for all exceptional conditions. The VFPLite is structured to execute a single instruction at a time under state machine control. An instruction may take as little as 2 cycles or as much as 15 cycles, depending on the instruction and the data involved. We intend to present the final area and timing and performance on several benchmarks in the final paper.

Hardware description The implementation is a semi-custom design with a very aggressive gates per cycle constraint, and we will discuss in detail the design decisions relating to this constraint, specifically the architecture of the functional blocks (adder, shifter, lza) and the design of the state machine. The design is partitioned into three blocks. They include the operand block, add block, and multiply block. The operand block, shown in Figure 1, contains the operand registers, exponent logic, swap muxing, shifter, and operand evaluation logic. Early evaluation of the exponents, particularly for equal exponents or exponents which differ by exactly 1, improves add timing. A fast 8-bit fraction comparison is done to speed up compare. In our consideration of floating-point compare operations we discovered a very high percentage of compares do not require an evaluation of the lower bits of the fraction, but are resolved based only on the sign, exponent and a few upper fraction bits. Many operations do not require computation, for example those with NaNs, infinities, and zeros. The operand block performs the evaluation of the operands, and in cases which are determinable based only on these evaluations the operation will complete in a few cycles without engaging the datapath blocks and avoiding consuming additional power. The add block, shown in Figure 2 with the multiply block, includes twin adders and performs all add and difference computations on the significands for add, sub, multiply and compare operations, and speculative rounding for arithmetic operations. Two’s complement operations, required for integer conversions, are done in this block. A leading one detector is included for normalization and shift count generation. Multiplication is a frequent operation which is typically implemented in a single or double-pass compression array. Our study of several benchmark programs demonstrates that a high percentage of multiplications do not require the full array, but with careful selection of the multiplier only a limited number of partial products are required. Multiply cycle counts range from 3 cycles for special cases, and 9-15 for double­ precision, with an average around 11 cycles.

Conclusions We present a new implementation with a focus on small area and power consumption while providing a full IEEE 754 implementation. The actual area and power figures are

not available at this time but will be by the presentation. As 3D graphics becomes more important in handheld devices, where die cost and power are critical issues, we see the VFPLite as providing a valuable functionality in a small area.

References [1]

[2] [3]

[4]

David R. Lutz and Chris N. Hinds, Accelerating Floating-Point 3D Graphics for Vector Microprocessors. In 2003 Asilom ar Conference on Signals, Systems, and Computers, November 2003. David H. Eberly, 3D Game Design, a Practical Approach to Real-Time Computer Graphics. Morgan Kaufmann, San Francisco, CA, 2001. “IEEE Standard for Binary Floating-Point Arithmetic,” ANSI/IEEE Standard No. 754-1985, The Institute of Electrical and Electronics Engineers, Inc., New York, New York, 10017. Chris N. Hinds, An Enhanced Floating Point Coprocessor for Embedded Signal Processing and Graphics Applications. In 1999 Asilom ar Conference on Signals, Systems, and Computers, October, 1999.

A

B

To AddA Figure 1. Operand Block Diagram

C

MulCH MulCL SwapL

Round Const

'i r

MulCH

a

'\r

'r

>

MulCL

ShiftHi

OpA

>r>nnnr

SwaPs

w

V 1

1 >

Ad dA

J

— PRIM ADD

y

AddB

V

>

V

\ ds^adc /

T PRIM ADD

Figure 2. Add and Multiply Block Diagrams

AddC

v



LZA

OpA

OpE

8:3 Reduction network

> Mul123

>

3:2 Add

MulCH MulCL



7\

I MulC



>

MulSH MulSL





4:2 Comp



7 MulS

Performance of the Floating-Point AMR Encoder on a Commercial Single-Issue Vector Processor David R. Lutz ARM 1250 S. Capital of Texas Highway Austin, TX 78746 phone: +1 (512) 732-8076 david.lutz @ann.com 1. INTRODUCTION We present a method for optimizing high data throughput floating-point applications on a single-issue vector proces­ sor. We consider as an example the Adaptive Multi-Rate (AMR) codec floating-point reference code. The AMR codec has been selected by the Third Generation Partner­ ship Project (3GPP) for speech transmission, and it will be used in GSM wireless networks. While the flxed-point specification is the only allowed implementation for the 3G mandatory speech service, the use of the floating-point is permitted in multimedia application, such as the 3G-324M terminal and in packet-based applications. Recent literature has demonstrated a renewed interest in vector processors [4, 6, 7, 8, 10]. Our target proces­ sor, the ARM1022E, provides a simple but powerful vector floating-point (VFP) coprocessor. The VFP provides IEEE 754-compliant [2] single and double precision floating­ point operations, a large, flat register file, and overlapped arithmetic and I/O. The novel part of this architecture is the presence of short vector operations, which allows us to keep both the load/store and arithmetic units busy in a single­ issue processor. Our compiler does not vectorize code, so maximum speedups require coding in assembler. The speech encoder contains over 12,000 lines of C code, so we needed some way of picking out the routines that were doing most of the work. We did this by running a call graph profiler (gprof) on the encoder for a sampling of speech files and encod­ ing rates. For any given encoding rate, there were usually only about ten subroutines that made any significant dif­ ference in the overall time. For example, when encoding at the 12.2 kbit/second rate, the top nine routines took up more than 87% of the processing time, but accounted for less than 12% of the C code. We used a variety of optimizations in order to quickly gen­

Chris N. Hinds ARM 1250 S. Capital of Texas Highway Austin, TX 78746 phone: +1 (512)314-1055 chris.hinds @arm.com

erate good assembler code. Floating-point addition is not associative, but the order of addition is arbitrary for the AMR (i.e., based on programming convenience rather than numerical analysis), and the encoder in the specification is a non-bit-exact representation of the fixed-point encoder [1], and so we sometimes rearrange sums to get improved per­ formance. Many of the AMR routines are load/store bound, and usually the best design was achieved by minimizing the time taken for loads and stores and then trying to use vector operations to do the computations for free. Normally we do this by double buffering: we load a subset of the registers while we are working on another subset. As we will see in this paper, we had to use other techniques for the AMR. In a prior paper [9], we discussed optimizing the AMR codec for the ARM 10 family of parts using ARM’s instruc­ tion simulator, ARMulator. ARMulator is a near-cycleaccurate simulator, providing good insight into the relative merits of the potential optimizations. Our goal in that ef­ fort was to investigate the general behavior of the AMR codec on an ARM platform and how the specific optimiza­ tions of interest to the VFP would perform in relation to non-optimized code and in comparison to other optimiza­ tion options. ARMulator did not allow us to investigate the memory system and data alignment, which turned out to be critical issues for getting the best performance out of the AMR.

2. VFP ARCHITECTURE The ARM VFPv2 floating-point architecture [5, 3] was de­ signed to support embedded applications that can benefit from the range and precision of floating-point computa­ tional models over that of integer or fixed-point models. The VFPv2 architecture specifies a basic IEEE 754 compli­ ant instruction set including add, subtract, multiply, divide, and square root; conversion between floating-point formats

and integer formats; comparison; and absolute value and negation. The VFPv2 also specifies a family of multiplyaccumulate (FMAC) instructions. These instructions per­ form a multiplication rounded to the destination precision, then an add, resulting in a final sum that is equivalent to the multiply and add executed independently. The result of the FMAC operations is fully IEEE 754 compliant. Figure 1 presents the primary FMAC datapath and the load/store datapath. The FMAC datapath handles all arith­ metic operations except divide and square root. Throughput for single precision operations and double precision opera­ tions not involving a multiply is one cycle, with a latency of four cycles. Double precision multiply and multiplyaccumulate operations have a throughput of one operation per two cycles and a latency of five cycles. The VFPv2 architecture specifies a register file with 32 single-precision registers (s0-s31) overlapped with 16 double-precision registers (d0-dl5). The 32 single­ precision registers are divided into four eight-register banks. Scalars and vectors of length two to eight operate within the banks. The following diagram shows the dispo­ sition of the registers within the banks, and is useful as a tool for mapping functions onto the VFP. bo 4

bi

b2

bz 4

So

S8

S l6

S24

Si

sg

S17

S25

S2

s io

S18

S26

S3

S ll

S l9

S27

S4

S l2

S20

S28

S5

S l3

S21

S29

S6

S l4

S22

S30

S7

S l5

S23

S31

The double-precision registers are similarly divided into four banks, but this feature will not be used in this paper. The VFP is also a vector processor. Most data processing instructions may be vectorized. The length of the vector may be up to eight for single-precision instructions and four for double-precision. The vector length is specified in a control register. Each subsequent iteration of the vector op­ eration increments the source and destination registers by one, then performs the operation. The following example illustrates a vector addition. Assuming a vector length of four, the floating-point add instruction fadds slO, sl8, s26

results in the following scalar operations being issued se­ quentially in 4 cycles: fadds slO, sl8, s26 fadds sll, sl9f s27 fadds sl2, s20, s28

fadds sl3, s21, s29

Several features of the vector architecture allow for easy intermixing of scalar operations and short vector operations with a scalar. Three rules define the operation:

scalar If the destination register is in bank 0, the operation is treated as a scalar operation, regardless of the vector length specified in the control register.

vector-scalar If the second source register is in bank 0 and the destination is not in bank 0, the vector addressed in the first source is operated on by the scalar in the second source and written to the vector addressed in the destination.

vector If neither the destination nor the second source reg­ ister is in bank 0, the instruction operates on vector data addressed by the source registers to a vector des­ tination. Vectors provide for easier programming and smaller code size, but their biggest benefit is increased performance. The ARM1022E core and the VFP10 have separate pipelines, allowing parallel execution of arithmetic operations and data transfer operations. The best performance is often achieved by double buffering, in which one set of registers is being loaded while computations are being performed on a second set of registers. Our goal is to have cycle counts reflecting either the cycles required for the arithmetic or the cycles required for the data transfer, but not more than the maximum of these operations.

3. HARDWARE AND TIMING The ARM1022E is a second generation implementation of the ARM v5TE architecture. It incorporates a Harvard ar­ chitecture with 16 KB instruction and data cache and in­ cludes full memory management hardware. We ran our tests on an ARM Integrator AP board with a CM/10220E Core Module daughter card. Both caches were enabled, the data cache was set to writeback mode, and the processor and bus interface to SRAM were clocked at 24 MHz. On the Core Module card was an FPGA that con­ tained several counters, including an asynchronous freerunning 24 MHz counter, and a bus counter. By running the bus and processor at 24 MHz (the processor itself is ca­ pable of operating at over 400 MHz), we were able to use these counters to provide accurate cycle counts. The asyn­ chronous relationship with the processor clock caused an uncertainty in the actual cycle count of only 3 cycles. We measured the time required for a function by reading the bus timer (bus_cnt), reading the free-running timer (reLcnt),

I

I

I

Figure 1: VFP10 FMAC and Load/Store Datapath calling the function, reading ref_cnt, reading bus_cnt, and computing the differences. To get rid of the timing overhead introduced by these reads, we subtract 15 cycles from the ref.cn t difference. Small functions were measured five times on the same data to make sure that we were getting cached timing. While this provides the most optimistic performance, we expect that in most deeply embedded applications, such as wireless phones and PDAs, the speech codec and speech data will be in very fast memory (such as a tightly coupled RAM) and the timing will be on the same order as we recorded in this configuration.

debugging environment to access a host processor for pro­ gram code and data, and display of results. We were careful to avoid any semihosting activity during measurements, so there was no semihosting impact on the cycle count data.

4. OPTIMIZING FOR THE VFP Our method for optimization is as follows: 1. profile and select optimization targets 2. unroll loops in C 3. rewrite C routines to support vectorization

Large functions, such as the encoder itself, do not fit in cache and so they were only called once for each frame. Performance on systems with a larger cache or a tightlycoupled memory would be considerably faster than what is presented here.

4. map out the register space for the target subroutine

The Integrator AP board was run without an operating sys­ tem, using the SemiHosting capabilities of the MultilCE

7. use vector arithmetic

5. use the compiler to generate assembler, then fixup and test 6. use vector loads and stores

8. double buffer I/O and arithmetic

The first step is applicable to the program as a whole, while the remaining steps are used on individual subroutines. Not every step is applicable to every subroutine. Most of the ex­ pensive routines in the AMR encoder are load/store bound and so usually the critical step is to map out the register space to minimize I/O costs. Each of the subsections in this section addresses one of the optimization techniques.

4.1

Profile and Select Optimization Targets

The first step in any optimization effort is to figure out where the resources are being spent. The easy way to do this with C programs is to generate a call graph profile. We gen­ erated this profile using gcc and gprof and found that most of the time was spent on a handful of functions, that are listed in Table 1. For each subroutine, we list the number of lines of C code, and what percent of the execution time is attributable to that subroutine. A few of the functions are marked inline, and no estimate is made for their execution time. These functions were moved inline by the compiler, and so they did not appear in the call graph profile. We add them to the table because they were called frequently from the other functions in the table. Their contribution to the overall execution time is already mostly accounted for by the calling routines. The encoder as a whole contains 12,229 lines of code in .c files, plus another 16,419 lines in header files. Together, the functions in Table 1 contribute to more than 87% of the runtime for the encoder, but represent less than 12% of the code in the .c files. Table 1: Candidate Functions for VFP Optimizations Lines % Execution Time Subroutine Autocorr 44 inlined cLltp 134 6.3 comp_corr 53 6.3 cor_h 301 14.3 cor_h_x 53 inlined Dotproduct40 inlined 29 search_10i40 426 22.2 Norm_Corr 98 9.5 72 3.2 Residu Syn_filt 96 11.1 Vq_subvec 51 12.7 Vq_subvec_s 78 1.6 1435 87.2 total Most of these functions are much too involved to discuss in a short paper. The simplest routine is Dotproduct40, and even it has a complication (accumulation of some partial results in double-precision) that obscures the optimizations

we are discussing. To get around this, we will look at a sim­ plified version called dotproduct40 (abbreviated dp40 in the code samples) that performs a 40-term single-precision dot product. This code for dotproduct40 is used as is in some of the correlation subroutines, and is slightly modified for use in Dotproduct40. The routines utilizing Dotproduct40 are operating on independent data sets for both the b and c vectors, and the result is not written to an element in ei­ ther vector, removing the potential for aliasing of the data pointers. extern float dp40(float *b, float *c)

{ int i ; float a; a = 0 .0f; for (i = 0; i < 40; i++)

{ a += b[i] * c [i];

} return(a);

} For aligned data, the base code takes 366 cycles.

4.2

Unroll Loops in C

Our objective is to unroll enough so that vectorization be­ comes possible. For single-precision arithmetic, we unroll so that we can use vectors of length eight. extern float { int i; float a? a = 0 Of for (i = i a += a += a += a += a += a += a += a +=

dp40(float *b, float *c)

0; i < 40; i+=8) b[i] * b[i+l] b[i+2] b [i+3] b[i+4] b [i+5] b[i+6] b[i+7]

c * * ★ * * * *

i] ; c[i+lj c[i+2] c[i+3] c [i+4] c[i+5] cEi+6] c [i+7]

} return(a);

} Unrolling the loop exposes some parallelism and gives bet­ ter branch prediction, resulting in a speedup of 1.69. For aligned data, the unrolled code takes 216 cycles.

4.3

Rewrite C Code to Support Vectorization

The unrolled loop in the previous section is great for vector loads of 6[] and c[], but the fmacs operation that multiplies and sums to s is not set up for good vector performance. To get good vector performance we need:

independent results - if results are dependent, then the vector pipeline is going to be filled with bubbles. We can sometimes avoid this problem by using multiple partial results, that are then combined for the final re­ sult.

operands loaded into registers in the proper order this is particularly apparent for convolutions, where some of the vectors are likely to be stored backwards in memory. In this case, we usually have to choose between vector loads and vector operations.

In the case of dp40, the problem is getting independent re­ sults. We can do this by having multiple sums in the loop body, that are then combined to get the final sum. Floating-point addition is not associative, so reordering the additions sometimes changes the sums. We contend that this is acceptable, first because the input data has not been arranged in any particular numeric order (so one order is no more correct than any other), and second because the resulting changes are small and the floating-point AMR is not required to be bit-exact. For the encoding we analyzed, only 0.14% of the output bits were different for the 23 test sequences provided with the AMR, with a worst-case dif­ ference of 1.39% on one test. This is insignificant compared to the average 8.84% difference between the 3GPP floating­ point version and the 3GPP fixed-point version.

o0 ol = a2 = a.3 = a4 = a5 = a6 = al

Ss

=

S l6 =

60

S24 =

Si

s9 =

S17 =

bl

S25 =

S2

S io

S l8 =

S3

S ll

S4

S12

S5

S l3

S6

S l4

S7

S l5

S l9 =

S20 = S21 = S22 = S23 =

Map out Register Space

S27 = S28 = S 29 = S30 = S31 =

For maximum performance, though, we have to use assem­ bly code. Once we have the structure we want, we compile to get working assembly code, and then make changes to an assembly code routine that already supports the data layout and vector lengths we have in mind.

4.6 Use Vector Loads and Stores Because our pipeline allows two single-precision operands to be loaded or stored per cycle, and because load latency is three cycles, it is very beneficial to use load multiple in­ structions. For example, we load eight elements of b[] into bank2 as follows: r l !,{sl6-s23>

The fldmias instruction loads multiple values beginning at the address in register rl. It also updates the address regis­ ter. This is a huge performance boost. The load store unit is fully utilized for four cycles, and only one issue slot is needed for the instruction. Furthermore, the bus is fully uti­ lized. A reasonable version of the assembler listing for dp40 using the optimizations we have discussed so far is as follows: 0

dp40 proc fids mov fmovs fmovs fmovs fmovs fmovs fmovs fmovs

0

For the case of dp40, the register mapping is easy. The terms a[], 6[], and c[] can be assigned to banks 1, 2, and 3. Bank 0 can be used for loading the next iteration of b[] or c[], and can also be used to add the partial sums at the end of the computation. The initial register mapping is as follows:

S26 =

The reason we have spent so much time on the C code is that it is much easier to let the compiler deal with pointer manipulations. This is not so important in functions like the dot product, but for complicated routines it saves a lot of programming time. Another advantage is that the rewritten C code is usually much faster, and in some cases we can get good enough performance without writing assembly code.

ii CO w

This can be very tricky, particularly if there are unusual vector lengths or mixed precision operations. We need to be mindful about banks: bank 0 registers can imply vector, vector-scalar, or scalar operations depending on where they appear in an instruction. If you cannot find a good fit in the register space, it pays to rewrite the C code to make it fit.

62 63 64 65 66 67

cO cl c2 c3 c4 c5 c6 c7

Use Compiler to Generate Assembler

fldmias

4.4

b3

i>2

so

4.5 • enough bandwidth to keep the vector units busy - if the vector units have to wait for data, performance suffers. Sometimes we can help this situation by rewriting the C function to reuse values and cut down on I/O.

h a-

bo

r3,#0 s 9 ,s8 s l O ,s8 s l l ,s8 s l 2 ,s8 s l 3 ,s8 s l 4 ,s8 s l 5 ,s8

loop ; load next 8 values of b[] and c[] fldmias rl!,{sl6-s23} add r3,r3 ,#8 fldmias r 2 !,{s24-s3; cmp r3,#40 ; compute a[] += b[] s8,sl6,s24 fmacs fmacs s9,sl7,s25 fmacs slO,sl8,s26 fmacs 311,319,327 sl2,s20,s28 fmacs fmacs sl3,s21,s29 fmacs sl4,s22,s30 fmacs sl5,s23,s31 bit loop ; combine partial su: fadds sO,s8,s9 fadds si,slO,sll s2,sl2,sl3 fadds s3,sl4,s!5 fadds fadds sO,sO,si s 2 ,s2,s3 fadds fadds sO ,sO ,s2 fsts sO,[rO,#0 J bx lr endp

For aligned data, this change results in a time of 102 cycles, a significant speedup over the compiled versions.

4.7

Use Vector Arithmetic

At this point, we set the vector length to eight, and replace the eight multiply-accumulate (fmacs) calls with one: fmacs

s8,s!6,s24

In many cases, this offers good speedup, but here there are unnecessary bubbles. Vector operations cannot begin until all operands are available, so there is a long wait between the second fldmias and the vector fmacs. We can amelio­ rate this problem by using bank 0 to preload data. For aligned data, this code takes 75 cycles, a significant im­ provement over nonvector code. Unfortunately, there are still load stalls because there are not enough registers to completely utilize vectors of length eight. For vectors that are this long, 24 registers are involved in the computation, and we need to preload 16 registers for the next computa­ tion. Since we only have 32 registers, the second load has to wait. A potential way around this problem is to double buffer with smaller vectors.

This involves changing the loop counter to count to 32 in­ stead of 40, and adds some additional code before and after the loop. The advantage is that the body of the loop can go at maximum speed: both the load/store pipeline and the arithmetic pipeline are fully utilized with no bubbles. In the code below it should be noted that the fmovs opera­ tions, like the fmacs, operate on short vectors of length four. So the instruction fmovs s24, sO

loads the four registers beginning with s24 (s24, s25, s26, and s27) with the contents of sO. As previously mentioned in Section 2, the use of bank 0 registers affords scalar or vector-scalar operations. This use of bank 0 registers is seen in the final summation operations after the loop, where each of the fadds operations specifies that the fadds executes only on the registers explicit in the instruction. dp40

proc ; requires vector length = 4 ; 4 arithmetic operations are initiated ; with each floating-point instruction fids sO,=0.0 fldmias rl!,{s8-sll> mov r3,#0 fmovs s24,sO fldmias r2 !, {sl6-sl9} fmovs s28,sO fldmias rl!,{sl2-sl5} add r3,r3,#4 fldmias r2 !, {s20-s23} s24,s8,sl6 fmacs fldmias rl!,{s8-sll} cmp r3 ,#32 fldmias r 2 !,{sl6-sl9} fmacs s28,sl2,s20 bit loop fldmias rl!,{sl2-sl5} fldmias r2! ,{s20-s23} s24,s8,sl6 fmacs fmacs s28,sl2,s20 ; remaining code is scalar sO,s24,s25 fadds fadds si,s26,s27 fadds s2,s28,s29 fadds s3,s30,s31 fadds sO,sO,si s2,s2,s3 fadds fadds sO,sO,s2 fsts sO, [rO,#0]

For aligned data, this code takes 72 cycles.

5. RESULTS 4.8

Double Buffer I/O and Arithmetic

For double buffering, we break our eight-iteration vectors into two four-iteration vectors. While one set of vectors are loading, the arithmetic unit is operating on the second set. To do this properly, we need to preload one set of vectors.

In our simulated results, we concluded that the best vector­ ization strategy was to double buffer whenever possible, but this is not so clear on real hardware. Consider the times for computing 200-term dot products in table 2 (We use the longer dot products to show asymptotic behavior). The

columns are arranged by alignment:

For 8-byte aligned data the double buffered result is clearly superior, but if some or all of the reads begin on four-byte boundaries, the longer vectors work better.

To compute absolute performance numbers, we have to look at speech data. For telephone-grade speech, speech data consists of 16-bit samples taken 8000 times per sec­ ond. The speech is processed in 20 millisecond frames. We count how many cycles it takes to process each frame, and 'firSn^iultiply the highest number (the worst frame) by 50 'tcTget the number of processor cycles per second required to encode continuous speech.

Table 2: Dot Product 200 Results Subroutine align 8,8 align 8,4 align 4,4 Initial ,c 1806 1806 1806 Unrolled .c 1059 1056 1059 Best Non-Vector .s 405 480 429 Vector Length 8 .s 300 321 330 Double Buffered .s 234 426 327

For the ARM1022E encoding at the 12.2 kbps rate, the compiler required about 1,052,100 cycles to process the worst frame. This means that the processor would have to run at a speed of at least 52.6 MHz to encode continuous speech. After optimizing, we reduced the required speed to 34.3 Mhz. Since the processor itself can mn at over 400 MHz, the encoding can easily be shared with other opera­ tions.

align8,8 both inputs are 8-byte aligned align8,4 one input is 8-byte aligned align4,4 both inputs are 4-byte aligned

The reason for improvement on longer vectors is that for unaligned data (four-byte boundaries), there is a delay of one extra cycle before the data appears. For an unaligned load of four single-precision operands, the data is returned as zero operands, two operands, two operands. For an un­ aligned load of eight single-precision operands, the data is returned as zero, two, two, two, and two operands, which is closer to the performance of an eight-byte aligned load. The data in the AMR encoder cannot be aligned, because multiple passes are made through the data at various offsets. In practice, we found that using vectors of length 8 was faster than using vectors of length 4. The overall and perfunction speedups are shown in Table 3. Table 3: Speedups from VFP Optimizations Subroutine Speedup Autocorr 1.39 1.51 cLltp comp.corr 2.41 cor_h 1.61 corJi_x 1.63 dblloop_10i40 1.34 dotproduct40 5.70 Norm-Corr 1.58 Residu 3.29 Syn_filt 1.82 Vq.subvec 1.41 Vq_subvec_s 1.19 1.54 total The speedups are taken from actual AMR runs, with the ex­ ception of dotproduct40, which was difficult to isolate be­ cause it is called from so many places. The overall speedup of 1.54 is consistent with our simulated results [9],

Our optimizations were targeted at the 12.2 kbps rate, but they also improved the other rates. Most of the optimized routines are shared among the rates. The biggest difference is the first routine in the call graph profile, which is specific to each rate. We believe that optimizing this routine for the remaining rates would allow us to make all of the rates at least as good as the 12.2 kbps rate. Our current performance numbers for all eight rates on the ARM1022E are given in Table 4. Table 4: Required MHz for Encoding at Various Rates Rate(kbps) compiled(MHz) opt(MHz) speedup 34.3 12.2 52.6 1.54 10.2 51.8 35.7 1.45 36.6 7.95 54.0 1.48 7.40 35.7 1.45 51.7 34.8 6.70 51.1 1.47 45.2 29.1 5.90 1.55 23.4 5.15 39.0 1.67 48.4 31.3 4.75 1.55

6. CONCLUSION Vector arithmetic instructions together with vector loads and stores give the VFP unusually good performance. The ARM1022E is a single-issue processor, but the vector in­ structions give superscalar performance, and sometimes better than superscalar performance. Theoretically, one could get the same performance out of a superscalar ma­ chine that can issue a load/store, a floating-point, and an integer instruction simultaneously. The problem is that the window for picking instructions to issue is small, and in practice programmers are not going to be able to arrange the instructions correctly to get peak performance. Look­

ing at compiler output for several different superscalar ma­ chines leads us to believe that compilers are also a long way from achieving peak performance. In contrast, the vector paradigm encouraged by the VFP is relatively easy to mas­ ter, and we think programmers can quickly learn how to get good performance out of the machine. In a cell phone implementation, we expect the encoder and data to be placed in a tightly coupled memory, and the speed to be much faster. Our memory accesses cost 15 cycles, and our relatively small 16 KB caches made for a lot of expensive memory traffic. In future work, we would like to see how well the opti­ mizations perform on other ARM processors with different pipelines. We would also like to compare vector solutions with superscalar solutions. We are particularly interested in seeing whether the smaller code size and easy programma­ bility offered by vectors would be useful in a superscalar setting.

References [1] 3GPR ANSI-C code for the floating-point AMR speech codec (Release 5). 3GPP TS 26.104 V5.0.0, ftp://ftp.3gpp.org/specs/latest, 2002. [2] ANSI/IEEE Standard No. 754-1985. IEEE Standard fo r Binary Floating-Point Arithmetic. The Institute of Electrical and Electronics Engineers, New York, New York, 10017, 1985. [3] ARM. VFP10 Vector Floating-Point Coprocessor Technical Reference Manual, 2001. [4] K. Asanovic. Vector Microprocessors. PhD thesis, University of California at Berkeley, 1998. [5] Chris N. Hinds. An enhanced floating point coproces­ sor for embedded signal processing and graphics ap­ plications. In 1999 Asilomar Conference on Signals, Systems, and Computers, October 1999. [6] C. Kozyrakis and D. Patterson. Vector vs. super­ scalar and vliw architectures for embedded multime­ dia benchmarks. In 35th International Symposium on Microarchitecture, November 2002. [7] C. G. Lee and D. J. DeVries. Initial results on the performance and cost of vector microprocessors. In 30th International Symposium on Microarchitecture, pages 171-182, December 1997. [8] C. G. Lee and M. G. Stoodley. Simple vector micro­ processors for multimedia applications. In 31st Inter­ national Symposium on Microarchitecture, December 1998.

[9] David R. Lutz and Chris N. Hinds. Optimizing the AMR encoder for the ARM 10 vector floating-point coprocessor. In 2003 International Signal Processing Conference, April 2003. [10] M. G. Stoodley and C. G. Lee. Vector microprocessors for desktop computing, unpublished monograph, June 1999. A RM is a registered tradem ark of A RM Limited. ARM10, ARM1022E, ARMulator, VFP10, and V FPv2 are trademarks of ARM Limited.

Accelerating Floating-Point 3D Graphics for Vector Microprocessors David R. Lutz ARM, Inc. 1250 S. Capital of Texas Highway Austin, TX 78746 [email protected] Abstract— We present performance results and discuss op­ timization techniques for floating-point 3D graphics primitives on the ARM vector floating-point (VFP) coprocessor. Much of the work in 3D graphics is performed by a surprisingly small number of primitives. Handheld devices cannot always afford the power and cost of dedicated graphics processors, so there is considerable interest in performing these primitives on the handheld’s own processor. In this paper, we examine the 3D graphics pipeline, break out the important primitives, and then show how to vectorize and optimize these primitives on the VFP.

I. I n t r o d u c t io n

There is a large and growing interest in using 3D graphics on small portable devices such as cellphones and PDAs. These devices are too small to include the specialized graphics cards used in PCs and game machines, but the requirements are modest enough that much of the computation can be performed on a suitably powerful floating-point unit. In this paper, we examine the major primitives used for 3D floating­ point graphics, and show optimized solutions for a small vector coprocessor, the ARM VFP. While SIMD extensions appear to have won the day in multimedia support, a good case can be made that vector architectures are even better suited for these applications [1], [4], [6], [7], [10]. We consider the interest in vector architectures for multimedia and demonstrate that 3D graphics problems are efficiently computed on the VFP. The graphics pipeline is usually divided into two parts: (1) the geometry pipeline, which produces triangles, maps them to the appropriate locations in the 3D world, determines which triangles are in the viewable portion of the world, and computes the lighting at the vertices, and (2) the rendering pipeline, which consumes triangles and produces output on a 2D display. The trend in graphics is toward more and more single-precision floating-point, but not IEEE-compliant floating-point. The most advanced graphics coprocessors are now doing some of the rendering with floating-point, but for the next few generations of handheld devices, floating-point will probably be confined to the geometry pipeline. The basic geometry pipeline [2], [3], [5] consists of 1) modeling transformation 2) trivial accept/reject classification (culling) 3) lighting

Chris N. Hinds ARM, Inc. 1250 S. Capital of Texas Highway Austin, TX 78746 chris.hinds @arm.com 4) view transformation 5) clipping 6) division by w In section IV of the paper, we consider the equations used by each stage of the pipeline, and then compare timing results under various optimizations. The low-level functions most used in the pipeline are three- and four-term dot prod­ ucts, three-term cross products, comparisons, and inverse and inverse square root computations. II. V e c t o r A r c h it e c t u r e The VFP has three very different implementations, but these implementations share a common architecture, which we abstract in the following reference architecture. The processor is single-issue, which is appropriate for small, low-power devices. On any cycle, an integer instruction, a load/store instruction, or a floating-point instruction can issue. The floating-point instructions use a flat register file with 32 single-precision registers. Load and store multiple instructions can load or store any number of consecutive registers. All required IEEE 754 instructions are supported, and the processor can run in either IEEE compliant mode or a faster, nearly compliant mode called RunFast (RunFast mode flushes subnormal inputs and outputs to zero, and disables user-level traps). Besides the basic floating-point operations, the instruction set contains a chained multiply-accumulate. If we stopped here, we would have a good solid floating­ point unit, but not a very busy unit because of the single issue restriction. We get around this restriction with a surprising solution: vectors. The registers can be addressed individually in scalar mode, or in groups of up to 8 registers in vector or vector-scalar mode. Vector operations perform the same operation on corresponding elements of two vectors, while vector-scalar operations perform a scalar operation on each element of a vector. For example, a vector-scalar multiply would multiply each element of a vector by a scalar. The register file is split into four 8-register banks. The following diagram shows the disposition of the registers within the banks, and will be useful as a tool for mapping graphics functions onto the reference architecture.

bo ]). So Si S2 S3

62

bi ]). S4 S5 s6 S7

S8 Sg S10 S11

&3

U. S12 S13 Sl4 Sl5

S16 Sl7 Sl8 Sl9

S20

S24

S28

S21 S22 S23

S25 S26 S27

S29 S30 S31

W hen the VFP is in vector mode, instruction execution is governed by a 3-bit length register. This register specifies the number of operations to be executed, with the registers in the instruction incremented on each iteration. The register file banks function as hardware circular queues, with a vector operation access to a register after the last register in the bank using the first register in the bank. The following example illustrates a vector addition. Assuming a vector length of four, the floating-point add instruction

registers can be used as soon as possible. There are three sorts of delay that affect register reuse: 1 ) use-use: how soon after a vector has been used as the destination for a computation can we use that value in further computations? 2) load-use: how soon after we initiate a load can we use the loaded value? 3) use-load: how soon after we initiate a vector operation can we load a new value into that register? Table II gives these values for VFP9, VFP10, and VFP11 for each of the possible vector lengths. VFP9

V F P 10

fadds slO, s22, s26

results in the following scalar operations being issued sequen­ tially in 4 cycles: fadds fadds fadds fadds

slO, sll, sl2, sl3,

s22, s23, sl6, sl7,

s26 s27 s28 s29

On a single-issue processor, vectors give us 3 advantages: (1) performance, (2) code size, and (3) easy programmability. Performance is greatly improved because a single issue slot can generate an entire vector of operations, which generally execute one per cycle. Loads and stores, pointer updates, control overhead, and other integer arithmetic can all be performed without slowing down the vector machine. Code size is greatly improved when a single vector instruction of length n replaces n scalar instructions. Finally, many graphics problems are naturally short vector problems, and vectors of length 3 and 4 are actually easier to use than scalars for these problems. The reference architecture has been implemented on three different microarchitectures. These microarchitectures are de­ signed as coprocessors and interface to their respective ARM core on a common coprocessor bus (VFP9 and VFP10) or a dedicated bus (VFP11). The ARM core handles all in­ struction and data memory accesses, and processes arithmetic exceptions generated by the VFP. The characteristics of the microarchitectures are summarized in table I. We define non­ arithmetic instructions (non-arith) as those instructions which only manipulate the sign bit, but do not modify the fraction. The VFP-family treats absolute value and negate as non­ arithmetic operations in accordance with the IEEE 754 specifi­ cation. Compare operations are also considered non-arithmetic and are included in the non-arith column below. Throughput for all single-precision operations, and all double-precision operations except those involving a multiply or a multiplyaccumulate, is 1 operation per cycle. For double-precision multiply or multiply-accumulate operations the throughput is 1 operation per 2 cycles. When designing vector code, performance is often limited by register availability. We need to design our code so that

V FP11

u se-u se load -u se use-lo ad use-use load -u se use-lo ad use-u se lo a d -u se use-lo ad

1 4 3 2 4 3 1 8 4 1

2 5 4 2 5 3 1 9 4 1

3 6 5 2 6 4 1 10 5 1

4 7 6 2 7 4 1 11 5 1

5 8 7 2 8 5 2 12 6 3

6 9 8 3 9 5 3 13 6 4

7 10 9 4 10 6 4 14 7 5

8 11 10 5 11 6 5 15 7 6

TABLE II R u n Fa s t

delays by microarchitecture a n d vector length

It’s important to note that for vectors of length 4, we need at least 7 cycles between dependent operations (11 cycles for V FP11) to keep the pipeline fully utilized. For example, suppose that we want to compute a x + by. We first compute a x (one multiplication), and then add by (a multiply-add). For vectors of length 4, the multiply-add must wait 7 cycles before the value a x can be updated because the use-use penalty is 7. If we want to keep the pipeline full, we need to have two independent vector operations of length 4. III. G e o m e t r y P ip e l in e A. Modeling Transformation The modeling transformation is to convert each object from local coordinates to world coordinates. An object is something that you can see on the screen, say a chair or a door or a tree. There might be 5-10 objects on the screen of a small handheld device like a cell phone. It’s usually easiest to create an object in its own coordinate system, but then it has to be moved, scaled, and rotated into its proper position in the world. The object is usually constructed of small triangles, each of which may contain between 10 and 30 pixels (screen dots). An object is commonly represented by hundreds or thousands of triangles. The triangles are represented by their vertices, and each vertex consists of 4 floating-point values, (x,y,z,w). The x, y, and z values are the location in 3D-space, while w makes the vertex a homogeneous point that simplifies perspective transformations. All of the modeling transformations can be reduced to one 4x4 matrix per object. Applying the transformation to a set of vertices means multiplying each vertex by the matrix to get a new vertex. Constructing the matrix is straightforward, so most of the computation is multiplying hundreds or thousands of vertices by each matrix.

TABLE I V F P Microarchitecture Differences

VFP9 V FP10 V FP11

in stru ctio n issu e rate 1 p e r 2 cycles 1 p e r cycle 1 p e r cycle

lo ad /sto re w idth

b lo cking loads?

3 2 bits 6 4 bits 64 bits

yes no no

pipe depth (stages) 4 4 8

B. Trivial Accept/Reject Classification, or Culling In an interactive 3D game, the camera or eye moves around the 3D world and only sees some subset of the world. Culling removes those objects and triangles that are not in our view. Specifically, the camera defines a view frustum , a sort of lopsided box defined by six planes. Objects outside of these planes can be ignored in the current frame. Each plane is represented by 4 floating-point numbers, and the vertices are tested against the plane by performing a 4-term dot product. The sign of the dot product determines whether the vertex is inside or outside the plane. Our third design target is to be able to test thousands of vertices against six planes by performing thousands of 4-term dot products. Ideally, we would like to have enough registers to hold the six planes and enough vertices to keep the floating-point pipeline full. A second type of culling, back face culling, typically removes about half of the triangles that are in the view frustum. This type of culling involves computing the dot product of the surface normal of each triangle and the camera vector, with the sign of the result indicating whether the triangle is facing toward the camera or away from the camera. Triangles that are facing away from the cam era are going to be covered with something else, so there is no reason to compute them. The computational requirements for back face culling are to be able to compute three-term cross products and dot products for all of the triangles in the view frustum. C. Lighting Lighting refers to computing colors based on light sources (directional, point, or spot) and materials (surface characteris­ tics of an object, e.g., metallic objects might be shiny), and is normally done for each triangle or each vertex. With multiple light sources, the lighting equation can be quite involved, but all variants seem to include componentwise addition and/or multiplication, dot products, and vector normalization. To normalize a vertex (x , y , z ) we divide each of x , y , z by ^ x 2 + y 2 + z 2. D. View Transformation The view transformation converts from world coordinates to camera coordinates (normalized projection coordinates). It has the same complexity as the world transformation (4x4 matrix times a 4x1 vector), and is sometimes done at the same time. E. Clipping Culling got rid of objects entirely outside of the view frustum. For objects that are partially inside the view volume,

SP div/sqrt (cycles) 17 17 19

SP arith / n on-arith (cycles) 4/4 4 /4 8/4

DP div/sqrt (cycles) 28/31 28/31 29/33

D P m ul or m ac (cycles) 5 5 9

D P arith/ n on-arith (cycles) 4 /4 4 /4 8/4

clipping is used to remove the unviewed portions. For a triangle, we know we need to clip if two of the vertices are on different sides of the view frustum plane. The actual process is to form a new polygon based on the intersection points with the appropriate plane. The computation involves dot products, subtraction, multiplication, division, and comparison. We usu­ ally assume that the number of triangles to be clipped is small (less than 10% of the total), but this still implies a substantial number of divisions. Clipping is not easy to vectorize, but it is common to defer clipping to the rendering pipeline, i.e., to process the triangles and then have the Tenderer discard the unneeded pixels [5], F. Division by w If we have applied a perspective transformation to a vertex (;x , y , z , w ), then we must divide x, y, and z by w to get the correct x , y , z coordinate. This initially seems like a lot of divisions, but multiplicative inverses work just as well, and we only need one of those per vertex. The total work for this step involves one multiplicative inverse computation and three multiplications per vertex. IV.

V e c t o r iz a t io n a n d O p t im iz a t io n

Our candidate operations for optimizations are as follows: • • • • •

4-term dot product (transformation matrices and culling) 3-term cross product (back face culling) 3-term dot product (back face culling, lighting) inverse square root (3 term vector normalization, lighting) inverse (division by w)

Space does not permit detailed examination of all of these, so we will look at the first two before presenting our results. Transformation matrices are illustrative because they are easy to vectorize, and cross products are conversely interesting because they are not easy to vectorize. A. Example: M atrix Times Vector This operation is used for the modeling and view transfor­ mations. Since the matrix is fixed for each object, the optimum routine will load the matrix into the floating-point registers, and then apply it to some large number of vertices. For a matrix A and a vertex p — ( x , y , z , w ) , the product we want to compute is a00x + a0iy + a02z + a03w a 10x + a n y + a 12z + ai3w a2ox + a2iy + a 22z + a23w [ a30x + a31y + a32z + a33w J . which of course is just a series of 4-term dot products.

If we code the algorithm directly into C, for each vertex we would have to do something like the following: new_x new_y new_z new_w

= = = =

a00*x alO*x a20*x a30*x

+ + + +

a01*y all*y a21*y a31*y

+ + + +

a02*z al2*z a22*z a32*z

+ + + +

a03*w; al3*w; a23*w; a33*w;

This is easy to read and write, but the performance is mediocre. Each step is serialized, so for example we wait the full pipeline length for a00*x to complete before we can begin the multiply-add of a01*y. Computing the multiplies separately is possible, but does not help because the basic pipeline on the VFP is a multiply-add, so an add takes the same amount of time as a multiply-add. The performance of this code segment if executed as shown above is 55 cycles on VFP10, 58 cycles on VFP9, and 107 cycles on VFP11. A much better approach is to perform the operations in an independent manner, avoiding the implied serialization above. The following code illustrates this: new_x new_y new_z new_w

= = = =

a00*x; al0*x; a20*x; a30*x;

new_x new_y new_z new_w

+= += += +=

a01*y; all*y; a21*y; a31*y;

new_x new_y new_z new_w

+= += += +=

a02*z; al2*z; a2 2 * z ; a32*z;

new_x new_y new_z new_w

+= += += +=

a03*w; al3*w; a23*w; a33*w;

Each of the operations in a block of four operations is independent of the other operations, and can be issued as soon as the result of the the corresponding operation in the prior block was completed. Performance is greatly improved: 19 cycles on VFP10, 35 cycles on VFP9, and 35 cycles on VFP11. Looking at the code from a vector point of view, we see that each block of four in the code is a column of the matrix times a scalar. With vectors of length 4, the code could be easily expressed with vector-scalar operations. Our experience with writing V FP code is that the most critical step is figuring the optimal placement of the data in registers [8], [9]. Once this is done, the assembler code is usually obvious. For our reference architecture, we load the matrix by columns into banks 1 and 2. The vertices to be transformed are loaded into bank 0, and the transformed vertices are computed in bank 4. bi b2 b3 bo 4

4

x0 yo Zo Wo

XI

flo o

Uoi

ao2

ao3

Vi Z\ W\

am

a n

a i2

a i3

a2o

a

0-22

023

030

21

«3 1

a32

0-33

/ xQ

Vo z'0 w'0

,

xx

v'l 4 w[

Sixteen operations are required for each vertex (4 multiply and 12 multiply-accumulate operations). We transform two vertices at a time so that there are 8 cycles between dependent operations, which minimizes any bubbles from use-use delays. In a real program, we might apply this transform to hundreds of vertices, so the load/store behavior becomes interesting. The following is the inner loop of a subroutine that transforms vertices two at a time: loop

fldmias fmacs fldmias fstmias fmuls fstmias fmuls sub fmacs cmp fmacs fmacs fmacs fmacs bgt

r O !,{s0-s3} s 2 8 ,s20 ,s7 r O !,{S4-S7} rl!,{s24-s27} S24,s8,SO rl!,{s28-s31} s 2 8 ,s 8 ,s4 r2,r2,#2 s 2 4 ,s l 2 ,si r2,#0 s28,sl2,s5 s 2 4 ,Sl6,s2 s 2 8 ,sl6,s6 s 2 4 ,s20 ,s3 loop

load vertex pO + col3 * pl.w load vertex pi store transformed pO colO * pO.x store transformed pi colO * pl.x + coll * pO.y + + + +

coll col2 col2 col3

* * * *

pl.y pO.z pl.z pO.w

The missing parts of the code are the bits to handle the first two vertices and the last one or two vertices. Each multiply (fmuls) or multiply-add (fmacs) actually specifies four operations, giving the inner loop the satisfying property of being shorter than the corresponding C code. The loads and stores are infrequent enough that they can be completely hidden on the VFP10. Consider the pipeline diagram shown in table III, which shows the floating-point and load/store pipelines for the VFP10 during the execution of this loop. The four-stage floating-point pipeline is completely filled, and could be filled indefinitely as long as the data is in cache. Our optimized performance results for VFP10 are 16 cycles per vertex, which is the best that can be done with a single execution unit. The blocking loads and stores on VFP9 cost us 5 additional cycles, for a total of 21 cycles per vertex. The longer pipeline on VFP11 causes four pipeline bubbles before we can begin the next operation on a vector (a vector of length 4 requires 9 cycles to produce the first result, but 12 cycles to produce all 4 results), so the total time increases to 49 cycles for two vertices, or 24.5 cycles per vertex. B. Example: Cross Product Our results are not nearly as good on cross products. A cross product of two vectors generates a vector that is perpendicular to the plane containing the original two vectors. Given two vectors x and y, the cross product is XlV2 - x 2y\ VO ’ = X2V0 - X0V2 yi _ Xoyi -zi2 / o . . y z . . X2 . The cross product is used to compute the normal to a triangle, which is used for back face culling. The normal will often be stored with the triangle, but it needs to be computed on a frame by frame basis for moving objects. 20

c= x x y =

Xi

"

X

TABLE IV

TABLE III V F P 10 M t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

in stru ctio n s fld m ia s (ll) fm a c s ( c l) fld m ias( 12) fs tm ia s (s l) fm u ls(c 2 ) fstm ia s(s2 ) fm u ls(c 3 ) sub

fm acs(c4 ) cm p

fm a c s(c5 )

fm acs(c6 )

fm a c s(c7 )

fm a c s(c8 ) bgt

atrix

el cl cl cl cl c2 c2 c2 c2 c3 c3 c3 c3 c4. c4 c4 c4 c5 c5 c5 c5 c6 c6 c6 c6 c7 c7 c7 c7 c8 c8 c8 c8

T imes V e2

ector

e3

w

cl cl cl cl c2 c2 c2 c2 c3 c3 c3 c3 c4 c4 c4 c4 c5 c5 c5 c5 c6 c6 c6 c6 c7 c7

cl cl cl cl c2 c2 c2 c2 c3 c3 c3 c3 c4 c4 c4 c4 c5 c5 c5 c5 c6 c6 c6 c6 cl

cl cl cl cl c2 c2 c2 c2 c3 c3 c3 c3 c4 c4 c4 c4 c5 c5 c5 c5 c6 c6 c6 c6

cl cl

cl cl

cl cl

cl

cl cl

c8 c8 c8 c8

c8 c8 c8 c8

V

P ipeline e 11 11 12 12 si si s2 s2

m 11 11 12 12 si si s2 s2

w

11 11 12 12 si si s2 s2

c8 c8 c8 c8

We can see that the cross product is not easy to vector­ ize, even though there are many to calculate. We tried two strategies: (1) vector loads with scalar arithmetic (load several vertices at a time, but the placement of the data does not allow for vector arithmetic), and (2) scalar loads with vector arithmetic. The second strategy is to put corresponding terms from several problems in a vector (e.g., all the x 0 terms in one short vector), and then quickly perform the arithmetic. Neither solution was completely satisfying, and they gave almost identical performance: 2 cross products in 23 cycles for VFP10, and 3 cross products in 35 cycles for VFP11. The theoretical minimum for a single-issue machine would be a cross product every 6 cycles. The average computation time of 11.5 cycles is nearly twice as long, but still a significant improvement over the best scalar approach, which requires 25 cycles per cross product. C. Results Our results for scalar VFP11, vector VFP11, and vector VFP10 are given in table IV. The table entries are in cycles, and the vector results are for inner loops. The scalar numbers do not include I/O costs, but the vector results do, making the speedups somewhat greater than indicated.

ectorization

R es u l t s

problem ______________________sca la r 4 X 4 m a trix tim es vecto r 42 16 culling 4 -te rm do t pro d u ct 25 3 -te rm cross p ro d u c t x/w ,y/w ,z/w 29 v ector norm a liz a tio n 72

V FP11 22 4.25 11.67 17 30

V FP10 16 4 11.5 14 28

V. C o n c l u s io n 3D graphics primitives have a large amount of parallelism, and most of the primitives are easily vectorized. Compilers require some help to utilize this parallelism (C is notoriously unfriendly toward vectorization), but this can be largely over­ come by writing a few low-level library routines that handle the basic primitives. Even with a single issue slot, vectors can keep a floating­ point unit fully utilized, which allows us to get by with much less hardware than we might have thought. This is an important point to consider before we start adding execution units (i.e., additional floating-point pipelines). For VFP9, we were sometimes constrained by bandwidth, and for VFP11 with its longer pipeline we often wanted more registers, but we did not ever see the need for more execution units. Keeping one pipeline fully utilized is a difficult enough problem, and unless we are willing to go for more bandwidth and more registers, SIMD approaches are not very useful for the graphics problems presented here. The VFP is well-suited to performing most basic 3D floating-point graphics operations. The major problem is I/O, and in future work, we need to find some better way to deal with problems like cross products that may not be stored in any vectorizable order. R eferences [1] K . A sanovic. Vector M icroprocessors. P h D thesis, U niversity of C alifornia a t B erkeley, 1998. [2] D avid H. E berly. 3 D G am e E n g in e D esign, a P ra ctica l A p proach to R eal-T im e C om p u ter G raphics. M organ K aufm ann, San F rancisco, C A ,

2001 . [3] Jam es D . Foley, A n d ries van D am , Steven K. Feiner, and Jo h n F. H ughes. C om p u ter G raphics, P rin cip les a n d P ractice, S e co n d E d itio n in C. A ddison-W esley, R eading, M A , 1997. [4] C. K ozyrakis and D . Patterson. V ector vs. superscalar and vliw a rc h i­ tectures fo r em b ed d ed m u ltim ed ia benchm arks. In 35th In te rn a tio n a l Sym p o siu m on M icroarchitecture, N ov em b er 2002. [5] A n d re L aM othe. Tricks o f the 3D G am e P rogram m ing G urus. Sam s P ublishing, In d ianapolis, IN , 2003. [6] C. G . L ee and D . J. D eV ries. Initial results on th e perform ance and co st o f vecto r m icroprocessors. In 30th In te rn a tio n a l Sym p o siu m on M icroarchitecture, pages 1 7 1 -1 8 2 , D ecem b er 1997. [7] C. G . L ee and M . G . Stoodley. Sim ple v ecto r m icroprocessors fo r m u ltim ed ia applications. In 3 1 s t In te rn a tio n a l Sym p o siu m o n M icroar­ chitecture, D e ce m b e r 1998. [8] D avid R. L utz and C hris N . H inds. O ptim izing the A M R encoder fo r the A R M 10 v ecto r floating-point coprocessor. In 2 0 0 3 In ternational Signal P rocessing C onference, A p ril 2003. [9] D avid R. L u tz and C h ris N . H inds. P e rfo rm an ce of the flo atin g ­ p o in t A M R e n co d er o n a co m m ercial sin g le-issu e v ector processor. In F irst W orkshop o n E m b e d d e d System s f o r R eal-T im e M ultim ed ia (E ST IM edia), O c to b er 2003. [10] M . G. Stoodley and C. G . Lee. V ector m icroprocessors fo r desk to p com puting, u n p u b lish ed m o n o g rap h , June 1999.

Suggest Documents