mp3 optimization exploiting processor architecture and using better ...

1 downloads 0 Views 370KB Size Report
Figure 1 shows an overview of the decod- ing structure. The decoding process produces output pulse code modulation (PCM) sam- ples from an MP3 bitstream.
MP3 OPTIMIZATION EXPLOITING PROCESSOR ARCHITECTURE AND USING BETTER ALGORITHMS AN APPLICATION’S EXECUTION TIME DEPENDS ON THE PROCESSOR ARCHITECTURE AND CLOCK FREQUENCY, THE COMPUTATIONAL COMPLEXITY OF THE ALGORITHMS, THE CHOICE OF COMPILER AND OPTIMIZATION OPTIONS, AND IT ALSO DEPENDS ON HOW WELL THE PROGRAMMER EXPLICITLY AND IMPLICITLY EXPLOITS PROCESSOR ARCHITECTURE. THIS ARTICLE QUANTIFIES THE INFLUENCE OF THESE FACTORS FOR AN MP3 DECODER THROUGH EXPERIMENTAL RESULTS.

Mancia Anguita Universidad de Granada J. Manuel MartínezLechado Vitelcom Mobile Technology

0272-1732/05/$20.00 © 2005 IEEE

For all applications, execution time depends on the executing processor’s architecture and clock frequency, the computational complexity (number of operations) of the algorithms used in the program, the compiler, and the programmer’s skill. But how much influence do these factors exert on overall performance? In this article, we try to quantify these effects for a sample application. We think it is possible for some applications to reach their necessary levels of performance with quicker time to market and/or a lower development budget if programmers exploit processor architecture. This could be preferable to programming for or procuring specialized hardware, or waiting for better compiler implementations or function libraries. In particular, programmers can exploit architectural features that compilers do not use1-4 or do not use at all the suitable points in a program. By exploiting the processor architecture,

Published by the IEEE computer Society

programmers can benefit from a processor’s new architectural feature well before a compiler or library incorporating that feature appears on the market (possibly at a significant price). For example, the option arch:SSE added to the Microsoft Visual C++ .NET 2003 (not found in Visual C++ .NET 2002) took advantage of Intel’s x86 cmov instructions and the single-instruction, multiple-data (SIMD) enhancements in the Streaming SIMD Extension (SSE). But cmov instructions first appeared in the Pentium Pro (1995), and SSE instructions appeared with the Pentium III (1999).5 Programmers can also exploit processor architecture to avoid the added cost of using dedicated, special-purpose hardware to meet high computational requirements.6,7 As an added benefit, general-purpose processors are also easier to program than dedicated hardware, and can flexibly incorporate new algorithms, such as those for communication

81

Huffman decoding

Requantization

Huffman tables

Scale factors

Reordering

Stereo decoding

MP3 bitstream

Preprocessing

EXPLOITING PROCESSOR ARCHITECTURE IN SOFTWARE

Alias reduction

IMDCT

Frequency inversion

Synthesis filter bank

Right PCM channel

Alias reduction

IMDCT

Frequency inversion

Synthesis filter bank

Left PCM channel

Figure 1. MP3 decoder structure.

Frame Granule 0 Subband blocks 18 frequency lines 31 18 frequency lines 30

18 frequency lines 0 Granule 1 Subband blocks 18 frequency lines 31 18 frequency lines 30

protocols, which can change quickly and make dedicated hardware obsolete. To investigate our ideas, we compare the performance of diverse MP3 decoder implementations that cover various levels and types of optimizations. Our goal is to illustrate the influence of these factors on performance.

MP3 decoder overview

Reviewing the basic processing stages of a MP3 decoder helps provide a better understanding of the Figure 2. Frame structure for a single chanrequirements for the decoder implementation. The nel. Each granule has 18 × 32 = 576 freISO/IEC 11172-3 Standard8 quency lines. and other works9 offer more information. Figure 1 shows an overview of the decoding structure. The decoding process produces output pulse code modulation (PCM) samples from an MP3 bitstream. The MP3 bitstream consists of frames, each producing a few milliseconds of sound output. A frame contains compressed audio data and information necessary to properly decode it. Compressed audio data contains one (mono) or two (stereo) channels, divided into two granules comprising 576 samples each, as Figure 2 shows. First, the preprocessing stage finds 18 frequency lines 0

82

IEEE MICRO

frames in the bitstream and extracts encoded audio data and other information needed during the decoding process. Then, the Huffman decoder, from the encoded audio data, produces a set of symbols for each granule that represent 576 scaled frequency lines. The next step descales the symbols to normal frequency lines, using the scale factors extracted from the frame. If the frame contains more than one channel of audio information (without dual-channel mode), the stereo decoder produces the frequencies for the left and right channels. In the last decoder steps—inverse modified discrete cosine transform (IMDCT) and synthesis polyphase filter bank—complex calculations transform frequency lines into time samples.

Decoder stages Data must pass through various decoder stages, including those for preprocessing, Huffman decoding, requantization, reordering, stereo decoding, alias reduction, IMDCT, frequency inversion, and synthesis polyphase filter bank. Preprocessing. This stage finds frames in the bitstream and extracts their compressed audio data and some information needed during the decoding process, such as Huffman tables and scale factors. This stage might also perform error detection. Huffman decoding. Huffman encoding is a

Requantization. From symbols isi generated in the Huffman decoder, this decoder stage reconstructs the original frequency line samples xri by using scale factors extracted from the preprocessing stage: xri = sign(isi) |isi|4/3 × 2Cj/4 (the scale factors are enclosed within Cj in this equation). The encoder had divided the frequency lines into a scale factor and a small integer number, saving the scale factors separately and encoding the integer numbers using Huffman encoding. Scale factors are different for short and long blocks. The encoder normally performs a modified discrete cosine transform (MDCT) for 18 samples at a time

Windowing

IMDCT transform

648 multiplications and 612 additions

Overlapping

lossless coding scheme that produces Huffman codes from input symbols. The encoder process bases the mapping of symbols to Huffman codes on the statistics contents of the input sequence. It codes symbols that occur more frequently with a short code; it also codes symbols that occur less frequently with longer codes, compressing the data. The output from the Huffman decoder is, for each granule, 32 subband blocks with 18 frequency lines each (32 × 18 = 576 frequency lines). The decoding process is based on several Huffman tables for mapping Huffman codes to symbols. The information extracted in the preprocessing stage specifies which table to use for decoding the current frame. The standard actually defines 17 different tables. The longest variable-length code word for a table is 19 bits. A direct table lookup method for the decoder would use large tables (some of them having 219 inputs). A more compact representation translates each Huffman table into a tree-like lookup table. The decoder processes the tree from the root and traverses it according to the bits in the compressed audio data bitstream, where a 0 extracted from the bitstream might mean go left; a 1, go right. An entire code word is fully decoded when a leaf is found. The leaves contain the values of the scaled frequency lines (Huffman decoder output symbols). In the Huffman decoder stage, the most significant part of the processing lies in handling the compressed audio data bitstream and in searching Huffman tables. Each search iteration extracts and processes a bit from the bitstream. The cost of the Huffman decoder stage increases with increasing bit rate.

18 input values

17

xi = ΣXk cos ⎧ π (2i + 19)(2k + 1)⎧ ⎩ ⎩ 72 k=0 i = 0, …, 35

IMDCT 36 values

xi = xi ai i = 0, …, 35 36 multiplications

a

Windowing

X 18 lower values

18 higher values

18 additions

+

Buffer

18 final output values

Figure 3. IMDCT, windowing and overlapping for long blocks.

(long blocks) to achieve good frequency resolution. It can also perform MDCT for six samples at a time (short blocks) to achieve better time resolution. Reordering. The encoder reorders short blocks to make the Huffman coding more efficient. The decoder then reverses this reordering. So this stage operates only on short blocks. Stereo decoding. To exploit redundancies between different stereo channels, an encoder can codify samples in MS-Stereo or Intensity Stereo, so it is necessary, in the decoder, to extract the independent channels. When using single or dual channels, no stereo processing is necessary. (With dual channel, the two channels are coded independently.) Alias reduction. It is necessary to negate the aliasing effects of the polyphase filter bank in the encoder, but this does not apply to granules that use short blocks. The alias reduction consists of eight butterfly calculations for each pair of adjacent subbands.9 IMDCT. This stage applies in sequence, to the 18 frequency lines in a subband block, the operations shown in Figure 3. The decoder obtains polyphase filter subband samples (xi) from the input frequency lines (Xk) by applying the following equation:

MAY–JUNE 2005

83

EXPLOITING PROCESSOR ARCHITECTURE IN SOFTWARE

Frequency inversion. To compensate for frequency inversions in the synthesis polyphase filter bank, this stage negates every odd sample in all odd subbands.

Matrixing

2,048 multiplications and 1,984 additions 32 input values

31

Vi = ΣSk cos⎧ π (i + 16)(2k + 1)⎧ ⎩ ⎩ 64 k=0 i = 0, …, 63

FIFO Extraction

V

for i = 1,023 down to 64 do F (i ) = F (i − 64) for i = 0 down to 63 do F (i ) = V (i ) for i = 0 to 7 do for j = 0 to 31 do U (i × 64 + j ) = F(i × 128 + j ) U (i × 64 + 32 + j ) = F(i × 128 + 96 + j )

Windowing

Matrixing

1,024 values

FIFO

Wi = Ui bi i = 0, …, 511 512 multiplications

64 values

512 values

U

Windowing

b

Vector addition

0 + 512 additions

+

31

+

0 = 31 32 PCM samples

Figure 4. Synthesis polyphase filter bank for 32 PCM samples.

xi =

n −1 2

∑ k =0

MP3 decoder implementations ⎤ ⎡π ⎛ n⎞ X k cos ⎢ ⎜ 2i + 1 + ⎟ ( 2k + 1)⎥ 2⎠ ⎦ ⎣ 2n ⎝

where i = 0, ..., n − 1. For long blocks, this expression generates 36 values from the 18 frequency lines of a subband block (n = 36). Then, this decoder stage multiplies the outputs with a 36-coefficient window. Finally, it adds the lower 18 values with the higher 18 values (overlapping) of the previously processed block, generating the 18 output samples of this stage for a long block. For short blocks, it performs three IMDCT transform, which produce 12 outputs each (n = 12) from six frequency lines (processing all 18 frequency lines in a subband block). The decoder stage then windows and overlaps the three vectors with each other. Concatenating six zeros on both ends of the resulting vector yields a vector of length 36, which the stage processes like the output of a long transform for long blocks. For a single granule, the outputs of this stage are 18 samples for each subband block (576 samples). This stage applies this process 32 times for each granule.

84

IEEE MICRO

Synthesis polyphase filter bank. This stage transforms the 32 subband blocks of 18 samples in each granule to 18 blocks of 32 PCM samples. The filter band operates on 32 samples at a time, one from each subband block. First, a variant of IMDCT is applied to the 32 input samples (matrixing operation), producing a 64-value V vector, as Figure 4 shows. It then pushes V into a first-in first-out (FIFO) buffer, which stores the last 16 V vectors (1,024 samples). Then, the filter band forms a 512-sample U vector from the alternate 32 component blocks in the FIFO. It multiplies the 512 samples in U with a 512-coefficient window, producing the W vector. To obtain the reconstructed samples, this block decomposes W into 16 vectors, each 32 values in size. Summing these vectors obtains the 32 PCM samples for the 32 subband samples at input. This process occurs 18 times for each granule. It is the most computationally intensive stage.

We have implemented diverse MP3 code versions that cover different levels and types of optimizations. Figure 5 summarizes how we derived each version.

Standard The first version, standard, implements MP3 following the standard documentation, using only the tables specified in the standard.8 Successive versions improve this MP3 version.

Basic The basic version improves on the standard version by accounting for the actual processor architecture features both explicitly and implicitly through the standard library functions of the compiler that take advantage of the architecture. For example, it does the following: • This version replaces some instructions by others with fewer clock cycles; for example, it replaces floating-point divisions by multiplications and some integer multiply instructions by shift instructions. • We also replace computationally inten-

sive library functions, such as cosines or powers, with tables. • Library functions, using special processor instructions, replace slower high-level programmer code; for example, we use the function memcpy for memory copy. • Also, we use loop unrolling to improve some loops.

Standard

First implementation

Basic

Accounts for usual architecture features

SIMD

Uses explicitly SIMD instructions Algorithmic

Loop unrolling is a transformation that increases the loop body size and decreases the number of iterations.1 It reduces the number of instructions to execute (such as instructions related to the loop index and the conditional branch). This process allows for more-efficient memory access and register use. In addition, it reduces data dependencies (hazards) arising from the inefficient scheduling of instructions by the compiler optimizer.

SIMD The SIMD version improves on the basic version using typical instructions found in SIMD extensions. SIMD instructions perform the same operation over various data in parallel, thus increasing the performance of matrix and vector operations. In general, the programmer can exploit more architecture features or can exploit a feature to a higher degree than a compiler. For example, it is difficult for a compiler to find places to use some instructions, such as SIMD instructions. Some processor instructions that a compiling process does not generate might be available through library functions. However, generic library functions can be too general, and buying specialized libraries increases the software’s final price. MP3 is based on vector operations, so it can achieve benefit from SIMD instructions. We have developed SIMD implementations for different MP3 stages: requantization, stereo processing, IMDCT processing, and synthesis filter bank. Globally, this version uses SIMD for improving memory initializations and block transfers. The SIMD instructions used execute up to a maximum of four single-precision floating-point operations in parallel. This decoder version also aligns data in memory for improved access time, and avoids operations with denormalized floating-point values.

Uses explicitly better algorithms

Algorithmic-SIMD Algorithmic version using SIMD

Figure 5. Source code versions, with order of implementation and relations.

with algorithms that reduce the number of operations required by the most computationally intensive MP3 stages. Synthesis polyphase filter bank. Konstantinides’ method reduces the number of operations in this stage by transforming the matrixing operation in a 32 discrete cosine transform (DCT) and some reorder operations.10 We implement DCT using a fast DCT algorithm.11 This algorithm divides DCT recursively into two smaller DCTs. This method eliminates 96 percent of the multiplications and 90 percent of the additions. IMDCT. Marovich’s method improves IMDCT complexity.12 Marovich, following Konstantinides’ method, reduces IMDCT to a fast DCT computation and some data copying operations. This method eliminates 88 percent of the multiplications and 76 percent of the additions for long blocks, the most frequent block in practice. For short blocks, it eliminates 82 percent of the multiplications and 55 percent of the additions. Huffman decoding. A tree-clustering algorithm13 can speed up the search process for a symbol in a Huffman tree and reduce the memory size. The algorithmic version also uses this clustering Huffman decoding.

Algorithmic-SIMD Algorithmic This version improves the basic version

This version is based on the SIMD version combined with the SIMD implementations

MAY–JUNE 2005

85

EXPLOITING PROCESSOR ARCHITECTURE IN SOFTWARE

Table 1. Compilers used in the performance evaluation. Compiler Intel C++ 7.1 (Intel)4

Microsoft Visual C++ 6 (MVC6)5

Microsoft Visual C++ .NET2003 (MNET)5

Optimization options Classical optimizations and inline expansion of functions: O2 Pentium III optimization: G6 Pentium 4 optimization: G7 Vectorization (MMX, SSE), cmov instructions: QxK Classical optimizations and inline expansion of functions: O2 Pentium III optimization: G6 Pentium 4 and AMD Athlon optimization: not available Vectorization (MMX, SSE) and cmov instructions: not available Classical optimizations and inline expansion of functions: O2 Pentium III optimization: G6 Pentium 4 and AMD Athlon optimization: G7 Vectorization (MMX, SSE) and cmov instructions: arch:SSE

of the algorithms we used to improve the IMDCT and synthesis stages. It uses the same clustering Huffman-decoding implementation as the algorithmic version.

Performance comparison We have compiled the different MP3 source code versions with three compilers: Intel C++ 7.1, Microsoft Visual C++ 6, and Visual C++ .NET 2003. Table 1 summarizes the different compiler optimization options we used for tuning code performance. They include the following: • O2. This option includes classical optimizations that are processor independent,2 such as constant propagation, copy propagation, dead-code removal, strength reduction, variable renaming, common subexpression elimination, loop-invariant code motion, loop unrolling, and induction variable simplification. O2 also includes inline function expansion, which helps increase performance for programs that have many calls to small functions (such as an MP3 Huffman decoder). Inline function expansion replaces a function call with the function body. • G6. This switch optimizes code for Pentium Pro, Pentium II, and Pentium III, generating code that is compatible with earlier processors. • G7. This switch optimizes code for the Pentium 4 processor (and Athlon in Microsoft’s compilers), generating code that is compatible with earlier processors.

86

IEEE MICRO

• QxK. This compiler option from Intel is processor-specific and generates instruction and optimizations for the Pentium III. The resulting executables can be run only on Pentium III or later processors. QxK allows vectorization using the SSE and MMX instructions included in Pentium III and Pentium 4,4 and generates cmov instructions. The vectorizer—a compiler component that works with QxK—detects patterns of sequential data accesses by the same instruction and transforms the code into SIMD instruction execution. To take advantage of the vectorizer, a loop must meet the following criteria: iteration independence, memory disambiguation (the compiler determines whether two or more pointers are pointing to the same memory location), and high loop count. Instruction cmov helps avoid mispredicted branches. The misprediction branch penalty is usually equivalent to the processor pipeline depth. • arch:SSE. This option uses SSE and cmov instructions.5 Table 2 presents the processors on which we tested this code—the AMD Athlon, and Intel Pentium III and Pentium 4. We measured the processor cycles consumed for the different MP3 stages, testing MP3 bitstreams with varying bit rates. Figures 6 and 7 present the processor cycles required for decoding a frame (cycles per frame) of a typical MP3 bitstream, Tristana,

Table 2. Processors used in the evaluation. L1 data cache (Kbytes) 64 16 8

Family 6 6 15

Athlon

L1 instruction cache 64 Kbytes 16 Kbytes 12K micro-ops

Pentium III

L2 cache (Kbytes) 256 256 512

Pentium 4

OS Windows XP Windows 2000 Windows XP

Bas MVC6

Bas MNET G7

Bas Intel G7

Std MVC6

Std MNET G7

Std Intel G7

Bas MVC6

Bas MNET G7

Bas Intel G7

Std MVC6

Std MNET G7

Std Intel G7

Bas MVC6

Bas MNET G7

Characteristic Size Play time No. of frames No. of channels Frequency Bit rate Mode

Bas Intel G7

Std MNET G7

Memory size 1 Gbyte 256 Mbytes 512 Kbytes

Table 3. Tristana bitstream features.

Other SF IMDCT Huffman

Std MVC6

52 48 44 40 36 32 28 24 20 16 12 8 4 0

Std Intel G7

Performance (millions of cycles/frame)

Processor Athlon XP Pentium III Pentium 4

Measure 4,392,960 bytes 274 seconds 10,508 2 44.1 KHz 128 Kbps, constant MS-Stereo (99.42%), Stereo (0.58%) Block composition 97.29% long blocks, 2.61% short blocks Processing speed 10,508/274 = 38.35 ` frames/s

Figure 6. Cycles per frame for the standard (Std) and basic (Bas) versions running on the Athlon, Pentium III, and Pentium 4. Designations such as “MVC6” refer to the compilers and optimizations outlined in Table 1.

with the characteristics in Table 3. Because we measure processor clock cycles instead of time, the results are independent of the processor clock frequency. The figures show the average number of cycles required to decode a frame for the different MP3 versions; Figure 6 shows the standard (Std) and basic (Bas) versions, and Figure 7 shows the basic, SIMD (SIMD), algorithmic (Alg), and algorithmic-SIMD (Alg-SIMD) versions. The figures show breakdowns for the average number of cycles for the slowest stages: Huffman, IMDCT, and synthesis filter bank (labeled “SF” in the figures). “Other” represents the cycles consumed for all the other MP3 stages. The relative measure used, cycles per frame, depends less on the input MP3 bitstream than the absolute measure of the cycles consumed for the complete MP3 bitstream. We have observed the logical and noticeable differences between mono and stereo bitstreams: For a mono stream, cycles per frame are approxi-

mately divided by two. For different bit rates, we have seen variations in the Huffman stage. We also show results for different compiler optimization options summarized earlier (G6, G7, QxK, and arch:SSE). We have used the O2 option for all binary codes and also used G6 for all MVC6 binary codes (we have not seen noticeable differences between QxK with O2 or O3 options). As Figure 6 shows, the Pentium 4 (P4) needs more cycles than the Pentium III (PIII) and the Athlon for the standard version, because of the Pentium 4’s deeper pipeline. The Pentium III pipeline has 12 stages, the Pentium 4, 20 stages,14 and Athlon’s integer pipeline has 10 stages.3 The high number of stages has increased the Pentium 4’s clock frequency with respect to the Pentium III, but this also increases the number of penalty cycles for unoptimized code. The MVC6 basic version is about 20× faster than the MVC6 standard code for the Pentium 4, 13× for Pentium

MAY–JUNE 2005

87

2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0

Basic

SIMD

SIMD

Algorithmic-SIMD

MNET G7 SSE

MNET G7

MNET G6 SSE

MVC6

MNET G6

Intel G7 QxK

Intel G7

Intel G6 QxK

Intel G6

MNET G7 SSE

MNET G7

MNET G6 SSE

MVC6

MNET G6

Intel G7 QxK

Intel G7

Intel G6 QxK

Intel G6

MNET G7 SSE

MNET G7

MNET G6 SSE

MVC6

MNET G6

Intel G7 QxK

Intel G7

Intel G6 QxK

Intel G6

MNET G7 SSE

MNET G7

MNET G6 SSE

MVC6

MNET G6

Intel G7 QxK

Intel G7

Intel G6 QxK

Basic

2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0

Algorithmic-SIMD

Algorithmic

Basic

2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0

SIMD

Algorithmic

MNET G7 SSE

MNET G7

MNET G6 SSE

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Intel G6

MNET G7 SSE

MNET G7

MNET G6 SSE

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Intel G6

MNET G7 SSE

MNET G7

MNET G6 SSE

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Intel G6

MNET G7 SSE

MNET G7

(b)

MNET G6 SSE

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Other SF IMDCT Huffman

Intel G6

Algorithmic-SIMD

MNET G7 SSE

MNET G7

MNET G6 SSE

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Intel G6

MNET G7 SSE

MNET G7

MNET G6 SSE

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Intel G6

MNET G7 SSE

MNET G7

MNET G6 SSE

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Intel G6

MNET G7 SSE

(c)

MNET G7

MNET G6

MVC6

Intel G7 QxK

Intel G6 QxK

Intel G7

Intel G6

Other SF IMDCT Huffman

MNET G6 SSE

Performance (millions of cycles/frame)

(a)

Performance (millions of cycles/frame)

Algorithmic

Other SF IMDCT Huffman

Intel G6

Performance (millions of cycles/frame)

EXPLOITING PROCESSOR ARCHITECTURE IN SOFTWARE

Figure 7. Cycles per frame for the basic, SIMD, algorithmic, and algorithmic-SIMD versions in Athlon (a), Pentium III (b), and Pentium 4 (c). Designations such as “MVC6” refer to the compilers and optimizations outlined in Table 1.

III, and 12× for Athlon. The Pentium 4 has more speedup because, as we have pointed out, the standard version in Pentium 4 spends more cycles than the Pentium 3 or Athlon (near 1.86× more cycles for MVC6). Figure 8 presents the speedup values obtained with the SIMD, algorithmic, and algorithmic-SIMD versions as compared to

88

IEEE MICRO

the MVC6 basic version. The figure shows that MVC6 Alg-SIMD code is about 4× faster than MVC6 basic code for Pentium 4; 5× for Pentium III; and 4.5× for Athlon. Figure 8 also reveals the improvement using explicitly SIMD and using explicitly lowcomplexity algorithms. We observe that in Pentium 4 and Pentium III, the SIMD ver-

Basic

SIMD

Algorithmic-SIMD

Algorithmic

Intel G7 Intel G6 QxK Intel G7 QxK MVC6 MNET G6 MNET G7 MNET G6 SSE MNET G7 SSE

Intel G6 Intel G7 Intel G6 QxK Intel G7 QxK MVC6 MNET G6 MNET G7 MNET G6 SSE MNET G7 SSE Intel G6 Intel G7 Intel G6 QxK Intel G7 QxK MVC6 MNET G6 MNET G7 MNET G6 SSE MNET G7 SSE Intel G6

MVC6

MNET G6 MNET G7 MNET G6 SSE MNET G7 SSE

Intel G7 QxK

Intel G7

Intel G6 QxK

Athlon Pentium III Pentium 4

Intel G6

Normalized speedup

5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0

Figure 8. Speedup obtained with the basic, SIMD, algorithmic, and algorithmic-SIMD versions with respect to the MVC6 basic program.

4,500

4,500

4,000 Performance (frame/s)

Performance (frame/s)

4,000 3,500 3,000 2,500 2,000 1,500 1,000

3,500 3,000 2,500 2,000 1,500 1,000 500

500

0

0

100 MHz

3.94

80.37

121.71

143.96

223.8

750 MHz

29.54

602.75

912.83

1,079.72

1,678.49

1.2 MHz

47.26

964.4

1,460.53

1,727.55

2,685.58

2 GHz

78.76

1,607.33

2,434.22

2,879.26

4,475.97

Figure 9. Processing speed for the Athlon.

sion offers more speedup than the algorithmic version for all compilers and compiler optimization options. In Athlon, the results vary with the compiler and optimizations: The algorithmic version can offer more speedup than the SIMD version. (We outlined the ideal speedups for the SIMD and algorithmic versions in the previous section.) Using this data, we can also compare the programmer SIMD versions to the compiler SIMD (QxK and arch:SSE) codes. Figure 8 shows that the Intel QxK or Microsoft arch:SSE codes for the basic version obtain less speedup than all SIMD executable codes. Also,

SIMD Std MNET Bas Intel Alg Intel MNET G8 G7 SSE G7 QxK G7 QxK SSE

Alg-SIMD MVC6 Clock speed

Clock speed

Std MNET Bas Intel SIMD Alg Intel G6 G7 QxK MNET G7 G7 QxK

Alg-SIMD MVC6

100 MHz

3.83

66.86

122.2

110.37

221.53

750 MHz

28.72

501.47

916.47

827.76

1,661.5

1.2 MHz

45.98

802.35

1,466.35

1,324.41

2,658.4

2 GHz

76.63

1,337.24

2,443.92

2,207.35

4,430.67

Figure 10. Processing speed for the Pentium III.

the Intel QxK or Microsoft arch:SSE codes for the algorithmic version have lower speedups than all algorithmic-SIMD executable codes (a conclusion also supported by Figure 7). Because the SIMD and algorithmic-SIMD versions use inline assembler, there are few optimization possibilities for the compiler, so the results of these codes depend too little on the compiler optimization options and only slightly on the compiler. Nevertheless, we found the algorithmic-SIMD version with MVC6 (the oldest compiler) offers the best performance. Figure 7 shows that Pentium 4 offers the

MAY–JUNE 2005

89

EXPLOITING PROCESSOR ARCHITECTURE IN SOFTWARE

Performance (frame/s)

3,500 3,000 2,500 2,000 1,500 1,000 500 0

Clock speed

Std MNET Bas Intel G6 G7 QxK

SIMD Intel G6

Alg Intel G7 QxK

Alg-SIMD MVC6

100 MHz

2.06

67.88

133.75

90.18

170.36

750 MHz

15.45

509.13

1,003.16

676.37

1,277.73

1.2 MHz

24.72

814.62

1,605.05

1,082.19

2,044.37

2 GHz

41.2

1,357.69

2,675.08

1,803.65

3,407.29

Figure 11. Processing speed for the Pentium 4.

lowest cycles per frame for the SIMD version (based mostly on vectorial instructions) and the worst for the algorithmic version (based on scalar instructions). Athlon offers the lowest cycles per frame for the algorithmic and basic versions. For the algorithmic-SIMD version, Athlon and Pentium III offer the lowest cycles per frame. To compare performance among processors, we must also account for the processor clock frequencies. Figures 9, 10, and 11 show the number of frames per second processed for several clock frequencies in Athlon, Pentium III, and Pentium 4. These figures show frames per second for hypothetical low-power 100-MHz processors and for frequencies that are not available in all these processors. The figures give the best result for each source version, that is, the frames per second obtained for each processor’s best executable code. These figures allow us to account for the Pentium 4’s operation at higher clock frequencies than the Pentium III for the same process technology. They also account for the fact that we can actually find an Athlon XP at approximately 2 GHz. The Pentium 4 processor, as opposed to the Pentium III, was designed with a greater than 1.6× higher frequency target for its main clock rate, using the same process technology (http://www.intel.com/technology/itj/ q12001/articles/art_2a.htm). In these figures, we can see that a Pentium 4 at 1.2 GHz offers better performance than a Pentium III at 750 MHz for the basic, SIMD, algorithmic, and algorithmic-SIMD versions.

90

IEEE MICRO

On the other hand, we must also account for the fact that an MP3 decoder must process frames in real time. Real-time processing requires 38.35 frames/s (as mentioned in Table 3). Figures 9, 10, and 11 show that the standard version does not allow for real-time processing in the 750-MHz Pentium III or Athlon, or even with the 1.2-GHz Pentium 4. The basic version, on the other hand, permits real-time processing even for 100-MHz processors, a low clock frequency suitable for embedded product.

Lessons learned From this work, we see that the exploiting and understanding of the basic architecture can drastically increase software performance. For example, comparing the basic and standard versions in Figure 6, we see that for MVC6, the basic version is at least 20× faster than the standard version, running on the Pentium 4. Besides this basic observation, we learned the following lessons about optimizing code: • Exploiting architecture features can be as important as choosing the right algorithms. Compare algorithmic and SIMD versions in Figures 7 and 8, we see that both improve on the basic version, but the SIMD version offers better performance than the algorithmic version for the Pentium 4 and Pentium III. In contrast, the algorithmic version offers better performance for the Athlon in most cases. • Programmers can exploit architecture features to a higher degree than compilers. Figure 8 shows that the Intel QxK or Microsoft arch:SSE codes for the basic version offer less speedup than all SIMD executable codes. Also, the Intel QxK or Microsoft arch:SSE codes for the algorithmic version offer less speedup than all of the algorithmic-SIMD executable codes. • Optimization choice depends on the application. In our MP3 examples, there is little difference in performance between G6 and G7. When using the vectorization options (arch:SSE and QxK), we obtained a noticeable improvement in performance in some cases (see Figures 7 and 8). MP3

is an application based on vector operations, so it seems natural that it could especially benefit from vectorization options. So programmers should choose compiler options with an eye toward the application and not blindly trust in compiler optimization options. As compiler manuals suggest, programmers should test whether and where compiler options actually improve performance. Also, notice that compilers allow for selective enabling or disabling of some optimization options in code fragments.

O

ur work clearly shows that vector applications of high computational complexity, such as MP3, can significantly improve in performance if programmers know and exploit the processor’s architecture. Compilers or function libraries incorporating new architecture features usually appear well after the feature’s release and possibly come at a significant price. So the ability to exploit architecture can provide programmers—and their companies— with a significant competitive advantage. We trust that our results will encourage programmers to exploit architecture. MICRO Acknowledgment We thank our colleagues in the Dept. de Arquitectura y Tecnología de Computadores, Universidad de Granada—Julio Ortega, Javier Fernández-Baldomero, and Francisco Illeras—for letting us use their computers. Many special thanks go to Antonio MartinezCalderón and Vicenta Lechado-Comino for all their support for a long time. References 1. K. Dowd and C. Severance, High Performance Computing, O’Reilly, 1998. 2. K.R. Wadleigh and I.L. Crawford, Software Optimization for High Performance Computing, Hewlett-Packard Professional Books, 2000. 3. AMD Athlon Processor x86 Code Optimization Guide, Advanced Micro Devices; http://www.amd.com/us-en/Processors/ ProductInformation/0,,30_118_1274_3734^ 3748,00.html. 4. Intel Pentium 4 Processor Optimization Reference Manual, Intel Corp.; http://developer.intel.com/design/Pentium4/manuals/.

5. M. Lacey, “Optimizing Your Code with Visual C++,” Microsoft Corp., Apr. 2003; http://msdn.microsoft.com/library/default. asp?url=/library/en-us/dv_vstechart/ html/vctchOptimizingYourCodeWith VisualC.asp 6. S.M. Akramullah, I. Ahmad, and M.L. Liou, “Optimization of H.263 Video Encoding Using a Single Processor Computer: Performance Tradeoffs and Benchmarking,” IEEE Trans. Circuits and Systems for Video Technology, vol. 11, no. 8, Aug. 2001, pp. 901915. 7. V. Lappalainen, T.D. Hämäläinen, and P. Liuha, “Overview of Research Efforts on Media ISA Extensions and Their Usage in Video Coding,” IEEE Trans. Circuits and Systems for Video Technology, vol. 12, no. 8, Aug. 2002, pp. 660-670. 8. ISO/IEC 11172-3, Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s, Part 3 Audio, Int’l Organization for Standardization/Int’l Electrotechnical Commission, 1993. 9. S. Gadd and T. Lenart, A Hardware Accelerated MP3 Decoder with Bluetooth Streaming Capabilities, master’s thesis, Lund Univ., Nov. 2001. 10. K. Konstantinides, “Fast Subband Filtering in MPEG Audio Coding,” IEEE Signal Processing Letters, vol. 1, no. 2, Feb. 1994, pp. 26-29. 11. B. Lee, FCT—A Fast Cosine Transform, IEEE Conf. Acoustics, Speech, Signal Processing (ICASSP 84), vol. 32, no. 9, Mar. 1984, pp.1243-1245. 12. S. Marovich, Faster MPEG-1 Layer III Audio Decoding, tech. Report HPL-2000-66, HP Laboratories, 2000. 13. R. Hashemian, “Memory Efficient and HighSpeed Search Huffman Coding,” IEEE Trans. on Comm., vol. 43, no. 10, Oct. 1995, pp. 2576-2581. 14. IA-32 Intel Architecture Software Developer’s Manual Volume 1: Basic Architecture, Intel Corp.; http://developer.intel.com/ design/Pentium4/manuals/.

Mancia Anguita is a professor at the Departamento de Arquitectura y Tecnología de Computadores, Universidad de Granada. Her research interest included code optimization and parallel processing. Anguita has a PhD

MAY–JUNE 2005

91

EXPLOITING PROCESSOR ARCHITECTURE IN SOFTWARE

degree in computer science from the Universidad de Granada. J. Manuel Martinez-Lechado is an engineer at Vitelcom Mobile Technology, a Spanish handset manufacturer. His experience includes working on projects for multimedia communication over the Internet and photographic stitching. Martinez-Lechado has an MSc in computer engineering from the Universidad de Granada.

SPECIAL

OFFER

Direct questions and comments about this article to Mancia Anguita, Dept. Arquitectura y Tecnología de Computadores, ETS Ingeniería Informática, Universidad de Granada C/ Periodista Daniel Saucedo Aranda, s/n E18071 Granada (Spain); [email protected] or [email protected]. For further information on this or any other computing topic, visit our Digital Library at http://www.computer.org/publications/dlib.

FROM

IEEE PERVASIVE COMPUTING IEEE Pervasive Computing is the premier publishing forum for peer-reviewed articles, industry news, surveys and tutorials on mobile and ubiquitous computing. See for yourself with a free copy of our Successful Aging issue! Read peer-reviewed articles on

TRY A COPY FOR

FREE!

Pervasive computing computing research research on on healthy healthy aging aging • Pervasive A prototype prototype pervasive pervasive computing computing infrastructure infrastructure that that • Alets lets applications and services reason about uncertainty physically inspired inspired approach approach to to motion motion coordination coordination • AA physically You’ll also find departments on Evaluating ubicomp ubicomp applications applications • Evaluating Reducing the the power power demands demands of of mobile mobile devices devices • Reducing Exploring the the Grid’s Grid’s potential potential • Exploring

For a limited time only, download your FREE issue at

www.computer.org/pervasive/freeissue.htm

92

IEEE MICRO

Suggest Documents