1244
IEEE Transactions on Consumer Electronics, Vol. 50, No. 4, NOVEMBER 2004
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng LIU, Zhibin Xiao Zhejiang University, Information Science & Electronic Engineering Department, P.R. China
Abstract — This paper proposes general software optimization techniques for embedded systems based on processors, which mainly include general optimization methods in high language and software & hardware cooptimization in assembly language. Then these techniques are applied to optimize our MP3 decoder, which is based on RISC32, a RISC core compatible with MIPSⅠ instruction set. The last optimization decoder requires 48 MIPS and 49Kbytes memory space to decode 128Kbps, 44.1KHz joint stereo MP3 in real time with CPI 1.15, and we have achieved performance increase of 46.7% and memory space decrease of 38.8% over the original decoding software1. Index Terms — Software Optimization, MP3 Decoder, RISC core, Embedded Systems. I.
INTRODUCTION
Low cost with fast time to market is the top requirement in development of embedded portable devices. As a result, typical embedded portable devices are developed based on microprocessor architecture by software and hardware codesign. Because the development of embedded systems adopts the design mode of the processor plus application programs, how to make software run efficiently on selected processor or IP core had become a main problem [1]. During the development of our MP3 [2] decoder, the same problem is met with. Generally, Many portable MP3 players [3][4] adopt DSP core or dual core that are comprised by DSP and RISC because of the complexity of application algorithms. However, with the development of processor design and semiconductor technology, the capability of digital signal processing of RISC approaches DSP level in the past. Therefore, it becomes reality to implement MP3 decoding in real time on a single RISC core. This is especially useful for resource constrained SoC (System on Chip) design [5] because RISC core can be used as both audio decoding unit and control
unit. Based on consideration above, RISC32 is chosen as target processor in this paper, which was developed in the department of information science & electronic engineering of Zhejiang University. RISC32 is a 32-bit RISC-based generalpurpose processor that is targeted at embedded application. It has adopted 6-stage pipeline consisting of instruction fetch (IF), instruction decoding (ID), execute (EX), data memory access (DM), data tag comparing (TC) and write back (WB) in turn. The main features are as follows: ◆ Compatible with MIPS-I instruction set architecture at binary codes with media processing enhancements. ◆ Separate 16Kbytes instruction cache and 16Kbytes data cache by Harvard architecture ◆ Direct mapping, write-through cache strategy ◆ 16Kbytes on-chip RAM ◆ 200 MHz clock frequency ◆ 1mW power dissipation per MHz Since RISC32 employs limited hardware and simple load/store architecture to reach lower Cycles Per Instruction (CPI), and usually the compiler cannot generate the optimal code for the RISC32, the implementation efficiency of the raw MP3 decoding software by conventional C compiler is very low. In fact, according to implementation results, the MP3 decoding software without optimization required about 90 MIPS and total 80Kbytes memory to decode a 128 kbps 44.1 kHz stereo bit stream on RISC32, which motivated our present work. The paper is organized as follows. The general embedded software optimization techniques are presented in section Ⅱ . According these techniques, algorithm optimization of decoding software in C language is described in section Ⅲ and hardware and software co-optimization methods are given in section Ⅳ. Finally, implementation results are given in section Ⅴ. II. EMBEDDED SYSTEMS SOFTWARE OPTIMIZATION
1 This work has been supported by the National Natural Science Foundation of China(NSFC)(69972043), and by the National High Technology Research & Development Program of China (863 Program) (2002 AA1Z1140), and by the Fok Ying Tong Education Foundation(94031). Yingbiao YAO is Ph.D candidate in department of information science and electronic engineering, Zhejiang University, Hangzhou, P.R.China. (email:
[email protected]). Qingdong YAO is a professor in the department of information science and electronic engineering, Zhejiang University, Hangzhou, P.R.China. (email:
[email protected],). Peng LIU is an associate professor in the department of information science and electronic engineering, Zhejiang University, Hangzhou, P.R.China. (e-mail:
[email protected]). Contributed Paper Manuscript received July 24, 2004
TECHNIQUES
One of the difficult problems of embedded system design is how to make software efficiently run on the used processor. Generally, there must be to do software optimization [6][7] for optimal system performance by matching the program code to the processor. The embedded software optimization work can be divided into two parts: algorithm optimization in high level language and code optimization in assembly level. Due to the improvement of computation ability of present embedded processors, the computation load (mainly multiplication and
0098 3063/04/$20.00 © 2004 IEEE
Y. Yao et al.: Embedded Software Optimization for MP3 Decoder Implemented on RISC Core
addition) becomes not important factor, but the data transportation becomes important. Therefore, during algorithm optimization, we must be tradeoff among computation load, the complexity of data conveying and the size of coefficients of it. This work can be accomplished in high level, for example, C or Java language. During the process of code optimization, we must take the features of instruction set, micro-architecture and pipeline of target processor into account for effective execution and lower CPI figure on target processor. This work must be accomplished in the optimization of assembly code. According to consideration above, we propose general embedded software optimization process as fig.1. Here the high level language chosen in this paper is C language. First phase is the preparation stage in which the main task is to analyze all kinds of constraint in the embedded systems, for example, memory sizes, power dissipation and the ability of digital signal processing of the specific processor. The optimization goals and the confined condition shall be cleared in this phase. The next phase is to analyze the features of specific processor, especially on the micro-architecture and the pipeline architecture of the used processor that shall guide the embedded software optimization work. We call this cooptimization of hardware and software. The third phase, software optimization, is discussed in this paper, which is accomplished first in C language and then in assembly language. The need of optimization of assembly code is due to there have not ever existed a good compiler for the embedded processor to meet the requirement suggested from first phase. The key idea of software optimization is that embedded software must be close link to target processor. Then we must implement software and hardware co-validation to make certain to reach optimization goals. Finally we can get optimal object software.
1245
The methods of embedded software optimization in the third phase are summarized into six steps as follows. Software module partition and performance estimation in C. The application software is divided into small module. Then run it and get the profile information of different instructions such as the statistics of multiplication and addition per module. This process can help designer to understand the program structure and to find optimization goals in C. ⒉ Module algorithm optimization in C. This process is accomplished by modifying algorithm to match with the used processor because even the best fast algorithms may not fit into it well. There must be trade-off between computation workload and memory sizes in optimizing the module algorithm. So algorithm validation based on processor is indispensable. ⒊ General optimization in C. The task in this step is to implement general optimization as usually to be done with the C code, for example, loop unrolling and program structure modification, to help compiler generating better code to reduce optimization workload in assembler. ⒋ Resource statistics of assembly code. It mainly includes memory sizes of objective code and execution MIPS statistics during the program running on the specific processor. ⒌ Memory size optimization. Generally, the assembly program converted from C program by conventional compiler is optimized for personal computer or workstation, which needs more memory because the size of memory does not cared in PC or workstation. Therefore, much attention must be given to memory optimization. The work is two facets. One is to reduce memory sizes used by program, and the other is to optimize the utilization of high-speed on-chip memory.
System functions and constraint conditions analysis
⒍ Assembly program structure optimization. Because Software module partition performance estimation
In C language Module algorithm optimization
The hardware features analysis
General optimization Software optimization Resource statistics
Software and hardware co-validation In assembler
N
Success
Memory optimization Assembly program structure optimization
Y Object software output
Fig.1. The software optimization flow chart for embedded systems
1246
IEEE Transactions on Consumer Electronics, Vol. 50, No. 4, NOVEMBER 2004
conventional C compiler cannot generate the best codes for specific embedded processor, further assembly code structure Huffman Code Bits
Huffman Decoding
Sync seek Head info decoding
Bitstream
Error checking
Joint stereo Decoding
requantization in our integer-based decoder is less consuming CPU resource than ISO floating point-based decoder. This is
Huffman Information Scalefactor Information
Requantization
Reordering
Huffman Information Decoding Scalefactor Decoding
Alias Reduction
Frequency Inversion
IMDCT
Subband Synthesis
PCM
Fig. 2. Decoding Process of MP3
optimization is indispensable. During the optimization process of this step, we must take the features of instruction set, microarchitecture and pipeline of target processor into account for effective running and lower CPI value on target processor. III. MP3 DECODING SOFTWARE OPTIMIZATION IN C LANGUAGE Before the co-optimization, we performed C-code level optimization to improve the decoding software performance. At the one hand, we choose the best fast algorithms with less computation load and coefficient sizes; at the other hand, we modify the code to be more suitable for the RISC32 and to help C compiler generate better assembly code. A. Selection Algorithm of Optimization Object According to ISO 11172-3 standard [2], MP3 decoding program can be divided into nine modules, which are system information decoding (including header and side information), Huffman decoding, requantization, reordering, joint stereo decoding, inverse modified discrete cosine transform (IMDCT), frequency inversion, alias reduction and synthesis poly-phase filter-bank. The decoding process is shown in fig.2. Table gives the result of ISO reference decoder employing floating-point computation. Table is the result of our integer computation-based decoder that is statistical result
because much consuming computation resource operation of requantization in floating point-based decoder is reduced by lookup table in integer-based decoder. Therefore, the algorithm optimization emphasis is laid on the modules of subband synthesis and IMDCT. B. Algorithm Optimization in C-code level Although there are many efficient fast IMDCT algorithms, TABLEⅡ Ⅱ PERFORMANCE OF OUR INTEGER DECODER WITHOUT OPTIMIZATION MIPS ON ISA Module CPU Time (%) simulator Huffman decoding 4.1 6.1 Requantization 3.2 4.8 Stereo processing 1.5 2.2 IMDCT 19.2 28.7 Subband synthesis 33.4 50 other 5.4 8.1 total 66.8 100
most of them have been derived for data sequences with lengths of N=2n. Since IMDCT computation in MP3 audio adopts the lengths of 36 points for long block and 12 points for short blocks, the Britanak & Rao’s algorithm [8] is adopted by this paper. Subband synthesis is the most computationally intensive operation in the MP3 decoder. Its complexity is twofold: one
TABLE Ⅰ PROFILING OF ISO REFERENCE DECODER Module
CPU Time(%)
Huffman decoding Requantization Stereo processing IMDCT Subband synthesis Other
4.1 16.9 1.2 17.9 58.2 1.7
on our ISA (instruction set architecture) simulator before optimization. It needs 66.8 MIPS to implement real time decoding on the ideal CPI. From table and table , it can be safely concluded that subband synthesis is the most consuming CPU resource and IMDCT is in the next place. However, the
TABLE Ⅲ MULTIPLICATION & ADDITION IN IMDCT 36 points long block ISO reference Britank Rao’s
Mul 648 47
Add 630 165
12 points short block Mul 72 13
Add 66 39
is matrixing operation, and the other is PCM output with window filter. The matrixing operation in the subband synthesis filter is defined in (1): V (i) =
31 ∑ cos[ π ( 2 k + 1 )( i + 16 )] S ( k ), 0 ≤ i < 64 , 0 ≤ k < 32 64 k =0
(1)
Konstantinides [9] has proved that equation (1) can be computed by a 32-point DCT and some data copy operations.
Y. Yao et al.: Embedded Software Optimization for MP3 Decoder Implemented on RISC Core
According to Konstantinides’s result, Lee suggested a fast DCT algorithm [10] which is employed by this paper. The final result is shown in table . C. Zero Value Optimization
ISO reference Lee’s
Mul
Add
Coefficient Sizes
2048 80
1984 209
2048 30
After Huffman decoding, the audio channel data is comprised of three parts: big values region, count1 region and rzero region, as in fig.3. Generally, all of 576 sample points are processed in the following modules. However, this is not necessary because the result is always zero during zero value of input for such modules as: requantization, reordering, joint stereo decoding, IMDCT, frequency inversion and alias reduction. Therefore, we can set tags to indicate zero position at the 576 frequency samples and do not dispose with these zero value points until the computation of equation (1). We gave a specific example about Requantization to save computation resource by zero value optimization as follows. Without zero value optimization: For (i=0; i