Real Time Implementation of MPEG-4 Video Decoder on ARM7TDMI K. Ramkishor Digital Multimedia Group eMuzed Bangalore India
[email protected]
Abstract MPEG-4 simple profile is being used as the video compression standard in mobile video communications. Video compression requires significant amount of processing power. Currently, ARM cores are widely used in mobile applications because of their low power consumption. As the processing power available is limited, optimization of the applications is inevitable. This paper describes in detail about the implementation of MPEG-4 simple profile video decoder on ARM7TDMI.
1 Introduction Multimedia communication typically involves the transfer of large amounts of data. Therefore, compression of video, audio, and image data is essential for a cost-efficient use of existing communication channels and storage media. Various coding standards have been developed for different types of applications. The latest MPEG-4 standard [1] defines a standardized framework for different types of multimedia applications. Examples include mobile video communications, tele-shopping, interactive TV etc. MPEG-4 simple profile is being used mainly for mobile video communications. MPEG-4 video encoding and decoding requires significant amount of computational power. Embedded system applications have become very complex in the recent times due to which RISC cores are increasingly being used. The popular RISC processor in embedded applications is ARM core because of its low power consumption and small in silicon, which are crucial for mobile applications. Hence, ARM cores are going to be used to implement video encoding and decoding algorithms in mobile video communications. In a mobile application, the amount of available processing power is limited and having a direct impact on the battery life. As a result optimization of applications to the maximum possible extent is necessary. Optimization can be done at algorithmic and specified architecture levels. This paper describes the optimization techniques for MPEG-4 simple profile video decoder on ARM7TDMI core.
V. Gunashree Mobile Multimedia Solutions Sasken Communication Technologies Ltd Bangalore India
[email protected] Section 2 gives an overview of ARM7TDMI architecture. Section 3 describes briefly about the MPEG-4 simple profile video decoding process. Section 4 describes the optimization techniques in detail. Results are presented in Section 5. Conclusions are given in Section 6. References are listed in Section 7.
2 ARM7TDMI Architecture ARM7TDMI is a 32-bit microprocessor that gives high performance for low power consumption, which makes it attractive for embedded applications [2]. The ARM architecture is based on RISC principles, and the instruction set and related decode mechanism are much simpler than those of CISC. This simplicity results in a high instruction throughput and impressive real-time interrupt response from a small and cost-effective chip. Pipelining is employed so that all parts of the processing and memory systems can operate continuously. Typically, while one instruction is being executed, its successor is being decoded, and a third instruction is being fetched from memory. The ARM memory interface has been designed to allow the performance potential to be realized without incurring high costs in the memory system. ARM7TDMI does not have an instruction or data cache. ARM7TDMI is mostly used as a controller core rather than for data processing. But by intelligent usage of the features provided, it can be easily adapted for the video decoding process. The features of ARM7TDMI relevant to the application are • Fifteen general-purpose registers out of which the last one register, is the link register used while branching. • Conditional execution of instructions. • Block data transfer instructions. • Instructions MUL (multiply) and MLA (multiply and accumulate) with source operand dependant execution cycle times. • Configurable Little Endian or Big Endian mode of operation The above mentioned features of ARM7TDMI architecture are exploited and general RISC optimization techniques are used.
Figure 1 :MPEG-4 Video Decoding Process Previous Reconstructed VOP
Video Bit Stream Parsing
Encoded Bit Stream
Coded Bit-Stream Motion Vector Decoding (Motion)
Motion Compensation
YUV Data
VOP Reconstruction
Coded Bit-Stream (Texture)
Variable Length Decoding
Inverse Scan
Inverse DC & AC Prediction
Inverse Quantization
Texture Decoding
3 MPEG-4 Decoding Process MPEG-4 decoding procedure is depicted in Figure 1. MPEG-4 decoding procedure consists of bit stream parsing, Variable Length Decoding (VLD), inverse DC and AC prediction, inverse scanning, Inverse quantization (IQ), Inverse Discrete Cosine Transform (IDCT), Motion Compensation (MC) and Video Object Plane (VOP) reconstruction.
functions in the parsing and VLD can be subject to coarse optimization. At first, different modules are optimized at algorithmic level [4]. The assembly code generated by the compiler for these functions is studied and then functions that need to be hand coded in assembly are decided upon. The different approaches used in optimization both at algorithmic level and assembly code level are explained in detail in the following sections.
4.1 Algorithmic Optimization
4 Optimization Techniques The compute intensive processes of the decoder are IDCT, MC and VOP reconstruction. Table 1 gives the profile information of the decoding process. Table 1 :Profile Information of the Decoding Processes before Optimization Module Name Time Spent in Module Inverse Discrete Cosine Transform Reconstruction of Macroblock/VOP Inverse Quantization, Inverse Scanning and Inverse Prediction Motion Compensation Parsing & Variable Length Decoding
IDCT
59% 14% 12% 10% 5%
It is evident from Table 1 that there will be considerable improvement in performance if IDCT, MC, IQ and VOP reconstruction can be optimized to the maximum possible extent. Bit parsing and VLD functions are called most number of times but the gain after optimization will be very less compared to the gains in other modules. Hence,
This involves changing the algorithms in high language ‘C’ to reduce the computations. Apart from implementing optimal algorithms, optimization techniques like loop unrolling, loop distribution and loop interchange are used. Characteristic specifics to MPEG-4 video decoding and low bit-rate video are utilized wherever possible to reduce computations.
IDCT Chen’s algorithm for IDCT is generally applied to a block of 8x8 pixels. In sequences for videophone applications, there is high spatial and temporal correlation. This will result in very few significant coefficients (non-zero coefficients) after DCT while encoding. The number of non-zero coefficients reduces further after quantization. Instead of generalized 8x8 IDCT, depending on the number of non-zero coefficients, IDCT can be applied to only the significant valued coefficients. This optimizes IDCT to a great extent since default 8x8 IDCT will involve unnecessary multiplications and additions for the zero valued coefficients.
Coding Redundancy Reduction The coded block pattern (CBP) is a parameter, which gives information about blocks in macroblock being coded, or not. It is a set of six bits, each of which represents one block in a macroblock. The status of the bits shows whether the block is coded or not. If all coefficients of the block are zero, that block is not coded. This parameter is checked before doing IQ, IDCT, inverse scanning and reconstruction for every block. These processes are not done for blocks whose CBP value is zero. By using the CBP for decoding process, all redundant computation at a block basis are reduced, which translates to better performance of the decoder in terms of MIPS requirement. To further reduce the redundancy within a block, the position of last non-zero coefficients in a block is found. This is used in reducing the computation in IQ, inverse scanning and IDCT.
Memory Access Reduction It is always desirable to reduce memory accesses as far as possible in any implementation. Since there is no cache in ARM7TDMI, this is of importance because every access is from external memory. By combining decoding processes wherever possible, the number of accesses can be reduced. Inverse quantization and inverse scanning are combined into a single process, which requires just one load and one store for every non-zero coefficient. Similarly saturation and reconstruction of pixels are done immediately after IDCT. This again reduces one load and store per coefficient.
are evaluated for efficiency. IDCT, reconstruction of VOP and MC are identified as the modules that could be hand coded in assembly to make better use of the specific features of ARM7TDMI architecture [3]. The following paragraphs detail the various features and the modules for which they are used.
Optimal Number of Function Parameters ARM compiler uses four registers, namely R0-R3 to pass parameters to a function. If more than four parameters are passed to a function, the excess parameters are pushed on stack and they have to be loaded from stack whenever they are accessed for the first time in the function. By minimizing the number of parameters to four or less than four, they can be directly used without any loads since the values are available in registers.
Reduction of Memory Accesses using LDM and STM LDM instruction is used to load multiple words from increasing or decreasing addresses into different registers. Similarly, the STM instruction will store multiple data words in increasing or decreasing memory addresses. This feature is very useful, as it is less expensive in terms of execution cycles when compared to single word load store. It is effectively used in IDCT for fetching all the coefficients of one row at a time. In MC and VOP reconstruction also, a set of data words is fetched at one execution of the instruction and is stored temporarily in multiple registers for future use.
Avoiding Half Word and Byte Load/Store Multiplication instead of Division using Table-lookup One division operation takes 60 to 120 cycles for execution while a multiplication operation takes a maximum of four cycles. In computations where division can be replaced by multiplication without loss of precision, it will be advantageous to do so. However, this requires additional memory for storing lookup data that is required for multiplication. The procedure followed is as follows. If x/y is to be converted to a multiply operation, x*z is computed, where z is ‘1/y’ represented in fixed-point with the required precision. The values of z for all possible values of y are stored in an array in memory. z is directly accessed by the value of y. This feature is used in the inverse AC and DC prediction functions to replace all divisions by multiplication.
4.2Assembly/Architecture
Level
Optimization The code generated by the compiler for IQ, inverse scanning, IDCT, MC and reconstruction of VOP
The dynamic range of the data to be processed in IDCT and IQ is 16 bit signed integers. As a result, typically for loading and storing the coefficients, the compiler uses LDRSH and STRSH, which are half word load store instructions. Packing two 16 bit half words into a word and using LDM and STM will be the best way to access 16 bit signed integers. The additional overhead is packing and unpacking the data appropriately. This overhead is not much when compared to using half word load store instructions.
Customized Functions
Memset
and
Memcpy
The compiler generated memset and memcpy functions use LDM and STM instructions but since these functions are generalized for transfer for all data types and different block sizes, it includes a number of checks for different cases and sizes. These unnecessary checks can be avoided by writing customized functions to exactly suit specific requirements.
Conditional Execution of Instructions
Efficient Method for Motion Compensation
The conditional execution feature is supported for all arithmetic and logic instructions and the data move instructions in the ARM7TDMI. This is an optional feature, which sets the flags when the instruction is executed. In general CMP or TST instruction is used to set flags required for branching. But by using conditional execution, the CMP or TST instructions can be avoided. Conditional execution is generally used for loop exit conditions and saturation. It results in saving of one instruction CMP in exiting loops. For loops where the loop count is large, reduction of even one instruction in the loop is a big advantage. Saturation of pixel values needs to be done frequently in video processing. Nested ‘if’ statements are generally used in the saturation. Conditional execution is used in nested ‘if’ statements to reduce branches. This type of ‘if’ statements have fixed latency of execution. For example, assume that the value after addition is to be saturated between 0 and 255 (this happens at the end of reconstruction for every pixel). It can be seen that saturation using conditional execution would have a fixed latency of 4 cycles without any branches.
Motion compensation is the process of using motion vectors and predicting macroblocks in the current frame. This involves fetching every pixel of the macroblock from a region in the previous frame decided by motion vectors and storing it in the current frame memory. Each pixel is of one byte size. As a result of this, if motion compensation is done on byte or pixel basis, byte load and byte store will have to be used which are the costliest operations for memory access. To reduce this, an efficient method of doing motion compensation is devised which operates on words of data. ARM7TDMI architecture does not support SIMD architecture features. In this efficient method, the SIMD feature is simulated while doing motion compensation. Using this method, motion compensation is carried out for four pixels simultaneously in an efficient manner. Memory read in ARM7TDMI is always word aligned. In cases where the horizontal motion vector is not word aligned, words from the nearest word aligned boundary are fetched using LDM instruction. The required bytes from the words are appropriately rearranged into one word so that, operations can be performed on words of data rather than bytes or half words. The logic in doing motion compensation for four pixels simultaneously is explained now. Motion compensation involves implementation of the following equations, (a+b+1-rounding_control)/2 or (a+b+c+d+2-rounding_control)/4. The parameter, rounding_control can take the value 0 or 1. The first equation is rewritten as a2+b/2+z, where z is ((a AND b) AND 1) if rounding _control is 1 and ((a OR b) AND 1) if rounding _control is 0. Similar logic is used to implement the second equation also. By implementing the modified equations it is possible to do the motion compensation operation on four bytes at a time.
Efficient Ordering of the Coefficients for Multiplication Instruction in IDCT The number of execution cycles for multiplication depends on number of 8 bit multiplier array cycles required to complete the multiply, which is dependant on the value of one of the multiplier operand. The number of cycles taken for multiplication is defined below. 1 if bits [32:8] of the multiplier operand are all zero or all one. 2 if bits [32:16] of the multiplier operand are all zero or all one. 3 if bits [32:24] of the multiplier operand are all zero or all one. 4 in all other cases. One operand for multiplication in IDCT will be the inverse quantized coefficient, which is having 12bit dynamic range. The other operand is the IDCT constant and represented using 16 bits. Most of the DCT coefficients have a dynamic range of less than 8-bits. This is due to quantization and the small presence of high frequency details in the signal. By appropriately arranging the operands in the MUL instruction, the cycle count can be reduced. This principle can be applied every time MUL instruction is used, with prior knowledge of the operands that are multiplied.
5 Results Table 2 gives the MHz counts for the processes subjected to optimization in the video decoder, before and after optimization. These results give the number of processor cycles taken by each of the processes for decoding one second of video comprising of 15 frames of QCIF (176x144) size. The results are collated by decoding the hallmonitor sequence, which is a typical sequence for videophone kind of application. Table 3 summarizes results for 1 second of video or 15 frames of QCIF size images for various sequences, which have different degrees of complexity. All sequences are characteristically for videophone applications. The MHz counts are taken for the MPEG-4 decoding process only, which means that generating YUV data from the encoded stream.
Table 2 : Decoder Process Execution Cycles
Process
MHz before MHz after Optimization Optimization
IQ IDCT MC Reconstruction of VOP
3.7 14.0 3.7 2.3
1.25 2.46 1.40 1.10
Table 3 : Optimization Results for Different Sequences
Sequence
MHz for 15 MHz for 15 Frames before Frames after Optimization Optimization
Hall-monitor Suzie Foreman Car-phone Akiyo News
71 84 100 87 56 54
12.7 15.0 16.0 15.0 12.5 12.0
6 Conclusions This paper summarizes the different methods used to optimize the MPEG-4 Simple Profile Video Decoder for real-time performance on the ARM7TDMI processor. Significant reduction in MHz is obtained after implementing all the above mentioned optimization techniques. It can be seen that by exploiting the features specific to MPEG-4 video decoder and by using RISC architecture specific optimization techniques, it is possible to reduce the time taken for decoding considerably. Similar techniques can be used to optimize MPEG4 video encoder also.
7 References [1]. “Information Technology – Generic Coding of Audio-Visual Objects – Part 2:Visual,” MPEG-4 standard, ISO/IEC/ JTC 1/SC29/WG 11 N 2688, Seoul, March 1999. [2]. “ARM7TDMI Data Sheet,” ARM DDI 0029E, August 1995. [3]. “Writing Efficient C for ARM,” Application Note 34, ARM DAI 0034A, January 1998. [4]. Micheal E Lee, “Optimization of Computer Programs in C”, Senior Programmer, Ontek Corporation, CA USA. [5]. John Hennesey and David Patterson, “Computer Architecture A Quantitative Approach”.