MPEG-4 VIDEO ENCODER ON ADI BLACKFIN DSP for DIGITAL

MPEG-4 VIDEO ENCODER ON ADI BLACKFIN DSP for DIGITAL IMAGING APPLICATIONS PSSBK Gupta

Ramkishor Korada

Sr. Design Engineer Emuzed India Pvt Ltd Bangalore, India T: 91-80-5252224, F: 91-80-5252223

Architect - Video Emuzed India Pvt Ltd Bangalore, India T: 91-80-5252224, F: 91-80-5252223

[email protected]

[email protected]

ABSTRACT The advent of flash memory technology has enabled storing of video in multiple video coding formats for different applications of digital camcorders. MPEG-4 is one of the popular formats because of its compression efficiency, suitability for sharing over wire-line or wireless networks. Power efficient DSPs are being used to encode or decode video as DSPs offer programmability to support multiple formats. The compression tools of the MPEG-4 video encoder are complex and require more computational power. Hence, optimization to the maximum extent is required for designing the MPEG-4 video encoder with real time performance on DSPs like Blackfin for camcorder application. This paper describes in detail about the implementation of MPEG-4 simple profile video encoder on Blackfin core requiring 175 Mega Cycles to encode CIF resolution video at 30 frames per second with less memory requirements.

Keywords

MPEG, DSP, Blackfin, Optimization and Video encoding.

1. INTRODUCTION

Flash memory is being used as the storage media in lieu of conventional magnetic tapes in camcorders due to the advantages like compactness, facilitation of random access and low power consumption. Flash memory is used to store video in multiple formats such as MPEG-4 [1], MPEG-2 [2], H.263 [3], MPEG-1 [4], MJPEG [5] and DV. The capacity of currently available flash memories is less than those of magnetic tapes [10]. Hence, high compression ratio is required for storing huge amount of video contents on flash memory. The contents recorded on flash memory can be stored on hard disk for backup/archival purposes and sharing video over wired/wireless networks. MPEG-4 simple profile video standard provides better compression efficiency than the preceding standards by using tools like four motion vector, unrestricted motion vector, AC/DC prediction etc. Power consumption of different modules in the digital camcorders Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

should be as low as possible. Programmable processors are suitable to support encoding or decoding of content in multiple formats. Recent DSP cores support instructions suitable for video/image processing, power management units thus forming an attractive processing core for applications like camcorders. Blackfin DSP from ADI has combined capabilities of dual MAC DSP engine and orthogonal RISC microprocessor instruction set, and also has dynamic power management, allowing continuous adjustment of the processors voltage and frequency to optimize the power consumption and processor performance for real-time applications. The applications need to be optimized to the maximum extent possible as the processing power available on camcorder is limited and the performance of applications has a direct impact on the battery life of camcorder. This work is aimed at performing efficient encoding of CIF resolution video at 30 frames per second on Blackfin DSP core. Section 2 describes briefly about the Blackfin architecture. MPEG-4 simple profile video encoding process is explained in section 3. In section 4, optimization techniques are described in detail. Results are presented in Section 5. Conclusions are given in Section 6 and references are listed in Section 7.

2. BLACKFIN DSP ARCHITECTURE

Blackfin [6] is a single core, load & store architecture having the combined capabilities of dual MAC DSP engine and orthogonal RISC microprocessor instruction set. Blackfin core contains two multiplier/accumulators (MACs), two 40-bit Arithmetic Logic Units (ALUs), one 40-bit shifter, four video ALUs, and an 8-entry 32-bit data register file. The computational units process 8-bit, 16bit, or 32-bit data from the register file. Each MAC performs a 16bit by 16-bit multiply in every cycle, with accumulation to a 40bit result. Two ALUs are capable to operate on 16- or 32-bit data. Each of the 32-bit data registers can be regarded as two 16-bit halves, so each ALU can accomplish very flexible single 16-bit arithmetic operations. By viewing the registers as pairs of 16-bit operands, dual 16-bit or single 32-bit operations can be accomplished in a single cycle. By further taking advantage of the second ALU, quad 16-bit operations can be accomplished simply, accelerating the per cycle throughput. The 40-bit shifter can deposit data and perform shifting, rotating, normalization, and extraction operations. Two Data Address Generators (DAGs) provide addresses for simultaneous dual operand fetches from memory. The DAGs share a register file containing four sets of 32-bit Index, Modify, Length, and Base registers. Eight additional

Figure 2: Block Diagram of MPEG-4 Simple Profile Video Encoder 32-bit registers provide pointers for general indexing of variables and stack locations. Blackfin DSPs support a modified Harvard architecture in combination with a hierarchical memory structure. Blackfin permits up to three instructions to be issued in parallel with some limitations; all three instructions are executed in the same amount of time as the slowest of the three. Load, store or pointer updating can be issued in parallel with most of the other operations. This makes the Blackfin throughput nearly equal to that of a processor having operands as data memory address. Blackfin DSP core diagram is show in Figure 1.

Instruction Fetch 1, Instruction Fetch 2, Instruction decode, Address calculation, Ex1 (execution), Ex2, Ex3 and WB (Write back).

Register file reads occur in the EX1 pipeline stage (for operands) and the EX3 pipeline stage (for stores). Writes occur in the WB stage. The multipliers and the video unit are active in the EX2 stage, and the ALUs and shifter are active in the EX3 stage. The accumulators are written at the end of the EX3 stage. Blackfin also has static branch prediction for conditional branch instructions to reduce pipeline stalls.

3. MPEG-4 VIDEO ENCODER

Raw video streams occupy huge space on storage device and take more time for transmission. Several international standards have been developed for compressing the video content by exploiting the temporal and spatial correlation property of the video streams. The block diagram depicting the data flow across different modules of MPEG-4 simple profile video encoding process is show in Figure 2.

Figure 1. Blackfin DSP Architecture Core ADSP-21535 (one of the processor of Blackfin family) has an eight-stage instruction pipeline as follows:

It consists of Motion Estimation (ME), Motion Compensation (MC), Rate control, DCT, Quantization (Q), Inverse Quantization (IQ), IDCT, VOP Reconstruction (VOPR), AC/DC prediction, motion vector coding, Variable Length Encoding (VLE), bit stream formation, scene change detection etc. modules. It compresses the video frames by exploiting the temporal and spatial correlation, using motion compensation and DCT tools. Motion estimation process involves estimating the motion in efficient way so that motion compensation becomes efficient in removing the temporal redundancy. DCT is used as the transform to compress the spatial data in I-VOPs and the residual

information after motion compensation in P-VOPs. MPEG-4 has state of the art compression tools like adaptive AC/DC prediction, nonlinear quantization of DC coefficients in Intra blocks, unrestricted motion vector, four motion vector etc to improve the compression efficiency. Besides compression efficiency tools, MPEG-4 has error resilience tools like resynchronization markers, data partitioning, Reversible Variable Length Coding (RVLC) and Header Extension Code (HEC) to offer improved error resilience over wire-line and wireless channels.

for motion estimation. Hence, through this process computational complexity involved in finding the best predictor for motion estimation is reduced

4. OPTIMIZATION

Table 1. Module-wise Complexity Break-up before Optimization Module Name M cycle/sec % Of Total time ME 560.5 45.0 MC & VOPR

100.5

8.0

DCT & IDCT

408.7

32.8

Q & IQ

104.8

8.4

VLE

45.5

3.7

Miscellaneous

26.2

2.1

Module-wise computational complexity break up for typical test sequence with out any optimization is given in Table1. From the profile information, it can be observed that computationally intensive modules are ME, MC, DCT, IDCT and quantization. Optimization process used for reducing the computational complexity of the encoder consists of two levels: algorithmic level optimization and architecture level optimization. Besides reducing computational complexity of encoding process, program memory and data memory requirements are reduced considerably. Following sub-sections gives more details about algorithmic and architecture level optimizations.

4.1 Algorithmic Level Optimization

The process of algorithmic optimization involves changing the algorithms in high-level language ‘C’ to reduce the computations. Apart from implementing optimal algorithms, optimization techniques like loop unrolling, loop distribution and loop interchange [9] are used. Properties specific to MPEG-4 video encoding are utilized to the possible extent for reducing complexity.

4.1.1 Motion Estimation

From Table 1, it can be observed that motion estimation requires major portion of the processing power. A fast motion estimation algorithm based on spatio-temporal correlation [8] is used for finding integer pixel motion vector estimation. This algorithm adaptively changes the search range with motion in video sequence. The following optimizations are used for further reducing the complexity. Eliminating Redundant Predictors: Predictors from spatially and temporally surrounded blocks are checked for repetition and then the redundant predictors are eliminated from the predictors list and remaining predictors are used for estimating the best predictor

Figure 3: Half Pixel Motion Vector Estimation Half pixel Motion Vector Estimation: This involves the refining the integer pixel motion vector to half pixel accuracy. Without finding the SAD value corresponding to all possible half-pixel motion vectors (those are L, T, B, R, LT, RT, LB, RB as shown in Figure 3) around the integer pixel motion vector, this new algorithm first computes the SAD values corresponding to L, T, B, R and compared with the integer pixel SAD value. If the integer pixel SAD value is less than the SAD values of half-pixel motion vectors then the integer pixel motion vector is returned as the best half pixel motion vector. Otherwise the best of the four half pixel motion vectors is found and SAD values corresponding to it’s adjacent two half pixel motion vectors (refined in both directions) are computed. For example, if SAD value corresponding to L is the best, then SAD values corresponding to LT, LB are computed. Among those three whichever has best SAD is returned as the best half pixel motion vector. This new algorithm reduces the computational complexity by minimum 25%, and the change in PSNR is less than +/- 0.03 db. We also observed that interpolation values corresponding to L could be reused for R; hence, computation of SAD values corresponding to L and R are merged for reducing the computational complexity. The same procedure is used in all pixel interpolations.

4.1.2 DCT & Quantization

Prediction of ’not coded blocks’: Not coded block is a block having all the coefficient values as zero after quantization. Predicting a block as a not coded block after motion compensation reduces the computational complexity for performing DCT and Quantization. SAD of the block after motion compensation is used as a metric for predicting the not coded block. If the SAD to quantscale ratio is less than a threshold then those blocks are treated as not-coded blocks. If the threshold value is high, then more blocks are predicted as not-coded blocks and computational complexity reduces but mis-prediction increases leading to quality degradation. Hence tradeoff between computational complexity and quality was carried out. By experimenting with different sequences, a value of 25 was chosen as the threshold. Multiplication Instead of Division using Table lookup: One integer division operation takes about 32 cycles for execution while a multiplication operation takes one cycle. In quantization,

one division operation is required for each coefficient, this division operation could be replaced by multiplication without loss of precision using lookup table and constant data memory required is less. The procedure followed is as follows. If x/y is to be converted to a multiply operation, x*z is computed, where z is ‘1/y’ represented in fixed-point with the required precision. The values of z for all possible values of y are stored in an array in memory. The same procedure is used wherever division operation is required. Last Position: Last position of the block is the position of the non-zero coefficient, after which all the AC coefficients have zero value. Last position of the every block is computed while performing the quantization. Last position is used for reducing the processing time of the calculation of Coded Block Pattern (CBP), IQ, IDCT, and VLE.

4.1.3 IQ, IDCT and VLE

Inverse Quantization: Inverse quantization process in MPEG4 video simple profile can be expressed as follows: IQ[X(n)] = X(n) * C1 + F(X(n)) * C2 . ------- (1) Where IQ[X(n)] is inverse quantized coefficient. X(n) is quantized coefficient. F(X(n)) = -1 when QX(n) < 0 ---------------- (2) = 0 when QX(n) = 0 = +1 when QX(n) > 0 C1 and C2 are constant for a block. IDCT requires pre-scaling for preserving the precision. Prescaling is efficiently merged with the inverse quantization process by scaling the C1&C2 constants. Processing the coefficients up to last position further reduces computational complexity of the IQ. Variable Complexity IDCT using Last Position: Most of the high frequency coefficients will have zero value. With out performing the complete IDCT operation on entire 8x8 block, variable complexity IDCT is used. Last position is used for implementing variable complexity IDCT. Last position of the block is used for finding the number of rows having non-zero coefficients and row IDCT is applied only on them. This reduces the complexity of finding the row IDCT of the remaining rows. Variable Length Encoding: AC coefficients of coded blocks are variable length encoded up to the last position of the block for gaining computational advantage.

4.2 Architecture Level Optimizations

Reordering the Conditional Branches: Blackfin supports the static branch prediction, when branch prediction fails it takes more cycles than the branch prediction success, hence all conditional branches are arranged according to probability of occurrence. Converting the Array into Pointers: More frequently accessed arrays are converted into pointers because from the crosscompiled assembly code, it is observed that the cross-complier does not efficiently reuse the array address calculations. Using Zero Overhead Loops: The loops involved in motion estimation (finding the SAD value), motion compensation, DCT, Q, IQ, IDCT, and VLE are replaced with zero overhead hardware loops for reducing loop overhead with out increasing the code size.

Using Conditional Move: Blackfin supports conditional move operation, this operation modifies the destination register with source register if the condition is true, otherwise it does not modify the destination register. In either case it takes one cycle. The conditional move operation is used efficiently while calculating last position. Exploiting the Video and Vector operations: Video Pixel operations and vector operations are potent operations of the Blackfin processor [7]. Video Pixel operations are efficiently used for calculating the SAD, interpolation and add & saturation involved in ME, MC and VOP reconstruction modules. Vector operations are used in DCT and IDCT. In Q and IQ, two coefficients are processed at a time using the vector operations. The function F(X(n)) in equation (2) is implemented using shift operation without using conditional branch. Reducing the function call overhead: In Blackfin, function call overhead is 8cycles (excluding the cycles required for register push and pop). For reducing the function call overhead, intensive loops having the function are converted in to intensive loop with in a function. Beside this, the child functions invoked from only one location of the code are inlined using ‘inline’ keyword, this reduces the function call overhead with out increasing the code size and with out disturbing the code readability.

4.2.1 Reducing the Stalls (Latencies)

Instruction Alignment Unit Empty Latencies: If the Instruction Alignment Unit (IAU) is empty of the next instruction, that next instruction will incur a one-cycle stall while the IAU is being filled. For eliminating these stalls functions and hardware-loops are aligned to 64-bit boundary. Accumulator to Data Register Latencies: R2 = R3 + R1; R4.H = R2.L * R0.H; while executing second instruction one extra cycle is taken because of accumulator to data register latencies. These type of latencies are eliminated by unrolling the loop and reordering the instructions.

5. RESULTS

Representative test sequences are encoded at CIF resolution (352x288), 30f/s and bit-rate of 768 kbps and the objective quality is measured in terms of PSNR. The processing power required is measured in M cycles. The results are given in Table 2. Module-wise percentage of processing time for a typical sequence is listed in Table 3. Table 4 gives the program memory and data memory requirements. Table 2: Performance of Different Sequences after Optimization Sequence Name M cycle/s PSNR (db) Foreman

152.5

34.79

Mcdonald

158.7

33.80

Coastguard

156.2

32.19

Tempete

157.4

29.83

Motorrace

143.6

31.57

Rowing

174.2

28.99

Table3: Module-wise complexity Break-up after Optimization Module Name M cycle/sec % Of Total time ME 47.3 31 MC & VOPR

17.7

11.6

DCT & IDCT

35.7

23.4

Q & IQ

17.3

11.4

VLE

27.2

17.8

Miscellaneous

7.3

4.8

Table 4. Memory Requirements after Optimization Program Memory 45 KB Data Memory 10KB + 2 frame buffers

6. CONCLUSIONS

This paper summarizes the efficient implementation of a real time MPEG-4 simple profile video encoder on Blackfin core. It is evident from the results that after applying the above techniques there is a tremendous reduction in the computation power requirement. Encoding a typical CIF sequence at 30 f/s, 768kbps into MPEG-4 format requires a 175 M cycle/s on Blackfin core simulator assuming single wait state memory, i.e. 58.3% processing power of the presently available Blackfin processor; hence it provides some room for other application like speech encoder etc.

7. REFERENCES

[1] “Information Technology – Generic Coding for AudioVisual Objects – Part 2: Visual,” MPEG-4 Standard, 144962, ISO/IEC Standard. [2] “Information Technology – Generic Coding for AudioVisual Objects – Part 2: Visual,” MPEG-2 Standard, 138182, ISO/IEC Standard. [3] “Video Coding for Low Bit Rate Communications”, H.263 Standard, ITU-T Recommendation H.263, Feb. 1998. [4] “Information Technology – Generic Coding for AudioVisual Objects – Part 2: Visual,” MPEG-1 Standard, 111722, ISO/IEC Standard. [5] JPEG Standard, “Information Technology- Digital Compression and Coding of Continuous-Tone Still ImagesRequirements and Guidelines,” Recommendation T.81, ITU, September, 1992. [6] Blackfin DSP Hardware Reference Manual, Nov 2001. [7] Blackfin DSP Instruction Reference Manual, June 2001. [8] Ramkishor Korada and S. Krishna, “Spatio-Temporal Correlation based Fast Motion Estimation Algorithm for MPEG-2,” 35th IEEE Asilomar Conference on Signals, Systems and Computers, California, November 2001. [9] Micheal E Lee, “Optimization of computer programs in c”, Senior Programmer, Ontek Corporation, CA USA. [10] SD Memory Technology Explained, http://www.panasonic.com/consumer_electronics/sd/sd_expl ained.asp