Efficient FPGA Implementation of H.264 CAVLC Entropy Decoder Ali Siblini∗ , Elias Baaklini‡ , Hassan Sbeity† , Ahmad Fadlallah† and Smail Niar‡ ∗ Lebanese University, Beirut, Lebanon Email:
[email protected] † Arab Open University, Beirut, Lebanon Email: {hsbeity, afadlallah}@aou.edu.lb ‡ University of Valenciennes, France Email: {elias.baaklini, smail.niar}@univ-valenciennes.fr
Abstract—Multiprocessor-system-on-a-chip (MPSoC) is the dominating architecture in embedded systems. Applications need to be multi-threaded to benefit from the concurrency provided by the MPSoC. Many parallel versions of the new emerging H.264/AVC [1] already exist. However, providing a full parallel H.264 version is blocked by the fact that all parts of the decoder depend on the first sequential stage of the decoding process which is the entropy decoder (mainly CAVLC). The entropy decoder consumes about 30% [8] of the total time of the decoder. In this work, we propose an optimized FPGA design achieving the demands of multi-threaded H.264 decoder versions which can be integrated in an MPSoC. We focus in our work on time optimization and on cycle number decrease when decoding an encoded 4x4 block of pixels. We also aim to achieve a design that operates at high frequencies. The work leads to the ability to decode at least 62 frames per second for HD resolution 1280x720. Decoding takes 22 clock cycles for one block of 4x4 pixels at most. The design has an upper frequency limit of 247MHz. High resolutions frames such as 1920x1088 FHD (full high definition) video maintain a minimum frequency of 30 fps. Keywords-FPGA, CALVC, Entropy Decoding, H.264, Video Decodin, Multi-Core, Embedded Systems
I. I NTRODUCTION Multimedia hand-held devices are nowadays an essential element in our daily life. Smart phones devices are equipped in general with high definition screens and fast multi-core processors. Video decoding algorithms are becoming more complex in order to achieve better compression for high definition videos. H.264 is currently one of the most widely used video compression standards for capturing, compressing, and broadcasting high definition videos. H.264 offers better compression and better quality at the expense of higher algorithm complexity [1]. The decoding process in H.264 is mainly divided into the following stages: entropy decoder (ED), inverse quantization and inverse transform (IQT), motion compensation (MC), intra-prediction (IP) and deblocking filter (DF). Figure 1 shows the simplified block diagram of the decoder. A video frame at the encoder and decoder stage is processed in all steps as a set of macroblocks (MB) where a macroblock in general is block of 16x16 pixels. The entropy decoder algorithm used is our research is the context adaptive variable length coding (CAVLC). The
Fig. 1.
H.264 decoder stages
time consumed by the CAVLC decoder is estimated to be between 20% and 30% of the total time executed to decode a video sequence [8]. The decoder minimum requirements should decode a video with a 720x480 resolution at a rate of 30 fps (frames per second). In order to benefit from existing and future multi-core processors, it is crucial to remove the bottlenecks requiring sequential execution like the CAVLC algorithm allowing parallel optimization techniques to achieve their full optimization speedups. Optimized parallel algorithms allow FHD video sequences to be decoded at real time with a minimum rate of 30 fps. The remainder of the paper is organized as follows. In section 2, we present the related work concerning CAVLC FPGA implementations. In section 3, we describe our approach to design and implement CAVLC on FPGA. In section 4, we experiment and display the results of our approach. In section 5, we discuss and analyze the obtained results. We also discuss performance and bottlenecks. Final conclusions and future work are given in Section 5. II. R ELATED W ORKS Several optimization studies exist for the design and the implementation of the H.264 CAVLC decoder. Some improvements include frequency and frame rate increase. Others decrease of the number of logic gates and the amount of energy consumed by the decoder. Most designers aimed to minimize the number of clocks needed to decode a block of 4x4 pixels and to maximize the upper limit frequency.
1. Count the total number of non-zero coefficients in the array and assign the variable TotCoeff with this number which can be from 0 to 16. 2. Count the number of trailing ones, the +/-1s from the end to the start of the array, and assign the result to the variable called T1s. TotCoeff and T1s are coded together and the resulted code is called Coeff token. 3. After knowing the number of trailing ones, each trail one is assigned a bit code 0 if it’s +1 or bit code 1 if it’s -1. These bits are added after the code of coeff token. 4. The non-zero coefficients, except the trailing ones, are called levels. These coefficients are coded using a special algorithm called prefix-suffix algorithm. Note that T1s variables are constrained by the maximum value 3 and if there exist more than 3 +/-1s the remaining +/-1s are coded as levels. IV. E XPERIMENTS
Fig. 2.
H.264 entropy decoding process
A CAVLC encoder is designed by [7]. He reaches a frame rate of of 36-41 fps for 1280x720 HD resolution at a frequency of 1070 MHz. [9] decodes CIF resolution, 352x288, at a very low frequency, 15 MHz, with a frame rate of 32 fps. His design has a very low power consumption, however, the frequency needs to be increased in order to decode higher frequencies. [5] proposes a very fast architecture using pipelining and multisymbol decoding. His design needs further tests in order to prove his results of decoding FHD, 1920x1088, video streams at 30 fps with a clock speed of 74.25 MHz. Other works like [10] and [4] have important designs and ideas. Accurate results needs to be investigated with their complete implementations. III. CAVLC D ECODER As illustratred in figure 1, the bitstream output from the entropy decoder is an input to the decoder. The bitstream can be seen as a sequence of bits where each set of undefined length, with respect to the decoder, should be decoded to extract the coefficients needed to rebuild the 4x4 blocks. So, from a decoder perspective, we cannot locate where to begin decoding the next coefficient before decoding the previous coefficient making this process an inherently sequential task. The CAVLC entropy decoder works on each block of 4x4 pixels with an adaptation in choosing some parameters during decoding. Adaptation depends on some features of the previously decoded blocks and on the internal content of the current 4x4 block. Before coding, the block is scanned in a zigzag manner to get an array of 16 coefficients. There are mainly four coefficients to be coded that are described in the following steps as illustrated in figure 2:
The proposed design is implemented using the VHDL language [6]. The ISE12.1 software offered by Xilinx company [3] is used to implement, check and simulate the design. The vertex6 lower power family is used, specifically the XC6VLX75TL device. This device has the characteristic of low power consumption compared to other devices in the family of vertex6. Vertex 6 in general has a larger number of blocks of logic gate and can ensure a fast clock frequency. For each written module, we create new VHDL file containing testing statements including clock flow and the inputs to the checked module. Afterwards, we check the outputs using the ISIM simulator (included in the ISE tool) that takes as input the VHDL module and its corresponding testing module and gives in a waveform view all the signals flow. The signals flow of the level decoder module shows that the output is executed in one clock after enable is high. In fact the computation and all the logic is executed asynchronously as a combinational logic requiring less than the clock cycle (10 ns). The output will be available after the delay time in less than 10 ns. Input data are synchronized with the enable signals (FSM module). The transition between states occurs with the necessary temporary signals to get the data from the 64-bit buffer. A very important step in the design flow is implementing the design that includes three stages: translate, map, place and route. Translate changes the code written into set of lookup tables and logical gates. Map is the process that makes a matching between the requirements (lookup tables and gates) provided by the translate stage with the resources in the library for the selected device. The place and route stages map these requirements to the real resources in the hardware. It also creates the routes between the selected CLB’s on the device using the programmable connection that is called switch matrix. The design consumes 2% of the LUTs of the device. The timing report has the following statistics: - Minimum period: 3.782ns (Maximum Frequency: 264.410MHz). - Maximum output required time after clock: 2.078ns.
TABLE II C YCLES AND FRAME RATE OF THE CAVLC ENTOPY DECODER - AVERAGE C ASE S CENARIO Format VGA HD FHD
Resolution 640 x 480 1280x720 1920x1088
CAVLC cycles 249600 748800 1697280
Frame cycles 832000 2496000 5657600
fps 317.3 105.8 46.7
TABLE III B IT RATES OF THE CAVLC ENTOPY DECODER - W ORST C ASE S CENARIO
Fig. 3.
Resolution 640 x 480 1280x720 1920x1088
H.264 CAVLC cycle results
CAVLC cycles 422400 1267200 2872320
Frame cycles 1408000 4224000 9574400
Resolution
CAVLC bits
Frame bits
Kbps at 30 fps
VGA HD FHD
640 x 480 1280x720 1920x1088
8908800 26726400 60579840
29696000 89088000 201932800
989.9 2969.6 6731.1
TABLE IV B IT RATES OF THE CAVLC ENTOPY DECODER - AVERAGE C ASE S CENARIO
TABLE I C YCLES AND FRAME RATE OF THE CAVLC ENTOPY DECODER - W ORST C ASE S CENARIO Format VGA HD FHD
Format
fps 187.5 62.5 27.6
- Maximum combinational path delay: 1.121ns. V. D ISCUSSION Experiments are conducted using the worst case and the averaged case scenarios. The worst case scenario is when we have a block of 16 nonzero coefficients that consume the maximum number of clocks given by the following equation: Maximum Clocks = T1 + (level num * T2) + T3 + (run num * T4) + Tm + 2 Where: T1 is the number of clocks consumed by Totcoeff T1 decoder module, T2 is the number of clocks consumed by level decoder module, T3 is the number of clocks consumed by total zeros module, T4 is the number of clocks consumed by run before module, Tm is the number of clocks consumed for merging levels and run before zeros. Then maximum number of clocks is 22 clock cycles. So if we are to check the performance of the design with respect to the requirements considering the worst case scenario, clocks consumed to decode an HD frame of 1280x720 pixels is 4224000 clocks as displayed in table 1. However, a 4x4 block requires 13 cycles on average. Thus, the to total number of clocks to decode a slice decreased significantly to 2496000 clocks, achieving a frame rate of 105 fps with a clock frequency of 264 MHz. The lowest frame rate to decode a FHD frame, 1920x1088, is 27.6 fps. However, if we consider the average case, the rate increases to 46.7 fps. So a frame rate of 30 fps is reached with a 87.4% possibility when decoding a FHD video sequence using our CALVC design.
Format
Resolution
CAVLC bits
Frame bits
VGA
640 x 480
614400
2048000
Kbps at 30 fps 68.3
HD
1280x720
1843200
6144000
204.8
FHD
1920x1088
4177920
13926400
464.2
VI. CONCLUSION AND FUTURE WORK We have design and implemented a CAVLC for the H.264 entropy decoder on FPGA. Experiments show a high frame rate for HD and FHD video resolutions. Our future work focuses on implementing a complete H.264 decoder on FPGA and on improving the overall performance of the decoder for consistent real time FHD execution. R EFERENCES [1] AISO/IEC. International standard. Part 10: Advanced video coding, 14496-10, 2003. [2] K. Suehring. H.264 reference software. http://bs.hhi.de/ suehring/tml/. [3] Xilinx Company. http://www.xilinx.coml/. [4] R. Osorio, J. Bruguera. An FPGA architecture for CABAC decoding in manycore systems. In International Conference on Application-specific Systems, Architectures and Processors, pages 293–298, 2008. [5] Tony Gladvin George, N. Malmurugan. The Architecture of Fast H.264 CAVLC Decoder and its FPGA Implementation. In Proceedings of the Third International Conference on International Information Hiding and Multimedia Signal Processing (IIH-MSP 2007) - Volume 02, pages 389– 392, 2007. [6] IEEE Standard VHDL Language Reference Manual. In IEEE Std 1076, 2000. [7] Zhibin Xiao, Bervan Baas A high-performance parallel CAVLC encoder on a fine-grained many-core system. In IEEE International Conference on Digital Object Identifier, ICCD.2008, pages 248 - 254, 2008. [8] E. Baaklini, H. Sbeity, S. Niar, N. Amaneddine. H.264 Color Components Video Decoding Parallelization on Multi-core Processors. In 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, DSD ’10, pages 785 - 790, 2010. [9] Sung-Kyu Choi, Jong-Gu Jeon, Woo-Sung Shim, Won-Kap Jang, Victor H. S. Ha. Design and implementation of H.264-based video decoder for digital multimedia broadcasting. In IEEE International Conference on Multimedia and Expo, ICME ’04, pages 149 - 152 Vol.1, 2004. [10] T. Lindroth, N. Avessta, J. Teuhola, T. Seceleanu. Complexity Analysis of H.264 Decoder for FPGA Design. In IEEE International Conference on Multimedia and Expo, ICME ’06, pages 1253 - 1256, 2006.