A High Performance CAVLC Encoder Design for MPEG-4 AVC/H.264 ...

A High Performance CAVLC Encoder Design for MPEG-4 AVC/H.264 Video Coding Applications Chih-Da Chien, Keng-Po Lu, Yi-Hung Shih, and Jiun-In Guo Dept. of Computer Science and Information Engineering National Chung Cheng University Chia-Yi, Taiwan, R.O.C. E-mail: {cct, lkp93, syh94, jiguo}@cs.ccu.edu.tw Abstract—This paper presents a high performance VLSI architecture design for MPEG-4 AVC/H.264 CAVLC encoding. In the proposed design, we propose a forward-based parallel coding (FPC) technique to increase the data throughput rate. Moreover, two approaches called arithmetic table elimination (ATE) and fast look-up table matching (FLM) are exploited to reduce the hardware cost. With the synthesis constraint of 125 MHz clock, the hardware cost of the proposed design is 9724 gates based on a 0.18um CMOS technology, which achieves the real-time processing requirement for H.264 video encoding on HD1080 format video.

I. INTRODUCTION Variable length coding (VLC) is a widely used technique in image and video compression applications, such as JPEG, MPEG and H.263. The main idea of VLC is to minimize the average codeword length. The shorter codewords are assigned to the input symbols with high probability, and vice versa. Besides, VLC is often applied together with other lossy compression technique to increase the data compression rate. In conventional MPEG standards, most of the coding information is encoded as variable length coded bit-stream by a fixed statistic model. For the sake of further increasing the data compression rate, context-based adaptive variable length coding (CAVLC) is adopted in the newly MPEG-4 AVC/H.264 standard [1][2] to encode the transform coefficients of residual data. The coding efficiency of CAVLC is much higher than that of conventional MPEG entropy coding due to the contextadaptive feature. However, the computational complexity of CAVLC is also much higher as compared with the conventional VLC. For meeting the processing demand of the real-time H.264 video coding applications, the dedicated hardware implementation of CAVLC is a good choice for pursuing high performance. In this paper, we propose a high performance CAVLC encoder for MPEG-4 AVC/H.264 video coding applications. We use an arithmetic table elimination (ATE) scheme in the proposed design to reduce hardware cost. The look-up tables (LUTs) required for encoding are eliminated by exploring the numerical properties of symbols and codewords. Therefore, the encoded codewords can be produced from input symbols according to the arithmetical operations instead of table look-up. For the codewords which can not be generated by

0-7803-9390-2/06/$20.00 ©2006 IEEE

arithmetical operations, a fast LUT matching (FLM) are adopted to reduce the size of LUTs. The encoding procedure of CAVLC can be partitioned into two phases, scan phase and encoding phase. In general, the throughput rate of scan phase is constant, but the throughput rate of encoding phase depends on the statistics from the scan phase. In a pipelining architecture, the required computational cycles of CAVLC for encoding a block data is limited by max(Cscan, Cencode), where Cscan denotes the computational cycles for scan phase, and Cencode denotes the computational cycles for encoding phase. In order to increase the throughput rate of the proposed design, we use the serial-input parallel-output (SIPO) buffer combined with forward-based parallel coding (FPC) approach to limit the computational cycles of encoding phase to 15 cycles for achieving high performance as well as constant throughput rate. As compared with the existing design [3], the proposed design possesses higher performance and lower hardware cost. According to a 0.18um CMOS technology, the implementation results show that proposed design operates at 125 MHz with the cost of 9724 gates, which the throughput meets the real-time processing demand of H.264 video encoding on HD1080 video format. The rest of the paper is organized as follows. Section-II introduces the CAVLC encoding processes and previous CAVLC design approaches proposed in the literature. The proposed design approaches and architecture are illustrated in Section-III. Section-IV shows the performance analysis and comparison. Finally, we conclude this paper in section V. II. BACKGROUND CAVLC is designed to encode a quantized 4x4 or 2x2 block of transform coefficients. The essential statistics for encoding are stored in the statistics buffer after zig-zag scan. Then, the five steps of encoding processes [5] are performed sequentially as described in the following. 1. Encode the number of non-zero coefficients (TC) and trailing ones (T1). 2. Encode the sign of each trailing one. 3. Encode the levels of the remaining non-zero coefficients (Level). 4. Encode the total number of zeros (TZ) before the last non-zero coefficient.

3838

ISCAS 2006

5. Encode each run of zeros (Run_Before). A serial encoding architecture design of CAVLC encoder is proposed by Lai [4]. This design scans the symbols of a block and encodes them serially by using the stacks as statistics buffer for the purpose of cost effective. However, the performance of the serial encoding architecture can not achieve real-time processing demand for high quality applications. Apart from the serial CAVLC design mentioned above, Chen [3] proposed a pipelining design approach with dual buffer. The main concept of this design is to use two-stage pipelining scheme for parallel processing of two blocks. That is, when one block is processed by encoding unit, the next block is processed by scan unit for collecting the required statistics. By this way, the pipelining design increases the throughput as compared with the serial design. But, an extra buffer for storing all of the symbols and statistics in a 4x4 block is required for pipeline. The performance of the pipelining design is limited by the maximum cycles required for scan stage or encoding stage. However, the executing cycles of encoding stage may over the execution cycles of scan stage if the symbols are encoded serially. When the value of QP is small, most coefficients in a block are the non-zero coefficients. Therefore, the throughput is degraded and induces the con-constant throughput rate when encoding the blocks with lots of non-zero coefficients, such as high quality video.

proposed SIPO buffer. The symbols are serially written to the SIPO buffer during scan, and the coding units read two symbols from SIPO in parallel for parallel coding. By exploiting the SIPO buffer, the requirement of dual buffer for storing Level can be reduced to single buffer and one 16-bit register. However, an extra Run_Before buffer is also required because the Run_Before symbols are encoded at the end of a block data. By using the parallel coding approach, Coeff_Token encoder and T1 encoder complete the work in first cycle of encoding phase of a block. Then, two Level encoders read two Level symbols and generate two codewords immediately in next cycle. The scan phase of next block can be overlaps the encoding phase of current block. Hence, the proposed design keeps the both advantages of pipeline encoding and eliminating the additional buffers for storing Level and T1.

III. PROPOSED CAVLC ENCODER The overall architecture of the proposed design for MPEG-4 AVC/H.264 CAVLC encoding is illustrated in Fig. 1. The proposed design contains five main components, that is, scan unit, SIPO buffer, coding unit, codeword linker and control unit. The coding unit consists of three encoding elements, which are in charge of encoding different type of symbols. The symbol encoders which belong to the same encoding element are able to work in parallel for reducing the required computational cycles in coding phase. For encoding a block data, the scan unit performs zig-zag scan on the 4x4 or 2x2 block data for collecting the statistics first. The statistics is then fed into a SIPO buffer. After that, the coding unit gets two symbols from SIPO buffer simultaneously to encode two symbols in parallel, meanwhile, the scan unit works on next block data. Finally, the codewords produced from coding unit are linked together by a codeword linker after parallel encoding. The key components in the proposed CAVLC encoder are illustrated in the following. A. SIPO Buffer During scan phase, the statistic data have to be stored in the buffer for the later encoding phase. The straightforward way is to implement this buffer as FIFOs or stacks with 16 entries in depth. However, this approach may cause the requirement of dual buffer for pipelining the encoding process into scan stage and encoding stage. For reducing the required buffer size, a SIPO buffer combined with parallel coding scheme is adopted in our proposed design. Fig. 2 shows an example of the

Fig. 1. The architecture of the proposed CAVLC encoder

Fig. 2. An example of SIPO buffer approach. B. Coeff_Token Encoder and T1 Encoder In H.264 standard, five codeword tables are used to define the codewords of Coeff_Token symbols. When encoding the Coeff_Token symbols, a parameter nC means the average number of non-zero coefficients in the left-top block is used to determine which codeword table is selected for encoding. The implementation of Coeff_Token encoder by directly LUT mapping is area consuming. In the proposed design, we use the fast LUT matching (FLM) and arithmetic table elimination (ATE) to achieve low cost and high performance. The main idea of FLM approach is to partition the original LUT into

3839

multiple ones, and then generate the required address for LUTs in parallel and enable only one of them. By applying the FLM approach, we can reduce the size of LUT as well as speed-up the table look-up process. When the value of nC is greater than 8, a fixed-length codeword table is selected for encoding. In this case, the ATE scheme is used to generate the codeword by arithmetical operation instead of table look-up process for further reducing the hardware cost. The codeword is the least significant 6 bits of the variable Code as defined in (1), where the operation {} represent a concatenation which is the joining together of bits resulting from two or more expressions. , if TC == 0  3 Code =  { TC − 1, T 1 }, if TC ! = 0 CodeLen = 6

(1)

that of another one to determine the table index and then forward to anther Level encoder for encoding two symbols in parallel. When only one Level symbol remains in the SIPO buffer, the Level symbol and Total_Zero symbol can also be encoded simultaneously. After applying the proposed FPC approach, two symbols can be generated in one cycle except the escape mode occurs. The escape mode induces one more cycle to encode the escape codeword because the sum of the lengths of two codewords may over 32-bits. For finding out the effect on coding performance by escape codewords, we collected the statistics by using various sequences and analyze the percentage of escape codewords according to JM [2]. Even in a high quality video, i.e., QP (quantization parameter) is a small value; the ratio of the symbols with escape mode is less than 1%.

The codeword of T1 is decided by the sign of each T1 symbol. There are three T1 symbols at most when encoding a block data. In this case, the serial encoding architecture [4] needs three cycles to encode the T1 symbols. On the contrary, the proposed T1 encoder producing three T1 codewords simultaneously by using a 3-bits array to record the T1 symbols. Moreover, the symbols of Coef_Token and T1 are encoded in parallel. That is, in the first cycle of encoding phase, Coeff_Token encoder and T1 encoder complete the encoding of four symbols at most. C. Level Encoder In H.264 standard, seven VLC tables are used to define the codewords of Level symbols. The Level encoder can be implemented easily by memory-based architecture. However, this kind of design is not hardware efficient. In the proposed design, we propose the arithmetic table elimination (ATE) scheme to eliminate the cost of tables by using the simply arithmetical operations to produce the codewords. By exploring the numerical properties of codewords, we can use four different arithmetical operations to generate the desired codewords for encoding the Level symbols. The architecture of proposed Level encoder is shown in Fig. 3. The number of execution cycles required for encoding Level symbols depends on the number of non-zero coefficients in a block. Therefore, when encoding a block data, the number of cycles required in encoding phase may exceed that in scan phase by serial symbol encoding and yield a non-constant throughput rate. Besides, the encoding of next symbol depends on the current one due to the context-adaptive property, which increases the complexity of parallel symbol encoding. To overcome the problem mentioned above, we propose a forward-based parallel coding (FPC) approach to encode two Level symbols in parallel. Fig. 4 shows the architecture of Level coding element with FPC. In the proposed design, two Level symbols are read from SIPO buffer simultaneously as mentioned before, and two Level encoders are used for parallel encoding. We use the symbol whose scan order is earlier than

Fig. 3. The architecture of the proposed Level encoder

Fig. 4. The architecture of the proposed encoding element 2 with FPC scheme D. Run_Before Encoder

3840

TABLE II. HARDWARE COST PROFILE ON SYMBOL ENCODERS AND STATISTICS BUFFER OF THE PROPOSED DESIGN COMPARED TO DESIGN [3]

Symbol Encoders

The Run_Before encoder is also implemented by adopting the proposed ATE and FPC design approaches. Similar to the Level encoder, the codewords of Run_Before symbols can also be generated by using the arithmetical operations to eliminate the requirement of LUT. When encoding, two Run_Before symbols are read from SIPO buffer, and two codewords are produced in parallel from two Run_Before encoders by forwarding the table index in the proposed design. IV. PERFORMANCE ANALYSIS AND COMPARISON In the proposed design, the performance is limited by the execution cycles of scan phase. The throughput rate of the proposed design is 16 x 27 = 432 cycles/MB. The proposed design is capable of real-timing processing for HD1080@30fps video at 106 MHz working frequency even though all blocks are coded. However, the coded block patterns (CBP) are usually used to determine the blocks which are not necessary to be encoded for further increasing video compression ratio. Consideration of the effects of CBP, we performed the simulation of the proposed design by using various sequences (D1 format) with different QP values. The average cycles for encoding a MB are shown in Table I. Simulation results show that the proposed design needs about 300 cycles for encoding a MB in average. In implementing the proposed CAVLC encoder, we performed logic synthesis on the proposed design according to a 0.18um CMOS technology. Table II shows the hardware cost profile on symbol encoders and statistics buffer of the proposed design and Chen’s design [3] in terms of gate count. The proposed design requires less hardware cost on statistics buffer as compared with design [3]. But the hardware cost on the Level/Run_Before encoding elements in the proposed design is lager than that in design [3] due to the overhead of FPC design approach. The comparison on hardware cost and processing speed of the proposed design with other existing designs [3][4] is shown in Table III. The number of gate count of design [3] listed in Table III excludes the Exp-Golomb encoder and bit-stream packer for a fair comparison. Design [4] also contains a bit-stream packer which packs the codewords produced by symbol encoders, the packing of bit-stream headers and Exp-Golomb VLC is not mentioned in design [4]. From Table III, we conclude that the proposed design outperforms the others with higher performance, and uses less or nearly the same hardware cost. TABLE I. THE AVERAGE ENCODING CYCLES PER MB IN THE PROPOSED DESIGN

Sailormen

Harbour

Crew

QP = 10

414

411

414

QP = 20

381

348

345

QP = 30

288

266

218

QP = 40

185

209

104

Average

317

309

270

Coeff_Token Encoder Level Encoding Element1 Total_Zero Encoder Run_Before Encoding Element2 Statistics Buffer

Proposed

Chen [3]

554

864

1208

1012

420

646

432

263

5325

12283

1: The Level encoding element contains two level encoders and FPC control unit 2: The Run_Before encoding element contains two Run_Before encoders and FPC control unit

TABLE III. COMPARISON OF THE PROPOSED DESIGN WITH OTHERS

Proposed

Chen [3]

Lai [4]

Technology

0.18 um

0.18 um

0.35 um

Gate Count

9724

17635

125 MHz

100 MHz

66 MHz

HD1080 30fps

HD1080 30fps

QCIF 10fps

Clock Frequency Target Format

9171 (including packer)

V. CONCLUSION In this paper, a high performance CAVLC encoder for MPEG-4 AVC/H.264 is presented. The proposed design employs both FLM and ATE schemes to reduce hardware cost. Moreover, we proposed the FPC approach combined with SIPO buffer for the purposes of high performance and constant throughput rate. The proposed CAVLC encoder achieves higher throughput compared to some previously architectures. The implementation results show that the proposed design operates at 125 MHz clock rate with the cost of 9724 gates, which meets the real-time process demand for HD1080@30fps video encoding. According to the feature of constant throughput rate, the proposed design is easy to control and simplify the complexity of the system integration. REFERENCES [1] ITU-T Rec. H.264 and ISO/IEC 11496-10 “Advanced Video Coding”, May 2003. [2] Joint Video Team (JVT) reference software JM9.3 [3] T. C..Chen, Y. W. Huang, C. Y. Tsai, B. Y. Hsieh, and L. G. Chen, “Dual-block-pipelined VLSI Architecture of Entropy Coding for H.264/AVC Baseline Profile”, Proc. International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pp.271–274, 2005. [4] Yeong-Kang Lai, Chih-Chung Chou and Yu-Chieh Chung, “A Simple and Cost Effective Video Encoder with Memory-Reducing CAVLC”, Proc. ISCAS 2005. [5] Richardson, Iain. E. G., “H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia”, John Wiley & Sons Ltd., Sussex, England, December 2003.

3841

A High Performance CAVLC Encoder Design for MPEG-4 AVC/H.264 ...

A High Performance CAVLC Encoder Design for MPEG-4 AVC/H.264 ...

Suggest Documents

A HIGH-PERFORMANCE CABAC ENCODER ARCHITECTURE FOR ...

A High Performance CABAC Encoder - CiteSeerX

A High-Performance Multi-Match Priority Encoder for ... - IEEE Xplore

Software Parallel CAVLC Encoder Based on ... - Semantic Scholar

A Design Procedure for Stable High Order, High Performance Sigma ...

Implementing A High-Speed Differential Encoder

Reaching MPEG4 Decoder Performance Goals

Efficient High-Performance ASIC Implementation of JPEG-LS Encoder

a performance and analysis of ezw encoder for image

High-Level Architecture Exploration for MPEG4 ... - Semantic Scholar

Temperature Aware Design for High Performance Processors

High Performance Based Design for the Building

Design Principles for High-Performance Blended ...

Design and Verification of a High Performance

Design of a Low Power, High Performance

design of a multicode bi-phase encoder for data

High-Performance High-Strength Concrete: Design Recommendations

A Conceptual Design Model for High Performance Hotspot Network ...

A High-Performance Web-Based System Design for Spatial Data ...

Design of a High-Performance System for Secure ...

A Design Technique for High-Performance Self-Checking

FIRGEN: a computer-aided design system for high performance FIR ...

Design of a High-Performance System for Secure

A Design for High-Performance Flash Disks - Microsoft