VLSI Architecture of H.264 RDO-based Block Size Decision for 1080 HD Ryoji Hashimoto∗ , Kimiya Kato∗ , Gen Fujita† , and Takao Onoye∗
∗ Dept. Information Systems Engineering, Osaka University E-mail: {hashimoto.ryoji, katou.kimiya, onoye}@ist.osaka-u.ac.jp † Dept. Engineering Informatics, Osaka Electro-Communication University E-mail :
[email protected]
I. Introduction H.264/AVC[1], the latest video coding standard of ITU-T and ISO/IEC, adopts a set of new coding techniques and offers high coding efficiency such that up to 50% of bit-rate reduction can be achieved as compared with MPEG-4 Advanced Simple Profile. Moreover, H.264 supports wide range of frame format from QCIF (176×144) through HDTV (1,920×1,080). Newly introduced techniques in H.264 such as multi frame reference, 1/4- and 1/8-pixel prediction, variable block size, and context adaptive entropy coding, must be effectively combined to achieve high coding efficiency. As for optimization scheme of these parameters, rate-distortion optimization (RDO)[2] is considered as the most viable, since this scheme estimates distortion and bit-rate based on actual execution of succeeding encoding and local-decoding processes, while other conventional schemes concentrate to optimize parameters in each encoding process. In H.264, RDO can be utilized mainly for selection of prediction mode and motion vector in motion estimation (ME), and for encoding block size decision. However, since RDO incurs considerable amount of computations, existing architectures[3], [4] do not execute actual encoding processes and estimate “rate” and “distortion” in a different manner, which results in non-optimal coding efficiency. In other words, none of practical system implementation has been reported so far. According to [5], inter block size decision by RDO is more cost effective for enhancing picture quality than motion estimation by RDO. As can be seen from PSNR comparison result for a 1080 HD sequence depicted in Fig. 1, RDO-based block size decision, which calculates RD cost 112 times, provides comparable performance to RDO-based motion estimation with 448×n2 +1,792 (n: search range) times RD cost calculation.
38.5 38 PSNR Y [dB]
Abstract— Hardware architecture of Rate-Distortion Optimization (RDO) is proposed, which is dedicated to H.264 block size decision of 1080 HD. To achieve high encoding efficiency of H.264, RDO for block size decision is indispensable but suffers from enormous computational costs since distortion and the number of coded bits can be determined only after completing the whole encoding processes of the block. The proposed approach reduces the computational costs by the approximation of bit amount in entropy coding. In addition, four parallel seven stage codec pipeline enables high speed calculation of distortion originated from residual. As a result, the proposed architecture, which can be implemented by 14K gates, achieves real-time processing of HDTV (1920×1080) frames at a rate of 30 fps in 120MHz operation, where 0.5 dB of PSNR is gained in comparison to conventional approaches.
37.5 37 36.5 36
SATD RDO block size RDO ME
35.5 35 2
4
6
8 10 12 bitrate [Mbps]
14
16
Fig. 1. Coding performance for “08HD Walk through the Square[6].”
Motivated by this tendency, the present paper proposes RDO-based H.264 block size decision architecture for 1080 HD sequences. In this architecture, header cost and block data cost are estimated by the efficient approximation of bit amount. Moreover, seven stage 4×4 block level codec pipeline, which is used exclusively for inter block size decision, is employed with four parallel operation so as to facilitate high speed calculation of distortion originated from residual. The proposed architecture is implemented by 14K gates in 0.13µm CMOS technology, which can process 1080 HD sequences at a rate of 30 fps in real-time. II. Rate-Distortion Optimization for Block Size Decision In the most video coding algorithms, RDO tries to minimize distortion subject to a certain bit-rate restriction. Specifically, Lagrange multiplier optimization is applied to J = D + λR,
(1)
where D, R, and λ indicate distortion, bit-rate, and the Lagrange multiplier, respectively. In the existing architectures, [3] uses SAD (Sum of Absolute Difference) as D and does not consider R. On the other hand, [4] uses SATD (Sum of Absolute Transformed Difference) and the number of bits for macroblock type and motion vector as D and R, respectively. In H.264, seven types of block size are prepared as shown in Fig. 2, and RDO-based block size decision in H.264 JM[7] utilizes RD cost represented by RD cost(size, λ) = SSD(size) + λ × R(size),
(2)
where size represents one of block size from 4×4 to 16×16. SSD indicates sum of squared difference between original
41MV From ME MV Buffer Neighboring MV Buffer size Num MV
16×8 2
16×16 1
8×16 2 Header cost generator
8×8 1
MC Buffer Shared with ME,MC 16×16×7
8×4 2
4×8 2
4×4 4
Ctrl & Block Size Decision
Curr MB buffer Shared with ME,MC 16×16
Residual processor Block Size Decision Unit Best Block Size
Fig. 2. Variable block size of H.264.
Fig. 3. Organization of block size decision unit.
image and reconstructed image associated with chosen block size. R represents the number of coded bits associated with chosen block size including macroblock type, motion vector, and residual data. The Lagrangian multiplier λ depends on quantization parameter (QP). Since one 16×16 macroblock comprises sixteen 4×4 blocks, terms in Eq. (2) can be expressed as, SSD(size) = R(size) =
15 ∑ idx=0 15 ∑
9×n
16×n I 16×n I 9×n MC 14 13×n I 12×n I 15×n I I I T Q T Q T T V H SSD SSD4×4 V H IT IIT 9
Level 12×n
rbit
Run EC
SSD4×4(size, idx),
(3)
rbit(size, idx) + hbit(size),
(4)
Stage1 Stage2
Stage3
Stage4
Stage5
Stage6
Stage7 n : parallel degree
Fig. 4. Architecture of proposed Residual processor.
idx=0
where SSD4×4 and rbit represent SSD and the number of coded bits for transformed residual data for a 4×4 block, respectively, and hbit indicates the number of coded bits of macroblock header such as macroblock type, reference frame index, and motion vector. In the course of RDO, integer transform (IT), quantization (Q), inverse quantization (IQ), inverse integer transform (IIT), entropy coding (EC), and motion compensation (MC) are necessarily carried out in the same manner as general encoding processes. Consequently, computational costs by using RD cost must be inevitably high. III. VLSI Architecture Figure 3 illustrates organization of the proposed block size decision unit. MC Buffer and Curr MB Buffer store the reference image used in ME and the original image of current macroblock, respectively. MV Buffer and neighboring MV buffer store motion vectors chosen in ME and motion vector of neighboring macroblock. Cost terms for the calculation of RD cost are generated by Residual processor (SSD4×4 and rbit) and Header cost generator (hd cost), which are summed up in Ctrl. Then the block size which gives the minimum RD cost is chosen for the current block. A. Pipelining and parallel operations As mentioned before, considerable operations are required to calculate RD cost, and thus pipelining and parallel operations are indispensable for real-time processing. In our implementation, seven stage 4×4 block level pipeline is employed as illustrated in Fig. 4. Basically, each process such as Q corresponds to one stage of the pipeline, where IT and IIT comprising horizontal and vertical operations are completed in two pipeline stages. EC, which is to generate rbit, can be executed in parallel with local decoding processes with consuming four stages.
Since a 4×4 block has 16 samples, n-degree parallel operations in each stage makes the number of cycles for single pipeline stage 16/n cycles. Considering that there are seven types of block sizes, one macroblock comprises sixteen 4×4 blocks, and 7 stage pipeline stands for 6 stage latency, the number of processing one macroblock is equal to (16 × 7 + 6)×16/n. Therefore the minimum operation frequency is given by frequency = MBFPS × (16 × 7 + 6) × 16/n,
(5)
where MBFPS represents the number of processing macroblocks per second. In case of 1080 HD sequence at 30 fps, MBFPS is equal to 8,160 × 30 = 244,800. Consequently, four parallel degree enables real-time processing of 1080 HD sequence in 116 MHz operation. B. Specific architecture for RD cost calculation In the same manner as general MEs, the proposed architecture considers only luma. As a matter of fact, only inter luma AC components are processed. Thus we can focus on processing AC components, which makes it possible to reduce gate count and to improve performance by tuning functional modules. For example, if we design the IT module to process both intra 16×16 luma DC and inter luma AC components, the input range of the IT module must be [-9180,9180]. On the other hand, in case of aiming only at inter luma AC components, the input range is limited to [-255,255]. In a similar way, optimization of bit precision helps to reduce gate count and to raise performance. Table I summarizes comparison among our IT module, which is synthesized by SYNOPSYS Design Compiler from Verilog HDL description with 0.13 µm CMOS standard cell library, and other “real” IT architectures. From this Table, we can see that our RDO-specific IT can reduce hardware cost in comparison to existing IT architectures by
TABLE I Comparative synthesis result of IT module Technology parallelism Gate Count functions Ours 0.13 µm 4 1,488 luma AC Wang[9] 0.35 µm 4 6,538 all* Chen[10] 0.18 µm 8 6,482 all Agostini[11] 0.35 µm 16 18,353 all (*) “all” includes luma DC, luma AC, Chroma DC, and Chroma AC.
percentage [%]
percentage [%]
80
60
70
50
60 40
50
30
40 30
20
20
10 0
4> 4
0
1
2 3 # of nz_co
4
4>
3 2 1 coded bit
10 0
0
1
2 3 # of nz_co
(a) QP=27
omitting support of luma DC, Chroma DC and Chroma AC coefficients. As for the calculation of SSD, the multiplier may request 16 bit precision to keep exact squared value because absolute difference between original image and reconstruct image is distributed in range of [0, 255]. However, such a block size that indicates extremely big difference between original image and reconstructed image is not chosen due to its very large RD cost. The proposed architecture designates 64 to the upper bound in accordance with the experimental results, which also contributes to reduce gate count. C. Simplified EC In H.264, there are two EC methods, one is CAVLC (Context-based Adaptive Variable Length Coding), the other is CABAC (Context-based Adaptive Binary Arithmetic Coding). In general, CABAC has higher coding efficiency with spending higher computational cost than CAVLC. However, the use of CABAC in calculating RD cost is impractical as is explained below. According to [8], 5.0 × 1010 symbols per second of processing capability is required to handle HD sequence at 25 fps by RDO. Considering that the architecture proposed in [8], which is implemented by about 20 K gates, can process 2.3 symbols per cycle, approximately 22 GHz operation is requested to process HD sequence in real-time. From this viewpoint, the proposed architecture uses CAVLC for the calculation of RD cost. Coded data of a 4×4 block consists of four components; coeff token representing the number of non-zero coefficients, level indicating the value of a transform coefficient, total zeros and run representing position of transform coefficient. These four components have no dependency with each other, and thus parallel processing of the components is employed. The proposed architecture also simplifies CAVLC aiming at bit amount estimation. 1) coeff token: It needs information of neighboring 4×4 blocks to calculate the precise bit amount of coeff token. To store neighboring information, however, CAVLC requests memory size of 20×120 bits in case of HD sequence. Thus the proposed architecture does not utilize neighboring information in order to reduce hardware cost. The proposed architecture estimates bit amount of coeff token by using only # of nz co (the number of non-zero coefficients) of the current 4×4 block. Correlation of # of nz co and the bit amount of coeff token is evaluated by software simulation, which is summarized in Fig. 5. From this figure, it is clear that there is strong correlation between coeff token and # of nz co. This tendency is kept regardless of video sequences and bitrates. Therefore, we employ the weighted average table, obtained from software simulation, to approximate the bit
4
4>
4> 4 3 2 1 coded bit
(b) QP=42
Fig. 5. Distribution of # of nz co and coded bit (08HD).
percentage [%] 100 90 80 70 60 50 40 30 20 10 0
1
percentage [%]
2 3 abs level
4
4>
4> 4 3 2 1 coded bit
100 90 80 70 60 50 40 30 20 10 0
3
1
2 3 4 abs level
4
4>
2 1 coded bit 4>
(b) QP=42
(a) QP=27
Fig. 6. Distribution of abs(level)-bit (08HD).
amount of coeff token, which gives 5% estimation error or less. 2) level: There is a complex relationship among levels within a 4×4 block. For example, if order of presenting value of levels changed, bit amount of levels also changes. To simply count bit amount of a level, we inquire into relation between absolute value of level (hereafter abs level) and the number of coded bits. Software simulation result is exemplified in Fig. 6. In other sequences, there are similar tendency such that the bit amount of level increases as abs level becomes bigger. Both in low QP (Fig. 6(a)) and high QP (Fig. 6(b)) there are a few coefficients which are more than 2. Using temporal redundancy of image sequence in inter frame makes abs level low value. In addition, there are no change in the relation between abs level and bit amount of level. Using this characteristic makes it possible to approximate the number of coded bits of levels without executing scanning process. In approximation, we use the table with its threshold (upper bound) value of 7 as shown in Table II. This table is made from weighted average obtained from software simulation, offers 10% estimation error or less. 3) run, total zeros: As for run, it is difficult to estimate bit amount without scanning process because run represents the number of consecutive zeros in scanning order. Bit amount of run depends only on the value of run and zeros left which represents the number of preceding zero coefficients. This means that parallel processing of scanning process and coding TABLE II Table of bit for abs (level). abs (level) 1 2 3-5 6-7 7>
bit 1 2.5 abs (level) + 1 abs (level) 10
TABLE III Implementation result. Module Gate Count Ctrl & Mode decision 1,207 Header cost generator 2,234 Residual Processor 10,310 Total 13,751 TABLE IV Simulation condition. Parameter Value Image size 1920×1080 Test sequence[6] 08HD Walk through the Square Search range [-128,127] Entropy coding CABAC GOP structure I-P-P-P... Block matching Full Search
process makes it possible to calculate the bit amount of runs. In order to calculate run, what is important is whether the value of level is zero or not, and thus just single bit is needed for each symbol, resulting in hardware reduction. As for total zeros, the number of zeros is equal to sum of runs. By using the result of scanning run, bit amount of total zeros can be calculated with no error. D. Header Cost Generator Header cost generator calculates the number of coded bits of macroblock type, reference frame index, and motion vector. These values tend to take small values, e.g. range of reference frame index is from 0 to 15. By using look-up table, hardware cost becomes smaller and it takes a few cycles to calculate bit amount of these values. Although current macroblock may have several reference frame indices and motion vectors, there are enough cycles for calculating these bits sequentially. Therefore the proposed architecture uses only one table to calculate the number of coded bits for each component per cycle. IV. Implementation The proposed architecture of block size decision based on RD cost has been implemented by using SYNOPSYS Design Compiler with 0.13 µm CMOS standard cell library from Verilog HDL description. Table III summarizes implementation result.
38.5 PSNR Y [dB]
38 37.5 37 36.5 36
SATD RDO (opt.) proposed
35.5 35 2
4
6
8 10 12 bitrate [Mbps]
14
Fig. 7. Result of software simulation.
16
18
The maximum operating frequency of the designed VLSI is 140 MHz. Therefore the proposed architecture, which can process one macroblock in 486 cycles including overhead process such as memory access, offers sufficient performance to support HD frame at a rate of 30fps since the designed module can operate at more than 120MHz of clock rate. By using SYNOPSYS Power Compiler, power consumption of the designed architecture is 19.6 mW dissipating from 1.2 V power supply at 120MHz operation. In order to evaluate consequences of the approximation, RD cost adopted in the proposed architecture is also implemented in software, and is simulated in condition summarized in Table IV. In this software simulation, the “proposed” employs CAVLC and CABAC for calculation RD cost and for generation of coded bitstream, respectively. Figure 7 illustrates the simulation results, demonstrating that PSNR degradation of proposed architecture in comparison to “RDO (opt.),” which utilizes RDO without any simplification, is 0.1 dB in the worst case. On the other hand, our architecture can improve up to 0.5 dB of PSNR from SATD. Considering that motion estimation module including block size decision requests 597K gates in the previous work[3], our 14K gates block size decision unit is valuable to raise video quality with only 2% of area increase. V. Conclusion This paper has described efficient VLSI architecture for H.264 block size decision based on RDO. By using a set of techniques in terms of pipelining, simplified EC, and data clipping, the proposed architecture can be successfully constructed by practical gate count. Our 14k gates block size decision unit improves up to 0.5 dB of PSNR compared with conventional block size decision based on SATD, while giving sufficient performance to process HD frames at a rate of 30 fps in real-time. References [1] ITU-T Rec. H.264 / ISO/IEC 11496-10 : “Advanced video coding,” International Standard, Oct. 2004. [2] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” IEEE Signal Processing Magazine, vol. 15, pp 74-90, Nov. 1998. [3] C. M. Ou, C. F. Le, and W. J. Hwang, “An efficient VLSI architecture for H.264 variable block size motion estimation,” IEEE Trans. Consumer Electronics, vol. 51, pp 1291-1299, Nov. 2005. [4] T. C. Chen, S. Y. Chien, et al., “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” IEEE Trans. Circuits and Systems for Video Technology, vol. 16, pp 673-688, June 2006. [5] R. Hashimoto, K .Kato, G. Fujita, and T. Onoye, “VLSI architecture of H.264 block size decision based on rate-distortion optimization,” in Proc. ISPACS2006, pp 618-621, Dec. 2006. [6] “http://www.ite.or.jp/products/ DISK HD”. [7] “http://iphome.hhi.de/suehring/tml/” JM12.1. [8] R. R. Osorio and J. D. Bruguera, “High-throughput architecture for H.264/AVC CABAC compression system,” IEEE Trans. Circuits and Systems for Video Technology, vol. 16, pp 1376-1384, Nov. 2006. [9] T. C. Wang, Y. W. Huang, H. C. Fang and, L .G. Chen, “Parallel 4×4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264,” in Proc. ISCAS, pp 800-803, May 2003. [10] K. Chen, J. Guo, and J. Wang, “An efficient direct 2-D transform coding IP design for MPEG-4 AVC/H.264,” in Proc. ISCAS, pp 4517-4520, May 2005. [11] L. Agostini, R. Porto, J. Guntzel, I. S. Silva, and S. S. Bampi, “High throughput multitransform and multiparallelism IP for H.264/AVC video compression standard,” in Proc. ISCAS, pp 5419-5423, May 2006.