Mar 13, 2007 - Through compressing the propagated data and optimizing the processing ... and Application-Based Systems]:
Hardware-Efficient Propagate Partial SAD Architecture for Variable Block Size Motion Estimation in H.264/AVC Zhenyu Liu, Yiqing Huang, Yang Song, Satoshi Goto, and Takeshi Ikenaga IPS, Waseda University, N355, 2-7, Hibikino, Wakamatsu, Kitakyushu, 808-0135, Japan
[email protected]
ABSTRACT
B4x4_00 B4x4_01 B4x4_02 B4x4_03
B4x4_10 B4x4_11 B4x4_12 B4x4_13
B4x4_20 B4x4_21 B4x4_22 B4x4_23
B8x4_01
B8x4_10
B8x4_11
B8x4_20
B8x4_21
B8x4_30
B8x4_31
B4x8_10 B4x8_11 B4x8_12 B4x8_13 B4x4_30 B4x4_31 B4x4_32 B4x4_33
B8x8_00
B8x8_01
B8x8_10
B8x8_11
B16x8_0
B8x16_0
B8x16_1
BLK 16x16_0
B16x8_1
Figure 1: 41 Blocks in one MB
Categories and Subject Descriptors: C.3 [Special-Purpose and Application-Based Systems]: Signal processing systems; B.7.1 [Integrated Circuits]: Types and Design Styles–Algorithms implemented in hardware, VLSI General Terms: Algorithms, Performance, Design Keywords: H.264, Variable Block Size Motion Estimation, VLSI
1.
B8x4_00 B4x8_00 B4x8_01 B4x8_02 B4x8_03
One hardware efficient and high speed architecture for variable block size motion estimation in H.264 is presented in this paper. Through compressing the propagated data and optimizing the processing element and adder tree circuits in pipeline, this architecture gets more hardware efficient datapath logic. Compared with the original Propagate Partial SAD structure, 12.1% hardware cost can be saved. With TSMC 0.18µm CMOS 1P6M standard cell library, the maximum clock speed of this design is 227MHz in worst work conditions (1.62V, 125◦ C). With the 48×32 search range, the maximum throughput of our design is 147786 MB/S, which can be used in the real-time encoding of VGA resolution frame with 4 reference frames at 30Hz.
INTRODUCTION
Variable block size motion estimation (VBSME) is one powerful technique adopted by the latest international video coding standard, H.264/AVC [1]. Compared with fixed block size motion estimation (FBSME), VBSME can achieve more accurate motion vectors (MV). In H.264/AVC, motion estimation (ME) is conducted on different blocks sizes including 4 × 4, 4×8, 8×4, 8×8, 8×16, 16×8 and 16×16, as shown in Fig. 1. During ME, all blocks inside one macroblock (MB) are processed and the block mode with the best rate distortion (RD) cost is chosen. Although VBSME can achieve a higher compression ratio, the computation of ME component becomes even more intensive. In H.264 encoding process, more than 50% computation power is consumed by VBSME algorithm. Many studies and
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’07, March 11–13, 2007, Stresa-Lago Maggiore, Italy. Copyright 2007 ACM 978-1-59593-605-9/07/0003 ...$5.00.
excellent works have been proposed in the field of hardwired VBSME accelerator design [2]-[6]. Among these proposed architectures, three works can provide superior performance in different applications, namely Propagate Partial SAD [2], SAD Tree [3] and Parallel Sub-Tree [4]. One graceful Propagate Partial SAD design is first provided by Huang in [2]. In this architecture, partial SADs of 4×4 blocks are propagated in the pipeline and accumulated to generate other SADs. When parallelism is not required, this design has the most efficient datapath, which makes it suitable for middle and small resolution videos [3]. The soul of Propagate Partial SAD is reducing the shift register number. For example, SAD Tree propagates the reference pixels in the datapath, in contrast, Propagate Partial SAD just keeps 4 × x(x : 1−4) partial SAD, consequently 368-bit registers can be saved. In fact, Propagate Partial SAD can be further optimized. First, the propagated data in pipeline can be more compressed to save the hardware cost consumed by the shift registers. Second, the pipeline circuit and structure can be optimized to reduce the system latency and the hardware cost. In this paper, we will give the detailed explanations about these improvements. This paper is organized as follows. In section 2, the proposed architecture is presented. The silicon designs based on this architecture and the performance comparisons with previous works are shown in section 3. Some conclusions are drawn in section 4.
2. HARDWARE ARCHITECTURE In this section, we first present the system structure and the data flow of our VBSME hardware architecture. Second, we describe the circuit optimizations of the processing ele-
160
16x1 + 16x1 Reference Pixels Broadcast Each Pixel to 1x16PE
Ref0[7:0] Ref1[7:0] Cur[7:0]
Row 0 Row 1 Row 2 Row 3
s8
8-bit adder s7 s6
s0 ...
Cx
Row 4 Row 5 Row 6 Row 7
B4x4_01
B4x4_00
B4x4_02
B4x4_11
B4x4_10 B4x8_00
B4x4_03 B4x4_13
B4x4_12 B4x8_02
B4x8_01 B8x4_00
B4x8_03 B8x4_01 ABS2 ABS1 ABS0
B8x4_11 B8x8_01
B16x8_0
Row 8 Row 9 Row 10 Row 11
PSAD_I C3 C2 C1 C0 ABS3
B8x4_10 B8x8_00
ABSx (x:0-3)
Row Adder Tree Reg
Row 12 Row 13 Row 14 Row 15
B4x4_20
B4x8_10
B4x4_22
B4x4_21 B4x4_31
B4x4_30
B8x4_30 B8x8_10
B8x4_20
B8x16_0
Processing Element
Row Adder Tree
B4x8_12
B4x8_11 B16x8_1
B8x4_31 B8x8_11
B16x16_0
B8x16_1
Shift Register for 4x4 Partial SAD
B4x4_23 B4x4_33
B4x4_32
B8x4_21
B4x8_13
Shift Register for 8x8 Partial SAD
Figure 2: Proposed hardware architecture ment (PE) and adder tree. At last, the memory organization is illustrated.
2.1 System Architecture ME algorithm is integrated in various international standards such as MPEG-1/2/4 and H.261/263/264 etc. This algorithm is implemented in three steps. First, absolute difference calculation is processed on each pixel. Second, the sum of absolute differences (SAD) for every MV is calculated. Third, the minimum SAD value is found. The procedure is shown in (1) and (2). SAD(m, n) =
W −1 H−1 X X
|R(m + k, n + l) − C(k, l)|
(1)
k=0 l=0
SADmin = min(SAD(m, n))
m ∈ [0, M−1]n ∈ [0, N−1] (2)
PE array and each PE chooses the proper reference pixel to calculate the distortion. In details, at the tth cycle in the last 15 cycles, where t ∈ [0, 14], from 0 to t PE rows read the pixels from M 0 and other PE rows read data from M 1. In this way, there are no data bubbles in the pipeline. At the output of the seventh PE row, the SADs of the upper half partitions inside one MB, which include B4 × 4 YX (Y:0-1 X:0-3), B4 × 8 0X (X:0-3), B8 × 4 YX (Y:01 X:0-1), B8 × 8 0X (X:0-1) and B16× 8 0, can be derived through one adder tree. These block SADs can be used to update the current minimum SADs stored in registers. In order to calculate the distortions of B16 × 8 1, B8 × 16 0, B8×16 1 and B16×16 0, the SADs of B8×8 00 and B8×8 01 continue to be propagated. At the last pipeline stage, they are summed with SADs of B8×8 10 and B8×8 11 to compute the distortions of B16×8 1, B8×16 0, B8×16 1 and B16×16 0. In contrast, the original Propagated Partial SAD architecture propagates all 4×4 partial SADs in pipeline [2]. At the last pipeline stage, they are summed to generate other SADs. Compared with the original design, the proposed architecture propagates 8 × 8 partial SADs in the last eight pipeline stages. The hardware cost for the data propagation is reduced. In details, 560-bit registers can be saved by the proposed architecture. Another advantage of our design is that the operand number to the adder tree is reduced, so the maximum path delay is shorten. In the original design, the critical path lies in the adder tree in the last stage. Based on the SADs of sixteen 4×4 blocks, this adder tree derives all other blocks’ SADs. So, the input number of adder tree is sixteen. In our design, we use eight 4×4 block SADs and two 8×8 block SADs as the operands to the last adder tree, so the operand number is reduced to ten. In this way, the circuit complexity of the adder tree can be simplified and the maximum delay is also reduced. Moreover, the system latency for the blocks in the upper half of MB is reduced from sixteen to eight cycles.
2.2 Circuit Optimization for PE and Adder Tree
W and H are the width and height of current block, respectively. M and N are the width and height of search window, respectively. C(k, l) denotes the pixel values of the current block and R(m + k, n + l) denotes the pixel values of the reference frame. (m, n) represents the MV. In our design, the search range is defined as M = 48 and N = 32. The hardware architecture of our design is shown in Fig. 2. In fact, the dataflow of our design is similar to the original Propagate Partial SAD architecture. The current pixels are stored in the PE array. Two sets of 16×1 reference pixels are broadcasted to the PE array. In Fig. 2, the vertical dash lines represent the data broadcasting. Each PE row computes the distortion of one row in MB. In the same clock cycle, these rows’ SADs belong to different search positions. One row SAD is accumulated with the partial SAD propagated in, which belongs to the same search position and then the result is propagated to the next stage in vertical. In order to realize the full hardware utilization, the search area window is partitioned in horizon. The last 15 rows in search area is saved in one memory block, which is denoted as M 1 and other reference pixels are stored in another memory block M 0. If the search range is M ×N , the search window scale is (M + 15)×(N + 15). The upper (M + 15)×N reference pixels are stored in M 0 and the lower (M + 15)×15 reference pixels are stored in M 1. In the last 15 cycles in each search column, M 0 and M 1 both broadcast reference pixels to the
In order to reduce the hardware cost, we also optimize the circuit of PE and the adder tree in PE row. The absolute difference operation can be expressed as (3). j R + C + 1 R > C |R − C| = (3) R≤C (R + C ) The intuitive hardware implementation of this algorithm is shown in Fig. 3 (a). The MSB bit ‘s8’ of the sum from the first 8-bit adder is inverted and then used to bit-XOR with the rest bits of the result. ‘s8’ is added with the XOR result to generate the absolute difference operation. The number of PE is 256, so the last adder in every PE consumes notrivial hardware overhead. One approach is eliminating this adder in PE. However, this will cause one bit error in each PE. In the worst case, the accumulated error of all PEs is 256. In our design, as shown in Fig. 3 (b), we did not apply the dedicated adder in each PE. ‘s8’ and ‘ABS’ in each PE are both output to the ‘Row Adder Tree’ as shown in Fig. 2. The addition between ‘s8’ and ‘ABS’ is merged into the CSA tree in ‘Row Adder Tree’ [7][8], so the dedicated adder in each PE is eliminated. In PE circuit design, another trick is applied. In theory, |C − R| is equal to |R − C|, but the latter is preferred in this hardware implementation. During the ME processing,
161
(b) Ours PE AD circuit
Figure 3: Hardware architecture of AD operation ABS2
7:0
7:0 CSA
8:1
C3
ABS0
ABS1
C2 8:0 CSA C1
A[9:0]
...
...
...
P14_M0 P15_M0
Physical partition of M0
PSAD_I[7:0]
7:0
Figure 5: Memory organization of M0 7:0 CSA
8:1
9:1
...
7:0 7:0
8:0
...
P0_M0 P1_M0 P2_M0
Logical partition of M0 ABS3
...
L15_M0 L31_M0 L47_M0 L63_M0
128 32
L63_M0
L62_M0
L2_M0
32
...
...
L14_M0 L30_M0 L46_M0 L62_M0
(a) Intuitive PE AD circuit
ABSx (x:0-3)
L1_M0
Cx
|R-C|
L0_M0
ABS
L18_M0 L34_M0 L50_M0
s0 ...
L2_M0
s7 s6
L17_M0 L33_M0 L49_M0
s0
...
L1_M0
s7 s6
64-pixels
8-bit adder
L16_M0 L32_M0 L48_M0
s8
8-bit adder
C
L61_M0
s8
R
L0_M0
C
R
7:0
Table 1: Hardware statistics (1.62V, 125◦ C)
7:0
PSAD[8]
Clock(MHz) PE Array (gate) Cur. MB (gate) Min SAD (gate) Control (gate) Total (gate)
8:0 8:0
PSAD[9]
ADDER SUM[10:0]
B[9:0]
CI
C0
106
133
150
227
55,255 15,726 11,187 1,900 84,068
55,835 15,728 11,264 1,900 84,727
59,038 15,726 11,417 1,900 87,468
64,708 15,748 11,877 1,956 94.289
Table 2: Comparisons with Propagate Partial SAD
Figure 4: Circuit structure of Row Adder Tree
Design Technology Clock (MHz) PE Array&Cur.MB Min SAD&Control Total
the data of current pixels are constant. So the timing constraints through these paths can be defined as multi-cycle data paths. In details, after the initialization of current MB, one cycle delay is inserted before the ME processing. Consequently, the setup time from current MB pixels is two-cycle. With |R − C|, the inverters will be added to current pixels, which do not lie on the critical paths, as shown in Fig. 3. When the timing target is 227MHz, with this scheme, 2k gates can be saved. Moreover, during the calculation, the data from current MB are stable, so no dynamic power is consumed by these inverters. The detailed circuit of the ‘Row Adder Tree’ in the dash line block of Fig. 2 is shown in Fig. 4. It should be noticed that we just illustrate the ‘Row Adder Tree’ circuit applied in ‘Row 1’, ‘Row 5’, ‘Row 9’ and ‘Row 13’. The circuits for other ‘Row Adder Tree’ can be traced by analogy. In Fig. 4, ‘Row Adder Tree’ has nine-operands, which include the propagated in partial SAD (PSAD I), Cx (x:03) and ABSx (x:0-3). Three-stage CSA tree is designed as the compressor to compress these nine operands into two vectors. These two vectors are added together through the final stage adder. The structure of the final stage adder depends on the timing, area and power constraints during the synthesis procedure.
Ref.[2]
Ref.[3]
Ours
0.35µm 66.67 79kgate 26.6kgate 105.6kgate
0.18µm 110.8 81.5kgate
0.18µm 133 71.6kgate 13.1kgate 84.7k gate
– –
ence [4]. First, in order to make physical implementation convenient, one pixel is extended in both vertical and horizontal directions, so the search area size of M 0 and M 1 are 64×32 and 64×16 pixels respectively. Second, according to reference [4], because one PE array set is configured, there are 16 physical partitions for M 0 and M 1 respectively. In order to simplify the explanation of mapping algorithm, M 0 is used as the illustration and the mapping of M 1 can be traced by analogy. First, in logic, M 0 is divided in column and each logic partition is 1-pixel wide. The lth logical partition is mapped to the (l mod 16)th physical partition and its begin address is l/16 × 32. The depth of each physical partition of M 0 is 128. The logical to physical mapping procedure of M 0 is shown in Fig. 5. This algorithm can realize the high IO utilization with the minimum memory partitions. Consequently, the hardware cost and power dissipation of these on-chip memory modules can be saved.
2.3 Search Window Memory Organization
3. EXPERIMENTAL RESULTS
The on-chip memories for the search area data consume nontrivial hardware cost and power dissipation. Memory organization is another important issue in ME hardware design. According to reference [4], during the memory mapping and organization, we must focus on improving the memory IO utilization and reducing the partition number. In this paper, we apply the mapping algorithm provided in refer-
Our design is described with Verilog HDL and synthesized with Synopsys Design Compiler (DC). The design specifications are: 640×480 frame size at 30Hz, supporting VBSME in H.264, 4 reference frames and 48×32 search range. The on-chip memory is 96Kb and the clock speed is 227MHz. As we know, the hardware cost and throughput of one ME engine are both affected by its clock speed. With the
162
Table 3: Comparisons with previous designs Design PE Number Technology Clock(MHz) Area PE Array&Cur.MB (gate) Total PHR PE Array&Cur.MB Total Power(mw)
1-D [6]
2-D [5]
16 256 0.13µm 0.18µm 294 100
–
–
61k
154k
–
–
77.1 573.4
166.2
–
Propagate Propagate SAD Tree Parallel Sub Ours Ours Partial SAD [2] Partial SAD [3] [3] Tree [4] @133MHz @227MHz 256 0.35µm 66.67 79k 105.6k 216.0 161.7 737.3
256 0.18µm 110.8 81.5k
256 0.18µm 110.8 88.6k
256 0.18µm 261
–
–
151.8k
348.0
320.1
–
– –
– –
440.2 484@200MHz
–
256 0.18µm 133 71.6k 84.7k 475.5 402.0 255.2
256 0.18µm 227 80.5k 94.3k 721.9 616.2 461.5
that our architecture has the highest hardware and power efficiency.
high clock speed, synthesizer tries to consume more hardware to satisfy the timing constraints. In order to analyze the performance of our design, we synthesis the design under different clock frequencies and get the corresponding hardware cost statistics. The detailed hardware statistics under the worst work conditions (1.62V 125◦ C) with 106MHz, 133MHz, 150MHz and 227MHz clock speed are listed in Tab. 1. The “Min SAD” module contains the forty-one comparators to find the minimum SADs and the registers to keep the minimum SADs and the corresponding MVs. During the ME processing, the output from current MB is stable and not changed. Depends on this character, we could apply multi-cycle synthesis strategy on the paths through the outputs from “Cur.MB” module. With these loss timing constraints, DC chooses the minimal area registers to implement the “Cur.MB” module. From Tab. 1, we can see that the area of “Cur.MB” is almost not affected by the timing constraints. The comparisons between our architecture and the original Propagate Partial SAD designs are shown in Tab. 2. Reference [3] uses the same technology as our design, but it just gives the hardware cost of “PE array” and “Cur.MB”. Reference [2] gives the detailed hardware statistics of this architecture. However it is implemented with 0.35µm technology and works at 66.67MHz. In order to make a fair comparison, we choose the synthesis result at 133MHz as the counterpart. We can see that, compared with the design in [3], which works at lower clock speed, our architecture saves 12.1% hardware cost on “PE Array” and “Cur.MB” modules. Compared with the design provided in [2], totally 19.8% hardware can be saved by our design. The performance comparisons, which include the hardware cost, clock speed and power dissipation, between the proposed architecture and previous designs are listed in Tab. 3. Because these designs are implemented under different processes and timing constraints, “performance hardware ratio”(P HR) [4] is adopted in this paper for more efficient comparisons. Higher P HR score represents higher hardware efficiency. Under the same process technology, P HR can accurately illustrate the hardware efficiency of a design. Among the designs listed in Tab. 3, all but the Propagate Partial SAD design in [2] apply the same as or more advanced technology than our design. Thus, P HR is a reasonable criterion for the performance comparison. It is clearly illustrated that the provided architecture has much higher P HR score than others. Even though 1-D architecture has the smallest area, but its PE number is sixteen, which is just one sixteenth of other counterparts, so it has the lowest hardware efficiency. From Tab. 3, it is clearly demonstrated
4. CONCLUSIONS One hardware efficient Propagate Partial SAD VBSME hardware architecture is provided in this paper. The PE array architecture is optimized through compressing the shift registers for partial SAD propagation and optimizing the circuits of PE and adder tree. This design is implemented with TSMC 0.18µm CMOS 1P6M standard cell library. Compared with the original Propagate Partial SAD architecture, the proposed design saves 12.1% hardware cost. In the worst work conditions (1.62V 125◦ C), the maximum clock speed is 227MHz, which can be used in the real-time processing of VGA resolution frame with 4 reference frames at 30Hz, and its power consumption is 461.5mw.
5. ACKNOWLEDGMENTS This work was supported by fund from the MEXT via Kitakyushu innovative cluster project.
6. REFERENCES [1] J. Ostermann, etal. Video coding with h.264/avc: Tools, performance, and complexity. IEEE Circuits and Systems Magazine, 4(1):7–28, First Quarter 2004. [2] Y. W. Huang, T. C. Wang, B. Y. Hsieh, and L. G. Chen. Hardware architecture design for variable block size motion estimation in mpeg-4 avc/jvt/itu-t h.264. In Proceedings of ISCAS 2003, volume 2, pages 796–799, May 2003. [3] C. Y. Chen, S. Y. Chien, Y. W. Huang, T. C. Chen, T. C. Wang, and L. G. Chen. Analysis and architecture design of variable block-size motion estimation for h.264/avc. IEEE Circuits and Systems I, 53(3):578–593, March 2006. [4] Z. Y. Liu, Y. Song, T. Ikenaga, and S. Goto. A fine-grain scalable and low memory cost variable block size motion estimation architecture for h.264/avc. IEICE Transactions on Electronics, E89-C(12):1928–1936, December 2006. [5] M. Kim, I. Hwang, and S. I. Chae. A fast vlsi architecture for full-search variable block size motion estimation in mpeg-4 avc/h.264. In Proceedings of ASP-DAC 2005, volume 1, pages 631–634, January 2005. [6] S. Y. Yap and J. V. McCanny. A vlsi architecture for variable block size video motion estimation. IEEE Transactions on Circuits and Systems II: Express Briefs, 51(7):384–389, October 2004. [7] J. Vanne, E. Aho, T. D. Hamalainen, and K. Kuusilinna. A high-performance sum of absolute difference implementation for motion estimation. IEEE Transactions on Circuits and Systems for Video Technology, 16(7):876–883, July 2006. [8] C. Wallace. A suggestion for a fast multiplier. IEEE Transactions on Computers, 13(3):14–17, February 1964.
163