1242
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010
A Hardware-Efficient Multi-Resolution Block Matching Algorithm and Its VLSI Architecture for High Definition MPEG-Like Video Encoders Haibing Yin, Huizhu Jia, Honggang Qi, Xianghu Ji, Xiaodong Xie, and Wen Gao, IEEE, Fellow
Abstract—High throughput, heavy bandwidth requirement, huge on-chip memory consumption, and complex data flow control are major challenges in high definition integer motion estimation hardware implementation. This paper proposes an efficient very large scale integration architecture for integer multi-resolution motion estimation based on optimized algorithm. There are three major contributions in this paper. First, this paper proposes a hardware friendly multi-resolution motion estimation algorithm well-suited for high definition video encoder. Second, parallel processing element (PE) array structure is proposed to implement three-level hierarchical motion estimation, only 256 PEs are enough for one reference frame real-time high definition motion estimation by efficient PE reuse. Third, efficient on-chip reference pixel buffer sharing mechanism between integer and fractional motion estimation is proposed with almost 50% SRAM saving and memory bandwidth reduction. The proposed multi-resolution motion estimation algorithm reached a good balance between complexity and performance with rate distortion optimized variable block size motion estimation support. Also, we have achieved moderate logic circuit and on-chip SRAM consumption. The proposed architecture is well-suited for all MPEG-like video coding standards such as H.264, audio video coding standard, and VC-1. Index Terms—Architecture, audio video coding standard (AVS), H.264, multi-resolution motion estimation, very large scale integration (VLSI), video coding.
I. Introduction
I
N MULTIMEDIA system, there are several video coding standards, MPEG-1/2/4, H.264 [1], VC-1 [2], and AVS [3]. Different coding tools and features are employed in different standards. However, the crucial technologies in different Manuscript received December 2, 2009; revised March 1, 2010; accepted March 31, 2010. Date of publication July 26, 2010; date of current version September 9, 2010. This work was supported by the National Natural Science Foundation of China, under Project 60802025, the 973 Project, under Project 2009CB320900, the China Postdoc Science Foundation, under Project 200902015, and the Open Project of Zhejiang Provincial Key Laboratory of Information Network Technology, Zhejiang University. This paper was recommended by Associate Editor G. Lafruit. H. Yin was with the National Engineering Laboratory for Video Technology, Peking University, Beijing 100871, China. He is now with China Jiliang University, Hangzhou, China (e-mail:
[email protected];
[email protected]). H. Jia, X. Ji, X. Xie, and W. Gao are with the National Engineering Laboratory for Video Technology, Peking University, Beijing 100871, China (e-mail:
[email protected];
[email protected];
[email protected];
[email protected]). H. Qi is with the Graduate University of Chinese Academy of Sciences, Beijing 100049, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2010.2058476
standards are very similar in coding and decoding framework. These similar standards are called MPEG-like video coding standards. Audio video coding standard (AVS) video part (AVS-P2) [3] is also a MPEG-like video standard of China approved in 2006. With fast development of microelectronic technology, multimedia applications on high definition (HD) video coding and decoding are gradually popular. Dedicated ASIC HD video encoder is highly desired to achieve the huge throughput and computation. Several works were reported on 720P or 1080P H.264 encoder very large scale integration (VLSI) implementations [4]–[9]. Nevertheless, further algorithm and architecture optimization are desired to achieve optimal balance among circuit area, rate distortion performance, and power consumption. Motion estimation (ME) is the most complex module in MPEG-like video encoder. Real-time ME implementation for HD video encoder is challenging due to not only large search window (SW) to cover, bus also new tools such as variable block size ME (VBSME), multiple reference frames, and fractional pixel motion estimation. Performance and complexity jointly optimized integer motion estimation (IME) engine is the biggest challenge in HD video encoder architecture [4], [10]. Full search block matching (FSBM) algorithm is widely used in hardware architecture due to superior performance and high regularity. Despite these advantages, system throughput burden, memory bandwidth, and hardware cost are huge challenges in HD FSBM based ME architecture due to the large search window size requirement. Many fast ME algorithms were proposed in the literature, but many are ill-suited for VBSME hardware implementation due to performance degradation, complex control, or irregular memory access [10]. Hierarchical ME algorithms such as three-step search (TSS), new TSS, and four step search are well-suited for VLSI implementation with high regularity and fast search speed. However, they suffer from nontrivial performance loss in HD cases. Multi-resolution ME algorithm (MMEA) is a good choice for VLSI implementation to target good balance between performance and complexity in HD cases [29]. This paper also focuses on MMEA based cost-effective ME hardware implementation. The rest of this paper is arranged as follows. The background and challenges are presented in Section II. Hardware
c 2010 IEEE 1051-8215/$26.00
YIN et al.: A HARDWARE-EFFICIENT MULTI-RESOLUTION BLOCK MATCHING ALGORITHM AND ITS VLSI ARCHITECTURE FOR HIGH DEFINITION
oriented MMEA and problem formulation are given in Section III. The proposed VLSI architecture is proposed in Section IV. Parameter selection and simulation results are drawn in Section V. Finally, the conclusion is given in Section VI. II. Integer Motion Estimation VLSI Architecture Design Challenges In this section, we make an in-depth investigation on challenges of IME algorithm and architecture design for HD video encoder chip. Then, the major factors to be considered in MMEA based architecture design are analyzed. A. Challenges There are four challenges in HD IME architecture design and they are analyzed in turn as follows. First, high processing throughput is the largest challenge in HD IME hardware implementation. In general, the average MB pipeline interval should be no larger than 1000 cycles [4]–[9]. It is really a challenge to cover large search window under such critical constraint. To solve this problem, eight-way parallel processing element (PE) array was adopted in [4] for FSBM to cover the 128 × 64 search window. Each PE array has 256 or 128 PEs. If this structure is simply applied in main profile 1080P encoder, more PE arrays are desired. Second, external memory bandwidth is another challenge. 64-bit DDR SDRAM external memory is usually used in video encoder architecture for temporary data store. The luminance and chrominance reference pixel read are the largest bandwidth consumers with almost 80% consumption. Moreover, multiple reference frames ME is supported in both Jizhun profile AVS and main profile H.264. It directly doubles the bandwidth consumption and aggravates the bandwidth burden greatly. Third, huge on-chip RAM consumption for reference pixel buffering is another challenge. In typical H.264 video encoder architectures, reference pixels in the whole search window are simultaneously buffered in on-chip SRAM buffers for IME and fractional ME (FME) to avoid redundant external memory access [4]–[9]. In these literatures, dual-port SRAM or doublebuffered single-port SRAM are used to achieve data sharing between IME and FME. Dual-port and double-buffered SRAM both consume doubled logic gates compared with single-port SRAM. Thus, efficient data sharing for reference luminance pixels between IME and FME using single-port SRAM is highly desired. Fourth, reference pixel data flow control is also an important challenge. Relatively, the data flow in FSBM is regular and simple. Some intelligent ME algorithms suffer from irregular data flow, thus they are ill-suited for hardware implementation. TSS and multi-resolution ME algorithm based architectures are also regular. However, the corresponding data flow controls are also complex. Format transform is necessary between data in on-chip buffer and data in external memory. Also, data structure should be arranged to match with the PE array structure. On-chip register arrays with intelligent shifting operation are necessary to collaborate with pixel buffer and PE arrays.
1243
An extensive exploration of IME algorithm for hardware implementation was made in [10]. In this review, FSBM [11]– [17] and fast block matching algorithm [18]–[25] based VLSI architectures derived from inter-type and intra-type systolic mapping with 1-D, 2-D, and tree structures were reviewed with detailed comparison. Six aspects including gate count, clock frequency, hardware utilization, memory bandwidth, memory bit-width, and sum of absolute difference (SAD) latency were used as the comparison criteria. In the last 3–4 years, many scholars had also done some work on ME architecture, and typical works are shown in [28]–[33]. Hardware cost, SRAM, memory bandwidth, throughput and working frequency, power consumption, search accuracy and performance, and control complexity are all important factors to be considered as for IME architecture design. It is difficult to satisfy all these constraints and reach optimal balance among them. Thus, it is very necessary to make in-depth investigation at algorithm level and architecture level optimization to balance the multiple mutually exclusive factors, especially in the HD applications with large search range and high throughput. In HD video encoders, MMEA is a better choice for hardware implementation to achieve good balance between performance and complexity. B. Design Consideration in MMEA Based VLSI Architecture In three-level MMEA [26], [27], [32], hierarchical refinements of search are performed from the coarsest level to the finest level successively. Although MMEA reduces complexity and throughput greatly, the resulting performance degradation is not negligible due to down-sampling. Multiple candidate refinement centers are widely combined with MMEA to avoid being trapped into local minima [26]. This measure improves the accuracy to some extent. But, the advantages are seriously challenged if RDO based VBSME and the HD factor are both considered. Also, multi-resolution ME search and efficient SAD reuse for VBSME are difficult to be achieved with negligible performance degradation. First, VBSME, when used, can adapt well to the performance superiority of H.264 and AVS remarkably. How to combine VBSME with MMEA becomes a key problem. In blocks with size no larger than 8 × 8, their down-sampled versions at coarse levels are very small. For example, at the coarsest level, only 4 pixels attend in SAD calculation in 8×8 blocks. Too small pixels in small blocks at the coarsest level result in untrusted SAD values. Thus, VBSME is ill-suited to be implemented at coarse levels. As a result, VBSME is usually performed only at the finest level within a local search window centered about the winner selected at the middle level. The position and size of LSW is very important to sustain the superiority of VBSME. Second, throughput is the largest challenge for IME hardware implementation in HD cases. Multiplying processing element arrays is indispensable for real-time pipelining. Simply multiplying PE arrays will result in dramatically increased circuit area. Thus, more efficient parallelism of PE arrays is highly desired. Third, RDO based ME is recommended in AVS and H.264 although it is not enforced. Cost function weighted SAD
1244
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010
(WSAD) including SAD and motion vector (MV) coding bit consumption is used as matching criteria. RDO based ME can achieve superior coding performance. In MMEA, the MV coding bit measure at coarse levels is not trustworthy relatively. This aggravates the WSAD uncertainty of the coarsest level, possibly resulting in being trapped into local minimum. Lin et al. [29] proposed a parallel MMEA in which three hierarchical levels were simultaneously searched with parallel PE arrays. The search speed and the SRAM consumption were very encouraging. At least 90% data sharing percentage was reported [29]. However, reference pixel data in the case of sharing misses have to be reloaded from external memory. The miss rate of direct mode in P and B frames was not considered. Unfortunately the miss rate of direct mode is usually high, and direct mode is the most important coding mode with the highest probability. This irregular data access complicated the external memory access control, especially in the cases of sharing misses in variable size blocks. Also, the miss rate and the performance degradation were obvious in sequences with irregular motions. Other MMEA architectures all perform MV refinement from the coarsest level to the finest level sequentially [26], [27], [32]. This mechanism is also adopted in this paper, and measures are taken to take on the three challenges analyzed above. Data reuse scheme for motion estimation and memory bandwidth analysis were elaborated in [34] and [35]. Memory bandwidth burden was greatly alleviated in the level C+ data reuse scheme [35]. The HF2V3 scan mode level C+ data reuse scheme is adopted in this paper to alleviate memory access burden. Search accuracy, circuit consumption, memory bandwidth, SRAM consumption, and data flow regularity will be jointly considered for IME algorithm and architecture optimization. III. Hardware Oriented Multi-Resolution Motion Estimation Algorithm A. Pixel Organization in Multi-Resolution Motion Estimation Three-level MMEA is performed from the coarsest 16:1 down-sampled level (L2 ) to the finest unsampled level (L0 ) sequentially. Direct down-sampling is used for efficient data and circuit reuse among adjacent levels. The original MB and all reference pixels in the whole search window are downsampled into three resolutions and 16-way interlaced groups to coincident with the parallel PE structure. We take the original MB as the down-sampling example. The raw 256 pixels of the original MB (at the finest level L0 ) are shown in Fig. 1(a). They are 4:1 down-sampled into four 8×8 blocks (the middle level L1 ) indexed by m and n. In order to simplify illustration, the pixels in four 8 × 8 blocks are, respectively, marked using different symbols [× (mn = 00), • (mn = 01), (mn = 10), and (mn = 11)] as shown in Fig. 1(b). Similarly, each 8 × 8 block at level L1 is 4:1 downsampled into four 4 × 4 subblocks (the coarsest level L2 ) indexed by p and q. The pixels in four 4×4 subblocks of each 8 × 8 block are marked using red, blue, green and black colors as shown from Fig. 2(a) to (d), respectively. As a result, the pixels at level L0 is down-sampled into 16 interlaced subblocks
Fig. 1. Down-sampling from level L0 to level L1 . (a) 256 pixels in a macroblock. (b) 64 pixels in four 8 × 8 blocks.
Fig. 2. Pixels of the 16 groups at level L2 . (a) 16 pixels in block 00. (b) Block 01. (c) Block 10. (d) Block 11. (e) Four 2 × 2 blocks in one 4 × 4 block.
indexed by mnpq and marked using different symbols and color combinations. Similarly, the whole reference search window is also down-sampled into 16 interlaced versions indexed by the indices mnpq. Each subblock is partitioned into four 2×2 granules indexed by rs as shown in Fig. 2(e). This partition is adopted for data and circuit reuse in hardware architecture to implement seamless combination between VBSME and MMEA. Suppose CL0 (i, j) and RL0 (i, j) are the pixels at spatial location (i, j) at level L0 of the original MB and the reference SW. CL1 mn (i, j) is the pixel at spatial location (i, j) at level L1 indexed by mn. CL1 mn (i, j) is described at level L0 as follows: CLmn1 (i, j) = CL0 (i × 2 + m, j × 2 + n).
(1)
Cmnpq L2 (i, j) is the pixel at location (i, j) at level L2 indexed by mnpq. Cmnpq L2 (i, j) is described as follows: L2 Cmnpq (i, j) = CL0 ((i × 2 + m) × 2 + p, (j × 2 + n) × 2 + q). (2) L1 L2 Similarly, Rmn (i, j) and Rmnpq (i, j) are the reference pixels at spatial location (i, j) at levels L1 and L2 .
B. Proposed Multi-Resolution Motion Estimation Algorithm The proposed three-level MMEA is illustrated in Fig. 3 and described as follows. Suppose the whole integer pixel SW is [−SRX, SRX]×[−SRY, SRY], our target SW is SRX = 128
YIN et al.: A HARDWARE-EFFICIENT MULTI-RESOLUTION BLOCK MATCHING ALGORITHM AND ITS VLSI ARCHITECTURE FOR HIGH DEFINITION
1245
and SRY = 96. Small SW [−32, 32]×[−32, 32] is used for example here due to display resolution limitation. 1) Full Search at IME Stage 1: Full search is performed at IME stage 1 to check all candidate MVs at level L2 which are shown using black points in Fig. 3(a). To accelerate the search speed, 16-way parallel motion searches are performed by employing 16 interlaced downsampled pixel samples [original Cmnpq L2 (i, j)(i, j) and reference Rmnpq L2 (i, j)(i, j)(i, j)]. The 16-way processing element arrays (PEA) PEAmnpq are employed to perform parallel searches. In each PEA, there are 16 processing elements to perform SAD calculation for each subblock. The PEA architecture will be given in Section IV. All candidate MVs at level L2 are divided into 16 subareas (mnpq L2 ) indexed by mnpq shown in Fig. 4(a). The PEAmnpq is employed to implement motion matching for the MVs in mnpq L 2. The 16-way parallel searches invoked by 16 PEAs can achieve the throughput of 16 MVs in each cycle at level L2 . The basic hardware unit in the proposed architecture is PEA, which is based on the 4 × 4 subblocks at level L2 indexed by mnpq as shown in Fig. 2(a)–(d). The basic SAD of the subblock mnpq (SADmnpq) can be defined by SAD((u, v)L2 ) = SADmnpq ((u, v)L2 ) = 1r=0 1s=0 SADmnpq rs ((u, v)L2 , 0, 0)
(3)
where (u, v)L2 is one motion vector at level L2 , u and v are the horizontal and vertical MV component. (u, v)L2 is mapped to the PEAmnpq and mnpq L2 according to the following rule: ⎧ 00 if 4u ∈ [−SRX, −SRX/2) ⎪ ⎪ ⎨ 01 if 4u ∈ [−SRX/2, 0) mn = 10 if 4u ∈ [0, SRX/2) ⎪ ⎪ ⎩ 11 if 4u ∈ [SRX/2, SRX] ⎧ 00 if 4v ∈ [−SRY, −SRY/2) ⎪ ⎪ ⎨ 01 if 4v ∈ [−SRY/2, 0) pq = (4) 10 if 4v ∈ [0, SRY/2) ⎪ ⎪ ⎩ 11 if 4v ∈ [SRY/2, SRY]. The down-sampled pixels in the subblocks indexed by mnpq will be used for SAD calculation for (u, v)L2 . (xoff, yoff) is the pixel offset due to the misalignment of the MVs at levels L1 and L0 for SAD reuse. SADmnpq(mv) is the summation of the SADs of four 2 × 2 granules SADmnpq− rs(mv), which is described by SADmnpq rs ((u, v)L2 , xoff, yoff) = 2r+1 2s+1 Cmnpq L2 (x0 + x, y0 + y)
−Rmnpq L2 (x0 + x + u + xoff, y0 + y + v + yoff).
(5)
x=2r y=2s
Here, (x0 , y0 ) is the spatial location of the left-top pixel of the original subblock mnpq. RDO based MMEA is adopted and combined with VBSME in this paper. The actual block matching criterion is WSAD defined as follows: WSADmnpq ((u, v)L2 ) = SADmnpq ((u, v)L2 ) +λmotion × RMV ((4u, 4v)L0 −mvp16×16 L0 ).
(6)
The MV (u, v)L2 at level L2 is mapped to level L0 as (4u, 4v) L0 , and mvpL0 16×16 is the predicted MV of the current MB at level L0 . RMV is the bit calculation function for delta MV
Fig. 3. Three level hierarchical integer multi-resolution motion estimation. (a) IME Stage 1. (b) IME Stage 2. (c) IME Stage 3.
coding, and λmotion is the Lagrange multiplier for cost function calculation. Sixteen locally optimal MVs (MVmnpq L2 ) are obtained by 16-way parallel searching by minimizing WSAD as MVmnpq L2 =
minWSADmnpq ((u, v)L2 ).
arg (u,v)L2 ∈
L2 mnpq
&& Restriction1
(7) The first restriction in (7) represents that MV (u,v)L2 belongs to mnpq L2 associated with PEAmnpq. The second restriction Restriction1 will be analyzed in Section III-D. The 16 optimal MVs at level L2 (MVmnpq L2 ) form the optimal MV subset of level L2 , in which the index mnpq varies from 0000 to 1111. Multiple candidate MVs are used as the refinement center of level L1 in this paper. Four winners (MVCmn L1 ) are selected from according to the rules described in (8), and they are shown with four red circles in Figs. 3(a) and 4(a). Here, Restriction2 and Restriction3 will also be analyzed in Section III-D. The predicted MV is used as the fourth refinement center MVL1 C11 to compensate for the case of missing candidate MV at level L2 search due to aggressive down-sampling. 2) Local Full Search at IME Stage 2: MVC00 L1 = 2 ×
arg
min WSADmnpq (MVmnpq L2 ))
MVmnpq L2 ∈
MVC01 L1 = 2 ×
arg
MVC00 L1 } && Restriction2 2 min WSADmnpq (MVmnpq L2 )) MVmnpq L2 ∈ { −
MVC10 L1 = 2 ×
arg
MVC00 L1 MVC01 L1 − }&&Restriction 3 2 2 L2 min WSADmnpq (MVmnpq )) MVmnpq L2 ∈ { −
MVC11 L1 = MVp; predicted from spatiotemporallyadjacent MVs using MV correlation
(8)
1246
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010
(MVcen L0 ), which is shown with a blue circle in Figs. 3(b) and 4(b) and given by 0 MVLcen =2×
Fig. 4. Illustration of relationship among mnpq L2 , mn L1 , and VBSME L0 . (a) Sixteen small windows at level L2. (b) Four local windows at level L1.
L1
=
1 L00
1 L01
1 L10
1 L11 .
(9)
Here, the local SW (L1 mn ) with size of [−SRXL1 , SRXL1 ] × [−SRYL1 , SRYL1 ] is defined by 1 ≤ SRXL1 , Lmn1 = {(u, v)L1 ) | −SRXL1 ≤ u − MVXLCmn L1 −SRYL1 ≤ u − MVYCmn ≤ SRYL1 }. (10) PEASmn is arranged to perform full search for all candidate MVs at level L1 in four local small SWs (mn L1 ) centered about the corresponding centers MVCmn L1 . Four refinement center MVs (MVC00 L1 , MVC11 L1 , MVC00 L1 , and MVC11 L1 ) and their corresponding local SW (00 L1 , 01 L1 , 10 L1 , and 11 L1 ) are shown in Fig. 4(b). The 16 PEAs are combined to implement four PEASs resulting in the throughput of four candidate MVs in each cycle at level L1 . Similarly, the SAD of MV (u, v)L1 at level L1 is given by
SAD((u, v)L1 )=SADmn ((u, v)L1 ) =
1 1
SADmn
rs ((u, v)
L1
(11) (u,v) L1 is mapped into local small SW mn L1 , 64 pixels in the 8 × 8 block indexed by mn at level L1 participate in SAD calculation for block matching. SADmn− rs(mv) is the SADmnpq− rs(mv) summation of four 4 × 4 subblocks defined by L1 rs ((u, v) ) =
1 1
SADmnpq
p=0 q=0
rs ((
u v L2 , ) , 2 2
xoff L1 , yoff L1 ).
(12)
Similarly, the WSAD of level L1 is described as follows: L1
L1
WSADmn ((u, v) ) = SADmn ((u, v) ) +λmotion × RMV ((2u, 2v) L0 − mvp16×16 L0 ).
(13)
WSAD is used as the matching criterion. Only one optimal MV is selected by full search within L1 . The winner MV at level L1 will be used as the refinement center of level L0
(14)
VBSME L0 ={(u, v)VBSME L0 ) | −SRXL0 ≤ u 0 0 −MVXLcen ≤ SRXL0 ,−SRYL0 ≤ v − MVYLcen ≤ SRYL0 }. (15) Although VBSME is performed within LSW instead of the whole SW, the resulting performance degradation is negligible if the LSW size (SRXL0 , SRYL0 ) is large enough, which will be verified in Section V-A. The 16 PEAs are simultaneously employed for SAD reuse to achieve the throughput of one MV in each cycle at level L0 . The WSAD of a block with different block size is given by 0 0 WSADVBS ((u, v)LVBS ) = SADVBS ((u, v)LVBS ) L0 L0 + λmotion × RMV ((u, v)VBS − mvpVBS ).
(16)
Here (u, v)VBS L0 is the candidate MV at level L0 . Full search is adopted for VBSME within the LSW VBSME L1 at level L0 , and the resulting optimal MV (MVVBS L0 ) is given as follows: arg (u,v)L0
).
minWSADmn ((u, v)L1 )).
MV refinement at level L1 is performed at IME stage 2. IME stage 2 is illustrated in Fig. 3(b), in which four-way full searches at level L1 are performed by employing four PEA subsets (PEAS) PEASmn, respectively. The local SW of level L1 ( L1 ) consists of four local small SW (00 L1 , 01 L1 , 10 L1 , and 11 L1 ) as shown in Fig. 4(b) and defined in (9). 3) VBSME at IME Stage 3: In this paper, VSBME is also performed at IME stage 3 at level L0 as shown in Fig. 3(c) only within a well-selected local search window (VBSME L0 ), which centered about MVcen L0 with size of [−SRXL0 , SRXL0 ]×[−SRYL0 , SRYL0 ] shown as follows:
0 MVLVBS =2×
r=0 s=0
SADmn
arg (u, v) L1 ∈ L1
∈
L0 VBSME
0 minWSAD((u, v)LVBS )).
(17)
VBS is the block type with different block size, and it may be 8 × 8− rs, 16 × 8− s, 8 × 16− r, or 16 × 16, and their WSAD are derived from block level SAD (SAD8×8− rs ) reuse, which is defined by SAD8×8 rs ((u, v)L0 ) = 1 1 1 1 SADmnpq
u v L2 ,xoff L0 ,yoff L0 . , rs 4 4 m=0 n=0 p=0 q=0 (18) The cycle consumptions of three levels in the proposed MMEA are analyzed as follows: cycleIME =
(2× SRX)×(2 ×SRY) +(2 × SRXL1 +1) × (2 × 16×16 + (2 × SRXL0 + 1) × (2 × SRYL0 + 1).
SRYL1 +1)
(19) The total cycles cycleIME consumed for each MB IME can be controlled by adjusting the SW parameters of the three levels. The proposed parallel MMEA can achieve fast search speed, and cycleIME can be restricted to the range from 600 to
YIN et al.: A HARDWARE-EFFICIENT MULTI-RESOLUTION BLOCK MATCHING ALGORITHM AND ITS VLSI ARCHITECTURE FOR HIGH DEFINITION
1247
1000. Detailed parameter selection will be given in Section VA. This fast search speed is very important and desired for low-power hardware implementation in HD video coding. C. Local Buffer Sharing for VBSME Between IME and FME As shown in Fig. 3(c), VBSME is performed at IME stage 3 at level L0 within the local SW (LSWVBSME ), all MVs in LSWVBSME are labeled as VBSME L0 . The center of LSWVBSME is the winner MV of level L1 , and the size of LSWVBSME is [−SRXL0 , SRXL0 ] × [−SRYL0 , SRYL0 ]. FME contributes to the coding performance improvement significantly, but the computation consumption is drastically high. The optimal integer pixel MVs of all MB partition modes are determined at the IME VSBME stage by WSAD reuse. At the FME stage, 1/2 and 1/4 pixel MVs are refined to center about these integer pixel MVs sequentially. As analyzed in Section II, the on-chip SRAM consumption for the reference pixels in the SW is very high in main profile H.264 and Jizhun profile AVS with B frame support. The reference pixels are simultaneously needed at the IME and FME stages due to the fact that IME and FME are at adjacent pipeline stages. To decrease the on-chip SRAM consumption for reference pixel buffering, we propose an efficient SW buffer sharing mechanism between IME and FME. There are strong correlations existing in the MVs of different size blocks in the same MB. If this assumption is valid, there must exists a local SW (LSW) which contains almost all displaced blocks needed in the whole SW case for FME refinement. As a result, FME only needs to be performed within this LSW. However, there inevitably exist exceptional MVs that are beyond the LSW. These exceptional MVs should be kept in the case of whole SW. Almost all optimal MVs can be tracked accurately if the following two conditions are satisfied: One is that the center of LSW is accurate enough, and the other is that the LSW size is large enough. If the above two conditions are satisfied, the exceptional MVs are very rare. Exceptional MVs can be skipped and replaced by suboptimal MVs within the LSW if the performance degradation is small enough. During IME stage 3 in which VBSME is implemented by full search at level L0 , the pixels in LSW is simultaneously transferred to the ping-pong LSW buffer for next MB FME pipelining. As a result, single-port original SW buffer and ping-pong LSW buffer between IME and FME are needed, and nearly 50% of buffer saving is achieved. Verification for the above assumption is analyzed and can be referred in our previous work [36]. The LSW center is also important. In this paper, MVL0 cen in (14) is used as the LSW center for IME and FME data sharing, and it is obtained at IME level L1 and is also the center of VSBME at level L0 . Also, the LSW size is sensitive to performance degradation. In this paper, the LSW actually used for data sharing between IME and FME are named LSWshare , its size [−SRXshare , SRXshare ] ×[−SRYshare , SRYshare ] is adjustable and no smaller than that of LSWVBSME . The relationship between LSWshare and LSWVBSME is shown in Fig. 3(c). LSWVBSME is the local SW for VBSME at level L0 , while LSWshare is the local SW for reference pixel data sharing between IME and FME.
Fig. 5. Ping-pong LSW buffer structure and its data organization. (a) 256 pixels stored in interlaced manner. (b) Normal 256 pixels. (c) Ping-pong LSW buffer data organization.
On the one hand, we need to maximize the LSW size as much as possible to decrease the possibility of exceptional MVs. Larger LSW size will result in lower erroneous exclusion of skip/direct MVs. As a result, the larger the LSW size used for IME and FME reference pixel data sharing, the lower performance degradation is. On the other hand, small LSW is desired for on-chip SRAM consumption and throughput burden alleviation of data transfer between IME and FME. Thus, it is necessary to reach a good balance between LSW size (LSWshare ) and performance degradation. This relationship will be analyzed and given in Section V-A. The reference pixels in the LSW from 16-way interlaced reference SW buffers (RSWB) are buffered into ping-pong LSW buffer during IME stage 3 at level L0 for the next MB pipelining. To facilitate both data accessing during FME stage and LSW buffer refreshment, 16 × 8 = 128 bit RAM is used to implement the ping-pong LSW whose size is (2×SRXshare + 22) × (2 × SRYshare + 22). 16 pixels in one line are buffered into this ping-pong buffer in each cycle. The address mapping is illustrated in Fig. 5(c), the examples for pixel interlace and de-interlace are illustrated in Fig. 5(a) and (b). D. Intelligent Multiple Center MV Selection for MMEA Performance degradation is inevitable in MMEA due to SAD validity degradation in WSAD. One typical WSAD surface with three levels of 720P Sailormen sequence with size of 128 × 128 is shown in Fig. 6 as an example. Here, (a) and (b) are 3-D and 2-D versions of the WSAD surface at level L2 , (c) and (d) are the corresponding WSAD surfaces at level L1 , (e) and (f) are the results at level L0 . It is obvious that the WSAD at level L2 is coarser than those at two finer levels. This precision degradation decreases the reliability of the selected candidate MVs. This problem can be solved using multiple candidates. However, the MVs with minimal WSAD may be very close to each other. MV refinement at level L1 may usually be performed centered about these close candidate
1248
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010
Fig. 6. Typical WSAD surface of three hierarchical levels. (a) WSAD at level L2. (b) WSAD image of level L2. (c) WSAD at level L1. (d) WSAD image of level L1. (e) WSAD at level L0. (f) WSAD image of level L0.
MVs. As a result, redundant refinement usually occurs. Thus, the expected advantage of multiple candidate MVs is largely weakened. We propose intelligent multiple candidate MV selection algorithm to compensate for this deficiency. FSBM is first performed at level L2 and all MVs are indexed in descending order by WSAD. Then, candidate MVs are selected under two constraints. First, WSAD of the selected candidates should be as small as possible. Second, the distance between every two selected candidates should be larger than a threshold to bypass redundant refinement at level L1 . This is implemented by the three restrictions listed in (7) and (8), and they are, respectively, defined as follows: MVcen11 L1 Restriction1: (u, v)L2 ∈ / Z (20) 2 MVcen00 L1 Restriction2: MVmnpq L2 ∈ / Z (21) 2 Restriction3: MVmnpq L2 ∈ /
L1
Z( MVcen00 ) 2 L1
/ Z( MVcen01 ). && MVmnpq L2 ∈ 2
(22)
Here, Z(MV) is a square area centered about MV, and the square area size is a threshold to be chosen. It is important and determined according to the local SW size at level L1 , which is [−SRXL1 , SRXL1 ] × [−SRYL1 , SRYL1 ]. Then, the threshold here is set as [−SRXL1 /4, SRXL1 /4] × [−SRYL1 /4, SRYL1 /4]. The efficiency of the proposed intelligent multiple center selection algorithm can be referred from our previous work [37]. IV. Proposed VLSI Architecture A. VLSI Architecture The block diagram of the proposed IME architecture is given in Fig. 7. The basic unit for motion estimation is processing element (PE), and each PE performs SAD calculation for one pixel. There are totally 256 parallel PEs divided
into 16 groups named PEA indexed by mnpq, which are employed for 16-way parallel searching at level L2 . At level L1 stage, four PEA modules (PEAmn00, PEAmn01, PEAmn10, and PEAmn11) are combined to implement one PEA subset (PEAS) PEASmn. As a result, 16 parallel PEAs are mapped to four way PEASs to achieve 4-way parallel full search at level L1 . At the level L0 , four PEAS modules (PEAS00 , PEAS01 , PEAS10 , and PEAS11 ) are combined to implement PEAS array for VBSME implementation at level L0 . Search area luminance reference pixels and current MB are fetched from external memory and inputted to the RSWB and Cur. Sub. MB Reg. (CSMR), respectively. IME controller accepts encoding parameters from MB controller and main processor, and coordinates all sub-modules for three-level refinements. SAD values and MV costs of different block size blocks are inputted to the WSAD adder tree and 16input WSAD comparator for SAD reuse and MV selection. LSW reference pixel refreshment controller is employed to refreshment the pixels in ping-pong LSW for efficient data sharing between IME and FME. The VLSI architecture of PEAmnpq is shown in Fig. 8. The task of PEAmnpq is to calculate SADmnpq for the 4×4 block indexed by mnpq shown from (a) to (d) in Fig. 2, which is 16:1 down-sampled from the finest level L0 . So, 16 parallel PEAs are adopted for full search at level L2. To fully utilize the PEA structure for VBSME at level L0 , we employ four subblock PE array (SPEA) indexed by rs in every PEA. The SPEArs is used to calculate SADmnpq rs of 2 × 2 pixels shown in Figs. 2(e) and 8. The 16 pixels in the current original 4 × 4 block mnpq are stored in the CSMR. The 16:1 down-sampled luminance reference pixels are stored in the RSWB. The Ref. Pel. Reg. Array with size of 5 × 5 pixels is employed to load reference pixels from RSWB for four SPEAs in the current PEA directly. This register array is very crucial to implement SAD reuse in the case of nonzero offset (xoff − L1 , yoff − L1 ) and (xoff − L0 , yoff − L0 ) at level L1 and L0 . Reg array shift control issues shift orders (left, right, or up shift) to Ref. Pel. Reg. Array for reference pixel data sharing among adjacent MVs, and generate the address (Base− Addr and Offset− Addr) to RSWB for data access. All candidate MVs at level L2 are searched by 16-ways parallelism indexed by mnpq as shown in Fig. 3(a). The mapping relationship among the candidate MV (u,v)L2 , PEAmnpq , and mnpq L2 is given in (4). The outputs SADmnpq− rs and SADmnpq are inputted into WSAD add tree and WSAD comparator for MV selection. The architecture of PEAS PEASmn is given in Fig. 9. PEASmn is employed to calculate the SADmn for 8 × 8 blocks at level L1 . As a result, 16 parallel PEAs are mapped to four way PEASs to achieve 4-way parallel full search at level L1 shown in Fig. 3(b). Similarly, SADmn− rs is calculated by SAD reuse in PEASmn for VBSME implementation. At the level L0 stage, four PEAS modules (PEAS00 , PEAS01 , PEAS10 , and PEAS11 ) are combined as PEAS array as shown in Fig. 10 to calculate the SAD of different blocks for VBSME implementation. SADmn− rs are summed up by the SAD adder tree to obtain SAD− 8 × 8− rs. SAD− 8 × 8− rs are reused and summed up with the MV costs by the WSAD adder tree to obtain the WSAD of different sized blocks. The
YIN et al.: A HARDWARE-EFFICIENT MULTI-RESOLUTION BLOCK MATCHING ALGORITHM AND ITS VLSI ARCHITECTURE FOR HIGH DEFINITION
Fig. 7.
Block diagram of the proposed IME architecture.
Fig. 8.
Proposed architecture of PEAmnpq.
Fig. 10.
1249
Complete PEAS Array Structure for VBSME.
flow complexities mainly exist in the IME controller and Reg. array shift control for Ref. Pel. Reg. Array shift operation. Fig. 11 gives the pseudo-C form of control flow for PEAmnpq in the IME controller. It is applicable to both three levels with general description style. (MVXBase , MVYBase ) is the MV of the left-up point of the local search window of the PEAmnpq at the current level. (MVXmnpq , MVYmnpq ) is the offset MV in the current local search window mapped to the PEAmnpq at the current level. (MVXFinal , MVYFinal ) is the final MV after the current level local search. SRH and SRW are the local search window height and width at the current level defined by Fig. 9.
Proposed architecture of PEASmn.
optimal integer MVs of all MB partition modes are finally selected by the 16-parallel WSAD comparator array. B. Data Flow and Control Flow Due to the regular relationship between 16-way buffers and PE arrays, data access is also regular. Control flow and data
SRH = SRY/16, SRW = SRX/16; at Level L2 SRH = 2xSRYL1 +1, SRW = 2xSRYL1 +1; at Level L1 SRH = 2xSRYL0 +1, SRW = 2xSRYL0 +1; at Level L0 . (23) Three kinds of MV scan mode , , and are listed in Fig. 11(b). The register array operation types , , and are given in Fig. 12. In register array operation
1250
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010
Fig. 11. Pseudo-C form of control flow in IME controller. (a) Control flow of IME controller. (b) MV scan order.
types and , only register array upward or downward shift and register array row refreshment operations are needed. While in types , register array left shift and parameters (Base− Addr, Offset− Addr, α, β) refreshment are needed. At level L2 , operations need to be performed in all 16 PEA modules, while operations are conditional at level L1 and L0 . Also, operations are highly related with the RSWB data organization. Fig. 13 gives the RSWB data organization for PEA. Two 32-bit SRAMs (RSWBmnpq− A and RSWBmnpq− B) are employed to implement RSWBmnpq to satisfy the real-time pixel refreshment requirement. Four pixels in one row are fetched from RSWBmnpq− A and other four pixels are fetched from RSWBmnpq− B. (Base− Addr, Offset− Addr) and (MVXmnpq , MVYmnpq ) jointly determine the access address of RSWB. Only five pixels are used for register array row refreshment in each cycle, how to select these five pixels from eight pixels is an important problem. We define the parameter pairs (α,β) to describe the pixel combination process, which is given in Fig. 14. α and β refreshments only occur in the case of scan type . This control process is pre-determinate and periodic. The smallest MB partition mode for VBSME is 8 × 8 in AVS. Although H.264 supports the smallest block with size of 4×4, only blocks not smaller than 8×8 are supported in typical H.264 video encoder VLSI architecture in the HD cases [7]. The performance degradation is negligible due to lower possibility that nontranslational motion occurs in 8 × 8 blocks in HD cases. Thus, the proposed architecture is also well-suited for H.264 if only blocks not smaller than 8 × 8 are adopted.
Fig. 12.
Three register array operation type in Reg shift control.
Fig. 13. Data organization of RSWB in PEAmnpq. (a) Data organization of RSWB. (b) MV scan order.
YIN et al.: A HARDWARE-EFFICIENT MULTI-RESOLUTION BLOCK MATCHING ALGORITHM AND ITS VLSI ARCHITECTURE FOR HIGH DEFINITION
Fig. 14. ment.
Parameter pairs (α,β) and pixel combination for pixel row refresh-
1251
Fig. 15. Results of r m (SRXL0 ) and PSNR degradation. (a) r m (SRXL0 ) in the case of different SRXL0 and m. (b) PSNR degradation vs. SRXL0 . TABLE I
V. Parameter Selection and Simulation Results
Relationship Between CYCLEVBSME and SRXL0
A. LSW Size Determination and Performance Degradation In this paper, the ratios between the width and the height of LSW of level L1 and level L0 are set identical to that of the original image. Suppose the image width and height are Wi and Hi , then the LSW height of level L1 and level L0 are given by Hi SRYL1 = Even− round × SRXL1 + 0.5 (24) Wi Hi SRYL0 = Even− round × SRXL0 + 0.5 . (25) Wi In general, SRXL1 = 8 is enough for multi-resolution motion estimation. Even− round is the function of even integer rounding. Comparatively, the LSW size of level L0 (SRXL0 , SRYL0 ) is very crucial for VBSME accuracy and performance degradation. We have conducted an investigation on the relationship between LSW size of level L0 and the resulting performance degradation. As shown in (25), the SRXL0 can be used as the parameter for LSW size measure. Here, we define rm (SRXL0 ) to measure the percentage that all block’s MV of mode (VBS = m) falls beyond the LSW of level L0 SW (VBSME L0 ) centered about the corresponding block’s MV in the mode 8×8. rm (SRXL0 ) is defined as follows: N f (m, SRXL0 )i i=1 rm (SRXL0 ) = × 100%. (26) N N is the number of MVs used. We label the ith MV in the mode m at level L0 as (u, v)m L0 (i). f (m, SRXL0 )i is defined to represent if this MV falls beyond L0 VBSME as follows:
/ R(MV8×8 L0 ) 1 (u, v)m L0 (i) ∈ f (m, SRXL0 )i = (27) 0 otherwise the relationships between r m (SRXL0 ) and SRXL0 of several 720P sequences are shown in Fig. 15(a). We can derive two conclusions here. One is that r m (SRXL0 )is very close in the case of different m when SRXL0 is fixed. The other is that rm (SRXL0 ) decrease fleetly when the side length SRXL0 increases. The first conclusion verified the assumption discussed
SRXL0 2 × SRXL0 + 1 2 × SRYL0 + 1 CyclesVBSME
6 13 9 117
8 17 11 187
10 21 13 273
12 25 15 375
14 29 17 493
16 33 19 627
18 37 23 851
20 41 25 1025
in Section II-C. The second one indicates that VBSME can be implemented within LSW given appropriate SRXL0 instead of the whole SW resulting in small exceptional MV possibility. We have further evaluated the PSNR degradation caused by simplified VBSME within LSW comparing with full search VBSME within the whole SW. The average PSNR degradation results of the 720P “Sailormen” sequence are shown in Fig. 15(b). We can observe that PSNR improvement due to SRXL0 increase decreases gradually, especially when SRXL0 is larger than 10, the improvement is small and almost negligible. In 720P and 1080P sequences, the relationship between the LSW size (SRXL0 ) and the cycle consumption (cyclesVBSME ) for VBSME are given in Table I. Let us do a comparison of SRXL0 = 20 with respect to SRXL0 = 14, the PSNR improvement is very small as shown in Fig. 15(b), while the additional cycles are 532 cycles. With these additional cycles, more candidate MVs can be checked at level L1 for center MV accuracy improvement. This accuracy improvement can adequately compensate for PSNR degradation due to r m (SRXL0 ) increase shown in Fig. 15(a) in the case of SRXL0 = 20. Thus, SRXL0 = 14 is used in this paper, and the LSW size is 29×17. In conclusion, the SW size parameters of three levels are as follows: SRX = 128, SRY = 96, SRXL1 = 10, SRYL1 = 6, SRXL0 = 14, SRYL0 = 8, SRXshare = 20, and SRYshare = 12. B. Performance of the Proposed MMEA The hardware friendly motion estimation algorithms such as full search block matching (FSBM) [4], NTSS, MMEA [29], and sub-sampling based MMEA[7] are used as references. Although VBSME is not supported in MMEA [26] developed for MPEG-2 applications, its multiple-candidate three-level MMEA is very typical, so we combine it with VBSME and
1252
Fig. 16.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010
PSNR curves of the proposed MMEA vs. FSBM. (a) RD curve of sequence group one. (b) RD curve of sequence group two.
take it also as a reference. Identical SW and coding parameters are used in all reference algorithms for fair comparison. The AVS Jizhun profile with rate distortion optimized VBSME is used for performance evaluation. Eight 720P sequences City, Spincalendar, Sailormen, Crew, Syclists, Optis, Harbour, and Night are used for simulation, and the integer pixel SW is [−128, 128] × [−96, 96]. The PSNR results of the proposed MMEA vs. the FSBM [4] are given in Fig. 16(a) and (b). According to Fig. 16, the PSNR degradation of the proposed MMEA vs. FSBM is small. The average PSNR degradation is approximately 0.1 dB, and the worst case occurs in the Spincalendar sequence with complex motion and camera circumrotation. The Spincalendar sequence is used to evaluate the performance of the proposed MMEA with other reference algorithms. The PSNR results of all reference multi-resolution VBSME algorithms, full search VBSME algorithm, and the proposed MMEA are shown in Fig. 17. From the results in Figs. 16 and 17, we can conclude that the proposed MMEA achieves similar searching performance as MMEA [26] and sub-sampling based MMEA [7], and better searching performance than MMEA [29] and TSS. The PSNR degradation of the proposed algorithm, when compared with FSBM [4], is smaller than 0.15 dB of the Spincalendar sequence case. C. Hardware Implementation Results The proposed IME architecture was implemented with Verilog-HDL language and synthesized by Design complier with SMIC 0.18 µm 1P6M standard cell library. Table II shows the total hardware cost of the proposed design and comparison to other reference designs. The throughput capacity (search range, reference frame number), the algorithm performance, circuit gate (PE number, gate count, SRAM consumption), system clock frequency, external memory interface bit width and access bandwidth requirement, and data flow control regularity are also taken as comparison factors. First, the search range and reference frame number determine the system throughput capacity. This paper aims at main profile H.264 and Jizhun profile AVS video coding, in which two reference frame simultaneous searches with integer pixel window size of 256×192 are supported. According to the data in table II, the MMEA [27] is designed for small size video coding application with search range of 32×32. Other designs
Fig. 17.
Performance comparison of 720P Spincalendar sequence.
supporting large search range meet the throughput constraint by adopting algorithm simplifications or PE multiplication. Our paper focuses on satisfying high throughput with moderate PE and SRAM consumption and negligible performance loss. Second, PSNR performance is very important for algorithm evaluation. According to the results in Section V-B, the proposed MMEA suffers from smaller performance loss especially even in the case of complex motion. Level L2 full search tracks irregular motion, and the intelligent multiple refinement center mechanism improves search accuracy too. Third, PE number and on-chip SRAM size determine the final circuit gate consumption. 2048 PE units were employed to cover the search range of 128 × 64 and 196 × 128 in [4] and [7], respectively, while 728 PEs were used to cover the range of 256 × 256 in [8]. Only One reference frame was supported in these three works for baseline H.264 video encoder. In this paper, only 512 PE units are enough for two reference frame by efficient PE array reuse. As a result, only 130 K gate is enough for one reference frame motion estimation, and totally 260 K gate is enough for Jizhun profile AVS and main profile H.264 B frame support. The proposed architecture was verified on Virtex5 FPGA ASIC development system. Efficient hardware reuse results in reasonable area, 26% of slice consumption in V5L × 330. Despite of the encouraging PE consumption, parallel search at the coarse levels can achieve acceptable processing throughput of 900 cycles per MB. In large scale hardware architectures, SRAM consumption usually the largest circuit consumer. Efficient SRAM struc-
YIN et al.: A HARDWARE-EFFICIENT MULTI-RESOLUTION BLOCK MATCHING ALGORITHM AND ITS VLSI ARCHITECTURE FOR HIGH DEFINITION
1253
TABLE II MEC YCLE, PENUMBER and SW Memory Comparison Symbol Video spec. Algorithm Profile (frame type) Ref. frame number Search range Inter modes No. of PEs Gate count (k) SRAM for IME (k bytes) SRAM for FME (k bytes) Throughput (cycles/MB) Frequency (MHz) DDR SDRAM bit width Data flow regularity Technology
Wu [27] CIF at 30 frames/s MMEA Baseline(P) 1 32 × 32 N/A
Lee [32] 720 × 480 at 30 frames/s MMEA Baseline (P) 1 128 × 128 All
Chen [4], [30] 720p at 30 frames/s FSBM Baseline (P) 1 128 × 64 All
Lin[8], [29] 1080p at 30 frames/s MMEA Baseline (P) 1 256 × 256 All
50 59 1.3 N/A 495 153 N/A N/A 0.18 CMOS
64/320 N/A N/A N/A 375 16 N/A N/A N/A
2048 305 13.71 (dual port) 13.82 (dual port) 1536 108 N/A Very high 0.18 CMOS
728 213.7 5.95 (dual port) N/A 256 128.8 128 High 0.13 CMOS
ture is adopted in the proposed architecture for data sharing between IME and FME. Single-port SRAM is used for onchip SW buffer instead of dual-port SRAM for IME. Dualport SRAM is only used for the LSW buffer to achieve efficient data sharing between IME and FME. As a result, the SW buffer SRAM consumption is reduced up to almost 50% [37], which is very encouraging for HD video encoder chip implementation. Fourth, system clock frequency is crucial for system throughput efficiency and power consumption. The proposed algorithm and architecture can reach the throughput of 900 cycles per MB, and 200 MHz clock system frequency is enough for 1080P 30 frames/s format real-time video coding. The throughput is similar with those of [7] and [8]. Due to high regularity, the proposed IME VLSI architecture can be multiplied and doubled to accelerate the search speed and improve data reuse further if higher throughput is desired. Suppose each PEAmnpq are cloned and multiplied by T way parallel structure, the throughput at three levels can achieve 16T, 4T, and T candidate MVs in each cycle. The parallelism intensity T is configurable to the image resolution and system clock frequency. Fifth, external memory interface bit bandwidth and access bandwidth are also important consideration factors for high definition video encoder chip. In some IME architectures, wide-bit width SDRAM interfaces were desired to satisfy the system throughput requirement. For example, 512 and 128 bit SDRAM interfaces are desired in [7] and [8] to cover the search ranges of 196 × 128 and 256 × 256 for real-time 1080P 30 frames/s format video coding. The HF2V3 zigzag scan mode level C+ data reuse scheme [35] is adopted in this paper. Also, efficient reference pixel data sharing between IME and FME is achieved by employing ping-pong structured LSW buffer. As a result, 64 bit external DDR SDRAM is enough to cover the search window of 256 × 192 for 1080P 30 frames/s video coding. Finally, data flow regularity is preferred for simple control flow with low verification and implementation risks. FSBM based architecture has the highest data flow regularity and the lowest control flow. The proposed architecture also has high data flow regularity due to regular parallel PE array structure and data organization in 16 parallel RSMB buffers.
Liu [7] Proposed 1080p at 30 frames/s 1080p at 30 frames/s Sub-sampling MMEA Baseline (P) Main (P, B) 1 2 196 × 128 256 × 192 8 × 8, 16 × 8, 8 × 8, 16 × 8, 8 × 16, 16 × 16, 8 × 16, 16 × 16 2048 256 × 2 = 512 486 130 × 2 = 260 40 (dual port) 159.5 (single port) 40.8 (dual port) 17 (dual port) 960 872 200 200 512 64 High High 0.18 CMOS 0.18 CMOS
VI. Conclusion This paper has proposed an efficient MMEA well-suited for HD MPEG-like video encoder hardware implementation. The MMEA was mapped to VLSI architecture with reconfigurable PE arrays achieving encouraging throughput. Highly regular PE array structure could be configured to implement FS and MMEA with regular data flow control. The proposed algorithm and architecture are well-suited for all MPEG-like video standards such as H.264, AVS and VC-1. References [1] ITU-T Recommendation and International Standard of Joint Video Specification, ITU-T Rec. H.264/ISO/IEC 14496-10 AVC, Mar. 2005. [2] SMPTE 421M. VC-1 compressed video bitstream format and decoding process [Online]. Available: http://www.smpte.org/smpte− store/ standards/pdf/s421m.pdf [3] Information Technology—Advanced Coding of Audio and Video: Part 2. Video, document AVS− N1063, China AVS Working Group, 2003. [4] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, and L.-G. Chen, “Analysis and architecture design of an HD720p 30 frames/s H.264/AVC encoder,” IEEE Trans. Circuits Syst. Video Tech., vol. 16, no. 6, pp. 673–688, Jun. 2006. [5] H.-C. Chang, J.-W. Chen, C.-L. Su, Y.-C. Yang, Y. Li, C.-H. Chang, Z.-M. Chen, W.-S. Yang, C.-C. Lin, C.-W. Chen, J.-S. Wang, and J.I. Guo, “A 7 mW-to-183 mW dynamic quality-scalable H.264 video encoder chip,” in Proc. IEEE ISSCC Dig. Tech. Papers, 2007, pp. 280–281. [6] Y.-H. Chen, T.-D. Chuang, Y.-J. Chen, C.-T. Li, C.-J. Hsu, S.-Y. Chien, and L.-G. Chen, “An H.264/AVC scalable extension and high profile HDTV 1080p encoder chip,” in Proc. IEEE Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2008, pp. 104–105. [7] Z. Liu, Y. Song, M. Shao, S. Li, L. Li, S. Ishiwata, M. Nakagawa, S. Goto, and T. Ikenaga, “HDTV 1080P H.264/AVC encoder chip design and performance analysis,” IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 594–608, Feb. 2009. [8] Y.-K. Lin, D.-W. Li, C.-C. Lin, T.-Y. Kuo, S.-J. Wu, W.-C. Tai, W.-C. Chang, and T.-S. Chang, “A 242 mW 10 mm2 1080P H.264/AVC highprofile encoder chip,” in Proc. ISSCC Dig. Tech. Paper, Feb. 2008, pp. 314–615. [9] K. Iwata, S. Mochizuki, M. Kimura, T. Shibayama, F. Izuhara, H. Ueda, K. Hosogi, H. Nakata, M. Ehama, T. Kengaku, T. Nakazawa, and H. Watanabe, “A 256 mW 40 Mbps full-HD H.264 high-profile codec featuring a dual-macroblock pipeline architecture in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1184–1191, Apr. 2009. [10] Y.-W. Huang, C.-Y. Chen, C.-H. Tsai, C.-F. Shen, and L.-G. Chen, “Survey on block matching motion estimation algorithms and architectures with new results,” J. VLSI Signal Process., vol. 42, no. 3, pp. 297–320, Mar. 2006. [11] T. Komarek and P. Pirsch, “Array architectures for block matching algorithms,” IEEE Trans. Circuits Syst., vol. 36, no. 10, pp. 1301–1308, Oct. 1989.
1254
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010
[12] L. D. Vos and M. Stegherr, “Parameterizable VLSI architectures for the full-search block-matching algorithm,” IEEE Trans. Circuits Syst., vol. 36, no. 10, pp. 1309–1316, Oct. 1989. [13] K. M. Yang, M. T. Sun, and L. Wu, “A family of VLSI designs for the motion compensation block-matching algorithm,” IEEE Trans. Circuits Syst., vol. 36, no. 10, pp. 1317–1325, Oct. 1989. [14] C. H. Hsieh and T. P. Lin, “VLSI architecture for block matching motion estimation algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 2, no. 2, pp. 169–175, Jun. 1992. [15] H. Yeo and Y. H. Hu, “A novel modular systolic array architecture for full-search block matching motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 5, no. 5, pp. 407–416, Oct. 1995. [16] Y. K. Lai and L. G. Chen, “A data-interlacing architecture with 2-D datareuse for full-search block-matching algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 2, pp. 124–127, Apr. 1998. [17] M. Mizuno, Y. Ooi, N. Hayashi, J. Goto, M. Hozumi, K. Furuta, A. Shibayama, Y. Nakazawa, O. Ohnishi, Y. Shu-Yu Zhu Yokoyama, Y. Katayama, H. Takano, N. Miki, Y. Senda, I. Tamitani, and M. Yamashina, “A 1.5-W single-chip MPEG-2 MP@ML video encoder with low power motion estimation and clocking,” IEEE J. Solid-State Circuits, vol. 32, no. 11, pp. 1807–1816, Nov. 1997. [18] H. M. Jong, L. G. Chen, and T. D. Chiueh, “Parallel architectures for 3step hierarchical search block-matching algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, no. 4, pp. 407–416, Aug. 1994. [19] H. D. Lin, A. Anesko, and B. Petryna, “A 14-GOPS programmable motion estimator for H.26×VideoCoding,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1742–1750, Nov. 1996. [20] S. C. Cheng and H. M. Hang, “A comparison of block matching algorithms mapped to systolic-array implementation,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 5, pp. 741–757, Oct. 1997. [21] V. G. Moshnyaga, “A new computationally adaptive formulation of block-matching motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 1, pp. 118–124, Jan. 2001. [22] S. C. Hsia, “VLSI implementation for low-complexity full search motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 7, pp. 613–619, Jul. 2002. [23] S. Kawahito, D. Handoko, Y. Tadokoro, and A. Matsuzawa, “Low power motion vector estimation using iterative search block-matching methods and a high-speed non-destructive CMOS sensor,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 12, pp. 1084–1092, Dec. 2002. [24] C. D. Vleeschouwer, T. Nilsson, K. Denolf, and J. Bormans, “Algorithmic and architectural co-design of a motion estimation engine for low-power video devices,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 12, pp. 1093–1105, Dec. 2002. [25] W.-M. Chao, T.-C. Chen, Y.-C. Chang, C.-W. Hsu, and L.-G. Chen, “Computationally controllable integer, half, and quarter-pel motion estimator for MPEG-4 advanced simple profile,” in Proc. IEEE ISCAS, 2003, pp. 788–791. [26] B. C. Song and K. W. Chun, “Multi-resolution block matching algorithm and its VLSI architecture for fast motion estimation in a MPEG-2 video encoder,” IEEE Trans. CSVT, vol. 14, no. 9, pp. 1119–1137, 2004. [27] B.-F. Wu, H.-Y. Peng, and T.-L. Yu, “Efficient hierarchical motion estimation algorithm and its VLSI architecture,” IEEE Trans. Very Large Scale Integr. Syst., vol. 16, no. 10, pp. 1385–1398, Oct. 2008. [28] J. Vanne, E. Aho, K. Kuusilinna, and T. D. Hämäläinen, “A configurable motion estimation architecture for block-matching algorithms,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 4, pp. 74–86, Apr. 2009. [29] Y.-K. Lin, C.-C. Lin, T.-Y. Kuo, and T.-S. Chang, “A hardware-efficient H.264/AVC motion-estimation design for high-definition video,” IEEE Trans. Circuits Syst. I Regular Papers, vol. 55, no. 6, pp. 1526–1535, Jul. 2008. [30] C. Y. Chen, S. Y. Chien, Y. W. Huang, T. C. Chen, T. C. Wang, and L. G. Chen, “Analysis and architecture design of variable block-size motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 3, pp. 578–593, Mar. 2006. [31] S. Y. Yap and J. V. Mc Canny, “A VLSI architecture for variable block size video motion estimation,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 51, no. 7, pp. 384–389, Jul. 2004. [32] J. H. Lee and N. S. Lee, “Variable block size motion estimation algorithm and its hardware architecture for H.264/AVC,” in Proc. IEEE Int. Symp. Circuits Syst., vol. 3. May 2004, pp. 741–744. [33] L. Deng, X. D. Xie, and W. Gao, “A real-time full architecture for AVS motion estimation,” IEEE Trans. Consumer Electron., vol. 53, no. 4, pp. 1744–1751, Nov. 2007.
[34] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 1, pp. 61–72, Jan. 2002. [35] C.-Y. Chen, C.-T. Huang, Y.-H. Chen, and L.-G. Chen, “Level C+ data reuse scheme for motion estimation with corresponding coding orders,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 4, pp. 553–558, Apr. 2006. [36] H. B. Yin, L. Deng, H. Qi, and W. Gao, “VLSI friendly ME search window buffer structure optimization and algorithm verification for high definition H.264/AVS video encoder,” in Proc. IEEE ICME, Jun.–Jul. 2009, pp. 1098–1101. [37] H. B. Yin, X. M. Wang, Z. L. Xia, and H. G. Qi, “Cost-effective multiresolution motion estimation algorithm for rate distortion optimized high definition video encoder,” in Proc. 17th IFIP/IEEE Int. Conf. Very Large Scale Integr. (VLSI-SOC), Oct. 2009, pp. 12–14. [38] Calculation of Average PSNR Differences Between RD-Curves ITU-T VCEG, Proposal VCEG-M33, 2001. Haibing Yin received the Ph.D. degree from Shanghai Jiaotong University, Shanghai, China, in 2006. He was a Post-Doctoral Researcher with the National Engineering Laboratory for Video Technology, Peking University, Beijing. He is currently with the School of Electrical Engineering, China JiLiang University, Hangzhou. His current research interests include image and video processing, and very large scale integration architecture design.
Huizhu Jia received the Ph.D. degree in electrical engineering from the Chinese Academy of Sciences, Beijing, China, in 2007. He is currently with the National Engineering Laboratory for Video Technology, Peking University, Beijing. His current research interests include image and video processing and very large scale integration architecture design. Honggang Qi received the Ph.D. degree in electrical engineering from the Chinese Academy of Sciences, Beijing, China, in 2007. He is currently with the Graduate University of Chinese Academy of Sciences. His current research interests include image and video processing, and very large scale integration architecture design. Xianghu Ji is currently pursuing the Ph.D degree from the National Engineering Laboratory for Video Technology, Peking University, Beijing, China. His current research interests include video coding algorithms, very large scale integration implementation, and hardware description language coding. Xiaodong Xie received the Ph.D. degree in electrical engineering from the University of Rochester, Rochester, NY. He has served the industry in several companies, including Eastman Kodak, Rochester, Broadcom, Irvine, CA, Grandview Semi, Beijing, China, and Spreatrum Communications, Beijing, from 1994 to 2009. He is currently a Professor with the Department of Electrical Engineering, Peking University, Beijing, China. His current research interests include multimedia system-ona-chip design and embedded systems. He has been granted 23 U.S. patents. Wen Gao (F’09) received the M.S. and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 1985 and 1988, respectively, and the Ph.D. degree in electronics engineering from the University of Tokyo, Tokyo, Japan, in 1991. Currently, he is the Director of the National Engineering Laboratory for Video Technology and the Joint Research and Development Laboratory for Advanced Computing and Communication, Chinese Academy of Sciences, Beijing, China, a Professor with the Institute of Computing Technology, a Professor with Peking University, Beijing, and a Professor of computer science with the Harbin Institute of Technology. He has published seven books and over 200 scientific papers. His current research interests include signal processing, image and video communication, computer vision, and artificial intelligence. Dr. Gao chairs the Audio Video Coding Standard Workgroup of China. He is the Head of the Chinese National Delegation to the MPEG Working Group.