An Efficient Hardware Implementation for Motion Estimation of AVC ...

IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005

1360

An Efficient Hardware Implementation for Motion Estimation of AVC Standard Lei Deng, Wen Gao, Ming Zeng Hu, Zhen Zhou Ji Abstract — In the advanced video coding standard (AVC), motion estimation adopts many new features such as variable block size searching, multiple reference frames, motion vector prediction, etc, for enhancing the coding performance. However, the high data dependence and high computation requirement of these new features makes the hardware implementation very complex, especially for real-time applications. Therefore base on the reference software JM9.0, this paper firstly improved the motion estimation algorithm from hardware-oriented viewpoint, and secondly proposed the systolic architecture of improved algorithm. It adopts 2-D systolic arrays, fully supports the AVC’s variable block size matching, and can produce 41 motion vectors for one macroblock. Experimental results show that the improved algorithm can avoid the data dependences while with the same coding performance as JM9.0, and the proposed architecture can achieve the real-time requirement for 720x576 picture size at 30 fps with the search range of 65x65. Index Terms — AVC, motion estimation, VBSME, VLSI, systolic. I.

has four modes which are forward, backward, direct and bidirection. Benefiting from these new features, the AVC achieves more than 50% coding gains over MPEG-2 [2]. However, these new features require much more computational load. Thus the hardware acceleration is necessary for real-time coding applications, especially for motion estimation, which is the most computationally intensive part in the codec [3]. Many architectures of motion estimation have been proposed for previous standards [4-9] so far. But they only focus on fixed block size motion estimation, and can not fully support variable block size motion estimation (VBSME) in the AVC. Since AVC has been developed, a few architecture of VBSME are also available. References [10] and [11] describe a 16 process elements (PE) 1-D array for low power applications. And [12] describes a 64 PEs 2-D array also for low power applications, but it only supports the block larger than 8x8. References [13-16] adopt 256 PEs in their 2-D architectures. Compared with low power architectures, their computational capability are improved obviously. While for high computational required coding, such as 65x65 search range, they also can not achieve the real time coding.

INTRODUCTION

Advanced video coding standard (AVC) [1] also known as H.264 is a new compression standard developed by the Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG. It can be used in many application areas such as high-resolution digital broadcasting, high-density DVD players, internet stream media, and wireless multimedia communication, etc. Generally speaking, AVC has the similar block-based hybrid coding framework with the previous MPEG-x and H.26x series of standard. It uses the spatial and temporal predictions to eliminate the spatial and temporal data redundancy, and the prediction errors are transformed, quantized and entropyencoded. While in order to achieve better coding efficiency, many new features are involved in the AVC standard. The entropy coding adopts two methods, i.e., Exp-Golomb codes and Context-Adaptive Variable Length Coding (CAVLC), to code the syntax elements, the intra prediction defines 9 modes for luma blocks and 4 modes for chroma blocks to enhance the performance of intra-coded macro-block (MB), the adaptive integer transform is used to eliminate the mismatch in the reverse transform, and the adaptive in-loop de-blocking filter is used to reduce the blocking artifacts. For multiple reference pictures, both P and B pictures may use five more reference pictures. And for variable block sizes, the MB which has 16x16 pixel can be partitioned into 16x8, 8x16 or 8x8 blocks (Fig.1), and 8x8 block can be further partitioned into 8x4, 4x8, or 4x4 blocks. The motion vector (MV) in AVC can be specified with quarter pixel accuracy, and the MB prediction Contributed Paper Manuscript received October 13, 2005

Fig.1. the macro-block patterns adopted in VBSME of AVC

The purpose of this paper is to present a hardware-oriented motion estimation algorithm and its VLSI architecture for realtime video coding applications of AVC. The special architecture can perform variable block size full searching, and it can achieve all 41 motion vectors of a MB currently. This paper is organized as follows. In section 2, a hardwareoriented motion estimation algorithm is presented with experimental results of coding performance. In section 3, the hardware architecture of motion estimation based on the modified algorithm is described in details. Finally, the experimental result and conclusion are given in section 4 and 5, respectively. II. HARDWARE-ORIENTED MOTION ESTIMATION ALGORITHM FOR AVC The motion estimation of the AVC is based on rate distortion optimization (RDO) framework [17]. The Distance

0098 3063/05/$20.00 © 2005 IEEE

L. Deng et al.: An Efficient Hardware Implementation for Motion Estimation of AVC Standard

Criteria to determine the best motion vector can be formalized as the minimization of the cost function: → § § § § → →· · §→ · ·· J ¨ m, REF | λ ¸ = SAD ¨ o, r ¨ REF , m ¸ ¸ + λ ⋅ ¨ R ¨ m− p ¸ + R ( REF ) ¸ © ¹ ¹¹ ¹ © © © © ¹

, where

(1)

)& m = ( mvx , mvy ) is the current candidate motion

vector, REF is the index of current reference picture, → § § ·· SAD ¨ o, r ¨ REF , m ¸ ¸ is the Sum of Absolute Difference © ¹¹ © (SAD) between current block and the candidate reference block: N1 N 2 → § § ·· SAD ¨ o, r ¨ REF , m ¸ ¸ = ¦¦ o ( i, j ) − rREF ( i + mvx , j + mvy ) ¹ ¹ i =1 j =1 © ©

(2)

, N1 and N2 are the width and height of current block. They can be 4, 8 or 16.

)& p is the prediction motion vector of current

→ → processing block, R §¨ m − p ·¸ and R ( REF ) are the bits

©

1361

used as the predictive motion vector. So the replacement only affects motion estimation calculation, and the proposed motion estimation algorithm is still compatible with the standard. All block in the current MB can be computed simultaneously after this replacement. Another hardware oriented modification is the position of search center. In traditional video coding reference software such as MPEG-2, the search center is located at (0, 0). The search windows of adjacent MBs are showed in Fig.3, the shadowed area in Fig.3 is the overlapped search range of current MB and next MB. It is fetched before current MB processing, and can be reused by next MB. This reuse scheme can greatly reduce the input bandwidth requirement of the chip. However In AVC reference software, the search center is

&

located at p which is distinct in different MB. Therefore it is difficult to use the reuse scheme to reduce the bandwidth requirement. In our hardware oriented algorithm, (0, 0) is still used as the search center for reuse scheme.

¹

assigned for motion vector difference and REF after entropy coding. λ is the Lagrangian multiplier. Current Macroblock

Next Macroblock

Fig.3. The reuse scheme of pixel data in search window. The pixel data in the shadowed area are needed by both the current MB and next MB.

In the AVC standard, p depends on the motion vector of the left, top left, top, top right neighbor blocks (Fig.2, MV0-3). The current block can not be processed until the motion vectors of neighbors have been determined. This data dependency between the current block and its neighbors forces the hardware architecture to adopt sequential processing block by block, which will greatly decrease the efficiency of hardware implementation. In order to eliminate the data dependency between current block and its neighbors, motion

To evaluate the coding performance of the improved algorithm, five sequences are employed, they are two standard definition sequences of “hero” and “zy” and three high definition sequences of “city”, “crew” and “harbor”. And weighting prediction, rate control and de-blocking are disabled. For the motion estimation algorithm, two reference frames, 65x65 search range and all size of block are used. The experiments are performed under different QP conditions, and the comparisons of the PSNR and the bit rate between the improved algorithm and JM9.0 are illustrated in the Table.1 (a) and the Table.1 (b) respectively. In Table.1 (a), the average degradation of PSNR is about 0.026dB, and the biggest degradation is 0.109 dB. In Table.1 (b), the average increment of bit rate is about 1.16 %, and the biggest increment is 4.34%, which occurs at the city sequence and QP = 50. Hence the improved algorithm has almost the same performance as that of JM9.0. Meanwhile the improved algorithm has low data dependency, low required bandwidth, and it is more simplicity for the hardware implementation.

vector MV0’-3’ in Fig.2 replace MV0-3 to calculate p of current block. This replacement is not accurate and the “false”

III. HARDWARE ARCHITECTURE OF MOTION ESTIMATION

)& p

Fig.2. Diagram for data dependency between the current block and its )& neighbors. The prediction motion vector p of current block is calculated from the neighbors’ motion vectors.

&

&

&

p is only used during the calculation of block matching. In the

&

entropy coding of the final motion vectors, the real p will be

Since the 4x4 block is the smallest block in the MB partitions, we firstly derived the 4x4 PE array architecture, and then use the 4x4 PE array to construct the ME architecture.

1362


Table.1 Experimental results of our hardware-oriented ME algorithm compared with the reference software JM9.0.

sequences City

Crew

Harbour

Hero

Zy

sequences City

Crew

Harbour

Hero

Zy

JM9.0

QP =50 27.56

(a) The compare of PSNR QP =45 QP =40 QP =35 30.33 32.92 35.3

QP =30 37.51

QP =25 39.81

QP =20 43.12

improved

-0.109

-0.033

-0.031

-0.031

-0.026

-0.026

-0.026

JM9.0

31.12

33.50

35.81

37.64

39.76

41.33

44.28

improved

-0.032

-0.029

-0.026

-0.026

-0.021

-0.018

-0.011

JM9.0

26.87

29.56

32.24

34.62

37.33

39.91

43.44

improved

-0.025

-0.015

-0.015

-0.009

-0.011

0

0

JM9.0

31.56

34.15

36.72

38.96

41.64

44.23

47.11

improved

-0.031

-0.029

-0.027

-0.025

-0.024

-0.022

-0.021

JM9.0

31.34

34.10

36.68

38.83

40.76

42.34

44.98

improved

-0.031

-0.058

-0.032

-0.025

-0.023

-0.022

-0.022

QP =50

(b) The compare of bit rate QP =45 QP =40 QP =35

QP =30

QP =25

JM9.0

855.53

878.55

1328.44

2321.1

5005.4

11646.8

26160.3

improved

+4.34%

+2.06%

+1.16%

+1.44%

+0.70%

+0.50%

+0.25%

JM9.0

910.5

1092.34

1373.74

2033.52

3462.99

8095.62

20030.39

improved

+0.71%

+0.56%

+0.55%

+1.52%

+1.17%

+0.81%

+0.46%

JM9.0

1216.78

1686.31

2852.61

5016.66

9834.44

19028.65

33661.20

improved

+1.52%

+1.59%

+1.43%

+1.21%

+0.81%

+0.48%

+0.24%

QP =20

JM9.0

265.33

383.22

597.49

893.58

1549.19

2488.17

3994.02

improved

+0.81%

+1.26%

+1.11%

+0.82%

+0.65%

+0.53%

+0.39%

JM9.0

461.35

531.83

702.11

991.39

1691.22

3412.22

7792.48

improved

+0.93%

+2.66%

+2.36%

+2.29%

+1.55%

+1.03%

+0.53%

A. Derivation of the Systolic Arrays for 4x4 block PE array For a certain reference picture, computations of the full searching motion estimation required by a 4x4 block can be considered as four loops: For k= -p to p For l = -p to p For i = 0 to 3 For j = 0 to 3 SAD ( k , l ) = SAD ( k , l ) + o ( i, j ) − r ( k + i, l + j )

“PA” type take the addition of the intermediate sums, and the addition of the final sum is derived from the node of type “A”.

(3)

End(j) End(i) End(l) End(k) where [-p , p] is the search range. Systolic arrays for motion estimation can be derived by a method described by Kung [18]. A three-dimensional dependence graph (DG) in the i, j, k space for the 4x4 block full search motion estimation is presented in Fig.4. In this DG, the nodes of “AD” type take the computation of the subtraction, magnitude operation and addition. The nodes of

Fig.4. Dependence graph (DG) of computation nodes and data dependencies for full search motion estimation of a line of 4x4 candidate blocks.


&T

We choose the projection vector as d = [0, 0,1] , the nodes mapping as

ª1 0 T P =« ¬0 1

0º 0 »¼

, according to the

min-max formulation [15], the minimal computation time

&

schedule s can be found by solving the min-max problem:

{ ( )}

&T & )& min s x − y +1 & max & )& s x , y∈L &T & &T &

(4)

&

under the constraints of s d > 0 , s e > 0 and e ∈ E , where E is the arc set of the DG:

E = {( i, j, k ) | (1,0,0) , ( 0,1,0) , ( 0,0,1) , (1,0, −1)}

(5)

& &

, x , y is the index of the nodes in the DG, and L is the index set :

L = {( i, j, k ) | 0 ≤ i ≤ 3,0 ≤ j ≤ 4,0 ≤ k ≤ 2 p +1}

1363

from a “PA” type PE to another ‘PA” type PE or to the “A” type PE. And it also needs one cycle delay to transfer data between other PEs. Thus it can easy to obtain that the delay of the SFG is 12 cycles which is the time from the cycle that the first pixel of a candidate block arrive at the top-left PE to the cycle that the SAD of this candidate block arrive at the output of the “A” type PE. The inputs of reference pixels also need to be delayed to meet the timing requirement of the SFG, which are indicated by the positions of the reference pixels in the Fig.5. The delays on the arrow line are realized in the PE architectures which are showed in the Fig.6. Fig.6 (a) is the architecture of the “AD” type PE, and Fig.6 (b) is the architecture of the “PA” and “A” type PE. In the “AD” type PE, two 8-bit register, D1 and D2, are used to store the current MB pixel and the reference MB pixel. The absolute difference values are accumulated in the 16-bit register D3. The Architecture of “PA” or “A” type PE has two 16-bit registers and an addition circuit.

(6)

Thus the schedule for minimal computation time

&T

is s = [ 2

1 1] , and the signal flow graph (SFG) is showed

in Fig.5. The number of delays on each edge is derived by:

& &T D e = s • E = ( 2 1 1 1)

()

(7)

(a)

(b)

Fig.6. The Diagram for PE architectures. (a) The architectures of “AD” type PE. (b) The architecture of “PA” and “A” type PE. D

D

D

D D

D

D

D

D

D

D

D

D

D

D

D

Fig.5.The signal flow graph (SFG) of 4x4 PE array. There have three types of PEs, i.e., “AD”, “PA” and “A” types. The letter “D” on each arrow line denotes one cycle delay. The reference data are transferred from the left border of the PE array to the right. And their positions denote the cycle at which the pixel will be send.

In Fig.5, each square represents a PE. Corresponding to the function of the nodes in the DG, there also have three type of PEs in the SFG, i.e., the “AD” type᧨the “PA” type and the “A” type. The character “D” on the arrow line represents the data transfer delays. It needs two cycle delays to transfer data

B. The Architecture for the Motion Estimation Based on the improved algorithm in Section 2, the proposed architecture for ME is shown on the Fig.7. It consists of a 16x16 PE array, the merging scheme, the comparer circuit and two static random access memory (SRAM) modules for reference data, and current MB. A 65x65 search range is used in this architecture, which indicates that 80x80 reference pixels are needed during the current MB process. Additionally, to support the data reuse scheme, the SRAM also needs to hold extra 16x80 pixels. Thus the size of SRAM for each search range is 80x80 + 16x80 = 7680 pixels. Addition of the 256 bytes of the current MB, the total size of on-chip SRAM is 7680 + 256 = 7936 bytes. The merging scheme (see Fig.8) computes the SADs of blocks which are larger than 4x4. And it sends totally 41 SADs of different patterns to the comparer to compute the minimum distortions and corresponding motion vectors during a candidate process. The outputs of the comparer, i.e., the outputs of the proposed architecture, are 41 motion vectors for the current MB.

1364


32 bits

128 bits

comparer

Merging scheme

32 bits 32 bits

128 bits

32 bits

Fig.7. The Architecture of the motion estimation, sixteen 4x4 PE arrays are used in this architecture. Each is for the computation of one 4x4 SAD. Delay lines at the left border of the architecture are used for matching the timing requirement for the input reference data, and the inner delay lines of the architecture are used to ensure that sixteen 4x4 SADs arrive at the right border at the same time.

(a)

(b) Fig. 8 The merging scheme. (a) The first stage of merging scheme for an 8x8 block based on itself’s four 4x4 block SADs, i is the index of the 8x8 SAD in the MB, and it is in the range of 0 to 3. (b) The second stage of merging scheme for the MB based on the four 8x8 block SADs

The 16x16 PE arrays play a role to compute all the 4x4 SADs in the matching process. It is composed of sixteen 4x4 PE arrays, i.e., PEAn, n = 0…15. Each PEA which is derived from the SFG (see Fig.5) is specially used to compute one of 4x4 SADs of a candidate MB. It has four input ports for the current pixels, CP0~CP3, and four input ports for the reference pixels, RP0~RP3. The CPs, s = 0…3, is at the left border of each PEA. CPs of PEA0, PEA4, PEA8 and PEA12 receive the current pixels

from the SRAM of current MB, and other PEA receives the current pixels from its left neighbor PEA. In each PEA, current pixels are propagated from its CP0~CP3 to its right border cycle by cycle. The RPs, s = 0…3, are also at the left border of the PEA. Sixteen reference pixels in the same row of the candidate MB can be read from the SRAM of the reference data in the same cycle. The reference pixels are propagated in the 16x16 PE array by the same manner of current pixels. While before transferred to the RPs of PEA0, PEA4, PEA8 and PEA12, they need to be latched for matching the timing required by the SFG (see Fig.5). Assumed that the first pixel used by PEA0 appears at the RP0 of PEA0 at the 0th cycle, thus the first pixel used by PEA1, PEA2 and PEA3 will appear at the RP0 of PEA0 at the 4th, 8th and 12th cycle correspondingly (see Fig.7). And the pixel also needs 4 cycles to pass through each PEA. Note that a PEA needs 12 cycles to compute a 4x4 SAD. Consequently the first 4x4 SADs of the PEA0, PEA1, PEA2 and PEA3 will be appeared at the output port of them at the 12th, 20th, 28th and 36th cycle correspondingly. in order that the sixteen 4x4 SADs of a candidate arrive at the right border of the 16x16 PE array at the same cycle, delay lines are applied to delay the 4x4 SADs produced by the PEAs except for PEA3, PEA7, PEA11 and PEA15. Therefore the 16x16 PE array needs 36 cycles to flush its pipeline, and from the 36th cycle, the sixteen 4x4 SADs of a candidate can be derived cycle by cycle. C. Data flow of the architecture of motion estimation At the beginning of a MB matching process, the current MB are read by row from the on-chip SRAM of current MB to


the 16x16 PE array, in the first 16 cycles. Fig.9 shows how the current MB stored in PEA0~PEA15 At the 16th cycle.

block0

block3

PEA3

12

block3

15 block1

15

15

15

15

PEA5

11

12

block1

PEA2

8

12

12

12

7

15 block0

11

PEA1

4

PEA7

12

block2 11

11

11

3

11

8

block2

PEA0

PEA4

8

8

block0

0

7

8

8

4

IV. EXPERIMENTAL RESULTS

15 block1

PEA6

PEA5 3

PEA6

column of candidate blocks. From Table.2 it is clear to see that the process of one current MB needs 5216 cycles to read the current MB pixels and reference pixels. Since the 16x16 PE array is able to read continuously, 5216 cycles is also the process time of one current MB.

PEA3

12

7

7

7

7

PEA4

0

11 block3

block3

15

4

block1

8

4

4

4

7

3

3

3

3

4

12 block0

PEA2

PEA1 3

11 block2

block2

PEA0

0

8

0

7

0

4

0

3

0

0

1365

PEA7

Fig.9. Current MB stored in the PEAn of the 16x16 PE array, n=0, 1, 2, …, 15. Table.2 The data flow of reference pixels for 16x16 PE array. cycle Task(-32) cycle Task(-31) … cycle Task(32) d(0, v) v [0, 15]

96

d(0, v) v [1, 16]

…

5136

d(0, v) v [64, 79]

17

d(1, v) v [0, 15]

97

d(1, v) v [1, 16]

…

5137

d(1, v) v [64, 79

18

d(2, v) v [0, 15]

98

d(2, v) v [1, 16]

…

5138

d(2, v) v [64, 79]

…

…

…

…

…

…

…

95

d(79, v) v [0, 15]

175

d(79, v) v [1, 16]

…

5215

d(79, v) v [64, 79]

We synthesize the proposed architecture based on the 0.18um CMOS standard cell technology. The results show that the gate count of the architecture is about 210K, the area of the on-chip SARM is about 0.9mm2. The design can work at 260MHz clock frequency. Additionally from the data flows the proposed architecture needs 5216 cycles to process one MB. Thus the proposed architecture can achieve the real-time encoding at 30 fps with the picture size of 720x576 and a full search range of 65x65. For the comparison of different VBSME architectures, we introduce the efficiency E [19], which can be expressed by the ratio of the through-put rate R and the required silicon area A of the architecture. R is expressed by the number of which the architecture computes the search points per second:

R=

2 f × ( 2 P + 1) T

(8)

, where f is the frequency of the architecture, T is the cycle number for processing one MB, and [-P, P] is the search range. Different with [19], we use the gate count G to evaluate the silicon area, thus the E is described as: 2 f 2 p 1 × + ( ) R E= =T A G

After the placement of the current MB pixels, from the 16th cycle, 16 reference pixels are read from the on-chip SRAM of search range and sent to the 16x16 PE array in each cycle. According to the 65x65 search range which has 65 columns of candidates and each column has 65 candidate MBs, Table.2 shows the data flow of reference pixels during one current MB matching process, where d(u, v) denotes the pixel data located at (u, v) position in the search window, (u, v [0, 79]), and task(l) denotes the computations for the lth

(9)

The unit of E is search point per second per gate. Table.3 shows the comparison between the proposed and others VBSME architectures. In all architectures, the proposed one can provide the highest computational capability which is two times higher than of [15], the most powerful architecture in the references. And the proposed architecture also has the best efficiency E, i.e., the best performance-price rate, in all of the architectures.

Table.3 Comparison with other VBSME architectures

[16]

[15]

[14]

[13]

[12]

[11]

[10]

Proposed

Number of PE

256

256

256

256

64

16

16

256

Search range

32x32

64x64

16x16

48x32

32x32

32x32

16x16

65x65

Process

0.25um

0.18um

0.6um

0.35um

0.6um

0.25um

0.13um

0.18um

Block size

4x4 to 16x16

4x4 to 16x16

2nx2n n>=1

4x4 to 16x16

8x8,16x16 32x32

4x4 to 16x16

4x4 to 16x16

4x4 to 16x16

Frequence

100MHz

100MHz

72MHz

67MHz

60MHz

150MHz

294MHz

260MHz

Gate count

105k

154k

263k

105k

67k

71k

61k

210k

R

96,215,040

99,916,185

76,308,480

62,208,000

15,084,748

12,165,120

18,247,680

210,601,993

E

916.3

674.1

291.2

592.5

242.1

171.4

299.1

1002.8

1366


V. CONCLUSION Based on the AVC standard, this paper improved motion estimation algorithm from the view point of hardware implementation and then proposed its hardware architectures. The software experimental results show that the algorithm has almost the same performance as the reference software JM9.0 of AVC in large picture size video encoding applications. The proposed architecture can perform variable block size full searching, and can achieve 41 motion vectors of one MB. The experimental results of the hardware architecture indicate that the design achieves the real-time encoding capability for an AVC standard definition application of 720x576 picture size at 30 fps with full search range of 65x65. Further more, compared with other VBSME architecture, the proposed one also provides the highest computational capability and the best performance-price rate.

[9]

[10]

[11]

[12]

[13]

[14]

[15]

REFERENCES [1] [2]

[3]

[4]

[5]

[6]

[7]

[8]

1 Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG Document JVT-G050r1, June.2003 2 Kamaci, N. Altunbasak, Y. “Performance comparison of the emerging H.264 video coding standard with the existing standards” ICME'03. Vol.1, pp.345-348 July 2003

[16]

3 Denolf, K.; Blanch, C.; Lafruit, G.; Bormans, ಯ A initial memory complexity analysis of the AVC codec”, SIPS '02, IEEE Workshop on, 16-18 Oct. 2002 Chun-Hsien Chou, Yung-Chang Chen. A VLSI architecture for realtime and flexible image template matching. IEEE Transactions. Circuits and Syst., 1989, 36(10):1336 – 1342. Yang K.-M., Sun M.-T., Wu L. A family of VLSI designs for the motion compensation block-matching algorithm. IEEE Transactions. Circuits and Syst., 1989, 36(10):1317 – 1325. Komarek T., Pirsch P. Array architectures for block matching algorithms. IEEE Transactions. Circuits and Sys., 1989, 36(10):1301 – 1308. Yu-Wen Huang, Tu-Chih Wang, Bing-Yu Hsieh. Hardware Architecture Design for Variable Block Size Motion Estimation in MPEG-4 AVC/JVT/ITU-T H.264. ISCAS’03, International Symp., 2003, pp. 796-799. Swee Yeow Yap, McCanny, J.V. A VLSI architecture for variable block size video motion estimation. IEEE Trans. Circuits Syst II: Express Briefs, 2004, 51(7):384 – 389.4

[18]

[17]

[19]

J. F. Shen et al. A novel low-power full-search block-matching motionestimation design for H.263+. IEEE Trans. Circuits Syst. Video Technol., 2001, 7. 890–897. Swee Yeow Yap, McCanny, J.V. A VLSI architecture for variable block size video motion estimation. IEEE Trans. Circuits Syst II: Express Briefs, 2004, 51(7):384 – 389.4 Cao Wei, Mao Zhi Gang, Lv Zhi Qiang, Zhang Yan. VLSI architecture design for variable-size block motion estimation in MPEG-4 AVC/H.264. IEEE Asia-Pacific Conference on Circuits and Systems. Proc., 6-9 December 2004, pp.617 – 620. J. F. Shen et al. A novel low-power full-search block-matching motionestimation design for H.263+. IEEE Trans. Circuits Syst. Video Technol., 2001, 7. 890–897. Yu-Wen Huang, Tu-Chih Wang, Bing-Yu Hsieh. Hardware Architecture Design for Variable Block Size Motion Estimation in MPEG-4 AVC/JVT/ITU-T H.264. ISCAS’03, International Symp., 2003, pp. 796-799. L.deVos, M. Schobinger. VLSI architecture for a flexible block matching processor. IEEE Trans. Circuits Syst.Video Technol., 1995, 5. 417–428. Minho. Kim, Ingu Hwang, Soo-Ik. Chae, A Fast VLSI Architecture for Full-Search Variable Block Size Motion Estimation in MPEG-4 AVC/H.264, ASP-DAC, 2005, PP.631-634. Cao Wei, Mao Zhi Gang. A Novel VLSI Architecture for VBSME in MPEG-4 AVC/H.264. ISCAS 2005. IEEE International Symp., 23-26 May 2005, pp.1794 – 1797. T.Wiegand and B.Girod, “Lagrangian Multiplier Selection in Hybrid Video Coder Control”, ICIP’01, Thessaloniki, Greece, October 2001. S.Y.Kung, VLSI Array Processors. Englewood Cliffs, NJ: prentice Hall, 1988 pp.140-235. Pirsch, P., Gehrke, W. VLSI architectures for video compression. In Systems, and Electronics Proc., URSI International Symp., 25-27 October, 1995, pp.49 – 54.

Lei Deng was born in Harbin, Heilongjiang province, P. R. China, in 1975. He received his B. Sc. in computer science, in 1998 from Jilin University and M. Sc. in computer science and engineering, in 2000 from Harbin institute of technology. From 2000, he is pursuing his Doctor Degree in the Harbin institute of technology for computer architecture and video signal processing. His research interests lie in the area of computer architecture, digital signal processing and video compression.

An Efficient Hardware Implementation for Motion Estimation of AVC ...

An Efficient Hardware Implementation for Motion Estimation of AVC ...

Suggest Documents

An Efficient Implementation of H.264/AVC Integer Motion Estimation ...

An Efficient Hardware Implementation for AI applications - Embedded ...

An Efficient Hardware Implementation for a Reciprocal Unit - CiteSeerX

Motion Estimation Optimization for H.264/AVC Using Source Image ...

FPGA Hardware Implementation of DOA Estimation Algorithm

An efficient scheme for motion estimation using multireference frames ...

An efficient hardware implementation of feed-forward neural ... - BME

Area Efficient Hardware Implementation of Elliptic Curve

Efficient Hardware Implementation of the Horn

Efficient FPGA Hardware Implementation of Secure ...

Efficient Hardware/Software Implementation of LPC ...

Hardware Implementation of Efficient Modified Karatsuba Multiplier ...

Hardware Efficient Implementation of Probabilistic ... - Lab-STICC

Respiratory Motion Estimation With Hybrid Implementation of ...

JOINT BLOCK MOTION ESTIMATION IN H.264/AVC Krit Panusopone ...

Efficient Hardware Implementation of Encoder and Decoder for Golay ...

A Hardware Intensive Approach for Efficient Implementation of ... - IJRIT

A Hardware Intensive Approach for Efficient Implementation of ...

Hardware Efficient Architecture with Variable Block Size for Motion ...

an efficient parallel motion estimation algorithm and x264 ...

FPGA Implementation Technology for Memory Efficient Hardware Architecture

Resource Efficient Hardware Implementation for Real-Time Traffic

POWER EFFICIENT MOTION ESTIMATION USING ... - CiteSeerX

Motion Control and Implementation for an AC