C.-M. Ou et al.: An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation
1291
An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation Chien-Min Ou, Chian-Feng Le and Wen-Jyi Hwang
Abstract —This paper proposes a novel flexible VLSI architecture for the implementation of variable block size motion estimation (VBSME). The architecture is able to perform a full motion search on integral multiples of 4 × 4 blocks sizes. To use the architecture, each 16 × 16 macroblock of the source frames should be partitioned into sixteen 4 × 4 non-overlapping subblocks, called primitive subblocks. The architecture contains sixteen modules and one VBSME processor. Each module, realized by cascading 1D systolic arrays, is responsible for the block-matching operations of a different primitive subblock. The realization has the advantages of high throughput, high flexibility and 100 % processing element (PE) utilization. The motion estimation of all the primitive subblocks are performed in parallel. Because these primitive subblocks can be used to form the 41 subblocks of different sizes specified by the H.264, the VBSME processor is employed to concurrently compute the sums of absolute differences (SADs) of all the 41 subblocks from the SADs of the primitive subblocks. This new architecture has lower latency and higher throughput over other exiting VBSME architectures for the hardware implementation of H.264 encoders1. Index Terms —Video Coding, VLSI Architecture, Variable Block Size Motion Estimation, H.264 Standard.
I. INTRODUCTION In many video encoders, block-matching algorithms (BMAs) have been established as the important motion estimation and compensation tools for temporal redundancy removal because of their simplicity and effectiveness. The fixed-size BMAs, however, have difficulty accommodating the different changes in object movement within a video frame. This drawback may limit the performance of the BMAs for low bit rate video coding applications. The advanced video coding standards such as H.264 [6], [11] lift this limitation by the employment of variable block size BMAs (VBS-BMAs). In the H.264, a video frame is first splitted using macroblocks of size 16 ×16 . Each macroblock may then be segmented into subblocks of different block sizes. The size of the smallest subblock, termed primitive subblock, is 4× 4 . Therefore, a 1 Chien-Min Ou is with the Department of Electronics Engineering, Ching-Yun University, Chungli, 320 TAIWAN (e-mail:
[email protected]). Chian-Feng Le is with the Graduate Institute of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, 117 TAIWAN (e-mail:
[email protected]). Wen-Jyi Hwang is with the Graduate Institute of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, 117 TAIWAN (e-mail:
[email protected]).
Contributed Paper Manuscript received September 28, 2005
macroblock contains 16 non-overlapping primitive subblocks. Other subblocks correspond to derivatives of these primitive blocks. In the H.264, there are 41 subblocks in a marcoblock. Although the VBS-BMAs are effective, their computational complexities may be very high. Many VLSI architectures have been proposed for the VBS-BMAs to reduce the computational complexities. However, some [3], [7], [8] of the architectures do not incorporate the capabilities of processing all the block sizes specified by the H.264. In [9], the flexibilities of the one-dimension (1D) processing element (PE) array are exploited for the realizations of H.264 VBSBMAs. Nevertheless, the employment of 1D PE array may result in high latency and low efficiency. The objective of this paper is to present an efficient VLSI architecture for hardware realizations of full-search H.264 VBS-BMA. The architecture attains low latency, low power and high throughput while supporting all the block sizes specified by H.264. There are sixteen modules and one VBS motion estimation (VBSME) processor in the architecture. We use each module for the block-matching operations of a different primitive subblock. Each module is a cascade of 1D systolic arrays [2], which attains high throughput and high flexibility with 100 % PE utilization. The sums of absolute differences (SADs) of all the primitive subblocks are computed by the 16 modules in parallel. We then use the VBSME processor to concurrently identify the best-matching block to each of the 41 subblocks from the SADs of the primitive subblocks. Whereas conventional 1D or 2D [5] architectures process only one motion vector (MV) within a macroblock, our new architecture can process up to 41 MVs in smaller number of clock cycles. By taking the advantage of high throughput, the circuit is allowed to reduce the clock rate subject to a constraint on frame size and frame rate. The average power dissipation can therefore be substantially decreased [1].
16 × 16 block
16 × 8 blocks
8 × 16 blocks
8 × 8 blocks
8 × 4 blocks
4 × 8 blocks
4 × 4 blocks
Fig. 1. Block sizes supported by H.264 for motion estimation.
0098 3063/05/$20.00 © 2005 IEEE
1292
IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005 X0
X1
X2
X3
x0 ,0
x1,0
x2,0
x3,0
x0 ,1
x1,1
x2,1
x3,1
x0 ,2
x1,2
x2,2
x0 ,3
x1,3
x2,3
Col0
Col1
Col2
Col 3
Col 4
Col5
Col6
x3,2
y0,0
y1,0
y2,0
y3,0
y4,0
y5,0
y6,0
x3,3
y0,1
y1,1
y2,1
y3,1
y4,1
y5,1
y6,1
y0,2
y1,2
y2,2
y3,2
y4,2
y5,2
y6,2
y0,3
y1,3
y2,3
y3,3
y4,3
y5,3
y6,3
(a)
y0,0
y1,0
y2,0
y3 ,0
y4,0
y5,0
y6,0
Block Strip 0
Block Strip 0
Block Strip 1
y0,1
y1,1
y2,1
y3 ,1
y4,1
y5,1
y6,1
y0,2
y1,2
y2,2
y3 ,2
y4,2
y5,2
y6,2
y0,3
y1,3
y2,3
y3 ,3
y4,3
y5,3
y6,3
y0,4
y1,4
y2,4
y3 ,4
y4,4
y5,4
y6,4
y0,5
y1,5
y2,5
y3 ,5
y4,5
y5,5
y6,5
y0,6
y1,6
y2,6
y3 ,6
y4,6
y5,6
y6,6
Col9 Col 10 Col11 Col12 Col13
Col7
Col8
y0,1
y1,1
y2,1
y3,1
y4,1
y5,1
y6,1
y0,2
y1,2
y2,2
y3,2
y4,2
y5,2
y6,2
y0,3
y1,3
y2,3
y3,3
y4,3
y5,3
y6,3
y0,4
y1,4
y2,4
y3,4
y4,4
y5,4
y6,4
Block Strip 1 (c)
(b)
Fig. 2. The 4 × 4 current block and its search area. (a) the 4 × 4 current block , (b) the search area , (c) the index of columns of the first two block strips.
The proposed architecture has been prototyped, simulated and synthesized by the UMC 0.18 μ m CMOS standard cell technology. The circuit is able to support up to the super high definition TV (SHDTV) frame format at the frame rate 60 fps. In addition, it consumes lower average power as compared with other circuits. Experimental results reveal that the proposed architecture is an effective solution to high performance and low power VBS-BMA design. II. BACKGROUND This section reviews some background materials of this paper. We first start with the VBS-BMAs supported by the H.264. As shown in Figure 1, the VBS-BMAs divide a 16 × 16 macroblock into 16 × 8 , 8 × 16 , and 8 × 8 subblocks. An 8 × 8 subblock may also be further splitted into 8× 4 , 4 × 8 , and 4 × 4 subblocks. Therefore, the VBS-BMAs have to handle 41 different subblocks of 7 different sizes. One simple way to implement the VBS-BMAs is based on the 1D architecture. Figure 2 shows a N × N current block ( N = 4) and its search area for the full-search BMA. The range of displacement is [− p,−( p − 1)] ( p = 2) in both x - and y directions. Therefore, the size of the search region is given by ( N + 2 p − 1) × ( N + 2 p − 1) . There are 2 p × 2 p candidate blocks in the search area. The candidate blocks in the same row form a block strip. Adjacent block strips are overlapping. For the illustration purpose, the columns of the block strips are indexed as shown in Figure 2. The 1D systolic array of the full-search BMA is shown in Figure 3, which skews each column of the current blocks and candidate blocks for the SAD
computation. Table 1 shows data flow schedule indicating the starting clock for calculating each column SAD, which will take N clock cycles to complete. Every N consecutive column SADs will then be accumulated as one block SAD. The 1D architecture can be directly used for BMAs with sizes n × m , where m can be any positive number and n ≤ N . Accordingly, N should be 16 to realize the VBS-BMAs. However, as compared with the existing 2D architectures, the 1D systolic array has longer latency for producing best MVs. Its PE utilization is less than 100 % when n < N . Moreover, the 1D architecture cannot search concurrently for the MVs for blocks with different sizes. The throughput of the architecture for VBS-BMA implementation therefore may be low.
Clock Cycle
TABLE 1 The data flow schedule of 1D array Inputs Current Block
(N=4). Operations
0
X0
Search Area Col0
1
X1
Col1
|X1 í Col1|
2
X2
Col2
|X2 í Col2|
3
X3
Col3
|X3 í Col3|
4
X0
Col1
|X0 í Col1|
5
X1
Col2
|X1 í Col2|
6
X2
Col3
|X2 í Col3|
7
X3
Col4
|X3 í Col4|
|X0 í Col0|
C.-M. Ou et al.: An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation
y4,0 y3,0 y2,0 y1,0 y3,0 y2,0 y1,0 y0,0
x0,0 x1,0 x2,0 x3,0 x0,0 x1,0 x2,0 x3,0
PE
y4,1 y3,1 y2,1 y1,1 y3,1 y2,1 y1,1 y0,1
y4,2 y3,2 y2,2 y1,2 y3,2 y2,2 y1,2 y0,2 y4,3 y3,3 y2,3 y1,3 y3,3 y2,3 y1,3 y0,3
1293
D
x0,1 x1,1 x2,1 x3,1 x0,1 x1,1 x2,1 x3,1
D
PE
D
D
D
PE
D
D
D
D
PE
D
D
x0,2 x1,2 x2,2 x3,2 x0,2 x1,2 x2,2 x3,2
D
x0,3 x1,3 x2,3 x3,3 x0,3 x1,3 x2,3 x3,3
ACCM
Fig. 3. The basic structure of 1D array.
III. THE PROPOSED ARCHITECTURE Our architecture has the advantages of low latency, 100 % PE utilization, and high throughput for the H.264 VBS-BMA implementation. Figure 4 shows the block diagram of this architecture, which contains SAD modules, VBSME processor, control unit (CU), and address generation unit (AGU). There are 16 SAD modules in the architecture, where each one is used for the SAD computation of a primitive subblock. The VBSME processor then fetches the SADs of the primitive blocks to produce the SADs of subblocks of other sizes in parallel. Both the CU and AGU coordinate the operations among the modules and VBSME processor. Given a current frame and its reference frame for motion estimation, we first divide the current frame into nonoverlapping marcoblocks having identical size 16 × 16 . Each marcoblock is associated with a search region in the reference frame. The marcoblocks are arranged in a raster scan sequence. For each macroblock to be coded, 16 current primitive subblocks are formed.
Each current primitive subblock is associated with a search region, which is a subregion in the search region associated with the macroblock. The block-matching operations of the modules are based on the current primitive subblocks and their search regions. Figure 5 shows the primitive subblock and the corresponding search region assigned to each module.
Fig. 4 Block diagram of the proposed architecture.
Fig. 5. The primitive subblock and its search area assigned to each SAD module.
1294
IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005
Figure 6 depicts the structure of each SAD module, which contains a RAM and a PE array. The PE array is constructed by cascading the 1D array, as shown in Figure 7 for 4 × 4 (N=4) current block size. There are 4 1D PE arrays in the architecture, and each PE array contains 4 PEs. This circuit
Fig. 6. The basic structure of SAD modules. (a) The structure of SAD modules (b) The structure of module i in SAD modules .
operates by scheduling the columns of the current primitive subblock through a delay line, and broadcasting two sets of candidate block columns in the search region on each clock cycle.
Fig. 7. The basic structure of the PE Array. (a) structure of the PE array for the module i (b) structure of the 1D array in the PE array (c) structure of the PE in 1D array.
TABLE 2 THE DATA FLOW OF THE PROPOSED PE ARRAY SHOWN IN FIGURE 7 FOR P=2. Clock
Current_block_data
Block_strip_A
Block_strip_B
1-D Array 0
1-D Array 1
1-D Array 2
1-D Array 3
0
X0
Col0
|X0 í Col0|
1
X1
Col1
|X1 í Col1|
|X0 í Col1|
2
X2
Col2
|X2í Col2|
|X1 í Col2|
|X0 í Col2|
3
X3
Col3
|X3 í Col3|
|X2 í Col3|
|X1 í Col3|
|X0 í Col3|
4
Col4
Col7
|X0 í Col7|
|X3 í Col4|
|X2 í Col4|
|X1 í Col4|
5
Col5
Col8
|X1 í Col8|
|X0 í Col8|
|X3 í Col5|
|X2 í Col5|
6
Col6
7
Col9
|X2 í Col9|
|X1 í Col9|
|X0 í Col9|
|X3 í Col6|
Col10
|X3 í Col10|
|X2 í Col10|
|X1 í Col10|
|X0 í Col10|
8
Col14
Col11
|X0 í Col14|
|X3 í Col11|
|X2 í Col11|
|X1 í Col11|
9
Col15
Col12
|X1 í Col15|
|X0 í Col15|
|X3 í Col12|
|X2 í Col12|
10
Col16
Col13
|X2 í Col16|
|X1 í Col16|
|X0 í Col16|
|X3 í Col13|
11
Col17
|X3 í Col17|
|X2 í Col17|
|X1 í Col17|
|X0 í Col17|
ಹ
ಹ
ಹ
ಹ
ಹ
ಹ
ಹ
ಹ
C.-M. Ou et al.: An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation
1295
Fig. 8. The basic structure of local RAM for each module (p=2)
Table 2 shows the data flow schedule of this PE array for the current block and the corresponding search area depicted in Figure 2. It can be observed from Table 2 that each 1D systolic array is used for the SAD computation between one candidate block and the current block. Hence, 4 block-matching operations can be performed concurrently in the architecture. The latency of the new PE array therefore is less than that of the simple 1D structure. In addition, because of its 100 % PE utilization, the latency of this structure is also lower than that of the conventional 2D systolic architectures with PE utilization lower than 100 % such as AB2 [5]. From Table 2, we also observe that the columns of two adjacent block strips (termed strip band) may be fetched concurrently for attaining 100 % PE utilization. Therefore, the local RAM in each module is employed for storing strip band data in search area. Figure 8 shows the structure of the RAM for p = 2 . Since there are 16 + 2 p − 1 columns in each strip, the To SAD Module 0, To SAD Module 4, 1, 2, 3 5, 6, 7 1
To SAD Module 8, 9, 10, 11
1
1
strip band also contains 16 + 2 p − 1 columns. The RAM therefore contains 4 + 2 p − 1 lines, where each line stores the pixels in the a column for the strip band. It can be observed from Figure 2 that the same column of adjacent strips are overlapping by 3 pixels. Therefore, each line contains five 1pixel cells. Each memory line can be updated any time when all its contents have been loaded to the PE array, and are no longer useful for subsequent block-matching operations. In this case, the pixels in the corresponding column in the next strip band (i.e., next two consecutive block strips) will be loaded to that line. We can also increase the number of block strips contained in the local RAM for extending the interval for updating memory line. This, however, will result in larger storage size and higher pin count for memory updating.
To SAD Module 12, 13, 14, 15
1
To SAD Module4 i ( search_ area_ data_4i)
5
To SAD Module4i+ 1 ( search_ area_ data_4i+1)
17
17
To SAD Module4i+ 2 ( search_ area_ data_4i+2) To SAD Module4i+ 3 ( search_ area_ data_4i+3) t
4
4 (a)
4
1
19 (b)
Fig. 9. The updating process of the local RAM of the 16 modules for p=2. (a) Four columns of the strip band will be fetched concurrently to the 16 modules. (b) Each of the four columns will be accessed by four different modules.
1296
IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005
Fig. 10. The basic structure of VSBM processor.
Note that, in the proposed architecture, all the modules have identical range of displacement [− p, p − 1] , and have the same scan order for fetching candidate blocks to their PE array. Therefore, the same line of all the 16 modules will be updated synchronously. As shown in Figure 9, four columns are fetched concurrently by the 16 modules for the updating of local RAM. Each is separated from the others by a multiple of 4 columns in the same strip band. The modules 4 i,4 i + 1,4 i + 2 and 4 i + 3 will request different portion of data from the same column in a strip band. Let t be the position of the left most column of these four columns in the same strip band. Therefore, the pixels fetched from these 4 columns will be
used to update the t -th line of the local RAM of the 16 modules. Because all the 16 modules have synchronous SAD computation, the MV associated with SADs produced by these modules are the same on the same clock cycle. The SADs associated with the subblocks of other sizes therefore can be computed by adding the SADs produced by the modules. In our architecture, the VBSME processor, as shown in Figure 10, is used for the SAD computation of the subblocks of other sizes.
Fig.12. The basic structure of Macroblock Mode Processor.
Fig. 11. The basic structure of 8x8 Mode Processor.
C.-M. Ou et al.: An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation
1297
TABLE 3 THE LATENCY, THROUGHPUT AND PE ARRAY OF VARIOUS VLSI ARCHITECTURES. Architecture
AB2[3]
Lai[4]
Vos[8]
Shen[7]
Yap[9]
Ours
Number of PE
256
256
256
64
16
256
Latency (T)
496
256
5376
4096
4096
256
Throughput (S)
1/496
1/256
1/256
21/4096
41/4096
41/256
32 × 32 8× 8 4× 4 (N=32)
16 × 16 16 × 8 8 × 16 8× 8 8× 4 4× 8 4× 4
16 × 16 16 × 8 8 × 16 8× 8 8× 4 4× 8 4× 4
Block size supported
16 × 16
16 × 16 8× 8 4× 4 (masking)
16 × 16
Block size: N=16 , Search Region Size: p=8
The VBSME processor contains four 8 × 8 mode processor, and one macroblock mode processor. Each 8 × 8 mode processor computes the SADs of two 8 × 4 subblocks, two 4 × 8 subblocks, and one 8 × 8 subblock, as shown in Figure 11. In the macroblock mode processor shown in Figure 12, the SAD of four 8 × 8 subblocks are used to obtain the SADs of two 16 × 8 subblocks, two 8 × 16 subblocks, and the 16 × 16 macroblock. In addition to the SAD computation, comparison circuits are included in the 8 × 8 mode processors and macroblock mode processor for identifying the best MV with minimum SAD for each subblock concurrently. Therefore, the best MV for a macroblock can be identified concurrently with the best MVs for all the other 40 subblocks in the macroblock. On the contrary, the conventional 1D or 2D systolic arrays are only able to find the best MV for one block size at a time. The throughput of our architecture therefore is higher than that of the conventional 1D or 2D architectures. IV. PERFORMANCE ANALYSIS Let M be the number of subblocks in a marcoblocks. Therefore, M = 41 for the VBS-BMAs supported by H.264, and M = 1 for the fixed-size BMA. Define the latency, denoted by T , of a full-search BMA VLSI architecture as the number of clock cycles required to identify the best MVs for all the M subblocks in the macroblock. The architecture then will produce M best MVs for every T clock cycles. Define the
throughput S of the architecture as the number of best MVs produced per clock cycle. It then follows that S=
M . T
Given a range of displacement [− p, p − 1] , the latency of the proposed architecture is given by T = 4 p 2 . Therefore, the throughput of the architecture is given by P = 41 /(4 p 2 ) . In our architecture, 16 ×16 PEs are used to attain the throughput. Table 3 shows the latency, throughput and PE number of various VLSI architectures for p = 8 (i.e., the area of search range is 16 ×16 ). It then follows from the table that the proposed architecture has the lowest latency and highest throughput. This is because the full-search BMA operations of all the primitive subblocks are performed in parallel in the architecture, and the results of these BMA operations can be subsequently used for computing the SADs of the subblocks with larger sizes. The PE number of the proposed architecture is identical to that of the traditional 2D systolic array AB2, which can only performs the fixed-size BMA. The architecture presented in [8] can be viewed as a direct extension of AB2 for VBS-BMA, where the best MV is identified only one subblock at a time. Therefore, the improvement in throughput over the AB2 is quite marginal. Using the same number of PEs, our architecture achieves a substantial improvement in throughput over these works.
TABLE 4 THE REQUIRED CLOCK RATE OF THE PROPOSED ARCHITECTURE AND ITS COUNTERPART [8] FOR VARIOUS FRAME SIZES AND FRAME RATES. Frame size
CIF (352 × 288)
4CIF (704 × 576)
16CIF (1408 × 1152)
SDTV (1280 × 720)
HDTV (1280 × 720)
SHDTV (1920 × 1080)
30
30
30
60
30
60
60
Ours
0.76
3.04
12.16
48.66
97.32
27.64
55.29
123.49
Yap[9 ]
12.16
48.64
194.56
778.56
1557.12
442.24
884.64
1975.84
Frame rate (fps) Clock rate (MHZ)
QCIF (176 × 144) 30
1298
IEEE Transactions on Consumer Electronics, Vol. 51, No. 4, NOVEMBER 2005
TABLE 5 THE PROPOSED VBS-BMA CHIP PERFORMANCE.
One major advantage having high throughput is that the clock rate is substantially reduced subject to a constraint on frame size and frame rate. The clock rate reduction may effectively reduce the average power [1] for block matching operations. Table 4 shows the required clock rate of the proposed architecture and its counterpart [9] for various frame sizes and frame rates. From the table, it can be observed that, in the proposed architecture, a clock rate of 123.49 MHz is sufficient to perform VBS-BMA for video sequences with SHDTV frame size and 60 fps frame rate. By contrast, a clock rate of 1.97 GHz is necessary for the architecture proposed in [9] to perform the VBS-BMA for the same sequences. The proposed architecture has been prototyped, simulated and synthesized by the UMC 0.18 μ m CMOS standard cell technology (1.8V). Table 5 shows the performance of the VBS-BMA circuit. The maximum frequency of the chip is 200MHz. Therefore, the circuit supports all the frame sizes and frame rates listed in Table 4. In particular, for the CIF frame size, the circuit can perform the VBS-BMA for frame rates up to 972 fps. A wide range of H.264-based video applications therefore can be processed by this chip. The average power of this circuit for various frame rates and frame sizes are shown in Table 6. It can be observed from the table that the video sequences with smaller frame size consume lower average power because of slower clock rate required by this circuit. In fact, the clock rate required for QCIF sequences at 30 fps is only 0.76 MHz, which results in a low power dissipation of 4.08 mW. The average power of the existing circuits are also included in Table 6 for comparison purpose. The exact comparisons of these circuits may be difficult by the facts that these are realized with different technologies, and have different capabilities and specifications. However, it should be noted that our circuit has the lowest power measurement in the table subject to the same frame size and frame rate. Therefore, this circuit may be an effective
Number of PE
256
Searching Region
16 × 16
Block size
4 × 4, 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16 × 8, 16 × 16
Technology
UMC 0.18 μ m
Gate count
597K
Max frequency
200 MHZ
alternative for video applications where high visual quality and low power dissipation are desired.
V. CONCLUSION As compared with existing VBS-BMA VLSI architectures, the proposed architecture is able to produce the best MVs for the H.264 VBS-BMAs with lowest latency and highest throughput. Therefore, the clock rate of the circuit can be effectively reduced for low power designs. In particular, the clock rate for the VBS-BMA operations over QCIF sequences at 30 fps is 0.76 MHz. The resulting power dissipation is only 4.08 mW, which may be attractive for mobile or portable video applications. On the other hand, the frame size and frame rate supported by the circuit can also be substantially extended subject to a clock rate constraint. In our experiment, the required clock rate for SHDTV sequences at 60 fps is 123.49 MHz. The circuit may therefore be very helpful for designs requiring high visual quality. Gate-level synthesization and verification illustrate that our VBS-BMA circuit is beneficial for enhancing the performance of H.264 encoders over a wide range of video applications.
TABLE 6 THE AVERAGE POWER OF VARIOUS VLSI ARCHITECTURES Architecture Process Block size
Saponara[10]
Shen[7]
Yap[9]
Ours
0.25 μ m
0.6 μ m
0.13 μ m
0.18 μ m
16 × 16
8 × 8, 16 × 16, 32 × 32
4 × 4, 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16 × 8, 16 × 16
4 × 4, 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16 × 8, 16 × 16
QCIF
Frame size
QCIF
CIF
CIF
Frame rate (fps)
30
30
30
15
Power(mW)
8.0
32.1
423
11.88
QCIF
CIF
4CIF
16CIF
30
30
30
30
30
23.76
4.08
20.48
50.7
203.06
SDTV
HDTV
SHDTV
60
30
60
60
408.62
133.84
250.40
503.05
C.-M. Ou et al.: An Efficient VLSI Architecture for H.264 Variable Block Size Motion Estimation
REFERENCES [1]
A.P. Chandrakasan and R.W. Brodersen, “Minimizing Power Consumption in Digital CMOS Circuits,” Proceedings of the IEEE, Vol. 83, pp.498-523, 1995 [2] Z. He, M. L. Liou, Philip.C.H. Chan, and R. Li, “An efficient VLSI architecture for new three-step search algorithm, ” Proceeding of the 38th IEEE Midwest symposium on Circuits and Systems, vol.2, pp.1228-1231, 1996. [3] P. Kuhn, Algorithms, complexity analysis and VLSI architectures for MPEG-4 motion estimation, Kluwer Academic, 1999. [4] Y. K. Lai, Y. L. Lai, Y. C. Liu, P. C. Wu and L. G. Chen, “VLSI Implementation of the Motion Estimator with Two-Dimensional DataReuse,” IEEE Trans. Consumer Electronics, Vol. 44, pp.623-629, 1998. [5] P. Pirsch, “VLSI Architectures for Video Compression-A Survey,” Proceedings IEEE, Vol.83, pp.220-246, 1995. [6] I.E.G. Richardson, “H.264 and MPEG-4 Video Compression,” John Wiley & Sons, 2003. [7] J.F. Shen, T.C. Wang and L.G. Chen, “A novel low-power full-search block-matching motion-estimation design for H.263+,” IEEE Trans. Circuits and Systems for Video Technology, pp.890-897, 2001. [8] L. de Vos and M. Schobinger, “VLSI architecture for a flexible block ma matching processor,” IEEE Trans. Circuits and Systems for Video Technology, Vol.5, pp.417-428, 1995. [9] S.Y. Yap and J.V. McCanny, “A VLSI Architecture for Variable Block Size Video Motion Estimation,” IEEE Trans. Circuits and Systems, pp.384-389, Vol. 51, 2004. [10] S. Saponara and L. Fanucci, “Data-adaptive motion estimation algorithm and VLSI architecture design for low-power video system,” IEE Proc. Computer and Digital Techniques, Vol. 151, pp.51-59, 2004. [11] T. Wiegand, G.J. Sullivan, G. Bjontegaard and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits and Systems for Video Technology, vol. 13, pp. 560-576, 2003
Chien-Min Ou received his diploma in Telecommunication Engineering from Chien Shin Institute of Technology, Chung Li, Taiwan, in 1978 and M.S., and Ph.D. degrees in Electrical Engineering from Chung Yuan Christian University, Chung Li, Taiwan, in 2000, and 2003, respectively. He joined the Faculty of the Department of Electronics Engineering, Ching Yun Institute of Technology, Chung Li, Taiwan, as an Instructor in 1978. Since 2004, He has been an Assistant Professor of the Department of Electronics Engineering, Ching Yun University. He is also a member of the honor society Phi Tau Phi. He research topics include VLSI design and testing, image processing, video compression, motion estimation.
1299
Chian-Feng Le was born in Taipei, Taiwan, R.O.C., on April, 20, 1980. He received the B.S. degree in computer science and information engineering from TamKang University in 2003. He is presently working toward the M.S. degree in computer science and information engineering at National Taiwan Normal University. His research interests include VLSI chip design and multimedia communications.
Wen-Jyi Hwang received his diploma in electronics engineering from National Taipei Institute of Technology, Taiwan, in 1987, and M.S.E.C.E. and Ph.D. degrees from the University of Massachusetts at Amherst in 1990 and 1993, respectively. From September 1993 until January 2003, he was with the Department of Electrical Engineering, Chung Yuan Christian University, Taiwan. In February 2003, he joined the Graduate Institute of Computer Science and Information Engineering, National Taiwan Normal University, where he is now a Full Professor. His research interests are centered on multimedia communications systems with particular emphasis on image/video transmission. Dr. Hwang is the recipient of the 2000 Outstanding Research Professor Award from Chung Yuan Christian University, 2002 Outstanding Young Researcher Award from the Asia-Pacific Board of the IEEE Communication Society, and 2002 Outstanding Young Electrical Engineer Award from the Chinese Institute of the Electrical Engineering.