ne igh b w or se (a) Conventional 9T25S Search Pattern
Abstract—Fractional motion estimation (FME) significantly enhances video compression efficiency, but its high computational complexity also limits the real-time processing capability. In this paper, we present a VLSI implementation of FME design in High Efficiency Video Coding (HEVC) for ultra-high definition video (Ultra-HD) applications. We firstly propose a bilinear quarter pixel approximation, together with a search pattern based on it to reduce the complexity of interpolation and fractional search process. Furthermore, a data reuse strategy is exploited to reduce the hardware cost of transform. In addition, by using the considered pixel-parallelism and dedicated access pattern for memory, we fully pipeline the computation and achieve high hardware utilization. This design has been implemented as a 65nm CMOS chip and verified. The measured throughput reaches 995Mpixels/s for 7680x4320 30fps at 188MHz, at least 4.7 times faster than prior arts. The corresponding power dissipation is 198.6mW, with a power efficiency of 0.2nJ/pixel. Due to the optimization, our work achieves more than 52% improvement on power efficiency, relative to previous works in H.264. VLSI
architecture,
co rn er
chip
I. INTRODUCTION
U
ltra high definition (Ultra-HD) achieves remarkably enhanced visual experience, and has been targeted by next generation applications. It includes the resolutions of 4K (3840×2160) and 8K (7680×4320), which delivers 4× to 16× the number of pixels per frame compared to today’s high definition (HD). To store and transmit such huge volume of Ultra-HD data, efficient and real-time compression is essential. As the latest video compression standard, High Efficiency Video Coding (H.265/HEVC) [1] improves the compression ratio by 50% compared to its precursor H.264/AVC, but involves intensive computational complexity. As an important component of video coding, fractional motion estimation (FME) provides a refinement for motion estimation with sub-pixel accuracy. It improves the rate-distortion performance significantly, about 2-6dB in H.264 [3]. In HEVC, the coding efficiency has been further improved, due to many new coding tools, such as the interpolation with 8-/7-tap filters, which provides up to 21.7% bitrate reduction
Manuscript received Jul. 21, 2014; revised Nov. 03, 2014. This work was supported by STARC program in Japan and National Nature Science Foundation of China (61222101). The fabrication chip was provided by VDEC, the University of Tokyo in collaboration with STARC, e-shuttle and Fujitsu. G. He and Y. Li are with the Telecomunication School of Xidian University Xi’an, 710071, China. (e-mail:
[email protected]). D. Zhou, Z. Chen, T. Zhang and S. Goto are with the Graduate School of Information, Production and Systems, Waseda University, Kitakyushu 808-0135, Japan.
(b) Proposed 5T12S Search Pattern*
1
Integer cand.
.c or ne r
ne igh b
Gang He, Dajiang Zhou, Yunsong Li, Zhixiang Chen, Tianruo Zhang and Satoshi Goto
Index Terms—HEVC, FME, implementation, Ultra-HD.
be st
be tte r
High-Throughput Power-Efficient VLSI Architecture of Fractional Motion Estimation for Ultra-HD HEVC Video Encoding
.c or ne r
This is the accepted version of the paper. The final version is available at http://dx.doi.org/10.1109/TVLSI.2014.2386897
TVLSI-00387-2014
Half cand. Quarter cand.
*: The corner is chosen with evaluation results of neighboring integer pixels
Fig. 1. FME search patterns (a) conventional 9T25S (b) proposed 5T12S.
over H.264 [2]. However, FME results in high computation complexity (49% of the total encoding time in HEVC [12]), due to the complex interpolation and fraction search process. This leads to critical challenges to the implementation in meeting the constraints of throughput and power dissipation. Previous works [3]-[10] have proposed many architectures of FME in H.264. One category of them adopts the search pattern of reference software in the designs [3][4]. The another category is to reduce complexity, by exploiting the neighboring correlations of fractional pixels. The number of the search candidates is reduced in each iteration [5]. However, these multi-iteration algorithms with long latency limit the design throughput. Recently, many works proposed the search patterns with only one iteration to achieve high throughput [6]-[10], while the performance loss is caused, related to the adopted search area. Due to the new features and higher complexity in HEVC, most of them cannot be applied directly. The emerging state-of-the-arts for FME in HEVC [12]-[14] focused on the fast algorithms such as mode selection and search pattern, while didn’t propose any hardware architecture. The design of an HEVC encoder was proposed [15], which had no optimization on FME part. In addition, most of existing works target the video applications with the resolutions up to HD or 4K. For higher throughput of Ultra-HD 7680x4320, more efficient hardware architecture is desirable.
II. PROPOSED FME ALGORITHM A. Bilinear Quarter pixel Approximation In HEVC decoding application, fractional pixels are classified into a, b and c types and interpolated with 8-tap and 7-tap filters for the motion compensation (MC) process. If we apply this into FME for encoding application, it will introduce very high computational complexity. According to the previous implementations[16][17], it takes 165 and 33 additions to interpolate the 15 fractional pixels per integer one with 8-/7-tap filters in HEVC and 6-/2-tap ones in H.264, respectively. The number of additions increases to five times. In this design, we propose a bilinear quarter pixel approximation (BQA) scheme. It employs the 8-tap and bilinear 2-tap filters to interpolate the half and quarter pixels. It benefits the FME hardware design largely with only 0.02dB PSNR degradation. Firstly, the number of additions for 15 fractional pixels (every integer pixel) is reduced from 165 to 39, about 76%. Moreover, FME processing flow is optimized with the operation linearity. Since a quarter pixel is generated from
TVLSI-00387-2014
2 Interpolation module
Integer Buffer IME costs MV costs R.G.: residual generation D.R.: data reuse
Corner Decision
R.G.
HT8x8
R.G.
HT8x8
R.G.
HT8x8
R.G.
HT8x8
R.G.
HT8x8
5T
8x8 7SC
8x8 5TC
8x8
16x16 (D.R.) 5TC
16x16 7SC
32x32 (D.R.) 5TC
32x32 7SC
16x16
32x32
SATD & Comparison
½ - pixel Buffer
Coeff. 8x8 Buffer SRAM
½ - pixel Interp.
Candidate Selection
Ref. pixels
Fractional search (5T12S) module
Frac. MVs & Trans. Size
12S
Hadamard transform (8x8, 16x16, 32x32)
5T: 5 candidates with transform 12S: 12 search candidates
IME: Integer motion estimation MV: Motion Vector
Fig. 2. Block diagram of the FME design
integer and half pixels with the 2-tap filter, the transformed coefficients for a quarter search candidate can be directly derived from the coefficients for integer and half candidates, rather than a full transform. Therefore, the transform operations including computation and memory access are saved for quarter candidates. B. Search Pattern Analysis based on BQA In FME design with high throughput, the search pattern with one single-iteration is suitable and widely used. Most existing designs focus on reducing the number of search candidates (SCs) [6][10], since they mainly determine the hardware cost However, in this design based on BQA, candidates at different locations do not correspond to the same complexity. Hardware cost is not only related to the number of SCs, but largely dependent upon the candidates with transform (TCs). Fig. 1 (a) shows a conventional search pattern in hardware design [9]. It fully searches a central 5x5 fractional area around the best integer candidate and provides a good coding performance. If FME is performed based on the proposed flow with BQA, there are nine integer and half candidates which require transforms, among the total 25 SCs. Here we denote it as a 9T25S. To further reduce complexity, we propose a 5T12S BQA-based search pattern as shown in Fig. 1(b) by utilizing the information of neighboring integer candidate evaluation. A cost for each corner is calculated by summing up the costs of the integer candidate at the corner and its two closest ones. Using these costs, the best corner is first decided and then the better neighboring one is chosen from its two neighbors. Directions to the two corners are used to contour a trapezoid-shaped region that contains a total of 12 candidates including five requiring transforms. Compared with 9T25S one, it causes only a 0.03dB PSNR degradation, while reduces 48% hardware cost for fractional search, which corresponds to the fractional search module in the FME architecture shown in Fig. 2. Since this module takes about 2/3 area of the design, the proposed search pattern reduces 37% hardware cost for the whole system.
C. Exhaustive Size-HAD In HEVC, residual quad-tree (RQT) allows the transform sizes in the range of 4x4 to 32x32. The approach enables the adaptation of transform to the varying space-frequency characteristics of the residual. However, the evaluation function of Hadamard transformed absolute difference (HAD)
in FME, is limited to 4x4/8x8 transform size. In this design, we apply an exhaustive size HAD (ES-HAD) in FME. The residuals are simultaneously processed by HT8x8, 16x16, and 32x32, and then compared recursively to get rate distortion cost. The technique avoids unifying variable blocks into small transform sizes and improves the coding performance. Moreover, it determines the best size of transform with much lower complexity, relative to the complex RDO in RQT. The average improved coding performance is about 0.05dB.
III. HARDWARE ARCHITECTURE A. System Architecture Fig. 2 shows the block diagram of our proposed FME design. The entire design can be divided into an interpolation and a fractional search modules. With referenced integer pixels, fractional half pixels in vertical, horizontal and diagonal positions are first generated by an interpolation calculation unit, and then stored in the corresponding buffers. In the meantime, the search-related integer pixels are also stored. The detailed architecture of interpolation calculation unit is illustrated in Fig. 3 and will be discussed in Section III.B. In the fractional search module, a corner decision unit first determines the fractional search area based on proposed 5T12S search pattern, by using the evaluation costs of integer candidates. Then, the five TCs in adopted search area including four half and one integer candidates are selected to be processed parallel by five residual generation & HT8x8 units. The resulted coefficients of HT8x8 (C8) are stored and reordered in a SRAM buffer for the further reusing operations of HT16x16 and 32x32. After the coefficients for the five candidates of all three transform sizes are calculated, we derive the results for other seven SCs. At last, the SATD costs of 12 candidates with three transform sizes are calculated and compared. The fractional MVs and the corresponding best transform size are determined. To explain this in detail, data reusing and memory organization will be described in Section III.C and III.D.
B. Parallelism & Processing schedule To design the FME architecture with high throughput and minimal hardware cost, the degree of parallelism has to be carefully considered. For the interpolation module, an efficient way is to design the processing unit based on the small block, as
TVLSI-00387-2014
3
Column Unit (17 filter ) x 2
D ½ pel.
8 : 1
(17 H + 16 I) x 2
½ pel.
8 : 1
8 : 1
8 : 1
8 : 1
8 : 1
8 : 1
8 : 1
8 : 1
8 : 1
Int. pel.
…
8 : 1
8 : 1
8 : 1
V ½ pel.
2D x 17 + 2V x 16
8 : 1
8 : 1
8 : 1
8 : 1
8 : 1
Row Unit (2 filter) x 33
8 : 1
V ½ pel. D ½ pel.
: 8-tap filter
8:1
768cycles / CTB 7680 4320 30 pixels / s / 64 64 pixels / CTB =188MHz
I: integer-pel. H, V, D ½ pel. : ½ pel. in different position
C. Data Reuse in ES-HAD
Fig. 3. Half pixel interpolation calculation unit
768 cyc. / 64x64 CTB
Time mode 1
256 cyc. / mode 0 0 1
...
mode 2
31
8 cyc. / 16x8 block 8 cyc. Intp. 6 cyc.
8 cyc. Intp
8 cyc. HT-1
8 cyc. HT-1 8 cyc. HT-2
Time
64 cyc. coeff. reordering
8 cyc. HT-2 8 cyc. Cost
In this design, each CTB (64x64) can be decomposed into 32 16x8 blocks. Therefore, it takes 256 (32x8) cycles to process one CTB for a certain mode. In order to reach our targeted high throughput, we support the three modes, which are selected from 64x64 to 16x8/8x16 partitioning ones including symmetric and asymmetric, by using the advanced mode pre-decision (AMPD) method [3] with IME evaluation results. It causes about 0.08dB PSNR quality loss. In this way, the total operating cycles count per CTB with can be derived as 3 x 256 = 768 cycles. For the targeted application of 7680x4320 30fps real-time encoding, the following operation frequency is required:
8 cyc. Cost
Operation
Fig. 4. Processing schedule of the design.
analyzed in [10]. Other blocks can be decomposed to re-utilize the hardware. In this design, based on the 16x8 block, column unit is designed with 16 pixel-parallelism. As shown in Fig. 3, it generates the 17 half pixels in horizontal position with 8-tap filters, and bypasses the 16 integer pixels. The corresponding 33 row units are then adopted to generate the half pixels in vertical and diagonal positions. However, in this case, it has to work at a high frequency 380MHz to achieve targeted throughput. In this design, we double the parallelism by adopting two column units and embedding two 8-tap filters into each row unit. By doing so, the required operation frequency is reduced by a half. For a fractional search module, the parallelism is decided by the trade-off between processing speed and hardware cost. In this design, to achieve the throughput of Ultra-HD 8k, 16 pixel-parallelism is adopted in the fractional search module. Moreover, processing in accordance with the speed of interpolation mode, a fully pipelining architecture can be designed. Fig. 4 summarizes the overall processing schedule of this design. With the adopted 16 pixel-parallelism, it takes eight cycles to process a 16x8 block. The operations are performed with the four-stage pipeline, which includes an interpolation, two HT8x8, and one cost calculation stages. Each stage takes eight cycles. Due to data dependency, HT-1 stage starts six cycles later than interpolation one. It needs to be noted, there are 64 cycles gap between HT-2 and cost calculation stages, due to the coefficient reordering operation.
In this design, ES-HAD is adopted as the cost evaluation function. The residuals are simultaneously processed with HT 8x8, 16x16 and 32x32. In the individual implementation, six, eight, ten layers of butterfly addition (ADD) are needed respectively, for the pipeline design. Due to the recursive feature of HT, we exploit the data reusing between different transform sizes. As illustrated in the derivation (1), the transform result T2x of the residual matrix R can be calculated, by reusing the transform result Tx (Tx-11, Tx-12, Tx-21, Tx-22) of decomposed residual matrix R-11, R-12, R-21, R-22. H H2x x Hx
Hx R11 R12 T R T H x Aij H x H x , R21 R22 ,x ij
H T2 x H 2 x R H 2 xT x Hx
H x R11 R12 H x H x R21 R22 H x
Hx H x
T T T T T T T T x 11 x 12 x 21 x 22 x 11 x 12 x 21 x 22 Tx 11 Tx 12 Tx 21 Tx 22 Tx 11 Tx 12 Tx 21 Tx 22
1
The reusing operation needs only two layers of ADD butterfly for pipeline design. Based on this characteristic, we only implement the HT8x8 architecture in the design. Additional two and two ADD layers are realized to further calculate HT16x16 and 32x32. The numbers of the reusing layers for HT8x8, 16x16 and 32x32 should be considered as 0, 6 and 8, respectively. The data reuse percentage is 58%, which can also be considered 58% hardware cost reduction. This is illustrated in Table I. In addition, it is noted that this reduction corresponds to the HT implementation, as marked in Fig. 2. Since HT takes 1/3 area in whole system, the reusing reduces 34% hardware cost for the design. TABLE I. Implementation comparison Individual implementation Data reusing reduction* Data Reusing implementation
COMPARISON FOR INDIVIDUAL AND IMPLEMENTATION
Addition layer
DATA REUSE
HT8x8
HT16x16
HT32x32
Hardware cost percentage
6
8
10
100%
0
6
8
58%
6
2
2
42%
TVLSI-00387-2014
4
*: This reduction is for Hadamard transform part, which is marked in Fig. 2.
SRAM for chip test
Technology E-Shuttle 65nm CMOS I/O pads 162 Die area 8.82mm2 Core area 3.29mm2 Density 81% Logic Gate Count 1183k (2-input NAND) SRAM 19.2KB/5.1KB* Max. resolution 7680x4320 30fps (10bits/pel) Max. pixel rate 995Mpixels/s Supported modes 3 modes from 64x64 to 16x8/8x16 Un-supported modes 8x8, 8x4, 4x8 modes *: 5.1KB SRAM for chip test
PLL
2.65mm
4.2mm
HEVC FME
SRAM 1.24mm 2.1mm
(a) Chip micrograph Module
(b) Chip specification
Logic gate count On-chip memory
Interpolation
442k
HT Frac. Search Others
439k
0 19.2KB
302k
0
Different scenarios 4320p30 @1.20v, 188MHz 2160p60 @0.70v, 95MHz 2160p30 @0.70v, 48MHz 1080p30 @0.70v, 12MHz
(c) Area breakdown in logic gate count and on-chip memory
Measured power 198.6mW 48.3mW 25.9mW 6.3mW
(d) Measured core power for different scenarios
Fig. 5. Chip implementation result.
Fig. 6. Photos of verification system. TABLE II.
CODING EFFICIENCY COMPARISON WITH HM10.0
Video sequence
Delta Bit-rates (%)* BDBDBitrate PSNR QP QP QP QP (%) (dB) 37 32 27 22 4320p NebutaFestival 1.37 1.46 1.72 2.32 1.68 -0.06 LocomotiveTrain 1.77 1.89 2.34 2.61 2.18 -0.09 2160p DucksTakeOff 1.33 1.43 1.42 1.49 1.43 -0.04 CrowdRun 2.05 1.94 2.22 2.42 2.19 -0.08 ParkJoy 1.37 1.48 1.82 2.24 1.69 -0.05 InToTree 1.54 1.74 1.8 2.22 1.87 -0.06 1080p Kimono 2.08 2.13 2.39 3.08 2.37 -0.08 ParkScene 2.55 2.67 3.02 3.54 3.04 -0.12 Cactus 1.65 1.67 1.77 1.93 1.78 -0.06 BQTerrace 1.85 1.99 2.02 2.26 2.10 -0.10 BasketballDrive 2.12 2.28 2.58 2.64 2.44 -0.11 Average 1.78 1.88 2.09 2.43 2.07 -0.08 *: these QP-specific delta bit rates were obtained by matching the QP-specific PSNR of this design to the RD curve of HM.
D. Memory Organization & Access Pattern According to the derivation (1), the reusing operation for matrix T2x needs the results of all four decomposed-blocks Tx-ij. If the decomposed-blocks are processed in a pipelining manner, their results should be stored for the further reusing operation. In this design, a two port SRAM is employed to store and reorder the C8. The word size is 240 bits (16 pixels x 15 bits/pixel). 128 depths are used to store two 32x32 blocks. With a dedicated access pattern, the cost calculations of 8x8, 16x16 and 32x32 HADs are performed parallel in a pipelining manner. Addresses are divided into two parts, where the data are written and read in a ping-pong way. In each part, the C8 of four decomposed 16x16 blocks are written row by row as the
address is ascending. A reusing operation of HT16x16 needs the data of two rows separated by eight addresses, while HT32x32 needs the data of four rows separated by 16 addresses, besides after the reusing ADD operation of HT16x16. Thus, we classify the addresses within each part into eight groups by using the MOD (%) operation with 8. The data is read as the group identity ascending order between groups, and as the address ascending order within a group. In this way, the processing units of all HAD calculations with adopted 16 pixel-parallelism can process in the pipeline manner.
IV. IMPLEMENTATION RESULT To verify the proposed architecture, it is implemented with 65nm 1P12M LVT CMOS. Fig. 5 shows the die photo and implementation result. 1183K logic gates, 24.3KB SRAM and a PLL are integrated into a 3.29mm2 core. The chip is verified by the FPGA based evaluation system shown in Fig. 6. In the system, FPGA is used to implement the verification core, which generates chip control signals and stores result data. A DUT board is adopted to connect the FPGA and ASIC ship. With a JTAG interface, a PC is connected to the mother board to burn the test code and read the result. In the measurement process, we use constant voltage power supply device as the power source, which supports mA level current measurement. Chip power consumptions in different scenarios such as the supply voltage and clock frequency for different video applications are measured. Firstly, for a certain resolution video application, the lowest operation frequency is derived according to the required throughput and chip processing speed. The on-chip clock is adjusted to it with an on-chip PLL and the clock control system in FPGA. Secondly, we try the supply voltage on the chip core as low as possible, under the guarantee of the chip response correctness. In this way, the power is calculated as the multiplication of given voltage and measured current. It should be noted that the measured power is for the chip core. The results for different video applications are shown in Fig. 5(d). At 1.2V power supply and 188MHz operating frequency, the design realizes the 7680x4320 30fps real-time encoding. The corresponding power dissipation is 198.6mW with 199.4pJ/pixel energy efficiency. At 0.7v, the 1080p30 processing dissipates only 6.3mW when running at 12MHz. The best energy efficiency is 97.0pJ/pixel as measured at 0.7v and 95MHz. The proposed algorithm is compared with HM10.0 reference software. The sequences with high resolutions are chosen for the Ultra-HD encoding applications. As shown in Table II, the overall PSNR degradation of proposed design is about 0.08 dB, or equivalent to an approximately 2.07% bit-rate increasing. The coding performance of the design were affected by the four adopted algorithms, including BQA, 5T12S search pattern, ES-HAD and mode selection. Firstly, mode selection causes the major performance degradation, about 0.08dB PSNR drop. The processing speed of the design is related to the number of selected modes. Considering the high throughput of Ultra-HD applications, three selected modes are processed in this design. Moreover, the proposed BQA and 5T12S search pattern are
TVLSI-00387-2014
5
This Work
TCSVT’09[9]
ISSCC’09[7]/ICME’09[8]
TVLSI’10[4]
7680x4320@30fps
720x480@30fps
4096x2160@24fps
1920x1080@30fps
Max. Throughput
995Mpixels/s
10.4Mpixels/s
212Mpixels/s
62.2Mpixels/s
Standard
HEVC/H.265
AVC/H.264
H.264 MVC
AVC/H.264
Max. Resolution
HT8x8/16x16/32x32
HT4x4
HT4x4/8x8
HT4x4
8/2-tap & 5T12S
6/2-tap & 9T25S
2-tap & 9T49S
6/2-tap & 9T9Sx2
Technology/Supply
65nm/1.2V
0.18um/1.8V
90nm/1.2V
0.18um/1.2V
Logic gates/SRAM
1183K/19.2KB
199.2K/-
448K/-
321K/9.72KB
Supported Trans. Algorithm
Cycles/pixel1)
0.19
FME power Norm. FME power3)
5.19
1.32
2.46
18.3mW2)*
135.0mW2)*
374mW**
4.4mW
97.5mW
135.0mW
198.6mW
2.17nJ/pixel 0.42nJ/pixel Power efficiency 0.20nJ/pixel 0.46nJ/pixel 1): The speed normalized to the processing cycles for each pixel. 2): FME power calculated from encoder core power and FME area proportion. 3): FME power normalized to 65nm/1.2V. (P65nm/1.2V = P90nm/1.2V /1.385 = P0.18μm/1.2V /2.77=P0.18μm/ 1.8V/4.16) *: chip measured power **: estimated power in post layout simulation
Fig. 7. Design comparison with the state-of-art designs.
used to trade off for FME complexity reduction, which causes only 0.02dB and 0.03dB PSNR degradation, respectively. By the interpolation approximation and reduction of search area, they largely benefit the hardware design. In addition, ES-HAD is applied to provide more flexible transform size for FME. It improves a coding efficiency of 0.05dB PSNR. Table II also shows the performance related to QP. The quality loss becomes a little larger when QP is small. This is mainly because small partitioning prediction modes including 8x8, 8x4 and 4x8 are not supported in the design. Fig. 7 summarizes the chip comparison between the proposed and state-of-the-art designs. The design delivers a maximum throughput of 995M pixels/s for 7680x4320 30fps video application, which is at least 4.7 times higher than previous designs. Due to the different targeted video specification, our design consumes more hardware cost and power. However, the pixel normalized hardware resources have been reduced. (44% and 52% for normalized logic gates and power reduction, compared with FME part in [7]). Moreover, for the lower throughput applications, this design consumes much lower power than others by scaling down the operation frequency and voltage. Some measured data are shown in Fig. 5(d). Achieving the similar throughput (3840x2160 30fps) with [7], our design consumes only 25.9mW, which saves 73% power. For the processing speed of the design, it takes only 0.19 cycle for each pixel, which is more than 6.8 times faster than others. This speed improvement comes from both the high parallelism and hardware efficiency. In addition, the power efficiency is defined as nano Joule per pixel and calculated for the comparison. The previous works [4][7]-[9] are designed for H.264, while this work is for HEVC. Compared with H.264, FME in HEVC brings more intensive complexity, and also involves more data dependency for hardware implementation. Despite that, our chip achieves better power efficiency (0.2nJ/pixel) than previous works in H.264, at least 52% improvement, even considering the technique scaling. The improved power efficiency benefits from the optimization of both algorithm and hardware: (1) BQA largely reduces the complexity of interpolation and optimizes the processing flow; (2) the proposed search pattern reduces nearly a half complexity for search process; (3) an exploited data reusing strategy, together with the dedicated hardware design including the considered parallelism and arranged memory access pattern, results in a high hardware utilization.
V. CONCLUSION In this paper, an efficient HEVC FME architecture is designed. A bilinear quarter approximation strategy, together with a search pattern based on it reduces the complexity. Furthermore, an exploited data reusing strategy and the dedicated hardware design results in a high hardware utilization. The implemented chip realizes 7680x4320 30fps real-time encoding, at 1.2V power supply and 188MHz operating frequency. The achieved throughput of 995M pixels/s is at least 4.7 times higher than prior arts. Moreover, the design achieves more than 52% improvement on power efficiency (0.2nJ/pixel) than previous works. REFERENCES [1] [2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13] [14]
[15]
[16]
[17]
Joint Collaborative Team on Video Coding, “High Efficiency Video Coding (HEVC) text specification draft 10,” JVT-G050, Jan. 2013. K. Ugur, A. Alshin, E. Alshina, F. Bossen, W. Han, J. Park, and J. Lainema, "Motion Compensated Prediction and Interpolation Filter Design in H. 265/HEVC," IEEE Journal of Selected Topics in Signal Processing, Vol. 7 , No. 6, pp. 946-956, July, 2013. T.-C. Chen, Y.-H. Chen, C.-Y. Tsai and L.-G. Chen, “Low power and power aware fractional motion estimation of H.264/AVC for mobile applications”, in Proc. IEEE ISCAS, May 2006, pp. 5331-5334. C.-Y. Kao, C.-L. Wu, and Y.-L. Lin, “A High-Performance Three-Engine Architecture for H.264/AVC Fractional Motion Estimation”, IEEE Trans. VLSI Syst., vol.18, No. 4, pp. 662-666, April, 2010. Y.-J. Wang, C.-C. Cheng, and T.-S. Chang, “A Fast Algorithm and Its VLSI Architecture for Fractional Motion Estimation for H.264/MPEG-4 AVC Video Coding”, IEEE Trans. CSVT, vol.17, No. 5, April, 2007. Y.-K. Lin, C.-C. Lin, T.-Y. Kuo, and T.-S. Chang, “A Hardware-Efficient H.264/AVC Motion-Estimation Design for High-Definition Video,” IEEE Trans. Circuits and Syt. I, Reg. Papers, vol. 55, No. 6, July, 2008. L.-F. Ding, W.-Y. Chen, P.-K. Tsung, T.-D. Chuang, H.-K. Chiu, Y.-H.Chen, P.-H. Hsiao, S.-Y. Chien, T.-C. Chen, P.-C. Lin, C.-Y. Chang, W.-L. Chen, and L.-G. Chen, “A 212 MPixels/s 4096 x2160p multiview video encoder chip for 3D/quad HDTV applications,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2009, pp. 154-155. P.-K. Tsung, W.-Y. Chen, L.-F. Ding, C.-Y. Tsai, T.-D. Chuang, and L.-G. Chen, “Single-iteration full-search fractional motion estimation for quad full HD H.264/AVC encoding,” in Proc. ICME, 2009, pp. 9–12. Y.-H. Chen, T.-C. Chen, C.-Y. Tsai, S.-F. Tsai, S.-Y. Chien and L.-G. Chen, “Algorithm and Architecture Design of Power-Oriented H.264/AVC Baseline Profile Encoder for Portable Devices,” IEEE Trans. CSVT, vol.19, No. 8, pp. 1118-1128, April, 2009. G. Kim, J Kim, and C.-M. Kyung, “Low cost single-pass fractional motion estimation architecture using bit clipping for H.264/AVC codec,” in Proc. ICME, 2011, pp. 661–666. G. He, D. Zhou, Z. Chen, T. Zhang, and S. Goto, “A 995Mpixels/s 0.2nJ/pixel fractional motion estimation architecture in HEVC for Ultra-HD,” in Proc. A-SSCC, 2013, pp. 301–304. H. Li, Y. Zhang, and H. Chao, "An optimally scalable and cost-effective fractional-pixel motion estimation algorithm for HEVC," in Proc. ICASSP, 2013, pp.1399-1403. T. Sotetsumoto, T. Song, and T. Shimamoto, "Low complexity algorithm for sub-pixel motion estimation of HEVC," in Proc. ICSPCC, pp.1-4. S.-Y. Jou and T.-S. Chang. "Fast prediction unit selection for HEVC fractional pel motion estimation design," in Proc. IEEE SiPS, 2013, pp. 247-250. S. F. Tsai, C. T. Li, H. H. Chen, P. K. Tsung, K. Y. Chen, L. G. Chen, “A 1062Mpixels/s 8192x4320p High Efficiency Video Coding (H.265) Encoder Chip,” Sym. on VLSI Circuits (VLSIC), 2013, pp.188-189. D. Zhou and P. Liu, “A Hardware-Efficient Dual-Standard VLSI Architecture for MC Interpolation in AVS and H.264”, in Proc. IEEE ISCAS, 2007. pp. 2910-2913. Z. Guo, D. Zhou and S. Goto, “An Optimized MC Interpolation Architecture For HEVC”, in Proc. IEEE ICASSP, 2012, pp. 1117-1120.