A HIGH-PERFORMANCE CABAC ENCODER ARCHITECTURE FOR ...

96 downloads 0 Views 825KB Size Report
Jinjia Zhou, Dajiang Zhou, Wei Fei and Satoshi Goto. Graduate School of Information, Production and Systems, Waseda University. 2-7 Hibikino, Kitakyushu ...
A HIGH-PERFORMANCE CABAC ENCODER ARCHITECTURE FOR HEVC AND H.264/AVC Jinjia Zhou, Dajiang Zhou, Wei Fei and Satoshi Goto Graduate School of Information, Production and Systems, Waseda University. 2-7 Hibikino, Kitakyushu 808-0135, Japan. E-mail: [email protected]

Index Terms— CABAC, entropy coding, HEVC, UHDTV

[11] further increases the parallelism to 4 bins by optimizing fourstage pipeline to shorten critical path of each symbol. We propose to further improve the performance of 4bins/cycle architecture. Firstly, the throughput in BAE is optimized by three proposals including prerenormalization (prenorm), hybrid path coverage (HPC), and bypass bin splitting (BPBS). Secondly, a state dual-transition (SDT) based quaternary context modeling is developed to solve context dependency so that the critical path of context modeling can be shortened. Finally, by applying a dual-standard binarization, both the HEVC and H.264/AVC formats can be supported with our architecture. The rest of this paper is organized as follows. Section 2 presents the proposed architectures on binary arithmetic coding, context modeling, and binarization. Section 3 shows the implementation results and compares this work with state-of-the-art designs. Conclusion is given in Section 4.

1. INTRODUCTION

2. PROPOSED ARCHITECTURE

ABSTRACT This paper presents a high-performance context adaptive binary arithmetic coding (CABAC) architecture for the next-generation UHDTV applications. Its maximum throughput has been enhanced by 31%∼34% with the proposed pre-normalization (prenorm.), hybrid path coverage (HPC), bypass bin splitting (BPBS) and state dual-transition (SDT) schemes. Both the HEVC and H.264/AVC formats can be supported with our architecture by applying a dualstandard binarization design. The proposed CABAC architecture has been silicon proven in a 65nm video encoder chip. It delivers 4.27∼4.40bins/cycle with synthesized and measured clock rates of 401.5MHz and 330MHz, respectively. Therefore a high performance of 1.452Gbin/s is achieved for real-time UHDTV encoding.

Next-generation video codec will be expected to support 2160p and 4320p UHDTV formats, which has 4 to 32 times of pixels per frame compared to today’s 1080p and 720p HDTV. To store and transmit the huge volume of UHDTV data, efficient video coding is essential. The next generation standard called High Efficiency Video Coding (HEVC) [1] is currently being developed by Joint Collaborative Team for Video Coding (JCT-VC). It is expected to deliver up to a 50% higher coding efficiency compared to its predecessor H.264/AVC [2]. HEVC uses several new tools for improving coding efficiency, including larger block and transform sizes, additional loop filters, and highly adaptive entropy coding. Context-adaptive binary arithmetic coding (CABAC) [3] is a form of entropy coding used in H.264/AVC and also in HEVC. While CABAC provides high coding efciency, its critical bin-to-bin data dependencies cause it to be a throughput bottleneck for a whole video encoder. HEVC introduces a new concept of “entropy slice”, that partitions slices into smaller “entropy” slices which can be processed independently. This allows for parallelizing multiple entropy slices to improve the throughput. However, slice parallelism which requires large memory to buffer all the data of paralleled slices including the huge context table, has the disadvantages of memory bandwidth overhead [4], system latency and large area. There were several works focused on these problems for CABAC decoding. [5],[6]. In this design, we aim at a fast single core CABAC without slice parallelism. Many single core CABAC implementations have been proposed. Liu [7] applies a one-bin binary arithmetic coding (BAE) with optimized context memory access to realize a hazardfree pipelined architecture. Based on one-bin BAE, the designs of Liu [8], Chen [9], and Kuo [10] process 2 bins in parallel. Chen

978-1-4799-2341-0/13/$31.00 ©2013 IEEE

1568

Context-adaptive binary arithmetic coding (CABAC) involves three main functions: binarization, context modeling, and arithmetic coding, as shown in Fig. 6. Binarization maps the syntax elements to binary symbols (bins). Context modeling estimates the probability of the bins. Finally, arithmetic coding compresses the bins to bits based on the estimated probability. In order to enhance the throughput, our design processes 4 bins in parallel. However, with the increased parallelism, the critical path becomes longer. The following three sections show the detail design of each part, to shorten the critical path and increase the throughput. 2.1. Binary Arithmetic Encoding The bottleneck of CABAC is in binary arithmetic encoding (BAE). To optimize BAE, three proposals are applied including prenormalization (prenorm.), hybrid path coverage (HPC), and bypass bin splitting (BPBS). 2.1.1. Pre-renormalization (prenorm.) The pipelined multi-bin BAE is commonly applied, as shown in Fig. 1. Its critical path lies in stage 2, where multiple range update units are organized in serial. Figure 2 shows a single range update unit in detail. After a 4-1 LUT, the critical delay is co-determined by a complex LPS renormalization, and an addition followed by a simple MPS renormalization. The result of MPS renormalization is either rMPS or rMPS≪1. The optimized LPS renormalization [8] [12] needs a find first one (FFO) circuit and a selective shifter, which is still critical. To optimize the delay of range update unit, a pre-normalization (prenorm) architecture is proposed. As shown in Fig. 3, for LPS

ICIP 2013

rLPStab

Stage 1

rLPStab

rLPStab

rLPStab

range update

range update

range update

Stage 2

low update

low update

low update

Stage 3

bit pack

FF range

range

LUT

FF

FF

FF

LUT

updated range

MPS

Suggest Documents