IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010
157
A Multitransform Architecture for H.264/AVC High-Profile Coders Woong Hwangbo and Chong-Min Kyung, Fellow, IEEE
Abstract—This paper presents a high-throughput, cost-effective implementation of six different integer transforms in the H.264/AVC high-profile coders, i.e., 4 4 forward, 4 4 inverse, forward Hadamard, inverse Hadamard, 8 8 forward, and 8 8 inverse transform, all integrated as a shared hardware. The 4 4 transform matrices are regularized by using permutation, partitioned into 2 2 blocks, and factored for maximal hardware sharing. By using two types of 4 4 transform matrices included in an 8 8 transform matrix, two different 8 8 transforms are both described as three steps and unified with minor modification. To improve throughput of the transform, two independent 4 4 transform blocks within the 8 8 transform block operate in parallel in the 4 4 transform mode, while the two-stage pipelined architecture is used in the 8 8 transform mode. Using 0.18CMOS technology, the maximum operating frequency of the proposed multitransform architecture is 200 MHz, which achieves 4.1 Gpixels/sec throughput rate with the hardware cost of 63618 gates. Compared with existing designs, the proposed design delivers at least 54% higher throughput at 38% higher throughput/area ratio in Adaptive Block-size Transform (ABT) mode.
TABLE I THROUGHPUT REQUIREMENT FOR VARIOUS VIDEO SIZES. TOTAL 50 AND 50 frames/sec frames WITH 4:2:0 YUV FORMAT, OF FRAME RATE IS USED
QP = 24
m
Index Terms—DCT, H.264/AVC, Hadamard transform, IDCT, integer transform, VLSI design.
I. INTRODUCTION H.264/AVC is the state-of-the-art video coding standard to achieve significant improvement in the video compression performance [1]. To quickly compress video data in spatial domain, H.264/AVC employs 4 4 integer transforms which use only integer arithmetic without any multiplications, with coefficients that allow 16-bit arithmetic computation [2]. Small block-size transform tends to reduce the computational complexity and ringing artifacts. However, for high-quality video, large block-size transform must be used not only to preserve fine details of the image but also to obtain the better energy compaction [3]. High profile in H.264/AVC Fidelity Range Extension (FRExt) [4], which is a new amendment added in H.264 standard, includes 8 8 integer transform and allows the encoder to adaptively choose between 4 4 and 8 8 transform for luma samples on an MB level, which is called adaptive block-size transform (ABT). Manuscript received February 22, 2009; revised November 05, 2009. First published January 26, 2010; current version published March 17, 2010. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No.2009-0080188). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ketan Mayer-Patel. The authors are with the Department of Electrical Engineering, KAIST, Daejeon 305-701, Korea (e-mail:
[email protected];
[email protected]. kr). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2010.2041099
2
Fig. 1. (a) 4 4 transform flow with four different 4 8 transform flow only for luma samples.
2 4 transforms. (b) 8 2
The transforms in H.264/AVC require high data throughput rate for real-time processing in the high-resolution video formats like HD 1080p (1920 1080). Moreover, the mode decision block in H.264 encoder uses ABT iteratively, which results in further increase of data throughput. Table I shows the throughput requirements for some example frame sizes obtained from H.264/AVC reference software in JM14.0. The test video is “Crowd Run”. In JM14.0 reference software, we set , high profile, level 5.1, “IPPP..” of GOP, fast full search motion estimation, single reference frame, and SAD as mode decision metric without rate-distortion optimization (RDO). The number of tested frames is 50 and frame rate is 50 frames/sec. Fig. 1 shows various transforms in the H.264/AVC encoding system. For luma residual input, the H.264/AVC encoder selects the transform flow between the 4 4 flow in Fig. 1(a) and 8 8
1520-9210/$26.00 © 2010 IEEE Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
158
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010
flow in Fig. 1(b). For chroma residual input, the H.264/AVC encoder performs 4 4 transform flow only. There are four types of 4 4 transform, i.e., forward, inverse, forward Hadamard, and inverse Hadamard transform, two types of 8 8 transform, i.e., forward and inverse transform. This paper describes how the 4 4 and 8 8 transforms of the H.264/AVC encoder can be modified such that they are implemented as one hardware block by maximally sharing common operations while satisfying the throughput requirement of real-time processing and reducing hardware cost. For early-stage H.264/AVC such as the baseline or main profile, researchers mainly focused on developing the fast algorithm of 4 4 transforms [5] and its implementation to improve performance with minimal area overhead [6]–[11]. With the advent of H.264/AVC high profile, implementing 8 8 transforms and unifying 8 8 and 4 4 transforms have been very important. A fast 8 8 transform algorithm using Kronecker product and direct sum is described in [12]. Hardware architectures sharing between 8 8 and 4 4 transform are described in [13]–[15]. In [15], a transform architecture to support RDO mode decision is also proposed. A unified architecture of the forward and inverse transforms are presented in [16]. Moreover, some architectures to support multistandard video applications with adaptive block-size transform (8 8 and 4 4) are proposed in [17] and [18]. However, the throughput values of these architectures are not sufficient to satisfy the real-time requirement of the unified transform in the HD 2160p system. Only the proposed architecture satisfies the requirement of HD 2160p system as will be shown in Section VI. The rest of this paper is organized as follows. In Section II, we briefly review each of the four different 4 4 and 8 8 integer transform equations. The proposed 4 4 transform algorithm and implementation are described in Section III. In Section IV, we present 8 8 transform algorithm including 4 4 transforms. Unified multitransform architecture (MTA) supporting all six kinds of integer transforms is described in Section V. Section VI discusses on the result of synthesis and evaluation in comparison with previous works followed by conclusions in Section VII. II. INTEGER TRANSFORM ALGORITHMS A. 4
4 forward and inverse transforms are applied to all The 4 4 4 input blocks regardless of the type of blocks, i.e., luma or chroma (Cb or Cr), and prediction modes, i.e., intra or inter mode. The forward and inverse Hadamard transforms are defined as (3) is the 4 4 block comprised of dc components from where is a quantized 4 each of the 16 4 4 submacroblocks and 4 DC block. The transform matrix is given as
(4)
The Hadamard transforms are applied only when a macroblock is encoded in 16 16 intra prediction mode. B. 8
The 8
8 forward and inverse transforms are defined as (5)
where is a 8 8 residual block input to the forward transform and is a inversely quantized 8 8 block input to the inverse is given as transform, respectively. The transform matrix
(6) The 8
4 Integer Transforms
The 4
8 Integer Transforms
8 transforms are applied to only luma blocks. III. 4
4 forward and inverse transforms are defined as (1)
where is a 4 4 residual block input to the forward transform is a inversely quantized 4 4 block input to the inverse and and are transform, respectively. The transform matrices given as
4 INTEGER TRANSFORM CODING
In this section, we describe the 4 4 inverse transform coding based on permutation and matrix factorization so that the 4 4 forward and (forward and inverse) Hadamard transform are derived from the 4 4 inverse transform with a minor modification. The integration of four 4 4 transforms is also addressed in this section. A. 4
4 Inverse Transform
The 4 4 inverse transform matrix can be regularized by two permutation matrices [5]:
(2)
(7)
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS
159
Pre- and post-multiplying by and , respectively, and into 2 2 blocks, it follows that partitioning (8) (9) Fig. 2. Block diagram of the proposed inverse transform consisting of six steps.
where (10) It is to be noted that and satisfies ( is the 4 4 identity matrix). If we pre-multiply by and post-multiply it by , the result becomes intuitively, i.e., (11) can then be factored as follows:
Fig. 2 shows the sequence of the proposed inverse transform. The inverse transform can now be carried out by the following six steps among which four steps (Step1, 3, 5, and 6) are simple permutations: 1) Step1, 3, 5, and 6: Permutation Four steps are all implemented as pure hard-wired interconnection, i.e., without any arithmetic logic. block multiplication 2) Step2: Partitioning into 2 2 blocks , we through block multiplication as follows: compute
(12) where (19)
(13) is the 2 Matrix
2 identity matrix and is the 2 2 null matrix. is defined by pre- and post-multiplying by :
block multiplication 3) Step4: Partitioning into 2 2 blocks , we is obtained through block multiplication as compute follows:
(14) Because product of
, the matrix and :
(20)
can be expressed as the
Equation (20) has the same form as (19) in Step2 except is used instead of in (19). Thus, we can that in (20), block multiplication) to calculate by reuse Step2 ( substituting in (19) by .
(15) By using (12) and (15) into (11), we obtain (16)
B. 4
In (2), can be expressed by as follows:
Then, we can rewrite the inverse transform (1) using (16)
as are
,
and an additional matrix
(21)
(17) Since is the symmetric matrix satisfying , and , it follows that
4 Forward Transform
where
(22) (18) Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
160
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010
D. 4
Fig. 3. Block diagram of the proposed forward transform consisting of six steps.
By using (16) into (21), we obtain (23) Then, the forward transform can be rewritten as
(24) Fig. 3 shows the sequence of the proposed forward transform. Similar to the inverse transform, the forward transform is carried out by six steps. As Step2, 3, 4, and 5 in sequence in Fig. 3 are the same as Step4, 3, 2, and 5 in sequence in Fig. 2, we can reuse them as common blocks when integrating the 4 4 forward and inverse transform. Like other permutation, Step1 is implemented as mere hard-wired interconnection. In Step6, the matrix in (22) is the same as the 4 4 identity matrix except for scaling factor 2, which is simply left-shift operation. Thus, Step6 is also implemented as hard-wired interconnection, which will be shown in the next subsection. C. 4
4 Hadamard Transform
Applying the same process as the 4 4 inverse transform, the Hadamard transform matrix can be expanded as follows: (25) Then, the forward and inverse Hadamard transform can be rewritten as
(26)
(27) Since (26) and (27) have the same equation form as the inis used instead of , the verse transform (18) except that Hadamard transforms can be carried out by the same procedure as the inverse transform with a minor modification.
4 multiTransform Architecture
Fig. 4(a) shows the sequences of four different 4 4 transforms based on the proposed algorithm. There is a common sequence among four transforms, i.e., from Step2 to Step5 in Fig. 4(a), which are merged into a 4 4 MTA core as shown in Fig. 4(b). The 4 4 MTA core is designed to process a 4 4 block within two clock cycles. Execution of odd and even clock cycle are named as Phase1 and Phase2, respectively. In Phase1, Step2 and 3 are performed, followed by Step4 and 5 in Phase2. A feedback path for the two-phase implementation is enclosed within the 4 4 MTA core. and in Two different block multiplications, i.e., Step2 and Step4 in Fig. 4(a), can be merged into one block 4 MTA core) as (“Block multiplication” block in the 4 they do not occur simultaneously. Likewise, permutation processes [Step3 and Step5 in Fig. 4(a)] are merged into one permutation block (“ permutation” block in the 4 4 MTA core). Remaining blocks [Step1 and Step6 in Fig. 4(a)] are merged into the input and output interconnection blocks as shown in Fig. 4(b). 4 MTA core is shown in Fig. 5. The proposed 4 This architecture is composed of four processing elements (PE), 16 multiplexers, permutation block, and four register blocks. 1) Sixteen multiplexers between the input ports and four PEs determine the input to PEs according to the phase. In Phase1, the multiplexer controller (MC) selects the input ports as the input of PEs. In Phase2, as the input the MC selects the output ports of PEs through the feedback path. are used 2) Four processing elements to calculate block multiplications such as and , which are Step2 and Step4 in Fig. 4(a). Each PE is composed of two-stage butterfly adders with shift operation as illustrated in Fig. 6. PEs operate differently according to the phase and transform type. In Phase1, the multiplexer controller (MC) in Fig. 6 selects the input 0 for the forward transform and the input 1 for the inverse transform. In Phase2, MC selects the input 1 for the forward transform and the input 0 for the inverse transform. On the other hand, MC always selects the input 0 when the transform type is the forward Hadamard, or inverse Hadamard transform regardless of the phase. Thus, PEs compute one of four Step2 operations in Fig. 6(a) in Phase1, and one of four Step4 operations in Phase2. It is to be noted that as Fig. 6(a) is also 2 2 Hadamard transform for chroma dc components, it can be implemented as a part of the 4 4 transform. 3) The permutation block uses wiring network to implement Step3 and 5. per4) Four register blocks temporarily store the result of mutation. In Phase1, the stored data enters PEs again along the feedback path, while the data enters the output interconnection block in Phase2. To perform all six steps for a transform, appropriate input and output (I/O) interconnection need to be done depending on the
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS
Fig. 4. (a) Block diagram of the sequence of operations for four 4 hardware platform.
161
2 4 transforms. (b) Proposed 4 2 4 MTA to implement the four transforms on a common
2
Fig. 6. Second-level details of 2 2 components in the MTA core. (a) P E . 2 (b) P E . (c) P E . (d) P E . Each PE corresponds to each of the 2 elements of block multiplication in (24), (26), (37), and (43). MC denotes the multiplexer controller. (a) is also 2 2 Hadamard transform for chroma dc components.
2
2
Fig. 5. First-level details of the proposed MTA core for performing input multiplexing, block multiplication, and P permutation. Step2 and 4 are merged, as are step3 and 5. MC denotes the multiplexer controller. Because 16 output coefficients are outputted every two cycles, the processing rate is eight pixels/cycle.
type of transforms. Fig. 7(a) shows the complete 4 4 multitransform architecture including the I/O interconnection blocks. The input interconnection block is composed of four permutation blocks and one multiplexer to choose an appropriate input to be processed. The output interconnection block is composed of three permutation blocks and a multiplication
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
162
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010
2
Fig. 8. Block diagram of the proposed 8 8 inverse transform consisting of three steps. IQ denotes inverse quantization.
and
2
Fig. 7. (a) Complete 4 4 multitransform architecture including I/O interconnection blocks. (b) S multiplication block in output interconnection of the 4 forward transform. The processing rate of 4 4 4 MTA core is eight pixels/cycle. All 16 coefficients of the selected input among four inputs must be prepared simultaneously.
2
2
(30) block. As the matrix in (22) is a scaling matrix without permultiplication is like mutation, the implementation of the Fig. 7(b). IV. 8
8 INTEGER TRANSFORM CODING
In this section, we describe the 8 8 inverse transform coding based on the extended transform and block multiplication so that the 8 8 forward transform is derived from the 8 8 inverse transform with a minor modification and 4 4 transforms are included in the 8 8 transform. A. Extended Transform
is a 8 8 permutation matrix and is a butterfly matrix. and are the integer form Two 4 4 transform matrices of II-type and IV-type DCT (discrete cosine transform) [19], corresponds to the 4 4 respectively. It is to be noted that in (2). inverse transform matrix B. 8
8 Inverse Transform , we obtain
Defining a new matrix
(31) Then, we can rewrite the 8
transform is a Extended transform [19] means that the transform. Taking 4 4 and 8 8 integer part of the transform in H.264/AVC as an example, the relation between them can be described as
(28) where
8 transform matrix using (28) (32)
Applying (32) to the 8
8 inverse transform (5), we obtain
(33) Fig. 8 shows the sequence of the proposed 8 8 inverse transform. The 8 8 inverse transform is carried out by the following three steps: Permutation: 1) As permutation means reordering elements in a 8 8 block, this step is implemented as hard-wired interconnection, i.e., without any arithmetic logic. Transform 2) into 4 4 blocks, we compute as four Partitioning different kinds of 4 4 transforms:
(34) (29) is equal to , the first component As is exactly the same as 4 4 inverse transform. Therefore, . we can reuse the 4 4 MTA to compute Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS
Fig. 10. (a) CeCo transform block including the 4 (b) Ce butterfly unit.
Fig. 9. (a) Direct implementation of Co transform. (b) (c) Two-cycle implementation of Co transform.
Co
163
2 4 inverse transform.
butterfly unit.
The other three components, , , and , can be processed by conventional row-column approach with 1D transform and transposition presented in H.264/AVC standard [20]. By using algebraic rules for transpose and the , they can be rewritten as follows: fact that
(35)
(36)
Fig. 11. (a) Signal flow of B block multiplication. (b) Two-cycle implementation.
to further improve the throughput of 4 4 transforms. Moreover, the 4 4 forward transform is also merged into Fig. 10(a), which will be described in the next subsection. Block Multiplication: 3) Partitioning into 4 4 blocks yields (39)
(37) Each of the three 4 4 transforms can be computed by applying the one-dimension (1-D) transform twice. , which is named as transform, as an exTaking , and ample Fig. 9(a) shows direct implementation of Fig. 9(b) shows the butterfly unit. As a butterfly butterfly unit can process four pixels at a time, four . By sharing units are needed to process a 4 4 block the 1D transform unit and transpose register, we obtain transform shown two-cycle implementation of the and , named as in Fig. 9(c). Likewise, transform, are implemented as shown in Fig. 10(a). As and butterfly unit, a cross-feedback they use both path is enclosed in the transform block. In Fig. 10(a), the shaded box with dotted-line feedback path indicates additional 4 4 inverse transform block. If , it follows that we apply 4 4 inverse transform to
where is the 4 4 identity matrix and permutation matrix:
is the 4
4
(40)
also into 4 4 blocks, we obtain Partitioning through block multiplication as follows:
(38)
(41)
Equation (38) means that the 4 4 inverse transform can transform twice be implemented by applying the 1D with transposition, which corresponds to the shaded box in Fig. 10(a). It can be used with the 4 4 MTA in parallel
Fig. 11(a) shows signal flow diagram and (b) shows its two-cycle implementation. Input multiplexers, registers, and feedback paths are used to share adders as shown in Fig. 11(b).
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
164
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010
Fig. 12. Block diagram of the proposed 8 three steps. Q denotes quantization.
2 8 forward transform consisting of
Fig. 14. Butterfly unit unifying Ce and Ce.
Fig. 13. (a) CeCo transform block including the 4 (b) Ce butterfly unit including C (= S C ).
C. 8
2 4 forward transform.
8 Forward Transform
The 8 8 forward transform can be expanded as follows using the similar process as the 8 8 inverse transform: (42) Fig. 12 shows the sequence of the proposed 8 8 forward transform. As Step1 in Fig. 12 is the same as Step3 in Fig. 8 except that the position of transpose, Step1 can be implemented multiplication block in Fig. 11(b). Step3 is by reusing the the permutation which can be implemented as hard-wired interconnection. In Step2, we obtain following four different kinds of 4 4 transforms by applying 4 4 block partitioning:
(43) As is equal to , the first component can be expanded by the same procedure as 4 4 transforms:
(44) Equation (44) is the same as (24) in the 4 4 forward transform in (24) is removed in (44). Thus, we can reuse the except that 4 4 MTA to compute by bypassing the multiplication block in Fig. 7. transform block for computing Fig. 13(a) shows the and . The butterfly unit in Fig. 10(b) is replaced butterfly unit as shown in Fig. 13(b). As by
Fig. 15. Proposed MTA supporting six different kinds of transforms.
, the 4 4 forward transform matrix is and , which is implemented by selecting the multiequal to plexer terminal 0 in Fig. 13(b). Thus, the 4 4 forward transform can also be implemented along the dotted-line feedback path in Fig. 13(a). To compute the 8 8 forward and inverse transform in one and butterfly units are unified as transform architecture, shown in Fig. 14. As it also includes the 4 4 forward transform transform block in Fig. 13(a) can process 8 matrix, the 8 forward, 8 8 inverse, 4 4 forward, and 4 4 inverse transform by using the unified butterfly unit in Fig. 14. V. MULTITRANSFORM ARCHITECTURE UNIFYING 8 8 AND 4 4 INTEGER TRANSFORMS Fig. 15 shows the proposed MTA supporting six different kinds of transforms for H.264/AVC high profile encoder. The MTA is composed of a block multiplication block, four 4 4 permutation blocks, and multiplexers. transform blocks, two block multiplication, transform, and permutation The blocks are used only for the 8 8 transforms. The 4 4 MTA transform blocks are used for both 4 4 and 8 8 and transforms. For performing four 4 4 transforms (4 4 forward, 4 4 inverse, forward Hadamard, and inverse Hadamard), two 4
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS
165
TABLE II SYNTHESIS RESULTS AND HARDWARE RESOURCE COMPARISON BETWEEN THE SINGLE TRANSFORM AND MULTITRANSFORM DESIGN. EACH TRANSFORM HAS THE SAME OPERATING FREQUENCY OF 200 MHz
2
Fig. 16. Temporal diagram of two-stage pipelined transform. (a) 8 8 forward transform. (b) 8 8 inverse transform. Each stage takes two clock cycles.
2
FT, IT, FHT, and IHT denote the forward, inverse, forward Hadamard, and inverse Hadamard transform, respectively. ABT denotes adaptive block-size transform with 4
2 4 and 8 2 8 block sizes.
DPR denotes data processing rate.
Fig. 17. Block diagram for functional verification of the proposed multitransform hardware using testbench from the JM reference software.
4 transform blocks, 4 4 MTA and part of transform block, are used to double the throughput compared to using only 4 4 MTA. Such throughput allows the proposed MTA to process the transforms of HD 2160p video (3840 2160 at 50 frames/sec) in real time whose throughput requirement is described in Table I, which is further discussed in Section VI. Unifying the 8 8 forward and inverse transform is simple because three functional blocks in each transform are almost the same while only their sequences are reversed as shown in Fig. 8 and Fig. 12. Multiplexers and feedback paths are used to unify the 8 8 forward and inverse transform as shown Fig. 15 in which dotted-line paths are used for the case of performing the 8 8 inverse transform. 8 block using the MTA takes four clock To process a 8 block multiplication takes two clock cycles cycles because and 4 4 transform block takes two clock cycles. However, by applying two-stage pipelining to 8 8 transforms as shown in Fig. 16, the throughput can be doubled, i.e., one 8 8 block every two clock cycles. VI. IMPLEMENTATION AND RESULTS A. Implementation and Verification We have implemented the proposed multitransform design and verified its behavior using Verilog RTL simulation, logic synthesis, and gate-level simulation. Fig. 17 shows the simulation environment to verify the functional behavior of the proposed architecture. Test vectors are obtained by using H.264/AVC reference software in JM14.0 version. After extracting input and output data from the reference software, we applied input data to the proposed design and compared its result with output data from the reference software. We synthesized the proposed multitransform design by using Faraday stanSynopsys Design Compiler and UMC 0.18
dard cell library [21]. In the logic synthesis, wireload model was used and skew, jitter, transition time of clock, and I/O external delay were separately taken into account. Table II shows the performance and hardware cost of the proposed multitransform design compared with the separate implementation of the six transforms. Timing constraints are identical so that each transform has the same operating frequency of 200 MHz. The single transform design, which is a separate implementation of four 4 4 transform paths in Fig. 4(a) and two 8 8 transform paths in Figs. 8 and 12, performs the same behavior as the multitransform design and is used as the target for comparison. According to Table II, the proposed MTA has about 51% less area than the single transform. Table II shows that the proposed MTA can process 3.2 Gpixels/sec when it processes only 4 4 transforms. Because the MTA includes two 4 4 transform blocks, i.e., 4 4 MTA and transform block each of which can process a 4 4 block within two clock cycles, the MTA has the data processing rate of 16 pixels/cycle. If the MTA processes only 8 8 transforms, the throughput becomes 6.4 Gpixels/sec. B. Performance Comparison When adaptive block-size transform (ABT) which uses 4 4 and 8 8 transform jointly is used, we obtain the throughput of 4.1 Gpixels/sec. It is based on the observation that the ratio of clock cycles spent for 4 4 mode to those spent for 8 8 block mode is 2.5. This was obtained from Table I considering one cycle is required to process a 4 4 block and two cycles are required to process a 8 8 block. Thus, the proposed design can allow real-time processing of HD 2160p video (3840 2160 at 50 frames/sec) whose throughput requirement is described in Table I. Table III shows the comparison among various methods in terms of operating frequency, data processing rate, throughput, gate count, and throughput per area. There are three different transform modes, i.e., 4 4, 8 8, and ABT. The results on the 4 4 and 8 8 mode are based on an assumption that each transform hardware performs either 4 4
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
166
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010
TABLE III SYNTHESIS RESULTS AND COMPARISON OF THE PROPOSED MTA WITH OTHER REPORTED DESIGNS. ALL ARCHITECTURES ARE DESIGNED AS 2-D TRANSFORM. DPR DENOTES DATA PROCESSING RATE AND MEANS THE NUMBER OF PIXELS TO BE PROCESSED EVERY CLOCK CYCLE. FT, IT, FHT, AND IHT DENOTE THE FORWARD, INVERSE, FORWARD HADAMARD, AND INVERSE HADAMARD TRANSFORM, RESPECTIVELY
Assume 2-D transform design by the architecture in Wang [6]. Gate count of the transpose register estimated by Design Compiler is 8821. Gate count of the transpose register estimated by Design Compiler is 11416. Gate count of the on-chip memory estimated by UMC MEMMAKER is 5496. Power consumption of the transpose register estimated by Prime Power is 4.102 mW. Power consumption of the transpose register estimated by Prime Power is 5.374 mW.
or 8 8 mode, while the result on the ABT mode indicates that the 4 4 and 8 8 transform mode are jointly used. Table III shows that the proposed MTA in the 4 4 transform mode is the most efficient in terms of throughput/area ratio among designs supporting all six kinds of transforms, which results from high operating frequency and two independent 4 4 transform blocks operating in parallel. In the 8 8 transform mode, the proposed design has the highest throughput and throughput/area ratio. It comes from high data processing rate, two-stage pipelined architecture as well as efficient sharing 8 forward and inverse of sub-blocks when unifying the 8 transform. When the designs are operated in ABT mode which is practical operating condition of the transforms, the proposed design has at least 54% higher throughput and 38% higher throughput/area ratio than other designs. After the logic synthesis, we used Synopsys PrimePower to estimate power consumption. When supplied with 1.8 V and operated at 200 MHz, the proposed design consumes about 83.8 mW. Compared to other designs [13], [18], the proposed design has the largest throughput/power ratio. Moreover, as power consumption increases in proportion to operating frequency, power consumption of the proposed design can be lowered with lower frame rate or smaller frame size.
VII. CONCLUSION We proposed a fast and cost-effective algorithm and implementation of the multitransform architecture in H.264/AVC encoders. Four different 4 4 transforms and two 8 8 transforms are integrated on a shared hardware by using extended transform and block multiplication. Comparing the proposed multitransform design with the best previous work, we obtained 54% higher throughput and 38% higher throughput/area. REFERENCES [1] N. Kamaci and Y. Altunbasak, “Performance comparison of the emerging H.264 video coding standard with the existing standards,” in Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2003, pp. 345–348. [2] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and quantization in H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 598–603, 2003. [3] M. Wien, “Variable block-size transform for H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 604–613, Jul. 2003. [4] D. Marpe, T. Wiegand, and S. Gordon, “H.264/MPEG4-AVC fidelity range extensions: Tools, profiles, performance, and application areas,” in Proc. IEEE Int. Conf. Image Processing, Sep. 2005, pp. I-593–I-596. [5] C. P. Fan, “Fast 2-dimensional 4 4 forward integer transform implementation for H.264/AVC,” IEEE Trans. Circuits Syst. II, vol. 53, no. 3, pp. 174–177, Mar. 2006.
2
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.
HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS
2
[6] T. C. Wang, Y. W. Huang, H. C. Fang, and L. G. Chen, “Parallel 4 4 2D transform and inverse transform architecture for MPEG-4 AVC/H. 264,” in Proc. IEEE Int. Symp. Circuits and Systems, May 2003, pp. 800–803. [7] Z. Y. Cheng, C. Chen, B. D. Liu, and J. F. Yang, “High throughput 2-D transform architectures for H.264 advanced video coders,” in Proc. IEEE Asia-Pacific Conf. Circuits and Systems, Dec. 2004, pp. 1141–1144. [8] K. H. Chen, J. I. Guo, and J. S. Wang, “A high-performance direct 2-D transform coding IP design for MPEG-4 AVC/H.264,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 4, pp. 472–483, Apr. 2006. [9] W. Hwangbo, J. Kim, and C. M. Kyung, “A high-performance 2-D inverse transform architecture for the H.264/AVC decoder,” in Proc. IEEE Int. Symp. Circuits and Systems, May 2006, pp. 1613–1616. [10] P. Chungan, Y. Dunshan, C. Xixin, and S. Shimin, “A new high throughput VLSI architecture for H.264 transform and quantization,” in Proc. Int. Conf. ASIC, Oct. 2007, pp. 950–953. [11] C. Wei, H. Hui, L. Jinmei, T. Jiarong, and M. Hao, “A high-performance reconfigurable 2-D transform architecture for H.264,” in Proc. IEEE Int. Conf. Electronics, Circuits and Systems, Aug. 2008, pp. 606–609. [12] C. P. Fan, “Fast 2-dimensional 8 8 integer transform algorithm design for H.264/AVC fidelity range extensions,” IEICE Trans. Inf. Syst., vol. E89-D, pp. 3006–3011, Dec. 2006. [13] C. P. Fan, “Cost-effective hardware sharing architectures of fast 8 8 and 4 4 integer transforms for H.264/AVC,” in Proc. IEEE Asia Pacific Conf. Circuits and Systems, Dec. 2006, pp. 776–779. [14] Y. C. Chao, H. H. Tsai, Y. H. Lin, J. F. Yang, and B. D. Liu, “A novel design for computation of all transforms in H.264/AVC decoders,” in Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2007, pp. 1914–1917. [15] G. Pastuszak, “Transforms and quantization in the high-throughput H.264/AVC encoder based on advanced mode selection,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Apr. 2008, pp. 203–208. [16] Y. Li, Y. He, and S. Mei, “A highly parallel joint VLSI architecture for transforms in H.264/AVC,” J. Signal Process. Syst., vol. 50, pp. 19–32, Oct. 2007. [17] B. Li, D. Zhang, J. Fang, L. Wang, and M. Zhang, “A unified IDCT architecture for multi-standard video codecs,” in Proc. Int. Conf. ASIC, Oct. 2007, pp. 962–965. [18] C. Y. Huang, L. F. Chen, and Y. K. Lai, “A high-speed 2-D transform architecture with unique kernel for multi-standard video applications,” in Proc. IEEE Int. Symp. Circuits and Systems, May 2008, pp. 21–24. [19] W. Chen, C. Smith, and S. Pralick, “A fast computational algorithm for the discrete cosine transform,” IEEE Trans. Commun., vol. 25, no. 9, pp. 1004–1009, Sep. 1977.
2
2
2
167
[20] Advanced Video Coding for Generic Audiovisual Services, ITU-T Recommendation H.264, Std., 2007. [21] Faraday UMC Standard Library. [Online]. Available: http://www. faraday-tech.com.
Woong Hwangbo received the B.S. degree in electrical engineering from Pusan National University, Busan, Korea, and the M.S. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea. He is currently pursuing the Ph.D. degree in the Department of Electrical Engineering and Computer Science at KAIST. His research interests include VLSI design and multimedia application with high performance and low power consumption.
Chong-Min Kyung (S’76–M’81–SM’99–F’08) received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea, in 1975 and the M.S. and Ph.D. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1977 and 1981, respectively. From April 1981 to January 1983, he worked at Bell Telephone Laboratories, Murray Hill, NJ, as a postdoc. Since he joined KAIST in 1983, he has been working on System-on-a-Chip design and verification methodology as well as processor and graphics architectures for high-speed and/or low-power applications, including mobile video codec. He is Hynix Chair Professor at KAIST Dr. Kyung received the Most Excellent Design Award, and Special Feature Award in the University Design Contest in the ASP-DAC 1997 and 1998, respectively. He received the Best Paper Awards in the 36th DAC held in New Orleans, LA; the 10th International Conference on Signal Processing Application and Technology (ICSPAT), Orlando, FL, in September 1999; and the 1999 International Conference on Computer Design (ICCD), Austin, TX. He was General Chair of Asian Solid-State Circuits Conference (A-SSCC) 2007, and ASP-DAC 2008. In 2000, he received a National Medal from the Korean government for his contribution to research and education in IC design. He is a member of the National Academy of Engineering Korea (NAEK) and the Korean Academy of Science and Technology (KAST).
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 17,2010 at 01:08:33 EDT from IEEE Xplore. Restrictions apply.