International Journal of Computer Science & Communication Vol. 1, No. 1, January-June 2010, pp. 43-50
Design and FPGA Implementation of Integer Transform and Quantization Processor and Their Inverses for H.264 Video Encoder N. Keshaveni1, S. Ramachandran2 & K.S. Gurumurthy3 1
Department of Electronics & communication Eng., Dr MGR University, Chennai, India
2
National Academy of Excellence, Bangalore, India
3
Department of Electronics & communication Eng., UVCE, Bangalore, India
1
[email protected],
[email protected],
[email protected]
ABSTRACT This paper proposes a novel implementation of the core processors, intra prediction, the integer transform, quantization, inverse quantization and inverse transformation for H.264 Video Encoder using an FPGA. It is capable of processing video frames with the desired compression controlled by the user input. The algorithm and architecture of the core modules of the video encoder namely, horizontal mode of intra prediction, the integer transform, quantization, inverse quantization and inverse transformation were developed, designed and coded in Verilog. The complete H.264 Advanced Video Codec was coded in Matlab in order to verify the results of the Verilog implementation. The processor is implemented on a Xilinx Vertex–II Pro XUPVP30 FPGA. The gate count of the implementation is approximately 870,000. It can process 1024x768 pixels moving color pictures in 4:2:0 format at 25 frames per second. The reconstructed picture quality is better than 35 dB. Keywords: Intraprediction, Integer transform, quantization, Inverse quantization, Inverse transform, Codec, Verilog, FPGA.
1. INTRODUCTION
With the widespread use of technologies like digital television, internet streaming video and DVD video, video compression has become an inevitable component of broadcast and entertainment media. Currently, the video codec that achieves the highest data compression without sacrificing on the picture quality is the MPEG-4 Part 10 Advanced Video Coding, also known as the H.264 [1-3]. This codec has many new features such as intra-frame prediction, 4x4 integer transform, quantization, context adaptive entropy coding, deblocking filter etc., which were not available in the earlier standards. The present work has realized some of the above features. The implementation conforms to the baseline, main as well as extended profiles since only Intra (I) frames are used. While implementations of the integer transform [4, 5] on hardware have been reported, no implementation of intraprediction has been found to the best of authors’ knowledge. Qiang Peng et al. [6] have reported an implementation of the H.264 encoder using a 32-bit RISC CPU on a single chip running on Linux operating system, which can process PAL, SECAM or NTSC video at 80 MHz. This paper is organized as follows: In the following Sections 2 and 3, the basic building blocks of an AVC encoder and the principles involved are described. It also highlights the actual modules implemented in the
present work. A novel parallel algorithm is presented in Section 4 for evaluating the transform and quantization suitable for high speed implementation on FPGA/ASIC. Section 5 presents detailed architectures of Intra Prediction, Integer Transform and Quantization Processors and their Inverses. Results and discussions are presented in Section 6. The FPGA implementation results of the design is presented in the next section and conclusions are presented in the last section. 2.
BLOCK DIAGRAM OF H.264 ADVANCED VIDEO ENCODER AS IMPLEMENTED
This section presents the overall building blocks of the Advanced Video Encoder as well as the functional modules implemented in the present work. The block diagram of H.264 Encoder [7, 8] is shown in Fig.1. Therein the modules designed in this work are shown in grey shades. An input frame or field Fn is processed in units of a macro block. A macro block consists of 16x16 pixels. Each macro block is encoded in intra or inter mode and for each block in the macro block, a prediction P is formed based on the reconstructed picture samples [9]. In Intra mode, P is formed from samples in the current slice that have been previously reconstructed. In Inter mode, P is formed by motion-compensated prediction from one or two reference picture(s) selected from the set of reference pictures.
44
International Journal of Computer Science & Communication (IJCSC)
Fig 1: Modules Implemented in H.264 Video Encoder
In Fig.1, the reference picture is shown as the previously encoded picture F n”1, but the prediction reference for each macro block partition (in inter mode) may be chosen from a selection of past or future pictures (in display order) that have already been encoded, reconstructed and filtered. The prediction P is subtracted from the current block to produce a residual (difference) block that is transformed using a block transform and quantized to give a set of quantized transform coefficients which are reordered and entropy encoded [10]. The entropy-encoded coefficients together with other information required to decode each block within the macro block form the compressed bit stream, which is passed on to a Network Abstraction Layer (NAL) for transmission or storage. The extra information required to decode each block are typically prediction modes, quantization parameter, motion vector information, etc. Apart from encoding and transmitting each block in a macro block, the encoder also decodes, i.e., reconstructs it to provide a reference for further predictions. The quantized coefficients are scaled, inverse quantized and inverse transformed to produce a difference block. The prediction block P is added to the difference block to get the reconstructed block. A filter is applied to reduce the effects of blocking distortion and the reconstructed reference picture is generated as F’n. In the present work, the core processors such as the intra prediction, integer transform, quantization, inverse quantization and inverse transform (TQIQIT) were implemented. They are shown shaded in the figure. The design of remaining modules is involved and the development of the same is under progress. 3.
ARCHITECTURAL DETAILS OF HORIZONTAL MODE OF INTRAPREDICTION
Within a picture frame, pixels close to each other tend to have similar values. Intraprediction is done in order to exploit the spatial redundancy within a frame. Each pixel is predicted based on the values of its neighboring pixels that are available. Instead of processing the pixel value, only the difference between the actual value of the pixel and its predicted value known as the residual pixel is processed. If a block or macro block is encoded in intra
mode, a prediction block is formed based on previously encoded and reconstructed blocks. This prediction block is subtracted from the current block prior to encoding. These details are explained clearly in the following paragraphs. The H.264 standard recommends the use of nine different prediction modes, namely, vertical, horizontal, DC, diagonal down left, diagonal down right, vertical right, vertical left, horizontal down and horizontal up from which one or more may be used to form the predicted block [11-14]. There are a total of 9 optional prediction modes for each 4x4 luma block; 4 optional modes for a 16x16 luma block; and one mode that is always applied to each 4x4 chroma block. For the luminance (luma) samples, prediction may be formed for each 4x4 sub block or for a 16x16 macro block. In the present work, the horizontal mode of intra prediction and the 4x4 luma block are used.
Fig 2: Basic Principle of Horizontal Mode of Intra-prediction
a.
Order of Sub block processing in a Macro block;
b.
Horizontal Mode Prediction block (shaded part) for processing the current sub block;
c.
TQIQIT Processing (Reconstruction of Residual sub block).
The best prediction mode would be the one by which the predicted block most closely matches the actual block. In order to accomplish that, the predicted block would have to be generated using all the modes, then compared with the actual block to find out the best mode. This would involve enormous amount of computation. Also, until the best mode is found, the processing would be stalled thereby bringing down the throughput of the encoder. Further, no particular mode offers better compression than the others, and which changes dynamically with the picture being processed. Of these nine modes, the horizontal mode of prediction is aptly suited for fast implementation as an ASIC or an FPGA
45
Design and FPGA Implementation of Integer Transform and Quantization Processor...
consuming minimum hardware. For these reasons, the horizontal mode of intraprediction has been chosen for this implementation. The basic principle involved in the horizontal mode of intra prediction implemented in this work is shown in Fig. 2. A picture is processed macro block by macro block in the order from top to bottom and from left to right. A macro block consists of 16x16 pixels. These are further divided into 4x4 pixels sub blocks: B0 to B15 as shown in Fig. 2(a). These sub blocks are processed in the order: B0, B1, .., B3; B4, B5, ….., B15. As an example, B6 is shown as the current sub block that is required to be processed. The pixel values of this sub block are ‘p1’, ‘p2’, …, ‘p16’. It may be noted that just before this sub block, B5 was already processed. The shaded part in Fig. 2(b) shows the horizontal mode prediction block for processing the current sub block. As shown therein, ‘d’, ‘c’, ‘b’ and ‘a’ pixels serve as the prediction for the current sub block B6. These prediction pixels belong to the last (4th) column of the previously reconstructed sub block B5. The pixels ‘e’ to ‘m’ is the already reconstructed last row pixels of the upper sub blocks. However, these pixels are not used in the horizontal prediction. It may be noted that the prediction for the current sub block being processed is always the last column of the recently reconstructed sub block. In the above example, B5 offers prediction for B6. As another example, the reconstructed last column of B0 forms the prediction for the current sub block B1. As yet another example, the reconstructed last column of B11 forms the prediction for the current sub block B12. Fig. 2(c) shows the reconstruction of the current sub block (B6 in this example) by processing TQIQIT. For processing the integer transform, the residual values of the sub block (and not the actual pixel values) are taken as the inputs. The residual values are got by taking the pixel-wise difference between the current sub block (B6) and the prediction sub block (B5). The processing of TQIQIT is explained in detail in subsequent sections. The reconstructed residual pixels of the sub block are obtained after processing of TQIQIT. Subsequently these reconstructed residual pixels are added with the corresponding prediction sub block pixels to get the reconstructed sub block (B6 being an example). For the first sub block B0 of a macro block no pixels are available to generate the predicted blocks. Therefore, the predicted block of such a block has its entire pixel values as “0”, i.e., the block is processed without prediction. 4.
ALGORITHM FOR PARALLEL MATRIX MULTIPLICATION OF INTEGER TRANSFORM, QUANTIZATION AND THEIR INVERSES
4.1. Integer Transform and Quantization A novel parallel algorithm that is capable of being highly pipelined has been developed for computing the integer
transform and quantization in an earlier work [15]. The core integer transform is expressed as two-stage matrix multiplication as shown in Eq. 1. The values X00 to X33 are the residual pixel inputs from the intra-prediction stage contained in matrix X as described in the previous section. C and C’ (the transpose of C) are constant matrices. W, containing elements W00 to W33 is a matrix of coefficients after transforming the matrix X.
W00 W 10 W20 W 30
W01
W11 W21 W31
W02
W12 W22 W32
X00 X 10 X 20 X 30
X01 X11
X 21 X 31
W03 1 1 1 1 2 1 −1 −2 W13 W23 = 1 −1 −1 1 1 −2 2 −1 W33 X02 X12
X 22 X 32
Or in short, W = C * X * C’
X03 X13 X 23 X 33
1 1 1 2 1 1 −1 −2 1 −1 −1 2 1 −2 1 −1 (1)
Each of the transformed coefficients Wij is quantized by a scalar quantizer as specified in Ref. [1]. A total of 52 values of quantization step size (Qstep) are supported by the standard and these are indexed by a Quantization Parameter, QP. Qstep doubles in size for every increment of 6 in QP. The wide range of quantizer step sizes makes it possible for an encoder to accurately and flexibly control the trade-off between bit rate and quality. The quantized coefficients are computed as: Zij = floor (Wij * MF / 2qbits)
(2)
where qbits = 15 + floor (QP/6) and MF is a multiplication factor specified in the H.264 reference model software of the standard. The algorithm for the integer transform and quantization is as follows: 1.
Multiply/add the first row of C with each column of X one after another to generate the first row of partial products, P 00 – P 03 . Multiplications involved are trivial since 1, -1, 2, -2 are the multiplying constants.
2.
Multiply/add the second row of C with each column of X one after another to generate the second row of partial products, P 10 – P 13 . Concurrently multiply the first row of partial products P00 – P03 (generated in the previous step) with each of the columns of C’ one after another to generate the first row of integer transformed coefficients. Pipeline the quantization (multiplication with MF) as per Eq. 2 immediately after each integer coefficient Wij is generated. It may be noted that the computation 2qbits is just right shift operation dispensing with division. In this step, the quantized coefficients Zij are generated.
46
International Journal of Computer Science & Communication (IJCSC)
3.
Repeat the step 2 for the third and fourth rows of C to generate the rest of the sixteen quantized coefficients.
4.2. Inverse Quantization and Inverse Integer Transform The inverse quantization is expressed as Wiji = Zij * Vij * 2floor(QP/6)
(3)
where Zij are the quantized coefficients and Vij are the rescaling factors dependent upon the coefficient position as specified in the H.264 standard. The inverse integer transform that follows the inverse quantization stage is as follows: X = Ci' * Wi * Ci
(4)
where
1 1 1 1 1 1 2 − 1 2 −1 1 Ci = 1 −1 −1 1 −1 1 − 1 2 2 and Ci' is its transpose.
components are valid when the “datain_valid” signal is asserted. The luminance and chrominance components are written into a “dual RAM” at the rising edge of “write_clk” signal. Thus, the RAMs store two blocks of 16 lines, i.e., two macro block rows. A macro block consists of 16x16 pixels. As one RAM buffer gets filled, the intraprediction is processed concurrently from the other buffer previously filled. The stored data is read from the “dual RAM” for further processing at the rising edge of “read_clk” signal. The system is reset at the time of powering on using an asynchronous active low signal “reset_n”. Just as a microprocessor may be halted at any point of time, the TQIQIT processing may also be temporarily suspended using the “halt” signal in order to allow the processor CAVLC to catch up with the TQIQIT processor. The desired compression may be set by the quantization parameter “Qstep_in [1:0]”, which is user configurable. After processing the desired data using chrominance components “pix_cb_rec_out” and “pix_cr_rec_out” are output along with their corresponding valid signals. “q_coef” is the output after quantization and it is fed to CAVLC Processor for effecting compression.
The algorithm for the inverse quantization and the inverse integer transform is similar to that of forward transform and quantization and, therefore, not presented here. 5.
ARCHITECTURE OF INTRA PREDICTION, INTEGER TRANSFORM AND QUANTIZATION PROCESSORS AND THEIR INVERSES
The basic architecture of Horizontal mode of Intra Prediction, Integer Transform and Quantization (TQ) and their Inverses (IQIT) of H.264 Advanced Video Encoder as implemented in the present work is shown in Fig. 3. The video encoder brings about the compression of video signals, which is vital in bringing down the storage and serial channel band width over which the compressed bit stream is transmitted. A video sequence such as that coming from the output of a camera decoder is input to the first stage, the format converter, which converts the 4:2:2 format luminance (Y) and chrominance (Cb, Cr) components of a color motion picture to the standard 4:2:0 format. This format has less number of pixels to be processed than the 4:2:2 format, thus resulting in less processing time and more compression. The 4:2:0 Y, Cb, Cr components are intra-predicted, transformed, quantized, inverse quantized and inverse transformed in order to get the reconstructed picture output. The quantized coefficients are then fed to the Context Adaptive Variable Length Coder (CAVLC) module [16], which assigns variable length codes to get the desired compressed bit stream. These pixel
Fig 3: Functional Modules of the Advanced Video Encoder Implemented
5.1. Detailed Architecture of the Intra-prediction Module The intra-prediction module consists of double buffered RAMs for each of the components Y, Cb and Cr. The output of the dual RAM, which stores the current sub block being processed, is the input to the intra-prediction module. This module consists of two sub modules, namely, the “four_pix_out_y” module for storing a current sub block and the “ram_predict_y” module for storing the predicted pixel values and the reconstructed pixel values needed to generate the predicted block as shown in Fig. 4. The current sub block pixel values from the “dual RAM” module is input to the module “four_pix_out_y” using the data bus “pix_y_dram_in [7:0]”. Its validity is signaled by simultaneously asserting the signal “pix_y_dram_val”.
Design and FPGA Implementation of Integer Transform and Quantization Processor...
The “four_pix_out_y” module outputs current pixel values, one column pixel values at a time at the pins marked “pix_fpo0” to “pix_fpo3”, with the signal “pix_fpo_valid” serving as their valid signal. The entire sub block is output in 4 clock cycles. The pixels output in these clock cycles are p1, p5, p9, p13 in the first cycle, p2, p6, p10, p14 in the second cycle, p3, p7, p11, p15 in the third cycle and p4, p8, p12, p16 in the last cycle. These are the current sub block pixel values presented in Fig. 2 earlier.
47
In the next module called “intrapred_mem_y”, the difference between the current sub block pixel values “pix_fpo0” to “pix_fpo3” and the predicted values gives the residual values “pix0_y_res” to “pix3_y_res” with “pix_y_res_valid” signal asserted. These (four) values are fed to the TQIQIT module to get back the reconstructed residual values (pix_y_res_rec). The reconstructed residual values in turn are added to the predicted values (d, c, b, a) to get the reconstructed sub block as described in the “ram_predict_y” module. The signal “pix_y_req_dram_out” is pixel request to dual ram. When this signal is high, the dual ram outputs pixels to the intra-predict module. 5.2. Detailed Architecture of TQIQIT Processor The TQIQIT module consists of transformation, quantization, Inverse quantization, dual RAM and inverse transformation modules. These modules are shown in Fig.5. The four residual pixel values “pix0_res” to “pix3_res” after intra-prediction are fed to the transformation module marked “transform”. These pixel values are valid when the “pixel_valid” signal is asserted. The output of the transform module is the transformed coefficient “t_coef” and the validity of the data is asserted by the signal “t_coef_val”.
Fig 4: Architecture of Intra-prediction Module
The module “ram_predict_y” contains the predicted values, i.e., the last column of the reconstructed pixels (d, c, b, a) of the previously processed sub block. The TQIQIT module computes the reconstructed residual values “pix_y_res_rec” and are added with the previously reconstructed pixels (d, c, b, a) in this “ram_predict_y” module. This is the reconstructed value (pix_y_rec_out) of the current sub block. The last column pixel values of this reconstructed sub block are also output as “pix_pred0” to “pix_pred3” with “pix_pred_valid” as the valid signal. These are internally stored as (d, c, b, a) to serve as the predicted values for processing the next sub block.
Fig 5: Architecture of TQIQIT Processor
48
International Journal of Computer Science & Communication (IJCSC)
The transformed coefficients are fed to the quantizer. The quantization is performed according to the Eq. 2. The signal “q_rsh” is the external input to the quantizer module used to decide the desired compression with the help of Q_step. The quantization is performed by right shift operation. The output of the quantizer is the quantized coefficient “q_coef” and the validity of the data is asserted by the signal “q_coef_val”. These outputs are fed as inputs to the next Processing module CAVLC, which is not part of this work. After quantization, the coefficients are fed to the inverse quantization module. The signal “q_lsh” is used as the inverse quantization parameter. The inverse quantization is performed by left shift operation. The output of the inverse quantizer is the signal “inv_coef” and the validity of the data is asserted by the signal “inv_coef_val”. The output of the inverse quantizer is fed to the dual RAM module “dram_inter” to get the four coefficients “inv_coef0” to “inv_coef3”and the validity of these coefficients are asserted by the signal “inv_coef_val”. These four coefficients are fed to the inverse transform module and the output of this module is the reconstructed residual values “pix_res_rec” and the validity is asserted by the signal “pix_res_rec_val”. The inverse transformation is the reverse process of transformation just as inverse quantization is the reverse of quantization process. The architectures for intraprediction and TQIQIT for chrominance (Cb and Cr) are similar to that of luminance and hence not presented in this paper. 6. SIMULATION RESULTS AND DISCUSSIONS
The various modules described in previous sections were coded in Verilog, the top design being called “top_tqiqit”. There are several other sub modules instantiated by this top design module. Also a test bench was developed so that the design may be tested using ModelSim. A Matlab program was first written that accepts a standard true color picture in 4:4:4 “tif” format as input and converts it to luminance (Y) and chrominance (C) components in 4:2:2 “tif” format. These “tif” files are converted to “raw” format using standard software such as Irfan View. These files are in raster scan order and they need to be converted to Macro block/Sub block before it can be used in Modelsim for simulation. Therefore, a C++ program was written to convert the raw picture into Sub blocks as a “txt” file, which serves as the input to the Verilog module TQIQIT. The Verilog design “top_tqiqit”, whose architecture was presented earlier was run in the Modelsim to get the reconstructed picture in 4:2:0 “txt” format. These reconstructed “txt” files (Y, Cb, Cr) were converted back to “raw” format using another C++
program. This is still in Macro block/Sub block order. This is finally converted to a “tif” format using another Matlab program. This program automatically displays both the original picture as well as the reconstructed picture. The Matlab program also computes the quality of the reconstructed picture referred to as PSNR expressed in dB. The simulated waveforms are shown in Fig. 6 and 7. The reconstructed picture is generated at every rising edge of “read_clk” with latency coming into play for every sub block processed. The first Sub block reconstructed pixel values “pix_Y_rec_out”, “pix_Cb_rec_out” and “pix_Cr_rec_out” and their corresponding valid signals are shown in Fig.6 and 7. From these waveforms, we observe that the reconstructed data “pix_Y_rec_out” commences at 98445 ns and it ends at 1638311 ns. Similarly, the reconstructed data “pix_Cb_rec_out” starts at 99821 ns and ends at 1638655 ns, while “pix_Cr_rec_out” commences at 100165 ns and ends at 1638999 ns. Some of these start/end time waveforms are not presented here since they occupy lots of space. In summary, the reconstructed picture pixels start issuing at 98445 ns (Fig. 6) and ends at 1638999 ns (Fig. 7), thus taking 1540554 ns for processing a complete frame of a video sequence. Since each “read_clk” cycle is of duration 2 ns during simulation, it takes 770277 “read_clk” cycles to reconstruct the entire data. Therefore, for a picture of size 512x256 pixels such as Lena used in the present simulation, it takes 5.9 clock cycles to process each pixel. Assuming an operating frequency of 124 MHz for “read_clk”, this works out to 6.24 milli second per frame ignoring latency, which is small. This assumption is valid since Verilog design works at 124 MHz as has been presented in the next section, FPGA implementation. Extrapolating this processing time for a picture of resolution 1024x768 pixels, we get the processing time of 37.4 milli second per frame or in other words, we have achieved a frame rate of 25 pictures per second. The H.264 video encoder was first implemented in Matlab in order to estimate the quality of the reconstructed image and the compression that can be achieved. In addition, Matlab output serves as a reference for verifying the Verilog output. Subsequently, the core modules of encoder as described earlier were realized using Verilog for ASIC/FPGA implementation. The resulting qualities of the reconstructed images obtained with intra-prediction using Matlab and Verilog compare favorably as can be seen from Fig.8. It may be seen from the Fig. 8(b) and 8(c) that Verilog result is very close to the Matlab result since the Verilog codes use at least 16-bits precision.
Design and FPGA Implementation of Integer Transform and Quantization Processor...
49
FPGA implementation, which processes motion pictures of size 1024x768 pixels at 25 frames/sec. is faster by about two times the SOC/ASIC implementation reported by Qiang Peng et al. [6]. 7. FPGA IMPLEMENTATION
Fig 6: Reconstructed Picture Waveforms of First Sub Block
The various modules described in previous sections were coded in Verilog, simulated using ModelSim, synthesized using Synplify Pro 8.5 and place and routed using Xilinx Project Navigator ISE 8.2. The target device chosen was Xilinx Vertex-II Pro XUPVP30 -7 FF896 FPGA since the board available in our laboratory is based on this FPGA. The core parts of the Video encoder design described in previous sections utilizes 863,469 gates and 12 numbers of block RAMs with 1666 numbers of occupied slices. The maximum frequency of operation is 124 MHz for “read_clk”. This works out to a frame rate of 25 per second for a picture size of 1024x768 pixels as explained earlier. With higher speed FPGA, the frame rate can be increased to 30. The Verilog codes developed for this project is fully RTL compliant and technology independent. As a result, it can work on any FPGA or ASIC without needing to change any code. As ASIC, it is likely to work for higher resolutions up to 1600x1200 pixels at 30 frames/sec. 8. CONCLUSION
Fig 7: Reconstructed Picture Waveforms of Last Sub Block
Fig 8: Simulation Results of H.264 TQIQIT Processor
(a) Original Lena Image (512x512 pixels) (b) Reconstructed Lena Image using Matlab with Intraprediction (PSNR: 35.5 dB) (c) Reconstructed Lena Image using Verilog with Intraprediction (PSNR: 35.2 dB) In simulation results presented earlier, Lena image with a resolution of 512x256 pixels was used. However, for reconstruction, we use 512x512 pixels. The results were obtained for Qstep_in = 16 (QP = 28). The compression achieved for 4:2:0 format was 15 for intraprediction and 11.7 without intra-prediction using Matlab, thus revealing a substantial improvement of compression for horizontal mode of intra-prediction. Verilog result for no intra-prediction, however, offered a higher PSNR value, namely, 37.3 dB. The proposed
An FPGA implementation of the core processors of H.264 Video Codec has been presented. It uses the horizontal mode of intraprediction. While intraprediction was found to give higher compression, the gains obtained vary with the video sequence. Although the 4x4 integer transform used is significantly simpler and faster than the 8x8 DCT used in MPEG-2, processing speed is offset by intraprediction and latency. A significant improvement over MPEG-2 is the reduction of blocking artifacts, especially for high compression, even without using de-blocking filter. The desired compression can be selected by the user in the implemented H.264 encoder modules. The FPGA implementation of the present work produces high quality reconstructed pictures and compares favorably with another implementation. REFERENCES [1]
Joint Video Team, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec H.264 and ISO/IEC 14496 AVC, March 2005.
[2]
Advanced Video Coding for Generic Audio-visual Services”, ITU-T H.264.
[3]
“Generic Coding of Moving Pictures and Associated Audio”, ISO/IEC JTCI CD 13818, 1994.
[4]
Sadiqullah Khan, Gulistan Raja, “Integer Cosine Transform and its Application in Image/Video
50
International Journal of Computer Science & Communication (IJCSC) Compression”, ICSEA-2004 Conference Proceedings, Islamabad, pp. 189-193, December 2004.
[5]
Liu Ling-zhi, Qiu Lin, Rong Meng-tian, Jiang Li, “A 2-D Forward/inverse Integer Transform Processor of H.264 based on Highly Parallel Architecture,” Proceedings of the 4th IEEE International Workshop on Sytem-on-Chip for RealTime Applications (IWSOC’04), 2004.
[6]
Qiang Peng and Jin Jing, “H.264 System on Chip Design and Verification”, The IEEE 2003 Workshop on Signal Processing Systems (SIPS’03), 2003.
[7]
I. E. G Richardson, “H.264 and MPEG-4 Video Compression (Video Coding for Next Generation Multimedia)”, John Wiley, January 2004.
[8]
“MPEG-4 Overview”, ISO/IEC JTC 1/SC29/WG11 N4668.
[9]
D. LeGall, “MPEG: A video Compression Standard for Multimedia Application,” Communication, ACM, 34, pp. 46-58, Apr. 1991.
[10] F. Pan, “Fast Intra Mode Decision Algorithm for H.264/ AVC Video Coding,” Proceedings, International Conference, Image Processing (ICIP), 2, pp. 781-784, Oct. 2004. [11] W. K. Cham, “Development of Integer Cosine Transforms by the Principle of Dyadic Symmetry,” IEE Proceedings, 136, pt. I, No. 4, pp. 276-282, August, 1989.
[12] K. M. Cheung, F. Pollara, and M. Shahshahani, “Integer Cosine Transform for Image Compression,” The Telecommunications and Data Acquisition Progress Report 42105, Vol. January-March 1991, Jet Propulsion Laboratory, Pasadena, California, pp. 45-60, May 15, 1991. [13] Thomas Wiegand and Gary J. Sullivan, “Overview of the H.264/AVC Video Coding Standard,” IEEE Transactions On Circuits and Systems For Video Technology, pp. 1-17, July 2003. [14] J. Ribas-Corbera, P. A. Chou, and S. Regunathan: “A Generalized Hypothetical Reference Decoder for H.264/ AVC,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 18-32, July 2003. [15] Keshaveni. N, Ramachandran. S, K.S. Gurumurthy: “Design and Implementation of Integer Transform and Quantization Processor for H.264 Encoder on FPGA”, International Conference on Advances in Computing, Control and Telecommunication Technologies, December 2009. [16] Keshaveni. N, Ramachandran. S, K.S. Gurumurthy: “Implementation of Context Adaptive Variable Length Coder for H.264 Video Encoder”, International Journal of Recent Trends in Engineering [ISSN: 1797-9617] by the Academy Publishers, Finland.