implementation of three block mat0063hing search algorithms ... - IRAJ

International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106,

Volume-2, Issue-12, Dec.-2014

IMPLEMENTATION OF THREE BLOCK MAT0063HING SEARCH ALGORITHMS IN H. 264 ON CUDA 1

RENUKA JOSHI, 2JANCY JAMES, 3SAGAR KARWANDE, 4DARSHAN PAWAR 1,2,3,4

Department of Computer Engineering, JSPM’s Rajarshi Shahu College of Engg., Pune, India E-mail: [email protected], [email protected], [email protected], [email protected]

Abstract- Over the past two decades, improvisations in computing field indicated by fast networks, distributed systems and parallel computer architectures delineates that parallelism is the future of computing. H.264 is a video coding standard which is a bitstream structure and decoding method for video compression. In this paper, we combine the power of two i.e. GPGPU and H.264. Here, we use NVIDIA CUDA GPGPU to further increase the efficiency of H.264. We use three block matching search algorithms Full search, Diamond search and Hexagon based search. We implement all of them independently for the extensive analysis of Motion Estimation (ME). We discuss the components like inter and intraprediction models, frame analysis and CUDA implementation which are required to make the system parallel on GPGPU. Motion estimation is computationally expensive process in video coding. Thus, this paper overcomes the limitation of time required for motion estimation in CPU by using GPGPU. Keywords- Compute Unified Device Architecture (CUDA), General Purpose Graphics Processing Unit (GPGPU), H.264 Video coding Standard, Motion Estimation, Block Matching Search Algorithms

I.

with minimum resources. H.264 has proven to be more pragmatic and robust algorithm for video encoding. H.264 is a lossy video compression technique although the loss is indistinguishable to bare eyes or human visual system.

INTRODUCTION

In the universe, almost everything is inherently parallel. Many complex, interrelated events happen simultaneously, still in the synchronization with each other. And therefore parallel computing relates to the real world phenomena and provides flexibility in computations. Even though powerful microprocessors are capable of executing millions of instructions per second, there are many such complex and computationally expensive problems for which a CPU may need years to solve and this is when the concept of parallel computing comes into picture. In general, parallel computing means that more than one microprocessors are executing some part of the overall task like modules or functions in C/C++ execute independently. A programmer assigns different tasks to each dedicated microprocessor which solves the problem assigned to it. And then solution for each sub-problem is reassembled and a solution is generated for a complex problem which has initially been divided into sub-problems.

Computational complexity is directly proportional to frame resolution and as the resolution of the frame increases the computational complexity also increases making the video size heavier. So there is a need to accelerate H.264 video encoders by dividing the task of computationally expensive Motion Estimation algorithm and hence the immense power of parallelization provided by NVIDIA CUDA can be used to accelerate video encoding. There are various block matching search algorithms to make motion estimation faster by using different search patterns to cover all the macro-blocks in the frame while still doing faster computations. Thus, making it possible for developers to decide which algorithm is efficient as per the system requirements and various factors which affect motion estimation.

Video encoding is used to reduce temporal and spatial redundancies from video frames resulting in reduction of size of a video. From the time when the MPEG1 video compression standard was proposed by ISO/IEC it has been one of the most popular compression standards. The further modifications to 1st encoding format led to the creation of MPEG2, MPEG4 and finally H.264 in year 2003. H.264 is a video compression technology developed by Joint Video Team (JVT) has set a standard in video encoding process for its high coding efficiency compared to other encoders. Internet and streaming media is a part of our day to day life. But it is costing companies significant amount of bandwidth, thus their primary goal since last two decades is to achieve good quality

II.

MOTIVATION AND BACKGROUND

The existing video encoders make use of optimized H.264 algorithm and using CPU as platform for video compression. Theora (developed by HTML5 an open standard) and WebM/VP8 are also some of the video compression standards but due to some controversies faced by these standards H.264 became most popular and trustable and this is the reason it is being used by iTunes store, Google Chrome browser, YouTube etc. However there is still scope for improvement in average time and speedup that is required for video compression of a video. This is done by optimally using the GPU

Implementation of Three Block Matching Search Algorithms in H.264 on CUDA 61


resources. A CPU has multiple cores but a GPU has thousands of cores and 1 core has number of threads which enables data parallelism. On the web, large number of videos are getting uploaded per minute which are used globally. Although it costs a huge amount of bandwidth because of motion prediction which is computationally expensive process, people still stick to traditional CPUs because GPU implementation is not an easy method. In our system, we make use of open source API libraries of CUDA platform which would only require the cost of the GPU i.e. hardware part and thus, this project is an open source project.

two parts, host code and device code. Host part consists of kernels and device part consists of grids. Further a grid may have 65535 blocks and a block can have 512 threads each. These threads are assigned to each subtask of a larger problem. There is just one constraint that must be followed to get maximized efficiency using CUDA is no dependency between data by dividing data in such a manner that each divided task can execute on its own. If there are no dependencies between data elements, true parallelization is accomplished. B. Motion Estimation Motion estimation uses a reference frame, a previous and a future frame for block matching. It divides the frames in blocks(16x16/16x8/8x16/8x8/8x4/4x8/4x4) and figures out where the blocks have moved in next frame using motion vectors. This enables us to reduce the overhead of re-encoding the same block again. This is because there are only a few changes in the sequence of frames of a video. Determining motion vectors between adjacent frames is called as motion estimation. Finding motion vectors is the core part of motion estimation. 2-dimensional vector is used in inter-prediction to remove the temporal redundancies within the frames. Motion estimation takes near about 80% of total encoding time to find out redundancies. Here, various block matching search algorithms are suggested to reduce the time required for example Full search algorithm, Diamond Search algorithm, and Hexagon-based search algorithm. When a video is pictured, there are very few differences between its adjacent frames making the remaining data redundant. Motion estimation is used to determine the redundant part and preventing it from re-encoding, which results in saving time. There are three types of MBs B-MB one which has previously been encoded from one or two frames and I-MB is the current frame taken for encoding and P-MB is predicted from previously coded frames. A macro-block is chosen from current video frame as well as previous and future frame to find out best matching macro block which is then selected for encoding, this process continues till all the macro-blocks from a frame are compared with others. Data with 0, -1, +1 values are obtained, here 0 indicates redundant data which is then not considered for encoding and hence it removes the overhead of reencoding the redundant data.

However one may say, CUDA does not efficiently work when data dependencies are present but by using optimization guidelines and carefully breaking a problem into parallelizable parts. This enables us to achieve high level of efficiency. H.264 produces all types of low bit-rate compressed videos as well as HDTV broadcast videos with nearly lossless video coding. H.264 algorithm is highly parallelizable as there are various calculations for finding Motion Vectors (MVs) between frames of a video which can be worked out simultaneously. In this paper, we propose a way to use the immense power of parallelization in video encoding using H.264 standard along with block matching algorithms used for motion estimation. In our system, we try to parallelize the major part i.e. motion estimation from video encoding of H.264 by using CUDA whereas the remaining encoding process may run on CPU. This approach would provide flexibility, speedup, quality and much less computational time. III.


LITERATURE SURVEY

A. CUDA architecture CUDA is a parallel computing platform created by NVIDIA on GPUs. Using CUDA to solve computationally expensive problems saves time and money, provides concurrency. It is capable of solving large and complex problems for which a CPU may take years to solve. A problem can be broken into series of instructions which are comparatively easy to compute on different processors simultaneously. In simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. Multi-core processor is a class of parallel processors which has multiple cores on a single chip. The one which we are using Nvidia Quadro K600 has 2880 cores. A multiprocessor is capable to issue multiple instructions per cycle from multiple instruction threads. CUDA architecture is basically divided into



IV.

values of integer pixels of the whole image transferred to GPU. Motion Estimation is the process of block matching between the frames. There are various algorithms used for block matching and they are as follows:

PROPOSED SYSTEM

A. Exhaustive Full Search Full search algorithm finds the best block match between the frames where it uses brute force method which makes it slower. It is a complete and extensive analysis of each and every block in the frame and has more parallelism than any other block matching search algorithm. Despite that it is computationally one of the most expensive algorithms for motion estimation. 1) Every possible rectangular search window is considered one by one from the reference frame. 2) Motion vectors are calculated based on SAD(sum of absolute differences) bewtween macro-blocks. 3) Corresponding pairs of pixels i.e temporal MB or spatial MB are used to calculate SAD values for one reference MB in search area. 4) Now these pixels are paired by reference MB. Number of pairs per MB is equal to the number of pixels in current MB. 5) All the pixels pairs for one MB are collected and the value which builts up from this is SAD value for that reference MB. This process is continued till all the reference MBs get SAD values.

As shown in the Fig. 3 the architecture is consisting of following main blocks: i) Motion Estimation/Prediction ii) CUDA Implementation iii) Transformation, Quantization and Bit-stream Encoding. The system will work as follows: 1) Input to the system is raw video file(.y4m/.yum) 2) Prediction a) Inter-prediction-reduces temporal redundancies between frames. b) Intra-prediction- reduces spatial redundancies from current frame i.e. reference frame 3) Transformation, Quantization- reduces unnecessary details for the human visual system in order to achieve higher coding gain 4) Bit-stream encoding- encoder will give encoded video file as output. We use Ubuntu 12.04 with CUDA Toolkit installed on it having minimum requirements as RAM 512MB, Hard disk size 160GB, intel dual core processor(64 bit) and 1GB NVIDIA graphics card. V.


B. Diamond Search This algorithm provides four sided shape analysis. It balances speed and quality. The search pattern starts from the center macro-block There are two search patterns: 1) LDSP(Large Diamond Search pattern) It consists of 9 checking points in a search window wherein 8 points surround the center one to form a diamond shape. 2) SDSP(Small Diamond Search pattern) It consists of 5 checking points and forms a smaller diamond shape.

IMPLEMENTATION

Firstly, encoding process except the interpolation module is executed on CPU. After the encoding process is offloaded to GPU then and then only interpolation module is executed. GPU assigns thousands of threads for calculation of pixels with help of parameter SAD (sum of absolute differences). The major part of this process is Motion Estimation which has two prediction modes inter and intra which reduces temporal and spatial redundancies across frames improving the further compression of the video. When sub-pixels are calculated, CPU imports

Searching procedure for finding Motion Vector: 1) LDSP is recursively called until minimum block distortion (MBD) occurs at the center point. If calculated MBD is found to be at the center point then




directly move to step 3, otherwise proceed through step 2. 2) The MBD which is found in previous step is now considered as the center point and is surrounded by 8 new points to form a new LDSP. If calculated MBD is found to be at the center point then proceed to step 3, otherwise keep repeating step 2 until MBD is found at center position. 3) Here, the search pattern is switched from LDSP to SDSP for further calculation on a detailed search window. Among these 5 checking points in SDSP, there is one point which gives MBD and it is the motion vector of best matching block.

architecture NVIDIA CUDA. The implementation would unleash the aspects which affect the video encoding. It differentiates between those three block matching search algorithms and its efficiency coefficients which would help to choose among them as per the system requirements.

C. Hexagon-based search (HEXBS) This algorithm provides six sided shape analysis. It provides reasonable quality in accordance with speed. The search pattern starts from the center macro-block

ACKNOWLEDGMENT

On the basis of our analysis, further work calls on introducing parallelization in transformation, quantization part of the proposed system. The system is presently designed for CUDA, but in the future we can increase the scope for other parallel architectures and for other GPGPUs.

The paper focuses on implementing H.264 video encoder on CUDA platform in CUDA research centre established at RSCOE, University of Pune as the part of our final year project.

There are two HEXBS patterns: 1) Large HEXBS- concists of 7 checking points from which 6 points surround the centre one to compose a hexagon. 2) Small HEXBS- consists of 4 checking points from which 3 surround the centre one. This strategy is applied after Large HEXBS for the focused inner search.

We would like to thank the members of CUDA research centre who have helped and guided us through the entire process. We are grateful to our project guide Prof. C.G.Thorat and project coordinator Prof. V.M.Barkade for their support and motivation in selection of the project topic and valuable guidance. We would also like to thank the Head of Computer Department Prof. A.B.Bagwan who gave us a continuous support and possible resources at every step we needed. REFERENCES

Searching procedure for finding Motion Vector: 1) Large HEXBS is recursively called until minimum block distortion (MBD) occurs at the center. If the MBD point is found to be at the center then go to step 3, otherwise go to step 2. 2) With the MBD point considered as center again a large HEXBS pattern is formed, again MBD is evaluated with 3 new candidate points. If the MBD point is at the centre then go to step 3, otherwise recursively call step 2. 3) Here, switch from Large HEXBS to Small HEXBS. MBD point is compared with four points covered in small hexagon and the new MBD is the final solution for the motion vector.

[1]

Aleksandar Colic, Hari Kalva, Borko Furht , “Exploring NVIDIA-CUDA for Video Coding”, Dept. of Electrical and Computer, Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, ACM, 2010

[2]

Youngsub Ko, and Youngmin Yi, Soonhoi Ha, “An Efficient Parallel Motion Estimation Algorithm and X264 Parallelization in CUDA”, Dept. of Electrical and Computer, Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, ACM, 2012.

[3]

Liye Zhang, Jian Wang, Chen Chu and Xiaoyong Ji, “Parallel Motion Compensation Interpolation in H.264/AVC using Graphic Processing Units”,School of Electrical Science and Engineering, Nanjing University Nanjing, China, IEEE, 2012.

[4]

Eduarda Monteiro, Bruno Vizzotto, Claudio Diniz, Bruni Zatt, Sergio Bampi, “Applying CUDA Architecture to accelerate Full Search Block Matching Algorithm for High Performance Motion Estimation in Video Encoding”, Informatics Institute-PPGC-PGMICRO, Federal University of Rio Grande do Sul (UFRGS), Porto Algre, Brazil, IEEE, 2011

[5]

S. Zhu, K. Ma, “A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation”, IEEE TRANSACTIONS ON IMAGE PROCESSSING,VOL.9, NO. 2, Feb .2000

[6]

Ce Zhu, Xiao Lin, and Lap-Pui Chau, “Hexagon-Based Search Pattern for Fast Block Motion Estimation”, Sch. of Electr. & Electron. Eng., Nanyang Technol. Univ., Singapore, Singapore, IEEE, 2002.

CONCLUSION AND FUTURE SCOPE In this paper, we implement three block matching search algorithms for H.264 on massively parallel




[7]

W. Chen, H. Hang, “H.264/AVC motion estimation implementation on Compute Unified Device Architecture(CUDA)”, IEEE 2008

[11] W. Chen, H. Hang, “H.264/AVC motion estimation implementation on Compute Unified Device Architecture(CUDA)”, IEEE 2008

[8]

NVIDIA CUDA Programming Guide version 4.0 (5/6/2011), http:// www.shodor.org/ media/ content// petascale/ materials/UPModules/ matrix Multiplication/cudaCguide.pdf

[12] NVIDIA CUDA Programming Guide version 4.0 (5/6/2011), http://www.shodor.org/ media/content// petascale/ materials/UPModules/ matrix Multiplication/cudaCguide.pdf

[9]

Dong-Kyu Lee, Seoung-Jun Oh, “Variable Block Size Motion Estimation Implementation on Compute Unified Device Architecture(CUDA)”, Department of Electronics Engineering Kwangwoon University, IEEE, 2013

[13] The H.264 Advanced Video Compression Standard, second edition, http://as.wiley.com/ WileyCDA/ Wiley Title/ productCd-0470516925.html [14] Parallel computing, http://en.wikipedia.org/ wiki/ Parallel_ computing

[10] Zhu, Xiao Lin, and Lap-Pui Chau, “Hexagon-Based Search Pattern for Fast Block Motion Estimation”, Sch. of Electr. & Electron. Eng., Nanyang Technol. Univ., Singapore, Singapore, IEEE, 2002.

[15] CUDA, http://www.nvidia.com/ object/ cuda_ home _ new. html.




implementation of three block mat0063hing search algorithms ... - IRAJ

implementation of three block mat0063hing search algorithms ... - IRAJ

Suggest Documents

implementation of three block mat0063hing search algorithms ... - IRAJ

parallelizing and comparing block matching algorithms in h.264 ... - IRAJ

parallelizing and comparing block matching algorithms in h.264 ... - IRAJ

Implementation of Recursive Search Algorithms in ... - Ua.pt

Novel Cross-Diamond-Hexagonal Search Algorithms for Fast Block ...

An Implementation of Three Algorithms for Timing ... - Semantic Scholar

Design and Implementation-Algorithms of Amharic Search Engine ...

Block-based Web Search

implementation of six sigma techniques in load conscious ... - IRAJ

implementation of six sigma techniques in load conscious ... - IRAJ

implementation of ihs fusion technique and comparative ... - IRAJ

algorithms and FPGA implementation

Comparison of Symmetric Block Encryption Algorithms

Efficient FPGA Implementation of Block Cipher MISTY1

Low Complexity Implementation of Block ... - IEEE Xplore

VHDL Implementation of Efficient Multimode Block ...

Parallel Implementation of Real-Time Block-Matching

passive damping filter design and application for three-phase ... - IRAJ

The Three Block Model of UDL

Comparison of three cell block techniques for

SEMANTIC MATCHING: ALGORITHMS AND IMPLEMENTATION

Block-Structured Adaptive Mesh Refinement Algorithms and ...

parallelizing and comparing block matching algorithms ... - DigitalXplore

block algorithms for orthogonal symplectic factorizations - Infoscience