Video Codec Implementation Using Parallel ...

3 downloads 0 Views 344KB Size Report
Video Codec Implementation Using Parallel. Processing Scheme. Ikram Hussain1, Tran Thi Hai Uyen1, Madiha Sher2. Department of Electronics Engineering, ...
The 3rd International Conference on Next Generation Computing(ICNGC2017b)

Video Codec Implementation Using Parallel Processing Scheme Ikram Hussain1, Tran Thi Hai Uyen1, Madiha Sher2 Department of Electronics Engineering, Sejong University, Seoul, South Korea 1 Department of Computer System Engineering, University of Engineering and Technology Peshawar, Pakistan 2 [email protected], [email protected], [email protected]

Abstract— video processing and its usage is highly challenging

In this paper, our goal is to decrease the processing time consumed in the time-consuming blocks of code by parallelizing it, to maintain the video quality and try not to degrade it and to decrease the size of video.

due to its good quality at low data rate. Due to high usage of video-media for various applications the video resolutions are also increasing which demands for fast and efficient coding algorithms. Although tremendous progress has been achieved in this area but still transmission and storage of high size videos is problematic therefore we need to compress (reducing size) and decompress (viewing) it while maintaining its quality. Video Codec H.264/MPEG-4 AVC has better performance and compression efficiency. Our goal is to im..plement H.264 video codec sequentially and parallel. Parallel processing and multithreading encoding systems are the solution for the implementation of video codec in real time systems. We will use GPU (Graphical processing unit) parallel base architecture named CUDA (Computer Unified Device Architecture). Performance of video codec on multi core CPU and different GPUs comparison will be presented in this paper.

The rest of the paper is organised as follows. Section II addresses the related work. Section III presents our proposed method. Section IV presents the experimental setup and results. Finally Section V concludes the paper.

II. Related Work Different solutions for video encoding have been developed since the standardization of video encoders. H.264/AVC has various solutions available such as the JM reference software and the ×264 free software library. However, there are none or only copyright libraries for GPU acceleration. Some works on accelerating video encoding using GPGPU can be found in the Literature. Such as the work proposed in [1] shows the sequential as well as parallel implementation on open MP they also proposed theoretical model for GPGPU. In another work done in [2] they tried to parallelize the SAD (Sum of absolute difference) block using different methods such as 4x4 SAD calculation and variable size SAD calculation. They achieved speed up of 10-11.

Keywords— parallel processing; video codec H.264-AVC

I. Introduction CODEC is short for compression and decompression. Compression and decompression is performed to handle the problem of transmission and storage of data stream. As we know that any sort of videos is compressed and raw videos are not used because of different aspects like Size, Quality. Those data streams can be video files having large size, it is difficult to transfer those video files across the Internet quickly or even it is difficult to store them.

In an attempt to parallelize SAD work done in [3] achieves massive speedup in SAD calculation, moreover we selected full search algorithm as in [4] it clearly shows the Full Search Algorithm is better than the rest. Why we use DCT (Discrete cosine transform) in image compression. This research involves comparison of results through DCT and FFT and shows that DCT is better than FFT. [5] .

For compression and decompression, we use a video CODEC which is software or it can be a device. So, compression shrinks the size of video file for transmission and then on the receiving side decompression role comes in play as the compressed video file is decoded for the purpose viewing or editing.

III. Proposed Method Video Codec is implemented serially and in parallel.

In the absence of CODEC downloading and uploading would take huge time than it takes now. To create a standard that will deliver good video quality at relatively lower bit rates than previous standards without growing the complexity of Video Codec and to provide flexibility to apply it to a diversity of applications on a variety of low and high bit rates of networks and systems was the intent of the H.264/AVC project. It was developed by the Joint Video Team (JVT).

Components of Video Codec: Shown in Figure 1. Each block is implemented serially and in parallel. Encoding: Motion Estimation, DCT, Quantization and Entropy Encoding Decoding: Entropy, Decoding, De-quantization, Inverse DCT and Motion Compensation

290

The 3rd International Conference on Next Generation Computing(ICNGC2017b)

1. Serial Implementation of Motion Estimation:

3. Read a frame

Algorithm: Full Search Serial SAD and Motion vector is calculated in this block Parallel Implementation: 1. Declare GPU memory pointers

4. Transfer the frame to GPU 5. Launch the kernel (function call)

2. Allocate memory for GPU pointers

6. The first argument is no of blocks and the second is the no of threads

3. Copy frame to GPU Memory

7. Then copy back the results to CPU

4. Launch the kernel 8. Serial Implementation of Encoding:

5. Copy frame From GPU Memory

In entropy encoding first we made a node of each element that contains occur amount, probability, symbol and a pointer to next node. A node is created for each element and is sorted according to their probability. After arranging the node in increasing probability, a tree is created by finding the two minimum probabilities of the nodes to make a new joint node and assign the joint probability of the two nodes to the new node. A final tree is obtained by repeating the process (arranging and creating a new node). The hash-table contains the information about the symbol, the length of the code word and a code for a symbol. This hash-table be transmitted which is helpful in decoding process to get the original symbol back. Parallel implementation: A research [6] says that GPGPU implementation would not be so impressive for Huffman encoding block. The major factors that restrict its parallelism are the context-based data dependence, the memory accessing dependence, and the control dependence which makes it unfavourable for GPU.

6. Serial Implementation (DCT):

The following steps are followed for DCT: 1. Read a frame 2. Send this frame to DCT Block 3. Extract a block(8x8) from the frame 4. Pass the block to the DCT function Parallel Implementation (DCT): Always remember that Device is GPU and Host is CPU 1. Declare GPU memory pointers 2. Allocate GPU memory pointers 3. Read a frame

IV. Experimental results

4. Transfer the frame to GPU

Experimental Setup: Software used is Visual Studio and Hardware: Nvidia GPGPU “Tesla Kepler k40” having 2880 CUDA cores. Results are taken for two set of input videos Foreman (.yuv) and Grandma (.yuv). For each block Serial time (sec) is calculated and time (sec) in parallel also the percentage speedup is calculated for each result by the formula % Speedup = (Serial Time – Parallel Time) / Serial Time.

5. Launch the kernel (function call) 6. Then copy back the results to CPU 7.

Serial Implementation (Quantization):

The following steps are followed for quantization 1. Read a frame 2. Send this frame to Quantization Block

1. Motion Estimation Results

3. Extract a block(8x8) from the frame

Video Name: Foreman (.yuv)

4. Divide the block by Quantization matrix

1.

Video Calculations Serial: # of Frames

5. Floor the quantized values to nearest integers and return the result. Parallel Implementation:

300

Following Steps are performed implementing quantization in CUDA

Execution Time in serial (seconds) 111

1. Declare GPU memory pointers Percentage Improvement:

2. Allocate GPU memory pointers

% Speedup = 94.45%

291

Threads

44

Execution Time I parallel (seconds) 5.99

88

3.06

The 3rd International Conference on Next Generation Computing(ICNGC2017b)

2. Video Name: Grandma (.yuv) Video Calculations Serial: frames Execution No of Time Serial THREADS (seconds) Video (870 frames)

72

frames

44

Execution Time Parallel (seconds) 4.18

88

3.98

870

Execution Threads Execution Time(seconds) Time(seconds) 0.430 22 0.247 44

0.235

Percentage Improvement: % Speedup= 45.34

Percentage Improvement: V. Conclusion

% Speedup = 94.19%

Parallel implementation of video codec on GPGPU greatly reduced the execution time it took to perform compression and decompression. The major advantages of H.264 are high compression rate which results in video having much lesser size to original and thus giving the ease in transmission and storage. It is very efficient for mobile surveillance applications. It is also very popular for its better quality in real time playback. Another advantage is that H.264 has backward compatibility to older standards. It allows video from higher resolution cameras to be streamed over the Internet. Audio can also be compressed separately and taken across in combination with H.264 video stream. H.264 /AVC is standard so different devices from different manufacturers can communicate, exchange data, and use the information via this standard. Further work can be done in the field of Real-Time video streaming. For now, the implementation is on the stored videos, but it can also be used in real-time video. Also, there were block artefacts in the video which can be reduced by using averaging filter. It will help in maintaining the video quality.

3. DCT Results 1. Video Name: Foreman (.yuv) frames

300 frames

Video Calculations Serial: Execution No of Execution Time Serial THREADS Time (seconds) Parallel (seconds) 293.42 44 110.15

Percentage Improvement: % Speedup = 62.45% 2. Video Name: Grandma (.yuv) frames 870

Video Calculations Serial: Execution THREADS Execution Time(seconds) Time(seconds) 120.38 16 97.54 33

48.94

Percentage Improvement:

References

% Speedup = 59.34

[1] E. Monteiro, B. Vizzotto, C. Diniz, M. Maule, B. Zatt, and S.

3. Quantization Results 1. Video Name: Foreman (.yuv) [2]

frames

300

Video Calculations Serial: Execution Threads Execution Time Serial Time Parallel (seconds) (seconds) 0.546184 16 0.518 32

0.344

44

0.1744

[3]

[4]

[5]

Percentage Improvement: % Speedup= 68.05 2. Video Name: Grandma (.yuv)

[6]

Video Calculations Serial:

292

Bampi, "Parallelization of full search motion estimation algorithm for parallel and distributed platforms," International Journal of Parallel Programming, pp. 1-26, 2014. W.-N. Chen and H.-M. Hang, "H. 264/AVC motion estimation implmentation on compute unified device architecture (CUDA)," in Multimedia and Expo, 2008 IEEE International Conference on, 2008, pp. 697-700. S. Mehta, A. Misra, A. Singhal, P. Kumar, and A. Mittal, "A high-performance parallel implementation of sum of absolute differences algorithm for motion estimation using CUDA," in HiPC Conf, 2010, p. 6. E. Monteiro, M. Maule, F. Sampaio, C. Diniz, B. Zatt, and S. Bampi, "Real-time block matching motion estimation onto GPGPU," in Image Processing (ICIP), 2012 19th IEEE International Conference on, 2012, pp. 1693-1696. I. A. Ismaili, S. A. Khowaja, and W. J. Soomro, "Image Compression, Comparison between Discrete Cosine Transform and Fast Fourier Transform and the problems associated with DCT," in International Conference on Image Processing, Computer Vision and Pattern Recognition, 2013, pp. 962-965. H. Su, M. Wen, N. Wu, J. Ren, and C. Zhang, "Efficient parallel video processing techniques on GPU: from framework to implementation," The Scientific World Journal, vol. 2014, 2014.

Suggest Documents