Parallel Full HD Video Decoding for Multicore ... - Springer Link

Parallel Full HD Video Decoding for Multicore architecture S.Sankaraiah Lam Hai Shuan, C. Eswaran, and Junaidi Abdullah Centre for Visual Computing, Multimedia University, Cyberjaya, Malaysia {*[email protected]}{hslam,eswaran,junaidi}@mmu.edu.my Abstract. Nowadays, the multi-core architecture is adopted everywhere in the design of contemporary processors in order to boost up the performance of multitasking applications. This paper mainly exploits the multi-core capability for full HD video decoding speedup to meet realtime display. Hantro 6100 H.264 decoder is chosen as the reference decoder. The serial decoding algorithm in the Hantro 6100 H.264 decoder is replaced with a parallel decoding algorithm. . In this research work, macroblock level parallelism is implemented using the enhanced version of macroblock region partitioning (MBRP) is implemented for the parallel video decoding of H.264 video. The results show that the workloads are well-balanced among the processor cores. It is observed that the maximum speedup values are attained when the decoder is running with 4 threads on a 4 core system and 8 logical core system configuration. Moreover, it is also observed that there is no degradation of visual quality throughout the decoding process. Keywords: Multicore processor, speed-up, macroblock, parallel, load balancing, Hantro decoder.

1

Introduction

In general multi-core architecture platforms, without parallelism implementation, it overloads all the work on one processor and keeps other processors idle. This process does not utilize the available resources and works like a single processor. But, with parallel implementation, the multi-core architecture provides a platform for speeding up the application by performing thread-level and data-level parallelism rather than by increasing the operating frequency of the processor. In the proposed research work, Hantro 6100 H.264 decoder is chosen, since it is fully open-source and implemented in the Android mobile operating system [1]. The decoder is designed to support only the baseline profile of H.264 video decoding and it primarily targets for the mobile devices. The source code of the decoder is fully written in ANSI C standard, so it can be easily ported to different platforms. The decoder is implemented for the mobile platform and it can support a maximum of CIF resolution (352X288). In order to see the real full HD quality, the Hantro decoder implementation is modified for Windows platform using advanced data structures and some windows supported libraries. However, the decoder is still unable to decode and render full HD video smoothly, this is because only one processor is running and the frame rate of video playback is perceived at approximately 10 FPS on average. This value is far from the actual frame rate for real-time video playback, which is at least 30 FPS on average. From this it is observed that, the Hantro H.26 decoder is implemented in serial manner and one processor cannot handle the full-HD decoding task, since the H.264 decoding process is computation-intensive and time-consuming [2]. From the analysis, this research work is motivated by the need

Corresponding author: S.Sankaraiah, email:[email protected]

T. Herawan et al. (eds.), Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), Lecture Notes in Electrical Engineering 285, DOI: 10.1007/978-981-4585-18-7_36, Springer Science+Business Media Singapore 2014

317

318

S. Sankaraiah et al.

for parallel decoding of H.264 and non-uniform distribution of work among the processor cores. Hence, the Hantro H.264 video decoding process can be sped up using the data-level parallelism. As per Amdahls law, theoretically the speed-up of the decoding process will be increased by a factor equal to the number of cores in the processor [3]. In this research work, the macro-block level parallelism is implemented on the video decoder and made the workload equally among the processor cores. The remaining part of the paper is organized as follows: Section 2 gives the literature review on the concepts of H.264 video decoding process and techniques of H.264 parallelization. Section 3 provides the detailed explanation on the design and implementation. Section 4 presents the results of the proposed implementation on the performance (speed), visual quality, thread utilization and cache misses are analyzed and discussed. Section 5 concludes the research work with the recommendation of future work.

2

Literature Review

The H.264 decoding process starts with the extraction of a compressed video frame from H.264 bitstream. The compressed video frame undergoes the process of entropy decoding to obtain the transformed residual image and its associated information of prediction. The transformed residual image then undergoes an inverse quantization process and followed by inverse Discrete Cosine Transform (IDCT) to obtain the original residual image. The reconstructed image undergoes deblocking filter before storing it into the frame buffer as the reference for next frame or before sending it out for display. The different stages of H.264 decoding process with serial implementation takes long time, which is not possible for real-time decoding. Thus, H.264 process requires parallelization to speed up the decoding process. According to C. Meenderinck et. Al ., H.264 parallelization can be performed at either task-level or data-level [4]. In task-level parallelism, the H.264 decoding task is decomposed into smaller entities or sub-tasks. Each sub-task is then assigned to different processors or cores for parallelism. In data-level parallelism, different portions of data are handled by different processors or cores for parallelism. Data-level parallelism has several advantages over task-level parallelism. Data-level parallelism has a higher scalability than task-level parallelism. This is due to the huge amount of data that stored in the video frame for processing [5]. Besides, parallelization of H.264 decoder using data-level parallelism, the load-balancing among the processors can be more easily achieved when compared to task-level parallelism since the time taken for processing each data is relatively the same. According to C. Meenderinck et. al., data-level parallelism can be exploited at different levels of H.264 structures, such as GOP level, frame level, slice level and macroblock level[4]. In GOP-level parallelism, each Group Of Picture (GOP) is assigned to different processors or cores, so each processor and cores will decode its assigned GOP independently [6]. According to A. Gurhanli et. al. and S. Sankaraiah et. al., the size of GOP affects the speedup and memory usage [6]-[9]. From their results, by using small size of GOP, it yields a significant speedup due to lower memory usage.The lower memory usage will eventually cause less cache pollution, which is one of the factors that affect the performance. Frame-level parallelism suffers the problem of scalability due to the limited amount of B-frames that can exist within a video sequence . According to Y. Chen et. al., the performance of frame-level parallelism can be sped up by accelerating the P-frame decoding in order to unlock the more B-frames available for decoding to exploit more parallelism [10] , [11]. The advantage of slice-level parallelism is that there is no communication overhead and synchronization between processors and cores as the slice are independent of each other. Besides, the implementation of slice-level parallelism, it also consumes less memory resources when compared to GOP-level parallelism and the effect of the cache pollution can be minimized.The disadvantage of

Parallel Full HD Video Decoding for Multico

319

slice-level parallelism is that it suffers the problem of scalability because there can be no more than 8 slices in a frame for H.264 video [12]. Hence, slice-level parallelism may not be suitable to be implemented on decoder due to limited parallelism. Macroblocklevel parallelism has the advantage of low data communication overhead. Besides, it also consumes lesser memory resources when compared to other levels of parallelism [13] , [14]. In macroblock-level parallelism, several macroblocks are assigned to different processors for concurrent decoding.

3

DESIGN AND IMPLEMENTATION

In this research work, the source code of Hantro 6100 H.264 decoder is modified in two aspects. The original source code of Hantro 6100 H.264 decoder implementation is only for the Android mobile platform. This is not suitable to analyze for full HD resolution applications. In order to exploit the capability of the multicore processor and to render full HD video, as a first task Hantro 6100 H.264 decoder is modified for Windows platform. The second aspect of modification is to replace the serial decoding algorithm with a parallel decoding algorithm to improve the speedup and render real-time full HD resolution video. This paper focuses on H.264 baseline profile. The main contribution of this research work is to implement decoding of full HD video resolution at 1920X1080 without any loss of quality. The proposed macroblock-level parallelism is implemented using enhanced macroblock-region partitioning (MBRP) to exploit data-level parallelism of H.264 video [13] . The programming model that used in this research work is based on fork-join model as specified by OpenMP standard [3]. In the parallel region, several threads are created and each thread handles the decoding task in one macroblock-region only. The width of each macroblock region is determined by dividing the width of the frame with the number of cores or processor. Each thread identifies starting and ending position of its region based on its own thread ID. Each thread decodes the macroblocks within its own region after all the data dependencies are resolved. However, each thread will perform entropy decoding if there are no macroblocks available for decoding. Whenever there is unresolved inter-thread dependency, each thread will continue its execution on entropy decoding instead of polling for the dependency to be resolved. So that, each thread will not fall into idle state.This idea is to effectively balance the workload among the threads, so that no threads will fall into an idle state. Hence threads do not have to wait for the availability of macroblocks to decode within its region. The detailed step by step procedure in the proposed algorithm is as follows: 1. Obtain the first macroblock index of the region. 2. Check whether current macroblock is entropy decoded, if it is not entropy decoded performs entropy decoding of one macroblock, otherwise proceed to the next step. 3. Check whether the data dependencies of current macroblock are resolved, if the dependencies are not resolved, perform entropy decoding of one macroblock, otherwise proceed to the next step. 4. Decode the current macroblock and then followed by performing deblocking filter. 5. Check whether it reaches to the end of the region, if it is the end of the region, proceed to step 7, otherwise proceed to the next step. 6. Obtain the next macroblock index of the region and proceed to step 2. 7. All the macroblocks within the region are completely decoded and wait for all threads to reach the barrier. The video rendering is implemented with the use of the Simple DirectMedia Layer (SDL) library API [15]. SDL is a fully open-source multimedia development API, which can be used in different types of platforms. The proposed research work invokes the

320


overlay functions defined in SDL library for rendering the decoded video frames. The tools for evaluating the visual quality (PSNR) and elapsed time are incorporatedine the source code as a data loggerr.

4 4.1

RESULTS AND DISCUSSION Testing Environment and Platform

In this research work, Intel Core i7-930 multicore processor operates with windows 7, 64-bit operating system is used and has a clock speed of 2.8 GHz turbo up to 3.06GHz, with 32KB D-Cache (L1), 32KB I-Cache (L1), 256KB cache (L2), 8MB L3 cache and 8GB RAM for the profiling of the parallel Hantro 6100 H.264 decoder. Before performing the profiling on the decoder, some of the additional precautions have taken to isolate the unnecessary interferences, which are as follows: 1. 2. 3. 4. 5.

Disabled the LAN card Changed the theme from Windows Aero to Windows Basic Closed all background processes running unnecessarily Disabled the real-time protection of Anti Virus software Disabled the disk defragmentation task

All these steps are considered to minimize the processor consumption before performing the profiling on the decoder. The samples of both PSNR and elapsed time with respect to the number of threads are recorded by using a data logger embedded inside the decoder. The profiles of both thread workload and thread concurrency are obtained by using the Intel Parallel Studios Parallel Amplifier. The compiler used is Intel C++ compiler. The Big Buck Bunny is the H.264 video sequence is used throughout the whole profiling process. 4.2

Performance metrics

The measurement of the decoders elapsed time is intended for benchmarking the performance of decoder with respect to the number of threads created. The measurement of PSNR is the visual quality representation of the video with respect to the number of threads created. The performance of the decoder is evaluated based on the elapsed time of decoding. The elapsed time is defined as the difference between the start time and end time of the decoding process. The shorter is the elapsed time, the higher is the performance of the decoder and the higher the speed up. The measurement of both PSNR and elapsed time are embedded inside the source code of the decoder as a data logger. The data collected when the system is running with Hyper-threading (HT) technology enabled and with HT technology disabled. Figures 1and 2 shows the data collected before the thread affinity is implemented, whereas Figure 3 shows the comparison of data collected before and after the thread affinity is implemented. The results of the data collection process are performed with video rendering disabled. For each sample of elapsed time, the speedup is calculated as the ratio of the elapsed time during serial execution to the parallel execution according to the Amdahls Law [3]. 4.2.1 Performance analysis without hyper threading technology is enabled When HT technology is not enabled, only 4 out of 8 processing elements or 4 physical cores (1 thread per core) are activated in the Intel Core i7 processor. From Figure 1, the speedup is gradually increasing with an increase in the number of threads from 1 to 4. This proves that the decoding performance can be speed up by utilizing multicore architecture.However when HT technology is not enabled,the decoder is running from


321

4 threads onwards the speedup decreases drastically, which is shown in Figure 1. This problem arises because of the sharing of 4 physical cores among the 6 or 8 threads. When the number of threads exceeds the number of available physical cores, the operating system will perform the task of thread switching to switch each physical core from one thread to another. This will pose a serious problem to the decoder speedup. This is because the threads keep waiting since threads are functioning like a thread pool.

Fig. 1. The graph of elapsed time and speedup with respect to the number of threads (HT Technology is not enabled and no thread affinity implemented)

Discussion: The parallel algorithm of decoder is designed in such a way that each thread is waiting for another thread to continue its execution by checking the status of reference flag. Thus, this might cause data race condition while reading the value of reference flag due to the thread switching activity. This data race condition will incur extra latency and eventually degrade the overall decoding performance. Another problem arises due to the extra waiting cycle of thread in an attempt to acquire the Mutex (Mutual exclusion) lock. Whenever there is no macroblock available for decoding, each thread will enter the shared region to perform entropy decoding [16]-[19]. In shared region, only one thread is allowed to access at one time. Before entering the shared region, each thread will try to acquire a Mutex lock to gain an access and execute the code inside the shared region. Therefore, the thread will possibly fall into waiting state due to repeated failure in acquiring the Mutex lock. This issue will also incur extra latency and eventually degrade the overall decoding performance. Therefore, this might give rise to the thread synchronization problem due to the data race condition and thread locking overhead. 4.2.2 Performance analysis with hyper threading technology is enabled When HT technology is enabled, all the 8 processing elements or 8 logical cores (2 threads per core) are activated in the Intel Core i7 processor. From Figure 2, there is an improvement in the speedup when compared to the speedup shown in Figure 1. This is due to the advantage of hyper technology, as it provides Simultaneous Multithreading (SMT) to allow seamlessly concurrent execution of multiple threads. By using the 8 logical cores, up to 8 threads can be independently executed without sharing of cores between the threads. This will effectively minimize the effect of data race condition

322


and extra waiting cycle in acquiring the Mutex lock. However, the speedup is not that much as expected. This might be due to the possibility of the occurrence of thread switching by operating system. In order to resolve thread switching problem, thread affinity has to be implemented.

Fig. 2. Comparison of elapsed time and speedup with respect to the number of threads (HT technology is enabled and no thread affinity implemented)

4.2.3 Performance analysis with hyper threading technology is enabled and thread affinity is implemented Thread affinity is a process of binding a single thread to a single core or processor in order to utilize the same cache resources on the same processor. As a result of implementing the thread affinity, the process may run more efficiently by minimizing the number of cache misses. After the thread affinity is implemented, each thread is permanently mapped to one logical core. In this case, the thread will not be switched to another core during the program runtime by operating system.

Fig. 3. The comparison of speedup before and after thread affinity is implemented

Discussion: As shown in Figure 3 it is observed that, thread affinity implementation increases


323

the speedup progressively when compared to the speedup of without implementing the thread affinity. It is also observed that the speedup is increased with the number of threads from 1 to 8 after implementing the thread affinity. However, when starting from 4 threads onwards, the speedup is slowly ramping up. This is due to the fact that the speedup is gradually approaching to its limit. According to the Amdahls Law, the maximum speedup is restricted by some part of the program cannot be parallelized and also the logical cores can give a maximum speedup of 33% only [?]. Therefore, the speedup of decoder is restricted by the sequential process, such as the initialization process. From the graphs shown in Figures 1, 2 and 3, it is observed that the number of threads created should be at least equal to the number of processor cores in order to obtain the maximum performance. Hence, it is advisable that the number of threads should not create more than the number of processing cores. Thus, it is justified that the implementation of thread affinity improves the stability in the performance of the decoder.

5

CONCLUSION AND RECOMMENDATION

In the proposed implementation, each thread manages its own task without the control of master thread. The results obtained from the proposed parallel algorithm ( MBRP) shows that, the speedup values are marginal when it is running with 4 threads without thread affinity is implemented. This is partly due to the sharing task of entropy decoding (CAVLC) among the threads. After implementing the thread affinity, the stability in the performance of the decoder is gradually improved. From the results, it is observed that the workloads among the threads are well-balanced with the efficient design of the algorithm without any deviation among the workload of threads. Besides, the video rendered on screen shows the same visual quality (PSNR) irrespective of the number of threads created. Also, the results show that the decoder exhibited a good level of thread concurrency and CPU utilization. Furthermore, it is observed that the effect of cache misses is minimized after implementing the thread affinity. In order to improve the performance in terms of speedup, it is recommended that the entropy decoding task should be separated from the parallel region. The entropy decoding is a slow process as it has to be performed sequentially. Hence, the entropy decoding task should be assigned to another core, which has a higher processing speed. Besides, it is recommended that the MBRP should be implemented in both spatial and temporal domains. In this case, the decoding of macroblocks in next frame can start without waiting for all macroblocks in the current frame to finish. This way waiting time of each thread for macroblocks can be minimized.

References 1. Hantro 6100 H.264/AVC Decoder, http://www.on2.com/hantro-embedded/ software-video-codecs/6100-h.264-avc-decoder 2. I. E. Richardson.: The H.264 Advanced Video Compression Standard. 2nd ed,Wiley,UK (2010) 3. M. J. Quinn.: Parallel Programming in C with MPI and OpenMP. McGraw-Hill,USA (2003) 4. Cor Meenderinck , Arnaldo Azevedo , Ben Juurlink , Mauricio Alvarez Mesa , Alex Ramirez.: Parallel Scalability of Video Decoders. Journal of Signal Processing Systems. 57,(2), 173–194 (2009) 5. E. B. Van der Tol, E. Jasper, and R. H. Gelderblom.: Mapping of H.264 Decoding on a Multiprocessor Architecture. In: Proceeding of SPIE Conference on Image and Video Communications, pp. 707–709. SPIE Press, New York (2003) 6. A. Gurhanli, C. C. Chen, and S. Hung.: GOP-level parallelization of the H.264 decoder without a start-code scanner. In: 2nd International Conference on Signal Processing Systems (ICSPS), pp. (2010)

324


7. S. Sankaraiah, H. S. Lam, C. Eswaran and Junaidi Abdullah.: GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture. In: IPCSIT, pp. 127–132. IACSIT Press, New York (2011) 8. S. Sankaraiah, H. S. Lam, C. Eswaran, and J. Abdullah.: Performance Optimization of Video Coding Process on Multi-Core Platform Using Gop Level Parallelism. International Journal of Parallel Programming.Springer, 1–17 (2013) 9. A. Gurhanli, C. C. Chen, and S. Hung.: Coarse Grain Parallelization of H.264 Video Decoder and Memory Bottleneck in Multi-Core Architectures. International Journal of Computer Theory and Engineering. 3,(3), 375–381 (2011) 10. Y. Chen, E. Li, X. Zhou, and S. Ge.: Implementation of H. 264 Encoder and Decoder on Personal Computers. Journal of Visual Communications and Image Representation. 17, (2006) 11. Y. Chen, X. Tian, S. Ge, and M. Girkar.: Towards Efficient MultiLevel Threading of H.264 Encoder on Intel Hyper-Threading Architectures. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, pp. 63–67. IEEE Press, New York (2004) 12. A. Azevedo, C. Meenderinck, B. Juurlink, A. Terechko, J. Hoogerbrugge, M. Alvarez, and A. Rammirez.: Parallel H.264 Decoding on an Embedded Multicore Processor. In: 4th International Conference on High Performance and Embedded Architectures and Compilers (2009) 13. S. Sun, D.Wang, and S. Chen.: A Highly Efficient Parallel Algorithm for H.264 Encoder Based on Macroblock region partition. In: 3rd international conference on High Performance Computing and Communications, pp.577–585 181–184. IEEE Press, New York (2007) 14. M. A. Mesa, A. Ramirez, A. Azevedo, C. Meenderinck, B. Juurlink, and M. Valero.: Scalability of Macroblock-level Parallelism for H.264 Decoding. In: 15th International Conference on Parallel and Distributed Systems, pp.236-243 (2009) 15. Simple DirectMedia Layer, http://www.libsdl.org 16. Y. Kim, J. Kim, S. Bae, H. Baik, and H. Song.: H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load balancing. In: IEEE International Conference on Multimedia and Expo, pp.1001-1004 (2008) 17. Cor Meenderinck , Arnaldo Azevedo , Mauricio Alvarez , Ben Juurlink , Alex Ramirez.: Parallel Scalability of H.264. In: first Workshop on Programmability Issues for Multi-Core Computers, pp.1–12. Goteborg, Sweden (2008) 18. Mauricio Alvarez Mesa, Alex Ramirez, Arnaldo Azevedo, Cor Meenderinck, Ben Juurlink, Mateo Valero.: Scalability of Macroblock-level Parallelism for H.264 Decoding. In: 15th International Conference on Parallel and Distributed Systems , pp.236–243. Shenzhen, China (2009) 19. M. Kim, J. Song, D. H. Kim, and S. Lee.: H.264 Decoder on Embedded Dual core with Dynamically Load-balanced Functional Partitioning. In: 17th IEEE International Conference on Image Processing, IEEE Press, New York (2010)