GOP Level Parallelism on H.264 Video Encoder ... - Semantic Scholar

3 downloads 92 Views 1MB Size Report
solution to minimize the encoding time would be to develop applications with high .... Visual studio 2008 on two platforms: (1) Dell Laptop built with Intel Core2 Duo ... (2) Dell desktop built with Intel Core2 Quad 9400, operated with Windows 7.
2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) © (2011) IACSIT Press, Singapore

GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture S.Sankaraiah1, H.S.Lam 2 , C.Eswaran1+ and Junaidi Abdullah1++ 1, 1+, & 1++

Faculty of Information Technology, MultiMedia University, Cyberjaya, Selangor,Malaysia. 2 Faculty of Engineering, MultiMedia University, Cyberjaya, Selangor,Malaysia. {sreemula.sankaraia10, hslam, eswaran, junaidi.abdullah}@mmu.edu.my

Abstract: H.264 is a popular codec used for encoding the videos that are hosted on the video server and delivered over the internet. Achieving real time encoding still remains a challenging problem. A possible solution to minimize the encoding time would be to develop applications with high level of Thread-Level Parallelism (TLP) to exploit the power of multi-core processors. Parallelization strategies at various levels such as Macro-block level, slice level, frame level have been proposed by various authors. Most of these techniques suffer from the drawbacks of limited scalability, and data dependency. We propose in this paper, a high level parallelization method based on Group-Of-Pictures (GOP). In this method, each GOP will be encoded independently and the frames being referenced are included within the GOP. In GOP-level parallelism, openMP programming model is used to restructure the H.264 encoder. This is to exploit the capability of the available hardware resources to support concurrent processing. The results obtained show that the strategy implemented provides high level of parallelism and efficiently exploits the capabilities of the multi-core system. The speedup achieved using the proposed method is 5.6 to 10 times higher compared to a well-optimized sequential code implementation. Keywords: Video encoding, H.264, Parallel Programming, TLP, GOP, ME, TP, OpenMP, Multi-core, Dual Processor (DP) and Quad Processor (QP).

1. Introduction The H.264 is currently the most popular and good quality video coding standard [1]. The H.264 standard is designed to serve a broad range of applications ranging from low to high bitrates, from low to high resolutions, and a variety of networks and systems i.e., internet streams, mobile streams, disc storage and broadcast. Since H.264 codec is developed with many advanced features which make the encoding process require more computation power than the other existing standards [2]. Hence, there is a need for speeding up the encoder. One possible way of improving the speed is to process the data in parallel [3]. This paper describes how to efficiently restructure the H.264 encoder using GOP parallelization. The remainder of this paper is organized as follows. In Section 2, we provide an overview on the parallelization of H.264. In Section 3, the simulation environment and the experimental methodology to evaluate the dynamic Parallelism with the access pattern as group of pictures (GOP) are presented. In Section 4, the implementation of H.264 parallelism with the GOP pattern on multicore are discussed in detail. In Section 5, the simulation results, analysis of the scalability and the performance of the GOP-level parallelism, as well as the impacts of parallelization overhead are presented. Section 6 consists of the conclusion and the possible future work.

2. Previous works on Parallelization of H.264 The high quality outputs from the advanced video codec such as H.264 come at the price of increased computational complexity. As a result, the current high performance Uni-Processor (UP) architecture is not capable of providing the required performance [4]. Thus, it is necessary to exploit parallelism. The H.264 codec can be parallelized by using the Task-Level or the Data-Level Decomposition methods. In the Task127

level Decomposition (TLD) method, the functional partitions of the algorithm are assigned to different processors. The main drawbacks of the TLD method are the load balancing issue and the scalability constrains. For the Data-level Decomposition (DLD) method, the data is divided into smaller parts and each of the parts is assigned to a different processor. Therefore, each processor runs the same program but with different sets of data elements. In the H.264 encoding process, the DLD method can be implemented at various levels of the data structures such as GOP-level, frame-level, slice-level, macro-block-level, and block level. The implementation of parallelism at various levels on H.264 codec has been described in several papers. Rodriguez et al. implemented the H.264 encoder using frame-level parallelism combined with a group of frames on a clustered workstations using Message Passing Interface (MPI) [5]. Although, real-time operation can be achieved with this approach, the latency is very high. Chen et al. presented a parallel implementation that encodes and decodes several B frames in parallel [6]. This limits the scalability to a few threads. This problem is solved in our proposed approach by dynamically detecting the dependencies and automatically exploiting the parallelism. Van der Tol et al. presented the exploitation of the intra-frame MB-level parallelism and they suggested combining it with frame-level parallelism [7]. The frame-level parallelism method is determined statically by the length of the motion vectors, while in our approach, the parallelism is determined dynamically. In terms of scalability, independency, load balancing and the utilization of processing cores, GOP-level parallelism has many advantages over other methods. The scalability can be easily achieved by increasing the number of processing cores and by applying homogeneous software optimization techniques to each core. The same concept can be applied to a full- HD (1920X1080) video encoding. It is found by experiments, as the number of processing cores increases, the performance improvement is enhanced almost linearly. As per Moore’s law, it is expected that the number of cores on a CMP will double every three years, resulting in an approximately 150 high performance cores on a single die in the year 2017 [8]. This increases the challenges for improving the applications with high scalability exploiting the capability of multi-core by implementing load balancing among processing cores. There are various techniques suggested by Strenstrom et al. [9] in analyzing the scalability in terms of parallelism. This paper focuses on a new parallelization strategy that provides sufficient scalability to fully utilize the processing cores in the future.

3. Methodology and Simulation environment In this section, the tools and methodology used to implement and evaluate the dynamic scheduling based on GOP-level parallelism technique are described. The computations on the processing cores are modeled based on number of cycles that are implemented accurately. The memory system is modeled using average transfer times with channel and bank contention. It is assumed that each of the cores has its own L1 data cache and the data can be copied from other L1 caches through 4 channels. The processing cores will be sharing a distributed L2 cache with 8 banks and an average access time of 40 cycles. The average access time takes into account the L2 hits, misses, and the interconnect delays. With the modeling of the L2 bank contention, the two cores will not access the same bank simultaneously. The multi-core programming model follows the task pool model. In this approach, one main thread and other slave threads are created. The task execution overhead is very low and the time to request a task is less than 2% of the entire GOP encoding time. The experimental results focused on the modified main profile of the H.264 standard, as this profile supports I, P and B frames. The simulation was conducted using JM 17.2 reference software compiled with Visual studio 2008 on two platforms: (1) Dell Laptop built with Intel Core2 Duo CPU T5750 operated on Windows XP OS, running at 2.0GHz with 32KB L1 D-Cache, 32KB L1 I-Cache 2MB L2 cache with 8-way set associative and 2GB RAM. (2) Dell desktop built with Intel Core2 Quad 9400, operated with Windows 7 Ultimate 64bits, running at 3.0GHz with 64KB L1 D-cache, 64KB I-cache, 4MB L2 cache with 8-way set associative and 4GB RAM. The encoding and elapse time for each thread are measured with Intel Parallel Studio 2011 and AMD Code Analyst. All video sequences used in the simulation are with QCIF and CIF resolutions.

4. Implementaion of Parallel H.264 128

To achiieve good data d paralleliism, the set of data wh hich can be treated indeependently and a fed to a processing element musst be determ mined. In the GOP-level parallelism, each GOP iis handled by b a separatee differ threads an threa p a assigns GOP P’s into rent processoor nd each ad processess thread. The GOP-level parallelism m uses temporal div vision of fraames to implement paralllelism. For a multiple seqquence of fraames. This method GOP data access patternn, dependenccy exists amoong the framees within a GOP G and theere is no data dependencyy between twoo sets of GO OP’s, thus eacch thread cann independen ntly process each GOP set without reeferencing too any frame outside o the GOP. G Figure 1 shows thee GOP accesss pattern off frames in independent manner. Forr data access pattern, the memory m ge amounts of o data, but reequires considerably lesss hierrarchy needss to store larg This is to fact exhibits ation. s due t that the sys stem s higher gran nularities of parallelism. This higherr synchroniza level of grannularity charracterizes thee data accesss pattern and the system memory m becomes a bottlleneck as thee smaller L1 and L2 memory levels are insufficcient to hold d multiple frrames of datta [10]. In th he proposedd P are storedd in a tempo orary buffer and sequenntially transfferred to thee approach alll the framees of a GOP correspondiing cores for processing. Odd numberred GOP’s arre processed by core 1 annd even numb bered GOP’ss are processeed by core 2. 2 In a duall core system m, the two cores c will shhare the L2 ccache memo ory, which iss connected to the main memory m h a separate b bus. In the pr roposed GOP P-level llelism, close ed GOPs aree with paral ween the twoo GOPs proccessed by thee two cores. In this imp plementation,, used and thhere is no reference betw additional core c is not ussed for task scheduling as a one of thee available coores will be assigned to do this task.. Figure 2 shhows the impplementationn of GOP-levvel parallelissm with threeads. Two G GOP buffers are used forr moving the raw images,, which will first store thhe frames wh hen these buffers have sppace. It will schedule thee frames into 4 temporaryy buffers accoording to thee frame typess, namely I, P and B fram mes as shown n in Figure 2.. maste for check b be one er thread h handling the input output t processes, s such as king of data dependency,, There will and this maaster thread will w be run on o whicheverr core is free. Four workiing threads w will be createed to encodee the frames waiting w in thhe temporaryy buffers. Thee number of threads creaated shall be according to o the numberr Sequuentially, all the operationns are synchrronized throu of processinng cores avaiilable in the system. s ugh the GOP P buffers by thhe master thrread. Figure 3 shows thee steps involv ved in the enccoding proceess.

Fig 1: The T GOP fram me access pattern

Fig 2: Impleementation off the GOP-leveel parallelism with threads

129

Fig 3: The flow of the en ncoding proceess

5. Expeerimental results an nd Discusssions In this section the experimentaal results aree presented. The results include the values of PSNR, P totall me, ME timee and bit-ratte of the videeo. Two diffeerent types of o video sequuences are co onsidered forr encoding tim testing. In Table1, T the reesults for the Grandma viideo sequence with slow motion, m are ppresented. In n Table 2, thee results for the t Foremann video sequeence with hiigh motion are a presentedd. The resultss have been obtained byy performing tests with 300 frames on both Dual-core D an nd Quad-corre processors, using thee GOP-levell parallelism with I framee as the startinng frame. In Tables 1 and d 2, the resullts obtained w with GOP paarallelism aree compared with w those obtained using original JM.. The size off the GOP is fixed f as 15. T The results show s that thee proposed method m yields reduced enccoding time and a ME timee with a smaall reduction in the bit rate. Further itt is noticed thhat the propposed methodd does not affect a T achieve aan optimum performance p e the PSNR value. To the size of GOP should bee with higherr speed up and a lower biit-rate (withoout reducing g the video quality), q carefully deetermined. Fiigures 4,5 annd 6 show thhe effect of GOP size onn PSNR, enccoding time and bit- ratee Frrom these figgures, we no respectivelyy in a quad processor. p ote that GOP size 15 yiellds optimum results withh regard to these quality parameters. p T effect of the number of The o threads onn PSNR in a quad processor is shownn in Figure7. Parameters

Original JM with DP and QP

15GOP Parallelism with DP

15GOP Parallelism With QP

Parameters

Original JM with DP and QP

15GOP Parallelism With DP

15 GOP Parallelism With QP

Average PSNR (dB)

39.19

39.19

39.19

Average PSNR (dB)

39.50

39.50

36.23

Total Encoding time(min)

116.18

22.40

11.72

Total Encoding time(min)

122.24

24.44

12.23

Total ME time(min)

99.20

18.43

9.36

Total ME time(min)

101.39

20.05

10.89

Bit rate (Kbit/s)

91.19

85.26

85.26

Bit rate (Kbit/s)

93.56

88.28

88.28

Table 1: The results of parallel encoding of less motion video sequence, Grandma_cif

Table 2: The results of parallel encoding of high motion video sequence, Foreman_cif

130

Encoding time vs GOP size CIF

44 42 40 38 36 34 32 30

120 110 100 90 80 70 60 50 40 30 20 10 0

Encoding Time(min)

PSNR

PSNR Vs GOP size

QCIF

3

6

Bit-rate(Kbps)

QCIF

0

9 12 15 18 20 GOP Size

Fig 4 : GOP size Vs PSNR

3

6

9 12 15 18 21 GOP size

Fig 5 :Encoding time Vs GOP Size

Bit-rate vs v GOP size

120 110 100 90 80 70 60 50 40 30 20 10 0

CIF

CIF QCIF

0

3

6

9

1 12

15

18

21

GOP size s G size Fig 6: Bit-rate Vs GOP

Fig 7: Thhe PSNR Vs tthe number off threads

Figure 7 shows a coonstant PSNR R, even whenn the numberr of threads is i increased iin both the reesolutions off QCIF and CIF. The reesults show that there is i no loss of video quaality after exxploiting thee GOP levell parallelism. Table 3 shoows a comparison of thee performancce parameterrs obtained for differen nt processorss proocess [9,10]. Quad-core processor shows a good utilization off front-side-b e p u bus rate. It iss during the encoding observed thhat the bus activities doo not increaase significan ntly with thhe increasingg of numberr of threads.. Therefore thhe executionn time is reduuced due to better b p xploiting thee utilizaation of the processor ressources by ex optimum thrread-level paarallelism. Parameters Instruction per cycle Microoperations per cycle Trace cache deliver mode % Trace cache build mode % 1st level cache load misses rate % 2nd level cache load misses rate % Front-side-bus utilization rate %

UP 0.689

DP 1.71

QP 3.02

1.22

2.69

5.05

78.23

89.98

93.58

21.49

9.84

5.23

5.23

5.35

4.89

0.47

0.53

0.21

0.59

4.59

12. 35

Fig 8: Speedup Vs the number of Threads

Table 3: Micro Architecture metrics

131

as folllows is usedd to evaluatee the perform The standard measuure, speed-upp which is defined d mance of thee proposed method. m hat the peakk Figure.88 shows the plot speeduup vs numbeer of threadss. It can be seen from tthis figure th performancee is achievedd when the number n of thrreads equals the numberr of cores. It is also obserrved that. thee speedup is almost a consttant ( or slighhtly lower) when w the num mber of threaads exceeds tthe number of o cores, thiss andd hold the innformation orr process thee is due to thee fact that additional a ovverheads are required to schedule s extra threadds. We also observe o from m Figure 8 thaat it is possib ble to achieve significantlly higher speeedup valuess using the GO OP parallelissm.

6. Concclusion an nd Futuree Work we haave presentedd a method based on GOP In this paper, p P parallelism m and analyzeed the paralleel scalabilityy of the H.2644 video encooding processs using dual core and qu uad core proccessors. Our proposed paarallelizationn strategy cann overcome many m of the shortfalls off the other kn nown methodds such as sccalability issu ues and dataa dependencyy constraints.. In general,, the experim mental resultts show thatt the GOP-leevel parallelism strategyy efficiently exploits e the capabilities c o the multiccore processo of ors. The speeedup values obtained usiing dual andd quad core syystems are 5.6 and 10 aree higher com mpared to thee original refe ference softw ware for H.26 64 (JM 17.2).. Although, thhe focus of this t paper is on o the H.2644 codec, it is expected thaat other videoo codecs and d multimediaa applicationss also exhibit similar chaaracteristics. Hence, the proposed method m can bee extended to t any of thee computationnally intensivve applications of video processing. p

7. Referrences [1] Internatiional Standardd of Joint Videeo specificatioon (ITU-T Rec. H. 264 ISO O/IEC) (2009).. [2] Ostermaann.J et.al., “V Video Coding with H.264/A AVC: Tools, Performance, and a Complexitty”, IEEE Cirrcuits and Systemss Magazine 4(1)(2004) pp. 7-28. 7 [3] Hoogerbbrugge.J, et alll., “ A Multithhreaded Multticore System m for Embeddeed Media Proccessing”, Tran ns. on HighPerform mance embeddeed Architecturres and Comppilers (2009). “Exxtending Singlle-View Scalaable Video Coding to Multi-- View Based on H. [4] Drose.M M, Clemen.C, Sikora.T, S 264/AVC C”, Image Proocessing, 20066 IEEE Inter.C Conf. on. (200 06) pp. 2977–22980. [5] Rodriguuez.A, et al., “H Hirarchical Paarallelization of o an H.264/A AVC Video Enncoder”, Proc. Int’l. Symp. on Parallel Computiing in Electriccal Engineerinng (2006) pp.3363–368. [6] Chen.Y,, Li.E, Zhou.X X, Ge.S, “Implle-mentation of o H.264 Enco oder and Decooder on Persoonal Computeers”, Journal of Visuaal Communicaations and Imaage Representtation 17 (2006). [7] Vander Tol.E, T Jasperss.E, Gelderbloom.R, “Mappiing of H.264 Decoding D ure”, Proc. on a Multiprocesssor Architectu SPIE Coonf. on Image and Video Coommunicationns and Processsing (2003). On Higgh-Performance Computer [8] Stenstrom.P,et al.,“Chhip-multiproceessing and Beeyond”, Proc. Twelth Int’l. Symp. S Architeccture. (2006) pp.109–109. p [9] Y.K Cheen, et.al.,“Tow wards Efficiennt MultiLevel Threading of H.264 Encodder on Intel Hyyper-Threadin ng th Architecctures”, Proc. Of the 18 Int’l Parallel annd Distributed Processing Syymposium, Appr.2004. [10] S.Ge, X..Tian and Y.K K.Chen, “Efficcient Multithreeading Implem mentation of H.264 H perEncodeer on Intel Hyp Threadinng Architecturres”, IEEE Paacific-Rim connf. on Multimeedia, Dec 20003.

132