Document not found! Please try again

Data Parallelism Exploiting for H.264 Encoder

1 downloads 0 Views 258KB Size Report
living room entertainment (BluRay/ HD-DVD) to Handhold terminals (DVB-H). It can save 25%-45% and 50%-70% of bitrates when compared with MPEG-4 ...
Data Parallelism Exploiting for H.264 Encoder

Mei Wen, Ju Ren, Nan Wu, Huayou Su, Changqing Xun, Chunyuan Zhang Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha, China [email protected] Abstract—Real-time H.264 encoding of high-definition (HD) video (up to 1080p) is a challenge workload to most existing programmable processors. Instead, the novel programmable parallel processors such as stream processor, Graphic processor unit (GPU) and DSP offer a different and very promising technology for these demands. Thus, parallel computing for H.264 encoding on these processors is becoming a hot research point. It’s challenged, because most emerging parallel processors focus on supporting Data Level Parallel (DLP), while the dependency inherently existing in traditional H.264 encoding algorithm significantly restricts exploiting DLP. Facing the challenge, this paper presents data parallel processing methods for key modules of H.264 encoder which can eliminate the dependency restriction. The result shows that key modules including Intra-prediction, Inter-prediction and CAVLC achieve significant speedup on stream processor by using these data parallel processing methods. Keywords-H.264 encoder; Data Level Parallelism; SIMD

I.

INTRODUCTION

H.264 standard[1] developed by the Video Coding Experts Group and the ISO/IEC Motion Picture Experts Group is widely adopted in applications from high definition living room entertainment (BluRay/ HD-DVD) to Handhold terminals (DVB-H). It can save 25%-45% and 50%-70% of bitrates when compared with MPEG-4 Advanced Simple Profile (ASP) and MPEG-2, respectively. However, the H.264 coding performance comes at the price of significantly increased computational complexity. Especially, real-time encoding of high-definition H.264 video (up to 1080p) is a challenge to most existing programmable processors. According to instruction profile of HDTV1024P ( 2048x1024, 30fps ) video, H.264/AVC decoder requires 83GIPS (Giga Instructions per Second) of computation capability and 70 GB/S of memory bandwidth. For HDTV720P (1280x720, 30fps) encoder, the performance and bandwidth requirement achieves 3600 GIPS and 5570 GB/S[2]. Emerging parallel processors, such as stream processor and GPU, focus on exploiting Data Level Parallelism, which shows surprising efficiency on many compute-intensive domains especially for Media/Graphic Processing[3,4]. This gives a substituted platform to accelerate applications on

programmable processor instead of long-term and expensive dedicated ASIC design. Thus, parallel computing for H.264 encoding on these processors is becoming a hot research point. However, the dependency generally exists in traditional H.264 encoding process including prediction, CVALC, and Deblocking Filter. It significantly restricts exploiting DLP and decreases parallel efficiency. Facing the challenge, this paper analyzes the dependency issue of H.264 encoding first, then presents data parallel processing methods for key modules of H.264 encoder which can eliminate the data dependency restriction and guarantee data parallel process granularity to implement parallel process for H.264 encoding. The paper is organized as follows. Section 2 summarizes the dependency issue existing in H.264 encoding. Then, the data parallelism techniques for key modules will be addressed in Section 3. Evaluation results are presented in Section 4. Finally, Section 5 concludes the paper. II.

DEPENDENCY ANALYSIS

This section focuses on all kinds of dependency which will have side-effect on process in DLP for H.264 encoding. We classified them into data dependency, priority restriction of bit-stream storage and control dependency, which described in details as follows. A. Data dependency This section classifies data dependency into inter-frame data dependency, inter-macroblock data dependency and inter-block data dependency in terms of data dependency granularity. (1) Inter-frame data dependency. As shown in Fig. 1(a), during inter-frame prediction of the current frame, pixels of partial regions of the reference frame are needed as input data. Therefore, there is data dependency between the current frame and the reference frame. (2) Inter-macroblock data dependency. As shown in Fig. 1(b), during intra-frame prediction, some prediction patterns of the current macroblock, such as 16x16 luma prediction, need calculate according to edge pixels of the left, left-top, top and right-top neighboring marcoblocks. (3) Inter-block data dependency. As shown in Fig. 1(c), some prediction patterns of intra-frame prediction (such as 9

4x4 luma prediction) need calculate according to edge pixels of neighboring blocks. In Deblocking Filter, the left and the

C. Control dependency It indicates that different input data will cause different branches. Take Deblocking Filter as example, different edge is corresponding to different filter procedure.

Figure 2. Eliminating MB dependency in Inter Prediction

Obviously, these kinds of dependency are conflict with data parallel execution model, such as SIMD fashion. In these parallel execution models, different data are operated by the same instruction or process procedure to implement DLP. However, priority restriction of bit-stream storage makes parallel data processed cannot be prepared in advance, while control dependency makes parallel data may be processed by different instructions due to different execution routes. In addition, these kinds of dependency result in priority restriction to process sequence, which restricts H.264 encoding to serial execution model and low data parallel process granularity. For example, the main loop body of x264 process data in block granularity, which causes short stream effect [3]. III.

Figure 1. Data dependency in H.264 encoding

top neighboring blocks, or the top, the down, the left and the right neighboring blocks are needed by processing. B. Priority restriction of bit-stream storage Take CAVLC as an example. The number of elements and the encoding length of each block are unpredictable due to variable length encoding. Thus, the length of bitstream generated by each block is unpredictable. Moreover, bitstreams of different block are tightly coupled. In other words, before the last block generates the bitstream and store it to bitstream structure, the next block cannot do it because that the beginning location of writing bitstream is unpredictable [5].

DATA PARALLEL METHODS

For these kinds of dependency mentioned in section 2, this section presents exploiting data parallel methods by eliminating dependency restriction. Since H.264 encoding algorithm is very complicated, and some module has multiple kinds of dependency, it is hard to describe for all modules. Thus, we only describe typical data parallel method for key modules including Inter-frame Prediction, Intraframe Prediction and CAVLC. A. Macroblock Level Parallelism of Inter-frame rediction There is inter-macroblock data dependency in motion vector (MV) of Inter-frame Prediction. There are 3 patterns of MV prediction, as shown in the top part of Fig. 2. In each prediction pattern, the MV value of the current macroblock (MB) is dependent on MV values of neighboring macroblocks. When MV values of neighboring macroblocks are decided, the current macroblock can calculate continuously. It causes macroblocks cannot parallel processed. For example, as shown in Fig. 2, there is data dependency between MB-P0 and MB-C. To implement MB level parallelism, we change the dependency direction to relax dependency. As shown in the down part of Fig. 2, the medium of MVs from left Macroblock are replaced by the MVs from top-left Macroblock in all of these modes. For example, in the first

mode, the exact MV cost of the block is the medium of the MV0, MV1, and MV2. As shown in Fig. 2, the Motion Vector Predictors (MVPs) of blocks are changed to the medium of MV3, MV1, and MV2. Thus, MB-P0 and MB-C can be processed in parallel. According to the approach, MBs in a row of an image can be processed by Inter-frame prediction in parallel. Note that modified code still accord with H.264 standard, Claire QCIF 10Hz series, for example, is use to test PSNR. We investigate that the PSNR of modified Inter Prediction is only decreased at most 0.4DB than original X264 code that does slight influence on image quality, while Inter Prediction can be applied to all Macroblocks in the same row by increasing the data process granularity of Inter Prediction to 30K Byte. B.

Block Level Parallelism of Intra-frame Prediction In Intra-frame Prediction, there are several prediction patterns such as 4x4luma intra prediction. Inter-block data dependency exists in 4x4luma intra prediction. Fig. 3(a) shows dependency relation among 4x4 blocks of a Macroblock. The number in figure indicates the block number, while arrowhead indicates dependency relation. Though 9 prediction modes of 4x4intra prediction makes prediction of each block is dependent on 4 neighboring blocks’ values, it doesn’t mean that each pair of 4x4 blocks of a macroblock is dependent. Therefore, the idea of block level parallelism in Intra-frame Prediction is that blocks which are not dependent are processed in parallel. It can be seen from Fig. 3(a) that there are independent blocks and dependent blocks which includes indirect and direct dependency among blocks. For example, 0 and 1 in the figure have directly dependency, while 0 and 4 have indirectly dependency. Some pairs of blocks which are not directly dependent are still dependent due to transitivity. For example, block 0 and block 4 are indirect dependent, because block 4 is dependent on block 1, while block 1 is dependent on block 0. Thus, the key of block level parallelism is to find independent blocks. To show dependency relation among blocks of a macroblock clearly, Fig. 3(b) is generated according to “ignition”. Ignition means one block is calculated as soon as all dependency conditions are satisfied. Fig. 3(b) shows dependency relation among block0 – block8. Obviously, pairs of blocks (2, 4), (3, 5) and (6, 8) are all not dependent. The two blocks in these block pairs can be processed in parallel. In other word, the blocks of the same horizontal level in the Fig. 3(b) can be processed simultaneously. According to the method described above, there are 6 pairs of blocks which can be processed simultaneously. They are pairs of blocks (2, 4), (3, 5), (6, 8), (7, 9), (10, 12) and (11, 13). The block level parallelism mode is shown in Fig. 3(c). The intra 4x4 prediction process for each Macroblock is decoupled into 10 stages, called 10 stages block level parallelism algorithm. The number n in figure indicates the execution stage of the corresponding block. It can be seen that in stage 1, 2, 9 and 10, only a block is processed, while in stage 3-8, 2 blocks are processed in parallel. The

execution time of the 10 stages algorithm is 62.5% of the one of traditional 16 stages serial process. In this way, the data process granularity in parallel is 2, which means at least 2 PPUs are needed. Furthermore, if a

Figure 3. 4x4 block dependency and 10 stages parallel intra prediction

frame is partitioned into multiple slices, since multiple macroblocks in different slices can be processed simultaneously, the maximum data process granularity can achieved 2n blocks (n is slice number of a frame). Note that in SIMD execution mode, some redundancy data would be generated by using the 10 stages algorithm. C. CAVLC In CAVLC, process is dependent on input data’s features by using context adaptivealgorithm. In addition, bitstreams of macroblocks are coupled tightly called priority restriction of bit-stream storage, as described in section 2. Bitstreams assembling is shown in Fig. 4(a). Supposed that bitstream length of MB0, MB1 and MB2 are 18bit, 12bit and 14bit respectively. Only 2 bits are written in the last byte of the bitstream of MB0. Thus, the bitstream of MB1 has to be used to fill the remainder 6 bits. Then, the remainder bitstream of MB1 is stored beginning from the next byte. As a result, there are 2 vacancy bits in the forth byte of the bitstream. Bitstream of MB2 is stored in bit likewise. Obviously, in CAVLC, priority restriction of neighboring macroblock bitstream storage caused by bitstream assembling results in that the bitstream of the next macroblock (or block) can be stored only after the one of the last macroblock (or block) has been stored. To solve the problem, we present a parallel bitstream generating method. By calculating storage length in advance and shifting in parallel, storage location of bitstream is changed from unpredictable to predictable, so that the priority restriction of bit-stream storage can be relaxed. The method decouples the VLC process into three stages. We take the example shown in Fig. 4(b) to illustrate. Step1. Encoding, which can be applied to multiple Macroblocks in parallel. 27 blocks of a macroblock are encoded in turn. For each macroblock, an assembled bitstream is generated, which maybe byte-align or not. Along with generating bitstream, the length of bitstream, L whose unit is bit, for each macroblock is also generated. Step2. Shifting in parallel, which targets on aligning bitstream to byte boundary for each macroblock. If the end of marcoblock bitstream is not byte-align, the last bits which are less than 8 bits are filled to the first byte of the bitstream of the next macroblock. As shown in Fig. 4(b), the last 2 bits of MB0’s bitstream will be stored in the beginning of MB1’s bitstream. To link to bitstreams of MB0 and MB1,

the MB1’s bitstream has to right shift 2 bits. Then, the bytealign bitstream is generated for the next macroblock by calculating shift offset according to new bitstream length. Obviously, process including remove, fill and shift of

method presented in this paper, the dependency problem in H.264 encoding modules is solved and the data processing granularity is significantly increased. For a 1080p video, data processing granularity of Prediction and CAVLA is achieved TABLE I.

Resolution

Frames

Format

Data Capacity

Blue_sky

1080p

217

YUV 4:2:0

644MB

Pedestrian_area

1080p

375

YUV 4:2:0

1112MB

Riverbed

1080p

250

YUV 4:2:0

742MB

Rush_hour

1080p

500

YUV 4:2:0

1483MB

TABLE II.

Figure 4. Parallel bit-stream generation in CAVLC

bitstream, as mentioned above are still serial, because the next shift of bitstream is dependent to the last one. To process in parallel, we calculate shift offset for each macroblock first. Since the length of bitstream, L has been calculated for each macroblock in step 1, the bitstream length to be removed and shifting lengthcan be calculated for each macroblock at the beginning of step 2. Therefore, remove, fill and shift of bitstream can be processed in macroblock level parallelism. In addition, for the last macroblock of each batch of macroblocks processed in parallel, the bitstream to be removed is taken as variable to be transferred to the next batch of macroblocks processed in parallel, and filled in the first byte of the first marcoblock of the next batch. Step3. Bitstream outputting. Till now, each macroblock’s bitstream is byte-align and the length is known. These bitstreams can be output without be adjusted. By using the method, in CAVLC, a row of marcoblocks of an image can be processed in parallel. IV.

EVALUATION

This paper chooses STORM-SP16 G220 processor [6] as a platform to evaluate data parallel method presented in this paper for H.264 encoding, which is a representative of stream processors for that STORM architecture is typical among architectures of emerging stream processors and data parallel processor such as Imagine, FT64, CELL, ClearSpeed et al. The STORM contains two MIPS processor core for data handling and control, 16 SIMD Arithmetic lanes for compute-intensive inner loop computations. It focuses on exploiting DLP by SIMD execution unit, and ILP by VLIW of arithmetic lane. X264 (baseline) is chosen as the reference code. The experimental parameters are: 1 I frame per 20 frames, 1 slice per frame, the number of reference frame is 1. This paper select HD-VideoBench [7] as the test video sequences. It contains 4 HD video sequences, whose parameters are shown in table 1. By using data parallelism

HD VIDEOBENCH CONFIGURATION

HD VideoBench

PERFORMANCE OF DIFFERENT CODING MODULE

HD VideoBench Blue_sky data parallel Blue_sky x264 Pedestrian_area data parallel Pedestrian_area x264 Riverbed data parallel Riverbed x264 Rush_hour data parallel Rush_hour x264 average speedup

Inter Prediction

Intra Prediction

CAVLC

1.61s

7.81s

1.03s

20.67s

11.79s

3.60s

2.79s

13.02s

1.70s

36.08s

20.42s

6.00s

1.89s

8.89s

1.47s

26.07s

14.01s

4.78s

3.97s

17.85s

1.90s

50.27s 13.04

26.93s 1.54

6.90s 3.47

30KB, which provides important basis so that data parallelism can be efficiently implemented. Table 2 shows execution time of key modules of H.264 encoder of both x264 serial code and data parallel code. It can be seen that, performance for 4 video sequences are similar, and the data parallel code achieves significant speedup over x264 serial code: 13.04x for Inter-frame Prediction; 1.54x for Intra-frame Prediction; 3.47x for CAVLC. Obviously, the speedup for Inter-frame Prediction is the highest. It indicates that data parallelism methods are more efficient for the parts with higher computation intensity. Since Inter-frame Prediction consumes most of overall execution time of H.264 encoding, the high speedup for it will cause good impact on the overall performance of H.264 encoder. Contrarily, the speedup for Intra-frame Prediction is the lowest. The reason is that the data processing granularity is only 2, which cannot sufficiently exploit the parallel PEs’ capability. Though the data processing granularity of CAVLC is topped 16,the speedup is still relatively low due to the side-effect of al lot of memory reference operations during the bitstream generating. In addition, the PSNR decreases 0.5 db on average for the 4 video sequences, while the average compress rate decreases 6% compared with X264 code, because the parallel inter prediction uses the MVs only in the previous row.

V.

CONCLUSION

This paper summarizes the major dependencies existing in H.264 encoding, which includes data dependency, priority restriction of bit-stream storage and control dependency. For these dependency restrictions, this paper presents macroblock level parallelism of Inter-frame Prediction, block level parallelism of Intra-frame Prediction, bitstream generating in macroblock level parallelism of CAVLC. The results show that these data parallel methods improve the data processing granularity of most of key modules in H.264 encoder to 30KB. It provides important basis for efficiently parallelizing and real time encoding of H.264 encoder. ACKNOWLEDGMENT This work was supported by National Nature Science Foundation of China No. 60703073, No. 60903041 and No. 61033008. Several of the authors were supported by National University of Defense Technology and/or Computer School graduate fellowships. REFERENCES [1]

[2]

[3] [4] [5]

[6] [7]

Wiegand T. Draft Text of Final Draft International Standard (FDIS) of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC). 7th Meeting of JVT, Pattaya. Mar. 2003. Chen T C , Lian C J, and Chen L G. Hardware Architecture Design of an H.264/AVC Video Codec. Yokohama, Japan: In Proceedings of the 2006 Conference on Asia South Pacific Design Automation, 2006: 750~757. Rixne S. Stream Processor Architecture. Boston: Kluwer Academic Publishers, 1985: 1~6. Mattson P. A Programming System for the Imagine Media Processor. PhD Thesis, Stanford University, 2002. Ju Ren, Yi He, Wei Wu, et al. Software Parallel CAVLC Encoder Based on Stream Processing. 2009 IEEE/ACM/IFIP 7th Workshop on Embedded Systems for Real-Time Multimedia, 2009, Grenoble, France.:126~133. Stream Processors Inc. 2008. SPI software Documentation. http://www.streamprocessors.com, 2008. Mauricio Alvarez, Esther Salami, Alex Ramirez and Mateo Valero. HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications. Proceedings of the IEEE 10th International Symposium on Workload Characterization. 2007: 120~125.