videotelephony, videoconferencing) and non- conversational (storage, broadcasting, streaming) applications for a wide range of bit-rates over wireless and.
THE JVT ADVANCED VIDEO CODING STANDARD: COMPLEXITY AND PERFORMANCE ANALYSIS ON A TOOL-BY-TOOL BASIS Sergio Saponara, Carolina Blanch, Kristof Denolf, Jan Bormans IMEC, Kapeldreef 75, B-3001 Leuven, Belgium. {saponara, blanch, denolf, bormans}@ imec.be ABSTRACT The Advanced Video Codec (AVC), currently being defined by a Joint Video Team (JVT) of ISO/IEC and ITU-T, aims at enhanced coding efficiency and network friendliness over a wide range of transmission systems. While the basic framework is similar to the motion compensated hybrid scheme of previous video coding standards, additional tools improve the compression efficiency at the expense of an increased implementation cost. This paper presents a systematic analysis of the complexity and the rate-distortion performances of the new tools over a wide variety of video contents. The simulation results show how the improved coding efficiency, up to 50 %, comes with a complexity increase of more than one order of magnitude at the encoder and a factor 2 for the decoder. This represents a design challenge for resource constrained electronic systems such as wireless devices or cost-effective consumer terminals. A thorough video tool-by-tool analysis, carried out in terms of memory cost, computational burden and compression efficiency, indicates that a proper use of the AVC tools allows for the same performance as the most complex configuration (all tools on) while considerably reducing the complexity. The reported results highlight the AVC implementation bottlenecks, provide input to assist the definition of profiles in the standard and enable the selection of the optimal balance between coding efficiency and implementation complexity. 1.
Additional features improve the compression efficiency and the error robustness at the expense of an increased complexity. This directly affects the cost-effective implementation of AVC and hence the final success of the standard. The goal of this paper is to systematically analyse the complexity and the coding performance of the new tools thus providing early feedback on the implementation bottlenecks. The complexity analysis focuses on data transfer and storage, as these are the dominant cost factors in multimedia systems for both software and hardware-based architectures [6-11]. Memory metrics are completed by computational cost measures. A comparison of AVC with respect to current video coding technology [12,13], in terms of algorithm performance and implementation cost, is also provided. The paper starts with a brief description of the upcoming standard and a review of previous work concerning its performance and complexity assessment in Section 2. Section 3 describes the metrics and the adopted test bench. Section 4 presents the global results obtained for the codec in terms of compression efficiency, memory cost and computational burden. Subsequently, the video tool-by-tool analysis of the encoder and decoder are discussed in Sections 5 and 6 respectively. In Section 7 the source code is modified to test some low-complexity configurations not included in the basic codec. Section 8 deals with the dependency of the proposed analysis on the quantization parameter. Finally, conclusions are drawn in Section 9.
INTRODUCTION
The Advanced Video Codec (AVC) is the current joint project of the ITU-T/VCEG and ISO/IEC MPEG [1-3]. This new standardisation effort promises both enhanced compression efficiency over existing video coding standards (H.263, MPEG-4 Part 2) and network friendly video representation. The codec is aimed at both conversational (bi-directional and real-time videotelephony, videoconferencing) and nonconversational (storage, broadcasting, streaming) applications for a wide range of bit-rates over wireless and wired communication networks. Like previous video coding standards [4,5], AVC is based on a hybrid blockbased motion compensation and transform-coding model.
2. 2.1
THE EMERGING AVC STANDARD
Standard Overview
An important concept of AVC is the separation of the system into two layers: a Video Coding Layer (VCL), providing the high-compression representation of data, and a Network Adaptation Layer (NAL), packaging the coded data in an appropriate manner based on the characteristics of the transmission network. This study focuses on the VCL. For a description of NAL features the reader is referred to [14,15]. The basic coding framework defined by AVC is similar to the one of previous video coding standards: translational block-based motion estimation and compensation, residual coding in a
transformed domain and entropy coding of quantized transform coefficients. Additional tools improve the compression efficiency at an increased implementation cost. The motion compensation scheme supports multiple previous reference pictures (up to 5) and a large number of different block sizes (from 16x16 up to 7 modes including 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 pixel blocks). The motion vector field can be specified with a higher spatial accuracy (quarter or eighth-pixel resolution instead of half pixel) and a Rate-Distortion (RD) Lagrangian technique [16] optimises both motion estimation and coding mode decisions. Since the residual coding is in a transformed domain, a Hadamard transform can be used to improve the performances of conventional error cost functions such as the sum of absolute differences. A deblocking filter within the motion compensation loop aims at improving prediction and reducing visual artefacts. AVC adopts spatial prediction for intra-coding being the pixels predicted from the neighbouring samples of already coded blocks. The conventional 8x8 floating-point discrete cosine transform, specified with rounding error margins, is replaced by a purely integer spatial transform basically working on 4x4 shapes. The small sizes help to reduce blocking and ringing artefacts while the integer specification prevents any mismatch between encoder and decoder. Finally, two methods are specified for entropy coding: a Universal Variable Length Coder (UVLC) that uses a single reversible VLC table for all syntax elements and a more sophisticated Context Adaptive Binary Arithmetic Coder (CABAC) [17]. 2.2
Related Work
Several contributions have recently been proposed to assess the coding efficiency of the AVC/H.26L scheme [3,14,17-20] (H.26L is the original ITU-T project used as starting point for the AVC standardization effort that will be released as ITU-T H.264 and ISO/IEC MPEG-4 Part 10). Although this analysis covers all tools, the new features are typically tested independently, comparing the performance of a basic configuration to the same configuration plus the tool under evaluation. In this way, the inter-tool dependencies and their impact on the tradeoff between coding gain and complexity are not fully explored yet. Indeed, the achievable coding gain is greater for basic configurations where the other tools are off and video data still feature a high correlation. For codec configurations in which a certain number of tools are already on, the residual data correlation is lower and hence the further achievable gain is less noticeable. Complexity assessment contributions have been proposed in [18,19,21,22]. However, these works do not exhaustively address the problem since just one side of the codec is considered (encoder in [18,19] and decoder in [21,22]) and/or the analysis of the complete tool-set provided by the upcoming standard is not presented. Typically, the use
of B-frames, CABAC, multi reference frames and 1/8 resolution is not considered. Consequently, the focus is mostly on a baseline implementation suitable for lowcomplexity, low bit-rate applications (e.g. video conversation) while AVC aims at both conversational and non-conversational applications for which these discarded tools play an important role [3,17,20]. Furthermore, the complexity evaluation is mainly based on computational cost figures while data transfer and storage exploration has been proved to be mandatory for efficient implementation of video systems [6-11]. Access frequency figures are reported in [21] for a basic H.26L decoder but in this case the analysis focuses on the communication between a 32 bit ARM 9 CPU and the RAM being a platform dependent measure of the bus bandwidth, rather than a platformindependent exploration of the system. 3. 3.1
TEST ENVIRONMENT
Test Bench Definition
The proposed test bench consists of 4 sequences whose characteristics are sketched in Table 1. Mother & Daughter 30 Hz QCIF (MD) is a typical head and shoulder sequence in very low bit-rate applications (tens of Kbps). Foreman 25 Hz QCIF (FOR1) has a medium complexity, being a good test for low bit-rate applications ranging from tens to few hundreds of Kbps. The CIF version of Foreman (FOR2) is a useful test case for middle-rate applications. Finally, Mobile & Calendar 15 Hz CIF (MC) is a high complexity sequence with lot of movements including rotation and is a good test for high-rate applications (thousands of Kbps). Since the current standard description does not provide on-line rate controls, the test sequences in Sections 4-7 are coded with a fixed quantization parameter (QP) to achieve the target bit-rate. The dependency of the proposed analysis on the QP value is addressed in Section 8. Up to 19 different AVC configurations, selected as representatives of a greater set of considered test cases, are reported in this paper for each input video. An identification number (from 0 to 18) characterizes each configuration whose description is detailed in the Appendix. Starting from a simple implementation (all optional tools off, I and P pictures, 1 reference frame with search displacements of 8 for case 0 and 16 for case 1) the different AVC options are added step-by-step (cases 2 to 9) up to a complex configuration (all tools on including B pictures with 5 reference frames and search displacements of 16 for cases 10, 12 and 32 for case 11) thus reaching the best coding performance albeit at maximum complexity. As it will be explained further, cases 13 to 18 achieve roughly the same coding efficiency as the complex cases, while considerably reducing the complexity overhead by discarding some tools and reducing the number of reference frames and/or the search area.
Table 1: Characteristics of the test sequences with their targeted bit-rate Test Case Mother & Daughter Foreman Foreman Calendar & Mobile
Video Format QCIF QCIF CIF CIF
Number of Frames 30 25 25 15
Frame rate (Hz) 30 25 25 15
QP 18 17 17 12
Bit-rate (Kbps) 40 150 450 2000
Table 2: AVC JM 2.1 vs. MPEG-4 SP PSNR (Y)
Kbps
Peak memory (Kbytes)
MC MPEG4-SP JM 2.1 simple JM 2.1 complex
35.2 37.3 37.8
2279.7 2243.9 1312.3
9880 8033 18408
FOR2 MPEG4-SP JM 2.1 simple JM 2.1 complex
35.5 35.8 36.4
506.4 435.4 284.7
9880 8033 18408
Codec
Table 3: AVC encoder and decoder ranges MD Min Max 36.24 36.77 PSNR –Y
Encoder
Kbps
Decoder
Decoder Accesses (106/s)
Relative time
Peak memory (Kbytes)
3.2 6.9 63.4
11.32 29.70 262.01
3129 2910 4152
278.3 303.8 470.2
1.76 2.46 4.05
5.0 11.6 116.3
14.64 49.98 511.86
3129 2910 4152
411.4 385.2 623.3
2.25 3.06 5.42
FOR1
FOR2
Relative time
MC
Min
Max
Min
Max
Min
Max
35.19
36.00
35.77
36.51
37.30
37.90
24.80
33.29
96.14
145.80
276.89
435.45
1305.20
2243.93
Peak memory (Mbytes)
2.19
15.60
2.19
15.60
7.31
26.87
7.31
26.87
Accesses (109/s)
1.40
79.92
1.18
65.78
4.60
258.01
2.75
134.26
6.48
409.12
5.40
330.87
21.70
1117.48
12.98
567.37
Peak memory (Mbytes)
1.06
1.38
1.06
1.38
2.91
4.15
2.91
4.15
Accesses (106/s)
61.10
104.41
90.16
153.88
385.21
636.90
287.10
492.54
Relative time
0.30
0.69
0.58
1.30
3.04
5.51
2.33
4.26
Relative time
3.2
Encoder Accesses (109/s)
Complexity Metrics
The coding performance for the different test cases is reported hereafter in terms of PSNR (Peak Signal-toNoise Ratio) and bit-rate. The complexity metrics of this paper are the access frequency (total number of memory transfers per second) and the peak memory usage (maximum memory amount allocated by the source code) as counted within the ATOMIUM/Analysis environment [23]. These figures give a platform independent measure of the memory cost and are completed with the processing time as a measure of the computational burden. Coding speed figures are measured on a Pentium IV at 1.7 GHz with Windows 2000 and are expressed in relative times, with the duration of the sequence in seconds as reference. Meeting real-time constraints entails a relative coding
time smaller than one. The memory centric analysis addressed in this work is motivated by the data dominance of multimedia applications. As a consequence, the data transfer and storage have a dominant impact on the costeffective realisation of multimedia systems [6-11]. Application specific hardware implementations have the freedom to match the memory system to the application. Thus, an efficient design flow exploits this to reduce area and power [8,9]. Programmable processors instead, rely on the memory hierarchy that comes with them. Efficient use of this memory architecture is crucial to obtain the required speeds as the performance gap between CPU and DRAM is growing every year [10,11]. Moreover, this high level analysis is essential for an efficient hardware/software system partitioning.
The huge C code complexity of multimedia systems [7] makes a complexity analysis without additional help timeconsuming and error-prone. To tackle this design bottleneck, the C-in-C-out ATOMIUM/Analysis environment has been developed: it consists of a scalable set of kernels providing functionality for data transfer and storage analysis and pruning. Using ATOMIUM/Analysis involves three steps [7]: instrumenting the program, generation of instrumentation data and post-processing of this data. The reference software models used as input for this paper are the AVC JM2.1 [2] and the MPEG-4 Part 2, Simple Profile (SP) in [13] (since the standard has been changing, the AVC reference code at the time of publication is JM6 [2]). Automated analysis techniques have the advantage of rapidly and reliably producing “what if” detailed cost figures for a wide variety of parameter settings and inputs. Additionally they allow for accurately assessing the impact of a coding tool under test on the global complexity. However, automated techniques inherently depend on the quality and coding style of the source code under test. This implies that reported results should also be interpreted in light of the quality of the source code. To cope with this issue, the reported results are supported by a theoretical analysis of the codec and by a comparison with literature. 4. 4.1
GLOBAL CODEC ANALYSIS
Coding-efficiency and Complexity Results
A comparison of JM2.1 for simple (case 1) and complex (case 10) configurations vs. MPEG-4 Part 2 (SP in [13] with 16 search size, half pixel resolution, unrestricted motion vectors, I and P pictures) is shown in Table 2 for MC and FOR2. Similar results are obtained for the other test sequences. From Table 2 it is clear that AVC is a new codec generation featuring an outstanding coding efficiency (1-2 dB PSNR increase and 40 % bit saving) but its cost-effective realization is a big challenge. These results refer to non-optimised source codes and hence the application of algorithmic and architectural optimizations will lead to a decrease of the absolute complexity values, as it has been for the implementation of previous video coding standards [4,8]. However, the complexity ratio between AVC and MPEG-4 SP reference codes (both nonoptimised), one order of magnitude at the encoder and a factor 2 for the decoder, represents a serious challenge requiring an exhaustive system exploration starting from the early standard development phases. Table 3 lists the ranges achieved by using AVC in different configurations for all test inputs. The overview of the encoder and decoder results (PSNR-Y, bit-rate, peak memory usage, access frequency, coding speed) for all test cases is
summarized in Figs. 1 to 8. All these figures present relative values: the measurements are normalized with respect to the ones of test case 0 (absolute values are reported in Tables 2 and 3). Storage complexity graphs (Figs. 4, 6) are just depicted for FOR 1 and 2 inputs since the memory usage depends on the video format but not on the scene characteristics (for details refer to Sections 5.2 and 6.2). It is worth noting that the new coding scheme acts likewise for all input sequences and that the obtained processing time curves feature a close similarity to those obtained for the memory access frequency. The analysis of Table 3 and Figs. 1 to 8 clearly indicates the cost of the improved coding efficiency (from 26 to 42 % bit saving for a PSNR gain of 0.5-0.8 dB) vs. a basic configuration: an encoder-complexity increase up to a factor 7 for the peak memory usage and 57 for the access frequency. The higher the bit-rate, the lower the complexity increases. Indeed, for MC there is a factor 3.7 concerning peak memory and a factor 49 for the access frequency. At the decoder side, the ratio between a complex and a simple profile is smaller. For all test sequences a factor 1.3-1.4 for the peak memory usage and 1.7 for the access frequency is observed. For a simple profile, Min columns in Table 3, the encoder requires an access frequency at least 10 times that of the decoder and a peak memory usage 2 times higher. For complex profiles, Max columns in Table 3, the encoder access frequency is 2 orders of magnitude more than the decoder one while the peak memory usage is 1 order of magnitude higher. The encoding time increases by a factor of at least 40 when going from the basic to the most complex configuration. For the decoder the increase factor is around 2. Real time processing is only allowed at the decoder side. The low complex MD video is the only one meeting real time constraints in its complex configuration, while the more complex FOR2 and MC inputs surpass the required time even in the simplest configuration. 4.2
Optimal Balance
Coding-efficiency
vs.
Complexity
Figs. 1 to 8 also provide useful hints for the selection of the optimal trade-off between coding efficiency and implementation complexity. Starting from basic configurations (cases 0 and 1), the adoption of the new tools leads to complex implementations (cases 9 to 12) resulting in high rate-distortion efficiency at the cost of a high implementation complexity. Indeed, while in Figs. 1 and 2 the PSNR increases towards its maximum and the bit-rate decreases towards its minimum, Figs. 3 to 8 show a peak of complexity at both encoder and decoder side in the memory metrics and for the computational burden.
1.05
1.05
1.04
FOR2
FOR1
MC
MD
0.95
1.03
0.85
1.02
0.75
1.01
0.65
1
0.55
FOR2 FOR1 MC MD
0.45
0.99 0
1
2
3
4
5
6
7
8
9
0
10 11 12 13 14 15 16 17 18
1
2
3
4
5
6
7
Test cases
8
9
10 11 12 13 14 15 16 17 18
Test cases
Figure 1. Normalized PSNR-Y
Figure 2. Normalized Bit-rate 8
61
7
FOR2
51
FOR1
41
MC
FOR1
5
MD
31
FOR2
6
4
21 3
11
2
1 0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
1 0
1
2
3
4
5
6
7
Test cases
Figure 3. Normalized access frequency (encoder) 1.8
8
9
10 11 12 13 14 15 16 17 18
Test cases
Figure 4. Normalized peak memory usage (encoder) 1.5
1.7 1.6
FOR2
1.5
FOR1
1.4
MC
1.3
MD
1.4
1.3
1.2
1.2 1.1
FOR2
1.1
FOR1
1 1
0.9 0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Test cases
Figure 5. Normalized access frequency (decoder)
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Test cases
Figure 6. Normalized peak memory usage (decoder)
2.4 61
MC FOR1
51
MD
41
FOR2
FOR1
2.2
MC
2
MD
1.8
FOR2
1.6
31
1.4 21
1.2 11
1 0.8
1 0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Test cases
0
1
5.
TOOL-BY-TOOL ENCODER ANALYSIS
A tool-by-tool detailed exploration for the encoder is addressed in this section. First, compression efficiency and data transfer complexity assessments are reported in Section 5.1. Section 5.2 deals with data storage
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Test cases
Figure 7. Normalized coding time
The analysis of test cases 13 to 18 show that a “smart” use of the new tools can reach the same performances as a complex implementation (e.g. PSNR and bit-rate results in Figs. 1 and 2 are practically the same as cases 9 to 12) with a considerable complexity reduction. The lower the bit-rate, the higher the saving factor. Case 11, for which the maximum complexity is measured, is the configuration used in [3] to highlight the AVC compression efficiency with respect to previous video coding standards. At the encoder, in case of low and middle rate applications (FOR 1 and 2), a factor 6.5 for the access frequency and up to 8 in coding time can be saved by proper using the tools. In very low bit rate applications (MD), the saving factor vs. a full complex configuration is up to 26 for the access frequency and 34 for the encoding time (11.5 for both metrics vs. test case 10). Finally, at high bit-rates the access savings vs. complex configurations such as case 10 and case 11 are 6.9 and 14.5 respectively, while the time savings are 6.2 and 13.4 respectively. At the decoder side, the range of variation among simple and complex configurations is smaller, therefore the savings are less noticeable than for the encoder. Indeed, no time saving is achieved for high rate video (MC) at the decoder side while the processing time of the complex case can be decreased by a factor of 1.5-2 for the other sequences. Similar results are obtained for the memory metrics. No access frequency reduction is measured for MC while a peak memory saving of 20 % is still reached. At lower bitrates, saving factors of 1.5 for both access frequency and peak memory usage can be achieved
2
Figure 8. Normalized decoding time
requirements. Coding time measurements present a similar behaviour to the memory access ones and hence all the considerations in Section 5.1 for the data transfer cost are still valid for the computational burden. 5.1 5.1.1
Compression Efficiency and Data Transfer Complexity 1/8-resolution
The use of higher pixel accuracy (1/8 instead of 1/4) has a different impact on the coding performances at low and high rates. In FOR 1 and 2 it does not have a positive impact on PSNR and bit-rate. In MD it improves the PSNR (0.07 dB) at the expense of a bit-rate increase of 4.7 %. In MC it improves both PSNR (0.04 dB) and bit-rate (roughly 12 % reduction). At all bit-rates the higher accuracy entails an access frequency increase of about 15 %. This is why the 1/8 pixel accuracy should be selected only for high-rate video solutions. 5.1.2
Variable Block Sizes
Using variable block sizes (both 4 and 7 modes vs. 1) improves PSNR and bit-rate. The enhancement depends on the input video: in FOR 1 and 2 the gain (up to 0.2 dB, 17 % bit saving) is greater than for MC (up to 0.1 dB, 6.4 % bit saving) and MD (0.07 dB, 5 % bit saving). From the analysis of Table 5 it should be noted that (i) the greater the number of modes used, the greater the rate-distortion efficiency (however, 60-85 % of the PSNR improvement and 75-85 % of the bit-rate reduction can be already achieved using 4 modes); (ii) the adoption of different block sizes affects the access frequency according to a linear model: more than 2.5 % complexity increase for each additional mode.
Table 4: Coding performance and memory complexity analysis for 1/8 vs. 1/4 pixel resolution (test cases 8 vs. 7) ∆PSNR (dB) bit reduction (%) ∆ accesses/s (%)
MD +0.07 -4.7 +14.3
FOR1 0.0 -0.6 +15.6
FOR2 -0.01 -1.5 +15.7
MC +0.04 +11.7 +14.7
Table 5: Coding performance and memory complexity analysis for different block-sizes MD 4 vs. 1 7 vs. 1 2 vs. 1 3 vs. 1 + 0.06 +0.07 +4.1 +4.9 +10.4 +18.1
Block Sizes Test cases ∆PSNR (dB) bit reduction (%) ∆ accesses/s (%)
FOR1 4 vs. 1 7 vs. 1 2 vs. 1 3 vs. 1 +0.16 + 0.2 +13.6 +16.8 +10.6 +18.3
FOR2 4 vs. 1 7 vs. 1 2 vs. 1 3 vs. 1 +0.14 +0.17 +13.8 +16.2 +11.3 +19.8
MC 4 vs. 1 7 vs. 1 2 vs. 1 3 vs. 1 + 0.06 + 0.10 +4.7 +6.4 +10.2 +17.6
Table 6: Coding performance and memory complexity analysis for Hadamard tool (test cases 7 vs. 6) MD + 0.02 - 0.4 + 20.8
∆PSNR-Y (dB) bit reduction (%) ∆ accesses/s (%)
FOR1 + 0.11 - 1.0 +20.9
FOR2 + 0.12 - 2.4 +22.1
MC + 0.08 0.0 +18.9
Table 7: Coding performance and memory complexity analysis for RD-Lagrangian optimisation (test cases 6 vs. 5) MD + 0.35 +9.3 + 119.7
∆PSNR-Y (dB) bit reduction (%) ∆ accesses/s (%)
FOR1 + 0.29 +3.1 + 117.8
FOR2 + 0.27 +6.7 +117.7
MC + 0.14 +1.8 + 126.6
Table 8: Coding performance and memory complexity analysis for B-frames MD B frames Test cases ∆PSNR (dB) bit reduction (%) ∆ accesses/s (%)
5.1.3
1 vs. 0 10 vs. 9 +0.04 +3.5 +12.2
2 vs. 1 12 vs. 13 -0.05 -4.2 +5.3
FOR1 1 vs. 0 2 vs. 1 10 vs. 9 12 vs. 10 - 0.11 - 0.19 +9.8 +1.3 + 1.5 -16.4
Hadamard
As it emerges from Table 6, using the Hadamard transform improves the PSNR (the gain varies with the input video from 0.02 to 0.12 dB) but can increase the bitrate (up to 2.4 % at low and middle rates while MC is not affected). The access frequency increase amounts to roughly 20 % for all test inputs. 5.1.4
RD-Lagrangian optimisation
As expected from the state-of-art [16], the use of a RDLagrangian optimisation for motion estimation and coding mode decisions improves PSNR (up to 0.35 dB) and bitrate (up to 9 % bit saving). However, the noticeable improvement comes with a data transfer increase in the order of 120 % (see Table 7). The performance vs. cost trade-off when using RD techniques for motion estimation and coding mode decisions inherently depends on the other tools used. For instance, when applied to a basic configuration with 1 reference frame and only 16x16
FOR2 1 vs. 0 2 vs. 1 10 vs. 9 12 vs. 10 -0.10 -0.06 +7.9 - 2.1 +1.4 -3.5
MC 1 vs. 0 10 vs. 9 - 0.15 +4.2 - 2.5
2 vs. 1 12 vs. 10 - 0.08 +0.5 -5.7
block size, the complexity increase resulted in less than 40 % while keeping similar performance improvements as those in Table 7. 5.1.5
B-frames
The use of B pictures decreases the bit-rate up to 10 %. Greater bit-rate savings are typical when adopting B frames in basic codec configurations as more data redundancy is left than in complex codec ones (like those in Table 8). The influence of B frames on the access frequency depends on the test case and varies from –16 to +12 %. The use of multiple B frames increases the bit-rate in some configurations. Moreover, a larger number of B pictures increases the latency of the system (B frames need backward and forward reference pictures to be reconstructed). Consequently, this tool is not used in lowlatency constrained video applications and typically is not supported in baseline standard profiles [1,4,12].
5.1.6
the other tests. Increasing both reference frame numbers and search size leads to higher access frequency, up to 60 times, for all test cases (see Table 11). MD figures are not shown since they are similar to the FOR1 ones.
CABAC
Table 9 illustrates the positive impact of using CABAC for entropy coding instead of the basic UVLC: up to 16 % bit reduction. The achieved gain is lower than the 30-40 % claimed in [17], but is in line with the analysis in [20]. This confirms that the performances of the new tools depend on the codec configuration adopted and the effect of the different tools is not additive. As a matter of fact, the percentage gain is greater for basic configurations where the other tools are off and video data still feature a high correlation. For codec configurations in which a certain number of tools is already on, the residual data correlation is lower and hence the further achievable gain is less noticeable. CABAC entails an access frequency increase from 25 to 30 %. 5.1.7
5.1.8
Multi Reference Frames
The use of multi reference frames improves PSNR and bit-rate (Table 12). Typically, the greater the number of used frames, the greater the achieved gain. The improvement depends on the test case. Practically it is negligible at low and middle-rates where a maximum 2 % bit saving is measured passing from 1 to 5 frames. Good results can be achieved (14 % bit saving) for high-rate video as MC. In all cases the great amount of improvement is reached passing from 1 to 3 frames rather than passing from 3 to 5 frames. Thus the gain vs. frame number is not linear and the use of a lot of reference pictures should be avoided. Indeed adopting multi reference frames increases the access frequency according to a linear model: 25 % complexity increase for each added picture.
Search Range
The size of the search range (8, 16 or 32) has a negligible impact on PSNR and bit-rate performances, as is clear from Table 10. These results refer to middle–rate applications (FOR2) but similar findings are obtained for
Table 9: Coding performance and memory complexity analysis for CABAC (test cases 9 vs. 8) MD 0.0 +16.4 +25.3
∆PSNR-Y (dB) bit reduction (%) ∆ accesses/s (%)
FOR1 + 0.04 +7.7 +25.5
FOR2 + 0.02 +10.3 +25.7
MC + 0.06 +11.7 +30.7
Table 10: PSNR-Y and Bit-rate results vs. search range (FOR2) Search Size Test cases ∆PSNR-Y (dB) % bit reduction
32 vs. 16 11 vs. 10 0.0 0.0
16 vs. 8 14 vs. 15 0.0 + 1.5
16 vs. 8 10 vs. 13 +0.02 +0.5
32 vs. 8 11 vs. 13 +0.02 +0.5
16 vs. 8 1 vs. 0 0.0 -0.2
Table 11: Accesses/s (x109) vs. number of reference frames and search range Search size 5 frames 1 frame
8 19.98 1.18
FOR1 16 29.04 3.00
32 65.78 10.47
8 80.37 4.60
FOR2 16 116.26 11.64
32 258.01 40.95
8 45.68 2.75
Table 12: Coding performance and memory complexity analysis for multi reference frames MD FOR1 FOR2 No. frames 3 vs. 1 5 vs. 1 3 vs. 1 5 vs. 1 3 vs. 1 5 vs. 1 Test cases 4 vs. 3 5 vs. 3 4 vs. 3 5 vs. 3 4 vs. 3 5 vs. 3 + 0.01 + 0.02 + 0.1 +0.15 + 0.12 +0.15 ∆PSNR-Y (dB) bit reduction (%) - 1.8 - 1.5 + 0.6 + 1.7 +1.2 +2.6 + 57.4 + 110.6 + 58.0 +111.2 +58.7 +112.6 ∆ accesses/s (%)
MC 16 63.43 6.86
32 134.26 23.36
MC 3 vs. 1 4 vs. 3 + 0.13 +12.1 +54.0
5 vs. 1 5 vs. 3 +0.16 +13.9 +99.7
Table 13: Peak memory usage (Mbytes) vs. number of reference frames and search size
32 16 8
1 frame 5.76 2.92 2.19
QCIF 3 frames 10.69 5.09 3.64
5 frames 15.60 7.14 4.98
1 frame 10.87 8.03 7.31
CIF 3 frames 18.87 13.06 11.94
5 frames 26.87 18.41 16.25
Table 14: Access frequency increase due to 1/8 resolution, B frames, CABAC and RD-Lagrangian techniques MD + 11.9 + 6.3 -1.0 +29.1 +10.2
1/8-resolution RD-Lagrangian CABAC B-frames (1 vs. 0) B-frames (2 vs.1)
FOR1 +31.7 + 4.3 +3.1 + 15.8 + 1.1
The results of the test cases demonstrate that the peak memory usage depends on the reference frame number and on the search range according to a linear model (see Table 13) while the effect of the other tools is negligible. Moreover, as the results for QCIF inputs (MD and FOR1) and CIF inputs (MC and FOR2) are the same, the memory usage depends only on the video format, not on the scene features. 6.
6.2
TOOL-BY-TOOL DECODER ANALYSIS
Peak Memory Usage
The decoder memory usage depends only on the number of reference frames according to a linear model and is practically independent from the search size (as instead for the encoder) and the other tools. The cost of 1 additional QCIF reference frame is roughly 80 Kbytes while that for 1 additional CIF reference frame is 310 Kbytes. As for the encoder, the peak memory usage depends on the video format but not on the scene features. These results and those in Fig. 6 refer to test cases in which the size of the reference frame buffer in the decoder is adaptively set according to the one used by the encoder. However, in real implementations in which the decoder frame buffer is fixed by the chosen profile, the memory usage does not depend on the input video but is fixed by the worst case (i.e. the maximum number of decodable reference frames).
A tool-by-tool detailed exploration for the decoder is addressed in this section focusing on data transfer (Section 6.1), storage requirements (Section 6.2) and decoding speed (Section 6.3).
Data Transfer Analysis
6.1
MC + 32.8 - 0.1 +11.8 +12.0 -0.4
10 and 9 (1 vs. 0 B frame), 12 and 10 (2 vs. 1 B frames). The use of a higher pixel accuracy (1/8 instead of 1/4) increases the access rate from 12 to 32 % depending on the input. The use of Lagrangian cost functions causes an average increase of 5 % for middle and low–rates while high-rate video (MC) are not affected. The access frequency increase due to CABAC is up to 12 %: the higher the bit-rate, the higher the increase. The influence of B frames on the data transfer complexity increase varies with the test case from 11 to 29 %.
Peak Memory Usage
5.2
FOR2 +28.8 + 4.6 +5.1 +11.1 +0.6
The main tools affecting decoder complexity are 1/8 resolution and B-frames. A minor complexity increase is due to CABAC and RD techniques while the contribution of the other tools seems negligible. Table 14 shows the data transfer complexity (measured as % access frequency increase) due to the adoption of 1/8-resolution, RDLagrangian techniques, CABAC and B-frames. The percentages are obtained comparing test cases 8 and 7 (1/8 resolution), 6 and 5 (RD-optimisation), 9 and 8 (CABAC), 41
1200
39
1000
Simple
37
Complex-16 Clever
Simple
35
Complex-16
33
Clever 31
Complex-32 29
Coding time
800
Complex-32
600
28
QP 12
400 200
MPEG4-SP 0
27 0
100
200
300
400
500
600
700
800
900
1000 1100
bit-rate
Figure 9. PSNR-Y (dB) vs. bit-rate (Kbps)
0
50
100
150
9
200
250
Access frequency (10 /s)
Figure 10. Encoder complexity for various QP values
300
5
7
5
Simple
4
Memory usage
Decoding time
6
Complex-16
3
Clever
2
Complex-32
1
28
0 200
300
400
500
600
700
800
6
Figure 11. Decoder complexity for various QP values
Time Analysis
To overrule the possible file access overhead, the decoded sequence is not written to disk during the time measurements. The access counts on the contrary, include these memory transfers, as the reconstructed images need to be sent to a display memory in a real implementation. As in the case of the encoder, the resulting curves are very similar to those of the memory accesses. Nevertheless, the effect of some of the tools is a bit different in terms of time than in memory accesses. For instance, while the cost of Hadamard is negligible in terms of memory accesses, it increases the decoding time up to 5 %. The use of Bframes has an important effect on the decoding time: introducing a first B frame requires an extra 50 % cost in the very low bit-rate video MD, 35 % in FOR1, 20 % in FOR2 and MC. The extra time required by the second B frame is much lower (a few %). The use of 1/8 resolution increases the decoding time from 13 % (MD) up to 40 % for the higher bit-rates (MC). 7.
Simple Complex-16 Clever Complex-32
3
QP 12
Access frequency (10 /s)
6.3
4
DEBLOCKING AND HALF-PIXEL RESOLUTION ANALYSIS
To test some low-complexity configurations not included in the basic codec scheme (half pixel resolution instead of the quarter one and without the in-loop deblocking filter), the JM2.1 code was suitably modified. Results for the same test cases of previous sections prove that restricting to half-pixel resolution causes a decrease of the compression efficiency (up to 30 %, particularly for complex video inputs). Reducing the pixel resolution can be useful only for very low bit-rate sequences (MD) coded with a complex AVC profile. In this case, the lower pixel accuracy does not affect the coding efficiency and allows for a complexity reduction (both access frequency and processing time) of 10 and 15 % for the encoder and decoder respectively. As concerns deblocking, its use leads to PSNR (up to 0.7 dB) and bit-rate (up to 6 % saving) improvements. The complexity overhead is negligible at the encoder side and is up to 6 % (access
2 12
16
20
24
28
QP
Figure 12. Decoder memory usage (Mbytes) vs. QP
frequency) and 10 % (processing time) at the decoder side. As proved in literature [10], the PSNR analysis is not enough for a fair assessment of the deblocking tool since a subjective analysis is also required. The latter, in addiction to the above rate-distortion gain, confirms the effectiveness of the insertion of deblocking within the basic standard profile [1]. 8.
PERFORMANCE AND COMPLEXITY ASSESSMENT VS. QP
This section details the impact of the different QP parameters on the analysis addressed in Sections 4 to 7. Indeed, all the measurements described above are repeated using several QP values (12, 16, 20, 24, 28) besides the fixed ones set in Section 3. In the following, Fig. 9 sketches rate-distortion results while Figs. 10, 11 present encoder and decoder complexity metrics (memory access frequency and processing time). These figures refer to the FOR2 test, covering a range from 100 to 1100 Kbps. Four AVC configurations are considered: simple (case 0), complex-16 (case 10), complex-32 (case 11) and clever (case 18). A similar behaviour is measured for the other test sequences. The rate–distortion results (Fig. 9) show a typical logarithmic behaviour. For all the bit-rates the complex configurations achieve at least a 2 dB PSNR increment vs. the simple one. As expected from Section 5.1, the compression performances with a search size of 16 and 32 are practically the same. With respect to the MPEG-4 SP standard, the same PSNR performances are achieved with a 50 % reduced bit-rate enabling full-rate, good-quality video communication over today wireless and wired networks. For instance, a complex CIF video like Foreman can be transmitted at 25 Hz, 36 dB with less than 300 Kbps being compatible with 3G wireless network capabilities. Moreover, the high coding efficiency of AVC allows the insertion of some redundancy in the source coder to improve the transmission robustness in errorprone channels [14,15]. As already shown in Section 4, a proper use of the AVC tools allows for nearly the same efficiency as the
complex cases with a considerable complexity reduction. Indeed in complexity figures (Figs. 10, 11) the implementation cost of case 18 (clever) is close to the simple one, while in Fig. 9 the relevant coding efficiency is close to the complex ones (the difference between complex and clever curves is below 0.4 dB and for QP≥16 the same results are achieved). The encoder data transfer and processing time analysis (Fig. 10) practically do not depend on QP while at the decoder side (Fig. 11) this dependency is more noticeable: as expected from literature the higher the QP value (and hence the lower the bit-rate), the lower the complexity. Finally, the storage requirements at both encoder and decoder (Fig. 12) side are not affected by the selected QP 9.
CONCLUSION
The emerging AVC standard aims at improving the compression efficiency (up to 50 % compared to previous ISO/IEC and ITU-T standards) to enable full-rate and good-quality video communication over a wide range of transmission networks. However, the new coding functionalities come at the expense of an increased implementation cost. The presented complexity and performance analysis of the upcoming standard focuses on memory cost, computational burden and rate-distortion efficiency. A comparison with MPEG-4 Part 2 is also provided. All reported results refer to non-optimised source codes and hence the application of algorithmic and architectural optimisations will lead to a decrease of the absolute complexity values, as it has been for the previous video coding standards. However, in relative terms, the encoder complexity increases with more than one order of magnitude between MPEG-4 Part 2 and AVC (when considering the non-optimised specifications) and with a factor of 2 for the decoder. The AVC encoder/decoder complexity ratio is in the order of 10 for a basic configuration and grows up to 2 orders of magnitude for a complex one. As a first step to address the design challenge of a cost effective AVC-implementation, a toolby-tool analysis was carried out to exploit inter-tool dependencies and find the optimal balance between coding performances and complexity. For the encoder, the main bottleneck is the combination of multiple reference frames and large search sizes. These tools have a limited impact on coding efficiency, especially at medium and low bit-rates, but a great one on complexity (up to a factor 60). Adopting Hadamard causes both a complexity and a bit-rate increase while a PSNR gain is achieved. The use of the 1/8 resolution should be avoided at low bit-rate (there is only a complexity increase without any coding efficiency improvement) but it is an important tool for high-rate applications. RD-Lagrangian techniques give a substantial compression efficiency improvement but the complexity doubles when the codec configuration entails a lot of
coding modes and motion estimation decisions. The adoption of multiple block sizes results in a PSNR and bitrate improvement. While the complexity increases linearly with the number of modes used, the major part of the gain is already achieved with the first 4 modes. The CABAC entropy-coder provides a consistent bit-rate saving (up to 16 %) at the expense of a computational and memory increase (up to 30 %) compared to UVLC. Results for B frames and deblocking are provided too. It should be noted that memory and time figures present a consistent behaviour. At the decoder side, the main tools affecting the access frequency and the decoding speed are the use of 1/8 resolution and B-frames. A minor complexity increase is also due to CABAC, Hadamard, deblocking and RDoptimisations. The adoption of multiple reference frames causes a linear increase of the peak memory usage. Simulation results prove that, when combining the new coding features, the relevant implementation complexity accumulates while the global compression efficiency saturates. Thus, a proper use of the AVC tools leads to roughly the same performances as the complex profiles (all tools on) with a considerable complexity reduction (a factor 6.5 for the encoder and up to 1.5 at the decoder side. REFERENCES [1]
[2] [3] [4]
[5]
[6] [7]
[8]
[9]
T. Wiegand (ed.), “Joint Final Committee Draft (JFCD) of joint video specification (ITU-T Rec. H.264, ISO/IEC 14496-10 AVC)”, Joint Video Team of ISO/IEC MPEG and ITU-T VCEG, JVT-D157, Klagenfurt, July 2002 ftp://ftp.imtc-files.org/jvt-experts/reference_software/ H. Schwarz, T. Wiegand, “The emerging JVT/H.26L video coding standard”, IBC 2002, Sept. 2002, Amsterdam, NL G. Cote’, B. Erol, M. Gallant, F. Kossentini, “H.263+: video coding at low bit-rates”, IEEE Trans. on Circuits. and Systems for Video Technology, 1998, 8 (7) pp. 849856 P. Khun, W. Stechele, “Complexity analysis of the emerging MPEG-4 standard as a basis for VLSI implementation”, Proc. of the International Society for Optical Engineering - SPIE, vol. 3309, pp. 498-509, San Jose, CA, Jan 1998 F. Catthoor, et al., “Custom Memory Management Methodology”, Kluwer Academic Publishers, 1998 J. Bormans et al., “Integrating system-level low power methodologies into a real-life design flow”, Proc. IEEE PATMOS '99, pp. 19-28, Kos, Greece, October 1999 A. Chimienti, L. Fanucci, R. Locatelli, S. Saponara, “VLSI architecture for a low-power video codec system”, Microelectronics Journal, 2002, 33 (5) pp. 417-427 T. Meng et al., “Portable video-on-demand in wireless communication”, Proc. of the IEEE, 1995, 83 (4) pp. 659680
[10] J. Jung, E. Lesellier, Y. Le Maguet, C. Miro, J. Gobert, “Philips Deblocking Solution (PDS), a low complexity deblocking for JVT”, Joint Video Team of ISO/IEC MPEG and ITU-T VCEG, JVT-B037, Geneva, CH, Febr. 2002 [11] L. Nachtergaele et al., “System-Level power optimization of video codecs on embedded cores: a systematic approach”, Journal of VLSI Signal Processing, Kluwer Academic Publisher, 1998, 18 (2) pp. 89-111 [12] ISO/IEC JTC1/SC29WG11 14496-2, “Information technology – Generic coding of audio-visual objects – Part 2: Visual Amendment 1: Visual Extensions”, N3056, Maui, December 1999 [13] ISO/IEC JTC1/SC29WG11 14496-5, “Information technology – Generic coding of audio-visual objects – Part 5: Amd1: Simulation software”, N3508, Beijing, July 2000 [14] T. Stockhammer, T. Oelbaum, T. Wiegand, “JVT/H.26L video transmission in 3G wireless environments”, Proc. Int. Conf. on Third Generation Wireless and Beyond, pp. 25-30, San Francisco, CA, May 2002 [15] T. Stockhammer, M. Hannuksela, S. Wenger, “H.26L/JVT coding network abstraction layer and IP-based transport”, Proc. IEEE ICIP 2002, pp. 485-488, Rochester, NY, September 2002 [16] G. Sullivan et al., “Rate-distortion optimization for video compression”, IEEE Sign. Proc. Mag., 1998 15 (6) pp. 7490 [17] D. Marpe, G. Blattermann, G. Heising, T. Wiegand, “Video compression using context-based adaptive arithmetic coding”, Proc. IEEE ICIP 2001, pp. 558-561, Thessaloniki, Greece, October 2001 [18] V. Lappalainen, A. Hallapuro, T.D. Hamalainen, “Optimization of emerging H.26L video encoder”, Proc.
[19]
[20]
[21]
[22]
IEEE Workshop on Signal Processing Systems, pp. 406415, Antwerp, Belgium, September 2001 V. Lappalainen, A. Hallapuro, T.D. Hamalainen, “Performance analysis of low bit-rate H.26L video encoder”, Proc. IEEE Conference on Acoustic, Speech and Signal Processing, pp. 1129-1132, Salt Lake City, US, May 2001 A. Joch, F. Kossentini, P. Nasiopulos, “A performance analysis of the ITU-T draft H.26L video coding standard”, Proc. Packet Video Workshop, Pittsburgh, US, April 2002 K. Abe et al., “H.26L Implementation Evaluation”, ISO/IEC JTC1/SC29WG11 M7666, Pattaya, December 2001 M. Horowitz, A. Joch, F. Kossentini, A. Hallapuro, “JVT/H.26L decoder complexity analysis”, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVTC152, Fairfax, May 2002
[23] http://www.imec.be/atomium A.
APPENDIX
Tables A.1, A.2 detail the activation status of the optional tools with respect to the default configuration. The latter is characterized by a search range of 8, 1 reference frame, UVLC entropy coding, ¼ pixel resolution, intra prediction by a DC plus 8 prediction modes and in-loop deblocking. All P pictures follow the first I picture. When Block sizes=1 only 16x16 mode is on, when Block sizes = 4 then 16x16, 16x8, 8x16, 8x8 modes are on otherwise all modes are on.
Table A.1: AVC configurations cases – FOR1 and FOR2 test sequences (Y= tool on, N=tool off) Case Number 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Hadamard N N N N N N N Y Y Y Y Y Y Y 1/8 resolution N N N N N N N N Y Y Y Y Y Y Ref. Frames 1 1 1 1 3 5 5 5 5 5 5 5 5 5 Block sizes 1 1 4 7 7 7 7 7 7 7 7 7 7 7 B-frames N N N N N N N N N N 1 1 2 1 CABAC N N N N N N N N N Y Y Y Y Y RD-Lagrange N N N N N N Y Y Y Y Y Y Y Y Search Range 8 16 16 16 16 16 16 16 16 16 16 32 16 8 Table A.2: AVC configurations cases – MD and MC test sequences (Y= tool on, N=tool off) MD Case Number 0-11 12 13 14 15 16 17 18 0-11 12 13 Hadamard Y Y Y Y N N Y Y Y 1/8 resolution Y Y Y N N N N Y Y Ref. Frames 1 1 1 1 1 1 1 5 5 Like Like Block sizes 7 7 7 7 7 7 7 7 7 A.1 A.1 B-frames 2 1 N N N N N 2 1 CABAC Y Y Y Y Y Y Y Y Y RD-Lagrange Y Y Y Y Y Y N Y Y 16 8 16 16 16 16 16 8 8 Search Range
14 N N 5 7 1 Y Y 16
15 N N 5 7 1 Y Y 8
MC 14 15 N N Y Y 3 2 7 7 1 1 Y Y Y Y 8 8
16 N N 3 7 1 Y Y 8
16 Y Y 2 7 1 Y N 16
17 N N 2 7 1 Y Y 8
17 N Y 3 7 1 Y N 8
18 N N 1 7 1 Y Y 8
18 N Y 2 7 1 Y N 8