Tuning the M-coder to improve Dirac's Entropy Coding - CiteSeerX

1 downloads 0 Views 185KB Size Report
Abstract: The Dirac codec is a new prototype video coding algorithm from BBC R&D based on wavelet technology. Compression-wise the algorithm is broadly ...
Tuning the M-coder to improve Dirac’s Entropy Coding HENDRIK EECKHAUT

BENJAMIN SCHRAUWEN MARK CHRISTIAENS JAN VAN CAMPENHOUT Parallel Information Systems (PARIS) Electronics and Information Systems (ELIS) Ghent University (UGent) St.-Pietersnieuwstraat 41, 9000 Gent, Belgium

[email protected]

http://www.elis.ugent.be/resume

Abstract: The Dirac codec is a new prototype video coding algorithm from BBC R&D based on wavelet technology. Compression-wise the algorithm is broadly competitive with the state-of-the-art video codecs but computational-wise the execution time is currently poor. One of the largest bottlenecks is the arithmetic coder. Hence we took this part under investigation and replaced it with a faster variant, the M-coder, that is also used in the H.264/AVC codec. This resulted in a convincing speedup. Using a faster (read ‘simpler’) coder has of course an impact on the compression performance. But because we wisely profiled Dirac’s entropy coding statistics and tuned the initialisation and adaptation parameters the impact on the compression is very limited. Key-Words:

1

Dirac video codec, arithmetic coding, M-coder

Introduction

time. Real time decoding is limited to smaller resolutions, even for the latest processors. In this paper we focus on the entropy coding part of the algorithm which is responsible for a large portion of the decoding time. Certainly for higher bit rates the arithmetic coder is an obstinate bottleneck. Hardware acceleration with FPGAs could solve this [2], but is for the time being not a general applicable solution. We solve the problem by equipping Dirac with a less computationally intense arithmetic coder which is also used in H.264/AVC [3]. This results in a much shorter execution time. The arithmetic coder is specially tuned to work optimally for typical data streams from Dirac. This way the quality penalty due to the use of an arithmetic coder using a lot of approximations is reduced.

In January 2003, BBC R&D produced a prototype video coding algorithm, “Dirac” [1], based on wavelet technology. The algorithm originally targeted high definition video but has been further developed to optimise it for Internet streaming resolutions and seems broadly competitive with state-of-the-art video codecs. The main philosophy behind the Dirac codec is ‘keep it simple’ which is an ambitious aim since video codecs, particularly those with state-of-the-art performance, tend to be fearsomely complex. The Dirac codec has been developed as a research tool, not a product, as a basis for further developments. An experimental version of the code, written in C++, was released under an Open Source licence agreement on 11th March 2004. For our experiments This paper is organized as follows: In Section 2 we used the source code of version 0.5.1 of the Dirac we introduce the Dirac codec and describe its main codec; the most recent version at the time of writing. parts. In Section 3 we describe the arithmetic coder Dirac’s weakest point is currently its execution of the Dirac codec. We pinpoint its main problem 1

The purpose of the first stage is to provide a bit stream with easily analysable statistics that can be encoded using AC, which can adapt to those statistics, reflecting any local statistical features. The context modelling in Dirac is based on the principle that whether a coefficient is small (or zero, in particular) or not is well-predicted by its neighbours and its parents. AC performs lossless compression and is further elaborated in Section 3. The same infrastructure is applied to quantised transform coefficients and to motion vector (MV) data.

Figure 1: coder [1].

Motion estimation (ME): ME exploits temporal redundancy in video streams by looking for similarities between adjacent frames. Dirac implements hierarchical motion estimation and defines three types of frame. Intra (I) frames are coded without reference to other frames in the sequence. Level 1 (L1) frames and Level 2 (L2) frames are both inter frames, that is they are coded with reference to other previously coded frames. The difference between L1 and L2 frames is that L1 frames are also used as temporal references for other frames, whereas L2 frames are not.

High-level overview of the Dirac en-

and propose another arithmetic coder and report on the benefits and disadvantages. Next the arithmetic coder is tuned to work optimally inside the Dirac codec. The multiple techniques used to accomplish this are elaborated in Section 4. Section 5 presents the compression and timing results of the new coder versus the original version. Finally Section 6 summarizes our findings and concludes this work.

2

Dirac uses Overlapped Block-based Motion Compensation (OBMC) to avoid block-edge artifacts which would be expensive to code using wavelets. Dirac also provides Sub-pixel Motion Compensation with motion vectors to 1/8th pixel accuracy.

The Dirac Algorithm

Overall, the Dirac codec is a wavelet-based motioncompensated codec. The coder has the architecture shown in Figure 1, whilst the decoder performs the inverse operations. There are four main elements or Motion compensation (MC): MC is the inverse modules to the coder: operation of ME. It involves using the motion Transform, scaling and quantisation: vectors to predict the current frame in such a Transform and scaling involves taking frame way as to minimise the cost of encoding the data and applying a wavelet transform and residual data (the difference between the actual scaling the coefficients to perform quantisation. and the motion estimated frame). A trade off For quantisation a rate distortion optimisation is made between accuracy and motion vector bit algorithm is used to strip information from rate. the frame data that results in as little visual distortion as possible.

3

Arithmetic Coding

Entropy Coding (EC): EC is illustrated in Figure 2 and consists of three stages: binarization, Arithmetic coding is an entropy coding technique context modeling and arithmetic coding (AC). which assigns fractional length code words to the in2

The decoding speed of the Dirac codec is currently the most critical point to be considered, we will thus focus on solving the computational issue without compromising the compression performance to much.

3.2

Because Dirac is currently too slow, we have replaced the Dirac AC with an AC which introduces several approximations (which will degrade the compression performance) but which has a much better computational performance: the M-coder. The arithmetic codec used by H.264/AVC [3] (also called M-coder) was inspired by the Q-coder [4, 5, 6] and MQ-coder [7] but is faster and achieves better compression than state-of-the-art MQ-coders [8] (used for example in JPEG2000 [7]). It avoids the overhead incurred by frequency counting by estimating the IDF using a finite state machine (FSM). The state of the FSM is updated each time a new input symbol arrives. With this state an estimate for the IDF of the next input symbol is associated. The Mcoder is multiplication-free due to the use of precalculated tables and quantisation. Because the M-coder only uses lookups, additions and shift operations the computational complexity is extremely low compared to the classic implementations of ACs.

Figure 2: Entropy coding block diagram [1]. put symbols provided that a good estimate of the input distribution function (IDF) of each successive input symbol is available. An arithmetic encoder reaches its optimal compression ratio when the estimated IDF equals the actual IDF. In fact, with perfect estimates, it is possible to reach the Shannon entropy limit of coding efficiency (with an infinite precision arithmetic coder) [9, 10].

3.1

A Faster Arithmetic Coder

Dirac’s Arithmetic Coder

A traditional technique to estimate the IDF of each input symbol is to use a frequency table. Each time an input symbol takes on a certain value, the frequency count corresponding to that value is increased. The probability that the next symbol will take on a certain value is then estimated by its relative occurrence in the frequency table. The frequency table is renormalised at regular intervals to prevent overflow and to allow the tracking of non-stationary IDFs. This classic technique is used in the Dirac codec. To improve the IDF estimation, context modelling is used: the symbol stream is split in several sub streams with similar statistical properties which improves estimation performance. For example in Dirac, motion vectors and the various wavelet subbands types are coded separately. In the Dirac codec all context IDF estimates are initialized with equal probability for 1 and 0. Dirac’s AC implementation [1, 2] is computationally very expensive since, in order to obtain an estimate for the probability of a value, a division operation is to be performed. Also the AC itself uses a multiplication to construct the fractional code words.

4

Tuning the M-coder

The standard M-coder does not perform that well inside the Dirac codec (as can be seen as ‘AVC’ vs. ‘Original’ in Figures 3, 4 and 5). This is because the M-coder was originally developed for the H.264/AVC codec and is somehow tuned for it. In this section we will present several techniques with which we adapted the M-coder so that it is specially tuned for expected bit streams from the Dirac codec.

4.1

Predicted 1-Probabilities

Every state of the IDF estimating FSM corresponds to a predicted probability of the next symbol being a 1. The original M-coder starts with a set of predefined predicted 1-probabilities (which are chosen em3

Difference in PNSR (Mobile, cif, 30 frames, 25Hz) 0.2 0

difference in PSNR (dB)

-0.2 -0.4 -0.6 -0.8 -1 -1.2 Original New NoInit AVC New128

-1.4 -1.6 0

5

10

15

20

25

bit rate (Mibit/s)

Figure 3: The absolute difference in PSNR of the different arithmetic coders for 30 CIF-frames of the “Mobile” sequence. Original uses the original Dirac algorithm; New uses the optimized M-coder with 256 states and α = 0.96; Noinit uses the optimized M-coder (256 states and α = 0.96) without initialization; AVC uses the original M-coder (of H.264/AVC with 128 states and α = 0.9492) and New128 uses an optimized M-coder with only 128 states (α = 0.96).

Difference in PNSR (Snowboard, 576x720, 27 frames, 25Hz) 0.05 0 -0.05

difference in PSNR (dB)

-0.1 -0.15 -0.2 -0.25 -0.3 -0.35 Original New NoInit AVC New128

-0.4 -0.45 1

10 bit rate (Mibit/s)

Figure 4: The absolute difference in PSNR of the different arithmetic coders for 27 720 × 576 frames of the “Mobile” sequence. The labels are identical to those of Figure 3. Notice the logarithmic scale of the X-axis.

4

Difference in PNSR (Snowboard, 1280x720, 27 frames, 25Hz) 0.1

difference in PSNR (dB)

0.05

0

-0.05

-0.1

-0.15

Original New NoInit AVC New128

-0.2 1

10 bit rate (Mibit/s)

100

Figure 5: The absolute difference in PSNR of the different arithmetic coders for 27 1280 × 720 frames of the “Mobile” sequence. The labels are identical to those of Figure 3. Notice the logarithmic scale of the X-axis. pirically to follow an exponential curve with decay rate α) and tries to construct an FSM that will generate these predictions as correctly as possible. This forward technique does not work very well because it does not take into account the quantisations introduced by the FSM. It also does not make use of the a-priori knowledge of the statistical process generating the input stream.

Here pn is set equal to the average number of ones received by the FSM at time step n, this averaged over the different benchmark streams.

The generation of the non-stationary Markov model from the FSM is straightforward: the states and state transitions are modeled by a matrix. The initial state and this matrix can then be used to calcute the probability that the FSM is in a certain state We will use a reverse approach: we first gener- at a certain time given a non-stationary model of the ate an FSM and then try to find optimal predicted input process. 1-probabilities for each of the states. This process Next we create a model for the expected comis split into two parts: first we generate a nonpression attained by the M-coder. This is done by stationary statistical model of the input process, seccombining the model of the input process and the ondly we make a non-stationary Markov model of the non-stationary Markov model (a model for the FSM) FSM. When we combine both parts and use it with with an optimal entropy coder (a model for the aritha model of the AC we are able to generate optimal metic coder). This optimal entropy coder is a coder predicted 1-probabilities. that is capable of compressing up to the Shannon The non-stationary statistical model of the input entropy limit. This approximation is acceptable as process is generated by computing the sample aver- it is well known that arithmetic encoders can reach age of a large number of benchmark input streams the entropy limit when infinitely many symbols are (generated by encoding a large number of benchmark encoded, when the input process is stationary and movies, here over 10 minutes of mixed benchmark when a good estimate of the IDFs of the input symCIF movies). This results in a series of Bernoulli bols is available [9, 10]. This cost function is used to distributions (1 − pn , pn ), one for each time step n. generate analytically optimal values for the predicted 5

1-probabilities for each of the FSM’s states. Note that all of the above calculations are done only once when generating the FSM for Dirac. Once this is done no speed penalty whatsoever is present in the M-coder, only a few constants are redefined. Note also that in this step the FSM itself (states and state transitions) is left unchanged.

4.2

As in [12] we increased the total number of context models. We introduced a total of 482 context models which are almost the same models as those from original Dirac algorithm: 50 models for motion vector data and 16 models for subband data. But now the subband models are also dependent on the color channel (Y, U, V), the frame type (I, L1, L2) and the wavelet band type (DC, LF, others). Note that only 16+50 of them are used simultaneously; these are the 16+50 contexts used in this color channel, frame and wavelet band type. A context model thus has become one of the 16+50 old Dirac models plus an estimated IDF initialisation which is dependent on color, frame and band type.

FSM Initialisation

Whenever the FSM starts to process an input stream, it would be ideal if the estimated 1-probability of the initial state is as close as possible to the actual 1probability of the underlying statistical process at the start of the input stream. Therefore we will search for the optimal initial state of the FSM given a nonstationary model of the input process. The initial state is found using the model created in the previous subsection. All possible initial states are tried and the one with the best compression ratio (according to the analytical model) is chosen. This enumeration can be performed very fast because an analytical model of the input and of the FSM was constructed. In [11] it is stated that the gain of correctly initialising the AC is modest but in Figures 3, 4 and 5 (‘New’ vs. ‘Noinit’) one can see that initialisation does have a clearly defined effect, especially at lower bit rates (when the influence of the start of the stream is more important) and for bigger image sizes. Since the cost of proper initialisation is almost zero – the AC has to be initialised anyway – it is a welcome improvement.

4.4

Adaptation Rate: the α-Parameter and the Number of States

Two important parameters of the estimating FSM need yet to be chosen: the number of states and the α-parameter which is used for the transition generation of the FSM. The combination of both parameters define the smallest predictable 1-probability, the ‘learning rate’ or ‘tracking speed’ and the variance on the prediction. No analytical method is known to optimise these parameters so enumeration is the only option. The original M-coder uses 128 states. All the above techniques are applied to an FSM with 128 and 256 states. As Figures 3, 4 and 5 show, 256 states perform clearly better, especially for large image sizes. An optimisation curve for the α-parameter for 256 states can be seen in Figure 6. Note that H.264/AVC uses α = 0.9492 and 128 states, while we use α = 0.96 4.3 More Context Models and 256 states. This results in an FSM with a slower tracking speed but also in an FSM that can estimate All the above discussions talked about one input smaller probabilities and has less variance on the prestream or input process. As was stated earlier, the indiction. put stream is split in several sub-streams which have similar statistical properties. This results in better predictions, and thus in improved compression. 5 Results These sub-streams are called ‘context models’. The original Dirac coder used 16 context models for sub- In Figure 7 we plotted the average PSNR for decoding band data and 50 context models for motion vector 30 frames of the “Mobile” sequence (CIF@25Hz) at data. All of these context model are used simultane- different bit rates. The PSNR of each frame is calcu0 , U 0 and ously. Note that the original Dirac codec initialises lated as follows. If Yi,j , Ui,j and Vi,j and Yi,j i,j 0 are resp. the luminance and the two chrominance all context model predictions to 0.5. Vi,j 6

Bit rate at 30 dB (Mobile CIF)

PNSR (Mobile, CIF, 30 frames, 25Hz)

4.12

60

4.11

55

4.1

50

45 PSNR (dB)

bit rate (Mibit/s)

4.09 4.08 4.07

40

35 4.06 30

4.05 4.04

25

4.03 0.93

20

Original New 0.94

0.95

0.96 alpha

0.97

0.98

0.99

0

5

10

15

20

25

bit rate (Mibit/s)

Figure 6: Optimization curve of α for compressing Figure 7: PSNR vs. bitrate for 30 CIF-frames of the the “Mobile”(CIF)-sequence at 30 dB for an FSM of “Mobile” sequence. 256 states. channels of the original and reconstructed frame of another sequence with a larger resolution (720 × 576 h × w pixels, then the PSNR is defined as follows: pixels): the “Snowboard” sequence, that is available 2552 23 hw on the Dirac website. The difference in PSNR is now P P 10 log10 P 0 2 0 2 0 2 (Y − Y ) + (U − U ) + (V − V ) smaller (Figure 9 and Figure 4). The compression is (1) now even better for some bit rates while the differThe PSNR of the original and the new algorithm ence in decoding time is still much better for the new is almost the same. The original AC is slightly better arithmetic coder (Figure 10). It is more than two but the difference is never larger than 0.5 dB. This times faster for most bit rates. is even better illustrated in Figure 3. The new coder gives very satisfactory results and is clearly better Finally we also measured the coding performance than the unadjusted M-coder of H.264/AVC. and decoding time for a high definition sequence We also measured the decoding time per frame (“Snowboard”) of 1280 by 720 pixels. The two for the same bit rates as Figure 7 in Figure 8. PSNR-curves of Figure 11 are now nearly indistinThe measurements were done on a AMD Athlon 64 guishable. In Figure 3 we can clearly see that the 3500+ processor. As can be seen in Figure 8, the compression is now even slightly better for most bit original algorithm never reached real-time decoding rates. This is a very good result because the M(< 0.04s/f rame). The M-coder is real-time for coder introduces a lot of approximations compared bit rates up to 10 million bits per second (∼ 43 dB to the original Dirac AC. But due to the tuning it is PSNR). Notice the exponential behaviour of the orig- possible to overcome these approximations and even inal implementation versus the near-linear behaviour overrun the original algorithm. The new decoder is of the M-coder. This is easily explained by the fact again (Figure 12) at least two times faster for most bit that the original algorithm needs a multiplication and rates, which confirms the previous conclusions. Note a division for each symbol. The new version only uses however that for the big frame sizes real-time decodlookups, additions and shift operations. ing still is not possible because other major parts of We repeated the same experiment for 27 frames of the Dirac coder still need optimization. 7

Decoding time (Snowboard, 720x576, 27 frames, 25Hz)

Decoding time (Mobile, CIF, 30 frames, 25Hz) 0.5

0.3

Original New

Original New 0.45 0.25 0.4 0.35 decoding time (s)

decoding time (s)

0.2

0.15

0.3 0.25 0.2

0.1

0.15 0.05 0.1 0.05

0 0

5

10

15

20

0

25

10

20

30 40 bit rate (Mibit/s)

bit rate (Mibit/s)

50

60

70

Figure 8: Decoding time per frame vs. bitrate for 30 Figure 10: Decoding time per frame vs. bitrate for 27 (720 × 576)-frames of the “Snowboard” sequence. CIF-frames of the “Mobile” sequence.

PNSR (Snowboard, 1280x720, 27 frames, 25Hz) 60

55

55

50

50

45

45

PSNR (dB)

PSNR (dB)

PNSR (Snowboard, 720x576, 27 frames, 25Hz) 60

40

35

40

35

30

30 Original New

Original New

25

25 0

10

20

30

40

50

60

70

0

bit rate (Mibit/s)

20

40

60

80

100

120

140

160

bit rate (Mibit/s)

Figure 9: PSNR vs. bitrate for 27 (720 × 576)-frames Figure 11: PSNR vs. bitrate for 27 (1280 × 720)of the “Snowboard” sequence. frames of the “Snowboard” sequence.

8

[3] Marpe D., Schwarz H., and Wiegand T. ContextBased Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard. IEEE Transactions on Circuits and Systems for Video Technology, vol. 13(7), July 2003, pp. 620–636.

Decoding time (Snowboard, 1280x720, 27 frames, 25Hz) 1.2 Original New 1

[4] Pennebaker W.B., Mitchell J.L., Langdon G.G.J., and Arps R.B. An overview of the basic principles of the Q-Coder adaptive binary arithmetic coder. IBM Journal of Research and Development, vol. 32(6), November 1988, pp. 717–726.

decoding time (s)

0.8

0.6

0.4

[5] Pennebaker W.B. and Mitchell J.L. Probability Estimation for the Q-Coder. IBM Journal of Research and Development, vol. 32(6), November 1988, pp. 737–752.

0.2

0 0

20

40

60

80 bit rate (Mibit/s)

100

120

140

160

[6] Mitchell J.L. and Pennebaker W.B. Optimal hardware and software arithmetic coding procedures for the Q-Coder. IBM Journal of Research and Development, vol. 32(6), November 1988, pp. 727–736.

Figure 12: Decoding time per frame vs. bitrate for 27 (1280 × 720)-frames of the “Snowboard” sequence.

6

[7] Taubman D.S. and Marcellin M.W. JPEG2000, Image Compression Fundamentals, Standards and Practice. No. ISBN 0-7923-7519-X. Kluwer Academic Publishers, Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht The Netherlands, 2002.

Conclusions

Replacing the original arithmetic coder algorithm of the Dirac algorithm with an accuratly configured Mcoder resulted in a much faster video codec. The [8] speedup is all the clearer as more symbols are decoded; for very high bit rates the new arithmetic decoder is three times faster. Because we initialised the different context models and perfected the different lookup-tables, we were able to closely approximate [9] or even outperform the original compression performance even though significant approximations were [10] introduced in the arithmetic coder.

Marpe D., Schwarz H., Bl¨attermann G., Heising G., and Wiegand T. Context-Based Adaptive Binary Arithmetic Coding in JVT/H.26L. Proc. IEEE International Conference on Image Processing (ICIP’02), vol. 2, September 2002, pp. 513–516. Rissanen J.J. Generalized Kraft Inequality and Arithmetic Coding. IBM Journal of Research and Development, vol. 20(3), May 1976, pp. 198–203. Rissanen J.J. and Langdon Jr. G.G. Arithmetic Coding. IBM Journal of Research and Development, vol. 23(2), March 1979, pp. 149–162.

Acknowledgement This research is supported by I.W.T. grant 020174, F.W.O. grant G.0021.03, by GOA [11] Langdon G. and Rissanen J. Compression of blackwhite images with arithmetic coding. IEEE Trans. project 12.51B.02 of Ghent University. Communications, vol. 89(6), 1981, pp. 858–867. [12] Eeckhaut H., Devos H., Schrauwen B., Christaens M., and Stroobandt D. A Hardware-Friendly Wavelet Entropy Codec for Scalable Video. In DATE 2005 De[1] Davies T. The Dirac Algorithm. signers’ Forum Proceedings, March 2005. http://dirac.sourceforge.net/documentation/algorithm/, 2005.

References

[2] Bleackley P.J. Hardware for Arithmetic Coding. Tech. rep., BBC R&D, 2005. URL http://www.bbc.co.uk/rd/pubs/whp/whp112.shtml.

9

Suggest Documents