Subband Adaptive Motion Compensated Temporal ... - Semantic Scholar

Subband Adaptive Motion Compensated Temporal Filtering for Scalable Video Coding Ruiqin Xiong1, Jizheng Xu2, Feng Wu2, Shipeng Li2 1

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 100080 2 Microsoft Research Asia, Beijing, China, 100080

Abstract. This paper proposes a subband adaptive motion compensated temporal filtering (MCTF) technique for scalable video coding. In scalable video coding schemes, hierarchical MCTF is widely adopted to exploit the temporal correlation across frames, for its good performance in energy compaction and its ability in providing efficient SNR scalability. In this hierarchical MCTF structure, the temporal correlation of neighboring frames has different strength at various MCTF levels. Furthermore, the temporal correlation exploited in motion compensation has diverse strength for various spatial frequency signal components. When SNR scalability is present, the reference frame at decoder is variable random signal, with reconstruction noises added on the references used at encoder. According to the correlation and noise characteristic of various spatial subbands between reference frame and target frame, we can adjust the strength of MCTF to maximally take advantage of temporal correlation but restrict the propagation of uncorrelated reconstruction noise signal. In this way an adaptive MCTF scheme is formed and the proposed technique can improve the performance of scalable video coding by up to 0.5 dB.

Index Terms—scalable video coding, motion compensated temporal filtering, temporal correlation

1. INTRODUCTION In video coding, motion compensated inter-frame prediction for exploiting temporal correlation is the key component whose efficiency influences the overall coding performance most. Multi-level motion compensated temporal filtering (MCTF) with dyadic temporal decomposition structure is widely employed in various 3D-wavelet-based video coding schemes, such as [1]~[5]. For its good performance in energy compaction and its ability to provide efficient SNR scalability, the MCTF coding framework is also incorporated into the recent state-of-the-art scalable video coding scheme, the scalable extension of H.264/AVC [6][7], with some possible variations such as omitting the update step. It has been shown that the hierarchical MCTF structure has significant performance gain over the traditional hybrid close-loop predictive coding structure of H.264/AVC [6][7]. In hierarchical MCTF structure, the temporal

correlation of neighboring frames has different statistical characteristic at various MCTF levels. The correlation is usually strong at high frame-rate (HFR) MCTF level but relatively weak at low frame-rate (LFR) MCTF level due to the difference in frame interval. In addition, the temporal correlation exploited in motion compensation has diverse strength for various spatial frequency signal components. The correlation is usually strong for low-frequency components but weak for high-frequency components. This diversity is especially evident at LFR MCTF levels. It motivates us to differentiate the various signal components in MCTF according to their correlation characteristics. It is reasonable to apply strong temporal filtering on signal component of spatial low-frequency and/or at HFR MCTF levels but apply weak filtering or stop filtering on signal components of spatial high-frequency and/or at LFR MCTF levels. Furthermore, in the context of scalable video coding with MCTF, the uncertainty of reference frames in decoder-side motion-compensated prediction structure should be taken into consideration. Generally, in inverse MCTF of reconstruction process, the frames used as reference in motion compensation are variable and non-identical with the ones used at encoder. This is an intrinsic characteristic of efficient SNR-scalable video coding. Decoder-side reference frames have diverse signal-to-noise ratio (SNR) value for various spatial frequency components. It motivates us to adjust the strength of MCTF of each spatial frequency component according to its relative amplitude of noise. By investigating the above correlation and noise characteristic between reference frames and target frames, an adaptive MCTF scheme can be formed not only to take advantage of temporal correlation but also reduce the propagation of uncorrelated reconstruction noise signal. The rest of this paper is organized as follows. Section 2 reviews the MCTF decomposition and reconstruction in scalable video coding. Section 3 analyzes the correlation and noise characteristics of inter-frame prediction in MCTF. Section 4 describes the proposed adaptive MCTF algorithm. Experimental results are given in section 5 and section 6 concludes the paper.

2. MCTF IN SCALABLE VIDEO CODING In video coding with MCTF, input frames are decomposed into lowpass frames and highpass frames through motion compensated prediction (1) and update steps (2): (1) H ( k ) = F (2k + 1) − P ( k ) (2) L( k ) = F (2k ) + U ( k ) Here P(k ) and U (k ) are the prediction and update signal, respectively. The most frequently used wavelet filter in MCTF is Haar and 5/3 filter, as defined in (3) and (4) respectively. The operator W denotes motion compensation. (3) P ( k ) = W ( F (2k )) , U ( k ) = W ( H ( k )) / 2 P ( k ) = {W ( F (2k )) + W ( F (2k + 1))}/ 2 , (4) U ( k ) = {W ( H ( k − 1)) + W ( H ( k ))}/ 4 To further compact the energy in video sequence, the output lowpass frames can be decomposed iteratively by following MCTF levels. Fig. 1 illustrates the hierarchical multi-level MCTF decomposition structure. The update steps are sometimes omitted to reduce complexity with minor performance loss. Input Frames F ( k )

Highpass Frames H 1 ( k )

Lowpass Frames L1 ( k )

Highpass Frames H 2 ( k )

Lowpass Frames L2 (k )

Figure 1: The hierarchical multi-level MCTF structure.

In reconstruction process of decoder, the prediction and update steps (1) and (2) are inverted by (5) and (6) and perfect reconstruction can be easily achieved. Here Xˆ denotes the reconstruction of any signal X . Fˆ (2k ) = Lˆ ( k ) − Uˆ ( k )

original quality while decoder uses reference frames of as high quality as it can reconstruct. This open-loop structure is friendly to efficient SNR scalability. Generally, from the point of view of reconstruction process (6) at decoder, the inverse MCTF predicts target frames F (2k + 1) from reconstructed reference frames Fˆ (2l ) (l=k,k+1), which are variable random signals instead of determinate signals, with reconstruction noises added on the references F (2l ) used at encoder. This is an intrinsic characteristic of efficient SNR-scalable video coding.

3. CORRELATION AND CHARACTERISTIC IN MCTF

NOISE

Temporal correlation among neighboring frames is exploited in video coding but it can not be measured directly. Actually the correlation is realized by motion compensated prediction. Therefore, it is reasonable to use the correlation between target frame and prediction as the indicator of temporal correlation, as defined in (7): (7) ρ = corr F (2k + 1), P (k ) Here corr A, B denotes the operator to calculate correlation between random signal A and B. The ρ value indicates the correlation that is exploited in motion compensation. It depends on frame interval, motion and MC techniques employed in coding. In hierarchical MCTF structure, the temporal correlation is usually strong at high frame-rate (HFR) MCTF level but relatively weak at low frame-rate (LFR) MCTF level due to the frame interval and motion occurs in that interval. Fig. 2 shows an example of target frame and MC prediction frame at a LFR MCTF level.

(5)

Fˆ (2k + 1) = Pˆ ( k ) + Hˆ ( k )

(6) For non-scalable video coding, it can always be true that Pˆ (k ) = P(k ) and Uˆ (k ) = U (k ) . However, in the context of scalable video coding with MCTF, embedded bitstream is generated for each temporal subband frame. The embedded bitstream is allowed to be truncated arbitrarily during transmission so that its availability at decoder is not always guaranteed. In one category of coding schemes, both encoder and decoder employ the same base-quality reference frames for motion compensation, like in FGS of MPEG-4 [8]. This leads to inefficient SNR-scalable coding performance. In another category of coding schemes, encoder always uses reference frames of

(a) Target frame (b) Prediction frame Figure 2: An example of motion compensated prediction.

Furthermore, the temporal correlation exploited in motion compensation has diverse strength for various spatial frequency signal components. When matching two frames with large interval and sharp motion, it is easy to match DC but difficult to match texture well. Also, from the relation (8) of the prediction error D and estimation error δ of motion vector [9]:

1

(8) S x (ω)(ωT δ) 2 dω (2π ) 2 ∫∫ we can see that high-frequency signal component is more sensitive to the accuracy of motion field. Therefore, when the estimated motion vector is not accurate enough, the prediction error of motion compensation is likely to contain more high-frequency signal than low-frequency signal. In addition, due to the limitation of widely used block-based MC technique, some high-frequency artifacts are produced in prediction images, especially at block boundaries. Therefore, it is reasonable to analyze the temporal correlation for each spatial frequency component separately. Define S as a subband decomposition operator and Si(F(k)) is the ith subband of frame F(k). We define the correlation of subband-i as (9) ρi = corr Si ( F (2k + 1)), Si (P(k ))

D≈

as illustrated in Fig. 3. From experiments data, we can notice that the correlation strength is usually strong for low frequency components but weak for high frequency components. This diversity is especially evident at LFR MCTF levels.

S 0.98

Target Frame F (2k + 1)

Si ( F (2k + 1))

S

0.76

0.38

0.55

0.77

0.63

0.29

0.47

0.4

0.26

0.08

0.19

0.61

0.48

0.22

0.38

ρ i = corr S i ( F (2 k + 1)), S i (P ( k )) Prediction Frame P ( k )

Si (P ( k ))

Figure 3: Subband based correlation calculation.

In addition, the signal-to-noise ratio (SNR) of reference frames reconstructed at decoder is also diverse for various spatial subbands. Compared with low-frequency subbands, high-frequency subbands usually contain less energy in it but have larger or equal quantization noise at decoder. Therefore, the SNR is usually large at lowpass subbands but small at highpass subbands.

4. SUBBAND ADAPTIVE MCTF Based on the above observations, we proposed an adaptive motion compensated prediction technique in MCTF. This algorithm adjusts the strength of MCTF for each spatial subband, according to its correlation and noise characteristics. Let F(x,y) is the target frame, P(x,y) is the prediction frame produced by traditional motion compensation, N(x,y) the is reconstruction noise of P at decoder. N is a random signal which is unknown at encoder but it can be assumed to be independent with

target frame

reference frames

F (2k + 1)

F (2l ) Motion Estimation

mv Traditional Motion Compensation

F (2k + 1)

prediction frames P(k ) Correlation Noise Analysis

A = argmin D A Subband Adaptive Motion Compensated Prediction

H = F − A⋅P

Figure 4:

Flowchart of proposed algorithm.

F and P and have zero mean and its variance is predictable at encoder at a certain bit rate range. To differentiate signal components with different correlation-noise characteristics, an invertible subband decomposition S is used to analyze the above signals: F=S(F)=(F1,F2,…,Fn)T, P=S(P)=(P1,P2,…,Pn)T, N= S(N)=(N1,N2,…,Nn)T. We define a diagonal scaling matrix Anxn= diag{α1, α2,…, αn}. The traditional motion compensated prediction at encoder (1) is modified to (10): (10) H = S ( H ) = ( H 1 , H 2 ,..., H n )T = F − AP The motion compensated prediction error E at decoder is (11) E = F − APˆ = F − A( P + N) The scaling matrix is selected to minimize the mean square prediction error D: A=argmin D and D = E[ET ⋅ E] (12) = E[(F − A ( P + N ))T ⋅ (F − A ( P + N ))] n

= ∑ E[ Fi − α i ( Pi + N i )]2 i =1

Here E[·] is expectation. Let ∂D / ∂α i = 0 (i = 1, 2,..., n) , it leads to: αi =

E[ Fi ( Pi + N i )] 2

E[( Pi + N i ) ]

≈

E[ Fi Pi ] 2

2

E[ Pi ] + E[ N i ]

≈

E[ Fi Pi ] / E[ Pi 2 ]

(13)

1 + E[ N i 2 ] / E[ Pi 2 ]

The equation (13) is the statistically optimal solution of scaling matrix A. It reflects the correlation and noise characteristic of F and P in two ways. 1) When there is no reconstruction noise on P, we have E[Ni2]/E[Pi2]=0, the scaling factor αi=E[FiPi]/E[Pi2]. We use mX and σX2 to denote the mean and variance of a random variable X. For spatial highpass subbands, we usually have σF2≈σP2=σ2 and mF2≈mP2 σ2. With these approximations, we have mF ⋅ mPi + ρi ⋅ σ Fi ⋅ σ Pi ρi ⋅ σ Fi (14) αi = i ≈ ≈ ρi mP2i + σ P2i σ Pi

It means the scaling factor is mainly determined by the correlation parameter ρi. 2) When there exists reconstruction noise on P, the αi is reduced by a factor of 1+E[Ni2]/E[Pi2]. Here E[Ni2]/E[Pi2] is the noise-to-signal ratio of P. Therefore, the principle of proposed adaptive technique is to apply strong filtering for subbands with strong correlation but reduce the strength of filtering for subbands with large noise. The flowchart of the proposed adaptive MCTF algorithm is illustrated in Fig. 4.

5. EXPERIMENTAL RESULTS To test the adaptive MCTF algorithm proposed in this paper, we have conduct experiments on many standard MPEG test sequences. The input sequences are of CIF size at 30 fps. Multi-level MCTF is performed on input video and the temporal subbands are transformed and bit-plane coded to generate embedded bitstreams. For correlation noise characteristic analysis, discrete wavelet transform (DWT) with the structure illustrated in Fig. 3 is employed as the subband decomposition S. Each frame is divided into 16 subbands. For each subband, the correlation and noise characteristics are analyzed and the scaling matrix is calculated as defined in (13). For noise calculation, a middle reconstruction bit-rate within testing bit-rate range is used. Table 1 shows the experimental results of Football, averaged over all frames. It verifies the observations mentioned in section 3. Please refer to [10] for the relationship between the wavelet subband index and its position in original signal spectrum. The performance with or without the proposed adaptive MCTF technique is compared, as illustrated in Fig. 5. The proposed algorithm is marked as “SA” (subband adaptive MCTF) and the baseline (without SA) is marked as “w/o SA”. The results in Fig. 5 shows that the proposed adaptive MCTF scheme based on correlation and noise characteristics can improve coding performance by 0.1~0.5 dB.

6. CONCLUSIONS AND FUTURE WORKS In this paper, an adaptive MCTF scheme is proposed. It adjusts the strength of temporal filtering of each spatial subband at each MCTF level, according to the correlation and noise characteristics of that subband. Strong filtering is applied for subbands with strong correlation but the filtering is weakened for subbands with weak correlation or with large noise. It takes advantage of temporal correlation and restrains the propagation of uncorrelated reconstruction noise signal at the same time. The proposed adaptive MCTF technique can improve coding performance by up to 0.5dB. The proposed subband adaptive MCTF technique changes the filtering strength of each spatial subband, so that the temporal synthesis structure of that spatial subband is also changed. Therefore, the rate allocation for each spatial-temporal subband should be adjusted accordingly. This can be one future work. 42 41 40 39 38

36 35 34 100

Correlation of Fi and Pi 30Hz 15Hz 7.5Hz 0.99 0.98 0.96 0.92 0.77 0.69 0.95 0.87 0.78 0.79 0.58 0.37 0.78 0.53 0.36 0.57 0.36 0.19 0.67 0.45 0.24 0.49 0.28 0.12

SNR of Pi (dB) 30Hz 15Hz 7.5Hz 25.60 27.96 29.06 14.16 16.04 16.94 15.52 17.75 18.84 8.28 9.87 10.35 9.38 11.51 12.30 5.00 7.99 9.01 5.56 7.45 7.79 1.44 4.06 4.57

Band(3,3) Band(3,2) Band(2,2) Band(2,2)

0.56 0.35 0.30 0.16

0.35 0.21 0.17 0.08

0.16 0.08 0.07 0.02

2.57 -1.34 -1.90 -4.65

4.93 1.83 1.66 -0.72

5.30 2.39 2.17 -0.22

600

1100

1600

2100

2600

(a) Crew 41 40 39 38 37

w/o SA (30Hz) SA (30Hz) w/o SA (15Hz) SA (15Hz) w/o SA (7.5Hz) SA (7.5Hz)

36 35 34 33 200

Table 1: Correlation and noise characteristic for Football (x,y) index of Band i Band(0,0) Band(0,1) Band(1,0) Band(1,1) Band(0,3) Band(0,2) Band(1,3) Band(1,2)

w/o SA (30Hz) SA (30Hz) w/o SA (15Hz) SA (15Hz) w/o SA (7.5Hz) SA (7.5Hz)

37

700

1200

1700

2200

2700

3200

(b) Football 41 40 39 38 w/o SA (30Hz) SA (30Hz) w/o SA (15Hz) SA (15Hz) w/o SA (7.5Hz) SA (7.5Hz)

37 36 35 34 33 100

400

700

1000

1300

1600

(c) Soccer

Figure 5: Coding Performance of adaptive MCTF.

REFERENCES 1.

J. R. Ohm, “Three-dimensional subband coding with motion compensation,” IEEE Trans. Image Processing, vol. 3, no. 5, pp. 559-571, Sept. 1994 2. S. J. Choi and J. W. Woods, “Motion-compensated 3-D subband coding of video,” IEEE Trans. Image Processing, vol. 3, no. 2, pp. 155-167, Feb. 1999 3. J. Xu, Z. Xiong, S. Li, Y.-Q. Zhang, Three-dimensional embedded subband coding with optimal truncation (3D ESCOT), Applied and Computational Harmonic Analysis, vol.10, pp. 290-315, 2001 4. A. Secker, D. Taubman, Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression, IEEE Trans. on Image Processing, vol. 12, no. 12, pp. 1530-1542, 2002 5. L. Luo, F. Wu, S. Li, Z. Xiong, Z. Zhuang, “Advanced motion threading for 3D wavelet video coding,” Signal Processing: Image Communication, vol. 19, no. 7, pp. 601-616, 2004 6. Heiko Schwarz, Detlev Marpe, Thomas Wiegand, Scalable extension of H.264/AVC, ISO/IEC JTC1/SC29/WG11 M10569/S03, Munich, 2004 7. Heiko Schwarz, Detlev Marpe and Thomas Wiegand, MCTF and Scalability Extension of H.264/AVC, Proc. of PCS 2004, San Francisco, CA, USA, Dec. 2004 8. W. Li. Overview of Fine Granularity Scalability in MPEG-4 Video Standard. IEEE Trans. Circuits and Systems for Video Tech., 2001, 3(11):301-317 9. A. Secker, D. Taubman, Highly scalable video compression with scalable motion coding, IEEE Trans. on Image Processing, vol. 13, no. 8, pp. 1029-1041, 2004 10. Donghoon Yu and Jong Beom Ra, Fine Spatial Scalability in Wavelet Based Image Coding, Proc. ICIP 05, pp. 862-865, Genoa, Italy, Sep. 2005