Abhijeet Golwelkar and John W. Woods. Center for Next Generation Video and ECSE Department. Rensselaer Polytechnic Institute. Troy, NY 12180-3590, USA.
Scalable video compression using longer motion compensated temporal filters Abhijeet Golwelkar and John W. Woods Center for Next Generation Video and ECSE Department Rensselaer Polytechnic Institute Troy, NY 12180-3590, USA ABSTRACT Three-dimensional (3-D) subband/wavelet coding using a motion compensated temporal filter (MCTF) is emerging as a very effective structure for highly scalable video coding. Most previous work has used two-tap Haar filters for the temporal analysis/synthesis. To make better use of the temporal redundancies, we are proposing an MCTF scheme based on longer biorthogonal filters. We show a lifting based coder capable of subpixel accurate motion compensation. If we retain the fixed size GOP structure of the Haar filter MCTFs, we need to use symmetric extensions at both ends of the GOP. This gives rise to loss of coding efficiency at the GOP boundaries resulting in significant PSNR drops there. This performance can be considerably improved by using a 'sliding window,' in place of the GOP block. We employ the 5/3 filter and its non-orthogonality causes PSNR variation, which can be reduced by employing filter-based weighting coefficients. Overall the longer filters have a higher coding gain than the Haar filters and show significant improvement in average PSNR at high bit rates. However, a doubling in the number of motion vectors to be transmitted, translates to a drop in PSNR at the lower video bit rates. KEYWORDS: Subband/wavelet coding, motion compensation, lifting, boundary artifacts, HVSBM
1.
INTRODUCTION
Over the last decade, the telecommunication industry has been experiencing a digital revolution. One of the main advantage of digital technology is its ease to provide a diverse range of services over the common network. While digital data and voice communication have long been around, now its the turn of digital video. Due to advances in network technology, a single workstation may serve as a personal computer, a high-definition TV, a videophone, or a fax machine. Today the main media of transmission is a computer network, which is often a heterogeneous environment, consisting of a diverse mixture of subnets and network users. In this situation, scalable transmission of video is essential to service different clients with widely varying display and processing capabilities. Scalability refers to the ability of an algorithm to decode a certain part of video bitstream to obtain a video at the desired quality or spatiotemporal resolution Since its introduction, subband/wavelet coding has emerged as a powerful method for compressing still images and videos. Besides its effectiveness in data compression, the presence of subbands makes this scheme a natural choice for scalability applications. In simple 3-D subband/wavelet schemes, subband decomposition is extended in temporal domain also. But its performance can be improved with motion compensation 1,4,8. Ohm1 used the simple fixed size block matching (FSBM) for estimating the motion field. Both Choi2 and Chen4 used hierarchical variable size block matching (HVSBM) scheme to get further improvement. All these scheme used 2-tap Haar filters for temporal analysis/synthesis. Low temporal subbands generated in such coders can be sent as a part of a low frame-rate sequence.
But for these schemes, with the sub-pixel motion compensation, images need to be interpolated for MC temporal filtering (MCTF). Thus the resulting analysis/synthesis scheme is not invertible. To achieve invertibility for any arbitrary subpixel accuracy, a lifting scheme was used3. Besides the computational efficiency, this is the big advantage of the lifting scheme. All these schemes use the 2-tap Haar filter for motion compensated temporal filtering. But using a longer length filter we can make better use of the correlation in the temporal domain. One main advantage of Haar filters over other longer filters is we need the motion field between every other pair of input frames as opposed to every other frame. Hence the initial attempt to use longer filters by Ohm1 did not show much improvement. Secker and Taubman6 used 5/3 filters and bi-directional motion estimates to achieve 3-D wavelet transform using lifting approach. Their work showed potential PSNR improvement by using longer filters instead of Haar filters. But since we need bi-directional motion field, the motion vector rate is almost 4 times of that for Haar filters. Xu et al5 used a motion threading approach with lifting scheme to use 5/3 filters along motion trajectories. Their work did not take into consideration the sub-pixel accurate motion field. In this work, we are proposing a lifting based 3-D subband/wavelet coder using 5/3 filters for temporal filtering and unidirectional motion estimation. In our approach the backward motion field (i.e. the current frame comes before the reference frame) is used for the MCTF with quarter pixel accuracy, as determined using hierarchical variable size block matching (HVSBM) 2,4. We estimate and transmit a backward motion field between every consecutive frame and infer the forward motion field from this backward motion field.
2.
MCTF WITH LONGER FILTERS
We expect it possible to take better account of temporal correlation with longer filters that input more than two frames at a time. We first explain the estimation of the motion field and then present a lifting implementation. 2.1 Estimation of motion field The Haar filters only need a motion field between every other pair of frames. With 5/3 filters, we need a motion field between every consecutive frame i.e. almost twice as do Haar filters. For lifting implementation of 5/3 filters, we need both a backward and a forward motion field at each frame, and that would then increase, the motion vector rate by almost a factor of 4. Instead, we choose to estimate the forward field from the backward field as shown in Figure 1 to save this additional overhead, thus settling at twice the motion rate of the Haar case. Figure 1 shows quarter pixel accurate motion vectors in the space-time domain. This model does not assume any particular type of motion estimation and hence our approach can be extended to any motion estimation scheme. Both the forward and backward motion vectors start at an integer pixel locations. If the backward motion vector (d) points to an integer pixel position, our inferred forward motion vector for that pixel is given by b = -d. But if d points to a subpixel location, we find the nearest integer pixel location, and associate it with motion vector b = -d. The algorithm used for this procedure is given below. Each pixel in a GOP can be classified in one of four classes: (a) START, (b) CONTINUE, (c) SINGLE and (d) END based upon its position in a given motion field. This classification, if carried out in a particular scanning order, doesn't require transmission of any additional side information. If two or more motion vectors point to the same location in next frame, we have multiple options available for the forward motion vector. For these “multi-connected” pixels, we make a decision based on scanning order to avoid transmission of any additional information. If no motion vector in the current frame points to a pixel in the reference frame, we have the case of “unconnected” pixels. If these pixels are present in the last frame of a GOP, we classify them as in the SINGLE class. But for all the other “unconnected” pixels, we resort to the backward motion vector and we identify them as in the START class. The CONTINUE class contains the pixels having a bi-directional motion match. The pixels in END class will occur in a last frame in GOP and hence do not have a backward motion field. But the END class connects via the forward motion field.
Figure 1 Subpixel accurate motion field in x-t domain The algorithm used to generate the bi-directional motion field at every pixel location is presented below : Algorithm: Let : Xt, t = 1, 2, …,L be the frames in a GOP, where L is the total number of such frames, Ct, t = 1, 2, …,L be the array storing 'class label' of pixels in frame Xt, and let the initial class label of all the pixels be UNUSED, (dm, dn) be the backward motion vector for each pixel in Xt, (bm, bn) be the forward motion vector for each pixel in Xt, M × N be the dimensions of frame Xt, and m + dm denotes nearest integer valued pixel location (determined in a unique way).
For t = 1 to L-1 For m = 1 to M For n = 1 to N Ct[m,n] = START, len = 0 While t+len < L If Ct+len+1[ m + dm , n + dn ] = UNUSED Ct+len+1[ m + dm , n + dn ] = CONTINUE b m + d = -dm , b n + dn = -dn m
else Ct+len[m.n] = END. Break out of while loop len++, m = m+dm, n = n+dn End While End(for n =1 to N) End(for m = 1 to M) End(for t =1 to L-1) Finally, ∀ (m,n) in the last frame, if CL[m,n] = UNUSED, CL[m,n] = SINGLE. Thus for a integer pixel accurate motion field, this approach is equivalent to forming motion trajectories and doing temporal filtering along them. The backward and forward motion vectors both align with these motion trajectories. For the subpixel accurate motion field, the motion trajectories are present on a subpixel grid. We get only approximate alignment in this case. 2.3
Generation of the temporal high subbands:
We temporally locate (i.e. time reference) the H subband to odd positions so that all their pixels have at least a backward motion vector. If a pixel belongs to class CONTINUE, we can use both backward (d) and forward motion (b) vectors. For all the pixels in the START class we only use the backward motion vector. The SINGLE and END class consists of pixels in last frame of GOP only and hence, are part of the temporal low subband. For the subpixel accurate motion
~
field, X denotes an interpolated value. For 5/3 filters, this procedure can represented as: If C2t-+1[m,n] = CONTINUE:
~ ~ H t [m, n] = X 2t +1 [m, n] − 0.5( X 2t [m − d m , n − d n ] + X 2t + 2 [m − bm , n − bn ]),
If C2t-+1[m,n] = START:
~ H t [m, n] = X 2t +1 [m, n] − X 2t [m − d m , n − d n ]),
2.4
Generation of the temporal low subbands:
We place the temporal low subbands at even time positions. If a pixel belongs to class CONTINUE, we can use both backward (d) and forward motion (b) vectors. For the START class we only use backward motion vectors, while for the END class we use the forward estimates. For the SINGLE class, we do not have motion vectors in either direction. If C2t[m,n] = CONTINUE:
~ ~ Lt [m, n] = X 2t [m, n] + 0.25( H t +1 [m − d m , n − d n ] + H t [m − bm , n − bn ]),
If C2t[m,n] = START :
~ Lt [m, n] = X 2t [m, n] + 0.5 H t +1 [m − d m , n − d n ],
If C2t[m,n] = END:
~ Lt [m, n] = X 2t [m, n] + 0.5 H t [m − bm , n − bn ],
If C2t[m,n] = SINGLE :
Lt [m, n] = X 2t [m, n].
2.5
A ‘Sliding Window’ Extension:
Thus far, we have considered processing MCTF one GOP at-a-time with symmetric extension of the motion trajectories at the boundaries. Even though this is essential for perfect reconstruction, it produces a considerable PSNR drop at the boundaries, especially at the end where an H coefficient is temporally referenced 7.
Figure 2 Temporal multiresolution pyramid for 'Sliding Window' approach Since we are using FIR filters, we do not have to restrict ourselves to a finite GOP size. Instead we think of the GOP having infinite size, and implement a 'sliding window' approach. For N levels of temporal analysis we need at least a window of 2N. In such an approach, instead of a symmetric extension, we use actual data at both ends of this window. This means we have to 'look ahead' causing a certain amount of delay at receiver. If we are using a 2L+1 tap filter, we need the future L frames at each temporal level. Hence the longer the filter, the longer is the delay. For N levels of temporal resolution, this delay is L×2N. Using Haar filters at the last stage of temporal analysis and using 5/3 filters for the lower levels, we can limit this additional delay. Figure 2 shows the temporal multi-resolution pyramid generated for this case. The rectangular windows indicate the frames currently involved in temporal analysis where the solid central
portion of this window is what is actually transmitted currently. At the decoder, to reconstruct frames 1-8, we need to wait till we receive window W2 and then do the synthesis. 2.6
Filter based weighting of temporal subbands:
All the temporal subbands generated by the MCTF are then spatially analyzed and encoded using EZBC 4,8. The goal of EZBC is to reduce the average distortion in a given group of frames, by minimizing the summation of the distortion over all the subbands. This strategy works well for orthogonal filters9, but the 5/3 filter has significant departures from orthogonality. It was shown that for such linear phase non-orthogonal filters banks, we need to properly scale the various subbands. Scaling coefficients were evaluated for the 2-D spatial subband/wavelet transform in [9]. We can use a similar procedure temporally.
Figure 3 Three stage temporal synthesis For illustrative purpose, consider a non-motion compensated three-stage temporal analysis/synthesis system. The decoder is shown in Figure 3. Pixels in the various temporal subbands are filtered with a distinct combination of reconstruction lowpass and high pass filters (i.e. h and g ). Thus we will have a separate weighting coefficient for each temporal subband t-H, t-LH, t-LLH and t-LLL subbands. For t-H: For t-LH: For t-LLH: For t-LLL:
Where,
1 2 g 3 [n ] ∑ 2 n 2 1 w2 = ∑ (h3 ∗ (g 2 ↑ 2))[n] 4 n 2 1 w1 = ∑ (h3 ∗ (h2 ↑ 2) ∗ (g1 ↑ 4))[n] 8 n 2 1 w0 = ∑ (h3 ∗ (h2 ↑ 2 ) ∗ (h1 ↑ 4 ))[n] 8 n w3 =
(h ↑ p )[n] = h [n / p],
if n is multiple of p
0, otherwise
For four-level temporal analysis/synthesis, the scaling coefficients for the temporal subbands are listed in Table 1. These coefficients do not take into consideration the possibility of unconnected pixels, which amount to 10-15% at the full frame rate. Temporal subband t-LLLL t-LLLH t-LLH t-LH t-H
Scaling coeff. 1.64 1.00 1.56 1.81 2.83
Table 1 Scaling coefficients for temporal subbands
3.
EXPERIMENTAL RESULTS
The PSNR plots for the first 96 frames of Mobile Calendar and Flower Garden are shown in Figure 4, which also lists average PSNR and standard deviation (shown in brackets). We estimate the ¼ pixel accurate motion field using HVSBM2 and implement the Haar based MCTF with the lifting approach discussed in [3]. The plots for the Haar and 5/3 filters are for 4 stage temporal analysis over a GOP of 16. We can see the boundary PSNR drops at the GOP ends. This effect is more pronounced at the beginning of the GOP where the temporal high subband has its temporal reference. For the ‘sliding window’ approach we use the 5/3 filter for the first 3 levels and the Haar for the last level of temporal analysis. With 4 level MCTF, the window size for SW is 16 i.e. the GOP size for the Haar and 5/3 filters. The GOP boundary drop in PSNR is now significantly reduced. The plots shown in Figure 4 employ the scaling coefficients shown in Table 1. These help reduce the standard deviation in the PSNR plot from 1.39(without scaling) to 1.06 for Flower Garden and from 1.28(without scaling) to 1.01 for Mobile Calendar. Figure 5, 6 and 7 show the Rate-Distortion plot for Y component of Mobile Calendar and Flower Garden for Integer, half and quarter pixel accurate MCTF, respectively. Both sequences are in CIF resolution and we use 288 and 240 frames each of the two sequences, respectively. Average PSNR values with ¼ pixel accurate MCTF at various rates for all three components are listed in Table 2 and 3. For these two test sequences we get significant improvement in PSNR at higher bit rates. Thus the longer filters help in increasing the coding gain. But, in our present implementation, they also require almost twice the motion information as the Haar filters. We think this is the main cause of the PSNR deficit at low bit rates. Tables 4 and 5 present the average PSNR values for integer, ½ and ¼ pixel MCTF at fixed rate of 2048 Kbps. The results with the ‘sliding window’ approach at integer and ½ pixel MCTF match the results with Haar at ½ and ¼ pixel accuracy, respectively. Thus longer filters are able to achieve the performance of the Haar filters with more accurate motion field, thus saving on computation. The gain compared with Haar filters reduces as we increase the accuracy of the motion field.
4.
FUTURE WORK
Currently we encode each motion field at every temporal analysis stage independently. However the motion fields at the different levels can be estimated and coded jointly. We could end up with a more scalable and efficient MV coder, giving significant improvement at low bit rates.
Figure 4 PSNR performance for Haar, 5/3 filters on a fixed GOP of 16 and Sliding window approach
Figure 5 PSNR (dB) of Y component vs Rate for Integer pixel accurate motion field
Figure 6 PSNR (dB) of Y component vs Rate for 1/2 pixel accurate motion field
Figure 7 PSNR (dB) of Y component vs Rate for 1/4 pixel accurate motion field
Rate(Kbps) 384 512 768 1024 1536 2048
Scheme Haar SW Haar SW Haar SW Haar SW Haar SW Haar SW
Y(dB) 26.43 25.90 28.11 28.15 30.27 30.63 31.66 32.19 33.57 34.33 35.15 35.89
U(dB) 31.80 29.66 33.55 31.79 35.92 34.94 37.23 36.73 39.18 38.91 40.56 40.24
V(dB) 30.81 28.85 32.68 30.99 35.06 34.13 36.45 35.84 38.50 37.95 39.86 39.08
Table 2 Average PSNR with ¼ pixel accurate MCTF for Mobile Calendar (CIF, 288 frames)
Rate(Kbps) 384 512 768 1024 1536 2048
Scheme Haar SW Haar SW Haar SW Haar SW Haar SW Haar SW
Y(dB) 25.52 23.47 27.52 26.90 29.78 30.32 31.20 32.16 33.18 34.31 34.71 35.86
U(dB) 29.79 27.19 32.03 29.83 34.80 33.51 36.33 35.93 38.33 38.45 39.73 39.50
V(dB) 32.26 30.50 33.87 32.17 36.09 35.05 37.87 36.84 39.92 39.66 41.33 41.10
Table 3 Average PSNR with ¼ pixel accurate MCTF for Flower Garden (CIF, 240 frames)
Motion vector Accuracy Integer 1/2 1/4
Scheme
Y(dB)
U(dB)
V(dB)
Haar SW Haar SW Haar SW
29.73 33.41 33.35 35.20 35.15 35.89
35.22 37.33 38.35 39.45 40.56 40.24
34.28 36.23 37.72 38.37 39.86 39.08
Improvement in Y(dB) +3.68 +1.85 +0.74
Table 4 Average PSNR at 2048 Kbps Mobile Calendar (CIF, 288 frames)
Motion vector Accuracy Integer 1/2 1/4
Scheme
Y(dB)
U(dB)
V(dB)
Haar SW Haar SW Haar SW
29.58 33.41 32.91 34.94 34.71 35.86
33.85 37.33 37.48 38.61 39.73 39.50
36.78 36.23 39.34 40.24 41.33 41.10
Improvement in Y(dB) +3.83 +2.03 +1.15
Table 5 Average PSNR at 2048 Kbps Flower Garden (CIF, 240 frames)
5.
REFERENCES
1.
J.-R. Ohm, “Three-dimensional subband coding with motion compensation,'' IEEE Trans. Image Processing, vol. 3, pp. 559-571, Sept. 1994.
2.
S.-J. Choi and J. W. Woods, “Motion-compensated 3-D subband coding of video,” IEEE Trans. on Image Processing, vol. 8, Feb. 1999, pp. 155 –167
3.
B. Pesquet-Popescu and V. Bottreau, “Three-dimensional lifting schemes for motion compensated video compression,'' Proc. ICASSP, pp-1793-1796, May 2001.
4.
J. W. Woods and P. Chen, “Improved MC-EZBC with quarter-pixel motion vectors,” ISO/IEC JTC1/SC29/WG11, MPEG2022/8366, Fairfax, May 2002.
5.
Jizheng Xu, Zixiang Xiong, Shipeng Li, and Ya-Qin Zhang, “Three-dimensional embedded subband coding with optimized truncation (3-D ESCOT),” J. Applied Computat. Harmonic Analysis, vol. 10, pp. 290-315, May 2001.
6.
A. Secker and D. Taubman, ``Motion-compensated highly scalable video compression using an adaptive 3-D wavelet transform based on lifting,'' Proc. ICIP, Oct 2001.
7.
Jianxin Wei, Mark Pickering, Michael Frater, John Arnold, John Boman and Wenjun Zeng, “Boundary artefact reduction using odd tile length and the low pass first convention (OTLPF),'' Proc. SPIE conf. on Applications of Digital Image Processing, July 2001.
8.
S.-T. Hsiang and J. W. Woods, ``Embedded video coding using motion compensated 3-D subband/wavelet filter bank,'' Packet Video Workshop, Sardinia, Italy, May 2000.
9.
J. W. Woods and T. Naveen, “A filter based bit allocation scheme for subband compression of HDTV,'' IEEE Trans. on Image Processing, vol. 1, pp. 436-440, July 1992.