Furthermore, to completely address the spatial scalability issues in spatial-domain MCTF schemes, optimization of bit allocation is needed to reasonably allo-.
Spatial Scalability in 3D Wavelet Coding with Spatial Domain MCTF Encoder Ruiqin Xiong*1, Jizheng Xu2, Feng Wu2, Shipeng Li2, Ya-Qin Zhang2 1 Institute of Computing Technology, Chinese Academy of Sciences, 100080, Beijing 2 Microsoft Research Asia, 100080, Beijing ABSTRACT This paper investigates the spatial scalability in 3D wavelet coding with spatial-domain MCTF encoder. We analyzed the factors which lead to the performance degradation in heuristic decoding scheme where inverse MCTF is carried out in spatial lowpass wavelet domain as if it were the original video at encoder. The inefficiency of interpolation and motion compensation in wavelet domain, the commutation of spatial DWT and MCTF are found to be two factors. Furthermore, the power spectrum leak across subbands caused by motion compensation is another serious reason. After these analysis and our discoveries, we proposed several solutions to the problems mentioned above to improve the decoding performance step by step, without changing the encoder and transmitted bitstream. They result in a complexity-flexible decoding scheme. The decoder can reconstruct a better video with extra computation and buffers. Furthermore, to completely address the spatial scalability issues in spatial-domain MCTF schemes, optimization of bit allocation is needed to reasonably allocate part of the bandwidth to spatial high-frequency coefficients, instead of dropping them completely. 1. INTRODUCTION In the application scenario of streaming video over the Internet, video server has to deal with the problems of various device capabilities, diverse network environments and bandwidth fluctuations. In this situation, all the spatial, temporal and SNR scalabilities of compressed bitstream are desirable. To respond to the MPEG SVC call for proposals [1], various scalable video coding schemes [2] are submitted, among which many of them are based on motion-aligned 3D wavelet coding. Motion-aligned 3D wavelet coding schemes [3]~[10] provide elegant solutions to keep all the nice scalability features with noncompromised or even better coding efficiency than the state-of-the-art coding schemes [10]. In 3D wavelet coding, the whole video is decomposed into various spatial-temporal subbands through a number of motion-aligned temporal transforms (referred as MCTF 1
hereafter) and spatial transforms. These subbands are assumed to be independent with each other and each of them can be dropped arbitrarily when a certain kind of resolution scalability is demanded. For example, to support spatial scalability, the spatial highpass (SH) subbands are usually dropped and the decoder just carries out the decoding process only with the received data of spatial lowpass (SL) subbands. According to the way how motion-aligned temporal transform is performed, current 3D wavelet coding schemes can be divided into two categories. In the first category, referred as in-band MCTF (IBMCTF), video frames are first spatially decomposed and MCTF is carried out in wavelet domain, possibly with subsequent further spatial decompositions. On the contrary, in the second category, MCTF is applied to video frames directly in spatial domain before all the following spatial decompositions. This category is referred as spatial-domain MCTF (SDMCTF). According to the visual quality test results of submitted schemes [11], IBMCTF schemes currently do not achieve as good performance as SDMCTF schemes. This paper focuses on SDMCTF schemes and studies the performance of its spatial scalability. A heuristic way to decode SL resolution video is just to regard the SL subbands as the whole original video and carry out temporal and spatial inverse transforms in the SL resolution. This method is a coarse approximation of the decoding process without spatial scalability. It has the advantages of low computation complexity and low buffer requirement. Furthermore, the original resolution of the video and the occurrence of spatial scalability can be totally transparent to decoder. For SDMCTF encoding schemes, the spatial scalability performance of above decoding scheme is fairly not bad at very low bit rate, compared with non-spatialscalable codec. However, its performance is not satisfactory at median and high bit rate. In some cases we can not recover a high quality SL resolution video even if the bit rate is very high. Figure 1 shows the PSNR performance of the scheme when encoding Foreman CIF 30Hz video and decoding Foreman QCIF 30Hz video.
This work has been done while the author is with Microsoft Research Asia.
!"
#$
The rest of this paper is organized as follows. Section 2 analyzes the factors which affect the spatial scalability in schemes with spatial-domain MCTF encoder. Section 3 gives solutions to improve the decoding performance. Experimental results are given in Section 4 to show the improvement of the proposed scheme. Section 5 gives some further discussions and concludes the paper.
Figure 1: The PSNR-Rate curve of Foreman (QCIF 30Hz)
Actually, the above decoding scheme is in-band inverse MCTF. Compared with the decoding process without spatial scalability, it exchanges the order of spatial transform and MCTF. When bit rate increases continuously, only the frames corresponding to temporal lowpass (TL) subband increase their quality continuously. Other frames converge to limited PSNR, as shown in Figure 2. This means even if the SL subbands coefficients of all temporal subbands are reconstructed losslessly, we can not recover the original SL frames at encoder before MCTF. In other words, in-band IMCTF can not achieve perfect reconstruction for SDMCTF encoding with occurrence of spatial scalability.
In 3D wavelet coding schemes with spatial-domain MCTF encoder, the whole video is decomposed by a number of motion-aligned temporal transforms and subsequent spatial transforms. The entire encoding and decoding process without spatial scalability is shown in Figure 3, where decoding is exactly an inverse process of encoding. T denotes MCTF, S denotes spatial transform and E is entropy coding here. The MCTF encoding and decoding are carried out in spatial domain.
%
!"
#$
%
2. PROBLEM ANALYSIS
Figure 3: The complete encoding and decoding process without spatial scalability. "&
Figure 2: The frame by frame PSNR of Foreman (QCIF 30Hz).
In this paper, we first investigate the factors which may affect the spatial scalability performance of schemes with spatial-domain MCTF encoder. The inefficiency of interpolation and motion compensation in wavelet domain, the non-commutativity of spatial transform and MCTF are found to be possible factors. Furthermore, the power spectrum leak across subbands caused by motion compensation is another serious reason. After analyzing all these factors, we proposed several solutions to the problems mentioned above to improve the decoding performance step by step, without changing the encoder and transmitted bitstream. They result in a complexity-flexible decoding scheme. The decoder can reconstruct a better video with extra computation and buffers. We also point out that, to completely address the spatial scalability issues in SDMCTF schemes, optimization of bit allocation is needed to reasonably allocate part of the bandwidth to spatial high-frequency coefficients, instead of dropping them completely.
For encoding a CIF video, when one-level spatial scalability is demanded, i.e., a QCIF version of the decoded video is required at decoder, a direct solution is to recover the CIF video first and then down-sample it with a filter. The down-sampling can also be done with DWT, as shown in Figure 4(a). We refer to this scheme as SDIMCTF since the inverse MCTF occurs in spatial domain. For comparison, Figure 4(b) shows the heuristic scheme mentioned in section 1 where inverse MCTF occurs in subband domain, referred as IBIMCTF.
(a) SDIMCTF decoder
(b) IBIMCTF decoder Figure 4: Two possible decoding schemes to support one-level spatial scalability. (a) Spatial-domain inverse MCTF. (b) Inband inverse MCTF.
The key difference between the two decoding schemes is how the inverse prediction and update lifting steps are performed during inverse MCTF. Figure 5 illustrates the two schemes with one-level inverse MCTF. Use Predict as an example. For IBIMCTF, prediction frames are generated through MCP directly from the two SL reference frames. The prediction frames are added to the highpass frame during inverse Predict lifting to recover its image content. For SDIMCTF, full SD resolution reference frames are obtained by an IDWT operation applied to the SL reference frames, with SH band data set to zeros. Then prediction frames are generated by spatial domain MCP before they are added to highpass frame. The recovered SD frame needs a DWT to become the target SL frame, as shown in the center of Figure 5.
Figure 5: IBIMCTF vs. SDIMCTF.
Since DWT is a linear operation, we have DWT ( F1 + F2 ) = DWT ( F1 ) + DWT ( F2 )
(1)
We only need to study the way to generate the spatial lowpass lifting signals at the Predict or Update steps, from the only available SL subbands in reference frames. The two schemes can be formulated as: SL SD SD → IDWT → INTPSD → MCPSD → DWT (2a) SL SL → INTPSL → MCPSL →
(2b)
Here (2a) is for SDIMCTF and (2b) is for IBIMCTF. INTP denotes interpolation in brief. The INTPSD and INTPSL are interpolations with the pixels in spatial domain and in SL domain, respectively. The MCPSD and MCPSL are motion compensated predictions in spatial domain and SL domain, respectively. There are many factitious arrangements which can result in mismatches between SDMCTF encoder and IBIMCTF decoder, such as motion vector scaling and OBMC. Here we assume the accuracy of motion vector is not changed and the interpolation is fine enough to exploit the motion information during IBIMCTF. OBMC is disabled and the MCP operation is fixed to be simple pixelfetching only according to motion vectors. We can find the following issues that may degrade the performance of scheme (2b).
2.1. Interpolation Due to the decimation of wavelet transform, the DWT LL band alone is not a complete representation of the entire low pass signal of original data. That is, without considering the high pass signal, part of the low pass signal is lost, at half of the phase positions. One consequence of the decimation is the shift-variant property of DWT LL band. Therefore, to obtain an interpolated low pass frame, performing interpolation directly in DWT LL band as shown in scheme (3a) may be non-optimal. SL mxm SL SL (3a) → INTPSL → MCPSL → SL mxm SL SL →IDWT → OCDWT → INTPOC−LL →MCPSL → (3b) SD SL SL →IDWT SD →INTPSD mxm →DSDWT mxm →MCPSL SL → (3c)
An alternative method is to perform half-pixel interpolation through over-complete wavelet representation. The over-complete representation from SL subband can be produced by CODWT [12] or by an inverse DWT with an over-complete DWT, as shown in scheme (3b). This IDWT plus OCDWT operation actually act as one-level interpolation. The remaining levels can be done with conventional interpolation in the over-complete wavelet domain, denoted as INTPOC-LL. Conventional over-complete wavelet only solve the half-pixel interpolation problem in SL domain. To support quarter pixel motion vector or finer motion in SL domain, continuous phase over-complete subband transform (CPOST) can be used as another implementation of (3b). The interpolation for the quarter pixel in SL band is performed in spatial domain instead of wavelet domain. Then the low pass signal of the interpolated frame is obtained by OCDWT. Since the DWT is a scale-sensitive operation, to match the DWT in encoder, the DWT should be performed on m-sampled pixels in the interpolated frame, where mxm is the factor of interpolation performed on spatial domain before OCDWT. The scheme is shown in (3c) where DSDWT means down-sampling the interpolated frame into many sub-frames and performing DWT and interleaving the obtained coefficients back. The scheme (3b) and (3c) are equivalent and should have exactly the same performance. 2.2. Commutation of DWT and MCP For IBIMCTF schemes, even if we replace the in-band interpolation with spatial domain interpolation by scheme (3c), the prediction sources for MCP are still low pass frames. The MCP operations occur in wavelet domain. Compared with the MCP operations at encoder, this actually commutates DWT with MCP.
However, due to the shift-variant property of DWT and the complexity in motion compensation, the DWT and MCP operations are non-commutative: (4a) DWT ( MCPSD ( FSD )) ≠ MCPSL ( DWT ( FSD )) IDWT ( MCPSL ( FSL )) ≠ MCPSD ( IDWT ( FSL ))
Figure 6, the frame containing only signal B is shifted in spatial domain. The shifted frame now contains spatial low-pass signal now. The spectrum leak phenomenon becomes even more serious when the motion is very complex.
(4b)
We can simply show (4a) as follows. Assume the MCP is the simple block-based motion shift operation. Let us assume the current frame is divided into blocks B={Bi| i=1,…,m}. The motion vectors for these blocks are MV={mvi | i=1,…,m}. Define Frefi as a new version of reference frame Fref where only the pixels referenced by block Bi is retained and other pixels are set to zero. Thus we have the prediction frame Fpred ( x) =
m i =1
(5a)
Frefi ( x + mvi )
DWT ( Fpred ) = DWT ( MCPSD ( Fref )) = DWT ( =
m i =1
DWT ( Frefi ( x + mvi )) ≠
= MCPSL (
m i =1
m i =1
m i =1
Frefi ( x + mvi ))
DWT ( F )) = MCPSL ( DWT ( i ref
≠ MCPSL ( DWT ( Fref ))
(5b)
( DWTFrefi ( x + mvi′)) m i =1
Frefi ))
The first inequality in (5b) is due to the shift-variant property of DWT and the second inequality in (5b) is due to the overlapping and uncovering during motion shift in MCP when the motion field is very complex. A possible solution is to move MCP operation to spatial domain before DWT, as shown in scheme (6), which is just SDIMCTF. SL SD SD → IDWT → INTPSD → MCPSD → DWT (6) 2.3. Power Spectrum Leak by Motion Shift We mentioned that in 3D wavelet coding, the temporalspatial subbands are usually assumed to be independent and some of them can be dropped at will, especially the highpass subbands. However, the DWT lowpass subband and highpass subband of neighboring frames are closely coupled with each other, due to the power spectrum leak phenomenon introduced by motion shift. The power spectrum leak phenomenon is as follows: when a frame containing only signal in one DWT subband is shifted according to motion, part of the signal will transfer to other DWT subbands. Figure 6 illustrates the phenomenon. A simple global motion is adopted here. In the first row of Figure 6, the original frame is divided into two parts: spatial lowpass signal A and spatial highpass signal B. In the second row, the frame containing only signal A is shifted in spatial domain. The shifted frame now contains spatial highpass signal now. Similarly, in the third row of
Figure 6: Power spectrum leak by motion shift
From the above phenomenon, we know that with SDMCTF in encoder, the SL components of reference frames predict part of the SH components of the anchor frame. Therefore, although we have only the SL band information at decoder in the beginning, we will have meaningful SH band information in the intermediate results of each level IMCTF. Using a SL resolution buffer to hold the intermediate results demands a DWT in the center of the Figure 5. This drops the SH band information which is useful for the next level IMCTF. Furthermore, the SH components of reference frames predict part of the SL components of the anchor frame. This means, at encoder, to code the SL components, we exploit the SH band information in reference frames. But we can not access these SH components at decoder. The absence of SH bands result in a kind of drifting, we called as SH drifting. The consequence is a PSNR ceiling, in other words, the PSNR curve turns to be horizontal at a not very high bit rate, as shown in the Figure 1. 3. POSSIBLE SOLUTIONS Based on the above observations, we propose several schemes to improve the decoding performance step by step. Only one-level spatial scalability is shown here. The motion vector is assumed to have 1/2 pixel accuracy in original resolution and have 1/4 pixel accuracy in the SL resolution. These schemes together with the baseline are listed as follows. Table 1 shows a comparison among the proposed schemes.
3.1. Scheme A: In-band IMCTF (Baseline) This is the scheme we used as baseline. The decoding framework is Figure 4(b). The IMCTF uses only SL buffer and all operations occurs in wavelet domain, as illustrated in Figure 7.
Figure 7: The IMCTF for the baseline scheme.
3.2. Scheme B: Optimized In-band IMCTF In scheme B, we optimize the IMCTF by moving interpolation and/or MCP operations into spatial domain. We still use SL buffer. The framework is in Figure 8. Three possible algorithms are given in Figure 9, referred as scheme B1, B2 and B3, respectively. B1 and B2 are expected to have the same performance.
Figure 8: Decoding framework with optimized in-band IMCTF
ing factors for SH subbands should be re-evaluated. This is for later research. Currently, in the scheme D, we just use the current scaling factor as if we were extracting the original resolution video. Coefficients of all subbands are allowed to be included into final bitstream. We use the same method as scheme C to decode the SL video. With all the proposed schemes listed above, we obtain a complexity flexible decoding scheme. The decoder can decide to decode a relatively low quality video with less computation and memory, or to reconstruct a better video with extra computation and buffers. For decoding even higher quality video, a subband-flexible bitstream extractor should be applied to transmit some SH coefficients. Table 1: Comparison between the proposed schemes Schemes
SL coefficients only
SL buffer only
In-band MCP
In-band interpolation
A
Yes
Yes
Yes
Yes
B1
Yes
Yes
Yes
Yes
B2
Yes
Yes
Yes
No
B3
Yes
Yes
No
No
C
Yes
No
No
No
D
No
No
No
No
4. EXPERIMENTAL RESULTS Figure 9: Three schemes to optimize in-band IMCTF.
3.3. Scheme C: Spatial-domain IMCTF Scheme C is the SDIMCTF described in Figure 4(a). Different with scheme B3, the buffer is a SD buffer. The SH coefficients in the buffer are initial to zeros. IDWT and DWT are no longer necessary, as shown in Figure 10. By this way the SH information in intermediate results are retained.
Figure 10: Spatial-domain IMCTF
3.4. Scheme D: SDIMCTF with SH coefficients To handle the SH drifting problem, we should reasonably allocate part of the bandwidth to the SH coefficients in the bitstream extractor. Of course this is out of the scope of traditional spatial scalability. To determine the optimal rate for SH coefficients, we should measure the contribution of the SH coefficients to reducing the distortion of low resolution video instead of the whole original resolution video. So the distortion scal-
We investigate the performance of proposed schemes with sequences Foreman, Mobile, Silence and Table. The input videos are CIF video at 30fps. Four-level MCTF and 3level Spacl wavelet transform are done at the encoder. Motion estimation parameter is the same as CE test conditions. The B1 and B2 do have the same performance and we show only one of them. From the results shown in the figures below, we can see that B1 does not show any improvement due to the absence of SH information. The scheme B3 and C achieve a performance gain of about 0.5~1.5dB over the baseline scheme A, without changing the encoder and the bit allocation. In addition, the scheme D allocates part of the bandwidth to SH subbands in a tentative way and it improve the performance significantly at middle bit-rate and high bit-rate, by 2dB or even higher. The scheme D may have some loss at low bit-rate since current rate-allocation is target for whole video instead of the SL video. 5. CONCLUSIONS This paper investigates the spatial scalability in 3D wavelet coding schemes with spatial-domain MCTF encoder.
#$ !"
$
!'(
$
!'( !'(
*
&
!" #$
!+ , '
!'(
)
!'(
$
!'(
$
!'( !'(
- ,
*
+
!" #$
[1]. “Call for proposals on scalable video coding technology”, ISO/IEC JTC1/SC29/WG11, N6193, Waikoloa, December 2003. [2]. “Registered responses to the call for proposals on scalable video coding”, ISO/IEC JTC1/SC29/WG11, M10569. [3]. J.-R. Ohm, “Three dimensional subband coding with motion compensation,” IEEE Trans. on Image Processing, vol. 3, no 5, pp. 559-571, September 1994. [4]. S.-J. Choi, J. Woods, “Motion-compensated 3-D subband coding of video”, IEEE Trans. on Image Processing, Vol. 8, no. 2, 1999, pp. 155–167. [5]. B. Pesquet-Popescu, and V. Bottreau, “Three-dimensional lifting schemes for motion compensated video compression,” ICASSP, vol. 3, pp 1793–1796, Salt Lake City, 2001. [6]. J. Xu, Z. Xiong, S. Li, Y.-Q. Zhang, “Three-dimensional embedded subband coding with optimal truncation (3D ESCOT)”, Applied and Computational Harmonic Analysis, 10 (2001) 290–315. [7]. L. Luo, J. Li, S. Li, Z. Zhuang, and Y.-Q. Zhang, “Motion compensated lifting wavelet and its application in video coding,” ICME, pp 365-368, Tokyo, 2001. [8]. A. Secker and D. Taubman, “Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression,” IEEE Trans. Image Processing, vol. 12, December 2003. [9]. P. Chen and J. Woods, “Bidirectional MC-EZBC with lifting implementation,” IEEE Transactions on Circuits and Systems for Video Technology, to appear. [10]. R. Xiong, F. Wu, S. Li, Z. Xiong and Y.-Q. Zhang, “Exploiting temporal correlation with adaptive block-size motion alignment for 3D wavelet coding,” SPIE, vol. 5308, pp 144-155, 2004. [11]. “Subjective test results for the CfP on scalable video coding technology”, ISO/IEC JTC1/SC29/WG11, W6383, Munich. [12]. Yiannis Andreopoulos et al, “Complete-to-Overcomplete Discrete Wavelet Transforms: Theory and Applications”, IEEE Trans. of Signal Processing, to appear.
)
!'(
!'(
)
!'(
$
!'(
$
!'( !'(
.
*
+ ,
#$
References
!'(
!"
We analyzed the factors which lead to the performance degradation in the heuristic scheme of wavelet domain decoding. The commutation of DWT and MCTF, the power spectrum leak across sub-bands are found to be two major reasons. After these analysis and discoveries, we proposed a complexity flexible scheme to improve the decoding performance step by step, without changing the encoder and transmitted bitstream. The decoder can reconstruct a better video with extra computation and buffers. Furthermore, to completely address the spatial scalability issue in spatial-domain MCTF schemes, optimization of bit allocation is needed to reasonably allocate part of the bandwidth to spatial high-frequency coefficients, instead of dropping them completely. The distortion scaling factor of each subband should be re-evaluated for decoding the low resolution video. This is one of our future works.
!'(
)
!'(
$
!'(
$
!'( !'(
*