Video Coding with Flexible MCTF Structures for Low End-to-End Delay Gr´egoire Pau
J´erˆome Vi´eron
B´eatrice Pesquet-Popescu
GET-ENST (Telecom Paris) Thomson R&D GET-ENST (Telecom Paris) Signal and Image Proc. Dept. 35576 Cesson-S´evign´e Cedex, France Signal and Image Proc. Dept. 37-39 Rue Dareau, 75014 Paris, France Email:
[email protected] 37-39 Rue Dareau, 75014 Paris, France Email:
[email protected] Email:
[email protected]
Abstract— Some of the most powerful schemes for scalable video coding are based on the so-called (t + 2D) paradigm and provide coding performance competitive with state-of-the-art codecs. However, the temporal multiresolution schemes used in such codecs introduce a non-negligible delay, preventing their use in applications which require low latency, like video conferencing or video streaming. In this paper, we provide a flexible approach to reduce the end-to-end delay involved in motion-compensated temporal filtering schemes. We show by simulations results the trade-offs observed between coding efficiency and low end-to-end delay.
I. I NTRODUCTION The (t + 2D) spatio-temporal subband scheme exploits the temporal interframe redundancy by applying a temporal wavelet transform along the motion trajectories on the frames of a video sequence. A spatial decomposition of the temporal subbands is further done to take advantage of the spatial redundancy of the filtered frames and the resulting wavelet coefficients can be encoded by different algorithms such as 3D-SPIHT [1], MC-EZBC [2] or 3D-ESCOT [3]. Since pioneering works exploiting the Haar motion-compensated temporal filtering (MCTF) transform [4], [5], several enhancements have been brought to this scheme, improving the compression performance either by an optimization of the operators involved in the temporal filter [6], [7] or by introducing longer temporal filters inspired from the 5/3 biorthogonal ones [8]. However and due to the non-causality of their underlying temporal filter, these multiresolution schemes introduce a nonnegligible delay, preventing their use in applications which require low latency, like video-conferencing, video on demand or video surveillance. Depending on the application, several delays can be defined. In a video-conference context, the end-to-end delay is the total time required between the capture of a frame and its display by the receiver’s device. However, many factors contribute to this end-to-end delay like processing time, packetization or transmission delay. In this paper, we focus only on reducing
the end-to-end delay created by the temporal filter, therefore ignoring transmission time and processing delay, which are respectively network and platform dependent. In [9] we presented a first solution to reduce the latency involved in MCTF-based video schemes but only the encoding delay was considered and global results were not satisfactory for very low encoding delay constraints. Golwelkar [10] also proposed a 5/3 temporal analysis with an Haar filter at the last stage in order to reduce the encoding delay, but in a less flexible manner. Some other works [11] have also tackled the problem by partitioning the groups of frames into independent subgroups. We present in this paper a flexible 5/3 temporal transform able to fulfill a given end-to-end delay constraint while providing a good coding efficiency. This transform consists in a sequence of temporal transform steps chosen among three candidate temporal transforms. We derive analytically the encoding, decoding and end-to-end delay resulting from this transform. We show by simulation results that this temporal transform leads to a small and graceful degradation performance, allowing therefore to choose between a large range of delay/coding-efficiency trade-offs, depending on the application requirements. The paper is organized as follows: in Section II, we review the 5/3 motion-compensated temporal filter and then study its delay characteristics in Section III. We present a flexible delay 5/3 MCTF transform, able to fulfill a given end-toend delay constraint in Section IV. Section V illustrates by simulation results the trade-offs between end-to-end delay and coding efficiency, for a range of given delays. We conclude in Section VI. II. MCTF WITH 5/3 FILTERS Let us denote by xt the original input frame at time t, by ht the detail (temporal “high-frequency”) subband frames and by lt the approximation (temporal “low-frequency”) subband frames. In the 5/3 MCTF, temporal lifting operations are
performed on motion-compensated frames, by predicting oddindexed frames x2t+1 from even ones, i.e. x2t and x2t+2 . The filter makes use of two motion vector fields: the forward one, predicting x2t+1 from x2t and denoted by v + 2t+1 and the backward one, predicting x2t+1 from x2t+2 and denoted by v − 2t+1 . If we denote by n the spatial index in a frame, then we can introduce the motion-compensation operator C defined by C(x, v t )(n) = x(n − v t (n)). With these notations, one temporal level of the 5/3 MCTF can be expressed in the following lifting form, denoted by T1 : 1 − ht = x2t+1 − C(x2t , v + (1) 2t+1 ) + C(x2t+2 , v 2t+1 ) 2 1 −1 lt = x2t + C −1 (ht−1 , v − (ht , v + 2t+1 ) 2t−1 ) + C 4 The N -level multiresolution decomposition of the input frames can then be obtained by subsequent decompositions of the approximation band with T1 . The update equation makes use of the reverse motioncompensation operator C −1 which is not defined everywhere, due to the non-invertibility of C but several approaches have been proposed [7], [12] to overcome this problem. Moreover, in case of fractional-pel motion estimation, these operators should involve an interpolation step which is not discussed here. The lifting formalism guarantees the invertibility of the scheme and therefore samples can be simply retrieved by using the same lifting steps in reverse order with opposite signs. Thanks to its good bidirectional prediction properties, the 5/3 MCTF filter was shown [6] to perform much better from a compression point of view than a mono-directional Haar filter bank. III. D ELAY ANALYSIS Depending on the application, several delays may be relevant at the encoder side and at the decoder side. From an overall application point of view, the end-to-end delay is also relevant. Moreover, we are focusing here only on the delay created by the motion-compensated temporal filter, therefore ignoring transmission time and processing delay. We denote by f the framerate in Hz of the input video sequence. The encoding delay can be defined from the encoder point of view. It is the maximum delay between capturing a frame and encoding it. It is denoted by De = Ne × f where Ne is the maximum number of frames that need to be captured in order to be able to encode the current frame. The decoding delay is defined from the decoder point of view. It is the maximum delay between the reception of a frame and its display by the decoder. It is denoted by Dd = Nd × f where Nd is maximum number of future frames between all the subbands that need to be decoded in order to be able to reconstruct the current frame.
The end-to-end delay is defined as the maximum delay between the capture of a frame by the encoder and its display by the decoder. It is denoted by Dr = Nr × f where Nr is the corresponding number of frames, as illustrated in Fig. 2 . 0
1
2
H
L
3
4
H
L
5
6
H
L
7
1
L
3
4
H
L
HL
LL
2
H
H
L
HL
0
5
6
H
L
HL
LL
LL
HHL
LLL
14
Fig. 1.
1
4
1
10
1
4
1
Encoding delay
Encoding delay in a 3-level 5/3 temporal decomposition.
Fig. 1 illustrates the encoding delay for all the frames of the GOF in a 3-level 5/3 temporal transform. Bold arrows illustrate the processing path related to the frame which has the maximum encoding delay, Ne = 14, in order to be encoded. Fig. 2 illustrates the decoding and end-to-end delay introduced by the temporal transform. Bold arrows illustrate the processing path related to the frame which has the maximum decoding delay, Nd = 11. We can also observe that this frame has the largest end-to-end delay, Nr = 21. If we consider an N -level classical 5/3 MCTF [6], it can be shown that the encoding, decoding and end-to-end delay are given by: Ne = 2N +1 − 2 Nd = 3 × 2N −1 − 1 Nr = 3 × (2N − 1) From a practical point of view, this means that for a fourlevel transform of a video sequence at f = 30 Hz, the endto-end delay will be Dr = 1.5 s, which is much too high knowing that a delay higher than 300 ms is clearly noticeable to a video-conference user. Given the previous equations, one way of reducing the delay of the 5/3 MCTF transform would be to decrease the number of temporal levels N , which can however greatly degrade the coding efficiency in case of smooth motion. This motivates us to introduce the following flexible 5/3 MCTF structure, able to fulfill a given delay constraint.
0
1
2
3
LLL
4
5
6
7
HHL
0
1
2
3
LLL
LL
4 HHL
LL
HL
HL
LL
HL
results, mainly because the two motion vector fields v − 2t−1 and had to be encoded. v+ 2t+1 The solution we propose here as a third alternative, is to consider a mono-directional temporal transform without an update step, allowing then to drop the motion vector field v− 2t−1 . This transform is denoted by T3 : ht = x2t+1 − C(x2t , v + 2t+1 )
L
L
H
0
4
L
H
2
H
4
1
3
11 10
9
L
6
5
8
H
7
0
5
14 21 20 19 18 17 16 15
Fig. 2. sition.
H
7
6
(3)
lt = x2t
L
Decoding delay End-to-end delay
Decoding and end-to-end delay in a 3-level 5/3 temporal decompo-
IV. F LEXIBLE LOW- DELAY 5/3 MCTF STRUCTURE In order to reduce the end-to-end delay involved in the 5/3 MCTF, we consider removing some parts of the operators involved in the temporal transform, as they were identified as responsible for raising up the delay. First, it is clear that the forward part of the predict and update operators is responsible for raising up the delay involved in the transform. On the other hand, this forward part brings the bidirectionality to the filter, which is in turn a source of coding efficiency. There is therefore a clear trade-off between coding efficiency and delay, and we can consider removing these forward parts only at the coarsest temporal resolution levels. Assuming that the predict steps are more important from a compression point of view than the update ones (which has been checked by simulations), we can design a temporal transform without the forward part of the update operator, denoted by T2 : 1 − (2) ht = x2t+1 − C(x2t , v + 2t+1 ) + C(x2t+2 , v 2t+1 ) 2 1 lt = x2t + C −1 (ht−1 , v − 2t−1 ) 2 This transform has clearly a lower end-to-end delay than the T1 temporal transform. However, the T2 transform does not have a null delay and we need another transform without the forward part of both predict and the update steps. This was basically proposed in [9] but was leading to unsatisfactory
We can then build an N -level 5/3 temporal transform with a delay parameterized by (P, Q) which consists in a sequence of temporal transforms selected among T1 , T2 and T3 . In this temporal transform, the first P levels are decomposed with T1 , the following Q levels with the transform T2 and the coarsest remaining ones with T3 . Then, it can be shown that the resulting transform has the following encoding, decoding and end-to-end delay, depending on the choice of the parameters (P, Q): Ne = 2P +1 − 2 Nd = ⌊3 × 2P −1 − 1⌋ If Q = 0 Nr = 3 × (2P − 1) Ne = 2P +1 + 2P +Q−1 − 2 Nd = 2P +Q − 1 If Q > 0 Nr = 2P +1 + 2P +Q − 3
This temporal transform can then be parameterized in order to fulfill a given end-to-end delay constraint. In some cases, there can be several couples of parameters (P, Q) which give the same end-to-end delay: we prefer then the one having the maximum number of bidirectional predict steps P + Q. One can notice that taking (P, Q) = (N, 0) is equivalent to a classical 5/3 MCTF transform, as described in Section II and (0, 0) is equivalent to a Haar MCTF transform without the update step. Tab. I gives the optimal parameters for an N = 4 level transform applied to video sequences with a framerate f = 30 Hz, for a set of given end-to-end delay constraints. TABLE I O PTIMAL PARAMETERS (P, Q) TO FULFILL A GIVEN END - TO - END DELAY CONSTRAINT FOR A FRAMERATE f = 30 H Z . Max. end-to-end delay 1500 ms (unconstrained) 300 ms 150 ms 70 ms 35 ms 0 ms
Nr 45 10 4 2 1 0
Optimal (P, Q) (4,0) (1,2) (0,2) (0,1) (0,1) (0,0)
V. S IMULATION RESULTS In our simulations, we consider two representative test color video sequences in CIF format at f = 30 Hz: “Tempete” and “Football”, which have been selected for their very difficult motion and texture characteristics. The simulations were conducted in the framework of the MC-EZBC [2] codec, where the temporal transform was modified according to Section IV, in order to fulfill the imposed delay constraints. Motion estimation involves a Hierarchical Variable Size Block Matching (HVSBM) algorithm with block sizes varying from 64 × 64 to 4 × 4 and motion vectors are estimated with 1/8th pixel accuracy. Motion vector field encoding and bitrate allocation are done within the MC-EZBC framework. Temporal subbands are spatially decomposed over 5 levels with biorthogonal 9/7 wavelets. Both video sequences have been encoded in the YUV color mode, meaning that the bit budget is shared by the luminance and chrominance components, the bitstream headers and the motion vector fields. Performances are however only expressed in terms of YSNR, calculated by averaging the Y component PSNR of all decoded frames. We compare in Tab. II and III, the coding performance achieved at several bitrates with the flexible 5/3 MCTF temporal filter over N = 4 temporal levels with optimized prediction [6], for several end-to-end delay constraints. We observe a graceful degradation of the coding performance, depending on the end-to-end delay constraints and therefore, on the application requirements. Note that even though the implementation presented here is a (t + 2D) one, we performed similar tests in a multi-layer approach with MCTF at each spatial level [13] and obtained very similar results. TABLE II R ATE - DISTORTION YSNR ( IN D B) COMPARISON OF THE FLEXIBLE 5/3 MC FILTER BANK FOR SEVERAL END - TO - END DELAYS AND FOR SEVERAL BITRATES ON THE
Max. delay 1500 ms 300 ms 150 ms 35 ms 0 ms
“T EMPETE ” CIF 30 H Z VIDEO SEQUENCE .
384 kbs 29.72 28.65 27.90 27.35 26.91
512 kbs 31.00 29.97 29.23 28.65 28.11
768 kbs 32.77 31.65 30.84 30.23 29.64
1024 kbs 33.95 32.98 32.20 31.61 31.02
VI. C ONCLUSION In this paper, we have presented a flexible MCTF structure able to fulfill a given end-to-end delay constraint. We have also exhibited by simulation results a graceful degradation of performances from the unconstrained case to the most
TABLE III R ATE - DISTORTION YSNR ( IN D B) COMPARISON OF THE FLEXIBLE 5/3 MC FILTER BANK FOR SEVERAL END - TO - END DELAYS AND FOR SEVERAL BITRATES ON THE
Max. delay 1500 ms 300 ms 150 ms 35 ms 0 ms
“F OOTBALL” CIF 30 H Z VIDEO SEQUENCE .
384 kbs 27.15 26.83 26.51 26.23 25.94
512 kbs 28.49 28.11 27.68 27.39 27.08
768 kbs 30.32 29.92 29.44 29.15 28.79
1024 kbs 31.79 31.43 30.90 30.61 30.30
constrained one. In between these extreme cases, a large range of trade–offs exist and the temporal structure can be chosen to take into account the application requirements. R EFERENCES [1] B.-J. Kim, Z. Xiong, and W. Pearlman, “Very low bit-rate embedded video coding with 3-D set partitioning in hierarchical trees (3D-SPIHT),” IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, pp. 1365–1374, 2000. [2] S.-T. Hsiang, J. Woods, and J.-R. Ohm, “Invertible temporal subband/wavelet filter banks with half-pixel-accurate motion compensation,” IEEE Transactions on Image Processing, vol. 13, pp. 1018–1028, August 2004. [3] S. L. J. Xu, Z. Xiong and Y.-Q. Zhang, “3D embedded subband coding with optimal truncation (3D-ESCOT),” Applied and Computational Harmonic Analysis, vol. 10, p. 589, May 2001. [4] J.-R. Ohm, “Three-dimensional subband coding with motion compensation,” IEEE Transactions on Image Processing, vol. 3, pp. 559–589, 1994. [5] S. Choi and J. Woods, “Motion-compensated 3-D subband coding of video,” IEEE Transactions on Image Processing, vol. 8, pp. 155–167, 1999. [6] G. Pau, C. Tillier, B. Pesquet-Popescu, and H. Heijmans, “Motion compensation and scalability in lifting-based video coding,” Signal Processing: Image Communication, vol. 19, pp. 577–600, August 2004. [7] C. Tillier, B. Pesquet-Popescu, and M. van der Schaar, “Improved update operators for lifting-based motion-compensated temporal filtering,” Feb. 2005, to appear in IEEE Signal Proc. Letters. [8] A. Secker and D. Taubman, “Motion-compensated highly scalable video compression using an adaptive 3D wavelet transform based on lifting,” in Proc. of the IEEE Int. Conf. on Image Processing, Thessaloniki, Greece, Oct. 2001, pp. 1029–1032. [9] G. Pau, B. Pesquet-Popescu, M. van der Schaar, and J. Vi´eron, “Delayperformance trade-offs in motion-compensated scalable subband video compression,” in Proc. of Advanced Concepts for Intelligent Vision Systems (ACIVS), September 2004. [10] A. Golwelkar, “Motion compensated temporal filtering and motion vector coding using longer filters,” Sept. 2004, phD thesis, RPI, ECSE Dept. [11] H. Schwartz, J. Shen, D. Marpe, and T. Wiegand, “Technical description of the HHI proposal for SVC CE3,” Oct. 2004, in doc. M11246, Palma MPEG meeting. [12] K. Hanke, J.-R. Ohm, and T. Rusert, “Adaptation of filters and quantization in spatio-temporal wavelet coding with motion compensation,” in Proc. of the Picture Coding Symposium, St. Malo, France, April 2003, pp. 49–54. [13] J. Vi´eron, G. Boisson, E. Francois, G. Pau, and B. Pesquet-Popescu, “Proposal for SVC CE1 : Time and level adaptive MCTF architectures for low delay video coding,” Jan. 2005, in doc. M11673, Hong Kong MPEG meeting.