Spatiotemporal Subband/Wavelet Coding of Video with Object-based Motion Information Soo-Chul Han and John W. Woods ECSE Department and Center for Image Processing Research (CIPR) Troy, New York 12180-3590 E-mail:
[email protected],
[email protected]
Abstract
We present a novel object-based spatiotemporal (3D) subband/wavelet coding scheme for very low bitrate video compression. First, the image sequence is segmented into objects based on motion and color. The objects are then independently decomposed by subband/wavelet analysis, both in the temporal and spatial domain. The temporal analysis is greatly facilitated over existing methods by the object-based representation. Each object's subband/wavelet coecients are then quantized according to a bit allocation scheme based on the generalised BFOS algorithm. Simulations at very low bit-rates indicate competitive performance to classical motion-compensated predictive (MCP) coding schemes, while allowing the exibility of frame-rate scalability.
1 Introduction
With very low bit-rate coding emerging as an important eld of video compression, object-based video coding has received considerable attention recently. This is evidenced further by the on-going MPEG4 standardization process, which is emphasizing the object-based approach to video coding and manipulation. A plethora of object/region-based schemes have thus been proposed in the last several years [1, 2, 3, 4, 5]. Each diers in the adopted motion model, object segmentation and representation scheme, and the color (texture) coding. However, all of them adopt the classical motion-compensated predictive (MCP) coding technique for encoding the objects. The MCP approach, although simple in concept and widely implemented, has several inherent drawbacks. First, the coding has to be performed based on the past reconstructed frame. This may be especially ill-advised for very low bit-rate applications, where each reconstructed frame is of relatively poor quality, and thus error propagation becomes more evident. Next, the time-recursive nature of MCP prohibits temporal scalability and multi-rate coding, which is of primary importance in today's applications. Finally, the dependent structure of an MCP system makes optimal rate allocation very dicult [6]. Meanwhile, three-dimensional subband coding (3DSBC) has been proposed as an alternative to the MCP This
work was supported by the National Science Founda-
tion under grant MIP-9528312
time
Figure 1: Object-based motion-compensated l-
tering
approach, mostly for medium and higher bit-rate applications. In [7] and [8], the two-tap Haar lter is used for the temporal decomposition, making them very similar to frame-dierence coding. Longer lters are used in [9], where the SPIHT coder is extended to 3-D by processing a group of 16 frames at a time. To take better advantage of the temporal correlation, motion-compensation has been incorporated into 3DSBC as well [10, 11, 6]. In this paper, a new object-based (OB) coding method, OB-3DSBC, is introduced in which the temporal redundancies are exploited by motioncompensated 3-D subband/wavelet analysis. The ltering for temporal subband analysis is performed along the motion trajectories within each object (Fig. 1). The object segmentation and motion, along with the covered/uncovered region information enable the ltering to be carried out in a systematic fashion. Up to now, rather ad-hoc rules had been employed in solving the temporal region consistency problem in implementing the ltering [6, 10]. In other words, the object-based approach allows us to make a better decision on where and how to implement the motion-compensated temporal analysis. Spatial subband/wavelet analysis follows the temporal decomposition, resulting in an object-based spatiotemporal multiresolution representation of the video. The rate can be distributed among the subbands optimally in the rate-distortion sense. In addition, by coding and/or decoding only selected objects and subbands, object and frame-rate scalability is incorporated into
object motion/contour
Image sequence
motion estimation segmentation
object #1 OB temproal
OB spatial
coefficient
analysis
analysis
coding
unconnected pixel connected pixel (a) coder object motion/contour
coded coefficients
coefficient
OB spatial
OB temproal
decoding
synthesis
synthesis
reconstructed sequence
object #2
(b) decoder
Figure 2: Block diagram of OB-3DSBC
0
1
2
3
GOP
our coder.
Figure 3: Example: Object-based temporal l-
The block diagram of Fig. 2 shows the main components of our OB-3DSBC. The motion estimation and segmentation partitions the image sequence into temporally linked objects, and nds the motion description for each object at each frame. The Objectbased (OB) temporal analysis/synthesis is the main contribution of this paper. The temporal bands of the objects are further split spatially, and the resulting spatiotemporal subband/wavelet coecients are quantized.
due to buer and time delay constraints, we chose to perform the ltering by groups of pictures (GOP) of 4 frames. Although the 2-tap Haar lter has been championed in the conventional MC 3-D subband coding schemes [6, 10, 12], we implement multi-tap lters to take full advantage of the object linkage within the 4 frames. The basic mechanism of our temporal ltering is best illustrated with Fig. 3, which shows 4 frames from a hypothetical image sequence. For sake of clarity, the frames are displayed in 1-D, where it can be seen that the object segmentation information facilitates the temporal ltering. This is due to the fact that the segmentation algorithm [5] enforces the object labels to be consistent along the motion trajectories. Because the temporal ltering must be done on arbitrary-length signals (see Fig. 3), the ecient signal extension method proposed by Barnard [13] is applied at the boundaries. Finally, the \unconnected" pixel problem [10] still occurs, but can be resolved more ef ciently, since we have a better idea of the region to which these pixels belong. These regions are handled separately at the encoding stage.
2 Object-based 3-D Subband Coding (OB-3DSBC)
2.1 Motion estimation and segmentation
The video is segmented into coherent moving objects using the joint motion estimation and segmentation algorithm proposed in [5]. This algorithm is best suitable for OB-3DSBC because of several factors. Most importantly, the objects are temporally linked in time, and thus represented as 3-D volumes in the space-time domain. This is essential if we want to perform any type of temporal processing on each object individually. In addition, the covered/uncovered regions due to object motion can be extracted readily. Also, the algorithm in [5] encourages the object labels to be consistent along the motion trajectories, which facilitates the temporal ltering approach. Furthermore, the object boundaries are enforced to coincide with intensity-based boundaries, meaning that each object has some sort of spatial cohesiveness. Finally, the object motion is represented by a set of ane parameters, which can handle more complex motion such as rotation and zoom in addition to translational motion.
2.2 Object-based temporal ltering
Although ltering along the motion trajectories within each object is simple in concept, some practical issues must be resolved in the implementation. Theoretically, it would be ideal for the temporal ltering of a given object to begin when the object rst appears and continue until it disappears. However,
tering for a GOP of 4
2.3 An OB-3DSBC coding scheme
The OB-3DSBC was realized with multi-stage temporal analysis as shown in Fig. 4. The temporalLow frequency values from the motion-compensated ltering were placed in even-numbered frame pixels, while the temporal-High values were placed in oddnumbered frame pixels [6]. For multi-stage implementation, one needs motion information between two non-consecutive frames (for instance, between frame 0 and 2 in Fig. 4). This was found by cascading the motion vectors between consecutive frames. At the highest stage (lowest framerate), the tLL (temporal LowLow) frames were encoded by object-based MCP coding (except the rst tLL frame, of course). With the two-stage scheme shown in Fig. 4, this meant a delay of 4 frames, which is tolerable in many applications. The
1
0
2
3
4
5
6
7
Miss America @ 15 fps 40
stage 0 (full framerate)
39 38
PSNR (dB)
37
(2)
36 35 (1)
tL
stage 1 (half framerate)
tH
34 (3)
33 (1) OB-3DSBC 32
(2) OB-MCP Coder (3) Choi & Woods
31 0
tLL
tLH
stage 2 (quater framerate)
Figure 4: Multi-stage temporal analysis: tL
refers to temporal Low band
temporal High bands were not further decomposed, as they contained very little energy due to the motioncompensated ltering. Following the temporal decomposition, the objects were further analyzed by spatial subband/wavelet coding for arbitrarily-shaped regions [13]. The generalized BFOS algorithm [14] was used in distributing the bits among the spatiotemporal subbands of the objects. The unconnected pixels of uncovered regions were encoded by the mean value, while the unconnected pixels within objects (see Fig. 3) are not encoded at all but merely reconstructed by spatial interpolation at the receiver. With our multiresolution representation, one can select at the decoder the desired object(s), frame-rate, and spatial resolution and decode only the relevant spatiotemporal bands.
3 Experimental results and discussion
Computer simulations were performed on the QCIF resolution Miss America sequence at 15 fps to test the performance of the proposed OB-3DSBC system. The object contours and motion parameters were encoded as described in [5]. The overall reconstructed PSNRs were on average about 1 dB worse than those reported in [5] for a polished object-based MCP coder, while the two methods produced visually comparable results. Fig. 5 compares the PSNR plots at 20 kbps among the two object-based coders and the 3-D SBC scheme of Choi and Woods [6]. The coder from [6] compares favorably to an ecient MPEG implementation at the relatively high bit-rate of 1.2 Mbps. Fig. 6 shows the \face' of Miss America decoded at various spatial resolutions. Finally, Fig. 7 displays the reconstructed
20
40
60
80 Frame no
100
120
140
Figure 5: PSNR plot of reconstructed Miss America at the rate 20 kbps
frames at dierent temporal resolution. By decoding only the object spatiotemporal subbands of interest, we can get objects or the whole frame at various spatiotemporal resolutions. By incorporating multi-stage or embedded quantization, one could achieve objectbased bit-rate scalability as well.
References
[1] H. Musmann, M. Hotter, and J. Ostermann, \Object-oriented analysis-synthesis coding of moving images," Signal Processing: Image Communications, vol. 1, pp. 117{138, Oct. 1989. [2] C. Gu and M. Kunt, \Contour simpli cation and motion compensated coding," Signal Processing: Image Communications, vol. 7, pp. 279{296, Nov. 1995. [3] Y. Yokoyama, Y. Miyamoto, and M. Ohta, \Very low bit rate video coding using arbitrarily shaped region-based motion compensation," IEEE Trans. Circuits and Systems for Video Technology, vol. 5, pp. 500{507, Dec. 1995. [4] L. Wu, J. Benois-Pineau, P. Delagnes, and D. Barba, \Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding," Signal Processing: Image Communications, vol. 8, pp. 513{543, Sept. 1996. [5] S. Han and J. Woods, \Adaptive coding of moving objects for very low bit-rates," IEEE Journal on Selected Areas in Communications, issue on very low bit-rate video coding, to be published, Jan 1998. [6] S. Choi and J. Woods, \Motion-compensated 3D subband coding of video," IEEE Trans. Image Process. submitted for publication, 1996.
(a) Full frame-rate (15 fps) Figure 6: Reconstructed \face" at various spa-
tial resolutions
[7] G. Karlsson and M. Vetterli, \Three dimensional subband coding of video," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1100{1103, Apr. 1988. [8] C. Podilchuk, N. Jayant, and N. Farvardin, \Three-dimensional subband coding of video," IEEE Trans. Image Process., vol. 4, pp. 125{139, Feb. 1995. [9] B. J. Kim and W. Pearlman, \An embedded wavelet video coder using three-dimensional set partitioning in hierarchical trees (SPIHT)," in Proc. DCC'97, IEEE Data Compression Conference, pp. 2510{260, Snowbird, UT, Mar. 1997. [10] J. Ohm, \Three-dimensional subband coding with motion compensation," IEEE Trans. Image Process., vol. 3, pp. 559{571, Sept. 1994. [11] A. Taubman and A. Zakhor, \Multi-rate 3-D subband coding of video," IEEE Trans. Image Process., vol. 3, pp. 572{588, Sept. 1994. [12] G. Lilien eld, Resolution-Scalable Coding of Video Using Motion-Compensated Temporal Filtering and Multistage Quantization. PhD thesis, Rensselaer Polytechnic Institute, Troy, New York, 1997. [13] H. J. Barnard, Image and Video Coding Using a Wavelet Decomposition. PhD thesis, Delft University of Technology, The Netherlands, 1994. [14] E. Riskin, \Optimal bit allocation via the generalized BFOS algorithm," IEEE Trans. Inform. Theory, vol. IT-37, pp. 400{402, Mar. 1991.
(b) Half frame-rate (7.5 fps)
(c) Quarter frame-rate (3.75 fps) Figure 7: OB-3DSBC reconstructed results of Miss America