The object interiors can be encoded by either 2-D or 3-D subband/wavelet coding. Simulations ... the moving objects in the video scene are extracted, and each object is .... pixel with each label representing one moving object, i.e. zt(x) = n (n =.
This is page 1 Printer: Opaque this
Object-Based Subband/Wavelet Video Compression Soo-Chul Han 1 John W. Woods ABSTRACT This chapter presents a subband/wavelet video coder using an object-based spatiotemporal segmentation. The moving objects in a video are extracted by means of a joint motion estimation and segmentation algorithm based on a compound Markov random eld (MRF) model. The two important features of our technique are the temporal linking of the objects, and the guidance of the motion segmentation with spatial color information. This results in spatiotemporal (3-D) objects that are stable in time, and leads to a new motion-compensated temporal updating and contour coding scheme that greatly reduces the bit-rate to transmit the object boundaries. The object interiors can be encoded by either 2-D or 3-D subband/wavelet coding. Simulations at very low bit-rates yield comparable performance in terms of reconstructed PSNR to the H.263 coder. The object-based coder produces visually more pleasing video with less blurriness and is devoid of block artifacts.
1 Introduction Video compression to very low bit-rates has attracted considerable attention recently in the image processing community. This is due to the growing list of very low bit-rate applications such as video-conferencing, multimedia, video over telephone lines, wireless communications, and video over the internet. However, it has been found that standard block-based video coders perform rather poorly at very low bit-rates due to the well-known blocking artifacts. A natural alternative to the block-based standards is object-based coding, rst proposed by Musmann et al [1]. In the object-based approach, the moving objects in the video scene are extracted, and each object is represented by its shape, motion, and texture. Parameters representing the three components are encoded and transmitted, and the reconstruction is performed by synthesizing each object. Although a plethora of work on the extraction and coding of the moving objects has appeared since [1], few works carry out the entire analysis-coding process from start to nish. 1 This work was supported in part by National Science Foundation grant MIP9528312.
1. Object-Based Subband/Wavelet Video Compression
2
Thus, the widespread belief that object-based methods could outperform standard techniques at low bit-rates (or any rates) has yet to be rmly established. In this chapter, we attempt to take the step in that direction with new ideas in both motion analysis and the source encoding. Furthermore, the object-based scheme leads to increased functionalities such as scalability, content-based manipulation, and the combination of synthetic and natural images. This is evidenced by the MPEG-4 standard, which is adopting the object-based approach. Up to now, several roadblocks have prevented object-based coding systems from outperforming standard block-based techniques. For one thing, extracting the moving objects, such as by means of segmentation, is a very dicult problem in itself due to its ill-posedness and complexity [2]. Next, the gain in improving the motion compensated prediction must outweigh the additional contour information inherent in an object-based scheme. Applying intraframe techniques to encode the contours at each frame has been shown to be inecient. Finally, it is essential that some objects or regions be encoded in \Intra" mode at certain frames due to lack of information in the temporal direction. This includes uncovered regions due to object movement, new objects that appear in a scene, and objects which undergo complex motion that cannot be properly described by the adopted motion model. An object-based coder addressing all of the above mentioned issues is presented in this chapter. Moreover, we need to make no a priori assumptions about the contents of the video scene (such as constant background, head-and-shoulders only, etc). The extraction of the moving objects is performed by a joint motion estimation and segmentation algorithm based on compound Markov random eld (MRF) models. In our approach, the object motion and shape are guided by the spatial color intensity information. This not only improves the motion estimation/segmentation process itself by extracting meaningful objects true to the scene, but it also aids the process of coding the object intensities. The latter because a given object has a certain spatial cohesiveness. The MRF formulation also allows us to temporally link objects, thus creating object volumes in the space-time domain. This helps stabilize the object segmentation process in time, but more importantly, allows the object boundaries to be predicted temporally using the motion information, reducing the boundary coding overhead. With linked objects, uncovered regions and new objects are detected by utilizing both the motion and intensity information. Object interiors are encoded by either 2-D or 3-D subband/wavelet coding. The 2-D hybrid coding allows objects to be encoded adaptively at each frame, meaning that objects well described by the motion parameters are encoded in \Inter" mode, while those that cannot be predicted in time are encoded in \Intra" mode. This is analogous to P-blocks and I-blocks in the MPEG coding structure, where we now have P-objects and I-objects. Alternatively, the spatiotemporal objects can be encoded
1. Object-Based Subband/Wavelet Video Compression
3
vertical translation
zoom object trajectory
time time (a)
(b)
FIGURE 1. The trajectory of a moving ball.
by 3-D subband/wavelet coding, which leads to added advantages such as frame-rate scalability and improved rate control [3]. In either case, the subband/wavelet transform must be modi ed to account for arbitrarily-shaped objects.
2 Joint Motion Estimation and Segmentation In this section, a novel motion estimation and segmentation scheme is presented. Although the algorithm was speci cally designed to meet coding needs as described in the previous section, the end results could very well be applied to other image sequence processing applications. The main objective is to segment the video scene into objects that are undergoing distinct motion, along with nding the parameters that describe the motion. In Fig. 1(a), the video scene consists of a ball moving against a stationary background. At each frame, we would like to segment the scene into two objects (the ball and background) and nd the motion of each. Furthermore, if the objects are linked in time, we can create 3-D objects in space-time as shown in Fig. 1(b). We adopt a Bayesian formulation based on a Markov random eld (MRF) model to solve this challenging problem. Our algorithm extends previously published works [4, 5, 6, 7, 8].
2.1 Problem formulation Let I represent the frame at time t of the discretized image sequence. The motion eld d represents the displacement between I and I ?1 for each pixel. The segmentation eld z , consists of numerical labels at every pixel with each label representing one moving object, i.e. z (x) = n (n = 1; 2; ::; N ), for each pixel location x on the lattice . Here, N refers to the total number of moving objects. Using this notation, the goal of motion estimation/segmentation is to nd fd ; z g given I and I ?1. t
t
t
t
t
t
t
t
t
t
1. Object-Based Subband/Wavelet Video Compression
We adopt a maximum a posteriori (MAP) formulation: fd^ ; z^ g = arg max p(d ; z jI ; I ?1); t
t
t
fdt zt g
t
t
(1.1)
t
;
which can be rewritten via Bayes rule as fd^ ; ^z g = arg fmax p(I ?1jd ; z ; I )p(d jz ; I )P (z jI ): d zg t
t
t
t
t
t
t
t
t
t
t; t
4
t
(1.2)
Given this formulation, the rest of the work amounts to specifying the probability densities (or the corresponding energy functions) involved and solving.
2.2 Probability models
The rst term on the right-hand side of (1.2) is the likelihood functional that describes how well the observed images match the motion eld data. We model the likelihood functional by p(I ?1jd ; z ; I ) = Q?1 expf?U (I ?1jd ; I )g; (1.3) which is also Gaussian. Here the energy function X U (I ?1 jd ; I ) = (I (x) ? I ?1(x ? d (x)))2 =22; (1.4) t
t
t
t
l
t
t
t
t
l
l
t
t
t
t
t
x2
and Q is a normalization constant [5, 9]. The a priori density of the motion p(d jz ; I ), enforces prior constraints on the motion eld. We adopt a coupled MRF model to govern the interaction between the motion eld and the segmentation eld, both spatially and temporally. The energy function is given as U (d jz ) = XX 1 k d (x) ? d (y) k2 (z (x) ? z (y)) x y2 x X + 2 k d (x) ? d ?1(x ? d (x)) k2 x X ? 3 (z (x) ? z ?1(x ? d (x))); (1.5) l
t
d
t
t
t
t
t
t
t
t
N
t
t
t
x
t
t
t
where () refers to the usual Kronecker delta function, k k is the Euclidean norm in R2, and Nx indicates a spatial neighborhood system at location x. The rst two terms of (1.5) are similar to those in [7], while the third term was added to encourage consistency of the object labels along the motion trajectories. The rst term enforces the constraint that the motion vectors be locally smooth on the spatial lattice, but only within the same object label. This allows motion discontinuities at object boundaries without
1. Object-Based Subband/Wavelet Video Compression
5
introducing any extra variables such as line elds [10]. The second term accounts for causal temporal continuity, under the model restraints (ref. second factor in 1.2) that the motion vector changes slowly frame-to-frame along the motion trajectory. Finally, the third term in (1.5) encourages the object labels to be consistent along the motion trajectories. This constraint allows a framework for the object labels to be linked in time. Lastly, the third term on the right hand side of (1.2) P (z jI ), models our a priori expectations about the nature of the object label eld. In the temporal direction, we have already modeled the object labels to be consistent along the motion trajectories. Our model incorporates spatial intensity information (I ) based on the reasonable assumption that object discontinuities coincide with spatial intensity boundaries. The segmentation eld is a discrete-valued MRF with the energy function given as XX V (z (x); z (y)jI ); (1.6) U (z jI ) = t
t
t
t
z
t
x y2Nx
c
t
where we specify the clique function as 8
if z (x) = z (y), s(x) = s(y) >