302
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 10, NO. 2, MARCH 2000
Sprite Generation and Coding in Multiview Image Sequences Nikos Grammalidis, Student Member, IEEE, Dimitris Beletsiotis, and Michael G. Strintzis, Senior Member, IEEE
Abstract—A novel algorithm for the generation of background sprite images from multiview image sequences is presented. A dynamic programming algorithm, first proposed in [1] using a multiview matching cost, as well as pure geometrical constraints, is used to provide an estimate of the disparity field and to identify occluded areas. By combining motion, disparity, and occlusion information, a sprite image corresponding to the first (main) view at the first time instant is generated. Image pixels from other views that are occluded in the main view are also added to the sprite. Finally, the sprite coding method defined by MPEG-4 is extended for multiview image sequences based on the generated sprite. Experimental results are presented, demonstrating the performance of the proposed technique and comparing it with standard MPEG-4 coding methods applied independently to each view. Index Terms—Dynamic programming, MPEG-4, multiview video applications, sprites.
I. INTRODUCTION
A
BACKGROUND sprite is an image composed of pixels belonging to a video object visible throughout a video segment. For instance, a sprite generated from a panning sequence will contain all visible pixels of the background object throughout the sequence. Certain portions of this background may not be visible in certain frames due to the occlusion of the foreground objects or the camera motion. Since the sprite contains all parts of the background that were at least visible once in an image sequence, the sprite can be used for the reconstruction or the predictive coding of the background. Sprites for background representation are also commonly referred to as “salient stills” [2], [3] or “background mosaics” in the literature [4]–[10]. The procedure for generating background sprite images from a video sequence typically starts by detecting scene cuts (changes) and thus dividing the video sequence in subsequences containing similar content. A background mosaic (sprite) is then generated for each subsequence by warping (aligning) different instances of the background region to a fixed coordinate system, after estimating their motion using a two-dimensional (2-D) or three–dimensional (3-D) motion model. Finally, information from all warped images is combined into the sprite image by using median filtering or averaging operations. Manuscript received March 15, 1999; revised October 20, 1999. This work was supported by the EU IST “INTERFACE” and the GSRT “PAVE” and “PANORAMA” projects. This paper was recommended by Guest Editor Y. Wang. The authors are with the Information Processing Laboratory, Department of Electrical and Computer Engineering, University of Thessaloniki 540 06, Greece (e-mail:
[email protected];
[email protected];
[email protected]). Publisher Item Identifier S 1051-8215(00)02015-2.
A method for encoding sprite images has been included in the emerging MPEG-4 standard [11], [12]. This method is based on describing simple camera motion models (e.g., translational, affine or perspective) by the 2-D motion of a number of points, called reference points. Since the sprite images are often much larger than the initial images, their coding is complicated by the significant delay (latency) incurred when they are coded and decoded as I-frames. Since the frames following the first are coded and decoded based on the sprite image, such delays may hinder real-time implementation. However, in MPEG-4, the sprite coding syntax allows large static sprite images to be transmitted piece by piece as well as hierarchically, so that the latency incurred in displaying a video sequence is significantly reduced. In earlier sprite-generation procedures, no segmentation was used, and the generated sprites always corresponded to the region with the dominant motion, i.e., usually the background. In this case, foreground objects were removed by using temporal averaging or median filtering. However, in order to improve the quality of the generated sprite images and to be able to generate sprites for foreground objects, a number of techniques have been proposed to segment the scene into a number of “layers” [10], [13]–[15]. Layers are regions which typically correspond to the physical objects in the scene. If a multilayered scene description is available, sprites can be easily obtained for each layer using standard sprite generation techniques or more sophisticated involving depth and transparent objects [15]. Although much effort is spent in the past to design sophisticated layer segmentation procedures, this still remains in many respects an open problem. In this paper, the sprite generation and coding procedures are generalized for the case of multiocular systems, consisting of two or more cameras. Multiocular systems provide the viewer with the appropriate monoscopic or stereoscopic view of a scene, depending on his position. Several coding schemes have been proposed for stereoscopic [16] and multiview image sequences [1], [17]. A common characteristic in these coding schemes is the use of disparity information to eliminate redundancies between images from different views. Furthermore, the detection of occlusions, i.e., points not visible in all views, provides additional information that can improve coding results. Techniques based on dynamic programming have been applied for the purposes of disparity estimation and simultaneous occlusion detection [1], [18]–[21] in stereoscopic sequences. A significant advantage of these techniques is that they provide a global solution for the disparity estimation/occlusion detection problem under local constraints, such as constraints related to correlation, smoothness, or disparity gradient limit.
1051–8215/00$10.00 © 2000 IEEE
GRAMMALIDIS et al.: SPRITE GENERATION AND CODING IN MULTIVIEW IMAGE SEQUENCES
303
Fig. 1. A multiview system with three viewpoints.
In previously proposed sprite-generation techniques [4]–[6], motion information has been extensively used to identify the objects in the scene (segmentation) and to determine their position in the sprite image (warping). The present paper proposes novel techniques for sprite generation, in which foreground and background segmentation is mainly based on the estimated disparity and occlusion information. Clearly, this is a more natural way for background identification, especially in sequences with very small motion, where segmentation based on motion may fail. Furthermore, motion information is used in this paper in a second segmentation step, in order to assign small or occluded regions to the background or foreground regions. The main contribution of this paper is the use of disparity and occlusion information to add information from all available views to the background sprite image. For example, a part of the background that is occluded in one view may be added to the sprite from another view, where this part appears. The sprite is generated in two stages: the first involves the frames from the first (leftmost) view, uses disparity and occlusion information for segmentation purposes, and is otherwise similar to previously proposed sprite-generation procedures [5], [11]. The second step involves the frames from the other views and is based exclusively on the estimated disparity and occlusion information. The sprite coding mode defined by MPEG-4 is then used to code the background region in the entire multiview sequence. Full compliance to the MPEG-4 sprite coding mode is achieved by using the same 6-parameter affine model to model both motion or disparity information describing the warping transformation between a frame and the sprite image. This model has seen to be efficient in situations where either the structure of the imaged scene is approximately planar, or the scene is sufficiently far from the camera [6], [22]. The entire multiview sequence can be then coded, according to the MPEG-4 sprite coding mode, by reordering all the frames from a group of frames (GOF) into a single sequence as follows: first the frames from the first view, then the corresponding frames
from the second view, and so on. An advantage of this technique is that no disparity or occlusion information needs to be coded for the background region. Experimental results demonstrate significant reduction in the required bit rate if a single sprite image is used for the entire background of the multiview sequence. The rest of the paper is organized as follows. The algorithm used for disparity and occlusion information, which was described in detail in [1], is summarized in Section II. The procedure used to generate a sprite image from the first (leftmost) view of a multiview image sequence is described in Section III. In Section IV, the procedure to generate sprites from the other available views, based on disparity and occlusion information, is presented. Then, the sprite coding scheme defined by MPEG-4 is generalized for the case of multiview sequences. In Section V, experimental results are obtained using a four-view sequence and a stereo sequence. Comparisons are made against standard MPEG-4 coding schemes, with or without the use of sprite images, applied independently to each view. Finally, conclusions and suggestions for future extensions of the proposed approach are presented in Section VI. II. A METHOD FOR DISPARITY ESTIMATION AND OCCLUSION DETECTION Consider a multiocular system with viewpoints arranged on a horizontal line. A trinocular system viewpoints is shown in Fig. 1. Let be the image with and denote the -coorcorresponding to viewpoint in . dinate of the (perspective) projection of a 3-D point We shall estimate disparity and occluded areas for the first (leftwith respect most) image . Disparity visible in is defined by to of a point
undefined
if if
is visible in is occluded in
.
(1)
304
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 10, NO. 2, MARCH 2000
By assuming that central projection is used and that all optical is visible in , axes are parallel, it may be shown [1] that if then its disparity equals (2) is the baseline corresponding to the th where viewpoint and is the depth of . In the example of Fig. 1, , and are visible in points belonging to segments both and , thus both and are defined for these points. Since the depth is constant within each of these segments, (2) implies that the corresponding disparities and also remain is constant within each of these segments. However, only (projected to segment ), defined for each point in nor is defined for points between and while neither (projected to segment ). if it is A pixel in the first view will be said to be in state . In particular, it will be in state visible in views if it is visible in all views, in state if it is visible only in and in state if it is invisible (occluded) in all views but . , it is seen from (2) that For a pixel in state (3) implies knowledge of all , Thus, knowledge of . A dynamic programming scheme was proposed in and the [1] and [23] so as to estimate the disparity vector for each pixel in . The corresponding valid disparity state are then found from (3). values The disparity field obtained in this manner corresponds to each pixel of the first (leftmost) view and as such may be termed L–R disparity field to distinguish it from the converse R–L disparity field, corresponding to each pixel of the th (rightmost) view. beFor pixels in state , the multiview matching cost tween all corresponding pixels is defined as
(4) where
(5) and the fixed weights are chosen heuristically. The disparity is given by (3), rounded to the nearest integer, and is a . window centered on the working pixel A dynamic programming algorithm for the calculation of the of pixels in and for the identification of areas disparity occluded in at least one view may be then based on the algorithm schematically shown in Fig. 2. The multiview matching is associated to the transition , while a fixed occlucost
N
Fig. 2. Allowed transitions between states in the general ( -view) case. The disparity is estimated and the occluded pixels in (not in state ) are detected.
d=d
S
S
sion cost is used for the occlusions and . Using this assigned to pixels algorithm, only two states are identified: indicating that the pixel is occluded visible in all views, and in at least one view. Finer estimation of disparity and detection of occluded regions is achieved, by iteratively applying the same algorithm within each of the occluded segments detected by the above algorithm. As detailed in [1], the same dynamic programming algorithm can be used to provide the R–L disparity field and the corresponding state information with respect to the rightmost view. III. MULTIVIEW SPRITE GENERATION USING INFORMATION FROM THE FIRST VIEW Sprites are typically generated from monoscopic image sequences by first using a scene-cut detector for the identification of the number of frames where a significant part of scene (usually a large part of the background) remains substantially the same. Each of the resulting subsequences is assumed to contain similar image content, and each is processed independently from the others. Each frame of the subsequence is segmented into a number of regions, each defining a different object. Then, a binary mask representing the shape of each object is produced, which together with the luminance (or color) information for this object comprise the video object plane (VOP) in MPEG-4 terminology. The segmentation may be based on luminance, motion and, in the case of multiview image sequences, disparity information. Segmentation based on disparity information, has the following advantages. 1) In sequences produced in videoconferencing and other similar applications, the observed motion may be very small and inadequate for accurate segmentation. However, for such sequences, efficient segmentation into foreground and background regions is possible by using multiple cameras and exploiting disparity information. 2) Luminance-edge and motion-edge information may be conveniently exploited in disparity estimation techniques based on dynamic programming, so that disparity changes are endorsed near luminance or motion edges [20]. Thus, the produced objects have more or less constant disparity, motion and texture information. This approach is very
GRAMMALIDIS et al.: SPRITE GENERATION AND CODING IN MULTIVIEW IMAGE SEQUENCES
Fig. 3.
Object segmentation and motion estimation.
Fig. 4.
Procedure for generating sprites from multiview image sequences.
suitable for situations where the disparity variations between the background region and the foreground objects is small (e.g., if the distance between the objects and the camera is very large). In such situations, even though segmentation based on the disparity alone might fail, use of luminance or motion information may significantly improve the final segmentation result. 3) Disparity provides a convenient means of layering the objects in the scene: objects with smaller absolute disparity values are at larger distance from the viewer, and are assigned accordingly to deeper layers. For the above reasons, we use disparity information as the main cue in the proposed segmentation procedure. Specifically, in order to generate the background sprite image, the background region is identified in each frame of the first (leftmost) view. A two-stage motion and disparity based segmentation technique is proposed to identify the foreground and background regions. At the first stage, a simple thresholding of the disparity fields is used to initialize the segmentation map for each frame. The threshold value is determined on the basis of the disparity histogram. Occluded pixels in at least one view are left out of this initial classification procedure. A second stage is used to correct errors in the initial segmentation result caused by local minor disparity changes or estimation errors. A connected component labeling procedure is used to find small connected regions that are labeled as artifacts and are excluded from the motion model estimation stage which follows.
305
After having identified the background region, its motion is described using a 6-parameter 2-D affine motion model. This model can be expressed as follows:
(6)
is the pixel position at time and where is the corresponding pixel at time . The upper and lower indicate the time and the view (first), respecindices in tively. is based on correspondences obtained The estimation of using a standard exhaustive search block-matching procedure. If matches are available in the background region, (6) yields a equations with six unknowns. The unknown pasystem of rameters are then estimated from this system of equations using least-squares techniques. The pixels in occluded or very small that were excluded from the motion estimation stage regions are now assigned to the background region if the average displaced frame difference
for this region is smaller than the average DFD in the backis computed from (6) ground region. In the above,
306
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 10, NO. 2, MARCH 2000
using the estimated motion parameters, denotes the frame from the th view at time and is . Using this procedure, the final segthe number of pixels in mentation map is obtained and the final motion parameters for the background region are recalculated. The entire segmentation and motion-estimation scheme is summarized in Fig. 3. The sprite-generation procedure for the background from the main (leftmost) view uses the coordinate system of the first frame as the reference coordinate system for the sprite image. For each frame, the estimated motion parameters are used to compute the motion (warping) transformation of the object between the current frame and the reference coordinate system. Using (6), this transformation can be written as follows:
(7)
and denotes a pixel position in frame where from the th view. Thus, the video object is warped toward the reference coordinate system. After processing all frames of the sequence, temporal median is used to produce the final sprite image from the warped objects. An alternative method for producing the final sprite image is a progressive averaging procedure [11]
Fig. 5. Estimated disparity fields and state maps for the left and right frame of the Claude sequence. Occluded areas are shown in black color. (a) L–R disparity field. (b) R–L disparity field. (c) State map for the L–R disparity field. (d) State map for the R–L disparity field.
(8)
Averaging may be performed on the spot, as each sample from the warped images becomes available (9) and are the sprite image and the warped image where corresponding to the th time instant. This method is faster and requires less memory than the former; however, the use of median filtering in the former method improves the quality of sprite images since outliers (wrong samples) and noise are eliminated. IV. MULTIVIEW SPRITE GENERATION AND CODING USING INFORMATION FROM THE REMAINING VIEWS After the initial sprite-image generation based on information obtained from the first view, information from the remaining views may be added based on the estimated disparity and occlusion information. Specifically, the warping parameters corfor the general case of responding to frame views, are computed on the basis of the estimated disparity of a pixel information. The position in the background in frame is modeled using an affine transformation. Using the notation of Section III, this can be written as follows: (10)
Fig. 6. (a) Initial support map after thresholding the disparity images. (b) Final support map after motion estimation.
These affine model parameters are estimated using least squares techniques based on the disparity of the pixels that are ). The total visible in all intermediate views ( warping transformation between frame and the sprite image is obtained by combining this affine model for the disparity with the warping model between frame and the sprite image (aligned to ) (11)
are defined in (7). The multiview sprite generation where procedure is illustrated in Fig. 4. A significant advantage of constructing the sprite image using more than one views is that pixels that are occluded in some of the views are still retained in the sprite image. More specifically, in order to generate the sprite image, we use the L–R and R–L disparity fields and the corresponding state maps to produce the , which corresponds to frame . Foreground disparity field and background segmentation is based on thresholding the disafter a preprocessing step in which occluded parity field
GRAMMALIDIS et al.: SPRITE GENERATION AND CODING IN MULTIVIEW IMAGE SEQUENCES
307
Fig. 7. Comparison of background sprites obtained from monoscopic and multiview sequences. (a) Monoscopic sprite is obtained from ten frames of the left sequence. (b) Multiview sprite obtained from ten frames of all four views using method A1. (c) Multiview sprite obtained from ten frames from all four views using method A2.
segments between pixels that have similar disparity to that of the foreground or background regions are assigned to these regions. Assuming that the foreground region has been identified correctly, all other pixels in , even those occluded in the left or the right view, can be assimilated into the background. Then, the disparity affine model for the background can be used to model the disparity in the entire background region. As a result, pixels in occluded regions provide additional samples for the warped frames used for the construction of the sprite image. Based on the information used for generating the sprite image, we have used and evaluated the following approaches. 1) In method A1, the sprite image is initially generated using only the pixels from the first (leftmost) view. Then, pixels corresponding to pixel from frames locations where no luminance value has been assigned, are added in the sprite image. This method yields sprite images with better visual quality because the pixels obtained from the first view are more reliable candidates for the sprite. However, as verified by the experimental results, this leads to very good reconstruction of the leftmost view, but not equally good results for the other views.
TABLE I CODING
OF THE BACKGROUND REGION OF THE “CLAUDE” SEQUENCE
A1, A2: Coding using a single sprite image. B: Independent MPEG-4 coding of each view using MPEG-4 spritecoding mode. C: Independent MPEG-4 coding of each view without using sprite coding.
2) Method A2 uses an averaging procedure of all available to update the sprite pixels from frames image. This creates a more balanced sprite image that can be used to obtain satisfactory reconstructions for all channels, since all views equally contribute information to the sprite image. A drawback however, is that some blurring may be induced to the sprite image, when some of the samples that are averaged are noisy, due to the inadequacy
308
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 10, NO. 2, MARCH 2000
of the affine disparity model to describe the local disparity or due to luminance changes among different views. The coding of the multiview sequence using the generated sprite image conforms to the MPEG-4 specifications for sprite coding [11], [12]. This is achieved by reordering all the frames from a group of frames (GOF) to the following order: (12)
Then, warping parameters for each frame of this sequence toward the sprite image are given by (6) for the first (left) view and by (11) for the remaining views. In order to code the warping transformation used to generate the reconstructed images from the sprite image, each transformation is expressed as a set of motion trajectories of a number of reference points. The number , needed to encode the of reference points warping parameters determines the transform to be used for warping, e.g., three reference points are used to fully describe the affine transform of (6) and (11). V. EXPERIMENTAL RESULTS Results are presented for the four-view sequence Claude and the stereoscopic sequence Aqua.1 Fig. 5(a)–(d) show the L–R and R–L disparity fields and the corresponding state maps obtained using the proposed algorithm for the first frame. Fig. 6(a) illustrates the initial segmentation map for the first frame obtained by thresholding of the disparity field in Fig. 5(a), while Fig. 6(b) shows the final segmentation map obtained after the proposed algorithm summarized in Fig. 3. The sprite for the background generated from ten frames from the first (leftmost) view is shown in Fig. 7(a). The sprite obtained by adding occluded pixels from the other three views using method A1, which was discussed in Section IV, is shown in Fig. 7(b). The background sprite image when averaging information from all four sequences according to method A2 is shown in Fig. 7(c). As seen, many new pixels are added to the sprite image. Most of the pixels are seen to be at the correct positions, however some blurring can be observed in Fig. 7(c), produced using method A2, especially near the left-upper corner. Blurring is mainly due to averaging noisy samples in locations where the local disparity is not adequately described by the affine disparity model. Coding results for both sprite-based and standard (nonsprite) MPEG-4 coding methods were obtained using the software implementation of the MPEG-4 Version 1 encoder/decoder provided by the ACTS 098 MoMuSys project [24], [25]. In the proposed approaches, methods A1 and A2, coding of all four views is based on a single sprite image obtained from all views. For comparison, independent coding of each view using MPEG-4 coders with a different sprite image for each view (method B) was also evaluated. Also, coding results provided by independent coding of each view using standard MPEG-4 coding are
1Both sequences were prepared by THOMPSON CSF/LER for the RACE Project 2045 DISTIMA and the ACTS 092 Project PANORAMA.
Fig. 8. (a), (b) Original images for the first and fourth view. (c), (d) Reconstructed images for the first and fourth view. Sprite coding (method A2) is used for the background and standard MPEG-4 coding for the foreground image. (e), (f) Reconstructed images for the first and fourth view using method C.
also presented (method C). Results for the coding of the background region of the Claude sequence using these coding techniques are presented in Table I. Methods A1 and A2 are seen to require approximately four times less bitrate to encode the background in all views when compared to method B and eight times less bit rate when compared to method C. Method A1 results in negligible degradation of the reconstruction quality for the first view; however, the reconstruction quality of the other three views falls by approximately 4.5 dB. Method A2 produces the same significant bit rate savings at a loss of approximately 3 dB in the reconstruction of all four views. In terms of visual quality, some blurring can be observed in reconstructions obtained using method A2, due to the sprite image blurring effects discussed above. We have also used the proposed method for reconstructing each frame by applying the sprite coding (method A2) for the background object, while using standard (non sprite) MPEG-4 coding for the foreground object. The corresponding coding results are presented in Table II. The original first frames from the first and the last view are shown in Fig. 8(a) and (b), respectively, while the corresponding reconstructed frames are shown in Fig. 8(c) and (d). Fig. 8(e) and (f) illustrate the corresponding reconstructed frames obtained using method C (standard MPEG-4 coding) for the entire image area. Similar results
GRAMMALIDIS et al.: SPRITE GENERATION AND CODING IN MULTIVIEW IMAGE SEQUENCES
309
Fig. 10. (a), (b) Disparity and occlusion maps for the left and right views (occluded regions are shown in black color). (c), (d) Reconstructed images for the left and right view. Sprite coding (method A2) is used for the background and standard MPEG-4 coding for the foreground region.
views using methods A1 and A2 are shown in Fig. 11(b) and (c), respectively. Coding results for the background region are presented in Table III, while results for the entire frames of the stereoscopic sequence are presented in Table IV (using standard MPEG-4 coding for the foreground region). Finally, two reconstructed frames using Method A2 are presented in Fig. 10(c) and (d). VI. CONCLUSIONS AND SUGGESTIONS FOR FUTURE WORK Fig. 9. (a), (b) Original images from the second and third view. (c), (d) Reconstructed images from the second and third. Sprite coding (method A2) is used for the background and standard MPEG-4 coding for the foreground region. (e), (f) Reconstructed images for the second and third view using method C. TABLE II CODING OF THE BACKGROUND AND FOREGROUND REGION OF THE “CLAUDE” SEQUENCE. FOREGROUND IS CODED USING STANDARD MPEG-4 CODING
A1, A2: Coding using a single sprite image. B: Independent MPEG-4 coding of each view using MPEG-4 spritecoding mode. C: Independent MPEG-4 coding of each view without using sprite coding.
for the middle (second and third) views are presented in Fig. 9(a)–(f). As seen, the visual quality of the reconstructed frames using multiview sprite coding is comparable to that obtained using standard MPEG-4 coding. The estimated disparity fields estimated for the first frame of the Aqua sequence are presented in Fig. 10(a) and (b), where occluded regions are marked in black. Although significant depth variations can be observed, satisfactory segmentation of the background region is possible using the proposed approach. The sprite generated from five frames of the left sequence is shown in Fig. 11(a), while the sprites generated from both
A method for sprite generation from multiview sequences was proposed. Disparity and occlusion estimation is based on an efficient dynamic programming algorithm using information from all views of the multiview sequence. By combining motion, disparity, and occlusion information, a sprite image corresponding to the first (main) view at the first time instant is generated. Image pixels from other views that are occluded in the main view are added to the sprite. The sprite coding method defined by MPEG-4 was extended for the case of a multiview image sequence, based on the generated sprite. Experimental results demonstrating the performance of the proposed technique and comparing it with methods using sprite generation from monoscopic sequences were presented. An additional advantage of this technique, is that the generated sprite images (mosaics) contain more pixels, and thus, additional information is available that may be exploited in other interesting sprite applications such as object tracking, background substitution, or annotating in multiview sequences. Significant depth changes or difficulties in the segmentation procedure may hinder successful sprite generation. In order to improve results, various approaches may be followed in the future. More complex warping models, than the simple affine or perspective models used by MPEG-4, could be defined to describe the motion of nonplanar surfaces or complex camera motions. Another approach would be to segment the scene into more than two regions (multiple layers), and use a different warping model for each layer. Efficient sprite-generation procedures for multiple layers considering transparent objects and depth variations have already
310
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 10, NO. 2, MARCH 2000
Fig. 11. Background sprites generated from five frames of the Aqua sequence. (a) Monoscopic sprite generated from the left view. (b) Multiview sprite obtained using method A1. (c) Multiview sprite obtained using method A2. TABLE III CODING OF THE BACKGROUND REGION OF THE “AQUA” SEQUENCE
technique would be to incorporate photometric correction methods, similar to those used in [26]. Specifically, the luminance direction and a normal vector could be estimated for the entire background region. Then, an iterative technique could be used to improve the estimation of the affine model parameters by using the photometrically corrected luminance values instead of the real ones.
A1, A2: Coding using a single sprite image. B: Independent MPEG-4 coding of each view using MPEG-4 spritecoding mode. C: Independent MPEG-4 coding of each view without using sprite coding. TABLE IV CODING OF THE BACKGROUND AND FOREGROUND REGION OF THE “AQUA” SEQUENCE. IN ALL CASES, FOREGROUND IS CODED USING STANDARD MPEG-4 CODING
A1, A2: Coding using a single sprite image. B: Independent MPEG-4 coding of each view using MPEG-4 spritecoding mode. C: Independent MPEG-4 coding of each view without using sprite coding.
been proposed [15]. In this case, additional depth information which is necessary to resynthesize the images from the sprite has to be coded. However, using sprite coding for more than one layer is not supported by the current version of MPEG-4, since shape coding is not supported for layers coded in sprite coding mode. This inhibits sprite coding of more than one layer using MPEG-4 compliant methods. In sequences where there are significant luminance changes among different views, an interesting extension of the proposed
REFERENCES [1] N. Grammalidis and M. G. Strintzis, “Disparity and occlusion estimation in multiocular systems and their coding for the communication of multiview image sequences,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 328–344, Jun. 1998. [2] M. Massey and W. Bender, “Salient stills: Process and practice,” IBM Syst. J., vol. 35, no. 3/4, pp. 557–574, 1996. [3] L. Teodosio and W. Bender, “Salient video stills: Content and context preserved,” in Proc. 1st ACM Int. Conf. Multimedia MULTIMEDIA ’93. New York, Aug. 1993, pp. 39–46. [4] F. Dufaux and F. Moscheni, “Background mosaicking for low bit rate coding,” in Proc. Int. Conf. Image Processing, Lausanne, Switzerland, Sept. 1996. [5] M. Irani, P. Anandan, J. Bergen, R. Kumar, and S. Hsu, “Efficient representations of video sequences and their applications,” Signal Processing: Image Commun., vol. 8, no. 4, pp. 327–351, May 1996. [6] M. Irani and P. Anandan, “Video indexing based on mosaic representations,” Proc. IEEE, vol. 16, no. 5, pp. 905–921, May 1998. [7] R. Szeliski, “Video mosaics for virtual environments,” IEEE Comput. Graphics Applicat., vol. 16, pp. 22–30, Mar. 1996. [8] R. Szeliski and H.-Y. Shum, “Creating full view panoramic mosaics and environment maps,” in Proc. ACM SIGGRAPH 97 Conf., T. Whitted, Ed., Aug. 1997, ISBN 0-89 791-896-7, pp. 251–258. [9] S. Mann and R. W. Picard, “Virtual bellows: Constructing high quality stills from video,” in Proc. ICIP Int. Conf. Image Processing, Nov. 1994. [10] M. Lee, W. Chen, C. B. Lin, C. Gu, T. Markoc, and R. Szeliski, “A layered video object coding system using sprite and affine motion model,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 130–145, Feb. 1997. [11] MPEG-4 Video Group, “MPEG-4 verification model version 11.0,”, Tokyo, Japan, Tech. Rep., ISO/IEC JTC1/SC29/WG11/MPEG98/N2172, T. Ebrahimi, Ed., Mar. 1998. [12] T. Sikora, “The MPEG-4 video standard verification model,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 19–31, Feb. 1997.
GRAMMALIDIS et al.: SPRITE GENERATION AND CODING IN MULTIVIEW IMAGE SEQUENCES
[13] J. Y. Wang and E. H. Adelson, “Representing moving images with layers,” IEEE Trans. Image Processing, vol. 3, pp. 625–638, Sept. 1994. [14] T. Darrell and A. Pentland, “Cooperative robust estimation using layers of support,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, pp. 474–487, May 1995. [15] S. Baker, R. Szeliski, and P. Anandan, “A layered approach to stereo reconstruction,” in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’98), Santa Barbara, CA, June 1998, pp. 434–441. [16] M. Ziegler, “Digital stereoscopic imaging and application—A way toward new dimensions: The RACE II Project DISTIMA,” in Inst. Elect. Eng. Colloq. Stereoscopic Television, London, U.K., Oct. 1992. [17] J.-R. Ohm and K. Muller, “Incomplete 3D-multiview representation of video objects,” IEEE Trans. Circuits Syst. I, vol. 47, Feb. 1999. [18] I. J. Cox, S. Hingorani, B. M. Maggs, and S. B. Rao, “Stereo without disparity gradient smoothing: A Bayesian sensor fusion solution,” in Proc. British Machine Vision Conf., New York, 1992, pp. 337–346. [19] S. S. Intille and A. F. Bobick, “Disparity-space images and large occlusion stereo,” M.I.T. Media Lab Perceptual Computing Group, Cambridge, MA, Tech. Rep. 220, 1994. , “Incorporating intensity edges in the recovery of occlusion re[20] gions,” M.I.T. Media Lab Perceptual Computing Group, Cambridge, MA, Tech. Rep. 246, 1994. [21] L. Falkenhagen, R. Koch, A. Kopernik, and M. Strintzis, “Disparity estimation based on 3-D arbitrarily shaped regions,” Digital Stereoscopic Imaging and Applications (DISTIMA), Tech. Rep. #R2045 /UH/DS/P/023/b1 RACE Project R2045, 1994. [22] O. Faugeras, Three-Dimensional Computer Vision. Cambridge, MA: MIT Press, 1993. [23] N. Grammalidis and M. G. Strintzis, “Disparity and occlusion estimation for multiview image sequences using dynamic programming,” in Proc. Int. Conf. Image Processing (ICIP ’96), Lausanne, Switzerland, Sept. 1996. [24] ACTS 098 MoMusys Project, “Software simulation of mpeg4 video coder,”, [Online]. Available FTP: /drogo.cselt.stet.it/pub/mpeg/mpeg4_fcd/Visual/Natural/MoMuSys-VFCD-V 01-980 File: 507.tar.gz. [25] R. Koenen, F. Pereira, and L. Chariglione, “MPEG-4: Context and objectives,” Signal Processing: Image Commun., vol. 9, no. 4, May 1997. [26] G. Bozdagi, A. M. Tekalp, and L. Onural, “3-D motion estimation and wireframe adaptation including photometric effects for model-based coding of facial image sequences,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, pp. 246–256, June 1994.
311
Nikos Grammalidis (S’93) received the Diploma in electrical engineering from Aristotle University of Thessaloniki, Greece in 1992. He is currently working toward the Ph.D. degree in the Information Processing Laboratory, Aristotle University of Thessaloniki. His research interests include computer vision and multiview image sequence coding and processing.
Dimitris Beletsiotis received the Diploma in electrical engineering from the Electrical Engineering Department, Aristotle University of Thessaloniki, Greece, in 1998. Presently he is serving in the Greek Army. His research interests include video-coding and video-processing applications.
Michael G. Strintzis (S’68–M’70–SM’80) received the Diploma in electrical engineering from the National Technical University of Athens, Athens, Greece in 1967, and the M.A. and Ph.D. degrees in electrical engineering from Princeton University, Princeton, NJ, in 1969 and 1970, respectively. He then joined the Electrical Engineering Department, University of Pittsburgh, Pittsburgh, PA, where he served as Assistant (1970–1976) and Associate (1976–1980) Professor. Since 1980, he has been Professor of Electrical and Computer Engineering at the University of Thessaloniki, and since 1999, Director of the Informatics and Telematics Research Institute, Thessaloniki, Greece. His current research interests include 2-D and 3-D image coding, image processing, biomedical signal and image processing, and DVD and Internet data authentication and copy protection. Dr. Strintzis was awarded one of the Centennial Medals of the IEEE in 1994.