Unsupervised Semantic Object Segmentation of Stereoscopic Video Sequences
Anastasios D. Doulamis, Nikolaos D. Doulamis, Klimis S. Ntalianis and Stefanos D. Kollias Electrical and Computer Engineering Department National Technical University of Athens Zografou 15773, Athens, Greece E-mail:
[email protected]
Abstract In this paper, we present an efficient technique for unsupervised semantically meaningful object segmentation of stereoscopic video sequences. By this technique we achieve to extract semantic objects using the additional information a stereoscopic pair of frames provides. Each pair is analyzed and the disparity field, occluded areas and depth map are estimated. The key algorithm, which is applied on the stereo pair of images and performs the segmentation, is a powerful low-complexity multiresolution implementation of the RSST algorithm. Color segment fusion is employed using the depth segments as a kind of constraints. Finally experimental results are presented which demonstrate the high-quality of semantic object segmentation this technique achieves.
1.
Introduction
Work carried out in literature has led to the well known standards MPEG-1, MPEG-2, H.261 and H.263 which are block-based approaches to the video analysis and coding. However as multimedia applications and content-based interactivity had a great increase in popularity the last decade, new standards have recently emerged as MPEG-4 and MPEG-7. The MPEG-4 standard brings a revolution to the video and image coding and representation by introducing the concept of video object planes (VOP's). A VOP describes a semantically meaningful object in a frame and recently many efforts have been done in the direction of extracting the semantic visual information. Although humans can identify semantic entities effortlessly, this remains a fundamental research problem in the image analysis community.
On the other hand, the use of three-dimensional (3-D) video provides very efficient visual representation and spectacular experiences to viewers. Virtual reality applications will attract much more attention in the next decades and the objects that will be combined to produce virtual worlds, would be real life objects and not animations or graphics. Additionally as stereoscopic video introduces new methods of handling and manipulating video objects, it will probably prevail over the conventional 2-D representations. These methods basically exploit depth information, which is provided by the perspective projection of 3-D points onto two 2-D image planes. These two image planes are defined by the planes where the two cameras of the stereoscopic system are placed. Several techniques and algorithms have been proposed for video segmentation in the past [1], [2], [3], each of which has its particular features and applications. In general, the segmentation can be categorized into three main classes; segmentation based on color, based on motion and based on depth . The methods of the third class can be easily applied especially to 3-D sequences. The technique in [4] uses 3-D watershed transformation to segment and track objects. The method in [5] is based on motion segmentation where a dense motion field is estimated and the scene is segmented according to motion information. Other techniques use gradient information of parts of an image to locate deformable whole boundaries [6]. In addition to these methods homogeneity is the corner stone for region based approaches [7]. Although the aforementioned techniques work well in situations where data are simple and models fit well, they lack generality and the results obtained are far from perfect. The problem of semantic segmentation depends on several factors: a VOP can contain multiple colors some times very similar
to colors of adjacent objects, multiple gray-scale levels and to move in a complicated manner. In this paper we describe an unsupervised 3-D video segmentation scheme based on a multiresolution implementation of the RSST algorithm proposed originally in [8]. Firstly we estimate the disparity and depth field of a pair of frames (left and right channel) by incorporating the geometric properties of perspective projection. Secondly we perform detection of occluded areas and afterwards we compensate these areas so as to create a more reliable disparity field. Then the M-RSST algorithm [9] performs color and depth segmentation. If we know the depth and assuming that all pixels of a semantically meaningful object are located at almost the same depth we can extract semantic entities. Finally a segmentation fusion algorithm extracts the final semantic objects with high contours accuracy as it is shown by the experimental results.
2.
Camera model - disparity and depth estimation
Let us assume a stereoscopic system with two cameras of focal length and baselines distance b whose optical axes converge with angle as shown in Figure 1. Furthermore a random point w with Cartesian coordinates (X, Y, Z) is perceptively projected onto image
d y = d y ( x1 , y1 ) = (4) [λ − (λc − x1 s )] y1 Z − λbs ′y1 (λc − x1 s ) Z + λbs ′ So we can easily estimate the depth by knowing the disparity vector and if the disparity vectors are available for all points ( x1 , y1 ) on image plane I 1 , then the = y 2 − y1 =
creation of the depth map can be obtained directly. The main problem arises when we try to estimate the disparity field from the images on planes I 1 and I 2 because it involves matching of each point ( x1 , y1 ) on I 1 with a point ( x 2 , y 2 ) on I 2 . Disparity estimation is accomplished by means of a block matching algorithm originally described in [11]. Depth and disparity estimation results are illustrated in Figure 2 for the "Claude" sequence. The original left channel image I 1 and right channel image I 2 are shown in Figures 2(a) and 2(b) respectively. The horizontal disparity field dx(x1,y1) is presented in Figure 2(c) (negligible vertical disparity), where areas in white correspond to positive disparity and areas in gray to approximately zero disparity.
plane I 1 providing the point ( x1 , y1 ) and onto image plane I 2 providing the point ( x 2 , y 2 ). It is proved that the camera coordinates ( x1 , y1 ) and ( x 2 , y 2 ) can be expressed with respect to the world coordinates (X, Y, Z) as Y X x1 = λ , y1 = λ (1) Z Z Y Xc + Zs − bc ′ x2 = λ , y2 = λ (2) − Xs + Zc + bs ′ − Xs + Zc + Bs ′ where s = sinθ, c = cosθ, s’ = sin(θ /2), and c’ = cos(θ /2) [10]. The previous equations are the basis of estimating the depth Z in a stereoscopic pair of frames. Let us define the disparity vector d(x1,y1) between the points ( x1 , y1 ) and ( x 2 , y 2 ) which are both generated by the perspective projection of the same 3-D point w. The disparity vector at ( x1 , y1 ) in image plane I 1 with respect to the image plane I 2 will then be d(x1,y1) = [dx(x1,y1) dy(x1,y1)]T where d x = d x ( x1 , y1 ) = x 2 − x1 = [λ (λs + x1c ) − x1 (λc − x1 s )]Z − λb(λc ′ + x1 s ′) (3) = (λc − x1 s ) Z + λbs ′
w
Image PlaneI1
Image PlaneI2 (x1,y1)
(x2,y2)
zˆ1 yˆ1
zˆ2
yˆ2
xˆ1
b Figure 1: The camera model.
xˆ2
edge of the image) are due to occlusion. Similar results for the "Aqua" sequence are shown in Figure 3.
2.1
Occlusion detection and compensation
In the previous section we assumed that every point ( x1 , y1 ) of the image plane I 1 had its corresponding point ( x 2 , y 2 ) (with (a)
(b)
y 2 near to y1 ) on the image plane
I 2 which is generally true. However, due to the different viewpoints of the two cameras, there may be areas of I 1 that are occluded in I 2 . This is illustrated in Figure 4, where a foreground and a background object are projected on both cameras. A
(c)
B
Background Object
(d) C Foreground Object
Figure 2: The depth and disparity for Claude. (a) Left channel image. (b) Right channel image. (c) Horizontal disparity field. (d) Depth map.
Image Plane I 1 A1
B 1= C 1
zˆ 1 yˆ 1
A 2= C 2 zˆ 2
xˆ 1
yˆ 2
Image Plane I 2 xˆ 2
Figure 4: Geometry of occluded areas (a)
(c)
(b)
(d)
Figure 3: The Depth and Disparity for Aqua. (a) Left channel image. (b) Right channel image. (c) Horizontal disparity field. (d) Depth map.
Finally, the depth map is illustrated in Figure 2(d), where areas in black correspond to background and areas in gray to foreground. We should stress that in both the disparity field and depth map, the shaded areas of gradual intensity change (at the left of the person and at the right
As can be seen, although the line segment AB of the background object is visible by camera 1 and projected on the line segment Α1 Β1 , it is occluded in camera 2, and only point A is projected on point Α 2 . The leftmost point C of the foreground object is projected on C1 = Β1 in image plane I 1 and on C 2 = Α 2 in I 2 . Then for the occluded areas we cannot estimate the correct disparity vectors directly. So before we continue with the estimation of depth we should first detect and compensate the occluded areas. The detection is accomplished by locating regions of I 1 where the horizontal disparity decreases continuously with respect to the horizontal coordinate x1 with a slope approximately equal to -1. Occlusion compensation is performed by keeping disparity constant and equal to the maximum disparity value in each occluded area. [11]. The occlusion compensation technique is illustrated in Figure 5 for the left channel of the "Claude" sequence. The compensated disparity field and corresponding depth map are presented in Figures 5(a) and 5(b) respectively. It is observed that, apart from the incorrect object boundaries (contours), the
compensated depth map provides a reliable separation between foreground and background objects. Similar results are obtained for the "Aqua" sequence, shown in Figure 6.
(a) (b) Figure 5: Occlusion effects for Claude. (a) Compensated
compared to other methods as clustering or morphological watershed, are that it does not impose any external constraints on images and it gives the ability to control the number of segments very simply. Its main disadvantage is that the sorting method it adopts, effects directly its computational complexity. So we have proposed an improved version of the RSST, the M-RSST algorithm [9] which decreases the complexity significantly, resulting, at the same time, to very good outcomes. The general notion of the M-RSST algorithm is that it is recursively applied to images of increasing resolution.
P a rtitio n im a g e I in M 0 x N 0 re g io n s o f siz e 1
horizontal disparity field for Claude. (b) Compensated depth map. In itia liz e an d so rt lin k w e ig h ts fo r a ll 4 -c o n n e c te d re g io n p a irs
Is te rm in a tio n c rite rio n re a c h e d ?
(a)
(b)
Figure 6: Occlusion effects for Aqua. (a) Compensated horizontal disparity field for Claude. (b) Compensated depth map.
3.
Object extraction
The next step of the algorithm is to segment stereo image sequences into semantically meaningful objects. Semantic segmentation has been one of the fundamental research problems the last decades and much more attention has been given recently in the framework of the MPEG-4 standard [1], [12]. In this paper we are interested in a fully automatic segmentation algorithm, generally applicable to all archives. More precisely, the proposed unsupervised segmentation algorithm is performed on both the color and the corresponding depth field of each video frame, and then color segments are projected onto depth segments so that all color segments belonging to the same depth level are merged.
3.1
The color and depth segmentation algorithm
A very interesting and efficient segmentation algorithm which was proposed in [8] is the Recursive Shortest Spanning Tree (RSST) algorithm. Its main advantages,
No
R S S T In itia liza tio n
Y es
STO P
M e rg e tw o c lo se st reg io n s. C a lc u la te n e w re g io n v a lu e s a n d siz e
R e c a lc u la te a n d so rt n e w re g io n lin k w e ig h ts. R e m o v e d u p lic a te d lin k s. R S S T Ite ra tio n
Figure 7 : Flowchart of the conventional RSST algorithm. Consider an image I of size M 0 × N 0 pixels. A hierarchy of images I(0)=I, I(1),…,I(L0) is constructed by performing multiresolution decomposition to the initial image I. The level of lowest resolution is L0 and the image I(L0) has a quarter of the pixels that are included in the image I ( L0 − 1) of the previous level of resolution (this is the general pixels- relationship between the neighboring levels). The RSST is initialized by partitioning the image I(L0) into M 0 / 2 L0 × N 0 / 2 L0 segments of size one pixel and links are generated for all the 4-connected segment pairs as depicted in the flowchart of Figure 7. Then in order to produce an initial segmentation, a weight is assigned to each link indicating the distance between the two neighboring segments S1 and S 2 . In a color (depth) segmentation the color (depth) distance is compared. The
initialization phase is completed by sorting the weights of all links. After the initialization, a merging procedure is iteratively performed, with the purpose of uniting the two closest (smallest distance) segments into a new one.
(3) A recursive merging of segments takes place during the RSST iteration phase.
P ro d u c e im a g e p yra m id I(0 ), I(1 ), ... , I(L 0 ). S e t k = L 0 , I = I(k )
R S S T In itia liz a tio n
(a)
(b)
(c)
(d)
R S S T Itera tio n
k=0? No
Y es
STO P
S p lit e a c h b o u n d a ry b lo c k in to 4 sm a lle r o n e s. O b ta in n e w re g io n v a lu e s fro m I(k -1 )
Figure 9: (a) Segmentation at resolution level 3, (b)
C a lc u la te a n d so rt n e w re g io n lin k w e ig h ts. Set k = k - 1
Figure
8:
Flowchart of the multiresolution approach (M-RSST).
Segment splitting at level 3, (c) Segmentation at level 2 and (d) Segment splitting at level 2.
proposed
More specifically the sequence of steps is the following: (1) Merging of the two closest segments followed by calculation of the average intensity and the number of pixels of the new segment. (2) Recalculation and sorting of the new link weights for the new segment from all its neighbors. (3) Removal of any duplicate link. The iteration phase terminates by setting a threshold to the number of segments or to the minimum link weight as illustrated in Figure 8. In the following steps of the M-RSST algorithm an iteration begins so that images of higher resolution are taken into consideration. Finally when the highest resolution image I(0) is processed, the algorithm reaches its end. More specifically the following steps are performed: (1) For the current resolution level each boundary pixel, which corresponds to four pixels of the next higher resolution level, is split into four new segments. (2) The newly produced links are assigned a weight by means of the distance calculation method and they are sorted.
In our experiments a minimum link weight (distance) threshold is selected to terminate the segmentation process, while an initial lowest resolution level of L0=3 (i.e., block resolution of 8x8 pixels) is adopted.
(a)
(b)
(c) Figure 10: Final segmentation mask. (a) Original left channel image. (b) Final color segmentation mask. (c) Final depth segmentation.
Initially, the RSST segmentation is applied on the image of the lowest resolution (level 3), as shown in Figure 9(a). Then, each boundary pixel (or 8x8 block) is split into four new segments (of size 4x4) according to the step (1) of the M-RSST algorithm [Figure 9(b)]. These segments are merged using the RSST iteration at resolution level 2 to produce the segmentation depicted in Figure 9(c). Similarly, Figure 9(d) illustrates the boundary pixels (or 4x4 blocks) that are split into four segments (of size 2x2) each and propagate to resolution level 1. The final color and depth segmentation masks can be seen in Figures 10 and 11. The depth segments of the first sequence are shown in Figure 10(c) and the color segments in Figure 10(b), while the depth segments of the second sequence are shown in Figure 11(c) and the color segments in Figure 11(b).
4.
Constrained fusion of color segments
The mathematical expression of these notions is expressed in the following. Assume K c color and K d depth segments, extracted by the M-RSST and denoted as Sic , i = 1,2, , K c and
S id
i = 1,2,
,
, K
, K i, k = 1,2, , K i, k = 1,2,
d
i≠k
where
Sic S id
∩ Skc = ∅ ∩ S kd = ∅
(b)
any
,
d
, i ≠ k . By these segments we construct
and
for any
two masks, the mask of color segments G c and the mask of depth segments G d , where (5) Gc ={Sic, i =1,2, , Kc}, Gd ={Sid , i =1,2, , Kd} By projecting color segments onto depth segments, each color segments is associated with the depth segment it most belongs (maximum intersection between the segments). This is expressed by a projection function: (6) p ( S ic , G d ) = arg max{a ( g ∩ S ic )}, i = 1,2, , K c
where a (⋅) is the number of pixels of a segment. All segments that belong to the same depth construct a set Ci , so after the projection we take K d sets of color segments where (7) C i = {g ∈ G c : p( g , G d ) = S id }, i = 1,2, , K d
Each Ci defines a semantic object and the final segmentation mask G consists of the union of all these objects: (8) G = Ci
(a) (a)
for
c
g∈G d
In the previous section we have extracted the color and depth segments, using the M-RSST algorithm. The next step is to fuse the color segments as these segments identify the object boundaries very accurately. Additionally depth segments can provide us with a coarse separation of the semantic objects because usually objects consist of segments that are located at about the same depth. So by using depth segments as a constraining mask we project color segments onto this mask. Then for every color segment we find in which depth it belongs and finally we fuse the color segments of the same depth into one new segment.
(b)
Figure 12: Final segmentation after the fusion for Claude. (a) The foreground object. (b) The background object.
Segmentation fusion results are presented in Figure 12 for the "Claude" sequence and in Figure 13 for the "Aqua" sequence. As it can be observed semantically meaningful objects are extracted with high accuracy.
5. (c) Figure 11: Final segmentation mask for Aqua. (a) Original left channel image. (b) Final color segmentation mask. (c) Final depth segmentation.
Conclusions and discussion
Stereoscopic video archives are anticipated to rapidly increase in the forthcoming years. At the same time multimedia as well as MPEG-4 applications would try to give the ability to the user to control and manipulate video objects. This demands the materialization of a system that
will be able to segment a frame into its semantically meaningful objects. The extraction of semantic visual information is a difficult task for several reasons: the definition of semantic is not obvious, there is a serious problem with noise insertion and limited mechanisms are available for extracting video objects. Although human eyes can separate semantic objects very easily, this is not valid assumption for a machine. So the extraction of semantically meaningful entities constitutes a major problem for the scientific society. Basically semantic extraction is mainly a segmentation problem. In any segmentation scheme the selection of suitable homogeneity criteria is of critical importance. Different definition of homogeneity usually leads to results that are poles apart to each other.
For future work a semiautomatic segmentation scheme based on the M-RSST may work better in more complicated situations. Additionally we can use motion segments to extract more precisely object boundaries.
6.
The authors wish to thank very much Professor Chas Girdwood, the project manager of the ITC (Winchester), for providing the 3D video sequence “Eye to Eye”, which was produced in the framework of ACTS MIRAGE project. Furthermore the authors want to express their gratitude to Professor Siegmund Pastoor of the HHI (Berlin), for providing the video sequences of the DISTIMA project.
7.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 13: The extracted objects for Aqua. In our segmentation scheme we combine the regionbased segmentation results (color segments) with the depth segments, in order to achieve higher accuracy outcomes. The main problems to be solved are the erroneous estimation of the disparity vectors for the occluded areas and the matching procedure for finding the corresponding pixels in a pair of frames. Furthermore the threshold, which is used by the M-RSST algorithm, can be chosen heuristically and it cannot fit to all possible cases. Except from the aforementioned drawbacks, the MRSST algorithm can be included to very promising segmentation schemes as it combines the low complexity with the ability to control the number of segments very easily. Furthermore it avoids oversegmentation and produces color segments that accurately describe the object boundaries. Experimental results indicate reliable performance on real life stereoscopic video recordings.
Acknowledgements
References
[1] T. Meier and K. N. Ngan, “Automatic Segmentation of Moving Objects for Video Object Plane Generation,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 8, No. 5, pp. 525-538, 1998. [2] N. Doulamis, A. Doulamis, D. Kalogeras and S. Kollias, “Very Low Bit-Rate Coding of Image Sequences Using Adaptive Regions of Interest,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 8, No. 8, pp. 928-934, Dec. 1998. [3] A. Doulamis, N. Doulamis and S. Kollias, “On Line Retrainable Neural Networks: Improving the Performance of Neural Networks in Image Analysis Problems,” IEEE Trans. Neural Networks (accepted for publication). [4] M. Pardas and P.Salembier, “3-D morphological segmentation and motion estimation for image segmentation,” Signal Processing, vol. 38, pp. 31-43, Sept. 1994. [5] M. M. Chang, A. M. Tekalp, and M. I. Sezan, “Motionfield segmentation using an adaptive MAP criterion,” Proc. of IEEE Int. Conf. Acoust., Speech, Signal Processing ICASSP'93, Minneapolis, MN, Apr. 1993, vol. V, pp.33-36. [6] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,” Int. j. Comp.. Vis., vol 1, pp. 312-331. [7] D. Geiger and A. Yuille, “A common framework for image segmentation,” Int. journal Comput. Vision, vol 6, pp. 227-243, 1991. [8] O. J. Morris, M. J. Lee and A. G. Constantinides, “Graph Theory for Image Analysis: an Approach based on the Shortest Spanning Tree,” IEE Proceedings, Vol. 133, pp.146152, April 1986. [9] Y. Avrithis, A. Doulamis, N. Doulamis and S. Kollias, “A Stochastic Framework for Optimal Key Frame Extraction from MPEG Video Databases,” Computer Vision and Image Understanding, July 1999. [10] R. C. Gonzalez and R. E. Woods, Digital Image Processing. Addison-Wesley, 1992. [11] D. Tzovaras, N. Grammalidis and M. G. Strintzis, “Disparity Field and Depth Map Coding for Multiview 3D Image Generation,” Image Communication, No. 11, pp. 205-230, 1998. [12] C. Gu and M.-C. Lee, “Semiautomatic Segmentation and Tracking of Semantic Video Objects,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 8, No. 5, pp. 572-584, Sept. 1998.