Non-Sequential Video Structuring based on Video ... - Semantic Scholar

3 downloads 0 Views 441KB Size Report
An Efficient Tool for Video Browsing and Indexing. Klimis S. Ntalianis ..... [1] M. Mills, J. Cohen and Y. Y. Wong, “A Magnifier Tool for Video. Data,” Proc.
Non-Sequential Video Structuring based on Video Object Linking: An Efficient Tool for Video Browsing and Indexing Klimis S. Ntalianis, Anastasios D. Doulamis, Nikolaos D. Doulamis and Stefanos D. Kollias National Technical University of Athens, Electrical and Computer Engineering Department, 15773, Athens, Greece E-mail: [email protected]

Abstract An efficient system for unsupervised structuring of stereoscopic sequences is presented in this paper, which generates links between similar VOPs of different shots. Particularly after shot cut detection, for each shot, a fast, unsupervised VOP detection and tracking algorithm is applied. Then for each of the foreground VOPs of a frame, a feature vector is constructed using low level features of the VOP as color and size. Afterwards, for a given shot, key-VOP poses are extracted for each VOP, using an optimization method for locating VOP poses of minimally correlated feature vectors. Finally an iterative process is performed to link each key VOP of a shot to another, according to a correlation measure. Experimental results indicate the promising performance of the proposed system on real life stereoscopic video sequences.

1.

Introduction

Recent progress in the fields of video capturing, encoding and processing has led to an explosion in the amount of visual information being stored, accessed and transmitted. Moreover the rapid development of video-based multimedia and Internet applications has stimulated the need for efficient tools towards representation, searching and content-based retrieving. Traditionally, video sequences are represented by numerous consecutive frames. However, such a video representation results in a sequential access to the video content. While this approach may be adequate for viewing a video in a movie mode, it raises a number of limitations when addressing new applications, such as fast video browsing, content-based indexing, retrieving or summarization. Currently, the only way to browse a video is to sequentially scan video frames, a process which is a timeconsuming and tedious. Towards this direction several video summarization schemes were proposed as in [1], where extraction of frames at regular time instances was adopted. However this scheme did not exploit frame similarity. Selection of a single key frame for each shot has been presented in [2], which cannot provide sufficient information about the video content, especially for shots of long duration and high motion activity. A method for analyzing video and building a pictorial summary for visual representation has been proposed in [3]. This work is concentrated on dividing a video sequence into consecutive meaningful segments and then constructing a video poster for each segment, based on shot dominance. However all the aforementioned techniques do not take into consideration the semantically meaningful content of a scene expressed by the Video Object Planes (VOPs), but use the global color, motion and texture characteristics of a frame.

On the other hand, results provided by current visual information retrieval systems are not yet completely satisfactory. Some interesting works include [4] and [5]. In [4], VOPs are extracted using a motion homogeneity criterion. In [5], the global visual characteristics are considered. Another scheme for video representation is presented in [6] where a low-level video content description is automatically built. However, all the aforementioned schemes face two problems: (i) Except of the case in [4], the other schemes do not consider VOPs, which constitute the information of importance. (ii) Even after building a visual feature library, every new visual search can be very much time-consuming since all feature vectors should be checked. In this paper a novel system is proposed for unsupervised VOP-based structuring of stereoscopic video sequences. The system produces a tree-like structure of the sequence, which can be used for selective transmission, fast browsing or directive indexing. We consider a stereo-capturing scheme, while a 2-D sequence can be stored. This is due to the fact that a reliable depth map can be produced by stereoscopic video analysis and it is considered that each VOP occupies a specific depth [7]. After shot cut detection, the first frame is analyzed and a depth segments map is constructed using a segmentation algorithm. Then for each depth segment, an unsupervised active contour is incorporated to detect the VOP inside the segment and each VOP is tracked. When VOPs are extracted, a feature vector is formulated for each VOP exploiting color and size information. Afterwards within a scene, the minimally correlated poses of each VOP are detected and selected as key-VOP poses. Finally an iterative process is performed to link each key-VOP pose to the other key-VOP poses. An overview of the proposed system is presented in Figure 1.

2.

Unsupervised VOP Segmentation

The first stage of the proposed algorithm is to extract VOPs from a video sequence. VOP segmentation remains a very interesting research problem, despite the extensive attention it has received the last decade. Even if it has been partly solved when using the chroma-key technology or in semiautomatic schemes, there are several cases where these solutions are not effective. Characteristic examples include most real life sequences. In these sequences semiautomatic VOP segmentation schemes might produce accurate results but would be high time- consuming. To perform VOP extraction in this paper, depth information is exploited, since it gives a better visualization of the shot content. Depth information can be easily estimated in stereoscopic video sequences [8]. This is due to the fact that depth segments roughly approximate the VOPs. After generating the depth segmentation map, an active contour approach is activated for each segment, to extract the VOP contained in the segment. Active contour

formulation, initialization and convergence are described in the following subsections.

2.1

Active Contour Formulation

An active contour is a parametric curve represented by a vector

v( s, t ) = ( x( s, t ), y ( s, t )) and moving onto an image plane to minimize the energy functional: E * snake = ∫0 E snake (v( s )) ⇔ 1

Active Contour Initialization and Convergence

After formulating the energy functional, the active contour is unsupervisedly initialized onto the depth contour and moves towards the VOP's boundary in order to minimize its energy. To address initialization, let us denote as K the normalized difference of the image intensities between two points ( xi , yi ) and ( xi +1, yi +1 ) located on the depth contour.

1

E * snake = ∫0 [ Eint (v( s)) + Eimage (v( s )) + Econ (v( s))]ds

2.3

= Ι( xi , yi ) − Ι( xi +1, yi +1 ) Ι( xi , yi )

(1)

(3)

where E int represents the internal deformation energy (stretching

where Ι( xi , yi ) denotes the intensity of a point. Let us also denote

and bending), Eimage expresses the image forces and Econ

as Ν mk the minimum number of points to be kept, as Ν dc the total points of the depth contour and P the percentage of selected points. Then, the following relation is held,

corresponds to the external forces [11]. In the proposed scheme, the energy being minimized is formulated as:

E * snake = −( w1 E int + w2 E edge _ map )

(2)

where E edge _ map is an external energy originating from a simplified edge map described in the following subsection and E int is calculated as in [9].

Ν mk = Ν dc ⋅ Ρ ⇒ Ν r = 1 Ρ

(4)

From (4), it can be seen that Ν r is the maximum allowed number of points to be rejected between two successive points.

scheme is a simplified edge map E s (Oi ) used for calculation of

In our scheme a point is selected as initial point if it validates one of the following rules: (a) It presents intensity change 30% compared to the previous selected point (K=30) or (b) Its previous Ν r -1 points have been rejected. Then a greedy algorithm is incorporated which allows the initial active contour to converge to the VOP's contour [11]. More particularly each pixel of the active contour starts moving on a grid in an iterative scheme, towards the direction that minimizes the energy in equation (2). The grid is centered at the processed pixel and covers the 8-connected area. The total energy is computed for every pixel in this area and the point of the active contour moves to the new position inside the area that satisfies the conditions: a) The new position has less energy than the current position of the active contour pixel. b) The new position has the least energy compared to the other eight pixels of the grid. If these conditions are not satisfied then the point of the active contour stays at its current position and the algorithm continues with the next active contour point. The algorithm terminates if no position changes occur during an iteration. Finally each extracted VOP is tracked throughout the whole sequence. For this reason the tracking part of the approach in [12] is adopted. In Figure 3(a), we present the depth segment by applying a segmentation algorithm to the depth map of the sequence as described in [8]. The edge map inside the depth contour is presented in figure 3(b), while the simplified edge map after the scanning procedure is depicted in Figure 3(c). The active contour initialization on the depth segment's boundary is presented in Figure 3(d). Finally, the result of the VOP detection procedure is presented in Figure 3(e). As is observed, the VOP inside the analyzed depth segment is accurately detected.

E edge _ map energy [see equation (2)]. The idea of formulating

3.

E edge _ map is that each point of the active contour should be

After detecting and tracking the foreground VOPs of each frame within a shot, a multidimensional feature vector including the mean color and size of a VOP is constructed for each VOP. In particular, for each VOP Vi , i ∈ M = {1,2,3} of a shot, an Lx1

2.2

Simplified Edge Map Forces

Energy E edge _ map originates from an edge map estimated inside a considered depth segment. Let us denote with Oi a depth segment. Then, edges inside Oi are estimated using a Canny edge detector and the edge map E (Oi ) is produced. Afterwards, a linescanning scheme begins, aiming at simplifying the resulting edge map E (Oi ) . According to this scheme, E (Oi ) is scanned sequentially line by line. Let us assume that line Li of edge map E (Oi ) is processed. Let us also assume that the line Li contains N ( Li ) edge pixels. For each of these pixels, say P ( Li ) , the city block distance is estimated from the two points of the depth contour, which belong to line Li . Finally, the left and right point, which holds the minimum distance from the respective depth contour are kept. An illustrative example is presented in Figure 2 where the left depth contour point of line Li is symbolized by the capital letter “L”, while the right letter by the capital “R”. In Figure 2, the edge pixels of line Li are denoted as

e1 Li , e 2 Li ,..., e N Li . By calculating the city block distance between points “L” and “R” and the edge pixels of line Li , the points e1 Li and e N Li are extracted, since they satisfy the minimum distance criterion. Scanning process is terminated after all lines of E (Oi ) have been processed. The result of the scanning

attracted by an edge pixel of an edge map as in [10]. In our case E s (Oi ) plays the role of the directional attraction field, where each point of the active contour is attracted by its nearest neighbor edge pixel of E s (Oi ) as in [9].

Key VOPs Extraction

vector f Vi is formed as:

fVi = [cT (Vi ) a (Vi )]T (5) where c is a 3x1 vector that includes the mean color components of Vi and a is the mean size of Vi . Thus, each vector f Vi has 4 elements. Let us now assume that the K F most characteristic poses of a VOP should be selected from a given shot. In this paper, the key-VOP poses are selected as the ones with the minimum correlation among a scene. This “minimum correlation” approach can be justified by the fact that in a VOP-based structuring of a sequence, the most different poses of a VOP should be detected, so as to have available all possible different forms a VOP takes within a shot. Towards this direction, let us define as ρ ij = Cij (σ iσ j ) the correlation coefficient of the

feature vectors fVi , fVj , where Cij = (fVi − m)T (fVj − m) is the LF

covariance of the two vectors, m = ∑ f Vl / L F is the average l =1

feature vector of the considered VOP of a shot, with LF the number of different poses of a VOP within a given shot, and

σ i2 = Cii is the variance of fVl . In order to define a measure of correlation between K F feature vectors, we first define the index vector a = (a1 ,, a K F ) ∈ U ⊂ S K F where U = {(a1 ,, a K F ) ∈ S K F : a1 <  < a K F }

(6)

is a subset of S K F (the set of indexes of all poses of a VOP), which contains all sorted index vectors a . Thus, each index vector a = ( a1 ,  , a K F ) corresponds to a set of numbers representing VOP poses. The correlation measure of the feature vectors fVi , i = a1 ,  , a K F is then defined as:

R F (a ) =

K F −1 K F 2 ∑ ∑ ( ρ a ,a ) 2 K F ( K F − 1) i =1 j =i +1 i j

(7)

It can be observed that searching for a set of K F minimally correlated feature vectors is equivalent to searching for an index vector a that minimizes RF (a) . Searching is limited in the subset U, since index vectors are used in order to construct sets of feature vectors, therefore any permutations of the elements of a will result in the same sets. The set of the K F least correlated feature vectors, corresponding to the K F represented by

key-VOP poses, is thus

aˆ = ( aˆ1 ,  , aˆ K ) = arg min R F (a )

N

covariance of the two vectors, m = ∑ fVl / N is the average feature l =1

vector of the N key-VOP poses of all shots and σ i2 = Cii is the variance of fVl . When correlation links are generated for all possible pairs, correlation values for each key-VOP pose are sorted and a tree-like structure of the sequence is generated. Then in an indexing and retrieval scheme the provided visual information to be indexed can initially be compared to the keyVOP poses of the whole sequence. A distance between key-VOP poses and provided visual information can be incorporated, e.g. distance according to correlation, convolution, Euclidean etc. Then the following content can be retrieved: (a) The key-VOP pose giving the minimum distance, (b) The shot where this keyVOP pose originates and (c) The first K minimum correlation keyVOP poses, according to the tree structure, where K is a threshold. As observed this kind of structuring can benefit indexing schemes as it works much faster than searching the whole space of feature vectors every time a visual query is performed. Furthermore summarization schemes can benefit from the proposed unsupervised VOP-based structuring approach too. Firstly a summary of the whole sequence can be the K FS keyVOP poses extracted for the whole sequence. Additionally, in an interactive summarization scheme a user may ask for a key-VOP summary. Then the application can in a hierarchical framework return the key-VOP poses of the VOP, the shot where the VOP originates and the K minimum correlation key-VOP poses of other shots, by simply parsing the tree structure. In Figure 5, we present the VOP linking of the key-VOP presented in Figure 5(a). In this figure, only the five maximally correlated key-VOP poses with this shot are presented. As is observed, the key-VOP poses, associated with the key-VOP of Figure 5(a), are of similar content.

5.

References

[1] M. Mills, J. Cohen and Y. Y. Wong, “A Magnifier Tool for Video (8)

F

Exhaustive search for minimizing RF (a ) is practically unfeasible, since the multidimensional space U includes all possible combinations of the VOP's poses. For this reason a genetic algorithm approach similar to [13] is adopted. In Figure 4, the keyVOP posses for a shot of the “Eye to Eye” sequence are depicted. As is observed, the key-VOP poses are the most uncorrelated (characteristic) poses of the VOP in this particular shot.

4.

particularly let us assume that K FS key-VOP poses have been extracted from the whole video sequence, each key-VOP pose described by its vector f Vi . The subscript s has been added to indicate the dependence of the sequence. Then correlation links are generated for all possible pairs of key-VOP poses, except those pairs where both key-VOP poses belong to the same shot. The correlation coefficient for two feature vectors again is estimated by ρ ij = Cij (σ iσ j ) where now Cij = (fVi − m)T (fVj − m) is the

VOP Linking and Applications

After extracting key-VOP poses for each VOP of a shot, the scheme proceeds to the final step of sequence structuring. Similar key-VOP poses are linked according to their correlation. More

Data,” Proc. ACM Computer Human Interface, May 1992, pp. 93-98.

[2] S. W. Smoliar and H. J. Zhang, “Content-Based Video Indexing and Retrieval,” IEEE Multimedia, pp.62-72, Summer 1994.

[3] M. M. Yeung and B.-L. Yeo, “Video Visualization for Compact Presentation and Fast Browsing of Pictorial Content,” IEEE Trans. CSVT, Vol. 7, No. 5, pp. 771- 785, Oct. 1997. [4] S. -F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong, “A Fully Aytomated Content-Based Video Search Engine Supporting Spatiotemporal Queries,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 602-615, Sept. 1998. [5] G. Iyengar and A. B. Lipmann, “Videobook: An experiment in characterization of video, ” in Proc. IEEE ICIP, vol. 3, pp. 855-858, 1996. [6] Y. Deng, and B. S. Manjunath, “NeTra-V: Toward an Object-Based Video Representation,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 616-627, Sept. 1998. [7] L. Garrido, F. Marques, M. Pardas, P. Salembier and V. Vilaplana, “A Hierarchical Technique for Image Sequence Analysis,” in Proc. of

Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp. 13-20, Louvain-la-Neuve, Belgium, June 1997. [8] A. D. Doulamis, N. D. Doulamis, K. S. Ntalianis and S. D. Kollias, “Unsupervised Semantic Object Segmentation of Stereoscopic Video Sequences,” Proc. of IEEE Int. Conf. on Intelligence, Information and Systems, Washington D.C, USA, November 1999. [9] K. Ntalianis, A. Doulamis, N. Doulamis, and S. Kollias, “An Active Contour-Based Scheme towards More Semantic Segmentation of Stereoscopic Video Sequences,”Proc. MELECON 2000,Limasol Cyprus. [10] L. D Cohen, “On Active Contour Models and Ballons,” CVGIP: Image Understanding, vol. 53, no. 2, pp.211-218, March 1991.

[11] D. J. Williams and M. Shah, “A fast algorithm for active contours and curvature estimation,” GVGIP: Image Understanding, vol. 55, no.1, pp. 14-26, January 1992. [12] C. Gu, and M.-C. Lee, “Semiautomatic Segmentation and Tracking of Semantic Video Objects,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 572584, Sept. 1998. [13] Y. Avrithis, A. Doulamis, N. Doulamis and S. Kollias, “A Stochastic Framework for Optimal Key Frame Extraction from MPEG Video Databases,” CVIU, vol. 75 (1/2), pp. 3-24, July/August 1999.

E d g e m a p E Oi

Depth Contour Stereoscopic Analysis and Depth Segmentation

Shot Cut Detection

Key VOP Extraction

Unsupervised VOP Detection and Tracking

Line L i

Key VOP Linking

L

Figure 1: System overview of the proposed architecture

(a)

D

e 1L i e 2L i

 e N L i

R

Figure 2: The scanning procedure

(b)

(c)

(d) (e) Figure 3: (a) The depth segment estimated by applying a segmentation algorithm to the depth map. (b) The edge map inside the depth contour. (c) The simplified edge map after scanning procedure. (d) Active contour initialization onto the depth boundary. (e) The detected video object plane (VOP).

Figure 4: Three key-VOP poses from a shot of the “Eye to Eye” sequence.

(a)

(b) Figure 5: (a) The initial key-VOP pose. (b) The best five key-VOP poses (having minimal correlation), which are linked with the initial key-VOP pose. Each key-VOP pose corresponds to a shot.