automatic content-based organization of video sequences ... - CiteSeerX

1 downloads 0 Views 71KB Size Report
when addressing new multimedia applications, such as fast video browsing ..... [1] M. Mills, J. Cohen and Y. Y. Wong, “A Magnifier Tool for Video. Data,” Proc.
AUTOMATIC CONTENT-BASED ORGANIZATION OF VIDEO SEQUENCES FOR MULTIMEDIA APPLICATIONS Klimis S. Ntalianis, Anastasios D. Doulamis, Nikolaos D. Doulamis, and Stefanos D. Kollias National Technical University of Athens, Dept. of Electrical & Computer Engineering E-mail:(kntal, adoulam,ndoulam)@image.ntua.gr

Abstract In this paper an efficient scheme for automatic organization of stereo-captured video sequences is presented, which exploits foreground VOP information of frames. More specifically after shot cut detection, for each frame of a shot, a fast, unsupervised foreground VOP extraction algorithm is applied, based on depth information and normalized Motion Geometric Spaces. Then for each frame, a feature vector is constructed, containing color and size characteristics of the foreground VOPs within the frame. Afterwards, for a given shot, key frames are extracted using an optimization method for locating minimally correlated feature vectors. Finally correlation links are generated between the extracted key frames to produce a graph-like structure of the video sequence. Experimental results on real life stereoscopic sequences indicate the promising performance of the proposed scheme.

1.

Introduction

Recent progress in the fields of video capturing, encoding and processing has led to an explosion in the amount of visual information being stored, accessed and transmitted. Moreover, the rapid development of video-based and Internet applications has stimulated the need for new tools and systems to address the issues of efficient representation, searching, indexing, content-based retrieving and managing visual information. Traditionally, a video sequence is represented by numerous consecutive frames and even if this representation is adequate for viewing a video in a movie mode, it raises a number of limitations when addressing new multimedia applications, such as fast video browsing, content-based indexing, retrieving or video summarization. Currently, the only way to browse a video is to sequentially scan its frames, a process that is both time-consuming and tedious. In order to confront this problem some summarization schemes where proposed in literature, as the extraction of frames at regular time instances in [1] which does not exploit frame similarity. Selection of a single key frame for each shot has been presented in [2], which cannot provide sufficient information of the video content, especially for shots of long duration and high motion activity. A method for analyzing video and building a pictorial summary for visual representation has been proposed in [3]. All these techniques use global frame characteristics such as color, motion and texture and do not exploit semantic information, like foreground VOPs, in which the user is usually more interested. On the other hand, the retrieval performance can be further improved if efficient video sequence organization tools are developed. Towards this direction some interesting systems were proposed including [4] and [5]. In the first work, video objects are extracted according to motion homogeneity while in the other two works only global visual characteristics are considered. Another scheme for video representation is presented in [6] where a lowlevel video content description is automatically built. The main problems of the aforementioned schemes are: (I) Except of [4], the

other schemes do not consider VOP information. (II) Even after building a visual feature library, every new visual content search can be very much time-consuming, since feature vectors of all frames of a sequence should be checked. In this paper a novel system is presented for automatic, contentbased video sequence organization. The system produces a graphlike structure of the sequence, which can be used for selective transmission, fast browsing or directive indexing / summarization. More specifically the proposed system consists of four main modules: (A) Shot cut detection. (B) Foreground VOP extraction. (C) Key frames extraction and (D) Key frames linking. The first module is based on the fast and effective shot cut detection algorithm proposed in [7]. Then for each frame, foreground VOPs are extracted, using depth information and normalized Motion Geometric Spaces (MGSs), as it is considered that every VOP occupies a specific depth [8]. In particular a depth segments map is produced by performing stereo pair analysis and applying a segmentation algorithm [9]. Then normalized MGSs are initialized onto the boundary of each of the foreground depth segments to extract the contained VOP as in [10]. In the following, a feature vector is constructed for each frame, containing color and size characteristics of the foreground VOPs. Next, key frames are extracted within each shot, by detecting minimally correlated feature vectors. Finally correlation links are generated between all extracted key frames, to produce the graph-like structure. An overview of the proposed system is presented in Figure 1. This paper is organized as follows. Section 2 describes the unsupervised VOP extraction technique, while in Section 3 key frames are extracted. In Section 4 key frames are linked and in Section 5 experimental results are presented. Finally Section 6 concludes this paper.

2.

Unsupervised VOP Segmentation

VOP segmentation remains a very difficult research problem, despite the extensive attention it has received the last decades in the image processing and computer vision society. Some approaches towards this direction are based on semi-automatic techniques. However semi-automatic object extraction methods generally demand extensive human assistance and thus they cannot be incorporated by real time applications. For this reason an unsupervised VOP segmentation technique is employed in the current scheme, based on the exploitation of depth, as it is contemplated that every VOP occupies a specific depth [8]. Let us consider that an occlusion compensated depth map is available, by performing stereo pair analysis as in [9]. Then a segmentation algorithm is applied to this depth map and produces a depth segments map. In this paper the M-RSST segmentation algorithm is incorporated [11], as it provides reliable results and significantly reduces computational complexity. In the following only foreground depth segments are addressed. More particularly for each foreground depth segment, some feature points are selected onto its contour, which satisfy a specific ordering [10].

Pi +1 = ( x i +1, y i +1) two consecutive neighboring feature points and

normalized MGSs are presented (lines starting from the depth contour) for three neighboring points. The spots on the boundary of Figure 2(b) depict the initially selected feature-points. Moreover, the MGSs have been estimated for 100 points. The final accurately extracted video object is depicted in Figure 2(c).

uˆ i is the vector connecting Pi (start point) and Pi +1 (end point).

3.

Afterwards an initial MGS is constructed for each feature point. An MGS is the only route a feature point is allowed to move onto [10]. More specifically, let us denote as Pi = ( x i , y i ) and

Then the initial MGS(Pi) for Pi is the set that contains all points p, which satisfy the following equation: MGS ( xi , yi )= { p = ( x, y ) : y − yi = λi ( x − xi )} with λi =

−1 , λu i

(1)

Key Frames Extraction

After extracting the VOPs of a frame, a multidimensional feature vector is constructed for each frame, which includes the mean color and size of the foreground VOPs. In particular for each frame, Fi , i ∈ M k = {1,  , N F } of a shot M k containing N F frames, an Lx1 vector f i is formed as:

where λu is the direction coefficient of the vector uˆi . After the i initialization of one MGS for each feature point, an iterative normalization process begins, aiming at smoothing the directions of neighboring MGSs. Let us symbolize as vˆ pi the unit vector located onto the

fi = [cT ( Si ) a ( Si )]T (6) where c is a 3x1 vector that includes the average of the three color components of the foreground object of frame Fi and a is the mean size of foreground objects. Thus, each vector fi has 4 elements. Let us now assume that the K F most characteristic frames

MGS( Pi ), starting at initial point Pi and having the opposite direction to that which leads to the background. Let us also symbolize as vˆ pi −1 and vˆ pi +1 similar vectors for MGS( Pi −1 )

should be selected from a given shot M k . In this paper the key frames are selected as the ones with the minimum correlation among a scene. More particularly let us define as !ij=Cij/1i1j the correlation coefficient of the feature vectors fi, fj, with Fi , F j ∈ M k , where Cij=(fi-m)T(fj-m) is the covariance of the two

and MGS( Pi +1 ) respectively. Then, two new vectors can be produced according to the following two equations: (2a) v pi ,i −1 = vˆ pi + wi,i −1 ⋅ vˆ pi −1 , v pi ,i +1 = vˆ pi + wi,i +1 ⋅ vˆ pi +1 ,

(2b)

vectors, m = ∑ iN=F1 f i / N F is the average feature vector of the shot and 1i2= Cii is the variance of vector fi. In order to define a measure of correlation between K F feature vectors, we first

where wk ,l is a weight, related to the distance between two points

define the index vector a = (a1 ,  , a K F ) ∈ U ⊂ V K F where

e.g. Pk = ( xk , yk ) and Pl = ( xl , yl ) , and can be expressed as:

U = {( a1 ,  , a K F ) ∈ V K F : a1 <  < a K F }

wk ,l

1 , Dk ,l = = Dk ,l

(xk − xl )2 + (yk − yl )2

(3)

Then vectors v pi ,i −1 and v pi ,i +1 are synthesized to produce the final vector v f : v f = v pi ,i −1 + v pi ,i +1

(4)

Finally the angle 3 between vˆ pi and v f is calculated as: 

  (5) ϕ = cos    xi2 + yi2 x 2j + y 2j    where ( x i , y i ) and ( x j , y j ) are the coordinates of two vectors that −1 

xi x j + yi y j

start from point (0,0) (for a defined coordinates system), they are parallel to the vectors vˆ pi and v f and have the same length as vˆ pi and v f respectively. In this paper, in order to accelerate convergence time, if angle 3 is small (less than 3 degrees) then no correction of the MGS's direction is performed. The direction smoothness algorithm terminates if no direction change occurs during an iteration. Finally each initial feature point moves onto its MGS and stops according to color changes and shape smoothness criteria to extract the VOP [10]. In Figure 2, VOP extraction is presented for one frame of the "Eye to Eye" sequence. More particularly the original frame is depicted in Figure 2(a), while in Figure 2(b) three

(7)

is a subset of V K F that contains all sorted index vectors a . Thus, each index vector a = (a1 ,  , a K F ) corresponds to a set of frame numbers. The correlation measure of the feature vectors fi , i = a1 ,  , a K F is then defined as RF (a) = RF (a1 ,, a K F ) = =

K F −1 K F 2 2 ∑ ∑ ( ρ a ,a ) K F ( K F − 1) i =1 j =i +1 i j

(8)

Based on the above definitions, it is clear that searching for a set of K F minimally correlated feature vectors is equivalent to searching for an index vector a that minimizes RF (a) . Searching is limited in the subset U, since index vectors are used in order to construct sets of feature vectors, therefore any permutations of the elements of a will result in the same sets. The set of the K F least correlated feature vectors, corresponding to the K F key frames, is thus represented by aˆ = ( aˆ1,, aˆ N ) = arg min RF (a) (9) F

a∈W

Unfortunately, an exhaustive search for minimizing RF (a) is practically unfeasible, since the multidimensional space U includes all possible combinations of frames within a shot. In this paper the minimization problem is solved by a genetic algorithm approach similar to the one in [12]. An example of key-frame extraction for a shot is presented in Figure 3, where in this figure three key frames are depicted.

Furthermore in the same figure, the foreground object detected using the adopted normalized MGS algorithm, is also depicted.

4.

Key Frames Linking and Applications

After extracting key frames for each shot, the scheme proceeds to the final step of video sequence structuring. Links are generated between key frames according to a correlation criterion. More particularly, let us assume that K FS key frames have been extracted from the whole video sequence, each key frame described by its vector fi. Then correlation links are generated for all possible pairs of key frames, except those pairs where both key frames belong to the same shot. The correlation coefficient for two feature vectors fi, fj, with Fi ∈ M k , F j ∈ M l , k ≠ l , again is estimated by !ij=Cij/1i1j where Cij=(fi-m)T(fj-m) is the covariance of the two vectors, and m is the mean vector of all key frames. When correlation links are estimated for all possible pairs, correlation values for each key frame are sorted and a graph-like structure of the sequence is generated. In Figure 4, the proposed structure is presented. In particular the left part depicts the sequence structure within shots, while in the right part the sorted correlation links for a key frame of a shot can be observed. L, H, R, U,…,P in Figure 4 correspond to arbitrary shots of the sequence. The provided video sequence structure can greatly benefit contemporary multimedia applications. For example in an indexing and retrieval scheme, the provided visual information to be indexed can initially be compared to the key frames of the whole sequence. A distance between key frames and query information can be incorporated, e.g. distance according to correlation, convolution, Euclidean etc. Then the following content can be retrieved: (a) The key frame giving the minimum distance, (b) The shot where the minimum distance key frame originates and (c) The first K minimum correlation key frames provided by the graph-like structure, where K is a threshold. The provided video sequence structure can benefit indexing schemes, as visual searching can be directionally performed, which accelerates the process (compared to searching the whole vector space). Furthermore summarization schemes can benefit from the proposed automatic content-based sequence organization too. Firstly a summary of the whole sequence can consist of a subset of the K FS key frames, extracted from the whole sequence (e.g. selection of one key frame per shot). Additionally, in an interactive summarization scheme, a user may select a key frame out of the K FS available, asking for a summary based on the selected content. Then the application can return in a hierarchical manner (a) the other key frames, belonging to the same shot with the selected key frame, (b) the whole shot or (c) the K minimum correlation key frames of other shots, by simply parsing the graph structure (content directive summarization). In Figure 5 the intershot structure for key frame 2 (#8128) of Figure 3 is presented. Only the first five minimum correlation key frames are presented. As observed, even if there are sufficient changes in the background, the foreground information remains the same.

5.

Conclusions and Discussion

Current multimedia applications need efficient structuring tools and systems for video sequences. The demand for searching and retrieving visual information together with the desire for fast

directive browsing increases continuously. These facts led to the proposed unsupervised system for content-based sequence organization. This system takes advantage of depth information provided by stereoscopic sequence analysis, in order to approximate VOPs. Then a fast scheme for VOP extraction is incorporated based on normalized Motion Geometric Spaces. Afterwards key frames are extracted among a scene and correlation links are generated among key frames. The generated graph-like structure of the sequence can be used from indexing and summarization schemes providing promising results. In future works, faster techniques towards VOP extraction should be investigated (e.g. by using a tracking algorithm). Furthermore other VOP features could be used for the formulation of feature vectors, as texture, shape and motion.

6.

Acknowledgements

The authors wish to thank Mr. Chas Girdwood, for providing the sequence “Eye to Eye”, produced in the framework of ACTS MIRAGE project and Dr. Siegmund Pastoor for providing the video sequences of the DISTIMA project. This research is funded by the Foundation of Governmental Scholarships of Greece.

7. [1]

References

M. Mills, J. Cohen and Y. Y. Wong, “A Magnifier Tool for Video Data,” Proc. ACM Computer Human Interface, May 1992, pp. 93-98. [2] S. W. Smoliar and H. J. Zhang, “Content-Based Video Indexing and Retrieval,” IEEE Multimedia, pp.62-72, Summer 1994. [3] M. M. Yeung and B.-L. Yeo, “Video Visualization for Compact Presentation and Fast Browsing of Pictorial Content,” IEEE Trans. CSVT, Vol. 7, No. 5, pp. 771- 785, Oct. 1997. [4] S. -F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong, “A Fully Aytomated Content-Based Video Search Engine Supporting Spatiotemporal Queries,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 602-615, Sept. 1998. [5] G. Iyengar and A. B. Lipmann, “Videobook: An experiment in characterization of video, ” in Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 855-858, 1996. [6] Y. Deng, and B. S. Manjunath, “NeTra-V: Toward an Object-Based Video Representation,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 616-627, Sept. 1998. [7] B. Shahraray, “Scene Change Detection and Content-Based Sampling of Video Sequences,” Proc. of SPIE 2419: Digital Video Compression: Algorithms and Technologies, pp. 2-13, Feb. 1995. [8] L. Garrido, F. Marques, M. Pardas, P. Salembier and V. Vilaplana, “A Hierarchical Technique for Image Sequence Analysis,” in Proc. of WIAMIS, pp. 13-20, Louvain-la-Neuve, June 1997. [9] A. D. Doulamis, N. D. Doulamis, K. S. Ntalianis and S. D. Kollias, “Efficient Unsupervised Content-based Segmentation in Stereoscopic Video Sequences,” Journal of Artificial Intelligence Tools, Vol. 9, No. 2, pp. 277-303, June 2000. [10] K. S. Ntalianis, N. D. Doulamis, A. D. Doulamis, and S. D. Kollias, “A Feature Point Based Scheme for Unsupervised Video Object Segmentation in Stereoscopic Video Sequences,” in Proc. of the Int. Conf. on Multimedia and Expo (ICME), New York, Aug. 2000. [11] A. D. Doulamis, N. D. Doulamis and S. D. Kollias, “A fuzzy video content representation for video summarization and content-based retrieval,” Signal Processing, Vol. 80, No. 6, pp. 1049-1067, June 2000. [12] N. Doulamis, A. Doulamis, Y. Avrithis, K. Ntalianis and S. Kollias, “Efficient Summarization of Stereoscopic Video Sequences,” IEEE Trans. CSVT, Vol. 10, No. 4, pp. 501-517, June 2000.

Shot Cut Detection

Stereoscopic Analysis and Depth Segmentation

Unsupervised VOP Extraction

Feature Vector Formulation

Key Frames Extraction

Key Frames Linking

Figure 1: System overview.

(a) (b) (c) Figure 2: (a) Original frame, (b) Initial points and three MGSs for three neighboring points and (c) Extracted foreground video object

#8122 #8128 #8150 Figure 3: Three key frames extracted from a shot of "Eye to Eye" sequence, containing the most uncorrelated foreground content. Key Frame 2

Key Frame n

Shot 2

Key Frame 1

Key Frame 2

Key Frame n

Shot N

Key Frame 1

Key Frame 2

Key Frame n

Key Frame n of shot L

Link

Key Frame i of shot H

#1

Key Frame j of shot R

Lin

k#

2

#3

Key Frame 1

link

Shot 1

Key Frame l of shot U

Lin

k#

N

Key Frame t of shot P

Direction of decreasing correlation links

,QWUDVKRW6WUXFWXUH

,QWHUVKRW6WUXFWXUHIRUNH\IUDPH

Figure 4: The proposed structure for a video sequence. On the left the intrashot structure and on the right the intershot structure.

Figure 5: Intershot structure for key frame 2 of Figure 3. Only the first five minimum correlation frames are presented containing the detected VOPs.

Suggest Documents