EFFICINET VIDEO SUMMARIZATION BASED ON A FUZZY VIDEO CONTENT REPRESENTATION Anastasios D. Doulamis, Nikolaos D. Doulamis and Stefanos D. Kollias National Technical University of Athens, E-mail:
[email protected] ABSTRACT In this paper, a fuzzy representation of visual content is proposed, which is useful for video summarization. In particular, a multidimensional fuzzy histogram is constructed for each video frame based on a collection of appropriate features, extracted using video sequence analysis techniques. Then, key frames are selected optimally by minimizing a cross correlation criterion. Experimental results and comparison with other known methods are presented to indicate the good performance of the proposed scheme on real life video recordings
1. INTRODUCTION The increasing amount of digital image and video data has stimulated new technologies for efficient searching, indexing, content-based retrieving and managing multimedia databases. The traditional approach of keyword annotation to accessing image or video information has the drawback that it cannot efficiently characterize the rich visual content using only text. For this reason, content-based retrieval algorithms have attracted a great research interest. Examples of such retrieval systems include the QBIC [1], Virage [2] or VisualSeek [3] prototypes. In this framework, the MPEG is currently defining the new MPEG-7 standard. The aforementioned systems are mainly restricted to still images and cannot easily be applied to video databases [4]. This is due to the fact that the standard representation of video as a sequence of consecutive frames results in significant temporal redundancy of visual content. Thus, it is very inefficient and time consuming to perform queries on every video frame. Such linear representation of video sequences is also not adequate for the new multimedia applications, such as video browsing, content-based indexing and retrieval. For this reason, apart from proposing algorithms for effective network design through modeling of video sources [5], new methods for efficient video content representation should also be implemented [6]. The objective is to divide a video sequence into separate representative shots, and then to extract the most characteristic frames (key frames) within the selected shots by means of a content-based sampling algorithm [6]. Two factors are mainly affect the performance of a video summarization scheme; the visual content representation and the method used for key-frame extraction. As far as the first issue is concerned, the traditional pixel-based representation suffers from the lack of a semantic meaning of the visual content. For this reason, several features are usually extracted from a video frame resulting in a feature-based representation. For the second issue, key-frame selection should be performed in such a way that the most characteristics frames are extracted. In the context of this paper, a fuzzy representation of visual content is proposed, which improves the performance of video
summarization algorithms and content-based retrieval systems, since it provides an interpretation closer to the human perception. Instead, the current approaches [1]-[3] are based on a "binary" classification and therefore it is possible to assign two similar features (located near the class boundaries) to different classes causing erroneous representation. Furthermore, they are sensitive to possible noise resulting from segmentation instabilities and erroneous estimation of feature values. As far as the second issue is concerned, an optimal method is proposed by minimizing a cross-correlation criterion. Instead, the current techniques use ad-hoc methods for key-frame selection, either by estimating the he accumulated differences of the DC component of the DCT-transformed video frames [7], or by extracting frames at regular time instances [8]. However, the first approach highly depends on the selection of threshold value, while the second exploits neither shot information nor frame similarity. Thus, thus important shots of small duration may have no representatives while shots of longer duration may be represented by multiple frames with similar content.
2. CONTENT REPRESENTATION A color / motion segmentation algorithm is applied in this paper for visual content description, using a multiresolution implementation of the Recursive Shortest Spanning Tree (RSST) called M-RSST. The use of the RSST is based on the fact that it is considered as one of the most powerful tools for image segmentation compared to other techniques, such as pyramidal region growing or morphological watershed. It has been shown by the COST211ter simulation subgroup, using several experiments, that the RSST presents the best performance and the best computational complexity compared to the other methods [9]. However, the complexity of the RSST still remains very high especially for images of large size. Instead, the proposed MRSST approach yields much faster execution time, while simultaneously keeping similar performance.
2.1 Fuzzy Formulation The size, location and average color (motion) components of all color (motion) segments are used as color (motion) properties. Since the segment number is not constant for each video frame, the aforementioned properties cannot be directly included in a feature vector, because the size of this vector is not constant and thus, direct comparison is practically impossible. To overcome this problem, we classify color/motion properties into predetermined classes and then, we assign a degree of fuzzy membership to each class, resulting in a fuzzy classification formulation. Let us assume that K c color and K m motion segments have been extracted. Then, for each color segment S ic , i=1,…,Kc , an
Lc × 1 vector s ic is formed, while for each motion segment S im i=1,2,..,Km an Lm × 1 vector s im is formed as follows: s ic = [c T ( S ic ) l T ( S ic ) a ( S ic )]T
(1a)
s im = [ v T ( S im ) l T ( S im ) a ( S im )]T
(1b)
where a denotes the size of the color or motion segment, and l is a 2×1 vector indicating the horizontal and vertical location of the segment center; the 3×1 vector c includes the average values of the three color components of the respective color segment, while the 2×1 vector v includes the average motion vector of the motion segment. Thus, Lc=6 for color segments and Lm=5 for motion segments. For the sake of notational simplicity, the superscripts c and m will be omitted in the sequel; each color or motion segment will be denoted as Si and will be described by the Lx1 vector si where L=5 or L=6 depending on the segment type. Based on the above, si=[si,1 si,2 … si,L] is a vector containing all properties extracted from the i-th segment, Si. For example si,1 corresponds to the average value of the first color component of segment Si. Each element si,j, j=1,2,…,L of vector si is then partitioned into Q regions by means of Q membership functions µ n j ( si, j ) nj=1,2,…,Q. As in the previous case, µ n j ( si, j ) denotes the degree of membership of
from) the query one. Then, for an image query, a similarity degree, say t i , is assigned to all images of the database, which indicates how similar is the content of the i-th image to the query. Three similarity degrees are used; zero degree meaning that the image is quite similar to the user's query, one degree for irrelevant images and 0.5 degree for somehow relevant images. The absolute difference E of the normalized distance and similarity degree over all the best M retrieved images is used to evaluate the system performance to the user's query, E=
1 ∑ d nrm (f q , f i ) − t i 1 S M i∈S M
(4)
where S M is the set containing the best M retrieved images for a given user's query and S M its cardinality. In our case M=10. Figure 1 illustrates the average error E for all the 15 examined image queries versus the number of partitions Q for different types of membership functions. In the same figure, the results obtained using binary classification are also depicted. It is observed that a partition number Q=3 yields the best performance for membership functions. We also observe that the triangular functions give better results compared to the other examined functions for the most of partition numbers.
3. VIDEO SUMMARIZATION
si , j to the nj-th class.
Then, the product of µ n j ( si, j ) over all si,j of si defines the
Key-frames are extracted by minimizing a cross-correlation criterion, so that the selected frames are not similar to each other.
degree of membership of vector si to the L-dimensional class n=[n1 n2 … nL]T the elements of which express the class to which the elements of si belong.
Let us denote by fk the feature vector of the k-th frame of a shot, with k ∈ V = {1,2, , N F } , where N F is the total number of frames of the given shot. Let us also denote by KF the number of key-frames that should be selected. In order to define a measure of correlation among KF feature vectors, an index vector is first defined
L
µ n (s i ) = ∏ µ n j ( s i , j ) j =1
(2)
Gathering all segments of a frame, a multidimensional fuzzy histogram is created, H (n ) =
1 K 1 K L ∑ µ n (s i ) = ∑ ∏ µ n j ( s i , j ) K i =1 K i =1 j =1
(3)
H(n) thus can be viewed as a degree of membership of a whole frame to class n. A frame feature vector f is then formed by gathering all values of H(n) for all classes n, i.e., for all QL combinations of indices, resulting in a vector of QL elements:
x = ( x1 ,
, x K
for the are the the the
Particularly, for each submitted image the Euclidean distance among the fuzzy feature vector of the query image f q , and all features in the database f i is calculated and then normalized in the interval [0 1]. Images with normalized distances close to zero (one) indicate that the respective feature vector is close to (far
F
) ∈W ⊂ V K F
where W = {( x1 ,
, x K
F
) ∈ V K F : x1