relevance feedback for content-based retrieval in

RELEVANCE FEEDBACK FOR CONTENT-BASED RETRIEVAL IN VIDEO DATABASES: A NEURAL NETWORK APPROACH Anastasios D. Doulamis, Nikolaos D. Doulamis and Stefanos D. Kollias National Technical University of Athens, Department of Electrical and Computer Engineering 9, Heroon Polytechniou, 157 73 Zografou, Greece, e-mail: [email protected]

A neural network scheme is presented in this paper for adaptive video indexing and retrieval. First, a limited but characteristic amount of frames are extracted from each video scene, able for providing an efficient representation of the video content. For this reason, a cross correlation criterion is minimized using a genetic algorithm. Low level features are extracted to indicate the frame characteristics, such as color and motion segments. After the key frame extraction, the video queries are implemented directly on this small number of frames. To reduce, however, the limitation of low-level features, the human is considered as a part of the process, meaning that he/she is able to assign a degree of appropriateness for each retrieved image of the system and then restart the searching. A feedforward neural network structure is proposed as a parametric distance for the retrieval, mainly due to the highly non linear capabilities. An adaptation mechanism is also proposed for updating the network weights, each time a new image selection is performed by the user.

1.

INTRODUCTION

The progress of capturing and encoding of digital images and video sequences has led to huge and growing archives of visual information. With this rapid development of multimedia applications, new tools and systems for efficient searching, indexing, content-based retrieving and managing are required [3]. This is due to the fact that traditional database management system methods do not work well for multimedia data, since they cannot efficiently describe the rich content of an image or video information [15]. Furthermore traditionally, video is represented by a sequence of consecutive frames, each of which corresponds to a constant time interval. Storage requirements of digitized video information, even in compressed domain, are very large and present challenges to most multimedia servers [6] while browsing or indexing of video archives should be performed sequentially which is practically impossible due to the time complexity [7], [10], [16]. To increase the flexibility of managing the aforementioned described visual databases, content-based tools and algorithms have been proposed in the literature [3], [4], [12], [15]. The active research effort in this area has been reflected in many conferences and special issues of leading journals dedicated to this topic [13], [14]. In addition, many content-based retrieval systems both commercial and academic have been recently developed [8]. Furthermore, in case of video databases, several algorithms have been also proposed for providing a new representation of the video content. In particular, the redundant video information is represented using a small set of still images (key frames) that are selected by a content-based sampling algorithm [2], [7]. However, extraction of semantic features from all kinds of images or image sequences is a very arduous task [6]. For this reason, the content based retrieval algorithms usually use lowlevel features to perform their queries [4], [15]. While, in some cases, the semantic information can be represented by

such features, in other cases, this may not be true. Moreover, there is subjectivity in the human perception as far as the similarity of the visual content is concerned [15]. To reduce the above-mentioned limitations, the human can be considered as a part of the retrieval process, in an interactive framework. This means that the degree of importance of each element is dynamically adjusted based on the users’ assistance, which select the most appropriate images among those retrieved by the system. A parametric metric distance has been proposed in [1] to perform the adaptation of the image retrieval, while in [15] examination of the variation of the elements are considered. Instead, in this paper, we enhance the adaptation process by introducing an adaptively trained neural network classifier. In this case, the network weights define the degree of importance for each feature element, while its output indicates how close the examined image is to the users’ queries.

2.

VIDEO REPRESENTATION

Since we concentrate on video databases or video archives, the first stage of the proposed adaptive architecture is the automatic extraction of a small number of characteristics frames or key frames, able to represent all the necessary information of the video content. A block diagram of the proposed architecture is illustrated in Fig. 1. 9LGHR 6RXUFH )HDWXUH %DVHG 9LGHR 5HSUHVHQWDWLRQ

6KRW 'HWHFWLRQ 9LGHR 6HTXHQFH $QDO\VLV

Video Database

ABSTRACT

)X]]\ )HDWXUH 9HFWRU )RUPXODWLRQ 6KRW FKDQJHV

)HDWXUH 9HFWRUV

.H\)UDPH ([WUDFWLRQ

.H\ )UDPHV

Figure 1: Block diagram of the proposed architecture for video summarization The first stage of the proposed scheme is to segment the image into shots. In our approach the algorithm proposed in [16] has been adopted for shot detection due to its efficiency and small time complexity, compared to other algorithms. This technique is based on the dc coefficients of the DCT transform of each frame, which are directly available in MPEG video data.

2.1

Feature Extraction

For the extraction of key frames, a color and motion segmentation algorithm is applied to each video frame. Then, the number, size, location and average color/motion components of all segments are used for construction of a color/motion vector. The Recursive Shortest Spanning Tree (RSST) algorithm is our basis both for color/motion segmentation since it is considered as one of the most powerful tools for image segmentation, compared to other techniques, such as color

clustering or morphological watershed [11]. The execution time of the RSST, however, is heavily dependent upon the choice of the sorting algorithm. For this reason, a new approach is proposed, which recursively applies the RSST algorithm on images of increasing resolution. Initially a multiresolution decomposition of image I is performed with a lowest resolution level of L0 so that a hierarchy of frames I(0)=I, I(1),…,I(L0) is constructed, forming a truncated image pyramid, with each layer having a quarter of the pixels of the layer below. The RSST initialization takes place for the lowest resolution image I(L0) and then an iteration begins, involving the following steps; (i) regions are recursively merged using the RSST iteration phase, (ii) each boundary pixel of all resulting regions is split into four new regions using the image of the next higher resolution level, (iii) new link weights are calculated and sorted. This “split-merge” procedure is repeated until the image I(0) is reached. The results of the proposed color segmentation algorithm are depicted in Fig. 2 for a target number of segments equal to 5 and for an initial resolution level L0=3. Figures 2 (b,c) illustrates the segmentation at different levels, while Fig. 2(d) the final one.

to different classes, causing erroneous comparisons, a degree of membership is allocated to each class, resulting in a fuzzy classification formulation. In particular in our case, where color, motion and location are taken into account, a multidimensional feature vector is formed for each segment. Let us denote by S ic , i=1,…,K , an Lc × 1 vector s ic containing the color segments, and as Sim , an Lm ×1 vector s im for the motion segments. For the sake of notational simplicity, the superscripts c and m will be omitted in the sequel. The domain of each element s (ij ) , j=1,2,…,L of vectors si, i=1,2,…,K are then partitioned into Q regions by means of Q membership functions µ n j ( s (ij ) ) , nj=0,1,…,Q-1, where Q is the number of partitions. Gathering class indices nj for all elements j=1,2,…,L, an L-dimensional class

n = [n1 , , nL ]T is defined. Then, the degree of membership of each vector si to class n can be performed through a product of the membership functions µ n j ( s (ij ) )

of all

individual elements s (ij ) of si to the respective elements nj.

(a)

(b)

Then, it is now possible to construct a multi-dimensional fuzzy histogram, say H(n) from the segment feature samples si, i=1,…,K, as follows 1 K 1 K L () H (n ) = (1) ∑ µ n (s i ) = ∑ ∏ µ n j ( s ji ) K i =1 K i =1 j =1 Thus, H(n) can be viewed as a degree of membership of a whole frame to class n. In fact, since the above analysis was based on features s ic and s im of color segments Sic and

(c) (d) Figure 2: Color segmentation. (a) The initial image. (b) Segmentation at level 3. (c) Segmentation at level 2, (d) Final segmentation

motion segments Sic respectively, two feature vectors will be calculated: a color feature vector fc for color segments and a motion feature vector fm for motion segments. Finally, based on color and motion feature vectors, the feature vector, of c

m

length Q L + Q L corresponding to the whole frame is formed as follows: f = [(f c )T (f m )T ]T

4.

Figure 3: Moition segmentation results The same algorithm is applied for motion segmentation using the motion vectors of the MPEG video data. Figure 3 illustrates the motion segmentation results of a frame extracted from a TV news program.

3.

FUZZY FEATURE VECTOR FORMULATION

All features extracted by the video sequence analysis module (i.e., size, location, color or motion of each segment) can be used to describe the visual content of each video frame. However, they are not directly included in a vector to be used for this purpose, since their size differs between frames. To overcome this problem, we classify color as well as motion segments into pre-determined classes, forming a multidimensional histogram. Furthermore, in order to eliminate the possibility of classifying two similar segments

(2)

EXTRACTION OF KEY FRAMES

Once a feature-based representation of each frame is available key frames can then be selected to provide a representation of a whole video sequence. An optimal solution based on the minimization of a cross-correlation criterion, which ensures that the selected frames are not similar to each other has been adopted in this paper for key frames extraction. Let us recall that f (k ) is the feature vector of the k-th frame of the shot under examination, with k ∈ V = {0,1, , N s − 1} , where N s is the total number of frames in the shot. Let us also denote by Ks the number of key frames that should be selected. This number is either a priori known or can be estimated as described in [2]. Then, the correlation coefficient of two feature vectors f ( k ), f (l ) is used as a comparison measure. First, an index vector is defined first:

x = ( x1 ,

, x

Ks

) ∈W ⊂ V K s

where W = {( x1 ,

, x

Ks

) ∈ V K s : x1