Automatic Video Annotation with Adaptive Number of Key ... - CiteSeerX

2 downloads 0 Views 404KB Size Report
shot with a fixed number of key words, no matter how ... annotate a video shot with an adaptive number of .... Then we pick up the top N of them as the SCS.
Automatic Video Annotation with Adaptive Number of Key Words Fangshi Wang, Wei Lu School of Software Beijing Jiaotong University

{fshwang,wlu}@bjtu.edu.cn

Jingen Liu, Mubarak Shah Computer vision Lab University of Central Florida

{liujg,shah}@cs.ucf.edu

Abstract Retrieving videos using key words requires obtaining the semantic features of the videos. Most work reported in the literature focuses on annotating a video shot with a fixed number of key words, no matter how much information is contained in the video shot. In this paper, we propose a new approach to automatically annotate a video shot with an adaptive number of annotation key words according to the richness of the video content. A Semantic Candidate Set (SCS) with fixed size is discovered using visual features. Then the final annotation set, which has an unfixed number of key words, is obtained from the SCS by using Bayesian Inference, which combines static and dynamic inference to remove the irrelevant candidate key words. We have applied our approach to video retrieval. The experiments demonstrate that video retrieval using our annotation approach outperforms retrieval using a fixed number of annotation words.

1. Introduction Automatic video annotation can be considered as the classification of videos. Nowadays, one of the main techniques of video classification is a statistical approach, which uses statistical learning to bridge the semantic gaps between the concepts (key words) and the visual features. Its performance depends on the success of the underlying classifiers and the discriminative ability of the multimodal perceptual features. It is important to notice that there exist some semantic networks among the concepts, which means semantic concepts usually do not occur in isolation. For instance, the concept ‘boat’ has strong correlation (co-occurrence) with the concept ‘water’. Knowing this contextual relationship between the concepts can help our automatic video annotation. Some discriminative and graphical models have been successfully applied to learn the concept relationship [1,2,3,4]. Yan et al. [2] presented the performance

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

De Xu School of computer Beijing Jiaotong University

[email protected]

comparison of the graphical models on discovering the concept correlation. Jiang et al. [3] model the interconceptual relationships using Conditional Random Field (CRF). Naphade[4] proposed the MultiNet to represent the high level probabilistic dependencies between the concepts. Both the concepts and structure of the classification framework are either decided by experts or specified by users. Moreover, the structure will become very large with the increase of the number of concepts. MediaNet [5] can automatically select the salient concepts from the training images and discover the relationship between concepts by using the external knowledge resources from WordNet. However, the relationships between concepts in MediaNet are complicated, because it considers both perceptual and semantic relationships. In the application of video annotation, we are concerned more about the cooccurrence relationship between concepts. Compared to the complicated frameworks, there are some simpler methods proposed to automatically annotate videos with multiple concepts [6] and obtain good performance. However, their systems return a fixed number of key words, no matter how rich the video content is. The number of key words is pre-determined by the user, which may affect the annotation performance. For example, the video showed in Fig.1(c) contains two main concepts, ‘waterfall’ and ‘mountain’. If the system fixes the number of annotation words to five, the other three key words might introduce confusion to the ranked video retrieval system. On the other hand, if the video shot contains more than five concepts, we may lose useful information because of the limit number of five. Both cases will affect the performance of the video retrieval system. In this paper, we propose a new method to annotate videos with an adaptive number of keywords. A video shot is represented by the visual features of its key frames. We first learn the relationships between the key words using Bayesian Network (BN). And then we extract a Semantic Candidate Set (SCS) which consists of N concepts from each video shot. This SCS is

further fed into the trained BN. We apply dynamic and static inference to remove the irrelevant keywords. Thus, the final annotation set contains a reasonable number of keywords which can fully describe the content of the video. We applied it to our video retrieval system and obtained promising performance compared to the one using a fixed number of annotation keywords. This paper is organized as follows. Section 2 describes SCS extraction. Section 3 gives the details of obtaining an adaptive number of key words. Sections 4 and 5 show the ranked video retrieval and experiment results.

2. Extracting the Semantic Candidate Set SCS is a set of N candidate concepts, which are learned from the visual features. We assume N is large enough to cover most possible concepts in the video. Then we can further improve the set by removing the irrelevant keywords using Bayesian Inference. In order to represent a video shot, we follow [9] to adaptively extract the key frames (KFs), and compute their visual features. Generally, each video in the training dataset contains one to five concepts. This dataset is used to build the BN and also generate SCS for each test video. Suppose that there are n concepts in the dataset. A straightforward but effective method to generate the SCS is described in the following three steps. (a) Concept Representation. A concept Ci (i=1,…,n) can be represented as a mixture Gaussian distribution of the visual features with mean μi and covariance ∑ i . The visual features used in this paper are

normalized HSV color histogram and edge histogram. The dimension number of μi is 152. (b) Concept Assignment. Given a new video shot x, its distance from concept Ci, denoted as D[i] is, D[i ] =

1 2π

∑i

exp ( −

x − μi 2

∑i

2

)

(1)

The distance D[i] can also be considered as the confidence to assign the concept Ci to shot x. (c) Generating SCS. D is sorted in a descending order. Then we pick up the top N of them as the SCS of the new video, denoted as {C1,…,CN}. C1 is considered as the most confident concept detected in the new video.

3. Inferring the Final Annotation Set With the SCS and the most confident concept C1 at hand, we need to further determine if the rest of the concept in the SCS are also present. Bayesian

inference is used to provide the conditional probabilities of a new concept given the known concepts. Bayesian Network (BN) is adopted to train the Semantic Concept Network which demonstrates the correlation between the semantic concepts. In this network, each node corresponds to a semantic concept and the edge between two nodes represents the dependency relationship between them. We use the three-phase dependency analysis algorithm proposed by Cheng [7] to train the semantic concept network. The directed graph of BN is transformed to a join tree, denoted as JT, which is a practical data structure for the inference [8]. There are two major steps in our inference procedure: dynamic and static inference. The detailed procedure is as follows. PROCEDURE Get_FAS ( JT ) 1. NE ={C1}; CE= ∅; SCS ={C1,…,CN} 2. WHILE (NE is not empty) DO //dynamic inference 3. Input NE into JT and change the potentials in the JT; 4. Perform global propagation to make the potentials of JT locally consistent; 5. CE =CE ∪ NE ; NE =∅; 6. FOR each concept Ci in SCS DO //static inference 7. IF Ci not in CE and P(Ci=1|CE)> σ THEN 8.

NE = NE ∪ {Ci }; SP[f][Ci]= P(Ci=1|CE);

9. END IF 10. END FOR 11. END WHILE

CE (current evidence set) is a keyword set which contains the already annotated keywords in the video shot, and NE is the new evidence set which is going to be added into CE. The inference starts by adding the highest confidence concept C1 of SCS into the NE. Because C1 is to be added into CE, we can set P(C1)=1 in JT, and update its potentials. Then this local potential is propagated to the entire JT such that all the potentials are made to be locally consistent. We repeat this procedure for all the concepts in NE. After it, we can update the CE by adding NE into it. Now, we come to the static inference step. It checks the probability conditioned on CE of each keyword in SCS but not in CE. If the conditional probability of Ci is large than the threshold σ , it becomes a new evidence in NE. Simultaneously, CE is updated by adding Ci into it, and keyword Ci is assigned to the shot. SP[f][Ci] is the probability of labeling the video shot f with concept Ci. Unlike [8] where the inference is stopped by the pre-defined number of iterations, our inference procedure automatically ends when there is no more new evidence coming up. Following the static inference, the new evidence set NE is updated with the concepts which have larger conditional probability than σ . The new evidence set further brings about dynamic infer-

ence in the next iteration. This procedure is ended when there is no more new evidence in NE. When the procedure ends, the CE is considered as the FAS of the video shot. Furthermore, σ is determined as follows.

σ=

1



m1

m1 k =1



m2 i=2

P ( Sik | S1k ,..., Sik−1 )

(2)

where m1 is the number of samples in the training set, m2 is the number of actual concepts in the k-th sample, k

Si is the i-th actual concept in the k-th sample, k

k

k

and P ( S i | S1 , ..., S i −1 ) is the conditional probability of Si given S1 , ..., S i −1 . This mean value σ is the average k

k

k

conditional probability of all concepts over the samples in the training set. This threshold can be calculated automatically and adaptively for different data sets, which avoids setting the threshold manually.

4. Ranked Video Retrieval Ranked video retrieval using video annotation results is similar to the text retrieval problem. Given a text query Q = {w1, w2, …, wk} and a set of video shots V, the goal of the video retrieval is to return the relevant video shots which contain the concept set Q from V. We can first annotate all the video shots using our proposed automatic video annotation approach, and then index the video shots using the associated keywords. Hence, the video retrieval can be conducted as text retrieval on the keywords. This approach is very straightforward. However, it has a disadvantage that the returned video shots are not ranked based on their matching confidence to the query. However, the annotation probability can be used to rank the returned relevant video shots. In section 3 we can obtain the conditional probability P(C=1|CE) of a concept C given a group of concepts CE. We score the video shots via the probability given the observation of query set Q. This probability is calculated as the following formula. Mj

w(Q | f j )=M j *



Nj

w ( ci | f j )

i =1

∑ w( c

i

| fj )

(3)

i =1

where Mj is the number of the matched keywords of video shot f j and query Q, Nj is the total number of annotated keywords of the video shot f j, and w(cj |f j) is the probability stored in SP[fj][cj] when doing the annotation inference. All the retrieved video shots are ranked according to w(Q | f j).

5. Experiment Result We have applied our approach to a video dataset with 14 semantic concepts (i.e. mountain, sky, cloud,

water, waterfall, boat, road, car, building, greenery, animal, bridge, snow and land). The videos are collected from the website www.open-video.org. We used 1823 video shots for training and 986 video shots for testing. The videos, in total, last around 7.8 hours. We compute HSV color histogram and edge histogram of KFs, and also normalize them.

5.1. Evaluation Metric of Annotation The evaluation metrics used in most related works are average precision (AP) and average recall (AR). However, they might not able to evaluate the performance very well when some concepts occur in the training dataset but are not detected in any testing video. For example, suppose there are four concepts: c1 ~c4 and the testing set is {f1, f2, f3}. The GT is f1{c1,c2}, f2{c2,c3}, f3{c4}. If System A automatically annotates the three video shots as f1{c1,c2}, f2{c1,c2}, f3{c3}, then AP=(0.5+1+0+1 1 )/4=0.625 and AR=(1+1+0+0)/4 =0.50. System B annotates the three video shots as f1{c1,c2,c3}, f2{c1,c2}, f3{c3}, then AP=0.625 and AR=0.5. Note that concept c4 is not assigned to any video shot. If the two annotation systems are evaluated using AP and AR, they perform equally. However, it is obvious that System A has better performance than System B. This is due to the fact that length of semantic annotation is not considered in the evaluation. We proposed a new evaluation metric as follows. MT = w1 *sum +

w2 *(1-abs ( AL -L ) /( N -L ))

(4)

where sum=AP+AR. AL is the average length of annotation given for all the testing shots by the system. L is the average length in GT label of all the testing shots. N is the number of concepts in SCS, and N=5 achieves better performance in our experiment. Weights w1 and w2 are experientially set to 0.9 and 0.1 respectively. The proportion of annotation length in MT is small. AP and AR are still the main evaluation factors.

5.2. Experimental Results of Annotation In order to compare with other methods, we implemented the methods based on Gauss kernel and EMD kernel in [6]. Table 1 shows the average results of annotation obtained by these three methods. It indicates that EMD kernel-based algorithm outperforms Gauss kernel, which coincides with the conclusion of [6]. Our result outperforms the other two. It also indicates that the annotation performance of FAS with Bayesian inference is much better than that of SCS. 1 When the denominator is zero, the quotient is set to 1. In fact, it is all right if only it is destined as the same value under such case.

Fig. 1 shows three testing examples of the automatic annotation using three methods respectively. We can see that there is a main concept in each shot. For example, the main concepts in Fig. 1(a), (b) and (c) are ‘water’, ‘boat’ and ‘waterfall’ respectively. Our method can correctly annotate the main concepts in most cases, even if it annotates incorrectly with some concepts, such as ‘animal’ in Fig. 1(c). Table 1. Annotation performance of three methods method

AR

sum

0.214

AP

0.387

0.601

EMD Kernel

0.317

0.469

0.786

0.707

SCS FAS

0.344

0.789

1.133

1.020

0.541

0.645

1.186

1.067

Gauss kernel

Ours

MT 0.541

non-fixed length of annotation from SCS by Bayesian Inference combining the static and dynamic inference. Experiments show that our method is efficient in automatically annotating video shots and retrieving videos by key words. Table 2. The performance on the retrieval task of 3 methods #words #Queries MAP Gauss MAR kernel F MAP EMD MAR kernel F MAP Our method MAR F

1 14 0.223 0.376 0.280 0.314 0.456 0.372 0.533 0.649 0.585

2 57 0.115 0.199 0.146 0.208 0.365 0.265 0.246 0.504 0.331

3 42 0.087 0.054 0.067 0.136 0.404 0.204 0.239 0.427 0.307

4 23 0 0 0 0.169 0.357 0.229 0.190 0.374 0.252

Figure 1. Some examples showing annotation results of Gauss Kernel, EMD Kernel and our approach.

5.3. Experiments on Ranked Video Retrieval We use four sets of queries which are constructed from all the n-word (n=1,2,3,4) combinations of all keywords (e.g. 2-word combination have the pairwise of the keywords ). All the combinations occur at least twice in the testing set. Table 2 shows the retrieval performance on the annotated videos using our annotation approach, compared with that of the retrieval on the annotated videos using Gauss kernel and EMD kernel based annotation. We use the traditional metric mean average precision (MAP), mean average recall (MAR) and F score over the entire query set as the evaluation metrics. Our returned videos are ranked according to the probability of (3) and theirs are ranked as follows. P (Q| f )=

∏ kj =1 P (w |f ) j

(5)

We observe that the EMD kernel method outperforms the Gauss kernel method, which coincides with the conclusion of [6], and our method outperforms both of them. Fig. 2 shows the top 5 video shots retrieved using query ‘mountain’.

6. Conclusion In this paper, we proposed a new method that can annotate a testing shot adaptively and automatically. The main contribution of this work is to obtain the

Figure 2. Top five video shots retrieved using ‘mountain’

7. References [1] Y. Aytar, O. B. Orhan and M. Shah. Improving Semantic Concept Detection and Retrieval Using Contextual Estimates, ICME 2006. [2] R.Yan et al.Mining Relationship between Video Concepts using Probabilistic Graphical Model, ICME 2006. [3] W. Jiang, S. Chang et al. Context-based concept fusion with boosted conditional fields. IEEE ICASSP 2007 [4] M. Naphade. A Probabilistic Framework For Mapping Audio-visual Features to High-Level Semantics in Terms of Concepts and Context. Dissertation of UIUC, 2001 [5] A.B.Benítez. Multimedia Knowledge: Discovery, Classification, Browsing, and Retrieval. Dissertation of Columbia University, 2005. [6] A.Yavlinsky, et al. Automated Image Annotation Using Global Features and Robust Non-parametric Density. CIVR 2005. [7] J.Cheng, R.Freiner, et al. Learning Belief Networks from Data: An Information Theory Based Approach. Artificial Intelligence, 2002,137(1-2):43-90. [8] C. Huang. Inference in Belief Networks: A Procedural Guide. International Journal of Approximate Reasoning. 1994(11):1-158 [9] Z. Rasheed and M. Shah, Scene Detection In Holly-wood Movies and TV Shows, CVPR 2003.

Suggest Documents