Semantically Coherent Image Annotation with a Learning-based ...

3 downloads 523 Views 333KB Size Report
this paper, we propose a novel learning-based keyword prop- agation strategy and ... enforced as part of optimization objective in our approach. Therefore, the ...
Semantically Coherent Image Annotation with a Learning-based Keyword Propagation Strategy Chaoran Cui1,2 [email protected]

Jun Ma1,2 [email protected]

Shuai Gao1,2 [email protected]

Shuaiqiang Wang3 [email protected]

Tao Lian1,2 [email protected]

1

3

School of Computer Science and Technology, Shandong University, Jinan, China 2 Shandong Provincial Key Laboratory of Software Engineering, Jinan, China School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China

ABSTRACT

1.

Automatic image annotation plays an important role in modern keyword-based image retrieval systems. Recently, many neighbor-based methods have been proposed and achieved good performance for image annotation. However, existing work mainly focused on exploring a distance metric learning algorithm to determine the neighbors of an image, and neglected the subsequent keyword propagation process. They usually used some simple heuristic propagation rules, and propagated each keyword independently without considering the inherent semantic coherence among keywords. In this paper, we propose a novel learning-based keyword propagation strategy and incorporate it into the neighbor-based method framework. In particular, we employ the structural SVM to learn a scoring function which can evaluate different candidate keyword sets for a test image. Moreover, we explicitly enforce the semantic coherence constraint for the propagated keywords in our approach. The annotation of the test image is propagated as a whole rather than separate keywords. Experiments on two benchmark data sets demonstrate the effectiveness of our approach for image annotation and ranked retrieval.

In recent decades, the number of digital images has been growing rapidly and there is an increasingly urgent demand for indexing and retrieving these images effectively. Users often prefer searching images with a textual query, which can be achieved by first annotating images manually, and then searching over the annotations using the query. However, manual image annotation is a laborious and time-consuming process. Therefore, researchers have attempted to develop automatic image annotation techniques for image retrieval. Automatic image annotation aims to assign relevant keywords to a new image from an annotation vocabulary. Recently, the neighbor-based methods [4, 6, 7] have become more attractive because of their superior performance and straightforward implementation scheme. Given an unlabeled image, the neighbor-based methods first find its visually similar neighbors, and then propagate the keywords associated with these neighbors to it. Existing work mainly focused on the first step, where they tried to explore a distance metric learning algorithm to determine the neighbors of an image. However, the subsequent keyword propagation process was not well investigated. In most cases, they used some heuristic propagation rules, e.g., simply transferred the most frequent keywords in the nearest neighbors to the given image. But there is no guarantee that the keywords selected by these simple heuristic rules are always the most suitable candidate annotations. Besides, most of the neighbor-based methods propagate each keyword independently without considering the correlations among keywords. In fact, the keywords associated with an image do not appear in isolation, instead they appear correlatively and interact coherently with each other at the semantic level. Therefore, it is difficult for these methods to propagate semantically coherent keywords to the given image. In light of the above problems, we propose a novel learningbased keyword propagation strategy and incorporate it into the neighbor-based method framework. The image annotation task is formulated as a structured prediction problem, where the goal is to learn a mapping from an image to an associated subset of the keywords. To this end, we utilize the structural SVM [5] as the backbone of our learning approach. Specifically, based on the annotated training examples, the structural SVM seeks to learn a scoring function which can

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing

General Terms Algorithms, Performance, Experimentation

Keywords image annotation, semantic coherence, structural learning

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’12, October 29–November 2, 2012, Maui, HI, USA. Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00.

2423

INTRODUCTION

3.

evaluate different candidate keyword sets for a test image. The quality of a keyword set is assessed based on its relevance to the neighbors of the test image in different aspects. Finally the keyword set maximizing the scoring function is propagated to the test image. In addition, the semantic coherence constraint for the propagated keywords is explicitly enforced as part of optimization objective in our approach. Therefore, the annotation of the test image is predicted as a semantically coherent whole instead of separate keywords.

3.1

Let X = {x1 , x2 , . . . , xN } denote an image collection, and all unique keywords appearing in this collection are W = {w1 , w2 , . . . , wM }. The goal of the annotation task is to learn a hypothesis h : X → Y, where Y denotes the space of all possible keyword subsets. Given an image x ∈ X , we use h to predict an associated keyword subset y ⊂ W for x. In the supervised learning scenario, we are given a set of annotated training images, S = {(x(i) , y (i) ) ∈ X × Y : i = 1, . . . , T }, where y (i) is the ground-truth annotation of the image x(i) . We hope the learned hypothesis h can minimize the empirical risk, T 1 ∑ ∆(y (i) , h(x(i) )) . T i=1

where N N1 , . . . , N NK are the K nearest neighbors of x, and SN N1 , . . . , SN NK denote their similarities to x. ϕ (w, N Ni ) is a feature vector encoding the relation between the keyword w and the ith neighbor N Ni . From the above definitions, we can see that Ψ(x, y) actually represents the vector composition of K relation components. In this paper, we simply determine the visual neighbors by the average of several distances computed from different features as [4] did. The impact degree of N Ni on x is assumed to be positively correlated with their similarity SN Ni , and we define SN Ni as follows: [ ] 1 SN Ni = exp . (6) 1 + d(x, N Ni )

(1)

Here, ∆(y (i) , h(x(i) )) quantifies the loss of the predicted annotation h(x(i) ) compared to the ground-truth annotation y (i) . For the annotation task, a loss function similar to F1 measure is defined as follows: ∆(y, y ′ ) = 1 −

2pr p+r

p=

|y ∩ y ′ | |y ′ |

r=

|y ∩ y ′ | , |y|

(2)

where y and y ′ are two annotations, |y| denotes the number of keywords in y, and |y ∩ y ′ | is the number of common keywords they share. In this paper, we adopt the structural SVM [5] as the backbone of our learning strategy to tackle the above problem. The idea behind the structural SVM is to discriminatively learn a scoring function F (x, y) : X × Y → R, which measures how well the candidate annotation y fits for a given image x. We represent the image/annotation pair (x, y) by a feature vector Ψ(x, y). In analogy to the linear SVM, the scoring function F (x, y) is assumed to be linear in terms of Ψ(x, y): F (x, y) = wT Ψ(x, y) ,

Here, d(x, N Ni ) is the normalized visual distance between x and N Ni . In equation (5), the feature vector ϕ (w, N Ni ) encodes the relation between the keyword w and the ith neighbor N Ni . It is formulated from the aspects of frequency, co-occurrence and semantic similarity of w given N Ni . According to the frequency of w, we can estimate the probability of annotating N Ni with w using a multiple Bernoulli model: µδw,N Ni + Tw P (w|N Ni ) = . (7) µ+T

(3)

Here, µ is a smoothing parameter estimated using cross validation. δw,N Ni = 1 if w occurs in the annotation of N Ni and zero otherwise. Tw denotes the number of training images that contain w in their annotations, and T is the total number of training images. To further explore the relation between the keyword w and the neighbor image N Ni , we consider two other kinds of keyword correlations, i.e. co-occurrence and WordNet semantic similarity. The co-occurrence Sco between two keywords is defined as: tf (w1 , w2 ) , (8) Sco (w1 , w2 ) = tf (w2 )

where w denotes the weight vector. Intuitively, the feature representation Ψ must be able to provide significant discriminative power between high quality and low quality candidate annotations. We will discuss the exact form of Ψ in the next section. Once the scoring function F (x, y) is learned, the hypothesis h can predict the annotation y ∗ for an image x by maximizing F (x, y) over all possible y ∈ Y: y ∗ = h(x) = arg max F (x, y) .

Feature Representation

In this section, we discuss the feature representation Ψ in equation (3). For training examples, Ψ should represent a set of discriminating features that can differentiate the ground-truth annotation of an image from other alternative annotations. Intuitively, visually similar images often reflect similar themes and thus are typically annotated with similar keywords. Starting from this intuition, given an image/annotation pair (x, y), we first find the visual neighbors of x, and then formulate Ψ(x, y) based on the relations between y and these neighbors. The form of Ψ(x, y) is given as  SN N1 ∑  w∈y ϕ (w, N N1 ) |y|   ..  , (5) Ψ(x, y) =  .   ∑ SN N K w∈y ϕ (w, N NK ) |y|

2. PROBLEM FORMULATION

∆ RS (h) =

STRUCTURAL SVM FOR IMAGE ANNOTATION

(4)

y∈Y

Following previous work [4, 6, 7], we assign L (L = 5) keywords to each test image, i.e. |y ∗ | = L.

where w1 and w2 are two keywords, tf (w2 ) denotes the total frequency of w2 in training examples, and tf (w1 , w2 ) is the number of images containing both w1 and w2 . Moreover, we employ Lin’s similarity measure [3] to estimate the WordNet

2424

Algorithm 1 Cutting plane algorithm

semantic similarity Swn between two keywords. According to Sco and Swn , the co-occurrence and WordNet semantic similarity between w and the annotation of N Ni are subsequently computed respectively by looking for the ‘closest’ keyword in N Ni with respect to w: Rco (w, N Ni ) = max Sco (w, t) ,

(9)

Rwn (w, N Ni ) = max Swn (w, t) .

(10)

t∈N Ni

t∈N Ni

Input: (x(1) , y (1) ), . . . , (x(T ) , y (T ) ), C, ε Output: w 1: Initialize Wi ← ∅ for all i = 1, . . . , T 2: repeat 3: for i = 1, . . . , T do 4: H(y; w) ≡ ∆(y (i) , y) + wT Ψ(x(i) , y) 5: compute yˆ = arg maxy∈Y H(y; w) 6: compute ξi = max{0, maxy∈Wi H(y; w)} 7: if H(ˆ y ; w) > ξi + ε then 8: Wi ← Wi ∪ {ˆ y} ∪ 9: optimize (14) over W = i Wi 10: end if 11: end for 12: until no Wi has changed during iteration. 13: return w

Following equation (7)(9)(10), the exact form of ϕ (w, N Ni ) is expressed as a three-dimensional vector:   P (w|N Ni )   ϕ(w, N Ni ) =  Rco (w, N Ni )  . (11) Rwn (w, N Ni ) Therefore, the total dimension of the feature representation Ψ(x, y) is 3K when we consider K nearest neighbors of x.

Algorithm 2 Greedy keyword subset selection

3.2 Semantic Coherence of Annotation

Input: (x(i) , y (i) ), w, L Output: yˆ 1: Initialize yˆ ← ∅ 2: V (x, y, y ′ ) ≡ ∆(y, y ′ ) + wT Ψ(x, y ′ ) + Θ(y ′ ) 3: for k = 1, . . . , L do (i) (i) 4: tˆ ← arg maxt∈ˆ , yˆ ∪ {t}) / y V (x , y 5: yˆ ← yˆ ∪ {t} 6: end for 7: return yˆ

In above study, we individually utilize the relevance of each keyword in a candidate annotation without considering the correlations between them. However, as we discussed in Section 1, the keywords associated with an image are dependent on each other at the semantic level, and they together constitute the annotation as a whole for that image. To guarantee the semantic coherence of the annotated keywords for an image, we append a constraint term to the scoring function in equation (3). As a result, the objective hypothesis becomes: y ∗ = h(x) = arg max F (x, y) + Θ(y) y∈Y

= arg max wT Ψ(x, y) + Θ(y)

,

annotation is associated with a constraint. Therefore, there are an exponential number of constraints with respect to the total number of unique keywords to be considered. In our study, we employ the cutting plane algorithm [2] (Algorithm 1) to solve the optimization problem. The algorithm aims at finding a subset of constraints so that the solution for this subset can also fulfill all constraints at a precision of ε. It iteratively finds yˆ which generates the most violated constraint for each example (x(i) , y (i) ) (line 5). If the corresponding constraint is violated by more than ε, the algorithm adds yˆ into the working set Wi , and then re-solves (14) using the constraints in the updated working sets (line 7-9). In Algorithm 1, we need to find the most violated constraint for each iteration via solving the maximization problem in line 5. However, since we add the semantic coherence constraint in the objective hypothesis (see equation (12)), the maximization problem becomes:

(12)

y∈Y

where Θ(y) is defined to be the average semantic similarity between each pair of keywords in y: ∑ αSco (p, q) + (1 − α)Swn (p, q) Θ(y) =

p,q∈y,p̸=q

CL2

.

(13)

Here, L is the size of y. The weighting parameter α tunes the relative importance between co-occurrence and WordNet semantic similarity, and is determined through cross validation.

4. LEARNING WITH STRUCTURAL SVM In this section, we employ the structural SVM to train a robust model for image annotation. Given a set of training examples S = {(x(i) , y (i) ) : i = 1, . . . , T }, the structural SVM learn the weight vector w in equation (12) through the following quadratic programming problem [5]: Optimization Problem 1. (Structural SVM) min

w,ξ≥0

T 1 C∑ ||w||2 + ξi , 2 n i=1

arg max ∆(y (i) , y) + wT Ψ(x(i) , y) + Θ(y) .

In this paper, we propose a greedy strategy (Algorithm 2) which is simple but effective in solving the problem. The algorithm repeatedly selects the keyword tˆ which can bring the highest gain for the current keyword set yˆ and stops when the size of yˆ reaches L. Although here we use an approximate constraint generation algorithm, the learned model still achieves good performance in our experiments. Given the learned weight vector w, we predict the annotation for a new image x by solving equation (12), and the greedy strategy in Algorithm 2 can also be applied for this purpose.

(14)

subjected to: ∀i, ∀y ∈ Y \y (i) : wT ψ(x(i) , y (i) ) ≥ wT ψ(x(i) , y) + ∆(y (i) , y) − ξi .

(16)

y∈Y

(15)

In the optimization formulation, the constraint condition (15) requires that for each training image, each incorrect

2425

directly utilize the scoring function (3) to evaluate the compatibility between an image and a query. Given a textual query q, the confidence of an image I relevant to q can be estimated as:

Table 1: Performance comparison in terms of P %, R% and N + between our method and previous published work Corel 5K P% R% N+ MSC [6] JEC [4] LASSO [4] GS [7] Str-SVM Our Method

25 27 24 30 28 31

32 32 29 33 36 36

136 139 127 146 153 151

R(I, q) = F (I, q) = wT Ψ(I, q) .

IAPR TC12 P% R% N+ — 28 28 32 28 33

— 29 29 29 32 33

We evaluate the performance in terms of mean average precision (MAP) on Corel 5K. In order to facilitate direct comparison, we adopt the same experimental setting as that in [1]. There are four types of queries, including single-word, multiple-words, ‘easy’ and ‘difficult’. Table 2 shows that our results improve those of PAMIR in all types of queries, which was found outperforming a number of alternative approaches in [1]. This demonstrates the retrieval ranking provided by our method is preferable.

— 250 246 252 263 258

Table 2: Ranked retrieval performance in terms of MAP% with different types of queries

PAMIR [1] Our Method

Single

Multi

Easy

Difficult

34 40

26 28

43 47

22 25

(17)

6.

CONCLUSIONS

In this paper, we have introduced a novel image annotation approach, which adapts the neighbor-based methods with a learning-based keyword propagation strategy. We utilize the structural SVM to learn a scoring function for evaluating the candidate keyword sets, and the annotation of a test image is predicted as a semantically coherent whole. Experiments demonstrate the effectiveness of our approach for image annotation and ranked retrieval. For future study, we plan to investigate the scalability of our approach and experiment on realistic large-scale web image data sets.

5. EXPERIMENTS 5.1 Experiment Settings We evaluate our method on two publicly available data sets: Corel 5K and IAPR TC12. The two data sets have been widely used in previous work so we can directly compare the experiment results. Each image is represented with the same features described in [4]. The quality of predicted annotations is assessed by retrieving test images using the keywords in annotation vocabulary. We use the average precision (P ) and recall (R) over all keywords as two evaluation measures. In addition, the number of keywords with non-zero recall (N +) is also considered. The number of neighbors K in equation (5), is a parameter to be determined. We experiment with several values for K and find that the best performance is achieved with K = 100 for Corel 5K and K = 400 for IAPR TC12.

7.

ACKNOWLEDGMENTS

This work is supported by the Natural Science Foundation of China (60970047,61103151,61173068,61272240), the Humanity and Social Science Foundation of Ministry of Education of China (12YJC630211), the Doctoral Fund of Ministry of Education of China (20110131110028), the Natural Science Foundation of Shandong Province (ZR2012FM037, BS2012DX012) and the Graduate Independent Innovation Foundation of Shandong University (YZC12084).

8.

REFERENCES

[1] D. Grangier and S. Bengio. A discriminative kernel-based approach to rank images from text queries. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30:1371 –1384, 2008. [2] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural svms. Machine Learning, 77:27–59, 2009. [3] D. Lin. An information-theoretic definition of similarity. In ICML, pages 296–304, 1998. [4] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. In ECCV, pages 316–329, 2008. [5] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453, 2006. [6] C. Wang, S. Yan, L. Zhang, and H.-J. Zhang. Multi-label sparse coding for automatic image annotation. In CVPR, pages 1643 –1650, 2009. [7] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and D. Metaxas. Automatic image annotation using group sparsity. In CVPR, pages 3312 –3319, 2010.

5.2 Experiment Results In order to evaluate our method, we compare it with some previous neighbor-based approaches. Besides, we design a simplified version of our method, Str-SVM, which does not enforce the semantic coherence constraint in the objective hypothesis (12). Table 1 shows the annotation results of different algorithms. On Corel 5K, our method outperforms Str-SVM by 3% in P with almost no loss in R and N +. This emphasizes the importance of the requirement for semantic coherence. Compared with JEC method which adopts the same features and visual distance measurement, our method achieves an improvement of 4%, 4% and 12 in terms of P , R and N + respectively. This shows the benefits brought by our proposed learning-based keyword propagation strategy. In addition, although we use a simple technique to find visual neighbors, the performance of our method is still superior to that of the methods involving complicated distance metric learning, such as MSC [6], Lasso [4], and GS [7]. On IAPR TC12, our method outperforms other algorithms as well. We also examine the performance of our method for the problem of ranked retrieval of images. In our model, we can

2426