Contextual Kernel and Spectral Methods for Learning

0 downloads 0 Views 467KB Size Report
image datasets: University of Washington (UW), Corel [20], and IAPR [14]. ...... F1. SSK+CKP+CSE. 145. 0.236. 0.258. 0.247. SSK+CKP. 143. 0.233. 0.243. 0.238.
1

Contextual Kernel and Spectral Methods for Learning the Semantics of Images Zhiwu Lu, Horace H.S. Ip, and Yuxin Peng

Abstract—This paper presents contextual kernel and spectral methods for learning the semantics of images which allow us to automatically annotate an image with keywords. First, to exploit the context of visual words within images for automatic image annotation, we define a novel spatial string kernel to quantify the similarity between images. Specifically, we represent each image as a 2D sequence of visual words and measure the similarity between two 2D sequences using the shared occurrences of slength 1D subsequences by decomposing each 2D sequence into two orthogonal 1D sequences. Based on our proposed spatial string kernel, we further formulate automatic image annotation as a contextual keyword propagation problem, which can be solved very efficiently by linear programming. Unlike the traditional relevance models that treat each keyword independently, the proposed contextual kernel method for keyword propagation takes into account the semantic context of annotation keywords and propagates multiple keywords simultaneously. Significantly, this type of semantic context can also be incorporated into spectral embedding for refining the annotations of images predicted by keyword propagation. Experiments on three standard image datasets demonstrate that our contextual kernel and spectral methods can achieve significantly better results than the state of the art. Index Terms—Kernel methods, spectral embedding, visual words, string kernel, keyword propagation, linear programming, annotation refinement

I. I NTRODUCTION With the rapid growth of image archives, there is an increasing need for effectively indexing and searching these images. Although many content-based image retrieval systems [1], [2] have been proposed, it is rather difficult for users to represent their queries using the visual image features such as color and texture. Instead, most users prefer image search by textual queries, which is typically achieved by manually providing image annotations and then searching over these annotations using a textual query. However, manual annotation is an expensive and tedious task. Hence, automatic image annotation plays an important role in efficient image retrieval. Recently, many methods for learning the semantics of images based on machine learning techniques have emerged to pave the way for automatic annotation of an image with keywords, and we can roughly classify them into two categories. Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. Z. Lu and H. Ip are with the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong (e-mail: [email protected], [email protected]). Y. Peng is with the Institute of Computer Science and Technology, Peking University, Beijing 100871, China (e-mail: [email protected]).

The traditional methods for automatic image annotation treat each annotation keyword or concept as an independent class and train a corresponding classifier to identify images belonging to this class. This strategy has been adopted by methods such as linguistic indexing of pictures [2] and image annotation using support vector machine [3] or Bayes point machine [4]. The problem with these classificationbased methods is that they are not particularly scalable to a large scale concept space. In the context of image annotation and retrieval, the concept space grows significantly large due to the large number (i.e. hundreds or even thousands) of keywords involved in the annotation of images. Therefore, the problems of semantic overlap and data imbalance among different semantic classes become very serious, which lead to a significantly degraded classification performance. Another category of automatic image annotation methods take a different viewpoint and learn the correlation between images and annotation keywords by means of keyword propagation. Many of such methods are based on the probabilistic generative models, among which an influential work is the cross-media relevance model [5] that estimates the joint probability of image regions and annotation keywords on the training image set. The relevance model for learning the semantics of images has subsequently been improved through the development of continuous-space relevance model [6], multiple Bernoulli relevance model [7], dual cross-media relevance model [8], and, more recently, our generalized relevance model [9]. Moreover, graph-based semi-supervised learning [10] has also been applied to keyword propagation for automatic image annotation in [11]. However, these keyword propagation methods ignore either the context of image regions or the correlation information of annotation keywords. This paper focuses on keyword propagation for learning the semantics of images. To overcome the problems with the above keyword propagation methods, we propose a 2D string kernel, called spatial spectrum kernel (SSK) [12], which quantifies the similarity between images and enables us to exploit the context of visual words within images for keyword propagation. To compute the proposed contextual kernel, we represent each image as a 2D sequence of visual words and measure the similarity between two 2D sequences using the shared occurrences of s-length 1D subsequences by decomposing each 2D sequence into two orthogonal 1D sequences (i.e. the row-wise and column-wise ones). To the best of our knowledge, this is the first application of string kernel for matching 2D sequences of visual words. Here, It should be noted that string kernels were originally proposed for protein classification [13], and the number of amino acids

2

Image dataset

Fig. 1.

Visual words

Visual context

Annotation keywords

Semantic context

Contextual keyword propagation

Annotation results

Annotation refinement

Contextual spectral embedding

Illustration of the proposed framework for learning the semantics of images using visual and semantic context.

(similar to the visual words used here) typically involved in the kernel definition was very small. In contrast, in the present work, string kernels are used to capture and compare the context of a large number of visual words within an image, and the associated problem of sequence matching becomes significantly more challenging. As compared with our previous work [12], this paper presents significantly more extensive and convincing results, in particular, on the large IAPR TC-12 image dataset [14]. More importantly, we further present a new and significant technical development that addresses the issue of annotation refinement (see Section V). The novelty of the proposed refinement method is that it directly considers the manifold structure of annotation keywords, which gives rise to additional new and significant contributions upon our previous work. Moreover, to exploit the semantic context of annotation keywords for automatically learning the semantics of images, a contextual kernel method is proposed based on our spatial spectrum kernel. We first formulate automatic image annotation as a contextual keyword propagation problem where multiple keywords can be propagated simultaneously from the training images to the test images. Meanwhile, not to be confused by the training images that are far away (i.e. not in the same manifold), each test image is limited to absorbing the keyword information (e.g. confidence scores) only from its k-nearest neighbors. Since this contextual keyword propagation problem is further solved very efficiently by linear programming [15], [16], our contextual kernel method is highly scalable and can be applied to large image datasets. It should be noted that our contextual keyword propagation distinguishes itself from previous work in that multiple keywords can be propagated simultaneously, which means that the semantic context of annotation keywords can be exploited for learning the semantics of images. More importantly, this type of semantic context can be further used for refining the annotations predicted by keyword propagation. Here, we first obtain spectral embedding [17]–[19] by exploiting the semantic context of annotation keywords, and then perform annotation refinement in the resulting embedding space. Finally, the above contextual kernel and spectral methods for learning the semantics of images can be integrated in a unified framework as shown in Fig. 1, which contains three main components: visual context analysis with spatial spectrum kernel, learning with visual and semantic context, and annotation refinement by contextual spectral embedding. In this paper, the proposed framework is tested on three standard image datasets: University of Washington (UW), Corel [20], and IAPR [14]. Particularly, the Corel image dataset has been widely used for the evaluation of image annotation [7],

[21]. Experimental results on these image datasets demonstrate that the proposed framework outperforms the state-of-theart methods. In summary, the proposed framework has the following advantages: (1) Our spatial string kernel defined as the similarity between images can capture the context of visual words within images. (2) Our contextual spectral embedding method directly considers the manifold structure of annotation keywords for annotation refinement, and more importantly, the semantic context of annotation keywords can be incorporated into manifold learning. (3) Our kernel and spectral methods can achieve promising results by exploiting both visual and semantic context for learning the semantics of images. (4) Our contextual kernel and spectral methods are very scalable with respect to the data size and can be used for large-scale image applications. (5) Our contextual kernel and spectral methods are very general techniques and have the potential to improve the performance of other machine learning methods that are widely used in computer vision and image processing. The remainder of this paper is organized as follows. Section II gives a brief review of previous work. In Section III, we present our spatial spectrum kernel to capture the context of visual words which can be further used for keyword propagation. In Section IV, we propose our contextual kernel method for keyword propagation based on the proposed spatial spectrum kernel. In Section V, the annotations predicted by our contextual keyword propagation are further refined by novel contextual spectral embedding. Section VI presents the evaluation of the proposed framework on three standard image datasets. Finally, Section VII gives conclusions drawn from our experimental results. II. R ELATED W ORK Our keyword propagation method differs from the traditional approaches that are based on relevance model [5]– [7] and graph-based semi-supervised learning [11] in that the keyword correlation information has been exploited for image annotation. Although much effort had also been made in [22] to exploit the keyword correlation information, it was limited to pairwise correlation of annotation keywords. In contrast, our method can simultaneously propagate multiple keywords from the training images to the test images. In [23], a particular structure of the annotation keywords was assumed in order to exploit the keyword correlation information. We argue that such an assumption could be violated in practice because

3

the relationships between annotation keywords may become too complicated. On the contrary, our method can exploit the semantic context of annotation keywords of any order. This semantic context is further exploited for refining the annotations of images predicted by our contextual keyword propagation. Specifically, we first obtain contextual spectral embedding by incorporating the semantic context of annotation keywords into graph construction, and then perform annotation refinement in the obtained more descriptive embedding space. This differs from previous methods, e.g. [21], [24], which directly exploited the semantic context of keywords for annotation refinement, without considering the manifold structure hidden among them. More importantly, another type of context has also been incorporated into our contextual keyword propagation. This can be achieved by first representing each image as a 2D sequence of visual words and then defining a spatial string kernel to capture the context of visual words. This contextual kernel can be used as a similarity measure between images for keyword propagation. In fact, both local and global context can be captured in our work. The spatial dependency between visual words learnt with our spatial spectrum kernel can be regarded as the local context, while the spatial layout of visual words obtained with multi-scale kernel combination (see Section III-C) provides the global context. In the literature, most previous methods only considered either local or global context of visual words. For example, the collapsed graph [25] and Markov stationary analysis [26] only learnt the local context, while the constellation model [27] and spatial pyramid matching [28] only captured the global context. To reduce the semantic gap between visual features and semantic annotations, we make use of an intermediate representation with a learnt vocabulary of visual words, which is similar to the bag-of-words methods such as probabilistic latent semantic analysis (PLSA) [29] and latent Dirichlet allocation (LDA) [30]. However, these methods typically ignore the spatial structure of images because the regions within images are assumed to be independently drawn from a mixture of latent topics. In contrast, our present work captures the spatial context of regions based on the proposed spatial spectrum kernel. It is shown in our experiments that this type of visual context is effective for keyword propagation in the challenging application of automatic image annotation. III. V ISUAL C ONTEXT A NALYSIS WITH S PATIAL S PECTRUM K ERNEL To capture the context of visual words within images, we propose a spatial spectrum kernel (SSK) which can be used as a similarity measure between images for keyword propagation. We further present an efficient kernel computation method based on a tree data structure. Finally, we propose multi-scale kernel combination to capture the global layout of visual words within images. Hence, both local and global context within images can be captured and exploited in our present work. A. Kernel Definition Similar to the bag-of-words methods, we first divide images into equivalent blocks on a regular grid, and then extract some

representative properties from each block by incorporating the color and texture features. Through performing k-means clustering on the extracted feature vectors, we generate a vocabulary of visual words V = {vi }M i=1 which describes the content similarities among the image blocks. Based on this universal vocabulary V , each block is annotated automatically with a visual word and an image is subsequently represented by a 2D sequence Q of visual words. The basic idea of defining a spatial spectrum kernel is to map the 2D sequence Q into a high-dimensional feature space: Q 7→ Φ(Q). We first scan this 2D sequence Q in the horizontal and vertical directions, which results in a rowwise 1D sequence Qr and a column-wise 1D sequence Qc , respectively. The feature mapping Φ can be formulated as follows Φ(Q) = (Φ(Qr )T , Φ(Qc )T )T ,

(1)

where Φ(Qr ) and Φ(Qc ) denote the feature vectors for the row-wise and column-wise sequences, respectively. The above formulation means that these two feature vectors are stacked together to form a higher dimensional feature vector for the original 2D sequence Q. More formally, for an image Q that is divided into X × Y blocks on a regular grid, we can now denote it as a row-wise sequence Qr and a column-wise one Qc : Qr Qc

= =

q11 q12 ...q1Y q21 q22 ...q2Y ...qXY , q11 q21 ...qX1 q12 q22 ...qX2 ...qXY ,

(2) (3)

where qxy ∈ V (1 ≤ x ≤ X, 1 ≤ y ≤ Y ) is the visual word of block (x, y) in the image. In the following, we will only give the details of the feature mapping for the row-wise sequences. The column-wise sequences can be mapped to a high dimensional feature space similarly. Since the s-spectrum of an input sequence is the set of all the s-length subsequences that it contains, our feature mapping used to define spatial spectrum kernel is indexed by all possible s-length subsequences α from the vocabulary V (i.e. α ∈ V s ). That is, we can define the following Φ that maps Qr to a M s -dimensional feature space: Φ(Qr ) = (φα (Qr ))α∈V s ,

(4)

where φα (Qr ) is the number of times that α occurs in Qr . We can find that Qr in the feature space is now denoted as a weighted representation of its s-spectrum. For example, given V ={A,B} and s = 2, the feature vector of Qr =ABBA is Φ(Qr ) = (φAA (Qr ), φAB (Qr ), φBA (Qr ), φBB (Qr )) = (0, 1, 1, 1), where all the possible s-length subsequences (i.e. α) are AA, AB, BA, and BB, respectively. Since the feature mapping for the column-wise sequences can be defined similarly, our spatial spectrum kernel can be computed as the following inner-product: ˜ =< Φ(Qr ), Φ(Q ˜ r ) > + < Φ(Qc ), Φ(Q ˜ c ) >, K(Q, Q)

(5)

˜ are two 2D sequences (i.e. two images). where Q and Q Although the 2D sequences are mapped to a high dimensional (i.e. 2M s ) feature space even for fairly small s, the feature

4

A

B

(1,0)

B

C

(1,0)

C

A

A

B

(1,0)

(1,0)

B

C

C

A

(1,1)

(1,1)

(a) (b) ˜ ˜ Fig. 2. The suffix tree constructed to compute the kernel value for two example sequences Q=ABCA and Q=BCAA: (a) the tree for Q; (b) the tree after Q is compared to Q. Here, V ={A,B,C} and s = 2. Each branch of the tree is labeled with a visual word from V , and each rectangular node denotes a leaf that ˜ stores two counts: one represents the number of times that an s-length subsequence of Q ends at the leaf, while the other represents a similar count for Q.

vectors are extremely sparse: the number of non-zero coordinates is bounded by X(Y − s + 1) + Y (X − s + 1). This property enables us to compute our spatial spectrum kernel very efficiently.

B. Efficient Kernel Computation ˜ is to build A very efficient method for computing K(Q, Q) a suffix tree for the collection of s-length subsequences of ˜ obtained by moving an s-length sliding window Q and Q, ˜ Each branch of the tree is labeled across either of Q and Q. with a visual word from V . Each depth-s leaf node of the tree stores two counts: one represents the number of times that an s-length subsequence of Q ends at the leaf, while the other ˜ represents a similar count for Q. Fig. 2 shows a suffix tree constructed to compute the kernel ˜ value for two example sequences Q=ABCA and Q=BCAA, where V ={A,B,C} and s = 2. To compare these two sequences, we first construct a suffix tree to collect all the slength subsequences of Q. Moreover, to make the kernel computation more efficient, we ignore the s-length subsequences ˜ that do not occur in Q as they do not contribute to the of Q kernel computation. Therefore, these subsequences (e.g. AA) are not shown in Fig. 2. It should be noted that this suffix tree has O(sXY ) nodes because each 2D sequence on a X × Y grid only has X(Y − s+1) (or Y (X −s+1)) s-length subsequences. Using a linear time construction algorithm for the suffix tree, we can build and annotate the suffix tree with a time cost O(sXY ). The kernel value is then calculated by traversing the suffix tree and computing the sum of the products of the counts stored at the depth-s nodes. Hence, the overall time cost of calculating the spatial spectrum kernel is O(sXY ). Moreover, this idea of efficient kernel computation can be similarly used to build a suffix tree for all the input sequences at once and compute all the kernel values in one traversal of the tree. This is essentially the method that we adopt to compute our kernel matrices in later experiments, though we use a recursive function rather than explicitly constructing the suffix tree.

C. Multi-Scale Kernel Combination We further take into account multi-scale kernel combination to capture the global layout of visual words within images. Similar to the idea of wavelet transform, we place a series of increasingly finer grids over the 2D sequences of visual words. That is, each subsequence at level l will be divided into 2 × 2 parts at level l + 1, where l = 0, ..., L − 1 and L is the finest scale. Hence, we can obtain 4l subsequences at level l. Based on these subsequences, we can define a series of spatial spectrum kernels and then combine them by a weighted sum. Let Qli be the i-th sub-sequence at level l for a 2D sequence Q. That is, Qli is in the i-th cell on the 2l ×2l grid at this level. The spatial spectrum kernel at this scale can be computed as follows: 4 X l

˜ = K (Q, Q) l

˜ l ), K(Qli , Q i

(6)

i=1

˜ are two sequences. That is, we first define where Q and Q spatial spectrum kernel for each sub-sequence at level l, and then take a sum of the obtained 4l kernels. Intuitively, ˜ not only measures the number of the same coK l (Q, Q) occurrences (i.e. spatial dependency) of visual words found ˜ but also captures the spatial layout at level l in both Q and Q, (e.g., from top or from bottom) of these co-occurrences on the 2l × 2l grid at this level. Since the co-occurrences of visual words found at level l also include all the co-occurrences found at the finer level l + 1, the increment of the same co-occurrences found at level ˜ is measured by K l (Q, Q) ˜ − K l+1 (Q, Q) ˜ l in both Q and Q for l = 0, ..., L − 1. The spatial spectrum kernels at multiple scales can then be combined by a weighted sum: L ˜ Kms (Q, Q) L−1 X 1 ˜ − K l+1 (Q, Q)) ˜ + K L (Q, Q), ˜ = (K l (Q, Q) 2L−l l=0

X 1 0 1 ˜ + ˜ K (Q, Q) K l (Q, Q)), L L−l+1 2 2 L

=

l=1

(7)

5

5

where a coarser scale is assumed to play a less important role. L When L = 0, the above multi-scale kernel Kms degrades to the original spatial spectrum kernel.

Ω(x)=x 4.5

Ω(x)=1−2−0.8x Ω(x)=1−2−0.2x

4 3.5

WITH

V ISUAL

AND

S EMANTIC C ONTEXT

In this section, we propose a contextual kernel method for keyword propagation based on our spatial spectrum kernel. Since the semantic context of keywords can also be exploited for keyword propagation, we succeed in learning the semantics of images using both visual and semantic context.

3 Ω(x)

IV. L EARNING

2.5 2 1.5 1 0.5 0

A. Notations and Problems

N X

1

2

3

4

5

x

We first present the basic notations for automatic image annotation. Let Q = {(Qi , Wi )}N i=1 denote the set of training images annotated with m keywords, where N is the number of training images. Here, Qi is the i-th training image represented as a 2D sequence of visual words, while Wi contains the annotation keywords that are assigned to the image Qi . We further employ a binary vector to represent a set of annotation keywords. In particular, for a keyword set Wi , its vector representation t(Wi ) = (ti,1 , ..., ti,m )T has its j-th element set to 1 only when the j-th keyword ∈ Wi and zero otherwise. Given a query image Q from the test set, our goal is to determine a confidence vector z = (z1 , ..., zm )T such that each element zj indicates the confidence score of assigning the j-th keyword to the query image Q. Our contextual kernel method for keyword propagation derives from a class of single-step keyword propagation. Suppose the similarity between two images is measured by a kernel K. The confidence score of assigning the j-th keyword to the test image Q could be estimated by zj =

0

K(Q, Qi )ti,j ,

(8)

i=1

where ti,j is set to 1 when the j-th keyword ∈ Wi and zero otherwise. It should be noted that both graph-based semisupervised learning [11] and probabilistic relevance model [6]–[8] can be regarded as variants of the above kernel method for keyword propagation. However, there are two problems with the above kernel method for keyword propagation. The first problem is that the confidence scores assigned to the test image Q are overestimated. That is, all the training images are assumed to propagate their annotation keywords to Q, and in the mean time each training image Qi is assumed to propagate all of its keywords to Q. These two assumptions are not necessarily true in many complex real-world applications. The second problem is that each keyword is propagated from the training images to the test image Q independently of the other keywords. That is, the keyword correlation information is not used for keyword propagation.

Fig. 3. The exponential functions Ω(x) = 1 − 2−γx used by our method. Here, we show two examples with γ = 0.2 or 0.8. It can be observed that Ω(x) ≤ x (x ∈ [0, m]), where m = 5.

keyword propagation in the following. First, not to overestimate the confidence scores assigned to the test image Q, we replace the equality constraint for keyword propagation in (8) with the following inequality constraint: X zj ≤ K(Q, Qi )ti,j , (9) Qi ∈N (Q)

where N (Q) is the set of k-nearest neighbors of the test image Q. The above inequality indicates that the confidence score zj propagated from the training images to the test image is upper bounded by the weighted sum of the pairwise similarity K and can not be obtained explicitly. Meanwhile, not to be confused by the training images that are far away (i.e. not in the same manifold structure), the test image is limited to absorbing the confidence scores only from its k-nearest neighbors. Moreover, we exploit the keyword correlation information for keyword propagation so that the annotation keywords are not assigned to the test image independently. Given any set of annotation keywords W represented as a binary vector t = t(W) = (t1 , ..., tm )T ∈ {0, 1}m, it follows from (9) that m X j=1

z j tj ≤

X

K(Q, Qi )

Qi ∈N (Q)

To solve the above problems associated with automatic image annotation, we propose a contextual kernel method for

ti,j tj .

(10)

j=1

When the inequality is presented in the vector form of the annotation keywords, it can be simplified as X zT t ≤ K(Q, Qi )tT t(Wi ). (11) Qi ∈N (Q)

Hence, given m different annotation keywords and the training examples Q, the confidence vector z of assigning individual annotation keywords to the test image Q is subject to the following constraints: X K(Q, Qi )tT t(Wi ), ∀t ∈ {0, 1}m, zT t ≤ Qi ∈N (Q)

∀j ∈ {1, 2, ..., m}, zj ≥ 0. B. Contextual Keyword Propagation

m X

(12)

Actually, we can generalize the inner-product of binary vectors of annotation keywords (i.e. tT t(Wi )) to a concave function Ω (see examples in Fig. 3), which means that the

6

above inequality constraints are forced to be tighter. Thus, the constraints in (12) are generalized in the following form: X ∀t ∈ {0, 1}m, zT t ≤ K(Q, Qi )Ω(tT t(Wi )),

on manifold learning techniques. That is, we first present our contextual spectral embedding using the semantic context of annotation keywords, and then perform annotation refinement in the more descriptive embedding space.

Qi ∈N (Q)

∀j ∈ {1, 2, ..., m}, zj ≥ 0.

(13)

In this paper, we only consider the exponential function Ω(x) = 1 − 2−γx(γ > 0), although there are other types of concave functions. As shown in Fig. 3, this function ensures that we can obtain tighter constraints in (13). Here, it should be noted that tT t(Wi ) ∈ {0, 1, ..., m}. Since it is insufficient to identify the appropriate z only with the constraints, we assume that among all the confidence scores that satisfy the constraints in (13), the optimal solution z is the one that “maximally” satisfies the constraints. This assumption leads to the following optimization problem: maxm

z∈R

m X

βj z j

j=1

s.t. ∀t ∈ {0, 1}m, zT t ≤

X

K(Q, Qi )Ω(tT t(Wi )),

Qi ∈N (Q)

∀j ∈ {1, 2, ..., m}, zj ≥ 0.

(14)

where {βj : βj ≥ 0}m j=1 are the weights of annotation keywords. This is actually a linear programming problem, and we can solve it efficiently by the following discrete optimization algorithm [16]: (1) Sort m annotation keywords as β1 ≥ β2 ≥ ... ≥ βm . P (2) Compute f (Tj ) = K(Q, Qi )Ω(tT (Tj )t(Wi )) Qi ∈N (Q)

for j = 1, 2, ..., m, where Tj = {1, 2, ..., j}. (3) Set f (T0 ) = 0, and output the confidence scores zj = f (Tj ) − f (Tj−1 ) for j = 1, 2, ..., m. According to [15], the concavity of Ω ensures that the above algorithm can find the optimal solution of the linear programming problem defined in (14). Here, it should be noted that our algorithm differs from [15] in three ways: (1) the motivation of keyword propagation is explained in more detail and the constraints for linear programming are derived with fewer assumptions, (2) each test image is limited to absorbing the confidence scores only from its k-nearest neighbors in order to speed up the process of keyword propagation and avoid overestimating the confidence scores, and (3) the visual context is incorporated into keyword propagation by defining the similarity between images with our spatial spectrum kernel so that both visual and semantic context can be exploited for learning the semantics of images. The above algorithm for contextual keyword propagation is denoted as CKP in the following. The time complexity of CKP is O(km) for annotating a single query image. In this paper, we set k ≪ N (e.g. k = 20) to ensure that the annotation process is very efficient for a large image dataset. V. A NNOTATION R EFINEMENT BY C ONTEXTUAL S PECTRAL E MBEDDING In this section, the semantic context of annotation keywords is further exploited for annotation refinement based

A. Contextual Spectral Embedding To exploit the semantic context for spectral embedding, we first represent the correlation information of annotation keywords by the Pearson product moment (PPM) correlation measure [31] as follows. Given a set of N training images annotated with m keywords, we collect the histogram of keywords as {cn (wi ) : n = 1, ..., N } (i = 1, ..., m), where cn (wi ) is the count of times that keyword wi occurs in image n. The PPM correlation between two keywords wi and wj can be defined by: PN (cn (wi ) − µ(wi ))(cn (wj ) − µ(wj )) aij = n=1 , (15) (N − 1)σ(wi )σ(wj ) where µ(wi ) and σ(wi ) are the mean and standard deviation of {cn (wi ) : n = 1, ..., N }, respectively. It is worth noting that the semantic context of annotation keywords has actually been captured from the set of training images using the above correlation measure. We now construct an undirected weighted graph for spectral embedding with the set of m annotation keywords as the vertex set. We set the affinity matrix A = {aij }m×m to measure the similarity between annotation keywords. The distinct advantage of using this similarity measure is that we have eliminated the need to tune any parameter for graph construction which can significantly affect the performance and has been noted as an inherent weakness of graph-based methods. Here, it should be noted that the PPM correlation aij will be negative if wi and wj are not positively correlated. In this case, we set aij = 0 to ensure that the affinity matrix A is nonnegative. While the negative correlation does reveal useful information among the keywords and serves to measure the dissimilarity between the keywords, our goal here, however, is to compute the affinity (or similarity) between the keywords and to construct the affinity matrix of the graph used for spectral embedding. Although the dissimilarity information is not exploited directly, by setting the entries between the negatively correlated keywords to be zeros, we have effectively unlinked the negatively correlated keywords in the graph (e.g., given two keywords “sun” and “moon” that are unlikely to appear in the same image, we set their similarity to zero to ensure that they are not linked in the graph). In this way, we have made use of the negative correlation information for annotation refinement based on spectral embedding. In future work, we will look into other possible ways to make use of the negative correlation information for image annotation. The goal of spectral embedding is to represent each vertex in the graph as a lower dimensional vector that preserves the similarities between the vertex pairs. Actually, this is equivalent to finding the leading eigenvectors of the normalized graph Laplacian L = I − D−1 A, where D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th row of the affinity matrix A. In this paper, we only consider

7

this type of normalized Laplacian [19], regardless of other normalized versions [18]. Let {(λi , vi ) : i = 1, ..., m} be the set of eigenvalues and the associated eigenvectors of L, where 0 ≤ λ1 ≤ ... ≤ λm and viT vi = 1. The spectral embedding of the graph can be represented by E (me ) = (v1 , ..., vme ),

Ground Truth: car structure overcast_sky house flower yellow_lines stone (a)

(16)

Keywords

(m ) Ej. e

with the j-th row being the new representation for vertex wj . Since we usually set me < m, the annotation keywords have actually been represented as lower dimensional vectors. In the follow, we will present our approach to annotation refinement using this more descriptive representation.

yellow _lines

Predicted scores

z

0.051

Refined:

car structure overcast_sky house flower yellow_lines pole

Unrefined:

car structure overcast_sky house flower pole people

Refine in linear neighborhoods car structure overcast_sky house flower pole white_lines

Refined scores

z

0.067

z

0.039

z

0.062

& (yellow_lines)

(b) people

z

0.055

north_stands goalpost football banner track purple_stripe husky_stadium & (people)

B. Annotation Refinement To exploit the semantic context of annotation keywords for annotation refinement, the confidence scores of a query image estimated by our contextual keyword propagation can be adjusted based on linear neighborhoods in the new embedding space. The corresponding algorithm is summarized as follows: (1) Find me smallest nontrivial eigenvectors v1 , ..., vme and associated eigenvalues λ1 , ..., λme of L = I − D−1 A. Here, A is the PPM correlation matrix. (2) Form E = [(1 − λ1 )v1 , ..., (1 − λme )vme ], and normalize each row of E to have unit length. Here, the i-th row Ei. is a new feature vector for keyword wi . (3) Compute the new affinity matrix between keywords as R = {rij }m×m = EE T . Here, if rij < 0, we set rij = 0 to ensure that R is nonnegative. (4) Adjust theP confidence scores of each query image by z˜i = η rij zj +(1−η)zi , where η is a weight wj ∈N (wi )

parameter, N (wi ) is the set of top mr keywords that are most highly correlated with keyword wi , and zi is the confidence score of assigning keyword wi to this image. It is worth noting that Step (2) slightly differs from Equation (16). Here, we aim to achieve better refinement results through preprocessing (i.e. weighting and normalizing) the new feature vectors. Moreover, in Step (4), we perform annotation refinement based on linear neighborhoods in the new embedding space, as illustrated in Fig. 4. More importantly, the example given by this figure presents a detailed explanation of how the semantic context of keywords encoded in new embedding space is used to refine the annotations of the image. Before refinement, the three keywords “yellow lines”, “people”, and “pole” are ranked according to their predicted confidence scores as follow: “pole” > “people” > “yellow lines”. Hence, the two keywords “people” and “pole” are incorrectly attached to the image, while the ground truth annotation “yellow lines” is wrongly discarded. However, we can find that the keyword “yellow lines” is highly semantically correlated with the ground truth annotations of the image (see the five bluehighlighted keywords in N (yellow lines) shown in Fig. 4). This semantic context is further exploited here for annotation refinement and the confidence score of “yellow lines” is accordingly increased to the largest among the three keywords,

pole

z

0.061

structure overcast_sky flower quad sidewalk cherry_trees white_lines & (pole)

Fig. 4. Illustration of annotation refinement in the spectral embedding space: (a) an example image associated with the ground truth, refined, and unrefined annotations (the incorrect keywords are red-highlighted); (b) annotation refinement based on linear neighborhoods. Here, N (·) denotes the set of top 7 keywords that are most highly correlated with a keyword, and in this neighborhood the keywords that also belong to the ground truth annotations of the image are blue-highlighted.

i.e., this keyword can now be annotated correctly. On the contrary, since the keyword “people” is not at all semantically correlated with the ground truth annotations of the image, its confidence score is decreased to the smallest value and it is discarded successfully by our annotation refinement. Additionally, as for the keyword “pole”, although not included in the ground truth annotations of the image, we can still consider that this keyword is semantically correlated with the image (see the three blue-highlighted keywords in N (pole) shown in Fig. 4). The above algorithm for annotation refinement by contextual spectral embedding is denoted as CSE in the following. The time complexity of CSE is O(m3 ) for refining the annotations of a single query image (mainly for spectral embedding in Step (1)). Since we have m ≪ N , our algorithm is very efficient even for a large image dataset (see the later experiments on the IAPR dataset). Moreover, our algorithm for annotation refinement has another distinct advantage. That is, besides the semantic context of annotation keywords captured from the training images using the PPM correlation measure, other types of semantic context derived from prior knowledge (e.g. ontology) can also be readily exploited for annotation refinement by incorporating them into graph construction. VI. E XPERIMENTAL R ESULTS In this section, our spatial spectrum kernel (SSK) combined with CKP and CSE (i.e. SSK+CKP+CSE) is compared to three other representative methods for image annotation: (1) spatial pyramid matching (SPM) [28] combined with CKP and CSE (i.e. SPM+CKP+CSE), (2) PLSA [29] combined with CKP and CSE (i.e. PLSA+CKP+CSE), and (3) multiple Bernoulli relevance models (MBRM) [7] combined with CSE (i.e. MBRM+CSE). Moreover, we also make comparison between

8

Fig. 5.

tree bush grass sidewalk

building sky street tree grass

cherry tree sky grass

sky hill tree water ferryboat

buildings cars skyline street

coral fish ocean reefs

deer grass tree white-tailed

field foals horses mare

cloud hill statue

construction rock water

city cloud mountain trees

hill lake mountain

Some annotated examples selected from UW (first row), Corel (second row), and IAPR (third row) image datasets.

annotation using the semantic context and that without using the semantic context. These two groups of comparison are carried out over three image datasets: University of Washington (UW) 1 , Corel [20], and IAPR TC-12 [14]. Some annotated examples selected from these image datasets are shown in Fig. 5. A. Experimental Setup Our annotation method is tested on three standard image datasets. The first image dataset comes from University of Washington (UW) and contains 1,109 images annotated with 338 keywords. Each image is annotated with 1–13 keywords. The images are of the size 378 × 252 pixels. The second image dataset is Corel [20] that consists of 5,000 images annotated with 371 keywords. Each image is annotated with 1–5 keywords. The images are of the size 384 × 256 pixels. This image dataset has been widely used for the evaluation of image annotation in previous work, e.g. [7], [21]. The third image dataset is IAPR TC-12 [14] that contains 20,000 images annotated with 275 keywords. Each image is annotated with 1–18 keywords. The images are of the size 480 × 360 pixels. It is worth noting that the task of image annotation is very challenging on such a large image dataset. For the three image datasets, we first divide images into blocks on a regular grid, and the size of blocks is empirically selected: 64 × 64 pixels for MBRM just as [7], but 8 × 8 pixels for the three annotation methods that adopt our CKP. Furthermore, we extract a 30-dimensional feature vector from each block: 6 color features (block color average and standard deviation) and 24 texture features (average and standard deviation of Gabor outputs over 3 scales and 4 orientations). Here, it should be noted that these feature vectors are directly used by MBRM and the computational cost in the annotation process thus becomes extremely large, while this problem can be solved by the other three methods that adopt our CKP 1 http://www.cs.washington.edu/research/imagedatabase/groundtruth/

through first quantizing these feature vectors into M visual words. In this paper, we consider a moderate vocabulary size M = 600 for the three image datasets. In the experiments, we divide the UW image dataset randomly into 909 training images and 200 test images, and annotate each test image with the top 7 keywords. For the Corel image dataset, we split it into 4,500 training images and 500 test images just as [20], and annotate each test image with the top 5 keywords. The IAPR image dataset is partitioned into 16,000 training images and 4,000 test images, and each test image is annotated with the top 9 keywords. After splitting the datasets, as with previous work, we evaluate the obtained annotations of the test images through the process of retrieving these test images with single keyword. For each keyword, the number of correctly annotated images is denoted as Nc , the number of retrieved images is denoted as Nr , and the number of truly related images in test set is denoted as Nt . Then the recall, precision, and F1 measures are computed as follows: recall = Nc /Nt , precision = Nc /Nr ,

(17)

F1 = 2 recall · precision/(recall + precision),

(18)

which are further averaged over all the keywords in the test set. Besides, we give a measure to evaluate the coverage of correctly annotated keywords, i.e. the number of keywords with recall >0 which is denoted by # keywords (recall>0). The measure is important because a biased model can achieve high precision and recall values by only performing quite well on a small number of common keywords. Since the solution returned by our CKP algorithm is dependent only on the relative order of the weights {βj }m j=1 of the annotation keywords, we only need to sort the weights without providing their exact values. One straightforward way is to order the weights to be in the reverse order of keyword frequency, namely βi ≥ βj ←→ pi ≤ pj , where pi is the frequency of the i-th keyword in the training set. Moreover, according to Fig. 6(a), we set k = 20 for our CKP algorithm on the UW dataset. Here, we can find that our CKP algorithm

9 0.48

0.48

0.48 SSK(L=0)

SSK(L=1)

SSK(s=2)

SSK(L=2) 0.44

0.4

0.4

0.4

0.36

F1

0.44

0.44

F1

F1

CKP

0.36

0.36

0.32

0.32

0.32

0.28

0.28

0.28

0.24

10

20

30 k

40

0.24

50

1

2 s

(a)

3

SPM

0.24

0

1 L

(b)

2

(c)

Fig. 6. The effect of different parameters on the annotation performance measured by F1 for the UW image dataset: (a) varying the neighborhood size k; (b) varying the length of subsequences s; (c) varying the scale L. 0.48

0.48 CKP

Unrefined

0.44

0.44

0.4

0.4

F1

F1

KP

0.36

0.36

0.32

0.32

0.28

0.28

0.24

SSK

SPM

0.24

PLSA

(a)

Refined

SSK

SPM

PLSA

MBRM

(b)

Fig. 7. Comparison between annotation using the semantic context (CKP and Refined) and that without using the semantic context (KP and Unrefined) on the UW image dataset: (a) keyword propagation vs. contextual keyword propagation; (b) unrefined annotation vs. refined annotation by contextual spectral embedding.

TABLE I T HE RESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE UW IMAGE DATASET Methods SSK+CKP+CSE SPM+CKP+CSE PLSA+CKP+CSE MBRM+CSE

# keywords (recall>0) 97 94 87 83

recall 0.486 0.450 0.396 0.354

precision 0.398 0.373 0.324 0.360

F1 0.438 0.408 0.356 0.357

is not sensitive to this parameter. Finally, according to Fig. 6(b) and Fig. 6(c), we set s = 2 for our SSK and L = 1 for both SSK and SPM on the UW dataset. The other parameters are also set the respective optimal values similarly. B. Results on UW Image Dataset The results of annotation using visual and semantic context are averaged over 10 random partitions of the UW image dataset and then listed in Table I. We can observe that our annotation method (i.e. SSK+CKP+CSE) performs much better than all the other three methods. This observation may be due to the fact that our method not only exploits the context of annotation keywords for keyword propagation and annotation refinement but also captures the context of visual words within images to define the similarity between images for keyword propagation. That is, we have successfully exploited both visual and semantic context for image annotation. Particularly, as compared with MBRM that propagates a single keyword

independently of the other keywords, our annotation method leads to 23% gain on the F1 measure through contextual keyword propagation using our spatial spectrum kernel. We make further observations on the three methods that adopt CKP for image annotation. It is shown in Table I that both SSK and SPM achieve better results than PLSA which does not consider the context of visual words within images. Moreover, since SPM can only capture the global context of visual words, our SSK performs better than SPM due to the fact that both local and global context are used to define the similarity between images. These observations show that the context of visual words indeed helps to improve the annotation performance of keyword propagation. More importantly, to demonstrate the gain of exploiting the semantic context of annotation keywords for image annotation, we also compare annotation using this semantic context to annotation without using this semantic context. The comparison is shown in Fig. 7. Here, keyword propagation given by Equation (8) is denoted as KP (without using the semantic context), while our proposed contextual keyword propagation is denotes as CKP (using the semantic context). Meanwhile, the refined annotation results by contextual spectral embedding are denoted as Refined (using the semantic context), while the annotation results before refinement are denoted as Unrefined (without using the semantic context). We can observe from Fig. 7 that the semantic context of annotation keywords plays an important role in both keyword propagation and annotation refinement.

10 0.35

0.35 KP

Unrefined

CKP

0.25

0.25 F1

0.3

F1

0.3

Refined

0.2

0.2

0.15

0.15

0.1

SSK

SPM

PLSA

0.1

SSK

SPM

PLSA

MBRM

(a) (b) Fig. 8. Comparison between annotation using the semantic context (CKP and Refined) and that without using the semantic context (KP and Unrefined) on the Corel image dataset: (a) keyword propagation vs. contextual keyword propagation; (b) unrefined annotation vs. refined annotation by contextual spectral embedding. TABLE II T HE RESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE C OREL IMAGE DATASET Methods # keywords (recall>0) recall precision F1 SSK+CKP+CSE 147 0.349 0.287 0.315 SPM+CKP+CSE 138 0.324 0.244 0.278 PLSA+CKP+CSE 113 0.233 0.164 0.193 MBRM+CSE 114 0.222 0.193 0.206 Liu et al. [21] 131 0.291 0.253 0.271 Makadia et al. [32] 139 0.320 0.270 0.293 TABLE III T HE RESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE IAPR IMAGE DATASET Methods # keywords (recall>0) recall precision F1 SSK+CKP+CSE 145 0.236 0.258 0.247 SSK+CKP 143 0.233 0.243 0.238 SSK+KP 135 0.206 0.221 0.213 SPM+CKP+CSE 129 0.211 0.213 0.212

C. Results on Corel Image Dataset The results of annotation using visual and semantic context on the Corel image dataset are listed in Table II. From this table, we can draw similar conclusions (compared to Table I). Through exploiting both visual and semantic context for keyword propagation and annotation refinement, our method still performs the best on this image dataset. Moreover, our method is also compared with more recent state-of-the-art methods [21], [32] using their own results. As shown in Table II, our method outperforms [21], [32] because both visual and semantic context are used for learning the semantics of images. To the best of our knowledge, the results reported in [32] are the best in the literature. However, our method can still achieve 8% gain on the F1 measure over this method. More importantly, we show in Fig. 8 the comparison between annotation using the semantic context of annotation keywords and annotation without using this semantic context. We can similarly find that this semantic context plays an important role in both keyword propagation and annotation refinement on this image dataset. D. Results on IAPR Image Dataset To verify that our annotation method is scalable to large image datasets, we present the annotation results on IAPR dataset in Table III. In the experiments, we do not compare our annotation method with PLSA and MBRM, since PLSA needs huge memory and MBRM incurs a large time cost when

the data size is 20,000. From Table III, we find that our SSK can achieve 17% gain over SPM (see SSK+CKP+CSE vs. SPM+CKP+CSE). That is, the visual context captured by our method indeed helps to improve the annotation performance. Moreover, we also find that both of our CKP and CSE can achieve improved results by exploiting the semantic context of annotation keywords. Another distinct advantage of these kernel and spectral methods is that they are very scalable with respect to the data size. The time taken by our CKP and CSE on the large IAPR dataset is 21 and 1 minutes, respectively. We run these two algorithms (Matlab code) on a PC with 2.33 GHz CPU and 2 GB RAM. VII. C ONCLUSIONS We have proposed contextual kernel and spectral methods for learning the semantics of images in this paper. To capture the context of visual words within images, we first define a spatial string kernel to measure the similarity between images. Based on this spatial string kernel, we further formulate automatic image annotation as a contextual keyword propagation problem, which can be solved very efficiently by linear programming. Different from the traditional relevance models that treat each keyword independently, our contextual kernel method considers the semantic context of annotation keywords and propagates multiple keywords simultaneously from the training images to the test images. More importantly, such semantic context can also be incorporated into spectral embedding for refining the annotations of images predicted by keyword propagation. Experiments on three standard image datasets demonstrate that our contextual kernel and spectral methods can achieve superior results. In future work, these kernel and spectral methods will be extended to the temporal domain for problems such as video semantic learning and retrieval. Moreover, since our contextual kernel and spectral methods are very general techniques, they will be adopted to improve the performance of other machine learning methods that are widely used in computer vision and image processing. ACKNOWLEDGEMENTS The work was supported by the Research Council of Hong Kong under Grant CityU 114007, the City University of Hong Kong under Grant 7008040, and the National Natural Science Foundation of China under Grants 60873154 and 61073084.

11

R EFERENCES [1] R. Zhang and Z. Zhang, “Effective image retrieval based on hidden concept discovery in image database,” IEEE Trans. Image Processing, vol. 16, no. 2, pp. 562–572, Feb. 2007. [2] J. Li and J. Wang, “Automatic linguistic indexing of pictures by a statistical modeling approach,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1075–1088, Sept. 2003. [3] Y. Gao, J. Fan, X. Xue, and R. Jain, “Automatic image annotation by incorporating feature hierarchy and boosting to scale up SVM classifiers,” in Proc. ACM Multimedia, 2006, pp. 901–910. [4] E. Chang, G. Kingshy, G. Sychay, and G. Wu, “CBSA: Contentbased soft annotation for multimodal image retrieval using Bayes point machines,” IEEE Trans. Circuits and Systems for Video Technology, vol. 13, no. 1, pp. 26–38, Jan. 2003. [5] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” in Proc. SIGIR, 2003, pp. 119–126. [6] V. Lavrenko, R. Manmatha, and J. Jeon, “A model for learning the semantics of pictures,” in Advances in Neural Information Processing Systems 16, 2004, pp. 553–560. [7] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli relevance models for image and video annotation,” in Proc. CVPR, vol. 2, 2004, pp. 1002–1009. [8] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, and S. Ma, “Dual crossmedia relevance model for image annotation,” in Proc. ACM Multimedia, 2007, pp. 605–614. [9] Z. Lu and H. Ip, “Generalized relevance models for automatic image annotation,” in Proc. Pacific Rim Conference on Multimedia, 2009, pp. 245–255. [10] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Sch¨olkopf, “Ranking on data manifolds,” in Advances in Neural Information Processing Systems 16, 2004, pp. 169–176. [11] J. Liu, M. Li, W. Ma, Q. Liu, and H. Lu, “An adaptive graph model for automatic image annotation,” in Proc. ACM International Workshop on Multimedia Information Retrieval, 2006, pp. 61–70. [12] Z. Lu, H. Ip, and Q. He, “Context-based multi-label image annotation,” in Proc. ACM International Conference on Image and Video Retrieval, 2009, pp. 1–7. [13] C. Leslie, E. Eskin, and W. Noble, “The spectrum kernel: A string kernel for SVM protein classification,” in Proc. Pacific Symposium on Biocomputing, 2002, pp. 566–575. [14] H. Escalante, C. Hern´andez, J. Gonzalez, A. L´opez-L´opez, M. Montes, E. Morales, L. Sucar, L. Villasenor, and M. Grubinger, “The segmented and annotated IAPR TC-12 benchmark,” Computer Vision and Image Understanding, vol. 114, no. 4, pp. 419–428, 2010. [15] F. Kang, R. Jin, and R. Sukthankar, “Correlated label propagation with application to multi-label learning,” in Proc. CVPR, 2006, pp. 1719– 1726. [16] R. Parker and R. Rardin, Eds., Discrete Optimization. New York: Academic Press, 1988. [17] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40–51, Jan. 2007. [18] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in Neural Information Processing Systems 14, 2002, pp. 849–856. [19] S. Lafon and A. Lee, “Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1393–1403, Sept. 2006. [20] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth, “Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary,” in Proc. ECCV, 2002, pp. 97–112. [21] J. Liu, M. Li, Q. Liu, H. Lu, and S. Ma, “Image annotation via graph learning,” Pattern Recognition, vol. 42, no. 2, pp. 218–228, 2009. [22] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification using maximum entropy method,” in Proc. SIGIR, 2005, pp. 274–281. [23] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, “On maximum margin hierarchical multi-label classification,” in Proc. NIPS Workshop on Learning with Structured Outputs, 2004, pp. 1–4. [24] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang, “Image annotation refinement using random walk with restarts,” in Proc. ACM Multimedia, 2006, pp. 647–650. [25] R. Behmo, N. Paragios, and V. Prinet, “Graph commute times for image representation,” in Proc. CVPR, 2008, pp. 1–8.

[26] J. Li, W. Wu, T. Wang, and Y. Zhang, “One step beyond histograms: Image representation using Markov stationary features,” in Proc. CVPR, 2008, pp. 1–8. [27] A. Holub, M. Welling, and P. Perona, “Hybrid generative-discriminative object recognition,” International Journal of Computer Vision, vol. 77, no. 1–3, pp. 239–258, 2008. [28] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. CVPR, 2006, pp. 2169–2178. [29] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 41, no. 1–2, pp. 177–196, 2001. [30] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 4–5, pp. 993–1022, 2003. [31] J. Rodgers and W. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician, vol. 42, no. 1, pp. 59–66, 1988. [32] A. Makadia, V. Pavlovic, and S. Kumar, “A new baseline for image annotation,” in Proc. ECCV, 2008, pp. 316–329.

Zhiwu Lu received the M.Sc. degree in applied mathematics from Peking University, Beijing, China, in 2005. He is currently working toward the Ph.D. degree in the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong. From July 2005 to August 2007, he was a Software Engineer with Founder Corporation, Beijing, China. From September 2007 to June 2008, he was a Research Assistant with the Institute of Computer Science and Technology, Peking University. He has published over 30 papers in international journals and conference proceedings including TIP, TSMC-B, TMM, CVPR, ECCV, AAAI, and ACM-MM. His research interests lie in pattern recognition, machine learning, multimedia information retrieval, and computer vision.

Horace H.S. Ip received the B.Sc. (first-class honors) degree in applied physics and the Ph.D. degree in image processing from the University College London, London, U.K., in 1980 and 1983, respectively. Currently, he is the Chair Professor of computer science, the Founding Director of the Centre for Innovative Applications of Internet and Multimedia Technologies (AIMtech Centre), and the Acting Vice-President of City University of Hong Kong, Kowloon, Hong Kong. He has published over 200 papers in international journals and conference proceedings. His research interests include pattern recognition, multimedia content analysis and retrieval, virtual reality, and technologies for education. He is a Fellow of the Hong Kong Institution of Engineers, a Fellow of the U.K. Institution of Electrical Engineers, and a Fellow of the IAPR.

Yuxin Peng received the Ph.D. degree in computer science and technology from Peking University, Beijing, China, in 2003. He joined the Institute of Computer Science and Technology, Peking University, as an assistant professor in 2003 and was promoted to a professor in 2010. From 2003 to 2004, he was a visiting scholar with the Department of Computer Science, City University of Hong Kong. His current research interests include content-based video retrieval, image processing, and pattern recognition.