IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
1739
Contextual Kernel and Spectral Methods for Learning the Semantics of Images Zhiwu Lu, Horace H. S. Ip, and Yuxin Peng
Abstract—This paper presents contextual kernel and spectral methods for learning the semantics of images that allow us to automatically annotate an image with keywords. First, to exploit the context of visual words within images for automatic image annotation, we define a novel spatial string kernel to quantify the similarity between images. Specifically, we represent each image as a 2-D sequence of visual words and measure the similarity between two 2-D sequences using the shared occurrences of -length 1-D subsequences by decomposing each 2-D sequence into two orthogonal 1-D sequences. Based on our proposed spatial string kernel, we further formulate automatic image annotation as a contextual keyword propagation problem, which can be solved very efficiently by linear programming. Unlike the traditional relevance models that treat each keyword independently, the proposed contextual kernel method for keyword propagation takes into account the semantic context of annotation keywords and propagates multiple keywords simultaneously. Significantly, this type of semantic context can also be incorporated into spectral embedding for refining the annotations of images predicted by keyword propagation. Experiments on three standard image datasets demonstrate that our contextual kernel and spectral methods can achieve significantly better results than the state of the art. Index Terms—Annotation refinement, kernel methods, keyword propagation, linear programming, spectral embedding, string kernel, visual words.
I. INTRODUCTION
W
ITH the rapid growth of image archives, there is an increasing need for effectively indexing and searching these images. Although many content-based image retrieval systems [1], [2] have been proposed, it is rather difficult for users to represent their queries using the visual image features such as color and texture. Instead, most users prefer image search by textual queries, which is typically achieved by manually providing image annotations and then searching over these annotations using a textual query. However, manual annotation is an expensive and tedious task. Hence, automatic image annotation
Manuscript received June 28, 2010; revised September 25, 2010; accepted December 13, 2010. Date of publication December 30, 2010; date of current version May 18, 2011. This work was supported in part by the Research Council of Hong Kong under Grant CityU 114007, the City University of Hong Kong under Grant 7008040, and the National Natural Science Foundation of China under Grants 60873154 and 61073084. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sharath Pankanti. Z. Lu and Y. Peng are with the Institute of Computer Science and Technology, Peking University, Beijing 100871, China (e-mail:
[email protected];
[email protected]). H. H. S. Ip is with the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2010.2103082
plays an important role in efficient image retrieval. Recently, many methods for learning the semantics of images based on machine learning techniques have emerged to pave the way for automatic annotation of an image with keywords, and we can roughly classify them into two categories. The traditional methods for automatic image annotation treat each annotation keyword or concept as an independent class and train a corresponding classifier to identify images belonging to this class. This strategy has been adopted by methods such as linguistic indexing of pictures [2] and image annotation using support vector machine [3] or Bayes point machine [4]. The problem with these classification-based methods is that they are not particularly scalable to a large-scale concept space. In the context of image annotation and retrieval, the concept space grows significantly large due to the large number (i.e., hundreds or even thousands) of keywords involved in the annotation of images. Therefore, the problems of semantic overlap and data imbalance among different semantic classes become very serious, which lead to a significantly degraded classification performance. Another category of automatic image annotation methods take a different viewpoint and learn the correlation between images and annotation keywords by means of keyword propagation. Many of such methods are based on the probabilistic generative models, among which an influential work is the cross-media relevance model [5] that estimates the joint probability of image regions and annotation keywords on the training image set. The relevance model for learning the semantics of images has subsequently been improved through the development of continuous-space relevance model [6], multiple Bernoulli relevance model [7], dual cross-media relevance model [8], and, more recently, our generalized relevance model [9]. Moreover, graph-based semi-supervised learning [10] has also been applied to keyword propagation for automatic image annotation in [11]. However, these keyword propagation methods ignore either the context of image regions or the correlation information of annotation keywords. This paper focuses on keyword propagation for learning the semantics of images. To overcome the problems with the above keyword propagation methods, we propose a 2-D string kernel, called the spatial spectrum kernel (SSK) [12], which quantifies the similarity between images and enables us to exploit the context of visual words within images for keyword propagation. To compute the proposed contextual kernel, we represent each image as a 2-D sequence of visual words and measure the similarity between two 2-D sequences using the shared occurrences of -length 1-D subsequences by decomposing each 2-D sequence into two orthogonal 1-D sequences (i.e., the rowwise and column-wise ones). To the best of our knowledge, this
1057-7149/$26.00 © 2010 IEEE
1740
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Fig. 1. Illustration of the proposed framework for learning the semantics of images using visual and semantic context.
is the first application of string kernel for matching 2-D sequences of visual words. Here, it should be noted that string kernels were originally proposed for protein classification [13], and the number of amino acids (similar to the visual words used here) typically involved in the kernel definition was very small. In contrast, in the present work, string kernels are used to capture and compare the context of a large number of visual words within an image, and the associated problem of sequence matching becomes significantly more challenging. As compared with our previous work [12], this paper presents significantly more extensive and convincing results, in particular, on the large IAPR TC-12 image dataset [14]. More importantly, we further present a new and significant technical development that addresses the issue of annotation refinement (see Section V). The novelty of the proposed refinement method is that it directly considers the manifold structure of annotation keywords, which gives rise to additional new and significant contributions upon our previous work. Moreover, to exploit the semantic context of annotation keywords for automatically learning the semantics of images, a contextual kernel method is proposed based on our spatial spectrum kernel. We first formulate automatic image annotation as a contextual keyword propagation problem where multiple keywords can be propagated simultaneously from the training images to the test images. Meanwhile, not to be confused by the training images that are far away (i.e., not in the same manifold), each test image is limited to absorbing the keyword information (e.g., confidence scores) only from its -nearest neighbors. Since this contextual keyword propagation problem is further solved very efficiently by linear programming [15], [16], our contextual kernel method is highly scalable and can be applied to large image datasets. It should be noted that our contextual keyword propagation distinguishes itself from previous work in that multiple keywords can be propagated simultaneously, which means that the semantic context of annotation keywords can be exploited for learning the semantics of images. More importantly, this type of semantic context can be further used for refining the annotations predicted by keyword propagation. Here, we first obtain spectral embedding [17]–[19] by exploiting the semantic context of annotation keywords and then perform annotation refinement in the resulting embedding space. Finally, the above contextual kernel and spectral methods for learning the semantics of images can be integrated in a unified framework as shown in Fig. 1, which contains three main components: visual context analysis with spatial spectrum kernel, learning with visual and semantic context, and annotation refinement by contextual spectral embedding. In this paper, the proposed framework is tested on three standard image datasets: University of Washington (UW), Corel [20], and IAPR [14]. Particularly, the Corel image dataset has been widely used
for the evaluation of image annotation [7], [21]. Experimental results on these image datasets demonstrate that the proposed framework outperforms the state-of-the-art methods. In summary, the proposed framework has the following advantages. 1) Our spatial string kernel defined as the similarity between images can capture the context of visual words within images. 2) Our contextual spectral embedding method directly considers the manifold structure of annotation keywords for annotation refinement, and more importantly, the semantic context of annotation keywords can be incorporated into manifold learning. 3) Our kernel and spectral methods can achieve promising results by exploiting both visual and semantic context for learning the semantics of images. 4) Our contextual kernel and spectral methods are very scalable with respect to the data size and can be used for largescale image applications. 5) Our contextual kernel and spectral methods are very general techniques and have the potential to improve the performance of other machine learning methods that are widely used in computer vision and image processing. The remainder of this paper is organized as follows. Section II gives a brief review of previous work. In Section III, we present our spatial spectrum kernel to capture the context of visual words which can be further used for keyword propagation. In Section IV, we propose our contextual kernel method for keyword propagation based on the proposed spatial spectrum kernel. In Section V, the annotations predicted by our contextual keyword propagation are further refined by novel contextual spectral embedding. Section VI presents the evaluation of the proposed framework on three standard image datasets. Finally, Section VII gives conclusions drawn from our experimental results. II. RELATED WORK Our keyword propagation method differs from the traditional approaches that are based on relevance model [5]–[7] and graphbased semi-supervised learning [11] in that the keyword correlation information has been exploited for image annotation. Although much effort had also been made in [22] to exploit the keyword correlation information, it was limited to pairwise correlation of annotation keywords. In contrast, our method can simultaneously propagate multiple keywords from the training images to the test images. In [23], a particular structure of the annotation keywords was assumed in order to exploit the keyword correlation information. We argue that such an assumption could be violated in practice because the relationships between annotation keywords may become too complicated. On
LU et al.: CONTEXTUAL KERNEL AND SPECTRAL METHODS FOR LEARNING THE SEMANTICS OF IMAGES
the contrary, our method can exploit the semantic context of annotation keywords of any order. This semantic context is further exploited for refining the annotations of images predicted by our contextual keyword propagation. Specifically, we first obtain contextual spectral embedding by incorporating the semantic context of annotation keywords into graph construction, and then perform annotation refinement in the obtained more descriptive embedding space. This differs from previous methods, e.g., [21], [24], which directly exploited the semantic context of keywords for annotation refinement, without considering the manifold structure hidden among them. More importantly, another type of context has also been incorporated into our contextual keyword propagation. This can be achieved by first representing each image as a 2-D sequence of visual words and then defining a spatial string kernel to capture the context of visual words. This contextual kernel can be used as a similarity measure between images for keyword propagation. In fact, both local and global context can be captured in our work. The spatial dependency between visual words learnt with our spatial spectrum kernel can be regarded as the local context, while the spatial layout of visual words obtained with multiscale kernel combination (see Section III-C) provides the global context. In the literature, most previous methods only considered either local or global context of visual words. For example, the collapsed graph [25] and Markov stationary analysis [26] only learnt the local context, while the constellation model [27] and spatial pyramid matching [28] only captured the global context. To reduce the semantic gap between visual features and semantic annotations, we make use of an intermediate representation with a learnt vocabulary of visual words, which is similar to the bag-of-words methods such as probabilistic latent semantic analysis (PLSA) [29] and latent Dirichlet allocation (LDA) [30]. However, these methods typically ignore the spatial structure of images because the regions within images are assumed to be independently drawn from a mixture of latent topics. In contrast, our present work captures the spatial context of regions based on the proposed spatial spectrum kernel. It is shown in our experiments that this type of visual context is effective for keyword propagation in the challenging application of automatic image annotation. III. VISUAL CONTEXT ANALYSIS WITH SPATIAL SPECTRUM KERNEL (SSK) To capture the context of visual words within images, we propose an SSK which can be used as a similarity measure between images for keyword propagation. We further present an efficient kernel computation method based on a tree data structure. Finally, we propose multiscale kernel combination to capture the global layout of visual words within images. Hence, both local and global context within images can be captured and exploited in our present work. A. Kernel Definition Similar to the bag-of-words methods, we first divide images into equivalent blocks on a regular grid and then extract some representative properties from each block by incorporating the
1741
color and texture features. Through performing -means clustering on the extracted feature vectors, we generate a vocabulary of visual words which describes the content similarities among the image blocks. Based on this universal vocabulary , each block is annotated automatically with a visual word and an image is subsequently represented by a 2-D sequence of visual words. The basic idea of defining a spatial spectrum kernel is to map the 2-D sequence into a high-dimensional feature space: . We first scan this 2-D sequence in the horizontal and vertical directions, which results in a row-wise 1-D sequence and a column-wise 1-D sequence , respectively. The feature mapping can be formulated as follows: (1) where and denote the feature vectors for the rowwise and column-wise sequences, respectively. The above formulation means that these two feature vectors are stacked together to form a higher dimensional feature vector for the original 2-D sequence . More formally, for an image that is divided into blocks on a regular grid, we can now denote it as a row-wise sequence and a column-wise one as (2) (3) is the visual word of where block in the image. In the following, we will only give the details of the feature mapping for the row-wise sequences. The column-wise sequences can be mapped to a high dimensional feature space similarly. Since the -spectrum of an input sequence is the set of all of the -length subsequences that it contains, our feature mapping used to define spatial spectrum kernel is indexed by all possible -length subsequences from the vocabulary (i.e., ), that is, we can define the following that maps to a -dimensional feature space: (4) where is the number of times that occurs in . in the feature space is now denoted as a We can find that weighted representation of its -spectrum. For example, given and , the feature vector of is , where all of the possible -length subsequences (i.e., ) are AA, AB, BA, and BB, respectively. Since the feature mapping for the column-wise sequences can be defined similarly, our spatial spectrum kernel can be computed as the following inner product: (5) where and are two 2-D sequences (i.e., two images). Although the 2-D sequences are mapped to a high-dimensional (i.e., ) feature space even for fairly small , the feature vectors are extremely sparse: the number of nonzero coordinates is bounded by . This property enables us to compute our SSK very efficiently.
1742
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Q = ABCA Q
Q~ = BCAA V
Q
Q~
Fig. 2. Suffix tree constructed to compute the kernel value for two example sequences and : (a) the tree for and (b) the tree after f g and . Each branch of the tree is labeled with a visual word from , and each rectangular node denotes a leaf is compared with . Here, that stores two counts: one represents the number of times that an -length subsequence of ends at the leaf, while the other represents a similar count for .
Q
V = A; B; C
s=2
s
Q~
B. Efficient Kernel Computation A very efficient method for computing is to build a suffix tree for the collection of -length subsequences of and , obtained by moving an -length sliding window across either of and . Each branch of the tree is labeled with a visual word from . Each depth- leaf node of the tree stores two counts: one represents the number of times that an -length subsequence of ends at the leaf, while the other represents a similar count for . Fig. 2 shows a suffix tree constructed to compute the kernel value for two example sequences and , where and . To compare these two sequences, we first construct a suffix tree to collect all of the -length subsequences of . Moreover, to make the kernel computation more efficient, we ignore the -length subsequences of that do not occur in as they do not contribute to the kernel computation. Therefore, these subsequences (e.g., AA) are not shown in Fig. 2. It should be noted that this suffix tree has nodes because each 2-D sequence on a grid only has (or ) -length subsequences. Using a linear time construction algorithm for the suffix tree, we can build and annotate the suffix tree with a time cost . The kernel value is then calculated by traversing the suffix tree and computing the sum of the products of the counts stored at the depth- nodes. Hence, the overall time cost of calculating the spatial spectrum kernel is . Moreover, this idea of efficient kernel computation can be similarly used to build a suffix tree for all of the input sequences at once and compute all of the kernel values in one traversal of the tree. This is essentially the method that we adopt to compute our kernel matrices in later experiments, though we use a recursive function rather than explicitly constructing the suffix tree.
each subsequence at level will be divided into 2 2 parts at , where and is the finest scale. level Hence, we can obtain subsequences at level . Based on these subsequences, we can define a series of spatial spectrum kernels and then combine them by a weighted sum. Let be the th subsequence at level for a 2-D sequence m, that is, is in the th cell on the grid at this level. The spatial spectrum kernel at this scale can be computed as follows:
C. Multiscale Kernel Combination We further take into account multiscale kernel combination to capture the global layout of visual words within images. Similar to the idea of wavelet transform, we place a series of increasingly finer grids over the 2-D sequences of visual words, that is,
(7)
(6) where and are two sequences, that is, we first define spatial spectrum kernel for each subsequence at level and then take a sum of the obtained kernels. Intuitively, not only measures the number of the same co-occurrences (i.e., spatial dependency) of visual words found at level in both and , but also captures the spatial layout (e.g., from top or from bottom) of these co-occurrences on the grid at this level. Since the co-occurrences of visual words found at level also include all of the co-occurrences found at the finer level , the increment of the same co-occurrences found at level in both and is measured by for . The spatial spectrum kernels at multiple scales can then be combined by a weighted sum
where a coarser scale is assumed to play a less important role. When , the above multiscale kernel degrades to the original spatial spectrum kernel.
LU et al.: CONTEXTUAL KERNEL AND SPECTRAL METHODS FOR LEARNING THE SEMANTICS OF IMAGES
IV. LEARNING WITH VISUAL AND SEMANTIC CONTEXT
1743
equality constraint for keyword propagation in (8) with the following inequality constraint:
Here, we propose a contextual kernel method for keyword propagation based on our spatial spectrum kernel. Since the semantic context of keywords can also be exploited for keyword propagation, we succeed in learning the semantics of images using both visual and semantic context.
A. Notations and Problems We first present the basic notations for automatic image andenote the set of training imnotation. Let keywords, where is the number of ages annotated with training images. Here, is the th training image represented contains the anas a 2-D sequence of visual words, while notation keywords that are assigned to the image . We further employ a binary vector to represent a set of annotation keywords. In particular, for a keyword set , its vector representation has its th element set to 1 only when the th keyword and zero otherwise. Given a query image from the test set, our goal is to determine a confidence vector such that each element indicates the confidence score of assigning the th keyword to the query image . Our contextual kernel method for keyword propagation derives from a class of single-step keyword propagation. Suppose the similarity between two images is measured by a kernel . The confidence score of assigning the th keyword to the test image could be estimated by (8) is set to 1 when the th keyword and zero otherwhere wise. It should be noted that both graph-based semi-supervised learning [11] and probabilistic relevance model [6]–[8] can be regarded as variants of the above kernel method for keyword propagation. However, there are two problems with the above kernel method for keyword propagation. The first problem is that the confidence scores assigned to the test image are overestimated, that is, all of the training images are assumed to propagate their annotation keywords to , and in the mean time each training image is assumed to propagate all of its keywords to . These two assumptions are not necessarily true in many complex real-world applications. The second problem is that each keyword is propagated from the training images to the test image independently of the other keywords, that is, the keyword correlation information is not used for keyword propagation.
B. Contextual Keyword Propagation To solve the above problems associated with automatic image annotation, we propose a contextual kernel method for keyword propagation in the following. First, not to overestimate the confidence scores assigned to the test image , we replace the
(9)
is the set of -nearest neighbors of the test image where . The above inequality indicates that the confidence score propagated from the training images to the test image is upper bounded by the weighted sum of the pairwise similarity and can not be obtained explicitly. Meanwhile, not to be confused by the training images that are far away (i.e., not in the same manifold structure), the test image is limited to absorbing the confidence scores only from its -nearest neighbors. Moreover, we exploit the keyword correlation information for keyword propagation so that the annotation keywords are not assigned to the test image independently. Given any set of annorepresented as a binary vector tation keywords , it follows from (9) that (10) When the inequality is presented in the vector form of the annotation keywords, it can be simplified as (11) Hence, given different annotation keywords and the training examples , the confidence vector of assigning individual annotation keywords to the test image is subject to the following constraints:
(12) Actually, we can generalize the inner-product of binary vectors of annotation keywords (i.e., ) to a concave function (see examples in Fig. 3), which means that the above inequality constraints are forced to be tighter. Thus, the constraints in (12) are generalized in the following form:
(13) In this paper, we only consider the exponential function , although there are other types of concave functions. As shown in Fig. 3, this function ensures that we can obtain tighter constraints in (13). Here, it should be noted that . Since it is insufficient to identify the appropriate only with the constraints, we assume that, among all of the confidence scores that satisfy the constraints in (13), the optimal solution
1744
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
V. ANNOTATION REFINEMENT BY CONTEXTUAL SPECTRAL EMBEDDING Here, the semantic context of annotation keywords is further exploited for annotation refinement based on manifold learning techniques, that is, we first present our contextual spectral embedding using the semantic context of annotation keywords, and then perform annotation refinement in the more descriptive embedding space. A. Contextual Spectral Embedding
0
Fig. 3. Exponential functions (x) = 1 2 used by our method. Here, we show two examples with = 0:2 or 0:8. It can be observed that (x) x (x [0; m]), where m = 5.
2
is the one that “maximally” satisfies the constraints. This assumption leads to the following optimization problem:
(14) are the weights of annotation keywhere words. This is actually a linear programming problem, and we can solve it efficiently by the following discrete optimization algorithm [16]. Step 1) Sort annotation keywords as . Step 2) Compute for Step 3) Set
, where . and output the confidence scores for . According to [15], the concavity of ensures that the above algorithm can find the optimal solution of the linear programming problem defined in (14). Here, it should be noted that our algorithm differs from [15] in three ways: 1) the motivation of keyword propagation is explained in more detail and the constraints for linear programming are derived with fewer assumptions; 2) each test image is limited to absorbing the confidence scores only from its -nearest neighbors in order to speed up the process of keyword propagation and avoid overestimating the confidence scores; and 3) the visual context is incorporated into keyword propagation by defining the similarity between images with our spatial spectrum kernel so that both visual and semantic context can be exploited for learning the semantics of images. The above algorithm for contextual keyword propagation is denoted as CKP in the following. The time complexity of CKP is for annotating a single query image. In this paper, we set (e.g., ) to ensure that the annotation process is very efficient for a large image dataset.
To exploit the semantic context for spectral embedding, we first represent the correlation information of annotation keywords by the Pearson product moment (PPM) correlation measure [31] as follows. Given a set of training images annotated with keywords, we collect the histogram of keywords , where is as the count of times that keyword occurs in image . The PPM correlation between two keywords and can be defined by (15) where and are the mean and standard deviation , respectively. It is worth noting of that the semantic context of annotation keywords has actually been captured from the set of training images using the above correlation measure. We now construct an undirected weighted graph for spectral embedding with the set of annotation keywords as the vertex set. We set the affinity matrix to measure the similarity between annotation keywords. The distinct advantage of using this similarity measure is that we have eliminated the need to tune any parameter for graph construction which can significantly affect the performance and has been noted as an inherent weakness of graph-based methods. Here, it should be noted that the PPM correlation will be negative if and are not positively correlated. In this case, we set to ensure that the affinity matrix is nonnegative. While the negative correlation does reveal useful information among the keywords and serves to measure the dissimilarity between the keywords, our goal here, however, is to compute the affinity (or similarity) between the keywords and to construct the affinity matrix of the graph used for spectral embedding. Although the dissimilarity information is not exploited directly, by setting the entries between the negatively correlated keywords to be zeros, we have effectively unlinked the negatively correlated keywords in the graph (e.g., given two keywords “sun” and “moon” that are unlikely to appear in the same image, we set their similarity to zero to ensure that they are not linked in the graph). In this way, we have made use of the negative correlation information for annotation refinement based on spectral embedding. In future work, we will look into other possible ways to make use of the negative correlation information for image annotation. The goal of spectral embedding is to represent each vertex in the graph as a lower dimensional vector that preserves the similarities between the vertex pairs. Actually, this is equivalent to finding the leading eigenvectors of the normalized graph Laplacian , where is a diagonal matrix with its
LU et al.: CONTEXTUAL KERNEL AND SPECTRAL METHODS FOR LEARNING THE SEMANTICS OF IMAGES
1745
Fig. 4. Illustration of annotation refinement in the spectral embedding space: (a) an example image associated with the ground truth, refined, and unrefined annotations (the incorrect keywords are red-highlighted); (b) annotation refinement based on linear neighborhoods. Here, ( ) denotes the set of top 7 keywords that are most highly correlated with a keyword, and in this neighborhood the keywords that also belong to the ground truth annotations of the image are blue-highlighted.
N 1
-element equal to the sum of the th row of the affinity matrix . In this paper, we only consider this type of normalized Laplacian [19], regardless of other normalized versions [18]. Let be the set of eigenvalues and the associated eigenvectors of , where and . The spectral embedding of the graph can be represented by (16) where the th row is the new representation for vertex . Since we usually set , the annotation keywords have actually been represented as lower dimensional vectors. In the following, we will present our approach to annotation refinement using this more descriptive representation. B. Annotation Refinement To exploit the semantic context of annotation keywords for annotation refinement, the confidence scores of a query image estimated by our contextual keyword propagation can be adjusted based on linear neighborhoods in the new embedding space. The corresponding algorithm is summarized as follows. Step 1) Find smallest nontrivial eigenvectors and associated eigenvalues of . Here, is the PPM correlation matrix. Step 2) Form , and normalize each row of to have unit length. Here, the th row is a new feature vector for keyword .
Step 3) Compute the new affinity matrix between keywords . Here, if , we set as to ensure that is nonnegative. Step 4) Adjust the confidence scores of each query image by , where is a is the set of top keyweight parameter, words that are most highly correlated with keyword , and is the confidence score of assigning keyword to this image. It is worth noting that Step 2) slightly differs from (16). Here, we aim to achieve better refinement results through preprocessing (i.e., weighting and normalizing) the new feature vectors. Moreover, in Step 4), we perform annotation refinement based on linear neighborhoods in the new embedding space, as illustrated in Fig. 4. More importantly, the example given by this figure presents a detailed explanation of how the semantic context of keywords encoded in new embedding space is used to refine the annotations of the image. Before refinement, the three keywords “yellow_lines”, “people”, and “pole” are ranked according to their predicted confidence scores as follow: “pole” “people” “yellow_lines”. Hence, the two keywords “people” and “pole” are incorrectly attached to the image, while the ground truth annotation “yellow_lines” is wrongly discarded. However, we can find that the keyword “yellow_lines” is highly semantically correlated with the ground truth annotations of the image (see the five blue-highlighted keywords in (yellow_lines) shown in Fig. 4). This semantic context is further exploited here for annotation refinement and the confidence score of “yellow_lines” is accordingly increased to the largest among the three keywords, i.e., this
1746
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Fig. 5. Some annotated examples selected from UW (first row), Corel (second row), and IAPR (third row) image datasets.
keyword can now be annotated correctly. On the contrary, since the keyword “people” is not at all semantically correlated with the ground truth annotations of the image, its confidence score is decreased to the smallest value and it is discarded successfully by our annotation refinement. Additionally, as for the keyword “pole”, although not included in the ground truth annotations of the image, we can still consider that this keyword is semantically correlated with the image (see the three blue-highlighted keywords in (pole) shown in Fig. 4). The above algorithm for annotation refinement by contextual spectral embedding is denoted as CSE in the following. The time complexity of CSE is for refining the annotations of a single query image [mainly for spectral embedding in Step 1)]. Since we have , our algorithm is very efficient even for a large image dataset (see the later experiments on the IAPR dataset). Moreover, our algorithm for annotation refinement has another distinct advantage. That is, besides the semantic context of annotation keywords captured from the training images using the PPM correlation measure, other types of semantic context derived from prior knowledge (e.g., ontology) can also be readily exploited for annotation refinement by incorporating them into graph construction.
VI. EXPERIMENTAL RESULTS Here, our SSK combined with CKP and CSE (i.e., ) is compared to three other representative methods for image annotation: 1) spatial pyramid matching (SPM) [28] combined with CKP and CSE (i.e., ); 2) PLSA [29] combined with CKP and CSE (i.e., ); and 3) multiple Bernoulli relevance models (MBRM) [7] combined with CSE (i.e., ). Moreover, we also make comparison between annotation using the semantic context and that without using the semantic context. These two groups of comparison are carried out over three image datasets: University of Washington
(UW),1 Corel [20], and IAPR TC-12 [14]. Some annotated examples selected from these image datasets are shown in Fig. 5 A. Experimental Setup Our annotation method is tested on three standard image datasets. The first image dataset comes from University of Washington (UW) and contains 1109 images annotated with 338 keywords. Each image is annotated with 1–13 keywords. The images are of the size 378 252 pixels. The second image dataset is Corel [20] that consists of 5000 images annotated with 371 keywords. Each image is annotated with 1–5 keywords. The images are of the size 384 256 pixels. This image dataset has been widely used for the evaluation of image annotation in previous work, e.g., [7], [21]. The third image dataset is IAPR TC-12 [14] that contains 20 000 images annotated with 275 keywords. Each image is annotated with 1–18 keywords. The images are of the size 480 360 pixels. It is worth noting that the task of image annotation is very challenging on such a large image dataset. For the three image datasets, we first divide images into blocks on a regular grid, and the size of blocks is empirically selected: 64 64 pixels for MBRM just as [7], but 8 8 pixels for the three annotation methods that adopt our CKP. Furthermore, we extract a 30-D feature vector from each block: six color features (block color average and standard deviation) and 24 texture features (average and standard deviation of Gabor outputs over three scales and four orientations). Here, it should be noted that these feature vectors are directly used by MBRM and the computational cost in the annotation process thus becomes extremely large, while this problem can be solved by the other three methods that adopt our CKP through first quantizing these feature vectors into visual words. In this paper, we consider a moderate vocabulary size for the three image datasets. 1http://www.cs.washington.edu/research/imagedatabase/groundtruth/
LU et al.: CONTEXTUAL KERNEL AND SPECTRAL METHODS FOR LEARNING THE SEMANTICS OF IMAGES
1747
Fig. 6. Effect of different parameters on the annotation performance measured by F for the UW image dataset. (a) Varying the neighborhood size k . (b) Varying the length of subsequences s. (c) Varying the scale L.
TABLE I RESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE UW IMAGE DATASET
In the experiments, we divide the UW image dataset randomly into 909 training images and 200 test images, and annotate each test image with the top seven keywords. For the Corel image dataset, we split it into 4500 training images and 500 test images just as [20], and annotate each test image with the top five keywords. The IAPR image dataset is partitioned into 16 000 training images and 4000 test images, and each test image is annotated with the top nine keywords. After splitting the datasets, as with previous work, we evaluate the obtained annotations of the test images through the process of retrieving these test images with single keyword. For each keyword, the number of correctly annotated images is denoted as , the number of retrieved images is denoted as , and the number of truly related images in test set is denoted as . Then, the recall, precision, and measures are computed as follows: (17) (18) which are further averaged over all of the keywords in the test set. Besides, we give a measure to evaluate the coverage of correctly annotated keywords, i.e., the number of keywords with recall which is denoted by # keywords . The measure is important because a biased model can achieve high precision and recall values by only performing quite well on a small number of common keywords. Since the solution returned by our CKP algorithm is dependent only on the relative order of the weights of the annotation keywords, we only need to sort the weights without providing their exact values. One straightforward way is to order the weights to be in the reverse order of keyword frequency, namely , where is the frequency of the th keyword in the training set. Moreover, according to Fig. 6(a), for our CKP algorithm on the UW dataset. Here, we set
we can find that our CKP algorithm is not sensitive to this pafor rameter. Finally, according to Fig. 6(b) and (c), we set for both SSK and SPM on the UW dataset. our SSK and The other parameters are also set the respective optimal values similarly. B. Results on UW Image Dataset The results of annotation using visual and semantic context are averaged over ten random partitions of the UW image dataset and then listed in Table I. We can observe that our annotation method (i.e., ) performs much better than all of the other three methods. This observation may be due to the fact that our method not only exploits the context of annotation keywords for keyword propagation and annotation refinement but also captures the context of visual words within images to define the similarity between images for keyword propagation. That is, we have successfully exploited both visual and semantic context for image annotation. Particularly, as compared with MBRM that propagates a single keyword independently of the other keywords, our annotation method leads to 23% gain on the measure through contextual keyword propagation using our spatial spectrum kernel. We make further observations on the three methods that adopt CKP for image annotation. It is shown in Table I that both SSK and SPM achieve better results than PLSA which does not consider the context of visual words within images. Moreover, since SPM can only capture the global context of visual words, our SSK performs better than SPM due to the fact that both local and global context are used to define the similarity between images. These observations show that the context of visual words indeed helps to improve the annotation performance of keyword propagation. More importantly, to demonstrate the gain of exploiting the semantic context of annotation keywords for image annotation,
1748
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Fig. 7. Comparison between annotation using the semantic context (CKP and Refined) and that without using the semantic context (KP and Unrefined) on the UW image dataset. (a) Keyword propagation versus contextual keyword propagation. (b) Unrefined annotation versus refined annotation by contextual spectral embedding.
TABLE II RESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE COREL IMAGE DATASET
Fig. 8. Comparison between annotation using the semantic context (CKP and Refined) and that without using the semantic context (KP and Unrefined) on the Corel image dataset. (a) Keyword propagation versus contextual keyword propagation. (b) Unrefined annotation versus refined annotation by contextual spectral embedding.
we also compare annotation using this semantic context to annotation without using this semantic context. The comparison is shown in Fig. 7. Here, keyword propagation given by (8) is denoted as KP (without using the semantic context), while our proposed contextual keyword propagation is denotes as CKP (using the semantic context). Meanwhile, the refined annotation results by contextual spectral embedding are denoted as Refined (using the semantic context), while the annotation results before refinement are denoted as Unrefined (without using the semantic context). We can observe from Fig. 7 that the semantic context of annotation keywords plays an important role in both keyword propagation and annotation refinement.
C. Results on Corel Image Dataset The results of annotation using visual and semantic context on the Corel image dataset are listed in Table II. From this table, we can draw similar conclusions (compared to Table I). Through exploiting both visual and semantic context for keyword propagation and annotation refinement, our method still performs the best on this image dataset. Moreover, our method is also compared with more recent state-of-the-art methods [21], [32] using their own results. As shown in Table II, our method outperforms [21], [32] because both visual and semantic context are used for learning the semantics of images. To the best of our knowledge,
LU et al.: CONTEXTUAL KERNEL AND SPECTRAL METHODS FOR LEARNING THE SEMANTICS OF IMAGES
1749
TABLE III RESULTS OF ANNOTATION USING VISUAL AND SEMANTIC CONTEXT ON THE IAPR IMAGE DATASET
the results reported in [32] are the best in the literature. However, our method can still achieve 8% gain on the measure over this method. More importantly, we show in Fig. 8 the comparison between annotation using the semantic context of annotation keywords and annotation without using this semantic context. We can similarly find that this semantic context plays an important role in both keyword propagation and annotation refinement on this image dataset. D. Results on IAPR Image Dataset To verify that our annotation method is scalable to large image datasets, we present the annotation results on IAPR dataset in Table III. In the experiments, we do not compare our annotation method with PLSA and MBRM, since PLSA needs huge memory and MBRM incurs a large time cost when the data size is 20 000. From Table III, we find that our SSK can achieve 17% gain over SPM (see versus ). That is, the visual context captured by our method indeed helps to improve the annotation performance. Moreover, we also find that both of our CKP and CSE can achieve improved results by exploiting the semantic context of annotation keywords. Another distinct advantage of these kernel and spectral methods is that they are very scalable with respect to the data size. The time taken by our CKP and CSE on the large IAPR dataset is 21 and 1 min, respectively. We run these two algorithms (Matlab code) on a PC with 2.33 GHz CPU and 2 GB RAM. VII. CONCLUSION We have proposed contextual kernel and spectral methods for learning the semantics of images in this paper. To capture the context of visual words within images, we first define a spatial string kernel to measure the similarity between images. Based on this spatial string kernel, we further formulate automatic image annotation as a contextual keyword propagation problem, which can be solved very efficiently by linear programming. Different from the traditional relevance models that treat each keyword independently, our contextual kernel method considers the semantic context of annotation keywords and propagates multiple keywords simultaneously from the training images to the test images. More importantly, such semantic context can also be incorporated into spectral embedding for refining the annotations of images predicted by keyword propagation. Experiments on three standard image datasets demonstrate that our contextual kernel and spectral methods can achieve superior results. In future work, these kernel and spectral methods will be extended to the temporal domain for problems such as video semantic learning and retrieval. Moreover, since our contextual kernel and spectral methods are very general techniques, they
will be adopted to improve the performance of other machine learning methods that are widely used in computer vision and image processing. REFERENCES [1] R. Zhang and Z. Zhang, “Effective image retrieval based on hidden concept discovery in image database,” IEEE Trans. Image Process., vol. 16, no. 2, pp. 562–572, Feb. 2007. [2] J. Li and J. Wang, “Automatic linguistic indexing of pictures by a statistical modeling approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1075–1088, Sep. 2003. [3] Y. Gao, J. Fan, X. Xue, and R. Jain, “Automatic image annotation by incorporating feature hierarchy and boosting to scale up SVM classifiers,” in Proc. ACM Multimedia, 2006, pp. 901–910. [4] E. Chang, G. Kingshy, G. Sychay, and G. Wu, “CBSA: Content-Based soft annotation for multimodal image retrieval using Bayes point machines,” IEEE Trans.Circuits Syst. Video Technol., vol. 13, no. 1, pp. 26–38, Jan. 2003. [5] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” in Proc. SIGIR, 2003, pp. 119–126. [6] V. Lavrenko, R. Manmatha, and J. Jeon, “A model for learning the semantics of pictures,” Adv. Neural Inf. Process. Syst., vol. 16, pp. 553–560, 2004. [7] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple bernoulli relevance models for image and video annotation,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., 2004, vol. 2, pp. 1002–1009. [8] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, and S. Ma, “Dual crossmedia relevance model for image annotation,” in Proc. ACM Multimedia, 2007, pp. 605–614. [9] Z. Lu and H. Ip, “Generalized relevance models for automatic image annotation,” in Proc. Pacific Rim Conf. Multimedia, 2009, pp. 245–255. [10] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf, “Ranking on data manifolds,” Adv. Neural Inf. Process. Syst., vol. 16, pp. 169–176, 2004. [11] J. Liu, M. Li, W. Ma, Q. Liu, and H. Lu, “An adaptive graph model for automatic image annotation,” in Proc. ACM Int. Workshop Multimedia Inf. Retrieval, 2006, pp. 61–70. [12] Z. Lu, H. Ip, and Q. He, “Context-based multi-label image annotation,” in Proc. ACM Int. Conf. Image Video Retrieval, 2009, pp. 1–7. [13] C. Leslie, E. Eskin, and W. Noble, “The spectrum kernel: A string kernel for SVM protein classification,” in Proc. Pacific Symp. Biocomputing, 2002, pp. 566–575. [14] H. Escalante, C. Hernández, J. Gonzalez, A. López-López, M. Montes, E. Morales, L. Sucar, L. Villasenor, and M. Grubinger, “The segmented and annotated IAPR TC-12 benchmark,” Comput. Vis. Image Underst., vol. 114, no. 4, pp. 419–428, 2010. [15] F. Kang, R. Jin, and R. Sukthankar, “Correlated label propagation with application to multi-label learning,” in Proc. CVPR, 2006, pp. 1719–1726. [16] Discrete Optim, R. Parker and R. Rardin, Eds. New York: Academic, 1988. [17] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 40–51, Jan. 2007. [18] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” Adv. Neural Inf. Process. Syst., vol. 14, pp. 849–856, 2002. [19] S. Lafon and A. Lee, “Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 9, pp. 1393–1403, Sep. 2006.
1750
[20] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth, “Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary,” in Proc. ECCV, 2002, pp. 97–112. [21] J. Liu, M. Li, Q. Liu, H. Lu, and S. Ma, “Image annotation via graph learning,” Patt Recog, vol. 42, no. 2, pp. 218–228, 2009. [22] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification using maximum entropy method,” in Proc. SIGIR, 2005, pp. 274–281. [23] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, “On maximum margin hierarchical multi-label classification,” in Proc. NIPS Workshop Learning Structured Outputs, 2004, pp. 1–4. [24] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang, “Image annotation refinement using random walk with restarts,” in Proc. ACM Multimedia, 2006, pp. 647–650. [25] R. Behmo, N. Paragios, and V. Prinet, “Graph commute times for image representation,” in Proc. CVPR, 2008, pp. 1–8. [26] J. Li, W. Wu, T. Wang, and Y. Zhang, “One step beyond histograms: Image representation using Markov stationary features,” in Proc. CVPR, 2008, pp. 1–8. [27] A. Holub, M. Welling, and P. Perona, “Hybrid generative-discriminative object recognition,” Int. J. Comput. Vis., vol. 77, no. 1–3, pp. 239–258, 2008. [28] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. CVPR, 2006, pp. 2169–2178. [29] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Mach. Learn., vol. 41, no. 1–2, pp. 177–196, 2001. [30] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, no. 4–5, pp. 993–1022, 2003. [31] J. Rodgers and W. Nicewander, “Thirteen ways to look at the correlation coefficient,” Amer. Stat., vol. 42, no. 1, pp. 59–66, Feb. 1988. [32] A. Makadia, V. Pavlovic, and S. Kumar, “A new baseline for image annotation,” in Proc. ECCV, 2008, pp. 316–329. Zhiwu Lu received the M.Sc. degree in applied mathematics from Peking University, Beijing, China, in 2005. He is currently working toward the Ph.D. degree in the Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong. From July 2005 to August 2007, he was a Software Engineer with Founder Corporation, Beijing, China. From September 2007 to June 2008, he was a Research Assistant with the Institute of Computer Science and Technology, Peking University, Beijing, where he is currently an Assistant Professor. He has authored or coauthored over 30 papers in international journals and conference proceedings. His research interests lie in pattern recognition, machine learning, multimedia information retrieval, and computer vision.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011
Horace H.S. Ip received the B.Sc. (first-class honors) degree in applied physics and Ph.D. degree in image processing from the University College London, London, U.K., in 1980 and 1983, respectively. Currently, he is the Chair Professor of computer science, the Founding Director of the Centre for Innovative Applications of Internet and Multimedia Technologies (AIMtech Centre), and the Acting Vice-President of City University of Hong Kong, Kowloon, Hong Kong. He has authored or coauthored over 200 papers in international journals and conference proceedings. His research interests include pattern recognition, multimedia content analysis and retrieval, virtual reality, and technologies for education. Prof. Ip is a Fellow of the Hong Kong Institution of Engineers, the U.K. Institution of Electrical Engineers, and the IAPR.
Yuxin Peng received the Ph.D. degree in computer science and technology from Peking University, Beijing, China, in 2003. He joined the Institute of Computer Science and Technology, Peking University, as an Assistant Professor in 2003 and was promoted to a Professor in 2010. From 2003 to 2004, he was a Visiting Scholar with the Department of Computer Science, City University of Hong Kong. His current research interests include content-based video retrieval, image processing, and pattern recognition.