locomotive ocean coral pool fish reefs plane sky jet runway art pillar shadows road stone temple. Fig. 3. 90 non-zero recall words in CMRM+SSLM annotation ...
A Semantic Similarity Language Model to Improve Automatic Image Annotation Tianxia Gong, Shimiao Li, Chew Lim Tan School of Computing National University of Singapore 13 Computing Drive, Singapore 117417 Email: {gong tianxia, lism, tancl}@comp.nus.edu.sg
Abstract— In recent years, with the rapid proliferation of digital images, the need to search and retrieve the images accurately, efficiently, and conveniently is becoming more acute. Automatic image annotation with image semantic content has attracted increasing attention, as it is the preprocess of annotation based image retrieval which provides users accurate, efficient, and convenient image retrieval with image understanding. Different machine learning approaches have been used to tackle the problem of automatic image annotation; however, most of them focused on exploring the relationship between images and annotation words and neglected the relationship among the annotation words. In this paper, we propose a framework of using language models to represent the word-to-word relation and thus to improve the performance of existing image annotation approaches utilizing probabilistic models. We also propose a specific language model - the semantic similarity language model to estimate the semantic similarity among the annotation words so that annotations that are more semantically coherent will have higher probability to be chosen to annotate the image. To illustrate the general idea of using language model to improve current image annotation systems, we added the language model on top of the two specific image annotation models - the translation model (TM) and the cross media relevance model (CMRM). We tested the improved models on a widely used image annotation corpus - the Corel 5K dataset. Our results show that by adding the semantic similarity language model, the performance of image annotation improves significantly in comparison with the original models. Our proposed language model can also be applied to other image annotation approaches using word probability conditioned on image or word-image joint probability as well.
I. I NTRODUCTION With the rapid proliferation of digital images, the need to search and retrieve the images accurately, efficiently, and conveniently is becoming more acute in recent years. Textbased image retrieval, a traditional approach that indexes the images with associated keywords has the advantage of easy implementation and fast retrieval, but it requires a large amount of manual work to label the images and suffers from human subjectivity. Content-based image retrieval approach [1] provides a solution to overcome the above disadvantage by retrieving images that have similar low level visual features to the query image’s; however, it has strict requirement on query format–the query must be an image example. Content-based image retrieval systems also suffer from the semantic gap problem. Auto-annotation based image
retrieval has gained increasing popularity as it seems to have the advantages of both text-based and content-based image retrieval by having a preprocess of automatically annotating images with their semantic content and offering users the ease of using text to search images. To tackle the problem of automatic image annotation, researchers have used various machine learning methods. Some approaches use supervised learning methods and view automatic image annotation as a classification problem–the objective is to classify the images to different annotation classes using visual features. Other approaches use unsupervised learning methods including both parametric models and non-parametric models with the aim to learn a probabilistic model to represent the correlation between images and annotation words. For these approaches, the annotations for an unknown image are usually selected to maximize the conditional probability–the word probability conditioned on the image regions or image features, or the joint probability of the word and the image. Much attention of these approaches is attracted to model the image-word correlation; therefore, fewer research is done to utilize the word-word correlation for image annotation. To make use of the context information in annotation words to improve automatic image annotation, we propose a semantic similarity language model in this paper. In this language model, we represent each annotation word using a semantic vector, which displays the distributional properties of a word in terms of the strength of its co-occurrence with a set of context words in the same annotation. By the assumption that words should be semantically similar with the set of context words in the same annotation, we approximate the probability of a set of annotation words by measuring the semantic similarities of each annotation word to all other words. By converting the word probability conditioned on image or word-image joint probability to image probability conditioned on annotation words prior to apply the language model, the word probability bias often observed in probabilistic image annotation models is reduced. The overall annotation performance also improves as the system is more likely to generate semantically coherent annotations by using the proposed semantic similarity language model on the image-word correlation probabilistic model to estimate the prior probability of a set of words instead of
individual words. II. R ELATED W ORK Different machine learning methods have been used to tackle the problem of automatic image annotation. Supervised learning approaches view image annotation as a classification problem–each textual word in the annotation is considered as an independent class label; the images or image regions are classified according extracted visual features into these classes; the classification result is the annotation result. Existing image classification approaches to image annotation are mainly based on global features, local features, or multi-level classifications. The classification methods treat each semantic keyword or concept as an independent class, and assign each keyword or concept one classifier. Supervised learning approaches for image annotation include the research work using SVM [2], HMM [3] [4]. Under the category of unsupervised learning methods, both parametric models and non-parametric models are used for automatic image annotation. They aim to learn a probabilistic model to represent the correlation between images and keywords. Mori et al. [5] proposed a co-occurrence model to represent the relationship between keywords and visual features. Duygulu et al. [6] proposed a machine translation model for image annotation. They used EM to learn a “image word lexicon” and then translate the image segments into annotation words. Blei and Jordan [7] proposed the correspondence latent Dirichlet allocation (Corr-LDA) model to find a conditional relationship between image features and textual features. Monay and Gatica-Perez [8] used latent semantic analysis (LSA) and probabilistic latent semantic analysis (PLSA) for image annotation. Jeon et al. [9] assumed that image annotation could be viewed as analogous to the cross-lingual retrieval problem and proposed a cross-media relevance model (CMRM). Lavrenko et al. [10] proposed a continuous-space relevance model (CRM), which is similar to CMRM but they used continuous feature vector of each image region instead. Besides supervised and unsupervised approaches, semisupervised methods are also used in image annotation [11] [12]. They combine a generative model of the input data with a discriminative model for image labeling to form a hybrid model to automatically annotate images. A few approaches also integrate the word correlation in the annotation process. Jin et al. [13] proposed a coherent language model which is extended from CMRM to model the correlation between two textual words. The model defines a language model as a multinomial distribution of words. Liu et al. [14] modeled the relationship among the annotation words using word-based graph learning. III. I MAGE A NNOTATION WITH A S EMANTIC S IMILARITY L ANGUAGE M ODEL The problem of automatic image annotation can be defined as: given a training corpus T consisting of some already annotated images, which annotations A will be used to label
a new image I? For most image annotation approaches using probabilistic models, the goal is to find A that maximizes the conditional probability p(A|I): A = argmaxA p(A|I)
(1)
where A is a set of words {w1 , . . . , wn } used for annotation and image I is usually represented by a set of features {f1 , . . . , fk } or a set of blobs {b1 , . . . , bm }. p(A|I) can also be rewritten as: p(A|I) =
p(A, I) p(I)
(2)
Since the prior probability of a given image I is usually considered as of uniform distribution, instead of estimating the conditional probability p(A|I) directly, some image annotation approaches use the joint probability p(A, I) to find the best annotation set A: A = argmaxA p(A, I)
(3)
The two ways to find best A from Equation 1 and Equation 3 used by many image annotation approaches emphasize on the image-word correlation, but ignore the contextual information among the annotation words themselves. In this paper, we make use of the word contextual information by incorporating a language model. We view the image annotation problem using the noisy channel model as in Figure 1.
Fig. 1.
Noisy channel model
The original signal A generated by the transmitter passes through a noisy channel and changes to a noisy signal I which will be received by the receiver. Using noisy channel model, we can interpret the image annotation problem as: a person wants to express a few objects in words, but the output of the expression is a picture because the tool he uses (maybe a camera); we have to use the image to predict what were the original words the person was trying the say in his picture using: p(A|I) =
p(I|A)P (A) p(I)
A = argmaxA p(I|A)p(A)
(4) (5)
We then find the best set of annotation words by maximizing the combination of two probabilities, p(I|A) and p(A). p(I|A) can be derived from the original image annotation methods; and p(A) is the annotation set probability derived from the language model. We add different weights to the two probabilities to give different emphasis on the original annotation model and the language model to achieve best overall annotation result:
TABLE I
A = argmaxA p(I|A)λ1 p(A)λ2 ,
(6)
E XAMPLE OF ANNOTATION WORDS REPRESENTED BY SEMANTIC VECTORS OF CONTEXT WORDS
which can also be expressed in log linear form: A = argmaxA (λ1 log p(I|A) + λ2 log p(A))
(7)
Language modeling is widely used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval. A statistical language model assigns a probability to a sequence of words. Using a language model in statistical machine translation boosts the probability of translating sentence in source language into well-formed sentence in target language. Similarly, we use a language model in image annotation to boosts the probability of annotating semantically coherent words. As the set of annotation words in automatic image annotation task is not a sequence as the annotation is not a sentence; therefore, the commonly used bi-gram or tri-gram model is not suitable to model the the probability of a set of words p(A) in image annotation task. Alternatively, we make use of the word co-occurrence information in the annotations in the training corpus to model the probability of the keyword given other keywords in the annotations and the probability of the annotation. In this paper, we choose semantic vector model to represent each word and define the language model to be the average pairwise similarity of the semantic vector of each word in the annotation set. In the semantic vector model, the meaning of each word is represented in terms of vectors of other context words. We first choose a set of words as context words to be included in the semantic vector to represent the meaning of any word. For a small image annotation corpus such as COREL 5K dataset [6], the vocabulary size (the total number of distinct words used in the annotations) is usually small as well (in a few hundred); we could use all the words in the vocabulary as context words. A word w is represented by v, a semantic vector < v1 , v2 , . . . , vi , . . . , vm >, where there are m context words in the vocabulary. In [15], the semantic vectors are based on components defined as the ratio of the conditional probability of a context word given the target word to the overall probability of the context word. We follow the definition in [15] to calculate each component vi in the semantic vector that represents w: vi =
p(contexti |w) p(contexti )
count(contexti , w) count(w)
city
TABLE II E XAMPLES OF PAIRWISE SEMANTIC SIMILARITY
people sun street sky forest tree ...
people 1 0.0219 0.4131 0.1513 0.0174 0.2118 ...
sun 0.0219 1 0.0120 0.2121 0.0001 0.0827 ...
street 0.4131 0.0120 1 0.1833 0.0012 0.1624 ...
sky 0.1513 0.2121 0.1833 1 0.0283 0.4211 ...
forest 0.0174 0.0001 0.0012 0.0283 1 0.2140 ...
tree 0.2118 0.0827 0.1624 0.4211 0.2140 1 ...
... ... ... ... ... ... ... ...
with a set of context words. Dividing by the overall probability of each context word prevents the vectors being dominated by the most frequent context words, which will often also have the highest conditional probabilities. Table I shows some examples of annotation words represented by semantic vectors of context words. Assume that words should be semantically similar with the set of context words in the same annotation, the probability of a set of annotation words A = {w1 , . . . , wn } can be measured by the similarities of each annotation word to all other words: p(A) ∝
X 1 n(n − 1)
X
sim(wi , wj )
(10)
wi ∈A wj ∈A,j6=i
where each word wi from the annotation set A is represented by its corresponding semantic vector. Similarity can be measured using cosine: sim(w1 , w2 ) =
w1 · w2 kw1 kkw2 k
(11)
Using the semantic vector representation stated in Equation 8, the dot product in similarity measure is calculated as:
(8)
And the the conditional probability is just the relative frequency of the count of the co-occurrences of context word contexti and word w in all annotations over the total number of occurrences of w in the annotations: p(contexti |w) =
grass buildings bridge mountain
w1 · w2 =
m X i=1
vw1 ,i vw2 ,i =
m X p(ci |w1 ) p(ci |w2 ) i=1
p(ci )
p(ci )
(12)
Table II shows the pairwise semantic similarity between some example annotation words. We found that Equation 5 is very similar to the fundamental equation statistical machine translation [16] as shown below (translating a foreign sentence f to an English sentence e): e = argmaxe p(f |e)p(e)
(13)
As the search space for the optimal solution is huge, beam search is usually used for decoding in statistical machine translation. Similarly, we also use k-best beam search to find the k-best set of annotation words. IV. L ANGUAGE M ODEL TO I MPROVE AUTOMATIC I MAGE A NNOTATION M ODELS A. Improved Machine Translation Model We can apply our proposed language model to improve existing image annotation models directly using word probability conditioned on the image, such as the Machine Translation Model (TM) [6]. TM considers the images and the annotation in the training corpus as ”parallel text”. The translation model of IBM Model 2 [17] was used to build the blob-to-word translation table and also get the blob-to-word alignment for each image in the training corpus (the explicit correspondence between the image regions and words). The blob-to-word translation table in the original Translation Model can be viewed as a probability table in which each entry p(wi |bj ) indicates the translation probability from blob bj to word wi . To apply our proposed language model to the translation model, we first rebuild the translation model by reversing the translation direction - we treat words as the source language and blobs as the target language. We construct the translation table–the word-to-blob translation table instead of the original blob-to-word translation table. In the new table, each entry p(bj |wi ) indicates the translation probability from word wi to blob bj . The word-to-blob translation probability and alignment probability can be calculated in the equations below respectively. P (b|w) =
X
p(a, w|b),
(14)
After the reverse translation table is built, we find the set of annotation words A for a new image I by Equation 1, where p(A|I) can be estimated as: Y p(A|I) ∝ p(A) p(bi |wi ) (16) wi ∈A
where wi is one annotation word in A, and bi is the blob translated from the word wi . Applying the semantic similarity language model defined in Equation 7, we find the set of annotation words by: A = argmaxA (λ1 log
Y
+ λ2 log
X
X 1 n(n − 1)
X
sim(wi , wj ))
wi ∈A wj ∈A,j6=i
(17) B. Improved Cross Media Relevance Model The semantic similarity language model we propose in this paper can be applied to image annotation models using imageword joint probability. The Cross Media Relevance Model (CMRM) [9] is such an image annotation model. The original CMRM defines each training instance J in the training corpus T as J = {b1 . . . bm ; w1 . . . wn }, where b1 . . . bm represents the blobs corresponding to regions of the image and w1 . . . wm represents the words in the image annotation. As CMRM uses a relevance language model [18], the joint probability of observing the word wi and the blobs b1 . . . bm in the same image is estimated as the expectation over the images J in the training set: p(wi , b1 , . . . , bm ) =
a
P (a, b|w) =
p(bk |wk )
wk ∈A
X
p(J)p(wi , b1 , . . . , bm |J)
(18)
J∈T
t(bj |waj ),
(15)
j=1
where w is the word, b is the blob, and a is a possible alignment between w and b. Finding out the best alignment and the best translation is a chicken-and-egg problem; EM algorithm is a natural choice to solve this problem. Parameter estimation method in [17] is used to obtain the the best alignments as well as the translation table: Step 1: initialize the blob-to-word translation probability uniformly. Step 2: apply the model to the training data to calculate the alignment probability p(a|w, b) using chain rule: p(a|w, b) = P (a, b|w)/p(b|w). Step 3: use the updated values to collect counts and reestimate the model. Step 4: check if the values converge–if not, repeat Step 2 and Step 3. Fig. 2. EM algorithm to estimate word-to-blob translation and alignment probabilities
It is assumed that the events of observing wi and b1 . . . bm are mutually independent and identically distributed for an image. Equation 18 can be rewritten as follows:
p(wi , b1 , . . . , bm ) =
X J∈T
p(J)p(wi |J)
m Y
p(bj |J)
(19)
j=1
The prior probabilities p(J) is set to be uniform over all images in T . Smoothed maximum-likelihood estimates is used for the probabilities in Equation 19: p(wi |J) = (1 − αJ )
|wi in T | |wi in J| + αJ |J| |T |
(20)
p(bj |J) = (1 − βJ )
|bj in J| |bj in T | + βJ |J| |T |
(21)
where |wi in J| is the number of times the word wi occurs in the annotation of image J and |wi in T | is the total number of times wi occurs in the annotations of all images in the training set T . Similarly, |bj in J| is the number of times the blob bj occurs in image J and |bj in T | is the total
number of times bj occurs in all images in the training set T . |J| is the total count of all words and blobs occurring in image J, and |T | is the total size of the training set. The smoothing parameters αJ and βJ determine the degree of interpolation between the maximum likelihood estimates and the background probabilities for the words and the blobs. From Jeon et. al. [9]’s experiment, the model gives best annotation result when αJ = 0.1 and βJ = 0.9. After generating joint probability from the original CMRM, in order to apply the language model to improve the original model, we first find the conditional probability p(I|wi ) (where I = {b1 , . . . , bm } from the joint probability p(wi , I): p(I|wi ) =
p(wi , b1 , . . . , bm ) p(wi , I) = p(wi ) p(wi )
the testing set. In order to provide a valid comparison with related work, we conducted the experiments on the COREL 5K dataset using the same visual features and visual blob clustering. B. Evaluation Metrics As the main motivation for automatic image annotation is for annotation-based image retrieval, it is natural to use the retrieval metrics to reflect the performance of the image annotation system. In most papers on automatic image annotation, the accuracy and recall is measured through the process of retrieving testing images with single keyword as shown below:
(22)
precision(w) =
We calculate the prior probability of a word p(wi ) as the count of wi in training corpus T over the total count of all words used in annotation in T : p(wi ) = P
|wi |
wk ∈T
(23)
|wk |
Then we find the set of annotation words A for a new image I by Equation 1, where p(A|I) can be estimated as: p(A|I) ∝ p(A)
Y
p(I|wi )
(24)
wi ∈A
Applying the semantic similarity language model defined in Equation 10, we can find the set of annotation words by: A = argmaxA (λ1 log
Y
p(I|wk )
wk ∈A
+ λ2 log
X 1 n(n − 1)
X
sim(wi , wj ))
wi ∈A wj ∈A,j6=i
(25) V. E XPERIMENTS A. Dataset The Corel 5K image corpus [6] is a publicly available and widely used dataset in evaluating image annotation methods. It contains 5000 images from 50 themes with 100 images from each theme. Each image is segmented into 1 to 10 regions using Normalized Cut [19]. 36 visual features including color, texture, and shape are extracted for each image region. All image regions are grouped into 500 visual blobs using KMeans clustering on the 36 features. Each image is annotated with 1 to 5 words. A total number of 374 words are used to annotate the entire dataset. The dataset is partitioned into training set with 4500 images and testing set with 500 images. We use 4000 images in the training set to train the improved models and use 500 images in the training set as a validation set to tune the weight parameters λ1 and λ2 . After the best parameter setting is determined, we use all 4500 training images to train the improved models again and test the models on the 500 testing images. There are 263 distinct words for
recall(w) =
tp(w) tp(w) + f p(w)
tp(w) tp(w) + f n(w)
(26)
(27)
where tp(w) is the number of correctly retrieved images, f p(w) is the number of incorrectly retrieved images, and f n(w) is the number of relevant images not retrieved. The precision(w) measures the correctness in annotating images with word w and the recall(w) measures the completeness in annotating images with word w. In addition, the number of words with non-zero recall, i.e. the number of single-word queries for which at least one relevant image can be retrieved using the automatic annotation, is also an important metric, because it indicates the range of words that contribute to the average precision and recall and a biased model can achieve high precision and recall by performing well only on a small number of words commonly used in annotation. C. Results For valid comparison with related work, we fixed the number of words to annotate each image to be 5, as the experiment results for related work are available for number of words fixed at 5. From the result of parameter tuning on validation set, we found that the best annotation result for improved translation model (TM) [6] is achieved when λ1 = λ2 . For improved Cross Media Relevance Model (CMRM) [9], the system performs best when λ1 = 4λ2 . We performed singleword queries to retrieve images using auto annotations for all 263 words in the testing set. The number of words with non-zero recall (“# words” for short) for Machine Translation Model is 49. With the semantic similarity language model enhancement (TM+SSLM), the number increases to 65. For Cross Media Relevance Model, with the semantic similarity language model enhancement (CMRM+SSLM), the number increases from 66 to 90. We union the four query sets to get a new 98 query set. As the 263 testing words are unevenly distributed, we also use the 98 query set from the union to show the performance. The detailed results of the semantic similarity language model improved image annotation models are shown in Table III in comparison with the original models.
TABLE IV AUTOMATIC ANNOTATION EXAMPLES ( FIXED LENGTH
OF
5
WORDS ) OF TRANSLATION MODEL MODEL
Images TM TM+SSLM
flowers people mountain tree water flowers needles blooms cactus grass
(TM)
AND
TM WITH SEMANTIC
SIMILARITY LANGUAGE
(TM+SSLM)
water tree snow buildings rocks plain snow forest coyote wolf
people buildings street cars plants buildings shops street sign writing
forest mare flowers tree street forest horse mare foals flowers
TABLE V AUTOMATIC ANNOTATION EXAMPLES ( FIXED LENGTH
OF
5
WORDS ) OF CROSS MEDIA RELEVANCE MODEL
SIMILARITY LANGUAGE MODEL
Images CMRM CMRM+SSLM
water sky tree people snows snow fox pagoda railroad locomotive
Fig. 3.
water tree sky people ocean ocean coral pool fish reefs
TM TM+SSLM CMRM CMRM+SSLM
WITH SEMANTIC
sky water tree people plane plane sky jet runway art
tone pillar tree sculpture people pillar shadows road stone temple
90 non-zero recall words in CMRM+SSLM annotation result, ordered by F-measure
TABLE III E VALUATION
(CMRM) AND CMRM
(CMRM+SSLM)
RESULTS ON SINGLE KEYWORD RETRIEVAL
on all 263 testing words # words precision recall 49 4.0% 6.0% 65 6.5% 8.6% 66 9.0% 10.0% 90 10.5% 13.1%
on 98 testing words precision recall 9.9% 12.9% 16.1% 18.5% 22.0% 25.1% 25.7% 33.0%
D. Discussion The translation model (TM) and cross-media relevance model (CMRM) both used unsupervised machine learning methods to tackle automatic image annotation. TM is a parametric model approach and represents the group of approaches that select annotation words based on the word probability conditioned on the image (or image blobs, image features, etc.). CMRM is a non-parametric model approach and represents the group of approaches that select annotation words based on the joint probability of each word and the
image (blobs or features). As the distribution of words are usually highly unbalanced in image annotations (in the case of COREL dataset, the distribution of annotation words follow the Zipf’s Law), the probabilistic models are often biased towards the words that occur more frequently. In our approach, we improve the existing image annotation approaches that use word probability conditioned on image or word-image joint probability (such as TM and CMRM) by converting to image probability conditioned on annotation words. In this way, the word probability bias is reduced as reflected in the non-zero recall word increase compared to the original probabilistic models in the experimental results shown in Table III. Since we use a semantic similarity language model on top of the image-word correlation probabilistic model to estimate the prior probability of a set of words instead of individual words, the overall annotation performance improves as the system is more likely to generate semantically coherent annotation word set. Automatic annotation results of some sample images from testing set shown in Table IV and Table V illustrate the word bias reduction and semantic coherence enhancement by using the language model. Although our experiments were done in comparison with TM and CMRM, the language model can be extended to apply to other image annotation approaches using word probability conditioned on image or word-image joint probability as well. VI. C ONCLUSION As automatic image annotation has attracted much attention in recent years, in this paper, we propose a framework of using language models to improve existing image annotation approaches utilizing probabilistic models. We also propose a specific language model - the semantic similarity language model, to improve the performance of image annotation methods such as Machine Translation model and Cross Media Relevance Model. By adding language model to the original image annotation models, the overall annotation performance improves as the annotation word bias is reduced and the system is more likely to generate semantically coherent annotation word set. We tested the improved models on a widely used image annotation corpus - the Corel 5K dataset. Our experimental results show that in comparison with the original models, the performance of image annotation improves significantly by adding the semantic similarity language model. ACKNOWLEDGMENT We thank Kobus Barnard for making the COREL dataset available. This work is supported by MOE grant R252-000349-112.
R EFERENCES [1] H. Muller, N. Michoux, D. Bandon, and A. Geissbuhler, “A review of content-based image retrieval systems in medical applications: Clinical benefits and future directions,” International Journal of Medical Informatics, vol. 73, pp. 1–23, 2004. [2] O. Chapelle, P. Haffner, and V. Vapnik, “Support vector machines for histogram-based image classification,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1055–1064, September 1999. [3] J. Z. Wang and J. Li, “Learning-based linguistic indexing of pictures with 2-d mhmms,” in Proceedings of the tenth ACM international conference on Multimedia, 2002, pp. 436–445. [4] G. Carneiro and N. Vasconcelos, “Formulating semantic image annotation as a supervised learning problem,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 163– 168. [5] Y. Mori, H. Takahashi, and R. Oka, “Image-to-word transformation based on dividing and vector quantizing images with words,” in Proceedings of First International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999, pp. 405–409. [6] P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth, “Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary,” in Proceedings of European Conference on Computer Vision (ECCV), 2002, pp. 97–112. [7] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in Proceedings of ACM SIGIR International Conference on Research and Development in Informaion Retrieval, 2003, pp. 127–134. [8] F. Monay and D. Gatica-Perez, “Plsa-based image auto-annotation: constraining the latent space,” in Proceedings of the 12th annual ACM international conference on Multimedia, 2004, pp. 348–351. [9] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, 2003. [10] V. Lavrenko, R. Manmatha, and J. Jeon, “A model for learning the semantics of pictures,” in Proceedings of Advances in Neural Information Processing Systems, 2003. [11] X. He and R. S. Zemel, “Learning hybrid models for image annotation with partially labeled data,” in Proceedings of Advances in Neural Information Processing Systems (NIPS2008), 2008. [12] M. Kelm, C. Pal, and A. McCallum, “Combining generative and discriminative methods for pixel classification with multi-conditional learning,” in Proceedings of the eighteenth conference of the International Association for Pattern Recognition (ICPR2006), vol. 2, 2006, pp. 828–832. [13] R. Jin, J. Y. Chai, and L. Si, “Effective automatic image annotation via a coherent language model and active learning,” in Proceedings of the 12th annual ACM International Conference on Multimedia, 2004, pp. 892–899. [14] J. Liu, M. Lib, Q. Liu, H. Lu, and S. Ma, “Image annotation via graph learning,” Pattern Recognition, vol. 42, no. 2, pp. 218–228, February 2009. [15] J. Mitchell and M. Lapata, “Language models based on semantic composition,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 430–439. [16] P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin, “A statistical approach to machine translation,” Computer Linguistics, vol. 16, no. 2, pp. 79–85, 1990. [17] P. F. Brown, V. J. Pietra, S. A. D. Pietra, and R. L. Mercer, “The mathematics of statistical machine translation: Parameter estimation,” Computational Linguistics, vol. 19, pp. 263–311, 1993. [18] V. Lavrenko and W. Croft, “Relevance-based language models,” in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2001. [19] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, Auguest 2000.