Bipartite Graph Reinforcement Model for Web Image Annotation Xiaoguang Rui
Mingjing Li, Zhiwei Li, Wei-Ying Ma
Nenghai Yu
MOE-MS Key Lab of MCC University of Science and Technology of China +86-551-3600681
Microsoft Research Asia 49 Zhichun Road Beijing 100080, China +86-10-58968888
MOE-MS Key Lab of MCC University of Science and Technology of China +86-551-3600681
[email protected]
{mjli, zli, wyma}@microsoft.com
[email protected]
images with semantic concepts so that people can retrieve images using keyword queries. But this approach encountered the problem of inconsistency and subjectivity among different annotators. Furthermore, the annotation process is timeconsuming and tedious as well. Consequently, it is impractical to annotate so many images on the web. Content-based image retrieval (CBIR) was proposed to index images using visual features and to perform image retrieval based on visual similarities. However, due to the well-known semantic gap [20], the performance of CBIR systems is far from satisfaction. Thus, those two approaches are infeasible to web images.
ABSTRACT Automatic image annotation is an effective way for managing and retrieving abundant images on the internet. In this paper, a bipartite graph reinforcement model (BGRM) is proposed for web image annotation. Given a web image, a set of candidate annotations is extracted from its surrounding text and other textual information in the hosting web page. As this set is often incomplete, it is extended to include more potentially relevant annotations by searching and mining a large-scale image database. All candidates are modeled as a bipartite graph. Then a reinforcement algorithm is performed on the bipartite graph to rerank the candidates. Only those with the highest ranking scores are reserved as the final annotations. Experimental results on real web images demonstrate the effectiveness of the proposed model.
To overcome the aforementioned limitations, many researchers have devoted to realizing automatic image annotation. If it can be achieved, the problem of image retrieval will be simplified into one of text retrieval problems, and many well developed textual retrieval algorithms can be easily applied to search for images by ranking the relevance between image annotations and textual queries. However, most image annotation algorithms are not specifically designed for web images because they are mainly based on content analysis but do not utilize the rich textual information associated with web images.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models.
General Terms Algorithms, Measurement, Experimentation.
Keywords
On the other hand, current commercial image search engines index web images using the surrounding text and other textual information in the hosting web pages. The underling assumption is that web images are purposely embedded into web pages and the text in hosting pages is more or less related to the semantic content of web images. Therefore, such textual information can be used as approximate annotations of web images. Although very simple, such an approach works pretty well in some cases. For most of web images, however, such annotations have many shortcomings. Let us take the pictures shown in Fig. 1 as an example. From the surrounding text of the web images, we extract some keywords as the candidate annotations. The first image is annotated by “bird”, “ligan”, “American”, “Morris”, “coot” and the second image by “color”, “degging”, “rose”, “spacer”, “flower”, and “card”, “multiflora”. First, these candidate annotations are usually noisy with irrelevant words. The first image might be taken in “American” by “Morris”, which is not explicitly expressed in the image. Words “spacer” and “card” for the second image are the advertisement words. Second, they do not fully describe the semantic content of images such as “blue river” in the first image and “green leaf” in the second image. Obviously, the annotations extracted from surrounding text are inaccurate and incomplete.
Automatic image annotation, bipartite graph model
1. INTRODUCTION The content on the web is shifting from text to multimedia as the amount of multimedia documents grows at a phenomenal rate. In particular, images are the major source of multimedia information available on the internet. Since 2005, Google [7] and Yahoo [27] have already indexed over one billion images. In addition, some online photo-sharing communities, such as Photo.Net [17] and PhotoSIG [18], also have image collections in the order of millions entirely contributed by the users. To access and utilize this abundant information efficiently and effectively, those images should be properly indexed. Existing image indexing methods can be roughly classified into two categories, based on either text or visual content. The initial image management approach was to manually annotate the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009...$5.00.
In this paper, we propose a bipartite graph reinforcement model (BGRM) for web image annotation, which sufficiently utilizes
585
(a)
(b)
bird, ligan, American , Morris, coot
color, degging, rose, spacer, flower, card, multiflora,
probabilistic model based methods attempt to infer the correlations or joint probabilities between images and annotations. The representative works include Co-occurrence Model [16], Translation Model (TM) [5], Latent Dirichlet Allocation Model (LDA) [2], Cross-Media Relevance Model (CMRM) [8], Continuous Relevance Model (CRM) [13], and Multiple Bernoulli Relevance Model (MBRM) [6]. However, these approaches do not focus on annotating web images and neglect the available textual information of images. Furthermore, compared with the potentially unlimited vocabulary existing in the web-scale image databases, they can only model a very limited number of concepts on a small-scale image database by learning projections or correlations between images and keywords. Therefore, these approaches cannot be directly applied to annotate web images.
Figure 1. Examples of images with surrounding keywords both visual features and textual information of web images. Given a web image, some candidate annotations are extracted from its surrounding text and other textual information in the hosting web page. As those candidates are incomplete, more candidates are derived from them to include more potentially relevant annotations by searching and mining a large-scale high-quality image collection. For each candidate, a ranking score is defined using both visual and textual information to measure how likely it can annotate the given image. Then two kinds of candidates are modeled as a bipartite graph, on which a reinforcement algorithm is performed to iteratively refine the ranking scores. After convergence, all candidates are re-ranked and only those with the highest ranking scores are reserved as the final annotations. In this way, some noisy annotations may be removed and some correct ones may be added, thus the overall annotation accuracy can be improved. Experiments on over 5,000 web images show that BGRM is more effective than traditional annotation algorithms such as WordNet-based method [10].
As automatic image annotation is often not accurate enough, some methods are proposed to refine the annotation result. Jin et al [8] achieved annotation refinement based on WordNet by pruning irrelevant annotations. The basic assumption is that highly correlated annotations should be reserved while non-correlated annotations should be removed. In that work, however, only global textual information is used, and the refinement process is independent of the target image. It means that different images with the same candidate annotations will obtain the same refinement result. To further improve the performance, the image content should be considered as well. Recently, several search-based methods are proposed for image annotation [22][24][26], which combine the text-based web image search and content-based image retrieval (CBIR) in the annotation process. The image annotations are obtained by leveraging a webscale image database. However, the former [24] assumed that an accurate keyword for the image in consideration was available. Although an accurate keyword might speed up the search process and enhance the relevance of the retrieved images, it is not always available for web images. Wang et al. [22] discarded this assumption, and estimated the annotations of images by performing CBIR first. But this leads to a poor performance compared with [24]. Rui et al. [26] proposed to select annotations for web images from available noisy textual information. In our work, we also adopt a search-based method and assume that several noisy and incomplete keywords are available for image annotations. This assumption is more reasonable for web images because it is easy to extract such keywords from the textual information on hosting web pages. The assumption is similar to that of [26], but [26] only considered the inaccurateness of initial keywords and ignored the incompleteness of them.
Our contributions are multifold: z
We propose to extract initial candidate annotations from the surrounding text and to extend the candidates via a search based method;
z
We propose a novel method to define the ranking scores of candidates based on both visual and textual information;
z
We design a bipartite graph reinforcement model to re-rank candidate annotations;
z
We design several schemes to determine the final annotations.
z
Based on BGRM, we develop a web image annotation system that utilizes the available textual information and leverages a large-scale image database.
The remainder of the paper is organized as follows: Section 2 lists some related work. Section 3 gives the overview of our web image annotation approach. Candidate annotation extraction and ranking are described in Sections 4 and 5. The main idea of BGRM is introduced in Section 6. In Section 7, we describe the final annotation determination schemes. The experimental results are provided in Section 8. We conclude and suggest future work in Section 9.
3. OVERVIEW OF BIPARTITE GRAPH REINFORCEMENT MODEL The proposed bipartite graph reinforcement model (BGRM) works in the following way for image annotation. At first, images are annotated with some candidate keywords, which may be noisy and incomplete. The initial candidates may be obtained by applying traditional image annotation algorithms or analyzing the surrounding text of a web image. On account of the incompleteness of the initial candidates, more candidate keywords are estimated by submitting each candidate as a query to an image search engine then clustering the search result. BGRM then models all the words as a bipartite graph. To remove the noisy words, all candidate annotations are re-ranked by reinforcing on the bipartite graph. Only the top ranked ones are reserved as final annotations.
2. RELATED WORK Some initial efforts have recently been devoted to automatically annotating images by leveraging decades of research in computer vision, image understanding, image processing, and statistical learning [1]. Most existing annotation approaches are either classification based or probabilistic modeling based. The classification based methods try to associate words or concepts with images by learning classifiers, such as Bayes point machine [3], support vector machine (SVM) [4], the two-dimensional multi-resolution hidden Markov models (2D MHMMs) [14]. The
586
Figure 2. Bipartite graph reinforcement model for web image annotation BGRM is shown in Fig. 2. It consists of the following components: initial candidate word extraction, extended candidate word generation, candidate ranking, bipartite graph construction, reinforcement learning, and final annotation determination. BGRM is flexible in the sense that its components are relatively independent of each other. Thus any improvement made in each component can be easily incorporated in this model to improve its overall performance. We will describe each component in detail in the following.
For one target web image I, word qi in initial words Q is used to query the text-based image search system to find the semantically related images. And this process is applied for each initial candidate annotation of I. Then, from the semantically related images, visually related images are found through content-based image similarity between the target image and the images found on the web. After two stages of search processes, each target image and its initial words obtain their search result which contains the semantically related and visually related images and their textual descriptions. The search result not only is highly useful for extending words, but also benefits initial word ranking. The search result of word w in image I can be represented by SR(I, w) = {(im1, sim1, de1), (im2, sim2, de2),…, (iml, siml, del)}. Where im is the image obtained by querying I and w, sim is the visual similarity between im and I, and de is the textual description of im. l gives the total number of images in the search result.
4. CANDIDATE ANNOTATION EXTRACTION In our model, two sets of candidate annotations are extracted for each web image. Initially, some candidate annotations are extracted from the related textual information such as surrounding texts. Because the surrounding text does not always describe the entire semantic content of the image, we also extend the annotations by searching and mining a large-scale image database. The assumption is that if certain images in the database are visually similar to the target image and semantically related to the candidate annotations, the textual descriptions of these images should also be correlated to the target image. Thus, the extended annotations can be extracted from them for the target image.
Finally, extended words are extracted by mining the search result SR for each initial word using search result clustering (SRC) algorithm [28]. Different from traditional clustering approaches, SRC clusters documents by ranking salient phrases. It first extracts salient phrases and calculates several properties, such as phrase frequencies, and combines the properties into a salience score based on a pre-learnt regression model. As SRC is capable of generating highly readable cluster names, these cluster names could be used as extended candidate annotations. For each target image, SRC is used to cluster the descriptions in the search result. After all cluster names are merged and duplicated words are discarded, extended candidate annotations X are obtained.
4.1 Initial Annotation Extraction Several sources of information on the hosting pages are more or less related to the semantic content of web images, e.g. file name, ATL text, URL and surrounding text. After stop word removal and stemming, each word is ranked by the standard text process technique (such as tf*idf method), and the words with highest ranks are reserved as initial candidate annotations Q.
However, a problem is that the initial words are noisy, which may lead to a bad performance of extending words. Surprisingly, the experimental results (see Section 8.4) show that the average precision of extended words is a little higher than initial words.
4.2 Extended Annotation Extraction Extended annotations are obtained by a search-based method. Each initial annotation and its image are used to query an image search engine to find semantically related and visually related images, and more annotations are extracted from the search result. For this purpose, about 2.4 million images were collected from some photo sharing sites, e.g. Photo.Net [17] and PhotoSIG [18], and a text-based image search system was built based on them. We notice that people are creating and sharing a lot of highquality images on these sites. In addition, images on such sites have rich metadata such as titles, and descriptions provided by photographers. As shown in Fig. 3, this textual information reflects the semantic content of corresponding images to some extent, though maybe noisy. Thus those images can be used to extend the initial annotation set.
Title: shadow cat Description: we have an antique oak mission style bed, and my cat, was on the bed. the slats part of his face
Title: mountain view Description: I hiked up a resort mountain in new hampshire….I took it dark on purpose to bring out the sun and clouds.
Figure 3. Example images and their descriptions
587
After carefully observing in the experiment, we found that not only a precise initial word may propagate more precise extended words, but an imprecise one can also extend precise words in a certain condition. This is because the visual information also takes effect on extending words. These facts are guarantees of the performance of extending the annotations.
similarity measure is more suitable for the annotation ranking task, which is also demonstrated by the experimental results. Another reason why we use this measure is that the search results, which are used to compute the local similarity, have already obtained in the process of extending words. Hence, it takes little expense when applying the local similarity to initial words ranking.
5. CANDIDATE ANNOTATION RANKING
After calculating the textual similarity, the textual ranking value for the initial word qi (rankt(qi|I)) is defined as the normalized summation of the local textual similarities between qi and the other initial words in image I.
After acquiring candidate annotations for each image, a ranking value is defined for each candidate using both visual and textual information to measure how consistent it is with the target image.
rankt ( qi | I ) =
5.1 Initial Annotation Ranking The visual consistence of an initial word can be indicated by the visual similarities between images in its search results and the target image. We utilize these scores as the visual ranking value of an initial word. First, these visual similarity scores are sorted in a descending order. Then, the average of top K visual similarity scores is calculated as the ranking value. And for each initial word qi of the target image I, a visual ranking value rankv(qi|I) is calculated as follows:
rankv ( qi | I ) =
1 K
| I)/
∑ ∑ simt (q , q
i(≠ k ) k (≠i)
i
k
F0 (qi | I ) = a × rankv(qi | I ) + (1 − a) × rankt (qi | I )
| I)
(4)
(5)
Where a is the weight ranging from 0 to 1. Because in web-based approaches, text features are generally more effective than image features [25], the value of a is less than 0.5.
(1)
Where sim(·) is the image similarity scores ranked in the descending order.
5.2 Extended Annotation Ranking The ranking value of an extended annotation is defined in a different way. As an extended candidate is actually the name of a search result cluster, its ranking value is estimated by the average similarity between images in the corresponding cluster and the target image [24]. If the member images of a cluster are relevant to the query, the concepts learned from this cluster are likely to represent the content of the query image. Considering the uniqueness of each keyword, we also weight this value using the textual information to define the ranking score:
In order to estimate the textual consistence, we try to compute the similarity of keywords within one web image first by checking how frequently one keyword appears in the research result of another. For the target image I, we first count the frequency Feqqk(qi) of the initial word qi appearing in textual descriptions of images in the search result of the initial word qk and the frequency Feqqi(qk) of qk appearing in the search result of qi . Feqqk(qi) and Feqqi(qk) reflect the local relation of qi and qk , so the similarity between them can be defined as follows:
C0 ( xi | I ) = v( xi )log( ND / N ( xi ))
(6)
Where xi is an extended word of image I and v(xi) is the average member image similarity.
(2)
Generally speaking, the more common a keyword is, the more chance it will associate with other keywords. However this kind of associations has lower reliability. Therefore, we weight the counts according to the uniqueness of each keyword, i.e. setting a lower weight to frequent keywords and a higher weight to unique keywords. Finally, the similarities of initial words in the target image I can be calculated by modifying Eqn. (2): simt (qi , qk | I ) = Feqqk (qi )log( ND / N (qi )) + Feqqi (qk )log( ND / N ( qk ))
k
After obtaining above two types of initial word rankings, we firstly normalize them into [0, 1] and then fuse them using a weighted linear combination scheme to define the ranking value of an initial word qi.
j =1
simt (qi , qk | I ) = Feqqk (qi ) + Feqqi (qk )
i
Where the denominator is the normalization factor.
K
∑ simqi ( j , I )
∑ simt (q , q
k (≠i)
6. BIPARTITE GRAPH CONSTRUCTION AND REINFORCEMENT LEARNING In this section, we describe the bipartite graph reinforcement model (BGRM) for re-ranking the candidate annotations within a web image. BGRM is based on the graph model. So we firstly introduce the construction of the graph. Then we describe the iterative form of our algorithm, followed by demonstrating its convergence. Additionally, we also give the non-iterative BGRM.
(3)
6.1 Graph Construction
Where N(qi) is the number of the word qi occurring in the descriptions of training images , and ND is the total number of images in the dataset.
Initial and extended candidate annotations are heterogeneous annotations for web images. First, initial words are the direct description of the target image, while extended words are mined in the large-scale image database, and they can only indirectly describe the target image by propagating the descriptions from the semantically and visually related images. Second, extended words with the same initial word tend to be similar to each other. So similarities between extended words are partly decided by their initial word. Meanwhile, similarities between initial words do not have this characteristic. Therefore, they cannot be re-ranked using a unified measure. However, they also have close relations. As an illustration, if an initial word is precise, its extended words are
This approach can measure the textual similarity between keywords in a local way. It not only considers the similarity between words, but also takes into account their relations to its image. Only when two words are closely related to the target image and always appear together in the web page, textual similarity between them is high. It is different with the traditional methods such as WordNet method [10] and pairwise cooccurrence [23] which are only considered relation between two words. Compared with the traditional methods, our local textual
588
certain confidence on the initial values. Meanwhile, LC reveals extended word rankings via their link relations to reinforce initial word rankings.
probable precise, and vice versa. Consequently, we form the candidate annotations for a web image as a bipartite graph model. To construct the bipartite graph G, initial and extended candidate annotations are considered as the two disjoint sets of graph vertices. Vertices from different disjoint sets are all connected using edges with proper weights.
L = Dr−1WDc−1
W is the original adjacency matrix of G; Dr is the diagonal matrix with its (i;i)-element equal to the sum of the i-th row of W; Dc is the diagonal matrix with its (i;i)-element equal to the sum of the i-th column of W.
The weight of an edge is defined using the relations between initial and extended words. A subtle point is that we set a nonzero weight to an edge only if the relation of two vertices is close enough. For two vertices qi, xj of an edge, we consider they have close relation if 1) xj is extended by qi or 2) qi is quite similar to xj. Therefore, the weight of the edge can be calculated as follows: ⎪⎧1 + s ( q i , x j | th )
ω ij = ⎨ ⎪⎩ s ( q i , x j | th )
⎧ s ( qi , x j ) s ( q i , x j | th ) = ⎨ 0 ⎩
Eqn. (8) also shows in each iteration C is first reinforced and F is then reinforced by updated C. It shows a stronger belief on initial word ranking than extended word ranking, which is demonstrated in the experimental result for web image. Additionally, the fact also impacts the selection of α, β. The greater value of β is always chosen to show the more confidence on the initial ranking for initial words.
if x j is extended by q i otherwise if s ( q i , x j ) > th otherwise
(9)
(7)
6.3 Convergence Let us show that the sequences {Cn} and {Fn} converge. By the iteration Eqn. (8), we have
Where ωij is the weight, s(·) is the textual similarity between words. s(·|th) is the textual similarity with a pre-defined threshold th.
⎧⎪Cn +1 = (γ LT L) n +1 C0 + α [∑ n (γ LT L)n ]C0 + (1 − α ) β [∑ n (γ LT L)n ]LT F0 i =0 i =0 ⎨ n n T n +1 T n T n ⎪⎩ Fn +1 = (γ LL ) F0 + β [∑ i = 0 (γ LL ) ]F0 + (1 − β )α [∑ i = 0 (γ LL ) ]LC0
Suppose the initial weight ωij equals to 0. Eqn. (7) shows that if xj is extended by qi, the weight ωij will be added by 1. If the similarity s(qi, xj) between them is above a pre-defined threshold th, ωij will be added by s(qi, xj).
Where γ = (1 − α )(1 − β ) Since 0