Web-Data Driven Approach for Bridging the Gap between Image

1 downloads 0 Views 367KB Size Report
structure as manifold learning clues to automatically convert image visual de- scriptors to semantic concepts. An image thesaurus is constructed by first ex-.
Web-Data Driven Approach for Bridging the Gap between Image Content and Concept Xin-Jing Wang1,3, Jian-Tao Sun2,3, Wei-Ying Ma3, Xing Li1, 1

Department of Electronic Engineering, Tsinghua University 100084 Beijing, P.R. China [email protected] [email protected] 2 Department of Computer Science, Tsinghua University 100084 Beijing, P.R. China [email protected] 3 Microsoft Research Asia 100080 Beijing, P.R. China [email protected]

Abstract. Due to the semantic gap, current content-based image retrieval framework can not satisfy the complex demands created by a user’s preferences and subjectivity. To retrieve images in a concept-level becomes an urgent need. However, a key challenge of such applications is to get enough training data to learn the mapping functions from low-level feature spaces to high-level semantics. In this paper, we propose to use Web images as training data and Web link structure as manifold learning clues to automatically convert image visual descriptors to semantic concepts. An image thesaurus is constructed by first extracting the right keywords and associating them with the corresponding regions in the images, and then organizing the descriptors according to website link structure. A concept-based image retrieval method which effectively leverages the learned thesaurus is also presented.

1 Introduction Although content-based image retrieval has achieved great success, due to the semantic gap, the traditional visual similarity matching techniques can not satisfy the complex demands created by a user’s preferences and subjectivity. One solution is to convert the image visual descriptors to semantic concepts. However, a key problem in this direction is the lack of training data to learn effective mapping functions from low-level feature spaces to high-level semantics. As the Web provides abundant textual descriptions for images, many researchers began to discuss possible usages of such annotations [1][2][3]. This paper is as a future work of [1], in which an image thesaurus is constructed as a vehicle to bridge the semantic gap. The key idea is to associate right keywords extracted from Web image annotations with the corresponding regions in the images through a data-driven approach.

The thesaurus contains two parts: a codebook that is trained to partition the feature space into sub-spaces, and a correlation matrix that indicates how two given concepts coexist in a same image, as shown in Figure 1. In [1], a vision-based web page analysis technique [4] is used to extract accurate image surrounding texts, from which a key term is selected as the main semantic concept of that image. Meanwhile, the key region of the segmented image is extracted using an attention model and then associated with that key term. In this way, image visual descriptors are converted to semantic concepts. The hypernym tree and synonyms of each key term is also obtained using WordNet [5], on which the hierarchical part of the codebook is constructed to support query-by-keyword retrieval tasks. The leaf nodes of the codeword tree are called semantic-level codewords. On the other hand, the context regions filtered out by the attention model are grouped to form the low-level codewords (the flat part of codebook in Figure 1). Based on all the codewords, a correlation matrix is learned to indicate their inter-relationships.

Fig. 1. The Structure of Image Thesaurus Constructed from the Web Data in [1]

However, one drawback of [1] lies in that the informative Web is not fully explored. Only the isolated web-pages are considered in that approach, i.e. it used only the image textual annotations in these web-pages. The key terms extraction scheme can be regarded as an entirely unsupervised approach. In fact, images in many websites [8][9] are categorized and structuralized. The website link structure (related to images) is constructed by human which, from the homepage to the individual web-pages, often indicates hyponyms of one concept (e.g. animal→mammal→wolf). Normally, there are two ways to direct a user to one image: by anchor text or by image thumbnail. And the anchor text often represents the main semantic concepts of one image. In this paper, we propose an image thesaurus construction method which not only takes advantages of the isolated web-pages [1], but also the website link structure. By website link analysis, the key term extraction and codebook construction approach is semi-supervised which improves both the effectiveness and efficiency of the image thesaurus. We show how the learned thesaurus helps in discovering the intrinsic image manifolds which can be used to convert image visual descriptors to semantic concepts. Experimental results verify the effectiveness of our thesaurus on concept-based image retrieval.

2 Using Website Link Structure to Build Hierarchical Codebook In this section, we detail our approach of semantic codeword learning and hierarchical codebook construction. The key region extraction and correlation matrix generation approach is inherited from [1]. The learned thesaurus is shown in Figure 1. Because website links are noisy, in order to extract the right image links, we adopt a backward searching approach. Firstly we create a set of “seed images”. Then based on these image sets, we backward tracing their uplinks until a homepage is reached. 2.1 Selecting Seed Images One image is filtered out if it satisfies any of the following rules: z aspect ratio exceeds 2:1 z has no valid key term. The key term extraction approach can be referred to [1]. z has no uplinks z has down-links The images left form the seed image set. 2.2 Extracting Link Structure The seed images downloaded from the same website are grouped together according to their URLs. Website links are extracted respectively based on each seed image subsets. The backward tracing strategy is as below: for each seed image z if its uplink links to an image, save its surrounding texts, reserve this link and keep tracing back z if its uplink links to an anchor text, check if this anchor text is a noun. If so, save the anchor text (which will form the semantic nodes in the learned codebook), reserve this link and keep tracing back; else remove this image z if the homepage is reached, stop the tracing In such a way, we obtain a semantic tree (or several trees) for each website whose leaf nodes are seed images. Note that other immediate nodes in a tree may either be an image or an anchor text. For the former case, the surrounding texts of the image are kept [1] and the image is dropped. This is because these images normally serve as either entries to a set of images with similar concepts or as thumbnails. 2.3 Merging Website Trees We merge the website trees to one single semantic tree. Let X and Y be two trees, X i , j denote the concept of jth node in level i in X and Yl , k denote the kth node in level l in Y . For the leaf nodes, i = 0 . The father of X i , j is denoted as X i +1, j and X i −1, j a child. Similar definition holds for Y . Assume X is larger than Y and we attempt to merge Y to X , the bottom-up merging strategy is shown in Figure 2. Note that a sim-

ple starting point is adopted in step 5 (we set i = 0 for a new merge circle). In fact, alternative strategies can be applied. Those isolated images can be seen as single-node tree and processed in the same way as Figure 2. Possibly there will be several trees after merge. We create a pseudo-root node named “entity” to combine these trees into a single one. Note that each node in the final tree has a semantic meaning to support query-by-keyword retrieval and only leaf nodes are associated with pools of key regions (i.e. the key regions of images with the same concept as the corresponding leaf node). Width-first Bottom-Up Tree Merging Algorithm 1) (Initialize) i = 0, l = 0 2) if X i , j = Yl , k or X i , j ≈ Yl , k , i.e. Yl , k is a synonym of X i , j by WordNet, merge

X i , j and Yl , k . 3) if X i , j is a hypernym or its synonym of Yl , k , insert Yl , k into the children set of X i , j 4) if Yl , k is a hypernym or its synonym of X i , j , set X i , j = X i +1, j and go to step 2. After Yl , k is inserted into X , set back X i , j 5) if all the siblings of Yl , k are processed, set Yl , k = Yl +1, k , i = 0 and go to 2) Fig. 2. Tree Merge Algorithm

2.4 Generating Image Codebook Two kinds of codewords, one called semantic-level codewords and the other lowlevel codewords are contained in the codebook. The semantic codewords are the centroids of key region pools associated with leaf nodes learned above. They have meaningful concepts learned during key term and link structure extraction. Their inter-relationship is denoted by the tree, which forms the hierarchical part of the codebook. The low-level codewords have no semantic meanings and are mapped from the context regions of leaf node. They are the cluster centroids resulted from kmeans clustering. Note that ideally, if the Web data set used to train the thesaurus is large enough, the low-level codewords can shrink by mapping those context regions to the semantic codewords. In such a way, all regions will be converted to meaningful concepts. Based on the codebook generated, a correlation matrix is learned which indicates the probabilities of codeword co-occurrences and the image thesaurus is accordingly constructed [1].

3 Using Image Thesaurus to Discover Intrinsic Image Manifold A manifold is a topological space which is locally Euclidean. Euclidean space is a simplest example of a manifold. On the manifold, the distance between two data point is measured by the geodesic distance (the length of real line in Figure 3) rather than the Euclidean distance (the length of dotted line in Figure 3). Hence, retrieval on manifold is a promising approach in bridging the semantic gap [6]. However, learning the manifold is still an open topic. We show that our image thesaurus is effective in image manifold learning, by which the learned manifold better approaches to the intrinsic one than current content-based approaches [6].

Fig. 3. An Example of Data Manifold: the “Swiss-roll”. Dots represent the data

3.1 Thesaurus-Guided Manifold Learning

We adopt the approach in [6] to rank images given a query. In [6], the initial affinity matrix W is the calculated with Euclidean distance which may degrade the learned manifold due to the semantic gap, we modify W according to the thesaurus since the semantic nodes are learned and structuralized supervised by human-constructed website links. Let Wij be an element of W . In [6], Wij = exp[−d 2 ( xi , x j ) 2σ 2 ] where d 2 ( xi , x j ) is the Euclidean distance between data xi and x j . We use the thesaurus to

bridge the semantic gap in calculating d ( xi , x j ) . Intuitively, the farer two semantic nodes are in the tree, the less similar their concepts will be. Assume the codebook has N codewords and L semantic nodes in the tree, and the semantic-level codewords are numbered from 1 to M (hence the low-level codewords are indexed from M + 1 to N ). We assign each edge of the tree a weight of 1 L . Let pij be the correlation probability between xi and x j . The new distance d * ( xi , x j ) is given by

⎧ δ1 * k * d ( xi , x j ) 1 ≤ i, j ≤ M L ⎪ ⎪ * 1 * d ( xi , x j ) i (or j ) > M d ( xi , x j ) = ⎨ δ 2 * pij ⎪ ⎪d ( xi , x j ) i, j > M ⎩

(1 )

where k is the shortest path between xi and x j according to the tree. δ1 , δ 2 are two parameters for punishment. Equation (1) means: If xi and x j are two semantic-level codewords, use the geodesic distance on the tree to weight their content-based similarity measured by Euclidean distance. The nearer two codewords are on the tree, the less their distance is weighted. If only one of xi and x j is semantic, their Euclidean distance will be weighted by their degree of correlation. The more frequent xi and x j co-occur in an image, the smaller their distance. If both xi and x j are low-level codewords, we adopt the Euclidean distance. 3.2 Concept-Based Image Retrieval on Manifold

For image database retrieval, we formalize the image concept-learning problem (i.e. image auto-annotation) as the problem of ranking on manifold. For a new image, we extract its key region r ( q ) in the same way as [1] and assign it to the codewords pool of the image thesaurus. The initial distance between this key region and any of the codewords is the Euclidean distance weighted by max(k ) in Eq.(1). The dimension L of the initial affinity matrix W is now N + 1 , with the submatrix WN × N being that of the image thesaurus. Based on W , the manifold is learned [6] and each data is ranked in a descending order according to their similarity to r ( q ) . r ( q ) is then labeled by the concept selected by majority voting on the top 10 codewords. The context regions can be mapped to codewords in the sameway. In such a way, each image in a new image database is mapped to a concept and query-by-keyword retrieval can be applied. For the query-by-example image retrieval scheme, either concept-based retrieval described above or the traditional contentbased method can be used [1].

4 Experiments We crawled the JPEG images from enature [8] website to train our image thesaurus. Totally about 16,000 images are downloaded and 5148 images are identified as seed images. We extract the 36-bin color correlograms as image region low-level features. The experimental results given in this paper uses only key regions to represent one image. It can easily be generated to include the effect of context regions in the retrieval process [1]. In our current approach, we set δ1 = 1 .

Two groups of experiments are conducted: the retrieval performance of the learned thesaurus, and its power in mapping low-level image features to high-level semantic concepts. 4.1 Evaluation on Image Thesaurus

Precision

All the seed images used to train the thesaurus are used as queries to evaluate the retrieval performance. Figure 4 shows the precision comparison of our method and the traditional content-based method. It can be seen that significant improvement is achieved by leveraging the Web training data. The two parameters used in manifold learning [6] is sigma = 0.6, alpha = 0.8 . 1

Content-Based Method

0.9

Our Approach

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

4

5

6

7

8

9

10 Scope

Precision

Fig. 4. Retrieval Precision Comparison 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1

5

10

Top #N

Fig. 5. Precision of Image Auto-Annotation based on the Learned Thesaurus

4.2 Evaluation on Hierarchical Codebook

We randomly selected 20 queries from Corel database whose concepts fall into the concept space of the trained thesaurus to test its performance on bridging the semantic gap. Figure 5 shows the precision variation vs. majority voting. The annotation approach is based on key-region mapping only, hence useful context information is lost which beyond all doubt will result in performance degradation. However, based

on portion of the constructed thesaurus only, i.e. the image codebook, we can already achieve some improvements. In fact, the correlation matrix can help solve this problem by reconstruction image context information [1].

5 Conclusions We propose an idea of image thesaurus to learn an automatic map from image lowlevel features to high-level semantic concepts using the Web as training data set. As an improvement of [1], in which only isolated web-pages are taken into account, in this paper, we propose the method of extracting the image-related website link structures as an additional source of information. A backward tracing strategy is proposed starting from a set of seed images which result in multiple trees. Then the tree is merged and a pseudo-root is given to combine the resulted trees into a single hierarchical codebook. Based on the learned thesaurus, the image manifold can be better approached to which is shown to be an effective approach to convert the image visual descriptors and semantic concepts. However, the affinity matrix W construction in image manifold learning when an outlier exists is still an open topic. In the future works, we will work on finding an effective way to insert one data into an image subspace.

References 1. Wang, X.-J., Ma, W.-Y., Li, X.: Data Driven Approach for Bridgin the Cognitive Gap in Image Retrieval. IEEE Conf. on Multimedia & Expo (2004) 2. Barnard, K., Duygulu, P. and Forsyth, D.: Clustering Art. Computer Vision and Pattern Recognition (2001) 434-439. 3. Srihari, R.K.: Use of Multimedia Input in Automated Image Annotation and Content-Based Retrieval. Presented at SPIE’95, San Jose, CA, Feb. (1995) 4. Cai, D., Yu, S.P., Wen, J.R., and Ma, W.-Y.: VIPS: a Vision-Based Page Segmentation Algorithm. Microsoft Technical Report, msr-tr-2003-79, (2003) 5. Fellbaum, C.: WordNet: An Electronical Lexical Database. MIT Press, Cambridge, Mass., (1998) 6. Zhou, D.Y., Weston, J., Gretton, A., Bousquet, O., and Schölkopf, B.: Ranking on Data Manifolds. In: Thrun, S., Saul L. and Schölkopf, B. (eds.): Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, Mass 7. Seung, H.S., and Lee, D.D.: The Manifold Ways of Perception. Science (2000), 22, 290: 2268-2269 8. Enature website (2004). http://www.enature.com 9. Yahooligan website (2004). http://yahooligan.yahoo.com 10. Rubner, Y., Tomasi, C., and Guibas, L.J.: A Metric for Distributions with Applications to image databases. IEEE International Conference on Computer Vision (1998), 59-66