Combining Stroke-based and Selection-based Relevance Feedback for Content-based Image Retrieval Jingyu Cui
State Key Laboratory of Intelligent Technologies and Systems Department of Automation Tsinghua University, Beijing, P.R.China
[email protected]
ABSTRACT We propose a flexible interaction mechanism for CBIR by enabling relevance feedback inside images through drawling strokes. User’s interest is obtained from an easy-to-use user interface, and fused seamlessly with traditional feedback information in a semi-supervised learning framework. Retrieval performance is boosted due to more precise description of the query concept. Region segmentation is also improved based on the collected strokes, and further increases the retrieval precision. We implement our system Flexible Image Search Tool (FIST ) based on the ideas above. Experiments on two real world data sets demonstrate the effectiveness of our approach.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: [Query formulation, Relevance feedback]; H.5.2 [User Interfaces]: [Graphical user interfaces (GUI), Prototyping]
General Terms Algorithms
Keywords Stroke-based Image Retrieval, Relevance Feedback, Image Retrieval Interface, Segmentation
1.
INTRODUCTION
Content-based Image Retrieval (CBIR) [6, 7, 9] has been a widely studied topic with the explosion of digital images. One of the crucial issues in CBIR is bridging the gap between low level features and high level semantic meanings. Regionbased image retrieval [3, 10] and relevance feedback [6] are currently two major methods to attack this problem. Region based approaches are based on the fact that semantic meaning of an image is often contained in object
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009 ...$5.00.
Changshui Zhang
State Key Laboratory of Intelligent Technologies and Systems Department of Automation Tsinghua University, Beijing, P.R.China
[email protected]
level in the image. User is often interested in objects in images, not the whole. For example, user might be interested in an image of a tiger on grass only because of the tiger, but not care about the grass. Representing images on region level gives the possibility of inferring in object level. However, among many region based systems, including Netra [10] and Blobworld [3], user can only specify an interested region on a mask which indicates pre-computed segmentation. The interface is not easy to use, since the user needs to look at both the mask and the image. It also depends highly on segmentation accuracy. If one object is segmented into many regions, it is difficult to correct this error. Limited user knowledge about the image is incorporated into the system through selecting single region. On one hand, there have been many research works done trying to design algorithms to make the most of region based representation: SIMPLICity [14] uses Integrated Region Matching (IRM) to account for all regions in two images by calculating the distance between them as weighted average of the distance between any two region pairs from the two images. By considering distance between all regions, this method tends to under-estimate the distance. For example, two images with large region of grass will have small distance no matter it is a horse or a tiger on the grass. Also, user relevance feedback is only on image level, and the system cannot get the information about the real interested region in the image. Image level relevance feedback with region based representation is analyzed by Jing et al. [8] to use optimal query (pseudo image) with decaying weight over time as relevance feedback mechanism. Relative region weights are learned and used in forming the optimal query. Still, by merely giving feedback on image level, system cannot know the interested region exactly. Recently, multi-instance learning [2, 4, 12] approaches are adopted to infer the interested region by many labeled images. However, if the labeled images are limited, how to identify user interested regions is still an open problem. On the other hand, while identifying region of interest in images is very difficult for computer, for user it is not even a problem. As interface developing tools are more and more capable, web UI is no longer only about clicking, but with much richer interaction, such as drawing strokes with mouse (or with pen on Tablet PC). If properly designed UI and corresponding algorithm can get more precise and detailed information about the user query, it can be perfectly fit into region-based image retrieval systems.
This paper leverages user’s ability of understanding the images, and shift the task of identifying interested regions to the users. Missing information in image level feedback is collected by allowing user to draw strokes in images to specify relevant or irrelevant regions. Algorithm is designed to fuse this information with image level feedbacks in a united framework, and improve the retrieval performance as well as segmentation results. To the best of our knowledge, this paper is the first attempt in this direction.
2.
USER INTERFACE DESIGN
The FIST UI consists of image boxes lay out on a grid (Fig 1). Each of the image boxes displays a retrieved image, on which relevance feedback can be given by the user. Several tool buttons are displayed below each image box for stoke handeling. Corresponding tool tips will show when the mouse cursor hovers over the image or tool buttons, giving user clues to operate if the user is totally new to the system.
able” regions believed by the user. Additional buttons are presented to allow user undo a stroke, modify a stroke, or clear all existing strokes. For tablet PC users, these interactions are much more natural and easier. For standard Macintosh single button mouse users, they can simply press “Control” button on the keyboard while moving the mouse to gain access to the “right button” drawing mode. Fig 2(a) shows an example when the user want to retrieval image that contains tigers (Green stroke on the tiger), but not with water (Red stroke on the water). This kind of specific information is unavailable for traditional user interfaces. Compared to BlobWorld [3] which directly selects regions on masks, the user experience of this UI is only drawing on images, which is both easier and more flexible. The recorded strokes are aligned with the pre-computed segmentation mask, and the label of each region is obtained according to relevant position of strokes and regions. Fig 2(b) shows a stroke s and a region r whose boundary is ∂r. The stroke s contributes to the label of r only when it is really “inside” the region. To be more specific, there exists point on the stroke that is far away from the re−
dT
gion boundary. We calculate the score as e d(s,r) where d (s, r) = max min d (p, q) is the distance between the p∈(s∩r) q∈∂r
stroke and the region, and dT is a predefined parameter. This soft label is insensitive to segmentation accuracy, yet can achieve segment level label information. Imperfect segmentation results can be improved based on the strokes, since pixels in one stroke are most likely from the same object. This part is detailed in Section 3.3.
3. TWO LEVELS INFORMATION FUSION
Figure 1: User interface of FIST . Besides image level feedback by clicking on “correct mark” or “cross mark”, object level feedback can be given by drawing strokes in image boxes (Fig 2(a)). Pushing down the left mouse button while moving the mouse will leave green strokes on the image, indicating the approximate region of “relevant” objects. In contrary, with right mouse button pushed down, red curves will show representing “undesir-
To the best of our knowledge, no existing work uses relevance feedback information both on image level and on region level. We propose an optimization based semi-supervised learning approach to fuse information of the two levels together. The algorithm first predicts region score based on the labeled information and prior, then calculates image score, based on which a rank of images is given. We have three priors about the region scores: 1. Scores of all the regions should be smooth enough with respect to the underlying distribution of regions, i.e., similar regions should have close scores. This is the smoothness prior used in many semi-supervised learning algorithms such as [15] to make use of the unlabeled data. 2. Some scores of regions are directly given, either by drawing green or red strokes (indicating relevant or irrelevant) in region, or by negative feedback on an image. The predicted score should be consistent with this given information. 3. For positive feedback on an image, we do not know exact label for all its regions, but we know there is at least one region in the image with high score. These three priors extend approaches of multi-instance semisupervised learning [12] by incorporating direct labels on instances.
3.1
(a) Object level feedback by (b) User Interaction Identifidrawing strokes. cation. Figure 2: Stroke based interaction.
Predicting Scores for Regions
Score of unlabeled regions are predicted according to labeled regions on a graph G = hV, Ei with regions as nodes xi ∈ V and similarity between regions i, j as weight wij of edge eij ∈ E. Weights are collected in an affinity matrix W representing similarities between any pair of nodes. A region is represented by different sets of features. For example, for region xi , x1,i represents its color histogram,
x2,i represents its texture coarseness, etc. Similarity between two regions xi and xj on feature set k can be calcud2 (xi ,xj ) − k 2σ 2 k lated as wk,ij = e where dk (xi , xj ) indicates distance between xk,i and xk,j , can be Chi-square, Euclidian, or EMD [13] distance depending on the type of the feature. |X P| P σk2 = kn1|X | d2k (xi , xj ) is the estimated varii=1 xj ∈Nkn (xi )
ance in the kth feature space, with Nkn (xi ) as the set of kn nearest neighbors of xi , and kn can be typically taken as 10. Similarity between regions xi and xj is the weighted summation of these similarities Paccording to relative importance of feature sets: wij = αk wk,ij , with ∀k, αk > 0, k P and αk = 1. k
We incorporate the three priors mentioned above in a regularization framework, and get the optimal score function f ∗ for all the regions by minimizing the loss function: J(f ) = f T ∆f + α kHL (f − y)k22 „ « NI X +β hP (i, i) 1 − max f (rij ) i=1
(1)
j
where ∆ is the Graph Laplacian. HL and HP are indicator matrices with their diagonal elements hL (i, i) = 1 if region xi is directly labeled, and 0 otherwise, hP (i, i) = 1 if image Ii is labeled as positive, and 0 otherwise. If region xi is directly labeled, yi equals to the given label, otherwise, yi can take any value, since hL (i, i) = 0, and yi will not affect the objective J(f ). α, β are hyper-parameters. The three terms correspond to three priors, respectively. The first term is the smoothness prior requiring that labels for similar regions should be similar. The second term expresses the requirement that predicted labels for the labeled regions should be as close to the given label as possible. Compared to semi-supervised learning, we add the third term to the whole objective to make use of the information in positively labeled images. Compared to multi-instance learning, we incorporate information of instance level labeling. By integrating the three terms, we make use of all the information from the user interaction in a unified framework. This unification give the user flexibility of choosing any favorite feedback mechanisms, either in image level, or in region level, or combine the two. ˆ To solve the optimization problem efficiently, we get W as the sparse similarity matrix using LSH method [1]. The third term in the objective is first discarded, and the apˆ . We then proximate solution is obtained on-the-fly with W impose the third term, and minimize the cost by setting the score of the region that has largest score in each image as 1. This step can be regarded as identifying the real positive region in the positively labeled image. This process is iterated several times until convergence. Since the inner part of the calculation is very fast, the whole iteration cost very little time, and can response to the user instantly.
3.2
Predicting Scores for Images
The intuitive idea of getting image score from region scores is setting the image score as the largest score of regions in the image. However, this approach does not account for different importance of regions in the image, and is sensitive
Figure 3: Enhance segmentation by interaction.
to noise by taking the largest score, and disregarding the other regions. We propose to use the score of regions to form signatures of images, and calculate distance between images based on it using Earth Movers Distance [13]. Each image is represented by two vectors: s (Ii ) = [f (ri1 ) , ..., f (riNi )] as the score of each region, and w (Ii ) = [w (ri1 ) , ..., w (riNi )] as their relative importance. We apply attention model such as [11] to obtain w (rij ), i.e., regions which are more likely to be the attention areas will have larger weight. Distance between two images Ii and Ij can then be calculated, and score for images are inverse proportional to the nearest distance to any of the positively labeled images.
3.3 Enhance Segmentation by Interaction We use Multiscale Normalized Cut [5] to do segmentation. This unsupervised algorithm cannot guarantee perfect results. However, in our system, we can make use of the strokes to improve segmentation: if two pixels pi and pj are frequently labeled in one stokes in history, we increase the connection between them simply by increasing their connection A (i, j) to γA (i, j), where γ > 1 is a factor to control the intensity of segmentation adjustment. With the new A, the segmentation will be improved, as well as the further retrieval results. One example is shown in Fig 3.
4. EXPERIMENT EVALUATION Our experiment is conducted on two data sets: COREL and SIVAL. COREL mostly contains close-ups of single objects, and we use 100 categories from the whole Corel set with 100 images in each category. SIVAL consists of images of 25 objects captured against various backgrounds with 60 images for each object. Typical search results of our region level interaction method and EMD matching method starting with an image of tiger are shown in Fig 4, the first column is the query image. In our results, since the user is only interested in tiger, images containing tigers with different backgrounds can appear. With traditional method, the green grass in the query image acts dominantly as a bias, and most of the retrieved images contain grass, but pay less attention to the really concept “tiger”. 2pr For empirical evaluation, we adopt the F 1 = p+r metric to measure retrieval results, where p denotes precision and r denotes recall. The comparison is made through 100 random queries from 3 feedback iterations for each set of query and the number of correctly retrieved images in top k images is counted after each user feedback. k = 100 for COREL and
feedbacks. We are also developing more efficient segmentation and indexing methods to support efficient online import of images to our database. Log data of user interaction is also under analysis to build long term feedback model and improve retrieval performance based on it. We also plan to integrate this technique into a Web 2.0 social network to allow people share the feedback information while searching to benefit others, and aggregate the information gradually to form a stronger search engine.
(a) Search results of EMD method.
6. ACKNOWLEDGMENTS
(b) Search results of our system FIST. Figure 4: Typical search results starting from an image with tiger.
This work is funded by Basic Research Foundation of Tsinghua National Laboratory for Information Science and Technology (TNList).
7. REFERENCES
(a) Comparison result on COREL.
(b) Comparison result on SIVAL.
Figure 5: Comparison of FIST performance with traditional method. 60 for SIVAL. The average F 1 is plotted against the number of relevance feedbacks, and shown in Fig 5. Since we made use of the user understanding, the retrieval accuracy is higher than that of image level feedback, which only uses partial information. It is also worth noticing that in the result of SIVAL, the gap between traditional method and our method is larger, since SIVAL mostly contains images of a same object in various backgrounds, which is more difficult for traditional methods.
5.
CONCLUSION AND FUTURE WORKS
In this paper, we propose a flexible interactive mechanism for relevance feedback in CBIR by allowing user labeling both on image level and on region level inside the image by drawing strokes. This easy-to-use interaction mechanism makes feedback more precise, and can obtain more information by comparable user interaction. We design a novel algorithm to utilize information from both of the two feedback levels in a unified semi-supervised learning framework to improve the retrieval results. Region segmentation is also improved based on relevance feedback information, which will further enhance the retrieval performance. Experiments on two real world data sets demonstrate the effectiveness of our approach and show that the newly designed interactive mechanism can really capture the intent of the user more precisely, and using this information by our algorithm can provide more precise retrieval result in comparable amount of user feedback. We are planning to provide the system as a Web service to test its performance by more users from Internet and get
[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In IEEE FOCS ’06, 2006. [2] J. Bi, Y. Chen, and J. Wang. A sparse support vector machine approach to region-based image categorization. In CVPR ’05, 2005. [3] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. In 3rd International Conference on Visual Information Systems, 1999. [4] Y. Chen, J. Bi, and J. Wang. MILES: Multiple-instance learning via embedded instance selection. IEEE Trans. on PAMI, 28:1931–1947, 2006. [5] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with multiscale graph decomposition. In CVPR ’05, 2005. [6] R. Datta, J. Li, and J. Z. Wang. Content-based image retrieval: approaches and trends of the new age. In MIR ’05, 2005. [7] A. Jaimes, M. Christel, S. Gilles, R. Sarukkai, and W.-Y. Ma. Multimedia information retrieval: what is it, and why isn’t anyone using it? In MIR ’05, 2005. [8] F. Jing, M. Li, H. Zhang, and B. Zhang. An effective Region-Based image retrieval framework. In ACM Multimedia ’02, 2002. [9] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl., 2(1):1–19, 2006. [10] W. Ma and B.S.Manjunath. Netra: A toolbox for avigating large image databases. In ICIP, 1997. [11] V. Navalpakkam and L. Itti. An integrated model of top-down and bottom-up attention for optimal object detection. In CVPR ’06, 2006. [12] R. Rahmani and S. A. Goldman. MISSL: multiple instance semi-supervised learning. In ICML ’06, 2006. [13] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In ICCV ’98, 1998. [14] J. Z. Wang, J. Li, and G. Wiederhold. Simplicity: Semantics-sensitive integrated matching for picture libraries. IEEE Trans. on PAMI, 23:947–963, 2001. [15] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch¨ olkopf. Learning with local and global consistency. 18th Annual Conf. on NIPS, 2003.