Relevance Feedback for Keyword and Visual Feature- based Image

2 downloads 0 Views 84KB Size Report
page to extract keyword features of the images [10], such automatically ... Relevance feedback is an online learning technique used to improve the effective-.
Relevance Feedback for Keyword and Visual Featurebased Image Retrieval* Feng Jing1, Mingjing Li2, Hong-Jiang Zhang2, Bo Zhang3 1

State Key Lab of Intelligent Technology and Systems Beijing 100084, China [email protected] 2 Microsoft Research Asia 49 Zhichun Road, Beijing 100080, China {mjli, hjzhang}@microsoft.com 3 State Key Lab of Intelligent Technology and Systems Beijing 100084, China [email protected]

Abstract. In this paper, a relevance feedback scheme for both keyword and visual feature-based image retrieval is proposed. For each keyword, a statistical model is trained offline based on visual features of a small set of manually labeled images and used to propagate the keyword to other unlabeled ones. Besides the offline model, another model is constructed online using the user provided positive and negative images as training set. Support vector machines (SVMs) in the binary setting are adopted as both offline and online models. To effectively combine the two models, a multi-model query refinement algorithm is introduced. Furthermore, an entropy-based active learning strategy is proposed to improve the efficiency of relevance feedback process. Experimental results on a database of 10,000 general-purpose images demonstrate the effectiveness of the proposed relevance feedback scheme.

1 Introduction Image retrieval based on keyword annotations [11] could be traced back to late 1970s, mainly developed by the database management and information retrieval community. Semantics of images can be accurately represented by keywords, as long as keyword annotations are accurate and complete. The challenge is that when the size of image database is large, manual annotation of all the images becomes a tedious and expensive process. Although it is possible to use surrounding text of images in the web page to extract keyword features of the images [10], such automatically extracted keywords are far from being accurate. These facts limit the scale up of keywordbased image retrieval approaches. On the other hand, content-based image retrieval (CBIR) [3][5][12][15] has been * This work was performed at Microsoft Research Asia. Feng Jing and Bo Zhang are supported in part by NSF Grant CDA 96-24396.

introduced and developed since early 1990s to support image search based on visual features, such as color, texture and shape. Although these features could be extracted from images automatically, they are not accurate enough to represent the semantics of images. After over a decade of intensified research, the retrieval result is still not satisfactory. The gap between visual features and semantic concepts is acknowledged to be the major bottleneck of CBIR approaches. To bridge the gap, one effective approach is to use relevance feedback. Relevance feedback is an online learning technique used to improve the effectiveness of information retrieval systems [9]. Since its introduction into image retrieval in middle 1990’s, it has been shown to provide dramatic performance improvement [3][7][12][15]. There are two key issues in relevance feedback: the choice of the learning strategy and the selection of images for the users to label. For the former issue, one of the most effective learning techniques used in relevance feedback is support vector machine (SVM) [4], which has not only strong theoretical foundations but also excellent empirical successes. An SVM classifier is trained based on the positive and negative images marked by a user and used to classify other unlabelled images into relevant and irrelevant classes [12]. For the later issue, instead of randomly selecting images, several active learning algorithms are proposed to select those most informative ones [3][12][14]. To utilize the strengths of both keyword-based and visual feature-based representations in image retrieval, a number of approaches have been proposed to integrate keyword and visual features [1][7][15]. The key issue of such integrated approaches is how to combine the two features such that they complement to each other in retrieval and/or relevance feedback processes. For example, the framework proposed in [7] uses a semantic network and relevance feedback based on visual features to enhance keyword-based retrieval and update the association of keywords with images. Zhang [15] and Chang [1] further improved this framework by updating unmarked images in addition to the marked ones using the probabilistic outputs of a Gaussian model and SVM, respectively, to perform annotation propagation. In this paper, we propose a scheme to seamlessly integrate keyword and visual feature representations in relevance feedback. As the basis of the scheme, an ensemble of keyword models, i.e. an ensemble of SVM classifiers, is trained offline based on a small set of manually labeled images and used to propagate keywords to unlabeled ones. Comparing with the aforementioned methods [1][7][15], it has the following characteristics: z Not only the online constructed model, i.e. an SVM classifier that separates positive images from negative ones, but also the keyword model trained offline are considered. A multi-model query refinement (MQR) technique is proposed for the combination of the two models. z To perform relevance feedback more efficiently, an entropy-based active learning algorithm is proposed to actively select the next requests. The organization of the paper is as follows: In Section 2, we describe the keyword propagation process based on statistical keyword models. The multi-model query refinement algorithm is introduced in Section 3. In Section 4, the active learning issue is discussed and a new entropy-based active learning algorithm is proposed. In

Section 5, we provide experimental results that evaluate all aspects of the relevance feedback scheme. Finally, we conclude in Section 6.

2 Keyword Propagation A critical basis of the proposed relevance feedback scheme is the keyword models built from visual features of a set of annotated images. The models serve as a bridge that connects the semantic keyword space with the visual feature space. Similar to [1], we use SVM binary classifiers as the models, due to their sound theoretical foundations and proven empirical successes [4]. For each keyword, an SVM is trained using the images labeled with it as positive examples and other images in the training set as negative examples. In the basic form, SVM tries to find a hyperplane that separates the positive and negative training data with maximal margin. More specifically, finding the optimal hyperplane is translated into the following optimization problem: 1 r 2 (1) Minimize: w + C ⋅ ∑ξi 2 r r subject to: ∀k : y k (w ⋅ x k + b ) ≥ 1 − ξ k (2) r where x i is the visual feature vector of image Ii, yi is equal to 1 if image Ii is labeled with the current keyword, while it is -1 otherwise. We solve this optimization problem in its dual formulation using SVM Light [6]. It efficiently handles problems with many thousands of support vectors, converges fast, and has minimal memory requirements. Moreover, it could efficiently estimate the parameters using leave-one-out (LOO) scheme. The key purpose of building the SVM models is to obtain the association (confidence) factor or weight of each keyword to each image in the keyword propagation process based on their visual features. As a result of the propagation, each image is associated or labeled with a set of keywords, each with a weighting factor. These weighted keywords thus form a keyword feature vector, the dimension of which is the number of total keywords in the database. This is similar to the content-based soft annotation approach proposed in [1]. To perform such soft labeling, calibrated probabilistic outputs of SVMs are required. Since standard SVMs do not provide such output, we use the method proposed by Platt [8] to resolve this issue. It trains the parameters of an additional sigmoid function to map the SVM outputs into probabilities, which are used as the confidence factor for each keyword labeling. Instead of estimating the class-conditional densities, it utilizes a parametric model to fit the posterior directly. Three-fold cross-validation is used to form an unbiased training set. By properly incorporating spatial information into color histogram, autocorrelogram has been proven to be one of the most effective features in CBIR [5] and therefore is used as the visual features of each image in our implementation. As in [5], the RGB color space with quantization into 64 color bins is considered and the distance set D = {1, 3, 5, 7} is used for feature extraction. The resulting feature is a 256-dimensional vector.

As suggested by [2], the Laplacian kernel is chosen as the kernel of SVM, which is more appropriate for histogram-based features like the one we use. Assuming x = ( x1 , x 2 ,..., x n ) and y = ( y1 , y 2 ,..., y n ) , the form of Laplacian kernel is: n

k Laplacian ( x, y ) = exp( − ( ∑ x i − y i ) 2σ 2 )

(3)

i =1

The value of σ is determined using cross-validation strategy. Ideally, σ should be tuned for each keyword. Our experimental study showed that the models were not very sensitive to the value of σ . For simplicity, σ is set to be the same for all keywords.

3 Multi-model Query Refinement After the keyword propagation process, each image is represented by two types of features: keyword feature and visual feature. Denote the keyword feature vector and visual feature vector of image I i by: Fi K = ( f i ,K1 , f i ,K2 ,..., f i ,KM ) and FiV = ( f iV,1 , f iV, 2 , ..., f iV, D ) respectively.

f i ,K j is the probability of keyword K j estimated using the

model of K j and D is the dimension of visual feature space which equals 256 in current implementation. The similarity score of image I i in respect to a query keyword K q is determined by: S i = f i ,Kq , 1 ≤ i ≤ N

(4)

The initial retrieval result is given by sorting the images in the decreasing order of their similarity scores. If the result is not satisfactory, a relevance feedback process is invoked. When the user marks a few images as feedback examples, an SVM is trained online using the visual feature of the marked images as the training set to extend the search space. More specifically, to rank an image I i in a renewed search, the similarity of the image to the query keyword in the visual feature space is defined by P ( K q | I i ) , i.e., the probability of image I i to be labeled with keyword K q . A straightforward way to estimate P( K q | I i ) is to combine the training set of online

learning with that of offline learning for K q and re-train an SVM based on the combined training set. However, considering the required real-time nature of relevance feedback interactions, the re-training process with a larger combined training set is not desirable. Instead, we compute the new model using a model ensemble scheme. That is, we have two estimations of P ( K q | I i ) : One from the model of K q trained offline, denoted as Pq ( I i ) that is equal to f i ,Kq ; and the other from the SVM trained online, denoted as Pon ( I i ) . For the latter, the type and parameters of the SVM kernel are the same as those in Section 2. Considering that the number of marked images is usually small in user feedback sessions, leave-one-out strategy is used to obtain the training set for the sigmoid fitting process. More specially, an ensemble of the two models is used to predict a more accurate estimation:

P( K q | I i ) = λPon ( I i ) + (1 − λ ) Pq ( I i )

(5)

This model ensemble is used as the similarity of images in the renewed retrieval. That is, the refined retrieval results are obtained by re-sorting images in the decreasing order of P ( K q | I i ) . λ in (5) is a tunable parameter that reflects our confidence on the two estimations. λ is currently set to be 0.3 based on the experiments that will be introduced in Section 5. It means that Pq ( I i ) is assumed to be more reliable than Pon ( I i ) . On the other hand, the larger the value of λ , the more dynamic the feedback will be, though it does not necessarily lead to a faster convergence of satisfactory retrieval result.

4 Active Learning As stated in Section 1, how to select more informative images from a ranked list based purely on similarities to present to a user is a crucial issue in relevance feedback to ensure efficient learning with the usually small set of training samples. The pursuing of the “optimal” selection strategy by the machine itself was referred to as active learning. In contrast to the passive learning in which the learner works as a recipient of a random data set, active learning enables the learner to use its own ability to collect training data. Tong and Chang proposed an active learning algorithm for SVM-based relevance feedback [12]. In their algorithm, the images are selected so as to maximally reduce the size of the version space. Following the principle of maximal disagreement, the best strategy is to halve the version space each time. By taking advantage of the duality between the feature space and the parameter space, they showed that the points near the decision boundary can approximately achieve this goal. Therefore, the points near the boundary are used to approximate the most-informative points [12]. We refer this selection strategy as the nearest boundary (NB) strategy. Considering that we deal with two different SVMs (offline and online) at the same time, the NB strategy is inappropriate here. Another straightforward and widely used strategy is the most positive (MP) strategy. When MP strategy is used, the images with largest probabilities are shown to users both as current result and candidates to label. Extensive comparisons have been made between MP and NB strategy on the application of drug discovery [13]. The results show that the NB strategy is better at “exploration” (i.e., giving better generalization on the entire data set) while the MP strategy is better at “exploitation” (i.e., high number of total hits) [13]. For image retrieval, exploitation which corresponds to precision is usually more crucial than exploration. Besides the aforementioned two strategies, we proposed a new strategy based on the information theory. Since the probability of image I i being labeled with keyword K q is P ( K q | I i ) , the probability of I i being unlabeled with K q is P ( K q | I i ) = 1 − P( K q | I i ) . From the information theory perspective, the entropy of this distri-

bution is precisely the information value of image I i . Therefore, the images with maximal entropy should be selected. More specific, the entropy of I i is: E ( I i ) = − P ( K q | I i ) log P ( K q | I i ) − P ( K q | I i ) log P ( K q | I i )

(6)

where E ( I i ) is maximized when P( K q | I i ) = 0.5 and the smaller the difference

between P ( K q | I i ) and 0.5 the larger the value of E ( I i ) . Instead of calculating entropy explicitly, we use a simpler criterion to characterize the information value of I i . The information value (IV) of I i is defined to be: IV ( I i ) = 0.5 − P( K q | I i ) − 0.5

(7)

We use this maximal entropy (ME) strategy to select those images with the largest information values to ensure faster convergence to a satisfactory retrieval result in relevance feedback process.

5 Experimental Results We have evaluated the proposed framework with a general-purpose image database of 10,000 images from COREL. In our experiments, ten percent of all images in the database were labeled and used to train the keyword models. The rest of images were used as ground truth as they are all categorized as well. Currently, an image is labeled with only one keyword, the name of the category that contains it. In other words, there are totally 79 keywords representing all images in the database. All these 79 keywords constitute the query set. First, the initial retrieval was evaluated. A retrieved image is considered a match if it belongs to the category whose name is the query keyword. Precision is used as the basic evaluation measure. When the top N images are considered and there are R relevant images, the precision within top N images is defined to be P(N) = R / N. N is also called scope in the following. The average precision vs. scope graph is shown in Figure 1. Vis and Key denote visual and keyword feature-based retrieval respectively. Moreover, (P) and (L) denote that the propagated keyword features Fi K s and initially labeled keyword features Fi L s were used. For the latter, Fi L = ( f i ,L1 , f i ,L2 ,..., f i ,LM ) is a Boolean vector, that is, f i ,Lj = 1 if image Ii is labeled with keyword Kj, otherwise f i ,Lj = 0. For Fi L s, Hamming distance is used as the distance function. It is observed from Figure 1 that the retrieval accuracy of using the initially labeled keyword features, i.e. Fi L s is a little better than that of using the visual features. This means that if only the labeling information is used without propagation, the improvement of performance is marginal. At the mean time, the retrieval accuracy of using the keyword features learned from the models of keywords, i.e. Fi K s, is remarkably better than that of using the initial ones, which suggests the effectiveness of the SVM-based keyword propagation.

Precision

Vis

Key(P)

Key(L)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10

20

30

40

50

60

70

80

90

100

Scope

Figure 1. Initial retrieval result comparison Then, the proposed multi-model query refinement (MQR) algorithm was evaluated. Users’ feedback processes were simulated as follows. For a query image, 5 iterations of user-and-system interaction were carried out. At each iteration, the system examined the top 10 images that have the largest informative values (IVs). Images from the same (different) category as the query image were used as new positive (negative) examples, as all images were categorized. To determine the value of λ , the performances of MQR with different λ s were compared. Generally speaking, the larger the value of λ , the more dynamic the feedback will be, though it does not necessarily lead to a faster convergence of satisfactory retrieval result. More specially, the accuracy vs. value of λ graph is used for the comparison. The accuracy is defined to be the average precision within top 50 images, i.e. average P(50). The accuracies after 1st, 3rd and 5th rounds of feedback iterations are shown in Figure 2. Currently, λ is set to be 0.3 which corresponds to the peak point of the curves. Furthermore, we compared MQR with a uni-model query refinement (UQR) method that only uses the online model. Actually, UQR corresponds to λ = 1 . As we can see from Figure 2, the performance of MQR is far better than that of UQR. For example, the accuracy of MQR after three iterations is higher than that of UQR by 25%. Finally, to show the effectiveness of active learning, three selection strategies were compared: a random selection strategy (RD), the most positive strategy (MP) and the maximal entropy strategy (ME). The accuracy vs. iteration graph is used for the comparison. Note that for RD and ME strategies, the sorting of images for evaluation and labeling is different. For evaluation, all the positive (negative) images labeled up to now are placed in top (bottom) ranks directly, while image similarity ranking are sorted by their probabilities, i.e. P( K q | I i ) ( 1 ≤ i ≤ N ). For labeling, if the ME (or

RD) strategy is used, 10 images with the highest IVs (or random selected from the database) except those labeled images were presented as retrieval result. The comparison results are shown in Figure 3, from which we can see that the two active

learning strategies, i.e. the ME and MP strategy are consistently better than the passive learning strategy, i.e. the RD strategy after the second iteration. After five iterations the accuracy of ME (MP) is higher than that of RD by 14% (12%). In addition, the proposed ME strategy is better than the MP strategy. 1st round

3rd round

5th round

0.85

Accuracy

0.75 0.65 0.55 0.45 0.35 0.25 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Value of λ Figure 2. The effect of different λ on the MQR algorithm. RD

MP

ME

0.85

Accuracy

0.8 0.75 0.7 0.65 0.6 1

2

3

4

5

Number of Iterations Figure 3. Accuracy comparison of different selection strategies: RD, MP and ME denote random selecting, most positive and maximal entropy strategy respectively.

7 Conclusion We have presented an effective and efficient scheme to support relevance feedback in both keyword-based and visual feature-based image retrieval. To be effective, two models are taken into account simultaneously. One is a keyword model constructed offline using SVM as the classifier and a small set of labeled images as the training set. The other is a model trained online, which is an SVM classifier that separates positive images from negative ones. A multi-model query refinement algorithm is introduced to combine the two models. To be efficient, an entropy-based active learning strategy is proposed to actively select next request images. Experimental results on a large scale database show the effectiveness and efficiency of the proposed scheme.

References 1. Chang, E., et al, “CBSA: Content-based Soft Annotation for Multimodal Image Retrieval Using Bayes Point Machines”, IEEE Transactions on Circuits and Systems for Video Technology, Volume 13, Number 1, January 2003, pp. 26-38. 2. Chapelle, O., Haffner, P., and Vapnik, V., “SVMs for Histogram-based Image Classification”. IEEE Transaction on Neural Networks, 10(5), Sep. 1999, pp. 1055-1065. 3. Cox, I.J. et al, “The Bayesian Image Retrieval System, PicHunter: Theory, Implementation and Psychophysical Experiments”, IEEE Transactions on Image Processing 9(1), 2000, pp. 20-37. 4. Cristianini, N., Shawe-Taylor, J., “An Introduction to Support Vector Machines.” Cambridge University Press, Cambridge, UK, 2000. 5. Huang, J., et al. “Image Indexing Using Color Correlogram”, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, June 1997, pp. 762-768. 6. Joachims, T., “Making large-Scale SVM Learning Practical”, in Advances in Kernel Methods - Support Vector Learning, B. Schölkopf et al (ed.), MIT-Press, 1999. pp. 169-184. 7. Lu, Y., et al, “A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems”, Proc. ACM International Multimedia Conference, 2000. pp. 31-38. 8. Platt, J., “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods”, in Advances in Large Margin Classifiers, MIT Press, 2000. pp. 61-74. 9. Salton, G., “Automatic text processing”, Addison-Wesley, 1989. 10. Shen, H.T., et al, “Giving Meanings to WWW Images”. Proc. ACM International Multimedia Conference, 2000. pp. 39-48. 11. Tamura, H. and Yokoya, N. “Image Database Systems: A Survey”, Pattern Recognition, Vol. 17, No. 1, 1984. pp. 29-43. 12. Tong, S. and Chang, E. “Support vector machine active learning for image retrieval,” Proc. ACM International Multimedia Conference, 2001. pp. 107-118. 13. Warmuth, M.K., Ratsch, G., Mathieson, M. and Liao, J. and Lemmen, C., “Active learning in the drug discovery process”. In T.G. Dietterich, S. Becker, and Z. Ghahramani, Ed., Adv. in Neural Inf. Proc. Sys. 14, Cambridge, MA, 2002. MIT Press. pp. 1449-1456. 14. Zhang, C. and Chen, T., “Indexing and Retrieval of 3D models Aided by Active Learning”, Demo on ACM Multimedia 2001, pp. 615-616.

15. Zhang, H. J. and Su, Z. “Improving CBIR by Semantic Propagation and Cross Modality Query Expansion”, NSF Workshop on Multimedia Content-Based Information Retrieval, Paris, Sept.24-25, 2001.

Suggest Documents