IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 7, JULY 2005
979
A Unified Framework for Image Retrieval Using Keyword and Visual Features Feng Jing, Mingjing Li, Hong-Jiang Zhang, and Bo Zhang
Abstract—In this paper, a unified image retrieval framework based on both keyword annotations and visual features is proposed. In this framework, a set of statistical models are built based on visual features of a small set of manually labeled images to represent semantic concepts and used to propagate keywords to other unlabeled images. These models are updated periodically when more images implicitly labeled by users become available through relevance feedback. In this sense, the keyword models serve the function of accumulation and memorization of knowledge learned from user-provided relevance feedback. Furthermore, two sets of effective and efficient similarity measures and relevance feedback schemes are proposed for query by keyword scenario and query by image example scenario, respectively. Keyword models are combined with visual features in these schemes. In particular, a new, entropy-based active learning strategy is introduced to improve the efficiency of relevance feedback for query by keyword. Furthermore, a new algorithm is proposed to estimate the keyword features of the search concept for query by image example. It is shown to be more appropriate than two existing relevance feedback algorithms. Experimental results demonstrate the effectiveness of the proposed framework. Index Terms—Image retrieval, keyword propagation, relevance feedback, support vector machine (SVM).
I. INTRODUCTION
I
MAGE representation schemes designed for image retrieval systems can be categorized into three classes: keyword (text) features, visual features, and their combinations. Image retrieval based on keyword features [2], [17], [21] can be traced back to the late 1970s, mainly developed by the database management and information retrieval community. The typical query scenario in such image retrieval systems is query by keyword (QBK). Semantics of images can be accurately represented by keywords, as long as keyword annotations are accurate and complete. The challenge is that when the size of image database is large, manual annotation becomes a tedious and expensive process. Although it is possible to use surrounding text of images on the Web to extract keyword features of the images [17], such automatically extracted keywords are far from being
Manuscript received August 15, 2003; revised June 4, 2004. This work was supported in part by the National Nature Sciences Foundation of China 60135010 and 60321002. This work was performed at Microsoft Research Asia. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Christine Guillemot. F. Jing and B. Zhang are with the Computer Science Department, Tsinghua University, Beijing 100084, China (e-mail:
[email protected];
[email protected]). M. Li and H.-J. Zhang are with Microsoft Research Asia, Beijing 100080, China (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TIP.2005.847289
accurate. These facts limit the scaleup of keyword-based image retrieval approaches. Content-base image retrieval (CBIR) was proposed to overcome the difficulty of manual annotations. It relies on visual feature-based representations, such as color, texture, shape, etc., which can be extracted from images automatically [4], [7], [9], [11], [28]. Some representative surveys for visual feature-based retrieval can be found in [14], [18], [32]. The typical query scenario in such image retrieval system is query by image example (QBE). However, after over a decade of intensified research, the retrieval result is still not satisfactory. It is widely understood that the major bottleneck of content-based image retrieval approaches is the gap between visual feature representations and semantic concepts of images. While it is a long-term effort to improve the semantic representation power of visual features, an effective approach is to incorporate relevance feedback process and learning techniques, online and offline, to learn better representations of images and/or refine queries. Relevance feedback, originally developed for information retrieval [16], is an online learning technique used to improve the effectiveness of information retrieval systems. Since its introduction into image retrieval in middle 1990s, it has been shown to provide dramatic performance improvement [4], [11], [22], [26], [28], [31]. The main idea of relevance feedback is to let users guide the system. During the retrieval process, the user interacts with the system and rates the relevance of the retrieved images, according to his/her subjective judgment. With this additional information, the system dynamically learns the user’s intention, and gradually presents better results. Among others, a key issue in relevance feedback is the learning strategy. One of the most effective learning techniques used in relevance feedback is support vector machines (SVMs) [23], which not only have strong theoretical foundations, but also excellent empirical success. Another issue in relevance feedback is the selection of images for the users to label. Instead of passive learning, in which the images are randomly selected, several active learning algorithms are proposed to select those most informative ones [4], [22], [25]. In contrast to online learning, fewer efforts have addressed offline learning issues in relevant feedback for image retrieval. The few existing methods are of two categories: clustering based [9], [11] and classification based [1]. In [13], the offline learning process consists of initial clustering based on visual features and cluster updating based on the user’s feedback. A representative work of classification-based approaches to offline learning was presented in [1], in which an ensemble of binary classifiers was trained to estimate keyword membership of images. The trained ensemble was then applied to each individual image to give the
1057-7149/$20.00 © 2005 IEEE
980
image multiple soft labels with certain membership factor. The membership factors assist a user to find relevant images rapidly via keyword search. To utilize the strengths of both keyword-based and visual feature-based representations in image retrieval, and in view of the fact that QBK and QBE have their own applicability, a number of approaches have been proposed to integrate keyword and visual features [1], [10], [26], [27], [29], [30]. The key issue of such integrated approaches is how to combine the two features so that they complement each other in the retrieval processes so as to make retrieval more accurate and efficient. In [29], latent semantic indexing [6] technique was used to exploit the underlying semantic structure of web images based on their visual features. Zhou [30] also proposed a pseudoclassification algorithm to learn the word similarity matrix. The learned similarity matrix could facilitate keyword semantic grouping, thesaurus construction, and soft query expansion. The framework proposed in [10] used a semantic network and relevance feedback based on visual features to enhance keyword-based retrieval and update the association of keywords with images. Zhang [26], [27] and Chang [1] further improved this framework by updating unmarked images in addition to the marked images using the probabilistic outputs of a Gaussian model and SVM, respectively, to perform annotation propagation. However, a common limitation of such previous work is that the keyword properties of images, which should be helpful if properly used, are ignored in QBE. In this paper, we propose a framework to seamlessly integrate keyword and visual feature representations in image retrieval. Also, keyword and visual features are combined in this framework in both online learning process for relevance feedback and offline learning for keyword propagation. The main contributions of this framework are threefold: 1) the schemes of integrating visual and keywords features; 2) the offline learning algorithms for establishing the models that map keywords with visual features and for updating these models after relevance feedback sessions; and 3) refined similarity measures used in reranking images in relevance feedback sessions in both QBK and QBE scenarios. An overview of the framework is shown in Fig. 1. First, in this framework, images in a database are indexed in both keyword and visual feature spaces. The dimension of keyword feature space is the number of keywords in the database. Realistically, we assume that only few images in the databases are labeled initially, as in [1], [10], [26], and [27]. As keyword propagation progresses, each image in the database is assigned more and more keywords and their weights, which form the keyword feature vector of the image. More importantly, an offline learning scheme is proposed to train an SVM for each keyword based on initially labeled images which are only a small population in a database. The trained SVM models of keywords are used in guiding the keyword propagation to unlabeled images in the database. In other words, probabilistic keyword propagation is realized by using probabilistic outputs of SVMs. In addition, the relevance information obtained from users’ feedback sessions is cumulated and used to periodically update the SVMs. Once updated, the SVMs are used to repropagate the keywords to all images in the database. The updating and repropagation process ameliorate the keyword
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 7, JULY 2005
Fig. 1. Overview of the framework. (a) Keyword model construction and update. (b) QBK and QBE.
representations of images and enable the self-improvement of the system. Furthermore, the framework supports both QBK and QBE query scenarios. Two sets of similarity measures and relevance feedback schemes have been developed for QBK cases and QBE cases, respectively. For QBK cases, the initial retrieval result of a query is given by searching the query keyword in the keyword feature space. Relevance feedback is then conducted by reranking all retrieved images according to a linear combination of the probabilistic outputs of an SVM model trained online in the visual feature space and initial keyword similarity. An entropy-based active learning strategy is proposed to select more informative images in the relevance feedback process to make online learning more efficient. In the cases of QBE, if the example image is from the same database, the distances in both keyword feature space and visual feature space are computed and used to provide the retrieval results. Otherwise, the search is only performed in visual feature space. In the relevance feedback process, an SVM is trained online in the visual feature space in the same way as in the case of QBK. In addition, probabilities of the keywords being the underlying query keyword are estimated using their SVMs and feedback image examples. The probabilities and SVMs of the keywords are both used to compute the similarity in the keyword
JING et al.: UNIFIED FRAMEWORK FOR IMAGE RETRIEVAL
981
feature space. The combination of this similarity and that of the visual feature space is used to rank images in the refined retrieval. The organization of the paper is as follows. Section II describes the construction and updating of keyword models using SVMs. In Section III, the initial retrieval and following feedback process of QBK are presented, and an entropy-based active learning algorithm is also introduced. The search and feedback process for QBE is presented in Section IV. In Section V, we provide experimental results that evaluate all aspects of the framework under several different situations. Finally, we conclude in Section VI. II. KEYWORD MODEL CONSTRUCTION AND UPDATING A critical component of the proposed framework is the keyword models built from visual features of a set of images labeled with keywords. This set of models will then be used in defining similarities in retrieval and guiding the keyword propagation to unlabeled images. To build these models, a small set of labeled images are needed as the initial training set. Similar to [1], we use SVM binary classifiers as the models, due to SVMs sound theoretical foundations and proven empirical successes. For each keyword, an SVM is trained using the images labeled with it as positive examples and other images in the training set as negative examples. In the basic form, SVM tries to find a hyperplane that separates the positive and negative training data with maximal margin [23]. More specifically, finding the optimal hyperplane is translated into the following optimization problem: Minimize
(1)
subject to
(2)
is the visual feature vector of image , is equal where to 1 if image is labeled with the current keyword, while it is otherwise. This optimization is a convex quadratic programming problem. Its Whole dual is maximizing the following optimization problem: (3) and under constraints . We solve this optimization problem using SVM Light [8]. It efficiently handles problems with many thousands of support vectors, converges fast, and has minimal memory requirements. Moreover, it supports training using cost models and could efficiently estimate the parameters using leave-one-out (LOO) scheme. The key purpose of building the SVM models is to obtain the association (confidence) factor or weight of each keyword to each image in the keyword propagation process based on their visual features. As a result of the propagation, each image is associated or labeled with a set of keywords, each with a weighting factor. These weighted keywords thus form a keyword feature vector, the dimension of which is the number of total keywords in the database. This is similar to the content-based soft an-
notation approach proposed in [1] and [27]. To perform such soft labeling, calibrated probabilistic outputs of SVMs are required. Since standard SVMs do not provide such output, we use the method proposed by Platt [13] to resolve this issue. Instead of estimating the class-conditional densities, it utilizes a parametric model to fit the posterior directly. The parameters of the model are adapted to give the best probability outputs. The form of the parametric model is chosen to be sigmoid (4) where and are the parameters to be fitted using maximum likelihood estimation from a training set. Threefold cross validation is used to form an unbiased training set. By properly incorporating spatial information into color histogram, auto-correlogram has been proven to be one of the most effective features in CBIR [7] and is, therefore, used as the visual features of each image in our implementation. As in [7], the RGB color space with quantization into 64 color bins is considered and the distance set is used for feature extraction [7]. The resulting feature is a 256-dimensional vector. As suggested by [3], the Laplacian kernel is chosen as the kernel of SVM, which is more appropriate for histogram-based features like the one we use. Assuming and , the form of Laplacian kernel is
(5)
The value of is determined using cross-validation strategy. Ideally, should be tuned for each keyword. Our experimental study showed that the models were not very sensitive to the is set to be the same for all value of . For simplicity, keywords. After the keyword models are built with initial training data, they are updated periodically using the examples collected in the relevance feedback sessions of a query. The updating procedure is actually an incremental learning process. We use the algorithm proposed by Syed [20] as the learning strategy. Only the support vectors of current models are combined with new training images to train a new SVM for each keyword. The examples collected in the relevance feedback sessions are represented by a labeling vector in keyword model updating. and keyAssume that there are images in the database. For image , the words . s are initialized labeling vector is as shown in the equation at the bottom of the next page. as the query keyword, s are updated as When using follows: if if
is a positive example is a negative example.
(6)
From (6), we can see that serves as a collecting and weighting factor that collects and weights training examples for the updating process.
982
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 7, JULY 2005
The cost model introduced in [12] is used to incorporate different costs for each training example. More specifically, for , (1) is modified to keyword (7)
Minimize
Based on the updated models, the keyword features of all images are recomputed. Since the new models contain the information of newly added training examples, more accurate keyword features are expected. III. QUERY BY KEYWORD The proposed framework supports both QBK and QBE query scenarios. In this section, we present in detail the similarity measure and relevance feedback for QBK scenario. A. Initial Query After the soft labeling process, each image is represented by two types of features: keyword features and visual features. Denote the keyword feature vector and the visual feature vector of image
by
and
, respectively. is the probability of the keyword estimated using the model of . is the dimension of visual feature space which equals 256 in current implementation. The similarity score of image in respect to is determined by a query keyword (8) Query by combining multiple keywords with Boolean operators is also supported. For example, if the query is formulated and ,” then . If the query is as “ or ,” then . The formulated as “ initial retrieval result is given by sorting the images in the decreasing order of their similarity scores. B. Relevance Feedback , a set of images are retrieved For a given query keyword based on similarity defined by (8). If the result is not satisfactory, a relevance feedback process is invoked. When the user marks a few images as feedback examples, an SVM is trained online using the visual feature of the marked images as the training set to extend the search space. More specifically, to rank an image in a renewed search, the similarity of the image to the query , keyword in the visual feature space is defined by . i.e., the probability of image to be labeled with keyword A straightforward way to estimate is to combine the training set of online learning with that of offline learning and retrain an SVM based on the combined training set. for However, considering the required real-time nature of relevance
if if if
feedback interactions, the retraining process with a larger combined training set is not desirable. Instead, we compute the new model using a model ensemble scheme. That is, we have two : one from the model of trained estimations of that is equal to , and the other from offline, denoted as . For the latter, the the SVM trained online, denoted as type and parameters of the SVM kernel are the same as those in Section II. Considering that the number of marked images is usually small in user feedback sessions, the LOO strategy is used to obtain the training set for the sigmoid fitting process. An ensemble of the two models is used to predict a more accurate estimation (9) This model ensemble is used as the similarity of images in the renewed retrieval. That is, the refined retrieval results are obtained by re-sorting images in the decreasing order of . in (9) is a tunable parameter that reflects our confidence on the two estimations. is currently set to be 0.3 based on the experiments that will be introduced in Section V. It means that is assumed to be more reliable than . On the other is, the more dynamic the feedback will hand, the larger the be, though it does not necessarily lead to a faster convergence of satisfactory retrieval result. C. Active Learning As stated in Section I, how to select more informative images from a ranked list based purely on similarities to present to a user is a crucial issue in relevance feedback to ensure efficient learning with the usually small set of training samples. The pursuing of the “optimal” selection strategy by the machine itself was referred to as active learning. In contrast to the passive learning in which the learner works as a recipient of a random data set, active learning enables the learner to use its own ability to collect data. Tong and Chang proposed an active learning algorithm for SVM-based relevance feedback [22]. In their algorithm, the images are selected so as to maximally reduce the size of the version space which is defined to be the set of all parameters consistent with the current training set [22]. Following the principle of maximal disagreement, the best strategy is to halve the version space each time. By taking advantage of the duality between the feature space and the parameter space, they showed that the points near the decision boundary can approximately achieve this goal. Therefore, the points near the boundary are used to approximate the most-informative points. We refer this selection strategy as the nearest boundary (NB) strategy. Considering that we deal with two different SVMs (offline and online) at the same time, the NB strategy is inappropriate here. Another straight forward and widely used strategy is the most positive (MP) strategy. When the
is initially labeled with is initially labeled with other keywords except is initially labeled without any keyword
JING et al.: UNIFIED FRAMEWORK FOR IMAGE RETRIEVAL
983
MP strategy is used, the images with largest probabilities are shown to users both as current result and candidates to label. Extensive comparisons have been made between MP and NB strategy on the application of drug discovery [24]. The results show that the NB strategy is better at “exploration” (i.e., giving better generalization on the entire data set) while the MP strategy is better at “exploitation” (i.e., high number of total hits) [24]. For image retrieval, exploitation which corresponds to precision is usually more crucial than exploration. Besides the aforementioned two strategies, we proposed a new strategy based on the information theory. Since the probais , bility of image being labeled with keyword is the probability of being unlabeled with . From the information theory perspective, the entropy of this distribution is precisely the information value of image . Therefore, the images with maximal entropy should be selected. More specificly, the entropy of is
(10) where is maximized when and the and 0.5 the larger the smaller the difference between value of . Therefore, instead of calculating entropy explicitly, we use a simpler criterion to characterize the information value of . The information value (IV) of is defined to be (11) We use this maximal entropy (ME) strategy to select those images with the largest information values to ensure faster convergence to a satisfactory retrieval result in relevance feedback process. IV. QUERY BY IMAGE EXAMPLE The proposed framework also provides an effective similarity measure and relevance feedback scheme to support QBE query scenario. In this section, we present in detail the similarity measure and relevance feedback scheme for QBE scenario. A. Initial Query There are two typical querying scenarios when query by image example is used. In the first scenario, the query example is one of the images in the database. In this case, both keyword features and visual features are used in similarity calculation. The distance between a query and an image in the database distance in the visual feature space is the (12) The distance between and distance is a weighted
in the keyword feature space
(13)
( ) reflects the relevance Note that the weight of the th keyword to the query image . The final distance between and is (14) is a tunable factor that controls the contribution of where will be discussed in two types of features. The setting of Section V. If the query image is not an image in the database, then there is no keyword feature available and only visual features of the query are extracted and used to calculate the distance. Equation . (14) is still applicable with B. Relevance Feedback Like in the QBK scenario, the proposed framework also provides an efficient relevance feedback scheme to improve retrieval results in QBE scenario. In this scheme, the relevance feedback in visual feature space is performed in the same way as in the QBK cases described in Section III-B. That is, when the user marks a few images as feedback examples, an SVM is trained online to extend the search space. More specifically, to rank an image in a renewed search, the similarity of the image . in the visual feature space is On the other hand, since the keyword space is also a vector space, several existing relevance feedback algorithms could be used. If only the positive examples are considered, the optimized learning algorithm proposed by Rui [15] is appropriate. In [15], optimal query estimation and weighting functions are derived in a unified framework. Based on the minimization of total distances of positive examples from the revised query, the weighted average and a whitening transform in the feature space were found to be the optimal solutions. If both positive and negative examples are considered, the SVM-based algorithm that is the same as the one in visual feature space is available. However, the keyword space is not a common vector space, it has its own special characteristics, e.g., the values of each dimension are also probabilities. Taking advantage of these characteristics, a more effective and efficient algorithm is proposed as follows. First, the probability of each keyword in the database to be the underlying query keyword is estimated. Assume there are positive images and negative images in a user feedback session. The probability of keyword being the underlying query keyword , given the current positive (negative) examples, is (15) See the Bayes rule, applied as shown in the equation at the bottom of the next page. Assume that the priors of all being are equal, i.e., for all and . To calculate the value of , assume that the positive and negative images are independent, then
984
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 7, JULY 2005
Denote the keyword feature vector of
and
as
and , respectively. Then
(16) Based on (16), we define the weight of the th dimension in the keyword feature space as (17) so
that . Hence, the similarity of
is (18)
The final similarity used to rank an image search is defined as
in a renewed (19)
is a weight to control contributions of two types where is set to 0.01 in current experimental of similarities. implementation. The proposed ME strategy is not used, for the similarity is no longer a probability as in Section III-C. The images with maximal similarities are shown to the user as both the refined result and the candidates for further judgment. V. EXPERIMENTAL EVALUATION We have evaluated the proposed framework with a generalpurpose image database of 10 000 images from COREL. In our experiments, 10% of all images in the database were labeled and used to train the initial keyword models. The rest of the images were used as ground truth as they are all categorized as well. Currently, an image was labeled with only one keyword, the name of the category that contains it. In other words, there are totally 79 keywords representing all images in the database. To collect training data for the updating process, each of the 79 keywords was used as a query once using the QBK scheme introduced in Section III. Users’ feedback processes were simulated as follows. For a query image, five iterations of user-and-system interaction were carried out. At each iteration, the system examined the top ten images that have the largest informative values (IVs). Images from the same (different) category as the query image were used as new positive (negative) examples, as all images were categorized. An SVM was trained for each keyword
based on the newly labeled images and the images labeled in foregoing iterations. The SVM was combined with the model of the query keyword to present a refined result. Based on the aforementioned data preparations, the proposed framework was evaluated as follows. First, the keyword propagation process was evaluated. On the one hand, the accuracy of the ensemble of keyword models was computed. The model, i.e., SVM classifier, which gave the largest probabilistic output determined the class of a given image. If the corresponding keyword was the same as the category of the image, then the image was classified correctly. Otherwise, the image was classified incorrectly. All the unlabeled images consisted of the testing set. The accuracy of initial models and the models after update were 43.4% and 53.47%, respectively. Although the accuracy is not very high, it is much ). On the other hand, better than random guess ( an example of keyword propagation for keyword lion is shown in Fig. 2. The number below an image is the probability of the image containing keyword lion. The 24 images with the highest probabilities are shown. Fifteen of the images are lion images. Then, the initial retrieval of QBK was evaluated. All 79 keywords constituted the query set. A retrieved image is considered a match if it belongs to the category whose name is the query keyword. Precision is used as the basic evaluation meaimages are considered and there are sure. When the top relevant images, the precision within top images is defined to . is also called scope in the following. The be average precision versus scope graph is shown in Fig. 3. In all to represent vifigures in this paper, we use Vis, Key, and sual, keyword, and combined visual and keyword feature-based denotes that the updated keyword fearetrieval. Moreover, denotes that the initially labeled keyword tures were used. features agation.
were used without propis a Boolean vector; that is,
if image
was labeled with keyword , otherwise . For s, Hamming distance was used as the distance function. It is observed from Fig. 3 that, when using QBK, the retrieval accuracy s is a little of using the initially labeled keyword features, i.e., better than that of using the visual features. This means that if only the labeling information is used without propagation, the improvement of performance is marginal. In the mean time, the retrieval accuracy of using the keyword features learned from s is remarkably better than that the models of keywords, i.e., of using the initial ones, which suggests the effectiveness of the SVM-based offline learning. Furthermore, the retrieval accuracy of using the updated keyword feature is superior to that of using the one without updating. Moreover, the initial retrieval of QBE was evaluated. One thousand images were randomly chosen from a total of 79
JING et al.: UNIFIED FRAMEWORK FOR IMAGE RETRIEVAL
985
Fig. 2. Example of keyword propagation for keyword lion.
Fig. 3. Initial retrieval result comparison in case of QBK. Vis and Key represent visual and keyword-based retrieval. (U) denotes that the updated keyword features were used. (L) denotes that the initially labeled keyword features were used without propagation.
categories as the query set. Above all, the effect of different was examined. The average precision versus value of graph is used for the evaluation. The average precisions within the top , , and ) are shown 10, 20, and 50 images (
Fig. 4.
Effect of different on the initial retrieval results of QBE.
in Fig. 4. From the figure, we can see two facts. One is that the combination of keyword and visual features is necessary because it is more effective than using each feature alone. The other is that there exists an optimal setting of , which is near 0.1 for the keyword and visual features examined temporarily. of is higher than that of For example, the
986
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 7, JULY 2005
K&V
Fig. 5. Initial retrieval result comparison in case of QBE. Vis, Key, and represent visual, keyword, and combined visual and keyword feature-based denotes that the updated keyword features were used. retrieval.
(U)
(using the keyword feature only) and (using the visual feature only) by 3% and 6%, respectively. Based on the analysis of the setting of , the average precision versus scope graph is used to evaluate different representations and shown in Fig. 5. From the figure, we can see that Vis is inferior . Also, as aforementioned, is to both Key and better than Key without model updating. However, with model . A possible reason for this updating, Key is superior to inconsistency is that the updating improves the keyword feature so much that the contribution of the visual feature, i.e., the should be lowered. In fact, should be adaptively value of chosen to reflect the contribution of two types of features, which is one of the future directions to pursue. Furthermore, the online learning process of QBK was evals was examined. Generally uated. The effect of different speaking, the larger the , the more dynamic the feedback will be, though it does not necessarily lead to a faster convergence of satisfactory retrieval result. More specially, the accuracy versus value of graph is used for the comparison. The accuracy here and also in the rest of the paper means the average precision . The accuracies after within top 50 images, i.e., average first, third, and fifth rounds of feedback iterations are shown is set to be 0.3, which corresponds to in Fig. 6. Currently, the peak point of the curves. To show the effectiveness of the SVM trained online and the necessity of combining it with the keyword model trained offline, three strategies were compared: using offline keyword model (offline), online SVM model (online), and ensemble of the two models ensemble). The accuracy versus iteration graph is used for the comparison and shown in Fig. 7. Comparing with the performance of offline which is static, the accuracy of online dynamically increases. The accuracy of online after five iterations is higher than that after one iteration by 30%. However, even after five iterations the accuracy of online is still lower than that of offline. On the other hand, with ensemble, the accuracy after one iteration is around that of offline. Moreover, the accuracy of ensemble after five
Fig. 6. Effect of different on the relevance feedback results of QBK.
Fig. 7. Learning algorithms evaluation in case of QBK. Offline, online, and ensemble denote using offline keyword model, online SVM model, and ensemble of the two models, respectively.
iterations is higher than that after one iteration by about 15%. To show the effectiveness of active learning, three selection strategies were compared: A random selection strategy (RD), the MP, and the ME. The accuracy versus iteration graph is used for the comparison. Note that for RD and ME strategies, the sorting of images for evaluation and labeling is different. For evaluation, all the positive (negative) images labeled up to now were placed in top (bottom) ranks directly, while other images ( ). were sorted by their probabilities, i.e., For labeling, if the ME (or RD) strategy was used, ten images with the highest IVs (or random selected from the database) except those labeled images were presented as retrieval result. The comparison results are shown in Fig. 8, from which we can see that the two active learning strategies, i.e., the ME and MP strategy are consistently better than the passive learning strategy, i.e., the RD strategy after the second iteration. After
JING et al.: UNIFIED FRAMEWORK FOR IMAGE RETRIEVAL
Fig. 8. Comparison of different selecting strategies. RD, MP, and ME denote random selecting, most positive, and maximal entropy strategy, respectively.
Fig. 9. Accuracy comparison of different learning algorithms in the keyword feature space. PRW, SVM, and Rui denote probabilistic reweighting algorithm, SVM-based algorithm, and Rui Yong’s algorithm, respectively.
five iterations, the accuracy of ME (MP) is higher than that of RD by 14% (12%). In addition, the proposed ME strategy is better than the MP strategy. Furthermore, when the updated features were used, the accuracy of ME after five iterations is 90%. Finally, the online learning scheme of QBE was evaluated. To show the effectiveness of the proposed probabilistic reweighting (PRW) algorithm in the keyword feature space, it was compared with Rui Yong’s algorithm (Rui) [15] and the SVM-based algorithm (SVM) using Gaussian kernel. In this evaluation, only the keyword features were used in both initial retrieval and relevance feedback process. The accuracy versus iteration graph is used to show the results in Fig. 9. The PRW algorithm is consistently better than both Rui’s algorithm and the SVM-based algorithm. After one iteration, the accuracy of PRW is higher than that of Rui (SVM) by 9% (15%). After five iterations, the
987
Fig. 10. Reaction time comparison of different learning algorithms in the keyword feature space. PRW, SVM, and Rui denote probabilistic reweighting algorithm, SVM-based algorithm, and Rui Yong’s algorithm, respectively.
Fig. 11. Learning algorithms evaluation of QBE using query images within represent visual and combined visual and keyword the database. Vis and denotes that the updated keyword features were feature-based retrieval. used.
K&V (U)
accuracy difference between PRW and Rui (SVM) is enlarged (reduced) to be 19% (3%). Besides the accuracy aspect, the efficiency aspect of the three algorithms was also evaluated. We use reaction time versus iteration graph to show the results in Fig. 10. The reaction time of SVM during the first (fifth) iteration is eight (26) times longer as that of PRW and Rui. From Figs. 9 and 10, we can see that PRW is not only more effective, but also more efficient than the other two algorithms. Hereafter, the PRW algorithm is used as the relevance feedback algorithm in keyword space for the following evaluations. For QBE, there exist two querying scenarios: The query was an image in the database, or the query was a new one which is not in the database. The Vis with NB strategy was used as the baseline. The results of the former scenario are shown in Fig. 11. is better than Vis before and Without updating the model,
988
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 7, JULY 2005
Fig. 12. Learning algorithms evaluation of QBE using query images out of represent visual and database and query keyword with model. Vis and combined visual and keyword feature-based retrieval. denotes that the updated keyword features were used.
K&V (U)
after feedback. When the updated features were used, 10% improvement is gained before any feedback, and the improvement is further enlarged to 12% after two iterations of feedback. This demonstrates that the updating process enables the framework to start from a higher level and boosts the retrieval performance in a faster way. To simulate the querying by new images situation, was set to be 1 in formula (14). The underlying concept was assumed to be a keyword with model. The results are shown in Fig. 12. is After one iteration of feedback, the accuracy of using higher than that of using only Vis by 20%. When the keyword features were updated, the difference is increased to be 35%. During the following iterations, the accuracy of using is consistently better than that of using , and they are both uniformly better than that of using Vis only. VI. CONCLUSION In this paper, a unified framework for image retrieval using both keyword and visual features is presented. For each keyword, a statistical model is trained using visual features of labeled images. The models serve as a bridge that connects the semantic keyword space with the visual feature space. Based on the models and both keyword and visual features, the initial retrieval and relevance feedback could be performed effectively for both query by keyword and query by image example scenarios. Moreover, an effective approach is proposed to update models using newly labeled images periodically. This updating process enables the proposed framework to self-improve continuously and progressively. Extensive experiments on a large-scale image database demonstrate the effectiveness of the framework. REFERENCES [1] E. Chang et al., “CBSA: Content-based soft annotation for multimodal image retrieval using Bayes point machines,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 26–38, Jan. 2003.
[2] S. K. Chang and A. Hsu, “Image information systems: Where do we go from here?,” IEEE Trans. Knowl. Data Eng., vol. 4, no. 5, pp. 431–442, Oct. 1992. [3] O. Chapelle, P. Haffner, and V. Vapnik, “SVM’s for histogram-based image classification,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 1055–1065, Sep. 1999. [4] I. J. Cox et al., “The Bayesian image retrieval system, pichunter: Theory, implementation and psychophysical experiments,” IEEE Trans. Image Process., vol. 9, no. 1, pp. 20–37, Jan. 2000. [5] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, 2000. [6] S. Deerwester et al., “Indexing by latent semantic analysis,” J. Amer. Soc. Inf. Sci., vol. 41, pp. 391–407, 1990. [7] J. Huang et al., “Image indexing using color correlogram,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Juan, PR, Jun. 1997, pp. 762–768. [8] T. Joachims et al., “Making large-scale SVM learning practical,” in Advances in Kernel Methods – Support Vector Learning, B. Schölkopf et al., Eds. Cambridge, MA: MIT Press, 1999, pp. 169–184. [9] C. Lee, W. Y. Ma, and H. J. Zhang, “Information embedding based on user’s relevance feedback for image retrieval,” Proc. SPIE, vol. 3846, pp. 294–304, 1999. [10] Y. Lu et al., “A unified framework for semantics and feature based relevance feedback in image retrieval systems,” in Proc. ACM Int. Multimedia Conf., 2000, pp. 31–38. [11] T. P. Minka and R. W. Picard, “Interactive learning using a society of models,” Pattern Recognit., vol. 30, no. 4, pp. 565–581, Apr. 1997. [12] K. Morik, P. Brockhausen, and T. Joachims, “Combining statistical learning with a knowledge-based approach – A case study in intensive care monitoring,” presented at the 16th Int. Conf. Machine Learning, 1999. [13] J. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” in Advances in Large Margin Classifiers. Cambridge, MA: MIT Press, 2000, pp. 61–74. [14] Y. Rui, T. S. Huang, and S. F. Chang, “Image retrieval: Current techniques, promising directions and open issues,” J. Vis. Commun. Image Rep., vol. 10, pp. 1–23, 1999. [15] Y. Rui and T. S. Huang, “Optimizing learning in image retrieval,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, Jun. 2000, pp. 236–245. [16] G. Salton, Automatic Text Processing. Reading, MA: Addison-Wesley, 1989. [17] H. T. Shen et al., “Giving meanings to WWW images,” in Proc. ACM Int. Multimedia Conf., 2000, pp. 39–48. [18] A. W. M. Smeulders et al., “Content-based image retrieval: The end of the early years,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22–12, no. 7, pp. 1349–1380, Jul. 2000. [19] J. R. Smith and S.-F. Chang, “VisualSEEk: A fully automated contentbased image query system,” in Proc. ACM Multimedia, Boston, MA, Nov. 1996, pp. 87–98. [20] N. A. Syed, H. Liu, and K. K. Sung, “Incremental learning with support vector machines,” presented at the Int. Joint Conf. Artificial Intelligence, Workshop on Support Vector Machines, 1999. [21] H. Tamura and N. Yokoya, “Image database systems: A survey,” Pattern Recognit., vol. 17, no. 1, pp. 29–43, 1984. [22] S. Tong and E. Chang, “Support vector machine active learning for image retrieval,” in Proc. ACM Int. Multimedia Conf., 2001, pp. 107–118. [23] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [24] M. K. Warmuth, G. Ratsch, M. Mathieson, J. Liao, and C. Lemmen, Active Learning in the Drug Discovery Process, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cambridge, MA: MIT Press, 2002, vol. 14, pp. 1449–1456. [25] C. Zhang and T. Chen, “Indexing and retrieval of 3D models aided by active learning,” Demo ACM Multimedia, pp. 615–616, 2001. [26] H. J. Zhang and Z. Su, “Relevance feedback in CBIR,” in Proc. 6th IFIP Working Conf. Visual Database Systems Brisbane, May 29–31, 2002, pp. 21–35. [27] , “Improving CBIR by semantic propagation and cross modality query expansion,” in Proc. NSF Workshop on Multimedia Content-Based Information Retrieval, Paris, France, Sep. 24–25, 2001. [28] L. Zhang, F. Z. Lin, and B. Zhang, “Support vector machine learning for image retrieval,” in Proc. IEEE Int. Conf. Image Processing, Oct. 2001, pp. 721–724. [29] R. Zhao and W. I. Grosky, “Narrowing the semantic gap – Improved text-based web document retrieval using visual features,” IEEE Trans. Multimedia, vol. 4, no. 1, pp. 189–200, Mar. 2002.
JING et al.: UNIFIED FRAMEWORK FOR IMAGE RETRIEVAL
[30] X. S. Zhou and T. S. Huang, “Unifying keywords and visual contents in image retrieval,” IEEE Trans. Multimedia, vol. 4, no. 1, pp. 23–33, Mar. 2002. [31] , “Exploring the nature and variants of relevance feedback,” in Proc. IEEE Workshop Content-Based Access of Image and Video Libraries, 2001, pp. 94–97. [32] , “Image retrieval: Feature primitives, feature representation, and relevance feedback,” in Proc. IEEE Workshop Content-Based Access of Image and Video Libraries, 2000, pp. 10–13.
Feng Jing received the B.S. degree in computer science from Tsinghua University, Beijing, China, in 2000. He is currently pursuing the Ph.D. degree at the Computer Science and Technology Department, Tsinghua University. From 2001 to 2003, he was with Microsoft Research Asia, Beijing, as a visiting student. His research interests include content-based image processing, intelligent robotics, pattern recognition, and statistical learning.
Mingjing Li received the B.S. degree in electrical engineering from the University of Science and Technology of China, Anhui, in 1989 and the Ph.D. degree in pattern recognition from Institute of Automation, Chinese Academy of Sciences, Beijing, in 1995. He joined Microsoft Research China, Beijing, in July 1999. His research interests include handwriting recognition, statistical language modeling, search engine, and image retrieval.
989
Hong-Jiang Zhang received the B.S. degree in electrical engineering from Zhengzhou University, Zhengzhou, China, in 1982, and the Ph.D. degree in electrical engineering from the Technical University of Denmark, Copenhagen, in 1991. From 1992 to 1995, he was with the Institute of Systems Science, National University of Singapore, where he led several projects in video and image content analysis and retrieval and computer vision. He was also with the Massachusetts Institute of Technology Media Lab, Cambridge, in 1994 as a Visiting Researcher. From 1995 to 1999, he was a Research Manager at Hewlett-Packard Labs, Palo Alto, CA, where he was responsible for research and technology transfers in the areas of multimedia management, intelligent image processing, and Internet media. In 1999, he joined Microsoft Research Asia, Beijing, where he is currently a Senior Researcher and Assistant Managing Director in charge of media computing and information processing research. He has authored three books, over 260 refereed papers, seven special issues of international journals on image and video processing, content-based media retrieval, and computer vision, and he holds over 50 patents or pending applications. Dr. Zhang currently serves on the editorial boards of five IEEE/ACM journals and a dozen committees of international conferences.
Bo Zhang was born in Fujian, China. He received the degree from Tsinghua University, Beijing, China, in 1958. He is currently a Professor with the Computer Science and Technology Department, Tsinghua University. His main research interests include artificial intelligence, neural networks, robotics, and pattern recognition. He has published about 130 papers and three books in these fields.