correlation, i.e., the correlation between an image and itself, is defined in a similar way, except that the bigram frequency is changed with the unigram frequency ...
A Statistical Correlation Model for Image Retrieval Mingjing Li, Zheng Chen, Liu Wenyin, Hong-Jiang Zhang Microsoft Research China 49 Zhichun Road Beijing 100080, China +86-10-6261-7711
{mjli, zhengc, wyliu, hjzhang}@microsoft.com
ABSTRACT A bigram correlation model for image retrieval is proposed, which captures the semantic relationships among images in a database from simple statistics of users’ relevance feedback information. It is used in the post-processing of image retrieval results such that more semantically related images are returned to the user. The algorithm is easy to implement and can be efficiently integrated into an image retrieval system to help improve the retrieval performance. Preliminary experimental results on a database of 100,000 images are very promising.
Keywords Image retrieval, relevance feedback, statistics, correlation.
1. INTRODUCTION The performance of current content-based image retrieval (CBIR) systems is far from satisfaction. Most of them rely on the use of low-level features to search visually similar images. But users intend to search semantically similar images. The semantic information conveyed by an image is difficult to capture using today’s computer vision technology. That is, it is difficult to find a mapping between the semantics and low-level features. Thus, feature-based image retrieval systems usually produce poor results. However, relevance feedback can significantly improve retrieval performances in CBIR systems; hence, it has been an active research area in recent years. In a relevance feedback retrieval system, the user informs the system among the retrieved images which ones are relevant or irrelevant to the query based on the user’s judgment. This feedback information can be considered a kind of knowledge that reveals the semantic relationships among images. But these relationships are also difficult to capture by low-level features. The relevance feedback information can be used in many different ways to improve the retrieval accuracy. It can be used dynamically during the search session, such as to modify the query vector [8] or the distance metric [5], or to update the
probability distribution of images [4]. The major drawback of these methods is that there is not a mechanism to accumulate or memorize the user-provided relevance feedback information. During each search session, the feedback information iteratively improves the retrieval performance. When another search session starts, however, this information is completely lost such that it cannot be utilized to improve the future performance of the image retrieval system. Other relevance feedback approaches have been proposed to deal with this problem. For instance, a semantic network [7] is constructed on top of keyword association with images, which can also be iteratively improved from users’ queries and feedbacks. But it cannot be applied in pure content-based retrieval. The information embedding scheme [6] is an excellent method to accumulate the relevance feedback information. But it is complex in terms of computation and implementation, thus, is difficult to incorporate into practical image retrieval systems. Motivated by the work of statistical language modeling [3], we propose a statistical bigram correlation model that is able to accumulate and memorize the semantic knowledge learnt from the statistics of the user-provided relevance feedback information. This model simply estimates the probability of how likely two images are semantically similar to each other based on the cooccurrence frequency that both images are marked as positive examples during a query/feedback session. The model is so simple that it can be easily incorporated into an image retrieval system to help yield better results. It can be trained from the users’ relevance feedback log, and can also be dynamically updated during the image retrieval process. The rest of this paper is organized as follows. In Section 2, previous works related to this paper are summarized. In Section 3, the definition and the training of the statistical correlation model are discussed. In Section 4, the application of the correlation model in image retrieval is introduced. Preliminary experimental results on a database of 100,000 images are presented in Section 5. Finally, concluding remarks are given in Section 6.
2. RELATED WORKS The image retrieval system based on the above-mentioned information embedding scheme starts with low-level image features, and gradually embeds semantic correlations between images in the database from users’ relevance feedback. Initially, images are clustered into groups by K-means algorithm. The correlation between different image groups is set to the inverse distance between the clusters’ centroids in the feature space.
Then the system begins to evolve by interacting with users. When a search result is shown, the user informs the system which images are relevant or irrelevant to the query. Based on this feedback, the system splits and merges groups, and updates the correlations between different groups. If both positive and negative examples come from the same group, this group tends to be spitted since it contains semantically unrelated images. If two neighboring clusters contain positive examples, they are merged since they contain semantically similar images. After that, the correlations between the clusters are updated based on the feedback. The correlations between relevant clusters are increased, while the correlations between irrelevant clusters are decreased. This process is iterated and eventually breaks the feature space into semantically related clusters. In an extreme case, each image in the database could become a cluster, and the correlation between image groups degrades into that between individual images. The proposed correlation model can be considered a special case of this scheme, in a sense that it also copes with the correlations between images. Compared to the information embedding, our method is simpler and easier to implement and integrate into an image retrieval system. It does not involve any image clustering, splitting and merging. And it does not depend on low-level features to update the correlations between images. Furthermore, the size of the correlation model can be effectively controlled by proper pruning, as does in the statistical language modeling [3]. The application of this model in image retrieval is similar to that of the graph-theoretic approach [1]. In that method, the problem of database search is formulated as graph clustering. The goal is to have an additional constraint that the retrieved images are not only close to the query image but also close to each other in the feature space. However, this constraint is imposed by the semantic correlations in our method.
3. BIGRAM CORRELATION MODEL The proposed correlation model estimates the semantic correlation between two images based on the number of search sessions in which both images are marked as relevant ones. A search session starts with a query phase, and is possibly followed by one or more feedback phases. For the sake of simplicity, the number of times that two images are selected as relevant examples in the same session is referred to as bigram frequency, while that an image is marked as relevant is referred to as unigram frequency. The maximum value of all unigram and bigram frequencies is referred to as maximum frequency. Intuitively, the larger the bigram frequency is, the more likely that these images are semantically similar to each other, so the higher the correlation between them. Ideally, the correlation should be defined as the ratio between the bigram frequency and the total number of search sessions. However, in practice, there are many images in the database, and users are usually reluctant to provide feedback information. Therefore, the bigram frequency is very small with respect to the number of searches. Here, we define the correlation between two images as the ratio between the bigram frequency and the maximum frequency. Since the definition of bigram frequency is symmetric, the semantic correlation is also symmetric. The selfcorrelation, i.e., the correlation between an image and itself, is defined in a similar way, except that the bigram frequency is changed with the unigram frequency of this image.
To be specific, R( I , J ) = R ( J , I ) , R ( I , J ) = U ( I ) / T , if I = J , R ( I , J ) = B ( I , J ) / T , if I ≠ J ,
where I , J are two images, B ( I , J ) is their bigram frequency, U (I ) is the unigram frequency of image I , T is the maximum frequency, R ( I , J ) is the correlation between image I and J . It is reasonable to define the self-correlation as constant 1.0, since an image is exactly identical to itself in semantics. But the above definition yielded slightly better image retrieval results in our experiments. The unigram and bigram frequencies may be easily obtained from simple statistics of user-provided feedback information collected in the user log. However, we encountered two problems in practice. One is how to process the irrelevant examples, which also provide important information. Because of the diversity of search intentions, an image that is relevant to one user’s intention may be marked as irrelevant by another user, even if their queries are the same. The correlation model should reflect the common sense of many different users. Another problem is the data sparseness. Because of large database and limited feedbacks, it is not easy to collect sufficient training data. In order to address these two problems, the calculation of unigram and bigram frequencies is a little complicated. In our solution, the definition of unigram and bigram frequency is extended to take account of irrelevant images. For a specific search session, we assume a positive correlation between two positive (relevant) examples, and the corresponding bigram frequency is increased. We assume a negative correlation between a positive example and a negative (irrelevant) example, and their bigram frequency is decreased. However, we do not assume any correlation between two negative examples, because they may be irrelevant to the user’s query in many different ways. Accordingly, the unigram frequency of a positive example is increased, while that of a negative example is decreased. The non-feedback images are not automatically treated as negative examples in our proposed model. Therefore, these images are excluded from the calculation of unigram and bigram frequencies. In order to overcome the problem of data sparseness, the feedbacks of search sessions with the same query, either a text query or an image example, are grouped together such that feedback images in different sessions may obtain correlation information. Within each group of search sessions with the same query, the local unigram frequency of each image, which is referred to as unigram count, is calculated at first. Based on these counts, the global unigram and bigram frequencies are updated. The unigram counts in a group are calculated as follows. Initially, C (I ) is set to 0, where C (I ) is the unigram count of image I . After that, C (I ) is iteratively updated for every session in this group: C ( I ) = C ( I ) + 1 , if image I is marked as positive in a session; C ( I ) = C ( I ) − 1 , if image I is marked as negative in a session; C (I ) is unchanged otherwise. This process is repeated for every image in the database. The unigram frequencies are updated as: U ( I ) = U ( I ) + C ( I ) .
The bigram frequencies of image pairs are updated as:
the correlation between image I and I j , I j ( j = 1,..., M ) are M
B ( I , J ) = B ( I , J ) + min{C ( I ), C ( J )} , if C ( I ) > 0, C ( J ) > 0, B ( I , J ) = B ( I , J ) − min{C ( I ),−C ( J )} , if C ( I ) > 0, C ( J ) < 0, B ( I , J ) = B( I , J ) − min{−C ( I ), C ( J )} , if C ( I ) < 0, C ( J ) > 0, B ( I , J ) = B ( I , J ) , otherwise.
images with the highest similarities. The similarity measure of positive examples is set to 1, while that of negative ones is set to 0. In this way, the contribution from non-feedback ones in the retrieved list of images is discounted because of their uncertainty in terms of semantic similarities to the query. The selection of M is not very critical since the contribution of image with low similarity is small. M is set to 20 in our experiments.
In this way, the symmetry of bigram frequency is reserved. The procedure for training the correlation model is summarized as follows: (1) Initializing all unigram and bigram frequencies to zero; (2) Clustering search sessions with the same query into groups; (3) Calculating the local unigram counts within a group; (4) Updating the global unigram frequencies; (5) Updating the global bigram frequencies; (6) Repeating step 3, step 4 and step 5 for all session groups; (7) Setting all negative unigram and bigram frequencies to zero; (8) Calculating the correlation values according to the definition.
An alternative method is to define the semantic support as the average correlation between an image and all positive examples. However, it does not provide any help if there is no feedback information. Thus, this definition is not adopted. The final ranking score of an image in a retrieval list for a given query is calculated as: Score( I ) = w × P ( I ) + (1 − w) × S ( I ) , 0 ≤ w ≤ 1 ,
Because of the model’s symmetry, a triangular matrix is good enough to keep all correlation information. All items with zero correlation are excluded from the model since they do not convey any information. In order to further reduce the model size, the items with correlation value below a certain threshold may also be removed. This is called pruning. Therefore, the representation of the correlation model is highly efficient.
where S (I ) is the similarity measure of image I , P (I ) is its semantic support, w is the semantic weight. In our experiments, the image retrieval performance is sensitive to the semantic weight. Currently, there is not a good criterion to set this value. One possible solution is to set the weight based on the training data available: the more training data, the higher the semantic weight.
4. APPLICATION IN IMAGE RETRIEVAL
5. EXPERIMENTAL RESULTS
In a feature-based retrieval system, the images returned by the system are consistent in the feature space, but are not necessarily consistent semantically. Thus, some retrieved images are often not relevant to the user’s intention at all. The proposed correlation model is used to impose some semantic constraints to the retrieved results in a way that more semantically consistent images are presented to the user; hence, the retrieval accuracy of an image retrieval system is improved. The basic idea is to apply the correlation model in the postprocessing of feature-based retrieval results. Given a query, the similarity between an image and the query is measured based on their feature vectors. If the user provides any relevance feedback, the similarity measure is refined accordingly, and the images are re-ranked. Based on the retrieved results, each image obtains a semantic support through the correlation model. The final ranking score of each retrieved image is then the weighted sum of the feature similarity measure and the semantic support. Images with the highest scores are returned to the user as the final retrieval results. In more detail, the proposed correlation mode is integrated with feature-based retrieval process as the following. For a given query, there are more or less relevant images among the retrieved results with the highest similarities. It is assumed that images that are highly correlated to those highly ranked images should obtain a higher semantic support. Therefore, the semantic support of an image is defined as the weighted sum of correlations between this image and top images returned by the system: M
P( I ) = ∑ S ( I j ) × R( I , I j ) j =1
M
∑ S (I j ) , 0 ≤ S (I j ) ≤ 1 , j =1
where S ( I j ) is the similarity measure of image I j , R( I , I j ) is
We have implemented the correlation model and integrated it with an image search system [9], which provides the functionalities of keyword based image search, query by image example, and relevance feedback. In this system, the image database has been greatly expanded, which contains about 100,000 images collected from more than 2,000 representative websites. These images cover a variety of categories, such as “animals”, “arts”, “nature”, etc. Their high-level textual features and low-level visual features are extracted from the web pages containing the images and the image themselves respectively [2]. The correlation model is trained using users’ search and feedback data collected in the user log. After internal use for months, about 3,000 queries with relevance feedbacks were collected. The experiments were conducted at four different levels: (1) the baseline system, where there is no feedback, and the correlation model is not utilized; (2) the system with semantic correlation model; (3) the system with feedback information; (4) the system with both feedback and correlation. We chose 20 queries to evaluate the performance of the proposed method. These queries are the following keywords: car, flower, tree, cat, submarine, mars, spring, galaxy, movie star, potato, ship, space, tomb raider, woman, mountain, Clinton, Jordan, angel, dog, and summer. Since the queries are keywords, only the textual features are used in the computation of similarity measure when there is no feedback. Once the feedback information is available, both textual and visual features are used. We asked two subjects to perform the image search experiments. None of them has any prior knowledge of the image retrieval system. Each one of them was required to search for images with every query and mark all relevant and irrelevant images within the top 200 results returned by the system, according to his/her own
subjective judgment. In order to evaluate the performance improvement of the correlation model, each subject actually searched with every query twice: first, there was no feedback; second, three images were selected as either positive or negative examples. Meanwhile, the semantic weight was set to zero in the system. All of the information, including the queries, feedbacks, relevant and irrelevant images, was stored in the log. Based on the log, the performance evaluation is conducted automatically. Because of the subjectivity of relevance judgment, the image retrieval accuracy is calculated for each subject separately, and is averaged finally. The accuracy is defined as the ratio between the number of relevant images within the top N retrieved results and the number N , which is often referred as precision. The ground truth for each subject is extracted from the log. Given a query, images that were marked as positive but never marked as negative are selected as the answer. This information might be incomplete since there are so many images in the database, but it is exactly the same in all experiments with different semantic weights. In the experiments with feedback, the feedback examples are also extracted from the log and are applied automatically. The experimental results without feedback are presented in Figure 1, while that with feedback in Figure 2, where the horizontal axis is the number of top images considered; the vertical axis is the corresponding retrieval precision; w is the semantic weight. The results show that the proposed method constantly improves the retrieval accuracy, no matter whether there is feedback information or not. The higher the semantic weight, the more performance improvement could be achieved. When there is no feedback information and the semantic weight is set to 0.8, the accuracy is improved from 50% to 54% for top 100 images, while from 64% to 76% for top 10 images.
6. CONCLUSION A bigram correlation model is proposed to improve the retrieval accuracy of image retrieval systems through re-ranking the top images. The training process is very simple. The model may be built and updated from the statistics of users’ feedback information. The representation of the model is efficient. Because of its symmetry, a triangular matrix is sufficient to store all correlation information. The size of the model may also be effectively controlled by proper cutoff. The additional computation is minor when the model is applied in image retrieval. All of these advantages make the proposed model a good choice in practical image retrieval systems.
7. ACKNOWLEDGEMENTS We thank Dr. Wei-Ying Ma for reviewing this paper and providing technical suggestions.
8. REFERENCES [1] Aksoy, S. and Haralick, R.M., “Graph-theoretic clustering for image grouping and retrieval”, in Proc. IEEE CVPR, 1999.
[2] Chen, Z. et al., “Web Mining for Web Image Retrieval”, to appear in the special issue of Journal of the American
Society for Information Science on Visual Based Retrieval Systems and Web Mining.
[3] Clarkson, P.R. and Rosenfeld, R., “Statistical Language Modeling Using the CMU-Cambridge Toolkit”, in Proc. ESCA Eurospeech, 1997.
[4] Cox, I.J. et al, “The Bayesian Image Retrieval System, PicHunter: Theory, Implementation and Psychophysical Experiments”, IEEE Transactions on Image Processing, Volume 9, Issue 1, Jan. 2000.
[5] Huang, J. et al, “Combining Supervised Learning with Color Correlograms for Content-Based Image Retrieval”, in Proc. ACM Multimedia, 1997.
[6] Lee, C.S., Ma, W.Y., and Zhang, H.J., “Information Embedding Based on User’s Relevance Feedback for Image Retrieval”, invited paper, SPIE Symposium on Voice, Video and Data Communications, 1999.
[7] Lu, Y. et al, “A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems”, in Proc. ACM Multimedia, 2000.
[8] Rui, Y. et al, “A Relevance Feedback Architecture for Content-Based Multimedia Information Retrieval Systems”, in Proc. IEEE Workshop on Content-Based Access of Image and Video Libraries, 1997.
[9] Zhang, H.J. et al, “iFind – A System for Semantics and Feature Based Image Retrieval over Internet”, in Proc. ACM Multimedia, 2000. Q R L V L F H U 3
Z Z Z
1XPEHURILPDJHV
Figure 1. Precision vs. scope without feedback Q R L V L F H U 3
Z Z Z
1XPEHURILPDJHV
Figure 2. Precision vs. scope with feedback