Content Quality Based Image Retrieval With Multiple Instance Boost Ranking ∗
Peng Yang, Hui Li, Qingshan Liu, Lin Zhong, Dimitris Metaxas Piscataway, NJ, United States
[email protected], lih9@umdnj,
[email protected],
[email protected],
[email protected] ABSTRACT
to do image retrieval through understanding the visual content inside the images. Content-based image retrieval (CBIR), also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR), aims to use computer vision techniques to handle the image retrieval problem [11][4][16]. In order to understand the visual content in the image, in the past years, much research effort has been done on image classification [17] and automatic image annotation [12][18]. Although lots of methods have been proposed on CBIR [4][11], most of these methods look image retrieval as a classification problem or a similarity measurement issue between the query image and the images in the database or online. These kind of methods still do not touch the visual content quality evaluation among the images.For example, in Figure 1, three images are returned by Flickr through the key word "Tiger". The perfect ranking order of these three images should be (a) > (b)> (c), because image (a) contains the full body of a tiger, but image (b) has part of the occluded head of a tiger, and image (c) is not a tiger at all. However, this kind of perfect list is not always available for the current image search engine due to the searching procedure heavily depends on the keyword on the tag, not the true visual content in the image. In order to alleviate negative impaction of the annotations, clustering and classification methods are used on visual features extracted from images as a pre-processing step [4], but they can not provide the good visual content order relation too. Taking the advantage of the user’s relevance feedback
Most previous works treated image retrieval as a classification problem or a similarity measurement problem. In this paper, we propose a new idea for image retrieval, in which we regard image retrieval as a ranking issue by evaluating image content quality. Based on the content preference between the images, the image pairs are organized to build the data set for rank learning. Because image content generally is disclosed by image patches with meaningful objects, each image is looked as one bag, and the regions inside are the corresponding instances. In order to save the computation cost, the instances in the image are the rectangle regions and the integral histogram is applied to speed up histogram feature extraction. Due to the feature dimension is high, we propose a boost-based multiple instance learning for image retrieval. Based on different assumptions in multiple instance setting, Mean, Max and TopK ranking models are developed with Boost learning. Experiments on the real-world images from Flickr, Pisca, and Google shows that the power of the proposed method. Categories and Subject Descriptors: B.X.X [Pattern Recognition]: Statistical General Terms: Algorithms. Keywords: Image Retrieval, Ranking, Multiple Instance.
1.
INTRODUCTION
An image retrieval system is for browsing, searching and retrieving images from a large database of digital images or on line. Most traditional methods of image retrieval utilize the methods of adding meta data such as captioning, keywords, or descriptions to the images, so that retrieval can be performed over the annotation words. To search for the specific images, users may provide query terms such as keyword, and then the search engine will return images "satisfied" to the query. However, textual descriptions can not always capture the very essence of content in the images. In recent years, the increase in social web applications and the semantic web have inspired the development of image search technologies. More researchers focus on visual interpretation of picture content, and try ∗
(a)
(b)
(c)
Figure 1: Three tigers images with the tag "tiger" returned by Flickr. The ranking of image (a) should be higher than image (b), and image (b) is higher than image (c). [15] is another solution, but its idea is to re-organize the data set by the user’s labeling such as relevance or irrelevance and to train the classifier again, so the above mentioned problem still exists. As shown in Figure 1, (a) and (b) are relevant to key word "Tiger", but (a) is obviously much better than (b). Therefore, more detail feedback is necessary. In this paper, both the relative and the quality of the images are taken into account. We not only indicate whether the images are relevant or not, but also predicts how relevant the images are. In other words, besides retrieving the relevant images based on the query, we list them in an order with the best relevant on the top one at the same time. In practices, it is hard to label how the interested image is relevant to the query with an exact score on
Area chair: Qi Tian
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’11, November 28–December 1, 2011, Scottsdale, Arizona, USA. Copyright 2011 ACM 978-1-4503-0616-4/11/11 ...$10.00.
1193
with (Ok,1 ≥ Ok,0 ), where k indexes the bag pair, and Ok,1 and Ok,0 are the order scores of the bag Bk,1 and Bk,0 respectively.
the training set. However, it is easy to label the preference between the images, based on the preference relationship, and the image pairs are readily available for learning[10]. In order to realize this function, the pairwise based ranking model is proposed based on the training set which contains the preference on each image pair. Recently ranking model learning has been widely explored in text information retrieval [2][10][6], but the related work on image retrieval are a few[13][8]. Inspired by [3][8], we extend ranking model learning into into multiple instance setting. One image is looked as one bag, and the potential regions in the image are looked as the instances. Normally, image segmentation should be applied to separate images into sub-regions. Although reliable segmentation is especially helpful for feature representations, some issues plaguing current techniques are computational complexity, reliability of good segmentation and acceptable segmentation quality assessment methods. In this paper, we do not depend on image segmentation, and we simply take the rectangle region with difference scale and at different positions as the instances. Besides avoiding the uncertainties of image segmentation, using the rectangle regions as the instances have two-fold advantages: 1) save the computation cost with no segmentation;2) fast feature extraction algorithm could be applied on the statistical feature extraction, such as, integral histogram technique used in this paper. Three different schemes are proposed to multiple instance ranking learning. We test the proposed method by extensive experiments on the images from Flickr, Pisca, and Google, and the experimental results demonstrate it effectiveness and power.
2.
2.1 Mean-based Model In this model, we define the function F as the mean of the inR(xk,1 )+R(xk,2 )+...+R(xk,m ) stance scores by equation: O(Bk ) = m The ranking score R(x) on each instance can be defined as R(x) = T t=1 αt rt (x). We can rewrite equation as O(Bk ) =
m T i=1
t=1
αt rt (xi )
m
=
T
αt
m
t=1
rt (xi ) m
i=1
(1)
Therefore, in the boosting framework, in each round, we just need to learn the optimal weak ranker rt (x) and the corresponding weight αt . We take simple threshold as the weak ranker, and the output of rt (x) is {1, 0}. The optimal weak ranker is the one which minimizes the rank error on the bag pairs, and the corresponding coeffi1+R cient αt = 12 log( 1−R ), where R = k D(Bk,0 , Bk,1 )(O(Bk,1 )− O(Bk,0 ). The weight on each bag pair is D(B k,0 , Bk,1 ), and the 1 rank score on the bag Bk,1 is O(Bk,1 ) = m j r(xk,j,1 ). Algorithm 1 Multiple Instance Ranking Based on Max Score 1: Give example image(bag) pairs (B1,0 , B1,1 ),...,(Bn,0 , Bn,1 ). Each bag contains instances {xi,j }, where j = 1, ..., m. 2: Initialize weight Dt (Bk,0 , Bk,1 ) = 1/N . 3: for t = 0 .... T do 4: Get weak ranker rt : rt (x) → H, s.t equation ??. 5: Choose αt ∈ R. 6: Base on current weak ranker, choose the instances with max rank score as the available instances {x∗i,j } "E"-Step: 1) Get optimal weak ranker rt∗ (x) based on the available instances {x∗i,j }, and α∗t . 2) Update the rank score on all instances Rt (xi,j ) = Rt−1 (xi,j ) + α∗t rt∗ (xi,j ). "M"-Step: 1) Keep up the instances with max rank score in each bag and save these instances as the available instances {x∗i,j }, and check the stop conditions. 7: Update: αt = α∗t , rt = rt∗ , Ot (Bi ) = max(rt (xi,j )).
MULTIPLE INSTANCE BOOST RANKING
Multiple instance learning is first introduced in the context of drug activity prediction in [5]. Different from traditional patternbased learning, training class labels are only associated with sets of samples (or bags), instead of individual samples (or instances). The labels of instances are only indirectly associated with the labels of bags. If a bag is labeled as positive, it means that at least one instance inside the bag is positive. On the other hand, a bag is labeled as negative if and only if all the instances in the bag are negative. The most of previous work on image retrieval, it tired to learn a classifier to label the images into being relevant or irrelevant to the query. However, even two images are relevant in content, they also exist the difference from content quality. It is well known that the semantic content of image is generally disclosed by one or some interesting objects in the image, and it is not easy to localize the interest objects automatically. To handle these issues, we propose to integrate multiple instance learning into Boost ranking in this section. Assuming there have lots of images from the same category, each image in the training set has one rank score to evaluate its content quality, i.e., whether this image contains high-quality interested objects, but the locations of the interested objects in the images are unknown. In other words, the dominant regions in the image are unknown, which play an important role in making rank score. From the view of the multiple instance setting, the ranking order is on the bag level (image), and the ranking order on the instances(regions) in the bag is not known, but it is implicity represented by the ranking order on its bag. Define the bag set Bi and the corresponding orders Oi , where i ∈ [1, n]. Without losing generality, the orders decrease monotonously and are satisfied with O1 ≥ O2 ≥ ... ≥ On . Organize the pairwise data with the bag pairs (Bi , Bj ) with the associated order information (Oi , Oj ). For convenience, we rearrange and index the bag pairs as (Bk,1 , Bk,0 )
Dt+1 (Bi,0 , Bi,1 ) =
Dt (Bi,0 ,Bi,1 ) exp(αt (Ot (Bi,0 )−Ot (Bi,1 ))) Zt
where Zt is a normalization factor. 8: end for 9: Output the final ranking H(x) =
T
αt rt (x).
t
2.2 MAX-based Model The MAX-based model defines the function F based on the max of the instance score as: O(Bk ) = max(R(xk,1), R(xk,2 ), ..., R(xk,m )). Therefore, in each round the max rank score is picked up from the rank score of the potential best instances in each bag. As we know, most of the instances in the bag have little contributions to ranking bag, so it is reasonable to remove these instances during the training procedure. The greedy strategy is to keep the instances with maximum ranking score and withdraw the other instances in each bag. However, this greedy strategy is too aggressive to lose the potential good instances. Therefore, we adopt an "EM"based instance selection scheme in each round. In each iteration, the weak ranker rt (x) is learned based on the available instances in each bag, and then the learned strong ranker t−1 αi ri + αt rt (x) by this iteration is used to cut off the instances which are not best ones in each bag. In "E" step, the ideal weak ranker rt∗ (x) is
1194
learned based on the remained instances, and α∗t is the correspond ing weight. In "M" step, based on the updated strong ranker t−1 αi ri + ∗ ∗ αt rt (x), all the instances is checked to see whether they are with max rank score, and the remained instances go to "E" step again. The stop condition of the "EM" is the change rate of the available instance is less than δ or the maximum iteration is reached. The whole process is summarized in algorithm 1.
2.3 Top K-based Model The top K-based model is a compromise between the Meanbased model and Max-based model, which defines the function as the mean of the top K of the instance scores, described in equation ?? O(Bk ) = maxT opK (R(xk,1 ), R(xk,2 ), ..., R(xk,m ))
Table 1: Average Precisions for different queries on Flickr
(2)
Similar to the Max-based model, an EM strategy is used to select the top K instances in each bag during the training procedure. The details of the algorithm is similar to algorithm 1. The same stop condition of "EM" step is used to choose the top K instances. We set K = 5 in the experiments.
3.
Query
Flickr
Alligator Dalmatian Dolphin Elephant Giraffe Goat Horse Kangaroo Leopard Penguin mean
0.4092 0.4088 0.6571 0.5264 0.5943 0.5487 0.683 0.3516 0.5735 0.2074 0.4960
Mean 0.6244 0.6368 0.7078 0.4425 0.8874 0.5991 0.8341 0.475 0.6172 0.6949 0.6519
Our Methods Max 0.7368 0.697 0.7938 0.6021 0.7293 0.5699 0.7287 0.6282 0.6628 0.5439 0.6692
Top K 0.7232 0.6198 0.852 0.6085 0.8366 0.5083 0.7847 0.5431 0.727 0.7107 0.6914
IMAGE REPRESENTATION
As described above, each image is taken as one bag, and some potential regions that may contain interesting objects in the image are regarded as its instances. In [8], they used image segmentation as preprocessing, and took each segmented region as one instance. However, image segmentation has some uncertainties. For example, same object is often segmented into several regions, so the region-based instant can only describe limited local information of the object. Additionally, extra computation cost is needed for image segmentation. In our experiments, we take rectangle regions as the instance, and we use rectangles with difference scales to slide over one image to build instances. For the real images, the size of interested object inside the image is always not too small for the high quality images. Based on this prior, the rectangle size is in the proper range. For one image whose size is mxn, the range of the height of the rectangle is [ 13 m, m], and the range of the width is [ 13 n, n]. The statistical features are used to represent the instances as in most previous work. We use popular statistical features including RGB, HSV, LAB, rg, opponent, Gabor, and Haar-like features. For each feature, we map it into a normalized histogram, and finally we obtain totally 1200-dimensional histogram features to describe each instance. Because we use rectangles to present instances, we can easily adopt the integral histogram technique to speed up the histogram computation for each kind of feature [14]. After building the integral histogram, for any rectangle in this image, the histogram could be obtained by calculating three times plus-minus operations. Thus, our feature extraction is very fast, and moreover, we can produce a large number of instances for an image bag and some of instances can exactly capture the interesting objects in the image.
4.
We manually assign one of the three following labels to the images: (1) Good image: a good example of the animal category. The animal(animals) is the most salient object in the image and is well delineated. (2) Intermediate image: the image contains the desired animal. However, the animal is not the dominant object or the image has extensive occlusion or the view is not good. (3) Junk image: the image does not contain real animal. Figure 1 shows an example, where three tiger images are labeled as three different levels respectively.
4.1 Performance Measures To evaluate the performance of the learned ranking model for the testing images, we adopt two performance measures. The first measure is the Normalized Discounted Cumulative Gain (NDCG)[9]. For a list of images sorted in descending order of the scores output by a ranking model, the NDCG score at the mth image is computed as m 2r(j)−1 Nm = Cm , (3) log2 (1 + j) j=1 where r(j) is the rating of the jth image, Cm is the normalization constant which equals to 1/IN DCG, INDCG is the NDCG scores on the ideal list ranked by its ground truth ratings. The second measurement is Average Precision (AP). AP is calculated as: AP =
N 1 P (j) × rel(j), Npos j=1
(4)
where Npos is the number of the positive images, N is the number of total retrieved images, rel(j) is a binary function indicating whether the jth image is relevant, and P (j) is the precision at j. We regard the good images as the positive, the intermediate and the junk images as the negative images.
4.2 Experiment Setting Each image is regarded as a bag, and we extract many rectangle regions from it with different size and scale and take these rectangle regions as the instances in the bag. The width of the rectangle is set as { 13 m, 12 m, 23 m, m}, and the corresponding height is also { 13 m, 12 m, 23 m, m}. The sample step is half of the width and height, so totally there are 121 instances in one bag. Half of the images are randomly selected from the collected data sets as the training set and the others as the testing data. On the Flickr data set, we first compare the rankings obtained by our method with the original Flickr ranking result. Flickr ranks the images in descending order of "relative", which should be determined according to the click-through, comments and some other factors. We also compare our method with Hu’s work by using their experimental results directly [8]. To evaluate the generality of the proposed ranking model, we use the ranking model learned on the Flickr data set to test the images from Google and Picasa. On the
EXPERIMENT
Same as [8], we send 10 category animal queries to Flickr, which are Alligator, Dalmatian, Dolphin, Elephant, Giraffe, Goat, Horse, Kangaroo, Leopard, and Penguin. For each category, we download around 200 images. We also test the proposed method on the images collected from Google image searching engine and Picasa, which are another two popular photo sharing services. We download the same 10 category animal images from both Google and Picasa. Animals are demonstrably among the most difficult classes to recognize. There has been much fascination with searching animals on the web recently[1][7].
1195
Table 2: Average Precision for different query at Nth Image Query Alligator Dalmatian Dolphin Elephant Giraffe Goat Horse Kangaroo Leopard penguin mean
@5th 0.8056 1.0000 1.0000 1.0000 0.9167 1.0000 0.8875 1.0000 1.0000 0.9500 0.9560
Google Image @10th 0.7608 1.0000 0.9526 0.9294 0.8073 0.9283 0.8044 0.9415 0.9160 0.9068 0.8947
@20th 0.7243 1.0000 0.8871 0.8515 0.8401 0.8890 0.7810 0.8501 0.8726 0.8370 0.8533
@5th 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Ours (Mean) @10th 0.8556 1.0000 0.9889 1.0000 1.0000 1.0000 0.9617 1.0000 1.0000 0.9765 0.9783
@20th 0.7310 0.9886 0.9151 0.9445 1.0000 0.9914 0.8657 0.9510 0.9764 0.9438 0.9308
Google Image 0.6465 0.9528 0.7793 0.7055 0.8101 0.7677 0.7319 0.7455 0.8247 0.7204 0.768
Ours 0.6580 0.9031 0.6529 0.8154 0.8832 0.8944 0.8119 0.8220 0.8887 0.8442 0.820
Picasa 0.1213 0.3492 0.6456 0.2188 0.4728 0.4985 0.5349 0.2634 0.4136 0.1421 0.418
Ours 1.0000 0.8056 0.8863 0.5246 0.7198 0.4889 0.8606 0.4455 0.6120 0.8373 0.724
Google and Picasa testing data sets, we compare the results from our method with the ranking results provided by them respectively. We download the top 60 images for each category from Google search and the images in the first page returned by Picasa, and we assign one of the three level to each image as the ground truth.
@20th 1.0000 0.8146 0.7528 0.9719 0.9232 1.0000 0.8980 1.0000 0.9917 0.8986 0.9251
@5th 1.0000 1.0000 1.0000 0.9167 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9917
Ours (TopK) @10th 0.7440 0.9526 1.0000 0.8339 1.0000 1.0000 0.9889 0.9379 1.0000 0.9283 0.9386
@20th 0.7766 0.9186 0.8772 0.8376 0.9568 0.9917 0.9468 0.8752 0.9308 0.9287 0.9040
6. REFERENCES [1] T. L. Berg and D. A. Forsyth. Animals on the web. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006. [2] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. pages 89–96, 2005. [3] Y. Chen, J. Z. Wang, and D. Geman. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research, 5:913–939, 2004. [4] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv., 40, 2008. [5] T. G. Dietterich, R. H. Lathrop, T. Lozano-Perez, and A. Pharmaceutical. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence, 1997. [6] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., vol.4:933–969, 2003. [7] E. Hörster, M. Slaney, M. Ranzato, and K. Weinberger. Unsupervised image ranking. Proceedings of the ACM workshop on Large-scale multimedia retrieval and mining, 2009. [8] Y. Hu, M. Li, and N. Yu. Multiple-instance ranking: Learning to rank images for image retrieval. Computer Vision and Pattern Recognition, 2008. [9] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20, 2002. [10] T. Joachims. Optimizing search engines using clickthrough data. ACM Knowledge Discovery and Data Mining. [11] M. S. Lew. Content-based multimedia information retrieval: State of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl, 2:1–19, 2006. [12] J. Li and J. Z. Wang. Real-time computerized annotation of pictures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), June 2008. [13] J. S. M. Merler, Rong Yan. Imbalanced rankboost for efficiently ranking large-scale image/video collections. Computer Vision and Pattern Recognition, 2009. [14] F. Porikli. Integral histogram: A fast way to extract histograms in cartesian spaces. in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2005. [15] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, pages 644–655, 1998. [16] N. Sebe, M. S. Lew, X. Zhou, T. S. Huang, and E. M. Bakker. The state of the art in image and video retrieval. CIVR’03, 2003. [17] A. W. M. Smeulders, S. Member, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:1349–1380, 2000. [18] L. von Ahn and L. Dabbish. Labeling images with a computer game. 2004.
4.3 Experiment Result First, we test the proposed method on the Flickr data set and compare it with [8]. Table 1 reports the experiment results. Compared with the outputs of Flickr, both our method and Hu’s work are much better. This means multiple instance ranking method can improve the ranking performance significantly. Due to different setting such as Mean, Max and Top K are used in both methods, we list the comparison result according to these three setting respectively in Table 1. It can be found that our method obtains a competitive performance compared to [8]. On average, 40 features are used for each category in our method. Comparing with [8], much less features are used to get similar performance. In order to investigate the generality of our method, we apply the ranking model trained on the Flickr data set to the data set from Google and Picasa respectively. Table 1 reports the results on Google images and Picasa images respectively. For Google image, we label the top 60 images returned by Google, and we compare our method with the TopK model. From the result of table 3, it is clear to see that our method could further improve the performance of Google search engine. We also apply the TopK model to Picasa images. Here, the images returned in the first page are used for testing, normally 24 images are returned in the first page for our setting. We can see that our ranking model can significantly improve the performance of Picasa search engine too. For information retrieval, the users are more interested in the top N returned samples. We focus on the top 5, 10, and 20 images, and the corresponding average precision and NDCG are shown in table 2 . The results in these two tables further demonstrate the effectiveness and power of our method.
5.
Ours (Max) @10th 1.0000 0.8228 1.0000 1.0000 0.9617 1.0000 1.0000 1.0000 1.0000 0.9750 0.9760
cording the preference between the images, the pairwise image sets are organized to build the data set for rank learning. Each image is looked as one bag, and the possible rectangle regions inside are taken as the corresponding instances. Based on the assumption: if one bag is better than the other one, the bag contains at least one instance which is better than all the other instances in the other bag. We take the max strategy into ranking consideration. Relaxing the assumption to take the average score or the scores of the best K instances in the bag, we built the models called the Mean, Max, and TopK models to do ranking. Because the statistical features are used as the raw feature, we applied integral histogram to speedily extract the histogram features in rectangle regions. Experiments on the data from Flickr, googe image search engine and Pisca demonstrated the effectiveness and robustness of our methods.
Table 3: Average Precisions for different query Query Alligator Dalmatian Dolphin Elephant Giraffe Goat Horse Kangaroo Leopard penguin mean
@5th 1.0000 0.9167 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9917
CONCLUSION
In this paper, we proposed a new multiple instance ranking algorithm to attack the content based image retrieval problem. Ac-
1196