Applying Aggregation Concepts for Image Search Brandeis Marshall Purdue University Computer and Information Technology 401 N. Grant Street West Lafayette, IN 47906
[email protected]
Abstract Through the influx of information content on the Internet, a number of image search methodologies have been presented and implemented to increase the accuracy of image retrieval including keywords, object classification and feature processing. Both keyword and object classification models rely heavily on human subjects, which is time-consuming and error-prone with inconsistency in word agreement. We propose two feature processing methods without human intervention. The feature collage algorithm compares images based on particular features such as color histogram whereas the feature independent algorithm considers each feature’s dimension as independent contributors to the image quality. Using query-by-example, we organize images using rank aggregation methods, previously applied in text information retrieval. We show through empirical experimentation the benefits of our feature processing algorithms over traditional CBIR approaches.
1. Introduction Image search has gained the attention of researchers due to the complexity of scientifically evaluating image content. Image retrieval approaches have been designed to improve accuracy including text labeling/annotations [15], object classification [7] and feature processing [7, 8, 9, 12, 14]. Object classification makes use of a content-based image retrieval (CBIR) approach which identifies the distinctive foundation objects within an image. Both annotations and object classification models rely heavily on human subjects, be time-consuming and have inconsistent word agreement when describing an image. To make either technique successful, feedback from the test subjects is necessary and users in the real-application domain are as-
Dale-Marie Wilson University of North Carolina at Charlotte Department of Computer Science 9201 University City Blvd Charlotte, NC 28223
[email protected]
sumed to chose the same or similar keywords as the test subjects. As the digital content increases, the reliance on text labeling for images becomes increasingly challenging. In CBIR feature processing techniques, the low-level descriptors or features, such as color and shape, are processed to reveal the high-level semantics and determine the relevancy of an image. As more features and associated dimensions are processed, the computational cost increases with marginal improvement in accuracy. In this paper, we propose two strategies, feature collage and feature independent algorithms to improve the likelihood of finding similar images by ranking the lowlevel descriptors. Ranking allows us to isolate similar images that exist in one dimension but may not appear in another dimension. Distance functions alone are unable to capture this localized similarity. To determine the best matching images, we use a rank aggregation method to compute the consensus ordering of the images amongst ranked lists. Rank aggregation methods, typically used in Web text searching, focus on highlighting a specific characteristic or property of the ranked lists. Aggregation methods assess similarity amongst images in very different ways resulting in distinct final ranked lists. By varying the dimensions used in our feature processing methods, we investigate the magnitude of the reduction in performance showing when our methods would, in general, outperform previous approaches. The contributions of this paper are: (1) introduce two feature ranking algorithms, feature independent and feature collage, (2) conduct an empirical study that shows ranking images increases the accuracy over using a distance function alone along with competitive with text-based image retrieval, and (3) study the effectiveness of several rank aggregation methods including the popular PageRank algorithm.
2. Related Literature
or ranker r using Cq ’s identifiers. In the case of ties, the images are ordered randomly. For a set of rankers, r1 , . . . , rt where t ≥ 2, a rank aggregation method can be used to find the consensus ordering of images amongst the rankers and returns an aggregate ranker rA . If t = 1, then rA is the first K images from the ranker. Otherwise, the ranker produced by the aggregator will contain its first K images. Precision corresponds to the number of overlapping images between rA and the relevant images Rq . Precision is a maximization function which aims to have all K images from the ranker appear in Rq .
Our work is closely related to early and late fusion methods introduced in [13]. Concentrated on video semantics analysis, a combination of visual, auditory and textual features from a video shot serves as input to their fusion algorithms providing significant information to classify the video shot. Our work considers only the visual features without the assistance of speech or text and studies the avenues for improving accuracy based on selecting the appropriate distance function and visual descriptors. Of the researched features, color is most commonly used in conjunction with at least one other feature such as texture [7, 8, 14], shape [9, 12, 14] or edge [14]. Each feature can be represented in different ways, for instance, color can be decomposed into the color histogram or color moment. Early CBIR techniques use color and shape to represent the content of an image and texture to explore the depth such as separating the background and foreground. CBIR aims to identify an image based on these low level features to describe high level semantics without a priori text labeling. Image databases are usually searched using one of three methods: category browsing [9, 12, 14], query by concept [7] or query by example [4, 8]. There may not be explicit attachment of keywords to images, but images are classified into some predefined groups. In category browsing, similar images are retrieved from specific groups identified to be closely related to the query image. Category browsing is beneficial when the categories are well-defined or specific to a domain. A higher accuracy is achievable due to the initial strong correlation of the images. Query by concept decomposes each image into its objects (e.g. tree, sky, lake) in which each object has its own category requiring heavy preprocessing to identify the correct set of categories. The third option, which is our focus in this paper, is query by example. Query by example directly compares the image database to the query image using the visual content descriptors without requiring the creation of categories.
3.1. Distance Functions A common approach to determining similarity is using a distance function [8, 9, 12] such to maximally leverage the image contents and balance a dimension’s importance and independence in relation to other dimensions. We focus on the Minkowski Form and Kullback-Leibler Divergence because both functions (1) are applied in previous image retrieval literature and (2) are flexible to compare subsets of dimensions and individual dimensions. Minkowski Form. If the dimensions are equally important and independent, then the Minkowski distance is best suited to find the similarity between two images. If p=2, then we have the Euclidean distance. dist(q, c) = (Σ1≤f ≤n Σ1≤d≤m |cfd − qdf |p )1/p Kullback-Leibler Divergence. The KullbackLeibler (KL) Divergence captures how compact one dimension’s distribution can be represented using another dimension as the codebook. dist(q, c) = Σ1≤f ≤n Σ1≤d≤m qdf log
qdf cfd
3.2. Distance Similarity (Algorithm 1) Algorithm 1 considers the evaluation of image similarity using a distance function. This methodology requires the complete processing of each image c compared to the query image q. For Algorithm 1, we iterate through the candidate images to compute the similarity of c image to the q image. Then, the first K images in ascending order of the similarity values are returned as a ranker. Either distance function mentioned above can be used to evaluate the similarity. Thus, for two images c1 , c2 , c1 has a lower rank than c2 (c1 < c2 ) iff a numerically significant majority of
3. Basics of CBIR Search For an image database Q∪Rq , Q contains the query images and Rq are the relevant images for some query q. An image c ∈ Cq to be all images except the query image q. An image is evaluated based on its pixels according to pre-selected n features in which each feature has m dimensions (e.g. Fi1 , . . . , Fim ). The processing of each image produces a value for each dimension. These dimension values can then be organized in a ranked list 2
comparisons in terms of the dimensions values have c1 as being closer to q than c2 . If using the Euclidean distance, then the sum of the differences for c1 must be less than those of c2 , e.g. the dimension values may have an ordering of c1 c2 in another dimension. The restriction of using a distance functions becomes determining the magnitude of the similar features. We address this limitation through (1) partitioning the dimensions according to the feature F i and (2) computing the dimensions individually.
aggregator M C4 [5], compute the steady state probability of each image through either the probability of navigating to that image from another or by randomly jumping to that image. We perform modifications on the original algorithm as presented in [1]. The Condorcet-fuse [11] uses a graph of unweighted directed edges such as for images ci and cj , ci → cj or cj → ci , indicates that ci (cj ) dominates cj (ci ) in the majority of rankers.
4.2. Feature Collage (Algorithm 2)
4. Proposed Algorithms Instead of having all candidate’s dimensions compared to the query, Algorithm 2 simply groups the dimensions according to their features. As shown in the code snippet, rank aggregation computations occur twice, once for each feature and once across all features. Any of the aggregators discussed above can be substituted since the returned results may be distinct. For simplicity, the aggregator chosen for the first aggregation computation is also used in the second aggregation.
In this section, we introduce our algorithms to exploit the characteristics of the images by integrating rank aggregation concepts to image retrieval. First, we discuss rank aggregation methods (aggregators) as it is applied to Web keyword search. Then we describe the feature collage and feature independent methods.
4.1. Rank Aggregation Methods Given the multitude of existing rank aggregation methods , we select six commonly-used and/or recent aggregators : Average (Borda’s Count), Median, CombMNZ, Precision Optimal, PageRank and Condorcet-fuse. We discuss key benefits and/or pitfalls for each aggregator.
1: 2: 3: 4: 5: 6: 7: 8: 9:
Average and Median. The Average [2] (Median [6]) aggregator computes the average (or median) rank values of an image using the set of rankers and then orders these values to obtain the aggregate ranker. Average considers all rank information which may not be desired in the case of large amounts of incorrect information. Median ignores all but one set of rank information that may be problematic when the median ranks are highly similar.
10:
for feature f do for dimension d do for candidate c do simd = dist(qdf , cfd ) rd = sort(simd ) inputs ← inputs ∪ rd rf = A(inputsK ) inputs2 ← inputs2 ∪ rf rA = A(inputs2K ) A return rK
A , c1 has a lower rank Thus, for two images c1 , c2 in rK than c2 iff ordering c1 < c2 is upheld in a majority of features. Since each feature is considered separately, the repeatedly observed images amongst the feature can be low and irrelevant leading to low precision. Due to the dominance of “bad” information within a feature, an incorrect ordering, the aggregation concepts can not accurately leverage the image properties with too many high ranks. In general terms, considering the majority n2 + 1 features such that precision = 1 if c1 < c2 , otherwise 0.
CombMNZ and Precision Optimal. CombMNZ [10] orders the information using a combination of the frequency of appearances and the ranks. While in the Precision Optimal [1] aggregator, the frequency of appearances initially order the rank information, then in the case of ties, the Average aggregator is computed. Both CombMNZ and Precision Optimal rely on multiple appearances of the same image in order to provide supporting evidence of similarity.
4.3. Feature Independent (Algorithm 3) Algorithm 2’s grouping of the rankers is beneficial in identifying those images that are clearly similar to the query image. However, the contents of an image are not uniformly similar to the query image in all features. In Algorithm 3, we remove this partitioning of
PageRank and Condorcet-fuse. The PageRank algorithm [3], an approximation to the Markov chain 3
the features. Below shows how each dimension of every feature is considered individually as an independent contributor to similarity evaluation. This allows for the repeatedly observed images to increase as the selected dimensions have increased. Those relevant images with high ranks for Algorithm 2 can now be ranked lower and be returned in the first K results.
query image annotation. A group of students from one of the co-author’s class annotated the images similarly to the Google Image Labeller. We manually chose 3 nouns and/or adjectives that capture the main components of each image and no two queries had the same 3 keywords. For query images with less than 3 words in the annotation, we manually added more descriptions. We have also manually corrected misspellings.
for feature f do for dimension d do for candidate c do simd = dist(qdf , cfd ) rd = sort(simd ) inputs ← inputs ∪ rd rA = A(inputsK ) A 8: return rK
1:
2: 3: 4: 5: 6: 7:
perf. std. dev.
Match 2 67.4 92.0
Match 1 614.9 568.7
Table 2 shows the performance and standard deviation of using 3-keyword image search for 230 images. Column 1 provides the average precision, number of overlapping relevant images. Column 2-4 displays the average number of images (standard deviation) with 3-, 2- and 1-keyword match.
c1 has a lower rank Hence, for two images c1 , c2 in than c2 iff ordering c1 < c2 is upheld in a majority of dimensions. Suppose that for the n features with m dimensions each, then the ordering c1 < c2 can only be maintained when n∗m 2 +1 rankers contain that ordering. If the rankers are dominated by the ordering c2 < c1 , no rank aggregation method or distance function can computationally change this ordering.
Animal Green Mts Red Mts Sky Plant Manmade
5. Empirical Study
Plant Mountains Sky Manmade
Match 3 8.4 13.0
Table 2. overall results
A , rK
Category Animal
Precision 0.817 0.263
Directories antelope, butterfly, cats, dinosaur art, dogs, horses, lizards, penguins, sea creatures, eagles, bird nests bonsai, botany, flowers green, white, red sunset, landscape aviation, air balloons, fireworks, rare cars
Precision 0.868 0.745
Match 3 7.4 14.8
Match 2 37.7 193.5
Match 1 469.5 1115.8
0.380 0.675 0.870 0.849
3.9 8.5 12.6 6.1
69.3 159.7 40.0 60.1
752.3 1363.2 285.9 602.9
Table 3. category results In Table 3, we observe that Green Mts, Red Mts and Sky categories did not achieve over 80% accuracy. The categories contain overlapping descriptions such as sky, sunset, trees, river and clouds making finding relevant images more challenging. As observed with Animal, Plant and Manmade categories, the directories are relatively distinct thus the annotations include the identifiable description.
Table 1. Image Database We evaluate the effectiveness of Algorithm 1-3 and the keyword image search method. We organize over 5000 images into 5 basic categories: animals, plants, mountains, sky and non-living/manmade objects. Table 1 shows the directories for each category. By taking about 10 images per directory, we test our algorithms on 230 images and setting K = 10, each ranker returning the first 10 images.
Feature-Based Image Retrieval. We perform an image preprocessing phase that computes the characteristics of each image according to the features. Due to the small number of dimensions, we use all the dimensions in features color histogram (5 dimensions), color moment (2 dimensions) and texture edge (9 dimensions). However, edge histogram, homogeneous texture and texture tamura with 32-d each, we select 6.25%(2d), 12.5%(4-d), 18.75%(6-d), 25%(8-d), 50%(16-d) and 75%(24-d) dimensions.
Text-based Image Retrieval. We select a bestcase methodology in which we use the words in the 4
Animal Green Mts Red Mts Sky Plant ManMade
Euclidean 0.183 0.770 0.220 0.535 0.267 0.165
100% 0.264 0.730 0.270 0.335 0.293 0.223
75% 0.263 0.730 0.320 0.330 0.300 0.218
50% 0.245 0.645 0.290 0.290 0.263 0.193
25% 0.235 0.645 0.290 0.260 0.247 0.193
18.75% 0.223 0.650 0.300 0.270 0.230 0.213
12.5% 0.209 0.575 0.270 0.260 0.243 0.198
6.25% 0.201 0.625 0.280 0.200 0.240 0.185
Table 4. Algorithm 2 category precision performance
Average Median Pagerank Precision Optimal CombMNZ Condorcet-fuse
100% 0.347 0.288 0.269 0.254 0.268 0.240
75% 0.360 0.266 0.252 0.252 0.255 0.239
50% 0.318 0.251 0.253 0.238 0.257 0.208
25% 0.312 0.238 0.244 0.240 0.237 0.193
18.75% 0.309 0.249 0.242 0.249 0.239 0.200
12.5% 0.287 0.234 0.235 0.223 0.234 0.172
6.25% 0.280 0.219 0.251 0.224 0.245 0.161
Table 5. Algorithm 2 aggregation performance Algorithm 1 (Distance Function). We compare the accuracy of Euclidean and KL Divergence distances using the texture edge feature since it provided the best performance. In general, the Euclidean and KL Divergence have nearly equivalent precision values, 25.5% and 25.6%, respectively as shown in Table 6. Due to the similar performance of the distances, we can compare our proposed algorithms to either Euclidean or KL Divergence. We chose Euclidean distance since it is more commonly used in related literature.
Animal Green Mts Red Mts Sky Plant ManMade
Euclidean 0.183 0.770 0.220 0.535 0.267 0.165
to the dramatic (Sky). The decrease in dimensionality does not necessarily produce lower precision results such as for the Plant category. We observe that using 100% and 75% of features appears to be the best choice. The rank aggregation method performance as shown in Table 5 is overwhelmingly bias toward the Average aggregator (36%) with its nearest competitor, Median, over 12% less in accuracy. We notice the higher accuracy at 75% features most likely due to information overload if using 100% of the features.
KL Divergence 0.168 0.755 0.380 0.465 0.280 0.170
Feature Independent Results. Table 7 displays the precision performance using feature independent processing. When compared to Table 4, the accuracy increased for each category expect Red Mts, which maintains at 32%. The average increase of 10% in precision provides evidence of the significant benefit of this algorithm. The increased accuracy is no longer confined to using 100% or 75% of the features. In Table 8, we continue to observe the dominance of the Average aggregator at 42.8%, up about 6.25% from feature collage processing. The Median and Precision Optimal aggregators also show higher precision over Algorithm 2 whereas Pagerank, CombMNZ and Condorcet-fuse aggregators experience a slight decrease in performance. Pagerank, CombMNZ and Condorcetfuse aggregators rely on the ranks of the images to effectively determine ordering, which has led to a nearly random ordering of the images yielding low performance.
Table 6. average precision by category
Feature Collage Results. Table 4 compares the average precision of Euclidean distance to feature collage. The entries in bold indicate the highest accuracy observed during feature collage processing. We notice that for categories Green Mts. and Sky the Euclidean distance outperformed our approach. This result is not unexpected due to the strict partitioning of features in which relevant images may be appearing in higher ranks. The dimensionality variation shows the decreased accuracy from insignificant (Manmade) 5
Animal Green Mts Red Mts Sky Plant ManMade
Euclidean 0.183 0.770 0.220 0.535 0.267 0.165
100% 0.396 0.915 0.240 0.370 0.400 0.268
75% 0.373 0.910 0.210 0.405 0.387 0.250
50% 0.359 0.890 0.240 0.460 0.370 0.250
25% 0.341 0.880 0.240 0.340 0.350 0.258
18.75% 0.337 0.885 0.290 0.385 0.390 0.283
12.5% 0.332 0.870 0.300 0.415 0.397 0.275
6.25% 0.324 0.835 0.320 0.410 0.380 0.265
Table 7. Algorithm 3 category precision performance
Average Median Pagerank Precision Optimal CombMNZ Condorcet-fuse
100% 0.428 0.409 0.215 0.206 0.215 0.220
75% 0.419 0.400 0.199 0.247 0.200 0.202
50% 0.415 0.404 0.173 0.305 0.174 0.171
25% 0.389 0.377 0.171 0.350 0.173 0.163
18.75% 0.404 0.404 0.192 0.375 0.190 0.169
12.5% 0.409 0.396 0.213 0.382 0.217 0.211
6.25% 0.396 0.391 0.251 0.386 0.250 0.218
Table 8. Algorithm 3 aggregation performance
6. Conclusion
[5] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of ACM WWW, pages 613–622, 2001. [6] R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In Proceedings of ACM SIGMOD, pages 301–312, 2003. [7] Y. Gao, J. Fan, H. Luo, X. Xue, and R. Jain. Automatic image annotation by incorporating feature hierarchy and boosting to scale up svm classifiers. In ACM Multimedia, 2006. [8] Q. Iqbal and J. K. Aggarwal. Cires: A system for content-based retrieval in digital image libraries. In International Conference on Control Automation, Robotics and Vision (ICARCV), pages 205–210, 2002. [9] A. Jain and A. Vailaya. Image retrieval using color and shape. Pattern Recognition, 29(8):1233–1244, 1996. [10] J. H. Lee. Analyses of multiple evidence combination. In Proceedings of ACM SIGIR, pages 267–276, 1997. [11] M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of ACM CIKM, pages 538–548, 2002. [12] A. Pentland, R. Picard, and S. Sclaroff. Photobook: Content-based manipulation of image databases. International Journal of Computer Vision, 1995. [13] C. G. M. Snoek, M. Worring, and A. W. M. Smeulders. Early versus late fusion in semantic video analysis. In Proceedings of ACM Multimedia, pages 399–402, 2005. [14] A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang. Image classification for content-based indexing. IEEE Transactions on Image Processing, 10(1):117–130, 2001. [15] L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proceedings of ACM SIGCHI, pages 319–326, 2004.
We introduce two algorithms leveraging rank aggregation concepts for image search not relying on text annotations. Feature collage processing performs two aggregation phases that exploits multiple appearances of image to be confined with a feature. Feature independent processing removes feature partitioning allowing multiple appearances across all dimensions. Through empirical experiments, we show that in general both proposed algorithms outperformed the traditional model of distance functions. We also observe that the dimensionality is a significant contributor in accuracy which depends on the query images. When examining the rank aggregation methods, the Average aggregator is the clear choice outperforming the other five methods by a dramatic magnitude.
References [1] S. Adalı, B. Hill, and M. Magdon-Ismail. The impact of ranker quality of rank aggregation algorithms: Information vs robustness. In Proceedings of the International Workshop on the Challenges in Web Information Retrieval and Integration, pages 10–19, 2006. [2] J. C. Borda. M´emoire sur les ´elections au scrutin. In Histoire de l’Acad´emie Royale des Sciences, 1781. [3] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of ACM WWW, pages 107–117, 1998. [4] Y. Chen, J. Wang, and R. Krovetz. Clue: Clusterbased retrieval of images by unsupervised learning. IEEE Trans. on Image Processing, 14(8), 2005.
6