Iterative Random Visual Word Selection

6 downloads 24109 Views 2MB Size Report
hosting website. However ... Others have been focused their effort on selecting the best way to use ... search domain is very active and we just present the most.
Iterative Random Visual Word Selection Thierry Urruty

Syntyche Gbèhounou

Huu Ton Le

XLIM Laboratory UMR CNRS 7252 University of Poitiers France

XLIM Laboratory UMR CNRS 7252 University of Poitiers France

XLIM Laboratory UMR CNRS 7252 University of Poitiers France

[email protected] [email protected] [email protected] Jean Martinet Christine Fernandez LIFL – UMR CNRS 8022 Lille 1 University France

[email protected]

XLIM Laboratory UMR CNRS 7252 University of Poitiers France

[email protected]

ABSTRACT

1.

In content based image retrieval, one of the most important step is the construction of image signatures. To do so, a part of state-of-the-art approaches propose to build a visual vocabulary. In this paper, we propose a new methodology for visual vocabulary construction that obtains high retrieval results. Moreover, it is computationally inexpensive to build and needs no prior knowledge on features or dataset used. Classically, the vocabulary is built by aggregating a certain number of features in centroids using a clustering algorithm. The final centroids are assimilated to visual ”words”. Our approach for building a visual vocabulary is based on an iterative random visual word selection mixing a saliency map and tf-idf scheme. Experiment results show that it outperforms the original ”Bag of visual words” based approach in efficiency and effectiveness.

In the last decades, online image databases have exponentially increased, especially with users sharing and embedding personal photographs such as Flickr1 or any other free image hosting website. However, such shared databases have heterogeneous user annotations, thus semantic retrieval is not effective enough to access all images. The need for novel methods of content based searching image databases has become more pressing. Building an effective and efficient content-based image retrieval (CBIR) system helps bridging the semantic gap between the low-level features and high-level semantic concepts contained in the image. Many researches have tried to close this gap, however with time, the term ”close” has become ”narrow down”. To do so, research work has been done to improve low-level features [4, 16, 25], multiplying the number of existing image global and local descriptors. Others have been focused their effort on selecting the best way to use, combine, refine, select those low-level features to optimise the effectiveness and efficiency of the retrieval, using supervised and unsupervised classification methodologies. Among them, the ”Bag of features” or ”Bags of visual words” approach [2, 24] has been widely used to create image signatures and to index image database. More recently, visual attention models, which extract the most salient location of an image, have been widely applied in different applications. Some bio-inspired approaches are based on psycho-visual features [9, 17] while others make use of several low-level features [6, 28]. Saliency maps help selecting the most interesting parts of an image or a video. Thus, it reduces the computational cost to create the image index or signature while increasing the precision of the retrieval. In this article, we present a novel approach to construct a visual vocabulary. Our main objective is to propose a new simpler, faster, and more robust alternative to the ”bag of visual words” approach that is based on an iterative random visual word selection. In our word selection process, we introduce an information gain measure that mixes ideas from saliency maps and tf-idf scheme. The remainder of this article is structured as follows: we

Categories and Subject Descriptors H.2 [Database Management]: Database Applications ; H.3 [Information Storage and Retrieval]: Content Analysis and Indexing, Information Search and Retrieval; I.4 [Image Processing and Computer Vision]: Scene Analysis, Image Representation, Quantization

General Terms Algorithms, Measurement, Performance

Keywords Images Retrieval, Vocabulary Construction, Bags of Visual Words, Random Word Selection, Saliency Map

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMR 2014 Glasgow, GB Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

1

INTRODUCTION

www.flickr.com

to differentiate images based on their colour features. Once computed, these moments provide a measurement for colour similarity between images. They are based on generalized colour moments [19] and are 30dimensional. Given a colour image represented by a function I with RGB triplets, for image position (x, y), the generalized colour moments are defined by Equation (2). ∫∫ abc Mpq = xp y q [IR (x, y)]a [IG (x, y)]b [IB (x, y)]c dxdy

provide a brief overview of related works in Section 2. Later, we explain our iterative random visual word selection approach in Section 3. Then, we present the exhaustive experiments in Section 4. Finally, we discuss the findings of our study in Section 5 and conclude in Section 6.

2. STATE OF THE ART We do not aim to present in this section everything that has been done in content based image retrieval. This research domain is very active and we just present the most popular proposals and the ones that have inspired our research work. Thus, we first present related work concerning low-level features, then we discuss methodologies for image representation, finally we highlight some saliency-map-based approaches.

(2) abc Mpq is referred to as a generalized colour moments of order p+q and degree a+b+c. Only generalized colour moments up to the first order and the second degree are considered, thus the resulting invariants are funcabc abc , M10 tions of the generalized colour moments M00 abc and M01 , with:    (1, 0, 0) , (0, 1, 0) , (0, 0, 1)  (2, 0, 0) , (0, 2, 0) , (0, 0, 2) (a, b, c) ∈ .   (1, 1, 0) , (1, 0, 1) , (0, 1, 1)

2.1 Image Features The most popular descriptor in object recognition is the SIFT (Scale Invariant Feature Transform) descriptor proposed by Lowe in 1999 [13]. The image is represented by a set of vectors of local invariant features. The efficiency of SIFT and its different extensions have been demonstrated in numerous papers in object recognition and image retrieval [10, 11, 13, 14, 20, 25]. The original version of SIFT [13] is defined in greyscale and different colour variants of SIFT have been proposed. For example, OpponentSIFT, proposed by Van de Sande et al. [25] are recommended when no prior knowledge about the data set is available. OpponentSIFT describes all the channels in the opponent colour space (Equation (1)) using SIFT descriptors.     R−G √ 01 2 R+G−2B  02  =  (1)  √6  R+G+B 03 √ 3

The information in the O3 channel is the intensity information, while the other channels describe the colour information in the image. The greyscale SIFT is 128-dimensional whereas the colour versions of SIFT, e.g. OpponentSIFT, are 384-dimensional. This high dimensionality induces slow implementations for matching the different feature vectors. The first solution to resolve this aspect proposed by Ke and Sukthankar is PCASIFT [11] with 36 dimensions. This variant is faster for matching, but less distinctive than SIFT according to the comparative study of Mikolajczyk and Schmid [18]. During this study, they proposed GLOH (Gradient Location and Orientation Histogram) a new variant of SIFT, which is more effective than SIFT with the same number of dimensions. However GLOH is more computationally expensive. Again with the aim to propose a solution less computationally expensive and more precise, Bay et al. [1] introduced SURF (Speeded Up Robust Features) which is a new detector-descriptor scheme. In fact, the detector used in SURF scheme is based on Hessian matrix and applied to integral images to make it fast. Their average rate of the recognition of objects of art in a museum shows that SURF is better than GLOH, SIFT and PCA-SIFT. There are other interesting local point descriptors such as: • Colour moments2 : they are measures that can be used 2

abbreviated CM for the remainder of the paper

• Colour Moment Invariants3 computed from the algorithm proposed by Mindru et al. [19]. The authors use generalised colour moments for the construction of combined invariants to the affine transform of coordinates and contrast changes. There are 24 basis invariants involving generalized colour moments in all 3 colour bands. The features presented above are interesting works, among others, that inspired the design of our approach. Next section discusses variations around the popular image representation with visual words.

2.2 Bag-of-Words-based Approaches Concerning the image representation, plenty of solutions have been proposed. The most popular is commonly called ”Bag of Visual Words” (BoVW) and was inspired by the bag-of-words used in text categorisation. This approach has different names. For example it is called Bag of keypoints by Csurka et al. [2]. However the general principle is the same. Given a visual vocabulary, the idea is to characterise an image by a vector of visual word frequencies [24]. The visual vocabulary construction is often done through low-level features vector clustering using for instance k-Means [2, 24] or Gaussian Mixture Models (GMM) [5, 22]. However, it is usual to apply a weighting value to the components of this vector. The standard weighting scheme is known as ”term frequency-inverse document frequency”, tf-idf [24], and it is computed as described in Equation (3). Suppose there is a vocabulary of k words, then each document is represented by a k-vector Vd = (t1 , . . . , ti , . . . , tk )⊤ of weighted word frequencies with components: ti =

nid N log nd ni

(3)

where nid is the number of occurrences of word i in document d, nd is the total number of words in the document d, ni is the number of occurrences of term i in the dataset and N is the number of documents in the dataset. 3

Abbreviated CMI for the remainder of the paper.

Besides BoVW approaches, many other equally efficient methods exist, for example Fisher Kernel or VLAD (Vector of Locally Aggregated Descriptors). The first one has been used by Perronnin and Dance [23] on visual vocabularies for image categorisation. They proposed to apply Fisher kernels to visual vocabularies represented by means of a GMM. In comparison to the BoVW representation, fewer visual words are required by this more sophisticated representation. VLAD is introduced by J´egou et al. [10] and can be seen as a simplification of the Fisher kernel. Considering a codebook C = c1 , . . . , ck of k visual words generated with k-Means, each local descriptor x is associated to its nearest visual word ci = N N (x). The idea of the VLAD descriptor is to accumulate, for each visual word ci , the differences x − ci of the vectors x assigned to ci .

2.3 Saliency Map Visual attention models are used to identify the most salient locations in an image; they are widely applied to many image-related research domains. In the last decades, many visual saliency frameworks have been published; some of them are inspired from psycho-visual features [9, 17], while others make use of several low-level features in different ways [6, 28]. The works of Itti et al. [9] can be considered as an outstanding example of the formal group. In [9], an input image is processed by parallel channels (colour, orientation, intensity) before combining them to generate the final saliency map. Alternatively, Zhang et al. in [28] present an approach based on the natural statistical information. They define an expression called ”pointwise mutual information” between the visual features and the presence of the target to indicate the overall saliency. Recently, visual attention models have been used in new applications such as image retrieval, and they have achieved interesting results. The most common approach consists in taking advantage of saliency map to decrease the amount of information to be processed [6, 12, 27]. These methods usually take the information given by the visual attention model at an early stage; image information will be either discarded or picked as inputs for next stages based on its saliency value. In [6], SIFT descriptor is implemented for image retrieval and the saliency value is used to rank all the keypoints in the input image, only the distinctive points are reserved for the matching stage. Research in [27] shares the same idea: a threshold is defined and SURF features are computed only for pixels with saliency (also called cornerness sometimes) value above this threshold. Their experiments show that the number of features can be reduced without affecting the performance of the classifier. In [15], a weighting scheme is designed for visual words based on saliency, assuming that the importance of an image object can be estimated with the saliency value of its region. In [12], visual attention model is combined with segmentation procedure to extract regions associated to highest saliency values for further processing.

3. ITERATIVE RANDOM VISUAL WORDS SELECTION As discussed in the state of the art, a lot of researchers and engineers have used BoVW as a tool to develop better object recognition systems, and to construct higher semantic levels, often called Bags of Visual Phrases. In [3, 29], the au-

thors propose hierarchical representations based on a visual word/phrase vocabulary. These methodologies have shown high performance results when combining phrases and words in the retrieval phase. Their results also highlighted the importance of both the feature selection step and the visual vocabulary selection step in order to obtain a more effective approach. Thus, in this paper we focus on opening this ”black box” named Bags of Visual Words to simplify it and to make it more robust to the choice of descriptor. In the next sections, we discuss the inspiring limitations of BoVWbased approaches and present our approach.

3.1

Robustness and BoVW

Most of the time, the BoVW uses the well-known k-Means algorithm [2] or a Gaussian Mixture Model to construct the visual vocabulary. But the effectiveness of these clustering algorithms tends to drop drastically with respect to the high dimension of the image features. As the authors mentioned in [21], the curse of dimensionality occurs and makes the result of the clustering close to random. This randomness aspect has been one of the starting point of our reflexion. A second aspect of our reflexion was to minimize the importance of feature selections. Without a-priori knowledge about the image dataset, our approach is designed to simplify the indexing step and the parameter tuning according to selected features.

3.2

Random Selection of Words

As mentioned in the previous section, using a k-Means or similar algorithms in the construction step of the visual vocabulary has no justification in high-dimensional spaces. k-Means is still widely used because it is well known and relatively not computationally expensive compared to other clustering algorithms. Experiments made in previous works (e.g. [15]) have shown k-Means random clustering results when the dimensionality grows, which is the case with many visual image descriptors. Thus it is not adapted for generating visual words. One of our objectives is to simplify the construction of the visual vocabulary. We propose to replace the clustering algorithm step by randomly selecting a large set of visual words. This random selection can either be made by: • randomly picking keypoints in all images of a collection and using their computed descriptors as visual words; • creating synthetic visual words with knowledge of the feature space distribution. Our experimental results show that either way give similar results but picking randomly keypoints is probably easier without any prior knowledge on the feature space. However, the image dataset needs to contain heterogeneous images that reflect the real world and the number of randomly picked words should be high enough (discussed in Section 4). The second step of the visual vocabulary construction is to identify the visual words that have the best information gain IG in the set of selected random visual words. To do so, we need a selection criteria. Even if many exist, as we are dealing with visual words, we made the analogy between information retrieval and CBIR domain and select to base part of the IG score on the tf-idf weighting scheme [26]. The other part of the IG score comes from saliency maps [9]

computed using the graph-based visual saliency software4 . Data: set of random words, image dataset Result: final visual vocabulary extraction of random words; while Nb words in Vocabulary > threshold do Assign each keypoint to its closest word w; Compute IGw for each word w with equation 4; Sort words according to their IGw value; Delete α % of words with lowest IG value; end Algorithm 1: Visual vocabulary construction To construct the visual vocabulary, we start from a set of random visual words and a heterogeneous training image dataset. As illustrated in Algorithm 1, we first assign all extracted keypoints from each image to one of randomly chosen visual word. Then, we compute the Information Gain for each visual word. As we already mentioned, the information gain IGw is based on tf-idf weighting scheme and a saliency score, given by the following formula: ∑ SalwD N nwD log + (4) nD nw nwD where IGw is the IG value of the visual word w, nwD is the frequency of w over all the keypoints of the dataset D, nD the total number of keypoints in the dataset, N the number of images in the dataset, nw the number of images containing the word w, and SalwD is the saliency score for all the keypoints assigned to word w. The tf-idf part is estimated dataset-wise, instead of document-wise. The whole set of random visual words is then sorted in ascending order of IG values. All words with no assigned keypoint have an IG value equals to 0, hence they are deleted. Then, a threshold α is used to set the number or percentage of words to be deleted in each iteration. The while block code is repeated until the desired number of words in the visual vocabulary is reached, re-assigning only the keypoints that were left alone due to the previously deleted visual words. At the end of this step, the visual vocabulary is built. The final step of our approach consists in indexing all the target dataset images using the visual vocabulary built from the training image dataset. It creates a signature for each image with respect to the occurrence frequency of each visual word in the image. This frequency histogram is L2normalized for better retrieval results. IGw =

3.3 Stabilization process As described above, the proposed algorithm is based on a random set of visual words. When using a random approach, it is natural to assume that several runs of the algorithm would create different visual vocabularies, and therefore would yield different experimental results. However, after running exhaustive experiments, the results presented in section 4 show that the algorithm is more stable that we expected. Nevertheless, in order to avoid a hypothetical lack of stability that may appear for some descriptors, we propose a stability extension to our approach, inspired from strong clustering methods. 4

http://www.klab.caltech.edu/~harel/share/gbvs.php

For this purpose, we generate β visual vocabularies, the iterative process is run for each of them (see Algorithm 1). We combine the obtained words together and run the iterative process again, until the desired number of visual words is reached. Combining few sets of visual words will improve the overall informative gain of the vocabulary. However, since the already obtained results are almost stable, experiment results (see Figure 6) show that β = 3 is a good parameter value, and choosing β > 3 would give similar results for a longer construction time.

4. 4.1

EXPERIMENTS Experimental setup

We considered three datasets during our experiments: - University of Kentucky Benchmark introduced by Nist´er and Stew´enius[20]. In the remainder, we will refer to this dataset as ”UKB” to simplify the reading. UKB is really interesting for image retrieval and presents three main advantages. First, it is a large benchmark composed of 10.200 images grouped in sets of 4 images showing the same object. Figure 1 is an good illustration of the diversity of the set of 4 images (changes of point of view, illumination, rotation, etc.). Second, it is easily accessible and a lot of results are available to make an effective comparison. In our case, we will compare our results to those obtained by J´egou et al. [10] and Nist´er and Stew´enius [20]. Third, the evaluation score of the results on UKB is simple: it counts the average number of relevant images (including the query itself) that are ranked in the first four nearest neighbours when searching the 10.200 images. - PASCAL Visual Object Classes challenge 2012 [4]

Figure 1: An example set of 4 images showing the same object in UKB. called PASCAL VOC2012. This benchmark is composed of 17.215 images, among which only 11.530 are classified. Note that while we just considered the 11.530 classified images when testing our approach on Pascal VOC2012, we used the full dataset to construct the vocabulary. PASCAL VOC2012 images represent realistic scenes and they are categorised in 20 objects classes, e.g. person, bird, airplane, bottle, chair and dining table. - MIRFLICKR-25000 [8]; this image collection is composed of 25.000 images. All images of this collection are supplied by a total of 9862 Flickr users. They were annotated with tags (mainly in English) e.g. sky, sunset, graffiti/streetart, people and food. We used this image collection only to build a vocabulary in order to have an independent dataset for visual word selection. Note also that for all our experiments, we chose to use the Harris-Laplace point detector.

4.2

Observations

The first observation we made was to select a distance measure for our low-level features, which is not in the scope

of this work but is still needed to guarantee good results. Studies have been made in the state of the art, e.g. [7]. We made few experiments on several distance measures, including L2, L1, Bhattacharyya, χ2 (Chi-Square). We select the L2 distance between keypoints and the χ2 distance between image signatures. This selection has given the best results for the k-Means version of bag of visual words. Thus, all our experiment results use the same framework.

Figure 3: Recursive selection scores for 2048 random initial words

Figure 2: k-Means based Bag of words vs Random selection of words The second observation is presented Figure 2. It shows the UKB scores with respect to the number of visual words for selected descriptors comparing ”Bag-of-words”-approaches using k-Means and just a random selection of words. The overall tendency is that k-Means-based BoVW approach outperforms simple random selection of words for small vocabularies. However, when the number of randomly selected words is high enough, the figure shows a stable score value for random selection, that is equivalent to k-Means scores. Thus, by simply taking a high number of random visual words, the scores are equivalent. For a thorough comparison with k-Means-based BoVW, we also need to take into account the retrieval process time, which is computationally linked to the number of words used to create the image signature. Hence, a recursive selection approach has been deployed to reduce the number of words and therefore to increase efficiency without losing effectiveness.

highlight important facts. First, the results are very stable. Indeed, starting from a random vocabulary, we see that for a given number of words, the score lies within a quite narrow window. For example, around a sub-selection of 150 words, the score value ranges from 3.05 and 3.15 approximately. The second important fact is the high score values of these sub-selections. In the previous results, neither the standard BoVW nor the random selection of vocabulary has given a score over 3. In Figure 3, the visual vocabularies containing between 90 and 1000 (1300 if the figure is extended) give a score higher than 3.

Figure 4: Recursive selection mean scores for 1024 to 65536 random initial words

4.3 Iterative selection approach results In this section, we guide the reader through result and discussion steps that lead to the construction of our final approach. To make this section clearer, we present the UKB score results only with the Colour Moment Invariants descriptor as it gives the best UKB score with just a set of random visual words. In the next section, we compare all selected descriptors. As it is mentioned earlier, random selection of visual words shows as good results as using k-Means clustering algorithm. However, to obtain this results, more words are needed. Therefore, we decide to apply a selection of words as presented in Algorithm 1. Figure 3 presents the UKB scores with respect to a sub-selection of the 2048 initial visual words. These exhaustive results, made from different runs,

Figure 4 presents the mean UKB score of 10 runs with respect to the number of visual words using different initial random sets, from 1024 to 65538 words. There is a clear evolution of results. The initial number of visual words affects the overall results. Starting with 1024 words is not enough, however going higher than 4096 words has no more clear effect on results. For the next experiments, we select 4096 as the maximum value of initial random visual words.

4.4

Mixing extension

Figure 5 presents the mean UKB score of our approach with its stabilization process with respect to number of visual words with different initial random sets, from 512 to 4096 words. Our hypothesis to group vocabularies 3 × 3

UKB score compared to the BoVW approach. This demonstrates that without adding any prior knowledge regarding which feature to use, and how to use it, our iterative random selection of visual words approach performs better. Descriptor CM SURF CMI SIFT OppSIFT

BoVW 2.62 2.69 2.96 2.16 2.30

vs IteRaSel 2.81 2.75 3.18 2.30 2.46

% +7% +2.75% +7.4% +6.5% +6.9%

Table 1: UKB scores Figure 5: Recursive selection mean scores for 512 to 4096 random initial words group 3 × 3

with high IG was good. Indeed, the maximum mean value for 3 × 3 − 4096 is almost 3.20 which is far higher than the previous results (3.12 approx.) with only a set of 250 to 300 visual words. We also observe that grouping 3×3 vocabulary sets gives a mean result of 3 for sets of 50 words, which is far better than the original BoVW approach in terms of efficiency and effectiveness.

Since our approach is not based on supervised classification, it is difficult to compare it to other state-of-the-art methods. However, Table 2 presents the UKB scores given by [10] in which we add our best score obtained with CMI visual descriptor. Descriptor BoVW Fisher Vector VLAD IteRaSel

Best score 2.95 3.07 3.17 3.23

vs IteRaSel % -7% -5% -1.9% 0%

Table 2: UKB scores: IteRaSel vs other methods Table 3 gives the mean Pascal VOC2012 of 3 vocabulary constructions of 256 words using 4096 initial random visual words without any mixing extension between vocabularies. As the table shows, the results of our approach on Pascal VOC2012 is globally similar to the Bag of visual words approach. Therefore the proposed easy and fast vocabulary construction for any visual features may replace the BoV W construction stage without losing in retrieval effectiveness. Note that our score is often higher with a different number of visual words. We also suppose that tuning the constructed vocabularies could improve the performance. Figure 6: Mean scores for 1024 to 4096 random initial words group 1 × 1 to 10 × 10 Similar experiments have been made grouping vocabularies 2 × 2 to 10 × 10, the results are presented in Figure 6. This figure shows the interest of mixing vocabularies together at least 3 × 3 as the UKB score does not improve significantly for higher grouping values. Other grouping techniques have been investigated to improve the score, like mixing again the already grouped and re-selected vocabularies, or mixing best informative vocabularies sets together. However, the results were not promising.

4.5 Global comparison In this section, we present our experimental results using other visual descriptors than CMI, then we compare our approach (noted IteRaSel for Iterative Random Selection) to state-of-the-art approaches, and finally, we give the results on Pascal VOC2012 dataset. Table 1 gives the mean UKB scores of 9 vocabulary constructions of 256 words using 4096 initial random words with a 3 × 3 vocabulary grouping approach. The results show that for most of selected features, there is a significative improvement, +7% approx., in the

Descriptor CM CMI SIFT OppSIFT SURF

BoVW 28.8 31.3 37.7 37.4 33.7

IteRaSel 27.5 31.4 36.5 36.8 35.1

% -4.6% 0.3% -3.2% -1.6% 4.1%

Table 3: Pascal VOC2012 scores

4.6

Mixing features

In this section, we propose to mix the retrieval results from different features with a frequency-plus-rank-based order. Figures 7 and 8 show the precision (score) obtained with respect to all possible combinations of features on UKB and Pascal VOC2012 respectively. The global evolution of results in those two figures is that mixing features improves the overall precision. When mixing ”good” and ”not so good” features, the precision of the mix tends to remain close to the best one. Mixing all 5 features yields a precision close to the best mix, which means that without any knowledge on the nature of the dataset and which feature to use, it is always a good option to mix more features. If we analyze a bit closer these mixed results, we note that the best

UKB (3.275) score is obtained with a four-feature combination: CM, CMI, OppSIFT and SURF. Whereas on the Pascal VOC2012, we need to mix the following four features: OppSIFT, CMI, SURF and SIFT.

Figure 7: UKB scores with respect to number of mixed features on UKB dataset

feature. Thus, it makes our approach easier to implement as it is feature independent and no a-priori knowledge on the feature is needed. The image signature is constructed using the selected visual words. If several low-level features are used, one signature by feature is created. The retrieval step may return several lists of results, or be combined with a round-robin algorithm for example. A third advantage of our framework is that the creation of the visual vocabulary is based on a random selection of words on a heterogeneous image dataset. No supervised or semi-supervised learning is needed. Some non-presented experiment results using UKB as training set show lower precision on all testing sets even on UKB, showing the importance of using a heterogeneous dataset. As we already mentioned, comparing our approach to others is difficult since the comparison needs to be performed using the same lowlevel features, with the same set of keypoints. That is why our experiment only compares the ”bags of visual words” to our iterative random word selection approach using [25] features extraction tools. We are fully aware that retrieval results are not at their best; e.g. no filter has been applied, no selection of keypoints have been made. Our results highlight that given a standard framework of Content Based Image Retrieval, our proposed methodology to construct a visual vocabulary has a mean score above the well-known BoVW approach, with a 7% score mean increase on UKB dataset. We also show that mixing several lowlevel features benefits generally most of the features, with a mixing score close to the best feature score. Thus, combining features is a good way to obtain a good results on any dataset without any knowledge on the visual content of the images.

6.

Figure 8: Precision with respect to number of mixed features on Pascal dataset

5. DISCUSSION The results presented in this article are part of the experiments we made to confirm that the proposed approach outperforms in different ways the ”bag of visual words”. First, the time required to create the visual vocabulary is O(k ∗ w ∗ N ), where k is the number of iterations needed to reduce the number of words w and N is the number of interest keypoints of all images in the dataset. Note that k depends on the threshold that selects the number of words to delete. It has been fixed to 10% in our experiments, which means k < 40 iterations to obtain 128 selected visual words from 4096 random words. Note also that for each iteration, only the keypoints that are not assigned anymore because of word deletion are processed, which reduces considerably the computational estimation. A second advantage of our approach is that the selection of the visual words is based on information mixing a saliency map and the tf-idf scheme. This choice has been made to get rid of the notion of ”best distance” to use with one low-level

CONCLUSIONS AND FUTURE WORKS

In this paper, we have introduced a new iterative random visual word selection framework for content based image retrieval. Such an approach has the potential to ”narrow down” a bit more the semantic gap issue by constructing a visual vocabulary, which is a level above low-level features. Our proposal main objective is to present an efficient and effective alternative to the well-known ”bag of visual words” approach. We propose to randomly select a large number of visual words. Then, an iterative process is performed to select the visual words containing the most informative words for the random set. The information gain we use is based on a saliency map as well as the tf-idf scheme. To improve the robustness of our proposal, we enhance our framework with a recursive word mixing process. The experimental results demonstrate the potential benefits of the proposed approach. It is shown that even starting from a random set of visual words, the results are very stable and higher than the ones from BoV W . We also highlight the fact that it is suitable to different low-level features, and it gives good results on several datasets. The more we work on this framework, the more perspectives we envision. A first perspective would be to try other existing saliency maps, in order to improve the informative gain of the constructed vocabulary. Another perspective is an application of the same framework to the ”bag of phrases” level, which is mentioned in related research section. To conclude the perspective part with a general direction, we are interested in investigating how the iterative selection process could be applied to other domains where a space quantiza-

tion is necessary, which requires to define relevant domainspecific information gain functions.

7. REFERENCES [1] H. Bay, T. Tuytelaars, and L. Gool. Surf: Speeded up robust features. In A. Leonardis, H. Bischof, and A. Pinz, editors, Computer V ision − ECCV 2006, volume 3951 of Lecture Notes in Computer Science, pages 404–417. Springer Berlin Heidelberg, 2006. [2] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, 2004. [3] I. Elsayad, J. Martinet, T. Urruty, and C. Djeraba. Toward a higher-level visual representation for content-based image retrieval. Multimedia Tools Appl., 60(2):455–482, 2012. [4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html. [5] J. Farquhar, S. Szedmak, H. Meng, and J. Shawe-Taylor. Improving ”bag-of-keypoints” image categorisation: Generative models and pdf-kernels. PASCAL Eprint Series, 2005. [6] K. Gao, S. Lin, Y. Zhang, S. Tang, and H. Ren. Attention model based sift keypoints filtration for image retrieval. In R. Y. Lee, editor, ACIS-ICIS, pages 191–196. IEEE Computer Society, 2008. [7] M. Halvey, P. Punitha, D. Hannah, R. Villa, F. Hopfgartner, A. Goyal, and J. M. Jose. Diversity, assortment, dissimilarity, variety: A study of diversity measures using low level features for video retrieval. In European Conference on Information Retrieval, pages 126–137, 2009. [8] M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA, 2008. ACM. [9] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259, Nov. 1998. [10] H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregating local descriptors into a compact image representation. In 23rd IEEE Conference on Computer Vision & Pattern Recognition (CVPR ’10), pages 3304–3311, San Francisco, United States, 2010. IEEE Computer Society. [11] Y. Ke and R. Sukthankar. Pca-sift: a more distinctive representation for local image descriptors. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–506–II–513 Vol.2, 2004. [12] Y. Lei, X. Gui, and Z. Shi. Feature description and image retrieval based on visual attention model. Journal of Multimedia, 6(1):56–65, 2011. [13] D. G. Lowe. Object recognition from local scale-invariant features. International Conference on Computer Vision, 2:1150–1157, 1999.

[14] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004. [15] J. Martinet. Human-centered region selection and weighting for image retrieval. In S. Battiato and J. Braz, editors, VISAPP (1), pages 729–734. SciTePress, 2013. [16] J. M. Mart´ınez. Mpeg-7 overview. www.chiariglione.org/mpeg/standards/mpeg-7, 2003. [17] O. L. Meur, P. L. Callet, D. Barba, and D. Thoreau. A coherent computational approach to model the bottom − up visual attention. IEEE Transactions on Pattern Analysis and MLachine Intelligence (PAMI), 28(5):802–817, 2006. [18] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(10):1615–1630, 2005. [19] F. Mindru, T. Tuytelaars, L. V. Gool, and T. Moons. Moment invariants for recognition under changing viewpoint and illumination. Computer Vision and ˘ S3):3 – 27, 2004. Image Understanding, 94(1ˆ aA¸ [20] D. Nist´er and H. Stew´enius. Scalable recognition with a vocabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2161–2168, June 2006. [21] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. In ACM SIGKDD, volume 6, pages 90–105. Explorations Newsletter, 2004. [22] F. Perronnin, C. Dance, G. Csurka, and M. Bressan. Adapted vocabularies for generic visual categorization. In In ECCV, pages 464–475, 2006. [23] F. Perronnin and C. R. Dance. Fisher kernels on visual vocabularies for image categorization. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA. IEEE Computer Society, 2007. [24] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the International Conference on Computer Vision, pages 1470–1477, Oct. 2003. [25] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1582–1596, 2010. [26] C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. [27] Z. Zdziarski and R. Dahyot. Feature selection using visual saliency for content-based image retrieval. In Signals and Systems Conference (ISSC 2012), IET Irish, pages 1–6, 2012. [28] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell. Sun: A bayesian framework for saliency using natural statistics. J Vis, 8(7):32.1–20, 2008. [29] S. Zhang, Q. Tian, G. Hua, Q. Huang, and W. Gao. Generating descriptive visual words and visual phrases for large-scale image applications. IEEE Transactions on Image Processing, 20(9):2664–2677, 2011.

Suggest Documents