Multimed Tools Appl DOI 10.1007/s11042-012-1340-5
Object-based visual query suggestion Amel Hamzaoui · Pierre Letessier · Alexis Joly · Olivier Buisson · Nozha Boujemaa
© Springer Science+Business Media New York 2013
Abstract State-of-the-art visual search systems allow to retrieve efficiently small rigid objects in very large datasets. They are usually based on the query-by-window paradigm: a user selects any image region containing an object of interest and the system returns a ranked list of images that are likely to contain other instances of the query object. User’s perception of these tools is however affected by the fact that many submitted queries actually return nothing or only junk results (complex nonrigid objects, higher-level visual concepts, etc.). In this paper, we address the problem of suggesting only the object’s queries that actually contain relevant matches in the dataset. This requires to first discover accurate object’s clusters in the dataset (as an offline process); and then to select the most relevant objects according to user’s intent (as an on-line process). We therefore introduce a new object’s instances clustering framework based on a major contribution: a bipartite shared-neighbours clustering
A. Hamzaoui (B) · A. Joly INRIA-Rocquencourt, team-project:IMEDIA, BP 105, 78153 Le Chesnay Cedex, France e-mail:
[email protected] A. Joly e-mail:
[email protected] P. Letessier · O. Buisson Institut National de l’Audiovisuel (INA), 4 avenue de l’Europe, 94366 Bry-sur-marne Cedex, France P. Letessier e-mail:
[email protected] O. Buisson e-mail:
[email protected] N. Boujemaa Inria Saclay, 4 rue Jacques Monod, 91893 Orsay Cedex, France e-mail:
[email protected]
Multimed Tools Appl
algorithm that is used to gather object’s seeds discovered by matching adaptive and weighted sampling. Shared nearest neighbours methods were not studied beforehand in the case of bipartite graphs and never used in the context of object discovery. Experiments show that this new method outperforms state-of-the-art object mining and retrieval results on the Oxford Building dataset. We finally describe two objectbased visual query suggestion scenarios using the proposed framework and show examples of suggested object queries. Keywords Visual object · Object mining · Query suggestion · Shared neighbours · Clustering
1 Introduction Large-scale object retrieval systems have demonstrated impressive performance in the last few years. The underlying methods, based on local visual features and efficient indexing models, can retrieve accurately small rigid objects such as logos, buildings or manufactured objects, under varying view pose and illumination conditions [6, 11, 13, 16, 20–22]. Therefore online object retrieval is now achievable up to 1M images with a state-of-the-art computer [11]. From the usage point of view, these methods are usually combined with a query-by-window search paradigm. The user can freely select a region of interest in any image, and the system returns a ranked list of images that are the most likely to contain an instance of the targeted object of interest [22]. This paradigm has however several limitations related to user’s perception: (i) When no (or very few) other instances of the query object exist in the dataset, the system mostly returns false positives making the user uncomfortable with the results. Indeed, he does not know if there are actually no other instances of the query object or if the system did not work correctly. (ii) When the user selects a deformable or complex object that the system is actually not able to retrieve, the system mostly returns false positives as well. As the user can freely select any object, this appears very frequently leaving the user with a bad impression of the effectiveness of the tool. The second remark is even more critical if the user believes that the system can retrieve any semantically similar objects (e.g. object categories or visual concepts such as cats or cars). We do not argue here that such queries will never be solved effectively in the future. We just emphasize that bridging the gap between a user’s understanding of the system and the actual capabilities of the underlying tools is essential to make it successful in a real world search engine. A first possible solution to address these limitations would be to use some adaptive thresholding method, allowing only relevant results to be filtered, and possibly returning no results if none are found. The a contrario method of [13], for instance, allows the actual false alarm rate of rigid object instances retrieval to be controlled very accurately. But still, as the user can select any region of interest, the system might return no results in many cases and leave the user disappointed. In this paper, we propose to solve these user perception issues by a new visual query suggestion paradigm. Rather than letting the user select any region of interest, the system will suggest only visual query regions that actually contain relevant matches in the dataset. By mining off-line object instances in the dataset, it is indeed
Multimed Tools Appl
possible to suggest to the user only query objects having at least a predetermined number of instances in the collection. Figure 1 illustrates such suggested objects in several images. When a user clicks on a highlighted region, the system returns only the images containing other object instances of the same discovered cluster. From a user perception point of view, the proposed paradigm is very different from the window query paradigm. Indeed, since all suggested objects mostly return correct results, the user might rather perceive them as visual links (or hyper-visual links by analogy to hypertext links). To the best of our knowledge, this is the first work to detail a method for object-based visual query suggestion. Unlike existing approaches, the links produced by our method are not similarity links between images, but rather links between automatically localized images containing instances of the same rigid object. These object-based visual links can be used in many different retrieval paradigms. In this paper, we focus on two visual query suggestion scenarios showing the potential of the proposed method (Section 3.3): Mouseover visual objects suggestion When the user moves or hovers the mouse cursor over a particular image, the system suggests object queries by highlighting the object instances present in the image. The suggested objects do not depend on the preliminary textual query but are guaranteed to match some other instances in the collection (when the user clicks on one of them). Text-aware visual objects suggestion After a user submits a text query, the most frequent visual items discovered in the result list are suggested as new object-based
Fig. 1 Discovered visual objects are displayed as links on which the user clicks to focus the retrieval on this specific object represented by a link-object
Multimed Tools Appl
visual queries (typically displayed as some clickable thumbnails on top of the result GUI). Images containing other instances of the suggested object are returned if the user clicks one. Simple as it seems to be, moving from the free window-query paradigm to the object’s suggestion paradigm is not trivial. Indeed, it first requires to discover accurate object clusters in the dataset (typically as an off-line process), without any supervision and without knowledge of the location and the extent of the objects. Therefore, this paper introduces a new object instances clustering framework based on two main steps: Object seeds discovery with adaptive weighted sampling: this step, proposed in [17], allows to discover small and rigid repeated patterns in the collection by randomly querying small image patches with an efficient geometric matching (Section 3.1). Bipartite shared-neighbours clustering This proposed algorithm allows building full object models by clustering the previously discovered object seeds (Section 3.2). Object clusters will be used for the object-based visual query suggestion and for the object retrieval. Note that shared neighbours clustering methods have never been studied before in the case of bipartite graphs and never been applied to object discovery either. The next section starts by discussing state-of-the-art works related to our method.
2 Related works Visual Query Suggestion was originally suggested in [33] by extension to Textual Query Suggestion methods that are now used in most existing search engines. The claim of the authors was that text-based predictive suggestion methods might sometimes not accurately express the intent of the users. By adding to the textual suggestion a set of representative pictures, the user can express his specific search intent more clearly. Their method was mainly based on global visual similarities using a joint text-image re-ranking for the retrieval. Our method differs in two main points: (i) we suggest purely visual queries (although the suggested queries can be computed according to the results of a textual query) (ii) the suggested visual queries represent object(s) in images and not global visual concepts associated with each image. Beyond large-scale object retrieval methods discussed in the introduction [6, 11, 13, 16, 20–22], our work is more related to object-based image clustering and unsupervised object mining techniques [4, 23, 24, 29]. Object-based image clustering attempts to cluster images that contain instances of the same object. Our objective differs in that we do not attempt to build image clusters but rather clusters of image regions containing instances of the same object. An Object can be for example a building or a part of building, or a logo that appears on images. These images can be different but they contain a similar or a different view of the same object. The problem to be solved is more challenging since the image regions to be clustered are not predefined entities (as images are). Therefore, image regions need to be segmented and clustered at the same time. This is basically what object mining methods are aimed at. Both objectives however share some common properties and issues so that it is difficult to classify the related methods into two distinct groups.
Multimed Tools Appl
Many object discovery methods are based on Latent Topic Models such as probabilistic Latent Semantic Analysis (pLSA) [2, 9], Latent Dirichlet Allocation (LDA) [2, 23, 27, 30] or hierarchical LDA [26]. The idea of these methods is to use common Bags-of-visual-Words models (BoW) and to analyse the resulting termdocument occurrence matrix with classical models used in text-based information retrieval. The method of Philbin et al. [23] makes a step forward by augmenting the topics of the LDA model with the spatial position of the visual words. Latent topic models are a favourite choice for object category recognition and retrieval. But their generalization ability is rather a disadvantage when searching particular object instances. The underlying models fail in discovering accurate clusters of object instances. Most other methods rely on graph-based clustering methods. They usually include a preliminary step allowing to discover object seeds (spatially stable image regions in the collection). The main objective of this step is to build a matching graph that will be processed afterwards to cluster images or discover object instances. Nodes of the matching graph typically represent images whereas edges correspond to common matching regions between the images. Efficiently constructing the matching graph has been a first research topic. Chum et al. [5, 7] proposed to automatically select Regions Of Interest (ROI) with a very low computational cost. Their method combines BoW models with the min-hash hashing scheme [3]. Min-Hash is an algorithm commonly used in text retrieval for finding near-duplicates [3] and works by approximating the intersection between two sets of words. Applied on visual words, it allows to discover efficiently very discriminant candidate visual sketches that are likely to be parts of more reliable objects. But the recall of this method for small objects is far from sufficient as pointed out in further works of the authors [5]. To reduce this drawback, they proposed a new min-hash based strategy called Geometric Min Hash [5]. In that method, the first min-hash value of the sketches is still generated from the whole image but the second and following hash values are randomly sampled in the spatial neighborhood of the first selected visual word. This version is able to discover more relevant local sketches and is therefore a very efficient way to discover candidate query regions that are likely to contain object instances. But the first global hashing step makes it still not robust to strong occlusions and the performance might therefore degrade in highly cluttered contexts. Once the matching graph has been constructed, graph-based object mining methods differ in the way they analyze or post-process the graph. One of the simplest operations for splitting a graph is to find connected components such as proposed in [1, 4, 24]. But as pointed by [24], the main problem is that many disjoint objects are grouped in the same component due to under-segmentation. Grauman et al. [10] proposed an alternative method based on spectral clustering. They pay special attention to separate the objects from the background or other objects present in a single image. A great disadvantage of spectral clustering is the need to specify the number of clusters whereas in our case it is impossible to know a priori how many objects could be found. More recently, Philbin et al. [24] also used a spectral clustering approach but in the context of spatially verified objects, which is more related to our work. They automatically estimate the optimal number of clusters by performing multiple clustering, leading to a consistent cost overhead. Furthermore, the produced clusters suffer from over-segmentation and therefore required some additional heuristics to merge them into consistent clusters.
Multimed Tools Appl
3 Proposed method Our framework relies on two main steps: (i) building a matching graph by mining spatially consistent object seeds and (ii) post-processing the graph to build clusters of object instances. But contrary to these methods, we formulate the problem as a bipartite graph clustering issue. Images are indeed considered as a first set of nodes, while object seeds form a second disjoint one. The next section details our proposed method to discover object seeds and build the matching graph. Section 3.2 then introduces our bipartite clustering algorithm allowing to group object seeds belonging to instances of the same object. Section 3.3 finally describes how the resulting object cluster are used within the two visual query suggestion scenarios discussed in the introduction. 3.1 Mining visual object seeds As stated before, state-of-the-art large-scale object retrieval systems usually combine efficient indexing models with a spatial verification re-ranking stage to improve query performance [13, 20]. In previous work of the authors [17], it was suggested to use an accurate two-stage matching strategy for building the input matching graph of our clustering algorithm (described in Section 3.2). The problem then rather becomes a sampling issue: how to effectively and efficiently select relevant query regions while minimizing the number of tentative probes. We therefore introduced an adaptive weighted sampling strategy. Sampling is a statistical paradigm concerned with the selection of a subset of individual observations within a population of objects intended to yield some knowledge about the population without surveying it entirely. If all items have the same probability to be selected, the problem is known as uniform random sampling. In weighted sampling methods [19], the items might be weighted individually and the probability of each item to be selected is determined by its relative weight. In conventional sampling designs, either uniform or weighted, the selection for a sampling unit does not depend on the observations made during previous surveys. On the other hand, Adaptive sampling [28] is an alternative strategy aiming at selecting more relevant sampling regions based on the results observed during the previous surveys. Our object seeds discovery method is composed of three main stages processed at each iteration: Adaptive Sampling of a query image region, Search of the selected local query region and Decision of whether this query region might be considered as an object seed in the final output matching graph. The full algorithm repeats these 3 steps T times until a fixed number of seeds has been found. More formally, let be an input dataset of N images Ii , i ∈ 1, ..., N. Each image Ii is represented by a set of Ni local visual features Fi, j (typically SIFT like those N used in [18]) localized by their position Pi, j. N F = i=1 Ni is the total number of features Fi, j. Each local feature Fi, j is associated with a fixed candidate query region Ri, j defined as the bounding box centered around Pi, j, with height Hi, j and width Wi, j. In this paper, Hi, j and Wi, j (ratio of image width and height) are set up according to: √ Hi, j = γ ∗ Hi Wi, j =
√ γ ∗ Wi
Multimed Tools Appl
where γ is a parameter of the method corresponding to the percentage of the image area covered by a candidate query region (Hi and Wi are respectively the height and width of image Ii ). Now, the following three steps are processed at the t-th iteration: 1. Local Region Sampling: This step aims at selecting a candidate query region Rqt , centered around a sampled feature Fqt . Fqt is randomly selected from a probability mass function pt (i, j) over the set of all local features Fi, j (using an inverse transformation method [8]). The method starts with a uniform probability mass function p0 (i, j) over the whole set of candidate query regions Ri, j. The selected query region Rq0 is processed by steps 2 and 3 (see above) providing a set of 0 , m ∈ 1, ..., M0 . Further probability mass functions pt (i, j) matching regions Rm are then updated in a recursive manner: t−1 pt = f pt−1 , Rqt−1 , {Rm } As is done in conventional weighted random sampling methods [19], the probability mass functions pt are in practice computed by normalizing some weighting function w t such as: w t (i, j) pt (i, j) = t i, j w (i, j)
(1)
The recursive updates are thus rather computed on the weights: t−1 w t = g w t−1 , Rqt−1 , {Rm } Our proposed updating function g is defined as follows: ⎧ 0 if Fi, j = Fqt−1 ⎪ ⎪ ⎪ ⎨α w t−1 (i, j) if F ∈ R t−1 1 i, j q t−1
w t (i, j) = t−1 ⎪ α w (i, j) if F ∈ Rm m 2 i, j ⎪ ⎪ ⎩ t−1 w (i, j) otherwise
(2)
The first condition means that the weights of already visited query region centers are set up to zero so that the probability to re-issue them as a new query is null (i.e. to guaranty a sampling without replacement). The second condition aims at decreasing the weights of the features belonging to the previous query region Rqt−1 , so that their probability to be re-issued as new query region centers decreases. This avoids selecting new query regions that have too much overlap with previous query regions. The third condition aims at decreasing the weights of the features belonging to already matched regions, so that their probability to be re-issuing as new query region centers is also decreased. The fourth condition keeps the weights of unmatched features unchanged. These three first conditions allow us to iteratively focus the selected query regions on objects that were never found in previous steps. In practice, α2 is chosen to be greater than α1 . Decreasing too much the weights of matched regions might indeed degrade the overall recall. This comment can be related to the success of query expansion methods [13] aiming at boosting object retrieval recall by re-issuing different instances of the same object as
Multimed Tools Appl
new queries. In [13], Joly and Buisson proposed an a contrario query expansion method to improve the retrieval quality. This work focuses only on the retrieval performance, not on the discovery. In our experiments we used α1 = 0.1 and α2 = 0.5. 2. Local Region Search: The candidate query region Rqt (centered around Fqt ) is processed by a large scale matching procedure as described in [13]. It starts by searching all query features in the database, thanks to an efficient approximate similarity search technique [12]. The retrieved images are kept for the next step if they own a sufficient number of matching features. Finally, we compute a geometric consistency score by estimating an affine transformation model, with 5 degrees of freedom, between the query and each of the retrieved images. This estimation is done by a RANSAC algorithm. It returns a set of geometrically verified matching regions in the dataset. We refer to any of these matched t regions as Rm , m ∈ 1, ..., Mt 3. Decision: Matching scores are then normalized and thresholded according to the a contrario procedure described in [13]. This technique allows us to accurately control the percentage of false alarms, which is crucial, since object seeds have to be very consistent and robust. If the final results set contains more than a user-defined number of images, it means that we found a recurrent object in the database. This threshold can be adapted, depending on desired minimal frequency of retrieved objects. Then, the tentative query region Rqt and the t matching images Rm are kept to form a visual object seed. Finally, after T tentative probes, the algorithm outputs a set S of |S| ≤ T seeds S j. Each seed corresponds to a spatially verified frequent visual pattern. A seed is j j associated with a query image region Rq and a set of M j matching regions Rm , m ∈ 1, ..., M j. The more the number of tentative probes is, the more a frequent object likely to be considered as a seed is. Note that the raw matching graph construction step could actually be based on Chum GMH approach [5]. However we precise that the complexity of Gmh is also linear in terms of the number of: – – –
images for the image selection. features in the selected image. features for the selection in the neighbourhood.
Our sampling method has also a complexity which is linear in terms of feature number in the considered database. Furthermore, our sampling approach generates less and more consistent matches than GMH collisions (based only on neighbourhood rather ours uses full geometry). Since the post-processing of the graph depends mainly on the number of nodes, it is important to keep it low within the first step. 3.2 Object instances clustering Although the discovered seeds correspond to consistent repeated patterns in the collection, they can still not be considered as full objects: (i) by construction, a seed usually covers only a subpart of an object instance with a loose localization, (ii) furthermore, due to the imperfect recall of the retrieval, a discovered seed matches only a subset of all instances in the dataset, (iii) finally, the more frequent an object is in the collection, the more redundant the discovered seeds are. Building accurate and complete object model therefore requires to group all seeds belonging to the
Multimed Tools Appl
same object. This cannot be done according to the visual content of the seeds, since two seeds with distinct visual contents might still be two subparts of the same object. A more intuitive alternative is to group seeds that are matching correlated contents in the dataset, which can be formulated as a bipartite clustering problem. Figure 2 illustrates our proposed method to group seeds representing the same object. Let us denote as G = (X; E) = (I, S; E) the bipartite matching graph resulting from the object seeds discovery, with I = {Ii }i∈[1,N]
Fig. 2 Illustration of the proposed method to suggest object-based visual queries in the image I4 . S2 , S5 and S9 belong to the cluster representing the same object by using the bipartite clustering. S3 , S6 and S8 represent seeds belonging to the second object in the image I4
Multimed Tools Appl
the vertex set representing the images of the collection,
S = S j j∈[1,|S|] the vertex set of the discovered seeds, X = I ∪ S and I ∩ S = ∅. Each directed edge ei, j ∈ E has a starting point in S, an endpoint in I and a weight wi, j corresponding to the matching score returned by the a contrario normalization method (wi, j = 0 means that no edge connects seeds S j to image Ii ). The advantage of this bipartite representation it to allow formulating our seeds clustering objective as a co-clustering problem (or dual subset clustering [31]). We indeed aim to find object clusters On = (Sn , I n ) with Sn ⊂ S being the subset of seeds modeling a given object and I n ⊂ I being the subset of images containing instances of the object. An ideal object cluster is the one whose seeds are matching on the same images. It is important to notice the advantage over previous object mining methods using a single image-oriented matching graph [1, 4, 10, 24]: a given image can be accurately affected to several object clusters (when it contains instances of distinct objects). Furthermore, each object cluster is composed of a unique set of seeds associated with localized matching regions. As discussed in the next subsection, this will be useful for display purposes within our visual query suggestion scenarios. Solving the bipartite clustering problem is not a trivial task. Some previous works proposed using spectral based techniques in the context of text documents clustering [31, 32]. These methods are useful to partition bipartite graphs in a prefixed number of balanced clusters but are not appropriate to our problem. The number of objects to be discovered as well as the number of seeds to be grouped within each object can indeed be highly variable. In this paper, we introduce a new bipartite clustering algorithm inspired by Shared Nearest Neighbours (SNN) clustering methods [14, 15, 25]. The principle of SNN algorithms is to group items not by virtue of their pairwise similarity but by the degree to which their neighbourhoods resemble one another as it is illustrated in Fig. 3. They are well known to overcome several shortcomings of classical clustering methods, notably high-dimensionality and similarity metrics limitation. In [15], a shared neighbours algorithm was proposed, that [14] extended to multisource case by an original oracle selection step. To the best of our knowledge, SNN
Fig. 3 The similarity between two items x and y is based on the number of shared nearest neighbours (red items)
Multimed Tools Appl
methods were not studied so far in the case of bipartite nearest neighbours graph, as it is done in this paper. The main difference of this new work is that the clustering is performed at the object level NOT at the image level. This is a much more challenging task, notably because the input matching graph is a bipartite graph that connects candidate objects to image regions, rather than simply connecting images. The seeds to cluster and the list of k-nearest neighbours representing the images belong to different data sets. We can represent them by a bipartite graph where items (seeds) that belong to the first graph have their neighbours (images) in the second graph. The new bipartite SNN method we introduced in this paper is therefore an important contribution over previous SNN methods. As a primary SNN similarity measure between two seeds, we first reformulate the Relevant Set Correlation measures used by [15] and [14] in our bipartite context. For any two visual object seeds S1 and S2 in S, we define the inter-seed similarity as: 1 ||B2 | | B1 ∩ B2 | − |B|N| R(S1 , S2 ) = 1| ) | B2 | (1 − | B1 | (1 − |B |N|
|B2 | ) |N|
(3)
where B j is the neighboring set of images matched by the j-th seed S j, i.e.
B j = Ii ∈ I | wi, j > 0
(4)
This measure is unbiased regarding input sets sizes; this allows to compare the connectivity of two seeds with unbalanced size of images on which they matched. From this inter-set correlation measure, Houle [15] derives a second-order intra-set signif icance measure that estimates the relevance of any candidate cluster. The idea is to compute the expectation of the inter-set measure between any pair of items in the candidate cluster. In our case, we can reformulate this measure by: SR(O) =
1 R(Sv , Sw ) | O |2
(5)
Sv ∈O Sw ∈O
where | O | is the number of seeds in the candidate cluster O. Unfortunately this measure has the disadvantage of a second-order bias relative to the size of the candidate cluster. As suggested in [15], we can remove this bias by standardizing it under a randomness hypothesis, leading to the following standard score: SI(O) =| O |
√
N − 1SR(O).
(6)
In the same way, the contribution of a given seed S j to a candidate cluster O under the randomness hypothesis can be computed as:
SI(S j, O) =
N−1 R(S j, Sw ) |O| Sw ∈O
(7)
Multimed Tools Appl
Interestingly (for the following steps), the significance of an object cluster O can be concisely re-expressed in terms of the sum of the contribution of each visual object seed belonging to it, as follows: 1 SI(O) = √ SI(S j, O) (8) |O| S ∈O j
Now that we have defined our raw SNN significance measures, we can describe our clustering procedure. It is based on two main steps, candidate object cluster creation and redundant clusters merging. Candidate object cluster creation Any visual object seed S j ∈ S is considered as a candidate cluster center if | B j |< r, i.e. the seed matches more than r images in the dataset (in our experiment, we used r = 4 images). For each seed S j selected as a candidate center, we would like to build a relevant candidate cluster O j from the set of neighboring seeds having at least one match in B j. Let us first denote as H j this full set of candidate neighboring seeds:
H j = Sv ∈ S | Ii ∈ B j wi,v > 0 (9) All these seeds belonging to images that have been matched by S j, it is meaningful to consider them as candidate items for the object cluster. However, many of them might correspond to other objects since an image can contain several objects. We therefore would like to build the candidate cluster O j as the optimal subset of H j maximizing the significance measure: O j = argmax SI(O)
(10)
O⊂H j
This is unfortunately a combinatorial problem that cannot be solved efficiently. We therefore propose to relax this objective by a greedy heuristic that locally selects optimal subsets of seeds when iterating on the neighboring images in B j. All images in B j are first ranked in decreasing order of their matching score wi, j. The candidate cluster O j is initialized by the central candidate seed S j, i.e. O0j = S j. The algorithm then iterates on the ranked images It ∈ B j and builds locally optimal clusters as: ∪ argmax Otj = Ot−1 j O⊂H j,t
1 |Ot−1 |
SI(Sh , Ot−1 )
(11)
Sh ∈O
where H j,t ⊂ H j is the set of seeds having a match in the t-th image It . Intuitively, each step simply selects the optimal set of object seeds that have been matched in the t-th image retrieved by S j. Note that this local optimization can now be easily solved by sorting the seeds Sh ∈ H j,t by decreasing contributions SI(Sh , Ot−1 ) and iterating on them. The full algorithm stops when Otj = Ot−1 meaning that no improving seed j has been found from the ones matching the t-th image in B j. At this step, any visual object seed S j ∈ S that matches more than r images in the dataset is associated with an approximate optimal cluster O j. Candidate clusters are however still highly redundant since all seeds of a given object might produce very similar clusters. The next step is aimed at merging these candidate clusters. Redundant object clusters merging For this step, we use a greedy strategy similar to the one in [15]. First, we sort all candidate clusters O j by decreasing order of
Multimed Tools Appl Fig. 4 The clusters on the left share few items, they are considered as two different object clusters. Those of the right are considered as similar (redundant) and have to be merged efficiently
their significance score SI(O j) and then iterate on them. If an encountered cluster has an intersection greater than a user-defined threshold with one of the previous clusters, it is merged with it (see Fig. 4). If not, it is considered as a new object cluster. To improve the quality of the final cluster when an encountered cluster has to be merged, we use a reshaping strategy: only the items of the new cluster that increase the intra-significance of the resulting cluster are kept as new items. 3.3 Object-based visual query suggestion For each of the two visual query suggestion paradigms described in the introduction, we answer the following questions: What do we suggest? How do we display the suggestions? What do we return when the user clicks on a suggested object? –
Mouseover visual objects suggestion: For any image I j ∈ I, we suggest queries. The number of these queries is equal to the number of clusters having I j in their dual image set. Each cluster is represented by a rectangular window computed from the set of all regions that have been matched by the seeds of the cluster. Taking the bounding box of all matching regions would however be affected by outlier matches. We rather keep the bounding box of all pixels that are covered by at least 2 matching regions, as illustrated in Fig. 5. When the user clicks on one of the suggested objects, we return a ranked list of images according to their
Fig. 5 Steps to obtain the suggested visual object on an image from BelgaLogos: (i) selecting the bounding box, (ii) intersection of the bounding boxes, (iii) keeping only the regions that have been covered by at least two bounding boxes. The final result is the suggested object
Multimed Tools Appl
Multimed Tools Appl
Fig. 6
Some object clusters discovered in the Oxford Buildings Dataset. The top 4 rows clusters are in the ground truth. The f ifth f irst columns are seeds examples of the cluster and the last column represents the suggested query object of each cluster
–
intersection with the selected object (i.e. the number of seeds matching with it or the sum of the corresponding matching weights). Text-aware visual objects suggestion: We suppose that an external text-based search did already return a subset Ix ⊂ I of images. We then select as suggested query objects the top M clusters of the dataset having the greatest intersection between the images in their dual representation and the text-based result list (i.e. the clusters representing the most frequent objects in the result list). Each suggested query object is displayed at the top of the search interface by a single representative thumbnail. This is done by first seeking the image that has the most intersection with the cluster (by means of the number of seeds matching it) and then by cropping the object of interest in this image with the same procedure as the one described in the previous mouse-over scenario. An illustration of resulting visual queries is given in Fig. 6 for the Oxford Buildings dataset. When the user clicks on one of the suggested object, we return a ranked list of images according to their intersection with the object.
4 Experiments 4.1 Experimental set-up Our method is demonstrated on three databases: –
–
–
Oxford Buildings: This dataset is described in1 and consists of 5,062 images of buildings from Oxford and miscellaneous images all retrieved from Flickr. A ground-truth is provided for 55 queries (11 different landmarks in Oxford). We describe the corpus with 30 millions of SIFT features. SIFT feature [18] is a computer vision algorithm that detects interest points (blobs) in images, based on Difference of Gaussians, and provides a local description around each detected point. BelgaLogos: This dataset2 is composed of 10,000 images. The images have been manually annotated for 26 logos. A given image can contain one or several logos or no logo at all. We described this corpus with 38 millions of SIFT. GoogleCrawl: To illustrate the text-aware visual objects suggestion paradigm, we created a small dataset (2638 images) crawled from the Google Image search engine using the five following queries: Metallica Concert, Green Peace, Disney, Khadafi, and World Cup. We described this dataset with 12 millions of SIFT features. The details of this dataset are presented in the Table 1.
1 http://www.robots.ox.ac.uk/∼vgg/data/oxbuildings/ 2 http://www-rocq.inria.fr/imedia/belga-logo.html
Multimed Tools Appl Table 1 GoogleCrawl database details
Queries
Number of images
Metallica Greenpeace Disney Khadafi World Cup Total
258 270 288 293 264 2638
For all experiments, the number of seeds was set to 5K. Note that this vocabulary size is strongly lower than the sizes used by common bag-of-visual-word methods applied on the Oxford Buildings dataset. 4.2 Clustering performance evaluation We first compared our clustering method to state-of-the-art object mining methods [4, 24] on the Oxford Buildings dataset. We used the same evaluation protocol as [24] and [4]: for each landmark, we found the cluster containing the most positive (Good and OK) images of that landmark and computed the fraction of positive ground truth images in this cluster. Table 2 summarizes the results of our method and reproduces the results reported by Philbin et al. [24] and Chum et al. [4]. It shows that our method gives on average a better performance than these two methods. It is clear that the overall gain of our method relies mainly on the two categories “Ashmolean” and “Magdalen” where other methods do not achieve good results. For the “Ashmolean”, we scored a MAP of 0.9095 which is high compared to the best score (MAP=0.68) of both Philbin et al. [24] and Chum et al. [4]. For the “Magdalen” category, we scored a MAP of 0.7634 which is 3 times the score found by the best result (MAP=0.204 ) of both compared methods. The worst result we found is equal to 0.5847 for the “Balliol” category while the worst one for Philbin et al. [24] is equal to 0.204 and for Chum et al. [4]
Table 2 A comparison of the MAP clustering results for the 5K Oxford buildings dataset GroundTruth object
Philbin et al. [24]
Chum et al. [4]
Our proposed method
All souls Ashmolean Balliol Bodleian Christ church Cornmarket Heterford Keble Magdalen Pitt rivers RadCliffe camera Average
0.937 0.627 0.333 0.612 0.676 0.651 0.705 0.937 0.204 1 0.973 0.696
0.9744 0.68 0.3333 0.9583 0.8974 0.6667 0.9630 0.8571 0.0556 1 0.9864 0.7611
0.9187 0.9095 0.5847 0.663 0.599 0.7449 0.957 1 0.7634 1 0.9087 0.8226
The entries that are bold are the maximum values for each line
Multimed Tools Appl Table 3 Detailed MAP for the 12 landmarks of Oxford buildings
GroundTruth object
MAP
All souls Ashmolean Balliol Bodleian Christ church Cornmarket Heterford Keble Magdalen Pitt rivers RadCliffe camera
0.967 0.9045 0.5594 0.922 0.8821 0.7449 0.9631 0.8736 0.7603 1 0.9172
Average
0.8631
is equal to 0.0556 for the same category “Magdalen”. This can be explained by the fact that these two methods combine bag-of-words indexing models with the spatial verification re-ranking stage to improve query performance which gives a bad result if the initial results returned by the bag-of-words method are very bad while in our case we discover spatially verified visual words. The geometric consistency of the features points between patches make them consistent. So even if the building takes a small part in the image, by using small consistent objects, we can have cluster images that can be globally different but all containing the same object. 4.3 Retrieval performance evaluation To evaluate the accuracy of our visual query suggestion method in terms of retrieval quality, we computed different MAP on Oxford Buildings dataset. We first evaluated the retrieval only for the common 55 queries, provided with their bounding boxes. We therefore select only the object’s clusters discovered by our method that have one of the 55 image queries in their dual image set (as if the mouse-over query suggestion scenario was applied to these images). In the first case, we considered as clicked queries only the object’s clusters having a match within the bounding box. We then returned the list of matching images sorted by decreasing order of the matching score (i.e. the sum of weights wi, j over all seeds belonging to the selected clusters). Detailed results for each landmark are presented in Table 3. We also give in Table 4 the MAP over all queries compared to the retrieval results reported in Jegou et al. [11] and Philbin [20]. The results show that our method outperforms both methods. To demonstrate that our proposed method is not only good for the common 55 queries but for any image, we then evaluated the retrieval in all images annotated to 1 in the ground truth. Since we do not have bounding boxes with these images, we
Table 4 A comparison of the MAP retrieval’s results for the 5K Oxford MAP
Jegou [11]
Philbin [20]
Our method
0.74
0.82
0.86
Multimed Tools Appl Table 5 MAP retrieval’s results of the 55 queries compared to all images annotated to 1 of the ground truth
MAP
55 queries (with bounding box)
55 queries (without bounding box)
All images in ground truth
0.86
0.84
0.836
This means that whatever the object and the image in which we suggest a query, the returned results will be as good as if the user had selected himself one of the 55 windows queries
considered as clicked queries all objects suggested in these images. We also did the same for the 55 common queries for fair comparison. Results are reported in Table 5. They show that the MAP remains very good whereas some of the images belonging to the full ground truth contain very small instances of the building, more partial views and more complex view points. That is important in the sense that it proves the feasibility of our new object’s suggestion paradigm. Whatever the object and the image in which we suggest a query, the returned results will be as good as if the user had selected himself one of the 55 windows queries. We finally computed some statistics on the produced clusters to evaluate the completeness of the suggested visual queries. Figure 7 gives the percentage of images having equal or more than a number of suggested query objects denoted as m, for increasing values of m. It shows that when using only 5K seeds, 42 percent of the images have at least one suggested visual query. Remember that the number of seeds
Fig. 7 Histogram of the percentage of images that have more than m suggested query objects
100
90
80
Percentage of images
70
60
50
40
30
20
10
0
2
4
6
8
10
12
Number of suggested query objects (m)
14
Multimed Tools Appl
being a parameter of the method a more complete coverage can be simply obtained by running the seed’s discovery algorithm longer. But, more we iterate, more we discover smaller and no frequent objects. 4.4 Visual query suggestion illustration To illustrate qualitatively our suggested visual queries in another database, we used the BelgaLogos and GoogleCrawl datasets. The text-aware visual objects suggestion scenario is illustrated using the GoogleCrawl dataset. Figure 8 shows the top 3 suggested objects for each of the 5 text queries. To better understand what is behind
Metallica concert
Greenpeace
Disney
Khadafi
World Cup
Fig. 8 Some suggested visual queries for each of the text-queries in the set of images crawled from the google images
Multimed Tools Appl
Fig. 9 Some suggested queries and the top three images returned for each one
such suggested objects we also provide in Fig. 9, for 4 suggested queries, the top 3 images returned when the user clicks on them. Figure 10 illustrates the Mouse-over visual objects suggestion scenario on the BelgaLogos dataset. The two first images are illustrations of images having two visual queries. The three last ones illustrate 3 other images with only one suggested visual query. The right column gives the top 3 returned images for each suggested query.
5 Conclusions and perspectives We believe that our “Object-based Visual Query Suggestion” is an original paradigm which really goes one step further standard window queries from a usage perspective: it poses several new problems that have all been addressed with original methods including the efficient building of the matching graph, the clustering of object seeds and the display of the discovered objects. All SNN similarity measures are redefined according to this dual context and the “candidate cluster creation” step in the clustering framework is completely
Multimed Tools Appl
Fig. 10 Some discovered object clusters in BelgaLogos
Multimed Tools Appl
different from the previous work. The second main contribution of the paper is the querying paradigm in itself, i.e. object-based visual query suggestion. It was never published before and as explained in the paper, it is, from the user perception point of view, completely different than the classical query by windows paradigm. Indeed, since all recommended objects are instanced several times in the collection and therefore mostly return correct results, the user might rather perceive them as “visual links”. Furthermore, we introduced two practical ways to implement this new paradigm, i.e. “Mouse-over visual objects suggestion” and “Text-aware visual objects suggestion”. Both of them involve representation issues that were addressed in the paper thanks to the dual clusters produced by our method. In comparison to recent work, experiments show that our method succeeds in increasing the clustering and the retrieval effectiveness by discovering frequent consistent visual object seeds and grouping those that matched on the same images in the dataset. Our clustering framework allows to obtain considerable performances for ALL objects in the dataset and not only user selected queries. As future work, we plan to study the influence of the size and shape of the query region in the object seeds generation. Moreover, we plan to study the impact of the size of discovered seeds on the performance and accuracy of our approach. Acknowledgment
A part of this work has been supported by the EU FP7 project I-SEARCH.
References 1. Anjulan A, Canagarajah N (2009) A unified framework for object retrieval and mining. IEEE Trans Circuits Syst Video Technol 19(1):63–76 2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993– 1022 3. Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences 1997. IEEE Computer Society, Washington, DC, USA, pp 21–29 4. Chum O, Matas J (2010) Large-scale discovery of spatially related images. IEEE Trans Pattern Anal Mach Intell 32:371–377 5. Chum O, Perdoch M, Matas J (2009) Geometric min-hashing: finding a (thick) needle in a haystack. In: IEEE computer society conference on computer vision and pattern recognition. Miami, Florida, pp 17–24 6. Chum O, Philbin J, Sivic J, Isard M, Zisserman A (2007) Total recall: automatic query expansion with a generative feature model for object retrieval. In: Proceedings of the 11th international conference on computer vision. Rio de Janeiro, Brazil, pp 1–8 7. Chum O, Philbin J, Zisserman A (2008) Near duplicate image detection: min-hash and tfidf weighting. In: Proceedings of the British machine vision conference. Leeds, UK, pp 493– 502 8. Devroye L (1986) Non-uniform random variate generation. Springer 9. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196 10. Grauman K, Darrell T (2006) Unsupervised learning of categories from sets of partially matching image features. In: IEEE computer society conference on computer vision and pattern recognition, vol 1. New York, NY, pp 19–25
Multimed Tools Appl 11. Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vis 87:316–336 12. Joly A, Buisson O (2008) A Posteriori multi-probe locality sensitive hashing. In: ACM international conference on multimedia (MM’08). Vancouver, British Columbia, Canada, pp 209–218 13. Joly A, Buisson O (2009) Logo retrieval with a contrario visual query expansion. In: Proceedings of the seventeen ACM international conference on multimedia, MM ’09. ACM, Beijing, China, pp 581–584 14. Hamzaoui A, Joly A, Boujemaa N (2011) Multi-source shared nearest neighbours for multimodal image clustering. Multimed Tools Appl 51:479–503 15. Houle ME (2008) The relevant-set correlation model for data clustering. Stat Anal Data Min 1:157–176 16. Kuo Y-H, Chen K-T, Chiang C-H, Hsu WH (2009) Query expansion for hash-based image object retrieval. In: Proceedings of the 17th ACM international conference on multimedia, MM ’09. Beijing, China, pp 65–74 17. Letessier Letessier P, Buisson O, Joly A (2011) Consistent visual words mining with adaptive sampling. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR ’11. ACM, Trento, Italy, pp 49:1–49:8 18. Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer visio, IEEE Computer Society, vol 2. Kerkyra, Greece, pp 1150–1157 19. Olken F (1993) Random sampling from databases. Ph.D. thesis, U.C. Berkeley 20. Philbin J (2010) Scalable object retrieval in very large image collections. Ph.D. thesis, University of Oxford 21. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Anchorage, Alaska 22. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition 23. Philbin J, Sivic J, Zisserman A (2008) Geometric LDA: a generative model for particular object discovery. In: Proceedings of the British machine vision conference. Leeds, UK 24. Philbin J, Zisserman A (2008) Object mining using a matching graph on very large image collections. In: Sixth Indian conference on Computer Vision, Graphics Image Processing, ICVGIP ’08. Bhubaneswar, India, pp 738–745 25. Rajeev SG, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Information systems, pp 512–521 26. Sivic J, Russell BC, Zisserman A, Freeman WT, Efros AA (2008) Unsupervised discovery of visual object class hierarchies. In: IEEE conference on computer vision and pattern recognition, CVPR 2008. Anchorage, Alaska, pp 1–8 27. Tang J, Lewis P (2008) Non-negative matrix factorisation for object class discovery and image auto-annotation. In: ACM international conference on image and video retrieval. Niagara Falls, Canada, pp 105–112 28. Thompson SK (1995) Adaptive sampling. In: The survey statistician, pp 13–15 29. Tuytelaars T, Lampert CH, Blaschko MB, Buntine W (2010) Unsupervised object discovery: a comparison. Int J Comput Vis 88:284–302 30. Wang X, Grimson E (2007) Spatial latent dirichlet allocation. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems, vol 20. MIT Press, Cambridge, MA, pp 1577–1584 31. Xu G, Zong Y, Dolog P, Zhang Y (2010) Co-clustering analysis of weblogs using bipartite spectral projection approach. In: Proceedings of the 14th international conference on knowledgebased and intelligent information and engineering systems: Part III, KES’10. Cardiff, Wales, UK, pp 398–407 32. Zha H, He X, Ding C, Simon H, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the tenth international conference on information and knowledge management, CIKM ’01. ACM, Atlanta, Georgia, pp 25–32 33. Zha Z-J, Yang L, Mei T, Wang M, Wang Z (2009) Visual query suggestion. In: Proceedings of the 17th ACM international conference on Multimedia. Beijing, China, pp 15–24
Multimed Tools Appl
Amel Hamzaoui received her engineering degree in High institute of informatics (ISI) in Tunisia in 2007 and the Master degree in Image precessing and artificial intelligence from the Pierre and Marie Curie at Paris (France) in 2008. She joined the Imedia team at Inria Rocquencourt and is currently a PhD candidate focusing in image clustering.
Pierre Letessier is a PhD student in computer vision in the research group of the Ina and in the IMEDIA team at INRIA Rocquencourt. He received his M.S. degree in image processing from the Pierre et Marie Curie University and Telecom ParisTech (Paris, France) in 2009. His research focuses on discovering and exploiting frequent visual objects in multimedia datasets. He is currently involved in the OTMedia project.
Multimed Tools Appl
Alexis Joly is permanent research scientist at INRIA Rocquencourt in France. His topics of interests include Content-Based Image and Video Retrieval, Visual objects mining and large scale similarity search issues. He received his Engineer degree in Telecommunication from the National Institute of Applicative Sciences (INSA Lyon, France) in 2001 and his Ph.D. degree in Computer Science from the University of La Rochelle (France) in 2005. During his PhD, he collaborated with the french national institute of audiovisual (INA) and developed a TV monitoring system working on huge datasets. In 2005, he worked as “Visitor Researcher” at Tokyo National Institute of Informatics and then joined the IMEDIA team at INRIA Rocquencourt. He was involved in numerous European initiatives (MUSCLE NoE, VITALAS IP, TRENDS STREP and CHORUS CA) as well as national projects covering different application areas such as audio-visual archives, photo stocks agency and biodiversity. In 2007 and 2008, he co-organized CIVR and TRECVID video copy detection evaluation campaigns which were the first international events related to this topic. Dr. Joly has served on numerous scientific program committees in international journals (PAMI, Trans. on Multimedia, etc.) and conferences (ACM Multimedia, ACM CIVR, IEEE ICME, etc.). He also serves as a scientific expert for the french national research agency.
Olivier Buisson received his PhD degree in computer science from the University of La Rochelle, France, in 1997. From 1998 to 1999, he developed colour movie restoration software for the ExMachina company. Since 1999, he has joined the research group of the Ina His research focuses on: visual descriptors for images and videos and similarity measures, visual search engine for very large databases of videos and images. Dr Olivier Buisson has been involved in the following projects: VITALAS, INFOMAGIC and OTMedia.
Multimed Tools Appl
Nozha Boujemaa is Director of Research at INRIA Paris Rocquencourt. She obtained her PhD degree in Mathematics and Computer Science in 1993 (Paris V) and her “Habilitation A Diriger des Recherches” in Computer Science in 2000 (University of Versailles). She has been graduated previously with a Master degree with Honors from University of Tunis. Her topics of interests include Multimedia Content Search, Image Analysis, Pattern Recognition and Machine Learning. Her research activities are leading to next generation of multimedia search engines and affect several applications domains such as audio-visual archives, Internet, security, biodiversity. Pr. Boujemaa has authored more than 100 international journal and conference papers.