Mining and Cropping Common Objects from Images - CiteSeerX

1 downloads 0 Views 2MB Size Report
(3) 3nd row: Mining common object is to find a bound- ing box with ..... manually label the ground truth bounding boxes of the com- mon objects in each image.
Mining and Cropping Common Objects from Images Gangqiang Zhao, Junsong Yuan School of Electrical and Electronic Engineering Nanyang Technological University, Singapore

[email protected], [email protected]

ABSTRACT Discovering common objects that appear frequently in a number of images is a challenging problem, due to (1) the appearance variations of the same common object and (2) the enormous computational cost involved in exploring the huge solution space, including the location, scale, and the number of common objects. We characterize each image as a collection of visual primitives and propose a novel bottomup approach to gradually prune local primitives to recover the whole common object. A multi-layer candidate pruning procedure is designed to accelerate the image data mining process. Our solution provides accurate localization of the common object, thus is able to crop the common objects despite their variations due to scale, view-point, lighting condition changes. Moreover, it can extract common objects even with few number of images. Experiments on challenging image and video datasets validate the effectiveness and efficiency of our method.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models

General Terms Algorithms, Experimentation

1. INTRODUCTION Given a collection of images, it is of great interests to crop the common objects that have many occurrences. Instead of mining and segmenting object categories [13] [18] [5] [1], in this paper we focus on mining identical common objects [17] [14] [7] [16]. There are many applications to discover and crop common objects from images, such as object retrieval, video summarization, and near duplicate image detection. To automatically discover and locate common objects, there are two major challenges. First of all, there lacks a priori knowledge of the common pattern, thus not known

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’10, October 25–29, 2010, Firenze, Italy. Copyright 2010 ACM 978-1-60558-933-6/10/10 ...$10.00.

Figure 1: Illustration of our common object mining. (1) 1st row: Each ellipse represents an extracted MSER region. (2) 2nd row: Red ‘+’ represents the local region with positive commonness score while green ‘o’ represents the local region with negative commonness score. (3) 3nd row: Mining common object is to find a bounding box with maximum commonness score.

in advance (i) the shapes and appearances of the common objects; (ii) the locations and scales of common objects in different images; and (iii) the total number of common objects and the number of their instances. Moreover, each image may contain none or multiple common objects. The same object can look quite different when presented from different view-points, scales, or under different lighting conditions, not to mention partial occlusions. It is not trivial to handle its variations and accurately locate its occurrences in the image. Although invariant local features greatly improve the image matching, accurate localization of common objects at the sub-image level remains a challenging problem. To address the above problems, we present a novel bottomup approach for identical common object discovery. Our emphasis is on the accurate localization of the common objects, in order to crop them out from the background clutter. Figure 1 illustrates the major steps of our method. First, each image is characterized by a collection of local features, which we referred to as visual primitives. We match visual primitives and gradually expand them spatially to recover the whole common object. In the initialization phase, “uncommon” visual primitives that are of limited matches in

other images are discarded, because they will not belong to any common pattern. For each remained visual primitive, we consider its local spatial neighborhood as a larger visual group and check the commonness score of this spatial pattern. Following a multi-layer commonness checking of different spatial scales, each local feature is finally assigned a commonness score, which indicates its likelihood of belonging to a common object. The commonness score of any sub-image is the summation of the scores of its local features. By searching the sub-image of highest commonness score in each image, we can locate and crop the common object. There are several advantages of our method. First of all, unlike top-down generative models that rely on a visual vocabulary for topic (i.e., visual category) discovery, our method only relies on the matching of visual primitives. It can automatically detect and locate common objects without requiring to know their scales and shapes, as well as the total number of such objects. Moreover, it can handle object variations such as scale and slight point-of-view changes, color and lighting condition variations, and it is insensitive to partial occlusion. Finally, it does not require a large number of images for data mining and works well to detect common object from a very limited number of images. We test our approach on both image datasets and video sequences. The results validate the robustness and effectiveness of our method.

2. RELATED WORK To discover common objects, some previous methods characterize each image as a graph composed of visual primitives like corners, interest points and image segments [14] [7] [10] [6]. For example, each visual primitive is treated as a vertex of the graph and spatial relations among visual primitives are represented as edges of the graph. In [14], it applies color histogram to characterize image segments and uses EMalgorithm to solve the graph matching problem. However, the result relies on the initialization and it does not guarantee the global optimal solution [7]. In [10], spatially coherent constraints are employed for graph matching. However, the proposed solution is specifically designed for finding common objects from a pair of images, while not a collection of images. Another category of approaches applies the “bag of visual words” representation by translating each image to a collection of “visual word” through clustering primitive visual features [15]. The data mining results rely on the quality of the visual vocabulary and the spatial configuration among the visual primitives is usually not considered. Moreover, although the bag-of-words model has been successfully applied to discover object categories from images [13], it is less suitable to discover identical common objects from a very limited number of images, for example, cropping the common objects from several or tens of images. In such a case, the visual document representation is less effective due to the small number of visual primitives for training the visual vocabulary. Thus, alternative methods should be considered. Besides mining common objects from images, there are also recent work in discovering common objects from video sequences [9] [4]. These methods need supervision information about the common object. For example, the user labeling is required in [9] to initialize the search, thus it is not fully unsupervised.

3. PROPOSED APPROACH 3.1 Overview Given a number of unlabeled images, our objective is to discover and crop common objects that appear frequently. For each image, we extract local features using the Maximally Stable Extremal Regions operator (MSER) [12]. Each local region is denoted by p = {x, y, d}, where (x, y) denotes its spatial location, and d is the SIFT descriptor [11]. Each image I = {pi } is characterized by a collection of local features, while an image dataset contains a few images D = {Ii }. For each p, we will assign a commonness score C(p) to estimate its likelihood of belonging to a common object. Intuitively, C(p) should relate to the frequency of p’s occurrence. For example, C(p) > 0 if p appears frequently among the dataset D; and vice versa, C(p) < 0 if p rarely repeats. The formal description of the commonness score will be discussed in next section. Once C(p) is obtained, we formulate the common object mining as finding a sub-image bounding box with highest commonness score. Namely, for one image Ii , we search for the bounding box R∗ with maximum score:  R∗ = arg max C(p) = arg maxf (R) (1) 

R⊆Ii p∈R

R∈Λ

where f (R) = p∈R C(p) is the objective function, Λ denotes the candidate set of all valid sub-images in Ii . To speed up the search of bounding box R∗ , we use the branchand-bound search proposed in [8].

3.2

Multi-Layer Commonness Checking

To estimate the commonness core C(p), we first check the frequency of each individual p ∈ D. For each p, we define its matching set as: Mp = {t : dt − dp  ≤ , t ∈ D\Ii } where Mp denotes all of the matches of p in D, except those in the same image Ii as p;  ≥ 0 is a matching threshold and  ·  denotes the Euclidean distance. To save the computation, we apply the locality sensitive hashing (LSH) [3] to search for the - Nearest Neighbors (-NN). An uncommon local feature p does not belong to any common object, it will be pruned if Mp = ∅. For the remained set D0 = {p : Mp = ∅}, we need to perform a further checking, because even an individual p is common, it may not belong to any common object. For example, a texton region p can appear frequently in a single image due to many self-repetitions. But it can not belong to any common object. For an object O = {pi } to be common, the whole set {pi } must re-occur many times in D. Therefore, besides evaluating individual p, we also evaluate its k-spatial nearest neighbors (k-SNN). For each p ∈ D0 , its k-SNN in the image form a spatial group Np . To count the frequency of Np , we need to match it with k-SNN of other features {q ∈ D0 \Ii }. To match two groups Np and Nq , we formulate it as a partial matching problem [2]. Generally, it is non-trivial and computationally demanding to solve this partial matching problem. Following the method in [17], we give an approximate matching score as:  M˜ s(Np , Nq ) = min{|Np ∩ GNq |, |Nq ∩ GNp |}  Here GNp = t∈Np Mt is the matching set of the group Np , M˜ s(Np , Nq ) is the upper-bound of the optimal matching score M s(Np , Nq ) according to [17]. Therefore, using

M˜ s(Np , Nq ), we can estimate Np ’s frequency. Let Sp = {q : M˜ s(Np , Nq ) > 0.5k1 }

Algorithm 1 Common Object Mining (2)

where M˜ s(Np , Nq ) > 0.5k1 indicates a potential match. Now |Sp | is the upper-bounded estimation of Np ’s frequency. We can prune p from D0 if Np is uncommon. Then we end up with a smaller candidate set: D1 = {p : |Sp | > λ × T } ⊆ D0 , where λ is a threshold and T is the total number of images. For each p ∈ D1 , we can expand its spatial group Np with a larger size of k. By estimating the frequency of Np , it will further prune some candidates and obtain an even smaller candidate set D2 . Suppose that there are in total L layers and denote by DL the final set, we obtain a filtration DL ⊆ ... ⊆ D1 ⊆ D0 and the corresponding spatial neighborhood size kL > ... > k 1 > 0. Compared with D1 , a visual primitive p ∈ Dl (2 ≤ l ≤ L) corresponds to a larger spatial neighborhood and is more likely to be a part of common object. Based on the multi-layer checking, each visual primitive is assigned a commonness score. For each p ∈ {DL , ..., D1 }, its commonness score is a positive value. The more layers p can pass, the higher the commonness score it has. For the / D1 }, its commonness primitives in D \ D1 = {p : p ∈ D, p ∈ score is a negative value. This is reasonable as these primitives are already non-repetitive by themselves. Finally, we assign the commonness score to each region p as:  l k if p ∈ {Dl \ Dl+1 }, 1 ≤ l ≤ L (3) C(p) = τ if p ∈ {D \ D1 } where τ is a predefined negative value.

3.3 Common Object Localization After obtaining the commonness score C(p), we can locate the common object in each image. To speed up the localization, we apply the branch-and-bound search. With our own objective function f (R), we briefly explain the upper bound estimation of the branch-and-bound search below. The details of the branch-and-bound search can be found in [8]. Denote by Λ a collection of sub-images. Assume there exist two regions Rmin and Rmax such that for any R ∈ Λ , Rmin ⊆ R ⊆ Rmax . Thenwe have f (R) ≤ f + (Rmax ) + f − (Rmin ), where f + (R) = p∈R max(C(p), 0) contains only positive  commonness score, while f − (R) = p∈R min(C(p), 0) contains only negative ones. We denote the upper-bound of f (R) for all R ∈ Λ by: f(Λ) = f + (Rmax ) + f − (Rmin ) ≥ f (R)

(4)

With this upper-bound, we can detect the optimal sub-image Ri∗ in each image using the branch-and-bound algorithm [8]. For an image collection, it is possible that some images do not contain any common object. So we select the sub-image ∗ with the highest commonness score from all detected R sub-images, and remove the sub-image Ri∗ for image Ii if it ∗ ) . has a too small commonness score, i.e. f (Ri∗ ) < 0.1f (R Our algorithm is summarized in Alg. 1.

4. PERFORMANCE EVALUATION 4.1 Experimental Setting To evaluate our algorithm, we test 10 image sets and 5 video sequences for common object mining. Each dataset

:T  unlabeled  images with extracted local regions D = I1 I2 ... IT , and threshold λ output : common object regions Ri∗ in each image Ii (if no valid detection, return ∅)

input

/* Multilayer Commonness Checking

1 foreach p ∈ D do 2 if Mp = ∅, add p to D0 3 end

*/

4 for 1 ≤ l ≤ L do 5 for p, q ∈ Dl−1 do 6 Sp = {q : M˜ s(Np , Nq ) > 0.5kl }

if |Sp | > λT , add p to Dl

7 end 8 end

9 for 1 ≤ l ≤ L do 10 if p ∈ {Dl \ Dl+1 } C(p) = kl

else if p ∈ {D \ D1 } C(p) = τ

11 end

/* Common Object Localization

*/

12 foreach Ii do

R∗ = arg maxf (R)

i 13 R∈Λ 14 add Ri∗ to a set Λ∗ 15 end

∗ from Λ∗ based on f (R) 16 get the highest score region R 17 foreach Ri∗ ∈ Λ∗ do

∗ ), remove it from Λ∗ 18 if f (Ri∗ ) < 0.1f (R 19 end 20 return Λ∗

contains 4-49 images, with different cluttered backgrounds. The common objects are under different variations like rotation, partial occlusion, scale, viewpoint and lighting changes. It is possible that one image contains multiple common objects and some images do not contain any common object. The multilayer commonness checking has three layers ( L = 3 ) and the corresponding spatial neighborhood sizes are k1 = 5, k2 = 10 and k3 = 15. The threshold λ was set to be 0.1 and the parameter τ was set to be -2.

4.2

Common Object Mining

For each object, we apply the mining algorithm on the image collection containing it. Fig. 2 presents three examples of our results. The common objects are highlighted by bounding boxes. From top to bottom, we describe our dataset as: DatasetA , DatasetB and DatasetC . DatasetA is the image collection of one Oxford building. DatasetB is the collection of faces of 10 persons [10]. Nine of them are male, one is female. DatasetC is the image collection captured from a video clip at two frames per second. The video clip is one advertisement clip used in [9]. Among the 49 images, the common object appears 21 times. DatasetD − DatasetO are shown in the supplementing material. Fig. 2 show that the proposed approach can also discover and locate the faces of different persons. To quantify the performance of the proposed approach, we manually label the ground truth bounding boxes of the common objects in each image. The performance is measured by F -measure [14]. We compare the proposed method with two other methods: (1) Random Bounding Box, and (2) Topic Discovery. For the first method, we randomly generate a bounding box to guess the location and scale of the common object. The second method is proposed

Figure 2: Sample results of common object mining. Each row corresponds to an image collection (at least six images) and the common object is discovered by the red bounding box. The third row comes from a video clip. 1 Proposed Topic Random

0.9 0.8 0.7

F−Measure

in [13], which finds common objects by employing the “bag of visual words” representation and topic discovery algorithm. Figure 3 presents the comparison of different approaches. Overall, our proposed approach outperforms both Random Bounding Box and Topic Discovery in terms of the F -measure, with an average score of 0.63 (Proposed) compared to 0.14 (Random) and 0.43 (Topic Discovery), respectively. It validates the advantages of our method.

0.6 0.5 0.4 0.3 0.2

5. CONCLUSION We propose a novel bottom-up approach to automatically cropping common objects from images. Instead of modeling each image as a visual document and discovering common patterns through conventional text mining, we evaluate each visual primitive and gradually expand them to recover the common object. To speed up the image data mining, we propose a multi-layer candidate pruning method to efficiently discard unqualified candidates of common patterns. With each visual primitive obtaining a commonness score, these local evidences of common patterns are finally fused through finding the bounding box of the highest commonness core. Experiments on both image datasets and video sequences show that our method can crop common objects automatically despite variations due to scale, view-point, lighting condition changes.

Acknowledgments This work is supported in part by the Nanyang Assistant Professorship (SUG M58040015).

6. REFERENCES

[1] N. Ahuja and S. Todorovic. Connected segmentation tree a joint representation of region layout and hierarchy. In CVPR, 2008. [2] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. The MIT Press, 2001. [3] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the 20th annual symposium on Computational geometry, pages 253–262, 2004. [4] S. Drouin, P. Hebert, and M. Parizeau. Incremental discovery of object parts in video sequences. Computer Vision and Image Understanding, 110(1):60–74, 2008. [5] J. Gao, Y. Hu, J. Liu, and R. Yang. Unsupervised learning of high-order structural semantics from images. In ICCV, 2009. [6] K. Heath, N. Gelfand, M. Ovsjanikov, M. Aanjaneya, and L. J. Guibas. Imagewebs: Computing and exploiting connectivity in image collections. In CVPR, 2010.

0.1 0

A

B

C

D

E

F

G H I Dataset

J

K

L

M

N

O

Avg

Figure 3: The performance comparison of proposed approach (Proposed), Topic Discovery (Topic) [13] and Random Bounding Box approach (Random). [7] P. Hong and T. S. Huang. Spatial pattern discovery by learning a probabilistic parametric model from multiple attributed relational graphs. Journal of Discrete Applied Mathematics, 139:113–135, 2004. [8] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwindow search: A branch and bound framework for object localization. TPAMI., 31(12):2129–2142, 2009. [9] D. Liu, G. Hua, and T. Chen. A hierarchical visual model for video object summarization. TPAMI, 2010. [10] H. Liu and S. Yan. Common visual pattern discovery via spatially coherent correspondences. CVPR, 2010. [11] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. [12] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. BMVC, 2002. [13] B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In CVPR, 2006. [14] H.-K. Tan and C.-W. Ngo. Localized matching using earth mover’s distance towards discovery of common patterns from small image samples. Image Vision Comput., 27(10):1470–1483, 2009. [15] T. Tuytelaars, C. H. Lampert, M. B. Blaschko, and W. Buntine. Unsupervised object discovery: A comparison. IJCV, 88:284–302, 2010. [16] J. Yuan, Z. Li, Y. Fu, Y. Wu, and T. S. Huang. Common spatial pattern discovery by efficient candidate pruning. In ICIP, 2007. [17] J. Yuan and Y. Wu. Spatial random partition for common visual pattern discovery. In ICCV, 2007. [18] J. Yuan, Y. Wu, and M. Yang. From frequent itemsets to semantically meaningful visual patterns. SIGKDD, 2007.

Suggest Documents