Unsupervised Learning of High-order Structural Semantics from Images

14 downloads 0 Views 1MB Size Report
Center for Visualization and Virtual Environments, University of Kentucky, USA ... buildings. They are manifested as repeated structures or patterns and are often captured in images. ..... the sorted list of edges is not used to find the set of least.
Unsupervised Learning of High-order Structural Semantics from Images Jizhou Gao, Yin Hu, Jinze Liu and Ruigang Yang Center for Visualization and Virtual Environments, University of Kentucky, USA Abstract Structural semantics are fundamental to understanding both natural and man-made objects from languages to buildings. They are manifested as repeated structures or patterns and are often captured in images. Finding repeated patterns in images, therefore, has important applications in scene understanding, 3D reconstruction, and image retrieval as well as image compression. Previous approaches in visual-pattern mining limited themselves by looking for frequently co-occurring features within a small neighborhood in an image. However, semantics of a visual pattern are typically defined by specific spatial relationships between features regardless of the spatial proximity. In this paper, semantics are represented as visual elements and geometric relationships between them. A novel unsupervised learning algorithm finds pair-wise associations of visual elements that have consistent geometric relationships sufficiently often. The algorithms are efficient – maximal matchings are determined without combinatorial search. Highorder structural semantics are extracted by mining patterns that are composed of pairwise spatially consistent associations of visual elements. We demonstrate the effectiveness of our approach for discovering repeated visual patterns on a variety of image collections.

1. Introduction Extending the tremendous success of text-based retrieval to images or videos has been tantalizing. However, given the complex interplay of lighting, perspective, and occlusions, the same visual pattern (e.g., a window) can have dramatically different appearances in different images. The general approach to address this problem is to extract semantics from low-level visual words. Though visual words, such as SIFT descriptors, are typically invariant or less sensitive to light or perspective changes, they are limited to small local image patches (e.g., a corner of a window). Composing visual patterns from these descriptors that represent high-order structural semantics is very difficult, if not impossible. In this paper, we study the extraction of structural se-

(a)

(b)

(c)

(d)

Figure 1. Sample Images with Repeated Visual Patterns

mantics in images. We define semantics as a specific set of relationships that connect visual words carrying specific meanings. For example, the four corners of a window and the relationships between these corners as shown in Figure 1(a) suggest the existence of the window. Assume that semantics manifest themselves through repetition. For example, the structure of windows is repeatedly exhibited in building structures. The ultimate goal, therefore, is to find frequently occurring patterns that are composed of visual features in images. To approach this goal, existing methods typically sample the features randomly within a small spatial neighborhood (e.g., [4, 22]), and search for frequently co-occurring features within the sampled neighborhood, and/or require a supervised training set of features for a particular classification purpose(e.g., [5, 6, 7]). These approaches lack a systematic method to capture visual structural semantics or patterns globally and are often unable to detect complex patterns that appear at random locations within one or multiple images, vary in either sizes or shapes, or have missing features. We propose a method for unsupervised learning of highorder structural semantics in images represented by fre-

quently co-occurring features with consistent spatial relationships. Our approach first extracts scale-invariant visual primitives from the input images; these primitives are further clustered into a small set of visual words or clusters, where each visual cluster represents the primitives with similar appearance. We formulate the problem of finding meaningful pairwise visual word associations as a minimal cost bipartite graph matching. The cost is defined as the spatial consistency of the candidate pairings. This allows us to find an optimal pairwise mapping between pairs of visual clusters in polynomial time, independent of their spatial proximity. This method is further extended to allow multiple associations (multi-model) between visual clusters. For example, while eyes might be a single visual cluster, the subset of left eyes has a different consistent association with the nose than the subset of right eyes. Finally, rather than specifying the number of parts as a strong prior, we are seeking to build a set of all pairwise association models Ψ between visual clusters into visual patterns with more complex semantics. To achieve this, we use frequent subgraph mining within all graphs induced by Ψ to automatically find all frequent composite visual patterns at different levels. Compared to previous methods, our approach contributes the followings to the state of the art: • The spatial distributions of features are taken into account. Rather than treating the image as “a bag of words”, we keep track of the relative position and orientation of features and use the consistency of the spatial relationships to measure the strength of the semantics. • An efficient polynomial-time algorithm is developed to search for meaningful and strong associations between visual features globally, i.e., over the entire image space. Previous methods (e.g. [4, 22]) limited the search range to a pre-defined local neighborhood due to complexity consideration. Our method can find long-range associations such as the cluttered longrange patterns as shown in Figure 1(b). It also allows multi-modal (e.g., one to many) associations to deal with complex patterns such as a human face. • It is un-supervised, so that it requires no prior knowledge of any kind. Many image classification or feature selection problems require a labeled training data set with known number of visual patterns. The limited efficiency and flexibility makes it hard to generalize to large and complex image datasets.

1.1. Overview of the Pipeline Given a set of images, we first extract visual primitives from images followed by clustering these visual primitives into visual clusters. The next step is to identify consistent pairwise associations among clusters (Section 3). These

associations between clusters will be used to label the relationships between primitives in images. Ultimately, frequent image patterns will be extracted by finding frequent maximal subgraphs that are connected by the same sets of associations (Section 4).

2. Related Work Previous part-based approaches try to learn, detect and recognize object models in images, such as the sparse flexible model [4], the constellation model [6], the star model [7], and the pictorial structure [5]. They typically summarize a frequently occurring pattern into a connected graphical structure built upon a collection of parts, where each part corresponds to a local image patch. In general, these methods are computationally expensive, and need to provide some restrictive priors, such as the known number of parts and spatial proximity. Yuan et al. [22] recently proposed a technique to translate image features into a transactional database and then the colocated features can be mined. However, features that are co-located do not always suggest accurate and meaningful associations. Our work is substantially different from their work, since we build upon detecting strong pairwise associations. SpIBag (Spatial Item Bag Mining)[10] also discovers frequent spatial patterns in images persisting to rotation, scaling and translation. Their pattern mining algorithm also relies on frequent itemset mining. However, they assume the model parts can be identified without any ambiguities and missing detection. Given a target object, image retrieval systems (such as [17, 16, 14]) output a set of representative images from a large database, based on the bag-of-visual-words model. Essentially, they treat an image as a collection of visual words and then apply text retrieval approaches to search the given target. Some loose spatial constrains are used to improve the retrieval results [16]. However, different from image retrieval, we do not have any target object to begin with, and instead of just looking for an entire image, we seek to find similar patterns at all levels within or between images. Thus, given the same number of images, the search space of our problem is significantly larger than that for image retrieval. Methods used in texel discovery also relate to our work. A texel is defined as a texture element which repeatedly occurs in a particular texture. Hays et al. fit a 2D lattice structure to detect texels in [8]. In [2], Ahuja et al. detect and infer partial occluded texels by learning substructures in a segmentation tree. However, texels are helpful to understand the regularity of a texture object, but they are distributed within an almost regular grid with many occurrences. These techniques will have difficulty in finding large but fewer structured patterns. In addition, we do not assume spatial regularity of similar patterns.

Image 1 Image Dataset

… Visual Primitives

Feature Detection

Clustering

Visual Clusters

Pair-wise Associations

Pair-wise Association

Composite Pattern

Semantics

Pattern Composition

Figure 2. A Pipeline for the Extraction of Structural Semantics from Images

In [15], Pauly et al. successfully detect regular geometric structures in a 3D model. The structure discovery is made possible by projecting pairwise similarity transformations between sampled 3D points onto a transformed space, and then revealing prominent lattice structures in a suitable model in that space. In [18], the authors present a grouping system that detects regular repetitions of pure planar patterns.

3. Pair-wise Associations 3.1. Preliminaries We start by building up the representation of an image in the form needed to discover frequent visual patterns. We assume a set of local image patches representing salient features in the image can be extracted automatically, e.g. using SIFT features [13]. Each such visual primitive of an image Im is represented by a feature vector fi = [xi , si , θi , di ], where 2D vector xi is the centroid, si is the scale, θi is the orientation, and the high-dimensional vector di encodes the appearance of the feature. Mathematically, an image Im is a set of visual primitives Im = {fi }. Primitives with similar appearance often occur in multiple locations in the image set {Im }. Clustering algorithms, such as k-means, can group the primitives in {Im } together according to the similarity of their appearance vectors, yielding n sets of similar visual primitives {C1 , C2 , ..., Cn }. Each set denotes a visual cluster, and the primitives in each set are instances of a visual cluster. Given two features from the same image fi , fj ∈ Im , where fi is an instance in visual cluster Ck , and fj is an instance in Ct , the link li,j denotes the spatial relationship between the two primitives fi and fj . Using the definition discovered in [3, 9], li,j can be formulated as a 4D vector [Di,j , Si,j , Hi,j , Hj,i ] (1), where Di,j is the relative spatial distance between fi and fj (1a), Si,j is their relative scale difference (1b), Hi,j is the heading from fi to fj (1c), and

Hj,i is the heading from fj to fi (1d):  Di,j = xi − xj 2 / s2i + s2j  Si,j = (si − sj )/ s2i + s2j Hi,j = Δθ (arctan(xi − xj ) − θi ) Hj,i = Δθ (arctan(xj − xi ) − θj )

(1a) (1b) (1c) (1d)

where function Δθ (·) ∈ [−π, +π] calculates the principal angle. The representation is invariant to translation, scale and rotation, and robust to small distortion. To determine the similarity between two links li,j and li ,j , the Mahalanobis distance is computed as  li,j − li ,j Σ = (li,j − li ,j )T Σ−1 (li,j − li ,j ) (2) where Σ = diag(σd2 , σs2 , σh2 , σh2 ), a 4 × 4 diagonal matrix with variances of distance, scale and heading. Two links li,j and li ,j are consistent if li,j −li ,j Σ < ε. In addition, we define that two links are independent if they are not incident to a common feature. Due to the inherent complexity of a visual pattern, visual clusters that are co-located do not always suggest accurate and meaningful associations. We believe that a set of visual clusters in a pattern not only need to co-occur, but also need to maintain consistent spatial relationships among the set of visual clusters to encode more rich semantical meanings. Ideally, we expect there exists a consistent spatial relationship that associates any feature in one cluster Ck with another unique feature in the other cluster Ct and vice versa. Therefore, we say there is an association or a model ψk,t between two clusters Ck and Ct , if there exists independent and consistent links connecting any instance in Ck and another instance in Ct ; namely ψk,t is a set of feature pairs, ψk,t = {(fi , fj )|fi ∈ Ck , fj ∈ Ct }, where given any two pairs (fi , fj ) and (fi , fj ) in ψk,t , they must satisfy the following conditions: (1) fi and fj are extracted from the same image and so are fi and fj , mathematically fi , fj ∈ Im and fi , fj ∈ Im , where Im and Im may or may not be the same; (2) fi and fi are grouped into the same cluster and so are fj and fj , namely fi , fi ∈ Ck and fj , fj ∈ Ct ; (3) links are independent, as follows, i = i and j = j  ; (4) links

are consistent, defined as li,j − li ,j Σ < ε, where ε is the maximum deviation allowed among links. Figure 3 shows a real example of how we apply the pair-wise association to identify the consistent relationships between caps and tips of the pencils.

1. weight-consistent: M SE(GM ) ≤ φ, M SE(GM ) =

1  (ω(e) − ω(ˆ e))2 , |M | e∈M

where eˆ = arg mine∈M ω(e) and ω(e) − ω(ˆ e) is calculated using (2); 2. maximal: there exist no GM , s.t. GM ⊂ GM  and GM  is weight-consistent. Figure 3. Example of pair-wise association. The left image contains several pencils in different poses. The middle one depicts the detected features centered in red dots, and blue and green ellipses represent two dominant visual clusters. The right image reflects the association between the two clusters. Based on link vector (1) and link similarity measurement (2), the pair-wise association algorithm correctly links the caps and the tips of the same pencils together.

However, in reality, the problem is more challenging due to a number of reasons. First, more often than not, some visual primitives might not be present in an image due to artifact or noise, or they can be lost during feature detection or clustering. In this case, not all associations are present. Secondly, there could exist multiple valid associations between two clusters. For example, both eyes are associated with nose, but the association between left eye and nose and the association between right eye and nose are different. Therefore, our goal is to find all associations between two visual clusters in the presence of noise. In order to address that, we transform our problem into a bipartite graph problem and develop an polynomial algorithm to find all maximal pair-wise associations exhibited between two visual clusters.

3.2. Multi-Modal Pair-wise Association A weighted complete bipartite graph G =< U, V, E, ω > can be generated to represent associations between two visual clusters Ck and Ct . Let every vertex u in U represent an instance of visual cluster Ck , and let every vertex v in V represent an instance of Ct . E is the set of all edges between U and V , each of which corresponds to a link between one instance of Ck and another of Ct . A weight function ω(e) is also defined for every edge e ∈ E. In this paper, ω(e) is represented as the 4D vector l (1) and it can be sorted by the relative distance D (1a). Within this bipartite graph G, a matching M ⊆ E is a set of edges that do not have common vertices. Let UM ⊆ U and VM ⊆ V are sets of vertices incident with some edge in M , and GM =< UM , VM , M, ω > is the subgraph of G induced by the matching M . A matching-induced subgraph GM ⊆ G is a maximal association subgraph if it is:

Choose eˆ as a representative edge, where eˆ is the edge in M that has the least weight among all the edges in that matching-induced subgraph. The weights of all the edges in M should then deviate little from ω(ˆ e), which is ensured by a small mean of squared errors (MSE) between the weight of eˆ and those of the rest edges. Meanwhile, the second property would make the candidates optimal on size, since models with higher support are always preferred. A matching that induces a maximal association subgraph is thus a candidate association between Ck and Ct . Note that, the sorted list of edges is not used to find the set of least weighted edges connecting features within limited spatial distances. Instead, it is used to find a maximal set of edges with similar weights which represents either long or short range associations.

3.3. A Polynomial-time Algorithm The naive way to find the maximal pair-wise association subgraphs is to enumerate all potential subgraphs and identify the one that satisfy the constraints. However, this |U | approach is intractable, since there exist i=1 |Vi!|! possible subgraphs assuming |U | ≤ |V |. We propose a polynomial time algorithm on finding all maximal pair-wise association subgraphs. Intuitively, since we are looking for consistent links within the association subgraph, these links must be similar. Assume we have an ordered list of the edges based on their weights, our algorithm looks for a window on the sorted edges that embeds a maximal association subgraph. Therefore, our algorithm first sorts all edge weights {ω(e)|e ∈ E} in G by relative distance D (1a). Next, we grow a window W on the non-descending list of edges. The window W starts with an empty set, and is grown dynamically by adding or deleting edges. Let GW ⊆ G be induced by the set of edges in W . At each step, given the set of edges in the current window W , a maximal matching in GW will be found. If it satisfies the weight consistency property, smallest weighted edge from the remaining edge pool will be added to current window W . Once the weight consistent property does not hold after a new edge is included in W , a local optimum has actually been achieved before the operation. Therefore, the model would be admitted as a candidate

model. In order to get out of the local optimum, the window W would discard the edge with smallest weight in it, find the maximum matching MW , until MW meets weightconsistency. What’s worth mentioning is that the maximum matching MW of the subgraph GW induced by W might be changing as the window W grows or shrinks. These two processes would occur alternately, which makes the window wriggle through the edge list. After all the edges have been visited, the algorithm can terminate and return all the candidate models. Complexity The algorithm visits each edge at most twice, one for including it into the window while the other for discarding it from the window. When adding an edge e, it is always necessary to find a maximum matching MW ∪{e} for GW ∪{e} based on MW , the maximum matching of GW . This can be done by finding an augmenting path within GW ∪{e} with respect to MW , which is a path of alternating sequence of edges in MW and edges not in MW with free end nodes. A node is free if it is not incident with any edge in the referred matching. This process is similar to a breadth-first search in GW ∪{e} for some free vertex of VW ∪{e} from free vertices of UW ∪{e} . The time cost for one such step is therefore O(|W |), where |W | is the number of edges in the window. Given the maximum matching MW before adding the new edge e into W , this process needs to operate only once when searching for   a new maximum matching MW  , where W = W ∪ {e}, as  |M | is at most |M | + 1. When discarding an edge e, the time cost is O(1) if e is not in the current maximum matching MW . Otherwise, the algorithm for discovering an augmenting path will be applied in the remaining window W − {e}, with respect to the remaining matching MW − {e}. This operation needs only once as well, since the maximum matching in W −{e} is at most as large as M , which has just one more edge than MW − {e}. So it would cost O(|W | − 1) time. Above all, the updating for the maximal matching of the window spends O(|W |) time when adding or dropping an edge from W . Then the operation for an edge costs time of O(|E|) totally. Since the window just visits each link twice at most, as mentioned above, it is just O(|E|2 ) time that is needed for mining pair-wise association between U and V .

4. Composition of Patterns with Multiple Associations Based on the set of associations Ψ through searching all pairwise associations between clusters, we would like to build visual patterns with more complex semantics that consist of multiple primitives. We can turn the primitives from the image set {Im } and the set of associations Ψ into a undirected labeled graph G =< V, E >. Each vertex is represented by a 2-tuple

vi = (i, β), where i indexes the primitive fi , and the label β denotes the cluster Cβ that fi ∈ Cβ . Each edge is represented by a 3-tuple ei,j = (i, j, γ), where i and j index the primitives fi and fj respectively, and γ is the edge label. The edge ei,j is set up if there exists an association, ψk,t,τ ∈ Ψ1 that the pair (fi , fj ) ∈ ψk,t,τ ; and its edge label γ = Υ(k, t, τ ), where function Υ(.) maps 3-tuple (k, t, τ ) to a unique identifier. Consequently, all pairs in an association ψk,t,τ share the same edge label γ. An instance of a pattern that is composed by multiple primitives and associations is now embedded as a subgraph in G. Mining maximal frequent subgraphs in G would be able to find all frequent patterns. If a subgraph is a clique where there exists an edge between any pair of nodes in the subgraph, then we call the corresponding patterns a strong pattern since all pair-wise associations are consistent across many instances of the patterns. Otherwise, we call the pattern a weak pattern. A weak pattern means that though there exists a set of associations that can connect the elements within a pattern together, some pair-wise association might not be preserved across sufficient copies of the pattern. This can be generated due to the constraints of the association model or simply an artifact of the pattern, or missing values. In our case, we are interested to find both strong and weak patterns. Both frequent subgraph pattern mining in one graph and multiple graphs have been well studied, e.g., [20, 11]. So we would not illustrate further about how to apply subgraph mining algorithm in detail.

5. Experiments In this section, we demonstrate the utility of the proposed algorithm for unsupervised learning of visual semantics from both one single image with many repeated patterns and an image database with repeated patterns. Given the extensive amount of labeling required in figures here, we rely on color to differentiate different groups and depict composite patterns. Please refer to the electronic PDF copy for better viewing.

5.1. Evaluation on Single Images Repeated visual patterns can be found in one single image where multiple copies of patterns are embedded. Besides the simple pencil example as shown in Figure. 3, our first experiment tested the algorithm on the section of shelves in the grocery store. Due to the space limitation, we only show a small segment of long panorama image [1] in Figure. 4(a). Repeated patterns representing different snack boxes are present in the images but their locations are random. Our goal is to discover the set of features that can 1 Here, we use τ to differentiate a particular association from the others between Ck and Ct in a multi-modal case.

represent or summarize each pattern in the image. There are totally 14, 288 SIFT features extracted in the original image shown Figure. 4(a). These features are further clustered into 1, 000 clusters using k-means. Associations are then searched between every pair of feature clusters and are used to label the edges between features in the image. Frequent subgraph mining are then applied to discover complex patterns with consistent structural relationships according to the method detailed in Section 4 as shown in Figure. 4(b)(c). Please note that most of the discovered patterns in each dataset are strong patterns, or near-strong patterns (almost cliques) meaning that there exists all (major) pair-wise associations among clusters within the patterns. One thing worth mentioning is that our algorithm is able to automatically discover the global occurrence of patterns independent of its spatial locations and neighborhood. For example, the snack box “CHEEZE IT” appears at several locations in the image, but they are able to be identified due to the fact that they have consistent structural patterns. Approaches based on the spatial random partition technique [21] may find some of the patterns, but there is no guarantee to detect and locate the repeated subimages in each retrieval. However, using our approach, we can successfully and efficiently find most of the common image parts. Our next experiment concerns repeated patterns in buildings. Recent work on identifying patterns in images typically find only simple and highly repetitive patterns [19]. However, as shown in Figure. 5(a), structures such as windows, doors, and sculptures can be versatile and vary greatly in lengths and shapes. The approach [19] will fail in this images. Our approach however is able to detect almost all repeated structural patterns in the building facade as shown in Figure. 5(b)(c). In this image, there are totally 3, 427 SIFT features detected in Figure. 5 and 200 clusters are computed. Independent of spatial proximity, our approach is able to discover long associations as shown in the repeated long windows in the middle part of the image (the red and blue patterns (the 3rd one in Figure. 5(c))). For both of these images, our method is capable of extracting almost all of the embedded structural patterns without any known knowledge of the location, the number and the size of the patterns.

5.2. Evaluation on Face Database Besides one single image, our method can be easily adapted to mine patterns across multiple images. Here we demonstrate the results by applying the algorithm on the face dataset from the Caltech-101 database [12] (435 images of 23 persons). Figure. 6 shows a few examples of the discovered visual patterns associated with the precision2 2 precision

= positive detects/(positive detects + false detects)

and recall3 scores. The face dataset is more challenging due to the weak textures and individual difference of human faces. Consequentially, the repeatability of SIFT features is quite limited and the ambiguity of SIFT descriptor is higher than the above shown experiments. Each pattern has high precision means that it is important to capture semantics in face. Still, the recall for each of the pattern is not very high especially for more complicated structural patterns such as the pattern shown in the fourth row. This is because the chance of missing features is too high to support the complex patterns. Among all discovered patterns, each contains a longrange structural pattern and indeed carries meaningful semantics. For example, the first row contains a pattern composed by forehead and mouth. The second row represents a pattern which consists of forehead and two eyes. The third row depicts a pattern linking forehead, two eyes and mouth. The fourth row shows the pattern with forehead, nose, a left eye and mouth; and notice that there exist a clique linking a left eye, nose and mouth together. Therefore, our method is able to detect more robust long-range patterns comparing to the results demonstrated in [22], where only local patterns can be detected such as features localized at eye corners.

6. Conclusions and Future Work This paper presented a novel framework to find semantically meaningful frequent patterns from unstructured images. Compared to previous visual pattern mining approaches that translate an image into a transactional database based on some heuristic rules (such as spatial proximity), we regard spatial coherence as the major criterion to link different visual clusters or primitives to form meaningful patterns. Toward that design goal, we developed a set of novel and efficient algorithms to find optimal pair-wise matches among all detected visual clusters. This allows us to find visual patterns of all sizes, from small ones close by to large ones that are relatively far apart. As demonstrated by the results, our approach is robust in finding meaningful visual patterns from a variety of images. Looking into the future, we plan to utilize the semantics information to perform object extraction and modeling. In particular it can be used to automatically recover partially occluded regions, as long as they are a part of repeated visual patterns. The discovered repetitive patterns can also be used in compression.We already see some progress in this regard. In [19], Wang et al. use a brute force method to find repeated patches within a single image for compression. While some very good results have been obtained, their method will not scale up. Our efficient method presented here will enable compression on a large scale and potentially increase the compression ratio. 3 recall

= positive detects/(positive detects + miss detects)

(a)

support=3

(b)

support=5

support=7 (c)

support=11

support=4

Figure 4. Visual patterns discovered in the grocery image. (a) is the input image. (b) shows the frequent visual patterns marked in different colors according to the actual items that they belong for better viewing. (c) shows some patterns with magnification and the corresponding support values. This figure is best viewed in color and with magnification.

(a)

support=5

(b)

support=3

support=3

support=4

(c) Figure 5. Visual patterns discovered in an image of a building facade. (a) is the input image. (b) shows the frequent visual patterns in different colors. (c) highlights the patterns with magnification and reports the corresponding support values.

References [1] A. Agarwala, M. Agrawala, M. Cohen, D. Salesin, and R. Szeliski. Photographing long scenes with multi-viewpoint panoramas. ACM Trans. Graph., 25(3):853–861, 2006. 5 [2] N. Ahuja and S. Todorovic. Extracting texels in 2.1d natural textures. ICCV, 2007. 2

[3] G. Carneiro and A. D. Jepson. Flexible spatial configuration of local image features. PAMI, 29(12):2089–2104, 2007. 3 [4] G. Carneiro and D. Lowe. Sparse flexible models of local features. In ECCV, pages 29–43, 2006. 1, 2 [5] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. IJCV, pages 55–79, 2005. 1, 2

Precision: 87% Recall: 30%

Precision: 94% Recall: 26%

Precision: 98% Recall: 18%

Precision: 100% Recall: 13% Figure 6. Visual patterns discovered in the face dataset. Each row shows the same pattern across different images and the corresponding precision and recall scores are reported in the last column.

[6] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In CVPR, 2003. 1, 2 [7] R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for efficient learning and exhaustive recognition. In CVPR, 2005. 1, 2 [8] J. H. Hays, M. Leordeanu, A. A. Efros, and Y. Liu. Discovering texture regularity as a higher-order correspondence problem. ECCV, 2006. 2 [9] M. Jamieson, A. Fazly, S. Dickinson, S. Stevenson, and S. Wachsmuth. Learning structured appearance models from captioned images of cluttered scenes. ICCV, pages 1–8, Oct. 2007. 3 [10] S. Kim, X. Jin, and J. Han. Sparclus: Spatial relationship pattern-based hierarchial clustering. In SDM, pages 49–60, 2008. 2 [11] M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph*. Data Min. Knowl. Discov., 11(3):243– 271, 2005. 5 [12] R. F. L. Fei-Fei and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. CVPR 2004, Workshop on Generative-Model Based Vision, 2004. 6 [13] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. 3

[14] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006. 2 [15] M. Pauly, N. J. Mitra, J. Wallner, H. Pottmann, and L. Guibas. Discovering structural regularity in 3D geometry. ACM Transactions on Graphics, 27(3), 2008. 3 [16] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007. 2 [17] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. ICCV, 2003. 2 [18] T. Tuytelaars, A. Turina, and L. V. Gool. Noncombinatorial detection of regular repetitions under perspective skew. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4):418–432, 2003. 3 [19] H. Wang, Y. Wexler, E. Ofek, and H. Hoppe. Factoring repeated content within and among images. ACM Transactions on Graphics, 27(3):1–10, 2008. 6 [20] X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In ICDM, page 721, Washington, DC, USA, 2002. IEEE Computer Society. 5 [21] J. Yuan and Y. Wu. Spatial random partition for common visual pattern discovery. In ICCV, pages 1–8, 2007. 6 [22] J. Yuan, Y. Wu, and M. Yang. From frequent itemsets to semantically meaningful visual patterns. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 864–873, New York, NY, USA, 2007. ACM. 1, 2, 6