Image Segmentation as Learning on Hypergraphs Lei Ding Computer Science and Engineering The Ohio State University Columbus, OH 43210, USA
[email protected]
Abstract In this paper, we propose to use hypergraphs as the model for images and pose image segmentation as a machine learning problem in which some pixels (called seeds) are labeled as the objects and background. Using the seed pixels, our method predicts the labels for all unlabeled pixels. We present the relations of the proposed method to other hypergraph based learning techniques. We give an adaptive procedure for constructing image hypergraphs and achieve promising results on a real image dataset.
1
Introduction
Image segmentation, the process of partitioning an image into meaningful segments corresponding to the objects and background, is important to many computer vision and multimedia applications. Although a human can delineate object boundaries with much ease, image segmentation is a time-consuming and error-prone process for a computer. Despite many years of research, purely unsupervised image segmentation techniques without human inputs are still not able to produce satisfactory results. Therefore, there has been a lot of research on semi-supervised or interactive segmentation techniques like snakes [14], intelligent scissors [15] and interactive graph cuts [5]. Our paper is in line with a recent direction of semisupervised image segmentation leveraging the latest developments of graph-based machine learning [12, 21, 11]. Since image segmentation can be treated as a labeling problem, which shares the basic assumption of graph-based function estimation, i.e., the target function to estimate is smooth with respect to the affinity graph constructed from data, state-of-the-art graph-based learning methods can be applied to image segmentation. Previous authors have introduced various methods for graph-based learning. Harmonic energy minimization [23], Laplacian Eigenmaps [1], Laplacian regularized regression [2] are some examples.
Alper Yilmaz Photogrammetric Computer Vision Lab. The Ohio State University Columbus, OH 43210, USA
[email protected]
Different from standard graph-based approaches, in this paper we use hypergraph, a generalization of graph, as the underlying model for images. Compared with graphs that model pairwise pixel similarity, hypergraphs model patches in an image, which when used with superpixels [16], naturally define a hierarchical image segmentation framework that respects both hard and soft constraints. In particular, we consider the hypergraph Laplacian matrices, by which graph-based image segmentation algorithms can be redefined on hypergraphs. At the theoretical level, we prove the convergence of an iterative update method to hypergraph based interpolation and also make connection to the wellknown normalized-cut algorithm. It is worthwhile to highlight the several desirable aspects of hypergraph based image segmentation: (1) different from commonly used pixel-wise similarity, hypergraphs consider patch based homogeneity, which is arguably more meaningful; (2) hypergraphs enable a user to directly specify soft constraints to be considered as edges (called hyperedges); (3) hypergraphs are suited to superpixels which are not defined on a uniform grid, and they two combined together both have favorable computational efficiency and enable a hierarchical segmentation scheme. We organize the paper as follows: we introduce the hypergraph-based image model in Section 2, followed by a discussion on learning on hypergraphs in Section 3, and we present experimental results in Section 4. We finally conclude the paper with future directions.
2
Hypergraph as Image Representation
In this section we introduce hypergraphs, a natural generalization of graphs, and describe our procedure for constructing hypergraph based image models, followed by the hypergraph Laplacian matrices, which play a similar role for hypergraphs as the role of the graph Laplacian matrices for traditional graphs.
Figure 1. An example hypergraph with 8 vertices and 3 hyperedges. V = {v1 , v2 , v3 , v4 , v5 , v6 , v7 , v8 }, E = {e1 , e2 , e3 } = {{v1 , v2 }, {v2 , v3 , v4 , v6 , v7 }, {v6 , v7 , v8 }}.
2.1
Hypergraph
A hypergraph [3] generalizes a graph. In a hypergraph, edges can connect any number of vertices. Formally, a hypergraph is a pair (V, E), where V is a set of vertices, and E is a set of non-empty subsets of V called hyperedges. Therefore, E is a subset of P (V ), where P (V ) is the power set of V . See Figure 1 for an illustration. We present several useful definitions as follows. A hypergraph G can be represented by a |V | × |E| matrix H called the incidence matrix of G with H(v, e) = 1 if v ∈ e and H(v, e) = 0 if otherwise. A hypergraph G can be a weighted one, where an edge e has a weight w(e), and W is a diagonal matrix containing the weights of hyperedges.P For a vertex v ∈ V , its degree is defined as d(v) = {e∈E|v∈e} w(e). Dv is the diagonal matrix containing the vertex degrees. For a hyperedge e ∈ E, its degree is defined as δ(e) = |e|, which is the cardinality of e as a finite set. De is the diagonal matrix containing the hyperedge degrees. Thus, we see that hypergraphs go beyond traditional graphs in that an edge in a traditional graph defines a binary relation between two vertices, while a hyperedge defines a relation among a number of vertices. In image segmentation, a traditional graph edge models pairwise pixel similarity, while a hyperedge corresponds to an image patch.
2.2
Hypergraph as Image Model
Graph-based image segmentation methods are dependent on neighborhood graphs constructed from images based on the pairwise (distance, brightness, color) similarity. In our hypergraph-based approach, several hyperedges are constructed for each pixel containing nearby pixels and
Figure 2. Fine-scale superpixels overlaid on the penguin image from GrabCut [18].
their weights are assigned based on the homogeneity of pixel colors in the hyperedge. Our hypergraph image model utilizes the concept of superpixels [16] from the computer vision literature, which are formed in a preprocessing stage to group pixels into larger sets, which are local, coherent, and preserve most of the structure necessary for segmentation at the scale of interest. The motivations of this grouping are: (1) pixels are merely a consequence of the discrete representation of images; (2) the number of pixels is typically very high, which makes optimization on the level of pixels intractable. The normalized cuts algorithm [19] is used to produce the superpixel map. In Figure 2, fine scale superpixels are plotted on top of the original image using the code available on line1 . We observe from this example that the superpixels are roughly homogeneous in size and shape, which makes a hierarchical treatment of image segmentation possible, from the finest scale pixels to superpixels (hard label constraints) and then to hyperedges (soft label constraints) which merge the superpixels and finally to the learned meaningful segments. Although some structures may get lost when we use superpixels, they are usually minor details. The critical aspects of a hypergraph are the construction of hyperedges and the weight of each hyperedge. Ideally, hyperedges correspond to the set of superpixels that we want to combine together. We derive our image model by constructing multiple hyperedges at each superpixel, and assigning a weight to each hyperedge based on patch homogeneity such that the image dependent constraints for combining superpixels are incorporated in a systematic way to our hypergraphic image model. Our model has an advantage over the Image Adaptive Neighborhood Hypergraph (IANH) proposed in [6], since IANH considers only one hyperedge at each location, which is more prone to segmentation errors; our model, however, has multiple hyperedges 1 http://www.cs.sfu.ca/
mori/research/superpixels/
at each location (superpixel) and assigns a weight to each hyperedge making the resulting segmentation more stable. Hyperedge Formation. Let us first introduce some notations. An image hypergraph G = (V, E, W ), V = {vS |S S is a superpixel}, E = {Ev |v ∈ V }, where Ev is generated at each vertex (superpixel) according to the method described next, and finally W is the diagonal hyperedge weight matrix. We wish to, at each superpixel, combine all its neighboring superpixels with similar appearance to make an hyperedge. Our approach for hyperedge construction uses image structure tensors (see e.g.,[20]) to find out the direction of maximum variation in appearance. We define the structure tensor of an image patch P (P is the set of pixel locations) as follows, where I is the image brightness function and p ∈ R2 : P P 2 p∈P Ix Iy (p) p∈P Ix (p) P P , ST (P ) = 2 p∈P Ix Iy (p) p∈P Iy (p)
which can be seen as an empirical version of E(∇I∇I T ), where ∇I = [∂I/∂x, ∂I/∂y]T . If we model an image patch as a random field with the following autocovariance function [10]: CP (d) = σP2 exp(−dT Σ−1 P d),
(1)
2
where d ∈ R is the spatial lag, then it can be shown that: ST (P ) ∝ σP2 Σ−1 P ,
(2)
which essentially says that the eigenvector direction of ST (P ) with the smaller eigenvalue is characterized by more variation. Therefore, we construct Ev at vertex v as follows: we start from a circle around the superpixel that v represents, take the superpixels inside as P , and iteratively shrink and rotate the two ellipse axes to align with the eigenvectors of ST (P ). In the direction with the high eigenvalue, we shrink the axis by α1 , and in the one with the small eigenvalue, we shrink by α2 , where α1 < α2 . At each step t, we get an ellipse Lt (a, b, θ), where a, b are the semi-major and -minor axes respectively and θ is the angle to the y coordinate axis. We add an hyperedge: [ Ev = Ev {vS | center of S within Lt (a, b, θ)}. (3)
And we iterate until we hit a single superpixel and stop. Weighting. The weight of hyperedge e is defined by the color variances that pixels constituting the hyperedge have. Intuitively, a homogenous area should be assigned a higher weight than a one with large contrast. It is inevitable that hyperedges formed in the shrinking stage include ones which are not reasonably homogenous. However, our weighting scheme helps to filter out these “bad” hyperedges by assigning them a very small weight. We define the weight as: P (var{ci (e)})2 , (4) W (e) = exp − i 2σ 2
where ci (·)’s are the three color functions operating on the pixels that e includes, which can be replaced by other photometric features, σ is a parameter characterizing the region of interest’s homogeneity. A large σ allows for an object of potentially different color and texture components, while a small σ tends to prohibit them. A hyperedge with large variance in colors has only a small weight, which means it will have small influence on the optimization problem. The hyperedges with large weights are important ones such that superpixels in them are more likely to have the same segment label. The proposed hypergraph based image model can be summarized as follows: • (Vertices) Vertex vi corresponds to superpixel Si . • (Hyperedges) At each Si , form hyperedges that include nearby superpixels iteratively according to the eigen-directions of structure tensors all the way to one superpixel. • (Weights) The weight for each hyperedge is determined using the variances of pixel color values.
2.3
Hypergraph Laplacian
Laplacian matrices are discrete analogs to the Laplace operator on a manifold [17, 8], and have played important roles in machine learning and spectral image segmentation. Previous authors used two basic graph Laplacian matrices: the unnormalized one the normalized one. Using the matrices introduced before, we now proceed to define the unnormalized Laplacian matrices ∆un and the normalized Laplacian matrix ∆n for a hypergraph [9]: ∆un = Dv − HW De−1 H T ,
(5)
∆n = Dv−1/2 ∆un Dv−1/2 = I − Dv−1/2 HW De−1 H T Dv−1/2 .
(6) (7)
The following equalities can be verified: X X
e∈E {u,v}⊆e
X X
e∈E {u,v}⊆e
w(e) 2 (f (u) − f (v)) = 2f T ∆un f, δ(e)
w(e) δ(e)
f (u) f (v) p −p d(u) d(v)
!2
= 2f T ∆n f,
which basically measure the “smoothness” of the function f defined on vertices of the hypergraph with respect to the underlying hyperedge system. It can also be seen that both ∆un and ∆n are symmetric and positive semi-definite, and therefore their eigenvectors provide vector space bases.
3
Learning on Hypergraphs
In this section we first present our hypergraph based semi-supervised learning algorithm, give its random walk interpretation and an iterative version. We then discuss the relation of our method to hypergraph cuts.
3.1
Hypergraph Based Semi-supervised Learning
We consider image segmentation in a transductive setting, where we have available labels of some parts of objects and their backgrounds, and using an optimization with a Laplacian matrix, we can predict the labels for all other pixels. From the definition of Laplacian matrices, f T ∆f measures the smoothness of the function f with respect to the hypergraph whose vertices it is defined on. We consider the following constrained optimization problem:
P h(u,e) h(v,e) p(u, v) is defined by p(u, v) = e∈E w(e) d(u) δ(e) , which basically says that there is a nonzero transition probability between two vertices only when they belong to one or more common hyperedges, and how large the probability is depends on the number of hyperedges that they both belong to and their weights. It can be seen that P = Dv−1 HW De−1 H T . Next we study the update rule using the following iterative formula: fUt = PU L fL + PU U fUt−1 , where fL and fU are the known and predicted labels respectively, which can be interpreted as label propagation while clamping the known labels at labeled points [22]. The following theorem shows what this iteration converges to. Theorem 1. The above iterative update rule for fUt leads to f˜U = (I − PU U )−1 PU L fL . Proof : The iterative algorithm leads to:
T
∗
f = arg min f ∆f, f |L =fL
(8)
where L is the set of locations with known segment labels and fL is the vector of the known labels. We denote the set of unlabeled locations U and the vector of their estimated labels fU . The graphical version of this problem is studied in [23], and its minimizer satisfies ∆f ∗ = 0 on the unlabeled points U and equals fL on the labeled points. With block matrix notations, it follows that: fU = −∆−1 U U ∆U L fL .
(9)
For a single object and background segmentation, +1 and −1 are used in fL , and points with positive and negative values in fU are assigned to the object and background respectively. In experiments, we use the normalized Laplacian matrix, since it leads to better performance. An extension of the hypergraph based learning to multiple classes is straightforward in that one only needs to use a separate function f (i) for each class and pick whatever function having the largest value at a given point as its estimated label. Note that although there are various ways of doing segmentation using the hypergraph Laplacian matrices, we choose this particular formulation due to its simplicity and robustness. A regularized regression method typically involves a trade-off parameter between the mean squared error and the regularizer; for methods using the bottom Laplacian eigenvectors as a basis, one needs to decide the number of such eigenvectors a priori. Computationally, our method is also less expensive.
3.2
A Random Walk Interpretation
The natural random walk on a hypergraph [9] is defined by the probability transition matrix P whose entry
n X (i−1) (n) PU U )PU L fL + PU U fU0 . fUt = lim ( n→∞
(10)
i=1
Next we shall prove that the second term approaches zero as n goes to ∞, which follows from the fact that the maximum absolute eigenvalue (spectral radius) (n) of PU U is less than 1 and therefore limn→∞ PU U = 0. Thus we have the convergent f˜U = (I −PU U )−1 PU L fL . Theorem 2. The convergent f˜U is the same as graph interpolation result fU . Proof : From the interpolation result fU , we have: fU = −∆−1 U U ∆U L fL = −(Dv{U U } − Dv{U U } PU U )−1 (−Dv{U U } PU L )fL = (I − PU U )−1 PU L fL = f˜U . Based on these results, we can either use graph interpolation for image segmentation or use the equivalent iterative update rule when ∆U U is close to singular.
3.3
Relation to Hypergraph Cuts
In the context of graph based spectral clustering, there are two popular graph cut methods: the ratio cut [13] and the normalized cut [19]. Approximate optimization of the ratio cut leads to eigenvector computation of the unnormalized graph Laplacian matrix, and that of the normalized cut involves the normalized graph Laplacian matrix. Empirical studies suggest that the normalized cut often leads to better clustering results, especially for image segmentation. The normalized cut approach in a hypergraphical setting solves
Low error rates 0.15 0.1 0.05 0
0.15 0.1 0.05 0
0.15 0.1 0.05 0
6 7 10 18 20 24 27 29 30 31 32 33 36 37 38 39 43 47 48 49 Image # Middle error rates
2 3 5 8 11 12 13 15 17 21 22 25 26 28 35 40 44 45 46 50 Image # High error rates
1
4
9
14
16 19 Image #
23
34
41
42
Figure 3. Segmentation error rates (percentage of mislabeled pixels in the region to be classified). We break the data into three groups (below 6%, 6% − 10%, above 10%).
the following minimization problem: P P 2 w(e) f T ∆un f e∈E {u,v}⊆e δ(e) (f (u) − f (v)) P R(f ) = T . = 2 f Dv f i di f (i)
It can be minimized by solving ∆un f = λDv f , which can −1/2 −1/2 be rewritten as Dv ∆un Dv y = ∆n y = λy, where y = D1/2 f . In our method, if we take L = ∅ (no labeled information), the best we can achieve is the nontrivial minimizer of y T ∆n y, the eigenvector of ∆n with the second smallest eigenvalue, which if multiplied by D−1/2 , produces the result from normalized hypergraph cut.
4
Experiments
We present quantitative results in Figure 3 and some qualitative segmentation results in Figure 4. We use the Microsoft GrabCut dataset [18], which has been used in previous studies. The dataset provides seed regions that facilitate our experimentation. Our average error rate 7.3% is better than 7.9% achieved in [4]. Note that a good error rate based on subjective ground truth does not necessarily imply good segmentation [7]. For instance, the third and fourth columns of Figure 4 have perceptual quality opposite to their errors.
4.1
Discussion
First, our image segmentation method addresses two kinds of constraints: the hard ones are modeled by superpixels, where we require that constituent pixels take the same
label; the soft ones are modeled by hyperedges, where label consistency is encouraged. Based on our method, one can build an interactive interface, where a user supplies soft constraints (regions likely to come from an object) and the system takes them into consideration by adding new hyperedges in addition to those constructed a priori. Second, it is possible to combine active learning, which is the framework that allows the learner to ask for informative examples and in our case seeds, with hypergraph-based image segmentation. There are sometimes regions in an image with varied texture and color, which can be very hard to segment given only few labeled pixels. Allowing a user to interactively provide some segment labels at critical points could greatly enhance the segmentation quality. Third, the efficiency of our approach comes from the combination of superpixels and hypergraphs. The majority of computational time is spent on calculating a fine scale superpixel map, which there are methods in the literature to accelerate. Suppose N is the number of superpixels. The time needed for hypergraph construction is O(N ), and the linear coefficient is bounded by the maximum number of hyperedges at one superpixel, which is typically a small number. Empirically, the segmentation process when superpixels are available takes from 0.5 to 2.5 minutes on a 1.5 GHz Pentium machine. The computation time depends on the size and the texture content of an image.
5
Conclusion
In this paper, we have demonstrated the effectiveness of learning on hypergraphs for image segmentation. In particular, we use hypergraph based interpolation and show its equivalence to an iterative procedure based on random walks on hypergraphs. Experimentally, we have achieved competitive results on the GrabCut dataset with our method. We are planning to use more sophisticated probabilistic models to address the natural hierarchy of hard and soft constraints in image segmentation.
References [1] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine Learning, 56(1-3):209–239, 2004. [2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399–2434, 2006. [3] C. Berge. Hypergraphs. North-Holland, Amsterdam, 1989. [4] A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segmentation using an adaptive GMMRF model. In ECCV, 2006.
Figure 4. Top: images with computed object contours; middle: masks (light gray denotes the unlabeled areas); bottom: segmented objects. The first two columns have error rates 4.56% and 3.17% respectively. The third column shows a hard-to-segment image. Our error rate 8.09% is better than 9.15% in [11]. The last column shows the image with the highest error (13.24%); however, the quality is not low, since the misclassified part of the spoiler is perceptually very similar to the background.
[5] Y. Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images. In ICCV, 2001. [6] A. Bretto and L. Gillibert. Hypergraph-based imge representation. In Graph-Based Representations in Pattern Recognition, 2005. [7] A. Cavallaro, E. D. Gelasca, and T. Ebrahimi. Objective evaluation of segmentation quality using spatio-temporal context. In ICIP, 2002. [8] F. R. K. Chung. Spectral Graph Theory. Regional Conference Series in Mathematics, Number 92. 1997. [9] D. Zhou and J. Huang and B.Sch¨ olkopf. Learning with hypergraphs: Clustering, classification, and embedding. In NIPS, 2006. [10] O. D’Hondt, L. Ferro-Famil, and E. Pottier. The gradient structure tensor as an efficient descriptor of spatial texture in polarimetric SAR data. In IEEE Internaional Geoscience and Remote Sensing Symposium, 2006. [11] O. Duchenne, J.-Y. Audibert, R. Keriven, J. Ponce, and F. Segonne. Segmentation by transduction. In CVPR, 2008. [12] L. Grady. Random walks for image segmentation. IEEE PAMI, 28(11):1768–1783, 2006. [13] L. Hagen and A. B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Trnasactions on Computer-Aided Design, 11(9):1074–1085, 1992.
[14] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1987. [15] E. N. Mortensen and W. A. Barrett. Intelligent scissors for image composition. In SIGGRAPH, 1995. [16] X. Ren and J. Malik. Learning a classification model for segmentation. In ICCV, 2003. [17] S. Rosenberg. The Laplacian on a Riemmannian Manifold. Cambridge University Press, 1997. [18] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In SIGGRAPH, 2004. [19] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE PAMI, 22(8):888–905, 2000. [20] D. Tschumperle and R. Deriche. Vector-valued image regularization with pde’s : A common framework for different applications. In CVPR, 2003. [21] F. Wang, X. Wang, and T. Li. Efficient label propagation for interactive image segmentation. In ICMLA, 2007. [22] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMUCALD-02-107, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 2002. [23] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In ICML, 2003.