Consistent Estimator of Median and Mean Graph - IEEE Xplore

1 downloads 0 Views 619KB Size Report
The median and mean graph are basic building blocks for statistical graph analysis and unsupervised pattern recognition methods such as central clustering.
2010 International Conference on Pattern Recognition

Consistent Estimators of Median and Mean Graph Brijnesh Jain and Klaus Obermayer Berlin University of Technology, Germany {jbj|oby}@cs.tu-berlin.de Abstract

distribution of graphs. This contribution establishes sufficient conditions for consistency of estimators of the true medians and means of a distribution on graphs. The key idea is to regard graphs as points of a Riemannian orbifold in order to gain access to mathematical results on the so-called Frechet [9] and Karcher [19] central points as well as results on stochastic optimization of nonsmooth distortion functions. Orbifolds as considered here are a convenient representation of graph spaces, because they generalize the notion of manifolds as locally being a quotient of Rn by finite group actions. This paper is organized as follows: Section 2 represents attributed graphs as points in an orbifold. In Section 3, we introduce the concepts of Frechet and Karcher central points and present consistency results. Section 4 concludes. To keep the treatment simple, we delegated the technical part of the paper to the appendix.

The median and mean graph are basic building blocks for statistical graph analysis and unsupervised pattern recognition methods such as central clustering and graph quantization. This contribution provides sufficient conditions for consistent estimators of true but unknown central points of a distribution on graphs.

1. Introduction Measures of central tendency such as the sample median and mean of a finite set of graphs find their applications in central clustering of graphs [7, 11, 12, 13, 22], graph quantization [16], frequent substructure mining [20] and multiple alignment of protein structures [17]. Because of their elementary importance, first theoretical results on measures of central tendency in the domain of graphs have been established [8, 14, 18]. In addition, methods for approximating sample medians and means have been devised [6, 8, 17, 18]. The proposed algorithms aim at minimizing appropriate empirical distortion functions such as the sum-of-distances for the sample median and sum-of-squared-distances for the sample mean. Since minimizing an empirical distortion function is usually computationally intractable, the ultimate challenge consists in constructing efficient algorithms which are capable to return optimal or at least suboptimal solutions. From the point of view of statistical pattern recognition, the ultimate goal is not to determine a global minimum of an empirical distortion function, but rather to discover the true but unknown central points of a distribution defined on graphs. According to this perspective, we may regard empirical measures of central tendency such as the the sample median and mean as estimators of the true but unknown central points of the underlying distribution. One gap between statistical and structural pattern recognition is the lack of consistency results of existing estimators for the true medians and means of a 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.258

2. Representation of Attributed Graphs Let E be a d-dimensional Euclidean space. An attributed graph is a triple X = (V, E,α ) consisting of a set V of vertices, a set E ⊆ V × V of edges, and an attribute function α : V ×V → E, such that α(i, j) $= 0 for each edge and α(i, j) = 0 for each non-edge. Attributes α(i, i) of vertices i may take any value from E. For simplifying the mathematical treatment, we assume that all graphs are of order n, where n is chosen to be sufficiently large. Graphs of order less than n, say m < n, can be extended to order n by including isolated vertices with attribute zero. For practical issues, it is important to note that limiting the maximum order to some arbitrarily large number n and extending smaller graphs to graphs of order n are purely technical assumptions to simplify mathematics. For pattern recognition problems, these limitations should have no practical impact, because neither the bound n needs to be specified explicitly nor an extension of all graphs to an identical order needs to be performed. When applying the theory, all we actually require is that the graphs are finite. 1036 1032

p = 1, we obtain the set of Frechet medians and for p = 2 the set of Frechet means. Note that the Frechet medians and means coincide with the graph medians and means in the pattern recognition literature. In a practical setting, it is often more convenient to consider the set of local minima of the Frechet function F p . Therefore, we define the set Kp of Karcher central points of order p as the set of all local minima of F p . In particular, K1 is the set of Karcher medians and K2 is the set of Karcher means.

A graph X is completely specified by its matrix representation X = (xij ) with elements xij = α(i, j) for all 1 ≤ i, j ≤ n. By concatenating the columns of X, we obtain a vector representation x of X. Let X = En×n be the Euclidean space of all (n×n)matrices with elements from E and let T denote a subgroup of all (n×n)-permutation matrices. Two matrices X, X " ∈ X are said to be equivalent, if there is a permutation matrix P ∈ T such that P T XP = X ! . The quotient set XT = X /T = {[X] : X ∈X }

3.1

consisting of all equivalence classes [X] is the T -space over the representation space X . A T -space is the simplest form of a Riemannian orbifold. In the following, we identify X with EN (N = n2 ) and consider vector- rather than matrix representations of abstract graphs. We use capital letters X, Y, Z, . . . to denote graphs from XT and write, by abuse of notation, x ∈ X if x is a vector representation of X. Since E is Euclidean so is X . By '·' we denote the Euclidean norm defined on X . We consider graph distance functions of the form ! " ˜ y) : x ∈ X, y ∈ Y , d(X, Y ) = min d(x,

Estimating Central Points

Since the distribution PT is usually unknown and the underlying metric space often lacks sufficient mathematical structure, the Frechet function F p can neither be computed nor be minimized directly. Instead, we estimate the Frechet and Karcher central points from empirical data. An important question on estimating central points is that of consistency. We present conditions under which an estimator converges almost surely to the true but unknown central points. Estimators based on Empirical Measures. The goal is to minimize the Frechet function F p where the probability measure PT is unknown, but an independent and identically distributed random sample X1 , X2 , . . . , XN ∈ XT is given. For this, we replace the function F p by the empirical Frechet function

where d˜ is a distance function on X . This definition includes graph edit distances based on costs for insertion, deletion, and substitution of vertices and edges. A pair (x, y) ∈ X × Y of vector representations is called ˜ y). optimal alignment if d(X, Y ) = d(x, An important example of a graph distance is the intrinsic metric of XT given by

N 1 $ FˆNp (Y ) = d(Xi , Y )p N i=1

and approximate a Frechet central point by a global p minimum of the empirical Frechet function. By FˆN we denote the set of Frechet sample central points of order p consisting of all global minima of FˆNp . The next result due to [1] proves strong consistency of the Frechet sample central points.

d∗ (X, Y ) = min {'x − y' : x ∈ X, y ∈ Y } . Note that the intrinsic mean is not a fancy construction for analytical purposes but rather appears in different guises as a common choice of proximity measure for graphs [3, 4, 10, 23].

Theorem 1 Let (XT , d) be a metric T -space over X . Consider the Frechet function F p of a probability measure PT with compact support. Then the following assertion hold: For any ε > 0, there is a random variable n(ω,ε ) ∈ N and a PT -null set N (ω,ε ) such that

3. Frechet and Karcher Central Points Suppose that (XT , d) is a metric T -Space of graphs over the representation space X . For a given probability measure PT on the Borel-sigmafield of XT , we define the Frechet function of order p as # F p (Y ) = EX [d(X, Y )p ] = d(X, Y )p dPT (X).

p FˆN ⊆ Fεp = {X ∈ XT : d (X, F p ) < ε}

outside of N (ω,ε ) for all N ≥ n(ω,ε ). In particular, if the set F p = {µ} of Frechet central points consists of a singleton µ, then every measurable selection, µ ˆN from Fˆ p is a strongly consistent estimator of µ.

XT

The set F p of Frechet central points of order p is the set of all global minima of the Frechet Function F p . For

1033 1037

Existing algorithms for estimating the Frechet median [6, 18] or Frechet mean [14, 17] are local optimization techniques for which Theorem 1 is inapplicable, because it assumes global instead of local minimizers of the Frechet function F p as estimators. Estimators based on Stochastic Optimization. Stochastic optimization methods directly minimize the Frechet function F p using independent and identically distributed random structures X1 , X2 , . . . , Xt drawn from XT . For convenience, we assume that the underlying graph distance function is the intrinsic metric d∗ (X, Y ). Since d∗ (X, Y )p is generalized differentiable (Section B), the interchange of integral and generalized gradient remains valid for generalized differentiable loss functions, that is ∂F p (Y ) = EX [∂d∗ (X, Y )p ] under mild assumptions (see [5, 21]), we can minimize the Frechet function F p according to the following stochastic generalized gradient (SGG) method: y t+1 = y t + ηt (xt − y t ),

(1)

where xt is a vector representation of Xt , which is optimally aligned to vector representation y t ∈ Yt . The random elements st = xt − y t ∈ St are vector representations of stochastic generalized gradients St , i.e. random variables defined on the probability space ∞ (XT , ΣT , PT ) such that E [St | Y0 , . . . , Yt ] ∈ ∂F p (Y ).

(2)

We consider the following conditions for almost sure convergence of stochastic optimization:

The proof is a direct consequence of Ermoliev and Norkin’s Theorem [5] applied on the lift d˜p∗ of the graph distance dp∗ by using d˜p∗ (x, y) → ∞ with 'y' → ∞. Theorem 2 states that the empirical Karcher central points are consistent estimators of the expected Karcher central points. Note that this result can be extended to any generalized differentiable graph edit distance function. Hence, Theorem 2 justifies existing estimators for the Karcher median and mean of graphs provided the underlying graph distance is generalized differentiable. Different examples of estimators that are consistent can be found, for example, in [17].

4. Conclusion Choosing an appropriate representation of graphs is crucial to access an arsenal of mathematical results. In this contribution, we identified graphs as points of a Riemannian orbifold in order to establish sufficient conditions for consistent estimators of medians and means of a distribution on graphs. In retrospect, the proposed consistency results justify existing research on unsupervised pattern recognition in the domain of graphs and suggest of how to bridge the gap between statistical and structural pattern recognition, namely by posing learning problems in the domain of graphs as stochastic optimization problems in Riemannian orbifolds. An open problem are consistency results when the underlying graph edit distance is discontinuous.

References [1] R. Bhattacharya and A. Bhattacharya. Statistics on Manifolds with Applications to Shape Spaces, Perspectives in Mathematical Sciences, ISI, Bangalore, 2008. [2] J.E. Borzellino. Riemannian geometry of orbifolds, PhD thesis, University of California, Los Angelos, 1992. [3] T.S. Caetano et al. Learning graph matching, ICCV 2007 Conf. Proc., p. 1–8, 2007. [4] T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching, NIPS 2006 Conf. Proc., 2006. [5] Y. M. Ermoliev and V.I. Norkin. Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization, Cybernetics and Systems Analysis, 34(2), 196–215, 1998. [6] M. Ferrer. Theory and algorithms on the median graph. application to graph-based classification and clustering, PhD Thesis, Univ. Aut‘onoma de Barcelona, 2007. [7] M. Ferrer et al.. Graph-Based k-Means Clustering: A Comparison of the Set Median versus the Generalized Median Graph CAIP 2009 Conf. Proc., 2009. [8] M. Ferrer, E. Valveny and F. Serratosa. Median graphs: A genetic approach based on new theoretical properties Pattern Recognition, 42(9) 2003–2012

A1 The sequence (ηt )t≥0 of positive step sizes satisfies lim ηt = 0,

t→∞

∞ $ t=1

ηt = ∞,

∞ $ t=1

ηt2 < ∞.

A2 (St )t≥0 satisfy (2). % & 2 A3 E 'St ' < +∞.

The next result shows that the SGG method is a consistent estimator.

Theorem 2 Let (XT , d∗ ) be a metric T -space over the Euclidean representation space (X , '·'). Suppose that assumptions (A1) − (A3) hold. Then the sequence (Yt )t≥0 generated by the SGG method converges almost surely to structures satisfying necessary extremum conditions K∗p = {Y ∈ XT : 0 ∈ ∂F p (Y )}. Besides the sequence (F p (Yt ))t≥0 converges almost surely and we have limt→∞ F p (Yt ) ∈ F (K∗p ).

1034 1038

[9] M. Fr´echet. Les e´ l´ements al´eatoires de nature quelconque dans un espace distanci´e, Annales de l’Institut Henri Poincar´e, 10(3):215–310, 1948. [10] S. Gold and A. Rangarajan. Graduated Assignment Algorithm for Graph Matching, IEEE Trans. PAMI, 18:377–388, 1996. [11] S. Gold, A. Rangarajan, and E. Mjolsness. Learning with preknowledge: clustering with point and graph matching distance measures, Neural Comp., 8(4):787–804, 1996. [12] S. G¨unter and H. Bunke. Self-organizing map for clustering in the graph domain, Pattern Recognition Letters, 23(4):405–417, 2002. [13] B. Jain and F. Wysotzki. Central Clustering of Attributed Graphs, Machine Learning, 56, 169–207, 2004. [14] B. Jain and K. Obermayer. On the sample mean of graphs, IJCNN 2008 Conf. Proc., p. 993–1000, 2008. [15] B. Jain and K. Obermayer. Structure Spaces, Journal of Machine Learning Research, 10:2667–2714, 2009. [16] B. Jain and K. Obermayer. Graph Quantization, arXiv:1001.0921v1 [cs.AI], 2009. [17] B. Jain et al. Multiple alignment of contact maps, IJCNN 2009 Conf. Proc., 2009. [18] X. Jiang, X., A. Munger, and H. Bunke. On Median Graphs: Properties, Algorithms, and Applications, IEEE Trans. PAMI, 23(10):1144–1151, 2001. [19] H. Karcher. Riemannian center of mass and mollifier smoothing. Communications in Pure and Applied Mathematics, 30:509–541, 1977. [20] L. Mukherjee et al. Generalized median graphs and applications, Journal of Combinatorial Optimization, 17:21–44, 2009. [21] V.I. Norkin. Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization, Cybernetics, 22(6), 804-809, 1986. [22] A. Schenker et al.. Clustering of web documents using a graph model, Web Document Analysis: Challenges and Opportunities, p. 1–16, 2003. [23] S. Umeyama. An eigendecomposition approach to weighted graph matching problems, IEEE Trans. PAMI, 10(5):695–703, 1988

A. Calculus in T -Spaces

This section summarizes calculus in T -spaces. For proofs we refer to [15]. A more detailed treatment of Riemannian orbifolds can be found in [2]. Throughout this section, we assume that XT is a T -space over the representation space X . The orbifold chart of XT is the surjective continuous mapping π : X→X T that projects each vector representation x to its orbit [x]. Orbifold Functions. An orbifold function is a map f : XT → R. The lift of f is a function f˜ : X → R satisfying f˜ = f ◦π, where π is the orbifold chart that projects vector representations to graphs. The lift f˜ is invariant under permutations of T , that is f˜(x) = f˜ (φ(x)) for all φ ∈ T . We say, an orbifold function f : XT → R is continuous (locally Lipschitz, differentiable, generalized differentiable) at X ∈ XT if its lift f˜ is continuous (locally Lipschitz, differentiable, generalized differentiable) at some vector representation x ∈ X. The definition is independent of the choice of the vector representation that projects to X. For a definition of generalized differentiable functions and their basic properties we refer to Section 4. Generalized Gradients. We extend the notion of gradient and generalized gradient to differentiable and generalized differentiable orbifold functions.

Suppose that f : XT → R is differentiable at X ∈ XT . Then its lift f˜ : X → R is differentiable at all vector representations that project to X. The gradient ∇f (X) of f at X is defined by the projection ∇f (X) = π(∇f˜(x)) of the gradient ∇f˜(x) of f˜ at a vector representation x ∈ X. This definition is independent of the choice of the vector representation. We have ∇f˜(φ(x)) = φ(∇f˜(x)) for all φ ∈ T . This implies that the gradients of f˜ at x and φ(x) are vector representations of the same structure, namely the gradient ∇f (X) of the orbifold function f at X. Thus, the gradient of f at X is a well-defined structure pointing to the direction of steepest ascent. Now suppose that f : XT → R is generalized differentiable at X ∈ XT . Then its lift f˜ : X → R is generalized differentiable at all vector representations that project to X. The subdifferential ∂f (X) of f at X is defined by the projection ∂f (X) = π(∂ f˜(x)) of the subdifferential ∂ f˜(x) of f˜ at a vector representation x ∈ X. This definition is independent of the choice of the vector representation. We have ∂ f˜(φ(x)) = φ(∂ f˜(x)) for all φ ∈ T . This implies that the subdifferentials ∂ f˜(x) ⊆ X and ∂ f˜(φ(x)) ⊆ X are subsets that project to the same subset of XT , namely the subdifferential ∂f (X). The properties of generalized differentiable function as listed in Section 4 carry over to generalized differentiable orbifold functions via their lifts. For example, a generalized differentiable orbifold function is locally Lipschitz and thus differentiable almost everywhere.

B. Generalized Differentiable Functions Let X = Rn be a finite-dimensional Euclidean space. A function f : X → R is generalized differentiable at x ∈ X in the sense of Norkin [21] if there is a multi-valued map ∂f : X → 2X in a neighborhood of x such that 1. ∂f (x) is a convex and compact set; 2. ∂f (x) is upper semicontinuous at x, that is, if yi → x and gi ∈ ∂f (yi ) for each i ∈ N, then each accumulation point g of (gi ) is in ∂f (x); 3. for each y ∈ X there is a g ∈ ∂f (y) with f (y) = f (x) + &g, y − x( +o (x, y, g), where lim

i→∞

|o (x, yi , gi )| =0 )yi − x)

for all sequences yi → y and gi → g with gi ∈ ∂f (yi ).

We call f generalized differentiable if it is generalized differentiable at each point x ∈ X . The set ∂f (x) is the subdifferential of f at x and its elements are called generalized gradients. Generalized differentiable functions have the following properties [21]: 1. Generalized differentiable functions are locally Lipschitz and therefore continuous and differentiable almost everywhere. 2. Continuously differentiable, convex, and concave functions are generalized differentiable. 3. Suppose that f1 , . . . , fn : X → R are generalized differentiable at x ∈ X . Then f∗ (x) = min(f1 (x), . . . , fm (x))

f ∗ (x) = max(f1 (x), . . . , fm (x)) are generalized differentiable at x ∈ X . 4. Suppose that f1 , . . . , fm : X → R are generalized differentiable at x ∈ X and f0 : Rm → R is generalized differentiable at y = (f1 (x), . . . , fm (x)) ∈ Rm . Then f (x) = f0 (f1 (x), . . . , fm (x)) is generalized differentiable at x ∈ X . The subdifferential of f at x is of the form n ˆ ˜ ∂f (x) = con g ∈ X : g = g1 g2 . . . gm g0 , o g0 ∈ ∂f0 (y), gi ∈ ∂fi (x), 1 ≤ i ≤ m .

where [g1 g2 . . . gm ] is a (N × m)-matrix. 5. Suppose that F (x) = Ez [f (x, z)], where f (·, z) is generalized differentiable. Then F is generalized differentiable and its subdifferential at x ∈ X is of the form ∂F (x) = Ez [∂f (x, z)].

1035 1039

Suggest Documents