A general framework for adaptive regularization based on diffusion processes on graphs Arthur D Szlam
[email protected]
Mathematics Department U.C.L.A., Box 951555 Los Angeles, CA 90095-1555
Mauro Maggioni
[email protected]
Department of Mathematics Duke University, Box 90320 Durham, NC, 27708
Ronald R Coifman
[email protected]
Program in Applied Mathematics Department of Mathematics Yale University, Box 208283 New Haven,CT,06510
Editor:
Abstract The use of data-dependent kernels has been shown to lead to state-of-the-art results in machine learning tasks, especially in the context of semi-supervised learning. We introduce a general framework for analysis both of data sets and functions defined on them. Our approach is based on diffusion operators, tuned not only to the intrinsic data geometry, but also the the function being analyzed. Among the many possible applications of this framework, we consider two apparently dissimilar tasks: image denoising and semi-supervised classification. We show that these tasks can be tackled within our framework both conceptually and algorithmically. Our results are comparable, or better, to those obtained with state of the art, application specific techniques. Keywords: Diffusion processes, spectral graph theory, image denoising, semi-supervised learning.
1. Introduction Recently developed techniques in the analysis of data sets and machine learning use the geometry of the data set in order to study functions on it (Coifman and Maggioni, 2006; Coifman et al., 2005a,b; Belkin and Niyogi, 2003a). In particular the idea of analyzing the data set and functions on it intrinsically has lead to novel algorithms with state-of-theart performance in various problems in machine learning (Szummer and Jaakkola, 2001; Belkin and Niyogi, 2003a; Mahadevan and Maggioni, 2006; Maggioni and Mahadevan, 2006; Mahadevan and Maggioni, 2005).
1
Large time
Multi-scale times
Small time
Table 1: Diffusion Geometry and Analysis Geometry Analysis Global geometric properties Global properties of func(e.g.: connectedness, bottle- tions (e.g. global smoothnecks); clustering, segmenta- ness); eigenfunctions of diffution sion kernel; global low-pass filtering Properties at all scales; hier- Multiscale analysis of funcarchical clustering and organi- tions, adaptive local approxzation imation; diffusion wavelets, other multi-scale bases; adaptive denoising and compression Local geometric properties; Local properties of functions; denoising of data sets diffusion process; local denoising of functions
Diffusion analysis and diffusion geometry (Coifman and Lafon, 2004; Coifman and Maggioni, 2006) are novel techniques for the intrinsic analysis of a data set. They are based on the construction of a diffusion, or local averaging operator, K on the data set, dependent on the local, fine scale geometry of the data set. This diffusion operator K and its powers can be used to study the geometry of the data set and analyze functions on the data set. Among other things, diffusion analysis allows to introduce a notion of smoothness in discrete settings that preserves the relationships between smoothness, sparsity in a “Fourier” basis, and evolution of heat that are well-known in Euclidean spaces. At least three types of analysis associated with the diffusion process K are possible: (i) A global analysis, corresponding to large powers of K. This analysis can be performed most efficiently by using the lowest-frequency eigenvectors of K, and is associated with a generalized Fourier analysis. Projection of a function on the data set onto these lowest-frequency eigenfunctions corresponds to a bandlimited projection, or ideal low-pass filter. This type of analysis is classical, for example, in global geometry of Riemannian manifolds (B´erard et al., 1994), in the large-time analysis of random walks on graphs, and it has been applied, among other things, to graph layout problems, dimensionality reduction (Belkin and Niyogi, 2003b; Lafon, 2004), clustering (Belkin and Niyogi, 2001), image segmentation (Shi and Malik, 2000a) and semi-supervised learning (Belkin and Niyogi, 2003a). j
(ii) A multi-scale analysis, corresponding to a set of increasing powers of K, such as {K 2 }j≥0 . This analysis corresponds to the multi-scale and classical wavelet analysis. It allows to analyze functions on the data at different resolutions. This type of analysis has been introduced recently in (Coifman and Maggioni, 2006), and developed further in (Bremer et al., 2004; Maggioni and Coifman, 2006; Szlam et al., 2005; Maggioni et al., 2005). Recent applications include the analysis of document corpora (Coifman and Maggioni, 2006) and Markov Decision Processes (Mahadevan and Maggioni, 2006; Maggioni and Mahadevan, 2006; Mahadevan and Maggioni, 2005; Maggioni and Mahadevan, 2005; Mahadevan et al., 2006) (iii) A local analysis, corresponding to small powers of K, can be used to locally smooth the data, as well as any function on the data set. This can be thought of as a local convolution with a low-pass filter (K itself!) which is data-driven. This operation differs in at least two
2
respects from the low-pass filtering in (i). The Fourier modes φl are global functions, and hence the projection of a function f onto the top eigenvectors of the diffusion operator is affected by global properties of the space and f , and may destroy important local features of f . The smoothing operators we construct are local and do not correspond to a perfect cut-off filtering, in which only the top k eigenfunction coefficients are retained and the remaining ones are killed. Using small powers of K corresponds in general to a smooth attenuation of higher frequencies. Moreover in applications of (i), k is in general quite small compared to the number of data points (which is equal to the number of possible Fourier modes), and hence this is a very low-band projection, while (i) is in general much wider-band. We see from the above that K and bases (eigenfunctions or diffusion wavelets) play intertwining roles in the different types of analysis sketched above. In general we could distinguish between analyses based on expansions on basis functions and analyses based on K (and functions of K, such as its powers) only. We will discuss both below in the context of image analysis and denoising, as well as in semi-supervised learning. In this work we introduce a general framework for both denoising and semi-supervised learning, using local diffusion operators. One of the main contributions of this work is the observation that the geometry of the space is not the only important factor to be considered, but the geometry and the properties of the function f to be studied (denoised/learned) should also affect the smoothing operation. We therefore will therefore modify our data set with features from f , and build K on the modified data set. The result is nonlinear in the sense that it depends on the input function f . On the other hand, on the modified data set, K will be linear, and very efficiently computable. One could generalize the constructions proposed to various types of processes (e.g. nonlinear diffusions), most simply by updating the modification of the data set as f is smoothed by K. We will demonstrate these techniques in the context of two problems: the analysis and denoising of images and semi-supervised learning. We tackle these two seemingly different problems in a unique framework. We find it remarkable that the general framework of anisotropic diffusions on graphs allows one to link such apparently dissimilar tasks, and leads to algorithms achieving state-of-the-art performance in such different tasks, when compared to state-of-the-art algorithms which are application specific.
2. Diffusion on graphs associated with data-sets An intrinsic analysis of a data set, modeled as a graph or a manifold, can be developed by considering a natural random walk K on it (Chung, 1997; Szummer and Jaakkola, 2001; Ng et al., 2001; Belkin and Niyogi, 2001; Zha et al., 2001; Lafon, 2004; Coifman et al., 2005a,b). The random walk allows diffusing on the data set, exploring it, discovering clusters and regions separated by bottlenecks. If K represents one step of the random walk, K t represents t steps. For an initial condition δx , K t δx (y) represents the probability of being at y at time t, conditioned on starting at x.
2.1 Setup and Notation We consider the following general situation: the space is a finite weighted graph G = (V, E, W ), consisting of a set V (vertices), a subset E (edges) of V × V , and a nonnegative function W : E → R+ (weights). We say that there is an edge from x to y ∈ V , denoted by x ∼ y if and only if W (x, y) > 0. Notice that in this work W will usually be symmetric; that is the edges will be undirected. The techniques we propose however do not require this property.
3
We interpret the weight W (x, y) as a measure of similarity between the vertices x and y. A natural filter acting on functions on V can be defined by normalization of the weight matrix as follows: let W (x, y) , d(x) = y∈V
and let1 the filter be
K(x, y) = d−1 (x)W (x, y) ,
(1)
so that y∈V K(x, y) = 1, and so that multiplication Kf of a vector from the left is a local averaging operation, with locality measured by the similarities W . There are other ways of defining averaging operators. For example one could consider the heat kernel e−tL (where L is defined in (7), see (Chung, 1997)), or a bi-Markov matrix similar to W (see Sinkhorn (1964); Sinkhorn and Knopp (1967); Soules (1991); Linial et al. (1998); Amnon Shashua and Hazan (2005); Zass and Shashua (2005)). In general K is not column-stochastic2 , but the operation f K of multiplication on the right by a (row) vector can be thought of as a diffusion of the vector f . This filter can be iterated several times by considering the power K t . From the point of view of the diffusion process, this corresponds to taking t steps of the random walk, which, by the Markov property, has transition probabilities K t .
2.2 Graphs associated with data sets Let a data set X = {xi }i∈I be given. From a data set we will construct a graph G. The vertices of G are the data points in X, and weighted edges are constructed that connect nearby data points, with a weight that measures the similarity between data points. The first step is to define local similarities. This is a step which is data and application dependent. It is important to stress the attribute local. Similarities between far away data points are not required, and deemed unreliable, since they would not take into account the geometric structure of the data set. Local similarities are assumed to be more reliable, and non-local similarities will be inferred from local similarities through diffusion processes on the graph.
2.2.1 Local Similarities Local similarities are collected in a (possibly infinite) matrix W , whose rows and columns are indexed by X, and whose entry W (x, y) is the similarity between x and y. In the examples we consider here, W will usually be symmetric, that is the edges will be undirected, but these assumptions are not necessary. If the data set lies in Rd , or in any other metric space with metric ρ, then the most standard construction is to choose a number σ and let ρ(x, y)2 Wσ (x, y) = h , (2) σ for some function h with, say, exponential decay at infinity. A common choice is h(a) = exp(−a). The idea is that we expect that very close data points (with respect to ρ) will be similar, but do not want to assume that far away data points are necessarily different. 1. Note that d(x) = 0 if and only if x is not connected to any other vertex, in which case we trivially define d−1 (x) = 0, or simply remove x from the graph. 2. In particular cases K it is a scalar multiple of a column-stochastic matrix, for example when D is a multiple of identity, which happens for example if G is regular and all the edges have the same weight.
4
Let D be the diagonal matrix with entries given by d. Suppose the data set is, or lies on, a manifold in Euclidean space. In Lafon (2004) (see also (Belkin, 2003; von Luxburg et al., 2004; Singer, 2006)), it is proved that in this case, the choice of h in the construction of the weight matrix is in some asymptotic sense irrelevant. For a rather generic symmetric −1 −1 function h, say with exponential decay at infinity, (I − Dσ 2 Wσ Dσ 2 )/σ, approaches the Laplacian on the manifold, at least in a weak sense, as the number of points goes to infinity and σ goes to zero. Thus this choice of weights is related to the heat equation on the manifold, and is very natural. On the other hand, for many data sets, which either are far from asymptopia or simply do not lie on a manifold, the choice of weights can make a large difference and is not always easy. Even if we use Gaussian weights, the choice of the “local time parameter” σ can be hard; this will be discussed more below. In some applications (e.g. analysis of groups of linked web pages) we are already presented with a data set with a natural graph structure that it is natural to preserve. In this situation we may consider metrics ρ on the data which are induced by the graph structure, and use these in the construction of the similarities described above. Let WG be the weights on the given graph G. One may consider the diffusion metric associated with the random walk on G. Or one may consider the geodesic distance ρ(x, y) = inf w(x, y) γ
γ
where the infimum is taken over paths γ on the graph connecting x to y. The exponential weight h in (2) gives a large preference to very close points. The measure of distance obtained by integrating over paths, called a “diffusion metric”, is often more robust to misplaced data points than the original metric. See (Lafon, 2004; Lafon and Lee, to appear, 2006) for a more complete discussion of these ideas. For each x, one usually limits the maximum number of points y such that W (x, y) = 0 (or non-negligible). Two common modifications of the construction above are to use either ρ (x, y) or ρk (x, y) instead of ρ, where d(x, y) if ρ(x, y) ≤ ; ρ (x, y) = , ∞ if ρ(x, y) > where usually is such that h(2 /σ) 0, this corresponds to letting f˜ = K t (f ), i.e. kernel smoothing on the data set, with a data-dependent kernel. We can interpret the K t f as evolving the diffusion process on the graph with an initial condition specified by f . Note that typically K is sparse (i.e. it is a N × N matrix with O(N ) entries), and hence this smoothing involves only O(tN ) computations.
8
One can also consider families of nonlinear denoising techniques, of the form f˜ = m( f, φi )φi ,
(10)
i
where for example m is a (soft-)thresholding function (see e.g. (Donoho and Johnstone, 1994)). While these techniques are classical and well-understood in Euclidean space (mostly in view of applications to signal processing), it is only recently that research in their application to the analysis of functions on data sets has begun (in view of applications to learning tasks). One of the main assumptions is that the data has some geometry, and that functions of interest on the data are smooth with respect to that geometry. The algorithms we just described use the geometry to construct the kernel K, and the analysis is performed with respect the notion of smoothness associated with this kernel. We refer the reader to (Smola and Kondor, 2003; Chapelle et al., 2006) and references therein. All of these techniques clearly have a regularization effect. This can be easily measured in terms of the Sobolev norm defined in (8): the methods above correspond to removing or damping the components of f (or f + η) in the subspace spanned by high-frequency φi , which are the ones with larger Sobolev norm.
3.2 Data and Function-dependent kernels It is possible to go even further, by incorporating the geometry of the function f (or a family of functions F ) in the construction of the kernel K. The way the function f affects the construction of K will be application and data specific, as we shall see in the application to image denoising and semi-supervised learning. The basic idea is that in order to smooth a function, we should take averages between points where f has similar structure. One could let ||x−y||2 |f (x)−f (y)|2 σ2 . (11) W f (x, y) = e− σ1 − So for example when σ2