Geometry on Probability Spaces 1 Introduction - Toyota Technological ...

Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 1427 East 60th Street, Chicago, IL 60637, USA E-mail: [email protected] Ding-Xuan Zhou Department of Mathematics, City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong, CHINA E-mail: [email protected] December 14, 2007 Preliminary Report

1

Introduction

Partial differential equations and the Laplacian operator on domains in Euclidean spaces have played a central role in understanding natural phenomena. However this avenue has been limited in many areas where calculus is obstructed as in singular spaces, and function spaces of functions on a space X where X itself is a function space. Examples of the last occur in vision and quantum field theory. In vision it would be useful to do analysis on the space of images and an image is a function on a patch. Moreover in analysis and geometry, the Lebesgue measure and its counterpart on manifolds are central. These measures are unavailable in the vision example and even in learning theory in general. There is one situation where in the last several decades, the problem has been studied with some success. That is when the underlying space is finite (or even discrete). The introduction of the graph Laplacian has been a major development in algorithm research 1

and is certainly useful for unsupervised learning theory. The point of view taken here is to benefit from both the classical research and the newer graph theoretic ideas to develop geometry on probability spaces. This starts with a space X equipped with a kernel (like a Mercer kernel) which gives a topology and geometry; X is to be equipped as well with a probability measure. The main focus is on a construction of a (normalized) Laplacian, an associated heat equation, diffusion distance, etc. In this setting the point estimates of calculus are replaced by integral quantities. One thinks of secants rather than tangents. Our main result, Theorem 1 below, bounds the error of an empirical approximation to this Laplacian on X.

2

Kernels and Metric Spaces

Let X be a set and K : X × X → R be a reproducing kernel, that is, K is symmetric and for any finite set of points {x1 , · · · , x` } ⊂ X, the matrix (K(xi , xj ))ì,j=1 is positive semidefinite. The Reproducing Kernel Hilbert Space (RKHS) HK associated with the kernel K is defined to be the completion of the linear span of the set of functions {Kx = K(x, ·) : x ∈ X} with the inner product denoted as h·, ·iK satisfying hKx , Ky iK = K(x, y). See [1, 10, 11]. We assume that the feature map F : X → HK mapping x ∈ X to Kx ∈ HK is injective. Definition 1. Define a metric d = dK induced by K on X as d(x, y) = kKx − Ky kK . We assume that (X, d) is complete and separable (if it is not complete, we may complete it). The feature map is an isometry. Observe that K(x, y) = hKx , Ky iK =

1 hKx , Kx iK + hKy , Ky iK − (d(x, y))2 . 2

Then K is continuous on X × X, and hence is a Mercer kernel. Let ρX be a Borel probability measure on the metric space X. Example 1. Let X = Rn and K be the reproducing kernel K(x, y) = hx, yiRn . Then dK (x, y) = kx − ykRn and HK is the space of linear functions. Example 2. Let X be a closed subset of Rn and K be the kernel given in the above example restricted to X ×X. Then X is complete and separable, and K is a Mercer kernel. Moreover, one may take ρX to be any Borel probability measure on X. 2

Remark 1. In Examples 1 and 2, one can replace Rn by a separable Hilbert space. Example 3. In Example 2, let x = {xi }m i=1 be a finite subset of X drawn from ρX . Then the restriction K|x is a Mercer kernel on x and it is natural to take ρx to be the uniform measure on x.

3

Approximation of Integral Operators

The integral operator LK : L2ρX → HK associated with the pair (K, ρX ) is defined by Z LK (f )(x) := K(x, y)f (y)dρX (y), x ∈ X. (3.1) X

It may be considered as a self-adjoint operator on L2ρX or HK . See [10, 11]. We shall use the same notion LK for these operators defined on different domains. Assume that x := {xi }m i=1 is a sample independently drawn according to ρX . Recall the sampling operator Sx : HK → `2 (x) associated with x defined [17, 18] by m Sx (f ) = f (xi ) i=1 . By the reproducing property of HK taking the form hKx , f iK = f (x) with x ∈ X, f ∈ HK , the adjoint of the sampling operator, SxT : `2 (x) → HK , is given by SxT c

=

m X

c i Kx i ,

c ∈ `2 (x).

i=1

As the sample size m tends to infinity, the operator

1 T S S m x x

converges to the integral

operator LKe . We prove the following convergence rate. Proposition 1. Assume κ :=

r

sup K(x, x) < ∞.

(3.2)

x∈X

Let x be a sample independently drawn from (X, ρX ). With confidence 1 − δ, we have

2

1 T

4κ log 2/δ

Sx Sx − L K √ ≤ . (3.3)

m

m HK →HK

3

Proof. The proof follows [18]. We need a probability inequality [15, 18] for a random variable f < ∞ almost ξ on (X, ρX ) with values in a Hilbert space (H, k · k). It says that if kξk ≤ M surely, then with confidence 1 − δ,

m

1 X f log 2/δ

4M √ ξ(xi ) − E(ξ) ≤ .

m

m i=1

(3.4)

We apply this probability inequality to the separable Hilbert space H of Hilbert-Schmidt operators on HK , denoted as HS(HK ). The inner product for A1 , A2 ∈ HS(HK ) is defined P as hA1 , A2 iHS = j≥1 hA1 ej , A2 ej iK where {ej }j≥1 is an orthonormal basis of HK . The space HS(HK ) is the subspace of the space of bounded linear operators on HK with the norm kAkHS < ∞. The random variable ξ on (X, ρX ) in (3.4) is given by ξ(x) = Kx h·, Kx iK ,

x ∈ X.

The values of ξ are rank-one operators in HS(HK ). For each x ∈ X, if we choose an orthonorp P mal basis {ej }j≥1 of HK with e1 = Kx / K(x, x), we see that kξ(x)k2HS = j≥1 kξ(x)ej k2K = P Pm 1 (K(x, x))2 ≤ κ4 . Observe that m1 SxT Sx = m1 m i=1 Kxi h·, Kxi iK = m i=1 ξ(xi ) and E(ξ) = f = κ2 and the LK . Then we observe that the desired bound (3.3) follows from (3.4) with M norm relation kAkHK →HK ≤ kAkHS .

(3.5)

This proves the proposition. Some analysis related to Proposition 1 can be found in [19, 14, 5, 12]. Definition 2. Denote p =

R X

Kx dρX ∈ HK . We assume that p is positive on X. The

normalized Mercer kernel on X associated with the pair (K, ρX ) is defined by y) e y) = p K(x, p K(x, , p(x) p(y)

x, y ∈ X.

(3.6)

This construction is in [19], except that we have replaced their positive weighting function by a Mercer kernel. In the following, m e in particular K e x := K(x e i , xj ) e means expressions wrt the normalized kernel K,

i,j=1

4

.

Consider the RKHS HKe , and the integral operator LKe : HKe → HKe associated with the e normalized Mercer kernel K. e x) as Sex . If Denote the sampling operator HKe → `2 (x) associated with the pair (K, e x }m in H e and its orthogonal complement as H⊥ , then we we denote HK,x = span{K e i i=1 e K K,x T e ⊥ e e e see that Sx (f ) = 0 for any f ∈ H . Hence Sx Sx |H⊥ = 0. Moreover, Kx is the matrix representation of the operator

e K,x T Sex Sex |HK,x . e

e K,x

e replacing K in Proposition 1. Proposition 1 yields the following bound with K Theorem 1. Denote p0 = minx∈X p(x). For any 0 < δ < 1, with confidence 1 − δ,

2

4κ log 2/δ 1 T

L e − Sex Sex √ . ≤

K m

mp0 H e →H e K

Note that

1 eT e S S m x x

(3.7)

K

is given from data x by a linear algorithm [16]. So Theorem 1 is an

error estimate for that algorithm. The analysis in [19] deeply involves the constant minx,y∈X K(x, y). In the case of the 2

Gaussian kernel Kσ (x, y) = exp{− |x−y| }, this constant decays exponentially fast as the 2σ 2 variance σ of the gaussian becomes small, while in Theorem 1 the constant p0 decays polynomially fast, at least in the manifold setting [20].

4

Tame Heat Equation

Define a tame version of Laplacian as ∆Ke = I − LKe . The tame heat equation takes the form ∂u = −∆Ke u. ∂t

(4.1)

One can see that the solution to (4.1) with the initial condition u(0) = u0 is given by u(t) = exp −t∆Ke u0 .

5

(4.2)

Diffusion Distances

Let 1 = λ1 ≥ λ2 ≥ . . . ≥ 0 be the eigenvalues of LKe : L2ρX → L2ρX and {ϕ(i) }i≥1 be corresponding normalized eigenfunctions which form an orthonormal basis of L2ρX . The 5

Mercer Theorem asserts that e y) = K(x,

∞ X

λi ϕ(i) (x)ϕ(i) (y)

i=1

where the convergence holds in L2ρX and uniformly. Definition 3. For t > 0 we define a reproducing kernel on X as e t (x, y) = K

∞ X

λti ϕ(i) (x)ϕ(i) (y).

(5.1)

i=1

Then the tame diffusion distance Dt on X is defined as dKe t by e xt − K e yt k e t = Dt (x, y) = kK K

(∞ X

2 λti ϕ(i) (x) − ϕ(i) (y)

)1/2 .

(5.2)

i=1

e t may be interpreted as a tame version of the heat kernel on a manifold. The kernel K e t )x are in H e for t ≥ 1, but only in L2 for t ≥ 0. The functions (K ρX K The quantity corresponding to Dt for the classical heat kernel and other varieties has been developed by Coifman et. al. [8, 9] where it is used in data analysis and more. e t is continuous for t ≥ 1, in contrast to the classical kernel (Greens function) Note that K which is of course singular on the diagonal of X × X. e t is the tth power of The integral operator LKe t associated with the reproducing kernel K the positive operator LKe . e When t becomes Note that for t = 1, the distance D1 is the one induced by the kernel K. large, the effect of the eigenfunctions ϕ(i) with small eigenvalues λi is little. In particular, √ when λ1 has multiplicity 1, since ϕ(1) = p we have p p t lim D (x, y) = p(x) − p(y) . t→∞

In the same way, since the tame Laplacian ∆Ke has eigenpairs (1 − λi , ϕ(i) ), we know that the solution (4.2) to the tame heat equation (4.1) with the initial condition u(0) = u0 ∈ L2ρX satisfies lim u(t) = hu0 ,

t→∞

6

√

piL2ρ

√ X

p.

6

Graph Laplacian

Consider the reproducing kernel K and the finite subset x = {xi }m i=1 of X. Definition 4. Let Kx be the symmetric matrix Kx = (K(xi , xj ))m i,j=1 and Dx be the diagonal Pm Pm matrix with (Dx )i,i = j=1 (Kx )i,j = j=1 K(xi , xj ). Define a discrete Laplacian matrix L = Lx = Dx − Kx . The normalized discrete Laplacian is defined by 1 1 b K,x = I − Dx− 2 Kx Dx− 2 = I − 1 K bx, ∆ m

(6.1)

b x is the special case of the matrix K e x when ρX is the uniform distribution of the where K finite set x, that is, m



K(xi , xj )  q P m 1 `=1 K(xi , x` ) `=1 K(xj , x` ) m

b x :=  q K Pm 1 m

.

(6.2)

i,j=1

The above construction is the same as that for graph Laplacians [7]. See [2] for a discussion in learning theory. But the setting here is different: the entries K(xi , xj ) are induced by a reproducing kernel, they might take negative values satisfying a positive definiteness condition, and are different from weights of a graph or entries of an adjacency matrix. In the following, b means expressions with respect to an implicit sample x.

7

Approximation of Eigenfunctions

In this section we study the approximation of eigenfunctions from the bound for approximation of linear operators. We need the following result which follows from the general discussion on perturbation of eigenvalues and eigenvectors, see e.g. [4]. For completeness, we give a detailed proof for the approximation of eigenvectors. b be two compact positive self-adjoint operators on a Hilbert Proposition 2. Let A and A bj } with multiplicity. Then there holds space H, with nondecreasing eigenvalues {λj } and {λ bj | ≤ kA − Ak. b max |λj − λ j≥1

(7.1)

Let wk be a normalized eigenvector of A associated with eigenvalue λk . If r > 0 satisfies λk−1 − λk ≥ r,

λk − λk+1 ≥ r, 7

b ≤ r, kA − Ak 2

(7.2)

then

4 b kwk − w bk k ≤ kA − Ak, r bk . b associated with eigenvalue λ where w bk is a normalized eigenvector of A Proof. The eigenvalue estimate (7.1) follows from Weyl’s Perturbation Theorem (see e.g. [4]). Let {wj }j≥1 be an orthonormal basis of H consisting of eigenvectors of A associated with P eigenvalues {λj }. Then w bk = j≥1 αj wj where αj = hw bk , wj i. bw Consider A bk − Aw bk . It can be expressed as bk w bw bk − A bk − Aw bk = λ

X

αj Awj =

j≥1

X

αj

b λk − λj wj .

j≥1

When j 6= k, we see from (7.2) and (7.1) that bk − λj | ≥ |λk − λj | − |λ bk − λk | ≥ min{λk−1 − λk , λk − λk+1 } − kA − Ak b ≥ r. |λ 2 Hence bw kA bk − Aw bk k2 =

2 2 X 2 X b k − λj ≥ r b k − λj ≥ αj2 λ αj2 λ αj2 . 4 j≥1 j6=k j6=k

X

bw b − Ak. It follows that But kA bk − Aw bk k ≤ kA ( X

)1/2 αj2

j6=k

2

From kw bk k = Therefore,

2 j≥1 αj

P

2 b ≤ kA − Ak. r

o1/2 n o1/2 nP P 2 2 . = 1, we also have αk = 1 − j6=k αj ≥ 1− j6=k αj

kwk − w bk k = k(1 − αk )wk + αk wk − w bk k = k(1 − αk )wk −

X

αj wj k

j6=k

≤ |1 − αk | + k

X

αj wj k ≤ 2

j6=k

( X j6=k

)1/2 αj2

4 b ≤ kA − Ak. r

This proves the desired bound. An immediate easy corollary of Proposition 2 and Theorem 1 is about the approximation of eigenfunctions of LKe by those of m1 SexT Sex . 8

Recall the eigenpairs {λi , ϕ(i) }i≥1 of the compact self-adjoint positive operator LKe on L2ρX . Observe that kϕ(k) kL2ρ = 1, but kϕ(k) k2Ke = h λ1k LKe ϕ(k) , ϕ(k) iKe = λ1k kϕ(k) k2L2ρ = λ1k . So X

√1 ϕ(k) λk

X

is a normalized eigenfunction of LKe on HKe associated with eigenvalue λk .

Also,

1 eT e S S | ⊥e m x x HK,x

= 0 and

1 e K m x

is the matrix representation of

1 eT e S S | e . m x x HK,x

Corollary 1. Let k ∈ {1, . . . , m} such that r := min{λk−1 − λk , λk − λk+1 } > 0. If 0 < δ < 1 and m ∈ N satisfy

4κ2 log 4/δ r √ ≤ , 2 mp0

then with confidence 1 − δ, we have

(k)

2

ϕ

16κ log 4/δ (k)

√ − Φ ≤ √ ,

λ

e r mp0 k K where Φ(k) is a normalized eigenfunction of

1 eT e S S m x x

associated with its kth largest eigenvalue e x , and with an associated λk,x . Moreover, λk,x is the kth largest eigenvalue of the matrix m1 K normalized eigenvector v, we have Φ(k) = √ 1 √1m SexT v. λk,x

8

Algorithmic Issue for Approximating Eigenfunctions

Corollary 1 provides quantitative understanding of the approximation of eigenfunctions of e x is given by the kernel L e by those of the operator 1 SeT Sex . The matrix representation 1 K K

m

x

m

e which involves K and the function p defined by ρX . Since ρX is unknown algorithmically, K we want to study the approximation of eigenfunctions by methods involving only K and the sample x. This is done here. b defined by means of the We shall apply Proposition 2 to the operator A = LKe and A b x given by (K, x) in (6.2), and derive estimates for the approximation of eigenfuncmatrix K tions. b1,x ≥ λ b2,x ≥ . . . ≥ λ bm,x ≥ 0 be the eigenvalues of the matrix Let λ

1 b K . m x

Theorem 2. Let k ∈ {1, . . . , m} such that r := min{λk−1 − λk , λk − λk+1 } > 0. If 0 < δ < 1 and m ∈ N satisfy

4κ2 log 4/δ r √ ≤ , 8 mp0

then with confidence 1 − δ, we have √

2

(k)

72 λ 1 k κ log 4/δ T

ϕ − √ Sex v ≤ √ ,

e m r mp0 K 9

(8.1)

(8.2)

where v ∈ `2 (x) is a normalized eigenvector of the matrix

1 b K m x

1

bk,x . with eigenvalue λ 1

b K,x = I − Dx− 2 Kx Dx− 2 = I − Remark 2. Since ∆Ke = I − LKe and ∆

1 b K , m x

Theorem 2

provides error bounds for the approximation of eigenfunctions of the tame Laplacian ∆Ke by ex . b K,x . Note that √1 SexT v = √1 Pm vi K those of the normalized discrete Laplacian ∆ i i=1 m m We need some preliminary estimates for the approximation of linear operators. Applying the probability inequality (3.4) for the random variable ξ on (X, ρX ) with values in the Hilbert space HK given by ξ(x) = Kx , we have P Lemma 1. Denote px = m1 m j=1 Kxj ∈ HK . With confidence 1 − δ/2, we have 4κ log 4/δ √ kpx − pkK ≤ . m

(8.3)

The error bound (8.3) implies 4κ2 log 4/δ √ max |px (xi ) − p(xi )| ≤ . i=1,...,m m

(8.4)

e x and K bx. This yields the following error bounds between the matrices K Lemma 2. Let x ∈ X m . When (8.3) holds, we have ! 2

2

1

4κ log 4/δ 4κ log 4/δ 1 e b

K √ √ .

m x − m Kx ≤ 2 + mp0 mp0 √ √px (xi ) . Then Proof. Let Σ = diag {Σi }m with Σ := i i=1 p(xi )

1 e K m x

b x Σ. Decompose = Σ m1 K

1 e 1 b 1 b 1 b Kx − K Kx (Σ − I) + (Σ − I) K x = Σ x. m m m m We see that

1

ex − 1 K b x ≤ kΣk 1 K b x kΣ − Ik + kΣ − Ik 1 K bx .

K

m

m m m

b x ≤ I, we have bx Since m1 K

m1 K

≤ 1. Also, for each i, p p (x ) |p (x ) − p(x )| x i x i i − 1 ≤ . p p(xi ) p(xi ) Hence

|px (xi ) − p(xi )| . i=1,...,m p(xi ) This in connection with (8.4) proves the statement. kΣ − Ik ≤ max

10

(8.5)

Now we can prove the estimates for the approximation of eigenfunctions. b in Proposition Proof of Theorem 2. First, we define a linear operator which plays the role of A bK,x , on H e such that L bK,x (f ) = 0 for any f ∈ H⊥ , and 2. This is an operator, denoted as L K

e K,x

bK,x with the matrix representation equal to the matrix HK,x is an invariant subspace of L e L ⊥ 1 b K given by (6.2). That is, in HK,x HK,x e e , we have m x h i h i bK,x K b x1 , . . . , K b xm = K bx, b x1 , . . . , K b xm 1 K L m

bK,x |H ⊥ = 0. L e K,x

bK,x . By Theorem 1, Second, we consider the difference between the operator LKe and L there exists a subset X1 of X m of measure at least 1 − δ/2 such that (3.7) holds true for x ∈ X1 . By Lemmas 1 and 2, we know that there exists another subset X2 of X m of measure at least 1 − δ/2 such that (8.3) and (8.5) hold for x ∈ X2 . Recall that SexT Sex |H⊥e = 0 and K,x

e x is the matrix representation of the operator SeT Sex |H . This in connection with (8.5) K x e K,x b and the definition of LK,x tells us that for x ∈ X2 ,

1

1 T 1 bK,x = K ex − K bx ≤

Sex Sex − L

m

m m

4κ2 log 4/δ √ 2+ mp0

!

4κ2 log 4/δ √ . mp0

Together with (3.7), we have

b

LKe − LK,x ≤

4κ2 log 4/δ √ 3+ mp0

!

4κ2 log 4/δ √ , mp0

∀x ∈ X1 ∩ X2 .

(8.6)

b=L bK,x on the Hilbert Third, we apply Proposition 2 to the operators A = LKe and A space HKe . Let x be in the subset X1 ∩ X2 which has measure at least 1 − δ. Since r ≤ 1, the estimate (8.6) together with the assumption (8.1) tells us that (7.2) holds. Then by Proposition 2, we have

50κ2 log 4/δ

1 (k)

4

(k) b

√ ϕ − ϕ √ b ,

λ

e ≤ r LKe − LK,x ≤ r mp0 k K

(8.7)

bk,x . bK,x associated with eigenvalue λ where ϕ b(k) is a normalized eigenfunction of L bK,x , we see that for u ∈ Finally, we specify ϕ b(k) . From the definition of the operator L `2 (x), we have bK,x L

m X

! ex ui K i

=

m X 1

i=1

i=1

11

m

bxu K

ex . K i i

P e x u, we know that the normalized eigenfunction ϕ bK,x e 2 = uT K b(k) of L Since k m e i=1 ui Kxi kK bk,x can be taken as associated with eigenvalue λ (k)

ϕ b

T

exv = v K

m −1/2 X

ex = vi K i

i=1

1 Te v Kx v m

where v ∈ `2 (x) is a normalized eigenvector of the matrix

−1/2

1 b K m x

1 √ SexT v m

bk,x . with eigenvalue λ

Since (8.5) holds, we see that 1 T T 1 1 T 9κ2 log 4/δ 1 1 T bk,x = v K exv − λ exv − v K b x v = v ( K ex − K b x )v ≤ v K √ . m m m m m mp0 Hence

2 1 T 1 T 4/δ 22κ log bk,x + |λ bk,x − λk | ≤ e x v − λk ≤ v K exv − λ v K √ . m m mp0

−1/2

√1 eT exv v S Since m1 v T K

m x e = 1, we have K





1 1 T

1 1 1 eT (k)

√ √ Sex v − ϕ

  √ Sx v b = √ −q

λ m

m λ 1 e k k exv

K vT K m e K 1 T e

2 m v Kx v − λ k

1 T 22κ log 4/δ 1

√ Sex v ≤ √ q =√ .

m

e √ q1 T mp0 λk 1 T e exv K λ k + m v Kx v λ k m v K This in connection with (8.7) tells us that for x ∈ X1 ∩ X2 we have

2 2

1 (k)

50κ log 4/δ 22κ log 4/δ 1 1

√ ϕ − √ √ SexT v ≤ √ + √ .

λ

e r mp0 mp0 λk λk m k K Since r ≤ λk − λk+1 ≤ λk , this proves the stated error bound in Theorem 2.

References [1] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 68 (1950), 337– 404. [2] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (2003), 1373–1396. 12

[3] M. Belkin and P. Niyogi, Convergence of Laplacian eigenmaps, preprint, 2007. [4] R. Bhatia, Matrix Analysis, Graduate Texts in Mathematics 169, Springer-Verlag, New York, 1997. [5] G. Blanchard, O. Bousquet, and L. Zwald, Statistical properties of kernel principal component analysis, Mach. Learn. 66 (2007), 259–294. [6] S. Bougleux, A. Elmoataz, and M. Melkemi, Discrete regularization on weighted graphs for image and mesh filtering, Lecture Notes in Computer Science 4485 (2007), pp. 128–139. [7] F. R. K. Chung, Spectral Graph Theory, Regional Conference Series in Mathematics 92, SIAM, Philadelphia, 1997. [8] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps, PNAS of USA 102 (2005), 7426–7431. [9] R. Coifman and M. Maggiono, Diffusion wavelets, Appl. Comput. Harmonic Anal. 21 (2006), 53–94. [10] F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc. 39 (2001), 1–49. [11] F. Cucker and D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, 2007. [12] E. De Vito, A. Caponnetto, and L. Rosasco, Model selection for regularized least-squares algorithm in learning theory, Found. Comput. Math. 5 (2005), 59–85. [13] G. Gilboa and S. Osher, Nonlocal operators with applications to image processing, UCLA CAM Report 07-23, July 2007. [14] V. Koltchinskii and E. Giné, Random matrix approximation of spectra of integral operators, Bernoulli 6 (2000), 113–167. [15] I. Pinelis, Optimum bounds for the distributions of martingales in Banach spaces, Ann. Probab. 22 (1994), 1679–1706.

13

[16] S. Smale and D. X. Zhou, Shannon sampling and function reconstruction from point values, Bull. Amer. Math. Soc. 41 (2004), 279–305. [17] S. Smale and D. X. Zhou, Shannon sampling II. Connections to learning theory, Appl. Comput. Harmonic Anal. 19 (2005), 285–302. [18] S. Smale and D. X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007), 153–172. [19] U. von Luxburg, M. Belkin, and O. Bousquet, Consistency of spectral clustering, Ann. Stat., to appear. [20] G. B. Ye and D. X. Zhou, Learning and approximation by Gaussians on Riemannian manifolds, Adv. Comput. Math., in press. DOI 10.1007/s10444-007-9049-0 [21] D. Zhou and B. Schölkopf, Regularization on discrete spaces, in Pattern Recognition, Proc. 27th DAGM Symposium, Berlin, 2005, pp. 361–368.

14