Capacity of Reproducing Kernel Spaces in Learning Theory1 Ding-Xuan Zhou Department of Mathematics City University of Hong Kong Tat Chee Avenue, Kowloon HONG KONG, CHINA E-mail:
[email protected]
Abstract The capacity of reproducing kernel Hilbert spaces plays an essential role in the analysis of learning theory. Covering numbers and packing numbers of balls of these reproducing kernel spaces are important measurements of this capacity. In this paper we first present lower bound estimates for the packing numbers by means of nodal functions. Then we show that if a Mercer kernel is C s (for some s > 0 being not an even integer), the reproducing kernel Hilbert space associated with this kernel can be embedded into C s/2 . This gives upper bound estimates for the covering number concerning Sobolev smooth kernels. Examples and applications to Vγ -dimension and Tikhonov regularization are presented to illustrate the upper and lower bound estimates. Keywords and Phrases: Learning theory, reproducing kernel Hilbert space, capacity, covering number, packing number, nodal function
§1. Introduction Learning theory investigates how to find an unknown function f : X → Y from random samples (xi , yi )li=1 . Here (X, d(·, ·)) is a compact metric space. We take Y = IR. As the random samples possess noise and other uncertainties, we do not expect that f (xi ) = yi , but regard f (xi ) to be approximately equal to yi . Thus we can only find an approximation for the 1
The author is supported by CERG grant No. CityU 1087/02P and City University grant No. 7001342.
1
desired function f . In learning theory this approximation is usually realized by minimizing some loss function defined in terms of the samples: l 1X V (yi , g(xi )) + λkgk2 l i=1
(1.1)
for a set of functions g. Here k · k is some function space norm and λ is a penalty parameter. For regularization networks, V (yi , g(xi )) = (yi −g(xi ))2 ; for the support vector machines regression, V (yi , g(xi )) = max{|yi −g(xi )|− ε, 0}; for the support vector machines classification, V (yi , g(xi )) = |1 − yi g(xi )|+ = max{1 − yi g(xi ), 0}. Given the samples z := (xi , yi )li=1 and a set H of functions called the hypothesis space, we find a function fz in H that best fits the samples z with respect to the loss function V : l 1X V (yi , g(xi )) + λkgk2 . l i=1
fz := arg min g∈H
(1.2)
The function fz is called the empirical target function and is used as an approximation of the desired function f . For example, consider the regression problem. Assume [20] that the samples z = (xi , yi )li=1 are governed by a (unknown) probability measure ρ on X × Y . If we use the least-square error for a function g: E(g) :=
Z
(g(x) − y)2 dρ,
X×Y
then the function fρ which minimizes the error, called the regression function, is given by Z
fρ (x) =
ydρ(y|x),
x ∈ X,
(1.3)
Y
where ρ(y|x) is the conditional probability measure. The purpose of the regression problem in learning theory is to learn the regression function fρ from the random samples z. Though fρ has the explicit formula (1.3), the probability measure ρ is unknown, hence the regression function cannot be computed directly. What one can compute is 2
the regularized least square empirical error defined for a function g and the samples z as l 1X (g(xi ) − yi )2 + λkgk2 . l i=1
Notice that the regression function fρ need not be in the hypothesis space H. Thus we can only expect that the empirical target function fz is a good approximation of the target function given by fH := arg min E(g). g∈H
For the least-square empirical error function, this approximation (the empirical target function) can be computed by solving a linear system when the hypothesis space is a reproducing kernel Hilbert space. In kernel machine learning one often uses reproducing kernel Hilbert spaces or their balls as hypothesis spaces. Let K : X × X → IR be continuous, symmetric and positive semidefinite, i.e., for any finite set of distinct points {x1 , · · · , xm } ⊂ X, the matrix (K(xi , xj ))m i,j=1 is positive semidefinite. Such a kernel is called a Mercer kernel. It is called positive definite if the matrix (K(xi , xj ))m i,j=1 is positive definite. The Reproducing Kernel Hilbert Space (RKHS) HK associated with the kernel K is defined (see [2]) to be the closure of the linear span of the set of functions {Kx := K(x, ·) : x ∈ X} with the inner product < ·, · >HK =< ·, · >K satisfying
m
2 X m m X
X
c K c K = c K , i xi i xi i xi
i=1
K
i=1
K
i=1
=
m X
ci K(xi , xj )cj .
i,j=1
The reproducing kernel property is given by ∀x ∈ X, g ∈ HK .
< Kx , g >K = g(x),
(1.4)
This space can be embedded into C(X), the space of continuous functions on X, and we denote the inclusion as IK : HK → C(X). 3
If the hypothesis space H is taken to be the RKHS HK and λ > 0, then the empirical target function fz defined by (1.2) is in span{Kxi }li=1 : fz (x) =
l X
ci Kxi (x).
i=1
See [9, 22, 20, 8]. As shown in [5, 11, 8], an equivalent method is to generate the hypothesis space to be a ball of the RKHS HK : BR := {f ∈ HK : kf kK ≤ R}. Here R > 0 is the radius. The set IK (BR ) is a subset of C(X) and we denote IK (BR ) as its closure in C(X) which is a compact subset. This is the hypothesis space H. In this method, the penalty parameter λ vanishes as the penalty has been controlled by the bound R of norms of functions in H. (The results go back to the regularization theory, see [19]). We shall discuss this equivalence again in Section 6. To estimate the error between the empirical target function fz and the target function fH , the capacity of the reproducing kernel Hilbert space HK is needed in various ways, see e.g., [20, 21, 6, 1, 5, 8, 23]. Here we shall consider the covering number and packing number of the ball IK (BR ) of the RKHS as a compact subset of C(X). Definition 1. For a compact set S in a metric space and η > 0, the covering number N (S, η) is defined to be the minimal integer m ∈ IN such that there exist m disks with radius η covering S. The packing number M(S, η) is defined to be the maximal integer m ∈ IN such that there exist m points {x1 , · · · , xm } ⊂ S being η-separated, i.e., the distance between xi and xj is at least η if xi 6= xj . It can be easily seen and is well known that for any η > 0, M(S, 2η) ≤ N (S, η) ≤ M(S, η). Hence the covering number and the packing number are equivalent. 4
(1.5)
To see how the covering number can be used to bound the error between the empirical target function fz and the target function fH , we mention the recent work of Cucker and Smale [5]: Let H be a compact and convex subset of C(X). Assume that for all f ∈ H, |f (x) − y| ≤ M almost everywhere. Then for every ε > 0,
Probz∈(X×Y )l E(fz ) − E(fH ) ≤ ε ≥ 1 − N H,
lε ε e− 288M 2 . 24M
Thus, estimates for the covering number of IK (BR ) are needed for the quantitative error analysis in kernel machine learning. The upper bounds for the covering number N (IK (BR ), η) have been investigated in the literature. Some previous methods for estimating the covering number in learning theory (for discrete versions of IK (BR )) are based on the assumption that the normalized (in L2 ) eigenfunctions {φj }∞ j=1 of the integral operator LK associated with the kernel K are uniformly bounded, see e.g., [23]. Here with a Borel measure ν on X, LK is the compact operator on L2ν (X) given by Z
LK f (x) =
K(x, y)f (y)dν(y), X
f ∈ L2ν (X).
However, Steve Smale observed that this assumption is not necessarily true and the author [25] presented a concrete example of C ∞ Mercer kernel such that the eigenfunctions of LK are not uniformly bounded. Thus the assumption of the uniform boundedness of the eigenfunctions is hard to check. Even for the Gaussian kernel K(x, t) = k(x − t) = e−
|x−t|2 2
on [−1, 1], it is
unknown whether this assumption holds. The example of C ∞ kernel presented in [25] tells us that the regularity of the kernel function (or the decay of the eigenvalues of the integral operator LK ) does not ensure this assumption to be valid. This assumption was weakened in [10] to the requirement that λj kφj k2L∞ (X) < ∞ for each positive eigenvalue λj of LK , but this weak assumption is still hard to check. For the Gaussian kernel, the estimate given in [10, Theorem 7] for the covering number depends on the number 5
of samples, compared with those in [5] and [25]. Let us mention that the decay of the eigenvalues of integral operators in terms of the regularity of the kernel is well understood, but the eigenfunctions are much more difficult to handle, for which little is known. The class of Mercer kernels that were handled in [23, 10] are periodic kernels. However, the RKHS associated with the periodized kernel contains only periodic functions while most functions in the original RKHS associated with a nonperiodic kernel are not periodic. Hence the estimates for the covering number of the periodized RKHS can not be used to bound that of the original RKHS. For more details, see the discussion in [25]. Though the eigenfunctions of LK need not be uniformly bounded, Cucker and Smale [5] were able to show that for a C ∞ Mercer kernel on X ⊂ IRn , the RKHS HK can be embedded into the Sobolev space H h for any h ∈ IN, using the results on approximation errors given by Smale and Zhou in [18]. Here the space H h consists of all the L2 functions such that Dα f ∈ L2 for any α = (α1 , · · · , αn ) ∈ ZZn+ with α1 + · · · + αn ≤ h. With this approach, by the rich knowledge on covering numbers of Sobolev spaces (see, e.g., [7]), they proved in [5] that
ln N (IK (BR ), η) ≤
RCh η
2n h
.
(1.6)
Here Ch is a constant independent of R and η, but it depends on h and tends to infinity as h → 0. Many kernels used in kernel machine learning are analytic. For such kernels, the estimates of the type (1.6) can be improved from powers to logarithmic powers. More explicitly, if the kernel K takes the convolutiontype form: K(x, t) = k(x − t) on [0, 1]n , and if the Fourier transform of ˆ decays exponentially, the covering number of the ball with radius R, k, k, N (IK (BR ), η), is shown in [25] to satisfy ln N (IK (BR ), η) ≤ Ck,n (ln 6
R n+1 ) , η
where the constant Ck,n depends on the kernel and the dimension. This covers many important Mercer kernels. For example, for the Gaussian kernel K(x, t) = exp{−
|x − t|2 }, σ2
x, t ∈ [0, 1]n
2
with σ > 0, when 0 < η < R exp{− 90n − 11n − 3}, we have σ2 R ln N (IK (BR ), η) ≤ 4 (6n + 2) ln η
n
n+1
.
A natural question to ask is how to deal with Sobolev smooth kernels. In [26] the author proved for 0 < s ≤ 1 that HK can be embedded into Lip s/2 if K is in Lip s. The embedding is sharp. Here Lip s denotes the space of all Lipschitz s functions on X with the norm kgkLip s := |g|Lip s + kgkC(X) and the seminorm |g|Lip s := sup
x6=y∈X
|g(x) − g(y)| . (d(x, y))s
This is a Banach space. Then the covering number for IK (BR ) can be bounded by that for Lipschitz spaces. While considering difference of function values is enough for studying Lipschitz s functions with s ≤ 1, higher order divided differences involving the Euclidean structure are needed for s > 1. The first purpose of the current paper is to realize this point and extend the above result with s ≤ 1 to higher order regularity. We shall show in Section 5 that if K is C s (with s > 0 being not an even integer), then HK can be embedded into C s/2 . Hence the covering number of IK (BR ) can be estimated using that of Sobolev spaces. With all the above mentioned results, the upper bounds for the covering numbers of IK (BR ) are well understood. The second purpose of this paper is to provide some lower bound estimates for the covering numbers and the packing numbers of balls of reproducing kernel Hilbert spaces in learning theory. For the reader to have some 7
idea of our main results, we briefly mention the estimate for the Gaussian kernels, which follows from Example 1 in Section 4. The upper bound was verified in [25]. Proposition 1. Let σ > 0, n ∈ IN and k(x) = exp{−
|x|2 }, σ2
x ∈ IRn .
Set X = [0, 1]n and the kernel K as x, t ∈ [0, 1]n .
K(x, t) = k(x − t), √ n/2 − nσ π R 8 e 2 (σ π)
2 2
Then for 0 < η ≤
√
2 ln N (IK (BR ), η) ≥ ln 2 √ nσπ
, there holds
n
n/2
√ R n ln( ) + ln(σ π) − ln 2 η 2
− ln 2.
For 0 < η ≤ R/2, we have the upper bound
ln N (IK (BR ), η) ≤ 3 ln
R 54n + 2 +6 η σ
n
R 90n2 + 2 + 11n + 3 . η σ
(6n + 1) ln
Another example is the multiquadric kernels K(x, y) = (σ 2 +|x−y|2 )−α/2 on [0, 1]n . In [25] when α > n, σ > 4 + 2n ln 4, the author gave an upper bound for the covering number: R ln N (IK (BR ), η) ≤ C ln η 0
n+1
.
By the analysis in this paper we shall show the following lower bound (see Example 2 in Section 4):
ln N (IK (BR ), η) ≥ C ln
R η
n
.
§2. The Packing Numbers and Nodal Functions Let K be a Mercer kernel on X, and x := {x1 , · · · , xl } ⊂ X be a set of nodes in X. Consider the restriction of functions in IK (BR ) to these points: IK (BR ) |x := {(f (xi ))li=1 ∈ IRl : 8
f ∈ IK (BR )}.
This is a bounded subset of IRl . We denote Np (IK (BR ) |x , η) and Mp (IK (BR ) |x , η) as the covering number and the packing number respectively, of this subset of IRl with the `p metric: `p ((ai )li=1 , (bi )li=1 )
:=
( P l
{ i=1 |ai − bi |p }1/p , if 1 ≤ p < ∞, max1≤i≤l |ai − bi |, if p = ∞.
It is easily seen that if IK (BR ) is covered by the union of balls in C(X) with radius η and centers {fj ∈ C(X) : j = 1, · · · , m}, then IK (BR ) |x is covered by the union of balls in (IRl , `p ) with radius l1/p η and centers {(fj (xi ))li=1 ∈ IRl : j = 1, · · · , m}. Therefore, Np (IK (BR ) |x , l1/p η) ≤ N (IK (BR ), η),
∀η > 0.
This in connection with (1.5) yields sup Mp (IK (BR ) |x , 2l1/p η)
x∈X l
≤ sup Np (IK (BR ) |x , l1/p η) ≤ N (IK (BR ), η) ≤ M(IK (BR ), η).
(2.1)
x∈X l
Thus, lower bounds for the packing number of the discrete sets IK (BR ) |x would provide lower bounds for the covering number N (IK (BR ), η). The covering number and packing number of the discrete version IK (BR ) |x are also often used in learning theory. For the equivalence discussion, see [16, 25]. Our lower bound estimates will be obtained by means of nodal functions. These functions are related to nodes and provide lower bound estimates for the covering number and the packing number of balls of reproducing kernel Hilbert spaces and and their discrete versions. According to (2.1) we shall present lower bounds for the covering number and packing number by choosing suitable sets of nodes {x1 , · · · , xl } and the related nodal functions. Definition 2. We say that {ui (x) := ui,x (x)}li=1 is the set of nodal functions associated with the nodes x := {x1 , · · · , xl } ⊂ X if ui ∈ span{Kxj }lj=1 9
and
ui (xj ) = δij =
1, if j = i, 0, otherwise.
(2.2)
The nodal functions have some nice minimization properties, see [18, 25]. The following proposition characterizes the existence of nodal functions. Proposition 2. Let K be a Mercer kernel on X and x := {x1 , · · · , xl } ⊂ X. Then the following statements are equivalent: (i) The nodal functions {ui (x) = ui,x (x)}li=1 exist. (ii) The functions {Kxi }li=1 are linearly independent. (iii) The Gramian matrix Ax := (K(xi , xj ))li,j=1 is nonsingular. (iv) There exists a set of functions {fi }li=1 ⊂ HK such that fi (xj ) = δij for i, j = 1, · · · , l. In this case, the nodal functions are uniquely given by ui (x) =
l X
(A−1 x )i,j Kxj (x),
i = 1, · · · , l.
(2.3)
j=1
Proof. (i) ⇒ (ii). The nodal function property (2.2) tells that the nodal functions {ui } are linearly independent. Hence (i) implies (ii), as the ldimensional space span{ui }li=1 is contained in span{Kxi }li=1 . (ii) ⇒ (iii). A solution {dj }lj=1 ∈ IRl of the linear system l X
K(xi , xj )dj = 0,
i = 1, · · · , l
j=1
satisfies
X
2 X l l X
l
d j K xj = di K(xi , xj )dj = 0.
K
j=1
i=1
j=1
Then the linear independence of {Kxj }lj=1 implies that the linear system has only the zero solution, i.e., Ax is invertible. (iii) ⇒ (iv). When Ax is invertible, the functions {fi }li=1 given by (2.3) satisfy fi (xj ) =
l X
−1 (A−1 x )i,m K(xm , xj ) = (Ax Ax )i,j = δij .
m=1
10
These are the desired functions. Actually they are the nodal functions. (iv) ⇒ (i). Let Px be the orthogonal projection from HK onto span{Kxi }li=1 . Then for i, j = 1, · · · , l, Px (fi )(xj ) =< Px (fi ), Kxj >K =< fi , Kxj >K = fi (xj ) = δij . So {ui = Px (fi )}li=1 are the desired nodal functions. When the RKHS has finite dimension m, then for any l ≤ m we can find nodal functions {uj }lj=1 associated with some subset x = {x1 , · · · , xl } ⊂ X; while for l > m, no such nodal functions exist. When dimHK = ∞, then for any l ∈ IN, we can find a subset x = {x1 , · · · , xl } ⊂ X which possesses a set of nodal functions. §3. Lower Bounds for the Packing Numbers In this section we present lower bounds for the packing numbers of balls of reproducing kernel Hilbert spaces and their discrete versions. The bounds are realized by nodal functions, and the estimates will be proved in terms of Gramian matrices. For an m × m matrix B, denote kBk2 as the operator norm of B on `2 (IRm ). Theorem 1. Let K be a Mercer kernel on X, l ∈ IN, and x = {x1 , · · · , xl } ⊂ X yield an invertible Gramian matrix Ax := (K(xi , xj ))li,j=1 . Then Mp (IK (BR ) |x , η) ≥ 2l − 1 provided that 1 R 2 kA−1 x k2 ≤ ( ) . l η
(3.1)
Proof. By Proposition 2, the set of nodal functions {uj (x)}lj=1 associated with x exists and can be expressed by (2.3). For each nonempty subset J of {1, · · · , l}, we define the function uJ (x) :=
P
j∈J
uj (x) and choose the vector
(ηuJ (xi ))li=1 ∈ IRl from the set {(f (xi ))li=1 : f ∈ HK }. These 2l − 1 vectors
11
are η-separated in (IRl , `p ), since for J1 6= J2 ,
`p (ηuJ1 (xi ))li=1 , (ηuJ2 (xi ))li=1 =
X l
|ηuJ1 (xi ) − ηuJ2 (xi )|p
1/p
≥ η.
i=1
What is left is to show that the functions ηuJ lie in BR . To see this, take ∅ = 6 J ⊂ {1, · · · , l}. Then Pl P Pl −1 −1 j∈J m=1 (Ax )j,m Kxm , η j 0 ∈J s=1 (Ax )j 0 ,s Kxs >K P P Pl −1 2P −1 η 2 j,j 0 ∈J lm=1 (A−1 x )j,m s=1 (Ax )j 0 ,s (Ax )s,m = η j,j 0 ∈J (Ax )j,j 0 .
kηuJ k2K =< η =
P
√ 1/2 It follows that kηuJ kK ≤ η lkA−1 x k2 , and the function ηuJ lies in BR if (3.1) holds. This proves Theorem 1. The estimates in Theorem 1 can be improved by taking integer multiples of uJ , depending on R and η. This is of particular interest when HK is finite dimensional, which includes the important example K(x, y) = (x · y)m with m ∈ IN in [20]. The detailed discussion will be presented somewhere else. To obtain lower bounds for the packing numbers and covering numbers, we need to estimate the norm of inverse of Gramian matrices. Note that kA−1 x k2 equals to the inverse of the smallest eigenvalue of the matrix Ax . Then to provide upper bounds for kA−1 x k2 is the same as having lower bounds for the smallest eigenvalue. However, such lower bounds are difficult to obtain, though upper bounds for the eigenvalues by means of regularity of kernels are standard. §4. Norm of Inverse of the Gramian Matrix In the literature of radial basis functions in approximation theory, the norm of inverse of Gramian matrix (K(xi , xj ))li,j=1 has been extensively investigated in the case that the kernel is a convolution-type, K(x, y) = k(x − y), and k takes the radial form: Z
k(x) =
+∞
2
e−ρ|x| dβ(ρ)
0
12
(4.1)
with β being a finite, nonnegative Borel measure on [0, +∞). For many examples of such kernels, upper bounds for kA−1 x k2 were obtained by Ball [3], Narcowich and Ward [13, 14], see also [12]. Lower bounds have also been estimated in the literature, see e.g., Schaback [17]. Applying the upper bounds there easily yields some lower bound estimates for the packing numbers studied for our purpose of learning theory. Let us illustrate the above idea by a simple example, the Gaussian kernels. Denote α/N as (α1 /N, · · · , αn /N ) for α = (α1 , · · · , αn ) ∈ ZZn . Example 1. Let σ > 0, n ∈ IN and k(x) = exp{−
|x|2 }, σ2
x ∈ IRn .
Set X = [0, 1]n and the kernel K as K(x, t) = k(x − t),
x, t ∈ [0, 1]n .
Take x = {α/N }α∈XN with N ∈ IN and XN := {0, 1, · · · , N − 1}n . Then √ −n −n nσ 2 (N π)2 N exp{ }. kA−1 x k2 ≤ (σ π) 4 √ n/2 − nσ π R 8 e 2 (σ π)
(4.2)
2 2
Hence for 0 < η ≤ ln N (IK (BR ), η)
≥ ln 2
, there holds
≥ ln M∞ (IK (BR ) |x , 2η) n/2 √ n √ R n √ 2 ln( η ) + 2 ln(σ π) − ln 2 nσπ
− ln 2.
Here N is taken to be the integer satisfying √ √ nσ 2 (N π)2 R nσ 2 ((N + 1)π)2 (σ π)−n exp{ } ≤ ( )2 < (σ π)−n exp{ }. 4 2η 4 The methods from the radial basis function literature enable us to provide estimates of norm of inverse of the Gramian matrix kA−1 x k2 . The proof of Example 1 follows from the next Theorem 2 and will be given after Theorem 2.
13
Theorem 2. Suppose K(x, y) = k(x − y) is a Mercer kernel on X = [0, 1]n and the Fourier transform of k is positive: ˆ k(ξ) =
Z IRn
k(x)e−ix·ξ dx > 0.
Take x = {α/N }α∈XN with N ∈ IN and XN := {0, 1, · · · , N }n or {0, 1, · · · , N − 1}n . Then kA−1 x k2
≤N
−n
−1
inf
ξ∈[−N π,N π]
ˆ k(ξ) n
.
(4.3)
Proof. By the inverse Fourier transform, k(x) = (2π)−n
Z IRn
ix·ξ ˆ k(ξ)e dξ,
we know that for any vector c := (cα )α∈XN , there holds R β α −N )·ξ i( N −n ˆ dξ α,β cα cβ (2π) IRn k(ξ)e R P 2 iα·ξ/N | dξ ˆ = (2π)−n IRn k(ξ)| α cα e R P −n N n iα·ξ |2 dξ ˆ = (2π) IRn k(N ξ)| α cα e
cT Ax c
=
P
R P iα·ξ |2 dξ ˆ ≥ (2π)−n N n minξ∈[−N π,N π]n k(ξ) [−π,π]n | α cα e
ˆ = kck2`2 (XN ) N n minξ∈[−N π,N π]n k(ξ) . It follows that the minimal eigenvalue of the matrix Ax is at least Nn
min
ξ∈[−N π,N π]
ˆ k(ξ) . n
Then (4.3) follows. Combining Theorem 2 and Theorem 1 yields lower bound estimates for the covering number and packing number of balls of reproducing kernel Hilbert spaces when K is convolution type and kˆ is positive. We demonstrate this idea by proving the bound in Example 1. Proof of Example 1. The Fourier transform of the Gaussian function k is: ˆ k(ξ) =
√ σ 2 |ξ|2 } > 0. k(x)e−ix·ξ dx = (σ π)n exp{− 4 IRn
Z
14
Then for any N ∈ IN, √ n nσ 2 (N π)2 ˆ π) exp{− }. k(ξ) = (σ 4 ξ∈[−N π,N π]n inf
This in connection with (4.3) in Theorem 2 yields (4.2). Our next step is to derive the lower bound for the packing number. Note that the number of nodes in x is l = N n . Then (4.2) implies √ −n nσ 2 (N π)2 lkA−1 k ≤ (σ π) exp{ }. 2 x 4 It follows that for 0 < η ≤ R, the requirement (3.1) is valid if N satisfies √ nσ 2 (N π)2 R (σ π)−n exp{ } ≤ ( )2 . 4 η √ nσ 2 π 2 Therefore, for 0 < η ≤ R(σ π)n/2 e− 8 , we choose N ∈ IN such that √ √ nσ 2 (N π)2 R nσ 2 ((N + 1)π)2 (σ π)−n exp{ } ≤ ( )2 < (σ π)−n exp{ }. 4 η 4 With this choice, (3.1) holds and √ 1 R N +1 >√ ln( )2 + ln(σ π)n 2 nσπ η
N≥
1/2
.
Thus Theorem 1 gives the lower bound M∞ (IK (BR ) |x , η) ≥ 2l − 1 ≥ 2l−1 = 2N
n −1
,
which yields
ln M∞ (IK (BR ) |x , η) ≥ ln 2 √
1 nσπ
n
√ R ln( )2 + ln(σ π)n η
n/2
− ln 2.
This in connection with (2.1) proves the desired lower bound in Example 1.
We conjecture that the order of the lower bound presented in Example 1 is the same as that of an upper bound, i.e., for 0 < η ≤ R/2, there holds n/2
R ln N (IK (BR ), η) ≤ C ln( ) η
15
.
To show our method, we present another example of multiquadric kernels. Example 2. Let σ > 0, α > n, and k(x) = (σ 2 + |x|2 )−α/2 ,
x ∈ IRn .
Set the Mercer kernel K on X = [0, 1]n as K(x, y) = k(x − y). Then there exists a fixed positive constant Cσ,α,n depending only on σ, α and n such that for 0 < η ≤ Cσ,α,n R, there holds R ln N (IK (BR ), η) ≥ C ln η
n
,
where C is a positive constant independent of R and η. Proof. It is well known in the literature of radial basis functions (see [17]) that for any ε > 0, there are positive constants C1 , C2 such that the Fourier transform of k satisfies ˆ C1 e−(σ+ε)|ξ| ≤ k(ξ) ≤ C2 e−σ|ξ| It follows that inf
ξ∈[−N π,N π]
ˆ k(ξ) ≥ C1 e−(σ+ε) n
∀ξ ∈ IRn . √
nN π
.
Take x = {α/N }α∈XN with N ∈ IN and XN := {0, 1, · · · , N − 1}n . Then Theorem 2 tells us that kA−1 x k2 ≤
1 −n (σ+ε)√nN π N e . C1
√
Set Cσ,α,n := min{
√ C1 −(σ+ε) nπ/2 1 , 2 }, 2 e
a constant depending only on σ, α, n.
For 0 < η ≤ Cσ,α,n R, we choose N ∈ IN such that 1 (σ+ε)√nN π ≤ e C1
R 2η
2
0. Its decay cannot be faster than the Gaussians. This is a strong restriction. Hence the above method does not work for kernel functions whose Fourier transforms have zeros or decay very fast. For example, spline kernels [22] can not be covered. Let us present a simple example to show how to overcome this difficulty. Example 3. Let m ∈ IN, K(x, t) = B2m (x−t) be the Mercer kernel on X = [0, 1], where B2m , the 2m-th cardinal centered B-spline, is the convolution of 2m folds of the characteristic function χ[−1/2,1/2] , i.e., −iξ 2m
ˆ2m (ξ) = ( sin(ξ/2) )2m = 1 − e B ξ/2 iξ
ξ ∈ IR.
,
N
Take x = {j2−N }2j=0−1 . Then 2m (2m−1)N kA−1 2 . x k2 ≤ π
Hence for 0 < η ≤
1 R, 2m+1 π m
there holds
ln 2 R ln N (IK (BR ), η) ≥ ln M∞ (IK (BR ) |x , 2η) ≥ 2π 2η
17
1/m
− ln 2.
n
.
N
2 −1 Proof. For any real sequence c := (cj )j=0 , we have T
c Ax c =
P2N −1
1 j,l=0 cj cl 2π
≥
R
2 1−e−i2N ξ 2m R P2N −1 ijξ dξ IR j=0 cj e i2N ξ 2 Rπ (1 − e−i2N ξ )m P2N −1 cj e−ijξ dξ. j=0 −π j−l
iξ· N ˆ 2 dξ = IR B2m (ξ)e
2(1−2m)N 2π 1+2m
2N 2π
If we define the sequence d := (dj )j∈ZZ by X
j
dj z = (1 − z
2N m
)
N −1 2X
cj z j ,
z ∈ C \ {0},
j=0
j∈ZZ
then cT Ax c ≥
2(1−2m)N 2π 1+2m
Z
π
|
X
−π j∈ZZ
dj e−ijξ |2 dξ =
2(1−2m)N kdk2`2 . π 2m
But dj = cj ,
∀j = 0, 1, · · · , 2N − 1.
Hence cT Ax c ≥
2(1−2m)N kck2`2 . π 2m
Thus, 2m (2m−1)N kA−1 2 . x k2 ≤ π
For 0 < η ≤
1 R, 2m+1 π m
we can choose N ∈ IN such that
π 2m 22mN ≤ (
R 2 ) < π 2m 22m(N +1) . 2η
With this choice, (3.1) holds. By Theorem 1, this implies N
M∞ (IK (BR ) |x , 2η) ≥ 2l − 1 = 22 − 1. Therefore the statements in Example 3 are true.
18
§5. The Regularity of Reproducing Kernel Spaces The analysis for general metric spaces given in [26] applied to closed subsets of IRn . Hence the embedding result can be obtained for Lipschitz continuous kernels. But the special feature of Euclidean spaces enables us to analyze the higher order regularity of reproducing kernel Hilbert spaces, which is the objective of this section. Ideas from approximation theory are used for our purpose here. To measure the higher order regularity, we shall use the generalized Lipschitz spaces which are defined by means of divided differences. Let X be a closed subset of IRn . For r ∈ IN, t ∈ IRn , and a function f on X, define the divided difference ∆rt f (x) :=
r X r j=0
j
(−1)r−j f (x + jt),
if x, x + t, · · · , x + rt ∈ X.
In particular, when r = 1, ∆1t f (x) = f (x + t) − f (x). The divided difference can be used to characterize various kinds of function spaces. For 0 < s < r, the generalized Lipschitz space Lip*(s, C(X)) (or the Zygmund-H¨older space) consists of continuous functions f on X with the norm kf kLip*s := |f |Lip*s + kf k∞ and the seminorm |∆rt f (x)| : |f |Lip*s := sup |t|s
x, x + t, · · · , x + rt ∈ X .
This is a Banach space. Under some mild regularity conditions for X (for example, when the boundary of X is minimally smooth), for s = l + s0 with 0 < s0 < 1, Lip*(s, C(X)) consists of continuous functions on X such that Dµ f ∈ Lip s0 for any multi-integers µ with |µ| ≤ l. In particular, this is the case for X = [0, 1]n or IRn : for s not being an integer, Lip*(s, C(X)) = C s (X) which consists of C l functions f with Dµ f ∈ Lip s0 for any |µ| = l; while for 19
s being an integer C s ⊂ Lip*(s, C(X)) (The Zygmund class is involved for Lip*(s, C(X)) in this case). Now we can state our embedding result for subsets of IRn . Theorem 3. Let X be a closed subset of IRn , and K : X × X → IR be a Mercer kernel. If s > 0 and K ∈ Lip*(s, C(X × X)), then HK ⊂ Lip*( 2s , C(X)) and kf kLip* s ≤
q
2r+1 kKkLip*s kf kK ,
2
∀f ∈ HK .
Hence HK is embedded into Lip*( 2s , C(X)). Proof. Let s < r ∈ IN and f ∈ HK . Let x, t ∈ IRn such that x, x+t, · · · , x+ rt ∈ X. The reproducing propert (1.4) tells us that ∆rt f (x)
=
HK .
It follows that |∆rt f (x)|
≤ kf kK
Pr
r j
(−1)r−j
j=0
= kf kK
Pr
r j
j=0
Pr
i=0
r i
(−1)r−i K(x
(−1)r−j ∆r(0,t) K(x
1/2
+ jt, x + it) 1/2
+ jt, x)
.
Here (0, t) denotes the vector in IR2n with the first n components being zero. By our assumption, K ∈ Lip*(s, C(X × X)). Hence r s ∆ ≤ |K| K(x + jt, x) (0,t) Lip*s |t| .
This yields |∆rt f (x)|
≤ kf kK
X r j=0
r j
s
|K|Lip*s |t|
1/2
≤
q
2r |K|Lip*s |t|s/2 kf kK .
Therefore, |f |Lip* s ≤
q
2
2r |K|Lip*s kf kK .
Combining with the fact that kf k∞ ≤
q
kKk∞ kf kK ,
20
we know that f ∈ Lip*( 2s , C(X)) and kf kLip* s ≤ 2
q
2r+1 kKkLip*s kf kK ,
∀f ∈ HK .
The proof of Theorem 3 is complete. Recall that C s (X) ⊂ Lip*(s, C(X)) for any s > 0. When s is not an integer, and the boundary of X is piecewisely smooth, C s (X) = Lip*(s, C(X)). As a corollary of Theorem 3, we have Proposition 3. Let X = [0, 1]n or IRn , and K : X × X → IR be a Mercer kernel. If s > 0 is not an even integer, and K ∈ C s (X × X), then s
HK ⊂ C 2 (X) and kf kLip* s ≤ 2
q
2r+1 kKkLip*s kf kK ,
∀f ∈ HK .
s
Hence HK is embedded into C 2 (X). Our embedding result yields the upper bound estimates for the covering number concerning reproducing kernel Hilbert spaces. It is a classical result from the theory of function spaces (see [7]) that the covering number for the ball of the generalized Lipschitz space BR (Lip*(s, C[0, 1]n )) has the asymptotic behavior R R Cs ( )n/s ≤ ln N (BR (Lip*(s, C[0, 1]n )), η) ≤ Cs0 ( )n/s , η η where the positive constants Cs , Cs0 are independent of R and η. This in connection with Proposition 3 gives the following upper bound estimate. Theorem 4. Let K : [0, 1]n × [0, 1]n → IR be a Mercer kernel. If s > 0 and K lies in Lip*(s, C([0, 1]n × [0, 1]n )), then R ln N (IK (BR ), η) ≤ C( )2n/s , η where C is a constant independent of R, η > 0. Let us present some more examples for upper bound estimates. The first is the upper bound for Example 3. Notice that B2m lies in Lip∗ (2m − 1) but not in Lip∗ (2m − 1 + δ) for any δ > 0. 21
Example 4. Let m ∈ IN, K(x, t) = B2m (x − t) be the Mercer kernel on X = [0, 1] given in Example 3, where −iξ 2m
ˆ2m (ξ) = ( sin(ξ/2) )2m = 1 − e B ξ/2 iξ
ξ ∈ IR.
,
Then there holds for 0 < η ≤ R,
C
R η
1
m
≤ ln N (IK (BR ), η) ≤ C 0
R η
1 m−1/2
.
The next example consists of a family of reproducing kernel Hilbert spaces constructed from compactly supported positive definite radial basis functions in [24]. These examples are dimension dependent. Denote (x)+ := min{0, x} for x ∈ IR. Let d ∈ ZZ+ and fd (r) := (1 − r2 )d+ for r ∈ IR. Define the univariate function φd,0 as the convolution of fd with itself: Z
φd,0 (r) :=
+∞
−∞
fd (r − t)fd (t)dt,
r ∈ IR
and for m = 1, · · · , d, φd,m (r) = (−
1 d )φd,m−1 (r). r dr
Define the radial basis function Φd,m on IRn by Φd,m (x) := φd,m (|x|),
x ∈ IRn .
Then Φd,m induces a Mercer kernel K(x, y) = Φd,m (x − y) on any closed subsets of IRn . Moreover, the kernel is in Lip*(2d − 2m + 1), but not in Lip*(2d − 2m + 1 + δ) for any δ > 0. It follows from Theorem 3 that the RKHS induced by Φd,m can be imbedded into Lip*(d − m + 21 ). Example 5. Let d ∈ IN, 1 ≤ m ≤ d and Φd,m , K be defined as above. Then there holds R ln N (IK (BR ), η) ≤ C η
22
n d−m+1/2
.
§6. Connections and Conclusions In this paper we have provided some estimates for upper and lower bounds of the covering number and packing number of balls of reproducing kernel Hilbert spaces. These give some connections to other quantitative analysis of sample errors for learning. We first show this by applying the lower bound to the Vγ -dimension [1]. Definition 3. Let H ⊂ C(X) and γ > 0. We say H Vγ -shatters a set A ⊆ X if there exists a constant α ∈ IR such that, for every E ⊆ A, there is some function fE ∈ H satisfying
fE (x)
≥ α + γ, ≤ α − γ,
∀x ∈ E, ∀x ∈ A \ E.
The Vγ -dimension of H, denoted Vγ -dim (H), is defined to be the maximal cardinality of a set A ⊆ X that is Vγ -shattered by H. The nodal functions used in Theorem 1 yields the following lower bound for the Vγ -dimension of balls of RKHS. Proposition 4. Let K be a Mercer kernel on X, l ∈ IN, and x = {x1 , · · · , xl } ⊂ X yield an invertible Gramian matrix Ax := (K(xi , xj ))li,j=1 . If (3.1) holds, then Vη/2 -dim(IK (BR )) ≥ l. The proof follows from that of Theorem 1 by taking A = {xi }li=1 , α = γ = η/2 and fE (x) = ηuJ (x) for E = {xj }j∈J ⊆ A with J ⊆ {1, . . . , l}. More generally, lower bounds of the packing number and covering number for general hypothesis spaces also provide lower bounds for the Pγ dimension. This can be found in [1, Lemmas 3.3 and 3.5]. Let us then show an application of the upper bounds to the relation between Tikhonov and Ivanov regularizations. Take V in (1.2) as the square loss. Recall that the empirical target function for the Tikhonov regularization is given by (1.2) with HK , while for the Ivanov regularization with
23
(R)
R > 0, the empirical target function fz fz(R) := arg
min
Ez (f ) = arg
f ∈IK (BR )
is defined as
l 1X (f (xi ) − yi )2 . l f ∈IK (BR ) i=1
min
As IK (BR )|x is a finite dimensional space, it equals to the restriction of (R)
span {Kxi }li=1 onto x. As in [26], we may take the minimizer fz
to be in
BR ⊂ HK . Proposition 5. Let ρ be a probability measure on X ×Y such that |y| ≤ M almost everywhere, and λ > 0, R > 0. Set Rλ :=
p
R2 + M 2 /λ. Then for
every ε > 0, we have
E fz(Rλ ) − ε ≤ E(fz ) ≤ E fz(R) + ε + λR2 with confidence at least
ε lε2 p . exp − p 16(M + Rλ kKk∞ ) 32(M + Rλ kKk∞ )4
1 − 2N IK (BRλ ),
Proof. By the definition (1.2), Ez (fz ) + λkfz k2K ≤ Ez (fz(R) ) + λkfz(R) k2K . (R)
As fz
(R)
∈ IK (BR ), we have kfz kK ≤ R. Since IK (BR ) contains the zero (R)
function, Ez (fz ) ≤ Ez (0) =
1 l
Pl
2 i=1 |yi |
≤ M 2 . Thus we have
Ez (fz ) ≤ Ez (fz(R) ) + λR2
(5.1)
and kfz k2K ≤ R2 + M 2 /λ = Rλ2 which implies that fz ∈ BRλ and Ez (fz(Rλ ) ) ≤ Ez (fz ).
(5.2)
Recall that Ez is a good approximation of E in probability, which can be seen from the following error estimates (e.g. [5]):
Prob
sup
|Ez (f )−E(f )| ≥ η ≤ 2N IK (BRλ ),
f ∈IK (BRλ )
24
η lη 2 exp − , 8MR 4(2MR4 + MR2 η/3)
where the constant MR bounds |f (x) − y| almost everywhere. For the hypothesis space IK (BRλ ), this constant can be chosen as MR = M + Rλ kKk∞ since for every f in this space, |f (x)| ≤ Rλ (supx∈X K(x, x))1/2 . p
(R)
(Rλ )
Notice that fz , fz , fz
∈ IK (BRλ ). Then we can take η = ε/2 and
conclude from (5.1) and (5.2) that
E
fz(Rλ )
− ε ≤ E(fz ) ≤ E
fz(R)
+ ε + λR2
with confidence at least
1 − 2N IK (BRλ ),
ε lε2 . exp − 16MR 32MR4
This proves Proposition 5. The last comparison we want to make is the one with learning schemes used in the literature of neural networks. There the hypothesis space H depends on the number N of nodes, and is taken to be linear combinations of N fundamental functions with uniformly bounded coefficients. In [4], the fundamental functions have the form φ(ak · x + bk ) for some sigmoidal function. In [15], the fundamental functions have the form exp{−|x − ti |2 /σ 2 }. For such hypothesis space, one can estimate the sample error by bounding the capacity for individial component fundamental functions. Compared with that, the empirical target function fz discussed in this paper takes the form fz =
Pl
i=1 ci Kxi ,
and is a linear combination of l fundamental func-
tions in the RKHS, hence the combination coefficients are not necessarily uniformly bounded. Therefore, the setting we consider here is different from those in [4, 15] and the upper bounds for the covering number are needed for our purposes. To finish our discussion we mention an open problem of getting upper and lower bounds for families of reproducing kernel Hilbert spaces in order to study Glivenko-Cantelli classes [6] for the purpose of multiscale kernels. For example, consider the family of Gaussian kernels Kσ (x, y) = exp{−|x − 25
y|2 /σ 2 } on [0, 1]n with 0 < σ < ∞. Take the union of balls of HKσ with a uniform radius 1 as the hypothesis space:
H := ∪σ>0 f (x) =
l X
ci Kσ (xi , x) :
n
xi ∈ [0, 1] ,
i=1
l X
ci Kσ (xi , xj )cj ≤ 1 .
i,j=1
It is unknown to us whether this is a uniform Glivenko-Cantelli class. Notice that the coefficients (ci ) in the linear combination are not necessarily uniformly bounded, compared with [4, 15]. To solve our problem, consistent upper and lower bounds for the capacity of H would be desirable. The upper bounds for the covering number of HKσ given by the author in [25] (see also Proposition 1 in this paper) depends on σ and are not sufficient to verify the uniform convergence. The lower bound presented in Example 1 in √ nσ 2 π 2 this paper is valid only for η < 21 (σ π)n/2 e− 8 which tends to zero when σ → 0. So it is not sufficient to disprove the uniform convergence. Similar difficulty exists for the family of B-splines B2m with various order m ∈ IN investigated in Examples 3 and 4. Acknowledgments The author would like to thank the referees for valuable comments and suggestions. References [1]
N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler, Scale-sensitive dimensions, uniform convergence, and learnability, J. the ACM 44 (1997), 615–631.
[2]
N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 68 (1950), 337–404.
[3]
K. Ball, Eigenvalues of Euclidean distance matrices, J. Approx. Theory 68 (1992), 74–82.
[4]
A. Barron, Approximation and estimation bounds for artificial neural networks, Mach. Learn. 14 (1994), 115–133. 26
[5]
F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc. 39 (2992), 1–49.
[6]
R. Dudley, E. Gin´e, and J. Zinn, Uniform and universal GlivenkoCantelli classes, J. Theoret. Prob. 4 (1991), 485–510.
[7]
D. Edmunds and H. Triebel, Function Spaces, Entropy Numbers, Differential Operators, Cambridge University Press, 1996.
[8]
T. Evgeniou, M. Pontil, and T. Poggio, Regularization networks and support vector machines, Adv. Comput. Math. 13 (2000), 1–50.
[9]
F. Girosi and T. Poggio, Networks and the best approximation property, Biological Cybernetics 63 (1990), 169–176.
[10]
Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson, Covering numbers for support vector machines, IEEE Trans. Inform. Theory 48 (2002), 239–250.
[11]
S. Mukherjee, R. Rifkin, and T. Poggio, Regression and classification with regularization, preprint, 2001.
[12]
F. J. Narcowich, N. Sivakumar, and J. D. Ward, On condition numbers associated with radial-function interpolation, J. Math. Anal. Appl. 186 (1994), 457–485.
[13]
F. J. Narcowich and J. D. Ward, Norms of inverses and condition numbers for matrices associated with scattered data, J. Approx. Theory 64 (1991), 69–94.
[14]
F. J. Narcowich and J. D. Ward, Norm estimates for the inverses of a general class of scattered-data radial-function interpolation matrices, J. Approx. Theory 69 (1992), 84–109.
[15]
P. Niyogi and F. Girosi, On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comp. 8 (1996), 819–842.
[16]
M. Pontil, A note on different covering numbers in learning theory, preprint, 2001. 27
[17]
R. Schaback, Lower bounds for norms of inverses of interpolation matrices for radial basis functions, J. Approx. Theory 79 (1994), 287–306.
[18]
S. Smale and D. X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. 1 (2003), 1–25.
[19]
A. N. Tikhonov and V. Y. Arsenin, Solution of Ill-Posed Problems, W. H. Winston, 1977.
[20]
V. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998.
[21]
V. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory Prob. Appl. 16 (1971), 264–280.
[22]
G. Wahba, Spline models for observational data, SIAM, 1990.
[23]
R. C. Williamson, A. J. Smola, and B. Sch¨olkopf, Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators, IEEE Trans. Inform. Theory 47 (2001), 2516–2532.
[24]
Z. Wu, Compactly supported positive definite radial functions, Adv. Comput. Math. 4 (1995), 283–292.
[25]
D. X. Zhou, The covering number in learning theory, J. Complexity 18 (2002), 739–767.
[26]
D. X. Zhou, Conditionally reproducing kernel spaces in learning theory, preprint, 2001.
28