Generalization Bounds of Regularization Algorithm with Gaussian Kernels Feilong Cao1†
Yufang Liu2
∗
Weiguo Zhang2
1. Department of Information and Mathematics Sciences, China Jiliang University, Hangzhou 310018, Zhejiang Province, P R China. 2. School of Business Administration, South China University of Technology, Guangzhou 510641, Guangdong Province, P R China.
Abstract In many practical applications, the performance of a learning algorithm is not actually affected only by an unitary factor just like the complexity of hypothesis space, stability of the algorithm and data quality. This paper addresses in the performance of the regularization algorithm associated with Gaussian kernels. The main purpose is to provide a framework of evaluating the generalization performance of the algorithm combinatively in terms of hypothesis space complexity, algorithmic stability and data quality. The new bounds on generalization error of such algorithm measured by regularization error and sample error are established. It is shown that the regularization error has polynomial decays under some conditions, and the new bounds are based on uniform stability of the algorithm, covering number of hypothesis space and data information simultaneously. As an application, the obtained results are applied to several special regularization algorithms, and some new results for the special algorithms are deduced. Keywords: Regularization algorithm; Gaussian kernels; Uniform stability; Hypothesis space; SVM; Generalization error ∗
This research was supported by the National Natural Science Foundation of China (Nos. 61272023,
61101240) and the major program of National Social Science Foundation of China (No. 11&ZD156). † Corresponding author: Feilong Cao, E-mail:
[email protected]
1
1
Introduction
In recent years, there has been a great increase in the interest for theoretical issues in machine learning. A main concern in these issues is generalization performance, which has been exploited in much literature, such as [2], [4], [8], [10], [20], [24] and [26]. Perhaps the first work in this direction belongs to Vapnik and Chervonenkis [23], where the generalization bounds for classification algorithms based on uniform convergence were derived. Until recently, some approaches based on the theory of uniform convergence of empirical risks to their expected risks, have been proposed to study the generalization performance of a learning machine. Usually, the measure of space complexity, for instance, covering number (see [8], [10], [12], [26]), VC-dimension (see [24]), and Vγ -dimension (see [14]) are utilized to estimate the bound of the difference between empirical risks and their expect risks. For example, Vapnik [24] established exponential bounds on the rate of uniform convergence and relative uniform convergence for a set of loss functions based on VC-dimension, and then obtained the generalization bounds of empirical risks minimizing (ERM) algorithms. Cucker and Smale [10] considered the least squares error and obtained the generalization bounds based on covering number of hypothesis space. Bousquet [5] derived a generalization of Vapnik and Chervonenkis’ relative error inequality by using a new measure of the size of function classes, the local Rademacher average. Zou et al. [29] established the bounds on the rate of uniform convergence of learning machines for independent and identically distribution sequences on the set of admissible functions which are eliminated noisy. On the other hand, the notion of algorithmic stability has also been used effectively to derive tight generalization bounds in some literature by now. Roughly speaking, an algorithm is called stable if the outputs of the algorithm yield only a small change when the training set has any change of a single point. This concept was first introduced and used by Devroye and Wagner [13] in 1979. After then, Bousquet [4], Kearns and Ron [16], and Kutin and Niyogi [17] introduced various definitions of algorithmic stability such as uniform hypothesis stability, error stability and training stability, and in terms of these notions they obtained some generalization bounds of learning algorithms. However, all of these derived bounds are independent of the complexity
2
of hypothesis space. In real application of machine learning, the performance of a learning algorithm is affected not only by the complexity of hypothesis space, stability of learning algorithm and data information, but by some other factors like sampling mechanism and sample quality as well. Chang et al. [7] studied the regularization algorithms derived simultaneously through hypothesis space complexity, algorithmic stability and data quality, where the general kernel functions are considered and the tool of K-functional theorems are used. Based on that the Gaussian kernels are the most widely used and in many cases which can give better performance, we will study, in this paper, the generalization performance of regularization algorithm associated with Gaussian kernels measured by regularization error and sample error. We here derive the generalization bounds of regularization algorithm based on uniform stability, covering number and data information simultaneously. Also, we will present the polynomial decays of the regularization error under some assumptions according to the method used in [27], and we will adopt the notion of uniform stability to bound sample error. The obtained bounds are affected not only by algorithmic stability, complexity of hypothesis space, but also by data information and sample quality as well. The outline of this paper is as follows: In section 2, we will introduce the mathematical framework and notations used throughout the paper and present several important inequality tools. In section 3, we will show our main result for generalization error by estimating the bounds for regularization error and sample error respectively. In section 4 we will demonstrate some concrete applications of our results. Finally, we will conclude this paper in section 5.
2
Preliminaries
In this section, we introduce some notations and main tools which will be used throughout the paper.
3
2.1
Definitions and notations
Let X and Y be input and output space respectively, where X is a compact subset of Rn , and Y ⊂ R. The product space Z = X × Y is assumed to be endowed with an unknown probability distribution ρ. We consider a training set z = {z1 = (x1 , y1 ), z2 = (x2 , y2 ), . . . , zm = (xm , ym )} of size m in Z = X × Y drawn independent and identically distributed (i.i.d) from the unknown distribution ρ. Given a training set z of size m, we will build, for all i = 1, 2, . . . , m, modified training sets as follows: Replace the i-th element of the training set z by zi = {z1 , z2 , . . . , zi−1 , zi0 , zi+1 , . . . , zm }, where the replacement sample zi0 is assumed to be drawn from Z according to ρ and which is independent from z. Let k : X × X → R be continuous, symmetric, and positive semidefinite, i.e., for any finite set of distinct points {x1 , x2 , . . . , xm } ⊂ X , the matrix (k(xi , xj ))m i,j=1 is positive semidefinite. Such a function is called a Mercer kernel. The Reproducing Kernel Hilbert Space Hk (RKHS) associated with the kernel k is defined (see [1]) to be the closure of the linear span of the set of functions kx := {k(x, ·) : x ∈ X } with the inner product h·, ·iHk = h·, ·ik satisfying hkx , ky ik = k(x, y). The reproducing property takes the form hkx , f ik = f (x), ∀x ∈ X , f ∈ Hk . Let κ = supx∈X
p k(x, x), then the above reproducing property tells us that kf k∞ ≤ κkf kk ,
f ∈ Hk .
In this paper we are mainly interested in Gaussian kernels which are the most widely used kernels in practice. Recall that these kernels are of the form kσ (x, x0 ) = exp −2σ 2 kx − x0 k2 , x, x0 ∈ X , where σ > 0 is a free parameter whose inverse 1/σ is called the width of kσ . We usually denote the corresponding RKHSs which are thoroughly described in [21] by Hσ (X ) or simply Hσ . We can easily get κ = 1 with Gaussian kernels. 4
For a compact set H in a metric space and ε > 0, the covering number N (H, ε) of the function class H is the minimal integer k ∈ N such that there exist k balls in H with radius ε covering H. Suppose that Hσ is contained in a ball BR = {f : kf kHσ ≤ R, R > 0} on X . This implies that there exists a constant C0 depending only on X and n such that (see [28]) ε log N (BR , ε) = log N (B1 , ) R ! R n+1 1 log + 2(n+1) , 0 < ε < 1, 0 < σ ≤ 1. ≤ C0 ε σ
(2.1)
The goal of machine learning from random sample is to find a function f that assigns values to objects such that if new objects are given, the function f will forecast them correctly. Let Z E(f ) = E[l(f, z)] =
l(f, z)dρ Z
be the expected risk (error) of function f , where the nonnegative function l, which is integrable for any f and depends on f , is called loss function. In this paper, we suppose that B = sup max l(f, z), L = f ∈Hσ z∈Z
sup
max
f1 ,f2 ∈Hσ ,f1 6=f2 z∈Z
|l(f1 , z) − l(f2 , z)| , |f1 − f2 |
(2.2)
where B and L are both finite. Remark 1 Notice that viewed as an unitary function, the loss function l is convex whenever the assumption is met. This implies that L=
sup
max
f1 ,f2 ∈Hσ ,f1 6=f2 z∈Z
|l(f1 , z) − l(f2 , z)| |f1 − f2 |
is well defined and finite. According to the ideal that the quality of the chosen function can be evaluated by the expected risk E(f ), the choice of required function from the set Hσ is to minimize the expected risk E(f ) based on the sample set z. We can not minimize the expected risk E(f ) directly since the distribution ρ is unknown. By the principle of Empirical Risk Minimizing (ERM in short), we minimize instead of the expected risk E(f ), the so called empirical risk (error) m
Ez (f ) =
1 X l(f, zi ). m i=1
5
Now the regularization algorithm associated with the Gaussian kernels kσ is defined to be the minimizer of the following optimization problem: X m 1 2 fz = arg min l(f, zi ) + λkf kHσ . f ∈Hσ m
(2.3)
i=1
Here λ is a positive constant called the regularization parameter. Usually it is chosen to depend on m : λ = λ(m), and limm→∞ λ(m) = 0. Let fρ be a function minimizing the expected risk E(f ) over all measurable functions, i.e., Z l(f, z)dρ.
fρ = arg min f
(2.4)
Z
Remark 2 We will assume throughout this paper that fρ is bounded by M and satisfies (2.2). Our main aim is to estimate the difference E(fz ) − E(fρ ), which is the so called generalization error.
2.2
Main tools
In this subsection we present some useful tools. We first introduce the notion of uniform stability. Let fzi be a function minimizing the empirical risk defined by the set zi , 1 ≤ i ≤ m over Hσ , i.e., fzi = arg min f ∈H
σ
m 1 X 2 l(f, zj ) + λkf kHσ , zj ∈ zi . m
(2.5)
j=1
Definition 2.1 (see[4]) The regularization algorithm (2.3) is said to be βm -uniform stable with respect to the loss function l, if there is a nonnegative constant βm (where m is the size of sample set z) such that ∀z ∈ Z, ∀i ∈ 1, 2, . . . , m, kl(fz , z) − l(fzi , z)k∞ ≤ βm ,
(2.6)
where βm → 0 as m → ∞. Remark 3 The above definition is a little different from [4] where one uses samples by removing the i-th element, while here we use samples whose i-th element is replaced by a new one. 6
As a very important quantity which simultaneously contains the information of hypothesis space and training samples, the empirical covering number is defined as follows: Definition 2.2 (see [18] and [24]) Let H be a class of bounded functions defined on X , and let m x = (xi )m i=1 ∈ X ,
m H|x = {(f (xi ))m i=1 : f ∈ H} ⊂ R .
For 1 ≤ p ≤ ∞, we define the p-norm empirical covering number of H by Np,x (H, ε) = N (H|x , ε, dp ), Np (H, ε, m) = sup Np,x (H, ε), x∈X m
Np (H, ε) = sup Np,x (H, ε), m∈N+ 1 where dp (r1 , r2 ) = ( m
1
Pm
p p i=1 |r1i − r2i | )
is the lp -metric on the Euclidean space Rm for
m m all r1 = (r1i )m i=1 , r2 = (r2i )i=1 ∈ R .
We also need the following useful inequalities. The first one is the classical Bernstein’s inequity (see [10]). Lemma 2.1 Suppose a random variable ξ on Z with expectation µ = E(ξ). If |ξ(z) − µ| ≤ M1 for almost all z ∈ Z, and the variance σ 2 (ξ) = σ 2 is known, then for every ε > 0 there holds ( P robz∈Z m
m
1 X ξ(zi ) − µ ≥ ε m
)
≤ exp
i=1
−
mε2 . 2(σ 2 + 31 M1 ε)
The following Lemma 2.2 is a probability inequality proved by means of the Bernstein’s inequality (see [26]), which will play an important role in our estimation for the bounds of the sample error term involving fz . Lemma 2.2 Let 0 ≤ α ≤ 1, c, M1 ≥ 0, and G be a set of functions on Z such that for every g ∈ G, E(g) ≥ 0, |g − E(g)| ≤ M1 and E(g 2 ) ≤ c(E(g))α . Then for all ε > 0, m 1 P g(zi ) E(g) − m α mε2−α i=1 1− P robz∈Z m sup p > 4ε 2 ≤ N (G, ε) exp − . 1 1−α ) α + εα 2(c + M ε (E(g)) g∈G 1 3 7
In the end, we present an inequality tool. Lemma 2.3 (see[12]) Let q, q ∗ > 1 be such that 1 1 ∗ a · b ≤ aq + ∗ bq , q q
3
1 q
+
1 q∗
= 1. Then
∀a, b > 0.
Main results
In this section, we present the bounds for the generalization error of regularization algorithm by estimating the regularization error and the sample error respectively. Proposition 3.1 (see[8]) Let fσ,λ ∈ Hσ , and let fz be defined by (2.3). Then E(fz ) − E(fρ ) is bounded by o nh i h io n E(fz ) − Ez (fz ) + Ez (fσ,λ ) − E(fσ,λ ) + E(fσ,λ ) − E(fρ ) + λkfσ,λ k2Hσ . (3.7) Proof. It follows from our definitions that E(fz ) − E(fρ ) ≤ E(fz ) − E(fρ ) + λkfz k2Hσ io i h h i nh 2 2 = E(fz ) − Ez (fz ) + Ez (fz ) + λkfz kHσ − Ez (fσ,λ ) + λkfσ,λ kHσ h i n o + Ez (fσ,λ ) − E(fσ,λ ) + E(fσ,λ ) − E(fρ ) + λkfσ,λ k2Hσ o nh i h io n ≤ E(fz ) − Ez (fz ) + Ez (fσ,λ ) − E(fσ,λ ) + E(fσ,λ ) − E(fρ ) + λkfσ,λ k2Hσ . This finishes the proof of Proposition 3.1. The first term of (3.7) is called the sample error, and the second one is called the regularization error. Based on this, in order to obtain the bound for the generalization error, we only need to bound the regularization error and sample error respectively. These will be done in the following subsection 3.1 and subsection 3.2.
3.1
Bound for regularization error
Let feσ,λ be a function minimizing the regularization expected risk over the RKHS Hσ , i.e., feσ,λ := arg min E(f ) + λkf k2Hσ . f ∈Hσ
8
This is the regularizing function usually used for the error analysis. However, it may lead to a random variable l(feσ,λ , z) of large bound for some loss functions. In order to overcome this, under a Sobolev smoothness condition of fρ , Xiang and Zhou [27] constructed a function fσ,λ which plays a role of regularizing function for the classification algorithm associated with Gaussian kernels and convex loss. The constructed function has two advantages. On one hand, it is uniformly bounded so the random variable l(f, z) involved in the error analysis is bounded. On the other hand, it plays the same role as feσ,λ in achieving nice bounds for the regularization error. Based on the method, we can find such a function which serves for the general regularization algorithm associated with Gaussian kernels and the loss functions satisfying (2.2). Let H s (Rn ) be the Sobolev space with index s > 0, which consists of all functions in L2 (Rn ) with the semi-norm |f |H s (Rn ) =
−n
Z
kµk |fˆ(µ)|2 dµ 2s
(2π)
1
2
0 such that fρ = feρ |X , and φ =
dρX dx
∈ L2 (X ). Then for 0 < σ ≤ 1, λ > 0, we can find
functions fσ,λ ∈ Hσ such that e kfσ,λ kL∞ (X ) ≤ C,
(3.8)
e s + λσ −n ) D(σ, λ) := E(fσ,λ ) − E(fρ ) + λkfσ,λ k2Hσ ≤ C(σ
(3.9)
e ≥ 1 is a constant independent of σ or λ. where C
3.2
Key bounds for the sample error
To achieve the key bounds for the sample error, we need the following assumption involving l and ρ.
9
Assumption. There exists some α ∈ [0, 1] and a constant C1 > 0 such that h h iα i2 E l(f, z) − l(fρ , z) ≤ C1 E(f ) − E(fρ ) , ∀f : X → [−R, R]. Remark 4 This assumption always holds for α = 0 and C1 = M 2 . Based on uniform stability, complexity of hypothesis and data quality, we can establish the bound of sample error by Theorem 3.2. For details of proof, one can look up in the literature [7]. Theorem 3.2 Suppose that the regularization algorithm (2.3) is βm -uniform stable with respect to the loss function, then for any ε > 0, ε −mε2 Probz∈Z m {E(fz ) − Ez (fz ) ≥ 2ε + 2βm } ≤ 2N1 . , Hσ , m exp 4 8B 2 Combing with Theorem 3.1, we can get the bound of generalization error based on uniform stability, complexity of hypothesis and data quality simultaneously. Comparing Theorem 3.2 with the result in [4], we can tell that when the empirical covering number satisfies the the following condition ε 8mε2 mε2 ln 2N1 , Hσ , m + = 2 4 (4mβm + B) 8B 2 they are equivalent, which means Theorem 3.2 generalized the result based on unitary measure by uniform stability only. However, it is very hard to estimate the empirical covering number. Note that h i h i E(fz ) − Ez (fz ) + Ez (fσ,λ ) − E(fσ,λ ) h i h i = E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) h i h i + Ez (fσ,λ ) − Ez (fρ ) − E(fσ,λ ) − E(fρ ) , the sample error can be divided to two terms, from which we can get the generalization bound in a more tractable way. The first term involves with fz and the second one involves with fσ,λ . We first bound the sample error involving fσ,λ . Theorem 3.3 Let α ∈ [0, 1] and fσ,λ ∈ Hσ satisfy (3.8). Then for any 0 < δ < 1, h i h i with confidence at least 1 − δ/2, the quantity Ez (fσ,λ ) − Ez (fρ ) − E(fσ,λ ) − E(fρ ) is bounded by 1 1 2(B + C1 )m− 2−α log(2/δ) + D(σ, λ). 2
10
Proof. Consider the random variable ξ(z) = l(fσ,λ , z) − l(fρ , z) with fσ,λ ∈ Hσ . We have E(ξ) = E(fσ,λ ) − E(fρ ) ≥ 0, Ez (ξ) = Ez (fσ,λ ) − Ez (fρ ). By (2.2), we find that |ξ(z)| = |l(fσ,λ , z) − l(fρ , z)| ≤ B. So we have |ξ − E(ξ)| ≤ 2B. By Lemma 2.1, we know that for any ε > 0, ( ) ( ) m 1 X mε2 . ξ(zi ) − E(ξ) ≥ ε ≤ exp − Probz∈Z m m 2 σ 2 (ξ) + 32 Bε i=1
Solving the quadratic equation for ε by setting the above probability bound to be δ/2, we see that with confidence at least 1 − δ/2, p m 2mσ 2 (ξ) log(2/δ) 1 X 4B log(2/δ) ξ(zi ) − E(ξ) ≤ + . m 3m m i=1
By our assumption we get that σ 2 (ξ) ≤ E(ξ 2 ) ≤ C1 [E(ξ)]α . This in connection with Lemma 2.3 implies r p 1 2mσ 2 (ξ) log(2/δ) 2C1 [E(ξ)]α log(2/δ) α 2C1 log(2/δ) 2−α α ≤ ≤ (1 − ) + E(ξ). m m 2 m 2 Therefore, with confidence at least 1 − δ/2, m
4B log(2/δ) 1 X ξ(zi ) − E(ξ) ≤ + m 3m i=1
2C1 log(2/δ) m
1 2−α
1 + E(ξ). 2
Since E(ξ) = E(fσ,λ ) − E(fρ ) ≤ D(σ, λ), the proof of Theorem 3.3 is completed. h i h i The sample error term E(fz )−E(fρ ) − Ez (fz )−Ez (fρ ) involving the function fz runs over a set of functions. To bound it, we will simultaneously use uniform stability and covering number which measures the complexity of hypothesis space. Lemma 3.1 (see [25]) For any λ > 0 and z ∈ Z m , there holds kfz kHσ ≤
p l(0, z)/λ.
Proof. The definition of fz tells us that, for f = 0, λkfz k2Hσ ≤ Ez (fz ) + λkfz k2Hσ ≤ Ez (0) + 0 = l(0, z) where l(0, z) stands for the loss function with f = 0. This proves Lemma 3.1. 11
Theorem 3.4 Assume that the regularization learning algorithm (2.3) is βm -uniform stable with respect to the loss function l. Let α ∈ [0, 1], and τ ≥ 1 be arbitrary. Then for any 0 < δ < 1, the inequality h i h i E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) ≤
i 5 1h βm + E(fz ) − E(fρ ) + 4ε(m, λ, σ, δ/2) 2 2 2 +4 2−α ε(m, λ, σ, δ/2)
holds with confidence at least 1 − δ/2 provided that log(2/δ) + σ −2(n+1)/(2−α) + (log m)(n+1)/(2−α) ε(m, λ, σ, δ/2) ≤ D 1 2−α m p l(0, z) σ −2(n+1) + + √ , m λmτ where 1 1 D = max L, (4C1 ) 2−α + 3B, [(4C0 C1 ) 2−α + 3BC0 ]τ n+1 . Proof. By (2.5) and (2.6), we have that for all i ∈ {1, 2, . . . , m}, Z |l(fz , z) − l(fzi , z)|dρ ≤ βm E(fzi ) − E(fz ) ≤ Z
and h i h i h i h i E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) − E(fzi ) − Efρ ) + Ez (fzi ) − Ez (fρ ) h i h i = E(fz ) − Ez (fz ) − E(fzi ) − Ez (fzi ) ≤ E(fz ) − E(fzi ) + Ez (fz ) − Ez (fzi ) Z m 1 X |l(fz , zj ) − l(fzi , zj )| ≤ |l(fz , z) − l(fzi , z)|dρ + m Z
j=1
≤ 2βm . Then we have E(fzi ) ≤ E(fz ) + βm ,
(3.10)
and i h i h E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) i h io nh ≤ 2βm + E(fzi ) − E(fρ ) − Ez (fzi ) − Ez (fρ ) . 12
(3.11)
Suppose that for all i ∈ 1, 2, . . . , m, h i h i E(fzi ) − E(fρ ) − Ez (fzi ) − Ez (fρ ) α rh ≤ 4ε1− 2 . sup iα 1≤i≤m E(fzi ) − E(fρ ) + εα It follows from (3.10) and (3.11) that h
rh h i iα 1− α 2 E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) ≤ 2βm + 4ε E(fzi ) − E(fρ ) + εα rh i i
α
≤ 2βm + 4ε1− 2
βm + E(fz ) − E(fρ )
α
+ εα .
Let η = 4ε
1− α 2
rh iα βm + E(fz ) − E(fρ ) + εα .
Therefore, we have h i h i Probz∈Z m E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) > 2βm + η h i h i ≤ Probz∈Z m E(fzi ) − E(fρ ) − Ez (fzi ) − Ez (fρ ) > η h i h i E(fzi ) − E(fρ ) − Ez (fzi ) − Ez (fρ ) α 1− 2 rh ≤ Probz∈Z m sup > 4ε iα 1≤i≤m α E(fzi ) − E(fρ ) + ε h i h i E(f ) − E(fρ ) − Ez (f ) − Ez (fρ ) α 1− rh ≤ Probz∈Z m sup (3.12) > 4ε 2 . iα f ∈Hσ α E(f ) − E(fρ ) + ε Let G = {l(f, z) − l(fρ , z) : f ∈ Hσ }, g ∈ G. Then for all f1 , f2 ∈ Hσ , [l(f1 , z) − l(fρ , z)] − [l(f2 , z) − l(fρ , z)] = |l(f1 , z) − l(f2 , z)| ≤ Lkf1 − f2 k∞ . This in connection with (2.1) means that ε ) ≤ C0 log N (G, ε) ≤ log N (B1 , LR
LR log ε
n+1 +
1 σ 2(n+1)
! .
(3.13)
Since E(g) = E(f )−E(fρ ) ≥ 0, |g| = |l(f, z)−l(fρ , z)| ≤ B, so we have |g −E(g)| ≤ 2B. Moreover, our assumption ensures that E(g 2 ) ≤ C1 [E(g)]α . Then by Lemma 2.2 and 13
(3.13) we get h i h i E(f ) − E(fρ ) − Ez (f ) − Ez (fρ ) α 1− 2 rh Probz∈Z m sup > 4ε iα f ∈Hσ α E(f ) − E(fρ ) + ε ! LR n+1 1 mε2−α log ≤ exp C0 + 2(n+1) − . ε 2C1 + (4/3)Bε1−α σ
Choose ε(m, λ, σ, δ/2) to be positive solution to the equation ! LR n+1 mε2−α 1 log C0 + 2(n+1) − = log(δ/2). ε 2C1 + (4/3)Bε1−α σ Then combing this and (3.12), we see that h i h i E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) rh iα 1− α ≤ 2βm + 4[ε(m, λ, σ, δ/2)] 2 βm + E(fz ) − E(fρ ) + [ε(m, λ, σ, δ/2)]α h iα α 2 ≤ 2βm + 4[ε(m, λ, σ, δ/2)]1− 2 βm + E(fz ) − E(fρ ) + 4ε(m, λ, σ, δ/2) holds with probability at least 1−δ/2. By Lemma 2.3 and take q = 2/α, q ∗ = 2/(2−α), we have h ≤
i h i E(fz ) − E(fρ ) − Ez (fz ) − Ez (fρ ) i 2 1h 5 βm + E(fz ) − E(fρ ) + 4 2−α ε(m, λ, σ, δ/2) + 4ε(m, λ, σ, δ/2). 2 2
We need to bound ε(m, λ, σ, δ/2). Let τ ≥ 1 be arbitrary and T (ε) =
mε2−α . 2C1 +(4/3)Bε1−α
One can easily check that T (ε) is a strictly increasing function on (0, ∞). By Lemma p 3.1, for fz we can take R = l(0, z)/λ. Set !n+1 p L l(0, z) 1 √ h(ε) := C0 log + 2(n+1) − T (ε) σ ε λ and
ς :=
1 p C0 n+1 2−α 4C log(2/δ) + + 4C C (τ log m) 1 0 1 2(n+1) L l(0, z) σ √ + τ m λm 8B C0 n+1 + . log(2/δ) + 2(n+1) + C0 (τ log m) 3m σ 14
If 43 Cς 1−α ≤ 2C1 , then T (ς) ≥
mς 2−α C0 ≥ log(2/δ) + 2(n+1) + C0 (τ log m)n+1 . 4C1 σ
If 43 Cς 1−α > 2C1 , then T (ς) ≥
mς 2−α 8 2−α 3 Bς
=
mς C0 ≥ log(2/δ) + 2(n+1) + C0 (τ log m)n+1 . 8 σ B 3
So in either case we have T (ς) ≥ log(2/δ) + √
Since ς ≥
L l(0,z) √ , λmτ
C0 2(n+1) σ
+ C0 (τ log m)n+1 .
√
we get that log
L
l(0,z) √ ς λ
≤ τ log m. It follows that
h(ς) ≤ C0 (τ log m)n+1 − log(2/δ) − C0 (τ log m)n+1 = log(δ/2). We can easily see that h is strictly decreasing, so ε(m, λ, σ, δ/2) ≤ ς, i.e., log(2/δ) + σ −2(n+1)/(2−α) + (log m)(n+1)/(2−α) ε(m, λ, σ, δ/2) ≤ D 1 m 2−α p l(0, z) σ −2(n+1) + + √ , m λmτ where D = max L, (4C1 )
1 2−α
+ 3B, [(4C0 C1 )
1 2−α
+ 3BC0 ]τ
n+1
.
This finishes the proof of Theorem 3.4.
3.3
Deriving generalization error bound
We are now in the position to present the bound for the generalization error. Theorem 3.5 Assume that the regularization learning algorithm (2.3) is βm -uniform stable with respect to the loss function l. And suppose that there exists a function feρ ∈ H s (Rn ) ∩ L∞ (Rn ) for some s > 0 such that fρ = feρ |X , and φ =
dρX dx
∈ L2 (X ).
Let α ∈ [0, 1], and τ ≥ 1 be arbitrary. Then for any 0 < δ < 1, with probability at least 1 − δ the quantity E(fz ) − E(fρ ) is bounded by 1 e σ s + λσ −n + 4(B + C1 )m− 2−α log(2/δ) 5βm + 3C 6−α
+4 2(2−α) ε(m, λ, σ, δ/2) + 8ε(m, λ, σ, δ/2) where ε(m, λ, σ, δ/2) is the same as in Theorem 3.4. 15
Theorem 3.5 could be proved by combing the bounds for sample error and regularization error explicitly. Remark 5 Note that the result in Theorem 3.5 is very similar to Proposition 2 in [27], while an additional term βm is involved in this paper, which should also be considered in the view that the performance of a learning algorithm is actually affected by all factors simultaneously. Theorem 3.6 Assume that the regularization learning algorithm (2.3) satisfies all the conditions in Theorem 3.5, and fix τ =
γ 2
+ 1, then for any 0 < δ < 1, with probability
at least 1 − δ we have e −θ log(2/δ) + 5βm , E(fz ) − E(fρ ) ≤ Bm where
1 − 2γς(n + 1) θ = min sςγ, γ(1 − nξ), 2−α
.
This can be derived by combing Theorem 3.5 and the proof of Lemma 2 in [27]. Note that βm → 0 as m → ∞, then E(fz ) − E(fρ ) → 0 as m → ∞, which shows that the generalization bound we get is asymptotically convergent.
4
Applications
There are more algorithms such as support vector machines (SVM) (see [6], [9]) or regularization networks (see [19]), which perform the minimization of a regularized objective function. In the following, as an application of our generic results, we evaluate the learning rates of SVM and regularization networks, and propose a new strategy for regularization parameter setting. Example 1 (SVM classification) Here Y = {−1, 1}, the loss function considered is hinge loss, i.e., l(f, z) = (1 − yf (x))+ = max{1 − yf (x), 0}. Suppose |f (x)| ≤ b for any x ∈ X , then B = b + 1 and L = 1. So the real-valued classification obtained by the SVM optimization procedure with Gaussian kernels has 16
uniform stability βm with (see [4]) βm ≤
1 . 2λm
(4.14)
We use Theorem 3.5 and thus get the following proposition. Proposition 4.1 For SVM classification, suppose |f (x)| ≤ b for any x ∈ X . Then for any 0 < δ < 1, with confidence at least 1 − δ we have E(fz ) − E(fρ ) ≤
1 5 e σ s + λσ −n + 4(1 + b + C1 )m− 2−α log(2/δ) + 3C 2λm 6−α
+4 2(2−α) ε(m, λ, σ, δ/2) + 8ε(m, λ, σ, δ/2)
(4.15)
provided that log(2/δ) + σ −2(n+1)/(2−α) + (log m)(n+1)/(2−α) σ −2(n+1) ε(m, λ, σ, δ/2) ≤ D1 + 1 m 2−α m p l(0, z) + √ , λmτ
where o n 1 1 n+1 2−α 2−α . + 3(b + 1), [(4C0 C1 ) + 3C0 (b + 1)]τ D1 = max 1, (4C1 ) In addition, by (4.15) we can suggest the strategy of setting the regularization parameter of SVMs as λ = m−γ , with σ = λζ = m−γζ and τ = 1 + 0