results are inspired by the work on cross-validation based adaptation in [10], but ... In Section 3 we consider a validation technique for the a-posteriori choice of ...
Cross-validation based Adaptation for Regularization Operators in Learning
Andrea Caponnetto
a
Yuan Yao
b
a
Department of Mathematics, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong b
Department of Mathematics, University of California, Berkeley, CA 94720
0
1
Abstract We consider learning algorithms induced by regularization methods in the regression setting. We show that previously obtained error bounds for these algorithms using apriori choices of the regularization parameter, can be attained using a suitable a-posteriori choice based on cross-validation. In particular, these results prove adaptation of the rate of convergence of the estimators to the minimax rate induced by the ”effective dimension” of the problem. We also show universal consistency for this broad class of methods which includes regularized least-squares, truncated SVD, Landweber iteration and ν-method.
2
1. Introduction Adaptation, as one of the most important problems in nonparametric regression [13], refers to the phenomenon that a convergence rate, depending on the complexity class of the regression function, can be achieved by data-based schemes without any prior knowledge about the actual complexity class of the function. In this paper we investigate the adaptation property of learning a function in reproducing kernel Hilbert spaces. In particular, we show that previous results in [4] about rates of convergence for regularization methods using a-priori choices of the regularization parameter, can be attained using a suitable a-posteriori choice based on cross-validation. We also show universal consistency for this class of regularization methods, which includes some popular algorithms such as regularized least-squares, truncated SVD, Landweber iteration and ν-method. Our results are inspired by the work on cross-validation based adaptation in [10], but differ in exploiting a novel semi-supervised learning framework which has been studied in learning theory [4]. The general algorithms we consider are based on the formalism of regularization methods for linear ill-posed inverse problems in their classical setting (see for example [11] for general reference). We note that our results in this paper can be easily extended to vector-valued function learning following the same treatment in [3]. The paper is organized as follows. In Section 2 we focus on a-priori choices of the regularization parameter for regularization methods. Theorem 1 shows universal consistency for a large class of choice rules, and Theorem 2 shows specific rates of convergence under suitable prior assumptions (parameterized by the constants r, s, Cr and Ds ) on the unknown probability measure ρ. Unlabelled data are added to the training set in order to improve the rates for a certain range of the parameters r and s. In Section 3 we consider a validation technique for the a-posteriori choice of the regularization parameter. Theorem 3 shows how error bounds for the estimators fz˜,λ , with a-priori choices of λ, can be transferred to the estimators fztot which use the validation examples zv in ztot = (˜ z, zv ) to determine λ. The subsequent corollaries are applications of Theorem 3 to the choices of λ described in Section 2. In Sections 4 and 5 we give the proofs of the results stated in the previous Sections, using some lemmas from [4]. 2. A-priori choice of the regularization parameter. We consider the setting of semi-supervised statistical learning. We assume that Y ⊂ [−M, M ] and we let the supervised part of the training set be equal to z = (z1 , . . . , zm ), with zi = (xi , yi ) drawn i.i.d. according to the probability measure ρ over Z = X × Y . Moreover we assume that the unsupervised part of the training set is (xum+1 , . . . , xum ˜ ), with xui drawn i.i.d. according to the marginal probability measure over X, ρX . For sake of brevity we also introduce the complete training set ˜ = (˜ z z1 , . . . , z˜m ˜ ), with z˜i = (˜ xi , y˜i ), where we introduced the compact notations x ˜i and y˜i , defined by xi if 1 ≤ i ≤ m, x ˜i = xui if m < i ≤ m, ˜ and y˜i
=
m ˜ y m i
0
if if
1 ≤ i ≤ m, m < i ≤ m. ˜
It is clear that, in the supervised setting, the semi-supervised part of the training set ˜ = z. is missing, whence m ˜ = m and z
3
In the following we will study the generalization properties of a class of estimators fz˜,λ belonging to the hypothesis space H: the RKHS of functions on X induced by the bounded Mercer kernel K (in the following κ = supx∈X K(x, x)). The learning algorithms that we consider, have the general form (1)
fz˜,λ = Gλ (Tx˜ ) gz ,
where Tx˜ ∈ L(H) is given by, Tx˜ f =
m ˜ 1 X Kx˜i hKx˜i , f iH , m ˜ i=1
gz ∈ H is given by, gz =
m ˜ m 1 X 1 X Kx˜i y˜i = Kxi yi , m ˜ i=1 m i=1
and the regularization parameter λ lays in the range (0, κ]. We will often used the shortcut notation λ˙ = λκ . The functions Gλ : [0, κ] → R, which select the regularization method, will be characterized in terms of the constants A and Br in [0, +∞], defined as follows (2)
A
=
sup
sup |(σ + λ)Gλ(σ)|
λ∈(0,κ] σ∈[0,κ]
(3)
Br
=
sup
sup
sup |1 − Gλ(σ)σ| σ t λ−t ,
r > 0.
t∈[0,r] λ∈(0,κ] σ∈[0,κ]
Finiteness of A and Br (with r over a suitable range) are standard in the literature of ill-posed inverse problems (see for reference [11]). Regularization methods have been recently studied in the context of learning theory in [12, 7, 9, 8, 1]. The main results of the paper, Theorems 1 and 2, describe the convergence rates of fz˜,λ to the target function fH . Here, the target function is the “best” function which can be arbitrarily well approximated by elements of our hypothesis space H. More formally, R fH is the projection of the regression function fρ (x) = Y ydρ|x (y) onto the closure of H in L2 (X, ρX ). The convergence rates in Theorem 2, will be described in terms of the constants Cr and Ds in [0, +∞] characterizing the probability measure ρ. These constants can be described in terms of the integral operator LK : L2 (X, ρX ) → L2 (X, ρX ) of kernel K. Note that the same integral operator is denoted by T , when seen as a bounded operator from H to H. The constants Cr characterize the conditional distributions ρ|x through fH , they are defined as follows
r
κ L−r if fH ∈ Im LrK K fH ρ (4) Cr = , r > 0. +∞ if fH 6∈ Im LrK Finiteness of Cr is a common source condition in the inverse problems literature (see [11] for reference). This type of condition has been introduced in the statistical learning literature in [6, 17, 2, 16, 3]. The constants Ds characterize the marginal distribution ρX through the effective dimension N (λ) = Tr T (T + λ)−1 , they are defined as follows q (5) Ds = 1 ∨ sup N (λ)λ˙ s , s ∈ (0, 1]. ˙ λ∈(0,1]
Finiteness of Ds was implicitly assumed in [2, 3]. The next theorem shows (strong) universal consistency (in probability) for the estimators fz˜,λ under mild assumptions on the choice of λ. The function |x|+ , appearing in the text of Theorem 1, is the “positive part” of x, that is x+|x| . 2
4
Theorem 1. Let {˜ zm }∞ m=1 be a sequence of training sets composed of m labelled examples drawn i.i.d. from a probability measure ρ over Z, and m ˜ m − m ≥ 0 unlabelled examples drawn i.i.d. from the marginal measure of ρ over X. Let the regularization parameter choice, λm : N → (0, κ], fulfill the conditions (6)
lim λm = 0, √ lim mλm = ∞.
m→∞
(7)
m→∞
Then, if Br¯ < +∞ for some r¯ > 0, it holds
1 P
lim kfz˜m ,λm − fH kρ = 0.
m→∞
Theorem 2 below is a restatement in a slightly modified form of Theorem 2 in [4]. In particular the introduction of the parameter q > 1 will be useful when we will merge this result with Theorem 3 in the proof of Corollary 2. Theorem 2. Let r > 0, s ∈ (0, 1] and α ∈ [0, |2 − 2r − s|+ ]. Furthermore, let m and λ satisfy the constraints λ ≤ kT k and 2 4Ds log 6δ 2r+s+t1 ˙ √ λ=q (8) , m for some q ≥ 1, δ ∈ (0, 1/3) and t1 defined in eq. (10). Finally, assume m ˜ ≥ 4 ∨ mλ˙ −α . Then, with probability greater than 1 − 3δ, it holds 2r−t2 4Ds log 6δ 2r+s+t1 √ kfz˜,λ − fH kρ ≤ q r Er , m where (9)
Er
=
Cr (30A + 2(3 + r)Br + 1) + 9M A,
(10)
t1
=
|2 − 2r − s|+ − α,
(11)
t2
=
|1 − 2r − 2s − t1 |+ .
The proofs of the above Theorems is postponed to Section 4. 3. Adaptation. In this section we show the adaptation properties of the estimators obtained by a suitable data-dependent choice of the regularization parameter. The main results of this section are obtained assuming that (12)
fH = fρ ,
this is true for every ρ when the underlying kernel K is universal (see [18]). In fact for this class of kernels the RKHS H is always dense in L2 (X, ρX ). The Gaussian kernel is a popular instance of a kernel in this family. Let the validation set v v ), zv = (z1v , . . . , zm
be composed of mv labelled examples ziv = (xvi , yiv ) drawn i.i.d. from the probability measure ρ over Z = X × Y . The validation set zv is, by assumption, independent of the ˜, and these two sets define the learning set training set z ztot = (˜ z, zv ), 1 We say that the sequence of random variables {Xm }m∈N converges in probability to the random variable X (and we write limm→∞ Xm =P X or Xm →P X), if for every > 0, limm→∞ P [|Xm − X| ≥ ] = 0. This is equivalent to say that, for every δ ∈ (0, 1), P [|Xm − X| ≥ (m, δ)] ≤ δ, with limm→∞ (m, δ) = 0.
5
which represents the total input of the adaptive learning algorithm. Following the nota˜, and m the tions of the previous Section, we let m ˜ be the total number of examples in z number of its labelled examples. Now let us explain how zv is used for the choice of λ. We consider the finite set of ˜, and the datapositive reals Λm depending on m, the number of labelled examples in z dependent choice for the regularization parameter is v
(13)
m X ˆ zv = argmin 1 λ (TM fz˜,λ (xvi ) − yiv )2 , v λ∈Λm m i=1
where the truncation operator TM : L2 (X, ρX ) → L2 (X, ρX ) is defined by TM f (x) = (|f (x)| ∧ M ) signf (x). The final learning estimator, whose adaptation properties are investigated in this Section, is defined as follows (14)
fztot = TM fz˜,λˆ zv .
Theorem 3 below is the main result of this Section and shows an important property of the estimator fztot . It will be used to extend to fztot convergence results similar to the ones obtained in the previous Section. Theorem 3. Let ρ, K, m, m, ˜ mv , Λm , δ ∈ (0, 1), > 0 and λm ∈ Λm be such that with probability greater than 1 − δ, it holds kfz˜,λm − fρ kρ ≤ . Then, with probability greater than 1 − 2δ, it holds kfztot − fρ kρ ≤ ˆ, with
2 |Λm | 80M 2 log . mv δ The proof of Theorem 3 is postponed to Section 5. The first corollary of Theorem 3 proves universal consistency for the estimators fztot under mild assumptions on the cardinalities of the grids Λm and validation sets zv . ˆ2 = 22 +
Corollary 1. Let K be a universal kernel, Q be a constant greater than 1, and define (15)
Λm = {κ, κQ−1 , . . . , κQ−|Λm |+1 },
with (16)
|Λm | = ω(1).
∞ Moreover let {ztot m }m=1 be a sequence of learning sets drawn according to a probability ˜m composed of m lameasure ρ over Z. Assume ztot zm , zvm ), with the training sets z m = (˜ belled examples and m ˜ m − m ≥ 0 unlabelled examples, and zvm the validation sets composed by mvm = ω(log |Λm |) examples. Then, if Br¯ < +∞ for some r¯ > 0, it holds
P lim fztot − fρ ρ = 0. m m→∞
Proof. The result is a corollary of theorems 1 and 3. The universality of K enforces the equality (12) (see [18]). Condition (16) implies that the regularization parameter λm = κQ−(blog log mc∧|Λm |) , which belongs to Λm , fulfills the assumptions (6) and (7). Hence, using the assumption on mvm , we get that for every δ ∈ (0, 1), with probability greater than 1 − 2δ
2 |Λm | 80M 2
fztot − fρ 2 ≤ o(1) + log → 0. m ρ ω(log |Λm |) δ
6
The second corollary proves explicit rates for the convergence of fztot to fρ over specific prior classes defined in term of finiteness of the constants Cr and Ds . The main assumption is the requirement mv ≥ m/ log m. Since this constrain can be fulfilled still being mv asymptotically negligible with respect to m, the rates (expressed in terms of m) that are obtained in the second part of the corollary are minimax optimal over the corresponding priors (see [3]). Corollary 2. Let K be a universal kernel. Consider a learning set ztot with mv ≥ logmm and m ˜ ≥ 4 ∨ m1+η , for some constants η ≥ 0, r > 0, and s ∈ (0, 1]. Define Λm as in eq. (15) with Q an arbitrary constant greater than 1 and η (17) logQ m + 1 ≤ |Λm | ≤ m, α with α defined by eq. (20). Moreover assume that for some δ ∈ (0, 1/6), m is large enough that it holds 2η 4Ds log 6δ α √ (18) ≤ κ−1 kT k . Q m Then, with probability greater than 1 − 6δ (19)
2r−t
2 −1 2 2r+s+t1
kfztot − fρ kρ ≤ 4(Qr Er Ds + 3M ) log(6m/δ) m
,
where Er , t1 and t2 are the constants defined in equations (9), (10) and (11) substituting η (20) α = |2 − 2r − s|+ ∧ (2r + s + |2 − 2r − s|+ ). 1+η In particular, if r + s ≥ (21)
1 2
and η =
|2−2r−s|+ , 2r+s
and assuming
2 logQ m + 1 ≤ |Λm | ≤ m,
and
2 4Ds log 6δ 2r+s √ ≤ κ−1 kT k , m with probability greater than 1 − 6δ, it holds
Q
1
2r
kfztot − fρ kρ ≤ 4(Qr Er Ds + 3M ) log(6m/δ) m− 2 2r+s . Proof. The result is a corollary of theorems 2 and 3. The universality of K enforces the equality (12) (see [18]). First, from equations (20) and (10), by simple algebra we get η 1 = . α 2r + s + t1 Therefore condition (18) is equivalent to 2 4Ds log 6δ 2r+s+t1 √ λ˙ q = q ≤ κ−1 kT k m
∀q ∈ [1, Q],
and condition λ˙ ≤ κ−1 kT k in the text of Theorem 2 is verified by λ˙ q for every q ∈ [1, Q]. Moreover, since Ds ≥ 1 and δ ≤ 1/6, for every q ∈ [1, Q] we can write −α η m ˜ ≥ 4 ∨ m1+η = 4 ∨ m m− α ≥ 4 ∨ mλ˙ −α q , which shows that also the other assumption of Theorem 2 is verified. Hence, by Theorem 2 we get that for every q ∈ [1, Q], with probability greater than 1 − 3δ, it holds 2r−t2 4Ds log 6δ 2r+s+t1 r √ kfz˜,λ − fρ kρ ≤ = Q Er . m
7
The next step is verifying that for some q¯ ∈ [1, Q], λq¯ = κλ˙ q¯ ∈ Λm , and hence applying Theorem 3. In fact, from definition (15), assumption (17) and Proposition 5, it is clear that η
min Λm ≤ κm− α ≤ λ1 ≤ λq¯ ≤ λQ ≤ kT k ≤ κ = max Λm , for some q¯. Applying Theorem 3, we get that with probability greater than 1 − 6δ it holds
tot
fz − fρ ≤ ˆ, ρ
with, using again condition (17) and the assumption mv ≥
= ≤
the chain of inequalities
1 6 |Λm | 2 80M 2 log mv δ 21 2 80M 12M 6m 2 2 6m + ≤ + √ log log m δ δ m
ˆ
m , log m
2 +
2r−t
1 2 −2 2r+s+t1
1
+ 12M log(6m/δ) m− 2
≤
4Qr Er Ds log(6/δ) m
≤
4(Qr Er Ds + 3M ) log(6m/δ) m
2r−t
2 −1 2 2r+s+t1
,
which concludes the proof of the first part of the Corollary. The second part of the Corollary is an instantiation of the previous result. In fact by |2−2r−s|+ implies α = |2 − 2r − s|+ and equations (20) and (10), the assumption η = 2r+s t1 = 0. Moreover from the assumption r + s ≥ 12 and eq. (11) we get t2 = 0, and noticing 1 that αη = 2r+s ≤ 2, it is clear that condition (21) implies condition (17). 4. Proofs of Theorems 1 and 2 In this section we give the proof of Theorems 1 and 2. We use various propositions taken [4], which we state without proof. 4.1. Before proving Theorem 1, we begin showing some preliminary propositions. The first one is a technical result about sequences of real numbers. Proposition 1. Let {ai }i∈N and {bi }i∈N be two non-increasing sequences of reals in the interval (0, 1) with lim ai
=
0,
lim bi
=
0.
i→∞
i→∞
Then there exists a sequence {ci }i∈N of reals in the interval (0, 1) such that, defining di = log ci / log bi , the following properties hold, i) {di }i∈N is a non-increasing sequence of positive reals. ii) {ci }i∈N is a non-increasing sequence of positive reals, with ci ≥ ai
∀i ∈ N,
lim ci = 0.
i→∞
Proof. We consider the sequence {ci }i∈N of positive numbers constructed by the recursive rule c1 = a1 , log ci
ci+1 = ai+1 ∨ (bi+1 ) log bi . Let us prove point i) by induction.
8 ai c1 Since by assumption a1 and b1 belong to (0, 1), by construction d1 = log > 0. = log log b1 log bi di Now, for i ≥ 1 assume di > 0, then by construction, either ci+1 = (bi+1 ) , and hence log ci+1 = di > 0, or ci+1 = (bi+1 )di+1 = ai+1 ≥ (bi+1 )di , and hence, since ai+1 di+1 = log bi+1 and bi+1 belong to (0, 1), it holds
di+1
=
log ai+1 > 0, log bi+1
(bi+1 )di+1 ≥ (bi+1 )di
⇒
di+1 ≤ di .
Let us now prove point ii). First, by construction ci ≥ ai > 0. Moreover, again by construction, either ci+1 = ai+1 , and hence, ci+1 = ai+1 ≤ ai ≤ ci , or ci+1 = (bi+1 )
di
and hence, since di > 0 by point i), it holds ci+1 = (bi+1 )di ≤ (bi )di = ci .
Therefore the sequence {ci }i∈N is non-increasing and ci ≤ c1 = a1 < 1. Finally, we prove that limi ci = 0. Let us assume the there exists an infinite increasing sequence of naturals {i(k)}k∈N , such that ci(k)
=
∀k ∈ N.
ai(k)
Since, by assumption, limi ai = 0, then limk ci(k) = 0. Therefore, since we already proved that {ai }i∈N is non-increasing, limi ci = 0. Which proves the Proposition, if {i(k)}k∈N exists. If {i(k)}k∈N does not exist, by construction, there exists I ∈ N such that ci+1
=
(bi+1 )di
∀i ≥ I.
Therefore, recalling the definition of di , by induction, it follows ci
=
(bi )dI
∀i > I.
Recalling that dI > 0 and limi bi = 0, the relation above proves that, also in this case, limi ci = 0. The next proposition introduces the functions fλtr and shows some simple results related to them. Proposition 2. For any λ > 0 let the truncated function fλtr be defined by fλtr = Pλ fH
(22)
where Pλ is the orthogonal projector in L2 (X, ρX ) defined by (23)
Pλ = Θλ (LK ),
with (24)
Θλ (σ) =
1 0
if σ ≥ λ, if σ < λ.
Then the function a : (0, κ] → R, defined by
(25) a(λ) = fλtr − fH ρ , is non-decreasing and fulfills the following properties (26)
0 ≤ a(λ) ≤ M
(27)
lim a(λ) = 0.
λ→0
∀λ ∈ (0, κ],
9
Proof. Recall that the self-adjoint integral operator LK has a countable eigensystem 1
2 {(λi , φi )}∞ i=1 with positive eigenvalues decreasing to zero (see [5]). Moreover LK is an isometry between L2 (X, ρX ) and H (again, see [5]). Therefore, since fH is the projection of fρ over the closure of H in L2 (X, ρX ), it holds
fH =
∞ X
| hfρ , φi iρ |2 φi .
i=1
Hence, by the definition of fλtr , and recalling that Y ⊂ [−M, M ], we get 0 ≤ a(λ)2 =
X
| hfρ , φi iρ |2 ≤
∞ X
| hfρ , φi iρ |2 ≤ kfρ k2ρ ≤ M 2 .
i=1
λi 0 and a(λi ) > 0 for every i ∈ N. Moreover, from Proposition 5, λi ≤ κ, and by eq. (26), a(λ) ≤ M . Hence we can apply Proposition 1 to the non-increasing sequences {ai }i and {bi }i defined by a(λi ) , 2M λi bi = . 2κ ai =
The function R is defined in terms of the sequence {di }i constructed in Proposition 1 as follows r¯ if λ1 < λ ≤ 1, R(λ) = r¯di /(¯ r ∨ d1 ) if λi+1 < λ ≤ λi , i ≥ 1. Equality (29) can be proved, recalling that by Proposition 1 ci = bdi i ≤ λ˙ di i goes to zero as i → ∞, and hence ˙ ) ˙ R(λ lim λ˙ R(λ) = lim λ˙ i i = lim
λ→0
i→∞
i→∞
r¯/(¯r∨d1 ) r¯/(¯r∨d1 ) (2bi )di ≤ 2r¯ lim ci = 0. i→∞
10
Since by Proposition 1 {di }i is a sequence of non-increasing positives, then R is nondecreasing. Therefore, defining fi = hfH , φi iρ , we can write
2 X 2 −2R(λ) X 2 −2R(λ˙ ) ˙ ˙ ˙ −R(λ)
i = fi λ˙ i κ2R(λ) LK Pλ f H ≤ fi λ˙ i ρ
i
˙ i ≥λ ˙ λ
=
X
≤
X
(ci > ai )
≤
X
−¯r/(¯r∨d1 ) fi2 (2bi )di
i
bdi i
= ci < 1
fi2 c−1 i
i
= 2M fi2 a−1 i
2M
fi2 a(λi )−1
i
i
=
X
∞ X
X
fi2 a(λi )−1
k=0 2−k−1 0. Then, if λ ∈ (0, κ], it holds
√
−r 3 −r √
η r
T (Gλ (Tx˜ ) Tx˜ − Id) Pλ f ≤ Br LK f ρ (1 + γ)(2 + rγ λ˙ 2 + γ )λ , H
where Pλ is defined in eq. (23), and (33)
γ
=
η
=
λ−1 kT − Tx˜ k , 1 1 |r − | − b|r − |c. 2 2
Proof. See Proposition 6 in [4].
Proposition 7. Let the operator Ωλ be defined by √ 1 (34) Ωλ = T Gλ (Tx˜ ) (Tx˜ + λ)(T + λ)− 2 . Then, if λ ∈ (0, κ], it holds √ kΩλ k ≤ (1 + 2 γ) A, with γ defined in eq. (33). Proof. See Proposition 7 in [4].
We finally need the following probabilistic inequality based on a result of [15], see also Th. 3.3.4 of [19]. We report it without proof. Proposition 8. Let (Ω, F, P ) be a probability space and ξ be a random variable on Ω taking value in a real separable Hilbert space K. Assume that there are two positive constants H and σ such that H kξ(ω)kK ≤ a.s, 2 2 2 E[kξkK ] ≤ σ , then, for all m ∈ N and 0 < δ < 1,
" # m
X
2 σ H
1
log (35) P(ω1 ,...,ωm )∼P m ξ(ωi ) − E[ξ] ≤ 2 +√ ≥ 1 − δ.
m
m δ m i=1 K
We are now ready to prove Theorem 1. Proof of Theorem 1. Let us consider the expansion √ T (fz˜,λ − fH )
√ T Gλ (Tx˜ ) gz − fλtr + T (fλtr − fH ) √ √ 1 Ωλ (T + λ) 2 (fz˜ls,λ − f˜zls0 ,λ ) + T (Gλ (Tx˜ ) Tx˜ − Id) fλtr + T (fλtr − fH ) 1 1 1 Ωλ (T + λ) 2 (fz˜ls,λ − fλls ) + (T + λ) 2 (fλls − f¯λls ) + (T + λ) 2 (f¯λls − f˜zls0 ,λ ) √ √ + T (Gλ (Tx˜ ) Tx˜ − Id) fλtr + T (fλtr − fH ) √
= = =
where the operator Ωλ is defined by equation (34), the ideal RLS estimators are fλls = (T +λ)−1 TM fH and f¯λls = (T +λ)−1 T fλtr , and f˜zls0 ,λ = (Tx˜ +λ)−1 Tx˜ fλtr is the RLS estimator constructed by the training set tr z˜0 = ((˜ x1 , fλtr (˜ x1 )) . . . , (˜ xm xm ˜ , fλ (˜ ˜ ))).
Hence we get the following decomposition, (36) kfz˜,λ − fH kρ ≤ D(˜ z, λ) S ls (˜ z, λ) + R(λ) + S¯ls (˜ z, λ) + P (˜ z, λ) + P tr (λ),
12
with (37)
S ls (˜ z, λ)
=
S¯ls (˜ z, λ)
=
D(˜ z, λ)
=
P (˜ z, λ)
=
P tr (λ)
=
R(λ)
=
1
ls ls
(T + λ) 2 (fz˜,λ − fλ ) ,
H 1
ls ls ¯ 2
(T + λ) (f˜z0 ,λ − fλ ) , H
kΩλ k ,
√
tr
T (Gλ (Tx˜ ) Tx˜ − Id) fλ , H
tr
fλ − fH , ρ
1
ls ls
(T + λ) 2 (f¯λ − fλ ) . H
Terms S ls and S¯ls will be estimated by Proposition 4, terms P tr and R by Proposition 2, term D by Proposition 7, and finally term P by Propositions 6, 3 and 2. 1
2 Step 1: Estimate of S ls . Since LK is an isometry between L2 (X, ρX ) and H (see [5]), we obtain
√
1
kfH kρ M
ls −1 2
(38) L f
fλ ≤ T (T + λ) H
K ≤ √λ ≤ √λ . H Now, let δ be an arbitrary real in (0, 1). From the assumptions on λm , for large enough m, we have √ 6 λm m ≥ 4κ log , δ λm ≤ kT k .
Hence, by Proposition 5, for large enough m, the assumptions of Proposition 4 are verified, and we get that with probability greater than 1 − δ ! r r r
N (λm ) 2 κ m ls 6 ls + log S (˜ zm , λ m ) ≤ 8 M + κ
fλm m ˜m m λm m δ H r r r κ κ κ 1 6 √ (P rop.5, eq. (38)) ≤ 8M 1 + 2 + log λm λm m λm δ m 32M κ 6 √ log (λm ≤ κ, m ≥ 4) ≤ → 0. δ λm m Hence it holds (39)
P
lim S ls (˜ zm , λm ) = 0.
m→∞
˜0 is a training set Step 2: Estimate of S¯ls . This term can be estimated observing that z 0 of m ˜ supervised samples drawn i.i.d. from the probability measure ρ with marginal ρX and conditional ρ0|x (y) = δ(y − fλtr (x)). Therefore the regression function induced by ρ0 is f ρ0 = fλtr , and the support of ρ0 is included in X × [−M 0 , M 0 ], with M 0 = supx∈X fρ0 (x) ≤ √
κ fλtr H . Reasoning as in the analysis of S ls , we obtain that, for every δ ∈ (0, 1) and large enough m, with probability greater than 1 − δ it holds ! r r
√ N (λm ) 2 κ 6
¯ls ls 0 ¯ S (˜ zm , λm ) ≤ 8 M + κ fλm + log m ˜ m λm m ˜m δ H r r
√ 1 κ κ 6 (P rop.5) ≤ 16 κ fλtrm H √ 2 + log λ m λ δ m m m
1
32κ 6 32κM 6 − (m ≥ 4) ≤ √ Pλm LK 2 Pλm fH
log δ ≤ λm √m log δ → 0. λm m ρ Hence it holds (40)
P lim S¯ls (˜ zm , λm ) = 0.
m→∞
13
Step 3: Estimate of P tr . By definition (25), P tr (λ) = a(λ). Hence from eq. (27) (41)
lim P tr (λm ) = lim a(λm ) = lim a(λ) = 0,
m→∞
m→∞
λ→0
where we used the assumption (6). Step 4: Estimate of R. Since from the definitions of fλls and f¯λls ,
√
1
R(λ) = (T + λ)− 2 T (f¯λls − fλls ) ≤ T (fλtr − fH ) ≤ P tr (λ), H
H
from (41) we get (42)
lim R(λm ) = 0.
m→∞
Step 5: Estimate of D. In order to estimate D(˜ z, λ), we have first to estimate the quantity γ = γ(˜ z, λ) (see definition (33)) appearing in the Proposition 7. Our estimate for γ(˜ z, λ) follows from Proposition 8 applied to the random variable ξ : X → LHS (H) defined by ξ(x)[·] We can set H = 2κ and σ = λ probability greater than 1 − δ γ(˜ zm , λ) ≤ λ−1 kT − Tx˜ kHS ≤
H , 2
2 λ
=
λ−1 Kx hKx , ·iH .
and obtain that, for every δ ∈ (0, 1) and m ≥ 4, with
2κ κ +√ m ˜m m ˜m
log
2 κ 2 ≤ 4 √ log =: (m, λ, δ). δ δ λ m
From the expression of (m, λ, δ) we see that, by the assumption (7), for every δ ∈ (0, 1), lim (m, λm , δ) = 0,
m→∞
and hence, (43)
P
lim γ(˜ zm , λm ) = 0.
m→∞
Finally, from eq. (43) and Proposition 7 we find p P D(˜ zm , λm ) ≤ 1 + 2 γ(˜ (44) zm , λm ) A → 3A. Step 6: Estimate of P . First, notice that by the definition (3), WLOG we can assume r¯ < 12 . Moreover by condition (6), we can assume m large enough that λm ≤ κ. We consider the function R introduced by Proposition 3, and apply Proposition 6, with f = Pλm fH and rm = R(κ−1 λm ) ≤ r¯, getting 1 P (˜ zm , λm ) ≤ Br¯ 1 + γ(˜ zm , λ m ) 2
1 m 2 + rm γ(˜ zm , λm ) + γ(˜ zm , λm ) 2 −rm κrm L−r Pλm fH ρ (κ−1 λm )rm . K This result together with eq. (43), and recalling that by Proposition 3 and assumption (6), the sequence {rm }m verifies the two conditions
m κrm L−r Pλm fH ρ ≤ 4M ∀m, K lim (κ−1 λm )rm = 0,
m→∞
proves that (45)
P
lim P (˜ zm , λm ) = 0.
m→∞
The proof of the Theorem is completed considering the limit m → ∞ of estimate (36), and using equations (39), (40), (41), (42), (44) and (45).
14
4.2. Before showing the proof of Theorem 2, we state two propositions from [4] which describe properties of the functions fλtr and fλls (defined in eq. (22) and eq. (32) respectively) when fH ∈ Im LrK . Proposition 9. Let fH ∈ Im LrK for some r > 0. Then, the following estimates hold,
tr
fλ − fH ≤ λr L−r K fH ρ , ρ ( − 1 +r −r
tr if r ≤ 21 , λ 2 LK fH ρ
fλ
≤ 1 H
κ− 2 +r L−r if r > 12 . K fH ρ Proof. See Proposition 3 in [4].
Proposition 10. Let fH ∈ Im LrK for some r > 0. Then, the following estimates hold,
ls
≤ λr L−r if r ≤ 1
fλ − fH K fH ρ , ρ ( − 1 +r −r
if r ≤ 12 , λ 2 LK fH ρ
ls
≤
fλ 1 +r −r −2
L fH H κ if r > 1 . K
ρ
2
Proof. See Proposition 1 in [4].
We are now ready to prove Theorem 2. Proof of Theorem 2. We consider the same decomposition (see equations (36) and (37)) for kfz˜,λ − fH kρ that we used in the proof of Theorem 1. Terms S ls and S¯ls will be estimated by Proposition 4, term D by Proposition 7, term P by Proposition 6 and finally terms P tr and R by Proposition 9. Let us begin with the estimates of S ls and S¯ls . First observe that, by Proposition 5, it holds λ˙ ≤ κ−1 kT k ≤ 1, therefore, since by assumption m ˜ ≥ mλ˙ −|2−2r−s|+ +t1 ≥ mλ˙ −|1−2r|+ +t1 , we get, λ˙ m ˜ ≥ λ˙ −|1−2r|+ +1+t1 m ≥ λ˙ 2r+t1 m. Moreover, by eq. (8) and definition (5), we find 6 6 λ˙ 2r+t1 m = 16q 2r+s+t1 Ds2 λ˙ −s log2 ≥ 16N (λ) log2 , δ δ hence the hypothesis (30) in the text of Proposition 4 is verified. Regarding the estimate of S ls . Applying Proposition 4 and recalling that by assump √ tion m ˜ ≥ mλ˙ −|2−2r−s|+ +t1 ≥ mλ˙ −|1−2r|+ +t1 and from Proposition 10, κ fλls H ≤ −| 1 −r | Cr λ˙ 2 + , we get that with probability greater than 1 − δ ! r r r N (λ) m ˙ −| 12 −r|+ 2 κ 6 ls (46) Cr λ + log S (˜ z, λ) ≤ 8 M + m ˜ m λ m δ ! t1 1 2 Ds 6 p +p ≤ 8(M + λ˙ − 2 Cr ) √ log δ m s ˙ ˙ mλ λ ! 1 ˙ − 2 (1−2r−2s−t1 ) s − t1 λ −r− 2 r ˙ 2 (M + C )λ (eq. (8)) = 2q 1+ r t1 s 2q r+ 2 + 2 Ds2 log 6δ s
t1 2
(t1 ≥ 0, q ≥ 1)
≤
3q −r− 2 −
(q ≥ 1)
≤
3(M + Cr )λ˙ r−
(M + Cr )λ˙ r− t2 2
.
t2 2
15
˜0 is a training set of m The term S¯ls can be estimated observing that z ˜ supervised samples drawn i.i.d. from the probability measure ρ0 with marginal ρX and conditional ρ0|x (y) = δ(y − fλtr (x)). Therefore the regression function induced by ρ0 is fρ0 = fλtr , and √ the support of ρ0 is included in X × [−M 0 , M 0 ], with M 0 = supx∈X fρ0 (x) ≤ κ fλtr H . Again applying Proposition 4, we obtain that with probability greater than 1 − δ it holds ! r r
√ N (λ) 6 2 κ
¯ls ls 0 ¯ (47) + log S (˜ z, λ) ≤ 8 M + κ fλ m ˜ λ m ˜ δ H ! r r √ N (λ) 2 κ 6 ≤ 16 κ fλtr H + log m ˜ λ m ˜ δ ! r r r N (λ) m ˙ −| 12 −r|+ 2 κ 6 (P rop.9) ≤ 16 Cr λ + log m ˜ m λ m δ ! t1 − Ds 6 λ˙ 2 2 p +p log ≤ 16Cr √ δ m mλ˙ λ˙ s ! 1 t1 s λ˙ − 2 (1−2r−2s−t1 ) (eq. (8)) = 4q −r− 2 − 2 Cr λ˙ r 1 + t1 s 2q r+ 2 + 2 Ds2 log 6δ s
(t1 ≥ 0, q ≥ 1)
≤
6q −r− 2 −
(q ≥ 1)
≤
6Cr λ˙ r−
t1 2
t2 2
Cr λ˙ r−
t2 2
.
In order to get an upper bound for D and P , we have first to estimate the quantity γ = γ(˜ z, λ) (see definition (33)) appearing in the Propositions 6 and 7. Our estimate for γ(˜ z, λ) follows from Proposition 8 applied to the random variable ξ : X → LHS (H) defined by ξ(x)[·] 2κ λ
=
λ−1 Kx hKx , ·iH .
H , 2
and obtain that with probability greater than 1 − δ 2 2 2 2κ κ 1 −1 log ≤ 4 √ log γ(˜ z, λ) ≤ λ kT − Tx˜ kHS ≤ +√ λ m ˜ δ δ m ˜ ˜ λ˙ m t1 s 1−r− −1− | | 2 + 2 2 λ˙ |1−r− 2s |+ −(1−r− 2s ) |r+ 2s −1|+ √ log ≤ λ˙ ≤ λ˙ ≤4 ≤ 1, δ m
We can set H =
and σ =
where we used the assumption m ˜ ≥ 4 ∨ mλ˙ −|2−2r−s|+ +t1 and the expression for λ˙ in the text of the Theorem. |r+ 2s −1|+ Hence, since γ(˜ z, λ) ≤ λ˙ , from Proposition 7 we get √ (48) D(˜ z, λ) ≤ (1 + 2 γ)A ≤ 3A, and from Proposition 6 (49)
P (˜ z, λ)
√
3
γ)(2 + rγ λ˙ 2 −r + γ η )λ˙ r
≤
Br Cr (1 +
≤ ≤
2Br Cr (3 + rγ λ˙ 2 −r )λ˙ r |r+ 2s −1|+ + 23 −r ˙ r )λ 2Br Cr (3 + rλ˙
≤
2Br Cr (3 + rλ˙
3
s+1 2
)λ˙ r ≤ 2Br Cr (3 + r)λ˙ r .
Regarding terms P tr and R. From Proposition 9 we get (50)
P tr (λ)
≤
Cr λ˙ r ,
16
and hence, R(λ)
(51)
= ≤
−1 ls ls
(T + λ) 2 T (f¯λ − fλ ) H
√
ls ls tr ¯ T ( f − f ) ≤ P ≤ Cr λ˙ r .
λ λ H
The proof is completed by plugging inequalities (46), (47), (48), (49), (50) and (51) ˙ in (36), recalling the expression for λ. 5. Proof of Theorem 3 The following result is due to [14], adapted to a suitable form used in this paper. Proposition 11. Let {Xi }n i∈1 be a set of real valued i.i.d. random variables with mean µ, |Xi | ≤ B and E[(Xi − µ)2 ] ≤ σ 2 , for all i ∈ {1, . . . , n}. Then for arbitrary α > 0, > 0, " # n 6nα 1X 2 P (52) Xi − µ ≥ ασ + ≤ e− 3+4αB , n i=1 and "
# n 6nα 1X 2 P µ− Xi ≥ ασ + ≤ e− 3+4αB . n i=1
(53)
Proof. It suffices to prove the one side inequality (52). For any s > 0, # " n h s Pn i 2 1X 2 Xi − µ ≥ ασ + = P e n i=1 (Xi −µ) ≥ es(ασ +) P n i=1 ≤ =
2
s
Pn
by Markov inequality e−s−sασ Ee n i=1 (Xi −µ) , n Y 2 s e−s−sασ Ee n (Xi −µ) , by independence of Xi i=1
Denote Zi = Xi − µ, t = s/n and B1 = 2B. Thus for those s such that sB < 3n/2 (or equivalently B1 t/3 < 1), k ∞ ∞ ∞ X X tk tk k−2 2 t2 σ 2 X B 1 t E[Zik ] ≤ 1 + 0 + B1 σ ≤ 1 + EetZi = 1 + k! k! 2 3 k=1 k=2 k=0 2 2 2 2 2 2 3t σ 3s σ 3t σ ≤ exp = exp = 1+ 6 − 4Bt 6 − 4Bt n2 (6 − 4sB/n) whence −s−sασ 2
e
n Y
Ee
s (X −µ) i n
≤
−s
e
i=1
exp
sσ 2 n
3s − nα 6 − 4sB/n
.
6αn 3s0 6nαB (one can check that s0 B = 3+4αB < 3n/2), we have 6−4s − 0 B/n 3 + 4αB 6nα nα = 0 and thus r.h.s. ≤ e−s0 = exp − , which gives estimate (52). 3 + 4αB
Setting s = s0 =
We are now ready to prove Theorem 3. Proof of Theorem 3. The strategy of the proof is the following. Define Z (54) λ∗m = argmin (TM fz˜,λ (x) − y)2 dρ. λ∈Λm
Z
Notice that, since for every f ∈ L2 (X, ρX ), Z Z (f (x) − y)2 dρ = kf − fρ k2ρ + (fρ (x) − y)2 dρ, Z
Z
17
definition (54) of λ∗m , is equivalent to λ∗m = argmin kTM fz˜,λ − fρ kρ . λ∈Λm
Now, from the equality above, the assumption of the Theorem, and recalling that fρ (x) ∈ Y ⊂ [−M, M ], we get that with probability greater than 1 − δ it holds
TM fz˜,λ∗ − fρ ≤ kTM fz˜,λm − fρ k ≤ kfz˜,λm − fρ k ≤ . (55) ρ
ρ
m
ρ
˜, λ > 0 and δ ∈ (0, 1), with probability greater than 1 − δ We claim that for every z v over the probability measure ρm , it holds
2 80M 2 2 |Λm | kfztot − fρ k2ρ ≤ 2 TM fz˜,λ∗m − fρ ρ + (56) . log mv δ Estimates (55) and (56) together will complete the proof of the Theorem. We now proceed to proving eq. (56). For i = 1, . . . , mv , let us define the random variables ξiλ = (TM fz˜,λ (xvi ) − yiv )2 − (fρ (xvi ) − yiv )2 . Clearly |ξiλ |
≤
E[ξiλ ]
=
E[(ξiλ )2 ]
=
4M 2 , Z Z (TM fz˜,λ (x) − y)2 dρ − (fρ (x) − y)2 dρ = kTM fz˜,λ − fρ k2ρ , Z Z Z (TM fz˜,λ (x) − fρ (x))2 (TM fz˜,λ (x) + fρ (x) − 2y)2 dρ
≤
16M 2 kTM fz˜,λ − fρ k2ρ = 16M 2 E[ξiλ ].
Z
Hence, using Proposition 11 with Xi = ξiλ , µ = E[ξiλ ], B = 4M 2 and σ 2 = E[(ξiλ )2 ] ≤ 16M 2 µ, we obtain that for all λ ∈ Λm with probability greater than 1 − δ, v
m 1 X λ ξi ≤ (1 + α0 )E[ξiλ ] + , v m i=1
and v
E[ξiλ ] where α0 = 16αM 2 and = kfztot − fρ k2ρ
1 ≤ 1 − α0
m 1 X λ ξi mv i=1
! +
, 1 − α0
2 |Λm | 3 + α0 . Therefore log 6αmv δ ˆ v λ
! mv 1 X λˆ zv + ξ mv i=1 i 1 − α0 ! mv 1 X λ∗m ξ + mv i=1 i 1 − α0
=
E[ξi z ] ≤
≤
1 1 − α0
≤ =
1 1 − α0
2 1 + α0 λ∗ E[ξi m ] + 1 − α0 1 − α0
1 + α0
TM fz˜,λ∗ − fρ 2 + 2 . m 0 ρ 1−α 1 − α0
Setting α = 1/(48M 2 ), this gives α0 = 1/3 and
2 80M 2 2 |Λm | kfztot − fρ k2ρ ≤ 2 TM fz˜,λ∗m − fρ ρ + log , mv δ which proves eq. (56), as desired.
18
Acknowledgements The authors wish to thank Peter Bickel for an inspiring discussion leading to this work, Bo Li for pointing out reference [10] and E. De Vito, T. Poggio, L. Rosasco, S. Smale and A. Verri for useful discussions and suggestions. The first author was supported by the City University of Hong Kong grant No.7200111(MA). The second author would like to thank NSF grant 0325113. References [1] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. J. Complexity, 23(1):52–72, 2007. [2] A. Caponnetto and E. De Vito. Fast rates for regularized least-squares algorithm. Technical report, Massachusetts Institute of Technology, Cambridge, MA, April 2005. CBCL Paper#248/AI Memo#2005-013. [3] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math., 7(3):331–368, 2007. [4] Andrea Caponnetto. Optimal rates for regularization operators in learning theory. Technical Report CBCL Paper #264/ CSAIL-TR #2006-062, Massachusetts Institute of Technology, Cambridge, MA, 2006. [5] F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math. Soc. (N.S.), 39(1):1–49 (electronic), 2002. [6] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundation of Computational Mathematics, 5(1):59–85, February 2005. [7] E. De Vito, L Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning from examples as an inverse problem. Journal of Machine Learning Research, 6:883–904, 2005. [8] E. De Vito, L. Rosasco, and A. Verri. Spectral methods for regularization in learning theory. Preprint, 2005. [9] Ernesto De Vito, Lorenzo Rosasco, and Andrea Caponnetto. Discretization error analysis for Tikhonov regularization. Anal. Appl. (Singap.), 4(1):81–99, 2006. [10] S. Dudoit and M.J. van der Laan. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology, 2(2):131–154, 2001. [11] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375 of Mathematics and its Applications. Kluwer Academic Publishers Group, Dordrecht, 1996. [12] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Adv. Comp. Math., 13:1–50, 2000. [13] L. Gy¨ orfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-free Theory of Nonparametric Regression. Springer Series in Statistics, 2002. [14] M. Hamers and M. Kohler. A bound on the expected maximal deviations of sample averages from their means. Preprint 2001-9, Mathematical Institute A, University of Stuttgart. [15] I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for probabilities of large deviations. Theory Probab. Appl., 30(1):143–148, 1985. [16] S. Smale and D. Zhou. Learning theory estimates via integral operators and their approximations. Preprint, Toyota Technological Institute, Chicago, 2005. [17] Steve Smale and Ding-Xuan Zhou. Shannon sampling. II. Connections to learning theory. Appl. Comput. Harmon. Anal., 19(3):285–302, 2005. [18] I. Steinwart. On the influence of the kernel on hte consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2002. [19] V. Yurinsky. Sums and Gaussian vectors, volume 1617 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1995.