Optimal Learning for Multi-pass Stochastic Gradient Methods

Optimal Learning for Multi-pass Stochastic Gradient Methods Junhong Lin†

Lorenzo Rosasco†‡

[email protected] † LCSL,

Massachusetts Institute of Technology and Istituto Italiano di Tecnologia, Cambridge, MA 02139, USA ‡ DIBRIS,

arXiv:1605.08882v1 [cs.LG] 28 May 2016

[email protected]

Universit` a degli Studi di Genova, Via Dodecaneso 35, Genova, Italy

May 31, 2016 Abstract We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases.

1

Introduction

Modern machine learning applications require computational approaches that are at the same time statistically accurate and numerically efficient [1]. This has motivated a recent interest in stochastic gradient methods (SGM), since on the one hand they enjoy good practical performances, especially in large scale scenarios, and on the other hand they are amenable to theoretical studies. In particular, unlike other learning approaches, such as empirical risk minimization or Tikhonov regularization, theoretical results on SGM naturally integrate statistical and computational aspects. Most generalization studies on SGM consider the case where only one pass over the data is allowed and the step-size is appropriately chosen, [2, 3, 4, 5, 6, 7] (possibly considering averaging [8]). In particular, recent works show how the step-size can be seen to play the role of a regularization parameter whose choice controls the bias and variance properties of the obtained solution [4, 5, 6]. These latter works show that balancing these contributions, it is possible to derive a step-size choice leading to optimal learning bounds. Such a choice typically depends on some unknown properties of the data generating distributions and in practice can be chosen by cross-validation. While processing each data point only once is natural in streaming/online scenarios, in practice SGM is often used as a tool for processing large data-sets and multiple passes over the data are typically considered. In this case, the number of passes over the data, as well as the step-size, need then to be determined. While the role of multiple passes is well understood if the goal is empirical risk minimization [9], its effect with respect to generalization is less clear and a few recent works have recently started to tackle this question. In particular, results in this direction have been derived in [10] and [11]. The former work considers a general stochastic optimization 1

setting and studies stability properties of SGM allowing to derive convergence results as well as finite sample bounds. The latter work, restricted to supervised learning, further develops these results to compare the respective roles of step-size and number of passes, and show how different parameter settings can lead to optimal error bounds. In particular, it shows that there are two extreme cases: one between the step-size or the number of passes is fixed a priori, while the other one acts as a regularization parameter and needs to be chosen adaptively. The main shortcoming of these latter results is that they are in the worst case, in the sense that they do not consider the possible effect of capacity assumptions [12, 13] shown to lead to faster rates for other learning approaches such as Tikhonov regularization. Further, these results do not consider the possible effect of mini-batches, rather than a single point in each gradient step [14, 15, 16, 17]. This latter strategy is often considered especially for parallel implementation of SGM. The study in this paper, fills in these gaps in the case where the loss function is the least squares loss. We consider a variant of SGM for least squares, where gradients are sampled uniformly at random and mini-batches are allowed. The number of passes, the step-size and the mini-batch size are then parameters to be determined. Our main results highlight the respective roles of these parameters and show how can they be chosen so that the corresponding solutions achieve optimal learning errors. In particular, we show for the first time that multipass SGM with early stopping and a universal step-size choice can achieve optimal learning bounds, matching those of ridge regression [18, 13]. Further, our analysis shows how the minibatch size and the step-size choice are tightly related. Indeed, larger mini-batch sizes allow to consider larger step-sizes while keeping the optimal learning bounds. This result could give an insight on how to exploit mini-batches for parallel computations while preserving optimal statistical accuracy. Finally we note that a recent work [19] is tightly related to the analysis in the paper. The generalization properties of a multi-pass incremental gradient are analyzed in [19], for a cyclic, rather than a stochastic, choice of the gradients and with no mini-batches. The analysis in this latter case appears to be harder and results in [19] give good learning bounds only in restricted setting and considering iterates rather than the excess risk. Compared to [19] our results show how stochasticity can be exploited to get faster capacity dependent rates and analyze the role of mini-batches. The rest of this paper is organized as follows. Section 2 introduces the learning setting and the SGM algorithm. Main results with discussions and proof sketches are presented in Section 3. Finally, simple numerical simulations are given in Section 4 to complement our theoretical results. Notation

For any a, b ∈ R, a ∨ b denotes the maximum of a and b. N is the set of all positive

integers. For any T ∈ N, [T ] denotes the set {1, · · · , T }. For any two positive sequences {at }t∈[T ] and {bt }t∈[T ] , the notation at . bt for all t ∈ [T ] means that there exists a positive constant C ≥ 0 such that C is independent of t and that at ≤ Cbt for all t ∈ [T ].

2

Learning with SGM

We begin by introducing the learning setting we consider, and then describe the SGM learning algorithm. Following [19], the formulation we consider is close to the setting of functional regression, and covers the reproducing kernel Hilbert space (RKHS) setting as special cases. In

2

particular, it reduces to standard linear regression for finite dimensions.

2.1

Learning Problems

Let H be a separable Hilbert space, with inner product and induced norm denoted by h·, ·iH and k · kH , respectively. Let the input space X ⊆ H and the output space Y ⊆ R. Let ρ be an unknown probability measure on Z = X × Y, ρX (·) the induced marginal measure on X, and ρ(·|x) the conditional probability measure on Y with respect to x ∈ X and ρ. Considering the square loss function, the problem under study is the minimization of the risk, Z inf E(ω),

ω∈H

(hω, xiH − y)2 dρ(x, y),

E(ω) =

(1)

X×Y

when the measure ρ is known only through a sample z = {zi = (xi , yi )}m i=1 of size m ∈ N, independently and identically distributed (i.i.d.) according to ρ. In the following, we measure the quality of an approximate solution ω ˆ ∈ H (an estimator) considering the excess risk, i.e., E(ˆ ω ) − inf E(ω).

(2)

ω∈H

Throughout this paper, we assume that there exists a constant κ ∈ [1, ∞[, such that hx, x0 iH ≤ κ2 ,

2.2

∀x, x0 ∈ X.

(3)

Stochastic Gradient Method

We study the following SGM (with mini-batches). Algorithm 1. Let b ∈ [m]. Given any sample z, the b-minibatch stochastic gradient method is defined by ω1 = 0 and ωt+1 = ωt − ηt

1 b

bt X

(hωt , xji iH − yji )xji ,

t = 1, . . . , T,

(4)

i=b(t−1)+1

where {ηt > 0} is a step-size sequence. Here, j1 , j2 , · · · , jbT are independent and identically distributed (i.i.d.) random variables from the uniform distribution on [m] 1 . Different choices for the (mini-)batch size b can lead to different algorithms. In particular, for b = 1, the above algorithm corresponds to a simple SGM, while for b = m, it is a stochastic version of the batch gradient descent. The aim of this paper is to derive excess risk bounds for the above algorithm under appropriate assumptions. Throughout this paper, we assume that {ηt }t is non-increasing, and T ∈ N with T ≥ 3. We denote by Jt the set {jl : l = b(t − 1) + 1, · · · , bt} and by J the set {jl : l = 1, · · · , bT }.

3

Main Results with Discussions

In this section, we first state some basic assumptions. Then, we present and discuss our main results. 1 Note

that, the random variables j1 , · · · , jbT are conditionally independent given the sample z.

3

3.1

Assumptions

We first make the following assumption. Assumption 1. There exists constants M ∈]0, ∞[ and v ∈]1, ∞[ such that Z y 2l dρ(y|x) ≤ l!M l v, ∀l ∈ N,

(5)

Y

ρX -almost surely. Assumption (5) is related to a moment hypothesis on |y|2 . It is weaker than the often considered bounded output assumption, and trivially verified in binary classification problems where Y = {−1, 1}. To present our next assumption, we introduce the operator L : L2 (H, ρX ) → L2 (H, ρX ), R defined by L(f ) = X hx, ·iH f (x)ρX (x). Under Assumption (3), L can be proved to be positive trace class operators, and hence Lζ with ζ ∈ R can be defined by using the spectrum theory [20]. The Hilbert space of square integral functions from H to R with respect to ρX , with induced 1/2 R , is denoted by (L2 (H, ρX ), k · kρ ). It is well known norm given by kf kρ = X |f (x)|2 dρX (x) R that the function minimizing Z (f (x) − y)2 dρ(z) over all measurable functions f : H → R is the regression function, which is given by Z fρ (x) =

x ∈ X.

ydρ(y|x),

(6)

Y

Define another Hilbert space Hρ = {f : X → R|∃ω ∈ H with f (x) = hω, xiH , ρX -almost surely}. Under Assumption 3, it is easy to see that Hρ is a subspace of L2 (H, ρX ). Let fH be the projection of the regression function fρ onto the closure of Hρ in L2 (H, ρX ). It is easy to see that the search for a solution of Problem (1) is equivalent to the search of a linear function from Hρ to approximate fH . From this point of view, bounds on the excess risk of a learning algorithm naturally depend on the following assumption, which quantifies how well, the target function fH can be approximated by Hρ . Assumption 2. There exist ζ > 0 and R > 0, such that kL−ζ fH kρ ≤ R. The above assumption is fairly standard [20, 19] in non-parametric regression. The bigger ζ is, the more stringent the assumption is, since Lζ1 (L2 (H, ρX )) ⊆ Lζ2 (L2 (H, ρX )) when ζ1 ≥ ζ2 . In particular, for ζ = 0, we are assuming kfH kρ < ∞, while for ζ = 1/2, we are requiring fH ∈ Hρ , since [21, 19] Hρ = L1/2 (L2 (H, ρX )). Finally, the last assumption relates to the capacity of the hypothesis space. Assumption 3. For some γ ∈]0, 1] and cγ > 0, L satisfies tr(L(L + λI)−1 ) ≤ cγ λ−γ ,

for all λ > 0.

(7)

The left-hand side of (7) is the so-called effective dimension, or the degrees of freedom [12, 13]. It can be related to covering/entropy number conditions, see [21] for further details. Assumption 3 is always true for γ = 1 and cγ = κ2 , since L is a trace class operator which implies the P eigenvalues of L, denoted as σi , satisfy tr(L) = i σi ≤ κ2 . This is referred as the capacity independent setting. Assumption 3 with γ ∈]0, 1] allows to derive better error rates. It is satisfied, for example, if the eigenvalues of L satisfy a polynomial decaying condition σi ∼ i−γ , or with γ = 0 if L is finite rank. 4

3.2

Main Results

We start with the following corollary, which is a simplified version of our main results stated next. Corollary 3.1. Under Assumptions 2 and 3, let ζ ≥ 1/2 and |y| ≤ M ρX -almost surely for some M > 0. Consider the SGM with 1

1) p∗ = dm 2ζ+γ e, b = 1, ηt '

1 m

for all t ∈ [(p∗ m)], and ω ˜ p∗ = ωp∗ m+1 .

If m is large enough, with high probability2 , there holds 2ζ

EJ [E(˜ ωp∗ )] − inf E . m− 2ζ+γ . ω∈H

Furthermore, the above also holds for the SGM with3 √ √ 1 ˜ p∗ = ωp∗ √m+1 . 2) or p∗ = dm 2ζ+γ e, b = m, ηt ' √1m for all t ∈ [(p∗ m)], and ω bt In the above, p∗ is the number of ‘passes’ over the data, which is defined as d m e at t iterations.

The above result asserts that, at p∗ passes over the data, the simple SGM with fixed step-size achieves optimal learning error bounds, matching those of ridge regression [13]. Furthermore, using mini-batch allows to use a larger step-size while achieving the same optimal error bounds. Our main theorem of this paper is stated next, and provides error bounds for the studied algorithm. For the sake of readability, we only consider the case ζ ≥ 1/2 in a fixed step-size setting. General results in a more general setting (ηt = η1 t−θ with 0 ≤ θ < 1, and/or the case ζ ∈]0, 1/2]) can be found in the appendix. Theorem 3.2. Under Assumptions 1, 2 and 3, let ζ ≥ 1/2, δ ∈]0, 1[, ηt = ηκ−2 for all t ∈ [T ], with η ≤

1 8(log T +1) .

If m ≥ mδ , then the following holds with probability at least 1 − δ: for all

t ∈ [T ], 2ζ

1

EJ [E(ωt+1 )] − inf E ≤ q1 (ηt)−2ζ + q2 m− 2ζ+γ (1 + m− 2ζ+γ ηt)2 log2 T log2 ω∈H

−1

+q3 ηb

1 − 2ζ+γ

(1 ∨ m

1 δ

(8)

ηt) log T.

Here, mδ , q1 , q2 and q3 are positive constants depending on κ2 , kT k, M, v, ζ, R, cγ , γ, and mδ also on δ (which will be given explicitly in the proof ). There are three terms in the upper bounds of (8). The first term depends on the regularity of the target function and it arises from bounding the bias, while the last two terms result from estimating the sample variance and the computational variance (due to the random choices of the points), respectively. To derive optimal rates, it is necessary to balance these three terms. Solving this trade-off problem leads to different choices on η, T , and b, corresponding to different regularization strategies, as shown in subsequent corollaries. The first corollary gives generalization error bounds for SGM, with a universal step-size depending on the number of sample points. Corollary 3.3. Under Assumptions 1, 2 and 3, let ζ ≥ 1/2 , b = 1 and ηt '

1 m

for all t ∈ [T ],

2

where T ≤ m . If m ≥ m0 , then with probability at least 1 − 1/m, there holds t 2 2ζ+2 m 2ζ EJ [E(ωt+1 )] − inf E . + m− 2ζ+γ · log4 m, ∀t ∈ [T ], ω∈H t m 2 Here, 3 Here,

‘high probability’√refers to the sample z. we assume that m is an integer.

5

(9)

and in particular, 2ζ

EJ [E(ωT ∗ +1 )] − inf E . m− 2ζ+γ log4 m,

(10)

ω∈H

where T ∗ = dm

2ζ+γ+1 2ζ+γ

e. Here, m0 is a positive integer depending only on κ, kT k, ζ and γ, and

will be given explicitly in the proof. Remark 3.4. Ignoring the logarithmic term and letting t = pm, Eq. (9) becomes 2ζ+2

EJ [E(ωpm+1 )] − inf E . p−2ζ + m− 2ζ+γ p2 . ω∈H

A smaller p may lead to a larger bias, while a larger p may lead to a larger sample error. From this point of view, p has a regularization effect. The second corollary provides error bounds for SGM with a fixed mini-batch size and a fixed step-size (which depend on the number of sample points). √ Corollary 3.5. Under Assumptions 1, 2 and 3, let ζ ≥ 1/2, b = d me and ηt '

√1 m

for all

2

t ∈ [T ], where T ≤ m . If m ≥ m0 , then with probability at least 1 − 1/m, there holds √ t 2 2ζ+2 m 2ζ EJ [E(ωt+1 )] − inf E . + m− 2ζ+γ √ log4 m, ∀t ∈ [T ], ω∈H t m

(11)

and particularly, 2ζ

EJ [E(ωT ∗ +1 )] − inf E . m− 2ζ+γ log4 m,

(12)

ω∈H

1

1

where T ∗ = dm 2ζ+γ + 2 e. The above two corollaries follow from Theorem 3.2 with the simple observation that the dominating terms in (8) are the terms related to the bias and the sample variance, when a small step-size is chosen. The only free parameter in (9) and (11) is the number of iterations/passes. The ideal stopping rule is achieved by balancing the two terms related to the bias and the sample variance, showing the regularization effect of the number of passes. Since the ideal stopping rule depends on the unknown parameters ζ and γ, a hold-out cross-validation procedure is often used to tune the stopping rule in practice. Using an argument similar to that in Chapter 6 from [21], it is possible to show that this procedure can achieve the same convergence rate. We give some further remarks. First, the upper bound in (10) is optimal up to a logarithmic factor, in the sense that it matches the minimax lower rate in [13]. Second, according to Corollaries 3.3 and 3.5,

bT ∗ m

1

' m 2ζ+γ passes over the data are needed to obtain optimal rates in both

cases. Finally, in comparing the simple SGM and the mini-batch SGM, Corollaries 3.3 and 3.5 show that a larger step-size is allowed to use for the latter. In the next result, both the step-size and the stopping rule are tuned to obtain optimal rates for simple SGM with multiple passes. In this case, the step-size and the number of iterations are the regularization parameters. 2ζ

Corollary 3.6. Under Assumptions 1, 2 and 3, let ζ ≥ 1/2, b = 1 and ηt ' m− 2ζ+γ for all 2ζ+1

t ∈ [T ], where T ≤ m2 . If m ≥ m0 , and T ∗ = dm 2ζ+γ e, then (10) holds with probability at least 1 − 1/m. Remark 3.7. If we make no assumption on the capacity, i.e., γ = 1, Corollary 3.6 recovers the result in [4] for one pass SGM.

6

The next corollary shows that for some suitable mini-batch sizes, optimal rates can be achieved with a constant step-size (which is nearly independent of the number of sample points) by early stopping. 2ζ

Corollary 3.8. Under Assumptions 1, 2 and 3, let ζ ≥ 1/2, b = dm 2ζ+γ e and ηt ' ∗

2

t ∈ [T ], where T ≤ m . If m ≥ m0 , and T = dm

1 2ζ+γ

1 log m

for all

e, then (10) holds with probability at least

1 − 1/m. 1−γ

According to Corollaries 3.6 and 3.8, around m 2ζ+γ passes over the data are needed to achieve the best performance in the above two strategies. In comparisons with Corollaries 3.3 and 3.5 ζ+1

where around m 2ζ+γ passes are required, the latter seems to require fewer passes over the data. However, in this case, one might have to run the algorithms multiple times to tune the step-size, or the mini-batch size. Finally, the last result gives generalization error bounds for ‘batch’ SGM with a constant step-size (nearly independent of the number of sample points). Corollary 3.9. Under Assumptions 1, 2 and 3, let ζ ≥ 1/2, b = m and ηt ' ∗

2

t ∈ [T ], where T ≤ m . If m ≥ m0 , and T = dm

1 2ζ+γ

1 log m

for all

e, then (10) holds with probability at least

1 − 1/m. As will be seen in the proof from the appendix, the above result also holds when replacing the sequence {ωt } by the sequence {νt }t generated from real batch GM in (14). In this sense, we study the gradient-based learning algorithms simultaneously.

3.3

Discussions

We compare our results with previous works. For non-parametric regression with the square loss, one pass SGM has been studied in, e.g., [4, 22, 5, 6]. In particular, [4] proved capacity indepen2ζ

2ζ

dent rate of order O(m− 2ζ+1 log m) with a fixed step-size η ' m− 2ζ+1 , and [6] derived capacity 2 min(ζ,1)

dependent error bounds of order O(m− 2 min(ζ,1)+γ ) (when 2ζ + γ > 1) for the average. Note also that a regularized version of SGM has been studied in [5], where the derived convergence 2ζ

rate there is of order O(m− 2ζ+1 ) assuming that ζ ∈ [ 21 , 1]. In comparison with these existing convergence rates, our rates from (10) are comparable, either involving the capacity condition, or allowing a broader regularity parameter ζ (which thus improves the rates). More recently, [19] studied multiple passes SGM with a fixed ordering at each pass, also called ζ

incremental gradient method. Making no assumption on the capacity, rates of order O(m− ζ+1 ) (in L2 (H, ρX )-norm) with a universal step-size η '

1 m

are derived. In comparisons, Corollary

3.3 achieves better rates, while considering the capacity assumption. Note also that [19] proved sharp rate in H-norm for ζ ≥ 1/2 in the capacity independent case. In fact, we can extend our analysis to the H-norm for Algorithm 4. We postpone this extension to a longer version of this paper. The idea of using mini-batches (and parallel implements) to speed up SGM in a general stochastic optimization setting can be found, e.g., in [14, 15, 16, 17]. Our theoretical findings, especially the interplay between the mini-batch size and the step-size, can give further insights on parallelization learning. Besides, it has been shown in [23, 15] that for one pass mini-batch √ SGM with a fixed step-size η ' b/ m and a smooth loss function, assuming the existence of at least one solution in the hypothesis space for the expected risk minimization, the convergence 7

p rate is of order O( 1/m + b/m) by considering an averaging scheme. When adapting to the learning setting we consider, this reads as that if fH ∈ Hρ , i.e., ζ = 1/2, the convergence rate p for the average is O( 1/m + b/m). Note that, fH does not necessarily belongs to Hρ in general. Also, our derived convergence rate from Corollary 3.5 is better, when the regularity parameter ζ is greater than 1/2, or γ is smaller than 1.

3.4

Error Decomposition

The key to our proof is a novel error decomposition, which may be also used in analysing other learning algorithms. We first introduce two sequences. The population iteration is defined by µ1 = 0 and

Z µt+1 = µt − ηt

(hµt , xiH − fρ (x))xdρX (x),

t = 1, . . . , T.

(13)

X

The above iterated procedure is ideal and can not be implemented in practice, since the distribution ρX is unknown in general. Replacing ρX by the empirical measure and fρ (xi ) by yi , we derive the sample iteration (associated with the sample z), i.e., ν1 = 0 and m

νt+1 = νt − ηt

1 X (hνt , xi iH − yi )xi , m i=1

t = 1, . . . , T.

(14)

Clearly, µt is deterministic and νt is a H-valued random variable depending on z. Given the sample z, the sequence {νt }t has a natural relationship with the learning sequence {ωt }t , since EJ [ωt ] = νt .

(15)

Indeed, taking the expectation with respect to Jt on both sides of (4), and noting that ωt depends only on J1 , · · · , Jt−1 (given any z), one has m

EJt [ωt+1 ] = ωt − ηt

1 X (hωt , xi iH − yi )xi , m i=1

and thus, m

EJ [ωt+1 ] = EJ [ωt ] − ηt

1 X (hEJ [ωt ], xi iH − yi )xi , m i=1

t = 1, . . . , T,

which satisfies the iterative relationship given in (14). By an induction argument, (15) can then be proved. Let Sρ : H → L2 (H, ρX ) be the linear map defined by (Sρ ω)(x) = hω, xiH , ∀ω, x ∈ H. We have the following error decomposition. Proposition 3.10. We have EJ [E(ωt )] − inf E(f ) ≤ 2kSρ µt − fH k2ρ + 2kSρ νt − Sρ µt k2ρ + EJ [kSρ ωt − Sρ νt k2 ]. f ∈H

(16)

Proof. For any ω ∈ H, we have [21, 19] E(ω) − inf E(f ) = kSρ ω − fH k2ρ . f ∈H

Thus, E(ωt ) − inf f ∈H E(f ) = kSρ ωt − fH k2ρ , and EJ [kSρ ωt − fH k2ρ ] = EJ [kSρ ωt − Sρ νt + Sρ νt − fH k2ρ ] = EJ [kSρ ωt − Sρ νt k2ρ + kSρ νt − fH k2ρ ] + 2EJ hSρ ωt − Sρ νt , Sρ νt − fH iρ . 8

(17)

Using (15) to the above, we get EJ [kSρ ωt − fH k2ρ ] = EJ [kSρ ωt − Sρ νt k2ρ + kSρ νt − fH k2ρ ]. Now the proof can be finished by considering kSρ νt − fH k2ρ = kSρ νt − Sρ µt + Sρ µt − fH k2ρ ≤ 2kSρ νt − Sρ µt k2ρ + 2kSρ µt − Sρ fH k2ρ .

There are three terms in the upper bound of the error decomposition (16). We refer to the deterministic term kSρ µt − fH k2ρ as the bias, the term kSρ νt − Sρ µt k2ρ depending on z as the sample variance, and EJ [kSρ ωt − Sρ νt k2ρ ] as the computational variance. These three terms will be estimated in the appendix, see Lemma B.2, Theorem C.6 and Theorem D.9. The bound in Theorem 3.2 thus follows plugging these estimations in the error decomposition.

4

Numerical Simulations SGM

Minibatch SGM Bias Sample Error Computational Error Total Error

0.07

0.08 Bias Sample Error Computational Error Total Error

0.07

0.06

0.05

0.05 Error

0.06

0.05

0.04

0.04

0.04

0.03

0.03

0.03

0.02

0.02

0.02

0.01

0.01

0

0

20

40

60

80

100 Pass

120

140

160

180

0

200

Bias Sample Error Total Error

0.07

0.06

Error

Error

Batch GM

0.08

0.08

0.01

0

20

40

(a) Minibatch SGM

60

80

100 Pass

120

140

160

180

0

200

0

20

40

60

80

100 Pass

120

140

160

180

200

(c) Batch GM

(b) SGM

Figure 1: Error decompositions for gradient-based learning algorithms on synthesis data, where m = 100. In order to illustrate our theoretical results and the error decomposition, we first performed some simulations on a simple problem. We constructed m = 100 i.i.d. training examples of the form y = fρ (xi )+ωi . Here, the regression function is fρ (x) = |x−1/2|−1/2, the input point xi is uniformly distributed in [0, 1], and ωi is a Gaussian noise with zero mean and standard deviation 1, for each i ∈ [m]. We perform three experiments with the same H, a RKHS associated with a Gaussian kernel K(x, x0 ) = exp(−(x − x0 )2 /(2σ 2 )) where σ = 0.2. In the first experiment, we √ √ run mini-batch SGM, where the mini-batch size b = m, and the step-size ηt = 1/(8 m). In the second experiment, we run simple SGM where the step-size is fixed as ηt = 1/(8m), while in the third experiment, we run batch GM using the fixed step-size ηt = 1/8. For each experiment, Classification Errors of Minibatch SGM

Classification Errors of SGM

0.09

Classification Errors of GM

0.08 Training Error Validation Error

0.08

0.08 Training Error Validation Error

0.07

0.07

Training Error Validation Error

0.07

0.06

0.06

0.05

0.05

0.04

Error

0.05

Error

Error

0.06

0.04

0.04

0.03

0.03

0.02

0.02

0.03 0.02

0.01

0.01 0

0

0.2

0.4

0.6

0.8

1 Pass

1.2

1.4

1.6

(a) Minibatch SGM

1.8

2 4

x 10

0

0.01

0

0.2

0.4

0.6

0.8

1 Pass

1.2

(b) SGM

1.4

1.6

1.8

2 4

x 10

0

0

0.2

0.4

0.6

0.8

1 Pass

1.2

1.4

1.6

1.8

2 4

x 10

(c) Batch GM

Figure 2: Misclassification Errors for gradient-based learning algorithms on BreastCancer dataset. 9

we run the algorithm 50 times. For mini-batch SGM and SGM, the total error kSρ ωt − fρ k2L2 , ρ ˆ

ˆt k2L2 and the computational variance the bias kSρ µ ˆt − fρ k2L2 , the sample variance kSρ νt − Sρ µ ρ ˆ

ρ ˆ

kSρ ωt − Sρ νt k2L2 , averaged over 50 trials, are depicted in Figures 1a and 1b, respectively. For ρ ˆ

batch GM, the total error kSρ νt − fρ k2L2 , the bias kSρ µ ˆt − fρ k2L2 and the sample variance ρ ˆ

ρ ˆ

kSρ νt − µ ˆt k2L2 , averaged over 50 trials are depicted in Figure 1c. Here, we replace the unknown ρ ˆ P2000 1 marginal distribution ρX by an empirical measure ρˆ = 2000 î is uniformly î , where each x i=1 δx distributed in [0, 1]. From Figure 1a or 1b, we see that as the number of passes increases4 , the bias decreases, while the sample error increases. Furthermore, we see that in comparisons with the bias and the sample error, the computational error is negligible. In all these experiments, the minimal total error is achieved when the bias and the sample error are balanced. These empirical results show the effects of the three terms from the error decomposition, and complement the derived bound (8), as well as the regularization effect of the number of passes over the data. Finally, we tested the simple SGM, mini-batch SGM, and batch GM, using similar step-sizes as those in the first simulation, on the BreastCancer data-set [24]. The classification errors on the training set and the testing set of these three algorithms are depicted in Figure 2. We see that all of these algorithms perform similarly, which complement the bounds in Corollaries 3.3, 3.5 and 3.9.

Acknowledgments This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. L. R. acknowledges the financial support of the Italian Ministry of Education, University and Research FIRB project RBFR12M3AC.

References [1] Olivier Bousquet and Léon Bottou. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2008. [2] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004. [3] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro.

Robust

stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [4] Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561–596, 2008. [5] Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence. IEEE Transactions on Information Theory, 60(9):5716–5735, 2014. [6] Aymeric Dieuleveut and Francis Bach. Non-parametric stochastic approximation with large step sizes. arXiv preprint arXiv:1408.0361, 2014. 4 Note

that the terminology ‘running the algorithm with p passes’ means ‘running the algorithm with dmp/be iterations’, where b is the mini-batch size.

10

[7] Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 1116– 1124, 2014. [8] Boris T Poljak. Introduction to Optimization. Optimization Software, 1987. [9] Stephen Boyd and Almir Mutapcic. Stochastic subgradient methods. Notes for EE364b, Standford University, Winter 2007. [10] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2016. [11] Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regularization of multiple passes SGM. In Proceedings of the 33rd International Conference on Machine Learning, 2016. [12] Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005. [13] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007. [14] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):3–30, 2011. [15] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13(1):165–202, 2012. [16] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for Machine Learning. MIT Press, 2012. [17] Andrew Ng. Machine learning. Coursera, Standford University, 2016. [18] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153–172, 2007. [19] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1621–1629, 2015. [20] Felipe Cucker and Ding-Xuan Zhou. Learning Theory: an Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007. [21] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science Business Media, 2008. [22] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71–79, 2013. [23] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems, pages 1647–1655, 2011. 11

[24] https://archive.ics.uci.edu/ml/datasets/. [25] IF Pinelis and AI Sakhanenko. Remarks on inequalities for large deviation probabilities. Theory of Probability & Its Applications, 30(1):143–148, 1986. [26] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007. [27] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages 1648–1656, 2015. [28] Joel A Tropp. User-friendly tools for random matrices: An introduction. Technical report, DTIC Document, 2012. [29] Stanislav Minsker. On some extensions of bernstein’s inequality for self-adjoint operators. arXiv preprint arXiv:1112.5448, 2011. [30] Junhong Lin, Lorenzo Rosasco, and Ding-Xuan Zhou. Iterative regularization for learning with convex loss functions. The Journal of Machine Learning Research, To appear, 2016.

12

A

Preliminary

A.1

Notation

We first introduce some notations. For t ∈ N, ΠTt+1 (L) = and

ΠTT +1 (L)

QT

k=t+1 (I

− ηk L) for t ∈ [T − 1]

= I, for any operator L : H → H, where H is a Hilbert space and I de-

notes the identity operator on H. E[ξ] denotes the expectation of a random variable ξ. For a given bounded operator L : L2 (H, ρX ) → H, kLk denotes the operator norm of L, i.e., kLk = supf ∈L2 (H,ρX ),kf kρ =1 kLf kH . We will use the conventional notations on summation and Qt Pt production: i=t+1 = 1 and i=t+1 = 0. We next introduce some auxiliary operators. Let Sρ : H → L2 (H, ρX ) be the linear map ω → hω, ·iH , which is bounded by κ under Assumption (3). Furthermore, we consider the adjoint operator Sρ∗ : L2 (H, ρX ) → H, the covariance operator T : H → H given by T = Sρ∗ Sρ , and the operator L : L2 (H, ρX ) → L2 (H, ρX ) given by Sρ Sρ∗ . It can be easily proved that R R Sρ∗ g = X xg(x)dρX (x) and T = X h·, xiH xdρX (x). The operators T and L can be proved to be positive trace class operators (and hence compact). For any ω ∈ H, it is easy to prove the following isometry property [21]

√ kSρ ωkρ = k T ωkH .

(18)

We define the sampling operator Sx : H → Rm by (Sx ω)i = hω, xi iH , i ∈ [m], where the norm k · kRm in Rm is the Euclidean norm times 1/m. Its adjoint operator Sx∗ : Rm → H, defined Pm 1 by hSx∗ y, ωiH = hy, Sx ωiRm for y ∈ Rm is thus given by Sx∗ y = m i=1 yi xi . Moreover, we can define the empirical covariance operator Tx : H → H such that Tx = Sx∗ Sx . Obviously, m

Tx =

1 X h·, xi iH xi . m i=1

With these notations, (13) and (14) can be rewritten as µt+1 = µt − ηt (T µt − Sρ∗ fρ ),

t = 1, . . . , T,

(19)

νt+1 = νt − ηt (Tx νt − Sx∗ y),

t = 1, . . . , T,

(20)

and

respectively. Using the projection theorem, one can prove that Sρ∗ fρ = Sρ∗ fH .

(21)

Indeed, since fH is the projection of the regression function fρ onto the closer of Hρ in L2 (H, ρX ), according to the projection theorem, one has hfH − fρ , Sρ ωiρ = 0,

∀ω ∈ H,

which can be written as hSρ∗ fH − Sρ∗ fρ , ωiH = 0, and thus leads to (21).

13

∀ω ∈ H,

A.2

Concentration Inequality

We need the following concentration result for Hilbert space valued random variable used in Caponnetto and De Vito [13] and based on the results in Pinelis and Sakhanenko [25]. Lemma A.1. Let w1 , · · · , wm be i.i.d random variables in a Hilbert space with norm k · k. Suppose that there are two positive constants B and σ 2 such that 1 l!B l−2 σ 2 , 2

E[kw1 − E[w1 ]kl ] ≤

∀l ≥ 2.

(22)

Then for any 0 < δ < 1, the following holds with probability at least 1 − δ,

m

1 X

B σ 2

wm − E[w1 ] ≤ 2 +√ log .

m

m δ m k=1

In particular, (22) holds if kw1 kl ≤ B/2 a.s.,

A.3

and E[kw1 k2 ] ≤ σ 2 .

(23)

Basic Estimates

Lemma A.2. Let θ ∈ [0, 1[, and t ∈ N. Then t

X t1−θ t1−θ ≤ k −θ ≤ . 2 1−θ k=1

Proof. Note that t X

k −θ ≤ 1 +

t Z X

u−θ du = 1 +

k−1

k=2

k=1

k

Z

t

u−θ du =

1

t1−θ − θ , 1−θ

which leads to the first part of the desired result. Similarly, t X k=1

k

−θ

≥

t Z X k=1

k+1 −θ

u

t+1

Z

u−θ du =

du =

k

1

(t + 1)1−θ − 1 , 1−θ

and by mean value theorem, (t + 1)1−θ − 1 ≥ (1 − θ)t(t + 1)−θ ≥ (1 − θ)t1−θ /2. This proves the second part of the desired result. The proof is complete. Lemma A.3. Let θ ∈ R and t ∈ N. Then t X

k −θ ≤ tmax(1−θ,0) (1 + log t).

k=1

Proof. Note that t X

k −θ =

k=1

t X

k −1 k 1−θ ≤ tmax(1−θ,0)

k=1

t X

k −1 ,

k=1

and t X k=1

k −1 ≤ 1 +

t Z X k=2

k

u−1 du = 1 + log t.

k−1

14

Lemma A.4. Let q ∈ R and t ∈ N with t ≥ 3. Then t−1 X k=1

1 −q k ≤ 2t− min(q,1) (1 + log t). t−k

Proof. Note that t−1 X k=1

t−1

t−1

k=1

k=1

X 1 −q X k 1−q 1 k = ≤ tmax(1−q,0) , t−k (t − k)k (t − k)k

and that by Lemma A.3, t−1 X k=1

B

t−1

1 1X = (t − k)k t

k=1

1 1 + t−k k

t−1

2X1 2 = ≤ (1 + log t). t k t k=1

Bias

In this section, we develop upper bounds for the bias, i.e., kSρ µt − fH k2ρ . Towards this end, we introduce the following lemma, whose proof borrows idea from [4, 5]. Lemma B.1. Let L be a compact self-adjoint operator on a separable Hilbert space H. Assume that η1 kLk ≤ 1. Then for t ∈ N and any non-negative integer k ≤ t − 1, kΠtk+1 (L)Lζ k

!ζ

ζ

≤

e

Pt

.

ηj

j=k+1

(24)

Proof. Let {σi } be the sequence of eigenvalues of L. We have t Y

kΠtk+1 (L)Lζ k = sup i

(1 − ηl σi )σiζ .

l=k+1

Using the basic inequality 1 + x ≤ ex

for all x ≥ −1,

(25)

with ηl kLk ≤ 1, we get ( kΠtk+1 (L)Lζ k

≤

sup exp −σi i

ηl

σiζ

l=k+1

( ≤

)

t X

sup exp −x x≥0

t X

) ηl

xζ .

l=k+1

The maximum of the function g(x) = e−cx xζ ( with c > 0) over R+ is achieved at xmax = ζ/c, and thus −cx ζ

sup e

x =

x≥0

ζ ec

ζ .

(26)

Using this inequality, one can get the desired result (24). With the above lemma and Lemma A.2 from the appendix, we can derive the following result for the bias.

15

Proposition B.2. Under Assumption 2, let η1 κ2 ≤ 1. Then, for any t ∈ N, !ζ ζ kSρ µt+1 − fH kρ ≤ R . Pt 2 j=1 ηj

(27)

In particular, if ηt = ηt−θ for all t ∈ N, with η ∈]0, κ−2 ] and θ ∈ [0, 1[, then kSρ µt+1 − fH kρ ≤ Rζ ζ η −ζ t(θ−1)ζ .

(28)

Proof. The result is essentially proved in [26], see also [19]. For the sake of completeness, we provide a proof here. Since µt+1 is given by (19), introducing with (21), µt+1 = µt − ηt (T µt − Sρ∗ fH ).

(29)

Sρ µt+1 = Sρ µt − ηt Sρ (T µt − Sρ∗ fH ) = Sρ µt − ηt L(Sρ µt − fH ).

(30)

Thus,

Subtracting both sides by fH , Sρ µt+1 − fH = (I − ηt L)(Sρ µt − fH ). Using this equality iteratively, with µ1 = 0, Sρ µt+1 − fH = −Πt1 (L)fH . Taking the L2 (H, ρX )-norm, by Assumption 2, kSρ µt+1 − fH kρ = kΠt1 (L)fH kρ ≤ kΠt1 (L)Lζ kR. By applying Lemma B.1, we get (27). Combining (27) with Lemma A.2, we get (28). The proof is complete. The following lemma gives upper bounds for the sequence {µt }t∈N in H-norm. It will be used for the estimation on the sample variance in the next section. Lemma B.3. Under Assumption 2, the following holds for all t ∈ N: 1) If ζ ≥ 1/2, kµt kH ≤ Rκ2ζ−1 .

(31)

2) If ζ ∈]0, 1/2], kµt kH ≤ κ2ζ−1 ∨

t X

! 21 −ζ ηk

.

(32)

k=1

Proof. The proof for the fixed step-size can be found in [19]. Following from (29), we have µt+1 = (I − ηt T )µt + ηt Sρ∗ fH . Applying this relationship iteratively, and introducing with µ1 = 0, we get µt+1 =

t X

ηk Πtk+1 (T )Sρ∗ fH =

k=1

t X

ηk Sρ∗ Πtk+1 (L)fH .

k=1

Therefore, using Assumption 2 and the spectrum theory,

t t

X

X

∗ t ζ ηk Πtk+1 (σ). kµt+1 kH ≤ ηk Sρ Πk+1 (L)L R ≤ R max2 σ 1/2+ζ

σ∈]0,κ ] k=1

k=1

16

If ζ ≥ 1/2, for any σ ∈]0, κ2 ], t X

σ 1/2+ζ

ηk Πtk+1 (σ) ≤ κ2ζ−1 σ

k=1

t X

ηk Πtk+1 (σ) ≤ κ2ζ−1 ,

k=1

where for the last inequality, we used t X

ηk σΠtk+1 (σ) =

k=1

t X

(1 − (1 − ηk σ))Πtk+1 (σ) =

k=1

t X

Πtk+1 (σ) −

k=1

t X

Πtk (σ) = 1 − Πt1 (σ).

k=1

Thus, kµt+1 kH ≤ Rκ2ζ−1 . The case for ζ ≥ 1/2 is similar to that in [19]. We omit it. The proof is complete.

C

Sample Variance

In this section, we aim to estimate the sample variance, i.e., E[kSρ µt − Sρ νt k2ρ ]. Towards this end, we need some preliminary analysis. We first introduce the following key inequality, which provides the hinge idea on estimating E[kSρ µt − Sρ νt k2ρ ]. Lemma C.1. For all t ∈ [T ], we have kSρ νt+1 − Sρ µt+1 kρ ≤

t X

1

ηk T 2 Πtk+1 (Tx )Nk ,

(33)

H

k=1

where Nk = (T µk − Sρ∗ fρ ) − (Tx µk − Sx∗ y),

∀k ∈ [T ].

(34)

Proof. Since νt+1 and µt+1 are given by (20) and (19), respectively, νt+1 − µt+1

νt − µt + ηt (T µt − Sρ∗ fρ ) − (Tx νt − Sx∗ y) (I − ηt Tx )(νt − µt ) + ηt (T µt − Sρ∗ fρ ) − (Tx µt − Sx∗ y) ,

= =

which is exactly νt+1 − µt+1 = (I − ηt Tx )(νt − µt ) + ηt Nt . Applying this relationship iteratively, with ν1 = µ1 = 0, νt+1 − µt+1 = Πt1 (Tx )(ν1 − µ1 ) +

t X

ηk Πtk+1 (Tx )Nk =

k=1

t X

ηk Πtk+1 (Tx )Nk .

k=1

By (18), we have

t

X 1

t kSρ νt+1 − Sρ µt+1 kρ = ηk T 2 Πk+1 (Tx )Nk ,

k=1

H

which leads to the desired result (33). The proof is complete. 2 The above lemma demonstrates that

in order to upper bound E[kSρ µt − Sρ νt kρ ], one may

12 t

only need to bound T Πk+1 (Tx )Nk . A detailed look at this latter term indicates that one 1

H

may analysis the terms T 2 Πtk+1 (Tx ) and Nk separately, since Ez [Nk ] = 0 and the properties of the deterministic sequence {µk }k are well developed in Section B. 17

Lemma C.2. Under Assumptions 2 and 3 , let ζ ≥ 1/2. Then for any fixed λ > 0, with probability at least 1 − δ1 , the following holds for all k ∈ N : 1) If ζ ≥ 1/2, − 21

Nk kH ≤ 4(Rκ2ζ +

k(T + λ)

! p √ 2 vcγ 4 κ √ + √ log . γ δ m λ mλ 1

√ M)

(35)

2) If ζ ∈]0, 1/2],   − 21

k(T + λ)

2ζ−1

Nk kH ≤ 4 κ κ

k X

∨

! 21 −ζ  ηi

√

+

i=1

 M

! p √ 2 vcγ 4 κ √ + √ log . (36) δ1 m λ mλγ

Proof. We will apply Berstein inequality from Lemma A.1 to prove the result. 1

Bounding (T + λ)− 2 Sρ∗ fρ − Sx∗ y H

1

For all i ∈ [m], let wi = yi (T + λI)− 2 xi . Obviously, from the definitions of fρ (see (6)) and Sρ , 1

1

E[w1 ] = Ex1 [fρ (x1 )(T + λI)− 2 x1 ] = (T + λI)− 2 Sρ∗ fρ . Thus, 1 1 X (T + λ)− 2 Sρ∗ fρ − Sx∗ y = (E[wi ] − wi ). m i=1 We next estimate the constants B and σ 2 (w1 ) in (22). Note that for any l ≥ 2, E[kw1 − E[w1 ]klH ] ≤ E[(kw1 kH + E[kw1 kH ])l ]. By using H¨ older’s inequality twice, E[kw1 − E[w1 ]klH ] ≤ 2l−1 E[kw1 klH + (E[kw1 kH ])l ] ≤ 2l−1 E[kw1 klH + E[kw1 klH ]]. The right-hand side is exactly 2l E[kw1 klH ]. Therefore, by recalling the definition of w1 and expanding the integration, E[kw1 −

E[w1 ]klH ]

≤2

l

Z

Z

l

y dρ(y|x) Y

1

k(T + λI)− 2 xklH dρX (x).

X

Note that by using H¨ older’s inequality, Z

Z

Z

l

≤

y dρ(y|x) Y

X

Using Assumption 1 to the above, Z Z y l dρ(y|x) Y

≤

21 |y| dρ(y|x) . 2l

Y

√

√ √ l!M l v ≤ l!( M )l v.

X

Plugging the above into (37), we reach Z √ √ 1 E[kw1 − E[w1 ]klH ] ≤ l!(2 M )l v k(T + λI)− 2 xklH dρX (x). X

Using Assumption (3) which imples 1 kxkH κ k(T + λI)− 2 xkH ≤ √ ≤ √ , λ λ

18

(37)

we get that √ √ E[kw1 − E[w1 ]klK ] ≤ l!(2 M )l v

κ √ λ

l−2 Z

1

k(T + λI)− 2 xk2H dρX (x).

X

Using the fact that E[kξk2H ] = E[tr(ξ ⊗ ξ)] = tr(E[ξ ⊗ ξ]) and E[x ⊗ x] = T , we know that Z 1 1 1 k(T + λI)− 2 xk2H dρX (x) = tr((T + λI)− 2 T (T + λI)− 2 ) = tr((T + λI)−1 T ), X

and as a result of the above and Assumption 3, Z 1 k(T + λI)− 2 xk2H dρX (x) ≤ cγ λ−γ . X

Therefore, E[kw1 −

E[w1 ]klH ]

√ √ ≤ l!(2 M )l v

κ √ λ

Applying Berstein inequality with B = bility at least 1 −

δ1 2 ,

l−2

√ 2κ√ M λ

−γ

cγ λ

1 = l! 2 p

and σ =

√ !l−2 √ 2κ M √ 8M vcγ λ−γ . λ

√ 8M vcγ λ−γ , we get that with proba-

there holds

1

(T + λ)− 2 Sρ∗ fρ − Sx∗ y

H

1 X

(E[wi ] − wi ) =

m i=1

H

√ ≤4 M

! p √ 2 vcγ κ 4 √ + √ log . γ δ1 m λ mλ (38)

1

Bounding k(T + λ)− 2 (T − Tx )k 1

1

Let ξi = (T + λ)− 2 xi ⊗ xi , for all i ∈ [m]. It is easy to see that E[ξi ] = (T + λ)− 2 T , and Pm 1 1 that (T + λ)− 2 (T − Tx ) = m i=1 (E[ξi ] − ξi ). Denote the Hilbert-Schmidt norm of a bounded operator from H to H by k · kHS . Note that kξ1 k2HS = kx1 k2H Trace((T + λ)−1/2 x1 ⊗ x1 (T + λ)−1/2 ) = kx1 k2H Trace((T + λ)−1 x1 ⊗ x1 ). By Assumption (3), kξ1 kHS ≤

p p √ κ2 Trace((T + λ)−1 x1 ⊗ x1 ) ≤ κ2 Trace(x1 ⊗ x1 )/λ ≤ κ2 / λ,

and furthermore, by Assumption 3, E[kξ1 k2HS ] ≤ κ2 ETrace((T + λ)−1 x1 ⊗ x1 ) = κ2 Trace((T + λ)−1 T ) ≤ κ2 cγ λ−γ . According to Lemma A.1, we get that with probability at least 1 − δ21 , there holds √ cγ 1 2κ 4 √ +√ k(T + λ)− 2 (T − Tx )kHS ≤ 2κ log . δ1 m λ mλγ

(39)

Finally, using the triangle inequality, we have,

1 1 1

k(T + λ)− 2 Nk kH ≤ k(T + λ)− 2 (T − Tx )kkµk kH + (T + λ)− 2 Sρ∗ fρ − Sx∗ y . H

Applying Lemma B.3 to the above, introducing with (38) and (39), and then noting that κ ≥ 1 and v ≥ 1, one can prove the desired results. The next lemma is borrowed from [27], derived by applying a recent Bernstein inequality from [28, 29] for a sum of random operators. 19

Lemma C.3. Let δ2 ∈ (0, 1) and

9κ2 m

log

1

1

m δ2

≤ λ ≤ kT k. Then the following holds with probability

at least 1 − δ2 , 1

1

k(Tx + λI)− 2 T 2 k ≤ k(Tx + λ)− 2 (T + λ) 2 k ≤ 2.

(40)

Now we are in a position to estimate the sample variance. Proposition C.4. Let η1 κ2 ≤ 1 and (35) for all k ∈ [T ]. Assume that (40) holds. Then the following holds for all t ∈ [T ] : 1) If ζ ≥ 1/2, kSρ νt+1 − Sρ µt+1 kρ ≤4(Rκ2ζ +

! t−1 ! p √ t−1 X X √ 2 2 vcγ κ ηk /2 4 √ + √ +λ ηk + 2κ ηt log . Pt γ δ m λ mλ 1 i=k+1 ηi k=1 k=1

√ M)

(41)

2) If ζ ≤ 1/2,   kSρ νt+1 − Sρ µt+1 kρ ≤ 4 κ κ2ζ−1 ∨

k X

! 21 −ζ  +

ηi



√

M

i=1 t−1 X

×

ηk /2 Pt

i=k+1

k=1

ηi

+λ

t−1 X

ηk +

√

! p √ 2 vcγ κ 4 √ + √ log . (42) γ δ1 m λ mλ

! 2

2κ ηt

k=1

Proof. For notational simplicity, we let Tλ = T + λI and Tx,λ = Tx + λI. Note that by Lemma 1

C.1, we have (33). When k ∈ [t − 1], by rewriting T 2 Πtk+1 (Tx )Nk as 1

−1

1

1

−1

1

− 21

2 2 T 2 Tx,λ2 Tx,λ Πtk+1 (Tx )Tx,λ Tx,λ2 Tλ2 Tλ

Nk ,

1

we can upper bound kT 2 Πtk+1 (Tx )Nk kH as 1

1

−1

1

−1

1

1

− 12

2 2 kT 2 Πtk+1 (Tx )Nk kH ≤ kT 2 Tx,λ2 kkTx,λ Πtk+1 (Tx )Tx,λ kkTx,λ2 Tλ2 kkTλ

Applying (40), the above can be relaxed as 1

1

1

− 21

2 2 kT 2 Πtk+1 (Tx )Nk kH ≤ 4kTx,λ Πtk+1 (Tx )Tx,λ kkTλ

Nk kH ,

which is equivalent to 1

− 12

kTλ2 Πtk+1 (Tx )Nk kH ≤ 4kTx,λ Πtk+1 (Tx )kkTλ

Nk kH .

Thus, following from ηk κ2 ≤ 1 which implies ηk kTx k ≤ 1, kTx,λ Πtk+1 (Tx )k

≤

kTx Πtk+1 (Tx )k + kλΠtk+1 (Tx )k

≤

kTx Πtk+1 (Tx )k + λ.

Applying Lemma B.1 with ζ = 1 to bound kTx Πtk+1 (Tx )k, we get kTx,λ Πtk+1 (Tx )k ≤

1 e

Pt

j=k+1

ηj

+ λ.

When k = t, 1

1

1

1

− 12

kT 2 Πtk+1 (Tx )Nk kH = kT 2 Nt kH ≤ kT 2 kkTλ2 kkTλ 1 2

1 2

− 12

≤ kT k (kT k + λ) kTλ 20

Nt kH

Nt kH .

Nk kH .

Since λ ≤ kT k ≤ tr(T ) ≤ κ2 , we derive 1

kT 2 Πtk+1 (Tx )Nt kH ≤

√

−1

2κ2 kTλ 2 Nt kH .

Pt

1

From the above analysis, we conclude that k=1 ηk T 2 Πtk+1 (Tx )Nk

can be upper bounded

H

by ≤ sup k∈[t]

t−1 X

−1 kTλ 2 Nk kH

ηk /2 Pt

ηi

i=k+1

k=1

+λ

t−1 X

ηk +

√

! 2

2κ ηt

.

k=1

Plugging (35) (or (36)) into the above, and then combining with (33), we get the desired bound (41) (or (42)). The proof is complete. Setting ηt = η1 t−θ in the above proposition, with some basic estimates from Appendix A, we get the following explicit bounds for the sample variance. Proposition C.5. Let ηt = η1 t−θ and (35) for all t ∈ [T ], with η1 ∈]0, κ−2 ] and θ ∈ [0, 1[. Assume that (40) holds. Then the following holds for all t ∈ [T ]: 1) If ζ ≥ 1/2, kSρ νt+1 − Sρ µt+1 kρ √ √ 2λη1 t1−θ 2 2ζ + log t + 1 + 2η1 κ ≤4(Rκ + M ) 1−θ

! p √ 2 vcγ 4 κ √ + √ log . γ δ m λ mλ 1

(43)

2) If ζ ≤ 1/2, kSρ νt+1 − Sρ µt+1 kρ ≤ 4 κ κ2ζ−1 ∨ ×

2η1 t1−θ 1−θ

21 −ζ !

!

√ +

√ 2λη1 t1−θ + log t + 1 + 2η1 κ2 1−θ

M

! p √ 2 vcγ κ 4 √ + √ log . (44) γ δ1 m λ mλ

Proof. By Proposition C.4, we have (41). Note that t−1 X

ηk Pt

k=1

i=k+1

ηi

=

t−1 X

k −θ Pt

i=k+1

k=1

i−θ

≤

t−1 X k=1

k −θ . (t − k)t−θ

Applying Lemma A.4, we get t−1 X

ηk Pt

k=1

i=k+1

≤ 2 + 2 log t,

ηi

and by Lemma A.2, t−1 X k=1

ηk = η1

t−1 X

k −θ ≤

k=1

2η1 t1−θ . 1−θ

Introducing the last two estimates into (41) and (43), one can get the desired results. The proof is complete. In conclusion, we get the following result for the sample variance. Theorem C.6. Under Assumptions 1, 2 and 3, let δ1 , δ2 ∈]0, 1[ and Let ηt = η1 t

−θ

−2

for all t ∈ [T ], with η1 ∈]0, κ

9κ2 m

log

m δ2

≤ λ ≤ kT k.

] and θ ∈ [0, 1[. Then with probability at least

1 − δ1 − δ2 , the following holds for all t ∈ [T ] : 1) if ζ ≥ 1/2, we have (43). 2) if ζ < 1/2, we have (44). 21

D

Computational Variance

In this section, we estimate the computational variance, E[kSρ ωt − Sρ νt k2ρ ]. For this, a series of lemmas is necessarily introduced.

D.1

Bounding the Empirical Risk

This subsection is devoted to upper bounding EJ [Ez (ωl )]. The process relies on some tools from convex analysis and a decomposition related to the weighted averages and the last iterates from [22, 30]. We begin by introducing the following lemma, a fact based on the square loss’ special properties. Lemma D.1. Given any sample z, and l ∈ N, let ω ∈ H be independent from Jl , then ηl (Ez (ωl ) − Ez (ω)) ≤ kωl − ωk2H − EJl kωl+1 − ωk2H + ηl2 κ2 Ez (ωl ).

(45)

Proof. Since ωt+1 is given be (4), subtracting both sides of (4) by ω, taking the square H-norm, and expanding the inner product, kωl+1 − ωk2H = kωl − ωk2H +

ηl2 b2

2

X

bl

(hω , x i − y )x l ji H ji ji

i=b(l−1)+1

H

+

2ηl b

bl X

(hωl , xji iH − yji )hω − ωl , xji iH .

i=b(l−1)+1

By Assumption (3), kxji kH ≤ κ, and thus

2

bl

X

(hω , x i − y )x l ji H ji ji

i=b(l−1)+1

 ≤

bl X



2 |hωl , xji iH − yji |κ

i=b(l−1)+1

H

≤

bl X

κ2 b

(hωl , xji iH − yji )2 ,

i=b(l−1)+1

where for the last inequality, we used Cauchy-Schwarz inequality. Thus, ωk2H

kωl+1 −

+

2ηl b

ωk2H

≤ kωl − bl X

η 2 κ2 + l b

bl X

(hωl , xji iH − yji )2

i=b(l−1)+1

(hωl , xji iH − yji )(hω, xji iH − hωl , xji iH ).

i=b(l−1)+1

Using the basic inequality a(b − a) ≤ (b2 − a2 )/2, ∀a, b ∈ R, kωl+1 − ωk2H ≤ kωl − ωk2H +

+

ηl b

bl X

ηl2 κ2 b

bl X

(hωl , xji iH − yji )2

i=b(l−1)+1

(hω, xji iH − yji )2 − (hωl , xji iH − yji )2 .

i=b(l−1)+1

Noting that ωl and ω are independent from Jl , and taking the expectation on both sides with respect to Jl , EJl kωl+1 − ωk2H ≤ kωl − ωk2H + ηl2 κ2 Ez (ωl ) + ηl (Ez (ω) − Ez (ωl )) , which leads to the desired result by rearranging terms. The proof is complete. 22

Using the above lemma and a decomposition related to the weighted averages and the last iterates from [22, 30], we can prove the following relationship. Lemma D.2. Let η1 κ2 ≤ 1/2 for all t ∈ N. Then ηt EJ [Ez (ωt )] ≤ 4Ez (0)

t t−1 t−1 X X 1X 1 ηl + 2κ2 ηi2 EJ [Ez (ωi )]. t k(k + 1) l=1

(46)

i=t−k

k=1

Proof. For k = 1, · · · , t − 1, 1 k

t X i=t−k+1

1 k(k + 1)

=

1 k(k + 1)

=

ηi EJ [Ez (ωi )] −

t X 1 ηi EJ [Ez (ωi )] k+1 i=t−k

(

t X

(k + 1)

ηi EJ [Ez (ωi )] − k

i=t−k+1 t X

t X

) ηi EJ [Ez (ωi )]

i=t−k

(ηi EJ [Ez (ωi )] − ηt−k EJ [Ez (ωt−k )]).

i=t−k+1

Summing over k = 1, · · · , t − 1, and rearranging terms, we get [30] t

ηt EJ [Ez (ωt )] =

t−1

t X

k=1

i=t−k+1

X 1X 1 ηi EJ [Ez (ωi )] + t i=1 k(k + 1)

(ηi EJ [Ez (ωi )] − ηt−k EJ [Ez (ωt−k )]).

Since {ηt }t is decreasing and EJ [Ez (ωt−k )] is non-negative, the above can be relaxed as t

ηt EJ [Ez (ωt )] ≤

t−1

t X

k=1

i=t−k+1

X 1 1X ηi EJ [Ez (ωi )] + t i=1 k(k + 1)

ηi EJ [Ez (ωi ) − Ez (ωt−k )].

(47)

In the rest of the proof, we will upper bound the last two terms of the above. To bound the first term of the right side of (47), we apply Lemma D.1 with ω = 0 to get ηl EJ (Ez (ωl ) − Ez (0)) ≤ EJ [kωl k2H − kωl+1 k2H ] + ηl2 κ2 EJ [Ez (ωl )]. Rearranging terms, ηl (1 − ηl κ2 )EJ [Ez (ωl )] ≤ EJ [kωl k2H − kωl+1 k2H ] + ηl Ez (0). It thus follows from the above and ηl κ2 ≤ 1/2 that ηl EJ [Ez (ωl )]/2 ≤ EJ [kωl k2H − kωl+1 k2H ] + ηl Ez (0). Summing up over l = 1, · · · , t, t X

ηl EJ [Ez (ωl )]/2 ≤ EJ [kw1 k2H − kωt+1 k2H ] + Ez (0)

l=1

t X

ηl .

l=1

Introducing with ω1 = 0, kωt+1 k2H ≥ 0, and then multiplying both sides by 2/t, we get t

t

l=1

l=1

1X 1X ηl EJ [Ez (ωl )] ≤ 2Ez (0) ηl . t t

(48)

It remains to bound the last term of (47). Let k ∈ [t − 1] and i ∈ {t − k, · · · , t}. Note that given the sample z, ωi is depending only on J1 , · · · , Ji−1 when i > 1 and ω1 = 0. Thus, we can apply Lemma D.1 with ω = ωt−k to derive ηi (Ez (ωi ) − Ez (ωt−k )) ≤ kωi − ωt−k k2H − EJi kωi+1 − ωt−k k2H + ηi2 κ2 Ez (ωi ). 23

Therefore, ηi EJ [Ez (ωi ) − Ez (ωt−k )] ≤ EJ [kωi − ωt−k k2H − kωi+1 − ωt−k k2H ] + ηi2 κ2 EJ [Ez (ωi )]. Summing up over i = t − k, · · · , t, t X

ηi EJ [Ez (ωi ) − Ez (ωt−k )] ≤ κ2

i=t−k

t X

ηi2 EJ [Ez (ωi )].

i=t−k

Note that the left hand side is exactly

Pt

i=t−k+1

ηi EJ [Ez (ωi ) − Ez (ωt−k )]. We thus know that

the last term of (47) can be upper bounded by κ2

t−1 X k=1

= κ2

t−1 X k=1

t X 1 ηi2 EJ [Ez (ωi )] k(k + 1) i=t−k

t−1 t−1 X X 1 1 ηi2 EJ [Ez (ωi )] + κ2 ηt2 EJ [Ez (ωt )] . k(k + 1) k(k + 1) i=t−k

k=1

Using the fact that t−1 X k=1

t−1

X 1 = k(k + 1)

k=1

1 1 − k k+1

=1−

1 ≤ 1, t

and κ2 ηt ≤ 1/2, we get that the last term of (47) can be bounded as t−1 X k=1

≤

κ2

1 k(k + 1)

t−1 X k=1

t X

ηi (EJ [Ez (ωi )] − EJ [Ez (ωt−k )])

i=t−k+1

t−1 X 1 ηi2 EJ [Ez (ωi )] + ηt EJ [Ez (ωt )]/2. k(k + 1) i=t−k

Plugging the above and (48) into the decomposition (47), and rearranging terms ηt EJ [Ez (ωt )]/2 ≤ 2M

21

t

t X

2

ηl + κ

l=1

t−1 X k=1

t−1 X 1 ηi2 EJ [Ez (ωi )], k(k + 1) i=t−k

which leads to the desired result by multiplying both sides by 2. The proof is complete. We also need to the following lemma, whose proof can be done by using an induction argument. Lemma D.3. Let {ut }Tt=1 , {At }Tt=1 and {Bt }Tt=1 be three sequences of non-negative numbers such that u1 ≤ A1 and ut ≤ At + Bt sup ui ,

∀t ∈ {2, 3, · · · , T }.

(49)

i∈[t−1]

Let supt∈[T ] Bt ≤ B < 1. Then for all t ∈ [T ], sup ut ≤ k∈[t]

1 sup Ak . 1 − B k∈[t]

(50)

Proof. When t = 1, (50) holds trivially since u1 ≤ A1 and B < 1. Now assume for some t ∈ N with 2 ≤ t ≤ T, sup ui ≤ i∈[t−1]

1 sup Ai . 1 − B i∈[t−1] 24

Then, by (49), the above hypothesis, and Bt ≤ B, we have Bt Bt 1 ut ≤ At + Bt sup ui ≤ At + sup Ai ≤ sup Ai 1 + ≤ sup Ai . 1 − B 1 − B 1 − B i∈[t−1] i∈[t−1] i∈[t] i∈[t] Consequently, sup ut ≤ k∈[t]

1 sup Ak , 1 − B k∈[t]

thereby showing that indeed (50) holds for t. By mathematical induction, (50) holds for every t ∈ [T ]. The proof is complete. Now we can bound EJ [Ez (fk )] as follows. Lemma D.4. Let η1 κ2 ≤ 1/2 and for all t ∈ [T ] with t ≥ 2, t−1 t−1 X 1 1 1 X ηi2 ≤ 2 . ηt k(k + 1) 4κ k=1

(51)

i=t−k

Then for all t ∈ [T ], ( sup EJ [Ez (fk )] ≤ 8Ez (0) sup k∈[t]

k∈[t]

k 1 X ηl ηk k

) .

(52)

l=1

Proof. By Lemma D.2, we have (46). Dividing both sides by ηt , we can relax the inequality as EJ [Ez (ωt )] ≤ 4Ez (0)

t t−1 t−1 X 1 X 1 X 1 ηl + 2κ2 ηi2 sup EJ [Ez (ωi )]. ηt t ηt k(k + 1) i∈[t−1] l=1

k=1

i=t−k

In Lemma D.3, we let ut = EJ [Ez (ωt )], At = 4Ez (0) η1t t Bt = 2κ2

Pt

l=1

ηl and

t−1 t−1 X 1 1 X ηi2 . ηt k(k + 1) k=1

i=t−k

Condition (51) guarantees that supt∈[T ] Bt ≤ 1/2. Thus, (50) holds, and the desired result follows by plugging with B = 1/2. The proof is complete. Finally, we need the following lemma to bound Ez (0), whose proof follows from applying the Bernstein Inequality from Lemma A.1. Lemma D.5. Under Assumption 1, with probability at least 1 − δ3 (δ3 ∈]0, 1[), there holds √ ! 1 2v 2 Ez (0) ≤ M v + 2M + √ log . m δ3 m In particular, if m ≥ 32 log2

2 δ3 ,

then Ez (0) ≤ 2M v.

Proof. Following from (5), Z Z

y 2l dρ ≤

1 l!M l−2 · (2M 2 v), 2

(53)

∀l ∈ N.

√ Applying Lemma A.1, with ωi = yi2 for all i ∈ [m], B = M and σ = M 2v, we know that with probability at least 1 − δ3 , there holds Z m 1 X 2 yi − y 2 dρ ≤ 2M m i=1 Z 25

√ ! 1 2v 2 + √ log . m δ3 m

By setting l = 1 in (5), Z

y 2 dρ ≤ M v.

Z

It thus follows that Z m 1 X 2 yi ≤ y 2 dρ + 2M m i=1 Z

√ ! 2v 2 1 log + √ ≤ M v + 2M m δ3 m

√ ! 2v 2 1 log , + √ m δ3 m

which leads to the desired results by noting that the left-hand side is exactly Ez (0) and ν ≥ 1. The proof is complete.

D.2

1 Bounding T 2 Πtk+1 (Tx )

Lemma D.6. Assume (40) holds for some λ > 0 and η1 κ2 ≤ 1. Then 1

1

kT 2 Πtk+1 (Tx )k2 ≤ Pt

i=k+1

ηi

+ 4λ.

Proof. Note that we have 1

1

1

1

kT 2 Πtk+1 (Tx )k ≤ kT 2 (Tx + λI)− 2 kk(Tx + λI) 2 Πtk+1 (Tx )k. Using (40), we can relax the above as 1

1

kT 2 Πtk+1 (Tx )k ≤ 2k(Tx + λI) 2 Πtk+1 (Tx )k, which leads to 1

1

kT 2 Πtk+1 (Tx )k2 ≤ 4k(Tx + λI) 2 Πtk+1 (Tx )k2 . Since 1

k(Tx + λI) 2 Πtk+1 (Tx )k2

=

k(Tx + λI)Πtk+1 (Tx )Πtk+1 (Tx )k

≤

kTx Πtk+1 (Tx )Πtk+1 (Tx )k + λ

=

kTx2 Πtk+1 (Tx )k2 + λ,

1

and with ηt κ2 ≤ 1, kTx k ≤ tr(Tx ) ≤ κ2 , by Lemma B.1, 1

1

kTx2 Πtk+1 (Tx )k2 ≤

2e

Pt

i=k+1

ηi

≤

1 4

Pt

i=k+1

ηi

,

we thus derive the desired result. The proof is complete.

D.3

Deriving Error Bounds

With Lemmas D.4 and D.6, we are ready to estimate the computational variance , EJ kft − gt k2ρ , as follows. Proposition D.7. Assume (40) holds for some λ > 0, η1 κ2 ≤ 1/2, (51) and (53). Then, we have for all t ∈ [T ], 16M vκ2 EJ kSρ ωt+1 − Sρ νt+1 k2ρ ≤ sup b k∈[t]

(

k 1 X ηl ηk k l=1

)

t−1 X

ηk2 Pt

k=1

i=k+1

ηi

+ 4λ

t−1 X

! ηk2 + ηt2 κ2

.

k=1

(54) 26

Proof. Since ωt+1 and νt+1 are given by (4) and (20), respectively,   bt   X 1 ωt+1 − νt+1 = (ωt − νt ) + ηt (Tx νt − Sx∗ y) − (hωt , xji iH − yji )xji   b i=b(t−1)+1

=

(I − ηt Tx )(ωt − νt ) +

ηt b

bt X

{(Tx ωt − Sx∗ y) − (hωt , xji iH − yji )xji } .

i=b(t−1)+1

Applying this relationship iteratively, t

ωt+1 − νt+1 = Πt1 (Tx )(ω1 − ν1 ) +

bk X

1X b

ηk Πtk+1 (Tx )Mk,i ,

k=1 i=b(k−1)+1

where we denote Mk,i = (Tx ωk − Sx∗ y) − (hωk , xji iH − yji )xji .

(55)

Introducing with ω1 = ν1 = 0, t

ωt+1 − νt+1 =

bk X

1X b

ηk Πtk+1 (Tx )Mk,i .

k=1 i=b(k−1)+1

Therefore, EJ kSρ ωt+1 − Sρ νt+1 k2ρ

2

t ηk Πk+1 (Tx )Mk,i

k=1 i=b(k−1)+1

=

X

t 1

E J

b2

=

t 1 X b2

bk X

bk X

ρ

2

ηk2 EJ Πtk+1 (Tx )Mk,i ρ ,

(56)

k=1 i=b(k−1)+1

where for the last equality, we use the fact that if k 6= k 0 , or k = k 0 but i 6= i05 , then EJ hΠtk+1 (Tx )Mk,i , Πtk0 +1 (Tx )Mk0 ,i0 iρ = 0. Indeed, if k 6= k 0 , without loss of generality, we consider the case k < k 0 . Recalling that Mk,i is given by (55) and that given any z, fk is depending only on J1 , · · · , Jk−1 , we thus have EJ hΠtk+1 (Tx )Mk,i , Πtk0 +1 (Tx )Mk0 ,i0 iρ = EJ1 ,··· ,Jk0 −1 hΠtk+1 (Tx )Mk,i , Πtl+1 (Tx )EJk0 [Mk0 ,i0 ]iρ = 0. If k = k 0 but i 6= i0 , without loss of generality, we assume i < i0 . By noting that ωk is depending only on J1 , · · · , Jk−1 and Mk,i is depending only on ωk and zji (given any sample z), EJ hΠtk+1 (Tx )Mk,i , Πtk+1 (Tx )Mk,i0 iρ = EJ1 ,··· ,Jk−1 hΠtk+1 (Tx )Eji [Mk,i ], Πtl+1 (Tx )Eji0 [Mk,i0 ]iρ = 0. Using the isometry property (18) to (56),

1

2

2

1

2

2 EJ Πtk+1 (Tx )Mk,i ρ = EJ T 2 Πtk+1 (Tx )Mk,i ≤ T 2 Πtk+1 (Tx ) EJ kMk,i kH , H

and by applying the inequality E[kξ − 2

E[ξ]k2H ]

≤ E[kξk2H ],

2

EJ kMk,i kH ≤ EJ k(hωk , xji iH − yji )xji kH ≤ κ2 EJ [(hωk , xji iH − yji )2 ] = κ2 EJ [Ez (ωk )], 5 This

is possible only when b ≥ 2.

27

where for the last inequality we use (3). Therefore, EJ kSρ ωt+1 − Sρ νt+1 k2ρ ≤

t

2 κ2 X 2

1

ηk T 2 Πtk+1 (Tx ) EJ [Ez (ωk )]. b k=1

According to Lemma D.4, we have (52). It thus follows that ( ) t k

2 X 8Ez (0)κ2 1 X

1 2 sup ηl ηk2 T 2 Πtk+1 (Tx ) . EJ kSρ ωt+1 − Sρ νt+1 kρ ≤ b η k k k∈[t] l=1

k=1

Now the proof can be finished by applying Lemma D.6 which tells us that t X

1

2

ηk2 T 2 Πtk+1 (Tx )

t−1 X

=

k=1

2

1 2

1

ηk2 T 2 Πtk+1 (Tx ) + ηt2 T 2

k=1 t−1 X

≤

ηk2 Pt

i=k+1 ηi

k=1

+ 4λ

t−1 X

ηk2 + ηt2 κ2 ,

k=1

and (53) to the above. The proof is complete. Setting ηt = η1 t−θ for some appropriate η1 and θ in the above proposition, we get the following explicitly upper bounds for EJ kSρ ωt − Sρ ωt k2ρ . Proposition D.8. Assume (40) holds for some λ > 0 and (53). Let ηt = η1 t−θ for all t ∈ [T ], with θ ∈ [0, 1[ and 0 < η1 ≤

tmin(θ,1−θ) , 8κ2 (log t + 1)

∀t ∈ [T ].

(57)

Then, for all t ∈ [T ], EJ kωt+1 − νt+1 k2ρ ≤

16M vκ2 5η1 t− min(θ,1−θ) + 8λη12 t(1−2θ)+ (1 ∨ log t). b(1 − θ)

(58)

Proof. We will use Proposition D.7 to prove the result. Thus, we need to verify the condition (51). Note that t−1 X k=1

X t−1 t−1 t−1 t−1 t−1 X X X X ηi2 1 1 1 1 2 2 2 ηi = ηi = ηi − ≤ . k(k + 1) k(k + 1) t−i t t−i i=1 i=1 i=1 i=t−k

k=t−i

Substituting with ηi = ηi−θ , and by Lemma A.4, t−1 X k=1

t−1 t−1 −2θ X X 1 i ηi2 ≤ η12 ≤ 2η12 t− min(2θ,1) (log t + 1). k(k + 1) t − i i=1 i=t−k

Dividing both sides by ηt (= ηt−θ ), and then using (57), t−1 t−1 X 1 X 1 1 ηi2 ≤ 2η1 t− min(θ,1−θ) (log t + 1) ≤ 2 . ηt k(k + 1) 4κ k=1

i=t−k

This verifies (51). Note also that by taking t = 1 in (57), for all t ∈ [T ] , ηt κ2 ≤ η1 κ2 ≤

1 1 ≤ . 8κ2 2

We thus can apply Proposition D.7 to derive (54). What remains is to control the right hand side of (54). Since t−1 X

ηk2 Pt

k=1

i=k+1

ηi

= η1

t−1 X

k −2θ Pt

k=1

i=k+1

28

i−θ

≤ η1

t−1 X k=1

k −2θ , (t − k)t−θ

combining with Lemma A.4, t−1 X

ηk2 Pt

i=k+1

k=1

ηi

≤ 2η1 t− min(θ,1−θ) (log t + 1).

Also, by Lemma A.2, k k 1 X 1 X −θ 1 ηl = 1−θ l ≤ , ηk k k 1−θ l=1

l=1

and by Lemma A.3, t−1 X

ηk2 = η12

k=1

t−1 X

k −2θ ≤ η12 tmax(1−2θ,0) (log t + 1).

k=1

Introducing the last three estimates into (54) and using that ηt2 κ2 ≤ η1 t−θ by (57), we get the desired result. The proof is complete. Collect some of the above analysis, we get the following result for the computational variance. Theorem D.9. Under Assumptions 1 and 3, let δ2 ∈]0, 1[, m≥

32 log2 δ23 ,

−θ

and ηt = ηt

9κ2 m

log

m δ2

≤ λ ≤ kT k, δ3 ∈]0, 1[,

for all t ∈ [T ], with θ ∈ [0, 1[ and η such that (57). Then, with

probability at least 1 − δ2 − δ3 , (58) holds for all t ∈ [T ].

E

Deriving Total Error Bounds

The purpose of this section is to derive total error bounds.

E.1

Attainable Case

We have the following general theorem for ζ ≥ 1/2, with which we prove our main results stated in Section 3. Theorem E.1. Under Assumptions 1, 2 and 3, let ζ ≥ 1/2, T ∈ N with T ≥ 3, δ ∈]0, 1[, ηt = ηκ−2 t−θ for all t ∈ [T ], with θ ∈ [0, 1[ and η such that 0

Optimal Learning for Multi-pass Stochastic Gradient Methods

Optimal Learning for Multi-pass Stochastic Gradient Methods

Suggest Documents

Semi-Stochastic Gradient Descent Methods

Stochastic Conditional Gradient Methods: From Convex Minimization ...

LEARNING RATE ADAPTATION IN STOCHASTIC GRADIENT

Stochastic Methods For Optimization and Machine Learning

Stochastic gradient decent methods for estimation with large data sets

Stochastic Gradient MCMC Methods for Hidden Markov Models - arXiv

CONDITIONAL GRADIENT METHOD FOR STOCHASTIC ...

CONDITIONAL GRADIENT METHOD FOR STOCHASTIC

A Robust Adaptive Stochastic Gradient Method for Deep Learning - arXiv

The Knowledge-Gradient Policy for Correlated ... - Optimal Learning

The Knowledge-Gradient Policy for Correlated ... - Optimal Learning

Stochastic Gradient Boosting

Numerical Methods for Optimal Stochastic Control in ... - UWSpace

Projected Gradient Methods for Nonlinear

Distributive Stochastic Learning for Delay-Optimal OFDMA ... - CiteSeerX

Control Variates for Stochastic Gradient MCMC

Exponential Discrete Gradient Schemes for Stochastic Differential ...

Parallel multiclass stochastic gradient descent algorithms for ...

Random gradient extrapolation for distributed and stochastic

Nonparametric Uncertainty Quantification for Stochastic Gradient Flows

Accelerating Stochastic Gradient Descent via Online Learning to Sample

Performance Limits of Online Stochastic Sub-Gradient Learning

Robust and Fast Learning of Sparse Codes with Stochastic Gradient ...

Riemannian stochastic variance reduced gradient