Riemannian stochastic quasi-Newton algorithm with variance reduction and its convergence analysis Hiroyuki Kasai∗
Hiroyuki Sato†
Bamdev Mishra‡
arXiv:1703.04890v2 [cs.LG] 12 Apr 2017
April 13, 2017
Abstract Stochastic gradient algorithms have recently gained significant attention for minimizing the average of a large, but finite number of loss functions. The present paper proposes a Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR). We tackle with the key challenges of averaging, adding, and subtracting multiple gradients by exploiting notions of retraction and vector transport. We present a global convergence analysis and a local convergence rate analysis of R-SQN-VR under some natural assumptions. The proposed algorithm is applied to the Karcher mean computation on the symmetric positive-definite manifold and low-rank matrix completion on the Grassmann manifold. Exhaustive experiments reveal the superior performances of the proposed algorithm in comparison with the Riemannian stochastic gradient descent and the Riemannian stochastic variance reduction algorithms.
1
Introduction
Let f : M → R be a smooth real-valued function on a Riemannian manifold M. The problem under consideration in the present paper is the minimization of the expected risk of f for a given model variable w ∈ M taken with respect to the distribution of z, i.e., minw∈M f (w), R where f (w) = Ez [f (w; z)] = f (w; z)dP (z) and z is a random seed representing a single sample or set of samples. When given a set of realizations {z[n] }N n=1 of z, we define the loss incurred by the parameter vector w with respect to the n-th sample as fn (w) := f (w; z[n] ), and then the empirical risk is defined as the average of the sample losses: ( ) N 1 X min f (w) := fn (w) , (1) w∈M N n=1
where N is the total number of the elements. This problem has many applications that include, to name a few, principal component analysis (PCA) and the subspace tracking problem [1] on the Grassmann manifold, which is the set of r-dimensional linear subspaces in Rd . The low-rank matrix/tensor completion problem is a promising example of the manifold ∗
Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, Japan (
[email protected]). † Department of Information and Computer Technology, Tokyo University of Science, Tokyo, Japan (
[email protected]). ‡ Core Machine Learning Team, Amazon.com, Bangalore, India (
[email protected].)
1
of fixed-rank matrices/tensors [2, 3]. The linear regression problem is also defined on the manifold of the fixed-rank matrices [4]. PNRiemannian gradient descent requires the Riemannian full gradient estimation, i.e., gradf (w) = n=1 gradfn (w), for every iteration, where gradfn (w) is the Riemannian stochastic gradient of fn (w) on the Riemannian manifold M for n-th sample. This estimation is computationally heavy when N is extremely large. A popular alternative is Riemannian stochastic gradient descent (R-SGD) that extends stochastic gradient descent (SGD) in the Euclidean space [5]. Because this uses only one gradfn (w), the complexity per iteration is independent of N . However, similarly to SGD [6], R-SGD requires decaying step-size sequences which start with a large step-size and decrease with iterations to guarantee its convergence, but this causes a slow convergence. Variance reduction (VR) methods have been proposed recently to accelerate the convergence of SGD in the Euclidean space [7–11]. One distinguished feature of them is to calculate a full gradient estimation periodically, and to re-use it to reduce the variance of noisy stochastic gradient. However, because all previously described algorithms are first-order algorithms, their convergence speed can be slow because of their poor curvature approximations in illconditioned problems. One promising approach is second-order algorithms such as stochastic quasi-Newton (QN) methods using Hessian evaluations [12–15]. They achieve faster convergence by exploiting curvature information of the objective function f . Furthermore, addressing these two acceleration techniques, [16] and [17] propose a hybrid algorithm of the stochastic QN method accompanied with the VR method. Examining the Riemannian manifolds again, many challenges on the QN method have been addressed in deterministic settings [18–20]. The VR method in the Euclidean space has also been extended to Riemannian manifolds, so-called R-SVRG [21, 22]. Nevertheless, the second-order stochastic algorithm with the VR method has not been explored thoroughly for the problem (1). To this end, we propose a Riemannian stochastic QN method based on L-BFGS and the VR method. Our contributions are three-fold; • We propose a novel (and to the best of our knowledge, the first) Riemannian limitedmemory QN algorithm with a VR method. • Our convergence analysis separately deals with a global convergence and a local rate of convergence analysis. This analysis is typical for batch algorithms on Riemannian manifolds [23]. Our assumptions for the latter are imposed only in a local neighborhood around a minimum, which are milder and natural. Consequently, our analysis is expected to be applicable to wider varieties of manifolds. • We present the convergence analysis of the algorithm with retraction and vector transport operations instead of the more restrictive exponential mapping and parallel translation operations. This makes the algorithm appealing for practical problems. The specific features of the algorithms are two-fold; • We update the curvature pair of the QN method every outer loop by exploiting full gradient estimations in the VR method, and thereby capture more precise and stabler curvature information without relying on unrealizable stochastic gradient estimations. This avoids additional sweeping of samples required in the Euclidean stochastic QN [15] and additional gradient estimations required in the Euclidean online BFGS (oBFGS) [12, 13, 24]. 2
• Compared with a simple Riemannian extension of the QN method, a noteworthy advantage of its combination with the VR method is that, as revealed below, frequent transportations of curvature information between different tangent spaces, which are inextricable in such a simple Riemannian extension, can be reduced drastically by the VR algorithm. This is the special benefit of the Riemannian hybrid algorithm, which does not exist in the Euclidean case [16, 17]. More specifically, the update of curvature information and the calculation of second-order modified Riemannian stochastic gradient are performed uniformly on the tangent space of the outer loop. The paper is organized as follows. Section 2 presents details of our proposed R-SQN-VR. Section 3 presents two theorems of the convergence analysis. In Section 4, numerical comparisons are explained with R-SGD and R-SVRG on two problems, with results suggesting the superior performances of R-SQN-VR. The proposed R-SQN-VR is implemented in the Matlab toolbox Manopt [25]. A brief explanation of optimization on manifold, the concrete proofs of theorems, and additional experiments are provided as supplementary material.
2
Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR)
We assume that the manifold M is endowed with a Riemannian metric structure, i.e., a smooth inner product h·, ·iw of tangent vectors is associated with the tangent space Tw M for all w ∈ M [23]. The norm k·kw of a tangent vector is the norm associated with the Riemannian metric. The metric structure allows a systematic framework for doing optimization over manifolds. Conceptually, the constrained optimization problem (1) is translated into an unconstrained optimization problem over M. Consequently, notions such as the Riemannian gradient (first-order derivatives of an objective function), tangent space, and moving along a search direction have well-known expressions for a number of manifolds.
2.1
R-SGD and R-SVRG
R-SGD: Given a starting point w0 ∈ M, the R-SGD algorithm produces a sequence (wt )t≥0 in M that converges to a first-order critical point of (1). Specifically, the R-SGD algorithm updates w as wt+1 = Rwt (−αt gradfn (wt , zt )), where αt is the step-size, and where gradfn (wt , zt ) is a Riemannian stochastic gradient, which is a tangent vector at wt ∈ M. gradfn (wt , zt ) represents an unbiased estimator of the Riemannian full gradient gradf (wt ), and the expectation of gradfn (wt , zt ) over the choices of zt is gradf (wt ), i.e., Ezt [gradfn (wt , zt )] = gradf (wt ). The update moves from wt in the stochastic direction −gradfn (wt , zt ) with a step-size αt while remaining on the manifold M. This mapping, denoted as Rw : Tw M → M : ζw 7→ Rw (ζw ), is called retraction at w, which maps the tangent bundle Tw M onto M with a local rigidity condition that preserves gradients at w. Exponential mapping Exp is an instance of the retraction. Riemannian stochastic variance reduced gradient (R-SVRG): R-SVRG has double loops where a k-th outer loop, called epoch, has mk inner iterations. R-SVRG keeps w ˜k ∈ M after mk−1 inner iterations of (k −1)-th epoch, and computes the full Riemannian gradient gradf (w ˜ k ) only for this stored w ˜ k . The algorithm also computes the Riemannian stochastic k gradient gradfik (w ˜ ) that corresponds to each ikt -th sample. Then, picking ikt -th sample t for each t-th inner iteration of k-th epoch at wtk , we calculate ξtk , i.e., by modifying the 3
Algorithm 1 Algorithm for Riemannian stochastic quasi-Newton with gradient variance reduction (R-SQN-VR). Require: Update frequency mk > 0, step-size αtk > 0, memory size L, and cautious update threshold . 1: Initialize w ˜ 0 , and calculate the Riemannian full gradient gradf (w ˜ 0 ). 2: for k = 0, . . . do 3: Store w0k = w ˜k . 4: for t = 0, 1, 2, . . . , mk − 1 do 5: Choose ikt ∈ {1, . . . , N } uniformly at random. −1 6: Calculate the tangent vector η˜tk from w ˜ k to wtk by η˜tk = Rw (wtk ). ˜k 7: if k > 1 then 8: Transport the stochastic gradient gradfik (wtk ) to Tw˜k M by (Tη˜k )−1 gradfik (wtk ). t t t 9: Calculate ξ˜k as ξ˜k = (T k )−1 gradf k (wk ) − (gradf k (w ˜ k ) − gradf (w ˜ k )). t
10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:
t
η˜t
it
t
it
k k ˜ k ξ˜k , transport H ˜ k ξ˜k back to T k M by T k H ˜ k ˜k Calculate H t t t t wt η˜t t ξt , and obtain Ht ξt . k k = Rwk (−αtk Htk ξtk ). Update wt+1 from wtk as wt+1 t else Calculate ξtk as ξtk = gradfik (wtk ) − Tη˜k (gradfik (w ˜ k ) − gradf (w ˜ k )). t t t k k Update wt+1 from wtk as wt+1 = Rwk (−αtk ξtk ). t end if end for k ) (or w ˜ k+1 = wtk for randomly chosen t ∈ option I: w ˜ k+1 = gmk (w1k , . . . , wm k {1, . . . , mk }). k . option II: w ˜ k+1 = wm k Calculate the Riemannian full gradient gradf (w ˜ k+1 ). −1 Calculate the tangent vector ηk from w ˜ k to w ˜ k+1 by ηk = Rw (w ˜ k+1 ). ˜k k+1 k+1 −1 Compute sk = Tηk ηk , and yk = κk gradf (w ˜ k+1 ) − Tηk gradf (w ˜ k ) where κk = kηk k/kTRηk ηk k. k+1 2 if hykk+1 , sk+1 then ˜ k+1 ≥ ksk kw k iw ˜ k+1 k+1 k k Discard pair (sk−L , yk−L ) when k > L, and store pair (sk+1 k , yk ). end if k+1 k+1 k−1 , yj )}j=k−τ +1 ∈ Tw˜k+1 M by Tηk , Transport {(skj , yjk )}k−1 ˜ k M to {(sj j=k−τ +1 ∈ Tw where τ = min(k, L) . end for
4
stochastic gradient gradfik (wtk ) using both gradf (w ˜ k ) and gradfik (w ˜ k ). Because they belong t t to different tangent spaces, a simple addition of them is not well-defined because Riemannian manifolds are not vector spaces. Therefore, after gradfik (w ˜ k ) and gradf (w ˜ k ) are transported t to Twk M by Tη˜k , ξtk is set as ξtk = gradfik (wtk ) − Tη˜k (gradfik (w ˜ k ) − gradf (w ˜ k )), where T t t t t t represents vector transport from w ˜ k to wtk , and η˜tk ∈ Tw˜k M satisfies Rw˜k (˜ ηtk ) = wtk . The vector transport T : T M ⊕ T M → T M, (ηw , ξw ) 7→ Tηw ξw is associated with retraction R and all ξw , ζw ∈ Tw M. It holds that (i) Tηw ξw ∈ TR(ηw ) M, (ii) T0w ξw = ξw , and (iii) Tηw is a linear map. Parallel translation P is an instance of the vector transport. Consequently, the k final update is defined as wt+1 = Rwk (−αtk ξtk ). t
2.2
Proposed R-SQN-VR
We propose a Riemannian stochastic QN method accompanied with a VR method (R-SQNVR). A straightforward extension is to update the modified stochastic gradient ξtk by premultiplying a linear inverse Hessian approximation operator Htk at wtk as k wt+1 = Rwk (−αtk Htk ξtk ), t
(2)
˜ k ◦ (T k )−1 by denoting the inverse Hessian approximation at w ˜ k simply where Htk := Tη˜k ◦ H η˜t t ˜ k . Here, T is an isometric vector transport explained in Section 3. Hk should be positive as H t definite, i.e., Htk 0 and is close to the Hessian of f , i.e., Hessf (wtk ). It is noteworthy ˜ k is calculated only every outer epoch, and remains to be used for Htk throughout the that H corresponding k-th epoch. k+1 ˜k Curvature pair (sk+1 k , yk ): This paper particularly addresses the operator H used in Lk+1 k+1 BFGS intended for a large-scale data. Thus, let sk and yk be the variable variation and the gradient variation at Tw˜k+1 M, respectively, where the superscript expresses explicitly that k+1 they belong to Tw˜k+1 M. It should be noted that the curvature pair (sk+1 k , yk ) is calculated at the new Tw˜k+1 M just after k-th epoch finished. Furthermore, after the epoch index k is ˜ k is incremented, the curvature pair must be used only at Tw˜k M because the calculation of H performed only at Tw˜k M. The variable variation sk+1 is calculated from the difference between w ˜ k+1 and w ˜ k . This k k k+1 is represented by the tangent vector ηk from w ˜ to w ˜ , which is calculated using the inverse −1 k+1 ). Since η belongs to the T M, transporting this onto T of the retraction Rw ( w ˜ k w ˜k w ˜ k+1 M ˜k yields −1 sk+1 = Tηk ηk (= Tηk Rw (w ˜ k+1 )). k ˜k
(3)
The gradient variation ykk+1 is calculated from the difference between the newly calculated full gradient gradf (w ˜ k+1 ) ∈ Tw˜k+1 M and the previous one, but transported Tηk gradf (w ˜k ) ∈ Tw˜k M [20] as ykk+1 = κ−1 ˜ k+1 ) − Tηk gradf (w ˜ k ), k gradf (w
(4)
where κk > 0 is explained in Section 3. ˜k: H ˜ k is calculated using the past curvature Inverse Hessian approximation operator H k k+1 ˜ is updated as H ˜ ˇ k Vˇ k + ρk sk s[ , where H ˇ k = Tη ◦ pairs. More specifically, H = (Vˇ k )[ H k k k −1 k [ ˇ ˜ H ◦ Tηk , ρk = 1/hyk , sk i, V = id − ρk yk sk with identity mapping id [20]. Therein, a[ 5
˜ k depends on denotes the flat of a ∈ Tw M, i.e., a[ : Tw M → R : v → ha, viw . Thus, H ˜ k−1 depends on H ˜ k−2 and (sk−2 , yk−2 ). Proceeding ˜ k−1 and (sk−1 , yk−1 ), and similarly H H k 0 ˜ is a function of the initial H ˜ and all previous k curvature pairs {(sj , yj )}k−1 . recursively, H j=0 k−1 Meanwhile, L-BFGS restricts use to the most recent L pairs {(sj , yj )}j=k−L since (sj , yj ) with j < k −L are likely to have little curvature information. Based on this idea, L-BFGS performs ˜ 0 . We use the k pairs {(sj , yj )}k−1 when k < L. L updates by the initial H j=0 ˜ k used for Htk in the inner iterations (2) of k-th Now, we consider the final calculation of H outer epoch using the L most recent curvature pairs. Here, since this calculation is executed at Tw˜k M and a Riemannian manifold is in general not a vector space, all the L curvature pairs must be located at Tw˜k M. To this end, just after the curvature pair is calculated in (3) and (4), the past (L − 1) pairs of {(skj , yjk )}k−1 ˜ k M are transported into j=k−L+1 ∈ Tw k+1 Tw˜k+1 M by the same vector transport Tηk used when calculating sk and ykk+1 . It should be emphasized that this transport is necessary only for every outer epoch instead of every inner loop, and results in drastic reduction of computational complexity in comparison with the straightforward extension of the Euclidean stochastic L-BFGS [24] into the manifold setting. Consequently, the update is defined as ˇ k (Vˇ k · · · Vˇ k ) + · · · ˜ k = ((Vˇ k )[ · · · (Vˇ k )[ )H H 0 k−L k−1 k−L k−1 [ ˇk k [ k k ˇ + ρk−2 (V ) s (s ) (V ) +
k−2 k−2 k−1 k k ρk−1 sk−1 (sk−1 )[ ,
k−1
(5)
ˇ k is the initial inverse Hessian approximation. Because H ˇk where Vˇjk = id − ρj yjk (skj )[ , and H 0 0 ˇ k−L , and because it is any positive definite self-adjoint operator, we use is not necessarily H ˇ k = hsk , y k iw˜k /hy k , y k iw˜k id similar to the Euclidean case. The practical update of H 0 k−1 k−1 k−1 k−1 ˜ k uses two-loop recursion algorithm [26, Section 7.2] in Algorithm A.1 of the supplementary H material. Cautious update: Euclidean L-BFGS fails on non-convex problems because the Hessian approximation has eigenvalues that are away from zero and are not uniformly bounded above. To circumvent this issue, cautious update has been proposed in the Euclidean space [27]. By following this, we skip the update of the curvature pair when the following condition is not satisfied; 2 ≥ ksk+1 k kw ˜ k+1 ,
hykk+1 , sk+1 ˜ k+1 k iw
(6)
where > 0 is a predefined constant parameter. According to this update, the positive ˜ k−1 is positive definite as seen in the proof of definiteness of H˜k is guaranteed as far as H Proposition 3.1 in the supplementary material. Second-order modified stochastic gradient Htk ξtk : R-SVRG transports the full gradient gradf (w ˜ k ) and the stochastic gradient gradfik (w ˜ k ) at Tw˜k M into Twk M to add them to t t the stochastic gradient gradfik (wtk ) at Twk M. If we follow the same strategy, we must also t
t
transport L pairs of {(skj , yjk )}k−1 ˜ k M into the current Twtk M at every inner iteration. j=k−L ∈ Tw Addressing this problem and the fact that both the full gradient and the curvature pairs belong to the same tangent space Tw˜k M, we transport gradfik (wtk ) from Twk M into Tw˜k M, t t and complete all the calculations on Tw˜k M. More specifically, after transporting gradfik (wtk ) t −1 k )), the modified stochastic gradient as (Tη˜k )−1 gradfik (wtk ) from wtk to w ˜ k using η˜tk (= Rw (w k t ˜ t
t
6
ξ˜tk ∈ Tw˜k M is computed as ξ˜tk = (Tη˜k )−1 gradfik (wtk ) − (gradfik (w ˜ k ) − gradf (w ˜ k )). t
t
t
(7)
˜ tk ξ˜tk ∈ Tw˜k M using the two-loop recursion algorithm, we obtain Htk ξtk ∈ After calculating H k k ˜ tk ξ˜tk back to T k M by T k H ˜ k ˜k Twk M by transporting H wt η˜t t ξt . Finally, we update wt+1 from wt t as (2).
3
Convergence analysis
In this section, after defining necessary assumptions, we bound the eigenvalue of Htk . Then, we present a global convergence analysis and a local convergence rate analysis. The concrete proofs are given in the supplementary file. Assumption 1. We assume below [20]; 1. The objective function f and its components f1 , . . . , fN are twice continuously differentiable. 2. For a sequence {wtk } generated by Algorithm 1, there exists a compact and connected set K ⊂ M such that wtk ∈ K for all k, t ≥ 0. Also, for each k ≥ 1, there exists a totally retractive neighborhood Θk of w ˜ k such that wtk stays in Θk for any t ≥ 0, where the ρ-totally neighborhood Θ of w is a set such that for all z ∈ Θ, Θ ⊂ Rz (B(0z , ρ)), and Rz (·) is a diffeomorphism on B(0z , ρ), which is the ball in Tw M with center 0z and radius ρ, where 0z is the zero vector in Tz M. Furthermore, suppose that there exists −1 (z)kw˜k } ≥ I. I > 0 such that inf k≥1 {supz∈Θk kRw ˜k 3. The objective function f is strongly retraction-convex with respect to R in Θ. Here, f is said to be strongly retraction-convex in Θ if f (Rw (tηw )) for all w ∈ M and ηw ∈ Tw M is strongly convex, i.e., there exist two constants 0 < λ < Λ such that λ ≤
d2 f (Rw (tηw )) ≤ Λ, dt2
(8)
for all w ∈ Θ, all kηw kw = 1, and all t such that Rw (τ ηw ) ∈ Θ for all τ ∈ [0, t]. 4. The vector transport T is isometric on M. It satisfies hTξw ηw , Tξw ζw iRw (ξw ) = hηw , ζw iw for any w ∈ M and ξw , ηw , ζw ∈ Tw M. 5. There exists a constant c0 such that the vector transport T satisfies the following conditions for all w, z ∈ U, which is a neighborhood of w: ¯ kTηw − TRηw k ≤ c0 kηw kw ,
kTη−1 − TR−1 k ≤ c0 kηw kw , w ηw
where TR denotes the differentiated retraction, i.e., TRηw ξw = DRw (ηw )[ξw ], and where −1 (z). ηw , ξw ∈ Tw M, and ηw = Rw 6. The vector transport T satisfies the locking condition, which is defined as Tηw ξw = κTRηw ξw , where κ = for all ηw , ξw ∈ Tw M and all w ∈ M. 7
kξw kw , kTRηw ξw kRw (ηw )
(9)
The following proposition bounds the eigenvalues of Htk . ˜ k := T k ◦ H ˜ k ◦(T k )−1 , Proposition 3.1 (Eigenvalue bounds of Htk ). Consider the operator H η˜t η˜t k ˜ where H is defined by the recursion in (5). Define the constant 0 < γ < Γ < ∞. If Assumption 1 holds, the eigenvalues of Htk is bounded by γ and Γ for all k ≥ 1, t ≥ 1, i.e., γid Htk Γid.
(10)
It should be noted that, although −ξtk is not generally guaranteed as a descent direction, Eik [−ξtk ] = −gradf (wtk ) is a descent direction. Furthermore, the positive definiteness of Htk t yields that −Htk ξtk is an average descent direction due to Eik [−Htk ξtk ] = −Htk gradf (wtk ). t
3.1
Global convergence analysis
This section analyses a global convergence to a critical point starting from any initialization point, which is common in a non-convex setting. We first state the additional assumptions particularly required for this analysis. Assumption 2. We assume the following assumptions; 1. f is bounded below by a scalar finf . 2. Since Θ is compact, all continuous functions on Θ can be bounded. Therefore, there exists S > 0 such that for all w ∈ Θ and n ∈ N , we have kgradf (w)kw ≤ S and kgradfn (w)kw ≤ S. 3. The step-size sequence {αtk } satisfies X αtk = ∞ and
X
(11)
(αtk )2 < ∞.
(12)
Now, we present the global convergence result below. Theorem 3.2. Consider Algorithm 1 and suppose Assumptions 1 and 2, and that the mapping w 7→ kgradf (w)k2w has the positive real number that the largest eigenvalue of its Riemannian Hessian is bounded by for all w ∈ M. Then, we have limk→∞ E[kgradf (wtk )k2wk ] = 0. t
3.2
Local convergence rate analysis
The analysis below presents a convergence rate in neighborhood of a local minimum. This setting is also very common and standard in manifold optimization. For this purpose, we first briefly summarize essential inequalities. They are detailed in the supplementary material. For all w, z ∈ U, which is a neighborhood of w, ¯ the difference between the parallel translation and the vector transport is given with a constant θ as (Lemma D.2) kTη ξ − Pη ξkz ≤ θkξkw kηkw ,
(13)
where ξ, η ∈ Tw M and Rw (η) = z. Similarly, as for the difference between the exponential mapping and the retraction, there exist τ1 > 0, τ2 > 0 for all w ∈ U and all small length of ξ ∈ Tw M such that (Lemma D.3) τ1 dist(w, Rw (ξ)) ≤ kξkw ≤ τ2 dist(w, Rw (ξ)). 8
(14)
Then, the variance of ξtk is upper bounded by (Lemma D.5) Eik [kξtk k2wk ] ≤ 4(β 2 + τ22 C 2 θ2 )(7(dist(wtk , w∗ ))2 + 4(dist(w ˜ k , w∗ ))2 ), t
(15)
t
where C is the upper bound of kgradfn (w)kw for w ∈ Θ, β is a Lipschitz constant, and θ is the constant in (13). Now, we are ready to give a local convergence rate as; Theorem 3.3. Let M be a Riemannian manifold and w∗ ∈ M be a non-degenerate local minimizer of f (i.e., gradf (w∗ ) = 0 and the Hessian Hessf (w∗ ) of f at w∗ is positive definite). Suppose Assumption 1 holds. λ and Λ are constants in (8). γ and Γ are constants in Proposition 3.1. Let the constants θ in (13), τ1 and τ2 in (14), and β, and C in (15). Let α be a positive number satisfying λτ12 > 2α(λ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 )) and γλ2 τ12 > 14αΛΓ2 (β 2 + τ22 C 2 θ2 ). It then follows that for any sequence {w ˜ k } generated by Algorithm 1 with a fixed step-size αtk := α and mk := m converging to w∗ , there exists K > 0 such that for all k > K, E[(dist(w ˜ k+1 , w∗ ))2 ] ≤
2(Λτ22 + 16mα2 ΛΓ2 (β 2 + τ22 C 2 θ2 ) E[(dist(w ˜ k , w∗ ))2 ). mα(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 ))
(16)
It should also be noted that, the parallel translation P as the vector transport can lead to a smaller Γ and a larger γ in (10). As a result, this produces a smaller coefficient above, and can result in a faster local convergence rate.
4
Numerical comparisons
This section compares the performance of R-SQN-VR with R-SGD and R-SVRG. While R-SQN-VR and R-SVRG use a fixed step-size αtk = α0 , R-SGD uses decaying step-size sequences with αk = α0 (1 + α0 νbk/mk c)−1 , where b·c denotes the floor function. We select multiple choices of α0 and ν = 10−3 . We also compare them with R-SD, which is the steepest descent algorithm on Riemannian manifolds with backtracking line search [23, Chapters 4]. All experiments herein use mk = 5N recommended by [7]. It should be noted that all results except R-SD are the best-tuned results. All experiments are executed in Matlab on a 3.0 GHz Intel Core i7 PC with 16 GB RAM. This paper addresses two problems; the Karcher mean computation problem of symmetric positive-definite (SPD) matrices, and the low-rank matrix completion (MC) problem on the Grassmann manifold. In these problems, full gradient methods become prohibitively computationally expensive when N is very large; the stochastic gradient approach is one promising way to achieve scalability.
4.1
SPD manifold and Karcher mean problem
d . Let S d d SPD manifold S++ ++ be the manifold of d × d SPD matrices. If we endow S++ d , the with the Riemannian metric defined by hξX , ηX iX = trace(ξX X−1 ηX X−1 ) at X ∈ S++ d SPD manifold S++ becomes a Riemannian manifold. The explicit formula for the exponend tial mapping is given by ExpX (ξX ) = X1/2 exp(X−1/2 ξX X−1/2 )X1/2 for any ξX ∈ TX S++ −1 1 d and X ∈ S++ . On the other hand, RX (ξX ) = X + ξX + 2 ξX X ξX proposed in [28] is a d d . The parretraction, which is symmetric positive-definite for all ξX ∈ TX S++ and X ∈ S++ 1/2 −1/2 −1/2 d allel translation on S++ along ηX is given by PηX (ξX ) = X YX ξX X YX1/2 , where
9
(c) Case MC-S1: baseline.
(a) Case KM-1: Small size
(b) Case KM-2: Large size
instance.
instance.
(d) Case MC-S2: influence on
(e) Case MC-S3: influence on
(f) Case MC-S4: influence on
low sampling.
ill-conditioning.
large memory size.
(g) Case MC-S5: influence on
(h) Case MC-S6: influence on
noise.
higher rank.
(i) Case MC-R: MovieLens-1M.
Figure 1: Performance evaluations on Karcher mean problem (KM) low-rank matrix completion problem (MC).
10
Y = exp(X−1/2 ηX X−1/2 /2). A more efficient algorithm that constructs an isometric vector transport is proposed based on a field of orthonormal tangent bases [29] while satisfying the locking condition (9). We use it in this experiment, and the details are in Appendix E. The logarithm map of Y at X is given by LogX (Y) = X1/2 log(X−1/2 YX−1/2 )X1/2 = log(YX−1 )X. d . The Karcher mean is introduced as a notion of Karcher mean problem on S++ mean on Riemannian manifolds by Karcher [30]. It generalizes the notion of an “average” on d a manifold. Given N points on S++ with matrix representations Q1 , . . . , QN , the Karcher mean is defined as the solution to the problem min
d X∈S++
N 1 X (dist(X, Qn ))2 , 2N
(17)
n=1
where dist(p, q) = k log(p−1/2 qp−1/2 )kF represents the distance along the corresponding geodesic d with respect to the affine-invariant metric. The gradient of the between the elements on S++ P −1 d loss function in (17) is computed as N1 N n=1 −log(Qn X )X. The Karcher mean on S++ is frequently used for computer vision problems, such as visual object categorization and pose categorization [31]. Since recursive calculations are needed with each visual image, stochastic gradient algorithms become an appealing choice for large datasets. All experiments use the batch size fixed to 1 and L is 2, and are initialized randomly and are stopped when the number of iterations reaches 60 for R-SD and R-SGD, and 12 for R-SQN-VR and R-SVRG, or the gradient norm gets below 10−11 . Figures 1(a) and (b) show the results of the optimality gap in the cases of the small size instance with N = 1000 and d = 3 (Case KM-1) and the large size instance with N = 10000 and d = 10 (Case KM-2), respectively. These results reveal that R-SQN-VR outperforms R-SGD, R-SVRG and R-SD at the convergence speed.
4.2
Grassmann manifold and MC problem
Grassmann manifold Gr(r, d). A point on the Grassmann manifold is an equivalence class represented by a d × r orthogonal matrix U with orthonormal columns, i.e., UT U = I. Two orthogonal matrices express the same element on the Grassmann manifold if they are related by right multiplication of an r × r orthogonal matrix O ∈ O(r). Equivalently, an element of Gr(r, d) is identified with a set of d × r orthogonal matrices [U] := {UO : O ∈ O(r)}. That is, Gr(r, d) := St(r, d)/O(r), where St(r, d) is the Stiefel manifold that is the set of matrices of size d × r with orthonormal columns. The Grassmann manifold has the structure of a Riemannian quotient manifold [23, Section 3.4]. The exponential mapping for the Grassmann manifold from U(0) := U ∈ Gr(r, d) in the direction of ξ ∈ TU(0) Gr(r, d) is given in a closed form as [23, Section 5.4] U(t) = [U(0)V W] [cos tΣ; sin tΣ] VT , where ξ = WΣVT is the singular value decomposition (SVD) of ξ with rank r. The sin(·) and cos(·) operations are performed only on the diagonal entries. The parallel translation of ζ ∈ TU(0) Gr(r, d) on the Grassmann manifold along γ(t) with γ(0) ˙ = WΣVT is given in a closed form by ζ(t) = ([U(0)V W] [− sin tΣ; cos tΣ] WT + (I − WWT ))ζ. The logarithm map of U(t) at U(0) on the Grassmann manifold is given by ξ = LogU(0) (U(t)) = W arctan(Σ)VT , where WΣVT is the SVD of (U(t)−U(0)U(0)T U(t)) (U(0)T U(t))−1 with rank r. Furthermore, a popular retraction is RU(0) (ξ) = qf(U(0) + tξ)(= U(t)) which extracts the orthonormal factor based on QR decomposition, and a popular
11
vector transport uses an orthogonal projection of tξ to the horizontal space at U(t), i.e., (I − U(t)U(t)T )tξ [23]. Matrix completion problem. The matrix completion problem is completing an incomplete matrix X, say of size d × N , from a small number of entries by assuming that the latent structure of the matrix is low-rank. If Ω is the set of known indices in X, the rank-r matrix completion problem amounts to solving minU,A kPΩ (UA) − PΩ (X)k2F , where U ∈ Rd×r , A ∈ Rr×N , and the operator PΩ acts as PΩ (Xij ) = Xij if (i, j) ∈ Ω and PΩ (Xij ) = 0 otherwise. Partitioning X = [x1 , . . . , xn ], the previous problem is equivalent to N 1 X min kPΩn (Uan ) − PΩn (xn )k22 , U∈Rd×r , an ∈Rr N n=1
(18)
where xn ∈ Rd and the operator PΩn is the sampling operator for the n-th column. Given U, an admits a closed form solution. Consequently, the problem only depends on the column space of U and is on Gr(r, d) [32]. We first consider a synthetic dataset. The proposed algorithm is also compared with Grouse [1], a state-of-the-art stochastic gradient algorithm on the Grassmann manifold. Algorithms are initialized randomly as suggested in [33]. The multiple choices of α0 are {10−3 , 5 × 10−3 , . . . , 10−1 , 5 × 10−1 } for R-SGD, R-SVRG and R-SQN-VR, and {5 × 10−1 , 1, 5, 10} for Grouse. We set explicitly the condition number, denoted as CN, of the matrix, which represents the ratio of the maximal and the minimal singular values of the matrix. We also set the over-sampling ratio (OS) which prescribes the number of known entries. An OS of 5 means that we randomly and uniformly select known entries of 5(N + d − r)r a priori out of the total N d entries. The Gaussian noise is also added with the noise level σ as suggested in [33]. The batch size is fixed to 10. This experiment evaluates two combinations of vector transport and retraction, namely, the parallel translation and the exponential mapping, and the projection-based vector transport and the QR-decomposition-based retraction. While the former combination satisfies the necessary conditions for the convergence, the latter does not, but is computationally more efficient than the former. The baseline problem instance is the case of the number of samples N = 5000, the dimension d = 500, the memory size L = 5, OS = 6, CN = 5, and σ = 10−10 (Case MC-S1). Additionally, we evaluate the lowersampling case with OS = 4 (Case MC-S2), the ill-conditioning case with CN = 10 (Case MC-S3), the larger memory size case with L = 10 (Case MC-S4), the higher noise case with σ = 10−6 (Case MC-S5), and the higher rank case with r = 10 (Case MC-S6). The results of the MSE on test set Φ, which is different from the training set Ω, are shown in Figures 1(c)-(h), respectively. This gives the prediction accuracy of missing elements. From the figures, we confirm the superior performance of R-SQN-VR. Finally, we compare the algorithms on a real-world dataset, the MovieLens-1M dataset1 . It contains a million ratings for 3952 movies (N ) of 6040 users (d). We further randomly split this set into 80/10/10 percent data out of the entire data as train/validation/test partitions. α0 is chosen from {10−5 , 5 × 10−5 , . . . , 10−2 , 5 × 10−2 }, the rank r is {10, 20}, and L is 5. The QR decomposition-based vector transport is used assuming a practical deployment due to the size of datasets. The algorithms are terminated when the tendency is observed that the MSE on the validation set increases or the number of the iteration reaches 100. Figure 1(i) shows the results when r = 20 on test sets of all the algorithms except Grouse, which faces issues 1
http://grouplens.org/datasets/movielens/
12
with convergence on this data set (Case MC-R). R-SQN-VR shows much faster convergence than others. Table 1 also shows the best result when r is 10 and 20 for each algorithm with the lowest test MSEs when the algorithm stopped for five runs. This also shows that the proposed R-SQN-VR gives faster convergences and lower test MSE than other algorithms, especially when r is larger.
Table 1: Test MSE on Φ and # of gradient (MovieLens-1M). r Algorithm #grad/N MSE on Φ 10 R-SG 399.6 ± 8.9−1 2.597056−4 ± 3.2−6 R-SGD 375.0 ± 1.81 2.629046−4 ± 1.0−6 R-SVRG 205.4 ± 4.11 2.529259−4 ± 4.7−7 R-SQN-VR 116.6 ± 3.61 2.523758−4 ± 3.4−7 20 R-SG 400 ± 0 1.950905−4 ± 2.1−6 R-SGD 382.0 ± 2.8 1.975479−4 ± 9.6−7 R-SVRG 356.0 ± 5.51 1.896678−4 ± 4.8−7 R-SQN-VR 137.0 ± 5.11 1.888096−4 ± 5.0−7 The subscript k indicates a scale of 10k .
5
Conclusions
We have proposed a Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR). We particularly addressed retraction and vector transport. The proposed algorithm stems from the algorithm in Euclidean space, but is now extended to Riemannian manifolds. The central difficulty of averaging, adding, and subtracting multiple gradients on a Riemannian manifold is handled by exploiting vector transport and retraction. We proved that R-SQN-VR generates globally convergent sequences with a decaying step-size condition and is locally linearly convergent under some natural assumptions. Numerical comparisons suggested the superior performance of R-SQN-VR on various benchmarks.
13
References [1] L. Balzano, R. Nowak, and B. Recht. Online identification and tracking of subspaces from highly incomplete information. In Allerton, pages 704–711, 2010. [2] B. Mishra and R. Sepulchre. R3MC: A Riemannian three-factor algorithm for low-rank matrix completion. In IEEE CDC, pages 1137–1142, 2014. [3] H. Kasai and B. Mishra. Low-rank tensor completion: a Riemannian manifold preconditioning approach. In ICML, 2016. [4] G. Meyer, S. Bonnabel, and R. Sepulchre. Linear regression under fixed-rank constraints: A Riemannian approach. In ICML, 2011. [5] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. on Automatic Control, 58(9):2217–2229, 2013. [6] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, pages 400–407, 1951. [7] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013. [8] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2663–2671, 2012. [9] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. JMLR, 14:567–599, 2013. [10] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, 2014. [11] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In ICML, 2016. [12] N. N. Schraudolph, J. Yu, and S. Gunter. A stochastic quasi-Newton method for online convex optimization. In AISTATS, 2007. [13] A. Mokhtari and A. Ribeiro. RES: Regularized stochastic BFGS algorithm. IEEE Trans. on Signal Process., 62(23):6089–6104, 2014. [14] X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi-Newton methods for nonconvex stochastic optimization. arXiv preprint arXiv:1607.0123, 2016. [15] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-Newton method for large-scale optimization. SIAM Journal on Optimization, 26(2), 2016. [16] P. Moritz, R. Nishihara, and M. I. Jordan. A linearly-convergent stochastic L-BFGS algorithm. In AISTATS, pages 249–258, 2016. [17] R. Kolte, M. Erdogdu, and A. Ozgur. Accelerating SVRG via second-order information,. In OPT2015, 2015.
14
[18] D. Gabay. Minimizing a differentiable function over a differential manifold. Journal of Optimization Theory and Applications, 37(2):177–219, 1982. [19] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012. [20] W. Huang, K. A. Gallivan, and P.-A. Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim., 25(3):1660–1685, 2015. [21] H. Sato, H. Kasai, and B. Mishra. Riemannian stochastic variance reduced gradient. arXiv preprint: arXiv:1702.05594, 2017. [22] H. Zhang, S. J. Reddi, and S. Sra. Fast stochastic optimization on Riemannian manifolds. In NIPS, 2016. [23] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008. [24] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. JMLR, 16:3151–3181, 2015. [25] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt: a Matlab toolbox for optimization on manifolds. JMLR, 15(1):1455–1459, 2014. [26] J. Nocedal and Wright S.J. Numerical Optimization. Springer, New York, USA, 2006. [27] D. Li and M. Fukushima. On the global convergence of BFGS method for nonconvex unconstrained optimization. SIAM J. Optim., 11(4):1054–1064, 2011. [28] B. Jeuris, R. Vandebril, and B. Vandereycken. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. ETNA, 2012. [29] X. Yuana, P.-A. Huang, W. Absil, and K. A. Gallivan. A Riemannian limited-memory BFGS algorithm for computing the matrix geometric mean. In ICCS, 2016. [30] H Karcher. Riemannian center of mass and mollifier smoothing. Comm. Pure Appl. Math., 30(5):509–541, 1977. [31] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on Riemannian manifolds with Gaussian RBF kernels. IEEE Trans. Pattern Anal. Mach. Intell., 37(12), 2015. [32] N. Boumal and P.-A. Absil. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra and its Applications, 475:200–239, 2015. [33] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor completion by Riemannian optimization. BIT Numer. Math., 54(2):447–468, 2014. [34] B. Mishra and R. Sepulchre. Riemannian preconditioning. SIAM J. Optim., 635-660, 2016. [35] L. Bottou, F. Curtis, and J. Nocedal. Optimization mehtods for large-scale machine learning. arXiv preprint: arXiv:1606.04838, 2016. 15
[36] W. Huang, P.-A. Absil, and K. A. Gallivan. A Riemannian symmetric rank-one trustregion method. Math. Program., Ser. A, 150:179–216, 2015. [37] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
16
A
Two-loop Hessian inverse updating algorithm
Algorithm A.1 Hessian Inverse Updating Require: Pair-updating counter t, memory depth τ , correction pairs {sku , yuk }k−1 u=k−τ , gradient p. 1: p0 = p. hsk ,y k i 2: Hk0 = γk id = tk tk id. hyt ,yt i 3: for u = 0, 1, 2, . . . , τ − 1 do k 4: ρk−u = 1/hskk−u−1 , yk−u−1 i. k 5: αu = ρk−u−1 hsk−u−1 , pu i. k 6: pu+1 = pu − αu yk−u−1 . 7: end for 8: q0 = Hk0 pτ . 9: for u = 0, 1, 2, . . . , τ − 1 do k 10: βu = ρk−τ +u hyk−τ +u , qu i. 11: qu+1 = qu + (ατ −u−1 − βu )sk−τ +u . 12: end for 13: q = qτ .
B
Proof of Proposition 3.1
This section presents the proof of Proposition 3.1, which is an essential proposition that ˜ k ◦(T k )−1 . To this end, we particularly bounds the eigenvalues of Htk at wtk , i.e., Htk := Tη˜k ◦ H η˜t t k k −1 ˜ ˜ ˜ k . Since mentioned use the Hessian approximation operator B = (H ) as opposed to H ˜ k at w in the algorithm description, we consider the curvature information for H ˜ k , i.e., every ˜ k in the calculation of the second-order modified stochastic outer epoch, and reuse this H k k k gradient Ht ξt at wt . Thereby, the way of the proof consists of two steps as follows; ˜ k at w 1. We first address the bounds of H ˜ k . The main task of the proof is to bound the k k −1 ˜ ˜ Hessian operator B = (H ) . ˜ k at w 2. Next, we bound Htk at wtk based on the bounds of H ˜k . k−1 It should be noted that, in this section, the curvature pair {skj , yjk }j=k−L ∈ Tw˜k M is simply k−1 k notated as {sj , yj }j=k−L , and we omit the subscript w ˜ for a Riemannian metric h·, ·iw˜k when the tangent space to be considered is clear.
B.1
Preliminary lemmas
This section first states some essential lemmas. The literature [23] generalizes a Taylor’s theorem to Riemannian manifolds. However, it addresses the exponential mapping instead of the retraction. Therefore, [20] applys Taylor’s theorem on the retraction by newly introducing a function along a curve on the manifold. Here, we denote f (Rwk (tηk /kηk k)) for a twice continuously differentiable objective function. Since f is strongly retraction-convex on Θ by
17
Assumption 1.3, there exist constants 0 < λ < Λ such that λ ≤ t ∈ [0, αtk kηk k] as in (8). From Taylor’s theorem, we obtain
d2 f (Rwk (tηk /kηk k)) dt2
≤ Λ for all
Lemma B.1 (In Lemma 3.2 in [20]). Under Assumptions 1.1, 1.2 and 1.3, there exists λ such that 1 k f (wt+1 ) − f (wtk ) ≥ hgradf (wtk ), αtk ηk iwk + λ(αtk kηk k)2 . t 2
(A.1)
There also exists Λ such that 1 k f (wt+1 ) − f (wtk ) ≤ hgradf (wtk ), αtk ηk iwk + Λ(αtk kηk k)2 . t 2
(A.2)
Proof. From Taylor’s theorem, we have k f (wt+1 ) − f (wtk ) = f (Rwk (αtk ηk )) − f (Rwk (0))
df (Rwk (0)) k 1 d2 f (Rwk (pηk /kηk k)) k αt kηk k + (αt kηk k)2 dt 2 dt2 1 d2 f (Rwk (pηk /kηk k)) k = hgradf (wtk ), αtk ηk iwk + (αt kηk k)2 t 2 dt2 1 ≥ hgradf (wtk ), αtk ηk iwk + λ(αtk kηk k)2 , t 2
=
where 0 ≤ p ≤ αtk kηk k, and the inequality uses (8). This yields (A.1). Similarly, (A.2) is also derived. This completes the proof. Lemma B.2 (Lemma 3.3 in [20]). Under Assumptions 1.1, 1.2, 1.3, 1.4 and 1.6, there exist two constants 0 < λ < Λ such that λ ≤
hsk , yk i ≤ Λ hsk , sk i
(A.3)
for all k. Constants λ and Λ can be chosen as in (8). Proof. From Lemma B.1, the proof of this lemma is given, but we omit it. The reader can see the full proof in Lemma 3.3 in [20]. Lemma B.3. Suppose Assumption 1 holds, there exist a constant 0 < µ1 then for all k such that hyk , yk i . hsk , yk i
µ1 ≤
(A.4)
Proof. This is given by applying Cauchy-Schwarz inequality to the condition (6) recursively. More specifically, (6) yields that ksk k2 ≤ hyk , sk i ≤ kyk kksk k, and considering the most left and right terms, we obtain ksk k ≤ 18
1 kyk k.
Substituting this into the above equation yields hsk , yk i ≤ ksk kkyk k ≤
1 kyk k2 .
Consequently, we obtain kyk k2 hsk , yk i
≥ .
This complete the claim by denoting as µ1 . Lemma B.4 (Lemma 3.9 in [20]). Suppose Assumption 1 holds, there exist a constant 0 < µ2 for all k such that hyk , yk i ≤ µ2 . hsk , yk i
(A.5)
Proof. (Lemma 3.9 in [20]) The complete proof is given in [20], but this paper provides it ˜ k+1 ) − Pγ1←0 for the subsequent analysis. Define ykP = gradf (w gradf (w ˜ k ), where γk (t) = k Rw˜k (tαk ηk ), i.e., the retraction curve connecting wk and wk+1 , and Pγk is the parallel trans¯ k αk ηk k ≤ b0 kαk ηk k2 = b0 ksk k2 , where H ¯k = lation along γk (t). We have kPγ1←0 ykP − H k R 1 0←t t←0 0 Pγk Hessf (γk (t))Pγk dt and b0 > 0. It follows that kyk k ≤ kyk − ykP k + kykP k ykP k = kyk − ykP k + kPγ0←1 k ¯ k αk ηk k + kH ¯ k αk ηk k ykP − H ≤ kyk − ykP k + kPγ0←1 k ≤ kgradf (w ˜ k+1 )/κk − Tαk ηk gradf (w ˜ k ) − gradf (w ˜ k+1 ) + Pγ0←1 gradf (w ˜ k )k k ¯ k αk ηk k + b0 ksk k2 +kH ≤ kgradf (w ˜ k+1 )/κk − gradf (w ˜ k+1 )k + kPγ0←1 gradf (w ˜ k ) − Tαk ηk gradf (w ˜ k )k k ¯ k αk ηk k + b0 ksk k2 +kH ≤ b1 ksk kkgradf (w ˜ k+1 )k + b2 ksk kkgradf (w ˜ k )k + b3 ksk k + b0 ksk k2
(A.6)
≤ b4 ksk k, where b1 , b2 , b3 , and b4 > 0. Therefore, by Lemma B.2, we have hyk , yk i hyk , yk i b2 ≤ ≤ 4. hsk , yk i λhsk , sk i λ
(A.7)
This complete the proof. Remark B.5. From the proof of Lemma B.4, if the parallel translation is used for vector transport, i.e., T = P , the first two terms in (A.6) are equal to zero, and the upper bound µ2 in (A.5) can get smaller than that of the case in the vector transport. ˆ˜ and det(B) ˆ˜ in order to bound the eigenvalues of H ˜k, Now, we attempt to bound trace(B) where a hat denotes the coordinate expression of the operator. The basic structure of the proof follows stochastic L-BFGS methods in the Euclidean space, e.g., [15, 24]. Nevertheless, some special treatments considering the Riemannian setting and the lemmas earlier are required. ˆ ˆ˜ do not depend on the chosen basis. ˜ and det(B) It should be noted that trace(B) 19
Lemma B.6 (Bounds of trace and determinant of B˜k ). Consider the recursion of B˜uk as [ yk−τ +t yk−τ Bˇk sk−τ +u (Bˇuk sk−τ +u )[ +u k B˜u+1 = Bˇuk − u k + , [ (Bˇu sk−τ +u )[ sk−τ +u yk−τ s +u k−τ +u
(A.8)
where Bˇuk = Tηk B˜uk (Tηk )−1 for u = 0, . . . , τ − 1. The Hessian approximation at k-th outer epoch is B˜k = B˜τk when u = τ − 1. Then, consider the Hessian approximation B˜k = B˜τk in (A.8) with B˜0k = γk−1 id. If Assumption 1 holds, the trace(Bˆ˜k ) in a coordinate expression of B˜k is uniformly upper bounded for all k ≥ 1, trace(Bˆ˜k ) ≤ (M + τ )µ2 .
(A.9)
where M is the dimension ofM. Similarly, if Assumption 1 holds, the det(Bˆ˜k ) in a coordinate expression of B˜k is uniformly lower bounded for all k, τ λ ˆ k M ˜ det(B ) ≥ µ1 . (A.10) (M + τ )µ2 Here, a hat expression represents the coordinate expression of an operator. Proof. The proof can be completed parallel to the Euclidean case [16]. We use a hat symbol k in order to represent the coordinate expression of the operator B˜u+1 and Bˇuk in update formula (A.8). Because T is an isometry vector transport, Tηk is invertible for all k. Accordingly, ˜k ) can be reformulated as trace(Bˆ˜k ) and det(Bˆ ˇk ) = trace(Tˆη Bˆ˜k Tˆη−1 ) = trace(Bˆ˜k ), trace(Bˆ k k ˆ ˆ k k ˆ −1 ˇ ˆ ˜ det(B ) = det(Tηk B Tηk ) = det(Bˆ˜k ).
(A.11) (A.12)
We first consider the trace lower bound of trace(Bˆ˜τk ) from (A.11) and (A.8) as k ˜u+1 ˇuk ) − trace(Bˆ ) = trace(Bˆ
hBˆˇuk sk−τ +u , Bˆˇuk sk−τ +u i hyk−τ +u , yk−τ +u i + hyk−τ +u , sk−τ +u i hBˆˇk sk−τ +u , sk−τ +u i u
˜uk ) − = trace(Bˆ
kBˆˇuk sk−τ +u k2 hBˆˇuk sk−τ +u , sk−τ +u i
+
kyk−τ +u k2 . hyk−τ +u , sk−τ +u i
ˇk guarantees the negativity of the second term. Therefore, Here, the positive definiteness of Bˆ u the bound of the third term yields from Lemma B.4 as ˜k ) ≤ trace(Bˆ˜k ) + µ2 . trace(Bˆ u+1 u By calculating recursively this for u = 0, · · · , τ − 1, we can conclude that trace(Bˆ˜uk ) ≤ trace(Bˆ˜0k ) + uµ2 . ˜k ). For this purpose, we consider the definition Bˆ˜k = id/γk All that is left is to bound trace(Bˆ 0 0 hsk ,yk i where, as a common choice in L-BFGS in the Euclidean, γk = hy , and we obtain k ,yk i M hyk , yk i ˜0k ) = trace I = M ≤ M µ2 . trace(Bˆ = γk γk hsk , yk i 20
Consequently, we obtain trace(Bˆ˜uk ) ≤ (M + u)µ2 .
(A.13)
Plugging u = τ into (A.13) yields the claim (A.9). For k = 0, we have γk = γ0 = 1 and (A.13) ˜k ) = M while (A.13) results in trace(Bˆ˜k ) ≤ (1 + u)µ2 . reduces to trace(Bˆ u 0 ˆ ˜ We consider the determinant lower bound of det(B ) from (A.8) as k,τ
ˆˇk s T ˇˆuk )−1 yk−τ +u y T B ( B ) s ( k−τ +u k ˇuk )det I − k−τ +u u k−τ +u + ˜u+1 det(Bˆ ) = det(Bˆ ˆ hy , s k ˇ k−τ +u k−τ +u i hBu sk−τ +u , sk−τ +u i ! ˆˇk s T ( B ) k−τ +u ˆ ˆ u k −1 k (Bˇu ) yk−τ +u = det(Bˇu )det hBˆˇuk sk−τ +u , sk−τ +u i ˇk ) hsk−τ +u , yk−τ +u i = det(Bˆ u ˇuk sk−τ +u , sk−τ +u i hBˆ ˇk ) = det(Bˆ u
hsk−τ +u , yk−τ +u ) ksk−τ +u k2 ksk−τ +u k2 hBˇˆk sk−τ +u , sk−τ +u i u
ˇuk ) ≥ det(Bˆ
λ
λmax (Bˆuk ) λ ˇk ) ≥ det(Bˆ u trace(Bˆk ) u
λ ˇk ) . ≥ det(Bˆ u (M + τ )µ2
(A.14)
Regarding the second equality, we obtain it from the formula det(I + u1 v T1 + u2 v T2 ) = (1 + T u1 v 1 )(1+uT2 v 2 )−(uT1 v 1 )(uT2 v 2 ) by setting u1 = −sk−τ +u , v 1 = Bˆuk sk−τ +u /hBˆuk sk−τ +u , sk−τ +u i, −1 u2 = (Bˆuk ) yk−τ +u , and v 2 = yk−τ +u /hsk−τ +u , yk−τ +u i. The first inequality follows from (A.3) in Lemma B.2 and the fact hBˆuk sk−τ +u , sk−τ +u i ≤ λmax (Bˆuk )ksk−τ +u k2 . Actually, we use the fact the trace of a positive definite matrix bounds its maximal eigenvalue for the second inequality. The last inequality follows (A.13). Then, applying (A.12), (A.14) turns to be k ˜u+1 det(Bˆ ) ≥ det(Bˆ˜uk )
λ . (M + τ )µ2
(A.15)
Applying (A.15) recursively from u = 0 to u = τ − 1, we obtain that τ λ ˜τk ) ≥ det(Bˆ˜0k ). det(Bˆ (M + τ )µ2 ˜k , considering Bˆ˜k = id/γk as above and Lemma B.3, we To bound the determinant of Bˆ 0 0 can rewrite for k ≥ 1 as I 1 hyk , yk i M ˆ k ˜ det(B0 ) = det = M = ≥ µM 1 . γk hs , y i γk k k
21
Consequently, we obtain as ˜τk ) ≥ µM det(Bˆ 1
λ (M + τ )µ2
τ .
Thus, this yields (A.10), and these complete the proof. Now we prove the main lemma for Proposition 3.1. ˜ k defined by the recursion in (5). Define the constant Lemma B.7. Consider the operator H ˜ k is bounded by γ and Γ for all 0 < γ < Γ < ∞. If Assumption 1 holds, the eigenvalues of H k ≥ 1 as ˜ k Γid. γid H Proof. The proof is obtained as parallel to the Euclidean case [24]. The sum and product ˜k correspond to the bounds on the trace and determinant. Here, we of its eigenvalues of Bˆ denote λi as the i-th largest eigenvalue of the operator matrix Bˆ˜k . From (A.9), the sum of ˜k satisfies below the eigenvalues of Bˆ M X
λi = trace(Bˆ˜k ) ≤ (M + τ )µ2 .
(A.16)
i=1
Because all the eigenvalues are positive due to the positive definiteness of Bˆ˜k , it is obvious that every eigenvalues is less than the upper bound of the sum of all of the eigenvalues. Consequently, we obtain λi ≤ (M + τ )µ2 for all i, and finally obtain B˜k (M + τ )µ2 id. On the other hand, because the determinant of a matrix is the product of its eigenvalues, the lower bound in (A.10) bounds the product of the eigenvalues of Bˆ˜k from below. This Q λτ µ M ˆ˜k 1 means that M i=1 λi ≥ [(M +τ )µ2 ]τ . Thus, we have below for any given eigenvalue of B , say λj , λj
≥
1 QM
k=1,k6=j
λk
·
λτ µM 1 . [(M + τ )µ2 ]τ
(A.17)
ˆ˜k M −1 Considering that (M + τ )µ2 is an upper bound for the eigenvalues QMof B , [(M + τ )µ2 ] gives the upper bound of the product of the (M − 1) eigenvalues k=1,k6=j λk . As a result, we obtain that any eigenvalues of Bˆk is lower bounded as λj
≥
1 λ τ µM λ τ µM 1 1 · = . [(M + τ )µ2 ]M −1 [(M + τ )µ2 ]τ [(M + τ )µ2 ]M +τ −1
(A.18)
λτ µM
Consequently, we finally obtain [(M +τ )µ21]M +τ −1 id B˜k . Now, we obtain the claim. The bounds in (A.16) and (A.18) imply that their inverses are ˆ ˜ k = (Bˆ˜k )−1 as bounds for the eigenvalues of H M +τ −1 1 ˜ k [(M + τ )µ2 ] id H id. M (M + τ )µ2 λ τ µ1
Denoting proof.
1 (M +τ )µ2
as γ, and
[(M +τ )µ2 ]M +τ −1 λτ µ M 1
(A.19)
as Γ, we obtain the claim. This completes the
22
B.2
Main proof of Proposition 3.1
Finally we prove Proposition 3.1. ˜ k := T k ◦ H ˜ k ◦ (T k )−1 , where H ˜ k is defined by Proposition 3.1. Consider the operator H η˜t η˜t the recursion in (5). Define the constant 0 < γ < Γ < ∞. If Assumption 1 holds, the range of eigenvalues of Htk is bounded by γ and Γ for all k ≥ 1, t ≥ 1, i.e., γid Htk Γid.
(A.20)
˜ k ◦ (T k )−1 , where η˜k = R−1k (wk ), since T k is a liner Proof. Considering Htk := Tη˜k ◦ H t t η˜t η˜t w ˜ t ˜ k are identical. transformation operator, we can conclude that the eigenvalues of Htk and H Actually, let hat expressions be representation matrices with some bases of Twk M and Tw˜k M, t we have the relation below; ˆ˜ k (Tˆ )−1 ) ˆ k ) = det(λid − Tˆ k H det(λid − H t η˜ η˜k t
= det(TˆSη˜k
t
ˆ˜ k )(Tˆ )−1 ) (λid − H k
t
η˜t
ˆ˜ k )det((Tˆ )−1 ) = det(TˆSη˜k )det(λid − H η˜k t
t
ˆ˜ k )det(Tˆ )−1 = det(TˆSη˜k )det(λid − H η˜k t
t
ˆ˜ k ). = det(λid − H Therefore, Lemma B.7 directly yields the claim. This completes the proof.
C
Proof of Theorem 3.2
This section shows the global convergence analysis of the proposed R-SQN-VR. Hereinafter, we use E[·] to express expectation with respect to the joint distribution of all random variables. For example, wk is determined by the realizations of the independent random variables {z1 , z2 , . . . , zk−1 }, the total expectation of f (wk ) for any k ∈ N can be taken as E[f (wtk )] = Ez1 Ez2 . . . Ezk−1 [f (wtk )]. We also use Ezt [·] to denote an expected value taken with respect to the distribution of the random variable zt . This analysis partially extends the expectation based analysis of SGD in the Euclidean [35] into the proposed algorithm.
C.1
Essential lemmas
k )] is We first obtain the following lemma from (A.2) in Lemma B.1. Subsequently, Ezt [f (wt+1 k a meaningful quantity because wt+1 depends on zt through the update in Algorithm 1.
Lemma C.1. Under Lemma B.1, the iterates of Algorithm 1 satisfy the following inequality for all k ∈ N: 1 k Ezt [f (wt+1 )] − f (wtk ) ≤ −αtk hgradf (wtk ), Ezt [Htk ξtk ]iwk + (αtk )2 ΛEzt [kHtk ξtk k2wk ].(A.21) t t 2
23
k Proof. When wt+1 = Rwk (−αtk Htk ξtk ), substituting −Htk ξtk into ηk , the iterates generated by t Algorithm 1 satisfy from (A.2) in Lemma B.1
1 k f (wt+1 ) − f (wtk ) ≤ hgradf (wtk ), −αtk Htk ξtk iwk + Λk − αtk Htk ξtk k2wk t t 2 1 = −αtk hgradf (wtk ), Htk ξtk iwk + (αtk )2 ΛkHtk ξtk k2wk . t t 2
(A.22)
Taking expectations in the inequalities above with respect to the distribution of zt , and noting k , but not w k , depends on z , we obtain the desired bound. that wt+1 t t This lemma shows that, regardless of how Algorithm 1 arrived at wtk , the expected decrease in the objective function yielded by the k-th step is bounded above by a quantity involving: (i) the expected directional derivative of f at wtk along −Htk ξtk and (ii) the second moment of Htk ξtk . Next, we derive the following lemma; Lemma C.2. Under Assumptions 1 and Assumption 2.2, the sequence of average function f (wtk ) satisfies k E[f (wt+1 )] ≤ f (wtk ) − αtk γkgradf (wtk )k2wk + t
9Λ(αtk )2 Γ2 S 2 . 2
(A.23)
Proof. Taking expectation (A.21) in Lemma C.1 with regard to wtk considering that Htk is deterministic when wtk is given, we write k E[f (wt+1 )]
(αtk )2 Λ Ezt [kHtk ξtk k2wk ]. t t 2 k )2 Λ (α ≤ f (wtk ) − αtk hgradf (wtk ), Htk gradf (wtk )iwk + t Ezt [kHtk ξtk k2wk ]. t t 2 k )2 Λ (α ≤ f (wtk ) − αtk hgradf (wtk ), Htk gradf (wtk )iwk + t Ezt [Γ2 kξtk k2wk ]. t t 2 k )2 Γ2 S 2 9Λ(α t ≤ f (wtk ) − αtk γkgradf (wtk )k2wk + , t 2
≤ f (wtk ) − αtk hgradf (wtk ), Htk Ezt [ξtk ]iwk +
where the second inequality is obtained from Ezt [ξtk ] = gradf (wtk ) because ξtk is an unbiased estimate of gradf (wtk ). The last inequality comes from (11) in Assumption 2.2 since kξtk kwk = kgradfik (wtk ) − Tη˜k gradfik (w ˜ k ) + Tη˜k gradf (w ˜ k ) kwk t
t
t
t
≤ S + S + S = 3S,
t
t
(A.24)
where η˜tk ∈ Tw˜k M satisfies Rw˜k (˜ ηtk ) = wtk . This completes the proof. Proposition C.3. Under Assumptions 1, Assumptions 2, and Lemma C.2, suppose that Algorithm 1 is run with a step-size sequence satisfying (12) in Assumption 2.3. Then, with
24
AK :=
PK Pmk k=1
k t=1 αt ,
" E "
1 E AK
and theref ore
mk K X X k=1 t=1 mk K X X k=1 t=1
# αtk kgradf (wtk )k2wk t
0 such that E[kgradf (wtk )k2wk ] > δ for all k sufficiently large, say, k > N . We have t
" E
mk ∞ X X k=N t=1
# αtk kgradf (wtk )k2wk t
≥
mk ∞ X X
αtk E[kgradf (wtk )k2wk ] > δ t
k=N t=1
This contradicts (A.25). 25
mk ∞ X X k=N t=1
αtk
= ∞.
A “lim inf” result of this type should be familiar to those knowledgeable of the nonlinear optimization literature. This intuition is that, for the R-SGD with diminishing step-sizes, the expected gradient norms cannot stay bounded away from zero.
C.2
Main proof of Theorem 3.2
Theorem. 3.2. Consider Algorithm 1 and suppose Assumptions 1 and 2, and that the mapping w 7→ kgradf (w)k2w has the positive real number that the largest eigenvalue of its Riemannian Hessian is bounded by for all w ∈ M. Then, we have lim E[kgradf (wtk )k2wk ] = 0.
k→∞
t
Proof. We define h(w) as h(w) := kgradf (w)k2wk and let Λh be the absolute value of the t eigenvalue with the largest magnitude of the Hessian of h. Then, from Taylor’s theorem, we obtain 1 k h(wt+1 ) − h(wtk ) ≤ −2αtk hgradh(wtk ), Hessf (wtk )[Htk ξtk ]iwk + (αtk )2 Λh kHtk ξtk k2wk . t t 2 Taking the expectation with respect to the distribution of zt , we obtain below; k Ezt [h(wt+1 )] − h(wtk )
1 ≤ −2αtk hgradf (wtk ), Ezt [Hessf (wtk )[Htk ξtk ]]iwk + (αtk )2 Λh Ezt [kHtk ξtk k2wk ] t t 2 1 k 2 k k k k k k k 2 = −2αt hgradf (wt ), Hessf (wt )[Ht Ezt [ξt ]]iwk + (αt ) Λh Ezt [kHt ξt kwk ] t t 2 1 ≤ 2αtk kgradf (wtk )kwk kHessf (wtk )[Htk gradf (wtk )]kwk + (αtk )2 Λh Γ2 Ezt [kξtk k2wk ] t t t 2 9 k 2 2 2 k k 2 ≤ 2αt ΛΓkgradf (wt )kwk + (αt ) Λh S Γ , t 2 where the last inequality comes from kHessf (w)[Htk gradf (wtk )]kwk ≤ ΛkHtk gradf (wtk )kwk ≤ t t ΛΓkgradf (wtk )kwk . t Taking the total expectation simply yields 9 k E[h(wt+1 )] − E[h(wtk )] ≤ 2αtk ΛΓE[kgradf (wtk )k2wk ] + (αtk )2 Λh S 2 Γ2 . t 2
(A.28)
Recall that Proposition C.3 establishes that the first component of this bound is the term of a convergent P sum. P The second component of this bound is also the term of a convermk ∞ k 2 gent sum since t=1 (αt ) converges. This means that again the result of Proposik=1 tion C.3 can be applied. Therefore, the side of (A.28) is the term of a converPKright-hand Pmk + k )] − E[h(w k )]), and S − = gent sum. Let us now define SK = k=1 t=1 max(0, E[h(wt+1 t K PK Pmk k )] − E[h(w k )]). max(0, E[h(w t t+1 t=1 k=1 Since the bound (A.28) is positive and forms a convergent sum, the nondecreasing sequence + SK is upper bounded and therefore converges. Since, for any K ∈ N, one has E[h(wK )] = − − + h(w1 ) + SK − SK ≥ 0, the nondecreasing sequence SK is upper bounded and therefore also converges. Therefore E[h(wK )] converges. Consequently, this implies that this limit must be zero from Proposition C.4. This completes the proof. 26
D
Proof of Theorem 3.3
This section first introduces some essential lemmas. Then, the main proof of Theorem 3.3 is given. This section also derives at the end a corollary about the analysis when the using exponential mapping and the parallel translation that are special cases of the retraction and the vector transport.
D.1
Essential lemmas
We first introduce the following lemmas. Lemma D.1 (In the proof of Lemma 3.9 in [20]). Under Assumptions 1.1 and 1.2, there exists a constant β > 0 such that kPγw←z (gradf (z)) − gradf (w)kw ≤ βdist(z, w),
(A.29)
where w and z are in Θ in Assumption 1.2 and γ is a curve γ(t) := Rz (τ η) for η ∈ Tz M defined by a retraction R on M. Pγw←z (·) is a parallel translation operator along the curve γ from z to w. Note that the curve γ in this lemma is not necessarily the geodesic. The relation (A.29) is a generalization of the Lipschitz continuity condition. Lemma D.2 (Lemma 3.5 in [20]). Let T ∈ C 0 be a vector transport associated with the same retraction R as that of the parallel translation P ∈ C ∞ . Under Assumption 1.5, for any w ¯ ∈ M there exists a constant θ > 0 and a neighborhood U of w ¯ such that for all w, z ∈ U, kTη ξ − Pη ξkz ≤ θkξkw kηkw ,
(A.30)
where ξ, η ∈ Tw M and Rw (η) = z. Modifying slightly Lemma 3 in [36], we have the following lemma. Lemma D.3 (Lemma 3 in [36]). Let M be a Riemannian manifold endowed with retraction R and let w ¯ ∈ M. Then there exist τ1 > 0, τ2 > 0 and δτ1 ,τ2 such that for all w in a sufficiently small neighborhood of w ¯ and all ξ ∈ Tw M with kξkw ≤ δτ1 ,τ2 , the inequalities τ1 dist(w, Rw (ξ)) ≤ kξkw ≤ τ2 dist(w, Rw (ξ))
(A.31)
hold. We also show a property of the Karcher mean on a general Riemannian manifold. Lemma D.4 (Lemma C.2 in [21]). Let w1 , . . . , wm be points on a Riemannian manifold M and let w be the Karcher mean of the m points. For an arbitrary point p on M, we have m
2
(dist(p, w))
≤
4 X (dist(p, wi ))2 . m i=1
We now bound the variance of ξtk as follows.
27
Lemma D.5 (Lemma 5.8 in [21]). Suppose Assumptions 1.1, 1.2, 1.4, and 1.5, which guarantees Lemmas D.1, D.2, and D.3 for w ¯ = w∗ . Let β > 0 be a constant such that kPγw←z (gradfn (z)) − gradfn (w)kw ≤ βdist(z, w),
w, z ∈ Θ, n = 1, 2, . . . , N.
The existence of such β is guaranteed by Lemma D.1. The upper bound of the variance of ξtk is given by Eik [kξtk k2wk ] ≤ 4(β 2 + τ22 C 2 θ2 )(7(dist(wtk , w∗ ))2 + 4(dist(w ˜ k , w∗ ))2 ), t
(A.32)
t
where the constant θ corresponds to that in Lemma D.2, C is the upper bound of kgradfn (w)kw , n = 1, 2, . . . , N for w ∈ Θ, and τ2 > 0 appears in (A.31). We also have the following corollary of the previous lemma with the case R = Exp and T = P. Corollary D.6 (Corollary 5.1 in [21]). Consider Algorithm 1 with T = P and R = Exp, i.e., the parallel translation and the exponential mapping case. When each gradfn is β0 -Lipschitz continuously differentiable, the upper bound of the variance of ξtk is given by Eik [kξtk k2wk ] ≤ β02 (14(dist(wtk , w∗ ))2 + 8dist(w ˜ k , w∗ ))2 ). t
(A.33)
t
Next, we show the lemma that finds a lower bound for kgradf (wtk )kwk with respect to the t error f (wtk ) − f (w∗ ), which is a standard derivation in the Euclidean space. See, e.g., [37]. We extend this into manifolds. Lemma D.7. Let w ∈ M and z be in a totally retractive neighborhood of w. It holds that 2λ(f (w) − f (z)) ≤ kgradf (w)k2w .
(A.34)
−1 (z). Using (A.1) in Lemma B.1, which is equivalent to the strong convexity Proof. Let ζ = Rw of gradf , we obtain
λ f (z) ≥ f (w) + hgradf (w), ζiw + kζk2w 2 λ 2 ≥ f (w) + min hgradf (w), ξiw + kξkw ξ∈Tw M 2 1 ≥ f (w) − kgradf (w)k2w . 2λ Rearranging this inequality completes the proof.
D.2
Main proof of Theorem 3.3
Theorem 3.3. Let M be a Riemannian manifold and w∗ ∈ M be a non-degenerate local minimizer of f (i.e., gradf (w∗ ) = 0 and the Hessian Hessf (w∗ ) of f at w∗ is positive definite). Suppose Assumption 1 holds. Let the constants β, θ, and C be in Lemma D.5, and τ1 and τ2 be in Lemma D.3. λ and Λ are the constants in Lemma B.1. γ and Γ are constants in Proposition 3.1. Let α be a positive number satisfying λτ12 > 2α(λ2 τ12 −14αΛΓ2 (β 2 +τ22 C 2 θ2 )) and γλ2 τ12 > 14αΛΓ2 (β 2 + τ22 C 2 θ2 ). It then follows that for any sequence {w ˜ k } generated by 28
Algorithm 1 under a fixed step-size αtk := α and mk := m converging to w∗ , there exists K > 0 such that for all k > K, E[(dist(w ˜ k+1 , w∗ ))2 ] ≤
2(Λτ22 + 16mα2 ΛΓ2 (β 2 + τ22 C 2 θ2 ) E[(dist(w ˜ k−1 , w∗ ))2 ]. mα(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 ))
Proof. Using (A.2) in Lemma B.1, which is equivalent to the Lipschitz continuity of gradf from Assumption 3, we obtain 1 k f (wt+1 ) − f (wtk ) ≤ hgradf (wtk ), −αHtk ξtk iwk + Λ(−αkHtk ξtk kwk )2 . t t 2 Taking expectation with regard to zt , this becomes 1 k )] − f (wtk ) ≤ Ezt [hgradf (wtk ), −αHtk ξtk iwk + α2 ΛkHtk ξtk k2wk ] Ezt [f (wt+1 t t 2 1 ≤ −αhgradf (wtk ), Ezt [Htk ξtk ]iwk + α2 ΛEzt [kHtk ξtk k2wk ] t t 2 1 ≤ −αhgradf (wtk ), Htk gradf (wtk )iwk + α2 ΛEzt [kHtk ξtk k2 ] t 2 1 2 2 k 2 (A.35) ≤ −αγkgradf (wt )kwk + α ΛΓ Ezt [kξtk k2wk ]. t t 2 where the third inequality used the fact that Ezt [Htk ξtk ] = Htk gradf (wtk ). The last inequality used the bound of Htk in Proposition 3.1. From Lemma D.7, (A.35) yields 1 k Ezt [f (wt+1 )] − f (wtk ) ≤ −2αγλ(f (wtk ) − f (w∗ )) + α2 ΛΓ2 Ezt [kξtk k2wk ]. t 2
(A.36)
Using (A.1) in Lemma B.1 with gradf (w∗ ) = 0, and using Lemma D.3, we obtain λ −1 k 2 λτ12 kRw∗ (wt )kw∗ ≥ (dist(wtk , w∗ ))2 . 2 2
f (wtk ) − f (w∗ ) ≥
(A.37)
Plugging (A.37) and the bound of Eik [kξkk k2 in (A.32) in Lemma D.5 into (A.36) yields t
1 k Ezt [f (wt+1 )] − f (wtk ) ≤ −αγλ2 τ12 (dist(wtk , w∗ ))2 + α2 ΛΓ2 Ezt [kξtk k2wk ] t 2 2 2 k ∗ 2 ≤ −αγλ τ1 (dist(wt , w )) 1 + α2 ΛΓ2 {4(β 2 + τ22 C 2 θ2 )(7(dist(wtk , w∗ ))2 + 4(dist(w ˜ k , w∗ ))2 )} 2 ≤ (−αγλ2 τ12 + 14α2 ΛΓ2 (β 2 + τ22 C 2 θ2 ))(dist(wtk , w∗ ))2 +8α2 ΛΓ2 (β 2 + τ22 C 2 θ2 )(dist(w ˜ k , w∗ ))2 . Taking expectations over all random variables, we obtain below by further summing over t = 0, . . . , m − 1 of the inner loop on k-th epoch k E[f (wm ) − f (w0k )] ≤ −(αγλ2 τ12 − 14α2 ΛΓ2 (β 2 + τ22 C 2 θ2 )) 2
2
2
+8mα ΛΓ (β +
m−1 X
E[(dist(wtk , w∗ ))2 ]
t=0 2 2 2 k τ2 C θ )E[(dist(w ˜ , w∗ ))2 ].
29
(A.38)
Here, considering the difference with the solution w∗ in terms of the cost function value, we obtain k k E[f (wm ) − f (w0k )] = E[f (wm ) − f (w∗ ) − (f (w0k ) − f (w∗ ))] 1 k ≥ E[λτ12 (dist(wm , w∗ ))2 − Λτ22 (dist(w0k , w∗ ))2 ]. 2
Plugging the above into (A.38) yields k E[λτ12 (dist(wm , w∗ ))2 − Λτ22 (dist(w0k , w∗ ))2 ] m−1 X ≤ −2α(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 )) E[(dist(wtk , w∗ ))2 ] 2
2
2
+16mα ΛΓ (β +
t=0 2 2 2 k τ2 C θ )E[(dist(w ˜ , w∗ ))2 ].
Rearranging this gives 2α(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 )) ≤
m−1 X
E[(dist(wtk , w∗ ))2 ]
t=0 2 k ∗ 2 2 k E[Λτ2 (dist(w0 , w )) ] − E[λτ1 (dist(wm , w∗ ))2 ] +16mα2 ΛΓ2 (β 2 + τ22 C 2 θ2 )E[(dist(w ˜ k , w∗ ))2 ].
(A.39)
k ), we derive Now, addressing option I in Algorithm 1, which uses w ˜ k+1 = gmk (w1k , . . . , wm below from Lemma D.4 as m 2α(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 ))E[(dist(w ˜ k+1 , w∗ ))2 ] 4 m−1 X 2 2 2 2 2 2 2 k ∗ 2 k ∗ 2 k ∗ 2 ≤ 2α(γλ τ1 − 14αΛΓ (β + τ2 C θ ))E (dist(wt , w )) + (dist(wm , w )) − (dist(w0 , w )) t=0 (A.39)
≤
k E[Λτ22 (dist(w0k , w∗ ))2 ] − E[λτ12 (dist(wm , w∗ ))2 ] + 16mα2 ΛΓ2 (β 2 + τ22 C 2 θ2 )E[(dist(w ˜ k , w∗ ))2 ] k +2α(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 ))E[(dist(wm , w∗ ))2 − (dist(w0k , w∗ ))2 ]
≤
(Λτ22 − 2α(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 )))E[(dist(w0k , w∗ ))2 ] +16mα2 ΛΓ2 (β 2 + τ22 C 2 θ2 )E[(dist(w ˜ k , w∗ ))2 ) k −(λτ12 − 2α(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 )))E[(dist(wm , w∗ ))2 ].
Combining the relation Λτ22 > λτ12 and the assumption λτ12 > 2α(γλ2 τ12 − 14α)ΛΓ2 (β 2 + since w0k = w ˜ k , we obtain
τ22 C 2 θ2 )),
m 2α(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 ))E[(dist(w ˜ k+1 , w∗ ))2 ] 4 ≤ (Λτ22 + 16mα2 ΛΓ2 (β 2 + τ22 C 2 θ2 ))E[(dist(w ˜ k , w∗ ))2 ]. Finally, we obtain E[(dist(w ˜ k+1 , w∗ ))2 ] ≤
2(Λτ22 + 16mα2 ΛΓ2 (β 2 + τ22 C 2 θ2 ) E[(dist(w ˜ k , w∗ ))2 ).(A.40) mα(γλ2 τ12 − 14αΛΓ2 (β 2 + τ22 C 2 θ2 ))
This completes the proof. 30
Remark D.8. From the proof of Lemma B.4, if we adopt the parallel translation as the vector transport, i.e., T = P , the first two terms in (A.6) are equal to zero, and µ2 in (A.5) gets smaller than that of the case of vector transport. This leads to a smaller Γ and a larger γ in (10) of Proposition 3.1. Then, the smaller Γ and the larger γ in (10) leads to a smaller coefficient in (A.40) of Theorem 3.3. Consequently, the parallel translation can result in a faster local convergence rate. We obtain the following corollary of the previous theorem with the case R = Exp and T = P. Corollary D.9. Consider Algorithm 1 with T = P and R = Exp, i.e., the parallel translation and the exponential mapping case. Let M be a Riemannian manifold and w∗ ∈ M be a non-degenerate local minimizer of f (i.e., gradf (w∗ ) = 0 and the Hessian Hessf (w∗ ) of f at w∗ is positive definite). Suppose Assumptions 1 hold. Let the constants θ, and C in Lemma D.5. β0 is the constant in Corollary D.6. λ and Λ are the constants in Lemma B.1 satisfying (A.1). γ and Γ are constants in Theorem 3.1. Let α be a positive number satisfying λ > 2α(γλ2 − 7αΛΓ2 β02 ) and γλ2 τ12 > 14αΛΓ2 (β 2 + τ22 C 2 θ2 ). It then follows that for any sequence {w ˜ k } generated by Algorithm 1 with a fixed step-size αtk := α and mk := m ∗ converging to w , there exists K > 0 such that for all k > K, E[(dist(w ˜ k+1 , w∗ ))2 ] ≤
2(Λ + 8mα2 ΛΓ2 β02 ) E[(dist(w ˜ k , w∗ ))2 ) mα(γλ2 − 7αΛΓ2 β02 )
(A.41)
Proof. The proof is given similarly to Theorem 3.3. We use Corollary D.6, and also set as θ = 0 in Lemma D.2, and as τ1 = τ2 = 1 in Lemma D.3.
E
d Isometric vector transport S++
This section introduces the isometric vector transport, which fulfills the locking condition for d . [20, 29] describe the details, and the comprehensive explanations therein. S++ Denoting B as the function giving a basis of Tw M as B : w → B(w) = (b1 , b2 , . . . , bd ), where d denotes the dimension of M. bi is i-th orthonormal basis of Tw M. Here, we consider w ∈ M, η, ξ i nTw M, z = Rw (η), B1 = B(w), and B2 = B(z). Then, the isometric vector transport is defined as 2v2 v2T 2v1 v1T T η ξ = B2 I − T I− T B1[ ξ, (A.42) v2 v2 v1 v1 where v1 = B1[ η − y, v2 = y − βB2[ TRη η. y can be any vector satisfying kyk = kB1[ ηk = S kβB2[ TRη ηk. The orthonormal basis of TX M is defined by BX = {Lei eTi LT : i = 1, . . . , n} √ {1/ 2L(ei eTj + ej eTi )LT , i < j, i = 1, . . . , n, j = 1, . . . , n}. Here, {ei , . . . , en } is the canonical basis of Euclidean space of which dimension is n. LLT represents the Cholesky decomposition of X.
31
F
Additional numerical experiments
In this section, we show additional numerical experiments which do not appear in the main text.
F.1
Karcher mean problem
We consider the PSD Karcher mean problem of N = 1000, d = 3 (Case KM-1: small size instance). Figure A.1 shows the convergence plots of the cost function, the optimality gap and the norm of the gradient for all the five runs. They show that the proposed R-SQN-VR gives a better performance beyond other stochastic algorithms. Additionally, we consider the case of N = 10000, d = 10 (Case KM-2: large size instance). From Figure A.2, we find that our proposed algorithm R-SQN-VR gives a better performance than other stochastic algorithms.
F.2
Matrix completion problem on synthetic datasets
This section shows the results of six problem instances. Due to the page limitations, we only show the loss on a test set Φ, which is different from the training set Ω. The loss on the test set demonstrates the convergence speed to a good prediction accuracy of missing entries. Case MC-S1: We first show the results of the comparison when the number of samples N = 5000, the dimension d = 500, the memory size L = 5, the oversampling ratio (OS) is 6, and the condition number (CN) 5. We also add Gaussian noise σ = 10−10 . Figures A.3 show the results of five runs. They show superior performances than other algorithms. Case MC-S2: influence on low sampling. We look into problem instances from scarcely sampled data, e.g. OS is 4. Other conditions are the same as Case MC-S1. From Figures A.4, we can find that the proposed algorithm gives much better and stabler performances against other algorithms. Case MC-S3: influence on ill-conditioning. We consider the problem instances with higher condition number (CN) 10. Other conditions are the same as Case MC-S1. Figures A.5 show the superior performances of the proposed algorithm against other algorithms. Case MC-S4: influence on large memory size. We evaluate the performance when the memory size L is large, where L = 10. Other conditions are the same as Case MC-S1. From Figures A.6, the proposed R-SQN-VR still shows the superior performances against other algorithms. Case MC-S5: influence on higher noise. We consider noisy problem instances, where σ = 10−6 . Other conditions are the same as Case MC-S1. Figures A.7 show that the convergent MSE values are much higher than the other cases. Then, we can see the superior performance of the proposed R-SQN-VR against other algorithms. Case MC-S6: influence on higher rank. We consider problem instances with higher rank, where r = 10. Other conditions are the same as Case MC-S1. From Figures A.8, the proposed R-SQN-VR still shows the superior performances against other algorithms. Especially, the exponential mapping and the parallel translation case of the proposed method gives much better performance than that of the retraction and the vector transport. In addition, Grouse indicates the faster decrease of the MSE at the begging of the iterations. However, the convergent MSE values are much higher than those of others.
32
(Run-1: a) Train loss.
(Run-1: b) Optimality gap.
(Run-1: c) Gradient norm.
(Run-2: a) Train loss.
(Run-2: b) Optimality gap.
(Run-2: c) Gradient norm.
(Run-3: a) Train loss.
(Run-3: b) Optimality gap.
(Run-3: c) Gradient norm.
(Run-3: a) Train loss.
(Run-4: b) Optimality gap.
(Run-4: c) Gradient norm.
(Run-5: a) Train loss.
(Run-5: b) Optimality gap.
(Run-5: c) Gradient norm.
Figure A.1: Performance evaluations on Karcher mean problem (Case KM-1: small size instance). 33
(Run-1: a) Train loss.
(Run-1: b) Optimality gap.
(Run-1: c) Gradient norm.
(Run-2: a) Train loss.
(Run-2: b) Optimality gap.
(Run-2: c) Gradient norm.
(Run-3: a) Train loss.
(Run-3: b) Optimality gap.
(Run-3: c) Gradient norm.
(Run-3: a) Train loss.
(Run-4: b) Optimality gap.
(Run-4: c) Gradient norm.
(Run-5: a) Train loss.
(Run-5: b) Optimality gap.
(Run-5: c) Gradient norm.
Figure A.2: Performance evaluations on Karcher mean problem (Case KM-2: large size instance). 34
Run 1
Run 2
Run 4
Run 3
Run 5
Figure A.3: Performance evaluations on low-rank matrix completion problem (Case MC-S1: baseline).
Run 1
Run 2
Run 4
Run 3
Run 5
Figure A.4: Performance evaluations on low-rank matrix completion problem (Case MC-S2: influence on lower sampling).
35
Run 1
Run 2
Run 4
Run 3
Run 5
Figure A.5: Performance evaluations on low-rank matrix completion problem (Case MC-S3: influence on ill-conditioning).
Run 1
Run 2
Run 4
Run 3
Run 5
Figure A.6: Performance evaluations on low-rank matrix completion problem (Case MC-S4: influence on larger memory size).
36
Run 1
Run 2
Run 4
Run 3
Run 5
Figure A.7: Performance evaluations on low-rank matrix completion problem (Case MC-S5: influence on higher noise.).
Run 1
Run 2
Run 4
Run 3
Run 5
Figure A.8: Performance evaluations on low-rank matrix completion problem (Case MC-S6: influence on higher rank).
37
Case MC-S7: Comparison in terms of processing time. Finally, we attempt to confirm the performance of our proposed R-SQN-VR in terms of the processing time. Because all the algorithms are implemented by Matlab, it is difficult or might be meaningless to evaluate the actual processing speed. However, the comparison with R-SGD and R-SVRG, of which code structures are similar to that of R-SQN-VR, gives us useful insights to readers. Figure A.9 shows the results of the relationship between test MSE and the processing time [sec]. It should be noted that the result of “run 1” in each case is shown in this result. First, it is noteworthy but not surprising that, compared with the results from the viewpoint of the number of gradients, where the cases of the retraction and the vector transport indicate slower convergences than the case of the exponential mapping and the parallel translation, these results show that the former cases indicate much faster than the latter cases. From these results, we confirm the advantage of the retraction and the vector transport. Next, the convergence speed of the SGD algorithm is much faster than others at the beginning because its process is much lighter than that of others. However, similarly to the results earlier, it converges to higher MSEs. Finally, the most remarkable thing in this result is that, in comparison of R-SQN-VR with R-SVRG, R-SQN-VR still gives superior performance than R-SVRG in terms of the processing time although R-SQN-VR requires one additional vector transport of a gradient in each inner iteration and L vector transports of the curvature pairs at every outer epoch. Consequently, we have confirmed the effectiveness of the proposed R-SQN-VR.
Case MC-S1: baseline.
Case MC-S2: lower
Case MC-S3:
sampling.
ill-conditioning.
Case MC-S4: larger memory Case MC-S5: higher noise.
Case MC-S6: higher rank.
size.
Figure A.9: Performance evaluations on low-rank matrix completion problem (Case MC-S7: comparison in terms of processing time).
38
F.3
Matrix completion problem on MovieLens 1M dataset
Figures A.10 and A.11 show the results of the cases of r = 10 (MC-R1: lower rank) and r = 20 (MC-R2: higher rank). They show the convergence plots of the training error on Ω, the test error on Φ and the norm of the gradient for all the five runs when rank r = 10 and r = 20, respectively. They show that the proposed R-SQN-VR gives a good performance on other algorithms, especially, when the rank is larger, i.e., r = 20.
39
(Run-1: a) Train loss.
(Run-1: b) Test loss.
(Run-1: c) Gradient norm.
(Run-2: a) Train loss.
(Run-2: b) Test loss.
(Run-2: c) Gradient norm.
(Run-3: a) Train loss.
(Run-3: b) Test loss.
(Run-3: c) Gradient norm.
(Run-4: a) Train loss.
(Run-4: b) Test loss.
(Run-4: c) Gradient norm.
(Run-5: a) Train loss.
(Run-5: b) Test loss.
(Run-5: c) Gradient norm.
Figure A.10: Performance evaluations on low-rank matrix completion problem (MC-R1: lower rank). 40
(Run-1: a) Train loss.
(Run-1: b) Test loss.
(Run-1: c) Gradient norm.
(Run-2: a) Train loss.
(Run-2: b) Test loss.
(Run-2: c) Gradient norm.
(Run-3: a) Train loss.
(Run-3: b) Test loss.
(Run-3: c) Gradient norm.
(Run-4: a) Train loss.
(Run-4: b) Test loss.
(Run-4: c) Gradient norm.
(Run-5: a) Train loss.
(Run-5: b) Test loss.
(Run-5: c) Gradient norm.
Figure A.11: Performance evaluations on low-rank matrix completion problem (MC-R2: higher rank). 41