arXiv:1802.05426v1 [math.OC] 15 Feb 2018
On Adaptive Cubic Regularized Newton’s Methods for Convex Optimization via Random Sampling Xi CHEN
∗
Bo JIANG
†
Tianyi LIN
‡
Shuzhong ZHANG
§
February 16, 2018
Abstract In this paper, we consider an unconstrained optimization model where the objective is a sum of a large number of possibly nonconvex functions, though overall the objective is assumed to be smooth and convex. Our bid to solving such model uses the framework of cubic regularization of Newton’s method. As well known, the crux in cubic regularization is its utilization of the Hessian information, which may be computationally expensive for large-scale problems. To tackle this, we resort to approximating the Hessian matrix via sub-sampling. In particular, we propose to compute an approximated Hessian matrix by either uniformly or non-uniformly sub-sampling the components of the objective. Based upon sub-sampling, we develop both standard and accelerated adaptive cubic regularization approaches and provide theoretical guarantees on global iteration complexity. We show that the standard and accelerated sub-sampled cubic regularization methods achieve iteration complexity in the order of O(ǫ−1/2 ) and O(ǫ−1/3 ) respectively, which match those of the original standard and accelerated cubic regularization methods [12, 27] using the full Hessian information. The performances of the proposed methods on regularized logistic regression problems show a clear effect of acceleration in terms of epochs on several real data sets.
Keywords: sum of nonconvex functions, acceleration, parameter-free adaptive algorithm, cubic regularization, Newton’s method, random sampling, iteration complexity
1
Introduction
In this paper, we consider the following generic unconstrained sum-of-nonconvex optimization problem: # " n X 1 fi (x) , (1) f ∗ := min f (x) = min n x∈Rd x∈Rd i=1
∗
Leonard N. Stern School of Business, New York University, New York, NY 10012, USA. Email:
[email protected]. Research Institute for Interdisciplinary Sciences, School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai 200433, China. Email:
[email protected]. ‡ Department of Industrial Engineering and Operations Research, UC Berkeley, Berkeley, CA 94720, USA. Email: darren
[email protected]. § Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN 55455, USA. Email:
[email protected]. †
1
where f : Rd → R is smooth and convex. However, each component function fi : Rd → R is smooth but may possibly be nonconvex. In addition, we assume f ∗ > −∞. A variety of machine learning and statistics applications can be cast into problem (1) where fi is interpreted as the loss on the i-th observation, e.g., [16, 50, 29, 7, 23]. An important special case of problem (1) is # " n X 1 ⊤ fˆj (ai x) , (2) f ∗ := min f (x) = min n x∈Rd x∈Rd i=1
where fˆi : R → R and ai is the i-th observation. The formulation in (2) finds a wide range of applications, for instance (regularized) maximum likelihood estimation for generalized linear models including regularized least squares and regularized logistic regression. We refer the interested readers to Section 1.1 for more applications of (1) and (2). Up till now, much of the effort devoted to solving problem (1) has been on developing stochastic firstorder approach (e.g., see, [47, 4]), due primarily to its simplicity nature in both theoretical analysis and practical implementation. However, stochastic gradient type algorithms are known to be sensitive to the conditioning of the problem and the parameters to be tuned in the algorithm. On the contrary, secondorder optimization methods (see, e.g., [34]) have been shown to be generally robust [45, 46, 53] and less sensitive to the parameter choices [5]. A downside, however, is that the second-order type algorithms are more likely to prone to higher computational costs for large-scale problems, by nature of requiring the second-order information (viz. Hessian matrix). To alleviate this, one effective approach is the so-called sub-sampled second-order methods, which approximate Hessian matrix via some randomized sampling scheme [13]. Recent trends in the optimization community tend to improve an existing method along two possible directions. The first direction of improvement is acceleration. [38, 39] pioneered the study of accelerated gradient-based algorithms for convex optimization. For stochastic convex optimization, [30] developed an accelerated stochastic gradient-based algorithm. Since then, numerous accelerated stochastic firstorder methods have been proposed (see, e.g., [48, 17, 21, 2, 26]). In contrast to stochastic first-order methods, results on accelerated stochastic (or sub-sampled) second-order approaches have been quite limited, as the acceleration with the second-order information is technically difficult. A recent paper [54] proposed a scheme to accelerate regularized sub-sampled Newton methods. However, the proposed algorithm requires the knowledge of certain problem parameters and the theoretical guarantee is only established for strongly convex quadratic objective functions. The second direction of improvement is to investigate adaptive optimization algorithms without prior knowledge of problem parameters such as the first and the second order Lipschitz constants. In view of implementation, it is desirable to design algorithms that adaptively adjust these parameters since they are normally prior unknown. A typical example is an adaptive gradient method (e.g., AdaGrad, see [14]), which has been popular in the machine learning community due to its robustness and effectiveness. However, such improvements – though highly desirable due to their relevance in machine learning – are largely lacking in the context of stochastic or sub-sampled second-order algorithms. As a matter of fact, we are unaware of any existing accelerated sub-sampling second-order methods that are fully independent of problem parameters while maintaining superior convergence rate. When the objective function f is non-convex, sub-sampling adaptive cubic regularized Newton’s methods [28, 52] are capable of reaching a critical point within an iteration bound of O(ǫ−3/2 ). However, to our best knowledge, similar sub-sampling algorithm has not been studied for generic convex optimization problems even 2
without acceleration. Recall that [40] proposed an accelerated cubic regularized Newton’s method with provable overall iteration complexity of O ǫ−1/3 for convex optimization. Therefore, a natural question raises: Can one develop an adaptive and accelerated sub-sampling cubic regularized method with an iteration complexity of O ǫ−1/3 ?
In this paper, we provide an affirmative answer to the above question and develop a novel sub-sampling cubic regularization method that is adaptive and accelerated. To this end, we first investigate the standard sub-sampling cubic regularization methods and then extend to propose the accelerated version of the algorithm. The advantage of our algorithms is threefold. First, the size of sub-sampled set is gradually increased and could be very small at the first a few steps of the algorithm leading to relatively low per-iteration computational cost. Second, our algorithms are fully adaptive and do not require any problem parameters. Third, we propose an accelerated second-order approach to stochastic optimization, which is far less studied than its first-order counterpart. In addition, our methods and analysis are flexible so as to allow both uniform (Lemma 2.3) and non-uniform (Lemma 2.4) sub-sampling techniques. In terms of iteration complexity, we establish a global convergence rate of O ǫ−1/2 (Theorem 3.2) for the standard sub-sampling cubic regularized method, and further show that an O ǫ−1/3 convergence rate (Theorem 4.5) can be achieved for the accelerated version. Both results match the deterministic counterparts presented in [12, 27], assuming the availability of the full Hessian information. Besides, our algorithms only require an approximate solution for the cubic regularized sub-problem (see Condition 3.1), echoing similar conditions considered in [12, 27].
1.1
Examples
In this subsection, we provide a few examples in the form of (1) and (2) arising from applications of machine learning. Examples for convex component functions are well known, e.g. the regularized least squares problem n 2 1X ⊤ ai x − bi + λkxk2 , min f (x) = n x∈Rd i=1
and the regularized logistic regression,
n i 1 Xh ln 1 + exp −bi · a⊤ x + λkxk2 , min f (x) = i n x∈Rd i=1
Rd
where ai ∈ and bi denote the feature and response of the i-th data point respectively. We have bi ∈ R for the least squares loss, and bi ∈ {−1, +1} for logistic regression. The parameter λ > 0 is known as the regularization parameter. Below we shall provide a few examples where some components in the finite sum may be nonconvex. Consider for instance the nonconvex support vector machine [35, 51], where the objective function takes the form of n i 1 Xh 1 − tanh bi · a⊤ x + λ kxk2 , min f (x) := i n x∈Rd i=1
3
which is an instance of (1) with 2 fi (x) = 1 − tanh bi · a⊤ i x + λ kxk .
Indeed, for some choice of λ > 0, the objective is convex but a few component functions may be nonconvex. Another example comes from principal component analysis (PCA). P Consider a set of n data vectors a1 , . . . , an in Rd and the normalized co-variance matrix A = n1 nj=1 aj a⊤ j , PCA aims to find the leading principal component. [18] proposed a new efficient optimization for PCA by reducing the problem to solving a small number of convex optimization problems. One critical subroutine in the method considered in [18] is to solve n 1 X 1 ⊤ 1 ⊤ ⊤ ⊤ ⊤ x µI − aj aj x + b x , min x (µI − A) x + b x = min 2 x∈Rd n x∈Rd 2 j=1
where µ is larger than or equal to the maximum eigenvalue of A. Although the above formulation is convex optimization, component functions in the above optimization problem may be nonconvex.
1.2
Related Works
The seminal work of [44] triggered a burst of research interest on developing stochastic first-order methods. The main focus of these efforts – particularly within the machine learning community – has been on accelerating this type of optimization methods (e.g., see, [30, 25, 19, 20, 21, 48, 49, 17, 32, 2, 31, 3, 26]). Despite its popularity and simplicity, it is known that stochastic first-order methods suffer from being overly sensitive to ill-conditioned instances [45, 46] as well as its sensitivity to the algorithmic parameters such as the choices of stepsize [5]. Regarding the second-order methods (in particular Newton’s method), there has been a recent intensive research attention in designing their stochastic variants suitable for large-scale applications, e.g. subsampling methods [8, 15, 45, 46, 53, 6, 1, 41, 33]. All these works assume that all the component functions are convex. For non-convex optimization, [28] proposed a uniform sub-sampling strategy to approximate the Hessian in the cubic regularized Newton’s method. However, in each step of the algorithm the sample size used to approximate the Hessian and gradient is unknown until the cubic subprolem in this iteration is solved. [52] fixed this flaw by conducting appropriate uniform and nonuniform sub-sampling strategies to construct Hessian approximations within the cubic regularization scheme for Newton’s method. However, the iteration complexity established in [52] is O ǫ−3/2 .
The literature on the acceleration of second-order methods for convex optimization is somewhat limited as compared to its first-order counterpart. [40] improved the overall iteration complexity for convex optimization from O ǫ−1/2 to O ǫ−1/3 , by means of the cubic regularized Newton’s method. [37] managed to accelerate the Newton proximal extragradient method [36] with an improved iteration com−2/7 plexity of O ǫ . Notwithstanding its theoretical superiority, the acceleration second-order scheme presented in [40, 37] are not easily implementable in practice, since they assume the knowledge of some Lipschitz constant for the Hessian. To alleviate this, [27] incorporated an adaptive strategy [10, 11, 12] into Nesterov’s approach [40], and further relaxed the criterion for solving each sub-problem while 4
maintaining the iteration complexity of O ǫ−1/3 for convex optimization. However, the deterministic second-order method such as the one in [27] may be computationally costly as it requires the full secondorder information. Recently, [22] proposed an accelerated Newton’s method with cubic regularization using inexact second-order information. However, in [22] a worst-case iteration bound of O ǫ−1/3 was not established, although the acceleration is indeed observed in the numerical experiments. Another recent work by [54] proposes a novel way to accelerate stochastic second-order methods, however the theoretical guarantee is only established for the strongly convex quadratic objective, and the algorithm also requires the knowledge of some problem parameters.
1.3
Notations and Organization
Throughout the paper, we denote vectors by bold lower case letters, e.g., x, and matrices by regular upper case letters, e.g., X. The transpose of a real vector x is denoted as x⊤ . For a vector x, and a matrix X, kxk and kXk denote the ℓ2 norm and the matrix spectral norm, respectively. ∇f (x) and ∇2 f (x) are respectively the gradient and the Hessian of f at x, and I denotes the identity matrix. For two symmetric matrices A and B, A B indicates that A − B is symmetric positive semi-definite. The subscript, e.g., xi , denotes iteration counter. log(α) denotes the natural logarithm of a positive number α. 00 = 0 is imposed for non-uniform setting. The inexact Hessian is denoted by H(x), but for notational simplicity, we also use Hi to denote the inexact Hessian evaluated at the iterate xi in iteration i, i.e., Hi , H(xi ). The calligraphic letter S denotes a collection of indices from {1, 2, . . . , n}, with potential repeated items and its cardinality is denoted by |S|. The rest of the paper is organized as follows. In Section 2, we introduce assumptions used throughout this paper, and two lemmas on the sample size to randomly construct an inexact Hessian matrix. Then the sub-sampling cubic regularized Newton’s method and its accelerated counterpart are presented and analyzed in Sections 3 and 4 respectively. In Section 5, we present some preliminary numerical results on solving regularized logistic regression, where the effect of acceleration together with low per-iteration computational cost are clearly observed. The details of all the proofs can be found in the appendix.
2
Preliminaries
In this section, we first introduce the main definitions and assumptions used in the paper, and then present two lemmas on the construction of the inexact Hessian in random sampling, leaving the proofs to Appendix A.
2.1
Assumptions
Throughout this paper, we refer to the following definition of ǫ-optimality. Definition 2.1 (ǫ-optimality). Given ǫ ∈ (0, 1), x ∈ Rd is said to be an ǫ-optimal solution to problem (1), if f (x) − f (x∗ ) ≤ ǫ, (3) 5
where x∗ ∈ Rd is the global optimal solution to problem (1). To proceed, we make the following standard assumption regarding the gradient and Hessian of the objective function f . Assumption 2.1 The objective function F (x) in problem (1) is convex and twice differentiable. Each of fj (x) is possibly nonconvex but twice differentiable with the gradient and the Hessian being both Lipschitz continuous, i.e., there are 0 < Lj , ρj < ∞ such that for any x, y ∈ Rd we have
A consequence of (4) is that
k∇fj (x) − ∇fj (y)k ≤ Lj kx − yk ,
2
∇ fj (x) − ∇2 fj (y) ≤ ρj kx − yk .
k∇2 fj (y)k2 ≤ Lj , ∀ y. ¯ = 1 Pn Lj > 0, and ρ¯ = In the rest of the paper, we define L = maxj Lj > 0 and L j=1 n
(4) (5)
(6) 1 n
Pn
j=1 ρj .
Assumption 2.2 The objective function f (x) in problem (1) has bounded level sets, namely, kx − x∗ k ≤ D,
for all x ∈ Rd such that f (x) ≤ f (x0 ), where x∗ is any global minimizer of f and D ≥ 1.
2.2
Random Sampling
When each fi in (1) is convex, random sampling has been proven to be an very effective approach in reducing the computational cost; see [15, 45, 46, 6, 53]. In this subsection, we show that such random sampling can indeed be employed for the setting considered in this paper. Suppose that the probability distribution of the sampling over the index set {1, 2, . . . , n} is p = {pi }i=n i=1 with Prob (ξ = i) = pi ≥ 0 for i = 1, 2, . . . , n. Let S and |S| denote the sample collection and its cardinality respectively, and define 1 X 1 2 H(x) = ∇ fj (x), (7) n |S| pj j∈S
to be the sub-sampled Hessian. When n is very large, such random sampling can significantly reduce the per-iteration computational cost as |S| ≪ n.
One natural strategy is to sample {1, 2, . . . , n} uniformly, i.e., pi = n1 . The following lemma reveals how many samples required to get an approximated Hessian within a given accuracy, if the indices are sampled uniformly with replacement. Lemma 2.3 (Sample Size of Uniform Sampling) Suppose Assumption 2.1 holds for problem (1). A uniform sampling with replacement is performed to form the sub-sampled Hessian. That is for x ∈ Rd , H(x) is constructed from (7) with pj = n1 and sample size 16L2 4L 2d |S| ≥ max , · log 2 ǫ ǫ δ 6
for given 0 < ǫ, δ < 1, where L is defined as in Assumption 2.1. Then we have
Prob H(x) − ∇2 f (x) ≥ ǫ < δ.
Intuitively, a more “informative” distribution may be constructed for problem (2), as opposed to simple uniform sampling. In fact, we can bias the probability distribution and pick those relevant fi ’s in some sense to form the approximated Hessian, which could result in a much reduced sample set in contrast with that of uniform sampling. Specifically, the Hessian of f in problem (2) is n
∇2 f (x) = A⊤ DA =
1 X ˆ′′ ⊤ fj (aj x) · ai a⊤ i , n j=1
n on ′′ ⊤ ˆ where A = [a1 , . . . , an ] and D = diag fj (aj x) . Next we consider a non-uniform sampling j=1
distribution p for problem (2). Definition 2.2 Let and denote
n o N = j ∈ {1, 2, . . . , n} : fˆj′′ (a⊤ x) = 6 0 , j ˆ′′ ⊤ 2 fj (aj x) kaj k , pj = P ˆ′′ ⊤ n 2 ka k f (a x) k k=1 k k
where the absolute values are taken since fˆj is possibly nonconvex, and let pmin = min{pj }. j∈N
The following lemma provides a sampling complexity for the construction of approximated Hessian of problem (2). Lemma 2.4 (Sample Size of Non-Uniform Sampling) Suppose Assumption 2.1 holds for problem (1). A non-uniform sampling is performed to form the sub-sampled Hessian. That is for x ∈ Rd , H(x) is constructed from (7) with p as defined in Definition 2.2 and sample size ¯2 4L 2L n + 1/pmin − 2 2d |S| ≥ max , · · log , ǫ2 ǫ n δ ¯ and pmin are defined in Assumption 2.1 and Definition 2.2 respectively. for given 0 < ǫ, δ < 1, where L Then, we have
Prob H(x) − ∇2 f (x) ≥ ǫ < δ.
Compared to Lemma 2.3, the sampling complexity provided in Lemma 2.4 can be much lower because ¯ ≤ L. In this case, the non-uniform sampling is more preferable where the distributions of Lj are L ¯ ≪ L. Such advantage has been observed in skewed, i.e., some Lj are much larger than the others and L the practical performance of randomized coordinate descent method and sub-sampled Newton method (see [42, 43, 53]). 7
Algorithm 1 Sub-Sampling Adaptive Cubic Regularized Newton’s Method (SARC) Given γ2 > γ1 > 1, η ∈ (0, 1), σmin ∈ (0, 1), optimality tolerance ǫ ∈ (0, 1) and δ ∈ (0, 1). Choose x0 ∈ Rd , σ0 ≥ σmin , κθ ∈ 0, 21 , and initial tolerance of Hessian approximation (1 − κθ ) k∇f (x0 )k ǫ0 = min 1, . 3 for i = 0, 1, 2, . . . do if i = 0 or i − 1 is a successful iteratoin then ˜ i ) according to (7) with sample size |S| satisfying Construct H(x o n |S| ≥ max 16L22 , 4L · log 2dǫ−1/2 , for uniform sampling δ o ǫi /2 −1/2 n (ǫi /2) 2 ¯ |S| ≥ max 4L 2 , 2L · n+1/pmin −2 , · log 2dǫ , for non-uniform sampling n δ (ǫi /2) ǫi /2 ˜ i) + Let H(xi ) = H(x
ǫi 2 I.
else H(xi ) = H(xi−1 ). end if Compute si ∈ Rd such that si ≈ argmins∈Rd m(xi , s, σi ) according to Condition 3.1; Set f (xi ) − f (xi + si ) . θi = f (xi ) − m(xi , si , σi ) if θi ≥ η [successful iteration] then Set xi+1 = xi + si ; Update tolerance of Hessian approximation (1 − κθ ) k∇f (xi+1 )k ; ǫi+1 = min ǫi , 3 else
Set σi+1 ∈ [σmin , σi ];
xi+1 = xi ; σi+1 ∈ [γ1 σi , γ2 σi ]. end if end for
3
Sub-Sampling Adaptive Cubic Regularized Newton’s Method
In this section, we propose the sub-sampling cubic regularized Newton’s method and provide its convergence rate.
8
3.1
The Algorithm
We consider the following approximation of f evaluated at xi with cubic regularization [10, 11]: 1 1 m(xi , s, σ) = f (xi ) + s⊤ ∇f (xi ) + s⊤ H(xi )s + σi ksk3 , 2 3
(8)
where σi > 0 is a regularized parameter adjusted in the process as the algorithm progresses. In each iteration, we approximately solve (9) si ≈ argmin m(xi , s, σi ), s∈Rd
where m(xi , s, σi ) is defined in (8) and the symbol “≈” is quantified as follows: Condition 3.1 We call si to be an approximative solution – denoted as si ≈ argmins∈Rd m(xi , s, σi ) – for mins∈Rd m(xi , s, σi ), if for certain subspace Li with ∇f (xi ) ∈ Li we have si = argmin m(xi , s, σi ),
(10)
s∈Li 2 σmin 3
< 1, k∇m(xi , si , σi )k ≤ κθ min k∇f (xi )k , k∇f (xi )k3 , ksi k2 .
and for a pre-specified constant 0 < κθ
0. It holds that Let σ ¯ = max σ0 , 3γ2 +γ 2 2 2 T ≤
2 1+ log log γ1
σ ¯
σmin
(12)
|SC|.
Then iteration complexity of Algorithm 1 is described as follows. Theorem 3.2 Let x∗ be the global minimum of f , ǫ be the tolerance of optimality, ǫi be the tolerance of Hessian approximation in (12) for iteratoin i, and δ be the probability of inequality (12) fails for at least one iteration. When Algorithm 1 runs σ ¯ 2 2 √ = O(ǫ−1/2 ) log (13) T = 1+ log γ σmin β ǫ iterations, then with probability 1 − δ we have f (xT ) − f (x∗ ) ≤ ǫ, where β is defined as β = min
(
3
2 σmin √η , 3 6D 6CD 11943936·κ3 C 2 L6 D 15 2 (1−κθ ) 2 θ
)
The proofs of Lemma 3.1 Theorem 3.2 can be found in Appendix B. 10
> 0.
4
Accelerated Sub-Sampling Adaptive Cubic Regularized Newton’s Method
In this section, we investigate the accelerated sub-sampling cubic rugularization method and establish a even lower complexity bound.
4.1
The Algorithm
We consider the same approximation of f evaluated at an extrapolation yi with cubic regularization in (9) and use the same symbol “≈” with different meaning and quantified as follows: Condition 4.1 We call si to be an approximate solution – denoted as si ≈ argmins∈Rd m(yi , s, σi ) – for mins∈Rd m(yi , s, σi ), if the following holds k∇m(yi , si , σi )k ≤ κθ min (1, ksi k) min (ksi k , k∇f (yi )k) ,
(14)
where κθ ∈ 0, 21 is a pre-specified constant.
Note that the bound on the right hand side of (14) is slightly different from that of (11) in Condition 3.1. More importantly, comparing to Condition 3.1 of the non-accelerated algorithm, the above approximity measure does not require si to be optimal over certain subspace Li (i.e. (10)), and is hence weaker than the previous one. This relaxation opens up possibilities for other approximation methods to solve the sub-problem. For instance, [9] propose to use the gradient descent method, and prove that it works well even when the cubic regularized sub-problem is non-convex. In our case, m(yi , s, σi ) is strongly convex, which implies that the gradient descent subroutine is expected to have a fast (linear) convergence. Now we propose the accelerated sub-sampling adaptive cubic regularization method in Algorithm 2. In particular, we adopt a two-phase scheme, where the acceleration is implemented in Phase II and an initial point to start acceleration is obtained in Phase I. From the standpoint of acceleration, we need to solve an additional cubic sub-problem in Phase II: zl = argmin ψl (z), z∈Rd
¯ 1 k3 , and where ψ1 (z) = 16 ς1 kz − x ψl (z) = ψl−1 (z) +
1 l(l + 1) ¯ l−1 )⊤ ∇f (¯ ¯ 1 k3 . f (¯ xl−1 ) + (z − x xl−1 ) + (ςl − ςl−1 )kz − x 2 6
Fortunately, this problem admits a closed-form solution (see [40, 27] for details) s 2 ¯1 − zl = x ∇ℓl (z). ςl k∇ℓl (z)k
We remark that a direct extension of accelerated cubic regularization method under inexact Hessian information fails to retain the superior convergence property theoretically (see [22]). Therefore, the two-phase scheme is necessary in our analysis to establish the accelerated rate of convergence. 11
Algorithm 2 Accelerated Sub-Sampling Adaptive Cubic Regularization for Newton’s Method Given γ2 > γ1 > 1, γ3 > 1, η ∈ (0, 1), σmin ∈ (0, 1), optimality tolerance ǫ ∈ (0, 1) and δ ∈ (0, 1). Choose x0 ∈ Rd , σ0 ≥ σmin , κθ ∈ (0, 1) and n (1−κ )k∇f (x )k o 0 θ . initial tolerance of Hessian approximation ǫ0 = min 1, 3
Begin Phase I: ˜ 0 ) according to (7) with sample size |S| satisfying Construct H(x 16L2 , |S| ≥ max 2 (ǫ0 /2) ¯2 4L |S| ≥ max 2, (ǫ0 /2)
4L ǫ0 /2
· log
2dǫ−1/3 δ
2L · n+1/pmin −2 , ǫ0 /2 n
for uniform sampling
· log
2dǫ−1/3 δ
for non-uniform sampling
˜ 0 ) + ǫ0 I. Let H(x0 ) = H(x 2 for i = 0, 1, 2, . . . do Compute si ∈ Rd such that si ≈ argmins∈Rd m(xi , s, σi ) according to Condition 4.1; Compute θi = m(xi , si , σi ) − f (xi + si ). if θi > 0 [successful iteration] then (
xi+1 = xi + si , σi+1 ∈ [σmin , σi ], and update tolerance of Hessian approximation ǫi+1 = min
1,
Record the total number of iterations: T1 = i + 1; break. else xi+1 = xi , σi+1 ∈ [γ1 σi , γ2 σi ], ǫi+1 = ǫi . end if end for End Phase I.
)
(1−κθ ) ∇f (xi+1 ) . 3
Begin Phase II: ¯ 1 = xT1 . Set the count of successful iterations l = 1 and let x 3z . ¯ 1 k3 , and let z1 = argmin ¯ + 4 ς kz − x x ψ (z), and choose y1 = 1 Construct ψ1 (z) = f (¯ x1 ) + 1 1 6 1 4 1 z∈Rd 1 ˜ 1 ) according to (7) with sample size |S| satisfying Construct H(y ( ) 16L2 4L 2dǫ−1/3 |S| ≥ max (ǫ /2)2 , ǫT /2 · log δ T 1 1 ) ( −1/3 n+1/pmin −2 ¯2 4L |S| ≥ max , 2L · , · log 2dǫ δ n (ǫT /2)2 ǫT /2
for uniform sampling for non-uniform sampling
1
1
ǫ ˜ 1 ) + T1 I. Let H(y1 ) = H(y 2 for j = 0, 1, 2 . . . do
Compute sT1 +j ∈ Rd such that sT1 +j ≈ argmins∈Rd m(yl , s, σT1 +j ) according to Condition 4.1, and ρT1 +j = − if ρT1 +j ≥ η [successful iteration] then h i xT1 +j+1 = yl + sT1 +j , σT1 +j+1 ∈ σmin , σT1 +j , and update tolerance of Hessian approximation ǫT1 +j+1 = min
1,
(1 − κθ ) ∇f (xT1 +j+1 ) 3
s⊤ T1 +j ∇f (yl +sT1 +j )
;
3
sT +j 1
.
Set l = l + 1 and ς = ςl−1 ; Update ψl (z) as illustrated by using ςl = ς, and compute zl = argminz∈Rd ψl (z). while ψl (zl )
0, and T3 is upper bound by some constant (Lemma 4.3) if H(xT1 +j ) − ∇2 f (xT1 +j ) < ǫT1 +j , ∀ j ≤ T2 .
4. We relate the objective function to
the count of successful iterations in Phase II (Theorem 4.4) if H(xi ) − ∇2 f (xi ) < ǫi , ∀ i ≤ T1 + T2 . 5. Denoting T = T1 + T2 + T3 , we set the per-iteration failure probability as δǫ1/3 and 1/3 T according to Lemmas 2.3 and Lemma 2.4, this ensures that
∈ O(1/ǫ 2 ). Then
H(xi ) − ∇ f (xi ) < ǫi , ∀ i ≤ T , with probability 1 − δ. Putting all the pieces together, we obtain the iteration complexity result (Theorem 4.5).
We first prove Lemma 4.1 and Lemma 4.2, which describes the relation between the total iteration numbers in Algorithm 2 and the amount of successful iterations |SC| in Phase II.
2
∇ f (xi ) − H(xi ) ≤ ǫi , ∀ i ≤ T1 . Lemma 4.1 Suppose in each iteration i of Algorithm 2, we have o n ¯ 2ρ , γ L + γ ρ ¯ > 0, it holds that Denoting σ ¯1 = max σ0 , 3γ2 +γ 2 2 2 T1 ≤
2 1+ log log γ1
σ ¯1 σmin
.
Lemma 4.2 Suppose in each iteration i of Algorithm 2, we have ∇2 f (xi ) − H(x i ) ≤ ǫi , ∀ T1 + 1 ≤ γ2 ρ¯ i ≤ T1 + T2 . Denoting σ ¯2 = max σ ¯1 , 2 + γ2 κθ + γ2 η + γ2 , γ2 L + γ2 ρ¯ + 2γ2 η > 0 and SC to be the set of successful iterations in Phase II of Algorithm 2, it holds that σ ¯2 2 log |SC|. T2 ≤ 1 + log γ1 σmin 13
Then we estimate an upper bound of T3 , i.e., the total number of count of successfully updating ς > 0.
Lemma 4.3 Suppose in each iteration i of Algorithm 2, we have ∇2 f (xi ) − H(xi ) ≤ ǫi , ∀ T1 + 1 ≤ i ≤ T1 + T2 . It holds that l(l + 1)(l + 2) f (¯ xl ) (15) ψl (zl ) ≥ 6 3 σ2 +1 1 , which further implies when ςl ≥ ρ¯+2L+2¯ 1−κθ η2 T3 ≤
&
1 log log (γ3 )
"
ρ¯ + 2L + 2¯ σ2 + 1 1 − κθ
3
1 2 η ς1
#'
.
Recall that l = 1, 2, . . . is the count of successful iterations, and the sequence {¯ xl , l = 1, 2, . . .} is updated when a successful iteration is identified.
Theorem 4.4 Suppose in each iteration i of Algorithm 2, we have ∇2 f (xi ) − H(xi ) ≤ ǫi , ∀ 1 ≤ i ≤ T1 + T2 . Let x∗ be the global minimum of f ; then the sequence {¯ xl , l = 1, 2, . . .} generated by Algorithm 2 satisfies
≤
l(l + 1)(l + 2) f (¯ xl ) ≤ ψl (zl ) ≤ ψl (z) 6 l(l + 1)(l + 2) ρ¯ + 2¯ σ1 1 2κθ (1 + κθ )L2 1 ¯ 1 k3 , f (z) + kz − x0 k3 + kz − x0 k2 + kx0 − x∗ k2 + ςl kz − x 6 6 2 σmin 6
where σ ¯1 is defined as
3γ2 + γ2 ρ¯ , γ2 L + γ2 ρ¯ > 0. σ ¯1 = max σ0 , 2
After establishing Theorem 4.4, the iteration complexity of Algorithm 2 readily follows. Theorem 4.5 Let x∗ be the global minimum of f , ǫ be the tolerance of optimality, ǫi be the tolerance of Hessian approximation in (12) for iteratoin i, and δ be the probability of inequality (12) fails for at least one iteration. When Algorithm 2 runs # " 1 2 σ ¯1 σ ¯2 C 3 2 T = 1+ +1 log log + 1+ log(γ1 ) σmin log(γ1 ) σmin ǫ & " #' 1 ρ¯ + 2L + 2¯ σ2 + 1 3 1 + log log(γ3 ) 1 − κθ η 2 ς1 = O(ǫ−1/3 )
iterations (including the successful iterations to update ς), then with probability 1 − δ we have f (xT ) − f (x∗ ) ≤ ǫ, 14
where C is defined as ∗ 3
C = (¯ ρ + 2¯ σ1 ) kx0 − x k +
ρ¯ + 2L + 2¯ σ2 + 1 1 − κθ
3
12κθ (1 + κθ )L2 1 ∗ 3 k¯ x1 − x k + + 3 kx0 − x∗ k2 , η2 σmin
and σ ¯2 is defined as n o γ2 ρ¯ σ ¯2 = max σ ¯1 , + γ2 κθ + γ2 η + γ2 , γ2 L + γ2 ρ¯ + 2γ2 η > 0. 2 We postpone all the proofs in this section to Appendix C.
5
Numerical Experiments
In this section, we test the performance of our algorithms by evaluating the following regularized logistic regression problem n λ 1X ln 1 + exp −bi · a⊤ x + kxk2 (16) min f (x) = i n 2 x∈Rd i=1
where is the samples in the data set, and the regularization parameter is set as λ = 10−5 . The experiments are conducted on 6 LIBSVM Sets 1 for binary classification, and the summary of those datasets are shown in Table 5. (ai , bi )ni=1
Table 1: Statistics of datasets.
Dataset a9a skin nonskin covtype phishing w8a SUSY
Number of Samples 32,561 245,057 581,012 11,055 49,749 5,000,000
Dimension 123 3 54 68 300 18
In the test, we implement Algorithm 1 referred to as SCR. Since the standard cubic regularized Newton’s method admits a local quadratic convergence rate [26], we implement a hybrid of Algorithm 1 and Algorithm 2 referred to as SACR, which starts with Algorithm 2 and switch to Algorithm 1 when entering the region of local quadratic convergence. Specifically, we check the progress made by each iteration of Algorithm 2, and we switch to Algorithm 1 with stopping criterion k∇f (x)k ≤ 10−9 , when |f (xi+1 )−f (xi )| ≤ 0.1 is identified. To observe the accelerated convergence, the starting the criterion of |f (xi )| point is randomly generated from a Gaussian random variable with zero mean and a large variance (say 5000) such that To observe the acceleration, the starting point is randomly generated from a Gaussian 1
https://www.csie.ntu.edu.tw/˜cjlin/libsvm/
15
random variable with zero mean and a large variance (say 5000) such that initial solutions are likely to be outside of the “local quadratic” region. We apply the so-called Lanczos process to approximately solve the cubic subproblem mins∈Rd m(xi , s, σi ) in the implementation. In other words, m(xi , s, σi ) is minimized with respect to a Krylov subspace 2 K := span{∇f (xi ), ∇2 f (xi )∇f (xi ), ∇2 f (xi ) ∇f (xi ), . . .},
where the dimension of K is gradually increased and an orthogonal basis of each subspace K is built up which typically involves one matrix-vector product. Moreover, minimizing m(xi , s, σi ) in the Krylov subspace only involves factorizing a tri-diagonal matrix, which can be done at the complexity of O(d). Conditions (11) and (14) are used as the termination criterion for the Lanczos process in the hope to find a suitable trial step before the dimension of K approaches d. We also apply 5 baseline algorithms to solve (16) for comparison. They are: the adaptive cubic regularized Newton’s method (CR), the accelerated adaptive cubic regularized Newton’s method (ACR), the limited memory Broyden-Fletcher-Goldfarb-Shanno method (L-BFGS) that , the accelerated gradient descent (AGD) and the standard stochastic gradient descent (SGD). We resort to SCIPY Solvers2 to call L-BFGS. To make a fair comparison between the sub-sampling algorithms and the deterministic algorithms, we measure how the log-residual of the gradient decreases with respect to epochs. In particular, one epoch is counted when a full batch size (i.e. n times) of the gradient or Hessian of the componnet functions is queried. Since the sample size in sub-sampling algorithms is less than n, one epoch is likely to be consumed by the queries from serval iterations. The results are presented in Figure 1. We can see that SACR outperforms all variants of cubic regularization methods and its clear effect of acceleration. Moreover, SACR is even comparable to L-BFGS, which is carefully tuned and optimized, and put in an open solver (i.e., SCIPY Solvers) for people to use.
References [1] N. Agarwal, B. Bullins, and E. Hazan. Second order stochastic optimization in linear time. ArXiv Preprint: 1602.03943, 2016. [2] Z. Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In STOC, pages 1200–1205. ACM, 2017. [3] Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. ArXiv Preprint: 1708.08694, 2017. [4] Z. Allen-Zhu and Y. Yuan. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In ICML, pages 1080–1089, 2016. [5] A. S. Berahas, R. Bollapragada, and J. Nocedal. An investigation of Newton-sketch and subsampled Newton methods. ArXiv Preprint: 1705.06211, 2017. 2
https://docs.scipy.org/doc/scipy/reference/optimize.html#module-scipy.optimize
16
a9a (n=32561, d=123)
log(||∇f||)
102 100 10−2
SUSY (n=5000000, d=18)
105
SACR SCR ACR CR L-BFGS AGD SGD
103 101
log(||∇f||)
SACR SCR ACR CR L-BFGS AGD SGD
104
10−4
10−1 10−3 10−5
10−6
10−7 0
50
100
150
200
epochs
250
300
350
0
400
50
100
skin_nonskin (n=245057, d=3)
log(||∇f||)
10−1 10−4
10−11
epochs
60
80
100
102
log(||∇f||)
100 10−2 10−4
20
40
epochs
60
80
100
w8a (n=49749, d=300)
105
SACR SCR ACR CR L-BFGS AGD SGD
103 101
log(||∇f||)
SACR SCR ACR CR L-BFGS AGD SGD
SACR SCR ACR CR L-BFGS AGD SGD
0
phishing (n=11055, d=68) 104
400
10−5
10−10 40
350
10−2
10−8
20
300
101
10−7
0
250
104
log(||∇f||)
102
200
epochs
covtype (n=581012, d=54) SACR SCR ACR CR L-BFGS AGD SGD
105
150
10−1 10−3 10−5
10−6
10−7
10−8
10−9 0
50
100
epochs
150
200
250
0
100
200
epochs
300
400
500
Figure 1: log(k∇f k) vs. epochs for SACR, SCR and 5 baseline algorithms
17
[6] R. Bollapragada, R. Byrd, and J. Nocedal. Exact and inexact subsampled newton methods for optimization. ArXiv Preprint: 1609.08502, 2016. [7] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. ArXiv Preprint: 1606.04838, 2016. [8] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011. [9] Y. Carmon and J. Duchi. Gradient descent efficiently finds the cubic-regularized non-convex Newton step. ArXiv Preprint: 1612.00547v2, 2016. [10] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011. [11] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function-and derivative-evaluation complexity. Mathematical Programming, 130(2):295–319, 2011. [12] C. Cartis, N. I. M. Gould, and P. L. Toint. Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optimization Methods and Software, 27(2):197– 219, 2012. [13] P. Drineas and M. W. Mahoney. Lectures on randomized numerical linear algebra. ArXiv Preprint: 1712.08880, 2017. [14] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. [15] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled newton methods. In NIPS, pages 3052–3060. MIT Press, 2015. [16] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [17] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, pages 2540–2548, 2015. [18] D. Garber and E. Hazan. Fast and simple pca via convex optimization. ArXiv Preprint: 1509.05647, 2015. [19] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012. [20] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013. 18
[21] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016. [22] S. Ghadimi, H. Liu, and T. Zhang. Second-order methods with cubic regularization under inexact information. ArXiv Preprint: 1710.05782, 2017. [23] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016. [24] D. Gross and V. Nesme. Note on sampling without replacing from a finite collection of matrices. ArXiv Preprint: 1001.2738, 2010. [25] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In NIPS, pages 781–789, 2009. [26] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating stochastic gradient descent. ArXiv Preprint: 1704.08227, 2017. [27] B. Jiang, T. Lin, and S. Zhang. A unified scheme to accelerate adaptive cubic regularization and gradient methods for convex optimization. ArXiv Preprint: 1710.04788, 2017. [28] J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization. ArXiv Preprint: 1705.05933, 2017. R in Machine Learning, 5(4):287–364, [29] B. Kulis. Metric learning: A survey. Foundations and Trends 2013.
[30] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012. [31] L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex finite-sum optimization via scsg methods. In NIPS, pages 2345–2355, 2017. [32] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In NIPS, pages 3384–3392, 2015. [33] X. Liu, C-J. Hsieh, J. D. Lee, and Y. Sun. An inexact subsampled proximal newton-type method for large-scale machine learning. ArXiv Preprint: 1708.08552, 2017. [34] D. G. Luenberger and Y. Ye. Linear and nonlinear programming, volume 2. Springer, 1984. [35] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In NIPS, pages 512–518, 2000. [36] R. D. C. Monteiro and B. F. Svaiter. Iteration-complexity of a Newton proximal extragradient method for monotone variational inequalities and inclusion problems. SIAM Journal on Optimization, 22(3):914–935, 2012. [37] R. D. C. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(3):1092–1125, 2013.
19
[38] Yu. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k 2 ). Dokl. Akad. Nauk SSSR, pages 543–547, 1983. (in Russian). [39] Yu. Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2004. [40] Yu. Nesterov. Accelerating the cubic regularization of Newton?s method on convex problems. Mathematical Programming, 112(1):159–181, 2008. [41] M. Pilanci and M. J. Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017. [42] Z. Qu and P. Richt´ arik. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31(5):829–857, 2016. [43] Z. Qu and P. Richt´ arik. Coordinate descent with arbitrary sampling II: Expected separable overapproximation. Optimization Methods and Software, 31(5):858–884, 2016. [44] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951. [45] F. Roosta-Khorasani and M. W. Mahoney. Sub-sampled Newton methods I: globally convergent algorithms. ArXiv Preprint: 1601.04737, 2016. [46] F. Roosta-Khorasani and M. W. Mahoney. Sub-sampled Newton methods II: Local convergence rates. ArXiv Preprint: 1601.04738, 2016. [47] S. Shalev-Shwartz. Sdca without duality, regularization, and individual convexity. In ICML, pages 747–754, 2016. [48] S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In NIPS, pages 378–385, 2013. [49] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In ICML, pages 64–72, 2014. [50] S. Sra, S. Nowozin, and S. J. Wright. Optimization for machine learning. MIT Press, 2012. [51] X. Wang, S. Ma, D. Goldfarb, and W. Liu. Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927–956, 2017. [52] P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Newton-type methods for non-convex optimization under inexact Hessian information. ArXiv Preprint: 1708.07164, 2017. [53] P. Xu, J. Yang, F. Roosta-Khorasani, C. R´e, and M. W. Mahoney. Sub-sampled Newton methods with non-uniform sampling. In NIPS, pages 3000–3008, 2016. [54] H. Ye and Z. Zhang. Nestrov’s acceleration for second order method. ArXiv Preprint: 1705.07171, 2017.
20
A
Proofs in Section 2.2
We first introduce the Operator Bernstein Inequality in [24]. Let C be a finite set and 1 ≤ m ≤ |C|, we define Xj ∈ C as a random variable taking values in C with uniform probability, and X = (X1 , . . . , Xm ) are drawn from C with replacement such that Xj s are independently identical distributed. Theorem A.1 (Operator-Bernstein Inequality).
hAssume i that E [Xj ] = 0 and the operator norm of Xj ,
2 i.e., kXj k ≤ c and the variance of Xj , i.e., E Xj ≤ σ02 , where c and σ0 are constants, we define P S= m i=1 Xj and obtain that ( 2mσ2 t2 , if t ≤ c 0 , 2d · exp − 4mσ 2 Prob (kSk ≥ t) ≤ 0 2mσ2 t , if t ≥ c 0 . 2d · exp − 2c Then we proceed to prove Lemma 2.3 and Lemma 2.4.
Proof of Lemma 2.3: Let C = {1, 2, . . . , n} and a sample set S ⊂ C. For k = 1, 2, . . . , |S|, we construct a sequence of independently identical distributed random matrices Zk such that 1 Prob Zk = ∇2 fj (x) − ∇2 f (x) = , ∀j = 1, 2, . . . , n. n Note that for each j = 1, 2, . . . , n,
X
2
2
∇ fj (x) − ∇2 f (x) = n − 1 ∇2 fj (x) − 1 ∇ f (x) ℓ
n n
ℓ6=j
(4) ≤
2(n − 1) L ≤ 2L, n
and thus kZk k ≤ 2L. As a result, i h
2
E Z ≤ E kZk k2 ≤ 4L2 . k
Furthermore,
n
E [Zk ] =
1X 2 ∇ fj (x) − ∇2 f (x) = 0, n j=1
and observe that
H(x) − ∇2 f (x) = Therefore, according to Theorem A.1, it holds that
|S|
1 X Zk . |S| k=1
|S|
X
Prob H(x) − ∇2 f (x) ≥ ǫ = Prob Zk
≥ |S| ǫ
k=1 2d · exp − ǫ2|S|2 , if ǫ ≤ 4L, 16L ≤ 2d · exp − ǫ|S| , if ǫ ≥ 4L. 4L 21
Simply letting |S| ≥ max
16L2 4L , ǫ2 ǫ
· log
2d δ
,
yeilds that
Prob H(x) − ∇2 f (x) ≥ ǫ < δ.
Proof of Lemma 2.4: Let C = {1, 2, . . . , n} and a sample set S ⊂ C. For k = 1, 2, . . . , |S|, we construct a sequence of independently identical distributed random matrices Zk such that # " fˆj′′ (a⊤ x) i 2 aj a⊤ ∀j = 1, 2, . . . , n, Prob Zk = j − ∇ f (x) = pj , npj where p = {p1 , . . . , pn } is defined as ˆ′′ ⊤ 2 f (a x) j j kaj k . pj = P n ˆ′′ ⊤ 2 ka k f (a x) ℓ ℓ=1 ℓ ℓ
Recalling for any i ∈ N it holds that
n o N = j ∈ {1, 2, . . . , n} : fˆj′′ (a⊤ j x) 6= 0 ,
fˆ′′ (a⊤ x)
j j
⊤ 2 aj aj − ∇ f (x)
npj
= (4) ≤
X
1 − pj ′′ ⊤ 1 ⊤ ⊤ ⊤ ′′ ˆ
fˆℓ (aℓ x)aℓ aℓ
npj fj (aj x)aj aj − n
ℓ6=j n + 1/pmin − 2 L, n
where pmin = minj∈N {pj } > 0, and thus kZk k ≤ n+1/pnmin −2 L. Furthermore we have X X X fˆj′′ (a⊤ x) 1 j ⊤ E [Zk ] = pj · fˆj′′ (a⊤ = 0, aj a⊤ pj · j x)aj aj j − npj n j∈N
j∈N
22
j∈N
and
!2
n
2
X fˆj′′ (a⊤ x) j ⊤ 2
E Zk = aj aj − ∇ f (x) pj ·
npj
j=1
! 2
X
n 2 2 fˆj′′ (a⊤ j x) ⊤
− ∇ f (x) aj aj = pj ·
np j
j=1
! 2
n
X
fˆj′′ (a⊤ j x) ⊤
≤ pj · aj aj
np j
j=1
X
n
2 1 ˆ′′ ⊤ 1 2 ⊤ = f (a x) ka k a a j j j j j
n2
j=1 pj
n n X X
1 ˆ′′ ⊤ 1 ˆ′′ ⊤ 2 ⊤ = · fj (aj x) kaj k fj (aj x) aj aj n
j=1 n j=1 2 n 1 X ˆ′′ ⊤ 2 ¯ 2, ≤ fj (aj x) kaj k ≤ L n j=1
where the first inequality is due to n X j=1
pj ·
fˆj′′ (a⊤ j x) aj a⊤ j npj
Note that H(x) − ∇2 f (x) =
Simply letting
1 |S|
!2
P|S|
n X j=1
k=1 Zk .
pj ·
fˆj′′ (a⊤ j x) aj a⊤ j npj
!2
2 − ∇2 f (x) = E Zk2 0.
Therefore, we can apply Theorem A.1 and get
Prob H(x) − ∇2 f (x) ≥ ǫ
X
|S|
Zk = Prob
≥ |S| ǫ
k=1 2 2d · exp − ǫ ¯|S| , if ǫ ≤ 4L2 ≤ n 2d · exp − ǫ|S| · 2L n+1/pmin −2 , if ǫ ≥
¯2 2L L ¯2 2L L
· ·
n n+1/pmin −2 , n n+1/pmin −2 .
¯2 4L 2L n + 1/pmin − 2 2d , · |S| ≥ max · log , 2 ǫ ǫ n δ
yeilds that
Prob H(x) − ∇2 f (x) ≥ ǫ < δ.
23
B
Proofs in Section 3
We first prove Lemma 3.1, which describes the relation between the total iteration numbers T in Algorithm 1 and the amount of successful iterations |SC|. Proof of Lemma 3.1: We have f (xi + si ) = ≤ = (12) ≤
Z 1 2 1 ⊤ 2 2 (1 − τ )s⊤ f (xi ) + + si ∇ f (xi )si + i ∇ f (xi + τ si ) − ∇ f (xi ) si dτ (17) 2 0 1 ρ ¯ ⊤ 2 3 f (xi ) + s⊤ i ∇f (xi ) + si ∇ f (xi )si + ksi k 2 6 ρ¯ σ 1 i 2 m(xi , si , σi ) + s⊤ ksi k3 , − ∇ f (x ) − H(x ) s + i i i 2 i 6 3 ρ¯ σ ǫi i ksi k3 , − m(xi , si , σi ) + ksi k2 + 2 6 3 s⊤ i ∇f (xi )
where the inequalities hold true due to Assumption 2.1. Next we argue that when σi exceeds a certain constant, then it holds that f (xi + si ) ≤ m(xi , si , σi ). The analysis is conducted according to the value of ksi k in two cases. 1. When ksi k ≥ 1, we have f (xi + si ) ≤ m(xi , si , σi ) +
ǫ
i
2
+
ρ¯ σi ksi k3 , − 6 3
which in combination with the fact that ǫi ≤ ǫ0 ≤ 1 leads to σi ≥
3 + ρ¯ 2
=⇒
f (xi + si ) ≤ m(xi , si , σi ).
2. When ksi k < 1, according to Condition 3.1, it holds that κθ k∇f (xi )k ≥ k∇m(xi , si , σi )k
= k∇f (xi ) + H(xi )si + σi ksi k · si k
≥ k∇f (xi )k − kH(xi )k ksi k − σi ksi k2
≥ k∇f (xi )k − (L + σi ) ksi k ,
where the last inequality holds true since ksi k < 1 and each component function has bounded Hessian as in (6). This implies that ksi k ≥
(1 − κθ ) k∇f (xi )k . L + σi
(18)
Moreover, note that f (xi + si ) ≤ m(xi , si , σi ) + 24
ρ¯ σi ǫi + − 2 ksi k 6 3
ksi k3 ,
and combining the above two inequalites yields that ǫi (L + σi ) ρ¯ σi + − ≤0 2(1 − κθ ) k∇f (xi )k 6 3 Recall that
=⇒
f (xi + si ) ≤ m(xi , si , σi ).
(1 − κθ ) k∇f (xi )k (1 − κθ ) k∇f (xi )k , ≤ ǫi = min ǫi−1 , 3 3
then it suffices to show
(19)
L + σi ρ¯ σi + − ≤ 0. 6 6 3
That is: σi ≥ L + ρ¯
f (xi + si ) ≤ m(xi , si , σi ).
=⇒
In summary, we have concluded that 3 + ρ¯ , L + ρ¯ σi ≥ max 2
=⇒
f (xi + si ) ≤ m(xi , si , σi ).
(20)
On the other hand, we observe that f (xi ) − m(xi , si , σi )
= = ≥ (11) ≥
≥
1 ⊤ σi −s⊤ ksi k3 i ∇f (xi ) − si H(xi )si − 2 3 1 ⊤ 2 σi si H(xi )si + ksi k3 − s⊤ i (∇f (xi ) + H(xi )si − σi ksi ksi ) 2 3
2 σmin 1 ⊤ si H(xi )si + ksi k3 − ∇f (xi ) + H(xi )si − (σi ksi k)si ksi k 2 3 1 ⊤ 2 σmin s H(xi )si + ksi k3 − κθ ksi k3 2 i 3 1 ⊤ s H(xi )si ≥ 0. 2 i
Therefore, σi ≥ max
3 + ρ¯ , L + ρ¯ 2
=⇒
which further implies that
f (xi ) − f (xi + si ) ≥ 1 > η, f (xi ) − m(xi , si , σi )
3 + ρ¯ σi+1 ≤ σi ≤ γ2 · σi−1 ≤ γ2 · max , L + ρ¯ , ∀ i ∈ SC/{0}. 2 o n ¯ 2ρ , γ2 L + γ2 ρ¯ . In addition, it Hence, for any i ∈ SC, σi can be bounded above by σ ¯ = max σ0 , 3γ2 +γ 2 follows from Algorithm 1 that σmin ≤ σi for all iterations, and γ1 σi ≤ σi+1 for all unsuccessful iterations. Therefore, we have σ ¯ σmin
≥
Y σi+1 Y σi+1 σT T −|SC| σmin |SC| . = · ≥ γ1 σ0 σi σi σ ¯ i∈SC
i∈SC /
25
Consequently, (|SC| + 1) log |SC| ≤ T ≤ |SC| + log γ1
σ ¯ σmin
≤
2 log 1+ log γ1
σ ¯ σmin
|SC|.
To establish iteration complexity of Algorithm 1, we provide the following lemma as a technical preparation. Lemma B.1 Assume in each iteration of Algorithm 1 the Hessian approximation H(xi ) satisfies (12). Suppose that Li is a subspace of Rd with ∇f (xi ) ∈ Li . Then for the successful iteration i ∈ SC, it holds that 3 1 √ min m(xi , si , σi ) ≤ f (xi ) − (f (xi ) − f (xi + s∗i )) 2 , (21) s∈Li 3D 6CD n o where s∗i ∈ argmins∈Li f (xi + s), and C = max 3+¯ρ6+2¯σ , L+¯ρ6+3¯σ . Proof. We have σi 1 ⊤ ksi k3 m(xi , si , σi ) = f (xi ) + s⊤ i ∇f (xi ) + si H(xi )si + 2 3 1 ⊤ 2 1 ⊤ σi 2 = f (xi ) + s⊤ ksi k3 i ∇f (xi ) + si ∇ f (xi )si + si H(xi ) − ∇ f (xi ) si + 2 2 3 ρ¯ σ ǫi i ksi k3 + ksi k2 , + ≤ f (xi + si ) + 6 3 2
where the inequality is due to (17), (12) and Assumption 2.1. Then the subsequent analysis is conducted according to the value of ksi k in two cases. 1. When ksi k ≥ 1, since ǫi ≤ ǫ0 ≤ 1, it holds that ǫ
ρ¯ σi ksi k3 + 2 6 3 3 + ρ¯ + 2¯ σ ≤ f (xi + si ) + ksi k3 . 6
m(xi , si , σi ) ≤ f (xi + si ) +
i
+
2. When ksi k < 1, inequalities (28) and (19) still hold. Therefore, we have ρ¯ σi ǫi + + ksi k3 m(xi , si , σi ) ≤ f (xi + si ) + 2 ksi k 6 3 L + σi ρ¯ σi + + ksi k3 ≤ f (xi + si ) + 6 6 3 L + ρ¯ + 3¯ σ ≤ f (xi + si ) + ksi k3 . 6 In summary, we have concluded that m(xi , si , σi ) ≤ f (xi + si ) + C ksi k3 . 26
(22)
Minimizing both sides of (22) gives that i h min m(xi , si , σi ) ≤ min f (xi + si ) + C ksi k3 s∈Li s∈Li i h ≤ min f (xi + αs∗i ) + Cα3 ks∗i k3 , α∈[0,1]
where the second inequality follows from the fact that αs∗i ∈ Li for all α ∈ [0, 1]. Moreover, from the convexity of f , f (xi + αs∗i ) ≤ (1 − α) f (xi ) + αf (xi + s∗i ) for all α ∈ [0, 1], and hence h
min m(xi , si , σi ) ≤ f (xi ) + min
α∈[0,1]
s∈Li
i α [f (xi + s∗i ) − f (xi )] + Cα3 ks∗i k3 .
Recall in the proof of Lemma 3.1, we have shown that f (xi ) ≥ m(xi , si , σi ). Thus, f (xi+1 ) ≤ f (xi ) ≤ f (x0 ) for any i ∈ SC. Then Assumption 2.2 guarantees that kxi − x∗ k ≤ D and kxi + s∗i − x∗ k ≤ D since f (xi + s∗i ) ≤ f (xi ), where x∗ is a global minimizer of problem (1). Consequently, ks∗i k ≤ 2D and we have (23) min m(xi , si , σi ) ≤ f (xi ) + min α [f (xi + s∗i ) − f (xi )] + 8Cα3 D 3 . α∈[0,1]
s∈Li
The convexity of f and
∇f (x∗ )
= 0 yeild that
f (xi ) − f (xi + s∗i ) ≤ − (s∗i )⊤ ∇f (xi ) ≤ ks∗i k · k∇f (xi )k ≤ 2D · k∇f (xi ) − ∇f (x∗ )k ≤ 2LD 2 ≤ 24CD 3 .
Therefore the minimum in the right-hand side of (23) is attained at ( p ) p f (xi ) − f (xi + s∗i ) f (xi ) − f (xi + s∗i ) ∗ √ √ αi = min 1, = . 2D 6CD 2D 6CD Finally plugging α∗i into (23) gives that min m(xi , si , σi ) ≤ f (xi ) −
s∈Li
3 1 √ (f (xi ) − f (xi + s∗i )) 2 . 3D 6CD
Lemma B.2 Suppose H(xi ) is constructed as in Algorithm 1. Then
H(xi ) 0 and ∇2 f (xi ) − H(xi ) ≤ ǫi with probability 1 − ǫ1/2 δ, where ǫ is the tolerance of optimality.
27
˜ i ) is formed according to (7). By Lemma 2.3 and ˜ i ) + ǫi I and H(x Proof. Recall that H(xi ) = H(x 2 Lemma 2.4,
ǫ
2 i ˜ i )
∇ f (xi ) − H(x
≤ 2 with probability 1 − ǫ1/2 δ. Then
ǫ
2
2
i ˜ i )
∇ f (xi ) − H(xi ) = ˜ i ) − ǫi I
≤ ∇2 f (xi ) − H(x
+ kIk ≤ ǫi ,
∇ f (xi ) − H(x 2 2
and since f is convex
˜ i ) + ǫi I ∇2 f (xi ) − ǫi I + ǫi I = ∇2 f (xi ) 0. H(xi ) = H(x 2 2 2 The following result is standard. However, we provide a proof here for completeness. Lemma B.3 Let x∗ be a global minimizer of f . Suppose the gradient of f is Lipschitz continuous with Lipschitz constant L. Then it holds that k∇f (x)k2 ≤ 2L(f (x) − f (x∗ ))
(24)
Proof. Since ∇f (x) is Lipschitz continuous, it holds that f (y) ≤ f (x) + (y − x)⊤ ∇f (x) +
L ky − xk2 . 2
Replacing y with x − L1 ∇f (x) in the above inequality gives that f (y) − f (x) ≤ −
1 k∇f (x)k2 . 2L
(25)
Equivalently, we have 2L(f (x) − f (x∗ )) ≥ 2L(f (x) − f (y)) ≥ k∇f (x)k2 = k∇f (x) − ∇f (x∗ )k2 . Now we are ready to establish the iteration complexity. Proof of Theorem 3.2: According to Lemma B.2, and since T = O(ǫ−1/2 ), we have
2
∇ f (xi ) − H(xi ) ≤ ǫi
for all i ≤ T with probability 1 − δ. For any i ∈ SC
f (xi+1 ) ≤ (1 − η)f (xi ) + η · m(xi , si , σi )
m = (1 − η)f (xi ) + η · [m(xi , si , σi ) − m(xi , sm i , σi )] + η · m(xi , si , σi )
28
(26)
d where sm i denotes the global minimizer of m(xi , s, σi ) over R . From the choice of T and Lemma B.2, we have that H(xi ) is positive semi-definite and inequality (12) holds for all i ≤ T with probability 1 − δ. Therefore, m(xi , s, σi ) is convex and
m(xi , si , σi ) − m(xi , sm i , σi )
∇s m(xi , si , σi )⊤ (si − sm i )
≤
k∇s m(xi , si , σi )k ksi − sm i k
≤
κθ k∇f (xi )k3 ksi − sm i k.
(11) ≤
To bound sT1 −1 − sm T1 −1 , we observe that σmin ksk3 ≤ σTi ksk3
= ≤
≤
(11) ≤
s⊤ [∇s m (xi , s, σi ) − ∇s f (xi ) − H(xi )s]
s⊤ [∇s m (xi , s, σi ) − ∇s f (xi )]
ksk [k∇f (xi )k + k∇m (xi , s, σi )k]
(1 + κθ ) ksk k∇f (xi )k ,
where s = si or s = sm i . This implies s
m ksi − sm i k ≤ ksi k + ksi k ≤ 2
(1 + κθ ) k∇f (xi )k . σmin
Further combining with Assumption 2.1 yields that √ 7 2κθ 1 − κθ m k∇f (xi )k 2 m(xi , si , σi ) − m(xi , si , σi ) ≤ √ σmin √ 2κθ 1 − κθ p = k∇f (xi )k · k∇f (xi )k3 √ σmin √ 3 2κθ 1 − κθ p (24) k∇f (xi )k · [2L (f (xi ) − f (x∗ ))] 2 √ ≤ σmin p 3 4κθ L 2L(1 − κθ ) p k∇f (xi )k · [f (xi ) − f (x∗ )] 2 . = √ σmin
We let Li = Rd and s∗i = x∗ − xi in Lemma B.1, then s∗i is a global minimizer of f (xi + s) and f (xi + s∗i ) = f (x∗ ). Now involking Lemma B.1 yeilds that m(xi , sm i , σi ) ≤ f (xi ) −
3 1 √ [f (xi ) − f (x∗ )] 2 . 3D 6CD
Combining the above two inequalities with (26) and recall η ≤ 1, we conclude that # " p 3 4ηκθ L 2L(1 − κθ ) p η √ [f (xi ) − f (x∗ )] 2 . f (xi+1 ) ≤ f (xi ) + k∇f (xi )k − √ σmin 3D 6CD
29
(27)
Note that xi can only be updated in the successful iterations. Without loss of generality, we rename ˆ i be the point obtained in the i-th successful iteration. Let ∆i = the index in SC as {1, 2, · · · } with x ∗ f (ˆ xi ) − f (x ), and (27) implies that " # p 3 4ηκθ L 2L(1 − κθ ) p η √ k∇f (ˆ ∆i − ∆i+1 ≥ − xi )k ∆i2 √ σmin 3D 6CD # " p √ 3 12D 6CDκθ L 2L(1 − κθ ) p η √ k∇f (ˆ xi )k ∆i2 . = · 1− √ σmin 3D 6CD Next, we conduct our analysis in terms of the value of 1. When
p
xi )k ≤ k∇f (ˆ
√
√
σmin
24D 6CDκθ L
√
2L(1−κθ )
p
k∇f (ˆ xi )k >
√
√
, we have
σmin
24D 6CDκθ L
√
xi )k in two cases: k∇f (ˆ
3 3 η √ · ∆i2 ≥ β∆i2 . 6D 6CD
∆i − ∆i+1 ≥ 2. When
p
2L(1−κθ )
, we denote ˆsi ≈ argmins∈Rd m(ˆ xi , s, σ ˆi ) and s∗i =
argmins∈Li f (ˆ xi + s) , and thus ˆsi = argmins∈Li m(ˆ xi , s, σ ˆi ). By Lemma B.1, we have f (ˆ xi+1 ) ≤ (1 − η)f (ˆ xi ) + η · m(ˆ xi , ˆsi , σ ˆi ) 3 η √ ≤ f (ˆ xi ) − (f (ˆ xi ) − f (ˆ xi + s∗i )) 2 , ∀ i ∈ SC. 3D 6CD Since ∇f (xi ) ∈ Li , f (ˆ xi ) − f (ˆ xi +
s∗i )
≥ (25) ≥
≥
f (ˆ xi ) − f
1 ˆ i − ∇f (ˆ x xi ) L
1 k∇f (ˆ xi ))k2 2L k∇f (ˆ xi ))k ∆i , 2LD
where the last inequality follows from Assumption 2.2. Therefore, we have ∆i − ∆i+1 = f (ˆ xi ) − f (ˆ xi+1 ) ≥
3
η k∇f (ˆ x ))k 2 32 √ √i ∆i · 3D 6CD 2LD 2LD 3
≥
2 σmin
3 15 2
11943936 · κ3θ C 2 L6 D (1 − κθ ) 3
≥ β∆i2 . In summary, we have concluded that 3
∆i − ∆i+1 ≥ β∆i2 , ∀ i ∈ SC. 30
3 2
∆i2
This implies that √
1 1 −√ ∆i+1 ∆i
=
∆i − ∆i+1 √ √ √ ∆i ∆i+1 ( ∆i + ∆i+1 ) 3
≥ ≥
β∆ 2 √ i √ √ ∆i ∆i+1 ( ∆i + ∆i+1 ) β , 2
where the last inequality holds ture since f (ˆ xi+1 ) ≤ m(ˆ xi , ˆsi , σ ˆi ) ≤ f (ˆ xi ) and ∆i ≥ ∆i+1 . Summing up the above inequalities over all i ∈ SC, we obtain that β 1 β 1 √ ≥√ + |SC| · ≥ |SC| · . 2 2 ∆i ∆0 As a result,
1 2 1 ≥ √ ⇒ 0 < ∆i ≤ ǫ, |SC| ≥ √ ⇒ √ β ǫ ǫ ∆i
completing the proof.
C
Proofs in Section 4
We first prove Lemma 4.1 and Lemma 4.2, which describe the relation between the total iteration numbers in Algorithm 2 and the amount of successful iterations |SC| in Phase II. Proof of Lemma 4.1: Obviously, (20) in the proofnof Lemma 3.1 o still holds for the iterates {xi } in 3+¯ ρ Phase I of Algorithm 2. which implies that σi < max 2 , L + ρ¯ for i ≤ T1 − 2. Moreover,
3 + ρ¯ σT1 ≤ σT1 −1 ≤ σT1 −2 ≤ γ2 max , L + ρ¯ . 2 o n ργ2 , γ L + γ ρ ¯ . On the other Then it holds that σi ≤ σ ¯1 for any i ≤ T1 , since σ ¯1 = max σ0 , 3γ2 +¯ 2 2 2 hand, it follows from the construction of Algorithm 2 that σmin ≤ σi for all iterations, and γ1 σi ≤ σi+1 for all unsuccessful iterations. Consequently, we have TY 1 −2 σ ¯1 σmin σT σT1 σj+1 ≥ 1 = · ≥ γ1T1 −1 , σmin σ0 σT1 −1 σj σ ¯1 j=0
and hence T1 ≤
1+
2 log log γ1
31
σ ¯1 σmin
.
Proof of Lemma 4.2: We have
= ≤ Condition 4.1 ≤
= ≤
s⊤ T1 +j ∇f (yl + sT1 +j ) ⊤ 2 2 s⊤ T1 +j ∇f (yl + sT1 +j ) − ∇f (yl ) − ∇ f (yl )sT1 +j + sT1 +j ∇f (yl ) + ∇ f (yl )sT1 +j
∇f (yl + sT1 +j ) − ∇f (yl ) − ∇2 f (yl )sT1 +j ksT1 +j k + s⊤ T1 +j ∇m (yl , sT1 +j , σT1 +j ) 3 ⊤ 2 +sT1 +j ∇ f (yl ) − H(yl ) sT1 +j − σT1 +j ksT1 +j k
∇f (yl + sT +j ) − ∇f (yl ) − ∇2 f (yl )sT +j ksT +j k + (κθ − σT +j ) ksT +j k3 + ǫT +j ksT +j k2 1 1 1 1 1 1 1
Z 1
2 3 2
∇ f (yl + τ · sT1 +j ) − ∇2 f (yl ) sT1 +j dτ
ksT1 +j k + (κθ − σT1 +j ) ksT1 +j k + ǫT1 +j ksT1 +j k 0 ρ¯ + κθ − σT1 +j ksT1 +j k3 + ǫT1 +j ksT1 +j k2 , 2
where the last inequality is due to Assumption 2.1. Next we argue that when σi exceeds centain constant, it holds that s⊤ T +j ∇f (yl + sT1 +j ) − 1 ≥ η, ksT1 +j k3 The analysis is conducted according to the value of ksT1 +j k in two cases. 1. When ksT1 +j k ≥ 1, we have s⊤ T1 +j ∇f (yl + sT1 +j ) ≤
ρ¯ 2
+ κθ − σT1 +j + ǫT1 +j ksT1 +j k3 ,
which combined with ǫT1 +j ≤ ǫ0 ≤ 1 implies that σT1 +j ≥
ρ¯ + κθ + η + 1 2
=⇒
−
s⊤ T1 +j ∇f (yl + sT1 +j ) ksT1 +j k3
≥ η.
2. When ksT1 +j k < 1, according to Condition 4.1, it holds that κθ k∇f (yl )k ≥ k∇m(yl , sT1 +j , σT1 +j )k
= k∇f (yl ) + H(yl )sT1 +j + σT1 +j ksT1 +j k · sT1 +j k
≥ k∇f (yl )k − kH(yl )k ksT1 +j k − σT1 +j ksT1 +j k2
≥ k∇f (yl )k − (L + σT1 +j ) ksT1 +j k ,
where the last inequality holds true since ksT1 +j k < 1 and each component function has bounded Hessian as in (6). This implies that ksT1 +j k ≥
(1 − κθ ) k∇f (yl )k . L + σi
(28)
Moreover, note that s⊤ T1 +j ∇f (yl
+ sT1 +j ) ≤
ǫT1 +j ρ¯ + κθ − σT1 +j + 2 ksT1 +j k 32
ksT1 +j k3 .
Combining the above two inequalities yields that ǫT1 +j (L + σi ) ρ¯ + − σT1 +j + η ≤ 0 (1 − κθ ) k∇f (yl )k 2 Recall that ǫT1 +j
−
=⇒
s⊤ T1 +j ∇f (yl + sT1 +j ) ksT1 +j k3
≥ η.
(1 − κθ ) k∇f (yl )k (1 − κθ ) k∇f (yl )k , ≤ = min ǫT1 +j−1 , 2 2
then it suffices to show
L + σT1 +j ρ¯ + − σT1 +j + η ≤ 0. 2 2
That is, σT1 +j ≥ L + ρ¯ + 2η
−
=⇒
s⊤ T1 +j ∇f (yl + sT1 +j ) ksT1 +j k3
≥ η.
In summary, we have concluded that σT1 +j ≥ max
n ρ¯ 2
+ κθ + η + 1, L + ρ¯ + 2η
o
=⇒ −
s⊤ T1 +j ∇f (yl + sT1 +j ) ksT1 +j k3
≥ η,
which further implies that for any unsuccessful iteration j 6∈ SC, n ρ¯ o σT1 +j ≤ max + κθ + η + 1, L + ρ¯ + 2η . 2 Therefore, for any successful iteration j ∈ SC
o n ρ¯ + κθ + η + 1, L + ρ¯ + 2η . σT1 +j+1 ≤ σT1 +j ≤ γ2 · σT1 +j−1 ≤ γ2 max 2
Consequently, for any 0 ≤ j ≤ T2 , σT1 +j is bounded above by
n o γ2 ρ¯ σ ¯2 = max σ ¯1 , + γ2 κθ + γ2 η + γ2 , γ2 L + γ2 ρ¯ + 2γ2 η , 2
where σ ¯1 is responsible for an upper bound of σT1 . In addition, it follows from the construction of Algorithm 2 that σmin ≤ σT1 +j for all iterations, and γ1 σT1 +j ≤ σT1 +j+1 for all unsuccessful iterations. Therefore, we have |SC| Y σT +j+1 Y σT +j+1 σ ¯2 σT1 +T2 T2 −|SC| σmin 1 1 ≥ · ≥ γ1 , = σmin σT1 σT1 +j σT1 +j σ ¯2 j∈SC
hence
(|SC| + 1) log |SC| ≤ T2 ≤ |SC| + log γ1
j ∈SC /
σ ¯2 σmin
≤
2 log 1+ log γ1
σ ¯2 σmin
|SC|.
To proceed, we need Lemma 3.3 and Lemma 3.4 in [27], which are restated as follows. 33
Lemma C.1 For any s ∈ Rd and g ∈ Rd , it holds that 3 2 1 s⊤ g + σ ksk3 ≥ − √ kgk 2 . 3 3 σ
Lemma C.2 Letting zl = argmin ψl (z), we have ψl (z) − ψl (zl ) ≥ z∈Rd
1 12 ςl
kz − zl k3 .
Now we provide an upper bound for T3 , i.e., the total number of times of successfully updating ς > 0.
Lemma C.3 Suppose in each iteration i of Algorithm 2, we have ∇2 f (xi ) − H(xi ) ≤ ǫi , ∀ T1 + 1 ≤ i ≤ T1 + T2 . For each successful iteration j in Phase II, we have (1 − κθ ) k∇f (xj+1 )k ≤ (¯ ρ + 2L + 2¯ σ2 + 1) ksj k2 ,
where κθ ∈ (0, 1) is used in Condition 4.1. Proof. We denote j-th iteration is the l-th successful iteration, and note ∇s m(yl , sj , σj ) = ∇f (yl ) + H(yl )sj + σj ksj k · sj . Then we have k∇f (xj+1 )k
≤ k∇f (yl + sj ) − ∇s m(yl , sj , σj )k + k∇s m(yl , sj , σj )k
≤ k∇f (yl + sj ) − ∇s m(yl , sj , σj )k + κθ · min (1, ksj k) · k∇f (yl )k
Z 1
2
2 2 2
≤ ∇ f (yl + τ sj ) − ∇ f (yl ) sj dτ
+ ∇ f (yl ) − H(yl ) ksj k + σj ksj k + κθ · min (1, ksj k) · k∇f (yl )k 0
≤
≤
ρ¯ ksj k2 + ǫj ksj k + σj ksj k2 + κθ · ksj k · k∇f (yl ) − ∇f (yl + sj )k + κθ k∇f (xj+1 )k 2 ρ¯ ksj k2 + ǫj ksj k + σ ¯2 ksj k2 + κθ L ksj k2 + κθ k∇f (xj+1 )k , 2
where the second inequality holds true due to Condition 4.1, and the last two inequality follow from Assumption 2.1. The subsequent analysis is conducted according to the value of ksj k in two cases. 1. When ksj k ≥ 1, we have (1 − κθ ) k∇f (xj+1 )k ≤ which combined with ǫj ≤ ǫ0 ≤ 1 implies that
ρ¯ 2
+ κθ L + σ ¯2 + ǫj ksj k2 ,
ρ¯
+ κθ L + σ ¯2 + 1 ksj k2 .
(1 − κθ ) k∇f (xj+1 )k ≤
2
2. When ksj k < 1, according to the fact that, (1 − κθ ) k∇f (yl )k (1 − κθ ) k∇f (yl )k , ≤ ǫj = min ǫj−1 , 2 2 34
it holds that (1 − κθ ) k∇f (xj+1 )k ρ¯ (1 − κθ ) k∇f (yl )k ksj k + κθ L + σ ¯2 ksj k2 + ≤ 2 2 ρ¯ (1 − κ ) k∇f (yl ) − ∇f (yl + sj )k ksj k (1 − κθ ) k∇f (xj+1 )k ksj k θ ≤ + κθ L + σ ¯2 ksj k2 + + 2 2 2 ρ¯ (1 − κ ) k∇f (x )k j+1 θ + κθ L + σ ¯2 ksj k2 + (1 − κθ )L ksj k2 + ≤ 2 2
where the last inequality holds true since ksT1 +j k < 1. Hence, we have
(1 − κθ ) k∇f (xj+1 )k ≤ (¯ ρ + 2L + 2¯ σ2 ) ksj k2 . In summary, we have concluded that (1 − κθ ) k∇f (xj+1 )k ≤ (¯ ρ + 2L + 2¯ σ2 + 1) ksj k2 . Now we are ready to estimate an upper bound of T3 . f (¯ xl ) since When l = 1, it trivially holds true that ψl (zl ) ≥ l(l+1)(l+2) 6 Lh +2¯ σ2 +2κθ Lg 3 1 ψ1 (z1 ) = f (¯ x1 ). Next we shall establish the general case when ςl ≥ 1−κθ η2 by mathematical induction. Without loss of generality, we assume (15) holds true for some l − 1 ≥ 1. Then, it follows from Lemma C.2, and the construction of ψl (z) that
Proof of Lemma 4.3:
ψl−1 (z) ≥ ψl−1 (zl−1 ) +
(l − 1)l(l + 1) 1 1 ςl−1 kz − zl−1 k3 ≥ f (¯ xl−1 ) + ςl−1 kz − zl−1 k3 . 12 6 12
As a result, we have
= ≥ ≥
=
ψl (zl ) i 1 l(l + 1) h ¯ l )⊤ ∇f (¯ ¯ 1 k3 f (¯ xl ) + (z − x xl ) + (ςl − ςl−1 ) kz − x min ψl−1 (z) + 2 6 z∈Rd i (l − 1)l(l + 1) 1 l(l + 1) h ⊤ 3 ¯ l ) ∇f (¯ min f (¯ xl ) + (z − x xl ) f (¯ xl−1 ) + ςl kz − zl−1 k + 6 12 2 z∈Rd i (l − 1)l(l + 1) h 1 ¯ l )⊤ ∇f (¯ min f (¯ xl ) + (¯ xl−1 − x xl ) + ςl kz − zl−1 k3 d 6 12 z∈R i l(l + 1) h ⊤ ¯ l ) ∇f (¯ + f (¯ xl ) + (z − x xl ) 2 1 (l − 1)l(l + 1) l(l + 1)(l + 2) ¯ l )⊤ ∇f (¯ f (¯ xl ) + min (¯ xl−1 − x xl ) + ςl kz − zl−1 k3 d 6 6 12 z∈R l(l + 1) ⊤ ¯ l ) ∇f (¯ + (z − x xl ) . 2 35
By the construction of yl−1 , we have (l − 1)l(l + 1) ¯ l−1 = x 6 = =
l(l + 1)(l + 2) l − 1 ¯ l−1 · x 6 l + 2 l(l + 1)(l + 2) 3 zl−1 yl−1 − 6 l+2 l(l + 1)(l + 2) l(l + 1) yl−1 − zl−1 . 6 2
Combining the above two formulas yields ψl (zl ) l(l + 1)(l + 2) l(l + 1)(l + 2) 1 ¯ l )⊤ ∇f (¯ ≥ f (¯ xl ) + min (yl−1 − x xl ) + ςl kz − zl−1 k3 6 6 12 z∈Rd l(l + 1) (z − zl−1 )⊤ ∇f (¯ xl ) . + 2 Then, by the criterion of successful iteration in the second phase and Lemma C.3, we have 3 ¯ l )⊤ ∇f (¯ (yl−1 − x xl ) = −s⊤ T1 +j ∇f (yl−1 + sT1 +j ) ≥ η ksT1 +j k 3 2 3 1 − κθ ≥ η k∇f (¯ xl )k 2 , ρ¯ + 2L + 2¯ σ2 + 1
where the l-th successful iteration count refers to the (j − 1)-th iteration count. Hence, it suffices to establish 3 2 3 1 − κθ l(l + 1) 1 l(l + 1)(l + 2)η (z − zl−1 )⊤ ∇f (¯ xl ) ≥ 0. k∇f (¯ xl )k 2 + ςl kz − zl−1 k3 + 6 ρ¯ + 2L + 2¯ σ2 + 1 12 2 Using Lemma C.1 and setting g =
l(l+1) xl ), 2 ∇f (¯
l(l + 1)(l + 2)η 6
s = z − zl , and σ = 14 ςl , the above is implied by
1 − κθ ρ¯ + 2L + 2¯ σ2 + 1
3
2
4 ≥ √ 3 ςl
l(l + 1) 2
3
2
.
(29)
Therefore, the conclusion follows if ςl ≥
ρ¯ + 2L + 2¯ σ2 + 1 1 − κθ
3
1 . η2
Finally let us go back to prove Theorem 4.4. Proof of Theorem 4.4: The proof is based on mathematical induction. Let’s first prove the base ¯ 1 = xT1 , we have case (l = 1). By the definition of ψ1 (z) and the fact that x f (¯ x1 ) = f (xT1 ) = ψ1 (z1 ). 36
Furthermore, by the criterion of successful iteration in Phase I, f (¯ x1 ) = f (xT1 ) ≤ m(xT1 −1 , sT1 −1 , σT1 −1 ) m = m(xT1 −1 , sT1 −1 , σT1 −1 ) − m(xT1 −1 , sm T1 −1 , σT1 −1 ) + m(xT1 −1 , sT1 −1 , σT1 −1 ),
d where sm T1 −1 denotes the global minimizer of m(xT1 −1 , s, σT1 −1 ) over R . Since f is convex, so is m(xT1 −1 , s, σT1 −1 ). Therefore, we have
≤
≤
(14) ≤
m(xT1 −1 , sT1 −1 , σT1 −1 ) − m(xT1 −1 , sm T1 −1 , σT1 −1 ) ⊤ ∇s m(xT1 −1 , sT1 −1 , σT1 −1 ) sT1 −1 − sm T1 −1
k∇s m(xT1 −1 , sT1 −1 , σT1 −1 )k sT1 −1 − sm T1 −1
κθ k∇f (xT1 −1 )k ksT1 −1 k sT1 −1 − sm T1 −1 .
To bound sT1 −1 − sm T1 −1 , we observe that σmin ksk3 ≤ σT1 −1 ksk3
=
≤
(14) ≤
s⊤ [∇m (xT1 −1 , s, σT1 −1 ) − ∇f (xT1 −1 ) − H(xT1 −1 )s]
ksk [k∇f (xT1 −1 )k + k∇m (xT1 −1 , s, σT1 −1 )k]
(1 + κθ ) ksk k∇f (xT1 −1 )k ,
where s = sT1 −1 or s = sm T1 −1 . Thus, we conclude that s
(1 + κθ ) k∇f (xT1 −1 )k
sT1 −1 − sm
m , T1 −1 ≤ ksT1 −1 k + sT1 −1 ≤ 2 σmin
which combines with Assumption 2.1 implies that
m(xT1 −1 , sT1 −1 , σT1 −1 ) − m(xT1 −1 , sm T1 −1 , σT1 −1 ) ≤ = ≤ =
37
2κθ (1 + κθ ) k∇f (xT1 −1 )k2 σmin 2κθ (1 + κθ ) k∇f (xT1 −1 ) − ∇f (x∗ )k2 σmin 2κθ (1 + κθ )L2 kxT1 −1 − x∗ k2 σmin 2κθ (1 + κθ )L2 kx0 − x∗ k2 . σmin
On the other hand, we have m(xT1 −1 , sm T1 −1 , σT1 −1 )
m 3 1 1 m ⊤ m ⊤
= f (xT1 −1 ) + (sm T1 −1 ) ∇f (xT1 −1 ) + (sT1 −1 ) H(xT1 −1 )sT1 −1 + σT1 −1 sT1 −1 2 3 1 ǫT −1 ≤ f (xT1 −1 ) + (z − xT1 −1 )⊤ ∇f (xT1 −1 ) + (z − xT1 −1 )⊤ ∇2 f (xT1 −1 )(z − xT1 −1 ) + 1 kz − xT1 −1 k2 2 2 1 3 + σT1 −1 kz − xT1 −1 k 3 ρ¯ 1 σT −1 ≤ f (z) + kz − xT1 −1 k3 + kz − xT1 −1 k2 + 1 kz − xT1 −1 k3 6 2 3 1 ρ¯ + 2¯ σ1 3 2 kz − xT1 −1 k + kz − xT1 −1 k ≤ f (z) + 6 2 ρ¯ + 2¯ σ1 1 = f (z) + kz − x0 k3 + kz − x0 k2 , 6 2 where the second inequality is due to the convexity of f , Assumption 2.1 and ǫT1 −1 ≤ ǫ0 ≤ 1. Therefore, we conclude that 1 ¯ 1 k3 ψ1 (z) = f (¯ x1 ) + ς1 kz − x 6 ρ¯ + 2¯ σ1 1 2κθ (1 + κθ )L2 1 ¯ 1 k3 . ≤ f (z) + kz − x0 k3 + kz − x0 k2 + kx0 − x∗ k2 + ς1 kz − x 6 2 σmin 6 Now suppose the theorem is true for some l ≥ 1, and let us consider the case of l + 1: ψl+1 (zl+1 ) ≤ ψl+1 (z) ρ¯ + 2¯ σ1 1 1 l(l + 1)(l + 2) ¯ 1 k3 f (z) + kz − x0 k3 + kz − x0 k2 + ςl kz − x ≤ 6 6 2 6 i 1 2κθ (1 + κθ )L2 (l + 1)(l + 2) h ¯ l )⊤ ∇f (¯ ¯ 1 k3 + f (¯ xl ) + (z − x xl ) + (ςl+1 − ςl ) kz − x kx0 − x∗ k2 + σmin 2 6 (l + 1)(l + 2)(l + 3) ρ¯ + 2¯ σ1 1 2κθ (1 + κθ )L2 ≤ f (z) + kz − x0 k3 + kz − x0 k2 + kx0 − x∗ k2 6 6 2 σmin 1 ¯ 1 k3 , + ςl+1 kz − x 6 where the last inequality is due to convexity of f (z). On the other hand, it follows from the way that f (¯ xl+1 ) ≤ ψl+1 (zl+1 ), and thus Theorem 4.4 is proven. ψl+1 (z) is updated that (l+1)(l+2)(l+3) 6 Proof of Theorem 4.5: We note that Lemma B.2 still holds for the approximated Hessian
2
H(xi ) −1/3
constructed in Algorithm 2. Therefore, when T = O(ǫ ), we have that ∇ f (xi ) − H(xi ) ≤ ǫi for all i ≤ T1 + T2 with probability 1 − δ. Then from Theorem 4.4 and by taking z = x∗ , we have that l(l + 1)(l + 2) f (¯ xl ) ≤ 6
l(l + 1)(l + 2) ρ¯ + 2¯ σ1 ∗ 1 f (x∗ ) + kx − x0 k3 + kx∗ − x0 k2 6 6 2 2 1 2κθ (1 + κθ )L ¯ 1 k3 . kx0 − x∗ k2 + ςl kx∗ − x + σmin 6
Rearranging the terms, and combining with Lemmas 4.1, 4.2 and 4.3, the theorem is proven.
38