by a modification of the preconditioned Beale's method with Powell restart, which performs a plane search at each step. Asymptotically the line search and the ...
The modification of Newton’s method for unconstrained optimization by bounding cubic terms A. Griewank1 Abstract The discrepancy between an objective function f and its local quadratic model f (x) + ϕ(s) ≈ f (x + s) at the current iterate x is estimated using an ellipsoidal norm 1 |s| ≡ (sT Qs) 2 . Steps are chosen such that they reduce the model function ϕ(s) + |s|3 /3 significantly. This implies f (x + s) < f (x) unless the matrix Q = QT > 0 is too small, in which case it is updated accordingly. Global convergence results are given for exact and inexact implementations of the proposed approach. The modification can be made invariant with respect to affine transformations and applies only at finitely many iterates if the limiting stationary point is a strongly isolated minimizer.
1
Introduction
In this paper we consider the unconstrained problem Minimize f (x),
f ∈ C 2 (IRn )
given that the gradient g(x) ≡ ∇f (x) and the Hessian H(x) = ∇2 f (x) are explicitly available. Provided the initial point x0 is sufficiently close to a minimizer x∗ ∈ g −1 (0) with nonsingular Hessian H ∗ = H(x∗ ), the classical Newton iteration for the solution of the stationary equation g(x) = 0 is well defined by xj+1 = xj − H(xj )−1 g(xj ) and converges Q-superlinearly to x∗ . Furthermore, it was shown in [1] that sup kxj+1 − x∗ k/kxj − x∗ k1+p < ∞ j≥0
if the Hessian H(x) is H¨older continuous with exponent p > 0 at x∗ . For simplicity we make the stronger assumption that H(x) is globally H¨older continuous so that there are positive constants p and q such that for all x, x + s ∈ IRn kH(x + s) − H(x)k ≤ 1
1 kskp q 1+p/2 . 1 + p/2
Department of Applied Mathematics and Theoretical Physics, University of Cambridge Cambridge, United Kingdom The work was partially supported by a research grant of the Deutsche Forschungsgemeinschaft, Bonn
1
(1.1)
Here as always in this paper k · k denotes the spectral or Euclidean norm of matrices and vectors, respectively. We are mainly interested in the case p = 1, where H is Lipschitz continuous and will therefore refer to quantities of O(kskp+2 ) as cubic terms. Dennis and Mor´e have shown in [2] that an arbitrary sequence {xj+1 = xj + sj }j≥0 ∈ IRn converges Q-superlinearly to x∗ ∈ g −1 (0) with det(H ∗ ) 6= 0 if and only if lim ksj − Hj−1 gj k/ksj k = 0
j→∞
where Hj ≡ H(xj ) and gj = g(xj ). Therefore we may view the Newton correction −Hj−1 gj as the optimal step on the final approach to a minimizer unless third or higher order derivative information is available. At an early iterate xj far away from any particular local minimizer of f the local quadratic model 1 (1.2) ϕj (s) ≡ gjT s + sT Hj s ≈ f (xj + s) − f (xj ) 2 may not reflect the essential features of f (xj + s) sufficiently accurate to make the Newton step well defined and worthwhile with respect to the minimization of f . This becomes immediately obvious if Hj has a nonpositive eigenvalue since then the Newton step is either not defined or leads to a stationary point of ϕj which is not a minimizer. Even if Hj > 0, i.e., the Hessian is positive definite, the full Newton correction sj = −Hj−1 gj may fail to reduce f because the expected gain −ϕj (sj ) is smaller than the discrepancy r(xj , sj ) ≡ |f (xj + sj ) − f (xj ) − ϕj (sj )| ≤
1 (qksj k2 )1+p/2 , 2+p
(1.3)
where the inequality on the RHS follows from (1.1). If the constant q was known beforehand, we could choose corrections sj which are ’safe’ in that they must yield a reduction of f . In practice the size of the cubic term will rarely be known beforehand and the reliability of the quadratic model ϕj (s) may depend not only on the Euclidean norm of s but also its direction s/ksk. Therefore we attempt to bound 1/(1+p/2) %j ≡ %(xj , sj ) ≡ (2 + p)r(xj , sj ) ≤ qksj k2
(1.4)
by a suitable ellipsoidal norm |s|2j ≡ sT Qj s,
(1.5)
where the symmetric matrix Qj > 0 is modified whenever necessary. Then any step that reduces the model function 1 |s|2+p (1.6) ϕ¯j (s) ≡ ϕj (s) + p+2 j will also decrease f itself unless | · |j is too small, in which case Qj must be enlarged by a suitable matrix update. The model function ϕ¯j allows us to make effective use of second derivative information when the quadratic model ϕj is unbounded below or otherwise unreliable. 2
For certain choices of the sj we obtain a modified Newton iteration of the form xj+1 = xj + µj sj ,
(1.7)
(Hj + αj Qj )sj = −gj .
(1.8)
where Here the scalar αj ≥ 0 is chosen such that Hj + αj Qj ≥ 0,
(1.9)
gjT sj < 0 or gjT sj = 0 and sTj Hj sj < 0
(1.10)
which implies unless xj = x∗ satisfies already the second order necessary conditions g(x∗ ) = 0 and H(x∗ ) ≥ 0.
(1.11)
If H ∗ > 0, the full Newton step sj = −Hj−1 gj or a suitable approximation will eventually reduce ϕ¯j so that the bound on the cubic term becomes inactive after finitely many iterations.
1.1
Comparison to other modified Newton methods
As it turns out most modified Newton methods for unconstrained minimization can be written in the form (1.7)-(1.9) with Qj usually diagonal and αj set to zero whenever Hj > 0. If the full step from xj to xj + sj does not achieve a significant reduction of f , we may either perform a line search to find a suitable multiplier µj ∈ (0, 1) or calculate a completely new step after adjusting αj and possibly Qj . The first approach was used by Gill and Murray [3] and the second was recommended by Levenberg [4], Marquardt [5], and Goldfeld, Quandt and Trotter [6]. For fixed Qj the steps sj as a function of αj form a smooth curve in Rn . Since the recalculation of sj for various values of αj may be quite expensive, several authors (e.g. [7] [8] and [9]) have chosen a compromise between the two approaches mentioned above by performing a curviliniear search in a two dimensional subspace of IRn . This plane may for example be spanned by the gradient together with the Newton correction or a direction of negative curvature. By minimizing ϕ¯j within the plane we can calculate a step sj which is likely to reduce f significantly. As we have noted above the actual form of the new approach is identical to that of several other proposals but the author feels the idea of bounding the cubic term yields more natural criteria for the choice of αj and Qj . There are few ad hoc choices and the whole procedure can be made invariant with respect to nonsingular affine transformations on the domain and range of f provided x0 and Q0 are adjusted accordingly. The main difference to the closely related trust region approach [10] is that we attempt to bound the discrepancy r(xj , sj ) relative to the expected gain −ϕj (sj ) rather than in absolute terms. A similar discussion of restricted step methods was given in [11]. The paper is organized as follows. In Section 2 we consider the problem of updating the Qj such that they tend to a limit Q∗ > 0 and all steps are eventually successful. In Section 3 we analyze a model function of the form (1.6) and characterize its stationary values, which include one global and possibly a second local minimum. In Section 4 we 3
discuss suitable choices of the sj and establish, under mild conditions on f , convergence to a point where (1.11) holds from any initial x0 ∈ IRn and Q0 > 0. In Section 5 we consider inexact implementations of the proposed approach and in the final Section 6 we discuss some preliminary numerical results.
2
Updating ellipsoidal norms to bound the cubic term
Ideally we would like to find a matrix Q = QT > 0 such that %(x, s) ≤ |s|2 = sT Qs for all x, x + s ∈ IRn ,
(2.1)
where % is defined by (1.4). The set of matrices Q for which (2.1) holds is obviously unbounded and convex so that it contains unique minimal elements with respect to any particular matrix norm. Whereas it would certainly be desirable to make Q as small as possible, it is in general computationally impossible to determine one of these minimal solutions or simply to establish that (2.1) holds for a given Q. For the purpose of our iterative minimization routine we will use a sequence of estimates {Qj }j≥0 which may be restricted to a certain class E of positive definite matrices, e.g. multiples of a fixed diagonal matrix D > 0. Whenever it is found after a step from xj to xj + sj that (2.1) is violated such that the ratio τj ≡ %j /|sj |2j = %j /sTj Qj sj is greater than a constant τ¯ ≥ 1, then we enlarge Qj to some Qj+1 ∈ E such that Qj+1 ≥ Qj
and |sj |2j+1 = sTj Qj+s sj = %j .
(2.2)
From a theoretical point of view Qj need never be reduced, but such an extremely conservative approach may slow down the convergence considerably. Therefore we want to make Qj somewhat smaller whenever τj < 1/¯ τ . However we have to be cautious because %j may happen to be small or zero even if f is in general highly nonquadratic in the direction sj /ksj k. Therefore we choose an update factor τˆj ∈ [τj , 1/¯ τ] and reduce Qj to some Qj+1 ∈ E such that and |sj |2j+1 = τˆj |sj |2j . (2.3) Combining (2.2) with (2.3) and leaving Qj unchanged if τj ∈ 1/¯ τ , τ¯ we obtain the update condition = τj if τj > τ¯ ≥ 1 = 1 if τj ∈ 1/¯ τ , τ¯ (2.4) |sj |2j+1 /|sj |2j = τˆj ∈ τj , 1/¯ τ otherwise. Qj+1 ≤ Qj
In order to ensure that the Q−1 j can be bounded even if for all j ≥ 0 sj = s0 6= 0 and %j = 0, 4
we have to impose the condition that ∞ Y
min(1, τˆj ) > 0
(2.5)
max{0, 1 − τˆj } < ∞.
(2.6)
j=0
which is equivalent to ∞ X j=0
This condition is necessary and sufficient to ensure the boundedness of the Qj and Q−1 j for suitable update functions U : E × (IRn − {0}) × (0, ∞) → E. For example, if the Qj are restricted to E ≡ {˜ q D}q˜>0
with D = DT > 0,
then (2.4) leads to the update function Qj+1 = U (Qj , sj , τˆj ) = τˆj Qj .
(2.7)
More interesting is the rank one update Qj+1 = U (Qj , sj , τˆj ) ≡ Qj + (ˆ τj − 1)Qj sj sTj Qj /sTj Qj sj ,
(2.8)
which was used by Shor [12] in his ellipsoidal algorithm and has the distinct advantage of being invariant with respect to nonsingular linear transformations. In this case E consists of all Q = QT > 0. It can be easily checked that (2.4) holds if U is given by (2.7) or (2.8). Furthermore we obtain the following convergence result. Lemma 2.1 Let {sj }j≥0 ⊂ IRn −{0} and {%j }j≥0 ⊂ [0, ∞) be infinite sequences such that the ratios {%j /ksj k2 }j≥0 are bounded. Then any sequence {Qj }j≥0 generated from some Q0 ∈ E according to (2.7) or (2.8) with {ˆ τj }j≥0 ∈ (0, ∞) satisfying (2.4) and (2.5) converges to a limit Q∗ = lim Qj > 0. (2.9) j→∞
This implies lim sup τj ≤ τ¯
(2.10)
j→∞
and if furthermore τ¯ > 1, then there is a first index j0 ≥ 0 such that Qj+1 = Qj
for all 5
j ≥ j0 .
(2.11)
Proof: Firstly we consider the simple case when U is given by (2.7) so that we must have for all j≥0 Qj = qj D with qj ∈ (0, ∞). Rewriting (2.7) in terms of the scalar qj we obtain the recurrence qj+1 = τˆj qj ≤ kD−1 k sup %j /ksj k2j < ∞ j≥0
so that by (2.5) for all k ≥ 1 ≥ 0 qk ≥ q l
k−1 Y
min(1, τˆj ) ≥ q0
∞ Y
min(1, τˆj ) > 0.
(2.12)
j=0
j=l
Consequently {qj }j≥0 must have a cluster point q ∗ > 0 and since (2.5) implies lim
l→∞
∞ Y
min(1, τˆj ) = 1,
j=l
it can easily be derived from (2.12) that q ∗ is in fact the limit of {qj }j≥0 . Thus (2.9) holds with Q∗ = q ∗ D. As an immediate consequence of (2.9) we have lim τˆj = 1, which implies j→∞
(2.10) and (2.11) because of (2.4). In the second case when U is given by (2.8) we obtain from the Sherman and Morrison formula the inverse recurrence −1 Q−1 τj − 1) sj sTj /|sj |2j , j+1 = Qj + (1/ˆ
which implies for the traces −1 T r(Q−1 τj − 1)ksj k2 /|sj |2j j+1 ) = T r(Qj ) + (1/ˆ ≤ T r(Q−1 τj − 1 kQ−1 j ) + max 0, 1/ˆ j k ≤ T r(Q−1 τj j ) max 1, 1/ˆ ∞ Q min {1, τˆj }−1 < ∞ ≤ T r(Q−1 0 )
(2.13)
j=0
where we have used (2.5) and the elementary inequality −1 ksj k2 /sTj Qj sj ≤ kQ−1 j k ≤ T r(Qj ). −1 Hence T r(Q−1 j ) and consequently kQj k are bounded so that ∞ P
τj − 1 ksj k2 /|sj |2j max 0, 1/ˆ
j=0
≤ sup(kQ−1 τj ) j k/ˆ j≥1
∞ P
(2.14) max {0, 1 − τˆj } < ∞,
j=0
6
where we have used (2.6) to obtain the last inequality. Now we find by summing (2.13) over j ≥ 0 that k P 1 T r(Q−1 ) + max 0, 1 − ksj k2 /|sj |2j k+1 τˆj j=0 (2.15) ∞ 1 P −1 2 2 ≤ T r(Q0 ) + max 0, τˆj − 1 ksj k /|sj |j < ∞. j=0
Since Qk+1 > 0 for all k ≥ 0, the first sum must therefore be bounded as k tends to infinity. Multiplying each term by %j /ksj k2 , which is by assumption bounded, we find that ∞ X
max {0, τˆj − 1} < ∞,
(2.16)
j=0
where we have used that by (2.4) 1−
1 > 0 =⇒ τˆj = τj . τˆj
Now we can bound the Qk since for all k ≥ 0 det(Qk+1 ) = det(Qk )ˆ τk k Q ≤ det(Q0 ) max {1, τˆj } < ∞, j=0
where the last inequality follows from (2.16). Consequently det(Q−1 j ) is bounded away from zero. Furthermore we have by (2.14) and (2.15) ∞ X j=0
kQ−1 j+1
−
Q−1 j k
≤
∞ X
−1 |T r(Q−1 j+1 − Qj )| < ∞,
j=0
where we have used the fact that each correction Qj+1 − Qj is a symmetric rank one matrix. ∗ −1 Therefore the Q−1 j converge to a nonsingular limit (Q ) , which implies the assertion (2.9). Equations (2.10) and (2.11) follow as above. 2 In order to prove convergence of our modified Newton algorithm we do not actually need the fact that the Qj converge to a nonsingular limit. It would suffice if the Qj were bounded or had a bounded condition number. However, for the rank one update (2.2) neither of these properties can be ensured unless we impose the condition (2.5) or enforce some apriori bound on the condition number of Qj . In any case, it is highly desirable that the Qj do ultimately settle down. Then Qj will eventually be sufficiently close to Q∗ such that all steps of our modified Newton method are successful and by (2.11) no further updates are necessary if we choose τ¯ > 1. This is important because %j /ksj k2 is essentially a second order divided difference and will therefore blow up due to rounding errors when the steps become small. The following analysis applies to any update function U for which Lemma 2.1 can be established; possibly under even stronger assumptions on the update factors τˆj . Now suppose we have some way of computing at any point xj ∈ IRn where the second order necessary condition (1.11) is not satisfied, a step sj which significantly reduces the model function ϕ¯j as defined in (1.6). Then we can apply the following conceptual algorithm A. 7
¯ 1], τ¯ ∈ [1, (δ/δ) ¯ 1/(1+p/2) initial values x0 ∈ IRn , Q0 ∈ 0. Choose constants δ¯ ∈ (0, 1), δ ∈ (δ, E and set j = 0. 1. If gj ≡ g(xj ) = 0 and Hj ≡ H(xj ) ≥ 0, stop . 2. Select a step sj ∈ IRn − {0} such that ¯ ϕj (sj ) ≤ −|sj |2+p /[(2 + p)δ]. j
(2.17)
δj = 1 − (f (xj + sj ) − f (xj ))/ϕj (sj )
(2.18)
%j ≡ |(2 + p)δj ϕj (sj )|1/(1+p/2) .
(2.19)
3. Evaluate and 4. Select τˆj ∈ (0, ∞) such that (2.4) holds and update Qj+1 = U (Qj , sj , τˆj ). 5. Set xj+1
x +s j j = x j
if δj < δ
(2.20)
otherwise
increment j = j + 1 and go to 1. This specification is still very general and allows various strategies for choosing sj . For instance, if δj ≥ δ, we may simply rescale sj by a suitable step multiplier, thus performing a line search instead of recalculating a new step in a different direction. The inequality (2.17) is equivalent to the condition 1 − δ¯ ≤ ϕ¯j (sj )/ϕj (sj ) < 1, which requires that sj satisfies some kind of Goldstein test with respect to ϕ¯j . ¯ 1], which occurs in the acceptance test (2.20), can be chosen equal The constant δ ∈ (δ, to 1. so that xj+1 = xj + sj whenever f (xj + sj ) < f (xj ) However, this weak test may not be advantageous if the switch to a new iterate is computationally costly and therefore only worthwhile if f (xj + sj ) is substantially smaller than f (xj ). In any case the δj will eventually be bounded below δ as shown in the following theorem. Theorem 2.2 Let the Hessian of f : IRn → IR be uniformly H¨older continuous with exponent p > 0 on IRn . Then we find for any infinite sequence {xj+1 = xj + sj }j≥0 generated by the algorithm A described above with {ˆ τj }j≥0 ⊂ (0, ∞) satisfying (2.5) that δ ∗ ≡ lim sup δj < δ.
(2.21)
j→∞
This means that all but finitely many steps are successful, which implies either lim f (xj ) = −∞
j→∞
or lim sj = 0.
j→∞
8
(2.22)
Proof: Firstly we derive from (2.17) that ϕj (sj ) < 0 and 1 ¯ j (sj )|. |sj |2+p ≤ δ|ϕ 2+p j
(2.23)
Similarly it follows from (2.19) that 1 1+p/2 % = |δj ϕj (sj )|. 2+p j
(2.24)
Dividing (2.24) by (2.23) we obtain the crucial inequality 1+p/2 δj /δ¯ ≤ |δj |/δ¯ ≤ τj .
By assumption H(x) is globally H¨older continuous so that we can apply Lemma 2.1 because of (1.4). Hence we derive from (2.10) that lim sup δj ≤ τ¯1+p/2 · δ¯ < δ, j→∞
where the last inequality follows from the condition on τ¯ in step 0 of algorithm A. Now suppose that the decreasing sequence of function values fj ≡ f (xj ) is bounded below. Then it follows from (2.21) and (2.18) that lim |ϕj (sj )| = lim (fj − fj+1 )/(1 − δj ) = 0,
j→∞
j→∞
2
which implies (2.22) by (2.17).
Theorem 2.2 can hardly be called a convergence result. It merely establishes that for large j the model function ϕ¯j (s) provides effectively an upper bound on f (xj + s) − f (xj ). If f does not have the expected degree of H¨older continuity, the ratios %j /ksj k2 and consequently the matrices Qj may grow undbounded. Then the condition (2.17) can force the steps sj to zero even though gj may be bounded away from zero. However, if our assumptions hold, then we can choose steps such that sj → 0 implies that any cluster point of {xj }j≥0 must satisfy the second order necessary conditions (1.11). This will be shown in the following two sections.
3
Stationary Values and Points of the Model Function ψη
In this section we analyse a given model function of the form η 1 |s|2+p ψη (s) ≡ g T s + sT Hs + 2 2+p where
1
|s| ≡ (sT Qs) 2
with 0 < Q = QT ∈ IRn×n 9
(3.1)
(3.2)
and g ∈ IRn ,
H = H T ∈ IRn×n ,
p > 0 < η.
The constant η could be subsumed into Q and we are mainly interested in the case η = 1 where ψη = ϕ¯ as defined in (1.6) without the index j. However, as we will see later, it may be advantageous to choose the step s as a minimizer of ψη for some η ∈ [0, 1) rather than η = 1. Therefore we will allow for nonunitary values of η, which hardly affects the analysis at all. Differentiating (3.1) we obtain the gradient ∇ψη (s) = g + (H + η|s|p Q)s
(3.3)
∇2 ψη (s) = H + η|s|p Q + ηpQssT Q/|s|2−p ,
(3.4)
and the Hessian which is H¨older continuous with exponent p > 0 at the origin and differentiable at all other points in IRn .
3.1
Stationary values as functions of |s∗ |
According to (3.3) any stationary point s∗ of ψη must satisfy the system of equations H(α∗ )s∗ = −g,
α∗ = η|s∗ |p
(3.5)
where H(α) ≡ H + αQ for all α ∈ IR.
(3.6)
Substituting (3.5) into (3.1) we obtain the corresponding stationary value pη 1 |s∗ |2+p . ψη (s∗ ) = g T s∗ − 2 4 + 2p
(3.7)
Obviously this implies that, if the origin is a stationary point, i.e. g = 0, then all other stationary values are less than ψη (0) = 0. As suggested by (3.7) the stationary values ψη (s∗ ) are indeed strictly decreasing in |s∗ |, which is proved in the following theorem. Theorem 3.1 There exists a strictly increasing, convex function hp : [1, ∞) → [0, ∞) such that for any pair of stationary points s∗ , t∗ of ψη with |s∗ | ≤ |t∗ | the corresponding function values satisfy the inequality ψη (t∗ ) ≤ ψη (s∗ ) − η|s∗ |2+p hp (|t∗ |/|s∗ |).
(3.8)
In the general case p = 1 we have simply η ψη (t∗ ) ≤ ψη (s∗ ) − (|t∗ | − |s∗ |)3 . 6 10
(3.9)
Proof: Since s∗ and t∗ are by assumption stationary points we must have by (3.5) H(α∗ )s∗ = −g = H(β ∗ )s∗
(3.10)
where α∗ = η|s∗ |p ≤ β ∗ = η|t∗ |p . Multiplying the equations (3.10) from the left and right by t∗T and s∗T , respectively, we obtain −g T t∗ = t∗T Hs∗ + α∗ t∗T Qs∗ −g T s∗ = s∗T Ht∗ + β ∗ s∗T Qt∗ so that g T (t∗ − s∗ ) = (β ∗ − α∗ )s∗T Qt∗ ≤ (β ∗ − α∗ )|s∗ ||t∗ |,
(3.11)
where we have used the Schwarz inequality to derive the last relation. Substituting (3.11) into (3.7) we obtain pη 1 T ∗ ∗ ∗ ∗ ∗ 2+p ∗ 2+p ψη (t ) − ψη (s ) = g (t − s ) − (|t | − |s | ) , 2 2+p which implies (3.8) with 1 p 2+p p (ξ − 1) − ξ(ξ − 1) . hp (ξ) = 2 2+p
(3.12)
It can easily be checked that the continuous derivatives of hp satisfy h0p (1) = 0 < h00p (ξ) for ξ > 1, which implies the asserted properties of hp . Equation (3.9) follows directly from (3.8) and (3.12). 2 As an immediate consequence of Theorem 3.1 we note that s∗ ∈ IRn is a global minimizer 1 of ψη if and only if it is a stationary point of ψη for which the weighted distance |s∗ | = (α∗ /η) p to the origin is maximal. Two stationary points s∗ , t∗ have the same stationary value if and only if |s∗ | = |t∗ |, which requires in general s∗ = t∗ as we can see from the following analysis.
3.2
Correspondence between the stationary points of ψη and ψˆη : IR → IR.
In order to limit the number of stationary values and to determine when the corresponding stationary points are isolated we use the generalized eigenvalue decomposition Λ = −V T HV,
V T QV = I
where Λ ≡ diag(λ1 , λ2 , . . . , λn ) 11
with λ1 ≤ λ2 , ≤ · · · ≤ λn−m ≤ 0 < λn−m+1 ≤ · · · ≤ λn . The index m = ind(H) ∈ [0, n] is the number of negative eigenvalues of H. Now we can rewrite the stationary condition (3.5) as (Λ − α∗ I)ˆ s = gˆ, α∗ = ηkˆ sk p , where sˆ = V T Qs∗
and gˆ ≡ (γ1 , γ2 , . . . , γn )T = V T g.
Firstly let us assume for simplicity that γi 6= 0 for i ∈ [n − m + 1, n].
(3.13)
Then all stationary points of ψη belong to the vector family s(α) ≡ −H(α)−1 g = V (Λ − αI)−1 gˆ,
(3.14)
where α ∈ [0, ∞) − {λj }nj=n−m+1 . Any such vector s∗ = s(α∗ ) is a stationary point of ϕ¯η if and only if α∗ is a nonnegative root of the function ψˆ0 η (α) ≡ 12 s(α)T Qs(α) − 21 (α/η)2/p n (3.15) P γi2 1 α 2/p = 12 − , 2 (λi −α) 2 η i=1
which is the derivative of n
1X p γi2 ψˆη (α) = − 2 i=1 (λi − α) 4 + 2p
1+2/p α . η
(3.16)
Consequently, s∗ = s(α∗ ) is a stationary point ψη if and only if α∗ is a nonnegative stationary point of ψˆη , in which case we have by comparison of (3.7) and (3.16) ψη (s∗ ) = ψˆη (α∗ ) since by definition of s(α) in (3.14) n X
γi2 /(λi − α) = −g T s(α).
i=1
Substituting (3.5) into (3.4) we find furthermore after some elementary manipulations that det(∇2 ψη (s∗ )) = −det(H(α∗ ))ψˆη00 (α∗ )ηp/|s∗ |2−p , where ψˆη00 (α)
=
n X i=1
γi2 1 − (λi − α)3 pη 12
2/p−1 α . η
(3.17)
(3.18)
If ψˆη00 (α∗ ) = 0 so that α∗ is not a strongly isolated stationary point of ψˆη , the Hessian ∇2 ψη (s∗ ) is singular but s∗ is nevertheless an isolated stationary point of ψη since det(H(α∗ )) 6= 0. As the matrix ∇2 ψη (s∗ ) − H(α∗ ) ≥ 0 is of rank one, it follows from the interlacing eigenvalue theorem that 0 if ψˆ00 (α∗ ) < 0 η ind(∇2 ψη (s∗ )) − ind(H(α∗ )) = −1 if ψˆ00 (α∗ ) ≥ 0.
(3.19)
η
Since α∗ ∈ [λi , λi+1 ) implies ind(H(α∗ )) = ind(α∗ I − Λ) = n − i,
(3.20)
we can therefore completely characterize all stationary points of ψη by analyzing the graph of ψˆη . To illustrate the | − | correspondence between the stationary points of ψη and ψˆη we consider the simple example .1 1 T 1.2 0 1 T ψ1. (s) = s − s (3.21) s + ksk3 .2 2 0 2.0 3 with p = 1 = η, Q = I, λ1 = 1.2, λ2 = 2. and m = 2 . The corresponding function ψˆ1. (α) takes the form 1 .01 .04 1 3 ˆ ψ1. (α) = + − α 2 (1.2 − α) (2 − α) 3 and its derivative is given by 0 ψˆ1. (α)
1 .01 .04 2 = + −α . 2 (1.2 − α)2 (2 − α)2
These two functions are plotted in Figure 1. Since the nondegeneracy assumption (3.13) is satisfied, ψ1. has exactly five stationary points s∗k = s(αk∗ ) = −H(αk∗ )−1 g for k = 1, . . . , 5 0 which correspond to the five roots {α1∗ ≤ α2∗ ≤ · · · ≤ α5∗ } of ψˆ1. Using (3.19) and (3.20) we ∗ ∗ can easily derive that s5 is the global minimizer, s4 is a second local minimizer, s∗3 and s∗2 are two saddle points and s∗1 is the local maximizer of ψ1 . In agreement with Theorem 3.1 the stationary values {ψ1. (s∗k ) = ψˆ1. (αk∗ )}5k=1 are strictly decreasing in k. This example in two variables is of particular interest since we may want to confine the exact analysis of a general model function ψη ∈ C(IRn ) to a suitable two dimensional plane in order to reduce the computational cost. This approach will be discussed in some detail in Section 5. The function ψ1. given in (3.21) has five stationary points because H is negative definite and g
13
Figure 1: is comparatively small. In general ψη can have up to 2m + 1 stationary points as will be shown below. On each of the m bounded intervals (0, λ if i=n−m n−m+1 ) Ji ≡ (λ , λ ) for i = n − m + 1, . . . , n − 2, n − 1 i i+1 the function ψˆη is continuous and by Theorem 3.1 its stationary values are strictly decreasing. Since ψˆη0 (α) is by (3.15) with (3.13) positive for α sufficiently close to either boundary, ψˆη is either strictly increasing on Ji or has a maximizer αi− followed by a minimizer αi+ which may merge to a single stationary point where ψˆη00 is also zero. We exclude this possibility by the nondegeneracy assumption ψˆη0 (α) = 0 ≤ α =⇒ ψˆη00 (α) 6= 0,
(3.22)
which implies by (3.17) also that all stationary points of ψη are isolated. If αi− , αi+ ∈ Ji exist, + + ∗ the corresponding stationary points s− i = s(αi ) and si = s(αi ) have by (3.19) and (3.20) the indices ind(∇2 ψη (s+ )) = n − i and ind(∇2 ψη (s− )) = n − i − 1. (3.23) On the last interval Jn = (max{0, λn }, ∞), ψˆη is strictly concave since ψˆη00 (α) as given in (3.18) is negative for all α ∈ Jn . Since furthermore ψˆη0 (α) is positive for α sufficiently close to max{0, λn } and negative for α sufficiently large, there is a unique stationary point αn∗ ∈ Jn . Because of Theorem 3.1 the corresponding vector s∗n = −H(αn∗ )−1 g 14
is the only global minimizer of ψη withe ∇2 ψη (s∗n ) being positive definite by (3.19) and (3.20). If H is indefinite so that m > 0, λn > 0 and consequently Jn−1 6= ∅, there may be − − 2 a second local minimizer s+ n−1 and a corresponding saddle point sn−1 where ∇ ψη (sn−1 ) has + one negative eigenvalue. Over all ψη can have up to m pairs of stationary points {s− i , si } besides s∗n , possibly including one local maximizer s− 0 if H is negative definite. At the very end of this section we consider an example where ψ1 has one local maximum and n pairs − of stationary points {s+ i−1 , si } with the same index n − i for i = 1, . . . , n. This example demonstrates that our bound on the number of stationary points with a certain index is sharp.
3.3
Analysis of certain degeneracies
+ If the nondegeneracy assumption (3.23) is violated, some of the pairs {s− i , si } may reduce 2 ∗ ∗ to a single stationary point si where ∇ ψη (si ) is singular. This kind of degeneracy cannot affect the global minimizer s∗n and is therefore of little interest. However, if for vn the last column vector of V
γn = g T vn ≈ 0 and λn > 0, then the second local minimizer s+ n−1 is likely to exist with ∗ ψη (s+ n−1 ) ≈ ψη (sn ).
In general this does not imply that the two local minimizers are close but it may be difficult to compute them or to decide numerically which one of them is global. To see this we consider the degenerate case γn−1 6= γn = 0 ≤ λn > λn−1 .
(3.24)
By simply deleting the n-th term in (3.15) and (3.16) we can make ψˆη continuously differentiable on (λn−1 , ∞). If the resulting value n−1
γi2 1 1X ψˆη (λn ) = − 2 2 i=1 (λn − λi ) 2
λn η
2/p
is positive, nothing has changed since ψˆη must have a maximizer αn∗ ∈ Jn = (λn , ∞) and − + may have a pair of stationary values {αn−1 , αn−1 } ⊂ Jn−1 if λn > 0. If, on the other hand, 0 ∗ ˆ ψη (λn ) is negative, then there must be one stationary point αn−1 ∈ Jn−1 and the nonlinear ∗ ∗ system (3.5) has for α = αn ≡ λn the two solutions q γ1 γ2 γn−1 ± 0 sn ≡ V , , ··· , , ± −2ψη (λn ) (3.25) λn − λ1 λn − λ2 λn − λn−1 with + ˆ ψη (s− n ) = ψη (λn ) = ψη (sn ).
15
This last equation follows from (3.7) and the fact that the difference s+ − s− n is a multiple of vn and thus orthogonal to g. If ψˆη0 (λn ) 6= 0, the vectors Qs± are nonorthogonal to the n nullvector vn of H(λn ) ≥ 0 so that by (3.4) 2 + ∇2 ψη (s− n ) > 0 < ∇ ψη (sn ).
Thus we have two strongly isolated global minimizers s− , s+ n which must be the limit points + ∗ of the previously defined local minimizers sn−1 and sn as γn tends to zero. The only real degeneracies occur if either ψˆη0 (λn ) = γn = 0 ≤ λn
(3.26)
or for some k ≥ 1 ψˆη0 (λn ) ≤ 0 = γn ≤ λn = λn−1 = · · · = λn−k . ∗ s+ n−1 , sn
(3.27) s− n−1
In the first case (3.26) the two minimizers and the saddle point reduce to a unique global minimizer with Hessian of rank ≤ n − 1 as γn tends to zero. In the second case (3.27) the global minimizers form an ellipsoidal manifold of dimension k which reduces again to a single point if ψˆη0 (λn ) = 0. The situation is very similar if for any other i ∈ [n − m + 1, n − 1] ψη0 (λi ) ≤ γi = 0 ≤ λi . − In general this means only that the values of ψη at the stationary points s+ i−1 and si with 0 the same index n − i become equal. These two points must merge if ψη (λi ) = 0 and may degenerate into ellipsoidal surfaces if λi is a multiple eigenvalue. Therefore ψη can never have more than 2m + 1 stationary values even if the nondegenerate condition (3.13) is violated. As a simple degenerate example, which also demonstrates that our bound on the number of stationary points is sharp if all eigenvalues of H are simple, we consider the function
1 1 ψ1. (s) = − sT diag(1, 2, . . . , n)s + ksk2+p , 2 2+p
(3.28)
where p > 0, η = 1. , g = 0, Q ≡ I and λi = i for i = 1, 2, . . . , n. The function p ψˆ1. (α) = − α1+2/p 4 + 2p has only one stationary point α0∗ = 0 that corresponds to the only local maximizer s∗0 = 0 of ψ1 (s). With ei ∈ IRn , the i-th Cartesian basis vector, the system (3.5), has for each αi∗ = λi = i ∈ [1, n] the two solutions 1/p s− ei i = −i
with ψ1. (s− i ) = −
1/p and s+ ei i = i
pi1+2/p = ψ1 (s+ i ) 4 + 2p
and + 2 ind(∇2 ψ1 (s− i )) = n − i = ind(∇ ψ1 (si )).
Thus we see that ψ1. as defined in (3.28) has exactly 2n + 1 stationary points including the two local minimizers s± n , which are both global. 16
3.4
The suitability of ψη for unconstrained minimization
At the end of this section we discuss briefly the usefulness of ψη (s) with g = g(x) and H = H(x) as a model function of f (x + s) − f (x) at the current iterate x ∈ IRn . With the exception of highly degenerate cases satisfying (3.27) there is always one global minimizer s∗n of ψη . If H has a sizable negative eigenvalue −λn and the directional derivative γn of f along the corresponding eigenvector vn is rather small, then ψ1 may have a second local − minimizer s+ n−1 and at least one corresponding saddle point sn−1 . This is particularly likely if all other eigenvalues −λi of H are positive or the corresponding gradient components γi are small. Thus we can imagine that x lies near the crest of a ridge or close to a saddle point of f . Then it seems indeed quite likely that f itself has at least two local minimizers in the vicinity of x and it is a matter of chance which one would be located by any particular minimization procedure. Therefore we conclude that ψη can quite appropriately reflect the essential features of the objective function f outside its region of convexity. This is particularly important in the case of Newton’s method, which can be attracted by any stationary point so that the combination g ≈ 0 and H 6> 0 may quite possibly arise. Mathematically the function ψη is considerably more involved than the quadratic model ϕ, but still much less complicated than a fully fledged cubic model f (x) + ϕ(s) + T s3 ≈ f (x + s),
(3.29)
where T is a symmetric third order tensor. Such a model is never bounded below and may have between 0 and 2n stationary points compared to a maximum of 2n+1 for ψη . Therefore the cubic expansion (3.29) is unlikely to be useful far away from any particular solution but it can be used to accelerate local convergence especially if the quadratic model is unsatisfactory due to singularity or near singularity of the Hessian. In contrast, the model function ψη does not yield higher order corrections for the final approach to a solution but provides merely a measure of conservativeness by balancing the predicted gain ϕ(s) against an estimate of the possible disruption due to the cubic term. Therefore all calculations related to ψη do not have to be performed very accurately and Q can be restricted to a small class of positive matrices in order to save storage and computing expense per step. Nevertheless, the computational effort and possibly also the storage requirement is greater than for some other modifications of Newton’s method.
4
Step Selection and Convergence Analysis
Since the step to the global minimizer of ψ1. (s) is in general too conservative, we consider the solutions of the nonlinear system H(α)s∗ = −g,
α = η|s∗ |p
(4.1)
as functions of the multiplier η ∈ (0, ∞). It will be shown that for η within a certain interval the steps s∗ satisfy the descent condition (2.17) and lead to global convergence with Q-order 1 + p under mild assumptions on f . The solutions s∗ of (4.1) may be viewed either as unconstrained stationary points of ψη as given in (3.1) or as stationary points of ψ1. = ϕ¯ subject to the constraint |s| = |s∗ | = constant. 17
We are mainly interested in the global minimizers of the functions ψη but will later on also allow for the possibility that s∗ is another stationary point, preferably the second local minimizer of ψη . For each ˆ n ≡ max{0, λn } α>λ (4.2) the system (4.1) has the unique solution s∗ (α) = −H(α)−1 g
(4.3)
η = α/|s∗ (α)|p .
(4.4)
with Unless g = 0 we have because of H(α) > 0 d 1 ∗ |s (α)|2 = −s∗ (α)T QH(α)−1 Qs∗ (α) < 0 dα 2
(4.5)
ˆ n with so that |s∗ (α)| is a strictly decreasing function of α > λ lim |s∗ (α)| = 0.
α→∞
Now it follows from (4.4) that η is a strictly increasing differentiable function of α with lim η = ∞.
α→∞
Consequently there is a limit ηˆ ≡ lim ≥ 0 ˆn α→λ
(4.6)
which is by (4.2) and (4.4) zero if and only if λn ≤ 0 or
lim |s∗ (α)| = ∞.
α→λn
(4.7)
Either of these conditions will usually be satisfied, in which case it follows from the implicit function theorem applied to the equation (4.4) or equivalently ψˆη0 (α) = 0 that ˆ n , ∞) (α) ≡ (α)η ∈ (λ is a strictly increasing differentiable function of η ∈ (0, ∞). Then the global minimizers s∗η ≡ s∗ (αη ) of the functions ψη form a smooth curve with |s∗η | strictly decreasing in η > 0. For η = 1. we have the global minimizer s∗1. of ψ1. and as η tends to infinity s∗η reduces to the origin. If H ≥ 0 and the linear system Hs = −g (4.8) is consistent, then the vectors {s∗η }η>0 are bounded and tend to the limit s∗0. ≡ lim s∗η = lim H(α)−1 g. η→0
α→0
18
It is well known that this vector is the minimal solution of (4.8) with respect to the ellipsoidal norm |.|. If H > 0, then ψη is strictly convex even for η = ηˆ = 0 and the limiting step s∗0 is simply the full Newton correction −H −1 g. This must always be the case when the current iterate x is sufficiently close to a strongly isolated minimizer x∗ of f . Then we may select a step s = s∗η between the conservative global minimizer s∗1. of ψ1. and the more daring, longer Newton step s∗0. , which is eventually also safe as we will see later. ˆ n = λn ≥ 0, Further away from x∗ the Hessian H may be indefinite or singular so that λ in which case the linear system H(λn )s = −g (4.9) is inconsistent unless for all i ∈ [1, n] λn = λi =⇒ γi = 0.
(4.10)
Except in this degenerate case equation (4.7) is always true and if λn ≥ 0 we must have lim |s∗η | = lim |H(α)−1 g| = ∞.
η→ˆ η =0
α→λn
If (4.10) holds with λn−1 < λn ≥ 0, then (4.9) has a solution s∗ηˆ with minimal |s∗ | where ηˆ = lim η = λn /|s∗ηˆ|p α→λn
equals zero if and only if λn = 0. Then any function ψη with η ∈ (0, ηˆ) has the two global ± − + minimizers s± η ) and {sη }η∈(0,ˆ η) η ≡ sn as given in (3.25). The two smooth curves {sη }η∈(0,ˆ emanate from the point s∗ηˆ, which is the unique global minimizer of ψηˆ. Provided λn > 0, we obtain from (4.4) with α = λn that for η ∈ (0, ηˆ) the norms + 1/p |s− η | = |sη | = (λn /η)
are strictly decreasing in η with lim |s+ η | = ∞.
η→0
Thus we can theoretically distinguish three different possibilities. If the quadratic model ψ0. = ϕ is bounded below, the vectors {s∗η }η≥0 form a smooth bounded curve between the origin s∗∞ = 0 and the Newton correction s∗0. . If ψ0. is unbounded below, the functions {ψη }η≥ˆη have unique global minimizers {s∗η }η≥ˆη with |s∗η | tending to infinity as η reduces to zero unless ηˆ > 0. This third possibility is rather exceptional since it occurs only if (4.10) holds with λn > 0. Then the original curve {s∗η }η>ˆη bifurcates at the point s∗ηˆ into two + branches {s− η ) and {sη }η∈(0,ˆ η ) , which may degenerate into ellipsoidal surfaces if γn is η }η∈(0,ˆ a multiple eigenvalue. In all three cases |s∗η | or |s∓ η | grows monotonically as the weighting 2+p factor η on the cubic term |s| is reduced. Unless a successful Cholesky decomposition of the Hessian H establishes its positive definiteness, we have to compute at least the eigenvalue λn and the associated eigenspace in order to determine with any accuracy which one of the three cases described above applies for a given model function ψ1. If it turns out that λn ≈ 0 and γn ≈ 0, then it will still be difficult to decide whether ηˆ as defined in (4.6) is zero or not. Furthermore the two local 19
minima of some ψη may be so close in value that it would be meaningless to declare one of them as global in the presence of rounding errors. Therefore we assume from now on only that the vector s∗η solves (4.1) for a suitable η ≥ 0 such that g T s∗η ≤ 0 and ∇2 ψη (s∗η ) ≥ 0, which must be satisfied if s∗ is a global minimizer of ψη . Then we derive from (4.1) for the reduction of the quadratic model ϕ(s) = ψ0. (s) ϕ(s∗η ) = =
∗ g T s∗η + 21 s∗T η Hsη 1 T ∗ g sη 2
− 21 η|s∗η |2+p
∗ ∗ 2+p = − 12 s∗T . η Hsη − η|sη |
Therefore the descent condition ¯ ϕ(s∗η ) ≤ −|s∗η |2+p /[(2 + p)δ]
(4.11)
¯ −1 η ≥ η¯ ≡ [(1 + p/2)δ]
(4.12)
is satisfied if either or
1 ∗ ¯. (4.13) s∗T η Hsη ≥ 0 and η ≥ η 2 Obviously neither (4.12) nor (4.13) is a necessary condition for (4.11) and if x is sufficiently close to a strongly isolated minimizer x∗ of f , the full Newton correction s∗0. must certainly ∗ satisfy (4.11) since the cubic term |s∗0. |2+p will be negligible compared to ϕ(s∗0 ) = − 21 s∗T 0 Hs0 . ∗ ∗ At some distance from x the Hessian H may still be positive definite but s0 can be so large that the condition (4.11) is violated. Then we are guaranteed to achieve a significant reduction of ψ1. by choosing a step s∗η with η ≥ 21 η¯. Even further away from x∗ the Hessian is likely to be indefinite and we may have to choose η ≥ η¯ in order to ensure (4.11). Thus we can apply the algorithm A described in Section 2 such that at each step (Hj + ηj |sj |p Qj )sj = −gj , ηj ∈ [0, η¯]
(4.14)
≥ 0. Hj + ηj |sj |p Qj + pηj Qj sj sTj Qj /|sj |2−p j
(4.15)
and Under the assumptions of Theorem 2 these conditions on sj lead to the following convergence result. Theorem 4.1 Let {xj+1 = xj + sj }j≥0 ⊂ IRn ba an infinite sequence generated by algorithm A with {ˆ τj }j≥0 and {sj }j≥0 satisfying (2.5), (4.14) and (4.15). If the Hessians {Hj }j≥0 are bounded and f ∗ ≡ lim f (xj ) > −∞, (4.16) j→∞
then 20
(i) lim sj = 0 = lim gj
(4.17)
lim inf v T Hj v ≥ 0 for all v ∈ IRn
(4.18)
sup kHj sj + gj k/ksj k1+p < ∞.
(4.19)
j→∞
j→∞
j→∞
j≥0
(ii) The cluster points {xj }j≥0 form a closed connected subset C ⊆ C¯ ≡ {x∗ ∈ g −1 (0) ∩ f −1 (f ∗ )|H(x∗ ) ≥ 0}.
(4.20)
(iii) If furthermore H(x∗ ) > 0 for some x∗ ∈ C, then it follows from (4.19) that sup kxj+1 − x∗ k/kxj − x∗ k1+p < ∞,
(4.21)
j≥0
which means that the xj converge to the strongly isolated minimizer {x∗ } = C with a Q-order not less than 1 + p. Proof (i) By Lemma 2.1 and Theorem 2.2 the matrices Qj as well as Q−1 are bounded and j the steps sj tend to zero since the fj are bounded below by assumption (4.16). Therefore all three assertions in (i) are immediate consequences of (4.14), (4.15) and the assumed boundedness of the Hj . (ii) It is well known (e.g. NR.14.1-2 in [1] ) that because of (4.17) the set C must be closed and connected. Because of (4.16), (4.17) and (4.18) any x∗ ∈ C must obviously satisfy f (x∗ ) = f ∗ , g(x∗ ) = 0 and H(x∗ ) ≥ 0. (iii) Since H(x∗ ) is nonsingular, x∗ must be isolated in the set C, which cannot contain any other point because it is by (ii) connected. For sufficiently large j we must have Hj > 0 so that as a consequence of (4.5), |sj |j ≤ |Hj−1 gj |j = 0(kgj k), where we have used again the boundedness of Qj and Q−1 j . Then we have by (4.14) and Taylor’s theorem gj+1 = gj + Hj sj + 0(ksj k1+p ) = 0(kgj k1+p ), which implies that the {gj }j≥0 converge to zero with a Q-order not less than 1 + p. As H(x∗ ) is nonsingular, this is equivalent to assertion (4.21), which completes the proof. 2 According to Theorem 4.1 the modified Newton method defined by algorithm A and the step conditions (4.14), (4.15) has very satisfactory global and local convergence properties. If any one of the level sets Lj ≡ {x ∈ IRn |f (x) ≤ fj } is bounded and the set C¯ defined in (4.20) is discrete, then the iterates {xj }j≥0 converge to a limit point x∗ ∈ C¯ which must be a minimizer of f if det(H(x∗ )) 6= 0. If H(x∗ ) is singular, the stationary point x∗ need not be a minimizer. 21
For example, if f ∈ C(IRn ) is given by f (x) = x3 ,
(4.22)
then our method like many others would converge from any x0 > 0 linearly to the saddle point x∗ = 0. This is not surprising since x∗ = 0 is, after all, the global minimizer of the twice Lipschitz continuously differentiable function |f (x)|, which has for x ≥ 0 the same values as f (x). The example (4.22) can also be used to show that in contrast to the nonsingular case we may never be able to choose ηj = 0. Then all steps are shorter than the full Newton correction 1 −f 0 (xj )/f 00 (xj ) = − xj 2 and we obtain linear convergence with a limiting ratio between 12 and 1. The situation is even worse if H(x) is singular and differentiable at a minimizer x∗ of f . Whereas it was shown in [13] under the assumption rank(H(x∗ )) = n − 1 that the unmodified Newton iteration is likely to converge with a limiting ratio between 2/3 and 4/5, it can be easily seen that the proposed modification will lead to sublinear convergence. Thus we conclude that our convergence result is of a purely theoretical nature if the Hessian is singular at the limiting point x∗ . If H(x∗ ) is nonsingular, it follows from the boundedness of the Qj that we may choose ηj = 0 so that sj = −Hj−1 gj for all sufficiently large j. Even if we fail to do that, the convergence order is not worse than that of the unmodified Newton iteration unless H is in fact H¨older continuous with an exponent p˜ > p at x∗ . In the latter case Newton’s method has a Q-order not less than 1 + p˜ at x∗ whereas it can be easily derived from (4.14) that lim sup ηj > 0 j→∞
implies lim sup kxj+1 − x∗ k/kxj − x∗ k1+p > 0, j→∞
which means together with (4.21) that the Q-order of the modified method is exactly equal to 1 + p. Therefore one should choose ηj = 0 whenever compatible with (4.11), which is in any case desirable because it minimizes the computational effort per step. The numerical problem of computing a minimizer s∗η of ψη for a suitable η ≥ 0 is very similar to that of solving the trust region system H(α)s∗η = −g, |s∗ | = η −p , α ≥ max{0, λn }, which has been considered by Hebden [14], Gay [15] and Mor´e and Sorensen [16]. They developed a scheme which requires on average no more than two Cholesky factorizations to compute a solution s∗ with η within an acceptable tolerance of the desired value. Their method can be easily adopted to our approach and numerical results will be reported in a forthcoming paper. 22
5
Inexact Implementation and Some Numerical Results
Especially as long as the full Newton step does not satisfy the condition (4.11) we may not want to expand the computational effort required for solving the system (4.1) accurately. In large scale applications it may even be impossible or uneconomical to form a single Cholesky decomposition of H(α). Therefore we adopt the philosophy of Dembo, Eisenstadt and Steihaug [17] by allowing for an inexact solution of (4.1) by a suitable iterative scheme. One such possibility is to perform an unconstrained minimization of some ψη with η ∈ [0, η¯] until ψ1. is significantly reduced or the gradient ∇ψη is sufficiently small. This approach seems particularly attractive because given an iterate s and a search direction t ∈ IRn we can easily compute the global minimum of ψη on the straight line {s + µt|µ ∈ IRn } or on the two dimensional subspace {sξ + tµ |ξ, µ ∈ IRn }. We will refer to these two restricted minimization calculations as a line search and a plane search, respectively. Both are based on the matrix vector products Hs, Ht, Qs and Qt but require otherwise a negligible computational effort due to the special algebraic structure of ψη . For the calculations reported in Section 6 the following minor iteration was used to compute a step s for the major iteration defined by algorithm A. Since it is hoped that the full Newton step is eventually well defined and safe, we start by minimizing the quadratic function ϕ = ψ0. with the standard conjugate gradient method preconditioned by Q such that successive gradients are orthogonal with respect to Q−1 . T. Steihaug has shown in [18] that the norm |s| of the resulting iterates s is strictly increasing so that we can expect that the method achieves a significant reduction of ϕ before the cubic term |s|2+p can build up to comparable size. If we reach an iterate s with ¯ |s|2+p > −ϕ(s)[(2 + p)δ] or encounter a search direction t ∈ IRn with tT Ht ≤ 0,
(5.1)
then we reset η to 1/2¯ η or η¯ as defined in (4.12) respectively. Subsequently ψη is minimized by a modification of the preconditioned Beale’s method with Powell restart, which performs a plane search at each step. Asymptotically the line search and the plane search lead to the same point, but we hope that the latter is more effective in the early stages of the minor iteration. If we encounter a direction of nonpositive curvature t satisfying (5.1) while still η = 12 η¯, then we reset η = η¯ and restart the minimization with the initial direction t. ∗ In the limit the procedure sketched above converges to a stationary point s of ψη with 1 η ∈ 0, 2 η¯, η¯ . Moreover, due to the use of plane searches at each iteration we must have for all iterates s ψη (s) = min{ψη (µs)}. (5.2) µ∈IR
Since
η 1 ψη (µs) = µg T s + µ2 sT Hs + |µ|2+p |s|2+p , 2 p+2 the identity (5.2) implies g T s + sT Hs = −η|s|2+p 23
(5.3)
and furthermore gT s ≤ 0 since otherwise ψη (−s) < ψη (s) As in Section 4 it can be easily seen that the condition ¯ ϕ(s) < −|s|2+p /[(2 + p)δ] must hold if
1 η ≥ η¯ or η ≥ η¯ and sT Hs ≥ 0. 2 An elementary calculation shows that equation (5.3) implies for the residual ∇ψη (s) kQ−1/2 (g + Hs + η|s|p Qs)k2 = (g + Hs)T (Q−1 − ssT /|s|2 )(g + Hs),
(5.4)
which equals zero if and only if s = s∗ is a stationary point of ψη . Thus we can apply algorithm A described in Section 2 such that at each step for some constant ε¯ < 1 ηj = −(gjT sj + sTj Hj sj )/|s|2+p ∈ [0, η¯] (5.5) and T 2 T −1 ε2j ≡ (gj + Hj sj )T (Q−1 ¯2 . j − sj sj /|sj |j )(gj + Hj sj )/gj Qj gj ≤ ε
(5.6)
Furthermore we may hope that, if the step sj has been obtained by minimizing some function ψη,j , it is sufficiently close to a minimizer of ψη,j such that ∇ψη,j (sj ) ≥ 0 and consequently by (3.4) minn v T Hj v/kvk2 ≥ −¯ η (1 + p)kQj k |sj |pj . (5.7) v∈IR
Under the assumptions of Theorem 2.2 these conditions on the steps sj lead to the following convergence result. Theorem 5.1 Let {xj+1 = xj + sj }j≥0 ⊂ Rn be an infinite sequence generated by algorithm A with {ˆ τj }j≥0 and {sj }j≥0 satisfying (2.5), (5.5) and (5.6). If the Hessians {Hj }j≥0 are bounded and f ∗ ≡ lim f (xj ) > −∞, j→∞
then (i) lim sj = 0 = lim gj
j→∞
j→∞
and the cluster points of {xj }j≥0 form a closed connected subset C˜ ⊆ g −1 (0) ∩ f −1 (f ∗ ) which contains at least one point x∗ with H(x∗ ) ≥ 0 if (5.7) holds for infinitely many j ≥ 0. 24
˜ then the xj converge to the strongly (ii) If furthermore det(H(x∗ )) 6= 0 for some x∗ ∈ C, ∗ isolated stationary point {x } = C˜ such that with Q∗ = (Q∗ )T > 0 as defined in Lemma 2.1 T gj+1 (Q∗ )−1 gj+1 (5.8) lim T ∗ −1 2 = 1 j→∞ g (Q ) gj ε j J provided lim kgj kp /εj = 0.
j→∞
(5.9)
Proof: (i) By Lemma 2.1 and Theorem 2.2 the Qj and sj tend to Q∗ and 0, respectively, so that by (5.4) and (5.6) 0 ≥ lim sup[(gj + Hj sj )T Q−1 ¯2 gjT Q−1 j (gj + Hj sj ) − ε j gj ] j→∞
≥ lim sup(1 − ε¯2 )gjT Q−1 j gj j→∞
=
lim (1 − ε¯2 )gjT (Q∗ )−1 gj = 0,
j→∞
which proves gj → 0. Together with sj → 0 this ensures again the closedness and connectedness of C˜ ⊆ g −1 (0) ∩ f −1 (f ∗ ), which must contain at least one cluster point x∗ of any subsequence of iterates at which (5.7) is satisfied. ˜ (ii) Since by assumption det(H(x∗ )) 6= 0, the stationary point x∗ must be isolated in C, which can therefore not contain any other point due to its connectedness. Since lim Hj + ηj |sj |pj Qj = H(x∗ )
j→∞
is nonsingular, there is a constant c such that k(Hj + ηj |sj |pj Qj )−1 k ≤ c for j sufficiently large, which implies by (5.4) and (5.6) ksj k = k(Hj + ηj |sj |pj Qj )−1 (gj + Hj sj + ηj |sj |pj Qj sj − gj )k −1/2
≤ c(¯ εkQj
gj k + kgj k) ≤ c˜kgj k,
where c˜ is a suitable positive constant. Now it follows again from (1.1) and Taylor’s theorem that gj+1 = gj + Hj sj + 0(ksj k1+p ) = gj + Hj sj + ηj |sj |pj Qj sj + 0(kgj k1+p ), which implies by (5.4) and (5.6) that −1/2
kQj
−1/2
gj+1 k = εj kQj
25
gj k + 0(kgj k1+p ).
−1/2
Dividing by εj kQj
gj k we obtain with (5.9) −1/2
lim kQj
j→∞
−1/2
gj+1 k/(kQj
gj kεj ) = 1,
which implies (5.8) since Qj → Q∗ by Lemma 2.1.
2
Equation (5.8) shows that the rate at which the sequence gj converges to zero can be exactly controlled by the choice of the forcing sequence {εj }j≥0 ⊆ [0, ε¯]. In particular we have Q-superlinear convergence of gj → 0 and consequently also of xj → x∗ if εj → 0. If H(x∗ ) > 0 and the quadratic model ϕj were minimized by conjugate gradients, then εj → 0 would almost certainly require that the system Hj s = −gj is eventually solved exactly at each step so that the method reduces to the exact Newton iteration. In practice one may prefer to settle for a satisfactory linear convergence by choosing a nonzero limit ε∗ ≡ lim εj ∈ (0, ε¯]. j→∞
Then it follows from (5.8) that both sequences gj and xj converge with the same linear R-factor lim kgj k1/j = ε∗ = lim kxj − x∗ k1/j . j→∞
j→∞
It should be noted that in contrast to the original result in [17] the rate of convergence results does not depend on the assumption H ∗ > 0 but merely requires det(H ∗ ) 6= 0. Without a complete triangular or eigenvalue decomposition of the Hessian we can never exclude the possibility of convergence to a saddle point, in which case (5.8) ensures at least that no excessive amount of computing time is wasted. To force convergence as such the condition (5.6) is not necessary at all. It can be easily seen that if at the end of each minor iteration a plane search with respect to some function {ψη }η∈[0,¯η] is performed in the span of −g and the suggested step s, then the resulting iterates converge to a stationary point provided they are bounded. However, the rate of convergence could be as bad as that of steepest descent, which seems certainly unacceptable when the Hessian is explicitly available, as we have assumed throughout the paper.
6
Computational Experience
A subroutine implementing our method as described in Section 4 has been used extensively in a program for minimizing energy potentials of molecular lattices. In this program the number of independent variables is restricted to seven at a time so that the system (4.1) can be solved exactly on the basis of the eigenvalue decomposition of Q−1/2 HQ−1/2 , which requires a negligible computational effort compared to the evaluation of the potential function and its derivatives. The routine was found to be reliable and efficient but its performance is somewhat sensitive to the choice of the initial Q0 . However, the rules for selecting the update factor τˆj and the weights ηj used in this routine were different from those given in Section 2 and 4, respectively. Numerical results were reported in the computational study [19], which places particular emphasis on the suitable parametrization of rigid body orientations. If this is done in terms 26
of the classical Euler angles, the condition number of the Hessian H ∗ at the minimizer depends on the definition of the neutral molecular orientation or, equivalently, the choice of the laboratory frame. It was found that if the Qj were restricted to multiples of the identity, the performance of the modified Newton routine would be strongly dependend on the conditioning of the Hessian in the neighbourhood of the minimizer. If, on the other hand, the Qj were updated by the rank one formula (2.8), then the performance would deteriorate only slightly up to near singularity of H ∗ . When the laboratory frame is chosen optimally, the condition number of H ∗ is less than ten and the performance of the modified Newton routine is virtually the same whether the Qj are updated according to (2.7) or (2.8). In large scale applications one cannot store or manipulate a full matrix Q = QT > 0 or its Cholesky factor. This would not even be desirable because at least n rank one updates are necessary to enlarge Q in all directions to a certain size if it was chosen far too small initially. Throughout the remainder of this section Q will be restricted to a multiple of the identity. Firstly we consider briefly the problem of minimizing iteratively the model function ψη as given in (3.1) with Q = I, p = 1 = η, n = 50, g = (1, 1, . . . , 1)T and for k ∈ {−50, −25, 0, 25, 50} H = U (diag(1, 2, . . . , 50) + (k − 25)I)U T , where the orthogonal matrix U T = U −1 was generated at random. The following table gives the number of iterations/evaluations required to reduce k∇ψη k below .1 for Powell’s Harwell routine VA14AD and a special purpose conjugate gradient routine PCGCMF which performs plane searches at each step. Table 1 -50
-25
0
25
50
PCGCMF
36/37
21/22
29/30
14/15
3/4
VA14AD
45/95
20/41
31/60
13/24
3/6
k=
A closer inspection of the output reveals that PCGCMF reduces ψη much faster during the first few iterations, in particular in the cases k ∈ {−50, −25, 0}, where H has a negative eigenvalue. Each iteration of PCGCMF and each function + gradient evaluation for VA14AD involves essentially the multiplication of H and Q by a single vector. Since VA14AD cannot exploit the special algebraic structure of ψη it uses on average one additional evaluation per step, but the number of iterations is practically the same for both routines except in the case of k = −50 where H is negative definite. Even though we require only that the norm k∇ψη k must be reduced by a facor of about 1/71 compared to its value kgk at the initial point s = 0 the approximate minimization of ψη takes some 30 iterations when H has a negative eigenvalue. In these three cases k ∈ {−50, −25, 0} the norm of ∇ψη is at some intermediate points much larger than at the origin so that its reduction below kgk/71 is not a trivial requirement. In fact the minimal value of ψη was computed up to at least six significant digits in these three cases. This need not be a disadvantage since we may expect that if H is indefinite, then a thorough 27
minimization of ψη increases the probability of reaching the neighbourhood of one of its local minima rather than a saddle point. In practice we hope that H ≥ 0 at most points of the major iteration, in which case the η |s|2+p hardly affects the performance of conjugate gradient routines adding of the term 2+p compared to the quadratic case η = 0. The following numerical results were obtained exclusively on the generalized Rosenbrock function 50 X 2 f (ξ1 , ξ2 , . . . , ξ50 ) ≡ )2 + (1 − ξi )2 ], [100(ξi − ξi−1 i=2
which allows us to study the effects of various parameter settings on the performance of the inexact modified Newton method described in Section 5. A more comprehensive performance evaluation of the proposed method and a detailed discussion of its implementation will be the subject of a forthcoming report. Gill and Murray reported in [20] that their modified Newton method [3] required between 62/201 and 88/392 iterations/evaluations depending on a certain line search parameter. Thus their method uses on average more than three evaluations per step because it tries the full Newton correction −H −1 g whenever H > 0 and the next iterate is required to satisfy a certain line search condition. It is interesting to note that a more accurate line search did not only increase the total number of evaluations but also the number of iterations and thus matrix factorizations. The same effect occurred on the test function Chebyquad so that it seems doubtful whether line searches are very useful if the Hessian is explicitly available. With the parameter setting described below the new method took ?? / ?? iterations/evaluations to minimize f from the starting point x0 =
1 (1, 2, . . . , 50)T 51
used by Gill and Murray. In order to ensure that in agreement with (2.6) bj ≡
∞ X
max{0, 1 − τˆi } ≤ bj−1 ≤ b0
i=j
the bounds bj and the update factors τˆj were chosen such that bj+1 = bj
and τˆj ≥ 1
if τj ≥ 1/ˆ τj and otherwise τˆj ≡ max 12 (1 + τj ), 1/(1 + bj ) bj+1 = bj − (1 − τˆj ) which implies that in any case bj+1 ≥ b2j /(1 + bj ) ≥ 0. 28
,
(6.1)
¯ δ and τ¯ occurring in the description of algorithm The initial bound b0 , the parameters p, δ, A and the bound ε¯ on the ratios εj as defined in (5.6) had, unless otherwise specified, the following values b0 = 10. , p = 1. , δ¯ = .99, δ = 1. , τ¯ = 1.005, ε¯ = .1 .
(6.2)
All minimization calculations were started from the initial point x0 = (1, −1, 1, −1, . . . , 1, −1)T and stopped when f ≥ 0 = f (1, 1, . . . , 1) had been reduced below 10−5 . Due to the fast local convergence even of inexact Newton methods the form of the stopping criterion hardly affects the overall iteration and evaluation count. With the parameter setting (6.2) and Q0 = I the inexact modified Newton method described above took 82 steps of which only one was rejected because it failed to reduce f at all. All but 12 of the steps were full inexact Newton corrections, which suggests that on this particular test function the method succeeds in modifying the Newton step at the right iterates without being too conservative in general. The conjugate gradient routine PCGCMF was only required to reduce k∇ψη k by a factor of ε¯????j , for which it performed on average less than 11 plane searches. Overall the minimizations of f involved 828 matrix vector products in H and Q, respectively. Since Q was restricted to a multiple of the identity this is essentially equivalent to the computation of 17 exact Newton steps by conjugate gradients. This result is not very sensitive to the choice of ε¯ as we can see from the following table. Table 2 .5
.3
.1
.01
.001
Iterations
110
87
81
77
77
Evaluations
119
89
83
78
78
Newton steps
36
50
70
69
70
Mat.-Vect.Pr.
984
817
828
1006
1214
ε¯ =
The optimal choice of ε¯ depends obviously on the relative cost of evaluating f and its derivatives compared to multiplying H by a given vector. ¯ δ and τ¯ provided The numbers in Table 2 are almost unchanged by minor variations of δ, 1/(1+p/2) ¯ the condition τ¯ < (δ/δ) is satisfied. Unfortunately the performance of the algorithm is more sensitive to the choice of the essentially arbitrary bound b0 as can be seen in the following table. 29
Table 3
.0
5.
10.
20.
Iterations
101
86
81
81
Evaluations
102
88
83
83
Mat.-Vect.Pr.
1108
911
828
828
Final kQk
38.2
33.4
4.31
3.69
b0 =
The last two columns are almost identical because for b0 ≥ 10. the updating of the Qj is over the 81 iterations, which lead to the minimizer, essentially unaffected by the condition τˆj ≥ 1/(1 + bj ) implicit in (6.1). If b0 = 0, the matrices Qj are never reduced so that Qj ∈ Qj+1 for all j ≥ 0, which results in only 4 and 2 full inexact Newton steps being taken at the very beginning and end of the iteration, respectively. This must naturally slow down the convergence but the increase in the number of evaluations is only about 25 % compared to the fairly optimal choice b0 = 10. . The increase in the number of matrix vector products is about 30 % , which means that PCGCMF completes the approximate minimization of the model functions ψη in roughly the same number of plane searches whether ψη is quadratic (i. e. η = 0) or not. Finally we consider the effect of a wrong assumption regarding the H¨older continuity of f . Since the generalized Rosenbrock function has a nontrivial third derivative tensor at its minimizer, the correct choice is clearly p = 1. Contrary to what one might expect the methods work surprisingly well if p is set to other values. Table 4 p=
.5
1.
2.
3.
Iterations
84
81
83
81
Evaluations
86
83
85
85
Newton steps
60
70
70
61
Mat.-Vect.Pr.
876
828
826
845
In all four cases the final steps were full inexact Newton steps even though Q grows at the end rapidly if p = 2 or p = 3. This observation may not apply to other test functions since it is one peculiarity of calculations on the Rosenbrock function that all but the first and last few steps are of similar size. Therefore our method can choose for any p a cubic term |s|p+2 which provides over that particular distance a reasonable estimate of the discrepancy between f and its local quadratic model. As we have mentioned above these results were not presented to establish the competitiveness of the proposed method but to illustrate its essential characteristics.
7
Summary and Conclusion
The modification of Newton’s method for minimization developed in this paper is based on the view that the failure of the local quadratic model ϕ to provide successful steps at 30
early iterates is due to the disruptive effect of cubic terms. Under the assumption that the Hessian is uniformly H¨older continuous it was shown in Section 2 that the cubic terms can be bounded by ellipsoidal norms such that all step are eventually successful in reducing the objective function f . In Section 3 we examined the model function ψη (s) = ϕ(s) +
η |s|2+p 2+p
and found that it may have up to 2n + 1 stationary values including always one global minimum and possibly a second local minimum. In Section 4 it was shown that by choosing steps s which minimize ψη for a suitable value of η ∈ [0, η¯] one obtains a modified Newton method with very satisfactory global and local convergence results. Since the computation of a minimizer of ψη can be quite expensive, we considered in Section 5 inexact implementations of our method. Finally some encouraging numerical experiments were discussed in Section 5. The idea of bounding the cubic term seems rather natural and leads to strong theoretical results including the possibility of invariance with respect to nonsingular affine transformations. It is hoped that the relation (2.5) of the update factors τˆj ≤ can be implicitly ensured by a condition of the individual τˆj rather than the imposition of an a priori bound on their product. Despite the encouraging numerical results it remains to be seen for which class of problems the added conceptual and computational complexity is justified by a significant gain in efficiency. The approach can be extended to quasi-Newton methods where it seems particularly attractive if the approximating Hessian is not guaranteed to be positive definite.
References [1] J.M. Ortega and W.C. Rheinboldt (1970), Iterative Solution of Nonlinear Equations in Several Variables, Academic Press (New York). [2] J.E. Dennis Jr. and J.J. Mor´e (1974), A characterization of superlinear convergence and its application to quasi-Newton methods; Math. of Comp., Vol, 28, pp. 549-560. [3] P.E. Gill and W. Murray (1974), Newton type methods for unconstrained and linearly constrained optimization, Math. Programming, Vol. 7, pp. 311-350. [4] K. Levenberg (1944), A method for the solution of certain nonlinear problems in least squares, Quart. Appl. Math., Vol. 5, pp. 105-136. [5] D.W. Marquardt (1963), An algorithm for least squares estimation of nonlinear parameters, SIAM.J., Vol. 11, pp. 431-441. [6] S.M. Goldfeld, R.E. Quandt and H.F. Trotter (1966), Maximization by quadratic hillclimbing, Econometrica, Vol. 34, pp. 541-551. [7] M.J.D. Powell (1970), A new algorithm for unconstrained optimization, in Nonlinear Programming, eds. J.E. Rosen, O.L. Mangasarian and K. Ritter, Academic Press (New York). 31
[8] D. Goldfarb (1980), Curvilinear path steplength algorithms for minimization which use directions of negative curvature, Math. Programming, Vol. 18, pp. 31-40. [9] J.E. Dennis and H.H.W. Mei (1979), Two new unconstrained optimization algorithms which use function and gradient values, J. Optim. Theor. Appl., Vol. 28, pp. 453-462. [10] D.C. Sorensen (1982), Trust region methods for unconstrained optimization, in Nonlinear Optimization, ed. M.J.D. Powell, Academic Press (New York). [11] R. Fletcher (1989), Practical Methods of Optimization, Vol. 1: Unconstrained Optimization; John Wiley & Sons (Chichester). [12] N.Z. Shor (1970), Utilization of the operation of space dilation in the minimization of convex functions, Kibernetika (Kiev), Vol. 6, pp. 6-12 (translated in Cybernetics, Vol. 6, pp. 7-15.) [13] A. Griewank and M.R. Osborne, Analysis of Newton’s method at irregular singularities, SIAM J. Num. Analysis, vol. 20(4), pp. 747-773. [14] M.D. Hebden (1973), An algorithm for minimization using exact second derivatives, Report TP 515, A.E.R.E., Harwell. [15] D.M. Gay (1981), Computing optimal locally constrained steps, SIAM J. Sci. Statist. Comput., Vol. 2, pp. 186-197. [16] J.J. Mor´e and D.C. Sorensen (1981), Computing a trust region step, in preparation. [17] R.S. Dembo, S.C. Eisenstat and T. Steihaug (1980), Inexact Newton methods, Report 47, School of Organization and Management, Yale University (to be published in SIAM J. Numer. Anal.) [18] T. Steihaug (1980), Quasi-Newton methods for large scale nonlinear problems, Report 49, School of Organization and Management, Yale University (to be published in SIAM J. Numer. Anal.) [19] A. Griewank, B.R. Markey and D.J. Evans (1979), Singularity-free static lattice energy minimization, J. Chem. Phys., Vol. 71, pp. 3449-3454. [20] P.E. Gill and W. Murray (1978), Conjugate gradient methods for large scale nonlinear optimization, Technical Report SOL 79-15, Department of Operations Research, Stanford University.
32