Convergence conditions, line search algorithms and trust region ...

Optimization Methods and Software Vol. 20, No. 1, February 2005, 71–98

Convergence conditions, line search algorithms and trust region implementations for the Polak–Ribière conjugate gradient method L. GRIPPO* and S. LUCIDI Dipartimento di Informatica e Sistemistica, Università di Roma ‘La Sapienza’, Via Buonarroti 12, 00185 Roma, Italy (Received 8 August 2003; in final form 30 January 2004) This paper is dedicated to Professor Yury Evtushenko in honor of his 65th birthday We study globally convergent implementations of the Polak–Ribière (PR) conjugate gradient method for the unconstrained minimization of continuously differentiable functions. More specifically, first we state sufficient convergence conditions, which imply that limit points produced by the PR iteration are stationary points of the objective function and we prove that these conditions are satisfied, in particular, when the objective function has some generalized convexity property and exact line searches are performed. In the general case, we show that the convergence conditions can be enforced by means of various inexact line search schemes where, in addition to the usual acceptance criteria, further conditions are imposed on the stepsize. Then we define a new trust region implementation, which is compatible with the behavior of the PR method in the quadratic case, and may perform different linesearches in dependence of the norm of the search direction. In this framework, we show also that it is possible to define globally convergent modified PR iterations that permit exact linesearches at every iteration. Finally, we report the results of a numerical experimentation on a set of large problems. Keywords: Unconstrained optimization; Conjugate gradient method; Polak–Ribière method

1.

Introduction

We consider the problem minimize n x∈R

f (x),

(1)

where f : R n → R is a continuously differentiable function with gradient g: R n → R n . For the solution of problem (1), we consider conjugate gradient algorithms of the form xk+1 = xk + αk dk , with

dk =

−gk −gk + βk dk−1

for k = 0 for k ≥ 1,

*Corresponding author. Tel.: +39 06 48299233; Fax: +39 06 4782516; Email: [email protected]

Optimization Methods and Software ISSN 1055-6788 print/ISSN 1029-4937 online © 2005 Taylor & Francis Group Ltd http://www.tandf.co.uk/journals DOI: 10.1080/1055678042000208570

(2)

(3)

72

L. Grippo and S. Lucidi

where x0 is a given initial point, αk is a steplength along dk , gk := g(xk ), and βk is a suitable scalar. When f is a quadratic function with a definite positive Hessian matrix Q, if the stepsize αk is chosen as the exact one-dimensional minimizer along dk and the scalar βk is defined by βk =

gkT Qdk−1 T dk−1 Qdk−1

k ≥ 1,

(4)

the algorithms (2) and (3) is the well-known (linear) conjugate gradient method of Hestenes and Stiefel [1], which determines the minimizer of f in n iterations at most. The extensions to the general case [see, e.g., refs. 2–12] are based on the adoption of some (possibly inexact) line search technique for the computation of αk and make use of formulae for the evaluation of βk that do not contain explicitly the Hessian matrix of f . The best-known formulae for βk are the Fletcher and Reeves (FR) [4] formula gk 2 gk−1 2

(5)

gkT (gk − gk−1 ) , gk−1 2

(6)

βkFR = and the Polak–Ribière (PR) [9] formula βkPR =

where · denotes the Euclidean norm on R n . Numerical experience indicates that the PR formula is, in general, the most convenient choice. The convergence properties of the various conjugate gradient methods in the nonquadratic case have been the subject of many investigations. In particular, global convergence results for the FR method have been obtained by Zoutendijk [13] in case of exact line searches and by AlBaali [14] in connection with inexact line searches based on the strong Wolfe conditions. The global convergence of the PR method with exact line searches has been proved in ref. [9] under strong convexity assumptions on f . However, it was shown by Powell [15] that in the general (nonconvex) case the PR method, employing a line search technique that accepts the first local minimizer along dk , can cycle infinitely without converging towards a stationary point. An additional (but related) difficulty arises in the implementation of the PR method; in fact, as remarked in ref. [5], there is no known linesearch algorithm that can guarantee, in the general T case, both satisfaction of the Wolfe conditions at xk and the descent condition gk+1 dk+1 < 0 at the next step. These difficulties can be overcome by restricting βk to nonnegative values, that is, by letting βk = max{βkPR , 0}, as suggested by Powell [16]. In connection with this choice, in ref. [5], it has been proved that the property limk→∞ inf gk = 0 can be enforced either through exact linesearches or also using an implementable inexact algorithm that guarantees satisfaction of the strong Wolfe conditions and a strong descent condition of the form: gkT dk ≤ −δgk 2 ,

δ ∈ (0, 1).

In ref. [17], it has also been shown that this last requirement can be weakened and that only the descent condition gkT dk < 0 must be imposed. An alternative approach, which does not require restarting along the steepest descent direction and leaves unmodified the PR direction has been proposed in ref. [18], where line search rules have been defined that guarantee the property limk→∞ gk = 0. In essence, the technique proposed there is based on an Armijo-type linesearch method such that: (i) The initial tentative stepsize is suitably restricted.

Polak–Ribière conjugate gradient method

73

(ii) The acceptability condition on the stepsize imposes both a sufficient reduction of f satisfying a ‘parabolic bound’, by requiring that f (xk + αk dk ) ≤ f (xk ) − γ αk2 dk 2 , and a descent condition on the next search direction of the form T −δ2 gk+1 2 ≤ gk+1 dk+1 ≤ −δ1 gk+1 2 .

It has also been proved that the parameters appearing in the acceptance rules can be adaptively updated, so that, asymptotically, the first one-dimensional minimizer along the search direction can be eventually accepted when the algorithm is converging to a point where the Hessian matrix is positive definite. Globally convergent linesearch methods for nonlinear conjugate gradient methods have been also proposed in ref. [19], where it is shown that the condition limk→∞ gk = 0 can be enforced in the PR method, through an Armijo-type linesearch that also ensures satisfaction T of the condition gk+1 dk+1 ≤ −σ dk+1 2 for some σ ∈ (0, 1). In this article, we reconsider the convergence properties of the (unmodified) PR method and extend and improve the results of ref. [18] in several directions. In particular, first we identify sufficient convergence conditions that imply lim inf gk = 0 k→∞

and we show that this property depends, essentially, on the fact that in addition to some standard acceptance condition on the stepsize we can establish the limit lim xk+1 − xk = 0,

k→∞

which can be enforced through a linesearch technique. A theoretical consequence of this is that the PR method with exact linesearches has stronger convergence properties than it is usually believed. More specifically, if the first stationary point is chosen at each step, it can be shown that the PR method converges if f is hemivariate [20], that is, if it is not constant on any line segment. This in turn implies that the PR method with exact linesearches converges under generalized convexity assumptions on f , so that the strong convexity assumption of ref. [9] can be weakened. We also consider sufficient conditions that imply the stronger result gk → 0 and we prove that this property depends on the existence of an asymptotic bound on αk dk 2 , which can also be enforced through a linesearch. Starting from these results, we can define various inexact linesearch algorithms that are simpler and less demanding of those defined in ref. [18]. However, these techniques do not guarantee, in principle, that in the quadratic case the resulting algorithm can be identified right at the start with the linear conjugate gradient method, unless the parameters are appropriately chosen in relation to the optimal solution. In order to overcome this difficulty, we also introduce a different model, based on a ‘trust region’ approach, in which a linesearch algorithm (compatible with Wolfe conditions and even with exact linesearches) is employed whenever the norm of dk does not exceed an adaptive bound, defined on the basis of the behavior of the method in the quadratic case. When this bound is violated, we can adopt any of the convergent Armijo-type linesearch techniques defined in this article. To the authors’ knowledge, this trust region version of the PR method (with the additional requirement that the initial tentative stepsize is chosen using quadratic interpolation) is the only globally convergent algorithm proposed so far that employs the unmodified PR iteration and reduces to the linear conjugate

74


gradient method in the quadratic case. On the basis of these results, we can also define a modified globally convergent PR iteration that consists of distinguishing (when required) the stepsize used for defining βk+1 from that used for computing xk+1 , and of rescaling βk+1 , each time that an adaptive bound on dk is violated. This strategy is again compatible with the behavior of the linear conjugate gradient method in the quadratic case and admits the possibility of performing exact linesearches at each step. The article is organized as follows. In section 2, we describe our notation and we state the basic assumptions employed in the sequel. In section 3, we establish sufficient convergence conditions and we derive convergence results for the PR method under generalized convexity assumptions. In section 4, we define various basic inexact linesearch schemes that ensure global convergence and include the algorithms proposed in ref. [18] as special cases. In section 5, we define our trust region implementation of the PR method and in section 6, we define a globally convergent algorithm based on a modified PR iteration. In section 7, we report the numerical results obtained for a set of large problems. Some concluding remarks are given in section 8.

2.

Notation and assumptions

Given a point xk produced by an iteration of the form (2) we set fk := f (xk ) and gk := g(xk ). We indicate by {xk } the sequence of points generated by an algorithm. A subsequence of {xk } will be denoted by {xk }K , where K is an infinite index set. We call forcing function [20] a function σ : R + → R + such that for every sequence {tk } such that σ (tk ) → 0 we have tk → 0. Therefore, a constant function σ (t) ≡ const for all t ∈ R + will be a (special) forcing function that satisfies vacuously the preceding condition. We also note that if σa and σb are two given forcing functions, then the function σ (t) = min{σa (t), σb (t)} is another forcing function. Finally, we denote by · the Euclidean norm on R n . We suppose that the following assumption is satisfied. ASSUMPTION 1 (i) The level set L = {x ∈ R n : f (x) ≤ f (x0 )} is compact. (ii) For every given r > 0, there exists L > 0 such that g(x) − g(y) ≤ Lx − y,

(7)

for all x, y ∈ Br , where Br := {x ∈ R n : x < r}. In the sequel, we will (tacitly) assume that the radius r is sufficiently large to have that all the points of interest remain in Br , so that we can suppose that equation (7) is valid for any given pair (x, y). By Assumption 1, there exists a number ≥ 1 such that g(x) ≤ ,

for all x ∈ L.

(8)

We consider conjugate gradient methods defined by the iterations (2) and (3), where βk is a scalar that satisfies |βk | ≤ C|βkPR |, with C > 0.

(9)


3.

75

Convergence conditions for the PR method

In this section, we consider sufficient conditions for the global convergence of the PR method. In particular, in the next theorem we state conditions that ensure the existence of, at least, one limit point of the sequence {xk }, which is a stationary point of f . In essence, we require that some usual condition on the linesearch is satisfied and moreover that the distance xk+1 − xk goes to zero if the sequence {gk } is bounded away from zero. THEOREM 1 Let {xk } be the sequence generated by Algorithms (2) and (3) and assume that equation (9) holds. Let σi : R + → R + , for i = 0, 1, 2, be forcing functions and suppose that the stepsize αk is computed in a way that the following conditions are satisfied: (c1 ) xk ∈ L for all k; (c2 ) limk→∞ [σ0 (gk ) σ1 (|gkT dk |)]/dk = 0. (c3 ) limk→∞ σ0 (gk ) σ2 (αk dk ) = 0. Then we have: lim inf gk = 0 k→∞

and, hence, there exists a limit point of {xk }, which is a stationary point of f . Proof By (c1 ) we have that xk ∈ L for all k; as L is compact, the sequence {xk } is bounded and admits limit points in L. Reasoning by contradiction, suppose there exists a number ε > 0 such that gk ≥ ε, for all k. (10) By definition of forcing function, this implies that there exists a number δ > 0 such that σ0 (gk ) ≥ δ,

for all k.

(11)

Then condition (c3 ) implies that lim αk dk = lim xk+1 − xk = 0.

k→∞

k→∞

(12)

Now, by (c1 ), equation (10), and the assumptions made, we can write for all k: dk ≤ gk + C|βkPR |dk−1 ≤ gk + C

(13)

gk gk − gk−1 Lαk−1 dk−1 dk−1 ≤ + C dk−1 . gk−1 2 ε2

Recalling equation (12) and letting q ∈ (0, 1), we have that there exists a sufficiently large index k1 such that: CL αk−1 dk−1 ≤ q < 1, ε2

for all k ≥ k1 ,

(14)

so that dk ≤ + qdk−1 ,

for all k ≥ k1 .

(15)

From the preceding inequality [see, e.g., ref. 21, Lemma 1, p. 44], we obtain immediately: dk ≤ + dk1 − q k−k1 , for all k ≥ k1 , (16) 1−q 1−q

76


and this proves that dk is bounded for all k. Therefore, from equation (12) we get: lim αk dk 2 = 0;

k→∞

(17)

moreover, as dk is bounded, by condition (c2 ) and equation (11) we obtain: lim |gkT dk | = 0.

k→∞

(18)

Recalling equation (9) and the PR formula, we can write: gk 2 ≤ |gkT dk | +

gk 2 CLαk−1 dk−1 2 , gk−1 2

(19)

so that by equations (10), (17), and (18) and the compactness of L, taking limits we have: lim gk = 0,

k→∞

which contradicts equation (10). This proves our assertion.

We note that conditions (c1 ) and (c2 ) of the preceding theorem can be satisfied through any standard line search technique that can guarantee a ‘sufficient decrease’ of f and a ‘sufficient displacement’from the current point along dk . Additional restrictions on αk must be introduced for ensuring that the search directions are descent directions and for imposing satisfaction of (c3 ). The acceptability conditions on the stepsize that can be used in the general case for enforcing all these requirements will be discussed later. Here, we show that an immediate consequence of Theorem 1 is that the convergence of the PR method with exact linesearches can be established under weaker convexity conditions than those used in ref. [9], where strong convexity of f was assumed. First, we recall from refs. [20,22] the following definitions. DEFINITION 1 A function f is hemivariate on a set D ⊆ R n if it is not a constant on any line segment on D, that is, if there exist no distinct points x, y ∈ D such that (1 − t)x + ty ∈ D and f ((1 − t)x + ty) = f (x) for all t ∈ [0, 1]. DEFINITION 2 [22] A function f is strongly quasiconvex on a convex set C ⊆ R n if, for all x, y ∈ C, with x = y we have: max{f (x), f (y)} > f ((1 − t)x + ty), for all t ∈ (0, 1). DEFINITION 3 A function f is strictly pseudoconvex on a convex set C ⊆ R n if, for all x, y ∈ C, with x = y we have that ∇f (x)T (y − x) ≥ 0 implies f (y) > f (x). Now we prove that convergence is achieved if the objective function is hemivariate on L and we choose αk as the first stationary point along dk . THEOREM 2 Suppose that f is hemivariate on the level set L. Let {xk } be the sequence generated by Algorithms (2) and (3), where βk satisfies equation (9) and αk is the first stationary point of f (xk + αdk ) along dk , that is, αk the smallest nonnegative number such that g(xk + αk dk )T dk = 0. Then we have: lim inf gk = 0 k→∞

and, hence, there exists a limit point of {xk }, which is a stationary point of f .

(20)


77

Proof We prove that the conditions of Theorem 1 are satisfied, provided that we choose σ0 (t) ≡ 1, σ1 (t) ≡ t, and σ2 (t) ≡ t. Recalling equation (3) and using equation (20), it is easily seen, by induction, that dk is a descent direction so that αk > 0 and f (xk ) > f (xk + αdk ) ≥ f (xk+1 ),

for all α ∈ (0, αk ],

(21)

whence it follows that xk ∈ L for all k so that condition (c1 ) holds. By equation (21), recalling the compactness assumption on L and using result 14.1.3 of ref. [20], we have that: lim xk+1 − xk = 0,

k→∞

(22)

which implies that condition (c3 ) of Theorem 1 is satisfied. Finally, by the orthogonality condition (20) and the Lipschitz continuity assumption we can write: T dk |gkT dk | gk+1 (gk+1 − gk )T dk = − ≤ Lxk+1 − xk , dk dk dk and, hence, (c2 ) follows from equation (22); this completes the proof.

We note that if we choose αk as the global minimizer of f along dk , condition (21) may not be satisfied for a general hemivariate function; however, if we assume that f is also quasiconvex we have that equation (21) is satisfied even when αk is the global minimizer, but not necessarily the first stationary point. As quasiconvex functions are hemivariate if and only if they are strongly quasiconvex [20], we get the following result. THEOREM 3 Suppose that f is strongly quasiconvex on a convex set C ⊇ L. Let {xk } be the sequence generated by Algorithms (2) and (3), where βk satisfies equation (9) and αk is the global minimizer of f (xk + αdk ) along dk . Then we have: lim inf gk = 0 k→∞

and, hence, there exists a limit point of {xk }, which is a stationary point of f . Proof Noting that equation (21) is satisfied and that strongly quasiconvex functions are hemivariate, we can repeat the proof of Theorem 2 above. If we introduce the stronger assumption that f is strictly pseudoconvex, we obtain a convergence result for the PR method with exact linesearches, which can be viewed as an extension of the result given in ref. [9] for strongly convex functions. THEOREM 4 Suppose that f is strictly pseudoconvex on a convex set C ⊇ L. Let {xk } be the sequence generated by Algorithms (2) and (3), where βk satisfies equation (9) and αk is the global minimizer of f (xk + αdk ) along dk . Then the sequence {xk } converges to the global minimizer of f . Proof As f is strictly pseudoconvex on C ⊇ L, we have that f is also strongly quasiconvex on C. Therefore, by Theorem 3, there exists a subsequence converging to a stationary point x ∗ , which is the only global minimum point of f on R n , because of the strict pseudoconvexity assumption. On the other hand, as f (xk+1 ) ≤ f (xk ), the sequence {f (xk )} converges to the minimum value f (x ∗ ) and, hence, there cannot exist limit points that are distinct from x ∗ ; this implies that {xk } converges to x ∗ .

78


In the general (nonconvex) case, in order to establish the result that every limit point of the sequence {xk } is a stationary point of f, in addition to the requirements considered in Theorem 1, we must impose a condition that bounds the growth of dk . Before stating this result we recall from ref. [18] the following lemma. LEMMA 1 Let {xk } be a sequence of points in L, let ε > 0, and let {xk }K be a subsequence such that gk−1 ≥ ε,

for all k ∈ K,

(23)

and lim

k∈K, k→∞

xk − xk−1 = 0.

(24)

Then, there exist a number ε∗ > 0 and a subsequence {xk }K ∗ , with K ∗ ⊆ K, such that: gk ≥ ε∗ ,

for all k ∈ K ∗ .

(25)

We can state the following theorem. THEOREM 5 Let {xk } be the sequence generated by Algorithms (2) and (3) and assume that equation (9) holds. Let σi : R + → R + , for i = 0, 1, 2, be forcing functions and suppose that the stepsize αk is computed in a way that the following conditions are satisfied: (c1 ) xk ∈ L for all k; (c2 ) limk→∞ σ0 (gk ) [σ1 (|gkT dk |)]/dk = 0. (c3 ) limk→∞ σ0 (gk ) σ2 (αk dk ) = 0. (c4 ) lim supk→∞ σ0 (gk ) αk dk 2 < ∞. Then, we have limk→∞ gk = 0 and, hence, every limit point of {xk } is a stationary point of f . Proof As xk ∈ L for all k, the sequence {xk } is bounded and, hence, by the continuity assumption on g, we must only prove that lim g(xk ) = 0.

k→∞

Reasoning by contradiction, suppose there exist a subsequence {xk }K1 and a number ε1 > 0 such that gk−2 ≥ ε1 ,

for all k ∈ K1 .

By definition of forcing function, this implies that there exists a number δ1 > 0 such that σ0 (gk−2 ) ≥ δ1 ,

for all k ∈ K1 .

(26)

Then, by (c3 ) we have: lim

k∈K1 , k→∞

xk−1 − xk−2 = 0,

and, hence, by Lemma 1 we can assert that there exist a subsequence {xk }K2 , with K2 ⊆ K1 and a number ε2 > 0 such that gk−1 ≥ ε2 ,

for all k ∈ K2 .

(27)


79

Thus we can write, for some δ2 > 0: σ0 (gk−1 ) ≥ δ2 ,

for all k ∈ K2 ,

(28)

and, therefore, using again condition (c3 ), we get: lim

k∈K2 , k→∞

xk − xk−1 = 0.

(29)

Starting from equation (27) and reasoning as above, we can find a further subsequence {xk }K3 , with K3 ⊆ K2 and numbers ε3 > 0 and δ3 > 0 such that gk ≥ ε3 ,

for all k ∈ K3 ,

(30)

σ0 (gk ) ≥ δ3 ,

for all k ∈ K3 .

(31)

Now, by condition (c4 ) and equations (26) and (28) we can find a number M > 0 such that M , δ2

αk−1 dk−1 2 ≤

αk−2 dk−2 2 ≤

M δ1

for all k ∈ K3 ,

(32)

and, hence, recalling our assumptions and the inequalities established above, we obtain: gk−1 gk−1 − gk−2 LCM dk−2 ≤ 1 + 2 dk−1 ≤ gk−1 + C gk−2 2 ε1 δ1 and dk ≤ gk + C

gk gk − gk−1 LCM , d ≤ 1 + k−1 gk−1 2 ε22 δ2

(33)

(34)

so that both dk−1 and dk are bounded for all k ∈ K3 . Therefore, from equation (29) we get: lim

k∈K3 , k→∞

αk−1 dk−1 2 = 0;

(35)

moreover, by condition (c2 ), equation (31), and the boundedness of {dk }, we obtain: lim

k∈K3 , k→∞

|gkT dk | = 0,

(36)

and, hence, we can write: gk 2 ≤ |gkT dk | +

gk 2 LCxk − xk−1 dk−1 , gk−1 2

(37)

so that by equations (27), (35), and (36), taking limits for k ∈ K3 , we have: lim

k∈K3 , k→∞

gk = 0,

which contradicts equation (30). This proves our assertion.

80

4.


Convergent linesearch algorithms

In this section, we suppose that βk is the PR parameter and we refer to the following scheme. ALGORITHM POLAK–RIBIÈRE (PR) Data. x0 ∈ R n . Initialization. Set d0 = −g0 and k = 0. While gk = 0 do Compute αk using a linesearch procedure and set xk+1 = xk + αk dk ; (gk+1 − gk )T gk+1 Compute gk+1 and βk+1 = ; gk 2 Set dk+1 = −gk+1 + βk+1 dk and k = k + 1. End while The convergence conditions considered in the preceding section can be enforced by means of an appropriate definition of the linesearch algorithm for the computation of αk . We consider first conditions (c1 ) and (c2 ) in Theorems 1 and 5. As remarked earlier, these conditions can be satisfied by introducing standard acceptability criteria for the stepsize. In particular, there are several well-known rules (such as Armijo’s method, Goldstein conditions, Wolfe conditions) [23], and also derivative-free rules [24,25] that ensure ‘sufficiently large’ values of the steplength and guarantee a sufficient reduction of the objective function values. Because of the structure of the PR iteration, we also have that a sufficiently large step is obtained each time that at some tentative point z(α) = xk + αdk the number g(z(α))T (g(z(α)) − gk ) d g(z(α))T −g(z(α)) + k gk 2 is significantly different from −g(z(α))2 . Taking this into account, in the next Lemma we collect most of the known conditions that yield useful lower bounds on the stepsize. For later purposes, in order to avoid the repetition of similar arguments, we enclose also conditions that are not strictly related to the algorithms introduced in this section. We refer, in particular, to a tentative point xk + ηk dk with ηk > 0, which can be different, in general, from the accepted point xk+1 = xk + αk dk ,

(38)

and we state conditions that imply a lower bound on ηk . The relationships between αk and ηk will be made more explicit in correspondence to the specific algorithms considered in the sequel. LEMMA 2 Let {xk } be the sequence defined by equation (38) and assume that gkT dk < 0 for all k. Suppose also that f (xk + αk dk ) ≤ f (xk + ηk dk ) < f (xk ),

for all k

(39)

and assume that there exist index sets K1 and K2 ( possibly empty) such that: (1) For all k ∈ K1 , at least one of the following conditions (a1 ), (a2 ), (a3 ), (a4 ) is satisfied (a1 ) ηk ≥ ρ min{φ, |gkT dk |}/dk 2 , where ρ > 0 and ∞ ≥ φ > 0; (a2 ) f (xk + ληk dk ) ≥ f (xk + ηk dk ), where λ > 1;


81

(a3 ) f (xk + ληk dk ) ≥ f (xk ) + γ˜1 ληk gkT dk − γ˜2 (ληk )2 dk 2 , where λ ≥ 1, 1 > γ˜1 ≥ 0, γ˜2 ≥ 0, and γ˜1 + γ˜2 > 0; (a4 ) g(xk + ηk dk )T dk ≥ µgkT dk , where 1 > µ > 0. (2) For all k ∈ K2 , at least one of the following conditions (a5 ), (a6 ) is satisfied: g(yk )T (g(yk ) − gk ) ≥ −δ1 g(yk )2 , d (a5 ) g(yk )T −g(yk ) + k gk 2 g(yk )T (g(yk ) − gk ) T (a6 ) g(yk ) −g(yk ) + dk ≤ −δ2 g(yk )2 , gk 2 where yk = xk + νηk dk , g(yk ) = 0, ν > 0, and δ2 > 1 > δ1 ≥ 0. Then: (i) We have xk ∈ L for all k; (ii) There exists ρ ∗ > 0 and ∞ ≥ φ > 0 such that ηk ≥ ρ ∗ (min{φ, |gkT dk |}/dk 2 ), for all k ∈ K1 ; (iii) There exists τ ∗ > 0 such that ηk ≥ τ ∗ (gk 2 /dk 2 ), for all k ∈ K2 . Proof Assertion (i) follows from equation (39). Now suppose that k ∈ K1 ; then assertion (ii) is obviously true if (a1 ) holds; therefore, let us assume first that (a2 ) is verified, that is: f (xk + ληk dk ) ≥ f (xk + ηk dk ),

(40)

for some λ > 1. As xk and xk + ηk dk are both in L, we can assume that also xk + ληk dk belongs to the ball Br introduced in Assumption 1. Using the Theorem of the Mean, we can write f (xk + ληk dk ) = f (xk + ηk dk ) + (λ − 1)ηk g(zk )T dk − (λ − 1)ηk gkT dk + (λ − 1)ηk gkT dk , where zk := xk + ξk (λ − 1)ηk dk , for some ξk ∈ (0, 1). Then, substituting the above expression into equation (40), dividing both members by (λ − 1)ηk > 0, and rearranging, we obtain (g(zk ) − gk )T dk ≥ −gkT dk , whence it follows, using the Lipschitz continuity assumption on g: ηk ≥

|gkT dk | 1 . (λ − 1)L dk 2

(41)

Now assume that (a3 ) is satisfied, that is: f (xk + ληk dk ) ≥ f (xk ) + γ˜1 ληk gkT dk − γ˜2 (ληk )2 dk 2 ,

(42)

for some λ ≥ 1. Using again the Theorem of the Mean, we can write: f (xk + ληk dk ) = f (xk ) + ληk g(wk )T dk − ληk gkT dk + ληk gkT dk , where wk := xk + ζk ηk λdk , for some ζk ∈ (0, 1). By substituting this expression into equation (42), dividing both members by ληk > 0, and rearranging, we obtain: γ˜2 ληk dk 2 + (g(wk ) − gk )T dk ≥ (γ˜1 − 1)gkT dk , whence, we get: ηk ≥

(1 − γ˜1 ) |gkT dk | . λ(L + γ˜2 ) dk 2

(43)

82


Next assume that (a4 ) holds. In this case, we can write: g(xk + ηk dk )T dk ≥ µgkT dk + gkT dk − gkT dk , whence, it follows: (g(xk + ηk dk ) − gk )T dk ≥ (1 − µ)|gkT dk |, which implies ηk ≥

(1 − µ) |gkT dk | . L dk 2

(44)

Using (a1 ) and equations (41), (43), and (44), it can be concluded that if at least one of the conditions (a1 ), (a2 ), (a3 ), (a4 ) is satisfied there must exist a number 1 (1 − γ˜1 ) (1 − µ) , , >0 ρ ∗ = min ρ, (λ − 1)L λ(L + γ˜2 ) L such that assertion (ii) is valid. Consider now the case k ∈ K2 . Letting yk = xk + νηk , we can suppose that yk belongs to Br . Assume first that (a5 ) holds; in this case we can write: −(1 − δ1 )g(yk )2 + g(yk )2

g(yk ) − gk dk ≥ 0; gk 2

similarly, if (a6 ) is satisfied we have −(δ2 − 1)g(yk )2 + g(yk )2

g(yk ) − gk dk ≥ 0, gk 2

whence, recalling the Lipschitz continuity of g, dividing both members of each inequality by g(yk )2 and rearranging, we get (iii) with (1 − δ1 ) (δ2 − 1) ∗ , . τ = min νL νL Now, let us consider condition (c3 ) appearing in Theorem 1 and in Theorem 5. We can observe that the key point is that of ensuring that the steplength xk+1 − xk is driven to zero through the linesearch, at least when the gradient norm is bounded away from zero. This can be obtained in two different ways: (a) By imposing a suitable (adaptive) upper bound on αk . (b) By employing a ‘parabolic’ acceptance rule on the objective function values that also forces the distance xk+1 − xk to zero. T In both cases, the descent condition gk+1 dk+1 < 0 must be imposed during the linesearch and sufficiently large values for the stepsizes must be guaranteed using some of the conditions considered in the preceding lemma. T An upper bound on αk and, possibly, additional restrictions on gk+1 dk+1 are required in order to satisfy condition (c4 ) of Theorem 5. Some of the simplest possibilities for satisfying all these requirements are combined into the single procedure described below, where we admit infinite values for some parameter, with an obvious meaning. In order to simplify our analysis, we refer to a conceptual Armijo-type model; however, alternative schemes (possibly more convenient from a computational point of view) can be adopted, taking into account the results of Lemma 2.


83

ALGORITHM LSA (modified Armijo linesearch) Data. ∞ ≥ ρ2 > ρ1 > 0, ∞ ≥ φ > 0, 1 > γ1 ≥ 0, γ2 ≥ 0, γ1 + γ2 > 0, 1 > θ > 0, ∞ ≥ δ2 > 1 > δ1 ≥ 0. Step 1. τk = min{φ, | gkT dk |}/ dk 2 and choose k ∈ [ρ1 τk , ρ2 τk ]. Step 2. Compute αk = max{θ j k , j = 0, 1, . . .} such that the vectors xk+1 = xk + αk dk and dk+1 = −gk+1 + βk+1 dk satisfy the conditions: (i) fk+1 ≤ fk + γ1 αk gkT dk − γ2 αk2 dk 2 ; T (ii) −δ2 gk+1 2 ≤ gk+1 dk+1 < −δ1 gk+1 2

(if gk+1 = 0).

We prove first that the preceding algorithm is well defined. PROPOSITION 1 Suppose that gk = 0 for all k. Then for every k, there exists a finite value jk of j such that the stepsize α k = θ jk k computed by Algorithm LSA satisfies conditions (i) and (ii) at Step 2. Proof We start by proving that if gkT dk < 0, then the number α = k θ j satisfies conditions (i) and (ii) for all sufficiently large j . By contradiction, assume first that there exists an infinite set J of j -values such that condition (i) is violated, so that for every j ∈ J we have: f (y (j ) ) − fk > γ1 gkT dk − γ2 k θ j dk 2 , θ j k where y (j ) := xk + θ j k dk . Then, taking limits for j ∈ J , j → ∞, we have gkT dk ≥ γ1 gkT dk , which contradicts the assumptions gkT dk < 0 and γ1 < 1. Suppose now that there exists an infinite set, say it again J , such that for j ∈ J condition (ii) is violated and we have g(y (j ) )T (g(y (j ) ) − gk ) (j ) 2 g(y (j ) )T −g(y (j ) ) + d k ≥ −δ1 g(y ) . gk 2 Then, taking limits for j ∈ J , j → ∞, we obtain that the inequality (1 − δ1 )gk 2 ≤ 0 must be valid, but this contradicts the assumptions that gk = 0 and 1 > δ1 . Finally, suppose that δ2 < ∞ and that for all j ∈ J we have: g(y (j ) )T (g(y (j ) ) − gk ) (j ) T (j ) g(y ) −g(y ) + dk < −δ2 g(y (j ) )2 . gk 2 In this case, taking limits for j ∈ J , j → ∞, we obtain that the inequality (δ2 − 1)gk 2 ≤ 0 and this contradicts the assumptions gk = 0 and δ2 > 1. Under the assumption gkT dk < 0, we can conclude that Step 2 is well defined, by taking jk as the largest index for which both conditions (i) and (ii) are satisfied, and letting αk = θ jk k . Then the proof can be completed, by induction, noting that as d0T g0 < 0, we will have T gk dk < 0 for all k. The convergence of Algorithm PR is proved in the following theorem, in correspondence to some admissible choices of the parameters. THEOREM 6 Suppose that the stepsize is computed by means of Algorithm LSA, with the conditions stated on the parameters, and let xk , for k = 0, 1, . . . be the points generated by

84


Algorithm PR; then either there exists an index ν such that g(xν ) = 0 and the algorithm terminates, or it produces an infinite sequence with the following properties, depending on the choice of the parameters. (a) Assume that ρ2 < ∞ if γ2 = 0,

(45)

then we have lim inf k→∞ gk = 0 and hence there exists a limit point of {xk }, which is a stationary point of f . (b) Assume that: ρ2 < ∞ and either φ < ∞ or δ2 < ∞,

(46)

then we have limk→∞ gk = 0 and every limit point of {xk } is a stationary point of f . Proof If the algorithm does not terminate at a stationary point, it will produce an infinite sequence {xk } such that gkT dk < 0 for all k. We prove first the conditions of Theorem 1 are satisfied if the stepsize is computed through Algorithm LSA and equation (45) is satisfied. As gkT dk < 0, condition (i) at Step 2 implies fk − fk+1 ≥ γ1 αk |gkT dk | + γ2 αk2 dk 2 ,

(47)

so that xk ∈ L for all k and hence (c1 ) of Theorem 1 is valid. The instructions at Steps 1 and 2, and the assumption (45) imply also that: αk dk 2 ≤ k dk 2 ≤ ρ2 min{φ, |gkT dk |} ≤ ρ2 |gkT dk |,

where

ρ2 < ∞

if

γ2 = 0, (48)

so that by equations (47) and (48) we obtain: fk − fk+1 ≥

γ1 ρ2

+ γ2 αk2 dk 2

γ2 αk2 dk 2

if ρ2 < ∞, γ2 ≥ 0 if ρ2 = ∞, γ2 > 0.

(49)

As {fk } is decreasing and bounded below, it admits a limit, and hence by equation (49) we have that lim αk dk = 0.

k→∞

(50)

and thus condition (c3 ) of Theorem 1 holds with σ2 (t) ≡ t and σ0 (t) ≡ 1 for all t. In order to establish (c2 ) we shall make use of Lemma 2, where we assume ηk = αk . First, we observe that condition (i) implies that f (xk+1 ) < f (xk ). Then we can distinguish the two cases: αk = k and αk < k , where k is the number defined at Step 2. In the first case, we have obviously αk dk 2 ≥ ρ1 min{φ, |gkT dk |}

(51)

and thus condition (a1 ) of Lemma 2 holds with ρ = ρ1 . If αk < k , this implies that αk /θ violates one of the conditions of Step 2. If αk /θ violates condition (i) we have that (a3 ) of Lemma 2 holds with λ = 1/θ , γ˜1 = γ1 , and γ˜2 = γ2 ; on the other hand, if αk /θ violates (ii), then at least one of the conditions (a5 ) or (a6 ) of Lemma 2


85

holds with ν = 1/θ . Thus, by Lemma 2 we have either that αk ≥ ρ ∗

min{φ, |gkT dk |} dk 2

for some ρ ∗ > 0 and ∞ ≥ φ > 0, or that αk ≥ τ ∗

gk 2 dk 2

for some τ ∗ > 0. Using equation (50), we have that (c2 ) of Theorem 1 holds with σ1 (t) ≡ min{1, ρ ∗ φ, ρ ∗ t}

σ0 (t) ≡ min{1, τ ∗ t 2 }.

It can be concluded that the conditions of Theorem 1 are satisfied and thus, assertion (a) follows from Theorem 1. In order to complete the proof we must show that in case (b) also condition (c4 ) of Theorem 5 is valid. On the other hand, recalling equation (46) and the instructions at Steps 1 and 2, we have αk dk 2 ≤ k dk 2 ≤ ρ2 min{φ, |gkT dk |} ≤ ρ2 min{φ, δ2 gk 2 } ≤ ρ2 min{φ, δ2 2 }, where is the bound for gk on L. By equation (46) we have that either φ < ∞ or δ2 < ∞, so that condition (c4 ) is satisfied for σ0 (t) ≡ 1. Then, recalling the proof of assertion (a), we can conclude that there exist forcing functions such that all conditions of Theorem 5 are satisfied and this establishes (b). The preceeding results show that there exist, in principle, various globally convergent implementations of the unmodified PR method, which are simpler and less demanding of those defined in ref. [18]. A first simple model can be obtained by specializing Algorithm LSA into the following restricted Armijo-type linesearch algorithm. ALGORITHM LSA1 (restricted Armijo) Data. ∞ > ρ2 > ρ1 > 0, 1 > γ > 0, 1 > θ > 0. Step 1. Set τk = | gkT dk |/dk 2 and choose k ∈ [ρ1 τk , ρ2 τk ]. Step 2. Compute αk = max{θ j k , j = 0, 1, . . .} such that the vectors xk+1 = xk + αk dk and dk+1 = −gk+1 + βk+1 dk satisfy: (i) fk+1 ≤ fk + γ αk gkT dk ; T (ii) gk+1 dk+1 < 0 (if gk+1 = 0). It is easily seen that Algorithm LSA1 can be viewed as a special case of Algorithm LSA, where we choose γ1 = γ ,

γ2 = 0,

δ1 = 0,

δ2 = ∞.

Then it follows from Theorem 6 that if Algorithm LSA1 is employed and the PR algorithm does not terminate, we have lim inf k→∞ gk = 0. If we want to impose the stronger property

86


gk → 0, we can replace condition (ii) at Step 2 with the stronger condition T −δgk+1 2 ≤ gk+1 dk+1 < 0,

with δ < ∞,

(52)

which essentially imposes a bound on the component of dk+1 parallel to gk+1 . Alternatively, we can keep unchanged condition (ii) at Step 2 and then set at Step 1: τk =

min{φ, |gkT dk |} , dk 2

with ∞ > φ > 0.

A different model can be defined by replacing condition (i) of Step 2 with a ‘parabolic’ acceptance rule on the objective function values. In this case, we get the following algorithm that can be viewed as a simplified version of that introduced in ref. [18]. ALGORITHM LSA2 (parabolic search) Data. ρ > 0, γ > 0, 1 > θ > 0. Step 1. Set τk = | gkT dk |/dk 2 and choose k ≥ ρτk . Step 2. Compute αk = max{θ j k , j = 0, 1, . . .} such that the vectors xk+1 = xk + αk dk and dk+1 = −gk+1 + βk+1 dk satisfy the conditions: (i) fk+1 ≤ fk − γ αk2 dk 2 ; T (ii) gk+1 dk+1 < 0

(if gk+1 = 0).

Algorithm LSA2 can be obtained from Algorithm LSA by choosing ρ2 = ∞,

ρ1 = ρ,

γ2 = γ ,

γ1 = 0,

δ1 = 0,

δ2 = ∞.

By Theorem 6, we have that if an infinite sequence is generated using Algorithm LSA2 within the PR algorithm, there holds the limit lim inf gk = 0. k→∞

In comparison with Algorithm LSA1, we can note that an upper bound on the initial stepsize is no more needed. However, if the stronger property lim gk = 0

k→∞

has to be enforced, we must introduce the same modifications described above in connection with Algorithm LSA1. We note that the interpolation phase in the two preceding algorithms can be conveniently performed using a safeguarded cubic interpolation, as both the function and the gradient must be evaluated at each tentative step. This would be compatible with the conditions of Lemma 2, provided that the reduction factor θk for the stepsize is uniformly bounded, that is if we impose the safeguards 0 < θl ≤ θk ≤ θu < 1, where θl and θu are given constant bounds. Another simple modification could be that of replacing the Armijo-type acceptability criterion with Goldstein-type conditions; in this case the lower bound on the initial stepsize k can be eliminated. The specific implementation used in computations will be illustrated in the sequel in more detail. Alternative globally convergent line search techniques that ensure the property limk→∞ gk = 0 can also be derived from the results given in ref. [19]. In fact, it is shown


87

there that an Armijo-type algorithm, starting from a unit stepsize, can produce a sufficiently small value of αk that satisfies the conditions: fk+1 ≤ fk + γ αk gkT dk , T dk+1 ≤ −σ dk+1 2 , gk+1

and it is proved that these conditions imply that lim gk = 0.

k→∞

An inherent limitation of the algorithms considered in this section (at least from a theoretical point of view) is that we cannot guarantee that accurate line searches can be performed, since the acceptance rules used in these algorithms may have the effect of rejecting a ‘good’ stepsize (satisfying, for instance, Wolfe conditions), even when this is not strictly required for enforcing global convergence. In order to overcome this difficulty, in ref. [18] an adaptive choice of the parameters was introduced and it was shown that the algorithms defined there eventually accept the optimal stepsize in a neighborhood of a stationary point, where the Hessian is positive definite. This technique can easily be extended to the algorithms considered here. As an example, letting ψk = min{1, gk }τ , for some τ > 0, we can replace the parameters ρ1 , ρ2 , ρ, and γ in Algorithm LSA1 and Algorithm LSA2 with the numbers ρ1 ψk , ρ2 /ψk , ρψk , and γ ψk , respectively, in a way that the acceptance conditions become less demanding as the gradient is converging to zero. However, this is still not entirely satisfactory, since in the quadratic case it does not ensure, in principle, that the algorithm can be identified right at the start with the PR algorithm, unless the parameters are appropriately chosen in relation to the optimal solution.

5. A trust region implementation of the PR method We introduce here a different model, based on a ‘trust region’approach, in which the linesearch algorithm is compatible with Wolfe conditions and even with exact linesearches, whenever the norm of dk does not exceed a suitable adaptive bound bk . The theoretical motivation is that of defining an algorithm model with the following properties: – Global convergence in the general nonquadratic case is guaranteed. – The PR formula for the computation of the search directions is unmodified. – The conjugate gradient method of Hestenes and Stiefel is reobtained in the quadratic case. From a computational point of view, the objective is that of defining a computational scheme, in which the acceptance rules defined in the preceding sections can be relaxed, so that line searches of any desired accuracy can be performed, at least when dk is not too large. Before describing this new algorithm, we define formally a procedure based on Wolfe conditions, which also ensures satisfaction of a descent condition on dk+1 or terminates with an arbitrarily small value of gk+1 .

88


ALGORITHM LSW (modified Wolfe conditions) Data. 21 > γ > 0, µ > γ , εk > 0, µ∗ > 0, δ1 > 0. Step 1. Compute ηk such that (i) f (xk + ηk dk ) ≤ fk + γ ηk gkT dk , (ii) g(xk + ηk dk )T dk ≥ µ gkT dk (or: |g(xk + ηk dk )T dk | ≤ µ |gkT dk |) Step 2. Compute αk such that either (case 1) the vectors xk+1 = xk + αk dk and dk+1 = −gk+1 + βk+1 dk satisfy: (a1 ) fk+1 ≤ f (xk + ηk dk ), T (a2 ) gk+1 dk+1 < −δ1 gk+1 2 T T (a3 ) gk+1 dk ≥ µ∗ gkT dk (or: | gk+1 dk | ≤ µ∗ |gkT dk |), or (case 2) the vector xk+1 = xk + αk dk satisfies gk+1 ≤ εk .

It can be easily shown that the algorithm is well defined, under the assumption that the level set L is compact and xk ∈ L. In fact, under this assumption, it is well known that there exist a finite procedure for computing a point ηk where the Wolfe conditions are satisfied. Starting from this point we can define a convergent minimization process that generates a sequence of stepsizes α(j ), for j = 0, 1, . . . with α(0) = ηk and such that for j → ∞ we have f (xk + α(j )dk ) < f (xk + ηk dk )

and

g(xk + α(j )dk )T dk → 0.

Recalling the PR formulas, we have that conditions (a1 )–(a3 ) will be satisfied in a finite number of steps, unless g(xk + α(j )dk ) converges to zero. On the other hand, in the latter case, the algorithm will terminate because of the test g(xk + α(j )dk ) ≤ εk . Then we can define the following scheme, in which we admit the possibility of using either Algorithm LSW defined above or the Armijo-type Algorithm LSA described in the preceding section. We will refer to Algorithm LSW by using the notation LSW(x, d, ε) for indicating that the algorithm computes a stepsize along d starting from the point x, with termination criterion defined by ε in Case 2. ALGORITHM PRTR (trust-region PR method) Data. δ ∈ (0, 1), x0 ∈ R n and a sequence {εk } such that εk → 0. Initialization. Set x˜0 = x0 , d0 = −g0 and k = 0. While gk = 0 do Step 1. Define a bound bk on the search direction Step 2. If dk ≤ bk then Compute αk and βk+1 using Algorithm LSW(x˜k , dk , εk ). If termination occurs in Case 1 then set xk+1 = x˜k + αk dk , x˜k+1 = xk+1 and dk+1 = −gk+1 + βk+1 dk Else (termination occurs in Case 2) set xk+1 = x˜k + αk dk , x˜k+1 = x˜k and dk+1 = dk End if


89

Else Compute αk using Algorithm LSA and set xk+1 = x˜k + αk dk , x˜k+1 = xk+1 and dk+1 = −gk+1 + βk+1 dk End if Step 3. Set k = k + 1 End While We note that some of the technical complications in the preceding scheme and, in particular, the introduction of the variables x˜k are only motivated by the objective of stating a convergence results for an infinite sequence. In practice, we can replace the test on the gradient norm at Step 2 of Algorithm LSW by the termination test used in the code. The convergence of Algorithm PRTR is proved in the following theorem under suitable assumptions on the bound bk specified at each step on the norm of dk . THEOREM 7 Let xk be the points generated by Algorithm PRTR and suppose that bk is defined in a way that the following condition holds: (H) if lim inf k→∞ gk > 0 then there exists B > 0 such that bk ≤ B for all k. Suppose that we choose in Algorithm LSA δ1 > 0 and (ρ2 < ∞ if

γ2 = 0).

(53)

Then either there exists an index ν such that g(xν ) = 0 and the algorithm terminates, or it produces an infinite sequence such that lim inf gk = 0, k→∞

and hence there exists a limit point of {xk } that is a stationary point of f . Proof If the algorithm does not terminate at a stationary point, it will produce an infinite sequence {xk } such that gk = 0 for all k. Reasoning by contradiction, we can assume that gk ≥ ε for some ε > 0. Because of Assumption (H) we have that bk ≤ B,

for all

k.

As the acceptance rules imply that xk ∈ L for all k, every subsequence will have limit points in L. Suppose that there exists an infinite subsequence such that Algorithm LSW is used at xk and termination occurs in Case 2; in this case, as εk converges to zero, the test at Step 2 of Algorithm LSW implies that the corresponding subsequence of points xk+1 will converge towards a stationary point and the assertion is proved. Therefore, we can assume that for sufficiently large values of k, when Algorithm LSW is used, termination occurs in Case 1. Under this assumption, because of the instructions of Algorithm LSW and Algorithm LSA, we have that gkT dk < −δ1 gk 2 ,

for all k.

(54)

90


Suppose first that there exists an infinite subsequence {xk }K such that dk ≤ bk ≤ B and Algorithm LSW is employed. By the instructions of Algorithm LSW, we have that there exists ηk > 0 such that fk+1 = f (xk + αk dk ) ≤ f (xk + ηk dk ) ≤ fk + γ ηk gkT dk ,

k∈K

(55)

and g(xk + ηk dk )T dk ≥ µgkT dk ,

k ∈ K.

(56)

Using equations (54) and (55), we can write fk − fk+1 ≥ γ ηk |gkT dk | > γ δ1 ηk gk 2 ,

k ∈ K.

(57)

Moreover, the assumptions of Lemma 2 hold and equation (56) implies that condition (a3 ) of this Lemma is valid. This implies that for some ρ ∗ > 0, we have ηk ≥ ρ ∗

|gkT dk | , dk 2

k ∈ K.

(58)

Therefore, by equations (54), (57), and (58) and the assumption dk ≤ M we get: fk − fk+1 >

γ δ12 ρ ∗ gk 4 , B2

k ∈ K.

(59)

As {fk } is converging to a limit, we have limk→∞ (fk − fk+1 ) = 0 and hence from equation (59) we obtain lim gk = 0, k∈K, k→∞

which yields a contradiction. Assume now that K is a finite set; in this case we have that algorithm LSA will be used for all sufficiently large k and hence the assertion follows directly from Theorem 6. This completes the proof. In order to complete the description of the algorithm defined above, we must also specify some rule for defining the bound bk used at each k. In particular, we require that condition (H) of Theorem 7 is satisfied and that exact linesearches are accepted in Algorithm PRTR when f is quadratic. When f is quadratic and exact linesearches are employed, we obviously have that βk(PR) = βk(FR) and hence we can write [see, e.g., ref. 14], for each k: dk 2 = gk 4

k

gk−j −2 .

(60)

j =0

It follows that a reasonable bound for dk can be defined by assuming  1/2 min{k,n} bk = bgk 2  gk−j −2  ,

(61)

j =0

where b ≥ 1 is a given constant. It is easily seen that condition (H) is satisfied under the assumptions of section 2, so that the bound bk in Algorithm PRTR can be defined through equation (61). Finally, in order to guarantee that the linear conjugate gradient method is reobtained in the quadratic case, we must also require that the initial stepsize in the linesearch performed at Step 1 of Algorithm LSW is the global minimizer of f along dk . This can be achieved, for instance, by performing two function evaluations along dk and then using a (safeguarded) quadratic interpolation formula for computing the initial tentative stepsize.


91

6. Trust region version of a modified PR method In the general case, it follows from Powell’s counterexamples that the algorithm defined in the preceding section may not permit to perform exact linesearches when the bound on dk is violated. However, we can remove this restriction by modifying the PR method, on the basis of the trust region approach proposed in this section. This yields an alternative technique for enforcing global convergence, which does not require resetting along the steepest descent direction as in ref. [16], but still admits the possibility of performing exact linesearches. This version of the PR method can be based on the following two modifications of the basic scheme: (i) A distinction is introduced between the stepsize ηk used for determining βk+1 (which is identified with the value that satisfies Wolfe conditions) and the actual stepsize αk used for computing the new point xk+1 . (ii) A rescaling of βk+1 is performed each time that an adaptive bound on the size of the current search direction dk is violated. We introduce these new features by modifying Algorithm LSW according to the following scheme, where we assume that a suitable bound bk is given in input. ALGORITHM LSW1 Data. 21 > γ > 0, µ > γ , εk > 0, µ∗ > 0, δ1 > 0, bk > 0. Step 1. Compute ηk such that (i) f (xk + ηk dk ) ≤ fk + γ ηk gkT dk , (ii) g(xk + ηk dk )T dk ≥ µ gkT dk (or: |g(xk + ηk dk )T dk | ≤ µ|gkT dk |) Step 2. Compute g(xk + ηk dk )T (g(xk + ηk dk ) − gk ) bk ∗ βk+1 = min 1, . dk gk 2

(62)

Step 3. Compute αk such that either (case 1) ∗ the vectors xk+1 = xk + αk dk and dk+1 = −gk+1 + βk+1 dk satisfy: (a1 ) fk+1 ≤ f (xk + ηk dk ), T (a2 ) gk+1 dk+1 < −δ1 gk+1 2 T T (a3 ) gk+1 dk ≥ µ∗ gkT dk (or: |gk+1 dk | ≤ µ∗ |gkT dk |), or (case 2) the vector xk+1 = xk + αk dk satisfies gk+1 ≤ εk .

Then we can define the following algorithm, where we use again the notation LSW1(x, d, ε) for indicating that Algorithm LSW1 computes a stepsize along d starting from the point x, with termination criterion defined by ε in Case 2. ALGORITHM

MPRTR (modified trust-region PR method)

Data. δ ∈ (0, 1), x0 ∈ R n and a sequence {εk } such that εk → 0. Initialization. Set d0 = −g0 , x˜0 = x0 and k = 0. While gk = 0 do Step 1. Define a bound bk on the search direction

92


∗ Step 2. Compute βk+1 and αk through Algorithm LSW1(x˜k , dk , εk ), If termination occurs in Case 1 then set xk+1 = x˜k + αk dk , x˜k+1 = xk+1 ∗ dk+1 = −gk+1 + βk+1 dk else (termination occurs in Case 2) set xk+1 = x˜k + αk dk , x˜k+1 = x˜k , bk+1 = bk and dk+1 = dk end if Step 3. Set k = k + 1 End While

The convergence of this scheme is established in the following theorem, whose proof can be derived from the proof of the preceding theorem and the proof of Theorem 1. THEOREM 8 Let xk be the points generated by Algorithm MPRTR and suppose that bk is defined in a way that condition (H) of Theorem 7 is satisfied. Then either there exists an index ν such that g(xν ) = 0 and the algorithm terminates, or it produces an infinite sequence such that lim inf gk = 0, n→∞

and hence there exists a limit point of {xk } that is a stationary point of f . Proof Reasoning by contradiction, as in the proof of Theorem 7, we can assume that the algorithm produces an infinite sequence of points xk ∈ L such that, for all k, we have that gk ≥ ε for some ε > 0. By (a2 ) of Algorithm LSW1 we have |gkT dk | > δ1 gk 2 .

(63)

Moreover, recalling the instructions of Algorithm LSW1 and using the same arguments employed in the proof of the preceding theorem, we can establish that there exists ηk > 0 such that, for all k, we have: fk − fk+1 ≥ γ ηk |gkT dk | > γ δ1 ηk gk 2

(64)

and ηk ≥ ρ ∗

|gkT dk | , dk 2

(65)

for some ρ ∗ > 0. By equation (64) and the compactness assumptions on L we have that the sequence {f (xk )} has a limit, so that lim ηk = 0.

k→∞

(66)

Now, taking into account the expression of dk , the definition of βk∗ and the assumption on bk ,we can write for all k: ∗ |dk dk+1 ≤ gk+1 + |βk+1 bk g(xk + ηk dk )g(xk + ηk dk ) − gk ≤ gk+1 + min 1, dk dk gk 2

≤+

BLηk dk , ε2

(67)


93

where is a bound on the gradient norm. Recalling equation (66) and letting q ∈ (0, 1), we have that there exists a sufficiently large index k1 such that: BL ηk ≤ q < 1, ε2

for all k ≥ k1 ,

(68)

so that dk+1 ≤ + qdk ,

for all k ≥ k1 .

(69)

From equation (69), recalling again a known inequality (ref. 21, Lemma 1, p. 44) we obtain: dk+1 ≤ + dk1 − q k+1−k1 , for all k ≥ k1 , 1−q 1−q and this proves that dk is bounded for all k, that is dk ≤ M for some M > 0. Therefore, from this inequality, recalling equations (63)–(65) we get: fk − fk+1 >

γ δ12 ρ ∗ gk 4 . M2

As (fk − fk+1 ) → 0 from equation (70) we get a contradiction.

7.

(70)

Numerical results

Some of the algorithms introduced in this article have been tested on a set of large problems, already used in refs. [5,26], where appropriate references can be found. The main motivation of these experiments was that of verifying whether the strategies defined here for enforcing global convergence may have negative effects from a computational point of view on some of the best known problems, where the (unmodified) PR method shows a relatively good behavior. More specifically, two different codes have been tested: one (Algorithm PRTR) based on the trust region implementation introduced in section 5 and the other (Algorithm MPRTR) based on the modified PR iteration defined in section 6. In both cases, the condition (61) was employed, in correspondence to different values of b > 1, and in each linesearch the objective function value was computed at least at two different points. All the experiments were performed using double precision Compaq Fortran 90 codes on a Windows NT workstation, with the termination criterion gk ∞ ≤ 10−5 (1 + |f (xk )|). Some of the choices that have been made are illustrated below in correspondence to each algorithm. Algorithm PRTR: The algorithm has been implemented by associating the modified Wolfe linesearch (Algorithm LSW) with a parabolic linesearch based on Algorithm LSA2, using condition (61) as switching rule. In Algorithm LSW, we set γ = 10−4 , µ = µ∗ = 0.1, δ1 = 0.8, and the test at Step 2 on the gradient norm was replaced by the termination test defined above. The Wolfe linesearch was performed using a safeguarded cubic interpolation and employing a constant extrapolation factor equal to 2. Algorithm LSA2 was implemented by employing modified Goldstein-type acceptability conditions (without specifying a lower bound on the initial stepsize, but including a tentative expansion step when all acceptance conditions are satisfied); the interpolation and the extrapolation phases were carried out as in Algorithm LSW.

94

L. Grippo and S. Lucidi Table 1.

Problem Calculus of variations 2 Calculus of variations 3 Generalized Rosenbrock Calculus of variations 2 Calculus of variations 3 Variably dimensioned Linear min. surface Strictly convex 1 Strictly convex 2 Oren’s power Generalized Rosenbrock Penalty 1 Penalty 3 Ext. Powell singular Tridiagonal 1 Boundary-value prob. Broyden trid. nonlinear Ext. Freud. and Roth Wrong. extended Wood Matrix square root ns = 1 Matrix square root ns = 2 Sp. matrix square root Extended Rosenbrock Extended Powell Tridiagonal 2 Trigonometric Variably dimensioned Strictly convex 1 Strictly convex 2 Oren’s power Penalty 1 Ext. Powell singular Tridiagonal 1 Extended ENGVL1 Ext. Freud. and Roth Wrong. extended Wood Sp. matrix square root Extended Rosenbrock Extended Powell Tridiagonal 2 Trigonometric Penalty 1 2nd ver.

Results with Algorithm PRTR for b = 1.5.

n

ni

nf

f

g

nmod

200 200 500 500 500 500 961 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000

686 2523 1045 1900 6582 0 143 2 11 106 2065 3 188 47 263 39 20 7 42 1314 1314 109 23 47 273 40 0 1 4 417 1 47 868 6 6 58 172 23 47 901 4 1

1381 5053 2100 3817 13,171 19 297 16 32 337 4140 36 396 132 530 81 50 39 108 2638 2638 229 75 132 555 96 58 13 18 964 27 132 1767 35 37 142 355 75 132 1842 17 27

0.52180D + 02 −0.14720D + 00 0.10000D + 01 0.52180D + 02 −0.14720D + 00 0.19870D − 24 0.90000D + 03 0.10000D + 04 0.50052D + 05 0.12487D − 07 0.10000D + 01 0.96771D + 00 0.17690D + 02 0.12666D − 06 0.62386D + 00 0.28904D − 08 0.78130D − 11 0.12147D + 06 0.39379D + 01 0.49514D − 08 0.49514D − 08 0.88914D − 09 0.30738D − 17 0.12666D − 06 0.49252D − 12 0.22754D − 06 0.23088D − 17 0.10000D + 05 0.50041D + 07 0.37180D − 07 0.98918D + 01 0.12666D − 05 0.62386D + 00 0.11099D + 05 0.12165D + 07 0.39379D + 01 0.54795D − 08 0.30738D − 16 0.12666D − 05 0.58199D − 12 0.49172D − 06 0.98918D + 01

0.20238D − 02 0.37441D − 04 0.29356D − 04 0.20509D − 02 0.61570D − 04 0.57633D − 08 0.72369D − 01 0.87888D − 01 0.52712D + 01 0.32025D − 04 0.40163D − 04 0.15658D − 04 0.16258D − 02 0.70344D − 04 0.47621D − 04 0.35714D − 04 0.27774D − 04 0.24113D + 01 0.80084D − 04 0.89296D − 04 0.89296D − 04 0.50968D − 04 0.38721D − 08 0.70346D − 04 0.28351D − 04 0.21198D − 04 0.55525D − 04 0.97246D + 00 0.11052D + 04 0.10137D − 03 0.85829D − 03 0.22242D − 03 0.85535D − 04 0.12119D + 00 0.37368D + 01 0.38725D − 04 0.10657D − 03 0.12245D − 07 0.22247D − 03 0.57112D − 04 0.41844D − 03 0.85829D − 03

0 0 0 0 0 0 0 0 0 69 0 0 0 19 0 0 0 1 0 0 0 0 7 19 0 0 0 0 0 7 0 19 0 0 0 0 0 7 19 0 0 0

Algorithm MPRTR: We used Algorithm LSW1 with the choices defined above within a slightly modified version of Algorithm MPRTR. In fact, when the bound on dk is violated, the tentative value of βk+1 must be rescaled by the factor bk /dk as indicated; however, if ηk is the stepsize satisfying Wolfe conditions and if g(xk + ηk dk )T dk ≥ 0, then a local minimizer has been bracketed and an accurate linesearch would produce αk ≤ ηk , so that the convergence proof is unaffected if we compute βk+1 in correspondence to αk . In this situation, we defined ∗ βk+1 during the line search as bk g(xk + αk dk )T (g(xk + αk dk ) − gk ) ∗ = min 1, . βk+1 dk gk 2 In each table, we show the number of iterations (ni), the number of function evaluations (nf), the objective function value (f ), the gradient norm at the solution (g), and the number of times that the bound on dk is violated (nmod), so that the standard PR algorithm is modified.


95

In table 1, we report the results obtained by employing Algorithm PRTR, using weak Wolfe conditions, and letting b = 1.5 in condition (61). By comparing, whenever possible, the results obtained with those given in references cited above, we can note that the behavior of the algorithm is essentially similar to that of the standard PR method in most of cases. Only on a very limited number of cases, the number nmod is greater than zero. This is probably one of the most significant result of our experimentation, since it implies that the bound defined in equation (61) is an effective way of monitoring the behavior of the PR method, in spite of the fact that the test problems are not quadratic. We note also that when nmod > 0, the modification does not deteriorate the behavior of the PR method. If the value of b is reduced, say to b = 1.1 we did not observe significant changes in the results, although nmod may increase considerably in some problems. In table 2, we report the results obtained by running Algorithm MPRTR on the same test set and for the same value of b = 1.5. We obviously have a set of problems that are unaffected Table 2. Problem Calculus of variations 2 Calculus of variations 3 Generalized Rosenbrock Calculus of variations 2 Calculus of variations 3 Variably dimensioned Linear min. surface Strictly convex 1 Strictly convex 2 Oren’s power Generalized Rosenbrock Penalty 1 Penalty 3 Ext. Powell singular Tridiagonal 1 Boundary-value prob. Broyden trid. nonlinear Ext. Freud. and Roth Wrong. extended Wood Matrix square root ns = 1 Matrix square root ns = 2 Sp. matrix square root Extended Rosenbrock Extended Powell Tridiagonal 2 Trigonometric Variably dimensioned Strictly convex 1 Strictly convex 2 Oren’s power Penalty 1 Ext. Powell singular Tridiagonal 1 Extended ENGVL1 Ext. Freud. and Roth Wrong. extended Wood Sp. matrix square root Extended Rosenbrock Extended Powell Tridiagonal 2 Trigonometric Penalty 1 2nd ver.

Results with Algorithm MPRTR for b = 1.5.

n

ni

nf

f

g

nmod

200 200 500 500 500 500 961 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000

686 2523 1045 1900 6582 0 143 2 11 133 2065 3 188 65 263 39 20 7 42 1314 1314 109 52 65 273 40 0 1 4 413 1 66 868 6 6 58 172 52 66 901 4 1

1381 5053 2100 3817 13,171 19 297 16 32 398 4140 36 396 169 530 81 50 39 108 2638 2638 229 141 169 555 96 58 13 18 954 27 170 1767 35 37 142 355 141 170 1842 17 27



0 0 0 0 0 0 0 2 0 15 0 0 0 15 0 0 0 1 2 0 0 0 39 15 0 0 0 1 0 7 0 16 0 0 0 2 0 39 16 0 0 0

96


by the test on dk and in almost all cases we get the same results obtained with Algorithm PRTR; the only exceptions are the problems: Oren, Extend Powell, and Extended Rosenbrock, where a (marginal) deterioration is observed. However, if b is changed to the value b = 1.1 the behavior of several problems is appreciably affected (negatively in most of cases). This seems to indicate that the modified PR iteration may not be particularly advantageous, but further experimentation may be needed. The complete results are shown in table 3.

Table 3. Problem Calculus of variations 2 Calculus of variations 3 Generalized Rosenbrock Calculus of variations 2 Calculus of variations 3 Variably dimensioned Linear min. surface Strictly convex 1 Strictly convex 2 Oren’s power Generalized Rosenbrock Penalty 1 Penalty 3 Ext. Powell singular Tridiagonal 1 Boundary-value prob. Broyden trid. nonlinear Ext. Freud. and Roth Wrong. extended Wood Matrix square root ns = 1 Matrix square root ns = 2 Sp. matrix square root Extended Rosenbrock Extended Powell Tridiagonal 2 Trigonometric Variably dimensioned Strictly convex 1 Strictly convex 2 Oren’s power Penalty 1 Ext. Powell singular Tridiagonal 1 Extended ENGVL1 Ext. Freud. and Roth Wrong. extended Wood Sp. matrix square root Extended Rosenbrock Extended Powell Tridiagonal 2 Trigonometric Penalty 1 2nd ver.

Results with Algorithm MPRTR for b = 1.1.

n

ni

nf

f

g

200 200 500 500 500 500 961 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000 10,000

686 4021 1045 1900 6231 0 143 2 11 147 2065 3 188 83 263 39 20 8 33 1314 1314 109 127 80 273 40 0 1 4 467 1 79 868 6 6 62 172 127 78 901 4 1

1381 8049 2100 3817 12,469 19 297 16 32 419 4140 36 396 210 530 81 50 41 90 2638 2638 229 283 204 555 96 58 13 18 1062 27 204 1767 35 37 151 355 283 201 1842 17 27



nmod 0 3470 0 0 4899 0 0 2 0 45 0 0 0 42 0 0 0 2 5 0 0 0 117 38 0 0 0 1 0 91 0 40 0 0 1 6 0 117 39 0 0 0


8.

97

Concluding remarks

The results obtained in this article show that there are different ways of implementing the PR method in a way that global convergence is guaranteed and the computational efficiency is not deteriorated. Generally speaking, we can follow two basic approaches: (a) Keep unmodified the PR iteration and employ modified linesearch rules, which are able to guarantee that the sequence {dk } remains bounded, whenever the gradient norm is bounded away from zero. (b) Modify the PR iteration in a way that the growth of {dk } is controlled, but permit exact linesearches along the (modified) search directions. Both strategies can be realized in a way that the linear conjugate gradient method is reobtained in the quadratic cases. The method proposed by Powell [16] can be viewed as one (extreme) example of the latter strategy, although it is not easily comparable with the trust region modification introduced here. Numerical results do not indicate that the convergence results given here may be useful for improving the behavior of the PR method in ‘nonpathological’ cases. However, the trust region approach proposed in this article could be made the basis of alternative, more sophisticated implementations. Acknowledgements The authors are grateful to the referees for their useful comments and suggestions. This work was supported by MIUR, FIRB Research Program Large-Scale Nonlinear Optimization, Rome, Italy.

References [1] Hestenes, M.R. and Stiefel, E.L., 1952, Methods of conjugate gradients for solving linear systems. Journal of Reseach Natural Bureau of Standards Section 5, 49, 409–436. [2] Dixon, L.C.W., 1973, Nonlinear optimization: A survey of the state of the art. In: D.J. Evans (Ed) Software for Numerical Mathematics (New York: Academic Press), pp. 193–216. [3] Fletcher, R., 1987, Practical Methods of Optimization (New York: John Wiley and Sons). [4] Fletcher, R. and Reeves, C.M., 1964, Function minimization by conjugate gradients. Computer Journal, 7, 149–154. [5] Gilbert, J.C. and Nocedal, J., 1992, Global convergence of conjugate gradient methods for optimization. SIAM Journal on Optimization, 2, 21–42. [6] Hestenes, M.R., 1980, Conjugate Direction Methods in Optimization (New York: Springer Verlag). [7] Khoda, K.M., Liu, Y. and Storey, C., 1992, Generalized Polak–Ribiere algorithm. Journal on Optimization Theory and Applications, 75, 345–354. [8] Liu, Y. and Storey, C., 1991, Efficient generalized conjugate gradient algorithms, part 1: Theory. Journal on Optimization Theory and Applications, 69, 129–137. [9] Polak, E. and Ribière, G., 1969, Note sur la convergence de méthodes de directions conjuguées. Revue Francaise d’Informatique et de Recherche Opérationnelle, 16, 35–43. [10] Shanno, D.F., 1985, Globally convergent conjugate gradient algorithms. Mathematical Programming, 33, 61–67. [11] Shanno, D.F., 1978, Conjugate gradient methods with inexact searches. Mathematics of Operations Research, 3, 244–2567. [12] Shanno, D.F., 1978, On the convergence of a new conjugate gradient algorithm. SIAM Journal on Numerical Analysis, 15, 1247–1257. [13] Zoutendijk, G., 1970, Nonlinear programming computational methods. In: J. Abadie (Ed) Integer and Nonlinear Programming (Amsterdam: North-Holland), pp. 37–86. [14] Al-Baali, M., 1985, Descent property and global convergence of the Fletcher-Reeves method with inexact line search. IMA Journal on Numerical Analysis, 5, 121–124.

98


[15] Powell, M.J.D., 1984, Nonconvex minimization calculations and the conjugate gradient method. In: Lecture Notes in Mathematics 1066 (Berlin: Spring-Verlag), pp. 122–141. [16] Powell, M.J.D., 1986, Convergence properties of algorithms for nonlinear optimization. SIAM Review, 28, 487–500. [17] Dai, Y.H., Han, J.Y., Liu, G.H., Sun, D.F., Yin, H.X. and Yuan, Y., 1999, Convergence properties of nonlinear conjugate gradient methods. SIAM Journal on Optimization, 10, 345–358. [18] Grippo, L. and Lucidi, S., 1997, A globally convergent version of the Polak–Ribiere conjugate gradient method. Mathematical Programming, 78, 375–391. [19] Dai, Y.H., 2002, Conjugate gradient methods with Armijo-type line searches. Acta Mathematicae Applicatae Sinica (English Series), 18(1), 123–130. [20] Ortega, J.M. and Rheinboldt, W.C., 1970, Iterative Solution of Nonlinear Equations in Several Variables (New York: Academic Press). [21] Polijak, B.T., 1987, Introduction to Optimization. Optimization Software Inc. [22] Bazaraa, M.S., Sherali, H.D. and Shetty, C.M., 1993, Nonlinear Programming (New York: John Wiley and Sons). [23] Bertsekas, D.P., 1999, Nonlinear Programming, (2nd edn) (Athena Scientific). [24] De Leone, R., Gaudioso, M. and Grippo, L., 1984, Stopping criteria for linesearch methods without derivatives. Mathematical Programming, 30, 285–300. [25] Grippo, L., Lampariello, F. and Lucidi, S., 1988, Global convergence and stabilization of unconstrained minimization methods without derivatives. Journal of Optimization Theory and Applications, 56, 385–406. [26] Raydan, M., 1997, The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM Journal on Optimization, 7, 26–33.

Convergence conditions, line search algorithms and trust region ...

Convergence conditions, line search algorithms and trust region ...

Suggest Documents

On the Convergence Theory of Trust-region-based Algorithms for

Global Convergence of Trust Region Algorithms for ... - CiteSeerX

Combining line search and trust-region methods for 1 ...

Line search and trust region strategies for canonical

A line search trust-region algorithm with ...

Convergence Conditions for Variational Inequality Algorithms Thomas ...

Nonlinear Optimization: Trust Region Algorithms - CiteSeerX

Convergence properties of trust region methods for linear and convex ...

Convergence of linesearch and trust-region methods using the ...

Convergence and Cycling in Walker-type Saddle Search Algorithms

A Study of Harmony Search Algorithms: Exploration and Convergence ...

A Combined Filter Line Search and Trust Region Method for Nonlinear ...

GLOBAL CONVERGENCE OF A TRUST-REGION SQP-FILTER ...

On the Convergence of Recursive Trust-Region Methods for ...

A Comparative Study between Genetic Algorithms and Line Search

Efficient maze-running and line-search algorithms for VLSI layout ...

Trust-Region Interior-Point Algorithms for Minimization Problems with ...

Sensitivity of trust-region algorithms to their parameters - Springer Link

Trust-Region Algorithms for Nonlinear Stochastic ... - DIAL@UCL

A review of trust region algorithms for optimization - CiteSeerX

Trust{Region Interior{Point Algorithms for a Class of Nonlinear ...

Robust Trust-Region Space-Mapping Algorithms for Microwave ... - SOS

Convergence of Stochastic Search Algorithms to Gap-Free Pareto ...

Combining Trust Region Techniques and