JAMES V. BURKE AND ANDREAS WIEGMANN. Abstract. The limited .... Building on the work of Morrison, Marquardt 10] published a well known trust{region ...
NOTES ON LIMITED MEMORY BFGS UPDATING IN A TRUST{REGION FRAMEWORK JAMES V. BURKE AND ANDREAS WIEGMANN Abstract. The limited memory BFGS method pioneered by Jorge Nocedal is usually implemented as a line search method where the search direction is computed from a BFGS approximation to the inverse of the Hessian. The advantage of inverse updating is that the search directions are obtained by a matrix{ vector multiplication. Furthermore, experience shows that when the BFGS approximation is appropriately re{scaled (or re{sized) at each iteration, the line search stopping criteria are often satis ed for the rst trial step. In this note it is observed that limited memory updates to the Hessian approximations can also be applied in the context of a trust{region algorithm with only a modest increase in the linear algebra costs. This is true even though in the trust{region framework one maintains approximations to the Hessian rather than its inverse. The key to this observation is the compact form of the limited memory updates derived by Byrd, Nocedal, and Schnabel. Numerical results on a few of the MINPACK-2 test problems indicate that an implementation that incorporates re{scaling directly into the trust{region updating procedure exhibits convergence behavior comparable to a standard implementation of the algorithm by Liu and Nocedal.
1. Introduction In 1980 Nocedal [14] introduced limited memory BFGS (L{BFGS) updating for unconstrained optimization. Subsequent numerical studies on large{scale problems have shown that methods based on this updating scheme can be very eective if the updated inverse Hessian approximations are re{scaled at every iteration [5, 9, 13, 24]. Indeed, the L{BFGS method is currently winner on many classes of problems and competes with truncated Newton methods on a variety of very large{scale nonlinear problems [13]. In this note we observe that L{BFGS updating can be employed within the framework of a trust{region algorithm with only a modest increase in the linear algebra costs. On most iterations these costs are the same as for the line search based algorithm. Since trust{region strategies are known to be quite ecient in the minimization of highly nonlinear objective functions, such an approach may improve the eciency of limited memory strategies. Moreover, a trust{region approach opens the door to methods based on non{symmetric updating such as a limited memory SR1 update. A general theory for low dimensional quasi{Newton updating strategies for large{scale unconstrained optimization is developed in [3]. The L{BFGS method is a matrix secant method speci cally designed for low storage and linear algebra costs in the computation of a Newton{like search direction. This is done by employing a clever representation for the L{BFGS update. Recall that the L{BFGS update is obtained by applying the BFGS update to an initial positive de nite diagonal matrix (a scaling matrix) using data from a few of the most recent Date : January 15, 1997. This manuscript was submitted to the SIAM Journal on Optimization on July 2, 1996. This research was supported by National Science Foundation Grant DMS{9303772. 1
2
JAMES V. BURKE AND ANDREAS WIEGMANN
iterations. The update can then be stored either as a recursion [14, 13] or in the compact form derived in [4]. The search direction is computed by a simple matrix vector multiplication. Changes in the initial scaling matrix and the data from past iterations can be introduced into the update at low cost. This is especially true if the initial scaling matrix is a multiple of the identity as it is usually taken to be in practice. Once the search direction is computed, an appropriate step{length is obtained from one of the standard line search procedures. However, in practice a line search is rarely required if one re{scales at each iteration using the scaling (3.2) suggested in [5, 9, 13, 19, 24]. The key dierence between the L{BFGS method described above and the strategy proposed in this note is that trust{region methods require the maintenance of an approximate Hessian, not its inverse. This means that the next iterate is obtained by solving an equation rather than by matrix multiplication. Moreover, if the solution to this equation is found to be unsatisfactory, then a new system must be formed and solved. This may occur several times before a successful iterate is nally obtained. Thus, it seems that the linear algebra costs should be much greater for this approach. However, we give a discussion in Section 4 of why this is not the case. 2. Trust{Region Methods Consider the problem of minimizing the smooth function f : IRn 7! IR over IRn . Each iteration of a trust{region method begins with an estimate of the solution x 2 IRn , the value g of the gradient of f at x , an estimate H of the Hessian of f at x , and a trust{region radius > 0. The step to the next iterate s is obtained by solving subproblems of the form P (g; H; ) : minimize gT s + 21 sT Hs (2.1) subject to ksk ; where the norm used here and throughout is the usual Euclidean norm (to simplify the notation, we often suppress the iteration index ). The step s is either accepted or rejected and the trust{region radius is kept, increased, or decreased depending on the value of the ratio r(s) = f (xT + s)1 ?T f (x ) : (2.2) g s + 2 s H s This ratio measures the ability of the quadratic model to accurately predict the behavior of the objective function f . If the step is rejected, then is decreased and the procedure is repeated. Trust{region methods rely on the fact that the subproblem P can be quickly and accurately solved [11, 20]. Research on this subproblem and the development of new solution methods remains a very active area of study to this day [17, 18, 21, 23]. We only use the following fact concerning the solution to P (g; H; ) [11, Lemma 2.1]: if H is positive semi-de nite, then the vector s() 2 IRn with ks()k solves P if and only if there exists a () 0 such that (2.3) (()I + H )s() = ?g; and ()(ks()k ? ) = 0 : If () > 0, then (()I + H ) is positive de nite and (2.4)
s() = ?(()I + H )?1 g:
L{BFGS METHODS IN A TRUST REGION FRAMEWORK
3
Next consider the problem
P^ (g; H; ) : minimize gT s + 21 sT Hs + 2 sT s ;
(2.5)
where H is assumed to be positive semi{de nite and 0. The solution to P^ (g; H; ) is the vector
s = ?(I + H )?1 g:
(2.6)
The expressions (2.4) and (2.6) reveal a simple relationship between the solutions to the subproblems P and P^ . That is, if s() solves P (g; H; ) with optimal multiplier (), then s() solves P^ (g; H; ()). Conversely, if s solves P^ (g; H; ), then s solves P (g; H; ks k) with optimal multiplier . In the context of least{squares minimization, Levenberg [8] was the rst to consider subproblems of the form P^ as a way to constrain the magnitude of the step. Later, Morrison [12] observed and exploited the relationship between the subproblems P and P^ . He proposed the idea of adjusting the parameter in P^ as a way to adjust the magnitude of the trial step and proved the rst results about the behavior of the solution s as a function of . Building on the work of Morrison, Marquardt [10] published a well known trust{region like solution procedure for problems of nonlinear least squares arising in parameter estimation. Following Morrison, Marquardt also adjusted the magnitude of the trial step by adjusting , but he went further and introduced a clear strategy for the acceptance and rejection of trial steps and for updating the parameter . Shortly thereafter, these ideas were extended to the general nonlinear setting by Goldfeld, Quandt, and Trotter [6]. Note that as increases the magnitude of the solution to P^ , s , decreases. In this way, the parameter in P^ can be used to adjust the magnitude of s . However, adjusting in P^ (g; H; ) provides only implicit control over the magnitude of the step, while adjusting in P (g; H; ) provides explicit control. For this reason, methods that use P (g; H; ) are called (explicit) trust{region methods, while those that use P^(g; H; ) are called implicit trust{region methods. 3. Trust{Regions and the L{BFGS Update Again, let f : IRn 7! IR be the smooth function that is to be minimized and suppose the algorithm generates the sequence of iterates fx g. Set g = rf (x ) for all 2 f0; 1; : : : g. In limited memory methods one chooses an integer value m (typically m = 5) and obtains an approximate Hessian by applying an updating scheme to an initial Hessian approximation using the dierences of successive iterates and gradients from the m most recent iterations. De ne s = x ? x ?1 and y = g ? g ?1 for all and set
S = s?m+1 ; : : :; s ; Y = y ?m+1 ; : : :; y ; and S T Y = L + D + R ; where L is strictly lower triangular, D is diagonal, and R is strictly upper triangular. Under the assumption that the matrix D is positive de nite, the L{BFGS Hessian approximation at x with initial Hessian approximation H0 := I is given by (3.1)
H = I ? ??1 T
4
where
JAMES V. BURKE AND ANDREAS WIEGMANN h
= S Y
"
S T S L and ? = LT ?D
i
#
:
It is the positive de niteness of D that guarantees the invertibility of ?. This beautiful form of the L{ BFGS update appears in Byrd, Nocedal, and Schnabel [4, Theorem 2.3]. The condition that D be positive de nite is usually ensured with the addition of a suitable line search procedure. In our implementations we use the scaling
= sT y = ky k2
(3.2)
suggested by Shanno and Phua in [19]. This scaling appears as one of a class of optimal scalings rst proposed in Oren and Spedicato [15]. In practice, this choice of scaling has a profound impact on the performance of the L{BFGS method [5, 9, 13, 19, 24]. In order to apply either an explicit or an implicit trust{region strategy, we must be able to quickly solve equations of the form (I + H )s = ?g;
(3.3)
for a trial step s where 0 and g = g . Moreover, we may have to solve this equation for more than one value of if the initial step is found to be unsuitable. To see how this can be done, note that
I + H = I ? ??1 T ; where = + . Using the Sherman{Morrison{Woodbury formula, the inverse of this matrix has the form (I + H )?1 = 1 I + ( ? ? T )?1 T (3.4)
whenever the matrices (I + H ) and ? are invertible (in which case ( ? ? T ) is also invertible). To solve the trust{region subproblems, we apply the formula (3.4) to compute the step s in (3.3). This requires the solution of one or more 2m 2m matrix equations involving the matrix ( ? ? T ) for various values of . The predominant cost in this approach is the formation of the vector T g , the matrix ( ? ? T ), and the nal recovery of the solution by multiplication with the matrix . Regarding the matrix ( ? ? T ), observe that "
(3.5)
S T S L ? (D + R) ( ? ? T ) = T T L ? (D + R) Y T Y ? ( + )D
#
:
This representation clearly illustrates the increased linear algebra costs associated with a trust{region approach. The parameter always takes the value 0 when using the compact form of the update in the Liu{Nocedal algorithm with line search. Hence the matrices S T S and L do not need to be computed. This observation has important consequences for how we implement the L{BFGS update in a trust{region framework.
L{BFGS METHODS IN A TRUST REGION FRAMEWORK
5
4. Trust{Region Implementations Using L{BFGS Updates Empirical evidence shows that the L{BFGS update implemented with the scaling (3.2) provides a step of such quality that a line search is rarely required. We exploit this observation in our trust{region implementations. That is, we always test the step s^ computed with = 0 to determine if it provides a sucient decrease in the objective function. If it does, then we accept this step and proceed to the next iteration. If the step is unacceptable, then we increase in an implicit strategy or we decrease in an explicit strategy. This deviates from standard trust{region implementations in that we do not pass forward a trust{region parameter from one iteration to the next. Instead, we generate this parameter anew at the beginning of each iteration. This approach brings the linear algebra costs into line with those associated with a line search implementation since on those iterations when the step s is accepted, the linear algebra costs are the same. By both carefully organizing the computations and storing the appropriate information from one iteration to the next, one can show that the number of multiplications required to compute a step in the standard L{BFGS implementation is approximately (4m +2)n (see [4]). In a trust{region implementation initialized with = 0 at each iteration, the rst trial step is identical to the standard L-BFGS step so the number of multiplications required to compute this step is also (4m + 2)n. However, if this step is rejected, and a trial step with > 0 needs to be computed, then greater costs are incurred. There are two ways to handle this extra cost. One approach is to update the matrices S T S and L at each iteration. This incurs an additional 2mn multiplications per iteration with an additional 2mn multiplications for each value of > 0 for which equation (3.3) is solved. In this approach, the total number of multiplications on iteration is (6m + 2(t ? 1)m + 1)n where t is the number of times equation (3.3) is solved on iteration . However, we do not advocate this approach due to the infrequency with which the trial step with = 0 is rejected. Instead, we suggest that the matrices S T S and L be updated only when they are required for the computation of a trial step with > 0. This approach adds at most an additional (m2 + 2(t ? 1)m)n multiplications to the basic (4m + 2)n multiplications whenever t > 1. If the initial trial step with = 0 is rejected on a sequence of iterations, then the m2 n term disappears from all but the rst iteration in this sequence with subsequent iterations in this sequence incurring a cost of (6m + 2(t ? 1)m + 1)n multiplications. In all of this, one must keep in mind that m is usually taken to be quite small, e.g. m = 5. We now give concise descriptions of both the implicit and explicit trust{region implementations discussed above.
Algorithm 1: Explicit Trust{Region Updating Initialization: Let x0 2 IRn ; g 0 = rf (x0); 0 > 0; m = 0; m 2 f1; 2; : : :; ng; 2 IR; 0 < < 1, and let
S , Y , and H be as in (3.1). Iteration: Obtain x +1 from x as follows: 1. Let s solve the trust{region subproblem P^ (g ; H; 0). 2. If the ratio r(s) given in (2.2) exceeds , set x +1 = x + s and go to 4; otherwise, set = ksk. 3. Let s solve the trust{region subproblem P (g ; H; ), and return to 2. 4. Set m = minfm + 1; m g:
6
JAMES V. BURKE AND ANDREAS WIEGMANN
Some care must be taken in the execution of Step 3 of this algorithm. Our implementation adapts the Newton iteration described in More and Sorensen [11] to the limited memory setting. A straightforward implementation of this procedure may require solving equation (3.3) for many values of . Such an implementation performs 2mn multiplications each time the solution s to (2.6) is formed. This is untenable. Fortunately, this problem can be avoided since it is possible to organize the computations so that the value () corresponding to s() can be found by computations performed entirely in the small dimension 2m. The details of how this can be done in a more general framework are given in [3]. We now discuss these details for the special case of L{BFGS updating. In Algorithm 1, we rst test the unconstrained minimizer of the quadratic model g T s + 21 sT Hs. Consequently, if this step is rejected, then, by Step 2 of the algorithm, the solution to the resulting trust{region subproblem necessarily lies on the boundary of the trust{region. Thus, the solution s() to P (g; H; ) satis es ks()k = with () > 0 (in particular, the so{called hard case never occurs). More{Sorensen [11] locate the solution s() by using Newton's method to solve the equation () = 0 where () = 1 ? ks(1)k and s() = ?(I + H )?1g : The choice of the function was proposed by Reinsch [16] in the context of smoothing by spline functions where similar kinds of equations arise again within a Lagrangian framework. This function is particularly suitable since Reinsch showed that it is both convex and nearly linear. The Newton iteration takes the form T ?3 + = ? 0(()) ; where 0 () = ? g (I + H3) g : (4.1) ks()k To compute these iterates we make use of the formulas (I + H )?1 = 1 [I + ( ? ? T )?1 T ]
and
(I + H )?2 = 12 [I + ( ? ? T )?1 T + ( ? ? T )?1 ?( ? ? T )?1 T ]; where = + (see [2] for a general expression for (I + H )?k ). Setting v0 = ? T g , v1 = ( ? ? T )?1 v0, and v2 = ( ? ? T )?1 ?v1 , iteration (4.1) can be written as where
p + = + [ ? ] ;
= 2 ks()k2 = v1T T v1 + v0T v1 + kgk2 and = 3gT (I + H )?3g = + [v1T T v2 + v0T v2] : These observations yield the following implementation of Newton's method applied to the function .
Newton Iteration: Let H be as given in (3.1) and set = T , = kg k, and v0 = T g . Let be the stopping tolerance. 1. Set = + . 2. Set v1 = ( ? ? )?1 v0 .
L{BFGS METHODS IN A TRUST REGION FRAMEWORK
7
3. Set v2 = ( ? ? )?1 ?v1 . 4. Set = v1T v1 + 2v0T v1 + 2. p 5. If j ? j < , go to 9. 6. Set = + [v1Tpv2 + v0T v2 ]. 7. Set + = + [ ? ]. 8. If + 0, set = + ; otherwise, set = 0:2. Go to 1. 9. Set s = ?1 (g + v1 ). We reiterate that once the matrix and the vector v0 are formed, then the only operation involving the large dimension n occurs when the nal solution s needs to be recovered from the lower dimensional variables in Step 9. The value is obtained as part of the formation of the matrix . Moreover, as is varied, expression (3.5) shows that only operations in the small dimension are required to update the matrix ( ? ? ).
Algorithm 2: Implicit Trust{Region Updating Initialization: Let x0 2 IRn ; g 0 = rf (x0); > 0; m = 0; m 2 f1; 2; : : :; ng; 2 IR; 0 < , = 0, and
let S , Y , and H be as in (3.1). Iteration: Obtain x +1 from x as follows: 1. Let s solve the trust{region subproblem P^ (g ; H; ). 2. If the ratio r(s) given in (2.2) exceeds , set x +1 = x + s and go to Step 3; otherwise, set = + ( + ) and return to Step 1. 3. Set m = minfm + 1; m g: In our limited experience with both the explicit and implicit strategies, we have found that the initial trial step taken as the solution to P^ (g ; H; 0) is extremely good. Indeed, on the problems we have tested, this step is accepted more than 98% of the time with a value for the ratio r(s) in (2.2) that is nearly 1 or exceeds 1. This is quite surprising given the very limited amount of second{order information that the L{BFGS update possesses. As discussed above, it was the great practical success of the straightforward L{BFGS update that prompted us to use this update as the initial trial step in our trust{region implementations. But it also suggested to us two further modi cations that may be of some value. The rst of these is that on some iterations we should trust the initial L{BFGS step even in cases when the ratio r is slightly negative. Such a strategy is not new. In the literature on line search methods, this is called a non{monotone method [7, 22]. However, the approach has not yet been tested in the trust{region context. In our trial runs, we have implemented this strategy in the following way:
Non{Monotone Trust{Region Updating: Let ` 2 f1; 2; : : : g, 1 < 0, and 0 < 2 < 1 be given. 1. Set M = maxff (x ?`+1 ); f (x ?`+2 ); : : :; f (x )g and de ne r^(s) = f (Tx + 1s)T? M : g s + s H s 2.
2 +1 If r(s) > 1 and r^(s) > 2 , then set x = x + s:
and compute a new s.
otherwise, adjust the trust{region parameter
8
JAMES V. BURKE AND ANDREAS WIEGMANN
The second modi cation concerns rejected steps. These steps contain important information about the behavior of the objective. Indeed, if we had this information before hand, then we would not have expended the eort required to compute this step and assess its quality. Nonetheless, we have computed the step so we should incorporate this information into our local model of the objective. We do this by updating the Hessian approximate just as if we had accepted the step. That is, we update the Hessian using the rejected step s and the change in the gradient y = rf (x + s) ? rf (x ). The impact on the L-BFGS update is quite signi cant since the initial scaling parameter in (3.2) is also recomputed using these vectors. In order to fully exploit this new Hessian information, we do not adjust the trust{region parameter (or ) immediately. Rather, we allow two or three unsuccessful steps (updating the Hessian on each occasion) before we give up on this strategy and revert to adjusting the trust{region. In addition, if it is ever the case that sT y 0 on a rejected step, then we immediately revert to adjusting the trust{region. This approach to updating the Hessian produced the most successful runs in our numerical experiments. 5. Some numerical results We compare the performance of the trust{region updates against a Fortran limited memory BFGS (L{BFGS) implementation of the Liu and Nocedal algorithm [9] (with a line-search by More and Thuente), as it comes with the Minpack-2 test problem collection [1] (by Averick, Carter and More). The Minpack2 implementation does not use the compact form of the BFGS update, but rather the original recursion formula by Nocedal [14]. We consider four unconstrained minimization problems from that collection on which L{BFGS is known to perform well (suggested by More), and two more unconstrained versions of constrained problems in that collection, on which L{BFGS performs well also. Our results on these selected problems using the trust region updates are comparable to the results obtained with L{BFGS. The unconstrained versions of constrained problems are: Elastic-Plastic Torsion (EPT) Pressure in a Journal Bearing (PJB) The unconstrained problems are: (Enneper's) Minimal Surface Area (MSA) Optimal Design with Composite materials (ODC) Steady State Combustion (SSC) homogeneous superconducters: 2-D Ginzburg-Landau (GL2). In all cases we used the default parameter values and computed for 2,500, 10,000, 40,000 and 160,000 variables (see Figure 1). We tested several dierent implementations of the trust{region algorithms described in Section 4 against the Minpack-2 L{BFGS code. The results of 4 of these experiments are given in this section. The stopping criterion for all tests was to terminate if the function value equaled the value for which the Minpack-2 code terminated (to 8 signi cant digits). This change in fmin changes the initial line-search, so that the number of function evaluations for the modi ed Minpack-2 implementation varies from the numbers for the original implementation.
L{BFGS METHODS IN A TRUST REGION FRAMEWORK prob
n
par
iters
nfev
ngev
f
||g||
9 time
task
-------------------------------------------------------------------------------DEPT
2500
0.5D+01
88
94
94
-0.4388D+00
0.8D-04
0.2336D+01
CONV
DEPT
10000
0.5D+01
176
187
187
-0.4392D+00
0.7D-04
0.2027D+02
CONV
DEPT
40000
0.5D+01
303
310
310
-0.4393D+00
0.2D-03
0.2383D+03
CONV
DEPT
160000
0.5D+01
656
677
677
-0.4393D+00
0.5D-04
0.2849D+04
CONV
DPJB
2500
0.1D+00
184
191
191
-0.2826D+00
0.1D-03
0.8086D+01
CONV
DPJB
10000
0.1D+00
383
401
401
-0.2828D+00
0.1D-03
0.7434D+02
CONV
DPJB
40000
0.1D+00
699
724
724
-0.2829D+00
0.1D-03
0.6904D+03
CONV
DPJB
160000
0.1D+00
1534
1582
1582
-0.2829D+00
0.2D-03
0.5065D+04
CONV
DMSA
2500
0.0D+00
71
74
74
0.1421D+01
0.3D-03
0.3125D+01
CONV
DMSA
10000
0.0D+00
161
169
169
0.1421D+01
0.2D-03
0.3013D+02
CONV
DMSA
40000
0.0D+00
350
361
361
0.1421D+01
0.2D-03
0.2806D+03
CONV
DMSA
160000
0.0D+00
812
826
826
0.1421D+01
0.2D-03
0.2755D+04
CONV
DODC
2500
0.8D-02
176
179
179
-0.1136D-01
0.4D-04
0.9523D+01
CONV
DODC
10000
0.8D-02
361
366
366
-0.1138D-01
0.2D-04
0.8141D+02
CONV
DODC
40000
0.8D-02
972
978
978
-0.1138D-01
0.2D-04
0.9423D+03
CONV
DODC
160000
0.8D-02
2020
2026
2026
-0.1138D-01
0.3D-04
0.8238D+04
CONV
DSSC
2500
0.1D+01
120
127
127
-0.1018D+01
0.2D-03
0.6469D+01
CONV
DSSC
10000
0.1D+01
256
267
267
-0.1018D+01
0.2D-03
0.5743D+02
CONV
DSSC
40000
0.1D+01
390
410
410
-0.1018D+01
0.1D-03
0.3827D+03
CONV
DSSC
160000
0.1D+01
858
881
881
-0.1018D+01
0.2D-03
0.1948D+04
CONV
DGL2
2500
0.2D+01
536
551
551
0.1623D+02
0.2D-02
0.1170D+02
CONV
DGL2
10000
0.2D+01
763
792
792
0.1623D+02
0.1D-02
0.7698D+02
CONV
DGL2
40000
0.2D+01
1541
1598
1598
0.1623D+02
0.1D-02
0.7503D+03
CONV
DGL2
160000
0.2D+01
2273
2373
2373
0.1623D+02
0.2D-02
0.5075D+04
CONV
Figure 1. The output from the Minpack-2 L{BFGS routine, modi ed to stop after a xed function
value is reached. The rst column gives the name of the test problem, the second column gives the number of variables. The third column gives the value of a problem speci c parameter, the next three columns give the number of iterations (accepted steps), function and gradient evaluations. Then follow the function value upon termination, norm of the gradient upon termination and run time (in seconds). The last column has the status upon termination, for all problems the requested function value was reached.
Figures 2-5 show the performance of various trust{region implementations relative to the performance of the line search based code in described in Figure 1. The 24 bars indicate the four sizes (2,500, 10,000, 40,000, 160,000) for each of the six test problems (EPT: 1-4, PJB: 5-8, MSA: 9-12, ODC: 13-16, SSC: 17-20, GL2: 21-24). To see the relative performance on the dierent problems (number of function evaluations ranges from 70 to 2,400), the bars are scaled so that height 1 corresponds to the function evaluations needed by the Minpack-2 L{BFGS code, given in Figure 1. Runtime comparisons are not included because they could not be obtained with accuracy; we did not have exclusive access to the CPUs. We have modi ed the Minpack-2 L{BFGS implementation to perform Algorithm 2 with non-monotone trust-region updating and using information at the rejected steps (see Figure 3). We have also written
10
JAMES V. BURKE AND ANDREAS WIEGMANN Delta−Explicit Trust Region (Matlab)
kappa = 0.1 1.2
1
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
Figure 2. The number of function evaluations needed, relative to L{BFGS. Matlab implementation of
Algorithm 1, = 0:1 (descent), = 0:2. The horizontal axes has four problem sizes (2,500, 10,000, 40,000, 160,000) for each of the six test problems EPT: 1-4, PJB: 5-8, MSA: 9-12, ODC: 13-16, SSC: 17-20, GL2: 21-24). Height one corresponds to the number of function evaluations for the problem and size as given in Figure 1. Non−monotone Trust Region (Fortran)
kappa = −1.0 1.2
1
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
Figure 3. The number of function evaluations needed, relative to L{BFGS. Fortran implementation
of Algorithm 2, = ?1 (non-monotone). Hessians are updated even for rejected steps. The horizontal axes has four problem sizes (2,500, 10,000, 40,000, 160,000) for each of the six test problems (EPT: 1-4, PJB: 5-8, MSA: 9-12, ODC: 13-16, SSC: 17-20, GL2: 21-24). Height one corresponds to the number of function evaluations for the problem and size as given in Figure 1.
Matlab code based on the compact form in [4]. Function and gradient evaluations are performed by the original Fortran code via MEX les.
Figure 2 displays the results for a strict descent version of Algorithm 1 (explicit trust{region updating), while Figures 3-5 are based on various implementations of Algorithm 2 (implicit trust{region updating). The results of these experiments are quite sensitive to machine arithmetic. Compiling and running the original L{BFGS code on DEC ALPHA and HP 700 computers produced dierent results, also diering from the numbers reported for SUN SPARC workstations by More. We report only results for DEC ALPHA, which seemed consistently better than those for HP 700. We see that the performance is comparable to L{BFGS, especially if we take into account the variations for these numbers introduced by running the problems on dierent machines or with a dierent form of the Hessian update.
L{BFGS METHODS IN A TRUST REGION FRAMEWORK
11
Implicit Trust Region (Matlab)
kappa = 0.01 1.2
1
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
Figure 4. The number of function evaluations needed, relative to L{BFGS. Matlab implementation of Algorithm 2, = 0:01 (descent), = 0:2. The horizontal axes has four problem sizes (2,500, 10,000, 40,000, 160,000) for each of the six test problems EPT: 1-4, PJB: 5-8, MSA: 9-12, ODC: 13-16, SSC: 17-20, GL2: 21-24). Height one corresponds to the number of function evaluations for the problem and size as given in Figure 1. Trust Region using rejected step information (Matlab)
kappa = 0.6 1.2
1
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
Figure 5. The number of function evaluations needed, relative to L{BFGS. Matlab implementation of Algorithm 2, = 0:6 (descent), = 0:2. Hessians are updated for rejected steps. The horizontal axes has four problem sizes (2,500, 10,000, 40,000, 160,000) for each of the six test problems (EPT: 1-4, PJB: 5-8, MSA: 9-12, ODC: 13-16, SSC: 17-20, GL2: 21-24). Height one corresponds to the number of function evaluations for the problem and size as given in Figure 1.
We observed some consistent dierences between the Fortran implementations, which use the recursive form of the L{BFGS update, and the Matlab implementations, which use the compact form of the L{BFGS update. First, our Matlab line-search version could not be tuned to perform as well as the Fortran version, possibly because our Matlab line-search routine is dierent from the Fortran routine. On the other hand, the Matlab implementations of the overall code performed consistently better than comparable Fortran implementations (in terms of function evaluations), hinting at a possible advantage of the compact Byrd{Nocedal{Schnabel formulae over Nocedal's recursive formulae. In addition, further numerical experimentation has shown that for larger (especially positive) the performance of the Fortran trust{region implementation degrades, while the Matlab (compact form) implementation performs well. To illustrate this, observe that Figure 5 displays the results from a successful run, however, when
12
JAMES V. BURKE AND ANDREAS WIEGMANN
this version of Algorithm 2 is run using the Fortran code the performance is so horrible that the scale used to display the ratios in Figures 2{5 becomes entirely inappropriate. On the other hand, when the implementation of Algorithm 2 described in Figure 3 is run in Matlab, the performance is comparable to that displayed in Figure 5. 6. Conclusion In this note, we have shown how to use limited memory BFGS updating in a trust{region context. Explicit and implicit trust{region methods are described, and numerical experiments on six MINPACK-2 test problems for problems sizes between 2,500 and 160,000 variables are presented. These experiments indicate that it is possible to devise trust{region algorithms that can compete in performance with the line search version of the L{BFGS update, in this case with the implementation of the Liu{Nocedal algorithm provided in Minpack-2 . However, our limited experience indicates that the key to the success of L{BFGS methods is the choice of scaling (3.2) and not whether the method is implemented within a line search or trust{region framework. References [1] B.M. Averick, R.G. Carter, and J.J. More. The MINPACK-2 Test Problem Collection (preliminary version). Technical Report TM-150, Mathematics and Computer Science Division, Argonne National Laboratory, 1991. [2] J.V. Burke. Sherman{Morrison{Woodbury formula for powers of the inverse. Preprint, 1996. [3] J.V. Burke and A. Wiegmann. Low dimensional quasi{Newton updating strategies for large{scale unconstrained optimization. Preprint, 1996. [4] R.H. Byrd, J. Nocedal, and R.B. Schnabel. Representations of quasi{Newton matrices and their use in limited memory methods. Math. Prog., 63:129|156, 1994. [5] J.C. Gilbert and C. Lemarechal. Some numerical experiments with variable storage quasi{Newton algorithms. Math. Prog., 45:407|435, 1989. [6] S.M. Goldfeld, R.E. Quandt, and H.F. Trotter. Maximization by quadratic hill{climing. Econometrica, 34:541|551, 1966. [7] L. Grippo, F. Lampariello, and S. Lucidi. A non{monotone line search technique for Newton's method. SIAM J. Numer. Anal., 23:707|716, 1986. [8] K. Levenberg. A method for the solution of certain nonlinear problems in least squares. Quart. Appl. Math., 2:164|168, 1944. [9] D.C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 45:503|528, 1989. [10] D.W. Marquardt. An algorithm for least squares estimation of nonlinear parameters. SIAM J. Appl. Math., 11:431|441, 1963. [11] J.J. More and D.C. Sorensen. Computing a trust region step. SIAM J. Sci. Stat. Comput., 4:553|572, 1983. [12] D.D. Morrison. Methods for least squares problems with convergence proofs, tracking programs and orbit determination. In Tracking Programs and Orbit Determination; Seminar Proceedings, pages 1{9. Jet Propulsion Laboratory, 1960. [13] S.G. Nash and J. Nocedal. A numerical study of the limited memory BFGS method and the truncated Newton method for large scale optimization. SIAM J. Optimization, 1:358{372, 1991. [14] J. Nocedal. Updating quasi{Newton matrices with limited storage. Math. Prog., 35:773|782, 1980. [15] S. Oren and E. Spedicato. Optimal conditioning of self scaling variable metric algorithms. Math. Programming, 10:70{90, 1976. [16] C. H. Reinsch. Smoothing by spline functions. II. Numer. Math., 16:451{454, 1971.
L{BFGS METHODS IN A TRUST REGION FRAMEWORK
13
[17] F. Rendl and H. Wolkowicz. A semide nite framework for trust region subproblems with application to large scale minimization. Technical Report CORR 94-32, Department of Combinatorics and Optimization, University of Waterloo, 1994. [18] S.A. Santos and D.C. Sorensen. A new matrix{free algorithm for the large{scale trust{region subproblem. Technical Report TR95-20, Department of Computational and Applied Mathematics, Rice University, 1995. [19] D.F. Shanno and K.H. Phua. Matrix conditioning and nonlinear optimization. Math. Programming, 14:149{160, 1978. [20] D.C. Sorensen. Newton's method with a model trust region modi cation. SIAM J. Numer. Anal., 19(2):409|426, 1982. [21] D.C. Sorensen. Minimization of large{scale quadratic functions subject to an ellipsoidal constraint. Technical Report TR94-27, Department of Computational and Applied Mathematics, Rice University, 1994. [22] P.L. Toint. An assessment of nonmonotone linesearch techniques for unconstrained optimization. Technical Report 94{14, Department of Mathematics, Facultes Universitaires ND de la Paix, B{5000 Namur, Belgium, 1994. [23] Y. Ye. A new complexity result on minimization of a quadratic function with a sphere constraint. In C.A. Floudas and P.M. Pardalos, editors, Recent Advances in Global Optimization, Princeton Series in Computer Science, pages 19{31. Princeton University Press, 1992. [24] X. Zou, I.M. Navon, M. Berger, K.H. Phua, T. Schlick, , and F.X. Le Dimet. Numerical experience with limited{memory quasi{Newton and truncated newton methods. SIAM J. Optimization, 3:582|608, 1993. (A.Wiegmann, J.V.Burke) University of Washington, Dept. of Mathematics, Box 354350, Seattle, WA 98195