A recent report. [ 12] includes ALGOL 60 procedures to solve P together with the results of applying the procedures to a comprehensive set of test problems. A.
Mathematical Programming 7 (1974) 311-350. North-Holland Publishing Company
NEWTON-TYPE METHODS FOR UNCONSTRAINED AND LINEARLY CONSTRAINED OPTIMIZATION Philip E. GILL and Walter MURRAY National Physical Laboratory, Teddington, Middlesex, England Received 11 December 1972 Revised manuscript received 26 July 1974
This paper describes two numerically stable methods for unconstrained optimization and their generalization when linear inequality constraints are added. The difference between the two methods is simply that one requires the Hessian matrix explicitly and the other does not. The methods are intimately based on the recurrence of matrix factorizations and are linked to earlier work on quasi-Newton methods and quadratic programming.
1. Introduction and notation In this paper we discuss the numerical solution of the following two optimization problems P
LCP
minimize
F(x),
subject to
AXx >l b;
minimize
F(x),
subject to
ATx >i b,
x~E n
x~E n
where AT is an m × n matrix and b an m× 1 vector. A recent report [ 12] includes ALGOL 60 procedures to solve P together with the results of applying the procedures to a comprehensive set of test problems. A similar report containing procedures to solve LCP will appear shortly [131. All the methods described assume the availability of a subroutine to compute the function value and gradient vector. In addition, one set of algorithms requires the evaluation of the n Xn Hessian matrix. Parts of the algorithms given here are common to those given by Gill and Murray
312
P.E. Gill, W. Murray, Newton-type optimization methods
[ 7 - 1 0 ] and Murray [20], and for this reason are readily i n c o r p o r a t e d into a large-scale m a t h e m a t i c a l programming system. The m e t h o d s are numerically stable (as are those given in the references) and can be adapted to take full advantage of situations in which the Hessian m a t r i x has a large n u m b e r o f zero elements. We shall often write F(k), g(k) and G (k) as the f u n c t i o n value, gradient vector and Hessian matrix at x (k). The transpose o f the m a t r i x A will be written as A T. The elements o f a n y matrix A will be written as aij a n d elements of vectors x as x]. If x is an n-vector and A an n X n matrix, t h e n the m a x - n o r m s o f x and A are defined as n
Ilxll~ = m a x Ixil,
• IIAll~ = m a x
l 0 since its corresponding diagonal has t~ be adjusted~to force dq I2q =/~2. It follows from this that even if q~s = 0, pX G(k) p < O.
322
P.E. Gill, W. Murray, Newton-type optimization methods
2.4. MNA - a modified Newton algorithm The k th step of the basic iteration for performing unconstrained optimization is as follows.
Step I: Calculate G (k), the Hessian matrix of F(x) at x (x). Step II: F o r m the modified Cholesky factorization of G (k), which gives
L (k) D(k)(L(k))Y = G(k) + E(k). Step III: If IIg(IC)ll2~< e, where e is a small scalar (e ~> 0), and [IE(k)ll= = 0, then x (k) is an adequate approximation to a ~strong local m i n i m u m x* and the algorithm is terminated. (In effect e defines a neighbourhood, all points of which are regarded as equivalent to x*.) If IIg(k)ll2 > e, then p(k) is determined by solving the equations L (k)D(k)(L (k))T p(k) = _g(k). If Itg(k)ll 0, where N(x*, li) is the neighbourhood N(x*, p) = {x: IIx - x*ll ~/~, x ~ x*}. These algorithms require the specification of the parameters r/, e and X. The value of r/affects only the efficiency of the algorithms, but e and X are relevant to the proof of convergence.
P.E. Gill, W. Murray, Newton-type optimization methods
325
Let F • I" c E n ~ E 1 be defined on the open set I'. Let ~ ( t ) denote the closure of the level set, ~ ( t ) = {x: x e I', F(x) < t} and C o [ ~ ] the closed convex hull of ~. Lemma 2.6.1. Let {G qc)}~=0 be a sequence o f symmetric matrices and 0 a positive scalar such that IIG(k)ll~ ~< P
for all k.
Let L (k), D (k) and E (k) denote the factorization o f G (k) obtained using the modified Cholesky algorithm, with 6 (k) and ~(k) chosen according to (7) and (12). Then
xT G(k)x ~ wllxll~ for all x ~ E n, where G(g) = G (~) + E (g) = L (k) D(k)(L(~)) T and v is a fixed positive constqht. Proof. If x is any non-zero vector, then x T ~(k)x = x T L (k)D(k)(L (k))T x >1 d (k) (a(k)) 2 Ilxll 2 min
where o (k) is the smallest singular value of the matrix L (k) and
d (k) = rain min
l 1 2 -t. This bound, together with the b o u n d on l.~rd r1/2, implies that the elements of L (k) are uniformly b o u n d e d and consequently each L (k) is non-singular with o (k) >~ o, for all k, where o is a fixed positive constant. This gives the result with V = 2 - t o 2.
The p r o o f of the following lemma is given by Ortega and Rheinboldt [22, pp. 4 7 5 - 4 7 6 ] . Lemma 2.6.2. Let S be the set o f stationary points o f F ( x ) , S = {2: ~ E ~2, g(~) = 0},
and assume that S is finite. I f x (k) is any sequence such that lim IIx(k+l) - x(k)ll = 0,
k~**
then limk__, ~. x (k) = Yc, where Yc ~ S.
lim IIg(x(k))ll = O,
k~*
I.E. Gill, 14I.Murray, Newton-type optimization methods
326
Theorem 2.6.1. Let the function F(x) be continuously differentiable for all x E F and let S, the set o f stationary points o f F ( x ) , be finite. Let x (°) be chosen such that (i) ~ ( F (0)) is closed and bounded,
(ii) Co[~(/x°))] c F, (iii) The Hessian matrix G(x) is such that 11G(x)tl ~ vllp~k)[i >i (v/~) IIg(k)ll.
On the few occasions when (13b) holds, a similar analysis gives _(g(k))T
p(k) >1 (I.z~kI,/H2 ) [ig(k)ll2
Clearly, in either case, limk_ . = ( F (k) - F (k+l)} = 0 implies lim IIg(k)ll = 0. k--~
F r o m the definition o f x (k +1) we have IIx (k+l) _ x(k)[I = a(k)llp(k)ll, XlIL(k) -1112 liD(k)- ll[ iig(k)l[, where X is the scalar defined in step IV o f the basic iteration. The modified Cholesky algorithm gives u n i f o r m upper bounds u p o n L (k)-I and D(k)-1 which implies that lim Ilx (k+l) - x(k)ll = 0. k--+ o~
L e m m a 2.6.2 n o w proves the result. T h e o r e m 2.6.2. Let e be the parameter defined in step III o f the basic iteration. Under the same conditions as those stated for Theorem 2.6.1, the sequence o f points {x(g)(e)}k=O generated by MNA is such that lira
lira (x(X)(e)) = x*,
e~0 k~¢0
where x * is a strong local m i n i m u m o f P, provided the Hessian matrix is not positive semi-definite at any o f the stationary point x ~ S. Proof. It can be established as in T h e o r e m 2.6.1 that a point, say x(kl)(e), will ultimately be d e t e r m i n e d which is in a small n e i g h b o u r h o o d o f a point, say ~?(1) ~ S.
P.E. Gill, W. Murray, Newton-type optimization methods
328
If G (k:) is positive definite, then x (k:) is in a small neighbourhood of a strong local minimum of P. Suppose, therefore, that G (k') is indefinite and define
y = x(kO
_
2(1);
then F(2 (1) + y) = F(x(kl)) = F(2(1)) + :
yT G(1)Y + O([lyl{3).
We have F ( x (~)) -
F(~ (1)) = x(e),
where, from Theorem 2.6.1, lira x(e) = 0. e-+0
Since the direction of search p(k) determined by the alternative search procedure of step III is always a descent direction, even when lie (x')[I = 0, we must have
F(kl)__ F(kl+1 ) =
T(e),
where r(e) > 0 for all e ~> 0. Clearly if e is sufficiently small, then r(e) > x(e) and F(~ (1)) - F (k~ + 1) > 0. Since the set S has finite dimension, a point will eventually be reached in the neighbourhood of a strong local minimum. A similar theorem cannot be proved for M N A D I F F because it is not always possible to determine from a finite-difference approximation whether or not the actual Hessian matrix is positive definite. Also the alternative search procedure cannot guarantee that p(X) is always a descent direction. However, if ~ is the maximum condition number of all the Hessian matrices for 2 ~ S, then, provided n is not t o o large, Theorem 2.6.2 also holds for MNADIFF. This is because the truncation error of the finite-difference approximation will not be sufficient to perturb the alternative search direction from being a descent direction or to change the result of the test for positive definiteness. 2.7. Rate o f convergence If the sequence {x ¢k)} ultimately lies in a region for which the Hessian matrix is uniformly positive definite, then the modified N e w t o n algorithm is identical to Newton's method. In this case there exists a se-
P.E. Gill, W. Murray, Newton-type optimization methods
329
quence {a(k)} such that {x (x)} converges to a strong local minimum, and there is a k 0 such that a(k) = 1 for k >~ k 0 with the rate of convergence superlinear. For this reason the value of a0 -- 1 is chosen as an initial estimate of the steplength and the results of the modified Newton algorithm with 77 = 0.9 will simulate the usual undamped Newton m e t h o d without exhibiting its defects. 2.8.
General comments
The performance of the Newton-type methods presented in Sections 2.4 and 2.5 depends upon the three parameters X: the bound on the steplength, r/: the tolerance on the reduction of the projected gradient, e : the final accuracy. The parameter X should be given the value X = Xo/(llp(k)ll2 + 2 - t ) at each iteration, where X0 is a preassigned constant. This choice leads to the uniform bound a(x) ~ X0 2 t for k > 0 (as required in Theorem 2.6.1) and enables Ilx(k+1)--x(k)l12 to be restricted in magnitude during early iterations to prevent possible overflow during the computation of the objective function. The number of iterations required to locate a strong local minimum with an " e x a c t " univariate search along every p(k) (this implies setting ~7 = 0) is generally fewer than with a "slack" linear search (~/= 0.9), although the overall n u m b e r of function evaluations is greater. Thus the value of r/ can be chosen to vary the number of iterations and in particular, smaller values of r/should be used when a Hessian evaluation is significantly more expensive than a function or gradient evaluation. There are two alternative versions of the steplength algorithm (see [ 11 ]); the first requires gradients and the second only function values. The second alternative should be used if the cost of computing a gradient is large compared with the cost of computing a function value. Even with the availability of the second derivative matrix and a convergent numerically stable algorithm, the question still arises as to whether a quasi-Newton algorithm, which generates approximations to the Hessian matrix would be preferable. For a given value of x the numerical computation of the Hessian matrix can often be costly and the initial programming required is not insignificant even for modest size problems. Those who have access to a program which can analytically differentiate a function are fortunate but its use initially involves an even greater a m o u n t of sophisticated coding. There are, however, a number of advantages.
330
P.E. Gill, W. Murray, Newton-type optimization methods
(i) It is relatively easy to determine the necessary and sufficient conditions which determine whether a point is an adequate approximation to x*.
(ii) Compared with quasi-Newton methods, fewer iterations are usually required to find the solution. (iii) The steplength ~(k~ and the performance of the algorithm are less sensitive to the choice of r/ than in quasi-Newton methods. Moreover, for a given r~ a satisfactory ~(k~ can be determined with fewer function evaluations than are required by a quasi-Newton method: The initial estimate, ~(k) = 1 is usually close to the optimum for modified-Newton (see Section 2.7), but this is not the case for quasi-Newton methods. (iv) On practical problems the algorithm is far more robust and will often be successful where a quasi-Newton method will fail. (v) Convergence can be proven for a large class of functions. When the algorithms MNA and MNADIFF are applied to a problem where G is a b a n d e d matrix, advantage can be taken of the fact that the Cholesky factors are band matrices: the direction of search can then be obtained in order m2n operations, where rn is the bandwidth. In this case, the discrete modified Newton algorithm has clear superiority over quasi-Newton methods although the same gradient information is used. A modification similar to that suggested by Curtis, Powell and Reid [4] can be used to reduce the number of gradient evaluations. Further details of the algorithm for banded Hessian matrices can be found in [9]. Even if the Hessian matrix is dense, the finite-difference algorithm behaves almost identically to MNA (see [12]). (This offers the possibility of a mixture o f analytical second derivatives for some variables and finite differences for the remainder.)
P.E. Gill, W. Murray, Newton.type optimization methods
331
PART II 3. Newton-type methods for the linearly constrained optimization problem
3.1. Linearly constrained optimization A constant a T will be termed active at a point x if it is satisfied exactly at the point x, i.e., a T x = b i.
The determination of a strong local minimum x* in the linearly constrained problem involves the solution of two subproblems. (i) The identification of the set of constraints which are active at the solution x*. (ii) The determination of the minimum of F(x) over a set of constraints which are temporarily active. In order to clarify the nature of these subproblems we introduce some additional notation. We define J(x) as the set of indices of constraints active at the point x. This implies that if Js ~ J(x), then aTsx = bjs. We shall define by (A (X))T the t × n matrix of constraints active at the k th approximation x (x) to a strong local minimum x* of LCP, i.e.,
A(k) = [ah' "'" ajt]'
Js E j(k),
where fix) denotes J(x(k)), the set of indices of constraints active at x (x). We define the linear subspace, which is the null space of (A(X))T,
M(ok)(x) =ix:
= O, ] e ](k))
and its associated linear manifold
M(k)(x) = ix: aTx = b], ] ~ j(k)}. We will now show that the subproblem (ii) can be viewed as an unconstrained minimization over the manifold M(k)(x ). This unconstrained problem, which will be solved using methods outlined in Part I of,this paper, has associated with it an ( n - t ) × ( n - t ) Hessian matrix of second derivatives GA and gradient vector gA of order n - t . If at any stage of the unconstrained minimization an approximation to the minimum on
332
I.E. Gill, W. Murray, Newton-type optimization methods
the manifold lies outside the region
II (k) = M(k)(x) A H, where II is the polyhedron of all feasible points II =(x: A T x >>b}, then M(g)(x) is redefined to include any new constraints encountered. If a feasible point x (g) is known, the equality constrained problem: minimize subject to
F(x ), (A(k))Tx = b (k),
is equivalent to the problem minimize p ~M(k)(x)
F(x (k) + p).
(14)
We define Z (g) to be a matrix whose columns are mutually orthogonal and span M(og)(x). Thus Z (g) has dimension n × ( n - t ) and is such that
(z(k))T z(k) = ] n - t '
(A(k))T Z(k) = O.
The problem defined at (14) can be redefined as minimize v~ En - t
F(x (k) + z(k)v).
This is now an unconstrained problem in the 1X ( n - t ) vector v and the methods already given in Part I for the unconstrained problem can be applied. When evaluating first and second derivatives with respect to the variables v we see that the gradient vector is given by (z(k))Tg (k). This vector can be interpreted as the projection of the gradient on the subspace of active constraints M(ok)(x) and will be denoted b y gA(x (k)) or equivalently g(Ag). Similarly, the ( n - t ) × ( n - t ) Hessian matrix with respect to v is given by (z(k)) T G (k) Z (g). This will be defined as the projected Hessian matrix at x (x), and written as G(Ag). The matrix Z (k) can be recurred separately or as part of some factorization of the matrix of active constraints. In Section 4 we shall discuss methods for computing and updating Z (k) which are based on the superior numerical properties of orthogonal matrices. Having characterized the subproblem (ii), we must be able to ascertain whether a particular set of constraints temporarily regarded as equalities are those active at the solution and, if not, which of them must be dropped from the basis in order to obtain a lower point. In
P.E. Gill, W. Murray, Newton-type optimization methods
333
both cases we can make use of the conditions on LCP for the existence of a strong local minimum. Let ~T2 = l) be the subset of constraints active at the point 2 and 2 the matrix whose columns span the null space of ~T. The point 2 is said to be a stationary point of LCP if ~ is feasible and gA(2) = 0. We shall make the following assumptions about LCP. (al) The matrix A satisfies the Haar condition (that is, all n Xn matrices formed by selecting n columns from A are non-singular). (a2) The projected Hessian matrix GA is non-singular at all stationary points of LCP. (a3) The set S of all stationary points of LCP is finite. (a4) For all 2 e S, the smallest K u h n - T u c k e r multiplier is non-zero. It is possible to allow for zero multipliers in practice but assumption (a4) has been made in order to clarify the description of a typical iteration. Let A* be the matrix of constraints active at x*, and Z* the matrix spanning the null soace of (A*) T. Under the assumptions just given we state without proof the following conditions necessary for x* to be a strong local minimum of LCP. ( c l ) There exists a vector o f K u h n - T u c k e r multipliers u such that A *u -'- g(x*),
with every element of u positive. (c2) The projected gradient (Z*)Tg at x* is zero, i.e., gA (X*) = O.
(C3) The projected Hessian matrix G A evaluated at x* is positive definite. The multipliers u indicate whether the set ilk) characterises the constraints active at the solution: a negative multiplier implies that the function can be decreased when the associated constraint is deleted from the basis. In the basic iteration described in the following section a constraint will be dropped from the active set only if a point has been found in the neighbourhood of a minimum on the relevant manifold. (However, this is not the only possible strategy and in Section 7 we shall discuss some alternatives.) Generally, during the course of the iteration the step-length will violate a constraint and this constraint will then be added to the basis. Ultimately, enough constraints will be added to give a positive-definite projected Hessian matrix.
334
P.E. Gill, W. Murray, Newton-type optimization methods
3.2. LCMNA - a linearly constrained modified Newton algorithm At the beginning of the k th step of the basic iteration the following matrices and vectors are stored: (I) the current approximation to the solution, x (k), (II) an n X t matrix A(k) of active constraint coefficients, (III) an n × ( m - t ) matrix ~(x) of inactive constraint coefficients, (IV) a t X 1 vector b (k) of right-hand sides corresponding to the set of active constraints, (V) an ( m - t ) X 1 vector b~ ) of right-hand sides corresponding to the set of inactive constraints, (VI) either a t × t unit lower.triangular matrix L (k) and diagonal matrix D (k) such that (A(k))T A(k) = L(k) D(k)(L(k))T, or a t X t upper-triangular matrix R (k) and an orthogonal matrix Q(k) such that (see Section 4) Q(k)A(k)=[-R-~(-k-)-],
(VII) an n X ( n - t ) matrix Z (k) whose columns span the linear subspace M~k)(x) and is such that (z(k)) T Z (k) = L (It will be shown in Section 4 that if Q(k) is stored, then Z (k) can be obtained as the last n - t columns of (Q(k))W), (VIII) the vector of residuals of the inactive constraints defined by r(k) = (~(k))T x(k) _ ~(k)
with r!k) > 0 for ] = l, 2, 3 ..... m - t , (IX1 the gradient vector g(k) and function value F (k). The k th step of the basic iteration is as follows.
Step I: Calculate the Hessian matrix G (k) and form the projected Hessian matrix G(Ak), where G(k) = (z(k)) T G (k) z(k).
Step II: Form the modified Cholesky factorization of G(Ak). The factors will satisfy the relationship L A(k) D(k) f (k)~T A ("~A " = G ( k ) + E(k) = ~ ( k ) ,
where L(Ak) is an (n--t) X (n--t) unit lower-triangular matrix and D (k) and E(Ak) are diagonal matrices.
P.E. Gill, W. Murray, Newton-type optimization methods
335
Step III: If Ilg(Ak)l[2 ~< e, where e is a pre-assigned small positive scalar, proceed to step IV. The direction of search p(k) is determined by solving the constrained modified Newton equations =
_
and setting p(k) = Z(k) p(Ak) As in the unconstrained case, p~k} can be computed using the Cholesky factors of G(f). Proceed to step V. Step IV: (This step is executed only if IIg~lC)112< e, that is, if x q0 is in the neighbourhood of a stationary point on the manifold M(k)(x).) We have two possible situations depending upon whether or not the projected Hessian G(Ak) is positive definite. (a) IIE(f)ll~o --- 0. In this case, x (k) lies in the neighbourhood of a local minimum on the manifold M(k)(x) and estimates of the K u h n - T u c k e r multipliers are computed by solving the equations
L(~) D(k) (L(k))T u = (A(k))Tg(k). (This m e t h o d for computing the vector u is based on the assumption that Z (/0 is recurred separately. The reader should refer to Section 4 if Z (/0 is recurred as part of the orthogonal factorization o f (A(k))T.) Let s be the index such that
u s = min u r i
If u s > 0, then x (k) is regarded as an adequate approximation to a strong local m i n i m u m of LCP and the algorithm is terminated. If u s is negarive, the constraint a,15T x ~ b,J$ is deleted from the set of active constraints and the index set of active constraints is modified accordingly. The deletion of a row from (A(k))T has the effect o f adding a column to Z (k) to give Z (k+l) ; methods are given in Sections 4 and 5 for modifying L (k), Z (k), L (k) and D (k), respectively. Since it is possible for the projected Hessian matrix over the new manifold M(k+l)(x) to have a single negative eigenvalue, a non-zero correction E (k+l) m a y be generated implicitly during the updating. In this c a s e , E(Ak+I) will be zero except for the ( n - t + l , n - t + l ) th element (see Section 5). Finally, the vector of residuals r (k) is modified to include the new
P.E. Gill, W. Murray, Newton-type optimization methods
336
inactive constraint, t is set at t, k at k+l and the iteration is continued at step III. (b) IIE(Ak)II=> 0. In this case, x (k) is in the neighbourhood o f a saddle point on the manifold M (k) and a direction of search p(k) is determined using the alternative search procedure (L(k)) T V = ej,
p(k) = (--sign(vTg(k)) Z(k)v ( z(k)v
if IIg(k)ll2 > 0, if IIg(Ak)ll2= 0,
where the index ] is such that
f]