Globally Convergent Optimization Algorithms on ...

1 downloads 122 Views 209KB Size Report
fully studied for a long time. An early development in this direction was the paper by. Botsaris (Ref. 5). In 1981, he developed a method to search optimizers ...
Globally Convergent Optimization Algorithms on Riemannian Manifolds: Uniform Framework for Unconstrained and Constrained Optimization 1 Y. Yang2

Communicated by T. Rapcsak

1

This paper is based on part of the Ph.D Thesis of the author under the supervision of

Professor Tits, University of Maryland, College Park, Maryland. The author is in debt to him for invaluable suggestions on earlier versions of this paper. The author is grateful to the Associate Editor and anonymous reviewers, who pointed out a number of papers that have been included in the references; they also made some detailed suggestions that lead to significant improvements of the paper. Finally, the author thanks Dr. S.T. Smith for providing his Ph.D Thesis. 2

Principal Control System Design Engineer, ITT Industries, Reston Virginia USA. A

Abstract. This paper proposes several globally convergent geometric optimization algorithms on Riemannian manifolds, which extend some existing geometric optimization techniques. Since any set of smooth constraints in Euclidean space Rn (corresponding to constrained optimizations) and Rn space itself (corresponding to unconstrained optimizations) are both special Riemannian manifolds, and these algorithms are developed on general Riemannian manifolds, the techniques discussed in this paper provide a uniform framework for constrained and unconstrained optimization problems. Unlike some earlier works, the new algorithms have less restrictions in both convergence results and in practice, for example, the global minimization in one-dimensional search is not required. All algorithms addressed in this paper are globally convergent. For some special Riemannian manifolds other than Rn , the new algorithms are very efficient. Convergence rates are obtained. Applications are discussed. Key Words. Riemannian manifolds, globally Convergent optimization, nonlinear programming, geodesics.

B

1

Introduction

Compared to nonlinear unconstrained optimization problems, nonlinear constrained optimization problems are more difficult (Refs. 1-2). The search for the optimizer needs to find a path that not only improves the objective function, but also leads to a feasible point prescribed by the nonlinear constraints. Since the procedure of finding optimizers is exactly a search based on the local geometric information of both the constraints and the objective function, it is very important to develop search techniques using the intrinsic geometry properties of the constraints and the objective function. In fact, differential geometry provides a powerful tool to characterize and analyze these geometric properties for systems of smooth functions. However, differential geometry was not adopted by researchers working on optimization techniques for several decades. It seems that the earliest insight of using these techniques was from Luenberger (Ref. 3). In 1972, he used search algorithm on geodesics to prove the convergence theorems for gradient projection method; in (Ref. 4) he further suggested that, for smooth constrained optimization problems, the search should be carried out along a geodesic of the surface prescribed by the constraint set. But, this idea was not fully studied for a long time. An early development in this direction was the paper by Botsaris (Ref. 5). In 1981, he developed a method to search optimizers along a curve

1

that is an approximation of some geodesic. The first serious effort in this direction is probably due to Gabay (Ref. 6). By noting that the optimization of functions on the Riemannian manifold are, at least locally, equivalent to the smoothly constrained optimization on Euclidean space, he developed search algorithms mainly based on differential geometry and analyzed the convergence of these algorithms (without giving some details). He also established relations between this new technique and some traditional methods such as gradient projection method and reduced gradient methods. Since geodesics are in general very difficult to get, most papers following the direction addressed on convergence issues in nonlinear optimization algorithms rather than on numerical computation details. Rapcsak and Thang (Ref. 7) developed coordinate representations on Riemannian manifolds and derived some convergence results for several nonlinear programming algorithms. More details on the proofs were provided in (Ref. 8). They also generated (Ref. 9) a class of polynomial interior point algorithms for linear optimization by a subclass of Riemannian metrics. Many of these developments have been summarized in two books (Refs. 10-11). A few other people also conducted some research in this field, for example, Brockett (Ref. 12) and Chu (Ref. 13) etc. Another important development appeared in 1993. Smith (Refs. 14-15) essentially re-developed the optimization techniques on Riemannian manifolds, similar to what Gabay has done, but with pure differential geometry language with no assumption 2

on local coordinate system. Namely, he developed a gradient algorithm (with the assumption that the global optimization can be achieved in one-dimensional search along geodesics), a Newton-type algorithm (which may not converge globally) and a conjugate gradient algorithm (which also needs the assumption before) on the Riemannian manifolds. Under an additional assumption that these algorithms are convergent, he obtained the estimations of the convergence rates for these algorithms. In addition, he noticed that for some Riemannian manifolds, the geodesics are not only easy to get, but also have explicit expressions. He actually used these properties and implemented his algorithms to solve some filtering problems. More efforts for solving practical problems are done for robust pole assignment problem (Ref. 16) and fiber optical communication problem (Ref. 17). Advantages of optimization techniques on Riemannian manifolds are obvious. First, these algorithms are conceptually much simpler than the traditional algorithms for the equality constrained optimization problems. Next, these results are reduced to those of the unconstrained optimization problems if the Riemannian manifold under consideration is the n-dimensional Euclidean space, therefore the new techniques are natural generalizations of the traditional ones. Third, if the geodesics of the underline problems can easily be obtained (this is true for some special Riemannian manifolds other than Euclidean space), one can expect that these algorithms are much more efficient than 3

traditional algorithms for the nonlinear equality constrained optimization problems. Finally, the geometric implication of these algorithms are much clearer than those of the traditional algorithms for nonlinear equality constrained optimization problems, and therefore these techniques are mathematically more elegant. Still, several improvements are needed while addressing implementable algorithms. First, the unrealistic assumption that the global optimum can be achieved in onedimensional search should be removed3 . Instead, a systematic one-dimensional search scheme for the implementable algorithms which will terminate in finite steps (but still guarantee the convergence) should be introduced. Convergence properties for the revised algorithms should be established and the convergence conditions for the specific algorithms should be less restrict than the conditions for general algorithms. Conditions of converging to an isolated stationary point should also be given so that we know under what situation it is meaningful to discuss convergence rate. The estimation of the convergence rate for these specific algorithms with finite steps in one-dimensional 3

If global optimum has to be achieved in one-dimensional search as assumed in (Ref.

14), steps requiring infinite computation are, in general, necessary to find the global optimizer(s) (Refs. 18-19) which is impossible in practice; therefore, in real implementation, any search has to be stopped in finite steps, this may not guarantee the convergence for the algorithms in (Ref. 14).

4

search should be re-examined. These will be our main topics in this paper.

2

Notations

In this section, notations used in the rest of the paper are summarized. We adopt the style of (Ref. 20, Chapter 1-3) and (Ref. 21, Chapter 1). Let k·k be the Euclidean norm of its argument. Let M denote a Riemannian manifold embedded in n-dimensional Euclidean space Rn . Let γ(t) : [t0 , t1 ] → M be a piecewise smooth curve segment on M. The arc length of γ is defined as

L(γ(t)) =

Z

t1

t0

kγ 0 (t)k dt.

If M is connected, for any two points p = γ(t0 ) and q = γ(t1 ) on M, the Riemannian distance d(p, q) from p to q is defined by the greatest lower bound of {L(γ(t)) : γ(t) ∈ Q(p, q)}, where Q(p, q) is the set of all piecewise smooth curve segments on M from p to q. Since the distance function is a metric on the manifold M (Ref. 20, p.136), from now on, we will denote by M the Riemannian manifold equipped with Riemannian metric. Let the tangent vector to a curve γ(t) = [x1 (t), . . . , xn (t)] on M at p = γ(t0 ) be the real valued function X = γ 0 (t0 ) : f ∈ F (M) → R defined by Xf =

d(f ◦ γ)(t) |t0 , dt

5

where F (M) denotes the set of all real valued smooth functions on M, and R denotes one-dimensional Euclidean space. Let Γp be the collection of all curves γ(t) on M through p = γ(t0 ). The set of vectors, each is tangent to a certain γ(t0 ) in Γp , is denoted by Tp (M), and is referred to as the tangent space. We then denote by V the vector field on a curve γ that assigns to each point p ∈ γ(t) a tangent vector Vp to γ(t) at p, by X (M) all smooth vector fields on M. We also need the notation of the dual space Tp (M)∗ of the tangent space Tp (M). At a point p of M, elements of Tp (M)∗ , the co-vectors, are linear maps of Tp (M) into R. A one-form µ on a manifold M is a function that assigns to each point p an element µp of the cotangent space Tp (M)∗ . A bilinear form on Tp (M) is a R-bilinear function b : Tp (M) × Tp (M) → R and we consider only the symmetric case b(v, w) = b(w, v) for all v, w ∈ Tp (M). Next, for all X, Y , W , V ∈ X (M), a, b ∈ R, and f , g ∈ F (M), let aX, bY be the multiplications of a constant and a vector field, let f X, gY be the multiplications of a real valued smooth function and a vector field. We define a connection or the covariant derivative by ∇ on M as a function ∇ : X (M) × X (M) → X (M) such that (C 1) ∇V W is F (M)–linear in V ; i.e., ∇f X+gY W = f ∇X W + g∇Y W ; (C 2) ∇V W is R–linear in W ; i.e., ∇V (aX + bY ) = a∇V X + b∇V Y ; (C 3) ∇V (f W ) = (V f )W + f ∇V W . 6

For our purpose, we use ∇ for a more specific, the so called Levi-Civita connection defined as follows. Let V, W ∈ X (M), then [V, W ] = V W − W V is a function from F (M) to F (M) sending each f ∈ F (M) to V (W f ) − W (V f ), and [V, W ] is called the bracket of V and W . Levi-Civita connection is a connection that satisfies two additional relations. (C 4) [V, W ] = ∇V W − ∇W V (C 5) X hV, W i = h∇X V, W i + hV, ∇X W i for all X, V , W ∈ X (M). Let Dp be the set of vectors v ∈ Tp (M) (with γ(0) = p and γ 0 (0) = v) such that the maximum geodesic γv (t) is defined at least on [0, 1]. The exponential map of M at p is a function defined by expp : Dp → M such that expp (v) = γv (1) for all v ∈ Dp . Fix a vector v ∈ Tp (M), and let t ∈ R, then the geodesic s → γv (ts) has initial velocity tγv0 (0) = tv, hence γtv (s) = γv (ts) for all s and t less than a suitable limit. In particular, if v ∈ Dp , then expp (tv) = γtv (1) = γv (t).

(1)

Thus the exponential map carries a straight line on Tp (M) to a geodesic of M through p. This map is also a diffeomorphism from a star-shaped neighborhood of N0 ∈ Tp (M) 7

onto a so-called normal neighborhood Np of p in M (Ref. 22, p.339). For γ(t) ∈ M with γ(0) = p, γ(c) = q, the parallelism along γ(t) induces an isomorphism τpq : Tp → Tq defined by τpq Y (0) = Y (c), where Y (t) ∈ Tγ(t) (M) with Y (0) a vector in Tγ(0) (M). ˜ on Np adapted to the tangent vector X in Tp by Similarly, we define the vector field X ˜ q = τpq X, the parallel translation of X along geodesic joining p and q. It is known X that parallel translation is a linear isometry (Ref. 20, p.66). Finally, to illustrate the rate of change of a function on a manifold M, we denote by (gradf )(p) the gradient of f ∈ F (M) on M at p that satisfies dfp (X) := h(gradf )(p), Xi = Xf for all X ∈ Tp (M).

3

Globally Convergent Optimization Algorithms on Riemannian Manifolds

Let ∆ denote a set of desirable points on the Riemannian manifold M, and a(·) denote an iteration map of some algorithm defined on M, i.e., a(·) : M → M. Let 0 <  ∈ R, ¯ x, ) := {x ∈ M : d(x, x˜) ≤ }, and x˜ ∈ M, let B(˜ x, ) := {x ∈ M : d(x, x˜) < }, and B(˜ where d(·, ·) is the Riemannian distance defined on Riemannian manifolds. We will 8

consider the following

Algorithm 3.1 (Abstract Algorithm):

Step 1: Data x0 ∈ M, initial iteration count k = 0. Step 2: If xk ∈ / ∆, pick xk+1 = a(xk ) ∈ M, and k = k + 1; Step 3: If xk ∈ ∆, stop.

A global convergence property of this algorithm is stated as the following theorem, which is an extension of a theorem of Polak (Ref. 23).

Theorem 3.1 : Suppose that there exists a function c : M → R such that (i) c(·) is continuous on M,

(ii) ∀x ∈ / ∆, but x ∈ M there exist  > 0, δ > 0 such that c(y 0 ) − c(x0 ) ≤ −δ,

∀x0 ∈ B(x, ), y 0 ∈ a(x0 ) ⊂ M.

Then, if the sequence {xk } is constructed by the abstract algorithm above, every accumulation point xˆ ∈ M of {xk } is desirable, i.e., xˆ ∈ ∆.

9

Proof: Since xk ∈ M and xk →K xˆ, xˆ ∈ M is obvious. The rest proof is a contradiction. Suppose xk →K xˆ ∈ / ∆. Since c(·) is continuous, c(xk ) →K c(ˆ x).

(2)

Next, let , δ be associated with xˆ satisfying (ii). Since xk →K xˆ, there exists a k0 such that ∀k ≥ k0 , k ∈ K, it has xk ∈ B(ˆ x, ). Thus c(y) − c(xk ) ≤ −δ,

∀y ∈ a(xk ), k ≥ k0 ,

and in particular, c(xk+1 ) − c(xk ) ≤ −δ,

k ≥ k0 .

But this contradicts (2) and hence the result follows.

2

A special case of the abstract algorithm is a general algorithm with generalized Armijo one-dimensional search described below.

Algorithm 3.2 (Generalized Armijo Algorithm):

Data: α, β ∈ (0, 1). Step 0: Set initial iteration count k = 0 and select x0 ∈ M.



Step 1: If (gradf )(xk ) = 0, stop. Otherwise, determine some tangent vector hk ∈ E

D

Txk (M) at xk ∈ M such that (gradf )(xk ), hk < 0. 10

Step 2: Find ik = smallest integer i ≥ 0 such that D

E

f (expxk (β ihk )) − f (xk ) ≤ αβ i (gradf )(xk ), hk .

(3)

Step 3: Update xk+1 = expxk (β ik hk ). Increment k by one then go to Step 1.

Remark: If hk = −(gradf )(xk ), the algorithm is reduced to the generalized gradient algorithm with the generalized Armijo search. We will refer it as to the generalized Armijo-gradient algorithm. This algorithm may look similar to the discrete gradient flow method (Ref. 24). But there is a major difference between the two. In the generalized Armijo-gradient algorithm, the search is carried out along geodesics, this guarantees that all intermediate trials are on the manifold, and any rounding error can be corrected in each iteration if the algorithm is carefully implemented. But in the gradient flow method, geodesics are not calculated, this is definitely an advantage from viewpoint of reducing operation counts, however, for structurally unstable gradient flow (Ref. 24), the rounding error in gradient flow method may lead the limit points away from the manifold because geodesic is not calculated.

The above algorithm has a convergence result parallel to Armijo-gradient algorithm defined in Rn . To facilitate the statement and the proof, we mention here a fact (Ref. 25) that, for f ∈ F (M) with M embedded in Rn , the derivative of f in Rn and the 11

gradient of f in M has the following relationship: ∂f = (gradf ) + (gradf )⊥ ∂x for any x ∈ M, where (gradf )⊥ ∈ T (M)⊥ , the orthogonal complement of T (M). Also in the rest of this section, we denote a univariate function φ(λ) = f (expx λhx ) − f (x). Moreover, let γhx (λ) be the geodesic emanating from x in the direction hx , we denote Hλ = γh0 x (λ) a tangent vector field defined along γhx (λ) with H0 = hx .

Proposition 3.1 : The step size rule of (3) is well defined, i.e., there always exists an ik satisfying (3).

Proof: Since, from (1) and d(expx (λhx )) = Hλ , dλ we have

f (expx λhx ) − f (x) = φ(λ) − φ(0) = λφ0 (0) + o(λ) *

+

∂f = λ (x), hx + o(λ) = λ hgradf (x), hx i + o(λ) ∂x = λα hgradf (x), hx i + λ[(1 − α) hgradf (x), hx i + D

o(λ) ], λ

(4)

E

and since limλ→0 o(λ) = 0, α < 1 and (gradf )(xk ), hk < 0, the last bracket in (4) is λ negative for λ small enough. The result follows from the fact that f (expx λhx ) − f (x) ≤ λα hgradf (x), hx i holds for λ small enough. 12

2

The Proposition also says that the objective function in the generalized Armijogradient algorithm is monotonically decreasing as k increasing. Similar to the case in Rn , we can prove that the generalized Armijo-gradient algorithm is convergent.

Lemma 3.1 : Let xˆ be any point in M. Let  > 0, and let hx ∈ Dx be a tangent ¯ x, ) ⊂ M, Hλ ∈ X (M), and f ∈ F (M). Then khx k ≥ c kgradf (x)k vector at x ∈ B(ˆ implies for any λ kHλ k ≥ c kgradf (x)k ;

(5)

and hgradf (x), hx i ≤ −ρ0 kgradf (x)k khx k ,

ρ0 > 0, ∀x ∈ B(ˆ x, )

¯ > 0 such that implies that for any fixed ρ < ρ0 , there exists some λ *

+

∂f (x), Hλ ≤ −ρ kgradf (x)k kHλ k , ∂x

¯ ¯ x, ), ∀λ ∈ [0, λ]. ∀x ∈ B(ˆ

(6)

Proof: Since the isometric property of parallel translation implies that kHλ k = khx k, the first relation holds trivially. Since *

+

∂f (x), H0 = hgradf (x), hx i ≤ −ρ0 kgradf (x)k kHλ k , ∂x

by uniform continuity of

D

∂f (x), Hλ ∂x

E

¯ ¯ x, ) and λ ∈ [0, 1], there must exist λ for x ∈ B(ˆ

such that (6) holds with ρ < ρ0 .

2 13

Theorem 3.2 : Let f ∈ F (M), and Hλ ∈ X (M) with hx ∈ Dx , a tangent vector at x ∈ M satisfying h(gradf )(x), hx i < 0 for (gradf )(x) 6= 0. Suppose that for any xˆ ∈ M for which k(gradf )(ˆ x)k 6= 0, there exist positive numbers , ρ0 , and c such that for ∀x ∈ B(ˆ x, ) hgradf (x), hx i ≤ −ρ0 kgradf (x)k khx k ,

(7)

khx k ≥ c kgradf (x)k .

(8)

and

Moreover, let {xk } be constructed by the generalized Armijo algorithm and K be any subset of {0, 1, 2, · · ·} with infinite many members.

Then xk →K x∗ implies

k(gradf )(x∗ )k = 0. Proof: By assumption, f is continuous. Therefore, we merely need to show that condition (ii) of Theorem 3.1 holds. By Lemma 3.1, (6) and (5) are applicable here. Notice that

d(expx (λhx )) dλ

= Hλ from (1), we have

f (expx λhx ) − f (x) = φ(λ) − φ(0) = =

Z

0

1

*

+

∂f (expx tλhx ), Htλ λdt, ∂x

By the mean theorem of integral, since

∂f ∂x

Z

0

1

dφ(tλ) λdt d(λt)

∀x ∈ M, ∀λhx ∈ Dx .

and Htλ are continuous, there exists a t¯ ∈ [0, 1]

14

such that for any fixed x ∈ M Z

0

1

*

+

∂f (x), Htλ dt = ∂x

*

+

∂f (x), Ht¯λ . ∂x

Therefore, *

∂f (x), Ht¯λ f (expx λhx ) − f (x) − αλ ∂x *

+

*

∂f ∂f ∂f (expx tλhx ) − (x), Htλ dt + λ(1 − α) (x), Ht¯λ ∂x ∂x ∂x

= λ

Z

1

≤ λ

(

* ∂f sup (expx tλhx ) − t∈[0,1] ∂x

0

+

∂f (x), Htλ ∂x

+

+) + * ∂f , (x), Ht¯λ + (1 − α) ∂x

λhx ∈ Dx ,

where α ∈ (0, 1). Now suppose that xˆ ∈ / ∆ := {x : (gradf )(x) = 0}, therefore

∂f (ˆ x) ∂x

6=

0. Using Schwartz inequality, (6), and the isometry property of parallel translation, we get, for any fixed 0 < ρ < ρ0 , *

∂f (x), Ht¯λ f (expx λhx ) − f (x) − αλ ∂x ≤ ≤

+

) !

∂f ∂f

(x) · kHtλ k − ρ(1 − α) kgradf (x)k · kHt¯λ k λ sup (expx tλhx ) − ∂x t∈[0,1] ∂x

( )

∂f ∂f

λ khx k sup

(expx tλhx ) − (x) − ρ(1 − α) kgradf (x)k ∂x t∈[0,1] ∂x (

¯ with some λ ¯ > 0. ∀x ∈ B(ˆ x, ), ∀λ ∈ [0, λ],

(9)

Pick 0 ∈ (0, ] (using continuity of (gradf ), xˆ ∈ / ∆, and α < 1) such that, for some η > 0, ρ(1 − α) k(gradf )(x)k > η > 0, 15

¯ x, 0 ). ∀x ∈ B(ˆ

(10)

¯ x, 0 ) is compact, Since B(ˆ

∂f (x) ∂x

¯ x, 0 ). Therefore, there is uniformly continuous in B(ˆ

exists an ¯ > 0 such that if kexpx tλhx − xk < ¯, ∀t ∈ [0, 1], for some λ ≤ δ, then

∂f

(expx tλhx ) −

∂x



∂f

(x) < η, ∂x

¯ x, 0 ). ∀x ∈ B(ˆ

Again by continuity of expx λhx , there exists a δ such that

kexpx tλhx − xk < ¯,

∀λ ≤ δ, ∀t ∈ [0, 1].

Thus

∂f

sup (expx tλhx ) − t∈[0,1] ∂x



∂f

(x) < η, ∂x

¯ x, 0 ), ∀λ ≤ δ. ∀x ∈ B(ˆ

The above relations indicate that the right hand side of (9) is less than zero for λ small enough. Again from (9), +

*

∂f (x), Ht¯λ . f (expx λhx ) − f (x) ≤ αλ ∂x Since

D

∂f (x), H0 ∂x

E

< 0, and

D

∂f (x), Hλ ∂x

E

¯ x, 0 ) × [0, δ], is uniformly continuous in B(ˆ

therefore for any fixed C ∈ (0, 1), there must exist a δ 0 with 0 < δ 0 < δ such that *

+

∂f (x), Ht¯λ < C ∂x

*

∂f (x), H0 ∂x

+

¯ x, 0 ) and λ ∈ [0, δ 0 ]. Using (6) and (5) yields for any x ∈ B(ˆ *

+

∂f f (expx λhx ) − f (x) ≤ (αC)λ (x), H0 ≤ −(αC)λρ k(gradf )(x)k kH0 k ∂x ≤ −(αC)λρc k(gradf )k2 , 16

¯ x, 0 ), ∀λ ∈ [0, δ 0 ]. ∀x ∈ B(ˆ

Let α0 = αC ∈ (0, 1). Now note that the iteration defined by generalized Armijo algorithm is a(x) = expx β ik hx , substituting λ = β ik and using (10), we have D

E

f (a(x)) − f (x) ≤ α0 β ik (gradf )(xk ), hx ≤ −α0 ρcβ ik k(gradf )k2 2

η c ≤ −α0 β ik (1−α) 2ρ ,

¯ x, ). ∀x ∈ B(ˆ

The result follows from Theorem 3.1.

2

Corollary 3.1 : The generalized Armijo-gradient algorithm is globally convergent in the sense that, for any initial point, every limit point x∗ constructed by the algorithm satisfies k(gradf )(x∗ )k = 0. It is shown in (Ref. 14) that if the generalized gradient algorithm on M (without Armijo search) converges, the convergence rate is linear. It can be shown that the generalized Armijo-gradient algorithm converges linearly as well. But if the generalized Newton algorithm on M (without Armijo search) converges, the convergence rate is at least quadratic or even cubic. However, the generalized Newton algorithm may only be locally convergent. This phenomenon suggests us investigating a so-called generalized Armijo-Newton algorithm on Riemannian manifolds, a combination of the generalized Armijo one-dimensional search on Riemannian manifolds and the generalized Newton direction on Riemannian manifolds, because we expect that this scheme enjoys both global convergence property and better than linear convergence rate. 17

First we introduce the definition of Hessian of f ∈ F (M), which is given by Hessf = ∇∇f = ∇2 f, and it is a bilinear form such that for ∀x ∈ M (Ref. 20, p.86) Hessf (X, Y ) = XY f − (∇X Y )f

∀X, Y ∈ Tx (M).

(11)

Therefore Hessf (X, Y ) : Tx (M) × Tx (M) → R. Let µ be a C∞ one-form on M. Since the covariant derivative of µ along h, (∇h µ)x , is a Hessian, Hessian also defines the map h ∈ Tx (M) → (∇µ)x (·, h) ∈ Tx∗ (M). Definition 3.1 : xˆ ∈ M is a non-degenerate critical point of f ∈ F (M) if gradf (ˆ x) = 0, and the Hessian ∇2 f (ˆ x) is positive definite. Roughly speaking, a non-degenerate critical point is an isolated local minimizer. Let M be an n-dimensional Riemannian manifold with Riemannian structure, and Levi-Civita connection ∇. Let Nx be a normal neighborhood of x ∈ M, and let τ be the parallelism with respect to exp(th). If there exist some xˆ ∈ Nx such that µxˆ = 0, then (Ref. 15, pp.122-123)

0 = τ −1 µxˆ = µx + (∇µ)x (·, h) + h.o.t.

The following algorithm is proposed by Smith in (Ref. 15).

18

Algorithm 3.3 (Generalized Newton’s Algorithm):

Step 0: Select x0 ∈ M, and set k = 0. Step 1: Evaluate µxk and (∇µ)xk .

Step 2: Find hk ∈ Txk (M) satisfying 0 = τ −1 µxˆ = µxk + (∇µ)xk (·, hk ).

(12)

Step 3: Update xk+1 = expxk hk . Increment k by one, and go to Step 1.

Since hk ∈ Txk (M) must hold, it is worthwhile to pay a little more attention on the details on the solution of (12). Let Txk denote an orthonormal basis of Txk (M), the tangent plane of M at xk . Then P k = Txk TxTk is a projection matrix4 such that gradf (xk ) = P k ∂f (xk ). Since hk = −(∇2 f (xk ))−1 gradf (xk ) generally does not belong ∂x to Txk (M), a natural extension would be to solve 0 = P k gradf (xk ) + P k (∇2 f (xk ))hk .

(13)

We will show that there always exists hk ∈ Txk (M) satisfying (13). Therefore, the generalized Armijo-Newton algorithm is given in the following form.

Algorithm 3.4 (Generalized Armijo-Newton Algorithm): 4

For more about projection matrix, see (Ref. 26). 19

Data: α ∈ (0, 1), β ∈ (0, 1). Step 0: Set k = 0 and select x0 ∈ M.



Step 1: Evaluate (gradf )(xk ), If

(gradf )(xk )

= 0, stop. Otherwise, evaluate ∇2 f (xk ) and compute hk by solving (13).

Step 2: Find ik = smallest integer i ≥ 0 such that D

E

f (expxk β ihk ) − f (xk ) ≤ αβ i gradf (xk ), hk .

(14)

Step 3: Update xk+1 = expxk (β ik hk ). Increment k by one, and then go to Step 1. The convergence of the algorithm depends on certain property of f defined as below.

Definition 3.2 : Let M be a Riemannian manifold embedded in Rn . A function f ∈ F (M) is said to be locally strongly convex around x ∈ M if there is a compact set U ⊂ M, such that x ∈ U and there exists an m > 0 such that D

E

y, ∇2 f (x)y ≥ m kyk2 ,

∀x ∈ U, ∀y ∈ Rn .

(15)

Theorem 3.3 : Assume that gradf (x) 6= 0 and f ∈ F (M) is locally strongly convex around x ∈ M. Then (13) must have some solution(s) hx ∈ Tx (M). Moreover, there exist positive numbers c, ρ such that for ∀x ∈ U ⊂ M, khx k ≥ c kgradf (x)k , 20

(16)

and hhx , gradf (x)i ≤ −ρ kgradf (x)k · khx k .

(17)

Proof: Let Tx be the orthonormal bases of Tx (M), since gradf (x) ∈ Tx (M), there must be a unique y which solves −gradf (x) = Tx y. Since f is locally strongly convex, TxT ∇2 f (x)Tx TxT is full rank, there must have at least a z which solves TxT ∇2 f (x)Tx TxT z = y. Let hx = Tx TxT z, clearly, hx ∈ Tx (M) solves (13). Let Px denote the orthogonal projection matrix with Px = Tx TxT . Since −gradf (x) = Px ∇2 f (x)hx ,



kgradf (x)k ≤ kPx k · ∇2 f (x) · khx k or khx k ≥

kgradf (x)k ≥ c · kgradf (x)k , kPx k · k∇2 f (x)k

∀x ∈ U

where c :=

1 supx∈U k∇2 f (x)k

!

1 supx∈U kPx k

!

>0

since supx∈U k∇2 f (x)k and supx∈U kPx k are bounded due to the assumptions that f ∈ F (M), U is compact and Px is a projection matrix. Since f is locally strongly convex hgradf (x), hx i =

D

E

2 −Px ∇2 f (x)hx , hx = −hT x ∇ f (x)hx

≤ −m · khx k2 ≤ −mc · khx k · kgradf (x)k := −ρ · khx k · kgradf (x)k , 21

∀x ∈ U

with ρ = m · c.

2

In conjunction with Theorem 3.2, Theorem 3.3 implies the following result. Corollary 3.2 : If for every x ∈ M with gradf (x) 6= 0, the objective function f ∈ F (M) is locally strongly convex around the x, then the generalized Armijo-Newton algorithm is globally convergent in the sense that, for any initial point, every limit point x∗ constructed by the algorithm satisfies k(gradf )(x∗ )k = 0. All above results assert that if certain conditions hold, then all the accumulation points are in the desired set kgradf (x)k = 0. However, people sometimes care very much if a sequence {xk } generated by a certain algorithm converges to a fixed point xˆ, i.e., if the iteration map of the algorithm a(·) : M → M leads to a(ˆ x) = xˆ. In fact, it is not hard to establish such a result with some additional assumption.

Definition 3.3 : xˆ is a stationary point if kgradf (ˆ x)k = 0. It is an isolated stationary point if (i) kgradf (ˆ x)k = 0, and (ii) there exist an  > 0 such that ∀x ∈ M, 0 < d(x, xˆ) ≤  implies kgradf (x)k = 6 0. Theorem 3.4 : Let xˆ be an isolated stationary point, and let {xk } be a sequence such that all its accumulation points are stationary. Suppose that xk →K xˆ for some nonnegative integer index set K and that there exists some neighborhood U ∈ M of xˆ such that d(xk+1 , xk ) → 0 as k → ∞, xk ∈ U. Then xk → xˆ. 22

Proof: If the stationary point is reached in finite steps, the proof is trivial. Therefore, we assume that the stationary point is not reached in finite steps. Since xˆ is an isolated stationary point, there is an  > 0 such that ∀x ∈ M, 0 < d(x, xˆ) ≤  implies kgradf (x)k = 6 0. Since xk →K xˆ for some nonnegative integer index set K, there are {ki } ∈ K, an integer N, such that for all ki > N, d(xki , x ˆ) < 2 . By contradiction, assume that xk also generates another stationary point x∗ . Since xˆ is isolated, d(x∗ , xˆ) > . Notice that d(xk+1 , xk ) → 0 as k → ∞, there must have infinite many xk satisfying point in

 2

 2

≤ d(xk , xˆ) ≤ . That means that there is at least another accumulation ≤ d(xk , x ˆ) ≤ . Since every accumulation point is a stationary point as

we have proved before, There is at least another stationary point in

 2

≤ d(xk , xˆ) ≤ ,

which contradicts the assumption that xˆ is an isolated stationary point. This proves the theorem.

4

2

Convergence Rate

In the previous section, we established that under certain conditions, a sequence constructed by the generalized Armijo-gradient or the generalized Armijo-Newton algorithms will converge to a fixed point. In this section, estimations of convergence rates for the algorithms discussed above will be derived. It is always assumed, in this section,

23

that a sequence constructed by one of these algorithms converges to a fixed point. We start the derivation with some preliminary results obtained in (Refs. 14-15). The first one is a Taylor series expansion on Riemannian manifolds.

Lemma 4.1 (Refs. 14-15): Let M be a Riemannian manifold with Levi-Civita con˜ the vector field on Nx adapted to nection ∇, Nx a normal neighborhood of x ∈ M, X X in Tx , and f ∈ F (M). Then there exists an  > 0 such that for every λ ∈ [0, ) f (expx λX) = f (x) + λ(∇X˜ f )(x) + · · · +

λn−1 (∇n−1 f )(x) (n − 1)! X˜

λn Z 1 (1 − t)n−1 (∇nX˜ f )(expx tλX)dt. (n − 1)! 0

+

(18)

Lemma 4.2 (Refs. 14-15): Let M be a Riemannian manifold with Levi-Civita connection ∇. Let f ∈ F (M) have a non-degenerate critical point at xˆ ∈ M, let hk be a sequence of tangent vectors and let xk+1 = expxk λk hk ,

k = 0, 1, . . . ,

(19)

with xk → xˆ as k → ∞, and D

E





gradf (xk ), hk ≤ −ρ gradf (xk ) khk k ,

for some fixed ρ ∈ (0, 1].

(20)

Then, for some positive constants , c and C, the following inequalities hold. cd2 (xk , xˆ) ≤ 2(f (xk ) − f (ˆ x)) ≤ Cd2 (xk , x ˆ), xk ∈ B(ˆ x, ), 24

(21)





gradf (xk )

≥ cd(xk , xˆ), xk ∈ B(ˆ x, ).

(22)

Theorem 4.1 : Let M be a Riemannian manifold with Levi-Civita connection ∇. Let f ∈ F (M) have a non-degenerate critical point at xˆ ∈ M. Let {xk } be a sequence of points, converging to xˆ on M, constructed by generalized Armijo-gradient algorithm with α < 1 and hk = −gradf (xk ). Then there exists a constant E 0 such that for some integer K0 ≥ 0, and θ ∈ (0, 1), k

d(xk+K0 , xˆ) ≤ E 0 θ 2 .

Proof: Since {hk = −gradf (xk )} ∈ Txk (M) satisfies (20) with ρ = 1, the previous lemma is applicable here. Let γhk (λ) be the geodesic emanating from xk in the direction hk , and let Hλ = γh0 k (λ) be a tangent vector field defined along γhk (λ) with H0 = hk . By making use of (18) with n = 2, we have

f (xk+1 ) − f (xk ) = λ(Hλ f )(xk ) + λ2

Z

0

1

(1 − t)(∇2Hλ f )(expxk tλhk )dt.

(23)

By the smoothness of f , there exists an open neighborhood U ⊂ M of xˆ such that for all X ∈ Tx (M) and all x ∈ U, c kXk2 ≤ (∇2X f (x)) ≤ C kXk2 . 25

(24)

Since, by assumption that xk is convergent, there exists a K0 such that for all k > K0 , (24) holds for x = xk . For any 0 < α < 1, we can choose ρ0 such that 1 > ρ0 = α+

(1−α) 2

> 0. Since (20) holds with ρ = 1, by Lemma 3.1, there exists a λ such that

for ρ0 < 1





(Hλ f )(xk ) ≤ −ρ0 gradf (xk ) kHλ k ,

¯ ¯ x, ), ∀λ ∈ [0, λ]. ∀xk ∈ B(ˆ

By (24) and the isometry of parallel translation, we can derive from (23) that 1 ¯ f (xk+1 ) − f (xk ) ≤ −λρ0 khk k2 + λ2 C khk k2 , ∀λ ∈ [0, λ]. 2 Therefore, 1 ¯ f (xk+1 ) − f (xk ) + αλ khk k2 ≤ −λ(ρ0 − α) khk k2 + λ2 C khk k2 , ∀λ ∈ [0, λ]. 2 ¯ If, in view of that α < ρ0 < 1 and λ ∈ [0, λ], 1 −λ(ρ0 − α) + λ2 C ≤ 0, 2 for λ small enough, or equivalently

0 0, by using (22) and (21), we obtain



2

f (xk+1 ) − f (xk ) ≤ −β ik α khk k2 ≤ −β I α gradf (xk ) ≤ −β I αc2 d2 (xk , x ˆ) ≤ −

2β I αc2 (f (xk ) − f (ˆ x)) . C

This proves k+1

f (x Let θ = (1 −

2β I αc2 ). C

2β I αc2 ) − f (ˆ x) ≤ 1 − (f (xk ) − f (ˆ x)). C !

Substituting (21) into the above relation yields c 2 k+K0 d (x , xˆ) ≤ θk (f (xK0 ) − f (ˆ x)). 2

x), we finally get Let E = f (xK0 ) − f (ˆ d(xk+K0 , xˆ) ≤

s

2E √ k ( θ) . c

Since α, β I , c and C are all greater than zero, θ < 1; since β I ≤

2(ρ0 −α) , C

β I , α, c and C are all independent on xk , the result follows.

θ > 0; since 2

Combining Theorems 3.2, 3.4, and 4.1 gives a completed result.

Theorem 4.2 : Let M be a Riemannian manifold with Levi-Civita connection ∇, and let f ∈ F (M). The generalized Armijo-gradient algorithm is globally convergent in 27

the sense that, for any initial point, every limit point x∗ constructed by the algorithm satisfies k(gradf )(x∗ )k = 0. Moreover, if f ∈ F (M) has only isolated stationary points, and there exists some neighborhood U ∈ M of an isolated stationary point xˆ such that d(xk+1 , xk ) → 0 as k → ∞, xk ∈ U. Then the algorithm will generate a sequence converging to xˆ. Suppose further that xˆ is a non-degenerate critical point, then the algorithm is at least linearly convergent.

We have some similar result for Armijo-Newton method.

Theorem 4.3 : Let M be a Riemannian manifold with Levi-Civita connection ∇. Let f ∈ F (M) is locally strongly convex around x ∈ M for every x with (gradf )(x) 6= 0. The generalized Armijo-Newton algorithm is globally convergent in the sense that, for any initial point, every limit point x∗ constructed by the algorithm satisfies k(gradf )(x∗ )k = 0. Moreover, if f ∈ F (M) has only isolated stationary points, and there exists some neighborhood U ∈ M of an isolated stationary point xˆ such that d(xk+1 , xk ) → 0 as k → ∞, xk ∈ U. Then the algorithm will generate a sequence converging to xˆ. Suppose further that xˆ is a non-degenerate critical point, there exists a K ≥ 0 such that for all k > K, β ik = 1. Then {xk } converges to xˆ quadratically. Proof: When β ik = 1 or equivalently ik = 0, the algorithm is reduced to the generalized Newton’s algorithm. The proof for this special case is given in (Refs. 14-15). 28

2

Remark: The key assumption in the above theorem is β ik = 1 for all k > K that, unlike the unconstrained optimization algorithm in Rn with some fixed positive number α < 21 , does not always hold.

5

Applications

The algorithms are used to solve the robust pole assignment problem (Ref. 16), and the polarization mode dispersion compensation problem (Ref. 17). It can also be used to Riemannian SVD problem given in (Ref. 27). The problem is described as below.

minv f (v) = v T AT D −1 Av s.t. v T v = 1,

where D is symmetric positive definite matrix function, the elements of D are quadratic functions of the components of v, and A is a constant matrix. For the sake of simplicity, only generalized Armijo gradient algorithm is illustrated.

gradf (v) = (I − vv T )(2AT D −1 Av − (v T AT D −1

dD −1 dD −1 D Av, . . . , v T AT D −1 D Av)T ). dv1 dvn

For sphere manifold given by

BS n−1 = {v ∈ Rn : v T v = 1}, the unique geodesic on BS n−1 emanating from v along the direction h = −gradf (v) is given by (Ref. 25, Theorem 2.1.4). 29

Lemma 5.1 Let v ∈ BS n−1 , h ∈ Tx (BS n−1 ) be any tangent vector at v having unit length. Then the unique geodesic on BS n−1 emanating from v along the direction h is given by t → v cos t + h sin t. Therefore we have all the information to carry out generalized Armijo gradient algorithm, and we know the algorithm is globally convergent with linear convergence rate.

6

Conclusions

Globally convergent optimization algorithms are investigated in a uniform framework that deals with both unconstrained and constrained optimization problems. In this uniform framework, both unconstrained and constrained optimization problems are optimizations on Riemannian manifolds. We have established a complete analysis for this uniform framework. Global convergence are guaranteed by introducing Armijo one-dimensional search. The search does not need to find the global optimum along the geodesic but still provides globally convergent property. Convergence rates for these algorithms are obtained under certain conditions. It is shown that these algorithms can be very efficient for some special manifolds for which the geodesics enjoy simple analytic expressions. However, there is no easy way at present to get the geodesics for general Riemannian manifolds. That is the part where more efforts should be made.

30

References [1] Fletcher, R., Practical Methods of Optimization, John Wiley and Soncs, New York, New York, 1987.

[2] Nocedal, I., and Wright, S.J., Numerical Optimization, Springer Series in Operations Research, Springer, New York, New York, 1999.

[3] Luenberger, D.G., The Gradient Projection Methods along Geodesics, Management Science, Vol. 18, pp. 620-631, 1972.

[4] Luenberger, D.G., Linear and Nonlinear Programming, Addison-Wesley, Reading, Massachusetts, 1984.

[5] Notsaris, C.A., Constrained Optimization along Geodesics, Journal of Mathematical Analysis and Applications, Vol. 79, pp.295-306, 1981.

[6] Gabay, D., Minimizing a Differentiable Function over a Differentiable Manifold, Journal of Optimization Theory and Applications, Vol. 37, pp. 117-219, 1982.

[7] Rapcsak, T. and Thang, T.T., On Coordinate Representations of Smooth Optimization Problems, Journal of Optimization Theory and Applications, Vol. 86, pp. 459-489, 1995.

31

[8] Rapcsak, T., Variable Metric Methods along Geodesics, New Trends in Mathematical Programming, Edited by F. Giannessi, S. Komlosi, and T. Tapcsak, Kluwer Academic Publishers, Dordrecht, Holland, pp. 257-275, 1998.

[9] Rapcsak, T. and Thang, T.T., A Class of Polynomial Variable Metric Algorithms for Linear Optimization, Mathematical Programming, Vol. 74, pp. 319-331, 1996.

[10] Rapcsak, T., Smooth Nonlinear Optimization in Rn , Kluwer Academic Publishers, Dordrecht, Holland, 1997.

[11] Udriste, C., Convex Finctions and Optimization Methods on Riemannian Manifolds, Kluwer Academic Publishers, Dordrecht, Holland, 1994.

[12] Brockett, R. W., Differential Geometry and the Design of Gradient Algorithms, Proceedings of the Symposia in Pure Mathematics, Edited by R. Green. and S.T. Yau., Providence, Rhode Island, Vol. 54, pp. 69-92, 1993.

[13] Chu, M.T., Curves on Sn−1 that Lead to Eigenvalues or Their Means of Matrix, SIAM Journal on Algebra and Discrete Mathematics, Vol. 7, pp. 425-432, 1986.

[14] Smith, S.T., Geometric Optimization Methods for Adaptive Filtering, PhD Thesis, Harvard University, Cambeidge, Massachusetts, 1993.

32

[15] Smith, S.T., Optimization Techniques on Riemannian Manifolds, Fields Institute Communications, Vol. 3, pp. 113-135, 1994.

[16] Yang, Y., Robust System Design: Pole Assignment Approach, PhD Thesis, University of Maryland, College Park, Maryland, 1996.

[17] Yang, Y., Sahinci, E., and Mahmood, W., A New Algorithm on PMD Pulse-Width Compresion, OPTIK: International Journal of Light and Electron Optics, Vol. 114, pp. 365-369, 2003.

[18] Hansen, P., Jaumard, B., and Lu, S., Global Optimization of Univariate Lipschitz Function, I: Survey and Properties, Mathematical Programming, Vol. 55, pp. 251271, 1992.

[19] Hansen, P., Jaumard, B., and Lu, S., Global Optimization of Univariate Lipschitz Function, II: New Algorithms and Computational Comparison, Mathematical Programming, Vol. 55, pp. 273-292, 1992.

[20] O’Neill, B., Semi-Riemannian Geometry with Applications to Relativity, Academic Press, New York, New York, 1983.

[21] Helgason, S., Differential Geometry, Lie Groups, and Symmetric Spaces, Academic Press, New York, New York, 1978. 33

[22] Boothby, W.M., An Introduction to Differentiable Mnifold and Riemannian Geometry, Academic Press, Orlando, Florida, 1986.

[23] Polak, E., Computational Methods in Optimization, Academic Press, New York, New York, 1971.

[24] Helmke, U., and Moore, J.M., Optimization and Dynamical Systems, SpringerVerlag, New York, New York, 1971.

[25] Ratcliffe, J.G., Foundations of Hyperbolic Manifold, Springer-Verlag, London, United Kingdom, 1994.

[26] Rao, C.R., and Mitra, S.K., Generalized Inverse of Matrices and Their Applications, John Wiley, New York, New York, 1971.

[27] Moor, B.D., Convergence of an Algorithm for the Riemannian SVD, Open Problems in Mathematical Systems Theory (Communications and Control Engineering), Edited by V. Blondel, E.D. Sontag, J. Willems, and M. Vadyasgar, SpringerVerlag, Berlin, German, pp. 95-98, 1999.

34

Suggest Documents