Nonlinear Programming

Nonlinear Programming Marc Teboulle School of Mathematical Sciences Tel-Aviv University, Ramat-Aviv, Israel [email protected], http://www.math.tau.ac.il/teboulle

Tutorial Talk presented at Summer School of ICCOPT-I July 30-August 4, 2004, RPI, Troy Marc Teboulle–Tel-Aviv University – p. 1

Opening Remarks... Most optimization problems are not solvable! This is very good news for us.... An Example: Present in 2 hours the basic and recent results on NLP!..... A typical constrained nonlinear (ill-posed) problem..... What "minimal" (another optimization problem!) material should we learn/know? Two main issues: Theory and Computation Many commercial packages exist to solve NLP. However, these are in form of black-box type. To understand how optimization methods work, their power and their limitations, and if they do (or not) solve a problem, we must understand the basic underlying theory.

Marc Teboulle–Tel-Aviv University – p. 2

Contents Part A. Optimization Theory Ideas and Principles Convexity and Duality Optimality Conditions Part B. Optimization Algorithms Basic and Classical Iterative Schemes Convergence and Complexity issues Modern Interior and Polynomial Methods Part C. Some Recent Developments (..Biased of course!..) Interior Proximal Algorithms Smooth Lagrangian Multiplier Methods Elementary Algorithms: Interior Gradient-like Schemes, suitable for very large scale problems Marc Teboulle–Tel-Aviv University – p. 3

A Short History of Optimization.... Fermat (1629): Unconstrained Minimization Principle ...+160...Lagrange (1789) Equality Constrained Problems (Mechanics) Calculus of Variations, 18-19th Century [Euler, Lagrange, Legendre, Hamilton...] ...+150...Karush (1939), Fritz-John (47), Kuhn-Tucker(51) KKT Theorem for Inequality Constraints: Modern Optimization Era begins... Engineering Applications (1960) Optimal Control Bellman, Pontryagin... Major Developments (50’s with LP) and 60-80’s for NLP Polynomial Interior Points Methods for Convex Optimization Nesterov-Nemirovsky (1988) Combinatorial Problems via continuous approximations 90’s ....More Theory, Algorithmic developments and much more specific models, applications .... Marc Teboulle–Tel-Aviv University – p. 4

Nonlinear Programming: Formulation minimize{f (x) : x ∈ X ∩ C} • X ⊂ Rn for implicit or simple constraints (Here X ≡ Rn ) • C a set of explicit constraints described by (O)

C = {x ∈ Rn : gi (x) ≤ 0, i = 1, . . . m, hi (x) = 0, i = 1, . . . , p}.

All the functions in problem (O) are real valued functions on Rn . Important Special Case: X ∩ C ≡ Rn The unconstrained minimization problem (U )

minimize{f (x) : x ∈ Rn }

Many methods for constrained problems eventually need to solve some type of problem (U)


Applications of NLP OPTIMIZATION APPEARS TO BE PRESENT ”ALMOST” EVERYWHERE.... Planning, Management Operations, Logistic (It all started with LP..) Data Networks, Finance-Economics, VLSI Design Pattern Recognition, Data Analysis/Mining, Resource Allocation Mechanical/structural design, Chemical Engineering,... Machine Learning, Classification Signal Processing, Communication Systems, Tomography...... .....and of course in Mathematics Itself...!


Definitions and Terminology minimize{f (x) : x ∈ C} A point x ∈ C is called a feasible solution of (O). An optimal solution is any feasible point where the local or global minimum of f relative to C is actually attained. Definition: Let Nε := Nε (x∗ ) ≡ Neighborhood of x∗ . Then, (O)

x∗ local minimum : f (x∗ ) ≤ f (x), ∀x ∈ C ∩ Nε x∗ global minimum : f (x∗ ) ≤ f (x), ∀x ∈ C x∗ a strict local minimum : f (x∗ ) < f (x) ∀x ∈ Nε ∩ S, x = x∗

Note: There are also ”max” problems...But... max F ≡ − min[−F ]


How to Solve an Optimization Problem ? Analytically/Explicitly: Very rarely....or Never.... We try to generate an Iterative (Descent) Algorithm to approximately solve the problem to a prescribed accuracy Algorithm: a map A : x → y (start with x to get some new point y ) Iterative: generate a sequence of pts calculated on prior point (or points) Descent: Each new point y is such that f (y) < f (x) Accuracy: Eventually, we find some xˆ such that f (ˆ x) − f (x∗ ) ≤ ε


A Powerful Algorithm.. Set k = 0, start with x0 somewhere While xk ∈ D ≡ {set of desisable Points} Do { xk+1 = A(xk ) k ← k + 1}

Stop Expected Output(s): {xk } is a minimizing sequence: f (xk ) → f∗ , (optimal value) as k → ∞

or/and even more, xk → x∗ , optimal solution, denoted via x∗ ∈ argmin{f (x) : x ∈ C} ≡ {x ∈ C : f (x) = inf f }


Some Basic Questions How do we pick the initial starting point? How to construct A so that xk converges to optimal x∗ ? How do we stop the algorithm? How close is the approximate solution to the optimal one? (that we do not know!!) How sensitive is the whole process to data perturbations (small and large!)? How do we measure the efficiency of a convergent algorithm to optimality? Computational cost per-iteration? Total complexity ?


Emerging Topics and Tools To answer these questions, we need an appropriate mathematical theory and tools. For example: Existence of optimal solutions Optimality conditions Convexity and Duality Convergence and Numerical Analysis Error and Complexity Analysis While each algorithm for each type of problem will often require a specific analysis (e.g., exploiting special structures of the problem), the above tools remain essential and fundamental.


Convexity—–(See More in [A2-A4]) S ⊂ Rn is convex if the line segment joining any two points of S is contained in it: ∀x, y ∈ S, ∀λ ∈ [0, 1] =⇒ λx + (1 − λ)y ∈ S f : S → R is convex if for any x, y ∈ S and any λ ∈ [0, 1], f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A Key Fact: Local Minima are also Global under convexity ♣ Convexity plays a fundamental role in optimization Even in Non convex problems...!


A Simple and Powerful Geometric Result: Separation Any point outside a nonempty closed convex set C of Rn can be separated from C by a hyperplane H i.e., let y ∈ C , then ∃0 = a ∈ Rn , α ∈ R : a, x ≤ α < a, y, ∀x ∈ C.

and H := {x ∈ Rn : a, x = α} This result is fundamental and with far reaching consequences, e.g., Alternative Theorems [see A6] Optimality Conditions Duality


Existence of Minimizers: inf{f (x) : x ∈ C} When f : C → Rn will attain its infimum over C ⊂ Rn ? Classical Answer–Weierstrass Theorem: A continuous function defined on a compact subset of Rn attains its minimum. This is a topological problem How do we get "useful" conditions for testing existence? Mimic Weierstrass.. Pick x0 ∈ C, Lf := {x|f (x) ≤ f (x0 )} and consider the equivalent problem inf{f (x) : x ∈ Lf }

Suitable ”compactness” and ”continuity” w.r.t Lf , i.e., ♣ Study behavior of subsets of Rn at infinity Marc Teboulle–Tel-Aviv University – p. 14

Asymptotic Cones and Functions: A short appetizer Let ∅ = C ⊂ Rn be closed convex f : Rn → R ∪ +{∞} proper, lsc convex [see A1] Definition–[A-Cone] The asymptotic cone of C , C∞ := {d ∈ Rn : d + C ⊂ C}

Proposition A set C ⊂ Rn is bounded iff C∞ = {0} Definition–[A-Function] The asymptotic function f∞ of f is defined by epi (f∞ ) = (epi f )∞ Proposition The A-function is also convex, and for any d ∈ Rn , f (x + td) − f (x) ∀x ∈ dom f t→∞ t

f∞ (d) = lim


Back to the existence of minimizers One has to study (Lf )∞ . It turns out that (in the convex case) (Lf )∞ = {d ∈ Rn | f∞ (d) ≤ 0} ♣ Topological questions can be handled via Calculus Rules at infinity

Example: (P ) inf{f0 (x) : fi (x) ≤ 0, i ∈ [1, m]}; convex Result: Optimal solution set nonempty and compact iff (fi )∞ ≤ 0, ∀i ∈ [0, m] =⇒ d = 0

A shameless commercial! For general results and applications, see Asymptotic Cones and Functions in Optimization and Variational Inequalities A. Auslender and M. Teboulle Srpinger Monographs in Mathematics, 2003.


Optimality for Unconstrained Minimization (U ) inf{f (x) : x ∈ Rn } f : Rn → R is a smooth function. Fermat Principle Let x∗ ∈ Rn be a local minimum. Then, ♠

∇f (x∗ ) = 0

This is a First Order Necessary condition If ∇f (x∗ ) = 0, then x∗ is a stationary point A local minimum must be a stationary point Second Order Necessary Cond.: Nonnegative curvature at x∗ The Hessian Matrix ∇2 f (x∗ ) 0 positive semidefinite Sufficient conditions for x∗ to be a local minimum: ∇2 f (x∗ ) 0 Whenever f is assumed convex, then ♠ becomes a sufficient condition for x∗ to be a global minimum for f . Marc Teboulle–Tel-Aviv University – p. 17

Equality constraints: Lagrange Theorem (E) min{f (x) : h(x) = 0, x ∈ Rn }

where f : Rn → R, h : Rn → Rp with (f, g) ∈ C 1 Lagrange Theorem (necessary conditions) Let x∗ be a local minimum for problem (E). Assume: (A) {∇h1 (x∗ ), . . . , ∇hp (x∗ )} are linearly independent

Then there exists a unique y ∗ ∈ Rp satisfying: ∇f (x∗ ) +

p

yk∗ ∇hk (x∗ ) = 0

k=1

Inequality constraints lead to more complications....


Optimality Conditions in NLP (P )

inf{f (x) : x ∈ C}, C ⊂ Rn ; f : C → R an arbitrary function

A Basic Optimality Criteria-BOC Let x∗ ∈ C and assume that the directional derivative of f at x∗ exists. a. Necessary condition: If x∗ is a local minimum of f over C then f (x∗ , x − x∗ ) ≥ 0, ∀x ∈ C. If f is C 1 , then the above reduces to x − x∗ , ∇f (x∗ ) ≥ 0, ∀x ∈ C. b. Sufficient condition: Suppose that f is also convex. Then, condition (a) is also a sufficient condition for x∗ to be a (global) minimum. Geometric reformulation: through the Normal Cone to a set C at a point x ¯, and defined by NC (¯ x) := {d ∈ Rn |d, x − x ¯ ≤ 0, ∀x ∈ C} Thus one has x − x∗ , ∇f (x∗ ) ≥ 0, ∀x ∈ C ⇐⇒ 0 ∈ ∇f (x∗ ) + NC (x∗ ). A Variational Inequality and a Generalized Equation Marc Teboulle–Tel-Aviv University – p. 19

A Useful Formula for Directional Derivatives n Let {gi }m i=0 be smooth functions over R and define g(x) := max0≤i≤m gi (x).

Note: even though gi are smooth, this is not necessarily the case for g. Proposition Assume for each i that gi is continuous and differentiable at x∗ . Then, g (x∗ , d) = max{d, ∇gi (x∗ ) : i ∈ K(x∗ )}, ∀d ∈ Rn where K(x∗ ) := {i ∈ [0, m] : gi (x∗ ) = g(x∗ ) = max0≤i≤m gi (x∗ )} Now observe that with F (x) := max{f (x) − f (x∗ ), g1 (x) . . . , gm (x)} x∗ ∈ argmin{f (x) : g(x) ≤ 0} solves inf F (x) x

Apply BOC with Alternative Theorem [namely the use of separation!] + the above formula is all we need to derive the fundamental Theorems characterizing optimality for constrained problems


First Order Optimality Conditions-Fritz-John Theorem (P ) inf{f (x) : g(x) ≤ 0, x ∈ Rn }

where f : Rn → R, g : Rn → Rm and (f, g) ∈ C 1 Let x∗ be a local minimum for problem (P). Primal form: Then ∃d ∈ Rn s.t.: d, ∇f (x∗ ) < 0 and d, ∇gi (x∗ ) < 0, ∀i ∈ I(x∗ ) := {i : gi (x∗ ) = 0}

Dual Form Then there exists λ0 , λi ∈ R+ (i ∈ I(x∗ )), not all zero, satisfying: ∗ λ0 ∇f (x ) + λ∗i ∇gi (x∗ ) = 0 i∈I(x∗ )

The weakness of FJ conditions: λ0 ∈ R+ can be equal to zero. To avoid this, we need a further hypothesis on the problem’s data, called constraint qualification Marc Teboulle–Tel-Aviv University – p. 21

Constraint Qualifications (P ) inf{f (x) : g(x) ≤ 0, x ∈ Rn }

with f : Rn → R, g : Rn → Rm , smooth. I(x) := {i : gi (x) = 0} is the set of active constraints. CQ are crucial regularity conditions on problem’s data to derive optimality and duality results. Linear independence ( LI): {∇gi (x∗ )}i∈I(x∗ ) are linearly indep. Mangasarian-Fromovitch (MF): ∃ d ∈ Rn : d, ∇gi (x∗ ) < 0, ∀i ∈ I(x∗ )

Slater (S): ∃ xˆ : gi (ˆ x) < 0, ∀i = 1, . . . , m One has the following relations between these (CQ): (LI) =⇒ (M F ) =⇒ (S)


The KKT Theorem – A System of Eqs and Inequalities (P )

inf{f (x) : g(x) ≤ 0, x ∈ Rn }

Let x∗ be a local minimum for problem (P) and assume that a (MF-CQ) holds. Then ∃y ∗ ∈ Rm + s.t. ∇f (x∗ ) +

m i=1 ∗

yi∗ ∇gi (x∗ ) = 0,

gi (x ) ≤ 0, ∀i ∈ [1, m], yi∗ gi (x∗ ) = 0, i = 1, . . . , m

[Saddle pt. in x∗ ] [Feasibility ≡ Sad. pt. in y ∗ ] [complementarity]

With convex data + (CQ), the KKT becomes necessary and sufficient for global optimality For general NLP (mixed eqs/ineq.), more Optimality Conds. (First and Second order), see [A7]


Duality: The Lagrangian (P ) f∗ := inf{f (x) : g(x) ≤ 0, x ∈ Rn } f : Rn → R, g : Rn → Rm

We assume that there exists a feasible solution for (P) and f∗ ∈ R. Observation : (P ) ⇐⇒ infn sup{f (x) + y, g(x) Lagrangian associated with (P) L :

x∈R y≥0 R n × Rm +

→R:

L(x, y) = f (x) + y, g(x) ≡ f (x) +

m

yi gi (x).

i=1

Definition A vector y ∗ ∈ Rm is called a Lagrangian multiplier for (P) if y ∗ ≥ 0, and f∗ = inf{L(x, y ∗ ) : x ∈ Rn }

. Marc Teboulle–Tel-Aviv University – p. 24

Lagrangian Duality • inf x∈Rn supy∈Rm L(x, y) + Hidden in this equivalent min-max formulation of (P) is another problem called the Dual. Suppose we reverse the inf sup operations: sup infn L(x, y)

x∈R y∈Rm +

Define the Dual Function: h(y) := infn L(x, y), dom h = {y ∈ Rm : h(y) > −∞}. x∈R

and the Dual Problem: (D) h∗ := sup{h(y) : y ∈ Rm + ∩ dom h}

Note: To avoid h(·) = −∞, additional constraints often emerge through y ∈ dom h.


Dual problem Properties The Dual Problem: Uses the same data (D)

h∗ = sup{h(y) : y ∈ Rm + ∩ dom h}, h(y) = inf L(x, y) y

x

Properties of (P)-(D) Dual objective h is always concave Dual problem (D) is always convex (ax max of concave func.) Weak duality holds: f∗ ≥ h∗ for any feasible pair of (P)-(D) Valid for any optimization problem. No convexity assumed or/and, any other assumptions on primal data!


Duality: Key Questions for the pair (P)-(D) f∗ = inf{f (x) : g(x) ≤ 0, x ∈ Rn }; h∗ = supy {h(y) : y ∈ Rm +} • Zero Duality Gap: when f∗ = h∗ ? • Strong Duality: when inf / sup attained? • Structure/Relations of Primal-Dual Optimal Sets/Solutions

Convex data + a Constraint Qualification, on constraints deliver the answers. Proof. Based on the simple geometric separation argument. inf / sup attainment + structure of optimal sets via Asymptotic functions calculus Convex problems are the ”Nice NLP”...and much more... Marc Teboulle–Tel-Aviv University – p. 27

Are there Many Convex Problems? ..More than we use to think...[some times after transformation, e.g., Geometric programs] Remember, the dual of any optimization problem is always convex...can be use at least to approximate the original primal... Useful Convex Models: Conic Problems min{c, x : A(x) = b, x ∈ K} K is a closed convex cone in some finite dimensional space X ·, · appropriate inner product on X A is a linear map

Example: Linear Programming X ≡ Rn , K ≡ Rn+ , A ∈ Rm×n , b ∈ Rm , c ∈ Rn and ·, · the scalar product in Rn ....Other Examples...? Marc Teboulle–Tel-Aviv University – p. 28

Semidefinite Programming Primal minm {cT x : A(x) 0}; x∈R

Dual

max {− tr A0 Z : tr Ai Z = ci , i ∈ [1, m], Z 0}

Z∈Sn

Here, tr is the trace operator and A(x) := A0 +

m

xi Ai , each Ai ∈ Sn ≡ symmetric

i=1

Primal : x ∈ Rn decision variables; A(x) 0 is a linear matrix inequality. Dual in Conic Form: Z ∈ Sn decision variables, K ≡ Sn+ is the closed convex cone of p.s.d. matrices


SDP Features and Applications ♦ Features

SDP are special classes of convex (nondifferentiable) problems Computationally tractable: Can be approximately solved to a desired accuracy in polynomial time A very active research area since mid 90’s ♦ Applications–A Short list...! Combinatorial optimization, Computational Geometry Control theory, Statistics, Classification problems Other useful conic model : Second order cone programming...


Part B. Optimization Algorithms Tractability is a key Issue What optimization problems can we solve? How do we solve them? At what cost? [Our current computers have limited memory...and we do not want to wait too much time [..for ever..] to get a solution! We need to draw a line between Easy and Hard Problems Convexity plays a key role in this distinction


Easy/Hard: Example

(P 1)

n max{ xj : x2j − xj = 0, j = 1, . . . , n; xi xj = 0 ∀ij ∈ E} j=1

(P 2)

inf x0 subject to m

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ λmin ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

=

bl , l = 1, . . . , k

⎟ · ⎟ ⎟ ⎟ · ⎟ ⎟ ⎟ · ⎟ ⎟ ⎟ l ⎟ xm ⎠ x0

≥

0, l = 1, . . . , k

x ∈ Rm+1 , xl ∈ Rm , l

=

1, . . . , k

xj = 1,

j=1

⎛

m j=1

xl1

x1 · · · xm xl1

aj xlj

·

·

·

(P1) ”looks” much easier than (P2)...

xlm

⎞


Easy/Hard: Example Ctd.

(P 1)

n max{ xj : x2j − xj = 0, j = 1, . . . , n; xi xj = 0 ∀i = j ∈ Γ} j=1

(P 2) min{x0 : λmin (A(x, xl )) ≥ 0,

m

aj xlj = bl , l ∈ [1, k]

j=1

m

xj = 1}

j=1

where A(x, xl ) is affine in x0 , x1 , . . . , xm , xl1 , . . . , xlm .

♠ (P1) easy formulation but: is as difficult as an optimization problem can be! Worst case computational effort for n = 256 is 2256 ≈ 1077 ≈ +∞! ♠ (P2) complicated formulation but: easy to solve! For m = 100, k = 6 =⇒ 701 variables (≈ 3 times larger) solved in less than 2 minutes for 6 digits accuracy!

convex (P2)[slow (n, ε)] vs. nonconvex (P1) [very fast (n, ε)]


Toward Computation: Approximation Models Approximation: replace a complicated function by a ”simpler” one, close enough to the original. This is the "bread" (and butter..) of numerical analysis Linear approximation Suppose f is differentiable at x. Then, for any y ∈ Rn (notation f (x) ≡ ∇f (x)): f (y) = f (x) + f (x), y − x + o(y − x); lim t−1 o(t) = 0 t0

Quadratic approximation Suppose f is twice differentiable at x (with Hessian f (x) ≡ ∇2 f (x)). Then, 1 f (y) = f (x) + f (x), y − x + f (x)(y − x), y − x + o(y − x2 ). 2 These models are Local. Thus, the resulting schemes based on these will share same properties.


A Generic Unconstrained Minimization Algorithm (U )

min{f (x) : x ∈ Rn }, f ∈ C 1 (Rn )

Start: with x ∈ Rn such that ∇f (x) = 0. Compute new point: x+ = x + td with d ∈ Rn and t > 0 chosen such that we can guarantee f (x+ ) = f (x + td) < f (x)

Since f (x + td) = f (x) + td, ∇f (x) + o(t)

a simple choice could be as follows: d ∈ Rn is called a descent direction with d, ∇f (x) < 0 t ∈ (0, +∞) is a stepsize. How far to go in direction d, called line search. This leads to the simplest scheme: The Gradient Method Marc Teboulle–Tel-Aviv University – p. 35

The Gradient Method x0 ∈ Rn , xk+1 = xk + tk dk where tk > 0 is the step size. Indeed, with d = −f (x) = 0, we have d, f (x) = −f (x)2 < 0

Thus, it is reasonable to choose a positive step size. There exists many variants for the choice of tk : fixed step size: tk := t > 0, ∀k. full-exact line search: find tk := argmin f (xk + tdk ) Used only t≥0

when this can be solved analytically or efficiently. Inexact line search: step size chosen to approximately minimize f along the ray {x + td|t ≥ 0}. This is the most common used in practical algorithms, e.g., Armijo Line Search, see [B3].


A Convergence Result for Gradient Method Theorem CGM Assume f ∈ CL1,1 (Rn ) and bounded below. Then, with tk = t, 0 < t < 2L−1 one has lim f (xk ) = 0.

k→∞

Moreover, we have the rate of convergence: 1/2 1 k 0 gk := min f (x ) ≤ √ 2L(f (x ) − f∗ ) 0≤l≤k k+1 Thus, we can obtain a (upper) complexity estimate to achieve gk ≤ ε: 2L k + 1 ≥ 2 (f (x0 ) − f∗ ) =⇒ gk ≤ ε ε Note: estimate does not depend on the problem’s dimension n!


Newton’s Method (U )

minimize {f (x) : x ∈ Rn }

Assumptions Let x∗ ≡ local min. off . Let f ∈ C 2 (Rn ) with ∇2 f (x∗ ) lI; l > 0. ∇2 f (x) − ∇2 f (y) ≤ M x − y, ∀x, y, x0 is close enough to x∗ : x0 − x∗ ≤ r¯ ≡ 2l(3M )−1 .

Then, the sequence {xk } produced by xk+1 = xk − (∇2 f (xk ))−1 ∇f (xk )

converges locally quadratically to x∗ , [See B1-2] Note: the dependence on knowledge of specific (and generally unknown/hard to compute) constants.... Marc Teboulle–Tel-Aviv University – p. 38

Basic Unconstrained Schemes-Summary x0 ∈ Rn , xk+1 = xk + tk W k dk

where W k 0, tk argmin f (xk + tW k dk ) t

W k ≡ I, dk ≡ −∇f (xk ), Gradient Method W k ≡ ∇2 f (xk )−1 , tk ≡ 1, Newton’s Method; Fast Local convergence but can diverge and even breaks down (∇2 f (xk ) degenerate), and quite expensive...

Global Rate of convergence needs infos on topological properties of ∇f, ∇2 f ...Soon we will see how to avoid that... Other Methods: Quasi-Newton [e.g.,BFGS] (replaces f (xk ) by some PD matrix), Conjugate gradient, Trust region.... see [B4]


Constrained Optimization Algorithms Richer but much more Difficult.... In most algorithms we will face one (or both) of the following: To solve a sequence of unconstrained/constrained minimization problems To solve a nonlinear system of equations and inequalities Thus, the importance of having efficient linear algebra methods and software, and a fast and reliable unconstrained routine. Numerical Optimization Numerical Linear Algebra


Some Classes of Constrained Optimization Algorithms Sequential Unconstrained Minimization: Penalty and Barrier Methods Sequential Linear/Quadratic/Convex Programming Lagrangian Multiplier Methods Interior point/Primal-Dual Methods Dual Methods: Decomposition/Subgradient/Cutting Plane Active set methods ....and more...


Sequential Unconstrained Minimization (C) min{f (x) : x ∈ S ⊂ Rn } Idea: approximate (C) by a sequence of solutions of unconstrained minimization problems • Penalty [Courant 1943]: A continuous P (·) is a Penalty function for S if P (·) ≥ 0 and = 0 if and only if x ∈ S .

Replace (C) by (Ct )

min {f (x) + tP (x) ≡ Ft (x)} x(t) = argmin{Ft (x)}(t > 0)

x∈Rn

For large t the minimum of (Ct ) will be in a region where P is small. We thus expect that as t → ∞ : tP (x(t)) → 0; x(t) → x∗ Marc Teboulle–Tel-Aviv University – p. 42

The Penalty Method Examples of Penalty Functions For Inequality Constraints S = {x : gi (x) ≤ 0, i = 1, . . . , m} P (x) =

m i=1

max(0, gi (x)); P (x) =

m

2

max(0, gi (x)) ← smooth

i=1

For Equality Constraints S = {x : hi (x) = 0, i = 1, . . . , m} P (x) = ||h(x)||2 , h : Rn → Rm

The Penalty Algorithm Let 0 < tk < tk+1 , ∀k with tk → ∞. For each k solve xk = argminx {Ft (x) ≡ f (x) + tk P (x)}

. Convergence: If xk exact global minimizer of Ftk (x), then every limit Marc Teboulle–Tel-Aviv University – p. 43 point of {xk } is a solution of C .

The Barrier Method: Frish 58, Fiacco-McCormick 68 Similar idea, but acting from the interior to preventing leaving the feasible region. • Barrier [Interior]: A Barrier function for S with int S = ∅ is a continuous function s.t. B(x) → ∞ as x → boundaryS

m

m −1 Examples: B(x) = − i=1 [gi (x)] , B(x) = − i=1 log(−gi (x)) Barrier Algorithm Let 0 < tk < tk+1 ∀k with tk → ∞. For each k solve xk = argmin{f (x) + x

1 B(x)}. tk

Convergence: Every limit point of {xk } is a solution of (C). In both Penalty/Barrier Methods: Compromise t must be chosen sufficiently large so that x(t) will approach S from the exterior (interior).... BUT..we do not know how to pick t.. (if chosen too large, then Ill-Conditioning may occurs) Avoid IC, do not send t → ∞, one approach: .....use augmented Lagrangian/Multiplier methods..... Marc Teboulle–Tel-Aviv University – p. 44

A Generic Multiplier Method (P )

min{f (x) : g(x) ≤ 0}, g : Rn → Rm

Lagrangian: L(x, u) = f (x) + uT g(x) (Linear in u) An Augmented/General Lagrangian: A(x, u, c) = f (x) + G(g(x), u, c), (u ≥ 0, c > 0)

Multiplier Method Given {uk , ck }, generate (xk , uk ) via: Find xk+1 = argmin{G(x, uk , ck ) : x ∈ Rn } Dual Update Rule: uk+1 = E(g(xk+1 ), uk , ck ) Increase ck > 0 (if necessary). • G should be explicit and preserves data properties of (P) (e.g. smoothness) • E should be an simple explicit formula to update u • How do we get these objects? Marc Teboulle–Tel-Aviv University – p. 45

Example: Multiplier Method for Ineqs. Constraints (C) min{f (x) : gi (x) ≤ 0, i = 1, . . . , m}, g := (g1 , . . . , gm )T

Quadratic Method of Multipliers [See [B5 for eq. constraints] xk+1

∈ argmin{A(x, uk , ck ) : x ∈ Rn }

uk+1 = (uk + ck g(xk+1 ))+ , (ck > 0, z+ := max{0, z}) A(x, u, c) := f (x) + (2c)−1 {||(u + cg(x))+ ||2 − ||u||2 }

Drawback/Advantage: Separability lost (if original prob. separable) Not C 2 and Newton can breaks down No need to increase penalty parameter; more robust; uses dual info More recent approaches allow for constructing smooth Lagrangian so that Newton’s method can be applied. Later on..... Marc Teboulle–Tel-Aviv University – p. 46

Sequential "Simpler" Constrained Problems Idea: Given xk ∈ Rn , uk ∈ Rm , solve a sequence of approximate simpler problems: inf{f k (x) : g k (x) ≤ 0} where (f k , g k ) are (local) approximations of the objective and constraint functions. Possible Choices Include: Linear [SLP], Convex [SCP], Quadratic [SQP] approximations of f , or g or both. SQP: quadratic approximation of objective and linearized constraints, i.e., solves a sequence of Quadratic Programs Most attractive feature of SQP: superlinear convergence in neighb. of a solution. A Drawback: Need functions/gradients with high precision


Back to Newton: inf{f (x) : x ∈ dom f } Self-Concordance Theory–[Nesterov-Nemirovsky-90] Idea: to make the convergence analysis coordinate invariant [Newton’s method is coordinate invariant..but conv. analysis is not!] Achieved for self-concordant convex functions θ is SC ⇐⇒ ∃M : |θ (t)| ≤ M θ (t)3/2 , ∀t ∈ dom θ Newton Revisited with SC function: The Damped Newton Method: DNM Start with x0 ∈ dom f . Generate {xk } via k+1

x

1 k −1 k −1 (f =x − (x )) f (x ); λ(x) := (f (x)) f (x), f (x) k 1 + λ(x ) k

♣ ∀x ∈ dom f, with λ(x) > η > 0, one iteration of DNM decreases the value of f (x) at least by the constant h(η) := η − log(1 + η). This is a global result. Marc Teboulle–Tel-Aviv University – p. 48

Newton for Self-Concordant functions–[See B6-8] f ∗ = inf{f (x) : x ∈ dom f }

Given some γ > 0 Damped Phase: λ(xk ) > γ , apply Damped Newton that ensures f (xk+1 ) − f (xk ) ≤ −h(γ)

Quadratic Phase: λ(xk ) ≤ γ , then apply pure Newton, which converges quadratically Complexity Analysis: Total number of iterations to ε accuracy: f (x0 ) − f ∗ + log log(1/) #of Newton’s step ≤ h(η)

Note: Absence of unknown constants and problem dimension


Interior Point Methods for Convex Programs inf{cT x : x ∈ C}, C ⊂ Rn closed convex

Idea goes back to Barrier Methods, but within a different methodology Basically one tries to approximately follow the central path generated within the interior of the corresponding feasible set. Computation of Central Path x∗ (µ) = argmin{µc, x + S(x)} x

Where S is a Self-Concordant Barrier for a closed convex feasible set of the given optimization problem . x∗ (µ) remains strictly feasible for every µ > 0 x∗ (µ) → x∗ optimal for µ → ∞ Can be computed in polynomial time within use of Newton method and suitable updating for µ Marc Teboulle–Tel-Aviv University – p. 50

Primal-Dual Interior Methods

(C) inf{f (x) : g(x) ≥ 0} ⇐⇒ inf{f (x) : g(x)−v = 0, v ≥ 0} (g := (g1 , . . . , gm )T ) The KKT system with perturbed complementarity ∇f (x) − ∇g(x)y

=

0, V := Diag(v), Y := Diag(y)

VYe

=

µe; µ > 0, e := (1, . . . , 1)T

g(x) − v

=

0

Apply Newton’s Method to generate new pt: (x+ , v + , y + ) = (x, v, y) + t(∆x, ∆v, ∆y) t chosen to ensure (v + , y + ) > 0 and the merit function sufficiently reduced [see B10] Advantage: feasibility of x not required, infeasible interior Suitable for large problems size (n + m < 10, 000 and more) Marc Teboulle–Tel-Aviv University – p. 51

Optimization–Summary Nonconvex problems most are just not solvable....Lacking theory.... Convex problems Local minima are global Computationally Tractable: Can be approximately solved to a desired accuracy in polynomial time [Not always efficient e.g., the Ellipsoid Method] Model many interesting problems Enjoy a powerful Duality Theory that can be used to find bounds/approximations for hard problems; in particular nonconvex quadratic models arising in combinatorial problems.


Mathematical and Computational Challenges To solve very large scale optimization problems keeping in mind the trade off between Efficiency versus Practicality/Simplicity For self-concordant convex problems, we have efficient polynomial algorithms with high accuracy Polynomial algorithms are highly sophisticated: require information on the Hessian of objective and constraints, [often not available], and heavy computational cost at each iteration [e.g., Computing Newton’s step] which are not affordable for very large scale problems. ..... Thus the needs to Further study potential of elementary/simple methods (e.g., first order methods, using function or/and gradient infos only). Produce more efficient algorithms within these methods Marc Teboulle–Tel-Aviv University – p. 53

Part C–Interior Gradient/Prox Schemes Lecture based on Joint Research Works with A. Auslender University of Lyon I, France Details, proofs and more results can be found in our two recent works Interior gradient and epsilon-subgradient descent methods for constrained convex minimization Mathematics of Operations Research, 29, 2004, 1–26. A unified framework for interior gradient/subgradient and proximal methods in convex optimization. February 2003 (submitted for publication) More references on related works we developed on Lagrangian methods, decomposition schemes, semidefinite programming, and variational inequalities, are listed at the end of these notes.


Gradient Based Methods: Why? A main drawback: Can be very slow....But... Main advantages Use minimal information, e.g., (f, g) Often lead to very simple iterative schemes Complexity/iteration mildly dependent in problem’s dimension Suitable when high accuracy is not crucial [in many large scale applications, the data is anyway known only roughly..] For very large scale problems often remain the only choice Examples: Three gradient-based algorithms widely used in applications Clustering: The k-means algorithm Neuro-computation: The backpropagation (perceptron) algorithm The EM (Expectation-Maximization) algorithm in statistical Marc Teboulle–Tel-Aviv University – p. 55

Main Results: Overview A unifying framework for analyzing interior gradient and proximal based algorithms for constrained minimization Global convergence results under minimal assumptions Smooth Lagrangian Multiplier Methods Derive and analyze corresponding new (sub) gradient interior schemes for constrained problems. Modified methods with better complexity/efficiency Applications of the results to specific instance problems and in particular to conic optimization i.e., semidefinite and second-order conic programs for producing elementary algorithms.


Part I. Interior Proximal methods Ideas and A Unifying Framework Convergence analysis mechanism Applications/Examples


Two Classical Algorithms (P )

f∗ = inf{f (x) :

x ∈ C},

f : Rn → R ∪ {+∞} is a lsc convex proper function. C ⊂ Rn is nonempty convex open; C denotes the closure of C. Let d(x, y) := 2−1 x − y2 , λk > 0 • Prox xk ∈ argmin{λk f (x) + d(x, xk−1 ) : x ∈ C} ⇐⇒ 0 ∈ λk ∂f (xk ) + xk − xk−1 + NC (xk ) • (Sub) Grad xk ∈ argmin{λk g k−1 , x + d(x, xk−1 ) : x ∈ C} ⇐⇒ 0 ∈ λk g k−1 + xk − xk−1 + NC (xk ) ⇐⇒ xk = ΠC (xk−1 − λk g k−1 ) where we use ΠC ≡ (I + NC )−1 ≡ Projection map ** Difference: Implicit –versus– Explicit Schemes ** Note: {xk } produced by either one of above algorithms does not necessarily belong to C Marc Teboulle–Tel-Aviv University – p. 58

A proximal term exploiting the geometry of C We use a proximal term d(x, y) that will play the role of a distance like function satisfying certain desirable properties which in particular Will force the iterates of the produced sequence to stay in C , and thus, will automatically eliminate the constraints, (hence Interior). Allow to derive explicit and simple iterative schemes for various interesting optimization models. Leads to convergent and improved methods Minimal required properties for d: d(·, v) is a convex function, ∀v d(·, ·) ≥ 0, and d(u, v) = 0 iff u = v ∀u, v . • d is not a distance: no symmetry or/and triangle inequality


The Basic Ingredients (P )

f∗ = inf{f (x) :

x ∈ C},

Interior Proximal Algorithm–IPA: x0 ∈ C; xk ∈ argmin{λk f (x)+d(x, xk−1 ) : x ∈ C},

k = 1, 2 . . . (λk > 0),

where d is some proximal distance . The basic ingredients needed to achieve our goals are: To pick an appropriate proximal distance d which allows to eliminate the constraints. Given d, to find an induced proximal distance H , which will control the behavior of the resulting method, to analyze convergence and complexity. We begin by defining an appropriate proximal distance d for problem (P). Marc Teboulle–Tel-Aviv University – p. 60

A Family of Proximal Distances F Definition A function d : Rn × Rn → R+ ∪ {+∞} is called a proximal distance with respect to an open convex set C ⊂ Rn if for each y ∈ C it satisfies the following properties: (P1 ) d(·, y) is proper, lsc, convex, and C 1 on C . (P2 ) dom d(·, y) ⊂ C , and dom ∂1 d(·, y) = C (P3 ) d(·, y) is level bounded on Rn , i.e., lim u →∞ d(u, y) = +∞. (P4 ) d(y, y) = 0. We denote by F the family of functions d satisfying the Definition. (P1 ) is needed to preserve convexity of d (P2 ) will force the iterate xk to stay in C (P3 ) will guarantee the existence of such an iterate.

Note: By definition d(·, ·) ≥ 0, so that from (P4 ) we get ∇1 d(y, y) = 0

∀y ∈ C. Marc Teboulle–Tel-Aviv University – p. 61

The Main Tool For each given d ∈ F , we generate an induced proximal distance satisfying some desirable properties. Definition Given C ⊂ Rn , open and convex, and d ∈ F , a function H : Rn × Rn → R+ ∪ {+∞} is called the induced proximal distance to d if (1) H is finite valued on C × C with H(a, a) = 0 ∀a ∈ C (2) c − b, ∇1 d(b, a) ≤ H(c, a) − H(c, b) ∀a, b, c ∈ C ♠ We write (d, H) ∈ F(C) to quantify such triple [C, d, H]. Likewise, we will write (d, H) ∈ F(C), for the triple [C, d, H], s.t. there exists H which is finite valued on C × C satisfies (1)-(2) for any c ∈ C , and such that ∀c ∈ C one has H(c, ·) level bounded on C. Clearly, one thus have F(C) ⊂ F(C). Marc Teboulle–Tel-Aviv University – p. 62

Mechanism’s Motivation Not as mysterious as it might look at first sight... Example: Quad-prox corresponds to the special case C = C¯ = Rn , d(x, y) = 2−1 x − y2 , ∇1 d(x, y) = (x − y) Then, ♠ c − b, ∇1 d(b, a) = H(c, a) − H(c, b) − H(b, a)

holds with equality;( Pythagoras Theorem!) and with induced H ≡ d. Methods with d ≡ H are called self-proximal. There are several examples of more general self-proximal methods and for various types of constraint sets C , [see C]


Main Results Within this framework, we can derive : Global rate of convergence/efficiency estimates in term of function values Convergence in limit points of the produced sequence by IPA Global convergence of the sequence {xk } to an optimal solution of (P),within additional assumptions on the induced proximal distance H , akin to the properties of norms, for reference this is denoted by F+ (C), see [C1-5].


A Typical Convergence Result Theorem (Prox-convergence) Let (d, H) ∈ F+ (C) and let {xk } be the sequence generated by the interior-prox: xk ∈ C g k ∈ ∂f (xk ) s.t. λk g k + ∇1 d(xk , xk−1 ) = 0.

n Set σn := k=1 λk . Then the following hold: limn→∞ σn = +∞ =⇒ {f (xk )} converges to f∗

Global rate of convergence estimate f (xn ) − f (x) = O(σn−1 ), ∀x ∈ C

If X∗ = ∅, then xk → x∗ optimal solution of (P).


Self-proximal methods We take d(x, y) := H(x, y) = Dh (x, y) with Dh a Bregman proximal distance given by: Dh (x, y) := h(x) − [h(y) + ∇h(y), x − y]

It can be verified that (P 1) − (P 4) hold for H . Depending on choice of h one has either (d, H) = (Dh , Dh ) ∈ F(C) or (d, H) = (Dh , Dh ) ∈ F+ (C). When C = Rn , with h = 2−1 · 2 , then Dh (x, y) = ||x − y||2 , and with (d, H) = (Dh , Dh ) ∈ F+ (Rn ), the IPA is exactly the classical prox Several interesting special cases for the pair (d, H) leading to self-proximal schemes for various type of constraints include: SDP, SOC, Convex Programs, see [C7-11] Marc Teboulle–Tel-Aviv University – p. 66

n Example: Semidefinite Constraints: the Cone C = S+ n (S n )≡ symmetric p.s.d (p.d.) matrices. Let S+ ++ n h1 : S+ → R, h1 (x) = tr x log x; n h3 : S++ → R, h3 (x) = − tr log x = − log det(x) n , let For any y ∈ S++ n d1 (x, y) = tr(x log x − x log y + y − x), with dom d1 (·, y) = S+ , n , d3 (x, y) = − log det xy −1 + tr(xy −1 ) − n, with dom d3 (·, y) = S++ n ) and (d , H) ∈ F(S n ) With H ≡ di , one has (d1 , H) ∈ F(S+ 3 ++ Similarly we can handle

C = {x ∈ Rm :

n B(x)−B0 ∈ S+ }; B(x) =

m

xi Bi , Bi ∈ S n ∀i ∈ [0, m]

i=1


IPA that are Not-Self-Proximal: ϕ-Divergences Given a scalar convex function ϕ satisfying some conditions, (call class Φr , see [C12-13]) we define a ϕ-divergence proximal distance by dϕ (x, y) =

n

yir ϕ(yi−1 xi ), r = 1, 2

i=1

For any ϕ ∈ Φr , one verifies that dϕ (·, ·) ≥ 0 and "=0" when arguments coincide. Can be easily extended to handle Polyhedral Constraints. Examples of functions in Φ1 , Φ2 : ϕ1 (t) = t log t − t + 1, dom ϕ = [0, +∞), ϕ2 (t) = − log t + t − 1, dom ϕ = (0, +∞), √ ϕ3 (t) = 2( t − 1)2 , dom ϕ = [0, +∞).


Example: The Class

n Φ2 2 with dϕ (x, y) = j=1 yj ϕ(yj−1 xj ) Let ϕ(t) = µp(t) + ν2 (t − 1)2 with p ∈ Φ2 One has, dϕ ∈ F , and it can be proven: ∀a, b ∈ R++ , ∀c ∈ R+ ♣ c − b, ∇1 d(b, a) ≤ η(||c − a||2 − ||c − b||2 )

where η = 2−1 (µ + ν). With H(x, y) := η||x − y||2 one obtains (dϕ , H) ∈ F+ (C) and all the convergence results hold. Note: ♣ Behaves like usual prox in Rn ...but here valid in the non-negative orthant!


A Powerful Application of Interior Prox: General Augmented Lagrangian Methods An Example: Take dϕ (u, v) =

m

2 ϕ(v −1 u ) with ϕ ∈ Φ . v j 2 j=1 j j

Applying IPA on dual (D) of (..remember (D) is defined on Rm + ...) (P ) min{f0 (x) : fi (x) ≤ 0, i ∈ [1, m]}

yields the Multiplier Method: k k LMM: Let u0 ∈ Rm ++ and λk ≥ λ > 0, ∀k ≥ 1, generate {x , u } via xk

∈ arg min{H(x, uk−1 , λk ) : x ∈ Rn }

k−1 ∗ k (ϕ ) (λ f (x )/u ), i = 1, . . . , m uki = uk−1 i k i i

m 2 ∗ −1 Here H(x, u, λ) := f0 (x) + λ i=1 ui ϕ (λfi (x)/ui ), (λ > 0, u > 0)

Using other d yields to many other old and new LMM: provides a unified framework to their analysis; and can be extended to SDP and VI problems [see refs]. Marc Teboulle–Tel-Aviv University – p. 70

Important Example: The Log-Quad Proximal Kernel Let ν > µ > 0 be given fixed parameters, and ν Φ2 ϕ(t) = (t − 1)2 + µ(t − log t − 1); t > 0 2 The conjugate ϕ∗ is explicitly given [see A-LQM] and satisfies some remarkable properties: domϕ∗ = R, and ϕ∗ ∈ C ∞ (R) (ϕ∗ ) (s) = (ϕ )−1 (s) is Lipschitz for all s ∈ R, with constant ν −1 (ϕ∗ ) (s) ≤ ν −1 , ∀s ∈ R.

For given data in (P) fi ∈ C ∞ (Rn ), the resulting H ∈ C ∞ (Rn )


Computation with Log-Quad Multiplier Method: An Example Robust [no parameters tuning] + computational effort does not increase with dimension n × m = 1000 × 50, 000 Average results of 100 times executions on Quadratically constrained random model


A Typical Convergent result Applying Theorem-prox-convergence Under very mild and standard assumptions on primal (P), e.g., • Optimal solution set of (P) bounded + Slater, one obtains:

Theorem-Convergence for LMM Let {xk , uk } be the sequence generated by previous LMM with ϕ ∈ Φ2 . Then All limit points of {xk , uk } are optimal solutions of (P ) × (D) The dual sequence {uk } → u∗ optimal solution of the dual (D). In particular, [and without any further assumptions, such as strict complementarity], the following improved global convergence rate estimate holds for the dual objective h: h(u∗ ) − h(un ) = o((n)−1 )


Part II. Interior (Sub)-Gradient methods for Constrained Minimization Basic Interior (sub) gradient methods A General Convergence Result Algorithms for Conic Optimization: Theory and Examples A More efficient O(1/k 2 ) Interior Gradient Algorithm


An Basic Interior-Gradient Algorithm for Constrained Minimization over C The basic step of a (sub)-Gradient Method over Rn is xk = xk−1 − λk g k−1 ⇐⇒ xk ∈ argmin{λk g k−1 , x + 2−1 x − xk−1 2 }

Thus, to solve min{f (x) : x ∈ C}, replace · 2 by some d ∈ F . Basic Interior Gradient–BIG Take d ∈ F . Let λk > 0 and generate the sequence {xk } via xk ∈ argmin{λk g k−1 , x + d(x, xk−1 )}

Building on the material previously developed, it is possible to establish various type of convergence results for various instances of the triple [C, d, H]. We focus on conic models.


Conic Optimization Models (M )

inf{f (x) : x ∈ C ∩ V},

where V := {x : Ax = b}, with b ∈ Rm and A ∈ Rm×n , n ≥ m f : Rn → R ∪ {+∞} is convex, lsc. We assume that ∃x0 ∈ dom f ∩ C : Ax0 = b. We assume also that f is continuously differentiable with ∇f Lipschitz on C ∩ V and Lipschitz constant L, i.e., there exists L > 0 such that ||∇f (x) − ∇f (y)|| ≤ L||x − y||

∀x, y ∈ C ∩ V

Notation: f ∈ C 1,1 (C ∩ V).


Applying Basic Scheme BIG to Solve Problem (M) For solving Problem (M), we propose the following basic iteration . Given d(·, y) σ -strongly convex over (C × V) Given a step-size rule for choosing λk (Various step-size rules are possible) At each step k, starting from a point x0 ∈ C ∩ V the sequence xk ∈ C ∩ V is computed via the relation xk = u(λk ∇f (xk−1 ), xk−1 ) = argmin{λk ∇f (xk−1 ), z+d(z, xk−1 )|z ∈ V}

Theorem-Convergence of BIG With (d, H) ∈ F+ (C), {xk } converges to an optimal sol. of (P), and the following Global Rate Estimation holds: f (xn ) − f∗ = O(n−1 )


Application examples–[C14-17] Consider the functions d with (d, H) ∈ F(C) which are regularized distances of the following form σ d(x, y) = p(x, y) + ||x − y||2 , with p ∈ F 2 We can derive Explicit gradient-like algorithms via the formula u(v, x) = argmin{v, z + d(z, x)} z

for Semidefinite programs Second-order conic problems Convex minimization over the unit simplex


Convex minimization over the unit simplex, C = ∆ Take C =

Rn+ ,

A=

eT , b

= 1, i.e., V = {x : |

(M )

n

j=1 xj

= 1} then,

inf{f (x) : x ∈ ∆}

with ∆ = {x ∈ Rn :

n

xj = 1, x ≥ 0}

j=1

This is an interesting special case of standard conic optimization, which arises in applications. We will concentrate on Mirror Descent Type Algorithms (MDA) [Nemirovsky-Yudin-1983] [Beck-Teboulle-2003] have shown that the (MDA) can be simply viewed as a projection subgradient algorithm with strongly convex Bregman proximal distances. As a result, they proposed to use the entropy kernel. Marc Teboulle–Tel-Aviv University – p. 79

The Entropic Mirror Descent Algorithm EMDA

h(x) :=

n

xj log xj if x ∈ ∆, +∞ otherwise

j=1

The entropy kernel is 1-strongly convex w.r.t the norm · 1 , i.e., ∇h(x) − ∇h(y), x − y ≥ ||x − y||21 ,

∀x, y ∈ ∆

Hence so is the resulting dh ≡ E defined by:

zj n z log if (z, x) ∈ ∆ × ∆+ , j=1 j xj d(z, x) ≡ E(z, x) = +∞ otherwise This produces the Entropic Mirror Descent Algorithm (EMDA).


Simple Formula for the EMDA with d ≡ E The problem u(v, x) = argmin{v, z + E(z, x)} can be easily solved: z∈∆

xj exp(−vj ) , j = 1, . . . , n uj (v, x) = n i=1 xi exp(−vi )

EMDA: Start with x0 = n−1 e, ∀j = 1, . . . , n with vj = gjk−1 xkj (λk ) = u(λk v, xk−1 ), λk =

√

2 log √ n, Lf k

∀k = 1, . . .

Theorem The sequence generated by EMDA satisfies for all k ≥ 1 s || ||g max ∞ 1≤s≤k √ min f (xs ) − min f (x) ≤ 2 log n x∈∆ 1≤s≤k k Here the objective function is supposed Lf -Lipschitz on ∆ Outperforms the classical proj. grad by a factor of (n/ log n)1/2


Improving efficiency of Interior Gradient Method The classical gradient method for minimizing a CL1,1 function over Rn exhibits a O(k −1 ) global convergence rate estimate [Nesterov-1988] developed what he called an ”optimal algorithm” for smooth convex minimization He was able to improve the efficiency of the gradient method by constructing a method that keeps the simplicity of the gradient method, but with the faster rate O(k −2 ). ♣ Question: Can this be extended for constrained problems using Interior Gradient Methods?

We answer this question positively for a class of interior gradient methods that leads to an equally simple but more efficient interior gradient algorithm for convex conic problems.


Problem’s Setting: Back to The Conic Model (M )

inf{f (x) : x ∈ C ∩ V},

where V := {x : Ax = b}, with b ∈ Rm and A ∈ Rm×n , n ≥ m f : Rn → R ∪ {+∞} is convex, lsc. ∃x0 ∈ dom f ∩ C : Ax0 = b

The optimal solution set X∗ is nonempty f is continuously differentiable with ∇f Lipschitz on C ∩ V and Lipschitz constant L


Generating the Sequence {qk } Basic Idea: Build a sequence of functions {qk }∞ k=0 that approximates f We take d ≡ H ∈ F , with H is a Bregman proximal distance, with kernel h σ -strongly convex on C ∩ V. For every k ≥ 0, we construct the sequence {qk (x)} recursively via: q0 (x) = f (x0 ) + cH(x, x0 ), qk+1 (x) = (1 − αk )qk (x) + αk lk (x, y k ) lk (x, y k ) = f (y k ) + x − y k , ∇f (y k ).

Here, c > 0, and αk ∈ [0, 1). The point x0 is chosen such that x0 ∈ C ∩ V . The point yk ∈ C is arbitrary and built-in within the algorithm, [see C18].


The Improved Interior Gradient Algorithm-IGA Step 0. Choose a point x0 ∈ C ∩ V , and a constant c > 0 Set z 0 = x0 = y 0 , c0 = c, λ = σL−1 . Step k. For k ≥ 0, compute: (ck λ)2 + 4ck λ − λck αk = 2 y k = (1 − αk )xk + αk z k , σ λ k+1 k k ∇f (y ) + H(x, z )} = u( ∇f (y k ), z k ), = argmin{x, •z αk L αk x∈C∩V xk+1 = (1 − αk )xk + αk z k+1 , and ck+1 = (1 − αk )ck . • Computational work is exactly as the one of the interior gradient method through z k+1 The remaining steps involve trivial computations Marc Teboulle–Tel-Aviv University – p. 85

The Improved Convergence Rate Estimate for IGA Theorem Let {xk }, {y k } be the sequences generated by (IGA) and let x∗ be an optimal solution of (P). Then, for any k ≥ 0 we have 1 f (x ) − f (x ) = O( 2 ) k k

∗

and the sequence {xk } is minimizing, i.e., f (xk ) → f (x∗ ). Thus to √solve (P) to accuracy ε > 0, one needs no more than (O(1/ ε) iterations of IGA. This is a reduction by a squared root factor in comparison to BIG. Note: IGA can be used to solve convex minimization over the unit simplex and spectrahedron in Sn+ with this improved global convergence rate estimate Extension with d = H : Open? Marc Teboulle–Tel-Aviv University – p. 86

Conclusion

Optimizers are not (yet!) out of job......

Thank you for listening.


Additional Material and References In the following, you will find some complementary material with more details and results on which I will talk very quickly (or not at all!) during the Lecture. This is organized in 3 appendices [A,B,C] refereing to corresponding parts of Lecture and [R] for the references. Appendix A Some complements for optimization theory Appendix B More on Algorithms Appendix C Further details on Interior gradient/prox References R Pointers to some basic books, and our recent works related to part C


A1. General Defs/Properties for Arbitrary Functions In optimization problems it is convenient to work with extended-real-valued functions, i.e., functions which takes values in R ∪ {+∞} = (−∞, +∞] instead of just finite valued functions i.e., taking values in R = (−∞, +∞). This allows to rewrite the constrained problem: inf{h(x) : x ∈ C} ⇐⇒ inf{f (x) : x ∈ Rn } with f := h + δC , and δC = 0 if x ∈ C and +∞ otherwise. Rules of arithmetic are thus extended to include ∞ + ∞ = ∞, α · ∞ = ∞, ∀α > 0, and 0 · ∞ = 0. Let f : Rn → R ∪ {+∞}. The effective domain of f is the set dom f := {x ∈ Rn |f (x) < +∞}. A function is called proper if f (x) < ∞ for at least one x ∈ Rn and f (x) > −∞, ∀x ∈ Rn , otherwise the function is called improper. Geometrical objects associated with f : epigraph and level set of f epi f

:=

{(x, α) ∈ Rn × R |α ≥ f (x)},

lev(f, α)

:=

{x ∈ Rn |f (x) ≤ α}.


A2. Lower-Semicontinuity (lsc) For f : Rn → R ∪ {+∞}, we write inf f

:=

inf{f (x) : x ∈ Rn },

argmin f

=

argmin{f (x) : x ∈ Rn } := {x ∈ Rn : f (x) = inf f }.

Lower limits are characterized via lim inf f (x) = min{α ∈ R := [−∞, ∞] | ∃xn → y with f (xn ) → α}. x→y

Note that one always have lim inf f (x) ≤ f (y). x→y

Definition The function f : Rn → R ∪ {+∞} is lower semi-continuous (lsc) at x if f (x) = lim inf f (y), y→x

and lower semi-continuous on Rn if this holds for every x ∈ Rn . Lower semicontinuity on Rn of a function can be characterized through its level set and epigraph. Theorem Let f : Rn → R. The following statements are equivalent: (a) f is lsc on Rn ; (b) the epigraph epi f is closed on Rn × R; (c) the level sets lev(f, α) are closed in Rn .


A3. Few more Defs. in Convex Analysis For a proper convex and lower semicontinuous (lsc) function f : Rn → R ∪ {+∞}, dom f = {x |f (x) < +∞} = ∅ its effective domain f ∗ (y) = sup{x, y − f (x)|x ∈ Rn }, is its conjugate For all ≥ 0 its -subdifferential ∂ f = {g ∈ Rn | ∀z ∈ Rn , f (z) + ≥ f (x) + g, z − x}

It coincides with the usual subdifferential ∂f ≡ ∂0 f whenever = 0, and we set dom ∂f = {x ∈ Rn |∂f (x) = ∅}. For any closed convex set S ⊂ Rn δS denotes the indicator function of S , ri S its relative interior NS (x) = ∂δS (x) = {ν ∈ Rn |ν, z − x ≤ 0, ∀z ∈ S} the normal cone to S at x ∈ S . Marc Teboulle–Tel-Aviv University – p. 91

A4. Differentiability of Convex Functions Under differentiability assumptions, one can check the convexity of a function via the following useful tests: f is convex iff (a) when f ∈ C 1 : f (x) − f (y) ≥ x − y, ∇f (y), ∀x, y ; (b) when f ∈ C 2 : ∇2 f (x) is positive semidefinite. (∇2 f (x) positive definite =⇒ f strictly convex; converse false). Directional derivative Let f : Rn → R ∪ {+∞} be a convex function, and let x any point where f is finite and d ∈ Rn . Then the limit f (x + τ d) − f (x) f (x; d) := lim τ τ →0+

exists (finite or equal to −∞) for all d ∈ Rn and is called the directional derivative of f at x.


A5. Coercivity and Asymptotic Function Definition The function f : Rn → R ∪ {+∞} is called (a) level bounded if for each λ > inf f , the level set lev(f, λ) is bounded, (b) coercive if f∞ (d) > 0 ∀d = 0 As an immediate consequence of the definition we remark that f is level bounded if and only if limx→∞ f (x) = +∞ which means that the values of f (x) cannot remain bounded on any subset of Rn that is not bounded. In the convex case, all these concepts are in fact equivalent. Proposition Let f : Rn → R ∪ {+∞} be lsc and proper. If f is coercive, then it is level bounded. Furthermore, if f is also convex, then the following statements are equivalent: (a) f is coercive. (b) f is level bounded. (c) The optimal set {x ∈ Rn | f (x) = inf f } is nonempty and compact. (d) 0 ∈ int dom f ∗ . Marc Teboulle–Tel-Aviv University – p. 93

A6. Alternative Theorems Two very useful in the study of optimality conditions. Farkas Exactly one of the following two systems has a solution: (F 1) Ax = b, x ≥ 0, (A ∈ Rm×n , b ∈ Rm the given data) (F 2) bT y > 0, AT y ≤ 0, y ∈ Rm Gordan Exactly one of the following two systems has a solution: (G1) Ax < 0, x ∈ Rn (G2) AT y = 0, 0 = y ∈ Rm + , (i.e., not all zero


A7. More Optimality Conditions for General NLP (N LP ) min{f (x) : g(x) ≤ 0, h(x) = 0, x ∈ Rn }

with smooth (C 2 (Rn )) f : Rn → R, g : Rn → Rm h : Rn → Rp . p →R × R Define: L : Rn × Rm + L(x, λ, µ) = f (x) +

m

λi gi (x) +

i=1

p

µk hk (x),

k=1

I(x) = {i : gi (x) = 0} ∇2 L(x∗ , λ∗ , µ∗ ) =

Hessian of L at (x∗ , λ∗ , µ∗ ) w.r.t x

The tangent subspace: M (x, d) = {d : dT ∇gi (x) = 0, i ∈ I(x), dT ∇hk (x) = 0, k ∈ [1, p]}


A7-b First and Second Order Opt. Conditions Theorem : [NC]-Necessary Conditions Let x∗ be a local minimum for (NLP). Assume x∗ is regular, namely for k = 1, . . . , p, i ∈ I(x∗ ) : {∇hk (x∗ ), ∇gi (x∗ )} are linearly independent. Then there exists a unique λ∗ , µ∗ such that: ∇x L(x∗ , λ∗ , µ∗ )

=

0,

λ∗i ≥ 0, i = 1, . . . , m, λ∗i

=

0 ∀i ∈ I(x∗ ), (first order conditions)

dT ∇2 L(x∗ , λ∗ , µ∗ )d

≥

0, ∀d ∈ M (x∗ , d), (second order conditions)

Theorem: [SC]-Sufficient Conditions Suppose that a feasible point x∗ for GNLP satisfies: ∇x L(x∗ , λ∗ , µ∗ )

=

0,

λ∗i ≥ 0, i = 1, . . . , m, λ∗i

=

0 ∀i ∈ I(x∗ ), (first order conditions)

dT ∇2 L(x∗ , λ∗ , µ∗ )d

>

0, ∀0 = d ∈ M (x∗ , d), (second order conditions)

λ∗i

>

0, ∀i ∈ I(x∗ ) (strict complementarity)

Then x∗ is a strict local minimum point for NLP, i.e., ∃Nε (x∗ ) s.t. f (x∗ ) < f (x) ∀x ∈ Nε ∩ S, x = x∗ Marc Teboulle–Tel-Aviv University – p. 96

A8. Primal-Dual Optimal Solution Definition The pair (x∗ , y ∗ ) ∈ Rn × Rm + is called a saddle point for L if L(x∗ , y) ≤ L(x∗ , y ∗ ) ≤ L(x, y ∗ ), ∀x ∈ Rn , ∀y ∈ Rm +.

Proposition (Saddle point characterization) (x∗ , y ∗ ) ∈ Rn × Rm + is a saddle point for L iff (a) x∗ = argminx∈Rn L(x, y ∗ ) (L-optimality) (b) x∗ ∈ Rn , g(x∗ ) ≤ 0 (Primal feasibility) (c) y ∗ ∈ Rm + (Dual feasibility) (d) yi∗ gi (x∗ ) = 0, i = 1, . . . , m (Complementarity). Proposition (Sufficient condition for optimality) If (x∗ , y ∗ ) ∈ Rn × Rm + ∗ is a saddle point for L, then x is a global optimal solution for NLP. Note: valid with 0-assumption on the problem’s data! However for nonconvex problem it is in general difficult to find a saddle point.


A9. Closing the Loop... In the case of convex data (f, gi ), the KKT theorem becomes necessary and sufficient for optimality for the convex program: (CP ) min{f (x) : g(x) ≤ 0, x ∈ Rn } = f ∗

with f : Rn → R, g : Rn → Rm convex KKT-Theorem for Convex Programs Let (CP) be convex with optimal value f ∗ < ∞, and assume that a (CQ) holds (for example Slater). Then x∗ is a global minimum for problem (CP) if and only if there exists λ∗ ∈ R m + satisfying the KKT system. ...Equivalent to Duality (Zero Gap).... Note: Linear equality constraints can also be treated easily, with multipliers λ∗ ∈ Rm (no sign restriction in that case).


B1. Convergence and Rate of Convergence Convergence of an algorithm by itself is important but not enough. We want to know how fast/efficient it happens. Possible approaches: Computational Complexity: theoretically estimates the # of elementary operations needed by a given method to find an exact/approx. optimal solution. Provides worst case estimates, i.e., upper bound on # of required operations for a class of problem Informational Complexity: Estimate # of function/gradient evaluations to find opt. sol [as opposed to # of computational operations] Local Analysis: Local behavior of a method near an optimal solution, but ignores behavior when far from solution. Which approach is best or/and should be used? Each have advantages and drawbacks!


B2. Local Asymptotic Rate of Convergence Measures Let {sk } ⊂ R be a positive real sequence converging to zero: limk→∞ sk = 0 (e.g. sk := x∗ − xk ; sk := |f (xk ) − f (x∗ )|) (Q)-Linear-[fairly fast]: ∃ρ ∈ (0, 1) : sk+1 /sk ≤ ρ ∀k sufficiently large

Superlinear-[faster]:

sk+1 =0 k→∞ sk lim

Quadratic-[very fast]: ∃ρ : sk+1 ≤ ρs2k ∀k sufficiently large

Quadratic =⇒ Superlinear =⇒ Linear


B3. The Armijo Line Search This is a successive step-size rule where we require more than just reducing the cost. Armijo Rule Fix scalars s, β, σ with β ∈ (0, 1), σ ∈ (0, 1). Set tk := β mk s, where mk = first integer m ≥ 0: ♦

f (xk + β m sdk ) − f (xk ) ≤ σβ m sf (xk ), dk

Stepsizes β m s, m = 0, . . . are tried successively until ♦ is satisfied for m = mk . So, here we are not satisfied with just "cost improvement"; amount of improvement has to be sufficiently large as defined in ♦. PRACTICAL CHOICES: σ ∈ [10−5 , 10−1 ], β = 0.5, 0.1.


B4. Other Unconstrained Minimization Algorithms Quasi-Newton Methods Idea, replace the Hessian (or its inverse) by some P.D. matrix. Done by "mimicking" the minimization of a quadratic function f (x) = 12 xT Qx − bT x. In that case one has f (x) − f (y) = Q(x − y), ∀x, y ∈ Rn Thus, given a P.D. matrix Hk , we search a matrix Hk+1 s.t. Hk+1 (f (xk+1 ) − f (xk )) = xk+1 − xk ←− [QN–Quasi Newton condition] There are many solutions satisfying [QN]. One example is BFGS, given by

T y k H k yk 1 T T Hk+1 = Hk + 1 + dk yk Hk + Hk yk dk − T dTk yk dk yk where yk := f (xk+1 ) − f (xk ); dk = xk+1 − xk QN scheme: Start with x0 ∈ Rn , H0 0 Iteration k: Set dk = −Hk f (xk ); xk+1 = xk − tk dk , [tk stepsize rules] Compute: dk = xk+1 − xk ; yk := f (xk+1 ) − f (xk ) Update Matrix: Hk −→ Hk+1 [e.g. via BFGS or other rules]


B4-b Other Unconstrained Minimization Algorithms Ctd. Conjugate Gradients Initially designed for quadratic problems. CG: Start with x0 ∈ Rn , f (x0 ), d0 = −f (x0 ) Iteration k: xk+1 = xk + tk dk , [tk exact line search] Compute: f (xk+1 ); f (xk+1 ) and βk = − f (xk ) −2 f (xk+1 )T (f (xk+1 ) − f (xk )) Update : dk+1 = f (xk+1 ) − βk dk Various other formula exist for βk . Trust Region Methods Replace Hk in Newton with Ak := Hk + µk I with µk > 0 such that Ak 0. Equivalent to impose a constraint on the length of the direction ("trust region")in the quadratic approximation model: minn

d∈R

1 T d Hd + g T d : d ≤ l} (l > 0) 2

k If Newton direction dk = −A−1 k f (x ) is inactive we set µk = 0 otherwise µk > 0 such that tk dk = lk


B5. A Basic Multiplier Method for Equality Constraints min{f (x) : h(x) = 0} h : Rn → Rm

Lagrangian: L(x, u) = f (x) + uT h(x) Augmented L: A(x, u, c)) = L(x, u) + 2−1 c||h(x)||2 AL = Penalized Lagrangian Multiplier Method Given {uk , ck } 1. Find xk+1 = argmin{A(x, uk , ck ) : x ∈ Rn } 2. Update Rule: uk+1 = uk + ck h(xk+1 ) 3. Increase ck > 0 if necessary.


B5-b Features of Multipliers Method A key Advantage: it is not necessary to increase ck to ∞, for convergence (as opposed to "Penalty/Barrier method" ) As a result, A is "less subject to ill-conditioning", and more "robust". The AL depends on c but also on the dual multiplier u : better/faster convergence can be expected (rather than keeping u constant) Useful for designing well behaved decomposition/splitting schemes Extendible to various models such as variational inequalities, semidefinite programs


B6. Self-Concordance Theory Idea: to make the convergence analysis coordinate invariant [ Newton’s method is coordinate invariant..but conv. analysis is not!] Achieved for self-concordant convex functions Definition-SCF Let f ∈ C 3 (dom f ) and convex. Then, f is called self-concordant if ∃Mf ≥ 0 such that:(SC) 3/2 2 3 |D f (x)[d, d, d]| ≤ Mf D f (x)[d, d] , ∀x ∈ dom f, d ∈ Rn . d3 ( 3 f (x + td)|t=0 = D3 f (x)[d, d, d] = f (x)[d]d, d, θ(t) ≡ f (x + td)) dt θ is SC ⇐⇒ |θ (t)| ≤ M θ (t)3/2 , ∀t ∈ dom θ i.e., Hessian does not vary too fast in its own metric.


B7. Examples of Self-Concordant functions 1. Linear and convex quadratic f (x) = xT Ax − 2bT x + c, A ∈ Sn+ , dom f = Rn

Then, f (x) = Ax − b, f (x) = A, f (x) = 0 =⇒ Mf = 0 2. Logarithmic Barrier f (x) = − log x, dom f = (0, +∞) then, f (x) = −x−1 , f (x) = x−2 , f (x) = −2x−3 =⇒ Mf = 2 3. Log-barrier of a quadratic region f (x) = − log q(x), q(x) = c + bT x − 0.5xT Ax, dom f = {x : q(x) > 0}. Then, one verifies that f is SC with Mf = 2. 4. The following functions in R are NOT SC ex , x−p (x, p > 0).


B8. Calculus Rules for SC functions Affine Invariance of SC Let A : Rn → Rm be an affine map, A(x) = Ax + b. If f is Mf -self-conc. Then, g(x) := f (Ax + b) is self-conc with Mg ≡ Mf . f SC =⇒ g = af is SC ∀a ≥ 1, with Mg = a−1/2 Mf f, g SC =⇒ h = f + g SC with Mh = max{Mf , Mg } Composition with logarithm Let h : R → R convex with dom h = (0, +∞) and such that (L) |h (x)| ≤ 3x−1 h (x), ∀x > 0 Then, Then, f (x) := − log(−h(x)) − log x is SC on R++ ∩ {x : h(x) < 0} Many functions satisfy (L): −xp (0 < p ≤ 1); x log x; − log x; x−2 (ax + b)2 , xp (−1 ≤ p ≤ 0). Useful to establish self-conc. of the following (important) functions:

T T f (x) = − m i=1 log(bi − ai x); dom f = {x : ai x < bi , i ∈ [1, m]} n f (x) = − log detX; dom f = S++

f (x) = − log(t2 − x 2 ); dom f = {(x, t) : x < t}


B9. Newton with Self-Concordant Functions Consider the problem min{f (x) : x ∈ dom f } and the Newton scheme x+ = x − f (x)−1 f (x)

Theorem - Existence f attains its minimum over dom f iff there exists x ∈ dom f such that λ(x) < 1. For every x with the later property we can establish the following key results (all estimates are parameters free!): f (x) − f (x∗ ) ≤ h∗ (λ(x)) [conjugate of h, h∗ (s) := −s − log(1 − s)] (x − x∗ )T f (x)(x − x∗ ) ≤ (h∗ ) (λ(x)) λ(x+ ) ≤ 2λ2 (x)

The last result provides the region of quadratic convergence (with γ ∈ (0, q) and q solves λ = (1 − λ)2 ) : √ −1 λ(x) < q = 2 (3 − 5) =⇒ λ(x+ ) < λ(x) Marc Teboulle–Tel-Aviv University – p. 109

B9.b Self Concordant Barrier Definition Let F be a self concordant function. The function F is called a ν-self-concordant barrier [SCB] for the set domF if for any x ∈ dom F : maxn {2F (x), u − F (x)u, u} ≤ ν

u∈R

ν is called the parameter of the barrier. This is very general definition that can be simplified, assuming that F (x) is non-singular: F (x)−1 F (x), F (x) ≤ ν or to: F (x), u2 ≤ νF (x)u, u, ∀u ∈ Rn , ∀x ∈ dom F Linear and quadratic functions are not SCB Examples of SCB: F (x) = − log x; dom F = R+ and F (x) = − log q(x); q(x) = −0.5xT Qx + c, x + d; dom F = {x : q(x) > 0}; Q ∈ S+ n are 1-SCB Marc Teboulle–Tel-Aviv University – p. 110

B10. Primal-Dual Interior Methods The KKT system with perturbed complementarity m T ⇐⇒ ♠ argmin{f (x) − (g(x) − v) y − µ log vi }) x,v

i=1

∇f (x) − ∇g(x)y

=

0, V := Diag(v), Y := Diag(y)

VYe

=

µe; µ > 0, e := (1, . . . , 1)T

g(x) − v

=

0

Apply Newton’s Method to generate new pt: (x+ , v + , y + ) = (x, v, y) + t(∆x, ∆v, ∆y) ⎞ ⎛ ⎞ ⎞⎛ ⎛ 2 ∇f (x) − ∇g(x)y ∆x −∇ L(x, v) − ∇g ⎠=⎝ ⎠ ⎠⎝ ⎝ T −1 −1 µY e − g(x) ∆y −∇g VY t chosen to ensure (v + , y + ) > 0 and merit function sufficiently reduced m

β vT y 2 ; δ ∈ (0, 1), β > 0 log vi + g(x) − v ; µ = δ M (x, v) = f (x) − µ 2 m i=1 See LOQO for convex and nonconvex problems, [Vanderbei-Shanno, 1999] Marc Teboulle–Tel-Aviv University – p. 111

C1. The Class F+ (C) This class allows to derive pointwise convergence results, and the requested properties below are trying to mimic "norms". We write (d, H) ∈ F+ (C)(⊂ F(C)) when the function H satisfies the following two additional properties: (a1 ) ∀y ∈ C and ∀{yk } ⊂ C bounded with limk→+∞ H(y, yk ) = 0, one has limk→+∞ yk = y (a2 ) ∀y ∈ C and ∀C ⊃ {yk } −→ y we have limk→+∞ H(y, yk ) = 0.


C2. The Interior Proximal Algorithm–IPA Given d ∈ F , λk > 0, k ≥ 0. (IPA is well defined, see []) Start from a point x0 ∈ C Generate a sequence {xk } ∈ C with g k ∈ ∂k f (xk )

such that λk g k + ∇1 d(xk , xk−1 ) = 0.

The IPA can be viewed as an approximate interior proximal method when k > 0 ∀k ∈ N which becomes exact for the special case k = 0 ∀k ∈ N


C3. Convergence Results I: Global Rate k } be the sequence Theorem G1 Let (d, H) ∈ F(C) and let {x

n the following hold: generated by IPA. Set σn = k=1 λk . Then

(i) f (xn ) − f (x) ≤ σn−1 H(x, x0 ) + σn−1 nk=1 σk k ∀x ∈ C. n ) = f and (ii) If limn→∞ σn = +∞, and k → 0, then lim inf f (x n→∞ ∗

the sequence {f (xk )} converges to f∗ whenever ∞ k=1 k < ∞. (iii) Furthermore, suppose X∗ = ∅, and consider the following cases: (a) X

∗∞is bounded, (b) k=1 λk k < ∞ and (d, H) ∈ F(C). Then, under either (a) or (b), the sequence {xk } is bounded with all its limit points in X∗ .

An immediate by-product yields the following global rate of convergence estimate for the exact version of IPA, (k = 0, ∀k). Theorem G2 Let (d, H) ∈ F(C) and let {xk } be the sequence generated by IPA with k = 0, ∀k. Then, f (xn ) − f (x) = O(σn−1 ), ∀x ∈ C . Marc Teboulle–Tel-Aviv University – p. 114

C4. Convergence Results II: Pointwise Convergence To establish the global convergence of the sequence {xk } to an optimal solution of problem (P), we use the class F+ (C). Theorem G3 Let (d, H) ∈ F+ (C) and let {xk } be the sequence generated ∗ of (P ) is nonempty,

X

n by IPA. Suppose

∞ that the optimal set σn = k=1 λk → ∞, k=1 λk k < ∞. and ∞ k=1 k < ∞. Then, the sequence {xk } converges to an optimal solution of (P).


C5. Comments Note that we have separated the two types of convergence results to emphasize: The differences and roles played by each of the three classes F+ (C) ⊂ F(C) ⊂ F(C)

To show that the largest, and less demanding class F(C), already provides reasonable convergence properties for IPA, with minimal assumptions on the problem’s data. These aspects are now illustrated by several application examples.


C6. Proximal Distances (d, H): Application Examples In most situations, when constructing an IPA for solving the convex problem (P), the proximal distance H induced by d will have a special structure, known as a Bregman proximal distance Dh , which is generated by some convex kernel h. We first recall the special features of a Bregman proximal distance We then consider various types of constraint sets C for problem (P), and give many examples for the pair (d, H), for which our convergence results hold.


C7. Bregman-proximal distances: Definition Let h : Rn → R ∪ {+∞} be a proper, lsc, and convex function with dom h ⊂ C and dom ∇h = C , strictly convex and continuous on dom h , C 1 on int dom h = C . Define ∀x ∈ Rn , ∀y ∈ dom ∇h H(x, y) := Dh (x, y) := h(x) − [h(y) + ∇h(y), x − y]

(1)

The function Dh enjoys a remarkable three points identity that plays a central role in the analysis. H(c, a) = H(c, b)+H(b, a)+c−b, ∇1 H(b, a)

∀a, b ∈ C, ∀c ∈ dom h

To handle the constraint cases C versus C , we need to consider two types of convex kernels h.


C8. Difference between F(C) and F+ (C): Some Examples We let C = Rn++ . More examples will follow. Example: Separable Bregman proximal distances are the most common used in the literature. Let θ : R → R ∪ +∞ be a proper convex and lsc function with (0, +∞) ⊂ dom θ ⊂ [0, +∞) and 2

θ ∈ C (0, +∞), θ (t) > 0, ∀t > 0 lim θ (t) = −∞ t→0+

We denote this class by Θ0 if θ(0) < +∞ Θ+ whenever θ(0) = +∞ and θ is nonincreasing. Given θ in either class, define n h(x) = θ(xj ), =⇒ Dh is separable j=1


C9. Typical Choices for θ The first two examples are functions θ ∈ Θ0 , i.e., with dom θ = [0, +∞) and the last two are in Θ+ , i.e., with dom θ = (0, +∞): θ1 (t) = t log t, (Shannon entropy). θ2 (t) = (pt − tp )/(1 − p), with p ∈ (0, 1). θ3 (t) = − log t (Burg’s entropy). θ4 (t) = t−1

Then, one can verify that for the corresponding proximal distances: Dh1 , Dh2 ∈ F+ (C), while Dh3 , Dh4 ∈ F(C)


C.10 Convex Programming: C = {x : fi (x) ≥ 0, i ∈ [1, m]} [CP ] min{c, x : fi (x) ≥ 0, i ∈ [1, m]}

Let fi : Rn → R be concave and C 1 on Rn for each i ∈ [1, m]. We suppose that Slater’s holds: ∃x0 ∈ Rn : fi (x0 ) > 0, ∀i ∈ [1, m]. For θ ∈ Θ+ and x ∈ C let hν (x) =

m i=1

ν θ(fi (x)) + ||x||2 , with ν > 0 2

Set d(x, y) = Dhν (x, y) then (d, Dhν ) ∈ F(C).


Convex Programming–Continued An interesting algorithm is then obtained for solving [CP] by choosing: θ(t) ≡ θ3 (t) = − log t. In this case we obtain: d(x, y) = Dhν (x, y) =

m i=1

− log

fi (x) ∇fi (y), x − y ν + + ||x − y||2 fi (y) fi (y) 2

The constrained convex program has thus been reduced to perform at each step an unconstrained minimization with objective of the form: −

m i=1

ν log fi (x) + x2 + x, Lk 2

(All "constant" terms depending on k through y k are in Lk ) Bears similarity with Barrier and Center methods.... Note: This d(·, y) enjoys other interesting properties e.g., when fi are concave quadratic, then d(·, y) is self-concordant for each y ∈ C . Marc Teboulle–Tel-Aviv University – p. 122

C11. Second order cone constraints: C = Ln+ Let Ln+ := {x ∈ Rn |xn ≥ (x21 + . . . + x2n−1 )1/2 } be the Lorentz cone. Let Dn be the diagonal matrix Dn = diag(−1, . . . , −1, 1)

Define h : Ln++ → R by h(x) = − ln(xT Dn x) + ν2 x2 . Then, h is proper, lsc and convex on dom h = Ln++ . The Bregman proximal distance associated to h is given by ν xT Dn x 2xT Dn y + T − 2 + x − y2 . Dh (x, y) = − log T y Dn y y Dn y 2

Thus, with d = Dh , we have (Dh , Dh ) ∈ F(Ln++ ). Similarly, we can handle the case with C = {x ∈ Rn |Ax − b ∈ Ln+ }


C12. Not Self-Proximal: ϕ-Divergence Kernels on Rn+ Let ϕ : R → R ∪ {+∞} be a lsc, convex, proper function such that dom ϕ ⊂ R+ and dom ∂ϕ = R++ . We suppose in addition that ϕ is C 2 , strictly convex, nonnegative on R++ with ϕ(1) = ϕ (1) = 0. We denote by Φ the class of such kernels and by Φ1 the subclass of these kernels satisfying ϕ (1)(1 − t−1 ) ≤ ϕ (t) ≤ ϕ (1) log t

∀t > 0.

Φ2 the subclass satisfying ϕ (1)(1 − t−1 ) ≤ ϕ (t) ≤ ϕ (1)(t − 1)

∀t > 0.

Examples of functions in Φ1 , Φ2 are: ϕ1 (t) = t log t − t + 1, dom ϕ = [0, +∞), ϕ2 (t) = − log t + t − 1, dom ϕ = (0, +∞), √ ϕ3 (t) = 2( t − 1)2 , dom ϕ = [0, +∞). Marc Teboulle–Tel-Aviv University – p. 124

C13. An Important Example: The Log-Quad Proximal Kernel Let ν > µ > 0 be given fixed parameters, and Φ2 ϕ(t) =

ν (t − 1)2 + µ(t − log t − 1); t > 0 2

Proposition (i)ϕ is strongly convex on R++ with modulus ν > 0. (ii)The conjugate of ϕ is given by ϕ∗ (s)

=

t(s)

:=

ν 2 ν t (s) + µ log t(s) − , 2 2 −1 (2ν) {(ν − µ) + s + ((ν − µ) + s)2 + 4µν} = (ϕ∗ ) (s).

(iv) domϕ∗ = R, and ϕ∗ ∈ C ∞ (R). (v)(ϕ∗ ) (s) = (ϕ )−1 (s) is Lipschitz for all s ∈ R, with constant ν −1 . (vi) (ϕ∗ ) (s) ≤ ν −1 , ∀s ∈ R. Smooth Lagrangian Multiplier with Log-Quad: Handle easily very large scale instances e.g., (n × m = 1000 × 50, 000) Number of Newton’step does not increase with dimension... Also solve (local min.) nonconvex problems....(No proofs for that....!) Marc Teboulle–Tel-Aviv University – p. 125

C14. BIG with Armijo-Goldstein stepsize rule We use a generalized stepsize rule, reminiscent to the one used in the classical projected gradient method. Algorithm 1: Armijo-Goldstein stepsize rule. Let β ∈ (0, 1), m ∈ (0, 1) and s > 0 be fixed chosen scalars. Step 0 Start from a point x0 ∈ C ∩ V . Step k Generate the sequence {xk } ∈ C ∩ V as follows: if ∇f (xk−1 ) ∈ V ⊥ stop. Otherwise, with xk (λ) = u(λ∇f (xk−1 ), xk−1 ), set λk = β jk s where jk is the first nonnegative integer j such that f (xk (β j s)) − f (xk−1 ) ≤ m(∇f (xk−1 ), xk (β j s) − xk−1 , [AG]

Set xk = xk (λk ); k ←− k + 1 ; goto step k. With some work...it can be proved that [AG] Stepsize Rule is well defined. Marc Teboulle–Tel-Aviv University – p. 126

C15. Convergence of Algorithm 1 Theorem A1 Let (d, H) ∈ F(C), and let {xk } be the sequence produced by Algorithm 1 with (d, H) ∈ F(C). Then, The sequence {f (xk )} is non increasing and converges to f∗ . Suppose that the optimal set X∗ of problem (M) is nonempty, then: (a) if X∗ bounded, {xk } is bounded with all its limit pts in X∗ , (b) if (d, H) ∈ F+ (C), {xk } converges to an optimal sol. of (P), and the following Global Rate Estimation holds: f (xn ) − f∗ = O(n−1 )


n C16. Cases C = Rn++ ; S++ ; Ln++ Let d(x, y) = p(x, y) + σ2 ||x − y||2

n • p(z, x) = µ j=1 xrj ϕ(x−1 j zj ), σ ≥ µ > 0, for (z, x) ∈ C × C, and ϕ(t) = − log t + t − 1 [r=2;the log-quad function], one obtains: ∗

ui (v, x) = xi (ϕ ) (−vi x−1 i ), i = 1, . . . n, n • (SDP) Take p(x, y) = tr(− log x + log y + xy −1 ) − n ∀x, y ∈ S++ , one has n ∀x ∈ S++ , v ∈ Sn: −1 u(v, x) = (2σ) (A(v, x) + A2 (v, x) + 4σI)

with

A(v, x) :

• (SOC) Take p(x, y) = − log

=

σx − v − x−1 .

xT Dn x y T Dn y

+

2xT Dn y y T Dn y

− 2, ∀x, y ∈ Ln ++ . It can be shown that:

−1

( 1 + 8σw−2 − 1)

u(v, x) = sw with s

:=

(2σ)

w:

=

2τ (x)−1 Dn x + v − σx.

τ (x)

=

xT Dn x. Marc Teboulle–Tel-Aviv University – p. 128

C17. Modifying The EMDA We can modify the EMDA with an Armijo-Goldstein step-size rule (since here d ≡ E is 1-strongly convex). Therefore, we can apply Theorem A1, proving that the sequence {xk } of EMDA with λk defined by the Armijo-Goldstein stepsize rule [AG] converges to an optimal solution of (M). This modified version can be more practical, since in particular, we do not need to know/compute the Lipschitz constant Lf . Another advantage of Entropy Kernel: Extendible to SDP constraints ∆ ≡ {x ∈ Sn : tr(x) = 1, x 0} with d(x, y) := tr(x log x − x log y) on ∆


C18. The Key Result to Update {xk }, {y k } in IGA Theorem Let σ > 0, L > 0 be given. Suppose that for some k ≥ 0 we have a point xk ∈ C ∩ V such that f (xk ) ≤ qk∗ = min{qk (x) : x ∈ C ∩ V}. Let αk ∈ [0, 1), ck+1 = (1 − αk )ck and C ∩ V {z k } be generated by z k+1 = argmin{x,

αk ∇f (y k ) + H(x, z k ) : x ∈ C ∩ V} ck+1

Define yk

=

(1 − αk )xk + αk z k ,

xk+1

=

(1 − αk )xk + αk z k+1 .

σ ∗ k+1 k 2 ≥ f (xk+1 ) + 12 ( ck+1 − L)x − y Then, qk+1 2 α k

Therefore, by taking for example Lαk2 = σck (1 − αk ) = σck+1 we can guarantee ∗ that qk+1 ≥ f (xk+1 ). This leads to the desired interior gradient alg.


R. Short Bibliography—Some Books for parts [A]-[B] A. Auslender and M. Teboulle, Asymptotic Cones and Functions in Optimization and Variational Inequalities, Springer Monographs in Mathematics, Springer-Verlag New-York, 2003. A. Ben-Tal, A. Nemirovski, Lectures on modern convex optimization. Analysis, algorithms, and engineering applications, SIAM Publications, 2001. D. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont Masschussetts, 1999. A. V. Fiacco and G. P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization Techniques, Classics in Applied Mathematics, SIAM , Philadelphia, 1990. O. L. Mangasarian, Nonlinear programming, McGraw-Hill Publishing Company, 1969. A. Nemirovski and D. Yudin, Problem complexity and Method Efficiency in Optimization, John Wiley New York, 1983. Y. Nesterov, A. Nemirovski, Interior point polynomial algorithms in convex programming, SIAM Publications, Philadelphia, PA, 1994. J. Nocedal, S.J. Wright, Numerical Optimization, Springer Verlag, New York, 1999. J. M. Ortega and W. C. Rheinboldt, Iterative solution of nonlinear equations in several variables, Academic Press, 1970. R. T. Rockafellar, Convex Analysis, Princeton University Press, 1970. Marc Teboulle–Tel-Aviv University – p. 131

Refs. on some of our recent works related to Part C A. Auslender, M. Teboulle and S. Ben-Tiba, “Interior Proximal and Multiplier Methods based on Second Order Homogeneous Kernels”, Mathematics of Operations Research, 24, (1999) 645–668. A. Auslender, M. Teboulle, “Lagrangian duality and related multiplier methods for variational inequalities”, SIAM J. Optimization, 10, (2000), 1097–1115. A. Auslender, M. Teboulle, “Entropic proximal decomposition methods for convex programs and variational inequalities”, Mathematical Programming, 91, (2001), 33-47. A. Auslender and M. Teboulle “Interior gradient and epsilon-subgradient descent methods for constrained convex minimization, Mathematics of Operations Research, 29, 2004, 1–26. A. Auslender and M. Teboulle “ A unified framework for interior gradient/subgradient and proximal methods in convex optimization”. February 2003 (submitted for publication). A. Beck and M.Teboulle “Mirror descent and nonlinear projected subgradient methods for convex optimization”, Operations Research Letters, 31, 2003, 167-175. J. Bolte and M. Teboulle, “ Barrier operators and associated gradient-like dynamical systems for constrained minimization” problems”, SIAM J. of Control Optimization, 42, (2003), 1266–1292 M.Doljanski and M.Teboulle, “An Interior Proximal Algorithm and the exponential multiplier method for Semidefinite Programming”, SIAM J. of Optimization, 9, 1998, 1-13. Marc Teboulle–Tel-Aviv University – p. 132

Nonlinear Programming - Department of Mathematical Sciences

Nonlinear Programming - Department of Mathematical Sciences

Suggest Documents

Nonlinear programming advances in mathematical programming with ...

department of mathematical sciences - University of Delaware

List of Projects - Department of Mathematical Sciences

UNIVERSITY OF DURHAM - Department of Mathematical Sciences

Tridiagonal Toeplitz Matrices - Department of Mathematical Sciences

Euclidean Field Theory - Department of Mathematical Sciences

Department of Mathematical Sciences - Durham University

Super-Brownian motion - Department of Mathematical Sciences

Department of Mathematical Sciences - Durham University

Vadim Sokolov - Department of Mathematical Sciences - Northern ...

Syllabus - Department of Mathematical Sciences - Rensselaer ...

discrete math projects - Department of Mathematical Sciences

department of mathematical sciences technical report ...

Tridiagonal Toeplitz Matrices - Department of Mathematical Sciences

SWI 2012 - Department of Mathematical Sciences

Directly Reflective Meta-Programming - Mathematical Sciences ...

Liberate computer user from programming - Mathematical Sciences ...

ON P3-DEGREE OF GRAPHS Department of Mathematical Sciences ...

Department of Mathematical Sciences Florida Institute of Technology ...

Group schemes of period 2 - Department of Mathematical Sciences

MATHEMATICAL SCIENCES

Mathematical Programming

Nonlinear Programming