Stephen Boyd (Stanford) and Lieven Vandenberghe (UCLA) and their research
.... it's not easy to recognize convex functions and convex optimization problems.
Convex Optimization and Applications
Lin Xiao Center for the Mathematics of Information California Institute of Technology Acknowledgment: Stephen Boyd (Stanford) and Lieven Vandenberghe (UCLA) and their research groups
CS286a: Mathematics of Information seminar, 10/15/04
Two problems polyhedron P described by linear inequalities, aTi x ≤ bi, i = 1, . . . , L
P PSfrag replacements a1
Problem 1: find minimum volume ellipsoid ⊇ P Problem 2: find maximum volume ellipsoid ⊆ P are these (computationally) difficult? or easy? CS286a seminar 10/15/04
1
problem 1 is very difficult • in practice • in theory (NP-hard)
problem 2 is very easy • in practice (readily solved on small computer) • in theory (polynomial complexity)
CS286a seminar 10/15/04
2
Moral
very difficult and very easy problems can look quite similar
. . . unless we are trained to recognize the difference
CS286a seminar 10/15/04
3
Linear program (LP) minimize cT x subject to aTi x ≤ bi,
i = 1, . . . , m
c, ai ∈ Rn are parameters; x ∈ Rn is variable
• easy to solve, in theory and practice
• can solve dense problems with n = 1000 vbles, m = 10000 constraints easily; far larger for sparse or structured problems
CS286a seminar 10/15/04
4
Polynomial minimization minimize p(x) p is polynomial of degree d; x ∈ Rn is variable
• except for special cases (e.g., d = 2) this is a very difficult problem
• even sparse problems with size n = 20, d = 10 are essentially intractable
• all algorithms known to solve this problem require effort exponential in n
CS286a seminar 10/15/04
5
Moral
• a problem can appear∗ hard, but be easy • a problem can appear∗ easy, but be hard
∗
if we are not trained to recognize them
CS286a seminar 10/15/04
6
What makes a problem easy or hard?
classical view:
• linear is easy • nonlinear is hard(er)
CS286a seminar 10/15/04
7
What makes a problem easy or hard?
emerging (and correct) view: . . . the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity. — R. Rockafellar, SIAM Review 1993
CS286a seminar 10/15/04
8
Convex optimization minimize f0(x) subject to f1(x) ≤ 0, . . . , fm(x) ≤ 0,
Ax = b
x ∈ Rn is optimization variable; fi : Rn → R are convex: fi(λx + (1 − λ)y) ≤ λfi(x) + (1 − λ)fi(y) for all x, y, 0 ≤ λ ≤ 1 • includes least-squares, linear programming, maximum volume ellipsoid in polyhedron, and many others • convex problems are fundamentally tractable
CS286a seminar 10/15/04
9
Example: Robust LP minimize cT x subject to Prob(aTi x ≤ bi) ≥ η,
i = 1, . . . , m
coefficient vectors ai IID, N (ai, Σi); η is required reliability • for fixed x, aTi x is N (aTi x, xT Σix) • so for η = 50%, robust LP reduces to LP minimize cT x subject to aTi x ≤ bi,
i = 1, . . . , m
and so is easily solved • what about other values of η, e.g., η = 10%? η = 90%? CS286a seminar 10/15/04
10
constraint Prob(aTi x ≤ bi) ≥ η equivalent to 1/2
a ¯Ti x + Φ−1(η)kΣi xk2 − bi ≤ 0 Φ is CDF of unit Gaussian
is LHS a convex function?
CS286a seminar 10/15/04
11
Hint
{x | Prob(aTi x ≤ bi) ≥ η, i = 1, . . . , m}
η = 10%
CS286a seminar 10/15/04
η = 50%
η = 90%
12
That’s right
robust LP with reliability η = 90% is convex, and very easily solved
robust LP with reliability η = 10% is not convex, and extremely difficult
CS286a seminar 10/15/04
13
Maximum volume ellipsoid in polyhedron • polyhedron: P = {x | aTi x ≤ bi, i = 1, . . . , m}
• ellipsoid: E = {By + d | kyk ≤ 1}, with B = B T Â 0
s
PSfrag replacements
d
E
maximum volume E ⊆ P, as convex problem in variables B, d: maximize log det B subject to B = B T Â 0, CS286a seminar 10/15/04
kBaik + aTi d ≤ bi,
i = 1, . . . , m
14
Moral
• it’s not easy to recognize convex functions and convex optimization problems • huge benefit, though, when we do
CS286a seminar 10/15/04
15
Convex Analysis and Optimization
Convex analysis & optimization
nice properties of convex optimization problems known since 1960s • local solutions are global
• duality theory, optimality conditions
• simple solution methods like alternating projections
convex analysis well developed by 1970s Rockafellar • separating & supporting hyperplanes • subgradient calculus
CS286a seminar 10/15/04
16
What’s new (since 1990 or so) • primal-dual interior-point (IP) methods extremely efficient, handle nonlinear large scale problems, polynomial-time complexity results, software implementations • new standard problem classes generalizations of LP, with theory, algorithms, software • extension to generalized inequalities semidefinite, cone programming
CS286a seminar 10/15/04
17
Applications and uses • lots of applications control, combinatorial optimization, signal processing, circuit design, communications, machine learning . . . • robust optimization robust versions of LP, least-squares, other problems • relaxations and randomization provide bounds, heuristics for solving hard (e.g., combinatorial optimization) problems
CS286a seminar 10/15/04
18
Recent history • 1984–97: interior-point methods for LP – 1984: Karmarkar’s interior-point LP method – theory Ye, Renegar, Kojima, Todd, Monteiro, Roos, . . . – practice Wright, Mehrotra, Vanderbei, Shanno, Lustig, . . . • 1988: Nesterov & Nemirovsky’s self-concordance analysis • 1989–: LMIs and semidefinite programming in control
• 1990–: semidefinite programming in combinatorial optimization Alizadeh, Goemans, Williamson, Lovasz & Schrijver, Parrilo, . . .
• 1994: interior-point methods for nonlinear convex problems Nesterov & Nemirovsky, Overton, Todd, Ye, Sturm, . . . • 1997–: robust optimization Ben Tal, Nemirovsky, El Ghaoui, . . .
CS286a seminar 10/15/04
19
New Standard Convex Problem Classes
Some new standard convex problem classes • second-order cone program (SOCP)
• geometric program (GP) (and entropy problems) • semidefinite program (SDP)
all these new problem classes have • complete duality theory, similar to LP
• good algorithms, and robust, reliable software
• wide variety of new applications
CS286a seminar 10/15/04
20
Second-order cone program second-order cone program (SOCP) has form minimize cT0 x subject to kAix + bik2 ≤ cTi x + di,
i = 1, . . . , m
with variable x ∈ Rn • includes LP and QP as special cases
• nondifferentiable when Aix + bi = 0
• new IP methods can solve (almost) as fast as LPs
CS286a seminar 10/15/04
21
Example: robust linear program minimize cT x subject to Prob(aTi x ≤ bi) ≥ η,
i = 1, . . . , m
where ai ∼ N (¯ a i , Σi ) equivalent to minimize cT x 1/2 subject to a ¯Ti x + Φ−1(η)kΣi xk2 ≤ bi,
i = 1, . . . , m
where Φ is (unit) normal CDF robust LP is an SOCP for η ≥ 0.5 (Φ−1(η) ≥ 0) CS286a seminar 10/15/04
22
Geometric program (GP) log-sum-exp function: lse(x) = log (ex1 + · · · + exn ) . . . a smooth convex approximation of the max function
geometric program: minimize lse(A0x + b0) subject to lse(Aix + bi) ≤ 0,
i = 1, . . . , m
Ai ∈ Rmi×n, bi ∈ Rmi ; variable x ∈ Rn
CS286a seminar 10/15/04
23
Posynomial form geometric program x = (x1, . . . , xn): vector of positive variables function f of form f (x) =
t X
k=1
α
α
nk ck x1 1k x2 2k · · · xα n
with ck ≥ 0, αik ∈ R, is called posynomial like polynomial, but • coefficients must be positive
• exponents can be fractional or negative
CS286a seminar 10/15/04
24
Posynomial form geometric program posynomial form GP: minimize f0(x) subject to fi(x) ≤ 1,
i = 1, . . . , m
fi are posynomial; xi > 0 are variables to convert to (convex form) GP, express as minimize log f0(ey ) subject to log fi(ey ) ≤ 0,
i = 1, . . . , m
objective and constraints have form lse(Aiy + bi)
CS286a seminar 10/15/04
25
Entropy problems unnormalized negative entropy is convex function − entr(x) =
n X
xi log(xi/1T x)
i=1
defined for xi ≥ 0, 1T x > 0 entropy problem: minimize − entr(A0x + b0) subject to − entr(Aix + bi) ≤ 0,
i = 1, . . . , m
Ai ∈ Rmi×n, bi ∈ Rmi CS286a seminar 10/15/04
26
Solving GPs (and entropy problems)
• GP and entropy problems are duals (if we solve one, we solve the other) • new IP methods can solve large scale GPs (and entropy problems) almost as fast as LPs • applications in many areas: – information theory, statistics – communications, wireless power control – digital and analog circuit design
CS286a seminar 10/15/04
27
Generalized inequalities with proper convex cone K ⊆ Rk we associate generalized inequality x ¹K y ⇐⇒ y − x ∈ K convex optimization problem with generalized inequalities: minimize
f0(x)
subject to f1(x) ¹K1 0, . . . , fL(x) ¹KL 0,
Ax = b
fi : Rn → Rki are Ki-convex: for all x, y, 0 ≤ λ ≤ 1, fi(λx + (1 − λ)y) ¹Ki λfi(x) + (1 − λ)fi(y)
CS286a seminar 10/15/04
28
Semidefinite program semidefinite program (SDP): minimize cT x subject to x1A1 + · · · + xnAn ¹ B • B, Ai are symmetric matrices; variable is x ∈ Rn
• ¹ is matrix inequality; constraint is linear matrix inequality (LMI) • SDP can be expressed as convex problem as
λmax(B − x1A1 · · · − xnAn) ≤ 0 or handled directly as cone problem
CS286a seminar 10/15/04
29
Semidefinite programming • nearly complete duality theory, similar to LP • interior-point algorithms that are efficient in theory & practice • applications in many areas: – – – – – – – – – – –
control theory combinatorial optimization & graph theory structural optimization statistics signal processing circuit design geometrical problems communications and information theory quantum computing algebraic geometry machine learning
CS286a seminar 10/15/04
30
Continuous time symmetric Markov chain on a graph • connected graph G = (V, E), V = {1, . . . , n}, no self-loops • Markov process on V, with symmetric rate matrix Q – Qij = 0 for (i, j) 6∈ E, i 6= j – Qij ≥ 0 for P (i, j) ∈ E – Qii = − j6=i Qij
• eigenvalues of Q ordered as 0 = λ1(Q) > λ2(Q) ≥ · · · ≥ λn(Q) • state distribution given by π(t) = etQπ(0) • distribution converges to uniform with rate determined by λ2 √
kπ(t) − 1/nktv ≤ ( n/2)eλ2(Q)t
CS286a seminar 10/15/04
31
Fastest mixing Markov chain (FMMC) problem ? minimize
λ2(Q)
subject to Q1 = 0,
Q=Q
T
?
?
Qij = 0 for (i, j) 6∈ E, i 6= j Qij ≥ 0 for (i, j) ∈ E P 2 d (i,j)∈E ij Qij ≤ 1
? ?
?
?
• variable is matrix Q; problem data is graph, constants dij = dij on E • need to add constraint on rates since λ2 is homogeneous • optimal Q gives fastest diffusion process on graph (subject to rate constraint) CS286a seminar 10/15/04
32
PSfrag replacements
Fast diffusion processes
electrical PSfrag replacements
g13 g46 g12 1
mechanical
2
g34
g23
4 g45
3
k13 k12 1
CS286a seminar 10/15/04
5
k46 k23
2
6
k34 3
k45 4
5
6
33
Swap objective and constraint • λ2(Q) and
P
(i,j)∈E
d2ij Qij are homogeneous
• hence, can just as well minimize weighted rate sum subject to bound on λ2(Q)
minimize
P
P
2 d (i,j)∈E ij Qij
2 d (i,j)∈E ij Qij
subject to Q1 = 0,
Q = QT
Qij = 0 for (i, j) 6∈ E, i 6= j Qij ≥ 0 for (i, j) ∈ E
λ2(Q) ≤ −1
CS286a seminar 10/15/04
34
SDP formulation of FMMC using Q1 = 0, we have λ2(Q) ≤ −1 ⇐⇒ Q − 11T /n ¹ −I so FMMC reduces to SDP minimize
P
2 d (i,j)∈E ij Qij
subject to Q1 = 0,
Q = QT
Qij = 0 for (i, j) 6∈ E, i 6= j Qij ≥ 0 for (i, j) ∈ E
Q − 11T /n ¹ −I hence: can solve efficiently, duality theory, . . . CS286a seminar 10/15/04
35
Robust Optimization
Robust optimization problem robust optimization problem: minimize maxa∈A f0(a, x) subject to maxa∈A fi(a, x) ≤ 0,
i = 1. . . . , m
• x is optimization variable
• a is uncertain parameter
• A is parameter set (e.g., box, polyhedron, ellipsoid)
heuristics (detuning, sensitivity penalty term) have been used since beginning of optimization
CS286a seminar 10/15/04
36
Robust optimization via convex optimization El Ghaoui, Ben Tal & Nemirovsky, . . .
• robust versions of LP, QP, SOCP problems, with ellipsoidal or polyhedral uncertainty, can be formulated as SDPs (or simpler) • other robust problems (e.g., SDP) intractable, but there are good convex approximations
CS286a seminar 10/15/04
37
Robust least-squares robust LS problem with uncertain parameters v1, . . . , vp minimize supkvk≤1 k(A0 + v1A1 + · · · + vpAp)x − bk2 equivalent SDP (variables x, t1, t2): minimize
t 1 + t2 I subject to P (x)T q(x)T
P (x) q(x) t1 I 0 º0 0 t2
where P (x) =
£
CS286a seminar 10/15/04
A1 x A 2 x · · · A p x
¤
∈ Rm×p,
q(x) = A0x − b 38
example: minimize supkvk≤ρ k(A0 + v1A1 + v2A2)x − bk2 12
worst-case residual
10
PSfrag replacements
8
LS sol.
6 4
robust LS sol.
2 0 0
0.2
0.4
ρ
0.6
0.8
1
(kA0k = 10, kA1k = kA2k = 1) CS286a seminar 10/15/04
39
example: minimize supkvk≤1 k(A0 + v1A1 + v2A2)x − bk2 0.25
xrls
0.2
frequency
0.15
0.1
PSfrag replacements
xtych xls
0.05
0 0
1
2
3
4
5
k(A0 + v1A1 + v2A2)x − bk2
distribution assuming v is uniformly distributed CS286a seminar 10/15/04
40
Relaxations & Randomization
Relaxations & randomization
convex optimization is increasingly used • to find good bounds for hard (i.e., nonconvex) problems, via relaxation • as a heuristic for finding good suboptimal points, often via randomization
CS286a seminar 10/15/04
41
Example: Boolean least-squares Boolean least-squares problem: minimize kAx − bk2 subject to x2i = 1, i = 1, . . . , n
• basic problem in digital communications
• could check all 2n possible values of x . . .
• an NP-hard problem, and very hard in practice
• many heuristics for approximate solution
CS286a seminar 10/15/04
42
Boolean least-squares as matrix problem
kAx − bk2 = xT AT Ax − 2bT Ax + bT b
= Tr AT AX − 2bT AT x + bT b
where X = xxT hence can express BLS as minimize Tr AT AX − 2bT Ax + bT b subject to Xii = 1, X º xxT , rank(X) = 1 . . . still a very hard problem
CS286a seminar 10/15/04
43
SDP relaxation for BLS ignore rank one constraint, and use X º xxT ⇐⇒
·
X xT
x 1
¸
º0
to obtain SDP relaxation (with variables X, x) Tr AT AX − 2bT AT x + bT b · ¸ X x subject to Xii = 1, º0 xT 1 minimize
• optimal value of SDP gives lower bound for BLS
• if optimal matrix is rank one, we’re done CS286a seminar 10/15/04
44
Interpretation via randomization • can think of variables X, x in SDP relaxation as defining a normal distribution z ∼ N (x, X − xxT ), with E zi2 = 1 • SDP objective is E kAz − bk2
suggests randomized method for BLS: • find X ?, x?, optimal for SDP relaxation
• generate z from N (x?, X ? − x?x?T )
• take x = sgn(z) as approximate solution of BLS (can repeat many times and take best one)
CS286a seminar 10/15/04
45
Example • (randomly chosen) parameters A ∈ R150×100, b ∈ R150 • x ∈ R100, so feasible set has 2100 ≈ 1030 points
LS approximate solution: minimize kAx − bk s.t. kxk2 = n, then round yields objective 8.7% over SDP relaxation bound
randomized method: (using SDP optimal distribution) • best of 20 samples: 3.1% over SDP bound
• best of 1000 samples: 2.6% over SDP bound CS286a seminar 10/15/04
46
0.5
frequency
0.4
PSfrag replacements
SDP bound
LS solution
0.3
0.2
0.1
0
1
1.2
kAx − bk/(SDP bound)
CS286a seminar 10/15/04
47
Calculus of Convex Functions
Approach
• basic examples or atoms • calculus rules or transformations that preserve convexity
CS286a seminar 10/15/04
48
Convex functions: Basic examples • xp for p ≥ 1 or p ≤ 0; −xp for 0 ≤ p ≤ 1 • ex, − log x, x log x • xT x; xT x/y (for y > 0); (xT x)1/2 • kxk (any norm) • max(x1, . . . , xn), log(ex1 + · · · + exn ) • log Φ(x) (Φ is Gaussian CDF) • log det X −1 (for X Â 0)
CS286a seminar 10/15/04
49
Calculus rules • convexity preserved under sums, nonnegative scaling • if f cvx, then g(x) = f (Ax + b) cvx • pointwise sup: if fα cvx for each α ∈ A, then g(x) = supα∈A fα(x) cvx • minimization: if f (x, y) cvx, then g(x) = inf y f (x, y) cvx • composition rules: if h cvx & increasing, f cvx, then g(x) = h(f (x)) cvx • perspective transformation: if f cvx, then g(x, t) = tf (x/t) cvx for t>0 . . . and many, many others CS286a seminar 10/15/04
50
More examples • λmax(X) (for X = X T ) • f (x) = x[1] + · · · + x[k] • −
Pm
i=1 log(−fi (x))
(sum of largest k elements of x) (on {x | fi(x) < 0}; fi cvx)
• f (x) = log Prob(x + z ∈ C)
(C convex, z ∼ N (0, Σ))
• xT Y −1x is cvx in (x, Y ) for Y = Y T Â 0
CS286a seminar 10/15/04
51
Duality
Lagrangian and dual function primal problem: minimize f0(x) subject to fi(x) ≤ 0,
Lagrangian: L(x, λ) = f0(x) +
i = 1, . . . , m
Pm
i=1 λi fi (x)
dual function: g(λ) = inf x L(x, λ)
CS286a seminar 10/15/04
52
Lower bound property
• for any primal feasible x and any λ ≥ 0, we have f0(x) ≤ g(λ) • hence for any λ ≥ 0, g(λ) ≤ p? (optimal value of primal problem) • duality gap of x, λ is defined as f0(x) − g(λ) (gap is nonnegative; bounds suboptimality of x)
CS286a seminar 10/15/04
53
Dual problem find best lower bound on p?: maximize g(λ) subject to λi ≥ 0,
i = 1, . . . , m
p? is optimal value of primal, and d? is optimal value of dual • weak duality: even when primal not convex, d? ≤ p? • for convex primal problem, we have strong duality: d? = p? (provided technical condition holds)
CS286a seminar 10/15/04
54
Example: linear program
primal:
dual:
CS286a seminar 10/15/04
minimize cT x subject to Ax ¹ b
maximize bT λ subject to AT λ + c = 0,
λº0
55
Example: unconstrained geometric program
primal: minimize log
dual:
¡P m
T i=1 exp(ai x
+ bi )
Pm maximize bT ν − i=1 νi log νi subject to 1T ν = 1, AT ν = 0,
¢
νº0
. . . an entropy maximization problem
CS286a seminar 10/15/04
56
Example: duality between FMMC and MVU FMMC problem minimize
P
(i,j)∈E
subject to Q1 = 0,
d2ij Qij Q = QT
Qij = 0 for (i, j) 6∈ E, i 6= j Qij ≥ 0 for (i, j) ∈ E
Q − 11T /n ¹ −I dual of FMMC problem maximize
Tr X
subject to Xii + Xjj − Xij − Xji ≤ d2ij , X1 = 0,
CS286a seminar 10/15/04
Xº0
(i, j) ∈ E
57
FMMC dual as maximum-variance unfolding
xT1
• use variables x1, . . . , xn, with X = .. [x1 · · · xn] xTn • dual FMMC problem becomes maximize subject to
Pn
P
2 kx k i i=1 i xi
= 0,
kxi − xj k ≤ dij ,
(i, j) ∈ E
• position n points in Rn to maximize variance, respecting local distance constraints, i.e., maximum-variance unfolding problem
CS286a seminar 10/15/04
58
Semidefinite embedding • similar to semidefinite embedding for unsupervised learning of manifolds (which has distance equality constraints) (Weinberger, Saul 2003)
• surprise: fastest diffusion on graph, max-variance unfolding are duals CS286a seminar 10/15/04
59
Interior-Point Methods
Interior-point methods • handle linear and nonlinear convex problems Nesterov & Nemirovsky • based on Newton’s method applied to ‘barrier’ functions that trap x in interior of feasible region (hence the name IP) √ • worst-case complexity theory: # Newton steps ∼ problem size • in practice: # Newton steps between 20 & 50 (!) — over wide range of problem dimensions, type, and data • 1000 variables, 10000 constraints feasible on PC; far larger if structure is exploited • readily available (commercial and noncommercial) packages
CS286a seminar 10/15/04
60
Log barrier for convex problem minimize f0(x) subject to fi(x) ≤ 0,
i = 1, . . . , m
we define logarithmic barrier as φ(x) = −
m X
log(−fi(x))
i=1
• φ is convex, smooth on interior of feasible set
• φ → ∞ as x approaches boundary of feasible set CS286a seminar 10/15/04
61
Central path central path is curve x?(t) = argmin (tf0(x) + φ(x)) , x
t≥0
• x?(t) is strictly feasible, i.e., fi(x) < 0
• x?(t) can be computed by, e.g., Newton’s method
• intuition suggests x?(t) converges to optimal as t → ∞
• using duality can prove x?(t) is m/t-suboptimal
CS286a seminar 10/15/04
62
Central path & duality from ?
?
?
∇x(tf0(x ) + φ(x )) = t∇f0(x ) +
m X i=1
1 ? ∇f (x )=0 i −fi(x?)
we find that
1 , i = 1, . . . , m = −tfi(x?) is dual feasible, with g(λ?(t)) = f0(x?) − m/t λ?i(t)
• duality gap associated with pair x?(t), λ?(t) is m/t • hence, x?(t) is m/t-suboptimal
CS286a seminar 10/15/04
63
Example: central path for LP ³
x?(t) = argminx tcT x −
P6
i=1 log(bi
− aTi x)
´
c
x?(0) PSfrag replacements
CS286a seminar 10/15/04
xopt
x?(10)
64
Barrier method a.k.a. path-following method
given strictly feasible x, t > 0, µ > 1 repeat 1. compute x := x?(t) (using Newton’s method, starting from x) 2. exit if m/t < tol 3. t := µt
duality gap reduced by µ each outer iteration
CS286a seminar 10/15/04
65
Trade-off in choice of µ
large µ means • fast duality gap reduction (fewer outer iterations), but • many Newton steps to compute x?(t+) (more Newton steps per outer iteration)
total effort measured by total number of Newton steps
CS286a seminar 10/15/04
66
Typical example
2
10
0
10
duality gap
GP with n = 50 variables, m = 100 constraints, mi = 5
• wide range of PSfrag µ works well replacements • very typical behavior (even for large m, n)
−2
10
−4
10
−6
10
0
µ = 150 20
µ = 50 40
60
µ=2 80
100
120
Newton iterations
CS286a seminar 10/15/04
67
Effect of µ
Newton iterations
140
PSfrag replacements
120 100 80 60 40 20 0 0
20
40
60
80
100 120 140 160 180 200
µ barrier method works well for µ in large range
CS286a seminar 10/15/04
68
Typical convergence of IP method
2
10
duality gap
0
PSfrag replacements
10
−2
10
SOCP GP
−4
10
SDP
−6
10
0
LP
10
20
30
40
50
# Newton steps
LP, GP, SOCP, SDP with 100 variables
CS286a seminar 10/15/04
69
Typical effort versus problem dimensions
35
• 100 instances for each of 20 problem sizes • avg & std dev shown
Newton steps
• LPs with n vbles, 2n constraints
30
25
20
PSfrag replacements 15 1 10
CS286a seminar 10/15/04
2
10
n
3
10
70
Complexity analysis
• based on self-concordance (Nesterov & Nemirovsky, 1988) • for any choice of µ, #steps is O(m log 1/²), where ² is final accuracy √
• to optimize√complexity bound, can take µ = 1 + 1/ m, which yields #steps O( m log 1/²) • in any case, IP methods work extremely well in practice
CS286a seminar 10/15/04
71
Computational effort per Newton step • Newton step effort dominated by solving linear equations to find primal-dual search direction • equations inherit structure from underlying problem (e.g., sparsity, symmetry, toeplitz, circulant, Hankel, Kronecker) • equations same as for least-squares problem of similar size and structure conclusion: we can solve a convex problem with about the same effort as solving 20–50 least-squares problems
CS286a seminar 10/15/04
72
Other interior-point methods
more sophisticated IP algorithms • primal-dual, incomplete centering, infeasible start • use same ideas, e.g., central path, log barrier
• readily available (commercial and noncommercial packages)
typical performance: 20 – 50 Newton steps (!) — over wide range of problem dimensions, problem type, and problem data
CS286a seminar 10/15/04
73
Exploiting structure sparsity • well developed, since late 1970s • direct (sparse factorizations) and iterative methods (CG, LSQR) • standard in general purpose LP, QP, GP, SOCP implementations • can solve problems with 105, 106 vbles, constraints (depending on sparsity pattern) symmetry • reduce number of variables • reduce size of matrices (particularly helpful for SDP) CS286a seminar 10/15/04
74
A return to subgradient-type methods • very large-scale problems: 105 − 107 variables • applications: medical imaging, shape design of mechanical structures, machine learning and data mining, . . . • IPM out of consideration (even single iteration prohibitive); can use only first-order information (function values and subgradients) • a reasonable algorithm should have – computational effort per iteration at most linear in design dimension – potential to obtain (and willing to accept) medium-accuracy solutions – error reduction factor essentially independent of problem dimension Ben Tal & Nemirovsky, Nesterov, . . . CS286a seminar 10/15/04
75
Conclusions
Conclusions
convex optimization • theory fairly mature; practice has advanced tremendously last decade • qualitatively different from general nonlinear programming • cost only 30× more than least-squares, but far more expressive • lots of applications still to be discovered
CS286a seminar 10/15/04
76
Some references • Convex Optimization, Boyd & Vandenberghe, 2004 www.stanford.edu/~boyd/cvxbook.html (pdf of full text on web) • Introductory Lectures on Convex Optimization, Nesterov, 2003 • Lectures on Modern Convex Optimization, Ben Tal & Nemirovsky, 2001 • Interior-point Polynomial Algorithms in Convex Programming, Nesterov & Nemirovsky, 1994 • Linear Matrix Inequalities in System and Control Theory, Boyd, El Ghaoui, Feron, & Balakrishnan, 1994
CS286a seminar 10/15/04
77