A little story in the development of semidefinite programming. One day in 1990 ...
discussion that afternoon and concluded that interiorpoint linear programming.
Linear Conic Programming Yinyu Ye December 2004
i
ii
Preface This monograph is developed for MS&E 314, “Semidefinite Programming”, which I am teaching at Stanford. Information, supporting materials, and computer programs related to this book may be found at the following address on the WorldWide Web: http://www.stanford.edu/class/msande314 Please report any question, comment and error to the address:
[email protected] A little story in the development of semidefinite programming. One day in 1990, I visited the Computer Science Department of the University of Minnesota and met a young graduate student, Farid Alizadeh. He, working then on combinatorial optimization, introduced me “semidefinite optimization” or linear programming over the positive definite matrix cone. We had a very extensive discussion that afternoon and concluded that interiorpoint linear programming algorithms could be applicable to solving SDP. I urged Farid to look at the LP potential functions and to develop an SDP primal potential reduction algorithm. He worked hard for several months, and one afternoon showed up in my office in Iowa City, about 300 miles from Minneapolis. He had everything worked out, including potential, algorithm, complexity bound, and even a “dictionary” from LP to SDP, but was stuck on one problem which was how to keep the symmetry of the matrix. We went to a bar nearby on Clinton Street in Iowa City (I paid for him since I was a thirdyear professor then and eager to demonstrate that I could take care of students). After chatting for a while, I suggested that he use X −1/2 ∆X −1/2 to keep symmetry instead of X −1 ∆ which he was using, where X is the current symmetric positive definite matrix and ∆ is the symmetric directional matrix. He returned to Minneapolis and moved to Berkeley shortly after, and few weeks later sent me an email message telling me that everything had worked out beautifully. At the same time or even earlier, Nesterov and Nemirovskii developed a more general and powerful theory in extending interiorpoint algorithms for solving conic programs, where SDP was a special case. Boyd and his group later presented a wide range of SDP applications and formulations, many of which were incredibly novel and elegant. Then came the primaldual algorithm, the MaxCut, ... and SDP established its full popularity. YINYU YE iii
iv Stanford, 2002
PREFACE
Chapter 1
Introduction and Preliminaries 1.1
Introduction
Semidefinite Programming, hereafter SDP, is a natural extension of Linear programming (LP) that is a central decision model in Management Science and Operations Research. LP plays an extremely important role in the theory and application of Optimization. In one sense it is a continuous optimization problem in minimizing a linear objective function over a convex polyhedron; but it is also a combinatorial problem involving selecting an extreme point among a finite set of possible vertices. Businesses, large and small, use linear programming models to optimize communication systems, to schedule transportation networks, to control inventories, to adjust investments, and to maximize productivity. In LP, the variables form a vector which is regiured to be nonnegative, where in SDP they are components of a matrix and it is constrained to be positive semidefinite. Both of them may have linear equality constraints as well. One thing in common is that interiorpoint algorithms developed in past two decades for LP are naturally applied to solving SDP. Interiorpoint algorithms are continuous iterative algorithms. Computation experience with sophisticated procedures suggests that the number of iterations necessarily grows much more slowly than the dimension grows. Furthermore, they have an established worstcase polynomial iteration bound, providing the potential for dramatic improvement in computation effectiveness. The goal of the monograph is to provide a text book for teaching Semidefinite Programming, a modern Linear Programming decision model and its applications in other scientific and engineering fields. One theme of the monograph is the “mapping” between SDP and LP, so that the reader, with knowledge of LP, can understand SDP with little effort. The monograph is organized as follows. In Chapter 1, we discuss some 1
2
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
necessary mathematical preliminaries. We also present several decision and optimization problems and several basic numerical procedures used throughout the text. Chapter 2 is devoted to studying the theories and geometries of linear and matrix inequalities, convexity, and semidefinite programming. Almost all interiorpoint methods exploit rich geometric properties of linear and matrix inequalities, such as “center,” “volume,” “potential,” etc. These geometries are also helpful for teaching, learning, and research. Chapter 3 is focused on interiorpoint algorithms. Here, we select two types algorithms: the pathfollowing algorithm and the potential reduction algorithm. Each algorithm has three forms, the primal, the dual and the primaldual form. We analyze the worstcase complexity bound for them, where we will use the real number computation model in our analysis because of the continuous nature of interiorpoint algorithms. We also compare the complexity theory with the convergence rate used in numerical analysis. Not only has the convergnece speed of SDP algorithms been significantly improved during the last decade, but also the problem domain applicable by SDP has dramatically widened. Chapters 4, 5, and 6 would describe some of SDP applications and new established results in Engineering, Combinatory Optimization, Robust Optimization, Euclidean Geometry Computation, etc. Finally, we discuss major computational issues in Chapter 7. We discuss several effective implementation techniques frequently used in interiorpoint SDP software, such as the sparse linear system, the predictor and corrector step, and the homogeneous and selfdual formulation. We also present major difficulties and challengies faced by SDP.
1.2
Mathematical Preliminaries
This section summarizes mathematical background material for linear algebra, linear programming, and nonlinear optimization.
1.2.1
Basic notations
The notation described below will be followed in general. There may be some deviation where appropriate. By R we denote the set of real numbers. R+ denotes the set of nonnegative ◦
real numbers, and R+ denotes the set of positive numbers. For a natural number ◦
n, the symbol Rn (Rn+ , Rn+ ) denotes the set of vectors with n components in ◦
◦
R (R+ , R+ ). We call Rn+ the interior of Rn+ . The vector inequality x ≥ y means xj ≥ yj for j = 1, 2, ..., n. Zero represents a vector whose entries are all zeros and e represents a vector whose entries are all ones, where their dimensions may vary according to other vectors in expressions. A vector is always considered as a column vector, unless otherwise stated. Uppercase letters will be used to represent matrices. Greek letters will
1.2. MATHEMATICAL PRELIMINARIES
3
typically be used to represent scalars. For convenience, we sometime write a column vector x as x = (x1 ; x2 ; . . . ; xn ) and a row vector as x = (x1 , x2 , . . . , xn ). Addition of matrices and multiplication of matrices with scalars are standard. The superscript “T” denotes transpose operation. The inner product in Rn is defined as follows: hx, yi := xT y =
n X
xj yj
for
x, y ∈ Rn .
j=1
The l2 norm of a vector x is given by kxk2 =
√
xT x,
and the l∞ norm is kxk∞ = max{x1 , x2 , ..., xn }. In general, the p norm is Ã kxkp =
n X
!1/p xj 
p
,
p = 1, 2, ...
1
The dual of the p norm, denoted by k.k∗ , is the q norm, where 1 1 + = 1. p q In this book, k.k generally represents the l2 norm. For natural numbers m and n, Rm×n denotes the set of real matrices with m rows and n columns. For A ∈ Rm×n , we assume that the row index set of A is {1, 2, ..., m} and the column index set is {1, 2, ..., n}. The ith row of A is denoted by ai. and the jth column of A is denoted by a.j ; the i and jth component of A is denoted by aij . If I is a subset of the row index set and J is a subset of the column index set, then AI denotes the submatrix of A whose rows belong to I, AJ denotes the submatrix of A whose columns belong to J, AIJ denotes the submatrix of A induced by those components of A whose indices belong to I and J, respectively. The identity matrix is denoted by I. The null space of A is denoted N (A) and the range of A is R(A). The determinant of an n × nmatrix A is denoted by det(A). The trace of A, denoted by tr(A), is the sum of the diagonal entries in A. The operator norm of A, denoted by kAk, is kAk2 := max n 06=x∈R
kAxk2 . kxk2
4
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
For a vector x ∈ Rn , Dx represents a diagonal matrix in Rn×n whose diagonal entries are the entries of x, i.e., Dx = diag(x). A matrix Q ∈ Rn×n is said to be positive definite (PD), denoted by Q Â 0, if
xT Qx > 0,
for all
x 6= 0,
and positive semidefinite (PSD), denoted by Q º 0, if xT Qx ≥ 0,
for all
x.
If Q Â 0, then −Q is called negative definite (ND), denoted by Q ≺ 0; if Q º 0, then −Q is called negative semidefinite (NSD), denoted by Q ¹ 0. If Q is symmetric, then its eigenvalues are all real numbers; furthermore, Q is PSD if and only if all its eigenvalues are nonnegative, and Q is PD if and only if all its eigenvalue are positive. Given a PD matrix Q we can define a Qnorm , k.kQ , for vector x as p kxkQ = xT Qx . Mn denotes the space of symmetric matrices in Rn×n . The inner product in Mn is defined as follows: X hX, Y i := X • Y = trX T Y = Xi,j Yi,j for X, Y ∈ Mn . i,j
This is a generalization of the vector inner product to matrices. The matrix norm associated with the inner product is called Frobenius norm: √ kXkf = trX T X . ◦
Mn+ denote the set of positive semidefinite matrices in Mn . Mn+ denotes the ◦
set of positive definite matrices in Mn . We call Mn+ the interior of Mn+ . 0 1 2 k k ∞ {xk }∞ 0 is an ordered sequence x , x , x , ..., x , .... A sequence {x }0 is k convergent to x ¯, denoted x → x ¯, if kxk − x ¯k → 0. k A point x is a limit point of {xk }∞ 0 if there is a subsequence of {x } convergent to x. If g(x) ≥ 0 is a real valued function of a real nonnegative variable, the notation g(x) = O(x) means that g(x) ≤ c¯x for some constant c¯; the notation g(x) = Ω(x) means that g(x) ≥ cx for some constant c; the notation g(x) = θ(x) means that cx ≤ g(x) ≤ c¯x. Another notation is g(x) = o(x), which means that g(x) goes to zero faster than x does:
lim
x→0
g(x) = 0. x
1.2. MATHEMATICAL PRELIMINARIES
1.2.2
5
Convex sets
If x is a member of the set Ω, we write x ∈ Ω; if y is not a member of Ω, we write y 6∈ Ω. The union of two sets S and T is denoted S ∪ T ; the intersection of them is denoted S ∩ T . A set can be specified in the form Ω = {x : P (x)} as the set of all elements satisfying property P . For y ∈ Rn and ² > 0, B(y, ²) = {x : kx − yk ≤ ²} is the ball or sphere of radius ² with center y. In addition, for a positive definite matrix Q of dimension n, E(y, Q) = {x : (x − y)T Q(x − y) ≤ 1} is called an ellipsoid. The vector y is the center of E(y, Q). A set Ω is closed if xk → x, where xk ∈ Ω, implies x ∈ Ω. A set Ω is open if around every point y ∈ Ω there is a ball that is contained in Ω, i.e., there is an ² > 0 such that B(y, ²) ⊂ Ω. A set is bounded if it is contained within a ball with finite radius. A set is compact if it is both closed and bounded. The ◦
(topological) interior of any set Ω, denoted Ω, is the set of points in Ω which ˆ is are the centers of some balls contained in Ω. The closure of Ω, denoted Ω, ˆ that the smallest closed set containing Ω. The boundary of Ω is the part of Ω ◦
is not in Ω. A set C is said to be convex if for every x1 , x2 ∈ C and every real number α, 0 < α < 1, the point αx1 + (1 − α)x2 ∈ C. The convex hull of a set Ω is the intersection of all convex sets containing Ω. A set C is a cone if x ∈ C implies αx ∈ C for all α > 0. A cone that is also convex is a convex cone. For a cone C ⊂ E, the dual of C is the cone C ∗ := {y : hx, yi ≥ 0 for all
x ∈ C},
where h·, ·i is an inner product operation for space E. Example 1.1 The ndimensional nonnegative orthant, Rn+ = {x ∈ Rn : x ≥ 0}, is a convex cone. The dual of the cone is also Rn+ ; it is selfdual. Example 1.2 The set of all positive semidefinite matrices in Mn , Mn+ , is a convex cone, called the positive semidefinite matrix cone. The dual of the cone is also Mn+ ; it is selfdual. Example 1.3 The set {(t; x) ∈ Rn+1 : t ≥ kxk} is a convex cone in Rn+1 , called the secondorder cone. The dual of the cone is also the secondorder cone in Rn+1 ; it is selfdual. A cone C is (convex) polyhedral if C can be represented by C = {x : Ax ≤ 0} for some matrix A (Figure 1.1). Example 1.4 The nonnegative orthant is a polyhedral cone, and neither the positive semidefinite matrix cone nor the secondorder cone is polyhedral.
6
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
Polyhedral Cone
Nonpolyhedral Cone
Figure 1.1: Polyhedral and nonpolyhedral cones. The most important type of convex set is a hyperplane. Hyperplanes dominate the entire theory of optimization. Let a be a nonzero ndimensional vector, and let b be a real number. The set H = {x ∈ Rn : aT x = b} is a hyperplane in Rn (Figure 1.2). Relating to hyperplane, positive and negative closed half spaces are given by H+ = {x : aT x ≥ b} H− = {x : aT x ≤ b}. +
H
H

a
+
H
0 Figure 1.2: A hyperplane and halfspaces. A set which can be expressed as the intersection of a finite number of closed half spaces is said to be a convex polyhedron: P = {x : Ax ≤ b}.
1.2. MATHEMATICAL PRELIMINARIES
7
A bounded polyhedron is called polytope. Let P be a polyhedron in Rn , F is a face of P if and only if there is a vector c for which F is the set of points attaining max {cT x : x ∈ P } provided the this maximum is finite. A polyhedron has only finite many faces; each face is a nonempty polyhedron. The most important theorem about the convex set is the following separating theorem (Figure 1.3). Theorem 1.1 (Separating hyperplane theorem) Let C ⊂ E, where E is either Rn or Mn , be a closed convex set and let y be a point exterior to C. Then there is a vector a ∈ E such that ha, yi < inf ha, xi. x∈C
The geometric interpretation of the theorem is that, given a convex set C and a point y outside of C, there is a hyperplane containing y that contains C in one of its open half spaces.
C
y
a
Figure 1.3: Illustration of the separating hyperplane theorem; an exterior point y is separated by a hyperplane from a convex set C. Example 1.5 Let C be a unit circle centered at the point (1; 1). That is, C = {x ∈ R2 : (x1 − 1)2 + (x2 − 1)2 ≤ 1}. If y = (2; 0), a = (−1; 1) is a separating hyperplane vector. If y = (0; −1), a = (0; 1) is a separating hyperplane vector. It is worth noting that these separating hyperplanes are not unique. We use the notation E to represent either Rn or Mn , depending on the context, throughout this book, because all our decision and optimization problems take variables from one or both of these two vector spaces.
1.2.3
Real functions
The real function f (x) is said to be continuous at x if xk → x implies f (xk ) → f (x). The real function f (x) is said to be continuous on set Ω ⊂ E, where recall that E is either Rn or Mn , if f (x) is continuous at x for every x ∈ Ω.
8
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
A function f (x) is called homogeneous of degree k if f (αx) = αk f (x) for all α ≥ 0. ◦
Example 1.6 Let c ∈ Rn be given and x ∈Rn+ . Then cT x is homogeneous of degree 1 and n X T P(x) = n log(c x) − log xj j=1
is homogeneous of degree 0, where log is the natural logarithmic function. Let ◦
C ∈ Mn be given and X ∈Mn+ . Then xT Cx is homogeneous of degree 2, C • X and det(X) are homogeneous of degree 1, and P(X) = n log(C • X) − log det(X) is homogeneous of degree 0. A set of realvalued function f1 , f2 , ..., fm defined on E can be written as a single vector function f = (f1 , f2 , ..., fm )T ∈ Rm . If f has continuous partial derivatives of order p, we say f ∈ C p . The gradient vector or matrix of a realvalued function f ∈ C 1 is a vector or matrix ∇f (x) = {∂f /∂xij },
for i, j = 1, ..., n.
If f ∈ C 2 , we define the Hessian of f to be the n2 dimensional symmetric matrix ½ ¾ ∂2f ∇2 f (x) = for i, j, k, l = 1, ..., n. ∂xij ∂xkl If f = (f1 , f2 , ..., fm )T ∈ Rm , the Jacobian matrix of f is ∇f1 (x) . ... ∇f (x) = ∇fm (x) f is a (continuous) convex function if and only if for 0 ≤ α ≤ 1, f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y). f is a (continuous) quasiconvex function if and only if for 0 ≤ α ≤ 1, f (αx + (1 − α)y) ≤ max[f (x), f (y)]. Thus, a convex function is a quasiconvex function. The level set of f is given by L(z) = {x : f (x) ≤ z}. f is a quasiconvex function implies that the level set of f is convex for any given z (see Exercise 1.9). A group of results that are used frequently in analysis are under the heading of Taylor’s theorem or the meanvalue theorem. The theorem establishes the linear and quadratic approximations of a function.
1.2. MATHEMATICAL PRELIMINARIES
9
Theorem 1.2 (Taylor expansion) Let f ∈ C 1 be in a region containing the line segment [x, y]. Then there is a α, 0 ≤ α ≤ 1, such that f (y) = f (x) + ∇f (αx + (1 − α)y)(y − x). Furthermore, if f ∈ C 2 then there is a α, 0 ≤ α ≤ 1, such that f (y) = f (x) + ∇f (x)(y − x) + (1/2)(y − x)T ∇2 f (αx + (1 − α)y)(y − x). We also have several propositions for real functions. The first indicates that the linear approximation of a convex function is a underestimate. Proposition 1.3 Let f ∈ C 1 . Then f is convex over a convex set Ω if and only if f (y) ≥ f (x) + ∇f (x)(y − x) for all x, y ∈ Ω. The following proposition states that the Hessian of a convex function is positive semidefinite. Proposition 1.4 Let f ∈ C 2 . Then f is convex over a convex set Ω if and only if the Hessian matrix of f is positive semidefinite throughout Ω.
1.2.4
Inequalities
There are several important inequalities that are frequently used in algorithm design and complexity analysis. CauchySchwarz: given x, y ∈ Rn , then xT y ≤ kxkkyk. Arithmeticgeometric mean: given x ∈ Rn+ , P
³Y ´1/n xj ≥ xj . n
◦
Harmonic: given x ∈Rn+ , ³X
xj
´ ³X
´ 1/xj ≥ n2 .
Hadamard: given A ∈ Rm×n with columns a1 , a2 , ..., an , then q det(AT A) ≤
Y
kaj k .
10
1.3
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
Some Basic Decision and Optimization Problems
A decision or optimization problem has a form that is usually characterized by the decision variables and the constraints. A problem, P, consists of two sets, data set Zp and solution set Sp . In general, Sp can be implicitly defined by the socalled optimality conditions. The solution set may be empty, i.e., problem P may have no solution. Theorem 1.5 Weierstrass theorem A continuous function f defined on a compact set (bounded and closed) Ω ⊂ E has a minimizer in Ω; that is, there is an x∗ ∈ Ω such that for all x ∈ Ω, f (x) ≥ f (x∗ ). In what follows, we list several decision and optimization problems. More problems will be listed later when we address them.
1.3.1
System of linear equations
Given A ∈ Rm×n and b ∈ Rm , the problem is to solve m linear equations for n unknowns: Ax = b. The data and solution sets are Zp = {A ∈ Rm×n , b ∈ Rm }
and Sp = {x ∈ Rn : Ax = b}.
Sp in this case is an affine set. Given an x, one can easily check to see if x is in Sp by a matrixvector multiplication and a vectorvector comparison. We say that a solution of this problem is easy to recognize. To highlight the analogy with the theories of linear inequalities and linear programming, we list several wellknown results of linear algebra. The first theorem provides two basic representations, the null and row spaces, of a linear subspaces. Theorem 1.6 Each linear subspace of Rn is generated by finitely many vectors, and is also the intersection of finitely many linear hyperplanes; that is, for each linear subspace of L of Rn there are matrices A and C such that L = N (A) = R(C). The following theorem was observed by Gauss. It is sometimes called the fundamental theorem of linear algebra. It gives an example of a characterization in terms of necessary and sufficient conditions, where necessity is straightforward, and sufficiency is the key of the characterization. Theorem 1.7 Let A ∈ Rm×n and b ∈ Rm . The system {x : Ax = b} has a solution if and only if that AT y = 0 implies bT y = 0. A vector y, with AT y = 0 and bT y = 1, is called an infeasibility certificate for the system {x : Ax = b}. Example 1.7 Let A = (1; −1) and b = (1; 1). Then, y = (1/2; 1/2) is an infeasibility certificate for {x : Ax = b}.
1.3. SOME BASIC DECISION AND OPTIMIZATION PROBLEMS
1.3.2
11
Linear leastsquares problem
Given A ∈ Rm×n and c ∈ Rn , the system of equations AT y = c may be overdetermined or have no solution. Such a case usually occurs when the number of equations is greater than the number of variables. Then, the problem is to find an y ∈ Rm or s ∈ R(AT ) such that kAT y − ck or ks − ck is minimized. We can write the problem in the following format:
or
(LS)
minimize subject to
kAT y − ck2 y ∈ Rm ,
(LS)
minimize subject to
ks − ck2 s ∈ R(AT ).
In the former format, the term kAT y − ck2 is called the objective function, y is called the decision variable. Since y can be any point in Rm , we say this (optimization) problem is unconstrained. The data and solution sets are Zp = {A ∈ Rm×n , c ∈ Rn } and Sp = {y ∈ Rm : kAT y − ck2 ≤ kAT x − ck2
for every x ∈ Rm }.
Given a y, to see if y ∈ Sp is as the same as the original problem. However, from a projection theorem in linear algebra, the solution set can be characterized and represented as Sp = {y ∈ Rm : AAT y = Ac}, which becomes a system of linear equations and always has a solution. The vector s = AT y = AT (AAT )+ Ac is the projection of c onto the range of AT , where AAT is called normal matrix and (AAT )+ is called pseudoinverse. If A has full row rank then (AAT )+ = (AAT )−1 , the standard inverse of full rank matrix AAT . If A is not of full rank, neither is AAT and (AAT )+ AAT x = x only for x ∈ R(AT ). The vector c − AT y = (I − AT (AAT )+ A)c is the projection of c onto the null space of A. It is the solution of the following leastsquares problem: (LS) minimize subject to
kx − ck2 x ∈ N (A).
In the full rank case, both matrices AT (AAT )−1 A and I − AT (AAT )−1 A are called projection matrices. These symmetric matrices have several desired properties (see Exercise 1.15).
1.3.3
System of linear inequalities
Given A ∈ Rm×n and b ∈ Rm , the problem is to find a solution x ∈ Rn satisfying Ax ≤ b or prove that the solution set is empty. The inequality
12
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
problem includes other forms such as finding an x that satisfies the combination of linear equations Ax = b and inequalities x ≥ 0. The data and solution sets of the latter are Zp = {A ∈ Rm×n , b ∈ Rm }
and Sp = {x ∈ Rn : Ax = b, x ≥ 0}.
Traditionally, a point in Sp is called a feasible solution, and a strictly positive point in Sp is called a strictly feasible or interior feasible solution. The following results are Farkas’ lemma and its variants. Theorem 1.8 (Farkas’ lemma) Let A ∈ Rm×n and b ∈ Rm . Then, the system {x : Ax = b, x ≥ 0} has a feasible solution x if and only if that AT y ≤ 0 implies bT y ≤ 0. A vector y, with AT y ≤ 0 and bT y = 1, is called a (primal) infeasibility certificate for the system {x : Ax = b, x ≥ 0}. Geometrically, Farkas’ lemma means that if a vector b ∈ Rm does not belong to the cone generated by a.1 , ..., a.n , then there is a hyperplane separating b from cone(a.1 , ..., a.n ). Example 1.8 Let A = (1, 1) and b = −1. Then, y = −1 is an infeasibility certificate for {x : Ax = b, x ≥ 0}. Theorem 1.9 (Farkas’ lemma variant) Let A ∈ Rm×n and c ∈ Rn . Then, the system {y : AT y ≤ c} has a solution y if and only if that Ax = 0 and x ≥ 0 imply cT x ≥ 0. Again, a vector x ≥ 0, with Ax = 0 and cT x = −1, is called a (dual) infeasibility certificate for the system {y : AT y ≤ c}. Example 1.9 Let A = (1; −1) and c = (1; −2). Then, x = (1; 1) is an infeasibility certificate for {y : AT y ≤ c}. We say {x : Ax = b, x ≥ 0} or {y : AT y ≤ c} is approximately feasible in the sense that we have an approximate solution to the equations and inequalities. In this case we can show that any certificate proving their infeasibility must have large norm. Conversely, if {x : Ax = b, x ≥ 0} or {y : AT y ≤ c} is “approximately infeasible” in the sense that we have an approximate certificate in Farkas’ lemma, then any feasible solution must have large norm. Example 1.10 Given ² > 0 but small. Let A = (1, 1) and b = −². Then, x = (0; 0) is approximately feasible for {x : Ax = b, x ≥ 0}, and the infeasibility certificate y = −1/² has a large norm. Let A = (1; −1) and c = (1; −1 − ²). Then, y = 1 is approximately feasible for {y : AT y ≤ c}, and the infeasibility certificate x = (1/²; 1/²) has a large norm.
1.3. SOME BASIC DECISION AND OPTIMIZATION PROBLEMS
1.3.4
13
Linear programming (LP)
Given A ∈ Rm×n , b ∈ Rm and c, l, u ∈ Rn , the linear programming (LP) problem is the following optimization problem: minimize subject to
cT x Ax = b, l ≤ x ≤ u,
where some elements in l may be −∞ meaning that the associated variables are unbounded from below, and some elements in u may be ∞ meaning that the associated variables are unbounded from above. If a variable is unbounded either from below or above, then it is called a “free” variable The standard form linear programming problem is given below, which we will use throughout this book: (LP ) minimize cT x subject to Ax = b, x ≥ 0. The linear function cT x is called the objective function, and x is called the decision variables. In this problem, Ax = b and x ≥ 0 enforce constraints on the selection of x. The set Fp = {x : Ax = b, x ≥ 0} is called feasible set or feasible region. A point x ∈ Fp is called a feasible point, and a feasible point x∗ is called an optimal solution if cT x∗ ≤ cT x for all feasible points x. If there is a sequence {xk } such that xk is feasible and cT xk → −∞, then (LP) is said to be unbounded. The data and solution sets for (LP), respectively, are Zp = {A ∈ Rm×n , b ∈ Rm , c ∈ Rn } and
Sp = {x ∈ Fp : cT x ≤ cT y,
for every y ∈ Fp }.
Again, given an x, to see if x ∈ Sp is as difficult as the original problem. However, due to the duality theorem, we can simplify the representation of the solution set significantly. With every (LP), another linear program, called the dual (LD), is the following problem: (LD)
maximize subject to
bT y AT y + s = c, s ≥ 0,
where y ∈ Rm and s ∈ Rn . The components of s are called dual slacks. Denote by Fd the sets of all (y, s) that are feasible for the dual. We see that (LD) is also a linear programming problem where y is a “free” vector. The following theorems give us an important relation between the two problems. Theorem 1.10 (Weak duality theorem) Let Fp and Fd be nonempty. Then, cT x ≥ bT y
where
x ∈ Fp , (y, s) ∈ Fd .
14
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
This theorem shows that a feasible solution to either problem yields a bound on the value of the other problem. We call cT x − bT y the duality gap. From this we have important results. Theorem 1.11 (Strong duality theorem) Let Fp and Fd be nonempty. Then, x∗ is optimal for (LP) if and only if the following conditions hold: i) x∗ ∈ Fp ; ii) there is (y ∗ , s∗ ) ∈ Fd ; iii) cT x∗ = bT y ∗ . Theorem 1.12 (LP duality theorem) If (LP) and (LD) both have feasible solutions then both problems have optimal solutions and the optimal objective values of the objective functions are equal. If one of (LP) or (LD) has no feasible solution, then the other is either unbounded or has no feasible solution. If one of (LP) or (LD) is unbounded then the other has no feasible solution. The above theorems show that if a pair of feasible solutions can be found to the primal and dual problems with equal objective values, then these are both optimal. The converse is also true; there is no “gap.” From this condition, the solution set for (LP) and (LD) is cT x − bT y n m n Ax Sp = (x, y, s) ∈ (R+ , R , R+ ) : −AT y − s
= 0 = b , = −c
(1.1)
which is a system of linear inequalities and equations. Now it is easy to verify whether or not a pair (x, y, s) is optimal. For feasible x and (y, s), xT s = xT (c − AT y) = cT x − bT y is called the complementarity gap. If xT s = 0, then we say x and s are complementary to each other. Since both x and s are nonnegative, xT s = 0 implies that xj sj = 0 for all j = 1, . . . , n. Thus, one equation plus nonnegativity are transformed into n equations. Equations in (1.1) become Dx s = Ax = −AT y − s =
0 b −c.
(1.2)
This system has total 2n + m unknowns and 2n + m equations including n nonlinear equations. The following theorem plays an important role in analyzing LP interiorpoint algorithms. It give a unique partition of the LP variables in terms of complementarity.
1.3. SOME BASIC DECISION AND OPTIMIZATION PROBLEMS
15
Theorem 1.13 (Strict complementarity theorem) If (LP) and (LD) both have feasible solutions then both problems have a pair of strictly complementary solutions x∗ ≥ 0 and s∗ ≥ 0 meaning X ∗ s∗ = 0
and
x∗ + s∗ > 0.
Moreover, the supports P ∗ = {j : x∗j > 0}
and
Z ∗ = {j : s∗j > 0}
are invariant for all pairs of strictly complementary solutions. Given (LP) or (LD), the pair of P ∗ and Z ∗ is called the (strict) complementarity partition. {x : AP ∗ xP ∗ = b, xP ∗ ≥ 0, xZ ∗ = 0} is called the primal optimal face, and {y : cZ ∗ − ATZ ∗ y ≥ 0, cP ∗ − ATP ∗ y = 0} is called the dual optimal face. Select m linearly independent columns, denoted by the index set B, from A. Then matrix AB is nonsingular and we may uniquely solve AB xB = b for the mvector xB . By setting the variables, xN , of x corresponding to the remaining columns of A equal to zero, we obtain a solution x such that Ax = b. Then, x is said to be a (primal) basic solution to (LP) with respect to the basis AB . The components of xB are called basic variables. A dual vector y satisfying ATB y = cB is said to be the corresponding dual basic solution. If a basic solution x ≥ 0, then x is called a basic feasible solution. If the dual solution is also feasible, that is s = c − AT y ≥ 0, then x is called an optimal basic solution and AB an optimal basis. A basic feasible solution is a vertex on the boundary of the feasible region. An optimal basic solution is an optimal vertex of the feasible region. If one or more components in xB has value zero, that basic solution x is said to be (primal) degenerate. Note that in a nondegenerate basic solution the basic variables and the basis can be immediately identified from the nonzero components of the basic solution. If all components, sN , in the corresponding dual slack vector s, except for sB , are nonzero, then y is said to be (dual) nondegenerate. If both primal and dual basic solutions are nondegenerate, AB is called a nondegenerate basis. Theorem 1.14 (LP fundamental theorem) Given (LP) and (LD) where A has full row rank m,
16
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
i) if there is a feasible solution, there is a basic feasible solution; ii) if there is an optimal solution, there is an optimal basic solution. The above theorem reduces the task of solving a linear program to that searching over basic feasible solutions. By expanding upon this result, the simplex method, a finite search procedure, is derived. The simplex method is to proceed from one basic feasible solution (an extreme point of the feasible region) to an adjacent one, in such a way as to continuously decrease the value of the objective function until a minimizer is reached. In contrast, interiorpoint algorithms will move in the interior of the feasible region and reduce the value of the objective function, hoping to bypass many extreme points on the boundary of the region.
1.3.5
Quadratic programming (QP)
Given Q ∈ Rn×n , A ∈ Rm×n , b ∈ Rm and c ∈ Rn , the quadratic programming (QP) problem is the following optimization problem: (QP )
minimize subject to
q(x) := (1/2)xT Qx + cT x Ax = b, x ≥ 0.
We may denote the feasible set by Fp . The data and solution sets for (QP) are Zp = {Q ∈ Rn×n , A ∈ Rm×n , b ∈ Rm , c ∈ Rn } and Sp = {x ∈ Fp : q(x) ≤ q(y) for every y ∈ Fp }. A feasible point x∗ is called a KKT point, where KKT stands for KarushKuhnTucker, if the following KKT conditions hold: there exists (y ∗ ∈ Rm , s∗ ∈ Rn ) such that (x∗ , y ∗ , s∗ ) is feasible for the following dual problem: (QD) maximize subject to
d(x, y) := bT y − (1/2)xT Qx AT y + s − Qx = c, x, s ≥ 0,
and satisfies the complementarity condition (x∗ )T s∗ = (1/2)(x∗ )T Qx∗ + cT x∗ − (bT y ∗ − (1/2)(x∗ )T Qx∗ = 0. Similar to LP, we can write the KKT condition as: (x, y, s) ∈ (Rn+ , Rm , Rn+ ) and
Dx s = 0 Ax = b −AT y + Qx − s = −c.
(1.3)
Again, this system has total 2n + m unknowns and 2n + m equations including n nonlinear equations.
1.4. ALGORITHMS AND COMPUTATIONS
17
The above condition is also called the firstorder necessary condition. If Q is positive semidefinite, then x∗ is an optimal solution for (QP) if and only if x∗ is a KKT point for (QP). In this case, the solution set for (QP) is characterized by a system of linear inequalities and equations. One can see (LP) is a special case of (QP).
1.4
Algorithms and Computations
An algorithm is a list of instructions to solve a problem. For every instance of problem P, i.e., for every given data Z ∈ Zp , an algorithm for solving P either determines that Sp is empty or generates an output x such that x ∈ Sp or x is close to Sp in certain measure. The latter x is called an approximate solution. Let us use Ap to denote the collection of all possible algorithm for solving every instance in P. Then, the (operation) complexity of an algorithm A ∈ Ap for solving an instance Z ∈ Zp is defined as the total arithmetic operations: +, −, ∗, /, and comparison on real numbers. Denote it by co (A, Z). Sometimes it is convenient to define the iteration complexity, denoted by ci (A, Z), where we assume that each iteration costs a polynomial number (in m and n) of arithmetic operations. In most iterative algorithms, each iteration can be performed efficiently both sequentially and in parallel, such as solving a system of linear equations, rankone updating the inversion of a matrix, pivoting operation of a matrix, multiplying a matrix by a vector, etc. In the real number model, we introduce ², the error for an approximate solution as a parameter. Let c(A, Z, ²) be the total number of operations of algorithm A for generating an ²approximate solution, with a welldefined measure, to problem P. Then, c(A, ²) := sup c(A, Z, ²) ≤ fA (m, n, ²) for any ² > 0. Z∈Zp
We call this complexity model errorbased. One may also view an approximate solution an exact solution to a problem ²near to P with a welldefined measure in the data space. This is the socalled backward analysis model in numerical analysis. If fA (m, n, ²) is a polynomial in m, n, and log(1/²), then algorithm A is a polynomial algorithm and problem P is polynomially solvable. Again, if fA (m, n, ²) is independent of ² and polynomial in m and n, then we say algorithm A is a strongly polynomial algorithm. If fA (m, n, ²) is a polynomial in m, n, and (1/²), then algorithm A is a polynomial approximation scheme or pseudopolynomial algorithm . For some optimization problems, the complexity theory can be applied to prove not only that they cannot be solved in polynomialtime, but also that they do not have polynomial approximation schemes. In practice, approximation algorithms are widely used and accepted in practice. Example 1.11 There is a strongly polynomial algorithm for sorting a vector in
18
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
descending or ascending order, for matrixvector multiplication, and for computing the norm of a vector. Example 1.12 Consider the bisection method to locate a root of a continuous function f (x) : R → R within interval [0, 1], where f (0) > 0 and f (1) < 0. The method calls the oracle to evaluate f (1/2) (counted as one operation). If f (1/2) > 0, we throw away [0, 1/2); if f (1/2) < 0, we throw away (1/2, 1]. Then we repeat this process on the remaining half interval. Each step of the method halves the interval that contains the root. Thus, in log(1/²) steps, we must have an approximate root whose distance to the root is less than ². Therefore, the bisection method is a polynomial algorithm. We have to admit that the criterion of polynomiality is somewhat controversial. Many algorithms may not be polynomial but work fine in practice. This is because polynomiality is built upon the worstcase analysis. However, this criterion generally provides a qualitative statement: if a problem is polynomial solvable, then the problem is indeed relatively easy to solve regardless of the algorithm used. Furthermore, it is ideal to develop an algorithm with both polynomiality and practical efficiency.
1.4.1
Convergence rate
Most algorithms are iterative in nature. They generate a sequence of everimproving points x0 , x1 , ..., xk , ... approaching the solution set. For many optimization problems and/or algorithms, the sequence will never exactly reach the solution set. One theory of iterative algorithms, referred to as local or asymptotic convergence analysis, is concerned with the rate at which the optimality error of the generated sequence converges to zero. Obviously, if each iteration of competing algorithms requires the same amount of work, the speed of the convergence of the error reflects the speed of the algorithm. This convergence rate, although it may hold locally or asymptotically, provides evaluation and comparison of different algorithms. It has been widely used by the nonlinear optimization and numerical analysis community as an efficiency criterion. In many cases, this criterion does explain practical behavior of iterative algorithms. Consider a sequence of real numbers {rk } converging to zero. One can define several notions related to the speed of convergence of such a sequence. Definition 1.1 . Let the sequence {rk } converge to zero. The order of convergence of {rk } is defined as the supermum of the nonnegative numbers p satisfying 0 ≤ lim sup k→∞
rk+1  < ∞. rk p
Definition 1.2 . Let the sequence {rk } converge to zero such that lim sup k→∞
rk+1  < ∞. rk 2
1.5. BASIC COMPUTATIONAL PROCEDURES
19
Then, the sequence is said to converge quadratically to zero. It should be noted that the order of convergence is determined only by the properties of the sequence that holds as k → ∞. In this sense we might say that the order of convergence is a measure of how good the tail of {rk } is. Large values of p imply the faster convergence of the tail. Definition 1.3 . Let the sequence {rk } converge to zero such that lim sup k→∞
rk+1  = β < 1. rk 
Then, the sequence is said to converge linearly to zero with convergence ratio β. Linear convergence is the most important type of convergence behavior. A linearly convergence sequence, with convergence ratio β, can be said to have a tail that converges to zero at least as fast as the geometric sequence cβ k for a fixed number c. Thus, we also call linear convergence geometric convergence. As a rule, when comparing the relative effectiveness of two competing algorithms both of which produce linearly convergent sequences, the comparison is based on their corresponding convergence ratio—the smaller the ratio, the faster the algorithm. The ultimate case where β = 0 is referred to as superlinear convergence. Example 1.13 Consider the conjugate gradient algorithm for minimizing 21 xT Qx+ c. Starting from an x0 ∈ Rn and d0 = Qx0 +c, the method uses iterative formula xk+1 = xk − αk dk where αk =
(dk )T (Qxk + c) , kdk k2Q
and dk+1 = Qxk+1 − θk dk where θk =
(dk )T Q(Qxk+1 + c) . kdk k2Q
This algorithm is superlinearly convergent (in fact, it converges in finite number of steps).
1.5
Basic Computational Procedures
There are several basic numerical problems frequently solved by interiorpoint algorithms.
20
1.5.1
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
Gaussian elimination method
Probably the bestknown algorithm for solving a system of linear equations is the Gaussian elimination method. Suppose we want to solve Ax = b. We may assume a11 6= 0 after some row switching, where aij is the component of A in row i and column j. Then we can subtract appropriate multiples of the first equation from the other equations so as to have an equivalent system: µ ¶µ ¶ µ ¶ a11 A1. x1 b1 = . 0 A0 x0 b0 This is a pivot step, where a11 is called a pivot, and A0 is called a Schur complement. Now, recursively, we solve the system of the last m − 1 equations for x0 . Substituting the solution x0 found into the first equation yields a value for x1 . The last process is called backsubstitution. In matrix form, the Gaussian elimination method transforms A into the form µ ¶ U C 0 0 where U is a nonsingular, uppertriangular matrix, µ ¶ U C A=L , 0 0 and L is a nonsingular, lowertriangular matrix. This is called the LU decomposition. Sometimes, the matrix is transformed further to a form µ ¶ D C 0 0 where D is a nonsingular, diagonal matrix. This whole procedure uses about nm2 arithmetic operations. Thus, it is a strong polynomialtime algorithm.
1.5.2
Choleski decomposition method
Another useful method is to solve the least squares problem: (LS)
minimize
kAT y − ck.
The theory says that y ∗ minimizes kAT y − ck if and only if AAT y ∗ = Ac. So the problem is reduced to solving a system of linear equations with a symmetric semipositive definite matrix.
1.5. BASIC COMPUTATIONAL PROCEDURES
21
One method is Choleski’s decomposition. In matrix form, the method transforms AAT into the form AAT = LΛLT , where L is a lowertriangular matrix and Λ is a diagonal matrix. (Such a transformation can be done in about nm2 arithmetic operations as indicated in the preceding section.) L is called the Choleski factor of AAT . Thus, the above linear system becomes LΛLT y ∗ = Ac, and y ∗ can be obtained by solving two triangle systems of linear equations.
1.5.3
The Newton method
The Newton method is used to solve a system of nonlinear equations: given f (x) : Rn → Rn , the problem is to solve n equations for n unknowns such that f (x) = 0. The idea behind Newton’s method is to use the Taylor linear approximation at the current iterate xk and let the approximation be zero: f (x) ' f (xk ) + ∇f (xk )(x − xk ) = 0. The Newton method is thus defined by the following iterative formula: xk+1 = xk − α(∇f (xk ))−1 f (xk ), where scalar α ≥ 0 is called stepsize. Rarely, however, is the Jacobian matrix ∇f inverted. Generally the system of linear equations ∇f (xk )dx = −f (xk ) is solved and xk+1 = xk + αdx is used. The direction vector dx is called a Newton step, which can be carried out in strongly polynomial time. A modified or quasi Newton method is defined by xk+1 = xk − αM k f (xk ), where M k is an n × n symmetric matrix. In particular, if M k = I, the method is called the gradient method, where f is viewed as the gradient vector of a real function. The Newton method has a superior asymptotic convergence order equal 2 for kf (xk )k. It is frequently used in interiorpoint algorithms, and believed to be the key to their effectiveness.
22
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
1.5.4
Solving ballconstrained linear problem
The ballconstrained linear problem has the following form: (BP ) or
minimize subject to
(BD)
cT x Ax = 0, kxk2 ≤ 1,
minimize subject to
bT y kAT yk2 ≤ 1.
x∗ minimizes (BP) if and only if there always exists a y such that they satisfy AAT y = Ac, and if c − AT y 6= 0 then x∗ = −(c − AT y)/kc − AT yk; otherwise any feasible x is a solution. The solution y ∗ for (BD) is given as follows: Solve AAT y¯ = b, and if y¯ 6= 0 then set
y ∗ = −¯ y /kAT y¯k;
otherwise any feasible y is a solution. So these two problems can be reduced to solving a system of linear equations.
1.5.5
Solving ballconstrained quadratic problem
The ballconstrained quadratic problem has the following form: (BP )
minimize (1/2)xT Qx + cT x subject to Ax = 0, kxk2 ≤ 1,
(BD)
minimize (1/2)y T Qy + bT y subject to kyk2 ≤ 1.
or simply
This problem is used by the classical trust region method for nonlinear optimization. The optimality conditions for the minimizer y ∗ of (BD) are (Q + µ∗ I)y ∗ = −b, and
µ∗ ≥ 0,
ky ∗ k2 ≤ 1,
µ∗ (1 − ky ∗ k2 ) = 0,
(Q + µ∗ I) º 0.
These conditions are necessary and sufficient. This problem can be solved in polynomial time log(1/²) and log(log(1/²)) by the bisection method or a hybrid of the bisection and Newton methods, respectively. In practice, several trust region procedures have been very effective in solving this problem. The ballconstrained quadratic problem will be used an a subproblem by several interiorpoint algorithms in solving complex optimization problems. We will discuss them later in the book.
1.6. NOTES
1.6
23
Notes
The term “complexity” was introduced by Hartmanis and Stearns [155]. Also see Garey and Johnson [118] and Papadimitriou and Steiglitz [248]. The N P theory was due to Cook [70] and Karp [180]. The importance of P was observed by Edmonds [88]. Linear programming and the simplex method were introduced by Dantzig [73]. Other inequality problems and convexity theories can be seen in Gritzmann and Klee [141], Gr¨otschel, Lov´asz and Schrijver [142], Gr¨ unbaum [143], Rockafellar [264], and Schrijver [271]. Various complementarity problems can be found found in Cottle, Pang and Stone [72]. The positive semidefinite programming, an optimization problem in nonpolyhedral cones, and its applications can be seen in Nesterov and Nemirovskii [241], Alizadeh [8], and Boyd, Ghaoui, Feron and Balakrishnan [56]. Recently, Goemans and Williamson [125] obtained several breakthrough results on approximation algorithms using positive semidefinite programming. The KKT condition for nonlinear programming was given by Karush, Kuhn and Tucker [195]. It was shown by Klee and Minty [184] that the simplex method is not a polynomialtime algorithm. The ellipsoid method, the first polynomialtime algorithm for linear programming with rational data, was proven by Khachiyan [181]; also see Bland, Goldfarb and Todd [52]. The method was devised independently by Shor [277] and by Nemirovskii and Yudin [239]. The interiorpoint method, another polynomialtime algorithm for linear programming, was developed by Karmarkar. It is related to the classical barrierfunction method studied by Frisch [109] and Fiacco and McCormick [104]; see Gill, Murray, Saunders, Tomlin and Wright [124], and Anstreicher [21]. For a brief LP history, see the excellent article by Wright [323]. The real computation model was developed by Blum, Shub and Smale [53] and Nemirovskii and Yudin [239]. Other complexity issues in numerical optimization were discussed in Vavasis [318]. Many basic numerical procedures listed in this chapter can be found in Golub and Van Loan [133]. The ballconstrained quadratic problem and its solution methods can be seen in Mor´e [229], Sorenson [281], and Dennis and Schnable [76]. The complexity result of the ballconstrained quadratic problem was proved by Vavasis [318] and Ye [332].
1.7
Exercises
1.1 Let Q ∈ Rn×n be a given nonsingular matrix, and a and b be given Rn vectors. Show (Q + abT )−1 = Q−1 −
1 1+
bT Q−1 a
Q−1 abT Q−1 .
This formula is called the ShermanMorrisonWoodbury formula.
24
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
1.2 Prove that the eigenvalues of all matrices Q ∈ Mn×n are real. Furthermore, show that Q is PSD if and only if all its eigenvalues are nonnegative, and Q is PD if and only if all its eigenvalues are positive. 1.3 Using the ellipsoid representation in Section 1.2.2, find the matrix Q and vector y that describes the following ellipsoids: 1. The 3dimensional sphere of radius 2 centered at the origin; 2. The 2dimensional ellipsoid centered at (1; 2) that passes the points (0; 2), (1; 0), (2; 2), and (1; 4); 3. The 2dimensional ellipsoid centered at (1; 2) with axes parallel to the line y = x and y = −x, and passing through (−1; 0), (3; 4), (0; 3), and (2; 1). 1.4 Show that the biggest coordinatealigned ellipsoid that is entirely contained ◦
in Rn+ and has its center at xa ∈Rn+ can be written as: E(xa ) = {x ∈ Rn : k(X a )−1 (x − xa )k ≤ 1}. 1.5 Show that the nonnegative orthant, the positive semidefinite cone, and the secondorder cone are all selfdual. 1.6 Consider the convex set C = {x ∈ R2 : (x1 − 1)2 + (x2 − 1)2 ≤ 1} and let y ∈ R2 . Assuming y 6∈ C, 1. Find the point in C that is closest to y; 2. Find a separating hyperplane vector as a function of y. 1.7 Using the idea of Exercise 1.6, prove the separating hyperplane theorem 1.1. 1.8
• Given m × n matrix A and a vector c ∈ Rn , consider the function Pan n B(y) = − j=1 log sj where s = c − AT y > 0. Find ∇B(y) and ∇2 B(y) in terms of s.
• Given C ∈ Mn , Ai ∈ Mn , i = 1, · · · , m, and b ∈ Rm , consider the m X function B(y) := − log det(S), where S = C − yi Ai Â 0. Find ∇B(y) and ∇2 B(y) in terms of S.
i=1
The best way to do this is to use the definition of the partial derivative ∇f (y)i = lim
δ→0
f (y1 , y2 , ..., yi + δ, ..., ym ) − f (y1 , y2 , ..., yi , ..., ym ) . δ
1.9 Prove that the level set of a quasiconvex function is convex. 1.10 Prove Propositions 1.3 and 1.4 for convex functions in Section 1.2.3.
1.7. EXERCISES
25
1.11 Let f1 , . . ., fm be convex functions. Then, the function f¯(x) defined below is also convex: • max fi (x)
i=1,...,m
•
m X
fi (x)
i=1
1.12 Prove the Harmonic inequality described in Section 1.2.4. 1.13 Prove Farkas’ lemma 1.7 for linear equations. 1.14 Prove the linear leastsquares problem always has a solution. 1.15 Let P = AT (AAT )−1 A or P = I − AT (AAT )−1 A. Then prove 1. P = P 2 . 2. P is positive semidefinite. 3. The eigenvalues of P are either 0 or 1. 1.16 Using the separating theorem, prove Farkas’ lemmas 1.8 and 1.9. 1.17 If a system AT y ≤ c of linear inequalities in m variables has no solution, show that AT y ≤ c has a subsystem (A0 )T y ≤ c0 of at most m + 1 inequalities having no solution. 1.18 Prove the LP fundamental theorem 1.14. 1.19 If (LP) and (LD) have a nondegenerate optimal basis AB , prove that the strict complementarity partition in Theorem 1.13 is P ∗ = B. 1.20 If Q is positive semidefinite, prove that x∗ is an optimal solution for (QP) if and only if x∗ is a KKT point for (QP). 1.21 Prove X • S ≥ 0 if both X and S are positive semidefinite matrices. 1.22 Prove that two positive semidefinite matrices are complementary to each other, X • S = 0, if and only if XS = 0. 1.23 Let both (LP) and (LD) for a given data set (A, b, c) have interior feasible points. Then consider the level set Ω(z) = {y : c − AT y ≥ 0, −z + bT y ≥ 0} where z < z ∗ and z ∗ designates the optimal objective value. Prove that Ω(z) is bounded and has an interior for any finite z < z ∗ , even Fd is unbounded.
26
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
1.24 Given an (LP) data set (A, b, c) and an interior feasible point x0 , find the feasible direction dx (Adx = 0) that achieves the steepest decrease in the objective function. 1.25 Given an (LP) data set (A, b, c) and a feasible point (x0 , y 0 , s0 ) ∈ (Rn+ , Rm , Rn+ ) for the primal and dual, and ignoring the nonnegativity condition, write the systems of linear equations used to calculate the Newton steps for finding points that satisfy the optimality equations (1.2) and (1.3), respectively. 1.26 Show the optimality conditions for the minimizer y ∗ of (BD) in Section 1.5.5: (Q + µ∗ I)y ∗ = −b, and are necessary and sufficient.
µ∗ ≥ 0,
ky ∗ k ≤ 1,
(Q + µ∗ I) º 0,
µ∗ (1 − ky ∗ k) = 0,
Chapter 2
Semidefinite Programming 2.0.1
Semidefinite programming (SDP)
Given C ∈ Mn , Ai ∈ Mn , i = 1, 2, ..., m, and b ∈ Rm , the semidefinite programming problem is to find a matrix X ∈ Mn for the optimization problem: (SDP )
inf subject to
C •X Ai • X = bi , i = 1, 2, ..., m, X º 0.
Recall that the • operation is the matrix inner product A • B := trAT B. The notation X º 0 means that X is a positive semidefinite matrix, and X Â 0 means that X is a positive definite matrix. If a point X Â 0 and satisfies all equations in (SDP), it is called a (primal) strictly or interior feasible solution. . The dual problem to (SDP) can be written as: (SDD)
T sup bP y m subject to i yi Ai + S = C, S º 0,
which is analogous to the dual (LD) of LP. Here y ∈ Rm and S ∈ Mn . If a point (y, S Â 0) satisfies all equations in (SDD), it is called a dual interior feasible solution. Pm Example 2.1 Let P (y ∈ Rm ) = −C + i yi Ai , where C and Ai , i = 1, . . . , m, are given symmetric matrices. The problem of minimizing the maxeigenvalue of P (y) can be cast as a (SDD) problem. In semidefinite programming, we minimize a linear function of a matrix in the positive semidefinite matrix cone subject to affine constraints. In contrast to the positive orthant cone of linear programming, the positive semidefinite 27
28
CHAPTER 2. SEMIDEFINITE PROGRAMMING
matrix cone is nonpolyhedral (or “nonlinear”), but convex. So positive semidefinite programs are convex optimization problems. Semidefinite programming unifies several standard problems, such as linear programming, quadratic programming, and convex quadratic minimization with convex quadratic constraints, and finds many applications in engineering, control, and combinatorial optimization. From Farkas’ lemma for linear programming, a vector y, with AT y ≤ 0 and T b y = 1, always exists and is called a (primal) infeasibility certificate for the system {x : Ax = b, x ≥ 0}. But this does not hold for matrix equations in the positive semidefinite matrix cone. Example 2.2 Consider
µ
A1 =
1 0 0 0
¶
and
µ , µ
b=
A2 = 0 2
0 1
1 0
¶
¶
where we have the following matrix system A1 • X = 0, A2 • X = 2,
X ∈ M2+ .
The problem is that {y : yi = Ai • X, i = 1, 2, ..., m, X º 0} is not a closed set. Similary, µ ¶ µ ¶ 0 1 1 0 C= and A1 = 1 0 0 0 makes C º A1 y1 infeasible but it does not have an infeasible certificate. We have several theorems analogous to Farkas’ lemma. n Theorem i ∈ M , i = 1, ..., m, have rank m Pm 2.1 (Farkas’ lemma in SDP) Let A m (i.e., i yi Ai = 0 implies y = 0) and b ∈ R . Then, there exists a symmetric matrix X Â 0 with Ai • X = bi , i = 1, ..., m, Pm Pm if and only if i yi Ai ¹ 0 and i yi Ai 6= 0 implies bT y < 0. n Corollary ..., m, have rank P1, Pm2.2 (Farkas’ lemma in SDP) Let Ai ∈ M , i = m m (i.e., i yi Ai = 0 implies y = 0) and C ∈ Mn . Then, i yi Ai ≺ C if and only if X º 0, X 6= 0 and
Ai • X = 0,
i = 1, ..., m,
implies C • X > 0. In other words, an X º 0, X 6= 0, Ai • X = 0, i = 1, ..., m, Pm and C • X ≤ 0 proves that i yi Ai ≺ C is impossible. Note the difference between the LP and SDP. The weak duality theorem for SDP is identical to that of (LP) and (LD).
29 Corollary 2.3 (Weak duality theorem in SDP) Let Fp and Fd , the feasible sets for the primal and dual, be nonempty. Then, C • X ≥ bT y
where
X ∈ Fp , (y, S) ∈ Fd .
But we need more to make the strong duality theorem hold. Theorem 2.4 (Strong duality theorem in SDP) Let Fp and Fd be nonempty and at least one of them has an interior. Then, X is optimal for (PS) if and only if the following conditions hold: i) X ∈ Fp ; ii) there is (y, S) ∈ Fd ; iii) C • X = bT y or X • S = 0. Again note the difference between the above theorem and the strong duality theorem for LP. Example 2.3 The 0 1 C= 1 0 0 0
following SDP has 0 0 0 , A1 = 0 0 0
and
µ b=
a duality gap: 0 0 0 1 0 , A1 = −1 0 0 0 0 10
−1 0 0 0 0 2
¶ .
Two positive semidefinite matrices are complementary to each other, X•S = 0, if and only if XS = 0 (Exercise 1.22). From the optimality conditions, the solution set for certain (SDP) and (SDD) is Sp = {X ∈ Fp , (y, S) ∈ Fd : C • X − bT y = 0}, or Sp = {X ∈ Fp , (y, S) ∈ Fd : XS = 0}, which is a system of linear matrix inequalities and equations. In general, we have Theorem 2.5 (SDP duality theorem) If one of (SDP) or (SDD) has a strictly or interior feasible solution and its optimal value is finite, then the other is feasible and has the same optimal value. If one of (SDP) or (SDD) is unbounded then the other has no feasible solution. Note that a duality gap may exists if neither (SDP) nor (SDD) has a strictly feasible point. This is in contrast to (LP) and (LD) where no duality gap exists if both are feasible. Although semidefinite programs are much more general than linear programs, they are not much harder to solve. It has turned out that most interiorpoint methods for LP have been generalized to semidefinite programs. As in LP, these algorithms possess polynomial worstcase complexity under certain computation models. They also perform well in practice. We will describe such extensions later in this book.
30
CHAPTER 2. SEMIDEFINITE PROGRAMMING
2.1 2.1.1
Analytic Center AC for polytope
Let Ω be a bounded polytope in Rm represented by n (> m) linear inequalities, i.e., Ω = {y ∈ Rm : c − AT y ≥ 0}, where A ∈ Rm×n and c ∈ Rn are given and A has rank m. Denote the interior of Ω by ◦
m T Ω= {y ∈ R : c − A y > 0}.
Define d(y) =
n Y
(cj − aTj y),
y ∈ Ω,
j=1
where a.j is the jth column of A. Traditionally, we let s := c − AT y and call it a slack vector. Thus, the function is the product of all slack variables. Its logarithm is called the (dual) potential function, B(y) := log d(y) =
n X
log(cj − aT.j y) =
j=1
n X
log sj ,
(2.1)
j=1
and −B(y) is the classical logarithmic barrier function. For convenience, in what follows we may write B(s) to replace B(y) where s is always equal to c − AT y. Example 2.4 Let A = (1, −1) and c = (1; 1). Then the set of Ω is the interval [−1, 1]. Let A0 = (1, −1, −1) and c0 = (1; 1; 1). Then the set of Ω0 is also the interval [−1, 1]. Note that d(−1/2) = (3/2)(1/2) = 3/4 and
d0 (−1/2) = (3/)(1/2)(1/2) = 3/8
and
B(−1/2) = log(3/4),
and
B 0 (−1/2) = log(3/8).
The interior point, denoted by y a and sa = c − AT y a , in Ω that maximizes the potential function is called the analytic center of Ω, i.e., B(Ω) := B(y a , Ω) = max log d(y, Ω). y∈Ω
(y a , sa ) is uniquely defined, since the potential function is strictly concave in a ◦
bounded convex Ω. Setting ∇B(y, Ω) = 0 and letting xa = (S a )−1 e, the analytic center (y a , sa ) together with xa satisfy the following optimality conditions: Xs = Ax = −AT y − s =
e 0 −c.
(2.2)
Note that adding or deleting a redundant inequality changes the location of the analytic center.
2.1. ANALYTIC CENTER
31
Example 2.5 Consider Ω = {y ∈ R : −y ≤ 0, y ≤ 1}, which is interval [0, 1]. The analytic center is y a = 1/2 with xa = (2, 2)T . Consider n times z } { 0 Ω = {y ∈ R : −y ≤ 0, · · · , −y ≤ 0, y ≤ 1}, which is, again, interval [0, 1] but “−y ≤ 0” is copied n times. The analytic center for this system is y a = n/(n + 1) with xa = ((n + 1)/n, · · · , (n + 1)/n, (n + 1))T . The analytic center can be defined when the interior is empty or equalities are presented, such as Ω = {y ∈ Rm : c − AT y ≥ 0, By = b}. Then the analytic center is chosen on the hyperplane {y : By = b} to maximize the product of the slack variables s = c − AT y. Thus, the interior of Ω is not used in the sense that the topological interior for a set is used. Rather, it refers to the interior of the positive orthant of slack variables: Rn+ := {s : s ≥ 0}. When say Ω has an interior, we mean that ◦
Rn+ ∩{s : s = c − AT y for some y where By = b} 6= ∅. ◦
Again Rn+ := {s ∈ Rn+ : s > 0}, i.e., the interior of the orthant Rn+ . Thus, if Ω ◦
has only a single point y with s = c − AT y > 0, we still say Ω is not empty. Example 2.6 Consider the system Ω = {x : Ax = 0, eT x = n, x ≥ 0}, which is called Karmarkar’s canonical set. If x = e is in Ω then e is the analytic center of Ω, the intersection of the simplex {x : eT x = n, x ≥ 0} and the hyperplane {x : Ax = 0} (Figure 2.1).
2.1.2
AC for SDP
Let Ω be a bounded convex set in Rm represented by n (> m) a matrix inequality, i.e., m X Ω = {y ∈ Rm : C − yi Ai º 0, }. Let S = C −
Pm i
i
yi Ai and
B(y) := log det(S)) = log det(C −
m X
yi Ai ).
(2.3)
i
Pm The interior point, denoted by y a and S a = C − i yia Ai , in Ω that maximizes the potential function is called the analytic center of Ω, i.e., max B(y). y∈Ω
32
CHAPTER 2. SEMIDEFINITE PROGRAMMING
x3 (0,0,3)
Ax=0
.
x1
(1,1,1) (3,0,0) (0,3,0)
x2 Figure 2.1: Illustration of the Karmarkar (simplex) polytope and its analytic center. (y a , S a ) is uniquely defined, since the potential function is strictly concave in a ◦
bounded convex Ω. Setting ∇B(y, Ω) = 0 and letting X a = (S a )−1 , the analytic center (y a , S a ) together with X a satisfy the following optimality conditions: XS AX −AT y − S
2.2
= I = 0 = −C.
(2.4)
Potential Functions for LP and SDP
We show how potential functions can be defined to solve linear programming problems and semidefinite programming. We assume that for a given LP data set (A, b, c), both the primal and dual have interior feasible point. We also let z ∗ be the optimal value of the standard form (LP) and (LD). Denote the feasible sets of (LP) and (LD) by Fp and Fd , ◦
respectively. Denote by F = Fp × Fd , and the interior of F by F .
2.2.1
Primal potential function for LP
Consider the level set Ω(z) = {y ∈ Rm : c − AT y ≥ 0, −z + bT y ≥ 0},
(2.5)
where z < z ∗ . Since both (LP) and (LD) have interior feasible point for given (A, b, c), Ω(z) is bounded and has an interior for any finite z, even Ω := Fd is unbounded (Exercise 1.23). Clearly, Ω(z) ⊂ Ω, and if z2 ≥ z1 , Ω(z2 ) ⊂ Ω(z1 ) and the inequality −z + bT y is translated from z = z1 to z = z2 .
2.2. POTENTIAL FUNCTIONS FOR LP AND SDP
33
From the duality theorem again, finding a point in Ω(z) has a homogeneous primal problem minimize cT x0 − zx00 s.t. Ax0 − bx00 = 0, (x0 , x00 ) ≥ 0. For (x0 , x00 ) satisfying Ax0 − bx00 = 0, (x0 , x00 ) > 0, ◦
let x := x0 /x00 ∈F p , i.e., Ax = b, x > 0. Then, the primal potential function for Ω(z) (Figure 2.2), as described in the preceding section, is P(x0 , Ω(z)) = (n + 1) log(cT x0 − zx00 ) −
n X
log x0j
j=0
= (n + 1) log(cT x − z) −
n X
log xj =: Pn+1 (x, z).
j=1
The latter, Pn+1 (x, z), is the Karmarkar potential function in the standard LP form with a lower bound z for z ∗ .
b T y = bT y a
ya
ya
bT y = z The objective hyperplane
The updated objective hyperplane
Figure 2.2: Intersections of a dual feasible region and the objective hyperplane; bT y ≥ z on the left and bT y ≥ bT y a on the right. One algorithm for solving (LD) is suggested in Figure 2.2. If the objective hyperplane is repeatedly translated to the analytic center, the sequence of new
34
CHAPTER 2. SEMIDEFINITE PROGRAMMING
analytic centers will converge to an optimal solution and the potentials of the new polytopes will decrease to −∞. As we illustrated before, one can represent Ω(z) differently: ρ times
z } { Ω(z) = {y : c − AT y ≥ 0, −z + bT y ≥ 0, · · · , −z + bT y ≥ 0},
(2.6)
i.e., “−z + bT y ≥ 0” is copied ρ times. Geometrically, this representation does not change Ω(z), but it changes the location of its analytic center. Since the last ρ inequalities in Ω(z) are identical, they must share the same slack value and the same corresponding primal variable. Let (x0 , x00 ) be the primal variables. Then the primal problem can be written as ρ times
minimize
z } { c x −zx00 − · · · − zx00 T
0
ρ times
s.t.
z } { Ax −bx00 − · · · − bx00 = 0, (x0 , x00 ) ≥ 0. 0
◦
Let x = x0 /(ρx00 ) ∈F p . Then, the primal potential function for the new Ω(z) given by (2.6) is P(x, Ω(z))
=
(n + ρ) log(cT x0 − z(ρx00 )) −
n X
log x0j − ρ log x00
j=1
=
(n + ρ) log(cT x − z) −
n X
log xj + ρ log ρ
j=1
=: Pn+ρ (x, z) + ρ log ρ. The function Pn+ρ (x, z) = (n + ρ) log(cT x − z) −
n X
log xj
(2.7)
j=1
is an extension of the Karmarkar potential function in the standard LP form with a lower bound z for z ∗ . It represents the volume of a coordinatealigned ellipsoid whose intersection with AΩ(z) contains SΩ(z) , where “−z + bT y ≥ 0” is duplicated ρ times.
2.2.2
Dual potential function for LP
We can also develop a dual potential function, symmetric to the primal, for ◦
(y, s) ∈F d Bn+ρ (y, s, z) = (n + ρ) log(z − bT y) −
n X j=1
log sj ,
(2.8)
2.2. POTENTIAL FUNCTIONS FOR LP AND SDP
35
where z is a upper bound of z ∗ . One can show that it represents the volume of a coordinatealigned ellipsoid whose intersection with the affine set {x : Ax = b} contains the primal level set ρ times
z } { {x ∈ Fp : cT x − z ≤ 0, · · · , cT x − z ≤ 0}, where “cT x − z ≤ 0” is copied ρ times (Exercise 2.1). For symmetry, we may write Bn+ρ (y, s, z) simply by Bn+ρ (s, z), since we can always recover y from s using equation AT y = c − s.
2.2.3
Primaldual potential function for LP
A primaldual potential function for linear programming will be used later. For ◦
◦
x ∈F p and (y, s) ∈F d it is defined by ψn+ρ (x, s) := (n + ρ) log(xT s) −
n X
log(xj sj ),
(2.9)
j=1
where ρ ≥ 0. We have the relation: ψn+ρ (x, s) =
(n + ρ) log(cT x − bT y) −
n X
log xj −
j=1
= =
Pn+ρ (x, bT y) − Bn+ρ (s, cT x) −
n X j=1 n X
n X
log sj
j=1
log sj log xj .
j=1
Since
ψn+ρ (x, s) = ρ log(xT s) + ψn (x, s) ≥ ρ log(xT s) + n log n,
then, for ρ > 0, ψn+ρ (x, s) → −∞ implies that xT s → 0. More precisely, we have ψn+ρ (x, s) − n log n xT s ≤ exp( ). ρ We have the following theorem: Theorem 2.6 Define the level set ◦
Ψ(δ) := {(x, y, s) ∈F : ψn+ρ (x, s) ≤ δ}. i) Ψ(δ 1 ) ⊂ Ψ(δ 2 )
if
δ1 ≤ δ2 .
36
CHAPTER 2. SEMIDEFINITE PROGRAMMING
ii) ◦
Ψ (δ) = {(x, y, s) ∈ F : ψn+ρ (x, s) < δ}. ˆ iii) For every δ, Ψ(δ) is bounded and its closure Ψ(δ) has nonempty intersection with the solution set. Later we will show that a potential reduction algorithm generates sequences ◦
k
{x , y k , sk } ∈F such that ψn+√n (xk+1 , y k+1 , sk+1 ) ≤ ψn+√n (xk , y k , sk ) − .05 for k = 0, 1, 2, .... This indicates that the level sets shrink at least a constant rate independently of m or n.
2.2.4
Potential function for SDP
The potential functions for SDP of Section 2.0.1 are analogous to those for LP. For given data, we assume that both (SDP) and (SDD) have interior feasible ◦
◦
points. Then, for any X ∈F p and (y, S) ∈F d , the primal potential function is defined by Pn+ρ (X, z) := (n + ρ) log(C • X − z) − log det(X),
z ≤ z∗;
the dual potential function is defined by Bn+ρ (y, S, z) := (n + ρ) log(z − bT y) − log det(S),
z ≥ z∗,
where ρ ≥ 0 and z ∗ designates the optimal objective value. ◦
◦
For X ∈F p and (y, S) ∈F d the primaldual potential function for SDP is defined by ψn+ρ (X, S) := = = =
(n + ρ) log(X • S) − log(det(X) · det(S)) (n + ρ) log(C • X − bT y) − log det(X) − log det(S) Pn+ρ (X, bT y) − log det(S) Bn+ρ (S, C • X) − log det(X),
where ρ ≥ 0. Note that if X and S are diagonal matrices, these definitions reduce to those for LP. Note that we still have (Exercise 2.2) ψn+ρ (X, S) = ρ log(X • S) + ψn (X, S) ≥ ρ log(X • S) + n log n. Then, for ρ > 0, ψn+ρ (X, S) → −∞ implies that X • S → 0. More precisely, we have ψn+ρ (X, S) − n log n X • S ≤ exp( ). ρ We also have the following corollary:
2.3. CENTRAL PATHS OF LP AND SDP
37
Corollary 2.7 Let (SDP) and (SDD) have nonempty interior and define the level set ◦ Ψ(δ) := {(X, y, S) ∈F : ψn+ρ (X, S) ≤ δ}. i) Ψ(δ 1 ) ⊂ Ψ(δ 2 )
if
δ1 ≤ δ2 .
ii) ◦
Ψ (δ) = {(X, y, S) ∈ F : ψn+ρ (X, S) < δ}. ˆ iii) For every δ, Ψ(δ) is bounded and its closure Ψ(δ) has nonempty intersection with the solution set.
2.3
Central Paths of LP and SDP
Many interiorpoint algorithms find a sequence of feasible points along a “central” path that connects the analytic center and the solution set. We now present this one of the most important foundations for the development of interiorpoint algorithms.
2.3.1
Central path for LP
Consider a linear program in the standard form (LP) and (LD). Assume that ◦
◦
◦
F 6= ∅, i.e., both F p 6= ∅ and F d 6= ∅, and denote z ∗ the optimal objective value. The central path can be expressed as ½ ¾ ◦ xT s C = (x, y, s) ∈F : Xs = e n in the primaldual form. We also see n o ◦ C = (x, y, s) ∈F : ψn (x, s) = n log n . For any µ > 0 one can derive the central path simply by minimizing the primal LP with a logarithmic barrier function: Pn (P ) minimize cT x − µ j=1 log xj s.t. Ax = b, x ≥ 0. ◦
Let x(µ) ∈F p be the (unique) minimizer of (P). Then, for some y ∈ Rm it satisfies the optimality conditions Xs = Ax = −AT y − s =
µe b −c.
(2.10)
38
CHAPTER 2. SEMIDEFINITE PROGRAMMING Consider minimizing the dual LP with the barrier function: Pn (D) maximize bT y + µ j=1 log sj s.t. AT y + s = c, s ≥ 0. ◦
Let (y(µ), s(µ)) ∈F d be the (unique) minimizer of (D). Then, for some x ∈ Rn it satisfies the optimality conditions (2.10) as well. Thus, both minimizers x(µ) and (y(µ), s(µ)) are on the central path with x(µ)T s(µ) = nµ. Another way to derive the central path is to consider again the dual level set Ω(z) of (2.5) for any z < z ∗ (Figure 2.3). Then, the analytic center (y(z), s(z)) of Ω(z) and a unique point (x0 (z), x00 (z)) satisfies Ax0 (z) − bx00 (z) = 0, X 0 (z)s = e, s = c − AT y, and x00 (z)(bT y − z) = 1. Let x(z) = x0 (z)/x00 (z), then we have Ax(z) = b, X(z)s(z) = e/x00 (z) = (bT y(z) − z)e. Thus, the point (x(z), y(z), s(z)) is on the central path with µ = bT y(z) − z and cT x(z) − bT y(z) = x(z)T s(z) = n(bT y(z) − z) = nµ. As we proved earlier in Section 2.2, (x(z), y(z), s(z)) exists and is uniquely defined, which imply the following theorem: Theorem 2.8 Let both (LP) and (LD) have interior feasible points for the given data set (A, b, c). Then for any 0 < µ < ∞, the central path point (x(µ), y(µ), s(µ)) exists and is unique.
ya
The objective hyperplanes Figure 2.3: The central path of y(z) in a dual feasible region. The following theorem further characterizes the central path and utilizes it to solving linear programs.
2.3. CENTRAL PATHS OF LP AND SDP
39
Theorem 2.9 Let (x(µ), y(µ), s(µ)) be on the central path. i) The central path point (x(µ), s(µ)) is bounded for 0 < µ ≤ µ0 and any given 0 < µ0 < ∞. ii) For 0 < µ0 < µ, cT x(µ0 ) < cT x(µ)
and
bT y(µ0 ) > bT y(µ).
iii) (x(µ), s(µ)) converges to an optimal solution pair for (LP) and (LD). Moreover, the limit point x(0)P ∗ is the analytic center on the primal optimal face, and the limit point s(0)Z ∗ is the analytic center on the dual optimal face, where (P ∗ , Z ∗ ) is the strict complementarity partition of the index set {1, 2, ..., n}. Proof. Note that (x(µ0 ) − x(µ))T (s(µ0 ) − s(µ)) = 0, since (x(µ0 ) −x(µ)) ∈ N (A) and (s(µ0 ) −s(µ)) ∈ R(AT ). This can be rewritten as n X ¡ 0 ¢ s(µ )j x(µ)j + x(µ0 )j s(µ)j = n(µ0 + µ) ≤ 2nµ0 , j
or
¶ n µ X x(µ)j s(µ)j + ≤ 2n. x(µ0 )j s(µ0 )j j
Thus, x(µ) and s(µ) are bounded, which proves (i). We leave the proof of (ii) as an exercise. Since x(µ) and s(µ) are both bounded, they have at least one limit point which we denote by x(0) and s(0). Let x∗P ∗ (x∗Z ∗ = 0) and s∗Z ∗ (s∗P ∗ = 0), respectively, be the unique analytic centers on the primal and dual optimal faces: {xP ∗ : AP ∗ xP ∗ = b, xP ∗ ≥ 0} and {sZ ∗ : sZ ∗ = cZ ∗ − ATZ ∗ y ≥ 0, cP ∗ − ATP ∗ y = 0}. Again, we have n X ¡
¢ s∗j x(µ)j + x∗j s(µ)j = nµ,
j
or
X µ x∗j ¶ X µ s∗j ¶ + = n. x(µ)j s(µ)j ∗ ∗
j∈P
j∈Z
Thus, we have x(µ)j ≥ x∗j /n > 0, j ∈ P ∗ and s(µ)j ≥ s∗j /n > 0, j ∈ Z ∗ .
40
CHAPTER 2. SEMIDEFINITE PROGRAMMING
This implies that
x(µ)j → 0, j ∈ Z ∗
and
s(µ)j → 0, j ∈ P ∗ .
Furthermore,
Y
j∈P ∗
or
Y s∗j x∗j ≤1 x(µ)j s(µ)j ∗
Y
j∈P ∗
x∗j
j∈Z
Y
s∗j ≤
j∈Z ∗
Y
x(µ)j
j∈P ∗
Y
s(µ)j .
j∈Z ∗
Q Q However, ( j∈P ∗ x∗j )( j∈Z ∗ s∗j ) is the maximal value of the potential function over all interior point pairs on the optimal faces, and x(0)P ∗ and s(0)Z ∗ is one interior point pair on the optimal face. Thus, we must have Y Y Y Y x(0)j s(0)j . s∗j = x∗j j∈P ∗
Therefore,
j∈Z ∗
x(0)P ∗ = x∗P ∗
j∈P ∗
j∈Z ∗
and s(0)Z ∗ = s∗Z ∗ ,
since x∗P ∗ and s∗Z ∗ are the unique maximizer pair of the potential function. This also implies that the limit point of the central path is unique. We usually define a neighborhood of the central path as ½ ¾ ◦ xT s N (η) = (x, y, s) ∈F : kXs − µek ≤ ηµ and µ = , n where k.k can be any norm, or even a onesided “norm” as kxk−∞ =  min(0, min(x)). We have the following theorem:
2.3. CENTRAL PATHS OF LP AND SDP
41
Theorem 2.10 Let (x, y, s) ∈ N (η) for constant 0 < η < 1. i) The N (η) ∩ {(x, y, s) : xT s ≤ nµ0 } is bounded for any given µ0 < ∞. ii) Any limit point of N (η) as µ → 0 is an optimal solution pair for (LP) and (LD). Moreover, for any j ∈ P ∗ xj ≥
(1 − η)x∗j , n
where x∗ is any optimal primal solution; for any j ∈ Z ∗ sj ≥
(1 − η)s∗j , n
where s∗ is any optimal dual solution.
2.3.2
Central path for SDP ◦
Consider a SDP problem in Section 2.0.1 and Assume that F 6= ∅, i.e., both ◦ ◦ F p 6= ∅ and F d 6= ∅. The central path can be expressed as n o ◦ C = (X, y, S) ∈F : XS = µI, 0 < µ < ∞ , or a symmetric form n o ◦ C = (X, y, S) ∈F : X .5 SX .5 = µI, 0 < µ < ∞ , where X .5 ∈ Mn+ is the “square root” matrix of X ∈ Mn+ , i.e., X .5 X .5 = X. We also see n o ◦ C = (X, y, S) ∈F : ψn (X, S) = n log n . When X and S are diagonal matrices, this definition is identical to LP. We also have the following corollary: Corollary 2.11 Let both (SDP) and (SDD) have interior feasible points. Then for any 0 < µ < ∞, the central path point (X(µ), y(µ), S(µ)) exists and is unique. Moreover, i) the central path point (X(µ), S(µ)) is bounded where 0 < µ ≤ µ0 for any given 0 < µ0 < ∞. ii) For 0 < µ0 < µ, C • X(µ0 ) < C • X(µ)
and
bT y(µ0 ) > bT y(µ).
iii) (X(µ), S(µ)) converges to an optimal solution pair for (SDP) and (SDD), and the rank of the limit of X(µ) is maximal among all optimal solutions of (SDP) and the rank of the limit S(µ) is maximal among all optimal solutions of (SDD).
42
2.4
CHAPTER 2. SEMIDEFINITE PROGRAMMING
Notes
The SDP with a duality gap was constructed by Freund.
2.5
Exercises
2.1 Let (LP) and (LD) have interior. Prove the dual potential function Bn+1 (y, s, z), where z is a upper bound of z ∗ , represents the volume of a coordinatealigned ellipsoid whose intersection with the affine set {x : Ax = b} contains the primal level set {x ∈ Fp : cT x ≤ z}. 2.2 Let X, S ∈ Mn be both positive definite. Then prove ψn (X, S) = n log(X • S) − log(det(X) · det(S)) ≥ n log n. 2.3 Consider linear programming and the level set ◦
Ψ(δ) := {(X, y, S) ∈F : ψn+ρ (x, s) ≤ δ}. Prove that
Ψ(δ 1 ) ⊂ Ψ(δ 2 )
if
δ1 ≤ δ2 ,
ˆ and for every δ Ψ(δ) is bounded and its closure Ψ(δ) has nonempty intersection with the solution set. 2.4 Prove (ii) of Theorem 2.9. 2.5 Prove Theorem 2.10. 2.6 Prove Corollary 2.11. Here we assume that X(µ) 6= X(µ0 ) and y(µ) 6= y(mu0 ).
Chapter 3
InteriorPoint Algorithms This pair of semidefinite programs can be solved in “polynomial time”. There are actually several polynomial algorithms. One is the primalscaling algorithm, and it uses only X to generate the iterate direction. In other words, µ k+1 ¶ X = Fp (X k ), S k+1 where Fp is the primal algorithm iterative mapping. Another is the dualscaling algorithm which is the analogue of the dual potential reduction algorithm for linear programming. The dualscaling algorithm uses only S to generate the new iterate: µ k+1 ¶ X = Fd (S k ), S k+1 where Fd is the dual algorithm iterative mapping. The third is the primaldual scaling algorithm which uses both X and S to generate the new iterate and references therein): µ k+1 ¶ X = Fpd (X k , S k ), S k+1 where Fpd is the primaldual algorithm iterative mapping. All these √ algorithms generate primal and dual iterates simultaneously, and possess O( n ln(1/²)) iteration complexity to yield the duality gap accuracy ². Other scaling algorithms have been proposed in the past. For example, an SDP equivalent of Dikin’s affinescaling algorithm could be very fast. However this algorithm may not even converge. Recall that Mn denotes the set of symmetric matrices in Rn×n . Let Mn+ ◦
denote the set of positive semidefinite matrices and Mn+ the set of positive definite matrices in Mn . The goal of this section is to extend interiorpoint algorithms to solving the positive semidefinite programming problem (SDP) and (SDD) presented in Section 2.0.1. 43
44
CHAPTER 3. INTERIORPOINT ALGORITHMS
(SDP) and (SDD) are analogues to linear programming (LP) and (LD). In fact, as the notation suggest, (LP) and (LD) can be expressed as a positive semidefinite program by defining C = diag(c),
Ai = diag(ai· ),
b = b,
where ai· is the ith row of matrix A. Many of the theorems and algorithms used in LP have analogues in SDP. However, while interiorpoint algorithms for LP are generally considered competitive with the simplex method in practice and outperform it as problems become large, interiorpoint methods for SDP outperform other methods on even small problems. Denote the primal feasible set by Fp and the dual by Fd . We assume that ◦
◦
both F p and F d are nonempty. Thus, the optimal solution sets for both (SDP ) and (SDD) are bounded and the central path exists, see Section 2.3 Let z ∗ denote the optimal value and F = Fp × Fd . In this section, we are interested in finding an ² approximate solution for the SDP problem: C • X − bT y = S • X ≤ ². For simplicity, we assume that a central path pair (X 0 , y 0 , S 0 ), which satisfies (X 0 ).5 S 0 (X 0 ).5 = µ0 I
and
µ0 = X 0 • S 0 /n,
is known. We will use it as our initial point throughout this section. ◦
◦
Let X ∈F p , (y, S) ∈F d , and z ≤ z ∗ . Then consider the primal potential function P(X, z) = (n + ρ) log(C • X − z) − log det X, and the primaldual potential function ψ(X, S) = (n + ρ) log(S • X) − log det XS, where ρ =
√
n. Let z = bT y. Then S • X = C • X − z, and we have ψ(x, s) = P(x, z) − log det S.
Like in Chapter 4, these functions will be used to solve SDP problems. Define the “∞norm,”, which is the traditional l2 operator norm for matrices, of Mn by kXk∞ := max {λj (X)}, j∈{1,...,n}
where λj (X) is the jth eigenvalue of X, and the “Euclidean” or l2 norm, which is the traditional Frobenius norm, by v uX √ u n kXk := kXkf = X • X = t (λj (X))2 . j=1
3.1. POTENTIAL REDUCTION ALGORITHM FOR LP
45
We rename these norms because they are perfect analogues to the norms of vectors used in LP. Furthermore, note that, for X ∈ Mn , tr(X) =
n X
λj (X)
and
det(I + X) =
j=1
n Y
(1 + λj (X)).
j=1
Then, we have the following lemma which resembles Lemma 3.1. We first prove two important lemmas. Lemma 3.1 If d ∈ Rn such that kdk∞ < 1 then eT d ≥
n X
log(1 + di ) ≥ eT d −
i=1
kdk2 . 2(1 − kdk∞ )
Lemma 3.2 Let X ∈ Mn and kXk∞ < 1. Then, tr(X) ≥ log det(I + X) ≥ tr(X) −
3.1
kXk2 . 2(1 − kXk∞ )
Potential Reduction Algorithm for LP ◦
Let (x, y, s) ∈F . Then consider the primaldual potential function: ψn+ρ (x, s) = (n + ρ) log(xT s) −
n X
log(xj sj ),
j=1
where ρ ≥ 0. Let z = bT y, then sT x = cT x − z and we have ψn+ρ (x, s) = Pn+ρ (x, z) −
n X
log sj .
j=1
Recall that when ρ = 0, ψn+ρ (x, s) is minimized along the central path. However, when ρ > 0, ψn+ρ (x, s) → −∞ means that x and s converge to the optimal face, √ and the descent gets steeper as ρ increases. In this section we choose ρ = n. The process calculates steps for x and s, which guarantee a constant reduction in the primaldual potential function. As the potential function decreases, both x and s are forced to an optimal solution pair. ◦
Consider a pair of (xk , y k , sk ) ∈F . Fix z k = bT y k , then the gradient vector of the primal potential function at xk is ∇Pn+ρ (xk , z k ) =
(n + ρ) (n + ρ) c − (X k )−1 e = T k c − (X k )−1 e. k T k (s ) x c x − zk
We directly solve the ballconstrained linear problem for direction dx : minimize s.t.
∇Pn+ρ (xk , z k )T dx Adx = 0, k(X k )−1 dx k ≤ α.
46
CHAPTER 3. INTERIORPOINT ALGORITHMS
Let the minimizer be dx . Then dx = −α
X k pk , kpk k
where pk = p(z k ) :== (I − X k AT (A(X k )2 AT )−1 AX k )X k ∇Pn+ρ (xk , z k ). Update xk+1 = xk + dx = xk − α
X k pk , kpk k
(3.1)
and, Pn+ρ (xk+1 , z k ) − Pn+ρ (xk , z k ) ≤ −αkpk k +
α2 . 2(1 − α)
Here, we have used the fact Pn+ρ (xk+1 , z k ) − Pn+ρ (xk , z k ) n+ρ k(X k )−1 dx k2 T T k −1 c d − e (X ) d + x x cT xk − z k 2(1 − k(X k )−1 dx k∞ ) k(X k )−1 dx k2 = ∇P(xk , z k )T dx + 2(1 − k(X k )−1 dx k∞ ) α2 = −αkpk k + 2(1 − α) ≤
Thus, as long as kpk k ≥ η > 0, we may choose an appropriate α such that Pn+ρ (xk+1 , z k ) − Pn+ρ (xk , z k ) ≤ −δ for some positive constant δ. By the relation between ψn+ρ (x, s) and Pn+ρ (x, z), the primaldual potential function is also reduced. That is, ψn+ρ (xk+1 , sk ) − ψn+ρ (xk , sk ) ≤ −δ. However, even if kpk k is small, we will show that the primaldual potential function can be reduced by a constant δ by increasing z k and updating (y k , sk ). We focus on the expression of pk , which can be rewritten as pk
= (I − X k AT (A(X k )2 AT )−1 AX k )( =
(n + ρ) X k s(z k ) − e, cT xk − z k
(n + ρ) X k c − e) cT xk − z k (3.2)
where s(z k ) = c − AT y(z k )
(3.3)
3.1. POTENTIAL REDUCTION ALGORITHM FOR LP and
T
k
47
k
x −z y(z k ) = y2 − c (n+ρ) y1 , k 2 T −1 y1 = (A(X ) A ) b, y2 = (A(X k )2 AT )−1 A(X k )2 c.
(3.4)
Regarding kpk k = kp(z k )k, we have the following lemma: Lemma 3.3 Let µk =
cT xk − z k (xk )T sk = n n
If
µ r kp(z k )k < min η
and
µ=
(xk )T s(z k ) . n
¶ n , 1 − η , n + η2
(3.5)
then the following three inequalities hold: s(z k ) > 0,
kX k s(z k ) − µek < ηµ,
and
√ µ < (1 − .5η/ n)µk .
(3.6)
Proof. The proof is by contradiction. i) If the first inequality of (3.6) is not true, then ∃ j such that sj (z k ) ≤ 0 and kp(z k )k ≥ 1 −
(n + ρ) xj sj (z k ) ≥ 1. nµk
ii) If the second inequality of (3.6) does not hold, then kp(z k )k2
= = ≥ ≥
(n + ρ) k k (n + ρ)µ (n + ρ)µ X s(z ) − e+ e − ek2 nµk nµk nµk (n + ρ) 2 k k (n + ρ)µ ( ) kX s(z ) − µek2 + k e − ek2 k nµ nµk (n + ρ)µ 2 2 (n + ρ)µ ( ) η +( − 1)2 n (3.7) nµk nµk n η2 , n + η2 k
where the last relation prevails since the quadratic term yields the minimum at n (n + ρ)µ = . nµk n + η2 iii) If the third inequality of (3.6) is violated, then (n + ρ)µ 1 .5η ≥ (1 + √ )(1 − √ ) ≥ 1, nµk n n
48
CHAPTER 3. INTERIORPOINT ALGORITHMS which, in view of (3.7), leads to kp(z k )k2
≥ ≥ ≥ ≥
(n + ρ)µ − 1)2 n nµk .5η 1 ((1 + √ )(1 − √ ) − 1)2 n n n η η 2 (1 − − √ ) 2 2 n (
(1 − η)2 .
The lemma says that, when kp(z k )k is small, then (xk , y(z k ), s(z k )) is in the neighborhood of the central path and bT y(z k ) > z k . Thus, we can increase z k to bT y(z k ) to cut the dual level set Ω(z k ). We have the following potential reduction theorem to evaluate the progress. ◦ √ Theorem 3.4 Given (xk , y k , sk ) ∈F . Let ρ = n, z k = bT y k , xk+1 be given k+1 k k+1 by (3.1), and y = y(z ) in (3.4) and s = s(z k ) in (3.3). Then, either
ψn+ρ (xk+1 , sk ) ≤ ψn+ρ (xk , sk ) − δ or ψn+ρ (xk , sk+1 ) ≤ ψn+ρ (xk , sk ) − δ where δ > 1/20. Proof. If (3.5) does not hold, i.e., µ r kp(z k )k ≥ min η
¶ n , 1 − η , n + η2
then Pn+ρ (x
k+1
µ r , z ) − Pn+ρ (x , z ) ≤ −α min η k
k
k
¶ n α2 , 1 − η + , n + η2 2(1 − α)
hence from the relation between Pn+ρ and ψn+ρ , ψn+ρ (x
k+1
µ r , s ) − ψn+ρ (x , s ) ≤ −α min η k
k
k
¶ α2 n , 1 − η + . n + η2 2(1 − α)
Otherwise, from Lemma 3.3 the inequalities of (3.6) hold: ◦
i) The first of (3.6) indicates that y k+1 and sk+1 are in F d .
3.1. POTENTIAL REDUCTION ALGORITHM FOR LP
49
ii) Using the second of (3.6) and applying Lemma 3.1 to vector X k sk+1 /µ, we have n log(xk )T sk+1 −
n X
log(xkj sk+1 ) j
j=1
= n log n −
n X
/µ) log(xkj sk+1 j
j=1
kX k sk+1 /µ − ek2 2(1 − kX k sk+1 /µ − ek∞ ) η2 ≤ n log n + 2(1 − η) n X η2 ≤ n log(xk )T sk − log(xkj skj ) + . 2(1 − η) j=1
≤ n log n +
iii) According to the third of (3.6), we have √
n(log(xk )T sk+1 − log(xk )T sk ) =
√
n log
µ η ≤− . µk 2
Adding the two inequalities in ii) and iii), we have ψn+ρ (xk , sk+1 ) ≤ ψn+ρ (xk , sk ) −
η η2 + . 2 2(1 − η)
Thus, by choosing η = .43 and α = .3 we have the desired result.
Theorem 3.4 establishes an important fact: the primaldual potential function can be reduced by a constant no matter where xk and y k are. In practice, one can perform the line search to minimize the primaldual potential function. This results in the following primaldual potential reduction algorithm. ◦
Algorithm 3.1 Given a central path point (x0 , y 0 , s0 ) ∈F . Let z 0 = bT y 0 . Set k := 0. While (sk )T xk ≥ ² do 1. Compute y1 and y2 from (3.4). 2. If there exists z such that s(z) > 0, compute z¯ = arg min ψn+ρ (xk , s(z)), z
and if ψn+ρ (xk , s(¯ z )) < ψn+ρ (xk , sk ) then y k+1 = y(¯ z ), sk+1 = s(¯ z ) and z k+1 = bT y k+1 ; otherwise, y k+1 = y k , sk+1 = sk and z k+1 = z k .
50
CHAPTER 3. INTERIORPOINT ALGORITHMS 3. Let xk+1 = xk − α ¯ X k p(z k+1 ) with α ¯ = arg min ψn+ρ (xk − αX k p(z k+1 ), sk+1 ). α≥0
4. Let k := k + 1 and return to Step 1. The performance of the algorithm results from the following corollary: Corollary 3.5 Let ρ = bT y 0 )/²) iterations with
√
√ n. Then, Algorithm 3.1 terminates in at most O( n log(cT x0 − cT xk − bT y k ≤ ².
√ Proof. In O( n log((x0 )T s0 /²)) iterations √ − n log((x0 )T s0 /²) = ψn+ρ (xk , sk ) − ψn+ρ (x0 , s0 ) √ ≥ n log(xk )T sk + n log n − ψn+ρ (x0 , s0 ) √ = n log((xk )T sk /(x0 )T s0 ). Thus,
√
n log(cT xk − bT y k ) =
√
n log(xk )T sk ≤
√
n log ²,
i.e., cT xk − bT y k = (xk )T sk ≤ ².
3.2
PrimalDual (Symmetric) Algorithm for LP
Another technique for solving linear programs is the symmetric primaldual ◦
algorithm. Once we have a pair (x, y, s) ∈F with µ = xT s/n, we can generate a new iterate x+ and (y + , s+ ) by solving for dx , dy and ds from the system of linear equations: Sdx + Xds = γµe − Xs, Adx = 0, (3.8) −AT dy − ds = 0. Let d := (dx , dy , ds ). To show the dependence of d on the current pair (x, s) and the parameter γ, we write d = d(x, s, γ). Note that dTx ds = −dTx AT dy = 0 here. The system (3.8) is the Newton step starting from (x, s) which helps to find the point on the central path with duality gap γnµ, see Section 2.3.1. If γ = 0, it steps toward the optimal solution characterized by the system of equations (1.2); if γ = 1, it steps toward the central path point (x(µ), y(µ), s(µ)) characterized by the system of equations (2.10); if 0 < γ < 1, it steps toward a central path point with a smaller complementarity gap. In the algorithm presented in
3.2. PRIMALDUAL (SYMMETRIC) ALGORITHM FOR LP
51
this section, we choose γ = n/(n + ρ) < 1. Each iterate reduces the primaldual potential function by at least a constant δ, as does the previous potential reduction algorithm. To analyze this algorithm, we present the following lemma, whose proof is omitted. Lemma 3.6 Let the direction d = (dx , dy , ds ) be generated by equation (3.8) with γ = n/(n + ρ), and let p α min(Xs) θ= , (3.9) xT s e − Xs)k k(XS)−1/2 ( (n+ρ) where α is a positive constant less than 1. Let x+ = x + θdx ,
y + = y + θdy ,
and
s+ = s + θds .
◦
Then, we have (x+ , y + , s+ ) ∈F and ψn+ρ (x+ , s+ ) − ψn+ρ (x, s) ≤ −α
p
min(Xs)k(XS)−1/2 (e −
(n + ρ) α2 Xs)k + . xT s 2(1 − α)
Let v = Xs. Then, we can prove the following lemma (Exercise 3.3): √ Lemma 3.7 Let v ∈ Rn be a positive vector and ρ ≥ n. Then, p p (n + ρ) min(v)kV −1/2 (e − v)k ≥ 3/4 . T e v Combining these two lemmas we have ψn+ρ (x+ , s+ ) − ψn+ρ (x, s) ≤ −α
p 3/4 +
α2 = −δ 2(1 − α)
for a constant δ. This result will provide a competitive theoretical iteration bound, but a faster algorithm may be again implemented by conducting a line search along direction d to achieve the greatest reduction in the primaldual potential function. This leads to ◦
Algorithm 3.2 Given (x0 , y 0 , s0 ) ∈F . Set ρ ≥ While (sk )T xk ≥ ² do
√
n and k := 0.
1. Set (x, s) = (xk , sk ) and γ = n/(n + ρ) and compute (dx , dy , ds ) from (3.8).
52
CHAPTER 3. INTERIORPOINT ALGORITHMS 2. Let xk+1 = xk + α ¯ dx , y k+1 = y k + α ¯ dy , and sk+1 = sk + α ¯ ds where α ¯ = arg min ψn+ρ (xk + αdx , sk + αds ). α≥0
3. Let k := k + 1 and return to Step 1. √ Theorem 3.8 Let ρ = O( n). Then, Algorithm 3.2 terminates in at most √ O( n log((x0 )T s0 /²)) iterations with cT xk − bT y k ≤ ².
3.3
Potential Reduction Algorithm for SDP ◦
Consider a pair of (X k , y k , S k ) ∈F . Fix z k = bT y k , then the gradient matrix of the primal potential function at X k is ∇P(X k , z k ) =
n+ρ C − (X k )−1 . Sk • X k
The following corollary is an analog to LP. ◦
Corollary 3.9 Let X k ∈Mn+ and k(X k )−.5 (X − X k )(X k )−.5 k∞ < 1. Then, ◦
X ∈Mn+ and P(X, z k ) − P(X k , z k ) ≤ ∇P(X k , z k ) • (X − X k ) +
k(X k )−.5 (X − X k )(X k )−.5 k2 . 2(1 − k(X k )−.5 (X − X k )(X k )−.5 k∞ )
Let
A1 A2 A= ... . Am
Then, define
A1 • X A2 • X = b, AX = ... Am • X
and AT y =
m X
y i Ai .
i=1
Then, we directly solve the following “ballconstrained” problem: minimize s.t.
∇P(X k , z k ) • (X − X k ) A(X − X k ) = 0, k(X k )−.5 (X − X k )(X k )−.5 k ≤ α < 1.
3.3. POTENTIAL REDUCTION ALGORITHM FOR SDP
53
Let X 0 = (X k )−.5 X(X k )−.5 . Note that for any symmetric matrices Q, T ∈ Mn ◦
and X ∈Mn+ , Q • X .5 T X .5 = X .5 QX .5 • T and kXQk· = kQXk· = kX .5 QX .5 k· . Then we transform the above problem into minimize s.t. where
(X k ).5 ∇P(X k , z k )(X k ).5 • (X 0 − I) A0 (X 0 − I) = 0, i = 1, 2, ..., i, kX 0 − Ik ≤ α,
(X k ).5 A1 (X k ).5 A01 0 k .5 k .5 A2 (X ) A2 (X ) A0 = ... := ... A0m (X k ).5 Am (X k ).5
.
Let the minimizer be X 0 and let X k+1 = (X k ).5 X 0 (X k ).5 . Then X 0 − I = −α X k+1 − X k = −α
Pk , kP k k
(X k ).5 P k (X k ).5 , kP k k
(3.10)
where Pk
or Pk =
=
PA0 (X k ).5 ∇P(X k , z k )(X k ).5
=
(X k ).5 ∇P(X k , z k )(X k ).5 − A0 y k
T
n+ρ (X k ).5 (C − AT y k )(X k ).5 − I, Sk • X k
and
S k • X k 0 0 T −1 0 k .5 (A A ) A (X ) ∇P(X k , z k )(X k ).5 . n+ρ Here, PA0 is the projection operator onto the null space of A0 , and 0 A1 • A01 A01 • A02 ... A01 • A0m A02 • A01 A02 • A02 ... A02 • A0m T ∈ Mm . A0 A0 := ... ... ... ... A0m • A01 A0m • A20 ... A0m • A0m yk =
In view of Corollary 3.9 and ∇P(X k , z k ) • (X k+1 − X k )
∇P(X k , z k ) • (X k ).5 P k (X k ).5 kP k k (X k ).5 ∇P(X k , z k )(X k ).5 • P k −α kP k k kP k k2 −α = −αkP k k, kP k k
= −α = =
54
CHAPTER 3. INTERIORPOINT ALGORITHMS
we have P(X k+1 , z k ) − P(X k , z k ) ≤ −αkP k k +
α2 . 2(1 − α)
Thus, as long as kP k k ≥ β > 0, we may choose an appropriate α such that P(X k+1 , z k ) − P(X k , z k ) ≤ −δ for some positive constant δ. Now, we focus on the expression of P k , which can be rewritten as P (z k ) := P k = with
n+ρ (X k ).5 S(z k )(X k ).5 − I Sk • X k
S(z k ) = C − AT y(z k )
and y(z k ) := y k = y2 −
(3.11)
(3.12)
Sk • X k C • X k − zk y1 = y2 − y1 , n+ρ n+ρ
(3.13)
where y1 and y2 are given by y1 y2
T
T
= (A0 A0 )−1 A0 I = (A0 A0 )−1 b, T = (A0 A0 )−1 A0 (X k ).5 C(X k ).5 .
(3.14)
Regarding kP k k = kP (z k )k, we have the following lemma resembling Lemma 3.3. Lemma 3.10 Let µk = If
Sk • X k C • X k − zk = n n µ r kP (z )k < min β k
and
µ=
S(z k ) • X k . n
¶ n ,1 − β , n + β2
(3.15)
then the following three inequalities hold: S(z k ) Â 0,
k(X k ).5 S(z k )(X k ).5 − µek < βµ,
and
√ µ < (1 − .5β/ n)µk . (3.16)
Proof. The proof is by contradiction. For example, if the first inequality of (3.16) is not true, then (X k ).5 S(z k )(X k ).5 has at least one eigenvalue less than or equal to zero, and kP (z k )k ≥ 1. The proof of the second and third inequalities are similar to that of Lemma 3.3. Based on this lemma, we have the following potential reduction theorem.
3.3. POTENTIAL REDUCTION ALGORITHM FOR SDP
55
◦ ◦ √ Theorem 3.11 Given X k ∈F p and (y k , S k ) ∈F d , let ρ = n, z k = bT y k , X k+1 be given by (3.10), and y k+1 = y(z k ) in (3.13) and S k+1 = S(z k ) in (3.12). Then, either
ψ(X k+1 , S k ) ≤ ψ(X k , S k ) − δ or
ψ(X k , S k+1 ) ≤ ψ(X k , S k ) − δ,
where δ > 1/20. Proof. If (3.15) does not hold, i.e., µ r kP (z )k ≥ min β k
¶ n ,1 − β , n + β2
then, since ψ(X k+1 , S k ) − ψ(X k , S k ) = P(X k+1 , z k ) − P(X k , z k ), ¶ µ r n α2 , 1 − β + . ψ(X k+1 , S k ) − ψ(X k , S k ) ≤ −α min β 2 n+β 2(1 − α) Otherwise, from Lemma 3.10 the inequalities of (3.16) hold: ◦
i) The first of (3.16) indicates that y k+1 and S k+1 are in F d . ii) Using the second of (3.16) and applying Lemma 3.2 to matrix (X k ).5 S k+1 (X k ).5 /µ, we have n log S k+1 • X k − log det S k+1 X k = n log S k+1 • X k /µ − log det(X k ).5 S k+1 (X k ).5 /µ = n log n − log det(X k ).5 S k+1 (X k ).5 /µ k(X k ).5 S k+1 (X k ).5 /µ − Ik2 ≤ n log n + 2(1 − k(X k ).5 S k+1 (X k ).5 /µ − Ik∞ ) β2 ≤ n log n + 2(1 − β) β2 ≤ n log S k • X k − log det S k X k + . 2(1 − β) iii) According to the third of (3.16), we have √
n(log S k+1 • X k − log S k • X k ) =
√
n log
µ β ≤− . k µ 2
Adding the two inequalities in ii) and iii), we have ψ(X k , S k+1 ) ≤ ψ(X k , S k ) −
β2 β + . 2 2(1 − β)
Thus, by choosing β = .43 and α = .3 we have the desired result.
56
CHAPTER 3. INTERIORPOINT ALGORITHMS
Theorem 3.11 establishes an important fact: the primaldual potential function can be reduced by a constant no matter where X k and y k are. In practice, one can perform the line search to minimize the primaldual potential function. This results in the following potential reduction algorithm. ◦
◦
Algorithm 3.3 Given x0 ∈F p and (y 0 , s0 ) ∈F d . Let z 0 = bT y 0 . Set k := 0. While S k • X k ≥ ² do 1. Compute y1 and y2 from (3.14). 2. Set y k+1 = y(¯ z ), S k+1 = S(¯ z ), z k+1 = bT y k+1 with z¯ = arg min ψ(X k , S(z)). z≥z k
If ψ(X k , S k+1 ) > ψ(X k , S k ) then y k+1 = y k , S k+1 = S k , z k+1 = z k . 3. Let X k+1 = X k − α ¯ (X k ).5 P (z k+1 )(X k ).5 with α ¯ = arg min ψ(X k − α(X k ).5 P (z k+1 )(X k ).5 , S k+1 ). α≥0
4. Let k := k + 1 and return to Step 1. The performance of the algorithm results from the following corollary: √ Corollary 3.12 Let ρ = n. Then, Algorithm 3.3 terminates in at most √ O( n log(C • X 0 − bT y 0 )/²) iterations with C • X k − bT y k ≤ ². √ Proof. In O( n log(S 0 • X 0 /²)) iterations √ − n log(S 0 • X 0 /²)
Thus,
√
= ψ(X k , S k ) − ψ(X 0 , S 0 ) √ ≥ n log S k • X k + n log n − ψ(X 0 , S 0 ) √ n log(S k • X k /S 0 • X 0 ). =
n log(C • X k − bT y k ) =
√
n log S k • X k ≤
i.e., C • X k − bT y k = S k • X k ≤ ².
√
n log ²,
3.4. PRIMALDUAL (SYMMETRIC) ALGORITHM FOR SDP
3.4
57
PrimalDual (Symmetric) Algorithm for SDP ◦
Once we have a pair (X, y, S) ∈F with µ = S • X/n, we can apply the primaldual Newton method to generate a new iterate X + and (y + , S + ) as follows: Solve for dX , dy and dS from the system of linear equations: D−1 dX D−1 + dS AdX −AT dy − dS
= R := γµX −1 − S, = 0, = 0,
(3.17)
where D = X .5 (X .5 SX .5 )−.5 X .5 . Note that dS • dX = 0. This system can be written as dX 0 + dS 0 A0 dX 0 0T −A dy − dS 0
= R0 , = 0, = 0,
(3.18)
where dX 0 = D−.5 dX D−.5 , and
dS 0 = D.5 dS D.5 ,
R0 = D.5 (γµX −1 − S)D.5 ,
D.5 A1 D.5 A01 .5 .5 0 A2 D A2 D A0 = ... := ... D.5 Am D.5 A0m
.
Again, we have dS 0 • dX 0 = 0, and T
T
dy = (A0 A0 )−1 A0 R0 , dS 0 = −A0 dy , and dX 0 = R0 − dS 0 . Then, assign dS = AT dy
and
dX = D(R − dS )D.
Let
◦
V 1/2 = D−.5 XD−.5 = D.5 SD.5 ∈Mn+ . Then, we can verify that S • X = I • V . We now present the following lemma, whose proof is very similar to that for LP and 3.6 and will be omitted. Lemma 3.13 Let the direction dX , dy and dS be generated by equation (3.17) with γ = n/(n + ρ), and let θ=
α kV
−1/2 k k I•V V −1/2 ∞ n+ρ
− V 1/2 k
,
(3.19)
58
CHAPTER 3. INTERIORPOINT ALGORITHMS
where α is a positive constant less than 1. Let X + = X + θdX ,
y + = y + θdy ,
and
S + = S + θdS .
◦
Then, we have (X + , y + , S + ) ∈F and ψ(X + , S + ) − ψ(X, S) ≤ −α
1/2 kV −1/2 − n+ρ k α2 I•V V . + 2(1 − α) kV −1/2 k∞
Applying Lemma 3.7 to v ∈ Rn as the vector of the n eigenvalues of V , we can prove the following lemma: ◦
Lemma 3.14 Let V ∈Mn+ and ρ ≥
√
n. Then,
1/2 kV −1/2 − n+ρ k p I•V V ≥ 3/4. −1/2 kV k∞
From these two lemmas we have ψ(X + , S + ) − ψ(X, S) ≤ −α
p
3/4 +
α2 = −δ 2(1 − α)
for a constant δ. This leads to Algorithm 3.4. ◦
Algorithm 3.4 Given (X 0 , y 0 , S 0 ) ∈F . Set ρ = While S k • X k ≥ ² do
√
n and k := 0.
1. Set (X, S) = (X k , S k ) and γ = n/(n + ρ) and compute (dX , dy , dS ) from (3.17). 2. Let X k+1 = X k + α ¯ dX , y k+1 = y k + α ¯ dy , and S k+1 = S k + α ¯ dS , where α ¯ = arg min ψ(X k + αdX , S k + αdS ). α≥0
3. Let k := k + 1 and return to Step 1. √ Theorem 3.15 Let ρ = n. Then, Algorithm 3.4 terminates in at most √ O( n log(S 0 • X 0 /²)) iterations with C • X k − bT y k ≤ ².
3.5. DUAL ALGORITHM FOR SDP
3.5
59
Dual Algorithm for SDP
An open question is how to exploit the sparsity structure by polynomial interiorpoint algorithms so that they can also solve largescale problems in practice. In this paper we try to respond to this question. We show that many largescale semidefinite programs arisen from combinatorial and quadratic optimization have features which make the dualscaling interiorpoint algorithm the most suitable choice: 1. The computational cost of each iteration in the dual algorithm is less that the cost the primaldual iterations. Although primaldual algorithms may possess superlinear convergence, the approximation problems under consideration require less accuracy than some other applications. Therefore, the superlinear convergence exhibited by primaldual algorithms may not be utilized in our applications. The dualscaling algorithm has been shown to perform equally well when only a lower precision answer is required. 2. In most combinatorial applications, we need only a lower bound for the optimal objective value of (SDP). Solving (SDD) alone would be sufficient to provide such a lower bound. Thus, we may not need to generate an X at all. Even if an optimal primal solution is necessary, our dualscaling algorithm can generate an optimal X at the termination of the algorithm with little additional cost. 3. For large scale problems, S tends to be very sparse and structured since it is the linear combination of C and the Ai ’s. This sparsity allows considerable savings in both memory and computation time. The primal matrix, X, may be much less sparse and have a structure unknown beforehand. Consequently, primal and primaldual algorithms may not fully exploit the sparseness and structure of the data. These problems include the semidefinite relaxations of the graphpartition problem, the boxconstrained quadratic optimization problem, the 0 − 1 integer set covering problem, etc. The dualscaling algorithm, which is a modification of the dualscaling linear programming algorithm, reduces the TanabeToddYe primaldual potential function Ψ(X, S) = ρ ln(X • S) − ln det X − ln det S. The first term decreases the duality gap, while the second and third terms keep X and S in the interior of the positive semidefinite matrix cone. When ρ > n, the infimum of the potential function occurs at an optimal solution. Also note that, using the arithmeticgeometric mean inequality, we have n ln(X • S) − ln det X − ln det S ≥ n ln n.
60
CHAPTER 3. INTERIORPOINT ALGORITHMS
Let operator A(X) : S n → 1+ √ 1− √ ≥ 1. nµk n n which leads to µ k
2
kP (¯ z )k
≥ ≥ = ≥
¶2 ρµ −1 n nµk µµ ¶µ ¶ ¶2 1 α 1+ √ 1− √ −1 n n 2 n µ ¶2 α α 1− − √ 2 2 n (1 − α)2 .
Focusing on the expression P (¯ z k ), it can be rewritten as ³ k ´ ¡ ¢ P (¯ z k ) = ∆ρk (S k ).5 ∆ρ (S k )−1 AT (d(¯ z k )y ) + S k (S k )−1 (S k ).5 − I ¡ k ¢ k −.5 = (S k )−.5 AT ³d(¯ z )y (S ´ ) = (S k )−.5 AT
y k+1 −y k β
(S k )−.5
which by (3.22), makes
and
∇ψ T (y k , z¯k )d(¯ z k )y = −kP (¯ z k )k2
(3.31)
∇ψ T (y k , z¯k )(y k+1 − y k ) = −αkP (¯ z k )k.
(3.32)
Updating the dual variables according to y k+1 = y k +
α d(¯ z )y kP (¯ z k+1 )k
and
S k+1 = C − AT (y k+1 ), (3.33)
assures the positive definiteness of S k+1 when α < 1, which assures that they are feasible. Using (3.32) and (3.21), the reduction in the potential function satisfies the inequality ψ(y k+1 , z¯k ) − ψ(y k , z¯k ) ≤ −αkP (¯ z k )k +
α2 . 2(1 − α)
(3.34)
The theoretical algorithm can be stated as follows. DUAL ALGORITHM. Given an upper bound z¯0√and a dual point (y 0 , S 0 ) such that S 0 = C − AT y 0 Â 0, set k = 0, ρ > n + n, α ∈ (0, 1), and do the following: while z¯k − bT y k ≥ ² do begin
66
CHAPTER 3. INTERIORPOINT ALGORITHMS 1. Compute A((S k )−1 ) and the Gram matrix M k (3.26) using Algorithm M or M’. 2. Solve (3.25) for the dual step direction d(¯ z k )y . 3. Calculate kP (¯ z k )k using (3.31). 4. If (3.30) is true, then X k+1 = X(¯ z k ), z¯k+1 = C•X k+1 , and (y k+1 , S k+1 ) = k k (y , S ); else y k+1 = y k + and z¯k+1 = z¯k .
α d(¯ z k+1 )y , kP (¯ z k )k
S k+1 = C − AT (y k+1 ), X k+1 = X k ,
endif 5. k := k + 1. end We can derive the following potential reduction theorem based on the above lemma: Theorem 3.17 Ψ(X k+1 , S k+1 ) ≤ Ψ(X k , S k ) − δ where δ > 1/50 for a suitable α. Proof. ¡ ¢ ¡ ¢ Ψ(X k+1 , S k+1 )−Ψ(X k , S k ) = Ψ(X k+1 , S k+1 ) − Ψ(X k+1 , S k ) + Ψ(X k+1 , S k ) − Ψ(X k , S k ) . In each iteration, one of the differences is zero. If kP (¯ z k )k does not satisfy (3.30), the dual variables get updated and (3.34) shows sufficient improvement in the potential function when α = 0.4. On the other hand, if the primal matrix gets updated, then using Lemma 3.2 and the first two parts of Lemma 3.16, ¡ ¢ ¡ ¢ ¡ k¢ n ln X k+1 S k − ln¢ det X k+1 − ln det S ¡ •k+1 ¡ ¢ = n ln ¡X • S k −¢ ln det X¡k+1 S k ¢ = n ln X k+1 • S¡k /µ − ln det X k+1 S¢k /µ = n ln n − ln det (S k ).5 X k+1 (S k ).5 /µ k(S k ).5 X k+1 (S k ).5 /µ−Ik ≤ n ln n + 2(1−k(S k ).5 X k+1 (S k ).5 /µ−Ik ) ∞ 2
α ≤ n ln n + 2(1−α) ¡ ¢ ¢ ¡ ¢ ¡ k ≤ n ln X • S k − ln det X k − ln det S k +
α2 2(1−α)
Additionally, by the third part of Lemma 3.16 ¢ √ √ ¡ µ α n ln(X k+1 • S k ) − ln(X k • S k ) = n ln k ≤ − µ 2
3.6. NOTES
67
Adding the two inequalities gives Ψ(X k+1 , S k ) ≤ Ψ(X k , S k ) −
α α2 + 2 2(1 − α)
By choosing α = 0.4 again, we have the desired result. This theorem leads to √ Corollary 3.18 Let ρ ≥ n + n and Ψ(X 0 , S 0 ) ≤ (ρ − n) ln(X 0 • S 0 ). Then, the algorithm terminates in at most O((ρ − n) ln(X 0 • S 0 /²)) iterations. Proof. In O((ρ − n) ln(X 0 • S 0 /²)) iterations, Ψ(X k , S k ) ≤ (ρ − n) ln(²)). Also, (ρ−n) ln(C •X k −bT y k ) = (ρ−n) ln(X k •S k ) ≤ Ψ(X k , S k )−n ln n ≤ Ψ(X k , S k ). Combining the two inequalities, C • X k − bT y k = X k • S k < ².
Again, from (3.28) we see that the algorithm can generate an X k as a byproduct. However, it is not needed in generating the iterate direction, and it is only explicitly used for proving convergence and complexity. Theorem 3.19 Each iteration of the dual algorithm uses O(m3 + nm2 + n2 m + n3 ) floating point iterations. Proof. Creating S, or S + AT (d(¯ z k )), uses matrix additions and O(mn2 ) op3 erations; factoring it uses O(n ) operations. Creating the Gram matrix uses nm2 + 2n2 m + O(nm) operations, and solving the system of equations uses O(m3 ) operations. Dot products for z¯k+1 and kP (¯ z k )k, and the calculation of y k+1 use only O(m) operations. These give the desired result.
3.6
Notes
The primal potential reduction algorithm for positive semidefinite programming is due to Alizadeh [9, 8], in which Ye has “suggested studying the primaldual potential function for this problem” and “looking at symmetric preserving scal−1/2 −1/2 ings of the form X0 XX0 ,” and to Nesterov and Nemirovskii [241], and the primaldual algorithm described here is due to Nesterov and Todd [242, 243]. One can also develop a dual potential reduction algorithm. In general, consider (P SP )
inf s.t.
C •X A • X = b, X ∈ K,
68
CHAPTER 3. INTERIORPOINT ALGORITHMS
and its dual
(P SD)
sup s.t.
bT y A∗ • Y + S = C, S ∈ K,
where K is a convex homogeneous cone. Interiorpoint algorithms compute a search direction (dX , dY , dS ) and a new strictly feasible primaldual pair X + and (Y + ; S + ) is generated from X + = X + αdX , Y + = Y + βdY , S + = S + βdS , for some stepsizes α and β. The search direction (dX , dY , dS ) is determined by the following equations: A • dX = 0,
dS = −A∗ • dY
and dX + F 00 (S)dS = − or dS + F 00 (X)dX = −
(feasibility)
n+ρ X − F 0 (S) (dual scaling), X •S
n+ρ S − F 0 (X) (primal scaling), X •S
or dS + F 00 (Z)dX = −
(3.35)
(3.36) (3.37)
n+ρ S − F 0 (X) (joint scaling), X •S
(3.38)
S = F 00 (Z)X.
(3.39)
where Z is chosen to satisfy The differences among the three algorithms are the computation of the search direction and their theoretical closeform stepsizes. All three generate an ²optimal solution (X, Y, S), i.e., X •S ≤² in a guaranteed polynomial time. Other primaldual algorithms for positive semidefinite programming are in Alizadeh, Haeberly and Overton [10, 12], Boyd, Ghaoui, Feron and Balakrishnan [56], Helmberg, Rendl, Vanderbei and Wolkowicz [156], Jarre [170], de Klerk, Roos and Terlaky.[185], Kojima, Shindoh and Hara [190], Monteiro and Zhang [228], Nesterov, Todd and Ye [244], Potra and Sheng [257], Shida, Shindoh and Kojima [276], Sturm and Zhang [285], Tseng [305], Vandenberghe and Boyd [313, 314], and references therein. Efficient interiorpoint algorithms are also developed for optimization over the secondorder cone; see Andersen and Christiansen [17], Lobo, Vandenberghe and Boyd [204], and Xue and Ye [326]. These algorithms have established the best approximation complexity results for some combinatorial problems. Primaldual adaptive pathfollowing algorithms, the predictorcorrector algorithms and the wideneighborhood algorithms can also be developed for solving (SDP).
3.7. EXERCISES
3.7
69
Exercises
3.1 Prove a slightly stronger variant of Lemma 3.1: If D ∈ Mn such that 0 ≤ kDk∞ < 1 then −Trace(D) ≥ log det(I − D) ≥ −Trace(D) −
kDk2 . 2(1 − kDk∞ )
3.2 Given S Â 0, find the minimizer of the leastsquares problem minimize s.t.
kS 1/2 XS 1/2 − Ik AX = 0.
Given X Â 0 find the minimizer of the leastsquares problem minimize kX 1/2 SX 1/2 − Ik s.t. S = C − AT y. 3.3 Let v ∈ Rn be a positive vector and ρ ≥ p
min(v)kV −1/2 (e −
√
n. Prove
p (n + ρ) v)k ≥ 3/4 . T e v
3.4 Prove the following convex quadratic inequality (Ay + b)T (Ay + b) − cT y − d ≤ 0 is equivalent to a matrix inequality µ I (Ay + b)T
Ay + b cT y + d
¶ º 0.
Using this relation to formulate a convex quadratic minimization problem with convex quadratic inequalities as an (SDD) problem. 3.5 Prove Corollary 3.9. 3.6 Prove Lemma 3.13. 3.7 Describe and analyze a dual potential algorithm for positive semidefinite programming in the standard form.
70
CHAPTER 3. INTERIORPOINT ALGORITHMS
Chapter 4
SDP for Combinatorial Optimization 4.1
Approximation
A (randomized) algorithm for a maximization problem is called (randomized) rapproximation algorithm, where 0 < r ≤ 1, if it outputs a feasible solution with its (expected) value at least r times the optimum value for all instances of the problem. More precisely, let w∗ (> 0) be the (global) maximum value of a given problem instance. Then, a rapproximate maximizer x satisfies w(x) ≥ r · w∗ . or E[w(x)] ≥ r · w∗ . A (randomized) algorithm for a minimization problem is called (randomized) rapproximation algorithm, where 1 ≤ r, if it outputs a feasible solution with its (expected) value at most r times the optimum value for all instances of the problem. More precisely, let w∗ (> 0) be the (global) minimal value of a given problem instance. Then, a rapproximate minimizer x satisfies w(x) ≤ r · w∗ . or E[w(x)] ≤ r · w∗ . 71
72
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION
4.2
BallConstrained Quadratic Minimization z ∗ :=
Minimize
xT Qx + 2q T x
Subject to
kxk2 ≤ 1.
(BQP)
(4.1)
Here, the given matrix Q ∈ Mn , the set of ndimensional symmetric matrices; vector q ∈ Rn ; and k.k is the Euclidean norm.
4.2.1
Homogeneous Case: q = 0
Matrixformulation: Let
z∗ =
X = xxT . Minimize
Q•X
Subject to
I • X ≤ 1,
(BQP) X º 0,
Rank(X) = 1.
SDPrelaxation: Remove the rankone constraint. z SDP :=
Minimize
Q•X
Subject to
I • X ≤ 1,
(SDP) X º 0. The dual of (SDP) can be written as: z SDP =
Maximize
y
Subject to
yI + S = Q,
(DSDP) y ≤ 0,
S º 0.
X ∗ is a minimal matrix solution to SDP if and only if there exist a feasible dual variable y ∗ ≤ 0 such that S ∗ = Q − y∗ I º 0 y ∗ (1 − I • X ∗ ) = 0 S ∗ • X ∗ = 0. Observation: z SDP ≤ z ∗ . Theorem 4.1 The SDP relaxation is exact, meaning z SDP ≥ z ∗ .
4.2. BALLCONSTRAINED QUADRATIC MINIMIZATION Moreover, let a decomposition of X ∗ be X∗ =
r X
xj xTj ,
j=1
where r is the rank of X ∗ . Then, for any j, x∗j = (I • X ∗ ) · xj /kxj k is a solution to (BQP).
4.2.2
NonHomogeneous Case
Matrixformulation: Let X = (1; x)(1; x)T ∈ Mn+1 , µ ¶ 0 qT 0 Q = , q Q µ ¶ 0 0T 0 I = , 0 I and
µ I1 = z∗ =
¶
0T 0
1 0
.
Minimize
Q0 • X
Subject to
I 0 • X ≤ 1,
(BQP) I1 • X = 1, X º 0, Rank(X) = 1. SDPrelaxation: Remove the rankone constraint. z∗ =
Minimize
Q0 • X
Subject to
I 0 • X ≤ 1,
(SDP) I1 • X = 1, X º 0. The dual of (SDP) can be written as: z SDP =
Maximize
y1 + y2
Subject to
y 1 I 0 + y 2 I1 + S = Q 0 ,
(DSDP) y1 ≤ 0,
S º 0.
73
74
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION
X ∗ is a minimal matrix solution to SDP if and only if there exist a feasible dual variables (y1∗ ≤ 0, y2∗ ) such that S ∗ = Q0 − y1∗ I − y2∗ I1 º 0 y1∗ (1 − I 0 • X ∗ ) = 0 S ∗ • X ∗ = 0. Observation: z SDP ≤ z ∗ . Lemma 4.2 Let X be a positive semidefinite matrix of rank r, A be a given symmetric matrix. Then, there is a decomposition of X X=
r X
xj xTj ,
j=1
such that for all j, xTj Axj = A • (xj xTj ) = A • X/r. Proof. We prove for the case A • X = 0. Let X=
r X
pj pTj
j=1
be any decomposition of X. Assume that pT1 Ap1 < 0
and
pT2 Ap2 > 0.
Now let
p
x1 = (p1 + γp2 )/ and x2 = (p2 − γp1 )/
1 + γ2
p
1 + γ2,
where γ makes (p1 + γp2 )T A(p1 + γp2 ) = 0. We have p1 pT1 + p2 pT2 = x1 xT1 + x2 xT2 and A • (X − x1 xT1 ) = 0. Note that X − x1 xT1 is still positive semidefinite and its rank is r − 1. Theorem 4.3 The SDP relaxation is exact, meaning z SDP ≥ z ∗ .
4.3. HOMOGENEOUS QUADRATIC MINIMIZATION
75
Moreover, there is a decomposition of X ∗ be X∗ =
r X
xj xTj ,
j=1 ∗
where r is the rank of X , such that for any j, x∗j = xj /(x0j ) is a solution to (BQP).
4.3
Homogeneous Quadratic Minimization z ∗ :=
Minimize
xT Qx
Subject to
xT A1 x ≤ 1,
(QP2)
(4.2)
xT A2 x ≤ 1. Matrixformulation: Let z∗ =
X = xxT .
Minimize
Q•X
Subject to
A1 • X ≤ 1,
(QP2) A2 • X ≤ 1, X º 0, Rank(X) = 1. SDPrelaxation: Remove the rankone constraint. z∗ =
Minimize
Q•X
Subject to
A1 • X ≤ 1,
(SDP) A2 • X ≤ 1, X º 0. The dual of (SDP) can be written as: z SDP =
Maximize
y1 + y2
Subject to
y1 A1 + y2 A2 + S = Q,
(DSDP) y1 , y2 ≤ 0,
S º 0.
Assume that there is no gap between the primal and dual. Then, X ∗ is a minimal matrix solution to SDP if and only if there exist a feasible dual variable (y1∗ , y2∗ ) ≤ 0 such that S ∗ = Q − y1 A1 − y2 A2 º 0
76
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION y1∗ (1 − A1 • X ∗ ) = 0 y2∗ (1 − A2 • X ∗ ) = 0 S ∗ • X ∗ = 0. Observation: z SDP ≤ z ∗ .
Theorem 4.4 The SDP relaxation is exact, meaning z SDP ≥ z ∗ . Moreover, there is a decomposition of X ∗ ∗
X =
r X
xj xTj ,
j=1
where r is the rank of X ∗ , such that for any j, x∗j = αxj for some α is a solution to (QP2).
4.4
MaxCut Problem
Consider the Max Cut problem on an undirected graph G = (V, E) with nonnegative weights wij for each edge in E (and wij = 0 if (i, j) 6∈ E), which is the problem of partitioning the nodes of V into two sets S and V \ S so that X w(S) := wij i∈S, j∈V \S
is maximized. A problem of this type arises from many network planning, circuit design, and scheduling applications. This problem can be formulated by assigning each node a binary variable xj : 1X z ∗ = Maximize w(x) := wij (1 − xi xj ) 4 i,j (MC) Subject to
x2i = 1,
i = 1, ..., n.
The CoinToss Method: Let each node be selected to one side, or xi be 1, independently with probability .5. Then, swap nodes from the majority side to the minority side using the greedy method. E[w(x)] ≥ 0.5 · z ∗ .
4.4. MAXCUT PROBLEM
4.4.1
77
Semidefinite relaxation z SDP :=
The dual is
minimize s.t.
Q•X Ij • X = 1, j = 1, ..., n, X º 0.
z SDP = maximize s.t.
eT y Q º D(y).
(4.3)
(4.4)
Let V = (v1 , . . . , vn ) ∈ Rn×n , i.e., vj is the jth column of V , such that X = V TV . Generate a random vector u ∈ N (0, I): ∗
x ˆ = sign(V T u), ½ 1 if xj ≥ 0 sign(xj ) = −1 otherwise.
4.4.2
(4.5)
Approximation analysis
Then, one can prove from Sheppard [274] (see Goemans and Williamson [125] and Bertsimas and Ye [50]): E[ˆ xi x ˆj ] =
2 ¯ ij ), arcsin(X π
i, j = 1, 2, . . . , n.
V j /Vj 
U = 1
=1 arccos(.)
V i /Vi  =1
= 1
vT u
vT u
Figure 4.1: Illustration of the product σ( kvi i k ) · σ( kvj j k ) on the 2dimensional unit circle. As the unit vector u is uniformly generated along the circle, the product is either 1 or −1.
78
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION
Lemma 4.5 For x ∈ [−1, 1) 1 − (2/π) · arcsin(x) ≥ .878. 1−x Lemma 4.6 Let X º 0 and d(X) ≤ 1. Then arcsin[X] º X. Theorem 4.7 We have i) If Q is a Laplacian matrix, then E(ˆ xT Qˆ x) ≥ .878z SDP ≥ .878z ∗ , so that z ∗ ≥ .878z SDP . ii) If Q is positive semidefinite E(ˆ xT Qˆ x) ≥ so that z∗ ≥
4.5
2 SDP 2 z ≥ z∗, π π 2 SDP z . π
Multiple EllipsoidConstrained Quadratic Maximization (HQP ) z ∗ :=
maximize subject to
xT Q0 x xT Qi x ≤ 1, i = 1, ..., m.
The previous best approximation ratio, when Qi º 0 for all i, was 1 , min(m2 , n2 ) that is, we can compute a feasible solution x, in polynomial time, such that x T Q0 x ≥
4.5.1
1 · z∗. min(m2 , n2 )
SDP relaxation (SDP ) z SDP := maximize Q0 • X subject to Qi • X ≤ 1, i = 1, ..., m X º 0.
4.5. MULTIPLE ELLIPSOIDCONSTRAINED QUADRATIC MAXIMIZATION79 and its dual (SDD) z SDD :=
Pm
minimize subject to
yi i=1P m Z = i=1 yi Qi − Q0 Z º 0, yi ≥ 0, i = 1, ..., m.
We assume that (SDD) is feasible. Then (SDP) and (SDD) have no duality gap, that is, z SDP = z SDD . Moreover, z ∗ = z SDP if and only if (SDP) has a feasible rankone matrix solution and meet complementarity condition.
4.5.2
Bound on Rank
Lemma 4.8 Let X ∗ be an maximizer of (SDP). Then we can compute another maximizer of (SDP) whose rank is no more than m by solving a linear program. Suppose the rank of initial X ∗ is r. Then, there exist column vectors xj , j = 1, ..., r, such that r X X∗ = xj xTj . j=1
Let
aij = Qi • xj xTj = xTj Qi xj . Then consider linear program Pr a0j vj maximize Pj=1 r subject to j=1 aij vj ≤ 1, i = 1, ..., m vj ≥ 0, j = 1, ..., r.
Note that v1 = ... = vr = 1 is an optimal solution for the problem. But we should have at least r inequalities active at an optimal basic solution. Thus, we should have at most m of vj are positive at an optimal basic solution. Let X ∗ be an SDP maximizer whose rank r ≤ min(m, n). Then, there exist column vectors xj , j = 1, ..., r, such that ∗
X =
r X
xj xTj .
j=1
Note that for each i ≥ 1 and j xTj Qi xj = Qi • xj xTj ≤ Qi • X ∗ ≤ 1. Thus, each xj is a feasible solution of (HQP). Then, we can select an xj such that xTj Q0 xj is the largest, and have r
max xTj Q0 xj ≥
j=1,...,r
1X T x Q0 xj = r j=1 j
80
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION 1 1 1 1 · Q0 • X ∗ = · z SDP ≥ · z ∗ ≥ · z∗. r r r min(m, n) Thus, we have
Theorem 4.9 For solving HQPm, where Qi º 0 for i = 1, ..., m, we have approximation ratio 1 . min(m, n)
4.5.3
Improvements
Lemma 4.10 Let X ∗ be an maximizer of (SDP). Then we can compute another maximizer of (SDP) whose rank is no more than m − 1 by solving a linear program. Thus, we have Theorem 4.11 For solving HQPm, where Qi º 0 for i = 1, ..., m, we have approximation ratio 1 . min(m − 1, n) For large m, consider SDP in the standard form: z ∗ := Minimize Subject to
C •X Ai • X = bi , i = 1, ..., m X º 0.
Theorem 4.12 (Pataki [252], Barvinok [34]) Let X ∗ be a minimizer of the SDP in the standard form. Then we can compute a minimizer of the problem whose rank r satisfies r(r + 1)/2 ≤ m in polinomial time. Proof. If the rank of X ∗ , r, satisfies the inequality, then we need do nothing. Thus, we assume r(r + 1)/2 > m, and let V V T = X ∗,
V ∈ Rn×r .
Then consider Minimize
V T CV • U
Subject to
V T Ai V • U = bi , i = 1, ..., m U º 0.
Note that V T CV , V T Ai V s and U are r × r symmetric matrices and V T CV • I = C • V V T = C • X ∗ = z ∗ .
(4.6)
4.5. MULTIPLE ELLIPSOIDCONSTRAINED QUADRATIC MAXIMIZATION81 Moreover, for any feasible solution of (4.6) one can construct a feasible solution for the original SDP using X(U ) = V U V T
and C • X(U ) = V T CV • U.
(4.7)
Thus, the minimal value of (4.6) is also z ∗ , and U = I is a minimizer of (4.6). Now we show that any feasible solution U to (4.6) is a minimizer for (4.6); thereby X(U ) of (4.7) is a minimizer for the original SDP. Consider the dual of (4.6) Pm z ∗ := Maximize bT y = i=1 bi yi V T CV º
Subject to
Pm i=1
yi V T Ai V T .
(4.8)
Let y ∗ be a dual maximizer. Since U = I is an interior optimizer for the primal, the strong duality condition holds, i.e., T
I • (V CV −
m X
yi∗ V T Ai V T ) = 0
i=1
so that we have V T CV −
m X
yi∗ V T Ai V T = 0.
i=1
Then, any feasible solution of (4.6) satisfies the strong duality condition so that it must be also optimal. Consider the system of homogeneous linear equations V T Ai V • W = 0, i = 1, ..., m where W is a r × r symmetric matrices (does not need to be definite). This system has r(r + 1)/2 real number variables and m equations. Thus, as long as r(r +1)/2 > m, we must be able to find a symmetric matrix W 6= 0 to satisfy all m equations. Without loss of generality, let W be either indefinite or negative semidefinite (if it is positive semidefinite, we take −W as W ), that is, W has at least one negative eigenvalue, and consider U (α) = I + αW. ¯ where λ ¯ is the least eigenvalue of W , we have Choosing α∗ = 1/λ U (α∗ ) º 0 and it has at least one 0 eigenvalue or rank(U (α∗ )) < r, and V T Ai V • U (α∗ ) = V T Ai V • (I + α∗ W ) = V T Ai V • I = bi , i = 1, ..., m. That is, U (α∗ ) is a feasible and so it is an optimal solution for (4.6). Then, X(U (α∗ )) = V U (α∗ )V T
82
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION
is a new minimizer for the original SDP, and rank(X(U (α∗ ))) < r. This process can be repeated till the system of homogemeous linear equations has only all zero solution, which is necessarily given by r(r + 1)/2 ≤ m. Thus, we must be able to find an SDP solution whose rank satisfies the inequality. The total number of such reduction steps is bounded by n − 1 and each step uses no more than O(m2 n) arithmetic operations. Therefore, the total number of arithmetic operations is a polynomial in m and n, i.e., in (strongly) polynomial time given the least eigenvalue of W .
4.5.4
Randomized Rounding
Let X ∗ = V¯ T V¯ and V¯ is r × n, and consider z SDP :=
maximize subject to
V¯ Q0 V¯ T • U V¯ Qi V¯ • U ≤ 1, i = 1, ..., m U º 0.
Note that U = I is an optimal solution for the problem. Without loss of generality, assume that D := V¯ Q0 V¯ T is a diagonal matrix and Ai = V¯ Qi V¯ , i = 1, ..., m. Rewrite the problem z SDP := maximize D • U subject to Ai • U ≤ 1, i = 1, ..., m U º 0, and U ∗ = I is an optimal solution for the problem. Generates a random rdimension vector u where, independently, ½ 1 with probability.5 uj = −1 otherwise. Note that and
E[uuT ] = I, D • uuT = D • I = z SDP .
Let u ˆ= Then, for i = 1, ..., m,
1 p · u. maxi Ai • uuT Ai • u ˆu ˆT ≤ 1.
That is, u ˆu ˆT is a feasible solution, and D•u ˆu ˆT =
z SDP . maxi (Ai • uuT )
Using the linear transformation x ˆ = V¯ T u ˆ,
4.6. MAXBISECTION PROBLEM
83
x ˆx ˆT is a feasible solution for (SDP), and Q0 • x ˆx ˆT =
z SDP , maxi (Ai • uuT )
where for i = 1, ..., m Ai • I ≤ 1. Lemma 4.13 For any given vector a, the probability Pr(aaT • uuT > αkak2 ) ≤ 2 exp(−α/2). Lemma 4.14 The probability Pr(Ai • uuT > α) ≤ Pr(Ai • uuT > α(Ai • I)) ≤ 2r exp(−α/2). Lemma 4.15 The probability Pr(max(Ai • uuT ) > α) ≤ 2mr exp(−α/2). i
Choose α = 2 ln(4m2 ), then Pr(max(Ai • uuT ) > 2 ln(4m2 )) ≤ i
or
µ Pr
1 r ≤ , 2m 2
1 1 ≥ maxi (Ai • uuT ) 2 ln(4m2 )
¶ ≥
1 . 2
Theorem 4.16 With probability at least .5 E(ˆ xT Q0 x ˆ) =
4.6
1 1 · z SDP ≥ · z∗. 2 ln(4m2 ) 2 ln(4m2 )
MaxBisection Problem
Consider the MaxBisection problem on an undirected graph G = (V, E) with nonnegative weights wij for each edge in E (and wij = 0 if (i, j) 6∈ E), which is the problem of partitioning the even number of nodes in V into two sets S and V \ S of equal cardinality so that w(S) :=
X i∈S, j∈V \S
wij
84
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION
is maximized. This problem can be formulated by assigning each node a binary variable xj : w∗ :=
Maximize
(MB) subject to
1X wij (1 − xi xj ) 4 i,j n X
xj = 0
or
eT x = 0
j=1
x2j = 1, j = 1, . . . , n, where e ∈ 0), then ˜ ≥ w(S)
α √ · w∗ . 1+ 1−β
4.6. MAXBISECTION PROBLEM Proof. Suppose
w(S) = λw∗
89
and
S = δn,
which from (4.18) and z(γ) ≥ α + γβ implies that λ ≥ α + γβ − 4γδ(1 − δ). Applying (4.12) we see that ˜ w(S)
w(S) 2δ λw∗ = 2δ α + γβ − 4γδ(1 − δ) · w∗ ≥ 2δ p ≥ 2( γ(α + γβ) − γ) · w∗ . ≥
The last inequality follows from simple calculus that √ α + γβ δ= √ 2 γ yields the minimal value for (α + γβ − 4γδ(1 − δ))/(2δ) in the interval [0, 1], if γ ≥ α/(4 − β). In particular, substitute γ=
α 1 − 1) (√ 2β 1−β
into the first inequality, we have the second desired result in the lemma. α √1 The motivation to select γ = 2β ( − 1) is that it yields the maximal 1−β p value for 2( γ(α + γβ) − γ). In fact, when both α = β = α(1) ≥ .878567 as in the case of Frieze and Jerrum,
α √ > 0.6515, 1−β p which is just slightly better than 2( 2α(1) − 1) ≥ 0.6511 proved by Frieze and Jerrum. So their choice γ = 1 is almost “optimal”. We emphasize that γ is only used in the analysis of the quality bound, and is not used in the rounding method. 1+
4.6.4
A simple .5approximation
To see the impact of θ in the new rounding method, we analyze the other extreme case where θ = 0 and P =
1 n (I − eeT ). n−1 n
90
CHAPTER 4. SDP FOR COMBINATORIAL OPTIMIZATION
That is, we generate u ∈ N (0, P ), then x ˆ and S. Now, we have µ ¶X 1X 1 2 1 E[w(S)] = E wij (1 − x ˆi x ˆ j ) = 1 + arcsin( ) wij ≥ .5·w∗ 4 i,j 4 π n−1 i6=j
(4.19) and · 2 ¸ n (eT x ˆ)2 n2 n n(n − 1) 2 1 1 n2 E − ≥ − + arcsin( ) ≥ (1 − ) · , (4.20) 4 4 4 4 4 π n−1 n 4 where from (4.9) we have used the facts that E[ˆ xi x ˆj ] = and
2 −1 arcsin( ), π n−1
i 6= j
X 1X wij = wij ≥ w∗ . 2 i 1+ 1−β 1 + 1/n approximation method. The same ratio can be established for P = I.
4.6.5
A .699approximation
We now prove our main result. For simplicity, we will use P = I in our rounding ¯ + (1 − θ)I method. Therefore, we discuss using the convex combination of θX ˜ as the covariance matrix to generate u, x ˆ, S and S for a given 0 ≤ θ ≤ 1, i.e., ¯ + (1 − θ)P ), u ∈ N (0, θX x ˆ = sign(u), S = {i : x ˆi = 1}
or S = {i : x ˆi = −1}
such that S ≥ n/2, and then S˜ from the Frieze and Jerrum swapping procedure. Define 1 − π2 arcsin(θy) ; (4.21) α(θ) := min −1≤y .9620, n
which imply rM B =
α(θ∗ ) α(.89) p p ≥ > .69920. 1 + 1 − β(θ∗ ) 1 + 1 − β(.89)
This bound yields the final result: Theorem 4.19 There is a polynomialtime approximation algorithm for MaxBisection whose expected cut is at least rM B times the maximal bisection cut, if the number of nodes in the graph is sufficiently large. In particular, if parameter θ = .89 is used, our rounding method is a .699approximation for MaxBisection. The reader may ask why we have used two different formulations in defining α(θ) of (4.21) and β(θ) of (4.22). The reason is that we have noPcontrol on the ratio, ρ, of the maximal bisection cut w∗ over the total weight i