Derivative-Free Optimization Methods for Finite Minimax Problems. Warren Hare∗
Mason Macklem†
August 23, 2011
Abstract Derivative-free optimization focuses on designing methods to solve optimization problems without the analytical knowledge of the function. In this paper we consider the problem of designing derivative-free methods for finite minimax problems: minx maxi=1,2,...N {fi (x)}. In order to solve the problem efficiently, we seek to exploit the smooth substructure within the problem. Using ideas developed in [BLO02, BLO05], we create the idea of a robust simplex gradient descent direction, and use it to accelerate convergence. Convergence is proven by showing that the resulting algorithm fits into the Directional Direct-Search framework. Numerical tests demonstrate the algorithm’s effectiveness on finite minimax problems.
Keywords: Optimization, Derivative-Free Optimization, Active Set, Active Manifold, Gradient Sampling AMS subject classifications: Primary, 90C47, 90C56; Secondary, 49M25, 90C26.
1
Introduction
The area of derivative-free optimization (DFO), or direct search methods, focuses on designing methods to solve optimization problems without the analytical knowledge of the objective function (or constraint set). This is particularly useful when the objective function is a blackbox function or an oracle function, where the only available information is the value of the objective function for an input point x ∈ Rn . The study of DFO methods has grown in recent years due to the flexibility of DFO methods across a variety of applied problems [BDF+ 98, ∗
University of British Columbia, Okanagan Campus (UBCO), Mathematics, 3333 University Way, Kelowna, BC, V1V 1V7, Canada. e:
[email protected]. Research for this author was partially supported by NSERC DG 355571-2008. † UBCO, Mathematics. e:
[email protected]. Research for this author was partially supported by NSERC DG 355571-2008.
1
DFO for Finite Minimax Problems
2
DV04, MWDM07, MFT08, Har10] (among many other examples), and the development of elegant frameworks to ensure convergence [AD02, KLT03, AD06, KLT06, BGP09]. For a general overview of DFO methods, along with an comprehensive study of many convergence results, see [CSV09]. In this paper we consider the challenge of designing of DFO methods for the specific unconstrained minimization problem min{F (x)} where F (x) := max {fi (x), fi ∈ C 1 } x
i=1,2,...N
(1)
in settings where the individual fi ’s are difficult to handle analytically. More precisely, we assume that at a given point x for each function fi (i = 1, 2, ...N ), it is possible to obtain a function value fi (x), but it is impossible to obtain an accurate gradient value ∇fi (x). It is therefore possible to compute function values for F , but not to compute any (sub-)gradient information for F . This setting is particularly appropriate for applying DFO methods. However, using traditional DFO methods will only consider the function F (x), and lose information that can be available from consideration of the fi ’s individually. Before discussing our techniques, it is enlightening to discuss the smooth substructure in problem (1). Although not explicitly required herein, for the sake of discussion, consider the case where each function fi is continuously differentiable, but not necessarily convex. In this setting, it is clear that the function F is not necessarily smooth. However, under basic non-degeneracy conditions, F will still enjoy an appealing subdifferential calculus [Cla83]. Moreover, in this setting, F contains an underlying smooth substructure that can be exploited to design more effective optimization algorithms; see [BM88, AKK91, Wri93, MS02, MS05, HL04, HL07, Har09] among many others. To understand how to exploit the smooth substructure in problem (1), we first define the active manifold of the function F at a point xˆ. For a given point xˆ we define the active set of F at xˆ by A(ˆ x) := {i : fi (ˆ x) = F (ˆ x)}. It is clear that F agrees with a smooth function when restricted to the set M(ˆ x) = {x : A(ˆ x) = A(x)}. Moreover, if the active gradients {∇fi (ˆ x) : i ∈ A(ˆ x)} form a linearly independent set, then M(ˆ x) defines a smooth manifold near xˆ. In applied optimization, it has been recognized that the minimizer for problem (1) usually appears on such a manifold, and that the manifold is a strict subset of the domain of F [BLO05, MS05, HL07]. We call this manifold the active manifold. To exploit this smooth substructure, the goal is to identify the active manifold and then treat F as a smooth function restricted to M [MS05]. Essentially, the goal is to determine when an iterate is sufficiently close to M, and then encourage the optimization algorithm to
DFO for Finite Minimax Problems
3
move in a descent direction approximately parallel to M instead of towards M [BLO05]. A variety of algorithms that exploit smooth substructure in this manner have been proposed, researched, implemented, and tested ([BLO05, MS05, DHM06, Wri10] among others). However, all of these algorithms employ the use of (sub-)gradients. Similar approaches have been considered in DFO in the context of linearly-constrained problems, where movement is required along the constraint boundaries [KLT06]. However, unlike higher-order methods, very little research has explored how DFO methods might exploit smooth substructure in the objective function. In this paper we introduce a framework for DFO that can be employed to exploit the active manifold of problem (1). Essentially, the framework uses a DFO pattern search that benefits from the inclusion of a search direction that the algorithm detects to be approximately parallel to the active manifold. Based on the ideas of [BLO02, BLO05], we introduce several methods of detecting a search direction parallel to the active manifold that fit within our framework. We show that the introduced DFO framework falls under the framework of the Directional Direct-Search method [CSV09, Alg 7.2], and is therefore proven to converge to a local minimum under a wide variety of settings [CSV09, §7]. We further implement and compare numerically several derivative-free methods for computing a search direction that is approximately parallel to the active manifold. We compare these methods against two similar derivative-free methods that do not exploit the smooth substructure of problem (1). The results convincingly show that by properly exploiting the smooth substructure in an optimization problem, even without the use of (sub-)gradients, we can vastly improve algorithmic performance. The remainder of this paper is organized as follows. In Section 2 we provide some background information useful in understanding this research. This includes details on the robust gradient sampling methods of [BLO02, BLO05] and on other related works. In Section 3 we provide a theoretical framework and prove convergence for our algorithm. In Subsection 3.2 we explain our four methods for approximating descent directions that run parallel to the active manifold. In Section 4 we present some numerical testing for our methods. We demonstrate that by applying a descent direction that runs approximately parallel to the active manifold we can significantly improve convergence. We provide some concluding remarks in Section 5.
2 2.1
Background Robust Gradient Sampling
Within a nonsmooth optimization setting, a common method to exploit smooth substructure is to use the fact that proximal points identify the active manifold [MS02, Har09]. Since proximal points can be computed quickly and accurately via bundle methods, this fact has lead to predictor-corrector style methods (such as [MS05]), where bundle methods are employed to find the active manifold and higher order methods are used to traverse quickly
DFO for Finite Minimax Problems
4
in directions parallel to the active manifold. Unfortunately, from the viewpoint of DFO, bundle methods require a large number of accurate subgradients to be computed during each iteration. This makes it very difficult to rework such an approach into a derivative-free method. A more promising idea, from the perspective of DFO, is presented in [BLO02, BLO05]. For the sake of discussion, let us define what we call the robust descent direction (RDdirection), based on the ideas in [BLO02, BLO05]. Definition 1 (Robust Descent Direction) Let f be a locally-Lipschitz function and x¯ be x), where Bε (¯ x) is the open ball of a point. Fix ε > 0 and select m ≥ 0 points {xi }m i=1 ⊆ Bε (¯ radius ε centered at x¯. Assume that ∇f (¯ x) and ∇f (xi ) are well-defined for each i. Create the robust approximate sub-differential G = conv {∇f (¯ x), ∇f (x1 ), ∇f (x2 ), . . . , ∇f (xm )}. The robust descent direction (RD-direction) d for f at x¯ generated by {xi }m i=1 is the negative of the projection of 0 onto G: d = −proj(0, G). Remark 1 Note that in the above definition, f is locally-Lipschitz and x¯ is in the interior of its domain. Therefore, for ε sufficiently small, with probability 1 f is differentiable at each xi . This justifies the assumption that ∇f (¯ x) and ∇f (xi ) are well-defined for each i. (For further discussion see [BLO02, BLO05].) In [BLO05], a robust gradient sampling method is created. At each iteration, the method (randomly) generates an RD-direction for the current iterate and then performs a line search in that direction. If the RD-direction is non-zero and sufficient decrease is found, then a new iterate is created. If the RD-direction is zero, or no sufficient decrease can be found, then the search radius for the RD-direction (ε) is decreased and a new RD-direction is created. The idea behind this approach is that the robust approximate sub-differential (G) is an approximation of the subdifferential of a nearby point on the active manifold, and therefore the RD-direction is a direction of descent that is closer to being parallel to the active manifold than the direction of steepest descent. (For further details, proof of convergence, and numerical examples, see [BLO05].) This idea seems promising from the viewpoint of DFO, as we are seeking to approximate nearby subdifferentials, and therefore the demand for exact gradient values can be relaxed (see [Kiw10] and discussion below). However, this approach has a disadvantage from the viewpoint of DFO, in that many (approximate) gradients are required. In the Subsection 3.2 we explain our methods for exploiting the structure of problem (1) to generate an approximate RD-direction without requiring excessive function evaluations.
2.2
Connections to other works
Some recent work has explored customizing DFO methods to exploit known structures in an optimization problem. For example, in 2006, Liuzzi, Lucidi, and Sciandrone examined the
DFO for Finite Minimax Problems
5
problem of using DFO to solve linearly constrained finite minimax problems [LLS06]. Their approach exploits the finite minimax structure to allow for a smoothing technique based on an exponential penalty function. Essentially they approximate the constrained nonsmooth problem min {F (x)} where F (x) = max {fi (x)}, fi ∈ C 2 Ax≤b
i=1..N
via the constrained smoothed problem min {F (x, µ)} where F (x, µ) = F (x) + µ ln
Ax≤b
N X i=1
exp
fi (x) − f (x) µ
, fi ∈ C 2 .
Variational analysis is used to prove that F (x, µ) is smooth and that F (x, µ) → F (x) as µ → 0. The DFO method then applies a pattern search method on F (x, µ) while adaptively reducing µ. Conversely, our framework does not attempt to smooth the problem, instead it directly uses finite minimax structure to adjust search directions to be more parallel to active manifolds. Another work related to this paper is the recent results of Bogani, Gasparo, and Papini. In [BGP09], the authors explore methods of applying DFO to functions of the form F(x) = {fi (x) : x ∈ Ei , i = 1, 2, . . . , m}, where the sets Ei are polyhedral and form a finite partition of the domain of F. The authors note that if an iterate is close to (or on) a boundary of Ei , then it is good to ensure that directions parallel to the boundary are in the search set. In essence, the authors are determining the active manifold of nearby points and attempting to move parallel to it. However, unlike this paper, the framework in [BGP09] assumes that the active manifolds are affine sets and assumes an a priori knowledge of the structure of the active manifolds at each point in the domain of F. Our framework makes neither assumption. Another direction of recent research closely related to this paper is generalizing the robust gradient sampling method of [BLO05] to a derivative-free setting. For example, in [BKS08] and [Kiw10], the authors employ discrete gradients to create an approximate subdifferential and then use the approximate subdifferential to determine a descent direction for the objective function. In [BKS08], Bagirov, Karas¨ozen, and Sezer use the techniques from [Bag03] to create a derivative-free subdifferential approximation and then employ a line search in the descent direction suggested by the approximation. Although the authors mention [BLO05], they clearly do not consider it an inspiration for their approach. Conversely, in [Kiw10], Kiwiel develops a DFO method explicitly motivated by the robust gradient sampling method by asking what properties are required by a discrete gradient in order to generalize the method. He then demonstrates that Gupal’s estimators (see [Gup77]) can be used to create a derivative-free variant of the robust gradient sampling method. In both methods, the authors assume no knowledge of any underlying smooth substructure to the objective function. As such, both methods require a high number of function
DFO for Finite Minimax Problems
6
evaluations per iteration. In particular, [BKS08] requires p(n + 1) function evaluations per iteration, where p is the number of points used to make an approximate subdifferential and n is problem dimension. Note that p is determined during each iteration and is dependent on the particular test problem. It is unclear what an average value would be. In [Kiw10], Kiwiel requires 2mn function evaluations are required per iteration, where m is the number of sample points used per iteration and n is problem dimension. Although Kiwiel does not contain any numerical results, the research in [BLO05] suggests that m ≈ 2n may be an appropriate number. By using the assumption that the objective function takes the form of a finite max function, our method only requires n + 1 function evaluations per iteration. (Of course, depending on the poll steps and the spanning sets used, our method may use more function evaluations if desired.) However, it should be noted that function evaluations per iteration is not necessarily representative of the total function evaluations used. In this work we do not numerically compare our method with any of the above methods. Future research may explore such benchmarking.
3
Algorithm Design and Convergence
3.1
Frameworks and Convergence
Directional direct-search is a simple framework that encompasses a wide variety of DFO methods. The essence of this framework is that at each iteration the algorithm evaluates the objective function at a finite number of points around the current algorithmic center and then relocates the algorithmic center if decrease is detected. In order to ensure convergence, the point evaluated at each iteration must form a positive spanning set. Definition 2 (Positive Spanning Set) A set of vectors {di : i = 1, 2, . . . , m} ⊆ Rn is a positive spanning set for Rn if ( ) m X x:x= λi di , λi ≥ 0 = Rn . i=1
The positive spanning set may change across iterations. Provided only a finite number of different positive spanning sets are used, then various convergence results can be established. The directional direct-search framework follows, from [CSV09, Alg 7.2]. Algorithm 1 (Directional Direct-Search Framework) 1. Initialization: Choose x0 ∈ Ω with f (x0 ) < +∞, α0 > 0, 0 < β1 ≤ β2 < 1, and γ ≥ 1. 2. Main loop: For k = 0, 1, 2, . . . :
DFO for Finite Minimax Problems
7
(a) Search step (optional): Evaluate f at a finite number of points. If a point xˆ is found such that f (ˆ x) < f (xk ), then set xk+1 = xˆ, declare the iteration successful, and skip the poll step. (b) Poll step: Choose a positive spanning set Dk and let mk = |Dk |. Define and order k the set of poll points Pk = {xk + αk di : di ∈ Dk }m i=1 . Evaluate f at poll points Dk until either i) a point xk + αk dk is found that satisfies f (xk + αk dk ) < f (xk ), or ii) all poll points have been evaluated and no such point is found. If an improvement is found (i), then set xk+1 = xk +αk dk and declare the iteration successful. If no improvement is found (ii), then set xk+1 = xk and declare the iteration unsuccessful. (c) Mesh parameter update: If the iteration was successful, then maintain or increase the stepsize parameter: αk+1 ∈ [αk , γαk ]. Otherwise, decrease the stepsize parameter: αk+1 ∈ [β1 αk , β2 αk ]. Convergence of the directional direct-search framework is established in [CSV09, Thm 7.4 & Thm 7.5]. In the following theorem recall that, for a locally Lipschitz function f the generalized directional derivative df at the point x in the direction d is defined by f (x0 + τ d) − f (x0 ) . τ &0 x0 →x τ
df (x, d) := lim sup
Note that, the assumption that f is locally Lipschitz assures that this limit is well-defined [Cla83, p. 10]. Theorem 1 (Convergence of Directional Direct-Search) Let f be locally Lipschitz continuous and x0 ∈ dom(f ). Suppose the level set L = {x : f (x) ≤ f (x0 )} is compact. Suppose an algorithm conforming to Algorithm 1 is employed. Suppose that the step-size parameters αk converge to 0 and the set of positive sets used {Dk } is finite. Then the sequence of iterates {xk } has a limit point x∗ such that df (x∗ , d) ≥ 0 for all d ∈ D∗ , where D∗ is one of the sets from {Dk }. Further, if f ∈ C 1 (at x∗ ), then ∇f (x∗ ) = 0. Proof: This is a mild simplification of Theorems 7.4 and 7.5 of [CSV09] (see also [AD02] and [AD06] for original statements and proofs).
DFO for Finite Minimax Problems
8
To adapt the Directional Direct-Search Framework to incorporate RD-directions we next discuss a Direction-Adjusted Direct-Search Framework. The framework is an adaptation of the Directional Direct-Search Framework that uses an expected descent direction in both the search step and the poll step. In the initialization of the Direction-Adjusted Direct-Search Framework we fix a finite collection of positive spanning sets D. This fulfills the assumption in Theorem 1 that the set of positive spanning sets used is finite. In the search step we use previous function evaluation information to create an expected descent direction. The search step then performs a finite line search in this direction. In the poll step of the Direction-Adjusted Direct-Search Framework we select a positive spanning set from the finite collection of positive spanning sets D. To select the best set, we seek the positive spanning set that best aligns itself with the expected descent direction. In ˆ ∈ D such that particular we seek D ) ( [ vk vk d d ˆ = min {D} . − : d∈D min |vk | − |d| : d ∈ |vk | |d| D∈D As D is a finite collection of finite sets, these minimizations are easily determined. (Ties may be broken arbitrarily.) The generation of an expected descent direction is intentionally left undiscussed in this framework, and indeed the framework converges regardless of how the expected descent direction is approximated. We now present our Direction-Adjusted Direct-Search Framework. Algorithm 2 (Direction-Adjusted Direct-Search Framework) 1. Initialization: Choose x0 ∈ Ω with f (x0 ) < +∞, α0 > 0, 0 < β1 ≤ β2 < 1, and γ ≥ 1. Select a finite collection of positive spanning sets D and a finite line search sequence {δ1 , δ2 , . . . , δm }, δi > 0. 2. Main loop: For k = 0, 1, 2, . . . : (a) Direction Selection and Search step: (If k = 0, then skip this step.) If k > 0, then use past function evaluation information to select an expected descent direction vk . Select a subset of Ik ⊆ {1, 2, . . . m}. If mini∈Ik f (xk + αk δi vk ) < f (xk ), then set xk+1 ∈ arg mini∈Ik f (xk + αk δi vk ), declare the iteration successful. (b) Poll step: If k = 0, then select any Dk ∈ D. ˆ ∈ D such that If k > 0, then determine D ( ) [ vk vk d d ˆ = min − : d∈D {D} , min |vk | − |d| : d ∈ |vk | |d| D∈D
DFO for Finite Minimax Problems
9
ˆ. and set Dk = D Order the set of poll points Pk = {xk + αk di : di ∈ Dk }. Evaluate f at poll points Dk until either i) a point xk + αk dk is found that satisfies f (xk + αk dk ) < f (xk ), or ii) all poll points have been evaluated and no such point is found. If an improvement is found (i), then set xk+1 = xk +αk dk and declare the iteration successful. If no improvement is found (ii), then set xk+1 = xk and declare the iteration unsuccessful. (c) Mesh parameter update: If the iteration was successful, then maintain or increase the stepsize parameter: αk+1 ∈ [αk , γαk ]. Otherwise, decrease the stepsize parameter: αk+1 ∈ [β1 αk , β2 αk ]. Corollary 2 (Convergence of Direction-Adjusted Direct-Search) Let f be locally Lipschitz continuous and x0 ∈ dom(f ). Suppose the level set L = {x : f (x) ≤ f (x0 )} is compact. Suppose a Direction-Adjusted Direct-Search is employed. Suppose that the stepsize parameters αk converge to 0. Then the sequence of iterates {xk } has a limit point x∗ such that df (x∗ , d) ≥ 0 for all d ∈ D∗ , where D∗ is one of the positive spanning sets in D. Further, if f ∈ C 1 (at x∗ ), then ∇f (x∗ ) = 0. Proof: The Direction-Adjusted Direct-Search falls under the Directional Direct-Search Framework, so Theorem 1 applies. (Notice that a successful search step does not demand skipping the poll step. This can be considered skipping the poll step and then performing a null search step in the subsequent iteration.) Remark 2 Given the general openness of the Direction-Adjusted Direct-Search Framework, it is not difficult to adjust the framework slightly in order to make it fit into the Generated Set Search Framework ([KLT06]) or Mesh Adaptive Direct Search Framework ([AD06]). Similar convergence results are available in these frameworks.
3.2
Selecting an approximate direction
As mentioned above, the Direction-Adjusted Direct-Search framework convergence analysis holds regardless of how an expected descent direction is created. In this section we present four methods that fit within this framework. We include a baseline method that does not update the positive spanning sets, and three methods that perform updates in order to
DFO for Finite Minimax Problems
10
generate an expected descent direction. In each method we seek to create the direction based on the most recent function evaluations {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 }. Our goal in designing these methods is to see how updating the positive spanning sets impacts the performance of the algorithm over using a fixed set of search directions across iterations, and to determine how much the performance varies across different choices of update direction. In particular, we want to perform this comparison over methods which all attempt to approximately align to a manifold for problems of the form 1. For the purposes of comparison, our first method is to use the poll direction dk that generated the last iterate. Algorithm 3 (Direction 1: Last Descent) 1. Input: Provide points {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 } and their corresponding function values. 2. Direction: Determine the index i such that xk = xk−1 + αk−1 di and set vk = di . The second method uses a simplex gradient based on the previous n + 1 function evaluations. This is a classical method that ignores the structure in F . Algorithm 4 (Direction 2: Simplex Gradient) 1. Input: Provide points {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 } and their corresponding function values. 2. Trim: Select the n + 1 points of {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 } with the n + 1 lowest function values: {y0 , y1 , . . . , yn }. 3. Direction: Set vk equal to the simplex gradient ∇s F (y 0 ): ∇s F (y 0 ) = Sδ(F ) where
(y 1 − y 0 )> .. S= . n 0 > (y − y )
and
F (y 1 ) − F (y 0 ) .. δ(F ) = . . n 0 F (y ) − F (y )
The simplex gradient ∇s F (y 0 ) is the gradient of the linear model formed by interpolating the function F through the points {y0 , y1 , . . . , yn }. Further information on the simplex gradient can be found in [BK98][CV07][CDV08][CSV09, §2.6]. Our third and fourth methods are based on the robust gradient sampling ideas of [BLO02, BLO05]. The third method creates a simplex gradient for any function fi that is active at any of the past n + 1 trial points, and then determines the projection of 0 onto the convex hull of these simplex gradients. In effect, we generate an approximate gradient for F at each of the past n + 1 trial points, and determine the approximate RD-direction for these approximate gradients.
DFO for Finite Minimax Problems
11
Algorithm 5 (Direction 3: RD-Simplex Gradient) 1. Input: Provide points {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 } and their corresponding function values. 2. Trim: Select the n + 1 points of {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 } with the n + 1 lowest function values: {y0 , y1 , . . . , yn }. 3. Direction: For each i such that F (yj ) = fi (yj ) for some yj ∈ {y0 , y1 , . . . , yn }, create Gi = ∇s fi (y 0 ). Set vk = −proj conv {Gi } (0): ( ) X X −vk ∈ arg min |v| : v = λi Gi , λi ≥ 0, λi = 1 , v
i∈A
i∈A
where A = {i : F (yj ) = fi (yj ) for some yj ∈ {y0 , y1 , . . . , yn }} . (Note that, in step 3 of the above algorithm, ∇s fi (y 0 ) = ∇s fi (y j ) for any yj ∈ {y0 , y1 , . . . , yn }. So setting Gi = ∇s fi (y 0 ) is simply a matter of convenience.) Our final method is similar to the third method, but avoids the need to solve a quadratic program to find the expected descent direction. Like the third method, we begin by generating a simplex gradient for any function fi that is active at any of the past n + 1 trial points. Then, instead of determining a projection, we take a weighted average of these simplex gradients to find a vector in the convex hull of these approximate gradients. Algorithm 6 (Direction 4: Weighted Average of Simplex Gradients) 1. Input: Provide points {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 } and their corresponding function values. 2. Trim: Select the n + 1 points of {xk−1 } ∪ {xk−1 + αk−1 di : di ∈ Dk−1 } with the n + 1 lowest function values: {y0 , y1 , . . . , yn }. 3. Direction: For each i such that F (yj ) = fi (yj ) for some yj ∈ {y0 , y1 , . . . , yn }, create Gi = ∇s fi (y 0 ). Set N X Ai Gi , vk = − n+1 i=1 where Ai is the number of times fi is active over the set {y0 , y1 , . . . , yn }: Ai = {j : F (yj ) = fi (yj ), j = 0, 1, . . . , n} .
DFO for Finite Minimax Problems
4
12
Numerical Results
4.1
Test Sets and Software
The algorithms described in Section 3 were implemented in MATLAB (v. 7.8.0.347, R2009a). Software is available by request to the corresponding author. Our test problems come from the Lukˇsan-Vlˇcek test set [LV00]. This test set consists of three subsets of problems, the first of which are 25 problems of the desired form min{F (x)} where F (x) := max {fi (x)}. x
i=1,2,...N
From this list, we omit problem 2.17 because the sub-functions are complex-valued (not real-valued). This leaves a test set of 24 problems of the desired form. The test problems dimensions range from 2 to 20 variables. Note that for several problems the functions fi take the form fi = |fi | where fi is a smooth function. These functions are rewritten as fi = max{fi , −fi } to create the desired form. With this in mind, the test problems’ number of sub-functions fi range from 2 to 130. A summary of test problems appear in Table 1. For further details, we refer readers to [LV00]. Table 1: Test Set Summary: problem number and name (as given in [LV00]), problem dimension (N ), and number of sub-functions (M ); ∗ denotes a doubling of the number of sub-functions due to an absolute value operation. Prob. # Name N M Prob. # Name N M 2.1 CB2 2 3 2.13 GAMMA 4 122∗ 2.2 WF 2 3 2.14 EXP 5 21 2.3 SPIRAL 2 2 2.15 PBC1 5 60∗ 2.4 EVD52 3 6 2.16 EVD61 6 102∗ 2.5 Rosen-Suzuki 4 4 2.18 Filter 9 82∗ 2.6 Polak 6 4 4 2.19 Wong 1 7 5 ∗ 2.7 PCB3 3 42 2.20 Wong 2 10 9 ∗ 2.8 Bard 3 30 2.21 Wong 3 20 18 2.9 Kow.-Osborne 4 22∗ 2.22 Polak 2 10 2 ∗ 2.10 Davidon 2 4 40 2.23 Polak 3 11 10 ∗ 2.11 OET 5 4 42 2.24 Watson 20 62∗ ∗ 2.12 OET 6 4 42 2.25 Osborne 2 11 130∗
All test problems were implemented in MATLAB (v. 7.8.0.347, R2009a).
DFO for Finite Minimax Problems
4.2
13
Results
For each test problem we run the Direction-Adjusted Direct-Search using each of the four methods for creating an expected descent direction, as proposed in Subsection 3.2. We refer to the resulting algorithms as LD (Last Descent), SG (Simplex Gradient), RSG (RD-Simplex Gradient), and WASG (Weighted Average of Simplex Gradients). Each test is run until 250 × N function evaluations are used. For each problem and algorithm we compute the number of digits of accuracy obtained using the formula |Fmin − F ∗ | − log |F0 − F ∗ | where Fmin is the true minimum value of the problem (as given by [LV00]), F ∗ is the function value of the best point found by the algorithm, and F0 is the function value at the starting point of the algorithm. Results appear in Table 2. Table 2 makes it clear that, for the majority of problems, it is most effective to create an expected direction using the idea of robust-descent directions. In particular, for all but 2 test problems, the RSG algorithm results in a greater accuracy obtained than any other method. The exceptions are problems 2.10 (in which all algorithms perform very similarly) and 2.13 (in which all algorithms tie). Relative to our original algorithm goals, these results tell us two things: updating the positive spanning sets can improve performance and the choice of descent direction matters. Updating positive spanning sets: When comparing the performance of the LD algorithm to the SG and WASG algorithms we see that, in most cases, the performance is better for the algorithms that update the positive spanning sets to adapt to the problem substructure. In particular, the SG and WASG algorithms yield more accuracy on 18 and 16 of the test problems, respectively. In addition, we see that, in some problems the LD algorithm performs quite well (e.g., problems 2.2, 2.18, and 2.22), while in others it performs very badly (e.g., problems 2.9, 2.11, 2.14, 2.20, and 2.24). This potentially indicates that some of the performance of the LD algorithm results from the initial choice of search directions. While a well chosen initial positive spanning set may result in good convergence, the inability to update the search directions limits performance for problems where the initial positive spanning set was not well-aligned to the problem structure. Choice of descent direction: In most of the cases (23 of 24 problems), the RSG algorithm outperforms the SG and WASG algorithms. In many cases the improvement in performance is in orders of magnitude (e.g., problems 2.1, 2.2, 2.4, 2.7, 2.8, 2.14, 2.16, 2.19, 2.20, 2.22, and 2.23). As all three of these methods use information about the problem in their choice of update direction, this indicates that the improvement in performance of the RSG algorithm is due to the specific choice of the search direction.
DFO for Finite Minimax Problems
14
Table 2: Accuracy obtained by each method. Accuracy obtained Prob. # LD SG RSG WASG 2.1 3.266 3.145 8.613 1.726 2.2 2.653 1.746 4.470 1.750 2.3 0.002 0.003 0.014 0.010 2.4 1.591 2.339 10.566 1.603 2.5 1.045 1.275 1.307 1.140 2.6 0.849 0.944 1.215 1.034 2.7 1.122 1.893 9.747 1.505 2.8 1.098 1.911 6.079 1.924 2.9 0.117 0.930 1.171 0.463 2.10 3.237 3.240 3.234 3.210 2.11 0.766 2.296 2.590 1.598 2.12 0.600 1.060 2.025 0.557 2.13 0.111 0.111 0.111 0.111 2.14 0.058 0.719 2.097 1.244 2.15 0.135 0.216 0.240 0.181 2.16 0.694 1.117 3.504 1.018 2.18 1.162 0.833 1.354 0.864 2.19 0.688 0.587 2.685 0.631 2.20 0.857 1.630 3.022 1.212 2.21 0.599 0.974 1.492 0.792 2.22 3.836 0.870 3.839 0.692 2.23 1.002 1.243 6.103 1.461 2.24 0.004 0.153 0.173 0.164 2.25 0.125 0.309 0.322 0.268 Average accuracy obtained 1.067 1.231 3.165 1.048
5
Conclusions and Future Work
In this paper we have developed novel methods for solving derivative-free optimization problems that take the specific form min{F (x)} where F (x) := max {fi (x)}, x
i=1,2,...N
where each fi is available through an oracle function. The methods exploit the smooth substructure within such problems by generating a robust gradient descent direction. In order to prove convergence of the method, we develop a Direction-Adjusted Direct-Search Frame-
DFO for Finite Minimax Problems
15
work that is a mild generalization of the Directional Direct-Search Framework. Convergence theory for the Directional Direct-Search Framework then provides convergence theory for all methods within this work. More central to this work is the question of how to generate a (approximate) robust descent direction without the use of derivatives. Through numerical testing we find that a simple adaptation of the robust gradient method of [BLO02, BLO05] works extremely well in a derivative-free setting. In particular, we demonstrate significant improvement in performance by creating a robust descent direction by projecting 0 onto the convex hull of a collection of simplex gradients. We refer to this direction as the robust simplex gradient, and the resulting method as the RSG (Robust Simplex Gradient) method. In order to benchmark the new approach, the RSG method is compared against three other methods for generating an expected descent direction. The first method (LD), suggested only as a very simple benchmark, is to use the last direction to yield improvement. The second method (SG), employs a classical simplex gradient without invoking the substructure of the problem. The final method (WASG) uses a weighted average of local simplex gradients. It is not a surprise that the RSG method outperforms LD and GS, as RSG uses structure within the problem that LD and GS do not. However, the magnitude of the observed improvement is worth noting. Also surprising is the fact that RSG noticeably outperforms the WASG method. In fact, the WASG method performs only slightly better than the LD or GS methods. These results demonstrate that the use of a robust descent direction can improve performance and that this improvement depends highly on the choice of direction. In particular, our results demonstrate that the robust simplex gradient is a good choice of direction in this context. Future work in this area will begin by examining the potential to weight the remaining directions in the positive spanning set to focus more on the directions approximately parallel to manifolds when they are detected. Although this will require additional checks on the shape of the positive spanning set to avoid degeneracy, such as in [KLT03], this approach will enable the search directions to better adapt to changes in the behavior of the manifold. Future work will also consider how a robust simplex gradient might be created when the specific structure of the problem is harder to detect.
References [AD02]
C. Audet and J. E. Dennis, Jr. Analysis of generalized pattern searches. SIAM J. Optim., 13(3):889–903 (electronic) (2003), 2002.
[AD06]
C. Audet and J. E. Dennis, Jr. Mesh adaptive direct search algorithms for constrained optimization. SIAM J. Optim., 17(1):188–217 (electronic), 2006.
[AKK91]
F. Al-Khayyal and J. Kyparisis. Finite convergence of algorithms for nonlinear programs and variational inequalities. J. Optim. Theory Appl., 70(2):319–332, 1991.
DFO for Finite Minimax Problems
16
[Bag03]
A. M. Bagirov. Continuous subdifferential approximations and their applications. J. Math. Sci. (N. Y.), 115(5):2567–2609, 2003. Optimization and related topics, 2.
[BDF+ 98]
A. J. Booker, J. E. Dennis, Jr., P. D. Frank, D. B. Serafini, and V. Torczon. Optimization using surrogate objectives on a helicopter test example. In Computational methods for optimal design and control (Arlington, VA, 1997), volume 24 of Progr. Systems Control Theory, pages 49–58. Birkh¨auser Boston, Boston, MA, 1998.
[BGP09]
C. Bogani, M. G. Gasparo, and A. Papini. Generating set search methods for piecewise smooth problems. SIAM J. Optim., 20(1):321–335, 2009.
[BK98]
D. M. Bortz and C. T. Kelley. The simplex gradient and noisy optimization problems. In J. T. Borggaard, J. Burns, E. Cliff, and S. Schreck, editors, Computational Methods in Optimal Design and Control, volume 24 of Progress in Systems and Control Theory, pages 77–90. Birkh¨ auser, 1998.
[BKS08]
A. M. Bagirov, B. Karas¨ ozen, and M. Sezer. Discrete gradient method: derivative-free method for nonsmooth optimization. J. Optim. Theory Appl., 137(2):317–334, 2008.
[BLO02]
J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdifferentials by random sampling of gradients. Math. Oper. Res., 27(3):567–584, 2002.
[BLO05]
J. V. Burke, A. S. Lewis, and M. L. Overton. A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim., 15(3):751–779 (electronic), 2005.
[BM88]
J. V. Burke and J. J. Mor´e. On the identification of active constraints. SIAM J. Numer. Anal., 25(5):1197–1211, 1988.
[Cla83]
F.H. Clarke. Optimization and Nonsmooth Analysis. Wiley, 1983; reprinted by SIAM, 1983.
[CSV09]
A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to derivative-free optimization, volume 8 of MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2009.
[CDV08]
A. L. Cust´ odio, J. E. Dennis, Jr., and L. N. Vicente. Using simplex gradients of nonsmooth functions in direct search methods. IMA J. Numer. Anal., 28(4):770–784, 2008.
[CV07]
A. L. Cust´ odio and L. N. Vicente. Using sampling and simplex derivatives in pattern search methods. SIAM J. Optim., 18(2):537–555, 2007.
[DHM06]
A. Daniilidis, W. Hare, and J. Malick. Geometrical interpretation of the predictorcorrector type algorithms in structured optimization problems. Optimization, 55(56):481–503, 2006.
[DV04]
R. Duvigneau and M. Visonneau. Hydrodynamic design using a derivativefree method. Structural and Multidisciplinary Optimization, 28:195–205, 2004. 10.1007/s00158-004-0414-z.
DFO for Finite Minimax Problems
17
[Gup77]
A. M. Gupal. A method for the minimization of almost differentiable functions. Kibernetika (Kiev), (1):114–116, 1977.
[Har09]
W. L. Hare. A proximal method for identifying active manifolds. Comput. Optim. Appl., 43(2):295–306, 2009.
[Har10]
W. L. Hare. Using derivative free optimization for constrained parameter selection in a home and community care forecasting model. In International Perspectives on Operations Research and Health Care, Proceedings of the 34th Meeting of the EURO Working Group on Operational Research Applied to Health Sciences, pages 61–73, 2010.
[HL04]
W. L. Hare and A. S. Lewis. Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal., 11(2):251–266, 2004.
[HL07]
W. L. Hare and A. S. Lewis. Identifying active manifolds. Algorithmic Oper. Res., 2(2):75–82, 2007.
[Kiw10]
K. C. Kiwiel. A nonderivative version of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Optim., 20(4):1983–1994, 2010.
[KLT03]
T.G. Kolda, R.M. Lewis, and V. Torczon. Optimization by direct search: New perspectives on some classical and modern methods. SIAM Rev., 45(3):385–482, 2003.
[KLT06]
T. G. Kolda, R. M. Lewis, and V. Torczon. Stationarity results for generating set search for linearly constrained optimization. SIAM J. Optim., 17(4):943–968, 2006.
[LLS06]
G. Liuzzi, S. Lucidi, and M. Sciandrone. A derivative-free algorithm for linearly constrained finite minimax problems. SIAM J. Optim., 16(4): 1054–1075, 2006.
[LV00]
L. Lukˇsan and J. Vlˇcek. Test problems for nonsmooth unconstrained and linearly constrained optimization. Technical Report 798, Academy of Sciences of the Czech Republic, 2000. http://www3.cs.cas.cz/research/library/reports_700.shtml.
[MFT08]
A. L. Marsden, J. A. Feinstein, and C. A. Taylor. A computational framework for derivative-free optimization of cardiovascular geometries. Comput. Methods Appl. Mech. Engrg., 197(21-24):1890–1905, 2008.
[MS02]
R. Mifflin and C. Sagastiz´ abal. Proximal points are on the fast track. J. Convex Anal., 9(2):563–579, 2002. Special issue on optimization (Montpellier, 2000).
[MS05]
R. Mifflin and C. Sagastiz´abal. A V U -algorithm for convex minimization. Math. Program., 104(2-3, Ser. B):583–608, 2005.
[MWDM07] A. L. Marsden, M. Wang, J. E. Dennis, Jr., and P. Moin. Trailing-edge noise reduction using derivative-free optimization and large-eddy simulation. J. Fluid Mech., 572:13– 36, 2007. [Wri93]
S. J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31(4):1063–1079, 1993.
DFO for Finite Minimax Problems [Wri10]
18
S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. Technical Report, http://www.optimization-online.org/DB_HTML/2010/08/ 2702.html, 2010.