Statistical approximations for stochastic linear programming problems

7 downloads 523 Views 221KB Size Report
linear programming (SLP) models have given rise to some of the largest ... fundamental constructs (decomposition and statistical approximation) which result in.
Annals of Operations Research 85(1999)173–192

173

Statistical approximations for stochastic linear programming problems ★ Julia L. Higle and Suvrajeet Sen Systems and Industrial Engineering Department, University of Arizona, Tucson, AZ 85721, USA

Sampling and decomposition constitute two of the most successful approaches for addressing large-scale problems arising in statistics and optimization, respectively. In recent years, these two approaches have been combined for the solution of large-scale stochastic linear programming problems. This paper presents the algorithmic motivation for such methods, as well as a broad overview of issues in algorithm design. We discuss both basic schemes as well as computational enhancements and stopping rules. We also introduce a generalization of current algorithms to handle problems with random recourse. Keywords: cutting plane algorithms, stochastic programming

1.

Introduction

Optimal planning under uncertainty presents a wide spectrum of challenges, from model inception to the development of solution methodologies. Indeed, stochastic linear programming (SLP) models have given rise to some of the largest instances of linear programs (LP) on record. It is widely recognized that for all but the smallest of problems, SLP models often exceed the capabilities of commercially available LP solvers. Consequently, the solution of SLPs requires the use of specialized routines that exploit their structure. Decomposition methods have long proven to be useful in the solution of largescale linear programs. Similarly, statistical estimation procedures involving randomly generated data are indispensable when addressing questions regarding the expected values of random variables. However, algorithms combining these approaches in the stochastic programming (SP) arena have been limited. Within the SP context, deterministic decomposition methods for SP include the L-shaped method (Van Slyke and Wets [21]) and its variants (Birge [2]) and primal–dual methods such as the scenario aggregation algorithm (Rockafellar and Wets [17]) and the method of diagonal ★

This work was supported in part by Grant No. NSF-DMII-9414680 from the National Science Foundation.

© J.C. Baltzer AG, Science Publishers

J.L. Higle, S. Sen y Statistical approximations in SLP

174

quadratic approximations (Mulvey and Ruszczyn´ski [15]). On the other hand, purely statistical approximations are the basis for successive sample mean optimization (SSMO) methods which have been forwarded mainly in the context of simulationoptimization (Healy [5], Rubinstein and Shapiro [18] and Plambeck et al. [16]). While it is not difficult to see that the advantages of decomposition methods and statistical approximations can be harnessed within one SP algorithm, there have been but a few attempts at developing such methods (Higle and Sen [6] and Infanger [11]). This apparent separation of the two approaches may be attributed to the disparity in the settings in which the two classes of methods have been developed. Decomposition methods are usually studied for special structured problems, whereas SSMO methods are intended to be applicable for general purpose stochastic optimization problems arising in simulation. SLP problems are sufficiently well structured to warrant the use of decomposition techniques, yet difficult enough that the advantages of statistical approximations do bear fruit. In this paper, we will explore combinations of these two fundamental constructs (decomposition and statistical approximation) which result in powerful algorithmic approaches for SLP problems. We begin this paper with brief discussions of traditional decomposition methods and methods based solely on statistical approximations. Following these discussions, we illustrate the combination of these constructs within a basic stochastic decomposition (SD) algorithm. Following this, we present enhancements of the basic SD method which provide greater stability. Next, we present stopping rules that consider both “in-sample” scenarios as well as “out-of-sample” scenarios. Finally, we present a framework that allows us to solve problems with random recourse matrices. 2.

Background and motivation A two-stage stochastic linear program with recourse may be stated as

(SLP)

Minimize subject to

cÁ x + E[h(x, ω˜ )] A x ≤ b,

where ω ˜ is a random variable defined on the probability space (Ω, A, P), and for each ω ∈Ω, (1) h(x, ω) = Minimize g(ω )Á y subject to

W(ω )y = r(ω) − T(ω )x, y ≥ 0.

The randomness in data elements only appears in the second stage, whereas data in the first stage is assumed to be known with certainty. Two-stage SLP problems often arise in the context of strategic planning models. In such models, x denotes the vector of strategic decisions which must be planned

J.L. Higle, S. Sen y Statistical approximations in SLP

175

while information regarding the future is imprecise or uncertain. This uncertainty is modeled via the random variable ω ˜ . The strategic decisions, x, are required to satisfy some linear constraints. After the strategic plan is in place, more accurate information becomes available (i.e. an outcome ω is observed), and tactical decisions y are undertaken by solving another linear program. There are several ways to generalize this model, allowing for multiple decision stages, alternative risk preferences etc. For the purposes of this paper, we will restrict our presentation to two-stage problems. 2.1. Deterministic decomposition in SLP For brevity in exposition, we assume throughout that h(x, ω ˜ ) < ` with probability one. That is, for all x satisfying Ax ≤ b, the LP in (1) is feasible with probability one. In the stochastic programming literature, this property is known as “relatively complete recourse”. As stated in (1), the function h(x, ω) is a convex function of x (see Wets [22]). The recourse function, ⌠ h(x , ω )P (dω ), E[h( x, ω˜ )] =  ⌡ Ω

is also a convex function of x. As such, it lends itself to solution via Kelley’s cutting plane algorithm (Kelley [12]). Let H(x) = E[h(x, ω ˜ )], and define f ( x) = cÁx + H(x) and X = {x| Ax ≤ b}. The kth piecewise linear lower bounding approximation of H(x) will be denoted by νk (x), and fk (x) = c Áx + νk (x). Assuming X is bounded, Kelley’s algorithm may be stated as: Kelley’s cutting plane algorithm Step 0 . Initialize. Let ε > 0 and x 1 ∈X be given. Let k ← 0 and define ν 0 (x) = – `, u0 = `, l0 = – ` . Step 1 . Defineyupdate the piecewise linear approximation. k ← k + 1. Evaluate H(x k ) and β k ∈∂H(x k ). Define α k = H(x k) – (β k )Áx k. (a) νk (x) = Max{νk –1 (x), α k + (β k ) Áx}, and fk (x) = cÁx + νk (x). (b) Update the upper bound: uk = Min{uk –1 , f (x k)}. Step 2 . Solve the LP master problem. Let x k +1 ∈argmin{ fk (x) | x ∈X}. Step 3 . Stopping rule. lk = fk (x k +1 ). If uk – lk ≤ ε, stop. Otherwise, repeat from step 1. The manner in which (α k, β k ) are specified in step 1 of the algorithm is critical to ensuring that an optimal solution to Min{ f (x) | x ∈X} is eventually identified

176

J.L. Higle, S. Sen y Statistical approximations in SLP

through step 3 of the algorithm. Coefficients of the supporting hyperplane required in step 1 may be obtained from a dual solution of (1). Whenever (1) has a finite optimum, the LP dual yields h(x, ω ) = Maximize [r(ω ) − T(ω) x]Áπ

(2)

subject to W(ω )Áπ ≤ g(ω). Thus, if x ∈X is given and if we let π(ω) ∈argmax{[r(ω) – T(ω) x ] Áπ | W(ω)Áπ ≤ g(ω)}, then h(x, ω) ≥ [r(ω ) − T(ω )x]Áπ (ω ) ∀x ∈ X, with equality ensured if x = x . Note that π(ω) depends on both x and ω. Thus, given x k and ω, let π k (ω ) ∈ arg max{[r(ω ) − T(ω) xk ]Áπ |W(ω )Á π ≤ g(ω )}. The subgradient coefficients in step 1 of Kelley’s method may be obtained as αk = ⌠  r(ω)Áπ k (ω )P (dω) = E[r(ω˜ )Á π k (ω ˜ )], ⌡

(3a)

⌠ − T(ω)Áπ k (ω )P (dω) = E[− T(ω˜ )Áπ k (ω βk =  ˜ )]. ⌡

(3b)





This algorithm can be interpreted as a decomposition method for dual block angular linear programs. Hence, it is sometimes referred to as Benders’ decomposition (Benders [1]). In the stochastic programming literature, this is also known as the Lshaped method (Van Slyke and Wets [21]). In order to appreciate the computational challenges inherent in the solution of SLP, it is important to recognize the magnitude of the requirements associated with (3). Specifically, note that the subgradient coefficients specified in (3) require the implicit solution of the linear program in (1)–(2) for every possible realization of the random variable ω ˜ . If there are only a few possible realizations, this poses no computational burden. However, in most cases, this exact evaluation easily exceeds computational capabilities. For example, if there are only 10 independent random variables with 3 outcomes each, then there are a total of 310, or 59,049, possible outcomes. This figure represents the number of linear programs that would have to be solved in each iteration of Kelley’s method. Hence, for larger and more realistic problems, the computational burden associated with such algorithms makes them impractical. 2.2. Successive Sample Mean Optimization (SSMO) in SLP The SLP objective function involves the minimization of an expected value function, E[h(x, ω ˜ )]. As such, it is natural to approximate SLP problems with a large

J.L. Higle, S. Sen y Statistical approximations in SLP

177

number of outcomes by using sample mean approximations of this function. To do so, let nk denote the sample size in iteration k and let {ω t }nt k=1 denote the outcomes associated with the sample. Define 1 nk Hk ( x) = h( x, ω t ), nk t =1 and let



(SSMO)

x k ∈ a r g m i n {cÁx + Hk ( x)| x ∈ X}.

It can be shown that if nk → ` as k → `, then the sequence {x k } converges to an optimal solution with probability one (see King and Wets [13], or chapter 2 of Higle and Sen [10]). This statistical approximation scheme goes under several names: stochastic counter part method (Rubinstein and Shapiro [18]), sample path optimization (Plambeck et al. [16]) and retrospective optimization (Healy [5]). While these algorithms originated in the simulation literature, their application to SP has been reported for a PERT problem in Plambeck et al. [16]. In any event, it is important to observe that unlike decomposition methods where errors (in any iteration) result from relaxations or restrictions, errors in SSMO methods are statistical in nature, unavoidable, and difficult to quantify. Moreover, the problems solved in (SSMO) are themselves SLP problems. Since sampling methods typically call for a fairly large sample size, the SLP approximations may themselves become unwieldy. We are therefore interested in methods that will work with more convenient approximations. This motivates us to combine statistical approximations and decomposition methods in such a manner that the advantages of each of these classes can be realized within one algorithm. 3.

Stochastic decomposition basics

Stochastic decomposition (SD) refers to a class of methods that incorporate statistical approximations within decomposition algorithms. The method, proposed in Higle and Sen [6], incorporates sample based approximations within Kelley’s method (i.e., L-shaped method or Benders’ decomposition). In each iteration of the SD method, a new observation of ω ˜ is generated and used to form an approximation of the sample mean function. That is, instead of optimizing the sample mean function as in SSMO methods, an approximation which consists of hyperplanes that provide a lower bound on the sample mean function is used. This approach allows us to optimize a “Max”function as in Kelley’s method, rather than the more complicated sample mean function. However, the approach is subject to two sources of error; one is due to the statistical approximation used, and the other is due to the lower bounding approximation used. In order to ensure asymptotic convergence (with probability one), these must vanish appropriately.

178

J.L. Higle, S. Sen y Statistical approximations in SLP

In the kth iteration, the SD algorithm creates a piecewise linear function that provides an approximation of the sample mean function, 1 k Hk ( x) = h( x, ω t ). (4) k t=1



This is a lower bounding approximation, and unlike the approximations derived in Kelley’s method, it is in general a strict lower bound. Unlike Kelley’s method where the cuts are denoted by (α t, β t ), t = 1,…, k, we will use two identifiers to specify cuts in the SD method. In this notation, the cuts are denoted by (α tk , β tk ), where the superscript indicates that the cuts define the kth approximation, fk , and the subscript provides the iteration in which a particular cut was generated, t = 1,…, k. This extended notation is necessary because cuts in the SD method are updated from one iteration to the next. Thus, in each iteration, the SD algorithm constructs one new hyperplane denoted by (α kk , β kk ), and updates previously generated hyperplanes (α tk , β tk ), where t ∈Jk –1 = {1,…, k – 1}. Letting Jk = Jk –1 < {k}, the approximation process used in SD will ensure that α tk + (βtk )Á x ≤ Hk (x) ∀t ∈ Jk and x ∈ X. The objective function approximation used in iteration k is fk (x) = cÁ x + Max{αtk + (βtk )Á x}. t ∈J k

(5)

The details underlying the calculation of cut coefficients are discussed next. 3.1. Generating a new cut In applying Kelley’s method to SLP, the formation of a new cut required the value and subgradient associated with h(x k, ω) for almost all ω. If we replace the expectations in (3) with sample means using a sample {ω 1,…, ω k}, the resulting algorithm is a stochastic cutting plane algorithm. However, since the sample size increases with k, the requirement that we calculate h(x k, ω t ) and its subgradient for t = 1,…, k will become time consuming. For the class of SLP problems in which W( ω ˜ ) = W, a fixed matrix, and g( ω ˜ ) = g, a fixed second stage cost vector, the cuts can be generated by calculating the value of h for only one outcome (i.e. h(x k, ω k )) and calculating a k −1 lower bound on the value of the others (i.e. {h(x k, ω t )} t = 1 )) . This is the cut formation methodology suggested in Higle and Sen [6] and is summarized below. At each iteration, we will solve one second stage LP (of the form (1)) so that the function value h(x k, ω k ) and its subgradient can be evaluated. Let π kk denote an optimal dual multiplier revealed in iteration k. That is, π kk ∈ argmax{[r(ω k ) − T(ω k )x k ]Áπ |W Áπ ≤ g}.

(6)

Within an SD algorithm, these dual multipliers are retained in a set Πk , so that Πk = Πk –1 < {π kk }. This information is used to calculate lower bounding functions on h(x, ω t ), t = 1,…, k – 1. Let

J.L. Higle, S. Sen y Statistical approximations in SLP

179

π tk ∈ argmax{[r(ω t ) − T(ω t )x k ]Áπ |π ∈ Π k }.

(7)

Since Πk is a subset of the dual vertices of the second-stage problem, h( x, ω t ) ≥ [r(ω t ) − (π tk )ÁT(ω t ) x]Áπ tk . Moreover, this inequality may be strict even at x = x k if t ≠ k. For computational purposes, it is important to note that the calculations in (7) constitute a matrix multiplication which can be performed in a recursive manner. For a detailed discussion, the reader is referred to Sen et al. [20]. Finally, we obtain the new cut by using {π tk }tk=1 defined in (6)–(7) as follows: α kk

1

k

∑ [r(ω k

=

t Á

)]

π tk ,

β kk

=

t =1

and note that by construction 1 k

1

k

[− T(ω t )]Áπ tk , ∑ k

(8)

t =1

k

∑ h( x, ω t ) ≥ αtk + (βtk )Áx. t=1

3.2. Updating previously generated cuts Higle and Sen [8] explore the use of algorithmically generated approximations in the solution of optimization problems. Although their setting extends well beyond SLP problems, their results prove useful in this context nonetheless. Based upon their results, to obtain an optimal solution it is necessary to ensure that the cutting plane approximations provide lower bounds on the recourse function asymptotically. When they are initially defined, the cutting planes provide lower bounds on the sample mean function. However, as iterations progress, the sample size increases so that the sample mean function is actually a dynamic entity. To retain their lower bounding nature, the cut coefficients must be similarly dynamic. For this reason, all previously generated cuts must be updated. In the following, we assume that a lower bound on the recourse function is available. For example, suppose that h(x, ω ˜ ) ≥ 0 (with probability 1). Then Hk (x) = ≥ =

1 k

k



h(x, ω t ) =

t=1

k − 1  1 k−1  1 h( x, ω t )  + h(x, ω k )  k  k − 1 t =1 k 



k − 1  1 k −1  h(x, ω t )   k  k − 1 t =1 



k −1 k

Hk −1 (x ).

Now, suppose that in the first k – 1 iterations we have accumulated a collection of k −1 k −1 k −1 cutting plane coefficients, {(α t , βt )}t = 1 , such that

J.L. Higle, S. Sen y Statistical approximations in SLP

180

Hk −1 (x ) =

1

k−1

k −1 k −1 h( x, ω t ) ≥ α t + (βt )Á x ∑ k −1

∀x ∈ X, t = 1,…, k − 1.

t =1

It follows that Hk (x) =

1

k

h(x, ω t ) ≥ ∑ k

k −1 k

t=1

k −1

{α t

k −1 Á

) x} ∀x ∈ X, t = 1, …, k − 1.

+ (βt

This leads to a simple mechanism for updating previously derived cutting plane coefficients which preserves the required lower bounding nature as iterations progress (and hence, as the sample size increases). That is, we simply require α tk ←

k −1 k

k −1

αt

, β tk ←

k −1 k

k −1

βt

t = 1,…, k − 1.

(9)

Of course, if the lower bound is given as L ≠ 0, then we use α tk ← ((k – 1) k)α tk –1 + (1 k)L, and the update of β tk is not altered. 3.3. Summary of a basic SD algorithm A basic SD algorithm is summarized below. The structure follows Kelley’s method discussed in the previous section. However, since SD uses randomly generated observations of ω ˜ , one only has statistical estimates of f (x k). It follows that termination criteria are necessarily more involved than one finds in deterministic methods (see section 5). As a result, steps analogous to 1(b) and 3 of Kelley’s method are not included in the algorithmic statement below. A basic stochastic decomposition method Step 0 . Initialize. Let x 1 ∈X be given. Let k ← 0. Step 1 . Defineyupdate the piecewise linear approximation. k ← k + 1. Generate ω k, an outcome of ω ˜ . Evaluate cut coefficients (αkk , β kk ) as given in (8) and update previous cuts as in (9). Update Jk , and define fk (x) as in (5). Step 2 . Solve the LP master problem. Let x k –1 ∈argmin{ fk (x) | x ∈X}. Step 3 . Repeat from step 1. 4.

Computational enhancements

The approximations used by cutting plane algorithms (deterministic or stochastic) are piecewise linear functions which have the potential to provide arbitrarily close approximations, especially near an optimal solution. However, these basic methods suffer from the following drawback: proofs of convergence are often based on retaining

J.L. Higle, S. Sen y Statistical approximations in SLP

181

all cuts generated during the course of the algorithm, that is, Jk = {1,…, k}. The cuts often result in dense linear inequalities, and the approximations may require arbitrarily many cuts. Consequently, computer memory can become a scarce resource. On the other hand, deleting inequalities indiscriminately can cause the solutions and approximations to oscillate. In this section, we will summarize enhancements that guarantee convergence, while ensuring that there is a fixed upper limit on the size of the master program. To discuss the enhancements of SD, it is convenient to first discuss enhancements of Kelley’s method. The adaptation to SD will then follow readily. 4.1. Some enhancements for Kelley’s algorithm Recall that the sequence of points in Kelley’s method are generated according to the recursion x k + 1 ∈ a r g m i n {fk (x)| x ∈ X}. It turns out that the sequence of values { f (x k)} need not be a descending sequence. However, it is possible to extract a descending subsequence during the iterative process; the solutions associated with such a subsequence will be referred to as incumbent solutions, the sequence of which we denote as {x k }. The exact manner in which an incumbent is identified is discussed subsequently. For the remainder of this section, it will be convenient to rewrite the approximation in terms of the direction of change (displacement) d from an incumbent x k ; that is, x ≡ x k + d. In order to write the cuts as a function of d, we define υ tk = α t + (β t )Á x k ; that is, υ tk denotes the value of the cut α t + (β t )Áx at the incumbent point x k . Hence, the approximation, in terms of the direction vector d, may be written as υk (d ) = Max{υtk + ( β t )Ád}. (10) t ∈J k

Given the relation x = x + d, we have υ k (d) = ν k (x), where the latter has the same meaning as in step 1(a) of Kelley’s method. With the incumbents defined above, we now proceed to a discussion of regularization. The idea here is to augment the piecewise linear approximation with a strictly convex casting function (Wets [23]) or an auxilliary function (Culioli and Cohen [3]). The most commonly used casting function is a quadratic proximal term (see Kiwiel [14], Ruszczyn´ski [19]). Given σ > 0, the regularized master problem, which is a direction finding problem, in the kth iteration is defined as k

(PRM k)

Minimize

υ + c Ád +

σ kd k2 2

subject to υ − (β t )Á d ≥ υkt ∀t ∈ Jk , xk + d ∈ X.

182

J.L. Higle, S. Sen y Statistical approximations in SLP

Since Kelley’s method is a deterministic algorithm, υk (0) = ν k (x k ) = H( x k ).

(11)

We now state a prototypical regularized deterministic cutting plane algorithm. A regularized version of Kelley’s algorithm Step 0 . Initialize. Let x 1 ∈X be given. Let k ← 0, x 0 ← x 1, J 0 ← ∅, ∆1 ← `, and let σ > 0, and γ ∈(0, 1) be given. Step 1 . Defineyupdate the piecewise linear approximation. k ← k + 1. Evaluate H(x k ) and β k ∈∂H(x k ). Define α k = H(x k ) – (β k ) Áx k. Let Jk = {t ∈Jk –1 | θ tk –1 > 0} < {k}. (a) fk (x) = c Áx + Max t ∈Jk {α t + (β t )Áx}. (b) If f (x k) ≤ f ( x k –1 ) – γ ∆ k , then x k = x k. Else, x k = x k –1 . Step 2 . Solve the regularized master. Let x k +1 = x k + d k, where d k denotes an optimal solution to PRM k. Similarly, let θ tk denote the optimal Lagrange multipliers associated with the cuts indexed by t ∈Jk . Step 3 . Stopping rule. Let ∆k +1 = H( x k ) – [ υ k (d k ) + cÁd k]. If kd k k = 0, then stop. Else repeat from step 1. Because of the need to calculate the value H(x k ) as well as its subgradient, this regularized algorithm suffers from some of the same drawbacks as Kelley’s method. Nevertheless, this variant has several important properties. First note from (10), (11) and the definition of d k (the optimal solution of PRM k ) that υk (d k ) + cÁd k ≤ υ k (0) = ν k (x k ) = H( x k )

⇒ υk ( d k ) + cÁ( x k + d k ) ≤ cÁ x k + H(x k ) ⇒ cÁx k + 1 + νk ( x k + 1) ≤ cÁ x k + H(x k ). Hence, in step 3, we see that ∆k ≥ 0 and consequently the sequence of incumbents provides a descending sequence of objective values. However, in step 1(b), we note that the incumbent is updated based upon how well ∆k +1 (calculated at iteration k, prior to the evaluation of H(x k +1 )) predicts the observed change in the subsequent iteration, after H(x k +1 ) has been evaluated. With large values of γ , the incumbent change occurs only when the prediction is accurate. On the other hand, smaller values of γ yield a less stringent criterion. The analysis of this algorithm relies heavily on the dual to the primal regularized master (PRM k). In fact, Kiwiel [14] states the algorithm in terms of the dual to the (PRM k). To write the dual, let Vk denote the vector of scalars { υ tk }t ∈Jk , and similarly, let Bk denote a matrix whose rows are given by the cut coefficients {β tÁ}t ∈Jk . Furthermore, let bk denote the vector A x k – b. Then the dual to PRM k is

J.L. Higle, S. Sen y Statistical approximations in SLP

(DRM k )

M a x VkÁθ + bÁ k λ −

1 2σ

183

Á 2 k c + BÁ k θ + A λk

1Áθ = 1, θ ≥ 0, λ ≥ 0. One of the important relationships between the primal and dual optimal solutions for the pair PRM k and DRM k is the following: d=−

1 σ

Á (c + BÁ k θ + A λ).

(12)

Finally, the total number of cutting planes retained in step 1 cannot exceed n + 2. Because σ > 0, both primal and dual master problems are strictly convex programs, and consequently both have unique optimum solutions. Suppose that d = d k in (12), where d k is the optimal solution for PRM k. The optimal dual solution (θ k, λ k ) must satisfy (12) together with the dual feasibility conditions in DRM k. These constitute n + 1 equalities (n from (12) and one from the convexity constraint in DRM k ) in nonnegative variables (θ, λ). Note that ( θ k, λ k ) (the optimal dual solution) must be an extreme point of this polyhedron; for if not, then (θ k, λ k ) can be written as a convex combination of two other points with the same dual objective value, which contradicts the uniqueness of the dual optimum. It follows that (θ k, λ k ) must be an extreme point of the polyhedron determined by (12) and the dual feasibility restrictions. Since there are at most n + 1 equations (other than the nonnegativity restrictions), it follows that there can be at most n + 1 components of θ k which can be positive. Thus, by including the new cut, we conclude that the cardinality of Jk cannot exceed n + 2. 4.2. Enhancements for the SD algorithm In extending the ideas of section 4.1 to the SD setting, we need to specify rules for retaining cuts that will maintain asymptotic results for the algorithm. In addition, when defining the incumbent solution, it is important to recognize that the objective function approximations are based on statistical estimates. These two issues are intimately related, and are discussed here. In selecting an incumbent solution, the function H is replaced by the lower bounding statistically approximated function ν k . These estimates are sufficient to retain asymptotic optimality, as verified in Higle and Sen [6]. A critical component in the verification of optimality is a requirement that ν k ( x k ) provides an accurate estimate of H( x k ) as k → `, while the errors in the subgradient estimates at the incumbent must also vanish asymptotically. Because of these requirements, estimates of the objective function value and its subgradients at the incumbent solution must be updated periodically. Suppose that at iteration k, we know that the most recent iteration at which an incumbent cut was generated (or updated) was ik < k. Furthermore, let us suppose that we intend to update the incumbent cut every τ iterations (1 ≤ τ < `). Then, at an

184

J.L. Higle, S. Sen y Statistical approximations in SLP

iteration in which k = ik + τ, the incumbent cut is updated to reflect the impact of outcomes as well as the dual vertices that may have been generated since iteration ik . These updates then guarantee that the sample size used for the incumbent cut grows indefinitely as required by the law of large numbers. Thus, it is important to ensure that the cut derived from the incumbent solution is retained. At iteration k, let tk denote the index associated with the cut derived from the incumbent solution, x k . Then the rule used for cut retention is k−1

Jk = {t ∈ Jk −1|θ t

> 0} < {k, tk }.

(13)

Hence, the maximum number of cuts used in the regularized master for SD is n + 3. Combining the above enhancements with the SD method, we obtain the following algorithm (Higle and Sen [9]). A regularized version of the SD algorithm Step 0 . Initialize. Assume that x 1 ∈X is given. Let an integer τ, and real scalars σ > 0, and γ ∈(0, 1) be given. Let k ← 0, x 0 ← x 1, J0 ← ∅, ∆1 ← `, i1 = t1 = 1. Step 1 . Defineyupdate the piecewise linear approximation. k ← k + 1. Generate ω k, an outcome of ω ˜ . Evaluate cut coefficients (α kk , βkk ) as given in (8) and update previous cuts as in (9). Update Jk as in (13). (a) If k = ik + τ, then update the coefficients (α tk , βtk ) for t = tk , and ik ← k. Define fk (x) = cÁx + Max t ∈Jk {α tk + (β tk )Áx}. (b) If fk ( x k ) ≤ fk ( x k − 1 ) − γ ∆ k , then x k ← xk , ik ← k, and tk ← k. Else, x k = x k − 1. Step 2 . Solve the regularized master. Let x k + 1 = x k + d k , where d k denotes an optimal solution to PRM k. Similarly, let θ tk denote the optimal Lagrange multipliers associated with the cuts indexed by t ∈Jk . ∆k +1 = υ k (0) – [ υ k (d k) + cÁd k]. Step 3 . Repeat from step 1. We remind the reader that, as in the basic SD algorithm, the stopping rule should be designed with greater care than for the deterministic counterpart. The following section addresses these issues. 5.

Stopping rules

In stochastic programming, sampling is used in a variety of ways; in some algorithms, sampling is performed prior to using an optimization algorithm (as in SSMO), and in others, sampling and optimization are intimately intertwined (as in SD). In either case, it is important to investigate whether a given solution may be sensitive to

J.L. Higle, S. Sen y Statistical approximations in SLP

185

additional data (sampling). In this section, we discuss two methods for investigating this sensitivity. The first, based on bootstrapping, is appropriate for use when the optimization procedure uses sampled data, although our presentation is customized for the SD setting. The second method is applicable to any algorithm that replaces the original sample space with an approximation. 5.1. Stopping rule using “in-sample” scenarios The stopping rule in this subsection uses the empirical distribution associated with the sampled observations {ω t }kt=1 in place of the original distribution. In particular, we use bootstrapping to re-sample primal and dual problems associated with the regularized master problem. The regularized primal and dual master problems that we use in these tests have the same form as the primal–dual pair PRM k and DRM k discussed in section 4.1. Suppose that in iteration K we wish to test whether the primal incumbent ( x K ) and the corresponding dual solutions (θ K, λ K ) are sensitive to additional sampling. Statistical tests rely upon replication with independent data as a means to indicate that the appearance of optimality is not due to random variation. The bootstrap method (Efron [4]) undertakes this replication using the observed empirical distribution, thereby increasing the efficiency of the computations involved. Thus, using the bootstrap method, we test the quality of ( x K , θ K , λK ) by re-sampling the cuts and then evaluating a duality gap which results from using the given solutions in the re-sampled primal and dual problems. If a large proportion of the re-sampled pair of problems (primal–dual) indicate that the point ( x K , θ K , λK ) is acceptable, then we may conclude that the given solution is sufficiently good. We now proceed to discuss details of the method. To start the bootstrapping procedure, suppose that we have K observations of ω ˜, t K {ω }t =1 . To sample from the resulting empirical distribution, we may randomly draw K values from the set of indices {1,…, K}, with replacement, which we label as {t(i)}Ki=1 . The bootstrapped sample then consists of the observations {ω t(i)}Ki=1 . This bootstrapped sample is used to reconstruct the primal dual representations of the master program in an effort to assess the extent to which the apparent quality of the solution ( x K , θ K , λK ) is sample dependent. This reconstruction requires a recalculation of the cutting plane coefficients. In order to illustrate the manner in which this reconstruction is undertaken, consider a cut that was generated by the SD algorithm at iteration t. Recall that the first t outcomes were used to form this cut. The re-sampled cut that we create uses those bootstrapped observations for which t(i) ≤ t. For a cut indexed by k, let π tk denote the subproblem dual vertex used in iteration k, with the observation ω t. The cut coefficients of the re-sampled cut are αˆ tK =

1 K



(π kt(i ) )Á r(t(i))

{i |t (i) ≤ t , i ≤ K}

(14a)

186

J.L. Higle, S. Sen y Statistical approximations in SLP

and 1 βˆtK = − K



T(t(i))Áπ tk(i ) .

(14b)

{i |t (i ) ≤ t , i ≤ K}

The formation of the coefficients in (14) requires the collection of dual vectors that were used to form the cut being replicated. Hence, for each of the cuts, this form of bootstrapping requires us to store the correspondence between observation ω t and the dual multipliers used for cut generation. Given a set of replicated cutting plane coefficients denoted by {(αˆ tK , βˆtK )}t ∈J K , ˆ K , rewe can formulate replicated primal and dual problems, denoted by Pˆ K and D K spectively. These primal and dual problems have the same form as PRM and DRM K, respectively, and we use x K as the incumbent. Accordingly, let υˆ tK = αˆ tK + βˆtK x K and define uˆ = Max υˆtK . (15a) t ∈ JK

Then uˆ is an upper bound on the optimal value of Pˆ K . Analogously, we underestimate ˆ K . Letting Vˆ denote the vector of scalars {υˆ K } ˆ the optimal value of D K t t ∈J K and BK ˆ K K denote the matrix of cut coefficients {βt }t ∈J K , the dual objective evaluated at (θ , λ K ) is 1 K lˆ = VˆKÁθ K + bÁ kc + BˆKÁθ K + AÁλK k2. (15b) Kλ − 2σ ˆ K , it follows that lˆ is a lower bound on its optimal Since (θ K, λ K ) is feasible for D ˆ value. Hence, uˆ − l ≤ ε (for a small enough value of ε) implies that the primal–dual vector ( xˆ K , θ K , λK ) is acceptable for the replication. By repeating this re-sampling process several times, we are able to obtain an empirical estimate of the duality gap for the proposed primal–dual pair. If a significant proportion of these empirical estimates are sufficiently close to zero, then we can conclude that the proposed solution is acceptable. Unlike previous attempts at designing stopping rules for a sampling-based algorithm (Higle and Sen [7]), the rules proposed in this subsection can be implemented without solving any LPs: the calculations only require function evaluations as prescribed in (15a), (15b). 5.2. Stopping rule using “out-of-sample” scenarios The bootstrap procedure discussed in the previous section provides a computationally efficient method to assess the variability of the duality gap using previously generated observations. Alternatively, one may prefer to assess the quality of the solution using independently generated observations. The test described in this subsection can be used with any algorithm that uses approximations of the objective function.

J.L. Higle, S. Sen y Statistical approximations in SLP

187

Suppose that x is proposed as a solution to (SLP). To test the quality of the solution, we will perform L independent tests, each involving M observations of ω ˜, ML for a total of M · L observations used, {ω t }t =1 . That is, L denotes the number of batches that will be investigated, and M the size of each batch. In order to form the lth batch, we use the observations ω t , t = M(l – 1) + 1,…, Ml. With batches generated in this manner, we can adopt the duality-based criteria described in Higle and Sen [10]. In the remainder of this subsection, we illustrate one such procedure based on estimating duality gaps. For any outcome ω, we define a primal value as φ( x, ω ) = cÁ x + Minimize g Áy

(16a)

subject to W y = r(ω ) − T(ω) x, y ≥ 0.

(16b) (16c)

Using LP duality, one can show that φ( x, ω) can also be calculated as follows: φ( x, ω ) = Maximize λ(ω)Á b + π(ω )Ár(ω ) + ξ(ω)Á x subject to

Á

Á

(17a)

A λ(ω) + T(ω ) π(ω ) + ξ(ω) = c,

(17b)

WÁπ(ω) ≤ g,

(17c) (17d)

λ(ω) ≤ 0.

Moreover, an optimal solution is characterized by the existence of an optimal dual solution for which E[ξ(ω˜ )] = 0. Thus, in order to estimate the duality gap, suppose that for each observation ω t, we are given a vector of nonanticipativity multipliers denoted ξˆt . For any particular batch of observations denoted Bl , we will impose the restriction that the nonanticipativity multipliers satisfy (1 M)∑t ∈Bl ξˆt = 0. In the procedure described below, we show how such multipliers can be obtained. In any event, we associate the following dual value with the outcome ω t: − φ* (ξˆt , ω t ) = Minimize (c − ξˆt )Á x + g Áy subject to

Ax

(18a) ≤ b,

(18b)

T(ω t )x + Wy = r(ω t ), y ≥ 0.

(18c) (18d)

We now present a criterion that uses out-of-sample observations to assess the quality of the solution x . Criterion: Estimated duality gap Step 1 . Let x ∈X, integers L and M, a random sample {ω t }t =1, and ε > 0 be given. LM

J.L. Higle, S. Sen y Statistical approximations in SLP

188

Step 2 . (a) Using (17), calculate φ( x, ω t ) for t = 1,…, LM. Let ξ t be part of a solution to (17). (b) For l = 1,…, L, let ξl = (1 M)∑ t ∈Bl ξ t . Step 3 . Let ξˆt = ξ t − ξl , t ∈Bl, l = 1,…, L. Evaluate − φ* (ξˆt , ω t ) for t = 1,…, LM, as in (18). Step 4 . For l = 1,…, L, define δ = (1 M)∑ (φ (x , ω t ) + φ *( ξˆ , ω t )). If δ < ε for l

t ∈Bl

t

l

a sufficiently large fraction of the estimates {δ l}Ll =1 , then declare x as an acceptable solution.

6.

Stochastic decomposition for problems with random recourse

The details of the SD algorithm have usually been discussed under the assumption that the technology matrix in the second-stage LP is deterministic, as are the objective function coefficients of the second-stage LP. In some applications, especially in financial planning, these parameters can also be dependent on random variables. Such problems are sometimes designated as “Random Recourse Problems”. Consider the second-stage LP of a stochastic program, and suppose that we partition the objective function coefficients g into its deterministic and stochastic Á components, i.e., gÁ(ω) = (gÁ 1 , g2 (ω)). Similarly, the recourse matrix W(ω) may be partitioned as W(ω) = [W1, W2 (ω)]. In many real-world problems, the number of deterministic components are considerably larger than the number of stochastic components, and decomposing these parts may lead to a large deterministic part, together with a collection of smaller stochastic subproblems. This is the motivation behind our approach to random recourse. In any event, the second-stage problem may be represented as follows: Á h(x, ω) = Minimize g1Áy1 + gÁ 2 (ω ) y 2

subject to

(19)

W1y1 + W2 (ω )y2 = r(ω) − T(ω )x, y1, y2 ≥ 0.

Maintaining our assumption that the value of h is finite, the dual to (19) is given by

h(x, ω ) = Maximize [r(ω ) − T(ω) x]Áπ subject to W2

W1Áπ (ω)Áπ

(20) ≤ g1, ≤ g 2 (ω ).

Unlike the fixed recourse formulation, in which W(ω) is fixed, the set of feasible multipliers varies with the outcome ω. That is, the set of feasible multipliers may be represented as Π(ω) = Π1 > Π 2 (ω), where

J.L. Higle, S. Sen y Statistical approximations in SLP

189

Π1 = {π |W1Áπ ≤ g1}

and

Π 2 (ω ) = {π| W2Á(ω )π ≤ g2 (ω )}.

To date, SD algorithms have not addressed issues related to such problems (with random recourse). In order to accommodate them, we observe that the vertices of Π 1 may be used to represent solutions in Π 2 (ω). Hence, we may use the vertices of Π 1 to create restricted subproblems. We outline such a method below. There are several different variations of the algorithm, presented below. Our algorithm will utilize a Dantzig–Wolfe (D–W) type approach in which the extreme points and directions of Π 1 are used for points in Π 2 (ω). In iteration k, we have the observations {ω t }tk=1 , each of which may be associated with an LP subproblem. As with standard SD, the most recent outcome, ω k, will be used to generate vertices of the set Π 1. These vertices are then to be used to approximate the solution of a subproblem. The details are provided next. In each iteration, one vertex or extreme direction from the convex polyhedron Π 1 is identified. Let Π 1k–1 denote a subset of dual vertices of Π1 that has been generated during the first k – 1 iterations. Similarly, let E1k –1 denote the set of extreme directions generated during the first k – 1 iterations. In iteration k, we first generate an outcome (ω k ). Given this outcome and a solution vector x k, we generate a linear program whose variables correspond to the weight of the vertices in Π 1k –1 . The resulting problem (which is reminiscent of the Dantzig–Wolfe master) is as follows:  Maximize [r(ω k ) − T(ω k )x k ]Á  λ, µ   W2 (ω k )Á  

subject to



π jλ j +



π jλ j +

π j ∈Π1k− 1

π

j ∈Π k−1 1

∑ λ j = 1, λ j ≥ 0,

 ei µ i   e i ∈E1k− 1

(21a)

 ei µ i  ≤ g 2 (ω k ),  e i ∈E1k− 1

(21b)





∀j, µi ≥ 0, ∀i.

(21c)

j

k

Let y(ω ) denote an optimal dual multiplier associated with (21b). Then we use the following as a vertex direction generation problem: Maximize [r(ω k ) − T(ω k ) x − W2 (ω k )y(ω k )]Áπ π

subject to

(22)

W1Áπ ≤ g1.

If (22) has a finite optimum, then we generate a vertex π k ∈Π1 and update Π1k = π k. In this case, E1k = E1k –1 . On the other hand, if (22) is unbounded, then an extreme direction e k is generated, and we update E1k = E1k –1 < e k and Π1k = Π1k–1 . Using the updated sets Π1k and E1k, we now solve the following restriction for all outcomes generated thus far ({ω 1,…, ω k }): Π1k–1