Computing Worst-case Input Models in Stochastic Simulation

4 downloads 7718 Views 1MB Size Report
Jul 19, 2015 - Business Analytics and Math Sciences, IBM T.J. Watson Research ... knowledge of the input model represented via optimization constraints.
Computing Worst-case Input Models in Stochastic Simulation

arXiv:1507.05609v1 [math.PR] 19 Jul 2015

Soumyadip Ghosh Business Analytics and Math Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, [email protected]

Henry Lam Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, [email protected]

Any performance analysis based on stochastic simulation is subject to the errors inherent in misspecifying the modeling assumptions, particularly the input distributions. One way to quantify such errors, especially in situations with little support from data, is to carry out a worst-case analysis, with the partial, nonparametric knowledge of the input model represented via optimization constraints. We design and analyze a numerical scheme for both a general class of simulation objectives and of uncertainty specifications, based on a FrankWolfe (FW) variant of the stochastic approximation (SA) method (which we call FWSA). A key step involves the construction of simulable unbiased gradient estimators using a suitable nonparametric analog of the classical likelihood ratio method. A convergence analysis for FWSA on non-convex problems is provided. We illustrate our approach numerically with robustness analyses for a canonical M/GI/1 queue under uncertain service time distribution specification, and for an energy storage facility acting as a hedge against uncertainty in local renewable generation.

1. Introduction Simulation-based performance analysis of stochastic models, or what is usually known as stochastic simulation, is built on input model assumptions that to some extent deviate from the truth. For example, to estimate the performance of a queueing system using a metric such as expected workload requires specifying the input distributions for interarrival and service times. Since the choices of input distributions are often approximately correct at best, the performance evaluation is subject to input model uncertainty. Common solution techniques often aim to obtain measures of the output variability due to model uncertainty in contexts where input data are available. These 1

2

Ghosh and Lam: Computing Worst-case Input Models

include bootstrapping (e.g. Barton et al. (2013)), its metamodel-assisted counterparts (e.g. Xie et al. (2015)), various goodness-of-fit tests (e.g. Banks et al. (2009)), and Bayesian model selection (e.g. Chick (2001)) among others. In this paper, we focus on a setting that diverges from most past literature: we are primarily interested in situations with insufficient data, or when the modeler wants to assess risk beyond what the data indicate. One way to handle uncertainty in such setting is by using a worst-case approach. The modeler first represents the partial, nonparametric beliefs about the input model as constraints, and then finds the worst-case model among all that satisfy the constraints. This line of analysis gives not only measurements on the worst-case performance, but also, importantly, characterizes the model that is identified with the most extreme downside risk. Indeed, worst-case (often known synonymously as robust) approaches have been popular in areas like stochastic control (e.g. Hansen and Sargent (2008), Petersen et al. (2000)) and distributionally robust optimization (e.g. Delage and Ye (2010), Lim et al. (2006)). However, the methodology presented in these literatures often target a specific structured decision problem and form of model uncertainty. The focus of this paper is to build a corresponding general methodology for handling broader classes of simulation models and uncertainty constraints. 1.1. Problem Statement In this paper, we confine ourselves to analyzing the effects of uncertainty in specifying the marginal distributions of an i.i.d. input process. Although this certainly does not cover all situations, the i.i.d. assumption is rather prevalent in simulation modeling and analysis; see, for instance, Law and Kelton (2000) for applications in manufacturing and production management. To facilitate discussion, we denote a target performance measure as Ep [h(X)], where X = (X1 , X2 , . . . , XT ) ∈ X T are T i.i.d. random variables on the space X , each with distribution p, and h(·) : X T → R is a cost function that can be evaluated given the realization of X. We denote Ep [·] as the expectation under the T -fold product measure of p. As an example, one can think of X as a sequence of interarrival times in a queue, and p as the interarrival time distribution. h(X)

3

Ghosh and Lam: Computing Worst-case Input Models

can be the indicator function of waiting times exceeding a threshold. Under uncertainty of p, we formulate the worst-case optimization as

min Z(p) = Ep [h(X)] p∈A

and/or

max Z(p) = Ep [h(X)] p∈A

(1)

where the probability distribution p is the decision variable, over the set of possible models given by A. Adopting the terminology in robust optimization (Ben-Tal et al. (2009)), we shall call A the uncertainty set. One example that (1) is applicable is the following. Suppose that, given few data points, the modeler initially takes a parametric approach to choosing an input model. The best fit indicated by the statistical goodness-of-fit tests may not be a perfect fit, and the true model, while “close” to the chosen parametric form with optimal parameters, is a non-zero distance away in the model space. In this case, the modeler can resort to (1) where A is the set of all models that are close to the fit. This closeness can be represented via a small statistical distance (a distance notion defined between two probability distributions that is not restricted to any parametric class), or by matching some moments, or both. Our objective is to find an efficient method to compute the worst-case scenarios defined by formulation (1) for a broad class of simulation models (as reflected in the cost function h) and a wide variety of input model uncertainty characterizations (as modeled by the uncertainty set A). While previous studies posing similar types of optimizations frequently make structural assumptions on the objective function h that allows decomposing the objective Z(p) into linear functions of p (e.g. Hansen and Sargent (2008), Iyengar (2005), Nilim and El Ghaoui (2005)), the assumptions do not hold for simulation models with nonlinear input-output relations. Therefore, we design a simulation-based iterative procedure for finding local optima of (1) using a modified version of the celebrated stochastic approximation (SA) method (e.g. Kushner and Yin (2003)), which is a stochasticization of first order methods in deterministic nonlinear optimization.

4

Ghosh and Lam: Computing Worst-case Input Models

1.2. Our Contributions We make several contributions in implementing the SA method for worst-case optimizations. The first is the construction of an unbiased gradient estimator for Z w.r.t. p based on the idea of the Gateaux derivative for functionals of probability distributions (Serfling (2009)), which is used to obtain the direction in each subsequent iterate in the SA scheme. The need for such construction is motivated by the difficulty in na¨ıve implementation of standard gradient estimators: As a probability distribution, not all perturbation direction of p leads to a gradient representation that has probabilistic meaning and subsequently allows simulation-based estimation. The Gateaux idea results in a simulable gradient that can be viewed as a nonparametric version of the classical likelihood ratio (also called the score function) method (Glynn (1990), Reiman and Weiss (1989)). The second contribution is the design and analysis of our constrained SA scheme. We choose to use a stochastic counterpart of the conditional gradient, or the so-called Frank-Wolfe (FW) (Frank and Wolfe (1956)) method in deterministic nonlinear programming. For convenience we call our scheme FWSA. Note that a standard SA iteration follows the estimated gradient up to a prespecified step size to find the next candidate iterate. When the formulation includes constraints, the common approach in the SA literature projects the candidate solution onto the feasible region in order to define the next iterate (e.g. Kushner and Yin (2003)). Instead, our method looks in advance for a feasible direction along which the next iterate is guaranteed to lie in the feasible region. In order to find this feasible direction, an optimization subprogram with a linear objective function and the (convex) feasible region A is solved in each iteration. We base our choice of using FWSA on its computational benefit in solving these subprograms. For common types of uncertainty sets, the linear-objective programs can be solved efficiently for high-dimensional p, whereas for quadratic or other nonlinear objective (which appears in the projective-type SA method), efficient solution schemes are not as available. We characterize the convergence rate of FWSA as a function of the pre-specified parameters that, in each iteration, define the step size selections and the number of simulation replications used to

Ghosh and Lam: Computing Worst-case Input Models

5

estimate the gradients. The form of our convergence bounds suggest prescriptions for the step-size and sample-size sequences that are work-efficient, where “work” refers to the cumulative number of sample paths of X that have been simulated to generate all the gradients until the current iterate. To the best of our knowledge, this is the first convergence rate analysis for stochastic FW method on non-convex problems (the only related work is Kushner (1974), which proves almost sure convergence under assumptions that can prescribe algorithmic specifications only for onedimensional settings). This analysis will thus be of independent interest when the FW variant of the SA method is preferred for other settings of constrained stochastic optimization, though this general formulation is not the primary focus here. Finally, we provide numerical validation of our approach using two stochastic systems. The first is the canonical M/GI/1 queue, whose optimal solution can be approximated directly via steadystate analysis and thus providing a good testing ground for our procedure. The second example is an emerging decision-making problem from the energy analytics area (Harsha and Dahleh (2014)) that balances the uncertainty in renewables generation (solar panels, windmills etc.) with energy storage devices (large batteries etc.) to provide stable electrical output. We analyze the robustness of the average operational cost of the system to uncertainty in the fitted distribution of the renewables output for specific threshold-based operation policies.

1.3. Literature Review We briefly survey three lines of related work. First, our paper is related to the large literature on input model uncertainty. In the parametric regime, studies have focused on the construction of confidence intervals or variance decompositions to account for both parameter and stochastic uncertainty, using for instance the delta method (Cheng and Holloand (1997)), bootstrapping (Barton et al. (2013), Ankenman and Nelson (2012)), Bayesian approach (Zouaoui and Wilson (2003), Saltelli et al. (2010, 2008)), and metamodel-assisted analysis (Xie et al. (2014, 2015)). Beyond a single parametric model, model risk is conventionally handled through directly comparing

6

Ghosh and Lam: Computing Worst-case Input Models

among several models (Morini (2011)), or by Bayesian model selection and averaging (Chick (2001), Zouaoui and Wilson (2004)) that select or combine the output estimates from them. Second, our formulation (1) relates to the literature on distributionally robust optimization (Delage and Ye (2010), Goh and Sim (2010), Ben-Tal et al. (2013)), where the focus is to make decision rules under stochastic environments that are robust against the ambiguity of the underlying probability distributions. This is usually cast in the form of a minimax problem where the inner maximization is over the space of distributions. This idea has spanned across multiple areas like economics (Hansen and Sargent (2001, 2008)), finance (Glasserman and Xu (2013), Lim et al. (2011)), and control theory (Petersen et al. (2000), Iyengar (2005), Nilim and El Ghaoui (2005)). Optimizations over probability distributions have also arisen as so-called moment problems, which have been applied to decision analysis (Smith (1995, 1993)) and stochastic programming (Birge and Wets (1987), Bertsimas and Popescu (2005)). Recently, Glasserman and Xu (2014) proposed a simulation-based scheme to solve a class of optimizations that quantified model risk, where they used Kullback-Leibler (KL) divergence to measure model discrepancy from a baseline model. Along this line, Lam (2013) developed infinitesimal approximations in a similar setup but under the presence of an i.i.d input process, as the KL divergence converged to zero. Finally, our algorithm relates to the literature on the FW method (Frank and Wolfe (1956)) and constrained SA. For the FW method, the classical work of Canon and Cullum (1968), Dunn (1979) and Dunn (1980) analyzed convergence properties for deterministic convex programs. Recently, Jaggi (2013) and Freund and Grigas (2014) carried out new finite-time analysis for the FW method motivated by machine learning applications. A general study of convergence rate of FW, in the case of non-convex problems, appears to be open. Regarding SA, Fu (1994), Kushner and Yin (2003) and Pasupathy and Kim (2011) reviewed convergence results. In the case of stochastic adaptation of FW, Kushner (1974) focused on almost sure convergence based on a set of assumptions about the probabilistic behavior of the iterations, which were then used to tune the algorithm for onedimensional problems. Other types of constrained SA schemes include the Lagrangian method (Buche and Kushner (2002)) and mirror descent SA (Nemirovski et al. (2009)).

7

Ghosh and Lam: Computing Worst-case Input Models

1.4. Organization of the Paper The remainder of this paper is organized as follows. Section 2 describes our assumptions and notation. Section 3 focuses on gradient estimation on the space of probability distributions. Section 4 introduces the FWSA procedure. Section 5 presents some theoretical convergence results on FWSA, and Section 6 shows some numerical performance. Section 7 concludes and suggests future research. The Appendix contains two additional results: the extension to multiple input models (Appendix EC.1) and an analysis on the discretization errors of continuous input distributions (Appendix EC.2). Finally, Appendix EC.3 gives the proofs for most of the results in the paper.

2. Assumptions and Notations We assume a discrete finite-support distribution p for our input model over some space X . This assumption is made mainly because our optimization involves p as the decision variables, and being finite-dimensional allows us to construct tractable algorithms. (We discuss some results on discretization approximations to continuous distributions in Appendix EC.2.) We use p = (p(i) )i=1,...,n to represent the distribution on n support points {x(1) , . . . , x(n) }. For convenience, we write P = {p = (p(i) )i=1,...,n ∈ Rn :

Pn

(i) i=1 p

= 1, p(i) ≥ 0, i = 1, . . . , n} as the proba-

bility simplex on X = {x(1) , . . . , x(n) }. Note that, as p is discrete, the performance measure can be expressed as Z(p) = Ep [h(X)] =

X i1

···

X

h(x(i1 ) , . . . , x(iT ) )p(i1 ) · · · p(iT ) .

(2)

iT

In other words, Ep [h(X)] is a high-degree polynomial in p. In practice, exhaustive enumeration of all the summands in (2) is often difficult and simulation is needed to approximate Ep [h(X)]; for example, when n = 50 and T = 20, exact computation of (2) requires 5020 summands. Throughout the paper I(E) denotes the indicator function for the event E, 0 denotes transpose, and kxk denotes the Euclidean norm of a vector x. We also write V arp (·) as the variance under the input distribution p. We consider a feasible region A that is convex. Examples of A include:

8

Ghosh and Lam: Computing Worst-case Input Models

1. Constraints given as a neighborhood surrounding a baseline model measured by statistical (i)

(i)

distance: Denote a fixed distribution pb = (pb )i=1,...,n with pb > 0 for all i (this can be, for instance, the best selected model from some statistical procedure). Denote dφ (p, pb ) =

Pn

(i) (i) (i) i=1 pb φ(p /pb )

as the φ-divergence between an arbitrary distribution p ∈ P and the baseline model pb . Set A = {p ∈ P : dφ (p, pb ) ≤ η }.

The function φ is typically a convex function satisfying φ(1) = 0. Common examples are φ(x) = x log x − x + 1, which gives dφ as the KL divergence, φ(x) = (x − 1)2 for the (modified) χ2 -distance, and φ(x) = (1 − θ + θx − xθ )/(θ(1 − θ)), θ 6= 0, 1, for the Cressie-Read divergence. Details of φdivergence can be found in, e.g., Pardo (2005) and Ben-Tal et al. (2013). 2. Moment constraints: Without loss of generality, we can write A = {p ∈ P : Ep [rj (X)] ≤ µj , j = 1, . . . , s}, where X is a generic random variable with distribution p and rj : X → R. For instance, rj (x) can be x or x2 , corresponding to the first two moments. 3. Covariance constraints: For X ⊂ Rd , we can use matrix inequality constraints to incorporate the mean and correlation information among the different components of X. For example (Delage and Ye (2010)), A = {p ∈ P : (Ep X − µ)0 Σ−1 (Ep X − µ) ≤ γ1 , Ep (X − µ)(X − µ)0  γ2 Σ}

where µ ∈ Rd is some baseline mean vector, Σ ∈ Rd×d is some baseline covariance matrix, γ1 and γ2 are constants and  denotes matrix inequality, i.e. for square matrices A and B with the same dimension, A  B if and only if B − A is positive semidefinite.

3. Gradient Estimation on Probability Simplices via a Nonparametric Likelihood Ratio Method We describe an implementable method to extract the gradient information of Z(p). Note that the standard gradient of Z(p), which we denote as ∇Z(p), obtained through na¨ıve differentiation of Z(p), may not lead to any simulable object. This is because an arbitrary perturbation of p may shoot out of the probability simplex, and the resulting gradient will be a high-degree polynomial in p that may have no probabilistic interpretation. This issue can be addressed by looking only

9

Ghosh and Lam: Computing Worst-case Input Models

at perturbation within the simplex, which resembles the Gateaux derivative on a functional of probability distribution (Serfling (2009)) as follows. Given any p, define a mixture distribution (1 − )p + 1(i) , where 1(i) is a point mass on x(i) , i.e. 1(i) = (0, 0, . . . , 1, . . . , 0) and 1 is at the i-th coordinate. The number 0 ≤  ≤ 1 is the mixture parameter. When  = 0, this reduces to the given distribution p. We now treat  as a parameter and differentiate Z((1 − )p + 1(i) ) with respect to  for each i. The resulting vector of derivatives, which we call ψ(p), possesses two properties: Theorem 1. Given p = (p(i) )i=1,...,n where p(i) > 0 for all i. Define ψ(p) = (ψ (i) (p))i=1,...,n ∈ Rn , where ψ (i) (p) = (d/d)Z((1 − )p + 1(i) ) =0 . Then: 1. We have ∇Z(p)0 (q − p) = ψ(p)0 (q − p)

(3)

ψ (i) (p) = Ep [h(X)s(i) (X)]

(4)

for any q ∈ P . 2. ψ(p) satisfies

where s(i) (·) is defined as s(i) (x1 , . . . , xT ) =

T X I(xt = x(i) ) t=1

p(i)

−T

(5)

The first property above states that ψ(p) and ∇Z(p) are identical when viewed as directional derivatives, as long as the direction lies within P . Hence, working on the space P , it suffices to focus on ψ(p). The second property above states that ψ(p) can be estimated unbiasedly in a way similar to the classical likelihood ratio method (Glynn (1990), Reiman and Weiss (1989)), with s(i) (·) playing the role of score function. Since this representation holds without assuming any specific parametric form for p, it can be viewed as a nonparametric version of the likelihood ratio method. Proof of Theorem 1.

To prove 1., consider first a mixture of p with an arbitrary q ∈ P , in the

form (1 − )p + q. It satisfies d Z((1 − )p + q) =0 = ∇Z(p)0 (q − p) d

10

Ghosh and Lam: Computing Worst-case Input Models

by the chain rule. In particular, we must have ψ (i) (p) = ∇Z(p)0 (1(i) − p) = ∂ (i) Z(p) − (∇Z(p)0 p)

(6)

where ∂ (i) Z(p) denotes the i-th coordinate partial derivative of Z. Writing (6) for all i together gives ψ(p) = ∇Z(p) − (∇Z(p)0 p)1 where 1 ∈ Rn is a vector of 1. Therefore ψ(p)0 (q − p) = (∇Z(p) − (∇Z(p)0 p)1)0 (q − p) = ∇Z(p)0 (q − p)

which shows (3). To prove 2., note that, by the likelihood ratio method, we have d ψ (i) (p) = Z((1 − )p + 1(i) ) d

=0

d = E(1−)p+1(i) [h(X)] d

= Ep [h(X)s(i) (X)]

(7)

=0

where s(i) (·) is the score function defined as T X d log((1 − )p(xt ) + I(xt = x(i) )) s(i) (x1 , . . . , xT ) = d t=1

.

(8)

=0

Here p(xt ) = p(j) where j is chosen such that xt = x(j) . Note that (8) can be further written as T X −p(xt ) + I(xt = x(i) ) t=1

p(xt )

= −T +

T X I(xt = x(i) ) t=1

p(xt )

= −T +

T X I(xt = x(i) ) t=1

p(i)

which leads to (4). The following provides a bound on the variance of the estimator for ψ (i) (p) (See Appendix EC.3 for proof): Lemma 1. Assume h(X) is bounded a.s., i.e. |h(X)| ≤ M for some M > 0, and that p = (p(i) )i=1,...,n where p(i) > 0 for all i = 1, . . . , n. Each sample for estimating ψ (i) (p), by using one sample path of X, possesses a variance bounded from above by M 2 T (1 − p(i) )/p(i) .

11

Ghosh and Lam: Computing Worst-case Input Models

4. Frank-Wolfe Stochastic Approximation (FWSA) With the implementable form of the gradient ψ(p) described in Section 3, we design the stochastic version of nonlinear programming techniques to solve minp∈A Z(p). We choose to use the FrankWolfe method because, for the types of A we consider in Section 2, effective routines exist for solving the induced linearized subprograms (as will be shown in Section 4.2). 4.1. Description of the Algorithm FWSA works as follows: To avoid repetition we focus only on the minimization formulation in (1). First, pretending that ∇Z(p) can be computed exactly, it iteratively updates a solution sequence p1 , p2 , . . . as follows. Given a current solution pk , solve min ∇Z(pk )0 (p − pk ) p∈A

(9)

Let the optimal solution to (9) be qk . The quantity qk − pk gives a feasible minimization direction starting from pk (note that A is convex). This is then used to update pk to pk+1 via pk+1 = pk + k (qk − pk ) for some step size k . This expression can be rewritten as pk+1 = (1 − k )pk + k qk , which can be interpreted as a mixture between the distributions pk and qk . When ∇Z(pk ) is not exactly known, one can replace it by an empirical counterpart. Theorem 1 suggests that we can replace ∇Z(pk ) by ψ(pk ), and so the empirical counterpart of (9) is ˆ k )0 (p − pk ) min ψ(p p∈A

(10)

ˆ k ) is an estimator of ψ(pk ) using a sample size mk . Letting q ˆ k be the optimal solution where ψ(p ˆ k for some step size k . The sample size mk to (10), the update rule will be pk+1 = (1 − k )pk + k q at each step needs to grow suitably to compensate for the bias introduced in solving (10). All these are summarized in Procedure 1. 4.2. Solving the Subprogram The subprogram (10) can be efficiently solved in all the examples given in Section 2. For notational convenience, in this subsection we denote a generic form of (10) as min ξ 0 p p∈A

(11)

12

Ghosh and Lam: Computing Worst-case Input Models

Algorithm 1 FWSA for solving (1) (i)

(i)

Initialization: p = (p1 )i=1,...,n where p1 > 0. Input: Step size sequence k , sample size sequence mk , k = 1, 2, . . .. Procedure: For each iteration k = 1, 2, . . ., given pk : 1. Repeat mk times: Compute h(X)s(i) (X) using one sample path X, where s(i) (X) =

for all i = 1, . . . , n

PT

t=1 I(Xt

= x(i) )/p(i) − T for i = 1, . . . , n. Call these

mk replications ξr(i) , for i = 1, . . . , n, r = 1, . . . , mk . 2. Estimate ψ(pk ) by mk 1 X ξ (i) mk r=1 r

ˆ(i)

ˆ k ) = (ψ (pk ))i=1,...,n = ψ(p

! . i=1,...,n

ˆ k )0 (p − pk ). ˆ k = argminp∈A ψ(p 3. Solve q ˆk. 4. Update pk+1 = (1 − k )pk + k q for an arbitrary vector ξ = (ξ (i) )i=1,...,n ∈ Rn .

Example 1: Constraint based on KL divergence, i.e. A = {p ∈ P : D(pkpb ) ≤ η }, for some baseline (i)

model pb with pb > 0 for all i, and D(pkpb ) =

Pn

i=1 p

(i)

(i)

log p(i) . pb

The solution is characterized by solving at most a one-dimensional root-finding problem. More precisely, we have Proposition 1. Denote M = argminj ξ (j) , the set of indices i from {1, . . . , n} that have the minimum ξ (i) . The optimal solution q = (q (i) )i=1,...,n to (11) is: 1. If − log

(i) i∈M pb

P

≤ η, then

q (i) =

   P

(i)

pb

(i) i∈M pb

  0

for i ∈ M (12) otherwise

13

Ghosh and Lam: Computing Worst-case Input Models

2. If − log

(i) i∈M pb

P

> η, then (i)

q

(i)

pb eβξ

(i)

(13)

= Pn

(i) βξ (i) i=1 pb e

for all i, where β < 0 satisfies βϕ0ξ (β) − ϕξ (β) = η Here ϕξ (β) = log

(i) βξ (i) i pb e

P

(14)

is the logarithmic moment generating function of ξ under pb .

Proposition 1 can be proved by fairly standard technique, and is left to Appendix EC.3.

Example 2: Constraint based on general φ-divergence, i.e. A = {p ∈ P : dφ (p, pb ) ≤ η } where pb = (i)

(i)

(pb )i=1,...,n with pb > 0 for all i, and dφ (p, pb ) =

Pn

(i) (i) (i) i=1 pb φ(p /pb ).

This is a generalization of

Example 1. Proposition 2. Let φ∗ (t) = supx≥0 {tx − φ(x)} be the conjugate function of φ, and define 0φ∗ (s/0) = 0 if s ≤ 0 and 0φ∗ (s/0) = +∞ if s > 0. Solve the program ( ∗



(α , λ ) ∈ argmaxα≥0,λ∈R −α

n X

(i) pb φ∗

i=1



ξ (i) + λ − α

)

 − αη − λ

(15)

Then the optimal solution q = (q (i) )i=1,...,n to (11) is 1. If α∗ > 0, then q

(i)

=

(i) pb

  (i) ξ + λ∗ r − φ(r) · argmaxr≥0 − α∗

2. If α∗ = 0, then q (i) =

   P

(i)

pb

(i) i∈M pb

  0

(16)

for i ∈ M (17) otherwise

where M = argminj ξ (j) . Operation (15) involves a two-dimensional convex optimization. Note that both the function φ∗ and the solution to the n one-dimensional maximization (16) have closed-form expressions for all common φ-divergence (Pardo (2005)). The proof of Proposition 2 is left to Appendix EC.3.

14

Ghosh and Lam: Computing Worst-case Input Models

Example 3: Moment constraints, i.e. A = {p ∈ P : Ep [rj (X)] ≤ µj , j = 1, . . . , s}. With this constraint set A, (11) is a linear program.

Example 4: Covariance constraints, i.e. A = {p ∈ P : (Ep X − µ)0 Σ−1 (Ep X − µ) ≤ γ1 , Ep (X − µ)(X − µ)0  γ2 Σ} where X lies on the support {x(1) , . . . , x(n) } ⊂ Rd . This can be transformed into a semidefinite program (see Delage and Ye (2010)) by rewriting A as       (i) n n   X X Σ x −µ (i)  (i) (i) (i) 0 A= p∈P : p   0, p (x − µ)(x − µ)  γ Σ  2     i=1 i=1 (x(i) − µ)0 γ1 which can be solved efficiently by the interior-point method (e.g., Boyd and Vandenberghe (2009)).

5. Theoretical Guarantees of FWSA 5.1. Almost Sure Convergence Throughout our analysis we assume that the subprogram at any iteration can be solved using deterministic optimization routine to a negligible error. In our analysis, an important object that we shall use is the so-called Frank-Wolfe (FW) gap ˜ ∈ A, let g(˜ ˜ ), which is the negation (Frank and Wolfe (1956)): For any p p) = − minp∈A ψ(˜ p)0 (p − p ˜ . Note that g(˜ of the optimal value of the next subprogram when the current solution is p p) is ˜ ∈ A, since one can always take p = p ˜ in the definition of g(˜ non-negative for any p p) to get a lower bound 0. In the case of convex objective function, it is well-known that g(˜ p) provides an upper bound of the actual optimality gap (Frank and Wolfe (1956)). However, we shall make no convexity assumption in our subsequent analysis, and will see that g(˜ p) still plays an important role in bounding the local convergence rate of our procedure under the conditions we impose. Below is our key assumption on the optimality property for Z(p): Assumption 1. There exists a unique minimizer p∗ for minp∈A Z(p). Moreover, g(·) is continuous over A and p∗ is the only feasible solution such that g(p∗ ) = 0.

15

Ghosh and Lam: Computing Worst-case Input Models

In light of Assumption 1, g plays a similar role as the first order gradient in unconstrained problems. The condition g(p∗ ) = 0 in Assumption 1 is a simple implication from the optimality of p∗ (since g(p∗ ) > 0 would imply the existence of a better solution). Our choices on the step size k and sample size mk of the procedure are as follows: Assumption 2. We choose k , k = 1, 2, . . . that satisfy ∞ X

k = ∞

∞ X

and

k=1

2k < ∞

k=1

Assumption 3. The sample sizes mk , k = 1, 2, . . . are chosen such that ∞ k−1 X k Y (1 − j )−1/2 < ∞ √ m k j=1 k=1

where for convenience we denote

Q0

j=1 (1 − j )

−1/2

= 1.

Note that among all k in the form c/k α for c > 0 and α > 0, only α = 1 satisfies both Assumptions 2 and 3 and avoids a super-polynomial growth in mk simultaneously (recall that mk represents the simulation effort expended in iteration k, which can be expensive). To see this, observe that Qk−1 Assumption 2 asserts α ∈ (1/2, 1]. Now, if α < 1, then it is easy to see that j=1 (1 − j )−1/2 grows faster than any polynomials, so that mk cannot be polynomial if Assumption 3 needs to hold. On √ Qk−1 the other hand, when α = 1, then j=1 (1 − j )−1/2 grows at rate k and it is legitimate to choose mk growing at rate k β with β > 1. We also note that the expression

Qk−1

−1/2 j=1 (1 − j )

in Assumption 3 really is due to the form of

gradient estimator depicted in (4) and (5), which possesses p(i) in the denominator and thus the possibility of having a larger variance as the iteration progresses (See the proof of Theorem 2 in Appendix EC.3). The following states our result on almost sure convergence, whose proof is left to Appendix EC.3:

Theorem 2. Suppose that h(X) is uniformly bounded a.s. and that Assumptions 1-3 hold. We have pk generated in Algorithm 1 converge to p∗ a.s..

16

Ghosh and Lam: Computing Worst-case Input Models

5.2. Local Convergence Rate We impose several additional assumptions. The first is a Lipchitz continuity condition on the optimal solution for the generic subprogram (11). Denote r(ξ) as an optimal solution of (11). We assume: Assumption 4. For any ξ1 , ξ2 ∈ Rn , we have kr(ξ1 ) − r(ξ2 )k ≤ Lkξ1 − ξ2 k

for some L > 0. ˜ , i.e. q(˜ Next, we denote q(˜ p) as an optimizer in the definition of the FW gap at p p) ∈ ˜ ). We assume: argminp ψ(˜ p)0 (p − p Assumption 5. g(p) ≥ ckψ(p)kkq(p) − pk for any p ∈ A, where c > 0 is a small constant. Assumption 6. kψ(p)k > τ > 0

for any p ∈ A, for some constant τ . Assumption 5 guarantees that the angle between the descent direction and the gradient must be bounded away from 90◦ uniformly at any point p. This assumption has been used in the design and analysis of gradient descent methods for nonlinear programs that are singular (i.e. without assuming the existence of the Hessian matrix; Bertsekas (1999), Proposition 1.3.3). The non-zero gradient condition in Assumption 6 essentially suggests that the local optimum must occur at the relative boundary of A. We argue that this assumption is natural in our setting, because it is rarely the case that the gradient of a high-dimensional polynomial (i.e. our objective function Z(·)), which is itself a vector of high-dimensional polynomials, entails a root in a small region A (note that A is on a simplex, which is small relative to the space Rn ).

17

Ghosh and Lam: Computing Worst-case Input Models

The following are our main results on convergence rate, first on the FW gap g(pk ), and then the optimality gap Z(pk ) − Z(p∗ ), in terms of the number of iterations k. Similar to almost sure convergence, we assume here that the deterministic routine for stepwise optimization can be solved to high precision. Theorem 3. Suppose |h(X)| ≤ M for some M > 0 and that Assumptions 1-6 hold. Additionally, set k =

a k

and

mk = bk β

when k > a, and arbitrary k < 1 when k ≤ a. Given any 0 < ε < 1, it holds that, with probability 1 − ε, there exists a large enough positive integer k0 and small enough positive constants ν, ϑ, % such that 0 < g(pk0 ) ≤ ν, and for k ≥ k0 ,

g(pk ) ≤

  1    (C−γ)kγ  

if 0 < γ < C

A 1 +B × if γ > C (γ−C)(k0 −1)γ−C kC  kC      log((k−1)/(k0 −1)) if γ = C kC

(18)

where A = g(pk0 )k0C , 

1 B = 1+ k0

C 

 2a2 %K  ν a% + + Lϑ cτ k0 cτ

and 

2KLϑ 2Kν − 2 2 C =a 1− cτ c τ

 (19)

Here the constants L, c, τ appear in Assumptions 4, 5 and 6 respectively. The sample size power β needs to be chosen such that β > γ + a + 1. More precisely, the constants a, b, β that appear in the specification of the algorithm, the other constants k0 , ϑ, %, γ, K, and two new constants ρ > 1 and δ > 0 are chosen to satisfy 1. 

4KM T KLϑ + k0 ≥ 2a c2 τ 2 cτ



18

Ghosh and Lam: Computing Worst-case Input Models

2.   2KLϑ 2a%K % 2aKLϑ% 2Kν 2 − 1− ≤0 − 2 2 ν+ 1+γ + γ + cτ c τ k0 k0 c2 τ 2 cτ k0 3. 2KLϑ 2Kν + 2 2 γ + ρa + 1 6. k0 −1

Y

(1 − j )

−1 M

2

Tn δb

j=1



1 L + 2 β−1 ϑ (β − ρa − 1)(k0 − 1) %(β − γ − ρa − 1)(k0 − 1)β−γ−1

 0 is a constant such that |x0 ∇2 Z(p)y| ≤ K kxkkyk for any x, y ∈ Rn and p ∈ A (which must exist because Z(·) is a polynomial defined over a bounded set). (i)

8. δ = mini=1,...,n p1

Corollary 1. Suppose that all the assumptions are satisfied and all the constants are chosen as indicated in Theorem 3. Then with probability 1 − ε, there exists a large enough positive integer k0 and small enough positive constants ν, ϑ, % such that 0 ≤ g(pk0 ) ≤ ν, and for k ≥ k0 ,

Z(pk ) − Z(p∗ ) ≤

  1    (C−γ)γ(k−1)γ  

if 0 < γ < C

E D 1 + +F × if γ > C (γ−C)(k0 −1)γ−C C(k−1)C  k − 1 (k − 1)C      log((k−1)/(k0 −1)) if γ = C C(k−1)C

where D=

aA a2 K , E= , F = aB 2 C

and a, A, B, C, K are the same constants as in Theorem 3.

(20)

19

Ghosh and Lam: Computing Worst-case Input Models

The bounds in Theorem 3 and Corollary 1 are local asymptotic statements since they only hold starting from k ≥ k0 and g(pk ) ≤ ν for some large k0 and small ν. Note that they do not say anything about the behavior of the algorithm before reaching the small neighborhood of p∗ as characterized by 0 ≤ g(pk0 ) ≤ ν. A quick summary extracted from Theorem 3 and Corollary 1 is the following: Consider the local convergence rate denominated by workload, i.e. in terms of the number of simulation replications, not the number of iterations. To achieve the most efficient rate, approximately speaking, a should be chosen to be 1 + ω and β chosen to be 3 + ζ + ω for some small ω, ζ > 0. The local convergence rate is then O(W −1/(4+ζ+ω) ) where W is the total number of simulation replications. This convergence rate becomes effective only after W0 (ε) = Θ((T n2 /(εω 2 ζ))(4+ζ+ω)/(ζ+ω) ) replications. The above estimate on the number of replications required to initiate convergence warrants some discussion. The right interpretation is that, given the algorithm has already run W0 (ε) number of replications and that g(pk ) ≤ ν for a suitably small ν (this occurs with probability 1 by Theorem 2), the convergence rate of O(W −1/(4+ζ+ω) ) will be guaranteed with probability 1 − ε starting from that point. We should mention that this estimate of W0 (ε) on the number of replications can be loose. The summary above is derived based on the following observations: 1. The local convergence rate of the optimality gap, in terms of the number of iterations k, is at best O(1/k C∧γ∧1 ). This is seen by (20). 2. We now consider the convergence rate in terms of simulation replications. Note that at iteration k, the cumulative number of replications is of order

Pk

j=1 j

β

≈ k β+1 . Thus from Point 1 above,

the convergence rate of the optimality gap in terms of replications is of order 1/W (C∧γ∧1)/(β+1) . 3. The constants C and γ respectively depend on a, the constant factor in the step size, and β, the geometric growth rate of the sample size, as follows: (a) (19) defines C = a(1 − 2KLϑ/(cτ ) − 2Kν/(c2 τ 2 )). For convenience, we let ω = 2KLϑ/(cτ ) + 2Kν/(c2 τ 2 ), and so C = a(1 − ω).

20

Ghosh and Lam: Computing Worst-case Input Models

(b) From Condition 5 in Theorem 3, we have β = γ + ρa + 1 + ζ for some ζ > 0. In other words γ = β − ρa − ζ − 1. 4. Therefore, the convergence rate in terms of replications is 1/W ((a(1−ω))∧(β−ρa−ζ−1)∧1)/(β+1) . Let us focus on maximizing (a(1 − ω)) ∧ (β − ρa − ζ − 1) ∧ 1 β +1

(21)

over a and β, whose solution is given by the following lemma: Lemma 2. The maximizer of (21) is given by a=

1 ρ , β= +ζ +2 1−ω 1−ω

and the optimal value is 1 ρ/(1 − ω) + ζ + 3

The proof is in Appendix EC.3. With Lemma 2, let us choose ϑ and ν, and hence ω, to be small. We also choose ρ to be close to 1. (Unfortunately, these choices lead to a small size of neighborhood around p∗ in which the convergence rate holds; see Point 5 below.) This gives rise to the approximate choice that a ≈ 1 + ω and β ≈ 3 + ζ + ω. The convergence rate is then O(W −1/(4+ζ+ω) ). 5. Lastly, we also investigate the neighborhood in which the convergence rate comes into effect. Condition 1 in Theorem 3 states that k0 needs to be Ω(aT ). Condition 4 states that k0 ≥ aρ/(ρ − 1). Both of these should be of second order importance compared to Condition 6. Note that in Qk0 −1 (i) Condition 6, j=1 (1 − k )−1 is of order Θ(k0 ), and taking δ = Ω(1/n) in Condition 8 (since p1 > 0 and the support size is n) and % = ϑ2 , we need to pick 1/(β−γ−2) !  1/(ρa+ζ−1) !  T n2 T n2 k0 = Θ =Θ εϑ2 (β − γ − ρa − 1) εω 2 ζ Plugging in a = 1 + ω and β = 3 + ζ + ω, and expressing in terms of the number of simulation replications, we get  Θ number of replications.

T n2 εω 2 ζ

(β+1)/(ρa+ζ−1) !

 =Θ

T n2 εω 2 ζ

(4+ζ+ω)/(ζ+ω) !

21

Ghosh and Lam: Computing Worst-case Input Models

Remark 1. Note that Assumption 4 may not hold for certain types of uncertainty sets described in Section 2, such as moment constraints. One way to address this issue is by adding a “smoothing” constraint such as a large entropy condition (See Appendix EC.4 for some details). It is also worth pointing out that Assumption 4 is in a sense consistent with the deterministic FW literature in that smoothness of constraints (i.e. having a positive curvature property, such as a sphere) appears to play an important role in achieving fast convergence (Bertsekas (1999), p.222).

6. Numerical Experiments This section describes our numerical experiments conducted on two stochastic systems to characterize the performance of the FWSA algorithm in estimating worst-case input models. The key parameters in the FWSA algorithm are the sample-size growth rate β and the step-size constant a. Varying these two parameters, we empirically test the rate of convergence of the FW gap to zero analyzed in Theorem 3, and the objective function Z(pk ) to the true optimal value Z(p∗ ) analyzed in Corollary 1. We also investigate the magnitude of the optimal objective value and the form of the identified optimal solution. In all experiments we terminate the FWSA algorithm at iteration k if at least one of the following criteria is met: • The cumulative simulation replications Wk reaches 5 × 108 , or • The relative difference between objective value Z(pk ) and the average of the observed values

in 30 previous iterations, (

P30

v=1 Z(pk−v ))/30,

is below 5 × 10−5 , or

ˆ k ) has an l2 -norm smaller than 1 × 10−3 . • The gradient estimate ψ(p Section 6.1 considers an M/GI/1 queue, and Section 6.2 considers an operational decision problem for an energy storage device in the context of managing renewable generation.

6.1. An M/GI/1 Queue Consider an M/GI/1 queue where the arrival process is Poisson known with high accuracy to have rate λ = 1. On the other hand, the time Xt taken to provide service to the t-th customer

22

Ghosh and Lam: Computing Worst-case Input Models

is uncertain, but {Xt , t = 1, 2, . . .} are assumed to be i.i.d.. A simulation model is being used to estimate the expected long-run average of the waiting times Z(p) = Ep [h(XT )], where T 1X h(XT ) = wt , T 1

wt = max{0, wt−1 + Xt − At }.

and

Note that the t-th customer’s waiting time wt is calculated by iterating over the celebrated Lindley’s recursion, and At ∼ Exp(1) is the interarrival time of the (t − 1)-th customer. We consider the scenario where a baseline input model for Xt is chosen to be a mixture distribution pb,cont given by 0.3 × Beta(2, 6) + 0.7 × Beta(6, 2). Should this distributional choice be uncertain, we can consider finding the worst-case performance from among the service time distributions p within a certain neighborhood from pb,cont , given by max / min Z(p) s.t.

X

p(i) log

i

(22) p(i) (i)

pb

! ≤ η.

p∈P where max / min denotes the pair of max and min problems. The space P is the probability simplex on the discrete support {x(1) , . . . , x(n) }, obtained by uniformly discretizing the interval [0, 1] into (1)

(n)

n points, i.e. x(i) = (i + 1)/n. The discrete distribution pb = (pb , . . . , pb ) is determined from the probability assigned to the intervals (x(i−1) , x(i) ] by the continuous distribution pb,cont . We have imposed a KL divergence constraint in (22) centered at pb . This popular type of uncertainty set is used in finance (Glasserman and Xu (2014, 2013)), economics (Hansen and Sargent (2008)), stochastic control (Petersen et al. (2000), Iyengar (2005), Nilim and El Ghaoui (2005)), etc.. Formulation (22) provides a good testing ground because steady-state analysis allows obtaining an approximate optimal solution directly which serves as a benchmark for verifying the convergence of our FWSA algorithm. As T grows, the average waiting time converges to the corresponding steady-state value, which, when the traffic intensity ρp = Ep [Xt ] is less than 1, is given in closed form by the Pollaczek-Khinchine formula (Khintchine (1932)) as: Z∞ (p) =

ρp Ep [X1 ] + V arp (X1 ) . 2(1 − ρp )

23

Ghosh and Lam: Computing Worst-case Input Models

∗ So, when T is large, an approximation Z∞ to the worst-case performance estimate can be obtained

by replacing Z(p) in problem (22) with Z∞ (p). (In experiments, a choice of T = 500 seems to show close agreement.) With Ep [X1 ] =

P

p(i) x(i) and Ep [X12 ] =

P

2

p(i) x(i) , the steady-state approxima-

tion to (22) is given by (SS):

P

min p

s.t.

(i) (i) 2

p x P (SS) 2(1 − p(i) x(i) ) ! X p(i) p(i) log ≤η (i) pb i X p(i) = 1

min

X

s.t.

X

p

2

yi log

2t − 2

X

(SS0 ) y (i) (i)

tpb

=⇒

X 0 ≤ p(i) ≤ 1,

y (i) x(i)

! ≤ ηt

y (i) x(i) = 1

y (i) = t

∀i = 1, . . . , n

0 ≤ y (i) ≤ t ∀i = 1, . . . , n The optimization problem (SS) is a so-called convex-concave-fractional program that can be reformulated as an equivalent convex optimization problem (SS0 ) using the variable substitutions t = 1/(2(1 −

P

p(i) x(i) )) and y (i) = p(i) t; see p.191 in Boyd and Vandenberghe (2009).

Figs. 1 and 2 capture the performance of our FWSA algorithm as a function of the a and β parameters. Figs. 1a–1c plot the (approximate) optimality gap as a function of the cumulative simulation replications Wk for the maximization version of (22). We set the parameters η = 0.025, n = 100 and T = 500. Recall from remarks following Corollary 1 that setting a = 1 + ω and β = 3 + ζ + ω, where ω, ζ > 0 are small, provides the best upper bound on the asymptotic rate of convergence. Figs. 1a, 1b and 1c provide further insights into the actual observed finite-sample performance (When interpreting these graphs, note that they are plotted in log-log scale and thus, roughly speaking, the slope of the curve represents the power of the cumulative samples whereas the intercept represents the multiplicative constant in the rate): • Fig. 1a v.s. 1b–1c: Convergence is much slower when a < 1 no matter the value of β. • Fig. 1b: For a > 1, convergence is again slow if β > 4. • Fig. 1b: For a slightly greater than 1, the convergence rates are similar for β ∈ [2.75, 3.25] with

better performance for the lower end.

24

Ghosh and Lam: Computing Worst-case Input Models

(a) small a, β varied as shown

(b) a = 1, β varied as shown

(c) β = 3.1, a varied

(d) Frank-Wolfe gap vs iteration count

Figure 1

Figs 1a, 1b and 1c plot the optimality gap of the FWSA algorithm for the M/GI/1 example as function of cumulative simulation samples (both in log-scale), under various combinations of step-size parameter a and sample-size growth parameter β. The three figures have the same range of values in both axes. Fig 1d shows the FW gap as a function of iteration count (both in log-scale). All figures provide the legend as a, β.

• Fig. 1c: For β = 3.1, the rate of convergence generally improves as a increases in the range

[1.10, 2.75].

25

Ghosh and Lam: Computing Worst-case Input Models

∗ • Figs. 1a, 1b and 1c: The approximation Z∞ from (SS) of the true Z(p∗ ) has an error of about

0.006 for the chosen T , as observed by the leveling off of all plots around this value as the sampling effort grows. Fig. 1d shows the FW gap as a function of the iteration count. In general, the sample paths with similar β are clustered together, indicating that more effort expended in estimating the gradient at each iterate leads to a faster drop in the FW gap per iteration. Within each cluster, performance is inferior when a < 1, consistent with Theorem 3. Since most runs terminate when the criterion on the maximum allowed budget of simulation replications is expended, the end points of the curves indicate that a combination of a ≥ 1 and a β of around 3 gains the best finite-sample performance in terms of the FW gap.

(a) Robust interval vs η. Figure 2

(b) Robust interval vs n.

Intervals contained between max and min worst-case objectives when η and n vary..

Next we investigate the worst-case objective values computed by FWSA as the maximum allowed KL divergence η increases, fixing n = 100. Fig. 2a plots the intervals contained by the maximum and minimum values in (22) with varying η. As expected, the size of the interval increases with η, and the interval identified by the FWSA matches well with that provided by the steady-state approximation (SS). Fig. 2b shows the effect of the discretization of pb,cont into discrete support sets of increasing size n, fixing η = 0.025. The intervals have slightly over-estimated values for n below 75, and seem to be relatively insensitive to n beyond 75.

26

Ghosh and Lam: Computing Worst-case Input Models

(a) (min) pb from beta-mixture Figure 3

(b) (max) pb from beta-mixture

Optimal solutions p∗ identified by the FWSA algorithm with n = 100 and η = 0.05, setting a = 1.5, β = 2.75. The gray bars represent the baseline p.m.f. pb .

Finally, Figure 3 shows the form of the optimal distributions p∗ identified by the FWSA algorithm for the minimization (Fig. 3a) and maximization (Fig. 3b) cases of (22). The optimal distributions follow a similar bimodal structure as the baseline distribution pb . The maximization version assigns probability masses in an unequal manner to the two modes in order to drive up both the mean and the variance of p, as (SS) leads us to expect, whereas the minimization version on the other hand makes the mass allocation more equal in order to minimize the mean and the variance of p while maintaining the maximum allowed KL divergence. 6.2. Energy Storage Policies Balancing the sizing and operation of energy storage devices with the fluctuations in renewable generation at small household scales forms a decision-making problem that has garnered recent interests in the energy analytics area. One key difficulty in the problem is the tremendous distributional uncertainty in the renewable sources. This section explains how we can quantify the performance of a given policy under such uncertainty. The system dynamics presented here are a simplified version of the discussion in Harsha and Dahleh (2014). Storage of capacity S is available to balance a local wind or solar generator’s

Ghosh and Lam: Computing Worst-case Input Models

27

uncertain output by storing any excess generation over the same period’s local demand for use in later periods. Let {Xt , t ≥ 0} represent the sequence of random net-demands, defined as the differences between the local demand and the renewable generation in each period. At the beginning of period t, the value of the net-demand Xt is revealed. This leads to a decision to change the current state st of the storage device by an amount ut , where a positive value indicates that energy is being stored and a negative value indicates withdrawal of the stored energy. The ut are bounded by − min{st , Ro } ≤ ut ≤ min{S − st , Ri }, where Ri and Ro are engineering limits on the input or output from the device in a single period. Separately, the grid operator imposes a charge curve of rt (·) for supplying power, and a charge curve of ct (·) for importing any excess renewable generation. We focus on the finite horizon average cost: T 1 X Z(p) = Ep [h(XT )] = Ep [rt ((Xt + ut )+ ) − ct ((Xt + ut )− )], T t=1

(23)

where Xt are i.i.d. each distributed under p, and (z)+ = max(0, z) and (z)− = min(0, z). Harsha and Dahleh (2014) study the case of minimizing the infinite horizon average cost, namely min{ut ,t≥0} limT →∞ Z(p) under ct ≡ 0. They show that control decisions ut described using a simple threshold function that depends on rt (·) and Xt are optimal under various conditions on the distribution of XT . In the spirit of their results, we extend their family of threshold policies to include ct (·) and study the following operation policy: 1. If Xt < 0, then store the excess renewable energy up to S and export the remainder to the grid at cost ct 2. If Xt ≥ 0, then (a) If st > g(rt , ct ), use the stored energy till st+1 is down to g(rt , ct ), and purchase the rest at cost rt (·) (b) If st ≤ g(rt , ct ), purchase all Xt and additionally purchase from the grid to fill storage till g(rt , ct ).

28

Ghosh and Lam: Computing Worst-case Input Models

Suppose the nominal distribution of the Xt is uniform over [−S, S]. Suppose also that the cost curves rt and ct are piecewise linear given by

rt (x) =

   1.5dt x for 0 < x ≤ 5  d x t

ct (x) =

   0.75dt x for 0 < x ≤ 5   0.5d x t

for 5 < x

,

for 5 < x

where dt alternates between values 0.4 and 1.0 in consecutive periods t, and rt = 2ct always. We investigate the control policy defined by the threshold function

g(rt , ct ) = g(dt ) =

  S

for dt = 0.4

 0

for dt = 1

2

We are interested in evaluating the robustness of this policy when the true distribution of Xt is not uniform. We focus on the average cost over a horizon of T = 200 periods. We adopt a KL divergence constraint around the discretized uniform distribution pb . In practice, the maximum allowed KL divergence η in the constraint can be calibrated through entropy estimation techniques from the data (e.g., P´al et al. (2010)), or in the case of non-stationarity, calibrated by using the most conservative estimates from a range of periods.

(a) a = 1 and β varied Figure 4

(b) a and β varied (legend as |a

β|)

FW gap against FWSA iterates under different values of the step-size parameter a and sample-size growth parameter β.

29

Ghosh and Lam: Computing Worst-case Input Models

We use FWSA to solve for the max and min worst-case average costs. We first study the effect of parameters in the algorithm on the convergence rate. Figure 4 plots the trends of the FW gap as the FWSA iterations progress, with η set at 0.025 and the discretization size n set at 150. Fig. 4a shows the predictable behavior that a higher β leads to faster convergence with respect to the number of iterations. Fig. 4b shows the effect of varying a and β. Similar to the M/GI/1 example, we find that a = 0.5 is sub-optimal for all choices of β, and that higher values of a above 1 seem narrow the gap to zero even faster.

η

minZ(p) maxZ(p) p∈A

p∈A

0.0063

2.39

2.81

0.0125

2.31

2.90

n

0.0187

2.25

2.92

20

2.31

3.30

0.0250

2.20

3.01

30

2.27

3.21

0.0312

2.16

3.12

50

2.22

3.14

0.0375

2.12

3.18

75

2.20

3.09

0.0437

2.08

3.23

125

2.25

3.02

0.0500

2.05

3.29

150

2.20

3.01

minZ(p) maxZ(p) p∈A

p∈A

(a) Robust interval as maximum KL divergence

(b) Robust interval as distribution support size

η grows.

n grows.

Table 1

Dependence of worst-case values on the support size n and the allowed KL divergence η from the baseline distribution. The FWSA algorithm was run with a = 1, β = 2

Table 1 describes the effect on the interval bounded by the worst-case values in relation to the maximum allowed KL divergence η and the discretized support size n. Table 1a shows the gradual increase in the interval length as η increases, fixing n = 150. Table 1b shows the decreasing trend in interval length as n increases, fixing η at 0.025. The effect on the max and min problems are not

30

Ghosh and Lam: Computing Worst-case Input Models

symmetric: the maximization problem seems to decrease steadily while the minimization problem is relatively unaffected for n ≥ 50. Figure 5 plots the optimal distributions p∗ identified by the FWSA algorithm for n = 101 and η = 0.025. The maximization solution in Fig. 5b favors the extreme ends of the distribution in order to induce more charges governed by rt (·) for purchases and also ct (·) for the grid operator’s absorption of excess renewables. The right side of the distribution is favored more because a higher likelihood of positive outcomes for the net-demand Xt leads to more purchases with cost rt (·), which in all cases is double the absorption cost ct (·). On the other hand, the minimization solution in Fig. 5a tends to favor the middle part of the support where the likelihood of charges is minimized; again more reduction on the right side leads to the higher cost of purchases rt .

(a) Minimization Figure 5

(b) Maximization

Optimal p.m.f. identified using FWSA with a = 1.5, β = 2.5, for η = 0.025 and n = 101. The gray bars represent the baseline p.m.f. pb .

7. Conclusion In this paper we investigated a methodology for solving optimization formulations that were posted to calibrate worst-case input probability distributions used in stochastic simulation, under constraints that represented the uncertainty in the model. The procedure involved gradient estimation

Ghosh and Lam: Computing Worst-case Input Models

31

based on a nonparametric version of the likelihood ratio method, and a stochastic variant of the FW method in nonlinear programming to iteratively update the solution. The derivation of the gradient estimators and some convergence properties of the proposed FWSA algorithm were studied. The FWSA algorithm were tested on a canonical single-server queue and on a decision problem for operating energy storage devices in the renewable generation context. In both experiments, the algorithm empirically converged, and its convergence rate was studied in relation to the choices of the tuning parameters. We also investigated the shape of the optimal distributions and value as the outcomes of the algorithm. We suggest several lines of future research. First is the extension of the methodology to dependent models, such as Markovian inputs or more general time series inputs, which would involve new sets of constraints in the optimizations. Second is the design and analysis of other potential alternative numerical procedures, and the comparison with the proposed method. Third is the utilization of our worst-case optimization method into robust decision-making, by posting and analyzing minimax formulations.

Acknowledgments We gratefully acknowledge support from the National Science Foundation under grants CMMI1400391 and CMMI-1436247.

References Ankenman, Bruce E, Barry L Nelson. 2012. A quick assessment of input uncertainty. Proceedings of the 2012 Winter Simulation Conference (WSC). IEEE, 1–10. Banks, J, JS Carson, BL Nelson, DM Nicol. 2009. Discrete-Event System Simulation. 5th ed. Prentice Hall Englewood Cliffs, NJ, USA. Barton, Russell R, Barry L Nelson, Wei Xie. 2013. Quantifying input uncertainty via simulation confidence intervals. INFORMS Journal on Computing 26(1) 74–87. Ben-Tal, Aharon, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, Gijs Rennen. 2013. Robust solutions of optimization problems affected by uncertain probabilities. Management Science 59(2) 341–357.

32

Ghosh and Lam: Computing Worst-case Input Models

Ben-Tal, Aharon, Laurent El Ghaoui, Arkadi Nemirovski. 2009. Robust optimization. Princeton University Press. Bertsekas, Dimitri P. 1999. Nonlinear programming. Athena Scientific. Bertsimas, Dimitris, Ioana Popescu. 2005. Optimal inequalities in probability theory: A convex optimization approach. SIAM Journal on Optimization 15(3) 780–804. Birge, John R, Roger J-B Wets. 1987. Computing bounds for stochastic programming problems by means of a generalized moment problem. Mathematics of Operations Research 12(1) 149–162. Blum, Julius R. 1954. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics 737–744. Boyd, Stephen, Lieven Vandenberghe. 2009. Convex optimization. Cambridge university press. Buche, Robert, Harold J Kushner. 2002. Rate of convergence for constrained stochastic approximation algorithms. SIAM journal on control and optimization 40(4) 1011–1041. Canon, MD, CD Cullum. 1968. A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM Journal on Control 6(4) 509–516. Cheng, Russell CH, Wayne Holloand. 1997. Sensitivity of computer simulation experiments to errors in input data. Journal of Statistical Computation and Simulation 57(1-4) 219–241. Chick, Stephen E. 2001. Input distribution selection for simulation experiments: accounting for input uncertainty. Operations Research 49(5) 744–758. Delage, Erick, Yinyu Ye. 2010. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research 58(3) 595–612. Dunn, Joseph C. 1979. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17(2) 187–211. Dunn, Joseph C. 1980. Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization 18(5) 473–487. Frank, Marguerite, Philip Wolfe. 1956. An algorithm for quadratic programming. Naval research logistics quarterly 3(1-2) 95–110.

Ghosh and Lam: Computing Worst-case Input Models

33

Freund, Robert M, Paul Grigas. 2014. New analysis and results for the Frank-Wolfe method. arXiv preprint arXiv:1307.0873v2 . Fu, Michael C. 1994. Optimization via simulation: A review. Annals of Operations Research 53(1) 199–247. Glasserman, Paul, Xingbo Xu. 2013. Robust portfolio control with stochastic factor dynamics. Operations Research 61(4) 874–893. Glasserman, Paul, Xingbo Xu. 2014. Robust risk measurement and model risk. Quantitative Finance 14(1) 29–58. Glynn, Peter W. 1990. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM 33(10) 75–84. Goh, Joel, Melvyn Sim. 2010. Distributionally robust optimization and its tractable approximations. Operations Research 58(4-Part-1) 902–917. Hansen, Lars Peter, Thomas J. Sargent. 2001. Robust control and model uncertainty. The American Economic Review 91(2) pp. 60–66. Hansen, Lars Peter, Thomas J Sargent. 2008. Robustness. Princeton university press. Harsha, P., M. Dahleh. 2014. Optimal management and sizing of energy storage under dynamic pricing for the efficient integration of renewable energy. IEEE Transactions on Power Systems PP(99) 1–18. Iyengar, Garud N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2) 257–280. Jaggi, Martin. 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. Proceedings of the 30th International Conference on Machine Learning (ICML-13). 427–435. Khintchine, A. Y. 1932. Mathematical theory of a stationary queue. Matematicheskii Sbornik 39 7384. Kushner, H., G. Yin. 2003. Stochastic approximation and recursive algorithms and applications. 2nd ed. Springer-Verlag, New York. Kushner, Harold J. 1974. Stochastic approximation algorithms for constrained optimization problems. The Annals of Statistics 713–723. Lam, Henry. 2013. Robust sensitivity analysis for stochastic systems. Under minor revision in Mathematics of Operations Research, arXiv preprint arXiv:1303.0326 .

34

Ghosh and Lam: Computing Worst-case Input Models

Law, Averill M, W David Kelton. 2000. Simulation modeling and analysis. 3rd ed. McGraw-Hill New York. Lim, Andrew E. B., J. George Shanthikumar, Thaisiri Watewai. 2011. Robust asset allocation with benchmarked objectives. Mathematical Finance 21(4) 643–679. Lim, Andrew EB, J George Shanthikumar, ZJ Max Shen. 2006. Model uncertainty, robust optimization and learning. Tutorials in Operations Research 66–94. Luenberger, David G. 1969. Optimization by vector space methods. John Wiley & Sons. Morini, Massimo. 2011. Understanding and Managing Model Risk: A practical guide for quants, traders and validators. John Wiley & Sons. Nemirovski, Arkadi, Anatoli Juditsky, Guanghui Lan, Alexander Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4) 1574–1609. Nilim, Arnab, Laurent El Ghaoui. 2005. Robust control of markov decision processes with uncertain transition matrices. Operations Research 53(5) 780–798. P´ al, D´ avid, Barnab´ as P´ oczos, Csaba Szepesv´ari. 2010. Estimation of R´enyi entropy and mutual information based on generalized nearest-neighbor graphs. Advances in Neural Information Processing Systems. 1849–1857. Pardo, Leandro. 2005. Statistical inference based on divergence measures. CRC Press. Pasupathy, Raghu, Sujin Kim. 2011. The stochastic root-finding problem: Overview, solutions, and open questions. ACM Transactions on Modeling and Computer Simulation (TOMACS) 21(3) 19. Petersen, I.R., M.R. James, P. Dupuis. 2000. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control 45(3) 398–412. Reiman, Martin I, Alan Weiss. 1989. Sensitivity analysis for simulations via likelihood ratios. Operations Research 37(5) 830–844. Saltelli, Andrea, Paola Annoni, Ivano Azzini, Francesca Campolongo, Marco Ratto, Stefano Tarantola. 2010. Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Computer Physics Communications 181(2) 259–270. Saltelli, Andrea, Marco Ratto, Terry Andres, Francesca Campolongo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, Stefano Tarantola. 2008. Global sensitivity analysis: the primer . Wiley.

Ghosh and Lam: Computing Worst-case Input Models

35

Serfling, Robert J. 2009. Approximation theorems of mathematical statistics, vol. 162. John Wiley & Sons. Smith, J. E. 1995. Generalized Chebyshev inequalities: theory and applications in decision analysis. Operations Research 43(5) 807–825. Smith, James E. 1993. Moment methods for decision analysis. Management science 39(3) 340–358. Xie, Wei, Barry L Nelson, Russell R Barton. 2014. A Bayesian framework for quantifying uncertainty in stochastic simulation. Operations Research 62(6) 1439–1452. Xie, Wei, Barry L Nelson, Russell R Barton. 2015. Statistical uncertainty analysis for stochastic simulation. Under review in Operations Research . Zouaoui, Faker, James R Wilson. 2003. Accounting for parameter uncertainty in simulation input modeling. IIE Transactions 35(9) 781–792. Zouaoui, Faker, James R Wilson. 2004. Accounting for input-model and input-parameter uncertainties in simulation. IIE Transactions 36(11) 1135–1151.

ec1

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Appendix EC.1. Separability of Subprograms under Several Input Models Procedure 1 can be generalized to problems with several independent input models, given that the uncertainty sets for these models are independently specified. For this discussion, consider a performance measure Z(p1 , p2 , . . . , pR ) = Ep1 ,...,pR [h(X1 , . . . , XR )] where p1 ∈ P 1 , . . . , pR ∈ P R denote R independent probability distributions representing R independent input models. Here P j denotes the probability simplex that pj lies in (which can be different j

for different j). Let each P j have support size nj on {xj,(1) , . . . , xj,(n ) }. Also, Xj = (X1j , . . . , XTj j ) denotes the i.i.d. sequence of random variables of the j-th input model over time horizon T j , and the expectation Ep1 ,...,pR [·] denotes the product measure of all the T 1 + . . . + T R independent distributions. We post the worst-case formulation as min

p1 ∈A1 ,...,pR ∈AR

Z(p1 , . . . , pR )

(EC.1)

where Aj is the individual uncertainty set for the j-th model. We posit that the FWSA method can be applied for (EC.1), with each iteration consisting of solving R separate convex subprograms. First, denote ∇Z(p1 , . . . , pR ) = (∇1 Z(p1 , . . . , pR ), . . . , ∇R Z(p1 , . . . , pR )) as the gradient of Z(·), where each ∇j Z(p1 , . . . , pR ) denotes the portion of the derivative taken with respect to pj . It is straightforward to extend the representation of gradient in Theorem 1 to the following: Theorem EC.1. Given pj > 0 for all j. Define ψ j (p1 , . . . , pR ) = (ψ j,(i) (p1 , . . . , pR ))i=1,...,nj ∈ j Rn , where ψ j,(i) (p1 , . . . , pR ) = (d/d)Z(p1 , . . . , pj−1 , (1 − )pj + 1j,(i) , pj+1 , . . . , pR ) =0 . Here 1j,(i) denotes the vector of length nj with 1 at component i and 0 otherwise. We have 1. 

 1

1

 q −p  R R   X X  .. 1 R 0 j 1 R 0 j j ∇Z(p , . . . , p )  ∇ Z(p , . . . , p ) (q − p ) = ψ j (p1 , . . . , pR )0 (qj − pj ) = .   j=1   j=1 qR − pR (EC.2)

ec2

e-companion to Ghosh and Lam: Computing Worst-case Input Models

for any qj ∈ P j , j = 1, . . . , R. 2. ψ j,(i) (p1 , . . . , pR ) = Ep1 ,...,pR [h(X1 , . . . , XR )sj,(i) (Xj )] where sj,(i) (·) is defined as j

j,(i)

s

(x1 , . . . , xT j ) =

T X I(xt = xj,(i) )

pj,(i)

t=1

−Tj

Next we look at each step of FWSA analogous to Algorithm 1. By (EC.2) and the separability of uncertainty sets, the stepwise optimization at each iteration can be written as

min

q1 ∈A1 ,...,qR ∈AR

R X

ψˆj (p1 , . . . , pR )0 (qj − pj ) =

j=1

R X j=1

min ψˆj (p1 , . . . , pR )0 (qj − pj )

qj ∈Aj

(EC.3)

where ψˆj (p1 , . . . , pR ), j = 1, . . . , R are the empirical counterparts of ψ j (p1 , . . . , pR ), j = 1, . . . , R, which can be obtained simultaneously using mk sample paths of (X1 , . . . , XR ) and calculating the sample means of h(X1 , . . . , XR )sj,(i) (Xj ) for each j = 1, . . . , R and i = 1, . . . , nj . Then (EC.3) can be solved by R individual convex subprograms. Given pjk , j = 1, . . . , R, we update pjk+1 = ˆ jk where q ˆ jk solves the subprogram minqj ∈Aj ψˆj Z(p1 , . . . , pR )0 (qj − pj ). (1 − k )pjk + k q

EC.2. Discrete Approximation for Continuous Input Distribution The FWSA algorithm assumes a discrete input distribution over a finite support set. For handling continuous input distribution, discretization schemes are needed. This section provides some discussion on these discretization errors. Let us focus on input distribution pc over a compact interval U = [u, u] ⊂ R. Define PU as the set of all distributions on U . We consider two types of convex feasible regions Ac for pc : 1) the neighborhood surrounding some continuous baseline distribution pb,c ∈ PU via φ-divergence, i.e. Ac = {pc ∈ PU : Epb,c [φ(dpc /dpb,c )] ≤ η } for some η > 0, where we define Epb,c [φ(dpc /dpb,c )] = ∞

if pc is not absolutely continuous with respect to pb,c ; 2) moment constraints, i.e. Ac = {pc ∈ PU : Epc [rj (X)] ≤ µj , j = 1, . . . , s}.

e-companion to Ghosh and Lam: Computing Worst-case Input Models

ec3

We let Z



···

Z = min Z(pc ) = Epc [h(X)] = pc ∈Ac

Z

U

h(x1 , . . . , xT )dFc (x1 ) · · · dFc (xT )

(EC.4)

U

be the optimal value we would like to find, where Fc (·) denotes the distribution function of pc . We shall study how discretization of pc will affect the approximation of (EC.4), presuming that the discrete problem can be solved exactly. We make the following assumptions: Assumption EC.1. Assume an optimal solution p∗c exists for (EC.4) and is a continuous distribution. Suppose that |h(X)| ≤ M for some M > 0 uniformly over X ∈ U T . Assume that there exists a constant L such that 1. For any j = 1, . . . , T − 1, E[h(X)|X1 = x1 , X2 = x2 , . . . , Xj−1 = xj−1 , Xj = xj ] is Lipschitz continuous with respect to xj and constant H, for any given fixed x1 , x2 , . . . , xj−1 ∈ U . In other words, for fixed x1 , x2 , . . . , xj−1 , |E[h(X)|X1 = x1 , X2 = x2 , . . . , Xj−1 = xj−1 , Xj = xj ] − E[h(X)|X1 = x1 , X2 = x2 , . . . , Xj−1 = xj−1 , Xj = x0j ]| ≤ H |xj − x0j |

for any xj , x0j ∈ U . 2. Given any fixed x1 , x2 , . . . , xT −1 ∈ U , the function h(x1 , x2 , . . . , xT −1 , xT ) over xT ∈ U has at most J number of jump discontinuities and is Lipschitz continuous in between jumps with Lipschitz constant H. In other words, for given x1 , x2 , . . . , xT −1 ∈ U , letting y1 , . . . , yk ∈ U be the discontinuous points for h(x1 , x2 , . . . , xT −1 , xT ), where k ≤ J, we have |h(x1 , x2 , . . . , xT −1 , xT ) − h(x1 , x2 , . . . , xT −1 , x0T )| ≤ H |xT − x0T |

for any yj−1 < xT , x0T < yj for j = 1, . . . , k + 1, where we define y0 = u and yk+1 = u. Conditions 1 and 2 for the cost function h in Assumption EC.1 are smoothness conditions to guarantee that a discretized distribution can approximate expectation functionals up to high accuracy. In general, h may not be continuous everywhere (such as the indicator function), so

ec4

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Condition 2 addresses the possibility of jumps. On the other hand, conditional expectations are often continuous due to the smoothing by the conditioning, and hence Condition 1. The following bounds describe how discretization over U can give a lower bound for Z ∗ up to an error: Theorem EC.2. Suppose Assumption EC.1 hold. We also make additional assumptions and use the discretization rules for two types of uncertainty sets: Case 1: Consider Ac = {pc ∈ PU : Epb,c [φ(dpc /dpb,c )] ≤ η }. Let L∗ = L∗ (X) = dp∗c /dpb,c be the likelihood ratio between the optimal distribution p∗c for (EC.4) and the baseline pb,c . Suppose that 1. L∗ (X) ≤ C for some C uniformly over X ∈ U . 2. For any x ∈ U , L∗ (x) =

Pp∗c (x ≤ X ≤ x) + (∆) Ppb,c (x ≤ X ≤ x)

for any x ≤ x ≤ x such that x − x = ∆, and |(∆)| ≤ K∆ for some K. 3. φ is Lipschitz continuous over [0, C + V ] for some V > 0, i.e. |φ(y) − φ(y 0 )| ≤ R|y − y 0 | for any y, y 0 ∈ [0, C + V ]. Consider the following discretization scheme. Divide U into n intervals, separated via the points x(0) = u < x(1) < x(2) < . . . < x(n) = u. Approximate the baseline distribution pb,c with a histogram, (i)

i.e. consider pb = (pb )i=1,...,n ∈ P on the support points x(1) , . . . , x(n) , each with probability weight (i) ¯ as the mesh of the discretization, i.e. ∆ ¯ = maxi ∆i = pb = Ppb,c (x(i−1) ≤ X ≤ x(i) ). Define ∆

maxi {x(i) − x(i−1) }. Now define p = (p(i) )i=1,...,n ∈ P as the decision variables, where P represents the probability simplex on the support points x(1) , . . . , x(n) . Assume the discretization is fine enough ¯ ≤ V /K. Let Zˆ ∗ be the optimal value of the problem such that ∆ P P Ep [h(X)] = x(i1 ) · · · x(iT ) h(x(i1 ) , . . . , x(iT ) )p(i1 ) · · · p(iT )   Pn (i) (i) ¯ subject to i=1 pb φ p(i) ≤ η + KR∆

min

pb

p∈P

(EC.5)

ec5

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Case 2: Consider Ac = {pc ∈ PU : Epc [rj (X)] ≤ µj , j = 1, . . . , s}. Suppose that each rj (·) is Lipschitz continuous over U with Lipschitz constant S, i.e. |rj (x) − rj (x0 )| ≤ S |x − x0 | for all x, x0 ∈ U . Consider the following discretization scheme. Divide U into n intervals, separated via the points ¯ as the mesh of the discretization, i.e. ∆ ¯ = maxi ∆i = x(0) = u < x(1) < x(2) < . . . < x(n) = u. Define ∆ maxi {x(i) − x(i−1) }. Let p = (p(i) )i=1,...,n ∈ P , on the support points x(1) , . . . , x(n) , be the decision variables. Let Zˆ ∗ be the optimal value of the problem min

Ep [h(X)] =

subject to

Pn

i=1 p

(i)

P

x(i1 )

···

P

x(iT )

h(x(i1 ) , . . . , x(iT ) )p(i1 ) · · · p(iT )

¯ rj (x(i) ) ≤ µj + S ∆

(EC.6)

p∈P

In both cases we have ¯ Zˆ ∗ ≤ Z ∗ + (T H + JM )∆

(EC.7)

The relaxations of the constraints in (EC.5) and (EC.6) guarantee feasibility of the optimal solutions under the discretized counterpart. Note that the constant K in (EC.5) is often unknown; ¯ goes to zero, it hence, Case 1 of the theorem is only of theoretical interest, i.e., if the mesh ∆ is guaranteed that the discrete version will provide an accurate lower bound. On the other hand, formulation (EC.6) in the theorem does not contain any constants related to the optimal solution. Proof of Theorem EC.2

We first give an estimate of the discrepancy between the original objec-

tive function Epc [h(X)] and its discretized counterpart Ep [h(X)]. For a given pc ∈ PU , consider its discretized counterpart by dividing U into n intervals, separated via the points x(0) = u < x(1) < x(2) < . . . < x(n) = u. Then define p = (p(i) )i=1,...,n ∈ P , on the support points x(1) , . . . , x(n) , where p(i) = Fc (x(i) ) − Fc (x(i−1) ) and Fc (·) denotes the distribution function of pc . ¯ = maxi ∆(i) . Consider Define ∆(i) = x(i) − x(i−1) , and let ∆ Epc [h(X)]

ec6

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Z = = =

Epc [h(X)|X1 U n Z x(i) X i=1 n X

x(i−1)

= x1 ]dFc (x1 )

Epc [h(X)|X1 = x1 ]dFc (x1 ) (i)

(i)

Epc [h(X)|X1 = ζ1 ]p(i) for some ζ1 ∈ [x(i−1) , x(i) ] by mean value theorem

i=1



n X

(Epc [h(X)|X1 = x(i) ] − H∆(i) )p(i) by using Condition 1 in the theorem

i=1



n X

¯ Epc [h(X)|X1 = x(i) ]p(i) − H ∆

(EC.8)

i=1

Now iterate the above calculation on Epc [h(X)|X1 = x(i) ] =

R U

Epc [h(X)|X1 = x(i) , X2 = x2 ]dFc (x2 )

in (EC.8) using Condition 1 in Assumption EC.1, and so forth. We get (EC.8) greater than or equal to n X n X

¯ Epc [h(X)|X1 = x(i1 ) , X2 = x(i2 ) ]p(i1 ) p(i2 ) − 2H ∆

i1 =1 i2 =1

.. . ≥

n X

···

i1 =1





n X

n X iT −1 =1

···

Z n X

¯ h(x(i1 ) , x(i2 ) , . . . , x(iT −1 ) , xT )dFc (xT )p(i1 ) p(i2 ) · · · p(iT −1 ) − (T − 1)H ∆

U

i1 =1

iT −1 =1

n X

n X

n X

iT −1 =1

iT =1

···

i1 =1

for some ≥

¯ Epc [h(X)|X1 = x(i1 ) , X2 = x(i2 ) , . . . , XT −1 = x(iT −1 ) ]p(i1 ) p(i2 ) · · · p(iT −1 ) − (T − 1)H ∆

n X i1 =1

···

(i ) ζT T

! (i ) ¯ − JM ∆ ¯ p(i1 ) p(i2 ) · · · p(iT −1 ) − (T − 1)H ∆ ¯ h(x(i1 ) , x(i2 ) , . . . , x(iT −1 ) , xT T )p(iT ) − H ∆

∈ [x(iT −1) , x(iT ) ] by using mean value theorem and Condition 2 in Assumption EC.1

n n X X

(i ) ¯ − JM ∆ ¯ h(x(i1 ) , x(i2 ) , . . . , x(iT −1 ) , xT T )p(i1 ) p(i2 ) · · · p(iT −1 ) p(iT ) − T H ∆

iT −1 =1 iT =1

¯ − JM ∆ ¯ = Ep [h(X)] − T H ∆

Therefore, if we can show that the discretized counterpart of the original optimal distribution p∗c is feasible for (EC.5) and (EC.6), then we have (EC.7) respectively for each case. Consider the discretized counterpart of p∗c denoted by p∗ = (p∗(i) )i=1,...,n ∈ P , on the support points x(1) , . . . , x(n) , where p∗(i) = Fc∗ (x(i) ) − Fc∗ (x(i−1) ) and Fc∗ is the distribution function of p∗c .

e-companion to Ghosh and Lam: Computing Worst-case Input Models

ec7

We consider the two cases of Ac individually. For Case 1, the φ-divergence in the original constraint evaluated at p∗c is Epb,c [φ(dp∗c /dpb,c )] = Epb,c [φ(L∗ (X))] n Z x(i) X = φ(L∗ (x))dFb,c (x) where Fb,c (·) is the distribution function of the baseline pb,c =

i=1 n X

x(i−1) (i)

φ(L∗ (ζ (i) ))pb

for some ζ (i) ∈ [x(i−1) , x(i) ] by mean value theorem

i=1

= ≥ ≥

n X

φ

i=1 n X i=1 n X i=1

p∗(i) (i) pb

φ φ

! (i)

+ (∆(i) ) pb

p∗(i)

!

! (i)

− R(∆(i) ) pb

(i) pb

p∗(i)

by using Condition 2 in Case 1 by using Condition 3 in Case 1

!

(i) pb

(i) ¯ by using Condition 2 in Case 1 pb − RK ∆

¯ ≤ V , or ∆ ¯ ≤ V /K. Therefore, p∗ , the discretized counterpart of p∗c , satfor (∆)   Pn (i) p∗(i) ¯ Since by definition p∗c satisfies isfies pb ≤ Epb,c [φ(dp∗c /dpb,c )] + RK ∆. (i) i=1 φ pb   Pn (i) p∗(i) ∗ ¯ So p∗ is feasible in the formulation Epb,c [φ(dpc /dpb,c )] ≤ η, we have i=1 φ pb ≤ η + RK ∆. (i) pb

(EC.5). Now consider Case 2. The j-th moment constraint in the original feasible region evaluated at p∗c is Ep∗c [rj (X)] n Z x(i) X = rj (x)dFc∗ (x) =

i=1 n X

x(i−1)

rj (ζ (i) )p∗(i) for some ζ (i) ∈ [x(i−1) , x(i) ] by mean value theorem

i=1



n X

(rj (x(i) ) − S∆(i) )p∗(i) by using the Lipschitz condition in Case 2

i=1



n X

¯ rj (x(i) )p∗(i) − S ∆

i=1

Pn ¯ Since Therefore, p∗ , the discretized counterpart of p∗ , satisfies i=1 rj (x(i) )p∗(i) ≤ Ep∗c [rj (X)]+S ∆. Pn ¯ for each j. So p∗ is by definition p∗c satisfies Ep∗c [rj (X)] ≤ µj , we have i=1 rj (x(i) )p∗(i) ≤ µj + S ∆ feasible in the formulation (EC.6).

ec8

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Therefore, feasibility is satisfied in both cases. We conclude the theorem.

EC.3. Technical Proofs Proof of Lemma 1

We denote V arp (·) as the variance under p. We have

V arp (h(X)s(i) (X)) ≤ Ep (h(X)s(i) (X))2 ≤ M 2 Ep (s(i) (X))2 = M 2 (V arp (s(i) (X)) + (Ep [s(i) (X)])2 ) (EC.9) Now note that by the definition of s(i) (X) in (5) we have Ep [s(i) (X)] = 0 and V arp (s(i) (X)) =

T V arp (I(Xt = x(i) )) T (1 − p(i) ) = (p(i) )2 p(i)

With these observations, and from (EC.9), we conclude that V arp (h(X)s(i) (X)) ≤ M 2 T (1 − p(i) )/p(i) . Proof of Proposition 1

Consider the Lagrangian for the optimization (11) ! X X p(i) (i) (i) (i) min ξ p +α p log (i) − η (p(i) )i=1,...,n ∈P pb i i

(EC.10)

By Theorem 1, P.220 in Luenberger (1969), suppose that one can find α∗ ≥ 0 such that q =   P (i) q (i) (i) ∗ ∗ (q )i=1,...,n ∈ P minimizes (EC.10) for α = α and moreover that α log (i) − η = 0, then iq pb

q is optimal for (11). Suppose α∗ = 0, then the minimizer of (EC.10) can be any probability distributions that have masses concentrated on the set of indices in M. Any one of these distributions that lies in A will be an optimal solution to (11). To check whether any of them lies in A, consider the one that has the minimum D(qkpb ) and see whether it is less than or equal to η. In other words, we want P (i) to find minp(i) ,i∈M:Pi∈M p(i) =1 i∈M p(i) log(p(i) /pb ). The optimal solution to this minimization is P P (i) P (i) (i) (i) pb / i∈M pb for i ∈ M, which gives an optimal value − log i∈M pb . Thus, if − log i∈M pb ≤ η, we find an optimal solution q to (11) given by (12). In the case that α∗ = 0 does not lead to an optimal solution, or equivalently − log

(i) i∈M pb

P

> η,

we consider α∗ > 0. It can be shown by an elementary convexity argument Hansen and Sargent (2008) that (i)

(i)



p e−ξ /α q (i) = Pn b (i) (i) ∗ −ξ /α i=1 pb e

ec9

e-companion to Ghosh and Lam: Computing Worst-case Input Models

minimizes (EC.10). Moreover, α∗ > 0 can be chosen such that X

q

(i)

log

i

(i) (i) −ξ (i) /α∗ i=1 ξ pb e − Pn (i) (i) ∗ α∗ i=1 pb e−ξ /α

Pn

q (i)

=

(i) pb

− log

n X

(i)

pb e−ξ

(i)

/α∗



i=1

Letting β ∗ = −1/α∗ , we obtain (13) and (14). Note that (14) must bear a negative root because the left hand side of (14) is 0 when β = 0, and as β → −∞, we have ϕξ (β) = β log

(i) i∈M pb

P

+ O(ec1 β ) for some positive constant c1 , and ϕ0ξ (β) =

positive constant c2 , and so βϕ0ξ (β) − ϕξ (β) = − log

P

i∈M ξ

(i) (c1 ∧c2 )β ) i∈M pb + O(e

P

(i)

P

i∈M ξ

(i)

+

+ O(ec2 β ) for some

> η when β is negative

enough. Proof of Proposition 2

Consider the Lagrangian relaxation

max min

α≥0,λ∈R p≥0

n X

(i) (i)

p ξ

n X

+α (

! −η +λ

(i)

(i)



pb min

α≥0,λ∈R

n X

! p

(i)

−1

(EC.11)

i=1

(i)

(i)

+λ p p −φ (i) (i) α pb p0 i=1   (i) n X ξ +λ (i) ∗ = max −α − αη − λ p0 φ − α≥0,λ∈R α i=1 = max −α

ξ

(i)

!

pb

i=1

i=1 n X

p(i)

(i) pb φ

!) − αη − λ

p(i) ≥0

where the case α = 0 is addressed by defining 0φ∗ (s/0) = 0 for s ≤ 0 and 0φ∗ (s/0) = +∞ for s > 0. This gives (15) and (16). In the case α∗ = 0, the optimal value of (EC.11) is the same as

max min

n X

λ∈R p≥0

which is equivalent to minp∈P

Pn

i=1 p

(i) (i)

p ξ

n X



i=1

! p

(i)

−1

i=1

(i) (i)

ξ

= minj∈{1,...,n} ξ (j) . Hence there must be an optimal

solution given by solving X

min P

p(i) ,i∈M:

i∈M p

Now note that by convexity of φ, for any X i∈M

(i)

pb φ

p(i) (i) pb

!

(i) =1

P

i∈M p

=

pb

X (i)

pb

i∈M

i∈M

(i) i∈M pb

P

i∈M

(i)

(i)

X

φ

(i) pb φ

p(i)

! (EC.12)

(i)

pb

= 1, we have p(i) (i) pb

! ≥

X i∈M

1

(i)

pb φ

(i) i∈M pb

P

! (EC.13)

It is easy to see that choosing p(i) as q (i) depicted in (17) archives the lower bound in (EC.13), hence concludes the proposition.

ec10

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Proof of Theorem 2

The proof is an adaptation of Blum (1954). Since h(X) is uniformly

bounded a.s., we have |h(X)| ≤ M a.s. for some M . Without loss of generality, we assume that Z(p) ≥ 0 for all p. Also note that Z(p), as a high-dimensional polynomial, is continuous everywhere in A. ˆk = q ˆ (pk ) − pk , i.e. dk is the k-th For notational convenience, we write dk = q(pk ) − pk and d ˆ k is the one with estimated step best feasible direction given the exact gradient estimate, and d gradient. ˆ k . We have, ˆ (pk ) = pk + k d Now, given pk , consider the iterative update pk+1 = (1 − k )pk + k q by Taylor series expansion, ˆk + Z(pk+1 ) = Z(pk ) + k ∇Z(pk )0 d

2k ˆ 0 2 ˆ k )d ˆk d ∇ Z(pk + θk k d 2 k

for some θk between 0 and 1. By Theorem 1, we can rewrite the above as 2

ˆ 0 ∇2 Z(pk + θk k d ˆ k )d ˆk ˆ k + k d Z(pk+1 ) = Z(pk ) + k ψ(pk )0 d 2 k

(EC.14)

Consider the second term in the right hand side of (EC.14). We can write ˆ k = ψ(p ˆ k )0 d ˆ k + (ψ(pk ) − ψ(p ˆ k ))0 d ˆk ψ(pk )0 d ˆ k )0 dk + (ψ(pk ) − ψ(p ˆ k ))0 d ˆk ≤ ψ(p

ˆk by the definition of d

ˆ k ) − ψ(pk ))0 dk + (ψ(pk ) − ψ(p ˆ k ))0 d ˆk = ψ(pk )0 dk + (ψ(p ˆ k ) − ψ(pk ))0 (dk − d ˆk) = ψ(pk )0 dk + (ψ(p

(EC.15)

Hence (EC.14) and (EC.15) together imply 2

ˆ k ) − ψ(pk ))0 (dk − d ˆ k ) + k d ˆ 0 ∇2 Z(pk + θk k d ˆ k )d ˆk Z(pk+1 ) ≤ Z(pk ) + k ψ(pk )0 dk + k (ψ(p 2 k Let Fk be the filtration generated by p1 , . . . , pk . We then have ˆ k ) − ψ(pk ))0 (dk − d ˆ k )|Fk ] E[Z(pk+1 )|Fk ] ≤ Z(pk ) + k ψ(pk )0 dk + k E[(ψ(p +

2k ˆ 0 2 ˆ k )d ˆ k |Fk ] E[dk ∇ Z(pk + θk k d 2

(EC.16)

ec11

e-companion to Ghosh and Lam: Computing Worst-case Input Models

We analyze (EC.16) term by term. First, since Z(p) is a high-dimensional polynomial and A is a bounded set, the largest eigenvalue of the Hessian matrix ∇2 Z(p), for any p ∈ A, is uniformly bounded by a constant H > 0. Hence ˆ 0 ∇2 Z(pk + θk k d ˆ k )d ˆ k |Fk ] ≤ HE[kd ˆ k k2 |Fk ] ≤ V < ∞ E[d k 2

(EC.17)

for some V > 0. Now

≤ ≤ ≤ ≤



ˆ k ) − ψ(pk ))0 (dk − d ˆ k )|Fk ] E[(ψ(p (EC.18) q ˆ k ) − ψ(pk )k22 |Fk ]E[kdk − d ˆ k k22 |Fk ] by Cauchy-Schwarz inequality E[kψ(p q ˆ k ) − ψ(pk )k22 |Fk ]E[2(kdk k2 + kd ˆ k k22 )|Fk ] by parallelogram law E[kψ(p q ˆ k ) − ψ(pk )k22 |Fk ] since kdk k2 , kd ˆ k k2 ≤ 2 by using the fact that pk , q(pk ), q ˆ (pk ) ∈ P 8E[kψ(p 2 v u n (i) u 8M 2 T X 1 − pk t by Lemma 1 mk i=1 p(i) k s 8T n M (EC.19) (i) mk mini=1,...,n pk

Note that by iterating the update rule (1 − k )pk + k qk , we have min

i=1,...,n

(i) pk



k−1 Y

(1 − j )δ

j=1

(i)

where δ = mini=1,...,n p1 > 0. Noting that ψ(pk )0 dk ≤ 0 by the definition of dk , we thus have (EC.19) less than or equal to r M

k−1 8T n Y (1 − j )−1/2 δmk j=1

(EC.20)

Therefore, from (EC.16) we have r E[Z(pk+1 ) − Z(pk )|Fk ] ≤ k M

k−1

8T n Y 2 V (1 − j )−1/2 + k δmk j=1 2

and hence ∞ X k=1

r +

E[E[Z(pk+1 ) − Z(pk )|Fk ] ] ≤ M

∞ k−1 ∞ X 8T n X k Y 2k V −1/2 (1 − j ) + √ δ k=1 mk j=1 2 k=1

(EC.21)

ec12

e-companion to Ghosh and Lam: Computing Worst-case Input Models

By Assumptions 2 and 3, and Lemma EC.1 (depicted after this proof), we have Z(pk ) converge to an integrable random variable. Now take expectation on (EC.16) further to get ˆ k ) − ψ(pk ))0 (dk − d ˆ k )] E[Z(pk+1 )] ≤ E[Z(pk )] + k E[ψ(pk )0 dk ] + k E[(ψ(p 2k ˆ 0 2 ˆ k )d ˆk] E[dk ∇ Z(pk + θk k d 2

+

and telescope to get

E[Z(pk+1 )] ≤ E[Z(p1 )] +

k X

j E[ψ(pj )0 dj ] +

j=1

+

k X j=1

2j 2

k X

ˆ j ) − ψ(pj ))0 (dj − d ˆ j )] j E[(ψ(p

j=1

ˆ 0 ∇2 Z(pj + θj j d ˆ j )d ˆj] E[d j

(EC.22)

Now take the limit on both sides of (EC.22). Note that E[Z(pk+1 )] → E[Z∞ ] for some integrable Z∞ by dominated convergence theorem. Also Z(p1 ) < ∞, and by (EC.17) and (EC.20) respectively, we have lim

k→∞

k X 2j j=1

2

ˆ 0 ∇Z(pj + θj j d ˆ j )d ˆj] ≤ E[d j

∞ X 2j V

2

j=1

ϑ}

and n o ˆ k − dk )| > % Ek0 = |(ψˆk − ψk )0 (d kγ Note that by Markov inequality, n k−1 (i) E kψˆk − ψk k2 M 2 T X 1 − pk M 2T n Y P (Ek ) ≤ ≤ 2 ≤ 2 (1 − j )−1 ϑ2 ϑ mk i=1 pk(i) ϑ mk δ j=1

ec14

e-companion to Ghosh and Lam: Computing Worst-case Input Models

where the second inequality follows from Lemma 1 and the last inequality follows as in the derivation in (EC.19) and (EC.20). On the other hand, we have ˆ k − dk )| k γ E |(ψˆk − ψk )0 (d % γq k ˆ k k22 ] by Cauchy-Schwarz inequality ≤ E[kψˆk − ψk k22 ]E[kdk − d % kγ L E[kψˆk − ψk k22 ] by using Assumption 4 ≤ % k−1 LM 2 T nk γ Y ≤ (1 − j )−1 by following the derivation in (EC.19) and (EC.20) mk %δ j=1

P (Ek0 ) ≤

Therefore, P (E ) ≤

∞ X

P (Ek ) +

k=k0

∞ X

P (Ek0 )

k=k0 ∞ X

2



1 Lk γ + ϑ2 %

k−1



1 Y (1 − j )−1 mk j=1 0  k0 −1 ∞  k−1 2 Y X 1 Lk γ 1 Y −1 M T n ≤ (1 − j ) + (1 − j )−1 2 δ ϑ % m k j=1 k=k j=k M Tn ≤ δ k=k

0

(EC.23)

0

Now recall that k = a/k. Using the fact that 1 − x ≥ e−ρx for any 0 ≤ x ≤ (ρ − 1)/ρ and ρ > 1, we have, for any a ρ−1 ≤ k ρ or equivalently k≥

aρ ρ−1

we have 1 − k = 1 −

a ≥ e−ρa/k k

Hence choosing k0 satisfying Condition 4, we get k−1 Y j=k0

(1 − j )

−1

≤e

ρa

Pk−1 k0

1/j

 ≤

k−1 k0 − 1

ρa (EC.24)

Therefore, picking mk = bk β and using (EC.24), we have (EC.23) bounded from above by  k0 −1 ∞  2 Y X 1 L −1 M T n + (1 − j ) δ k=k ϑ2 b(k0 − 1)ρa k β−ρa %b(k0 − 1)ρa k β−γ−ρa j=1 0   k0 −1 2 Y 1 L −1 M T n ≤ (1 − j ) + δb ϑ2 (β − ρa − 1)(k0 − 1)β−1 %(β − γ − ρa − 1)(k0 − 1)β−γ−1 j=1

ec15

e-companion to Ghosh and Lam: Computing Worst-case Input Models

if Condition 5 holds. Then Condition 6 guarantees that P (E ) < ε. The rest of the proof will show that under the event E c , we must have the bound (18), hence concluding the theorem. To this end, we first set up a recursive representation of gk . Consider 0 0 gk+1 = −ψk+1 dk+1 = −ψk+1 (qk+1 − pk+1 )

= −ψk0 (qk+1 − pk+1 ) + (ψk − ψk+1 )0 (qk+1 − pk+1 ) = −ψk0 (qk+1 − pk ) + ψk0 (pk+1 − pk ) + (ψk − ψk+1 )0 (qk+1 − pk+1 ) ˆ k + (ψk − ψk+1 )0 dk+1 ≤ gk + k ψk0 d

ˆ k and dk+1 by the definition of gk , d

ˆ k ) + (ψk − ψk+1 )0 dk+1 ≤ gk − k gk + k (ψˆk − ψk )0 (dk − d

by (EC.15)

ˆk) = (1 − k )gk + (∇Zk − ∇Zk+1 )0 dk+1 + k (ψˆk − ψk )0 (dk − d

(EC.25)

ˆ k )d ˆ k for Now since ∇Z(·) is continuously differentiable, we have ∇Zk+1 = ∇Zk + k ∇2 Z(pk + θ˜k d some θ˜k between 0 and 1. Therefore (EC.25) is equal to ˆ 0 ∇2 Z(pk + θ˜k d ˆ k )dk+1 + k (ψˆk − ψk )0 (dk − d ˆk) (1 − k )gk − k d k ˆ k kkdk+1 k + k (ψˆk − ψk )0 (dk − d ˆk) ≤ (1 − k )gk + k K kd

by Condition 7

ˆ k − dk kkdk+1 k + k (ψˆk − ψk )0 (dk − d ˆk) ≤ (1 − k )gk + k K kdk kkdk+1 k + k K kd by triangle inequality ≤ (1 − k )gk + k K

gk gk+1 2 c kψk kkψk+1 k

gk+1 ˆk) + k KLkψˆk − ψk k + k (ψˆk − ψk )0 (dk − d ckψk+1 k

by Assumptions 4 and 5 ≤ (1 − k )gk + k

K

gk gk+1 c2 τ 2

+ k

KL ˆ ˆk) kψk − ψk kgk+1 + k (ψˆk − ψk )0 (dk − d cτ

by Assumption 6 Now under the event E c , and noting that  = a/k, (EC.26) implies that  a aK aKLϑ a% gk+1 ≤ 1 − gk + 2 2 gk gk+1 + gk+1 + 1+γ k c τ k cτ k k or 

  aK aKLϑ a a% 1 − 2 2 gk − gk+1 ≤ 1 − gk + 1+γ c τ k cτ k k k

(EC.26)

ec16

e-companion to Ghosh and Lam: Computing Worst-case Input Models

We claim that |gk | = |ψk0 dk | ≤ 4M T , which can be seen by writing T X

  I(Xt = x(i) ) ψ (p) = Ep [h(X)s (X)] = E0 h(X) − T E0 [h(X)] p(i) t=1 (i)

(i)

=

T X

E0 [h(X)|Xt = x(i) ] − T E0 [h(X)]

(EC.27)

t=1

so that |ψ (i) (p)| ≤ 2M T for any p and i. Using this and the fact that 1/(1 − x) ≤ 1 + 2x for any 0 ≤ x ≤ 1/2, we have, for 4aKM T aKLϑ 1 + ≤ c2 τ 2 k cτ k 2

(EC.28)

 2aK 2aKLϑ  a a%  gk+1 ≤ 1 + 2 2 gk + 1− gk + 1+γ c τ k cτ k k k

(EC.29)

we must have 

Note that (EC.28) holds if 

4KM T KLϑ k ≥ 2a + c2 τ 2 cτ



which is Condition 1 in the theorem. Now (EC.29) can be written as  2a2 K% a% 2a2 KLϑ% 2a2 KLϑ 2aK  a 2 a 2aKLϑ + 2 2 2+γ gk + 1+γ + − g + 1 − g gk+1 ≤ 1 − + k k cτ k c τ k k cτ k 2+γ cτ k 2 c2 τ 2 k k k   2a2 K% 2a2 KLϑ% 2aK  a 2aKLϑ a% a 2 ≤ 1− + + 2 2 2+γ gk + 1+γ + + 1 − g (EC.30) k cτ k c τ k k cτ k 2+γ c2 τ 2 k k k 

We argue that under Condition 2, we must have gk ≤ ν for all k ≥ k0 . This can be seen easily by induction using (EC.30). By our setting at the beginning of this proof we have gk0 ≤ ν. Suppose gk ≤ ν for some k. We then have  2a2 K% a% 2a2 KLϑ% 2aK  a 2aKLϑ a 2 gk+1 ≤ 1 − + + 2 2 2+γ ν + 1+γ + + 1 − ν k cτ k c τ k k cτ k 2+γ c2 τ 2 k k    a 2KLϑ 2aK% % 2aKLϑ% 2Kν 2 ≤ν+ −1 + + 2 2 1+γ ν + γ + + 2 2 k cτ c τ k k0 c τ cτ k01+γ 

≤ν

by Condition 2. This concludes our claim.

(EC.31)

ec17

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Given that gk ≤ ν for all k ≥ k0 , (EC.29) implies that      a 2KLϑ 2a2 KLϑ 2aKν  a a% a2 % 2Kν 2KLϑ gk+1 ≤ 1 − 1− − + 2 2 1− gk + 1+γ + 2+γ + k cτ cτ k 2 c τ k k k k c2 τ 2 cτ      2 a 2KLϑ 2Kν a% a % 2Kν 2KLϑ ≤ 1− 1− − 2 2 gk + 1+γ + 2+γ + k cτ c τ k k c2 τ 2 cτ   C G ≤ 1− gk + 1+γ (EC.32) k k 

where 

2KLϑ 2Kν − 2 2 C =a 1− cτ c τ



and a2 % G = a% + k0



2Kν 2KLϑ + c2 τ 2 cτ



Now note that Condition 3 implies that C > 0. By recursing the relation (EC.32), we get gk+1 ≤

k  Y j=k0

≤e

−C

C 1− j

Pk

 gk0 +

 k k X Y j=k0 i=j+1

j=k0 1/j

gk0 +

k X

e−C

Pk

C 1− i

i=j+1 1/i

j=k0



G j 1+γ

G j 1+γ

C k  X j +1 G k0 gk0 + ≤ 1+γ k+1 k+1 j j=k0   1  if 0 < γ < C   (C−γ)(k+1)γ   C  C  k0 1 1 ≤ gk0 + 1 + G× if γ > C (γ−C)(k0 −1)γ−C (k+1)C  k+1 k0      log(k/(k0 −1)) if γ = C (k+1)C 

C

which gives (18). This concludes the proof. Proof of Corollary 1

We use the notations in the proof of Theorem 3. Our analysis starts from

(EC.14), namely ˆk + Zk+1 = Zk + k ψk0 d

2k ˆ 0 2 ˆ k )d ˆk d ∇ Z(pk + θk k d 2 k

ˆ k ≥ ψ 0 dk by the definition of dk , we have for some θ˜k between 0 and 1. Using the fact that ψk0 d k Zk+1 ≥ Zk + k ψk0 dk + = Zk − k gk +

2k ˆ 0 2 ˆ k )d ˆk d ∇ Z(pk + θk k d 2 k

2k ˆ 0 2 ˆ k )d ˆk d ∇ Z(pk + θk k d 2 k

ec18

e-companion to Ghosh and Lam: Computing Worst-case Input Models

Now, using (18) and Condition 7 in Theorem 3, we have      1  if 0 < γ < C        (C−γ)kγ   A  2 K    1 Zk+1 ≥ Zk − k  C + B × − k if γ > C γ−C C (γ−C)(k −1) k k    2 0           log((k−1)/(k −1))   0 if γ = C kC     1  if 0 < γ < C    1+γ   (C−γ)k     a2 K aA 1 = Zk − 1+C − aB × − if γ > C (γ−C)(k0 −1)γ−C k1+C   k 2k 2          log((k−1)/(k0 −1))  if γ = C

(EC.33)

k1+C

Now iterating (EC.33) from k to l, we have   Pl−1 1   1  if 0 < γ < C    j=k j 1+γ   (C−γ)   l−1 l−1   a2 K X X aA 1 P l−1 1 1 Zl ≥ Zk − − − aB × if γ > C 1+C γ−C 1+C j=k j (γ−C)(k0 −1)   j 2 j=k j 2   j=k        Pl−1 log((j−1)/(k0 −1))  if γ = C j=k j 1+C and letting l → ∞, we get  P∞  1   j=k  (C−γ)  

1 j 1+γ

∞ X aA P∞ ∗ 1 1 Z ≥ Zk − − aB × 1+C γ−C j=k j 1+C (γ−C)(k −1)  j 0  j=k     P∞ log((j−1)/(k0 −1)) j=k

j 1+C

  if 0 < γ < C     ∞  a2 K X 1 − if γ > C  2 j=k j 2      if γ = C

(EC.34)

where the convergence to Z ∗ is guaranteed by Theorem 2. Note that (EC.34) implies that     1  if 0 < γ < C      (C−γ)γ(k−1)γ     aA a2 K ∗ 1 Z ≥ Zk − − aB × − if γ > C (γ−C)(k0 −1)γ−C C(k−1)C   C(k − 1)C 2(k − 1)          log((k−1)/(k0 −1))  if γ = C C(k−1)C

≥ Zk −

  1    (C−γ)γ(k−1)γ  

if 0 < γ < C

D E 1 − −F × if γ > C (γ−C)(k0 −1)γ−C C(k−1)C  k − 1 (k − 1)C      log((k−1)/(k0 −1)) if γ = C C(k−1)C

where D = a2 K/2, E = aA/C and F = aB. This gives (20). Proof of Lemma 2 Since

β−ρa−ζ−1 β+1

Consider first a fixed a. When a(1 − ω) > 1, (21) reduces to

is increasing in β and

1 β+1

is decreasing in β, the maximizer of

β−ρa−ζ−1 β+1

β−ρa−ζ−1 β+1

1 ∧ β+1 .

1 ∧ β+1 occurs

ec19

e-companion to Ghosh and Lam: Computing Worst-case Input Models

at the intersection of

β−ρa−ζ−1 β+1

and

1 , β+1

which is β = ρa + ζ + 2. The associated value of (21) is

1 . ρa+ζ+3

When a(1 − ω) ≤ 1, (21) reduces to

a(1−ω) β+1

∧ β−ρa−ζ−1 . By a similar argument, the maximizer is β+1

β = a(1 − ω + ρ) + ζ + 1, with the value of (21) equal to

a(1−ω) . a(1−ω+ρ)+ζ+2

Thus, in overall, given a, the optimal choice of β is β = ρa + ζ + 1 + (a(1 − ω)) ∧ 1, with the value of (21) given by

(a(1−ω))∧1 . ρa+ζ+2+(a(1−ω))∧1

When a(1 − ω) > 1, the value of (21) is

decreasing in a, whereas when a(1 − ω) ≤ 1, the value of (21) is in a. Thus the maximum occurs when a(1 − ω) = 1, or a =

1 . 1−ω

a(1−ω) a(1−ω+ρ)+ζ+2

1 ρa+ζ+3

which is

which is increasing

The associated value of (21) is

1 . ρ/(1−ω)+ζ+3

EC.4. Discussion of Remark 1 Moment constrained formulation may violate Assumption 4. We consider fixing this by adding Pn a “smoothing” constraint using large entropy condition − i=1 p(i) log p(i) ≥ η˜, or equivalently ˜ ) ≤ κ, where p ˜ = (1/n)i=1,...,n is the uniform distribution on all support points, and κ is D(pkp some suitably large positive constant (so that the disruption to the original problem is negligible). Adding this on top of the moment constraints, the solution to (11) is   P ∗ (i) ξi + s j=1 γj rj (x ) exp − α∗   q (i) = P ∗ (k) ) Pn ξi + s j=1 γj rj (x k=1 exp − α∗

(EC.35)

where (α∗ , γj∗ , j = 1, . . . , s) ( = argminα≥0,γj ≥0,j=1,...,s −α log

n X 1 i=1

n

exp −

ξi +

Ps

(i) j=1 γj rj (x )

α

! − αη −

s X

) γj µ j

j=1

˜ for i ∈ M ˜ and 0 otherwise, where M ˜ = argmink {ξ (k) + and, in the case α∗ = 0, we define q (i) = 1/|M| Ps ∗ (k) ˜ denotes the cardinality of M ˜ . All these can be seen by the Lagrangian )} and |M| j=1 γj rj (x relaxation max

min

α≥0,γj ≥0,j=1,...,s p∈P

n X i=1

p(i) ξ (i) + α

n X 1 k=1

n

! log(np(k) ) − η +

and an argument similar to the proof of Proposition 1.

s X j=1

γj

n X i=1

! p(i) rj (x(i) ) − µj