The Empirical Likelihood Approach to Quantifying Uncertainty in

0 downloads 0 Views 170KB Size Report
Oct 23, 2016 - constraints where the underlying probability distributions are observed via limited data. ... Keywords: empirical likelihood; sample average approximation; ..... z be the maximum and minimum values of (5) respectively. ...... subject to the condition that the expected return should exceed a certain threshold.
arXiv:1604.02573v1 [stat.ME] 9 Apr 2016

The Empirical Likelihood Approach to Quantifying Uncertainty in Sample Average Approximation∗ Henry Lam†

Enlu Zhou‡

Abstract We study the empirical likelihood approach to construct confidence intervals for the optimal values, henceforth quantify the statistical uncertainty of sample average approximation, for stochastic optimization problems with expected value objectives and constraints where the underlying probability distributions are observed via limited data. This approach casts the task of confidence interval construction as a pair of distributionally robust optimizations posited over the uncertain distribution, with a divergence-based uncertainty set that is suitably calibrated to provide asymptotic statistical guarantees. Keywords: empirical likelihood; sample average approximation; confidence interval; constrained optimization; stochastic program; statistical uncertainty

1

Introduction

We consider a stochastic optimization problem in the form min{h(x) := E[H(x; ξ)]}, x∈Θ

(1)

where x is a continuous decision variable in the deterministic feasible region Θ ∈ Rp , and ξ ∈ Rd is a random variable. We are interested in situations where the underlying probability distribution that controls the expectation E[·] is not fully known and can only be accessed via limited data ξ1 , . . . , ξn . It is customary in this setting to work on an empirical counterpart of the problem, namely by solving the sample average approximation (SAA) (e.g., [13]): n

min x∈Θ

1X H(x; ξi ). n

(2)

i=1



A preliminary conference version of this work has appeared in [9]. Research of the first author was partially supported by the National Science Foundation under Grants CMMI-1400391/1542020 and CMMI-1436247/1523453. Research of the second author was partially supported by the National Science Foundation under Grant CAREER CMMI-1453934, and Air Force Office of Scientific Research under Grant YIP FA- 9550-14-1-0059. † Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI. Email: [email protected] ‡ H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA. Email: [email protected]

1

2 We further consider problems with expected value constraints, which, without loss of generality, can be written as min h(x) = E[H(x; ξ)] subject to fk (x) = E[Fk (x; ξ)] ≤ 0, k = 1, . . . , m′ fk (x) = E[Fk (x; ξ)] = 0, k = m′ + 1, . . . , m x∈Λ where

(3)

   gk (x) ≤ 0, k = 1, . . . , s′  Λ= g (x) = 0, k = s′ , . . . , s  k  xj ≥ 0, j = 1, . . . , p′

is a deterministic set, and m′ ≤ m, s′ ≤ s, p′ ≤ p. Thus (3) is a general constrained optimization formulation with both equality and inequality constraints that can be either stochastic or deterministic. Again, under limited data ξ1 , . . . , ξn , an SAA version of (3) is (e.g., [14]) 1 Pn min n Pi=1 H(x; ξi ) subject to n1 Pni=1 Fk (x; ξi ) ≤ 0, k = 1, . . . , m′ (4) n 1 ′ i=1 Fk (x; ξi ) = 0, k = m + 1, . . . , m n x∈Λ Our premise is that beyond the n observed samples, new samples are not easily accessible, either because of lack of data or because of limited computational capacity in running further Monte Carlo simulation. The optimal value and solution obtained from (2) and (4) thus deviate from those under the genuine distribution in (1) and (3). The focus of this paper is to quantify the uncertainty in SAA by constructing confidence intervals (CIs) for the true optimal values. In this regard, a common existing approach is to employ asymptotics provided by the central limit theorem (CLT) and the delta method (e.g., [8], Section 5.1 in [13]). The main contribution of this paper is to study a new approach to construct CIs, as an alternative to directly using the CLT, that has asymptotic statistical guarantees and could possess better finitesample performance. This approach uses the so-called empirical likelihood (EL) method in statistics. It reformulates the problem of finding the upper and lower bounds of a CI into solving a pair of distributionally robust optimization problems, where the uncertainty set is a divergence-based ball cast over the uncertain probability distribution. The size of the ball is suitably calibrated such that it provides asymptotic guarantees for the coverage probability of the resulting CI. We shall study the theory giving rise to such guarantees. We shall also discuss the tractability of the constituent robust optimization problems. We demonstrate, through two numerical examples on conditionalvalue-at-risk and portfolio optimization, that our method compares favorably with CLT bounds in terms of finite-sample performance. The theory and the numerical aspects are studied in Sections 2 and 3 respectively.

2

The Empirical Likelihood Method for Constructing Confidence Bounds for Optimal Values

This section studies in detail the EL method in constructing CIs for the optimal values. Section 2.1 focuses on the unconstrained case (1), and Section 2.2 generalizes to the general constrained case (3).

3

2.1

Unconstrained Optimization

Let us first fix some notation. Given the set P of data ξ1 , ξ2 , . . . , ξn , we denote a probability vector over {ξ1 , . . . , ξn } as w = (w1 , . . . , wn ), where ni=1 wi = 1 and wi ≥ 0 for all i. We denote χ2q,β as the 1 − β quantile of a χ2 distribution with degree of freedom q. The EL method utilizes the optimizations Pn max / minw minP x∈Θ i=1 wi H(x; ξi ) subject to −2 ni=1 log(nwi ) ≤ χ2p+1,β Pn (5) i=1 wi = 1 wi ≥ 0 for all i where max / min denote aP pair of maximization and minimization. The quantity −(1/n) ni=1 log(nwi ) can be interpreted as the Burg-entropy divergence ([12], [2]) between the probability distributions represented by the weights w and by the uniform weights (1/n)i=1,...,n on the support {ξ1 , . . . , ξn }. Thus, the first constraint in (5) is a Burg-entropy divergence ball centered at the uniform weights, with radius χ2p+1,β /(2n). Interpreting from the viewpoint of distributionally robust optimization (e.g., [5, 2]), the pair of optimizations in (5) outputs the worst-case scenarios of the optimal value minx∈Θ E[H(x; ξ)] when E[·] is uncertain and its underlying distribution is believed to lie inside the divergence ball. From this viewpoint, the EL method is a mechanism to stipulate that using the ball size χ2p+1,β /(2n) in the optimization pair (5) gives rise to statistically valid confidence bounds. This method originates as a nonparametric analog of maximum likelihood estimation, first proposed by [10]. A celebrated result in parametric likelihood inference is Wilks’ Theorem [4], stating that the ratio between the maximum likelihood and the true likelihood, the so-called likelihood ratio, converges to a chi-square distribution in a suitable logarithmic scale. The underpinning theory of the EL method is a nonparametric counterpart of this theorem. Q To explain, on the data set {ξ1 , . . . , ξn }, we first define a nonparametric likelihood ni=1 wi , where difficult to see that the maximum value Q wi is the weight applied to each data. It is not Q of ni=1 wi , among all w in the probability simplex, is ni=1 (1/n). In fact, the same conclusion holds even if one the support of the data, which could only make Qoutside Qn allows putting weights n the likelihood i=1 wi smaller. Hence i=1 (1/n) can be viewed as a maximum likelihood in the nonparametric space. Correspondingly, we define the likelihood Qn nonparametric Qn Qn ratio between the weights w and the maximum likelihood weights as i=1 wi / i=1 (1/n) = i=1 (nwi ). The next step is to incorporate a target parameter of interest, i.e. the quantity whose statistical uncertainty is to be assessed. For this we shall introduce the concept of estimating equations. Suppose we want to estimate some parameters θ ∈ Rp . The true parameters are known to satisfy the set of equations E[t(ξ; θ)] = 0 where E[·] is the expectation for the random object ξ ∈ Rd , t(ξ; θ) ∈ Rs , and 0 is a zero vector. We define the nonparametric profile likelihood ratio as ) ( n n n X X Y wi = 1, wi ≥ 0 for all i (6) wi t(ξi ; θ) = 0, nwi : R(θ) = max i=1

i=1

i=1

where profiling refers to the categorization of all weights that respect the equations E[t(ξ; θ)] = 0. With the above definitions, the crux is the empirical likelihood theorem (ELT): Theorem 1 (Theorem 3.4 in [11]). Let ξ1 , . . . , ξn ∈ Rd be i.i.d. data. Let θ0 ∈ Rp be the value of the parameter that satisfies E[t(θ; ξ)] = 0, where t(θ; ξ) ∈ Rs . Assume V ar(t(θ; ξ)) is finite and has rank q > 0. Then −2 log R(θ) ⇒ χ2q , where R(θ) is defined in (6).

4 The quantity −2 log R(θ) is defined as ∞ if the optimization (6) is infeasible. We now turn to explain how (5) provides confidence bounds for the unconstrained optimization (1). We make the following assumptions: 1. h(x) is differentiable in x with ∇x h(x) = E[∇x H(x; ξ)] for x ∈ Θ. Pn ∗ ∈ argmin 2. ∇x E[H(x∗ ; ξ)] = 0 if and only if x E[H(x; ξ)], and x∗ ; ξ i ) = 0 x∈Θ i=1 wi ∇x hH(ˆ P n if and only if x ˆ∗ ∈ argminx∈Θ i=1 wi H(x; ξi ) for any ξi , where w = (w1 , . . . , wn ) is an arbitrary probability vector.

Assumption 1.

3. There exists a x∗ ∈ argminx∈Θ h(x) such that the covariance matrix of the random vector (∇x H(x∗ ; ξ), H(x∗ ; ξ)) ∈ Rp+1 has rank p + 1 > 0. The interchangeability of derivative and expectation can generally be justified by the pathwise Lipschitz continuity condition |H((x1 , . . . , xj−1 , u, xj+1 , . . . , xp ); ξ) − H((x1 , . . . , xj−1 , v, xj+1 , . . . , xp ); ξ)| ≤ Mj |u − v| a.s. for any (x1 , . . . , xj−1 , u, xj+1 , . . . , xp ), (x1 , . . . , xj−1 , v, xj+1 , . . . , xp ) ∈ Θ, j = 1, . . . , p, and Mj are measurable with EMj < ∞ (e.g., [1]). Condition 2 is a first order condition that is satisfied if, for instance, H(·; ξ) is a coersive convex function for any ξ a.s. and Θ = Rp . Condition 3 states that all partial derivatives of H(x∗ ; ξ) together with H(x∗ ; ξ) itself are neither perfectly correlated or nondegenerate. Note that x∗ is not necessarily unique. We have the following statistical guarantee: Theorem 2. Let z ∗ be the optimal value of (1), and z and z be the maximum and minimum values of the programs (5) respectively. Then, under Assumption 1, we have lim inf P (z ∗ ∈ [z, z]) ≥ 1 − β. n→∞

Proof. We denote x∗ as an element in argminx∈Θ h(x) such that the covariance matrix of the random vector (∇x H(x∗ ; ξ), H(x∗ ; ξ)) has rank p + 1 > 0, which exists by Assumption 1 Condition 3. Also, by Assumption 1 Conditions 1 and 2, x∗ satisfies E[∇x H(x0 ; ξ)] = 0. We define the nonparametric profile likelihood ratio as ) ( n n n n X X X Y wi = 1, wi ≥ 0 for all i . wi H(x; ξi ) = z, wi ∇x H(x; ξi ) = 0, nwi : R(x, z) = max i=1

i=1

i=1

i=1

z∗

(7) From Theorem 1, the as n → ∞. This implies

h(x∗ ).

Let be the true optimal value of minx∈Θ h(x), which is equal to nonparametric profile likelihood ratio (7) satisfies −2 log R(x∗ , z ∗ ) ⇒ χ2p+1 P (−2 log R(x∗ , z ∗ ) ≤ χ2p+1,β ) → 1 − β. Write

−2 log R(x∗ , z ∗ ) ( n n n n X X X X wi = 1, wi ≥ 0 wi H(x∗ ; ξi ) = z ∗ , wi ∇x H(x∗ ; ξi ) = 0, log(nwi ) : = min − 2 i=1

)

for all i

i=1

i=1

i=1

(8)

5 ∗ , z ∗ ) ≤ χ2 We focus on the event −2 log R(x∗ , z ∗ ) ≤ χ2p+1,β . P We argue that −2 log R(xP p+1,β implies n n ∗ ∗; ξ ) = z∗ the existence of a probability vector w such that w ∇ H(x ; ξ ) = 0, w H(x i i i=1 i x i=1 i P Pn n and −2 i=1 log(nwi ) ≤ χ2p+1,β . Notice that −2 i=1 log(nwi ) = ∞ if wi = 0 for any i. Hence it suffices to replace, in (8), wi ≥ 0Pwith wi ≥ ǫ for all i, for some small enough ǫ > 0. In this modified, compact, feasible set, −2 ni=1 log(nwi ) is bounded and hence must possess an optimal solution w, which satisfies the properties that we have set out to argue with. This further implies that z ∗ is bounded by the pair of optimizations P max / minw Pni=1 wi H(x∗ ; ξi ) n ∗ subject to i=1 Pnwi ∇x H(x ; ξi ) =2 0 −2 log(nwi ) ≤ χp+1,β Pn i=1 w = 1 i=1 i wi ≥ 0 for all i

which is equivalent to Pn ∗ max / minw i=1 wi H(x ; ξi ) P ∗ subject to x is optimal solution for minx∈Θ ni=1 wi H(x; ξi ) Pan −2 n log(nwi ) ≤ χ2p+1,β Pn i=1 i=1 wi = 1 wi ≥ 0 for all i by using Assumption 1 Condition 2, and this is in turn equivalent to P max / minw minx∈Θ ni=1 wi H(x; ξi ) P subject to x∗ is optimal solution for minx∈Θ ni=1 wi H(x; ξi ) Pan −2 n log(nwi ) ≤ χ2p+1,β Pn i=1 i=1 wi = 1 wi ≥ 0 for all i

(9)

By relaxing the first constraint in (9), we conclude that −2 log R(x∗ , z ∗ ) ≤ χ2p+1,β implies that z ∗ is bounded by the optimization pair (5). Therefore, we have lim inf P (z ≤ z ∗ ≤ z) ≥ lim P (−2 log R(x∗ , z ∗ ) ≤ χ2p+1,β ) = 1 − β n→∞

n→∞

Note that we have obtained bounds by relaxing a constraint in (9). Thus the degree of freedom in the χ2 -distribution may not be optimally chosen. Nevertheless, our numerical example shows that, at least for small p, the EL method provides reasonably tight CIs. We also mention that there exists techniques (e.g., bootstrap calibration or Bartlett correction; [10, 6]) that can further improve the coverage of the EL method in estimation problems. Investigation of such further improvements for optimizations is delegated to future work.

6

2.2

Stochastically Constrained Optimization

We generalize the EL method to the stochastically constrained problem (3). In this setting, we consider the following optimization pair   Pn minx w H(x; ξ )   i i i=1   P   subject to Pni=1 wi Fk (x; ξi ) ≤ 0, k = 1, . . . , m′ max / minw n  i=1 wi Fk (x; ξi ) = 0, k = 1, . . . , m      x ∈ Λ (10) Pn subject to −2 i=1 log(nwi ) ≤ χ2p+m+1,β Pn i=1 wi = 1 wi ≥ 0 for all i = 1, . . . , n While resembling (5), we note that the degree of freedom in the χ2 -distribution is now set at p + m + 1, which includes the number of stochastic constraints compared to (5). To describe our main results on statistical guarantees for stochastically constrained problems, we first make the following assumptions: Assumption 2. We assume: 1. h(x) = E[H(x; ξ)], fk (x) = E[Fk (x; ξ)], k = 1, . . . , m and gk (x), k = 1, . . . , s are all differentiable in x ∈ Θ, and ∇x h(x) = E[∇x H(x; ξ)], ∇x fk (x) = E[∇x Fk (x; ξ)] 2. Program (3) possesses an optimal solution and has a KKT condition that is distributionally stable. This means that the optimization ˜ min h(x) subject to f˜k (x) ≤ 0, k = 1, . . . , m′ f˜k (x) = 0, k = m′ + 1, . . . , m x∈Λ

(11)

˜ ˜ ˜ k (x; ξ)], with E ˜ denoting the expectation with respect to where h(x) = E[H(x; ξ)], f˜k (x) = E[F ˜ an arbitrary distribution P such that ˜ sup |h(x) − h(x)| < ǫ x∈Λ

sup |f˜k (x) − fk (x)| < ǫ for all k = 1, . . . , m x∈Λ

for small enough ǫ > 0, has an optimal solution that satisfies the same active constraints in the KKT condition as (3). 3. Let x∗ be an optimal solution and λ∗ = (λ∗1 , . . . , λ∗m ), ν ∗ = (ν1∗ , . . . , νs∗ ) be the Lagrange multipliers for the functional constraints in (3). The covariance matrix of the variables H(x∗ ; ξ), P m ∂ ∗ ∗ ∂ ∗ ∗ k=1 λk ∂xj Fk (x ; ξ), and Fk (x ; ξ), for all indices j and k corresponding to the ∂xj H(x ; ξ) + active constraints, is finite and has positive rank.

7 4.

n

n

i=1

i=1

1X 1X H(x; ξi ) → h(x) and Fk (x; ξi ) → fk (x), k = 1, . . . , m n n

uniformly over x ∈ Λ a.s..     5. E supx∈Θ H(x; ξ)2 < ∞ and E supx∈Θ Fk (x; ξ)2 < ∞ for k = 1, . . . , m.

In Conditions 2 and 3 above, the active constraints of the KKT condition satisfied by (x∗ , λ∗ , ν ∗ ) are in the form " # m s X X ∂ ∂ ∂ E λk νk H(x; ξ) + Fk (x; ξ) + gk (x) = 0, j ∈ A1 ⊂ {1, . . . , p} ∂xj ∂xj ∂xj k=1

k=1

E[Fk (x; ξ)] = 0, k ∈ A2 ⊂ {1, . . . , m} gk (x) = 0, k ∈ A3 ⊂ {1, . . . , s} xj = 0, j ∈ {1, . . . , p} \ A1

λk = 0 k ∈ {1, . . . , m} \ A2 νk = 0, k ∈ {1, . . . , s} \ A3 Note that Conditions 1 and 3 are similar to those of Assumption 1. Conditions 2 and 4 together ensure that the active set of constraints is unchanged when the true distribution is replaced by its (weighted) empirical version. Condition 5 is a technical condition required to bound the error between the empirical distribution and its weighted version within the divergence ball. We have the following result: Theorem 3. Under Assumption 2, we have lim inf P (z ∗ ∈ [z, z]) ≥ 1 − β n→∞

where z ∗ is the optimal value of (3), and z and z are the minimum and maximum values of (10). Proof. Consider the nonparametric profile likelihood ratio Pn  wi H(x; ξi ) = z   P  Pm Pni=1  ∂   ∂  H(x; ξ ) + λ F (x; ξ ) + sk=1 νk ∂x∂ j gk (x) = 0, j ∈ A1 w i i i k k  k=1 i=1 ∂xj ∂xj  P  n    i=1 wi Fk (x; ξi ) = 0, k ∈ A2   n Y gk (x) = 0, k ∈ A3 R(x, λ, ν, z) = max nwi : xj = 0, j ∈ {1, . . . , p} \ A1   i=1  λk = 0 k ∈ {1, . . . , m} \ A2      νk = 0, k ∈ {1, . . . , s} \ A3  P  n    i=1 wi = 1  wi ≥ 0 for all i (12) Letting x∗ be the optimal solution and λ∗ = (λ∗1 , . . . , λ∗m ), ν ∗ = (ν1∗ , . . . , νs∗ ) be the Lagrange multipliers, we have (x∗ , λ∗ , ν ∗ ) satisfies the equations # " m s X X ∂ ∂ ∂ H(x; ξ) + Fk (x; ξ) + gk (x) = 0, j ∈ A1 λk νk E ∂xj ∂xj ∂xj k=1

k=1

                            

8 E[Fk (x; ξ)] = 0, k ∈ A2 By Assumption 2 Condition 3, the covariance of the matrix formed by H(x∗ ; ξ) ! m s X X ∂ ∂ ∂ ∗ ∗ H(x ; ξ) + Fk (x ; ξ) + gk (x∗ ) for j ∈ A1 λk νk ∂xj ∂xj ∂xj k=1

k=1

Fk (x∗ ; ξ) for k ∈ A2 gk (x) for k ∈ A3

xj for j ∈ {1, . . . , p} \ A1

λk for k ∈ {1, . . . , m} \ A2 νk for k ∈ {1, . . . , s} \ A3

has rank r for some r > 0. Let z ∗ = E[H(x∗ ; ξ)]. By Theorem 1, R(x∗ , λ∗ , ν ∗ , z ∗ ) ⇒ χ2r . This implies that P (−2 log R(x∗ , λ∗ , ν ∗ , z ∗ ) ≤ χ2r,β ) → 1 − β. Similar to the proof of Theorem that −2 log R(x∗ , λ∗ , ν ∗ , z ∗ ) implies the exPn 2, we can argue istence of a w that satisfies −2 i=1 log(nwi ) ≤ χ2p+m+1 and all constraints in (12) evaluated at x∗ , λ∗ , ν ∗ , z ∗ . This in turn implies that z ∗ is bounded by Pn max / minw,x,λ,ν wi H(x; ξi )  P  Pm Pi=1 n ∂ ∂ H(x; ξ ) + λ F (x; ξ ) + sk=1 νk ∂x∂ j gk (x) = 0, j ∈ A1 w subject to i i i ∂xj k=1 k ∂xj k Pi=1 n i=1 wi Fk (x; ξi ) = 0, k ∈ A2 gk (x) = 0, k ∈ A3 xj = 0, j ∈ {1, . . . , p} \ A1 λk = 0 k ∈ {1, . . . , m} \ A2 νk = P0, k ∈ {1, . . . , s} \ A3 −2 ni=1 log(nwi ) ≤ χ2d+m+1,β Pn i=1 wi = 1 wi ≥ 0 for all i = 1, . . . , n (13) We shall argue that (13) is equivalent to   Pn minx   i=1 wi H(x; ξi )   P   subject to Pni=1 wi Fk (x; ξi ) ≤ 0, k = 1, . . . , m′ max / minw n  i=1 wi Fk (x; ξi ) = 0, k = 1, . . . , m      x ∈ Λ Pn subject to −2 log(nwi ) ≤ χ2r,β Pn i=1 i=1 wi = 1 wi ≥ 0 for all i = 1, . . . , n a.s. for large enough n. By Assumption 2 Condition 2, it suffices to show that P the distribution w represented by w on the data support, which we denote as P , with w such that −2 ni=1 log(nwi ) ≤ χ2r,β , satisfies sup |hw (x) − h(x)| → 0 a.s. x∈Λ

sup |fkw (x) x∈Λ

− fk (x)| → 0 a.s. for all k = 1, . . . , m

9 where hw (x) = E w [H(x; ξ)] and fkw (x) = E w [Fk (x; ξ)] with E w denoting the expectation with respect to P w . Note that Z w w sup |h (x) − h(x)| = sup H(x; ξ)d(P − P )(ξ) x∈Λ x∈Λ Z Z w ≤ sup H(x; ξ)d(P − Pˆ )(ξ) + sup H(x; ξ)d(Pˆ − P )(ξ) x∈Λ



x∈Λ

where Pˆ denotes the empirical distribution Z w ˆ ˆ sup |H(x; ξi )|dT V (P , P ) + sup H(x; ξ)d(P − P )(ξ)

x∈Λ, 1≤i≤n

x∈Λ

where dT V denotes the total variation distance

(14)

Now by Lemma 11.5 in [11] (restated in the Appendix) and Assumption 2 Condition 5, we have sup x∈Λ, 1≤i≤n

|H(x; ξi )| = max sup |H(x; ξ)| = o(n1/2 ) a.s. 1≤i≤n x∈Λ

On the other hand, by Pinsker’s inequality, s r Pn r − i=1 log(nwi ) dKL (P w , Pˆ ) χ2r w ˆ = ≤ dT V (P , P ) ≤ 2 2n 4n

(15)

(16)

where dKL denotes the Kullback-Leibler divergence. Combining (15) and (16), the first term in (14) goes to 0 a.s.. The second term in (14) converges to 0 a.s. by Assumption 2 Condition 4. Hence supx∈Λ |hw (x) − h(x)| → 0 a.s.. The same argument shows that supx∈Λ |fkw (x) − fk (x)| → 0 a.s. for all k = 1, . . . , m. Therefore, (13) is equivalent to (10). Finally, since gk (x), λk , νk are deterministic, r ≤ 1 + |A1 | + |A2 | ≤ 1 + p + m

(17)

where | · | denotes cardinality. This implies that χ2r,β ≤ χ2p+m+1,β . Therefore, v ≥ z and v ≤ z, where v and v denote the minimum and maximum values of (10). This concludes that lim inf P (z ≤ z ∗ ≤ z) ≥ lim inf P (v ≤ z ∗ ≤ v) ≥ lim P (−2 log R(x∗ , λ∗ , ν ∗ , z ∗ ) ≤ χ2r,β ) = 1 − β n→∞

n→∞

n→∞

Note that, much like the proof of Theorem 2, we have placed a conservative bound on the degree of freedom of the χ2 -distribution in (17), which could potentially be improved with more refined analysis.

3

Numerical Examples

Before going into the numerical details, we discuss the tractability of the optimization pairs (5) and (10), each of which consists of a max-min and a min-min problem. For the unconstrained case (5), if H is a convex function and Θ is a convex set, then the max-min program is convex and its global solution can be solved by a standard convex optimization routine. The min-min program, on the

10 other hand, is more P challenging because the outer optimization involves minimizing the concave function minx∈Θ ni=1 wi H(x; ξi ) over w. This is not a convex problem in general. However, fixing either w or x, optimizing over the other variable becomes a convex problem. Thus one approach is to do alternating minimization, by iteratively minimizing w and x while fixing each others, until no improvement is observed. Such type of schemes has appeared in chance-constrained programming (e.g., [3, 15, 7]), and it appears to work well in our examples despite a lack of global convergence guarantee. All the discussion here holds for the constrained case (10) if the constraints are all convex. We demonstrate the presented method numerically on two examples. The first is an unconstrained problem of estimating Conditional-Value-at-Risk (CVaR), while the second is a constrained portfolio optimization problem.

3.1

CVaR Estimation

In this example, we consider estimating CVaRα,F c (ξ), the α-level conditional-value-at-risk of a random variable ξ, which we assume follows an unknown distribution F c . This can be rewritten as a stochastic optimization problem:   1 + E[(ξ − x) ] , (18) min x + x 1−α

where (·)+ is short for max(·, 0). We set F c as a standard normal distribution, and α = 0.9. Assuming we are given n data from the normal distribution, we implement the EL method (EL) to obtain the inferred 95% confidence upper and lower bounds for the optimal value of (18). Moreover, we also compare with the CIs obtained from the CLT and the delta method ([13], Theorem 5.7), given by   σ ˆ (ˆ x∗ ) ∗ ˆ (19) Z ± z1−β/2 √ n

where z1−β/2 is the critical value of the standard normal distribution with confidence 1 − β, x ˆ∗ is P n ˆ ∗ is the empirical optimal value given by (1/n) the empirical optimal solution, H x∗ ; ξi ), i=1 H(ˆ 2 ∗ andqσ ˆ (ˆ x ) is the empirical standard deviation of the objective value at the optimal solution given P ˆ ∗ )2 . by (1/(n − 1)) ni=1 (h(ˆ x∗ ; ξ i ) − H We test on three cases when the data size n is 10, 50 and 100, respectively. For each case, we repeat the experiment 100 times, and note down the empirical coverage probability, mean upper and lower bounds, and the mean and standard deviation of the interval width for each method. The results are summarized in Table 1. Note that the true optimal value can be accurately calculated in our setting, and is given by 1.755. We see that the two methods are relatively comparable. When there are relatively few data (n = 10), the CLT method has higher coverage probability, but the CI is also much wider on average with a much larger standard deviation in interval widths. When the data size increases, the coverage probability of the EL method improves significantly, while the mean CI width only increases moderately. Roughly speaking, we can say that the EL method is more preferable in this example.

3.2

Portfolio Optimization

In the second example, we want to minimize the CVaR risk associated with the loss of an investment, subject to the condition that the expected return should exceed a certain threshold. Let’s denote

11

n = 10 n = 50 n = 100

EL CLT EL CLT EL CLT

Coverage probability 0.39 0.63 0.90 0.86 0.98 0.89

Mean lower bound 0.80 0.94 1.21 1.22 1.34 1.36

Mean upper bound 1.65 2.35 2.31 2.23 2.28 2.07

Mean interval width 0.85 1.41 1.10 1.01 0.94 0.71

Standard deviation of interval width 0.56 1.02 0.40 0.42 0.27 0.21

Table 1: Comparison of EL and CLT methods on the CVaR estimation problem by x = [x1 , . . . , xd ]′ the vector of holding proportions in d assets, ξ = [ξ 1 , . . . , ξ d ]′ the random vector of asset returns, and rb the threshold for expected return. We assume short selling is not allowed. The problem can be written as minx CV aRα (−ξ ′ x) subject to E[ξ ′ x] ≥ rb Pd i=1 xi = 1 xi ≥ 0, i = 1, . . . , d

(20)

Note we can rewrite the problem in the form of (3) as 1 E[(−ξ ′ x − c)+ ] minx,c c + 1−α ′ subject to E[ξ x] ≥ rb Pd i=1 xi = 1 xi ≥ 0, i = 1, . . . , d

(21)

The parameter setting is as follows: ξ follows a normal distribution with mean µ = [0.8, 1.2]′ and covariance Σ = [1 0; 0 4]; the minimum expected return is rb = 1; the CVaR level is α = 0.9, and the confidence level is 1 − β = 0.95. It is easy to verify that the optimal solution to (21) is x∗ = [0.5, 0.5]′ , and the associate optimal value can be evaluated by Monte Carlo simulation with a large number (108 ) of samples, which yields H ∗ ≈ 0.96. To compare with the CLT method, we extend it from the unconstrained setting to the constrained problem, by evaluating the objective values H(ˆ x∗ ; ξi ), i = 1, . . . , n, where x ˆ∗ is the optimal solution of the SAA problem, and then computing the CI according to (19). Although this extension of the CLT method to constrained problems is not rigorously proved, we treat it as a heuristic way to provide a benchmark. For each case of n = 10, 50, 100, 200, we repeat the experiment 100 times, and summarize the numerical results in Table 2. We can see that the empirical converge probability in this example is in general smaller compared to the first example. The reason is because this problem has a higher dimension than the first example, and naturally requires more data to achieve a similar level of accuracy in the SAA solution. The constrained problem is also more challenging to solve than the unconstrained problem, which may lead to more numerical imprecisions when applying the EL method. Nevertheless, we can see that the EL method still produces CIs with a higher coverage probability than the CLT method, and the coverage probability improves as the data size increases.

4

Conclusion

We have studied the EL method to construct statistically valid CIs to quantify the uncertainty of the optimal values of optimizations with expected value objectives and constraints. The method

12

n = 10 n = 50 n = 100 n = 200

EL CLT EL CLT EL CLT EL CLT

Coverage probability 0.55 0.25 0.71 0.40 0.78 0.54 0.91 0.60

Mean lower bound −0.10 0.29 0.08 0.57 0.30 0.70 0.48 0.86

Mean upper bound 4.11 1.29 2.40 1.72 2.20 1.57 2.31 1.52

Mean interval width 4.21 1.00 2.32 1.16 1.90 0.87 1.83 0.66

Standard deviation of interval width 11.51 1.72 3.03 0.67 1.87 0.41 2.33 0.23

Table 2: Comparison of EL and CLT methods on the portfolio optimization problem builds on positing a pair of optimizations that resembles distributionally robust optimizations with Burg-entropy divergence ball constraints, with the ball size suitably calibrated with a χ2 -quantile with a suitable degree of freedom. We have studied the theory leading to the statistical guarantees, and have demonstrated through two numerical examples that our method compares favorably to the approach directly suggested by CLT. Built on a rigorous foundation, our method could provide a competitive method for constructing CIs for stochastic optimization problems. In future work, we plan to further refine the accuracy of our method.

Appendix Lemma 1 (Lemma 11.2 in [11]). Let Yi be i.i.d. random variables in R with EYi2 < ∞. We have max1≤i≤n |Yi | = o(n1/2 ) a.s..

References [1] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57. Springer Science & Business Media, 2007. [2] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341– 357, 2013. [3] W. Chen, M. Sim, J. Sun, and C.-P. Teo. From CVaR to uncertainty set: Implications in joint chance-constrained optimization. Operations Research, 58(2):470–485, 2010. [4] D. R. Cox and D. V. Hinkley. Theoretical Statistics. CRC Press, 1979. [5] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010. [6] T. DiCiccio, P. Hall, and J. Romano. Empirical likelihood is Bartlett-correctable. The Annals of Statistics, pages 1053–1061, 1991. [7] R. Jiang and Y. Guan. Data-driven chance constrained stochastic program. Mathematical Programming, Series A, pages 1–37, 2012.

13 [8] A. J. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2002. [9] H. Lam and E. Zhou. Quantifying uncertainty in sample average approximation. In Proceedings of the 2015 Winter Simulation Conference, pages 3846–3857. IEEE Press, 2015. [10] A. B. Owen. Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2):237–249, 1988. [11] A. B. Owen. Empirical Likelihood. CRC press, 2001. [12] L. Pardo. Statistical Inference Based on Divergence Measures. CRC Press, 2005. [13] A. Shapiro, D. Dentcheva, and A. Ruszczy´ nski. Lectures on Stochastic Programming: Modeling and Theory, volume 16. SIAM, 2014. [14] W. Wang and S. Ahmed. Sample average approximation of expected value constrained stochastic programs. Operations Research Letters, 36(5):515–519, 2008. [15] S. Zymler, D. Kuhn, and B. Rustem. Distributionally robust joint chance constraints with second-order moment information. Mathematical Programming, Series A, 137(1-2):167–198, 2013.

Suggest Documents