Distributionally Robust Stochastic Optimization with Wasserstein ...

Distributionally Robust Stochastic Optimization with Wasserstein Distance Rui Gao, Anton J. Kleywegt

arXiv:1604.02199v1 [math.OC] 8 Apr 2016

School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0205 [email protected], [email protected]

Stochastic programming is a powerful approach for decision-making under uncertainty. Unfortunately, the solution may be misleading if the underlying distribution of the involved random parameters is not known exactly. In this paper, we study distributionally robust stochastic programming (DRSP) in which the decision hedges against the worst possible distribution that belongs to an ambiguity set, which comprises all distributions that are close to some reference distribution in terms of the Wasserstein distance. We derive a tractable reformulation of the DRSP problem by constructing the worst-case distribution explicitly via the first-order optimality condition of the dual problem. Using the precise structure of the worst-case distribution, we show that the DRSP can be approximated by robust programs to arbitrary accuracy. Then we apply our results to a variety of stochastic optimization problems, including the newsvendor problem, twostage linear program, worst-case value-at-risk analysis, point processes control and distributionally robust transportation problems. Key words : distributionally robust optimization; Wasserstein ambiguity set; data-driven; Mirror-Prox algorithm MSC2000 subject classification : Primary: 90C15; secondary: 90C46 OR/MS subject classification : Primary: programming: stochastic

1. Introduction In decision making under uncertainty, a large class of problems can be modeled as stochastic programming. In its classical form, it is formulated as the minimization of an expected value function min Eµ [Ψ(x, ξ)], x∈X

where x is the decision variable with (deterministic) feasible region X ⊂ Rn , the random element ξ has distribution µ supported on a Polish space Ξ and Ψ : X × Ξ → R. We refer to Shapiro et al. [42] for a thorough study of stochastic programming. One major criticism of stochastic programming is the assumption of the awareness of the exact probability distribution µ. After all, in real-world problem µ can never be obtained exactly, while deviation from µ may results in bad decisions. This motivates the notion of distributionally robust stochastic programming (DRSP) as follows. Suppose one can construct a family M of probability distributions containing all conceivable distributions that are close to the unknown underlying distribution. Then the decision maker finds an optimal solution which hedges against the worst-case expected value among these distributions by solving the minimax stochastic programming [DRSP]

min sup Eµ [Ψ(x, ξ)]. x∈X µ∈M

(1)

Such an approach has its root in von Neumann’s game theory and has been used in many fields such as inventory management (Scarf et al. [41], Gallego and Moon [19]), statistical decision ˇ aˇcková [52], Shapiro and Kleywegt [43]). analysis (Berger [7]), as well as stochastic programming (Z´ Recently due to the development of Big Data, it regains attention in Operations Research and sometimes is called data-driven stochastic programming or ambiguous stochastic programming. The idea is that, in many practical applications, the historical observations of ξ, with or without noise, are available to help decision makers infer the relevant distributions. A good choice of 1

2 Table 1. Examples of φ-divergence

Divergence Kullback-Leibler Burg entropy χ2 -distance Modified χ2 Hellinger Total Variation φ φkl φb φχ2 φmχ2 φ φtv h √ φ(t), t ≥ 0 t log t − t+ 1 − log t +t −1 1t (t − 1)2 |t − 1| (t − 1)2 ( t − 1)2 P P P (pi −qi )2 P (pi −qi )2 P √ P √ pi log pqii qi log pqii ( pi − qi )2 Iφ (p, q) |pi − qi | pi qi

ambiguity set M should take into account both the practical meaning and the tractability of the resulting optimization problem. In general, constructing the ambiguity set M has two ways — moment-based and statistical-distance-based methods. Moment-based ambiguity set. In this approach, the ambiguity contains a family of distributions of which moments satisfies certain conditions, such as fixed mean and variance (Scarf et al. [41], Delage and Ye [15], Popescu [38], Zymler et al. [54]). It has been shown that in many cases the resulting optimization problem can be reformulated as a conic quadratic or semi-definite program. However, this method does not fully utilize the data information and the worst-case distribution sometimes yields over-conservative decisions (Wang et al. [50], Goh and Sim [22]). For example, Wang et al. [50] shows that for the newsvendor problem, Scarf’s moment approach yields a twopoint worst-case distribution, of which the associated optimal solution performs not well under other more likely distributions. Statistical-based ambiguity set. The other way is by considering a set of probability measures that are close, in the sense of statistical distance, to some reference distribution ν, such as the empirical distribution of ξ and Gaussian distribution with fixed mean and variance (Ghaoui et al. [20], Calafiore and El Ghaoui [11]). Popular choices for the statistical distance are φ-divergence (Bayraksan and Love [3], Ben-Tal et al. [5]), which includes Kullback-Leibler divergence (Jiang and Guan [27]), Burg entropy (Wang et al. [50]) and Total Variation distance (Sun and Xu [46]) as special cases, Prokhorov metric (Erdo˘gan and Iyengar [16]), and Wasserstein distance (Esfahani and Kuhn [17], Zhao and Guan [53]). Due to its popularity and importance in statistical inference, we here provide a brief discussion on φ-divergence ambiguity sets. For more details, we refer the reader to BayraksanPand Love [3] for N a recent survey. Let ∆N be the standard N -simplex ∆N := {(p0 , . . . , pn ) ∈ Rn+1 : i=0 pi = 1, pi ≥ 0, ∀i}. To make the following discussion simple, let us temporarily focus on finite-supported discrete distributions. The φ-divergence between two discrete probability distributions p, q ∈ ∆N is defined by N X pi , (2) Iφ (p, q) := qi φ qi i=0 where φ is a convex function on [0, ∞), φ(1) = 0, 0φ(a/0) := a limt→∞ φ(t)/t for all a > 0 and 0φ(0/0) = 0. Let φ∗ be its convex conjugate function. Different choices for φ leads to different φ-divergence (see Table 1). Given the historical data {ξbi }N i=1 , the DRSP is then formulated as ( N ) X min max pi Ψ(x, ξbi ) : Iφ (p, q) ≤ θ (3) x∈X p∈∆N

i=0

for some θ > 0. It has been shown in Ben-Tal et al. [5] that for most of the choices of φ, the inner maximization problem of (3) can be reformulated as a linear/ conic quadratic/ semidefinite program. Moreover, the φ-divergence ambiguity set is statistically meaningful and related to a statistical confidence region (Pardo [34]). Hence it has been appealing to many researchers. Nevertheless, we point out that φ-divergence may not be favorable in some settings. Let us consider the following example.

3

×10 4

×10 4

10000 2

3

9000 8000

2.5

7000

1

6000

Frequence

Frequence

Frequence

1.5

5000

2

1.5

4000 1

3000 0.5

2000 0.5

1000 0

0

0

8-bit Gray-scale

8-bit Gray-scale 0

50

100

150

200

250

(a) Observed image with histogram q

0

50

100

150

8-bit Gray-scale 200

250

(b) True image with histogram ptrue

0

50

100

150

200

250

(c) Pathological image with histogram ppathol

Figure 1. Three images and their respective gray-scale histograms. With Kullback-Leibler divergence, we have Iφkl (q, ptrue ) = 5.05 > Iφkl (q, ppathol ) = 2.33, while in contrast, with Wasserstein distance W1 (q, ptrue ) = 30.70 < W1 (q, ppathol ) = 84.03.

Example 1. Suppose there is an underlying true image (Figure (1b)). The decision maker is not aware of it but only has a vintage-style observation (Figure (1a)). In fact, Figure (1a) is obtained from Figure (1b) by a low-contrast intensity transformation (cf. Gonzalez and Woods [23]), in which the black pixels become brighter and the white pixels become darker. In terms of the histogram, the observation q is obtained by shifting the true histogram ptrue inwards. On the other hand, pathological image (Figure (1c)) with histogram ppathol is too dark to see the details. Suppose the decision maker constructs an ambiguity set M by considering all distributions whose Kullback-Leibler divergence to the observation q is less than or equal to θ. Since Iφkl (q, ptrue ) Iφkl (q, ppathol ), when the radius θ is chosen such that the pathological image (Figure (1c)) is excluded, the true image (Figure (1b)) is also excluded, thus the robustification of decisions cannot be achieved. Conversely, when the radius θ is large enough to include the true image (Figure (1b)), it has to include pathological image (Figure (1c)) as well, and then the decision may be over-conservative due to hedging against irrelevant distributions. The reason for such inconsistency between perceptual (dis)similarity of images and the KullbackLeibler distance of their histograms is that φ-divergence ignores the metric structure in the histogram, namely, the space Ξ. More specifically, in the previous example, Ξ = {0, 1, . . . , 255} represents the 8-bit gray-scale levels. The smaller absolute difference between two points ξ, ξ 0 ∈ Ξ, the perceptually closer in color they are. While in the definition (2) of φ-divergence Iφ (p, q), only the relative ratio pi /qi for the same gray-scale level i is compared, but the distance between different gray-scale levels is not reflected. This phenomenon has been observed in the field of computer science especially in image retrieval (Rubner et al. [40], Ling and Okada [31]), in which histogram comparison is used for extracting features of graphs, such as gray-scale, heat-map and texture. It has been pointed out in Rubner et al. [40] that many φ-divergence focus on bin-by-bin comparison of the histograms, and fail to capture the (dis)similarity across the bins and are sensitive to the bin-size. The second drawback of using φ-divergence ambiguity set is the well-known fact that when limt→∞ φ(t)/t = ∞, such as φkl , φmχ2 , the φ-divergence ambiguity set fails to include sufficiently

4

many relevant distributions. In fact, since 0φ(pi /0) = pi limt→∞ φ(t)/t = ∞ for all pi > 0, the ambiguity set defined by φ-divergence does not include any distribution which is not absolutely continuous with respect to the reference distribution q. For example, suppose Ξ = {0, 1}s and the reference distribution q is supported on {ξbi }N i=1 ⊂ Ξ. Then the ambiguity set defined by Kullback-Leibler divergence only contains distributions that are supported on the given N data points {ξbi }N i=1 . Unless N is large relative to 2s , the ambiguity set is far from rich enough to robustify the decision. For φ-divergence with limt→∞ φ(t)/t < ∞, such as φb , φχ2 , φh , φtv , we have another potential problem which relates to the behavior of the worst-case distribution. Define I0 := {1 ≤ i ≤ N : qi > 0}. Let us assume X is a singleton, i.e., Ψ(x, ξbi ) = Ψ(ξbi ), so that we focus on the inner maximization of problem (3). Without loss of generality let us assume Ψ(ξb1 ) < · · · < Ψ(ξbN ). Then according to Ben-Tal et al. [5], Bayraksan and Love [3], the worst-case distribution satisfies p∗i /qi ∈ ∂φ∗ p∗j p∗N

Ψ(ξbi ) − µ∗

, ∀i ∈ I0 , λ∗ = 0, ∀j ∈ / I0 ∪ {N }, ( P 1 − i∈I0 p∗i , if µ∗ = Ψ(ξbN ) − λ∗ limt→∞ φ(t)/t, = 0, if µ∗ > Ψ(ξbN ) − λ∗ limt→∞ φ(t)/t,

(4a) (4b) (4c)

for some λ∗ ≥ 0 and µ∗ ≥ Ψ(ξbN ) − λ∗ limt→∞ φ(t)/t. When limt→∞ φ(t)/t < ∞, although the ambiguity set is allowed to contain distributions that are not absolutely continuous with respect to the reference, (4b) suggests that the support of the worst-case distribution and that of the reference distribution can differ by at most one point ξbN . If pN > 0, (4c) suggests that the probability mass is moved away from all scenarios in I0 to the worst scenario ξbN . This is called “popping” behavior in Bayraksan and Love [3]. Note that in many applications where the support of ξ is unknown, given the historical observations, the choice for the underlying space Ξ may be arbitrary, so the worst-case behavior heavily depends on the specification of Ξ and the shape of function Ψ. We refer to Section 4.1 for a numerical illustration. In the above discussion we assume that both the underlying distribution and the reference are discrete distributions. Now let us consider, for example, the underlying distribution is continuous. Given the N observed data, to avoid some drawbacks of φ-divergence discussed before, one may be willing to modify the discrete reference distribution into continuous one via parametric or nonparametric density estimation (see Ben-Tal et al. [5], Jiang and Guan [27] for more details). However, such manipulation unnecessarily introduces further ambiguity and the phenomenon in previous examples is not fully eliminated. In this paper, we advocate the use of Wasserstein ambiguity set and our main focus is to provide a tractable reformulation for the problem min sup Eµ [Ψ(x, ξ)]

(5)

M := {µ ∈ P : Wp (µ, ν) ≤ θ},

(6)

x∈X µ∈M

with where P is the set of Borel probability distributions on Ξ, ν is the reference measure, θ ∈ [0, ∞) is the radius of the Wasserstein ball, representing in some sense the confidence level or the degree of risk aversion, and Wp is the Wasserstein distance of order p ∈ [1, ∞) defined by Wpp (µ, ν) = min {Eγ [dp (ξ, ζ)] : γ has marginal distributions µ, ν } , γ

where d is the metric on Ξ. A more formal and general definition of Wasserstein distance for probability measures on a Polish space in Section 2. For now let us roughly view the joint distribution

5

γ as certain transportation plan that splits and moves mass from µ to ν, so the Wasserstein distance is the minimal expected total distance over all transportation plans. (This is the reason why Wasserstein distance is also called Earth Mover’s distance in computer science literature.) We note that by definition, Wasserstein distance explicitly exploit the metric structure d on Ξ as opposed to φ-divergence discussed in Example 1. Example 2 (Revisiting Example 1). Let us compute the Wasserstein distance between histograms in Example 1. Roughly speaking, the cheapest way of transporting mass from q to ptrue is by transporting mass near the boundary inwards. Then a large proportion of the total mass is transported within a small distance, so the total expected traveling distance is relatively small. In contrast, to compute W1 (q, ppathol ), one has to transport all the mass on the right to the left. Such long traveling distance results in a large expected total distance, whence W1 (q, ppathol ) > W1 (q, ptrue ). Therefore, using Wasserstein distance is consistent with the perceptual intuition. Thanks to results on concentration inequalities with Wasserstein distance (cf. Bolley et al. [8], Fournier and Guillin [18]), it has been pointed out in Esfahani and Kuhn [17] that using Wasserstein ambiguity set provides good out-of-sample performance of the resulting DRSP. Moreover, as suggested in Carlsson et al. [12], the Wasserstein distance is a natural choice for certain transportation problems as it inherits the Euclidean structure if Euclidean metric d is used to define the ambiguity set. Wasserstein distance is actually not new to Operations Researchers. Indeed, in 1942 together with the newborn Linear Programming theory Kantorovich [29], Leonid Kantorovich Kantorovich [28] tackled the Monge’s problem originally brought up in 1781 in the field of Optimal Transport, which can be viewed as a generalized transportation problem between two probability distributions. This seminal work also gives the name of Monge-Kantorovich distance, which is now known as W1 -Wasserstein distance. In the field of stochastic programming, Wasserstein distance has been used for multistage stochastic programming (see Pflug and Pichler [36] and reference therein) and Wasserstein ambiguity set is considered in Wozabal [51]. But only until recently, Esfahani and Kuhn [17], Zhao and Guan [53] show that under certain convexity and/ or compactness assumptions, the DRSP with Wasserstein distance is tractable, in which problem (5) is reformulated as a finite convex program by deriving the dual of the inner supremum problem sup Eµ [Ψ(x, ξ)].

(7)

µ∈M

In this paper, we prove the duality using a constructive approach under general settings, thereby we obtain a precise structure of the worst-case distribution. The powerfulness is demonstrated by applying to various problems in stochastic optimization. 1.1. Main Contributions • General Settings. The two most related work to ours are Esfahani and Kuhn [17], Zhao and Guan [53]. Both paper only consider the ambiguity set defined by a W1 -Wasserstein ball on a finite dimensional normed vector space and centered at some reference distribution with finite support. Moreover, additional technical assumptions are used in their work. For example, Esfahani and Kuhn [17] made certain convexity assumptions to derive the worst-case distribution, and Zhao and Guan [53] assumed the random parameter is compact-supported and has a density. In contrast, in our setting the reference measure ν can be any general probability measure on a Polish space and Wasserstein distance of any order p ∈ [1, ∞) is considered, and the only assumption on Ψ(x, ξ) is upper semi-continuity with respect to ξ. • Elementary Proof. Our proof approach is elementary in the sense that we do not resort to tools from infinite dimensional convex programming as did in Esfahani and Kuhn [17], Zhao and Guan [53]. The proof is constructive, thereby we obtain a detailed representation for the worstcase distribution. Moreover, the proof method can be applied to other distributionally robust

6

problems, such as a large class of distributionally robust transportation problems considered in Carlsson et al. [12] (Section 4.6). The basic idea of the proof is deriving the weak dual and then using the first-order optimality condition of the dual program to construct a primal feasible solution which turns out to achieve the same objective value as the associated dual feasible solution. Except for some technicality in measure theory — which can be circumvented if only considering distribution with finite support as in Esfahani and Kuhn [17], Zhao and Guan [53] — such an approach is more straightforward than the method therein. • Insightful Structure of the Worst-case Distribution. Thanks to its concise form, the worst-case distribution has intuitive meanings and establishes the connection between the classical Robust Programming and the DRSP. More specifically, we show that in the case of finite-supported reference distribution, the distributionally robust stochastic program can be approximated by robust programs to any accuracy, and such an approximation becomes exact when the objective function Ψ(x, ξ) is concave in the uncertainty ξ. • Broad Applicability. To illustrate the power of the duality results, we apply it to the newsvendor problem, two-stage linear programming, worst-case value-at-risk analysis, point processes control, and develop a version of Mirror-Prox algorithm for saddle-point problems to solve the distributionally robust stochastic programming, provided that the objective is concave in the uncertainty. The rest of this paper is organized as follows. In Section 2, we review some main results on Wasserstein distance from the theory of optimal transport. Next we prove the duality results for the general reference distribution and distribution with finite support in Section 3.1 and 3.2 respectively. Then in Section 4, we apply the strong duality results and the structural description of the worst-case distribution to a variety of stochastic optimization problems. We conclude this paper in Section 5 and provide some lemmas and proofs in the technical appendix. 2. Preliminaries on Wasserstein Distance In this section, we introduce notations and briefly outline some results on Wasserstein distance from the theory of optimal transport. For a more detailed discussion we refer the reader to Villani [47, 48]. Let Ξ be a Polish space equipped with metric d. We denote by P (Ξ) the set of Borel probability distributions (measures) on Ξ, and by Pp (Ξ) the set of probability distributions on Ξ with finite p-th moment: Z p Pp (Ξ) := µ ∈ P : d (ξ, ζ0 )µ(dξ) < ∞ for some ζ0 ∈ Ξ . Ξ

Note that from triangle inequality the above definition is not dependent on the particular choice of ζ0 . Throughout this paper, we assume that p ∈ [1, ∞). Given a Borel probability distribution ν ∈ P and a Borel map T : Ξ → Ξ, we say µ ∈ P is the image distribution of ν by T , or T transport ν onto µ, denoted by µ = T# ν, if µ(A) = ν({ζ ∈ Ξ : T (ζ) ∈ A}), ∀ Borel set A ⊂ Ξ. We define by 1 2 Γ(µ, ν) := γ ∈ P (Ξ × Ξ) : π# γ = µ, π# γ=ν the set of Borel probability distributions on Ξ × Ξ that has marginal distribution µ and ν, where π i : Ξ × Ξ → Ξ, i = 1, 2 are canonical projections. We call γ ∈ Γ(µ, ν) is a transportation plan from µ to ν. We denote by Cb (Ξ) the space of continuous and bounded real functions on Ξ and by L1 (ν) the L1 space of ν-measurable functions. Definition 1 (Wasserstein distance). Let p ∈ [1, ∞). The Wasserstein distance of order p between µ, ν ∈ Pp (Ξ) is defined by Z p p Wp (µ, ν) := min d (ξ, ζ)γ(dξ, dζ) : γ ∈ Γ(µ, ν) . (8) γ

Ξ×Ξ

7

PN Example 3 (Transportation problem). In the discrete case when ν = i=1 qi δξbi and µ = PM j=1 pj δξ j , where qi , pj ≥ 0 for all i, j and δξ denotes the Dirac probability distribution concentrated P P on ξ ∈ Ξ. Let cij := dp (ξbi , ξ j ) for i = 1, . . . , N , j = 1, . . . , M . Assume i qi = j pj = 1, then problem (8) becomes the classical transportation problem in linear programming: min

( X

γij ≥0

) cij γij :

X

i,j

γij = qi , ∀i,

j

X

γij = pj , ∀j

.

i

In particular, when M = N , it reduces to the classical assignment problem. By Birkhoff’s Theorem, 0 there exists an N -permutation, denoted as TN , such that γij := N1 1{j=TN (i)} is the optimal trans1 0 portation plan. In this case, the first marginal of γ 0 is µ0 := π# γ = TN # ν, that is, TN transports 0 ν onto µ . Example 3 suggests the name of “transporting map” T and “transportation plan” γ ∈ Γ(µ, ν). From Definition 1, Wasserstein distance between µ, ν is the minimal transportation cost (in terms of dp ) among all transportation plans in Γ(µ, ν). It can be shown that Wp defined above is indeed a metric (cf. Theorem 7.3 in Villani [47]). From Hölder’s inequality, it can be easily checked that p1 ≥ p2 ≥ 1 implies that Wp1 ≥ Wp2 (cf. Section 7.1.2 in Villani [47]). By Kantorovich’s duality (Theorem 1.3 in Villani [47]), Wpp (µ, ν)

Z =

Z

max

u∈L1 (µ),v∈L1 (ν)

u(ξ)µ(dξ) + Ξ

v(ζ)ν(dζ) : u(ξ) + v(ζ) ≤ d (ξ, ζ), ∀ξ, ζ ∈ Ξ . p

(9)

Ξ

Note that the set of functions under the maximum above can be replaced by u, v ∈ Cb (Ξ). Particularly when p = 1, by Kantorovich-Rubinstein theorem (cf. Theorem 1.14 in Villani [47]), (9) can be simplified to Z

u(ξ)d(µ − ν)(ξ) : u is 1-Lipschitz .

W1 (µ, ν) = max

u∈Cb (Ξ)

Ξ

So for L-Lipschitz function Ψ, Eµ0 [Ψ(ξ)] − Eν [Ψ(ξ)] ≤ Lθ for all {µ ∈ Pp (Ξ) : Wp (µ, ν) ≤ θ}. The following lemma generalizes the above statement.

µ0 ∈ M =

Lemma 1. Let Ψ(ξ) : Ξ → R. Suppose Ψ(ξ) satisfies |Ψ(ξ) − Ψ(ζ)| ≤ κ(M + dp (ξ, ζ))

(10)

for some κ > 0, M ≥ 0 and all ξ, ζ ∈ Ξ. Then Eµ [Ψ(ξ)] − Eν [Ψ(ξ)] ≤ κ(M + θp ),

∀µ ∈ M.

We close the section by pointing out that one important feature of Wasserstein distance, that is, Wp metrizes the weak convergence in Pp (Ξ) (cf. Theorem 6.9 in Villani [48]). In other words, let {µk }∞ k=1 be a sequence of measures in Pp (Ξ) and µ ∈ Pp (Ξ), then limk→∞ Wp (µk , µ) = 0 if and only if µk converges weakly in Pp (Ξ) to µ. Therefore, convergence in the Wasserstein distance of order p implies convergence up to p-th moment. We refer the reader to Chapter 6 of Villani [48] for a discussion on the advantages of Wasserstein distance over other distances that metrize the weak convergence, such as Prokhorov metric.

8

3. Tractable Reformulation via Duality The DRSP problem (5) involves a supremum of expectations over infinitely many distributions, which makes it difficult to solve. In this section we develop a tractable reformulation of the inner problem (7) by deriving its strong dual, which turns out to be a finite dimensional program. To emphasize that x is fixed if only the inner supremum is considered, we suppress x subscript of Ψ, and thus (7) is rewritten as Z Ψx (ξ)µ(dξ) : Wp (µ, ν) ≤ θ . (11) [Primal] vP := sup µ∈P

Ξ

where p ∈ [1, ∞), θ ≥ 0, µ, ν ∈ P := Pp (Ξ) and Ψx (ξ) := Ψ(x, ξ) ∈ L1 (ν). In Proposition 2, we will derive the (weak) dual program Z p p [Dual] vD := inf λθ − inf λd (ξ, ζ) − Ψx (ξ) ν(dζ) . (12) λ≥0

Ξ ξ∈Ξ

Our main goal is to show the strong duality holds, i.e., vP = vD , and identify the condition for the existence of the worst-case distribution. We shall see that the existence of the worst-case distribution is related to the growth rate of Ψx (ξ) as ξ approaches to infinity. More specifically, for some fixed ζ0 ∈ Ξ, we define the growth rate κ by Ψx (ξ) − Ψx (ζ0 ) . p d(ξ,ζ0 )→∞ d (ξ, ζ0 ) + 1

[Growth Rate] κ := lim sup

(13)

We note that the value of κ actually does not depend on the particular choice of ζ0 , as proved in Lemma 5 in the Appendix. 3.1. General Reference Distribution In this subsection, we consider ν being any general distribution. While a large amount of existing literature on DRSP focus on “data-driven” approach, where the reference distribution is chosen as the empirical distribution obtained from data, our result is not confined to this setting. Such generality broadens the applicability of the DRSP. For example, the result is useful when the reference distribution is some parametric distribution such as Gaussian distribution (Section 4.4), or some stochastic processes (Section 4.5). To begin with, let us assume θ > 0. We first consider κ = ∞. Proposition 1.

Suppose κ = ∞ and θ > 0. Then vP = vD = ∞.

Remark 1 (Choosing Wasserstein order p). Let ζ0 ∈ Ξ. Define Ψx (ζ 0 ) − Ψx (ζ0 ) p := inf lim < ∞ . p≥1 d(ζ 0 ,ζ0 )→∞ dp (ζ 0 , ζ) Proposition 1 suggests that the Wasserstein order p should be at least greater than or equal to p. Noticing that in the definition (15) of Φ(λ, ζ), beside λ, parameter p also controls the extent of perturbation. When p is much greater than p, only small perturbation is allowed. In both Esfahani and Kuhn [17], Zhao and Guan [53] only p = 1 is considered. By considering higher order p in our analysis, we have more flexibility to choose the ambiguity set and control the degree of conservativeness based on the information of function Ψ(x, ξ). Moreover, as will be seen in Section 4, some choice of p may have some computational benefit. Now let us consider κ ∈ [0, ∞). The main idea of the proof is rather straightforward. We first derive the weak duality from the Lagrangian, which is a relatively simple routine. The resulting dual problem is a one-dimensional convex minimization problem since there is only one constraint in the primal problem (11). Then by exploiting the first-order optimality for the dual problem (12), we construct a primal feasible solution which yields the same value as the associated dual solution, and thus the strong duality follows. Let us first derive the weak duality.

9

Proposition 2 (Weak duality). Proof.

Suppose κ < ∞. Then vP ≤ vD .

Let us write the Lagrangian and apply minimax inequality which yields that Z p p Ψx (ξ)µ(dξ) + λ(θ − Wp (µ, ν)) vP = sup inf µ∈P λ≥0 Ξ Z p p Ψx (ξ)µ(dξ) − λWp (µ, ν) . ≤ inf λθ + sup λ≥0

µ∈P

(14)

Ξ

To provide an upper bound on supµ∈P Ξ Ψx (ξ)µ(dξ) − λWpp (µ, ν) , using the equivalent definition (9) of Wp , we obtain that Z p Ψx (ξ)µ(dξ) − λWp (µ, ν) sup µ∈P ( ZΞ Z Z R

Ψx (ξ)µ(dξ) − λ

= sup µ∈P

sup u∈L1 (µ),v∈L1 (ν)

Ξ

u(ξ)µ(dξ) + Ξ

v(ζ)ν(dζ) : Ξ

) v(ζ) ≤ inf dp (ξ, ζ) − u(ξ), ∀ζ ∈ Ξ . ξ∈Ξ

Set uλ := Ψx /λ for λ > 0, then uλ ∈ L1 (µ) due to Ψx ∈ L1 (ν) and Lemma 1. Plugging uλ into the inner supremum for u, we obtain that for λ > 0, Z Z h Z i p p sup Ψx (ξ)µ(dξ) − λWp (µ, ν) ≤ − inf λd (ξ, ζ) − Ψx (ξ) ν(dζ) = − Φ(λ, ζ)ν(dζ). µ∈P

Ξ

Ξ

ξ∈Ξ

Ξ

Note that the above inequality also holds for λ = 0, so combining it with (14) we finally have Z p vP ≤ inf λθ − Φ(λ, ζ)ν(dζ) = vD . λ≥0

Ξ

As mentioned before, the inner infimum involved in the dual problem (12) can be viewed as a perturbation of ζsupp ν, which plays an important role in the analysis of worst-case distribution. Let us define Φ(λ, ζ) := inf λdp (ξ, ζ) − Ψx (ξ). (15) ξ∈Ξ

for every λ ≥ 0 and ζ ∈ Ξ. Observe that when p = 2 and λ > 0, Φ(λ, ζ) is the classical Moreau-Yosida regularization (cf. Parikh and Boyd [35]) of −Ψx (ξ) with parameter 1/λ at ζ. The parameter λ controls the extent to which the point ζ is perturbed towards the supremum of Ψx over Ξ. With a smaller value of λ, ζ is tending to the supremum of Ψx , while a larger value of λ controls the perturbation within a small neighborhood of ζ. We prepare some properties of Φ(λ, ζ) for the proof of strong duality. Lemma 2. Assume Ψx is upper semi-continuous and κ < ∞. Then for any ζ ∈ Ξ and λ > κ, the following holds: (i) The right-hand side of (15) admits a minimizer. (ii) Φ(λ, ζ) is non-decreasing and concave with respect to λ and is continuous with respect to ζ. (iii) For any ζ ∈ Ξ, define p ˆ T− (λ, ζ) := arg min d(ξ, ζ) : ξ ∈ arg min {λd (ξ, ζ) − Ψx (ξ)} , ξ ξ∈Ξ (16) p ˆ T+ (λ, ζ) := arg max d(ξ, ζ) : ξ ∈ arg min {λd (ξ, ζ) − Ψx (ξ)} . ξ

ξ∈Ξ

10

Then there exists a Borel measurable selection T− (λ, ζ) (resp. T+ (λ, ζ)) of the multi-valued function Tˆ− (λ, ζ) (resp. Tˆ+ (λ, ζ)) such that d(T− (λ, ζ), ζ) and Ψx (T− (λ, ζ)) (resp. d(T+ (λ, ζ), ζ) and Ψx (T+ (λ, ζ))) are left (resp. right) continuous with respect to λ. (iv) Φ(λ, ζ)/λ has both a left and right partial derivative Φ(ζ, λ) Φ(ζ, λ) Ψx (T− (λ, ζ)) Ψx (T+ (λ, ζ)) ∂λ− , ∂ . = = λ+ λ λ2 λ λ2 (v) Suppose Φ(κ, ζ) is finite. limλ→κ+ Φ(λ, ζ) = Φ(κ, ζ), and particularly when κ = 0, it holds that limλ→0+ Ψx (T− (λ, ζ)) = supξ∈Ξ Ψx (ξ). (vi) There exists Dλ > 0 such that dp (T+ (λ, ζ), ζ) < Dλ for all ζ. Statement (i)(ii)(iv) are similar to the properties of Moreau-Yosida regularization. The T± (λ, ζ) defined in (iii) represent the nearest and the furthest minimizer associated with Φ(λ, ζ), which will help us construct the worst-case distribution. Their Borel measurability is only needed when deriving strong duality for general reference distribution, and can be avoided if we focus on finitesupported reference distributions. Property (v) will be used for the threshold case where the dual minimizer equal to κ. Finally, (vi) shows that the furthest distance between ζ and the minimizer of inf ξ∈Ξ λdp (ξ, ζ) − Ψx (ξ) are uniformly bounded with respect to ζ. We are now ready to show that the strong duality holds. Theorem 1 (Strong duality and worst-case distribution). Let θ > 0. Assume Ψx is upper semi-continuous and κ ∈ [0, ∞). Then the following holds: (i) vP = vD < ∞. (ii) The dual problem (12) always admits a minimizer λ∗ . If there exists a dual minimizer λ∗ > κ, then the distribution µ∗ ∈ P defined by Z h i ∗ µ (A) = p(ζ)1A (T+ (λ∗ , ζ)) + (1 − p(ζ))1A (T− (λ∗ , ζ)) ν(dζ), ∀ Borel set A ⊂ Ξ, (17) Ξ

is primal optimal, where T+ (λ∗ , ξ), T− (λ∗ , ξ) are defined in Lemma 2 and R ∗ θp − Ξ Φ(λλ∗,ζ) ν(dζ) − Ψx (T− (λ∗ , ζ)) 1{Ψx (T+ (λ∗ ,ζ))6=Ψx (T− (λ∗ ,ζ))} . p(ζ) = Ψx (T+ (λ∗ , ζ)) − Ψx (T− (λ∗ , ζ)) (iii) If there exists a primal optimal solution µ∗ satisfying µ∗ = T#∗ ν for some Borel map T ∗ : supp ν → Ξ, that is, µ∗ (A) = ν(T ∗ −1 (A)) for all Borel set A, then we have the equivalence vP = vD = sup Eµ [Ψx (ξ)],

(18)

Z p p M := µ = T# ν T : supp ν → Ξ, d (ξ, T (ξ))ν(dξ) ≤ θ .

(19)

µ∈M0

where

0

Ξ

The existence of T ∗ can be ensured, for example, when • For ζ ∈ supp ν, the problem inf ξ λ∗ dp (ξ, ζ) − Ψx (ξ) has a unique minimizer. • Ξ is convex and Ψx (ξ) is concave in ξ. Remark 2. Combining Theorem 1(i) with Proposition 1, we have proved the strong duality with the only assumption of upper semi-continuity of Ψx , which can hardly be eliminated in order to guarantee the existence of the worst-case distribution. Moreover, the results holds for any reference distribution supported on a Polish space. Therefore we have established the result in a very general setting. In Esfahani and Kuhn [17], Zhao and Guan [53], strong duality is established only for W1 -Wasserstein ball centered at a reference distribution with finite support in finite-dimensional normed vector space, and additional assumptions are needed – convexity assumption of Ψx (ξ) and Ξ in Esfahani and Kuhn [17], and compactness of Ξ and absolute continuity of µ in Zhao and Guan [53]. We will demonstrate the power of such generality in Section 4.4 and 4.5.

11

Remark 3 (Perturbation by splitting and transporting). (ii) also shows that when λ > κ, there exists a worst-case distribution µ∗ which can be viewed as a perturbation of the reference ν in the following way. Each point ζ ∈ supp ν (with infinitesimal probability mass) is split into two parts with weights p(ζ) and 1 − p(ζ), each of which is then transported to the nearest and furthest minimizer T± (λ∗ , ζ) associated with Φ(λ∗ , ζ). As will be seen in Section 3.2, this structure helps us derive interesting results in many problems. ∗

Proof of Theorem 1. vD . Set

In view of the weak duality (Proposition 2), it suffices to show that vP ≥ Z p h(λ) := λθ − Φ(λ, ζ)ν(dζ). Ξ

Note that Φ(λ, ζ) = −∞ for all λ < κ. By Lemma 2(ii)(v), h(λ) is the sum of a linear function R λθp and an extended real-valued convex function − Ξ Φ(λ, ζ)ν(dζ) on [κ, ∞), hence h(λ) is also convex and continuous on [κ, ∞). In addition, since Φ(λ, ζ) ≥ −Ψx (ζ), it follows that h(λ) ≥ λθp − R Ψx (ζ)ν(dζ) → ∞ as λ → ∞. Thus h(λ) is a convex function on [κ, ∞) which tends to ∞, so it Ξ admits a minimizer λ∗ in [κ, ∞). Let us consider the following three cases. ∗ • Case 1. There exists a minimizer R λ ∗> κ. It follows that vD = vP > −∞ and Ξ Φ(λ , ζ)ν(dζ) = λ∗ θp − vD < ∞. We rewrite the dual objective as Z Φ(λ, ζ) h(λ) = λθp − λ ν(dζ). λ Ξ

The first-order optimality condition reads ∂λ− h(λ∗ ) ≥ 0 and ∂λ+ h(λ∗ ) ≤ 0, that is, Z Z Φ(λ∗ , ζ) Φ(λ∗ , ζ) p ∗ θ − ν(dζ) − λ ∂λ+ ν(dζ) ≤ 0, λ∗ λ∗ Ξ Ξ Z Z Φ(λ∗ , ζ) Φ(λ∗ , ζ) p ∗ θ − ν(dζ) − λ ∂λ− ν(dζ) ≥ 0. λ∗ λ∗ Ξ Ξ

(20)

Observe that dp (T+ (λ∗ , ζ), ζ) ≤ Dλ∗ by Lemma 2(vi), so we have Ψx (T+ (λ∗ , ζ)) ≤ −Φ(λ∗ , ζ) + λ∗ Dλ∗ . Observe also that Ψx (T+ (λ∗ , ζ)) ≥ Ψx (ζ). Hence Ψx (T+ (λ∗ , ζ)) ∈ L1 (ν), and similarly for Ψx (T− (λ∗ , ζ)). Thus we can apply dominated convergence theorem on the left-hand side of (20) and use Lemma 2(iv) to obtain that Z Z Ψx (T+ (λ∗ , ζ)) Φ(λ∗ , ζ) p ν(dζ) + ν(dζ), θ ≤ λ∗ λ∗ Ξ Ξ Z Z (21) Φ(λ∗ , ζ) Ψx (T− (λ∗ , ζ)) p ν(dζ) + ν(dζ) ≤ θ . λ∗ λ∗ Ξ Ξ Based on the optimality condition (21), we construct a primal solution µ∗ as follows. We define p : supp ν → [0, 1] by R ∗ θp − Ξ Φ(λλ∗,ζ) ν(dζ) − Ψx (T− (λ∗ , ζ)) p(ζ) = 1{Ψx (T+ (λ∗ , ζ)) 6= Ψx (T− (λ∗ , ζ))}. (22) Ψx (T+ (λ∗ , ζ)) − Ψx (T− (λ∗ , ζ)) It follows from (21) that Z Z Φ(λ∗ , ζ) 1 p ∗ ∗ θ = ν(dζ) + p(ζ)Ψ (T (λ , ζ)) + (1 − p(ζ))Ψ (T (λ , ζ)) ν(dζ). x + x − λ∗ λ∗ Ξ Ξ Since T+ (λ∗ , ζ) and T− (λ∗ , ζ) are Borel measurable, Z ∗ µ (A) = p(ζ)1A (T+ (λ∗ , ζ)) + (1 − p(ζ))1A (T− (λ∗ , ζ)) ν(dζ) Ξ

(23)

12

is well defined for every Borel measurable set A ⊂ Ξ and thus µ∗ ∈ P . Then we verify the feasibility and optimality of µ∗ . We compute Z Z p ∗ ∗ p Wp (µ , ν) = max u(ξ)µ (dξ) + v(ζ)ν(dζ) : u(ξ) + v(ζ) ≤ d (ξ, ζ), ∀ξ, ζ ∈ Ξ u,v∈Cb (Ξ) Ξ ( ZΞ Z ∗ ∗ = max p(ζ)u(T+ (λ , ζ)) + (1 − p(ζ))u(T− (λ , ζ)) ν(dζ) + v(ζ)ν(dζ) : u,v∈Cb (Ξ)

Ξ

Ξ

) u(ξ) + v(ζ) ≤ dp (ξ, ζ), ∀ξ, ζ ∈ Ξ Z

∗

Z

p(ζ)d (T+ (λ , ζ), ζ)ν(dζ) + (1 − p(ζ))dp (T− (λ∗ , ζ), ζ)ν(dζ) Ξ Z Ψx (T+ (λ∗ , ζ)) Φ(λ∗ , ζ) = p(ζ) + ν(dζ) λ∗ λ∗ Ξ Z Ψx (T− (λ∗ , ζ)) Φ(λ∗ , ζ) + (1 − p(ζ)) + ν(dζ) λ∗ λ∗ Ξ = θp , p

≤

Ξ

where the last inequality follows from (23). Hence, µ∗ is primal feasible. Meanwhile, we have Z Z ∗ p(ζ)Ψx (T+ (λ∗ , ζ)) + (1 − p(ζ))Ψx (T− (λ∗ , ζ)) ν(dζ) vP ≥ Ψx (ξ)µ (dξ) = Ξ Ξ Z Φ(λ∗ , ζ) ∗ p =λ θ − ν(dζ) = vD . λ∗ Ξ Therefore we obtain the result. • Case 2. λ∗ = 0 is the unique dual minimizer. In this case, vD = g(0) = supξ∈Ξ Ψx (ξ) < ∞. For any λ > 0, since zero is the unique minimizer, we have that Z p p λθ − inf λd (ξ, ζ) − Ψx (ξ) ν(dζ) > sup Ψx (ξ), ∀λ > 0. Ξ

ξ∈Ξ

ξ∈Ξ

Dividing λ on both sides and rearranging terms, we obtain Z Ψx (ξ) Ψx (ξ) p p θ > inf d (ξ, ζ) − ν(dζ) + sup , ∀λ > 0. λ λ ξ∈Ξ Ξ ξ∈Ξ

(24)

We next show that {µλ }λ is a sequence of feasible distributions approaching to optimality. Using the notations in Lemma 2(iii), for each λ > 0, we define µλ := ν({ζ : T− (λ, ζ) ∈ A}), for any Borel set A ⊂ Ξ. From Lemma 2(iii) we know that T− (λ, ζ) is measurable and thus µλ ∈ P (Ξ). In addition, we compute Z Z p p Wp (µλ , ν) = max u(ξ)µλ (dξ) + v(ζ)ν(ζ) : u(ξ) + v(ζ) ≤ d (ξ, ζ), ∀ξ, ζ ∈ Ξ u,v∈Cb (Ξ) Ξ Ξ Z p = max u(T− (λ, ζ)) + v(ζ) ν(dζ) : u(ξ) + v(ζ) ≤ d (ξ, ζ), ∀ξ, ζ ∈ Ξ u,v∈C (Ξ) Ξ Z b ≤ dp (T− (λ, ζ), ζ)ν(dζ). Ξ

13

But by construction, dp (T− (λ, ζ), ζ) = Wpp (µλ , ν)

Ψx (T− (λ,ζ)) λ

+ Φ(λ, ζ), thereby

Ψx (T− (λ, ζ)) ≤ + Φ(λ, ζ) ν(dζ) λ Ξ Z supξ∈Ξ Ψx (ξ) ≤ + Φ(λ, ζ) ν(dζ) λ Ξ p 0 is the unique dual minimizer. In this case, Z Z p p vD = κθ − Φ(κ, ζ)ν(dζ) < λθ − Φ(λ, ζ)ν(dζ), ∀λ > κ. Ξ

Ξ

Thus vD < ∞. Rearranging terms and using the fact that Φ(κ, ζ) ≤ κdp (T− (λ, ζ), ζ) − Ψx (T− (λ, ζ)) for any λ > κ, we obtain that Z Z (λ − κ)θp > [Φ(λ, ζ) − Φ(κ, ζ)]ν(dζ) ≥ (λ − κ)dp (T− (λ, ζ), ζ)ν(dζ), Ξ

Ξ

R

or, Ξ dp (T− (λ, ζ), ζ)ν(dζ) < θp for any λ > κ. Choose any compact set E ⊂ Ξ with ν(E) > 0 and let ξ¯ ∈ arg maxξ∈E Ψx (ξ), which is well-defined since Ψx is upper semi-continuous. For any ∈ (0, κ) ¯ > R and and R > 0, there exists ζ > 0, such that dp (ζ , ξ) ¯ > (κ − )(dp (ζ , ξ) ¯ + diam(E) + 1), Ψx (ζ ) − Ψx (ξ) where diam(E) := maxξ,ξ0 ∈E d(ξ, ξ 0 ). It follows that Ψx (ζ ) − Ψx (ξ) > (κ − )(dp (ζ , ξ) + 1),

∀ξ ∈ E.

Similar as before, we are going to define a sequence of distributions approaching to optimality. To this end, we define a map T : Ξ → Ξ by T (ζ) := ζ 1{ζ∈E} + ζ1{ζ∈Ξ\E} .

(25)

R Note that we can choose sufficiently large R such that E dp (ζ , ζ)ν(dζ) > θp . Thus we define µλ, by µλ, (A) := 1 − pλ, ν({ζ : T− (λ, ζ) ∈ A}) + pλ, ν({ζ : T (ζ) ∈ A}), for any Borel set A ⊂ Ξ, where pλ, :=

R θ p − Ξ dp (T− (λ,ζ),ζ)ν(dζ) R . p p E d (ζ ,ζ)ν(dζ)− Ξ d (T− (λ,ζ),ζ)ν(dζ)

R

Wpp (µλ, , ν)

Z ≤ (1 − pλ, )

p

Then µλ, is primal feasible since Z

d (T− (λ, ζ), ζ)ν(dζ) + pλ, Ξ

E

dp (ζ , ζ)ν(dζ) = θp ,

14

and that Z vP ≥

Ψx (ζ)µλ, (ζ) Z Z hZ i = (1 − pλ, ) Ψx (T− (λ, ζ))ν(dζ) + pλ, Ψx (ζ )ν(dζ) + Ψx (ζ)ν(dζ) E Ξ\E ZΞ Z p ≥ (1 − pλ, ) λd (T− (λ, ζ), ζ)ν(dζ) − (1 − pλ, ) Φ(λ, ζ)ν(dξ) Ξ Z Z Z Ξ p Ψx (ζ)ν(dζ) + pλ, (κ − ) d (ζ , ζ)ν(dζ) + pλ, (κ − ) Ψx (ζ)ν(dζ) + pλ, Ξ\E E Z ZE ≥ (κ − )θp − (1 − pλ, ) Φ(λ, ζ)ν(dξ) + pλ, Ψx (ζ)ν(dζ). Ξ

Ξ

Ξ

Note that pλ, → 0 as R → ∞ and by Lemma R2(v), Φ(λ, ζ) → Φ(κ, ζ) as λ → κ+. Hence letting → 0, R → ∞ and λ → κ+, we have vP ≥ κθp − Ξ Φ(κ, ζ)ν(dζ) = vD . Therefore we have shown (i) and (ii). Now to prove (iii), suppose µ∗ = T#∗ ν for some Borel map T ∗ : supp ν → Ξ is a primal optimal solution. From Definition 1 and the fact that (T × id)R# ν ∈ Γ(µ, ν), we have M0 ⊂ M and thus i ≥ supµ∈M0 Eµ [Ψx (ξ)]. But at optimality Wpp (µ∗ , ν) ≤ Ξ dp (ξ, T ∗ (ξ))ν(dξ) ≤ θp , whence µ∗ ∈ M0 and i ≤ supµ∈M0 Eµ [Ψx (ξ)]. Therefore, (19) is proved. Suppose for ζ ∈ supp ν, the problem inf ξ λ∗ dp (ξ, ζ) − Ψx (ξ) has a unique minimizer. Setting ∗ T (ζ) = arg minξ λ∗ dp (ξ, ζ) − Ψx (ξ) and using the similar argument in Case 1, we can show that µ∗ = T ∗ (ζ)# ν is feasible and optimal. Suppose Ξ is a convex metric space and Ψx (ξ) is concave, we have κ < ∞. If λ∗ > κ, set T ∗ (ζ) := p∗ T− (λ∗ , ζ) + (1 − p∗ )T+ (λ∗ , ζ), ∀ζ ∈ supp ν, where

R R ∗ θp − Ξ Φ(λλ∗,ζ) ν(dζ) − Ξ Ψx (T− (λ∗ , ζ))ν(dζ) R p := R , Ψx (T+ (λ∗ , ζ))ν(dζ) − Ξ Ψx (T− (λ∗ , ζ))ν(dζ) Ξ ∗

provided that the denominator is nonzero, otherwise we set p∗ = 1. By concavity T ∗ (ζ) is also an optimal solution of the right-hand side of (15). Then using the same argument as Case 1 above we can show that the distribution µ∗ = T#∗ ν is a primal optimal solution. If λ∗ = κ = 0, then in Case 2 above, there will be a sequence of distributions Tλ# ν converges to optimality as λ → κ+. If λ∗ = κ > 0, then define Tλ, (ζ) = (1 − pλ, )T− (λ, ζ) + pλ, T (ζ), p

where pλ, := R dp (T θ(ζ),ζ)ν(dζ) . Similar to Case 3 above, we can verify that µλ, := Tλ, # ν approach λ, Ξ to optimal as λ → κ+ and → 0. Therefore we complete the proof. Remark 4 (Existence of worst-case distribution). (ii) shows that the existence of worst-case distribution is essentially determined by the growth parameter κ. When λ∗ > κ, a worstcase distribution always exists; otherwise the worst-case distribution may not exist but there exists a sequence of distributions approaching to the optimal value. From the proof of Case 2, 3, a necR essary condition for λ∗ > κ is Ξ dp (T− (λ, ζ), ζ)ν(dζ) < θp for all λ > κ. Hence, unless supp ν ⊂ arg minξ λdp (ξ, ζ) − Ψx (ξ) for all λ > κ, or, Ψx (ζ) ≤ κdp (ξ, ζ) − Ψx (ξ) for all ζ ∈ supp ν and ξ ∈ Ξ, there is sufficiently small θ such that the worst-case distribution exists. We also note that Example 1 in Esfahani and Kuhn [17] corresponds to the threshold case λ∗ = κ = 1. Comparing to Corollary 4.7 in Esfahani and Kuhn [17], we here provide a complete description of the condition for the existence of a worst-case distribution.

15

Remark 5. In the above three cases, we repeatedly use the same proof idea. That is, we derive the first-order optimality condition, based on which we construct a distribution or a sequence of distributions that is proven to be optimal or approaching to optimal. We will use this idea to prove the strong duality for another class of problems in Section 4. We close this section by consider the degenerate case θ = 0. Note that when κ = ∞, we may not have strong duality. For example, let ν = δξ0 for some ξ 0 ∈ Ξ. Then Wp (µ, ν) = 0 implies that µ = ν, and thus vP = Ψx (ξ 0 ). However, Φ(λ, ξ 0 ) = inf ξ∈Ξ λdp (ξ, ξ 0 ) − Ψx (ξ) = −∞ for any λ ≥ 0, so vD = ∞. Nevertheless, when κ < ∞, we still have strong duality. Proposition 3.

Suppose θ = 0 and κ < ∞. Then vP = vD .

Proof. Since Ψx is upper semi-continuous and κ R< ∞, there exists κ ¯ < ∞ such that Ψx (ξ) ≤ κ ¯ (dp (ξ, ζ0 ) + 1) for all ξ ∈ Ξ. Then by Lemma 1, vP = Ξ Ψx (ζ)ν(dζ). Since Φ(λ, ζ) ≤ −Ψx (ζ) for all ζ, we have that Z Z vD = inf

λ≥0

−

Φ(λ, ζ)ν(dζ) ≥ Ξ

Ψx (ζ)ν(dζ) = vP . Ξ

On the other hand, by Moreau’s theorem (cf. Theorem 1.25 in Rockafellar and Wets [39]), limλ→∞ Φ(λ, ζ) = −Ψx (ζ) for all ζ ∈ Ξ. Moreover, since dp (ξ, ζ0 ) ≤ (d(ξ, ζ) + d(ζ, ζ0 ))p ≤ 2p−1 (dp (ξ, ζ) + dp (ζ, ζ0 )), we have Φ(λ, ζ) ≥ inf λdp (ξ, ζ) − κ ¯ (dp (ξ, ζ0 ) + 1) ξ∈Ξ

≥ inf (λ − 2p−1 κ ¯ )dp (ξ, ζ) − 2p−1 κ ¯ (dp (ζ, ζ0 ) + 1). ξ∈Ξ

Hence when λ > 2p−1 κ ¯ , we have Φ(λ, ζ) ≥ −2p−1 κ ¯ (dp (ζ, ζ0 ) + 1). Together with Φ(λ, ζ) ≤ −Ψx (ζ), 1 we have shown that Φ(λ, ζ) is dominated R by some function R in L (ν). Therefore, applying dominated convergence we have vD ≤ − limλ→∞ Ξ Φ(λ, ζ)ν(dζ) = Ξ Ψx (ζ)ν(dζ) = vP . 3.2. Finite-supported Reference Distribution In this subsection, we restrict attention to PN the case when the reference distribution ν = N1 i=1 δξbi for some ξbi ∈ Ξ, i = 1, . . . , N . This occurs when the decision maker collects N observations that constitute an empirical distribution. PN Corollary 1 (Data-driven DRSP). Let ν = N1 i=1 δξbi . Suppose θ > 0. Assume Ψx is upper semi-continuous. Then the following holds: (i) The primal problem (11) has a strong dual program ( ) N h i X 1 min λθp + max Ψx (ξ) − λdp (ξ, ξbi ) , (26) λ≥0 N i=1 ξ∈Ξ which always admits a minimizer λ∗ . (ii) When λ∗ > κ, there exists a worst-case distribution which is supported on at most N + 1 points and has the form 1 X p∗ 1 − p∗ µ∗ = δξ∗i + δξi0 + + δ i0 − , (27) N i6=i N ∗ N ξ∗ 0

where 1 ≤ i0 ≤ N , p∗ ∈ [0, 1], ξ∗i0 ± := T± (λ∗ , ξbi0 ), and ξ∗i ∈ {T+ (λ∗ , ξbi ), T− (λ∗ , ξbi )} for all i 6= i0 . (iii) Suppose there exists L, M ≥ 0 such that |Ψx (ξ) − Ψx (ζ)| < M + Ld(ξ, ζ) for all ξ, ζ ∈ Ξ. Let K be any positive integer and define the robust program vK :=

sup (ξ ik )i,k ∈MK

N K 1 XX Ψx (ξ ik ), N K i=1 k=1

16

where (

) N X K X 1 MK := (ξ ik )i,k : dp (ξ ik , ξbi ) ≤ θp , ξ ik ∈ Ξ, ∀i, k . N K i=1 k=1

(28)

Then vK ↑ supµ∈M Eµ [Ψx (ξ)] as K → ∞. In particular, if λ∗ = κ = 0, v1 = supµ∈M Eµ [Ψx (ξ)], else if λ∗ > κ, it holds that vK ≤ sup Eµ [Ψx (ξ)] ≤ vK + µ∈M

M + LDλ∗ , NK

where Dλ∗ is the constant defined in Lemma 2(vi). (iv) Assume κ < ∞. When Ξ is convex and Ψx is concave, (26) is further reduced to ( sup ξ i ∈Ξ

) N N X 1 X 1 Ψx (ξ i ) : dp (ξ i , ξbi ) ≤ θ . N i=1 N i=1

(29)

Remark 6. (i)(iv) are similar to the results in Esfahani and Kuhn [17] and Zhao and Guan [53]. In (ii) we provide a more detailed form of the worst-case distribution which further implies (iii), suggesting that any distributionally robust stochastic programming (11) with ambiguous set M can be approximated by a robust program with uncertainty set MK , which is a subset of M which contains distributions supported on N K points with equal probability. Such robust-program approximation expands the tractability of DRSP problems. We will show some examples in Section 4. Remark 7 (Hedging against noise in the data value). Given N historical data, as pointed out in Remark 3, the structure (27) of the worst-case distribution means that, each observations is perturbed to ξ∗± according to Bernoulli distribution Bernoulli(pi+ ). From this point of view, using Wasserstein ambiguity set has the ability to hedge against the noise of the data values. Remark 8 (Choosing metric d and Total Variation metric). We have discussed that λ and p controls the extent of perturbation by means of Φ(λ, ζ), and here we pointed out the metric d also plays a similar role. For example, when d is the metric induced by lq -norm, large q implies larger allowable perturbation. As another example, By choosing the discrete metric d(ξ, ζ) = 1{ξ6=ζ} on Ξ, the Wasserstein distance is equal to Total Variation distance (Gibbs and Su [21]), which can be used for the situation where the distance of perturbation does not matter and provides a rather conservative decision. In this case, suppose θ is chosen such that N θ is an integer, then there is no fractional point in (27) and the problem is reduced to (29). Note that in both Esfahani and Kuhn [17], Zhao and Guan [53], only the Euclidean space with norm-induced metric is considered. Whereas our results hold generally for any complete and separable metric space. This provided more flexibility to apply the reformulation and worst-case analysis to a broader class of problems, such as ν being the distribution of continuous random vectors or even stochastic processes. Proof of Corollary 1. (i) (iv) follows directly from Theorem 1 and Proposition 1. To prove (ii), by Theorem 1(ii), when λ∗ > κ, there exists a worst-case distribution which is supported on at most 2N points and has the form µ∗ =

N 1 X i+ p δξ∗i+ + pi− δξ∗i− , N i=1

(30)

17

where pi± ≥ 0, pi+ + pi− = 1 and ξ∗i± = T± (λ∗ , ξbi ). Therefore, we obtain the following (possibly non-convex) reformulation of (11): ( N N i 1 X 1 X h i+ max Ψx (ξbi ) + p (Ψx (ξ i+ ) − Ψx (ξbi )) + pi− (Ψx (ξ i− ) − Ψx (ξbi )) : N i=1 N i=1 ξ i± ∈Ξ,pi± ≥0 ) N 1 X i+ p i+ bi i− p i− bi p i+ i− p d (ξ , ξ ) + p d (ξ , ξ ) ≤ θ , p + p = 1, ∀i , N i=1 Given ξ i± for all i and by the assumption on Ξ, the problem N n1 X max pi+ (Ψx (ξ i+ ) − Ψx (ξbi )) + (1 − pi+ )(Ψx (ξ i− ) − Ψx (ξbi )) : 0≤pi+ ≤1 N i=1 N o 1 X i+ p i+ bi p d (ξ , ξ ) + (1 − pi+ )dp (ξ i− , ξbi ) ≤ θp , pi± ≥ 0 N i=1

is a linear program and has an optimal solution which has at most one fractional point. Thus there exists a worst-case distribution which is supported on at most N + 1 points, and has the form p∗ 1 − p∗ 1 X δ i, µ∗ = δξi0 + + δ ξ i0 − + ∗ N ∗ N N i6=i ξ∗ 0

where i0 ∈ {1, . . . , N }, ξ∗i0 ± = T± (λ∗ , ξbi0 ), and ξ∗i ∈ {T+ (λ∗ , ξ i ), T− (λ∗ , ξ i )} for all i 6= i0 . To prove (iii), let us first consider λ∗ > κ. By assumption on Ψx we have κ ≤ L < ∞. Observe that  i , ∀ 1 ≤ k ≤ K, ∀ i 6= i0 ,  ξ+ ik i0 + ξ = ξ∗ , ∀ 1 ≤ k ≤ bKp∗ c, i = i0 ,  ξ i0 − , ∀ bKp∗ c < k ≤ N, i = i , 0 ∗ ∗ belongs to MK . Since p − bKp∗ c/K < 1/K, and dp (ξ∗i0 ± , ξbi0 ) ≤ Dλ∗ by Lemma 2(vi), it follows that vK − Eµ∗ [Ψx (ξ)] ≤ 1 p∗ − bKp∗ c/K · Ψx (ξ∗i0 + ) − Ψx (ξ∗i0 − ) N 1 ≤ Ψx (ξ∗i0 + ) − Ψx (ξbi0 ) NK M + Ld(ξ∗i0 + , ξbi0 ) ≤ NK M + LDλ∗ ≤ . NK When λ∗ = κ = 0, from Case 2 in the proof of Theorem 1, there exists a sequence of distributions that supported on N points with equal probability, so K = 1 gives the exact reformulation. When λ∗ = κ > 0, from Case 3 in the proof of Theorem 1, there exists a sequence of distributions {µλ }λ that approaches to optimality as λ → κ+ and has the form µλ =

N 1 X (1 − piλ )δξi+ + piλ δξi− , λ λ N i=1

(31)

where piλ ∈ (0, 1) and ξλi+ = T− (λ, ξbi ), ξλi− = T (ξbi ) (see (25) for the definition).For any piλ , let Kλi be such that 1/Kλi < piλ < 1/(Kλi − 1), and define µ ˜λ =

N 1 X 1 1 (1 − i )δξi+ + i δξi− . N i=1 Kλ λ Kλ λ

18

Then using the same argument as in Case 3, we can show that µ ˜λ is a sequence of primal feasible solutions that approach to optimality. Example 4 (Piecewise concave objective). Suppose p = 1 and Ψx (ξ) = max1≤j≤J Ψjx (ξ) as considered in Esfahani and Kuhn [17]. From the structure of worst-case distribution (30)(31), for each 1 ≤ i ≤ N , there exists ji+ , ji− ∈ {1, . . . , J } such that Ψx (ξbi ) is perturbed to the pieces at j which the ji± -th function Ψxi± achieves the maximum. So without decreasing the optimal value we can restrict the set M to a smaller set ( ) N J N J J X 1 XX 1 XX ij bi ij µ= pij δξij : pij d(ξ , ξ ) ≤ θ, pij = 1, ∀i, pij ≥ 0, ξ ∈ Ξ, ∀i, j . N i=1 j=1 N i=1 j=1 j=1 Replacing ξ ij by ξbi + (ξ ij − ξbi )/pij , by positive homogeneity of norms and convexity-preserving property of perspective functions (cf. Section 2.3.3 in Boyd and Vandenberghe [10]), we obtain an equivalent convex program reformulation of (11): ( sup pij ≥0,ξ ij

N J N J J X 1 XX 1 XX pij Ψjx (ξbi + (ξ ij − ξbi )/pij ) : d(ξ ij , ξbi ) ≤ θ, pij = 1, ∀i, N i=1 j=1 N i=1 j=1 j=1 ) i ij i ξb + (ξ − ξb )/pij ∈ Ξ, ∀i, j .

So we recover Theorem 4.5 in Esfahani and Kuhn [17], which was obtained therein by a separate procedure of dualizing twice the reformulation (i). ¯

Now let us consider a special case when Ξ is also finite, namely, Ξ = {ξ 1 , . . . , ξ B }, where ξ i are different from each other. In this case, let Ni be the samples that are equal to ξ i , and let qi = Ni /N , ¯ then the reference distribution ν = PB¯ qi δB i . Let q := (q1 , . . . , qB¯ )> ∈ ∆B¯ , where ∆B¯ i = 1, . . . , B, i=1 ¯ is the standard B-simplex. The DRSP problem becomes min max

( B¯ X

x∈X p∈∆B ¯

) pi Ψ(x, ξ i ) : Wp (p, q) ≤ θ .

(32)

i=1

¯

Corollary 2 (Hedging against noise in empirical frequency). Suppose Ξ = {ξ i }B i=1 PB¯ and ν = i=1 pi δξi . Then the problem (32) has a strong dual program ( min

x∈X,λ≥0

λθp +

¯ B X

) ¯ yi : yi ≥ Ψ(x, ξ i ) − λdp (ξ i , ξ j ), ∀i, j = 1, . . . , B

(33)

i=1

Given the optimal solution (x∗ , λ∗ ), the associated worst-case distribution can be computed by

max ¯

¯ B×B p∈∆B ¯ ,γ∈R+

( B¯ X i=1

) pi Ψ(x∗ , ξ i ) :

X i,j

d(ξ i , ξ j )γij ≤ θ,

X i

γij = pi , ∀i,

X

γij = qj , ∀j

(34)

i

Proof. Reformulation (33) follows from Theorem 1(i), and (34) can be obtained using the equivalent definition of Wasserstein distance in Example 3.

19

4. Applications In this section, we apply our duality results and the structural description of the worst-case distribution to a variety of stochastic optimization problems. In Section 4.1, we perform a numerical study on the newsvendor problem and show that using Wasserstein ambiguity set results in a reasonable worst-case distribution. Due to the close connection between DRSP and robust programming (Corollary 2(iii)(iv)), in Section 4.2 and 4.3, we illustrate the tractability of DRSPs via two-stage linear programming and DRSPs with objective concave in the uncertainty. Then in Section 4.4 and 4.5, we illustrate the great generality of strong duality results by studying worst-case Value-at-Risk problem and point processes control. Finally, we demonstrate the power of our constructive proof method by applying it to a class of distributionally robust transportation problems in Section 4.6. 4.1. Newsvendor Problems In this subsection, we perform a numerical study on distributionally robust newsvendor problems, with an emphasis on the worst-case distribution. The newsvendor model is one of the most fundamental problems in operations management. In this model, the decision maker has to decide the inventory level before the random demand is realized, facing both the overage and underage costs. The problem can then be formulated as min Eµ [h(x − ξ)+ + b(ξ − x)+ ],

(35)

x≥0

where x is the decision variable for initial inventory level, ξ is the random demand distribution, and h, b represent respectively the overage and underage costs per unit. We assume that b, h ≥ 0, and ξ ¯ } for some positive integer B. ¯ If the underlying distribution µ is known is supported on {0, 1, . . . , B b -quantiles exactly, it is well-known that the set of optimal solutions is the whole interval of b+h b b [inf {t : Hµ (t) ≥ b+h }, sup{t : Hµ (t) ≤ b+h }]. Now suppose we are given Ni observations of historical ¯ The classical stochastic ¯ and let N = PB¯ Ni and qi := Ni /N , i = 1, . . . , B. demand i, i = 1, . . . , B, i=0 programming theory would suggest using Sample Average Approximation (SAA) method to solve the SAA counterpart N X qi [h(x − i)+ + b(i − x)+ ]. (36) min x≥0

i=0

n P o i b b Then the empirical b+h -quantile gives an optimal solution x ˆ := inf i≥0 i : j=0 qj ≥ b+h . Since the problem is merely one-dimensional, SAA solution x ˆ should gives an accurate approximation provided that the sample size N is not too small (Levi et al. [30]). On the other hand, in some practical settings, it is rather expensive to gather demand data. For example, a company is planning to introduce a new product of which the demand data is collected by either setting up pilot stores or simulating with complex systems. In such cases, with fewer data in hand, the decision maker may want to consider the distributionally robust approach by investigating the DRSP counterpart min sup Eµ [h(x − ξ)+ + b(ξ − x)+ ]. x≥0 µ∈M

Using Corollary 2, we obtain a convex programming reformulation of (35) ( ) ¯ B X p p ¯ . min λθ + qi yi : yi ≥ max h(x − i), b(i − x) − λ|i − j | , ∀0 ≤ i, j ≤ B x,λ≥0

(37)

i=0

In addition, given the optimal solution (x∗ , λ∗ ), the worst-case distribution can then be computed by solving ( B¯ ) X X X X h b p p h b max pi h(x − i) + pi b(i − x) : γij |i − j | ≤ θ , γij = pi + pi , ∀i, γij = qj , ∀j . b ph i ,pi ,γij ≥0

i=0

i,j

j

i

20

0.7 Wasserstein Burg entropy Empirical

Wasserstein Burg entropy Empirical

0.1

0.6

0.5

Relative frequency

Relative frequency

0.08

0.06

0.4

0.3

0.04

0.2

0.02

0.1

0

0 0

10

20

30

40

50

60

70

80

90

0

100

10

20

30

40

50

60

70

80

90

100

Random demand

Random demand

(a) Binomial(200, 0.5), N = 500 0.12

(b) Binomial(200, 0.5), N = 50 Wasserstein Burg entropy Empirical

Wasserstein Burg entropy Empirical

0.45

0.4

0.1

0.08

Relative frequency

Relative frequency

0.35

0.06

0.04

0.3

0.25

0.2

0.15

0.1 0.02 0.05

0

0 0

10

20

30

40

50

60

70

80

90

Random demand

(c) truncated Geometric(0.1), N = 500

100

0

10

20

30

40

50

60

70

80

90

100

Random demand

(d) truncated Geometric(0.1), N = 50

Figure 2. Histograms of worst-case distributions yielding from Wasserstein distance and Burg entropy

We perform a numerical test of which setup is similar to Wang et al. [50] and Ben-Tal et al. [5]. ¯ = 100, and N ∈ {50, 500} representing small and large data set. The data is then We set b = h = 1, B generated from Binomial(100, 0.5) and Geometric(0.1) truncated on [0, 100]. For a fair comparison, we estimate the radius of the ambiguity set such that it covers the underlying distribution with probability greater than 95%. For Burg entropy, we use the estimation developed in Wang et al. [50], while for Wasserstein distance, the radius is determined by a classical concentration inequality for Wasserstein distance developed in Bolley et al. [8] (see Appendix B for more details). Figure 2 shows the histograms of the worst-case distributions under the optimal initial inventory level determined by solving the DRSPs with Wasserstein and Burg entropy ambiguity set. When the underlying distribution is Binomial, intuitively, the symmetry of Binomial distribu¯ = 50, and the tion and b = h = 1 implies that the optimal initial inventory level is close to B/2 corresponding worst-case distribution should be similar to a mixture distribution with two modes, representing high and low demand respectively. This intuition is consistent with the solid curves in Figure (2a)(2b) for which the Wasserstein distance is used to construct the ambiguity set. In addition, the tail distributions are smooth and reasonable for both small and large dataset. In contrast, if Burg entropy is used to define the ambiguity set (dashed curves in Figure (2a)(2b), the worst-case distribution is less realistic. First, the worst-case distribution has disconnected support. Second, the worst-case distribution is not symmetric and there is a large spike on the boundary 100, corresponding to the “popping” behavior mentioned in Section 1. Especially when the dataset is small, the spike is huge, which makes the solution too conservative. When the underlying distribution is Geometric, conceivably, the worst-case distribution should have one spike for low demand and a heavy tail for high demand. By using Wasserstein distance,

21

we observe reasonable worst-case distributions (solid curves in Figure (2c)(2d)). While using Burg entropy (dashed curves in Figure (2c)(2d), the worst-case distribution has disconnected support and unrealistic spikes on the boundary. In the case of Geometric distribution, the tail behaviors are even more problematic. Indeed, for distribution with unbounded support, we have much flexibility ¯ Since the spike only appears on the boundary, the tail to specify the truncation threshold B. ¯ Hence, the conclusion for this numerical test is distribution is very sensitive to our choice of B. that Wasserstein ambiguity set is likely to yield a more reasonable, robust and realistic worst-case distribution. 4.2. Two-stage Linear Programming In Corollary 1(iii)(iv) we established the close connection between the DRSP problem and robust programming. More specifically, we show that every DRSP problem can be approximated by robust programs with rather high accuracy, which significantly enlarges the applicability of the DRSP problem. To illustrate this point, in this subsection we show the tractability of the two-stage DRSP problem. Consider the two-stage distributionally robust stochastic programming min c> x + sup Eµ [Ψ(x, ξ)], x∈X

(38)

µ∈M

where Ψ(x, ξ) is the optimal value of the second-stage problem minm q(ξ)> y : T (ξ)x + W (ξ)y ≤ h(ξ) ,

(39)

y∈R

and q(ξ) = q 0 +

s X

ξl q l , T (ξ) = T 0 +

s X

l=1

ξl T l , W (ξ) = W 0 +

l=1

s X

ξl W l , h(ξ) = h0 +

l=1

s X

ξl hl .

l=1

We assume p = 2 and Ξ = Rs with Euclidean distance d. In general, the two-stage problem (38) is hard to solve. Indeed, let N = 1, ν = δξb, and αl > βl > 0, l = 1, . . . , s be given constants. Consider a special case of problem (38) as follows. ( s )# " X + − + − > αl yl − βl yl : yl + yl = ξl − xl , ∀l . (40) min c x + sup Eµ min x∈X

Note that " sup Eµ min µ∈M

yl± ≥0

( s X

yl± ≥0

µ∈M

l=1

)# αl yl+

− βl yl−

:

yl+

+ yl−

= ξl − xl , ∀l

" = sup Eµ µ∈M

l=1

# s X + + [αl (ξl − xl ) − βl (xl − ξl ) ] . l=1

Recall the definition (15) of Φ, we have for any λ > 0, the right-hand side of ( ) s X 2 + + b = inf λ||ξ − ξb|| − Φ(λ, ξ) [αl (ξl − xl ) − βl (xl − ξl ) ] ξ∈Ξ

2

l=1

b is −∞. Hence by Theorem 1(iii), problem (40) is has a unique solution. When λ = 0, Φ(0, ξ) equivalent to " ( s )# X αl yl+ − βl yl− : yl+ + yl− = ξl − xl min c> x + sup Eµ min . (41) x∈X

b 2 ≤θ ξ:||ξ−ξ||

yl± ≥0

l=1

22

It has been shown in Minoux [32] that problem (41) is strongly NP-hard. As a consequence, the two-stage linear DRSP (38) is strongly NP-hard even when only the right-hand side vector h(ξ) is uncertain. However, we are going to show that with tools from robust programming, we are able to obtain PN a tractable approximation of (38). Let M1 := {(ξ 1 , . . . , ξ N ) ∈ ΞN : N1 i=1 ||ξ i − ξbi ||22 ≤ θ2 }. Using Theorem 1(ii) with K = 1, we obtain an adjustable-robust-programming approximation ( ) N 1 X > i min c x + sup Ψ(x, ξ ) x∈X (ξ i )i ∈M1 N i=1 ) ( (42) P t ≥ c> x + N1 i q(ξ i )> y(ξ i ), ∀(ξ i )i ∈ M1 , √ SN , = min t: x∈X,t∈R T (ξ)x + W (ξ)y(ξ) ≤ h(ξ), ∀ξ ∈ i=1 {ξ 0 ∈ Ξ : ||ξ 0 − ξbi ||2 ≤ θ N } y:Ξ→R

where the second set of inequalities follows PN from the fact that T (ξ)x + W (ξ)y(ξ) ≤ h(ξ) should hold for all realizations of distribution N1 i=1 δξi . Although problem (42) is still intractable in general, there has been a substantial literature on different approximations to problem (42). One popular approach is to consider the so-called affinely adjustable robust counterpart (AARC) as follows. We assume that y is an affine function of ξ: y(ξ) = y 0 +

s X

ξl y l , ∀ξ ∈

N [

Bi,

i=1

l=1

√ for some y 0 , y l ∈ Rm , where B i := {ξ 0 ∈ Ξ : ||ξ 0 − ξbi ||2 ≤ θ N }. Then the AARC of (42) is ( s N s 1 X 0 X i l > 0 X i l > y + ξl y − t ≤ 0, ∀(ξ i )i ∈ M1 , min q + ξl q t: c x+ x∈X,t∈R N i=1 l=1 l=1 y l ∈Rm ,l=0,...,s

0

T +

s X

ξl T

l

0

x+ W +

l=1

s X

ξl W

l

0

y +

s X

l=1

ξl y

l

) s N X [ 0 l i − h + h ξl ≤ 0, ∀ξ ∈ B .

l=1

l=1

i=1

(43) Set ζil := ξli − ξbi for i = 1, . . . , N and l = 1, . . . , s. In view of M1 , ζ belongs to the ellipsoidal uncertainty set X Uζ = {(ζil )i,l : ζil2 ≤ N θ2 }. i,l

Set z = [x; t; {y l }sl=0 ], and define          

s s h i X X α0 (z) := − c> x + (q 0 + ξbli q l )> (y 0 + ξbli y l ) − t , l=1

h

0

(q +

l=1

Ps bi l l> 0 bi l > l l=1 ξl q ) y + q (y + l=1 ξl y )

Ps

 β0il (z) := −      l l0 l0 l    Γ(l,l0 ) (z) := − q y + q y , 0 2N

i

2N

,

∀1 ≤ i ≤ N, 1 ≤ l, l0 ≤ s.

Then the first set of constraints in (43) is equivalent to α0 (z) + 2

X i,l

β0il (z)ζil +

XX i

l,l0

(l,l0 )

Γ0

ζil ζil0 ≥ 0,

∀(ζil )i,l ∈ Uζ .

(44)

23

It follows from Theorem 4.2 in Ben-Tal et al. [6] that (44) takes place if and only if there exists λ0 ≥ 0 such that (α0 (z) − λ0 )v 2 + 2v

X

β0il (z)wil +

XX i

i,l

(l,l0 )

Γ0

wil wi0 l0 +

l,l0

λ0 X 2 w ≥ 0, N θ2 i,l il

∀v ∈ R, ∀wil ∈ R, ∀i, l.

Or in matrix form, ∃λ0 ≥ 0 :

Γ0 ⊗ IN + Nλθ02 · IsN vec(β0 ) vec> (β0 ) α0 (z) − λ0

0,

(45)

where IN (resp. IsN ) is N (resp. sN ) dimensional identity matrix, ⊗ is the Kronecker product of matrices and vec is the vectorization of a matrix. Now we reformulate the second set of constraints in (43). For all 1 ≤ i ≤ N , 1 ≤ j ≤ m and 1 ≤ l, l0 ≤ s, we set  s s s s h i X X X X  bi T l )x + (W 0 + bi W l )(y 0 + bi y l ) − (h0 + bi hl ) ,  αij (z) := − (T 0 + ξ ξ ξ ξ  j l j j l j l j l j    l=1 l=1 l=1 l=1   i h  Ps Ps Tjl x + (Wj0 + l=1 ξbli Wjl )y l + Wjl (y 0 + l=1 ξbli y l ) − hl l  , βij (z) := −   2   0 0  l l l l    Γ(l,l0 ) (z) := − Wj y + Wj y . j 2 Let η i := ξ − ξbi for 1 ≤ i ≤ N . Then the second set of constraints in (43) is equivalent to √ > αij (z) + 2βij (z)> η i + η i Γj (z)η i ≥ 0, ∀η i ∈ {η 0 ∈ Rs : ||η 0 ||2 ≤ θ N }, ∀1 ≤ i ≤ N, 1 ≤ j ≤ m. Again by Theorem 4.2 in Ben-Tal et al. [6] we have further equivalence λij Γ β j (z) + N θ 2 · Is ij (z) 0, ∀1 ≤ i ≤ N, 1 ≤ j ≤ m. ∃λij ≥ 0 : βij (z)> αij (z) − λij

(46)

Combining (45) and (46) we obtain the following result. Proposition 4.

An exact reformualtion of the AARC of (42) is given by min

x∈X,t∈R,y l ∈Rm ,l=1,...s λ0 ,λij ≥0,i=1,...,N,j=1,...,s

{t : (45), (46) holds } .

(47)

Since (42) is a fairly good approximation of the original two-stage DRSP problem (38), the semidefinite program (47) provides a tractable approximation of (38). 4.3. Concave objective In Corollary 1(iii), we pointed out that problem (7) with finitesupported reference distribution and concave Ψx is equivalent to a robust optimization problem (29) with uncertainty set ( ) N X 1 (ξ 1 , . . . , ξ N ) ∈ ΞN : dp (ξ i , ξbi ) ≤ θp , N i=1 which describes that the average distance (measured in dp ) of the noisy data ξbi from the “true” one is controlled by θp . In some situations, such reformulation yields to an efficiently computed program.

24

As an example, let us consider Ψ(x, ξ) being an affine function of x, i.e., Ψ(x, ξ) = a> x + b,

(48)

where ξ = [a; b] and let ξ i = [ai ; bi ] and ξbi = [ˆ ai ; ˆbi ], i = 1, . . . , I. Assume the metric d is induced by some norm || · ||q . Set u = (u1 , . . . , uN ) and ( ) N 1 X U = u: ||ui ||p ≤ θp . N i=1 Then by (29), problem (5) is equivalent to ( ) N a î 1 X ai [u] > min t: (ai [u] x + bi [u]) ≤ t, = ˆ + ui , ∀u ∈ U . x∈X,t∈R N i=1 bi [u] bi We have the following results. Corollary 3 (Affinely-perturbed objective). With the above setup, problem (5) is equivalent to ( ) N 1 X (ˆ ai [u]> x + ˆbi [u]) + θ||x||q∗ ≤ t . min t: x∈X,t∈R N i=1 Now we turn to a general concave objective. With the help of reformulation (29), we can apply the Mirror-Prox algorithm to solve the convex-concave saddle point problem min

max

x∈X (ξ 1 ,...,ξ N )∈Y

where

N 1 X Ψ(x, ξ i ), N i=1

(49)

(

) N X 1 Y = (ξ 1 , . . . , ξ N ) ∈ ΞN : dp (ξ i , ξbi ) ≤ θp . N i=1

In the following, we briefly describe and set up the algorithm. For a detailed description, we refer the reader to Nemirovski [33]. For ease of notation, set y := (ξ 1 , . . . , ξ N ). We assume that Ξ is a separable Hilbert space such with the metric d induced from some inner product h·, ·i. Set Ξi to be the translated space of Ξ under translation mapping ξ 7→ ξ − ξbi , then a natural norm on Ξi is given by ||ξ ||Ξi := d(ξ, ξbi ), ∀ξ ∈ Ξi . QN On the product space ΞN := i=1 Ξi , we define a norm || · ||Y by ||y ||Y = ||(ξ 1 , . . . , ξ N )||Y :=

N X

!1/p ||ξ i ||pΞi

,

∀ξ i ∈ Ξi , i = 1, . . . , N.

i=1

We introduce the distance generating function ωY (y) = ωY (ξ 1 , . . . , ξ N ) :=

N 1 X i m ||ξ ||Ξi , mγ i=1

where m, γ are chosen later such that ωY is strongly convex with modulus 1 with respect to || · ||Y . We also assume that there exists a norm || · ||X on X and a distance generating function ωX (·) which

25

is continuous and strongly convex with modulus 1 with respect to || · ||X , and admits a continuous selection ω 0 (x) of subgradients. Let ΘX := supx∈X ωX (x) − inf x∈X ωX (x). On the product space Z := X × Y , we define a norm s 1 1 ||z ||Z := ||x||2X + 2 ||y ||2Y 2 ΘX ΘY for any z ∈ Z. It can be easily checked that q ||z ||Z,∗ = Θ2X ||x||2X,∗ + Θ2Y ||y ||2Y,∗ defines the dual norm. Suppose there exists L11 , L12 , L21 , L22 , M11 , M12 , M21 , M22 ≥ 0 such that for any x, x0 ∈ X, ξ, ξ 0 ∈ Ξ,  ||∂ Ψ(x, ξ) − ∂x Ψ(x0 , ξ)||X,∗ ≤ L11 ||x − x0 ||X + M11 ,   x ||∂x Ψ(x, ξ) − ∂x Ψ(x, ξ 0 )||X,∗ ≤ L12 ||ξ − ξ 0 ||Ξ + M12 , (50) 0 0   ||∂ξ Ψ(x, ξ) − ∂ξ Ψ(x0 , ξ)||Y,∗ ≤ L21 ||x − x0 ||X + M21 , ||∂ξ Ψ(x, ξ) − ∂ξ Ψ(x , ξ)||Ξ,∗ ≤ L22 ||ξ − ξ ||Y + M22 . Set L :=

p

M :=

p

2Θ4X L211 + 2Θ2X Θ2Y (L212 + L221 ) + 2Θ4Y L222 , p 2 2 2 2 + 2Θ2X M21 . + 2Θ2Y M21 + 2Θ2Y M22 2Θ2X M11

(51)

Define the vector field " N # 1 X F (z) := ∇x Ψ(x, ξ i ); {−∇ξ Ψ(x, ξ i )}N i=1 , z ∈ X × Y. N i=1

(52)

It follows that from Lemma 6 in the Appendix that ||F (z) − F (z 0 )||Z,∗ ≤ L||z − z 0 ||Z + M.

Set ω(z) :=

1 1 ωX (x) + 2 ωY (y), 2 ΘX ΘY

then ω is a distance generating function compatible to ||·||Z , and Θ := supz∈Z ω(z) − inf z∈Z ω(z) = 1. Suppose there is a first-order oracle which computes F (z) on each call. The accuracy of a candidate solution (x, y) ∈ X × Y is characterized by sad (x, y) :=

max

(ξ 01 ,...,ξ 0N )∈Y

N N 1 X 1 X 0i Ψ(x, ξ ) − min Ψ(x0 , ξ i ). x0 ∈X N N i=1 i=1

√ Given N and θ, the lower bound on oracle complexity of (49) using first-order methods is O(M/ t) when L = 0 and O(L/t) when M = 0, and the Mirror-Prox algorithm can achieve the lower bound up to a constant. The Mirror-Prox algorithm is shown in Algorithm 1.

Assume (50) holds and let L, M be defined in (51). Set   1  1 + ln1N , p = 1  lnNN θ ln N −1 , p=1 1 < p ≤ 2 , γ = (p − 1)θ2−p N 2/p−1 , 1 < p ≤ 2 . m = p,  2,  2/p−1 p>2 N , p>2

Proposition 5.

(53)

26

Algorithm 1 Mirror-Prox Algorithm 1: z1 := (x1 , y1 ) ∈ X × Y . 2: wt = Prox(γt F (zt )), zt+1 = Prox(γt F (wt )), where Proxz (γt F (zt )) := arg minz∈Z [V (z, zt ) + hγt F (z ), z i] and V (z, z 0 ) := ω(z) − ω(z 0 ) − h∇ω(z 0 ), z − z 0 i. Ptt γτ w τ t τ . 3: z = P=1 t γτ τ =1

Let γτ , 1 ≤ τ ≤ t satisfies γτ hF (zτ ) − F (wτ ), wτ − zτ +1 i − V (zτ , wτ ) − V (wτ , zτ +1 ) ≤ 0, which is definitely satisfied when γτ ∈ (0, √12L ], or γτ ∈ (0, 1/L] when M = 0. Then it holds that 1 + M2 sad (z ) ≤ Pt t

Pt

2 τ =1 γτ

,

(54)

 2 2 eθ N ln N   1+1/ ln N , p = 1 θ2 ΘY ≤ p(p−1) N 2 , 1 < p ≤ 2,   θ2 2 N , p > 2. 2

(55)

τ =1 γτ

and

Note that if γτ = O(1/L), by (51)(54) we have sad (z t ) = O(Θ2Y ). Thus according to (55), sad (z t ) = O(θ4 N 4 ) when p > 1 and sad (z t ) = O(θ4 N 4 ln2 N ) when p = 1. We argue that the complexity of the algorithm may not be very large as it appears. For one thing, we only use DRSP formulation when N is not very large, since otherwise the Sample Average Approximation method provides a relatively accurate solution. For another thing, the radius θ goes to zero as N goes to infinity. Therefore, we believe that the Mirror-Prox algorithm (Algorithm 1) is useful in many practical applications. 4.4. Worst-case Value-at-Risk In the strong duality results (Theorem 1), the reference distribution can be any measure on a Polish space. Such generality greatly enhances the applicability of our results. As an example, in this subsection we study the worst-case Value-at-Risk with Wasserstein ambiguity set. Value-at-risk is a popular risk measure in financial industry. Given a random variable Z and α ∈ (0, 1), the value-at-risk VaRα [Z] of Z with respect to measure ν is defined by VaRα [Z] := inf {t : Pν {Z ≤ t} ≥ 1 − α} .

(56)

Note that the VaRα [Z] ≤ 0 is equivalent to the chance constraint P{Z ≤ 0} ≥ 1 − α. In the spirit of Ghaoui et al. [20], we consider the following worst-case VaR problem. Pn Suppose we are given a portfolio consisting of n assets with allocation weight w satisfying i=1 wi = 1 and w ≥ 0. Let ξi be the (random) return rate of asset i, i = 1, . . . , n, and r = E[ξ] the vector of the expected return rate. Assume the metric d associated with the set M in (6) is induced by the infinity norm || · ||∞ on Rn . The worst-case VaR with respect to the set of probability distributions M is defined as wc > VaRα (w) := min q : inf P{−w ξ ≤ q } ≥ 1 − α . (57) q

µ∈M

For simplicity we assume p = 1. The following lemma shows the continuity of the probability bound with respect to the boundary ∂C of a Borel set C.

27

Lemma 3. Let C be a Borel set and θ > 0. Then inf µ(C) = min µ(int(C)).

µ∈M

µ∈M

Hence by Lemma 3, the distributionally robust chance constraint is equivalent to its strict counterpart minµ∈M Pµ {−w> ξ < b} ≥ 1 − α. Pn Proposition 6. Let θ > 0, α ∈ (0, 1), w ∈ {w0 ∈ Rn : i=1 wi0 = 1, w0 ≥ 0}. Define   > > > p (α − ν {ξ : −w ξ > VaRα [−w ξ]})(q − VaRα [−w ξ])  β0 := min 1, . p > p θ − Eν (q + w ξ) 1{−w> ξ>VaRα [−w> ξ]} Then inf Pµ {−w> ξ ≤ q } ≥ 1 − α ⇐⇒ Eν ((q + w> ξ)+ )p + ·1{−w> ξ>VaRα [−w> ξ]} + β0 Eν ((q + w> ξ)+ )p · 1{−w> ξ=VaRα [−w> ξ]} ≥ θ.

µ∈M

PN In particular, suppose p = 1, when ν is a continuous distribution or ν = N1 i=1 δξbi and αN is an integer, inf Pµ {−w> ξ ≤ q } ≥ 1 − α ⇐⇒ −αCVaRα [−(q + w> ξ)+ ] ≤ −θp , µ∈M

where CVaRα [Z] is the conditional value-at-risk defined by CVaRα [Z] := inf t∈R {t + α−1 Eν [Z − t]+ }.

Figure 3. Worst-case VaR. When −w> ξ is continuously distributed, VaRwc α equals to the q such that the area of the shade region is equal to θ.

Example 5 (Worst-case VaR with Gaussian reference distribution). Suppose ν ∼ > > N (¯ ¯, w> Σw) and VaRα [−w> ξ] = −w> µ ¯+ √ µ, Σ) and consider p = 1. It follows that −wcw ξ ∼ N (−w µ −1 > w> ΣwΦ (1 − α). By Proposition 6, VaRα (−w ξ) is the minimal q such that (see Figure 3) Z q (y+w> µ) ¯ 2 1 − √ (q − y)e 2w> Σw dy ≥ θ. 2πw> Σw VaRα [−w> ξ] Let q 0 be the solution of the equation 2 −Φ−2 (1−α) −q 0 1 2 e q 0 [Φ(q 0 ) − (1 − α)] − √ − e 2 = θ, 2π

(58)

where Φ is the √ cumulative distribution function of standard normal distribution. Then > 0 VaRwc w> Σw − w> µ ¯. Note that the one-dimensional nonlinear equation (58) can be α (−w ξ) = q solved efficiently via any root-finding algorithm.

28

4.5. On/Off Process Control In this subsection, we consider a distributionally robust process control problem where the reference distribution ν is a point process. In such a problem, the decision maker controls a two-state (on/off) system given an exogenous demand process. When the system is switched on, it meets any arriving demand, each of which contributes 1 dollar profit, but the cost of maintaining on-state is c dollar per unit time. When the system is off, there is no cost, but it cannot meet any demand so there is no profit either. The decision maker chooses a control policy so that the overall profit during a finite time horizon is maximized. Such problem is a prototype of many problems in sensor network and revenue management. In many practical settings, the demand process is not known exactly. Instead, the decision maker has i.i.d. historical sample paths, which constitute an empirical point process. Using the empirical point process, the classical Sample Average Approximation method yields a degenerate control policy, in which the system is switched on only at the arrival time points of the empirical point process. Consequently, as long as the true arrival process perturbs the empirical arrival time a little bit, the system is switched off and no demand is met. Due to such degeneracy and instability of SAA method, we resort to the distributionally robust approach. Before providing a formal mathematical definition of the problem, we remark that we can extend Definition 1 of Wasserstein distance to any Borel measures µ, ν ∈ B (Ξ) provided that ν(Ξ) < ∞. In fact, when ν(Ξ) = 0, by Lemma 8, if µ(Ξ) > 0, then Wp (µ, ν) = ∞, with corresponding (8) infeasible. Thus the ambiguity set M only contains zero measure ν itself. When ν(Ξ) > 0, without loss of generality, we can always replace ν by ν/ν(Ξ) such that it is a Borel probability measure and the strong duality developed in Section 3 remains valid. Therefore, in the remaining of this section, when we write W1 distance between two Borel measures, we use the above extended definition. Pm We set up the problem as follows. We scale the finite time horizon to [0, 1]. Let Ξ = { t=1 δξt : m ∈ Z+ , ξt ∈ [0, 1], t = 1, . . . , m} be the space of finite counting measures on [0, 1]. We assume the metric d on Ξ satisfies Pm the following Pmconditions: m 1) For any ηˆ = t=1 δζt and η = t=1 δξt , where m is a positive integer and {ζt }m t=1 , {ξt }t=1 ⊂ [0, 1], it holds that m X d(η, ηˆ) = W1 (η, ηˆ) = |ξ(t) − ζ(t) |, (59) t=1

where ξ(t) (resp. ζ(τ ) ) are the order statistics of Pξmt (resp. ζ(τ ) ). 2) For any Borel set C ⊂ [0, 1] and θ¯ ≥ 0, and ηˆ = t=1 δζt , where m positive integer and {ζt }m t=1 ⊂ [0, 1], it holds that n o η˜(C) : W1 (˜ η , ηˆ) ≤ θ¯ , (60) inf η(C) : d(η, ηˆ) = θ¯ ≥ inf η∈Ξ

η˜∈B([0,1])

3) The metric space (Ξ, d) is Polish, i.e., it is a complete and separable metric space. We note that condition (59) is only imposed on η, ηˆ ∈ Ξ such that η([0, 1]) = ηˆ([0, 1]). Possible choices for d can be l m X min(m,l) X X d δξt , δ ζτ = |ξ(t) − ζ(t) | + |m − l|, t=1

or

τ =1

t=1

 l m l), l 6= m, X  max(m, X Pm | ξ − ζ | , l = m > 0, d δξt , δζτ := (t) (t)  0, t=1 l = m = 0. t=1 τ =1

Such metric is similar to the one developed in Barbour and Brown [1], Chen and Xia [14]. Given the metric d, the point processes on [0, 1] are then defined by the set P (Ξ) of Borel probability measures on Ξ. For simplicity, we choose the distance between two point processes Pmi µ, ν ∈ P (Ξ) to be W1 (µ, ν) as defined in (8). Suppose we are given N sample paths ηî = t=1 δξbi , t i i = 1, . . . , N , where mi is some nonnegative integer and ξb ∈ [0, 1] for all i, t. Then the reference t

29

PN distribution ν = N1 i=1 δηî , and the ambiguity set M = {µ ∈ P (Ξ) : W1 (µ, ν) ≤ θ}. Denote by X the set of all functions x : [0, 1] → {0, 1} such that x−1 (1) is a Borel set, where x−1 (1) := {t ∈ [0, 1] : x(t) = 1}. The decision maker is looking for a policy x ∈ X which maximizes the total revenue by solving the problem Z 1 −1 ∗ v := sup v(x) := −c x(t)dt + inf Eη∼µ η(x (1)) . (61) x∈X

µ∈M

0

We first investigate the structure of the optimal policy. Let int(x−1 ) be the interior of the set x (1) on the space [0, 1] with canonical topology (and thus 0, 1 ∈ int([0, 1])). −1

Lemma 4. For any ν ∈ P (Ξ) and policy x, it holds that n o 2 inf Eη∼µ [η(x−1 (1))] = inf E(˜η,ˆη)∼ρ [˜ η (int(x−1 (1)))] : E(˜η,ˆη)∼ρ W1 (˜ η , ηˆ) ≤ θ, π# ρ=ν ,

µ∈M

ρ∈P(B([0,1])×Ξ)

Suppose ν = v∗ =

(62) Pmi i δ with η ˆ = δ . Then there exists a positive integer M such that i i i=1 ηˆ t=1 ξbt ( M ) M X X M sup v 1[xj− ,xj+ ] := −c (xj+ − xj− ) + inf Eη∼µ η {∪j=1 [xj− , xj+ ]} .

1 N

PN

xj± ∈[0,1], xj− 0 and dµ/dA is the Radon-Nikodym derivative. We denote by B by the set of finite Borel probability measures on Ξ that are absolutely continuous with respect to A and by P ⊂ B by the set of Borel probability measures on Ξ that are absolutely continuous with respect to A. We use the overloaded notation Wp (f, ν) to represent the Wp distance between the reference distribution ν and the distribution with density f with respect to A. Let L : R → R be a non-decreasing concave function, consider the problem Z vP = sup L ◦ f dA. (66) f ∈A

Ξ

32

Our goal is to derive the strong dual of (66) and obtain a representation for the worst-case distribution using the same proof method as in Section 3.1. Step 1. Derive the weak duality. We first derive the weak duality by writing the Lagrangian and applying the similar reasoning to the proof of Proposition 2, we have that Z L ◦ f dA + λ(θ

vP = sup inf

f ∈P λ≥0

p

− Wpp (f, ν))

Ξ

Z L ◦ f dA + λ(θp − Wpp (f, ν)) = sup inf f ∈B λ≥0 Ξ Z p p L ◦ f dA − λWp (f, ν) ≤ inf λθ + sup λ≥0 f ∈B Ξ (Z ( Z = inf

λ≥0

p

f ∈B

λ≥0

≤

sup u,v∈Cb (Ξ)

Ξ

uf dA + Ξ

(Z

( = inf

L ◦ f dA − λ

λθ + sup

Z

p

L ◦ f dA − λ

λθ + sup f ∈B

inf

λ≥0 R v∈Cb (Ξ): Ξ vdν=0

sup R v∈Cb (Ξ): Ξ vdν=0

Ξ

)) vdν : u(ξ) ≤ inf d (ξ, ζ) − v(ζ) p

ζ∈ξ

Ξ

Z h Ξ

p

i

))

inf d (ξ, ζ) − v(ζ) f (ξ)A(dξ)

ζ∈Ξ

Z p λθ + sup [L ◦ f (ξ) − λΦv (ξ)f (ξ)]A(dξ) , f ∈B

Ξ

where the first inequality follows from Lemma 8 and in the last inequality Φv (ξ) := inf ζ∈Ξ dp (ξ, ζ) − v(ζ). Define L∗ (a) := sup L(x) − ax,

a ∈ R,

x≥0

which can be viewed as the Legendre transform of concave function L. Thus L∗ is convex and we denote by ∂ L∗ (a) its subdifferential at a ∈ domL∗ , where domL∗ := {a ≥ 0 : L∗ (a) < ∞}. It follows that nZ o p ∗ vP ≤ inf λθ + sup [L ◦ f (ξ) − λΦv (ξ)f (ξ)]A(dξ) : λΦv (ξ) ∈ domL , ∀ξ ∈ supp A λ≥0 R v∈Cb (Ξ): Ξ vdν=0

≤

inf


=:

inf

f ∈B

Ξ

Z p ∗ λθ + L (λΦv (ξ))A(dξ)


Ξ

hv (λ)

=: vD . Step 2. Show the existence of the dual minimizer. Let (λn , vn )n be a sequence approaching to vD > −∞. Since vn are continuous and supp A is compact, by Arzela-Ascoli theorem, there exists a subsequence convergent to v ∗ on A and hv∗ (λ) ≤ limn→∞ hvn (λ) for all λ ≥ 0. Also note that Φv (ξ) ≥ 0 for all ξ such that λΦv (ξ) ∈ domL∗ , it follows that L∗ (λΦv∗ (ξ)) is a non-increasing convex function, whence hv (λ) is the sum of an increasing linear function and a non-increasing convex function and goes to infinity as λ → ∞. So hv∗ (λ) attains its minimum at some λ∗ ∈ [0, ∞). Therefore we prove the existence of dual minimizer (λ∗ , v ∗ ). Step 3. Use first-order optimality to construct a primal solution.

33

We take the derivative and obtain the first-order optimality at λ∗ , Z p θ + ∂λ− L∗ (λ∗ Φv∗ (ξ))Φv∗ (ξ)A(dξ) ≤ 0, ZΞ p θ + ∂λ+ L∗ (λ∗ Φv∗ (ξ))Φv∗ (ξ)A(dξ) ≥ 0.

(67)

Ξ

We define f ∗ (ξ) := − p∗ ∂λ− L∗ (λ∗ Φv∗ (ξ)) + (1 − p∗ )∂λ+ L∗ (λ∗ Φv∗ (ξ)) , ∀ξ ∈ supp A,

(68)

where R θp + Ξ ∂λ+ L∗ (λ∗ Φv∗ (ξ))Φv∗ (ξ)A(dξ) R , p := R ∂ L∗ (λ∗ Φv∗ (ξ))Φv∗ (ξ)A(dξ) − Ξ ∂λ− L∗ (λ∗ Φv∗ (ξ))Φv∗ (ξ)A(dξ) Ξ λ+ ∗

provided that the denominator is nonzero, otherwise we set p∗ = 1. By definition of L∗ , f is nonnegative. Also note that L∗ is convex, so f ∗ is measurable. Step 4. Verify the feasibility and optimality. We verify that f ∗ (ξ) is primal optimal. From the concavity of L, we have L∗ (f ∗ (ξ)) ≥ ∗ ∗ p L (−∂λ− L∗ (λ∗ Φv∗ (ξ))) + (1 − p∗ )L∗ (−∂λ+ L∗ (λ∗ Φv∗ (ξ))). Using (67), we have Z vP ≥ L∗ (f ∗ (ξ))A(dξ) ΞZ Z ∗ ≥p L∗ (−∂λ− L∗ (λ∗ Φv∗ (ξ)))A(dξ) + (1 − p∗ ) L∗ (−∂λ+ L∗ (λ∗ Φv∗ (ξ)))A(dξ) Ξ Ξ Z h i ∗ ∗ ∗ ∗ p ∗ L(−∂λ− L (λ Φv∗ (ξ))) − λ Φv∗ (ξ)∂λ− L∗ (λ∗ Φv∗ (ξ)) A(dξ) =λ θ +p ZΞ h i ∗ L(−∂λ+ L∗ (λ∗ Φv∗ (ξ))) − λ∗ Φv∗ (ξ)∂λ+ L∗ (λ∗ Φv∗ (ξ)) A(dξ) + (1 − p ) Ξ

≥ vD ,

and that Z

maxR uf dA : u(ξ) + v(ζ) ≤ d (ξ, ζ), ∀ξ ∈ Ξ u,v∈Cb (Ξ): vdν=0 ZΞ ∗ = maxR uf dA : u(ξ) ≤ Φv∗ (ξ), ∀ξ ∈ Ξ

Wp (f , fˆ) = ∗

u,v∈Cb (Ξ): vdν=0

∗

p

Ξ

= θp . Hence we conclude that there exists a worst-case distribution ofqthe form (68). In particular, when R √ 1 L(·) = ·, we have ∂L∗ (a) = 4a12 , f ∗ (ξ) = 4λ∗2 Φ1 ∗ (ξ)2 , and λ∗ = dA. We remark that we Ξ 4θ p Φv ∗ v obtain a slightly more compact form than that in Carlsson et al. [12]. 5. Conclusions In this paper, we developed a constructive proof method to derive the dual reformulation of distributionally robust stochastic programming with Wasserstein distance under a general setting. Such approach allows us to obtain a precise structural description of the worst-case distribution, which connects the distributionally robust stochastic programming to classical robust programming. Based on our results, we obtain many theoretical and computational implications. For the future work, extensions to multi-stage distributionally robust stochastic programming will be explored.

34

Appendix A: Technical Lemmas and Proofs The technical lemmas and proofs are presented in the order as they appear in the paper. Proof of Lemma 1. Let γ0 be the minimizer of (8). It follows that Z Eµ [Ψ(ξ)] − Eν [Ψ(ξ)] ≤ |Ψ(ξ) − Ψ(ζ)|γ0 (dξ, dζ) Ξ×Ξ Z ≤ L(1 + dp (ξ, ζ))γ0 (dξ, dζ) Ξ×Ξ

= L(1 + Wpp (µ, ν)) ≤ L(1 + δ). x (ζ0 ) Lemma 5. Let κ := lim supd(ξ,ζ0 )→∞ Ψxdp(ξ)−Ψ for some ζ0 ∈ Ξ. Then for any ζ ∈ Ξ, it holds (ξ,ζ0 )+1 that Ψx (ξ) − Ψx (ζ) lim sup = κ. p d(ξ,ζ)→∞ d (ξ, ζ) + 1 x (ζ) Proof of Lemma 5. Suppose that for some ζ ∈ Ξ, it holds that κ0 := lim supd(ξ,ζ)→∞ Ψxdp(ξ)−Ψ < (ξ,ζ)+1 0 κ. Then for any ∈ (0, κ − κ ), there exists R > 0 such that for any ξ with d(ξ, ζ) ≥ R, we have that

Ψx (ξ) − Ψx (ζ0 ) = Ψx (ξ) − Ψx (ζ) + Ψx (ζ) − Ψx (ζ0 ) < (κ0 + )(dp (ξ, ζ) + 1) + [Ψx (ζ) − Ψx (ζ0 )]. It follows that Ψx (ξ) − Ψx (ζ0 ) Ψx (ξ) − Ψx (ζ0 ) ≤ lim sup p p d(ξ,ζ0 )→∞ d (ξ, ζ0 ) + 1 d(ξ,ζ)→∞ d (ξ, ζ0 ) + 1 (κ0 + )(dp (ξ, ζ) + 1) + [Ψx (ζ) − Ψx (ζ0 )] ≤ lim sup dp (ξ, ζ) − dp (ζ0 , ζ) + 1 d(ξ,ζ)→∞ lim sup

= κ0 + < κ, which is a contradiction. Therefore the results is obtained by switching the role of ζ and ζ0 . We note that the above proof holds also for κ = ∞. Proof of Proposition 1. Choose any compact set E ⊂ Ξ with ν(E) > 0 and let ξ¯ ∈ arg maxξ∈E Ψx (ξ). By Lemma 5 it holds that ¯ Ψx (ξ) − Ψx (ξ) p ¯ + 1 = ∞. d (ξ, ξ) ¯ d(ξ,ξ)→∞ lim sup

¯ > R and that Thus for any M, R > 0, there exists ζM such that d(ζM , ξ) ¯ > M (dp (ζM , ξ) ¯ + diam(E) + 1), Ψx (ζM ) − Ψx (ξ) where diam(E) := maxξ,ξ0 ∈E d(ξ, ξ 0 ). It follows that Ψx (ζM ) − Ψx (ξ) > M (dp (ζM , ξ) + 1),

∀ξ ∈ E.

Thereby we define a map TM : Ξ → Ξ by TM (ζ) := ζM 1{ζ∈E} + ζ1{ζ∈Ξ\E} , ∀ζ ∈ Ξ,

35

and let

µM := pM TM # ν + 1 − pM ν, p

θ where pM := R dp (ζM . Note that we can always make pM ≤ 1 by choosing sufficiently large R. ,ζ)ν(dζ) E Then µM is primal feasible and Z Z Z vP − Ψx (ξ)ν(dξ) ≥ pM [Ψx (TM (ζ)) − Ψx (ζ)]ν(dζ) > pM M dp (TM (ζ), ζ)ν(dζ) = M θp , Ξ

E

E

which goes to ∞ as M → ∞. On the other hand, for any λ ≥ 0 and ζ ∈ E, Φ(λ, ζ) = −∞, whence vD = ∞. Therefore we prove that vP = vD .

Figure 4. Illustration of duality vp = ∞ when κ = ∞.

Proof of Lemma 2. (i) For any ζ ∈ Ξ and > 0, there exists R > 0 such that for any ξ ∈ Ξ with d(ξ, ζ) > R, it holds that Ψx (ξ) − Ψx (ζ) ≤ (κ + )dp (ξ, ζ) + 1. Let < λ − κ, it follows that λdp (ξ, ζ) − Ψx (ξ) > Ψx (ζ) ≥ Φ(λ, ζ) for all ξ with dp (ξ, ζ) > R. Also note that −Ψx is lower semi-continuous, thus the minimum of the right-hand side of (15) admits a minimizer in {ξ : d(ξ, ζ) ≤ R}, and so the minimum Φ(λ, ζ) is finite-valued. (ii) From (i) we know that Φ(λ, ζ) is finite-valued. Since dp (ξ, ζ) is non-negative, for each fixed ζ ∈ Ξ, Φ(λ, ζ) is the infimum of non-decreasing linear functions of λ and therefore it is nondecreasing and concave with respect to λ. To show its continuity with respect to ζ, observe that |Φ(λ, ζ) − Φ(λ, ζ 0 )| ≤ λ|dp (ξ, ζ) − dp (ξ, ζ 0 )| ≤ λdp (ζ, ζ 0 ), so we conclude that Φ(λ, ζ) is continuous in ζ. (iii) Fixing any λ > κ, by (i), arg minξ∈Ξ {λdp (ξ, ζ) + Ψx (ξ)} is non-empty and compact, thus the multi-valued function φ(λ, ζ) := arg min{λdp (ξ, ζ) + Ψx (ξ)} ξ∈Ξ

and

p ˆ T− (λ, ζ) := arg max d(ξ, ζ) : ξ ∈ arg min {λd (ξ, ζ) − Ψx (ξ)} ξ

ξ∈Ξ

are well-defined. We claim that Tˆ− (λ, ζ) is closed-valued. Indeed, let {ξk }k ⊂ Tˆ− (λ, ζ) converge to ξ0 , from continuity of d and lower semi-continuity of Ψx , it follows that ξ0 minimizes λdp (ξ, ζ) − Ψx (ξ) and ξ0 also has largest distance to ζ among the optimizers. We further claim that Tˆ− (λ, ζ) is Borel measurable. Indeed, define the mapping f : Ξ × Ξ → R by f (ζ, ξ) := λdp (ξ, ζ) − Ψx (ξ). Since Ψx (ξ) is upper semi-continuous, the epigraphical mapping epi f (ζ, ·) := {(ξ, α) : f (ζ, ξ) ≤ α} is closed-valued and Borel measurable, that is, f is a normal integrand (cf. Castaing and Valadier [13]). Hence, the mapping Φ(λ, ·) and φ(λ, ·) are Borel measurable (cf. Theorem 14.37 in Rockafellar and Wets [39] ¯ := and Theorem 3.5 in Himmelberg et al. [25]). Thereby the mapping d¯ : Ξ → R defined by d(ζ)

36

maxξ {d(ξ, ζ) : ξ ∈ φ(λ, ζ)} is also Borel measurable. For any closed set A ⊂ Ξ, set the closed-valued ¯ }, note that (Tˆ− (λ, ·))−1 (A) = {ζ : {epi d(ζ, ·)} ∩ D(ζ) 6= mapping D : ζ 7→ A × {β ∈ R : β ≤ d(ζ) ∅}, from Theorem 4.2(ii) of Wagner [49] we finally conclude that Tˆ− (λ, ζ) is Borel measurable. Consequently, by Kuratowski–Ryll-Nardzewski measurable selection theorem (cf. Theorem 4.1 in Wagner [49]), Tˆ− (λ, ζ) has a Borel measurable selection, denoted as T− (λ, ζ). Now we show that d(T− (λ, ζ), ζ) is left continuous in λ. Let {λn }n be a sequence convergent to λ with λn ∈ ((λ + κ)/2, λ) for all n. We claim that T− (λn , ζ) ⊂ {ξ ∈ Ξ : d(T− (λ, ζ), ζ) ≤ d(ξ, ζ) ≤ R}, ∀n. In fact, by (ii), Φ(λ0 , ζ) has an upper bound on the compact set [(λ + κ)/2, λ] × supp ν. Then using the same argument as in (i), there exists Rλ > 0, such that for any ξ ∈ Ξ with d(ξ, ζ0 ) > Rλ , λ0 dp (ξ, ζ) − Ψx (ξ) ≥ Φ(λ0 , ζ),

∀(λ0 , ζ) ∈ [(λ + κ)/2, λ] × supp ν.

(69)

Hence d(T− (λn , ζ), ζ) ≤ Rλ for all n. On the other hand, for all ξ ∈ Ξ with d(ξ, ζ) < d(T− (λ, ζ), ζ), Ψx (T− (λ, ζ)) − Ψx (ξ) ≥ λ[dp (T− (λ, ζ), ζ) − dp (ξ, ζ)] > λn [dp (T− (λ, ζ), ζ) − dp (ξ, ζ)], or, λn dp (T− (λ, ζ), ζ) − Ψx (T− (λ, ζ)) < λn dp (ξ, ζ) − Ψx (ξ), ∀ξ with d(ξ, ζ) < d(T− (λ, ζ), ζ). Therefore the claim holds. We then show that d(T− (λn , ζ), ζ) converges to d(T− (λ, ζ), ζ) as n → ∞. Suppose on the contrary there exists δ > 0 and a subsequence {nk }k such that d(T− (λnk , ζ), ζ) ≥ d(T− (λ, ζ), ζ) + δ for all k. By definition of T− (λnk , ζ), λnk dp (T− (λnk , ζ), ζ) − Ψx (T− (λnk , ζ)) ≤ λnk dp (T− (λ, ζ), ζ) − Ψx (T− (λ, ζ)). From (69) and possibly passing to a subsequence we can assume T− (λnk , ζ) converges to some ξ0 ∈ Ξ. Let k → ∞ and by lower semi-continuity of −Ψx , λdp (ξ0 , ζ) − Ψx (ξ0 ) ≤ lim inf λnk dp (T− (λnk , ζ), ζ) − Ψx (T− (λnk , ζ)) ≤ λdp (T− (λ, ζ), ζ) − Ψx (T− (λ, ζ)), k→∞

thus we obtain that ξ0 is a minimizer of inf ξ∈Ξ λdp (ξ, ζ) − Ψx (ξ), but d(ξ0 , ζ) ≥ d(T− (λ, ζ), ζ) + δ, which contradicts to that T− (λ, ζ) is the minimizer with largest distance to ζ. Therefore we have shown that d(T− (λn , ζ), ζ) converges to d(T− (λ, ζ), ζ) as n → ∞, that is, d(T− (λ, ζ), ζ) is left continuous with respect to λ. Moreover, since Ψx (T− (λ, ζ)) = λd(T− (λ, ζ), ζ) − Φ(λ, ζ) and Φ(λ, ζ) is continuous with respect to λ, we conclude that Ψx (T− (λ, ζ)) is left continuous with respect to λ. Using a similar argument we can show the second part of (iii). (iv) For any 0 < < λ − κ, by definition we have Φ(λ, ζ) Ψx (T− (λ, ζ)) Φ(λ − , ζ) Ψx (T− (λ, ζ)) + ≥ dp (T− (λ, ζ), ζ) = + , λ− λ− λ λ thereby

1 1 Ψx (T− (λ − , ζ)) − λ λ−

≤

Φ(λ − , ζ) Φ(λ, ζ) − , λ− λ

and similarly, Φ(λ − , ζ) Φ(λ, ζ) 1 1 − ≤ Ψx (T− (λ, ζ)) − . λ− λ λ λ−

37

Dividing by on both sides of the above two inequalities and letting → 0, from left continuity of T− (·, ζ) it follows that Φ(λ, ζ) Ψx (T− (λ, ζ)) ∂λ− . = λ λ2 The expression of ∂λ+ Φ(ζ, λ) can be obtained similarly. (v) For any > 0, there exists ξ such that κdp (ξ , ζ) − Ψx (ξ ) < Φ(κ, ζ) + /2 and so for any 0 < λ < 2dp (ξ ,ζ) , 0 ≤ Φ(λ, ζ) − Φ(κ, ζ) < (λ − κ)dp (ξ , ζ) + /2 < . It follows that lim Φ(λ, ζ) − Φ(κ, ζ).

λ→κ+

In particular when κ = 0, Φ(0, ζ) = − sup Ψx (ζ). (vi) By definition we have λdp (T+ (λ, ζ), ζ) − Ψx (T+ (λ, ζ)) ≤ −Ψx (ζ), or Ψx (T+ (λ, ζ)) − Ψx (ζ) ≥ λdp (T+ (λ, ζ), ζ). On the other hand, since κ < ∞, it follows that for any > 0, there exists R > 1, such that for any ξ, ξ 0 ∈ Ξ with dp (ζ, ζ 0 ) > R , it holds that |Ψx (ζ) − Ψx (ζ 0 )| < (κ + )dp (ζ, ζ 0 ). In particular, let := λ − κ, Ψx (ζ 0 ) < Ψx (ζ) + λ(ζ 0 , ζ) for all ζ 0 with dp (ζ 0 , ζ) > Rλ−κ , therefore dp (T+ (λ, ζ), ζ) < Dλ := Rλ−κ . Proof of Corollary 3.

Note that x is feasible if and only if (

) N a ˆ 1 X a [u] i i t ≥ max (ai [u]> x + bi [u]) : = ˆ + ui u∈U N i=1 bi [u] bi ( N ) N X 1 1 X > (ˆ a x + ˆbi ) + max u> x = N i=1 i N u∈U i=1 i Then by Hölder’s inequality it follows that ) ( N N X 1 X 1 > u x ≤ max ||ui ||q ||x||q∗ ≤ θ||x||q∗ , N u∈U i=1 i N i=1 where 1/q + 1/q ∗ = 1, and equality holds at, for example, ui = θy, i = 1, . . . , N , where y satisfies y > x = ||x||q∗ . Hence x is feasible if and only if N 1 X (ˆ ai [u]> x + ˆbi [u]) + θ||x||q∗ ≤ t. N i=1

Lemma 6. Suppose (50) holds and the constant L, M are defined in (51). Then for the vector field defined in (52), it holds that ||F (z) − F (z 0 )||Z,∗ ≤ L||z − z 0 ||Z + M for all z, z 0 ∈ Z. Proof of Lemma 6. The proof is a simple exercise in Calculus. Using condition (50), we obtain the bound estimations ||F1 (x, y) − F1 (x0 , y)||X,∗ ≤

N 1 X ||∂x Ψ(x, y i + ξbi ) − ∂x Ψx (x0 , y i + ξbi )||X,∗ ≤ L11 ||x − x0 ||X + M11 , N i=1

38 N 1 X i ||F1 (x , y) − F1 (x , y )||X,∗ ≤ ||∂x Ψ(x0 , y i + ξbi ) − ∂x Ψx (x0 , y 0 + ξbi )||X,∗ N i=1 0

0

0

N 1 X i L12 ||y i − y 0 ||Ξ + M12 N i=1 !1/p N L12 X i i ≤ ||y − y 0 ||pΞ N 1/q + M12 N i=1

≤

≤ L12 ||y − y 0 ||Y + M12 , ||F2 (x, y) − F2 (x0 , y)||Y,∗ = N −1 ≤ N −1

N X

!1/q ||∂ξ Ψ(x, y i + ξbi ) − ∂ξ Ψ(x0 , y i + ξbi )||qΞ

i=1 N X

[L21 ||x − x0 ||X + M21 ]

i=1

= L21 ||x − x0 ||X + M21 , 0

0

0

||F2 (x , y) − F2 (x , y )||Y,∗ = N

−1

N X

!1/q 0

i

0

bi

0i

bi

||∂ξ Ψ(x , y + ξ ) − ∂ξ Ψ(x , y + ξ

)||qΞ

i=1

≤N

−1

N X i [L22 ||y i − y 0 ||Ξ + M22 ]q

!1/q

i=1

≤N ≤N

−1

N X

L22

!1/q ||y

i

i=1 max(1/q−1/p,0)−1

i − y 0 ||qΞ

+ M22

L22 ||y − y 0 ||Y + M22 .

Combining the above inequalities we have q ||F (z) − F (z 0 )||Z,∗ = ||F1 (z) − F1 (z 0 )||2X,∗ + ||F2 (z) − F2 (z 0 )||2Y,∗ ≤ 2 max{||F1 (z) − F1 (z 0 )||X,∗ , ||F2 (z) − F2 (z 0 )||Y,∗ } ≤ 4 max{L11 ||x − x0 ||X + M11 , L12 ||y − y 0 ||Y + M12 ,

L21 ||x − x0 ||X + M21 , N max(1/q−1/p,0)−1 L22 ||y − y 0 ||Y + M22 }. Proof of Proposition 5. According to Theorem 4.1 in Nemirovski [33], the results follows if we can provide an upper bound on ΘY . Let h = (h1 , . . . , hN ) ∈ ΞN and m ≤ 2. The first-order directional derivative of ωY at y along h is given by γDωY (y)[h] =

N X

i ||ξ i ||m−2 Ξi hξ , hi i.

i=1

The second-order directional derivative of ωY at y along [h, h] is given by 2

γD ωY (y)[h, h] =

N X

2 ||ξ i ||m−2 Ξi ||hi ||Ξi

+ (m − 2)

i=1

≥ (m − 1)

N X i=1

N X i=1

||ξ i ||Ξm−2 ||hi ||2Ξi , i

||ξ i ||Ξm−4 hξ i , hi i2 i

39

where the inequality follows from Cauchy-Schwarz inequality and the condition m ≤ 2. On the other hand, ||h||2Y =

N X

!2/p ||hi ||pΞi

N X

≤

i=1

!2 ||hi ||Ξi

N X

≤

i=1

! 2 ||ξ ||m−2 Ξi ||hi ||Ξi

i=1

N X

! ||ξ i ||2−m , Ξi

i=1

where the first inequality follows from p ≥ 1 and the second inequality follows from Cauchy-Schwarz inequality. Combining the above two inequalities yields ! N X γ ||h||2Y ≤ D2 ωY (y)[h, h] ||ξ i ||2−m . (70) Ξi m−1 i=1 PN PN Noting that i=1 ||ξ i ||pΞi ≤ N θp , set ti = ||ξ i ||pΞi , whence i=1 ti ≤ N θp . It follows from Hölder’s inequality that N X m max(m/p,1) ||ξ i ||m . (71) Ξi ≤ θ N i=1

PN

≤ θ2−m N Hence i=1 ||ξ i ||2−m Ξi 1 with respect to || · ||Y if

max( 2−m p ,1)

. Thus in view of (70), ωY is strongly convex with modulus

2−m γ θ2−m N max( p ,1) = 1. m−1

(72)

Now let us compute an upper bound of ΘY . According to (72), we have N N X 2−m 1 1 X i m ||ξ ||Ξi ≤ θ2−m N max( p ,1) ||ξ i ||m ΘY = max ωY (y) − min ωY (y) ≤ Ξi y∈Y y∈Y mγ i=1 m(m − 1) i=1

! .

Using (71) we obtain that ΘY ≤

2−m m θ2 N max( p ,1)+max( p ,1) . m(m − 1) 2

θ When p > 1, choosing m = min(2, p), we have ΘY ≤ m(m−1) N 2 ; when p = 1, choosing m = 1 + ln1N , 2 2 we have ΘY ≤ eθ N ln N . It can be easily checked that γ defined in (53) satisfies condition (72).

Lemma 7. Let C be a Borel set and has nonempty boundary ∂C. Then for any > 0, there exists a Borel map T : ∂C → Ξ \ cl(C) such that d(ξ, T (ξ)) < for all ξ ∈ ∂C. i Proof of Lemma 7. Since Ξ is separable, ∂C has a dense subset {ξ i }∞ i=1 . For each ξ , there exists S ∞ ξ 0i ∈ Ξ \ cl(Ξ) such that i := 2d(ξ i , ξ 0i ) < . Thus ∂C = i=1 Bi (ξ i ), where Bi (ξ i ) is the open ball centered at ξ i with radius i . Define

i∗ (ξ) := min{i : ξ ∈ Bi (ξ i )}, i≥0

ξ ∈ ∂C,

and T (ξ) := ξi∗ (ξ) , Then T satisfies the requirements in the lemma.

ξ ∈ ∂C.

40

Proof of Lemma 3. Since −1{ξ ∈ int(C)} is upper semi-continuous and binary-valued, Theorem 1 ensures the existence of the worst-case distribution of minµ∈M µ(int(C)). Thus it suffices to show that for any > 0, there exists µ ∈ M such that µ(C) ≤ minµ∈M µ(int(C)) + . Note that min µ(int(C)) Z Z 1 p p = min γ(dξ, dζ) : π# γ = ν, d (ξ, ζ)γ(dξ, dζ) ≤ θ γ Ξ×int(C) Ξ2 Z Z 1 p p =ν(int(C)) − max γ(dξ, dζ) : π# γ = ν, d (ξ, ζ)γ(dξ, dζ) ≤ θ . µ∈M

γ

int(C)×(Ξ\int(C))

Ξ2

The second term suggests to find a transportation plan which transports mass from distribution ν such that the total cost is no greater than θp and that the total mass from int(C) to its complement Ξ \ int(C) is maximized. Since it is unnecessary to transport mass from int(C) further once it hits ∂C, nor to transport mass from int(C) to int(C) \ supp ν, nor to transport mass in Ξ \ int(C), therefore, there exists an optimal transportation plan γ0 such that supp γ0 ⊂ supp ν × supp ν ∪ (supp ν ∩ int(C)) × ∂C . 2 We set µ0 := π# γ0 . If µ0 (∂C) = 0, there is nothing to show, so we assume that µ0 (∂C) > 0. We first consider the case ν(int(C)) = 0 (and thus µ0 can be chosen to be ν). By Lemma 7, we can define a Borel map T which maps each ξ ∈ ∂C to some ξ 0 ∈ Ξ \ cl(C) with d(ξ, ξ 0 ) < ∈ (0, θ) and is an identity map elsewhere. We further define a distribution µ by

µ (A) := µ0 (A \ ∂C) + µ0 {ξ ∈ ∂C : T (ξ) ∈ A}, for all Borel set A ⊂ Ξ. Then Wp (µ , µ0 ) ≤ < θ and µ (C) = µ0 (int(C)). Now let us consider ν(int(C)) > 0. For any ∈ (0, θ), we define a distribution µ0 by µ0 (A) :=µ0 (A ∩ int(C)) + µ0 {(A ∩ int(C)) × ∂C } + ν(A ∩ ∂C) θ + 1 − µ0 {ξ ∈ ∂C : T (ξ) ∈ A} + µ0 (A \ cl(C)), θ for all Borel set A ⊂ Ξ. Then µ0 (C) = µ0 (int(C)) + [µ0 (∂C) − ν(∂C) + ν(∂C)] + 0 + 0 θ ≤ µ0 (int(C)) + . θ R Note that Wpp (µ0 , ν) = int(C)×∂C dp (ξ, ζ)γ0 (dξ, dζ), it follows that Z Z 0 p Wp (µ , ν) ≤ d (ξ, ζ)γ0 (dξ, dζ) − dp (ξ, ζ)γ0 (dξ, dζ) + 1 − + 0 θ int(C)×∂C θ int(C)×∂C ≤ θ. Hence the proof is completed.

Proof of Proposition 6. Define Cw := {ξ : −w> ξ < q } for all w. According to the structure of the worst-case distribution (Theorem 1(ii)(iii)), there exists a worst-case distribution µ∗ which attains the infimum inf µ∈M Pµ {−w> ξ < q } and has the form (17). Let λ∗ be the associated dual optimizer λ∗ . Since 1Cw is binary valued, by definition (Lemma 2(iii)) of T± (λ∗ , ·) it holds that T+ (λ∗ , ζ),

41

T− (λ∗ , ζ) ∈ {ζ, arg minξ∈Ξ\Cw d(ξ, ζ)}. That is to say, for ζ ∈ supp ν, the worst-case distribution µ∗ either transports (probably with splitting) ζ to its closest point in Ξ \ Cw , or just let it stay at ζ. In addition, we claim that there exists a worst-case distribution which satisfies, if ζ is transported to arg minξ∈Ξ\Cw d(ξ, ζ), then all the points that closer to Ξ \ Cw than ζ will be transported. This is simply because any other transportation plan that achieves the same objective has larger total distance of transportation. With this in mind, let γ ∗ be the optimal transport plan between ν and µ∗ . and let ∗ ∗ t := ess sup min d(ξ, ζ) : ζ 6= T+ (λ , ζ) . ξ∈Ξ\Cw

ζ∈supp ν

So t∗ is the longest distance of transportation among all the points that are transported. (We note that infinity is allowed in the definition of t∗ , however, as will be shown, this violates the probability bound.) Then µ∗ transports all the points in supp ν ∩ {ξ : q − t∗ < −w> ξ < b}, and possibly a fraction of mass β ∗ ∈ [0, 1] in supp ν ∩ {ξ : −w> ξ = q − t∗ }. Also note that by Hölder’s inequality, the d-distance between two hyperplanes {ξ : −w> ξ = s} and {ξ : −w> ξ = s0 } equals to |s − s0 |/||x||1 = |s − s0 |. Using the above characterization, let us define a probability measure νw on R for any w by νw {(−∞, s)} := ν {ξ : −w> ξ < s}, ∀s ∈ R, then using the changing of measure, the total distance of transportation can be computed by Z

p

Zq−

∗

d (ξ, ζ)γ (dξ, dζ) = (Ξ\Cw )×Cw

(q − s)p νw (ds) + β ∗ νw ({q − t∗ })t∗ p ≤ θp .

(73)

(q−t∗ )+

On the other hand, using property of marginal expectation and the characterization of γ ∗ , Z ∗ µ (Cw ) = γ ∗ (dξ, dζ) Cw ×Ξ Z γ ∗ (dξ, dζ) = ν(Cw ) − (Ξ\Cw )×Cw

= 1 − νw ([q, ∞)) − β ∗ νw ({q − t∗ }) + νw {(q − t∗ , q)} = 1 − νw (q − t∗ , ∞) − β ∗ νw ({q − t∗ }). Thereby the condition inf µ∈M µ(Cw ) ≥ 1 − α is equivalent to β ∗ νw ({q − t∗ }) + νw (q − t∗ , ∞) ≤ α.

(74)

Now consider the quantity Z q− J := (q − s)p νw (ds) + β0 νw {VaRα [−w> ξ]} (q − VaRα [−w> ξ])p − θp . (VaRα [−w> ξ])+

If J < 0, due to the monotonicity in t∗ of the right-hand side of (73), either q − t∗ < VaRα [−w> ξ] or q − t∗ = VaRα [−w> ξ] and β ∗ > β0 . But in both cases (74) is violated. On the other hand if J ≥ 0, again by monotonicity, either q − t∗ > VaRα [−w> ξ], or q − t∗ > VaRα [−w> ξ] and β ∗ ≤ β0∗ and thus (74) is satisfied. Lemma 8. For any Borel measure µ with µ(supp ν) 6= 1, it holds that Z Z p max u(ξ)µ(dξ) + v(ζ)ν(dζ) : u(ξ) + v(ζ) ≤ d (ξ, ζ), ∀ξ, ζ ∈ Ξ = ∞. u∈L1 (µ),v∈L1 (ν)

Ξ

Ξ

42

Proof of Lemma 8. Let (u0 , v0 ) be any feasible solution to the above maximization problem and define ut (ξ) := u(ξ) + t and vt (ζ) := v(ζ) − t for any t ∈ R, ξ ∈ Ξ, and ζ ∈ supp ν. Then it follows that ut (ξ) + vt (ζ) ≤ dp (ξ, ζ) and Z Z Z Z ut (ξ)µ(dξ) + vt (ζ)ν(dζ) = u0 (ξ)µ(dξ) + v0 (ζ)ν(dζ) + t[µ(supp ν) − 1]. Ξ

Ξ

Ξ

Ξ

Since µ(supp ν) 6= 1, Z sup

vt (ζ)ν(dζ) = ∞,

ut (ξ)µ(dξ) +

t∈R

which proves the lemma.

Z Ξ

Ξ

PN ˆ := {ξbti : i = 1, . . . , N, t = 1, . . . , mi }. Proof of Lemma 4. If θ = 0, clearly v ∗ ≤ N1 i=1 mi . Let X ˆ can be sorted in increasing order by ξb(1) < . . . < ξb(M ) , where M = Suppose the elements in X ˆ Then for any > 0, let xj± = ξb(j) ± /(2M ). Then v PM 1[x ,x ] = −c + 1 PN mi . card(X). j− j+ j=1 i=1 N Let → 0 we obtain (63). So in the sequel we assume θ > 0. Observe that inf Eη∼µ [η(x−1 (1))] = inf Eη∼µ [η(x−1 (1))] : min Eγ [d(η, ηˆ)] ≤ θ µ∈M µ∈P(Ξ) γ∈Γ(µ,ν) (75) −1 2 = inf E(η,ˆη)∼γ [η(x (1))] : Eγ [d(η, ηˆ)] ≤ θ, π# γ=ν . γ∈P(Ξ2 )

For any γ ∈ P (Ξ2 ), denote by γηˆ the conditional distribution of θ¯ := d(η, ηˆ) given ηˆ, and by γηˆ,θ¯ ¯ Using tower property of conditional probability, we the conditional distribution of η given ηˆ and θ. 2 2 have that for any γ ∈ P (Ξ ) with π# γ = ν, h i −1 E(η,ˆη)∼γ [η(x−1 (1))] = Eηˆ∼ν Eθ∼γ E [η(x (1))] , ¯ η∼γ η ˆ η ˆ,θ¯ and ¯ . Eγ [d(η, ηˆ)] = Eηˆ∼ν Eθ∼γ [θ] ¯ η ˆ Note that the right-hand side of the equation above does not depend on γηˆ,θ¯. Thereby (75) can be reformulated as n h o i −1 ¯ inf Eη∼µ [η(x−1 (1))] = inf Eηˆ∼ν Eθ∼γ E [η(x (1))] : E E [ θ] ≤ θ ¯ ¯ η∼γηˆ,θ¯ ηˆ∼ν θ∼γηˆ η ˆ µ∈M {γηˆ }ηˆ ,{γηˆ,θ¯}ηˆ,θ¯ ( " # ) ¯ ≤θ , = inf Eηˆ∼ν Eθ∼γ inf Eη∼γ [η(x−1 (1))] : Eηˆ∼ν Eθ∼γ [θ] ¯ ¯ η ˆ

{γηˆ }ηˆ

{γηˆ,θ¯}ηˆ,θ¯

η ˆ,θ¯

η ˆ

(76) where the second equality follows from interchangeability principle (cf. Theorem 14.60 in Rockafellar and Wets [39]). ( " ) # ¯ ≤ θ = 0. inf Eηˆ∼ν Eθ∼γ inf Eη∼γ [η(x−1 (1))] : Eηˆ∼ν Eθ∼γ [θ] ¯ ¯ η ˆ

{γηˆ }ηˆ


η ˆ,θ¯

η ˆ

We claim that ( inf

{γηˆ }ηˆ

=

"

Eηˆ∼ν Eθ∼γ ¯ η ˆ inf

ρ∈P(B([0,1])×Ξ)

inf


) # ¯ ≤θ Eη∼γηˆ,θ¯ [η(x (1))] : Eηˆ∼ν Eθ∼γ [θ] ¯ η ˆ −1

n o 2 E(˜η,ˆη)∼ρ [˜ η (int(x−1 (1)))] : E(˜η,ˆη)∼ρ W1 (˜ η , ηˆ) ≤ θ, π# ρ=ν .

(77)

43

Indeed, let ρ be any feasible solution of the right-hand side of (77). We denote by ρηˆ the conditional ¯ distribution of θ¯ := W1 (˜ η , ηˆ) given ηˆ and by ρηˆ,θ¯ the conditional distribution of η˜ given ηˆ and θ. When ηˆ = 0 or θ¯ = 0, set γ¯ηˆ = δ0 and γ¯ηˆ,θ¯ = ηˆ, that is, we choose γ¯ηˆ and γ¯ηˆ,θ¯ be such that η = ηˆ. When ηˆ 6= 0 and θ¯ > 0, applying Corollary 1 and Lemma 3 to the problem minη˜∈B([0,1]) {η˜(x−1 (1)) : W1 (˜ η , ηˆ) ≤ θ¯}, we have that for any > 0, there exists 1 ≤ i0 ≤ ηˆ([0, 1]), pηˆ,θ¯ ∈ [0, 1], and ηˆ([0,1])

η˜ =

X

δξ˜i + pηˆ,θ¯δξ˜+ + (1 − pηˆ,θ¯)δξ˜− , i0

i=1 i6=i0

i0

where ξ˜i ∈ [0, 1] for all i 6= i0 and ξ˜i±0 ∈ [0, 1], such that η˜(x−1 (1)) ≤ + minη˜∈B([0,1]) {η˜(int(x−1 (1))) : W1 (˜ η , ηˆ) ≤ θ¯}. Define ηˆ([0,1]) X ηη± := δξ˜i + δξ˜± , ˆ,θ¯ i0

i=1 i6=i0

it follows that ηη± ([0, 1]) = ηˆ([0, 1]), and ˆ,θ¯ pηˆ,θ¯ηη+ (x−1 (1)) + (1 − pηˆ,θ¯)ηη− (x−1 (1)) ≤ + min ˆ,θ¯ ˆ,θ¯

η˜∈B([0,1])

and

n o η˜(int(x−1 (1))) : W1 (˜ η , ηˆ) ≤ θ¯ ,

¯ pηˆ,θ¯W1 (ηη+ , ηˆ) + (1 − pηˆ,θ¯)W1 (ηη− , ηˆ) ≤ θ. ˆ,θ¯ ˆ,θ¯

Define γ¯ηˆ and γ¯ηˆ,θ¯ by Z γ¯ηˆ(C) :=

(78)

(79)

∞ 0

pηˆ,θ¯1{W1 (ηη+ , ηˆ) ∈ C } ˆ,θ¯ ¯ ∀ Borel set C ⊂ [0, ∞), + (1 − pηˆ,θ¯)1{W1 (ηη− , ηˆ) ∈ C } ρηˆ(dθ), ˆ,θ¯

and ∞Z

Z γ¯ηˆ,θ¯(A) := 0

h

Ξ

+ ¯ pηˆ,θ¯1 ηη+ ∈ A, W (η , η ˆ ) = θ 1 ¯ ¯ ˆ,θ ηˆ,θ

i − ¯ ∀ Borel set A ⊂ Ξ. ¯ + (1 − pηˆ,θ¯)1 ηη− ∈ A, W (η , η ˆ ) = θ ρηˆ,θ¯(dη)ρηˆ(dθ), 1 ηˆ,θ¯ ˆ,θ¯ We verify that ({γ¯ηˆ}ηˆ, {γ¯ηˆ,θ¯}ηˆ,θ¯) is a feasible solution to the left-hand side of (77). By condition ¯ (59), we have d(ηη± , ηˆ) = W1 (ηη± , ηˆ), hence (79) implies that pηˆ,θ¯d(ηη+ , ηˆ) + (1 − pηˆ,θ¯)d(ηη− , ηˆ) ≤ θ. ˆ,θ¯ ˆ,θ¯ ˆ,θ¯ ˆ,θ¯ Then taking expectation on both sides, Z Z ∞h i + − ¯ ¯ ¯ ≤ θ, Eηˆ∼ν Eθ∼¯ p d(η , η ˆ ) + (1 − p )d(η , η ˆ ) ρηˆ(dθ)ν(dˆ η ) = Eηˆ∼ν Eθ∼γ [θ] ¯ γ [θ] = ¯ ¯ ¯ ¯ ¯ ηˆ,θ ηˆ,θ η ˆ η ˆ ηˆ,θ ηˆ,θ Ξ

0

henceh {γ¯ηˆ}ηˆ is feasible. Similarly, taking expectation on both sides of (78), we have that i −1 Eηˆ∼ν Eθ∼¯ (1))] ≤ + Eρ [˜ η (x−1 (1))]. Let → 0, we obtain that ¯ γ Eη∼¯ γηˆ,θ¯ [η(x η ˆ ( inf

{γηˆ }ηˆ

≤

"

Eηˆ∼ν Eθ∼γ ¯ η ˆ inf

ρ∈P(B([0,1])×Ξ)

inf


) # ¯ ≤θ Eη∼γηˆ,θ¯ [η(x (1))] : Eηˆ∼ν Eθ∼γ [θ] ¯ η ˆ −1

n o 2 E(˜η,ˆη)∼ρ [˜ η (int(x−1 (1)))] : E(˜η,ˆη)∼ρ W1 (˜ η , ηˆ) ≤ θ, π# ρ=ν .

44

To show the opposite direction, observe that inf µηˆ,θ¯ Eη∼µηˆ,θ¯ [η(x−1 (1))] = inf η∈Ξ {η(x−1 (1)) : d(η, ηˆ) = θ¯}. Hence ( " ) # −1 ¯ ≤θ inf Eηˆ∼ν Eθ∼γ [θ] inf Eη∼γηˆ,θ¯ [η(x (1))] : Eηˆ∼ν Eθ∼γ ¯ ¯ η ˆ η ˆ {γηˆ }ηˆ {γηˆ,θ¯}ηˆ,θ¯ (80) h −1 i ¯ ≤θ . = inf Eηˆ∼ν Eθ∼γ inf η(x (1)) : d(η, ηˆ) = θ¯ : Eηˆ∼ν Eθ∼γ [θ] ¯ ¯ η ˆ

{γηˆ }ηˆ

η ˆ

η∈Ξ

Let ({γηˆ}ηˆ, {ηηˆ,θ¯}ηˆ,θ¯}) be a feasible solution of the right-hand side of (80). Then the joint distribution ρ¯ ∈ P (B ([0, 1]) × Ξ) defined by Z Z ∞ ¯ ρ¯(B) := 1{ηηˆ,θ¯ ∈ π 1 (B)}γηˆ(dθ)ν(dˆ η ), ∀ Borel set B ⊂ B ([0, 1]) × Ξ. π 2 (B)

0

is a feasible solution of the right-hand side of (77). By condition (60), we have that n o inf {η(x−1 (1)) : d(η, ηˆ) = θ¯} ≥ inf η˜(int(x−1 (1))) : W1 (˜ η , ηˆ) ≤ θ¯ , η∈Ξ

η˜∈B([0,1])

h

−1 i ¯ and thus Eηˆ∼ν Eθ∼γ inf η(x (1)) : d(η, η ˆ ) = θ ≥ E(˜η,ˆη)∼ρ [˜ η (int(x−1 (1)))]. Therefore we prove ¯ η η ˆ the opposite direction and (77) holds. Together with (76), we obtain that inf Eη∼µ [η(x−1 (1))] n o 2 E(˜η,ˆη)∼ρ [˜ η (int(x−1 (1)))] : E(˜η,ˆη)∼ρ W1 (˜ = inf η , ηˆ) ≤ θ, π# ρ=ν . µ∈M

ρ∈P(B([0,1])×Ξ)

It then follows that it suffices to only consider policynx such that x−1 (1) is an oopen set. Then by strong duality (Theorem 1), the problem minη˜∈B([0,1]) η˜(x−1 (1)) : W1 (˜ η , ηˆ) ≤ θ¯ admits a worstˆ = {ξbti : i = 1, . . . , N, t = case distribution ηηˆ,θ¯ and let ληˆ,θ¯ be the associated dual optimizer. Recall X 1, . . . , mi }. We claim that it suffices to further restrict attention to those policies x such that each ˆ Indeed, suppose there exists a connected component of x−1 (1) contains at least one point in X. −1 ˆ connected component C0 of x (1) such that C0 ∩ X = ∅. Then for every ζ ∈ supp ηˆ, recall that T± (ληˆ,θ¯, ζ) = minξ∈[0,1] 1x−1 (1) (ξ) + |ξ − ζ | , so T± (ληˆ,θ¯, ζ) ∈ / C0 , and thus ηηˆ,θ¯(x−1 (1)) = ηηˆ,θ¯(x−1 (1) \ 0 C0 ). Hence, in view of (61), x := 1{x−1 (1)\C0 } achieves a higher objective value v(x0 ) than v(x) and ˆ thus x cannot be optimal. We finally conclude that there exists {xj± }M j=1 , where M ≤ card(X), such that M X v(x) = −c (xj+ − xj− ) + inf Eη∼µ η {∪M j=1 [xj− , xj+ ]} . j=1

µ∈M

Appendix B: Selecting radius θ We mainly use a classical result on Wasserstein distance from Bolley et al. [8]. Let νN be the empirical distribution of ξ obtained from the underlying distribution ν0 . In Theorem 1.1 (see also Remark 1.4) of Bolley et al. [8], it is shown that P{W1 (νN , ν0 ) > 2 λ θ} ≤ C(θ)e− 2 N θ for some constant λ dependent on ξ, and C dependent on θ. Since their result holds for general distributions, we here simply it for our purpose and explicitly compute the constants λ and C. For a more detailed analysis, we refer the reader to Section 2.1 in Bolley et al. [8]. ¯ the truncation step in their proof is no longer needed, Noticing that by assumption ξ ∈ [0, B], thus the probability bound (2.12) (see also (2.15)) is reduced to ¯ N ( 2δ ) 2 B λ P{W1 (νN , ν0 ) > θ} ≤ max 8e , 1 e− 8 N (θ−δ) δ

45

for some constant λ > 0, δ ∈ (0, θ), where e is the natural logarithm, and N ( 2δ ) is the minimal ¯ number of balls need to cover the support of ξ by balls of radius δ/2 and in our case, N ( 2δ ) = B/δ. Now let us compute λ. By Theorem 1.1, λ is the constant appeared in the Talagrand inequality r 2 W1 (µ, ν0 ) ≤ Iφ (µ, ν0 ), λ kl where the Kullback-Leibler divergence of µ with respect to ν is defined R by Iφkl (µ, ν0 ) = +∞ if µ is not absolutely continuous with respect to ν0 , otherwise Iφkl (µ, ν0 ) = f log f dν0 , where f is the Radon-Nikodym derivative dµ/dν0 . Corollary 4 in Bolley and Villani [9] shows that λ can be chosen as −1 Z 1 αd2 (ξ,ζ0 ) λ= inf , 1 + log e ν(dξ) ζ0 ∈Ξ,α>0 α which can be estimated from data. Finally, we obtain a concentration inequality ¯ B ¯ δ 2 B λ P{W1 (νN , ν0 ) > θ} ≤ max 8e , 1 e− 8 N (θ−δ) . δ

(81)

In the numerical experiment, we choose δ to make the right-hand side of (81) as small as possible, and θ is chosen such that the right-hand side of (81) is equal to 0.05. Acknowledgments. The authors would like to thank Wilfrid Gangbo, David Goldberg, Alex Shapiro, and Weijun Xie for several stimulating discussions and Yuan Li for providing the image (Figure (1b)). References [1] Barbour AD, Brown TC (1992) Stein’s method and point process approximation. Stochastic Processes and their Applications 43(1):9–31. [2] Bartholdi III JJ, Platzman LK (1988) Heuristics based on spacefilling curves for combinatorial problems in euclidean space. Management Science 34(3):291–305. [3] Bayraksan G, Love DK (2015) Data-driven stochastic programming using phi-divergences. Tutorials in Operations Research . [4] Beardwood J, Halton JH, Hammersley JM (1959) The shortest path through many points. Mathematical Proceedings of the Cambridge Philosophical Society, volume 55, 299–327 (Cambridge Univ Press). [5] Ben-Tal A, Den Hertog D, De Waegenaere A, Melenberg B, Rennen G (2013) Robust solutions of optimization problems affected by uncertain probabilities. Management Science 59(2):341–357. [6] Ben-Tal A, Goryashko A, Guslitzer E, Nemirovski A (2004) Adjustable robust solutions of uncertain linear programs. Mathematical Programming 99(2):351–376. [7] Berger JO (1984) The robust bayesian viewpoint. Studies in Bayesian Econometrics and Statistics : In Honor of Leonard J. Savage., Edited by Joseph B. Kadane 4(2):63–124, ISSN 00220515. [8] Bolley F, Guillin A, Villani C (2007) Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields 137(3-4):541–593. [9] Bolley F, Villani C (2005) Weighted csiszár-kullback-pinsker inequalities and applications to transportation inequalities. Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 14, 331–352. [10] Boyd S, Vandenberghe L (2004) Convex optimization (Cambridge university press). [11] Calafiore GC, El Ghaoui L (2006) On distributionally robust chance-constrained linear programs. Journal of Optimization Theory and Applications 130(1):1–22. [12] Carlsson JG, Behroozi M, Mihic K (2015) Wasserstein distance and the distributionally robust tsp . [13] Castaing C, Valadier M (2006) Convex analysis and measurable multifunctions, volume 580 (Springer). [14] Chen LH, Xia A (2004) Stein’s method, palm theory and poisson process approximation. Annals of probability 2545–2569.

46 [15] Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research 58(3):595–612. [16] Erdo˘ gan E, Iyengar G (2006) Ambiguous chance constrained problems and robust optimization. Mathematical Programming 107(1-2):37–61. [17] Esfahani PM, Kuhn D (2015) Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. arXiv preprint . [18] Fournier N, Guillin A (2014) On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields 1–32. [19] Gallego G, Moon I (1993) The distribution free newsboy problem: review and extensions. Journal of the Operational Research Society 825–834. [20] Ghaoui LE, Oks M, Oustry F (2003) Worst-case value-at-risk and robust portfolio optimization: A conic programming approach. Operations Research 51(4):543–556. [21] Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. International statistical review 70(3):419–435. [22] Goh J, Sim M (2010) Distributionally robust optimization and its tractable approximations. Operations research 58(4-part-1):902–917. [23] Gonzalez RC, Woods RE (2006) Digital Image Processing (3rd Edition) (Upper Saddle River, NJ, USA: Prentice-Hall, Inc.), ISBN 013168728X. [24] Haimovich M, Rinnooy Kan A (1985) Bounds and heuristics for capacitated routing problems. Mathematics of operations Research 10(4):527–542. [25] Himmelberg C, Parthasarathy T, Van Vleck F (1981) On measurable relation. Fundamenta mathematicae 111(2):161–167. [26] Hwang FK, Richards DS (1992) Steiner tree problems. Networks 22(1):55–89. [27] Jiang R, Guan Y (2015) Data-driven chance constrained stochastic program. Mathematical Programming 1–37, ISSN 0025-5610. [28] Kantorovich LV (1942) On the translocation of masses. Dokl. Akad. Nauk SSSR, volume 37, 199–201. [29] Kantorovich LV (1960) Mathematical methods of organizing and planning production. Management Science 6(4):366–422. [30] Levi R, Perakis G, Uichanco J (2015) The data-driven newsvendor problem: new bounds and insights. Operations Research 63(6):1294–1306. [31] Ling H, Okada K (2007) An efficient earth mover’s distance algorithm for robust histogram comparison. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(5):840–853. [32] Minoux M (2012) Two-stage robust lp with ellipsoidal right-hand side uncertainty is np-hard. Optimization Letters 6(7):1463–1475. [33] Nemirovski A (2004) Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15(1):229–251. [34] Pardo L (2005) Statistical inference based on divergence measures (CRC Press). [35] Parikh N, Boyd SP (2014) Proximal algorithms. Foundations and Trends in optimization 1(3):127–239. [36] Pflug GC, Pichler A (2014) Multistage stochastic optimization (Springer). [37] Platzman LK, Bartholdi III JJ (1989) Spacefilling curves and the planar travelling salesman problem. Journal of the ACM (JACM) 36(4):719–737. [38] Popescu I (2007) Robust mean-covariance solutions for stochastic optimization. Operations Research 55(1):98–112. [39] Rockafellar RT, Wets RJB (2009) Variational analysis, volume 317 (Springer Science & Business Media). [40] Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40(2):99–121. [41] Scarf H, Arrow K, Karlin S (1958) A min-max solution of an inventory problem. Studies in the mathematical theory of inventory and production 10:201–209.

47 [42] Shapiro A, Dentcheva D, et al. (2014) Lectures on stochastic programming: modeling and theory, volume 16 (SIAM). [43] Shapiro A, Kleywegt A (2002) Minimax analysis of stochastic problems. Optimization Methods and Software 17(3):523–542. [44] Steele JM (1981) Subadditive euclidean functionals and nonlinear growth in geometric probability. The Annals of Probability 365–376. [45] Steele JM (1997) Probability theory and combinatorial optimization, volume 69 (Siam). [46] Sun H, Xu H (2015) Convergence analysis for distributionally robust optimization and equilibrium problems. Mathematics of Operations Research . [47] Villani C (2003) Topics in optimal transportation. Number 58 (American Mathematical Soc.). [48] Villani C (2008) Optimal transport: old and new, volume 338 (Springer Science & Business Media). [49] Wagner DH (1977) Survey of measurable selection theorems. SIAM Journal on Control and Optimization 15(5):859–903. [50] Wang Z, Glynn PW, Ye Y (2015) Likelihood robust optimization for data-driven problems. Computational Management Science 1–21, ISSN 1619-697X. [51] Wozabal D (2012) A framework for optimization under ambiguity. Annals of Operations Research 193(1):21–47. ˇ aˇckov´ ˇ [52] Z´ a J (1966) On minimax solutions of stochastic linear programming problems. Casopis pro pˇestov´ an´ı matematiky 91(4):423–430. [53] Zhao C, Guan Y (2015) Data-driven risk-averse stochastic optimization with wasserstein metric. Available on optimization online . [54] Zymler S, Kuhn D, Rustem B (2013) Distributionally robust joint chance constraints with second-order moment information. Mathematical Programming 137(1-2):167–198.