SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I: NON-ASYMPTOTIC BOUNDS WITH WEAK ASSUMPTIONS AND STOCHASTIC CONSTRAINTS∗
arXiv:1705.00822v1 [math.OC] 2 May 2017
ROBERTO I. OLIVEIRA† AND PHILIP THOMPSON‡ Abstract. We give statistical guarantees for the sample average approximation (SAA) of stochastic optimization problems. Precisely, we derive exponential non-asymptotic finite-sample deviation inequalities for the approximate optimal solutions and optimal value of the SAA estimator. In that respect, we give three main contributions. First, our bounds do not require sub-Gaussian assumptions on the data as in previous literature of stochastic optimization (SO). Instead, we just assume H¨ older continuous and heavy-tailed data (i.e. finite 2nd moments), a framework suited for risk-averse portfolio optimization. Second, we derive new deviation inequalities for SO problems with expected-valued stochastic constraints which guarantee joint approximate feasibility and optimality without metric regularity of the solution set nor the use of reformulations. Thus, unlike previous works, we do not require strong growth conditions on the objective function, the use of penalization nor necessary first order conditions. Instead, we use metric regularity of the feasible set as a sufficient condition, making our analysis general for many classes of problems. Our bounds imply exact feasibility and approximate optimality for convex feasible sets with strict feasibility and approximate feasibility and optimality for metric regular sets which are non-convex or which are convex but not strictly feasible. In our bounds the feasible set’s metric regular constant is an additional condition number. For convex sets, we use localization arguments for concentration of measure, obtaining feasibility estimates in terms of smaller metric entropies. Third, we obtain a general uniform concentration inequality for heavy-tailed H¨ older continuous random functions using empirical process theory. This is the main tool in our analysis but it is also a result of independent interest. Key words. AMS subject classifications.
1. Introduction. In this work, we study the following general optimization problem given by (1)
f ∗ := min f0 (x) x∈Y
s.t. fi (x) ≤ 0, ∀i ∈ I, where Y is a closed subset of Rd , I is a set of indexes and, for i ∈ I0 := {0} ∪ I, fi : Y → R is a function. In this work we will consider the case Y is a compact set. We shall denote the solution set of (1) by X ∗ and use the notation f := f0 . The central aspect of this work is to address the following basic question: Problem 1. Suppose we only have access to “randomly perturbed” versions {Fbi }i∈I0 of the original function {fi }i∈I0 . Then, how can we can ensure that nearly optimal solutions of the optimization problem Fb ∗ := min x∈Y
s.t.
Fb0 (x)
Fbi (x) ≤ b ǫi , ∀i ∈ I,
are nearly optimal solutions of the original problem (1)? ∗ Submitted
to the editors DATE. de Matem´ atica Pura e Aplicada (IMPA), Rio de Janeiro, RJ, Brazil. (
[email protected]). Roberto I. Oliveira’s work was supported by a Bolsa de Produtividade em Pesquisa from CNPq, Brazil. His work in this article is part of the activities of FAPESP Center for Neuromathematics (grant #2013/07699-0, FAPESP - S. Paulo Research Foundation). ‡ Center for Mathematical Modeling (CMM), Santiago, Chile. (
[email protected]). † Instituto
1
2
R. I. OLIVEIRA AND P. THOMPSON
Other possible useful question is how to derive lower and upper bounds on the optimal value f ∗ from the computation of the perturbed problem’s optimal value Fb ∗ . In the above, {b ǫi }i∈I are “tuning” parameters computed from the acquired data {Fbi }i∈I0 . In this context, we distinguish the hard constraint Y from the soft constraint set X := {x ∈ Y : fi (x) ≤ 0, ∀i ∈ I},
(2)
which are allowed to be relaxed by the decision maker due to intrinsic perturbations in his model. The relaxed constraint set is then given by b := {x ∈ Y : Fbi (x) ≤ b X ǫi , ∀i ∈ I}.
The above type of question is frequent in optimization. If the perturbations are deterministic, we are within the realm of perturbation theory for optimization problems. Our paper is focused on the case where the data of the original problem (1) is only accessible through random perturbations obtained from data acquisitions. In this context, a standard general model in Stochastic Optimization (SO) is given as follows. We consider a distribution P over a sample space Ξ. For clarity, from now on we shall denote by P the probability measure over a common sample space Ω, that is, a random variable ξ : Ω → Ξ with distribution P satisfies P(A) = P(ξ ∈ A) for any measurable set A in the σ-algebra of Ξ. We suppose the data of problem (1) is given, for any i ∈ I0 , by Z Fi (x, ξ) d P(ξ), (x ∈ Y ), (3) fi (x) := PFi (x, ·) := Ξ
where Fi : Y × Ξ → R is such that Fi (x, ·) is an integrable random variable for any x ∈ Y . It is then assumed that, although we do not have access to {fi }i∈I0 , the decision maker can sample the distribution P and thus evaluate {Fi }i∈I0 . Within this framework, the Sample Average Approximation (SAA) approach consists in solving the problem whose data is constructed by empirical averages of {Fi }i∈I0 associated to an acquired sample of P. The precise methodology is as follows. Given a size-N independent identically distributed (i.i.d.) sample {ξj }N j=1 from the distribution P, denote the empirical b N := 1 PN δξ , where δξ denotes the Dirac measure at the measure over Ξ by P j=1 j N point ξ ∈ Ξ.1 Define, for x ∈ Y and i ∈ I0 , b N Fi (·, x) := Fbi,N (x) := P
Z
Ξ
b N (ξ) = Fi (ξ, x) d P
N 1 X Fi (ξj , x). N j=1
The SAA approach then consists in solving
FbN∗ := min Fb0,N (x)
(4)
x∈Y
s.t. Fbi,N (x) ≤ b ǫi,N , ∀i ∈ I.
(5)
b ∗ , and use the notations FbN := Fb0,N and We will denote the solution set of (4) by X N 1 If
bN. P
bN := {x ∈ Y : Fbi,N (x) ≤ b X ǫi,N , ∀i ∈ I}.
no confusion arises, the dependence on the sample ξ N := {ξi }N j=1 is omitted in the notation
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
3
We should mention that there exists a competing methodology in SO called Stochastic Approximation. In this case, a specific algorithm is chosen and samples are used in an online interior fashion to approximate the objective function along the iterations of the algorithm. The quality of such methods are given in terms of optimization and statistical estimation errors simultaneously. The first relates to the optimality gap while the second is related to the variance error induced by the use of random perturbations. See e.g. [57, 17, 18, 19]. Differently, the quality of the SAA methodology is measured by the statistical estimation error present when approximating problem (1)-(3) by problem (4). This analysis is valid uniformly on the class of algorithms chosen to solve (4). The SAA methodology is adopted in many decisionmaking problems in SO (see, e.g., [29]) but also in the construction of the so called M -estimators studied in Mathematical Statistics and Statistical Machine Learning (see, e.g., [31]). In the first setting, knowledge of data is limited,2 but one can resort to samples using Monte Carlo simulation. In the second setting, a limited number of samples is acquired from measurements and an empirical contrast estimator is built to fit the data to a certain criteria given by a loss function over a hypothesis class [31]. In the latter context, the SAA problem is often termed Empirical Risk Minimization (ERM). However, randomly perturbed constraints are not typical in this setting. Referring to the general perturbation theory framework of Problem 1, central questions in the SAA methodology are to give conditions on the data and good guarantees on the estimation error such that computable (nearly) optimal solutions of (4) are (nearly) optimal solutions of the original problem (1)-(3). This can be cast in different forms. One important type of analysis is to guarantee that such methodology is asymptotically consistent, i.e., it ensures that solutions from the SAA problem converge to true solutions as N → ∞. This may include different kinds of convergence modes.3 Almost sure convergence, for instance, provides a Strong Law of Large Numbers (SLLN) for the SAA problem, while convergence in distribution provides a Central Limit Theorem (CLT) and may be used to give rates of convergence of the variance associated to the sample average approximation,4 as well as the construction of asymptotic confidence intervals for the true solution. It should be remarked that a key aspect in the analysis of the SAA estimator is the need of uniform SLLN and CLT since we are dealing with random functions instead of random variables [1]. Such kind of asymptotic analysis have been carried out in numerous works, e.g., [2, 9, 23, 24, 37, 38, 39, 44, 45, 46, 47]. See e.g. [47, 14, 22] for an extensive review. However, a preferred mode of analysis, which will be the one analyzed in this work, is exponentially non-asymptotic in nature. By this we mean the construction of deviation inequalities giving explicit non-asymptotic rates in N which guarantee that with exponential high probability5 the optimal value and (nearly) optimal solutions of (4) are close enough to the optimal value and (nearly) optimal solutions of (1)-(3) given a prescribed tolerance ǫ > 0. In such inequalities, the prescribed deviation error ǫ typically depends on the number of samples N and parameters depending on the
2 Sources of this limitation are manifold, but may include: (i) expensive computation of the expectation due to a large dimensional sample space, (ii) an unknown distribution P or (iii) no close form of the data {Fi }i∈I0 . 3 Such as almost convergence, convergence in probability and convergence in distribution. 1
of order O N − 2 dictated by the Central Limit Theorem. √ 5 Meaning that the deviation depends polynomially on ln p for a chosen probability level p ∈ (0, 1). This is also termed exponential convergence in contrast to polynomial convergence obtained by mere use of Markov-type inequalities. 4 Typically
4
R. I. OLIVEIRA AND P. THOMPSON
problem’s structure6 . As an example, in the case of a compact X with no stochastic constraints (I = ∅), diameter D(X) and a random Lipschitz continuous function F0 with bounded Lipschitz modulus satisfying L0 (ξ) ≤ L for almost every (a.e.) ξ ∈ Ξ, the following non-asymptotic inequality for optimal value deviation can be obtained: for some constant C > 0, ) ( √ 2 dD(X)L b∗ ∗ √ (1 + t) ≤ e−t , ∀t > 0, ∀N ∈ N. P FN − f ≥ C N
For optimal solutions deviations, such type of inequalities can be obtained using difb ∗ , X ∗ ) or [X b ∗ * X ∗ ], where H denotes the Hausdorff ferent metrics such as H(X N N distance between sets (see Section 3 and [47]). The strong advantage of such inequalities in comparison to asymptotic results is that the exponential tail decay is not just valid in the limit but also valid uniformly for any N . It is thus true in the regime of a small finite number of samples. This kind of results are a manifestation of the concentration of measure phenomenon [28]. In asymptotic convergence analysis, the mild assumption of a data distribution with finite second moment is usually sufficient. Moreover, smoothness is usually required to establish results of asymptotic convergence in distribution. To derive stronger non-asymptotic results, however, more demanding assumptions on the distribution are assumed in the existing SO literature. The typical assumption in this context is a light-tailed data (see Section 1.1, item (i)). While this is satisfied by many problems (for instance, bounded or sub-Gaussian data), the much weaker assumption of a data with finite second moment is desirable in many practical situations where information on the distribution is limited. This is particularly true in the context of risk-averse SO where heavy-tailed data is expected. One central aspect of this work is to provide a non-asymptotic analysis of the SAA problem with heavier-tailed data. Another crucial aspect in Problem 1 is the presence of perturbations in the constraints. Already in the case of deterministic constraint perturbations, the feasible set must be sufficiently regular to ensure stability of solutions. Usual conditions to ensure this are, e.g., Slater or Mangasarian-Fromovitz constraint qualifications (CQ). In the specific case of the SAA methodology, most of the developed work has been done without stochastic constraints (i.e. I = ∅). We refer the reader to the recent review [15] advocating this gap in the literature. With that respect, another central question in this work is to give non-asymptotic guarantees for the SAA problem in the presence of stochastic perturbation of the constraints. Importantly, we will present such type of results which ensure feasibility and optimality simultaneously. In this quest, we wish to avoid penalization or Lagrangian reformulations of the original problem. While these have the advantage of coping with the constraints in the objective, they implicitly require determining the associated penalization parameters or multipliers. The computation of penalization parameters and multipliers is another hard problem by itself and, in the stochastic case, they are data dependent random variables. Moreover, these reformulations may express necessary but not sufficient conditions of the original problem. Our adopted framework is to establish guarantees for the SAA estimator in terms of the original optimization problem itself. In this 6 These parameters are typically associated to intrinsic “conditional numbers” of the problem, such as the feasible set’s diameter and dimension as well as parameters associated to growth, curvature or regularity properties of the data. These may include, e.g., Lipschitz and strong-convexity moduli.
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
5
quest, we explore metric regularity of the feasible set as suitable general property to handle random perturbation of constraints (see, e.g., [36]). 1.1. Related work and contributions. We next resume the main contributions of this paper and latter compare it with previous works. (i) Heavy-tailed and H¨ older continuous data. The standard non-asymptotic exponential convergence results for SO problems were given for sub-Gaussian data. We say a random variable q : Ξ → R is a centered sub-Gaussian random variable with parameter σ > 0 if 2 2 σ t , ∀t ∈ R. P etq ≤ exp 2
This is equivalent to the tail of q not exceeding the tail of a centered Gaussian random variable with variance σ 2 (whose tail decreases exponentially fast in t2 ). As an example, bounded random variables are sub-Gaussian by Hoeffding’s inequality [12]. To establish exponential non-asymptotic deviations, a standard requirement in SO used so far is the following uniform sub-Gaussian assumption: that for all i ∈ I0 , there exists σi > 0 such that for all x ∈ Y , δi (x) := Fi (x, ·) − PFi (x, ·) is sub-Gaussian with parameter σi2 . Explicitly, 2 2 o n σi t tδi (x) , ∀t ∈ R, ∀x ∈ Y. ≤ exp (6) P e 2 From now on we will assume the following condition.
Assumption 1 (Heavy-tailed and H¨ older continuous random data). For all i ∈ I0 , there exist random variable Li : Ξ → R+ with PLi (·)2 < ∞ and αi (0, 1], such that for a.e. ξ ∈ Ξ and for all x, y ∈ Y , |Fi (x, ξ) − Fi (y, ξ)| ≤ Li (ξ)kx − ykαi . In above, k · k is a generic norm. The above assumption is standard in SO [47], where typically it is also assumed the stronger condition that αi := 1 (i.e., Lipschitz continuity). In the current literature of SO, the data is assumed to satisfy both conditions in (6) and Assumption 1. In that case, to establish uniform sub-Gaussianity of Fi one requires Li to be sub-Gaussian. We give significant improvements by obtaining non-asymptotic exponential deviation 1 inequalities of order O(N − 2 ) for the optimal value and nearly optimal solutions of the SAA estimator assuming just Assumption 1, that is, Li is only required to have a finite 2nd moment. Note that in this setting, even for a compact Y , the multiplicative noise Li may result in much heavier fluctuations of Fi when compared to a bounded or Gaussian random variable. One motivation for obtaining this kind of new results in SO is the use of risk measures in decision-making under uncertainity (such as CV@R [41]). In this framework, the assumption of a sub-Gaussian data (with a tail decreasing exponentially fast) may be too optimistic since often risk measures are used to hedge against the tail behavior of the random variable modeling the uncertainty. The price to pay for our assumptions is that the deviation bounds typically depend on empirical quantities such as the H¨ older moduli of the empirical losses. (ii) Stochastic optimization with expected-valued stochastic constraints. A remarkable difficulty in problem (1)-(3) is that constraints are randomly perturbed. In this generalized framework, besides the behaviour of the objective, the feasible set’s geometry plays an additional important role for ensuring simultaneous feasibility and
6
R. I. OLIVEIRA AND P. THOMPSON
optimality. We obtain exponential non-asymptotic deviations for optimal value and nearly optimal solutions for SO problems with expected valued stochastic constraints also in the context of heavy-tailed data of item (i). In our analysis, we do not use reformulations based on penalization or Lagrange multipliers. For the mentioned purposes, we explore metric regularity (MR) of the feasible set as a sufficient condition. Definition 1.1 (Metric regular set). We say the set X in (2) is metric regular if there exists c := c(Y, {fi }i∈I ) > 0 such that d(x, X) ≤ c sup [fi (x)]+ , ∀x ∈ Y. i∈I
In above, d(·, X) denotes the Euclidean distance to X and [a]+ denotes the positive part of a ∈ R.7 MR is a fairly general property of sets used in the perturbation analysis and computation of problems in Optimization and Variational Analysis [36, 5, 17, 57, 33]. We can thus make our analysis for random perturbations general enough coping with many classes of problems. This property is implied by standard CQs. For instance, as proved by Robinson [40], the error bound in Definition 1.1 holds for any compact convex set with an interior point (the so called Slater condition). Nevertheless, it is still true for a larger class of sets which are not strictly feasible nor convex and may not satisfy standard CQs. We refer the reader to the excelent review [36] for a detailed discussion. A polyhedron is an important example of a convex set which is always metric regular even without a strict feasibility assumption, as implied by the celebrated Hoffmann’s Lemma [13]. In our results, we do explore the benefits of convexity and strict feasibility in the Slater condition by showing that near optimality and exact feasibility of nearly optimal solutions of the SAA problem (4) are jointly satisfied with high probability when the Slater condition holds and the error tolerance ǫ > 0 is smaller than a feasibility threshold (see Definition 6.1 and Theorem 5). To the best of our knowledge this is a new result (see Section 6.2). Nevertheless, we also derive exponential non-asymptotic deviations which guarantee simultaneous near optimality and near feasibility of solutions of the SAA problem (4) assuming just a metric regular feasible set (possibly nonconvex) without requiring the Slater condition and valid for any tolerance level ǫ > 0. Importantly, we do not impose metric regularity on the solution set map which restricts significantly the problem in consideration (see e.g. [43] for a precise definition). To the best of our knowledge, this is also a new result (see Section 6.3 and Theorem 6). In the small existing literature of SAA with stochastic constraints, metric regularity of the solution set map has been assumed for establishing non-asymptotic deviations guarantees for optimal solutions (that is, which ensure feasibility and optimality simultaneously). We should mention that requiring a metric regular solution set tipically imposes strong growth conditions on the objective function and sometimes uniqueness of solutions [43, 38]. By exploring metric regularity of the feasible set we do not require such regularity properties of the objective function. (iii) An uniform concentration inequality for heavy-tailed H¨ older continuous random functions. Since the statistical problem in consideration involves optimizing over a feasible set, there is the need of obtaining uniform deviation inequalities for random functions. In the quest of items (i)-(ii) above, instead of Large Deviations Theory, as used, e.g., in [50, 49, 25, 47], we use techniques from Empirical Process Theory. In this approach, we use chaining arguments to obtain a general concentration inequal7 Other
error bounds which generalize Definition 1.1 are possible [36]. For simplicity we use the standard one given in the mentioned definition.
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
7
ity for the uniform empirical deviation of random H¨ older continuous functions (See Section 4, Theorem 1 and the Appendix). Since we assume a heavy-tailed data, we incorporate in our arguments self-normalization techniques [35, 7] instead of postulating boundedness assumptions on the data and then directly invoking Talagrand’s inequality for bounded empirical processes [51] as in [43, 39, 38], for example. We make some important remarks regarding the parameters appearing in our deviation inequalities in the framework of items (i)-(ii). Typically, our deviations in feasibility and optimality depends on the metric entropy of Y or X and of the index set I (see Section 3 for a precise definition). Such quantity is associated to the complexities of Y or X and I. As an example, if Y is the Euclidean ball in Rd , its metric entropy is of O(d). We now focus on the case of stochastic constraints (I 6= ∅). For a finite (but possibly very large) number of constraints, our deviations depend on O(ln |I|). We also explore the convexity of X as an useful localization property for using concentration of measure in feasibility estimation. We show that exact or near feasibility holds true depending on sets with smaller metric entropies than X or Y . Precisely, if X is convex and ǫ > 0 is a prescribed tolerance, our feasibility deviation inequalities depend on the metric entropy of ǫ-active level sets of the form {x ∈ XO(ǫ) : fi (x) = ǫ}, corresponding to constraint i ∈ I. In above, for given γ > 0, Xγ := {x ∈ Y : fi (x) ≤ γ, ∀i ∈ I} is a relaxation of X. We believe this is a new type of result (see Section 6.3.1 and Theorem 7). Finally, our deviation bounds depend on the metric regularity constant c in Definition 1.1 as a condition number and on deterministic errors associated to set approximation due to inevitable constraint relaxation in the stochastic setting.8 These approximation errors depend on the optimization problem and vanish as ǫ → ∞ in a prescribed rate. A very conservative upper bound of these rates is [PL0 (·)]ǫα0 . We refer to Section 6. To the best of our knowledge these kind of exponential non-asymptotic error bounds are also new. To conclude, we remark that we do not consider chance constrained problems. These are problems where the constraints are required to be satisfied within a confidence level, that is, problem (1) with fi (x) := P [Gi (x, ·) ≤ 0] − α for given Gi : Y × Ξ → R, α ∈ (0, 1) and i ∈ I. This problem is equivalently expressed of the form (1)-(3) with Fi (x, ξ) := I[Gi (x,·)≤0] (ξ) − α, where IA stands for the characteristic function of the measurable set A ∈ Ξ. Hence, they are of a very distinct nature: the constraints Fi are bounded but discontinuous (not satisfying Assumption 1) and usually not convex even if G(·, ξ) is convex. We now compare our results with previous work. Related to our work we mention [38, 43, 39, 3, 25, 48, 49, 56, 46, 47, 53, 54, 55, 16, 34, 27, 42, 10, 8, 21, 20, 11, 4]. Except [42, 20, 21], all other papers assume light-tailed data. However, their analysis is asymptotic where data with finite second moments may be expected. Moreover, in [42], asymptotic consistency is given for a necessary stationary reformulation based on optimality functions while in [20, 21] the analysis is restricted to the optimal value with no report on consistency for optimal solutions. In these two papers, their asymptotic rate in terms of convergence in probability is O(N −β ) with β ∈ (0, 21 ) and, hence, 1 sub-optimal since the optimal rate O(N − 2 ) is not achieved. Our results assuming heavy-tailed data are non-asymptotic, do not use reformulations of the optimization 8 We
thus provide deviation bounds of the form “variance error + approximation error”, often seen in statistics.
8
R. I. OLIVEIRA AND P. THOMPSON 1
problem and achieve the optimal rate O(N − 2 ) with joint guarantees of feasibility and optimality. Expected-valued stochastic constraints were analyzed in [42, 8, 16, 56, 53, 43, 3, 54, 55, 21, 20, 11]. As mentioned, an asymptotic analysis is given in [42] in terms of a necessary stationary reformulation based on optimality functions. In [8, 16, 56, 53], exponential non-asymptotic convergence is obtained only for the feasibility requirement, with no report on simultaneous feasibility and optimality guarantees. Moreover, they assume light-tailed data. In [43, 3, 54, 55], besides assuming lighttailed data, the exponential non-asymptotic guarantees for optimal solutions assume metric regularity of the solution set. We, on the other hand, obtain exponential nonasymptotic guarantees for nearly optimal solutions assuming heavy-tailed data and only metric regularity of the feasible set. This last assumption is much weaker since no qualification structure of the solution set nor growth conditions on the objective function need to be verified. In [21, 20, 11], analysis is only given for optimal values. Moreover, in [21, 20] it is provided asymptotic sub-optimal rates while light-tailed data is assumed in [11]. For completeness, we also mention [26] where algorithms, based on the different SA methodology, are proposed for convex problems with stochastic constraints and light-tailed data. Finally, we compare our approach with the mentioned works with respect to the methods used in obtaining concentration of measure, an intrinsic property of random perturbations. As explained before, this is the property required to derive exponential non-asymptotic convergence. Instead of using Large Deviations Theory as in [3, 25, 48, 49, 56, 46, 47, 16, 8, 4] we use methods from Empirical Process Theory. Ready to use inequalities from this theory were invoked in [38, 43, 39, 53, 54, 55, 10]. One essential point on all of these works is the assumption of lighttailed data which we avoid. To achieve this we do not postulate a bounded data and directly use concentration inequalities of empirical processes as done in [38, 43, 39, 53, 54, 55]. Instead, we work with basic assumptions on the data (heavy-tailed H¨ older continuous random functions) and derive a suitable concentration inequality on this class of random functions. The work in [10] also chooses this line of analysis. Using techniques based on Rademacher averages and the McDiarmid’s inequality [32], they require H¨ older continuous random functions with a constant H¨ older modulus and, as a consequence, require bounded data. Moreover, besides not analyzing stochastic constraints, their rate of convergence is severely sub-optimal. Precisely, their deviation −a is of order O √N1−2a for a ∈ (0, 12 ). Hence, even in the setting of bounded data, their 1
1
1
deviation satisfies lima→ 21 N −a (1 − 2a)− 2 = ∞, implying N −a (1 − 2a)− 2 ≫ N − 2 as a approaches the optimal exponent 21 . Our deviations have the optimal rate of 1 O(N − 2 ) for heavy-tailed data and include expected-valued stochastic constraints. To obtain these sharper results, our concentration inequalities are derived from different techniques using chaining and metric entropy arguments as well as self-normalization theory [35, 7]. 2. Motivating applications. Our theory is motivated by many stochastic optimization problems which can be cast in the framework given in the Introduction, see e.g. [47]. In this section we present two specific set of problems. In subsequent papers, we apply our methodology developed here specially tailored to these applications. Example 1 (Risk averse portfolio optimization). Let ξ1 , . . . , ξd be random return rates of assets 1, . . . , d. The objective of portfolio optimization is to invest some initial capital in these assets subject to a desirable feature of the total return rate on the
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
9
investment. If x1 , . . . , xd are the fractions of the initial capital invested in assets 1, . . . , d, then the total return rate is R(x, ξ) := ξ T x, where T denotes the transpose operation, ξ T := [ξ1 , . . . , ξd ] and xT := [x1 , . . . , xd ]. An obvious hard constraint for the capital is the simplex ) ( d X xi = 1 . Y := x ≥ 0 : i=1
An option to hedge against uncertain risks, is to solve the problem min P [−R(x, ·)] x∈Y
s.t. CV@Rp [−R(x, ·)] ≤ β,
(7)
where, given random variable G : Ξ → R, the Conditional Value-at-Risk of level p ∈ (0, 1] correspondent to distribution P is defined as 1 CV@Rp [G] := min t + P[G − t]+ . t∈R p Problem (7) minimizes the expected risk in losses subjected a CV@R constraint hedging more aggressive tail losses. The above problem can be equivalently solved by the problem min
P [−R(x, ·)]
s.t.
t + p−1 P[−R(x, ·) − t]+ ≤ β.
x∈Y,t∈R
We refer to [41]. Example 2 (The Lasso estimator). function to be minimized is
In Least-Squares-type problems the loss
F (x, ξ) := [y(ξ) − hx(ξ), xi]2 ,
(x ∈ Rd ),
for x(ξ) = (xi (ξ))di=1 ∈ Rd and y(ξ) ∈ R. If {ξj }N j=1 is a sample of P, the usual b b ordinary least squares method minimizes FN (x) := PN F (x, ·) over Rd . When N ≫ d, this method typically produces a good approximation of the minimizer of f (x) := PF (x, ·). The above is not true when N ≪ d, where the least-squares estimator is not consistent. For this setting, Tibshirani [52] proposed the Lasso estimator given by the problem min FbN (x),
(8)
x∈Y
with Y := {x ∈ Rd : kxk1 ≤ R}, for some R > 0, where k · k1 denotes the ℓ1 -norm. Bickel, Ritov and Tsybakov [6] analyze the penalized estimator given by the problem b 2,N xk1 , min FbN (x) + λkD
x∈Rd
10
R. I. OLIVEIRA AND P. THOMPSON
q b 2,N is the diagonal matrix with i-th entry given by P b N |xi (·)|2 . Up to a where D penalization parameter, the above problem is equivalent to (8) for some R > 0 with b 2,N xk1 ≤ R}. Y := {x ∈ Rd : kD
3. Preliminaries and notation. For x, y ∈ R, we write y = O(x) if there exists constant C > 0 such that |y| ≤ C|x|. For m ∈ N, we denote [m] := {1, . . . , m}. For a set Z ⊂ Rd , we will denote its topological interior and frontier respectively by ˚ and ∂Z, its diameter by D(Z) associated to the norm k · k in Assumption 1 and Z d(·, Z) := inf y∈Z k · −yk2 where k · k2 is the Euclidean norm. The Hausdorff distance between compact sets A, B ⊂ Rd is defined as H(A, B) := max{D(A, B), D(B, A)}, where D(A, B) := supx∈A d(x, B) is the excess between A and B. For a finite set V we denote its cardinality by |V|. For a given set S, its complement is denoted by S c . For random variables {ηj }nj=1 , σ(η1 , . . . , ηn ) denotes the σ-algebra generated by {ηj }nj=1 . Given σ-algebra F of sets in Ω, E[·|F ] the conditional expectation with respect to F . When dealing with perturbations, it will be useful to make the following set definitions corresponding to ǫ-approximate solutions of problem (1), Problem 1 and problem (4) respectively. For Z ⊂ Y , we define Zǫ∗ := {x ∈ Z : f (x) ≤ f ∗ + ǫ}, b ∗ := {x ∈ X b : Fb (x) ≤ Fb∗ + ǫ}, X ǫ ∗ bN,ǫ bN : FbN (x) ≤ FbN∗ + ǫ}. X := {x ∈ X
In particular, Xǫ∗ is the set of ǫ-approximate solutions of problem (1). To be consistent b ∗ and X b ∗ to with the notation in the Introduction, when ǫ := 0 we simply use Z ∗ , X N denote the correspondent exact solutions sets. In order to consider stochastic constraints the following definitions will be important: for γ ∈ R, (9) (10)
Xγ := {x ∈ Y : fi (x) ≤ γ, ∀i ∈ I}, ˘ γ := (X + γB) ∩ Y, X
where + denotes the Minkowski sum and B denotes the Euclidean unit ball in Rd . The set Xγ may be regarded as a perturbation of X in terms of its representation functions {fi }i∈I . For γ > 0, it is an “exterior” relaxation of X while for γ < 0 it ˘ γ is an “exterior” relaxation of X in is an “interior” relaxation. For γ > 0, the set X terms of the Hausdorff distance associated to the Euclidean distance in Rd . We also define, for γ ∈ R the γ-active level sets (11)
Xi,γ := {x ∈ Xγ : fi (x) = γ} ,
(i ∈ I),
As mentioned in the introduction, these sets will be useful when exploiting convexity of X as a localization property for feasibility estimates in high-probability. From a schematic point of view, our analysis follows essentially two subsequent steps: • Deterministic step: derive feasibility and optimality guarantees in terms of the deviations {Fbi − fi }i∈I0 . These deviations may need to be controlled pointwise or uniformly over a precise set (see Lemmas 2-7).
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
11
• Probabilistic step: specify the previous step for the random perturbation {Fbi,N }i∈I0 and use pointwise or uniform concentration inequalities (see next Section 4 and Theorems 4-7). In relation to the above scheme, it will be useful to define the following deviations. Given functions h, g : Y → R, we define for x, y ∈ Y , δh,g (x) := h(x) − g(x),
For Z ⊂ Y , we also define
δeh,g (x, y) := δh,g (x) + δg,h (y). δh,g (Z) := sup δh,g (z). z∈Z
With respect to the random perturbation {Fbi }i∈I0 := {Fbi,N }i∈I0 , the above deviations are expressed, for x, y ∈ Y and i ∈ I0 , as b N Fi (x, ·), δ i,N (x) := δfi ,(Fbi,N ) (x) = P − P b N − P Fi (x, ·), δ i,N (x) := δ(Fbi,N ),fi (x) = P δeN (x, y) := δefi ,(Fbi,N ) (x) = δ i,N (x) + δ i,N (y).
For given Z ⊂ Y and i ∈ I0 , we also define9 b N Fi (x, ·) , δbi,N (x) := P − P δ i,N (Z) := sup δ i,N (x), x∈Z
δ i,N (Z) := sup δ i,N (x), x∈Z
δbi,N (Z) := sup δbi,N (x). x∈Z
We will also need the following definitions. For x ∈ Y and i ∈ I0 , we define q 2 σi (x) := P [Fi (x, ·) − PFi (x, ·)] , q b N [Fi (x, ·) − PFi (x, ·)]2 , σ bi,N (x) := P p Li := PLi (·)2 , q b i,N := P b N Li (·)2 . L
4. The cornerstone: an uniform concentration inequality for random functions. In this section we present an uniform concentration inequality for H¨ oldercontinuous random functions without sub-Gaussianity assumptions. We shall only require a finite second moment condition. It is the main probabilistic tool applied to control deviations while using sample average approximation. The proof is presented in the Appendix. In the following, (M, d) is a metric space with Borel σ-algebra B(M). We will assume that (i) Z ⊂ M is a compact set with diameter D(Z), (ii) {ξj }N j=1 is an i.i.d. sample drawn from P, 9 In many instances, we shall use uniform deviations of the form sup x∈Z G(x, ξ) for closed set Z ⊂ Y and a measurable function G : Y × Ξ → R such that G(·, ξ) is continuous for a.e. ξ ∈ Ξ. Such properties of Z and G imply that ξ 7→ supx∈Z G(x, ξ) is measurable, a fact which we will assume implicitly from now on.
12
R. I. OLIVEIRA AND P. THOMPSON
(iii) G : Z × Ξ → R is a measurable function such that G(·, ξ) is continuous for a.e. ξ ∈ Ξ and PG(x0 , ·)2 < ∞ for some x0 ∈ Z, (iv) There exist α ∈ (0, 1] and non-negative random variable L(·) with PL(·)2 < ∞ such that a.e. for all x, y ∈ Z, |G(x, ξ) − G(y, ξ)| ≤ L(ξ) d(x, y)α . Given a subset S of M and δ > 0, we say V ⊂ S is a δ-net for S if for every s ∈ S there is v ∈ V such that d(s, v) ≤ δ. If additionally V is finite, with cardinality N(δ, S) of minimum size, then the δ-metric entropy of S is H(δ, S) := ln N(δ, S). In the following we define b N − P G(x, ·) (x ∈ Z), δN (x) := P q b b N L(·)2 , LN := P s ∞ X D(Z)α D(Z) D(Z) Aα (Z) := H (12) ,Z + H , Z + ln[i(i + 1)]. 2iα 2i 2i−1 i=1 Theorem 1 (Sub-Gaussian type uniform concentration inequality for heavy-tailed H¨ older continuous random functions). Assume items (i)-(iv) above hold. Then there exists constant C > 0, such that for any y ∈ Z and t > 0, ) ( r i (1 + t) hb2 LN + PL2 (·) ≤ e−t . P sup |δN (x) − δN (y)| ≥ CAα (Z) N x∈Z
The quantity Aα (Z) measures the “size” of Z with respect to diameter and metric entropy for the uniform concentration property stated above to hold. To obtain the previous fundamental result, we will use the following inequality due to Panchenko [35]. Theorem 2 (Panchenko’s inequality). Let F be a finite family of measurable N functions f : Ξ → R such that Pf 2 (·) < ∞. Let also {ξj }N j=1 and {ηj }j=1 be both i.i.d. samples drawn from P which are independent of each other. If FN := σ(ξ1 , . . . , ξN ), define N X 2 [f (ξj ) − f (ηj )] FN . VbN := E sup f ∈F j=1
Then, there exists constant C > 0 such that for all t > 0, r N X (1 + t) f (ξj ) ≥ C P sup VbN ≤ e−t . f ∈F N j=1
Finally, we state the following result which is a corollary of Panchenko’s inequality applied to F := {G(x, ·)}.
Theorem 3 (Sub-Gaussian type concentration inequality for the empirical mean of heavy-tailed random variables). Suppose items (ii)-(iii) above hold with PG(x, ·)2 < ∞ for all x ∈ Z. Then there exists C > 0 such that for any x ∈ Z and t > 0, ) ( r (1 + t) b 2 ≤ e−t . PN + P [G(x, ·) − PG(x, ·)] P |δN (x)| ≥ C N
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
13
5. The case with no stochastic constraints. In this section, we assume that there is no stochastic constraints (I = ∅). We start with the following deterministic lemma relating deviation in optimality with the objective function’s deviation in Problem 1. Lemma 1 (Optimality deviation for a fixed feasible set). For all t, δ ≥ 0 and x∗ ∈ X ∗ , h i b ∗ , f (x) > f ∗ + t + δ ⊂ sup δe b (x, x∗ ) > t . ∃x ∈ X δ
x∈X
f,F
Proof. Let t, δ ≥ 0 and x∗ ∈ X ∗ as stated in the lemma and E denote the event on b ∗ such that f (x) > f ∗ +t+δ. the left hand side of the above inclusion. On E, let x ∈ X δ Then
(13)
δeFb,f (x, x∗ ) = Fb(x) − Fb (x∗ ) − [f (x) − f (x∗ )] ≤ δ − [f (x) − f (x∗ )] ≤ δ − (t + δ) = −t,
b ∗ and x∗ ∈ X, where in the first inequality we used Fb (x) ≤ Fb (x∗ ) + δ since x ∈ X δ ∗ ∗ while in second inequality we used f (x) > f (x ) + t + δ since x ∈ X ∗ . On E, the above inequality, definition of δef,Fb (x, x∗ ) and x ∈ X imply that sup δef,Fb (y, x∗ ) ≥ δef,Fb (x, x∗ ) > t,
y∈X
proving the claim.
The statement of the next theorem requires the following variance definitions associated to the objective function: for x ∈ Y and Z ⊂ Y , q (14) b0,N (x)2 + σ0 (x)2 , σ ˘N (x) := σ q b 2 + L2 , (15) σ N (Z) := Aα0 (Z) L 0 0,N where we refer to definitions of Section 3 and definition (12).
Theorem 4 (Exponential convergence of SAA for a fixed feasible set). Let ǫ > 0, p ∈ (0, 1], x∗ ∈ X ∗ and z ∈ Y . Then, with probability at least 1 − p, the optimal value deviations f ∗ − ǫ ≤ FbN∗ ˘N (z)2 ln(p−1 ), and holds for N ≥ O(ǫ−2 ) max σ N (X)2 , σ FbN∗ ≤ f ∗ + ǫ
holds for N ≥ O(ǫ−2 )˘ σN (x∗ )2 ln(p−1 ), while the optimal solutions deviation ∗ ∗ bN,ǫ X ⊂ X2ǫ ,
holds for N ≥ O(ǫ−2 )σ N (X)2 ln(p−1 ).
Proof. In the following, we will consider the deterministic Lemma 1 for the particular case of random perturbations given in the SAA problem (4), that is, Fb := FbN . Let x∗ ∈ X ∗ , z ∈ Y , p ∈ (0, 1] and ǫ > 0 as stated in the theorem. We invoke Lemma 1 with t = ǫ and δ ≥ 0 to be determined later. In the following, C > 0
14
R. I. OLIVEIRA AND P. THOMPSON
is a constant which may change from line to line and β ∈ (0, 1] is a constant to be determined later in terms of p depending on N as stated in the theorem. We shall denote the events h i b ∗ , f (x∗ ) > f ∗ + t + δ, , E1,δ := ∃x∗N ∈ X N,δ N ∗ e E2 := sup δN (x, x ) > t , x∈X E3 := δ 0,N (X) ≤ ǫ , E4 := δ 0,N (x∗ ) ≤ ǫ . x∗N
PART 1 (Optimal value deviation): Set δ = 0. On the event E3 , we have for any b∗ , ∈X N f ∗ − FbN∗ ≤ f (x∗N ) − FbN∗ = f (x∗N ) − FbN (x∗N ) ≤ δ 0,N (X) ≤ ǫ,
b ∗ ⊂ X. On the event E4 , for any x∗ ∈ X b∗ , using x∗N ∈ X N N N
FbN∗ − f ∗ = FbN (x∗N ) − f (x∗ ) ≤ FbN (x∗ ) − f (x∗ ) ≤ δ 0,N (x∗ ) ≤ ǫ,
bN . Hence using x∗ ∈ X ∗ ⊂ X i h (16) E3 ⊂ f ∗ − ǫ ≤ FbN∗ ,
h i E4 ⊂ FbN∗ ≤ f ∗ + ǫ .
By the strong law of large numbers (SLLN), there exists C > 0 such that a.s. 2 b L0,N ≤ CL20 and σ b0,N (z)2 ≤ Cσ0 (z)2 . Hence, up to a constant, the event (17) EN,a := N ≥ O(ǫ−2 ) max σ N (X)2 , σ ˘N (z)2 ln β −1 has probability 1. From Theorems 1 and 3 we get P[E3c ∩ EN,a ] = P sup δ 0,N (x) > ǫ ∩ EN,a x∈X nh o ǫ ǫi ≤P sup |δe0,N (x, z)| > ∩ EN,a + P δb0,N (z) > ∩ EN,a 2 2 x∈X ) ( r −1 ) (1 + ln β 2 2 b L ≤ P sup |δe0,N (x, z)| ≥ O(1)Aα (X) 0,N + L0 N x∈X ) ( r i (1 + ln β −1 ) h 2 2 b σ b0,N (z) + σ0 (z) +P δ0,N (z) ≥ O(1) N (18)
≤ 2β ≤ p,
by choosing β and changing constants accordingly. By the SLLN, there exists C > 0 such that a.s. σ b0,N (x∗ )2 ≤ Cσ0 (x∗ )2 . Hence, up to a constant, the event (19) EN,b := N ≥ O(ǫ−2 )˘ σN (x∗ )2 ln β −1 has probability 1. From Theorem 3 we get P[E4c ∩ EN,b ] = P δ 0,N (x∗ ) > ǫ ∩ EN,b ( ) r i −1 ) h (1 + ln β 2 (x∗ ) + σ 2 (x∗ ) ≤ P δb0,N (x∗ ) ≥ O(1) σ b0,N 0 N
(20)
≤ β ≤ p,
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
15
by choosing β and changing constants accordingly. Relations (16)-(20) prove the claims for the optimal value deviations. c PART2 (Optimal solutions deviation): Set δ = ǫ. On the event E1,δ , for any ∗ ∗ b xN ∈ XN,ǫ, f (x∗N ) ≤ f ∗ + t + δ = f ∗ + 2ǫ, bN = X. Hence and also x∗N ∈ X
(21)
From Lemma 1, we have
h i c ∗ ∗ bN,ǫ E1,δ ⊂ X ⊂ X2ǫ . P[E1,δ ] ≤ P[E2 ].
(22)
b 2 ≤ CL2 . Hence, up to a Also, by the SLLN, there exists C > 0 such that a.s. L 0 0,N constant, the event (23) EN,c := N ≥ O(ǫ−2 )σ N (X)2 ln β −1 has probability 1. From Theorem 1 we get ∗ e P[E2 ∩ EN,c ] = P sup δN (x, x ) > ǫ ∩ EN,c x∈X ) ( r −1 ) (1 + ln β b 2 + L2 L ≤ P sup δeN (x, x∗ ) ≥ O(1)Aα (X) 0 0,N N x∈X
(24)
≤ β ≤ p,
by choosing β and changing constants accordingly. Relations (21)-(24) prove the claim for the optimal solutions deviation. 6. The case with stochastic constraints. In this section we assume that there are stochastic constraints (I 6= ∅). The presence of stochastic constraints requires relaxing the feasibility constraints since the concentration phenomenon is guaranteed only over a confidence band. Consequently, the convergence rate of the solutions of the SAA problem (4) to solutions of problem (1) will inevitably have two error components which depend on the properties of ({Fi }i∈I0 , X, Y ). The first, as in the case of a fix feasible set, is the variance associated to estimation error when using sample average approximation. The second is the deterministic error associated to set approximation when using constraint relaxation.10 The first error above is controlled via concentration of measure. We make some comments regarding the error associated to set approximation. When inner approximating X by X−γ for some tolerance γ > 0, we will use the error gap(γ) := min f − f ∗ ≥ 0. X−γ
The use of an interior approximation has the advantage of guaranteeing exact feasibility with high-probability for a convex feasible set satisfying the Slater constraint qualification. This will be presented in section 6.2. 10 This is (in a loose sense) methodologically analogous to the approach used in nonparametric regression where the statistician imposes the choice of a finite base of functions which approximates the infinite dimensional family of functions where regression ideally should be made.
16
R. I. OLIVEIRA AND P. THOMPSON
When outer approximating X by Xγ for some tolerance γ > 0, we will use the error Gap(γ) := f ∗ − min f ≥ 0. Xγ
The correspondent benchmark problem is minXγ f whose ǫ-approximate solution set is denoted by (Xγ )∗ǫ . The use of an exterior approximation does not guarantee exact feasibility. Moreover, in general Xγ may be “far” from X. However, if X is metric ˘ cγ and we can obtain approximate feasibility regular as in Definition 1.1 then Xγ ⊂ X ˘ cγ = (X + cγB) ∩ Y . Precisely, rates can in high-probability in terms of the set X ˘ cγ , X) ≤ cγ. In introducing be obtained in terms of the Hausdorff distance H(X approximate feasibility, we do not require convexity nor an interior point condition, but will require metric regularity of X. In this setting, we also explore convexity as an useful property for concentration but without the strict feasibility of the Slater condition. This will be presented in section 6.3. We note that, the error gap(γ) vanishes as γ → 0+ if X satisfies the Slater condition while Gap(γ) vanish as γ → 0+ if it is metric regular. Moreover, a bound can be obtained. Indeed, if L := PL0 (·), the H¨ older continuity of f and metric regularity of X imply for any γ > 0, (25)
0 ≤ Gap(γ) ≤ L(cγ)α0 ,
while, if X satisfy the Slater condition (see Definition 6.1), then for all sufficiently small γ > 0, (26)
0 ≤ gap(γ) ≤ L(cγ)α0 ,
where we have used that if X satisfies the Slater condition then it is metric regular as in Definition 1.1 (by Robinson’s theorem [40]).11 Anyhow, the above global bounds are rather conservative since gap(γ) and Gap(γ) only depend on the local H¨ older continuity of f around a neighborhood12 of a solution in X ∗ which is also in the frontier ∂X. A more drastic manifestation of this local behavior is seen as follows: if there exists a interior solution, then X ∗ ∩ X−γ 6= ∅ for sufficiently small γ > 0 and consequently gap(γ) = 0. Analogously, Gap(γ) = 0 if (Xγ )∗ ∩ X 6= ∅ for sufficiently small γ > 0. 6.1. A feasibility lemma for convex feasible sets. We start with a lemma for the large class of problems where the feasible set is convex. This will be used in the next Subsections 6.2 when the Slater condition is satisfied and in Subsection 6.3.1 for a convex metric regular set. 11 The precise argument is as follows. If (X )∗ ∩ X 6= ∅ then it is not difficult to check that γ Gap(γ) = 0. Otherwise, there exists x∗γ ∈ (Xγ )∗ ∩ X c and x ∈ X such that
kx∗γ − xk = d(x∗γ , X) ≤ cγ, by the metric regularity of X and definition of Xγ . Hence, by H¨ older continuity, Gap(γ) = f ∗ − f (x∗γ ) ≤ f (x) − f (x∗γ ) ≤ Lkx − x∗γ kα0 ≤ L(cγ)α0 . To bound gap(γ) for X satisfying the Slater condition, we proceed analogously. If X ∗ ∩ X−γ 6= ∅ then gap(γ) = 0. Otherwise, we use analogous argument as above but using metric regularity of X−γ as concluded from Robinson’s theorem [40] (noting that it has the same metric regular constant c as X). 12 That is, with respect to a H¨ older modulus Lγ potentially satisfying Lγ ≪ L.
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
17
Lemma 2 (Approximation of a convex feasible set). Suppose Xǫ 6= ∅ for some ǫ ∈ R and Fbi and fi are convex on Y for all i ∈ I. Then, for ǫ < γ and y ∈ Xǫ , i i h \h b ⊂ Xγ ]. bǫ,γ (y) := A δfi ,Fbi (Xi,γ ) + δFbi ,fi (y) < γ − ǫ ∩ δfi ,Fbi (Xi,γ ) < γ − b ǫ i ⊂ [X i∈I
Proof. Let y ∈ Xǫ and γ > ǫ as stated in the lemma. We observe that such y exist b ⊂ Xγ on the event A bǫ,γ (y), since Xǫ 6= ∅ by assumption. To prove the inclusion X bǫ,γ (y), the inclusion Y /Xγ ⊂ Y /X b holds. we will prove equivalently that on A ˚ Let x ∈ Y /Xγ . By definition of Xγ and convexity, we have y ∈ Xγ since y ∈ Xǫ and ǫ < γ. Again by convexity of Xγ , there exists x ˜ ∈ ∂Xγ lying in the open segment (y, x) := {ty + (1 − t)x : t ∈ (0, 1)}, implying that there exists i ∈ I such that fi (˜ x) = γ. Hence, by definition of δfi ,Fbi (Xi,γ ), (27)
Fbi (˜ x) ≥ γ − δfi ,Fbi (Xi,γ ).
bǫ,γ (y), Fbi (y) = δ b (y) + fi (y) and (27), we have that on Using the definition of A Fi ,fi bǫ,γ (y), the event A Fbi (˜ x) − Fbi (y) ≥ γ − δfi ,Fbi (Xi,γ ) − δFbi ,fi (y) − fi (y)
≥ γ − δfi ,Fbi (Xi,γ ) − δFbi ,fi (y) − ǫ > 0,
where we used that y ∈ Xǫ . The above relation, the convexity of Fbi on Y and x ˜ ∈ (y, x) imply that Fbi (x) − Fbi (˜ x) > 0. bǫ,γ (y), ǫi , which holds on A Hence, by the above relation, (27) and δfi ,Fbi (Xi,γ ) ≤ γ − b we get that Fbi (x) > γ − δ b (Xi,γ ) ≥ b ǫi . fi ,Fi
b the above relation implies that x ∈ Y /X b as claimed. By definition of X,
6.2. Interior approximation of a feasible set with Slater constraint qualification. In this subsection we will consider an interior set approximation of X satisfying the Slater constraint qualification (SCQ). Definition 6.1 (Slater constraint qualification). if there exists x ¯ ∈ Y satisfying sup fi (¯ x) < 0,
The set X satisfies the SCQ
i∈I
and a.s. for all i ∈ I, Fi (·, ξ) is convex on Y . We use the following notation: for i ∈ I, ˚ ǫi (¯ x) := −fi (¯ x),
˚ ǫ(¯ x) := inf ˚ ǫi (¯ x). i∈I
Definition 6.1 implies that X is a non-empty compact convex set and for some x¯ ∈ Y , ˚ ǫ(¯ x) > 0 and any γ ∈ (0,˚ ǫ(¯ x)], X−γ is a non-empty compact convex set such that ˚ X−γ ⊂ X ⊂ X.
18
R. I. OLIVEIRA AND P. THOMPSON
Lemma 3 (Interior approximation of feasible set with SCQ). Suppose X satisfies the SCQ and for all i ∈ I, Fbi is convex on Y . Then for any γ ∈ (0,˚ ǫ(¯ x)] and y ∈ X−γ , h i b−γ,0 (y) ⊂ X b ⊂X , A i h i \h b , bγ := B δFbi ,fi (X) ≤ γ + b ǫi ⊂ X−γ ⊂ X i∈I
b−γ,0 (y) is as in Lemma 2. where A
Proof. Let γ ∈ (0,˚ ǫ(¯ x)] and y ∈ X−γ which exist since X−γ 6= ∅ by Assumption b−γ,0 (y) ⊂ [X b ⊂ X] follows from Lemma 2. For any γ > 0, the 6.1. The inclusion A b b b and B bγ . Indeed, for all inclusion Bγ ⊂ [X−γ ⊂ X] follows from the definition of X x ∈ X−γ , Fbi (x) ≤ δFbi ,fi (X) + fi (x) ≤ δFbi ,fi (X) − γ ≤ b ǫi , ∀i ∈ I.
We thus have proved the claims.
Lemma 4 (Optimality deviation for interior approximation of feasible set with SCQ). Suppose X satisfies the SCQ and, for all i ∈ I, Fbi is convex on Y . Then, for any t, δ ≥ 0, γ ∈ (0,˚ ǫ(¯ x)], y ∈ X−γ , z ∈ Y , h i t ∗ ∗ e b b b , A−γ,0 (y) ∩ Bγ ∩ ∃x ∈ Xδ , f (x) > f + gap(γ) + t + δ ⊂ sup |δf,Fb (x, z)| > 2 x∈X
b−γ,0 (y) and B bγ are as in Lemmas 2-3. where A
Proof. Let the variables be chosen as stated in the lemma and let E denote the event on the left hand side of the above inclusion. Let also y ∗ ∈ (X−γ )∗ . On the b ∗ such that f (x) > f ∗ + gap(γ) + t + δ. Then event E, let x ∈ X δ δeFb,f (x, y ∗ ) = Fb(x) − Fb (y ∗ ) − [f (x) − f (y ∗ )] ≤ δ − [f (x) − f (y ∗ )]
< δ − (t + δ) = −t,
(28)
b ∗ and y ∗ ∈ X−γ ⊂ X b where in the first inequality we used Fb (x) ≤ Fb (y ∗ )+δ since x ∈ X δ ∗ b on the event Bγ by Lemma 3, while in second inequality we used f (x) > f +gap(γ)+ t + δ = f (y ∗ ) + t + δ since y ∗ ∈ (X−γ )∗ . On E, the above relation implies δef,Fb (x, y ∗ ) > t using the definition of δef,Fb (x, y ∗ ). Since δe b (x, y ∗ ) = δe b (x, z) + δe b (z, y ∗ ), we also have f,F
f,F
f,F
t t δef,Fb (x, z) > or − δef,Fb (y ∗ , z) = δef,Fb (z, y ∗ ) > . 2 2
b ⊂ X since A b−γ,0 (y) ⊂ [X b ⊂ X] by Lemma 3. This Additionally, we have x ∈ X ∗ ∗ conclusion, the above inequalities, y ∈ (X−γ ) ⊂ X and x ∈ X imply sup |δef,Fb (u, z)| >
u∈X
proving the claim.
t , 2
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
19
The statement of the next theorem requires the following variance definitions associated to feasibility: we set C := {Fi }i∈I for the constraint data and for γ ≥ 0, x, y ∈ Y and i ∈ I we define q σ ˘N (x, C) := max (29) σ bi,N (x)2 + σi (x)2 , i∈I q b 2 + L2 , (30) σ N (γ, C) := max Aαi (Xi,γ ) L i i,N i∈I
(31)
σN (γ, C, x, y) := max {σ N (γ, C), σ ˘N (x, C), σ ˘N (y, C)} ,
where we refer to definitions in Section 3 and definitions (12) and (14)-(15). Theorem 5 (Exponential convergence of SAA for interior approximation of feasible set with SCQ). Suppose X satisfies the SCQ and I = [m] for some m ∈ N. Let ǫ ∈ (0,˚ ǫ(¯ x)/2], and b ǫi,N = −ǫ, ∀i ∈ I.
Let p ∈ (0, 1], y ∈ X−2ǫ and z ∈ Y . Then, with probability at least 1 − p, the optimal value deviation holds for
f ∗ − ǫ ≤ FbN∗ ≤ f ∗ + gap(2ǫ) + ǫ,
N ≥ O(ǫ−2 ) max σN (0, C, y, z)2 , σ N (X, C)2 , σ N (X)2 , σ ˘N (z)2 ln m + ln p−1 ,
the approximate optimality deviation
b∗ ⊂ X∗ X N,ǫ gap(2ǫ)+2ǫ ,
holds for N ≥ O(ǫ−2 ) max σN (0, C, y, z)2 , σ N (X, C)2 , σ N (X)2 ln m + ln p−1 , and the exact feasibility bN ⊂ X, X holds for N ≥ O(ǫ−2 )σN (0, C, y, z)2 ln m + ln p−1 . Proof. In the following, we will consider the deterministic Lemmas 3-4 for the particular case of random perturbations given in the SAA problem (4), that is, Fb := FbN and for all i ∈ I, Fbi := Fbi,N ,
b ǫi := b ǫi,N ,
b = X bN . In this setting, we will use the notations AN,γ (y) := A b−γ,0 (y), so that X bγ and ǫN := ǫ as defined in Lemmas 2-3. BN,γ := B Let ǫ > 0, p ∈ (0, 1], y ∈ X−2ǫ and z ∈ Y as stated in the theorem. We invoke Lemmas 3-4 for t := ǫ, γ := 2ǫ and δ ≥ 0 to be determined later. In the following, C > 0 is a constant that may change from line to line and β ∈ (0, 1] is a constant to be determined later in terms of m and p depending on N as stated in the theorem. We shall define the following events h i b ∗ , f (x∗ ) > f ∗ + gap(γ) + t + δ , E1 := AN,γ (y), E2 := BN,γ , E3,δ := ∃x∗N ∈ X N,δ N t c , E5 := [δb0,N (X) ≤ ǫ], Eo := E1 ∩ E3,0 ∩ E5 , E4 := sup |δeN (x, z)| > 2 x∈X c Es,δ := E1 ∩ E3,δ .
20
R. I. OLIVEIRA AND P. THOMPSON
By Lemma 4 we get P[E3,δ ] = P[E1 ∩ E2 ∩ E3,δ ] + P[(E1 ∩ E2 )c ∩ E3,δ ] ≤ P[E4 ] + P[E1c ] + P[E2c ].
(32)
2 2 b 2 ≤ CL2 , σ By the SLLN, there exists C > 0 such that a.s. L i bi,N (z) ≤ Cσi (z) i,N 2 2 and σ bi,N (y) ≤ Cσi (y) for all i ∈ [m]. Hence, up to a constant, the event (33) EN,a := N ≥ O(ǫ−2 )σN (0, C, y, z)2 ln β −1 has probability 1.
By definition of AN,γ (y), the union bound and Theorems 1 and 3 we get ! ! [h [h γi γi c δ i,N (y) ≥ ∩ EN,a + P ∩ EN,a P[E1 ∩ EN,a ] ≤ P δ i,N (Xi,0 ) ≥ 2 2 i∈I i∈I ! [ +P δ i,N (Xi,0 ) ≥ ǫ ∩ EN,a i∈I
≤2 ≤2
m X
P
i=1
m X
m X P δ i,N (y) ≥ ǫ ∩ EN,a δ i,N (Xi,0 ) ≥ ǫ ∩ EN,a +
"
P
i=1 m X
i=1
sup
x∈Xi,0
! # ǫ δ i,N (x) − δ i,N (z) ≥ ∩ EN,a 2
m h X ǫi P δ i,N (y) ≥ ǫ ∩ EN,a ∩ EN,a + δ i,N (z) ≥ 2 i=1 i=1 ( ) r m X (1 + ln β −1 ) b 2 P sup δ i,N (x) − δ i,N (z) ≥ O(1)Aαi (Xi,0 ) ≤2 Li,N + L2i N x∈Xi,0 i=1 ) ( r m i X (1 + ln β −1 ) h 2 2 +2 P δ i,N (z) ≥ O(1) σ bi,N (z) + σi (z) N i=1 ( ) r m i X (1 + ln β −1 ) h 2 P δ i,N (y) ≥ O(1) + σ bi,N (y) + σi2 (y) N i=1
+2
P
≤ (2 + 2 + 1)mβ = 5mβ.
(34)
Similarly, by definition of BN,γ , P[E2c
∩ EN,a ] = P
[ ǫi,N ∩ EN,a δ i,N (X) > γ + b
i∈I
!
≤
m X i=1
P
δ i,N (X) > ǫ ∩ EN,a
X m m h X ǫ ǫi P δ i,N (z) > ∩ EN,a + ∩ EN,a P sup δ i,N (x) − δ i,N (z) > ≤ 2 2 x∈X i=1 i=1 ( ) r m X (1 + ln β −1 ) b 2 2 P sup δ i,N (x) − δ i,N (z) ≥ O(1)Aαi (X) ≤ Li,N + Li N x∈X i=1 ( ) r m i X (1 + ln β −1 ) h 2 P δ i,N (z) ≥ O(1) (35) + ≤ 2mβ. σ bi,N (z) + σi2 (z) N i=1
21
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
b 2 ≤ CL2 . Hence, up to a By the SLLN, there exists C > 0 such that a.e. L 0 0,N constant, the event EN,b := N ≥ O(ǫ−2 )σ N (X)2 ln β −1 has probability 1. (36) From Theorem 1,
P[E4 ∩ EN,b ] = P ≤P (37)
(
≤ β.
sup |δeN (x, z)| >
x∈X
ǫ ∩ EN,b 2 r
sup |δeN (x, z)| ≥ O(1)Aα0 (X)
x∈X
(1 + ln β −1 ) b 2 L0,N + L20 N
)
PART 1 (Optimal value deviation): Set δ = 0. On the event Eo , we have for any b∗ , x∗N ∈ X N f ∗ − FbN∗ ≤ f (x∗N ) − FbN∗ = f (x∗N ) − FbN (x∗N ) ≤ δ 0,N (X) ≤ ǫ,
using that, on AN,γ (y), x∗N ∈ X by Lemma 3 in first and second inequalities and b∗ , Eo ⊂ E5 in last inequality. Moreover, on the event E0 , for any x∗N ∈ X N FbN∗ − f ∗ = FbN (x∗N ) − f (x∗N ) + f (x∗N ) − f (x∗ )
≤ δ 0,N (X) + gap(γ) + t ≤ gap(2ǫ) + ǫ,
c using that on AN,γ (y), x∗N ∈ X by Lemma 3 and Eo ⊂ E3,0 in first inequality and Eo ⊂ E5 in last inequality. Hence h i (38) Eo ⊂ f ∗ − ǫ ≤ FbN∗ ≤ f ∗ + gap(2ǫ) + ǫ .
Using (32), we get
P[Eoc ] ≤ P[E1c ] + P[E3,0 ] + P[E5c ] ≤ 2P[E1c ] + P[E2c ] + P[E4 ] + P[E5c ].
(39)
b 2 ≤ CL2 and σ By the SLLN, there exists C > 0 such that a.s. L b0,N (z)2 ≤ 0 0,N 2 Cσ0 (z) . Hence, up to a constant, the event 2 (40) EN,c := N ≥ O(ǫ−2 ) max{σ N (X)2 , σ ˘N (z)} ln β −1 has probability 1. From Theorems 1-3 we get P[E5c ∩ EN,c ] = P sup δb0,N (x) > ǫ ∩ EN,c x∈X h ǫi ǫ ∩ EN,c + P δb0,N (z) > ∩ EN,c ≤P sup |δe0,N (x, z)| > 2 2 x∈X ( ) r (1 + ln β −1 ) b 2 2 e ≤ P sup |δ0,N (x, z)| ≥ O(1)Aα0 (X) L0,N + L0 N x∈X ) ( r i −1 ) h (1 + ln β 2 (z) + σ 2 (z) σ b0,N +P δb0,N (z) ≥ O(1) 0 N
(41)
≤ 2β.
22
R. I. OLIVEIRA AND P. THOMPSON
From (33)-(41) we conclude that P[Eoc ] ≤ 2 · 5mβ + 2mβ + β + 2β = (12m + 3) β ≤ p, by choosing β and changing constants accordingly. The above relation and (38) prove the claim for the optimal value deviation. PART2 (Optimal solutions deviation): Set δ = ǫ. On the event Es,δ , for any b∗ , x∗N ∈ X N,δ f (x∗N ) ≤ f ∗ + gap(2ǫ) + t + δ = f ∗ + gap(2ǫ) + 2ǫ, c bN ⊂ X since Es,δ ⊂ AN,γ (y). since Es,δ ⊂ E3,δ ; also x∗N ∈ X since by Lemma 3, X Hence i h b∗ ⊂ X∗ (42) Es,δ ⊂ X N,ǫ gap(2ǫ)+2ǫ .
Using (32), we get
c P[Es,δ ] ≤ P[E1c ] + P[E3,0 ]
≤ 2P[E1c ] + P[E2c ] + P[E4 ].
From the above relation and (33)-(37) we conclude that c P[Es,δ ] ≤ 2 · 5mβ + 2mβ + β = (12m + 1) β ≤ p,
by choosing β and changing constants accordingly. The above relation and (42) prove the claim for the optimal solutions deviation. PART 3 (Exact feasibility): by Lemma 3, bN ⊂ X] =: Ef . AN,γ (y) ⊂ [X
From the above and (33)-(34),
P[Efc ] ≤ P[E1c ] ≤ 2 · 5mβ = 10mβ ≤ p, by choosing β and changing constants accordingly. The above relation proves the claim for exact feasibility. Remark 1 (Simultaneous approximate optimality and exact feasibility bounds for interior approximation of feasible set with SCQ). In particular, Theorem 5 states that, if X satisfies the SCQ and I = [m] for some m ∈ N and we set ǫ ∈ (0,˚ ǫ(¯ x)/2],
and
b ǫi,N = −ǫ,
∀i ∈ I,
p ∈ (0, 1], y ∈ X−2ǫ and z ∈ Y , then with probability at least 1 − p, the exact feasibility and approximate optimality deviation bN ⊂ X, X f (x∗N ) ≤ f ∗ + gap(2ǫ) + 2ǫ,
b∗ , ∀x∗N ∈ X N
hold for N ≥ O(ǫ−2 ) max σN (0, C, y, z)2 , σ N (X, C)2 , σ N (X)2 ln m + ln p−1 .
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
23
6.3. Exterior approximation of a metric regular feasible set. In this subsection we will consider an exterior set approximation assuming metric regularity of X as stated in Definition 1.1. Lemma 5 (Exterior approximation of feasible set). Define i h ǫ := sup δfi ,Fbi (Y ) + b ǫi . i∈I
Then
b := B
\h
i∈I
i i h b ⊂ Xǫ ∩ [ǫ ≥ 0] . δFbi ,fi (Y ) ≤ b ǫi ⊂ X ⊂ X
Proof. Define for every i ∈ I, the following quantities: ǫi := b ǫi + δfi ,Fbi (Y ),
ǫ := sup ǫi ,
ǫi − δFbi ,fi (Y ), ǫi := b
i∈I
ǫ := inf ǫi . i∈I
b ⊂ Y and i ∈ I, Given x ∈ X
fi (x) ≤ fi (x) − Fbi (x) + b ǫi ≤ δfi ,Fbi (Y ) + b ǫi ≤ ǫ,
b ⊂ X ǫ. implying that x ∈ Xǫ . Hence X Similarly, given x ∈ Xǫ ⊂ Y and i ∈ I,
ǫi , Fbi (x) ≤ Fbi (x) − fi (x) + ǫ ≤ δFbi ,fi (Y ) + ǫi = b
b Hence Xǫ ⊂ X. b From this conclusion and the previous implying that x ∈ X. paragraph, we obtain that b ⊂ Xǫ . Xǫ ⊂ X
(43)
b we have ǫi = b ǫi − δFbi ,fi (Y ) ≥ 0 for all i ∈ I. Hence On the event B, ǫ ≥ 0,
X ⊂ Xǫ .
b ⊂ Xǫ . Moreover, on the Using the above conclusion and (43) we get that X ⊂ X b event B we have for all i ∈ I, ǫi = δfi ,Fbi (Y ) + b ǫi
≥ δfi ,Fbi (Y ) + δFbi ,fi (Y )
= sup (fi − Fbi )(x) − inf (fi − Fbi )(x) ≥ 0, x∈Y
x∈Y
implying ǫ ≥ 0. We have thus proved the claim.
Lemma 6 (Optimality deviation for exterior approximation of feasible set). For any t, γ, δ ≥ 0 and x ∈ X ∗ , # " i h i h b ∗ , f (x) > f ∗ + t + δ ⊂ sup δe b (x, x∗ ) > t . b ⊂ Xγ ∩ ∃x ∈ X X⊂X δ
x∈Xγ
f,F
24
R. I. OLIVEIRA AND P. THOMPSON
Proof. Let the variables be chosen as stated in the lemma and E denote the b ∗ such that event on the left hand side of the above inclusion. On E, let x ∈ X δ ∗ f (x) > f + t + δ. On such event δeFb,f (x, x∗ ) = Fb(x) − Fb (x∗ ) − [f (x) − f (x∗ )] ≤ δ − [f (x) − f (x∗ )] < δ − (t + δ) = −t,
b ∗ and x∗ ∈ X ⊂ X b where in the first inequality we used Fb (x) ≤ Fb(x∗ ) + δ since x ∈ X δ ∗ ∗ by definition of E, while in second inequality we used f (x) > f + t + δ = f (x ) + t + δ since x∗ ∈ X ∗ . b ⊂ Xγ (by definition of E) imply On the event E, the above inequality and x ∈ X sup δef,Fb (u, x∗ ) ≥ δef,Fb (x, x∗ ) > t
u∈Xγ
proving the required claim.
The following proposition will be the main step to obtain exponential non-asymptotic bounds for metric regular feasible sets. Its statement requires the following variance definitions associated to feasibility: if we set C := {Fi }i∈I for the constraint data, we define q b 2 + L2 , (44) σ N (Y, C) := max Aαi (Y ) L i i,N i∈I
(45)
σN (C, Y, x) := max {σ N (C, Y ), σ ˘N (x, C)} ,
where we refer to definitions in Section 3 and definitions (12), (14)-(15) and (29). Proposition 1. Suppose I = [m] for some m ∈ N. Let ǫ > 0 and set b ǫi,N := ǫ,
∀i ∈ I.
Let p ∈ (0, 1], z ∈ Y . Then, with probability at least 1 − p, the optimal value deviation f ∗ − ǫ − Gap(2ǫ) ≤ FbN∗ ≤ f ∗ + 2ǫ, ˘N (z)2 ln m + ln p−1 , the holds for N ≥ O(ǫ−2 ) max σN (C, Y, z)2 , σ N (X2ǫ )2 , σ approximate optimality deviations ∗ bN,ǫ X ⊂ (X2ǫ )∗Gap(2ǫ)+2ǫ ,
f ∗ − Gap(2ǫ) ≤ f (x∗N ) ≤ f ∗ + 2ǫ,
b∗ , ∀x∗N ∈ X N,ǫ
hold for N ≥ O(ǫ−2 ) max{σN (C, Y, z)2 , σ N (X2ǫ )2 } ln m + ln p−1 , while the approximate feasibility bN ⊂ X2ǫ , X⊂X holds for N ≥ O(ǫ−2 )σN (C, Y, z)2 ln m + ln p−1 .
Proof. In the following, we will consider the deterministic lemmas 5 and 6 for the particular case of random perturbations given in the SAA problem (4), that is, Fb := FbN and for all i ∈ I, Fbi := Fbi,N ,
b ǫi := b ǫi,N ,
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
25
b=X bN . In this setting, we will denote B bN := B b and ǫN := ǫ as defined in so that X Lemma 5. Let ǫ > 0, p ∈ (0, 1] and z ∈ Y as stated in the proposition. We invoke Lemmas 5-6 for t := ǫ, γ := 2ǫ, some x∗ ∈ X ∗ and δ ≥ 0 to be determined later. In the following, C > 0 is a constant that may change from line to line and β ∈ (0, 1] is a constant to be determined later in terms of m and p depending on N as stated in the theorem. We shall define the events
E3,δ
bN , E2 := [ǫN ≤ γ] , E1 := B h i ∗ bN,δ := ∃x∗N ∈ X , f (x∗N ) > f ∗ + t + δ , # " ∗ e E4 := sup δN (x, x ) > t ,
h i E5 := δb0,N (Xγ ) ≤ ǫ ,
x∈Xγ
c Eo := E1 ∩ E2 ∩ E3,0 ∩ E5 ,
c Es,δ := E1 ∩ E2 ∩ E3,δ .
We have P[E3,δ ] = P[E1 ∩ E2 ∩ E3,δ ] + P[(E1 ∩ E2 )c ∩ E3,δ ] o i nh bN ⊂ Xγ ∩ E3,δ + P [E c ] + P[E c ] ≤P X ⊂X 1 2 ≤ P[E4 ] + P [E1c ] + P[E2c ],
(46)
where we used the union bound in the equality, Lemma 5 in the first inequality and Lemma 6 in last inequality. 2 2 b 2 ≤ CL2 , σ By the SLLN, there exists C > 0 such that a.s. L i bi,N (z) ≤ Cσi (z) i,N 2 2 and σ bi,N (y) ≤ Cσi (y) for all i ∈ [m]. Hence, up to a constant, the event EN,a := N ≥ O(ǫ−2 )σN (C, Y, z)2 ln β −1 has probability 1.
(47)
bN , the union bound and Theorems 1 and 3, By definition of B
P [E1c ∩ EN,a ] = P
(
[ δ i,N (Y ) > ǫ ∩ EN,a
i∈I
)
≤
m X P δ i,N (Y ) > ǫ ∩ EN,a i=1
m X ǫ ∩ EN,a P sup δ i,N (x) − δ i,N (z) > ≤ 2 x∈Y i=1
m nh o X ǫi P δ i,N (z) > ∩ EN,a 2 i=1 ( ) r m X (1 + ln β −1 ) b 2 2 P sup δ i,N (x) − δ i,N (z) ≥ O(1)Aαi (Y ) ≤ Li,N + Li N x∈Y i=1 ( ) r m i X (1 + ln β −1 ) h 2 P δ i,N (z) ≥ O(1) + σ bi,N (z) + σi2 (z) N i=1
+
(48)
≤ 2mβ.
26
R. I. OLIVEIRA AND P. THOMPSON
Similarly, by the definition of ǫN , the union bound and Theorems 1 and 3, ) ( [ c P [E2 ∩ EN,a ] = P δ i,N (Y ) + b ǫi,N > γ ∩ EN,a i∈I
m X P δ i,N (Y ) > ǫ ∩ EN,a ≤ i=1
≤ 2mβ.
(49)
b 2 ≤ CL20 . Hence, up to a By the SLLN, there exists C > 0 such that a.s. L 0,N constant, the event EN,b := N ≥ O(ǫ−2 )σ N (X2ǫ )2 ln β −1 has probability 1. (50) By Theorem 1,
P[E4 ∩ EN,b ] = P
(51)
(
sup δeN (x, x∗ ) > ǫ ∩ EN,b
x∈X2ǫ
) r (1 + ln β −1 ) b 2 ∗ 2 e ≤ P sup |δN (x, x )| ≥ O(1)Aα0 (X2ǫ ) L0,N + L0 N x∈X2ǫ ≤ β.
PART 1 (Optimal value deviation): Set δ = 0. On the event Eo , we have for any b∗ , x∗N ∈ X N f ∗ − FbN∗ = min f + Gap(2ǫ) − FbN∗ X2ǫ
≤ f (x∗N ) − FbN (x∗N ) + Gap(2ǫ)
≤ δ 0,N (X2ǫ ) + Gap(2ǫ) ≤ ǫ + Gap(2ǫ), bN ∩ [ǫN ≤ γ], x∗ ∈ Xγ by Lemma 5 in first and second inequalities using that, on B N b∗ , and Eo ⊂ E5 in last inequality. Moreover, on the event E0 , for any x∗N ∈ X N FbN∗ − f ∗ = FbN (x∗N ) − f (x∗N ) + f (x∗N ) − f ∗ ≤ δ 0,N (X2ǫ ) + t ≤ 2ǫ,
c bN ∩ [ǫN ≤ γ], x∗ ∈ Xγ by Lemma 5 and Eo ⊂ E3,0 using that on B in first inequality N and Eo ⊂ E5 in last inequality. Hence h i (52) Eo ⊂ f ∗ − ǫ − Gap(2ǫ) ≤ FbN∗ ≤ f ∗ + 2ǫ .
Using (46), we get (53)
P[Eoc ] ≤ P[E1c ] + P[E2c ] + P[E3,0 ] + P[E5c ] ≤ 2P[E1c ] + 2P[E2c ] + P[E4 ] + P[E5c ].
b 2 ≤ CL2 and σ By the SLLN, there exists C > 0 such that a.s. L b0,N (z)2 ≤ 0 0,N 2 Cσ0 (z) . Hence, up to a constant, the event EN,c := N ≥ O(ǫ−2 ) max{σ N (X2ǫ )2 , σ ˘N (z)2 } ln β −1
(54)
has probability 1.
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
27
From Theorems 1 and 3, c b P[E5 ∩ EN,c ] = P sup δ0,N (x) > ǫ ∩ EN,c x∈X2ǫ nh o ǫi ǫ ∩ EN,c + P δb0,N (z) > ∩ EN,c ≤P sup |δe0,N (x, z)| > 2 2 x∈X2ǫ ( ) r (1 + ln β −1 ) b 2 2 e ≤ P sup |δ0,N (x, z)| ≥ O(1)Aα0 (X2ǫ ) L0,N + L0 N x∈X2ǫ ( ) r i −1 ) h (1 + ln β 2 (z) + σ 2 (z) +P δb0,N (z) ≥ O(1) σ b0,N 0 N ≤ 2β.
(55)
From (47)-(55) we conclude that P[Eoc ] ≤ 2 · 2mβ + 2 · 2mβ + β + 2β = (8m + 3) β ≤ p, by choosing β and changing constants accordingly. The above relation and (52) prove the claim for the optimal value deviation. PART2 (Optimal solutions deviation): Set δ = ǫ. On the event Es,δ , for any b∗ , x∗N ∈ X N,ǫ f (x∗N ) ≤ f ∗ + t + δ = min f + Gap(2ǫ) + 2ǫ, X2ǫ
since Es,δ ⊂ γ]. Hence
c E3,δ ;
also
x∗N
bN ⊂ Xγ since Es,δ ⊂ B bN ∩ [ǫ ≤ ∈ X2ǫ since by Lemma 5, X
h i b ∗ ⊂ (X2ǫ )∗ Es,δ ⊂ X N,ǫ Gap(2ǫ)+2ǫ .
(56)
bN ⊂ In addition to the above derivation, by noting that, on Es,δ , by Lemma 5, X ⊂ X ∗ ∗ ∗ ∗ ∗ b Xγ and for any xN ∈ XN,δ , f − f (xN ) = minX2ǫ f − f (xN ) + Gap(2ǫ) ≤ Gap(2ǫ) since x∗N ∈ X2ǫ , we also have i h bN ⊂ X2ǫ , f ∗ − Gap(2ǫ) ≤ f (x∗ ) ≤ f ∗ + 2ǫ, ∀x∗ ∈ X b∗ . (57) Es,δ ⊂ X ⊂ X N N N,ǫ Using (46), we get
c P[Es,δ ] ≤ P[E1c ] + P[E2c ] + P[E3,δ ]
≤ 2P[E1c ] + 2P[E2c ] + P[E4 ].
From the above relation and (47)-(51) we conclude that c P[Es,δ ] ≤ 2 · 2mβ + 2 · 2mβ + β = (8m + 1)β ≤ p,
by choosing β and changing constants accordingly. The above relation and (56)-(57) prove the claim for the optimal solutions deviation and the final claim of the theorem on feasibility and optimality bounds. PART 3 (Approximate feasibility): by Lemma 5, bN ⊂ Xγ ] =: Ef . E1 ∩ E2 ⊂ [X ⊂ X
28
R. I. OLIVEIRA AND P. THOMPSON
Hence, from the above and (47)-(49), P[Efc ] ≤ P[E1c ] + P[E2c ] ≤ 2mβ + 2mβ = 4mβ ≤ p, by choosing β and changing constants accordingly. The relation above proves the claim of approximate feasibility. We now observe that, if X is metric regular as in Definition 1.1, then for any γ ≥ 0, Xγ ⊂ Xcγ = X + cγB. Indeed, for all x ∈ Xγ ⊂ Y , d(x, X) ≤ c supi∈I [fi (x)]+ ≤ cγ. Hence Xγ ⊂ X + cγB, and since also Xγ ⊂ Y we get the required claim. The above relation implies that ˘ cγ ), σ N (Xγ ) ≤ σ N (X
(58)
where we used definitions (12) and (15). The above inequality and Proposition 1 immediately imply the following main theorem which states non-asymptotic approx˘ cγ . This is a better imate optimality and feasibility bounds in terms of the set X benchmark since it approaches X in the Hausdorff metric. Theorem 6 (Exponential convergence of SAA for exterior approximation of metric regular feasible sets). Suppose X is metric regular as in Definition 1.1 and I = [m] for some m ∈ N. Let ǫ > 0 and set b ǫi,N := ǫ,
∀i ∈ I.
Let p ∈ (0, 1], z ∈ Y . Then, with probability at least 1 − p, the optimal value deviation
f ∗ − ǫ − Gap(2ǫ) ≤ FbN∗ ≤ f ∗ + 2ǫ, n o ˘ 2cǫ )2 , σ holds for N ≥ O(ǫ−2 ) max σN (C, Y, z)2 , σ N (X ˘N (z)2 ln m + ln p−1 , the approximate optimality deviations ∗ ∗ bN,ǫ X ⊂ (X2ǫ )Gap(2ǫ)+2ǫ ,
f ∗ − Gap(2ǫ) ≤ f (x∗N ) ≤ f ∗ + 2ǫ,
b∗ , ∀x∗N ∈ X N,ǫ
˘ 2cǫ )2 } ln m + ln p−1 , while the aphold for N ≥ O(ǫ−2 ) max{σN (C, Y, z)2 , σ N (X proximate feasibility bN ⊂ X2ǫ ⊂ X ˘ 2cǫ , X⊂X holds for N ≥ O(ǫ−2 )σN (C, Y, z)2 ln m + ln p−1 . Remark 2 (Simultaneous approximate feasibility and optimality bounds for metric regular feasible set). In particular, Theorem 6 states that, if X is metric regular and I = [m] for some m ∈ N, for any ǫ > 0, p ∈ (0, 1] and z ∈ Y , if we set b ǫi,N = ǫ,
∀i ∈ I,
then with probability at least 1 − p, the approximate feasibility and optimality bounds bN , X ⊂X
bN , X) ≤ 2cǫ, D(X
b∗ , f ∗ − Gap(2ǫ) ≤ f (x∗N ) ≤ f ∗ + 2ǫ, ∀x∗N ∈ X N,ǫ ˘ 2cǫ )2 } ln m + ln p−1 . hold for N ≥ O(ǫ−2 ) max{σN (C, Y, z)2 , σ N (X
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
29
6.3.1. Convexity of feasible set as a localization property for feasibility and optimality. As mentioned in the introduction, there exist convex feasible sets which are metric regular even though they do not satisfy the strict feasibility required in the Slater condition. One relevant example is a polyhedron. For metric regular convex sets we can explore convexity as a localization procedure for concentration of measure and provide feasibility rates in terms of {Xi,O(ǫ)}i∈I as defined in (11) and ˘ O(cǫ) as defined in (10). The active optimality rates in terms of {Xi,O(ǫ) }i∈I and X level sets {Xi,O(ǫ) }i∈I have potentially smaller metric entropies than the feasible set X ˘ O(cǫ) has potentially a smaller metric entropy than the hard constraint while the set X Y . Up to a “neighborhood gap” of O(cǫ) in terms of Hausdorff distance, these results recover the ones of Subsection 6.2 but without requiring the stronger condition of strict feasibility of the SCQ. The crucial properties used here are metric regularity and convexity of X. Lemma 7 (Exterior approximation of convex feasible set). Suppose X 6= ∅ and, for all i ∈ I, Fbi and fi are convex on Y . Then, for given γ > 0 and y ∈ X, i h b0,γ (y) ⊂ X b ⊂ Xγ , A b0,γ (y) is defined as in Lemma 2. Moreover where A h i \ b . B := [b ǫi ≥ 0] ⊂ X ⊂ X i∈I
i h b0,γ (y) ⊂ X b ⊂ Xγ follows from Proof. For γ > 0 and y ∈ X, the inclusion A b is immediate from the definitions Lemma 2 and X 6= ∅. The inclusion B ⊂ [X ⊂ X] b and B. of X In the next Proposition 2 and Theorem 7, we recall definitions of Section 3, (12), (14)-(15) and (29)-(31).
Proposition 2. Suppose a.e. for all i ∈ I Fi (·, ξ) is convex on Y and I = [m] for some m ∈ N. Let ǫ > 0 and set b ǫi,N := ǫ,
∀i ∈ I.
Let p ∈ (0, 1], y ∈ X and z ∈ Y . Then, with probability at least 1 − p, the optimal value deviation f ∗ − ǫ − Gap(2ǫ) ≤ FbN∗ ≤ f ∗ + 2ǫ, holds for N ≥ O(ǫ−2 ) max σN (2ǫ, C, y, z)2, σ N (X2ǫ )2 , σ ˘N (z)2 ln m + ln p−1 , the approximate optimality deviations b ∗ ⊂ (X2ǫ )∗ X N,ǫ Gap(2ǫ)+2ǫ ,
∗ bN,ǫ f ∗ − Gap(2ǫ) ≤ f (x∗N ) ≤ f ∗ + 2ǫ, ∀x∗N ∈ X , hold for N ≥ O(ǫ−2 ) max σN (2ǫ, C, y, z)2 , σ N (X2ǫ )2 ln m + ln p−1 , while the approximate feasibility bN ⊂ X2ǫ , X⊂X holds for N ≥ O(ǫ−2 )σN (2ǫ, C, y, z)2 ln m + ln p−1 .
30
R. I. OLIVEIRA AND P. THOMPSON
Sketch of proof. In the following, we will consider the deterministic Lemmas 6 and 7 for the particular case of random perturbations given in the SAA problem (4), that is, Fb := FbN and for all i ∈ I, Fbi := Fbi,N ,
b ǫi := b ǫi,N ,
b =X bN . In this setting, we will denote AN,0,γ (y) := A b0,γ (y) as defined in so that X Lemma 7. Let ǫ > 0, p ∈ (0, 1], y ∈ X and z ∈ Y as stated in the proposition. We invoke Lemmas 6-7 for t := ǫ, γ := 2ǫ, some x∗ ∈ X ∗ and δ ≥ 0 to be determined later. In the following, C > 0 is a constant that may change from line to line and β ∈ (0, 1] is a constant to be determined later in terms of m and p depending on N as stated in the theorem. The proof requires just one change in the proof of Proposition 1, which corresponds to the feasibility bound. Precisely, we will use Lemma 7 and the events E1 := B,
E2 := AN,0,γ (y),
bN and E2 := [ǫN ≤ γ] as instead of invoking Lemma 5 and using the events E1 := B done in Proposition 1. We make the change precise in the following. Note that since b ǫi,N = ǫ > 0 for all i ∈ I the event E1 is certain, that is, (59)
E1 has probability 1.
2 2 b 2 ≤ CL2 , σ By the SLLN, there exists C > 0 such that a.s. L i bi,N (z) ≤ CσN (z) i,N 2 2 and σ bi,N (y) ≤ CσN (y) for all i ∈ [m]. Hence, up to a constant, the event
EN,a := N ≥ O(ǫ−2 )σN (2ǫ, C, y, z)2 ln β −1 has probability 1.
(60)
By definition of AN,0,γ (y) and Theorems 1 and 3 we have
P[E2c
[h γi δ i,N (Xi,2ǫ ) ≥ ∩ EN,a 2
∩ EN,a ] ≤ P
i∈I
+P
[
i∈I
≤2 ≤2
m X
P
i=1
m X
P
i=1 m X
!
δ i,N (Xi,2ǫ ) ≥ ǫ ∩ EN,a
+P
[h γi δ i,N (y) ≥ ∩ EN,a 2
i∈I
!
m X P δ i,N (y) ≥ ǫ ∩ EN,a δ i,N (Xi,2ǫ ) ≥ ǫ ∩ EN,a +
"
i=1
sup
x∈Xi,2ǫ
! # ǫ δ i,N (x) − δ i,N (z) ≥ ∩ EN,a 2
m X h ǫi P δ i,N (y) ≥ ǫ ∩ EN,a ∩ EN,a + P δ i,N (z) ≥ +2 2 i=1 i=1
!
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
31
(
) r (1 + ln β −1 ) b 2 P sup δ i,N (x) − δ i,N (z) ≥ O(1)Aαi (Xi,2ǫ ) ≤2 Li,N + L2i N x∈Xi,2ǫ i=1 ( ) r m i X (1 + ln β −1 ) h 2 2 P δ i,N (z) ≥ O(1) σ bi,N (z) + σi (z) +2 N i=1 ( ) r m i X (1 + ln β −1 ) h 2 P δ i,N (y) ≥ O(1) + σ bi,N (y) + σi2 (y) N i=1 m X
≤ (2 + 2 + 1)mβ = 5mβ. (61)
With the above change, the rest of the proof follows as in Proposition 1, ignoring (47)-(49) and using (59)-(61) instead. In case X is convex and metric regular, Proposition 2 and the bound (58) imply immediately the following main theorem which states feasibility bounds in terms of the metric entropy of {Xi,2ǫ }i∈I and optimality bounds in terms of the metric ˘ 2cǫ . As in Theorem 6, X ˘ 2cǫ is a good benchmark since entropies of {Xi,2ǫ }i∈I and X ˘ H(X2cǫ , X) ≤ 2cǫ.
Theorem 7 (Simultaneous approximate feasibility and optimality bounds for metric regular convex feasible set). Suppose X is metric regular, a.e. for all i ∈ I, Fi (·, ξ) is convex on Y and I = [m] for some m ∈ N. For any ǫ > 0, p ∈ (0, 1], y ∈ X and z ∈ Y , if we set b ǫi,N = ǫ, ∀i ∈ I,
then with probability at least 1 − p, the approximate feasibility and optimality bounds bN , X ⊂X
bN , X) ≤ 2cǫ, D(X
f ∗ − Gap(2ǫ) ≤ f (x∗N ) ≤ f ∗ + 2ǫ,
∗ bN,ǫ , ∀x∗N ∈ X
n o ˘ 2cǫ )2 ln m + ln p−1 . hold for N ≥ O(ǫ−2 ) max σN (2ǫ, C, y, z)2, σ N (X Appendix. We now give the proof of Theorem 1.
Proof of Theorem 1. In the following C is a universal constant that might change from line to line. Let y ∈ Z and, for any t > 0, denote by Et the event in the statement of the theorem. We shall define for any x ∈ M, δ(x, ξ) := G(x, ξ) − PG(x, ·). We have r N i h X (1 + t) b2 v P(Et ) = P sup max δ(x, ξj ) − δ(y, ξj ) ≥ CAα (Z) . LN + PL2 (·) x∈Z v∈{−1,1} N N j=1
(62)
In the following we will show that for any v ∈ {−1, 1} and t > 0, r N h i X (1 + t) b2 v δ(x, ξj ) − δ(y, ξj ) ≥ CAα (Z) LN + PL2 (·) ≤ e−t , (63) P sup x∈Z N N j=1
32
R. I. OLIVEIRA AND P. THOMPSON
which, together with (62) and an union bound, proves the required claim by changing C accordingly. We proceed with the proof of (63). We only prove the bound for v = 1 since the argument is analogous for v = −1. Set t > 0. Define V0 := {y} and, for any integer i ≥ 1, let Vi be any D(Z) 2i -net for Z. Denote by Πi : Z → Vi the projection operator onto Vi , that is, Πi (x) ∈ argminz∈Vi d(z, x) for any x ∈ Z. Since Vi is a D(Z) 2i -net for D(Z) Z, clearly d(x, Πi (x)) ≤ 2i for any x ∈ Z and i ≥ 1. Hence, for any x ∈ Z and i ≥ 1, 1 3D(Z) 1 d(Πi (x), Πi−1 (x)) ≤ D(Z) = + . i i−1 2 2 2i Since Π0 (x) = y and limi→∞ Πi (x) = x, we have that for any x ∈ Z, N N ∞ 1 XX 1 X δ(Πi (x), ξj ) − δ(Πi−1 (x), ξj ) δ(x, ξj ) − δ(y, ξj ) = N j=1 N j=1 i=1
≤
(64)
∞ X i=1
max
(a,b)∈Ai
N 1 X δ(a, ξj ) − δ(b, ξj ), N j=1
where Ai := {(a, b) ∈ Vi × Vi−1 : d(a, b) ≤ 3D(Z)/2i } for any i ≥ 1. For (a, b) ∈ Ai , define ∆N (a, b) :=
N 1 X δ(a, ξj ) − δ(b, ξj ). N j=1
We shall need concentration bounds of the above term. From Theorem 2, n o p (65) P |∆(a, b)| ≥ C VN (a, b)(1 + t) ≤ e−t ,
for some constant C > 0 and the random variable VN (a, b) ≥ 0 defined as 2 N X δ(a, ξ ) − δ(b, ξ ) − δ(a, η ) + δ(b, η ) j j j j VN (a, b) := E FN , N j=1
where FN := σ(ξj : j ∈ [N ]) and {ηj }N j=1 is an i.i.d. sample of ξ independent of {ξj }N . From the H¨ o lder-continuity of G(·, ξ), we get j=1 N X 2 L(ξi )2 + L(ηi )2 d(a, b)2α VN (a, b) ≤ E FN 2 N j=1 2 b2 LN + PL(·)2 d(a, b)2α , ≤ N
(66)
where we used that {ηj }N j=1 is an i.i.d. sample of ξ independent of FN . Relations (65)-(66) imply q b L2N + PL(·)2 √ √ d(a, b)α 1 + t ≤ e−t . (67) P |∆N (a, b)| ≥ C N
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
33
From (67), if we set for i ≥ 1, Ti := ln[|Vi ||Vi−1 |] + ln[i(i + 1)], we obtain q b L2N + PL(·)2 D(Z)α p √ 1 + T + t ≤ P max |∆N (a, b)| ≥ C · i (a,b)∈Ai 2iα N q b p L2N + PL(·)2 √ d(a, b)α 1 + Ti + t ≤ |Vi ||Vi−1 | max P |∆N (a, b)| ≥ C (a,b)∈Ai N ≤ |Vi ||Vi−1 |e−(Ti +t) ≤
e−t , i(i + 1)
(68) where we used union bound in second and (67) in third inequality. By P inequality 1 the union bound over i ≥ 1 and ∞ ≤ 1, relation (68) implies that with i=1 i(i+1) −t probability greater than 1 − e , for all i ≥ 1 we have q b L2N + PL(·)2 D(Z)α p √ 1 + Ti + t. (69) max |∆(a, b)| ≤ C 2iα (a,b)∈Ai N Taking sup over x ∈ Z and summing over i ≥ 1 in relations (64) and (69), we obtain that with probability greater than 1 − e−t , there holds q N ∞ b X L2N + PL(·)2 X D(Z)α p 1 √ δ(x, ξj ) − δ(y, ξj ) ≤ C 1 + Ti + t sup 2iα N x∈Z N j=1 i=1 q b L2N + PL(·)2 √ √ Aα (Z) 1 + t, ≤C N √ √ √ where we used 1 + Ti + t ≤ Ti + 1 + t and the definition of Ti and Aα (Z). The above inequality proves the required claim in (63). REFERENCES [1] A. ARAUJO AND E. GINE, The Central Limit For Real and Banach Valued Random Variable, John Wiley & Sons Inc, Rio de Janeiro, 1980. [2] Z. ARTSTEIN and R.J-B. WETS, Consistency of minimizers and the SLLN for stochastic programs, Journal of Convex Analysis, 2 (1995), pp. 1-17. [3] J. ATLASON, M.A. EPELMAN AND S.G. HENDERSON, Call center staffing with simulation and cutting plane methods, Annals of Operations Research, 127 (2004), Issue 1, pp. 333-358. BANHOLZER [4] D. BANHOLZER, J. FLIEGE and R. WERNER, On almost sure rates of convergence for sample average approximations, (2017), preprint at http://www.optimizationonline.org/DB HTML/2017/01/5834.html [5] H.H. BAUSCHKE and J.M. BORWEIN, On projection algorithms for solving convex feasibility problems, SIAM Review, 38(1996), Issue 3, pp. 367-426. [6] P.J. BICKEL, Y. RITOV AND A.B. TSYBAKOV, Simultaneous analysis of the Lasso and Dantzig Selector, The Annals of Statistics, 37 (2009), No.4, 1705-1732.
34
R. I. OLIVEIRA AND P. THOMPSON
[7] S. BOUCHERON, G. LUGOSI AND P. MASSART, Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford University Press, Oxford, 2013. [8] M. BRANDA, Sample approximation technique for mixed-integer stochastic programming problems with expected value constraints, Optimization Letters, 8 (2014), pp. 861-875 [9] J. DUPACOV` a AND R.J-B. WETS. Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems, The Annals of Statistics, 16 (1988), No.4, pp. 1517-1549. [10] Y.M. ERMOLIEV AND V.I. NORKIN, Sample average approximation for compound stochastic optimization problems, SIAM Journal on Optimization, 23 (2013), No.4, pp. 2231-2263. [11] V. GUIGUES, A. JUDITSKY AND A. NEMIROVSKI, Non-asymptotic confidence bounds for the optimal value of a stochastic program, (2016), preprint at https://arxiv.org/abs/1601.07592 [12] W. HOEFFDING, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58 (1963), pp. 1330. [13] A.J. HOFFMAN, On approximate solutions of systems of linear inequalities, Journal of Research of the National Bureau of Standards, 49 (1952), pp. 263-265. [14] T. HOMEM-DE-MELLO AND G. BAYRAKSAN, Monte Carlo sampling-based methods for stochastic optimization, Surveys in Operations Research and Management Science, 19 (2014), pp. 56-85. [15] T. HOMEM-DE-MELLO AND G. BAYRAKSAN, Stochastic constraints and variance reduction techniques, in Handbook of Simulation Optimization, Michael Fu (ed.), Springer, 2015. [16] J. HU, T. HOMEM-DE-MELLO AND S. MEHROTRA, Sample average approximation of stochastic dominance constrained programs, Mathematical Programming Ser.A, 133 (2012), pp. 171-201. ´ AND P. THOMPSON, Incremental constraint projection [17] A. IUSEM, A. JOFRE methods for monotone stochastic variational inequalities, (2015), preprint at https://arxiv.org/abs/1703.00272 ´ R.I. OLIVEIRA AND P. THOMPSON, Extragradient Method with [18] A. IUSEM, A. JOFRE, Variance Reduction for Stochastic Variational Inequalities, SIAM Journal on Optimization, 27 (2017), Issue 2, pp. 686-724. ´ R.I. OLIVEIRA AND P. THOMPSON, Variance-based stochastic [19] A. IUSEM, A. JOFRE, extragradient methods with linear search for stochastic variational inequalities, (2016), preprint at https://arxiv.org/abs/1703.00262 ´ AND M. HOUDA, Thin and heavy tails in stochastic programming, Kyber[20] V. KANKOVA netika, 51 (2015), No. 3, pp. 433-456. ´ AND V. OMELCHENKO, Empirical estimates in stochastic programs with [21] V. KANKOVA probability and second order stochastic dominance constraints, Acta Math. Univ. Comenianae, Vol. LXXXIV (2015), 2, pp. 267-281. [22] S. KIM, R. PASUPATHY AND S.G. HENDERSON, A guide to Sample Average Approximation, in Handbook of Simulation Optimization, Michael Fu (ed.), Springer, 2015. [23] A.J. KING AND R.T. ROCKAFELLAR, Asymptotic theory for solutions in statistical estimation and stochastic programming, Math. Oper. Res., 18 (1993), pp. 148-162. [24] A.J KING AND R.J-B. WETS, Epi-consistency of convex stochastic programs, Stoch. Stoch. Rep., 34 (1991), pp. 83-92. [25] A.J. KLEYWEGT, A. SHAPIRO AND TITO HOMEM-DE-MELLO, The sample average approximation method for stochastic discrete optimization, SIAM Journal on Optimization, 12 (2001), No.2, pp. 479-502. [26] G. LAN AND Z. ZHOU, Algorithms for stochastic optimization with expectation constraints, (2016), preprint at https://arxiv.org/abs/1604.03887 [27] G. LAN, A. NEMIROVSKI AND A. SHAPIRO, Validation analysis of mirror descent stochastic approximation method, Math. Program. Ser. A, 134 (2012), pp. 425-458. [28] M. LEDOUX, The Concentration of Measure Phenomenon, AMS, Providence, 2001. [29] J. LINDEROTH, A. SHAPIRO AND S. WRIGHT, The empirical behavior of sampling methods for stochastic programming, Annals of Operations Research, 142 (2006), pp. 215-241. [30] J. LUEDTKE AND S. AHMED, A sample approximation approach for optimization with probabilistic constraints, SIAM Journal on Optimization, 19 (2008), No. 2, pp. 674-699. [31] P. Massart, Concentration inequalities and model selection, Ecole d’Et´ e de Probabilit´ es de Saint-Flour XXXIII, Springer, 2003. [32] C. MCDIARMID, On the method of bounded differences, in Surveys in Combinatorics, Cambridge University Press, Cambridge, 1989, pp. 148-188. ´ Random algorithms for convex minimization problems, Mathematical Program[33] A. NEDIC,
SAMPLE AVERAGE APPROXIMATION WITH HEAVIER TAILS I
35
ming Ser. B, 129 (2011), Issue 2, pp. 225-253. [34] Y. NESTEROV AND J.-PH. VIAL, Confidence level solutions for stochastic programming, Automatica, 44 (2008), pp. 1559-1568. [35] D. PANCHENKO, Symmetrization approach to concentration inequalities for empirical processes, The Annals of Probability, 31 (2003), pp. 2068-2081. [36] J-S. PANG, Error bounds in mathematical programming, Mathematical Programming Ser. B, 79 (1997), Issue 1, pp. 299-332. [37] G.C. PFLUG, Asymptotic stochastic programs, Math. Oper. Res., 20 (1995), pp. 769-789. [38] G.C. PFLUG, Stochastic programs and statistical data, Annals of Operations Research, 85 (1999), pp. 59-78. [39] G.C. PFLUG, Stochastic optimization and statistical inference, in Handbooks in OR & MS, A. Ruszczy´ nski and A. Shapiro, Eds., Vol. 10, 2003. [40] S.M. ROBINSON, An application of error bounds for convex programming in a linear space, SIAM Journal on Control, 13 (1975), pp. 271-273. [41] R.T. ROCKAFELLAR AND S. URYSAEV, Optimization of conditional value-at-risk, Journal of Risk, 2 (2000), No.3, pp. 493-517. [42] J.O. ROYSET, Optimality functions in stochastic programming, Math. Program. Ser. A, 135 (2012), pp. 293-321. ¨ [43] W. ROMISCH, Stability of Stochastic Programming Problems, in Handbooks in OR & MS, A. Ruszczy´ nski and A. Shapiro, Eds., Vol. 10, 2003. [44] A. SHAPIRO, Asymptotic properties of statistical estimators in stochastic programming, Ann. Statist., 17 (1989), pp. 841-858. [45] A. SHAPIRO, Asymptotic analysis of stochastic programs, Ann. Oper. Res., 30 (1991), pp. 169-186. [46] A. SHAPIRO, Monte Carlo sampling methods, in Handbooks in OR & MS, A. Ruszczy´ nski and A. Shapiro, Eds., Vol. 10, 2003. [47] A. SHAPIRO, D. DENTCHEVA AND A. RUSZCZYNSKI, Lectures on Stochastic Programming: Modeling and Theory, MOS-SIAM Ser. Optim., SIAM, Philadelphia, 2009. [48] A. SHAPIRO AND T. HOMEM-DE-MELLO (2000), On the Rate of Convergence of Optimal Solutions of Monte Carlo Approximations of Stochastic Programs. [49] , A. SHAPIRO AND A. NEMIROVSKI, On the complexity of stochastic programming problems, in Continuous Optimization: Current Trends and Modern Applications, Springer, Vol. 99, 2005, pp. 111-146. [50] , A. SHAPIRO AND H. XU, Stochastic mathematical programs with equilibrium constraints, modelling and sample average approximation, Optimization, 57 (2008), 3, pp. 395-418. [51] M. TALAGRAND, Sharper bounds for Gaussian and empirical processes, Annals of Probability, 22 (1994), pp. 28-76. [52] R. TIBSHIRANI, Regression shrinkage and selection via the Lasso, J. Roy. Statist. Soc. Ser. B, 58 (1996), pp. 267-288. [53] S. VOGEL, Stability results for stochastic programming problems, Optimization, 19 (1998), 2, pp. 269-288. [54] S. VOGEL, Confidence Sets and Convergence of Random functions, (2008), preprint at https://www.tu-ilmenau.de/fileadmin/media/orsto/vogel/Publikationen/VogelGrecksch-Geb-korr-1.pdf [55] S. VOGEL, Universal Confidence Sets for Solutions of Optimization Problems, 19 (2008), No. 3, pp. 1467-1488. [56] W. ANG AND S. AHMED, Sample Average Approximation of Expected value constrained stochastic programs, Operations Research Letters, 36 (2008), pp. 515-519. [57] M. WANG AND D.P. BERTSEKAS, Stochastic first-order methods with random constraint projection, SIAM Journal on Optimization, 26 (2016), No. 1, pp. 681-717.