Optimality of the aggregate procedure with exponential weights ...

1 downloads 0 Views 278KB Size Report
Sep 24, 2010 - Abstract. We study the performances of the empirical risk minimization procedure. (ERM for short), relative to the quadratic risk, in the context of ...
On the optimality of the empirical risk minimization procedure for the Convex Aggregation problem Guillaume Lecu´e1,3

Shahar Mendelson2,4

September 24, 2010

Abstract We study the performances of the empirical risk minimization procedure (ERM for short), relative to the quadratic risk, in the context of (C) aggregation (the “C” is for “convex”), in which one wants to construct a procedure whose risk is as close as possible to the best function in the convex hull of an arbitrary finite class F . We show that ERM performed in the convex hull of F is an optimal aggregation procedure for the (C) aggregation problem. We also show that if this procedure is used for the problem of (MS) aggregation, in which one wants to mimic the performance of the best function in F itself, then its rate is the same as the one achieved for the (C) aggregation problem, and thus it far from optimal. These results are obtained in deviation and are sharp up to logarithmic factors.

1

Introduction and main results

In this note, we study the optimality of the empirical risk minimization procedure in the aggregation framework. Let X be a probability space and let (X, Y ) and (X1 , Y1 ), . . . , (Xn , Yn ) be n + 1 i.i.d. random variables with values in X × R. From the statistical point of view, D = ((X1 , Y1 ), . . . , (Xn , Yn )) is the set of given data. The quadratic risk of a real-valued function f defined on X is given by R(f ) = E(Y − f (X))2 . 1

CNRS, LAMA, Universit´e Paris-Est Marne-la-vall´ee, 77454 France and Centre for Mathematics and its Applications, The Australian National University, Canberra, ACT 0200, Australia. 2 Department of Mathematics, Technion, I.I.T, Haifa 32000, Israel, and Centre for Mathematics and its Applications, The Australian National University, Canberra, ACT 0200, Australia. 3 Email: [email protected] 4 Email: [email protected]

1

If fb is a random function constructed using the data D, the quadratic risk of fb is the random variable h i R(fb) = E (Y − fb(X))2 |D . For the sake of simplicity, throughout this article we will restrict ourselves to functions f and random variables (X, Y ) for which |Y |, |f (X)| ≤ b almost surely, for some fixed b ≥ 1. One should note, though, that it is possible to extend the results beyond this case, to functions with well behaved tail – though at a high technical price. In the aggregation framework, one is given a finite set F of real-valued functions defined on X (usually called a dictionary) of cardinality M . There are three main types of aggregation problems: 1. In the Model Selection (MS) aggregation problem, one has to construct a procedure that produces a function whose risk is as close as possible to the risk of the best element in the given class F (cf. [2, 8, 9, 10, 11, 15, 24, 25, 27]) 2. In the Convex (C) aggregation problem (cf. [1, 5, 7, 8, 11, 24, 28]) one wants to construct a procedure whose risk is as close as possible to the risk of the best function in the convex hull of F . 3. In the linear (L) aggregation problem (cf. [8, 10, 14, 24]), one wants to construct a procedure whose risk is as close as possible to the risk of the best function in the linear span of F . One can define the optimal rates of the (MS), (C) and (L) aggregation problems, (M S) (C) (L) respectively denoted by ψn (M ), ψn (M ) and ψn (M ) (see, for example, [24]). The optimal rates are the smallest prices that one has to pay to solve the (MS), (C) or (L) aggregation problems in expectation, as a function of the cardinality of the dictionary M and of the sample size n. It has been proved in [24] that   s √  log eM/ n log M (L) M M , ψn(M S) (M ) ∼ , ψn (M ) ∼ and ψn(C) (M ) ∼ min  , n n n n where we denote a ∼ b if there are absolute positive constants c and C such that cb ≤ a ≤ Cb. Note that the rates obtained in [24] hold in expectation and their optimality holds only in that context. Nevertheless, lower bounds in deviation follow from the argument of [24] for the three aggregation problems with the same rates (M S) (C) (L) ψn (M ), ψn (M ) and ψn (M ). In other words, there exist two absolute constants c0 , c1 > 0 such that for any sample cardinality n ≥ 1, any cardinality of dictionary M ≥ 1 and any aggregation procedure f¯n , there exists a dictionary F of size M such that with probability larger than c0 , R(f¯n ) ≥ min R(f ) + c1 ψn∆(F ) (M ), f ∈∆(F )

2

(1.1)

∆(F )

(M S)

(C)

(L)

where the residual term ψn (M ) is ψn (M ) (resp. ψn (M ) or ψn (M ) ) when ∆(F ) = F (resp. ∆(F ) = conv(F ) or ∆(F ) = span(F ). Procedures achieving these rates in deviation have been constructed for the (MS) aggregation problem ([15]) and the (L) aggregation problem ([14]). So far, there was (C) no example of a procedure that achieves the rate of aggregation ψn (M ) with high (C) probability for the (C) aggregation problem. Moreover the rate ψn (M ) was achieved in [24] (in expectation) in the Gaussian regression model with a known variance and a known marginal distribution of the design. In [7], the authors were able to remove √ these assumptions at a price of an extra log n factor for 1 ≤ M ≤ n. Moreover, the procedures of [24] and [7] depend on a parameter T , called the temperature parameter, which has to be chosen larger than certain parameters of the model (depending on the tail of the output and on the parameter b). On one hand the residue is monotone in T , but on the other, when T is too small, then these procedures are far from optimal in general (cf. [16]). This makes the use of these aggregates “risky” since there is no natural (deterministic or data-dependent) way of choosing T . (C) Despite that no procedure has been shown to achieve the rate ψn (M ) in deviation, we will call this rate the optimal rate of (C) aggregation, and the aim of this note is to prove that the most natural procedure, empirical risk minimization over (C) the convex hull of F , achieves the rate of ψn (M ) in deviation (up to a log n factor √ for values of M close to n) without requiring the choice of some “model dependent” parameter. Indeed, we will show that the procedure f˜ERM −C minimizing the empirical risk functional n

1X f− 7 → Rn (f ) = (Yi − f (Xi ))2 , n

(1.2)

i=1

q

 log M for the (C) in conv(F ) achieves, with high probability, the rate min n aggregation problem (see the exact formulation in Theorem B below). Moreover, (C) we will show that the optimal rate ψn (M ) can be achieved by f˜ERM −C for any orthogonal dictionary (formulated in Theorem C). On the other hand, it turns out (M S) that the same algorithm is far from the conjectured optimal rate ψn (M ) for the (MS) aggregation problem (see Theorem A). Another motivation for this work comes from what is known about ERM in the context of the three aggregation schemes mentioned above. It is well-known that ERM in F is, in general, a suboptimal aggregation procedure for the (MS) aggregation problem (see [12], [19] or [17]). It is also known that performing ERM in the convex hull of F is a suboptimal aggregation procedure for the (MS) aggregation problem √ for M = n (see [15]). Concerning the (L) aggregation problem, ERM in the linear span of F is an optimal procedure ([14]). It is straightforward to see that ERM in 

3

M n,

the convex hull of F in the context of (L) aggregation or ERM in F in the context of (C) or (L) aggregation are suboptimal. Therefore, studying the performances of ERM in the convex hull of F in the context of (C) aggregation can be seen as an “intermediate” problem which remained open. In fact, a lot of effort has been invested in finding any procedure that would achieve the (conjectured) optimal rate of the (C) aggregation problem. For example, many boosting algorithms (see [23] or [6] for recent results on this topic) are based on finding the best convex combination in a large dictionary (for instance, dictionaries consisting of “decision stumps”), while random forest algorithms can be seen as procedures that try finding the best convex combination of decision trees. Thus, finding an optimal procedure for the problem of (C) aggregation for a general dictionary is of high practical importance. Our first main result is to prove a lower bound on the performance of f˜ERM −C (ERM in the convex hull) in the context of the (MS) aggregation problem. In [15], it was proved that this procedure is suboptimal for the problem of (MS) aggregation √ when the size of the dictionary is of the order of n. Here we complement the result by providing a lower bound for all values of M and n. Theorem A There exist absolute positive constants c1 , c2 for which the following holds. For any integer n and M there exists a dictionary F of cardinality M such that, with probability greater than c1 R(f˜ERM −C ) ≥ min R(f ) + c2 ψn (M ), f ∈F

where ψn (M ) = M/n when M ≤



√ −1/2 √ n and n log(eM/ n when M > n.

Note that the residual term ψn (M ) of Theorem A is much larger than the optimal (M S) rate ψn (M ) = (log M )/n for the (MS) aggregation problem. Theorem A shows that ERM in the convex hull satisfies a much stronger lower bound than the one mentioned in (1.1) that holds for any algorithm. This result is of particular importance since optimal aggregation procedures for the (MS) aggregation problem take their values in conv(F ), and it was thus conjectured that f˜ERM −C could be an optimal aggregation procedure for the (MS) aggregation problem. In [15] it was proved that √ this not the case for M = n; Theorem A shows that this is not the case for any M and n. √ The proof of Theorem A requires two separate arguments. The case M ≤ n √ is easier, and follows an identical path to the one used in [15] for M = n. Its proof is presented for the sake of completeness, and to allow the reader a compar√ ison with the situation in the other case, when M > n. In the “large M ” range √ things are very different. The difficulty here stems from the fact that if M > n, 4

the “complexity” of the sets of functions in the convex hull whose excess risk is at p √ √ most r ≈ log(cM/ n)/ n does not change much with a proportional change in r. Thus, accurate estimates on the risk of the ERM in that range require a far more delicate analysis than in the “small dictionary” case. Let us mention that for √ M > n the result is optimal up to a logarithmic factor, and the conjectured rate is p √ √ log(cM/ n)/ n. The performance of ERM in the convex hull has been studied for an infinite dictionary in [5], in which estimates on its performance have been obtained in terms of the metric entropy of F . The resulting upper bounds were conjectured to be suboptimal in the case of a finite dictionary, since they provide an upper bound M/n for every n and M . And indeed, we establish the following upper bound on the risk of f˜ERM −C as a (C)-aggregation procedure: Theorem B For every b > 0 there is a constant c1 (b) and an absolute constant c2 for which the following holds. Let n and M be integers which satisfy that log M ≤ √ c1 (b) n. For any couple (X, Y ) and any finite dictionary F of cardinality M such that |Y |, supf ∈F |f (X)| ≤ b, and for any u > 0, with probability greater than 1 − exp(−u),  M r log M  u i R(f˜ERM −C ) ≤ min R(f ) + c2 b2 max min , , . n n n f ∈conv(F ) h

Although Theorem B is new, it is probably known to experts, and its proof is based on what is now, rather standard machinery. Note that Theorem B is optimal except for values of M for which n1/2 < M ≤ c()n1/2+ for  > 0. Although there is a gap in this range in the general case, under the additional assumption that the dictionary is orthogonal, this gap can be removed. Theorem C Under the assumptions of Theorem B, if F = {f1 , . . . , fM } is such that Efi (X)fj (X) = 0 for any i 6= j ∈ {1, . . . , M }, then f˜ERM −C achieves the optimal rate of (C) aggregation: for any u > 0, with probability greater than 1 − exp(−u) s ! h  eM b2  M 1 ui ERM −C 2 R(f˜ ) ≤ min R(f ) + c2 b max min , log √ , . n n n n f ∈conv(F ) Removing the gap in the general case is likely to be a much harder problem, although we believe that the orthogonal case is the “worst” one. Combining Theorem A with Theorem B, it follows that up to some logarithmic (C) terms, the rate ψn (M ) is the rate of aggregation of f˜ERM −C for the (MS) and (C) aggregation problems. In particular, it achieves the conjectured optimal rate for 5

√ the (C) aggregation problem, up to a logarithmic factor that appears when n < M ≤ c1 ()n1/2+ . However, it is far from the conjectured optimal rate for the (MS) aggregation problem. Finally, a word about notation. Throughout, we denote absolute constants or constants that depend on other parameters by c, C, c1 , c2 , etc., (and, of course, we will specify when a constant is absolute and when it depends on other parameters). The values of constants may change from line to line. The notation x ∼ y (resp. x . y) means that there exist absolute constants 0 < c < C such that cy ≤ x ≤ Cy (resp. x ≤ Cy). If b > 0 is a parameter then x .b y means that x ≤ C(b)y for M endowed some constant C(b) depending only on b. We denote by `M p the space R with the `p norm. The unit ball there is denoted by BpM . We also denote the unit Euclidean sphere in RM by S M −1 . If F is a class of functions, then let f ∗ be a minimizer in F of the true risk; in our case, f ∗ is the minimizer of E(f (X) − Y )2 . For every f ∈ F set Lf = (Y − f (X))2 − (Y − f ∗ (X))2 , and let LF = {Lf : f ∈ F } be the excess loss class associated with F , the target Y and the quadratic risk.

2

Proof of the lower bound for the (MS) aggregation problem (Theorem A)

The proof of Theorem A consists of two parts. The first, simpler part, is when √ M ≤ n. This is due to the fact that if 0 < θ < 1 and ρ = θr ∼ M/n, the set √ √ B1M ∩ rS M −1 is much “larger” than the set B1M ∩ ρB2M . This results in much larger “oscillations” of the appropriate empirical process on the former set than on the latter one, leading to very negative values of the empirical excess risk functional √ for functions whose excess risk larger than ρ. The case M ≥ n is much harder because when considering the required values of r and ρ, the complexity of the two sets is very close, and comparing the two oscillations accurately involves a far more delicate analysis.

2.1

The case M ≤



n

We will follow the method used in [15]. Let (φi )i∈N be a sequence of functions defined on [0, 1] and set µ to be a probability measure on [0, 1] such that (φi : i ∈ N) is a sequence of independent Rademacher variables in L2 ([0, 1], µ). √ Let M ≤ n be fixed and put (X, Y ) to be a couple of random variables; X is distributed according to µ and Y = φM +1 (X). Let F = {0, ±φ1 , . . . , ±φM } be the dictionary, and note that any function in the convex hull of F can be written as P fλ = M λ φ for λ ∈ B1M . Since relative to conv(F ), f ∗ = 0, the excess quadratic j j j=1

6

loss function is



2 Lλ (X, Y ) = 1 − 2φM +1 (X) λ, Φ(X) + λ, Φ(X) where we set Φ(·) = (φ1 (·), . . . , φM (·)). The following is a reformulation of Lemma 5.4 in [15]. Lemma 2.1 There exist absolute constants c0 , c1 and c2 for which the following holds. Let (Xi , Yi )i=1,...,n be n independent copies of (X, Y ). Then, for every r > 0, with probability greater than 1 − 8 exp(−c0 M ), for any λ ∈ RM , n 2 1 1 X

2 2 kλk − λ, Φ(X ) (2.1) ≤ kλk2 i 2 n 2 i=1

and

r c1

n

rM 1 X

≤ sup λ, Φ(X ) φM +1 (Xi ) ≤ c2 i √ n λ∈ rB M n 2

i=1

r

rM . n

(2.2)

Set r = βM/n for some 0 < β ≤ 1 to be named later, and observe that B1M ∩ √ rS M −1 = rS M −1 because r ≤ 1/M . Applying (2.1) and (2.2), it is evident that with probability greater than 1 − 8 exp(−c0 M )



inf Pn Lλ √ λ∈B1M ∩ rS M −1

=r−

sup

√ λ∈ rS M −1 n X

(P − Pn )Lλ

n

2 1 2 X

2 ≤ r + √sup kλk2 − λ, Xi − √sup λ, Φ(Xi ) φM +1 (Xi ) n λ∈ rS M −1 λ∈ rS M −1 n i=1 i=1 r  3β p M p M rM 3r − 2c1 = − 2c1 β ≤ −c1 β , ≤ 2 n 2 n n 2 provided that β ≤ 2c1 /3 . On the other hand, let ρ = αM/n for some α to be chosen later. Using (2.1) and (2.2) again, it follows that with probability at least 1 − 8 exp(−c0 M ), for any √ λ ∈ B1M ∩ ρB2M n n X

2 2 X

Pn Lλ ≤ P Lλ + kλk2 − 1 λ, Φ(X ) + λ, Φ(X ) φ (X ) i i i M +1 2 n n i=1 i=1 r  3α √ M 3ρ ρM ≤ + 2c2 = + 2c2 α . 2 n 2 n 2 √ √ Therefore, if 0 < α < β satisfies that 3α/2+2c2 α < c1 β for some 0 < β ≤ 2c1 /3 then, with probability greater than 1 − 16 exp(−c0 M ), λ 7−→ Rn (fλ ) will be “more √ √ negative” on B1M ∩ rS M −1 than on B1M ∩ ρB2M . Hence, with that probability, R(f˜ERM −C ) ≥ ρ = αM/n.

7

2.2

The case M ≥



n

Let us reformulate the second part of Theorem A. Theorem 2.2 There exist absolute constants c and n0 for which the following holds. √ For every integers n ≥ n0 and M , if M ≥ n, there is a function class FM of cardinality M consisting of functions that are bounded by 1, and a couple (X, Y ) distributed according to a probability measure µ, such that with µ⊗n -probability at least 1/2, c R(fb) ≥ min R(f ) + p √ , f ∈FM n log(eM/ n) where fb is the empirical minimizer in conv(FM ). The proof will require accurate information on a monotone rearrangement of almost Gaussian random variables. Lemma 2.3 There exists an absolute constant C for which the following holds. Let g be a standard Gaussian random variable, set H(x) = P(|g| > x) and put W (p) = H −1 (p) (the inverse function of H). Then for every 0 < p < 1,  2) 2   log log 2/(πp 2 2 W (p) − log 2/(πp ) + log(log 2/(πp ) ≤ C  . log(2/(πp2 ) Moreover, for every 0 <  < 1/2 and 0 < p < 1/(1 + ), 2 W (p) − W 2 ((1 + )p) ≤ C, W 2 (p) − W 2 ((1 − )p) ≤ C. Proof.

The proof of the first part follows from the observation that for every x > 0, √ √   2 2 1 2 √ exp(−x /2) 1 − 2 ≤ P(|g| > x) ≤ √ exp(−x2 /2), (2.3) x x π x π

where c is a suitable absolute constant (see, e.g. [22]), combined with a straightforward (yet tedious) computation. The second part of the claim follows from the first one, and is omitted. P The next step is a gaussian approximation of a variable Y = n−1/2 ni=1 Xi , where X1 , . . . , Xn are i.i.d random variables, with mean zero, variance 1, under the additional assumption that X has well behaved tails. Definition 2.4 [18, 26] Let 1 ≤ α ≤ 2. We say that a random variable X belongs to Lψα if there exists a constant C such that  E exp |X|α /C α ≤ 2. (2.4) The infimum over all constants C for which (2.4) holds defines a norm called the ψα norm of X, and we denote it by kXkψα . 8

Proposition 2.5 ([22], pg. 183) For every L there exist constants c1 and c2 that depend only on L and for which the following holds. Let (Xn )n∈N be a sequence of i.i.d.,P mean zero random variables with variance 1, for which kXkψ1 ≤ L. If 1 Y = √n ni=1 Xi then for any 0 < x ≤ c1 n1/6 , P[Y ≥ x] = P[g ≥ x] exp

 EX 3 x3 h x + 1i √1 1 + c2 √ 6 n n

and P[Y ≤ −x] = P[g ≤ −x] exp

 −EX 3 x3 h x + 1i √1 1 + c2 √ . 6 n n

In particular, if 0 < x ≤ c1 n1/6 and EX13 = 0 then x+1 |P[|Y | ≥ x] − P[|g| ≥ x]| = c2 P[|g| ≥ x] √ . n Since Proposition 2.5 implies a better gaussian approximation than the standard Berry-Ess´een bounds, one may consider the following family of random variables that will be used in the construction. Definition 2.6 We say that Pna random variable Y is (L, n)-almost gaussian for L > 0 −1/2 and n ∈ N, if Y = n i=1 Xi , where X1 , . . . , Xn are independent copies of X, which is a non-atomic random variable with mean 0, variance 1, and satisfies that EX 3 = 0 and kXkψ1 ≤ L. P Let X1 , ..., Xn and Y be such that Y = n−1/2 ni=1 Xi is (L, n)-almost gaussian. For 0 < p < 1 set U (p) = {x > 0 : P(|Y | > x) = p}. Since X is non-atomic then U (p) is non-empty and let u+ (p) = sup U (p) and u− (p) = inf U (p). We shall apply Lemma 2.3 and Proposition 2.5 in the following case to bound u+ (i/M ) and u− (i/M ) for every i, as long as M is not too large (i.e. log M ≤ c1 n1/3 ). To that end, set M,n = [(log M )/n]1/2 , and for fixed values of M and n, and 1 ≤ i ≤ M let − + − u+ i = u (i/M ) and ui = u (i/M ). Corollary 2.7 For every L > 0 there exist a constant C0 that depends on L and an absolute constant C1 for which the following holds. Assume that Y is (L, n)-almost gaussian and that log M ≤ C0 n1/3 . Then, for every 1 ≤ i ≤ M/2, ( )   2M 2    2M 2  2 /(πi2 ) log log 2M 2  , M,n , (u+ − log log + C1 max i ) ≤ log πi2 πi2 log 2M 2 /(πi2 ) 9

and 2 (u− i )

≥ log

 2M 2  πi2



− log log

 2M 2  πi2

( − C1 max

)  log log 2M 2 /(πi2 )  , M,n . log 2M 2 /(πi2 )

√ Proof. Since log M ≤ C0 n1/6 , one may use the gaussian approximation from Proposition 2.5 to obtain   √4 log M + 1  p p √ P[|Y | ≥ 4 log M ] ≤ P[|g| ≥ 4 log M ] 1 + c1 n r √   2 4 log M + 1  c2 √ ≤ 2. ≤ exp(−2 log M ) 1 + c1 4π log M M n √ Thus, for every 1 ≤ i ≤ M , if x ∈ U (i/M ) then x ≤ 4 log M . √ Let 1 ≤ i ≤ M/2 and x ∈ U (i/M ). Since x ≤ 2C0 n1/6 (because x ≤ 4 log M ≤ 2C0 n1/6 ), it follows from Proposition 2.5 that i/M − H(x) ≤ c3 H(x) x√+ 1 ≤ c4 H(x)εM,n , (2.5) n where H(x) = P[|g| ≥ x]. Observe that if W (p) = H −1 (p), then 2 W (i/M ) − x2 ≤ c5 M,n . Indeed, since H(x)(1 − c4 M,n ) ≤ i/M ≤ H(x)(1 + c4 M,n ), then by the monotonicity of W and the second part of Lemma 2.3, setting p = H(x), W 2 (i/M ) ≤ W 2 ((1 + c4 M,n )H(x)) ≤ W 2 (H(x)) + c6 M,n = x2 + c6 M,n . One obtains the lower bound in a similar way. The claim follows by using the approximate value of W 2 (i/M ) provided in the first part of Lemma 2.3. − The parameters u+ i and ui can be used to estimate the distribution of a nonincreasing rearrangement (Yi∗ )M i=1 of the absolute values of M independent copies of Y.

Lemma 2.8 There exists constants c > 0 and j0 ∈ N for which the following holds. Let Y1 , ..., YM be i.i.d. non-atomic random variables. For every 1 ≤ s ≤ M , with probability at least 1 − 2 exp(−cs), + |{i : |Yi | ≥ u− s }| ≥ s/2 and |{i : |Yi | ≥ us }| ≤ 3s/2.

In particular, with probability at least 5/6, for every j0 ≤ j ≤ M/2, + ∗ u− 2j ≤ Yj ≤ ud2(j−1)/3e ,

where dxe = min{n ∈ N : x ≤ n}. 10

Proof. Fix 0 < p < 1 to be named later and let (δi )M i=1 be independent {0, 1}valued random variables with Eδi = p. A straightforward application of Bernstein’s inequality [26] shows that ! M 1 X P δi − p ≥ t ≤ 2 exp(−cM min{t2 /p, t}). M i=1

In particular, with probability at least 1 − 2 exp(−c1 M p), (1/2)M p ≤

M X

δi ≤ (3/2)M p.

i=1

We will apply this observation to the independent random variables δi = 1I{|Yi |>a} , 1 ≤ i ≤ M for an appropriate choice of a. Indeed, if we take a for which P(|Y1 | > a) = s/M (such an a exists because Y1 is non-atomic), then with probability at least 1 − 2 exp(−c1 s), at least s/2 of the |Yi | will be larger than a, and at most 3s/2 will be larger than a. Since this result holds for any a ∈ U (s/M ) the first part of the claim follows. P Now take s0 to be the smallest integer such that 1 − 2 M s=s0 exp(−cs) ≥ 5/6 and apply an union bound and a change of variable. Hence, with probability at least 5/6, for every b(3s0 )/2c + 1 ≤ j ≤ dM/2e, + |{i : |Yi | ≥ u− 2j }| ≥ j and |{i : |Yi | ≥ ud(2(j−1))/3e }| ≤ j − 1 + ∗ and thus u− 2j ≤ Yj ≤ ud(2(j−1))/3e .

With Lemma 2.8 and Corollary 2.7 in hand, we can bound the following functional of the random variables (Yi∗ )M i=1 . Lemma 2.9 For every L > 0 there exist constants c1 , ..., c4 , j0 and α < 1 that depend only on L for which the following holds. Let Y be (L, n)-almost gaussian and let Y1 , ..., YM be independent copies of Y . Then, with probability at least 5/6, for every j0 ≤ ` ≤ αk and k ≤ αM , ! ! log(ek/`) − M,n log(ek/`) +  M,n p p c1 ≤ Y`∗ − Yk∗ ≤ c2 . log(eM/`) log(eM/`) Moreover, with probability at least 2/3, !1/2 k X 1 log(ek/`) 1 Y`∗ − Yk∗ − (Yi∗ − Yk∗ )2 ≥ c3 p − c4 p , k log(eM/`) log(eM/k) i=1

provided that log2 M .L k and that M,n = 11

p (log M )/n ≤ 1.

Proof. The first part of the claim follows from Lemma 2.8 and Corollary 2.7, combined with a straightforward computation. For the second part, observe that, for some well chosen constant c1 (L) depending only on L, with probability at least 5/6, √ ∗ Y1 ≤ c1 (L) log M . Hence, applying the first part of the claim, with probability at least 2/3, k k 1X ∗ j0 log M 1X ∗ (Yi − Yk∗ )2 ≤ c1 (L) + (Yi − Yk∗ )2 k k k i=1

i=j0

k j0 log M c2 X ≤ c1 (L) + k k

i=j0

≤ c1 (L)

2M,n log2 (ek/i) + log(eM/i) log(eM/i)

!

1 + 2M,n j0 log M c4 + c3 ≤ , k log (eM/k) log(eM/k)

provided that log2 M .L k and that M,n ≤ 1. Note that to estimate the sum we used that k k 1 X log2 (ek/i) 1 1X c3 . ≤ log2 (ek/i) ≤ k log(eM/i) log(eM/k) k log eM/k i=j i=j 0

0

Now the second part follows from the first one. The next preliminary step we need is a simple bound on the dual norm to the one √ whose unit ball is Ar = B1M ∩ rB2M . From here on, given v ∈ RM we shall set

kvkA◦r = sup v, w , w∈Ar

M and, as always, (vi∗ )M i=1 is the monotone rearrangement of (|vi |)i=1 .

Lemma 2.10 For every v ∈ RM and any 0 < ρ < r ≤ 1 such that 1/r and 1/ρ are integers,  1/2 1/ρ X ∗ ∗ ∗ kvkA◦r − kvkA◦ρ ≥ v1/r − v1/ρ − ρ (vi∗ − v1/ρ )2  . i=1

Proof.

First, observe that for every v ∈ RM ,   !1/2 j √ X ∗ + vj∗  . kvkA◦r = inf  r (vi − vj∗ )2 1≤j≤M

i=1

12

(2.6)

 M ∪(1/√r)B M , it is evident that kvk ◦ = inf{kuk + Indeed, since A◦r = conv B∞ ∞ Ar 2 √ rkwk2 , v = u + w}. One may verify that if v = u + w is an optimal decomposition then supp(w) ⊂ {i : |ui | = kuk∞ }. Hence, if kuk∞ = K then ui = Ksgn(vi )1I{|vi |≥K} +vi 1I{|vi |0

n √  K+ r

X

(|vi | − K)2

1/2 o .

{i:|vi |≥K}

Moreover, since it is enough to consider only values of K in {vj∗ : 1 ≤ j ≤ M }, (2.6) is verified. In particular, if 1/r is an integer then 1/r

kvkA◦r

1/2 √ X ∗ ∗ ∗ ≤ r (vi − v1/r )2 + v1/r . i=1

On the other hand, if Tr = {u ∈ RM : kuk2 ≤ √ B1M ∩ rB2M . Hence,



kvkA◦r ≥ sup v, w =



w∈Tr



r, |supp(u)| ≤ 1/r} then Tr ⊂

 1/2 1/r X r  (vi∗ )2  . i=1

Therefore, if 1/r and 1/ρ are integers, it follows that   1/r 1/ρ   X 1/2 √  X ∗ 2 1/2 √ ∗  ∗ − ρ + v1/ρ kvkA◦r − kvkA◦ρ ≥ r (vi ) (vi∗ − v1/ρ )2 i=1

i=1 1/ρ

∗ ∗ − v1/ρ − ≥ v1/r

1/2 √ X ∗ ∗ ρ (vi − v1/ρ )2 , i=1

because (r

P1/r

∗ 2 1/2 i=1 (vi ) )

∗ . ≥ v1/r

Proof of Theorem 2.2. Let Φ1 , ..., ΦM , X and a > 0 be such that Φ1 (X), . . . , ΦM (X) are uniformly distributed on [−a, a] and have variance 1. Set T (X) = ΦM +1 (X) = Y +1 to be a Rademacher variable. Assume further that (Φi )M are independent in i=1 X L2 (P ) and let F M = {0, ±Φ1 , ..., ±ΦM }. Note that the functions in conv(FM ) are given by fλ = Φ, λ where Φ = (Φ1 , ..., ΦM ) and λ ∈ B1M . It is straightforward to verify that the excess loss function of fλ relative to conv(FM ) is

2

Lfλ = (fλ − ΦM +1 )2 − (0 − ΦM +1 )2 = Φ, λ − 2 Φ, λ ΦM +1 13

(since f ∗ = 0), implying that ELfλ = kλk22 .

Let us consider the problem of empirical minimization in conv(FM ) = { λ, Φ : √ λ ∈ B1M }. Recall that Ar = B1M ∩ rB2M and, for an independent sample (Φ(Xi ), ΦM +1 (Xi ))ni=1 , define the functional   ψ(r, ρ) =n inf Rn (fλ ) − inf Rn (fµ ) λ∈Ar n X

= inf

λ∈Ar

µ∈Aρ

2

Φ(Xi ), λ − 2 Φ(Xi ), λ ΦM +1 (Xi )

i=1

− inf

µ∈Aρ

n X

2

Φ(Xi ), µ − 2 Φ(Xi ), µ ΦM +1 (Xi ). i=1

If we show that for some r ≥ ρ, ψ(r, ρ) < 0, then for that sample, ELfb ≥ ρ. Note that ψ(r, ρ) ≤ − 2 sup

n X

λ∈Ar i=1 n X

+ 2 sup

Φ(Xi ), λ ΦM +1 (Xi )

Φ(Xi ), λ ΦM +1 (Xi )

µ∈Aρ i=1 n X

+ sup

2 Φ(Xi ), λ ,

λ∈Ar i=1

and let us estimate the supremum of the process λ ∈ Ar →

n   X

2

2 Φ(Xi ), λ = n (Pn − P )( Φ, λ ) + kλk22 . i=1

M , E λ, Φ(X) 2 = kλk2 ), Observe that Φ(X) is isotropic (that is, for every λ ∈ R 2

and subgaussian – since λ, Φ(X) ψ2 ≤ 4a kλk2 . Hence, applying the results from [21], it is evident that with probability at least 3/4,  

2 γ2 (Ar , k·k2 ) γ22 (Ar , k·k2 ) √ , . sup (Pn − P )( λ, Φ ) ≤ c(a) max diam(Ar , k·k2 ) n n λ∈Ar p Recall that for r ≥ 1/M , γ2 (Ar , k·k2 ) ∼ log(eM r) (see, for instance, [21], and thus, if r ≥ max(1/M, 1/n), then with probability at least 3/4,   p

2 n sup (Pn − P )( Φ, λ ) + kλk22 ≤ nr + c1 nr log(eM r), λ∈Ar

14

where c1 is a constant that depends only on a. Next, to estimate the first two terms, let Yj = n−1/2

n X

ΦM +1 (Xi ) Φ(Xi ), ej

i=1

and observe that (Yj )M j=1 are independent copies of a (2, n)-almost gaussian variable. M If we set V = (Yi )i=1 then sup

n X

n X

λ, Φ(Xi ) ΦM +1 (Xi ) − sup θ, Φ(Xi ) ΦM +1 (Xi )

λ∈Ar i=1

θ∈Aρ i=1

 √

√ = n sup λ, V − sup θ, V = n(kV kA◦r − kV kA◦ρ ) = (∗). λ∈Ar

θ∈Aρ

By Lemma 2.10, if 1/r = ` and 1/ρ = k are integers, then (∗) ≥

k 1 X 1/2 i √ h ∗ n Y` − Yk∗ − (Yi∗ − Yk∗ )2 k i=1

and thus, if `, k, M and n are as in Lemma 2.9, then with probability at least 2/3, ! √ c2 log(ek/`) c3 −p (∗) ≥ n p log(eM/`) log(eM/k) √ log(ek/`) , ≥ c4 n p log(eM/`) provided that k ≥ c5 ` for c5 large enough. Hence, with probability at least 1/2, p √ log(ek/`) ψ(r, ρ) ≤ −2c4 n p + nr + c1 rn log(eM r). log(eM/`) p √ It follows that if we select r ∼ 1/ n log(eM/ n) and ρ ∼ r with ρ < r so that the conditions of Lemma 2.9 are satisfied, then with probability at least 1/2, ψ(r, ρ) < 0. Hence, with the same probability, R(fb) − min R(f ) = ELfb ≥ p f ∈FM

15

c6

√ . n log(eM/ n)

3

Proof of the upper bound for the (C) aggregation problem (Theorem B)

Our starting point is to describe the machinery developed in [3], leading to the desired estimates on the performance of ERM in a general class of functions. Let G be a class ∗ (x))2 : g ∈ G} the of functions and denote by LG = {(x, y) 7−→ (y − g(x))2 − (y − gG ∗ associated class of quadratic excess loss functions, where gG is the minimizer of the quadratic risk in G. Let V = star(LG , 0) = {θL : 0 ≤ θ ≤ 1, L ∈ LG } and for every λ > 0 set Vλ = {h ∈ V : Eh ≤ λ}. Theorem 3.1 ([3]) For every positive B and b there exists a constant c0 = c0 (B, b) for which the following holds. Let G be a class of functions for which LG consists of functions that are bounded by b almost surely. Assume further that for any L ∈ LG , EL2 ≤ BEL. If x > 0, λ∗ > 0 satisfies that E kP − Pn k Vλ∗ ≤ λ∗ /8 and  x λ∗ (x) = c0 max λ∗ , , n then with probability greater than 1 − exp(−x), the empirical risk minimization procedure gb in G satisfies R(b g ) ≤ inf R(g) + λ∗ (x). g∈G

Let F be the given dictionary and set G = conv(F ). Using the notation of Theorem 3.1, put  Lconv(F ) = Lf : f ∈ conv(F ) , consider the star-shaped hull V = star(Lconv(F ) , 0) and its localizations Vλ = {g ∈ V : Eg ≤ λ} for any λ > 0. Thanks to convexity, the following observation holds in our case (see [20] for the proof). Proposition 3.2 If f ∈ conv(F ) then ELf ≥ kf − f ∗ k2L2 (P X ) where f ∗ is the minimizer of the quadratic risk in conv(F ). In particular, 1. EL2 ≤ 4b2 EL for any L ∈ Lconv(F ) ; 2. For µ > 0, if f ∈ conv(F ) satisfies that ELf ≤ µ, then f ∈ f ∗ + Kµ , where Kµ = 2[conv{±f1 , . . . , ±fM } ∩



µB(L2 (P X ))].

The first part of Proposition 3.2 shows that Lconv(F ) satisfies the assumptions of Theorem 3.1 with B = 4b2 .

16

To apply Theorem 3.1 one has to find λ∗ > 0 for which E kP − Pn kVλ∗ ≤ λ∗ /8, and to that end we will use the second part of Proposition 3.2. Observe that Vλ ⊂

∞ [

{θLf : 0 ≤ θ ≤ 2−i , ELf ≤ 2i+1 λ},

i=0

and it was shown in [4] that E kP − Pn kVλ ≤

X

2−i E kP − Pn kL

2i+1 λ

,

(3.1)

i≥0



where from here on we set Lµ = L ∈ Lconv(F ) : EL ≤ µ . Applying the second part of Proposition 3.2 it is evident that {f ∈ conv(F ) : ELf ≤ µ} ⊂ f ∗ + Kµ . There are two trivial ways of approximating Kµ , both of which will be used. First, an “`1 ” approximation, that Kµ ⊂ 2conv{±f1 , . . . , ±fM }. Second, an “L2 ” approxi√ mation, that Kµ ⊂ 2 µB(L2 (P X )) ∩ span(F ). Proposition 3.3 There exists an absolute constant c1 > 0 for which the following holds. For any µ > 0, ( r ) r log M M µ E kP − Pn kLµ ≤ c1 min b2 ,b . n n Proof.

By the Gin´e-Zinn symmetrization Theorem [26], E kP − Pn kLµ

n 1 X ≤ 2EE sup i L(Xi , Yi ) . L∈Lµ n i=1

Note that if L ∈ Lµ and f ∈ conv(F ) satisfies that L = Lf , then for any (x, y), |L(x, y)| = |(y − f (x))2 − (y − f ∗ (x))2 | = |(f ∗ (x) − f (x))(2y − f (x) − f ∗ (x))| ≤ 4b|f (x) − f ∗ (x)|. Thus, by the contraction principle (see, e.g. [18]) and Proposition 3.2, n 1 X 8b i f (Xi ) . E kP − Pn kLµ ≤ √ EE sup √ n n f ∈Kµ i=1

17

(3.2)

Set F (·) = (f1 (·), . . . , fM (·)). By the comparison theorem between Rademacher and Gaussian processes [18] and the “`1 ” approximation for Kµ , it is evident that conditionally on X1 , . . . , Xn , n n 1 X 1 X E sup √ i f (Xi ) ≤ c1 Eg sup √ gi f (Xi ) n n f ∈Kµ f ∈Kµ i=1 i=1 n

X 1 gi F (Xi ) = c2 Eg max |γj |, ≤ c2 Eg sup λ, √ 1≤j≤M n λ∈B M i=1

1

P where γj is the Gaussian random variable n−1/2 ni=1 gi fj (Xi ) defined for all j = P 1, . . . , M . The variance of γj is n−1 ni=1 fj (Xi )2 ≤ b2 . Hence, by the Gaussian maximal inequality [18], p Eg max |γj | ≤ c3 b log M , 1≤j≤M

which completes the first part of the proof. √ Turning to the second part, when M ≤ n, we will use the strategy developed in [14]. Let S be the linear subspace of L2 (P X ) spanned by F and take (e1 , . . . , eM 0 ) to be an orthonormal basis of S (where M 0 = dim(S) ≤ M ). Since Kµ ⊂ S ∩ √ 2 µB(L2 (P X )), then n n M0 1 X 1 X X  E sup i f (Xi ) ≤ 2E sup √ i λj ej (Xi ) n n f ∈Kµ kλk 0 ≤2 λ i=1

`M 2

M0

i=1

n 2 1/2 √ X1 X i ej (Xi ) ≤ c4 ≤ c4 µE n j=1

r

i=1

j=1

M 0µ . n

Proof of Theorem B: By Proposition 3.3 it follows that for any µ > 0,   p p E kP − Pn kLµ ≤ min c1 b2 (log M )/n, b M µ/n . Then, by (3.1), for every λ > 0, X E kP − Pn kVλ ≤ 2−i E kP − Pn kL

2i+1 λ

i≥0

r  r log M M 2i+1 λ  2 ≤ c1 2 min b ,b n n i≥0  r log M r M λ  ≤ c2 min b2 ,b . n n X

−i

18

Hence, E kP − Pn kVλ∗ ≤ λ∗ /8 if r log M M  λ ∼ b min , , n n ∗

2

and Theorem B follows from Theorem 3.1. Next, we turn to the proof of Theorem C, in which the “trivial” approximations of Kµ are too weak to give the desired upper bound. Proof of Theorem C. Observe that since the dictionary consists of an orthogonal family, if F (x) = (f1 (x), ..., fM (x)) and if (e1 , . . . , eM ) is the standard basis in `M 2 , then 

√ Kµ = 2 λ, F : λ ∈ B1M ∩ µE , where E is an ellipsoid with principal axes (kfi kL2 ei )M i=1 . From here on we will assume M that (kfi kL2 )i=1 is a non-increasing sequence. Following the path used in the proof of Theorem B, it is enough to bound n n 1 X 2 X

E sup i f (Xi ) = E sup i λ, F (Xi ) √ f ∈Kµ n λ∈B M ∩ µE n i=1

i=1

1

n

2

X 1

√ √ = E i F (Xi ) M √ ◦ , n n (B1 ∩ µE) i=1

√ where k·k(B M ∩√µE)◦ denotes the dual norm to the one whose unit ball is B1M ∩ µE. 1 Since both B1M and E are unconditional with respect to the coordinate structure given by (ei )M i=1 , it follows that   !1/2 X  vi 2 √  µ kvk(B M ∩√µE)◦ ∼ inf + max |vi | , (3.3) 1 i∈I c kfi kL2 I⊂{1,...,M } i∈I

Pn √ M and in our case, v = (vj )M j=1 = ((1/ n) · i=1 i fj (Xi ))j=1 . Let p √ J0 = {j : kfj kL2 ≥ c0 b log M / n}, where c0 is a constant to be named later. A straightforward application of Bernstein inequality [26] shows that, for t ≥ c1 ,  X  P ∃j ∈ J0 : Pn fj2 ≥ (t + 1)kfj k2L2 ≤ exp − c2 n(kfj k2L2 /b2 ) min(t2 , t) j∈J0

≤M exp(−c3 t log M ) ≤ exp(−c4 t log M ),

19

and P(∃j ∈ J0c : Pn fj2 ≥ (t + 1)b2 n−1 log M ) ≤ exp(−c4 t log M ). For every integer ` ≥ c1 , let A` = {∀j ∈ J0 : Pn fj2 ≤ (` + 1)kfj k2L2 }

\ {∀j ∈ J0c : Pn fj2 ≤ (` + 1)b2 n−1 log M }.

Set B` = A`+1 ∩ Ac` and note that P(B` ) ≤ P(Ac` ) ≤ 2 exp(−c4 ` log M ) for any ` ≥ c1 . For every ` ≥ c1 , consider the random variables conditioned on B` , n

1 X Uj,` = √ i fj (Xi )/kfj kL2 |B` n

∀j ∈ J0

i=1

and

n

Uj,`

1 X =√ i fj (Xi )|B` n

∀j ∈ J0c .

i=1

Hence, by H¨ offding inequality (cf. [26]), there exists an absolute constant c5 such that, for any j ∈ J0 , P n−1 ni=1 fj2 (Xi ) 2 kUj,` kψ2 () ≤ c5 ≤ c5 (` + 1), kfj k2L2 and for any j ∈ J0c ,

kUj,` k2ψ2 () ≤ c5 (` + 1)b2 (log M )/n.

By a result due to Klartag [13], it follows that for every such ` and any 1 ≤ j ≤ |J0 |,

E

j X

!1/2 2 ∗ (Ui,` )

≤ c6 `

p j log(e|J0 |/j),

i=1 |J |

∗ ) 0 is a monotone rearrangement of (|U |) where (Uj,` j,` j∈J0 . Moreover, by a standard j=1 maximal inequality (see, e.g. [26])

E maxc Uj,` j∈J0

q √ log M ≤ c7 log |J0c | maxc kUj,` kψ2 () ≤ c8 `b √ . j∈J0 n

20

For every 1 ≤ j ≤ |J0 |, let I be the set of the j largest coordinates of (|Uj,` |)j∈J0 . Hence, by (3.3) and since kfj kL2 ≤ b, ! n

X

1

√ i F (Xi ) M √ E |(Xi )ni=1 ∈ B` n (B1 ∩ µE)◦ i=1 !1/2 j   X √ 2 ∗ ∗ .E µ (Ui,` ) , maxc |Uj,` | + E max kfj kL2 Uj,` j∈J0

i=1

 p √ √ p .c6 ` µ j log(e|J0 |/j) + b log(e|J0 |/j) + E maxc |Uj,` | j∈J0

 p √ √ p √ log M µ j log(e|J0 |/j) + b log(e|J0 |/j) + c10 `b √ . ≤c9 ` n Therefore, if we take j = min{[1/µ], |J0 |} it is evident that !   n

X √ p log M 1

n √ √ i F (Xi ) M √ ◦ |(Xi )i=1 ∈ B` . b ` log(eM µ) + . E n n (B1 ∩ µE) i=1

Thus, integration with respect to X1 , ..., Xn and applying the estimates on the measure of B` , r n

2 log(eM µ)

X 1

√ E √ i F (Xi ) M √ .b . ◦ n n n (B1 ∩ µE) i=1

Finally, by (3.1), for any λ > 1/M , X 2−i E kP − Pn kL E kP − Pn kVλ ≤

2i+1 λ

i≥0

r 2

.b

X i≥0

−i

2

log(eM 2i+1 λ) . b2 n

r

log(eM λ) , n

and, if s λ∗ ∼ b2

 eM b2  1 log √ , n n

then E kP − Pn kVλ∗ ≤ λ∗ /8, as required. Note that here, the localization is truly needed to obtain the correct rate of aggregation. To generalize Theorem C beyond the orthogonal case, one must study the geometry of the intersection of B1M with an ellipsoid, but now the two bodies need not have the same unconditional structure, making the analysis considerably harder.

21

4

Concluding Remarks

p The best rate of (MS) aggregation for the ERM in F is (log M )/n (which is larger than the rate achieved in Theorem B). Therefore, if one decides to use the ERM procedure, it is in general better to use it in conv(F ) than in F itself – even for the problem of (MS) aggregation. In particular, for dictionaries of small cardinality √ (smaller than n), ERM in the convex hull will outperform ERM in F (since M/n ≤ p (log M )/n). Moreover, note that when M ≤ c0 for some absolute constant c0 the rate M/n achieved by f˜ERM −C is of the same order of magnitude as the optimal residue 1/n, and so ERM in conv(F ) is an optimal aggregation procedure in deviation even for the (MS) aggregation problem for a very small dictionary. Finally, from a computational point of view, minimizing in the convex set conv(F ) is reasonable, and thus for a finite function class F , ERM in conv(F ) should be preferred to ERM in F . Our final comment has to do with the logarithmic gaps that both lower p upper and √ √ bounds have. We believe that the correct rate in both cases is log(eM/ n)/ n √ when M > n, and the gap in both cases has a similar source. To obtain the correct upper bound one must study the empirical process indexed by the intersection body of B1M and an ellipsoid coming from the localization. The proof of Theorem C requires that B1M and an ellipsoid E share the same unconditional structure, and to obtain the general result, one must remove that assumption. Although it seems reasonable that the worst case is when the two share the same unconditional basis, proving it is nontrivial. √ For the lower bound, one has to study the structure of B1M ∩ rB2M in a range where a proportional change of the radius has a very small impact on the complexity of the sets and thus, on the oscillation of the process indexed by the set. Although we believe that the example we gave here should give the desired bound, we have not been able to improve our estimate. The main difficulty is that for the optimal p bound√one √ has to compare two very close levels - r and ρ - where n(r − ρ) ∼ 1/ log(eM/ n), rather than r ∼ ρ which is what we have been able to do here.

References [1] Jean-Yves Audibert. Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincar´e Probab. Statist., 40(6):685–736, 2004. [2] Jean-Yves Audibert. Fast learning rates in statistical inference through aggregation. Ann. Statist., 37(4):1591–1646, 2009. [3] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probab. Theory Related Fields, 135(3):311–334, 2006.

22

[4] Peter L. Bartlett, Shahar Mendelson, and Joseph Neeman. `1 -regularized linear regression: Persistence and oracle inequalities. Submitted, 2009. [5] Olivier Bousquet, Vladimir Koltchinskii, and Dmitriy Panchenko. Some local measures of complexity of convex hulls and generalization bounds. In Computational learning theory (Sydney, 2002), volume 2375 of Lecture Notes in Comput. Sci., pages 59–73. Springer, Berlin, 2002. [6] Peter B¨ uhlmann and Torsten Hothorn. Boosting algorithms: regularization, prediction and model fitting. Statist. Sci., 22(4):477–505, 2007. [7] Florentina Bunea and Andrew Nobel. Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inform. Theory, 54(4):1725– 1735, 2008. [8] Florentina Bunea, Alexandre B. Tsybakov, and Marten H. Wegkamp. Aggregation for Gaussian regression. Ann. Statist., 35(4):1674–1697, 2007. [9] Arnak S. Dalalyan and Alexandre B. Tsybakov. Aggregation by exponential weighting and sharp oracle inequalities. In Learning theory, volume 4539 of Lecture Notes in Comput. Sci., pages 97–111. Springer, Berlin, 2007. [10] M. Emery, A. Nemirovski, and D. Voiculescu. Lectures on probability theory and statistics, volume 1738 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 2000. Lectures from the 28th Summer School on Probability Theory held in Saint-Flour, August 17–September 3, 1998, Edited by Pierre Bernard. [11] Anatoli Juditsky and Arkadii Nemirovski. Functional aggregation for nonparametric regression. Ann. Statist., 28(3):681–712, 2000. [12] Anatoli Juditsky, Philippe Rigollet, and Alexandre B. Tsybakov. Learning by mirror averaging. Ann. Statist., 36(5):2183–2206, 2008. [13] B. Klartag. 5n Minkowski symmetrizations suffice to arrive at an approximate Euclidean ball. Ann. of Math. (2), 156(3):947–960, 2002. [14] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist., 34(6):2593–2656, 2006. [15] Guillaume Lecu´e and Shahar Mendelson. Aggregation via empirical risk minimization. Probab. Theory Related Fields, 145(3-4):591–613, 2009. [16] Guillaume Lecu´e and Shahar Mendelson. On the optimality of the aggregate with exponential weights. Submitted, 2010. 23

[17] Guillaume Lecu´e and Shahar Mendelson. Sharper lower bounds on the performance of the empirical risk minimization algorithm. Bernoulli, 16(3):605–613, 2010. [18] Michel Ledoux and Michel Talagrand. Probability in Banach spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)]. Springer-Verlag, Berlin, 1991. Isoperimetry and processes. [19] Shahar Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Trans. Inform. Theory, 54(8):3797–3803, 2008. [20] Shahar Mendelson. Obtaining fast error rates in nonconvex situations. J. Complexity, 24(3):380–397, 2008. [21] Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal., 17(4):1248–1282, 2007. [22] Valentin V. Petrov. Limit theorems of probability theory, volume 4 of Oxford Studies in Probability. The Clarendon Press Oxford University Press, New York, 1995. [23] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist., 26(5):1651–1686, 1998. [24] Alexandre B. Tsybakov. Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003), volume 2777 of Lecture Notes in Artificial Intelligence, pages 303–313. Springer, Heidelberg, 2003. [25] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32(1):135–166, 2004. [26] Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics. [27] Yuhong Yang. Mixing strategies for density estimation. Ann. Statist., 28(1):75– 87, 2000. [28] Yuhong Yang. Aggregating regression procedures to improve performance. Bernoulli, 10(1):25–47, 2004.

24

Suggest Documents