Empirical risk minimization is optimal for the convex ... - arXiv

6 downloads 0 Views 244KB Size Report
risk of the best function in the convex hull of F (cf. [1, 9, 11, 12, 18, 27 ..... We apply Theorem 2.1 to excess loss functions classes indexed by segments. First, note.
arXiv:1312.4349v1 [math.ST] 16 Dec 2013

Bernoulli 19(5B), 2013, 2153–2166 DOI: 10.3150/12-BEJ447

Empirical risk minimization is optimal for the convex aggregation problem ´ GUILLAUME LECUE CNRS, LAMA, Universit´e Paris-Est Marne-la-vall´ee, 77454 France. E-mail: [email protected] Let F be a finite model of cardinality M and denote by conv(F ) its convex hull. The problem of convex aggregation is to construct a procedure having a risk as close as possible to the minimal risk over conv(F ). Consider the bounded regression model with respect to the squared risk denoted by R(·). If fbnERM -C denotes the empirical risk minimization procedure over conv(F ), then we prove that for any x > 0, with probability greater than 1 − 4 exp(−x),   x R(fbnERM -C ) ≤ min R(f ) + c0 max ψn(C) (M ), , f ∈conv(F ) n (C)

where c0 > 0 is an absolute constant and ψn (M ) is the optimal rate of convex aggregation defined in (In Computational Learning Theory and Kernel Machines (2003) 303–313 p (COLT-2003) √ √ √ (C) (C) Springer) by ψn (M ) = M/n when M ≤ n and ψn (M ) = log(eM/ n)/n when M > n. Keywords: aggregation; empirical processes theory; empirical risk minimization; learning theory

1. Introduction and main results Let X be a probability space and let (X, Y ) and (X1 , Y1 ), . . . , (Xn , Yn ) be n + 1 i.i.d. random variables with values in X × R. From the statistical point of view, the set D = {(X1 , Y1 ), . . . , (Xn , Yn )} is the set of given data where the Xi ’s are usually considered as input data taking their values in some space X and the Yi ’s are some outputs or labels associated with these inputs. We are interested in the prediction of Y associated with a new observation X. The data D are thus used to construct functions f : X → R such that f (X) provides a good guess of Y . We measure the quality of this prediction by means of the squared risk 2

R(f ) = E(Y − f (X)) , when f is a real-valued function defined on X and by 2

R(fb) = E[(Y − fb(X)) |D] This is an electronic reprint of the original article published by the ISI/BS in Bernoulli, 2013, Vol. 19, No. 5B, 2153–2166. This reprint differs from the original in pagination and typographic detail. 1350-7265

c

2013 ISI/BS

2

G. Lecu´e

when fb is a function constructed using the data D. For the sake of simplicity, throughout this article, we restrict ourselves to functions f and random variables (X, Y ) for which |Y | ≤ b and |f (X)| ≤ b almost surely, for some fixed b ≥ 0. Note that b does not have to be known from the statistician for the construction of the procedures we are studying in this note. Given a finite set F of real-valued measurable functions defined on X (usually called a dictionary), there are three main types of aggregation problems: 1. Model selection aggregation: construct a procedure whose risk is as close as possible to the risk of the best element in F (cf. [2, 3, 12, 14, 15, 17–19, 23, 37, 38, 41, 42]). 2. Convex aggregation: construct a procedure whose risk is as close as possible to the risk of the best function in the convex hull of F (cf. [1, 9, 11, 12, 18, 27, 37, 43]). 3. Linear aggregation: construct a procedure whose risk is as close as possible to the risk of the best function in the linear span of F (cf. [4, 12, 17, 21, 37]). In this note, we focus on the convex aggregation problem. We want to construct a procedure f˜ for which, with high probability, R(f˜) ≤

min

f ∈conv(F )

R(f ) + ψn (M ),

(1.1)

where ψn (M ) is called the residual term. The residual term is the quantity that we want as small as possible. Results in expectation are also of interest: construct a procedure f˜ such that ER(f˜) ≤ minf ∈conv(F ) R(f ) + ψn (M ). In [37], the author defined the optimal rates of the convex aggregation, by the smallest price in the minimax sense that one has to pay to solve the convex aggregation problem. The definition of [37] is given in expectation, as a function of the cardinality M of the dictionary F and of the sample size n. It has been proved in [37] (see also [18] and [43]) that the optimal rate of convex aggregation is M √  if M ≤ n,   n, s   ψn(C) (M ) = √ eM 1   log √ , if M > n.  n n (C)

This rate is defined up to some multiplying constant. Note that the rate ψn (M ) was achieved in [37] in expectation for the Gaussian regression model with a known variance and a known marginal distribution of the design. In [11], the authors√ were able to remove these assumptions at a price of an extra log n factor for 1 ≤ M ≤ n (results are still in expectation). Last year, there has been some striking results on different problems of aggregation including the convex aggregation problem. To mention few of them, we refer the reader to [32, 33, 40]. Finally, we also refer the reader to [7, 43] for non-exact oracle inequalities (inequalities like (1.1) where minf ∈conv(F ) R(f ) is multiplied by a constant strictly larger than 1) in the context of convex aggregation. A lower bound in deviation for the convex aggregation problem follows from the arguments of [37]: there exist absolute positive constants c0 , c1 and c2 such that for any

ERM in convex aggregation

3

sample cardinality n ≥ 1, any cardinality of dictionary M ≥ 1 such that log M ≤ c0 n, there exists a dictionary F of size M such that for any aggregation procedure f¯n , there exists a random couple (X, Y ) such that |Y | ≤ b and maxf ∈F |f (X)| ≤ b a.s. and with probability larger than c1 , R(f¯n ) ≥

min

f ∈conv(F )

R(f ) + c2 b2 ψn(C) (M ).

(1.2)

This means that, from a minimax point of view, one cannot do better than the rate (C) ψn (M ) for the convex aggregation problem. Therefore, any procedure achieving the rate (C) ψn (M ) for any dictionary F and couple (X, Y ) such that |Y | ≤ b and maxf ∈F |f (X)| ≤ b a.s. in an oracle inequality like (1.1) is called an optimal procedure in deviation for the convex aggregation problem. (C) The procedure constructed in [37] achieves the rate ψn (M ) in expectation (i.e., a pro(C) cedure satisfying (1.1) in expectation with the optimal residual term ψn (M )). An optimal procedure in deviation has been constructed in Theorem 2.8.1 in [22]. In both cases, the construction of these optimal aggregation procedures require the aggregation of an exponential number in M of functions in conv(F ) and thus cannot be used in practice. On the other side, it would be much simpler and natural to consider the classical procedure of empirical risk minimization (cf. [39]) over the convex hull of F to solve the convex aggregation problem: n

fbnERM -C ∈ argmin Rn (f ),

where Rn (f ) =

f ∈conv(F )

1X 2 (Yi − f (Xi )) . n i=1

(1.3)

In [17, 18, 25], the authors prove that, for every x > 0, with probability greater than 1 − 4 exp(−x) r     M log M x ERM C , where φn (M ) = min . , R(fbn ) ≤ min R(f )+c0 max φn (M ), n n n f ∈conv(F ) (C)

The rate φn (M ) behaves like the optimal rate ψn (M ) except for values of M such that n1/2 < M ≤ c(ε)n1/2+ε for ε > 0 for which there is a logarithmic gap. In this note, we were able to remove this logarithmic loss proving that fbnERM -C is indeed optimal for the convex aggregation problem. Finally, note that in [25], the authors show that (C) the rate ψn (M ) can be achieved by fbnERM -C for any orthogonal dictionary (i.e., such that ∀f 6= g ∈ F, Ef (X)g(X) = 0). The performance of ERM in the convex hull has been studied for an infinite dictionary in [9]. The resulting upper bounds, in the case of a finite dictionary, is of the order of M/n for every n and M . Another motivation for this work comes from what is known about ERM in the context of the three aggregation schemes mentioned above. It is well known that ERM in F is, in general, a suboptimal aggregation procedure for the model selection aggregation problem (see [19, 30] or [24]). It is also known that ERM in the linear span of F is an optimal procedure for the linear aggregation problem [21] (cf. Theorem 13 and Example 1) or [4].

4

G. Lecu´e

Therefore, studying the performances of ERM in the convex hull of F in the context of convex aggregation can be seen as an “intermediate” problem which remained open. In fact, a lot of effort has been invested in finding any procedure that would be optimal for the convex aggregation problem. For example, many boosting algorithms (see [34] or [10] for recent results on this topic) are based on finding the best convex combination in a large dictionary (for instance, dictionaries consisting of “decision stumps”), while random forest algorithms can be seen as procedures that try finding the best convex combination of decision trees. Thus, finding an optimal procedure for the problem of convex aggregation for a general dictionary is of high practical importance. In the following result, we prove that empirical risk minimization is an optimal procedure for the convex aggregation problem. Theorem A. There exists absolute constants c0 and c1 such that the following holds. Let F be a finite dictionary of cardinality M and (X, Y ) be a random couple of X × R such that |Y | ≤ b and maxf ∈F |f (X)| ≤ b a.s. for some b > 0. Then, for any x > 0, with probability greater than 1 − 4 exp(−x)   x 2 (C) ERM−C ˜ . R(fn ) ≤ min R(f ) + c0 b max ψn (M ), n f ∈conv(F ) The optimality also holds in expectation: ER(f˜nERM−C ) ≤

min

f ∈conv(F )

R(f ) + c1 b2 ψn(C) (M ).

2. Preliminaries on isomorphic properties of functions classes We recall the machinery developed in [5] to prove isomorphic results between the empirical and actual structures of functions classes. Let (Z, σ) be a measurable space, Z, Z1 , . . . , Zn be n + 1 i.i.d. random variables with values in Z distributed according to PZ and G be a class of real-valued measurable functions defined on Z. We consider the star shaped hull of G in zero and its localized set at some level λ > 0: V (G) = {αg: 0 ≤ α ≤ 1, g ∈ G}

and V (G)λ = {h ∈ V (G): P h ≤ λ}.

For any functions class H (in particular for H being G, V (G) or V (G)λ for some Pn λ), we −1 denote kP − Pn k√ = sup |(P − P )h|, where P h = Eh(Z) and P h = n H n n h∈H i=1 h(Zi ), 2 σ(H) = suph∈H P h and kHk∞ = suph∈H khkL∞ (PZ ) . We also recall the separability condition of [29] (cf. Condition (M)) for which Talagrand’s concentration inequality holds: (M) There exists G0 ⊂ G such that G0 is countable and for any g ∈ G, there exists a sequence (gk )k in G0 such that for any z ∈ Z, (gk (z))k tends to g(z) when k tends to infinity.

ERM in convex aggregation

5

Theorem 2.1 ([5]). There exists an absolute constant c0 > 0 such that the following holds. Let G be a class of real-valued measurable functions defined on Z satisfying condition ( M) and such that P g 2 ≤ BP g, ∀g ∈ G for some constant B > 0. Let λ∗ > 0 be such that EkP − Pn kV (G)λ∗ ≤ (1/8)λ∗ .

(2.1)

For every x > 0, with probability greater than 1 − 4 exp(−x), for every g ∈ G,   ∗ c0 (B + kGk∞ )x . |P g − Pn g| ≤ (1/2) max(P g, ρn (x)), where ρn (x) = max λ , n For the reader convenience, we recall the short proof of [5]. Proof of Theorem 2.1. Without loss of generality, we can assume that G is countable. From a limit argument, the result holds for classes of functions satisfying condition (M). Fix λ > 0 and x > 0, and note that by Talagrand’s concentration inequality (cf. [8, 20, 28, 35, 36]), with probability larger than 1 − 4 exp(−x), r x x + KkV (G)λ k∞ , kP − Pn kV (G)λ ≤ 2EkP − Pn kV (G)λ + Kσ(V (G)λ ) n n

(2.2)

where K is an absolute constant. Clearly, we have kV (G)λ k∞ ≤ kGk∞ and σ 2 (V (G)λ ) = sup(P (αg)2 : 0 ≤ α ≤ 1, g ∈ G, P (αg) ≤ λ) ≤ Bλ. Moreover, since V (G) is star-shaped, λ > 0 → φ(λ) = EkP − Pn kV (G)λ /λ is nonincreasing, and since φ(λ∗ ) ≤ 1/8 and ρn (x) ≥ λ∗ then EkP − Pn kV (G)ρn (x) ≤ (1/8)ρn (x). Combined with (2.2), there exists an event Ω0 (x) of probability greater than 1 − 4 exp(−x), and on Ω0 (x), kP − Pn kV (G)ρn (x) ≤ (1/4)ρn (x) + K

r

Bρn (x)x kGk∞ x +K ≤ (1/2)ρn (x) n n

as long as c0 ≥ 64(K 2 + K). Hence, on Ω0 (x), if g ∈ V (G) satisfies that P g ≤ ρn (x), then |P g − Pn g| ≤ (1/2)ρn (x). Moreover, if g ∈ V (G) is such that P g > ρn (x), then h = ρn (x)g/P g ∈ V (G)ρn (x) ; hence |P h − Pn h| ≤ (1/2)ρn (x), and so in both cases |P g − Pn g| ≤ (1/2) max(P g, ρn (x)).  Therefore, if one applies Theorem 2.1 to obtain isomorphic properties between the empirical and actual structures, one has to check the condition P g 2 ≤ BP g, ∀g ∈ G, called the Bernstein condition in [5], and to find a point λ∗ satisfying (2.1).

6

G. Lecu´e

A point λ∗ such that (2.1) holds can be found thanks to the peeling argument of [6]: for any λ > 0, V (G)λ ⊂ which implies

∞ [

i=0

{αg: 0 ≤ α ≤ 2−i , g ∈ G, P g ≤ 2i+1 λ}

EkP − Pn kV (G)λ ≤

∞ X i=0

2−i EkP − Pn kG2i+1 λ ,

(2.3)

where, for any µ > 0, Gµ = {g ∈ G: P g ≤ µ}. Then if λ∗ > 0 is such that λ∗ /8 upper bounds the RHS in (2.3) this point also satisfies (2.1). The Bernstein condition usually follows from some convexity argument. For instance, it is now standard to check the Bernstein condition for the excess loss functions class LF = {Lf : f ∈ F } associated with a convex model F with respect to the squared loss function ℓf (x, y) = (y − f (x))2 , ∀x ∈ X , y ∈ R, where fF∗ ∈ argminf ∈F E(Y − f (X))2 and Lf = ℓf − ℓfF∗ . Indeed, if F is a convex set of functions and (X, Y ) is a random couple on X × R such that |Y | ≤ b and supf ∈F |f (X)| ≤ b a.s. then it follows from convexity and definition of fF∗ that for any f ∈ F , E[(fF∗ (X) − Y )(fF∗ (X) − f (X))] ≤ 0 and so 2

2

ELf = 2E(fF∗ (X) − f (X))(Y − fF∗ (X)) + E(fF∗ (X) − f (X)) ≥ E(fF∗ (X) − f (X)) . (2.4) Moreover, since |Y | ≤ b and supf ∈F |f (X)| ≤ b a.s. then 2

2

2

EL2f = E(2Y − fF∗ (X) − f (X)) (f (X) − fF∗ (X)) ≤ (4b)2 E(f (X) − fF∗ (X)) .

(2.5)

Therefore, any f in F is such that EL2f ≤ (4b)2 ELf .

3. Proof of Theorem A √ The proof of Theorem A for the case M ≤ n is now very classical and can be found in [21] (cf. Theorem 13 and Example 1). Nevertheless, we reproduce here √ this short proof in order to provide a self-contained note. The proof for the case M > n is more tricky and relies on isomorphic properties of an exponential number of segments in conv(F ) together with Maurey’s empirical method (cf. [13, 31]) which was first used in the context of convex aggregation in [17] and [37]. Note that segments are models of particular interest in Learning theory because they are convex models (in particular, they satisfy the Bernstein condition) and they are of small complexity (essentially the same complexity as a model of cardinality two). On the contrary to the classical entropy based approach which essentially consists in approximating a set by finite sets, approaching models by union of segments may be of particular interest in Learning theory beyond the convex aggregation problem. Note that finite models have no particular geometrical structure and therefore are somehow “bad models” as far as ERM procedures are concerned. Proofs are given for the deviation result of Theorem A. The result in expectation of Theorem A follows from a direct integration argument.

ERM in convex aggregation

3.1. The case M >

7

√ n

We apply Theorem 2.1 to excess loss functions classes indexed by segments. First, note that segments of bounded functions are functions classes satisfying condition (M). We consider a set C ′ = {g1 , . . . , gN } of real-valued measurable functions defined on X such that maxg∈C ′ |g(X)| ≤ b a.s. For every i, j ∈ {1, . . . , N }, we consider the segment [gi , gj ] = ∗ {θgi + (1 − θ)gj : 0 ≤ θ ≤ 1} and take gij ∈ argming∈[gi ,gj ] R(g) where R(·) is the squared risk. We consider the excess loss functions class Lij = {Lij g : g ∈ [gi , gj ]},

∗ where Lij g = ℓg − ℓgij

for ℓg (x, y) = (y − g(x))2 , ∀x ∈ X , y ∈ R. 2 As a consequence of convexity of segments, we have for any g ∈ [gi , gj ], E(Lij g ) ≤ 2 ij (4b) ELg (cf. (2.4) and (2.5) in Section 2). This implies that the functions class Lij satisfies the Bernstein condition of Theorem 2.1. Now, it remains to find λ∗ > 0 such that EkP − Pn kV (Lij )λ∗ ≤ (1/8)λ∗ . Let µ > 0 and ε1 , . . . , εn be n i.i.d. Rademacher variables. ∗ 2 ∗ 2 Note that for any g ∈ [gi , gj ], P Lij g ≥ P (g − gij ) = E(g(X)− gij (X)) (cf. (2.4)). It follows from the symmetrization argument and the contraction principle (cf. [26], page 95) that if gi 6= gj then n 1 X sup εk Lij EkP − Pn k(Lij )µ ≤ 2E g (Xk , Yk ) n g∈[gi ,gj ]: P Lij g ≤µ k=1 n 1 X ∗ ≤ 8bE sup εk (g(Xk ) − gij (Xk )) n ij g∈[gi ,gj ]: P Lg ≤µ k=1 n 1 X ∗ ≤ 8bE sup εk (g(Xk ) − gij (Xk )) ∗ n 2 g∈[gi ,gj ]: P (g−gij ) ≤µ k=1 n 1 X εk g(Xk ) = 8bE sup ∗ : P g 2 ≤µ n g∈[gi ,gj ]−gij k=1 n 1 X εk g(Xk ) ≤ 8bE sup n 2 g∈span(gi −gj ): P g ≤µ k=1 √ n 1 X 8b µ = E ε (g − g )(X ) k i j k P (gi − gj )2 n k=1

!2 !1/2 √ n 8b µ 1X E ≤ εk (gi − gj )(Xk ) P (gi − gj )2 n k=1 r µ . = 8b n

8

G. Lecu´e

Note that when gi = gj the result is also true. Now, we use the peeling argument of (2.3) to obtain r ∞ ∞ X X p 2k+1 λ −k −k ≤ c0 b λ/n. EkP − Pn kV (Lij )λ ≤ 2 EkP − Pn k(Lij )2k+1 λ ≤ 2 8b n k=0

k=0

Therefore, for λ∗ = (8c0 b)2 /n, we have EkP − Pn kV (Lij )λ∗ ≤ (1/8)λ∗ . Now, we can apply Theorem 2.1 to the family of excess loss functions classes (Lij )1≤i,j≤N together with a union bound to obtain the following result. Proposition 3.1. There exists an absolute constant c0 > 0 such that the following holds. Let C ′ = {g1 , . . . , gN } be a set of measurable real-valued functions defined on X . Let (X, Y ) be a random couple with values in X × R such that |Y | ≤ b and maxg∈C ′ |g(X)| ≤ b a.s. For any x > 0, with probability greater than 1 − 4 exp(−x), for any i, j ∈ {1, . . . , N } and any g ∈ [gi , gj ], ij ij |P Lij g − Pn Lg | ≤ (1/2) max(P Lg , γ(x)),

where γ(x) =

c0 b2 (x + 2 log N ) . n

Now, we want to apply the isomorphic result of Proposition 3.1 to a wisely chosen subset C ′ of C = conv(F ). For that, we consider the integer  r n √ m= log(eM/ n) and the set C ′ is defined by ′

C =

(

) m 1 X hi : h1 , . . . , hm ∈ F . m i=1

The set C ′ is an approximating set of the convex hull conv(F ). We will, for instance, use the following approximation property: min′ R(f ) ≤ min R(f ) +

f ∈C

f ∈C

4b2 . m

(3.1)

Indeed, to obtain such a result, we use Maurey’s empirical method. Let fC∗ ∈ argminf ∈C R(f ) P PM and denote fC∗ = M j=1 λj fj where λj ≥ 0, ∀j = 1, . . . , M and j=1 λj = 1. Consider a random variable Θ : Ω → F such that P[Θ = fj ] = λj , ∀j = 1, . . . , M and let Θ1 , . . . , Θm be m i.i.d. random variables distributed according to Θ and independent of (X, Y ). Denote by EΘ the expectation with respect to Θ1 , . . . , Θm . Since EΘ Θj = fC∗ for any j = 1, . . . , m, we have ! !2 m m 1 X 1 X min R(f ) ≤ EΘ R Θj = EΘ E Θj (X) − Y f ∈C ′ m j=1 m j=1

ERM in convex aggregation

9

! m EVΘ (Y − Θ(X)) 1 X , =E EΘ (Y − Θj (X))(Y − Θk (X)) = R(fC∗ ) + 2 m m j,k=1

where VΘ stands for the variance symbol with respect to Θ. Equation (3.1) follows since |Y | ≤ b and maxf ∈F |f (X)| ≤ b a.s. Denote by N = |C ′ | the cardinality of C ′ and by g1 , . . . , gN the functions in C ′ . For simplicity, assume that R(g1 ) = ming∈C ′ R(g). Thanks to [13] for the first inequality and [16], page 218, or [28], Proposition 2, for the second inequality, we know that m    2eM M +m−1 . (3.2) |C ′ | = N ≤ ≤ m m Let x > 0. Consider the event Ω(x) ⊂ Ω such that the following isomorphic property holds for all the segments [g1 , gj ], j = 1, . . . , N : 1j 1j |Pn L1j g − P Lg | ≤ (1/2) max(P Lg , γ(x))

∀g ∈ [g1 , gj ],

(3.3)

∗ is the excess loss function of g ∈ [g1 , gj ] for the model where we recall that L1j g = ℓg − ℓg1j [g1 , gj ] and

c0 b2 (x + 2 log N ) . n Thanks to Proposition 3.1, we know that P[Ω(x)] ≥ 1 − 4 exp(−x). We are going to work on the event Ω(x) but for the moment, we use a second time PM Maurey’s empirical method. Fix X1 , . . . , Xn and write fbnERM -C = j=1 βj fj . Consider a random variable Θ : Ω′ → F defined on an other probability space (Ω′ , A′ , P′ ) such that P′ [Θ = fj ] = βj , ∀j = 1, . . . , M and let Θ1 , . . . , Θm be m i.i.d. random variables having the same probability distribution as Θ. Once again, denote by E′Θ the expectation with respect to Θ1 , . . . , Θm and by VΘ the variance with respect to Θ. Since E′Θ Θj = fbnERM -C for any j = 1, . . . , m, it follows from the same method used to obtain (3.1) that ! m 1 X EV′Θ (Y − Θ(X)) ′ EΘ R Θj = R(fbnERM -C ) + (3.4) m j=1 m γ(x) =

and the same holds for the empirical risk: ! m 1 1 X ′ Θj = Rn (fbnERM -C ) + EΘ Rn m j=1 m

! n 1X ′ V (Yi − Θ(Xi )) . n i=1 Θ

Consider the following notation: m

gΘ =

1 X Θj m j=1

and iΘ ∈ {1, . . . , N } such that giΘ = gΘ .

(3.5)

10

G. Lecu´e

Note that gΘ is a random point in C ′ (as a measurable function from Ω′ to C ′ ) and that, on the event Ω(x), the following isomorphic property on the segment [g1 , gΘ ] holds: 1iΘ Θ Θ − P L1i |Pn L1i g | ≤ (1/2) max(P Lg , γ(x)) g

∀g ∈ [g1 , giΘ ].

(3.6)

First note that for every Θ1 , . . . , Θm , we have ∗ ∗ ) + R(fbnERM -C ) − R(gΘ ). ) + R(gΘ ) − R(g1i R(fbnERM -C ) = R(g1i Θ Θ

(3.7)

∗ ∗ ∈ argming∈[gi ,g1 ] R(g), we have R(g1i ) ≤ R(g1 ) = ming∈C ′ R(g) By definition of g1i Θ Θ Θ and according to (3.1), we have minf ∈C ′ R(f ) ≤ minf ∈C R(f ) + (4b2 )/m. Therefore, it follows from (3.7) that

R(fbnERM -C ) ≤ min R(f ) + f ∈C

4b2 Θ bERM -C ) − R(gΘ ). + P L1i gΘ + R(fn m

(3.8)

On the event Ω(x), we use (3.6) to obtain for every Θ1 , . . . , Θm R(fbnERM -C ) ≤ min R(f ) + f ∈C

4b2 Θ bERM -C ) − R(gΘ ). + 2Pn L1i gΘ + γ(x) + R(fn m

Moreover, by definition of fbnERM -C , we have

∗ Θ bERM -C ). Pn L1i gΘ = Rn (gΘ ) − Rn (g1iΘ ) ≤ Rn (gΘ ) − Rn (fn

Therefore, on the event Ω(x), we have for every Θ1 , . . . , Θm R(fbnERM -C ) ≤ min R(f ) + f ∈C

4b2 + γ(x) m

+ 2(Rn (gΘ ) − Rn (fbnERM -C )) + R(fbnERM -C ) − R(gΘ ).

In particular, one can take the expectation with respect to Θ1 , . . . , Θm (defined on Ω′ ) in the last inequality. We have on Ω(x), R(fbnERM -C ) ≤ min R(f ) + f ∈C

4b2 + γ(x) m

+ 2E′Θ (Rn (gΘ ) − Rn (fbnERM -C )) + E′Θ (R(fbnERM -C ) − R(gΘ )).

Thanks to (3.4), we have E′Θ (R(fbnERM -C ) − R(gΘ )) ≤ 0 and it follows from (3.5) that E′Θ (Rn (gΘ ) − Rn (fbnERM -C )) ≤ (2b)2 /m. Therefore, on the event Ω(x), we have   x 8b2 2 (C) ERM -C b , + γ(x) ≤ min R(f ) + c1 b max ψn (M ), R(fn ) ≤ min R(f ) + f ∈C f ∈C m n

where the last inequality follows from (3.2) and the definition of m.

ERM in convex aggregation

3.2. The case M ≤

11

√ n

We use the strategy developed in [5] √ together with the one of [21] (cf. Example 1) to prove Theorem A in the case M ≤ n. Define C = conv(F ) and LC = {Lf : f ∈ C} the excess loss class associated with C where Lf = ℓf − ℓfC∗ , ∀f ∈ C and fC∗ ∈ argminf ∈C R(f ). Let x > 0. Assume that we can find some ρn (x) > 0 such that with probability greater than 1 − 4 exp(−x), for any f ∈ C, |Pn Lf − P Lf | ≤ (1/2) max(P Lf , ρn (x)).

(3.9)

Then, the ERM over conv(F ) would satisfy with probability greater than 1 − 4 exp(−x), R(fbnERM -C ) −

min

f ∈conv(F )

R(f ) = P LfbERM -C ≤ 2Pn LfbERM -C + ρn (x) ≤ ρn (x). n

n

This means that if we can prove some isomorphic properties between the empirical and the actual structures of the functions class LC like in (3.9), then we can derive oracle inequalities for fbnERM -C . This is the strategy used in [5] that we follow here. According to Theorem 2.1 in Section 2, a function ρn (x) satisfying (3.9) can be constructed if we prove that LC satisfies some Bernstein condition and if we find some fixed point λ∗ > 0 such that EkP − Pn kV (LC )λ∗ ≤ (1/8)λ∗ . The Bernstein condition follows from the convexity of conv(F ) and the strategy used in Section 2: for any f ∈ C, P L2f ≤ (4b)2 P Lf . We use the peeling argument of Section 2 together with the following observations due to [21] (cf. Example 1) to find a fixed point λ∗ . Let S be the linear subspace of L2 (PX ) spanned by the dictionary F and consider an orthonormal basis (e1 , . . . , eM ′ ) of S in L2 (PX ) (where M ′ = dim(S) ≤ M ). For any µ > 0, it follows from the symmetrization argument and the contraction principle (cf. Chapter 4 in [26]) that n 1 X εi f (Xi ) sup EkP − Pn k(LC )µ ≤ 8bE n 2 f ∈S: P f ≤µ i=1 n ! M′ 1 X X βj ej (Xi ) εi ≤ 8bE sup √ ′ β∈RM : kβk2 ≤ µ n i=1 j=1 !2 !1/2 r M′ n X 1X M ′µ √ ≤ 8b µE εi ej (Xi ) ≤ 8b . n i=1 n j=1

We use the peeling argument of (2.3) to prove that for λ∗ = c0 b2 M/n and c0 an absolute constant large enough, we have indeed EkP − Pn kV (LC )λ∗ ≤ (1/8)λ∗ . Now, it follows from Theorem 2.1 that for any x > 0, with probability greater than 1 − 4 exp(−x),   M x . , R(fbnERM -C ) ≤ min R(f ) + c1 b2 max f ∈C n n √ This concludes the proof for the case M ≤ n.

12

G. Lecu´e

√ Remark 3.2. We did not use the condition M ≤ n in the last proof. In fact, the result holds in the following more general framework. Let Λ be any closed convex subset of RM and for any dictionary F = {f1 , . . . , fM } denote by Λ(F ) the set of all functions PM ⊤ j=1 λj fj when (λ1 , . . . , λM ) ∈ Λ. Let (X, Y ) be a random couple with values in X × R such that |Y | ≤ b and maxf ∈F |f (X)| ≤ b a.s. Consider the ERM procedure fbn ∈ argmin Rn (f ). f ∈Λ(F )

Then, it follows from Theorem 2.1 and the argument used previously in this section that for any x > 0, with probability greater than 1 − 4 exp(−x),   M x . , R(fbn ) ≤ min R(f ) + c1 b2 max n n f ∈Λ(F ) The same result can be found in [4] under very weak moment assumptions.

Acknowledgements We would like to thank Alexandre Tsybakov for helping us for the presentation of this result. Supported by French Agence Nationale de la Recherche ANR Grant “Prognostic” ANR-09-JCJC-0101-01.

References [1] Audibert, J.Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. Henri Poincar´e Probab. Stat. 40 685–736. MR2096215 [2] Audibert, J.Y. (2007). Progressive mixture rules are deviation suboptimal. In Adv. Neural Inf. Process. Syst. 20 41–48. Cambridge: MIT Press. [3] Audibert, J.Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646. MR2533466 [4] Audibert, J.Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794. MR2906886 [5] Bartlett, P.L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334. MR2240689 [6] Bartlett, P.L., Mendelson, S. and Neeman, J. (2012). ℓ1 -regularized linear regression: Persistence and oracle inequalities. Probab. Theory Related Fields 154 193–224. MR2981422 [7] Birg´ e, L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators. Ann. Inst. Henri Poincar´e Probab. Stat. 42 273–325. MR2219712 [8] Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334 495–500. MR1890640 [9] Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. In Computational Learning Theory (Sydney, 2002). Lecture Notes in Computer Science 2375 59–73. Berlin: Springer. MR2040405

ERM in convex aggregation

13

¨ hlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction [10] Bu and model fitting. Statist. Sci. 22 477–505. MR2420454 [11] Bunea, F. and Nobel, A. (2008). Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inform. Theory 54 1725–1735. MR2450298 [12] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697. MR2351101 [13] Carl, B. (1985). Inequalities of Bernstein-Jackson-type and the degree of compactness of operators in Banach spaces. Ann. Inst. Fourier (Grenoble) 35 79–118. MR0810669 [14] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Berlin: Springer. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001. MR2163920 [15] Dalalyan, A.S. and Tsybakov, A.B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72 39–61. ¨ rfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recog[16] Devroye, L., Gyo nition. Applications of Mathematics (New York) 31. New York: Springer. MR1383093 [17] Emery, M., Nemirovski, A. and Voiculescu, D. (2000). Lectures on Probability Theory and Statistics (P. Bernard, ed.). Lecture Notes in Math. 1738. Berlin: Springer. Lectures from the 28th Summer School on Probability Theory held in Saint-Flour, August 17–September 3, 1998. MR1775638 [18] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712. MR1792783 [19] Juditsky, A., Rigollet, P. and Tsybakov, A.B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206. MR2458184 [20] Klein, T. and Rio, E. (2005). Concentration around the mean for maxima of empirical processes. Ann. Probab. 33 1060–1077. MR2135312 [21] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656. MR2329442 [22] Lecu´ e, G. (2011). Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis. Habilitation ` a diriger des recherches. [23] Lecu´ e, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613. MR2529440 [24] Lecu´ e, G. and Mendelson, S. (2010). Sharper lower bounds on the performance of the empirical risk minimization algorithm. Bernoulli 16 605–613. MR2730641 [25] Lecu´ e, G. and Mendelson, S. (2013). On the optimality of the empirical risk minimization procedure for the Convex aggregation problem. Ann. Inst. Henri Poincar´e Probab. Stat. 49 288–306. MR3060158 [26] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] 23. Berlin: Springer. MR1102015 [27] Lounici, K. (2007). Generalized mirror averaging and D-convex aggregation. Math. Methods Statist. 16 246–259. MR2356820 [28] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Berlin: Springer. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003, with a foreword by Jean Picard. MR2319879 ´ (2006). Risk bounds for statistical learning. Ann. Statist. elec, E. ed´ [29] Massart, P. and N´ 34 2326–2366. MR2291502

14

G. Lecu´e

[30] Mendelson, S. (2008). Lower bounds for the empirical minimization algorithm. IEEE Trans. Inform. Theory 54 3797–3803. MR2451042 [31] Pisier, G. (1981). Remarques sur un r´esultat non publi´e de B. Maurey. In Seminar on ´ Functional Analysis, 1980–1981 Exp. No. V, 13. Palaiseau: Ecole Polytech. MR0659306 [32] Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665. MR2933661 [33] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771. MR2816337 [34] Schapire, R.E., Freund, Y., Bartlett, P. and Lee, W.S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686. MR1673273 [35] Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product ´ spaces. Inst. Hautes Etudes Sci. Publ. Math. 81 73–205. MR1361756 [36] Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563. MR1419006 [37] Tsybakov, A.B. (2003). Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003). Lecture Notes in Artificial Intelligence 2777 303– 313. Heidelberg: Springer. [38] Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166. MR2051002 [39] Vapnik, V.N. (1998). Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. New York: Wiley. MR1641250 [40] Wang, Z., Paterlini, S., Gao, F. and Yang, Y. (2012). Adaptive minimax estimation over sparse ℓq -hulls. Technical report. Available at arXiv:1108.1961. [41] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161. MR1790617 [42] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87. MR1762904 [43] Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47. MR2044592 Received October 2011 and revised January 2012

Suggest Documents