No-Regret Learnability for Piecewise Linear Losses

5 downloads 0 Views 472KB Size Report
Jan 8, 2015 - LG] 8 Jan 2015 .... z = (y, x) ∈ Z = [−Cy,Cy] × B∞(0,1) for Cy > 0 (B∞(0,1) denotes the ...... A. B. Berkelaar, B. Jansen, K. Roos, and T. Terlaky.
No-Regret Learnability for Piecewise Linear Losses Arthur Flajolet

FLAJOLET @ MIT. EDU

Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139

Patrick Jaillet

JAILLET @ MIT. EDU

arXiv:1411.5649v3 [cs.LG] 8 Jan 2015

Department of Electrical Engineering and Computer Science, Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139

Abstract In the convex optimization √ approach to online regret minimization, many methods have been developed to guarantee a O( T ) regret bound for subdifferentiable convex loss functions with bounded subgradients by means of a reduction to bounded linear loss functions. This suggests that the latter tend to be the hardest loss functions to learn against. We investigate this question in a systematic fashion as a function of the decision set and the environment’s set √ of moves. On the one hand, we exhibit a localization property for linear losses leading to o( T ) learning rates and provide √ examples where this property holds. On the other hand, we establish Ω( T ) lower bounds on the minimum achievable regret for a class of piecewise linear loss functions that subsumes the class of bounded linear loss functions and for polyhedral decision sets. These results hold in a completely adversarial setting. In contrast, we show that the minimum achievable regret can be significantly smaller when the opponent is “greedy”. Keywords: Online learning, Regret minimization, Lower bounds.

1. Introduction Online learning has emerged as a general framework to tackle problems where repeated decisions need to be made in an unknown, possibly adversarial, environment. The underlying premise is that the decision maker can assess the quality of past actions, learn from his mistakes, and improve as time goes by. Many approaches to online learning have been proposed that predominantly differ along the assumptions made on the environment, the actual learning criterion used to measure improvements and the mathematical tools involved in the design of online algorithms. These approaches are rooted in a diverse array of fields such as statistical learning, information theory, and game theory. Online convex optimization is one such approach bringing together convex optimization methods to bear on online learning problems. Perhaps, it can be best described as a “robust” approach to online learning as no distributional assumptions is made on the input data, making the framework quite general and the results amenable to other approaches. In that respect, it significantly deviates from the statistical learning theory where inputs are usually assumed to be independent and identically distributed. For this reason, it has found many applications in diverse areas such as online classification, online regression, and online resource allocation, to mention a few. A full-information online convex optimization instance is defined as a zero-sum game between a learner (the player) and the environment (the opponent). There are T time periods. At each round, the learner has to play ft in a convex set F. Subsequent to the choice of ft , the environment discloses a vector zt ∈ Z and the loss incurred to the learner is `(zt , ft ) for a given loss function `

F LAJOLET JAILLET

convex in its first argument. The convexity property is a minimal assumption as convex optimization is at the core of the framework. In fact, many non-convex online problems are amenable to this setting by introducing a convex surrogate loss function that upper bounds the original loss and by extending F to conv(F). This convexification technique is used for the online classification problem known as Winnow. In the online convex optimization framework, both players are aware of all the parameters of the game, namely F, Z and `, prior to starting the game. Additionally, at the end of each period, the environment’s move is revealed to the learner. In this framework, the performance of the learner is measured in terms of a quantity called “regret”. It is defined as the gap between the actual accumulated losses incurred to the learner and the best performance he could have achieved in hindsight with a non-adaptive strategy. Hence, the goal for the decision maker is to pick each move ft so as to keep the regret, expressed as: rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) =

T X t=1

`(zt , ft ) − inf

f ∈F

T X

`(zt , f ),

t=1

as small as possible for any choice of (zt )t=1,··· ,T ∈ Z T . rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) thereby refers to the regret experienced by the learner for the sequence of decisions (zt )t=1,··· ,T and (ft )t=1,··· ,T made by the opponent and the learner respectively. To achieve this, the learner will rely on the past observations, captured by the sequence of moves z1 , · · · , zt−1 played by the opponent so far, in order to pick ft . As a consequence, for a strategy to perform well, it should generally depend on the decisions made by the opponent up to this point in time. Hence, we should write ft (z1 , · · · , zt−1 ) but this notation is rather cumbersome and we omit any reference to the past history for the clarity of the exposition. In this new field, one of the primary research focus is to design algorithms, i.e. strategies to select (ft )t=1,··· ,T based on the available information, with a low regret upper bound guarantee on rT even for the most adversarial opponents. To simplify the notations, we omit to refer to these sequences and simply write rT when it is clear from the context. Particular emphasis is placed on the dependence with respect to T , the number of rounds. The reason is the following. Denote by u(T ) an upper bound on rT that holds for any sequence zt ∈ Z. Equipped with an algorithm guaranteeing ) an upper bound such that u(T T → 0, the player is learning on average, in the long-run, and the property carries over to the underlying learning problem. The smaller the growth rate of f , the faster we learn. If ` is both bounded √ and convex in its second argument, Zinkevich (2003) provides an algorithm with u(T ) = O( T ) guarantee by means of a reduction to linear loss functions. This result can be greatly improved to u(T ) = O(log(T )) when the loss function is curved, for instance if ` is α-strongly convex for α > 0 in its second argument, see Hazan et al. (2007a), a striking parallel to the faster rates of convergence obtained in convex optimization in the case of strongly convex functions. √ Moreover, there exists examples for which these results are tight, i.e. any algorithm has at least Ω( T ) regret for some instances, see, for example, Cesa-Bianchi and Lugosi (2006) and Abernethy et al. (2008). For these three reasons, the curvature seems to play a prominent rule in the learnability of the problem and linear functions (e.g. the pure linear loss `(z, f ) = z · f ) are in some sense the “hardest” to play √ against. It is then tempting to conclude that linear pieces of the loss function dooms us to a Ω( T ) regret bound even for the most tailored algorithm. We argue that the minimum achievable learning rate is determined by a more involved interplay between F, Z, ` and the opponent’s behavior so that this assertion requires further examination. Consider, as a very basic example, F of cardinality one. In this situation, the game is trivial and the regret is zero 2

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

irrespective of Z (we define a broader notion of trivial games in a later section). The purpose of this paper is threefold: 1. We characterize the minimum achievable regret for a class of piecewise linear games. Specifically, we establish that, when ` is a piecewise linear function of a particular type that subsumes the linear loss and when√F is a polyhedron, then either there exists a trivial algorithm achieving rT = 0 or rT = Ω( T ) however good the algorithm used to select ft may be. Also, we prove that this result remains valid for more general convex compact sets F when 0 is in the interior of the convex hull of Z and when the loss is purely linear; √ 2. We identify a localization property for linear loss functions that leads to o( T ) upper bounds on the achievable regret and exhibit situations where this property is satisfied; 3. We show that, for a particular class of linear loss functions, if the opponent is known to be greedy then there exists an algorithm leveraging this additional information that achieves O(log(T )) regret upper bounds. A greedy opponent is one maximizing the immediate gain at every round.

2. Problem statement and literature review The game is fully characterized by the 3-uple (`, Z, F). The regret of the game (`, Z, F) for a T -time-period game is defined as the minimax regret that can be attained by a player against a completely adversarial opponent in T periods:

RT (`, Z, F) = inf sup · · · f1 ∈F z1 ∈Z

inf

sup

inf sup [

T X

fT −1 ∈F zT −1 ∈Z fT ∈F zT ∈Z t=1

`(zt , ft ) − inf

f ∈F

T X

`(zt , f ) ],

(1)

t=1

which we refer to as the value of the game (`, Z, F). RT (`, Z, F) differs from rT in that this quantity is not tied to any particular learner’s strategy. Instead, it corresponds to the best regret achievable in the present context. This precisely means that the inequality RT (`, Z, F) ≤ rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) holds for any strategy as long as the opponent is completely adversarial, i.e. for any 1 ≤ τ ≤ T , zτ maximizes: sup · · · sup [ zτ +1 ∈Z

zT ∈Z

T X

`(zt , ft (z1 , · · · , zt−1 )) − inf

f ∈F

t=τ

T X

`(zt , f ) ].

t=1

On the other hand, if the opponent is greedy, this inequality will hold but for a different loss function: the greedy version of `, we discuss this point in Section 8. If the game at hand is obvious from the context, we omit any reference to the game and simply write RT . Throughout the paper, we will be led to make assumptions on the nature of the game in order to be able to characterize its value, RT , for any number of periods T . These assumptions differ along the parameters of the game they relate to. However, for each type of assumption, there are two versions: one that will stand throughout the paper and a strengthened version that we make use of occasionally in addition to the first one. The first set of assumptions relates to F and Z, the sets of moves that can be selected by the player and the opponent respectively. 3

F LAJOLET JAILLET

Assumption 1 Z is a non-empty subset of Rd and F is a non-empty convex compact subset of Rn . Assumption 2 Z is a compact subset of Rd . The second set of assumptions relates to the regularity of the loss function `. Assumption 3 `(z, ·) is convex and continuous for any choice of z ∈ Z and ` is bounded on Z ×F. Assumption 4 ` is jointly continuous. Assumptions 1 and 3 stand throughout the paper. The convexity assumptions on ` and F are at the core of the online convex optimization framework to bring convex optimization methods to bear on the online learning problem. A bounded loss function is also a standard assumption in the literature in order to avoid giving too much power to either party. To illustrate this last point, Abernethy et al. (2008) show that RT ≤ −αT with α > 0 for the quadratic loss function and an unbounded set F while RT = +∞ when F is a unit ball in Rd , Z = Rd and the loss is purely linear. The purpose of this paper is to study asymptotic lower and upper bounds on the learning rate, i.e RT , as a function of T while ignoring constant factors, for games with losses that are piecewise linear in their second argument. We are particularly interested in the linear setting, referred to as linear games, where `(z, f ) = z · f denoted by ` = ·. The latter is considered as intrinsically difficult from a pure regret standpoint since, under mild assumptions on `, upper bounds on RT can be derived by means of a reduction to the linear case, see Zinkevich (2003). We make explicit reference to the specific set of assumptions that we rely on to prove the results. Literature Review Asymptotically matching lower and upper bounds on RT can be found in the literature for a variety of loss functions although the discussion tends to be restrictive as far as the decision sets F and Z are concerned. The value of the game is shown to be Θ(log T ) for three standard examples of “curved” loss functions. The first example, studied in Abernethy et al. (2008), is the quadratic loss where `(z, f ) = z · f + σkf k22 for σ > 0, with Z and F bounded L2 balls. The second, studied in Vovk (1998), is the online linear regression setting where the opponent plays z = (y, x) ∈ Z = [−Cy , Cy ] × B∞ (0, 1) for Cy > 0 (B∞ (0, 1) denotes the unit L∞ ball), the loss is `((y, x), f ) = (y − x · f )2 , and F is an L2 ball. The last one, from Ordentlich and Cover (1998), is the log-loss `(z, f ) = − log z · f with Z any compact set in Rd and F the simplex of dimension d. In the expert setting, i.e. when the cumulative loss is compared to the best static decision in a finite subset of F, the regret can even be smaller for loss functions bearing a specific property called mixability and is shown to be Θ(1) in Haussler et al. (1998). For non-curved√and non-mixable losses, evidence suggests that the value of the game increases dramatically to Θ( T ). √ Ω( T ) lower bounds are proved for several instances of the absolute loss `(z, f ) = |z − f | in Cesa-Bianchi and Lugosi (1999), typically with Z √ = {0, 1} and F = [0, 1]. For purely linear loss functions, Abernethy et al. (2008) establishes a Ω( T ) lower bound on RT when Z is an L2 ball centered at 0 and F is either an L2 ball or a bounded rectangle. This result was later generalized in Abernethy et al. (2009) and shown to hold for F a unit ball in any norm centered at 0 and Z its d dual ball. Cesa-Bianchi and Lugosi (2006) investigates √ the expert setting, i.e. Z = [0, 1] and F is the simplex in dimension d, and proves the same Ω( T ) lower bound (which also holds if Z is the simplex in dimension d, see Abernethy et al. (2009)). The framework considered in Abernethy et al. (2008) is slightly more general by allowing the opponent’s decision set to change at each period, precisely the opponent picks zt in a L2 ball with time-dependent widths, and some of the underlying parameters may be time-dependent too (e.g. σ for the quadratic loss). 4

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

Contributions The list of results recalled above is far from exhaustive but draws a good picture of the current state of the art. For each loss function, the intrinsic limitations of online algorithms are well-understood by providing a particular example of F and Z for which a lower bound on RT asymptotically matches the best guarantee achieved by one of these algorithms. We aim at studying lower bounds on the value of the game in a more systematic fashion. Abernethy et al. (2009) develops a general fruitful minimax tool for re-expressing the value of the game in order to obtain lower and upper bounds on the value of the game. We build on these results and improve upon it to characterize the learnability of a large class of piecewise linear functions. Notations For a set S ⊂ Rd , conv(S) refers to the convex hull of this set, int(S) is the interior of this set, relint(S) denotes the relative interior and S¯ is its closure. If S is convex, ext(S) corresponds to the set of extreme points of S. |S| refers to the cardinality of this set which can be equal to ∞. When S is a compact subset of Rd , we define P(S) as the set of probability measures on this set. For x ∈ Rd , kxk refers to the L2 norm of x while B2 (x, ) (resp. B∞ (x, )) denotes the L2 (resp. L∞ ) ball centered at x with width . For a matrix A, |||A||| is the operator norm induced by the L2 norm. For a collection of random variables (Z1 , · · · , Zt ), σ(Z1 , · · · , Zt ) refers to the sigma-field generated by Z1 , · · · , Zt .

3. First steps and overall objective Any lower bound derived in this paper involves some form of an equalizing strategy. This concept, formalized in Rakhlin et al. (2011a), roughly refers to randomized strategies played by the opponent that cause the player’s decisions to be completely irrelevant from a regret standpoint. These strategies are intrinsically hard to play against, by definition, and often lead to tight lower bounds. However, they may not always exist and might be hard to exhibit. In the linear setting, a basic example where an equalizing strategy arises is when 0 ∈ int(Z), in which case the opponent has some freedom to play, at each period, a random vector with expected value 0 making every strategy available to the player equally bad. This standard argument is stated in more depth in the following lemma. √ Lemma 1 Suppose that the game is linear, that 0 ∈ int(Z), and that |F| > 1. Then RT = Ω( T ). 2 Proof Consider f1 , f2 ∈ F. As F is convex, we get [f1 , f2 ] ⊂ F. Define f ∗ = f1 +f 2 , β = 2 ¯ kf1 − f2 k > 0 and e = f1 −f β . There exists  > 0 such that B2 (0, ) ⊂ Z. Consider T ∈ N and the collection of random variables Zt = (t · )e for t = 1, · · · , T with t i.i.d. Rademacher random variables, i.e. t takes value 1 with probability 21 and value −1 with probability 12 . Then for any player’s strategy, we have, as ft ∈ σ(Z1 , · · · , Zt−1 ): E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] =

T X

E[Zt ] · ft − E[ inf f · f ∈F

t=1

T X t=1

This yields: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] = −E[ inf f · f ∈F

T X

Zt ].

t=1

We can lower bound the last quantity by: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ −E[ inf

f ∈[f1 ,f2 ]

5



T X t=1

Zt ].

Zt ].

F LAJOLET JAILLET

This implies: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ −E[

inf

T X

α∈[− β2 , β2 ]

(f ∗ + α · e) · (

(t · ) · e)].

t=1

Expanding the last line, we get: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ − · E[

T X

t ] · (f ∗ · e) − kek2 · E[

t=1

If Tt=1 t ≥ 0 (resp. we obtain: P

PT

t=1 t

inf

α∈[− β2 , β2 ]

T X

α·(

t )].

t=1

≤ 0), the infimum is attained for α = − β2 (resp. α = β2 ). Therefore,

E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ − · E[

T X

t ] · (f ∗ · e) +

t=1

T X β t |]. ·  · E[| 2 t=1

T T The first term cancels out √ as E[ t=1 t ] = t=1 E[t ] = 0. Furthermore, by Khintchine’s inequalPT 1 ity, E[| t=1 t |] ≥ √2 T and we finally derive

P

P

β √ E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ √ · T . 2 2 √ This enables us to conclude RT = Ω( T ) as, for any player’s strategy, there exists a sequence √ √ · T. z1 , · · · , zT such that rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) ≥ E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ 2β 2

Lemma 1 sheds some light on the intrinsic hardness of linear games. This result does not come as a surprise as it has long been noticed that pure linear games tend to be either trivial or as “hard” as a bounded convex√game may be, in the√sense that there are well-known examples of F and Z for which RT = Θ( T ) while RT = O( T ) for the loss functions identified in Assumption 3, see Zinkevich (2003). However, this has not been formally proven and turns out to be false for general non-trivial decision sets F and Z. One of the objectives of this paper is to develop lower and upper bounds on RT for a variety of decision sets. Pursuing in that direction, we are led to introduce the following definition. Definition 2 The game (`, Z, F) is said to be trivial if and only if ∃f ∗ ∈ F such that ∀z ∈ Z, ∀f ∈ F, `(z, f ∗ ) ≤ `(z, f ) If the game is trivial, then a player will always choose f ∗ regardless of the planning horizon and of the opponent’s strategy observed so far to obtain regret 0. As it turns out, this uniquely identifies trivial games for a wide range of losses as we establish in Lemma 4. A simple example of a trivial game is ([0, 1]d , [0, 1]d , ·) where `(z, f ) ≥ 0 ∀f ∈ [0, 1]d and ∀z ∈ [0, 1]d with equality if f = 0 irrespective of z. Observe that the game considered in Lemma 1 is, in fact, non-trivial as both −f1 f1 −f2 f1 −f2 1 `( kff22 −f , f2 ) > `( kff22 −f −f1 k , f1 ) and `( kf2 −f1 k , f1 ) > `( kf2 −f1 k , f2 ) for any f1 , f2 ∈ F and we 1k 6

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

√ have showed that, as a result, the game has regret Ω( T ). We establish the same lower bound for a class of piecewise linear losses when F is a polyhedron in Section √ 6 and for linear games when 0 ∈ int(conv(Z)) in Section 5. However, we stress that this Ω( T ) lower bound is not a general feature of non-trivial linear√games as we show in Section 7. Specifically, there exist non-trivial linear games with regret o( T ). To further investigate the value of the game, we need to introduce new tools as Lemma 1 is quite restrictive and the method appears √ to be of limited applicability. Indeed, even having 0 in Z might be of little help to prove a Ω( T ) regret bound if 0 is not in the interior of Z. For instance, consider Z = conv(z1 , z2 , z3 , z4 ), with z1 , z2 , z3 , z4 extreme points of Z, such that 0 ∈ [z1 , z2 ] and F = [f ∗ , f ∗∗ ], such that f ∗ ∈ argminf ∈F z1 · f = argminf ∈F z2 · f and argminf ∈F z3 · f = {f ∗ } but argminf ∈F z4 · f = {f ∗∗ }, so that the game is not trivial. In this case, any lower bound derived from an i.i.d. opponent with zero-mean distribution is 0 (a concrete example is provided in the Appendix). Considering this limitation, we will rather build on a more powerful tool developed in Abernethy et al. (2009), rooted in von Neumann’s minimax theorem, that readily enables the derivation of tight lower and upper bounds on RT by recasting the value of the game in a backward order. We present this core result and derive consequences on the value of the game for general loss functions in the next section.

4. Tools for the general setting We start with a reformulation of the value of the game proved in Abernethy et al. (2009) that we rely on throughout the paper. Proposition 3 From Abernethy et al. (2009) Under Assumptions 1 and 3, the value of the T -time-period game (`, Z, F) can be rewritten as: RT = sup E[ p

T X t=1

inf E[`(Zt , ft )|Z1 , · · · , Zt−1 ] − inf

ft ∈F

T X

f ∈F

`(Zt , f ) ]

t=1

where the supremum is taken over the joint distribution p of the random variables (Z1 , · · · , ZT ) in ZT . Proof We sketch the proof as we are going to use the same tools a number of times. Just like in Abernethy et al. (2009), we present the ideas for T = 2 as this case captures most of the techniques. We are interested in re-expressing: R2 = inf sup inf sup [

2 X

f1 ∈F z1 ∈Z f2 ∈F z2 ∈Z t=1

`(zt , ft ) − inf

f ∈F

2 X

`(zt , f ) ]

t=1

Optimizing over z2 ∈ Z or over p2 ∈ P(Z) and subsequently taking the average over z2 ∼ p2 yields the same value (the maximum is obtained for a Dirac distribution). Hence: R2 = inf sup inf

sup M (p2 , f2 ),

f1 ∈F z1 ∈Z f2 ∈F p2 ∈P(Z)

with M (p2 , f2 ) = Ez2 ∼p2 [

P2

t=1 `(zt , ft )

− inf

P2

f ∈F

t=1 `(zt , f )

] for a fixed pair f1 , z1 ∈ F × Z.

This trick makes the cost-to-go linear in the opponent’s move, i.e. p2 , which enables us to appeal to a minimax result. As M is linear in its first argument, continuous and convex in the second 7

F LAJOLET JAILLET

(continuity can be shown by a standard dominated convergence type argument since ` is bounded and continuous in its second argument), F and P(Z) are convex sets and F is compact, we obtain from Theorem 7.1 in Cesa-Bianchi and Lugosi (2006) that: inf

sup M (p2 , f2 ) =

f2 ∈F p2 ∈P(Z)

sup

inf M (p2 , f2 ).

p2 ∈P(Z) f2 ∈F

Plugging this back in R2 yields: R2 = inf sup { `(z1 , f1 ) + f1 ∈F z1 ∈Z

sup p2 ∈P(Z)

{ inf Ez∼p2 `(z, f2 ) − Ez2 ∼p2 inf

f ∈F

f2 ∈F

2 X

`(zt , f ) } }.

t=1

Proceeding as before, we obtain: R2 = inf

sup A(p1 , f1 ),

f1 ∈F p1 ∈P(Z)

with A(p1 , f1 ) = Ez1 ∼p1 [ `(z1 , f1 ) + sup

p2 ∈P(Z)

{ inf Ez∼p2 `(z, f2 ) − Ez2 ∼p2 inf f2 ∈F

P2

t=1 `(zt , f )

f ∈F

} ].

Observe that A is continuous and convex in f1 and linear in p1 so we can again use the minimax theorem mentioned above to obtain: R2 =

sup { inf Ez∼p1 `(z, f1 )+Ez1 ∼p1 [ sup { inf Ez∼p2 `(z, f2 )−Ez2 ∼p2 inf

p1 ∈P(Z) f1 ∈F

p2 ∈P(Z)

f2 ∈F

2 X

f ∈F

`(zt , f ) } ] },

t=1

which can be rewritten as: R2 =

sup Ez1 ∼p1

p1 ∈P(Z)

sup { inf Ez∼p1 `(z, f1 )+ inf Ez∼p2 `(z, f2 )−Ez2 ∼p2 inf

p2 ∈P(Z)

f1 ∈F

f2 ∈F

f ∈F

2 X

`(zt , f ) }.

t=1

Optimizing over p1 , averaging over p1 and then optimizing over p2 amounts to optimizing over the joint distribution and average out on z1 ∼ p1 . Hence: R2 = sup Ez1 ∼p1 [ inf Ez∼p1 `(z, f1 ) + inf Ez∼p2 `(z, f2 ) − Ez2 ∼p2 inf f1 ∈F

p

f2 ∈F

2 X

f ∈F

`(zt , f ) ],

t=1

and the property follows for T = 2.

This result, and more importantly the ideas found in the proof, go beyond the development of lower bounds, providing insight on the minimax optimal strategies for both players. For instance, Rakhlin et al. (2011b) further shows, going through the proof of Proposition 3, that the value of the game remains the same when the learner is allowed to play randomized strategies, as opposed to deterministic ones. Specifically, instead of playing a deterministic vector ft , the player picks a distribution pt on F, reveals it to the opponent and randomize according to pt upon disclosure of the opponent’s decision. Mathematically, this translates into the equality: RT =

inf

sup Ef1 ∼p1 · · ·

p1 ∈P(F ) z1 ∈Z

inf

sup EfT ∼pT [

pT ∈P(F ) zT ∈Z

T X t=1

`(zt , ft ) − inf

f ∈F

T X

`(zt , f ) ]

t=1

Let us start with a first application of Proposition 3 providing another equivalent characterization of trivial games in terms of the achievable regret. 8

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

Lemma 4 Under Assumptions 1 and 3 strengthened by 2 and 4, for any number of periods T ∈ N, RT ≥ 0 and RT = 0 if and only if the game is trivial. Proof The fact that RT ≥ 0 is proved in Lemma 3 of Abernethy et al. (2009) and follows from Proposition 3 by taking p as a product of any Dirac distribution on Z with same support. Clearly, if the game is trivial then RT = 0 because this value is attained for f1 , · · · , fT = f ∗ irrespective of the decisions made by the opponent. Conversely, suppose RT = 0. Consider p to be the product of T uniform distributions on Z. Then, by Proposition 3: RT ≥ E[

T X t=1

inf E[`(Zt , ft )] − inf

ft ∈F

T X

f ∈F

`(Zt , f ) ],

t=1

as Z1 , · · · , ZT are independent random variables. Since they are also identically distributed, we obtain: RT ≥ T · inf E[`(Z, f )] − E[ inf f ∈F

Yet E[ inf

f ∈F

PT

t=1 `(Zt , f )]

≤ inf E[ f ∈F

f ∈F

PT

t=1 `(Zt , f )]

T X

`(Zt , f )].

t=1

= T · inf E[`(Z, f )] and we derive RT ≥ 0. f ∈F

Since we know RT = 0, all the above inequalities are binding. As Z is compact set and `(·, ·) is jointly continuous, we have that f → E[`(Z, f )] is continuous by dominated convergence. Consider f ∗ ∈ argminf ∈F E[`(Z, f )] (F is compact). We obtain: E[

T X

`(Zt , f ∗ ) − inf

t=1

As

PT

inf

f ∈F

t=1 `(Zt , f

PT

∗)

t=1 `(zt , f )

− inf

f ∈F

PT

t=1 `(Zt , f )

f ∈F

T X

`(Zt , f )] = 0.

t=1

≥ 0, we get that (z1 , · · · , zT ) →

PT

t=1 `(zt , f

∗)

=

almost everywhere on Z T . ` is jointly continuous and F is compact, there-

fore the two latter functions are continuous and we have equality on Z T . In particular, `(z, f ∗ ) = inf `(z, f ) for all z ∈ Z and the game is trivial. f ∈F

Proposition 3 also proves to be a powerful tool to develop lower bounds on RT . We will make use of the following result, also established in Abernethy et al. (2009) as a consequence of Proposition 3, or more precisely Lemma 8 that follows. Lemma 5 From Abernethy et al. (2009) Suppose that Assumptions 1 and 3 hold. Take a distribution p on Z, define F ∗ = argminf ∈F Ep [`(Z, f )]. ˜ = {`(·, f ) | f ∈ Consider f ∗ ∈ F ∗ . For F˜ a finite subset of F ∗ that contains f ∗ , define Q(F) ˜ ⊂ L2 (p) and Q( ¯ F) ˜ = Q(F) ˜ − `(·, f ∗ ). Then for T > T0 (F): F} √ RT ≥ c · T · sup E[ sup Gq ] F˜

¯ F˜ ) q∈Q(

where Gq is the centered Gaussian process indexed by the functions in Q, i.e. E[Gq Gs ] = hq, siL2 (p) ¯ F), ˜ and c > 0 is a constant independent of T . for q, s ∈ Q(

9

F LAJOLET JAILLET

Proof We sketch the general ideas of the proof. This lower bound is based on an application of Proposition 3 for i.i.d. random variables Z1 , · · · , ZT ∼ p. The lower bound derived is: T RT 1X ≥ inf EZ∼p [`(Z, f )] − E[ inf `(Zt , f )]. f ∈F f ∈F T T t=1

We can further lower bound the last quantity by: T 1X RT ≥ inf EZ∼p [`(Z, f )] − E[ inf ∗ `(Zt , f )]. f ∈F f ∈F T T t=1

Noting that inf EZ∼p [`(Z, f )] = EZ∼p [`(Z, f )] for any f ∈ F ∗ , we get: f ∈F

T 1X RT ≥ E[ sup { EZ∼p [`(Z, f )] − `(Zt , f ) }], T T t=1 f ∈F ∗

which implies: T 1X RT ≥ sup E[ sup { EZ∼p [q(Z)] − q(Zt ) } ]. T T t=1 F˜ q∈Q(F˜ )

Based on this last expression, the conclusion draws on a lower bound on the performance of the empirical risk minimization procedure established when there are several learning functions achieving ¯ F)| ˜ > 1 (when the cardinality is one, the minimum risk in the learning class, specifically when |Q( the lower bound is 0 and the result follows from Lemma 4).

The lower bound derived in Lemma 5 arises from a statistical learning framework where the objective is to learn a function `(·, f ) from a class parametrized by a vector f ∈ F rather than learning the actual underlying parameter. In fact, this particular problem is ill-defined in the game setting considered as some vectors may be indistinguishable in F from a regret standpoint. For a ¯2 (0, 1). In practical example, consider the game (·, Z, F) with Z ⊂ {z | a · z = 0} and F = B this game, playing f or f + λa, for a small enough λ and for f ∈ B2 (0, 1), makes absolutely no difference. For this reason, we need to think in terms of the classes of equivalence on F induced by `, we make this definition precise in the linear setting. Definition 6 When the loss is linear, we define the equivalence relationship ∼` on F by fa ∼` fb if and only if `(·, fa ) = `(·, fb ) on aff(Z). Armed with Lemma 5, it is tempting to assert that finding p ∈ P(Z) such that | argmin Ep [`(Z, f )]| > 1 f ∈F

√ is enough to conclude RT = Ω( T ). In Remark 7, we point out that this is not necessarily the case and specify a sufficient condition for the result to hold in Lemma 8.

10

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

√ Remark 7 The lower bound derived in Lemma 5 does not necessarily yield a Ω( T ) lower bound on RT even when |F ∗ | > 1 for a given distribution p on Z. Consider, for instance, the online linear regression game `(z, f ) = (z · f )2 , F = [f1 , f2 ] with f1 = (1, 0, · · · , 0) and f2 = (0, 1, 0, · · · ), and Z = B2 (z ∗ , 1) with z ∗ = (1, 1, 0, · · · , 0). Observe that ∀z ∈ Z, ∀f ∈ F, z · f ≥ 0, hence argminf ∈F `(z, f ) = argminf ∈F z · f . Furthermore, argminf ∈F z ∗ · f = [f1 , f2 ]. Choose p to be the Dirac distribution supported on z ∗ , we have F ∗ = [f1 , f2 ]. Yet, even though ` is not strongly convex in f , there exists an algorithm achieving O(log(T )) regret, see Vovk (1998). Therefore, the √ bound of Lemma 5 cannot be Ω( T ). Remark 7 shows that we need to think in terms of the set of loss functions that is induced by the set of minimizers, i.e.: {`(·, f ) | f ∈ argmin Ep [`(Z, f )}. f ∈F

This is because any choice of f ∈ F is expressed through the induced loss function `(·, f ). Hence, the actual learning problem is on this functional space. Lemma 8 √ is simply stating that when this last functional set has cardinality greater than one then we get a Ω( T ) lower bound. Lemma 8 If we can find a distribution p on Z and two points in argminf ∈F Ep [`(Z, f )], say f1 and f2 , such that `(Z, f1 )√is not equal to `(Z, f2 ) almost surely for Z ∼ p. Then the lower bound derived in Lemma 5 is Ω( T ). Proof Defining q = `(Z, f2 ) − `(Z, f1 ), Lemma 5 implies: √ RT ≥ c · T · E[sup(0, Gq )]. Since `(·, f2 ) 6= `(·, f1 ) almost surely in L2 (p) we get E(G2q ) = hq, qiL2 (p) > 0 and Var(Gq ) > 0. We conclude E[sup(0, Gq )] > 0 otherwise Gq ≤ 0 almost surely which would be a contradiction.

Note, in particular, that the condition to meet for Lemma 8 rules out the choice of a Dirac distribution, i.e. finding z ∈ Z such that | argminf ∈F z · f | > 1 is not sufficient to conclude √ RT = Ω( T ) as highlighted in Remark 7. Lemma 8 is a powerful tool to derive lower bounds. As an illustration, consider the game `(z, f )√= |z − f | with Z = {0, 1} and F = [0, 1] whose regret is known to be lower bounded by Ω( T ), see Cesa-Bianchi and Lugosi (2006). This result can be readily obtained from Lemma 8 with p taken as the Bernouilli distribution. Indeed, argminf ∈F Ep [`(Z, f )] = argminf ∈F 12 f + 12 (1 − f ) = [0, 1] while clearly `(0, 0) 6= `(0, 1). It is insightful to put Lemma 8 into perspective by relating the condition for it to hold to the nondifferentiability of a functional defined on P(Z) originally introduced in Abernethy et al. (2009), which we denote by Φ, whose regularity is shown to be intrinsically tied to the asymptotic value of the game. Φ is defined as: Φ(p) = inf EZ∼p [`(Z, f )], for p ∈ P(Z). f ∈F

(2)

As the authors of Abernethy et al. (2009) point out, when Φ is not differentiable √ everywhere on P(Z) (or equivalently when the condition of Lemma 8 holds) we have RT = Ω( T ). Alternatively, Abernethy et al. (2009) show that when Φ is differentiable and α-smooth with respect to the total 11

F LAJOLET JAILLET

variation distance on P(Z), then RT = O(log(T )). In particular, they prove that if ` is strongly convex in its second argument, this condition can be shown to be satisfied. In Section 6, we show that Φ always has a point of non-differentiability when F is a polytope and when ` belongs to a √ class of piecewise linear losses, thus incurring a Ω( T ) regret bound. In Section 7, we show that Φ can be α-smooth and everywhere differentiable when the loss function is linear depending on both F and Z, thus incurring a O(log(T )) regret bound.

5. Linear setting We proceed to establish general properties of the online game for the linear setup in the same spirit as in Section 4. Linear games tend to have additional features that enable us to draw stronger conclusions on the game. Let us begin with a couple of remarks. Observe that the value of the game remains unchanged after application of the translation operator to F. Precisely, RT (·, Z, F + a) = RT (·, Z, F) for any a ∈ Rd . This is straightforward to verify as T X

zt · (ft + a) − inf

T X

f ∈F +a

t=1

zt · f =

T X

zt · ft − inf

f ∈F

t=1

t=1

T X

zt · f.

t=1

Also note that Φ defined in (2) only depends on p through EZ∼p [Z]. Hence, when the loss is linear, Φ can be understood as a functional on Z: Φ(z) = inf z · f, for z ∈ Z.

(3)

f ∈F

The following result based on Proposition 3 is a powerful feature of linear games. Lemma 9 shows that we may assume without loss of generality that Z is a convex set. Lemma 9 Under Assumptions 1 and 2 for the linear game (·, Z, F), the two games (·, Z, F) and (·, conv(Z), F) are equivalent, i.e. RT (·, Z, F) = RT (·, conv(Z), F). Moreover Assumptions 1 and 2 also hold for the game (·, conv(Z), F). Proof First observe that, armed with Assumptions 1 and 2, the loss function · satisfies Assumption 3 and 4 so we are free to use any of the results derived in the last section. We proceed along the same lines as in Proposition 3 and prove the theorem when T = 2. The general proof follows the same principle. We have: R2 = inf sup inf sup [

2 X

f1 ∈F z1 ∈Z f2 ∈F z2 ∈Z t=1

`(zt , ft ) − inf

f ∈F

2 X

`(zt , f ) ].

t=1

Consider fixed vectors f1 , f2 ∈ F and z1 ∈ Z and define the function M (z2 ) = 2t=1 `(zt , ft ) − P inf 2t=1 `(zt , f ). Observe that M is convex. Indeed, z2 → `(z1 , f1 ) + `(z2 , f2 ) is affine and P

f ∈F

z2 → inf

P2

f ∈F

sup

t=1 `(zt , f )

is concave as the infimum of affine functions. Therefore, sup M (z2 ) = z2 ∈Z

M (z2 ). We obtain:

z2 ∈conv(Z)

R2 = inf sup inf

sup

[

2 X

f1 ∈F z1 ∈Z f2 ∈F z2 ∈conv(Z) t=1

12

`(zt , ft ) − inf

f ∈F

2 X t=1

`(zt , f ) ].

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

We can use the same minimax theorem as in Proposition 3 to derive: R2 = inf sup { `(z1 , f1 ) + f1 ∈F z1 ∈Z

sup p2 ∈P(conv(Z))

{ inf Ez∼p2 `(z, f2 ) − Ez2 ∼p2 inf

f ∈F

f2 ∈F

2 X

`(zt , f ) } }.

t=1

For a fixed f1 ∈ F, define A(z1 ) = `(z1 , f1 ) +

sup p2 ∈P(conv(Z))

{ inf Ez∼p2 `(z, f2 ) − Ez2 ∼p2 inf

f ∈F

f2 ∈F

2 X

`(zt , f ) }.

t=1

Observe that, for a fixed p2 ∈ P(conv(Z)), z1 → inf Ez∼p2 `(z, f2 )−Ez2 ∼p2 inf

P2

t=1 `(zt , f ) is

f ∈F

f2 ∈F

convex as the difference between a constant and the infimum of affine functions. Since the supremum of convex functions is convex, A is convex and sup A(z1 ) = sup A(z1 ). We derive: z1 ∈Z

R2 = inf

sup

[ `(z1 , f1 ) +

f1 ∈F z1 ∈conv(Z)

sup p2 ∈P(conv(Z))

z1 ∈conv(Z)

{ inf Ez∼p2 `(z, f2 ) − Ez2 ∼p2 inf f2 ∈F

f ∈F

2 X

`(zt , f ) } ].

t=1

To conclude, we unwind the first step, i.e. we use the minimax theorem in reverse order. This yields: R2 = inf

sup

inf

sup

[

2 X

f1 ∈F z1 ∈conv(Z) f2 ∈F z2 ∈conv(Z) t=1

`(zt , ft ) − inf

f ∈F

2 X

`(zt , f ) ],

t=1

i.e. R2 (·, Z, F) = R2 (·, conv(Z), F). Moreover, Z is a compact set which implies that conv(Z) is also a compact set by a standard topological argument. As a result, the game (·, conv(Z), F) is satisfying Assumption 2.

As a byproduct of Lemma 1 and Lemma 9 used in combination, we conclude that when 0 ∈ √ √ int(conv(Z)) and |F| > 1, RT = Ω( T ). When the goal is to establish Ω( T ) lower bounds, the following lemma proves useful by reducing the general case to a one-dimensional decision set for the opponent. Lemma 10 Suppose that Assumptions 1 and 2 hold, further assume that the game is linear and non-trivial. Then, at least one of the following two alternatives holds. Either RT (`, Z, F) = √ Ω( T ) or we can find z1 6= z2 ∈ Z such that: 1. For any f1 ∈ argminf ∈F `(z1 , f ), f1 ∈ / argminf ∈F `(z2 , f ) and symmetrically, 2. For any z ∈ [z1 , z2 ], all points in argminf ∈F `(z, f ) belong to the same class of equivalence under ∼` . In particular, the game (·, [z1 , z2 ], F) is non-trivial. Furthermore, when int(Z) 6= ∅, points 1 and 2 translate into ∀z ∈ [z1 , z2 ], | argminf ∈F `(z, f )| = 1 and argminf ∈F `(z1 , f ) 6= argminf ∈F `(z2 , f ). Proof As before, note that Assumptions 3 and 4 hold under Assumptions 1 and 2 when ` = ·. Therefore, we can use Lemma 8. Furthermore, we may assume, without loss of generality, that Z is a convex set because, as a result of Lemma 9, the two games are equivalent. We begin with the 13

F LAJOLET JAILLET

case int(Z) 6= ∅ to present the major ideas, avoiding technical difficulties that naturally arise when Z is not full-dimensional. Take z ∈ int(Z) with  √ > 0 such that B¯2 (z, ) ⊂ Z and suppose that | argminf ∈F z · f | > 1. We show that RT = Ω( T ). Observe, as a side note, that these two assumptions alone imply that the game is not trivial. Indeed, for any f1 , f2 ∈ argminf ∈F z · f , −f2 1 consider z1 = z +  · kff11 −f ∈ Z and z2 = z +  · kff21 −f −f2 k . We have f1 · z = f2 · z and 2k (f1 − f2 ) · z1 =  · kf1 − f2 k > 0 which yields f1 · z1 > f2 · z1 and symmetrically f2 · z2 > f1 · z2 . We conclude that there does not exist a point in F minimizing z · f for all z ∈ Z, i.e. the game is not trivial. Hence this property does not contradict Lemma 4. Take p to be the uniform distribution on B¯2 (z, ). Clearly, `(·, f1 ) and `(·, f2 ) are not equal in L2 (p), otherwise as these functions are continuous they would be equal everywhere on B¯2 (z, ) but f1 · z1 √ > f2 · z1 for z1 defined above, a contradiction. We now invoke Lemma 8 to conclude that RT = Ω( T ). Thus, without loss of generality, | argminf ∈F z ·f | = 1 for all z ∈ int(Z). Take any z1 ∈ int(Z) and f1 ∈ argminf ∈F z1 ·f . Since the game is not trivial, we can find z2 ∈ Z and f2 ∈ F such that z2 · f2 < z2 · f1 . Without loss of generality we may assume that z2 ∈ int(Z) by slightly perturbing z2 in the direction of z1 , i.e. redefining z2 as αz1 + (1 − α) · z2 for α > 0 small enough in order to preserve the inequality z2 · f2 < z2 · f1 . By the property proved above, we may assume | argminf ∈F `(z2 , f )| = 1, oth√ erwise RT = Ω( T ) and we would be done. Clearly, by construction, f1 ∈ / argminf ∈F `(z2 , f ). This establishes the property when int(Z) 6= ∅. For the general case, we proceed along the same lines except that we work with the relative interior instead. Define d as the dimension of Z, note that d ≥ 1 otherwise the game is necessarily trivial. For z ∈ relint(Z), we claim that if there is more that one class of equivalence in argminf ∈F `(z, f ) √ then RT = Ω( T ). Suppose otherwise, take f1 , f2 respectively in two different classes included in argminf ∈F `(z, f ) and p to be the uniform distribution on B2 (z, ) ∩ aff(X) ⊂ Z. `(Z, f1 ) and `(Z, f2 ) cannot be equal almost surely for Z ∼ p, otherwise √ f1 and f2 would belong to the same class of equivalence. By Lemma 8, we have RT = Ω( T ). Take any z1 ∈ relint(Z) and f1 ∈ argminf ∈F z1 · f . Since the game is not trivial, there must exist z2 ∈ Z and f2 ∈ F such that f2 · z2 < f1 · z2 . Just like for the previous case, we may assume z2 ∈ relint(Z) and, as proved above, we may assume that there exists a single class of equivalence in argminf ∈F z2 · f . Clearly, f1 and f2 do not belong the same class of equivalence which establishes the property.

For the aforementioned reasons, when dim(Z) < n, we cannot guarantee the uniqueness of minimizers in F as there may exist indistinguishable vectors in F which requires some additional work and a more precise statement. Nevertheless, Lemma 10 essentially enables us to systematically reduce the general study of the linear setting to a lower dimensional game with stronger assumptions on the decision sets. Indeed, since z1 , z2 ∈ Z, we have RT (·, Z, F) ≥ RT (·, {z1 , z2 }, F) = RT (·, [z1 , z2 ], F). In a number of practical cases, Lemma 9 allows us to √ settle the study directly without even looking into this simplified game. For instance, RT = Ω( T ) follows through if int(Z) 6= ∅ and we can find a point z ∈ int(Z) such that | argminf ∈F z · f | > 1. Hence, this drives us to restrict our attention to p with Bernouilli marginal distributions in Proposition 3. Once we have carried out this reduction, we can characterize the minimax optimal strategies for both the player and the opponent.

14

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

Lemma 11 Suppose that Assumption 1 and 2 hold, that Z = {z 1 , z 2 } and that the game is linear. Then there exists a minimax strategy for both the player and the opponent, (f1 , · · · , fT ) and (z1 , · · · , zT ) respectively, in the sense that: RT =

T X

`(zt , ft ) − inf

f ∈F

t=1

T X

`(zt , f ).

t=1

Moreover, for t = 1, · · · , T , either ft ∈ argminf ∈F zt · f or it is equivalent for the opponent to play zt = z 1 or zt = z 2 . In addition, if 0 ∈ / [z 1 , z 2 ] then ft ∈ / int(F). Proof The proof relies on Proposition 3 and, as before, we show the result for T = 2. We start with the equality shown in Proposition 3: R2 = inf

sup { `(z1 , f1 ) +

f1 ∈F z1 ∈[z 1 ,z 2 ]

sup p2 ∈P(Z)

{ inf Ez∼p2 `(z, f2 ) − Ez2 ∼p2 inf

f ∈F

f2 ∈F

2 X

`(zt , f ) } }.

t=1

which we can rewrite: R2 = inf max{z 1 · f1 + A(z 1 ), z 2 · f1 + A(z 2 )}, f1 ∈F

{ inf Ez∼p2 `(z, f2 )−Ez2 ∼p2 inf

P2

t=1 `(zt , f ) } for z1 ∈ Z. Observe that f2 ∈F f ∈F f1 → max(z 1 ·f1 +A(z 1 ), z 2 ·f1 +A(z 2 )) is continuous as the maximum of two affine functions and that F is compact. Thus, there exists f1 ∈ F such that R2 = max(z 1 · f1 + A(z 1 ), z 2 · f1 + A(z 2 )). There are two possibilities. Either z 1 ·f1 +A(z 1 ) 6= z 2 ·f1 +A(z 2 ), say for instance z 1 ·f1 +A(z 1 ) > z 2 · f1 + A(z 2 ). In this case, f1 ∈ argminf ∈F z1 · f because (1 − α)f1 + αf ∗ would otherwise have a lower overall regret than f1 for f ∗ ∈ argminf ∈F z1 · f and a small enough α > 0. Either z 2 · f1 + A(z 2 ) = z 1 · f1 + A(z 1 ), in which case the opponent can decide between playing z1 or z2 arbitrarily (because both yield the same overall regret). If, in addition, 0 ∈ / [z 1 , z 2 ], we can n strictly separate this closed convex set from 0. Hence, there exists h ∈ R such that h · z 1 < 0 and

with A(z1 ) =

sup

p2 ∈P(Z)

h · z 2 < 0. Suppose by contradiction that f1 ∈ int(F), then for a small enough  > 0, f + h ∈ F and this point has lower overall regret than f contradicting the optimality of f . We can now proceed to establish the characterization for f2 , z2 by writing: R2 = inf

sup

T X

f2 ∈F z2 ∈[z 1 ,z 2 ] t=1

`(zt , ft ) − inf

f ∈F

T X

`(zt , f )

t=1

and proceed similarly now that f1 and z1 are fixed.

Lemma 11 provides insight on the capabilities of the minimax player. The minimax strategy alternates between periods where the player has an immediate edge over the opponent, i.e. when ft ∈ argminf ∈F zt · f , and periods when his strategy turns out to be equalizing. In this respect, it is interesting to observe that the situation is reversed when compared to Lemma 1 where we exhibit an equalizing strategy for the opponent.

15

F LAJOLET JAILLET

√ 6. Ω( T ) lower bounds This section is dedicated to the development of lower bounds on RT when the loss function is piecewise linear. We build on the tools developed so far in Section 4 and Section 5. Proposition 12 settles on the value of linear games with a polyhedral set F. This result comes in addition √ to the one demonstrated in Section 5, namely if 0 ∈ int(conv(Z)) and |F| > 1 then RT = Ω( T ). Proposition 12 Suppose that Assumptions 1 and 2 hold√and that F is a polyhedron, further assume that the game is linear and non-trivial. Then, RT = Ω( T ). Proof The assumptions made on the nature of the game are sufficient to apply Lemma 10. To establish the lower bound, we show that the second alternative of Lemma 10 cannot hold. Suppose by contradiction that it does, the game at hand is (·, [z1 , z2 ], F) with F satisfying points 1 and 2 of Lemma 10. Define z (α) = α · z1 + (1 − α) · z2 for α ∈ [0, 1]. We are interested in the function φ : α → argminf ∈F z (α) · f . F being a polyhedron, I(f ) = {α ∈ [0, 1] | f ∈ φ(α)} is a closed interval in [0, 1] for any f ∈ ext(F) by standard linear sensitivity analysis. Furthermore, ext(F) is finite and we can write ext(F) = {f1 , · · · , fN } for some N ∈ N. For any α ∈ [0, 1], there exists at least one point f in ext(F) such that f ∈ φ(α), hence we can decompose [0, 1] into ∪N k=1 I(fk ). We can further specify the latter description as we rather think in terms of classes of equivalences: there exists M ≤ N classes of equivalences Fk , k = 1, · · · , M of ∼` such that [0, 1] = ∪M k=1 I(fk ) for some fk ∈ Fk ∩ ext(F) (reordering the fk s if necessary). By point 2 of Lemma 10, for all k 6= j ≤ M I(fk ) ∩ I(fj ) = ∅. Yet, the only way to partition [0, 1] into M < ∞ closed intervals is to have M = 1 and [0, 1] = I(f1 ). However, this contradicts point 1 of Lemma 10 because the latter result implies that the game (·, [z1 , z2 ], F) is trivial.

The proof heavily relies on Lemma 10 and may, as a result, seem rather obscure. We stress that, in all cases, this lower bound is derived by means of a kind of equalizing strategy. We present this more intuitive view in the case int(Z) 6= ∅. The above proof boils down to finding a point z ∈ int(Z) such that | argminf ∈F z · f | > 1 and, with further examination, this property can be extended to ensure that there exists  > 0, e ∈ Rn and f 1 , f 2 ∈ argminf ∈F z · f such that f 1 ∈ argminf ∈F (z − xe) · f while f 2 ∈ / argminf ∈F (z − xe) · f for all x ∈ (0, ] and symmetrically for x ∈ [−, 0). Now, just like in Lemma 1, we consider a randomized opponent: Zt = z + (t · ) for (t )t=1,··· ,T i.i.d. Rademacher random variables. Then for any player’s strategy: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] =

T X

E[Zt ] · ft − E[ inf f · f ∈F

t=1

T X

Zt ].

t=1

This yields: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] =

T X

z · ft − E[ inf f · f ∈F

t=1

T X

Zt ].

t=1

We can lower bound the last quantity by: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ T (z · f 1 ) − T E[ inf f · (z + ( · f ∈F

16

PT

t=1 t

T

)e)],

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

as f 1 ∈ argminf ∈F z · f , but we could have equivalently picked f 2 as f 1 · z = f 2 · z. Furthermore, PT

t

as | t=1 | ≤ 1, f 1 is optimal in the inner optimization problem if T P optimal if Tt=1 t ≥ 0. Hence: 1

1

PT

t=1 t

≤ 0 while f 2 is

PT

E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥ T (z · f ) − T E[ f · (z + ( · f 2 · (z + ( ·

t=1 t

T

)e) · 1PT

 ≤0 t=1 t

+

PT

t=1 t

T

)e) · 1PT

 ≥0 t=1 t

].

Observe that the term T (z · f 1 ) cancels out and we get: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥

E[|

t=1 t |]

P

T

·  · (f 1 · e − f 2 · e).

√ P By Khintchine’s inequality E[| Tt=1 t |] ≥ √12 T . Moreover f 1 · e − f 2 · e > 0 because f 2 ∈ argminf ∈F (z + e) · f while f 1 does not and f 1 · z = f 2 · z. We finally derive E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] ≥

(f 1 · e − f 2 · e) √ √ T. 2

√ This enables us to conclude RT = Ω( T ) as this holds for any player’s strategy. The result of Proposition 12 can be generalized to larger classes of piecewise linear loss functions. The following result is based on the same ideas but comes at the price of more technicalities. To the best of our knowledge, this is the first lower bound derived for classes of piecewise linear loss functions. ¯ and Proposition 13 Suppose that Assumption 1 and 2 hold, that int(Z) 6= ∅, that Z ⊂ int(Z) that F is a polyhedron. Further assume that `(z, f ) = maxy∈P (z) y · f , with P (z) = {y | Ay ≤ z} for a matrix A with full column rank such that the recession cone of P (z) is reduced to {0}, namely {y | Ay ≤ 0} = {0}, and also assume that ∀z ∈ Z, P (z) is not empty, i.e. the opponent is not allowed to play z ∈ Z that would make the value of the game undefined. Finally assume that the √ game is not trivial. Then, RT = Ω( T ). Proof The proof is more technical than for Proposition 12 but follows the same principle: we will prove that we can always exhibit a distribution p on Z satisfying the assumptions of Lemma 8. We first note that we may apply Lemma 5 since ` satisfies Assumption 3. Define Q(f ) = {q ≥ 0 | AT q = f }. Observe that for any z ∈ Z and f ∈ F, `(z, f ) = minq∈Q(f ) q · z by linear duality  and since P (z) is not empty by assumption. Denote by N = nd a standard upper bound on the number of extreme points of Q(f ) that holds regardless of f . We break down the proof in three steps. We first show that we can find a set H of d · N 2 + 1 points in Rd in such a way that any collection of d + 1 points in that set is affinely independent. In a second step, we show that for any ¯2 (z, ) ⊂ Z and for a discrete uniform distribution p to point z ∈ int(Z) and  > 0 such that B √ be specified with mean z, if | argminf ∈F Ep [`(Z, f )]| > 1, then RT = Ω( T ). In a final step, we show that we can always find such a point z. Fact 1 There exists a set H of d · N 2 + 1 points in Rd with the property that any collection of d + 1 ¯2 (0, 1). points in that set is affinely independent and such that H ⊂ B 17

F LAJOLET JAILLET

¯2 (0, 1) comprised of d + 1 affinely inProof We proceed by induction. Start with H, a subset of B dependent points. Consider H = ∪(x1 ,··· ,xd )∈H distinct aff({x1 , · · · , xd }). Observe that H is strictly included in Rd because it is a finite union of (d − 1)-dimensional affine spaces, hence we take x ∈ Rd − H and we may take x such that kxk ≤ 1. Observe that H ∪ {x} satisfies the assumption. Indeed for any collection of d+1 points in H∪{x} we have two scenarios. Either x does not belong to that collection and we are done by initial construction of H. Either x is in this set and we denote by x1 , · · · , xd the other elements. These last d points are affinely independent by construction of H. Moreover, x ∈ / aff({x1 , · · · , xd }) by construction of x. Hence the whole collection is affinely independent. We proceed inductively, by taking x ∈ Rd − ∪(x1 ,··· ,xd )∈H distinct aff({x1 , · · · , xd }) such that kxk ≤ 1 until |H| reaches the target value.

¯2 (z, ) ⊂ Z and define p to be the uniform Now, for any point z ∈ int(Z), take  such that B distribution on the finite set z + H ⊂ Z. √ Fact 2 If | argminf ∈F Ep [`(Z, f )]| > 1, then RT = Ω( T ). Proof With Lemma 8, it suffices to show that there exists h ∈ z + H with the property `(h, f1 ) 6= `(h, f2 ) for some f1 , f2 ∈ argminf ∈F Ep [`(Z, f )]. Suppose by contradiction there does not exists such a point. Take f1 , f2 ∈ argminf ∈F Ep [`(Z, f )] distinct. For any h ∈ z + H ⊂ Z, there exists q1 , q2 in ext(Q(f1 )) × ext(Q(f2 )) such that `(h, f1 ) = q1 · h and `(h, f2 ) = q2 · h (because by duality ` is always finite). Since `(h, f1 ) = `(h, f2 ) for all h ∈ z + H ⊂ Z and since there are at most N 2 distinct pairs of such q1 , q2 while |z + H| = d · N 2 + 1, we conclude that there exists at least one such pair q1 , q2 such that (q1 − q2 ) · h = 0 for d + 1 points h in z + H. By construction of H, these points form a family of d + 1 affinely independent points. Hence, by linearity, q1 − q2 ∈ (Rd )⊥ , i.e. q1 = q2 . This implies, f1 = AT q1 = AT q2 = f2 , a contradiction.

We now proceed to the last step which consists in proving that there always exist z ∈ int(Z) such that | argminf ∈F Ep [`(Z, f )]| > 1 for the distribution p defined above. Fact 3 There exists z ∈ int(Z) such that | argminf ∈F Ep [`(Z, f )]| > 1 for p taken as the uniform ¯2 (z, ) ⊂ Z. distribution on z + H, for a small enough  > 0 such that B ¯ Proof Take any z1 ∈ int(Z) and 1 > 0 such that B2 (z1 , 1 ) ⊂ Z. Define p1 as the uniform distribution on z1 + 1 H. Without loss of generality, we may assume that | argminf ∈F Ep1 [`(Z, f )]| = 1, otherwise we are done. Take f1 ∈ argminf ∈F Ep1 [`(Z, f )]. Since the game is not trivial, there exist f2 ∈ F and z2 ∈ Z such that `(z2 , f2 ) < `(z2 , f1 ). We may also assume without loss of generality ¯ and ` is continuous in its first argument (as the minimum of a finite that z2 ∈ int(Z) as Z ⊂ int(Z) ,f1 )−`(z2 ,f2 ) number of linear functions). Define β = `(z22·(d·N > 0. Observe that `(·, f1 ) and `(·, f2 ) 2 +1) are continuous, z2 ∈ int(Z) and H is finite. Therefore, we can take 2 > 0 small enough such that B¯2 (z2 , 2 ) ⊂ Z, |`(z2 + 2 h, f2 ) − `(z2 , f2 )| < β and |`(z2 + 2 h, f1 ) − `(z2 , f1 )| < β for all h ∈ H. Define p2 as the uniform distribution on z2 + 2 H. Observe that, as a consequence of the above construction, Ep2 [`(Z, f2 )] < Ep2 [`(Z, f1 )] and f1 ∈ / argminf ∈F Ep2 [`(Z, f )]. If | argminf ∈F Ep2 [`(Z, f )]| > 1, we are done. Hence, we may assume | argminf ∈F Ep2 [`(Z, f )]| = 1 and redefine f2 as f2 ∈ argminf ∈F Ep2 [`(Z, f )] (f2 6= f1 still). For any α ∈ [0, 1], define pα 18

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

as the uniform distribution on αz1 + (1 − α)z2 + (α1 + (1 − α)2 )H ⊂ Z. For α ∈ [0, 1], minf ∈F Epα [`(Z, f )] is equal to the following linear program: inf

q1 ,··· ,qM ,f

q · (αx1 + (1 − α)x2 )

subject to q = (q1 , · · · , qM ) AT qm = f

∀m = 1, · · · , M

f ∈ F, qm ≥ 0

∀m = 1, · · · , M

1 1 where M = d · N 2 + 1, x1 = M (z1 + 1 h1 , z1 + 1 h2 , · · · , z1 + 1 hM ) and x2 = M (z2 + 2 h1 , z1 + 2 h2 , · · · , z1 + 2 hM ). At this point, the result follows from a sensitivity analysis similar to the one carried out in Proposition 12. There must exists α ∈ [0, 1] such that there are two different extreme points solution to the linear program above with a distinct f coordinate, i.e. | argminf ∈F Epα [`(Z, f )]| > 1.

√ We conclude with Fact 2 that RT = Ω( T ) in all cases.

It might seem that we make a lot of assumptions in the course of proving Proposition 13. However, we point out that all the assumptions on P (z) are meant to make the game well-defined and such that ` satisfies Assumption 3 in order to be able to use the minimax reformulation of Proposition 3. For instance, the assumption on the recession cone along with Assumptions 1 and 2 guarantees that ` is bounded. With Proposition 13, we establish, for a range of piecewise linear functions, that the resulting game is as difficult as a purely linear game when F is a polyhedron. This particular type of loss function can be interpreted as having a greedy opponent, i.e. one maximizing the immediate reward at each period while facing its own optimization program and whose action is indirect, by means of the right-hand side of the constraints. For instance, this last parameter could represent the supply to a system. Note that switching from a completely adversarial opponent to a rational opponent simply leads to a redefinition of the loss function and hence the study falls back on the online convex optimization framework. In some sense, the learner knows more about this greedy opponent than of a standard opponent in the linear setting. Yet, Proposition 13 shows that the game is just as difficult asymptotically. In Section 8, we further study the impact of a greedy behavior on the game and show that greed can cause its value to decrease dramatically for other loss functions. We conclude this section with a general remark deduced from Proposition 12 regarding the interplay between the relation of set inclusion and the difficulty of the game. Remark 14 Remark that for a shrinking set Z, the game gets easier for the player as can be trivially deduced from (1). That is to say, other things being equal, RT is a decreasing function of Z. On the other hand, we point out that there is no counterpart result for F, i.e. F1 ⊂ F2 does not imply RT (`, Z, F1 ) ≤ RT (`, Z, F2 ) nor RT (`, Z, F2 ) ≤ RT (`, Z, F1 ) regardless of the dimension of these sets. Consider the three linear games (·, B2 (z ∗ , 1), F) where z ∗ = (2n, 2n, 1, · · · , 1) ∈ Rn and F is taken as Fa = {f ∈ Rn | kf k∞ ≤ 1}, Fb = conv(Fa ∪ {(−4, 0, · · · , 0), (0, −4, · · · , 0)}) and Fc = {f ∈ Rn | kf k∞ ≤ 4} respectively. This example is illustrated in Figure 1 in a two-dimension setting. Note that Fa ⊂ Fb ⊂ Fc and that the 19

F LAJOLET JAILLET

Figure 1: Counterexample showing that there is no connection between the relation inclusion for the player’s decision set and the value of the game.

game is trivial for both Fa and Fc (it is always optimal to play (−1, · · · , −1) and (−4, · · · , −4) respectively since Z ⊂ Rn+ ). Hence RT (·, Z, Fa ) = RT (·, Z, Fc ) = 0 for all T ∈ N. On the other hand, the game (·, Z, Fb ) is not trivial as we next show. First remark that, for any z ∈ Z, argminf ∈Fb z · f ⊂ [(−4, 0, · · · ), (0, −4, · · · )]. Indeed, as Z ⊂ Rn+ , minf ∈Fa minz∈Z z · f = minz∈Z z · (−1, · · · , −1) = z ∗ · (−1, · · · , −1) + minz∈B2 (0,1) z · (−1, · · · , −1) = −5n + 2 − n = −6n + 2 > maxz∈Z z · (−4, 0, · · · ) = maxz∈Z z · (0, −4, · · · ) = −8n + 4 for n ≥ 1. Next, argminf ∈Fb (2n + 1, 2n, 0, · · · ) · f = {(−4, 0, · · · )} and argminf ∈Fb (2n, 2n + 1, 0, · · · ) · f = {(0, −4, · · · )}. This shows that the game is √ not trivial. Since Fb is a polyhedron, we conclude with Proposition 12 that RT (·, Z, Fb ) = Ω( T ). As a result, we have neither RT (·, Z, Fb ) ≤ RT (·, Z, Fc ) nor RT (·, Z, Fb ) ≤ RT (·, Z, Fa ).

√ 7. o( T ) upper bounds for linear losses The upper bounds developed in this section are based on a localization idea expressed through the regularity of the functional Φ defined in (2). For mathematical convenience, we will rather rely on the restriction of Φ defined in (3) as we are interested in the purely linear loss function. The following smoothness property is crucial to obtain low regret bounds. Definition 15 Φ defined in (3) is (α, γ)-flat on Z, for α, γ > 0, if it is differentiable on Z and the following property holds: Φ(z1 ) − Φ(z2 ) ≤ ∇Φ(z1 ) · (z1 − z2 ) + αkz1 − z2 kγ ,

∀z1 , z2 ∈ Z.

(4)

Flatness of Φ is closely related to stability of the minimizers of the optimization problem involved in the definition of Φ. For instance, if ` is strongly convex and globally Lipschitz, Φ of (2) is shown to be (α, 2)-flat with respect to the total variation distance for some α > 0 in Abernethy et al. (2009). Thus, flatness of Φ is a telltale sign of low regret and this can be shown analytically, see Lemma 17 below. However, strong convexity of ` is not required to induce flatness in Φ. Even for linear loss 20

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

functions, the interplay between Z, F and ` may cause Φ to exhibit the same property, as we will explore in this section. The next lemma is a first step in that direction: Lemma 16 Adapted from Abernethy et al. (2009) If there exists α > 0, γ > 0 such that ∀z1 , z2 ∈ Z there exist f1 ∈ argminf ∈F z1 · f and f2 ∈ argmin z2 · f with: f ∈F

kf1 − f2 k ≤ αkz1 − z2 kγ

(5)

then Φ is (α, 1 + γ)-flat. Proof The proof follows the steps of Theorem 12 in Abernethy et al. (2009). First write: Φ(z1 ) − Φ(z2 ) = z1 · f1 − z2 · f2 = (z1 − z2 ) · f1 + z2 · (f1 − f2 ). As f1 ∈ argminf ∈F z1 · f , we have z1 · (f2 − f1 ) ≥ 0 and symmetrically z2 · (f1 − f2 ) ≥ 0. Hence: 0 ≤ z2 · (f1 − f2 ) ≤ z2 · (f1 − f2 ) + z1 · (f2 − f1 ) = (z2 − z1 ) · (f1 − f2 ) ≤ kz2 − z1 k · kf1 − f2 k. Since (5) is satisfied, we have kf1 − f2 k ≤ αkz2 − z1 kγ . This yields: 0 ≤ z2 · (f1 − f2 ) ≤ α · kz1 − z2 k1+γ . Therefore: Φ(z1 ) − Φ(z2 ) = (z1 − z2 ) · f1 + o(kz1 − z2 k), which proves that Φ is differentiable at z1 and ∇Φ(z1 ) = f1 . Furthermore, we obtain: Φ(z1 ) − Φ(z2 ) ≤ ∇Φ(z1 ) · (z1 − z2 ) + αkz1 − z2 k1+γ

The condition of Lemma 16 makes this localization idea mathematically sound by requiring the argmin operator to satisfy a Hölder condition. Note that this definition involves both F and Z, thereby enabling us to capture the interplay between the two decision sets through the loss function. Observe that (5) cannot hold when the game is non-trivial and F is a polyhedron because we prove in Proposition 12 that there always exists z ∈ relint(Z) for which | argminf ∈F z · f | > 1. We draw the same conclusion when 0 ∈ int(Z) and |F|√> 1. Abernethy et al. (2009) proves that when Φ is (α, γ)- flat with γ > 32 , the game has value o( T ). For the sake of completeness, we resketch the proof in the linear setting. Lemma 17 Adapted from Abernethy et al. (2009) Under Assumption 1 strengthened by Assumption 2, if Φ is (α, γ)-flat with γ < 2 then RT = O(T 2−γ ). If γ ≥ 2 then RT = O(log(T )). Proof For any p joint distribution on (Z1 , · · · , ZT ), we upper bound the quantity appearing in the equivalent reformulation of the game in Proposition 3, namely: E[

T X t=1

inf E[`(Zt , ft )|σt−1 ] − inf

ft ∈F

f ∈F

21

T X t=1

`(Zt , f ) ],

F LAJOLET JAILLET

where σt is the sigma-field generated by (Z1 , · · · , Zt ) and ` = ·. The latter can be broken down into two terms: T TX −1 1 1X T ( E[ Φ(E[ZT |σT −1 ]) − Φ( Zt ) ]) + E[Φ(E[Zt |σt−1 ])]. T T t=1 t=1

The first term can be written: T E[

−1 1 T − 1 1 TX 1 Φ(E[ZT |σT −1 ]) − Φ( E[ZT |σT −1 ] + Zt ) T T T T − 1 t=1 −1 T 1 T − 1 1 TX 1X + Φ( E[ZT |σT −1 ] + Zt ) − Φ( Zt ) ]. T T T − 1 t=1 T t=1

As Φ is concave, we can use the Jensen inequality to upper bound the first term of this last expression 1 PT −1 by −(T − 1)E[Φ( T −1 t=1 Zt )]. The second term can be controlled with (4): −1 T 1 1X T − 1 1 TX T (Φ( E[ZT |σT −1 ] + Zt ) − Φ( Zt )) ≤ T T T − 1 t=1 T t=1 −1 1 1 T − 1 1 TX ∇Φ( E[ZT |σT −1 ]+ Zt )·(E[ZT |σT −1 ]−ZT )+α γ−1 kE[ZT |σT −1 ]−ZT kγ . T T T − 1 t=1 T

Taking expectations on both sides, the first term cancels out by conditioning on σT −1 since E[ZT |σT −1 ]+ (2CF )γ T −1 1 PT −1 t=1 Zt is σT −1 -measurable. The second term can be upper bounded by α T γ−1 , where T T −1 CF is a bound on F. Plugging everything back into the second equation yields: α

−1 −1 1 TX (2CF )γ TX E[Φ(E[Z |σ ])] − (T − 1)E[Φ( Zt )]. + t t−1 T γ−1 T − 1 t=1 t=1

By induction, we conclude: E[

T X t=1

inf E[`(Zt , ft )|σt−1 ] − inf

ft ∈F

f ∈F

T X

`(Zt , f ) ] ≤ α(2CF )γ

t=1

Since this holds for all p, we obtain RT ≤ α(2CF )γ P P 1 γ ≥ 2, Tt=1 tγ−1 ≤ Tt=1 1t ∼ log(T ).

T X t=1

PT

1 t=1 tγ−1 .

If γ < 2,

PT

1 tγ−1

.

1 t=1 tγ−1

∼ T 2−γ . If,

The next lemma shows that the definition of Lemma 16 is not vacuous: there exist non-trivial linear games satisfying this property. Lemma 18 The linear game (·, Z, F) with Z = B¯2 (z ∗ , 1), for kz ∗ k > 1, and F = B¯2 (0, 1) kz ∗ k+1 is non-trivial and satisfies the assumptions of Lemma 16 for α = 2 (kz ∗ k−1)2 and γ = 1. Hence, ∗ ¯ ¯ 0 < RT (·, B2 (z , 1), B2 (0, 1)) = O(log(T )). 22

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

z Proof First observe that the game is not trivial since, for any z ∈ Z, argminf ∈F z · f = {− kzk }. z1 z2 1 Consider z1 , z2 ∈ F, we have kf1 − f2 k = k kz2 k − kz1 k k = kz1 kkz2 k kkz1 kz2 − kz2 kz1 k. Hence:

kf1 − f2 k ≤

1 k(kz1 k − kz2 k)z2 + kz2 k(z2 − z1 )k, − 1)2

(kz ∗ k

which implies: kf1 − f2 k ≤

1 kz ∗ k + 1 kz k(|kz k − kz k| + kz − z k) ≤ 2 kz1 − z2 k. 2 1 2 2 1 (kz ∗ k − 1)2 (kz ∗ k − 1)2

There is a straightforward generalization of Lemma 18 to ellipsoidal decision sets F. Lemma 18 also has some implications for piecewise linear losses. Consider the loss function of Lemma 13 with F an ellipsoid and suppose that the matrix A and the set Z are such that there exists y ∗ with ky ∗ k > 1 such that P (z) ⊂ B¯2 (y ∗ , 1) ∀z ∈ Z. Then, for any sequence of moves (z1 , · · · , zT ), (f1 , · · · , fT ): rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) ≤

T X

yt · ft − inf

f ∈F

t=1

T X

yt · f

t=1

where yt ∈ argmaxy∈P (zt ) y · ft . By construction, we have yt ∈ B¯2 (y ∗ , 1) and we deduce from Lemma 18 that the value of this game is upper bounded by O(log(T )). Remark that the upper bound derived in Lemma 18 is not constructive in the sense that we do not provide an explicit algorithm to achieve O(log(T )) regret. As it turns out, this regret can be achieved by a simple Follow-the-Leader approach, a strategy introduced in Kalai and Vempala (2005), as shown below. Lemma 19 The Follow-The-Leader strategy guarantees a O(log(T )) regret for the linear game described in Lemma 18. Proof This strategy consists in playing ft ∈ F at period t ≥ 2 such that: ft ∈ argmin f ∈F

t−1 X

`(zτ , f ).

τ =1

Pt−1

=1 For the particular game studied here, ft can be written in closed form as ft = − Pτt−1

Pt−1

k (kz ∗ k



z k τ =1 τ

. Ob-

serve that this last quantity is always well-defined as k τ =1 zτ k ≥ (t − 1) · − 1) > 0 as shown in the proof of Lemma 21. A common inequality on the regret incurred for a Follow-theleader strategy is: rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) ≤

T X

(zt · ft − zt · ft+1 ).

t=1

Observe now that:

k t−1 zτ k kzt k2 = Ptτ =1 zt · ft − Pt . k τ =1 zτ k k τ =1 zτ k P

zt · ft+1

23

F LAJOLET JAILLET

Plugging this inequality back into the upper bound yields: rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) ≤

T X

kzt k2 k t−1 zτ k [ Pt + zt · ft (1 − Pτt =1 )]. k τ =1 zτ k τ =1 zτ k t=1 k P

Observe that kzt k ≤ kz ∗ k + 1 and kft k ≤ 1, hence |zt · ft | ≤ kz ∗ k + 1. Moreover, k k t−1 zτ k |=| |1 − Ptτ =1 k τ =1 zτ k P

Pt−1 τ =1 zτ k − k τ =1 zτ k | Pt k τ =1 zτ k

Pt



kzt k k

Pt

τ =1 zτ k



kz ∗ k + 1 . P k tτ =1 zτ k

We get: ∗

rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) ≤ 2(kz k + 1)

2

T X t=1

With k

Pt

τ =1 zτ k

1 k

Pt

τ =1 zτ k

.

≥ t · (kz ∗ k − 1), we conclude: T 1 (kz ∗ k + 1)2 X . rT ((zt )t=1,··· ,T , (ft )t=1,··· ,T ) ≤ 2 kz ∗ k − 1 t=1 t

This holds for any opponent’s sequences of moves, hence RT = O(log(T )).

The “smoothness” of the the boundary of the decision set F in the game studied in Lemma 18 combined with the fact that 0 ∈ / conv(Z) accounts for the low value of the game. Let us illustrate this last point with a slight generalization of Lemma 18. Lemma 20 Suppose that 0 ∈ / conv(Z) and that Z is compact. Consider F : Rd → R a strongly convex continuously differentiable function and define the decision set F = {f ∈ Rd | F (f ) ≤ 0}. Suppose that F is not empty, then: RT (·, Z, F) = O(log(T )). Proof First observe that F is convex by convexity of F . It is also closed, as F is continuous, and bounded, as F is strongly convex (which implies that there exist a, b with b > 0 such that F (f ) ≥ a + b · kf k2 ), thus F is a compact set. From this we conclude that we may apply Lemma 17. The remainder of the proof is dedicated to show that the conditions of Lemma 16 are met for γ = 1. Without loss of generality, we may assume that Z is a convex set as shown in Lemma 9. Since Z is closed and convex, we can strictly separate 0 from Z. Hence, there exists a ∈ Rd and c c ∈ R such that a · z > c > a · 0 = 0, ∀z ∈ Z. We get a 6= 0 and kzk ≥ kak > 0 ∀z ∈ Z. Let us c use the shorthand C = kak . As F is compact, there exists a solution, which we denote by f ∗ , to the optimization problem inf f · z, f ∈F

for any z ∈ Z. Furthermore, we can always find a solution located on the boundary of F as F is continuous. If F is a singleton, we are done since in that case RT (·, Z, F) = 0. Otherwise, observe that the constraint qualifications are automatically satisfied at f ∗ as ∇F cannot vanish on the boundary since F does not attain its minimum on this set. Hence, there exists λ ≥ 0 such that z + λ∇F (f ∗ ) = 0. As in Lemma 16, we consider any pair of points, z1 and z2 , in Z. Let us denote by f1 , λ1 and f2 , λ2 the solutions and the associated Lagrange multipliers exhibited by the 24

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

above procedure respectively for z1 and z2 . As z1 , z2 6= 0, we must have λ1 , λ2 6= 0. We obtain ∇F (f1 ) = − λ11 z1 and ∇F (f2 ) = − λ12 z2 . Since F is strongly convex, there exists β > 0 such that: (∇F (f 0 ) − ∇F (f 00 )) · (f 0 − f 00 ) ≥ βkf 0 − f 00 k2 , for all f 0 , f 00 ∈ F. Applying this inequality for f 0 = f1 and f 00 = f2 , we obtain: (

1 1 z2 − z1 ) · (f1 − f2 ) ≥ βkf1 − f2 k2 . λ2 λ1

We can break down the last expression in two pieces: 1 1 z2 · (f1 − f2 ) + z1 · (f2 − f1 ) ≥ βkf1 − f2 k2 . λ2 λ1 Observe that z2 · (f1 − f2 ) ≥ 0 since both f1 and f2 belong to F and since f2 is the minimizer of (f1 )k z2 · f for f ranging in F. Symmetrically, z1 · (f2 − f1 ) ≥ 0. Note that λ11 = |λ11 | = k∇F kz1 k . As ∇F is continuous and F is compact, there exists K > 0 such that k∇F (f )k ≤ K for any f ∈ F. Hence, we get λ11 ≤ K C and the same inequality for λ2 . Plugging this upper bound back into the last inequality yields: K (z2 − z1 ) · (f1 − f2 ) ≥ βkf1 − f2 k2 , C and: β·C kf1 − f2 k2 . kz2 − z1 kkf1 − f2 k ≥ K Simplifying on both sides yields: kz2 − z1 k ≥

β·C kf1 − f2 k. K

This concludes the proof as this is equivalent to (5) for γ = 1 and α =

β·C K

> 0.

Interestingly, an upper bound based on Rademacher complexity from Rakhlin et al. (2011b) fails to yield the O(log(T )) regret bound for the linear game of Lemma 18. Indeed, taking the expression from Theorem 2 of Rakhlin et al. (2011b), this upper bound is: 2 sup E1 ,··· ,T sup z1 ,··· ,zT

T X

f ∈F t=1

t `(zt (), f ) = 2 sup E1 ,··· ,T k z1 ,··· ,zT

T X

t zt ()k,

t=1

where (t )t=1,··· ,T are independent Rademacher random variables and zt ∈ Z is allowed to depend on 1 , · · · , t−1 . This quantity is clearly lower bounded by: 2 · E1 ,··· ,T k

T X

t z ∗ k,

t=1





which is further lower bounded by 2kz ∗ k · T by Khintchine’s inequality. On a side note, quite surprisingly, i.i.d. opponents appear to be particularly weak for this particular game incurring a O(1) regret as shown in the following lemma. This is in stark contrast with the situations identified in Lemma 1 and Proposition 12. 25

F LAJOLET JAILLET

Lemma 21 Consider the non-trivial linear game of Lemma 18. Any lower bound derived from Proposition 3 with i.i.d. random variables Z1 , · · · , ZT ∼ p is O(1) for any choice of p ∈ P(Z). Proof We prove more generally that, for any choice of random variables (Z1 , · · · , ZT ) such that E[Zt |σ(Z1 , · · · , Zt−1 )] is constant almost surely, the lower bound derived is O(1). In the sequel, σt−1 serves as a shorthand for σ(Z1 , · · · , Zt−1 ). Denote Zt = z ∗ + Wt and E[Wt |σt−1 ] = ct P with kWt k ≤ 1 and kct k ≤ 1. Define w∗ = T · z ∗ + Tt=1 ct . Observe that kw∗ k ≥ T · kz ∗ k − PT w∗ ∗ ˜ ˜ k t=1 ct k ≥ T · (kz ∗ k − 1) > 0. Write Wt = Xt · kw ∗ k + Wt + ct with Wt · w = 0. Projecting ˜ t |σt−1 ] = 0. The bound down the equality E[Wt − ct |σt−1 ] = 0, we get E[Xt |σt−1 ] = 0 and E[W derived from Proposition 3 is: RT ≥ E[kw∗ +

T X

T X

Wt − ct k] −

t=1

t=1

Expanding the first term: kw∗ + Tt=1 Wt − ct k = By concavity of the squared root function: P

E[kw∗ +

T X

kz ∗ + ct k.

q

(1 +

PT

Xt 2 t=1 kw∗ k )

· kw∗ k2 + k

PT

˜

t=1 Wt k

2.

v u T T u X X Xt 2 ˜ t k2 ]. Wt − ct k] ≤ tkw∗ k2 · E[(1 + W ) ] + E[k ∗

t=1

t=1

We expand the two inner terms. E[(1 +

PT

Xt 2 t=1 kw∗ k ) ]

kw k

= 1+2·

t=1

E[Xt ] t=1 kw∗ k

PT

+ kw1∗ k2 · E[(

PT

2 t=1 Xt ) ]. P E[( Tt=1 Xt )2 ] =

Looking at each term individually, we have E[Xt ] = E[E[Xt |σt−1 ] = 0 and P −1 PT −1 P −1 Xt )] = E[E[XT |σT −1 ] · Xt )] + E[XT2 ], yet E[XT · ( Tt=1 Xt )2 ] + 2 · E[XT · ( t=1 E[( Tt=1 PT −1

t=1 Xt )] = 0. Hence, E[(1 + PT ˜ t k2 ]. We obtain: E[kW

(

PT

Xt 2 t=1 kw∗ k ) ]

= 1+

E[

PT

X2] t=1 t . kw∗ k2

Similarly E[k

˜ 2 t=1 Wt k ]

PT

=

t=1

E[kw∗ +

T X

v u T u X ˜ t k2 ]. Wt − ct k] ≤ tkw∗ k2 + E[Xt2 + kW

t=1

t=1

˜ t k2 ≤ 2. We obtain: Remark that kWt − ct k ≤ kWt k + kct k ≤ 2. Hence, Xt2 + kW T X

E[kw∗ +

Wt − ct k] ≤

q

kw∗ k2 + 2 · T .

t=1

We have

p

kw∗ k2 + 2 · T = kw∗ k ·

T · (kz ∗ k − 1). Yet kw∗ k = k E[kw∗ +

T X t=1

PT

t=1 z

q

1+

∗+c k t

Wt − ct k] −

T X

2·T ≤ kw∗ k + kwT∗ k for T big enough as kw∗ k ≥ kw∗ k2 PT ≤ t=1 kz ∗ + ct k. Hence, the lower bound derived is:

kz ∗ + ct k ≤

t=1

26

T 1 ≤ ∗ = O(1). ∗ kw k kz k − 1

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

Abernethy et al. (2009) remarks that restricting the study to i.i.d. sequences is not enough to get tight bounds in a more general setting for non-linear losses. Lemma 21 above shows that, in fact, this may be the case even for the arguably simplest games to study. We stress that there are very few examples of online games for which tight lower bounds on RT are derived from a non-i.i.d. random opponent, we refer to one such example in Abernethy et al. (2009) for the quadratic loss `(z, f ) = kz − f k2 .

8. Achievable regret when playing against a greedy opponent in the linear setting In Section 6, we have established intrinsic limits on the learner’s capabilities when the loss is linear in f . As we now argue, these limits highly depend on the perceived opponent’s decision set. Being overly pessimistic, the learner may wrongly think that the opponent has the liberty to pick any z ∈ Z while, in fact, the latter is constrained in his choice and has an indirect influence on the next move. Such a possibility is explored in Proposition 13 for a greedy opponent in the linear setting. In this example, the opponent acts greedily by maximizing his immediate payoff at each period. In this scenario, the loss incurred to the leaner is y · f , just like in the linear setting, with the important distinction that y is not any point in a decision set Z but has to be taken as argminy∈P (z) y · f . The actual underlying decision made by the opponent concerns z. Recall that the study of a greedy, as opposed to completely adversarial, opponent simply leads to a redefinition of the loss function so that the study of the game can still be carried out in the online convex optimization framework. In the example depicted in Proposition 13, as it turns out, even if the learner was aware of the actual opponent’s decision scheme, this would not enable him to develop √ a better strategy achieving lower regret (because both games have the same asymptotic value Θ( T )). In this section, we emphasize that, in some circumstances, having an accurate knowledge of the opponent’s decision mechanism pays off in terms of regret. Precisely, if the original loss is linear and the opponent is greedy, the value of the game may drop to O(log T ). In other words, leveraging this extra information enables faster learning rates. For this reason, gathering as much information as possible on the opponent’s strategy should be a primary concern. The rational behind this dramatic change is that incorporating this additional information into the loss function experienced by the learner may result in curving it. To provide more intuition, we start with a simple example which we extend to more general opponent’s decision sets in the remainder of this section. Example 1 Consider the game defined by `(z, f ) = z · f − 21 z · Qz, with Q a symmetric matrix, Z a bounded set and F a polytope. The loss is purely linear √ with respect to f and we conclude, invoking Proposition 12, that the value of the game is Θ( T ) regardless of Q. Alternatively, suppose that the opponent is being greedy and, instead of making a decision on z, chooses Q in {Q | QT = Q, βIn  Q  αIn } where β > α > 0 and  denotes the positive semi-definite ordering. The actual loss incurred to the learner turns into `(Q, f ) = maxz z · f − 12 z · Qz = 12 f · Q−1 f which is β-strongly convex in f . Thus, the value of this game is O(log T ), see Hazan et al. (2007a). This last example can be extended resorting to the concept of the convex conjugate of a function, a common tool in online optimization. Consider the more general class of games defined by `(z, f ) = z · f − g(z) for g(·) a real-valued convex function, Z a bounded open convex √ set and F a polytope. As stated above, the value of the game is always Θ( T ). Suppose the opponent plays greedily and, instead of choosing z, plays g(·) in { g(·) | g(·) is lower semicontinuous, differentiable on Z, and ∇g(·) is β-Lipschitz continuous with β > 0}. The actual loss,

27

F LAJOLET JAILLET

`(g(·), f ) = maxz∈Z z · f − g(z), is the conjugate of g(·) and can be shown to be β1 -strongly convex for any choice of g(·) by standard convex analysis. Hence, once again, the achievable regret is O(log(T )). Note that, in this particular case, the learner needs not even be aware of the behavior of his opponent to obtain this regret bound because online algorithms have been developed to adaptively learn this information, see for instance Hazan et al. (2007b). We are interested in the linear game defined by `((z, w), f ) = z · f − 12 w · Qw where the √ opponent plays z and w. The value of the game is Θ( T ) for any choice of decision sets satisfying Assumptions 1 and 2 and such that F is a polyhedron, see Proposition 12. We consider the greedy counterpart where, instead, the loss is hereafter defined as `((Q, c), f ) = max(z,w)∈P (Q,c) z · f − 1 T 2 w · Qw, with P (Q, c) = {(z, w) | A z − Qw ≤ c} for a given matrix A. The opponent now T chooses Q ∈ ZQ = {Q | Q = Q, βIn  Q  αIn }, with β > α > 0, and c in a compact set Zc bounded by Dc . As before, the learner’s decision set is assumed to be a polytope bounded by DF . To enforce that the game (`, ZQ × Zc , F) is well-posed, we make some assumptions on A and F. Assumption 5 A has full row rank and F ⊂ {f | ∃x ≥ 0, f = Ax}. For ease of presentation, let us write down explicitly the convex quadratic program defining the value of `((Q, c), f ): 1 sup z · f − w · Qw 2 (z,w,s) (6) subject to AT z − Qw + s = c s≥0 along with its Wolfe-dual: 1 c · x + x · Qx x 2 subject to Ax = f, x ≥ 0 inf

(7)

Lemma 22 Under Assumption 5, ` is jointly bounded and convex in f . Additionally, there exists (z, w) ∈ P (Q, c) such that `((Q, c), f ) = z · f − 12 w · Qw for any choice of (Q, c) in ZQ × Zc and f in F. Furthermore, z can be chosen such that kzk and kwk are upper bounded by quantities independent of f, Q, and c. Proof By assumption, (7) is feasible. Since Q is constrained to be symmetric definite positive, the objective is strongly convex and there always exists an optimal solution x to (7). By weak duality, we conclude that (6) is feasible and bounded above. As a result, we may apply the Frank-Wolfe theorem from Frank and Wolfe (1956) showing that (6) admits an optimal solution which we denote by (z, w, s). In addition, Berkelaar et al. (1996) shows that we must have w = x. Furthermore, it follows from the proof of the Frank-Wolfe theorem that kxk ≤ supB basis of A |||B −1 ||| · DF + Dc +β·supB

basis of A

|||B −1 |||·DF

< ∞ independent of f, Q, and c. We denote this last upper bound by B. This last fact combined with weak duality enables us to show that ` is jointly upper bounded by β · B 2 + Dc · B and that kwk = kxk is upper bounded by a quantity independent of f, Q, and c. For a lower bound on `, observe that (0, Q−1 c, 0) is a feasible solution to (6) from which we obtain the lower bound −β · Dc2 . By a linear programming argument, z can be taken as a basic feasible point of the polyhedron {y | AT y ≤ Qw + c} and is, as a result, upper bounded by α

28

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

supB basis of AT |||B −1 ||| · (Dc + β · B) independent of f, Q, and c. The convexity of ` in f follows from a standard argument because the feasibility set of (6) does not involve f , see Berkelaar et al. (1996).

Lemma 22 essentially shows that the game is properly defined but also that the framework is consistent with the original linear game because z and w experienced by the learner are indeed bounded. We now proceed to introduce the necessary tools to study the curvature of `((Q, c), ·) for a fixed opponent’s move (Q, c). First note, as mentioned in the proof of Lemma 22, that a pair of optimal solutions to (7) and (6) (resp. x and (z, w, f )) must satisfy w = x, see Berkelaar et al. (1996). An important step in performing sensitivity analysis on the quadratic program (6) is to introduce the notion of optimal tripartions. Borrowing the notations from Berkelaar et al. (1996), for a fixed f ∈ F, we define Bf = {i | xi > 0 for an optimal solution of (7) }, Nf = {i | si > 0 for an optimal solution (z, w, s) of (6) } and Tf = {1, · · · , n} − (Bf ∪ Nf ). Güler and Ye (1993) shows by duality that Bf ∩ Nf = ∅. Hence the previous decomposition defines a tripartition πf = (Bf , Nf , Tf ) of {1, · · · , n}. Furthermore, Güler and Ye (1993) establishes that there always exits a maximal complementary optimal solution corresponding to πf , (z, x, s), in the sense that: xi > 0 ⇐⇒ i ∈ Bf ,

si > 0 ⇐⇒ i ∈ N .

We will rely on the following strong property on the operator π. Lemma 23 From Berkelaar et al. (1996) πf1 = πf2 implies πf = πf1 = πf2 for all f ∈ [f1 , f2 ]. The introduction of optimal tripartitions along with the last two characterizations come in handy to study the effect of a change in f on the optimal value of (6). Lemma 24 Adapted from Berkelaar et al. (1996) 4α If f1 and f2 are such that πf1 = πf2 then ` is quadratic convex and |||A||| 2 -strongly convex in f on [f1 , f2 ]. Moreover `((Q, c), ·) is globally strictly convex and has subgradients bounded by quantities independent of f, Q, c. Proof Denote by (z1 , x1 , s1 ) and (z2 , x2 , s2 ) some maximal complementary solutions to (6) for f = f1 and f = f2 respectively. Berkelaar et al. (1996) shows that πf1 = πf2 implies that (z λ , xλ , sλ ) = λ · (z1 , x1 , s1 ) + (1 − λ) · (z2 , x2 , s2 ) is an optimal solution to (6) for f = f λ = λf1 + (1 − λ)f2 . Hence plugging these two expressions in the objective of (6), we obtain: `((Q, c), f λ )−λ·`((Q, c), f1 )−(1−λ)·`((Q, c), f2 ) = −λ·(1−λ)·((z1 −z2 )·(f1 −f2 )+(x1 −x2 )·Q(x1 −x2 )). This establishes that `((Q, c), ·) is quadratic on [f1 , f2 ]. Using the fact that (z1 , x1 , s1 ) and (z2 , x2 , s2 ) are maximal complementary solutions and with some basic algebra, Berkelaar et al. (1996) points out that the last two terms are in fact equal, i.e. (z1 − z2 ) · (f1 − f2 ) = (x1 − x2 ) · Q(x1 − x2 ). 1 Since Ax1 = f1 and Ax2 = f2 , we obtain kx1 − x2 k ≥ |||A||| · kf1 − f2 k. So we obtain `((Q, c), f λ ) ≤ λ · `((Q, c), f1 ) + (1 − λ) · `((Q, c), f2 ) −

2α · λ · (1 − λ) · kf1 − f2 k2 |||A|||2

4α Hence, `((Q, c), ·) is |||A||| 2 -strongly convex on [f1 , f2 ]. We move on to show global strict convexity. Consider any two vectors f1 and f2 ∈ F, and assume that they do not necessarily satisfy πf1 = πf2 .

29

F LAJOLET JAILLET

As before, define f λ = λf1 +(1−λ)f2 . Define (z λ , xλ , sλ ), (z1 , x1 , s1 ), and (z2 , x2 , s2 ) as maximal complementary solutions to (6) for f = f λ , f = f1 , and f = f2 respectively. Because (z λ , xλ , sλ ) is a feasible solution to (6) for any choice of f , we have: `((Q, c), f λ ) = λ · (f1 · z λ − 12 xλ · Qxλ ) + (1 − λ) · (f2 · z λ − 12 xλ · Qxλ ) ≤ λ · `((Q, c), f1 ) + (1 − λ) · `((Q, c), f2 ). Hence, we have equality if and only if (z λ , xλ , sλ ) is an optimal solution to (6) for f = f1 and f = f2 . By uniqueness, we get xλ = x1 = x2 . Therefore, f1 = Ax1 = Axλ = f λ , a contradiction. Finally, observe that for any f ∈ F, any z optimal solution to (6) is a subgradient of `((Q, c), ·) at f and can be bounded by quantities independent of f, Q, c as proved in Lemma 22.

As a consequence of Lemma 24 and because there are only finitely many possible tripartitions, F 4α can be partitioned into a finite number of convex pieces on which ` is |||A||| 2 -strongly convex. As a consequence, ` satisfies a property called “directional strong convexity” which is the inequality defined in Lemma 3 of Hazan et al. (2007a). For such a loss function, the achievable regret is always O(log T ) as long as the subgradient of the loss function is bounded everywhere, see Theorem 2 of Hazan et al. (2007a) for a detailed derivation. Extensions The results above can be further extended to a time-dependent setting where the constraint matrix A would become time-dependent provided its operator norm remains bounded.

Acknowledgments The authors would like to thank Alexander Rakhlin√for his valuable input, and in particular, for bringing to our attention the possibility of having o( T ) upper bounds on the achievable regret in the linear setting.

References J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 415–424, 2008. J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through minimax duality. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. A. B. Berkelaar, B. Jansen, K. Roos, and T. Terlaky. Sensitivity analysis in (degenerate) quadratic programming. Technical Report 26, Delft University of Technology, Delft, The Netherlands, 1996. N. Cesa-Bianchi and G. Lugosi. On prediction of individual sequences. Annals of Statistics, 27(6): 1865–1895, 1999. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

30

N O -R EGRET L EARNABILITY FOR P IECEWISE L INEAR L OSSES

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956. O. Güler and Y. Ye. Convergence behavior of interior-point algorithms. Mathematical Programming, 60(1-3):215–228, 1993. D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Trans. Information Theory, 44(5):1906–1925, 1998. E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007a. E. Hazan, A. Rakhlin, and P. L. Bartlett. Adaptive online gradient descent. In Advances in Neural Information Processing Systems 21 (NIPS 2007), pages 65–72, 2007b. A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. E. Ordentlich and T. M. Cover. The cost of achieving the best portfolio in hindsight. Mathematics of Operations Research, 23(4):960–982, 1998. A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Beyond regret. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), pages 559–594, 2011a. A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In Advances in Neural Information Processing Systems 23 (NIPS 2010), pages 1984–1992, 2011b. V. Vovk. Competitive on-line linear regression. In Advances in Neural Information Processing Systems 10 (NIPS 1997), pages 364–370. MIT Press, 1998. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.

√ Appendix A. Illustration, having 0 ∈ Z may not help to prove RT = Ω( T ) Define z1 = (−1, 1, 0, 0), z2 = (1, −1, 0, 0), z3 = [0, 0, 0, 1], z4 = [0, 0, 1, 0], and F = [f ∗ , f ∗∗ ] with f ∗ = (0, 0, 0, 0) and f ∗∗ = (1, 1, −1, 1). The game (`, Z, F) is not trivial because argminf ∈F f · z3 = {f ∗ } while argminf ∈F f · z4 = {f ∗∗ }. For any i.i.d. opponent Z1 , · · · , ZT with zero mean, the only possibility is to have Zt ∈ [z1 , z2 ]. We get, for any player’s strategy: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] = −E[ inf f · f ∈F

T X

Zt ].

t=1

We have f ∗ ∈ argminf ∈F f · z for z ∈ [z1 , z2 ]. Hence: E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] = −E[f ∗ ·

T X t=1

which finally gives E[rT ((Zt )t=1,··· ,T , (ft )t=1,··· ,T )] = 0.

31

Zt ],