Trading Regret for Efficiency: Online Convex Optimization with Long ...

2 downloads 250 Views 283KB Size Report
Nov 25, 2011 - Department of Computer Science and Engineering ... based on online Newton step and covers the class of exp-concave loss functions. Notably ...
Journal of Machine Learning Research 0 (0) 0-0

Submitted 0/0; Published 0/0

Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

arXiv:1111.6082v1 [cs.LG] 25 Nov 2011

Mehrdad Mahdavi Rong Jin Tianbao Yang

[email protected] [email protected] [email protected]

Department of Computer Science and Engineering Michigan State University East Lansing, MI, 48824, USA

Editor: ?

Abstract In this paper we propose a framework for solving constrained online convex optimization problem. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While for simple shapes (e.g. Euclidean ball) the projection is straightforward, for arbitrary complex sets this is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring decisions belong to K for all rounds, we only require that the constraints which define the set K be satisfied in the long run. We show that our framework can be utilized to solve a relaxed version of online learning with side constraints addressed in Mannor and Tsitsiklis (2006) and Kveton et al. (2008). By turning the problem into an online convex-concave optimization problem, we propose an efficient algorithm which √ ˜ T ) regret bound and O(T ˜ 3/4 ) bound for the violation of constraints. Then we achieves O( modify the algorithm in order to guarantee that the constraints are satisfied in the long run. ˜ 3/4 ) regret bound. Our second algorithm This gain is achieved at the price of getting O(T is based on the Mirror Prox method (Nemirovski, 2005) to solve variational inequalities ˜ 2/3 ) bound for both regret and the violation of constraints when the which achieves O(T domain K can be described by a finite number of linear constraints. Finally, we extend the result to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex concave optimization, bandit feedback, variational inequality

1. Introduction Online convex optimization has recently emerged as a primitive framework for designing efficient algorithms for a wide variety of machine learning applications (Cesa-Bianchi and Lugosi, 2006). In general, online convex optimization problem can be formulated as a repeated game between a learner and an adversary: at each iteration t, the learner first presents a solution xt ∈ K, where K ⊆ Rd is a convex domain; it then receives a convex function ft (x) : K 7→ R+ and suffers the loss ft (xt ) for the submitted solution xt . The objective of online convex c Mehrdad Mahdavi, Rong Jin and Tianbao Yang.

0

Mahdavi, Jin and Yang

optimization is to generate a sequence of solutions xt ∈ K, t ∈ [T ] that minimizes the regret RT defined as follows

RT =

T X t=1

ft (xt ) − min x∈K

T X

ft (x)

(1)

t=1

Regret measures the difference between the cumulative loss of the learner’s strategy and the minimum possible loss had the sequence of loss functions been known in advance and the learner could choose the best fixed action in hindsight. When RT is sublinear in number of rounds, i.e. o(T ), we call the solution Hannan consistent (Cesa-Bianchi and Lugosi, 2006), implying that the learner’s average per-round loss approaches the average per-round loss of the best fixed action in hindsight. It is noticeable that the performance bound must hold for any sequence of loss functions, and in particular if the sequence is chosen adversarially. Many successful algorithms have been developed over the past decade to minimize the regret in the online convex optimization. The problem is initiated in remarkable work of Zinkevich (Zinkevich, 2003) which presents √ an algorithm based on gradient descent ˜ with projection that guarantees a regret of O( T )1 when the set K is convex and the loss functions are Lipschitz continuous within the domain K. In (Hazan et al., 2007) and (Shalev-Shwartz and Kakade, 2008) algorithms with logarithmic regret bound are proposed for strongly convex loss functions. In particular, the algorithm in Hazan et al. (2007) is based on online Newton step and covers the class of exp-concave loss functions. Notably, ˜ the simple gradient algorithm also achieves O(log T ) regret bound for strongly convex loss functions with an appropriately chosen step size. (Bartlett et al., 2007) generalizes the results in previous works to the setting where algorithm can adapt to the curvature of the loss functions without any prior information. A modern view of these algorithms casts the problem as the task of following the regularized leader (Rakhlin, 2009). In (Abernethy et al., √ ˜ 2009) using game-theoretic analysis, it has been shown that the both O( T ) for Lipschitz ˜ continuous and O(log T ) for strongly convex loss functions are tight in the minimax sense. To motivate the setting addressed in this paper, let us first examine a popular online learning algorithm for minimizing the regret RT based on the Online Gradient Descent (OGD) method (Zinkevich, 2003). At each iteration t, after receiving the convex function ft (x), the learner computes the gradient ∇ft (xt ) and updates the solution xt by solving the following projection problem xt+1 = ΠK (xt − η∇ft (xt )) = arg min kx − xt + η∇ft (xt )k2

(2)

x∈K

where ΠK (·) denotes the projection onto K and η > 0 is a predefined step size. While for projection onto simple shapes, such as ℓ2 ball, efficient algorithms are available, but for general convex domains, solving the optimization problem in (2) is an offline convex optimization problem by itself and can be computationally expensive. Recently several efficient algorithms have been developed for projection onto specific domains, e.g. ℓ1 ball (Duchi et al., 2008; Liu and Ye, 2009); however when the domain K is complex it is a more involved task or computationally burdensome. ˜ hides the constant factors 1. The notation O(·)

2

Trading Regret for Efficiency

To tackle the computational challenge arising from the projection, we consider an alternative online learning problem. Instead of requiring xt ∈ K, we only require the constraints, which define the convex domain K, to be satisfied in a long run. Then, the online learning problem becomes to find a sequence of solutions PT xt , t ∈ [T ] that minimizes the regret defined in (1), under the long term constraints, i.e. t=1 xt ∈ K. We refer to this problem as online learning with long term constraints. In other words, instead of solving the projection problem in (2) on each round, we allow the learner to make decisions which do not belong to the set K, but the overall sequence of decisions made must obey the constraints at the end by vanishing convergence. The proposed online optimization with long term constraints is closely related to the problem of regret minimization with side constraints (Mannor and Tsitsiklis, 2006), motivated by applications in wireless communication. In this setting, beyond minimizing regret, the learner has some side constraints that need to be satisfied on average for all rounds. Unlike our setting, in learning with side constraints the set K is controlled by the nature and can vary from trial to trial in an arbitrary way. They showed that if the convex set is affected by both decisions and loss functions, the minimax optimal regret is generally unattainable online. But for degenerate cases where the domain is affected by only decisions made by learner, regret is attainable. (Kveton et al., 2008) investigated the expert advice setting (Cesa-Bianchi and Lugosi, 2006) with side constraints. We show that regret minimization with side constraints can be solved by the proposed technique in the degenerate case. More specifically, we show that our framework with changing constraint functions resembles online learning with side constraints and our algorithms can be directly applied to this setting. In this paper, we describe and analyze a general framework for solving online convex optimization with long term constraints. We first show that a direct application of OGD √ ˜ T ) for the regret fails to achieve a sublinear bound for the violation of constraints and O( bound. Then, by turning the problem into an online convex-concave optimization problem, we propose an efficient algorithm for online learning with long term √ constraints, which is ˜ an adaption of OGD. The proposed algorithm achieves the same O( T ) regret bound as ˜ 3/4 ) bound for the violation of constraints. We show that by the general setting and O(T using a simple trick we can turn the proposed method into an algorithm which exactly ˜ 3/4 ) regret bound. When the satisfies the constraints in the long run by achieving O(T convex domain K can be described by a finite number of linear constraints, we propose an alternative algorithm based on the Mirror Prox method (Nemirovski, 2005), which achieves ˜ 2/3 ) bound for both regret and the violation of constraints. Our framework also handles O(T the cases when we do not have the full access to domain K except through a limited oracle evaluations. We show that we can generalize the proposed OGD based algorithm to this setting by only accessing the value oracle for domain K at two points, which achieves the same bounds in expectation as the case that has a full knowledge about the domain K. Finally, it is worth mentioning that the proposed setting can be used in certain classes of online learning such as online-to-batch conversion (Cesa-Bianchi et al., 2004), where it is sufficient to guarantee that constraints are satisfied in the long run. More specifically, under the assumption that received examples are i.i.d samples, the solution for batch learning is to average the solutions obtained over all the trials. As a result, if the long term constraint is satisfied, it is guaranteed that the average solution will belong to the domain K. 3

Mahdavi, Jin and Yang

The remainder of the paper is structured as follows. In section 3 we formulate regret minimization as an online convex-concave optimization problem and apply the OGD algorithm to solve it. Our first algorithm allows the constrains to be violated in a controlled way. It is then generalized in a way that the constraints are exactly satisfied in the long run. We also show that a slightly modified version of the proposed algorithm can handle online convex optimization with side constraints that vary from trial to trial. Section 4 presents our second algorithm which is an adaptation of the Mirror Prox method. Section 5 generalizes the problem to the setting where we only have a partial access to the convex domain K. Section 6 concludes the work with a list of open questions.

2. Notation and Setting Before proceeding, we define the notations and state the assumptions made for the analysis of algorithms. Vectors are shown by lower case bold letters, such as x ∈ Rd . Matrices are indicated by upper case letters such as A and their pseudoinverse is represented by A† . We use [m] as a shorthand for the set of integers {1, 2, . . . , m}. Throughout the paper we denote by k · k and k · k1 the ℓ2 (Euclidean) norm and ℓ1 -norm, respectively. We use E and Et to denote the expectation and conditional expectation with respect to all randomness in early t − 1 trials, respectively. To facilitate our analysis, we assume that the domain K can be written as an intersection of a finite number of convex constraints, i.e. K = {x ∈ Rd : gi (x) ≤ 0, i ∈ [m]} where gi (·), i ∈ [m], are Lipschitz continuous functions. Like many other works for online convex optimization such as (Flaxman et al., 2005), we assume that K is a bounded domain, i.e., there exists constants R > 0 and r < 1 such that K ⊆ RB and rB ⊆ K where B denotes the unit ball centered at the origin. For the ease of notation, we use B = RB. We focus on the problem of online convex optimization, in which the goal is to achieve a low regret with respect to a fixed decision on a sequence of loss functions. The difference between the setting considered here and the general online convex optimization is that, in our setting, instead of requiring xt ∈ K, or equivalently gi (xt ) ≤ 0, iP ∈ [m], for all t ∈ [T ], we only require the constraints to be satisfied in a long run, namely Tt=1 gi (xt ) ≤ 0, i ∈ [m]. Then, the problem becomes to find a sequence of solutions P xt , t ∈ [T ] that minimizes the regret defined in (1), under the long term constraints Tt=1 gi (xt ) ≤ 0, i ∈ [m]. Formally, we would like to solve the following optimization problem online, min

x1 ,...,xT ∈B

T X t=1

s.t

ft (xt ) − min

T X t=1

x∈K

T X

ft (x)

t=1

gi (xt ) ≤ 0 , i ∈ [m]

(3)

For simplicity, we will focus on a finite-horizon setting where the number of rounds T is known in advance. This condition can be relaxed under certain conditions, using standard techniques (see, e.g., (Cesa-Bianchi and Lugosi, 2006)). 4

Trading Regret for Efficiency

Note that in (3) (i) the solutions come from the ball B ⊇ K instead of K and (ii) the constraint functions are fixed and are given in advance. To cast our problem as an online convex optimization with side constraints as discussed before, we need to handle the constraints which vary from trial to trial. In this setting at each round t, after the learner makes a decision xt , in addition to the loss function ft (x), it receives a constraint (cost) function gt (x). Note that in contrast to the previous setting, gt (x) is given after the learner exploses his solution. Similar to (3), the goal of the learner is to minimize the regret and simultaneously satisfies the constraints in the long term. Putting in a formal way, the online optimization problem becomes as min

x1 ,...,xT ∈K

T X t=1

(4)

gt (xt ) ≤ 0

(5)

x∈K

s.t

T X t=1

T X

ft (x)

ft (xt ) − min

t=1

We tackle this problem in section 3.4. Like most online learning algorithms, we assume that both loss functions and the constraint functions to be Lipschitz continuous, i.e., there exists constants Lf and Lg such that |ft (x) − ft (x′ )| ≤ Lf kx − x′ k,

|gi (x) − gi (x′ )| ≤ Lg kx − x′ k for any x and x′ , i ∈ [m]

Finally, for simplicity of analysis, we use F = max max ft (x) − ft (x′ ) ≤ 2Lf R, D = ′ t∈[T ] x,x ∈K

max max gi (x) ≤ Lg R and G = max{Lf , Lg }.

i∈[m] x∈B

3. Online Gradient based Convex-Concave Optimization for Long Term Constraints Our general strategy is to turn online convex optimization with long term constraints into a convex-concave optimization problem. Instead of generating a sequence of solutions that satisfies the long term constraints, we first consider an online optimization strategy that allows for the violation of long term constraints. We then modify the online optimization strategy to obtain a sequence of solutions that obeys the long term constraints. Although the online convex optimization with long term constraints is clearly easier than the standard online convex optimization problem, it is not difficult to show that optimal regret √ bound ˜ T ), no for online optimization with long term constraints should be on the order of O( better than the standard online convex optimization problem. Before discussing the proposed algorithm, let us investigate why a simple penalty based OGD method may fail in solving the online learning problem in (3). In order to address the constraints, we define fbt (·) as m

δX fbt (x) = ft (x) + [gi (x)]2+ 2

(6)

i=1

where [z]+ = max(0, 1 − z) and δ > 0 is a positive constant used to penalize the violation of constraints. The auxiliary function fbt (x) consists of the loss function ft (x) and 5

Mahdavi, Jin and Yang

the penalty term defined for each constraint as [gi (x)]2+ . The algorithm is analyzed using the following Lemma from (Zinkevich, 2003). Lemma 1 Let x1 , x2 , . . . , xT be the sequence of solutions obtained by applying OGD on the sequence of bounded convex functions f1 , f2 , . . . , fT . Then, for any solution x∗ ∈ K we have T X

T X

ft (x) −

T X

ft (x∗ ) +

t=1

t=1

T

R2 η X k∇ft (xt )k2 + ft (x∗ ) ≤ 2η 2 t=1

We apply OGD to the functions fbt (x), t ∈ [T ], i.e. instead of updating the solution based on the gradient of ft (x), we update the solution by the gradient of fbt (x). Using Lemma 1, by expanding the functions fbt (x) based on (6) and considering the fact that P m 2 i=1 [gi (x∗ )]+ = 0, we have T X t=1

ft (x) −

t=1

T

T

m

δ XX R2 η X + k∇fbt (xt )k2 [gi (x)]2+ ≤ 2 t=1 2η 2 t=1

From the definition of fbt (x), the norm of the gradient ∇fbt (xt ) is bounded as follows k∇fbt (x)k2 = k∇ft (x) + δ

(7)

i=1

m X i=1

[gi (x)]+ ∇gi (x)k2 ≤ 2G2 (1 + mδ2 D 2 )

(8)

where the inequality holds because (a1 + a2 )2 ≤ 2(a21 + a22 ). By substituting (8) into the (7) we have: T T T m X X δ XX R2 ft (x∗ ) + ft (xt ) − [gi (xt )]2+ ≤ + ηG2 (1 + mδ2 D 2 )T (9) 2 2η t=1 t=1 t=1 i=1 P Since [·]2+ is a convex function, from Jensen’s inequality and following the fact that Tt=1 ft (xt )− ft (x∗ ) ≥ −F T , we have: #2 " T m m T δ X X R2 δ XX gi (xt ) + ηG2 (1 + mδ2 D 2 )T + F T (10) [gi (xt )]2+ ≤ ≤ 2T 2 2η t=1 t=1 i=1

i=1

+

By minimizing the r.h.s of (9) with respect to η, we get the regret bound as: T X t=1

ft (x) −

T X t=1

p √ ˜ T) ft (x∗ ) ≤ RG 2(1 + mδ2 D 2 )T = O(δ

and the bound for the violation of constraints as s  T X R2 2T 2 2 2 ˜ 1/4 δ1/2 + T δ−1/2 ) gi (xt ) ≤ + ηG (1 + mδ D )T + F T = O(T 2η δ t=1

(11)

(12)

√Examining the bounds obtained in (11) and (12), it turns out that in order to recover ˜ ) bound for the ˜ O( T ) regret bound, we need to set δ to be a constant, leading to O(T violation of constraints in the long run, which is not satisfactory at all. In the next subsection we show that by turning √ the problem into an online convex concave formulation, we are ˜ able to recover the O( T ) regret bound and simultaneously achieve sublinear bound on the long term violation of constraints. 6

Trading Regret for Efficiency

√ ˜ T ) regret bound and O(T ˜ 3/4 ) bound on the 3.1 An efficient algorithm with O( violation of constraints The intuition behind our PT approach stems from the observation that the constrained optimization problem min t=1 ft (x) is equivalent to the following convex-concave optimization x∈K

problem

min maxm x∈B λ∈R+

T X

ft (x) +

t=1

m X

λi gi (x)

(13)

i=1

where λ = (λ1 , . . . , λm )⊤ are Lagrangian multipliers associated with constraints gi , i = 1, . . . , m and belongs to the nonnegative orthant Rm + . To solve the online convex-concave optimization problem, we extend the gradient based approach for variational inequality (Nemirovski) to (13). Define  m  X δη 2 (14) Lt (x, λ) = ft (x) + λi gi (x) − λi 2 i=1

where δ > 0 is a constant whose value will be decided by the analysis. Note that in (14), we introduce a regularizer δηλ2i /2 to prevent λi from being too large. This is because, Pm when λi is large, we may encounter a large gradient for x because of ∇x Lt (x, λ) ∝ i=1 λi ∇gi (x), leading to unstable solutions and a poor regret bound. Algorithm 1 shows the detailed steps of proposed algorithm. Unlike standard online learning that only updates x, Algorithm 1 updates both x and λ. Note that in Algorithm 1 the projection is done onto the ball B ⊇ K instead of set K itself which can be efficiently implemented in O(dm) by using appropriate data structures. In order to bound the regret and the violation of constraints for Algorithm 1, we first state the following Lemma, similar to Lemma 1 for online convex optimization. Lemma 2 Let Lt (·, ·) be the function defined in (14) which is convex in its first argument and concave function in its second argument. Then for any (x, λ) ∈ B × Rm + we have Lt (xt , λ) − Lt (x, λt )  1 kx − xt k2 + kλ − λt k2 − kx − xt+1 k2 − kλ − λt+1 k2 + η 2 k∇x Lt k2 + η 2 k∇λ Lt k2 ≤ 2η

Proof By convexity of Lt (·, λ), we have

Lt (xt , λt ) − Lt (x, λt ) ≤ (xt − x)⊤ ∇x Lt (xt , λt )

(15)

and by concavity of Lt (x, ·), Lt (xt , λ) − Lt (xt , λt ) ≤ (λ − λt )⊤ ∇λ Lt (xt , λt )

(16)

Combining the inequalities (15) and (16) results in Lt (xt , λ) − Lt (x, λt ) ≤ (xt − x)⊤ ∇x Lt (xt , λt ) − (λ − λt )⊤ ∇λ Lt (xt , λt ) 7

(17)

Mahdavi, Jin and Yang

Algorithm 1 An OGD based Convex-Concave Optimization Method with Long Term Constraints 1: Input: constraints gi (x) ≤ 0, i ∈ [m], step size η, parameter δ > 0 2: Initialization: x1 = 0 and λ1 = 0 3: for t = 1, 2, . . . , T do 4: Submit solution xt 5: Receive the convex function ft (x) and loss ft (xt ) P experience i ∇g (x ) and ∇ L (x , λ ) = g (x ) − ηδλi 6: Compute ∇x Lt (xt , λt ) = ∇ft (xt ) + m λ i t i t λi t t t t i=1 t 7:

Update xt and λt by

xt+1 = ΠB (xt − η∇x Lt (xt , λt ))

λt+1 = Π[0,+∞)m (λt + η∇λ Lt (xt , λt )) 8:

end for

Using the update rule for xt+1 in terms of xt and expanding, we get kx − xt+1 k2 ≤ kx − xt k2 − 2η(xt − x)⊤ ∇x Lt (xt , λt ) + η 2 k∇x Lt (xt , λt )k2

(18)

where the first inequality follows from the nonexpansive property of the projection operation. Expanding the inequality for kλ − λt+1 k2 in terms of λt and plugging back into the (17) with (18), establishes the desired inequality.

Proposition 3 Let xt and λt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 1. Then for any x ∈ B and λ ∈ Rm + , we have T X t=1

Lt (xt , λ) − Lt (x, λt ) ≤ T

 η X R2 + kλk2 ηT kλt k2 + (m + 1)G2 + 2m2 D 2 + (m + 1)G2 + 2mδ2 η 2 2η 2 2 t=1

2 2 2 2 2 Proof Using inequality  (a1 + a2 +2 . . . , an ) ≤2n(a12 +2 a2 +2. . . + an ) we have k∇x Lt k ≤ 2 2 (m + 1)G 1 + kλt k and k∇λ Lt k ≤ 2m(mD + δ η kλt k ). Using Lemma 2, by adding the inequalities of all iterations and using the fact kxk ≤ R we complete the proof.

The following theorem shows the regret bound and the bound on the violation of the constraints in long run for Algorithm 1. p √ 2 + 2m2 D 2 . Set η = R2 /[a T ]. Assume T is large Theorem 4 Define a = R (m + 1)G √ η 2 . Let xt , t ∈ enough such that 2 2η(m + 1) ≤ 1. Choose δ such that δ ≥ (m + 1)G2 + 2mδ2P [T ] be the sequence of solutions obtained by Algorithm 1. Then for x∗ = min Tt=1 ft (x) we x∈K

8

Trading Regret for Efficiency

have T X t=1

√ ft (xt ) − ft (x∗ ) ≤ a T ,

T X t=1

gi (xt ) ≤

s



√ √ 2 FT + a T T



δR2 ma + 2 a R



˜ 3/4 ) = O(T

Proof Using (14) we expand (19) as T X t=1

[ft (xt ) − ft (x)] +

≤ −

δη 2

T X t=1

m X i=1

kλt k2 +

(

R2

T X

λi

t=1

gi (xt ) −

kλk2

+ 2η

T X

+

)

λit gi (x)

t=1



δηT kλk2 2 T

 η X ηT kλt k2 (m + 1)G2 + 2m2 D 2 + (m + 1)G2 + 2mδ2 η 2 2 2 t=1

Since δ ≥ (m + 1)G2 + 2mδ2 η 2 , we can drop the kλt k2 terms and have T X t=1

[ft (xt ) − ft (x)] + ≤

m X

(

λi

T X

gi (xt ) −

t=1 i=1 T m XX λit gi (x) i=1 t=1

+



m δηT + 2 2η



λ2i

)

 R2 ηT + (m + 1)G2 + 2m2 D 2 ) 2η 2

By taking the maximization over λ over the range (0, +∞), we have  hP

   R2 ηT  + i [ft (xt ) − ft (x)] + λt gi (x) ≤ − + (m + 1)G2 + 2m2 D 2 )  2(δηT + m/η)  2η 2  t=1 t=1 i=1 

T X

m   X

i2

T t=1 gi (xt )

T X

Since x∗ ∈ K, we have gi (x∗ ) ≤ 0, i ∈ [m], the resulting inequality becomes T X t=1

[ft (xt ) − ft (x∗ )] +

m X i=1

hP

i2

T t=1 gi (xt )

+

2(δηT + m/η)



 R2 ηT + (m + 1)G2 + 2m2 D 2 ) 2η 2

PT We have the bound for ft (x∗ ) using the expression of η. The bound for t=1 ft (xt ) −P PT T t=1 ft (xt ) − ft (x∗ ) ≥ −F T . t=1 gi (xt ) follows from the fact that

9

Mahdavi, Jin and Yang

Remark 5 Note that the constraint for δ mentioned in Theorem 4 is equivalent to

1/(m + 1) +

p

2 (m + 1)−2 − 8G2 η 2

≤δ≤

1/(m + 1) +

p

(m + 1)−2 − 8G2 η 2 4η 2

(19)

When T is large enough (i.e., η is small enough), we can simply set δ = 2(m + 1)G2 that will obey the constraint in (19). By investigating Lemma 2, it turns out that the boundedness of the gradients is essential to obtain the bounds for Algorithm 1. Although in each iteration of Algorithm 1, λt is projected onto the Rm + , however; since K is a compact set and functions ft and gi , i ∈ [m] are convex, thus the boundedness of the functions implies that the gradients are bounded (Bertsekas et al., 2003, Proposition 4.2.3). ˜ 3/4 ) regret bound and without violation of 3.2 An efficient algorithm with O(T constraints In this section we generalize Algorithm 1 such that the constrained are satisfied in a long run. PT To create a sequence of solutions xt , t ∈ [T ] that satisfies the long term constraint t=1 gi (xt ) ≤ 0, i ∈ [m], we change Algorithm 1 by modifying the definition of Lt as Lt (x, λ) = ft (x) +

m X

λi (gi (x) + γ)

(20)

i=1

where γ > 0 will be decided later. We make the following assumption about the constraints gi , i ∈ [m]: Assumption 1 For any x ∈ B define A(x) = (∇g1 (x), . . . , ∇gm (x)). We assume that min σmin (A(x)) ≥ σ > 0 x∈B

where σmin (A) is the minimum singular value of A, which is equal to λmin (B) is the minimum eigenvalue of B.

p λmin (A⊤ A), where

We then have the following bound. Theorem 6 Set η and δ same as Theorem 4. Let xt , t ∈ [T ] be the sequence of solu−1/4 and b = tions with γ = bT p obtained by Algorithm 1 with functions defined in (20) √ √ 2 F (R2 δ/a + ma/R2 ). With sufficiently large T , i.e., F T ≥ a T + bσ −1 2mGT 3/4 , and P under Assumption 1, we have xt , t ∈ [T ] satisfied the global constraints Tt=1 gi (xt ) ≤ 0, i ∈ [m] and the regret RT is bounded by RT =

T X t=1

√ b 2m GT 3/4 ft (xt ) − ft (x∗ ) ≤ a T + σ √

10

Trading Regret for Efficiency

Proof Similar to the proof of Theorem 4, we have T X

ft (xt ) +

t=1

m X i=1

hP

T t=1 gi (xt ) +

γT

2(δηT + m/η)

i2

+



 R2 ηT + (m + 1)G2 + 2m2 D 2 ) 2η 2



( T X

ft (x) +

t=1

T m X X i=1

λit

t=1

!

)

(gi (x) + γ)

We consider the following minimax optimization problem h(γ) = min max

x∈B λ∈R+

T X

ft (x) +

t=1

m X

λi (gi (x) + γ)

(21)

i=1

We denote by xγ and λγ the optimal solutions to (21). We have h(γ) = min max

x∈B λ∈R+



T X

T X

ft (x) +

t=1

ft (xγ ) +

t=1

m X i=1

T m X X i=1

λit

t=1

λi (gi (x) + γ) = max

λ∈R+

!

T X

ft (xγ ) +

t=1

m X

λi (gi (xγ ) + γ)

i=1

(gi (xγ ) + γ)

(22)

We then upper bound h(γ) by f (x∗ ), i.e., h(γ) = min max m x∈B λ∈R+

= min x∈B

T X

T X

ft (x) +

t=1

λi (gi (x) + γ)

i=1

ft (x) +

t=1

m X

m X i=1

λiγ (gi (x) + γ) ≤

T X t=1

ft (x∗ ) + γ

m X

λiγ

(23)

i=1

where λiγ , i ∈ [m] are the optimal solution to the minimax problem in (21). P i We now bound m i=1 λγ . Since xγ is the minimizer to (21), we have −

T X t=1

∇ft (xγ ) =

m X i=1

λiγ ∇gi (xγ )

By setting A = (∇g1 (xγ ), . . . , ∇gm (xγ )) and v = − as λγ = A† v. According to Assumption 1 we have: m X i=1

λiγ





mkλk =





mkA vk ≤



11

PT

t=1 ∇ft (xγ ),

kvk 2m ≤ σmin (A)

where we used the facts k∇ft (xγ )k ≤ G and k · k1 ≤



(24)

mk · k.



we can simplify (24)

2m GT σ

(25)

Mahdavi, Jin and Yang

Putting (22), (23), and (25) together, we have T X

ft (xγ ) +

t=1

T m X X t=1

i=1

λit

!

(gi (xγ ) + γ) ≤

T X t=1

ft (x∗ ) + γ



2m GT σ

and therefore T X t=1

[ft (xt ) − ft (x∗ )] +

m X i=1

hP

T t=1 gi (xt )

+ γT

2(δηT + m/η)

i2

√  2m R2 ηT 2 2 2 ≤ + GT (m + 1)G + 2m D + γ 2η 2 σ

√ √ Using γ = bT −1/4 , and when T is sufficiently large, i.e., F T ≥ a T + b σ2m GT 3/4 , we have T X t=1

gi (xt ) + bT 3/4 ≤ 2T 3/4

p F (R2 δ/a + ma/R2 )

p P Setting b = 2 F (R2 δa−1 + maR−2 ), we get Tt=1 gi (xt ) ≤ 0, i ∈ [m]. 3.3 Approachability Based Strategy In this subsection we show that how the Blackwell’s celebrated approachability theorem (Cesa-Bianchi and Lugosi, 2006, Section 7.7) can be used to solve the online convex optimization with long term constraints. Before, we cast the online convex optimization with long term constraints as a multi-round vector-valued game between the learner and an adversary as follows. Let B ⊆ Rd be the set of moves of the learner, i.e. a closed convex set, and F be the set of moves of the adversary which contains convex functions from Rd to R. Let the P one shot vector valued loss by mt = (ft (xt ), g1 (xt ), . . . , gm (xt )) and its average by m b t = 1t ts=1 ms . At each round t, the player makes a move xt ∈ B and adversary responds with a convex function ft ∈ F and then the learner receives the loss vector mt . Blackwell (Blackwell, 1956) considered a finite repeated game with vector valued payoffs, where two players play in discrete time and introduced the notion of approachability, which is a generalization of minimax theorem in a one-shot game with real payoff. Namely, a closed subset S of Rd is approachable for the learner if its strategy guarantees that asymptotically the average loss vector reaches S, regardless of the strategy employed by adversary. The beauty of approachability is that, while we may not be able to satisfy S in a one shot game, we can approach the set on average if we may play the game indefinitely. Blackwell also introduced a sufficient condition upon the target set and described explicitly an approachability strategy in this case. Roughly speaking, it says that for any point x outside the set, the learner can force an expected one-stage outcome to lie on the other side of the tangent hyperplane to the set at the projection of x onto the set. This strategy relies on convex projections and requires solving a linear program at each round. He also proved that this condition was necessary in the convex case. 12

Trading Regret for Efficiency

The following characterization of the approachability of closed convex sets follows from (Blackwell, 1956, Theorem 3) and using this theorem, we will see that the learner can approach the set S. Theorem 7 Consider a repeated vector valued game between the learner and an adversary who make decisions from convex sets B and F respectively. Let m : ∆(B) × ∆(F) denotes the vector valued loss function. Then, a closed convex set S ⊆ Rd is approchable if and only if ∀β ∈ ∆(F), Let x∗ = min x∈K

PT

t=1 ft (x)

∃α ∈ ∆(B); m(α, β) ∈ S

with average optimal value f ∗ =

1 T

PT

∗ t=1 ft (x ).

We define the

target set that the learner tries to approach in the mentioned game as S = {(x, y) ∈ R × Rm : x ≤ f ∗ ; y ≤ 0} Obviously, S is a convex set. So we have described our setting in terms of an approachability game. As it is clear from the definition of the target set S, approachability of S indicates that any strategy which approaches S achieves vanishing per round regret and satisfies the constraints in the long run. In original approachability theorem it is assumed that the set of available actions for each player is finite but in our setting, the actions of each player comes from a compact convex set. The similar setting is addressed in (Even-Dar et al., 2009). We define the probability distribution over B and F by ∆(B) and ∆(F) respectively. Let m : ∆(B) × ∆(F) → S denotes the expected vector valued loss as m(α, β) =

Z

x∈B

Z

αx βf (f (x), g1 (x), . . . , gm (x))

f ∈F

We claim that the set S is approachable. Following Theorem 7, we need to prove that for every β ∈ ∆(F) there exists a α ∈ ∆(B) such that m(α, β) ∈ S. Let α∗ be the solution to the following optimization problem: α∗ = arg min x∈B

s.t gi (x) ≤ 0,

Z

βf f (x) f ∈F

i = 1, . . . , m

(26)

Since the minimization is taken over the convex domain K, it follows that m(α∗ , β) ∈ S. Although by formulating the online convex optimization problem with long constraints as a repeated game with vector valued loss, we showed that the target set S, which resembles the vanishing regret and long term satisfaction of the constraints, is approachable, but any strategy which tries to approach the target set S needs to solve the optimization problem in (26) which is computationally inefficient due to the optimization over convex domain K. 13

Mahdavi, Jin and Yang

3.4 Extension to Online Learning with Side Constraints In (Mannor and Tsitsiklis, 2006) Online Learning with Path Constraints is investigated in two player game setting. In (Kveton et al., 2008) the setting is extended to the online convex optimization with constraints and an algorithm is proposed for learning from expert advice with side constraints. Roughly speaking, the proposed algorithm divides the time horizon into the fixed size epochs and tries to minimize the number of switching between experts in order to guarantee vanishing regret bound and the long term violation of constraints. Here, we show that using the proposed framework, we can solve the online convex optimization with side constraints for degenerate cases where the domain is affected by only decisions made by learner. We consider the following setting: at every round t, the learner makes a decision xt ∈ K, and receives the loss function ft and the cost function gt . The goal of the learner is to achieve both vanishing regret and violation of constraints in the long run. For the simplicity of analysis we keep the projection onto the convex set K and only consider the handling of cost constraints. We define the convex-concave function Lt as: Lt (x, λ) = ft (x) + λgt (x) −

ηδ 2 λ 2

(27)

It is straightforward to extend the results obtained in Theorem 4 for the mentioned setting. We consider two cases for the optimal decision x∗ ∈ K that our decisions are compared to, as done in (Kveton et al., 2008): the optimal solution exactly satisfies the constraints for all rounds and the PTsetting where the violations of the constraints of the optimal solution is bounded, i.e. t=1 gt (x) ≤ V. For the first case we get the same bounds as Theorem 4 P since Tt=1 gt (x∗ ) = 0. For the second case we have the following Corollary. 2

Corollary 8 Define α = ( R2 +

DV δ )

and β = G2 + D 2 . Set η =

q

α βT .

Let xt , t ∈ [T ] be

the sequence of solutions obtained by performing Algorithm 1 on the functions (27) with G Liptchitz side constraints PgTt , t ∈ [T ], then for any solution x∗ ∈ K with bounded violation of the long constraints as t=1 gt (x∗ ) ≤ V, we have T X t=1

T X t=1

p √ ˜ T ), ft (xt ) − ft (x∗ ) ≤ 2 αβT = O(

v s ! r u u p αT βT t ˜ 3/4 ) = O(T + gt (xt ) ≤ 2(2 αβT + F T ) δ β α

14

Trading Regret for Efficiency

4. Mirror Prox Based Approach The bound for the violation of constraints for Algorithm 1 is unsatisfied since it is signifi√ ˜ cantly worse than O( T ). In this section, we pursue a different approach that is based on the Mirror Prox method in (Nemirovski, 2005) to improve the regret bound and the long term violation of constraints. The basic idea is that solving (13) can be reduced to the problem of approximating a saddle point (x, λ) ∈ B × [0, ∞)m by solving the associated variational inequality. We first define an auxiliary function F(x, λ) as F(x, λ) =

 m  X δη λi gi (x) − λ2i 2 i=1

In order to successfully apply the Mirro Prox method, we follow the fact that any convex domain can be written as intersection of linear constraints, and assume Assumption 2 We assume that gi (x), i ∈ [m] are linear, i.e. K = {x ∈ Rd : gi (x) = x⊤ ai − bi ≤ 0, i ∈ [m]} where ai ∈ Rd is a normalized vector with kai k = 1 and bi ∈ R . The following Proposition shows that under Assumptions 2, the function F(x, λ) has Lipschitz continuous gradients, a basis for the application of Mirror Prox method. Proposition 9 Under Assumption 2, F(x, λ) has Lipschitz continuous gradient, i.e.,



∇x F(x, λ) − ∇x′ F(x′ , λ′ ) 2 + ∇λ F(x, λ) − ∇λ′ F(x′ , λ′ ) 2 ≤ 2(m + δ2 η 2 )(kx − x′ k2 + kλ − λ′ k2 )

Proof



∇x F(x, λ) − ∇x′ F(x′ , λ′ ) 2 + ∇λ F(x, λ) − ∇λ′ F(x′ , λ′ ) 2

2

2 m m m

X

X X



′ (λ′i − λi ) a⊤ = (λi − λ′i )ai + i (x − x ) + δη



i=1 ⊤

i=1

i=1

≤ kA (λ − λ )k + 2kA(x − x )k + 2δ η kλ − λ′ k2 ′

2



2

2 2

2 2 (A) + 2δ2 η 2 )kλ − λ′ k2 ≤ 2σmax (A)kx − x′ k2 + (σmax

Since σmax (A) =

q

λmax (AA⊤ ) ≤

q

Tr(AA⊤ ) ≤



m

2 (A) ≤ m, leading to the result in the Proposition. we have σmax

Algorithm 2 shows the detailed steps of Mirro Prox based algorithm for online convex optimization with long term constraints defined in (13). Compared to Algorithm 1, there are three key features of Algorithm 2. First, it introduces auxiliary variables zt and µt besides the variables xt and λt . At each iteration t, it first computes solution xt and λt based on the auxiliary variables zt and µt ; it then updates the auxiliary variables based on the gradients computed from xt and λt . Second, two different functions are used for 15

Mahdavi, Jin and Yang

Algorithm 2 Prox Method with Long Term Constraints 1: Input: constraints gi (x) ≤ 0, i ∈ [m], step size η, parameter δ 2: Initialization: z1 = 0 and µ1 = 0 3: for t = 1, 2, . . . , T do 4: Compute the solution for xt and λt as xt = ΠB (zt − η∇x F(zt , µt )) ,

λt = Π[0,+∞)m (µt + η∇λ F(zt , µt )) 5: 6: 7: 8:

Submit solution xt Receive the convex function ft (x) and experience lossn ft (xt ) P Compute Lt (x, λ) = ft (x) + F(x, λ) = ft (x) + m i=1 λi gi (x) − Update zt and µt by

δη 2 2 λi

o

zt+1 = ΠB (zt − η∇x Lt (xt , λt )) ,

µt+1 = Π[0,+∞)m (µt + η∇λ Lt (xt , λt )) 9:

end for

updating (xt , λt ) and (zt , µt ): function F(x, λ) is used for computing the solution xt and λt , while function Lt (x, λ) is used for updating the auxiliary variables zt and µt . Our analysis is based on the Lemma 3.1 from (Nemirovski, 2005) which is restated here for completeness. Lemma 10 Let D(x, x′ ) be a Bregman distance function that has modulus α with respect to a norm k · k, i.e., D(x, x′ ) ≥ αkx − x′ k2 /2. Given u ∈ B, a, and b, we set w = arg min a⊤ (x − u) + D(x, u), u+ = arg min b⊤ (x − u) + D(x, u) x∈B

x∈B

For any x ∈ B and η > 0, we have ηb⊤ (w − x) ≤ D(x, u) − D(x, u+ ) +

 α η2 ka − bk2∗ − kw − uk2 + kw − u+ k2 2α 2

We equip B × [0, +∞)m with the norm k · k as k(z, µ)k2 =

kzk2 + kµk2 2

where k · k2 is the Eculidian norm defined separately for each domain. It is immediately seen that the Bregman distance function defined as 1 1 D(zt , µt , zt+1 , µt+1 ) = kzt − zt+1 k2 + kµt − µt+1 k2 2 2 is α = 1 modules with respect to the norm k · k. 16

Trading Regret for Efficiency

Lemma 11 If η(m + δ2 η 2 ) ≤ Lt (xt , λ) − Lt (x, λt ) ≤

1 4

holds, we have

kx − zt k2 − kx − zt+1 k2 kλ − µt k2 − kλ − µt+1 k2 + + ηk∇ft (xt )k2 2η 2η

Proof To apply Lemma 10, we define u, w, u+ , a and b as follows u = (zt , µt ), u+ = (zt+1 , µt+1 ), w = (xt , λt ) a = (∇x F(zt , µt ), −∇λ F(zt , µt )), b = (∇x Lt (xt , λt ), −∇λ Lt (xt , λt )) Using Lemmas 2 and 10, we have

kx − zt k2 − kx − zt+1 k2 kλ − µt k2 − kλ − µt+1 k2 − Lt (xt , λ) − Lt (x, λt ) − 2η 2η o n η 2 ≤ k∇x F(zt , µt ) − ∇x Lt (xt , λt )k + k∇λ F(zt , µt ) − ∇λ Lt (xt , λt )k2 |2 {z } I

1 kzt − xt k2 + kµt − λt k2 − |2 {z } II

By expanding the gradient terms and applying the inequality (a + b)2 ≤ 2(a2 + b2 ), we upper bound (I) as: η 2k∇ft (xt )k2 + 2k∇x F(zt , µt ) − ∇x F(xt , λt )k2 + k∇λ F(zt , µt ) − ∇λ F(xt , λt )k2 2 (28)  2 2 2 ≤ ηk∇ft (xt )k + η k∇x F(zt , µt ) − ∇x F(xt , λt )k + k∇λ F(xt , λt ) − ∇λ F(xt , λt )k  ≤ ηk∇ft (xt )k2 + 2η(m + δ2 η 2 ) kzt − xt k2 + kµt − λt k2 (29)

(I) =

where the last inequality follows from Proposition 9. Combining (II) with 28 we get:

kx − zt k2 − kx − zt+1 k2 kλ − µt k2 − kλ − µt+1 k2 Lt (xt , λ) − Lt (x, λt ) − − 2η 2η   1  kzt − xt k2 + kµt − λt k22 ≤ ηk∇ft (xt )k22 + 2η(m + δ2 η 2 ) − 2 We complete the proof by setting η(m + δ2 η 2 ) ≤ 14 .

17

Mahdavi, Jin and Yang

Theorem 12 Set η = T −1/3 and δ = T −2/3 . Let xt , t ∈ [T ] be the sequence of solutions obtained by Algorithm 2. Then for T ≥ 164(m + 1)3 we have T X

t=1 T X t=1

˜ 2/3 ) , ft (xt ) − ft (x∗ ) ≤ O(T ˜ 2/3 ) gi (xt ) ≤ O(T

Proof Similar to the proof of Theorem 4, applying Lemma 11 for all rounds t = 1, · · · , T we have for any x∗ ∈ K, T X t=1

By setting δ =

[ft (xt ) − ft (x∗ )] +

1 ηT

i=1

hP

t=1

T X t=1

i2

T t=1 gi (xt )

PT

t=1 ft (xt )

[ft (xt ) − ft (x)] ≤

gi (xt ) ≤

s

+

2(δηT + m/η)

and using the fact that T X

and

m X

m (1 + ) η





R2 ηT 2 + G 2η 2

− ft (x∗ ) ≥ −F T we have:

R2 ηT 2 + G 2η 2

R2 + ηT G2 + F T η



Optimizing both inequalities for η we get the desired bounds as mentioned in the Theorem. Note that the condition η(m + δ2 η 2 ) ≤ 41 in Lemma 11 is satisfied for the stated values of η and δ as long as T ≥ 164(m + 1)3 . Using the same trick as Theorem 6, by introducing appropriate γ, we will be able to ˜ 2/3 ) bound produce the solutions that exactly satisfy the constraints in long run with O(T for the regret as it shown in the following Corollary. Corollary 13 Let η = δ = T −1/3 . Let xt , tp∈ [T ] be the sequence of solutions obtained by Algorithm 2 with γ = bT −1/3 √ and b2/3= 2 F (1 + m). With sufficiently large T , i.e., 2 1/3 2 −1 F T ≥ [R /2]T +(G /2+bσP 2m)T , under Assumptions 1 and 2, we have xt , t ∈ [T ] satisfy the global constraints Tt=1 gi (xt ) ≤ 0, i ∈ [m] and the regret RT is bounded by RT =

T X t=1

R2 1/3 T + ft (xt ) − ft (x∗ ) ≤ 2

! p G2 2G 2F m(1 + m) ˜ 2/3 ) T 2/3 = O(T + 2 σ

The bound obtained in Corollary 13 is interesting since it indicates that for any convex domain defined by a finite number of halfspaces, i.e. Polyhedral set, one can easily replace the projection onto the Polyhedral set with the ball containing the Polyhedral at the price ˜ 2/3 ) regret bound. of satisfying the constraints in the long run and achieving O(T 18

Trading Regret for Efficiency

Algorithm 3 An online gradient based convex-concave optimization method for long term constraints under bandit feedback for domain 1: Input: constraint g(x), step size η, parameter δ > 0, exploration parameter ζ > 0, shrinkage coefficient ξ 2: Initialization: x1 = 0 and λ1 = 0 3: for t = 1, 2, . . . , T do 4: Submit solution xt 5: Select unit vector ut uniformly at random 6: Query g(x) at points xt + ζut and xt − ζut and incur average of them as violation of constraints i h d (g(xt + ζut ) − g(xt − ζut ))ut 7: Compute g˜x,t = ∇ft (xt ) + λt 2ζ 8:

9: 10:

Compute g˜λ,t = 12 (g(xt + ζut ) + g(xt − ζut )) − ηδλt Receive the convex function ft (x) and experience loss ft (xt ) Update xt and λt by xt+1 = Π(1−ξ)B (xt − η˜ gx,t ) ,

λt+1 = Π[0,+∞)(λt + η˜ gλ,t )

11:

end for

Next we show that by imposing a simple condition on the linear constraints, we can relax the Assumption 1. More specifically, the following Proposition shows that as long as all the linear constraints gi (x) − bi ≤ 0, i ∈ [m] are significantly different, i.e. |a⊤ i aj | ≤ 1/[2(m − 1)], ∀i 6= j

(30)

√ the Assumption 1 holds for any convex domain defined as Assumption 2 with σ = 1/ 2. Proposition 14 Let ai , i ∈ [m] be a set of normalized vectors (i.e., kai k2 = 1, i ∈ [m]) that satisfies the conditions in (30). Let A = (a1 , . . . , am ). Then, we have √ σmin (A) ≥ 1/ 2 Proof Since λmin (A⊤ A) ∈ ∩m i=1 Di P where Di = {z : |z − [A⊤ A]i,i | ≤ j6=i [A⊤ A]i,j }, we have λmin (A⊤ A) ≥ min [A⊤ A]i,i − i∈[m]

X 1 [A⊤ A]i,j ≥ 2 j6=i

√ and therefore have σmin (A) ≥ 1/ 2, leading to the conclusion in the proposition.

19

Mahdavi, Jin and Yang

5. Online Convex Optimization with Long Term Constraints under Bandit Feedback for Domain We extend the gradient based convex-concave optimization algorithm discussed in Section 3 for the setting where the learner only receives partial feedback for constraints. More specifically, other than knowing the solution is within a ball B, the exact definition of the domain K is not exposed to the learner. Instead, after receiving a solution xt , the oracle will present the learner with the convex loss function ft (x) and the maximum violation of the constraints for xt , i.e., g(xt ) = maxi∈[m] gi (xt ). The convex concave function defined in (14) becomes as Lt (x, λ) = ft (x) + λt g(x) − (δη/2)λ2 in this case. Our goal is to generate a sequence of solutions with vanishing regret bound and violation of constraints. The mentioned setting is closely tied to the bandit online convex optimization. In the bandit setting, in contrast to the full information setting, only the cost of the chosen decision (i.e. the incurred loss ft (xt )) is revealed to the algorithm not the function itself. There is a rich body of literature that deals with the bandit online convex optimization. In seminal papers of (Flaxman et al., 2005) and (Awerbuch and Kleinberg, 2004) it was shown that ˜ 3/4 ) regret bound even in the bandit setting where one could design algorithms with O(T only evaluations of the loss functions are revealed at a single point. If we specialize to the online bandit optimization of linear loss functions, (Dani et al., 2007) proposed an √ inefficient √ ˜ T log T ) ˜ T ) regret bound and (Abernethy et al., 2008) obtained O( algorithm with O( bound by an efficient algorithm if the convex set admits an efficiently computable selfconcordant barrier. For general convex loss functions, Agarwal et al.(Agarwal et al., 2010) proposed optimal algorithms in a new bandit setting, in which multiple points can be queried for the cost values. By using multiple evaluations, they showed that the modified online √ ˜ T ) regret bound in expectation. gradient descent algorithm can achieve O( Before proceeding, we state a fact about the Lipschitz continuity of the function g in the following Proposition. Proposition 15 Assume that functions gi , i ∈ [m] are Lipschitz continuous with constant G. Then, function g(x) = maxi∈[m] gi (x) is G Lipschitz continuous, i.e., |g(x) − g(x′ )| ≤ Gkx − x′ k for any x and x′ Proof First, weP rewrite g as g(x) = maxα∈∆m m ∆m = {α ∈ Rm ; + i=1 αi = 1}. Then, we have

Pm

i=1 αi gi (x)

where ∆m is a m-simplex, i.e.

m m X X αi gi (x) − max αi gi (x′ ) |g(x) − g(x′ )| = max α∈∆m α∈∆m i=1 i=1 m m m X X X αi gi (x) − gi (x′ ) αi gi (x′ ) = max αi gi (x) − ≤ max α∈∆m α∈∆m i=1 ′

i=1

i=1

≤ Gkx − x k

where the second inequality follows from the Lipschitz continuity of gi , i ∈ [m].

20

Trading Regret for Efficiency

Algorithm 3 gives a complete description of the proposed algorithm under the bandit setting, which is a slight modification of Algorithm 1 . To facilitate the analysis, we define ηδ Lbt (x, λ) = ft (x) + λb g(x) − λ2 . 2

where b g is the smoothed version of g defined as gb(x) = Ev∈S [ dζ g(x + ζv)v] at point xt where S denotes the unit sphere centered at the origin. Note that gb is Lipschitz continuous with the same constant G, and it is always differentiable even though g is not in our case. b we need a way to estimate Since we do not have access to the function b g to compute ∇x L, its gradient at point xt . Our gradient estimation follows closely the idea in (Agarwal et al., 2010) by querying g function at two points. The main advantage of using two points to estimate the gradient with respect to one point gradient estimation used in (Flaxman et al., 2005) is that the former has a bounded norm which is independent of ζ and leads to improved regret bounds. b t , λt ) = ∇f (xt )+λt ∇b b t , λt ) = gb(xt )− The gradient estimators for ∇x L(x g(xt ) and ∇λ L(x δηλt in Algorithm 3 are computed by evaluating the g function at two random points around xt as

g˜x,t = ∇ft (xt ) + λt



d (g(xt + ζut ) − g(xt − ζut ))ut 2ζ



and g˜λ,t =

1 (g(xt + ζut ) + g(xt − ζut )) − ηδλt 2

where ut is chosen uniformly at random from the surface of the unit sphere. Using 1 (g(xt + ζut ) − g(xt − Stock’s theorem, Flaxman et al.(Flaxman et al., 2005) showed that 2ζ ζut ))ut is a conditionally unbiased estimate of the gradient of b g. To make sure that randomized points around xt lives inside the convex domain B, we need to stay away from the boundary of the set such that the ball of radius ζ around xt is contained in B. In particular, in (Flaxman et al., 2005) it has shown that for any x ∈ (1 − ξ)B and any unit vector u it holds that (x + ζu) ∈ B as soon as ζ ∈ [0, ξr]. In order to facilitate the analysis of the algorithm 3, we define the convex concave function function Ht (·, ·) as     b t , λt ) x + g˜λ,t − ∇λ L(x b t , λt ) λ Ht (x, λ) = Lbt (x, λ) + g˜x,t − ∇x L(x

(31)

It is easy to check that ∇x H(xt , λt ) = g˜x,t and ∇λ H(xt , λt ) = g˜λ,t . By defining functions Ht , Algorithm 3 reduces to Algorithm 1 by doing gradient descent on functions Ht except the projection is made onto the set (1 − ξ)B instead of B. We begin our analysis by reproducing Proposition 3 for functions Ht . 21

Mahdavi, Jin and Yang

Lemma 16 If the Algorithm 1 is performed over convex set K with functions Ht defined in (31), then for any x ∈ K we have T X t=1

Ht (xt , λ) − Ht (x, λt )



T X R2 + kλk22 λ2t + η(D 2 + G2 )T + η(d2 G2 + η 2 δ2 ) 2η t=1

Proof We have ∇x Ht (xt , λt ) = g˜x,t and ∇λ Ht (xt , λt ) = g˜λ,t . It is straightforward to show 1 that the 2ζ (g(xt +ζut )−g(xt −ζut ))ut has norm bounded by Gd (Agarwal et al., 2010). So, gλ,t k22 ≤ 2(D 2 +η 2 δ2 λ2t ). the norm of gradients are bounded as k˜ gx,t k22 ≤ 2(G2 +d2 G2 λ2t ) and k˜ Using Lemma 2, by adding for all rounds we get the desired inequality. The following theorem gives the regret bound and the expected violation of constraints in the long run for Algorithm 3. q √ √ √ 2D 1 Theorem 17 Let c = D 2 + G2 ( 2R + δR . ) + ( Dr + 1) GD r . Set η = R 2(D 2 +G62)T

Choose δ such that δ ≥ 2(d2 G2 + η 2 δ2 ). Let ζ = Tδ and ξ = sequence of solutions obtained by Algorithm 3. We then have T X

ζ r.

Let xt , t ∈ [T ] be the

√ GD + c T, r t=1 s T X √ √ δR2 + 2(D 2 + G2 ) GD ˜ 3/4 ) √ g(xt ) ≤ Gδ + ( E + c T + F T ) T = O(T )( 2 + G2 r R D t=1 ft (xt ) − ft (x) ≤

Proof Using Lemma 2, we get that Lbt (xt , λ) − Lbt (x, λt ) ≤ (xt − x)⊤ ∇x Lbt (xt , λt ) − (λ − λt )⊤ ∇λ Lbt (xt , λt )

Ht (xt , λ) − Ht (x, λt ) ≤ (xt − x)⊤ g˜x,t − (λ − λt )⊤ g˜λ,t

Subtracting the preceding inequalities, taking expectation, and summing for all t from 1 to T we get E

T X t=1

+ E

Lbt (xt , λ) − Lbt (x, λt ) = E

T h X t=1

T X t=1

Ht (xt , λ) − Ht (x, λt )

(32)

i (xt − x)⊤ (∇x Lbt (xt , λt ) − Et [˜ gxt ,t ]) + (λt − λ)⊤ (∇λ Lbt (xt , λt ) − Et [˜ gλt ,t )]

Next we provide an upper bound on the difference between the gradients of two functions. First, Et [˜ gx,t ] = ∇x Lbt (xt , λt ), so g˜x,t is an unbiased estimator of ∇x Lbt (xt , λt ). Considering D for all the update rule for λt+1 we have |λt+1 | ≤ (1 − η 2 δ)|λt | + ηD which implies |λt | ≤ δη t. So we have 22

Trading Regret for Efficiency

(λt − λ)⊤ (∇λ Lbt (xt , λt ) − Et [˜ gλt ,t ]) ≤ |λt − λ|Et k∇λ Lbt (xt , λt ) − g˜λt ,t k2 DG DG D 1 (g(xt + ζut ) + g(xt − ζut )) − gb(xt ) ≤ ζkut k ≤ ζ ≤ δη 2 δη δη

(33)

where the last inequality follows from Lipschitz property of the functions g and gb with the same constant G. Combining the inequalities (32) and (33) and using Lemma 16, we have E

T X

T X DGζ R 2 + λ2 λ2t + Lbt (xt , λ) − Lbt (x, λt ) ≤ + η(D 2 + G2 )T + η(d2 G2 + η 2 δ2 ) T 2η δη t=1 t=1

By expanding the r.h.s we get T X t=1



[ft (xt ) − ft ((1 − ξ)x)] + λE

T X t=1

g (xt ) − Eb b g((1 − ξ)x) T X

T X t=1

λt −

DGζ R 2 + λ2 λ2t + + η(D 2 + G2 )T + η(d2 G2 + η 2 δ2 ) T 2η δη t=1

T ηδT 2 ηδ X 2 λ + λt 2 2 t=1

By choosing δ ≥ 2(d2 G2 + η 2 δ2 ) we cancel λ2t terms from both sides and have T X t=1



[ft (xt ) − ft ((1 − ξ)x)] + λE

R2

+ 2η

λ2

+ η(D 2 + G2 )T +

T X t=1

g(xt ) − Eb b g((1 − ξ)x)

DGζ T δη

T X t=1

λt −

ηδT 2 λ 2 (34)

By convexity and Lipschitz property of ft and g we have ft ((1 − ξ)x) ≤ (1 − ξ)ft (x) + ξft (0) ≤ ft (x) + DGξ g(x) ≤ gb(x) + Gζ,

(35)

gb((1 − ξ)x) ≤ g((1 − ξ)x) + Gζ ≤ g(x) + Gζ + DGξ

(36)

Plugging (35) and (36) back into (34), for any optimal solution x∗ ∈ K we get T X t=1



[ft (xt ) − ft (x)] + λE

R2

+ 2η

λ2

T X t=1

g(xt ) −

T

+ η(D 2 + G2 )T +

X DGζ λt T + DGξT + (DGξ + Gζ) δη

t=1



[ft (xt ) − ft (x)] + λE

R2 2η

(37)

t=1

D Considering the fact that λt ≤ δη we have and rearranging the terms we have T X

ηδT 2 λ − λGζT 2

+ η(D 2 + G2 )T +

T X t=1

PT

t=1 λt

g(xt ) −



DT δη .

Plugging back into the (37)

ηδT 2 λ2 λ − λGζT − 2 2η

DGζ DT T + DGξT + (DGξ + Gζ) δη δη 23

Mahdavi, Jin and Yang

By setting ξ = T X t=1

ζ r

and ζ =

[ft (xt ) − ft (x)] ≤

1 T

we get

R2 DGζ ζDGT D DGT + η(D 2 + G2 )T + T+ + ( + 1)ζ 2η δη r r δη

which gives the mentioned regret PT bound by optimizing for η. By maximization for λ over the range (0, +∞) and using t=1 ft (xt ) − ft (x∗ ) ≥ −F T , yields the following inequality for the violation of the constraints h P i2 E Tt=1 g(xt ) − GζT √ DG + ≤ + c T + FT 4(δηT /2 + 1/2η) r Plugging in the stated values of parameters completes the proof. Note that δ = 4d2 G2 obeys the condition specified in the theorem.

6. Conclusion In this study we have addressed the problem of online convex optimization with constraints, where we only need the constraints to be satisfied in the long run. In addition to regret bound which is the main tool in analyzing performance of the general online optimization algorithms, we defined the bound on the violation of constraints in the long term which measures the violation of the decisions for all rounds. Our setting is applied to solving online convex optimization without projecting the solutions onto the complex convex domain in each iteration which may be computationally inefficient. In addition we have showed that, in special case, our setting resembles online convex optimization with side constraints. Our strategy is to turn the problem into an online convex-concave optimization problem and apply online gradient descent algorithm to solve it. We have proposed efficient algorithms in three different settings; the violation of constraints is allowed, the constrains need to be exactly satisfied, and finally we do not have access to the target convex domain except it is bounded by a ball. Morever, for domains determined by linear constraints, we exploit the Mirror Prox method, a simple gradient-based algorithm for variational inequalities and ˜ 2/3 ) bound for both regret and the violation of the constraints. obtained O(T Our work leaves open a number of interesting directions for future work. In particular it would be interesting to √ see if it is possible to improve the bounds obtained in this paper, ˜ 3/4 ) ˜ T ) regret bound and simultaneously better bound than O(T in particular getting O( for the violation of constraints for general convex domains. Proving optimal lower bounds for the proposed setting also remains an open question. Finally, relaxing the assumption made to exactly satisfy the constraints in the long run is of interest to be investigated.

Acknowledgments

24

Trading Regret for Efficiency

References Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In COLT, pages 263–274, 2008. Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin. A stochastic view of optimal regret through minimax duality. In COLT, 2009. Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28–40, 2010. Baruch Awerbuch and Robert D. Kleinberg. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In STOC, pages 45–53, 2004. Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In NIPS, 2007. Dimitri P. Bertsekas, Angelia Nedic, and Asuman E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, 2003. David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 1(6), 1956. Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. ISBN 0521841089. Nicol`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004. Varsha Dani, Thomas P. Hayes, and Sham Kakade. The price of bandit information for online optimization. In NIPS, 2007. John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the 1 -ball for learning in high dimensions. In ICML, pages 272–279, 2008. Eyal Even-Dar, Robert Kleinberg, Shie Mannor, and Yishay Mansour. Online learning for global cost functions. In COLT, 2009. Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages 385–394, 2005. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Mach. Learn., 69:169–192, December 2007. ISSN 0885-6125. doi: 10.1007/s10994-007-5016-8. URL http://portal.acm.org/citation.cfm?id=1296038.1296051. Branislav Kveton, Jia Yuan Yu, Georgios Theocharous, and Shie Mannor. Online learning with expert advice and finite-horizon constraints. In AAAI, pages 331–336, 2008. 25

Mahdavi, Jin and Yang

Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In ICML, page 83, 2009. Shie Mannor and John N. Tsitsiklis. Online learning with constraints. In COLT, pages 529–543, 2006. Arkadi Nemirovski. Efficient methods in convex programming. Lecture Notes, Available at http://www2.isye.gatech.edu/ nemirovs. Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. on Optimization, 15(1):229–251, 2005. Alexander Rakhlin. Lecture notes on online learning. Lecture Notes, Available at http://www-stat.wharton.upenn.edu/ rakhlin/papers, 2009. Shai Shalev-Shwartz and Sham M. Kakade. Mind the duality gap: Logarithmic regret algorithms for online optimization. In NIPS, pages 1457–1464, 2008. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages 928–936, 2003.

26