A General Class of No-Regret Learning Algorithms and ... - CiteSeerX

0 downloads 0 Views 217KB Size Report
establish the existence of no-Φ-regret algorithms for the class of linear ... external-regret learning converges to the set of minimax-valued equilibria and ...
A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria Amy Greenwald

[email protected]

Brown University Providence, RI 02912

Amir Jafari

[email protected]

Northwestern University Evanston, IL 60208

Casey Marks

[email protected]

Brown University Providence, RI 02912

Editor:

Abstract A general class of no-regret learning algorithms, called no-Φ-regret learning algorithms, is defined which spans the spectrum from no-external-regret learning to no-internal-regret learning and beyond. The set Φ describes the set of strategies to which the play of a given learning algorithm is compared. A learning algorithm satisfies no-Φ-regret if no regret is experienced for playing as the algorithm prescribes, rather than playing according to any of the strategies that arise by transforming the algorithm’s play via the elements of Φ. We establish the existence of no-Φ-regret algorithms for the class of linear transformations. Analogously, a class of game-theoretic equilibria, called Φ-equilibria, is defined, and it is shown that the empirical distribution of play of no-Φ-regret algorithms converges to the set of Φ-equilibria. A necessary condition for convergence to this set is also derived. Perhaps surprisingly, for the class of linear transformations, the strongest form of no-Φ-regret learning is no-internal-regret learning. Hence, the tightest game-theoretic solution concept to which such no-Φ-regret algorithms converge is correlated equilibrium. In particular, Nash equilibrium is not a necessary outcome of learning via such no-Φ-regret algorithms. Keywords: no-regret learning, repeated games, game-theoretic equilibrium

1. Introduction Consider an agent that repeatedly faces some decision problem. The agent is presented with a choice of actions, each with a different outcome, or set of outcomes. After each choice is made, and the corresponding outcome is observed, the agent achieves a reward. In this setting, one reasonable objective for an agent is to maximize its average rewards. If each action always leads to the same outcome, and if the action set is finite, the agent need only undertake a linear search for an action that yields the maximal reward, and choose that action forever after. But if the outcomes associated with each choice of action vary across iterations of the decision problem, even if the action set is finite, a more complex strategy, or learning algorithm, is called for, if indeed the agent seeks to maximize its average rewards. 1

Greenwald, Jafari, and Marks

No-regret learning algorithms are geared toward maximizing average rewards in such settings. The efficacy of a no-regret algorithm is determined by comparing the performance of the algorithm with the performance of a set of alternative strategies. For example, one might compare the performance of a learning algorithm with the set of strategies that always choose the same action. Learning algorithms that outperform this strategy set exhibit noexternal-regret (Hannan, 1957). As another example, consider the set of strategies that are identical to the strategy played by the learning algorithm, except that every play of action a is replaced by an action a0 , for all actions a and a0 . Learning algorithms that outperform all strategies in this set exhibit no-internal-regret (Foster and Vohra, 1997). This paper studies the outcome of no-regret learning among a set of agents, or players, playing repeated games. At each stage in a repeated game, each player makes a choice among a set of actions. The outcome, which is jointly determined by all players’ choices, assigns a reward to each player. We assume players observe not only their own reward, but the reward they would have obtained had they instead made each of their other choices. Players observe nothing about one another’s actions or rewards. In this setting, the following results are well-known: In two-player, zero-sum, repeated games, if each player plays using a no-external-regret learning algorithm, then the empirical distribution of play converges to the set of minimax-valued equilibria (see, e.g., Freund and Schapire (1996)). In general-sum, repeated games, if each agent plays using a no-internalregret learning algorithm, then the empirical distribution of play converges to the set of correlated equilibria (see, e.g., Hart and Mas-Colell (2000)). In this article, we define a general class of no-regret learning algorithms, called no-Φregret learning algorithms, which spans the spectrum from no-external-regret to no-internalregret, and beyond. The set Φ describes the set of strategies to which the play of a given learning algorithm is compared. A learning algorithm satisfies no-Φ-regret if no regret is experienced for playing as the algorithm prescribes, rather than playing according to any of the strategies that arise by transforming the algorithm’s play via the elements of Φ. We establish the existence of no-Φ-regret algorithms for the class of linear transformations. We also define a class of game-theoretic equilibria, called Φ-equilibria, and we show that the empirical distribution of play of no-Φ-regret algorithms converges to the set of Φ-equilibria. We obtain as corollaries of our theorem the aforementioned results that noexternal-regret learning converges to the set of minimax-valued equilibria and no-internalregret learning converges to the set of correlated equilibria. Interestingly, we also establish a necessary condition for convergence to the set of Φ-equilibria, namely that all players’ learn via algorithms that exhibit no-Φ-regret “on the path of play” (defined in Section 4). Perhaps surprisingly, for the class of linear transformations, we find that the strongest form of no-regret algorithms are no-internal-regret algorithms. Consequently, the tightest game-theoretic solution concept to which no-Φ-regret algorithms converge is correlated equilibrium, for Φ a set of linear transformations. In particular, Nash equilibrium is not a necessary outcome of learning via such no-Φ-regret algorithms. The research most closely related to this work is that of Lehrer (2003), in which he proves a “wide range no-regret theorem.” In fact, Lehrer’s theorem on the existence of no-regret algorithms generalizes our result on the existence of no-Φ-regret algorithms: e.g., Lehrer’s “wide range” of alternative strategies includes history-dependent strategies. Nonetheless, we see one important advantage of our approach: in our formalism, a set Φ of alternative 2

strategies can be represented as a set of stochastic matrices, a convex combination of which readily yields a simple no-Φ-regret algorithm (see Algorithm 1). This article is organized as follows. In Section 2, we review Blackwell’s approachability theory and present a generalization of Blackwell’s approachability theorem. (The proof of this generalization appears in Appendix A.) In Section 3, we define no-Φ-regret learning, and establish the existence of no-Φ-regret algorithms. In Section 4, we define Φ-equilibrium, and prove that no-Φ-regret learning converges to the set of Φ-equilibria. In Section 5, we show that no-internal-regret is the strongest form of no-Φ-regret, for Φ linear.

2. Approachability Consider an agent with a finite set of actions A (a ∈ A) playing a game against a set of opponents who play actions in the joint action space A0 (a0 ∈ A0 ). (The opponents’ joint action space can be interpreted as the Cartesian product of independent action sets.) Associated with each possible outcome is some vector given by the vector-valued function ρ : A × A0 → V , where V is a vector space with an inner product · and a distance metric d defined by the inner product, that is d(x, y) = kx − yk2 , for all x, y ∈ V . Given a game with the vector-valued function ρ, a learning algorithm A is a sequence of functions qt = qt (ρ) : (A × A0 )t−1 → ∆(A), for t = 1, 2, . . ., where ∆(A) = {q ∈ R|A| | P 0 0 a qa = 1 & qa ≥ 0, ∀a ∈ A} and (A × A ) is defined as a single point: i.e., q1 ∈ ∆(A). Note that, in general, a learning algorithm generates probabilistic actions. Many examples of learning algorithms appear throughout the literature. A historyindependent learning algorithm returns some constant element of ∆(A). The best-reply heuristic (Cournot, 1838) returns an element of ∆(A), at time t, that maximizes the agent’s rewards with respect to only a0t−1 . Fictitious play (Brown, 1951; Robinson, 1951) returns an element of ∆(A), at time t, that maximizes the agent’s rewards with respect to the empirical distribution of play through time t − 1. Following Blackwell (1956), we define the notion of approachability as follows. Definition 1 Given a game with vector-valued function ρ, a subset G ⊆ V is said to be ρ-approachable if there exists learning algorithm A = q1 , q2 , . . . s.t. for any sequence of opponents’ actions a01 , a02 , . . ., limt→∞ d(G, ρ¯t ) = limt→∞ inf g∈G d(g, ρ¯t ) = 0, almost surely, where ρ¯t (a1 , . . . , at ) = 1t (ρ(a1 , a01 ) + . . . + ρ(at , a0t )): i.e., P (d(G, ρ¯t ) → 0 as t → ∞) = 1. Equivalently, for all  > 0, there exists t0 s.t. P (∃t ≥ t0 s.t. d(G, ρ¯t ) ≥ ) < . Thus, a subset G ⊆ V is ρ-approachable if there exists a learning algorithm for an agent that generates probabilistic actions (mixed strategies) for the agent which ensure that the distance from the set G to the average value of ρ through time t tends to zero as t tends to infinity, with probability one, regardless of the opponents’ actions a01 , a02 , . . .. We interpret the vector-valued function in games as regrets (rather than rewards). Moreover, throughout the remainder of this paper, we restrict our attention to V = RS and G = RS− = {(xs )s∈S | xs ≤ 0, ∀s ∈ S}, for various choices of the set S. Thus, we seek algorithms whose regret approaches the negative orthant: i.e., no-regret algorithms. Blackwell’s seminal approachability theorem (Blackwell, 1956) provides a sufficient condition on learning algorithms which ensures that RS− (or, more generally, any convex subset 3

Greenwald, Jafari, and Marks

G ⊆ V ) is approachable. We present Blackwell’s condition, as well as a generalization of his approachability theorem due to Jafari (2003). Definition 2 Let Λ : RS → RS . Given a game with vector-valued function ρ, a learning algorithm A = q1 , q2 , . . . is Λ-compatible if there exists some constant c ≥ 0 and there exists t0 ∈ N s.t. for all t ≥ t0 , c a.s. (1) Λ(¯ ρt ) · ρ(qt+1 , a0 ) ≤ t+1 for all a0 ∈ A0 . Here ρ(q, a0 ) = Ea∼q [ρ(a, a0 )]. This definition of Λ-compatibility is inspired by Hart and Mas-Colell (2001), who introduce the notion of Λ-compatibility for c = 0. Blackwell’s condition can be stated in terms of Λ-compatibility for a particular choice of Λ, namely Λ0 , which is defined as follows: + Λ0 ((xs )s∈S ) = (x+ s )s∈S , where xs = max{xs , 0}. Theorem 3 (Blackwell, 1956) Given a game with vector-valued function ρ, if A and A0 are finite, then the set RS− is ρ-approachable if there exists a learning algorithm A that is 0 Λ0 -compatible, with c = 0: i.e., for all times t and for all a0 ∈ A0 , ρ¯+ t · ρ(qt+1 , a ) ≤ 0. S Moreover, the following algorithm can be used to ρ-approach R− : if ρ¯t ∈ RS− , play arbitrarily; otherwise, play according to learning algorithm A. RS−

The following generalization of Blackwell’s theorem states that Λ0 -compatibility implies is ρ-approachable, even for c 6= 0. For completeness, a proof is provided in Appendix A.

Theorem 4 (Jafari, 2003) Given a game with vector-valued function ρ, if ρ(A × A0 ) is bounded, then the set RS− is ρ-approachable if there exists a learning algorithm A that is Λ0 -compatible: i.e., there exists some constant c ≥ 0 and there exists t0 ∈ N s.t. for all c 0 times t ≥ t0 and for all a0 ∈ A0 , ρ¯+ t · ρ(qt+1 , a ) ≤ t+1 . Moreover, the following algorithm can be used to ρ-approach RS− : if ρ¯t ∈ RS− , play arbitrarily; otherwise, play according to learning algorithm A.

3. No-Regret Learning Let ΦALL be the set of all mappings, or transformations from ∆(A) to ∆(A). Thus, each φ ∈ ΦALL converts one probabilistic action into another. Given a finite set Φ ⊆ ΦALL , and given one distinguished agent’s reward function r : A × A0 → R, we define that agent’s regret vector ρΦ : A × A0 → RΦ as follows: ρΦ (a, a0 ) = (r(φ(ea ), a0 ) − r(a, a0 ))φ∈Φ

(2)

Here, ea is the basis vector of length |A| with a 1 in the entry corresponding to action a, and r(q, a0 ) = Ea∼q [r(a, a0 )]. In words, ρΦ (a, a0 ) is a vector indexed by φ ∈ Φ for which each entry describes the regret the agent has for choosing action a rather than action φ(ea ), given opposing action a0 . Using this framework, we define no-Φ-regret learning. Definition 5 A no-Φ-regret learning algorithm is one that ρΦ -approaches RΦ −. 4

Given a repeated game, if an agent with reward function r plays in such a way that ρΦ approaches the negative orthant RΦ − , then the agent’s learning algorithm is said to exhibit no-Φ-regret. There are two well-studied examples of no-regret properties, no-external-regret and no-internal-regret, both of which arise as instances of no-Φ-regret. 3.1 Linear Transformations Given an agent with finite action set A, let ΦEXT = {φ(a) | a ∈ A} be the set of constant maps: i.e., φ(a) (q) = ea . In words, φ(a) (q) ascribes zero probability to any action b 6= a. Intuitively, a learning algorithm satisfies no-ΦEXT -regret if it prevents the agent from experiencing regret relative to any fixed strategy. No-ΦEXT -regret is equivalent to no-external-regret, also known as Hannan consistency (Hannan, 1957) and universal consistency (Fudenberg and Levine, 1995). Our next choice of Φ, once again for finite A, gives rise to the definition of no-internalregret. Let ΦINT = {φ(ab) | a, b ∈ A}, where  if c 6= a, b  qc (ab) 0 if c = a φc (q) = (3)  qa + qb if c = b

Thus, φ(ab) maps probabilistic action q into another that ascribes zero probability to a, but instead adds a’s probability mass according to q to the probability q ascribes to b. Intuitively, an algorithm satisfies no-ΦINT -regret if the agent does not feel regret for playing a instead of playing b, for all pairs of actions a and b. Following Foster and Vohra (1997), we call this property no-internal-regret, but it is sometimes called conditional universal consistency (Fudenberg and Levine, 1999). The subset ΦLINEAR ⊆ ΦALL is the set of linear mappings from ∆(A) to ∆(A): that is, φ : ∆(A) → ∆(A) is an element of ΦLINEAR if for all 0 ≤ α ≤ 1, for all q1 , q2 ∈ ∆(A), φ(αq1 + (1 − α)q2 ) = αφ(q1 ) + (1 − α)φ(q2 )

(4)

Assuming A is finite, a linear map can be represented by a stochastic matrix. For example, let A = {1, 2, 3, 4}. Now φ represents the linear map F = {1 7→ 2, 2 7→ 3, 3 7→ 4, 4 7→ 1}:   0 1 0 0  0 0 1 0   φ=  0 0 0 1  1 0 0 0

so that hq1 , q2 , q3 , q4 iφ = hq2 , q3 , q4 , q1 i. Similarly, the transformations φ(2) ∈ ΦEXT and φ(23) ∈ ΦINT can be represented as follows:     0 1 0 0 1 0 0 0  0 1 0 0   0 0 1 0    φ(2) =  φ(23) =   0 1 0 0   0 0 1 0  0 1 0 0 0 0 0 1 so that hq1 , q2 , q3 , q4 iφ(2) = h0, 1, 0, 0i and hq1 , q2 , q3 , q4 iφ(23) = hq1 , 0, q2 + q3 , q4 i. 5

Greenwald, Jafari, and Marks

3.2 Existence Indeed, no-Φ-regret learning algorithms exists. In particular, no-external-regret algorithms pervade the literature. The earliest date back to Blackwell (1956) and Hannan (1957); but, more recently, Foster and Vohra (1993), Littlestone and Warmuth (1994), Freund and Schapire (1995), Fudenberg and Levine (1995), Foster and Vohra (1997), Hart and MasColell (2001), and others have studied such algorithms. Our first theorem establishes the existence of no-Φ-regret algorithms, for all finite sets Φ ⊆ ΦLINEAR. Theorem 6 Given a game and a distinguished agent’s reward function r. Assume rewards are bounded, i.e., r(A×A0 ) is bounded, so that regrets are bounded, i.e., ρ(A×A0 ) is bounded. For all finite sets Φ ⊆ ΦLINEAR, there exists a learning algorithm that satisfies no-Φ-regret. Proof It suffices to show that for all X ∈ RΦ \ RΦ − , there exists q = q(X) ∈ ∆(A) s.t. X + · ρΦ (q, a0 ) ≤ c, for all a0 ∈ A and some c ≥ 0. Letting X = ρt , for t = 1, 2, . . ., the result follows from Jafari’s generalization of Blackwell’s approachability theorem.1 In fact, we show equality, with c = 0. Let q be the (row) vector representation of a mixed strategy, and let φ be the matrix representation of a linear transformation in Φ. In that case, 0 = X + · ρΦ (q, a0 ) X = Xφ+ (r(qφ, a0 ) − r(q, a0 ))

(5) (6)

φ∈Φ

=

X

φ∈Φ

=

X

φ∈Φ

=

X

a∈A

Xφ+ r(qφ, a0 ) −

Xφ+

X

X

Xφ+ r(q, a0 )

r(a, a0 )(qφ)a −

a∈A



r(a, a0 ) q

(7)

φ∈Φ

X

φ∈Φ

X

Xφ+ 

Xφ+ φ −  a

r(a, a0 )qa

(8)

a∈A

φ∈Φ



X

X

φ∈Φ

 

Xφ+ q  

(9)

a

Equation 6 follows from the definitions of the dot product and ρΦ . Equation 8 follows from the definition of expectation. Equation 9 follows via matrix algebra. Now it suffices to show the following:     X X q (10) Xφ+  q Xφ+ φ =  φ∈Φ

φ∈Φ

Define M : ∆(A) → ∆(A) as follows: M=

+ φ∈Φ Xφ φ P + φ∈Φ Xφ

P

1. Blackwell’s approachability theorem is not directly applicable, since we do not assume A0 is finite.

6

(11)

The function M maps ∆(A), a nonempty, compact, convex set, into itself. Moreover, M is continuous, since all φ ∈ Φ are linear in finite-dimensional Euclidean space, and hence continuous. Therefore, by Brouwer’s fixed point theorem, M has a fixed point. Further, since A is finite, M is a stochastic matrix with at least one nonnegative fixed point whose entries sum to 1. Any algorithm for computing such a fixed point of M gives rise to a Λ0 -compatible (thus, a no-Φ-regret), learning algorithm.

Algorithm 1 lists the steps in the no-Φ-regret algorithm that is derived in Theorem 6. At all times t, the agent plays the mixed strategy q t , realizes some pure action a, and observes an |A|-dimensional reward vector r t , where rat = r t (a, a0t ). Given this reward vector, the agent computes its instantaneous regret in all dimensions φ ∈ Φ, and updates the cumulative regret vector accordingly. The next step is to compute the positive part of the cumulative regret vector. If this quantity is zero, then the algorithm outputs an arbitrary mixed strategy. Otherwise, the algorithm outputs a fixed point of M in Equation 11.

Algorithm 1 No Φ-Regret Algorithm 1: for t = 1, . . . , do 2: play mixed strategy q t 3: realize pure action a 4: observe rewards r t 5: for all φ ∈ Φ do 6: compute instantaneous regret xtφ = r t · ea φ − r t · ea 7: update cumulative regret vector Xφt = Xφt−1 + xtφ 8: end for 9: compute Y = (X t )+ 10: if Y = 0 then 11: choose q t+1 arbitrarily 12: else P P 13: let M = φ∈Φ Yφ φ/ φ∈Φ Yφ 14: solve for a fixed point q t+1 = q t+1 M 15: end if 16: end for

Note that the no-external-regret algorithm of Hart and Mas-Colell (2001) is a special case of Algorithm 1, with Φ = ΦEXT . The no-internal-regret algorithm of Foster and Vohra (1997) is also closely related to Algorithm 1, with Φ = ΦINT. The only difference between these two algorithms is the normalization constant in Equation 11: Foster and Vohra’s algorithm incorporates a max, rather than a sum, but the fixed points are unchanged. Finally, we point out that by generalizing from Λ0 to Λ1 ((xs )s∈S ) = (exs )s∈S , we arrive at the no-external-regret algorithm of Freund and Schapire (1995), when Φ = ΦEXT , and the no-internal-regret algorithm of Cesa-Bianchi and Lugosi (2003), when Φ = ΦINT . 7

Greenwald, Jafari, and Marks

~ 4. Φ-Equilibrium We now turn our attention to infinitely repeated games, and we explore the relationship between no regret learning and game-theoretic equilibria. In particular, we define the notion of Φ-equilibrium, and we prove that learning algorithms that satisfy no-Φ-regret converge to the set of Φ-equilibria: no-ΦEXT -regret algorithms (i.e., no-external-regret algorithms) converge to the set of ΦEXT-equilibria, which correspond to generalized2 minimax equilibria in zero-sum games; no-ΦINT -regret algorithms (i.e., no-internal-regret algorithms) converge to the set of ΦINT -equilibria, which correspond to correlated equilibria in general-sum games. Consider an n-player game Γ = hI, (Ai )1≤i≤n , ri. The ith player in I chooses an action from the finite set Ai , and rewards are determined by the function r : A1 × . . . × An → Rn . A linear map φi : ∆(Ai ) → ∆(Ai ) can be extended to a linear map φ˜i : ∆(A1 × . . . × An ) → ∆(A1 × . . . × An ) as follows: X φ˜i (q)(ai , a−i ) ≡ q(bi , a−i )φi (ebi )(ai ) (12) bi ∈Ai

where a−i ∈

Q

j6=i Aj .

This extension is indeed a probability measure, since   X X X  φ˜i (q)(ai , a−i ) = q(bi , a−i )φi (ebi )(ai )

ai ,a−i

=

ai ,a−i

bi ∈Ai

X

q(bi , a−i )

bi ,a−i

= 1

X

φi (ebi )(ai )

(13) (14)

ai

(15)

Equation 13 follows P from the definition of φ˜i in Equation 12. Equation 15 follows because φi (ebi ) ∈ ∆(Ai ) and ai ∈Ai qi (ai ) = 1, for all qi ∈ ∆(Ai ). An element q ∈ ∆(A1 × . . . × An ) is called independent if it can be written as the product of n independent elements qi ∈ ∆(Ai ): i.e., q = q1 × . . . × qn . For independent q ∈ ∆(A1 ) × . . . × ∆(An ) and linear φi : ∆(Ai ) → ∆(Ai ) (re: Equation 20), φ˜i (q)(a1 , . . . , ai−1 , ai , ai+1 , . . . , an ) X = (q1 × . . . × qi × . . . × qn )(a1 , . . . , bi , . . . , an )φi (ebi )(ai )

(16)

=

(18)

(17)

bi ∈Ai

X

q1 (a1 ) . . . qi (bi ) . . . qn (an )φi (ebi )(ai )

bi ∈Ai

= q1 (a1 ) . . . qi−1 (ai−1 )qi+1 (ai+1 ) . . . qn (an )

X

qi (bi )φi (ebi )(ai )

(19)

bi ∈Ai



= q1 (a1 ) . . . qi−1 (ai−1 )qi+1 (ai+1 ) . . . qn (an )φi 

X

bi ∈Ai

= q1 (a1 ) . . . qi−1 (ai−1 )qi+1 (ai+1 ) . . . qn (an )φi (qi )(ai ) = q1 (a1 ) . . . qi−1 (ai−1 )φi (qi )(ai )qi+1 (ai+1 ) . . . qn (an )



qi (bi )ebi  (ai )

(20) (21) (22)

2. The notion of minimax equilibrium that arises in this context, namely ΦEXT -equilibrium, allows for correlations in the players’ strategies, unlike von Neumann’s concept of minimax equilibrium.

8

Thus, the map φ˜i : ∆(A1 × . . . × An ) → ∆(A1 × . . . × An ) extends φi : ∆(Ai ) → ∆(Ai ) with the property that φ˜i (q) = q1 × . . . × φi (qi ) × . . . × qn , for independent q and linear φi . This derivation is intended to motivate the definition of φ˜i . ~ = (Φi )1≤i≤n , with Φi ⊆ ΦLINEAR for Definition 7 Given a game Γ and given a vector Φ ~ 1 ≤ i ≤ n, an element q ∈ ∆(A1 × . . . × An ) is called a Φ-equilibrium if ri (q) ≥ ri (φ˜i (q)), for all players i and for all φi ∈ Φi . ~ If Φi = Φ for 1 ≤ i ≤ n, then we refer to a Φ-equilibrium as a Φ-equilibrium.3 4.1 Examples of Φ-Equilibrium Generalized minimax, correlated, and Nash equilibrium are all special cases of Φ-equilibrium. Generalized Minimax Equilibrium Letting Φi = ΦEXT for all players i, for φ(α) ∈ Φi and for (correlated) q ∈ ∆(A1 × . . . × An ), the expression ri (φ˜(α) (q)) simplifies as follows: X X ri (φ˜(α) (q)) = ri (ai , a−i ) q(bi , a−i )φ(α) (ebi )(ai ) (23) ai ,a−i

=

X

bi ∈Ai

ri (ai , a−i )

ai ,a−i

=

X

X

=

X

X

ri (ai , a−i )1α=ai

q(bi , a−i )

(25)

bi ∈Ai

ri (α, a−i )

a−i

=

(24)

bi ∈Ai

ai ,a−i

X

q(bi , a−i )eα (ai )

X

q(bi , a−i )

(26)

bi ∈Ai

ri (α, a−i )q−i (a−i )

(27)

a−i

= ri (α, q−i ) (28)  Here, q−i ∈ ∆ j6=i Aj . Thus, q ∈ ∆(A1 × . . . × An ) is a ΦEXT -equilibrium if ri (q) ≥ ri (α, q−i ) for all α ∈ Ai . In two-player, zero-sum games, the ΦEXT-equilibrium value coincides with the minimax equilibrium value (see, for example, Fudenberg and Levine (1995)). Note, however, that the sets of ΦEXT -equilibria and minimax equilibria need not coincide, since the former allows for correlations while the latter does not. Q

(αβ)

Correlated Equilibrium Letting Φi = ΦINT for all players i, for φi ∈ Φi and for (correlated) q ∈ ∆(A1 × . . . × An ), the expression ri (φ˜(αβ) (q)) simplifies as follows: ri (φ˜(αβ (q)) X X = ri (ai , a−i ) q(bi , a−i )φ(αβ) (ebi )(ai ) ai ,a−i

=

X

ai ,a−i

(29) (30)

bi ∈Ai

ri (ai , a−i )

X

bi ∈Ai

q(bi , a−i )



eβ ebi

if bi = α otherwise



(ai )

3. Technically, if two players have different action sets (i.e., Ai 6= Aj for j 6= i), then they cannot share the same set of transformations because their transformations have different domains and ranges. For our purposes, it suffices that the Φi share some property (e.g., Φi = ΦEXT (Ai )), for 1 ≤ i ≤ n.

9

(31)

Greenwald, Jafari, and Marks

=

X

ri (ai , a−i )

ai ,a−i

=

X

ai ,a−i

=

X

X

q(bi , a−i )

bi ∈Ai



ri (ai , a−i ) 

X



1ai =β 1ai =bi

q(bi , a−i )1ai =β +

bi =α

if bi = α otherwise X

bi 6=α

(32) 

(33)

q(bi , a−i )1ai =bi 

ri (ai , a−i ) [q(α, a−i )1ai =β + q(ai , a−i )1ai 6=α ]

(34)

ai ,a−i

=

X

ai ,a−i

=

X



ri (ai , a−i )q(α, a−i )1ai =β + 

X

ai ,a−i

ri (β, a−i )q(α, a−i ) +

X

ai ,a−i

a−i

Therefore, since

ri (ai , a−i )q(ai , a−i ) −

ri (ai , a−i )q(ai , a−i ) −

ri (q) =

X

X

X

ai ,a−i



ri (ai , a−i )q(ai , a−i )1ai =α 

ri (α, a−i )q(α, a−i )

(35)

a−i

ri (ai , a−i )q(ai , a−i )

(36)

ai ,a−i

it follows that

ri (q) − ri (φ˜(αβ) (q)) ≥ 0 ∀α, β ∈ Ai X iff q(α, a−i )(ri (α, a−i ) − ri (β, a−i )) ≥ 0 a−i

(37) ∀α, β ∈ Ai

(38)

Equation 38, holding for all players i, is precisely the definition of correlated equilibrium (Aumann, 1974). Nash Equilibrium A Nash equilibrium is an element q ∈ ∆(A1 ) × . . . × ∆(An ) such that r(q) ≥ r(q1 , . . . , a0i , . . . , qn ), for all players i and for all actions a0i ∈ Ai . Notice that a Nash equilibrium is an independent element of q ∈ ∆(A1 × . . . × An ). Thus, by definition, a Nash equilibrium is an independent ΦEXT -equilibrium. Indeed, a Nash equilibrium is also an independent ΦINT -equilibrium, since a ΦINT -equilibrium is a correlated equilibrium, and a Nash equilibrium is an independent correlated equilibrium. Therefore, the set of independent ΦEXT -equilibria is identical to the set of independent ΦINT -equilibria. ~ 4.2 Properties of Φ-Equilibrium ~ Next we discuss two convexity properties of the set of Φ-equilibria. ~ = (Φi )1≤i≤n , with Φi ⊆ ΦLINEAR for Proposition 8 Given a game Γ and given a vector Φ ~ 1 ≤ i ≤ n, the set of Φ-equilibria is convex.

~ Proof If q, q 0 ∈ ∆(A1 × . . . × An ) are both Φ-equilibria, then ri (q) ≥ ri (φ˜i (q)) and 0 0 ˜ ri (q ) ≥ ri (φi (q )), for all players i and for all φi ∈ Φi . Because ri is linear, ri (αq + (1 − α)q 0 ) = αri (q) + (1 − α)ri (q 0 ) ≥ αri (φ˜i (q)) + (1 − α)ri (φ˜i (q 0 )) = ri (αφ˜i (q) + (1 − α)φ˜i (q 0 )) = ri (φ˜i (αq + (1 − α)q 0 )) 10

~ for all α ∈ [0, 1] and for all players i. Thus, αq + (1 − α)q 0 is a Φ-equilibrium. ~ = (Φi )1≤i≤n , with Φi ⊆ ΦLINEAR for Proposition 9 Given a game Γ and given a vector Φ ~ 0 -equilibrium, where ~ 1 ≤ i ≤ n, if q ∈ ∆(A1 × . . . × An ) is a Φ-equilibrium, then q is also a Φ 0 0 0 ~ Φ = (Φi )1≤i≤n and Φi is the convex hull of Φi for 1 ≤ i ≤ n. ~ Proof Since q is a Φ-equilibrium, ri (q) ≥ ri (φ˜i (q)), for all players i and for all φi ∈ Φi . In particular, ri (q) ≥ ri (φ˜i1 (q)) and ri (q) ≥ ri (φ˜i2 (q)), for some player i and for some φi1 , φi2 ∈ Φi . Since ri is linear, ri (q) = αri (q) + (1 − α)ri (q) ≥ αri (φ˜i1 (q)) + (1 − α)ri (φ˜i2 (q)) = ri (αφ˜i (q) + (1 − α)φ˜i (q)) 1

2

= ri ((αφ˜i1 + (1 − α)φ˜i2 )(q))

~ 0 -equilibrium. for all α ∈ [0, 1]. Thus, since player i, φi1 , and φi2 were arbitrary, q is a Φ ~ ~ 4.3 No-Φ-Regret Learning and Φ-Equilibrium In this section, we establish a fundamental relationship between no-regret algorithms and game-theoretic equilibria: If all players i play via some no-Φi -regret learning algorithm, then ~ the joint empirical distribution of play approaches the set of Φ-equilibria, almost surely. This result is well-known in the special cases where Φi = ΦEXT , for all players i, and Φi = ΦINT , for all players i: i.e., no-external-regret learning converges to the set of minimax-valued equilibria and no-internal-regret learning converges to the set of correlated equilibria. To prove this claim, we introduce the following notion of no-Φ-regret on the path of play: Definition 10 Given a game with vector-valued function ρ, a learning algorithm A = q1 , q2 , . . ., is said to exhibit no-Φ-regret on the path of play q10 , q20 , . . . if limt→∞ d(G, ρ¯t ) = limt→∞ inf g∈G d(g, ρ¯t ) = 0, almost surely, where ρ¯t is the average value of ρ through time t. Note that an algorithm that exhibits no-Φ-regret learning does so on all paths of play. Equipped with Definition 10, we can now establish necessary and sufficient conditions for ~ convergence to the set of Φ-equilibria: Each player i plays according to a learning algorithm that exhibits no-Φi -regret on the path of play if and only if the joint empirical distribution ~ of play approaches the set of Φ-equilibria, almost surely. Given an infinitely repeated game Γ∞ , define the empirical distribution zt of play through time t as follows: for all actions ai ∈ Ai and a−i ∈ A−i , t

1X zt (ai , a−i ) = 1ai,τ =ai 1a−i,τ =a−i t

(39)

τ =1

1x=y denotes the indicator function, which equals 1 whenever x = y, and 0 otherwise. Given ~ = (Φi )1≤i≤n , let Z = {q ∈ ∆(A1 × . . . × An ) | ri (q) ≥ ri (φ˜i (q)), ∀i, ∀φi ∈ Φi }. a vector Φ 11

Greenwald, Jafari, and Marks

In this section, we show the following: if all Φi ⊆ ΦLINEAR are finite and all φi ∈ Φi are continuous, then each player i plays according to a learning algorithm that exhibits no-Φi -regret on the path of play if and only if d(zt , Z) → 0 as t → ∞, almost surely. Our proof of this result relies on a technical lemma, the statement and proof of which appear in Appendix B. We apply this lemma via the following immediate corollary. Corollary 11 Fix a player i and assume φi is a continuous, linear transformation. Define fi (q) = ri (φ˜i (q)) − ri (q), for all q ∈ ∆(A1 × . . . × An ). Now d (zt , Z) → 0 as t → ∞ if and only if d (fi (zt ), R− ) → 0 as t → ∞. ~ = (Φi )1≤i≤n with finite Theorem 12 Given an infinitely-repeated game Γ∞ and a vector Φ Φi ⊆ ΦLINEAR and all φi ∈ Φi continuous, each player i plays according to a learning algorithm that exhibits no-Φi -regret on the path of play if and only if the joint empirical ~ distribution of play approaches the set of Φ-equilibria, almost surely. Proof By Corollary 11, it suffices to show that all players i exhibit no-Φi -regret learning algorithm on the path of play if and only if for all players i and for all φi ∈ Φi , ri (φ˜i (zt )) − ri (zt ) → R− as t → ∞, almost surely. First, for arbitrary player i and for arbitrary φi ∈ Φi , X ri (φ˜i (zt )) = φ˜i (zt )(ai , a−i )ri (ai , a−i ) (40) ai ,a−i

=

X X

zt (bi , a−i )φi (ebi )(ai )ri (ai , a−i )

(41)

ai ,a−i bi ∈Ai

=

X

zt (bi , a−i )ri (φi (ebi ), a−i )

(42)

bi ,a−i t

=

1X ri (φi (eai,τ ), a−i,τ ) t τ =1

(43)

In Equation 40, we expand the definition of expected rewards, whereas in Equation 42 we collapse this definition. Equation 41 relies on the definition of φ˜i , the extension of φi . Equation 43 follows from the definition of the empirical distribution zt in Equation 39. Second, for arbitrary player i, t

ri (zt ) =

1X ri (ai,τ , a−i,τ ) t

(44)

τ =1

Case (⇒): Observe the following: for all  > 0, ∃φi ∈ Φi s.t. ri (φ˜i (zt )) − ri (zt ) ≥  t t 1X 1X iff ∃φi ∈ Φi s.t. ri (φi (eai,τ ), a−i,τ ) − ri (ai,τ , a−i,τ ) ≥  t t τ =1

(45) (46)

τ =1

iff ∃φi ∈ Φi s.t. (¯ ρt )φi ≥    i ⇒ d RΦ ¯Φi ,t ≥  − ,ρ

(47) (48)

12

Equation 46 follows from Equations 43 and 44, and Equation 48 follows from the definition of Euclidean distance. Now if all players i exhibit no-Φi -regret on the path of play, then, for sufficiently large t,   P ∃t ≥ t0 , φi ∈ Φi s.t. ri (φ˜i (zt )) − ri (zt ) ≥  (49)   i ≤ P ∃t ≥ t0 s.t. d(RΦ ¯Φi ,t ≥  (50) − ,ρ < 

(51)

Thus, for all players i and for all φi ∈ Φi , ri (φ˜i (zt )) − ri (zt ) → R− as t → ∞, almost surely. Case (⇐): Observe the following: for all  > 0,   i ≥ (52) d RΦ , ρ ¯ Φ ,t − i 1

⇒ ∃φi ∈ Φi s.t. (¯ ρt )φi ≥ |Φi |− 2 iff ∃φi ∈ Φi s.t.

1 t

t X τ =1

(53)

ri (φi (eai,τ ), a−i,τ ) −

1 t

1 iff ∃φi ∈ Φi s.t. ri (φ˜i (zt )) − ri (zt ) ≥ |Φi |− 2

t X τ =1

1

ri (ai,τ , a−i,τ ) ≥ |Φi |− 2

(54) (55)

As above, Equation 53 follows from the definition of Euclidean distance, and Equation 55 follows from Equations 43 and 44. Now if for all players i and for all φi ∈ Φi , ri (φ˜i (zt )) − ri (zt ) → R− as t → ∞, almost surely, then, for sufficiently large t,   i P ∃t ≥ t0 s.t. d(RΦ , ρ ¯ ≥  (56) Φ ,t − i   1 ≤ P ∃t ≥ t0 , φi ∈ Φi s.t. ri (φ˜i (zt )) − ri (zt ) ≥ |Φi |− 2 (57) 1

< |Φi |− 2

(58)

≤ 

(59)

Therefore, player i’s learning algorithm exhibits no-Φi -regret on the path of play.

5. No-Internal-Regret Learning: A Closer Look Perhaps surprisingly, no-internal-regret is the strongest form of no-Φ-regret, for all finite Φ ⊆ ΦLINEAR. Therefore, the tightest game-theoretic solution concept to which such no-Φregret learning algorithms can be proven to converge is correlated equilibrium. In particular, Nash equilibrium is not an outcome of learning via such no-Φ-regret algorithms. Lemma 13 If learning algorithm A satisfies no-Φ-regret, for some finite set Φ ⊆ ΦLINEAR, then A also satisfies no-Φ0 -regret, for all subsets Φ0 ⊆ sch(Φ), the super convex hull of Φ, defined as follows: (k+1 X sch(Φ) = αi φi | φi ∈ Φ, for 1 ≤ i ≤ k, φk+1 = I, the identity map, i=1

13

Greenwald, Jafari, and Marks

αi ≥ 0, for 1 ≤ i ≤ k, αk+1 ∈ R, and

k+1 X i=1

)

αi = 1

Proof For φ0 ∈ Φ0 ⊆ sch(Φ), ρφ0 (a, a0 ) = r(φ0 (ea ), a0 ) − r(a, a0 ) ! ! k+1 X 0 = r αi φi (ea ), a − r(a, a0 )

(60) (61)

i=1

=

=

=

k+1 X

i=1 k+1 X

i=1 k+1 X

αi r(φi (ea ), a0 ) −

k+1 X

αi r(a, a0 )

(62)

i=1

 αi r(φi (ea ), a0 ) − r(a, a0 )

(63)

αi ρφi (a, a0 )

(64)

αi ρφi (a, a0 )

(65)

i=1

=

k X i=1

In the last step, ρφk+1 (a, a0 ) = ρI (a, a0 ) = r(I(ea ), a0 ) − r(a, a0 ) = r(a, a0 ) − r(a, a0 ) = 0. 0 Thus, we can define a transformation F : RΦ → RΦ such that F (ρΦ (a, a0 )) = ρΦ0 (a, a0 ), for all a, a0 . Because the αi are all nonnegative, F exhibits the following property: 0

0 Φ 0 ρΦ (a, a0 ) ∈ RΦ − ⇒ F (ρΦ (a, a )) = ρΦ0 (a, a ) ∈ R−

(66)

0

Φ Φ i.e., F (RΦ − ) ⊆ R− . Further, since Φ is finite, R is a finite-dimensional Euclidean space, so that F , which is linear, is also continuous. Now, for all δ > 0 there exists a γ > 0 such that 0

¯Φ0 ,t ) ≥ δ ¯Φ0 ,t ) ≥ δ ⇒ d(F (RΦ d(RΦ − ), ρ − ,ρ iff



ρΦ,t )) d(F (RΦ − ), F (¯ Φ d(R− , ρ¯Φ,t ) ≥ γ

(67)

≥δ

(68) (69)

Equation 67 follows from Equation 66. Equation 68 follows by linearity and the definition of F . Equation 69 follows by continuity. Finally, for all  > 0 and for all t, 0

P (∃t0 ≥ t s.t. d(RΦ ¯Φ,t0 ) ≥ γ) <  ⇒ P (∃t0 ≥ t s.t. d(RΦ ¯Φ0 ,t0 ) ≥ δ) <  −, ρ − ,ρ

(70)

Therefore, no-Φ-regret implies no-Φ0 -regret.

Corollary 14 If learning algorithm A satisfies no-Φ-regret, for some finite set Φ ⊆ ΦLINEAR, then A also satisfies no-Φ0 -regret, for all subsets Φ0 ⊆ ch(Φ), the convex hull of Φ. 14

Proof Choose αk+1 = 0.

Theorem 15 If learning algorithm A satisfies no-internal-regret, then A also satisfies noΦ-regret for all subsets Φ ⊆ ΦLINEAR. Proof First note that the set of stochastic matrices represents ΦLINEAR , the set of linear maps on ∆(A), since A is finite, by assumption. Now define an elementary matrix as one with exactly one 1 per row, and 0’s elsewhere. By Corollary 14, if an algorithm is no-Φregret for the set Φ of elementary matrices, then the algorithm satisfies no-Φ0 -regret for all subsets Φ0 of the set of stochastic matrices, since the set of stochastic matrices is the convex hull of the set of elementary matrices. Thus, it suffices to show that an algorithm that satisfies no-internal-regret is no-Φ-regret for the set Φ of elementary matrices. By Lemma 13, it further suffices to show that any elementary matrix M can be expressed as follows: M = α1 φ1 + . . . + αk φk − k+1 I, where φi ∈ ΦINT , I is the identity map, and Pαk+1 αi ≥ 0, for 1 ≤ i ≤ k, αk+1 ∈ R, and i=1 αi = 1. Let A = {1, . . . , m}. Let M (n1 , . . . , nm ) denote the elementary matrix with 0’s everywhere except 1’s at entries (i, ni ) for 1 ≤ i ≤ m. Now the linear map φi,j defined in Equation 3 is represented by the elementary matrix with 1’s on the diagonal except in row i, where the 1 appears in column j. Thus, φ1n1 +. . .+φmnm corresponds to the matrix with 0’s everywhere, except 1’s at entries (i, ni ) for 1 ≤ i ≤ m, and m−1’s on the diagonal. It follows that M (n1 , . . . , nm ) = φ1n1 +. . .+φmnm −(m−1)I.

6. Conclusion In this article, we defined a general class of no-regret learning algorithms, called no-Φregret learning algorithms, which spans the spectrum from no-external-regret learning to no-internal-regret learning and beyond. Analogously, we defined a general class of gametheoretic equilibria, called Φ-equilibria, and we showed that the empirical distribution of play of no-Φ-regret algorithms converges to the set of Φ-equilibria. In the future, we hope to extend our results to arbitrary action sets, rather than assume A finite, and to extend our existence theorem to handle arbitrary sets of transformations, rather than just finite subsets of ΦLINEAR . Our present interest in finite action sets and finite sets of transformations is not limiting, however, in our ongoing simulation work, where we are investigating the short- and medium-term dynamics no-Φ-regret algorithms.

Appendix A. Blackwell’s Approachability Theorem: A Generalization For completeness, we include in this appendix a proof of Jafari’s generalization of Blackwell’s approachability theorem. This appendix is self-contained: we prove this theorem from first principles, developing the techniques used in the appendix of Foster and Vohra (1997). Let (Ω, F, P ) be a probability space with a filtration (Ft : t ≥ 0). A stochastic process Z = (Zt : t ≥ 0) is said to be adapted to a filtration (Ft : t ≥ 0) if Zt is Ft -measurable, for all times t: i.e., if the value of Zt is determined by Ft , the information available at time t. We denote by Et the conditional expectation with respect to Ft . 15

Greenwald, Jafari, and Marks

Lemma 16 Assume the following: 1. Z = (Zt : t ≥ 0) is an adapted process such that ∀t, 0 ≤ Zt < k a.s., for some k ∈ R; 2. Et−1 [Zt ] ≤ ct a.s., for t ≥ 1, and E[Z0 ] ≤ c0 , where ct ∈ R, for all t. For fixed T , E

"T Y t=0

#

Zt ≤

T Y

ct

(71)

t=0

Proof The proof is by induction. The claim holds for T = 0 by assumption. We assume it also holds for T and show it holds for T + 1: " "T +1 ## "T +1 # Y Y Zt Zt (72) = E ET E t=0

"

"

"

T Y

= E ET = E ≤ E

t=0

= cT +1 E ≤

TY +1

"

Zt ZT +1

Zt

!

Zt

!

!

##

(73)

#

ET [ZT +1 ]

(74)

#

(75)

t=0

t=0 T Y

"

t=0 T Y

T Y t=0

cT +1

Zt

#

(76)

ct

(77)

t=0

Line (72) follows from the tower property, also known as the law of iterated expectations: If a random variable X satisfies E [|X|] < ∞ and H is a sub-σ-algebra of G, which in turn is a sub-σ-algebra | H] a.s. (Williams, 1991). hQ i of F, then E[E[X | G] | H] = E[X QT T +1 Note that E t=0 Zt < ∞. Line (74) follows because t=0 Zt is FT -measurable and i hQ T E t=0 Zt < ∞. Line (75) follows by assumption, noting that Zt , for t ≥ 0, is nonnegative with probability 1. Line (77) follows from the induction hypothesis.

Lemma 17 Assume the following: 1. (Mt : t ≥ 0) is a supermartingale, i.e. (Mt : t ≥ 0) is an adapted process s.t. for all t, E[|Mt |] < ∞ and for t ≥ 1, Et−1 [Mt ] ≤ Mt−1 a.s.; 2. f is a nondecreasing positive function s.t. for t ≥ 1, |Mt − Mt−1 | ≤ f (t) a.s.. If M0 = m ∈ R a.s., then for fixed T , P [MT ≥ 2T f (T )] ≤ em/f (0)− 16

2T

, for all  ∈ [0, 1].

Proof For t ≥ 0, let

Yt =

Mt f (T )

(78)

and for t ≥ 1, let Xt = Yt − Yt−1 , so that Yt =

t X

Xτ .

(79)

τ =0

Note that X0 = Y0 = m/f (0) a.s.. Because Mt is supermartingale and f (t) is positive, for t ≥ 1, Et−1 [Xt ] = Et−1 [Yt ] − Yt−1 =

Et−1 [Mt ] − Mt−1 ≤0 f (T )

a.s.

Because f is nondecreasing, f (t) ≤ f (T ) for all t; hence, for t ≥ 1, Mt |Mt − Mt−1 | M t−1 ≤ |Xt | = |Yt − Yt−1 | = − ≤1 f (T ) f (T ) f (t)

a.s.

(80)

(81)

Thus, for t ≥ 1,

  Et−1 eXt ≤ 1 + Et−1 [Xt ] + 2 Et−1 [Xt2 ] ≤ 1 + 2

a.s.

(82)

The first inequality follows from the fact that ey ≤ 1 + y + y 2 for y ≤ 1, and Xt ≤ 1 a.s., since  ∈ [0, 1] and |Xt | ≤ 1 a.s. by Line (81). The second inequality follows from Line (80). Therefore, P [MT ≥ 2T f (T )] = P [YT ≥ 2T ] h i 2 = P eYT ≥ e2 T   E eYT ≤ eh22 T i E e

= =

(83) (84) (85)

PT

t=0 Xt

22 T i hQe T X t E t=0 e

e22 T em/f (0) (1 + 2 )T ≤ e22 T 2 em/f (0) e T ≤ e22 T 2 = em/f (0)− T

(86) (87) (88) (89) (90)

Line (85) follows from Markov’s inequality. Line (88) follows from Lemma 16, Line (82), and the assumption that M0 = m a.s.. Line (89) follows from the fact that (1 + x) ≤ ex .

17

Greenwald, Jafari, and Marks

Lemma 18 For x, y ∈ Rn ,

k(x + y)+ k22 ≤ kx+ + yk22

(91)

k(x + y)+ k22 ≤ kx+ k22 + 2(x+ · y) + kyk22

(92)

Equivalently, P P 2 Proof Observe that k(x + y)+ k22 = i ((xi + yi )+ )2 and kx+ + yk22 = i (x+ i + yi ) . Thus, + it suffices to show that ((xi + yi )+ )2 ≤ (xi + yi )2 for all i. 2 Case 1: xi + yi ≤ 0. Here, ((xi + yi )+ )2 = 0 ≤ (x+ i + yi ) . + 2 2 Case 2: xi + yi > 0, xi ≥ 0. Here, ((xi + yi ) ) = (xi + yi )2 = (x+ i + yi ) . Case 3: xi + yi > 0, xi < 0. Here, 0 < xi + yi < yi , so that ((xi + yi )+ )2 = (xi + yi )2 < 2 2 yi = (x+ i + yi ) .

Lemma 19 For x, y ∈ Rn , k(x + y)+ k22 ≥ 2(x+ · y) + kx+ k22

(93)

k(x + y)+ k22 − 2(x+ · y) − kx+ k22 n X   + 2 ((xi + yi )+ )2 − 2x+ = i yi − (xi )

(94)

Proof

(95)

i=1

=

n X  i=1

=

n X  i=1

≥ =

n X 

i=1 n X i=1

+ + 2 ((xi + yi )+ )2 − 2x+ i yi − 2xi xi + (xi ) + 2 ((xi + yi )+ )2 − 2x+ i (xi + yi ) + (xi )





+ 2 + ((xi + yi )+ )2 − 2x+ i (xi + yi ) + (xi )

(96) (97)



2 ((xi + yi )+ − x+ i )

≥ 0

(98) (99) (100)

Line (96) follows from the observation that a+ a = (a+ )2 , for all a ∈ R. Theorem 20 (Jafari, 2003) Given a game with vector-valued function ρ, if ρ(A × A0 ) is bounded, then the set RS− is ρ-approachable by learning algorithm A if A is Λ0 -compatible. Moreover, the following algorithm can be used to ρ-approach RS− : • if ρ¯t ∈ RS− , play arbitrarily; • otherwise, play according to learning algorithm A. 18

Proof The proof follows directly from an application of Lemma 17. To apply this lemma, we define a stochastic process, and we use Lemmas 18 and 19 (together with the assumptions of the theorem) to show that this process satisfies the assumptions of Lemma 17. By assumption, ρ(A × A0 ) is bounded. WLOG, assume ρ : A × A0 → [−1, 1]S . Letting d = |S|, kρ(a, a0 )k22 ≤ d (101) for all a ∈ A and a0 ∈ A0 . P Define Rt = tτ =1 ρ(aτ , a0τ ). By the assumption of Λ0 -compatibility, there exists some constant c ≥ 0 and there exists t0 ∈ N s.t. for all times t ≥ t0 , ! + Rt−1 c · ρ(qt , a0t ) ≤ a.s. (102) t−1 t

for all a0 ∈ A. We rely on the following implication of this assumption, which follows because qt is Ft−1 -measurable:   c(t − 1) + t0

≤ =

X

−δ 4 t

e 25d2

t≥t0

e

−δ 4 t0 25d2 −δ 4

1 − e 25d2 20

(131)

(132)

Finally, for t0 sufficiently large,  P ∃t ≥ t0 s.t. d(¯ ρt , RS− ) ≥ δ ≤ δ

(133)

Appendix B. Proof of Lemma 21 Lemma 21 Let (X, dX ) be a compact metric space and let (Y, dY ) be a metric space. Let {xt } be an X-valued sequence, and let S be a nonempty, closed subset of Y . If f : X → Y is continuous and if f −1 (S) is nonempty, then dX (xt , f −1 (S)) → 0 as t → ∞ if and only if dY (f (xt ), S) → 0 as t → ∞. Proof We write d = dX and d = dY , since the appropriate choice of distance metric is always clear from the context. To prove the forward implication, assume d(xt , f −1 (S))) → 0 as t → ∞. Choose t0 s.t. for all t ≥ t0 , d(xt , f −1 (S)) < 2δ . Observe that for all xt and (γ)

for all γ > 0, there exists qt ( δ2 )

(γ)

∈ f −1 (S) s.t. d(xt , qt ) < d(xt , f −1 (S)) + γ. Now, since (δ )

d(xt , qt ) < 2δ + 2δ = δ, by the continuity of f , d(f (xt ), f (qt 2 )) < , for all  > 0. Therefore, (δ )

d(f (xt ), S) < , since f (qt 2 ) ∈ S. To prove the reverse implication, assume d(f (xt ), S) → 0 as t → ∞. We must show that for all  > 0, there exists a t0 s.t. for all t ≥ t0 , d(xt , f −1 (S)) < . Define T = {x ∈ X | d(x, f −1 (S)) ≥ }. If T = ∅, the claim holds. Otherwise, observe that T can be expressed as the complement of the union of open balls, so that T is closed and thus compact. Define g : X → R as g(x) = d(f (x), S). By assumption S is closed; hence, g(x) > 0, for all x. Because T is compact, g achieves some minimum value, say L > 0, on T . Choose t0 s.t. d(f (xt ), S) < L for all t ≥ t0 . Thus, for all t ≥ t0 , g(xt ) < L ⇒ xt ∈ / T ⇒ d(xt , f −1 (S)) < .

Acknowledgments We gratefully acknowledge Dean Foster for sparking our interest in this topic, and for ongoing discussions that helped to clarify many of the technical points in this paper. We also thank Dave Seaman for providing a proof of the harder direction of Lemma 21; and David Gondek, Zheng Li, and Martin Zinkevich for their comments on the proof of Jafari’s generalization of Blackwell’s theorem. The content of this article is based on Greenwald and Jafari (2003), which in turn is based on Jafari’s Master’s thesis (Jafari, 2003). This research was supported by NSF Career Grant #IIS-0133689.

21

Greenwald, Jafari, and Marks

References R. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1:67–96, 1974. D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8, 1956. G. Brown. Iterative solutions of games by fictitious play. In T. Koopmans, editor, Activity Analysis of Production and Allocation. Wiley, New York, 1951. N. Cesa-Bianchi and G. Lugosi. Potential-based algorithms in on-line prediction and game theory. Machine Learning, 51(3):239–261, 2003. A. Cournot. Recherches sur les Principes Mathematics de la Theorie de la Richesse. Hachette, 1838. D. Foster and R. Vohra. A randomization rule for selecting forecasts. Operations Research, 41(4):704–709, 1993. D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 21:40–55, 1997. Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory: Proceedings of the Second European Conference, pages 23–37. Springer-Verlag, 1995. Y. Freund and R. Schapire. Game theory, on-line prediction, and boosting. In Proceedings of the 9th Annual Conference on Computational Learning Theory, pages 325–332. ACM Press, May 1996. D. Fudenberg and D. K. Levine. Universal consistency and cautious fictitious play. Journal of Economic Dyanmics and Control, 19:1065–1090, 1995. D. Fudenberg and D. K. Levine. Conditional universal consistency. Games and Economic Behavior, 29:104–130, 1999. A. Greenwald and A. Jafari. A general class of no-regret algorithms and game-theoretic equilibria. In Proceedings of the 2003 Computational Learning Theory Conference, pages 1–11, August 2003. J. Hannan. Approximation to Bayes risk in repeated plays. In M. Dresher, A.W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pages 97–139. Princeton University Press, 1957. S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127–1150, 2000. S. Hart and A. Mas-Colell. A general class of adaptive strategies. Economic Theory, 98: 26–54, 2001. 22

A. Jafari. On the Notion of Regret in Infinitely Repeated Games. Master’s Thesis, Brown University, Providence, May 2003. E. Lehrer. A wide range no-regret theorem. Games and Economic Behavior, 42(1):101–115, 2003. N. Littlestone and M. Warmuth. The weighted majority algorithm. Information and Computation, 108:212 – 261, 1994. J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54:298–301, 1951. D. Williams. Probability with Martingales. Cambridge University Press, Cambridge, 1991.

23

Suggest Documents