Introduction to Bandits: Algorithms and Theory - Google Sites

Introduction to Bandits: Algorithms and Theory Jean-Yves Audibert1,2 & Rémi Munos3 1. Université Paris-Est, LIGM, Imagine, ´ 2. CNRS/Ecole Normale Supérieure/INRIA, LIENS, Sierra 3. INRIA Sequential Learning team, France

ICML 2011, Bellevue (WA), USA

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

1/38

Outline I

Bandit problems and applications

I

Bandits with small set of actions

I

I

Stochastic setting

I

Adversarial setting

Bandits with large set of actions I

unstructured set

I

structured set I I I

I

Jean-Yves Audibert,

linear bandits Lipschitz bandits tree bandits

Extensions


2/38

Bandit game Parameters available to the forecaster: the number of arms (or actions) K and the number of rounds n Unknown to the forecaster: the way the gain vectors gt = (g1,t , . . . , gK ,t ) ∈ [0, 1]K are generated For each round t = 1, 2, . . . , n 1. the forecaster chooses an arm It ∈ {1, . . . , K } 2. the forecaster receives the gain gIt ,t 3. only gIt ,t is revealed to the forecaster Cumulative regret goal: maximize the cumulative gains obtained. More precisely, minimize n n X X Rn = max E gi,t − E gIt ,t i=1,...,K

t=1

t=1

where E comes from both a possible stochastic generation of the gain vector and a possible randomization in the choice of It Jean-Yves Audibert,


3/38

Stochastic and adversial environments

I

Stochastic environment: the gain vector gt is sampled from an unknown product distribution ν1 ⊗ . . . ⊗ νK on [0, 1]K , that is gi,t ∼ νi .

I

Adversarial environment: the gain vector gt is chosen by an adversary (which, at time t, knows all the past, but not It )

Jean-Yves Audibert,


4/38

Numerous variants

I

different environments: adversarial, “stochastic”, non-stationary

I

different targets: cumulative regret, simple regret, tracking the best expert

I

Continuous or discrete set of actions

I

extension with additional rules: varying set of arms, pay-perobservation, . . .

Jean-Yves Audibert,


5/38

Various applications

I

Clinical trials (Thompson, 1933)

I

Ads placement on webpages

I

Nash equilibria (traffic or communication networks, agent simulation, tic-tac-toe phantom, . . . )

I

Game-playing computers (Go, urban rivals, . . . )

I

Packet routing, itinerary selection

I

...

Jean-Yves Audibert,


6/38

Outline I


I


I

I

Stochastic setting

I

Adversarial setting


unstructured set

I


I

Jean-Yves Audibert,


Extensions


7/38

Stochastic bandit game

(Robbins, 1952)

Parameters available to the forecaster: K and n Parameters unknown to the forecaster: the reward distributions ν1 , . . . , νK of the arms (with respective means µ1 , . . . , µK ) For each round t = 1, 2, . . . , n 1. the forecaster chooses an arm It ∈ {1, . . . , K } 2. the environment draws the gain vector gt = (g1,t , . . . , gK ,t ) according to ν1 ⊗ · · · ⊗ νK 3. the forecaster receives the gain gIt ,t ∗ = max Notation: i ∗ = arg maxi=1,...,K µi µP i=1,...,K µi ∗ ∆i = µ − µi , Ti (n) = nt=1 1It =i P P ˆ n = n gi ∗ ,t − n gI ,t Cumulative regret: R t=1

t=1

t

Goal: minimize the expected cumulative regret ˆ n = nµ∗ − E Rn = ER Jean-Yves Audibert,

n X t=1

gIt ,t = nµ∗ − E

K X

Ti (n)µi =

i=1


K X

∆i ETi (n)

i=1 8/38

A simple policy: ε-greedy For simplicity, all rewards are in [0, 1] I I

Playing the arm with highest empirical mean does not work ε-greedy: at time t, I I

I

Theoretical guarantee: (Auer, Cesa-Bianchi, Fischer, 2002) I I

I

I

with probability 1−εt , play the arm with highest empirical mean with probability εt , play a random arm Let ∆ = mini:∆i >0 ∆i and consider εt = min( ∆6K2 t , 1) When t ≥ 6K ∆2 , the probability of choosing a suboptimal arm i is bounded by ∆C2 t for some constant C > 0 P As a consequence, E[Ti (n)] ≤ ∆C2 log n and Rn ≤ i:∆i >0 C∆∆2 i log n −→ logarithmic regret

drawbacks: I I I

Jean-Yves Audibert,

naive exploration for K > 2: no distinction of sub-optimal arms requires knowledge of ∆ outperformed by UCB policy in practice Introduction to Bandits: Algorithms and Theory

9/38

Optimism in face of uncertainty

I

At time t, from past observations and some probabilistic argument, you have an upper confidence bound (UCB) on the expected rewards.

I

Simple implementation: play the arm having the largest UCB !

Jean-Yves Audibert,


10/38

Why does it make sense?

I

Could we stay a long time drawing a wrong arm? No, since: I The more we draw a wrong arm i the closer the UCB gets to the expected reward µi , I µ < µ∗ ≤ UCB on µ∗ i

Jean-Yves Audibert,


11/38

Illustration of UCB policy

Jean-Yves Audibert,


12/38

Confidence intervals vs sampling times

Jean-Yves Audibert,


13/38

Hoeffding-based UCB I

(Auer, Cesa-Bianchi, Fischer, 2002)

Hoeffding’s inequality: Let X , X1 , . . . , Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1 − ε, we have r m log(ε−1 ) 1 X Xs + EX ≤ m 2m s=1

I

UCB1 policy: at time t, play ( It ∈ arg max

s

µ î,t−1 +

i∈{1,...,K }

I

where µ î,t−1 = Regret bound:

1 Ti (t−1)

Rn ≤ Jean-Yves Audibert,

PTi (t−1)

X i6=i ∗

s=1

min

2 log t Ti (t − 1)

) ,

Xi,s

10 log n, n∆i ∆i


14/38



Hoeffding’s inequality: Let X1 , . . . , Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1 − ε, we have r m log(ε−1 ) 1 X Xs + EX ≤ m 2m s=1

I

UCB1 policy: At time t, play ( It ∈ arg max

s

µ î,t−1 +

i∈{1,...,K }

where µ î,t−1 = I

1 Ti (t−1)

PTi (t−1) s=1

2 log t Ti (t − 1)

) ,

Xi,s

UCB1 is an anytime policy (it does not need to know n to be implemented)

Jean-Yves Audibert,


15/38



Hoeffding’s inequality: Let X1 , . . . , Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1 − ε, we have r m log(ε−1 ) 1 X EX ≤ Xs + m 2m s=1

I

UCB1 policy: At time t, play ( It ∈ arg max

s

µ î,t−1 +

i∈{1,...,K }

where µ î,t−1 =

1 Ti (t−1)

PTi (t−1) s=1

2 log t Ti (t − 1)

) ,

Xi,s −1

I I

UCB1 corresponds to 2 log t = log(ε2 ) , hence ε = 1/t 4 Critical confidence level ε = 1/t (Lai & Robbins, 1985; Agrawal, 1995; Burnetas & Katehakis, 1996; Audibert, Munos, Szepesv´ ari, 2009; Honda & Takemura, 2010)

Jean-Yves Audibert,


16/38

Better confidence bounds imply smaller regret I Hoeffding’s inequality

1 -confidence t

EX ≤ I Bernstein’s inequality

1 m

bound r

m X s=1

1 -confidence t

Xs +

log(t) 2m

bound r 2 log(t) ar X log(t) 1 EX ≤ Xs + + m s=1 m 3m m X

I

V

Empirical Bernstein’s inequality 3t -confidence bound s m X \ 8 log(t) 1 2 log(t)V ar X Xs + + EX ≤ m m 3m s=1

(Audibert, Munos, Szepesv´ ari, 2009; Maurer, 2009; Audibert, 2010) I Asymptotic confidence bound leads to catastrophy: s Z +∞ −u2 /2 m \ 1 X ar X e 1 √ EX ≤ Xs + x with x s.t. du = m s=1 m t 2π x

V

Jean-Yves Audibert,


17/38

Better confidence bounds imply smaller regret

Hoeffding-based UCB r 1 Pm X + EX ≤ m s=1 s

Rn ≤

P

i6=i ∗

Jean-Yves Audibert,

min

c ∆i

empirical Bernstein-based UCB r

log(ε−1 ) 2m

log n, n∆i

1 Pm X + EX ≤ m s=1 s

Rn ≤

P

i6=i ∗

min c

V

2 log(ε−1 )\ ar X m

Var ν ∆i


i

+

8 log(ε−1 ) 3m

+ 1 log n, n∆i

18/38

Tuning the exploration: simple vs difficult bandit problems I

UCB1(ρ) policy: At time t, play ( It ∈ arg max

s

µ î,t−1 +

i∈{1,...,K }

Regret of UCB1(ρ) for n = 1000 and K = 2 arms: Dirac(0.6) and Ber(0.5)

35

ρ log t Ti (t − 1)

) ,

Regret of UCB1(ρ) for n = 1000 and K = 2 arms: Ber(0.6) and Ber(0.5)

25 30

Expected regret

Expected regret

20

15

10

25 20 15 10

5 5 0

0.0

0.2

0.4

Jean-Yves Audibert,

0.6 0.8 1.0 1.2 1.4 1.6 Exploration parameter ρ

1.8

2.0

0

0.0

0.2

0.4



1.8

2.0

19/38

Tuning the exploration parameter: from theory to practice I

Theory: I I

for ρ < 0.5, UCB1(ρ) has polynomial regret for ρ > 0.5, UCB1(ρ) has logarithmic regret (Audibert, Munos, Szepesv´ ari, 2009; Bubeck, 2010)

I

Regret of UCB1(ρ) for n = 1000 and K = 3 arms: Ber(0.6), Ber(0.5) and Ber(0.5)

Regret of UCB1(ρ) for n = 1000 and K = 5 arms: Ber(0.7), Ber(0.6), Ber(0.5), Ber(0.4) and Ber(0.3)

45

80

40

70

Expected regret

Expected regret

50

Practice: ρ = 0.2 seems to be the best default value for n < 108

35 30 25 20 15

40 30

10

5

0.0

50

20

10

0

60

0.2

0.4

Jean-Yves Audibert,


1.8

2.0

0

0.0

0.2

0.4



1.8

2.0

20/38

Deviations of UCB1 regret I UCB1 policy: At time t, play

s

( It ∈ arg max i∈{1,...,K }

I I

µ î,t−1 +

2 log t Ti (t − 1)

)

ˆ n > ER ˆ n +γ) ≤ ce −cγ does not hold! Inequality of the form P(R

If the smallest reward observable from the optimal arm is smaller than the mean reward of the second optimal arm, then the regret of UCB1 satisfies: for any C > 0, there exists C 0 > 0 such that for any n ≥ 2 ˆ n > ER ˆ n + C log n) > P(R

1 C 0 (log n)C 0

(Audibert, Munos, Szepesv´ ari, 2009)

Jean-Yves Audibert,


21/38

Anytime UCB policies has a heavy-tailed regret I For some difficult bandit problems, the regret of UCB1 satisfies: for any

C > 0, there exists C 0 > 0 such that for any n ≥ 2 ˆ n > ER ˆ n + C log n) > P(R I

1 C 0 (log n)C 0

UCB-Horizon policy: At time t, play s ( It ∈ arg max

i∈{1,...,K }

µ î,t−1 +

(?)

2 log n Ti (t − 1)

) ,

(Audibert, Munos, Szepesv´ ari, 2009) I I

ˆ n > ER ˆ n + C log n) ≤ UCB-H satisfies P(R

C n

for some C > 0

(?) = unavoidable for anytime policies (Salomon, Audibert, 2011)

Jean-Yves Audibert,


22/38

Comparison of UCB1

Jean-Yves Audibert,

(solid lines)

and UCB-H


(dotted lines)

23/38

Comparison of UCB1(ρ) and UCB-H(ρ) in expectation Comparison of policies for n = 1000 and K = 2 arms: Dirac(0.6) and Ber(0.5)

Comparison of policies for n = 1000 and K = 2 arms: Ber(0.6) and Dirac(0.5) 70

25

60 20

Expected regret

Expected regret

50

15

10

40

30

20 GCL ∗

5

GCL ∗

10

UCB 1(ρ)

UCB 1(ρ)

UCB - H (ρ)

0 0.0

0.2

0.4

0.6

0.8 1.0 1.2 Exploration parameter ρ

UCB - H (ρ)

1.4

1.6

1.8

Left: Dirac(0.6) vs Bernoulli(0.5)

Jean-Yves Audibert,

2.0

0 0.0

0.2

0.4

0.6

0.8 1.0 1.2 Exploration parameter ρ

1.4

1.6

1.8

2.0

Right: Bernoulli(0.6) vs Dirac(0.5)


24/38

Comparison of UCB1(ρ) and UCB-H(ρ) in deviations I

For n = 1000 and K = 2 arms: Bernoulli(0.6) and Dirac(0.5)

0.40

0.08

GCL ∗

ˆ n ≤ r + 2) IP(r − 2 ≤ R

0.35 0.30

0.07

UCB 1(0.2)

UCB - H (0.2)

0.06

UCB - H (0.2)

ˆ n > r) IP(R

UCB 1(0.5)

0.25

UCB - H (0.5)

0.20 0.15

0.03 0.02 0.01

20

40

60

80

100

Regret level r

120

140

160

UCB - H (0.5)

0.04

0.05 0

UCB 1(0.5)

0.05

0.10

0.00 −20

GCL ∗

UCB 1(0.2)

0.00

40

60

80

100

Regret level r

120

140

160

Left: smoothed probability mass function. Right: tail distribution of the regret.

Jean-Yves Audibert,


25/38

Knowing the horizon: theory and practice

I

Theory: use UCB-H to avoid heavy tails of the regret

I

Practice: Theory is right. Besides, thanks to this robustness, the expected regret of UCB-H(ρ) consistently outperforms the expected regret of UCB1(ρ). However: I I

Jean-Yves Audibert,

the gain is small. a better way to have small regret tails is to take larger ρ


26/38

Knowing µ∗ I

Hoeffding-based GCL∗ policy: play each arm once, then play It ∈ argmin Ti (t − 1) (µ∗ − µ î,t−1 )2+ i∈{1,...,K }

(Salomon, Audibert, 2011) I

Underlying ideas: I I

compare p-values of the K tests: H0 = {µi = µ∗ }, i ∈ {1, . . . , K } the p-values are estimated using Hoeffding’s inequality 2 (obs) (obs) PH0 µ î,t−1 î,t−1 ≤ µ î,t−1 / exp − 2Ti (t − 1) µ∗ − µ +

I

I

play the arm for which we have the Greatest Confidence Level that it is the optimal arm.

Advantages: I I I I I

Jean-Yves Audibert,

logarithmic expected regret anytime policy regret with a subexponential right-tail parameter-free policy ! outperforms any other Hoeffding-based algorithm ! Introduction to Bandits: Algorithms and Theory

27/38

From Chernoff’s inequality to KL-based algorithms I

I

Let K(p, q) be the Kullback-Leibler divergence between Bernoulli distributions of respective parameter p and q Let X1 , . . . , XT be i.i.d. P r.v. of mean µ, and taking their values ¯ = 1 T Xi . For any γ > 0 in [0, 1]. Let X i=1 T ¯ ≤ µ − γ ≤ exp − T K(µ − γ, µ) . P X In particular, we have ¯ ≤X ¯ (obs) ≤ exp − T K min(X ¯ (obs) , µ), µ . P X

I

If µ∗ is known, using the same idea of comparing the p-values of the tests H0 = {µi = µ∗ }, i ∈ {1, . . . , K }, we get the Chernoff-based GCL∗ policy: play each arm once, then play It ∈ argmin Ti (t − 1) K min(ˆ µi,t−1 , µ∗ ), µ∗ i∈{1,...,K }

Jean-Yves Audibert,


28/38

Back to unknown µ∗

I

When µ∗ is unknown, the principle playing the arm for which we have the greatest confidence level that it is the optimal arm is replaced by being optimistic in face of uncertainty: I

I

Jean-Yves Audibert,

an arm i is represented by the highest mean of a distribution ν for which the hypothesis H0 = {νi = ν} has a p-value greater than t1β (critical β = 1, as usual) the arm with the highest index (=UCB) is played


29/38

KL-based algorithms when µ∗ is unknown I

Approximating the p-value using Sanov’s theorem is tightly linked to the DMED policy, which satisfies lim sup n→+∞

∆i Rn ≤ log n inf ν:EX ∼ν X ≥µ∗ K(νi , ν)

(Burnetas & Katehakis, 1996; Honda & Takemura, 2010)

It matches the lower bound lim inf

n→+∞

Rn ∆i ≥ log n inf ν:EX ∼ν X ≥µ∗ K(νi , ν)

(Lai & Robbins, 1985; Burnetas & Katehakis, 1996) I

Approximating the p-value using non-asymptotic version of Sanov’s theorem leads to the KL-UCB (Cappé & Garivier, COLT 2011) and the K-strategy (Maillard, Munos, Stoltz, COLT 2011)

Jean-Yves Audibert,


30/38

Outline I


I


I

I

Stochastic setting

I

Adversarial setting


unstructured set

I


I

Jean-Yves Audibert,


Extensions


31/38

Adversarial bandit Parameters: the number of arms K and the number of rounds n For each round t = 1, 2, . . . , n 1. the forecaster chooses an arm It ∈ {1, . . . , K }, possibly with the help of an external randomization 2. the adversary chooses a gain vector gt = (g1,t , . . . , gK ,t ) ∈ [0, 1]K 3. the forecaster receives and observes only the gain gIt ,t Goal: Maximize the cumulative gains obtained. We consider the regret: n n X X Rn = max E gi,t − E gIt ,t , i=1,...,K

t=1

t=1

I

In full information, step 3. is replaced by the forecaster receives gIt ,t and observes the full gain vector gt

I

In both settings, the forecaster should use an external randomization to have o(n) regret.

Jean-Yves Audibert,


32/38

Adversarial setting in full information: an optimal policy I I I

Cumulative reward on [1, t − 1]: Gi,t−1 =

Pt−1

s=1 gi,s

Follow-the-leader: It ∈ arg maxi∈{1,...,K } Gi,t−1 is a bad policy

An “optimal” policy is obtained by considering

e ηGi,t−1 pi,t = P(It = i) = PK ηGk,t−1 k=1 e I

For this policy, Rn ≤

I

in particular, for η =

nη 8

+

log K η

q

8 log K n ,

we have Rn ≤

q

n log K 2

(Littlestone, Warmuth, 1994; Long, 1996; Bylanger, 1997; Cesa-Bianchi, 1999)

Jean-Yves Audibert,


33/38

Proof of the regret bound e ηGi,t−1 pi,t = P(It = i) = PK ηGk,t−1 k=1 e E

X

gIt ,t

t

=E

XX t

pi,t gi,t

i

P X X 1 1 η(gi,t − j pj,t gj,t ) ηgi,t pi,t e + log pi,t e =E − log η η t i i P ηGi,t X 1 e 1 =E − log Ee η(Vt −EVt ) + log P i ηG P(Vt = gi,t ) = pi,t i,t−1 η η e i t P ηGj,n X η 1 j e + E log P ηG Hoeffding’s inequality ≥E − j,0 8 η j e t X

≥−

Jean-Yves Audibert,

nη 1 e η maxj Gj,n nη log K + E log =− − + E max Gj,n j 8 η K 8 η Introduction to Bandits: Algorithms and Theory

34/38

Adapting the exponentially weighted forecaster I

In bandit setting, Gi,t−1 , i = 1, . . . , K are not observed Trick = estimate them

I

˜i,t−1 = Pt−1 g˜i,s with Precisely, Gi,t−1 is estimated by G s=1 g˜i,s = 1 − Note that EIs ∼ps g˜i,s = 1 −

1 − gIs ,s 1Is =i . pIs ,s

K X

pk,s

k=1

1 − gk,s 1k=i = gi,s pk,s ˜

e ηGi,t−1 pi,t = P(It = i) = PK ˜k,t−1 ηG k=1 e I

For this policy, Rn ≤ I

nK η 2

In particular, for η =

+ logη K q

2 log K nK ,

we have Rn ≤

√

2nK log K

(Auer, Cesa-Bianchi, Freund, Schapire, 1995) Jean-Yves Audibert,


35/38

Implicitly Normalized Forecaster

(Audibert, Bubeck, 2010)

Let ψ : R∗− → R∗+ increasing, convex, twice continuously differentiable, and s.t. [ K1 , 1] ⊂ ψ(R∗− ) Let p1 be the uniform distribution over {1, . . . , K } For each round t = 1, 2, . . . , I I

It ∼ pt

Compute pt+1 = (p1,t+1 , . . . , pK ,t+1 ) where ˜i,t − Ct ) pi,t+1 = ψ(G where Ct is the unique real number s.t.

Jean-Yves Audibert,

PK

i=1 pi,t+1


=1

36/38

Minimax policy I

ψ(x) = exp(ηx) with η > 0; this corresponds exactly to the exponentially weighted forecaster

I

ψ(x) = (−ηx)−q with q > 1 and η > 0; this√is a new policy which is minimax optimal: for q = 2 and η = 2n, we have √ Rn ≤ 2 2nK (Audibert, Bubeck, 2010; Audibert, Bubeck, Lugosi, 2011)

while for any strategy, we have sup Rn ≥

1√ nK 20

(Auer, Cesa-Bianchi, Freund, Schapire, 1995)

Jean-Yves Audibert,


37/38

Outline I


I


I

I

Stochastic setting

I

Adversarial setting


unstructured set

I


I

Jean-Yves Audibert,


Extensions


38/38

Introduction to Bandits: Algorithms and Theory - Google Sites

Introduction to Bandits: Algorithms and Theory - Google Sites

Suggest Documents

Introduction to Algorithms - Google Sites

[PDF] Introduction to Algorithms - Google Sites

An Introduction to Genetic Algorithms - Google Sites

eBook Download Introduction to Algorithms - Google Sites

An Introduction to Bioinformatics Algorithms - Google Sites

Theory, Algorithms, and Implementations (Intelligent ... - Google Sites

Theory, Algorithms, and Implementations (Intelligent ... - Google Sites

Introduction to Theory of Mind - Google Sites

Introduction to Information Theory - Google Sites

INTRODUCTION to STRING FIELD THEORY - Google Sites

[PDF] Introduction to Graph Theory - Google Sites

Better Algorithms for Benign Bandits - CiteSeerX

Introduction to the Design and Analysis of Algorithms ... - Google Sites

Computer algorithms introduction to design and analysis - Google Sites

[PDF] Computer Algorithms: Introduction to Design and ... - Google Sites

PdF Computer Algorithms: Introduction to Design and ... - Google Sites

Algorithms for Linear Bandits on Polyhedral Sets

Risk-Aware Algorithms for Adversarial Contextual Bandits

Introduction to Algorithms

Introduction to Genetic Algorithms

Introduction to Algorithms

Introduction to Algorithms

Introduction to Parallel Algorithms

Introduction to Algorithms