Introduction to Bandits: Algorithms and Theory - Google Sites

2 downloads 123 Views 3MB Size Report
Rn = ERn = nµ. ∗ − E n. ∑ t=1. gIt ,t = nµ. ∗ − E. K. ∑ i=1. Ti (n)µi = K. ∑ i=1. ∆i ETi (n). Jean-Yv
Introduction to Bandits: Algorithms and Theory Jean-Yves Audibert1,2 & R´emi Munos3 1. Universit´e Paris-Est, LIGM, Imagine, ´ 2. CNRS/Ecole Normale Sup´erieure/INRIA, LIENS, Sierra 3. INRIA Sequential Learning team, France

ICML 2011, Bellevue (WA), USA

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

1/38

Outline I

Bandit problems and applications

I

Bandits with small set of actions

I

I

Stochastic setting

I

Adversarial setting

Bandits with large set of actions I

unstructured set

I

structured set I I I

I

Jean-Yves Audibert,

linear bandits Lipschitz bandits tree bandits

Extensions

Introduction to Bandits: Algorithms and Theory

2/38

Bandit game Parameters available to the forecaster: the number of arms (or actions) K and the number of rounds n Unknown to the forecaster: the way the gain vectors gt = (g1,t , . . . , gK ,t ) ∈ [0, 1]K are generated For each round t = 1, 2, . . . , n 1. the forecaster chooses an arm It ∈ {1, . . . , K } 2. the forecaster receives the gain gIt ,t 3. only gIt ,t is revealed to the forecaster Cumulative regret goal: maximize the cumulative gains obtained. More precisely, minimize   n n X X Rn = max E gi,t − E gIt ,t i=1,...,K

t=1

t=1

where E comes from both a possible stochastic generation of the gain vector and a possible randomization in the choice of It Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

3/38

Stochastic and adversial environments

I

Stochastic environment: the gain vector gt is sampled from an unknown product distribution ν1 ⊗ . . . ⊗ νK on [0, 1]K , that is gi,t ∼ νi .

I

Adversarial environment: the gain vector gt is chosen by an adversary (which, at time t, knows all the past, but not It )

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

4/38

Numerous variants

I

different environments: adversarial, “stochastic”, non-stationary

I

different targets: cumulative regret, simple regret, tracking the best expert

I

Continuous or discrete set of actions

I

extension with additional rules: varying set of arms, pay-perobservation, . . .

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

5/38

Various applications

I

Clinical trials (Thompson, 1933)

I

Ads placement on webpages

I

Nash equilibria (traffic or communication networks, agent simulation, tic-tac-toe phantom, . . . )

I

Game-playing computers (Go, urban rivals, . . . )

I

Packet routing, itinerary selection

I

...

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

6/38

Outline I

Bandit problems and applications

I

Bandits with small set of actions

I

I

Stochastic setting

I

Adversarial setting

Bandits with large set of actions I

unstructured set

I

structured set I I I

I

Jean-Yves Audibert,

linear bandits Lipschitz bandits tree bandits

Extensions

Introduction to Bandits: Algorithms and Theory

7/38

Stochastic bandit game

(Robbins, 1952)

Parameters available to the forecaster: K and n Parameters unknown to the forecaster: the reward distributions ν1 , . . . , νK of the arms (with respective means µ1 , . . . , µK ) For each round t = 1, 2, . . . , n 1. the forecaster chooses an arm It ∈ {1, . . . , K } 2. the environment draws the gain vector gt = (g1,t , . . . , gK ,t ) according to ν1 ⊗ · · · ⊗ νK 3. the forecaster receives the gain gIt ,t ∗ = max Notation: i ∗ = arg maxi=1,...,K µi µP i=1,...,K µi ∗ ∆i = µ − µi , Ti (n) = nt=1 1It =i P P ˆ n = n gi ∗ ,t − n gI ,t Cumulative regret: R t=1

t=1

t

Goal: minimize the expected cumulative regret ˆ n = nµ∗ − E Rn = ER Jean-Yves Audibert,

n X t=1

gIt ,t = nµ∗ − E

K X

Ti (n)µi =

i=1

Introduction to Bandits: Algorithms and Theory

K X

∆i ETi (n)

i=1 8/38

A simple policy: ε-greedy For simplicity, all rewards are in [0, 1] I I

Playing the arm with highest empirical mean does not work ε-greedy: at time t, I I

I

Theoretical guarantee: (Auer, Cesa-Bianchi, Fischer, 2002) I I

I

I

with probability 1−εt , play the arm with highest empirical mean with probability εt , play a random arm Let ∆ = mini:∆i >0 ∆i and consider εt = min( ∆6K2 t , 1) When t ≥ 6K ∆2 , the probability of choosing a suboptimal arm i is bounded by ∆C2 t for some constant C > 0 P As a consequence, E[Ti (n)] ≤ ∆C2 log n and Rn ≤ i:∆i >0 C∆∆2 i log n −→ logarithmic regret

drawbacks: I I I

Jean-Yves Audibert,

naive exploration for K > 2: no distinction of sub-optimal arms requires knowledge of ∆ outperformed by UCB policy in practice Introduction to Bandits: Algorithms and Theory

9/38

Optimism in face of uncertainty

I

At time t, from past observations and some probabilistic argument, you have an upper confidence bound (UCB) on the expected rewards.

I

Simple implementation: play the arm having the largest UCB !

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

10/38

Why does it make sense?

I

Could we stay a long time drawing a wrong arm? No, since: I The more we draw a wrong arm i the closer the UCB gets to the expected reward µi , I µ < µ∗ ≤ UCB on µ∗ i

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

11/38

Illustration of UCB policy

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

12/38

Confidence intervals vs sampling times

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

13/38

Hoeffding-based UCB I

(Auer, Cesa-Bianchi, Fischer, 2002)

Hoeffding’s inequality: Let X , X1 , . . . , Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1 − ε, we have r m log(ε−1 ) 1 X Xs + EX ≤ m 2m s=1

I

UCB1 policy: at time t, play ( It ∈ arg max

s

µ ˆi,t−1 +

i∈{1,...,K }

I

where µ ˆi,t−1 = Regret bound:

1 Ti (t−1)

Rn ≤ Jean-Yves Audibert,

PTi (t−1)

X i6=i ∗

s=1

 min

2 log t Ti (t − 1)

) ,

Xi,s

10 log n, n∆i ∆i

Introduction to Bandits: Algorithms and Theory



14/38

Hoeffding-based UCB I

(Auer, Cesa-Bianchi, Fischer, 2002)

Hoeffding’s inequality: Let X1 , . . . , Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1 − ε, we have r m log(ε−1 ) 1 X Xs + EX ≤ m 2m s=1

I

UCB1 policy: At time t, play ( It ∈ arg max

s

µ ˆi,t−1 +

i∈{1,...,K }

where µ ˆi,t−1 = I

1 Ti (t−1)

PTi (t−1) s=1

2 log t Ti (t − 1)

) ,

Xi,s

UCB1 is an anytime policy (it does not need to know n to be implemented)

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

15/38

Hoeffding-based UCB I

(Auer, Cesa-Bianchi, Fischer, 2002)

Hoeffding’s inequality: Let X1 , . . . , Xm be i.i.d. r.v. taking their values in [0, 1]. For any ε > 0, with probability at least 1 − ε, we have r m log(ε−1 ) 1 X EX ≤ Xs + m 2m s=1

I

UCB1 policy: At time t, play ( It ∈ arg max

s

µ ˆi,t−1 +

i∈{1,...,K }

where µ ˆi,t−1 =

1 Ti (t−1)

PTi (t−1) s=1

2 log t Ti (t − 1)

) ,

Xi,s −1

I I

UCB1 corresponds to 2 log t = log(ε2 ) , hence ε = 1/t 4 Critical confidence level ε = 1/t (Lai & Robbins, 1985; Agrawal, 1995; Burnetas & Katehakis, 1996; Audibert, Munos, Szepesv´ ari, 2009; Honda & Takemura, 2010)

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

16/38

Better confidence bounds imply smaller regret I Hoeffding’s inequality

1 -confidence t

EX ≤ I Bernstein’s inequality

1 m

bound r

m X s=1

1 -confidence t

Xs +

log(t) 2m

bound r 2 log(t) ar X log(t) 1 EX ≤ Xs + + m s=1 m 3m m X

I

V

Empirical Bernstein’s inequality 3t -confidence bound s m X \ 8 log(t) 1 2 log(t)V ar X Xs + + EX ≤ m m 3m s=1

(Audibert, Munos, Szepesv´ ari, 2009; Maurer, 2009; Audibert, 2010) I Asymptotic confidence bound leads to catastrophy: s Z +∞ −u2 /2 m \ 1 X ar X e 1 √ EX ≤ Xs + x with x s.t. du = m s=1 m t 2π x

V

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

17/38

Better confidence bounds imply smaller regret

Hoeffding-based UCB r 1 Pm X + EX ≤ m s=1 s

Rn ≤

P

i6=i ∗

Jean-Yves Audibert,

min



c ∆i

empirical Bernstein-based UCB r

log(ε−1 ) 2m

log n, n∆i

1 Pm X + EX ≤ m s=1 s



Rn ≤

P

i6=i ∗

 min c

V

2 log(ε−1 )\ ar X m

Var ν ∆i

Introduction to Bandits: Algorithms and Theory

i

+

8 log(ε−1 ) 3m

  + 1 log n, n∆i

18/38

Tuning the exploration: simple vs difficult bandit problems I

UCB1(ρ) policy: At time t, play ( It ∈ arg max

s

µ ˆi,t−1 +

i∈{1,...,K }

Regret of UCB1(ρ) for n = 1000 and K = 2 arms: Dirac(0.6) and Ber(0.5)

35

ρ log t Ti (t − 1)

) ,

Regret of UCB1(ρ) for n = 1000 and K = 2 arms: Ber(0.6) and Ber(0.5)

25 30

Expected regret

Expected regret

20

15

10

25 20 15 10

5 5 0

0.0

0.2

0.4

Jean-Yves Audibert,

0.6 0.8 1.0 1.2 1.4 1.6 Exploration parameter ρ

1.8

2.0

0

0.0

0.2

0.4

0.6 0.8 1.0 1.2 1.4 1.6 Exploration parameter ρ

Introduction to Bandits: Algorithms and Theory

1.8

2.0

19/38

Tuning the exploration parameter: from theory to practice I

Theory: I I

for ρ < 0.5, UCB1(ρ) has polynomial regret for ρ > 0.5, UCB1(ρ) has logarithmic regret (Audibert, Munos, Szepesv´ ari, 2009; Bubeck, 2010)

I

Regret of UCB1(ρ) for n = 1000 and K = 3 arms: Ber(0.6), Ber(0.5) and Ber(0.5)

Regret of UCB1(ρ) for n = 1000 and K = 5 arms: Ber(0.7), Ber(0.6), Ber(0.5), Ber(0.4) and Ber(0.3)

45

80

40

70

Expected regret

Expected regret

50

Practice: ρ = 0.2 seems to be the best default value for n < 108

35 30 25 20 15

40 30

10

5

0.0

50

20

10

0

60

0.2

0.4

Jean-Yves Audibert,

0.6 0.8 1.0 1.2 1.4 1.6 Exploration parameter ρ

1.8

2.0

0

0.0

0.2

0.4

0.6 0.8 1.0 1.2 1.4 1.6 Exploration parameter ρ

Introduction to Bandits: Algorithms and Theory

1.8

2.0

20/38

Deviations of UCB1 regret I UCB1 policy: At time t, play

s

( It ∈ arg max i∈{1,...,K }

I I

µ ˆi,t−1 +

2 log t Ti (t − 1)

)

ˆ n > ER ˆ n +γ) ≤ ce −cγ does not hold! Inequality of the form P(R

If the smallest reward observable from the optimal arm is smaller than the mean reward of the second optimal arm, then the regret of UCB1 satisfies: for any C > 0, there exists C 0 > 0 such that for any n ≥ 2 ˆ n > ER ˆ n + C log n) > P(R

1 C 0 (log n)C 0

(Audibert, Munos, Szepesv´ ari, 2009)

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

21/38

Anytime UCB policies has a heavy-tailed regret I For some difficult bandit problems, the regret of UCB1 satisfies: for any

C > 0, there exists C 0 > 0 such that for any n ≥ 2 ˆ n > ER ˆ n + C log n) > P(R I

1 C 0 (log n)C 0

UCB-Horizon policy: At time t, play s ( It ∈ arg max

i∈{1,...,K }

µ ˆi,t−1 +

(?)

2 log n Ti (t − 1)

) ,

(Audibert, Munos, Szepesv´ ari, 2009) I I

ˆ n > ER ˆ n + C log n) ≤ UCB-H satisfies P(R

C n

for some C > 0

(?) = unavoidable for anytime policies (Salomon, Audibert, 2011)

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

22/38

Comparison of UCB1

Jean-Yves Audibert,

(solid lines)

and UCB-H

Introduction to Bandits: Algorithms and Theory

(dotted lines)

23/38

Comparison of UCB1(ρ) and UCB-H(ρ) in expectation Comparison of policies for n = 1000 and K = 2 arms: Dirac(0.6) and Ber(0.5)

Comparison of policies for n = 1000 and K = 2 arms: Ber(0.6) and Dirac(0.5) 70

25

60 20

Expected regret

Expected regret

50

15

10

40

30

20 GCL ∗

5

GCL ∗

10

UCB 1(ρ)

UCB 1(ρ)

UCB - H (ρ)

0 0.0

0.2

0.4

0.6

0.8 1.0 1.2 Exploration parameter ρ

UCB - H (ρ)

1.4

1.6

1.8

Left: Dirac(0.6) vs Bernoulli(0.5)

Jean-Yves Audibert,

2.0

0 0.0

0.2

0.4

0.6

0.8 1.0 1.2 Exploration parameter ρ

1.4

1.6

1.8

2.0

Right: Bernoulli(0.6) vs Dirac(0.5)

Introduction to Bandits: Algorithms and Theory

24/38

Comparison of UCB1(ρ) and UCB-H(ρ) in deviations I

For n = 1000 and K = 2 arms: Bernoulli(0.6) and Dirac(0.5)

0.40

0.08

GCL ∗

ˆ n ≤ r + 2) IP(r − 2 ≤ R

0.35 0.30

0.07

UCB 1(0.2)

UCB - H (0.2)

0.06

UCB - H (0.2)

ˆ n > r) IP(R

UCB 1(0.5)

0.25

UCB - H (0.5)

0.20 0.15

0.03 0.02 0.01

20

40

60

80

100

Regret level r

120

140

160

UCB - H (0.5)

0.04

0.05 0

UCB 1(0.5)

0.05

0.10

0.00 −20

GCL ∗

UCB 1(0.2)

0.00

40

60

80

100

Regret level r

120

140

160

Left: smoothed probability mass function. Right: tail distribution of the regret.

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

25/38

Knowing the horizon: theory and practice

I

Theory: use UCB-H to avoid heavy tails of the regret

I

Practice: Theory is right. Besides, thanks to this robustness, the expected regret of UCB-H(ρ) consistently outperforms the expected regret of UCB1(ρ). However: I I

Jean-Yves Audibert,

the gain is small. a better way to have small regret tails is to take larger ρ

Introduction to Bandits: Algorithms and Theory

26/38

Knowing µ∗ I

Hoeffding-based GCL∗ policy: play each arm once, then play It ∈ argmin Ti (t − 1) (µ∗ − µ ˆi,t−1 )2+ i∈{1,...,K }

(Salomon, Audibert, 2011) I

Underlying ideas: I I

compare p-values of the K tests: H0 = {µi = µ∗ }, i ∈ {1, . . . , K } the p-values are estimated using Hoeffding’s inequality 2    (obs) (obs)  PH0 µ ˆi,t−1 ˆi,t−1 ≤ µ ˆi,t−1 / exp − 2Ti (t − 1) µ∗ − µ +

I

I

play the arm for which we have the Greatest Confidence Level that it is the optimal arm.

Advantages: I I I I I

Jean-Yves Audibert,

logarithmic expected regret anytime policy regret with a subexponential right-tail parameter-free policy ! outperforms any other Hoeffding-based algorithm ! Introduction to Bandits: Algorithms and Theory

27/38

From Chernoff’s inequality to KL-based algorithms I

I

Let K(p, q) be the Kullback-Leibler divergence between Bernoulli distributions of respective parameter p and q Let X1 , . . . , XT be i.i.d. P r.v. of mean µ, and taking their values ¯ = 1 T Xi . For any γ > 0 in [0, 1]. Let X i=1 T   ¯ ≤ µ − γ ≤ exp − T K(µ − γ, µ) . P X In particular, we have    ¯ ≤X ¯ (obs) ≤ exp − T K min(X ¯ (obs) , µ), µ . P X

I

If µ∗ is known, using the same idea of comparing the p-values of the tests H0 = {µi = µ∗ }, i ∈ {1, . . . , K }, we get the Chernoff-based GCL∗ policy: play each arm once, then play  It ∈ argmin Ti (t − 1) K min(ˆ µi,t−1 , µ∗ ), µ∗ i∈{1,...,K }

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

28/38

Back to unknown µ∗

I

When µ∗ is unknown, the principle playing the arm for which we have the greatest confidence level that it is the optimal arm is replaced by being optimistic in face of uncertainty: I

I

Jean-Yves Audibert,

an arm i is represented by the highest mean of a distribution ν for which the hypothesis H0 = {νi = ν} has a p-value greater than t1β (critical β = 1, as usual) the arm with the highest index (=UCB) is played

Introduction to Bandits: Algorithms and Theory

29/38

KL-based algorithms when µ∗ is unknown I

Approximating the p-value using Sanov’s theorem is tightly linked to the DMED policy, which satisfies lim sup n→+∞

∆i Rn ≤ log n inf ν:EX ∼ν X ≥µ∗ K(νi , ν)

(Burnetas & Katehakis, 1996; Honda & Takemura, 2010)

It matches the lower bound lim inf

n→+∞

Rn ∆i ≥ log n inf ν:EX ∼ν X ≥µ∗ K(νi , ν)

(Lai & Robbins, 1985; Burnetas & Katehakis, 1996) I

Approximating the p-value using non-asymptotic version of Sanov’s theorem leads to the KL-UCB (Capp´e & Garivier, COLT 2011) and the K-strategy (Maillard, Munos, Stoltz, COLT 2011)

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

30/38

Outline I

Bandit problems and applications

I

Bandits with small set of actions

I

I

Stochastic setting

I

Adversarial setting

Bandits with large set of actions I

unstructured set

I

structured set I I I

I

Jean-Yves Audibert,

linear bandits Lipschitz bandits tree bandits

Extensions

Introduction to Bandits: Algorithms and Theory

31/38

Adversarial bandit Parameters: the number of arms K and the number of rounds n For each round t = 1, 2, . . . , n 1. the forecaster chooses an arm It ∈ {1, . . . , K }, possibly with the help of an external randomization 2. the adversary chooses a gain vector gt = (g1,t , . . . , gK ,t ) ∈ [0, 1]K 3. the forecaster receives and observes only the gain gIt ,t Goal: Maximize the cumulative gains obtained. We consider the regret:   n n X X Rn = max E gi,t − E gIt ,t , i=1,...,K

t=1

t=1

I

In full information, step 3. is replaced by the forecaster receives gIt ,t and observes the full gain vector gt

I

In both settings, the forecaster should use an external randomization to have o(n) regret.

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

32/38

Adversarial setting in full information: an optimal policy I I I

Cumulative reward on [1, t − 1]: Gi,t−1 =

Pt−1

s=1 gi,s

Follow-the-leader: It ∈ arg maxi∈{1,...,K } Gi,t−1 is a bad policy

An “optimal” policy is obtained by considering

e ηGi,t−1 pi,t = P(It = i) = PK ηGk,t−1 k=1 e I

For this policy, Rn ≤

I

in particular, for η =

nη 8

+

log K η

q

8 log K n ,

we have Rn ≤

q

n log K 2

(Littlestone, Warmuth, 1994; Long, 1996; Bylanger, 1997; Cesa-Bianchi, 1999)

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

33/38

Proof of the regret bound e ηGi,t−1 pi,t = P(It = i) = PK ηGk,t−1 k=1 e E

X

gIt ,t

t

=E

XX t

pi,t gi,t

i

 P X X 1 1 η(gi,t − j pj,t gj,t ) ηgi,t pi,t e + log pi,t e =E − log η η t i i P ηGi,t  X 1 e 1 =E − log Ee η(Vt −EVt ) + log P i ηG P(Vt = gi,t ) = pi,t i,t−1 η η e i t P ηGj,n  X η 1 j e + E log P ηG Hoeffding’s inequality ≥E − j,0 8 η j e t X

≥−

Jean-Yves Audibert,

nη 1 e η maxj Gj,n nη log K + E log =− − + E max Gj,n j 8 η K 8 η Introduction to Bandits: Algorithms and Theory

34/38

Adapting the exponentially weighted forecaster I

In bandit setting, Gi,t−1 , i = 1, . . . , K are not observed Trick = estimate them

I

˜i,t−1 = Pt−1 g˜i,s with Precisely, Gi,t−1 is estimated by G s=1 g˜i,s = 1 − Note that EIs ∼ps g˜i,s = 1 −

1 − gIs ,s 1Is =i . pIs ,s

K X

pk,s

k=1

1 − gk,s 1k=i = gi,s pk,s ˜

e ηGi,t−1 pi,t = P(It = i) = PK ˜k,t−1 ηG k=1 e I

For this policy, Rn ≤ I

nK η 2

In particular, for η =

+ logη K q

2 log K nK ,

we have Rn ≤



2nK log K

(Auer, Cesa-Bianchi, Freund, Schapire, 1995) Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

35/38

Implicitly Normalized Forecaster

(Audibert, Bubeck, 2010)

Let ψ : R∗− → R∗+ increasing, convex, twice continuously differentiable, and s.t. [ K1 , 1] ⊂ ψ(R∗− ) Let p1 be the uniform distribution over {1, . . . , K } For each round t = 1, 2, . . . , I I

It ∼ pt

Compute pt+1 = (p1,t+1 , . . . , pK ,t+1 ) where ˜i,t − Ct ) pi,t+1 = ψ(G where Ct is the unique real number s.t.

Jean-Yves Audibert,

PK

i=1 pi,t+1

Introduction to Bandits: Algorithms and Theory

=1

36/38

Minimax policy I

ψ(x) = exp(ηx) with η > 0; this corresponds exactly to the exponentially weighted forecaster

I

ψ(x) = (−ηx)−q with q > 1 and η > 0; this√is a new policy which is minimax optimal: for q = 2 and η = 2n, we have √ Rn ≤ 2 2nK (Audibert, Bubeck, 2010; Audibert, Bubeck, Lugosi, 2011)

while for any strategy, we have sup Rn ≥

1√ nK 20

(Auer, Cesa-Bianchi, Freund, Schapire, 1995)

Jean-Yves Audibert,

Introduction to Bandits: Algorithms and Theory

37/38

Outline I

Bandit problems and applications

I

Bandits with small set of actions

I

I

Stochastic setting

I

Adversarial setting

Bandits with large set of actions I

unstructured set

I

structured set I I I

I

Jean-Yves Audibert,

linear bandits Lipschitz bandits tree bandits

Extensions

Introduction to Bandits: Algorithms and Theory

38/38