Reinforcement Learning, Multi-Armed Bandits, and

0 downloads 0 Views 4MB Size Report
Bayesian aggregation, workshop BayesOpt, NIPS, 2017. R. P. Adams, D. MacKay. Bayesian online changepoint detection. arXiv preprint,. 2007. D. Bouneffouf ...
Reinforcement Learning, Multi-Armed Bandits, and AB testing

Raphaël Féraud OLNC/OLPS/DIESE/DIA/PROF

October 2018

Raphaël Féraud (Orange labs)

October 2018

1 / 53

Introduction

Outline

1

Introduction

2

Use cases

3

Non-stationary Bandits

4

Contextual and Structured Bandits

5

Decentralized Exploration

6

Octopus

7

References

Raphaël Féraud (Orange labs)

October 2018

2 / 53

Introduction

Reinforcement Learning

A generic problem: the reinforcement learning

The agent: Observes the environment, Executes an action, Receives a reward. The environment: Changes its internal state due to the action, Emits a reward. Goal: Maximize the long term rewards of the agent.

Raphaël Féraud (Orange labs)

October 2018

3 / 53

Introduction

Reinforcement Learning

A generic problem: the reinforcement learning

The environment can be: deterministic or stochastic, Adversarial or not, fully observable or partially observable. The agent can be: Model-based: the agent has a model of the environment (planning). Model-Free: the environment is initially unknown (reinforcement learning).

Raphaël Féraud (Orange labs)

October 2018

4 / 53

Introduction

Reinforcement Learning

A generic problem: the reinforcement learning

Main features of reinforcement learning: There is no supervisor, only the reward of played action is revealed to the agent. The environment is initially unknown: the agent has to interact with the environment to gather information. The agent’s actions affect the subsequent data it receives. The reward is delayed.

Raphaël Féraud (Orange labs)

October 2018

5 / 53

Introduction

Multi-armed Bandits

The multi-armed bandit problem

Main features of the multi-armed bandit problem: There is no supervisor, only the reward of played action. The environment is initially unknown: the agent has to interact with the environment to gather information. The played action does not affect the state of the environment.

Raphaël Féraud (Orange labs)

October 2018

6 / 53

Introduction

Multi-armed Bandits

The multi-armed bandit problem

Exploration / Exploitation dilemma: Exploration: the agent plays a loosely estimated action in order to build a better estimate. Exploitation: the agent plays the best estimated action in order to maximize her cumulative reward.

Raphaël Féraud (Orange labs)

October 2018

7 / 53

Introduction

Multi-armed Bandits

Upper Confidence Bound

A smart solution: the optimism in face of uncertainty principle: play the arm with the highest index = estimated mean reward + confidence interval.

Raphaël Féraud (Orange labs)

October 2018

8 / 53

Introduction

Best Arm Identification

AB testing

A/B testing: in order to test the efficiency of a new treatment A compared to treatment B, the physician must set in advance the size of his patient sample according to the gap (i.e. the minimum efficiency difference to be measured between A and B), and the expected confidence level. Sequential AB testing: In continuous development, for marketing campaigns, to blacklist frequency bands ..., AB testing is done sequentially and stops when the minimum efficiency difference is measured according to the expected level of confidence. In practice sequential AB testing takes into account observations to stop the test: smaller sample and no more need to know the gap between A and B.

Raphaël Féraud (Orange labs)

October 2018

9 / 53

Introduction

Best Arm Identification

Sequential AB testing: principle

1: 2: 3: 4: 5:

Inputs: two options (arms) A and B, a confidence level 1  δ, an approximation factor  Output: the best option (arm) repeat Arms A and B sampled until P prA ¤ rB q ¤ δ Raphaël Féraud (Orange labs)

October 2018

10 / 53

Introduction

Best Arm Identification

Sequential Multivariate testing: Best Arm Identification

1: 2: 3: 4: 5: 6: 7:

Inputs: a set of arms K, a confidence level 1  δ, an approximation factor  Output: the best option (arm) repeat a remaining arm k P K is sampled Let k 1 be the arm with the best estimated reward if the upper confidence bound of rk is lower than the lower confidence bound of rk 1 then remove k from K until |K|  1 Raphaël Féraud (Orange labs)

October 2018

11 / 53

Introduction

Best Arm Identification versus Multi-Armed Bandits

Best Arm Identification versus Multi-Armed Bandits clickthrough rate

Bandit gain curve

A/B testing gain curve

A/B testing phase

deployment phase

(test A & B equally)

(play best option)

time

The expected gain of Multi-Armed Bandits algorithms is higher than the one of Best Arm Identification algorithms. Best Arm Identification algorithms are used when there is a maintenance cost associated with each action: continuous development, clinical trials, phone marketing campaigns... Raphaël Féraud (Orange labs)

October 2018

12 / 53

Use cases

Outline

1

Introduction

2

Use cases

3

Non-stationary Bandits

4

Contextual and Structured Bandits

5

Decentralized Exploration

6

Octopus

7

References

Raphaël Féraud (Orange labs)

October 2018

13 / 53

Use cases

Internet Of Things

Optimization of LoRa transmissions

Reinforcement learning: The device chooses the parameters for the next LoRa transmission, then the environment, that is the gateways, the other devices, the weather..., generates a reward that is used by the device to optimize the choice of parameters. Raphaël Féraud (Orange labs)

October 2018

14 / 53

Use cases

Internet Of Things

Experimental setting Simulator: A realistic simulator (Varsier and Schwoerer 2017), that implements propagation models, shadowing and fast fading, collision rules and retransmissions, is used for the experiments. One versus all: A single indoor gateway with 100 nodes is considered: 99 nodes, that are randomly positioned within a radius of 2 km, uses the algorithm Adaptive Data Rate (ADR). 1 node uses bandit algorithms at different distances of the gateway: 592, 1000 and 1972 meters. Scenario: 1

1000 packets are sent by the optimized node. Due to the use of ADR on 99 nodes, the mean reward of each arm evolves during time.

2

After sending 500 packets, the node moves from 592 meters to 1975 meters from the gateway. The mean reward of arms abruptly changes. Raphaël Féraud (Orange labs)

October 2018

15 / 53

Use cases

Internet Of Things

Results

Comparison with ADR: ADR algorithm which does not handle mobile nodes clearly is dominated by MAB. Multi-Armed Bandits: Switching Thompson Sampling with Bayesian Aggregation is the best-performing algorithm. Surprisingly Thompson Sampling performs as well as STS and SWUCB which are designed for switching environments. Raphaël Féraud (Orange labs)

October 2018

16 / 53

Use cases

Sensor Networks

Scheduling process in IEEE 802.15.4 TSCH network

TSCH: Time Slotted Channel Hoppings. Communication is coordinated by a scheduler (time slot x set of channels): avoids collisions. Node sends a packet or receives an ack: the reward is the success of the transmission. Interference issues on the license free bandwidths.

Raphaël Féraud (Orange labs)

October 2018

17 / 53

Use cases

Sensor Networks

Multi-armed Bandit Algorithms for TSCH-compliant channel selection The setting

The available channels are restricted into some whitelist for each node. Multi-Armed Bandits are used to choose the best available channel. Offsets 1

7

5

13

1

5

(30 + offset) % 16 13

7

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Channel to transmit

5

Raphaël Féraud (Orange labs)

11

3

15

TSCH with MAB

15

October 2018

18 / 53

Use cases

Sensor Networks

Performance gain

Varying the size of the whitelist from 1 to 16. Switching Packet Data Rate of channels (switch every 50, 100 and 200 packets). Comparing the performances between Thompson Sampling (TS), Switching Thompson Sampling with Bayesian Aggregation (STSBA), and standard TSCH. Best algorithm: Switching Thompson Sampling with Bayesian Aggregation.

Raphaël Féraud (Orange labs)

October 2018

19 / 53

Use cases

Marketing campaign optimization

Marketing campaign optimization problem We would like to promote services to our customers: which message, which service, will obtain the higher subscription rate ?

Moreover, our customers have different profiles: For a given profile, which message which service shall we propose ? Raphaël Féraud (Orange labs)

October 2018

20 / 53

Use cases

Marketing campaign optimization

Simulation on marketing campaigns of the year 2013

In a simulation runs on 1, 5 millions of customers and based on 16 millions of contacts of the year 2013, the number of clicks generated by 18 emailing campaigns is increased by a factor 2. Raphaël Féraud (Orange labs)

October 2018

21 / 53

Use cases

optimization of routing policies

Compagnon de voyage: la promesse

Sur mon itinéraire habituel, ce matin il s’est produit un incident. L’application m’avertit via mon mobile, et me propose une alternative personnalisée. Lors d’une recherche d’itinéraire, l’application compagnon de voyage me propose un itinéraire optimal en fonction du contexte et de mes préférences. Raphaël Féraud (Orange labs)

October 2018

22 / 53

Use cases

optimization of routing policies

Compagnon de voyage: la promesse

Sur mon itinéraire habituel, ce matin il s’est produit un incident. L’application m’avertit via mon mobile, et me propose une alternative personnalisée. Lors d’une recherche d’itinéraire, l’application compagnon de voyage me propose un itinéraire optimal en fonction du contexte et de mes préférences. Raphaël Féraud (Orange labs)

October 2018

23 / 53

Use cases

optimization of routing policies

Architecture fonctionnelle du service Le serveur Octopus optimise le choix de politique de recherche multicritère d’itinéraires en fonction du contexte observé. La récompense est observée via un like ou lorsque l’itinéraire proposé est choisi.

Raphaël Féraud (Orange labs)

October 2018

24 / 53

Use cases

Advertizing optimization

Advertizing optimization Octopus is one of the complementary modules that could be used to monetize the audience: the action is the choice of displayed ad, the reward is the click, the context describes the content of the page and the profile of the cookie.

Raphaël Féraud (Orange labs)

October 2018

25 / 53

Use cases

Other use cases

Other use cases Online Relevance Feedback: the optimization of search and recommendation interfaces is an online process where the feedback is strongly biased by the interface itself. Online Customer Experience Optimization: knowing the customer journey and the customer profile, optimizing the next best action (marketing, after sale services, customer care. . . ) improves the customer experiences. Autonomous Terminals: complex and interconnected terminals, such as a LiveBox, have to take online decisions in order to configure themselves, to insure self-care or security. . . Self Organizing Networks: Depending on the observed network state, the network sequentially optimizes itself. Healthcare: Back to the origin of bandit algorithms, a domotic system has to choose actions (send alert, recall medication...) depending on the available context (physiological measurements, cameras, microphones...) Raphaël Féraud (Orange labs)

October 2018

26 / 53

Non-stationary Bandits

Outline

1

Introduction

2

Use cases

3

Non-stationary Bandits

4

Contextual and Structured Bandits

5

Decentralized Exploration

6

Octopus

7

References

Raphaël Féraud (Orange labs)

October 2018

27 / 53

Non-stationary Bandits

Adversarial Bandits

Adversarial Bandits Main feature of adversarial bandits: An adversary, which knows the algorithm of the player, has generated a deterministic sequence of rewards for each action. The player has to randomize the choice of arms. The player has to continuously explore overtime. The time horizon is known. The player competes against the best arm or the best sequence of arms of the run. Exp3:

In practice adversarial bandit algorithms (Exp3, Exp3.S...) are very conservative. Raphaël Féraud (Orange labs)

October 2018

28 / 53

Non-stationary Bandits

Switching Bandit Problem

Switching Bandit Problem

µk ,t



#

µk ,t 1 µnew  U p0, 1q

probability 1  ρ probabilityρ

where ρ is the switching rate.

In the switching bandit problem the mean rewards of arms abruptly change overtime. The run can be divided in M segments of different means. The run can be divided in N segments of different best arms (M

Raphaël Féraud (Orange labs)

¥ N).

October 2018

29 / 53

Non-stationary Bandits

Switching Bandit Problem

First solutions

Sliding Window UCB UCB is run in a sliding window: choose the arm with the highest index (estimated arm plus confidence interval), where the index is evaluated on the last part of the history. Exp3.R Exp3 is run: sample an arm according to a mixture of the uniform distribution and a probability mass exponential in the estimated mean reward of arms. The uniform exploration of Exp3 is used in a Hoeffding based change point detector. Exp3 is reset when a change is detected. Drawbacks of these approaches: Limited memory: reset of estimators or cut of the history.

Raphaël Féraud (Orange labs)

October 2018

30 / 53

Non-stationary Bandits

Switching Bandit Problem

Memory Bandits Main features: Growing Number of experts. Each expert is a Thompson Sampling algorithm. Bayesian Change Point Detector. Bayesian Aggregation to choose the best arm.

Raphaël Féraud (Orange labs)

October 2018

31 / 53

Non-stationary Bandits

Switching Bandit Problem

Memory Bandits

Raphaël Féraud (Orange labs)

October 2018

32 / 53

Non-stationary Bandits

Non-stationary Best Arm Identification

Non-stationary Best Arm Identification Problem setting: An adversary has chosen a sequences of mean reward for each action: generalization of adversarial bandits. Assumption 1: positive mean gap

SER3: Successive Elimination with a random choice of remaining actions when Assumption 1 holds. SER4: Successive Elimination with random reset and random choice of remaining actions, for any sequence of best arms (Assumption 1 does not hold). Raphaël Féraud (Orange labs)

October 2018

33 / 53

Contextual and Structured Bandits

Outline

1

Introduction

2

Use cases

3

Non-stationary Bandits

4

Contextual and Structured Bandits

5

Decentralized Exploration

6

Octopus

7

References

Raphaël Féraud (Orange labs)

October 2018

34 / 53

Contextual and Structured Bandits

Contextual and Structured Bandits

Main features: There is no supervisor, only the reward of played action. The environment is initially unknown: the agent has to interact with the environment to gather information. Before choosing the action, the player observes: a description of each action: structured bandits. a random variable: contextual bandits.

Raphaël Féraud (Orange labs)

October 2018

35 / 53

Contextual and Structured Bandits

Bandit Forest

Combines: Best arm identification exploration/exploitation tradeoff

Which option is best according to profile?

online-built decision tree to fit the profiles

Random Forest to optimize the performances to tune the parameters

most

le

peop

rugb

y lov ers

Option A:

Bandit Forest:

Option B: Option B: Option A:

tree: randomized decision tree node: context variable edge: variable value leaf: arm Raphaël Féraud (Orange labs)

October 2018

36 / 53

Contextual and Structured Bandits

Bandit Forest

Combines: Best arm identification exploration/exploitation tradeoff

online-built decision tree to fit the profiles

Random Forest to optimize the performances to tune the parameters

Bandit Forest: tree: randomized decision tree node: context variable edge: variable value

The worst case analysis of B ANDIT F OREST shows that the number of time steps needed to find the best random forest with high probability is optimal up to logarithmic factors.

leaf: arm Raphaël Féraud (Orange labs)

October 2018

36 / 53

Contextual and Structured Bandits

Some experimenal results

Forest Cover Type dataset

Census dataset

Key benefits: Bandit Forest versus LinUCB: non-linearity. Bandit Forest versus NeuralBandit: theoretical guarantees.

Raphaël Féraud (Orange labs)

October 2018

37 / 53

Decentralized Exploration

Outline

1

Introduction

2

Use cases

3

Non-stationary Bandits

4

Contextual and Structured Bandits

5

Decentralized Exploration

6

Octopus

7

References

Raphaël Féraud (Orange labs)

October 2018

38 / 53

Decentralized Exploration

Decentralized Exploration

The decentralized best arm identification (i.e. the decentralized exploration) offers several advantages: Scalability, thanks to parallel processing, Reduction of costs, thanks to the removal of centralized processing architecture , Increasing of the responsiveness of applications, thanks to the minimization of exchanged information through the network, Privacy, by locally processing user’s data.

Raphaël Féraud (Orange labs)

October 2018

39 / 53

Decentralized Exploration

Problem setting

1: 2: 3: 4: 5: 6: 7: 8: 9:

Inputs: a set of arm K Output: an arm on each player n repeat a player n  P pz q is sampled player n gets the messages of other players an action k P Kn is played a reward ykn P r0, 1s is received player n sends messages to other players until @n |Kn |  1 Raphaël Féraud (Orange labs)

October 2018

40 / 53

Decentralized Exploration

Privacy guarantee

We define the privacy level as the information about the preferred arms of a player, that an adversary could infer by intercepting the messages of this player. Definition 3 (p, η q-private).

The decentralized algorithm A is p, η q-private for finding an -approximation of the best arm, if for any player n, Eη1 , 0   η1   η   1 such that an adversary, that knows Mn , the set of messages of player n, and the algorithm A, can infer what arm is an -approximation of the best arm for player n with a probability at least 1  η1 :

@n P N , P

Kn



„ K |Mn , A ¥ 1  η1 .

1  η is the confidence level associated to the decision of the adversary: the higher η, the higher the privacy protection.

Raphaël Féraud (Orange labs)

October 2018

41 / 53

Decentralized Exploration

Decentralized Elimination: the principle

Messages The players exchange the indexes of arms that they would like to eliminate with a high probability of failure η: the information exchanged can not be exploited individually, but makes it possible to take a collective decision. Elimination of arms

The N players are independent, so when M ¤ N players want to eliminate an arm with a probability of failure η, then the probability of failure of the group of M players is δ  ηM . Raphaël Féraud (Orange labs)

October 2018

42 / 53

Decentralized Exploration

Analysis of Decentralized Elimination

Theorem 1

D ECENTRALIZED E LIMINATION is an p, η q-private algorithm, that finds an

-approximation of the best arm with a failure probability δ log δ at most N pK  1q t log u  1 messages. η

¤η

log δ t log η u

and that exchanges

Corollary 1 Using M EDIAN E LIMINATION as an ArmElimination subroutine, when the distribution of player is uniform the number of samples in Px,y needed to find with high probability an -approximation of the best arm is upper bounded by: 

O

Raphaël Féraud (Orange labs)

1 K log 2 δ

c

N



1 1 log 2 δ

samples in Px,y .

October 2018

43 / 53

Octopus

Outline

1

Introduction

2

Use cases

3

Non-stationary Bandits

4

Contextual and Structured Bandits

5

Decentralized Exploration

6

Octopus

7

References

Raphaël Féraud (Orange labs)

October 2018

44 / 53

Octopus

A generic architecture for reinforcement learning

1

The environment sends a context (for instance the profile of a customer).

2

The server chooses an arm (for instance a marketing campaign).

3

The environment sends the feedback (for instance a click).

4

The decision server updates its model. Raphaël Féraud (Orange labs)

October 2018

45 / 53

Octopus

An open architecture based on two API

The management API defines the parameters of decision problems (number of arms, format of contexts, range of rewards), administrates the servers (stop, start, reboot) monitors the performances via a dashboard. The decision API handles the real time queries (context, decision, feedback).

Raphaël Féraud (Orange labs)

October 2018

46 / 53

Octopus

Where are the breaking points ?

The sequential decision approach can handle partial information (global optimization of marketing campaigns), dynamic environment (change of customer behaviors). No need to store billions of detailed logs: The decision model is built online from scratch. The tuples (context,decision,feedback) which contain personal data are never stored. The sequential decision server favors privacy: Only the decision model is stored to choose the decisions and to feed the dashboard.

Raphaël Féraud (Orange labs)

October 2018

47 / 53

References

Outline

1

Introduction

2

Use cases

3

Non-stationary Bandits

4

Contextual and Structured Bandits

5

Decentralized Exploration

6

Octopus

7

References

Raphaël Féraud (Orange labs)

October 2018

48 / 53

References

Introduction

R. S. Sutton, and A. G. Barto: Reinforcement Learning: An Introduction, 2017 (new edition). P. Auer, N. Cesa-Bianchi, P. Fisher: Finite Time Analysis of the Multiarmed Bandit Problem, Machine Learning, 2002. E. Even-Dar, S. Mannor, Y. Mansour: Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, JMLR, 2006.

Raphaël Féraud (Orange labs)

October 2018

49 / 53

References

Use Cases

R. Kerkouche, R. Alami, R. Féraud, N. Varsier, P. Maillé: Node-based optimization of LoRa transmissions with Multi-Armed Bandit algorithms, ICT 2018. H. Dakdouk, E. Tarazona, R. Alami, R. Féraud, G. Z. Papadopoulos, P. Maillé: Reinforcement Learning Techniques for Optimized Channel Hopping in IEEE 802.15.4-TSCH Networks, ACM MSWIM 2018. R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot: Multiarmed bandit learning in IoT networks: Learning helps even in nonstationary settings, CROWNCOM, 2017.

Raphaël Féraud (Orange labs)

October 2018

50 / 53

References

Non-stationary Bandits

P. Auer, N. Cesa-Bianchi, Y. Freund, R. E. Shapire: The nonstochastic multiarmed bandit problem, SIAM J. Comput., 2002. R. Allesiardo, R. Féraud: EXP3 with Drift Detection for the Switching Bandit Problem, DSAA, 2015. R. Allesiardo, R. Féraud, and O. Maillard: The non-stationary stochastic multi-armed bandit problem, IJDSA, 2017. R. Alami, O. Maillard, and R. Féraud: Switching Thompson sampling with Bayesian aggregation, workshop BayesOpt, NIPS, 2017. R. P. Adams, D. MacKay. Bayesian online changepoint detection. arXiv preprint, 2007. D. Bouneffouf, R. Féraud: Multi-Armed Bandits with Known Trend, Neurocomputing, 2016.

Raphaël Féraud (Orange labs)

October 2018

51 / 53

References

Contextual and Structured Bandits

Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits, NIPS, 2007. L. Li, W. Chu, J. Langford, R.E. Schapire: A Contextual-Bandit Approach to Personalized News Article Recommendation, WWW, 2010. Abbasi-Yadkori Y., Pàl D., and Szepesvàri, C.: Improved Algorithms for Linear Stochastic Bandits, NIPS, 2011. R. Féraud, R. Allesiardo, T. Urvoy, F. Clérot: Random Forest for the Contextual Bandit Problem, AISTATS, 2016. R. Allesiardo, R. Féraud, D. Bouneffouf: A Neural Networks Committee for the Contextual Bandit Problem, ICONIP, 2014. D. Bouneffouf, I. Rish, G. A. Cecchi, R. Féraud: Context Attentive Bandits: Contextual Bandit with Restricted Context, IJCAI, 2017.

Raphaël Féraud (Orange labs)

October 2018

52 / 53

References

Decentralized Exploration

Hillel, E., Karnin, Z., Koren, T. Lempel, R., Somekh, O.: Distributed Exploration in Multi-Armed Bandits, NIPS, 2013. Gajane, P., Urvoy, T., Kaufmann, E.: Corrupt Bandits for Preserving Local Privacy, ALT, 2018. R. Féraud, R. Alami, R. Laroche: Decentralized Exploration in Multi-Armed Bandits, Tech. Report, 2018.

Raphaël Féraud (Orange labs)

October 2018

53 / 53