Bayesian aggregation, workshop BayesOpt, NIPS, 2017. R. P. Adams, D. MacKay. Bayesian online changepoint detection. arXiv preprint,. 2007. D. Bouneffouf ...
Reinforcement Learning, Multi-Armed Bandits, and AB testing
Raphaël Féraud OLNC/OLPS/DIESE/DIA/PROF
October 2018
Raphaël Féraud (Orange labs)
October 2018
1 / 53
Introduction
Outline
1
Introduction
2
Use cases
3
Non-stationary Bandits
4
Contextual and Structured Bandits
5
Decentralized Exploration
6
Octopus
7
References
Raphaël Féraud (Orange labs)
October 2018
2 / 53
Introduction
Reinforcement Learning
A generic problem: the reinforcement learning
The agent: Observes the environment, Executes an action, Receives a reward. The environment: Changes its internal state due to the action, Emits a reward. Goal: Maximize the long term rewards of the agent.
Raphaël Féraud (Orange labs)
October 2018
3 / 53
Introduction
Reinforcement Learning
A generic problem: the reinforcement learning
The environment can be: deterministic or stochastic, Adversarial or not, fully observable or partially observable. The agent can be: Model-based: the agent has a model of the environment (planning). Model-Free: the environment is initially unknown (reinforcement learning).
Raphaël Féraud (Orange labs)
October 2018
4 / 53
Introduction
Reinforcement Learning
A generic problem: the reinforcement learning
Main features of reinforcement learning: There is no supervisor, only the reward of played action is revealed to the agent. The environment is initially unknown: the agent has to interact with the environment to gather information. The agent’s actions affect the subsequent data it receives. The reward is delayed.
Raphaël Féraud (Orange labs)
October 2018
5 / 53
Introduction
Multi-armed Bandits
The multi-armed bandit problem
Main features of the multi-armed bandit problem: There is no supervisor, only the reward of played action. The environment is initially unknown: the agent has to interact with the environment to gather information. The played action does not affect the state of the environment.
Raphaël Féraud (Orange labs)
October 2018
6 / 53
Introduction
Multi-armed Bandits
The multi-armed bandit problem
Exploration / Exploitation dilemma: Exploration: the agent plays a loosely estimated action in order to build a better estimate. Exploitation: the agent plays the best estimated action in order to maximize her cumulative reward.
Raphaël Féraud (Orange labs)
October 2018
7 / 53
Introduction
Multi-armed Bandits
Upper Confidence Bound
A smart solution: the optimism in face of uncertainty principle: play the arm with the highest index = estimated mean reward + confidence interval.
Raphaël Féraud (Orange labs)
October 2018
8 / 53
Introduction
Best Arm Identification
AB testing
A/B testing: in order to test the efficiency of a new treatment A compared to treatment B, the physician must set in advance the size of his patient sample according to the gap (i.e. the minimum efficiency difference to be measured between A and B), and the expected confidence level. Sequential AB testing: In continuous development, for marketing campaigns, to blacklist frequency bands ..., AB testing is done sequentially and stops when the minimum efficiency difference is measured according to the expected level of confidence. In practice sequential AB testing takes into account observations to stop the test: smaller sample and no more need to know the gap between A and B.
Raphaël Féraud (Orange labs)
October 2018
9 / 53
Introduction
Best Arm Identification
Sequential AB testing: principle
1: 2: 3: 4: 5:
Inputs: two options (arms) A and B, a confidence level 1 δ, an approximation factor Output: the best option (arm) repeat Arms A and B sampled until P prA ¤ rB q ¤ δ Raphaël Féraud (Orange labs)
October 2018
10 / 53
Introduction
Best Arm Identification
Sequential Multivariate testing: Best Arm Identification
1: 2: 3: 4: 5: 6: 7:
Inputs: a set of arms K, a confidence level 1 δ, an approximation factor Output: the best option (arm) repeat a remaining arm k P K is sampled Let k 1 be the arm with the best estimated reward if the upper confidence bound of rk is lower than the lower confidence bound of rk 1 then remove k from K until |K| 1 Raphaël Féraud (Orange labs)
October 2018
11 / 53
Introduction
Best Arm Identification versus Multi-Armed Bandits
Best Arm Identification versus Multi-Armed Bandits clickthrough rate
Bandit gain curve
A/B testing gain curve
A/B testing phase
deployment phase
(test A & B equally)
(play best option)
time
The expected gain of Multi-Armed Bandits algorithms is higher than the one of Best Arm Identification algorithms. Best Arm Identification algorithms are used when there is a maintenance cost associated with each action: continuous development, clinical trials, phone marketing campaigns... Raphaël Féraud (Orange labs)
October 2018
12 / 53
Use cases
Outline
1
Introduction
2
Use cases
3
Non-stationary Bandits
4
Contextual and Structured Bandits
5
Decentralized Exploration
6
Octopus
7
References
Raphaël Féraud (Orange labs)
October 2018
13 / 53
Use cases
Internet Of Things
Optimization of LoRa transmissions
Reinforcement learning: The device chooses the parameters for the next LoRa transmission, then the environment, that is the gateways, the other devices, the weather..., generates a reward that is used by the device to optimize the choice of parameters. Raphaël Féraud (Orange labs)
October 2018
14 / 53
Use cases
Internet Of Things
Experimental setting Simulator: A realistic simulator (Varsier and Schwoerer 2017), that implements propagation models, shadowing and fast fading, collision rules and retransmissions, is used for the experiments. One versus all: A single indoor gateway with 100 nodes is considered: 99 nodes, that are randomly positioned within a radius of 2 km, uses the algorithm Adaptive Data Rate (ADR). 1 node uses bandit algorithms at different distances of the gateway: 592, 1000 and 1972 meters. Scenario: 1
1000 packets are sent by the optimized node. Due to the use of ADR on 99 nodes, the mean reward of each arm evolves during time.
2
After sending 500 packets, the node moves from 592 meters to 1975 meters from the gateway. The mean reward of arms abruptly changes. Raphaël Féraud (Orange labs)
October 2018
15 / 53
Use cases
Internet Of Things
Results
Comparison with ADR: ADR algorithm which does not handle mobile nodes clearly is dominated by MAB. Multi-Armed Bandits: Switching Thompson Sampling with Bayesian Aggregation is the best-performing algorithm. Surprisingly Thompson Sampling performs as well as STS and SWUCB which are designed for switching environments. Raphaël Féraud (Orange labs)
October 2018
16 / 53
Use cases
Sensor Networks
Scheduling process in IEEE 802.15.4 TSCH network
TSCH: Time Slotted Channel Hoppings. Communication is coordinated by a scheduler (time slot x set of channels): avoids collisions. Node sends a packet or receives an ack: the reward is the success of the transmission. Interference issues on the license free bandwidths.
Raphaël Féraud (Orange labs)
October 2018
17 / 53
Use cases
Sensor Networks
Multi-armed Bandit Algorithms for TSCH-compliant channel selection The setting
The available channels are restricted into some whitelist for each node. Multi-Armed Bandits are used to choose the best available channel. Offsets 1
7
5
13
1
5
(30 + offset) % 16 13
7
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Channel to transmit
5
Raphaël Féraud (Orange labs)
11
3
15
TSCH with MAB
15
October 2018
18 / 53
Use cases
Sensor Networks
Performance gain
Varying the size of the whitelist from 1 to 16. Switching Packet Data Rate of channels (switch every 50, 100 and 200 packets). Comparing the performances between Thompson Sampling (TS), Switching Thompson Sampling with Bayesian Aggregation (STSBA), and standard TSCH. Best algorithm: Switching Thompson Sampling with Bayesian Aggregation.
Raphaël Féraud (Orange labs)
October 2018
19 / 53
Use cases
Marketing campaign optimization
Marketing campaign optimization problem We would like to promote services to our customers: which message, which service, will obtain the higher subscription rate ?
Moreover, our customers have different profiles: For a given profile, which message which service shall we propose ? Raphaël Féraud (Orange labs)
October 2018
20 / 53
Use cases
Marketing campaign optimization
Simulation on marketing campaigns of the year 2013
In a simulation runs on 1, 5 millions of customers and based on 16 millions of contacts of the year 2013, the number of clicks generated by 18 emailing campaigns is increased by a factor 2. Raphaël Féraud (Orange labs)
October 2018
21 / 53
Use cases
optimization of routing policies
Compagnon de voyage: la promesse
Sur mon itinéraire habituel, ce matin il s’est produit un incident. L’application m’avertit via mon mobile, et me propose une alternative personnalisée. Lors d’une recherche d’itinéraire, l’application compagnon de voyage me propose un itinéraire optimal en fonction du contexte et de mes préférences. Raphaël Féraud (Orange labs)
October 2018
22 / 53
Use cases
optimization of routing policies
Compagnon de voyage: la promesse
Sur mon itinéraire habituel, ce matin il s’est produit un incident. L’application m’avertit via mon mobile, et me propose une alternative personnalisée. Lors d’une recherche d’itinéraire, l’application compagnon de voyage me propose un itinéraire optimal en fonction du contexte et de mes préférences. Raphaël Féraud (Orange labs)
October 2018
23 / 53
Use cases
optimization of routing policies
Architecture fonctionnelle du service Le serveur Octopus optimise le choix de politique de recherche multicritère d’itinéraires en fonction du contexte observé. La récompense est observée via un like ou lorsque l’itinéraire proposé est choisi.
Raphaël Féraud (Orange labs)
October 2018
24 / 53
Use cases
Advertizing optimization
Advertizing optimization Octopus is one of the complementary modules that could be used to monetize the audience: the action is the choice of displayed ad, the reward is the click, the context describes the content of the page and the profile of the cookie.
Raphaël Féraud (Orange labs)
October 2018
25 / 53
Use cases
Other use cases
Other use cases Online Relevance Feedback: the optimization of search and recommendation interfaces is an online process where the feedback is strongly biased by the interface itself. Online Customer Experience Optimization: knowing the customer journey and the customer profile, optimizing the next best action (marketing, after sale services, customer care. . . ) improves the customer experiences. Autonomous Terminals: complex and interconnected terminals, such as a LiveBox, have to take online decisions in order to configure themselves, to insure self-care or security. . . Self Organizing Networks: Depending on the observed network state, the network sequentially optimizes itself. Healthcare: Back to the origin of bandit algorithms, a domotic system has to choose actions (send alert, recall medication...) depending on the available context (physiological measurements, cameras, microphones...) Raphaël Féraud (Orange labs)
October 2018
26 / 53
Non-stationary Bandits
Outline
1
Introduction
2
Use cases
3
Non-stationary Bandits
4
Contextual and Structured Bandits
5
Decentralized Exploration
6
Octopus
7
References
Raphaël Féraud (Orange labs)
October 2018
27 / 53
Non-stationary Bandits
Adversarial Bandits
Adversarial Bandits Main feature of adversarial bandits: An adversary, which knows the algorithm of the player, has generated a deterministic sequence of rewards for each action. The player has to randomize the choice of arms. The player has to continuously explore overtime. The time horizon is known. The player competes against the best arm or the best sequence of arms of the run. Exp3:
In practice adversarial bandit algorithms (Exp3, Exp3.S...) are very conservative. Raphaël Féraud (Orange labs)
October 2018
28 / 53
Non-stationary Bandits
Switching Bandit Problem
Switching Bandit Problem
µk ,t
#
µk ,t 1 µnew U p0, 1q
probability 1 ρ probabilityρ
where ρ is the switching rate.
In the switching bandit problem the mean rewards of arms abruptly change overtime. The run can be divided in M segments of different means. The run can be divided in N segments of different best arms (M
Raphaël Féraud (Orange labs)
¥ N).
October 2018
29 / 53
Non-stationary Bandits
Switching Bandit Problem
First solutions
Sliding Window UCB UCB is run in a sliding window: choose the arm with the highest index (estimated arm plus confidence interval), where the index is evaluated on the last part of the history. Exp3.R Exp3 is run: sample an arm according to a mixture of the uniform distribution and a probability mass exponential in the estimated mean reward of arms. The uniform exploration of Exp3 is used in a Hoeffding based change point detector. Exp3 is reset when a change is detected. Drawbacks of these approaches: Limited memory: reset of estimators or cut of the history.
Raphaël Féraud (Orange labs)
October 2018
30 / 53
Non-stationary Bandits
Switching Bandit Problem
Memory Bandits Main features: Growing Number of experts. Each expert is a Thompson Sampling algorithm. Bayesian Change Point Detector. Bayesian Aggregation to choose the best arm.
Raphaël Féraud (Orange labs)
October 2018
31 / 53
Non-stationary Bandits
Switching Bandit Problem
Memory Bandits
Raphaël Féraud (Orange labs)
October 2018
32 / 53
Non-stationary Bandits
Non-stationary Best Arm Identification
Non-stationary Best Arm Identification Problem setting: An adversary has chosen a sequences of mean reward for each action: generalization of adversarial bandits. Assumption 1: positive mean gap
SER3: Successive Elimination with a random choice of remaining actions when Assumption 1 holds. SER4: Successive Elimination with random reset and random choice of remaining actions, for any sequence of best arms (Assumption 1 does not hold). Raphaël Féraud (Orange labs)
October 2018
33 / 53
Contextual and Structured Bandits
Outline
1
Introduction
2
Use cases
3
Non-stationary Bandits
4
Contextual and Structured Bandits
5
Decentralized Exploration
6
Octopus
7
References
Raphaël Féraud (Orange labs)
October 2018
34 / 53
Contextual and Structured Bandits
Contextual and Structured Bandits
Main features: There is no supervisor, only the reward of played action. The environment is initially unknown: the agent has to interact with the environment to gather information. Before choosing the action, the player observes: a description of each action: structured bandits. a random variable: contextual bandits.
Raphaël Féraud (Orange labs)
October 2018
35 / 53
Contextual and Structured Bandits
Bandit Forest
Combines: Best arm identification exploration/exploitation tradeoff
Which option is best according to profile?
online-built decision tree to fit the profiles
Random Forest to optimize the performances to tune the parameters
most
le
peop
rugb
y lov ers
Option A:
Bandit Forest:
Option B: Option B: Option A:
tree: randomized decision tree node: context variable edge: variable value leaf: arm Raphaël Féraud (Orange labs)
October 2018
36 / 53
Contextual and Structured Bandits
Bandit Forest
Combines: Best arm identification exploration/exploitation tradeoff
online-built decision tree to fit the profiles
Random Forest to optimize the performances to tune the parameters
Bandit Forest: tree: randomized decision tree node: context variable edge: variable value
The worst case analysis of B ANDIT F OREST shows that the number of time steps needed to find the best random forest with high probability is optimal up to logarithmic factors.
leaf: arm Raphaël Féraud (Orange labs)
October 2018
36 / 53
Contextual and Structured Bandits
Some experimenal results
Forest Cover Type dataset
Census dataset
Key benefits: Bandit Forest versus LinUCB: non-linearity. Bandit Forest versus NeuralBandit: theoretical guarantees.
Raphaël Féraud (Orange labs)
October 2018
37 / 53
Decentralized Exploration
Outline
1
Introduction
2
Use cases
3
Non-stationary Bandits
4
Contextual and Structured Bandits
5
Decentralized Exploration
6
Octopus
7
References
Raphaël Féraud (Orange labs)
October 2018
38 / 53
Decentralized Exploration
Decentralized Exploration
The decentralized best arm identification (i.e. the decentralized exploration) offers several advantages: Scalability, thanks to parallel processing, Reduction of costs, thanks to the removal of centralized processing architecture , Increasing of the responsiveness of applications, thanks to the minimization of exchanged information through the network, Privacy, by locally processing user’s data.
Raphaël Féraud (Orange labs)
October 2018
39 / 53
Decentralized Exploration
Problem setting
1: 2: 3: 4: 5: 6: 7: 8: 9:
Inputs: a set of arm K Output: an arm on each player n repeat a player n P pz q is sampled player n gets the messages of other players an action k P Kn is played a reward ykn P r0, 1s is received player n sends messages to other players until @n |Kn | 1 Raphaël Féraud (Orange labs)
October 2018
40 / 53
Decentralized Exploration
Privacy guarantee
We define the privacy level as the information about the preferred arms of a player, that an adversary could infer by intercepting the messages of this player. Definition 3 (p, η q-private).
The decentralized algorithm A is p, η q-private for finding an -approximation of the best arm, if for any player n, Eη1 , 0 η1 η 1 such that an adversary, that knows Mn , the set of messages of player n, and the algorithm A, can infer what arm is an -approximation of the best arm for player n with a probability at least 1 η1 :
@n P N , P
Kn
K |Mn , A ¥ 1 η1 .
1 η is the confidence level associated to the decision of the adversary: the higher η, the higher the privacy protection.
Raphaël Féraud (Orange labs)
October 2018
41 / 53
Decentralized Exploration
Decentralized Elimination: the principle
Messages The players exchange the indexes of arms that they would like to eliminate with a high probability of failure η: the information exchanged can not be exploited individually, but makes it possible to take a collective decision. Elimination of arms
The N players are independent, so when M ¤ N players want to eliminate an arm with a probability of failure η, then the probability of failure of the group of M players is δ ηM . Raphaël Féraud (Orange labs)
October 2018
42 / 53
Decentralized Exploration
Analysis of Decentralized Elimination
Theorem 1
D ECENTRALIZED E LIMINATION is an p, η q-private algorithm, that finds an
-approximation of the best arm with a failure probability δ log δ at most N pK 1q t log u 1 messages. η
¤η
log δ t log η u
and that exchanges
Corollary 1 Using M EDIAN E LIMINATION as an ArmElimination subroutine, when the distribution of player is uniform the number of samples in Px,y needed to find with high probability an -approximation of the best arm is upper bounded by:
O
Raphaël Féraud (Orange labs)
1 K log 2 δ
c
N
1 1 log 2 δ
samples in Px,y .
October 2018
43 / 53
Octopus
Outline
1
Introduction
2
Use cases
3
Non-stationary Bandits
4
Contextual and Structured Bandits
5
Decentralized Exploration
6
Octopus
7
References
Raphaël Féraud (Orange labs)
October 2018
44 / 53
Octopus
A generic architecture for reinforcement learning
1
The environment sends a context (for instance the profile of a customer).
2
The server chooses an arm (for instance a marketing campaign).
3
The environment sends the feedback (for instance a click).
4
The decision server updates its model. Raphaël Féraud (Orange labs)
October 2018
45 / 53
Octopus
An open architecture based on two API
The management API defines the parameters of decision problems (number of arms, format of contexts, range of rewards), administrates the servers (stop, start, reboot) monitors the performances via a dashboard. The decision API handles the real time queries (context, decision, feedback).
Raphaël Féraud (Orange labs)
October 2018
46 / 53
Octopus
Where are the breaking points ?
The sequential decision approach can handle partial information (global optimization of marketing campaigns), dynamic environment (change of customer behaviors). No need to store billions of detailed logs: The decision model is built online from scratch. The tuples (context,decision,feedback) which contain personal data are never stored. The sequential decision server favors privacy: Only the decision model is stored to choose the decisions and to feed the dashboard.
Raphaël Féraud (Orange labs)
October 2018
47 / 53
References
Outline
1
Introduction
2
Use cases
3
Non-stationary Bandits
4
Contextual and Structured Bandits
5
Decentralized Exploration
6
Octopus
7
References
Raphaël Féraud (Orange labs)
October 2018
48 / 53
References
Introduction
R. S. Sutton, and A. G. Barto: Reinforcement Learning: An Introduction, 2017 (new edition). P. Auer, N. Cesa-Bianchi, P. Fisher: Finite Time Analysis of the Multiarmed Bandit Problem, Machine Learning, 2002. E. Even-Dar, S. Mannor, Y. Mansour: Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, JMLR, 2006.
Raphaël Féraud (Orange labs)
October 2018
49 / 53
References
Use Cases
R. Kerkouche, R. Alami, R. Féraud, N. Varsier, P. Maillé: Node-based optimization of LoRa transmissions with Multi-Armed Bandit algorithms, ICT 2018. H. Dakdouk, E. Tarazona, R. Alami, R. Féraud, G. Z. Papadopoulos, P. Maillé: Reinforcement Learning Techniques for Optimized Channel Hopping in IEEE 802.15.4-TSCH Networks, ACM MSWIM 2018. R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot: Multiarmed bandit learning in IoT networks: Learning helps even in nonstationary settings, CROWNCOM, 2017.
Raphaël Féraud (Orange labs)
October 2018
50 / 53
References
Non-stationary Bandits
P. Auer, N. Cesa-Bianchi, Y. Freund, R. E. Shapire: The nonstochastic multiarmed bandit problem, SIAM J. Comput., 2002. R. Allesiardo, R. Féraud: EXP3 with Drift Detection for the Switching Bandit Problem, DSAA, 2015. R. Allesiardo, R. Féraud, and O. Maillard: The non-stationary stochastic multi-armed bandit problem, IJDSA, 2017. R. Alami, O. Maillard, and R. Féraud: Switching Thompson sampling with Bayesian aggregation, workshop BayesOpt, NIPS, 2017. R. P. Adams, D. MacKay. Bayesian online changepoint detection. arXiv preprint, 2007. D. Bouneffouf, R. Féraud: Multi-Armed Bandits with Known Trend, Neurocomputing, 2016.
Raphaël Féraud (Orange labs)
October 2018
51 / 53
References
Contextual and Structured Bandits
Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits, NIPS, 2007. L. Li, W. Chu, J. Langford, R.E. Schapire: A Contextual-Bandit Approach to Personalized News Article Recommendation, WWW, 2010. Abbasi-Yadkori Y., Pàl D., and Szepesvàri, C.: Improved Algorithms for Linear Stochastic Bandits, NIPS, 2011. R. Féraud, R. Allesiardo, T. Urvoy, F. Clérot: Random Forest for the Contextual Bandit Problem, AISTATS, 2016. R. Allesiardo, R. Féraud, D. Bouneffouf: A Neural Networks Committee for the Contextual Bandit Problem, ICONIP, 2014. D. Bouneffouf, I. Rish, G. A. Cecchi, R. Féraud: Context Attentive Bandits: Contextual Bandit with Restricted Context, IJCAI, 2017.
Raphaël Féraud (Orange labs)
October 2018
52 / 53
References
Decentralized Exploration
Hillel, E., Karnin, Z., Koren, T. Lempel, R., Somekh, O.: Distributed Exploration in Multi-Armed Bandits, NIPS, 2013. Gajane, P., Urvoy, T., Kaufmann, E.: Corrupt Bandits for Preserving Local Privacy, ALT, 2018. R. Féraud, R. Alami, R. Laroche: Decentralized Exploration in Multi-Armed Bandits, Tech. Report, 2018.
Raphaël Féraud (Orange labs)
October 2018
53 / 53