The online decision optimization is one of the complementary modules used to monetize the ... We would like to promote a web service to our customers :.
Optimization of online decisions Raphaël Féraud
Tanguy Urvoy
Orange-labs – OLNC/OLPS/UCE/CRM-DA/PROF
LRI, November 6th
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
1 / 32
Introduction : big-data, fast-data
Outline
1
Introduction : big-data, fast-data
2
Online decisions for advertising optimization
3
Online decisions for marketing optimization
4
Online decisions : other use cases
5
Online decisions : some selected references
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
2 / 32
Introduction : big-data, fast-data
Introduction : a praise for fast-data
Two complementary approaches : big-data : batch processing, data mining, fundamental trends, fast-data : stream processing, stream mining, contextual decision, interaction.
Fast Data Gets A Jump On Big Data [Oracle 2013] the online decision is often directly the problem we would like to solve, simpler and therefore often easier to deploy, consuming less resources, to sum up : smarter in a finite world.
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
3 / 32
Online decisions for advertising optimization
Outline
1
Introduction : big-data, fast-data
2
Online decisions for advertising optimization
3
Online decisions for marketing optimization
4
Online decisions : other use cases
5
Online decisions : some selected references
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
4 / 32
Online decisions for advertising optimization
Advertising optimization : context
The three most common ways in which online advertising is purchased are : CPM (Cost per Mille) : advertisers pay for exposure of their message to a specific audience, CPC (Cost per click) : advertisers pay each time a user clicks on their ads and is redirected to their website, CPA (Cost per Action) : advertisers pay only for the amount of users who have been redirected to their website and then complete a transaction, such as a purchase or sign-up. We focus here on the CPC.
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
5 / 32
Online decisions for advertising optimization
Advertizing optimization : functional architecture
The online decision optimization is one of the complementary modules used to monetize the audience :
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
6 / 32
Online decisions for advertising optimization
Online decisions for advertising optimization : stochastic formulation
Let xt be a vector of context (page x profile), A be a set of K decisions (display a banner k ), yk ptq be the reward of the decision k at time t (click on the banner), yt P r0, 1sK be a vector of bounded reward at time t, Π be the set of policy π : X Ñ A (ad server policy) We would like to find the ad server policy π maximizing the number of clicks : repeat pxt , yt q are drawn according to Dx,y The player chooses a decision k “ πpxt q The reward yk ptq is revealed until t ă T
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
7 / 32
Online decisions for advertising optimization
The Multi-Armed Bandits for online advertising optimization
Exploration / Exploitation tradeoff : Exploration to find the best arm, Exploitation of the best arm that we have found. UCB policy is based on the optimism in face of uncertainty principle : it plays the arm with the highest confidence interval. R. Féraud and T. Urvoy (Orange labs)
Allocate one MAB per page x profile : repeat for All arms k do Bk ptq “ b řnk ptq t 1 yk ppq ` 2n log p nk ptq k ptq end for Play kt “ arg maxk Bk ptq Receive ykt ptq until t ă T
Optimization of online decisions
November 2013
8 / 32
Online decisions for advertising optimization
Some results for a given web page
Gain
Regret against the best remaining ad
Experimental setting Logs, from one hundred ads displayed to 1{1000 users, have been collected during ten days. Total of 2500000 displays and 11000 clicks. Finite inventory for each ad, corresponding here to the number of displays during the collecting time. The use of UCB policy can increase the click rate to 20%. R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
9 / 32
Online decisions for marketing optimization
Outline
1
Introduction : big-data, fast-data
2
Online decisions for advertising optimization
3
Online decisions for marketing optimization
4
Online decisions : other use cases
5
Online decisions : some selected references
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
10 / 32
Online decisions for marketing optimization
Emailing campaign optimization : the problem
We would like to promote a web service to our customers : which message will obtain the higher subscription rate ?
Moreover, our customers have different profiles which can influence the subscription rate : what profile for which message ?
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
11 / 32
Online decisions for marketing optimization
Online decisions for emailing campaign optimization : stochastic formulation The Scratch Games problem is a variant of the MAB problem. Let xt be a vector of context (profile of the customer), A be a set of K decisions (send a message k ), yt P r0, 1sK be a vector of bounded reward at time t, yk ptq be the reward of the message k at time t (click on the enclosed link), Π be the set of policy π : X Ñ A (emailing policy) We would like to find the emailing policy π maximizing the number of clicks : tpx1 , y1 q, ..., pxt , yt q, ..., pxT , yT qu are drawn according to Dx,y The set tx1 , ..., xt , ..., xT qu is revealed. repeat The player chooses a context xt The player chooses a decision k “ πpxt q The reward yk ptq is revealed until t ă T R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
12 / 32
Online decisions for marketing optimization
What is the breaking point ? The contexts, which are the profiles of known customers, are drawn in advance. We assume that the answers are also drawn in advance. This assumption corresponds to the sequential design of non-reproductible experiments : each customer can be reached only one time for a marketing campaign. A scratch game is functionally equivalent to an urn where the draws are performed without replacement. In the multi-armed bandit setting, to maximize his gain, the player has to find the best game as soon as possible, and then to exploit it. In the scratch game setting, when the player has found the best game, he knows that this game will expire. The player needs to re-explore before the best game expired to find the next best game. The usual tradeoff between exploration and exploitation has to be revisited.
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
13 / 32
Online decisions for marketing optimization
UCB Without Replacement
Allocate one Scratch Games with a number of games equal to the number of messages x number of profiles : repeat for All arms k P rKt s do c´ ř Bk ptq “ n 1ptq npk ptq yk ppq ` 1´ k
nk ptq´1 Nk
¯
2 log t nk ptq
end for Play kt “ arg maxk Bk ptq Receive ykt ptq until t ă T R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
14 / 32
Online decisions for marketing optimization
Analysis of the algorithm UCB Without Replacement
UCBWR uses the Serfling concentration inequality rather than the Hoeffding inequality : Let y1 , ..., yn be a sample drawn without replacement from a finite list of values between 0 and 1, Y1 , ..., YN , then for all ą 0 : ˇ ¸ ˜ˇ 2 n ˇ ÿ ˇ ´ 2n n´1 ˇ1 ˇ P ˇ yt ´ Y ˇ ď ` ď 2e 1´ N ˇ n t“1 ˇ Theorem For all K ą 1, if policy UCBWR is run on K scratch games corresponding each to a finite list of rewards, then for any suboptimal scratch game i with Ni ą 0, we have : ˆ ˙ log t Erni ptqs ´ 1 log t π2 π2 Erni ptqs ď 8 1 ´ ` `1ď 2 ` `1 2 Ni 3 3 ∆i ptq ∆i ptq The obtained upper bound is lower than the one of UCB.
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
15 / 32
Online decisions for marketing optimization
Thompson Sampling To solve the exploration / exploitation tradeoff, the Thompson Sampling uses a randomized version of the optimism in face of uncertainty principle : when the number of draws of a game is low, the posterior distribution has a large variance, which promotes the exploration of this game, when a game has been chosen a lot of times, its posterior distribution is sharp, to promote exploitation of games with high value of µi ptq. repeat for For all games i in rKt s do Draw µi ptq according to Ppµi |mi , ni q end for Play the game it in rKt s which maximizes µi ptq t “t `1 nit ptq “ nit´1 pt ´ 1q ` 1 Receive reward xit ptq Update Ppµit |mit , nit q until t “ T R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
16 / 32
Online decisions for marketing optimization
Thompson Sampling Without Replacement The probability to observe mi winning tickets is computed with the hypergeometric law : `N µ ˘`N ´N µ ˘ i
Ppmi |µi , ni q “
mi
i
i
i
i
n ´mi
`N ˘i i
ni
Using the Bayes rule, we can compute the posterior distribution of the mean reward µi : Ppµi |ni , mi q9Ppmi |µi , ni qPpµi |ni q ` n ˘` N ´n ˘ i i i 1 m N µ ´m “ i `Ni ˘i i i N i `1 ni ˜ ¸ Ni ´ ni βpNi µi ` 1, Ni ´ Ni µi ` 1q “ , Ni µi ´ mi pni ` 1qβpmi ` 1, ni ´ mi ` 1q where βpa, bq denotes the beta function. The obtained posterior distribution is the beta-binomial distribution. The analysis of Thompson Sampling Without Replacement is an open problem. R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
17 / 32
Online decisions for marketing optimization
Online decisions for emailing campaign optimization : adversarial formulation The adversarial formulation of the Scratch Games problem allows us to cope with non-stationarity of data. Let xt be a vector of context (profile of the customer), A be a set of K decisions (send a message k ), yt P r0, 1sK be a vector of bounded reward at time t, yk ptq be the reward of the message k at time t (click on the enclosed link), Π be the set of policy π : X Ñ A (emailing policy) We would like to find the emailing policy π maximizing the number of clicks : tpx1 , y1 q, ..., pxt , yt q, ..., pxT , yT qu are chosen by an adversary. The set tx1 , ..., xt , ..., xT qu is revealed. repeat The player chooses a context xt The player chooses a decision k “ πpxt q The reward yk ptq is revealed until t ă T R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
18 / 32
Online decisions for marketing optimization
Exp3 for Finite Sequences
repeat for all games i in rKm s do pi ptq “ p1 ´ γm˚ q ř
wi ptq wj ptq
jPrKm s
`
˚ γm Km
end for Draw it randomly accordingly to the probabilities pi ptq Receive reward yit ptq for all games i in rKm s do yˆi ptq “ yi ptq{pi ptq if i “ it and 0 otherwise, ´ ˚ ¯ γ wi pt ` 1q “ wi ptq exp Kmm yˆi ptq end for t “t `1 if a game ends then Evaluate γm˚ end if until t “ T R. Féraud and T. Urvoy (Orange labs)
Exp3FS in four points : A piecewise constant exploration factor maintains a minimal probability of draws whatever the past rewards. The unbiased estimation of the reward allows quick changes. The number of current games depends on the past plays. Each time a game ends, the exploration / exploitation trade-off is reassessed.
Optimization of online decisions
November 2013
19 / 32
Online decisions for marketing optimization
Analysis of the algorithm Exp3 for Finite Sequences
ř Let GT˚ “ Tt“1 maxiPrKt s xi ptq be the gain of the optimal policy. Let ∆m “ GTm`1 ´ GTm be the gain between times Tm and Tm`1 , and ˚ ˚ ∆˚ m “ GTm`1 ´ GTm be the optimal gain between times Tm and Tm`1 . Theorem 1 For all Km ą 0, if E3FAS policy runs during the time period rTm , Tm`1 r, with 0 ă γm ď 1, then we have : ˚ ∆˚ m ´ Er∆m s ď pe ´ 1qγm ∆m `
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
Km ln Km γm
November 2013
20 / 32
Online decisions for marketing optimization
Analysis of the algorithm Exp3 for Finite Sequences Corollary 1.1 For all Km ą 0, if E3FAS policy runs during the time period rTm , Tm`1 r, with 0 ă γm ď 1, then we have : ˜ d ¸ Km ln Km ˚ γm “ min 1, pe ´ 1q∆˚ m Corollary 1.2 If E3FAS policy runs from the time t “ 0 to the time t “ T , and Km ą 0 for the L time periods rTm , Tm`1 r, we have : g f L f ÿ ˚ GT ´ ErGT s ď 2epe ´ 1q ∆˚ m Km ln Km m“1
b ď 2 GT˚ pe ´ 1qK ln K The obtained upper bound is lower than the one of Exp3. R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
21 / 32
Online decisions for marketing optimization
Test on synthetic problems : methodology
210314 tickets including 33688 winning tickets spread over 100 scratch games have been drawn according respectively to a Pareto distribution and a Bernoulli distribution. For each simulation and for each scratch game, a sequence of rewards is drawn according to the urn model parameterized by the number of winning tickets mi and the number of tickets Ni. Results shown corresponds to the mean of one hundred simulations.
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
22 / 32
Online decisions for marketing optimization
Synthesis : regret against the optimal static policy
TABLE : Mean regret and rank. non-stationary : for each game the probability of rewards changes at time t “ N{2.
Problem
UCB1
UCBWR
E XP 3
E3FS
TS
TSWR
finite budget non-stationary
2030p6q 1154p4q
1648p5q 324p1q
1498p4q 709p3q
1433p3q 596p2q
1381p2q 1313p6q
1354p1q 1303p5q
As expected we can take advantage of scratch games setting : UCBWR, E3FS, and TSWR outperform respectively UCB1, E XP 3 and TS. When its prior holds, Thompson Sampling algorithms outperform the other algorithms.
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
23 / 32
Online decisions for marketing optimization
Results on finite sequences of reward The weak regret versus the number of scratched tickets :
UCB1 spends too much time to explore small games, and as expected UCBWR outperforms UCB1. For E3FS, The value of γ takes into account that the sequences of rewards are finite, and as expected E3FS outperforms E XP 3. on the first part of the curve TS outperforms TSWR, but on the second part of the curve the effect of the draws without replacement advantages TSWR. R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
24 / 32
Online decisions for marketing optimization
Results on non-stationary sequences The weak regret versus the number of scratched tickets when the distributions of rewards depends on a threshold function of time :
As expected, E3FS and E XP 3 maintain good performances. TS et TSWR are the worst in this case : they are based on a prior which does not hold here. Surprisingly, UCBWR performs very well on this problem. On the first period, it plays more the scratch games which have their probabilities of rewards multiplied by two. Thanks to the decreasing of its exploration factor, it does not scratch all the tickets of these games. For the introduced non-stationarity it is useful because most of the winning tickets of these games have been scratched before the end of this time period. R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
25 / 32
Online decisions for marketing optimization
Some results for emailing campaign optimization
Regret against the best remaining game
Gain
Experimental setting : Logs, from 640 scratch games (128 campaigns x 5 profiles) sent to all customers, have been collected during one month. Finite inventory for each game, corresponding here to the number of sent emails. Total of 11006000 sent emails and 221000 clicks. The use of UCBWR policy can increase the click rate to 50%. R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
26 / 32
Online decisions for marketing optimization
Works in progress
The proposed approach has a drawback : We create as many games as the product of the number of messages by the number of profiles. The obtained number of scratch games can be high in comparaison to the number of sent emails, which leads to poor performance. Our modelization of the context is naive : no dependance between scracth games are taken into account. Explore then Exploit approach : As in the case of supervised learning, a model is built on a first period and then applied during a second period. The p, δq-PAC framework is usefull to calibrate the size of the first period : the sample complexity is the number of collected data allowing to obtain a model that provides rewards close to of the optimal one with a probability of error δ.
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
27 / 32
Online decisions for marketing optimization
Works in progress : Proof Of Concept Emailing Optimization
The functional architecture of the Proof of Concept :
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
28 / 32
Online decisions : other use cases
Outline
1
Introduction : big-data, fast-data
2
Online decisions for advertising optimization
3
Online decisions for marketing optimization
4
Online decisions : other use cases
5
Online decisions : some selected references
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
29 / 32
Online decisions : other use cases
Other online-decision use cases Online Relevance Feedback : the optimization of search and recommendation interfaces is an online process where the feedback is strongly biased by the interface itself. Online Customer Experience Optimization : knowing the customer journey and the customer profile, optimizing the next best action (marketing, after sale services, customer care. . . ) improves the customer experiences. Autonomous Terminals : complex and interconnected terminals, such as a LiveBox, have to take online decisions in order to configure themselves, to insure self-care or security. . . Dynamic Networks : depending on the network state, the router has to take online decisions to choose the best path in the network. Yield Management : online adaptation of the communication price to the network load maximizes the global revenue. R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
30 / 32
Online decisions : some selected references
Outline
1
Introduction : big-data, fast-data
2
Online decisions for advertising optimization
3
Online decisions for marketing optimization
4
Online decisions : other use cases
5
Online decisions : some selected references
R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
31 / 32
Online decisions : some selected references
Some selected references Multi-armed bandits : Peter Auer and Nicolò Cesa Bianchi and Paul Fischer : Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning,47 235-256 (2002) Peter Auer and Nicolò Cesa-Bianchi and Yoav Freund and Robert E. Schapire : The nonstochastic multiarmed bandit problem, SIAM J. COMPUT., 32 48-77 (2002) E. Even-Dari, S. Mannor and Y. Mansour : Action Elimination Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, JMLR 7, 1079-1105 (2006) Our references on MAB : Raphaël Féraud and Tanguy Urvoy : A stochastic bandit algorithm for scrath games, ACML, 25, 129-145 (2012) Raphaël Féraud and Tanguy Urvoy : Exploration and Exploitation of Scratch Games, Machine Learning, 92, 377-401 (2013) Tanguy Urvoy, Fabrice Clérot, Raphaël Féraud and Sami Naamane : Generic Exploration and K-armed Voting Bandits, ICML, (2013) R. Féraud and T. Urvoy (Orange labs)
Optimization of online decisions
November 2013
32 / 32