Optimization of online decisions

Optimization of online decisions Raphaël Féraud

Tanguy Urvoy

Orange-labs – OLNC/OLPS/UCE/CRM-DA/PROF

LRI, November 6th

R. Féraud and T. Urvoy (Orange labs)

Optimization of online decisions

November 2013

1 / 32

Introduction : big-data, fast-data

Outline

1


2

Online decisions for advertising optimization

3

Online decisions for marketing optimization

4

Online decisions : other use cases

5

Online decisions : some selected references



November 2013

2 / 32


Introduction : a praise for fast-data

Two complementary approaches : big-data : batch processing, data mining, fundamental trends, fast-data : stream processing, stream mining, contextual decision, interaction.

Fast Data Gets A Jump On Big Data [Oracle 2013] the online decision is often directly the problem we would like to solve, simpler and therefore often easier to deploy, consuming less resources, to sum up : smarter in a finite world.



November 2013

3 / 32


Outline

1


2


3


4


5




November 2013

4 / 32


Advertising optimization : context

The three most common ways in which online advertising is purchased are : CPM (Cost per Mille) : advertisers pay for exposure of their message to a specific audience, CPC (Cost per click) : advertisers pay each time a user clicks on their ads and is redirected to their website, CPA (Cost per Action) : advertisers pay only for the amount of users who have been redirected to their website and then complete a transaction, such as a purchase or sign-up. We focus here on the CPC.



November 2013

5 / 32


Advertizing optimization : functional architecture

The online decision optimization is one of the complementary modules used to monetize the audience :



November 2013

6 / 32


Online decisions for advertising optimization : stochastic formulation

Let xt be a vector of context (page x profile), A be a set of K decisions (display a banner k ), yk ptq be the reward of the decision k at time t (click on the banner), yt P r0, 1sK be a vector of bounded reward at time t, Π be the set of policy π : X Ñ A (ad server policy) We would like to find the ad server policy π maximizing the number of clicks : repeat pxt , yt q are drawn according to Dx,y The player chooses a decision k “ πpxt q The reward yk ptq is revealed until t ă T



November 2013

7 / 32


The Multi-Armed Bandits for online advertising optimization

Exploration / Exploitation tradeoff : Exploration to find the best arm, Exploitation of the best arm that we have found. UCB policy is based on the optimism in face of uncertainty principle : it plays the arm with the highest confidence interval. R. Féraud and T. Urvoy (Orange labs)

Allocate one MAB per page x profile : repeat for All arms k do Bk ptq “ b řnk ptq t 1 yk ppq ` 2n log p nk ptq k ptq end for Play kt “ arg maxk Bk ptq Receive ykt ptq until t ă T


November 2013

8 / 32


Some results for a given web page

Gain

Regret against the best remaining ad

Experimental setting Logs, from one hundred ads displayed to 1{1000 users, have been collected during ten days. Total of 2500000 displays and 11000 clicks. Finite inventory for each ad, corresponding here to the number of displays during the collecting time. The use of UCB policy can increase the click rate to 20%. R. Féraud and T. Urvoy (Orange labs)


November 2013

9 / 32


Outline

1


2


3


4


5




November 2013

10 / 32


Emailing campaign optimization : the problem

We would like to promote a web service to our customers : which message will obtain the higher subscription rate ?

Moreover, our customers have different profiles which can influence the subscription rate : what profile for which message ?



November 2013

11 / 32


Online decisions for emailing campaign optimization : stochastic formulation The Scratch Games problem is a variant of the MAB problem. Let xt be a vector of context (profile of the customer), A be a set of K decisions (send a message k ), yt P r0, 1sK be a vector of bounded reward at time t, yk ptq be the reward of the message k at time t (click on the enclosed link), Π be the set of policy π : X Ñ A (emailing policy) We would like to find the emailing policy π maximizing the number of clicks : tpx1 , y1 q, ..., pxt , yt q, ..., pxT , yT qu are drawn according to Dx,y The set tx1 , ..., xt , ..., xT qu is revealed. repeat The player chooses a context xt The player chooses a decision k “ πpxt q The reward yk ptq is revealed until t ă T R. Féraud and T. Urvoy (Orange labs)


November 2013

12 / 32


What is the breaking point ? The contexts, which are the profiles of known customers, are drawn in advance. We assume that the answers are also drawn in advance. This assumption corresponds to the sequential design of non-reproductible experiments : each customer can be reached only one time for a marketing campaign. A scratch game is functionally equivalent to an urn where the draws are performed without replacement. In the multi-armed bandit setting, to maximize his gain, the player has to find the best game as soon as possible, and then to exploit it. In the scratch game setting, when the player has found the best game, he knows that this game will expire. The player needs to re-explore before the best game expired to find the next best game. The usual tradeoff between exploration and exploitation has to be revisited.



November 2013

13 / 32


UCB Without Replacement

Allocate one Scratch Games with a number of games equal to the number of messages x number of profiles : repeat for All arms k P rKt s do c´ ř Bk ptq “ n 1ptq npk ptq yk ppq ` 1´ k

nk ptq´1 Nk

¯

2 log t nk ptq

end for Play kt “ arg maxk Bk ptq Receive ykt ptq until t ă T R. Féraud and T. Urvoy (Orange labs)


November 2013

14 / 32


Analysis of the algorithm UCB Without Replacement

UCBWR uses the Serfling concentration inequality rather than the Hoeffding inequality : Let y1 , ..., yn be a sample drawn without replacement from a finite list of values between 0 and 1, Y1 , ..., YN , then for all ą 0 : ˇ ¸ ˜ˇ 2 n ˇ ÿ ˇ ´ 2n n´1 ˇ1 ˇ P ˇ yt ´ Y ˇ ď ` ď 2e 1´ N ˇ n t“1 ˇ Theorem For all K ą 1, if policy UCBWR is run on K scratch games corresponding each to a finite list of rewards, then for any suboptimal scratch game i with Ni ą 0, we have : ˆ ˙ log t Erni ptqs ´ 1 log t π2 π2 Erni ptqs ď 8 1 ´ ` `1ď 2 ` `1 2 Ni 3 3 ∆i ptq ∆i ptq The obtained upper bound is lower than the one of UCB.



November 2013

15 / 32


Thompson Sampling To solve the exploration / exploitation tradeoff, the Thompson Sampling uses a randomized version of the optimism in face of uncertainty principle : when the number of draws of a game is low, the posterior distribution has a large variance, which promotes the exploration of this game, when a game has been chosen a lot of times, its posterior distribution is sharp, to promote exploitation of games with high value of µi ptq. repeat for For all games i in rKt s do Draw µi ptq according to Ppµi |mi , ni q end for Play the game it in rKt s which maximizes µi ptq t “t `1 nit ptq “ nit´1 pt ´ 1q ` 1 Receive reward xit ptq Update Ppµit |mit , nit q until t “ T R. Féraud and T. Urvoy (Orange labs)


November 2013

16 / 32


Thompson Sampling Without Replacement The probability to observe mi winning tickets is computed with the hypergeometric law : `N µ ˘`N Ń µ ˘ i

Ppmi |µi , ni q “

mi

i

i

i

i

n ´mi

`N ˘i i

ni

Using the Bayes rule, we can compute the posterior distribution of the mean reward µi : Ppµi |ni , mi q9Ppmi |µi , ni qPpµi |ni q ` n ˘` N ń ˘ i i i 1 m N µ ´m “ i `Ni ˘i i i N i `1 ni ˜ ¸ Ni ´ ni βpNi µi ` 1, Ni ´ Ni µi ` 1q “ , Ni µi ´ mi pni ` 1qβpmi ` 1, ni ´ mi ` 1q where βpa, bq denotes the beta function. The obtained posterior distribution is the beta-binomial distribution. The analysis of Thompson Sampling Without Replacement is an open problem. R. Féraud and T. Urvoy (Orange labs)


November 2013

17 / 32


Online decisions for emailing campaign optimization : adversarial formulation The adversarial formulation of the Scratch Games problem allows us to cope with non-stationarity of data. Let xt be a vector of context (profile of the customer), A be a set of K decisions (send a message k ), yt P r0, 1sK be a vector of bounded reward at time t, yk ptq be the reward of the message k at time t (click on the enclosed link), Π be the set of policy π : X Ñ A (emailing policy) We would like to find the emailing policy π maximizing the number of clicks : tpx1 , y1 q, ..., pxt , yt q, ..., pxT , yT qu are chosen by an adversary. The set tx1 , ..., xt , ..., xT qu is revealed. repeat The player chooses a context xt The player chooses a decision k “ πpxt q The reward yk ptq is revealed until t ă T R. Féraud and T. Urvoy (Orange labs)


November 2013

18 / 32


Exp3 for Finite Sequences

repeat for all games i in rKm s do pi ptq “ p1 ´ γm˚ q ř

wi ptq wj ptq

jPrKm s

`

˚ γm Km

end for Draw it randomly accordingly to the probabilities pi ptq Receive reward yit ptq for all games i in rKm s do yî ptq “ yi ptq{pi ptq if i “ it and 0 otherwise, ´ ˚ ¯ γ wi pt ` 1q “ wi ptq exp Kmm yî ptq end for t “t `1 if a game ends then Evaluate γm˚ end if until t “ T R. Féraud and T. Urvoy (Orange labs)

Exp3FS in four points : A piecewise constant exploration factor maintains a minimal probability of draws whatever the past rewards. The unbiased estimation of the reward allows quick changes. The number of current games depends on the past plays. Each time a game ends, the exploration / exploitation trade-off is reassessed.


November 2013

19 / 32


Analysis of the algorithm Exp3 for Finite Sequences

ř Let GT˚ “ Tt“1 maxiPrKt s xi ptq be the gain of the optimal policy. Let ∆m “ GTm`1 ´ GTm be the gain between times Tm and Tm`1 , and ˚ ˚ ∆˚ m “ GTm`1 ´ GTm be the optimal gain between times Tm and Tm`1 . Theorem 1 For all Km ą 0, if E3FAS policy runs during the time period rTm , Tm`1 r, with 0 ă γm ď 1, then we have : ˚ ∆˚ m ´ Er∆m s ď pe ´ 1qγm ∆m `



Km ln Km γm

November 2013

20 / 32


Analysis of the algorithm Exp3 for Finite Sequences Corollary 1.1 For all Km ą 0, if E3FAS policy runs during the time period rTm , Tm`1 r, with 0 ă γm ď 1, then we have : ˜ d ¸ Km ln Km ˚ γm “ min 1, pe ´ 1q∆˚ m Corollary 1.2 If E3FAS policy runs from the time t “ 0 to the time t “ T , and Km ą 0 for the L time periods rTm , Tm`1 r, we have : g f L f ÿ ˚ GT ´ ErGT s ď 2epe ´ 1q ∆˚ m Km ln Km m“1

b ď 2 GT˚ pe ´ 1qK ln K The obtained upper bound is lower than the one of Exp3. R. Féraud and T. Urvoy (Orange labs)


November 2013

21 / 32


Test on synthetic problems : methodology

210314 tickets including 33688 winning tickets spread over 100 scratch games have been drawn according respectively to a Pareto distribution and a Bernoulli distribution. For each simulation and for each scratch game, a sequence of rewards is drawn according to the urn model parameterized by the number of winning tickets mi and the number of tickets Ni. Results shown corresponds to the mean of one hundred simulations.



November 2013

22 / 32


Synthesis : regret against the optimal static policy

TABLE : Mean regret and rank. non-stationary : for each game the probability of rewards changes at time t “ N{2.

Problem

UCB1

UCBWR

E XP 3

E3FS

TS

TSWR

finite budget non-stationary

2030p6q 1154p4q

1648p5q 324p1q

1498p4q 709p3q

1433p3q 596p2q

1381p2q 1313p6q

1354p1q 1303p5q

As expected we can take advantage of scratch games setting : UCBWR, E3FS, and TSWR outperform respectively UCB1, E XP 3 and TS. When its prior holds, Thompson Sampling algorithms outperform the other algorithms.



November 2013

23 / 32


Results on finite sequences of reward The weak regret versus the number of scratched tickets :

UCB1 spends too much time to explore small games, and as expected UCBWR outperforms UCB1. For E3FS, The value of γ takes into account that the sequences of rewards are finite, and as expected E3FS outperforms E XP 3. on the first part of the curve TS outperforms TSWR, but on the second part of the curve the effect of the draws without replacement advantages TSWR. R. Féraud and T. Urvoy (Orange labs)


November 2013

24 / 32


Results on non-stationary sequences The weak regret versus the number of scratched tickets when the distributions of rewards depends on a threshold function of time :

As expected, E3FS and E XP 3 maintain good performances. TS et TSWR are the worst in this case : they are based on a prior which does not hold here. Surprisingly, UCBWR performs very well on this problem. On the first period, it plays more the scratch games which have their probabilities of rewards multiplied by two. Thanks to the decreasing of its exploration factor, it does not scratch all the tickets of these games. For the introduced non-stationarity it is useful because most of the winning tickets of these games have been scratched before the end of this time period. R. Féraud and T. Urvoy (Orange labs)


November 2013

25 / 32


Some results for emailing campaign optimization

Regret against the best remaining game

Gain

Experimental setting : Logs, from 640 scratch games (128 campaigns x 5 profiles) sent to all customers, have been collected during one month. Finite inventory for each game, corresponding here to the number of sent emails. Total of 11006000 sent emails and 221000 clicks. The use of UCBWR policy can increase the click rate to 50%. R. Féraud and T. Urvoy (Orange labs)


November 2013

26 / 32


Works in progress

The proposed approach has a drawback : We create as many games as the product of the number of messages by the number of profiles. The obtained number of scratch games can be high in comparaison to the number of sent emails, which leads to poor performance. Our modelization of the context is naive : no dependance between scracth games are taken into account. Explore then Exploit approach : As in the case of supervised learning, a model is built on a first period and then applied during a second period. The p, δq-PAC framework is usefull to calibrate the size of the first period : the sample complexity is the number of collected data allowing to obtain a model that provides rewards close to of the optimal one with a probability of error δ.



November 2013

27 / 32


Works in progress : Proof Of Concept Emailing Optimization

The functional architecture of the Proof of Concept :



November 2013

28 / 32


Outline

1


2


3


4


5




November 2013

29 / 32


Other online-decision use cases Online Relevance Feedback : the optimization of search and recommendation interfaces is an online process where the feedback is strongly biased by the interface itself. Online Customer Experience Optimization : knowing the customer journey and the customer profile, optimizing the next best action (marketing, after sale services, customer care. . . ) improves the customer experiences. Autonomous Terminals : complex and interconnected terminals, such as a LiveBox, have to take online decisions in order to configure themselves, to insure self-care or security. . . Dynamic Networks : depending on the network state, the router has to take online decisions to choose the best path in the network. Yield Management : online adaptation of the communication price to the network load maximizes the global revenue. R. Féraud and T. Urvoy (Orange labs)


November 2013

30 / 32


Outline

1


2


3


4


5




November 2013

31 / 32


Some selected references Multi-armed bandits : Peter Auer and Nicolò Cesa Bianchi and Paul Fischer : Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning,47 235-256 (2002) Peter Auer and Nicolò Cesa-Bianchi and Yoav Freund and Robert E. Schapire : The nonstochastic multiarmed bandit problem, SIAM J. COMPUT., 32 48-77 (2002) E. Even-Dari, S. Mannor and Y. Mansour : Action Elimination Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, JMLR 7, 1079-1105 (2006) Our references on MAB : Raphaël Féraud and Tanguy Urvoy : A stochastic bandit algorithm for scrath games, ACML, 25, 129-145 (2012) Raphaël Féraud and Tanguy Urvoy : Exploration and Exploitation of Scratch Games, Machine Learning, 92, 377-401 (2013) Tanguy Urvoy, Fabrice Clérot, Raphaël Féraud and Sami Naamane : Generic Exploration and K-armed Voting Bandits, ICML, (2013) R. Féraud and T. Urvoy (Orange labs)


November 2013

32 / 32