Learning in Sequential Decision Problems - Google Sites

Learning in Sequential Decision Problems Peter Bartlett Mathematical Sciences Queensland University of Technology Computer Science and Statistics University of California at Berkeley

AusDM November 28, 2014

1 / 34

A Decision Problem

Order news stories by popularity.

2 / 34

A Decision Problem

Order news stories by popularity. Multiclass classification.

2 / 34

A Decision Problem

Order news stories by popularity. Multiclass classification. Incur loss when a low-ranked story is chosen. 2 / 34

A Decision Problem

Order news stories by popularity. Multiclass classification. Incur loss when a low-ranked story is chosen. Aim to minimize expected loss. 2 / 34

Decision Problems

Decision Problem Finite decision

Approach estimate expected loss.

Aim close to optimal.

3 / 34

A Decision Problem: Classification

4 / 34


4 / 34


Aim for small relative expected loss: close to best in some comparison class of prediction rules.

4 / 34

Decision Problems

Decision Problem Finite decision Classification

Approach estimate expected loss. e.g., logistic regression

Aim close to optimal. close to best linear classifier.

5 / 34

A Decision Problem: Bandit Classification

But do not get complete information: what would the user do if we offered a different ranking?

6 / 34


But do not get complete information: what would the user do if we offered a different ranking? Sequential decision problem: current decisions affect later performance. 6 / 34


But do not get complete information: what would the user do if we offered a different ranking? Sequential decision problem: current decisions affect later performance. Exploration versus exploitation. 6 / 34


But do not get complete information: what would the user do if we offered a different recommendation? Sequential decision problem: current decisions affect later performance. Exploration versus exploitation. 6 / 34

Decision Problems



Contextual bandit

e.g., reduction to classification

Aim close to optimal. close to best linear classifier. close to best in set of classifiers.

7 / 34

A Decision Problem: Maximizing Value

8 / 34


8 / 34


State of user evolves.

8 / 34


State of user evolves.

8 / 34


State of user evolves. Sequential decision problem: current decisions affect our knowledge and the subsequent state.

8 / 34


State of user evolves. Sequential decision problem: current decisions affect our knowledge and the subsequent state. Exploration versus exploitation. 8 / 34

Decision Problems



Contextual bandit

e.g., reduction to classification

Markov decision process

Aim close to optimal. close to best linear classifier. close to best in set of classifiers. close to best in set of policies.

9 / 34

Sequential Decision Problems: Key Ideas

1

2

Scaling back our ambitions: Performance guarantees relative to a comparison class. Two directions for Markov Decision Processes (MDPs): Large scale policy design for MDPs. Learning changing MDPs with full information.

10 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . :

11 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1

See state Xt

of ecosystem

11 / 34


See state Xt

2

Play an action At

of ecosystem intervention

11 / 34


See state Xt

of ecosystem

2

Play an action At intervention anti-poaching patrols

11 / 34


See state Xt

of ecosystem

2


3

Incur loss `(Xt , At )

11 / 34


See state Xt

of ecosystem

2


3


$, extinction

11 / 34

Markov Decision Processes MDP: Web Customer Interactions For t = 1, 2, . . . : 1

See state Xt

2

Play an action At

3


of customer interaction offer/advertisement missed revenue

11 / 34

Markov Decision Processes MDP: Managing Threatened Species Transition matrix:

For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3


4

State evolves to Xt+1 ∼ PXt ,At

P : X × A → ∆(X )

intervention anti-poaching patrols $, extinction

11 / 34


For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3


4


intervention anti-poaching patrols

P : X × A → ∆(X )

Policy:

π : X → ∆(A)

$, extinction

Performance Measure: Regret RT = E

T X t=1

`(Xt , At ) − min E π

T X

`(Xtπ , π(Xtπ )).

t=1

11 / 34


For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3


4



P : X × A → ∆(X )

Policy:

π : X → ∆(A)

Stationary distribution: µ Average loss:

µT `.

Performance Measure: Regret RT = E

T X t=1

`(Xt , At ) − min E π

T X

`(Xtπ , π(Xtπ )).

t=1

11 / 34


For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3


4



P : X × A → ∆(X )

Policy:

π : X → ∆(A)

Stationary distribution: µ Average loss:

µT `.

Performance Measure: Excess Average Loss T µT π ` − min µπ ` π

11 / 34

Large-Scale Sequential Decision Problems

Large MDP Problems: When the state space X is large, we must scale back the ambition of optimal performance.

12 / 34

Large-Scale Sequential Decision Problems

Large MDP Problems: When the state space X is large, we must scale back the ambition of optimal performance: In comparison to a restricted family of policies Π. e.g., linear value function approximation. Want a strategy that competes with the best policy in Π.

12 / 34

Outline 1. Large-Scale Policy Design

2. Learning Changing Dynamics

13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions.


13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized exponentially transformed value function.


13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization.


13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class.


13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy.


13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 13 / 34

Linear Subspace of Stationary Distributions

Large-scale policy design

(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)

Stationary distributions dual to value functions.

14 / 34




Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+

πθ (a|x) = P

14 / 34





πθ (a|x) = P

Let µθ denote the stationary distribution of policy πθ .

14 / 34





πθ (a|x) = P

Let µθ denote the stationary distribution of policy πθ . Find θb such that µ> ` ≤ minθ∈Θ µ> ` + . θb

θ

14 / 34





πθ (a|x) = P

Let µθ denote the stationary distribution of policy πθ . Find θb such that µ> ` ≤ minθ∈Θ µ> ` + . θb

θ

Large-scale policy design: Independent of size of X .

14 / 34

Approach: a Reduction to Convex Optimization Define a constraint violation function

V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ) | {z } | {z }1 prob. dist.

stationary


V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)

1

and consider the convex cost function c(θ) = `> (µ0 + Φθ) + αV (θ).


V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)

1


P Stochastic gradient: θt+1 = θt − ηgt (θt ), θbT = T t=1 θt /T ,


V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)

1


P Stochastic gradient: θt+1 = θt − ηgt (θt ), θbT = T t=1 θt /T , . . . with cheap, unbiased stochastic subgradient estimates:

Φ(xt ,at ),: I q1 (xt , at ) {µ0 (xt ,at )+Φ(xt ,at ),: θ :,xt0 Φ +α sign((P − B)> :,xt0 Φθ). 0 q2 (xt )

gt (θ) = `> Φ − α

15 / 34

Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption), V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ

16 / 34


Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0).

16 / 34


Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0). Competitive with other policies; comparison more favorable near some stationary distribution.

16 / 34


Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0). Competitive with other policies; comparison more favorable near some stationary distribution. Previous results of this kind: require knowledge about optimal policy, or require that the comparison class Π contains a near-optimal policy. 16 / 34

Simulation Results: Queueing

http://alzatex.org/

(Rybko and Stolyar, 1992; de Farias and Van Roy, 2003a)

17 / 34

Simulation Results: Queueing

http://alzatex.org/

(Rybko and Stolyar, 1992; de Farias and Van Roy, 2003a)

17 / 34

Outline

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions.

18 / 34

Outline

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions. Linearly parameterized exponentially transformed value function.

18 / 34

Outline

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions. Linearly parameterized exponentially transformed value function. Stochastic gradient convex optimization. Competitive with policies in the approximating class. Simulation results: crowdsourcing.

18 / 34

Total Cost, Kullback-Leibler Penalty


(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)

19 / 34




Total cost:

(assume a.s. hit absorbing state with zero loss)

E

∞ X

`(Xt ).

t=1

19 / 34




Total cost:


E

∞ X

`(Xt ).

t=1

Parameterized value functions.

19 / 34




Total cost:


E

∞ X

`(Xt ).

t=1

Parameterized value functions. Convex cost function.

19 / 34




Total cost:


E

∞ X

`(Xt ).

t=1

Parameterized value functions. Convex cost function. Stochastic gradient.

19 / 34

Simulation Results: Crowdsourcing Classification task.

http://www.technicalinfo.net/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource labels.

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors.


20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi )


20 / 34



20 / 34



20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta.


20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta. State = posterior.


20 / 34

Simulation Results: Crowdsourcing

Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta. State = posterior.

20 / 34



21 / 34


2. Learning Changing Dynamics Changing MDP; complete information.

21 / 34


2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy.

21 / 34


2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π.

21 / 34


2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. 21 / 34


2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 21 / 34

Changing Dynamics

(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)

Full information: observe Pt , `t after round t. Arbitrary. Even adversarial.

22 / 34

Changing Dynamics


Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A}

22 / 34

Changing Dynamics


Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt ))

22 / 34

Changing Dynamics


Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ ))

22 / 34

Changing Dynamics


Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ )) Aim for low regret: RT /T → 0

22 / 34

Changing Dynamics


Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ )) Aim for low regret: RT /T → 0

Computationally efficient low regret strategies?

22 / 34

Regret Bound

Main Result There is a strategy that

achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)

23 / 34

Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).

24 / 34


Regret (total loss versus best in hindsight) for T rounds: p T log |A| . O

24 / 34



Long history.

24 / 34



Long history. Unreasonably broadly applicable: Zero-sum games.

Shortest path problems.

AdaBoost.

Fast max-flow.

Bandit problems.

Fast graph sparsification.

Linear programming.

Model of evolution.

24 / 34

Strategy:

For allP policies π ∈ Π, wπ,0 = 1. Wt = π∈Π wπ,t , pπ,t = wπ,t−1 /Wt−1 . for t := 1, 2, . . . do wπt−1 ,t−1 , πt = πt−1 . Otherwise πt ∼ p.,t . w.p. βt = wπt−1 ,t−2 Choose action at ∼ πt (.|xt ). Observe dynamics Pt and loss `t . Suffer `t (xt , at ). For all policies π, wπ,t = wπ,t−1 exp (−ηE [`t (xtπ , π)]). end for Exponential weights on Π.

25 / 34

Strategy:

For allP policies π ∈ Π, wπ,0 = 1. Wt = π∈Π wπ,t , pπ,t = wπ,t−1 /Wt−1 . for t := 1, 2, . . . do wπt−1 ,t−1 , πt = πt−1 . Otherwise πt ∼ p.,t . w.p. βt = wπt−1 ,t−2 Choose action at ∼ πt (.|xt ). Observe dynamics Pt and loss `t . Suffer `t (xt , at ). For all policies π, wπ,t = wπ,t−1 exp (−ηE [`t (xtπ , π)]). end for Exponential weights on Π. Rare, random changes to πt . 25 / 34

Regret Bound



Adversarial dynamics and loss functions.

26 / 34

Regret Bound



Adversarial dynamics and loss functions. Large state and action spaces.

26 / 34

Regret Bound



Adversarial dynamics and loss functions. Large state and action spaces. E [RT ] /T → 0. T = ω(log |Π|) suffices.

26 / 34

Regret Bound




Computationally efficient as long as |Π| is polynomial.

26 / 34

Regret Bound




Computationally efficient as long as |Π| is polynomial. No computationally efficient algorithm in general.

26 / 34

Shortest Path Problem Special case of MDP: node=state; action=edge; loss=weight.

http://www.google.com/

27 / 34


http://www.google.com/ http://www.meondirect.com/

27 / 34


http://www.google.com/ http://www.meondirect.com/

27 / 34

Computational Efficiency

Hardness Result Suppose there is a strategy for the online adversarial shortest path problem that: 1 2

runs in time poly(n, T ), and has regret RT = O(poly(n)T 1−δ ) for some constant δ > 0.

Then there is an efficient algorithm for online agnostic parity learning with sublinear regret.

28 / 34

Online Agnostic Parity Learning

Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }

29 / 34



Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt }

29 / 34



Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt }

29 / 34



Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)?

29 / 34



Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)? Very well-studied.

29 / 34



Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)? Very well-studied. Widely believed to be hard: used for cryptographic schemes.

29 / 34

Computational Efficiency

Hardness Result Suppose there is a strategy for the online adversarial shortest path problem that: 1 2

runs in time poly(n, T ), and has regret RT = O(poly(n)T 1−δ ) for some constant δ > 0.

Then there is an efficient algorithm for online agnostic parity learning with sublinear regret.

30 / 34

Reduction

x = (1, 0, 1, 0, 1) ∈ {0, 1}5 1a Adversary (x, y)

Conversion1

Shortest Path Algorithm

ybt

Conversion2

2a

2b

3a

3b

4a

4b

5a

5b

6a Path

6b 1−y

y Z ybt = 0

ybt = 1

31 / 34

Online shortest path: Hard versus easy

Edges (dynamics) Adversarial Stochastic Adversarial

Weights (costs) Adversarial Adversarial Stochastic

As hard as noisy parity. Efficient algorithm. Efficient algorithm.

32 / 34

Sequential Decision Problems: Key Ideas Scaling back our ambitions: Performance guarantees relative to a comparison class. Two directions for Markov Decision Processes (MDPs):

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization.

2. Learning changing dynamics Compete with a restricted family of policies Π. Exponential weights strategy. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 33 / 34

Acknowledgements: ARC, NSF.

Alan Malek Yasin Abbasi-Yadkori

Varun Kanade

Yevgeny Seldin

Xi Chen

Csaba Szepesvari 34 / 34

Learning in Sequential Decision Problems - Google Sites

Learning in Sequential Decision Problems - Google Sites

Suggest Documents

SEQUENTIAL DECISION PROBLEMS, DEPENDENT TYPES AND ...

Relative entropy in sequential decision problems 1 - CiteSeerX

Learning Sequential Decision Rules Using Simulation ... - CiteSeerX

Optimal Learning for Sequential Decision Making

Sequential versus Simultaneous Assignment Problems ... - Google Sites

Structure Learning in Human Sequential Decision-Making - CiteSeerX

Estimating Bayesian Decision Problems with ... - Google Sites

Decision Theory: Sequential Decisions

learning fuzzy decision trees from sequential and incomplete data

Evidence for Sequential Decision Making in the

Models of sequential decision making in ... - Decision Analytics

Social Influences in Sequential Decision Making

Sequential Learning without Feedback

young children's learning via solving problems in the ... - Google Sites

young children's learning via solving problems in the ... - Google Sites

Geometric Decision Rules for Instance-based Learning Problems

sequential circuits - Google Sites

on learning problems in mathematics

Age-Related Effects in Sequential Motor Learning

Sequential parameter learning and filtering in structured

Learning against sequential opponents in repeated

Discriminative Learning in Sequential Pattern Recognition - Microsoft

Problems in decentralized decision making and computation

Over-Diversification in Repeated Decision Problems ...