Nov 28, 2014 - Learning Changing Dynamics. 13 / 34 ... Stochastic gradient convex optimization. 2. ..... Strategy for a
Learning in Sequential Decision Problems Peter Bartlett Mathematical Sciences Queensland University of Technology Computer Science and Statistics University of California at Berkeley
AusDM November 28, 2014
1 / 34
A Decision Problem
Order news stories by popularity.
2 / 34
A Decision Problem
Order news stories by popularity. Multiclass classification.
2 / 34
A Decision Problem
Order news stories by popularity. Multiclass classification. Incur loss when a low-ranked story is chosen. 2 / 34
A Decision Problem
Order news stories by popularity. Multiclass classification. Incur loss when a low-ranked story is chosen. Aim to minimize expected loss. 2 / 34
Decision Problems
Decision Problem Finite decision
Approach estimate expected loss.
Aim close to optimal.
3 / 34
A Decision Problem: Classification
4 / 34
A Decision Problem: Classification
4 / 34
A Decision Problem: Classification
Aim for small relative expected loss: close to best in some comparison class of prediction rules.
4 / 34
Decision Problems
Decision Problem Finite decision Classification
Approach estimate expected loss. e.g., logistic regression
Aim close to optimal. close to best linear classifier.
5 / 34
A Decision Problem: Bandit Classification
But do not get complete information: what would the user do if we offered a different ranking?
6 / 34
A Decision Problem: Bandit Classification
But do not get complete information: what would the user do if we offered a different ranking? Sequential decision problem: current decisions affect later performance. 6 / 34
A Decision Problem: Bandit Classification
But do not get complete information: what would the user do if we offered a different ranking? Sequential decision problem: current decisions affect later performance. Exploration versus exploitation. 6 / 34
A Decision Problem: Bandit Classification
But do not get complete information: what would the user do if we offered a different recommendation? Sequential decision problem: current decisions affect later performance. Exploration versus exploitation. 6 / 34
Decision Problems
Decision Problem Finite decision Classification
Approach estimate expected loss. e.g., logistic regression
Contextual bandit
e.g., reduction to classification
Aim close to optimal. close to best linear classifier. close to best in set of classifiers.
7 / 34
A Decision Problem: Maximizing Value
8 / 34
A Decision Problem: Maximizing Value
8 / 34
A Decision Problem: Maximizing Value
State of user evolves.
8 / 34
A Decision Problem: Maximizing Value
State of user evolves.
8 / 34
A Decision Problem: Maximizing Value
State of user evolves. Sequential decision problem: current decisions affect our knowledge and the subsequent state.
8 / 34
A Decision Problem: Maximizing Value
State of user evolves. Sequential decision problem: current decisions affect our knowledge and the subsequent state. Exploration versus exploitation. 8 / 34
Decision Problems
Decision Problem Finite decision Classification
Approach estimate expected loss. e.g., logistic regression
Contextual bandit
e.g., reduction to classification
Markov decision process
Aim close to optimal. close to best linear classifier. close to best in set of classifiers. close to best in set of policies.
9 / 34
Sequential Decision Problems: Key Ideas
1
2
Scaling back our ambitions: Performance guarantees relative to a comparison class. Two directions for Markov Decision Processes (MDPs): Large scale policy design for MDPs. Learning changing MDPs with full information.
10 / 34
Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . :
11 / 34
Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1
See state Xt
of ecosystem
11 / 34
Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1
See state Xt
2
Play an action At
of ecosystem intervention
11 / 34
Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1
See state Xt
of ecosystem
2
Play an action At intervention anti-poaching patrols
11 / 34
Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1
See state Xt
of ecosystem
2
Play an action At intervention anti-poaching patrols
3
Incur loss `(Xt , At )
11 / 34
Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1
See state Xt
of ecosystem
2
Play an action At intervention anti-poaching patrols
3
Incur loss `(Xt , At )
$, extinction
11 / 34
Markov Decision Processes MDP: Web Customer Interactions For t = 1, 2, . . . : 1
See state Xt
2
Play an action At
3
Incur loss `(Xt , At )
of customer interaction offer/advertisement missed revenue
11 / 34
Markov Decision Processes MDP: Managing Threatened Species Transition matrix:
For t = 1, 2, . . . : 1
See state Xt
of ecosystem
2
Play an action At
3
Incur loss `(Xt , At )
4
State evolves to Xt+1 ∼ PXt ,At
P : X × A → ∆(X )
intervention anti-poaching patrols $, extinction
11 / 34
Markov Decision Processes MDP: Managing Threatened Species Transition matrix:
For t = 1, 2, . . . : 1
See state Xt
of ecosystem
2
Play an action At
3
Incur loss `(Xt , At )
4
State evolves to Xt+1 ∼ PXt ,At
intervention anti-poaching patrols
P : X × A → ∆(X )
Policy:
π : X → ∆(A)
$, extinction
Performance Measure: Regret RT = E
T X t=1
`(Xt , At ) − min E π
T X
`(Xtπ , π(Xtπ )).
t=1
11 / 34
Markov Decision Processes MDP: Managing Threatened Species Transition matrix:
For t = 1, 2, . . . : 1
See state Xt
of ecosystem
2
Play an action At
3
Incur loss `(Xt , At )
4
State evolves to Xt+1 ∼ PXt ,At
intervention anti-poaching patrols $, extinction
P : X × A → ∆(X )
Policy:
π : X → ∆(A)
Stationary distribution: µ Average loss:
µT `.
Performance Measure: Regret RT = E
T X t=1
`(Xt , At ) − min E π
T X
`(Xtπ , π(Xtπ )).
t=1
11 / 34
Markov Decision Processes MDP: Managing Threatened Species Transition matrix:
For t = 1, 2, . . . : 1
See state Xt
of ecosystem
2
Play an action At
3
Incur loss `(Xt , At )
4
State evolves to Xt+1 ∼ PXt ,At
intervention anti-poaching patrols $, extinction
P : X × A → ∆(X )
Policy:
π : X → ∆(A)
Stationary distribution: µ Average loss:
µT `.
Performance Measure: Excess Average Loss T µT π ` − min µπ ` π
11 / 34
Large-Scale Sequential Decision Problems
Large MDP Problems: When the state space X is large, we must scale back the ambition of optimal performance.
12 / 34
Large-Scale Sequential Decision Problems
Large MDP Problems: When the state space X is large, we must scale back the ambition of optimal performance: In comparison to a restricted family of policies Π. e.g., linear value function approximation. Want a strategy that competes with the best policy in Π.
12 / 34
Outline 1. Large-Scale Policy Design
2. Learning Changing Dynamics
13 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions.
2. Learning Changing Dynamics
13 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized exponentially transformed value function.
2. Learning Changing Dynamics
13 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization.
2. Learning Changing Dynamics
13 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class.
2. Learning Changing Dynamics
13 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy.
2. Learning Changing Dynamics
13 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.
2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 13 / 34
Linear Subspace of Stationary Distributions
Large-scale policy design
(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)
Stationary distributions dual to value functions.
14 / 34
Linear Subspace of Stationary Distributions
Large-scale policy design
(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)
Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+
πθ (a|x) = P
14 / 34
Linear Subspace of Stationary Distributions
Large-scale policy design
(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)
Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+
πθ (a|x) = P
Let µθ denote the stationary distribution of policy πθ .
14 / 34
Linear Subspace of Stationary Distributions
Large-scale policy design
(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)
Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+
πθ (a|x) = P
Let µθ denote the stationary distribution of policy πθ . Find θb such that µ> ` ≤ minθ∈Θ µ> ` + . θb
θ
14 / 34
Linear Subspace of Stationary Distributions
Large-scale policy design
(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)
Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+
πθ (a|x) = P
Let µθ denote the stationary distribution of policy πθ . Find θb such that µ> ` ≤ minθ∈Θ µ> ` + . θb
θ
Large-scale policy design: Independent of size of X .
14 / 34
Approach: a Reduction to Convex Optimization Define a constraint violation function
V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ) | {z } | {z }1 prob. dist.
stationary
Approach: a Reduction to Convex Optimization Define a constraint violation function
V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)
1
and consider the convex cost function c(θ) = `> (µ0 + Φθ) + αV (θ).
Approach: a Reduction to Convex Optimization Define a constraint violation function
V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)
1
and consider the convex cost function c(θ) = `> (µ0 + Φθ) + αV (θ).
P Stochastic gradient: θt+1 = θt − ηgt (θt ), θbT = T t=1 θt /T ,
Approach: a Reduction to Convex Optimization Define a constraint violation function
V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)
1
and consider the convex cost function c(θ) = `> (µ0 + Φθ) + αV (θ).
P Stochastic gradient: θt+1 = θt − ηgt (θt ), θbT = T t=1 θt /T , . . . with cheap, unbiased stochastic subgradient estimates:
Φ(xt ,at ),: I q1 (xt , at ) {µ0 (xt ,at )+Φ(xt ,at ),: θ :,xt0 Φ +α sign((P − B)> :,xt0 Φθ). 0 q2 (xt )
gt (θ) = `> Φ − α
15 / 34
Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption), V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ
16 / 34
Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption), V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ
Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0).
16 / 34
Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption), V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ
Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0). Competitive with other policies; comparison more favorable near some stationary distribution.
16 / 34
Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption), V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ
Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0). Competitive with other policies; comparison more favorable near some stationary distribution. Previous results of this kind: require knowledge about optimal policy, or require that the comparison class Π contains a near-optimal policy. 16 / 34
Simulation Results: Queueing
http://alzatex.org/
(Rybko and Stolyar, 1992; de Farias and Van Roy, 2003a)
17 / 34
Simulation Results: Queueing
http://alzatex.org/
(Rybko and Stolyar, 1992; de Farias and Van Roy, 2003a)
17 / 34
Outline
1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions.
18 / 34
Outline
1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions. Linearly parameterized exponentially transformed value function.
18 / 34
Outline
1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions. Linearly parameterized exponentially transformed value function. Stochastic gradient convex optimization. Competitive with policies in the approximating class. Simulation results: crowdsourcing.
18 / 34
Total Cost, Kullback-Leibler Penalty
Large-scale policy design
(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)
19 / 34
Total Cost, Kullback-Leibler Penalty
Large-scale policy design
(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)
Total cost:
(assume a.s. hit absorbing state with zero loss)
E
∞ X
`(Xt ).
t=1
19 / 34
Total Cost, Kullback-Leibler Penalty
Large-scale policy design
(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)
Total cost:
(assume a.s. hit absorbing state with zero loss)
E
∞ X
`(Xt ).
t=1
Parameterized value functions.
19 / 34
Total Cost, Kullback-Leibler Penalty
Large-scale policy design
(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)
Total cost:
(assume a.s. hit absorbing state with zero loss)
E
∞ X
`(Xt ).
t=1
Parameterized value functions. Convex cost function.
19 / 34
Total Cost, Kullback-Leibler Penalty
Large-scale policy design
(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)
Total cost:
(assume a.s. hit absorbing state with zero loss)
E
∞ X
`(Xt ).
t=1
Parameterized value functions. Convex cost function. Stochastic gradient.
19 / 34
Simulation Results: Crowdsourcing Classification task.
http://www.technicalinfo.net/
20 / 34
Simulation Results: Crowdsourcing Classification task. Crowdsource labels.
http://www.technicalinfo.net/ http://cdns2.freepik.com/
20 / 34
Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors.
http://www.technicalinfo.net/ http://cdns2.freepik.com/
20 / 34
Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi )
http://www.technicalinfo.net/ http://cdns2.freepik.com/
20 / 34
Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi )
http://www.technicalinfo.net/ http://cdns2.freepik.com/
20 / 34
Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi )
http://www.technicalinfo.net/ http://cdns2.freepik.com/
20 / 34
Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta.
http://www.technicalinfo.net/ http://cdns2.freepik.com/
20 / 34
Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta. State = posterior.
http://www.technicalinfo.net/ http://cdns2.freepik.com/
20 / 34
Simulation Results: Crowdsourcing
Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta. State = posterior.
20 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.
2. Learning Changing Dynamics
21 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.
2. Learning Changing Dynamics Changing MDP; complete information.
21 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.
2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy.
21 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.
2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π.
21 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.
2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. 21 / 34
Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.
2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 21 / 34
Changing Dynamics
(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)
Full information: observe Pt , `t after round t. Arbitrary. Even adversarial.
22 / 34
Changing Dynamics
(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)
Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A}
22 / 34
Changing Dynamics
(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)
Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt ))
22 / 34
Changing Dynamics
(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)
Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ ))
22 / 34
Changing Dynamics
(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)
Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ )) Aim for low regret: RT /T → 0
22 / 34
Changing Dynamics
(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)
Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ )) Aim for low regret: RT /T → 0
Computationally efficient low regret strategies?
22 / 34
Regret Bound
Main Result There is a strategy that
achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)
23 / 34
Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).
24 / 34
Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).
Regret (total loss versus best in hindsight) for T rounds: p T log |A| . O
24 / 34
Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).
Regret (total loss versus best in hindsight) for T rounds: p T log |A| . O
Long history.
24 / 34
Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).
Regret (total loss versus best in hindsight) for T rounds: p T log |A| . O
Long history. Unreasonably broadly applicable: Zero-sum games.
Shortest path problems.
AdaBoost.
Fast max-flow.
Bandit problems.
Fast graph sparsification.
Linear programming.
Model of evolution.
24 / 34
Strategy:
For allP policies π ∈ Π, wπ,0 = 1. Wt = π∈Π wπ,t , pπ,t = wπ,t−1 /Wt−1 . for t := 1, 2, . . . do wπt−1 ,t−1 , πt = πt−1 . Otherwise πt ∼ p.,t . w.p. βt = wπt−1 ,t−2 Choose action at ∼ πt (.|xt ). Observe dynamics Pt and loss `t . Suffer `t (xt , at ). For all policies π, wπ,t = wπ,t−1 exp (−ηE [`t (xtπ , π)]). end for Exponential weights on Π.
25 / 34
Strategy:
For allP policies π ∈ Π, wπ,0 = 1. Wt = π∈Π wπ,t , pπ,t = wπ,t−1 /Wt−1 . for t := 1, 2, . . . do wπt−1 ,t−1 , πt = πt−1 . Otherwise πt ∼ p.,t . w.p. βt = wπt−1 ,t−2 Choose action at ∼ πt (.|xt ). Observe dynamics Pt and loss `t . Suffer `t (xt , at ). For all policies π, wπ,t = wπ,t−1 exp (−ηE [`t (xtπ , π)]). end for Exponential weights on Π. Rare, random changes to πt . 25 / 34
Regret Bound
Main Result There is a strategy that
achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)
Adversarial dynamics and loss functions.
26 / 34
Regret Bound
Main Result There is a strategy that
achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)
Adversarial dynamics and loss functions. Large state and action spaces.
26 / 34
Regret Bound
Main Result There is a strategy that
achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)
Adversarial dynamics and loss functions. Large state and action spaces. E [RT ] /T → 0. T = ω(log |Π|) suffices.
26 / 34
Regret Bound
Main Result There is a strategy that
achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)
Adversarial dynamics and loss functions. Large state and action spaces. E [RT ] /T → 0. T = ω(log |Π|) suffices.
Computationally efficient as long as |Π| is polynomial.
26 / 34
Regret Bound
Main Result There is a strategy that
achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)
Adversarial dynamics and loss functions. Large state and action spaces. E [RT ] /T → 0. T = ω(log |Π|) suffices.
Computationally efficient as long as |Π| is polynomial. No computationally efficient algorithm in general.
26 / 34
Shortest Path Problem Special case of MDP: node=state; action=edge; loss=weight.
http://www.google.com/
27 / 34
Shortest Path Problem Special case of MDP: node=state; action=edge; loss=weight.
http://www.google.com/ http://www.meondirect.com/
27 / 34
Shortest Path Problem Special case of MDP: node=state; action=edge; loss=weight.
http://www.google.com/ http://www.meondirect.com/
27 / 34
Computational Efficiency
Hardness Result Suppose there is a strategy for the online adversarial shortest path problem that: 1 2
runs in time poly(n, T ), and has regret RT = O(poly(n)T 1−δ ) for some constant δ > 0.
Then there is an efficient algorithm for online agnostic parity learning with sublinear regret.
28 / 34
Online Agnostic Parity Learning
Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }
29 / 34
Online Agnostic Parity Learning
Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }
Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt }
29 / 34
Online Agnostic Parity Learning
Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }
Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt }
29 / 34
Online Agnostic Parity Learning
Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }
Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)?
29 / 34
Online Agnostic Parity Learning
Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }
Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)? Very well-studied.
29 / 34
Online Agnostic Parity Learning
Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }
Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)? Very well-studied. Widely believed to be hard: used for cryptographic schemes.
29 / 34
Computational Efficiency
Hardness Result Suppose there is a strategy for the online adversarial shortest path problem that: 1 2
runs in time poly(n, T ), and has regret RT = O(poly(n)T 1−δ ) for some constant δ > 0.
Then there is an efficient algorithm for online agnostic parity learning with sublinear regret.
30 / 34
Reduction
x = (1, 0, 1, 0, 1) ∈ {0, 1}5 1a Adversary (x, y)
Conversion1
Shortest Path Algorithm
ybt
Conversion2
2a
2b
3a
3b
4a
4b
5a
5b
6a Path
6b 1−y
y Z ybt = 0
ybt = 1
31 / 34
Online shortest path: Hard versus easy
Edges (dynamics) Adversarial Stochastic Adversarial
Weights (costs) Adversarial Adversarial Stochastic
As hard as noisy parity. Efficient algorithm. Efficient algorithm.
32 / 34
Sequential Decision Problems: Key Ideas Scaling back our ambitions: Performance guarantees relative to a comparison class. Two directions for Markov Decision Processes (MDPs):
1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization.
2. Learning changing dynamics Compete with a restricted family of policies Π. Exponential weights strategy. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 33 / 34
Acknowledgements: ARC, NSF.
Alan Malek Yasin Abbasi-Yadkori
Varun Kanade
Yevgeny Seldin
Xi Chen
Csaba Szepesvari 34 / 34