Introduction to Approximate Dynamic Programming - Dan Zhang

Introduction to Approximate Dynamic Programming

Dan Zhang Leeds School of Business University of Colorado at Boulder

Dan Zhang, Spring 2012

Approximate Dynamic Programming

1

Key References

Bertsekas, D.P. 2011. Chapter 6, Approximate Dynamic Programming, Dynamic Programming and Optimal Control, 3rd Edition, Volume II. Available online at http://web.mit.edu/dimitrib/www/dpchapter.pdf. Bertsekas, D.P., J.N. Tsitsiklis. 1996. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA. Powell, W. 2007. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Interscience.



2

Outline

Setup: infinite horizon discounted MDPs Policy evaluation via Monte Carlo simultion Q-Learning Linear programming based approximate dynamic programming Approximate policy iteration Rollout policies State aggregation



3

Setup

Infinite-horizon discounted MDP Transition probability pij (u) Cost: g (i, u, j) with |g (i, u, j)| < ∞ Discounting: α ∈ [0, 1). Discrete state space: S is finite Action space U.



4

MDP Model Let π ∈ ΠMD . States visited under policy π follows a Markov chain with transition probability P π = {pij (π(i)) : i, j ∈ S}. Total discounted cost J π (i) = lim E

( T X

T →∞

) π αt g (itπ , π(itπ ), it+1 ) .

t=0

An optimal policy can be computed by solving the optimality equations:   X  J(i) = min pij (u)[g (i, u, j) + αJ(j)] .  u∈U(i)  j∈S



5

Value Iteration

The optimality equation can be written as J = TJ, where [TJ](i) = min

 X

u∈U(i) 

j∈S

  pij (u)[g (i, u, j) + αJ(j)] . 

The value function J ∗ is a fixed point of the operator T . Under suitable technical conditions, J ∗ = limk→∞ T k J 1 for an initial vector J 1 .



6

Policy Iteration

Define [Tπ J](i) =

X

pij (π(i))[g (i, π(i), j) + αJ(j)].

j∈S

It can be shown that the discounted cost J π incurred by policy π is a fixed point of the operator Tπ ; i.e., J π = Tπ J π .



7

Policy Iteration (Continued)

k

Policy evaluation: Compute the discounted cost J π incurred by policy π k , possibly by solving the system of linear k k k k equations J π = g π + αP π J π . Policy improvement: For all i ∈ S, let policy π k+1 be defined such that we have X k π k+1 (i) = argmin pij (u)[g (i, u, j) + αJ π (j)]. u∈U(i) j∈S

Stop if π k = π k+1 .



8

Three Curses of Dimensionality (Powell, 2007)

State space is large: Computing and storing the value function v (·) can be difficult for both value iteration and policy iteration algorithms Action space is large: Both value iteration and policy evaluation become difficult Computing expectations with respect to the transition probability matrix can be difficult when the system dynamics is complex



9

Policy Evaluation with Monte Carlo Simulation

Policy evaluation requires solving linear equations of the form J π = g π + αP π J π . Difficult when P is large or unknown

Idea: Simulate the sequence of state {i0 , i1 , . . . } by following a particular policy π. An approximation J of J π can be updated as follows: J(ik ) ← (1 − γ)J(ik ) + γ[g (ik , π(ik ), ik+1 ) + αJ(ik+1 )].



10

Policy Evaluation with Monte Carlo Simulation

Policy evaluation requires solving linear equations of the form J π = g π + αP π J π . Difficult when P is large or unknown

Idea: Simulate the sequence of state {i0 , i1 , . . . } by following a particular policy π. An approximation J of J π can be updated as follows: J(ik ) ← (1 − γ)J(ik ) + γ[g (ik , π(ik ), ik+1 ) + αJ(ik+1 )].

Why does this make sense?



10

Policy Evaluation with Monte Carlo Simulation (continued)

The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: ˜ r) = J(i,

L X

rl φl (i),

l=1

where φl (·)’s are pre-specified functions (“feature functions”) and r is a vector of adjustable parameters.



11

Policy Evaluation with Monte Carlo Simulation (continued)

The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: ˜ r) = J(i,

L X

rl φl (i),

l=1

where φl (·)’s are pre-specified functions (“feature functions”) and r is a vector of adjustable parameters. How to update r ?



11

Q-Learning

Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u∈U(i) X Q(i, u) = pij (u)[g (i, u, j) + αJ(j)]. j∈S



12

Q-Learning

Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u∈U(i) X Q(i, u) = pij (u)[g (i, u, j) + αJ(j)]. j∈S

What is the interpretation of Q(i, u)?



12

Q-Learning (Continued)

Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(ik , uk ) ← (1−γ)Q(ik , uk )+γ[g (ik , uk , sk )+α min Q(sk , v )], v ∈U(sk )

where the successor state sk is sampled according to the probabilities {pik ,j (uk ) : j ∈ S}.



13

Q-Learning (Continued)

Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(ik , uk ) ← (1−γ)Q(ik , uk )+γ[g (ik , uk , sk )+α min Q(sk , v )], v ∈U(sk )

where the successor state sk is sampled according to the probabilities {pik ,j (uk ) : j ∈ S}. What is the benefit of Q-learning?



13

Linear Programming Approach P Let θ(s) be positive scalars such that s∈S θ(s) = 1. The linear programming formulation is given by X max θ(i)J(i) J

i∈S

J(i) −

X

αPij (u)J(j) ≤

X

j∈S

Pij (u)g (i, u, j),

∀i ∈ S, u ∈ U(i).

j∈S

Dual linear program is given by X X X min pij (u)g (i, u, j)x(i, u) x

i∈S u∈U(i) j∈S

X

x(i, u) −

u∈U(i)

X X

αpji (u)x(j, u) = θ(i), ∀i ∈ S,

j∈S u∈U(j)

x(i, u) ≥ 0,

∀i ∈ S, u ∈ U(i).



14

Linear Programming Approach (continued)

The size of LP can be reduced by using a parameterized approximation architecture (Schweitzer and Seidmann, 1985): ˜ r) = J(i,

L X

rl φl (i),

l=1

Two solution approaches: Simulation-based approach (de Farias and Van Roy, 2003; also Powell, 2007) Mathematical programming-based approach (Adelman, 2003; Adelman, 2007)



15

Approximate Policy Iteration

Policy evaluation algorithm can be difficult when the problem is “large”. Idea: Carry out policy iteration approximately.



16

Approximate Policy Iteration (Continued) Approximate policy evaluation: Simulate a sequence of states {i0 , i1P , . . . } by following policy π k . Let t−l g (i , π k (i ), i k Cl = ∞ t t t+1 ) for all l = 0, 1, . . . . Let r t=l α be the solution to the regression problem  (∞ " L #2  ) ∞ X  X X ˜ t , r ) − Ct ]2 = min . rl φl (it ) − Ct min [J(i r r   t=0

t=0

l=1

Approximate policy improvement: For all i ∈ S, let policy π k+1 be defined such that we have X ˜ r k )]. π k+1 (i) = argmin pij (u)[g (i, u, j) + αJ(j, u∈U(i) j∈S



17

Rollout Policy

Idea: Improve the performance of a given policy. Given policy π, let π 0 be defined such that for each i ∈ S, X π 0 (i) = argmin pij (u)[g (i, u, j) + αJ π (j)]. u∈U(i) j∈S

Then it can be shown that π 0 is at least as good as π. Simulation may be involved to estimate J π .



18

State Aggregation

Idea: Partition the state space into a number of subsets and assume the value function is constant over each partition. State aggregation can be combined with other approximation methods.



19

Discussion

ADP aims at alleviating computational effort required to solve large-scale dynamic programs The area is still in its infancy A commonly accepted definition of ADP does not seem to exist

Open problem: How to specify “feature functions”? Research to date usually assume they are fixed in advance. Recent advances: Klabjan and Adelman (2007)

Develop efficient solution approaches for practical applications could be valuable.



20