Introduction to Approximate Dynamic. Programming. Dan Zhang. Leeds School
of Business. University of Colorado at Boulder. Dan Zhang, Spring 2012.
Introduction to Approximate Dynamic Programming
Dan Zhang Leeds School of Business University of Colorado at Boulder
Dan Zhang, Spring 2012
Approximate Dynamic Programming
1
Key References
Bertsekas, D.P. 2011. Chapter 6, Approximate Dynamic Programming, Dynamic Programming and Optimal Control, 3rd Edition, Volume II. Available online at http://web.mit.edu/dimitrib/www/dpchapter.pdf. Bertsekas, D.P., J.N. Tsitsiklis. 1996. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA. Powell, W. 2007. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-Interscience.
Dan Zhang, Spring 2012
Approximate Dynamic Programming
2
Outline
Setup: infinite horizon discounted MDPs Policy evaluation via Monte Carlo simultion Q-Learning Linear programming based approximate dynamic programming Approximate policy iteration Rollout policies State aggregation
Dan Zhang, Spring 2012
Approximate Dynamic Programming
3
Setup
Infinite-horizon discounted MDP Transition probability pij (u) Cost: g (i, u, j) with |g (i, u, j)| < ∞ Discounting: α ∈ [0, 1). Discrete state space: S is finite Action space U.
Dan Zhang, Spring 2012
Approximate Dynamic Programming
4
MDP Model Let π ∈ ΠMD . States visited under policy π follows a Markov chain with transition probability P π = {pij (π(i)) : i, j ∈ S}. Total discounted cost J π (i) = lim E
( T X
T →∞
) π αt g (itπ , π(itπ ), it+1 ) .
t=0
An optimal policy can be computed by solving the optimality equations: X J(i) = min pij (u)[g (i, u, j) + αJ(j)] . u∈U(i) j∈S
Dan Zhang, Spring 2012
Approximate Dynamic Programming
5
Value Iteration
The optimality equation can be written as J = TJ, where [TJ](i) = min
X
u∈U(i)
j∈S
pij (u)[g (i, u, j) + αJ(j)] .
The value function J ∗ is a fixed point of the operator T . Under suitable technical conditions, J ∗ = limk→∞ T k J 1 for an initial vector J 1 .
Dan Zhang, Spring 2012
Approximate Dynamic Programming
6
Policy Iteration
Define [Tπ J](i) =
X
pij (π(i))[g (i, π(i), j) + αJ(j)].
j∈S
It can be shown that the discounted cost J π incurred by policy π is a fixed point of the operator Tπ ; i.e., J π = Tπ J π .
Dan Zhang, Spring 2012
Approximate Dynamic Programming
7
Policy Iteration (Continued)
k
Policy evaluation: Compute the discounted cost J π incurred by policy π k , possibly by solving the system of linear k k k k equations J π = g π + αP π J π . Policy improvement: For all i ∈ S, let policy π k+1 be defined such that we have X k π k+1 (i) = argmin pij (u)[g (i, u, j) + αJ π (j)]. u∈U(i) j∈S
Stop if π k = π k+1 .
Dan Zhang, Spring 2012
Approximate Dynamic Programming
8
Three Curses of Dimensionality (Powell, 2007)
State space is large: Computing and storing the value function v (·) can be difficult for both value iteration and policy iteration algorithms Action space is large: Both value iteration and policy evaluation become difficult Computing expectations with respect to the transition probability matrix can be difficult when the system dynamics is complex
Dan Zhang, Spring 2012
Approximate Dynamic Programming
9
Policy Evaluation with Monte Carlo Simulation
Policy evaluation requires solving linear equations of the form J π = g π + αP π J π . Difficult when P is large or unknown
Idea: Simulate the sequence of state {i0 , i1 , . . . } by following a particular policy π. An approximation J of J π can be updated as follows: J(ik ) ← (1 − γ)J(ik ) + γ[g (ik , π(ik ), ik+1 ) + αJ(ik+1 )].
Dan Zhang, Spring 2012
Approximate Dynamic Programming
10
Policy Evaluation with Monte Carlo Simulation
Policy evaluation requires solving linear equations of the form J π = g π + αP π J π . Difficult when P is large or unknown
Idea: Simulate the sequence of state {i0 , i1 , . . . } by following a particular policy π. An approximation J of J π can be updated as follows: J(ik ) ← (1 − γ)J(ik ) + γ[g (ik , π(ik ), ik+1 ) + αJ(ik+1 )].
Why does this make sense?
Dan Zhang, Spring 2012
Approximate Dynamic Programming
10
Policy Evaluation with Monte Carlo Simulation (continued)
The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: ˜ r) = J(i,
L X
rl φl (i),
l=1
where φl (·)’s are pre-specified functions (“feature functions”) and r is a vector of adjustable parameters.
Dan Zhang, Spring 2012
Approximate Dynamic Programming
11
Policy Evaluation with Monte Carlo Simulation (continued)
The procedure requires storing J, which can be problematic when the state space S is large. Idea: Use a parameterized approximation architecture: ˜ r) = J(i,
L X
rl φl (i),
l=1
where φl (·)’s are pre-specified functions (“feature functions”) and r is a vector of adjustable parameters. How to update r ?
Dan Zhang, Spring 2012
Approximate Dynamic Programming
11
Q-Learning
Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u∈U(i) X Q(i, u) = pij (u)[g (i, u, j) + αJ(j)]. j∈S
Dan Zhang, Spring 2012
Approximate Dynamic Programming
12
Q-Learning
Q-learning is based on an alternative representation of the optimality equation: J(i) = min Q(i, u), u∈U(i) X Q(i, u) = pij (u)[g (i, u, j) + αJ(j)]. j∈S
What is the interpretation of Q(i, u)?
Dan Zhang, Spring 2012
Approximate Dynamic Programming
12
Q-Learning (Continued)
Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(ik , uk ) ← (1−γ)Q(ik , uk )+γ[g (ik , uk , sk )+α min Q(sk , v )], v ∈U(sk )
where the successor state sk is sampled according to the probabilities {pik ,j (uk ) : j ∈ S}.
Dan Zhang, Spring 2012
Approximate Dynamic Programming
13
Q-Learning (Continued)
Key idea: Evaluate Q(i, u) by using a stochastic approximation iteration: Q(ik , uk ) ← (1−γ)Q(ik , uk )+γ[g (ik , uk , sk )+α min Q(sk , v )], v ∈U(sk )
where the successor state sk is sampled according to the probabilities {pik ,j (uk ) : j ∈ S}. What is the benefit of Q-learning?
Dan Zhang, Spring 2012
Approximate Dynamic Programming
13
Linear Programming Approach P Let θ(s) be positive scalars such that s∈S θ(s) = 1. The linear programming formulation is given by X max θ(i)J(i) J
i∈S
J(i) −
X
αPij (u)J(j) ≤
X
j∈S
Pij (u)g (i, u, j),
∀i ∈ S, u ∈ U(i).
j∈S
Dual linear program is given by X X X min pij (u)g (i, u, j)x(i, u) x
i∈S u∈U(i) j∈S
X
x(i, u) −
u∈U(i)
X X
αpji (u)x(j, u) = θ(i), ∀i ∈ S,
j∈S u∈U(j)
x(i, u) ≥ 0,
∀i ∈ S, u ∈ U(i).
Dan Zhang, Spring 2012
Approximate Dynamic Programming
14
Linear Programming Approach (continued)
The size of LP can be reduced by using a parameterized approximation architecture (Schweitzer and Seidmann, 1985): ˜ r) = J(i,
L X
rl φl (i),
l=1
Two solution approaches: Simulation-based approach (de Farias and Van Roy, 2003; also Powell, 2007) Mathematical programming-based approach (Adelman, 2003; Adelman, 2007)
Dan Zhang, Spring 2012
Approximate Dynamic Programming
15
Approximate Policy Iteration
Policy evaluation algorithm can be difficult when the problem is “large”. Idea: Carry out policy iteration approximately.
Dan Zhang, Spring 2012
Approximate Dynamic Programming
16
Approximate Policy Iteration (Continued) Approximate policy evaluation: Simulate a sequence of states {i0 , i1P , . . . } by following policy π k . Let t−l g (i , π k (i ), i k Cl = ∞ t t t+1 ) for all l = 0, 1, . . . . Let r t=l α be the solution to the regression problem (∞ " L #2 ) ∞ X X X ˜ t , r ) − Ct ]2 = min . rl φl (it ) − Ct min [J(i r r t=0
t=0
l=1
Approximate policy improvement: For all i ∈ S, let policy π k+1 be defined such that we have X ˜ r k )]. π k+1 (i) = argmin pij (u)[g (i, u, j) + αJ(j, u∈U(i) j∈S
Dan Zhang, Spring 2012
Approximate Dynamic Programming
17
Rollout Policy
Idea: Improve the performance of a given policy. Given policy π, let π 0 be defined such that for each i ∈ S, X π 0 (i) = argmin pij (u)[g (i, u, j) + αJ π (j)]. u∈U(i) j∈S
Then it can be shown that π 0 is at least as good as π. Simulation may be involved to estimate J π .
Dan Zhang, Spring 2012
Approximate Dynamic Programming
18
State Aggregation
Idea: Partition the state space into a number of subsets and assume the value function is constant over each partition. State aggregation can be combined with other approximation methods.
Dan Zhang, Spring 2012
Approximate Dynamic Programming
19
Discussion
ADP aims at alleviating computational effort required to solve large-scale dynamic programs The area is still in its infancy A commonly accepted definition of ADP does not seem to exist
Open problem: How to specify “feature functions”? Research to date usually assume they are fixed in advance. Recent advances: Klabjan and Adelman (2007)
Develop efficient solution approaches for practical applications could be valuable.
Dan Zhang, Spring 2012
Approximate Dynamic Programming
20