We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs. (NMDPs), a class of MDPs that generalize. Linearly Solvable MDPs ...
Nonlinear Policy Gradient Algorithms for Noise-Action MDPs
Krishnamurthy Dvijotham Computer Science and Engineering University of Washington Seattle, WA 98105
Abstract We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI2 algorithm, a state of the art policy optimization algorithm. We provide an alternate interpretation of the PI2 algorithm that further justifies this: The PI2 algorithm is actually performing gradient descent with respect to a risk seeking objective, rather than the desired expected cost objective. For infinite horizon problems, we develop algorithms that only require estimation of a state value function, rather than a state-action Q function or advantage function. We develop policy gradient algorithms for all MDP formulations (finite horizon, infinite horizon, first exit) for arbitrary policy parameterizations. We demonstrate the effectiveness of the policy gradient algorithms on simple 2-D nonlinear dynamical systems and large linear dynamical systems.
Emanuel Todorov Computer Science and Engineering Applied Mathematics University of Washington Washington, WA 98105
for many applications. Further, methods that try to find approximate solutions to the Bellman equation often suffer from the deficiency that the approximation error is not monotonically related to the performance of the policy as measured by the original cost or reward function. Policy Gradient algorithms (Williams, 1992)(Sutton et al., 1999)(Peters and Schaal, 2008) are an alternate class of algorithms that work by considering a parametric class of policies and performing gradient descent with respect to the policy parameters. These methods have the advantage that they are theoretically guaranteed to improve the performance of the policy. Further, they do not require access to a model of the system dynamics or the cost function. In this work, we develop efficient policy gradient algorithms for a special class of MDPs: Noise-Action MDPs. We introduce the notion of Noise-Action MDPs, a class of MDPs that include Linearly Solvable MDPs (LMDPs) (Todorov, 2009) as a special case. We develop the theory of policy gradients for NMDPs. For finite horizon problems, we show that the policy gradient update has a simple form as an average over rollouts of the current policy. For infinite horizon problems, we show that the policy gradient can be efficiently computed for all problem formulations and only requires estimation of the cost-to-go or value function, rather than a state-action Q function or advantage function (as in previous algorithms). The main contributions of this paper are:
1
Introduction
MDPs are a very convenient formalism for specifying and solving sequential decision making problems. The dynamic programming principle and the associated Bellman equations lead to an elegant theory for the solution of MDPs. However, for many application domains with continuous state spaces, the curse of dimensionality implies that the computational complexity of solving MDPs grows exponentially with the number of state dimensions, making them impractical
1 We introduce the notion of Noise-Action MDPs(NMDPs), a class of MDPs with very nice properties that have not been studied previously in MDP literature. This class includes Linearly Solvable MDPs (Todorov, 2009) as a special case. 2 We derive a more efficient version of the REINFORCE algorithm (Williams, 1992)(Baxter and Bartlett, 2001) called NMDP-REINFORCE that only involves sampling state space trajectories, rather than state-action trajectories. Since these
algorithms rely on sampling to get gradient estimates, this reduction in sampling reduces the variance in the algorithm significantly and leads to a much more efficient algorithm. 3 We provide an alternate interpretation of the PI2 algorithm (Theodorou et al., 2010a), a state-ofthe art reinforcement learning algorithm that has been successfully applied in several robotic applications, as a policy gradient algorithm performing gradient descent with respect to a risk-seeking objective. This alternate interpretation of PI2 leads to a simple convergence proof for PI2 . Further, we show that since the PI2 algorithm was originally developed for optimizing the standard expected cost objective, one can actually do much better by performing gradient descent directly on the expected cost. Finally, the NMDP-REINFORCE algorithm (Algorithm 1, Section 3.1) we develop are applicable more generally applicable than the PI2 algorithm, which only applies to a class of control-affine diffusion processes. 4 We develop efficient policy gradient algorithms for first-exit,discounted and average cost NMDPs that only require estimation of a state-value function rather than a state-action Q-function or advantage function (Peters and Schaal, 2008)(Sutton et al., 1999). These results indicate generalize those in (Todorov, 2009), which considered the case of Average Cost Linearly Solvable MDPs with a specific policy parametrization.
2
5 A problem formulation: Finite Horizon (FH), First Exit (FE), Infinite Horizon Average Cost(IH) or Infinite Horizon Discounted Cost (IHD). The objective is to minimize the expected accumulated cost:
FH : min
IHD : min
FE : min
E
t=0 "∞ X
xt+1 ∼Π(·|xt )
# `t (xt , xt+1 ) + `f (xT ) #
IH : min
γ `(xt , xt+1 )
"T −1 e X
E
xt+1 ∼Π(·|xt ) Te =minτ :xτ 6∈T
2.1
E
xt+1 ∼Π(·|xt )
t
t=0
# `(xt , xt+1 ) + `f (xTe )
t=0
"
PT −1 lim
T →∞
t=0
`(xt , xt+1 ) T
# (1)
RELATIONSHIP TO TRADITIONAL MDPS
Consider an MDP with the following dynamics: The controller specifies an action uct that gets corrupted by some noise process to a different action ut , but then the system makes a deterministic transition to the next state: xt+1 = F (xt , ut ) , ut ∼ Pn (ut |uct ) . {z } | {z } |
Deterministic Dynamics
NOISE-ACTION MDPs(NMDP)
We define a new class of MDPs known as called NoiseAction MDPs(NMDPs). These MDPs have the crucial property that noise and controls are interchangeable, so that any transition that could have happened intentionally could also have happened by random chance. This enables us to develop the first policy gradient algorithms with deterministic policies, since the noise in the system can be used to drive exploration and compute a policy gradient estimate to improve the policy.
E
xt+1 ∼Π(·|xt )
"T −1 X
Noisy Controls
Further, suppose that there are no two actions u, u0 such that F (x, u) = F (x, u0 ) for any x. In other words, we have a well-defined inverse dynamics function ut = F −1 (xt , xt+1 ) (assume that if xt+1 is not reachable from xt , this function returns a default infinite cost action). Then, one can represent policies as Πf (x0 |x) = Pn F −1 (x0 , x)|f (x) , where f is a deterministic policy mapping states to actions, as we do in NMDPs. Further, the state-action cost `(x, u) = `(x, F −1 (x, x0 )) = `(x, x0 ) can be represented in terms of x, x0 , like in NMDPs.
Definition 1. An NMDP is defined by specifying: 2.2 1 A state space X . 2 A reachable set Next (x) ⊆ X for every state x ∈ X. 3 A set of valid policies P (x) for each state x that satisfy Π (x0 |x) = 0 ⇐⇒ x0 ∈ 6 Next (x) ∀ Π ∈ P (x) . 4 A cost function `(x, x0 ).
RELATIONSHIP TO LINEARLY SOLVABLE MDPS
Linearly solvable MDPs (Todorov, 2009) also use policies of the kind Π (x0 |x). Further, they have a notion of a passive dynamics Π0 (x0 |x) that represents the uncontrolled dynamics of the system and impose the restriction that Π0 (x0 |x) = 0 =⇒ Π (x0 |x) = 0. In terms of NMDPs, this means that Next (x) = {x0 : Π0 (x0 |x) > 0}, P (x) = {Π : Π0 (x0 |x) = 0 =⇒
Π (x0 |x) > 0}. The cost function is given by: `(x) + KL Π (·|x) k Π0 (·|x) |{z} {z } | State Cost
Action Cost
Writing this out, we get: `(x) + E
3.1
0
Π (x0 |x) log
x0 ∼Π(·|x)
Compare this with the expected immediate cost of a NMDP: E [`(x, x0 )] . 0 x ∼Π(·|x)
Thus, by choosing `t (x, x0 ) = `t (x)−log Π0 (x0 |x) , we see that LMDPs are NMDPs with an additional entropy maximizing cost. However, NMDPsare strictly more general because: 1 They allow one to impose specific constraints on the structure of Π (x0 |x) through P (x): For example, NMDPscan require that all their policies have a fixed variance Σ, some inherent noise level. 2 Further, NMDPs can encode arbitrary stateaction costs as demonstrated through the relationship with Traditional MDPs above and do not need to impose an additional entropic or KL divergence cost. Note that NMDPsdo not lead to linear Bellman equations. However, they yield policy gradient results which are the focus of this paper.
3
NMDP-REINFORCE
Π (x |x) = 0 (x0 |x) Π x0 `(x) − log Π0 (x0 |x) + H [Π (·|x)]
X
in terms of the function value f . The estimation of the gradient by sampling is analogous to a finite difference estimate, with the resolution of the finite differencing governed by Σ.
POLICY GRADIENTS
In this section, we use a trajectory-based approach to computing policy gradients. The resulting policy gradient can naturally be expressed as an average over fixed length rollouts of the current policy. This is very similar to the REINFORCE algorithm (Williams, 1992). However, NMDPs allow us to get a more efficient algorithm:NMDP-REINFORCE (Algorithm 1, that requires sampling only state-space trajectories rather than state action trajectories. We also see that the resulting updates, when applied with a specific policy parametrization, have connections to the P I 2 algorithm that has been successfully applied to several problems in robotics (Theodorou et al., 2010a). Consider a finite horizon NMDP with horizon T . Let X = (x0 , x1 , x2 , . . . , xT ) be a feasible state space trajectory of the NMDP. Algorithm 1 NMDP-REINFORCE Algorithm Perform k noisy rollouts with Policy Parameters θ. Get policy gradient estimate g (Theorem 1). Compute step length η using line search. Update parameters θ ← θ − ηg Theorem 1. Consider a family of parameterized policies Πθ (xt+1 |xt , t). The gradient of the expected cost: Ex0 ∼µ0 ,xt+1 ∼Πθ (·|xt ,t) [S(X)] is given by ! T −1 X X θ θ g= Π (X) S(X) ∇θ log Π (xt+1 |xt , t) t=0
X
We start with some intuition behind the stochastic policy gradient algorithms. The idea in these algorithms is to use noise to make improvements to the current policy. In the specific policy gradient algorithms we will derive here, this intuition is particularly clear since noise and controls are interchangeable. To develop some intuition for how this works, consider minimizing a function f (x) over x ∈ Rn and suppose that f is only available as a black box, so that one can only query f at certain points x1 , x2 , . . .. Now, suppose that one wanted to use a gradient based method to minimize this function. One way to do this is to minimize Ex∼N (y,Σ) [f (x)] with respect to y, the mean of the Gaussian. This is now a differentiable problem in terms of y, and the gradient is given by − Ex∼N (y,Σ) f (x)Σ−1 (y − x) . One can estimate this gradient using sampling and go downhill with respect to it, checking for improvement
where Πθ (X) = µ0 (x0 )
TY −1
Πθ (xt+1 |xt , t)
t=0
S(X) =
T −1 X
`t (xt , xt+1 ) + `f (xT ) for NMDPs
t=0
S(X) =
T −1 X t=0
`t (xt ) + `f (xT ) + log
Πθ (X) Π0 (X)
! for LMDPs
Proof. For LMDPs, differentiating with respect to the policy parameters θ, ! θ X X ∇Π (X) ∇Πθ (X) S(X) + Πθ (X) Πθ (X) X X X = Πθ (X) S(X)∇log Πθ (X) X
where the second term vanished because P P θ θ ∇Π (X) = 0, since Π (X) = 1 for all X X θ. For NMDPs, we directly get the RHS since S(X) does not depend on θ. Now, X ∇log Πθ (X) = ∇log Πθ (xt+1 |xt ) . t
Hence the result. Remark 1. Contrast this with the classical REINFORCE policy gradient: P P T −1 θ θ Π (X) S(X) ∇ log Π (u |x , t) , θ t t X,U t=0 which requires sampling over both X, U and hence the gradient estimate has a much higher variance resulting in slower convergence. Further, in a special case, we can obtain a GaussNewton approximation to the policy Hessian. Theorem 2. Suppose that we have a NMDP where for the optimal policy, S(X) ≈ 0 with high probability, ie, the minimum trajectory cost is 0 and is achieved with high probability by the optimal policy. Then, the following is a Gauss-Newton approximation to the Hessian: T X Πθ (X) S(X)∇θ log Πθ (X) ∇θ log Πθ (X) . X
Further, this is guaranteed to be positive semi-definite for any θ. Proof. Differentiating the policy gradient with respect to θ, we get X
T ∇θ log Πθ (X) ∇θ log Πθ (X) S(X)
described by a stochastic differential equation of the following kind: d x = a(x)dt + B(x) uc dt + noise, x ∈ Rn Theorem 3. With a deterministic policy parameterization uc (x, t) = f (x)θt , the time-discretized version of the above controlled diffusion is an NMDP. Further, the policy gradient update is given by Z θt ← θt + hη Πθ (X) (− S(X))t dX where t is the noise added at time t. The P I 2 update for the same system is Z θt ← θt +
Πθ (X) exp (− S(X)) t (X)dX R θ Π (X) exp (− S(X))
Remark 2. The Policy Gradient update uses the unexponentiated trajectory cost while the P I 2 update uses the exponentiated version. Otherwise, the updates are very similar. Proof. Proof in appendix. Theorem 4. P I 2 is performing gradient descent with respect to the risk-seeking objective: Z − log Πθ (X) exp (− S(X)) dX Proof. The gradient of the objective wrt θt is given by Πθ (X) exp (− S(X)) ∇θt log Πθ (X) − R θ Π (X) exp (− S(X)) dX
X
+
X
Πθ (X) ∇2θ log Πθ (X) S(X)
X
Since S(X) ≈ 0 w.h.p under Πθ (X), we can ignore the second term and obtain the result. Further, since xxT is positive semidefinite for any x and S ≥ 0, the resulting Hessian approximation is a sum of positive definite terms and is hence positive semidefinite. 3.2
RELATIONSHIP TO THE P I 2 ALGORITHM
In this section, we show the relationship between the NMDP-REINFORCE and the P I 2 algorithms (Theodorou et al., 2010a). To do this, we need to consider a specific policy parameterization and a specific class of NMDPs: Control affine dynamical systems. Control affine dynamical systems are a popular class of systems that arise in mechanical systems. They are
From the proof of theorem , we have that θ ∇θt log Π (X) = t (X). The P I 2 update is exactly in the opposite direction of this gradient. Hence the result. Remark 3. Thus P I 2 performs gradient descent on a risk-seeking objective, rather than a risk-neutral or standard expected cost objective. This can be seen of as a simple way to derive P I 2 , as well as a proof that P I 2 converges to a local minimum of the risk-seeking objective under infinite sampling. Note that as the noise level become small, both the riskneutral and risk-seeking objectives become close and P I 2 optimizes the right quantity, which probably explains its widespread success in various application domains. However, we show numerically in section ?? that NMDP-REINFORCE outperforms P I 2 , likely because it is performing gradient descent directly on the expected cost objective.
4
INFINITE HORIZON POLICY GRADIENTS
6. Let Πθ (x0 |x) be a polanh FE LMDP. i Let µθ (x) = PTe −1 t=0 I [xt = x] . The policy gradient Πθ (x0 |x)
Theorem icy for E
We will now develop policy gradient theorems for LMDPs and NMDPs under the various problem formulations. We will consider a parameterized family of policies Πθ (x0 |x) and derive expressions for the gradient of the expected accumulated cost under this policy with respect to the policy parameters θ. The resulting expressions do not rely on costs or dynamics being differentiable, and can be simply computed using on-policy sampling and value function estimation algorithms like LSTD(λ).
Te =minτ xτ ∈T
is given by X
µ (x) ∇Π (x |x) v (x ) + log
For NMDPs, the expression changes to: µθ (x) ∇Πθ (x0 |x) v θ (x0 ) + `(x, x0 )
(5)
x∈N ,x0
In the IHD formulation, the cost is added up over time, but future costs are discounted by a factor γ (1). "∞ # X t E γ (`(xt , xt+1 ))
Proof. Similar to theorem 5. 4.3
t=0
for NMDPs, with a similar expression for LMDPs. The expectation is under the stochastic dynamics xt+1 ∼ Π (·|xt ), with x0 ∼ µ0 , a fixed initial state distribution. The policy gradient in an IHD LMDP can be computed as Theorem 5. Let Πθ (x0 |x) be any parameterized family of policies for an IHD LMDP with passive dynamics Π0 (x0 |x). Define the discounted P∞ state visit distribution to be µθ (x) = EΠθ (x0 |x) [ t=0 γ t I [xt = x]]. For an IHD LMDP with a parameterized policy is given by !! X Πθ (x0 |x) θ θ 0 θ 0 µ (x) ∇θ Π (x |x) γv (x ) + log Π0 (x0 |x) 0 x,x
(2) For an NMDP, the expression changes slightly: X µθ (x) ∇θ Πθ (x0 |x) γv θ (x0 ) + `(x, x0 )
!!
(4)
POLICY GRADIENTS IN IHD LMDPs/NMDPs
xt+1 ∼Π(·|xt )
Πθ (x0 |x) Π0 (x0 |x)
0
θ
x∈N ,x0
X 4.1
0
θ
θ
(3)
POLICY GRADIENTS IN INFINITE HORIZON AVERAGE COST LMDPs
The policy gradient theorem for infinite horizon average cost problems was first derived in (Todorov, 2010). For completeness, we restate the result here: Theorem 7. For an IH LMDP the policy gradient is given by X
θ
θ
0
θ
0
µ (x) ∇Π (x |x) v (x ) + log
x x0
Πθ (x0 |x) Π0 (x0 |x)
!!
(6) 4.4
NONLINEAR POLICY PARAMETERIZATIONS
In this section, we will consider policy parameterizations of the form Πθ (x0 |x) ∝ Π0 (x0 |x) exp −f θ (x0 ) .
x,x0
Proof. See appendix. 4.2
POLICY GRADIENTS IN FIRST-EXIT LMDPs
In FE LMDPs, the control policy tries to minimize the cost accumulated until one hits a goal or terminal state. Denote by T the set of terminal states, N the set of non-terminal states. We will assume that all terminal states are absorbing, ie, Πθ (x|x) = 1 ∀ x ∈ T .
where f θ (x) is an arbitrary nonlinear function approximator. Further, for NMDPs, we assume that `(x, x0 ) = `(x0 ), so that there is only a state cost and no control cost. P Further, we define the notation Πθ [f ] (x) = x0 Πθ (x0 |x) f (x0 ), the expectation of a function under the one-step controlled dynamics. With these assumptions, the policy gradient theorem reduces to Theorem 8. The policy gradient for the IHD formulation with parametrization Πθ (x0 |x) ∝
LMDPs :
X
µθ (x) Πθ [∇θ f ] (x) Πθ γv θ − f (x)
x
−
X µθ (x) − µ0 (x) γ
x
NMDPs :
X
∇θ f (x)(γv θ (x) − f θ (x))
µθ (x) Πθ [∇θ f ] (x) Πθ γv θ + ` (x)
x
−
X µθ (x) − µ0 (x) x
γ
∇θ f (x)(γv θ (x) + `(x)) (7)
Similar results hold for FE, IH. Remark 4. This is a generalization of the results derived in Todorov (2010) for LMDPs with linear policy T parameterizations f θ (x) = f (x) θ. Further, those results only applied to the infinite horizon average cost case. Here, we extend the results to arbitrary nonlinear functions f (x) and other problem formulations. This generalization is particularly useful, since if we want to scale these algorithms up to high dimensional continuous state spaces, we would need to learn nontrivial features of the state space to represent the value function. Remark 5. In Todorov (2010), the authors also derived a natural policy gradient theorem. However, the way this was done was to compute the expression for the gradient and the Fischer information matrix, and then solve a linear system. This partially defeats the original advantage of natural policy gradients, which was that they could be computed simply by fitting the value function with a set of features, without having to compute the Fischer information matrix or the policy gradient. Hence, we do not develop the natural policy gradient in this paper.
algorithm with NMDP-REINFORCE. The experimental protocol is as follows: Both algorithms start with the same random initial policy. Both algorithms perform 100 rollouts in each iteration and use those to update the policy parameters for the next iteration. We found that performing a line search is crucial to getting these algorithms to work. In order to do so, we restricted our objective to be the accumulated cost of one noiseless rollout under the current policy. The learning curves for both the algorithms are plotted in figure 1. The NMDP-REINFORCE algorithm (denote PG) converges rapidly in the initial phase and then tapers. The P I 2 algorithm converges at a much slower rate. This confirms our intuition that doing direct policy gradients on NMDPs is likely more efficient than P I 2 , which actually performs gradient descent on a different objective. PG vs NMDP−REINFORCE 3.2 3.1
2.9
5.1
EXPERIMENTS NMDP-REINFORCE vs P I 2
We compared the NMDP-REINFORCE algorithm with the P I 2 algorithm on a 2 dimensional nonlinear control problem: The simple pendulum swing up task. The state space consists of the angular position and velocity of the pendulum (α, ω). We build a policy parameterization by placing radial basis functions, or a normalized mixture of Gaussians centered at points on a 10x10 2-D grid over the state space. The dynamics are given by αt+1 = αt + dtω, ωt+1 = ωt − g sin(ωt )dt + u dt + dω where ω is Brownian noise. This can be time discretized to get a NMDP. We the compare the P I 2
2.8 2.7 2.6 2.5 2.4 2.3 0
500
1000
1500
2000
2500
3000
Iterations
Figure 1: Learning Curves for PI2 and NMDPREINFORCE
5.2
5
PI2 NMDP−REINFORCE
3
Cost
Π0 (x0 |x) exp −f θ (x0 ) is given by:
ANALYTICAL SOLUTIONS FOR LINEAR DYNAMICAL SYSTEMS
The NMDP-REINFORCE algorithm has another advantage: The policy gradient can be computed analytically for Linear Dynamical Systems with linear feedback policies and a fairly general family of cost functions: Any mixture of Gaussians and Exponentials x Polynomials. This is particularly useful for large scale decentralized control problems: Consensus on a network, Traffic Control and Power Systems are all good examples. Consider a linear dynamical system: xt+1 = A xt +B uct +noise. We consider a family of linear feedback policies uct = θt xt . We can enforce a decentralized policy by saying that θt must have a particular sparsity structure. The, the controlled dynamics becomes xt+1 = A xt +B (θt xt +t ) , t ∼ N (0, Σ).
Since this is again a linear dynamical system with Gaussian noise, the joint probability of any trajectory Πθ (X) is jointly Gaussian. Further, if the trajectory cost S(X) is some mixture of Gaussians and exponentials times polynomials, the NMDP-REINFORCE update can be computed analytically. We tested this algorithm on a simple scalar linear dynamical system:
that P I 2 is restricted to: Developing practical algorithms for new classes of systems is another area for future work. The Gauss-Newton algorithm 2 is also quite promising and needs to be numerically evaluated to determine its effectiveness under sampling.
xt+1 = θt xt +t , t ∼ N (0, I)
J. Baxter and P.L. Bartlett. Infinite-horizon policygradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
where θt are the policy parameters, the linear control gains in this case. We have a trajectory cost that depends only on the final state, and is an inverted Gaussian of the form 1 − exp(− x2T /2). Further, we impose a control cost by penalizing deviations of θt from 1, so that gains that change the state drastically are penalized. We can analytically compute the NMDPREINFORCE gradient in this case, and perform gradient descent using a standard optimizer (Schmidt, 2005). The algorithm converges rapidly to the optimum and the resulting controller drives the state to the goal while compensating for noise disturbances. Although this is a simple example, it illustrates the potential power of this algorithm to learn optimal controllers using just a very high level delayed reward.
6
CONCLUSIONS
We have identified a new class of MDPs: Noise-Action MDPs and derived efficient policy gradient algorithms for MDPs in this class. We have identified the relationship between these algorithms and P I 2 , a state of the art policy optimization algorithm that has been successfully applied to many robotic control tasks. We have presented theoretical and preliminary numerical evidence that our new policy gradient algorithm: NMDP-REINFORCE, may likely be even more efficient. Further, we have derived the first policy gradient algorithms that can work with deterministic policies and use the noise in the system dynamics to compute policy improvements. Finally, we have derived policy gradient theorems for all infinite horizon formulations of MDPs: First Exit, Discounted and Average Cost, with a general class of nonlinear policy parameterizations. There is a lot of work to be done both theoretically and empirically. The results relating P I 2 and NMDPREINFORCE are specialized to certain classes of nondegenerate diffusions. Further investigation is needed to see if these results hold more generally. Further investigations of the convergence properties of both algorithms, both theoretically and empirically, are required to get a thorough understanding of the relative pros and cons of each algorithm. NMDPs can model more general systems that the control-affine diffusions
References
J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7):1180–1190, 2008. M. Schmidt. minfunc., 2005. Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12. 1999. E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning of motor skills in high dimensions: A path integral approach. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 2397–2403. IEEE, 2010a. E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 9999:3137–3181, 2010b. E. Todorov. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences, 106(28):11478, 2009. E. Todorov. Policy gradients in linearly-solvable mdps. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2298–2306. 2010. R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
Appendix 6.1
PROOF OF THEOREM 3.2
Proof. The h-step Euler discretization of this system is given by x0 = x +(a(x)+B(x) u)h+B(x), ∼ N (0, hI). We parameterize the control law:u(x, t) = f θ (x, t) = f (x)θt . so that the system dynamics under θ becomes xt+1 = st + (a(xt ) + B(xt )f (xt )θt )h + B(x), ∼ N (0, hI). We can redefine B(x) to be B(x)f (x), so that the system dynamics simply becomes xt+1 = xt +(a(xt ) + B(xt )(θt h + ), ∼ N (0, hI)
The control θt is corrupted by noise , and then the deterministic dynamics is applied with the noisy control. Further, assuming that B(x) is full rank, no two controls lead to the same next state. Hence, this is a NMDP(see section 2.1)with T
θ
Π (xt+1 |xt ) = N (xt +(a(xt )+B(xt )θt )h, h B(x)B(x) ). Suppose that some subset of (c) of state dimensions are directly actuated and B(c) (x) is the invertible square submatrix of B(x) formed by picking rows corresponding Letting to (c). −1
xt+1 − xt − a(xt ) − θt , we have: t = B(c) (x) h θ ∇θt log Π (X) = ∇θt log Πθ (xt+1 |xt )
=
T
−h∇θt t 2t = ht . Thus, the expression for the NMDP-REINFORCE algorithm has the given form. The P I 2 update for these systems is given by (Theodorou et al., 2010b) ˜ exp −S(X) t (X)dX R ˜ exp −S(X)
Z θt ← θt +
PT −1 ˜ where S(x) = S(x) + t=0
X
=
µ0 (x)∇v θ (x)
x
=
X
µθ (x) ∇Π (x0 |x) γv θ (x0 ) + log θ
x,x0
Πθ (x0 |x) Π0 (x0 |x)
!!
The LHS is precisely the gradient of the costto-go over the initial state distribution: P 0 averaged θ x µ (x)v (x). Lemma 1. For IHD LMDPs and NMDPs, the discounted state visit distribution µθ (x) under control policy Πθ (x0 |x) and initial state distribution µ0 (x) satisfies: X µθ (x0 ) = µ0 (x0 ) + γ µθ (x) Πθ (x0 |x) x
Proof. Follows from Markov properties of the MDP.
t T t +log(det(B(x)B(x)T )) . 2 θ
The second term is precisely − log Π (X) , which gives the result for the P I 2 update.
6.2
θ 0 0 0 From P θlemmaθ 1 0 (section 6), µ (x ) − µ (x ) γ x µ (x) Π (x |x). Plugging this in, we get
PROOF OF THEOREM 5
6.3
PROOF OF THEOREM 7
Proof. We do the case of LMDPs. For the given policy parametrization, we have ! Πθ (x0 |x) θ 0 θ log = −f (x )−log E exp −f . Π0 (x0 |x) Π0 (x0 |x) The second term is a functionPof x, and cancels out θ 0 in the sum in theorem 5: 0 ∇θ Π (x |x)h(x) = xP P h(x) x0 ∇θ Πθ (x0 |x) = 0 since x0 Πθ (x0 |x) = 1∀θ.
Proof. We do the proof for LMDPs, that for NMDPs is even simpler. The policy specific Bellman equation for an IHD LMDP is Further, we have ∇log Πθ (x0 |x) = Πθ [∇θ f ] (x) − ! ! X ∇θ f (x0 ). Thus, the policy gradient reduces to Πθ (x0 |x) θ 0 Πθ (x0 |x) log v θ (x) = q(x)+ + γv (x ) X Π0 (x0 |x) x0 µθ (x) Πθ [∇θ f ] (x) Πθ γv θ − f θ (x) Differentiating this with respect to θ, we get x,x0 ! X X X X θ θ 0 ∇v θ (x) = γ Πθ (x0 |x)∇v θ (x0 ) + ∇Πθ (x0 |x)+ − µ (x) Π (x |x) ∇θ f θ (x0 ) γv θ (x0 ) − f θ (x0 ) x0 x0 x x0 !! θ 0 X Π (x |x) θ 0 ∇Πθ (x0 |x) γv θ (x0 ) + log P µ (x )−µ0 (x0 ) 0 (x0 |x) Π . ReThe second term reduces to x0 γ x0 labeling x0 to x, we get the result. P Since x0 Πθ (x0 |x) = 1 ∀θ, the second term is 0. Multiplying throughout by µθ (x) and summing over x, we get ! X X X µθ (x) ∇v θ (x) − γ µθ (x) Πθ (x0 |x) ∇v θ (x0 ) x0
x
=
X x,x0
θ
θ
0
x θ
0
µ (x) ∇Π (x |x) γv (x ) + log
Πθ (x0 |x) Π0 (x0 |x)
!!