Deep Reinforcement Learning with Relative Entropy Stochastic Search

5 downloads 175 Views 293KB Size Report
May 22, 2017 - Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. ... Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom.
arXiv:1705.07606v1 [stat.ML] 22 May 2017

Deep Reinforcement Learning with Relative Entropy Stochastic Search Voot Tangkaratt1, Abbas Abdolmaleki2, and Masashi Sugiyama1,3 1

RIKEN Center for Advnaced Intelligence Project, Japan 2 The University of Aveiro, Portugal 3 The University of Tokyo, Japan Abstract Many reinforcement learning methods for continuous control tasks are based on updating a policy function by maximizing an approximated action-value function or Q-function. However, the Q-function also depends on the policy and this dependency often leads to unstable policy learning. To overcome this issue, we propose a method that does not greedily exploit the Q-function. To do so, we upper-bound the Kullback-Leibler divergence of the new policy while maximizing the Q-function. Furthermore, we also lower-bound the entropy of the new policy to maintain its exploratory behavior. We show that by using a Gaussian policy and a Q-function that is quadratic in actions, we can solve the corresponding constrained optimization problem in a closed form. In addition, we show that our method can be regarded as a variant of the well-known deterministic policy gradient method. Through experiments, we evaluate the proposed method using a neural network as a function approximator and show that it gives more stable learning performance than the deep deterministic policy gradient method and the continuous Q-learning method.

1

Introduction

Policy search methods [Deisenroth et al., 2013] are reinforcement learning 1

methods that represent a policy as a parameterized function and learn the optimal policy parameters which maximize the expected cumulative rewards. Policy search is an appealing approach to tackle continuous control tasks since it is naturally applicable to continuous state and action spaces. With the recent advances in deep learning [Goodfellow et al., 2016], policy search methods that can efficiently learn neural network policies are of great interest. Policy gradient methods [Williams, 1992, Sutton et al., 1999, Peters and Schaal, 2006] are the most well-known policy search methods where the idea is to update policy parameters using gradient estimates of the expected cumulative rewards. However, the gradient estimates of a stochastic policy is known to have high variance which leads to slow convergence or premature convergence to a local optima [Greensmith et al., 2004, Zhao et al., 2012] and this makes standard stochastic policy gradients not well suited for training neural networks. The issue can be avoided by using a deterministic policy and the deterministic policy gradients (DPG) [Silver et al., 2014]. The basic idea of DPG is to update a policy in directions that increase values of the action-value function which is also known as the Q-function. This simple yet elegant approach was extended to become one of the state-of-the-art deep reinforcement learning method called deep DPG (DDPG) [Lillicrap et al., 2015]. DDPG represents both a deterministic policy and an action-value function as neural networks where the policy network is trained by DPG and the action-value network is trained by a deep Q-network (DQN) [Mnih et al., 2015]. While this combination was shown to be effective, a deterministic policy requires a suitable exploration strategy in order to ensure adequate exploration. However, a suitable strategy is often subjective and task dependent, while an improper exploration strategy generally leads to unstable policy update or premature convergence to a local optima which are undesirable. DQN was also used in the continuous deep Q-learning with the normalized advantage function (Q-NAF) method Gu et al. [2016]. Q-NAF is a continuous variant of Q-learning where it imposes a specific form of an approximated action-value function such that a maximizer can be obtained analytically. Although this strategy leads to fast learning, a direct maximization of an action-value function often leads to unstable behavior since small changes in an action-value function can lead to large changes in the policy. This instability also gets more severe when DQN is used since the action-value network is learned based on a separate target network [Mnih et al., 2015]. The main contribution in this paper is a reinforcement learning method 2

that can overcome the above instability issue. Our main idea is to learn a policy under the Kullback-Leibler (KL) divergence constraint. However, it is known that learning a neural network under the KL divergence constraint is not trivial, even when an approximated KL divergence is used [Amari, 1998]. To overcome this issue, we adopt the idea of the guided policy search (GPS) methods [Levine and Koltun, 2013, Levine et al., 2016] where we firstly learn a guiding policy under the constraint and then train a neural network using the guiding policy as the target function. However, unlike existing GPS methods which learn locally linear guiding policies based on trajectory optimization, our method learns a global guiding policy based on a novel extension of the relative entropy stochastic search (RESS) [Abdolmaleki et al., 2015] method1 . Through our analysis, we show that our method is closely related to DPG and Q-NAF. Furthermore, under a condition on the action-value function, the proposed method can be regarded as a generalization of these two methods where policy learning can be systematically stabilized via the KL divergence constraint.

2

Notations

We consider discrete-time Markov decision processes (MDPs) with continuous state space S Ď Rds and continuous action space A Ď Rda . We denote the state and action at time step t P N by st and at , respectively. The initial state s1 is determined by the initial state density s1 „ p1 psq. At time step t, the agent in state st takes an action at according to a policy at „ πpa|s “ st q and obtains a reward rt “ rpst , at q. Then, the next state st`1 is determined by the transition function st`1 „ pps1 |s “ st , a “ at q. The performance of a policy π is measured by the expected cumulative rewards defined by « ff 8 ÿ J pπq “ Ep1 ps1 q,πpat |st qtě1 ,ppst`1 |st ,at qtě1 γ t´1 rpst , at q t“1

“ Epπ psq,πpa|sq rrps, aqs ,

1

(1)

Abdolmaleki et al. [2015] refers to the framework as model-based relative entropy stochastic search (MORE) since it learns a model of the objective function. However, since the term “model-based” is used differently in reinforcement learning, i.e., learning the transition model, we omit this term to avoid confusion and abbreviate it as RESS instead.

3

where Ep r¨s denotes the expectation over the density p and the subscript t ě 1 indicates that the expectation is taken over the densities at all time steps. “řThe (improper) discounted state density is defined as ‰ 8 t´1 1 π 1 pps Ñ s , t, πq where pps Ñ s1 , t, πq denotes the p ps q “ Ep1 psq t“1 γ 1 density ot being in s after transitioning from s for t steps following π [Silver et al., 2014]. The discount factor 0 ă γ ă 1 assigns different weights to rewards received at different time steps. The goal of reinforcement learning is to learn the optimal policy π ˚ “ argmaxπ J pπq. We define the value of being s by the state-value function V π psq “ ‰ “ř8in state t´1 Eπpat |st qtě1 ,ppst`1|st ,at qtě1 rpst , at q|s1 “ s and define the value of t“1 γ choosing action a in “state Qπ ps, aq “ ‰ ř8 st´1by the action-value function Eπpat |st qtě2 ,ppst`1|st ,at qtě1 rpst , at q|s1 “ s, a1 “ a . t“1 γ

3

Deep Reinforcement Learning with Relative Entropy Stochastic Search

The proposed method performs the following three steps to update the policy in each update iteration (See Algorithm 1). The first step is the policy evaluation step where we learn an approximation of the action-value function which is local in the action space using an appropriate policy evaluation method (Section 4). In the second step, we learn a guiding policy which maximizes the approximated action-value function under the KL divergence constraints (Section 3.1). Then in the last step, we learn a parametric policy in a supervised learning manners by using the guiding policy as a target function (Section 3.2).

3.1

Learning the Guiding Policy

In this step, our goal is to learn a guiding policy such that it maximizes an approximated action value function while stays close to the current policy and maintains an explorative behavior. This goal can be formally defined as the following optimization problem: ” ı p aq , max Epβ psq,˜πpa|sq Qps, π ˜

subject to Epβ psq rKLp˜ π pa|sq||πpa|sqqs ď ǫ, Epβ psq rHp˜ π pa|sqqs ě κ, 4

(2)

where π ˜ pa|sq is the guiding policy to be optimized, βpa|sq is a behavp aq is an approximated ioral policy, πpa|sq is the current policy, and Qps, action-value function. The KL divergence is defined as KLpppxq||qpxqq “ Eppxq rlog ppxq ´ log qpxqs, and Shannon’s entropy is defined as Hpppxqq “ ´Eppxq rlog ppxqs. The upper-bound ǫ is non-negative while the lower-bound κ can be negative. The KL constraint is commonly used in policy search to prevent divergence of policies due to excessively greedy updates [Kakade, 2001, Peters et al., 2010, Levine and Koltun, 2013, Abdolmaleki et al., 2015, Schulman et al., 2015]. The entropy constraint was relatively uncommon in policy search. However, it was shown to be crucial for maintaining explorative behavior of the policy and preventing premature convergence [Abdolmaleki et al., 2015, Akrour et al., 2016, Mnih et al., 2016]. The above optimization problem can be solved by the method of Lagrange multipliers. The solution is given by ˜ ¸ p aq η Qps, π ˜ pa|sq 9 πpa|sq η`ω exp , (3) η`ω where η ą 0 and ω ą 0 are dual variables corresponding to the KL and entropy constraints, respectively. These dual variables are obtained by minimizing the dual function given by ˜ ¸ ¸ff « ˜ż p η Qps, aq da . πpa|sq η`ω exp gpη, ωq “ ηǫ ´ ωκ ` pη ` ωqEpβ psq log η`ω (4) The solution in Eq.(3) tells us that the guiding policy is obtained by weighting p aq, η and ω. If we set ǫ “ 0 which implies η Ñ 8, the current policy with Qps, then we have π˜ Ñ πθ and the policy will not change. On the other hand, if we p aq{ωq which set ǫ Ñ 8 which implies η Ñ 0, then we have π ˜ pa|sq9 exppQps, is a softmax action selection strategy where ω is the temperature parameter. Computing π ˜ pa|sq and gpη, ωq is non-trivial for arbitrary πpa|sq due to η the exponent η`ω in Eq.(3) and Eq.(4). To overcome this issue, we employ the following two assumptions. The first assumption is that πpa|sq is the Gaussian distribution with mean φpsq and covariance Qpsq, i.e., πpa|sq “ N pa|φpsq, Qpsqq. 5

(5)

This assumption is quite common in the policy search literature [Deisenroth et al., 2013], although most authors assume that the covariance is independent of states for simplicity. In this paper, we do not assume any specific form of the mean and covariance functions. The second p aq is a local approximation such that it assumption we employ is that Qps, is quadratic in terms of actions with negative curvature. A general form of a function approximator which satisfies this assumption is p aq “ ´aJ W psqa ` aJ ψpsq ` χpsq. Qps,

(6)

π ˜ pa|sq “ N pa|F ´1 psqLpsq, pη ` ωqF ´1 psqq,

(7)

The (positive definite) matrix-valued function W : S ÞÑ S`da , the vectorvalued function ψ : S ÞÑ Rda , and the scalar-valued function χ : S ÞÑ R are state-dependent functions to be learned whose details will be explained later in Section 4. While this locally quadratic model may seem restrictive since the action-value function may not be quadratic in actions, it has a great benefit since it is compatible with the Gaussian policy and it allows us to evaluate the guiding policy in a closed form as shown below. Moreover, the functions W psq, ψpsq and χpsq are not restricted by this assumption and can be highly complex functions that compensate for any model misspecification from the action-dependent terms. Employing the quadratic model and Gaussian assumptions allows us to compute the guiding policy in a closed form as

where the matrix F psq P Rda ˆda and vector Lpsq P Rda are defined as F psq “ ηQ´1 psq ` 2W psq, ´1

Lpsq “ ηQ psqφpsq ` ψpsq.

(8) (9)

Furthermore, the dual function which contains the intractable integral can now be evaluated via simple matrix-vector products: « d ff |2πpη ` ωqF ´1 psq| gpη, ωq “ ηǫ ´ ωκ ` pη ` ωqEpβ psq log η |2πQpsq| η`ω “ ‰ 1 ` Epβ psq LpsqJ F ´1 psqLpsq ´ ηφpsqJ Q´1 psqφpsq . (10) 2 6

The expectation can be approximated from state samples and argminηą0,ωą0 gpη, ωq can be solved by bounded-constraint optimization methods such as L-BFGS-B [Nocedal and Wright, 2006] 2 . Note that to solve this sub-optimization problem we only need to compute Q´1 psq, W psq, φpsq, and ψpsq only for each state sample and the optimization variables are just two scalars. While we still need to compute F ´1 psq, this inversion is relatively inexpensive since the dimension of actions is much smaller than that of states in practice. As shown in Eq.(7), the guiding policy is again the Gaussian distribution. Then, it may look like that one can use N pa|F ´1 psqLpsq, pη ` ωqF ´1 psqq as a policy for the next iteration. However, this guiding policy will be a function of all the past policies. More specifically, F ´1 i psq and Li psq that are psq and Li´1 psq from the pi ´ 1qlearned at the i-th iteration depend on F ´1 i´1 ´1 th iteration, which again depend on F i´2 psq and Li´2 psq and so on. This means that we are required to keep track of all past parameters throughout the learning phase and this is severely impractical, especially for large neural networks. A similar optimization problem to Eq.(2) was previously applied to trajectory optimization for reinforcement learning [Akrour et al., 2016]. The proposed approach is different from the previous work in two dimensions. Firstly, we allow the policy to be Gaussian of any form such as log-nonlinear Gaussian policies while the previous work requires the policy to be a linear Gaussian. This linear Gaussian assumption makes the previous work not suitable for training a log-nonlinear Gaussian policy such as neural networks. Secondly, our method is an off-policy reinforcement learning which allows us to efficiently reuse samples from different policies.

3.2

Learning the Parametric Policy

In this step, we learn a new parametric policy which well represents the guiding policy and subsequently use the learned parametric policy in the next iteration. Since the guiding policy is Gaussian and we also require a Gaussian policy for the next iteration, a natural choice for the parametric policy is a parameterized Gaussian policy πθ pa|sq “ N pa|µθ psq, Σθ psqq with 2 We did not take into account the discount factor γ t´1 from the discounted state distribution when approximating expectations. However, neglecting the discount factor often leads to better data efficiency since none of data samples is discarded [Thomas, 2014].

7

parameters θ. We propose to learn the parameter θ by minimizing a loss function defined as the expected KL divergence: Lpθq “ Epβ psq rKL pπθ pa|sq||˜ π pa|sqqs .

(11)

Minimizing a divergence to learn parameters of a distribution is a common approach in statistics and machine learning [Bishop, 2006]. While various divergences can be used instead of the KL divergence, there are two advantages of minimizing the KL divergence for our method. Firstly, for two Gaussian distributions πθ pa|sq and π ˜ pa|sq, this loss function is expressed as „  1 1 ´1 2 Epβ psq TrpF psqΣθ psqq ´ log |Σθ psq| ` }µ psq ´ F psqLpsq}F ` const, η`ω η`ω θ (12) where Trp¨q denotes the trace of a matrix and }x}2F “ xJ F x. This loss function can be straightforwardly minimized by first order methods such as gradient descent where the expectation is approximated from state samples. In our implementation, we represent the mean function µθ psq and the covariance function Σθ psq by a neural network with two separate outputs. The output covariance is required to be positive definite, which can be achieved by minimizing a constrained variant of Lpθq. However, solving a constrained problem is often computationally expensive. To be computationally more efficient, we adopt the idea proposed by Gu et al. [2016] and reparameterize the covariance function by a lower-triangular matrix U θ psq with the main diagonal and learn a vectorized matrix vecpU θ psqq as an output layer of the neural network. Then, we can simply obtain the covariance function based on the Cholesky decomposition of positive definite matrices, i.e., Σθ psq “ U θ psqU θ psqJ . The second advantage of minimizing Lpθq is that its gradient w.r.t. the parameters of µθ psq has an interesting interpretation as follows. Firstly, p aq “ ´aJ W psqa ` aJ ψpsq ` χpsq, we can express an notice that from Qps, p aq as action as a function of ∇a Qps, 1 p aq ` 1 W ´1 ψpsq. a “ ´ W ´1 psq∇a Qps, 2 2

(13)

This is possible since W psq is positive is invertible “ definite and ‰ by our con´1 1 2 struction. Letting Lµ pθq “ η`ω Epβ psq }µθ psq ´ F psqLpsq}F , its gradient 8

p aq as follows: can be expressed as a function of ∇a Qps, ı ” 1 p aq|a“µ psq ´∇θ Lµ pθq “ Epβ psq ∇θ µθ psqJ F psqW ´1 psq∇a Qps, θ η`ω ı ” 1 p aq|a“F ´1 psqLpsq , Epβ psq ∇θ µθ psqJ F psqW ´1 psq∇a Qps, ´ η`ω (14) p aq|a“µ psq denotes the gradient evaluated at point a “ µθ psq. where ∇a Qps, θ Since we have F psq “ ηQ´1 psq ` 2W psq, we can express this gradient as ¯ı ´ ” 2 p aq|a“µ psq ´ ∇a Qps, p aq|a“F ´1 psqLpsq ´∇θ Lµ pθq “ Epβ psq ∇θ µθ psqJ ∇a Qps, θ η`ω ” ´ η p aq|a“µ psq ` Epβ psq ∇θ µθ psqJ Q´1 psqW ´1 psq ∇a Qps, θ η`ω ¯ı p aq|a“F ´1 psqLpsq . (15) ´ ∇a Qps,

” ı p aq|a“µ psq is precisely the determinThe first term Epβ psq ∇θ µθ psqJ ∇a Qps, θ istic policy gradients with a deterministic policy µθ psq [Silver et al., 2014]. From this analysis, the gradient ´∇θ Lµ pθq can be viewed as a biased variant of the deterministic policy gradients where the amount of bias depends on ´1 η. Indeed, we can verify that” since limηÑ0 F ´1 Lpsq “ 21 W ı ψpsq, we have p aq|a“µ psq which is the origlimηÑ0 ´∇θ Lµ pθq “ 2 Epβ psq ∇θ µθ psqJ ∇a Qps, ω

θ

inal deterministic policy gradients where the factor 2{ω can be subsumed into the gradient step-size. Therefore, the mean update rule of the proposed method can be regarded as a generalization of the off-policy deterministic actor-critic algorithm [Silver et al., 2014] when a quadratic function is used to approximate the action-value function. It should be mentioned that while a similar loss function to Lpθq was previously used to train a neural network policy from guiding policies by GPS [Levine et al., 2016], a relationship to deterministic policy gradients has not been explored in previous works.

4

Policy Evaluation

The proposed method requires an approximated action-value function p aq for policy update. Many existing policy evaluation methods can be Qps, 9

p aq “ ´aJ W psqa ` aJ ψpsq ` used to learn the function approximator Qps, χpsq. In this paper, we use a neural network as a function approximator and use deep Q-learning [Mnih et al., 2015] to learn the weight parameters of this network. The main reason which makes Q-learning suitable to our method is that the maximum of the aproximated action-value function can be obtained analytically in our model. More specifically, a critical point of p aq w.r.t. a can be obtained by setting ∇a Qps, p aq to zero. Since Qps, p aq Qps, p aq, is concave in actions, such a critical point is the solution of maxa Qps, and thus we have the maximum action-value: p aq “ 1 ψpsqJ W ´1 psqψpsq ` χpsq. max Qps, a 4

(16)

Then, by following the deep Q-learning method, the parameters of W psq, ψpsq, and χpsq are learned by minimizing

´

N ı2 1 ÿ ”p Qpsn , an q ´ yn , N n“1

1¯ ¯ 1q Ď ´1 psn qψps ψps1n qJ W n 4

1

(17) ¯

` χps ¯ q is the target value coms s aq “ ´aJ W Ď psqa`aJ ψpsq` puted from a separate target network Qps, χ spsq. ¯ Ð τ w ` p1 ´ τ qw ¯ The target network is learned using soft weight updates w ¯ are the weights of the action-value network and target netwhere w and w work, respectively [Lillicrap et al., 2015]. The samples tpsn , an , r n , s1n quN n“1 are mini-batch samples drawn uniformly from a replay buffer. Since we require that W psq always gives positive definite matrices, we use the same trick by Gu et al. [2016] as mentioned previously and learn a lower-triangular matrix Lpsq such that W psq “ LpsqLpsqJ instead. A similar quadratic model to ours was previously considered by Gu et al. [2016] which uses the following quadratic model to approximate the actionvalue function:

where yn “ rn ` γ

p aq “ ´pa ´ bpsqqJ P psqpa ´ bpsqq ` V psq, Qps,

(18)

where P psq is a positive definite matrix. The authors called this model the normalized advantage function (NAF) since the quadratic term acts like an advantage function while V psq acts like a state-value function. Indeed, this model also satisfies our assumption in which W psq “ P psq, ψpsq “ 2P psqbpsq, and χpsq “ V psq ´ bpsqJ P psqbpsq. While the two models are 10

similar, Gu et al. [2016] used the model very differently from our approach. More specifically, they used the NAF model together with Q-learning to compute analytic actions which maximize the action-value function. Since the NAF model is maximized for a “ bpsq, they directly used the function bpsq as a deterministic policy. This method is called continuous Q-learning with the normalized advantage function (Q-NAF) and was shown to be able to outperform more complicated method such as DDPG on various control tasks Gu et al. [2016]. However, as previously mentioned, a greedy maximization approach such as Q-NAF is prone to an instability issue since small changes in the actionvalue function may cause large changes in the policy, especially when deep Q-learning is used where the target network is involved during learning. In p aq under the KL contrast, our method finds a policy which maximizes Qps, and entropy constraints which cause the policy to be gradually updated. While this could lead to slower learning when compared with greedy approaches, it also allows the agent to slowly explore the problem space which can yield a better policy. Another advantage of the NAF model is that the target value yn for Qlearning can be directly obtained as yn “ rn ` γV ps1n q without the necessity to compute the inverse W ´1 psq in Eq.(16). Due to this advantage, we also implement the NAF model as a function approximator for our method. Note that the number of parameters to be learned for the models in Eq.(6) and Eq.(18) are still the same. The pseudocode of the proposed method using the NAF model and the deep Q-learning method is given in Algorithm 1. We called the proposed method Deep reinforcement learning with Relative Entropy Stochastic Search (DRESS)

5

Experiments

We evaluate the proposed method against deep deterministic policy gradient (DDPG) [Lillicrap et al., 2015] and continuous deep Q-learning with the normalized advantage function (Q-NAF) [Gu et al., 2016]. In order to focus on the quality of policy learning, all methods use the NAF model with deep Q-learning method where the neural network has 4 hidden layers and 100 hidden units per layer with the rectified linear (ReLU) activation function. The first two layers are fully connected layers, the third and fourth layers are fully connected layers separated for the three components of the NAF model, 11

Algorithm 1 Deep reinforcement learning with relative entropy stochastic search p aq, tarInput: Initial policy network πθ pa|sq, action-value network Qps, s aq, bounds ǫ, κ, and data buffer D “ H. get action-value network Qps, for i “ 1, . . . , Imax do for t “ 1, . . . , Tmax do Observe state st and sample action at „ πθ pa|st q Execute at , receive reward rt and next state s1t Add transition tst , at , rt , s1t u to a data buffer D for k “ 1, . . . , Kmax do Sample N mini-batch samples tpsn , an , rn , s1n quN n“1 from D p s Update Qps, aq and Qps, aq by e.g., deep Q-learning (Section 4) p aq by relative entropy stochastic Compute π ˜ pa|sq based on Qps, search (Section 3.1) Update πθ pa|sq based on π ˜ pa|sq by supervised learning (Section 3.2) end for end for end for Output: Learned policy πθ pa|sq.

12

i.e., Lpsq, bpsq, and V psq. For DDPG, the policy network has 4 hidden layers with 100 ReLU hidden units per layer. For the proposed method, the policy network outputs both the mean and covariance functions where the first two layers are fully connected layers with 100 ReLU units and the last two layers are fully connected layers separated for each output. The weights are initialized randomly according to Glorot and Bengio [2010] and are trained by Adam [Kingma and Ba, 2014] with learning rate 0.001 for the Q-network and 0.0001 for the policy network. Note that under this architecture, QNAF learns the smallest amount of parameters (only Q-network) and the proposed method learns the largest amount of parameters (Q-network and policy network for both mean and covariance).

5.1

Markov Chain

We firstly investigate the behavior of the proposed method on a continuous Markov chain task. The state and action space of the chain are S “ r0, 1sds and A P r´0.4, 0.4sda , respectively, where ds “ da “ 20. The transition function is a linear function with Gaussian noise: s1 “ s ` a ` e, where e „ N pe|0, 0.32 Iq. The reward function is given by rps, aq “ exp p´}s ´ 0.5}22 {p2p0.6q2 qq ´ 0.5}a}1 . The initial value of the i-th dimension of s can be either 0 or 1 with equal probability. We consider an episodic scenario where each episode lasts for 50 time steps. We set the discount factor to γ “ 0.99. Although we tried to perform experiments where all methods have the same initial performance, we were not able to do so in this task because the initial performance of Q-NAF and DDPG are very sensitive to the exploration strategy due to high-dimensional action. So we tried several values of the variance σ of the Ornstein-Uhlenbeck process and choose σ “ 0.0005 which gave the best results at the 50 training episodes. The result in Figure 1 shows that in this task all methods provide stable performances. It also shows that DDPG and Q-NAF can improve the policy very quickly and yield better performance in the early stage of learning. However, they prematurely converge to local solutions and could not further improve their performance even after more training episodes are gathered. In contrast, the proposed method gradually improves the policy and is able to find a better policy than that of DDPG and Q-NAF, even though it has poorer initial performance.

13

5.2

Pendulum Swing Up

Next, we evaluate the performance using the pendulum swing up task from the OpenAI Gym platform [Brockman et al., 2016]. In this standard benchmark task, the agent needs to swing a pendulum and stabilize it in the up-right position. The state is 3-dimensional and the action is 1dimensional. This task is particularly interesting for our evaluation since the reward function is quadratic with negative curvature and is given by rps, aq “ ´0.0001a2 ´θ2 ´0.1θ92 where θ and θ9 are computed from s. Since the transition function is deterministic, the Q-function is a discounted summation of quadratic functions with negative curvature. Therefore, a quadratic model such as the NAF model should be able to accurately approximate this function and we could expect that methods which greedily maximizing this approximation would give the best results. However, the result of Q-NAF and DDPG in Figure 2 show that greedy approaches can lead to unstable performance and premature convergence to suboptimal policies. In contrast, even though the proposed method seems to learn slower in the early stage of learning when it is compared to DDPG , its performance is more stable which allows it to find a better policy. This result emphasizes the advantage of upper-bounding the KL-divergence in order to prevent excessively greedy policy updates.

6

Conclusions

In this paper, we proposed a deep reinforcement learning method which focuses on stable policy learning. Unlike existing methods which iteratively maximize an approximated Q-function in a greedy manner, our method instead iteratively maximizes the Q-function under the KL divergence and entropy constraint. We showed through experiments that the stability introduced by these constraints allows our method to perform better and avoid poor local optima. We further show that our method can be regarded as a variant of the deterministic policy gradient methods where the gradients are biased. This result suggests that a biased variant of deterministic policy gradient can be useful and this will be investigated in our future work. On a more practical side, while we only evaluated our method on simple benchmark tasks, the results still tell us that stable learning is important for obtaining better policies even on the simplest control tasks. Indeed, our

14

Figure 1: The performance of each method on the Markov chain task. The averaged returns are computed over 100 testing episodes. The error bars indicate the standard error, but they are small and negligible for this task.

Figure 2: The performance of each method on the pendulum task. The averaged returns are computed over 100 testing episodes. The error bars indicate the standard error.

future work includes using the proposed method for real-world tasks where stable learning behavior is strongly desired such as in robot controls.

References Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1–142, 2013. doi: 10.1561/2300000021. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992. doi: 10.1007/BF00992696. Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pages 1057–1063, 1999. URL

15

Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2006, October 9-15, 2006, Beijing, China, pages 2219–2225, 2006. doi: 10.1109/ IROS.2006.282564. Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471–1530, 2004. Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of policy gradient estimation. Neural Networks, 26:118–129, 2012. doi: 10.1016/j.neunet.2011.09.005. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 387–395, 2014. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. ISSN 00280836. Shixiang Gu, Timothy P. Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2829–2838, 2016. Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, February 1998. ISSN 0899-7667. doi: 10.1162/ 089976698300017746. Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1–9, 2013.

16

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1): 1334–1373, January 2016. ISSN 1532-4435. Abbas Abdolmaleki, Rudolf Lioutikov, Jan Peters, Nuno Lau, Luís Paulo Reis, and Gerhard Neumann. Model-based relative entropy stochastic search. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3537–3545, 2015. Sham Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pages 1531–1538, 2001. Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 2010. John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1889–1897. JMLR.org, 2015. Riad Akrour, Gerhard Neumann, Hany Abdulsamad, and Abbas Abdolmaleki. Model-free trajectory optimization for reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2961–2970, 2016. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR. Jorge Nocedal and Stephen J. Wright. Numerical Optimization, second edition. World Scientific, 2006. Philip S. Thomas. Bias in natural actor-critic algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages I–441–I–448. JMLR.org, 2014.

17

Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016.

18

Suggest Documents