EPOpt: Learning robust neural network policies ... - Semantic Scholar

4 downloads 0 Views 575KB Size Report
Oct 10, 2016 - [3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,. Y. Tassa, D. Silver, ... [16] Kemin Zhou, John C. Doyle, and Keith Glover. Ro- bust and ...
arXiv:1610.01283v2 [cs.LG] 10 Oct 2016

EPOpt: Learning Robust Neural Network Policies Using Model Ensembles Aravind Rajeswaran

Sarvjeet Ghotra

University of Washington Seattle [email protected]

NITK Surathkal [email protected]

Sergey Levine

Balaraman Ravindran

University of California Berkeley [email protected]

Indian Institute of Technology Madras [email protected]

Abstract Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks – especially when the policies are represented using rich function approximators like deep neural networks. Modelbased methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including to unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.

1

Introduction

Reinforcement learning methods with powerful function approximators such as deep neural networks (deep RL) have recently demonstrated remarkable success in a wide range of simulated tasks like games [1, 2], simulated robotic control problems [3, 4], and graphics [5]. However, high sample complexity poses a major challenge for directly applying these deep RL methods to real-world robotic control. Model-free methods like Q-learning, actor-critic, and policy gradients are known to suffer from long learning times [6], which is compounded when combined with expressive function approximators like deep neural networks. The challenge of gathering samples from the real world is further exacerbated by issues of safety for the agent and environment when sampling with partially learned policies which could be unstable [7]. Thus, model-free deep RL methods often require a prohibitively large number of potentially-dangerous samples for real-world control tasks. Model-based methods, where the target domain is approximated with a simulated source domain, provide an avenue to tackle the above challenges by learning policies using simulated data. The principal challenge with simulated training is the systematic discrepancy between source and

target domains. We show that this discrepancy can be mitigated through two key ideas: (1) training on an ensemble of models in an adversarial fashion to learn policies that are robust not only to parametric model errors, but also to unmodeled effects; and (2) adaptation of the source domain distribution using data from target domain, to progressively make it a better approximation. This approach can be viewed as an instance of model-based Bayesian RL [8]; or as an instance of transfer learning from a collection of simulated source domains to a real-world target domain [9]. Standard model-based RL methods typically operate by finding a maximum-likelihood estimate to the target dynamics model [10, 11], followed by policy optimization. This approach has two drawbacks: (a) For safety-critical systems, a good policy might be required to gather data from the target domain for model identification. Obtaining such a policy is not directly addressed. (b) RL algorithms exploit all aspects of model dynamics to find optimal policies, hence errors in point estimates of model parameters could lead to inadvertent or sub-optimal behaviors in target domain. Previously, Bayesian RL methods have been explored to address these drawbacks [12, 13]. Although such methods are highly general, they have not yet been demonstrated on highdimensional, continuous control tasks of the scale explored in this paper. In this work, we propose the Ensemble Policy Optimization (EPOpt−) algorithm, which uses an ensemble of source domains and a form of adversarial training to learn robust policies that generalize to a broad range of models. By robust, we primarily mean robustness to parametric model errors and near-optimal performance according to the following metrics: (a) Jumpstart: average initial performance in the target domain; (b) Worst-trajectory: return corresponding to worst-trajectory in the target domain. By adversarial training, we mean that model instances on which the policy performs poorly are sampled more often in order to encourage learning of policies that perform well for a wide range of model instances. This is in contrast to methods that learn policies which are highly optimized for specific model instances, but brittle under model perturbations. Further, we show that policies learned using EPOpt are robust even to effects not modeled in the source domain. We also present an approach for adapting the source domain ensembles using approximate Bayesian updates, in order to

progressively make it a better approximation of the target domain. In contrast to standard system ID [14], the data for model learning is obtained through execution of a robust policy, and hence alleviates safety concerns during model identification. Thus, we are able to enjoy the benefits of both robustness and adaptation as outlined above. In addition, we leverage recent advances in fast physics simulators, deep learning, and policy search to overcome computational bottlenecks in policy learning for high-dimensional continuous control tasks. We evaluate the proposed methods on the hopper (12 dimensional state space; 3 dimensional action space) and half-cheetah (18 dimensional state space; 6 dimensional action space) benchmarks, further details about which are presented in section 5. Our experimental results suggest that (a) adversarial training on model ensembles produces robust policies which generalize much better than policies trained on the maximum-likelihood model from the ensemble; (b) policies trained on an ensemble perform better than policies trained on the best model within considered ensemble in presence of unmodeled effects.

2

Related Work

The general problem of finding reliable policies using imprecise or inadequate model information is ubiquitous, and has been attempted many times under different assumptions and settings. Robust control is a branch of control theory which formally studies development of robust policies [15, 16]. However, typically no distribution assumption over source or target tasks is assumed, and a worst case analysis is performed. Much of the work in this community has been concentrated around linear systems or finite MDPs, which often cannot adequately model complexities of real-world tasks [17]. The broad field of model-based Bayesian RL maintains a belief over models for decision making under uncertainty [8, 13]. In Bayesian RL, through interaction with the target domain, the uncertainty is reduced to find the correct or closest model. Application of this idea in its full general form is difficult, and requires either restrictive assumptions like finite MDPs [18], Gaussian dynamics [19], or task specific innovations. Some previous methods have also suggested treating uncertain model parameters as unobserved state variables in a continuous POMDP framework, and solving the POMDP to get optimal explorationexploitation trade-off [20, 21]. While this approach is general, and allows automatic learning of epistemic actions, extending such methods to large continuous control tasks like those considered in this paper is difficult. Use of model ensembles to produce robust controllers has been explored recently in robotics. Mordatch et al. [22] use model based trajectory optimization and an ensemble with small finite set of models, whereas we follow a sampling based direct policy search approach over a continuous distribution of uncertain parameters and also show domain adaptation. Sampling based approaches can be applied easily to complex models and discrete MDPs which cannot be planned through easily. Another related theme of work involves learning a policy subspace using the source domain distribution, in order

to reduce sample complexity when interacting with the target domain. Kolter [23] identified that parameters of optimal policies to MDPs in the source distribution live in some low dimensional subspace, and hence policy search can be performed in this lower dimensional space with data from the target domain. Learning of parametrized skills [24] is also concerned with finding policies for a distribution of parametrized tasks. However, this setting is primarily geared towards situations where task parameters are revealed during test time, whereas our work is motivated by situation where parameters (e.g. friction) are unknown. A number of methods have also been suggested to reduce sample complexity when provided with either a baseline policy [25, 26], expert demonstration [27, 28], or approximate simulator [29, 30]. These are complimentary to our work, in the sense that our policy, which has good jumpstart performance, can be used to sample from the target domain and other off-policy methods could be explored for policy improvement.

3

Problem Formulation

We consider parametrized Markov Decision Processes (MDPs), which are tuples of the form: M(p) ≡< S, A, Tp , Rp , γ, S0,p > where S, A are (continuous) states and actions respectively; Tp and Rp are the state transition and reward functions, both parametrized by p; γ is the discount factor; and S0,p is the initial state distribution parametrized by p. Thus, we consider a set of MDPs with the same state and action spaces, with each MDP in the set instantiated by a parameter vector p. Each MDP in this set could potentially have different transition functions, rewards, and initial state distributions. We use transition functions of the form St+1 ≡ Tp (st , at ) where Tp is a random process and St+1 is a random variable. We distinguish between source and target MDPs using M and W respectively. Our ultimate objective is to learn the optimal policy for W, and to do so we have access to M(p) for different choices of p. More concretely, we assume that we have a distribution (D) over the source domains (MDPs) generated by a distribution over the parameters P ≡ P(p) that capture our subjective belief about the parameters of W. Let this distribution over parameters, P, be parametrized by ψ (e.g. mean, standard deviation). For example, M could be a hopping task with reward proportional to hopping velocity and falling down corresponds to a terminal state. For this task, p could correspond to parameters like torso mass, ground friction, and damping in joints, all of which affect the dynamics. Ideally, we would like the target domain to be in the same class of M(p). However, in practice, there are likely to be unmodeled effects, and we analyze this setting in the experiments section. For simplicity, we denote the source domain distribution with D ≡ M(P ) and refer to it as the ensemble MDP. Here, D is both a distribution over MDPs as well as an MDP (p is considered a state with no transition dynamics i.e. pt+1 = pt ; and initial p0 ∼ P). We wish to learn a policy at = π ∗ (st ; θ) which will perform well for all (or most) M ∼ D. Note that this robust policy does not have an explicit dependence on p, and we require it to perform well on M ∼ D without knowledge of p.

4

Learning protocol and EPOpt algorithm

We follow the round-based (or episodic) learning protocol described in Algorithm 1, which is similar to Bayesian model-based RL. In each round, we interact with the target domain after computing a robust policy on the simulated source domain distribution. Following this, we update the source domain distribution using data from the target domain collected by executing the robust policy. Algorithm 1: Robust model-based learning protocol 1 Input: ψ0 , θ0 2 Compute robust policy π(θ0 ) on source domain distribution D ≡ M(P ) with P(p) = Pψ0 3 for round i = 0, 1, 2, . . . do 4 Interact with W to sample a trajectory T −1 τi = {st , at , rt , st+1 }t=0 using π(θi ) 5 ψi+1 ← BeliefUpdate(ψi , τi ) 6 θi+1 ← RobustPolicyUpdate(ψi+1 , θi ) 7 end

Robust policy search We introduce the EPOpt algorithm for finding the robust policy for a given source domain distribution. EPOpt is a policy gradient based meta-algorithm which uses standard batch policy optimization methods as a subroutine. The basic idea is to sample a collection of models from the source domain distribution, sample trajectories from each of these models, and make a gradient update based on the sampled trajectories. We first define evaluation metrics for the parametrized policy, π(θ): "T −1 # X ηM (θ, p) = Eτ˜ γ t rt (st , at ) p , t=0

ηD (θ) = Ep∼P [ηM (θ, p)] " "T −1 ## X t = Ep∼P Eτˆ γ rt (st , at ) p

(1)

t=0

= Eτ

"T −1 X

# t

γ rt (st , at ) .

t=0

In (1), ηM (θ, p) is the evaluation of π(θ) on the model M(p), with τ˜ being trajectories gener−1 ated by M(p) and π(θ): τ˜ = {st , at , rt , st+1 }Tt=0 where st+1 ∼ Tp (st , at ), s0 ∼ S0,p , rt ∼ Rp (st , at ), and at ∼ π(st ; θ). Similarly, ηD (θ) is the evaluation of π(θ) over the source domain distribution. The corresponding expectation is over trajectories τ generated by D and −1 π(θ): τ = {st , at , rt , st+1 }Tt=0 , where st+1 ∼ Tpt (st , at ), pt+1 = pt , s0 ∼ S0,p0 , rt ∼ Rpt (st , at ), at ∼ π(st ; θ), and p0 ∼ P. With this modified notation of trajectories, policy gradient methods can be employed to find the optimal policy parameters: θ∗ = argmaxθ ηD (θ) (e.g. [31], [32], [33]). For simplicity, we refer to such policy gradient methods as batch policy optimization.

Optimizing ηD allows us to learn a policy that performs best in expectation over models in the source domain distribution. However, this does not necessarily lead to a robust policy, since there could be high variability in performance for different models in the distribution. To explicitly seek a robust policy, we use a softer version of max-min objective used in robust control: Z max ηM (θ, p)P(p)dp θ,y (2) F (θ) s.t. P (ηM (θ, P ) ≤ y) = , where F(θ) = {p | ηM (θ, p) ≤ y} is the set of parameters corresponding to models that produce the worst  percentile of returns, and provides the limit for the integral; ηM (θ, P ) is the random variable of returns, which is induced by the distribution over model parameters; and  is a hyperparameter which governs the level of relaxation from max-min objective. The interpretation is that (2) maximizes the expected return for the worst -percentile of MDPs in the source domain distribution. A related line of work is percentile optimization, where the -percentile value of return is directly optimized [34]. Most prior work in this area is confined to small and discrete MDPs, and finding the optimal policy under the chance constrained formulation is NP-hard for general MDPs. We refer readers to [7] for a survey of related risk sensitive RL methods in the context of robustness. We adapt the previous policy gradient formulation to approximately optimize the objective in (2). The resulting algorithm, which we call EPOpt-, generalizes learning a policy using an ensemble of source MDPs which are sampled from a source domain distribution. In Algorithm 2, PT −1 t R(τk ) ≡ t=0 γ rt,k denotes the discounted return obtained in trajectory sample τk . In line 7, we compute the −percentile value of returns from the N trajectories. In line 8, we find the subset of sampled trajectories which have returns lower than Q , and make a gradient update using this subset in line 9. Algorithm 2: EPOpt– 1 Input: ψ, θ0 , niter, N ,  2 for iteration i = 0, 1, 2, . . . niter do 3 for k = 1, 2, . . . N do 4 sample model parameters pk ∼ Pψ −1 5 sample a trajectory τk = {st , at , rt , st+1 }Tt=0 from M(pk ) using policy π(θi ) 6 end 7 compute Q =  percentile of {R(τk )}N k=1 8 select sub-set T = {τk : R(τk ) ≤ Q } 9 Update policy: θi+1 = BatchPolOpt(θi , T) 10 end

Adapting the source domain distribution In line with model-based Bayesian RL, we can adapt the ensemble distribution after observing trajectory data from the target domain. The Bayesian update can be written as:

1 P(P |τk ) = × P(τk |P ) × P(P ) Z TY −1 1 = × P(St+1 = st+1 |st , at , p) × P(P ), Z t=0 (3) where Z1 is the normalization constant required to make the probabilities sum to 1, St+1 is the random variable representing the next state, and st+1 is the observed transition from τk . We try to explain the target trajectory using the stochasticity in the state-transition function, which also models sensor errors. This provides the following expression for the likelihood: P(St+1 |st , at , p) ≡ Tp (st , at ). (4) In our experiments, we consider the case of deterministic simulator and known Gaussian sensor error model, but the approach is general. Based on the Gaussian transition model, we can simplify (4) as: P(St+1 |st , at , p) ≡ N (xt+1 , Σ) where xt+1 = f (st , at , p) is the next state predicted by the deterministic simulator. In this work, we follow a bin-based approach to source domain adaptation, where the parameter range is discretized into bins, and uniform density is assumed within the bins. Thus, each bin is now associated with a probability, which can be updated in accordance with Bayes rule. Let bi denote the ith bin. According to Bayes rule, 1 P(bi |τk ) = × L(bi , τk ) × P(bi ), (5) Z where L(bi , τk ) is the likelihood of models in bi generating τkQ . The likelihood is given by L(bi , τk ) = EPbi P(S = st+1 |st , at , p) where t+1 t the expectation is over drawing parameters (p) from bi , which we take to be uniform. By sampling p from bi , we can estimate the likelihood, which allows us to update the bin probabilities according to (5).

5

corresponds to joint torques. Again, we construct the source domain using a distribution over the following parameters: torso and head mass, ground friction, damping, and armature (inertia) of foot joints. We parametrize the stochastic policy using the scheme presented in Schulman et al. [31]. The policy is comprised of a Gaussian distribution, the mean of which is represented using a neural network with two hidden layers. Each hidden layer has 64 units, with a tanh non-linearity, and the final output layer is made of linear units. Normally distributed multivariate random variables (with diagonal covariance) are added to the output of this neural network, and we also learn the standard deviation of this distribution. In the first experiment, we demonstrate the need for robustness using the hopper task, by learning policies for different model instances, followed by evaluating the learned policy on MDPs with perturbed parameters. Figure 2 illustrates the performance of control policies learned on different hopper configurations, which in this case correspond to different torso weights. We clearly see that there is no single task configuration, which if trained on, produces a policy that generalizes to a broad range of task parameters or configurations. Hence, an attractive approach to generate robust policies, meaning competent for multiple configurations, is to consider an ensemble of models and learn a single policy over such an ensemble. In Figure 2, we see that such a policy, trained over an ensemble, is robust and generalizes to all the parameters considered in the source distribution. Next, we analyze the robustness of policies trained using EPOpt for a range of parameters. Figures 3 and 4 compare the performance of policies obtained using: (a) batch policy optimization on the average or most-likely model of source domain distribution; (b) EPOpt( = 1) on source domain distribution – i.e. best policy in expectation over source domain distribution and (c) EPOpt( = 0.1) on source domain distribution – i.e. adversarially trained policy. The comparison is similar to what is done in robust control settings

Experiments

We evaluate the proposed EPOpt- algorithm on 2D hopper and half-cheetah simulated robotic tasks using the MuJoCo physics simulator [35]. Both tasks involve complex second order dynamics and direct torque control. The tasks were implemented using base code provided with OpenAI gym [36] and rllab [37]. We use TRPO [31] for our batch policy optimization subroutine. These tasks were chosen since falling down is a natural interpretation for failure and lack of robustness. Descriptions of these tasks are below: Hopper: The hopper task is to make a 2D planar hopper with three joints and 4 body parts hop forward as fast as possible [38]. This problem has a 12 dimensional state space and a 3 dimensional action space that corresponds to torques at the joints. We construct the source domain by considering a distribution over 4 parameters: torso mass, ground friction, armature (inertia), and damping of foot. Half Cheetah: The half-cheetah task [39] requires us to make a 2D cheetah with two legs run forward as fast as possible. The simulated robot has 8 body links with an 18 dimensional state space and a 6 dimensional action space that

(a)

(b) Figure 1: Illustrations of the 2D simulated robot models used in the experiments. The hopper (a) and half-cheetah (b) tasks present the challenges of under-actuation and contact discontinuities. These challenges when coupled with parameter uncertainties lead to dramatic degradation in the quality of policies when robustness is not explicitly considered. A video demonstration of our method on these tasks is available here: https://youtu.be/w1YJ9vwaoto

m=3

m=6

m=9

Performance

4000 3500 3000 2500 2000 1500 1000 500 0

3

4

5

6

7

Torso Mass

8

9

3

4

5

6

7

Torso Mass

8

9

3

4

5

6

7

Torso Mass

8

9

3

4

Ensemble 5 6 7 8

Torso Mass

9

Figure 2: Performance of different policies on a range of torso masses. The x and y axis represent the torso mass and performance respectively, which are shared between the sub-plots. The policies have been trained on respective masses specified in the plot using TRPO. Performance is measured using return per trajectory, and the shaded region depicts the 10th and 90th percentile of the return distribution for a given mass. The ensemble policy is trained using EPOpt( = 0.1) on the Gaussian distribution N (µ = 6, σ = 1.5). The percentile regions are shown for this policy as well, but they nearly overlap with the average value, indicating a highly reliable policy.

(a)

(b) Figure 3: Performance of policies for various model instances of the hopper domain. Figures 3(a) and 3(b) correspond to the average and the 10th percentile of the performance (return) distribution respectively. The performance is depicted as a heat map for various model configurations, parameters of which are given in the x and y axis. The 10th percentile value is used as a softer version of worst-case performance. The adversarially trained policy over source domain distribution is observed to generalize to a wider range of models and is more robust according to both criteria. where a policy is trained on an initial source distribution (or range), and evaluated on the same range, without any adaptation. This analysis allows us to understand the quality of learned policy with regards to the jumpstart criteria. The

truncated Gaussian distribution described in Table 1 was used as the source domain distribution. It is observed that policies trained on the ensemble generalize better, especially when trained adverserially.

(a)

(b) Figure 4: Performance of policies for various model instances for the cheetah domain, similar to Figure 3. Again, it is observed that the adversarial trained policy is robust and generalizes well to all models in the source distribution.

Hopper

µ

σ

low

high

mass ground friction joint damping armature

6.0 2.0 2.5 1.0

1.5 0.25 1.0 0.25

3.0 1.5 1.0 0.5

9.0 2.5 4.0 1.5

Half-Cheetah

µ

σ

low

high

mass ground friction joint damping armature

6.0 0.5 1.5 0.125

1.5 0.1 0.5 0.04

3.0 0.3 0.5 0.05

9.0 0.7 2.5 0.2

To analyze the robustness of the policies trained using EPOpt to unmodeled effects, our next experiment considers a setting where the torso mass variation is unmodeled. Specifically, the torso mass differs between source and target domains, with all models in the source domain having the same incorrect torso mass. The source domain distribution is obtained by varying the three other parameters in Table 1: ground friction, joint damping, and joint armature. Figure 5 indicates that EPOpt( = 0.1) policy is robust to

a broad range of torso masses even when its uncertainty is not considered. However, as expected, this policy is not as robust as the case when mass is also modeled as part of the source domain distribution. The preceding experiments show that EPOpt can find robust policies, but the source domain distribution in these ex-

4000 3500 3000

Performance

Table 1: Initial source domain distribution

2500 2000 1500 1000 Ensemble (unmodeled) Maximum-Likelihood

500 0

3

4

5

6

Torso Mass

7

8

9

Figure 5: Performance of policy learned on a source domain distribution which does not include variations in mass.

6

Friction

2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0

0

5

Iter 0

Iter 1

Iter 2

Iter 3

10

15

20

0

5

10

15

20

Torso Mass Figure 6: Adaptation of source domain to learn parameters of the target domain. The red cross indicates the parameters of the target MDP. The contours in the plot, generated with kernel density estimation of model parameters sampled from the binned source domain distribution, indicate the distribution over models. Lighter colors and more concentrated contour lines indicate regions of higher density. Each iteration corresponds to one episode of iteration with the target domain. Note that the high-density regions gradually move toward the true model, but does so by maintaining probability mass at multiple parameter settings which can explain the behavior of target domain.

3500 3000

Performance

periments was chosen to be broad enough such that the target domain is not too far from high-density regions of the distribution. However, for real-world problems, we might not have the domain knowledge to identify a good source domain distribution in advance. In such settings, domain adaptation allows us to change the parameters of source distribution using data gathered from the target domain. Additionally, domain adaptation is helpful when the parameters of target domain could change over time – for example through wear and tear. To illustrate domain adaptation, we perform an experiment where the target domain is very far from the high density regions of the source domain, as depicted in Figure 6. We observe that progressively, the source domain becomes a better approximation of the target domain and consequently the performance improves. In this case, we initially observe a bimodal distribution, since the data from the target domain can be explained in two ways. When the mass is underestimated, under-estimating the ground friction allows for explaining the additional slippage of a higher mass body. Similarly, over-estimating the friction partially nullifies the effect of over-estimating the mass. Eventually, after 11 iterations (episodes), the source domain distribution identifies the correct bin with near-1 probability. Figure 7 depicts the learning curve, and we see that a robust policy with return more than 2000 can be discovered with just 3 trajectories from the target domain. Subsequently, the policy improves near monotonically, and we find a good policy with just 12 episodes worth of data from the target domain. Also, we learned a policy directly on the target domain, and observed that the final return achieved under this perfectly known model setting is approximately 3200, comparable to what is achieved by following the robust model-based learning protocol. Thus, maintaining a distribution over models and learning a robust policy, followed by domain adaptation allows us to learn both quickly and robustly. During our experiments with the EPOpt- algorithm, we found that directly optimizing the policy for a small value of  leads to unstable learning. This is likely because, policy gradient methods try to increase the probability of better performing trajectories and penalize trajectories that perform poorly, and due to the adversarial nature, EPOpt- emphasizes penalizing poor trajectories more. This might constrain the initial exploration needed to find the better trajectories, and encourage stable learning. However, our experiments also indicate that choosing a low value for  (≈ 0.1) leads to better generalization. This difficulty in training can be overcome by gradually reducing  over multiple iterations till we get to the desired relaxation level. This scheme roughly corresponds to exploring extensively initially to find promising trajectories, and then rapidly reducing probability of those trajectories that do not generalize – i.e. perform poorly for some model instances.

2500 2000 1500 1000 500 0

2

4

6

Rounds

8

10

12

Figure 7: Learning curve when following the robust modelbased learning protocol with source domain adaptation. The shaded region describes the 10th and 90th percentile values of the performance distribution, and the solid line is the average performance.

Conclusions and Future Work

In this paper, we presented the EPOpt- algorithm for training robust policies on ensembles of source domains. Our method provides for robust policy training by using a distribution of models at training time, and supports an adversarial training regime designed to provide good jumpstart

and worst-case performance. We also describe how our approach can be combined with Bayesian model adaptation to adapt the source domain ensemble to a target domain using a small amount of target domain experience. Our algorithm can be used to train robust and generalizable policies in an

ensemble of simulated domains, and our experimental results demonstrate that the ensemble approach provides for policies that are robust to some unmodeled effects. Our experiments also demonstrate that Bayesian source ensemble adaptation can produce distributions over models that produce better policies on the target domain than more standard maximum likelihood estimation, particularly in the presence of unmodeled effects. Although our method exhibits good generalization performance and supports Bayesian adaptation, the adaptation algorithm we propose currently relies on discretization of the model parameter space, which quickly becomes intractable as the number of model parameters increases. In future, we plan to explore more sophisticated model ensemble parameterizations, such as flexible parametric probability distributions and nonparametric distributions. Such parametrizations and representations would provide more favorable performance with higher model parameter dimensionalities. One such promising venue is use of generalpurpose (Bayesian) neural network models, where the neural network parameters could be thought of as model parameters. These models could be pre-trained using physics based simulators like MuJoCo to get a practical initialization of neural network parameters. Such representations are likely useful when dealing with high dimensional inputs and parametrizations like simulated vision from rendered images or complex physics models, which are needed to train highly generalizable policies that can successfully transfer to physical robots acting in the real world.

Acknowledgments The authors would like to thank Emo Todorov and Sham Kakade for insightful comments about the work. The authors would also like to thank Emo Todorov for the MuJoCo simulator. Aravind Rajeswaran and Balaraman Ravindran acknowledge financial support from ILDS, IIT Madras.

References

[6] Sham Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003. [7] Javier Garc´ıa and Fernando Fern´andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 2015. [8] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359–483, 2015. [9] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10:1633–1685, December 2009. [10] Stephane Ross and Drew Bagnell. Agnostic system identification for model-based reinforcement learning. In ICML, 2012. [11] Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(12):1–142, 2013. [12] Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In UAI, 1999. [13] Nikos Vlassis, Mohammad Ghavamzadeh, Shie Mannor, and Pascal Poupart. Bayesian Reinforcement Learning, pages 359–386. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. [14] Lennart Ljung. System Identification, pages 163–173. Birkh¨auser Boston, Boston, MA, 1998. [15] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005. [16] Kemin Zhou, John C. Doyle, and Keith Glover. Robust and Optimal Control. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996.

[1] Volodymyr Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, Feb 2015.

[17] Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov decision processes. In NIPS. 2013.

[2] David Silver et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, Jan 2016. Article.

[18] Pascal Poupart, Nikos A. Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian reinforcement learning. In ICML, 2006.

[3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ArXiv e-prints, September 2015.

[19] S. Ross, B. Chaib-draa, and J. Pineau. Bayesian reinforcement learning in continuous pomdps with application to robot navigation. In ICRA, 2008.

[4] Igor Mordatch, Kendall Lowrey, Galen Andrew, Zoran Popovic, and Emanuel V. Todorov. Interactive control of diverse complex characters with neural networks. In NIPS. 2015. [5] Xue Bin Peng, Glen Berseth, and Michiel van de Panne. Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics (Proc. SIGGRAPH 2016), 2016.

[20] Michael O. Duff. Design for an optimal probe. In ICML, 2003. [21] Josep M. Porta, Nikos A. Vlassis, Matthijs T. J. Spaan, and Pascal Poupart. Point-based value iteration for continuous pomdps. Journal of Machine Learning Research, 7:2329–2367, 2006. [22] I. Mordatch, K. Lowrey, and E. Todorov. EnsembleCIO: Full-body dynamic motion planning that transfers to physical humanoids. In IROS, 2015.

[23] Zico Kolter. Learning and control with inaccurate models. PhD thesis, Stanford University, 2010. [24] Bruno Castro da Silva, George Konidaris, and Andrew G. Barto. Learning parameterized skills. In ICML, 2012. [25] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In AAAI Conference on Artificial Intelligence. 2015. [26] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, 2002. [27] Sergey Levine and Vladlen Koltun. Guided policy search. In ICML, 2013. [28] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469 – 483, 2009. [29] Aviv Tamar, Dotan Di Castro, and Ron Meir. Integrating a partial model into model free reinforcement learning. Journal of Machine Learning Research, 2012. [30] Pieter Abbeel, Morgan Quigley, and Andrew Y. Ng. Using inaccurate models in reinforcement learning. In ICML, 2006. [31] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In ICML, 2015. [32] Sham Kakade. A natural policy gradient. In NIPS, 2001. [33] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992. [34] Erick Delage and Shie Mannor. Percentile optimization for markov decision processes with parameter uncertainty. Operations Research, 58(1):203–213, 2010. [35] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, Oct 2012. [36] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016. [37] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In ICML, 2016. [38] Tom Erez, Yuval Tassa, and Emanuel Todorov. Infinite-horizon model predictive control for periodic tasks with contacts. In Proceedings of Robotics: Science and Systems, 2011. [39] Pawel Wawrzynski. Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Networks, 22:1484–1497, 2009.

Suggest Documents