A Behavior Based Kernel for Policy Search via ... - Semantic Scholar

9 downloads 0 Views 433KB Size Report
Alan Fern. [email protected]. Prasad Tadepalli. TADEPALL@EECS. ..... Translated by Amiel Feinstein. Rasmussen, Carl Edward and ...
A Behavior Based Kernel for Policy Search via Bayesian Optimization

Aaron Wilson WILSONAA @ EECS . OREGONSTATE . EDU Alan Fern AFERN @ EECS . OREGONSTATE . EDU Prasad Tadepalli TADEPALL @ EECS . OREGONSTATE . EDU Oregon State University School of EECS, 1148 Kelley Engineering Center, Corvallis, OR 97331

Abstract We expand on past successes applying Bayesian Optimization (BO) to the Reinforcement Learning (RL) problem. BO is a general method of searching for the maximum of an unknown objective function. The BO method explicitly aims to reduce the number of samples needed to identify the optimal solution by exploiting a probabilistic model of the objective function. Much work in BO has focused on Gaussian Process (GP) models of the objective. The performance of these models relies on the design of the kernel function relating points in the solution space. Unfortunately, previous approaches adapting ideas from BO to the RL setting have focused on simple kernels that are not well justified in the RL context. We show that a new kernel can be motivated by examining an upper bound on the absolute difference in expected return between policies. The resulting kernel explicitly compares the behaviors of policies in terms of the trajectory probability densities. We incorporate the behavior based kernel into a BO algorithm for policy search. Results reported on four standard benchmark domains show that our algorithm significantly outperform alternative state-of-the-art algorithms.

1. Introduction In the policy search setting, RL agents seek an optimal policy within a fixed set. In such a setting an agent executes a sequence of policies searching for the true optimum. Naturally, future policy selection decisions should benefit from the information available in all samples. A question arises regarding how the expected return of untried policies can be estimated using the batch of samples, and how to best use Appearing in ICML 2011 Workshop: Planning and Acting with Uncertain Models, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s).

the estimated returns to perform policy search. In this work we propose explicitly constructing a probabilistic model of the expected return informed by observations of past policy behaviors. We exploit this probabilistic model of the return by selecting new policies predicted to best improve on the performance of policies in the sample set. Our approach is based on adapting black box Bayesian Optimization (BO) to the RL problem. BO is a method of sequentially planning a sequence of queries from an unknown objective function for purposes of seeking the maximum. It is an ideal method for tackling the basic problem of policy search as it directly confronts the fundamental issue of trading off exploration of the objective function (global searches) with exploitation (local searches). Fundamental to application of Bayesian Optimization techniques is the definition of a Bayesian prior distribution for the objective function. The method of BO searches this surrogate representation of the objective function for maximal points instead of directly querying the true objective. Hopefully, by using a large number of surrogate function evaluations (trading computational resources for higher quality samples) the true maximum point can be identified with few queries to the true objective. As in most Bayesian methods success of the BO technique rests on the quality of the modeling effort. How should the objective, the expected return in the RL case, be effectively modeled? In this work, similar to past efforts applying BO to RL, we focus on GP models of the expected return. The generalization performance of GP models, and hence the performance of the BO technique, is strongly impacted by the definition of the kernel function which encodes a notion of relatedness between points in the function space. When applying BO to RL this means encoding a notion of similarity between policies. Past work has used simple kernels to relate policy parameters (for instance squared exponential kernels (Lizotte et al., 2007; Wilson et al., 2010)). Unfortunately, the selected kernels fail to account for the special properties of sequential decision processes typical of RL problems. A more appropriate notion of relatedness is needed for the RL context. We propose that policies are

A Behavior Based Kernel for Policy Search

better related by their behavior rather than their parameters. Below we motivate our behavior-based kernel function. We then discuss how to incorporate the kernel into a BO approach when a sparse sample of policy trajectories are available. Empirically we demonstrate that the behaviorbased Kernel significantly improves BO and outperforms a selection of standard algorithms on four benchmark domains.

2. Problem Setting We study the Reinforcement Learning problem in the context of Markov Decision Processes (MDPs). MDPs are described by a tuple (S, A, P, P0 , R, π). We consider processes with continuous state and action values. Where each state and action is a vector s ∈

Suggest Documents