Learning Optimal Policies using Bound Estimation - Semantic Scholar

1 downloads 0 Views 208KB Size Report
May 30, 2005 - transition model T o, that is defined on Mo state space, in the following way. (deterministic case): ...... [4] William L. Briggs, Van Emden Henson, and Steve F. McCormick. A multi- ... June 1991. [14] A. Moore and C. Atkeson.
Learning Optimal Policies using Bound Estimation Andrea Bonarini

Alessandro Lazaric

Marcello Restelli

May 30, 2005 Abstract Reinforcement learning problems related to real-world applications are often characterized by large state spaces, which imply high memory requirements. In the past, the use of function approximators has been studied to get compact representations of the value function. Although function approximators perform well in the supervised learning context, when applied to the reinforcement learning framework they may find unsatisfactory solutions or even diverge. In this paper, we focus our attention on a particular kind of function approximator: state aggregation. We show how it is possible to compute upper and lower bounds of the optimal values of actions to be applied in the states contained in each aggregate. We propose an algorithm that modifies the state aggregation until the optimal solution is reached. Furthermore, this approach is extended to a multi-representation algorithm where overlapping partitions are used to compute tighter bounds. Although this paper limits its analysis to deterministic environments, it establishes new and relevant results, that will be extended to stochastic problems in a next future.

1

Introduction

Research in the reinforcement learning field is focusing more and more on the use of function approximation in order to cope with large state and action problems. Despite the good results achieved by function approximators in supervised learning problems, their application to reinforcement learning led to controversial results. The reason is that in reinforcement learning algorithms which bootstrap (i.e. update value estimates using other value estimates), we want to approximate a function using samples that are not taken from the function itself, but from its approximation. In this paper, we study a particular kind of function approximator: state aggregation, which puts together in each aggregate different states. We propose a class of algorithms that, on the basis of the computation of bounds of the optimal values for each pair aggregate-action, determine whether they are able to learn an optimal policy over the given aggregation. Although they cannot establish an optimal policy, we propose a heuristic which, at least in deterministic environments, allows to modify the aggregation until an optimal policy is learned.

1

Furthermore, the bound approach is extended to a multi-representation scheme, which is made up of different overlapping partitions of the original state space. In this way, several aggregations are simultaneously considered, and learning is achieved through their interaction. The following section introduces some basic notations about MDPs and Reinforcement Learning. Section 3 gives a brief overview of the theoretical studies for what concerns state aggregation. Section 4 gives the definitions of bounds, while Section 5 presents on-line algorithms for learning them. In Section 6 we present an algorithm for changing the state aggregation in order to identify the optimal policy and we show a simple experiment that puts in evidence the differences among bounds. In Section 7 we introduce the use of multiple overlapping partitions and in Section 8 we adapt the refinement algorithm for single partition to the multiple partitions version. In Section 9 we present a brief analysis of the most relevant work related to state aggregation and computation of bounds. In the last section we discuss the results of this work and present future directions.

2

Markov Decision Processes

In a Reinforcement Learning (RL) problem an agent learns to behave optimally in order to achieve its own task by a direct interaction with the environment where it is placed in. The learning process is deeply different with respect to supervised learning because the agent is not told which is the best response to the condition of the environment but it can only perceive a numerical signal (the reinforcement) that is a measure of the instantaneous value of the action. The agent must learn by itself which is the course of action (i.e. the policy) that guarantees the largest amount of reinforcement not only in the current instant but also in the future. The environment is formally defined as a Markov Decision Process (MDP) characterized by: • a finite discrete set of environment states S, • a finite discrete set of available actions A, • a state transition function T : S × A × S →

Suggest Documents