Apr 3, 2011 - Prescott, A.J. and Mayhew, J.E.W. (1992). Obstacle ... latterly Steve Hippisley-Cox, for guiding my stumbling steps through the often alien.
Explorations in Reinforcement and Model-based Learning
A thesis submitted by
Anthony J. Prescott Department of Psychology University of Sheffield
in partial fulfillment of the requirements for the degree of Doctor of Philosophy
Submitted December 1993 Accepted 1994.
Explorations in Reinforcement and Model-based Learning Anthony J. Prescott Summary Reinforcement learning concerns the gradual acquisition of associations between events in the context of specific rewarding outcomes, whereas model-based learning involves the construction of representations of causal or world knowledge outside the context of any specific task. This thesis investigates issues in reinforcement learning concerned with exploration, the adaptive recoding of continuous input spaces, and learning with partial state information. It also explores the borderline between reinforcement and model-based learning in the context of the problem of navigation. A connectionist learning architecture is developed for reinforcement and delayed reinforcement learning that performs adaptive recoding in tasks defined over continuous input spaces. This architecture employs networks of Gaussian basis function units with adaptive receptive fields. Simulation results show that networks with only a small number of units are capable of learning effective behaviour in realtime control tasks within reasonable time frames. A tactical/strategic split in navigation skills is proposed and it is argued that tactical, local navigation can be performed by reactive, task-specific systems. Acquisition of an adaptive local navigation behaviour is demonstrated within a modular control architecture for a simulated mobile robot. The delayed reinforcement learning system for this task acquires successful, often plan-like strategies for control using only partial state information. The algorithm also demonstrates adaptive exploration using performance related control over local search. Finally, it is suggested that strategic, way-finding navigation skills require modelbased, task-independent knowledge. A method for constructing spatial models based on multiple, quantitative local allocentric frames is described and simulated. This system exploits simple neural network learning, storage and search mechanisms, to support robust way-finding behaviour without the need to construct a unique global model of the environment.
Declaration This thesis has been composed by myself and contains original work of my own execution. Some of the work reported here has previously been published: Prescott, A.J. and Mayhew, J.E.W. (1992). Obstacle avoidance through reinforcement learning. in Moody, J.E., Hanson, S.J., and Lippmann, R.P. Advances in Neural Information Processing Systems 4, Morgan Kaufmann, New York. Prescott, A.J. and Mayhew, J.E.W. (1992). Adaptive local navigation. in Blake, A. and Yuille, A. Active Vision, MIT Press, Cambridge MA. Prescott, A.J. and Mayhew, J.E.W. (1993). Building long-range cognitive maps using local landmarks. in From Animals to Animats: Proceedings of the 2nd International Conference on Simulation of Adaptive Behaviour, MIT Press.
Tony Prescott, 14 December 1993.
We shall not cease from exploration WAnd the end of all our exploring Will be to arrive where we started And know the place for the first time.
T.S. Eliot: The Four Quartets.
For my parents— John and Diana
vii
Acknowledgements I would like to thank the many people who have given me their help, advice and encouragement in researching and writing this dissertation. I am particularly indebted to the following. My supervisor John Mayhew for his depth of insight, forthrightness, and humour. His ideas have been a constant source of inspiration to me through the years. John Frisby for his patience, wise counsel, and generous support. John Porrill, Neil Thacker, and latterly Steve Hippisley-Cox, for guiding my stumbling steps through the often alien and bewildering realm of mathematics. Pat Langdon for some apposite advice on programming Lisp and tuning Morris Minor engines. Paul Dean, Pete Redgrave, Pete Coffey, Rod Nicolson and Mark Blades, for some inspiring conversations about animal and human intelligence. The remaining members of the AI Vision Research Unit and the Department of Psychology, past and present, for creating such a pleasant and rewarding environment in which to work and learn. I would not have been able to carry out this work without the friends who have given me their support and companionship since coming to Sheffield. I am particularly grateful to Phil Whyte, Leila Edwards and especially Sue Keeton for sharing their lives and homes with me. I am also grateful to Justin Avery for proof-reading parts of the text. Finally, I wish to thank the Science and Engineering Research Council and the University of Sheffield for the financial support I have received while carrying out this work.
viii
Contents
One
Introduction and Overview
1
Two
Reinforcement Learning Systems
Three
Exploration
Four
Input Coding for Reinforcement Learning
Five
Experiments in Delayed Reinforcement Learning using Networks Basis Function Units 98
Six
Adaptive Local Navigation
13
48 63 of
128
Seven Representations for Way-finding: Topological Models
162
Eight
Representations for Way-finding: Local Allocentric Frames
Nine
Conclusions and Future Work
190
221
Appendices A
Algorithm and Simulation Details for Chapter Three
B
Algorithm and Simulation Details for Chapter Four
229
C
Algorithm and Simulation Details for Chapter Five
240
D
Algorithm and Simulation Details for Chapter Six
247
Bibliography
253
226
1
Chapter 1 Introduction and Overview Summary The ‘explorations’ in this thesis consider learning from a computational perspective. In other words, it is assumed that both the biological ‘neural nets’ that underlie adaptive processes in animals and humans, and the electronic circuitry that allows learning in a robot or a computer simulation, can be considered as implementing similar types of abstract, information processing operations. A computational understanding of learning should then give insight into the adaptive capabilities of both natural and artificial systems. However, to determine such an understanding in abstract is clearly a daunting, if not impossible, task. This will be helped by looking both to natural systems, for inspiration concerning how effective adaptive mechanisms have evolved and are organised, and to artificial systems, as vehicles in which to embed and then evaluate theories of learning. This chapter introduces two key domains in the study of learning from each of these perspectives, it then sets out the research objectives that will be pursued in the remainder of the thesis.
CHAPTER ONE
INTRODUCTION
2
Learning in natural systems Research in psychology suggests that underlying a large number of observable phenomena of learning and memory, there are two broad clusters of learning processes. First, there are the associative learning processes involved in habit formation, the acquisition of motor skills, and certain forms of classical and instrumental conditioning. These processes involve incremental adaptation and do not seem to need awareness. Learning is driven by a significant outcome in the form of a positively or negatively reinforcing event. Further, it does not seem to require or involve the acquisition of knowledge about the causal processes underlying the task that is solved. Second, there are the learning processes involved in acquiring knowledge about the relationships between events (stimuli or responses). For instance, that one event follows another (causal knowledge), or is close to another (spatial knowledge). These forms of learning appear to be have more of an all-or-none character, and may require awareness or involve attentional processes. They are also not directly involved in generating behaviour, and need not be acquired with respect to a specific task or desired outcome. The knowledge acquired can support both further learning or decision-making through inference. Patterns of learning impairment in human amnesiacs [153, 159, 160] and lesion studies with animals (e.g. with monkeys [104, 160], and with rats [63, 110]) indicate that the second style of learning relies on specific medial-temporal structures in the brain, in particular, the hippocampus. In contrast the simpler associative forms of learning underlying habit and skill acquisition are not affected by damage to this brain region, but appear instead to be supported by neural systems that evolved much earlier. This view is supported by observations that all vertebrates and most invertebrates show the more ‘primitive’ learning abilities, whereas the more
CHAPTER ONE
INTRODUCTION
3
‘cognitive’ learning styles have evolved primarily in higher vertebrates [62] coinciding with a massive increase in brain-size1. A variety of terms have been suggested to capture the qualitative distinctions between different learning processes, for instance, procedural and declarative [5, 186], dispositional and representational [32, 110], implicit and explicit [142], and, incremental and all-or-none [173]. This variation reflects the fact that there may be a number of important dimensions of difference involved. Here I will adopt the terms dispositional and representational suggested by Thomas [32] and Morris [110] to refer to these two clusters of learning processes. A fuller understanding of learning systems, in which their similarities, differences, and interactions are better understood, can be gained by realising the mechanisms in computational models and evaluating them in various task domains. This agenda describes much of the recent connectionist research in cognitive science and Artificial Intelligence (AI). Learning and connectionism The explosion of research in connectionist modelling in the last ten years has reawakened interest in associative learning and has motivated researchers to attempt to construct complex learning systems out of simpler associative components. Connectionist systems, or artificial neural networks, consist of highly interconnected networks of simple processing units in which knowledge is stored in the connection strengths or weights between units. These systems demonstrate remarkable learning capabilities yet adaptation of the network weights is governed by only a small number of simple rules. Many of these rules have their origins in psychological learning theories—the associationist ideas of Locke, James and others, Thorndyke’s ‘law of effect’, Hull’s stimulus-response theory, and the correlation learning principles proposed by Hebb. Although most contemporary connectionist models use more sophisticated learning rules and assume network circuitry unlikely to occur in real
1When
body-size is taken into account the brains of higher vertebrates are roughly ten times as large as those of lower vertebrates [68] .
CHAPTER ONE
INTRODUCTION
4
neural nets, the impression remains of a deep similarity with the adaptive capabilities of biological systems. Classical connectionist research in the 1960s by Rosenblatt [139] and by Widrow and Hoff [182] concerned the acquisition of target associative mappings by adjusting a single-layer of adaptive weights under feedback from a ‘teacher’ with knowledge of the correct responses. However, researchers have since relaxed many of the assumptions embodied in these early systems. First, multi-layer systems have been developed that adaptively recode the input by incorporating a trainable layer of ‘hidden’ units (e.g. [140]). This development surmounted what had been a major limitation of early connectionist systems—the inability of networks with only one adaptive layer to efficiently represent certain classes of input-output mappings [102]. Second, reinforcement learning systems have been developed that learn appropriate outputs without guidance from a ‘teacher’ by using environmental feedback in the form of positively or negatively reinforcing outcomes (e.g. [162]). These systems have been extended further to allow learning in delayed reward tasks in which reinforcing outcomes occur only after a sequence of stimuli have been observed and actions performed. Recent work has also considered reinforcement learning in multilayer systems with adaptive recodings (e.g. [167]). Finally, model-based associative learning systems have been developed that, rather than acquiring task knowledge directly, explicitly encode knowledge about causal processes (forward models) or environment structure (world models) [41, 69, 106, 107, 164, 166]. This knowledge then forms the basis either for task-specific learning or for decision-making by interpolation, inference, or planning. It is clear that there are certain parallels between the connectionist learning systems described so far and the classes of psychological learning processes described above. In particular, there seems to be a reasonable match between some forms of reinforcement learning and dispositional learning, and between model-based learning and certain aspects of representational learning processes. To summarise, the former pair are both concerned with the gradual acquisition of associations between events in the context of specific rewarding outcomes. Although these events might individually
CHAPTER ONE
INTRODUCTION
5
be composed of elaborate compound patterns, and the acquired link may involve recoding processes, the input-output relation is of a simple, reflexive nature. On the other hand, model-based learning and representational learning, while being associative in a broad sense (in that they concern the acquisition of knowledge of the relationships between events), generally involve the construction of representations of causal or world knowledge to be used by other learning or decision-making processes. These learning processes may also have other characteristics such as the involvement of domain-specific learning mechanisms and/or memory structures. The ‘Animat’ approach to understanding adaptive behaviour The shared interest in adaptive systems, between psychologists and ethologists, on the one hand, and Artificial Intelligence researchers and roboticists on the other, has recently seen the development of a new inter-disciplinary research field. Being largely uncharted it goes by a variety of titles—‘comparative’ or ‘biomimetic’ cognitive science (Roitblat [138]), ‘computational neuroethology’ (Cliff [35]), ‘behaviour-based’ AI (Maes [91]), ‘animat’ (simulated animal) AI (Wilson [187]) or ‘Nouvelle’ AI (Brooks [24]). The common research aim is to understand how autonomous agents—animals, simulated animals, robots, or simulated robots—can survive and adapt in their environments, and be successful in fulfilling needs and achieving goals. The following seeks to identify some of the key elements of this approach by citing some of its leading proponents. Wilson [187] identifies the general methodology of this research programme as follows: “The basic strategy of the animat approach is to work towards higher levels of intelligence from below—using minimal ad hoc machinery. The essential process is incremental and holistic [...] it is vital (1) to maintain the realism and wholeness of the environment [...] (2) to maximise physicality in the sensory signals [...] and (3) to employ adaptive mechanisms maximally, to minimalise the rate of introduction of new machinery and maximise understanding of adaptation.” ([187] p. 16)
An important theme is that control, in the agent, is not centralised but is distributed between multiple task-oriented modules— “The goal is to build complete intelligent systems. To the extent that the system consists of modules, the modules are organised around activities, such as path-
CHAPTER ONE
INTRODUCTION
6
finding, rather than around sensory or representational systems. Each activity is a complete behaving sub-system, which individually connects perception to action.” (Roitblat [138] p. 9)
The animat approach therefore seeks minimal reliance on internal world models and reasoning or planning processes— “We argue that the traditional idea of building a world model, or a representation of the state of the world is the wrong idea. Instead the creature [animat] needs to process only aspects of the world that are relevant to its task. Furthermore, we argue that it may be better to construct theoretical tools which instead of using the state of the world as their central formal notion, instead use the aspects that the creature is sensing as the primary formal notion.” (Brooks [23] p. 436)
It advocates, instead, an emphasis on the role of the agent’s interaction with its environment in driving the selection and performance of appropriate, generally reflexive, behaviours— “Rather than relying on reasoning to intervene between perception and action, we believe actions derive from very simple sorts of machinery interacting with the immediate situation. This machinery exploits regularities in its interaction with the world to engage in complex, apparently planful activity without requiring explicit models of the world.” (Chapman and Agre [29] p. 1) “One interesting hypothesis is that the most efficient systems will be those that convert every frequently encountered important situation to one of ‘virtual stimulus-response’ in which internal state (intention, memory) and sensory stimulus together form a compound stimulus that immediately implies the correct next intention or external action. This would be in contrast to a system that often tends to ‘figure out’ or undertake a chain of step by step reasoning to decide the next action.” (Wilson [187] p. 19)
Perception too is targeted at acquiring task-relevant information rather than delivering a general description of the current state of the perceived world— “The basic idea is that it is unnecessary to equip the animat with a sensory apparatus capable at all times of detecting and distinguishing between objects in its environment in order to ensure its adaptive competence. All that is required is that it be able to register only the features of a few key objects and ignore the rest. Also those objects should be indexed according to the intrinsic features and properties that make them significant.” (Meyer and Guillot [98] p. 3).
CHAPTER ONE
INTRODUCTION
7
It is clear, from this brief overview, that the ‘Animat’ approach is in good accord with reinforcement learning approaches to the adaptation of behavioural competences. In view of the stated aim of building ‘complete intelligent systems’ in an incremental, and bottom-up fashion this is wholly consistent with the earlier observation that learning in simpler animals is principally of a dispositional nature. However, the development of this research paradigm is already beginning to see the need for some representational learning. One reason for this is the emphasis on mobile robotics as the domain of choice for investigating animat AI. The next section contains a preliminary look at this issue. Navigation as a forcing domain The fundamental skill required by a mobile agent is the ability to move around in the immediate environment quickly and safely, this will be referred to here as local navigation competence. Research in animat AI has had considerable success in using pre-wired reactive competences to implement local navigation skills [6, 22, 38, 170]. The robustness, fluency, and responsiveness of these systems have played a significant role in promoting the animat methodology as a means for constructing effective, autonomous robots. In this thesis the possibility of acquiring adaptive local navigation competences through reinforcement learning is investigated and advanced as an appropriate mechanism for learning or fine-tuning such skills. However, a second highly valuable of navigation expertise is the ability to find and follow paths to desired goals outside the current visual scene. This skill will be referred to here as way-finding. The literature on animal spatial learning differentiates the way-finding skills of invertebrates and lower vertebrates, from those of higher vertebrates (birds and mammals). In particular, it suggests that invertebrate navigation is performed primarily by using path integration mechanisms and compass senses and secondarily by orienting to specific remembered stimulus patterns (landmarks) [2628, 178, 179]. This suggests that invertebrates do not construct models of the spatial layout of their environment and that consequently, their way-finding behaviour is
CHAPTER ONE
INTRODUCTION
8
relatively inflexible and restricted to homing or retracing familiar routes2. In contrast, higher vertebrates appear to construct and use representations of the spatial relations between locations in their environments (see, for example, [52, 119, 120, 122]). They are then able to use these models to select and follow paths to desired goals. This form of learning is often regarded as the classic example of a representational learning process (e.g. [152]). This evidence has clear implications for research in animat AI. First, it suggests that the current ethos of minimal representation and reactive competence could support way-finding behaviour similar to that of invertebrates3. Second, however, the acquisition of more flexible way-finding skills would appear to require model-based learning abilities, this raises the interesting issue of how control and learning architectures in animat AI should be developed to meet this need.
Content of the thesis The above seeks to explain the motivation for the research described in the remaining chapters. However, although inspired by the desire to understand and explain learning in natural systems the work to be described primarily concerns learning in artificial systems. The motivation, like much of the work in connectionism, is to seek to understand learning systems from a general perspective before attempting to apply this understanding to the interpretation and modelling of animal or human behaviour. I have suggested above that much of the learning that occurs in natural systems clusters into two fundamental classes —dispositional and representational learning. I have further suggested that these two classes are loosely analogous to reinforcement learning and model-based learning approaches in connectionism. Finally, I have proposed that a forcing domain for the development of model-based learning systems is that of navigation. These ideas form the focus for the work in this thesis.
2Gould
[54] has proposed a contrary view that insects do construct models of spatial layout however, the balance of evidence (cited above) appears to be against this position. 3In
particular it should be possible to exploit the good odometry information available to mobile robots.
CHAPTER ONE
INTRODUCTION
9
The first objective, which is the focus of chapters two through five, is with understanding reinforcement learning systems. A particular concern is with learning in continuous state-spaces and with continuous outputs. Many natural learning problems and most tasks in robot control are of this nature, however, much existing work in reinforcement learning has concentrated primarily on finite, discrete state and action spaces. These chapters concentrate on the issues relating to exploration and adaptive recoding in reinforcement learning. In particular, chapters four and five propose and evaluate a novel architecture for adaptive coding in which a network of local expert units with trainable receptive fields are applied to continuous reinforcement learning problems. A second objective, which is the topic of chapter six, is the consideration of reinforcement learning as a tool for acquiring adaptive local navigation competences. This chapter also introduces the theme of navigation which is continued through chapters seven and eight where the possibility of model-based learning systems for way-finding are considered. The focus of these later chapters is on two questions. First, on whether spatial representations for way-finding should encode topological or metric knowledge of spatial relations. And second, on whether a global representation of space is desirable as opposed to multiple local models. Finally, chapter nine seeks to draw some conclusions from the work described and considers future directions for research. A more detailed summary of the contents of each chapter is as follows: Chapter Two—Reinforcement Learning Systems introduces the study of learning systems in general and of reinforcement and delayed reinforcement learning systems in particular. It focuses specifically on learning in continuous state-spaces and on the Actor/Critic systems that have been proposed for such tasks in which one learning element (the Actor) learns to control behaviour while the other (the Critic) learns to predict future rewards. The relationship of delayed reward learning to dynamic programming is reviewed and the possibility of systems that integrate reinforcement learning with model-based learning is considered. The chapter concludes by arguing that, despite the absence of strong theoretical results, reinforcement learning should be possible in tasks with only partial state information where the strict equivalence with stochastic dynamic programming does not apply.
CHAPTER ONE
INTRODUCTION
10
Chapter Three—Exploration considers methods for determining effective exploration behaviour in reinforcement learning systems. This chapter primarily concerns the indirect effect on exploration of the predictions determined by the critic system. The analysis given shows that if the initial evaluation is optimistic relative to available rewards then an effective search of the state-space will arise that may prevent convergence on sub-optimal behaviours. The chapter concludes with a brief review of more direct methods for adapting exploration behaviour. Chapter Four—Input Coding for Reinforcement Learning considers the task of recoding a continuous input space in a manner that will support successful reinforcement learning. Three general approaches to this problem are considered: fixed quantisation methods; unsupervised learning methods for adaptively generating an input coding; and adaptive methods that modify the input coding according to the reinforcement received. The advantages and drawbacks of various recoding techniques are considered and a novel multilayer learning architecture is described in which a recoding layer of Gaussian basis function (GBF) units with adaptive receptive fields is trained by generalised gradient descent to maximise the expected reinforcement. The performance of this algorithm is demonstrated on a simple immediate reinforcement task. Chapter Five—Experiments in Delayed Reinforcement Learning Using Networks of Basis Function Units applies the algorithm developed in the previous chapter to a delayed reinforcement control task (the pole-balancer) that has often been used as a test-bed for reinforcement learning systems. The performance of the GBF algorithm is compared and contrasted with other work, and considered in relation to problem of input sampling that arises in real-time control tasks. The interface between explicit task knowledge and adaptive reinforcement learning is considered, and it is proposed that the GBF algorithm may be suitable for refining the control behaviour of a coarsely pre-specified system. Chapter Six—Adaptive Local Navigation introduces the topic of navigation and argues for the division of navigation competences between tactical, local navigation skills that deal with the immediate problems involved in moving efficiently while avoiding collisions, and strategic, way-finding skills that allow the successful planning and execution of paths to distant goals. It further argues that local navigation can be efficiently supported by adaptive dispositional learning processes, while way-
CHAPTER ONE
INTRODUCTION
11
finding requires task-independent knowledge of the environment, in other words, it requires representational, or model-based, learning of the spatial layout of the world. A modular architecture in the spirit of Animat AI is proposed for the acquisition of local navigation skills through reinforcement learning. To evaluate this approach a prototype model of an acquired local navigation competence is described and successfully tested in a simulation of a mobile robot. Chapter Seven—Representations for Way-finding: Topological Models. Some recent research in Artificial Intelligence has favoured spatial representations of a primarily topological nature over more quantitative models on the grounds that they are: cheaper and easier to construct, more robust in the face of poor sensor data, simpler to represent, more economical to store, and also, perhaps, more biologically plausible. This chapter suggests that it may be possible, given these criteria, to construct sequential route-like knowledge of the environment, but that to integrate this information into more powerful layout models or maps may not be straightforward. It is argued that the construction of such models realistically demands the support of either strong vision capabilities or the ability to detect higher-order geometric relations. And further, that in the latter case, it seems hard to justify not using the acquired information to construct models with richer geometric structure that can provide more effective support to way-finding. Chapter Eight—Representations for Way-finding: Local Allocentric Frames. This chapter describes a representation of metric environmental spatial relations with respect to landmark-based local allocentric frames. The system works by recording in a relational network of linear units the locations of salient landmarks relative to barycentric coordinate frames defined by groups of three nearby cues. It is argued that the robust and economical character of this system makes it a feasible mechanism for way-finding in large-scale space. The chapter further argues for a heterarchical view of spatial knowledge for way-finding. It proposes that knowledge should be constructed in multiple representational ‘schemata’ where different schemata are distinguished not so much by their geometric content but by their dependence on different sensory modalities, environmental cues, or computational mechanisms. It thus argues against storing unified models of space, favouring instead the use of runtime arbitration mechanisms to decide the relative contributions of different local models in determining appropriate way-finding behaviour.
CHAPTER ONE
INTRODUCTION
12
Chapter Nine: Conclusions and Future Work summarises the findings of the thesis and considers some areas where further research might be worthwhile.
13
Chapter Two Reinforcement Learning Systems Summary The purpose of this chapter is to set out the background to the learning systems described in later parts of the thesis. It therefore consists primarily of description of reinforcement learning systems and particularly of the actor/critic and temporal difference and learning methods developed by Sutton and Barto. Reinforcement learning systems have been studied since the early days of artificial intelligence. An extensive review of this research has been provided by Sutton [162]. Williams [184] has also discussed a broad class of reinforcement learning algorithms viewing them from the perspective of gradient ascent learning and in relation to the theory of stochastic learning automata. An account of the relationship between delayed reinforcement learning and the theory of dynamic programming has been provided by Watkins [177] and is clearly summarised in [14]. The theoretical understanding of these algorithms has recently seen several further advances [10, 41, 65]. In view of the thoroughness of these existing accounts the scope of the review given here is limited to what I hope is a sufficient account of the theory of reinforcement learning to support the work described later. The structure of this chapter is as follows. The study of learning systems and their embodiment in neural networks is briefly introduced from the perspective of function estimation. Reinforcement learning methods are then reviewed and considered as gradient ascent learning algorithms following the analysis given by Williams [184]. A sub-class of these learning methods are the reinforcement comparison algorithms developed by Sutton. Temporal difference methods for learning in delayed reinforcement tasks are then described within the framework developed by Sutton and
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
14
Barto [11, 162, 163] and by Watkins [177]. This section also includes a brief review of the relationship of TD methods to both supervised learning and dynamic programming, and describes the actor/critic architecture for learning with delayed rewards which is studied extensively in later chapters. Finally, a number of proposals for combining reinforcement learning with model-based learning are reviewed, and the chapter concludes by considering learning in circumstances where the system has access to only partial state information.
CHAPTER TWO
2.1
REINFORCEMENT LEARNING SYSTEMS
15
Associative Learning Systems
Learning appropriate behaviour for a task can be characterised as forming an associative memory that retrieves suitable actions in response to stimulus patterns. A system that includes such a memory and a mechanism by which to improve the stored associations during interactions with the environment is called an associative learning system. The stimulus patterns, that provide the inputs to a learning system, are measures of salient aspects of the environment from which suitable outputs (often actions) can be determined. However, a learning system may also attend to a second class of stimuli called feedback signals. These signals arise as part of the environment’s response to the recent actions of the system and provide measures of its performance. In general, therefore, we are concerned with learning systems, as depicted in Figure 2.1, that improve their responses to input stimuli under the influence of feedback from the environment.
Feedback Learning Mechanism
Signal
(Stimuli) (Actions)
Inputs Memory
Outputs
Figure 2.1: A learning system viewed as an associative memory. The learning mechanism causes associations to be formed in memory in accordance with feedback from environment (adapted from [162])
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
16
Associative memories are mappings Mathematically, the behaviour of any system that transforms inputs into outputs— stimuli into responses—can be characterised as a function f that maps an input domain X to an output domain Y. Any associative memory is therefore a species of mapping. Generally we will be concerned with mappings over input and output domains that are multi-dimensional, vector spaces. That is, the input stimulus will be described by a vector4 x = (x1 , x2 ,!, x N )! whose elements each measure some salient aspect of the current environmental state, and the output will also be a vector y = (y1 , y2 ,!, y M )! whose elements characterise the system’s response. In order to learn, a system must be able to modify the associations that encode the input-output mapping. These adaptable elements of memory are the parameters of the learning system and can be described by a vector w taken from a domain W. The mapping defined by the memory component of a learning system can therefore be written as the function y = f (w, x), f :W ! X " Y .
Varieties of learning problem To improve implies a measure of performance. As suggested above such measures are generally provided in the form of feedback signals. The nature of the available feedback can be used to classify different learning problems as supervised, reinforcement, or unsupervised learning tasks. In supervised learning feedback plays an instructive role indicating, for any given input, what the output of the system ought to have been. The environment trains the learning system by supplying examples of a target mapping y * = F(x), F: X ! Y .
4Vectors
are normally considered to be column vectors. Superscript T is used to indicate the transpose of a row vector into a column vector or vice versa.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
17
For any input-output pair (x,y* ) a measure of the error in the estimated output y can be determined. The task of the learning system is then to adapt the parameters w in a way that will minimise the total error over all the input patterns in the training set. Since the goal of learning is for the associative memory to approximate the target mapping, learning can be viewed as a problem of function approximation. In contrast, the feedback provided in a reinforcement learning task is of a far less specific nature. A reinforcement signal is a scalar value judging whether performance is good or bad but not indicating either the size or direction of the output error. Some reinforcement problems provide feedback in the form of regular assessments on sliding scales. In other tasks, however, reinforcement can be both more intermittent and less informative. At the most ‘minimalist’ end of the spectrum signals can indicate as little as whether the final outcome following a long sequence of actions was a success or a failure. In reinforcement learning the target mapping is any mapping that achieves maximum positive reinforcement and minimum negative reinforcement. This mapping is generally not known in advance, indeed, there may not be a unique optimal function. Learning therefore requires active exploration of alternative input-output mappings. In this process different outputs (for any given input stimulus) are tried out, the consequent rewards are observed, and the estimated mapping f is adapted so as to prefer those outputs that are the most successful. Finally, in unsupervised learning there is no feedback, indeed, there is no teaching signal at all other than the input stimulus. Unsupervised training rules are generally devised, not with the primary goal of estimating a target function, but with the aim of developing useful or interesting representations of the input domain (for input to other processes). For example, a system might learn to code the input stimuli in a more compact form that retains a maximal amount of the information in the original signals. Learning architectures A functional description of an arbitrary learning system was given above as a mapping from an input domain X to an output domain Y . In order to simplify the account this chapter focuses on mappings for which the input x ! X is multi-valued
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
18
and the output y !Y is a scalar. All the learning methods described will, however, generalise in a straightforward way to problems requiring a multi-valued output. In order to specify an appropriate system, y = f ( w, x) , for a particular task three principal issues need to be considered. Following Poggio and Girosi [126] these will referred to as the representation, learning, and implementation problems. •
The representation problem concerns the choice of a suitable form of f (that is how y depends on w and x) such that a good mapping for the task can be learned. Choosing any particular form of f can enormously constrain the range of mappings that can be approximated (to any degree of accuracy) regardless of how carefully the parameters w are selected.
•
The learning problem is concerned with selecting appropriate rules for finding good parameter values with a given choice of f .
•
Finally, the implementation problem concerns the choice of an efficient device in which to realise the abstract idea of the learning system (for instance, appropriate hardware).
This chapter is primarily concerned with the learning problem for the class of systems that are based on the linear mapping y = f (w, x) = w! " (x) .
(2.1.1)
That is, y is chosen to be the product of a parameter vector w = (w1 , w2 ,!, wp ) and a recoding vector ! (x) "# whose elements !1 (x) , ! 2 (x) , !, ! P (x) are basis functions of the original stimulus pattern. In other words, we assume the existence of a recoding function ! that maps each input pattern to a vector representation in a new domain !. For any desired output mapping over X, an appropriate recoding can be defined that will allow a good approximation to be acquired. Of course, for any specific choice of ! only the limited class of linear mappings over ! can be estimated. The representation problem is not solved therefore, rather it is transmuted into the problem of selecting, or learning, a suitable coding for a given task. What equation 2.1.1 does allow, however, is for a clear separation to be made between the choice of
! and the choice of suitable learning rules for a linear system. As this chapter concentrates on the latter it therefore assumes the existence of an adequate, fixed
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
19
coding of the input. The recoding problem, which is clearly of critical importance, will be considered later as the main topic of Chapters Four and Five. The process of learning involves a series of changes of increments to the parameters of the system. Thus, for instance, the jth update step for the parameter vector w could be written either as
w j +1 = w j + !w j or as wk ( j +1) = wk ( j) + !wk ( j ) giving, respectively, the new value of the vector, or of the kth individual parameter. Since many rules of this type will be considered in this thesis, a more concise notation will be used whereby a rule is expressed in terms of the increment alone, that is, by defining either !w or !wk . Error minimisation and gradient descent learning As suggested above, supervised training occurs by providing the learning system with a set of example input/output pairs (x, y * ) of the target function F . This allows the task of the learning system to be defined as a problem of error minimisation. The total error for a given value of w, can be written as
E(w) = "i y*i ! y i
(2.1.2)
where i indexes over the full training set and . is a distance measure. An optimal set of parameters w* is one for which this total error is minimised, i.e. where E(w* ) ! E(w) for every choice of the parameter vector w. A gradient descent learning algorithm is an incremental method for improving the function approximation provided by the parameter vector w. The error function E(w) can be thought of as defining an error surface over the domain W. (For instance, if the parameter space is two-dimensional then E(w) can be visualised as the height of a surface, above the 2-D plane, at each possible position ( w1 ,w2 ).) Starting from a given position E(w(0) ) on the error surface, gradient descent involves moving the parameter vector a small distance in the direction of the steepest downward gradient and calculating a new estimate E(w(1) ) of the total error. This process is then repeated over multiple iterations. On the jth pass through the training set the error
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
20
gradient e ( j ) is given by the partial derivative of the total error with respect to the weights, i.e. by
e( j ) = !
" E(w( j ) ) . " w( j)
(2.1.3)
This gives the iterative procedure for updating the parameter estimate
!w( j ) = " e ( j )
(2.1.4)
where ! is the step size or learning rate (0 < !
R(t) = $k =1 ! k "1r(t + k) .
(2.3.1)
The goal of the critic in delayed reward tasks is to learn to anticipate the expected value of this return. In other words, the prediction V(t) should be an estimate of IE (R) = IE
[$
# k =1
]
! k "1r (t + k ) .
(2.3.2)
13In some tasks, particular those with deterministic transition and reward functions, an infinite time horizon clearly is desirable. However, a discounted return measure is still needed for most reinforcement learning algorithms as this gives the sum of future returns a finite value. A discussion of this issue and a proposal for a reinforcement learning method that allows an infinite time horizon is given in Schwartz [148] .
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
36
The TD(0) learning rule One way of estimating the expected return would be to average values of the truncated return after n-steps
#
n k =1
! k "1r(t + k) .
(2.3.3)
In other words, the system could wait n steps, calculate this sum of discounted rewards, and then use it as a target value for computing a gradient descent error term. However, for any value of n, there will be an error in this estimate equal to the prospective rewards (for as yet unexperienced time-steps)
$
# k =n +1
! k "1r (t + k) .
The central idea of TD learning is to notice that this error can be reduced by making use of the predicted return V(t + n) associated with the context input at time t + n . Combining the truncated return (2.3.3) with this prediction gives an estimate called the corrected n-step return Rn (t) n
Rn (t) = #k =1! k "1r(t + k) + ! n V (t + n) .
(2.3.4)
Of course, at the start of training the predictions generated by the critic will be poor estimates of the unseen rewards. However, Watkins [177] has shown that the expected value of the corrected return will on average be a better estimate of R(t) than the current prediction at all stages of learning. This is called the error reduction property of Rn (t) . Because of this useful property, estimates of Rn (t) are suitable targets for training the predictor. The estimator used in the temporal difference method which Sutton calls TD(0) is the one-step corrected return R1 (t) = r(t + 1) + !V(t + 1)
(2.3.5)
which leads to a gradient descent error term called the TD error
eTD (t + 1) = [r(t + 1) + !V (t + 1)] " V (t) .
(2.3.6)
Substituting this error into the critic learning rule (2.2.17) gives the update equation
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
!v = " eTD(t + 1) # (t) .
37 (2.3.7)
The learning process occurs as follows. The system observes the current context, as encoded by the vector ! (t) , and calculates the prediction V(t) . It then performs the action associated with this context. The environment changes as result of the action generating a new context ! (t + 1) and a reinforcement signal r(t + 1) (which may be zero). The system then calculates the new prediction V(t + 1) , establishes the error in the first prediction, and updates the parameter vector v appropriately. To prevent changes in the weights made after each step from biasing the TD error it is preferable to calculate the predictions V(t) and V(t + 1) using the same version of the parameter vector. To make this point clear, the notation V(a|b) is introduced to indicate the prediction computed with the parameter vector at time t = a for the context at time t = b . The TD error term is then written as eTD (t + 1) = r(t + 1) + !V(t|t + 1) " V (t|t) .
(2.3.8)
The TD(0) learning rule carries the expectation of reward one interval back in time thus allowing for the backward chaining of secondary reinforcement. For example, consider a task in which the learning system experiences, over repeated trials, the same sequence of context inputs (with no rewards attached) followed by a fixed reward signal. On the first trial the system will learn that the final pattern predicts the primary reinforcement. On the second trial it will learn that the penultimate pattern predicts the secondary reinforcement associated with the final pattern. In general, on the kth trial, the context that is seen k steps before the reward will start to predict the primary reinforcement. The family of TD(") learning methods The TD(0) learning rule will eventually carry the expectation of a reward signal back along a chain of stimuli of arbitrary length. The question that arises, however, is whether it is possible to propagate the expectation at a faster rate. Sutton suggested that this can achieved by using the TD error to update the predictions associated with a sequence of past contexts where the update size for each context is weighted according to recency. A learning rule [163] that incorporates this heuristic is t $1
!v(t) = " eTD(t + 1) &k =0 #
t$k
% (t $ k)
(2.3.9)
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
38
where " (0 ! " ! 1) is a decay parameter that causes an exponential fall-off in the update size as the time interval between context and reward lengthens. One of the advantages of this rule is that the sum on the right hand side can be computed recursively using an activity trace vector ! (t) given by (2.3.10)
! (t) = ! (t) + "! (t # 1)
where ! (0) = 0 (the zero vector). This gives the TD(") update rule
!v(t) = " eTD(t + 1) # (t) .
(2.3.11)
Watkins shows an alternative method for deriving this learning rule. Instead of using just the R1 (t) estimate, a weighted sum of different n-steps corrected returns can be used to estimate the expected return. This is appropriate because such a sum also has the error reduction property14. The TD(") return R ! (t) is defined to be such a sum in n which the weight for each Rn (t) is proportional to ! such that
R ! (t) = (1 " ! )[R1 (t) + !R2 (t) + ! 2 R3 (t)…] . Watkins shows that this can be rewritten as the recursive expression !
!
R (t) = r(t + 1) + " (1 # ! )V (t + 1) + "!R (t + 1).
(2.3.12)
Using this R ! (t) estimator the gradient descent error for the context at time t is given by R ! (t) " V (t) for which a good approximation15 is the discounted sum of future TD errors
[
]
eTD (t + 1) + !"eTD (t + 2) + ( !" )2 eTD (t + 3)+…
14Provided
the weight on each of the corrected returns is between 0 and 1 and the sum of weights is unity (see Watkins [177]). 15If
learning occurs off-line (i.e. after all context and reinforcement inputs have been seen) then R! (t ) " V(t ) can be given exactly as a discounted sum of TD errors. Otherwise, changes in the parameter vector over successive time-steps will bias the approximation by an amount equal to
%
$ k =1
(! " )k V(t + k # 1, t + k) # V (t +k,t + k ) ,
i.e. the discounted sum of the differences in the prediction of each state visited for successive parameter vectors. This sum will be small if the learning rate is not too large.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
39
From this a rule for updating all past contexts can shown to be t
!v(t) = " eTD (t + 1)&k =1( #$ )t % k ' (k) which is identical to Sutton’s update rule but for the substitution of the decay rate !" for ! . Understanding TD learning This section reviews some of the findings concerning the behaviour of TD methods and their relationship to other ways of learning to predict. The aim is to establish a clearer understanding of what these algorithms actually do. TD(") and the Widrow-Hoff rule Consider the definition of the TD(") return (2.3.12). If " is made equal to one it can 1 easily be seen that R (t) is the same as the actual return (2.3.1). Sutton [163] shows that the TD(1) learning rule is in fact equivalent16 to a Widrow-Hoff rule of the form
!v(t) = " ( R(t) # V(t)) $ (t) . An important question is therefore whether TD methods are anything other than just an elegant, incremental method of implementing this supervised learning procedure. To address this issue Sutton carried out an experiment using a simple delayed reinforcement task called a bounded random walk. In this task there are only seven states A-B-C-D-E-F-G two of which, A and G, are boundary states. In the nonboundary states there is a 50% chance of moving to the right or to left along the chain. All states are encoded by orthogonal vectors. A sequence starts in a (random) non-boundary state and terminates when either A or G is reached. A reward of 1 is given at G and zero at A. The ideal prediction for any state is therefore just the probability of terminating at G.
16This
is strictly true only if TD(1) learning occurs off-line, i.e. if the parameters are updated at the end of each trial rather than after each step.
CHAPTER TWO
A 0
REINFORCEMENT LEARNING SYSTEMS
1/6
1/3
1/2
2/3
40
5/6
G 1
Figure 2.7: Sutton’s random walk task. The numbers indicate the rewards available in the terminal states (A and G) and the ideal predictions in the non-terminal states (B to F). Sutton generated a hundred sets of ten randomly generated sequences. TD(") learning procedures for seven values of " including 0 and 1 (the Widrow-Hoff rule) were then trained repeatedly17 on each set until the parameters converged to a stable solution. Sutton then measured the total mean squared error, between the predictions for each state generated by the learned parameters, and the ideal predictions. Significant variation in the size of this error was found. The total error in the predictions was lowest for "=0, largest for "=1, and increased monotonically between the two values. To understand this result it is important to note that each training run used only a small set of data. The supervised training rule minimises the mean squared error over the training set, but as Sutton points out, this is not necessarily the best way to minimise the error for future experience. In fact, Sutton was able to show that the predictions learned using TD(0) are optimal in a different sense in that they maximise the likelihood of correctly estimating the expected reward. He interpreted this finding in the following way: “...our real goal is to match the expected value of the subsequent outcome, not the actual outcome occurring in the training set. TD methods can perform better than supervised learning methods [on delayed reward problems] because the actual outcome of a sequence is often not the best estimate of its expected value.” ([163] p.33)
Sutton performed a second experiment with the bounded random walk task looking at the speed at which good predictions were learned. If each training set was presented just once to each learning method then the best choice for ", in terms of reducing the error most rapidly, was an intermediate value of around 0.3. Watkins also considers
17The
best value of the learning rate # was found for each value of " in order to make a fair comparison between the different methods. The weights were updated off-line.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
41
this question and points out that the choice of " is a trade-off between using biased estimates ("=0) and estimates with high variance ("=1). He suggests that if the current predictions are nearly optimal then the variance of the estimated returns will be lowest for "=0 and that therefore the predictor should be trained using TD(0). However, if the predictions are currently poor approximations then the corrections added to the immediate reinforcement signals will be very inaccurate and introduce considerable bias. The best approach overall might therefore be to start with "=1, giving unbiased estimates but with a high variance, then reduce " towards zero as the predictions become more accurate. TD and dynamic programming A second way to understand TD learning is in relation to the Dynamic Programming methods for determining optimal control actions. ‘Heuristic’ methods of dynamic programming were first proposed by Werbos [180]. However, Watkins [177] has investigated the connection most thoroughly showing that TD methods can be understood as incremental approximations to dynamic programming procedures. This approach to studying actor/critic learning systems has also been taken by Williams [185]. Dynamic Programming (the term was first introduced by Bellman [15]) is a search method for finding a suitable policy for a Markov decision process. A policy is optimal if the action chosen in every state maximises the expected return as defined above (2.3.2). To compute this optimal control requires accurate models of both the transition function and the reward function (which gives the value of the expected reward that will be received in any state). Given these prerequisites dynamic programming proceeds through an iterative, exhaustive search to calculate the maximum expected return, or optimal evaluation, for each state. Once this optimal evaluation function is known an optimal policy is easily found by selecting in each state the action that leads to the highest expected return in the next state. A significant disadvantage of dynamic programming is that it requires accurate models of the transition and reward functions. Watkins [177] has shown that TD algorithms can be considered as incremental forms of dynamic programming that require no advanced or explicit knowledge of state transitions or of the distribution of
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
42
available rewards18. Instead, the learning system uses its ongoing experience as a substitute for accurate models of these functions. His analysis of dynamic programming led Watkins to propose a learning method called ‘Q learning’ that arises directly from viewing the TD procedure as incremental dynamic programming. In Q learning a prediction is associated with each of the different actions available in a given state. While exploring the state space the system improves the prediction for each state/action pair using a gradient learning rule. This learning method does away with the need for explicitly learning the policy, the preferred action in any state is simply the one with the highest associated value, therefore as the system improves its predictions it also adapts its policy. If each action in each state is attempted a sufficient number of times then Q learning will eventually converge to an optimal set of evaluations. A family of Q(") learning algorithms that use activity traces similar to those given for TD(") can also be defined. Convergence properties of TD methods There are now several results showing that TD(") and Q(") learning algorithms will converge[10, 41, 65, 163, 177], many of them based on an underlying identity with stochastic dynamic programming. The latest proofs demonstrate convergence to the ideal values with probability of one in both batch and on-line training of both types of algorithm. These proofs generally assume tasks that, like the bounded random walk, are guaranteed to terminate, have one-to-one mappings from discrete states to contexts, fixed transition probabilities, and encode different contexts using orthogonal vectors. Actor/Critic architectures for delayed rewards The actor/critic learning methods developed by Barto and Sutton and described in section 2.2 can be also be applied to learning in tasks with delayed rewards [11, 162]. The separation of action learning from prediction learning has several useful
18Watkins
uses the term ‘primitive’ learning to describe learning of this sort, likening it to what I have called dispositional learning in animals.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
43
consequences although analysing the behaviour of the system is more difficult. One important difference is that problems can be addressed in which actions are realvalued (Q learning is restricted to tasks with discrete action spaces). A second advantage arises in learning problems with continuous input spaces. Here the optimal policy and evaluation functions may have quite different non-linearities with respect to the input. Separating the two functions into distinct learning systems can therefore allow appropriate recodings to be developed for each. The benefit of this is shown clearly in the simulations described in Chapter 5. The training rules for actor/critic learning in delayed reward tasks are based on the rules described above for immediate reinforcement problems (section 2.2). With delayed rewards the goal of the system is to maximise the expected return IE (R) . The critic element is therefore trained to predict IE (R) using the TD(") procedure, while the performance element is trained to maximise IE (R) using a variant of the reinforcement comparison learning rule (equation 2.2.14). This gives the update !w for the parameters of the actor learning system
!w(t) = " [ r(t + 1) + #V(t + 1) $ V (t)] %w(t) .
(2.3.13)
Here !w is a sum of past eligibility vectors (section 2.2) weighted according to recency. This allows the actions associated with several past contexts to be updated at each time-step. This eligibility trace is given by the recursive rule
!w(0) = 0 , !w(t) = !w(t) + " !w(t # 1)
(2.3.14)
where $ is the rate of trace decay. The eligibility trace and the activity trace (2.3.10) (used to update the critic parameters) encode images of past contexts and behaviours that persist after the original stimuli have disappeared. They can therefore be viewed as short term memory (STM) components of the learning system. The weight vectors encoding the prediction and action associations then constitute the long term memory (LTM) components of the system. There are no convergence proofs for actor/critic methods for delayed reward tasks because of the problem of analysing the interaction of two concurrent learning systems. However successful learning has been demonstrated empirically on several difficult tasks [3, 11, 167] which encourages the view that these training rules may share the desirable gradient learning properties of their simpler counterparts.
CHAPTER TWO
2.4
REINFORCEMENT LEARNING SYSTEMS
44
Integrating reinforcement and model-based learning Planning, world knowledge and search
The classical AI method of action selection is to form an explicit plan by searching an accurate internal representation of the environment appropriate to the current task (see, for example [32, 114]). However, any planning system is faced with a scaling problem. As the world becomes more complex, and as the system attempts to plan further ahead, the size of the search space expands at an alarming rate. In particular, problems arise as the system attempts to consider more of the available information about the world. With each additional variable another dimension is added to the input space which can cause an exponential rise in the time and memory costs of search. Bellman [16] aptly described this as the “curse” of dimensionality. Dynamic Programming is as subject to these problems as any other search method. Incremental approximations to dynamic programming such as the TD learning methods attempt to circumvent forward planning by making appropriate use of past experience. Actions are chosen that in similar situations on repeated past occasions have proven to be successful. A given context triggers an action that is in effect the beginning of a ‘compiled plan’, summarising the best result from the history of past experiences down that branch of the search tree. Thus TD methods are a solution (of sorts) to the problem of acting in real time—there is no on-line search, and when the future is a little different than expected, then there is often a ‘compiled plan’ available for that too. However, the problem of search does not go away. Instead, the size of the search space translates into the length of learning time required, and, when exploration is local (as in gradient learning methods), there is an increased likelihood of acquiring behaviours that are only locally optimal. Optimal Learning, forward and world models Real experience can be expensive to obtain—exploration can be a time-consuming, even dangerous, affair. Optimal learning, rather than learning of optimal behaviour19,
19Watkins
[177] gives a fuller discussion of this distinction.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
45
is concerned with gaining all possible knowledge from each experience that can be used to maximise all future rewards and minimise future loss. Reinforcement learning methods are not optimal in this sense. They extract information from the temporal sequence of events that enables them to learn mappings of the following types stimulus ! action (S! A) (actor) stimulus ! reward (S! R) (critic) stimulus ! action " reward (S!A " R) (Q learning)
Associative learning of this type is clearly dispositional, it encodes task-specific information and retains no knowledge of the underlying causal process. However, associative mappings obtained through model-based learning can clearly help in determining optimal behaviour. These can take the form of forward (causal) models, i.e. stimulus ! action " stimulus (S!A " S)
or world models encoding information about neighbouring or successive stimuli, i.e. stimulus ! stimulus (S ! S)
Where the knowledge such mappings contain is independent of any reward contingencies they can be applied to any task defined over that environment. However, there can be substantial overhead in the extra computation and memory required to learn, store, and maintain models of the environment. Several methods of supplementing TD learning with different types of forward model have been proposed [41, 108, 164, 166] and are described further below. Forward models A mapping of the S!A " S type, is effectively a model of the transition function used in dynamic programming. Sutton’s [164] DYNA system uses such a model as an extension to Q learning, the agent uses its actual experience in the world both to learn evaluations for state/action pairs and to estimate the transition function. This allows on-line learning to be supplemented by learning during simulated experience. In other words, between taking actions in the real world and observing and learning
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
46
from their consequences, the agent performs actions in its ‘imagination’. Using its current estimates of the evaluation and transition functions, it can then observe the ‘imaginary’ outcomes of these actions and learn from them accordingly. Sutton calls this process ‘relaxation planning’—a large number of shallow searches, performed whenever the system has a ‘free moment’, will eventually approximate a full search of arbitrary depth. By carrying out this off-line search the system can propagate information about delayed rewards more rapidly. Its actual behaviour will therefore improve faster than by on-line learning alone. Moore [108] describes a related method for learning in tasks with continuous input spaces. To perform dynamic programming the continuous input space is partitioned or quantised into discrete regions and an optimal action and evaluation learned in each cell. The novel aspect of Moore’s approach is to suggest heuristic methods for determining a suitable quantisation of the space that attempt to side-step the dimensionality problem. He proposes varying the resolution of the quantisation during learning, specifically, having a fine-grained quantisation in those parts of the state-space that are visited during an actual or simulated sequence of behaviour and a coarse granularity in the remainder. As the trajectory through the space changes over repeated trials the quantisation is then altered in accordance with the most recent behaviour. World models Sutton and Pinette [166] and Dayan [41] both propose learning S ! S models. The essence of both approaches is to train a network to estimate, from the current context x(t) , the discounted sum of future contexts
#
" k =1
! k x(t + n) .
One reason for learning this particular function is that a recursive error measure, similar to the TD error, can be used to adapt the network parameters. Having acquired such a mapping the resulting associations will reflect the topology of the task, which may differ from the topology of the input space. When motivated to achieve a specific goal, such a mapping may aid the learning system to distinguish situations in which different actions are required, and to recognise circumstances where similar behaviours can be applied.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
47
For instance, Dayan simulated a route-finding problem which involved moving within a 2-D bounded arena between start and goal positions. A barrier across part of the space obliged the agent to make detours when traversing between the separated regions. The agent was constrained to make small, local movements, hence the probability of co-occurrence of any two positions was generally a smooth function of the distance between them, except for the non-linearity introduced by the barrier. An S ! S mapping was trained by allowing the agent to wander unrewarded through the space. Neighbouring positions thus acquired high expectations except where they lay either side of the barrier where the expectation was low. The output of this mapping system was used as additional input to an actor/critic learning system that was reinforced for moving to a specific goal. The S ! S associations aided the system in discriminating between positions on either side of the barrier. This allowed appropriate detour behaviour to be acquired more rapidly than when the system was trained without the model. One significant problem with this specific method for learning world models is that it is not independent of the behaviour of system. If the S ! S model in the routefinding task is updated while the agent is learning paths to a specific goal then the model will become biased to anticipate states that lie towards the target location. When the goal is subsequently moved the model will then by much less effective as an aid to learning. An additional problem with this mapping is that it is one-to-many making it difficult to represent efficiently and giving it poor scaling properties.
2.5
Continuous input spaces and partial state knowledge
Progress in the theoretical understanding of delayed reward learning algorithms has largely depended on the assumptions of discrete states, and of a Markovian decision process. If either of these assumptions is relaxed then the proofs of convergence, noted above, no longer hold. Furthermore, it seems likely that strong theoretical results for tasks in which these restrictions do not apply will be difficult to obtain. One reason for this pessimism is that recent progress has depended on demonstrating an underlying equivalence with stochastic dynamic programming for which the same rigid assumptions are required. An alternative attack on these issues is of course an empirical one—to investigate problems in which the assumptions are relaxed and then observe the consequences.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
48
This is a common approach in connectionist learning where the success of empirical studies has often inspired further theoretical advances. To be able to apply delayed reinforcement learning in tasks with continuous state spaces would clearly be of great value. Many interesting tasks in robot control, for example, are functions defined over real-valued inputs. It seems reasonable, given the success of supervised learning in this domain, to expect that delayed reinforcement learning will generalise to such tasks. This issue is one of the main focuses of the investigations in this thesis. Relaxing the assumption of Markovian state would give, perhaps, even greater gain. The assertion that context input should have the Markov property constitutes an extremely strict demand on a learning system. It seems likely that in many realistic task environments the underlying Markovian state-space will be so vast that dynamic programming in either its full or its incremental forms will not be viable. It is clear, however, that for many tasks performed in complex, dynamic environments, learning can occur perfectly well in the absence of full state information. This is for the reason that the data required to predict the next state is almost always a super-set of that needed to distinguish between the states for which different actions are required. The latter discrimination is really all that an adaptive agent needs to make. Consider, for instance, a hypothetical environment in which the Markovian state information is encoded in a N bit vector. Let us assume that all N bits are required in order to predict the next state. It is clear that a binary-decision task could be defined for this environment in which the optimal output is based on the value at only a single bit position in the vector. An agent who observed the value of this bit and no other could then perform as well as an agent who observed the entire state description. Furthermore, this single-minded operator, who observes only the task-relevant elements of the state information, has a potentially huge advantage. This is that the size of its search-space (2 contexts) is reduced enormously from that of the full Markovian task (2N contexts). The crucial problem, clearly, is finding the right variables to look at!20
20This
task has inspired research in reinforcement learning on perceptual aliasing—distinguishing states that having identical codings but require different actions (this issue is considered further in chapter four).
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
49
It is the above insight that has motivated the emphasis in Animat AI on reactive systems that detect and exploit only the minimal number of key environmental variables as opposed to attempting to construct a full world model. It is therefore to be strongly hoped that delayed reinforcement learning will generalise to tasks with only partial state information. Further, we might hope that it will degrade gracefully where this information is not fully sufficient to determine the correct input-output mapping. If these expectations are not met then these methods will not be free from the explosive consequences of dimensionality that make dynamic programming an interesting but largely inapplicable tool.
Conclusion Reinforcement learning methods provide powerful mechanisms for learning in circumstances of truly minimal feedback from the environment. The research reviewed here shows that these learning systems can be viewed as climbing the gradient in the expected reinforcement to a locally maximal position. The use of a secondary system that predicts the expected future reward encourages successful learning because it gives better feedback about the direction of this uphill gradient. Many issues remain to be resolved. Success in reinforcement learning is largely dependent on effective exploration behaviour. Chapters three and six of this thesis are in part concerned with this issue. The learning systems described here have all been given as linear functions in an unspecified representation of the input. However, for continuous task spaces, finding an appropriate coding is clearly critical to the success of the learning endeavour. The question of how suitable representations can be chosen or learned will be taken up in chapters four and five.
51
Chapter 3 Exploration Summary Adaptive behaviour involves a trade-off between exploitation and exploration. At each decision point there is choice between selecting actions for which the expected rewards are relatively well known and trying out other actions whose outcomes are less certain. More successful behaviours can only be found by attempting unknown actions but the more likely short-term consequence of exploration is lower reward than would otherwise be achieved. This chapter discusses methods for determining effective exploration behaviour. It primarily concerns the indirect effect on exploration of the evaluation function. The analysis given here shows that if the initial evaluation is optimistic relative to available rewards then an effective search of the state-space will arise that may prevent convergence on sub-optimal behaviours. The chapter concludes with a brief summary of direct methods for adapting exploration behaviour.
CHAPTER THREE
3.1
EXPLORATION
52
Exploration and Expectation
Chapter two introduced learning rules for immediate reinforcement tasks of the form
!wi (t) = " [r(t + 1) # b(t)] $wi (t) . When there are a finite number of actions, altering the value of the reinforcement baseline b has an interesting effect on the pattern of exploration behaviour. Consider the task shown in figure 3.1 which could be viewed as a four armed maze.
rN
N
rW
?
rE
rS
Figure 3.1 : A simple four choice reinforcement learning problem. Assume that for i ! {N,S, E, W } the preference for choosing each arm is given by a parameter wi and the reward for selecting each arm by ri , also let the action on any trial be to choose the arm for which
wi + ! i
(3.1)
is highest where ! i is a random number drawn from a gaussian distribution with a fixed standard deviation. The eligibility !wi (t) is equal to one for the chosen arm and zero for all others.
CHAPTER THREE
EXPLORATION
53
From inspecting the learning rule it is clear that an action will be punished whenever the reinforcement achieved is less than the baseline b . Consider the case where the reward in each arm is zero but the baseline is some small positive value. It is easy to see that the preferred action at the start of each trial (i.e. the one with least negative weighting) will be the one which has been attempted on fewest previous occasions. The preference weights effectively encode a search strategy based on action frequency. It is tempting to call this strategy spontaneous alternation (after the exploration behaviour observed in rodents21) since a given action is unlikely to be retried until all other alternatives have been attempted an equal number of times. If non-zero rewards are available in any of the maze arms, but the base-line is still higher than the maximum reward, then the behaviour of the system will follow an alternation pattern in which the frequency pi with which each arm is selected22 is pi =
(ri ! b) !1 " j (r j ! b)!1 .
In other words, the alternation behaviour is biased to produce actions with higher reward levels more frequently. With a fixed baseline, the learning system will never converge on actions that achieve only lower levels of reward. Consequently, if the maximum reward value * r* is known, setting b = r will ensure that the system will not cease exploring unless an optimal behaviour is found. In Sutton’s reinforcement comparison scheme, the baseline is replaced by the prediction of reinforcement V . From the above discussion it is clear that there is a greater likelihood of achieving optimal
alternation (e.g. [43] ) is usually studied in Y or T mazes. Over two successive trials, rodents and other animals are observed to select the alternate arm on the second trial in approximately 80% of tests. Albeit that there maybe a superficial likeness, the artificial spontaneous alternation described here is not intended as a psychological model—it seems probable that in most animals this phenomenon is due to representational learning.
21Spontaneous
22If
T is the total number of trials and T i is the number of trials in which arm i was selected then (r i ! b)T i (r i ! b) " 1 . Therefore, T # T hence the frequency as T ! " for any pair i, j j (r j ! b)T j (r j ! b) i with which arm i is chosen is !1 T (r i ! b)!1 pi = i = #%(r i ! b) j (r j ! b)!1 &( = . ' T $ (r j ! b)!1
"
"
"
j
CHAPTER THREE
EXPLORATION
54
behaviour (and avoiding convergence on poor actions) if the initial prediction * V 0 ! r* . Alternation behaviour will occur in a similar manner until V ! r , thereafter the optimal action will be rewarded while all others continue to be punished. For associative learning, the expectation V(x) associated with a context x, is described here as optimistic if V(x) ! r * (x) (where r* (x) is the maximum possible reward for that context) and as pessimistic otherwise. In immediate * reinforcement tasks, setting the initial expectation V 0 (x) = r (x) will give a * greater likelihood of finding better actions than for V 0 (x) < r (x) , it should also * give faster learning than for V 0 (x) > r (x) . When r* (x) is not known, an optimistic guess for V 0 (x) will give slower learning than a pessimistic guess but also a better chance of finding the best actions. A similar argument applies to delayed reinforcement tasks. In this case, an expectation is considered optimistic if V(x) ! R* (x) , where R* (x) is the maximum possible return. If the initial expectation V 0 is the same in all states and is globally optimistic then a form of spontaneous alternation will arise. While the predictions are over-valued the TD error will on average be negative and actions will generally be punished. However, transitions to states that have been visited less frequently will be punished less. The selection mechanism therefore favours those actions that have been attempted least frequently and that lead to the least visited states. When the expected return varies between states the alternation pattern should also be biased towards actions and states with higher levels of return. Hence, for an optimistic system, initial exploration is a function of action frequency, state frequency and estimated returns. This results in behaviour that traverses the state-space in a near systematic manner until expectation is reduced to match the true level of available reward. Sutton [162] performed several experiments investigating the effect on learning of reinforcements that occur after varying lengths of delay. He found that the learning system tends to adapt to maximise rewards that occur sooner rather than later. This arises because secondary reinforcement from more immediate rewards biases action selection before signals from later rewards have been backed-up sufficiently to have any influence. Clearly this problem is not overcome by altering rates of trace decay or learning since these parameters effect the rate of propagation of all rewards equally. Providing the learning system with an
CHAPTER THREE
EXPLORATION
55
optimistic initial expectation can, however, increase the likelihood of learning optimal behaviour. While the expectation is optimistic, action learning is postponed in favour of spontaneous alternation. Chains of secondary reinforcement only begin to form once the expectation falls below the level of available reward. Rewards of greater value will obtain a head start in this backingup process increasing the likelihood of learning appropriate, optimal actions. In the following section this effect of the initial expectation is demonstrated for learning in a simple maze-like task. A maze learning task Figure 3.2 shows a maze learning problem represented as a grid in which the cells of the grid correspond to intersections and the edges between cells to paths that connect at these intersections. In each cell in the rectangular grid shown there are therefore upto four paths leading to neighbouring places.
A H H H H G Figure 3.2: A maze learning task with 6x6 intersections. The agent (A) and goal (G) are in opposite corners and there are four ‘hazard’ areas (H). Behaviour is modelled in discrete time intervals where at each time-step the agent makes a transition from one cell to a neighbour. A version of the actor/critic architecture is used in which each cell is given a unique, discrete encoding. The evaluation for a cell is encoded by a distinct parameter, and, as in the four-arm maze (figure 3.1), there is a separate weight for each action in each cell. The
CHAPTER THREE
EXPLORATION
56
action in any specific cell is chosen by equation 3.1. Further details of the algorithm are given in Appendix A. For the task considered here, one cell of the grid is assigned to be the starting position of the agent and a second cell is assigned to be the goal position where positive reward (+1) is available. Certain cells contain hazards where a negative reinforcement (-1) is given for entering the cell, in all other non-goal cells the reward is zero. Note that with this reward schedule it is the effect of the discounted time horizon (that delayed rewards are valued less) that encourages the agent to find direct paths. It is also possible that the learning system will fail to learn any route to the goal. This arises if a locally optimal, ‘dithering’ policy is found that involves swapping back and forth between adjacent non-hazard cells to avoid approaching punishing areas. The maximum return R* (x) for any cell is ! s"1 where s is the minimum number of steps to the goal, hence, for all cells, 0 < R * (x) ! 1 . An initial expectation of zero is therefore pessimistic in all cells and an expectation of +1 optimistic. The effect on learning of these different expectations was examined in the following experiment. The agent was run on repeated trials with the maze configuration shown above. Each trial ended either when the agent reached the goal or after a thousand transitions had occurred. A run was terminated after 100 trials or after two successive trials in which the agent failed to reach the goal in the allotted time. Suitable global parameters for the learning system (learning rates, decay rates and discount factor) were determined by testing the system in a hazard-free maze. Out of ten learning runs starting from the pessimistic initial expectation, V 0 = 0 , the agent failed to learn a path to the goal on all occasions as a result of learning a procrastination policy. Figure 3.3 shows a typical example of the associations acquired.
CHAPTER THREE
EXPLORATION
57
Figure 3.3: Action preferences and cell evaluations after a series of trials learning from initially pessimistic expectation. The arrows in the left diagram indicate the preferred direction of movement in each cell. The heights of the columns in the right diagram show the predicted return for each cell (white +ve, black -ve).
In this example the agent has gradually confined itself to the top left hand corner—all preferences near the start cell direct the agent away from the hazards and back toward the initial position. A ‘wall’ of negative expectation prevents the agent from gaining any new experience near the goal. In contrast to the poor performance of this pessimistic learner, given an optimistic initial expectation23, V 0 = +1 , successful learning of a direct path was achieved on all ten runs. Figure 3.4 illustrates the policy and evaluation after one particular run. On this occasion the prediction in any cell never fell below zero expectation, so convergence on a dithering policy could not occur.
23The
optimistic expectation is applied to all cells except the goal which is given an initial value of zero. This does not imply any prior knowledge but merely indicates that once the goal is achieved the anticipation of it ceases. Experiments with a continuous version of the problem in which the agent moves from the goal cell back to its starting position (and updates the evaluation of the goal according to its experience in the next trial) support the conclusions reported here.
CHAPTER THREE
EXPLORATION
58
Figure 3.4: Action preferences and cell evaluations after a series of trials of learning from an initially optimistic expectation, the agent has learned a direct path to the goal (along the top row then down the left column). To confirm that the exploration behaviour of a optimistic system is better than chance a simple experiment was performed using a ‘dry’ 6x6 maze (i.e. one without rewards of any kind). In each cell the action with the highest weighting was always selected (using random noise as a tie breaker). A good measure of how systematically the maze is traversed is the variance in the mean number of times each possible action is chosen out of a series of n actions. For n = 120 , random action selection (or V 0 = 0 ) gave an average variance that was more than five times higher24 than optimistic exploration behaviour ( V 0 = +1 ). In other words, the initial behaviour of an optimistic system traverses the maze in a manner that is considerably more systematic than a random walk. The effect of initial expectation is further demonstrated in figure 3.5. This graph shows the average number of transitions in the first ten trials of learning starting from different initial expectations25. Behaviour in the maze with hazards is contrasted with behaviour in a hazard-free maze. In the latter case the number of transitions gradually rises as the value of the initial expectation is increased (from zero through to one). This is entirely due to the alternation behaviour induced by
24Since
there are 120 actions in total, choosing n=120 makes the mean number of choices 1. Over ten trials, random selection gave an average variance in this mean of 1.42, for optimistic search the variance was only 0.26.
25The
averages were calculated over ten runs, with only those runs which were ultimately successful in learning a path to the goal being considered.
CHAPTER THREE
EXPLORATION
59
the more optimistic learning systems. In the hazardous maze, however, the trend is reversed. In this case, systems with lower initial expectations take longer to learn the task. More time is spent in unfruitful dithering behaviour and less in effective exploration.
120
average steps per trial
100
80
60
40
hazards no hazards 20
0 0.0
0.2
0.4
0.6
0.8
1.0
1.2
initial value
Figure 3.5: Speed of learning over first ten trials for initial value functions 0.0, 0.3, 0.6, 0.8 and 1.0. (For an initial value of zero in the maze with hazards all ten runs failed to find a successful route to the goal). Clearly, it is the relative values of the expectation and the maximum return that determine the extent to which the learning system is optimistic or pessimistic. Therefore an alternate way to modify exploration behaviour is to change the rewards and not the expectation. For the maze task described above, equivalent learning behaviour is generated if there is zero reward at the goal and negative rewards for transitions between all non-goal cells26.
26
In the task described above there is a reward r g at the goal and zero reward in all other nonhazard cells. Consider an identical task but with zero reward at the goal and a negative ‘cost’ reward rc for every transition between non-goal cells (Barto et al. [13] investigated a routefinding task of this nature). Let the maximum return in each cell for the first task be RG (x ) and in the second task RC ( x) . It can be easily shown that, for a given discount factor ! ,
CHAPTER THREE
EXPLORATION
60
As learning proceeds the affect of an optimistic bias in initial predictions gradually diminishes. To obtain a similar, but permanent, influence on exploration Williams [184] has therefore suggested adding a small negative bias to the error signal. This will provoke continuous exploration since any action which does not lead to a higher than expected level of reinforcement is penalised. The system will never actually converge on a fixed policy but better actions will be preferred. Relationship to animal learning The indirect exploration induced by the discrepancy between predictions and rewards seems to fit naturally with many of the characteristics of conditioning in animal learning. In particular, a large number of theorists have proposed that it is the ‘surprise’ generated by a stimulus that provokes exploration more than the status of the outcome as a positive or negative reinforcer (see, for instance, Lieberman [85]). The pessimistic-optimistic distinction seems also to have a parallel in animal learning which is shown by experiments on learned helplessness [123, 149]. This research demonstrates that animals fail to learn avoidance of aversive stimuli if they are pre-trained in situations where the punishing outcome is uncontrollable. In terms of the simulations described above, it could be argued that the induced negative expectation brought about by the pretraining, reduces the discrepancy between the predicted outcome and the (aversive) reward and so removes the incentive for exploration behaviour.
3.2
Direct Exploration Methods
In addition to the indirect exploration strategies described above a number of methods have proposed for directly encouraging efficient exploration. This section briefly reviews some of these techniques, and in this context, describes an exploration method due to Williams [184] that is employed in the simulations described in later chapters. The simplest procedure for controlling exploration is to start out with a high level of randomness in the action selection mechanism and reduce this toward zero as learning proceeds. An annealing process of this sort can be applied to the task as a whole, or for more efficient exploration, the probability function from which
C G R ( x) = 1 ! R ( x) iff r p = (! " 1)r g . Learning behaviour will therefore be the same under both C G circumstances if the initial expectation V 0 = 1! V0 .
CHAPTER THREE
EXPLORATION
61
actions are selected can be varied for different local regions of the state space. The following considers several such methods for tailoring local exploration, these fall into two general categories that I will call uncertainty and performance measures. Uncertainty measures attempt to estimate the accuracy in the learning system’s current knowledge. Suitable heuristics are to attach more uncertainty to contexts that have been observed less frequently [108], or less recently [164], or for which recent estimates have shown large errors [108, 111, 145, 169]. Exploration can be made a direct function of uncertainty by making the probability of selecting an action depend on both the action preference and the uncertainty measure. The effect of biasing exploration in this manner is local, that is, it cannot direct the system to explore in a distant region of the state-space. An alternative approach is to apply the uncertainty heuristic indirectly by adding some positive function of uncertainty to the primary reinforcement (e.g. [164]). This mechanism works by redefining optimal behaviours as those that both maximise reward and minimise uncertainty. This method can produce a form of non-local exploration bias, since uncertainty will be propagated by the backward-chaining of prediction estimates eventually having some effect on the decisions made in distant states. Performance measures estimate the success of the system’s behaviour as compared with either a priori knowledge, or local estimates of available rewards. Gullapalli [56] describes a method for immediate reinforcement tasks in which the performance measure is a function of the difference between the predicted reward and the maximum possible reward r* (x) (which is assumed to be known). The amount of exploration, which varies from zero up to some maximum level is in direct proportion to the size of this disparity. Williams [184] has proposed a method for adapting exploration behaviour that is suitable to learning real-valued actions when r* (x) is not known to the learning system. He suggests allowing the degree of search to adapt according to the variance in the expected reward around the current mean action. In other words, if actions close to the current mean are achieving lower rewards than actions further away, then the amount of noise in the decision rule should be increased to allow more distant actions to be tried more frequently. If, on the other hand, actions close to the mean are more successful than those further away, then noise should be reduced, so that the more distant (less successful) actions are sampled less often. A Gaussian action unit that performs adaptive exploration of this nature is illustrated below and described in detail in Appendix D. Here both the
CHAPTER THREE
EXPLORATION
62
mean and the standard deviation of a Gaussian pdf are learned. The mean, of course, is trained so as to move toward actions that are rewarded and away from those that are punished, however, the standard deviation is also adapted. The width of the probability function is increased when the mean is in a region of low reward (or surrounded by regions of high reward) and reduced when the mean is close to a local peak in the reinforcement landscape (see figure 3.6). This learning procedure results in an automatic annealing process as the variance of the Gaussian will shrink as the mean behaviour converges to the local maximum. However, the width of the Gaussian can also expand if the mean is locally sub-optimal allowing for an increase in exploratory behaviour at the start of learning or if there are changes in the environment or in the availability of reward. It is interesting to contrast this annealing method with Gullapalli’s approach. In the latter the aim of adaptive exploration is solely to enable optimal actions to be found, consequently as performance improves the noise in the decision rule is reduced to zero. In Williams’ method, however, the goal of learning is to adapt the variance in the acquired actions to reflect the local slope of the expected return. The final width of the Gaussian should therefore depend on whether the local peak in this function is narrow or flat on top. The resulting variations in behaviour can be viewed not so much as noise but as acquired versatility. An application of Williams’ method to a difficult delayed reinforcement task is described in chapter six where the value of this learned versatility can be clearly seen.
CHAPTER THREE
EXPLORATION
!
e +ve (y-!"2## > m2
m
63
y
e -ve
e +ve
(y-!"2## < m2
(y-!"2## < m2
e +ve
e -ve
(y-!"2## > m2
(y-!"2## > m2
e -ve 2## (y-!" > m2
change in m Figure 3.6: Learning the standard deviation of the action probability function. The figures indicate the direction of change in the standard deviation ( ! ) for actions (y) sampled less ( (y ! µ )2 < " 2 ) or more ( (y ! µ )2 > " 2 ) than one standard deviation from the mean for different values of the reinforcement error signal (e). The bottom left figure indicates that the distribution will widen when the mean is in a local trough in the reinforcement landscape, the bottom right, that it will narrow over a local peak.
Direct exploration and model-based learning Mechanisms that adapt exploration behaviour according to some performance measure can be seen as rather subtle forms of reinforcement (i.e. non modelbased) learning. Here the acquired associations, are clearly task-specific although they specify the degree of variation in response as well as the response itself. Uncertainty measures, on the other hand, are more obviously ‘knowledge about knowledge’, implying some further degree of sophistication in the learning system. However, whether this knowledge should be characterised as model-based is arguable. To a considerable extent this may depend on how and where the knowledge is acquired and used.
CHAPTER THREE
EXPLORATION
64
For instance, a learning system could estimate the frequency or recency of different input patterns, or the size of their associated errors, in the context of learning a specific task. This measure could then be added to the internal reward signal (as described above) so indirectly biasing exploration. These heuristics will be available even in situations where the agent has only partial state knowledge. In contrast, where causal or world models are constructed, the same uncertainty estimates with respect to a given goal could be acquired within the context of a task-independent framework [108, 164]. Exploration might then be determined more directly and in a non-local fashion. That is, rather than simply biasing exploration toward under-explored contexts, the learning system could explicitly identify and move toward regions of the state-space where knowledge is known to be lacking. Clearly such strategies depend on full model-based knowledge of the task although the motivating force for exploration measure is still task-specific. Finally, uncertainty measures could be determined with respect to the causal or world models themselves, in other words, there could be task-independent knowledge of uncertainty, something perhaps more like true curiosity, which could then drive exploration behaviour.
Conclusion This chapter has considered both direct and indirect methods for controlling the extent of search carried out by a reinforcement learning system. In particular, the value of the initial expectation (relative to the maximum available reward) has been shown to have an indirect effect on exploration behaviour and consequently on the likelihood of finding globally optimal solutions to the task in hand.
63
Chapter 4 Input Coding for Reinforcement Learning Summary The reinforcement learning methods described so far can be applied to any task in which the correct outputs (actions and predictions) can be learned as linear functions of the recoded input patterns. However, the nature of this recoding is obviously critical to the form and speed of learning. Three general approaches can be taken to the problem of choosing a suitable basis for recoding a continuous input space: fixed quantisation methods; unsupervised learning methods for adaptively generating an input coding; and adaptive methods that modify the input coding according to the reinforcement received. This chapter considers the advantages and drawbacks of various recoding techniques and describes a multilayer learning architecture in which a recoding layer of Gaussian basis function units with adaptive receptive fields is trained by generalised gradient descent to maximise the expected reinforcement.
CHAPTER FOUR
4.1
INPUT CODING
64
Introduction
In chapter two multilayer neural networks were considered as function approximators. The lower layers of a network were viewed as providing a recoding of the input pattern, or recoding vector, which acts as the input to the upper network layer where desired outputs (actions and predictions) are learned as linear functions. This chapter addresses some of the issues concerned with selecting an appropriate architecture for the recoding layer(s). The discussion divides into three parts: methods that provide fixed or a priori codings; unsupervised or competitive learning methods for adaptively generating codings based on characteristics of the input; and methods that use the reinforcement feedback to adaptively improve the initial coding system. As discussed in the chapter two, for any given encoding of an input space, a system with a single layer of adjustable weights can only learn a limited set of output functions [102] . Unfortunately, there is no simple way of ensuring that an arbitrary output function can be learned short of expanding the set of input patterns into a highdimensional space in which they are all orthogonal to each other (for instance, by assigning a different coding unit to each pattern). This option is clearly ruled out for tasks defined over continuous input spaces as the set of possible input vectors is infinite. However, even for tasks defined over a finite set such a solution is undesirable because it allows no generalisation to occur between different inputs that require similar outputs. One very general assumption that is often made in choosing a representation is that the input/output function will be locally smooth. If this is true, it follows that generalisation from a learned input/output pairing will be worthwhile to nearby positions in the input space but less so to distant positions. The recoding methods discussed in this chapter exploit this assumption by mapping similar inputs to highlycorrelated codings and dissimilar inputs to near-orthogonal codings. This allows local generalisation to occur whilst reducing crosstalk (interference between similar patterns requiring different outputs). To provide such a coding the input space is
CHAPTER FOUR
INPUT CODING
65
mapped to a set of recoding units, or local experts1, each with a limited, local receptive field. Each element of the recoding vector then corresponds to the activation of one such unit. In selecting a good local representation for a learning task there are clearly two opposing requirements. The first concerns making the search problem tractable by limiting the size of recoded space. It is desirable to make the number of local experts small enough and their receptive fields large enough that sufficient experience can be gained by each expert over a reasonably short learning period. The more units there are the longer learning will take. The second requirement is that of making the desired mapping learnable. The recoding must be such that the non-linearities in the input-output mapping can be adequately described by the single layer of output weights. However, it is the nature of the task and not the nature of the input that determines where the significant changes in the input-output mapping occur. Therefore, in the absence of specific task knowledge, any a priori division of the space may result in some input codings that obscure important distinctions between input patterns. This problem—sometimes called perceptual aliasing—will here be described as the ambiguity problem as such codings are ambiguous with regard to discriminating contexts for which different outputs are required. In general the likelihood of ambiguous codings will be reduced by adding more coding units. Thus there is a direct trade-off between creating an adequate, unambiguous code and keeping the search tractable.
4.2
Fixed and unsupervised coding methods Boundary methods
The simplest form of a priori coding is formed by a division of the continuous input space into bounded regions within each of which every input point is mapped to the same coded representation. A recoding of this form is sometimes called a hard
1The use of this term follows that of Nowlan [117] and Jacobs [66], though no strict definition is
intended here.
CHAPTER FOUR
INPUT CODING
66
quantisation. Figure 4.1 illustrates a hard quantisation of an input space along a single dimension.
min
input
max x
Figure 4.1: Quantisation of a continuous input variable. A hard quantisation can be defined by a set of M cells denoted by C = {c1 ,c2 ,…cM } , max where for each cell ci the range {x min } is specified, and where the ranges of all i , xi cells cover the space and are non-overlapping. The current input pattern x(t) then maps to the single cell c * (t ) !C whose boundaries encompass this position in the input space. The elements of the recoding vector are in a one-to-one mapping to the * quantisation cells. If the index of the winning cell is given by i this gives ! (t) as a unit vector of size M where
# 1 iff i = i* ! i (t ) = " (i,i ) = $ %0 otherwise *
(4.1.1)
(The Kronecker delta ! (i, j) = 1 for i = j, 0 for i " j will also be used to indicate functions of this type). With a quantisation of this sort non-linearities that occur at the boundaries between cells can be acquired easily. However, this is achieved at the price of having no generalisation or transfer of learning between adjoining cells. For high-dimensional input spaces this straight-forward approach of dividing up the Euclidean space using a fixed resolution grid rapidly falls prey to Bellman’s curse. Not only is a large amount of space required to store the adjustable parameters (much of it possibly unused), but learning occurs extremely slowly as each cell is visited very infrequently. Coarse coding One way to reduce the amount of storage required is to vary the resolution of the quantisation cells. For instance, Barto et al. [11] (following Michie and Chambers [99]) achieve this by using a priori task knowledge about which regions of the state
CHAPTER FOUR
INPUT CODING
67
space are most critical. An alternative approach is to use coarse-coding [61], or soft quantisation, methods where each input is mapped to a distribution over a subset of the recoding vector elements. This can reduce storage requirements and at the same time provide for a degree of local generalisation. A simple form of coarse coding is created by overlapping two or more hard codings or tilings as shown in figure 4.2.
min
max
input x
Figure 4.2 Coarse-coding using offset tilings along a single dimension. One cell in each tiling is active. If each of the T tilings is described by a set C j ( j = 1..T ) of quantisation cells, then the current input x(t) is mapped to a single cell c *j (t ) in each tiling and hence to a set U(t) = {c1 *(t), c2 * (t),…cT * (t)} of T cells overall. If, again, there is a one-to-one mapping of cells to the elements ! i (t ) then the coding vector is given by # 1 iff c i "U(t) ! i (t ) = $ T . % 0 otherwise
Note that here each element is ‘normalised’ to the value activation across all the elements of the vector is unity.
(4.1.2) 1 T
so that the sum of
A soft quantisation of this type can give a reasonably good approximation wherever the input/output mapping varies in a smooth fashion. However, any sharp nonlinearities in the function surface will necessarily be blurred as a result of taking the average of the values associated with multiple cells each covering a relatively large area. A coarse-coding of this type called the CMAC (“Cerebellar Model Articulation Computer”) was proposed by Albus [2] as a model of information processing in the mammalian cerebellum. The use of CMACs to recode the input to reinforcement
CHAPTER FOUR
INPUT CODING
68
learning systems has been described by Watkins [177] , they are also employed in the learning system described in Chapter Six. The precise definition of a CMAC differs slightly from the method given above, in that the cells of the CMAC are not necessarily mapped in a one-to-one fashion to the elements of the recoding vector. Rather a hashing function can be used to create a pseudo random many-to-one mapping of cells to recoding vector elements. This can reduce the number of adaptive parameters needed to encode an input space that is only sparsely sampled by the input data. Compared with a representation consisting of a single hyper-cuboid grid a CMAC appears to quantise a high-dimensional input space to a similar resolution (albeit coarsely) using substantially less storage. For example, figure 4.3 shows a CMAC in a two-dimensional space consisting of four tilings with 64 adjustable parameters, this gives a degree of discrimination similar to a single grid with 169 parameters.
Figure 4.3: A CMAC consisting of four 4x4 tilings compared with a single grid (13x13) of similar resolution At first sight it appears that the economy in storage further improves as the dimensionality of the space increases—if all tilings are identical and each has a uniform resolution P in each of N input dimensions, then the total number of N adjustable parameters (without hashing) is P T . This gives a maximum resolution
CHAPTER FOUR
INPUT CODING
69
N
similar2 to a single grid of size ( PT ) , in other words, there is a saving in memory N !1 requirement of order T . However, this saving in memory actually incurs a significant loss of coding accuracy as the nature of the interpolation performed by the system varies along different directions in the input space. Specifically, the interpolation is generated evenly along the axis on which the tilings are displaced, but unevenly in other directions (in particular in directions orthogonal to this axis). Figure 4.4 illustrates this for the 2-dimensional CMAC shown above. The figure shows the sixteen high-resolution regions within a single cell of the foremost tiling. For each of these regions, the figure shows the low resolutions cells in each of the tilings that contribute to coding that area. There is a clear difference in the way the interpolation occurs along the displacement axis (top-left to bottom-right) and orthogonal to it. This difference will be more pronounced when coding higher dimensional spaces.
2This is strictly true only if the input space is a closed surface (i.e. an N-dimensional torus), if the input space is open (i.e. has boundaries) then the resolution is lower at the edges of the CMAC because the tilings are displaced relative to each other. If the space is open in all dimensions (as in figure 4.3) N then the CMAC will have maximum resolution only over grid of size ( PT +1 ! T ) .
CHAPTER FOUR
INPUT CODING
70
Figure 4.4: Uneven interpolation in a two dimensional CMAC. The distribution of low resolution cells contributing to any single high-resolution area (the black squares) is more balanced along the displacement axis (top-left to bottom-right) than orthogonal to it (bottom-left to top-right).
Nearest Neighbour methods Rather than defining the boundaries of the local regions it is a common practice to partition the input space by defining the centres of the receptive fields of the set of recoding units. This results in nearest-neighbour algorithms for encoding input patterns [70, 105]. Given a set of i = 1…M units each centred at position c i in the input space, a hard, winner-takes-all coding is obtained by finding the centre that is closest to the current input x(t) according to some distance metric. For instance, if the Euclidean metric is chosen, then the winning node is the one for which N
di = d(x(t ),c i ) = " j =1(x j (t) ! c ij )2
(4.1.3)
CHAPTER FOUR
INPUT CODING
71
* is minimised. If the index of the winning node is written i then the recoding vector is given by the unit vector
! i (t ) = " (i,i* )
(4.1.4)
This mechanism creates an implicit division of the space called a Voronoi tessellation3. Radial basis functions This nearest neighbour method can be extended to form a soft coding by computing for each unit a value gi which is some radially symmetric, non-linear function of the distance from the input point to the node centre known as a radial basis function (RBF) (see [131] for a review). Although their are a number of possible choices for this function there are good reasons [127] for preferring the multi-dimensional Gaussian basis function (GBF). First, the 2-dimensional Gaussian has a natural interpretation as the ‘receptive field’ observed in biological neurons. Furthermore, it is the only radial basis function that is factorisable in that a multi-dimensional Gaussian can be formed from the product of several lower dimensional Gaussians (this allows complex features to be built-up by combining the outputs of two- or onedimensional detectors). A radial Gaussian node has a spherical activation function of the form 2 $ x j (t) " cij ' N gi (t) = g(x(t), c i ,w) = exp & " # j =1 ) (4.1.5) (2 ! ) N 2 w N 2w2 % (
1
[
]
where w denotes the width of the Gaussian distribution in all the input dimensions. It is convenient to use a normalised encoding which can be obtained by scaling the activation gi (t) of each unit according to the total activation of all the units. In other words, the recoding vector element for the ith unit is calculated by
3 See Kohonen [70] chapter five.
CHAPTER FOUR
! i (t ) =
INPUT CODING
gi (t) M " j =1g j (t) .
72
(4.1.6)
Figure 4.5 illustrates a radial basis recoding of this type with equally spaced fixed centres and the Gaussian activation function.
x
Figure 4.5 A radial basis quantisation of an input variable, The coding of input x is distributed between the two closest nodes.
Without task-specific knowledge selecting a set of fixed centres is a straight trade-off between the size of the search-space and the likelihood of generating ambiguous codings. For a task with a high-dimensional state-space learning can be slow and expensive in memory. It is therefore common practice, as we will see in the next section, to employ algorithms that position the nodes dynamically in response to the training data as this can create a coding that is both more compact and more directly suited to the task. Unsupervised learning Unsupervised learning methods encapsulate several useful heuristics for reducing either the bandwidth (dimensionality) of the input vector, or the number of units required to adequately encode the space. Most such algorithms operate by attempting to maximise the network’s ability to reconstruct the input according to some particular criteria. This section discusses how adaptive basis functions can be used to learn the probability distribution of the data, perform appropriate rescaling, and learn the covariance of input patterns. To simplify the notation the dependence of the input and the node parameters on the time t is assumed hereafter.
CHAPTER FOUR
INPUT CODING
73
Learning the distribution of the input data Many authors have investigated what is commonly known as competitive learning methods (see [59] for a review) whereby a set of node centres are adapted according to a rule of the form !c i " # (i, i * )( x $ ci )
(4.2.1)
*
where i is the winning (nearest neighbour) node. In other words, at each time-step the winning node is moved in the direction of the input vector4. The resulting network provides an effective winner-takes-all quantisation of an input space that may support supervised [105] or reinforcement learning (see next chapter). A similar learning rule for a soft competitive network of spherical Gaussian nodes5 is given by !c i " # i ( x $ c i )
(4.2.2)
In this rule the winner-only update is relaxed to allow each node to move toward the input in proportion to its normalised activation (4.1.6). Nowlan [115] points out that this learning procedure approximates a maximum likelihood fit of a set of Gaussian nodes to the training data. To avoid the problem of specifying the number of nodes and their initial positions in the input space, new nodes can be generated as and when they are required. A suitable heuristic for this (used, for instance, by Shannon and Mayhew [150]) is to create a new node whenever the error in the reconstructed input is greater than a threshold !. In other words, given an estimate xˆ of the input calculated by M
xˆ = "i =1! i ci
(4.2.3)
a new node will be generated with its centre at x whenever
x ! xˆ > " .
(4.2.4)
4Learning is usually subject to an annealing process whereby the learning rate is gradually reduced to
zero over the duration of training. 5All nodes have equal variance and equal prior probability.
CHAPTER FOUR
INPUT CODING
74
One of the advantages of such a scheme compared to an a priori quantisation, is in coding the input of a task with a high-dimensional state-space which is only sparsely visited. The node generation scheme will reduce substantially the number of units required to encode the input patterns since units will never be assigned to unvisited regions of the state space. Rescaling the input Often a network is given a task where there is a qualitative difference between different input dimensions. For instance, in control tasks, inputs often describe variables such as Cartesian co-ordinates, velocities, acceleration, joint angles, angular velocities etc. There is no sense in applying the same distance metric to such different kinds of measure. In order for a metric such as the Euclidean norm to be of any use here an appropriate scaling of the input dimensions must be performed. Fortunately, it is possible to modify the algorithm for adaptive Gaussian nodes so that the each unit learns a distance metric which carries out a local rescaling of the input vectors (Nowlan [116]). This is achieved by adding an additional set of adaptive parameters to each node, that describes the width of the Gaussian in each of the input dimensions. The (non-radial) activation of function is therefore given by 2 % ( N x j # cij gi = g(x, c i ,wi ) = exp # ' N $ * j 2wij2 ) (2 ! )N 2 " j wij &
[
1
]
(4.2.5)
The receptive field of such a node can be visualised as a multi-dimensional ellipse that is extended in the jth input dimension in proportion to the width wij . Suitable rules for adapting the parameters of the ith node are 2
xi $ cij ) $ wij2 x j $ cij ( !cij " # i !wij " #i . w2ij , and wij3
(4.2.6)
Figure 4.6 illustrates the receptive fields of a collection of nodes with adaptive variance trained to reflect the distribution of an artificial data set. The data shown is randomly distributed around three vertical lines in a two-dimensional space. The central line is shorter than the other two but generates the same number of data points. The nodes were initially placed at random positions near to the centre of the space.
CHAPTER FOUR
INPUT CODING
75
Figure 4.6: Unsupervised learning with Gaussian nodes with adaptive variance.
Learning the covariance Linsker [86, 87] proposed that unsupervised learning algorithms should seek to maximise the rate of information retained in the recoded signal. A related approach was taken by Sanger [141] who developed a Hebbian learning algorithm that learns the first P principal components of the input and therefore gives an optimal linear encoding (for P units) that minimises the mean squared error in the reconstructed input patterns. Such an algorithm performs data compression allowing the input to be described in a lower dimensional space and thus reducing the search problem. The principal components of the data are equivalent to the eigenvectors of the covariance (input correlation) matrix. Porrill [128] has also proposed the use of Gaussian basis function units that learn the local covariance of the data by a competitive learning rule. To adapt the receptive field of such a unit a symmetric NxN covariance matrix S is trained together with the parameter vector c describing the position of the unit’s centre in the input space. The activation function of a Gaussian node with adaptive covariance is therefore given by
CHAPTER FOUR
g( x, c, S) =
INPUT CODING
1
(2 ! )
N2
S
12
exp (" 12 (x " c) # S"1(x " c))
76
(4.2.7)
where . denotes the determinant. The term Gaussian basis function (GBF) unit will be used hereafter to refer exclusively to units of this type. In practice, it is easier to work with the inverse of the covariance matrix H = S!1 which is known as the information matrix. With this substitution equation 4.2.7 becomes g(x, c, H) = (2 ! )
"N 2
H
12
exp (" 2 (x " c)# H(x " c)) 1
(4.2.8)
Given a collection of GBF nodes with normalised activations (eq. 4.1.6), suitable update rules for the parameters of the ith node are
!c i " # i Hi (x $ c) , and !Hi " # i [Si $ (x $ ci )(x $ c i ) % ].
(4.2.9)
A further discussion of these rules is given in Appendix B. The receptive field of a GBF node is a multi-dimensional ellipse where the axes align themselves along the principal components of the input. The value of storing and learning the extra parameters that describe the covariance is that this often allows the input data to be encoded to a given level of accuracy using a smaller number of units than if nodes with a more restricted parameter set are used. Figure 4.7 illustrates the use of this learning rule for unsupervised learning with a data set generated as a random distribution around an X shape in a 2-dimensional space. The left frame in the figure shows the receptive fields of four GBF nodes, for comparison the right frame shows the fields learned by four units with adaptive variance alone.
CHAPTER FOUR
INPUT CODING
77
Figure 4.7: Unsupervised learning with adaptive covariance (GBF) and adaptive variance units. The units in the former case learn a more accurate encoding of the input data.6 A set of GBF nodes trained in this manner can provide a set of local coordinate systems that efficiently describe the topology of the input data. In particular, they can be used to capture the shape of the manifold formed by the data-set in the input space. Porrill [128] gives a fuller discussion of this interesting topic. Discussion Unfortunately, none of the techniques described here for unsupervised learning is able to address the ambiguity problem directly. The information that is salient for a particular task may not be the same information that is predominant in the input. Hence though a data-compression algorithm may retain 99% of the information in the input, it could be that the lost 1% is vital to solving the task. Secondly, unsupervised learning rules generally attempt to split the state-space in such a way that the different units get an equal share of the total input (or according to some similar heuristic). However, it may be the case that a particular task requires a very-fine grained
6In this particular case the mean squared error in the reconstructed input was approximately half as
large for the GBF units (left figure) as for the units with adaptive variance only (right figure). (mse = 0.0075 and 0.014 respectively after six thousand input presentations).
CHAPTER FOUR
INPUT CODING
78
quantisation in certain regions of the space though not in others, and that this granularity is not reflected in the frequency distribution of the input data. Alternatively, the acquired quantisation may be sufficiently fine-grained but the region boundaries may not be aligned with the significant changes in the input-output mapping. In many reinforcement learning tasks the input vectors are determined in part by the past actions of the system. Therefore as the system learns a policy for the task the distribution of inputs will change and the recoding will either become out-dated or will need to adapt continually. In the latter case the reinforcement learning system will be presented with a moving target problem where the same coding may represent different input patterns over the training period. Finally, it is often the case that the temporal order in which the input vectors are provided to the system is a poor reflection of their overall distribution. Consider, for instance, a robot which is moving through an environment generating depth patterns as input to a learning system. The patterns generated in each local area of the space (and hence over any short time period) will be relatively similar and will not randomly sample the total set of depth patterns that can be encountered. In order to represent such an input space adequately the learning rate of the recoding algorithm will either have to be extremely slow, or some buffering of the input or batch training will be needed. In general, unsupervised techniques can provide useful pre-processing but are not able to discover relevant task-related structure in the input. The next section describes some steps toward learning more appropriate input representations using algorithms in which the reinforcement learning signal is used to adapt the input coding.
CHAPTER FOUR
4.3
INPUT CODING
79
Adaptive coding using the reinforcement signal Non-gradient descent methods for discrete codings
I am aware of two methods that have been proposed for adaptively improving a discrete input quantisation. Both techniques are based on Watkin's Q learning, and therefore also require a discrete action space. Both also assume that the quantisation of the input space consists of hyper-cuboid regions. Whitehead and Ballard [181] suggest the following method for finding unambiguous state representations. They observe that if a state is ambiguous relative to possible outcomes then the Q learning algorithm will associate with each action in that state a value which is actually an average of the future returns. For an unambiguous state, however, the values converged upon by Q learning will always be less than or equal to the true returns (but only if all states between it and the goal are also unambiguous—an important caveat!). Their algorithm therefore does a search through possible input state representations in which any one which learns to overestimate some of its returns is suppressed. In the long run unambiguous states will be suppressed less often and therefore come to dominate. This method has some similarities with selectionist models of learning (e.g. Edelman [45]) since it requires that there are several alternative input representations all competing to provide the recoding with the less successful ones gradually dying out. Chapman and Kaebling [30] describe a more rigorous, statistical method based on a similar principle. Their ‘G algorithm’ attempts to improve a discrete quantisation by using the t-test to decide whether the reinforcement accruing either side of a candidate splitting-point is derived from a single distribution. If the test suggests two different underlying distributions then the space is divided at that position. The technique can be applied recursively to any new quantisation cells that are generated. The algorithm is likely to require very extensive training periods for the following reasons. First, the evaluation function must be entirely re-learned every time the quantisation is altered. Second, because the secondary reinforcement is noisy whilst the values are being learned it is necessary to split training into two phases—value function learning and t-test data acquisition. Finally, the requirement of the t-test that
CHAPTER FOUR
INPUT CODING
80
data is drawn from normal distributions requires that the same state be visited many times (ideally) before the splitting test can be applied. Both of these algorithms have a major limitation in that they require that the set of potential splitting points, or alternative quantisations, is finite and preferably small. This will clearly not be true for most tasks defined over continuous input spaces. Gradient learning methods for continuous input spaces In chapter two Williams’ [184] analysis of reinforcement learning algorithms as gradient ascent learning methods was reviewed. As Williams has pointed out, once the gradient of the error surface has been estimated it is possible to apply generalised gradient learning rules to train multilayer neural networks on such problems. This allows a suitable recoding of the input space to be learned dynamically by adaptation of the connection weights to a layer of hidden units. There are basically two alternative means for training a hidden layer of coding units using the reinforcement signal. The first approach is the use of generalised gradient descent whereby the error from the output layer is back-propagated to the weights on the hidden units. This is the usual, supervised learning, method for adapting an internal representation of the input. The second approach is a generalisation of the correlation-based reinforcement learning rule. That is, the coding layer (or each coding unit) attempts, independently of the rest of the net, to do stochastic gradient ascent in the expected reward signal. Learning in this case is of a trial and error nature where alternative codings are tested and judged by their effect on the global reinforcement. The output layer of the network has no more direct influence in this process than any other component of the changing environment in which the coding system is seeking to maximise its reward. A network architecture that uses a correlation rule to train hidden units has been proposed by Schmidhuber [143, 144] and is discussed in the next chapter. In general, it will more efficient to use correlation-based learning only when absolutely necessary [184]. That is, stochastic gradient ascent need only be used at the output layer of the network where no more direct measure of error is possible. Elsewhere units that are trained deterministically by back-propagation of error should always learn more efficiently than stochastic units trained by the weaker rule. The
CHAPTER FOUR
INPUT CODING
81
rest of this chapter therefore concerns methods based on the use of generalised gradient descent training. Reinforcement learning in multilayer perceptrons The now classical multilayer learning architecture is the multilayer perceptron (MLP) developed by Rumelhart et al. [140] and illustrated in figure 4.8.
layer 2: output units
layer 1: hidden units
layer 0: input units
Figure 4.8: Multilayer Perceptron architecture. (The curved lines indicate non-linear activation functions.) In a standard feed-forward MLP activation flows upward through the net, the output of the nodes in each layer acting as the inputs to the nodes in the layer above. Learning in the network is achieved by propagating errors in the reverse direction to the activation (hence the term back-propagation) where the generalised gradientdescent rule is used to calculate the desired alteration in the connection weights between units. The activity in the hidden units provides a recoding of each input pattern appropriate to the task being learned. In achieving this coding each hidden unit acts by creating a partition of the space into two regions on either side of a hyper plane. The combined effect of all the partitions identifies the sub-region of the space to which the input
CHAPTER FOUR
INPUT CODING
82
pattern is assigned. This form of recoding is thus considerably more distributed than the localist, basis function representations considered previously. There have been several successful attempts to learn complex reinforcement learning tasks by combining TD methods with MLP-like networks, examples being Anderson's pole balancer [4] and Tesauro's backgammon player [167]. However, the degree of crosstalk incurred by the distributed representation means that learning in an MLP network can be exceptionally slow. This is especially a burden in reinforcement learning where the error feedback signal is already extremely noisy. For this reason, a more localist form of representation may be more appropriate and effective for learning in such problems. This motivates the exploration of generalised gradient learning methods for training networks of basis function units on reinforcement learning tasks. Reinforcement learning by generalised gradient learning in networks of Gaussian Basis Function units. This section describes a network architecture for reinforcement learning using a recoding layer of Gaussian basis function nodes with adaptive centres and receptive fields. The network is trained on line using an approximate gradient learning rule. Franklin [51] and Millington [101] both describe reinforcement learning architectures consisting of Gaussian basis nodes with adaptive variance only. Clearly, such algorithms will be most effective only when the relevant task-dimensions are aligned with the dimensions of the input space. The architecture described here is based on units with adaptive covariance, the additional flexibility should provide a more general and powerful solution. An intuition into how the receptive fields of the GBF units should be trained arises from considering Thorndike’s ‘law of effect’. The classic statement of this learning principle (see for instance [85]) is that a stimulus-action association should be strengthened if performing that action (after presentation of the stimulus) is followed by positive reinforcement and weakened if the action is followed by negative reinforcement. Consider an artificial neuron that is constrained to always emit the same action but is able to vary its ‘bid’ as to how much it wants to respond to a given stimulus. The implication of the law of effect is clear. The node should learn to make
CHAPTER FOUR
INPUT CODING
83
high bids for stimuli where its action is rewarded, and low bids for stimuli where its action is punished. Generalising this idea to a continuous domain suggests that the neuron should seek to move the centre of its receptive field (i.e. its maximum bid) towards regions of the space in which its action is highly rewarded and away from regions of low reward. If the neuron is also able to adapt the shape of its receptive field then it should expand wherever the feedback is positive and contract where it is negative. Now, if the constraint of a fixed action is relaxed then three adaptive processes will occur concurrently: the neuron adapts its action so as to be more successful in the region of the input space it currently occupies; meanwhile it moves its centre toward regions of the space in which its current action is most effective; finally it changes its receptive field in such a way as to cover the region of maximum reward as effectively as possible. Figure 4.9 illustrates this process. The figures shows a single adaptive GBF unit in a two-dimensional space. The shape of the ellipse shows the width of the receptive field of the unit along its two principal axes. The unit's current action is a. If the unit adapts its receptive field in the manner just described then it will migrate and expand its receptive field towards regions in which action a receives positive reward and away from regions where the reward is negative. It will also migrate away from the position where the alternative action b is more successful. A group of units of this type should therefore learn to partition the space between them so that each is performing the optimal action for its ‘region of expertise’.
CHAPTER FOUR
INPUT CODING
84
b +ve a +ve a
a -ve
Figure 4.9: Adapting the receptive field of a Gaussian basis function unit according to the reinforcement received. The unit will migrate and expand towards regions where its current action is positively reinforced and will contract and move away from other regions. In an early paper Sutton and Barto [165] termed an artificial neuron that adapts its output so as to maximise its reinforcement signal a ‘hedonistic’ neuron. This term perhaps even more aptly describes units of the type just described in which both the output (action) and input (stimulus sensitivity) adapt so as to maximise the total ‘goodness’ obtained from the incoming reward signal. The learning algorithm As in equation 4.2.8 the activation of each expert node is given by the Gaussian g(x, c, H) = (2 ! )
"N 2
H
12
exp (" 12 (x " c)# H(x " c))
CHAPTER FOUR
INPUT CODING
85
where x is the current context, c is the parameter vector describing the position of the centre of the node in the input space and H is the information matrix (the inverse of the covariance matrix). Here I assume a scalar output for the network to make the algorithm easier to describe, the extension to vector outputs is, however, straightforward. The net output y is given as some function of the net sum s. Now if the error e in the network output is known then update rules for the output parameter vector w and the parameters c i and H i of the ith expert can be determined by the chain rule. Let
!=e
"s "y !i = , "gi then "s
(4.3.1)
$s = # % (x) , $w
(4.3.2)
!w " #
!c i " # $
% gi % c i , and
(4.3.3)
!Hi " # $
%g i %H i .
(4.3.4)
To see that these learning rules behave in the manner described above consider the following example. Assume a simple immediate reinforcement task in which the network output is given by a Gaussian probability function with standard deviation of one and mean w ! " (x) = s . After presentation of a context x the network receives a reward signal r. From Williams’ gradient ascent learning procedure (Section 2.2) and assuming a reinforcement baseline of zero we have
! = r (y " s)
(4.3.5)
First of all consider the case where the outputs of the units are not normalised with respect to each other that is ! i (x) = gi (x) = gi for each expert i. We have
!i =
" "g i
(# w g ) = w , i i
i
the update rules are therefore given by
(4.3.6)
CHAPTER FOUR
INPUT CODING
86
!w " r (y # s) $ (x) ,
(4.3.7)
!c i " r (y # s)wi $ i Hi (x # c i ) and
(4.3.8)
% !Hi " r(y # s)wi $i ( H#1 i # (x # c i )(x # ci ) )
(4.3.9)
Since ! i is always positive it will affect the size and not the direction of the change in the network parameters. The dependence of the direction of change in parameters according to the sign of the remaining components of the learning rules is illustrated in the table below. direction of change !c !H !wi
r
components y!s
! i = wi
a
+
+
+
+
!x
grow
b
+
-
-
-
!x
grow
c
-
+
+
-
!x
shrink
d
-
-
-
+
!x
shrink
e
+
+
-
+
!x
shrink
f
+
-
+
-
!x
shrink
g
-
+
-
-
!x
grow
h
-
-
+
+
!x
grow
The table shows that the learning procedure will, as expected, result in each local expert moving toward the input and expanding whenever the output weight of the unit has the same sign as the exploration component of the action, and the reward is positive (rows a, b). If the reward is negative the unit will move away and its receptive field will contract (c, d). If the sign of the weight is opposite to the direction of exploration then all these effects are reversed (e, f, g, h).
CHAPTER FOUR
INPUT CODING
87
Explicit competition between units There appears, then, to be good accordance between these training rules and the intuitive idea for adaptive coding as a generalisation of the law of effect. However, there remains an impression that this method of training the receptive fields is not quite ideal. This arises because the direction of change in each of these rules depends upon the correlation of the variation in the mean action with the absolute size of the output weight, that is on
(y ! s)wi . Intuitively, however, a more appropriate measure would seem to be the correlation of the variation in the mean action with the variation of the output (of this unit) compared with the mean output, that is
(y ! s)(wi ! s) . This measure seems more appropriate as it introduces an explicit comparison between the local experts allowing them to judge their own success against the group mean rather than against an absolute and arbitrary standard. Fortunately learning rules that incorporate this alternative measure arise directly if the normalised activation is used in determining the output of the local experts. If (as in equation 4.2.6) we have
!i =
gi
"
M j =1
gj
where M is the total number of local expert units, then
!i =
"s = "g i
1
#g j
j
( wi $ s) .
(4.3.10)
From which we obtain the learning rules (from 4.3.3 and 4.3.4) !c i " # (wi $ s) % i H i (x $ c i ) and
(4.3.11)
& !Hi " # (wi $ s) % i ( H$1 i $ (x $ c i )(x $ ci ) )
(4.3.12)
CHAPTER FOUR
INPUT CODING
88
wherein the desired comparison measure (wi ! s) arises as a natural consequence of employing generalised gradient ascent learning. Further refinements and potential sources of difficulty The use of GBF networks for a simple immediate reinforcement learning task is described below, their application to difficult delayed reinforcement tasks is investigated in the next chapter. Before describing any implementations, however, some refinements to the learning system and potential problems (and possible solutions) will be considered. Learning scaling parameters An important extension of the learning scheme outlined above involves adapting the strength of response of each unit independently of the position and shape of its receptive field. This sets up a competition for a share in the output of the network between the different ‘regions of expertise’ occupied by the units. One of the benefits of this competition is a finer degree of control in specifying the shape and slope of the decision boundaries between competing nodes. For each unit an additional parameter pˆ i is used which scales the activation of the ith node during the calculation of the network output, i.e. the activation equation becomes gi = pˆ i ( 2! )
"N 2
Hi
12
exp (" 2 (x " ci )#H i (x " c i )) 1
(4.3.13)
The learning rule for the scale parameter is then given by ! pˆi " # $
% gi % pˆi .
(4.3.14)
The pˆ i s must be non-zero and sum to unity over the network. This requires a slightly complicated learning rule since a change in any one scale parameter must be met by a corresponding re-balancing of all the others. A suitable learning scheme (due to Millington [101]) is described in Appendix B.
CHAPTER FOUR
INPUT CODING
89
Receptive field instability The learning rules for the node receptive fields described above do not guarantee that the width of the field along each principal axis will always be positive. A sufficient condition for this is for the covariance matrix to be positive definite which can be determined by checking for a positive, non-zero value of the determinant. A simple fix for this problem is to check this value after each update and reinitialise any node receptive field which fails the test. A better solution, however, is to adapt the square root of the covariance matrix rather than attempt to learn the covariance (or information) matrix directly. Algorithms using the square root technique are described in [17]. Keeping the GBF units apart A further problem that can arise in training sets of GBF nodes is that two nodes will drift together, occupy the same space, and eventually become identical in every respect. This is a locally optimal solution for the gradient learning algorithm and is clearly an inefficient use of the basis units. To overcome this problem a spring component that implements a repulsive ‘force’ between all of the node centres can be added to the learning mechanism (see also [126]). This is not always a desirable solution however. For instance, it could be the case that two nodes have their centres almost exactly aligned but differ considerably in both the shape of their receptive field and their output weights. This can be a very efficient manner of approximating some surfaces (see next chapter) but cannot arise if the spring heuristic is used to keep the units separated. Staying in the data The converse of the problem of units drifting together is that they may drift too far apart. Specifically, some units can be pushed beyond the margins of the sampled region of the input space through the action of the learning rule (for a unit to be totally inactive is another locally optimal solution). A possible way to keep units ‘in the data’ would be to implement some form of conscience mechanism whereby inactive units have higher learning rates (see Appendix B for more on this topic) or to use unsupervised learning to cause units that are under-used to migrate toward the
CHAPTER FOUR
INPUT CODING
90
input. Both these devices will only be of use, however, if the temporal sequence of inputs approximates a random sampling of the input space, a requirement that rarely holds for learning in real time control tasks. Relationship to fuzzy inference systems One of the most attractive features of basis function approximations is their relationship to rule-based forms of knowledge, in particular, what are known as fuzzy inference systems (FIS). A FIS is a device for function approximation based on fuzzy if-then rules such as “If the pressure is high, then the volume is small” An FIS is defined by a set of fuzzy rules together with a set of membership functions for the linguistic components of the rules, and a mechanism, called fuzzy reasoning, for generating inferences. Networks of Gaussian basis function units have been shown to be functionally equivalent to fuzzy inference systems [67]. In other words, the local units in GBF networks can be directly equated with if-then type rules. For instance, if we have a network of two GBF units a and b (in a 2D space with a single scalar output), then an equivalent FIS would be described by Rule A : If x1 is A1 and x 2 is A2 , then y = wA , Rule B: If x1 is B1 and x 2 is B2 , then y = wB .
Here the membership functions ( A1 , A2 , B1 , and B2 ) are the components of the (normalised) Gaussian receptive fields of the units in each input dimension. The functional equivalence between the two systems allows an easy transfer of explicit knowledge (fuzzy rules) into tuneable implicit knowledge (network parameters) and vice versa. In other words, a priori knowledge about a target function can be built in to the initial conditions of the network. Provided these initial fuzzy rules give a reasonable first-order approximation to the target then learning should be greatly accelerated and the likelihood of local optima much reduced. This ability to start the learning process from good initial positions should be of great value in reinforcement learning where tabula rasa systems can take an inordinately long time to train.
CHAPTER FOUR
INPUT CODING
91
A simple immediate reinforcement learning problem To demonstrate the effectiveness of the GBF reinforcement learning mechanism this section describes its application to a simple artificial immediate reinforcement learning task. Its use for more complex problems involving delayed reinforcement is discussed in the next chapter. Figure 4.10 shows a two-dimensional input space partitioned into two regions by the boundaries of an X shape. The ‘X’ task is defined such that to achieve maximum reward a system should output a one for inputs sampled from the within the X shape and a zero for inputs sampled from the area outside it.
Figure 4.10: A simple immediate reinforcement task. In the simulations described below the network architecture used a Bernoulli logistic unit (see section 2.2.2) to generate the required stochastic binary output. A spring mechanism was also employed to keep the GBF nodes separated. Full details of algorithm are given in Appendix B where suitable learning rate parameters are also described. Networks of between five and ten GBF nodes were each trained on forty thousand randomly selected inputs. Over the period of training the learning rates of the networks were gradually reduced to zero to ensure that the system settled to a stable configuration. Each network was then tested on ten thousand input points lying on a
CHAPTER FOUR
INPUT CODING
92
uniform 100x100 grid. During this test phase the probabilistic output of the net was replaced with a deterministic one, i.e. the most likely output was always taken. The learning mechanism was initially tested with units with fixed (equal) scale parameters. Ten runs were performed with each size of network. The results for each run, computed as the percentage of correct outputs during the test phase, are given in Appendix B. In all, the best average performance was found with networks of eight GBF units (hereafter 8-GBF nets). Figure 4.11 shows a typical run for such a net. Initially the nodes were randomly positioned within 0.01 of the centre of the space. By five hundred training steps the spring component of the learning rule has caused the nodes to spread out slightly but they still have the appearance of a random cluster. The next phase of training, illustrated here by the snapshot at two thousand timesteps, is characterised by movement of the node centres to strategic parts of the space and adaptation of the output weights toward the optimal actions. Soon after, illustrated here at five thousand steps, the receptive fields begin to rotate to follow the shape of the desired output function. The last twenty thousand steps mainly involve the consolidation and fine tuning of this configuration.
CHAPTER FOUR
INPUT CODING
93
5,000
500
0.18
0.47 0.52
0.49
0.81 0.15
0.47 0.51
0.87
0.52 0.87 0.55
0.48
0.18
0.73
0.15
0.04
10,000
40,000
0.01
0.95
0.93
0.98
0.98
0.01 0.04
0.96
0.05
0.01
0.99 0.97
0.89
0.04
0.01
Figure 4.11: Learning the X task with an 8-GBF network. The figures show the position, receptive field and probability of outputting a one for each GBF unit after 500, 2000, 5,000 and 40,000 training steps. Figure 4.12 shows that the output of the network during the test phase (that is, after forty thousand steps) is a reasonable approximation to the X shape.
CHAPTER FOUR
INPUT CODING
94
Figure 4.12: Test output of an 8-GBF network on the X task. Black shows a preferred output of 1, white a preferred output of zero. Averaged over ten runs of 8-GBF the mean score on the test phase was 93.6% optimal outputs (standard deviation 1.1%). This performance compared favourably with that of eight-unit networks with adaptive variance only. The latter, being unable to rotate the receptive fields of their basis units, achieved average scores of less than 90%. Performance at a similar level to the 8-GBF networks was also achieved on most runs with networks of seven units and on some runs using networks of six units (though in the latter case the final configuration of the units were substantially different). However, with fewer than eight units locally optimal solutions in which the X shape is incompletely reproduced7 were more likely to arise. Networks larger than eight units did not show any significant improvement in performance over the 8-GBF nets, indeed, if anything the performance was less consistently good. There are two observations that may be relevant to understanding this result. First, on some runs with larger network sizes one or more units is eventually pushed outside the space to a position in which it is almost entirely inactive. This effectively reduces the number of units participating in the function
7For instance, one arm of the X might be missing or the space between the two arms incorrectly filled.
CHAPTER FOUR
INPUT CODING
95
approximation. Second, with the larger nets, the number of alternative near optimal configurations of units is increased. These networks are therefore less likely to converge to the (globally) best possible solution on every run. The experiments with GBF networks of five to ten units were repeated this time with the learning rule for adapting the scale parameters switched on. Though the overall performance was similar to that reported above, quantitatively the scores achieved with each size of network were slightly better. Again the best performance was achieved by the 8-GBF nets with mean score 95.4% ("= 0.56) showing a significant8 improvement when compared to networks of the same size without adaptive scaling. Once more there was no significant improvement for net sizes larger than eight units indicating a clear ceiling effect9. Figure 4.13 shows the final configuration and test output of a typical 8-GBF network with adaptive scaling. The additional degrees freedom provided by the scaling parameters results in the node centres being more widely spaced generating an output which better reproduces the straight edges and square corners of the X.
8t= 4.47, p=0.0003 9A run of 15-GBF nets also failed to produce a higher performance than the 8 unit networks.
CHAPTER FOUR
INPUT CODING
96
0.251
0.003
0.001
0.271
0.209
0.002
0.001
0.260
Figure 4.13: GBF network with adaptive priors. The numbers superimposed on the nodes show the acquired scale factors (the output probabilities were all near deterministic i.e. >0.99 or