Learning Working Memory Tasks by Reward Prediction in ... - CiteSeerX

2 downloads 0 Views 564KB Size Report
written by Bryan James Loughry has been approved for the Department of Computer Science. Michael C. Mozer. Randall C. O'Reilly. Clayton H. Lewis. Date.
Learning Working Memory Tasks by Reward Prediction in the Basal Ganglia and Prefrontal Cortex by Bryan Loughry BS. University of Colorado, Boulder, 1989

A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Master of Sciences Department of Computer Science 2003

This thesis entitled: Learning Working Memory Tasks by Reward Prediction in the Basal Ganglia and Prefrontal Cortex written by Bryan James Loughry has been approved for the Department of Computer Science

Michael C. Mozer

Randall C. O’Reilly

Clayton H. Lewis

Date

The final copy of this thesis has been examined by the signatories and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

Learning Working Memory Tasks by Reward Prediction

iii

Abstract Loughry, Bryan James (M.S., Computer Science) Learning Working Memory Tasks by Reward Prediction in the Basal Ganglia and Prefrontal Cortex Thesis directed by Assistant Professor Randall C. O’Reilly A detailed computational model composed of the structures of the basal ganglia and the prefrontal cortex is presented. The model implements a reinforcement learning mechanism serving working memory. The model expands on a previously proven architecture by incorporating reward prediction, which facilitates the systems learning to selectively store and maintain relevant stimuli. The model is trained on a modified version of the CPT-AX task and is shown to learn an appropriate neural response for an expected reward. This work demonstrates how the Basal Ganglia can compute the difference between expected and actual reward and thereby implement a Reinforcement Learning algorithm analogous to Temporal Differences.

Learning Working Memory Tasks by Reward Prediction

iv

Contents 1

2

3

4

Introduction 1.1 Working Memory Characteristics . . . . . . . . . . . . . . . . . 1.1.1 The 1,2-AX Task . . . . . . . . . . . . . . . . . . . . . 1.1.2 Functional Requirements of Working Memory . . . . . 1.2 General Paradigm for Working Memory and Executive Function 1.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . Biological Details – Review of previous work 2.1 General . . . . . . . . . . . . . . . . . . . 2.2 Reverberatory Loops . . . . . . . . . . . . 2.3 Disinhibition of the PFC reverberatory loops 2.4 Information Compression . . . . . . . . . . 2.5 Maintenance . . . . . . . . . . . . . . . . . 2.6 Gating . . . . . . . . . . . . . . . . . . . . 2.7 Competitive Inhibition within Stripes . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Reinforcement Learning in BG, PFC interactions 3.1 Computational Details of Reinforcement Learning . . . . . . . 3.1.1 The Temporal Differences Algorithm . . . . . . . . . 3.2 How the Brain Implements Reinforcement Learning in the BG 3.2.1 Selectivity . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Dopamine and Deriviatives . . . . . . . . . . . . . . . 3.2.3 Loops and Conjunctions . . . . . . . . . . . . . . . . 3.2.4 Division of Labor within the Striatum . . . . . . . . . Computational Simulations 4.1 Model Details . . . . . . . . . . . . . . . 4.2 Stimuli Processing on a Learned Task . . 4.3 Stimuli Processing while Learning a Task 4.3.1 Learning to Respond to X . . . . 4.3.2 Learning sequences . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . .

. . . . . . .

. . . . . .

. . . . .

. . . . . . .

. . . . . . .

. . . . . .

. . . . .

. . . . . . .

. . . . . . .

. . . . . .

. . . . .

. . . . . . .

. . . . . . .

. . . . . .

. . . . .

1 3 4 6 7 8

. . . . . . .

8 9 10 11 12 13 14 15

. . . . . . .

15 16 18 20 20 21 26 26

. . . . . .

27 27 29 30 30 34 35

Learning Working Memory Tasks by Reward Prediction

v

5

Discussion 5.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Model Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Model Limitations, Issues and Future Work . . . . . . . . . . . . . . .

37 38 39 40

6

References

43

Learning Working Memory Tasks by Reward Prediction

vi

List of Figures 1

2 3

4 5 6 7 8 9

10 11 12 13 14 15

Sample 1,2-AX sequence In the context of 1 an A,X sequence should be followed by a right button press. Still context of 1 so BY is followed by left button. In context of 2 the B,Y is followed by a right button press. Complete Biological Schematic Connectivity of network including special connection types is shown with BG structures highlighted. . . . Biological Schematic of Loops Shows the disinhibtion circuit which is modeled as a single connection. The Posterior Cortex is associated with the Input layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . Agent Model The agent selects an action in the environment which provides a reward and state information to the agent. . . . . . . . . . . Actor-Critic Model The TD error is used to optimize the agents policy and value functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . Dopamine Modulation Dopaminergic neurons of the SN and VTA modulate activity of the striatal components and the OFC. . . . . . . . Unlearned Dopamine Activity The dopamine activity is correlated with the reward, not the predictive stimuli (light). . . . . . . . . . . . . Learned Dopamine Activity The dopamin activity is corelated with the predictive stimuli after learning. . . . . . . . . . . . . . . . . . . . Striatal Pathways Two different pathways connect the striatum with the dopamine neurons. The slow one inhibits and the fast one excites. This results in the derivative of the activity in the striatum being communicated to the SN and VTA. . . . . . . . . . . . . . . . . . . . . . . Learned 1AX sequence . . . . . . . . . . . . . . . . . . . . . . . . . XnR sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XR sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AX sequence, X rewards A . . . . . . . . . . . . . . . . . . . . . . . AX sequence, A learned . . . . . . . . . . . . . . . . . . . . . . . . 1A sequence, A rewards 1 . . . . . . . . . . . . . . . . . . . . . . .

5 9

11 17 19 22 23 23

24 30 31 32 34 35 36

Learning Working Memory Tasks by Reward Prediction

vii

List of Tables 1 2

3

Number of Neurons in Regions . . . . . . . . . . . . . . . . . . . . . 13 Assignment of network components to the RL Model The agent includes policy, action and value components and the environment include the state and reward components. Assignments do not represent hard boundries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Association of Network Layers to Biology . . . . . . . . . . . . . . . 28

Learning Working Memory Tasks by Reward Prediction

1

1

Introduction

The Basal Ganglia (BG) and Prefrontal Cortex (PFC) have long been associated with motor function (Jackson & Houghton, 1995; Graybiel & Kimura, 1995). Delay activity around motor response is found in neurons of the PFC (Hoshi, Shima, & Tanji, 2000). Movement disorders are associated with BG dysfunction, such as caused by Parkinson’s and Huntington’s disease, (Marsden, 1986; Dujardin, Krystkowiak, Defebvre, Blond, & Destee, 2000; Chesselet & Delfs, 1996) and the BG and PFC are known to be involved in sequential motor tasks (Matsumoto, Hanakawa, Maki, Graybiel, & Kimura, 1999). More recently evidence has been presented that supports a broader view for the BG and PFC, that includes higher level cognitive function and executive planning and narrows the scope of the motor role to that of planning or initiation of actions (Gobbel, 1997; Middleton & Strick, 2000; Berns & Sejnowski, 1996; Schultz, Apicella, Romo, & Scarnati, 1995a). Anatomically the BG is clearly in a position to influence cognitive processing as it interacts with virtual the entire cortex (Wilson, 1990). Of interest to us, it is believed that the BG is involved in working memory (WM) and that the BG sub-serves working memory in much the same way it sub-serves motor function (Braver & Cohen, 2000; Gabrieli, 1995). Indeed the only distinction between the motor and working memory roles may be a presence or absence of a physical manifestation. It has also been suggested that the BG plays a role in behavior reinforcement through the dopamine (DA) system. DA, which is primarily controlled by the BG, is believed to indicate reward expectation and possibly errors in such expectations (Montague, Dayan, & Sejnowski, 1996; Suri, Bargas, & Arbib, 2001; Schultz, Romo, T., Mirenowicz, Hollerman, & A., 1995b; Wickens & Kotter, 1995). It has also been suggested that signals from the BG representing predicted reward and/or error corrections could be used by associated brain structures to implement the learning of WM tasks (O’Reilly & Munakata, 2000). In effort to better understand the fundamental computational task the BG perform in regard to working memory, researchers have utilized computational models. It is held here that such a model should span the gap from biologically valid neural activity to physical behavior. There exist theoretical models of working memory, which are too abstract to be testable in a biological sense, as well as models that reflect the biology but, are too computationally intensive to be implemented (Wickens, 1997). Models that bridge the gap between these extremes represent a balance in which abstract behavior can be tested using biologically plausible computation (Alexander, 1995). This approach is taken, in an effort to answer the question of how the neural interactions be-

Learning Working Memory Tasks by Reward Prediction

2

tween the Prefrontal Cortex (PFC) and BG enable the system to learn working memory tasks from reward. The focus of this research is to understand the interactions between the BG and the PFC in the process of reward prediction. Connectivity between the PFC and BG is quite complex, making the understanding of their interactions difficult (GoldmanRakic & Selemon, 1990). Identifying the computational roles of each of the individual regions involved is essential to determine how the system, as a whole, is capable of performing reward prediction and utilizing this information to learn WM tasks. In the following working memory is first discussed, and the WM task that is modeled is described. Next we review previous work on the underlying architecture of this model and the relevant biology. We then turn to the details of learning from reward signals within the BG PFC interactions. Unusual features are found in the these interactions that play a key role in the learning process. Some simulations are then presented followed by a brief discussion.

1.1 Working Memory Characteristics Working memory is task-specific activation-based memory involved in current tasks (Gazzingan, 1995). It is information that is maintained to influence and constrain ongoing processing, like while solving a Rubik’s Cube. In solving a Rubik’s Cube one needs to remember the ultimate goal (all six sides solid colors), the current sub-goal (get the blue side to be all blue without messing up the red side), and the immediate goal (get the blue, red, yellow corner piece in the correct place relative to the red and blue sides). In this example we have a hierarchy of goals (six sides, one side, one piece) in which at any level multiple trials or loops may occur as one tries to figure out the successful moves. During this process one needs to maintain the hierarchy of goals while being able to update the state of a goal on a specific level in the hierarchy. It is important that the memory utilized for such a task is activation based and not passive or weight based, in order to interact and dynamically guide processing. Activation based memory allows for updating the memory state as the task evolves. We may try various ways to get the blue, red, yellow corner piece in the right place and need to update the state of the system as we proceed. Once the piece is in place the modified state should guide the system toward a subsequent goal. Physiologically the distinction between active and passive memory is based on the mechanism that instantiates the memory. Active memory is a memory by virtue of persistent activity among a collection of neurons that represent the memory item(O’Reilly & Munakata, 2000). This allows the memory item to actively and di-

Learning Working Memory Tasks by Reward Prediction

3

rectly influence on-going processing because all neurons that are connected to these active neurons will receive activity and thereby be influenced in their processing. Passive memory (associated with long-term memory) is memory by virtue of the weighting of connections between neurons, and requires a stimulus to travel the relevant connections. This form of memory is less accessible to on-going processing, because the memory is contained within the weights of the connections as opposed to activity of neurons. Weight based memory is also not easily modified as connection strengths typically change on a slow time scale. For these reasons active memory is required to dynamically control the processing. 1.1.1

The 1,2-AX Task

A canonical example of a working memory task is the CPT-AX (Braver & Cohen, 2000). A modified version of this task was conceived to increase the dependence on the working memory system, creating the 1,2-AX task. The traditional CPT-AX task is a sequential stimuli response task. One of six (A,B,C,X,Y,Z) stimuli are presented sequentially. Four of the stimuli are task relevant: A,B,X,Y, and two of them are nonrelevant: C,Z. The task is to respond with a right button press to the target sequence (A followed by X) and to all other sequences with a left button press. The non-relevant stimuli are distractors and are to be ignored. That is a pattern including C and Z should result in the same behavior when C and Z are absent. In order to create a more difficult working memory task a context stimulus, that dictates the target sequence, was introduced. In the 1,2-AX task there are 9 possible stimuli (1, 2, 3, A, B, C, X, Y, Z) presented sequentially. Of the new stimuli 1 and 2 are task relevant and stimuli 3 is a distractor. The numeric stimuli sets the target sequence: if the last task relevant numeric stimuli was a 1 then the target sequence is A-X and if it was a 2 then the target sequence is B-Y. The task is otherwise the same as the traditional CPT-AX task. Figure 1 illustrates the sequence: 1,A,X,B,X,2,B,Y and the associated right and left button presses. This sequence demonstrates how the context is controlled by the 1 and 2. Note that distractors could be placed in the sequence without changing the response sequence. The 1,2-AX task increases the demands placed on the working memory system because the 1 or 2 must be maintained over repeated trials of the A/B/C,X/Y/Z sequence in order to have the context indicating the target sequence available for correct action. This increases the difficulty of our working memory task as it contains an inner loop. With the CPT-AX task alone only the A or B needs to be maintained until an X or Y is seen. With the addition of the 1/2 we introduce another layer in the hierarchy and the

Learning Working Memory Tasks by Reward Prediction

1

4

R A

X

L B

time

X

2

R B

Y

Figure 1: Sample 1,2-AX sequence In the context of 1 an A,X sequence should be followed by a right button press. Still context of 1 so BY is followed by left button. In context of 2 the B,Y is followed by a right button press. A/B level becomes an inner loop. The behavior of the system looping over A/B level trials depends on the context of the 1/2 outer loop. In the CPT-AX task this inner loop is absent as the A/B represent the outer loop and the X/Y terminates the sequence. The 1-2AX task therefore requires selective updating of the memory system, and is more akin to our Rubik’s Cube example. 1.1.2

Functional Requirements of Working Memory

As the 1,2-AX task illustrates there are two functional demands placed on the working memory system. It must be able to maintain a memory state while processing stimuli while also being able to rapidly update a selected part of the memory state. This all needs to be done while ignoring distractors. The distractors need to be processed to determine they are non-relevant but should not affect the memory state. The system must both robustly maintain stimuli in the face of distraction and ongoing processing as well as selectively update the contents of the memory maintenance system (Braver & Cohen, 2000). For example; a 1 or a 2 should be selected for maintenance while a 3 should be ignored, and if the task context is a 1 then an A stimuli should be added to the memory state without disturbing the maintenance of the 1 context in order to respond correctly if an X should follow. The functions of maintenance and updating are are at odds. A mechanism that rapidly updates will be prone to overwriting stored items. Conversely a system that robustly maintains is not open to updating of incoming stimuli. One way a system can mitigate between these opposing goals is via a gating mechanism that selects specific stimuli for updating particular regions of the maintenance system (Braver & Cohen, 1999). Such a mechanism has been implemented and is described below.

Learning Working Memory Tasks by Reward Prediction

5

1.2 General Paradigm for Working Memory and Executive Function Implementing the 1,2-AX task within a computational model serves the purpose of illustrating an overall paradigm for executive function. The solving of the 1,2-AX task is representative of how a general problem type might be solved. The general paradigm hypothesized is that problems or executive tasks have common traits that place them in a particular category or type. Over repeatedly experiencing these demands the working memory system learns how and what to maintain in order to solve particular problem or task types. The pathways and connective weights in the brain become tuned to handle many different problems. The 1,2-AX task is representative of a possible problem type. It is not suggested that there exists dedicated processing structures unique for the 1,2AX task but rather there are structures dedicated to general task types and the 1,2-AX task is but one of these.

1.3 Previous Work It has been demonstrated that the underlying architecture for the model proposed in this work is capable of providing the selective gating that is required to perform the 1,2-AX task. The biological and computational aspects of the model were discussed in detail and it was shown that the system can both robustly maintain and selectively update, according to the stimuli presented (Frank, Loughry, & O’Reilly, 2001). The relevant details of this model are reviewed below. As presented the model was lacking a mechanism for learning the correct behavior. Explicit representations were specified and presented as target values at each layer in the model. This circumvented the need for the model to learn what was predictive of reward. The focus of Frank et al was to demonstrate a biologically plausible gating system in the service of working memory. Here we show that an extension to the model can learn the stimuli that are predictive of reward in order to perform the 1,2-AX task.

2

Biological Details – Review of previous work

Detailed models, such as the one presented here, allow us to address the anatomical and physiological data in a way unavailable to abstract models. However some degree of abstraction is necessary to obtain a testable model and a macroscopic view of the whole system. Described below is an evolved version of the model presented in Frank et al.

Learning Working Memory Tasks by Reward Prediction Input

PFC_Maint

6

Output

PFC_Gate

Orbital PFC Disinhibition

Matrisomes

Striosomes

Limbic Striatum

Dopamine Derivative

SNc

VTA

Rew

Figure 2: Complete Biological Schematic Connectivity of network including special connection types is shown with BG structures highlighted.

2.1 General The PFC is the main point of input and output to the WM system, as shown in figure 2, receiving sensory input from the posterior cortex (Input) as well as sending output to the posterior motor cortex (Output) (Wickens, 1997; Gobbel, 1997; Wilson, 1990; Wise, Murray, & Gerfen, 1996). The PFC can be viewed as having two functionally separate regions. The deep and superficial layers, layers 2-3 and 5-6 respectively, (PFCm) have been shown to exhibit delay activity and are suggested to support a maintenance function. PFC Layer 4 (PFCg) is the primary receiver of thalamus activity and has a modulated supporting role to the PFCm (Gobbel, 1997; Wickens, 1997), playing a role in the gating function. The basal ganglia is composed of a number of structures including the striatum, the substantia nigra (SN) and the globus palidus (GP) (Wickens, 1997; Gobbel, 1997; Wilson, 1990). Other structures associated with BG include the thalamus and the ventral tegmental area (VTA). We will treat the striatum, the SN and the VTA separately within the model. In the model the gating function of the globus pallidus and thalamus, demonstrated in our previous work and reviewed below, is collapsed into a single special modulatory connection type. The striatum is the main point of input to the BG, virtually all areas of cortex project to the striatum (Wickens, 1997; Gobbel, 1997; Wilson, 1990). It can be broken down in several ways. Functionally the striatum is composed of associative, motor and limbic areas (Alexander, DeLong, & Strick, 1986; Joel & Weiner, 2000), but we will consider

Learning Working Memory Tasks by Reward Prediction

7

Frontal Cortex Posterior Cortex

reverberatory loops

Thalamus GPi

Striatum

tonically active

inhibition

net disinhibition

Figure 3: Biological Schematic of Loops Shows the disinhibtion circuit which is modeled as a single connection. The Posterior Cortex is associated with the Input layer.

the associative and the limbic areas only. The motor area is assumed be analgous, with respect to motor function, to the associative area which is serving WM. Additionally the striatum can be delineated by two types of cells (Gobbel, 1995; Beiser, Hua, & Houk, 1997; Wickens, 1997). These cells are known as striosomes (patches) and matrisomes (the matrix). Together they partition the striatum. The nature of the connectivity of the matrisomes and striosomes differ, and they are treated separately here.

2.2 Reverberatory Loops As shown in figure 3 there are loops (or stripes) of PFC that are predominantly connected to loops of the striatum which are connected to loops in the GP which are connected to loops in the thalamus which are connected back to the PFC (Amos, 2000; Beiser et al., 1997; Alexander et al., 1986; Groves, Garcia-Munoz, Linder, S., Martone, & Young, 1995). The striped connectivity between the various regions is topologically organized such that a path initiating in the PFC will eventually return to itself, forming a loop. The SN has a similar stripe connectivity with the striatum. The PFC also has a topologically consistent connectivity that constitutes a reverberatory loop originating in the PFCm out to the PFCg and back to the originating region of the PFC. This loop is activated by the PFCm itself by way of the Thalamus. There are thus several intersecting striped loops of connectivity: PFC - Striatum - Globus Palidus Thalamus - PFC, Substantia Nigra - Striatum - Substantia Nigra, and PFCm - PFCg PFCm.

Learning Working Memory Tasks by Reward Prediction

8

2.3 Disinhibition of the PFC reverberatory loops In the following the details of the loops have largely been ignored to facilitate understanding of interactions in question. However one should assume that the interactions occur within loops and not between loops, unless specifically stated otherwise. The connectivity from the matrisomes to the GP is known to be inhibitory as is the connectivity from the GP to the Thalamus (Gobbel, 1997). Further the GP is tonically active which means the Thalamus is by default inhibited, which results in the PFC’s reverberatory loops being naturally inhibited. Recall that PFCm connects back to itself by way of the PFCg which is inhibited by the Thalamus. When striatal neurons fire they inhibit GP neurons which are then prevented from inhibiting thalamic neurons, allowing the thalamus to become activated by the PFCm and thereby opening a reverberatory loop. This is the gate that is opened to allow selected activation patterns into the maintenance system. Essentially the PFC activates its own representations when a selected stripe is opened by the BG gating system.

2.4 Information Compression The quantity of neurons within each layer decreases as activation decends from the PFCm into the BG (Wickens, 1997). Because of this, amount of information that can be passed through the system decays as the number of neurons available for transmitting information is reduced. Based on the numbers of neurons available (table 1) for conveying information thru the PFC–Striatum–GP loop there is a great deal of information loss and/or integration of information. Wilson states that there is a minimal convergence of 100:1, and higher ratios for the SN (Wilson, 1990). The gating system therefore must concern itself with details at the stripe level. However decisions must be made with the the full content which only the PFC layers contain. Consequently it is essential that the PFCm - PFGg is connected in a one-to-one fashion to preserve relevant information as the entire information content cannot pass thru the BG. By allowing PFC loops to maintain themselves we eliminate the need for transmitting this information.

2.5 Maintenance Because the PFCm exhibits prolonged neuronal activity during working memory tasks it is commonly believed that the maintenance function lies here. Further, as pointed out above, there is a functional need for content to be stored in the PFC. There are several

Learning Working Memory Tasks by Reward Prediction Region Striatum Globus Palidus Substantia Nigra

9

Number of Neurons in Human Brain 1.0x108 6.6x105 1.5x105

Table 1: Number of Neurons in Regions possible mechanisms for this prolonged activation. Intrinsic activation can be supplied by way of threshold driven intracellular calcium or sodium channels (Gorelova & Yang, 2000; Abbott, Varela, Sen, & Nelson, 1997; Durstewitz, Kelc, & Gunturkun, 1999) or recurrent self-connections within the PFCm could also maintain activity. In the model the PFCm layer is self-connected and a intrinsic threshold driven function is used as well. The PFCm is also reciprically connected with the PFCg. The connectivity biases the PFCm to hold on to stimuli and allows the PFCm units to activate themselves under the control of PFCg. Anatomically the PFCm layer is connected through the thalamus to the PFCg layer and it is this circuit that is disinhibited as shown in figure 2. Once disinhibited it is the PFCm that activates the PFCg via the Thalamus, indirectly causing its own maintenance.

2.6 Gating The combination of the disinhibition circuit and the intrinsic activation of PFC neurons provide a gating mechanism. When something is deemed to be predictive of reward and relevant to further processing, the striatum will send a signal to the GP which will open the gate. If the activation of the PFCm exceeds a threshold, the intrinsic activation on the related neurons will cause the stimulus to be maintained. PFCg neurons provide the extra excitation to push PFCm neurons over the threshold. The striatum is the gate keeper. Matrisomes in the striatum receive highly conjunctive inputs, from the PFCm and Input, and experiencing the relevant stimuli that may be correlated with reward (Wilson, 1990). These matrisomal neurons are receiving activity from across the anatomical loops of the PFCm so a single unit can receive from a broad, though specific, combination of PFCm and Input neurons. This allows it to react to the conjunction of the relevant PFCm/Input neurons to which it is tuned. The matrisomes then become active and open the gate, by inhibiting the GP which disinhibits a specific stripe of the Thalamus and completing the PFCm - PFCg loop. The matrisomes must learn which conjunction(s) are predictive of reward and what stripe (region) in the PFCm contains the essential stimuli to be maintained (refer to figures 3

Learning Working Memory Tasks by Reward Prediction

10

and 2). The matrisomes learn these conjunctions based on correlation of the contents of the PFC and receipt of a reward signal that is initiated by the striosomes, or primary reward. The striosomes role in the learning process is to indicate to the matrisomes when the content of the PFC is correlated with reward. When the striosomes recognize a pattern in the PFCm and Input is predictive of reward they activate the SN which releases dopamine and facilitates the matrisomes strengthening of conjunctive connections. We explain below, how the striosomes learn to provide this signal.

2.7 Competitive Inhibition within Stripes An essential feature of this system is that within each stripe there is competitive inhibition between the units, consistent with the previous work (Frank et al., 2001). This is implemented in a winner take all fashion resulting in only one unit being fully active within any one stripe. Because of this the system can maintain only one stimuli per stripe.

3

Reinforcement Learning in BG, PFC interactions

The system has the overall goal of receiving a reward and avoiding punishment as in the traditional RL framework. Rewards can be immediate (relative to the initial predictor of the reward), in which case a direct association can be made making no demands on memory, or they can be time delayed, requiring some form of memory or maintenance to correlate the predictor (see figure 1). In our model the striosomes are learning to detect predictive stimuli and the matrisomes are learning, with the help of the striosomes, what to do with this information, (ie. to open the gate to the appropriate stripe and begin maintenance). Here we review the fundamentals of reinforcement learning and then place them within the model framework.

3.1 Computational Details of Reinforcement Learning Classical conditioning was used for training of the 1-2AX task in a bottom-up fashion, as one would train an animal on such a task. First the X was presented followed by a reward, until the network recognized the X as being predictive of reward. Then the A, followed by the X, followed by the reward. Lastly the 1, A, X, reward sequence was

Learning Working Memory Tasks by Reward Prediction

11

Agent state action reward

Environment

Figure 4: Agent Model The agent selects an action in the environment which provides a reward and state information to the agent. presented. A similar procedure was used for the 2,B,Y sequence. During the testing phase distractor stimuli were introduced to show that the system is robust to distractors. This style of learning lies in the reinforcement learning (RL) paradigm in which the feedback given for performance is a reward (or punishment) (Sutton & Barto, 1998). No explicit information about the correct response is given which distinguishes RL from supervised learning. Dopamine has long been known to play a key role in modulating this learning behavior, often associated with reward predicting stimuli (Brown, Bullock, & Grossberg, 1999; Schultz et al., 1995a; Houk, Adams, & Barto, 1995; Schultz et al., 1995b). In the following we review RL and show how the BG, utilizing dopamine can implement an RL algorithm. The basic elements involved in RL, shown in figure 4, include an agent and an environment (Sutton & Barto, 1998). The environment produces rewards and state information. The agent uses the rewards and state information to determine the optimal action to take. Most reinforcement learning techniques further breakdown the model to include a policy, a value function, a reward function and possibly a model of the environment. The policy is a mapping from states of the environment to actions of the agent, analagous to stimulus-response pairs. The reward function is a mapping from states to rewards. The agent strives to create a policy that will maximize the total rewards received, often including a discounting of future rewards. The value function places a value, on states (or actions), that is representative of immediate rewards as well as a (possibly discounted) expectation of future rewards arising from that state (or actions). The value function is therefore an approximation of the total rewards to be received from a given state (or action) and an integral part of policy formation. As specified in table 2 the biological components of the PFC and BG can be con-

Learning Working Memory Tasks by Reward Prediction

12

sidered to play particular primary roles in the RL model. We do not claim that the biology strictly adheres to these roles. The functional boundries of the decomposition are blurred in order to aid in our understanding of the component interaction. The learning problem in RL is to find an optimal policy (one that maximizes expected reward). This is done by learning a value function and choosing the greedy state to move to (or action to perform). There are a number of methods to optimize a policy including dynamic programming (DP), monte carlo (MC) methods and temporal differences (Sutton & Barto, 1998). TD is the preferred learning method in this context for several reasons. TD doesn’t require complete prior knowledege of the environment dynamics, as DP requires. TD also allows leveraging value estimates by calculating new estimates based on previous value estimates (existing knowledge), which MC methods don’t do. Because TD can utilize existing value estimates and does not need to store a model of the system it is more suitable for learning within biological systems, where resources are limited. Finally, there is evidence that dopamine activity during RL related learning is functionally related to the TD algorithm (Suri et al., 2001; Suri & Schultz, 2001; Kakade & Dayan, 2001). 3.1.1

The Temporal Differences Algorithm

The basic principle behind RL methods is that of generalized policy iteration(GPI) (Sutton & Barto, 1998). In GPI the agent iteratively evaluates the current values assigned to states (or actions) and updates the current policy, typically in a greedy fashion. As the number of iterations of value and policy updates increases the agent converges on an optimal policy, thereby maximizing total rewards. Notice this incremental updating is similar to the weight updating that occurs in neural networks. Temporal Differences represents one way of updating the state (or action) values. The basic TD algorithm is :

Policy Matrisomes

AGENT Action PFCg, Output

Value Striosomes Limbic

ENVIRONMENT State Reward Inputs External Reward PFCm, OFC SN, VTA

Table 2: Assignment of network components to the RL Model The agent includes policy, action and value components and the environment include the state and reward components. Assignments do not represent hard boundries.

Learning Working Memory Tasks by Reward Prediction

13

Agent action

Policy state

TD error

Value

Environment

reward

Figure 5: Actor-Critic Model The TD error is used to optimize the agents policy and value functions.

Vt+1 (s) = Vt (s) + α[rt+1 + γVt (s + 1) − Vt (s)]

(1)

The value of a state is updated with the current estimate of the value of the state plus the difference in the new estimate (immediate reward plus the estimate of the state value in the next time step) and the current estimate. Values are updated according to the difference in estimates as time passes. The updating is typically controlled by two parameters, α, which scales the weighting between previous estimates and the new estimate, and γ, which represents a discount on the estimates of future rewards. The change in the value of a given state, Vt+1 (s) − Vt (s), represents the TD error and is used to adjust the value function, and possible the policy, as shown in figure 5. The TD model is often separated into actor and critic components. This actor-critic model has been previously proposed in the BG (Barto, 1995). In our model one can view the matriosomes and PFC as the actor and the striosomes, OFC and external reward as the critic. As mentioned it has been suggested that DA provides the signal representing the temporal difference in the above algorithm. Dopamine activity represents an error signal indicating the difference in predicted reward from one point in time to the next (Suri et al., 2001). Where and how does the biology provide for a measure in the difference between expectation and actuality of reward?

3.2 How the Brain Implements Reinforcement Learning in the BG Shown below is a mechanism that, we propose, implements what is effectively a derivative calculation of striosome activity. This will be shown to result in dopamine exhibiting the desired error signal, enabling the system to learn which stimuli are predictive of

Learning Working Memory Tasks by Reward Prediction

14

reward. Given that, we show how dopamine and conjunctions over loops work together to provide the needed gating selectivity. 3.2.1

Selectivity

Selectivity within the BG and PFC involves two aspects, a temporal aspect (at what point in time is the gate to open) and a spatial aspect (what region of the brain must be affected to gate the correct stimuli) (Houk et al., 1995). Temporal selectivity is necessary to capture the predictive stimuli and not pre/proceeding stimuli. The system must specify a point in time to initiate the updating of the maintenance system. It is critical that this signal refer to a distinct point in time and react quickly. Both of these features are neccessary for the striatum to learn to open the gate at the right time and for only long enough to let in the appropriate stimuli. For example in the sequence 1,3,A,X the gate must open while the 1 is being presented and close before the 3 is presented resulting in only the 1 being stored. If the gating signal is not aligned in time with the predictive stimuli, the 1, it will fail to make the correct association. Additionally, if the signal is not sharp and distinct (or spiked) there could be ambiguity between the 1 and the 3. Spatial selectivity is needed for the correct information (knowledge of having seen the stimuli) to be stored in the correct region of the PFC, reflecting its position in the hierarchy for the task at hand. In the 1,3,A,X sequence the signal that indicates that the A is predictive must be associated with the region of PFC in which the A is stored. Therefore a distinct time signal as well as a distinct location must be provided by the gating system. First we present evidence that dopamine is associated with the temporal nature of the needed signal, and how this is a result of a derivative calculation within the BG. Then we address the spatial nature of the signal that is a result of conjunctive connectivity across loops. 3.2.2

Dopamine and Deriviatives

Although dopamine (DA) has long been accepted as being directly involved in rewarddriven behavior, motivation and motor tasks, and has been suggested to be involved in the above temporal diferences computation in serving these tasks, no specific mechanism has been shown to produce the desired DA behavior. DA is a neuro-modulator (Gorelova & Yang, 2000). It increases the efficacy of neurons to become active as opposed to activating them directly. This makes it suitable for modulating the learning of

Learning Working Memory Tasks by Reward Prediction Frontal Cortex

15

Orbital PFC

Posterior Cortex

Ext Rew Striatum

S

SN & VTA

M

S

M

Limbic Striatum

modulatory dopamine

Figure 6: Dopamine Modulation Dopaminergic neurons of the SN and VTA modulate activity of the striatal components and the OFC. associations between stimuli and reward (Gorelova & Yang, 2000). However, DA activity tends to be diffuse and not well targeted, so it cannot be acting on specific neural representations (Groves et al., 1995). This makes it unlikely to independently play the role of the gate keeper as it lacks spatial selectivity. Individual units (local neuron populations representing stimuli patterns) must be targeted and dopamine does not provide for this specificity. The dopamine activity around rewarded stimuli is well documented (Schultz et al., 1995b; Wickens & Kotter, 1995). Classic DA activiy is shown in the neural recordings presented in figures 7 and 8. The recordings represent before and after the learning mechanism has determined that a stimuli is predictive of reward. A light is illuminated followed by a reward after a fixed time delay. Prior to learning there is a significant dopamine burst upon receipt of reward, and minor activity when the light is lit (figure 7). After learning there is a dopamine spike when the light is first illuminated and no dopamine activty on reward receipt (figure 8). There are two things to notice from these diagrams. First, the dopamine signal is propagated back in time to the the initial stimuli that predicts reward. Secondly, the DA response is a sharp, as opposed to broad, signal. These two features together can provide for temporal selectivity. Looking closely at the connectivity between the striatum and the substantia nigra and VTA (the DA supplying neurons) a very important biological feature is found. As shown in figure 9 there are two relevant paths connecting the striatum with the SN (technically the SN pars compacta) and VTA, one which is direct and inhibitory and the other which is indirect (via the SN pars reticulata) and disinhibitory (Chevalier &

Learning Working Memory Tasks by Reward Prediction

16

Figure 7: Unlearned Dopamine Activity The dopamine activity is correlated with the reward, not the predictive stimuli (light).

Figure 8: Learned Dopamine Activity The dopamin activity is corelated with the predictive stimuli after learning.

Learning Working Memory Tasks by Reward Prediction

17

Striatum slow inhibitory

fast excitatory

GABA B

GABA A

SN and VTA

Figure 9: Striatal Pathways Two different pathways connect the striatum with the dopamine neurons. The slow one inhibits and the fast one excites. This results in the derivative of the activity in the striatum being communicated to the SN and VTA. Deniau, 1990; Joel & Weiner, 2000). The neurotransmitter receptors (GABA-A) on the indirect path are faster acting, than the receptors on the direct path (GABA-B). This means the SN is first disinhibited then inhibited. Because the SN is tonically active (due to the subthalamic nucleas) (Chevalier & Deniau, 1990) the disinhibition is effectively excitation, which is followed by inhibition. We believe that what results is essentially the derivative of the striatal units activity being transmitted to the SN and VTA (Rick Granger Personal Communication, 2001). When a striatal unit is in a steady state the excitatory and inhibitory paths balance and the SN (VTA) receives no net activity. If the unit suddenly becomes more active the disinhibition (excitation ) will reach the SN (VTA) first, activating the receiving neurons. This will be followed by the slower acting inhibition which will cancel out the activity and system will return to the steady state of no activity being transmitted to the SN(VTA). This happens even though the striatal units remain at a higher activation level than they were initially. Thus a spike of activity will occur when a striatal unit moves from one activity level to a higher one. An analogous situation arises when the striatal unit becomes less active, the excitatation is reduced first, leaving the inhibition input acting alone and later the inhibition is removed, causing a dip in dopamine and a return to the tonic state. The SN/VTA effectively sees net changes in striatal neuron activity, the derivative of striatal activity. The derivative response is important on a number of levels. It will cause the SN(VTA) to exhibit the spiking DA behavior that has been documented, and provide temporal selection. Dopamine signals appear to track a net change of expected reward versus actual reward (Montague et al., 1996) as this compuation would produce. Finally the derivative fits well with the role of the striatum as the critic in the TD algorithm

Learning Working Memory Tasks by Reward Prediction

18

(Barto, 1995), and in particular the striosomes as a value function. It is important to point out that in the model dopamine activity is neccesary for learning the correct responses but not performing the correct responses. Once the system has learned its task it may not be necessary for DA to be available to execute this task. However, it seems likely that in the abscence of dopamine the response will degrade, given the symtoms found in diseases of the BG (Gabrieli, 1995; Marsden, 1986; Chesselet & Delfs, 1996). 3.2.3

Loops and Conjunctions

The brain needs to know not only when to maintain but also what to maintain. What, amounts to where it is located within the PFC. While processing stimuli for the 1,2-AX task the PFC is receiving stimuli other than the sequence characters being presented (distractors). All the stimuli end up in different places of the PFC and the gating system must specify where the predictive stimuli is located. As described above the matrisomes receive conjunctive activity and select one specific stripe to gate. In our model the matrisomes are trained up by the striosomes to respond to stimuli that have shown to be predictive of reward. Recall that the matrisomes initiate the disinhibition of the PFC reverbatory loops, essentially opening the gate. In this training the matrisomes learn to gate the appropriate stripe of PFCg based on the context of the PFCm and the Inputs. From this we achieve spatial selectivity. 3.2.4

Division of Labor within the Striatum

The orbital frontal cortex (OFC) and limbic striatal areas have established relationships with motivation and reward based performance (Tremblay & Schultz, 2000a, 2000b; Schultz, Tremblay, & Hollerman, 2000). In this model, the Limbic striatum (LS or Nucleus Accumbens) recieves the primary reward signal and distributes it to the SN and VTA. It also receives activation from the OFC which is related to novel and reward predicting stimuli (Tremblay & Schultz, 1999). The LS is functionally related to the OFC (Joel & Weiner, 2000) and we believe has the same relationship to the OFC as the striatum has with the PFC. The initiating signal is the first signal in a sequence that is predictive of reward (e.g. the 1 in the 1,A,X sequence). The OFC learns this in a manner similar to the the PFC - BG system but needs only to encode the initiating signal and not the hierarchy. As the OFC and LS work to encode the initial predictor the PFC and striatum encode the entire hierarchy of predictive stimuli. The LS and striosomes utilize information mutually to determine the initial predictor and the hierarchy.

Learning Working Memory Tasks by Reward Prediction

19

Network Layers Biological Correlate Input Posterior Motor Cortex Output Posterior Motor Cortex PFCm PFC layers V & VI PFCg PFC layer II OFC Orbital Frontal Cortex Matrisomes Striatal matrisomes Striosomes Striatal striosomes Limbic Nucleas Accumbens Substantia Nigra Substantia Nigra pars compacta VTA Ventral Tegmental Area Rewards Hypothalmus Table 3: Association of Network Layers to Biology The LS is like a manager and the striosomes workers. The LS tells the striosomes when to pay attention (novel stimuli DA response) and the striosomes in return tell the LS if there is anything important in this stimuli (prediction of reward DA response). The LS causes the OFC to maintain the stimuli that it has found to be the initial predictor, while the striatum is building a hierachy of maintained stimuli within the PFC.

4

Computational Simulations

After presenting details of the underlying computational elements of the model, two scenarios which illustrate the PFC and BG interactions during unlearned and learned behavior are presented. The learned scenario is presented first, followed by the unlearned.

4.1 Model Details The model presented here was built using the Leabra++ implementational framework (O’Reilly & Munakata, 2000). The bulk of the processing components are from the standard implementation though a few special processing unit specifications and connection specifications were used. The network is structured as shown in figure 10. The association of network components to underlying biology is found in table 3. The special components facilitate modulation of units by DA, maintenance of the

Learning Working Memory Tasks by Reward Prediction

20

PFCm and the derivative calculation. These facilities were implemented with custom specifications and include: LeabraModulatedUnitSpec, LeabraMaintConSpec, LeabraGainConSpec, LeabraDerivConSpec, and LeabraPrvActUnitSpec. The LeabraMaintUnitSpec switches on hysteresis if activation of unit is over a given threshold. The LeabraGainConSpec works with the LeabraModulatedUnitSpec allowing the units that are connected via the Gain connection to control the gain of the modulated unit. The LeabraDerivConSpec works with the LeabraPrvActUnitSpec to calculate and send the difference in activation (previous - current activation) from the PrvActUnit to units connected via the DerivCon.

4.2 Stimuli Processing on a Learned Task A stimulus is presented at the input (eg. 1) and activates its representation in the PFCm. If this stimulus, along with the relevant context maintained in the PFCm, is predictive of reward (as the 1 is in our task) it also activates related units in the Orbital PFC as well as the matrisomes and striosomes (note that the striosomal response is a remnant from learning). The matrisomes then begin the disinhibition of the PFCg stripe (anatomically via the GP and thalamus) associated with the reward predicting stimulus. The PFCm activates itself via the PFCg, which has now been disinhibited. The PFCg allows the PFCm neurons that are activating it to exceed the threshold required to activate the intrinsic activation and begin maintenance. If it is non-predictive the matrisomes will not become activated and there will be no gating signal. The output layer is activated by the representations in the PFCm layer, which will contain all the information for selecting the appropriate output(button), as shown in figure 10. For example in the sequence: 1,3,A,C,X, prior to presenting the X the PFCm layer will contain both the 1 and the A. The addition of the X to the PFCm layer provides the complete information for the output to select the right button. This is the essential computation that was modeled in Frank et al. In this expanded model there are some other interactions that occur which are all artifacts of reinforcement learning. These are detailed in the next section.

4.3 Stimuli Processing while Learning a Task 4.3.1

Learning to Respond to X

Prior to learning the model responds to all stimuli as if they were non-predictive of reward. A stimulus is present at the input and it activates a representation in the PFCm

Learning Working Memory Tasks by Reward Prediction 0 0

3

0

2

0

1 0 Input

0

C

0

B

A

0

Z

Y

X 0.99098

0

I

0 0 0 0

0

I

0

I

I

0

0

II

I 0

I II 0.00069 0 Matrisomes

II

0

II

0

III

0

III

0

III 0.98621

0

0

I 0.00087

0

II

II 0.99106

Left

Right 0.98041 Output

Pfc_Gate

Pfc_Maint I 0.99095

21

0

0

III

II

0

III 0

I

I

0

I

0

I

0

I

0 0

0 0

II

0

II

II

II

II

0 0

III

III

OrbitalFC

III

0

III

0

0.0

III

0

0

0

0

0.90496 0

III 0.0

II

III 0.08548 0

III

I II 0.51065 0.50419 SubstantiaNigra

LimbicStr I II 0.97457 0.98690 Striosomes

III 0.49702

0.51746 VTA

III 0.99349

0.99607

0.99094 Reward

0

Neg_Reward

Figure 10: Learned 1AX sequence layer. No other layers are activated. Though they may receive some excitation, it is insufficient to activate neurons (Figure 11). In particular the striatum (limbic, matriosomes, striosomes) will all receive some excitation. Given sufficient excitation the output layer randomly (based on noise) selects an R, L or no button press, and if the selection is correct a reward is presented at the reward layer. The reward directly activates the limbic striatum which activates the SN and VTA. The SN sends dopamine to the striosomes, matrisomes and back to the limbic striatum. The DA will lower the threshold of excitation required for the striosomes to become active, neurons in the respective stripe will activate and increase the dopamine response which will then allow the matrisomes, in a like way, to become active (Figure 12). Some matrisomes will become active, with bias for ones that lie in the same stripe as the stimulus activity in the PFCm (recall the matrisome connectivity is conjunctive, crossing over stripe bounries). This will initiate the disinhibition of the PFC reverberatory loop so the maintenance can occur as explained above. Simultaneously the VTA releases dopamine to the OFC and PFCm layer which allows the OFC to become active and strenghtens the PFCm response. When correct responses are made the system is rewarded and the dopamine activity causes the above chain of events. Associations are strengthened, reinforcing the same behavior when the same stimuli are presented. In this way the model can learn a simple task like ”Press R button when an X is seen”.

Learning Working Memory Tasks by Reward Prediction

0 0

3

0

2

0

1 0 Input

0

C

0

B

0

A

Z

Y

X 0.99098

Left Right 0.98673 0 Output

Pfc_Gate

Pfc_Maint 0 0 0 0 0

0

I

0 0 0

0

0

I

I

I

0

I

0 0

III

0

III 0.00028

II

0

III 0.82055

0

III 0.0 0

I

0

II

II

II

I

II

0

III

0

II

I 0

I

0

I

0

I

0

I

0

II 0

II

III

III

OrbitalFC

III 0.0

II

0

II

0

II

0

0

III

0

III

0

0

0

0

0

0

III 0.0

II

III 0.00022

I II 0 0 Matrisomes

0

LimbicStr

III 0.0

0

I II 0.49599 0.49599 SubstantiaNigra

I

II 0 Striosomes

III 0.49567

III 0.0

0

0.49567 VTA

0 Reward

0.99718 Neg_Reward

Figure 11: XnR sequence

0 0

3

0

2

0

1 0 Input

0

C

0

B

A

0

Z

Y

X 0.99098

0

0 0 0 0

I

0

0 I

0 0

0

0

0

I

I

I

I

I II 0 0 Matrisomes

II

III

0

III 0.0

0

III 0.93780

0

0 0

0

0

III 0.0

II

I

II

II

II

II

Right 0.98499 Output

Pfc_Gate

Pfc_Maint 0

Left

0

III

II

0

I

I

0

I

0

I

0

I

0 0

0 0

II

III

III

OrbitalFC

III 0.34666

II

II

II

II

0 0

0

III

0

0

III

0

0

III 0.0 III 0.51078 III 0.0

LimbicStr 0

I II 0.51861 0.51861 SubstantiaNigra

III 0.57635

I

II 0 Striosomes

0.57635 VTA

III 0.96687

0.99608

0.99098 Reward

0

0

0

Neg_Reward

Figure 12: XR sequence

0.97968 0

22

Learning Working Memory Tasks by Reward Prediction

23

The next time a known (by the network) predictive stimulus is presented the orbital PFC, striosomes and matrisomes will recognize it and become active when the stimuli is first presented, prior to the reward. This will cause the dopamine to be released when the X is seen and prevent the DA burst when the reward is presented. The DA response will come from the net change in striosome activity. Note that in the above description the presentation implies one-pass learning. In reality a number of trials are necessary to allow the changes to propagate, as is standard in TD learning. The weights between units grow slowly, at some point in which they become strong enough to allow the next item in the chain of events to occur. This also fits with the notion of GPI as described above. For example the PFCm to striosomes weights increase until the striosomes become active enough to activate the SN, without the influence of the limbic units, to allow the matrisomes to become active. It is important for the weights to grow slowly in order to learn the correct structures/relationships. It is possible that incorrect or irrelevant information could be active in the PFCm and a correct press be made, by chance. In one pass learning the system would correlate these incorrectly. If the weights are changed gradually the system is tuned only to what is correlated over many trials. Thus far we understand how an immediate stimulus response is learned, but we are far from the 1,2-AX working memory task. In order to learn the AX task we need a reward to be available when the A is being presented, but our reward does not occur until the X is seen. This is where the propagation of the dopamine response plays a key role. As explained above, prior to learning, the reward initiates the dopamine response. However, after the striosomes have learned to recognize the predictive stimuli, they initiate the dopamine response. Recall that the dopamine response occurs due to a net change in expectation of reward at the striosomes. This allows the X to act as proxy for an external reward and implicitly rewards whatever is in the PFCm when the X arrives. This is not a full strength reward, only enough to start the DA driven learning process. If the subsequent output is correct and the reward is presented, a reduced DA response occurs, which further strengthens the relationship between activations in PFC and striatum. If the output were incorrect, and no reward was received, the striosomes would experience a negative change in expected reward and the relationships would be weakened. 4.3.2

Learning sequences

When the A first arrives there is no activity in the BG as it is not yet correlated with reward (figure 13). When the X arrives the A will be implicitly rewarded as described

Learning Working Memory Tasks by Reward Prediction 0 0

3

0

2

0

1 0 Input

C

0

B

A 0.0

0

Z

Y

X 0.99098

Left Right 0.91246 0.01446 Output

Pfc_Gate

Pfc_Maint 0 0 0 0 0

0 0 0

I

0

I

I

I

I

II 0.98971

0 0

I

I

0 0 0

0

III

II 0.13662 0

0

III

0

III

0

III

0

III 0.99114

II 0.23666

I II 0.0 0.0 Matrisomes

II

II

II

II

24

0

0

I

I

0

I

0

I

0 0

II 0.70665

I

II 0.97753 Striosomes

0 0

II

0

II

I

II

II

0 0

III

III

OrbitalFC

III

0

III

0

III

0.97212

0

0

0

0 0

III

III

LimbicStr

III 0.91991

I II 0.53019 0.54160 SubstantiaNigra

0

0

III 0.53007

0.54120 VTA

III 0.99106

0.98839

0 Reward

0

Neg_Reward

Figure 13: AX sequence, X rewards A above. This will cause the system to reinforce the A and the AX combination (Figure 14). If a correct button press is made all, the proper correlations will be reinforced. If an incorrect button press were made there would be a negative change in expectation of reward and the connections would all be weakened, due to a decrease in dopamine. After repeated AX trials with correct responses the A will activate the rewarding mechanisms in the BG and ultimately the OFC will recognize the A as the initiator of the reward sequence. At this point the A can act as proxy for reward. In learning the 1AX task a similar sequence of events occurs. The 1 initially does not activate BG units (figure 15); the A implicitly rewards the 1 and a correct response will strengthen all the units (figure 10). The Y, BY and 2BY tasks are learned in a similar way. Notice that the B can also implicitly reward the 1 and strengthen the 1BY relation. The implicit reward behavior also can occur for the 2AX, 1BX, 2BX and 1AY, as all of these are task-relevant stimuli that must be maintained for correct button presses to be made.

4.4 Results Several different training regimes were explored focusing, to varying degrees, on shaped conditioning versus unshaped conditioning. In the shaped conditioning more

Learning Working Memory Tasks by Reward Prediction

0 0

3

0

2

0

1 0 Input

C

0

B

0

A 0.99098

Z

Y

X 0.99091

Left Right 0.00082 0.94221 Output

Pfc_Gate

Pfc_Maint 0 0 0 0 0

0

I

II 0.0

0

II

I

0

0

III 0.0

0

0

0 0

III

II 0.13459

I II 0 0.0 Matrisomes

III

III 0.0 0

II 0.32161

I

0

III 0.92771

II 0.90713

0

II

II 0.00028

I

I

I

0

I

0

I

0

I

0

I

0

I 0

II

0 0

II

III

III

OrbitalFC

III 0.50255

II 0.83915

I

II

II

0

0

0

III

0

III

0

0

0.92425

0.00274

0

0

III

III 0.92864

LimbicStr

III

0

I II 0.52333 0.52355 SubstantiaNigra

0

I

II 0.99057 Striosomes

III 0.67706

III 0.98197

0.67698 VTA

0.98796

0 Reward

0

Neg_Reward

Figure 14: AX sequence, A learned

0 0

3

0

2

0

1 0.99098 Input

C

0

B

A 0.99096

0 0

Z

Y

X

Left Right 0.0 0.95471 Output

Pfc_Gate

Pfc_Maint I 0.98581 0 0 0 0

I

I

0

0

II 0.0

0

II 0.93368 0

II

I 0.10255 I 0.29371

0

II 0.0

I

I

II

0

I 0.85011

III

0

III

0

III

0

III

II 0.14647

II 0.0

I II 0.01539 0 Matrisomes

0

III

0

0

I

I

0

I

0

I

II

II

0 0

II

0

II 0.0 0

0 II

0 0

III

III

OrbitalFC

III

0

III

0.97605

III

0

III

I II 0.51620 0.61594 SubstantiaNigra

0

0

0

0 0

III

III 0.08930 0

0

LimbicStr I II 0.97413 0.98562 Striosomes

III 0.50215

0.62396 VTA

0

III

0.98555

0 Reward

0

Neg_Reward

Figure 15: 1A sequence, A rewards 1

25

Learning Working Memory Tasks by Reward Prediction

26

complex tasks are constructed from simpler ones, and distractors are initially avoided. In the fully unshaped conditioning full sequences including distractors are presented at the onset of training. An example of a shaped regime is to train until correct response for X alone, and then train to correct response to AX then to 1AX. Then the same is done for 2BY and then the sequences are mixed. An example of the unshaped regime would be to randomly select a 1,2 or 3, then randomly select a number of inner loops to present. For each inner loop an A,B or C is selected, followed by selection of an X,Y or Z. In the unshaped regime sequences were constrained so that there is always a task relevant stimuli presented before proceeding to the next level. Various training sequences between these two extremes were also explored. In particular the complete AX task was trained under that context of 1 until it was learned properly (utilizing shaping along the way) and then the 2BY task was trained in a similar fashion prior to mixing the sequences. This method proved to be the most efficient for learning the complete 1,2-AX task. The number of trials to learn the complete task varied greatly for the different training regimes. For the shaped regime the number of presentations of stimuli sequences to achieve full activation in the response neuron was quite low, typically two or three trials. This of course varied, depending on the learning rates used. For the unshaped regime a much lower learning rate was required (to prevent learning the wrong sequences) and many more presentations were required. Typically training without shaping would require hundreds of presentations of complete sequences and occasionally the system was unable to learn the sequence. Presence or abscense of distractors during training also made a significant difference in training times. Introducing distractors to the shaped training resulted in much longer training times. This was the case for the unshaped regime but the effect was much more significant for the shaped regime. The shaped training could take two or three times longer to learn when distractors were present.

5

Discussion

There have been many models proposed for implementing working memory within the biological constraints of the human brain. Most of these involve the use of dopamine, which is highly correlated with working memory tasks, as the gating system. Dopamine activity, however, is not well localized nor representation specific, so its ability to select specific stimuli to be maintained is questionable. Above I have detailed a biological

Learning Working Memory Tasks by Reward Prediction

27

mechanism that implements a computation analagous to the Temporal Differences algorithm. This proposed mechanism produces the dopamine activity found in working memory studies and learns to detect and respond correctly to stimuli that are predictive of reward. Utilizing this mechanism and the conjuctive connectivity of the matrisomes in the striatum the network can learn to detect predictive stimuli. This model was tested on the 1,2-AX working memory task and represents a biologically detailed model of working memory. As presented the model learns how to identify which stimuli are predictive of reward, and initiate to correct neural response. There are a number of limitations of the model, as well as predictions that such a model makes. These are discussed below, following some necessary background information.

5.1 Markov Decision Processes TD learning is typically applied to Markov decision processes (MDPs). An MDP is a process that has the Markov property. This is the property that the optimal decision can be determined by the current state and history need not be considered. As the stimuli are presented in the 1,2-AX task, the process is not an MDP. Specifically, the correct response (left or right button press) depends on the task context (1 or 2), the inner loop context (A or B) and the final stimuli (X or Y). Because the stimuli are presented sequentially the history is relevant. The goal of the system is to store the correct information to turn the task into an MDP. This is what is done by maintaining the task level and the inner loop level stimuli. As it is presented the problem is significantly more difficult that a typical MDP. To model this within a TD learning framework one would require states that represent all combinations of historical data. The system would then have to choose between one of these states given the current stimuli. The primary challenge to the working memory task is in building an MDP from a process that is not MDP. This is more or less difficult depending on the amount of historical infomation that must be maintained and the time interval for which the information needs to be maintained. This differential difficulty was evident in the results of the simulations and makes specific predictions for training on real biological systems.

5.2 Model Predictions This model and the MDP issues make some specific predictions regarding training biological systems, such as monkeys, on the 1,2-AX task. Two specific issues are consid-

Learning Working Memory Tasks by Reward Prediction

28

ered: shaped vs. unshaped training and presence vs. absence of distractors. From the MDP observations and the simulation results one would expect that monkeys will learn the task more readily if a shaped training regime is used. Indeed this shaped training paradigm is standard in training animals on these sort of tasks. The shaped training presents to the learner a task that is much closer to an MDP. Clearly learning the X (or Y) alone is an MDP. Introduction of the earlier stimuli now present an easier problem, had the X and Y not been known, because the X and Y are strongly associated with reward. This is the motivation and need for the backward chaining of reward that is implemented in the model. Given the unshaped regime monkeys should not always successfully learn the task, and learning the task should take on the order of 100 times longer. Because there is no shaping going on it is much more difficult to pick out the important stimuli. In the unshaped regime correct versus incorrect response is less tractible. The shaped regime requires learning of a subtask before moving up to the more difficult task. Once correct behavior is established for the subtask, adding a new level to it is clearly easier than being presented with the entire novel sequence at once. The presence or absence of distractors should also result in longer or shorter training times, respectively. This was found in the simulation results and coincides with the MDP issues. More distractors result in longer intervals between relevant stimuli as well as more stimuli to decipher, making the construction of an MDP process more difficult.

5.3 Model Limitations, Issues and Future Work The focus of this research was to demonstrate that the necessary interactions between the BG and PFC could be ellicited. A number of model limitations and issues arise in extending the model to a more complete system. One problem that was encountered is due to the dicretization of what is a continuous time task. Because much of the model has a cyclical nature to it, reverberation in activation can occur. This was specifically seen between the striosomes and the SN. This difficulty was overcome by adjusting time constraints and neural saturation effects. It is believed that this problem is primarily due to processing over discrete time steps. However in simulation of any continuous time task on a digital computer one must always compromise between granularity of the dicretization and efficiency of the simulation. As CPU rates increase this issue can be addressed more completely. A more significant concern is the lack of a robust way to unlearn something, for example, switching the role of 1 and 2 after the system had fully learned the task. This

Learning Working Memory Tasks by Reward Prediction

29

is clearly something that people have the ability to do, (eg. unlearning the overgeneralization of the past tense ¨-ed¨ending). Because unlearning a behavior is clearly more difficult that learning a novel behavior, it is expected that this would be a more difficult task for the model. However the model fails to be able to learn this sort of task at all. A potential solution to this would be to include a strong negative reward signal, though the biological evidence for such a mechanism is questionable. The primary concern of the model is the hard wiring of the inputs to their respective stripes. One of the ultimate goals of this line of research is to elliminate the homulculus from the system. To some degree we have elevated the humunculus to a higher level. This issue can be explored by removing the hardwiring of the inputs. This would likely require graded representations (bounded to stripes), as opposed to the discrete ones used, and poses a significant research effort. The logical next step for this work is to train the model to the response. This should be a fairly straight forward modification, requiring an additional hidden layer. The additional layer would be connected between the PFCm and the Ouput. The layer would be responsible for building representations for right and left button presses based on the content of the PFC. Through this mechanism the system should be able to be trained by correct output response alone.

Learning Working Memory Tasks by Reward Prediction

6

30

References

Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220. Alexander, G. E. (1995). Basal ganglia. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 139–143). Cambridge, MA: MIT Press. Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience, 9, 357–381. Amos, A. (2000). A computational model of information processing in the frontal cortex and basal ganglia. Journal of Cognitive Neuroscience, 12, 505–519. Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press. Beiser, D. G., Hua, S. E., & Houk, J. C. (1997). Network models of the basal ganglia. Current Opinion in Neurobiology, 7, 185. Berns, G. S., & Sejnowski, T. J. (1996). How the basal ganglia make decisions. In A. Damasio, H. Damasio, & Y. Christen (Eds.), Neurobiology of decision-making (pp. 101–113). Berlin: Springer-Verlag. Braver, T. S., & Cohen, J. D. (1999). Dopamine, cognitive control, and schizophrenia: The gating model. Progress in Brain Research, 121, 327–349. Braver, T. S., & Cohen, J. D. (2000). On the control of control: The role of dopamine in regulating prefrontal function and working memory. In S. Monsell, & J. Driver (Eds.), Control of cognitive processes: Attention and performance XVIII (pp. 713– 737). Cambridge, MA: MIT Press. Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. Journal of Neuroscience, 19, 10502–10511. Chesselet, M. F., & Delfs, J. M. (1996). Basal ganglia and movement disorders: an update. Trends in Neurosciences, 19, 417–22. Chevalier, G., & Deniau, J. M. (1990). Disinhibition as a basic process in the expression of striatal functions. Trends in Neurosciences, 13, 277–280.

Learning Working Memory Tasks by Reward Prediction

31

Dujardin, K., Krystkowiak, P., Defebvre, L., Blond, S., & Destee, A. (2000). A case of severe dysexecutive syndrome consecutive to chronic bilateral pallidal stimulation. Neuropsychologia, 38, 1305–1315. Durstewitz, D., Kelc, M., & Gunturkun, O. (1999). A neurocomputational theory of the dopaminergic modulation of working memory functions. Journal of Neuroscience, 19, 2807. Frank, M. J., Loughry, B., & O’Reilly, R. C. (2001). Interactions between the frontal cortex and basal ganglia in working memory: A computational model. Cognitive, Affective, and Behavioral Neuroscience, 1, 137–160. Gabrieli, J. (1995). Contribution of the basal ganglia to skill learning and working memory in humans. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 277–293). Cambridge, MA: MIT Press. Gazzingan, M. S. (1995). Cognitive neuroscience: The biology of the mind. New York, NY: W.W. Norton and Company Inc. Gobbel, J. R. (1995). A biophysically-based model of the neostriatum as a dynamically reconfigurable network. In M. Boden, & L. E. Niklasson (Eds.), Current Trends in Connectionism: Proceedings of the Second Swedish Conference on Connectionism, Sk¨ovde, Sweden. Hillsdale, NJ: Erlbaum. Gobbel, J. R. (1997). The role of the neostriatum in the execution of action sequences. PhD thesis, University of California, San Diego, San Diego, CA, USA. Goldman-Rakic, P. S., & Selemon, L. D. (1990). New frontiers in basal ganglia research. TINS, 13, 241–244. Gorelova, N. A., & Yang, C. R. (2000). Dopamine d1/d5 receptor activation modulates a persistent sodium current in rats prefrontal cortical neurons in vitro. Journal of Neurophysiology, 84, 75. Graybiel, A. M., & Kimura, M. (1995). Adaptive neural networks in the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 103–116). Cambridge, MA: MIT Press. Groves, P. M., Garcia-Munoz, G., Linder, J. C., S., M. M., Martone, M. E., & Young, S. J. (1995). Elements of the intrinsic organization and information processing in the neostriatum. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 52–89). Cambridge, MA: MIT Press.

Learning Working Memory Tasks by Reward Prediction

32

Hoshi, E., Shima, K., & Tanji, J. (2000). Neuronal activity in the primate prefrontal cortex in the process of motor selection based on two behavioral rules. Journal of Neurophysiology, 83, 2355. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 233–248). Cambridge, MA: MIT Press. Jackson, S., & Houghton, G. (1995). Sensorimotor selection and the basal ganglia: A neural network model. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 337–368). Cambridge, MA: MIT Press. Joel, D., & Weiner, I. (2000). The connections of the dopaminergic system with the striatum in rats and primates: an analysis with respect to the functional and compartmental organization of the striatum. Neuroscience, 96, 451. Kakade, S., & Dayan, P. (2001). Dopamine bonuses. In T. Leen, & T. Dietterich (Eds.), Advances In Neural Information Processing Systems, 13. Cambridge, MA: MIT Press. Marsden, C. D. (1986). Movement disorders and the basal ganglia. Trends in Neurosciences, 9, 512–515. Matsumoto, N., Hanakawa, T., Maki, S., Graybiel, A. M., & Kimura, M. (1999). Role of nigrostriatal dopamine system in learning to perform sequential motor tasks in a predictive manner. Journal of Neurophysiology, 82, 978. Middleton, F. A., & Strick, P. L. (2000). Basal ganglia output and cognition: Evidence from anatomical, behavioral, and clinical studies. Brain and Cognition, 42, 183– 200. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. O’Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain. Cambridge, MA: MIT Press. Schultz, W., Apicella, P., Romo, R., & Scarnati, E. (1995a). Context-dependent activity in primate striatum reflecting past and future behavioral events. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 11–28). Cambridge, MA: MIT Press.

Learning Working Memory Tasks by Reward Prediction

33

Schultz, W., Romo, R., T., L., Mirenowicz, J., Hollerman, J. R., & A., D. (1995b). Reward-related signals carried by dopamine neurons. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 233–247). Cambridge, MA: MIT Press. Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex, 10, 272–283. Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience, 103, 65–85. Suri, R. E., & Schultz, W. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Computation, 13, 841. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tremblay, L., & Schultz, W. (1999). Relative reward preference in primate orbitofrontal cortex. Nature, 398, 704. Tremblay, L., & Schultz, W. (2000a). Modifications of reward expectation-related neuronal activity during learning in primate orbitofrontal cortex. Journal of Neurophysiology, 83, 1877. Tremblay, L., & Schultz, W. (2000b). Reward-related neuronal activity during go-nogo task performance in primate orbitofrontal cortex. Journal of Neurophysiology, 83, 1864–1876. Wickens, J. (1997). Basal ganglia: Structure and computations. Network: Computation in Neural Systems, 8, R77–R109. Wickens, J., & Kotter, R. (1995). Cellular models of reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 187–214). Cambridge, MA: MIT Press. Wilson, C. J. (1990). Basal ganglia. In G. M. Shepherd (Ed.), The synaptic organization of the brain (Chap. 9, pp. 279–316). Oxford: Oxford University Press. Wise, S. P., Murray, E. A., & Gerfen, C. R. (1996). The frontal cortex-basal ganglia system in primates. Crital Reviews in Neurobiology, 10, 317–365.