and running motions captured via motion capture technology. ...... the Carnegie Mellon University Motion Capture Library (MOCAP) (http://mocap.cs.cmu.edu/.
Learning in Chaotic Recurrent Neural Networks David C. Sussillo
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY 2009
UMI Number: 3346497
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.
®
UMI UMI Microform 3346497 Copyright 2009 by ProQuest LLC. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC 789 E. Eisenhower Parkway PO Box 1346 Ann Arbor, Ml 48106-1346
©2009 David C. Sussillo All Rights Reserved
ABSTRACT
Learning in Chaotic Recurrent Neural Networks David C. Sussillo
Training recurrent neural networks (RNNs) is a long-standing open problem both in theoretical neuroscience and machine learning. In particular, training chaotic RNNs was previously thought to be impossible. While some traditional methods for training RNNs exist, they are generally thought of as weak and typically fail on anything but the simplest of problems and smallest networks. We review previous methods such as gradient descent approaches and their problems, and we also review more recent approaches such as the Echostate Network and related ideas. We show that chaotic RNNs can be trained to generate multiple patterns. Further, we explain a novel supervised learning paradigm, which we call F O R C E learning, that accomplishes the training. The network architectures we analyze, on the one extreme, include training only the input weights to a readout unit that has strong feedback to the network, and on the other extreme, involve generic learning of all synapses within the RNN. We present these models as potential networks for motor pattern generation that are able to learn multiple, high-dimensional patterns while coping with the complexities of a recurrent network that may have spontaneous, ongoing, and complex dynamics. We show an example of a single RNN that can generate the aperiodic dynamics of all 95 joint angles for both human walking and running motions captured via motion capture technology. Finally, we apply the learning techniques we developed for chaotic RNNs to a novel, unsupervised method for extracting predictable signals out of high-dimensional time series data, if such predictable signals exists.
Contents List of Figures
iv
Chapter 1 Introduction
1
1.1
A Brief History
1
1.2
Overview of Dissertation
5
1.3
The Network and Problem Statement
7
1.4
Liquid State Machine and Echostate Network
10
1.5
The Echostate Network with Output Feedback
13
1.5.1
Echostate Networks for Pattern Generation
16
1.5.2
Problems with Training ESNs in Batch Mode
17
1.6
Methods That Calculate Gradients Through Time
21
1.7
Real-Time Recurrent Learning
23
1.8
Back-Propagation Through Time
25
1.9
Problems with Gradient Descent Based Techniques
28
Chapter 2 Generating Patterns of Activity from Chaotic Networks 33 2.1
Introduction
35
2.2
Results
38
2.2.1
Avoiding Indirect Effects and Controlling Chaos i
42
2.2.2
FORCE Learning I
46
2.2.3
Principal Component Analysis of Network Activity
50
2.2.4
FORCE Learning II
51
2.2.5
The Advantage of Chaotic Spontaneous Activity
59
2.2.6
Biological Interpretation of Feedback Modification
61
2.2.7
Switching Between Different Output Patterns
68
2.2.8
Computations on Inputs
70
2.2.9
A Motion Capture Example
71
2.3 Discussion
73
2.4
Methods
77
2.4.1
Network Model with Output Feedback
77
2.4.2
Lorenz attractor example
78
2.4.3
Generator Network with Separate Feedback Network and Out-
2.5
put
79
2.4.4
Network with Internal Learning
81
2.4.5
Many Output Patterns with Input Modulation
82
2.4.6
4-Bit Memory Example
82
2.4.7
Motion Capture Example
83
Acknowledgments
84
Chapter 3 FORCE Learning for Internal Neurons
85
3.1
Introduction
86
3.2
Training a Control Unit with the Output Error
89
3.3
An Example
91
3.4
A Mathematical Explanation
94
ii
3.4.1
A Small Example
98
3.4.2
Back to Large Dimensions
99
Chapter 4 The Prediction Machine 4.1
4.2
104
Introduction
104
4.1.1
Input Model
106
4.1.2
FORCE Learning
107
4.1.3
The Surprise Signal s(t)
107
Approach to Finding u(t)
108
4.3 Using z(t) as a Teacher Signal for f(t) Reduces the Surprise Signal
111
4.4
Avoiding the z(t) = 0 Solution
115
4.5
Example of the Extraction of a Predictable Signal
116
4.6
Conclusion
119
Bibliography
120
in
List of Figures 1.1
Liquid State Machine / Echostate Network
11
1.2
Mackey Glass Implemented in a ESN with Feedback
18
1.3
Problems with Batch Training ESNs
20
1.4
Back Propagation Through Time
26
2.1 Network Architectures
41
2.2
45
FORCE Learning I
2.3 Principal Component Analysis of Network Activity
52
2.4
Examples of FORCE Learning II
60
2.5
Chaos Is Useful
62
2.6
Feedback Variants
69
2.7
Multiple Pattern Generation and 4-Bit Memory
72
2.8
A Network that Generates Both Running and Walking Human Motions 74
3.1
Network Architecture with Separate Control Unit
86
3.2
Motivation for Internal Learning
93
4.1
Predictor Machine Network Architecture
109
4.2
Input to Prediction Machine
110
iv
4.3
Predictor Network Activity Before Training
116
4.4
The Predictable Signal Found the by Network
117
4.5
Sample Firing Rates of Predictor Network After Learning
118
v
I would like to acknowledge my family, which has helped me acheive the goal of completing my PhD. It's been a long road and in particular I would like to thank my wife Robin Sussillo, who was always extremely supportive. I would like to thank my aunt and uncle Mary and Elliot Zeisel for constant emotional and epicurean support. Thanks to Greg Gordon for being a brother as well as my best friend. Thanks also to JD Litster for helpful mentoring discussions about what getting a PhD even means. I would like to thank Nick and Sonnie Sussillo. I would also like to thank Maria Delaney for all those great summers where I was able to relax enough so to that going back to school didn't seem unbearable. I would like to thank my sister, who was important in the early years and always acted like a little mother. I would also like to thank Dr. Laquercia for helping me to understand myself and unlock my potential. I would like to acknowledge Dr. Rafa Yuste, in whose lab I initially cut my teeth. I would also like to acknowledge Dr. Wolfgang Maass, with whom I wrote my first paper in neuroscience. I would also like to thank the department chair John Koester, who was instrumental in helping me find a "soft landing" when things got rough. The quality of the group is always reflective of the qualities of the leader. Most of all, I would like to whole-heartedly thank Dr. Larry Abbott, who agreed to let me do a rotation in his lab "sight-unseen." Larry taught me an incredible amount about research and allowed me to choose my own research interests. The rest, as they say, is history. :)
VI
This work is dedicated to all those who have found their way through the storm and to those who were beacons along the way.
VII
1
Chapter 1 Introduction 1.1
A Brief History
The study of recurrent neural networks (RNN) has a long and rich history. In particular, the question of how to successfully train an RNN to perform some specified task has been a topic of intense study due largely to the RNNs architectural similarity to the brain, which is connected in an extremely recurrent fashion. Further, if one can successfully train the synaptic strengths of the neurons, the recurrent neural network is an extremely general computational device and can approximate a wide range of continuous time dynamical systems (Doya, 1993). The modern study of training RNNs probably took off with the developments of Hopfield (Hopfield, 1982), (Hopfield, 1984) and his reformulation of simple neural networks as spin glasses.
In this early work Hopfield showed how to embed a
large number of fixed point attractors into the network by setting the strengths of synapses of the recurrent network to specific values. The synaptic strengths of the network were symmetric, and thus the system's evolution could be viewed as a ball
2
moving in an energy landscape. This meant that a lot of familiar machinery could be brought to bear on the problem. During this period, the computational neuroscience and machine learning communities were largely one and the same. The back-prop algorithm (Rumelhart et al., 1985) invigorated the community by providing a method to train multi-layer perceptrons (feed-forward networks), which could solve complicated problems as varied as digit recognition in zip codes (LeCun et al., 1989), diagnosing medical conditions such as lower back pain (Bounds et al., 1988), and damage detection in bridge structures (Pandey and Barai, 1995). Methods for training recurrent networks such as Real-Time Recurrent Learning (Williams and Zipser, 1989a), and Back-Propagation Through Time (Rumelhart et al, 1985), (Werbos, 1990) were also developed, although the widespread usage of RNNs did not take off in the same way as multi-layer perceptrons. Instead, most learning problems that involved time, for example speech recognition, were solved by other methods, such as Hidden Markov Models (Rabiner, 1989). Another difficulty for training RNNs was found by the authors of (Bengio et al., 1994). This problem was called the vanishing gradient and it called into question the strategic validity of methods like Real-Time Recurrent Learning and Back-Propagation Through Time. During the 1990s, the happy union of the computational neuroscience and machine learning communities suffered another major blow in the form of the Support Vector Machine (Cortes and Vapnik, 1995). Support Vector Machines were developed initially for classification tasks as an alternative to multi-layer perceptrons. While multi-layer perceptrons gave good performance for many tasks, they suffered from local minima, a situation made more problematic due the large num-
3
ber of parameters in these models. SVMs were based on perceptrons and were globally optimal and also provided guarantees on classification success, a theory known as structural risk minimization. SVMs inspired an new line of research, largely subsumed under the umbrella of "kernel methods", that had very little to do with neuroseience. It's safe to say that by 2000, if not earlier, the computational neuroscience and machine learning communities had become quite separate. The core of this division was a difference in priorities about what the research should yield. The machine learning researchers decided that performance of algorithms as implemented on computers is what mattered, regardless of any application to understanding the brain or methodological inspiration taken from it. The computational neuroscientists stayed focused on developing ideas and methods for neuroscience. More recently, there have been some attempts to bridge this gap, largely based on new developments in recurrent neural network research. Examples are the similar ideas of the Liquid State Machine (Maass et al., 2002) and the Echostate Network (Jaeger, 2001), which were independently developed. The Echostate approach was given even more power based on trained feedback loops in (Jaeger and Haas, 2004b). In another line of research, the authors of (Hinton and Salakhutdinov, 2006) developed a method called deep belief networks for reducing the dimensionality of data. These networks are based on (Restricted) Boltzmann Machines (Ackley et al., 1985); and are very similar to the Hopfield network. Further examples of this line of research include (Taylor et al., 2007), which was used to generate high-dimensional human motion capture data. We present this work in the hopes of again attempting to bridge the gap between the computational neuroscience and machine learning communities. We develop a method for training chaotic recurrent networks, which may have applica-
4
tions in both fields. The learning method we develop, called FORCE learning, may have applications in machine learning, but also provide synapse electrophysiologists with food for thought. Finally, we develop models of motor pattern generation that may be useful in understanding the problem of volitional movement1 from a biological point of view. More specifically, in this dissertation we are interested in training activity such as periodic or aperiodic transient trajectories. Further, we would like to train RNNs in the face of ongoing, chaotic spontaneous activity in the network. We focus primarily on the problem of pattern generation without input to the circuit, that is to create an RNN that will produce a specific temporal pattern such as a sine wave, e.g. f(t) = sin (cut). We focus on network-generated dynamics because tasks that involve circuit input are often more easily solved. This is because if an input drives the network, it may not be necessary to address issues of network stability. Imagine, for example, a network driven with a strong input. If we create a learning task that is input dependent, then we want the network to learn some target function of the input, based on the input dynamics. After training, the network may produce the desired output exactly, but most likely it will not. Typically the network will generate f(t) + e(t), where e(t) is some small error. Because the network is driven by input, the chances are small that the recurrent network will compound the error. But if we imagine training the network to learn some input independent task then the appropriate dynamics within the network will have to be created, sustained and stabilized by the network parameters. After training, the network will have not learned f(t) exactly but rather f(t) + e(t). Since the network is recurrent it is likely that the network will compound the initially small error with every time step. *as opposed to reflex
5
If this happens, the error may eventually grow so large that it completely changes the dynamics of the network such that nothing close to f(t) is generated. Thus the crucial problem of learning stable solutions while training RNNs is better studied in the absence of input. Nevertheless, input dependent learning problems are generally of interest so we will also show examples where f(t) is input dependent.
1.2
Overview of Dissertation
In this Introduction we formally introduce the problem, specifically, training chaotic recurrent neural networks to generate interesting outputs, either in the context of pattern generation, or as complex functions of input. We review the work that most immediately informs our own, the Liquid State Machine and Echostate Networks and describe the learning techniques used with these ideas. Reviewing these methods brings up a difficult and well-known problem in recurrent neural network training research, that of accounting for the indirect effects of changing synaptic weights in a recurrent network, that is, the effects of changing synaptic strengths that propagate through the network. We review the two long-standing methods of Real-Time Recurrent Learning and Back-Propagation Through Time in order to understand how this problem is traditionally solved. We describe problems with these methods when applied to chaotic networks. A review of this material sets a meaningful context for understanding our solutions to training chaotic, recurrent neural networks.
Chapter 2 presents the main body of research described in this dissertation. The main neuroscientific problem that we address is the application of chaotic recurrent neural networks, with ongoing, spontaneous activity, to motor pattern
6
generation. We describe a novel learning paradigm, which we call FORCE learning, that enables us to train this kind of network. We present many examples demonstrating the power of FORCE learning, including modeling complex, highdimensional, aperiodic human motion obtained from motion capture technology. In chapter 3 we give a mathematical explanation for a specific example we describe in Chapter 2. In particular, in chapter 2 we use the FORCE learning method to modify the synaptic strengths of neurons internal to the network, even though the FORCE method is described only for training the weights of readout units. We derive the mathematics behind so-called internal FORCE learning. In the final chapter 4 we change gears and explain a novel idea about sensory processing that utilizes the machinery we develop in chapter 2. Specifically, we describe a neural network model that embodies the idea of perception as a form of modeling. A recurrent network receives extremely complex, multidimensional input and the task is to find the predictable elements of the input in an unsupervised manner, if any predictable signal exists. The key idea is to use FORCE learning to generate a signal that indicates the level of surprise of the network with respect to the current values the input stream. Using this surprise signal, the network is able to extract predictable components of the input stream, despite the input stream being composed of, in addition to the predictable signal, an inherently unpredictable, high-dimensional, chaotic process and random noise.
7
1.3
The Network and Problem Statement
In this work we are interested in studying the dynamics of RNNs and also developing learning algorithms for RNNs that are defined by the following equations N
M
Tii(t) = -Xi(t) + g ] P JikTk(t) + ^
Bikuk{t) + wfiz(t),
(1.0)
fc=i fe=i
where
ri(t) = tanh(x i(t))
z(t) = y^uwokrk{t). k=\
The system of equation (1.0) is an RNN with N neurons. The Xi(t) are called the activations and n(t) the firing rates. Neuron i receives input from other neurons in the network with firing rates rk(t) via synapses with strengths J^ as well as the M dimensional input Uk(t) via synapses with strengths Bik- The scalar g provides a global scaling of the recurrent synaptic weights and plays a role in setting the operating regime of the network. The neuronal time constant is r and just sets the scale for time. In the following, we set r equal to unity to simplify the formulas. Finally, z(t) provides a linear readout of the network state. The output unit receives input from the network via readout weights wok and sends inputs back into the network via synapses with strengths u>/j. For ease of explanation, we restrict ourselves to networks with one readout unit 2 . Note that typically the firing rates r(t) are considered the outputs of the neurons, but in the case of z(t), the activation 2
Additional outputs present no real difficulties.
8
is used as the output instead of the firing rate, i.e. there is no nonlinearity like the tanh function. On occasion, and in all computer simulations, we analyze the network behavior of specific learning rules with the finite difference approximation to equation (1.0),
(
N
M
\
g Yl Jikrk{t) + Y^ BikUk(t) + Wfiz(t) I , (1.1) fc=l
fc=l
/
where At is the increment of time. In general, we require during training that the network learn to generate some target signal that may be both a function of time and of the inputs to the network, and we denote this function by f(t). The goal of learning is to minimize the error objective function
Eu*{T) = \J
1. Most importantly, there are no fixed points or periodic attractors on which to train the network via gradient descent when g > 1. Thus for large systems with g > 1, RTRL and BPTT will not work. While the situation appears grim, it is not an impossible one. Often the solution to parameter bifurcations for gradient descent methods that train networks involves teacher forcing. As stated earlier, teacher forcing means that the learning algorithm utilizes the actual value of the teacher signal f(t) in place of the value z(t) computed by the network. For example, in recurrent networks, f(t) is fed into the network via the synapses of the output neuron instead of z(t). Thus the network is "forced" with the signal f(t) as if it were injected as input via the output neuron's feedback synapses. It is well known that teacher forcing is necessary to train sinusoidal oscillations in both RTRL and BPTT. Of course it's not clear how a real biological network would implement teacher forcing. We've already discussed teacher forcing in the context of training Echostate Networks with feedback and we will come back to it in our presentation of the main results. With the use of teacher forcing the difference between the RTRL and BPTT approaches and the ESN approach start to blur. This is especially true for our particular network definition since teacher forcing removes the loop that requires
32
the computation of gradients through time. However, with the RTRL and B P T T approaches the ambitions of training are much greater and these two rules differ in three significant ways from the ESN approach. The first difference is that the ESN architecture is chosen such that the values of the internal synaptic strengths gJij are appropriately scaled so that a randomly connected network is a sufficiently powerful nonlinear filter. Thus, from the ESN/LSM perspective, there is no need, and thus no attempt, to train the internal synaptic strengths. Second, in RTRL and B P T T the error gradients are computed across all time, which means that the effect of the value Xk(t') on the objective function at time t, E(i), is evaluated for all t' < t. The ESN approach evaluates only the effect of x(t — At) on E(t) because teacher forcing breaks the loop and thus there is no need to compute gradients farther back in time. Finally, since the goal of the RTRL and B P T T algorithms typically is to change all Jij (although not in our case), teacher forcing doesn't eliminate the difficult learning problem with these algorithms as it does with ESNs. So the trade-off with the ESN approach is to replace potentially powerful learning techniques t h a t are difficult to implement and time consuming with potentially less powerful learning methods that are extremely convenient and quick. Specifically the trade off is to forgo training all the J^ and focus on the readout weights, even if training only the readout weights is less powerful. It should be noted that a rank one matrix update such as the addition of WfWQT to J is nevertheless quite powerful. It is a well-known fact in the engineering community that a rank-one addition to a full rank matrix can determine every eigenvalue of the combined matrix, which provides the mathematical underpinning for many of the manipulations utilized in control theory.
33
Chapter 2 Generating Coherent Patterns of Activity from Chaotic Neural Networks Neural circuits typical display a rich set of spontaneous activity patterns, and also exhibit complex activity when responding to a stimulus or generating a motor output. How are these two forms of activity related? We develop a procedure we call FORCE learning for modifying feedback loops either external to or within a model neural network to change chaotic spontaneous activity into a wide variety of desired activity patterns. FORCE learning, which involves error-directed modifications of synaptic strengths, works even though the networks we train are chaotic and, in contrast to earlier proposals, we leave all feedback loops intact and all neurons unconstrained during learning. Using this approach, we construct networks than produce a wide variety of complex output patterns, input-output transformations 1
L.F. Abbott was a coauthor on this chapter.
34
that require memory, multiple outputs that can be switched by control inputs, and motor patterns matching human motion capture data. Our results reproduce recent data on pre-movement activity in motor and premotor cortex, and suggest that synaptic plasticity may be a more rapid and powerful modulator of network activity than generally appreciated.
35
2.1
Introduction
When we voluntarily move a limb or perform some other motor action, what is the source of the neural activity that initiates and carries out this behavior? We explore the idea that such actions arise from the reorganization of spontaneous neural activity. This hypothesis raises another question: How can apparently chaotic spontaneous activity be reorganized into the coherent patterns required to generate controlled actions? Following, but modifying and extending, earlier work (Jaeger and Haas, 2004b), (Maass et al., 2007), we show how external feedback loops can be used to alter the chaotic activity of a recurrently connected neural network and generate complex but controlled patterns of activity. Two features of neural circuitry make it difficult to figure out how to mold spontaneous patterns of activity into something useful for behavior. First, the synaptic connectivity in neural circuits is highly recurrent. Researchers in the machine learning and computer vision communities have developed powerful methods for training artificial neural networks to perform complex tasks (Rumelhart and McClelland, 1988), (Hinton and Salakhutdinov, 2006), but these apply predominantly to networks with feedforward architectures. Learning procedures have also been developed for recurrently connected neural networks (Rumelhart et al., 1985), (Williams and Zipser, 1989b), (Pearlmutter, 1989), (Atiya and Parlos, 2000), but these are more computationally demanding and difficult to use than feedforward learning algorithms, and there are fundamental limitations to their applicability (Doya, 1993). In particular, these algorithms will not converge if applied to recurrent neural networks with chaotic activity. This limitation is severe because models of spontaneously active neural circuits typically exhibit chaotic dynamics. For ex-
36
ample, spiking models of spontaneous activity in cortical circuits (van Vreeswijk and Sompolinsky, 1996), (Amit and Brunei, 1997), (Brunei, 2000), which can generate remarkably realistic patterns of activity, and the analogous spontaneously active firing-rate model networks that we use here (Sompolinsky et al, 1988) have chaotic dynamics. To surmount the first of these problems, learning in recurrent networks, we borrow an idea developed to train so-called echo-state networks (Jaeger and Haas, 2004b). In this approach, an external feedback loop is added to carry activity from a network readout unit back into the network (figure 2.1A), and synaptic modifications are restricted to the connections that drive this readout unit (red lines in figure 2.1A). Otherwise, the original neural network remains unmodified. This restriction simplifies the learning process, but it does not, by itself, solve the problems associated with learning in recurrent networks. To get around these, Jaeger and Haas (Jaeger and Haas, 2004b) proposed breaking the external feedback loop during training and feeding an input equal to the desired network readout into the cut feedback pathway while modifying the readout connections, which is called teacher forcing. The feedback loop is only restored after training is completed. This procedure allows learning to take place in a purely feedforward context, greatly simplifying training, but it is not biologically plausible. Here, we show how learning can be performed while leaving all feedback loops intact. In our implementation, the desired output is not injected into the network, but only serves to generate an error signal that guides modifications of the readout connections. Keeping the feedback loop intact to satisfy biological plausibility adds a significant complication because it means we must solve the problem of learning in a truly recurrent network. However, once this is done, we find that imposing this biological constraint actually results in
37
a learning procedure that works significantly better than when the feedback loop is cut. In addition, leaving the feedback loop intact allows us to explore variants of the basic architecture in which the feedback loop and readout pathways are separated (figure 2.IB) or in which learning takes place within the network (figure 2.1C), both of which may be more biologically plausible. Neither the original formulation with a cut feedback loop, nor our modification with an intact loop, can work if the network is in a chaotic state during training. Thus, in addition to reconfiguring the network to perform a desired function, the learning procedure we propose must control the chaotic activity present in the network prior to training. As we will show, the solutions to learning in a recurrent network and to learning in a chaotic network are one and the same: synaptic modifications must be strong and rapid during the initial phases of training. Because of this, the learning procedure we propose is quite different from traditional training in neural networks. Usually, training consists of performing a sequence of modifications that slowly reduce the size of an error quantifying the difference between the actual and desired network outputs. In our case, this error is always small, even from the beginning of the learning process. As a result, the goal of training is not to reduce the error significantly. Instead, the goal is to reduce the amount of modification needed to keep the error small until, by the end of the training period, modification is no longer required at all and the network can generate the desired output autonomously. For reasons given in the Results, we call this approach FORCE learning. From a machine learning point of view, the FORCE procedure we propose provides a powerful algorithm for constructing recurrent neural networks that generate complex and controllable patterns of activity. From a biological perspective,
38
it can be viewed either as a model for the development and modification through training of motor circuits or, more conservatively, as a method for building models of these circuits for further study. In the former context, we present a more biologically plausible version of the FORCE algorithm that suggests that synaptic modifications may be more rapid and of larger magnitude than is normally assumed. Within the latter, we present a highly efficient though more complex algorithm and show how these circuits, once developed, can be controlled and switched between different output patterns, which allows us to make contact with experimental data. Either way, our approach introduces a novel way to think about learning in neural networks.
2.2
Results
The recurrent network that forms the basis of our studies is a conventional model in which the outputs of individual neurons are characterized by firing rates and neurons are sparsely interconnected through excitatory and inhibitory synapses of various strengths (see Methods). Following ideas developed in the context of liquidstate (Maass et al., 2002) and echo-state (Jaeger, 2001) models, we assume that this basic network is not designed for any specific task, but is instead a general purpose dynamical system that will be co-opted for a particular task by subsequent modifications. As a result, the connectivity and synaptic strengths of the network are chosen randomly (see Methods). For the parameters we use, the initial state of
the network is chaotic (figure 2.2A). To specify a task for this network, we must have a way of defining its output. In a full motor model, this would involve simulating activity all the way out to
39
the periphery. In the absence of such a complete model, we need to have a way of describing what the network is "doing", and here we follow another suggestion from the liquid- and echo-state approach (Maass et al., 2002), (Jaeger, 2001). We define the network output through a weighted sum of its activities. Denoting the activities of the network neurons at time t by assembling them into a column vector r(t) and the weights connecting these neurons to the output by another column vector w(t), we define the network output as z(t) = w T (t)r(t).
(2.1)
Multiple readouts can be defined in a similar manner, each with its own set of weights, but we restrict the discussion to one readout unit at this point. This readout is a linear function of the network activities not because the motor pathways from the modeled circuit to the periphery are assumed to be linear (of course they are not), but because using a linear readout allows us to define what is being generated by the network itself. If we defined the network readout using a complex dynamic transformation of its activities, it would be difficult to determine whether a result being reported arose from the network or from the readout transformation. Thus, as proposed previously (Maass et al., 2002), (Jaeger, 2001), a linear readout is a useful way of defining what we mean by the output of a network. It must be kept in mind, however, that the linear readout unit depicted in figure 2.1 is a convenient computational tool that acts a stand-in for complex transduction circuitry. For this reason, we refer to the output-generating element as a unit rather than a neuron, and we call the components of w weights rather than synaptic strengths. We leave a discussion of the biological substrate for these and other network elements to the
40
Discussion. Having specified the network output, we can now define the task we want the network to perform, which is to set z(t) = f(t) for a pre-defined target function f(t).
In the initial instantiation of our model (figure 2.1A), we follow Jaeger and
Haas (2004) and do this by modifying only the output weight vector w (figure 2.1 A, red synapses). All other network connections are left unchanged from their initial, randomly chosen values (Methods). The critical element making such a procedure possible is a feedback loop that carries the output z back into the network (figure 2.1A). The weights of the synapses from this loop onto the neurons of the network are chosen randomly and are left unmodified (Methods). With the readout weight vector w also chosen randomly and the network in its initial chaotic state, the output z is a highly irregular and non-repeating function of time (figure 2.2A).
41
Figure 2.1: Network Architectures. A) The standard network has two parts: a recurrent generator network with firing rates r and a linear readout unit with output z. Only the readout weights w (red) connecting the generator network to the readout unit are modified during training. B) A network with feedback and readout separated. The readout unit with output z receives input from the generator network via modifiable readout weights w
(red). Neurons of the feedback network are recurrently connected and receive input from the generator network through synapses of strength J (red). These synapses are also modified during training. C) A network with no external feedback. Instead, feedback is generated within the network and modified by applying FORCE learning to the synapses with strengths J G G internal to the network (red). The readout weights w (red) are also modified during training.
42
2.2.1
Avoiding Indirect Effects and Controlling Chaos
Training in the presence of a recurrent loop, such as the loop connecting the output in figure 2.1 A back into the network, is challenging because modifying the readout weights produces indirect effects that can be difficult to calculate. Consider what happens if we modify the weights w that determine the output z through equation 2.1. Modifying w has the direct effect of changing how z depends on r, and it is easy to determine from this direct effect how to change w to make z closer to the desired output / . This is exactly the sort of gradient descent (on the error (z — f)2) learning done in feedforward networks. However, in a recurrent network, there is an additional indirect or delayed effect that greatly complicates learning. The indirect effect arises when we modify w and the resulting change in the output z propagates along the feedback pathway and through the network, resulting in a change in the network activities denoted by r. Because of this delayed effect, a weight modification that at first appears to bring z closer to / may later cause it to deviate away. Indirect effects are much more difficult to compute than direct effects. Although techniques have been developed to account for indirect effects (Rumelhart et a l , 1985), (Williams and Zipser, 1989b), (Pearlmutter, 1989), (Atiya and Parlos, 2000), these are computationally challenging and sometimes fail to work. Of particular relevance to us, these techniques cannot be applied when network activity is chaotic. As stated in the Introduction, (Jaeger and Haas, 2004b) eliminated the problem of indirect effects by cutting the feedback loop during learning. We take another approach. Suppose that during learning, the output can be kept close to its target value, that is, z(t) = f(t) + e(t), where e is small, or even zero. The indirect effect discussed in the previous paragraph is of order e so, if the direct effect that drives
43
modification of the readout weights is significantly larger than e, indirect effects can be ignored. This is our strategy. As shown below, e can be held to a small value during learning while a larger "first-order" direct term controls weight modification. Because of this and because the method requires tight control of a small error, we call it First-Order, Reduced and Controlled Error or FORCE learning. In addition to dealing with recurrent effects, we must control the chaotic activity of the network during the learning process. Learning in the presence of chaos is impossible, because the exponential sensitivity of chaotic dynamical systems introduces instabilities in the computation of the gradients that must be followed to reduce output errors (Abarbanel et al., 2008). In this regard, it is important to note that the network we are considering is being driven by the signal f(t) through the feedback pathway. Recent work has shown that such an input can induce a transition between chaotic and non-chaotic states (Rajan et a l , 2008), see also (Molgedey et al., 1992). This is how the problem of chaotic activity can be avoided. Provided that f(t) is sufficient to induce a transition to a non-chaotic state (the required properties are discussed in (Rajan et al., 2008), learning can take place in the absence of chaotic activity, even though the network is chaotic prior to learning. Before we derive the specific modification procedures we propose and use, it is instructive to consider an example of FORCE learning so that its unconventional aspects can be pointed out. This example illustrates how the activity of an initially chaotic network can be modified so that it ends up producing a periodic, triangle-wave output autonomously. Initially, with the output weight vector w chosen randomly, the neurons in the network exhibit chaotic spontaneous activity (blue traces in figure 2.2A), as does the network output (red traces in figure 2.2A). When we start the learning procedure (described below), the weights of the readout
44
connections begin to fluctuate rapidly (orange trace in figure 2.2B). These fluctuations immediately force the output to match the target triangle wave (red trace in figure 2.2B), and they change the activity of the network so that it is periodic rather than chaotic (blue traces in figure 2.2B). The activity of the network neurons is periodic because they are being driven by feedback from the readout unit that is an excellent (though not perfect) approximation of the triangle wave, and this periodic drive switches the network from a chaotic to a periodic state (Rajan et al., 2008). The progression of learning can be tracked by monitoring the size of the fluctuations in the readout weights, which diminish over time (orange trace in figure 2.2B). As learning proceeds, the rapid fluctuations of the readout weights subside (orange trace in figure 2.2B) because the learning procedure establishes a set of static weights that generate the output that originally arose from dynamic weights. In other words, during learning the network takes more and more responsibility for generating the target function itself. Once the magnitude of the time-derivative of the readout weights is sufficiently small, the learning procedure can be turned off, and the network continues to generate the triangle wave output on its own (figure 2.2C). At this point, the readout weights are fixed, and the network will continue to generate the triangle wave indefinitely. As seen in figure 2.2B, the learning process is extremely rapid, taking only four cycles of the triangle wave in this example to set the weights to appropriate values. Having demonstrated how FORCE learning works, we now address the specific form of the weight modifications we use.
45
Figure 2.2: FORCE Learning I. A) Before training the network activity is chaotic. Network output, z, is in red, the firing rates of 10 sample neurons from the network are in blue and the orange trace is the magnitude of the time derivative of the readout weight vector. B) During learning, the output (red) matches the target function, in this case a triangle wave, and the network activity (blue traces) is periodic. The readout weights rapidly fluctuate (orange trace), but these fluctuations subside as learning progresses. C) After training, the network output (red) matches the target. The network activity (blue traces) is the same as during learning, but this is achieved autonomously with static readout weights (orange). D) An example of FORCE Learning I. Activity is initially chaotic (left of vertical line in left panel). Learning lasted 600 s (right of vertical line in left panel), after which (right panel) the network output (red) matches the target function with static readout weights (orange). E) The learning rate during the example of D.
46
2.2.2
FORCE Learning I
We present two versions of the FORCE learning weight modifications. The one presented in this section is simpler and more biologically plausible, but can only learn relatively simple patterns, whereas the other, presented later, is more efficient and powerful. Dividing time into small steps of size At, we introduce the quantity e(t)=v,T(t-At)r(t)-f(t),
(2.2)
which is the deviation of the true output from its desired value if z is calculated using the weights from the previous rather than current time step, that is, using w(t—At) rather than w(t). Computationally, this means that the difference between the output and its target at each time step is evaluated after the network is updated but before the readout weights are modified. As a result, e(t) is what the deviation would be if the weights were not updated on that time step, and thus it can be used to determine if time-independent weights that generate the function / have been found. The first form of readout weight modification is the update rule
w(t) = w ( i - At) - r)(t)e(t)r(t),
(2.3)
where r)(t) is a time-dependent learning rate. This equation implies that each weight is modified by an amount that depends on the product of its input firing rate, the deviation e(t), and the learning rate. If we ignore the effects of feedback, this modification rule corresponds to gradient descent on the error e2 with a timedependent learning rate. The only unconventional aspect of this modification rule is that we have ignored the effects of feedback along the pathway from the output
47
back to the network in figure 2.1 A. How is this justified? Recall from the above discussion that the indirect effects of feedback can be ignored if the deviation driving weight modification, which is e(t) in equation 2.3, is larger in magnitude than the error in the current output, e(t) = z(t) — f(t). A simple calculation shows that, when equation 2.3 is implemented, the output z(t)=wT(t)r{t)
is z(t) = f(t) + e(t) (1 - rp?(t)r{t)) .
(2.4)
From this result, it is clear that e(t), which is the second term in this equation, will be much less than e(t) if 1 - rjrT(t)r(t) (t At) P(t) - P ( t - A * ) -
P{t
~ At)*(t)rT(t)P(t 1 + rT{t)P{t_At)r{t)
- At) •
(2-8)
The algorithm given by equations 2.6-2.8 is equivalent to the recursive least squares algorithm (Haykin, 2002) applied to the problem of matching the output to the target if we ignore the effects of the feedback loop. Not only does this rule produce learning rates for the different principal component projections of w proportional to the inverse of their eigenvalues, it also, automatically, generates a gradually slowing learning rate. The reason for this is that P is proportional to the inverse of the correlation matrix at sufficiently long times and, in addition, the constant of proportionality decreases as one over the learning time, P oc C _ 1 / T for learning time T. Thus, learning automatically slows as P decreases with time and FORCE learning converges onto a solution. The update equation for z, analogous to equation 2.4 for the earlier form of FORCE learning, is
z(t) = f(t) +
e(t)(l-TT(t)P(t)r{t)).
This form of FORCE learning works because, as before, the second term on the right side of this equation (the factor e discussed above) can be made much smaller than e. Initially, r T P r is close to 1, keeping K < e (see Methods). As learning
55
proceeds, both r T P r and e get smaller until both approach zero as learning is completed.
Thus, this form of FORCE learning, like the previous version, can
maintain a reduced and controlled magnitude of error while the effective learning rate slowly reduces to zero. The parameter a, which acts as a learning rate, must be adjusted depending on the particular target function being learned. Large a results in fast learning but sometimes makes synaptic changes so rapid that the algorithm becomes unstable. In those cases, smaller a should be used, but if a is too small, the FORCE algorithm may not induce the target input into the trained neuron, causing learning to fail. In practice, values from 0.001 to 10 are effective, depending on the task. Figure 2.4 shows examples of networks generating various target functions after being modified by this version of FORCE learning. In all of these
red
trace shows the output of the network after FORCE learning has been applied and then turned off, so the network produces the indicated output pattern autonomously and indefinitely. The examples in figure 2.4A-D and F are periodic functions, which are convenient for our purposes because they naturally repeat during the training procedure, and because it is easy to see, later on, whether the target function is being reproduced accurately. The examples in figure 2.4A and 2.4B are composed of a sum of 4 and 16 commensurate sinusoidal functions with different frequencies and phases, respectively.
For figure 2.4C, the same target function as in figure
2.4A was used, but it was corrupted by white noise (green trace in figure 2.4C). Even though the error signal guiding FORCE learning was corrupted by noise, the network ended up generating a close approximation of the noise-free target function (red trace in figure 2.4C). The network can be configured by FORCE learning to generate an approx-
56
imation of a discontinuous function, the square wave in figure 2.4D, although the discontinuity generates some ringing. This example shows that the network does not have to reproduce the target function exactly, it can produce an approximation of the target function if an exact reproduction is impossible. Such examples are difficult for learning with a cut feedback loop. The dynamic range of the outputs that chaotic networks can be trained to generate by FORCE learning is impressive. In figure 2.4F, we show two examples of a 1000 neuron network with a time constant of 10 ms producing sine waves, one with period 60 ms and the other with a period of 8 s. These are roughly the upper and lower limits of what we can achieve in the frequency domain, namely a range slightly greater than a factor of 100. Even though they are convenient examples, FORCE learning is not restricted to periodic functions.
Figure 2.4E shows an example in which a network was
trained to produce an output matching one of the dynamic variables of the threedimensional chaotic Lorenz attractor (Methods). To illustrate a match between these two dynamical systems, we started them with matched initial conditions at the left side of the traces in figure 2.4E. Eventually, the network output (red trace) and the variable from the Lorenz model (green trace) diverge, as they must, because of the chaotic nature of the trace being reproduced, but the two models are similar for a significant amount of time (see also Jaeger and Haas, 2004). In addition, after the two traces diverge, the network still produces a trace that looks like it comes from the Lorenz model. Figure 2.4G shows another non-periodic example, the network producing a segment matching a one-shot, non-repeating target function (the portion of the red trace between the blue arrows). To produce such a one-shot sequence, the network must be initialized properly, and we do this by introducing a fixed-point attractor as
57
well as the network configuration that produces the one-shot sequence. We do this by adding a second linear readout unit to the network that also provides feedback (network diagram in figure 2.4G). The two readouts and feedback loops are similar except for different random choices for the strengths of the feedback synapses onto the network neurons. We allow readout unit 1 to take two possible states, one called active in which its output is determined by equation 2.1, and another called inactive in which the output is z = 0. We begin the training process by activating unit 1 and applying FORCE learning to the weights onto both readout units. The target function for unit 1 is simply the constant value 1. The target function for readout unit 2 is, at this point, the initial value of the one-shot sequence we ultimately want to learn (the value of the red trace at the left blue arrow in figure 2.4G). After a brief learning period, the weights attain a value that puts the entire network into a fixed point state, with constant activities for all the network neurons, a constant value for output 1, and a constant value for output 2 equal to the initial value of its target sequence. This state is henceforth achieved whenever both readout units are activated. We then train the network to produce the one-shot sequence by first activating and then deactivating unit 1, and FORCE training the readout weights of unit 2 with the desired one-shot target function. This two-step process (activating and inactivating unit 1 while modifying the weights of unit 2) is repeated until the network can produce the desired output in this way without requiring weight modification. From then on, the one-shot sequence can be generated by activating readout unit 1 to briefly put the network into a fixed-point state and to initialize unit 2, and then inactivating it to generate the sequence. After the learned sequence is completed, the output of unit 2 becomes chaotic, although occasionally the learned sequence
58
or pieces of it may be generated spontaneously. In all of the cases shown in Figure 2.4 and, indeed, in all our figures, the network is initially in a chaotic state and then is modified by FORCE learning to produce a desired output. Learning is dramatically faster than it is with the first form of FORCE learning (equation 2.3). Despite the fact that more complex target functions were used in figure 2.4 than in figure 2.2D, learning typically converged in about lOOOr, equivalent to about 10 s. Because of the advantages in both learning speed and the complexity of the functions that can be reproduced, we use this form of FORCE learning in the remaining examples of the paper (and we used it for figure 2.2A). In our experience, a network with around 1000 neurons can be trained through this form of FORCE learning to generate a remarkably varied range of complex temporal sequences. However, training fails in a chaotic network for target functions that are close to zero. As discussed above, FORCE learning, when activated, always generates a network output that closely matches the target function. In order to find time-independent weights that allow the network to produce the target function autonomously, however, FORCE learning must induce a transition in the network from chaotic to non-chaotic activity. As shown in Rajan et al., (2008), this requires an input to the network, through the feedback loop in our case, of sufficient amplitude. Figure 2.4H shows an example in which we tried to train a network to generate a low amplitude sine wave. Although, as the red trace to the right of the vertical line indicates, FORCE learning produces the desired output during training, the activity of the network neurons remains chaotic (blue traces in figure 2.4H). When this happens, FORCE learning fails to converge and, as soon as it is turned off, the network reverts to a chaotic output (similar to the red trace
59
to the left of the vertical line in figure 2.4H). There are a number of solutions to this problem. It is possible for the network to generate low amplitude oscillatory and non-oscillatory functions if these are displaced from zero by a constant shift, but not if they are centered near zero as in this example. Alternatively, the networks shown in figure 2. IB and C can be trained to generate low amplitude signals centered near zero.
2.2.5
The Advantage of Chaotic Spontaneous Activity
Training a network that starts with spontaneous chaotic activity presents some challenges, as we have discussed, but does it also offer advantages? To address this question, we used FORCE learning to generate the output function shown in figure 2.4A in networks that were either non-chaotic or that exhibited various degrees of chaotic activity prior to training. The recurrent synaptic connections in the networks we study are scaled by a factor g (Methods) that controls whether or not they produce chaotic activity spontaneously. Networks with g < 1 are inactive prior to training in the sense that the output of each neuron is constant and takes a zero value. For g> 1, these networks exhibit chaotic spontaneous activity (Sompolinsky et al., 1988), which gets more irregular and fluctuates more rapidly as g is increased beyond 1. shows the number of cycles that it takes to train a network to produce a periodic output as a function of g. The number of cycles required to train a network to generate a periodic target function drops dramatically as a function of g, continuing to fall as g gets larger than 1, within the chaotic region (figure 2.5A). The average root-mean-square (RMS) error, indicating the difference between the target function and the output of the network after FORCE learning, also decreases with g, especially in the region g< 1 (figure 2.5B). Another measure
60
Periodic
D
Complicated periodic
Extremely noisy target
Discontinuous target
Figure 2.4: Examples of FORCE Learning II. The red traces are network outputs after training with the network running autonomously. A) Periodic function composed from 4 sinusoids (red). B) A more complicated periodic output composed from 16 sinusoids. C) Periodic function of 4 sinusoids (red) learned from a very noisy target function (green). D) Network output after training with a square-wave target function. E) The Lorenz attractor. Initial conditions of the network and the target were synchronized at the beginning of the traces. The network readout is in red, and the target function in green. F) Fast and slow sine waves with periods of 60 ms and 8 s, for r = 10 ms. G) A oneshot example using a network with two readout units (circuit insert). When activated, unit 1 creates the fixed point to the left of the left-most blue arrow, establishing the appropriate initial condition. Unit 2 then produces the sequence between the two blue arrows. Following the second blue arrow, when the sequence is concluded, the network output returns to being chaotic. H) A low amplitude target function for which the FORCE procedure does not control network chaos and learning fails.
61
of training success is the magnitude of the readout weight vector |w|. Large values of |w| indicate that the solution found by a learning process involves cancellations between large positive and negative contributions. Such solutions tend to be unstable and sensitive to noise. The magnitude of the weight vector falls as a function of g and takes minimum values of about 0.1 in the region g > 1 characterized by chaotic spontaneous activity. These results indicate that networks that are initially in a chaotic state are quicker to train and produce more accurate and robust outputs than non-chaotic networks. Learning works best when g > 1 and, in fact, fails in this example for networks with g < 0.75 (the lower end of the range in figure 2.5). This might suggest that the larger the g value the better, which is true to an extent, but there is an upper limit. Recall from figure 2.4H that FORCE learning does not work if feeding back the target function from the readout unit to the network fails to suppress the chaos in the network. For any given target function and set of feedback synaptic strengths, there is a upper limit for g beyond which chaos cannot be suppressed by FORCE learning. Indeed, the range of g values in Figure 2.5 terminates at gm 1.6 because learning did not converge for higher g values due to this problem. Thus, the best value of g for a particular target function is at the "edge of chaos" (Bertschinger and Natschlager, 2004), that is the g value just below the point where FORCE learning fails to suppress chaotic activity during training.
2.2.6
Biological Interpretation of Feedback Modification
As discussed above, the linear readout unit was introduced into the network model as a stand-in for a much more complex, un-modeled peripheral system, in order to define what we mean by "network output". The critical information provided
62
A 10U
CD C CO 1 _
in T3 O CD Q.
=**:
80 60
\ \ \
40 20
\ \ \ •-•
•
n
0.8
1
1.2
1.4
Figure 2.5: Chaos Is Useful. Networks with different g values (Methods) were trained to produce the output of figure 2.4A. All results are plotted against g in the range 0.75 < g < 1.56, and learning failed outside this range. A) Number of cycles of the periodic target function required for training. B) The RMS error of the network output after training. C) The length of the readout weight vector |w| after training. by the readout unit is the error signal needed to guide weight modification, so its biological interpretation should be as a system that computes or estimates the deviation between an action generated by a network and the desired action. However, in the network configuration presented to this point (figure 2.1A), the readout unit, in addition to generating the error signal that guides learning, is also the source of feedback.
Given that the output in a biological system is actually the result
of a large amount of nonlinear processing and that feedback may have to travel a significant distance before returning to the network, we begin this section by examining the effect of introducing delays and nonlinear distortions along the feedback pathway from the readout unit to the network neurons (gray oval in figure 2.6A). Distorted and Delayed Feedback: The FORCE learning scheme is remarkably robust to distortions and, to a somewhat lesser extent, delays introduced along the feedback pathway. In the example of figure 2.6A, the feedback signal at time t was 1.3tanh(sin(7T2:(£ — 100 ms)) rather than z(t) as in previous examples.
63
Even though this feedback signal (cyan trace in figure 2.6A) is completely different from z (red trace in figure 2.6A), the network learns and performs successfully. We have also introduced low-pass filtering into the feedback pathway with similar results. Nonlinear distortions of the feedback signal can be introduced unless they diminish the temporal fluctuations of the output to the point where chaos cannot be suppressed. Filtering can also be quite extreme before the network fails to learn. Delays can be more problematic if they are too long. The critical point is that FORCE learning works as long as the feedback is of an appropriate form to suppress the initial chaos in the network. Separating Feedback from Output: Even allowing for distortion and delay, the feedback pathway, originating as it does from the linear readout unit, is an abstract element of the networks we have been considering. To address this problem, we consider two ways of separating the feedback pathway from the linear readout of the network and modeling it more realistically. The first is to provide feedback to the network through a second neural network (figure 2.IB) rather than via the readout unit. To distinguish the two networks, we call the original network, present in figure 2.1 A, the generator network and this new network the feedback network. The feedback network has nonlinear, dynamic neurons identical to those of the generator network, and it may or may not be recurrently connected (in the example of figure 2.6B, it is). Each unit of the feedback network produces a distinct output that is fed back to a subset of neurons in the generator network. In the original formulation of the model (figure 2.1A), exactly the same signal was fed from the readout unit to all of the neurons of the generator network along a single feedback pathway. In this second model (figure 2.IB), the task of carrying feedback is more realistically shared across a number of neurons in the feedback network with
64
non-identical outputs. As a result, the strengths of these synapses can take more reasonable values than they did in the original scheme in which one readout unit provided all the feedback (figure 2.1A). When we include a feedback network (figure 2.IB), FORCE learning takes place both on the weights connecting the generator network to the readout unit (as in the architecture of figure 2.1 A) and on the synapses connecting the generator network to the feedback network (red connections in figure 2.IB). Separating feedback from output would appear to introduce a credit-assignment problem, a classic dilemma in network learning, because changes to the synapses connecting the generator network to the feedback network do not have a direct effect on the output. Fortunately, there is a simple solution to this problem within the F O R C E learning scheme; we treat every neuron subject to synaptic modification as if it were the output unit, even when it is not. In other words, we apply equations 2.6-2.8 to every synapse connecting the generator network to the feedback network (Methods), and we also apply them to the weights driving the output unit. When we modify synapses onto a particular neuron of the feedback network, the vector r in these equations is composed of the firing rates of generator network neurons presynaptic to that feedback neuron, and the weight vector w is replaced by the strengths of the synapses it receives from these presynaptic neurons. However, the same error term that originates from the readout (equation 2.2) is used in these equations whether they are applied to the readout weights of the output unit or synapses onto neurons of the feedback network (for the modification equations see Methods). FORCE learning with a feedback network and independent readout unit can generate complex outputs similar to those shown in figure 2.4, although convergence
65
may be slower and parameters such as a (equation 2.7) may require more careful adjustment. The output (red trace) in the example of figure 2.6B matches the target function, but the activities of neurons in the feedback network (the blue traces show representative examples) do not, despite the fact that the readout weights and the synapses onto the feedback network neurons were modified by the same algorithm. This difference is due to the fact that the feedback network neurons receive input from each other as well as from the generator network, and these other inputs are not modified by the FORCE procedure. Differences between the activity of feedback network neurons and the output of the readout unit can also arise from different values of the synapses and the readout weights prior to learning. Recall from the example shown in figure 2.6A that it is not essential for the signal fed back to the generator network to be identical to the output of the readout unit. When we have a separate feedback network, the feedback to an individual neuron of the generator network is a random combination of the activities of a partial subset of feedback neurons, summed through random synaptic weights. While these sums bear a certain resemblance to the target function, they are not identical to it nor are they identical for each neuron of the generator network. Nevertheless, FORCE learning works. This extends the result of figure 2.6A, showing not only that the feedback does not have to be identical to the network output, but that it does not even have to be identical for each neuron of the generator network. It also indicates that the neurons of the feedback pathway can be modeled in a much more reasonable way than in the network of figure 2.1 A, indeed in the same way that neurons of the generator network are modeled. Why does this form of learning, in which every neuron with synapses being modified is treated as if it were producing the output, work? In the example of
66
figure 2.6B, the connections from the generator network to the readout unit and to the feedback network are sparse and random (Methods), so that neurons in the feedback network do not receive the same inputs from the generator network as the readout unit. However, suppose for a moment that each neuron of the feedback network, as well as the output unit, received synapses from all of the neurons of the generator network. In this case, the changes to the synapses onto the feedback neurons would induce a signal identical to the output into each neuron of the feedback network. This occurs, even though there is no direct connection between these two circuit elements, because the same learning rule is being applied in both cases. The induced input is not the entire input to the feedback neurons, because they are connected to each other as well. Nevertheless, it is sufficient to suppresses any chaotic activity in the feedback network, and it causes the feedback neurons to send signals to the generator network that suppress its chaotic activity and drive learning. The explanation of why FORCE learning works in the feedback network when the connections from the generator network are sparse rather than all-to-all (as in figure 2.6B) relies on the accuracy of randomly sampling a large system. With sparse connectivity, each neuron samples a subset of the activities within the full generator network, but if this sample is large enough, it can provide an accurate representation of the leading principal components of the activity of the generator network that drive learning, even if it is a sparse sampling. This is enough information to allow learning to proceed. Learning Within the Network: The original generator network we used (figure 2.1 A) is recurrent and can produce its own feedback. Why then do we need separate feedback loops from the output (figure 2.1 A) or from a feedback network
67
(figure 2. IB)? Why can't we apply FORCE learning to the synapses of the generator network itself, in the arrangement shown in figure 2.1C, to provide this feedback? The answer, as shown in figure 2.6C, is that we can. To implement FORCE learning within the generator network (Methods), we modify every synapse in that network using equations 2.6-2.8. To apply these equations, the vector w is replaced by the set of synapses onto a particular neuron being modified, and r is replaced by the vector formed from the firing rates of all the neurons presynaptic to that network neuron. As in the example of learning in the feedback network, FORCE learning is also applied to the readout weights, and the same e, given by equation 2.2, is used for every synapse or weight being modified. FORCE learning within the network can produce the same output used in the other examples of figure 2.6 (red trace in figure 2.6C). Even though the same basic modification rule is used within the network and for the readout weights, the network neurons do not reproduce the target function in their outputs (blue traces in figure 2.6C). Nevertheless, the periodic activity they produce is sufficient to suppress chaotic activity and induce learning. An arguments similar to that given for learning within the feedback network can be applied to FORCE learning for synapses within the generator network. Because all of the synapses onto the network neurons are modified in this case, whereas the recurrent connections in the feedback network were not modified in the previous example, it is easier to illustrate the argument given above to account for how FORCE learning works. The input induced by FORCE learning is plotted (cyan traces overlaid in figure 2.6C) for each of the 5 network neurons with activities shown in blue. The thickness of the cyan trace shows that these 5 traces differ from
68
each other, and thus from z, slightly, but it is clear that a good approximation of the output z is induced by FORCE learning in each of these neurons (compare the cyan and red traces in figure 2.6C).
2.2.7
Switching Between Different Output Patterns
We have shown that FORCE learning can train networks to generate a wide variety of output functions, but in all of our examples only a single target function was trained and generated. We now explore the possibility of training networks to produce a variety of diverse functions from a single output, with a set of inputs controlling which of them is generated at any particular time. To do this by introducing control inputs to the network neurons (Methods) and pairing each desired output function with a particular input pattern. The exact values of the control inputs are arbitrarily, and they are chosen randomly. When a particular target function is being either trained or generated, the control inputs to the network are set to the corresponding input pattern and held there until a different output is desired. The control inputs do not supply any temporal information to the network, so all the target patterns are produced autonomously. Instead, they act as solely a switching signal to select a particular output function. To illustrate switching between multiple output patterns, we return to the simple feedback scheme shown in figure 2.1 A because it is easier to train for such demanding tasks. To train the network, we simply use FORCE learning as before, but during training we step through the different control input patterns and simultaneously change the target / to the function corresponding to that input. Figure 2.7B shows a set of 10 outputs (all of these are periodic, but only one cycle of each is shown) learned by the network in response to different static input patterns. When
69
c
1.2sec
ux/xr ^A/AA/AA/
rirux 1.2sec
1.2sec
Figure 2.6: Feedback Variants. A) A network can be trained to produce the output shown by the red trace when its feedback (cyan trace) is delayed and distorted (gray oval in circuit diagram) to be very different from the output. B) FORCE learning with a distinct feedback network and output (circuit diagram). The network can be trained to produce the output shown as the red trace. Activity traces from 5 feedback neurons are shown in blue. C) A network with no feedback from the output unit (circuit diagram) in which the internal synapses are trained to produce the output shown in red. Activities of 5 representative network neurons are shown in blue. The thick cyan trace is an overlay of the input to each of these 5 neurons induced by FORCE learning.
70
learning is complete, the network can generate any one of these outputs when the appropriate pattern of control inputs is imposed, and it transitions between these outputs when the input pattern is switched.
2.2.8
Computations on Inputs
Up to now, we have treated the network we are studying as a source of what are analogous to motor output patterns. The network can also be used to generate complex input/output maps when inputs are present as in the configuration of figure 2.7C. Figure 2.7C shows a particularly complex example of a network that functions as a 4-bit memory that is robust to input noise. This network has 4 readout units, all of which generate feedback (circuit diagram in figure 2.7C). Unlike the case of figure 2.4G, all 4 units are always left in an active state. This network also has 8 inputs that randomly connect to neurons in the network and are functionally divided into pairs (Methods). The input values are held at zero except for short pulses to a positive value. These pulses act as ON and OFF commands for the 4 output units. Input 1 is the ON command for output 1 and input 2 is its OFF command. Similarly, inputs 3 and 4 are the ON and OFF commands for output 2, and so on. Turning on an output means inducing a transition to a state with a fixed positive value, and turning it off means switching it to zero. The rows of traces in figure 2.7D show the 4 outputs (in red, with target values in green) with their ON and OFF inputs below them (purple). A careful examination of these traces shows that the inputs correctly turn the appropriate outputs on and off without any crosstalk between inputs and inappropriate outputs. This occurs even though the network is randomly connected, so the inputs are not segregated into different channels. This example requires that the network has,
71
after learning, 16 different fixed point attractors, one for each of the 4 2 possible combinations of the 4 outputs, and that pulsing the 8 inputs induces the correct transitions between these attractors.
Note, for example, that when the output
shown in the top trace in figure 2.7C is in its first off state, it is unaffected by additional pulsing of its O F F input as well as by pulsing other inputs. However, it is immediately switched to its on state by a pulse of its own ON input. Training for this example was performed by applying the FORCE procedure while introducing input pulses in random patterns and using the correct on- and off-state sequences as the target function for each output (green trace in figure 2.7C).
2.2.9
A Motion Capture Example
Finally, we consider an example of running and walking based on data obtained from human subjects performing these actions while wearing a suit that allows variables such as joint angles to be measured (also studied by Taylor et al., in (Taylor et al., 2007) using a different type of network and learning procedure). These data, from the CMU Motion Capture Library (http://mocap.cs.cmu.edu/), consist of 95 joint angles measure over many hundreds of time steps. We used an architecture as in figure 2.1 A (Methods), except with multiple readout units all provided feedback to the generator network (figure 2.8A). A set of 95 readout units generated the temporal sequences for the 95 joint angles in these data sets. Although running and walking might appear to be periodic motions, in fact the joint angles in the real data are non-periodic. Thus, the network had an additional 96 th readout unit and associated feedback loop to set initial conditions, as in the example of figure 2.4G. Because we wanted a single network to generate both the running and walking motions, we introduced control inputs as
72
Figure 2.7: Multiple Pattern Generation and 4-Bit Memory. A) The network with control inputs used to produce multiple output patterns (inputs with modifiable readout weights are shown in red). Different, constant inputs (purple) are used to switch between multiple output functions in a single network. B) 10 outputs (only 1 cycle of a periodic function made from 3 sinusoids is shown) generated by a single input-controlled network. Switching between these patterns is induced by control inputs. C) A network with 4 outputs and 8 inputs used to produce a 4-bit memory (circuit diagram). Red traces are the 4 outputs, with green traces showing their target values. Purple traces shown the 8 inputs, divided into ON and OFF pairs associated with the output trace above them. The upper input in each pair turns the output shown above on, i.e. sets it to a positive output value. The lower input of each pair, turns the associated output off, i.e. sets it to zero.
73
in the example of figure 2.7B. Two fixed control input patterns were applied to the network during training and after training to switched between the running and walking motions. Readout weights for all the output units were trained with FORCE learning. The successfully trained motions are shown in figure 2.8B (running) and 2.8C (walking). Ten sample firing rates from neurons in the generator network are shown in figure 2.8C. This example demonstrates that a single chaotic recurrent network can generate multiple, high-dimensional, non-periodic patterns that resemble human motions.
2.3
Discussion
Sensory processing is often studied as an end in itself, but its real purpose is to initiate and guide internally generated behaviors. In this spirit, we have studied how spontaneously active neural networks can be modified to generate desired outputs, and how control inputs can be used to initiate and select among those outputs. Although this has most direct application to motor systems, it can be generalized to a broader picture of cognitive processing (Yuste et al., 2005), as our example of a 4-bit, input-controlled memory suggests. Each of the three architectures shown in figure 2.1 has a possible biological interpretation. The feedback loop in figure 2.1A can be interpreted as proprioceptive information about a motor action fed back to the motor circuit that generated it. The feedback network in figure 2.IB may model elements of the basal ganglia.
The architecture in figure 2.1C corresponds to modifications taking place within motor or premotor cortex. Probably all of these forms of feedback (and more) are involved in motor learning.
74
Figure 2.8: A Network that Generates Both Running and Walking Human Motions. A) The network used to generate the running and walking motions (modifiable readout weights are shown in red). The network utilized different, constant inputs to differentiate between running and walking (purple). The number of readout units providing output and feedback equals the number of joint angles, 95, plus an additional unit for setting initial conditions. Each joint angle is generated through time by one of the output units (gray arrows). B) The running motion generated by the network. Cyan frames show early and red frames late movement phases. C) 10 sample network neuron activities during the walking motion. D) The walking motion generated by the network, with colors as in B.
75
The additional element in our networks is the somewhat amorphous readout unit. As shown in figure 2.6, we only require the readout unit to compute an error signal that guides synaptic modifications, either in a feedback network (figure 2.6B) or within the original generator network (figure 2.6C). The cerebellum has been proposed as a possible locus for such computations (Miall et al., 1993). After learning, especially in the case of a non-periodic output (figure 2.4G or 8), a network may generate chaotic trajectories in addition to the desired pattern of activity. In this case, setting initial conditions in the network is important for avoiding the wrong trajectory. The two-step process by which we induced a chaotic network to produce a non-periodic sequence (illustrated in figures 2.4G and 8) may have an analog in motor and premotor cortex. The brief fixed point that we introduced to terminate chaotic activity results in a sharp drop in the fluctuations of network neurons just before the learned sequence is generated. Churchland et al. (Churchland et al., 2006), (Churchland and Shenoy, 2007) have reported just such a drop in variability in their recordings from these motor areas in monkeys immediately before they performed a reaching movement. Synaptic modification through FORCE learning is error directed and fast. For this method to work, it is essential that large synaptic modifications can take place rapidly. The architecture of figure 2.IB involving a separate feedback network may be a way to keep such aggressive plasticity from disrupting the generator network (motor cortex), a disruption that would be disastrous for all motor output, not merely the one being learned. Modification of synapses in a second network (as in figure 2.IB), such as circuits of the basal ganglia, may dominate when a motor task is first learned, whereas changes within motor cortex (analogous to learning within the network of figure 2.1C) may be reserved for "virtuoso" tuning of highly
76
trained motor actions. Focal dystonia is possibly an example of the dangers of excessive learning within a generator network. A key component of FORCE learning is producing the correct output even during the initial stages of training. Analogously, people cannot learn fine manual skills by randomly nailing their arms about and having their movement errors slowly diminish over time.
Motor learning works best when the desired motor
action is duplicated as accurately as possible during training. The initial phases of FORCE learning, when error-driven synaptic modification drives the output, may correspond to consciously guiding an action when learning a task.
Autonomous
generation of an output after learning may be analogous to being able to do a task without conscious intervention. There are some interesting and perhaps unsettling aspects of the networks we have studied. First, the connectivity of the generator network in the architectures of figure 2.1 A and IB is completely random, even after the network has been trained to perform a specific task. If would be impossible to understand how the generator network "works" by analyzing its synaptic connectivity. Even when the synapses of the generator network are modified (as in figure 2.1C), there is no obvious relationship between the task being performed and the connectivity, which is in any case not unique. The lesson here is that the activity, response properties and function of neurons in an area such as motor cortex can be drastically modified by feedback loops passing through distal networks. Circuits need to be studied with an eye toward how they modulate each other, rather than how they function in isolation. Except in the simplest of examples, the activity of the generator neurons bears little relationship to the output of the network (Robinson, 1992), (Fetz, 1992).
77
Trying to link single-neuron responses to motor actions may thus be misguided. Instead, results from our network models suggest that it may be more instructive to think about how the activity of individual neurons relates to the network-wide modes or patterns of activity extracted by principal components analysis of multiunit recordings. Our examples show the power of adding feedback loops as a way of modifying network activity. Nervous systems often seem to be composed of loops within loops within loops. Because adding a feedback loop leaves the original circuit unchanged, it is a non-destructive yet highly flexible way of increasing a behavioral repertoire through learning, as well as during development and evolution.
2.4
Methods
2.4.1
Network Model with Output Feedback
All the networks we consider are based on firing-rate descriptions of neural activity. The generator network has N neurons with activity variables that obey, in the architecture of figure 2.1 A,
r~ = -Xi + gf^J^rj
+ J?°z
with r; = tanh(x;) (these can be considered to be firing rates relative to a non-zero background rate and normalized to a maximum of 1) and r = 10 ms. The value of the output unit, z, is given, as in equation 2.1, by
z{t) = wT{t)r(t),
78
where w are the readout weights. The matrix J G G describing the connectivities and synaptic strengths between neurons in the generator network is sparse and randomly generated. With probability p, a given synaptic strength J^G is set to a value generated randomly from a Gaussian distribution with zero mean and variance 1/pN.
Otherwise, with
probability 1 — p it is set to zero. Unless otherwise specified, N = 1000, p = 10% and g = 1.5. The strengths of the synapses connecting the feedback to the network neurons, the parameters J^°,
are generated randomly and independently from a
uniform distribution between -1 and 1. Readout weights are set initially either to zero or to values generated by a Gaussian distribution with zero mean and variance I/TV. They are then modified by either equations 2.3 and 2.5 or 2.6-2.8. In the later case, we use a = 0.1, unless otherwise specified. In our discussion of FORCE learning II, we made us of the fact that r T P r is close to 1 early in learning. To see this, note that, given the initial matrix of equation 2.7, P ( A * ) - i - 2 r ( A Tt ) r T ( A " a a + ar (At)r(At)
'
so rl(At)P(At)r{At)
rT{At)v{At) a + rT{At)r(At)
'
Because r T r is of order N, this is very close to 1, provided that a ( t ) r 0 « ) T
(3.7)
t=o T
Cc(T) = J2rc(t)rc(t)T,
(3.8)
t=
and e(t) = f(t) — z(t).
Using the inverse matrix notation ignores the fact that
C~l(t) and C~l(t) may be updated in a clever way (the "Recursive" in RLS) and focuses on the fact the learning rules are using information about internal neuron spatial correlations to define the synaptic update. The particular details of the network are as such: we assume the recurrent network is randomly connected with some sparseness and that the output and control weights receive Nout connections. We assume that the synaptic strengths Wf of the feedback synapses are sufficiently strong to drive the neurons in the recurrent network into non-chaotic network activity when the network is driven through these synapses. We assume that the number of neurons in the recurrent network, N, is large and the dimensionality of the driven network activity, D, is much smaller. Specifically our assumptions on the network size are that iV > N^t »
D. In
order to effectively use FORCE learning the number of output weights Nout should be at least a few hundred but Nout does not need to scale with N.
89
3.2
Training a Control Unit with the Output Error
In order for the network to learn to reproduce a pattern (i.e. z(t) = fit))
it's
obviously important to figure out what kind of y(t) is needed. The output unit constructs it's firing rate from the firing rates of the internal neurons. Since y(t) is fed back into the network, it is responsible for driving the network in a way that the output unit can successfully extract f(t)
(see figure 3.1, panel B). One way to
understand what y(t) would enable the network to successfully generate f(t) is to think about the RNN as a nonlinear filter, which is how it behaves when driven by a strong input. The RNN generates the base frequencies, harmonics shifts
3
2
and phase
for any signal injected into it. If we wish to construct a specific f(t)
from
the network activity, at the very least y(t) should contain the incommensurate base frequencies contained in f(t) and the RNN will generate any necessary harmonics and phase shifts. If the learning rules given in equations (3.5) and (3.6) were to generate y(t) ~ z(t) ~ f(t),
there should be no problem because the appropriate
frequencies will be generated in the network, even ii~y(t) does not exactly equal fit).
It will be shown that under appropriate circumstances, equations (3.5) and
(3.6) do indeed generate a y(t) that is very close to f(t), a z{i) that is effectively equal to 2
in addition to generating
fit).
The hyperbolic tangent nonlinearity generates odd harmonics of any frequency injected into the network due to the x —> — x axis symmetry of the function. If a small bias current is injected into the neurons in the network, it will also generate even harmonics, since the base current will remove this symmetry. 3 The quality of phase shifting of an input signal appears to be heavily dependent on the scale factor g. In particular, simulations have shown that the phase shifting properties of the network grow with g, which is one of the reasons why training a network with g < 1 results in high amplitude output weights.
90
The simpler of two issues we need to address for learning with this architecture is what happens if w c (0) 7^ w o (0)? It turns out that unequal initial conditions for the weights doesn't pose a problem. This is based on the considerations in the previous paragraph. If w c (0) = w o (0) and all the inputs between the output and control units were shared, y(t) would be identical at all times to z(t) according to the learning rules and so a close approximation of f(t) would be injected into the network and there would be no problem. We now imagine that the initial conditions are different.
Since the network acts to generate harmonics of the control
unit's firing rate y(t) and additionally to generate all phases of the frequencies of y(t), all that matters is that y(t) generates the incommensurate base frequencies of f(t).
Poor approximations of f(t) work just fine as long as the essential frequency
content is present. This requirement is almost always met if we use learning rules (3.5) and (3.6), even if w c (0) 7^ w o (0), assuming that the initial conditions are not wildly far apart. The second, and much more subtle issue, is what happens if the output and control units do not have common input from the recurrent network. To exaggerate this problem, we assume that the output and control units have no overlapping inputs from the recurrent network whatsoever. In practice they might be randomly selected and thus have some intersecting inputs, but if we can understand how FORCE learning works in the non-overlapping case, we should understand the case of a random intersection of connections as well. Finally, the hypothesis is that if the network activity is sufficiently correlated, i.e. D «
Nout, learning between control
and output units will be viable with a single output error, e(t), because even with mutually exclusive sets of inputs from the recurrent network, the information seen by both the output and control units is essentially the same.
91
3.3
An Example
We now study an example of this network during learning. We learned the function fit) — 0.5sin(0.057rt) + 0.6sin(0.i7r£) using the network described in this chapter and the learning rules given by equations (3.5) and (3.6) (see figure 3.2). Essential details of the simulation were N = 1000, pCOnn — 0.1 4 and both the control and output units each had Nout = 500 mutually exclusive inputs. The inputs to the output unit were labelled 1...500 and the inputs to the control unit were labelled 501... 1000. Since the point of the exercise is to understand the complications introduced by having the control and output units receive separate inputs, the initial synaptic strengths were set such that w c (0) = w o (0) = 0. The output and control unit's firing rates z(t) and y(t) are shown during training in panel A of figure 3.2. The firing rate z(t) (red) very closely followed f(t) (green). The firing rate y(t) also approximated f(t), although not with the same exactness as z(t). The lengths of the synaptic weight vectors, |w 0 | and |w c |, are a good indication of the success of learning. When they stabilize, typically the learning is finished. One can see that |w 0 | and |w c | were extremely similar at all times during training, which gives an indication that the updates rules for the readout and control weights had similar effects on both sets of weights, despite their having mutually exclusive pre-synaptic neurons. After training, the network was run with the learned output and control weights. The results are shown in panel C. The network was able to generate f(t) with y(t) as feed back to the recurrent network. During simulation, the correlation matrix of the network firing rates was 4
The internal connectivity percentage pconn performance.
has surprisingly little effect on the network
92
measured and utilized to determine the dimensions of r with the highest variance using Principal Component Analysis (PCA). The number of dimensions necessary to explain 99.99% of the variance was 23 (so we say the network was active in a D = 23 dimensional subspace). In order to see how much of the signals z(t) and y(t) were composed of network firing rates active in the top D dimensions, the output unit and control unit firing rates were recalculated using modified readout and control weight vectors w 0 and w c expressed in PCA basis, but with all but the top D dimensions set to zero. These firing rates z(t) and y(t) are shown in panel D and are not distinguishable from the true firing rates shown in panel A. This result gives a strong indication that the learning rules' most important effects occur in the top D dimensions of the PCA basis. Finally panel E shows the component values for w 0 and w c in the top D dimensions of the PCA basis (red w 0 , blue w c ). Upon examination of each stem individually, it's clear that the update rules for w c and w 0 found very similar solutions for the synaptic strengths when viewed in the dimensions of top variance, despite completely non-overlapping inputs. This is why y(t) was so similar to z(t) and thus why the learning rule works.
93
training
150
200
250
Iw I, Iw I during training
after training
1050
1100
1150
1200
PCA composition with top 23 components.
time (s) Output and control weights in top dimensions.
IT
ITT 8
A 10
~t 14 * 12
14
16
18
• ri 20
22
dims of PCA w/ 99.9% variance (highest->lowest)
Figure 3.2: A) The target f(t) in green, the output neuron's firing rate z(t) (red) and control neuron's firing rate y(t) (cyan) through time during training. B) The lengths of the readout and control weight vectors through time |w c | and |w 0 |. C) The network after training demonstrates that the network successfully learned the task at hand with feedback from y(t) and not z(t).
D ) Principal component analysis done on the network
activity and then used to reconstruct z(t) and y(t). Panel D shows the reconstruction of the network activity using only those top 23 dimensions. E) The weights w 0 (red) and w c (blue) in the basis of the correlation matrix (the PCA basis). The dimension with the highest variance starts at the left and the final dimension on the right.
94
3.4
A Mathematical Explanation
To show that it doesn't matter whether the control and output units receive different inputs when using e(t) to train both neurons, we must explain why for d = 1..D we have
w0d « wcd,
(3.9)
where w means a weight expressed in the PCA basis. This is clearly shown in panel E of figure 3.2 for large eigenvalues. On an intuitive level it's natural to think that if the number of connections from the recurrent network to the output and control units is much greater than the dimensionality of the network activity, i.e. Nout »
D,5 the correlation matrices C0 and C c must share almost all the
information about the network activity, because synaptic inputs to the control arid output units are highly correlated. The remainder of this chapter gives a mathematical explanition of how this occurs. We define P0 as the A^out x N projection operator from the TV dimensional space to the Nout dimensional subspace from which the output unit samples and likewise Pc as the Nout x N projection operator from the N dimensional space to the Nout dimensional subspace from which the control unit samples (the second 500 N^ indices). 5
Typical values used in simulations are TV = 1000, Nout — 500 and D is found to be below 50.
95
0
0 0 •••
0
0 1
0 0 •••
0
0 0
1 0 •••
0
(l
(3.10)
Pn =
0
•••
0
1 0
0
•••
0 0 1
0 •••
0 0 0
\ (3.11)
Pc =
/
The firing rates of the inputs to the output and control units are defined by the projection matrices by
rQ = P0r
(3.12)
rc = Prr.
(3.13)
The unnormalized correlation matrix for all the neurons in the network is defined as
C(*) = 5>(T)r(r) T ,
(3.14)
T=0
where r(t) is the vector of all network firing rates at time t. Based on experience, the correlation matrices become reasonably accurate very quickly, typically within a period of f(t), and so as a simplification in the following analysis we assume
96
that C(t) = tC, C0{t) = tC0, and Cc(t) = tCc, where C, C0 and Cc represent the long-term time averaged values of the correlation matrices when the network is driven by f(t) through the feedback synapses. In this fashion, we remove the dependency of the structure the correlation matrices on time. However, since the learning algorithms use the inverse of the unnormalized correlation matrices, we keep the time dependencies as a global scaling of the matrices. This approximation on the correlation matrices obviates the need to use the recursive version of the weight update rules in the following analysis. We express C in terms of its eigenvectors and eigenvalues C = VAVT.
(3.15)
Typically, the eigenvalue spectrum for a driven network falls off exponentially (see for example figure 2.3C). For the remainder of this chapter we approximate the true distribution of eigenvalues in A, with a rank D matrix, A, that contains only the top D eigenvalues. The matrix A has eigenvalues, which we call Ai, A2,..., XD, 0 , . . . , 0, that are ordered from top to bottom. The column vectors of the eigenvector matrix, V, remain unchanged and are ordered likewise. Finally we define
w 0 = VTPjwQ
(3.16)
w c = VTP?we
(3.17)
as the N x 1 dimensional projections of the weight vectors onto the basis defined by the eigenvectors of C. Since we assume that N > Nout »
D it's necessary to employ the pseudo-
97
inverse, because the inverses C0 and Cc do not exist 6 . Thus the learning rules we analyze are
where the
+
w 0 (t) = C0(t)+r0(t)e{t)
(3.18)
wc(t) = Cc(t)+rc{t)e(t),
(3.19)
notation means the matrix pseudo-inverse.
The key to showing that w0d « wcd for d € 1..D is to write the smaller correlation matrices Cc and C0 in terms of VAVT, the diagonalized version of C and then analyze the learning rules in this format. Thus for Cc we have
C0 = PoCPj = (P0V)A(VTP0T)
(3.20) (3.21)
D
- 5>a(P0VT(P0VTT
(3-22)
a=l
and likewise for Cc. The main question is: when are the eigenvectors of Cc and C0 equal to the projected eigenvectors of C onto the iV^-dimensional subspaces of the inputs to the output and control units? In other words, when do the equations
V;a = P0Va
(3.23)
Vca = PcVa
(3.24)
hold, for a G 1..D, where V0a and V" are the ath eigenvectors in C0 and C c , respectively? The reason this is important is if the eigenvectors of Cc and C0 are 6
In practice the RLS implementation is so-called diagonally loaded to keep the correlation matrix from being singular. That is C{t) = J2r=o r ( T ) r ( r ) T + £I> where e is a small number.
98
the projected eigenvectors of C then the learning rule will move the weights of the output and control units in similar directions and thus y{t) will be approximately equal to z(t).
3.4.1
A Small Example
Keeping in mind equation 3.22, let's look a t a simple example N = 4, Nout = 2 a n d D = 2. Below is t h e eigenvalue decomposition of a rank-2 4 dimensional matrix, projected onto t h e first two dimensions, t h e 1,2-subplane.
vl v\ vl) \ vl ( 1 0 0 0 \ v\ vl vl v\
f
AI
°)
vl vl vl\ v\ vl vl vl
(vl
0 A2 0 0
0 1 0 0 1 vl vl vl v\
\ v\ vl vl v\ )
0 0
0
0 0 0
1°
0 0
°J
v\ vl vl vl [vl v\ vi v\ J
f 10 °1 0 0 v
o o
(3.25)
( Ai v\ \
vl
vl
0
0
0 ^ / vi Vo \
vl vl
0
A2 0 0
vl
0
0
0 0
0
0
0 0
v\
V vl
v\
vx2 \
Ai
0
vl
0
A2
)
vl vl vl 4
\v$
' ^i vl
V
(3.26)
v\ vl
4
v\ J
(3.27)
99
Notice the structure of equation 3.27. It has the structure of an eigenvector decomposition of a 2-dimensional matrix. If the projections of the first two 4-dimensional eigenvectors onto the 1,2-plane, (v\ vl)T and (v\ f | ) T , were orthonormal, then equation 3.27 would be the eigenvector decomposition of the smaller matrix. The example illustrates the idea that the projected eigenvectors of C are the potential eigenvectors of the smaller correlation matrices C0 and Cc in equation 3.22. Note that its crucial that D < = N^t, otherwise the D eigenvectors will mix in the subspace, leaving no possibility for a potential eigenvector decomposition.
3.4.2
Back to Large Dimensions
Unfortunately, the projected vectors are not in general orthogonal or normal. But things change when the dimensions start to get larger. If N is large, Nout is also quite large, and D is reasonably small, in general one can expect the projected eigenvectors to be essentially orthogonal in the subspaces if they were originally orthogonal in the full N dimensional space. In other words, if the above requirements hold, we can expect
10 ifa^fc (P0Va) • (P0Vb) = { , a 2 I \P0V \ if a = b
(3.28)
and likewise for eigenvectors of C projected onto the subspace spanned by the inputs to the control unit. This is a result of the random embedding theorem, which states that D points in R ^ can be embedded into a subspace HNout with little distortion assuming that N0ut »
D (Dasgupta and Gupta, 2003). In our application, the points are the D
100
eigenvectors in iV-dimensional space and the subspaces are the two jVOU(-dimensional subspaces that r 0 and r c occupy. From these arguments we conclude that the following equations represent approximations of the eigenvalue decomposition of C0 and Cc
where the denominators are introduced simply to account for renormalization of the projected eigenvectors in the subspaces. Going forward, we assume that equations 3.29 and 3.30 are legitimate eigenvector decompositions, even though in practice the degree to which the projected eigenvectors of C are orthogonal in the subspaces is related to many factors such as network size, the complexity of the target function
fit), etc. We are now in a position to analyze the effect of the learning rules for w 0 and w c . We proceed with the calculations for w 0 but the calculations for w c are identical. The definition of the pseudo-inverse of a £>-dimensional, square, symmetric matrix C is D 1 C+ = Y^ T-VaVaT,
(3.31)
a—1
where the Aa are the positive eigenvalues of C and Va are the corresponding eigenvectors. Since the meaningful network activity is only D dimensional, we approxi-
101
mate the full network activity vector r(t) as D
r(t) = J2^a(t)Va,
(3.32)
where aa{t) gives the projection of r(t) on the ath eigenvector. Writing down the learning rule for w 0 , from 3.18, and defining V£ = P0Va and |V0a|2 = Vg • V„ = (P0Va) • (P0Va), as previously discussed, we have
w 0 (t) =
C0{t)+ro(t)e(t) (tP0VAVTPj)+(P0r(t))e(t)
=
= (*Ej^W 1 /
=
D
\Va\2
) (Pov(t))e(t)
V V T
1[T,-f^
= \(E^W°T)
T
\
o o J(Por(t))e(t)
( ^ E wv?W) a=l
1KT
d'T
K
_
\6=1
' "
E « ^ ) ^eW
\6=1
d 2
\V0 \ ad(t)e(t)
The expression for the change of w c in the basis of C is identical except for a change of subscript. Finally we write down the value at time t^ for both the output and control vectors, given some initial conditions at time t\, for d £ 1 . . . D as a l \Vrd\2 \ r2fti 1 w0d(t2) = —^— -ad{r)e{r) dr + wod{ti) Ad Jtl T
Wcd{t2) = % L / 2 -«d(r)e(r) dr + w^h) Ad Jtl T
(3.33) (3.34)
Thus if the normalizing constants \Vg\2 and \V^\2 have similar values (in general they should if the number of synapses to the output and control units are roughly equal), the output and control synaptic weights will have similar values in exactly the dimensions of the network activity with highest variance, leading to similar values for z(t) and y(t).
Since the network does not require that y(t) and z{t)
have exactly the same values (only the base frequency content should match), this is sufficient to successfully train the network. As mentioned earlier, the initial conditions, expressed in the above formula as w0d{t\) and wcd{t\), don't cause a
103
problem as long as they aren't too far apart. Even though we have focused on learning for the case of a single internal neuron, there is no reason why we can't construct a network architecture where the feedback is a multidimensional signal based on a number of neurons learning in the same fashion. This is what we did to implement examples from the last chapters in figures 2.6C and 2.6D. Each trained synapse or weight in these networks learned according to the principle outlined here. In the first architecture, an entire feedback network of connected neurons was trained based on the error signal of the readout unit. In the second architecture, we trained every neuron internal to the recurrent network. In the latter, each neuron drove its own activity with f(t) by virtual teacher forcing by its own synapses. In all cases, the learning was accomplished using only the error of the readout unit, e(t).
104
Chapter 4 Perception as Modeling: A neural network that extracts and models predictable elements from input. 4.1
Introduction
Previous chapters related mostly to motor control examples and training recurrent neural networks to autonomously generate patterns. We now switch gears and consider an application of the previous ideas in the context of sensory processing. In particular, we explore the idea that perception could be based on sampling external stimuli in order to better model those stimuli internally. Perception is not simply the passive analysis of input data, it involves modeling the external world, making inferences about predictable events and noting when something unexpected happens. In most models of sensory processing, activity generated by external stimuli in peripheral sensors is used to drive networks that
105
perform computations such as object recognition. We consider a different scheme in which stimulus driven activity acts as a training signal for a spontaneously active network. This predictor network continually attempts to model the training data in the sense of being able to generate it spontaneously, even if the sensory input is cut off. Of course, only certain aspects of the sensory data may be predictable. To address this, an extractor circuit that is guided by feedback from the predictor produces the training signal for the predictor network. The extractor pulls out from the sensory stream aspects of the data that the predictor network can reproduce. This automatically divides the input data into three categories: 1) noise, which is the part of the input stream ignored by the extractor, 2) a predictable signal that is isolated by the extractor circuit and internally reproduced by the predictor, and 3) surprising events, which are elements of the input stream pulled out by the extractor that do not match the output of the predictor. The predictor is a recurrent network of firing-rate model neurons with a linear readout that provides both the output of, and feedback to, the network. This is identical to models in the previous chapters. As in the preceding chapters, the learning algorithm only modifies the weights connecting the network to the output unit. The key element is FORCE learning that restricts errors in the output to small values throughout training. This was a supervised learning scheme in the preceding chapters, but in the extractor-predictor approach, the extractor circuit acts as the supervisor for the predictor network, so the scheme is unsupervised. Due to the nature of FORCE learning, the predictor network always generates a close match to the target and thus does not produce hallucinations unrelated to external reality. The rate of change of the readout weights provides a measure of whether the target function can be generated autonomously by the predictor. We
106
use this measure as a supervisory signal for the extractor, which is modified if the target signal it is extracting from the input data cannot be autonomously modeled. The combined recurrent predictor network and linear-filter extractor is successful at finding and modeling predictable structure (if it exists) in high-dimensional time series data even when polluted by extremely complex noise. This network can be viewed as a general method for extracting patterns from complicated time series or, from a systems neuroscience perspective, as a model of perception where the way to understand the world is through internal prediction, with further processing used only when sensory input fails to match expectations.
4.1.1
Input Model
We propose a scheme involving two neural networks that is able to extract a predictable signal (if one is present) from a complex, noisy, and high-dimensional input stream. We define the ^-dimensional input stream I(t) as
Ii(t) = Viu(t) + a(t) + il>i{t),
(4.1)
for i = 1, 2 , . . . M, where vt is a small constant, u(t) is a predictable (for example, periodic) signal of interest1, ipi(t) is white noise, and d(t) are the time dependent outputs of a chaotic dynamical system typically with a range from —1 to 1. The chaotic system from which the Ci(t) originate is by definition unpredictable over long time scales. The task is to extract the predictable function u(t) by sampling
from I(t) without knowing a priori the values u(t), Vi, Ci(t) or tpi(t). Because we do not know u(t) beforehand, any algorithm used for this task must be unsupervised a
If u(t) is an order one signal, Vi should be no smaller than \j\f~M so we get an order one signal when summed.
107
(see Figure 4.2).
4.1.2
FORCE Learning
Our method for finding u(t) is based on the FORCE Learning paradigm explained in previous chapters. Briefly, the FORCE learning paradigm describes how to train synapses of a recurrent neural network (RNN) such that the network learns a target function f(t) in an online, supervised fashion. Using FORCE learning, the target function f(t) plus small errors e(t) are fed back into the network via the output unit while learning is taking place (see figure 4.IB). This means that the synaptic weights are changing initially rapidly because the network has not yet learned the target function. As the network begins to learn /(£), the synaptic weights fluctuate less and by the end of learning the synaptic weights fluctuate hardly at all. Thus by the end of learning, one can simply "turn off" the training mechanism and the network output unit will continue to generate f(t) as it did during the training period.
4.1.3
The Surprise Signal s(t)
The key insight for the task at hand is that the synaptic fluctuations of the synapses in the network provide a signal that is not present in other learning paradigms such as a pure gradient descent based approach or batch based learning algorithms. These synaptic fluctuations are a measure of the degree of surprise or unpredictability of the target function /(£) for the network. We define the surprise signal s(t) of a network during the training of a target function as the sum of squares of the derivatives of the synaptic weights as they change during the learning process. A
108
low value of s indicates that the network has found and is able to predict a signal extracted from the input. A large value means that it has not.
4.2
Approach to Finding u(t)
The extractor-predictor network we propose is based on two coupled networks, the extractor and the predictor. For simplicity, we assume that the extractor network is defined by a linear combination of the multidimensional signal I(t), M
f(t) = j2Kkh^-
(4-2)
fe=i
More generally, the extractor network could be defined as a chaotic RNN such as that defined in equation (4.4) with g > 1, however we stick to the definition given in equation (4.2). The output z(t) of the predictor network is defined by TV
z(t) = J2wokn(t),
(4.3)
k=\
which is a linear readout of the firing rates of the predictor network, denoted by rfc. The synaptic strengths that control z(t) are collectively denoted by w 0 . The equations for the activations x, of the predictor network, which determine the firing rates r«(i), are TV
Xi(t) = -Xi(t) + gJ2 Jikrk(t) + wfiz(t)
(4.4)
fc=i
n(t) = tanh(x,-(i)).
(4.5)
109
The Jik are chosen from a normal distribution with mean 0 and standard deviation ,
1
, T . The scaling factor g sets the operating range of the system and for a
chaotic system g > 1, which is the range in which the networks operate in this application. Note that the output z(t) is fed into the network via the synapses with strength Wf and no neurons receive external input. 2
Figure 4.1: The Predictor Machine network architecture. A) The extractor network takes the signal I(t) as input. The extractor weights K are trained to reduce the surprise signal of the predictor network. This is accomplished by training f(t) to produce the output of the predictor network, z(t), from the inputs. B) The predictor network has no inputs. Instead, it is a chaotic RNN with output neuron, z(t), which feeds back to the network via the synapses with strengths Wf. It is trained to produce the output of the extractor, f(t).
Having defined the extractor and predictor networks, we now explain how they communicate. Rather than the traditional method of communication via input synapses our two networks communicate only via their respective error signals. The goal of the extractor network is to reduce the surprise signal s(t) of the predictor network over time. In other words we change the weights K of the extractor unit 2
The requirement that the learned synapses must be the w 0 readout weights is not strictly necessary. Nor is the requirement that z(t) be a linear readout, or even that z(t) have instantaneous dynamics. Training with these alternative cases can be handled with the FORCE learning paradigm as shown in previous chapters.
110
KnJv^xr{J^Jvv^ru^
Figure 4.2: ^4j 10 sample dimensions of I(i). The sensory signal I(t) is composed of a predictable signal, a chaotic signal, which is inherently unpredictable, and white noise. such that the following objective function
L{T) = 1 f dt |w 0 | 2
(4.6)
is minimized. The goal of the predictor network during training is to generate the output of the extractor network, f(t). It will be shown below that an approximation to minimizing equation (4.6) is the much simpler goal of having the output of the extractor attempt to reproduce z(t). In summary, during training the goal of the extractor network is to produce f(t) = z(t) based on the inputs I(£). The goal of the predictor network is to produce z(t) = f(t) based on its own internal dynamics, i.e. r(t) (see figure 4.1 panels A and B.) In summary, we wish to learn the signal u(t) which is present with scale factor Vi in each dimension of the input hit). We propose to do this by providing I(t) as input to the extractor network. During training the extractor attempts to generate f(t) = z(t) by supervised training on the synapses Ki and the predictor
Ill
attempts to generate z(t) = f(t) by training on the Wm synapses. When training is finished, both the predictor and the extractor should produce some approximation of u(t), so that f(t) ?s z(t) « u(t). Even though both sets of weights K and w 0 are trained in a supervised fashion, the overall training method is unsupervised because they are trained on signals that are internal to the extractor and predictor networks.
4.3
Using z(i) as a Teacher Signal for f(t) Reduces the Surprise Signal
In this section we derive our learning rule for the extractor network. As stated previously, the goal of the extractor network is to reduce the surprise signal of the predictor network. We stated that this is accomplished by training the extractor network to generate f(t) = z(t). In this section we give a partial explanation for why this target works to reduce the surprise signal. This section can be omitted if one is not interested in the calculation. Learning in the predictor network has the goal of reducing the following objective function
E(T)=l-£dt(z(t)-f(t))2.
(4.7)
Under a simple version of FORCE learning that does not use the network spatial correlations, the rule for the predictor network is
woj(t) = -Vw„(t) (z(t) - /((0»-*(t)
(4.21)
rp
d«(/(t)-2(«))/i«)|r«)| 2
= -vlJ
(4.22)
Jo
Since |r(i)| 2 is a positive quantity, we view it simply as part of the learning rate and thus ignore it. This gives us dL{T) = - m l OK, Jo with rjK = % J r l
>
an