Phase-Space Learning for Recurrent Networks - CiteSeerX

2 downloads 0 Views 438KB Size Report
Mar 29, 1995 - Garrison W. Cottrell1. Institute for ...... Steve Biafore, Kenji Doya, Peter Rowat, Bill Hart, members of GURU, and especially Dave. DeMers for his ...
Phase-Space Learning for Recurrent Networks

Fu-Sheng Tsung Chung Tai Ch'an Temple 56, Yuon-fon Road, Yi-hsin Li, Pu-li Nan-tou County, Taiwan 545 Republic of China Garrison W. Cottrell1 Institute for Neural Computation Computer Science & Engineering University of California, San Diego La Jolla, California 92093 March 29, 1995

1 Correspondence should be addressed to the second author:

[email protected]

Abstract Current recurrent learning algorithms use a \trick" { teacher forcing { that is not theoretically justi ed and leads to a paradox (explained below). Also, gradient descent alone is often inadequate at extracting the dynamical information of the target signals necessary to train hidden units properly. We introduce a conceptual framework, viewing recurrent training as matching vector elds of dynamical systems in phase space, that overcomes this diculty by using the phase-space reconstruction technique to make the hidden states explicit. Thus learning temporal signals in a recurrent network is reduced to the easier problem of feedforward learning. In short, we propose viewing iterated prediction [LF88] as the best way of training recurrent networks on deterministic signals. Using this framework, we can train multiple trajectories, insure their stability, and design arbitrary dynamical systems. We show examples of training multiple, coexisting periodic attractors, present a novel online version (called ARTISTE) that can learn a periodic attractor without teacher forcing, and outline a method to store any number of periodic attractors of any waveform into a single network.

1 Introduction Existing general-purpose recurrent algorithms are capable of rich dynamical behavior. Unfortunately, straightforward applications of these algorithms to train fully-recurrent networks to perform complex temporal tasks have had much less success than their feedforward counterparts. For example, to train a recurrent network to oscillate like a sine wave (the \hydrogen atom" of recurrent learning), existing techniques either produced a deformed waveform [WZ89], or required more hidden units than necessary. [WZ89] showed a two-unit network trained with RTRL, with one teacher signal. One unit of the resulting network showed an distorted waveform, the other with only half the amplitude. [Pea89, TB92] used four or more hidden units. However, our work demonstrates that a two-unit recurrent network with no hidden units can learn the sine wave very well [Tsu94]. Existing methods also have several other limitations. For example, the network often does not converge even though a solution is known to exist; teacher forcing is usually necessary to learn periodic signals; it is not clear how to train multiple trajectories at once, or how to insure that the trained trajectory is stable (an attractor). In this paper we analyze the algorithms to discover why they have such diculties, and propose a general solution to the problem. First, we introduce the dynamical systems perspective. Then, we analyze two existing training methods, Real-Time Recurrent Learning (RTRL) [WZ89], and back propagation through time (BPTT) [RHW86] [Pea89] with respect to this framework. Next, we elucidate the phase-space learning framework. Following this, we present simulation results of a new algorithm inspired by this framework, ARTISTE, as well as simulations showing that Phase Space Learning can capture the dynamics of quite complex systems.

2 Background Recurrent neural networks are dynamical systems; they evolve in time. Most recurrent networks in use today are discrete dynamical systems, that is, they are iterated maps represented by the system of equations: y1(t + 1) = f1(y1(t); y2(t); :::; yn(t)) ... yn(t + 1) = fn(y1(t); y2(t); :::; yn(t)) or, in vector notation, y(t + 1) = f (y(t)) where y(t) = (y1(t); :::; yn(t)) is the vector of output activations of the network, and f = (f1; :::; fn) is the transfer function. We can consider this discrete system to be an approximation to the continuous system: dy = ?y + f (y) dt 1

In many situations, we want the behavior of the network y to match a smooth trajectory x(t). The trajectory x(t) is itself generated by a dynamical system X. The di erence between X and x(t) is that a dynamical system, X, represents an in nite number of trajectories x(t; x0), depending on the initial conditions x0 (we write x(t) instead of x(t; x0) for brevity). Of these trajectories, we are often only interested in a very small number of them, the attractors of X, because they are what is left after the transience dies out, and that they are stable to small perturbations. The phase space of a dynamical system is the space of its dependent variables; it excludes the time-dimension t, the independent variable. For a dynamical system X with dimension m, the evolution of a trajectory x(t) traces out a phase curve, or orbit, in the m-dimensional phase space of X. Time is implicit in this representation as a parameterization of the curve. If a trajectory stops changing, it traces out an orbit that terminates at a point (the xed point). If a trajectory is periodic, it traces out a closed orbit. A periodic attractor in phase space is a stable limit cycle, which means nearby orbits (other trajectories) converge to the closed orbit. The vector eld of a dynamical system X is the point-wise derivative of X with respect to time, which can be represented as a tangent vector at each point of the phase space. These tangent vectors show graphically how each point in the phase space of the dynamical system evolves. The phase space of the n-unit recurrent network y is a hypercube in the n-dimensional Euclidean space consisting of the network activations (y1; :::; yn). For the network y to learn the trajectory (or trajectories) x(t), n must be equal to or greater than m1. For example, if x(t) is a sinewave, then X must be at least two dimensional, so that y must consist of at least two units. The learning problem, then, may be viewed as matching dynamical systems in phase space; that is, we are minimizing a cost function between the m-dimensional orbit(s) x(t) of X (the teacher signals), and the orbits of a subset of m units (the visible units) of the network y, given the same initial conditions. [Hop82] showed how to store xed point attractors with recurrent networks; in this paper we show how to train periodic attractors, but this approach works as well for chaotic time series, since the ideas it is based upon come from the time series prediction community.

3 Analysis of current approaches To get a better understanding of why recurrent algorithms have not been very e ective, we look at what happens during training with two popular recurrent learning techniques: RTRL and back propagation through time (BPTT). With each, we illustrate a di erent problem, although the problems apply equally to each technique. RTRL is a forward-gradient algorithm that keeps a matrix of partial derivatives of the 1 There are cases where x(t) lies in a lower dimensional space than X. Also, throughout this paper we have mixed the terminologies of continuous and discrete dynamical systems. These technical details are irrelevant for our discussion.

2

xo

xo

a

b

Figure 1: Learning a pair of sine waves with RTRL learning. (a) without teacher forcing, the network dynamics (solid arrows) take it far from where the teacher (dotted arrows) assumes it is, so the gradient is incorrect. (b) With teacher forcing, the network's visible units are returned to the trajectory. network activation values with respect to every weight. To train a periodic trajectory, it is necessary to teacher-force the visible units [WZ89, Tsu91], i.e., on every iteration, after the gradient has been calculated, the activations of the visible units are replaced by the teacher. To see why, consider learning a pair of sine waves o set by 90 . In phase space, this becomes a circle (Figure 1a). Initially the network (thick arrows) is at position X0 and has arbitrary dynamics. After a few iterations, it wanders far away from where the teacher (dashed arrows) assumes it to be. The teacher then provides an incorrect next target from the network's current position. Teacher-forcing (Figure 1b), resets the network back on the circle, where the teacher again provides useful information. However, if the network has hidden units, then the phase space of the visible units is just a projection of the actual phase space of the network, and the teaching signal gives no information as to where the hidden units should be in this higher-dimensional phase space. Hence the hidden unit states, unaltered by teacher forcing, may be entirely unrelated to what they should be. This leads to the moving targets problem. During training, every time the visible units re{visit a point, the hidden unit activations will di er, Thus the mapping changes during learning. If one uses a smaller time step and teacher-force frequently in the hope of maintaining the hidden units in \nearby" regions, then the gradients may become too small for the system to learn [Tsu91]. (See [Pin88, WZ89, WZ90] for other discussions of teacher forcing.) With BPTT, the network is unrolled in time (Figure 2). This unrolling reveals another problem: Suppose in the teaching signal, the visible units' next state is a non-linearly separable function of their current state. Then hidden units are needed between the visible unit layers, but there are no intermediate hidden units in the unrolled network. The network must thus take two time steps to get to the hidden units and back. One can deal with this by giving the teaching signal every other iteration, but clearly, this is not optimal (consider that the hidden units must \bide their time" on the alternate steps).2 The trajectories trained by RTRL and BPTT tend to be stable in simulations of simple 2

When this work was presented at NIPS in 1994, 0 delay connections to the hidden units were suggested,

3

Figure 2: A nonlinearly separable mapping must be computed by the hidden units (the leftmost unit here) every other time step. ε

ε

a

b

Figure 3: The paradox of attractor learning with teacher forcing. (a) During learning, the network learns to move from the trajectory to a point near the trajectory. (b) After learning, the network moves from nearby points towards the trajectory. tasks [Pea89, TCS90], but this stability is paradoxical. Using teacher forcing, the networks are trained to go from a point on the trajectory, to a point within the ball de ned by the error criterion  (see Figure 3 (a)). However, after learning, the networks behave such that from a place near the trajectory, they head for the trajectory (Figure 3 (b)). Hence the paradox. Possible reasons are: 1) the hidden unit moving targets provide training o the desired trajectory, so that if the training is successful, the desired trajectory is stable; 2) we would never consider the training successful if the network \learns" an unstable trajectory; 3) the stable dynamics in typical situations have simpler equations than the unstable dynamics [Nak93]. To create an unstable periodic trajectory would imply the existence of stable regions both inside and outside the unstable trajectory; dynamically this is more complicated than a simple periodic attractor. In dynamically complex tasks, a stable trajectory may no longer be the simplest solution, and stability could be a problem. In summary, we have pointed out several problems in the RTRL (forward-gradient) and BPTT (backward-gradient) classes of training algorithms: 1. Teacher forcing with hidden units is at best an approximation, and leads to the moving targets problem. which is essentially part of the solution given here.

4

Figure 4: The network used for iterated prediction training. Dashed connections are used after learning. 2. Hidden units are not placed properly for some tasks. 3. Stability is paradoxical.

4 PHASE-SPACE LEARNING The inspiration for our approach is prediction training [LF88], which at rst appears similar to BPTT, but is subtly di erent. In the standard scheme, a feedforward network is given a time window, a set of previous points on the trajectory to be learned, as inputs. The output is the next point on the trajectory. Then, the inputs are shifted left and the network is trained on the next point (see Figure 4). Once the network has learned, it can be treated as recurrent by iterating on its own predictions. The prediction network di ers from BPTT in two important ways. First, the visible units encode a selected temporal history of the trajectory (the time window). The point of this delay space embedding is to reconstruct the phase space of the underlying system. [Tak81] has shown that this can always be done for a deterministic system. Consider the current time window as one vector-valued data point Xt . Shifting the window over time, an orbit in a higher dimensional space (the phase space) is formed from the original time series. When this embedding process is done properly, the orbital structure will accurately re ect that of the original system. An important property of a proper embedding is that each point on the orbit uniquely determines the next point on the orbit. This means that the evolution of the time series is now represented as a map, a set of vector pairs (Xt; Xt+1) where Xt+1 is the next-iterate of Xt. Hence what originally appeared to be a recurrent network problem can be converted into an entirely feed forward problem. Essentially, the delay-space reconstruction makes hidden states visible, and recurrent hidden units unnecessary. Crucially, dynamicists have developed excellent reconstruction algorithms that not only 5

a

b

Figure 5: Embedding a \1" from 2-D into 3-D. automate the choices of delay and embedding dimension but also lter out noise or get a good reconstruction despite noise [FS91, Dav92, KBA92]. Thus, it is important not to confuse the assumption that the data (time series) is generated from a deterministic system, with the requirement that the data must be noise-free. Real world time series are always noisy, and the cited techniques have been developed to deal with this. Noise can also introduce arti cial trajectory crossings in the training set; again, this can be easily checked and removed. On the other hand, we clearly cannot deal with non-deterministic systems by this method. The fact that phase space reconstruction results in a well-de ned map is crucial. For example, to train a recurrent network with two visible units (as the x ? y coordinates) to reproduce a gure-8 (\1"), the network needs to know which way to go at the intersection. This corresponds to some internal state or hidden state of the network, and must be represented by recurrent hidden units. However, phase space reconstruction embeds the (x; y) time series into three dimensions, at least, so that the intersection point becomes two separate points (Figure 5). Now each point on the new orbit, (u; v; w), uniquely determines the next point, so there is no need for hidden states. Instead of a recurrent network with two visible units and some recurrent hidden units, now we have an iterated network with three visible units (the input-output layer) and a feedforward hidden layer. Although the feedforward hidden units are needed to learn the input-output mapping, they do not and cannot represent hidden states. Thus, phase space reconstruction is a general way of nding explicit representations for the hidden states so that recurrent hidden units (and the training diculties associated with them) can be eliminated. The second di erence from BPTT is that the hidden units are between the visible units, allowing the network to produce nonlinearly separable transformations of the visible units in a single iteration. In the recurrent network produced by iterated prediction, the sandwiched hidden units can be considered \fast" units with delays on the input/output links summing to 1. Thus, phase-space learning (Figure 6) consists of: (1) embedding the temporal signal to recover its phase space structure, (2) generating local approximations of the vector eld near the desired trajectory, and (3) functional approximation of the vector eld with a feedforward 6

Yt+1

Yt a

It b

Figure 6: Phase-space learning. (a) The training set is a sample of the vector eld. (b) Phase-space learning network. Dashed connections are used after learning. network. Existing methods developed for these three problems can be directly and independently applied to solve the problem. Since feedforward networks are universal approximators [HSW89], we are assured that at least locally, the trajectory can be represented. The trajectory is recovered from the iterated output of the pre{embedded portion of the visible units. Additionally, we may also extend the phase-space learning framework to also include inputs that a ect the output of the system Phase-space learning is a conceptual framework with the iterated prediction network as the primary example of implementation. Within this framework, we view every temporal signal as generated by some m-dimensional deterministic dynamical system, and the learning task is to learn the vector eld of the dynamical system as closely as possible. This perspective has some advantages that may not be immediately obvious from regular prediction training, which we will discuss in section 4.1. So far we have only talked about systems with no input. It is easy to extend the phasespace learning framework to include inputs that a ect the output of the system. If we have a steady input (e.g., the input level determines di erent behavior, as in conductance level in a neuron), then the input can be treated as an ordinary, extra input unit. The situation is slightly more complicated if the input varies with time, since now we need to estimate how long past input in uences the system. This means an embedding for the input needs to be performed. In the case of time-delay embedding (time window), the size of the input time window need not be the same as the size of the past-history time window from the time series itself. The network will then be composed of two sets of inputs, one of m-units from the previous output of the network, and another set of p-units from the input; these inputs are followed by some layer(s) of feedforward hidden units, and the output layer will have up to m units.

7

4.1 Other advantages of phase-space learning

The phase-space learning framework, in many practical implementations, reduces to the familiar iterated prediction method. As such, it may be prone to stability problems just as the prediction network, as mentioned in section 3. Having trained only one segment at a time, why should the network remain on the desired trajectory? Since we are now learning a mapping in phase space, this problem is easily solved by adding additional training examples that converge towards the desired orbit (in phase space). A common heuristic in recurrent training is to add noise to the input in prediction training. This practice is justi ed when we look at what this means in phase space: it is just a simple minded way of adding convergence information. However, we can have much more control than that. For example, in learning a sine wave as an periodic attractor, if we want nearby orbits to spiral in to the sine wave at a particular speed, we simply add those vectors into the training set. Thus we can explicitly specify that the learned trajectory is to be an attractor, and we can specify the behavior near the attractor. In this framework, training multiple attractors is just training orbits in di erent parts of the phase space, so they simply add more patterns to the training set. In fact, when we add noise or the extra convergent vectors, we are already training multiple trajectories. We can randomize the training set suciently to insure proper training of all parts of the phase space; there is no particular reason one should train a trajectory in its given temporal sequence. In fact, we can now create designer dynamical systems possessing the properties we want, e.g., with combinations of xed point, periodic, or chaotic attractors. As an example, to store any number of arbitrary periodic attractors zi(t) with periods Ti in a single recurrent network, create two new coordinates for each zi(t), (xi(t); yi(t)) = (sin( 2T t); cos( 2T t)), where (xi; yi) and (xj ; yj ) are disjoint circles for i 6= j . Then (xi; yi; zi) is a valid embedding of all the periodic attractors in phase space, and the network can be trained. In essence, the rst two dimensions form \clocks" for their associated trajectories. i

4.2 Training sequence and ARTISTE

i

As to the mechanics of phase-space learning, there are several ways to specify the training sequence presented to the network. Since the training is feedforward, one can randomize the training set (consisting of data near the desired trajectory) and train the network in an epoch-wise fashion. This corresponds to, on average, training all local segments of the desired trajectory simultaneously. One may also present the data sequentially along the desired trajectory. This is like training with complete spatial teacher forcing. RTRL and BPTT with teacher forcing and no hidden units, and sequential prediction training both fall into this category, except the training is on the desired trajectory only. Another possibility is to allow the network to move freely while its dynamics is modi ed by online weight changes. Wherever the network is, a teacher showing the local vector eld is given as in Figure 6. In this case there is no predetermined training set; the local vector eld is approximated on the spot. We call this version of the algorithm ARTISTE, for Autonomous Real TIme Selection of Training Examples. The advantage of this approach 8

is that the training examples are tailored for the current dynamics of the network. From training on the desired trajectory (100% spatial forcing), to autonomous training (0% spatial forcing), there is obviously a continuum of possibilities where the degree of spatial forcing is varied.

5 Simulation results In this section we give some examples on using phase space methods to learn periodic attractors. The rst two tasks, learning a sine wave and the van der Pol oscillator, are trained online with ARTISTE. The third example shows batched training of the van der Pol oscillator. The last two examples show training of multiple attractors; one with four separate periodic attractors, the other has one attractor inside the basin of another attractor.

5.1 ARTISTE applied to learning sine waves

We describe a simple task trained using the ARTISTE algorithm, which allows the network to determine its own course of learning. This is the spatial, or phase space, analog of online temporal training without teacher forcing. The task is to learn a pair of sine waves centered at (0.5, 0.5), with radius r = 0:4, as the desired limit cycle attractor, completing one full cycle in 30 iterations. We used a feedforward network as described above without hidden units for this task, as we know that two units can oscillate in this manner. We make up the vector eld to converge towards the sine wave attractor as follows. The current position (x; y) of the network is converted to polar coordinates (r0; 0). Then the new position (r1; 1) is closer to the limit cycle and rotated, speci cally, r1 = a(r0 ? r) + r 1 = 0 + 2=30 0

Suggest Documents