Organisation of Past States in Recurrent Neural Networks ... - CiteSeerX

2 downloads 202 Views 207KB Size Report
tion (CIMCA'99, Vienna), IOS Press 1999. Abstract: Non{relaxing recurrent .... forward connection matrix, lower triangular, wii = 0 fik; rs ij. : input connection ...
published in:

Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA'99, Vienna), IOS Press 1999.

Organisation of Past States in Recurrent Neural Networks: Implicit Embedding Peter Stagge and Bernhard Sendho Institut fur Neuroinformatik, Ruhr-Universitat Bochum 44780 Bochum, Germany fpeter.stagge,bernhard.sendho [email protected]

Abstract:

Non{relaxing recurrent neural networks (RNNs) generalize feedforward neural networks (FFNNs) in a straightforward way. However, this change largely in uences the usage of the neural network, for example the training, and the capabilities of the network. Due to their inherent dynamics RNNs are often used for prediction and modeling of dynamical systems. The internal dynamics of the network can be used to represent information about the past whereas in FFNNs a vector of the relevant past has to be built beforehand and passed to the network as input. The aim of this paper is to highlight the dependence of the reconstruction on the model which is used for the prediction and to propose an alternative approach to embedding based on recurrent neural networks. We analyse the reconstruction by RNNs, which is done implicitly by exploiting the ability of the network to represent past states in its activations.

1 Introduction Prediction is an ubiquitous task and the idea behind it is rather simple. Given measurements of a dynamical system one can generally approach this task by choosing a vector of relevant past states { this process is called reconstruction or embedding { and then using an adequate model to accomplish the mapping between these states and the expected state a given time step ahead. For a nonlinear dynamic system this procedure was demonstrated eg. in [1] where an information theoretic measure, the mutual information, was used to select a proper embedding and local maps were used for the prediction model. Formally an embedding vector of a scalar time series fx(t)g1 is given by :::N

(t) = (x(t); x(t ? 1 );    ; x(t ?  E ?1 )) ; 1

r

d

(1)

where the embedding dimension d and the time{lags  have to be speci ed. It has been shown that for a large enough embedding dimension there exists a di eomorphism between the original attractor and the system described by the embedded vectors [2]. However, even if we know that such a mapping between the dynamical system and the reconstructed system exists, it does not tell us whether this di eomorphism leads to an easy or complex prediction 1 It is common in the literature to assume a constant  and to set  = i   . i E

i

problem that has to be approximated by the model. The embedding parameters (d ,  ) are often determined using criteria like False Nearest Neighbours for the d {dimension [3] and information{theoretic criteria for the time{lags [4]. In such an approach the embedding is selected independently from the task one refers to afterwards (eg. how far to predict into the future) and also from the model one uses to accomplish the given task. Within the framework of ARMA(p,q) models the approach is di erent: one starts with the given model and task and chooses the parameters p,q { something similar to the embedding procedure { in a way that relates the model to the task. In Section 2, we will show that there are several combinations of the embedding parameters for a speci c model which lead to a minimal prediction error, but which do not correspond to the theoretical values. In order to circumvent the problem of a proper apriori (i.e. before the error is available) choice of the parameters, we introduce a recurrent neural network in Sec. 3.1, which implicitly and adaptively solves the problem of reconstruction and has successfully been applied to the modelling and prediction of time{series in [6]. Implicit embedding is realized by neurons which represent past states of the dynamical system. Evidence for this will be provided in Sec. 3.2 and 3.3 by applying an information theoretic measure for the analysis of the activations of neurons of the recurrent neural network. It provides some insight into how RNN solve the twofold task of reconstruction and prediction and how reconstruction is organized adaptively in such a model. E

i

E

2 Task{Model{Embedding Dependence We already indicated that the choice of the model will strongly in uence the embedding parameters, d and  , that lead to the best predictor. The coupling of model and embedding has been mostly neglected in the literature. Although it is unlikely that a general relation between representation and model can be devised, the common independent approach for the reconstruction is questionable, at least for the case when a minimal prediction error is wanted. In a more general framework the issue of potentials and limitations of nonlinear time{series analysis is commented on by Kantz [8]. In order to emphasize the dependence of the prediction error on the appropriate choice of the embedding parameters according to the model, we analysed the prediction of the x-coordinate of the Lorenz [9] attractor. (Data generated by 4th order Runge{Kutta integration, sampled at dt = 0:03, 750 data points as training/test data set.) The model we use is a 3{4{1 FFNN with full connections between input and hidden layer and hidden layer and output unit. As input we take the following embedding vector: E

i

(t) = (x(t); x(t ? 1 ); x(t ? 2)):

r

The task is to predict ve steps ahead. Noise has been added to the data at a signal-to-noise (SNR) ratio of 30dB. The results are shown in Fig. 1 (left). The mean prediction error, averaged over 8 runs, versus the embedding parameters (1 , 2) is shown in grey{scale. Light areas represent a small prediction error, dark areas a large one. We note that there are several di erent good combinations of (1 , 2) (1 = 1 with 2 = 4; 5; 6 and 1 = 4; 5

Average error

Average learning curve

14

τ1,τ2

12

τ1,τ2 τ1,τ2

0.02

prediction error

time lag

τ2

10 8 6 4 2 0

0

= 5, 12 = 3, 6 = 2, 3

0.01

0.006 1

2

time lag

3

τ1

4

5

6

0

200

400

600

800

1000

Batches

Figure 1: Left: Prediction error for various 1 , 2 values. Every FFNN is averaged over 8 runs for the ve{step prediction task with SNR = 30dB. There are two good regions, 1 = 1 with 2 = 4; 5; 6 and 2 = 12; 13 with 1 = 4; 5. Right: Averaged learning curve (20 runs) for 3 di erent (1 , 2 ) combinations. with 2 = 12; 13). These di erent combinations cannot be identi ed with the \standard" embedding methods nor do they represent one connected area in the (1, 2 ) diagramm. The \standard" mutual information criterion for reconstruction results in the embedding parameters (1 ; 2) = (3; 7) or (1 ; 2) = (4; 7), [3], which does not correspond to a minimal prediction error. Similar results have been obtained for other noise levels and other time series. Additionally, the development of the prediction error over time is shown for three di erent parameter settings in Fig. 1 (right). We observe that the success of di erent embedding parameters furthermore depends on the adaptation process and length.

3 Implicit Embedding in Recurrent Neural Networks In the last section we indicated that the embedding parameters which lead to a minimal prediction error di er from the ones which are determined via the mutual information criterion. This suggests that in general the best choice of the parameters will depend on the model and the task. It would therefore be desirable to incorporate reconstruction into model building. In this section we will introduce a RNN, which uses implicit embedding for the representation of past states.

3.1 The Recurrent Neural Network

The recurrent neural network which we use [6] is based on the Elman network [7] but generalized in two ways. Firstly, the layer restricted structure is relaxed to allow for arbitrary recurrent connections and all possible feedforward connections. This is due to the idea of optimizing the structure itself at a later time. The second change refers to the usage of not only one memory layer from one time{step ago but allowing several memory layers from several time{steps ago. We keep the time{step organization of the Elman network. This implies that the information is propagated from the input to the output

in just one time{step. This is reasonable for the prediction task as otherwise the structure of the network could implicitly turn a one{step prediction task into a several{step prediction task. This results in a network which obeys the following dynamics: ?1 X w v (t) = i

i

j

yi (t + 1) vi (t) N; K  wij s fik ; rij

4

s

i xk (t)

: : : : : : : : :

=1

ij

 (vj (t)) +

X Xr

memory layers N s

j

=1

yi (t + 1)

 (vj (t ij s

? 4 )) + s

Xf K

k

=1

= (v (t)) , with

ik

xk (t) + i

(2) (3)

i

output of neuron i at time t + 1 activation of neuron i at time t Number of Neurons and Inputs sigmoidal function, e.g. tanh forward connection matrix, lower triangular, w = 0 input connection matrix, recurrent connections from memory{layer s time{delay from memory{layer s thresholds external input to the network at time t. ii

This approach has been successful for modeling the chaotic Rossler attractor [6]. Modeling here means more than just prediction: the nal neural network is viewed as an autonomous system, i.e. the networks output constitutes its input for the next time step. The similarity of the two dynamical systems (the Rossler di erential equations and the RNN) can be measured by invariants of the dynamics, in our case the largest Liapunov exponent.

3.2 Implicit Embedding

Information about past states can be exploited by recurrent neural networks for prediction, this is clearly demonstrated by the performance they yield. This is realized by those neurons in the network whose output is used as input in the next time{step. In the notation of equation (2) these are the neurons from the memory layers. Therefore, embedding can be realized by the network implicitly by using its memory. In turn the memory is build up during adaptation via the feed{back connections of some neurons. As long as the representation is not introduced explicitly by the network structure, e.g. by tapped{delay lines in the input layer of a FFNN, it has to be learned. From observing the learning process we can gain insight into the way in which the network solves the embedding problem. In order to determine qualitatively whether a neuron takes part in the implicit embedding procedure, we employ an information theoretic measure, which is based on the fact that the outputs of the neurons represent part of the input of the network in the next time{step. The mutual information [10] measures nonlinear correlations between two random variables. In our approach, we use the external input, i.e. the original time{series, with di erent time delays d as the rst random variable (X (t ? d)) and the outputs of neurons with recurrent connections as the second random variable (Y (t)) ; i = 1 : : : N (i denotes the neuron number). t

i

t

Therefore, we calculate the mutual information between the internal inputs and the external input with and without time-delay. I (X (t

? d); Y (t)) = H (X (t ? d)) ? H (X (t ? d)=Y (t)) ; 2 i

(4)

i

If the mutual information of e.g. neuron i, equation (4), has a large value for one particular delay d, then we can assume that the output of neuron i represents past values of the dynamical system and contributes to the embedding procedure in a similar way as the component x(t ?  ) for  = d in equation (1). The value d corresponding to the maximum value of the mutual information is de ned as j

j

i

dimax

= max f I (X (t ? d); Y (t)) g :

(5)

i

d

The value d max reveals an implicit embedding of neuron i. Of course the implicit embedding and prediction task cannot be strictly separated from each other. Therefore, a value of d max > 0 does not tell us that neuron i exclusively embeds. i

i

3.3 Experiments

We used the introduced measure to analyse the neurons' outputs of a RNN during learning of the Lorenz time{series [12] in its chaotic domain. The di erential equation is three dimensional and only the x-coordinate was used as external input to the RNN, which consists of nine hidden neurons and one memory layer (The data were generated using 4th order Runge{Kutta integration, sampled at dt = 0:05. The training data set consisted of a sequence with 1000 data points.). By observing the d max values during learning one generally notices that d max goes to zero for all neurons at the beginning of the adaptation process, see Fig. 2 (left). This behaviour can often be seen directly during the learning phase: At the beginning a RNN simply learns the trivial prediction, i.e. producing the identity map which means that the predicted output is just the input. At this stage there is no need for the RNN to use past information which is revealed by the d max values. At later stages of the learning process this changes. The prediction error decreases, past information gets used and some neurons develop a value d max 6= 0. Embedding in addition to approximation obviously develops during learning. The RNN will not represent past states if it is not necessary. These observations account for both learning the weights via backpropagation and for learning with Evolution Strategies [11]. The representation of past states in the nal RNN is revealed by the mutual information I (X (t ? d); Y (t)) (cf. equation 4) which is shown for one of the hidden neurons in Fig. 2 (right). This indicates a delay of d = 5 time{steps which is veri ed by Fig. 3 that shows the output of hidden neuron 7 and the external input{series. The time o set between the neuron's output and the original time{series is clearly visible in the enlargement of Fig. 3. This example shows that an ecient memory over several time{steps can be generated

H denotes the entropy ? Pi pX (xi ) log pX (xi ). 2

of a random variable, de ned by:

H (X )

=

procedure which is hidden in the adaptation property of the network and which has therefore been termed implicit embedding. In order to reveal this behaviour we introduced a simple measure which observes the representation of past states in recurrent neural networks. This measure gives insight into the functionality of a RNN and can serve as a quantity to analyze the learning process of the network. Up to now we used RNN structures with just one input variable, so the network had to use its representation capabilities; however, it might also prove useful to take an explicitly embedded input vector, in which all necessary information should be available, and give the network the opportunity to nd a more suitable representation via its recurrent activations.

References [1] Henry D.I. Abarbanel, T. Carroll, L. Pecora, J. Sidorowich, and L. Tsimring. Prediction physical variables in time-delay embedding. Phys. Rev. E, 49(3):1840 { 1853, 3 1994. [2] T. Sauer, J. Yorke, and M. Casdagli. Embedology. J. Stat. Phys., 65, 1991. [3] H. Abarbanel, R. Brown, J. Sidorowich, and L. Tsimring. Analysis of observed chaotic data in physical systems. Rev. Mod. Phys., 65:1331 { 1392, 1993. [4] W. Wienholt and B. Sendho . How to determine the redundancy of noisy chaotic time series. Int. J. of Bifurcation and Chaos, 6(1), January 1996. [5] J. Elman. Learning and development in neural networks: the importance of starting small. Cognition, 48:71 { 99, 1993. [6] P. Stagge and B. Sendho . An extended elman net for modeling time series. In W. Gerstner, editor, Int. Conference on Arti cial Neural Networks, Lecture Notes in Computer Science. Springer, September 1997. [7] J. Elman. Finding structure in time. Cognitive Science, 14:179 { 211, 1990. [8] H. Kantz. Nonlinear time{series analysis: Potentials and limitations. In J. Parisi, S. Muller, and W. Zimmermann, editors, Nonlinear Physics of Complex Systems. Current Status and Future Trends, pages 213 { 228. Springer Verlag, 1996. [9] E.N. Lorenz. Deterministic nonperiodic ow. J. Athm. Sci., 20:130 { 141, 1963. [10] T. Cover. Elements of Information Theory. Wiley, 1991. [11] W. Wienholt. Minimizing the System Error in Feedforward Neural Networks with Evolution Strategies. In ICANN'93. Springer Verlag, 1993. [12] E.N. Lorenz. Irregularity: A fundamental property of the atmosphere. Tellus, 36(A), 1984.

although the network possesses just a memory layer reaching one time{step in the past. One further test for this measure is to apply it to a problem without any information in the temporal order of the presented data. We did this by randomizing the order of the presented data during learning and consistently all the values d max became zero. i

10

I( X(t); Y 7 (t-d) )

2.0

8

d max

6

4

2

0

0

5

10

15

1.5

1.0

0.5

0

20

0

batches / 5

10

5

d

15

20

Figure 2: Mutual information between the external input (X (t)) and the internal input (Y7(t)) produced by neuron 7, and development of d max for that neuron External and Internal Input 0.4 External

Internal

0.2

0.0

-0.2

d~5

-0.4

0

100

Time steps

200

300

240 250 260 270

Figure 3: External input{series (X (t)), dashed line, and internal input{series (Y7(t)), solid line; One sees the shift of the internal input. Note that the internal input, produced by neuron 7, is shown and not the outcome of the prediction.

4 Discussion This article deals with the embedding problem for the prediction of time{ series for which the full information about the state of the system is not avaliable. We highlighted the importance of the choice of the embedding parameters with respect to a minimal prediction error and showed that the best setting is not necessarily the one which would be expected from theoretical observations. This is due to the fact that the theoretical existence of a di eomorphism does not tell us anything about how hard it is to be constructed by the chosen model. Recurrent neural networks of the structure introduced in this article seem to be able to solve this problem by using an embedding

Suggest Documents