Simple Learning Algorithm for Recurrent Networks to Realize Short-Term Memories Katsunari SHIBATA*, Yoichi OKABE**
and Koji ITO*
Email :
[email protected] * : Dept. of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, Tokyo Inst. of Technology 4259 Nagatsuta, Midori-ku, Yokohama 226 JAPAN ** : Research Center for Advanced Science and Technology, Univ. of Tokyo 4-6-1 Komaba, Meguro-ku, Tokyo, 153 JAPAN
Abstract A simple supervised learning algorithm for recurrent neural networks is proposed. It needs only O(n2) memories and O(n2) calculations where n is the number of neurons, by limiting the problems to delayed recognition (short-term memory) problem. Since O(n2) is the same as the order of the number of connections in the neuralnetwork, it is reasonablefor implementation. This learning algorithm is similar to the conventional static back-propagation learning. Connection weights are modified by the products of the propagated errorsignal and some variables that hold the information about the past pre-synaptic neuron’s output.
Key Words : Recurrent Neural Network, Back Propagation, Short-Term Memory, Delayed Recognition Problem
1. Introduction We, living creatures, obtain a lot of pieces of information from various sensors. However, since much information is necessary to represent our real world, we cannot recognize the whole of them in a moment of time. Therefore we utilize not only the present sensory signals, but also a time history of sensory signals to generate motions or to recognize something. Since the raw sensory information is too much to be memorized, we have to extract pieces of information to be memorized from them. The learning ability of neural network is useful to obtain flexible motions and recognition to adapt to the environment. In order to memorize the past information, we employ recurrent neural networks. Then, it is expected This research was supported by “The Japan Society for the Promotion of Science” as “Biologically Inspired Adaptive Systems” (JSPS-RFTF96I00105) in “Research for the Future Program”.
that the recurrent network can extract and memorize necessary information among the past huge information in order to generate appropriate motions and recognitions. Many learning algorithms for recurrent neural networks have been already proposed. However, all of them are not realistic with respect to the calculation time and memory size. A realistic learning algorithm for the recurrent neural network is proposed to solve shot-term memory problem. Then some simulation results are presented.
2. Conventional Learning Algorithm Two typical learning algorithms for the recurrent neural network are proposed. One is BPTT (Back Propagation Through Time) ([1] etc. for discrete time, [2] etc. for continuous time), and the other is RTRL (Real Time Recurrent Learning) ([3] etc. for discrete time, [4] etc. for continuous time). In BPTT, it is necessary to make the error propagate to the past. This means that the past states of the neural network have to be stored. If the propagation is truncated at T time step, that is called truncated BPTT(T), the neural network cannot memorize the signals before T time steps. In BPTT(T), O(n2T) order of calculation time and O(nT) order of memory is required, where n represents number of neurons[5]. On the other hand, in RTRL, the partial differential value of the output of each neuron with respect to each weight, is calculated using simultaneous differential equations from the values at the previous time. Each weight is modified by putting the partial differential value into the steepest descent equation. Therefore, it is not necessary to trace back against time, but O(n4) order of calculation time and O(n3) order of memory are required[5]. O(n3) is more than the order of the number of weights O(n2), which means that even if the memory is assigned at each weight, the memory size is varied according to the size of the
Fig. 1 An example of delayed recognition problem neural network. Now our aim is to give a learning algorithm for a recurrent neural network, in which only O(n2) order of calculation time and O(n2) order of memory is necessary and the tracing towards the past is not needed. In order to realize small calculation time and memory size, the problem is limited to short-term memory problems. Short term memory problems are equivalent to delayed recognition ones as shown in Fig. 1, that is, the input signals are reflected to the output of neural network when the trigger signal becomes on after some time lag from the inputs. Recently Hochreuter et. al. have proposed a special architecture for short-tern memory, which includes pairs of memory cell, input gate and output gate[6]. These gates operate multiplication like σ−π units. They have also proposed a learning algorithm based on a variant RTRL. It needs only O(n2) order of memory. However, the authors will not employ any special structures here.
1.0
x
input1 (start) input3 (operand 1) input4 t (operand 2) input2 (trigger) time lag T idealoutput To outputan operation resultofinput2 and input3 aftertriggersignal.
w : weight for connection, τ : time constant, f : output function. Advantages for this model are that (A) The state that the output value is 0 is easily realized and stable. In delayed recognition problem, the training signal at non-excited state can be set 0. (B) Error can be removed when the output value is 0. (C) By controlling self feedback connection weight, two stable equilibrium points can be realized easily. (see Appendix) (D) Since the differential of output function is the largest when the output value is 0, efficient learning can be realized. (E) It is reasonable with respect to the energy consumption that the output is 0 when all input signals are 0. (F) The outputs of neurons can be seen as pulse density, because the output value is always plus. In the case of a single neuron that has only a self feedback connection, the neuron has two stable equilibrium points, 0 and the other, when the following two equations about w and θ is satisfied (see Appendix for proof).
0.5 0.0 0
5
10
u
Fig. 2 Output function employed here w>2
3. Forward Calculation
– { w(w – 2) + log (w – 1 – w(w – 2) )} ≤ θ ≤ 0
In forward calculation, a model similar to the conventional continuous-time neural network model is used. Differences from the conventional one are that (a) Output function is sigmoid whose value range is from -1 to 1, and (b) Internal state u(t) is not negative. (Output x(t) is also not negative) This means that the half of the sigmoid function is used as shown in Fig. 2. Describing by the equations, τ j d u j(t) = – u j(t) + Σ wji xi(t) + θ j(t) , i dt
(1)
d u (t) = 0 when u (t) = 0 and j dt j – u j(t) + Σ wji xi(t) + θ j(t) < 0 , i
(2)
x j(t) = f (u j(t)) =
2.0 – 1.0 , 1 + exp ( –u j(t))
(3)
where x : output of a neuron, u : internal state, θ : bias input,
(4) (5)
4. Learning In order to keep the memory size and calculation time small, the error signal is propagated backwards in the neural network and the connection weights are modified using the error signal like the conventional static BackPropagation learning algorithm. In the static BackPropagation, weights are modified according to the product of propagated error signal and the output of the pre-synaptic neuron. Here, in spite of the output itself, some variables that hold the information about the past output of presynaptic neuron are used. At the beginning, to prevent the propagated error signal δ from diverging, the weights of connections from hidden neurons is limited. Connection weights w ji are calculated by putting the variable w ji into the following
equation, 2W –W 1 + exp ( –wji / W )
wji =
(6)
post-synaptic neuron’s output was plus, is introduced as follows,
where W : a constant (here W = 4) and w ji is modified using the error signal δ. Connection weights from input neurons are calculated by wji = wji
.
(7)
In the output layer, the error signal δ is calculated as follows, δ i = tr j(t) – x j(t)
(8)
where tr: a training signal, and in the hidden layer, it is calculated as follows, δi = Σ j
vji δ j(t) W
(9)
d v = w (t) x' (t) – v ) d x (t) j ji dt j dt ji ( ji
(10)
if xi(t) = 0
q ji(t) = q ji(t)
otherwise .
(17)
Then p, q, rand q are used to modify the connection weights. An example of the transition of these variables is shown in Fig. 3. In this case, x 3 , the output of the neuron 3, is excited by x 1 and inhibited by x 2 . Since q3 1 does not change the value when the values of all the pre-synaptic neurons (x 1 and x 2 ) are constant, it is useful to learn when no inputs have changed for a long time. q ji is employed because the latest inputs are not always important for hidden neurons. Since r31 does not change the value when the output of the postsynaptic neuron is constant, it is useful to modify the effect to the post-synaptic one. The connection weights between each layer are modified as follows. At first, dw is calculated as (hidden -> output)
where x’(t) is set as follows,
dwji(t) = ( pji(t) + q ji(t) + r ji(t)) δ j(t)
dx j(u j(t)) (1.0 + x j(t)) (1.0 – x j(t)) x' j(t) = = 2.0 du j(t)
dx j(u j(t)) (1.0 + x j(t)) (1.0 – x j(t)) x' j(t) = = 2.0 du j(t) otherwise 2.0 – 1.0 1.0 + exp(– u j(t))
u j(t) = Σ wji xi(t) . i
dwji(t) dwji(t)
(input -> output) dwji(t) = ( pji(t) + r ji(t)) δ j(t)
if x(t) = 0.0
x j(t) = f (u j(t)) =
dq ji(t) =0 dt
(18)
(19)
(hidden -> hidden) dwji(t) = (q ji(t) + q ji(t)+ r ji(t)) δ j(t)
(11) (12) (13)
For the modification of the connection weights, the signals to be multiplied by the propagated error signal δ are as follows, (1) the latest outputs of pre-synaptic neurons, (2) the output of pre-synaptic neuron that changes recently among the inputs of the post-synaptic neuron. (3) the output of the pre-synaptic neuron that caused the change of the post-synaptic neuron’s output. Then, corresponding to the above (1) (2) (3), the following three variables p, q, r are introduced.
(input -> hidden) dwji(t) = (q ji(t) + q ji(t)+ r ji(t)) δ j(t) ,
(14)
d q (t) = (x (t) x' (t) – q (t)) Σ d x (t) i j ji i dt ji dt i
(15)
d r (t) = (x (t) x' (t) – r (t)) d x (t) i j ji dt j dt ji
(16)
Furthermore another variable q , that maintains q when
(20)
(21)
w is calculated as
1.0 0.5 0.0
x1(t) 0
1.0 0.5 0.0 1.0 0.5 0.0
0
100
150
t
50
100
150
t
100
150
t
100
150
t
50
100
150
t
50
100
150
t
x3(t) 0
50
0.2 0.1 0.0
p31(t) 50
q31(t)
0.2 0.1 0.0 0 0.2 0.1 0.0
50
x2(t)
0
τ j d pji(t) = – pji(t) + xi(t) x' j(t) dt
dwji(t) dwji(t)
q31(t) r31(t)
0
Fig. 3 Example of the transition of the variables used for the learning. The initial value of each variable is 0.1.
d w (t) = η dw (t) j ji dt ji
(22)
where η : the learning rate, and w is calculated by Eq. (6) and Eq. (7). Biases are modified same as the weights, but for stability, the maximum value of biases in the hidden layer is -0.1. There are no connections between output neurons.
5. Simulation o f delayed recognition problem Some simulations of delayed recognition problems using the learning algorithm described above, are shown. As shown in Fig. 1, input 1, that represents a start signal, is excited at first, and then other input signals, those are operands, are excited. After some time lag, the other input, that shows a trigger signal, are excited. Just after that, an ideal output, that is training signal, is excited only when the operation result is 1. The excitation curve is equal to the output of a neuron that is connected by the trigger signal with some connection weight. All the inputs, start signal, operands and trigger signal, are put into the neural network without any discriminations. At each cycle, a presented pattern is chosen randomly, and the cycle is followed continuously by the next one. This means that none of the states in the neural network are reset at the boundary of the cycles. At the initial state, self feedback connection weights in the hidden neurons are set to be 2.0 for the acceleration of the learning, and all the other weights are set to be small and plus random numbers. The time constants in each neuron is set to be 0.4. The learning rate η is 20 for the output neuron and the self feedback connection, 160 for the other connections of the hidden neurons, 100 for the output neuron’s bias and 1.0 for the hidden neuron’s bias.
The simplest problem is NOT operation. In this case, one of the solutions is that a hidden neuron holds an excitation of the input and inhibits the output neuron. To examine that the hidden neuron can extract necessary information or not, a random input were added to the inputs in exact timing with the operand input. There were 4 inputs (1 start, 1 operand, 1 random and 1 trigger) and 1 output. 1 hidden neuron was employed here. Connection weights after learning are shown in Fig. 4. For NOT operation, w 53, that is the connection weight from input 3 (operand) to the hidden neuron, became plus and w 65, that is the weight from the hidden neuron to the output neuron, became minus. And since the self feedback connection of the hidden neuron w 55 satisfies Eq. (4) and (5), the hidden neuron holds the operand excitation until the start signal is excited. Moreover w 54, that is the weight from the random input to the hidden neuron, became close to 0. We can say that the learning has an ability to select necessary information.
5.2 Simplest sequence detection problem Next, to examine the learning ability to detect the sequence of inputs, a simulation of the simplest sequence problem was done. There are two operand inputs. When the operand input 1 comes before the operand input 2, the ideal output for training is excited, and when the operand 1 comes after the operand 2, that is called inverse sequence here, the ideal output is not excited. Connection weights after learning are shown in Fig. 5. For detecting the sequence, w 54 becomes plus, and in order that the hidden neuron excites the output neuron, w 65 became plus. When the inverse sequence is presented, although the hidden neuron (x 5) is excited by the operand 2 (x 4) once, it is inhibited by the operand 1 (x 3). The self feedback weight became enough to hold the operation result.
5.1 NOT with random input problem
input 1 (start signal)
A A A AAAAA A AA A 1
-7.83
input 2 (trigger signal)
2
-0.10
4.98
-0.31
-0.07
5
input 3 (operand) input 4 (random input)
3
7.92
6
output
-0.10
2.59
0.20
4
-0.10
-3.47
0.03
Fig. 4 Connection weights after learning in a case of NOT with random input problem
AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA
input 1 (start signal)
1
-8.10
input 2 (trigger signal) input 3 (operand 1)
2
3
-0.02
5.00
0.19
-24.8
-4.03
5
-0.11
3.47
6
output
-2.15
2.43
2.05
input 4 (operand 2)
4
-5.50
Fig. 5 Connection weights after learning in a case of simplest sequence detection problem
5.3 NAND problem Finally NAND problem was tried. The sequence of two inputs was fixed and the second operand input begins to excite after the first operand input becomes 0. In this problem, two hidden neurons are required. One of them have to hold the first operand signal, and the other one have to make an operation using the second operand signal and the first hidden neuron’s output and to hold the result. The main point of this learning is that the one hidden neuron becomes to extract the first operand input or not. To learn to extract it, the hidden neuron have to use the propagated error signal through the other hidden neuron. Figure 6 shows the connection weights after learning, and Fig. 7 shows each output transition after learning. We can see that the hidden neuron 1 (x 5) is always excited by the start signal, and is inhibited when the operand input 1 (x 3) comes. The hidden neuron 2 (x 6) is excited, when the operand input 2 (x 4) comes and the hidden neuron 1 (x 5) keeps its excitation. The output is inhibited by the hidden neuron 2 at last. Figure 8 shows the changes of the error between the output and the training signal through one-order delay filter, and the connection weights according to the progress of the learning. At first, the connection weights to the output neuron (w 72 , w 76 ) grew. Next, the connections to the hidden neuron 2 (w 6 1, w 6 3) grew. Then according to the growth of the connection from the hidden neuron 1 to the hidden neuron 2 (w 65 ), the absolute value of the weights from the input neurons to the hidden neuron 1 (w 51 , w 52 ) became large and the weight from the operand input 1 to the hidden neuron 2 (w 6 2) became close to 0. Finally after 800 cycle, that is equivalent to after 800 presentation of patterns, the error became close to 0.
AAA AAAAA AAA AA AAA AAA AA AAA AAAAA AA AAA AAA
If |wji| < 1.0, then connections are omitted in this figure. 2.2 input 1 3.7 1 (start signal)
5
4.2
input 2 (trigger signal) input 3 (operand 1)
2
-3.6
-6.0
7
5.0 -3.0
-11.3
3
1.0
input 4 (operand 2)
-0.1
output
-0.1
6
-3.3
-0.1
5.0
2.6
4
Fig. 6 Connection weights after learning in a case of NAND problem
x1 start
x3 = 1 ¤x7 = 0 x4 = 1
1.0 0.0 1.0
x3 = 0 ¤x7 = 1 x4 = 1
x2 trigger
t
x3 operand1
t
0.0 1.0 0.0 1.0
x4 operand2
0.0 1.0
x5 hidden1
+
0.0 1.0
t
-
x6 hidden2
t
0.0 1.0
-
x7 output train
t
+
-
t
0.0 1.0
t
training 0.0 signal 0
10
20
30
t
40
Fig. 7 Transition of each neuron’s output after learning
0.08
6. Conclusion We proposed a simple learning algorithm for recurrent neural networks. The learning algorithm needs only O(n2) order of calculation time and O(n2) order of memory, which is reasonable for implementation. The main aim of us is to realize short term memory (delayed recognition problems). Connection weights which can solve a delayed recognition problem, NOT with random input, simplest sequence detection, and NAND were obtained by the proposed learning.
err 0.04 0.00 0
5.0
500
1000
cycle w51
w55
0.0
w53
w52
-5.0 0
500
1000
cycle
w66 w62
0.0
w65
-5.0 -10.0
w61 500
1000
cycle
1500
w72 w75 w76
5.0
If a neuron with only a self feedback connection w is in an equilibrium state, Eq. (1) is modified as
1500
w63
5.0
0
Appendix
1500
0.0 -5.0 0
500
1000
cycle
1500
Fig. 8 Change of connection weights according to the progress of learning in NAND problem
(1) y= u – θ
2.0 – 1.0 (2) y= w x = w f (u) = w 1 + exp (–u) y
y
(1)
w
(1) < (2) u : increase
unstable equilibrium point
(1) > (2) u : decrease
w (2)
(2)
-θ
-θ
u
(1)
y
(1)
w
(2)
-θ
stable equilibrium point
u
u
stable
stable
stable
(A )
(B)
θ=0
– { w(w – 2) – log (w – 1 + w(w – 2) )} < θ < 0
(C)
and w>2
–{
w θ
Fig. 9 Stability of a neuron only with a self feedback connection 0 = – u + w x +θ .
(23)
This equation is divided into the following two, y = u – θ , and
(24) 2.0 – 1.0 . 1 + exp(–u)
y = wx = wf (u) = w
(25)
The stability condition at output value 0 is θ≤ 0 .
(32)
Then Eq. (4) and Eq. (5) are derived. When Eq. (4) and Eq. (5) are satisfied, the stable point except for 0 satisfies (u + w – θ)(1 + exp(–u)) = 2w .
(33)
from Eq. (23) The slopes of the u-y curves are, y' = 1 , and y' =
(26)
2.0 w exp(–u)
(1 + exp(–u)) 2
.
(27)
When these two slopes are the same,
u = –log (w – 1 – w(w – 2) ) ,
(28)
where w > 2.0. By putting this u value into Eq. (24) and Eq. (25),
y =u – θ = – log (w – 1 – w(w – 2) ) – θ y = wx = w
2.0 – 1.0 . w – w(w – 2)
(29) (30)
At the limitation state that the neuron has two stable equilibrium point, these two y values are the same. Then
θ = –log (w – 1 – w(w – 2) ) –w
2.0 – 1.0 w – w(w – 2)
= – log (w – 1– w(w – 2) ) – w(w – 2) .
(31)
References [1]Rumelhart D.E., Hinton G.E. and Williams R.J.: "Learning internal representations by error propagating", Parallel Distributed Processing, Vol. 1, MIT Press, pp.318-362 (1986) [2]Pearlmutter, B. A., "Learning state space trajectories in recurrent neural networks", Neural Computation, Vol. 1, pp.263-269 (1989) [3]Williams, R. J. and Zipser, D. , "A learning algorithm for continually running fully recurrent neural networks", Neural Computation, Vol. 1, pp.270-280 (1989) [4]Doya, K. and Yoshizawa, S., "Adaptive neural oscillator using continuous-time back-propagation learning", Neural Networks, Vol. 2, pp.375-385 (1989) [5]Williams, R. J. and Zipser, D. , "Gradient Based Learning Algorithms for Recurrent Connectionist Networks", Northeastern University, College of Computer Science Technical Report, NU-CCS-90-9 (1990) [6]Hochreiter, S. and Schmidhuber, J., "LSTM can solve hard long time lag problems", Advanced in NIPS 9, MIT Press, pp.473-479 (1997)