Volume 2, Issue 1, 2010
Shape Recognition through an Alternative Recurrent Network Architecture Tarik Rashid, Lecturer, SABIS ® University, Hawler, Kurdistan,
[email protected] Abstract
This paper focuses on a study of different Recurrent Neural Network architectures. A new alternative type of recurrent neural network is introduced. This alternative recurrent neural network is based on the simple recurrent network (SRN) but has an architecture that consists of fully interconnected layers. The network has feedback connections from the input, hidden, and output layers, with each connecting to its own designated context layer. A mathematical model for the network is detailed in the paper. The new alternative recurrent neural network, developed for recognition tasks, demonstrates superior performance over simple recurrent networks in recognition and prediction tasks. Introduction Standard artificial feed forward neural networks (Haykin, 1994) transmit patterns through the use of initial layer neurons (the input layer neurons) to subsequent layer neurons (the hidden layer neurons or output neurons), whereas recurrent networks (Elman, 1990; Boden, 2001; Castano, Casacuberta, and Bonet, 1997; Cheng, Karjala, and Himmelblau, 1997; Corradini and Cohen, 2002 ) spread patterns from any subsequent layer back to the initial layer. Artificial recurrent neural networks are different from standard neural networks in that artificial recurrent neural networks contain links among neurons that construct cycles generating an internal network state. This implicit state helps the recurrent network to demonstrate dynamic temporal behavior. An artificial recurrent neural network is developed originally from the standard feed forward network; nevertheless, artificial recurrent neural networks are very different from standard feed forward networks when it comes to the evaluation of their functionality and learning algorithms. The Elman and Jordan networks are known as the standard SRNs (Elman, 1990; Boden, 2001; M. Castano, F. Casacuberta, and Bonet, 1997; Cheng, Karjala, and Himmelblau 1997; Corradini and Cohen 2002). Jordan’s network (Jordan, 1986) resembles the Elman network, but the context neurons are fed from the output layer instead of the hidden neurons. The Elman network removes the previous state of hidden neurons and uses the activation from the hidden neurons as the basic recurrent relation. The context layer of neurons (equivalent to Jordan's state neurons) is completely internal to the SRN because it references only the hidden neurons; it does not associate with the outside world of input and output neurons at all. With this, the Elman network was able to demonstrate its usefulness on a set of tasks involving sequential input. Elman’s network, with its ability to learn sequences, became more intelligent than the Jordan network that was fed by the output layer. The Jordan network structure, with only output memory, cannot recall inputs that are not reflected in the output, and as a result, it cannot learn sequences (Boden, 2001; Cheng, Karjala, and Himmelblau, 1997) The SRN has some problems with its network memory, which consists of one context layer (rather small), the mapping of hidden layer neurons to the output layer neurons, and the computation cost due to the need of more hidden neurons (Dorffner, 1996; Wilson 1996; Dit-Yan Yeung 1995). This paper shows how the new network is designed to solve the memory problem and provide a better mapping to the output layer neurons. In the following sections, the new architecture is explained, the mathematical theory with the learning algorithm is detailed, and the simulation and results are presented. The paper is concluded with a discussion of the main findings and a summary of our results. 1.
Alternative Architecture & Mathematical Theory The alternative recurrent neural network is based on the SRN (Elman, 1990; Boden, 2001; Castano, Casacuberta, and Bonet, 1997; Cheng, Karjala, and Himmelblau, 1997; Jordan, 1986) architecture. However, the modified network architecture consists of fully interconnected layers. The network has feedback connections from the input, hidden, and output layers, with each connected to its own designated context layer. This is a further refinement and improvement of the single-step recurrent network idea for sequential processing. Its function mainly lies in the
1
combination of several context layers that weigh the influence of sequence history length differently (Cheng, Karjala, and Himmelblau, 1997; Dorffner, 1996). The context layers are forward connected to both the hidden and output layers. The forward connection speeds up the learning of the network and reduces the number of neurons in the hidden layer (Cheng, Karjala, and Himmelblau, 1997; Dorffner, 1996 and Wilson 1996). This unique feature of our network ensures the stability of the output and gives the network the ability to memorize and adapt based on previous data (Cheng, Karjala, and Himmelblau, 1997; Dorffner, 1996 and Wilson, 1996). A snapshot of its architecture is illustrated in Figure 1.
Figure 1. The alternative recurrent neural network architecture. Here are the terms and notations used to discuss the new architecture:
Neurons and layers:
are the indices for the input, hidden, and output neurons, respectively. in
h
are the indices for context layers; n , n , and hidden/context, and output layers, respectively.
Net inputs and outputs:
t
is the current time step.
n out
I i (t), H j (t), and O k (t) represent the outputs of
the input, hidden and output layers’ neurons, respectively. copies of the input, hidden, output layers’ neurons.
Connection weights: Moreover,
v ji
are the previous time step
d k (t) is the target of k neurons in the output layer.
denotes the weight connection from the input layer to the hidden layer.
v 'jl , v 'jp' , and v 'jq''
wkl' , wkp'' , and wkq'''
are the number of neurons in the input,
are the weight connections from the context layer to the hidden layer,
are the weight connections from the context layer to the output layer, and
wkj
is
the weight connection from the hidden layer to the output layer. The back propagation algorithm (BP) (Haykin, 1994; Elman, 1990; Boden, 2001; Wilson 1996 ; Dit-Yan, K. W. Yeung, 1995; Duch and Jankowski, 2001; Fausett, 1994; Rashid, Huang, Kechadi and Glesson , 2007;Yee-Ling LU, ManWai MAK, and Wan-Chi SIU,1 995; Prokhorov 2004; Shi XH, YC. Liang, and X. Xu, 2003) is an example of
2
supervised learning. It is based on the gradient descent method for error minimization and is suitable for applications with a specified sequence length (Jordan, 1986). Since the data sequence length is known, we used the back propagation learning algorithm to train our network to recognize digits. The BP algorithm has been previously explained by M.I. Jordan (Jordan, 1986); therefore, we only briefly summarize the implementation of the algorithm of our network here. In feed forward, the output of the hidden layer can be expressed as a function of the input and context layers outputs. The output
H j (t )
of a neuron
j
at time
t
in the hidden layer is straightforward and given by the following:
n n n n ~ h j (t ) I i (t )v ji (t ) C l' (t )v 'jl (t ) C p'' (t )v 'jp' (t ) C q''' (t )v 'jq'' (t ) .......... ..(1) l 1 p 1 q 1 i 1 ~ H j (t ) f h j (t ) .......... ......( 2) in
in
Where
f
h
out
in equation (2) is the activation function (M.I. Jordan, 1986).
The update of the first context layer that comes from the previous input layer is accomplished by:
Ci' (t ) I i (t 1)......... .......... (3) The update of the second context layer that comes from the previous hidden layer is accomplished by:
C 'j' (t ) H j (t 1)......... ....( 4) The update of the third context layer that comes from the previous output layer is accomplished by:
Ck''' (t ) Ok (t 1)......... ......( 5) The output
Ok (t )
of the
k th neuron in output layer at time t can be obtained by:
n n n n o~k (t ) H j (t ) wkj (t ) C l' (t ) wkl' (t ) C p'' (t ) wkp'' (t ) C q''' (t ) wkq''' (t ) ......( 6) l 1 p 1 q 1 j 1 h
in
h
out
Ok (t ) f o~k (t ) .......... ......( 7) Where
f
in equation (7) is the activation function.
BP tries to minimize the error of the network, which is the difference between the output and the target. However, due to the gradient descent process, the BP technique tends to converge into local minima (Haykin, 1994). This new network has the ability to provide more information, allowing the system to converge with better solutions. This information is stored in the context layers and, at any given time, can be provided to the output layer to adjust the final solution; this facilitates good mapping and speeds up its convergence (Cheng, Karjala, and Himmelblau, 1997; Dorffner, 1996 and Wilson 1996). Here we introduce the BP algorithm using an example of supervised learning. Let us assume s is a pattern in the data set. The difference between the target pattern and its actual output in the output layer will be defined as . The objective is to minimize the network cost by summing the errors over all past patterns of the network. This can be expressed as:
3
Assume LGks is the local gradient of the kth neuron in the output layer and LGjs is the local gradient of the jth neuron in the hidden layer when input pattern s is presented. The gradient descent method is used to calculate the derivative of the error with respect to the variable weights of the system: ET , ET , ET , ET , ET , E T , ET
wkj
v ji
wkl'
v 'jl
'' wkp
' v 'jp
'' ' wkq
, and E T . According to the chain rule, the defined partial gradient (local gradient) terms in the output and hidden '' v 'jq
layers are used to simplify the formulae. Therefore,
ET wkj
can be written as:
~ S E eks Oks o ET ks s ~ w .......... .......... .(9) wkj e O o s 1 ks ks ks kj LGks can then be expressed as follows:
LG ks
E s eks Oks eks f ' (o~ks )......... .....(10) ~ eks Oks oks
Also: nout E ~ LG js ~s LG ks wkj f ' (h js )......... .........( 11) h js k 1
The weight-changes for each category are as follows: S ET LG ks H js .......... .......... .......(12) wkj s 1
S ET LG ks C qs''' (t ).......... .......... ......(13) ''' wkq s 1
S ET '' LG ks C ps (t ).......... .......... ....(14) '' wkp s 1
S ET LG ks' C ls' (t ).......... .......... .......(15) ' wkl s 1
S ET LG js I is .......... .......... .......(16) v ji s 1 S ET LG js C qs''' (t ).......... .......... ....(17) v 'jq'' s 1
4
S
'' ET LG js C ps (t ).......... .......... ...(18) s 1
S ET LG js C ls' (t ).......... .......... ..(19) ' v jl s 1
The momentum technique is employed to prevent the training of the network from falling into local minima errors and to speed up the training:
wab
ET wab .......... .......... .......... .......... .......... ........( 20) wab
The initial condition of
wab
for all connections is zero.
is the learning rate and
is the momentum rate. The
values of learning and momentum rates must be less than zero. The adjustments to the weights are obtained by adding the corresponding weight-changes to the previous values:
w' ab wab wab .......... .......... .......... .......... .......... ........( 21) Where
2.
w' ab
is the adjusted weight between the two layers
a and b .
Simulation and Results a)
Recognition Task
Forming the shape of a circle is a recognition task. The network attempts to draw the shape of a circle with a representative data set composed of 26 points (patterns) in Cartesian coordinates (x, y) in the range of 0 to 1. The patterns were designed as follows: the target of an input pattern is the next input pattern, and the target of the last input pattern is the input pattern. A suitable network structure, made of 2 input neurons, 4 hidden neurons, and 2 output neurons, along with three 3 context layers (2 neurons, 4 neurons, and 2 neurons) of output neurons, was selected for this task. The network was trained for 2000 cycles with BP, with a learning rate of 0.25, and a momentum value of 0.65. After training the network, an arbitrary point from the data set was input into the network. The actual output of the selected point was fed back again as a new input point to the network. This process continued until the circle was drawn correctly by the network (see Figure 2). The error rapidly but steadily diminished without falling into local minima. The network was able to learn this task and correctly produced the shape of the circle step by step (see Figure 3).
5
Figure 2. Recognition of the circle shape: (a) is a biased shape with two inner cycles created by the SRN; (b) is a better constructed shape with no inner cycles created by the architecture of the alternative network.
6
Figure 3. Error comparison of both networks.
b)
Prediction Task
Prediction tasks rely primarily on the information that is presented to the network; nevertheless, trivial information causes complexity and is time-consuming, both of which are detrimental to the network’s performance. The test we use to gauge the performance of a network is the prediction of a power plant’s daily maximum load (DML). The details about the historical data are found in (Rashid, Huang, Kechadi and Glesson 2007). The pre-processing mainly involves the selection of input data. In accordance with the analysis of the historical data, it was appropriate to select seven variables as inputs to the network: the forecasted day of the year, the forecasted day of the month, the forecasted day of the week, the daily average temperature of the forecasted day, the daily average temperature of the day before the forecasted day, the daily average temperature for the same day as the forecasted day of the previous week, and a binary value to represent holidays or working days. The output data selection of the network was the single variable, DML. The network model for the daily maximum load forecasting consisted of seven input neurons, five hidden neurons, and one output neuron. The parameters relied heavily on the size of the training and testing sets. The model was trained with a two-year training data set, a learning rate of .005, and a momentum value of 0.2. The training cycles of the network were varied, ranging from 15000 to 40000 cycles. Several experiments were carried out on our trained network. Our network was also evaluated against the SRN for performance using two different formulas: the Percentage Mean Absolute Error (MAPE) and Maximum Error (MAX) (Rashid, Huang, Kechadi and Glesson 2007). The MAPE and MAX errors for our network were 4.32 and 56.43 compared to 5.06 and 62.28 for the SRN – an error difference of 0.74 and 5.85. Figure 4 shows the prediction results for both networks.
7
Forecasting 810
790 SRN 770
750
New Network
DML 730 Target
710
690
670 1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 Day
Figure 4. Prediction results of both networks vs. target results. The prediction values of the new model are significantly closer to the target values than those of the SRN. Additionally, the MAPE and MAX errors of our networks were less than of that of the SRN.
3.
Discussion and Conclusion This paper examines the limitations (over-simplicity and restricted memory) of the most common recurrent neural network and suggests a new alternative recurrent network to overcome these shortcomings. This new alternative recurrent neural network was developed using a mathematical model based on variants of the SRN and trained with BP. The most important aspects of the mathematical model are the error gradients, which have to be minimized using BP. Simulations for both the SRN and our network were implemented and compared, and the results show that our network is substantially faster and more accurate than the SRN.
4.
References
A. Corradini and P. Cohen (2002) Multimodal speech-gesture interface for hands-free painting on virtual paper using partial recurrent neural networks for gesture recognition, in Proc. of the Int'l Joint Conf. on Neural Networks (IJCNN'02), volume 3, pp. 2293--—2298. D. Prokhorov (2004). Backpropagation through Time and Derivative Adaptive Critics: A Common Framework for Comparison. Chapter, Learning and Approximate Dynamic Programming, Wiley.
8
Dit-Yan, K. W. Yeung (1995). A Locally Recurrent Neural Network Model for Grammatical Interface. In Info. Processing Proceedings of the International Conference on Neural Information Processing, 1468-1473. Georg Dorffner (1996). Neural networks for time series processing. Neural Network World, 6 (4), pp. 447--— 468. J. L. Elman (1990). Finding structure in time. Cognitive Science, 14 (2), pp.179--—211. L. Fausett (1994). Through Time and Derivative Adaptive Critics: A Common Framework for Comparison. Chapter, Englewood Cliffs, NJ: Prentice Hall. M. Boden (2001). A guide to recurrent neural networks and backpropagation. The DALLAS project. Report from the NUTEK-supported project AIS-8, SICS. Holst: Application of Data Analysis with Learning Systems.
M.A. Castano, F. Casacuberta, and A. Bonet (1997). Training Simple Recurrent Networks Through Gradient Descent Algorithm. Volume 1240 of ISBN 3-540-63047-3, chapter Biological and Articial Computation: From Neuroscience to Technology, pp. 493--500, Eds. J. Mira and R. Moreno-Diaz and J. Cabestany, Springer Verlag.g.
M.I. Jordan (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the 8th Annual Conference of the Cognitive Science. S. Haykin (1994). Neural Networks, a Comprehensive Foundation. MacMillan Publishing Company, New York.k. Shi XH, YC. Liang, and X. Xu (2003). An improved Elman model and recurrent back-propagation control neural networks. Journal of Software, 6(14), 1110—1119. Tarik Rashid, B. Q. Huang, M-T. Kechadi and B. Glesson (2007). Auto Regressive Recurrent Network Approach for Electricity Load Forecasting. International Journal of Computational Intelligence 3; 1 www.waset.org Winter.
W H. Wilson (1996). Learning Performance of Networks like Elman's Simple Recurrent Networks but having Multiple State Vectors. In the 7th Australian Conference on Neural Networks, Canberra, Australia. W. Duch and N.Jankowski (2001). Transfer functions: Hidden possibilities for better neural networks. in ESANN'2001 proceedings European Symposium on Artificial Neural Networks, ISBN 2-930307 01-3, pp. 25-27, Belgium, 2001. D-Facto public. Y. Cheng, T.W. Karjala, and D.M. Himmelblau (1997). Closed loop non-linear process identification using internal recurrent nets. In Neural Networks, 10(3), pp. 573--—586. Yee-Ling LU, Man-Wai MAK, and Wan-Chi SIU (1995). Application of a fast real time recurrent learning algorithm to text-to-phone conversion. In Proceedings of the International Conference of Neural Networks, volume 5, pp. 2853—285.
9