linear and non-linear artificial neural networks in

0 downloads 0 Views 830KB Size Report
Keywords: autoregressive linear model, linear neural network, feed forward neural net- ..... Two-layer network created by MATLAB command newff of Neural Network Toolbox has been .... The MathWorks-Online Documentation (Help Desk).
LINEAR AND NON-LINEAR ARTIFICIAL NEURAL NETWORKS IN SIGNAL PREDICTION ˇ PROCHAZKA ´ ˇ PAVELKA ALES, ALES Prague Institute of Chemical Technology Department of Computing and Control Engineering Technick´ a 1905, 166 28 Prague 6, Czech Republic E-mail: [email protected], [email protected] Phone: 00420-2-2435 4198 ∗ Fax: 00420-2-2435 5053 Abstract: This paper presents the comparison of prediction properties and results obtained by different types of non-linear artificial neural networks. The fundamental part of the contribution is devoted to multi-dimensional feed-forward, recurrent networks and their comparison with simple linear networks. Resulting algorithms are applied to analysis and processing of data describing gas consumption in the Czech Republic. Keywords: autoregressive linear model, linear neural network, feed forward neural network, recurrent neural network, epochwise backpropagation, spectral analysis, prediction, gas consumption

1

INTRODUCTION

Signal prediction is very important tool of information engineering with many practical applications. Its theoretical base always comes out from analysis of historical data. Various methods of signal prediction use autoregressive models, different types of neural networks and tools of system identification and optimization. The following paper presents their possible use in prediction of gas consumption.

2

NEURAL NETWORK STRUCTURES FOR SIGNAL PREDICTION

2.1 2.1.1

Linear Neural Network Architecture and Learning Algorithm

A linear neural network (Fig. 1 and 2) is one of basic types of neural networks. Main characteristic feature of this neural network is its transfer function, which is unlimited linear function with unknown slope. For each input vector we can calculate the network’s output vector. The difference between an output vector and its target vector is the error. We would like to find values for the network weights and biases such that the sum of the squares of the errors is minimized or below a specific value. This problem is manageable because linear systems have a single error minimum. In most cases, we can calculate a linear network directly, such that its error is a minimum for the given input vectors and targets vectors. In other cases, numerical problems prohibit direct calculation. Fortunately, we can always train the network to have a minimum error by using the Least Mean Squares (Widrow-Hoff) algorithm [DB01]. And thereby this neural network is by own concept very close to the autoregressive models. For the prediction of time series it is suitable to use network architecture with delays. Those delay are already in the ground of time series – in its time dependence. The basic idea of neural network calculation can be observed in block schema shown on Fig. 1, where own

R187 - 1

network with matrix of weights and linear transfer function is highlighted. Another possibility of visualization of the structure of network and neuron is represented on Fig. 2.

Figure 1: Block schema of linear network

2.2 2.2.1

Feed-Forward Backpropagation Neural Network Architecture and Learning Algorithm

A feed-forward backpropagation (Fig. 3) same as linear neural network has no special architecture in comparisons with recurrent neural network (paragraph 2.3). Burden of this network is by no doubt in the learning algorithm. For our purpose we used two layer network. The first transfer function was hyperbolic tangent sigmoid function and the second transfer function was unlimited linear function with unknown slope. Exactly this network is show on the block schema in Fig. 4. Different colour marking represent two layers in the network, first with hyperbolic tangent transfer function and the second with linear transfer function. The Levenberg-Marquardt backpropagation method has been used as a basic learning algorithm for this network type. [Ano01, HM94]. This algorithm is based on calculation of the Jacobian matrix:   ∂e (a) ∂e (a) 1 1 · · · ∂e∂a1 (a) ∂a1 ∂a2 n  ∂e2 (a) ∂e2 (a) ∂e2 (a)    ∂a1 · · · ∂a ∂a n 2  J(a) =  . . .   .. .. .. .. .   ∂eN (a) ∂eN (a) ∂eN (a) ··· ∂a1 ∂a2 ∂an

(1)

and its application in the Levenberg-Marquardt modification of backpropagation ∆a = (J T (a)J(a) + µI)−1 J T (a)e(a)

(2)

where a = [w1(1, 1) w1(1, 2) · · · w1(S1, R) b1(1) · · · b1(S1) w2(1, 1) · · · bM (SM )]T is parameter vector, I is identity matrix, µ is learning parameter, e(a) is error vector and M stands for the number of network layers.

R187 - 2

Figure 2: Architecture of linear network and structure of the neuron

Figure 3: Architecture of two layer feed forward network

Figure 4: Block schema of feed forward network 2.3 2.3.1

Recurrent Neural Network Net Architecture

A recurrent neural network (Fig. 5) is one in which the outputs from the output layer are fed back to a set of input units. This is in contrast to feed-forward networks, where the outputs are connected only to the inputs of units in subsequent layers. Neural networks of this kind are able to store information about time, and therefore they are particularly suitable for forecasting applications: they have been used with considerable success for predicting several types of time series. However, while recurrent neural networks have many desirable features, there are practical problems in developing effective training algorithms for them [Kot99]. For prediction we used a one layer model with variable number of external inputs and with a variable number of hidden neurones. Hidden neurones in architecture of the recurrent network are neurones without external input. Against it, the visible neurones are neurones, R187 - 3

which have an external output. Figure 5 shows one layer model of recurrent network with one external input, with one visible neurone (external output) and with three hidden neurones. This network architecture we can write as 1-4-1. For training a recurrent network we used the epochwise backpropagation through time algorithm [WP90], which can be derived by unfolding the temporal operation of a network into a multi layer feed forward network that grows by one layer each time step (Fig. 6).

Figure 5: Architecture of recurrent network with one external input and one output [Hay94]

2.3.2

Figure 6: In time unfolded recurrent network with 11 time layers

Mathematical Net Description

Let the network have n units [WP90, WZ89], with m external input lines. Let y(τ ) denote the n-tuple1 of outputs of the units in the network at time τ , and let xnet (τ ) denote the m-tuple of external input signals to the network at time τ . We also define x(τ ) to be the (m + n)-tuple obtained by concatenating xnet (τ ) and y(τ ) in some convenient fashion. To distinguish the components of x representing unit outputs from those representing external input values where necessary, let U denote the set of indices k such that xk , the k th component of x, is the output of a unit in the network, and let I denote the set of indices k for which xk is an external input. Furthermore, we assume that the indices on y and xnet are chosen to correspond to those of x, so that

xk (τ ) =



xnet k (τ ) if k ∈ I yk (τ ) if k ∈ U.

(3)

Let W denote the single n × (m + n) weight matrix for the network. The element wkl represents the weight on the connection to the k th unit from either the lth unit, if l ∈ U , or the lth input line, if l ∈ I. Furthermore, note that to accommodate a bias for each unit we simply include among the m input lines one input whose value is always 1; for the units it is convenient also to introduce for each k the intermediate variable sk (τ ), which represents the net input to 1

tuple: an ordered sequence of fixed length of values of arbitrary types; close term is vector.

R187 - 4

the k th unit at time τ . Its value at time τ + 1 is computed in terms of both the state of and input to the network at time τ by

X

sk (τ + 1) =

wkl xl (τ ).

(4)

l∈U ∪I

The output of such a unit at time τ + 1 is then expressed in terms of the net input by yk (τ + 1) = fk (sk (τ + 1)),

(5)

where fk is the unit’s squashing function. Let T (τ ) denotes the set of indices k ∈ U for which there exists a specified target value dk (τ ) that the output of the k th unit should match at time τ (note that following formulation allows for the possibility that target values are specified for different units at different times). Then define a time-varying n-tuple e by

ek (τ ) =



dk (τ ) − yk (τ ) if k ∈ T (τ ) 0 otherwise.

(6)

A natural objective of learning is to minimize the total error E total composite from single overall network errors at time τ E total (t0 , t1 ) =

t1 X 1X 2 ek (τ ). 2

τ =t0 +1

(7)

k∈U

The point is in computation of the error gradient, that is used to adjust the weights. One natural way to make these weight changes is along a constant negative multiple of this gradient, so that 4wkl (τ ) = −α

∂E total (t0 , t1 ) , ∂wkl

(8)

where α is a positive learning rate parameter. The algorithm used for computing the error gradient is a so-called epochwise backpropagation through time [WP90]. This algorithm is organized as follows. With t0 denoting the start time of the epoch and t1 denoting its end time, the objective is compute the gradient of E total (t0 ; t1 ). This is done by first letting the network run through the interval [t0 ; t1 ] and saving the entire history of inputs to the network, network state, and target vectors over this interval. Then a single backward pass over this history buffer is performed to compute the set of values of local gradients δk (τ ) = −∂E total (t0 , t1 )/∂skl , for all k ∈ U and τ ∈ (t0 ; t1 ], by means of the equations ( 0 fk (sk (τ ))ehk (τ ) i if τ = t1 P δk (τ ) = (9) 0 fk (sk (τ )) ek (τ ) + l∈U wlk δl (τ + 1) if t0 < τ < t1 . Once the backpropagation computation has been performed back to time t0 + 1, the weight changes may be made along the negative gradient of overall error by means of the equations t1 X ∂E total (t0 , t1 ) 4wkl (τ ) = −α =α δk (τ )xl (τ − 1). ∂wkl

(10)

τ =t0 +1

Then a new weight matrix wkl (τ + 1) is counted wkl (τ + 1) = wkl (τ ) + 4wkl (τ ). R187 - 5

(11)

3 3.1

RESULTS AND METHODS OF ONE STEP AHEAD PREDICTION Data Specification

We had two data sets of gas consumption in Czech Republic. First was measured every two hours for three years and three months (1. Jan 1997 – 31. Mar 2000) and the second represented average gas consumption in one day for eight years (1. Jan 1994 – 31. Dec 2001). For the same time (1. Jan 1994 – 31. Dec 2001) we had average daily temperature in Czech Republic (Fig. 9). The study of the given data resulted in the fact that some measurements were not done or they are missing. For the future work with time series was important to fill in the missing measurements (Fig. 7). To achieve this goal the simplest way has been used: missing values were obtained by the linear interpolation of the closest values. This means, that the missing value is the arithmetics mean of two near-by values in case of one interpolated value only. In general form: yn − y1 n−1 = (k − 1) δ + y1 ,

(12)

δ = yk

k = 2, . . . , n − 1

(13)

(n − 2) is the total number of missing (unknown) values between two known values yn and y1 . In special case when n = 3 we have y2 = (2 − 1)

y3 − y1 y3 + y1 + y1 = 3−1 2

(14)

Figure 7: The replacement of missing values by new values Application of this algorithm was urgent only for data with sampling period two hours, difference between series with zero values and series with new estimated data shows Fig. 8.

Figure 8: Data before and after estimation of missing values (year 1997 and 1998) R187 - 6

Figure 9: All three original time series before trend removing and standardization

Figure 10: Linear trends in given data

Figure 11: Data modified by difference of log

From the given data of gas consumption and evolution of average daily temperature (see Fig. 9) has been removed linear trend (Fig. 10) and data has been standardized by substraction of mean and dividing by standard deviation. For spectral analysis has been used also standardized data modified by difference of logarithms (Fig. 11) xi = log(xi+1 ) − log(xi )

R187 - 7

(15)

3.2

Spectral Analysis

Spectral analysis has been done by algorithm of FFT, which is built in MATLAB environment as fft function. When we make a close look at the spectral density graphs (see Fig. 12 and 13 — all graphs does not shows whole range of frequencies), we can see that main difference among spectra calculated from the data modified by logarithmic difference and from the real data is, that spectra from difference of logarithms heighten higher frequencies and lower frequencies are suppressed.

Figure 12: Spectral density of gas consumption and average daily temperature

Figure 13: Spectral density of gas consumption and temperature using difference of log R187 - 8

When we compare spectra (12a) and (13a) of gas consumption (periodicity 2 hours) we see one main peak representing period of one day, next is on the position of 12 hours and 8 hours. In low frequency area we find periods 3.5 days and 7 days, which is plainly see in (12b) or (13b). On both graphs we can mentioned period 3.5 days and of course period on one year. Spectral analysis of periods occurred in evolution of average daily temperature in Czech republic is more harder. In graphs (12c) and (13c) is possible to find period of one year, some peaks occurring around period of one month. And as next we can find a lot of periods around 16, 8, 7, 6 days and more, but none of them is more less significant. Result of this analysis has been used in selecting different empiric constants. 3.3

Data Pre-processing for All Models

For all following calculations ONLY the data with periodicity of one day have been used. With respect to the higher importance of heating and total volume of gas consumption during winter period of the year the study has been devoted mainly to the winter period, and namely the period of six months from October till March. This statement is proved by correlation between gas consumption and temperature, that is shown on Fig. 14. Values in 48 bars represent mutual correlation of gas consumption and temperature calculated as correlation over months. All data have been normalized by the trend removing, subtraction of mean value and dividing by the standard deviation and finally re-scaling into the interval −1 ≤ x ≤ 1. As the learning data set the first winter period 1994-1995 has been used. As the testing data set the second winter period 1995-1996 has been studied. Data modified by the difference of logarithm has not been used for the the prediction.

Figure 14: Correlation between gas consumption and temperature

3.4

Linear Neural Network

One-layer linear network with architecture 15-1 has been designed by MATLAB’s instruction newlind from the Neural Network Toolbox. This architecture has been chosen with respect to the periodicity found in the signal. For learning of this network the overdetermined system of equations has been defined and solved by the least-square method (LSM). The learning data set has been defined by the first winter period 1994-1995. The testing data set has been formed by the second winter period 1995-1996. Between positives of this method belong its calculation time, good accuracy and transparentness. R187 - 9

Obtained results are summarized in the Table 1 and are shown on Figures 15-18 where we can compare influence of adding more neurones in the input layer. Those seven (7) new added neurones are fed on values of daily temperature. The network architecture has been changed to 22-1. These figures and values of SSE point to the significance of addition of temperature into the simple linear prediction model resulting in the growth of accuracy and decrease of error values.

Figure 15: Real and predicted data – linear NN 15-1, learning data set winter 94/95

Figure 16: Real and predicted data – linear NN 15-1, testing data set winter 95/96

Figure 17: Real and predicted data – linear NN 22-1, learning data set winter 94/95

Figure 18: Real and predicted data – linear NN 22-1, testing data set winter 95/96

3.5

Feed-Forward Backpropagation Neural Network

Two-layer network created by MATLAB command newff of Neural Network Toolbox has been tested with Levenberg-Marquardt backpropagation [HM94] as the learning algorithm. Satisfactory results have been obtained with network 15-3-1 with 10 training epochs, hyperbolic tangent sigmoid as the first transfer function and unlimited linear function with unknown slope as the second one. All networks parameters – number of hidden neurones, number of training epochs and transfer functions – had strong influence to the final prediction results. Their mutual combination resulting in better prediction properties form one of futures goals. R187 - 10

Problem of suitable parameters estimation is not in the area of decreasing learning error as it is relatively easy to train the network to minimize this error. The more important problem is in th application of the resulting network for real data not used in the learning stage. Networks seems to be in most cases over-learned and not able to generalize resulting in the increase of error values. To minimize this problem several calculation have been carried out to find suitable parameter regime with such items as number of hidden neurones (studying their number from 1 to 50), number of learning epochs (in the range 5-500). The strong influence of initial matrix of weights has been slightly reduced by the testing of 200 various initial matrices of weights from which one with the lowest value SSE has been selected. Obtained results are summarized in the Table 1 and on Figs. 19-22. As in the previous case the model of neural network 15-2-1 has been enlarged by seven (7) values of daily temperatures in the part of network inputs. This network of architecture 22-2-1 with 2 hidden neurones provided the best results.

Figure 19: Real and predicted data – FF NN 15-3-1, learning data set winter 94/95

Figure 20: Real and predicted data – FF NN 15-3-1, testing data set winter 95/96

Figure 21: Real and predicted data – FF NN 22-2-1, learning data set winter 94/95

Figure 22: Real and predicted data – FF NN 22-2-1, testing data set winter 95/96

R187 - 11

Figure 23: Real and predicted data – Recurrent NN 15-2-1, learning data set winter 94/95 3.6

Figure 24: Real and predicted data – Recurrent NN 15-2-1, testing data set winter 95/96

Recurrent Neural Network

Recurrent neural network has been designed with 15 visible neurones (external input), with one external output and variable number of hidden neurones (1-20). As a squashing2 function the hyperbolic tangent has been used and the epochwise backpropagation through time [WP90] as the learning algorithm has been used. Every network with specific architecture had 500 learning epochs. To find typical behaviour for every single network architecture two ways how to find suitable initial matrix of weights have been used. The first one is based upon the same algorithm as in the previous case. Specific network architecture has been tested with 1000 different initial matrices of weights followed by the selection of one which gave us the output close to desired one in the sense of SSE. The network has not been able to give us lover SSE than 15 and these results are not presented. The second algorithm used 20 different initial weight initialization to calculate the whole proces of learning and for one specific network we obtained 20 results. This calculation takes approximately 4-20 hours depending on the computer and its parameters. For selection of the best model the criterium of SSE (Sum of Squared Errors) has been used. At the end of the learning proces all 501 weights for 20 different initial weights were evaluated. The best model3 has been selected according the minimal value of the learning error. For comparison of these values the last 501 value has been also taken. Titles of figures 23-26 provide information about epoch with the best results. The learning constant changing slightly during the learning process was the next important element under study. The main reason of analysis of dynamical behavior was in the trial to achieve better results to improve the learning process. As in many cases (almost for all tested networks’ architectures) we can observe many wrong learning processes according the evolution of the learning error. Obtained results are summarized in the Table 1 and on Figures 23-26. As in the previous case the model of neural network 15-2-1 has been added to seven values in part of the network input by daily temperatures. This network modification resulted in the network architecture 22-2-1 with 2 hidden neurones giving the best results. 2 3

transfer When we talked about model we understand the matrix of weights.

R187 - 12

Figure 25: Real and predicted data – Recurrent NN 22-2-1, learning data set winter 94/95

Figure 26: Real and predicted data – Recurrent NN 22-2-1, testing data set winter 95/96

Table 1: Comparing of prediction methods Type of Artificial Neural Network Linear 15-1 Linear 22-1 with temperature Feed-forward 15-3-1 Feed-forward 22-2-1 with temperature Recurrent 15-2-1 Recurrent 22-2-1 with temperature 4

MSE Learning part Validation part 3.5799 3.7215 3.3122 3.5025 3.0938 4.1677 3.1609 3.9882 9.1468 8.1744 10.1656 12.8883

CONCLUSION

All presented prediction methods (linear, feed-forward and recurrent neural networks) has been applied on real data set of gas consumption in Czech republic during winter periods. In all cases the data from winter 94/95 have been used for learning (finding the suitable model) and data from winter 95/96 have been used for testing of founded models. Results are summarized in Table 1 and on Figures 15 – 26. According to the values of SSE in Table 1 seems to be the best prediction method – linear neural network. But we have to remind, that this research has been used just for modelling prediction possibilities of mentioned prediction methods and its simple comparison. For the true specification and decision which method is better that the other it will be necessary to make more prediction tests. There are many ways how to improve the prediction accuracy and precision. For example by adding another data into the models like a data of wind speed, gas prices and so on. Or we can use possibilities of more prediction methods and combine its results. There are still some unanswered questions that should be studied. It is assumed that the future work based upon Mr. Pavelka’s diploma thesis research will be devoted to: • transformation of predicted data back into range of raw un-preprocessed data • data prediction with reliability limits estimation • prediction testing using different kinds of preprocessing • selection of different data sets and applying them in own algorithms • multi-step prediction by improving (or modification) of a one step prediction model R187 - 13

References [Ano01]

Anonymous. The MathWorks-Online Documentation (Help Desk). The MathWorks, Inc., Natick, MA, 12.1 edition, 2001. http://www.mathworks.com.

[DB01]

Demuth Howard and Beale Mark. Neural Network Toolbox. The MathWorks, Inc., Natick, MA, 7 edition, March 2001. http://www.mathworks.com.

[FS92]

Freeman J. A. and Skapura D. M. Neural Networks Algorithms, Applications, and Programming Techniques. Addison-WelseyPublishing Company, Reading, Massachusetts, 1992.

[Hay94]

Haykin S. Neural Networks. Prentice Hall, Inc., New Jersey, 1994.

[HM94]

Hagan Martin T. and Menhaj Mohammad B. Training feedforward networks with the marquadt algorithm. IEEE Transactions On Neural Networks, 5(6):989–993, 1994.

[Kan95]

Kanjilal P. P. Adaptive Prediction and Predictive Control. Peter Peregrinus Ltd., London, 1995.

[KEL00]

Khotanzad A., Elragal H., and Lu T. L. Combination of neural-network forecasters for prediction of natural gas consumption. IEEE Transactions On Neural Networks, 11(2):464–473, March 2000.

[Kot99]

´ N. Chapter neural networks. WWW, 1999. http://www.ee.up.ac.za/ee/ Kotze telecoms/telecentre/ nicchapternn.html.

[KvP99]

ˇ ´ıd M., and Procha ´nsky ´ Z., Sm ´zka A. Progn´ Ka ozov´an´ı spotˇreby zemn´ıho plynu. Plyn, 4:21–22, 1999.

[NI91]

Nelson Marilyn M. and Illingworth W. T. A Practical Guide to Neural Nets. Addison-Welsey Publishing Company, Reading, Massachusetts, 1991.

[Pro97]

´zka A. Neural networks and seasonal time-series prediction. In Fifth InterProcha national Conference on Artificial Neural Networks Cambridge, England, 1997.

´zka A., Str ˇ´ıbrsky ´ J., Pta ´c ˇek J., and Ka ´nsky ´ Z. Uˇzit´ı umˇel´ [PSPK00] Procha ych neuronov´ ych s´ıt´ı v neline´arn´ıch modelech spotˇreby plynu. Plyn, 3, 2000. [Tay94]

Taylor Fred J. Principles of Signal and Systems. McGraw-Hill, Inc., New York, 1994.

[vP97]

ˇ ´zka Aleˇ Sorek Martin and Procha s. Neural networks in signal prediction. Signal Analysis and Prediction, I:228–231, June 1997. The First European Conference on Signal Analysis and Prediction.

[WP90]

Williams R. J. and Peng J. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2:490–501, 1990. http://www.ccs.neu.edu/ home/rjw/pubs.html.

[WZ89]

Williams R. J. and Zipser D. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270–280, 1989. http://www.ccs.neu.edu /home/rjw/pubs.html.

R187 - 14