DIPARTIMENTO DI INFORMATICA E SCIENZE DELL ... - DISI

5 downloads 0 Views 226KB Size Report
sav 1. N. N. ∑ i8 1 si ¨. (3.7). In [20] a very efficient implementation of FNN algorithm is presented. This algo- rithm is based on the work by Nene and Nayar [21].
DIPARTIMENTO DI INFORMATICA E SCIENZE DELL’INFORMAZIONE UNIVERSITA’ DI GENOVA - ITALY Technical Report DISI-TR-00-4 (v 1.1)

Discontinuous and Intermittent Signal Forecasting: A Hybrid Approach Francesco Masulli Istituto Nazionale per la Fisica della Materia & DISI-Dipartimento di Informatica e Scienze dell’Informazione Universit`a di Genova, Via Dodecaneso 35, I-16146 Genova, Italy E-mail: [email protected] Giovambattista Cicioni Istituto di Ricerca sulle Acque Consiglio Nazionale delle Ricerche, Via Reno 1, I-00198 Roma, Italy E-mail: [email protected] L´eonard Studer ´ Institut de Physique des Hautes Energies Universit´e de Lausanne, CH-1015 Dorigny, Switzerland E-mail: [email protected]

Abstract A constructive methodology for shaping a neural model of a non-linear process, supported by results and prescriptions related to the Takens-Ma˜ne´ theorem, has been recently proposed. Following this approach, the measurement of the first minimum of the mutual information of the output signal and the estimation of the embedding dimension using the method of global false nearest neighbors permit to design the input layer of a neural network or a neuro-fuzzy system to be used as predictor. In this paper we present an extension of this prediction methodology to discontinuous or intermittent signals. As the Universal function approximation theorems for neural networks and fuzzy systems requires the continuity of the function to be approximate, we apply the Singular-Spectrum Analysis (SSA) to the original signal, in order to obtain a family of time series components that are more regular than the original signal and can be, in principle, individually predicted using the mentioned methodology. On the basis of the properties of SSA, the prediction of the original series can be recovered as the sum of those of all the individual series components. We show an application of this prediction approach to a hydrology problem concerning the forecasting of daily rainfall intensity series, using a database collected for a period of 10 years from 135 stations distributed in the Tiber river basin.

Chapter 1

Introduction In the last few years, neural networks and fuzzy systems have been widely used in nonlinear dynamic systems modeling and forecasting. Those applications are supported on Universal Functions Approximation properties holding for both neural networks and fuzzy system [3, 12, 33]. However, from these theorems no information can be obtained in order to define the structure of learning machine, while the application this kind of systems, such as a Multi-Layer Perceptrons (MLPs), to the problem of time series forecasting implies the setting of: the number of units in the input layer, the sampling time of the series, and the structure and dimension of hidden layers. The neural network theory gives only general suggestions in order to choose these quantities. The specificities of data set have to be taken into account at this level to tailor the MLP to the time series which have to be forecasted. As shown by our group in [26, 22], a constructive methodology for shaping a supervised neural model of a non-linear process can be based on the results and prescriptions related to the Takens-Ma˜ne´ theorem and on the Global False Nearest Neighbors method (FNN) proposed by Abarbanel [1]. In [10] a similar methodology has been independently proposed. In this paper we present an extension of this methodology to the prediction of discontinuous or intermittent signals. based on an ensemble method using a decomposition based on the Singular-Spectrum Analysis (SSA) [31] . The SSA decomposes the raw signal in reconstructed components. Each reconstructed component can be, in principle, predicted separately using the approach presented in [22]. Then, the prediction of the original series can be recovered as the sum of those of all the individual series components. The approach followed in the paper is based on the prediction of reconstructed waves that are sum of disjoint groups of reconstructed components and on the following recomposition of the forecasting of the raw signal by addition of the predicitions of individual reconstructed waves. We present also an application of the presented methods to the forecasting of rainfall intensity series collected in the Tiber basin. In the next chapter, we present the Multi-Layer Perceptrons and discuss their properties relevant to series forecasting. In Ch. 3 we give the basis of the methodology for time series forecasting. In Ch. 4 we introduce the Singular Spectrum Analysis and 1

ensemble method based on SSA decomposition. In Ch. 5 we present the application to rainfall forecasting and the obtained results. The conclusions of the work are drawn in Ch. 6.

2

Chapter 2

Learning Time Series using Multi-Layer Perceptrons Theoretical and experimental results support the use of neural networks in many applicative tasks. In particular, it has been shown that such systems can perform function approximation [3, 12], Bayesian classification [24], unsupervised vector quantization clustering of inputs [14], content-addressable memories [11], linear and non-linear principal component analysis and independent component analysis [9]. Artificial neural networks are made up of simple nodes or neurons interconnected to one another. Generally speaking, a node of a neural network can be regarded as a block that measures the similarity between the input vector and the parameter vector, or weight vector, associated to the node, followed by another block that computes an activation function, normally not linear [18, 9]. The transfer function of an artificial neuron is given by the equation: y

H



∑ wi xi 

θ

(2.1)

i

where y is the output of the neuron, H is a the activation function, wi are weights, xi are the inputs, and θ is the threshold. The most used neural network is the Multi-Layer Perceptron (MLP) that is a feedforward model based on layers of neurons. Nodes of each layer are interconnected with all nodes of the following layer. In this way Multi-Layer Perceptrons perform non-linear maps from an input space to an output space. Moreover, as demonstrated by the Universal Approximation Theorem [3, 12], an  1 x MLP with a single hidden layer , and using sigmoid activation functions H    1  1  exp  ax  , where a is the slope parameter of the sigmoid function), is sufficient to uniformly approximate any continuous function with support in a unit hypercube. The non-linear map can be automatically learned from data by a MLP thought supervised learning techniques based on the minimization of a cost function, such as 1 The output nodes constitute the output of the MLP. The remaining nodes constitute hidden layers of the network.

3

the Root Mean Square (RMS) error . The most diffused learning technique is the Error Back-Propagation that is an efficient application of the Gradient Descent method [25, 9]. The universal approximation property implies that, if the non-linear dynamical process can be represented by a continuous function, an efficient non-linear model can be built from data using a Multi-Layer Perceptron. Using MLPs the costly detailed design step of the first principles model usually implemented in the non-linear system identification is transformed in a more simpler structuring step of the MLP plus an optional pre-processing (eventually driven by any understanding of the physical model of the process) of the raw data coming from the field. Even if, in principle, the function approximation property of MLP guarantees the feasibility of data-based models of non-linear dynamical systems, the neural network theory don’t give any suggestion about many details: for example no general prescriptions are available concerning the dimension of the data window (i.e. input layer of the MLP), the sampling rate of the input data, the dimension of the hidden layer, and the dimension of the training set, and then most of time those fundamental design parameters have to be obtained by experiments and heuristics.

4

Chapter 3

Hints from Dynamical Systems Theory 3.1 3.1.1

Dynamical Systems and Chaos Theory State Space

A deterministic dynamical system is described by a set of differential equations. Its evolution is represented by the trajectory in state space (of dimension n) of the vector  x  x˙  y y˙  z  z˙     where x  x˙  y y˙  z  z˙    are the variables of the system and their Q derivatives. The figure made in state space by Q is the attractor of the system. For non-linear systems, the dynamical variables (x  y z   ) are coupled. The evolution of one variable (let say x) is not independent of all the other ones (y z   ). Except for few simple phenomena, the set of differential equations is unknown. Even, often the whole set of relevant effective dynamical variables is not always well defined. But, as the variables are interdependent, the observation of only one of those brings information — maybe in an implicit way — on the other ones and consequently on the complete dynamical system. This is the reason why time series of non-linear dynamic systems are so useful.

3.1.2

Embedding Theorem

The question is now: “How to reconstruct the complete dynamical system with only  the one-variable time series s1  s2  s3     ?” Here the theory of dynamical systems gives an answer. In particular, the Embedding Theorem proposed independently in 1981 Takens and Ma˜ne´ [27, 17] gives an answer to the above question. In the Takens-Ma˜ne´ theorem we consider an augmented vector S built with d elements of the time series. The dimension of the vector d has to be greater than two times the box-counting dimension D0 of the attractor of the system: d



2D0

5

(3.1)

A vector S satisfying the Takens-Ma˜ne´ bound cited in the previous paragraph will evolve in a reconstructed state space, and its evolution will be in a diffeomorphic relation with the original Q state space point (a diffeomorphism is a smooth one-to-one relation). In other words, for every practical purposes the evolution of S is a fair copy of the evolution of Q. It is worth noting that there is a distinction between the order of the differential equation (n) which is the dimension of the state space where live the true state vector Q and the sufficient dimension of a reconstructed state space (d) where the reconstructed vector S lives.

3.1.3

An Example 

In order to elucidate the Embedding Theorem, let consider a sine wave st A sin t  . In d=1 (i.e. the st space) this wave oscillates in the interval  A  A  . Two points which are close in the sense of Euclidean (or other distance) may have quite different values  of s˙ t  . In this way two ”close” points may move in opposite directions along the single  spatial axis. In a two dimensional space st  st  T  , where T is a time lag, the ambiguity of the dynamics of points is resolved. The system evolves on a figure (in general an ellipse) that is topologically equivalent to a circle. If we draw the sine wave in the three  dimensions st  st  T  st  2T  , no further unfolding occurs and the sine is represented as a new ellipse.

3.1.4

The Method of Embedding

In order to reconstruct the dynamical system we can use the time delay embedding method [1]. This method consists in building d-dimensional state vectors Si  si  si T  si d  1  T  . In principle, it suffices that d  n. But, the effective dimension d is not directly related to the dynamical dimension n  as in the case of weak coupled variables.

3.2

Choosing the time delay

The time delay T (or time lag) used in the embedding has to be chosen carefully. If it is too long, the samples si  si  T  si  d  1  T are not correlated1 and then, in general, the dynamical system can not be reconstructed. If it is too short, every sample is essentially a copy of the previous one, bringing very little information on the dynamical system. We use the Shannon’s mutual information to quantify the amount of information shared by two samples in order to get an useful estimation of the time lag T . Let’s defined the average mutual information between measurements ai drawn from the set A and measurements bi drawn from set B. The set of measurements A is made of the values of the observable si and the set B is made of the values si  t (t is a time interval). 1 This happens in particular for chaotic systems, for which even two initially close chaotic trajectories will diverge exponentially in time.

6

Average mutual information is then : 

I t



si  A  si  t  B





P si  si

t 

P si  si t   log2  P si  P si  t  

(3.2)



where P    are probabilities distributions based on frequency observations. It has been suggested [6, 5, 29, 1] to take the time T , where the first minimum of I(t) occurs, as the value to use at the time delay in the phase space reconstruction. In this way the values of sn and sn  T are the most independent of each other in an information-theoretic sense. Moreover the first minimum of average mutual information is a good candidate for the interval between the components of the state vectors that will be input to the neural network model of the non-linear dynamical process.

3.3

Evaluating the Global Embedding Dimension

From the Embedding Theorem, the box counting dimension D0 should be evaluated. In principle, it can be estimated directly from the time series itself, but this task is very sensitive to the noise and needs large set of data points (order of 10D0 data points) [1]. In order to avoid those problems, we can estimate the embedding dimension dE , defined as the lowest (integer) dimension which unfolds the attractor, i.e. the minimal dimension for which foldings due to the projection of the attractor in a lower dimensional space are avoided. The embedding dimension is a global dimension and in general is different from the local dimension of the underlying dynamics. The Embedding Theorem guarantees that if the dimension of the attractor is D0 , then we can unfold the attractor in a space of dimension dE (dE  2Do ). It is worth noting that dE is not a necessary condition for unfolding, but is sufficient. The dimension of input layer of the Multi-Layer Perceptron will be then of dimension high enough in order that the deterministic part of the dynamics of the system is unfold.

3.3.1

Global False Nearest Neighbors

In practice, the method of Global False Nearest Neighbors proposed by Abarbanel [1], can be used to evaluate the embedding dimension dE . Given a data space reconstruction  si  si  T   si  d  1  T  , where the time delay T in dimension d, with data vectors Si is the first minimum of average mutual information(Eq. 3.2).  NN NN NN s Let be SNN i i  si  T ! si  d  1  T  , the nearest neighbor vector in phase space. If the vector SNN is a false neighbor (FNN) of Si , having arrived in its neighborhood by i projection from a higher dimension because the present dimension d does not unfold the attractor, then by going to the next dimension d  1 we may move this point out of the neighborhood of Si . We define the distance ξ between points when seen in dimension d  1 relative to

7

the distance in dimension d as #

ξi "

R2d 

i 





R2d i 

R2d i 

then ξi

 1

sNN i  dT $ Rd i 

%$ si  dT  





(3.3)

(3.4)

As suggested by Abarbanel [1], SNN and Si can be classified as a false neighbor if i ξi is a number greater than a threshold θ (ξi  θ). In many applications a good value for θ is 15. In case of clean data from a dynamical system, we expect that the percentage of FNNs will drop from nearly 100% in dimension one close to zero when dE is reached. It is worth noting that, as we go to higher dimensional spaces the volume available for data grows as the distance to the power of dimension, and no near neighbor will be classified close neighbor. In this case we can modify the Eq. 3.4 as ξi

%$ si  dT 

sNN i  dT $

RA



(3.5)

where A is the nominal “radius” of the attractor defined as the Root Mean Square (RMS) error value of data about its mean, e.g.: RA

sav

1 N $ si  sav $  N i∑ & 1

(3.6)

1 N si  N i∑ & 1

(3.7)

In [20] a very efficient implementation of FNN algorithm is presented. This algorithm is based on the work by Nene and Nayar [21]. It is worth noting that there are two main arguments that can suggest to size the input layer of a predictor based on MLPs smaller than the evaluation obtained using the FNN method. In fact this evaluation is still an upper bound, and moreover for an assigned size of the training set, a limitation of the complexity of the learning machine can lead to better generalization.

3.3.2 '

Bells Whistles and Pitfalls of FNN The global FNN calculation is simple and fast.

'

'

The FNN calculation applied to signals coming from two different outputs of the same dynamical system gives, in general, two different values of dE . Then from each signal we will obtain different reconstructed coordinate systems, but both consistent with the original dynamical system. FNN method is valid even if the signal of interest results from a filtered output of a dynamical system [1, 4]. 8

'

If the signal is contamined by noise (assumed to be generated by an high dimensional system), it may be that the contamination will dominate the signal of interest and FNN will show the dimension required to unfold the contamination. Here, a simple byproduct of FNN calculation is an indication of noise level in a signal.

9

Chapter 4

Ensemble Method based on Singular Spectrum Analysis Decomposition 4.1

Singular Spectrum Analysis

The methodology described in the previous section has been successfully applied in the design of Multi-Layer Perceptrons and Neuro-Fuzzy systems to forecasting of simulated non-linear and chaotic systems [26, 22] and of to real world problem such as the modeling of the vibration dynamic of a real system consisting in a 150 MW Siemens steam turbine [19]. The proposed methodology can not be directly applied to forecasting discontinuous or intermittent signals, as the universal function approximation theorems for neural networks [3] and fuzzy systems [33] require the continuity of the function to be approximate. In order to avoid the effect of discontinuities of a signal we can apply the SingularSpectrum Analysis (SSA) [15, 23, 31, 16] to the signal to be forecasted. In SSA the  si  si 1   si  M  1  is a temporal window (augmented vector) of the state vector Si series s, made up by a given number of samples M. The cornerstone of SSA is the Karhunen-Lo e` ve expansion or Principal Component Analysis (PCA) [28] that is based on the eigenvalues problem of the lagged covariance matrix Zs . Zs has a Toeplitz structure, i.e. constant diagonals corresponding to equal lags: () ) )



) ) )*





c  0 c 1

c  1 c 0









 



c M  1







c M  1

c 1 +



















 



10



,.-









c 1



-



c  1 c 0

/

(4.1)

In absence of prior information about the signal it has been suggest [31] to use the following estimate for Zs :  1 N j (4.2) c j si si j N  j i∑ & 1 The original series can be expanded with respect to the orthonormal basis corresponding to the eigenvectors of Zs M

si

∑ pki ukj 

j

1

0

j

0

M 0

0

i

0

N M

k& 1

(4.3)

where pki are called principal components (PCs) and the eigenvectors ukj are called the empirical orthogonal functions (EOFs), and the orthornomality property M

∑ ukj ukl

k& 1

δ jl 

10

j

0

M

1

0

l

0

M

(4.4)

holds. It is worth noting that SSA does not resolve periods longer than the window length M. Hence, if we want to reconstruct a strange attractor, whose spectrum includes periods of arbitrary length, the large M the better, avoiding to exceeding M N3 (otherwise statistical errors could dominate the last values of the auto-covariance function). In [30, 8, 31, 13, 16, 7] many applications of Singular Spectrum Analysis have been presented, including noise reduction, detrending, spectral estimate, and prediction. Concerning the application of SSA to prediction, that is the main interest of the present paper, it is supported by the following argument. Since the PCs are filtered version of the signal and typically band-limited, their behavior is more regular than that of the raw series s, and hence more predictable. Vautard and Ghil in [31] fit an autoregressive (AR) model for each individual PC using the AR coefficient estimate of Burg [2], while Lisi, Nicolis and Sandri [16] used Multi-Layer Perceptrons in order to estimate filtered version of the raw signal using obtained using SSA. In order to reduce the computational costs we decompose the raw series s in reconstructed waves corresponding to SSA subspaces equivalent to similar explained variance and we predict them using Multi-Layer Perceptrons combined with independent evaluation of time lag using the first minimum of mutual information and embedding dimension using False Nearest Neighbors method.

4.2

Reconstructed components and reconstructed waves

Following Vautard and Ghil [31], suppose we want to reconstruct the original signal si starting from a SSA subspace 1 of k eigenvectors. By analogy with Eq. 4.3, the problem can be formalized as the search for a series sˆ of length N, such that the quantity 

H2 sˆ

N M M 

∑ ∑

i& 0 j & 1

sˆi 

11

j

∑2 pki ukj  2

k

(4.5)

is minimized. In other words, the optimal series sˆ is the one whose augmented version Sˆ is the closest, in the least-squares sense, to the projection of the augmented series S onto EOFs with indices belonging to 1 . The solution of the least-squares problem of Eq. 4.5 is given by

sˆi

1 M

∑M j & 1 ∑k 

6 4355 55

1 i

∑ij &

55

1 N  i 1

557

1 ∑k  2

∑M j& i

2

pki j ukj

for

pki j ukj

for

N  M ∑k  2

pki j ukj

M

0

1

0

i 0 N M 1 i

0

M 1

for N  M  2 0 i

0

(4.6) N

When 1 consists on a single index k, the series sˆ is called the kth RC, and will be denoted by sˆk . RCs have additive properties, i.e.

∑ sˆk



(4.7)

k 2

In particular the series s can be expanded as the sum of its RCs: M

∑ sˆk

s

(4.8)

k& 1

Note that, despite its linear aspect, the transform changing the series s into sˆk is, in fact, non-linear, since the eigenvectors uk depend non-linearly on s. If we truncate this sum to an assigned number of RCs, the explained variance of the related augmented vector Sˆ is the sum of the eigenvalues associated to those RCs, while the estimation of the resulting reconstruction error is the sum of the eigenvalues corresponding to the remaining RCs. As a consequence, it is suitable to order the RCs following the value of the eigenvalues. Let be 1 1 81 2 81 L L disjoint subspaces, then a reconstructed wave (RW) Ωl (l 1  L) is defined as Ωl

∑2 sˆk 

k

1

0

l

0

L

(4.9)

l

Then, from Eq.s 4.8 and 4.9, one can obtain: L

s

∑ Ωl 

(4.10)

l& 1

that says that the original series s can be recovered as the sum of all the individual RWs.

4.3

Hybrid Approach to Complex Signal Prediction

In order to design a predictor for complex signals, such as discontinuous and/or intermittent signals, we can apply the following approach that combines an unsupervised step and one supervised one, building-up an such a way an ensemble of learning machines: 12

'

'

Unsupervised decomposition: Using the Singular Spectrum Analysis, decomposes the original signal S in reconstructed waves (RWs), corresponding to subspaces with equal explained variance;

'

Supervised learning: Prepares a predictor for each RW using the methodology described in Ch. 3; Operational Phase: The prediction of the original signal S is then obtained as the sum of the predictions of individual RWs, i.e. using Eq. 4.10.

It is worth noting that, sometime the most complex waves (in general those corresponding the the low eigenvalues) cannot satisfactory predicted, using the available data. Following the criteria of the best prediction [16] in the Eq. 4.10 we can excluded them if, when if enclosed in the sum, make worse the overall prediction.

13

Chapter 5

Application to Rainfall Forecasting 5.1

Data Set and Methods

One of our applications of the previous described forecasting approach concerns the forecasting of daily rainfall intensities series of 3652 samples each, collected by 135 stations located in the Tiber river basin (see Fig. 5.1) in the period 01/01/1958 - 12/31/ 1967. The data processing started by considering the series of the Mean Station (MS), defined as the average of all 135 rainfall intensity series (Fig. 5.2). In Fig 5.3 a window on the period 07/01/66 - 12/30/66 is presented in order to better show the discontinuity and intermittence of the studied signal. Fig. 5.4 shows the graph of the mutual information of the MS’s time series. Its first minimum gives T 7. This value has been used as the time lag for the computation of Global False Nearest Neighbors. The graph of FNN is shown in Fig. 5.5. Till d 6 the curve decreases with the growing of dimension, and then reaches a plateau of 20%. The embedding dimension is then dE 6. Following the constructive approach described in Ch. 3, we designed a predictor based on a Multi-Layer Perceptron. The MLP was made up by two hidden layers of 5 units, an input layer of 6 inputs spaced by a time lag of 7 days. The results obtained by such a way are poor, due to the discontinuity of the hydrological variable. In order to reduce the effects of the discontinuities, we used the SSA decomposition ensemble method shown in the previous chapter. We applied the Singular-Spectrum Analysis (SSA) to a signal corresponding to the first 3000 samples of MS series. The window width used for the SSA was M 182, i.e. 6 months, that is a period sufficient to take in account seasonal periodicities of the related physical phenomena. Fig. 5.6 shows the ordered list of eigenvalues and the explained variance of the reconstructed signal using an increasing number of RCs. Then, from the original MS series we obtained 10 waves Ω1  Ω10 reconstructed from 10 disjoint sub-spaces, each of them representing a 10% of the explained variance

14

Figure 5.1: Distribution of the 135 stations in the Tiber river basin.

15

120

100

Daily rain (mm)

80

60

40

20

0 0

500

1000

1500

2000 time (days)

2500

3000

3500

4000

Figure 5.2: Mean Station: Daily rain millimeters. Period 01/01/1958 - 12/31/1967.

35

30

Daily rain (mm)

25

20

15

10

5

0 0

20

40

60

80

100 120 time (days)

140

160

180

200

Figure 5.3: Mean Station: Daily rain millimeters. Period 07/01/66 - 12/30/66.

16

3.5

Mutual Information

3 2.5 2

1.5 1

0.5 0 0

10

time (days)

20

30

Figure 5.4: Mean Station: Mutual Information. The first minimum is for t

100

FNN Ratio

80

60

40

20

0 0

5

10 time (days)

15

20

Figure 5.5: Mean Station: Global False Nearest Neighbors.

17

7.

250 Eigen Value Spectra

200 150 100 50

Explained Variance

0 0 100

50

100

150

200

50

100

150

200

80 60 40 20 0 0

Figure 5.6: Mean Station: Eigenvalues spectrum (up) and explained variance of the augmented vectors related to an increasing number of RCs (down).

18

Table 5.1: Reconstructed waves (RWs) from disjoint SSA subspaces (each of them explaining 10% of the variance) and corresponding reconstructed components (RCs). The SSA is performed using using a window of 182 days.

RW RCs Ω1 1-4 Ω2 5-11 Ω3 12-19 Ω4 20-28 Ω5 29-39 Ω6 40-52 Ω7 53-70 Ω8 71-93 Ω9 94-126 Ω10 127-182 (Tab 5.1). Waves Ω1   Ω6 (corresponding to the first 52 RCs), are more regular than the remaining waves (corresponding to subspaces with low eigenvalues) are more complex (Fig. 5.7). Fig. 5.8 shows the mutual information for each RW, while Fig. 5.9 shows the corresponding Global False Neighbors plots. The evaluations of the first minimum of mutual information and of dE for each RW are presented in Tab. 5.2. Then, we designed a neural predictor based on a MLP for each individual wave of the MS, following the constructive approach described in Ch. 3, implementing, in such a way, a SSA decomposition ensemble of learning machines. The best results for each RW have been obtained using as inputs windows of 5 consecutive elements and two hidden layers with dimensions described in Tab. 5.3. As each wave contains 3652 daily samples, in our case for each wave we obtained a data set of 3646 associative couples, each of them consisting of a window of 5 consecutive elements, as input, and the next day rainfall intensity, as output. Each MLP was trained using the first 2000 associative couples (training set), using the error back-propagation algorithm with momentum [32], and a batch presentation of samples. The following 1000 associative couples (validation set) were used in order to implement an early stopping of the training procedure. The remaining 646 were used for measuring the quality of the forecasting of the reconstructed wave (test set).

19

Table 5.2: First minimum of Mutual Information (T) and and embedding dimension (dE ) computed using T and other time lags for each reconstructed wave.

RW T dE 9 T : Ω1 22 4 Ω2 9 18 Ω3 4 10 Ω4 5 18 Ω5 4 14 Ω6 3 5 Ω7 2 4 Ω8 2 4 Ω9 2 5 Ω10 5 10 5.2

dE 9 7 : 3 14 7 14 9 6 4 4 5 8

dE 9 1 : 2 3 4 4 4 4 5 6 4 4

Results and Discussion

The prediction results for each reconstructed wave are presented in Tab. 5.3 and in Fig. 5.10. The predictions obtained using the SSA decomposition ensemble of learning machines (i.e., the sum of the predictions of the 10 waves) at 1 day ahead are very satisfactory, as for the resulting MS prediction the Root Mean Square (RMS) error on the test set is .95 mm of rain, while the Maximum Absolute (MAXA) error is 6.47 mm, i.e., the predicted signal is substantially coincident with the measured MS rainfall intensity signal. As shown in Figs. 5.11, 5.12, and 5.13, the predictions of the MS rainfall intensity signal are substantially coincident with the measured MS 1 . It is worth noting that the design of the ensemble learning machine is critical. Choosing a window M 182 for the SSA, the best prediction results were obtained using MLPs with four or five inputs and two hidden layers. Using MLPs predictors with four inputs we obtained results slight worse.In this case the RMS for MS is 1.05 mm and the MAXA is 8.05 mm for MLPs predictors using four inputs. We notice that the Maximum Absolute error occurs the same day (11/05/1967) than for the architecture using MLPs with five inputs. A different window for SSA can give results of inferior quality. E.g., using M=256 as the window for SSA we obtained good prediction performances only for for waves Ω1  Ω6 , corresponding to 60% of the explained variance (first 76 RCs). The resulting generalization of the SSA decomposion ensem1 Note

that in the comparison shown in Fig. 5.11 the predicted signal is clamped to zero.

20

Table 5.3: Size of the hidden layers (L1 and L2), Root Mean Square (RMS) error and Maximum Absolute (MAXA) error for each reconstructed wave - Size of MLPs Input Layer=5.

RW L1 L2 RMS MAXA Ω1 6 4 .02 .05 Ω2 8 5 .03 .12 Ω3 6 4 .04 .15 Ω4 8 4 .04 .11 Ω5 8 5 .06 .14 Ω6 8 4 .15 .40 Ω7 4 4 .15 .38 Ω8 6 4 .64 1.92 Ω9 3 4 .75 2.40 Ω10 3 4 .29 .90 ble was poor, even leaving out in Eq. 4.10 the predictions of Ω7  Ω10 as, if enclosed in the addition, make worse the overall prediction. We underline that the dimension of the optimal input layer (i.e. 5) is smaller than the dE evaluated with the FNN method (Tab. 5.2). This choice is supported by the generalization trade-off due to complexity of the learning machine and limed size of the training set (see discussion in Ch. 3.3.1). Concerning the time lag between inputs, we investigated different values as the first minimum of the mutual information is only a prescription and not a theoretical result ( in [26]). The plateau in the FNN plots of Fig. 5.4 is a symptom of the presence of high dimensional noise [1]. After the SSA decomposition we can notice that the noise is concentrated mainly in RW10 and also in RW3, RW5, and RW9, as shown in the plateaus in the their FNN plots (Fig. 5.9.

21

6

6 RW2

RW1

5 4

Daily rain (mm)

Daily rain (mm)

4

3 2 1

2

0

−2 0 −1 0

50

100 time (days)

150

−4 0

200

6

Daily rain (mm)

Daily rain (mm)

150

200 RW4

RW3

2

0

1 0

−1

−2

−4 0

−2

50

100 time (days)

150

−3 0

200

4

100 time (days)

150

200 RW6

2 Daily rain (mm)

2 1 0

1 0

−1

−1

−2

−2 −3 0

50

3 RW5

3 Daily rain (mm)

100 time (days)

2

4

50

100 time (days)

150

−3 0

200

4

50

100 time (days)

150

4

200 RW8

RW7

3

2 Daily rain (mm)

2 Daily rain (mm)

50

3

1 0

0

−2

−1 −2

−4

−3 −4 0

50

100 time (days)

150

6

4

4

2 0

−2

100 time (days)

150

200

RW10

2 0

−2

−4

−4

−6

−6

−8 0

50

8

RW9

6

Daily rain (mm)

Daily rain (mm)

8

−6 0

200

50

100 time (days)

150

−8 0

200

50

100 time (days)

150

200

Figure 5.7: Reconstructed Waves. Period 07/01/1966 - 12/30/1966. 22

8 7 RW1

6 5 4 3 2 1 0 0

10

20

30

40

7

7

6

6 RW3

RW4

5

5

4

4

3

3

2

2

1 0

10

20

30

1 0

40

6

6

5

5

RW5

4

3

2

2

10

20

30

1 0

40

6

6

5

5 RW7

4

3

2

2

1

1

10

20

30

0 0

40

7

30

40

RW6

10

20

30

40

RW8

4

3

0 0

20

4

3

1 0

10

10

20

30

40

6

6

5 RW9

5 RW10

4 4 3 3 2

2 1 0

10

20

30

1 0

40

10

20

30

Figure 5.8: Reconstructed waves - Mutual Information. 23

40

100

100 RW1

RW2

80

80

60

60

40

40

20

20

0 0

5

10

15

0 0

20

100

5

10

15

RW3

RW4

80

80

60

60

40

40

20

20

0 0

5

10

15

0 0

20

100

5

10

15

RW5

RW6 80

60

60

40

40

20

20

5

10

15

0 0

20

100

5

10

15

RW7

RW8 80

60

60

40

40

20

20

5

10

15

0 0

20

100

5

10

15

20

100 RW9

RW10

80

80

60

60

40

40

20

20

0 0

20

100

80

0 0

20

100

80

0 0

20

100

5

10

15

0 0

20

5

10

15

20

Figure 5.9: Reconstructed Waves - Global False Neighbors using T=7.

24

RW2 6

6

5

4

RW1 Predicted MS

Predicted MS

4 3 2

2 0

−2

1

−4

0 −1 −2

0

2 MS

4

6

−6 −5

6

0 MS

5

5 RW4 RW3

2

Predicted MS

Predicted MS

4

0

−2

0

−4 −6 −5

0 MS

5

5

−5 −5

0 MS

5

10 RW6 Predicted MS

Predicted MS

RW5

0

−5 −5

0 MS

5

5

0

−5 −5

5

0

5

MS

10

5 RW8 Predicted MS

Predicted MS

RW7

0

−5 −5

0 MS

5

−5 −5

8

5

RW10

6 4 Predicted MS

4 Predicted MS

0 MS

8 RW9

6

2 0

2 0

−2

−2

−4

−4 −6 −10

0

−6 −5

0

MS

5

10

15

−8 −10

−5

0

MS

5

10

15

Figure 5.10: Reconstructed Waves - Scatter plots on the test set (using MLPs with 5 25 inputs).

25

Daily rain (mm)

20

15

10

5

0 0

50

100 time (days)

150

200

Figure 5.11: Mean Station: 1 day ahead forecasting in the period 07/01/66 - 12/30/66 using the ensemble of 10 MLPs with 5 inputs.

2 1.5 1

Error (mm)

0.5 0 −0.5 −1 −1.5 −2 0

50

100 time (days)

150

200

Figure 5.12: Mean Station: 1 day ahead forecasting. Errors in the period 07/01/66 12/30/66 using the ensemble of 10 MLPs with 5 inputs.

26

50

45

40

35

MS

30

25

20

15

10

5

0 0

5

10

15

20 25 30 Predicted Station

35

40

45

50

Figure 5.13: Mean Station: scatter plot of the 1 day haed forecasting on the test set using the ensemble of 10 MLPs with 5 inputs.

27

Chapter 6

Conclusions In this paper we presented an extension of a methodology for signal forecasting [1, 26, 22] to the case of discontinuous and intermittent signals. As in [22], we used predictors based on Multi-Layer Perceptrons or Neuro-Fuzzy Systems characterized by the Universal Function Approximation Property. The input layer of those predictor are shaped using results and suggestions from the theory of dynamical systems linked to the Takens-Ma˜ne´ theorem [27, 17] about the sufficient dimension of the reconstruction vector for the dynamics of an attractor. In order to avoid the effect of the discontinues, in this paper we have proposed an Ensemble Method based on Singular Spectrum Analysis [15, 23, 31, 16] Decomposition based on the following design steps: '

'

Unsupervised decomposition: Using the Singular Spectrum Analysis, decomposes the original signal S in reconstructed waves (RWs), corresponding to subspaces with equal explained variance;

'

Supervised learning: Prepares a predictor for each RW using the methodology described in Ch. 3; Operational Phase: The prediction of the original signal S is then obtained as the sum of the predictions of individual RWs, i.e. using Eq. 4.10.

The presented methodology has been successfully applied to the forecasting of rainfall intensities series collected by 135 stations distributed in the Tiber river basin for a period of 10 years. Moreover, preliminary results for the forecasting of the rainfall series of an individual station are also in good agreement with data [20].

28

Acknowledgments This works was supported by IRSA-CNR, Progetto Finalizzato Madess II CNR, INFM, and Universit`a di Genova. We thank Fabio Montarsolo and Daniela Baratta for the programming support.

29

Bibliography [1] H.D.I. Abarbanel. Analysis of Observed Chaotic Data. Springer, New York, USA, 1996. [2] J.P. Burg. Maximum entropy spectral analysis . In D.G. Childers, editor, Modern Spectrum Analysis, IEEE Press, page 34, New York, 1978. [3] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control Signals, and Systems, 2:303–314, 1989. [4] M.E. Dave. Reconstruction of attractors from filtered time series. Physica D, 101:195–206, 1997. [5] A. Fraser. Information theory and strange attractors. Technical Report PhD thesis, University of Texas, Austin, 1989. [6] A. Fraser and L. Swinney. Independent coordinates for strange attractors from mutual information. Physical Review, 33:1134–1140, 1986. [7] M Ghil. The SSA-MTM toolkit: Applications to analysis and prediction of time series. In B. Bosacchi, J.C. Bezdek, and D.B. Fogel, editors, Application of Soft Computing, volume 3165 of Proceedings of SPIE, pages 216–230, Bellingham, WA, 1997. [8] M. Ghil and R. Vautard. Rapid disintegration of the wordie ice shelf in response to atmospheric warning. Nature, 350:324, 1991. [9] S. Haykin. Neural Networks. A Comprehensive Foundation (Second Edition). Prentice Hall, Upper Saddle River, NJ, 1999. [10] S. Haykin. and J. Principe. Making sense of a complex world: Using neural networks to dynamically model chaotic events such as sea clutter. IEEE Signal Processing Magazine,15, 1998 [11] J. J. Hopfield. Neural networks and physical systems with emergent collective computational facilities. Proceedings of the National Academy of Sciences, USA, 79:2554–2558, 1982. [12] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989. 30

[13] C.L. Keppenne and M. Ghil. Adaptive filtering and prediction of noisy multivariate signals: an application to subannual variability in atmospheric angular momentum. International Journal of Bifurcation and Chaos, 3:625–634, 1993. [14] T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin, third edition, 1989. [15] R. Kumaresan and D.W. Tuffs. Data-adaptive principal component signal processing. In IEEE Proc. Conf. on Decision and Control, page 949, Albuquerque, USA, 1980. IEEE. [16] F. Lisi, O. Nicolis, and M. Sandri. Combining Singular-Spectrum Analysis and neural networks for time series forecasting. Neural Processing Letters, 2:6–10, 1995. [17] R. Ma˜ne´ . On the dimension of the compact invariant sets of certain non-linear maps. In D.A. Rand and L.-S. Young, editors, Dynamical Systems and Turbulence, volume 898 of Lecture Notes in Mathematics, pages 230–242, Warwick 1980, 1981. Springer-Verlag, Berlin. [18] F. Masulli. Bayesian classification by feedforward connectionist systems. In F. Masulli, P. G. Morasso, and A. Schenone, editors, Neural Networks in Biomedicine - Proceedings of the Advanced School of the Italian Biomedical Physics Association - Como (Italy) 1993, pages 145–162, Singapore, 1994. World Scientific. [19] F. Masulli, R. Parenti, and L. Studer. Neural modeling of non-linear processes: Relevance of the Takens-Ma˜ne´ theorem. International Journal on Chaos Theory and Applications, 4:59-74, 1999. [20] F. Montarsolo. A toolkit for discontinuous series forecasting. Laurea thesis in computer science (in Italian), DISI - Department of Computer and Information Sciences, University of Genova - Genova, (Italy), 1998. [21] S.A. Nene and S.K. Nayar. A simple algorithm for nearest neighbor search in high dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 1997. [22] R. Parenti, F. Masulli, and L. Studer. Control of non-linear process by neural networks: Benefits using the Takens-Ma˜ne´ theorem. In Proceedings of the ICSC Symposium on Intelligent Industrial Automation, IIA’97, pages 44–50, Millet, Canada, 1997. ICSC. [23] E.R. Pike, J.G. MCWhirter, M. Bertero, and C. deMol. Generalized information theory for inverse problems in signal processing. IEE Proceedings, 59:660–667, 1984. [24] D.W. Ruck, S.K. Rogers, M. Kabrisky, M.E. Oxley, and B.W. Suther. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1:296–298, 1990. 31

[25] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representations by error propagation. In D.E. Rumelhart and J.L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cambridge, 1986. [26] L. Studer and F. Masulli. Building a neuro-fuzzy system to efficiently forecast chaotic time series. Nuclear Instruments and Methods in Physics Resesarch, Section A, 389:264–667, 1997. [27] F. Takens. Detecting strange attractors in turbulence. In D.A. Rand and L.-S. Young, editors, Dynamical Systems and Turbulence, volume 898 of Lecture Notes in Mathematics, pages 366–381, Warwick, 1981. Springer-Verlag, Berlin. [28] C. W. Therrien. Decision, Estimation, and Classification: An Introduction to Pattern Recognition and Related Topics. Wiley, New York, 1989. [29] J. Vastano and L. Rahman. Information transport in spatio-temporal chaos. Physical Review Letters, 72:241–275, 1989. [30] R. Vautard and M. Ghil. Singular-spectrum analysis in nonlinear dynamics, with applications to paleoclimatic time series. Physica D, 35:395–424, 1989. [31] R. Vautard, P. You, and M. Ghil. Singular-spectrum analysis: A toolkit for short, noisy chaotic signals. Physica D, 58:95–126, 1992. [32] T.P. Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink, and D.L. Alkon. Accelerating the convergence of the back-propagation method. Biological Cybernetics, 59:257– 263, 1988. [33] L. Wang and J.M. Mendel. Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. on Neural Networks, 5:807–14, 1992.

32

Suggest Documents