Missing Data Interpolation By Using Local-Global

0 downloads 0 Views 155KB Size Report
Jul 1, 1999 - Missing value filling is a very important problem. ... literature (e.g. Parzen[7] and Little&Rubin[8]). ... 2. The proposed architecture: Local-Global Neural Networks. Let us ... j e h)2( j are parameters to be adjusted. Note that parameter Cj .... following result (Cantor diagonal Bartle [15]): Let gn(x)=gnn(x), then.
Missing Data Interpolation By Using Local-Global Neural Networks Mayte Fariñas & Carlos E. Pedreira {pedreira or mayte}@ele.puc-rio.br Catholic University, PUC-RIO C.P. 38063; CEP 22452-970 Rio de Janeiro, Brazil Abstract: In this paper a new connectionist model is applied to missing data interpolation. This proposed architecture is trained by a scheme based on partition of the function domain, approximating the generator function by a set of very simple supporting functions. This method showed a very interesting ability concerning interpolation. Both, controlled numerical experiments and a real data missing data application for an electricity load series, are presented.

1. Introduction Missing value filling is a very important problem. Its solution, that sometimes is quite hard to be appropriately obtained, can be viewed as a special kind of interpolation solution. In this kind of problems, the main goal is to emulate a function in a sector of the domain where only a fraction of points are known. A new algorithm to reconstruct a generator function, based on local estimates (Pedreira et al [1][2]), is applied. Prediction, i.e. estimations outside the pre-established domain, is not a goal here. The proposed architecture was firstly introduced in (Pedreira et al [2]), following some ideas originally proposed in (Pedroza & Pedreira [1]). It is trained by a scheme based on the partition of the function domain. The main idea is to approximate the original function by a set of very simple supporting functions. Although, there are no theoretical limitations concerning these functions complexity, the supporting function are in general linear. The input-output mapping is expressed by a piecewise structure. The network output is constituted by a combination of several pairs, each of those, composed by a supporting function and by a membership function. The membership functions define the role of an associated supporting function, for each subset of the domain. Partial superposition of membership functions is allowed. In this way, the problem of approximation functions is approached by the specialization of neurons in each of the sectors of the domain. In other word, the neurons are formed by pairs of membership and supporting functions that emulate the generator function in different parts of the domain. The level of specialization in a given sector is proportional to the value of the membership function. It is well known that single hidden lawyer neural networks are able to universally approximate arbitrary continuous functions (Haykin [3]). Similar result for the proposed methodology will be presented in section 3. Some contributions on function approximation in noisy environment with neural networks can be found in the literature. Carozza&Rampone[4] and Francis et al [5] propose a radium basis function (RBF) architecture arguing that it produces more consistent results,

concerning generalization if compared to the classical multilayer perceptrons used in (Lawrence et al [6]). Concerning the occurrence of missing observation, it constitutes a quite common problem in real application of time series, and his repercussion in modeling has been extensively discussed in the literature (e.g. Parzen[7] and Little&Rubin[8]). Several methodologies for missing value filling have been reported, Brubacher, S. R. et al[9], Fereeiro[10] and Harvey&Pierse[11] and Ljung[12] use parametric alternatives. These approaches are based on statistical models to adjust data and interpolation and in some cases of interest the outcome is not satisfactory, as it will be discussed later on. The cubic spline has been tried (see Ferreiro[10]) and Koopman et al [13]) but the results are far from satisfactory in cases with a large consecutive number of misses.

2. The proposed architecture: Local-Global Neural Networks Let us consider a network with m nodes or neurons. Let {xi }1n be the subset of the available data that is used for training. For algebraical and notational simplicity we will restrict ourselves to the case where x∈ℜ (x subscript is omitted). Generalization for x∈ℜn is straightforward. Let us define, for each point x, m membership functions:

1 1 − (1) 1 + exp(d j ( x − h j )) 1 + exp(d j ( x − h (j 2 ) ))

Bj(x) ≡ -Cj

j=1, . . . ,m

where Cj, dj, h (j1) e h (j2) are parameters to be adjusted. Note that parameter Cj reflects the membership functions level, while dj is related to this function declivity. Parameters h (j1) and h (j2) delimit the domain sector where the associated support function is more active (see figure 1). 1 .5

1

0 .5

0 -3

-2

-1

0

1

2

3

4

5

Figure 1. Example of activation functions C=1, d=6; h (j1) = -2; h (j2) =2 * C=1.2, d=6; h (j1) = 0; h (j2) =4

The supporting functions are typically linear or quadratic. Although more complex functions may be used, it seams that this additional complexity does not bring a correspondent refinement of the model. Let us consider linear supporting functions: κj(x) = ajx + bj

j=1, . . . ,m

where aj and bj are the parameters to be estimated. Each node or neuron is constituted by a pair {membership function ; support function} (see figure 2). Then, for each node one need to estimate 6 parameters (7 for the quadratic function case). As usual, the model complexity may be inferred by the number of nodes. The input is connected to the nodes producing as its output the membership and supporting function, Bj(x) and κj(x) product. Note that there no weight links the nodes outputs to the network output (see figure 2). The output of the jth neuron is Bj(x) κj(x), and the network output is given by: m

gm(x) =

j =1

B j ( x)κ j ( x)

(1)

B1 k1

x

X

B2 k2

X

g

Σ

. . . Bn kn

X

Figure 2 - The proposed architecture The central goal is to design a network that is able to approximate as well as possible a target function f(x). With this objective in mind we define an error function as a convex combination of two quadratic error measures E1 e E2: E≡α where

k i =1

E12 ( xi ) + (1 − α ) E22

(2)

E1(xi) ≡ gm(xi) – y(xi)

and

E2 ≡ 1 -

m j =1

Cj

(3)

Parcel E1 is associated to the quality of the obtained approximation while E2 has the target to keep the membership functions bounded. By using this E2 one is penalizing the solutions where the sum of parameters Cj exceeds 1. The unit-valued limit is not mandatory although it may confer interpretably to the results. If one defines, for each neuron, a vector of parameters ℑ j ≡ (Cj, dj, h (j1) , h (j2) , aj, bj), the main goal will be to find ℑ j that minimizes the error function E.

3. Theoretical results Theorem T-1 gives theoretical consistence to the proposed methodology. It is shown that any L2integrable function may be approximated by functions g m (x) . The following auxiliary results are needed to prove the theorem: Auxiliary result AR-1: The simple functions as S(x) =

m j=1

α i Χ Ai (x) with constant αi ∈ ℜ and

where ΧAi(x) is an indicator function of the set Ai, are dense in L2. Auxiliary result AR-2: There exits a sequence of functions gn(x)∈ {g m (x)} such that gn(x)→S(x) (L2 convergence), ∀ S(x). Proof: Let S(x) =X[0, 1] . Let us consider κ n ( x) = 1 , then, gn(x) = Bn(x) with h1=0 h2=1. B n (x) = −C n

1 (1 + e

dn x

)



1 (1 + e

d n (x −1)

)

(

where C n = e d n

2

)(

+ 1 e dn

2

) and d

−1

n

→ ∞ . It is not difficult

2 to show that B n (x) → S(x) punctually, and in L by using the Lebesgue convergence theorem De n

Barra, G. [14 ]. Since punctual convergence is already shown, it will be enough to prove that Bn(x) is dominated by an L2- integrable function. It is easy to verify that an L2-integrable function g(x) defined as: 1 + ex x1

limits function Bn(x) for all x ∀n, as we desire to prove. It is still needed to generalize from interval [0 1] for an arbitrary one. For that, one has to find a sequence Bn(x) that converges for functions of the type XA A=[δ1, δ2]. With this goal in mind one may consider functions Bn(x) with h1=δ1, h2=δ2. Extension of this result for any simple function S(x) follows by considering that L2 convergence is preserved with adding and constant multiplication operations. Since simple

functions are expressed as finite linear combinations of XAi functions, one obtains that the sequence Bn(x) that converge to S(x) in L2 . Note that by choosing gn(x) one consider κ n ( x ) = 1 . So, the theorem is valid even in this case. The choice of the linear form κ(x) = ax + b (sub-indices are omitted) has the purpose of increase the approximation for a limited number of neurons and to accelerate convergence. Theorem T-1: Let g m (x) = Bj(x) ≡ -Cj

m j=1

B j (x)κ j (x)

where κj(x) = ajx + bj ,

1 1 − 1 + exp(d j ( x − h (j1) )) 1 + exp(d j ( x − h (j 2 ) ))

j=1, . . . ,m

any L2-integrable function may be approximated by functions of the form g m (x) .

{

}

Proof: One wants to prove that the set of functions g m (x) with the norm . is a dense set in L2 L2. It is needed to shown that for any function in L2, there exists a sequence of functions

{

}

gn(x)∈ g m (x) that converge for f(x) in L2. From AR-1 we have that simple functions form a dense set in L2. So, there exists a sequence of simple functions, Sk(x) that converge for f(x) in L2. From AR-2, we have that the simple function may be approximated by functions of the type gm(x). For each simple function Sk(x), there exists a sequence gkn(x) that converges to Sk(x) in L2. g11(x) g12(x) … g1n(x) → S1(x) g21(x) g22(x) … g2n(x) → S2(x) : : : : gk1(x) gk2(x) … g1k(x) → Sk(x) To construct fn(x) we use the following result (Cantor diagonal Bartle [15]): Let gn(x)=gnn(x), then gn(x)→f(x) in L2.

4. On the initial choice of parameters The relationship between the network input and output is learned by estimating the parameters that define the membership and supporting functions. The membership functions may overlap in part of the domain allowing that a given point is estimated by a balanced combination of more than one supporting function. The initial choice of the parameters h 1(1) and h (m2) may reflect an a priori knowledge on the function domain. With the goal of accelerating convergence, one may use the following initialization heuristic. The central idea is to divide the domain obtaining intervals where the function is approximately monotonic. The starting point is to adjust on data a polinomium of the degree equal to the number of supporting functions one decided to use. By calculating the maximum and minimum of this polinomium one determines the regions of the domain where the function remain

monotonic. To define the values for a and b associated to a linear supporting function one should adjust, for each interval, a straight line by using linear regression.

5. Numerical results In the first numerical experiment (see figure 3) 100 points were generated by using the function f(x) = sin(x) +2 + noise ♣. Noisy signal was obtained by adding a Gaussian signal with zero mean and 3 different standard deviations: 0.1 ; 0.4 and 0.7. Table 1 resumes these experiment results. We use the MAPE (Mean Absolute percentual Error) as measure error MAPE = 1n

n i =1

g(x i ) − y(x i ) x 100 y(x i )

In all simulations 3 pairs of membership-supporting functions were used. The initialization heuristic for initialization, described in section 4 was used in experiments 1, 2 and 3, but not on the fourth. The 5th column in the table 1 refers to what we defined as “denoised MAPE”. Denoised MAPE is the relative difference between training MAPE and noise MAPE. The goal here is to consider the training MAPE eliminated the error introduced by noise addition on data. Noise level 0 0.1 0.4 0.7

Number Noise Training Denoised Generalization of epochs MAPE MAPE MAPE MAPE 111 0 0.14 1 0.157 137 4.79 4.64 0.03 1.04 78 16.13 16.80 0.04 4.91 37 30.88 52.36 0.410 7.87 Table 1 - Numerical results for different noise levels In Sample Test

3.5 3



sen(x)+2 Aproximation

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5 0

1

2

3

4

Out Sample Test

3.5 Sample sen(x)+2 Solution

5

6

7

0.5 0

1

Figure 3a - noise level = 0.1

2

3

4

We added +2 to the sine in order to avoid MAPE instability for values around the origin

5

6

7

Figure 3b. noise level = 0.4

We used the same data of experiment 4, but deliberated inicializated the algorithm with a ‘bad initial condition’. After 355 iterations we obtained the following errors: In-sample MAPE = 17.12 and out-of-sample MAPE = 3.20. The convergence may be observed in figure 4.

Figure 4. Convergence evolution without the use of the initialization heuristic 5.1 A comparison with other function approximation methods

We benchmarked the proposed methodology, the Local-Global network (LGN) against the methods based on radius basis functions (Carozza & Rampone [4]) and multilayer perceptrons (Lawrence et al [6]). In the first three experiments noiseless data is used, in the remaining noise is added and generalization capacity is tested. Table 2 shows results for f(x)=sin(x)+sin(3x)+sin(6x) in the interval [-0.35, 3.50], table 3 presents results for f(x) = 0.5exp(-x)+sin(6x) in [-1, 1]. Finally, table 4 is referred to f(x) = x em [0,1] those functions are the same simulated in (Carozza&Rampone [4]). In each table shows the mean squared error for both training and generalization. In each experiment, the training and generalization sets are both of 100 random noisy samples from uniform distribution in the considered interval. In order to compare with the result obtained in Carozza&Rampone [4] and

Lawrence et al [6] we use, in table 2, 3 and 4, the MSE (Mean Square Error) MSE = 1n

n i =1

(g(x i ) − y(x i ))2

as measure error.

Training Generalization LGN (18) 0.0058 0.0059 LGN (36) 0.0044 0.0049 RBF(3), T=0.01 0.0099 0.0094 RBF(8), T=0.001 0.001 0.0016 Table 2. MSE for f(x)=sin(x)+sin(3x)+sin(6x) in [-0.35, 3.50] Training Generalization LGN (24) 0.0008 0.0012 RBF(3), T=0.01 0.0090 0.0093 RBF(6), T=0.001 0.0009 0.0013 Table 3. MSE for f(x) = 0.5exp(-x)+sin(6x) in [-1, 1] Training Generalization LGN 7.76 10-5 8.17 10-5 RBF, T=0.01 0.0022 0.0026 RBF, T=0.001 0.0008 0.0010 Table 4. MSE for f(x) = x in [0,1]

In order to compare the generalization capacity we used function f(x)=sin(x/3) adding uniform random noise (-0.25,0.25). We first considered the function in [0 20], using the 21 integers in this interval. For generalization purpose we sampled x=0:0.1:20. The same strategy was applied for [0 5]. The results are presented in tables 5 and 6. Training Generalization LGN 0.010 0.0046 RBF, T=0.1 0.014 0.0099 RBF, T=0.01 0.0019 0.0027 MPL 0.0358 0.0343 MLP 0.0204 0.0222 MLP 0.0204 0.0201 Table 5. MSE for f(x)=sin(x/3) + noise, in [0,20] Training Generalization LGN 0.0061 0.0027 RBF, T=0.01 0.0012 0.0017 MLP 0.0347 0.0761 MLP 7.29 10-5 0.1030 Table 6. MSE for f(x)=sin(x/3) + noise, in [0,5]

In noisy environment, LGN are comparable to the results obtained by (Carozza&Rampone [4]) for RBF and clearly better than the obtained for MLP, (Lawrence et al [6]), especially concerning generalization. We believe that if the noise level is increased the difference of performance, in favor of LGN will also increase.

6. An application for missing data The occurrence of missing observation is quite common in practical applications involving times series. The problem in the estimation of the parameters in the presence of missing values has been discussed extensively in the literature: Little&Rubin [8] for statistical modeling in general and Parzen [7] in times series context. Although some methods aiming to deal with missing values have appeared in the literature, many relevant cases are poorly solved by those methods. The parametric approaches for interpolation (see Brubacher, S. R. et al [9]) have been largely applied. These methods are based on adjusting data by statistical models and lately using these models for interpolation purpose. Ferreiro[10], and Harvey&Pierse[11] and Ljung [12] used this approach with classical time series models. The main setback of this approach is that the model identification and parameters estimation may be strongly affected by the missing values (See Stoffer[16]). Some authors propose iterative algorithms for estimation with robustness of the model in mind (Pourahamdi [17]). EM algorithms are proposed in Little&Rubin[8], in this case the procedure may become computationally costly, bringing in an impossibility to deal with large data sets. If dealing with high frequency e.g. half-hour or less electricity load data, a large number of consecutive misses is a very common problem. In those cases, interpolation tends to the conditional mean producing quite poor results. The cubic spline was proposed for filling missing values in times series (see Ferreiro [10]) and Koopman et al[13]) with some good results. This methodology is used in some commercial packs e.g. SsfPack (Koopman et al [13]). Nevertheless, as expected, the spline methodology does not produce good results for series with a considerable number of consecutive missing values. This is basically because the spline generates solutions with accentuates curvatures producing quite different patterns from the original data as will be next shown. When one is dealing with high frequency data e.g. half hour or less, a large number of consecutive misses is a very common occurrence. In those cases, classical methods for interpolation tend to the conditional mean, producing quite poor results. In this section we present an application of the proposed methodology for the problem of missing data. Real electricity load data was used. The series concerns minute measures of electricity load in Brazil for 1st July 1999. Note that missing data problem is quite frequent in minute measures of electricity load.

Figure 4.1- Electricity load, minute to minute data 1st July 1999

In order to simulate a missing data problem we withdrew a fixed quantity of points from the original series. This series is then recomposed by using the proposed algorithm. The points were withdrawn randomly in different percentages: 5%; 10%; 20%; 30% and 40 % . The results can be found in table 2. Withdrawn Number Trainning Generalization points (%) of epochs Mape Mape 5% 75 0.99 1.21 10% 333 1.01 1.15 20% 106 1.06 1.116 30% 379 1.00 1.119 40% 105 1.07 1.117 Table 7 – Missing data simulation - real electricity load data

We compared our method with smooth cubic spline. One can notice by observing figure 5 that the spline algorithm did not produce good results when one has a considerable number of consecutive missing.

Figure 5. Proposed algorithm versus Cubic spline

7. Final remarks In this paper new connectionist architecture was applied to interpolation problems with missing values in mind. This method has an interesting ability concerning interpolation: it has an intrinsic ability to produce regular solutions. A synthetic experiment and a real data missing data application were presented. Simulated results for noisy environment were particularly encouraging, producing very nice generalization results. The proposed architecture opens the door for a better interpretation of the results since the location of the membership functions may indicate a change in the model. Changes in the structure of the data generation function are expected to be reflected in changes of location and level of the membership functions. The results were particularly satisfactory in noisy environment. Although, in some experiments a considerable dose of noise was introduced, the method performed in a robust way, producing good results , especially for generalization. From the real data point of view, the results for missing data were quite encouraging demonstrating the method capability to produce regular and consistent solutions.

References: [1] Pedreira, C.E., Pedroza, L. C. and Fariñas, M.(2001) " Local-Global Neural Networks For Interpolation" Proceeding of ICANNGA 2001- Praga, April, 2001. [2] Pedroza L.C and Pedreira C.E. (1999) “Multilayer Neural Networks and Function Reconstruction by Using a priori Knowledge” International Journal of Neural Systems, Vol 9, No. 3, pp 251-256. [3] Haykin S. (1999) “Neural Networks – A Comprehensive Foundation” , Prentice Hall, second edition [4] Carozza, M. and Rampone, S.(1999) "An incremental multivariate Regression method for function approximation from noisy data" Pattern Recognition, Vol. 32 , No.11 [5] Francis, N. M., Brown, A. G., Cannon, P. S. and Broomhead, D. S.(2000). "Nonlinear Prediction of the Hourly F0F2 Times Series in Coonjunction with the Interpolation of Missing Data Points" Phys. Chem. Earth. Vol 25, No. 4, pp. 261-265 [6] Lawrence, S.; Lee Giles, C.; Tsoi, A.C. (1996) "Whay size neural network gives optimal generalization ? Convergences properties of Backpropagation" University of Maryland Technical Report UMIACS-TR-96-22 and CS-TR-3617.. [7] Parzen, E (ed.) 1983 Proceedings of Times Series Analysis of Irregulary Observed data, Lecture Notes in Statistics 25. New York: Springer Verlag

[8] Litle, R. and Rubin, D. (1987) Statistical Analysis with Missing Data. New York: Wiley. [9] Brubacher, S. R. and Tunnicliffe Wilson, G.(1976) "Interpolating Times Series with Aplication to the estimation of holoiday Effects on electricity" Applied Statistic. Vol. 25, No. 2, pp. 107117. [10] Ferreiro, O. (1987)"Methodologies for the estimation of missing observations in time series" Statistics and probability Letters, No. 5, pp. 565-69. [11] Harvey, A.C. and Pierse, R.G. (1984) "Estimating missing observation in the economic time series". Journal of American Statistical Association, 79(385):125-131. [12] Ljung, G.M. (1989)"A note on the estimation of missing values in time series" Commun. Statist. - Simula., No. 18(2), pp. 459-465 [13] Koopman, S. M., Shephard, N. and Doornik, J. A. (1998) "Statistical algorithms for models in satate space using Ssf Pack 2.2" Econometrics Journal (1998), Vol.1, pp 1-55. [14] De Barra, G. (1974) Introduction to measure theory, New York, Van Nostrand Reinhold . [15] Bartle, R.G.(1966) Elements of integration. Wiley. New York. [16] Stoffer, D.S. (1986) Estimation and identification of space-time ARMAX models in the presence of missing data. Journal of the American Statistical Association 81-395. [17] Purahmadi, M. (1988) . "Estimation and interpolation of missing values of a stationary time series" J. of Times Series Analysis Vol. 10, No. 2. Acknowledgments

The authors wish to acknowledge the careful revision and the suggestions kindly provided by Prof. Yaser Abu-Mustafa.

Suggest Documents