Forecasting electricity demand using non-linear mixture of ... - CiteSeerX

9 downloads 12314 Views 211KB Size Report
Universit e Paris 1, research center SAMOS ... series prediction that we call mixture of experts. ... model for forecasting the electricity demand in France. Finally ...
Forecasting electricity demand using non-linear mixture of experts Morgan Mangeas Universite Paris 1, research center SAMOS

90 rue de Tolbiac,75634 Paris Cedex 13, France Electricite de France, Research Center

1, avenue du general de Gaulle 92141 Clamart cedex, France

1 Introduction Linear or nonlinear global model are known to be well suited to problems with stationnary dynamics. However, the assumption of stationnarity is violated in many real-world phenomena. In the eld of forecasting time series, an important sub-class of nonstationnarity is piece-wise stationnarity where the series switches between di erent regimes. For example, regimes of electricity demand depend on the seasons, and regimes of nancial forecasts (grown and recession) depend on the economy. Adressing these problems, we present a class of models for time series prediction that we call mixture of experts. They were introduced into the connectionnist community by R. Jacon, M. Jordan, S Nowlan and G. Hinton [2]. In our model, we use nonlinear gating network (a multilayer perceptron) gated nonlinear experts (some others multilayer perceptron). The input space can be split nonlinearly through the hidden units of a gating network and the learned

sub-processes can be nonlinear through the hidden units of the expert networks. In this paper, we discuss the assumptions gated experts make about the data generating process and gives a mathematical and probabilistic interpretation of the architecture, the cost function and the search. We also summarize and analyze the performance of the mixture of experts model for forecasting the electricity demand in France. Finally, we show how the use of a statistical pruning can improve the robustness of the model. More details can be found in an article of A. Weigend et al (1995) [8].

2 Mathematical framework The mixture of experts model has a solid statistical basis. In the time series community, the idea of splitting an input space is not new. One of the rst example is the threshold autoregressive (TAR) model [7]. In contrast to mixture of experts, the splits are very simple and ad hoc; there is no probabilistic interpretation. In the following we will see that the mathematical framework of mixture of experts allows us more exibility. Let us consider an input-output model (input x and output Y ) made with a control module and K sub-models called experts. The output Y is a real random variable, that depends on a discrete variable I , which switches between di erent states, I 2 f1; 2; : : :; K g. The distribution of I depends on the input

x 2 Rd; d  1 and can be obtained through the probabilities computation: Px (I = j ); j = 1; 2; : : :; K . The control module compute these K probabilities.

Then the output Y can be written using the general formulation:

Y = fj (x) + j ; if I = j

(1)

where fj , function from Rd in R, characterizes the j th expert and where j is a random variable with zero mean. The term fj (x) is then the expectation of Y if

I = j . The distribution of Y (for a certain input x) is as follows: Px(Y = y) =

K X j =1

Px(I = j )Px (Y = y j I = j )

(2)

and the expectation of Y :

Ex (Y = y) =

K X j =1

Px (I = j )fj (x) :

(3)

The goal is to build a model that includes the control modul and the K experts. In our model, the control modul is a feedforward neural network (denoted by the associated function f of parameters g ). The role of this network consists in approximating the function: x ?! (Px(I = j ))j =1;2;:::;K . Let be (gj (x; g ))j =1;2;:::;K the output vector of the network. Each expert j is also implemented as feedforward neural network of parameters j , j = 1; 2; : : :; K . The ouput is denoted fj (x; j ) and the network approximates fj (x) (de ned equation 1). So the global model can be summarized as:

Y = fj (x; j ) + "j , si I = j

(4)

where fj (x; j ), is the output of the j th expert and "j is a white noise of zero mean and variance j2 . We consider in the following that the noises ("j )j =1;2;:::;K are gaussian ("j  N (0; j2)).

E[y | x]

y (x, θK)

y (x, θ1 )

K

1

Expert 1 variance σ 2 1 Parameter θ1

Expert K variance σ 2 K Parameter θK

g (x, θg ) 1

g (x, θg ) K

Gating network Parameter θg

X

Fig. 1. Mixture of experts architecture. The input x stands at the bottom. The K

outputs of the gating P network (gj (x; g ))1jK weight the outputs of the experts; the global output is Kj=1 gj (x; g )fj (x; j ).

The model is so entirely speci ed, and the parameters  = fg ; 1 ; 2 ; : : :; K ; 12 ; 22; : : :; K2 g can be estimated by maximizing the likelihood. In the following, in order to simplify the notation, we use P (y j :) for P (Y =

y j :). The probability to have Y = y given x and I = j (in this case the distribution of Y is associated to the expert j ) is: 2 Px (y j I = j ) = q 1 2 exp ? (y ? f2j (x2; j )) j 2j

!

(5)

and the global distribution of Y (see equation 2) is: K X

2 Px (y) = gj (x; g ) q 1 2 exp ? (y ? f2j (x2; j )) j 2j j =1

!

:

(6)

Then the expected value of the output y given the input x can be obtained:

yb(x) = Ex [Y ] =

K X j =1

gj (x; g )fj (x; j )

(7)

Assume that we have a set of N couples (x(t) ; y(t) )t=1;2;:::;N for which we associate N random variables (I (t) )t=1;2;:::;N , I 2 f1; 2; : : :; K g, assume also that I (t)

depends on x(t), then we can compute the likelihood (with Y = f(y(t) )t=1;2;:::;N g,

X = f(x(t))t=1;2;:::;N g and LX (Y ; ) denotes L(Y ; ; X )) : L(Y ; ; X ) = = =

N Y

Px (y(t) ) (t)

t=1 K N X Y t=1 j =1 N X K Y t=1 j =1

gj (x(t) ; g ) Px (y(t) j I (t) = j ) (t)

gj

(x(t) ; 

C (Y ; ; X ) = ? ln L(Y2; ; X ) =

N X t=1

?

K X ln 4 j =1

2 !

?

y(t) ? fj (x(t) ; j ) 1 g ) q 2 exp ? 2 j2 2j

gj (x(t) ; g )

? (t)

1 exp ? y 2j2

q

(8)

!3

 ? fj (x(t); j ) 2 5 2 2

j

Furthermore, the sum inside the logarithm makes the cost function signi cantly more complicated than in the case of a single neural network. However, we can reformulate the problem such that it allows to apply the ExpectationMaximization (EM) algorithm [3]. This algorithm is based on the assumption that some variables are missing. We choose to consider the set of random indicator variables associated to the random variable I 2 f1; 2; : : :; K g. 8 > >


1 si I (t) = j

(9)

> : 0 sinon.

So following the equation 1 we have Ymissing = f(Jj(t) )1j K;1tN g and we can compute the complete data likelihood:

L2(Y ; Ymissing ; ; X ) =

N Y K h Y t=1 j =1

gj (x(t) ; g ) Px (y(t) j I (t) = j ) (t)

iJj(t)

(10)

Now, the key feature of the EM algorithm enters: it allows us to replace the missing variables Jj(t) by their expected values (the E step) and to minimize

:

(the M step) the new cost function (the negative logarithm of the likelihood function). Summarizing, the basic idea, behind mixture of experts, is simple: simultaneously, the gating networks learn to split the input space, and the experts learn local features from the data. The problem is that the splitting of the input space is unknown because the only information available is the next value of the series. This requires blending supervised and unsupervised learning. The supervised component learns to forecast the next value, the unsupervised component discovers the hidden regimes. The trade-o between exibility in the gates and

exibility in the experts is an important degree of freedom in this model class.

3 Electricity demand forecasting To illustrate the mixture of experts approach, we attempt to model the electricity demand in France. From a time series perspective, this series exhibits three interesting features: multi-variate, there are many inputs, encompassing both an endogeneous variable (e.g. past electricity demand) and exogenous variables (e.g., temperatures, cloud coverage, weather, etc.); multiscale,there is structure on several time scales (e.g. daily patterns, yearly patterns); multi-stationnary, there are di erents regimes (e.g., holidays vs. workdays, summer vs. winter, etc.). The experts each had 1 layer of 5 tanh hiddden units, and the gating network one hidden layer with 10 tanh units. The experts and gating network were connected to the same 51 exogenous inputs: proximity to holidays, task independent unsupervised clustering (Kohonen), days of the week, proximity of special tari

−1

10

Variances (log scale)

Data Gate 1

Gate 2

Gate 3

−2

10

2

−3

1

4

10

3

Gate 4

0

50

100

150

200

250

300

Iterations

Training set

Test set

Fig. 3. Evolution of the variances of Fig. 2. Output of the gating networks. the experts. At the end of training, exThe annual cycle of the data, peaking in the cold winter months, is evident, and so is the summer vacation. Gate 2 picks out the holidays, gate 1 the days around holidays, gate 3 the warmer weather, and gate 4 the colder seasons.

pert 3 (summers) has the lowest variance, whereas expert 2 (holidays has the largest variance. Experts 4 and 1 converge later than 2 and 3, and show some trade-o with each other before the nal phase separation.

days, annual cycle, temperatures and cloud coverage. We performed 10 runs with di erent initial sets of weights, always starting with 8 experts. The nal number of active experts varied from 2 (3 runs) to 4 (two runs). We can see on Figure 2 that most of the time the gating output 2 is binary and the gating output 3 and 4 are mutually completing. In this case, the switching between the two regimes is smooth. The feature does not appear in most of the partitionning model as TAR. Another important feature is the di erent noise levels associated to each expert (see on Figure 3). It is worthy to note that the variance assigned to the expert 2 (this expert deals with the holidays), is three times larger than the one assigned to the expert 3 (this expert deals

with summer time). Beside the underlying analysis, in our experience, expertspeci c variances are important in order to facilitate the segmentation (areas of di erent predictability are grouped together), and in order to better manage the ressources of the model.

4 Statistical pruning For the nal performance, we use a standard pruning called SSM [1] in order to remove the irrelevant weights. For the electricity prediction task, we typically removed 35% of the weights. The resulting network squared error was 4% smaller on the training set than on the test set, indicating that there was essentially no over tting after pruning. This method is based on recent results about almost sure identi cation ([6, 5]). Let denote (Xt )t>0 the data, and n(A) the number of free parameters for a given architecture A. Then we have an almost sure criterion for model selections: LSC = LSC(T; A) = STT(A) + lnTT n(A)

(11)

P with ST (A) = Tt=p+1 (Xt ? fW (Xt?1 ; Xt?2; : : :; Xt?p ; Yt; 1))2, and any positive real number. Then the selected model is: A^T = Arg minA2A LSC(T; A)

We denote also Wmax = (w1 ; w2; : : :; wM ) the associated dominant parameter vector. Theoretically, in order to estimate the true model, we would have to exhaustively explore a nite family and compute the criterion for all sub-models of A. But the number of these sub-models is exponentially large (as 2M ) and it is impossible to do it in practice. So, as in linear regression analysis, we propose

a Statistical Stepwise Method (SSM) to guide the search. Such a descending strategy is based on the asymptotic normality of the estimator W^ T . Checking the deletion or not of a weight wl is equivalent to considering the test of a null hypothesis \wl = 0" against an alternative hypothesis \wl 6= 0", which leads to a Student test on wl , (in fact an asymptotic Gaussian test since the normality of the estimated weight is ensured only when T is large). See [1] for previous presentations of the SSM algorithm, with several examples.

5 Conclusion This paper describes nonlinearly gated nonlinear experts with adaptative variances and applies them to the prediction and analysis time series. Mixture of experts model can be viewed as a new method for data analysis, particularly well

suited for discovering processes with changes conditions. The blend of supervised and unsupervised learning, embedded in the architecture and cost function, gives results int three areas:

{ Prediction. This model is particularly well adapted to the forecasting of piece-wise stationnary processes that switch between regimes with di erent dynamics and noise levels.

{ Analysis The model discovers regimes and allows us to analyse the predictability in each regime (the noise level). The analysis of the gate (e.g., the correlations between its output and auxiliary variables) yields also a deeper understanding of the underlying process.

The use of a statiscal pruning (such the SSM methodology) on this model improves the robustness of the model and allows each expert to limit the associated number of parameters, for the particular regime it deals with.

References 1. M. Cottrell, B. Girard, Y. Girard, M. Mangeas, and C. Muller. Neural modeling for time series: a statistical stepwise method for weight elimination. IEEE Transaction on Neural Networks, (in press), 1995.

2. R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3:79{87, 1991. 3. M. I. Jordan and L. Xu. Convergence results for the EM approach to mixtures of experts architectures. Neural Networks, (in press), 1995. 4. Y. le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2 (NIPS*89), pages 598{ 605, San Mateo, CA, 1990. Morgan Kaufmann. 5. M. Mangeas, M. Cottrell, and J.F. Yao. New criterion of identi cation in the multilayered perceptron modelling. In Proceedings of ESANN'97, Bruges, Belgium, 1997. 6. M. Mangeas and Jian-feng Yao. Sur l'estimateur des moindres carres d'un modele autoregressif non-lineaire. Technical Report 53, SAMOS, Universite Paris I, 1996. 7. H. Tong and K. S. Lim. Threshold autoregression, limit cycles and cyclical data. J. Roy. Stat. Soc. B, 42:245{292, 1980.

8. A. S. Weigend, M. Mangeas, and A. Srivastava. Nonlinear gated experts for time series: discovering regimes and avoiding over tting. IJNS, 6:373{399, 1995.

Suggest Documents