In the EM-iterations, the E-step turns out to be just the computation of the posterior prob- ...... 11] S. R. Waterhouse and A. J. Robinson. ... Z on ftp.ira.uka.de.
Structurally Adaptive Modular Networks for Non-Stationary Environments Viswanath Ramamurti and Joydeep Ghosh Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712-1084. Abstract This paper introduces a neural network capable of dynamically adapting its architecture to realize time variant non-linear input-output maps. This network has its roots in the mixture of experts framework but uses a localized model for the gating network. Modules or experts are grown or pruned depending on the complexity of the modeling problem. The structural adaptation procedure addresses the model selection problem and typically leads to much better parameter estimation. Batch mode learning equations are extended to obtain on-line update rules enabling the network to model time varying environments. Simulation results are presented throughout the paper to support the proposed techniques.
This research was supported in part by ARO contracts DAAH04-94-G-0417 and 04-95-10494 and NSF grant ECS 9307632.
1
Contents 1 Introduction
3
2 Background on Mixture of Experts
4
2.1 2.2 2.3 2.4
Generic Mixture of Experts Architecture : : : : : : : : : Drawbacks of a Global Gating Network : : : : : : : : : : A Localized Model for the Gating Network : : : : : : : : Some Comments on Using the Localized Gating Network
3 Model Selection in Mixture of Experts
3.1 The Growing Mixture of Experts Network 3.1.1 Simulation Results : : : : : : : : : 3.2 Pruning and Growing Mixture of Experts : 3.2.1 Simulation Results : : : : : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
4 7 9 10
11 12 14 16 18
4 An on-line algorithm to train the mixture of experts network
19
5 Modeling Slowly Varying Environments
21
6 Modeling Rapidly Changing Environments
24
7 Summary and Related Work
27
4.1 Simulation Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
5.1 Simulation Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
6.1 Simulation Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25
2
1 Introduction There are many situations wherein a learning system capable of adapting to changing environments is desirable. For example, one may want to model a plant process that undergoes seasonal variations in its input-output behavior. In the neural network community, the problem of realizing a time varying mapping is commonly encountered as the sequential learning problem. Neural networks such as the multi-layered perceptrons (MLPs) are known to exhibit \Catastrophic Interference (CI)", i.e., they tend to forget previously learned mappings rather quickly when exposed to new mappings [1],[2]. CI is exhibited by MLPs mainly because such networks make highly coupled global parameter updates as they learn. Several researchers have tried to alleviate CI by reduced coupling or unlearning in MLP networks or by using networks of cells with localized responses [3],[4],[5]. The mixture of experts network provides an alternative approach, where the problem is \softly" decomposed into subtasks for the individual experts to work on, with minimal interaction among the experts[6]. In this paper we present a modular architecture, based on the mixture of experts framework, that is capable of growing and pruning modules or experts to maintain a suitable model complexity at all times. We rst try to solve the problem of determining an appropriate model to approximate a static interpolation task. A localized form of the gating network is considered that makes the mixture of experts network perform function approximation tasks very well with only one layer of experts. Two techniques are then proposed to overcome the model selection problem in the mixture of experts architecture. In the rst technique, an ecient way to grow expert networks to come up with an appropriate number of experts for a given problem is presented. In the second approach, one starts with a certain initial number of experts and methods are presented which while training prune experts that become less useful and add on experts that are more eective. Simulation results show that the techniques presented yield better results than the standard static approach even when the nal network sizes are chosen to be the same. The training algorithm for the (static) mixture of experts network with the localized gating network is a batch algorithm. For a learning system to be capable of adapting to changing environments, an on-line training algorithm is necessary. Towards this end, an on-line extension of the batch algorithm is derived. Finally, on-line techniques are presented for the structural 3
adaptation of the mixture of experts network with the localized gating model. A summary of the type of experiments reported in this paper is provided in table 1 to help the reader better understand their intent and scope. The next section brie y describes the mixture of experts network of [6], and motivates the use of the localized gating network of [7] rather than the more popular one based on softmax. Techniques to counter the model selection problem for solving static tasks are then proposed. The on-line training algorithm with the localized gating network and the on-line techniques for structural adaptation are described next. Finally, we examine related work and place the proposed techniques in perspective. Table 1: Summary of Experiments Section Number Purpose of Experiments Data Sets 3.1.1 Evaluate Network Growing Procedure Sinc+Noise, Building2, Heartac1, Multivariate Fn. 3.2.1 Evaluate the Pruning and Growing Gabor Fn. Approximation, Procedure. Multivariate Fn. Approximation. 4.1 Compare On-line Training Algorithm Gabor Fn. Approximation, With the Batch Algorithm. Multivariate Fn. Approximation. 5.1 Model Slowly Varying Environments. Time Varying Version of Gabor Fn. Approximation. 6.1 Model Rapidly Changing Environments. Multimodal (modi ed) Version of Building2 Data Set.
2 Background on Mixture of Experts 2.1 Generic Mixture of Experts Architecture A mixture of experts model is shown in gure 1. The model consists of a set of expert networks and a gating network. The gating network mediates the outputs of the expert networks. The 4
expert networks j=1,..,M, look at an input vector x to form outputs y^j (x). The gating network also looks at the input vector x and produces outputs gj (x) 0, Pj gj (x) = 1, which weight the expert network outputs to form the overall output y^(x) = Pj gj (x)y^j (x). The expert networks are typically single layer linear networks (i.e., y^j (x) = jT x) since simple models are desired for localized ts. As introduced in [6], the gating network is a single layer network with a softmax output non-linearity. Let the j th output of the gating network prior to the output non-linearity be uj = Pi xiji. The corresponding output gj (x) after the softmax non-linearity is given by: (uj ) ; j = 1; :::; M: (1) gj = Pexp j exp(uj ) The softmax function is to ensure that gj 0, Pj gj = 1 and that gj > gk , uj > uk . Expert 1 g
1
y2
Expert 2 x
y1
g
. .
Σ
y
2
yK
. Expert K g
...
K
Gating Network
Figure 1: A mixture of experts network (from [6]). The mixture of experts network can be viewed as modeling the conditional probability density of the target output given the network input, based on the training data. The conditional density is cast as a mixture of conditional probability densities with each expert modeling the conditional density P (yjx; j ) of the target conditioned on that expert being picked and the network input. Thus, the entire network represents:
P (yjx) =
X g P (yjx; j ): j
j
5
(2)
The expert outputs y^j (x) are the means of the corresponding densities P (yjx; j ), while the gating network output gj is interpreted as the conditional probability of selecting the expert j given the input vector x. If the component densities are Gaussian distributed with identity covariance, the above mixture density function can be written in the form
P (yjx) =
X g (2)?n= expf?0:5(y ? )T (y ? )g: j
2
j
j
j
(3)
The above equation can be viewed as a likelihood function with the unknown parameters being the gating and expert network weights. Therefore the problem of estimating the network parameters can be viewed as a maximum-likelihood problem. In [6] a gradient ascent algorithm is derived to maximize the log-likelihood function. Using Bayes' rule, one rst de nes posterior probabilities
j )) : hj = PgjgP P(y(jy(x; j(x; j ))
(4)
j j
Then the learning rules based on gradient ascent are: For the weight matrix j in the j th expert: j =
X ht (yt ? t )(xt)T : j
j
t
(5)
For the jth weight vector in the gating network: j =
X(ht ? gt )xt: j
t
(6)
j
In [8], an EM-algorithm has been derived for evaluating the network parameters. The EMalgorithm is a faster algorithm to train the network compared to gradient ascent learning. In the EM-iterations, the E-step turns out to be just the computation of the posterior probabilities hj s. The M-step reduces to the following separate maximization problems.
X ht ln P (yj(x; j )); j t X X ht ln gt : arg max max
jk
= arg j
jk
=
( +1)
( +1)
j
t
j
j
j
(7) (8)
In the above equations, k denotes the iteration number. The superscript \t" is the running variable corresponding to the individual input patterns. 6
The one level mixture of experts model was extended to a hierarchy in [8] by replacing each expert in gure 1 with a mixture of experts model. The resulting two level hierarchy could similarly be extended to form deeper trees.
2.2 Drawbacks of a Global Gating Network The gating network in the mixture of experts network as de ned by equation 1 divides the input space into overlapping regions by \soft" hyperplanes, and each region is assigned to one or more experts. It has been mathematically shown that the cost function in a mixture of experts network contains an \entropy" term which increases with the number of experts that are active for a given input (see equation 21 of [9]). Here we intuitively explain why the use of gating network based on equation 1 leads to diculties while modeling a function approximation task that is non-trivial. This is explained with the help of gure 2. Let us assume that a 2-d function, f (x ; x ) to be approximated has inputs falling in regions R and R . Figure 2 shows the hyperplanes(bs) corresponding to the gating weight vectors j , in one realization involving 5 experts. The idea is to have gating outputs g , g and g high for inputs falling in the region R and g , g and g high for inputs falling in the region R . Ideally then, experts 1 and 2 will model the approximation in the region R and experts 3 and 4 will model the approximation in the region R . Unfortunately, this does not happen - g is high even for inputs falling in regions R and g is high for inputs falling in regions R . Hence, there is interference among the experts wherever the half-subspaces covered by dierent hyperplanes overlap. If this interference happens to be a constructive one, the approximation task can still be performed well. However, the probability of nding such a constructive interference decreases rapidly as the number of experts increases. When one employs a hierarchy of experts, the above problem is circumvented as hierarchy provides isolation of sub-spaces. For example, if a two level hierarchy is employed for the task described in gure 2, the top level gating network can provide partition btop which isolates regions R and R for individual mixture of experts networks to act upon. Figure 3 illustrates the eect of a hierarchy. The task is to approximate a sinc function. A simple mixture of experts network consisting of 16 experts, a 4 level-binary tree hierarchy and an 8-level binary tree hierarchy were employed. It is seen that, deeper the network, the better it performs. The same phenomena is 1
1
1
2
2
5
1
3
2
1
2
1
2
1
1
2
7
2
4
4
5
b
5
b
1
x2 b
top
R 1 b b
R 2
2
3
b
x
4
1
Figure 2: Regions of expertise determined by a softmax based gating network. also reported in classi cation type HME networks [10],[11]. Unfortunately, the computational cost increases as the tree height increases, as more gating and expert networks need to be trained. The depth of the tree also leads to the question of what is the right size and structure of the tree to best solve a problem, i.e., one is faced with the model selection problem. Sinc function with the HME network 1.2
1
... true sinc
0.8
_. 1 level 16 exps
0.6
−− 4 level bin tree
0.4
__ 8 level bin tree
0.2
0
−0.2
−0.4 −10
−8
−6
−4
−2
0
2
4
6
8
Figure 3: HME of dierent sizes trying to approximate a sinc function
8
2.3 A Localized Model for the Gating Network Xu, Jordan and Hinton [7] have proposed an alternative form for the gating network that divides the input space into \soft" hyper-ellipsoids as opposed to the \soft" hyperplane divisions created by the original gating network. With such localized regions of expertise, a single layer of linear experts is adequate in practice for function approximation [12]. Figure 4 shows the performance of the localized model with only one layer of 15 experts, on the same sinc function of gure 3. Sinc with the localized gating network 1
0.8
... true sinc
0.6
__ 1 level 15 exps
0.4
0.2
0
−0.2
−0.4 −10
−8
−6
−4
−2
0
2
4
6
8
Figure 4: Sinc function approximation with the localized model: 1 layer with 15 experts The gating network proposed in [7] is of the form
gj (x; ) = PjPP(x(jxjj) ) i i
where
(9)
i
P (xjj ) = (2)?n= jj j? = expf?0:5(x ? mj )T ?j (x ? mj )g: 2
1 2
1
(10)
Thus the jth expert's in uence is localized to a region around mj ( gure 5). The EM formulation for estimating the gating network parameters j ; j from the set of training patterns indexed by t is given by: E step: gjk (xt ; )expf?0:5(y ? y^j )T (y ? y^j )g k t t (11) hj (y jx ) = P k t i gi (x ; )expf?0:5(y ? y^i)T (y ? y^i )g ( )
( )
( )
9
g x
1
g g g x
1 2 k-1 k
2
Figure 5: Localized Gating Network. M step (gating network):
jk
( +1)
mjk
( +1)
jk
( +1)
X = 1 hjk (yt j xt ); N t X h k (yt j xt)xt; = P k1 t t j t hj (y j x ) t X h k (yt j xt)[xt ? m k ][xt ? m k ]T : = P k1 t t j j j t hj (y j x ) t ( )
( )
( )
( )
( )
( )
( )
(12) (13) (14)
2.4 Some Comments on Using the Localized Gating Network We have employed the localized gating network to solve a number of regression problems and have found that the network performs well with just a single layer of experts [12]. By requiring only one gating network, this model is signi cantly faster to train as compared to an hierarchical model. Also, the M-step of the gating network has an exact one-pass analytical solution as opposed to the approximate solution to train the M-step in one-pass for the original gating network (algorithm 2 described in [8] ). However, there are a couple of things that need attention while training the localized model in practice. In the EM iterations, the E-step of the gating network requires inverting the covariance matrices j s obtained in the M-step. Some of the covariance matrices that are obtained often turn out to be close to singular. To avoid this problem, we consider only the diagonal elements of the covariance matrices and make the o-diagonal elements zero. This would mean that only 10
the variances of the dierent input elements are being considered and the cross terms are being ignored. This does not aect the function approximation capability of the network. In the worst case the network needs more experts than it would have needed otherwise. The eect of this change on the EM derivation is that, while calculating the j s in the M-step, only the terms corresponding to the diagonal entries need be computed. Also, a lower bound on the diagonal entries is imposed so that the entries never become too small. The lower threshold for each entry is selected so as to re ect the variance of the individual input elements. Towards this end, the global covariance matrix, one with the combined set of input vectors, is rst determined and the diagonal entries of this global co-variance matrix are scaled down by a large constant. The constant in our simulations was taken to be ten times the number of experts employed. The resulting scaled down values form the set of lower thresholds. When a computed value at any time goes below the corresponding threshold, it is replaced by the constant threshold value. The algorithm has also been found to be sensitive to the initial random means associated with the experts. Initializing these means by the standard K-means algorithm works well in practice. The capability of the localized model is illustrated by simulation results on the building2 data-set, a function approximation task, from the PROBEN1 set of benchmark data sets [13]. In the building2 data-set the input is a 14-d vector which includes the date, time of day, outside temperature, outside air humidity, solar radiation and wind speed and the output is a 3-d vector consisting of the hourly consumption of electrical energy, hot water and cold water. The task was to perform function approximation with 2104 training samples and 1052 samples to test the approximation. A single layer mixture of 10 experts network was trained. The network gives a average test set MSE of 0.0084 and training set MSE of 0.0072 which is as good as the best results obtained using a fully connected MLP including short-cut connections [13] and slightly better than the results quoted in [13] when short-cut connections were not present in the MLP.
3 Model Selection in Mixture of Experts The \mixture of experts" framework provides a modular and exible approach to function approximation. However, the important problem of determining the appropriate number and com11
plexity of experts had not been fully explored earlier. A common practice to counter the model selection problem is to train a suciently large network and, while training, monitor the mean squared error on a validation set in parallel. Besides the high computational cost in training possibly redundant experts, one ends up with a more complicated model to approximate the function, making both parameter estimation and model interpretation much more dicult. Another possible approach is to use some form of regularization like ridge regression to reduce the eective number of parameters during training [14][15]. Several experiments may be needed to determine a suitable value for the regularization parameter. For several neural networks including the mixture of experts, \early stopping" has a similar eect as regularization. In this section we observe that the localized gating network introduced in section 2.3 provides a basis for overcoming the model selection problem. We propose two techniques to counter the model selection problem. In the rst technique, one starts by training a network consisting of only one expert, which corresponds to a linear regression, and grows more experts sequentially to t the complexity of the problem. In the second technique, one starts training an adequately powerful network and while training, prunes away experts which become less useful, and subsequently, grows more eective experts if need be. The nal network yielded by the second approach typically provides much better results, even when compared to training a xed network of the same nal size.
3.1 The Growing Mixture of Experts Network Several researchers have proposed algorithms that dynamically adapt network structure by adding weights or nodes until the most appropriate network size is obtained for a given function approximation task [16] [17] [18]. In this section, a constructive algorithm is presented for building a mixture of experts network, by exploiting the localized nature of the chosen gating network. The idea is to initially start with one expert, and add on experts one at a time, to systematically reduce the output error. Let us assume that at some instant of time there are m experts in the network (m - 1 experts having been already added). The mixture of m experts is trained using the EM algorithm until 12
the MSE on the validation set fails to decrease. The network parameters obtained are saved. We de ne a weighted MSE Ej , for each expert j, on the validation set samples p as: Pp hp(yp ? y^p) jP ; j = 1; :::; m; (15) Ej = p p hj where hpjs are obtained by evaluating the E-step expression for the validation set samples p. Ej gives a measure of how well the j th expert performed on the validation set. If the partitioning had been crisp, i.e., hpjs were indicator functions, every sample in the test set would have been associated with only one expert and Ej would have been the MSE of expert j trying to approximate a sub-function from the samples associated with expert j. The hpjs in the mixture of experts architecture are not all 1s and 0s and therefore the Ej s in reality correspond to a soft version of the mean-squared error. If Ej is large, it indicates one of two possibilities:(1) the localized model is not able to approximate well the target function for the weighted samples associated with expert j or (2) the localized model has over t the training data and therefore is not able to perform well on the validation set data. In the former case, addition of an expert to the localized space spanned by expert j would reduce the output error due to the added exibility. In the latter case, such an addition would only further over t the training data. Therefore, to add the most eective expert, the Ej s are ranked from largest to the smallest. Let c denote the expert network with the current highest rank. We now try to add an expert m+1 which tries to reduce the error due to expert c . Towards this end, a new expert m+1 is created as follows: 1
2
Weights of expert network m+1 = weights of expert network c. Perform \weighted 2-means" on the training samples t (weights = htc). The two means obtained are the new means associated with the two experts, m+1 and c.
Set m = cnew = cold +1
1 2
Set m = c. +1
The \weighted 2-means algorithm" is performed as follows: 1
as distinguished from the training set examples, which are indexed by t
13
Initialize the two means to the two training samples having the largest htcs. Assign each point in the training set to subset t or t depending on whether it is closer to 1
2
m^ or m^ . 1
2
Determine m^
1
new
=
m^ new = 2
Pt ht xt Pt ht Pt ht xt P ht 1
1
2
t2
(16)
2
(17)
1
1
2
1
2
Iterate until convergence. In the crisp case, the \weighted 2-means" procedure reduces to nding two means for the set of samples associated with expert c using the K-means algorithm. The newly formed network is trained for one EM-iteration. If there is a decrease in MSE on the validation data-set, we proceed further. Else, we revert to the earlier saved network and try adding a new expert to reduce the error due to the expert with the next highest rank. This procedure is terminated when at any stage, adding a new expert to reduce the error of any of the existing experts does not reduce the overall MSE on the validation data-set. Also, at any time during the growing procedure, when an over tting expert is detected, its corresponding expert network parameters, mean and covariance matrix are frozen.
3.1.1 Simulation Results The above algorithm was tested on both synthetic and real life problems. The rst task was to approximate a sinc function with Gaussian noise of variance 0.25 added to it. 100 training samples were present in the training set. Figure 6 shows results from the growing algorithm. The test MSE in this case was computed by measuring the error between the network output and the true sinc function. The MSE obtained at the end of the growing operation was found to be better than the MSE obtained by training static networks of size 10,15 or 20 experts. Figure 7 shows the performance of the nal trained network. 14
The algorithm was then tried on the building2 data set which was described earlier. The results from a typical run are shown in table 2. Over multiple runs, with an average of 4.33 experts, the network gives an average validation set of 0.0084, on par with that reported earlier. We next worked on the heartac1 data from the PROBEN1 set of benchmark data-sets [13]. The task was to predict heart disease (single output between 0 and 1) from a 35-d input vector based on personal data such as age, sex, smoking habits etc., and results of various medical examinations. The data was originally obtained from Cleveland Clinic Foundation (Robert Detrano, M.D., Ph.D). For this problem, we obtained a test set MSE of 0.0461 with one expert. Further addition of experts did not help reduce the validation set MSE. It so happens that, for this task, a linear regressor or single layer perceptron network is enough to solve the problem [13]. Thus, our procedure did not over t the data. A multivariate function approximation task was considered next. The function to be approximated was y = 0:79 + 1:27x x + 1:56x x + 3:42x x + 2:06x x x . The training set consisted of 820 samples and the validation set had 204 samples. Results using static networks and the growing algorithm are shown in table 3. While using the growing algorithm, in one run the network grew to a nal size of 15 experts and in another run the network grew to a nal size of 16 experts. For the static networks, the MSE values are averaged over 5 runs. Once again, as seen in table 3, the growing procedure out-performs static networks. The computational cost to train static networks is roughly equal to the cost of performing the initial K-means algorithm (K = number of experts), plus a cost proportional to number of experts times the average number of EM-iterations to converge. The average number of EMiterations depends on the problem, but is usually small, about 5-10 iterations. For the growing procedure, it has been observed that it usually takes only about one or at rare occasions two EM-iterations for the validation set MSE to converge after the addition of an expert. Therefore, the computational cost for the growing procedure is roughly equal to a cost proportional to the square of the nal number of experts, plus the number of experts times the cost involved in performing the 2 means algorithm. The cost of the K-means algorithm is roughly proportional to K when the number of samples to be clustered is large. Hence, as long as the nal network structure is not too large, the computational cost involved in employing the growing algorithm 1
2
1
4
2
15
5
3
4
5
is not very much more than the cost involved in training static networks. Also, the nal network structure with the growing algorithm is typically seen to out-perform a comparably sized static network. No. of Experts Vs Test Set MSE (Sinc with noise) 0.12
0.1
Test Set MSE
0.08
0.06
0.04
0.02
0 1
2
3
4
5 6 −−> No. of Experts
7
8
9
10
Figure 6: Number of experts vs. test set MSE. Sinc function approximation with 10 experts 2
1.5
Test MSE = 0.0087
1
0.5
0
−0.5
−1 −10
−8
−6
−4
−2
0
2
4
6
8
Figure 7: Noisy sinc approximation with 10 experts at the end of the growing algorithm.
3.2 Pruning and Growing Mixture of Experts Another popular approach to structural adaptation is to start with an adequately powerful model and then to simplify it based on training data [19]. Various weight decay strategies for pruning 16
Table 2: Growing mixture of experts network on the Building2 data-set. Number of Experts Test set MSE 1 0.0100 2 0.0090 3 0.0087 4 0.0083 Table 3: Growing Mixture of Experts on the Multivariate Data-set. Number of Experts Ave. MSE Static 15 0.0107 Architecture 20 0.0093 25 0.0085 30 0.0089 Network Growth 15.5 0.0077 links have been proposed, especially for the MLP network [20]. Here, a simple and ecient technique to structurally adapt (prune and grow) the mixture of experts network is presented wherein the localized nature of the chosen gating network is once again exploited. From section 2.3, it is observed that the prior j is computed in the M-step as
jk
( +1)
X = N1 hjk (yt j xt ): ( )
t
(18)
j is seen to be proportional to the sum of htj s over all patterns t in the training set, htj being the posterior probability of selecting expert j given input xt and its corresponding output yt. We therefore see that in every iteration of the EM-algorithm, j directly gives us a measure of how important the network feels expert j is in relation to the other experts. Hence, when one needs to prune an expert, the obvious candidate is that expert which has the least value of . Pruning and growing can be combined while training a mixture of experts network to serve two purposes: (i) to remove redundant experts and (ii) to avoid local minima and perform better generalization. 17
Pruning and Growing: Train the mixture of experts network, with a certain initial number
of experts m, until convergence. Prune expert corresponding to lowest . If there is no signi cant change in network performance, continue pruning. Else, perform network growth as described in the previous section. While there is no strict order in which pruning and growing should be performed, it has generally been found more useful to prune initially and then grow.
3.2.1 Simulation Results The simultaneous pruning and growing technique was applied to two function approximation problems: (1) to approximate the two dimensional Gabor function, G(x; y) = : 2 exp(? x2 : y22 ) cos(2(x + y)), consisting of 64 training set samples and 192 validation set samples [16] and (2) to approximate the multivariate function described in the previous section. For the Gabor function, the initial network con guration chosen had 30 experts. The network was trained using the EM algorithm described. The network converged to a validation set MSE of 0.0516. Network pruning was performed using the pruning method described above. No signi cant change in MSE was observed till the number of experts was brought down to 21. The MSE with 21 experts was 0.0503. Next, network growth was performed. As shown in table 4, signi cant performance gains were obtained by the addition of 2 more experts lowering the MSE to 0.0296. It is to be noted that the result is much better than simply pruning the 30 expert network to 23 experts. Table 5 shows results on the multivariate data-set. 1 2 (0 5)
Table 4: Pruning and Growing on 2-D Gabor Function. Number of Experts MSE Pruning 30 0.0516 21 0.0503 Growing 22 0.0451 23 0.0296
18
+ 2(0 5)
Table 5: Pruning and Growing on the Multivariate Data-set. Number of Experts MSE Pruning 25 0.0086 16 0.0091 Growing 17 0.0083 19 0.0080 21 0.0073
4 An on-line algorithm to train the mixture of experts network The training algorithm described in section 2.3 is a batch algorithm. In order to employ a network for learning non-stationary maps, one needs an on-line algorithm. The EM iterations to train the gating network parameters bear a close resemblance to the procedure of training the parameters of a mixture of Gaussian densities to model an unknown probability density [21], which has attracted several attempts at on-line versions. We take the approach of Traven[22] to derive a suitable on-line algorithm. All equations in the EM update rule and the batch mode growing technique are of the general form Pt hk (xk ) t
= kPt hk : (19) 2
=1
k=1
Thus,
t+1
= = = =
2t
Pt hk (xk ) kP t hk +1 =1
+1
P
k=1 t +1 h (xt+1 ) + tk=1 hk (xk ) t+1 hk k=1 t +1 t +1 h (x ) + tk=1 hk t t+1 hk k=1 k t t+1 t ht+1 (xt+1 ) + tk+1 =1 h ? h t+1 hk k=1
P
P
P
P P
is now being used to denote discrete time
19
(20) (21) (22) (23)
Rearranging we have,
t = t + t [ (xt ) ? t ]; +1
+1
where,
+1
(24)
t t = Pht
+1
(25) +1 k: k=1 h Considering only the L most recent values of xk , the denominator of t+1 can be approximated +1
by the recursion
Dt = (1 ? 1=L)Dt + ht : +1
+1
(26)
The gating network parameter update equations now become
jt = jt + (1=L)[htj ? jt ]:
(27)
mtj = mtj + tj [xt ? mtj ]:
(28)
tj = tj + tj [(xt ? mtj )(xt ? mtj )T ? tj ]:
(29)
+1
+1
+1
+1
+1
As the expert networks are linear, their parameters can be updated using either LMS or RLS algorithms [23]. The RLS algorithm converges faster compared to the LMS algorithm. However, the RLS algorithm is sensitive to its forgetting factor RLS and hence not as robust as the LMS algorithm . The RLS algorithm to train the ith expert network on-line is described below. Initialize P (0) = ? I , = small positive constant. w^(0) = small random values. 3
1
l(t) k(t) P (t) (t) w^T (t) 3
= ?RLS P (t ? 1)x(t)
(30)
1
= [h?i (t) + xT (t)l(t)]? l(t) = ?RLS P (t ? 1) ? k(t)lT (t) = y(t) ? w^T (t ? 1)x(t) 1
1
1
= w^T (t ? 1) + k(t)(t)
The RLS algorithm can be made more robust by making RLS converge to unity as training proceeds.
20
(31) (32) (33) (34)
4.1 Simulation Results A comparison of the on-line algorithm with the batch algorithm on two synthetic data sets described earlier is shown in table 6. The results are averaged over 10 runs for each of the data sets. From the table it is seen that the performance of the on-line algorithm and the batch algorithm are comparable. In terms of the number of epochs to train the networks, the batch algorithm takes fewer number of epochs to train - the pseudo-inversion technique is used to solve for the weighted least squares problem in the M-step of the EM iterations. However, in terms of the total time taken to train the networks, the on-line algorithm is faster as the training time for each epoch is much shorter compared to the batch algorithm. Table 6: Batch vs. on-line results for static networks. Number of Batch On-line Experts Av. MSE Std. Av. MSE Std. Gabor data-set 20 0.0541 0.0079 0.0410 0.0056 Multivariate data-set 25 0.0079 0.0009 0.0082 0.0002
5 Modeling Slowly Varying Environments The algorithm developed in section 3 for structural adaptation is again a batch algorithm. A similar approach when extended to an on-line setting would work well if the incoming training data come from a xed distribution or from a distribution where changes are slow with respect to time. In this section we rst present an on-line extension to the batch-mode growing technique. The on-line error estimate for the experts is given by
Ejt = Ejt + tj [(yt ? y^t) ? Ejt ]: +1
+1
2
(35)
In a scenario of changing environments, if the error corresponding to a particular expert increases beyond a predetermined threshold, the need for an additional expert is detected. The threshold is chosen based on some prior knowledge about the modeling complexity of the problem. In order to grow an additional expert, we follow the same procedure as in batch mode to create the 21
new parameters, with the exception of the gating network means. To create the gating network means, we employ an on-line version of the weighted 2-means algorithm as follows.
Initialize the mean vector corresponding to the new expert to be equal to the mean vector corresponding to the expert to be split.
Add a small random perturbation to the two means. For a window length T of input patterns, make parameter updates to all the experts except the expert being split and the new expert.
For a window length T of input patterns, if the posterior hj corresponding to one of the two new experts is the highest among the posteriors of all the experts for an input pattern, make parameter updates for this expert also.
The window length T is chosen in such a way as to separate the two means. After this window length of T samples, the two experts become normal experts. Pruning is performed by monitoring the parameter j . When j becomes very small, the corresponding expert is pruned.
5.1 Simulation Results The above algorithm was applied to the following time varying version of the Gabor function approximation problem. The purpose of the experiment was to test the structural adaptation and modeling capabilities of the network under dierent environments. Uniformly spaced samples from the Gabor function G(x; y) = : 2 exp(? x2 : y22 ) cos(2(x + y)) were rst obtained in the range ?1:5 x 1:5, ?1:5 y 1:5 as a 49 by 49 grid. The data set was divided into three overlapping subsets, each consisting of 1225 samples as shown in gure 8. Randomly ordered samples from each of these subsets were presented to the network for 100 epochs each. The LMS algorithm was used to make the expert network updates. The following restrictions were imposed during training. + 2(0 5)
1 2 (0 5)
The number of experts in the network should be always less than 20. The threshold for pruning an expert, j = 0:0001. 22
The error threshold Ej = 0:03 for growing an expert. Also an expert is grown only if the corresponding j > 0:01. (a)
(b)
1
1
0.5
0.5
0
0
−0.5 −2
0
0
2 −2
2
−0.5 −2
0
y
x
2 −2 x
0
2
y
(d)
(c) 1 1 0.5
0.5
0 −0.5 −2
0 −0.5 −2
0
2 x
−5
y
0
5
0
2 −2 x
0
2
y
Figure 8: (a) Global data set. (b),(c),(d) overlapping subsets of the global data set. The initial number of experts in the network was 10. Staying within the above constraints, for the rst data set the network gives an MSE of 0.0089 with 15 experts. When the network is subsequently exposed to the second data set, the network prunes away 2 experts and adds on 3 experts to give an MSE value of 0.0152. When the network is next exposed to the third data set, the network prunes away ve experts and gives an MSE of 0.0088 with 11 experts. The three data sets were presented for the same duration of 100 epochs to three independent MLP networks, each with 20 hidden units. The average MSE values averaged over 5 runs on the three data sets were 0.0147, 0.0311 and 0.0148 respectively. 23
6 Modeling Rapidly Changing Environments In a setting where either the input space, the output space or both could change with time rather quickly, a more rapidly adaptable scheme is required. An alternative way to grow a mixture of experts based on the novelty of the training patterns is described here. This approach of growing a neural network is not new. A similar approach based on input novelty has been used before in [5], [24],[25]. The present approach diers from those cited by incorporating output novelty within the framework of novelty detection. Let the network at any instant of time have m experts. A new expert m + 1 is added to the network if for the current training pattern xt , min[(xt ? mj )T ?j (xt ? mj )] < , where the mj s and j s are the means and covariances corresponding to the current set of experts respectively and is a small xed constant. The mean of the new expert is initialized to the current input pattern. m is set equal c=2 where c corresponds to that expert whose mean is closest to the input pattern. c is also set to c=2 so that the i s sum to one. The new expert is given random initial weights. The covariance matrix is initialized to a constant diagonal matrix. Once a new expert has been added to the network, the network parameter updates are continued using the RLS algorithm. This growing algorithm is capable of adapting to changes in the input space by creating new experts for input samples in regions not seen earlier by the network. However, in a modeling task with changing environments, there could be occasions when the output space changes and the input space does not change. When this happens, \Catastrophic Interference" takes place in networks which adapt only to changes in the input space. In order to take into account changes in the output space, while training the network, the gating network is fed with the combined data from both the input and the output space. This way distinct experts are created for changes in the output space also. The expert networks continue to see the data from the input space alone. During \recall", i.e., while testing the network for a new input pattern, the target output is not available. In this case, we treat the gating network input as an input vector with missing elements [26], the missing elements being those corresponding to the output space. It should be pointed out here that in an on-line setting, training and testing take place alternately. The easiest way to deal with a missing feature 1
+1
24
element is to choose the mean value of the missing feature as a substitute for the missing feature during \recall". However, in the mixture of experts network, the expert speci c mean, mj , of the input feature vector is already being computed in the gating network (equation 28). Hence, no extra computation is required to determine the mean of the missing features. Feeding these mean values of the missing features to the gating network during \recall" eectively amounts to ignoring the elements corresponding to the output space in the gating network for the \test input". Patterns having the same values in the input space but dierent values in the output space would produce the same gating network outputs in this scheme during \recall". However, as network training is performed in parallel at all times, the j s (equation 27) quickly tune themselves to the current environment and make the gating network weights higher for experts that are performing better in the current environment.
6.1 Simulation Results The application of the above growing procedure on a real life data set is now described. The data set employed is adapted from the \building" data set of the PROBEN1 benchmark suite [13]. The problem is to predict the hourly consumption of electrical energy, hot water, and cold water, based on the date, day of the week, time of day, outside temperature, outside air humidity, solar radiation and wind speed. The data set is spread over six months from September to February. On observing the data, it is seen that the hot water requirement has a strong correlation with the outside temperature. Upon ignoring the temperature parameter, the hot water requirement also varies according to the months. Therefore without the temperature parameter, for similar values of the inputs, the hot water requirement on average is dierent for dierent months. In order to test the growing technique in a problem involving both the changes in input and output space, the data set is divided into three segments each containing data corresponding to two consecutive months. The outside temperature is assumed not measured and the network is made only to predict the hot water requirement. The three data segments contain 1462, 1464 and 1282 samples respectively. Experiments with the three data segments with the growing network are described below. In the rst experiment, dierent sets of 200 random samples are extracted from each of the 25
three data segments. The dierent sets of 200 samples are fed sequentially to the network - the rst set from segment 1, the second from segment 2, the third from segment 3, the fourth from segment 1 and so on. The purpose of the experiment was to see how quickly the network is able to adapt to changing environments. The feeding of the dierent sets of data are termed phases. The experiment is conducted over 10 cycles, where each cycle is made up of 3 phases and thus involves 600 data points. The average MSE values, averaged over each phase, are plotted in gure 9. The performance of the \best" MLP network having 20 hidden nodes is also plotted . The number of experts in the network at the end of every phase of training is plotted in gure 10. It is seen that the growing mixture of experts network yields low MSE values within the rst couple of cycles. The same observation is made in the following experiments also. In the second experiment, the entire set of 4208 samples from the three data segments were sequentially fed to the growing mixture of experts network. This was done to examine the performance of the network when the actual daily data, spread over six months, is fed naturally to the network. The network grew to 33 experts starting from one expert and gave an average MSE of 0.0068. When the 4208 samples were fed for a second time, with the addition of just one more expert, the network gave an average MSE of 0.0037. An MLP network with 20 hidden nodes gives an average MSE of 0.0100 even after 15 cycles through all the samples. In the third experiment, all the samples from the rst and the third data segments, 2744 in number (corresponding to September, October, February, and March), were fed to the network sequentially. The intent was to compare the performance of the growing network with the MLP network in a data set involving a sudden change in the output space (at the boundary of the two data segments). The growing mixture of experts network grew to 27 experts starting from one expert and gave an average MSE of 0.0080, the rst time through the sequence of 2744 samples. When the same sequence of samples was again presented to the network, with the addition of no more experts, the network gave an average MSE of 0.0028, essentially learning well the modeling task. The MLP network gives an average MSE of 0.0110 even after 15 cycles through the sequence of samples. 4
While computing the MSE value of phase 1 in cycle 1, the rst 50 sample output errors were ignored for both the networks as the networks started training from random initializations. 4
26
The growing localized mixture of experts network vs. MLP 0.045
0.04
__ Growing N/W
0.035
.. MLP
MSE
0.03
0.025
0.02
0.015
0.01
0.005 0
5
10
15 −−> phases
20
25
30
Figure 9: MSE comparison between the growing localized mixture of experts network and an optimized MLP network.
7 Summary and Related Work In this paper, we presented methods to address the model selection problem in the mixture of experts network and extended the training algorithm to on-line settings to model environments that are non-stationary. A localized model was employed for the gating network. We note that the computational eciency - often two orders of magnitude for the same performance level - of localized approaches as compared to the MLP, have been well documented (see for example [25], [10]). These advantages carry over to our methods and for this reason we have not further emphasized this point by providing timing comparisons for most of the experiments. Our approach can be compared to other localized connectionist architectures such as the radial basis function (RBF) networks. One can view the localized mixture of experts network as a normalized RBF network with the output layer weights (now provided by the individual experts) being functions of the inputs to the network, rather than constants. The training algorithms are however quite dierent. Decision tree type approaches such as CART [27] and MARS [28] also work on localized subspaces. Besides employing a dierent technique for parameter estimation, these approaches dier in that splits are limited to be parallel to the co-ordinate axes, rather than being based on normalized ellipsoids. 27
Number of Experts in the Growing Network 40
Number of Experts
35
30
25
20
15 0
5
10
15 −−> phases
20
25
30
Figure 10: The number of experts at any time in the growing network The localized mixture of experts model can also be contrasted with ensemble methods [29], [30], [31]. Both the schemes aim to combine results obtained by individual networks. The key dierence is that each member of an ensemble is trained to do well on the entire data set, and expertise in a localized space is not the intent. For a detailed description of the dierences between ensemble and modular approaches, see [32]. It is argued in this paper that a localized model for the gating network is better than the standard softmax based gating network for a single layer mixture of experts. We have repeatedly observed this for several data sets, but nd no mention of this in the literature. Note that the comparison is only for linear experts. One way out is to use non-linear experts such as RBFs or MLPs. The problem then is that one expert may dominate by performing well over most of the input space, and also the spirit of having simple localized ts is lost. However, by careful initialization/partitioning so that the experts are less coupled, some successes in using non-linear experts (example MLPs for forecasting [9]; Kalman lters for trajectory estimation [33]) have been achieved. Weigend et al. [9], also used an MLP for the gating network. The MLP network alleviates the problems that were mentioned against the single layer softmax based gating network by allowing more complex regions of expertise. However, when an MLP is employed as the gating or expert network, the advantage of being able to train the network in one pass in the M-step of the EM iterations is lost. 28
Model selection via network growing or pruning has been widely studied for feedforward networks [20],[19],[17]. In the mixture of experts framework, Waterhouse et al. [18] have developed a growing technique to build a hierarchical mixture of experts, based on principles from Classi cation and Regression Trees [27]. The hierarchical network is grown one layer at a time by splitting an expert at the leaf of the tree into two experts and a gate. The expert to be split at any time is the one that maximizes the increase in log-likelihood due to the split. The scheme has a nice theoretical framework that parallels well known results on growing decision trees, but leads to deep trees. A popular architecture in the neural network literature that is capable of modeling sequential learning tasks is the resource allocation network (RAN) [5], a constructive version of the RBF network. When a novel input pattern is encountered, an additional RBF unit is added to the network provided the network output is also far enough from the target output. At other times learning is performed using gradient descent. The growing algorithm cannot respond to changes in the output space alone. RAN does not have a pruning mechanism. The gradient descent learning rule could also pose problems if quick changes are to be tracked. Schaal and Atkeson [25] have introduced the system of experts architecture wherein local experts are allocated depending on the spatial distribution of input samples. In this architecture each expert has a xed center point associated with it in the input space and the expert outputs are weighted based on the distance of an input sample from the dierent center points. This architecture is somewhat related to our localized mixture of experts network. In [25], the center point for each expert's operation is xed, and the weighting (gating) factors are a function of only the distance from the input sample to the center points associated with the dierent experts. Since the gating network outputs in the localized mixture of experts network depend also on a prior probability term j (equation 27), it is more attuned to the current environment in a scenario where modeling environments can change. Also, the training algorithm developed for the mixture of experts network makes it capable of modeling inverse problems [34],[9] which the system of experts network cannot model. The modeling of non-stationary processes is a wide open eld. Popular approaches include detrending, and identifying regimes that are (relatively) stationary. For time series applications, 29
Weigend et al.[9] have used a mixture of MLP experts to divide time series segments into dierent distinct sub-segments for the individual experts to act upon. By tuning the experts to dierent noise levels, the local model complexity attempts to match the local complexity of data. This model has been applied to static time series data sets. On-line application of this architecture is not as practical as it would suer from slow learning. Since the expert domains are localized and data distribution may not be uniform, it is possible that some experts in the localized mixture of experts network get overtrained while others are still undertrained. Regularization techniques can be applied to the gating and expert networks to further improve the generalization performance. Another situation that can be further investigated is one in which a behavioral regime experienced earlier may recur. An example is a time series with cyclic behavior every 12 months. While modeling such non-stationary processes, it might be worth incorporating a memory scheme to store the j s (equation 27) for each regime and recalling the corresponding j s whenever a switch to a dierent but previously encountered regime is detected. Unfortunately, the lack of adequate benchmarks for such situations at present is a hindrance to this study.
References [1] N. E. Sharkey and A. J. C. Sharkey. Understanding catastrophic interference in neural nets. Technical Report CS-94-4, Department of Computer Science, University of Sheeld, U.K., 1994. [2] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychological Review, 24:109{165, 1989. [3] Roger Ratcli. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review, 97(2):285{308, 1990. [4] R. M. French. Semi-distributed representations and catastrophic forgetting in connectionist networks. Connection Science, 4(3):365{377, 1992. 30
[5] John C. Platt. A resource allocation network for function interpolation. Neural Computation, 3(2):213{225, 1991. [6] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3:79{87, 1991. [7] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixture of experts. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 633{640. The MIT Press, 1995. [8] M. I. Jordan and R. A. Jacobs. Hierarchical mixture of experts and the EM algorithm. Neural Computation, 6:181{214, 1994. [9] A. S. Weigend, M. Mangeas, and A. N. Srivastava. Nonlinear gated experts for time series: discovering regimes and avoiding over tting. International Journal of Neural Systems, 6:373{ 399, 1995. [10] V. Ramamurti and J. Ghosh. Advances in using hierarchical mixture of experts for signal classi cation. In Proceedings of ICASSP 96, pages 3569{3572, 1996. [11] S. R. Waterhouse and A. J. Robinson. Classi cation using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE Workshop, pages 177{186. IEEE Press, New York, 1994. [12] V. Ramamurti and J. Ghosh. Structural adaptation in mixture of experts. In Proceedings of ICPR 96, Track D, pages 704{708, 1996. [13] L. Prechelt. PROBEN1 | A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe, D-76128 Karlsruhe, Germany, September 1994. Anonymous FTP: /pub/papers/techreports/1994/1994-21.ps.Z on ftp.ira.uka.de. [14] J. H. Friedman. An overview of predictive learning and function approximation. In V. Cherkassky, J.H. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks, Proc. NATO/ASI Workshop. Springer Verlag, 1994. 31
[15] J.E. Moody. Note on generalization, regularization and architecture selection in nonlinear learning systems. In IEEE Workshop on Neural Networks for Signal Processing, pages 1{10, 1991. [16] Y. Shin and J. Ghosh. Ridge polynomial networks. IEEE Trans. Neural Networks, 6:610{ 622, May 1995. [17] S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems-2, pages 525{532. Morgan Kaufmann, San Mateo, CA, 1990. [18] S. R. Waterhouse and A. J. Robinson. Constructive algorithms for hierarchical mixture of experts. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 584{590. The MIT Press, 1996. [19] J. Ghosh and K. Tumer. Structural adaptation and generalization in supervised feedforward networks. Jl. of Arti cial Neural Networks, 1(4):431{458, 1994. [20] R. Reed. Pruning algorithms - a survey. IEEE Transactions on Neural Networks, 4(5):740{ 747, 1993. [21] R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. Wiley, New York, 1973. [22] Hans G. C. Traven. A neural network approach to statistical pattern classi cation by \semiparametric" estimation of probability density functions. IEEE Transactions on Neural Networks, 2(3):366{377, May 1991. [23] S. Haykin. Adaptive Filter Theory. Englewood Clis, NJ: Prentice-Hall, 1991. [24] S. Roberts and L. Tarassenko. A probabilistic resource allocating network for novelty detection. Neural Computation, 6:271{284, 1994. [25] S. Schaal and C. G. Atkeson. From isolation to cooperation: An alternative view of a system of experts. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 605{611. The MIT Press, 1996. 32
[26] V. Tresp and R. Hofmann. Missing and noisy data in nonlinear time-series prediction. In F. Girosi, J. Makhoul, E. Manolakos, and E. Wilson, editors, Neural Networks for Signal Processing V, pages 1{10. IEEE, 1995. [27] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi cation and Regression Trees. Belmont, CA: Wadsworth International Group, 1984. [28] J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1{141, 1991. [29] K. Tumer and J. Ghosh. Analysis of decision boundaries in linearly combined neural classi ers. Pattern Recognition, 29(2):341{348, Feb. 1996. [30] M.P. Perrone and L. N. Cooper. Learning from what's been learned: Supervised learning in multi-neural network systems. In Proceedings of the World Congress on Neural Networks, pages III:354{357. INNS Press, 1993. [31] L. K. Hanson and P. Solamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:993{1002, 1990. [32] A. J. C. Sharkey. On combining arti cial neural networks. Connection Science, 8(3 and 4):299{314, Dec. 1996. [33] W. Chaer, R. Bishop, and J. Ghosh. A mixture-of-experts framework for adaptive kalman ltering. IEEE Transactions on Systems, Man and Cybernetics, to appear, 1997. [34] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press - Oxford, 1995.
33