Visualization of Learning in Neural Networks Using Principal Component Analysis Marcus Gallagher and Tom Downs Neural Networks Laboratory Department of Electrical and Computer Engineering University of Queensland St. Lucia Q. 4072. Australia. fmarcusg,
[email protected]
ABSTRACT: We propose a new method for visualizing the learning process in arti cial neural networks using Principal Component Analysis. The network weights constitute an evolving sequence of data in weight space that can be subjected to this form of analysis. We verify experimentally that the variance in the data is captured largely by the rst few principal components. Our experimentation is applied to networks with various numbers of weights, and demonstrates that this approach is a useful technique for the visualization of learning in networks of practical size. Keywords: Visualization, learning, neural networks
1 Introduction In the training of neural networks, the concept of the error surface is of central importance. The training process is often described as a directed search of the error surface, attempting to locate a minimum point. For example, the commonly used back-propagation algorithm works by estimating the gradient of the error surface at the current position, and moves in the negative direction of this estimate. Although the error surface provides a useful conceptual notion of learning in feed-forward neural networks, there are problems with actually visualizing the error surface. A network with n weights de nes an n-dimensional error surface, so that in practical networks we are generally searching a very high-dimensional space during training. The area of visualization in neural networks is one which, surprisingly, has received little attention (see [1] for a review). Most methods proposed in this area scale awkwardly to networks with more than a few weights. Others, such as the well-known Hinton diagram, only provide information concerning the magnitude of weights in the network, and so reveal little about how a network behaves during training [2]. Some authors have provided visualization of the error surface by choosing two of the weights in the network, and plotting these against the cost function, thereby revealing a 2-D \slice" of the multidimensional error surface [3],[4],[5], but how one should choose pairs of weights for visualization in any given situation remains problematic. In addition, analysis has revealed that the error surface can be surprisingly complex, even for simple problems such as XOR, and that simply visualizing slices through the error surface can provide misleading information [6].
2 Principal Component Analysis and Weight Data Principal Component Analysis (PCA) is a statistical technique which may be used to map highdimensional data on to a space of lower dimensionality [7]. PCA attempts to capture the major directions of variation of a data set, by performing the equivalent of a rotation of the original data space. After this is done, the principal component corresponding to the largest principal value is the direction of greatest variance in the data. Ignoring principal components with the smallest principal values corresponds to performing a dimensionality reduction, with the data being eectively projected onto a lower dimensional space. In a linear sense PCA is optimal, in that the sum of the squares of the dierences between the original distances between points and the projected distances is minimal. To the best of our knowledge, PCA has not previously been used as a mechanism for dimensionality reduction in weight space. To calculate principal components we employ a technique described in [8]. Let w = (w1 ; : : : ; wn ) be a vector containing all the weights in a network, including bias weights. This vector de nes a single point in weight space or equivalently on the error surface. Training a network for s epochs then produces a set of weight vectors fws g, which describes the trajectory followed by the learning process over the error surface. The usual coordinate system in weight space represents each w as a linear combination of a set of n orthonormal basis vectors ui
X w= n
i=1
wi
u; i
which is rotated under PCA to form a new coordinate system given by
w=
X n
i=1
v
zi i :
If we calculate the covariance matrix of the set of vectors fws g =
X(w ? w)(w ? w) s
s
s
T;
then the principal components and principal values are given respectively by the eigenvectors and eigenvalues of the covariance matrix vi = i vi : To reduce the dimensionality of the weight space to some value d < n, we simply choose the d largest eigenvectors and corresponding eigenvalues and discard the remaining n ? d values.
3 Experimental Details In order to test the eectiveness of the visualization method, a series of simulations were conducted. Three single-hidden-layer, fully-connected, feedforward network con gurations were chosen; in terms of input-hidden-output units they were 2-2-1, 4-4-1 and 8-8-1, which correspond to n = 9, 25 and 81 weights respectively (including bias weights). For each con guration, 10 randomly generated problems were used. In addition, 10 separate training runs were conducted for each problem, commencing from a dierent (randomly generated) weight initialization (range [-0.25,0.25]). The random problems were produced by creating a \teacher" network of the same con guration, with weights generated from a Normal distribution with zero mean and unit variance. Thus a total of 300 training sessions were conducted. Standard back-propagation was used, with learning rate = 0:1. A training set of 10000 input patterns was randomly generated for each of the 10 problems, and the required outputs for each
Network 2-2-1 4-4-1 8-8-1
PC1
PC2
PC3
PC4
PC5
93.5713 (6.16) 5.0838 (5.28) 1.0063 (0.94) 0.2060 (0.15) 0.0826 (0.07) 89.4590 (7.18) 7.2139 (5.48) 1.6822 (1.32) 0.5311 (0.41) 0.3072 (0.27) 80.5913 (7.62) 8.4673 (3.46) 3.0473 (1.37) 1.6361 (0.77) 1.0928 (0.49) Table 1: Percentage of variance captured by PCA
pattern were produced using the teacher networks. Weights were updated stochastic mode (ie, after each pattern presentation), using examples chosen at random from the training set. Note that this \studentteacher" model of learning guarantees the presence of global minima (ie, points which are functionally equivalent to the teacher weight con guration), where the value of the error function is zero. Each network was trained for a total of 10000 epochs, and the values of the weights were recorded after every tenth epoch.
4 Results A summary of the results is given in Table 1. Shown are the network con guration used, and the average percentage of variance captured by each of the rst ve principal components (each value being averaged over 10 problems and 10 separate training runs for each problem). Shown in brackets is the standard deviation for each mean value. Clearly, the majority of the variance in the weight data can be represented by the rst principal value. The results reveal a decreasing trend in the average rst principal value as the number of weights increases. This is to be expected, as increasing the number of weights increases the number of possible dimensions in which the training process can move. The variance lost by the rst principal value as the number of weights increases is picked up by the remaining principal values but remains heavily skewed towards the rst values. What this reveals is that whilst training is a search across an n-dimensional weight surface, the intrinsic dimensionality of the path taken by back-propagation is in fact much smaller than n. We can therefore view the learning path along the rst few principal components (in fact, usually only the rst one or two), without any signi cant loss of information. An example of the visualization provided by our method is shown in Fig. 1, which shows the meansquared error (MSE) plotted against the rst two principal components for a 2-2-1 network con guration. For this particular example, the rst two principal components capture 83.00% and 14.12% of the variation in the data respectively. Plotting point values in this way gives some idea of the velocity of the trajectory (ie, rapid drop in MSE and location initially, progressing to small movement towards the end of learning). Initially, the trajectory moves in a direction close to that of the second principal component, but a region of low error (MSE 0.0003) is rapidly found where the network remains for the rest of training, moving in a direction similar to that of the rst principal component. Clearly, the trajectory has revealed that from this particular starting point, there is a relatively steep gradient on the error surface (in the direction of the second principal component), which leads to a region of low error, sloping gradually in the direction of the rst principal component. This nal region must also be somewhat at, as there is only a small amount of movement in the direction of the second principal component, and almost no movement (< 2.88% of total variance) in any other direction. The fact that back-propagation slows down as the error becomes small is also observable, from the amount of time spent following the gradient along the rst principal component. In Fig. 2, a 4-4-1 con guration network has been trained, and the training process has become trapped in a region of local minima (or more precisely a sub-optimal region of low gradient); MSE 0.0155. The trajectory moves steadily in the positive direction along the rst principal component. For the second principal component the direction of movement gradually changes from positive to negative.
0.12
0.1
0.1 0.08 0.06
0.06
MSE
MSE
0.08
0.04
0.04
0.02
0.02
0
0
1
3
0
6
2
2 3
4
2
1
2
0
2nd Principal Component
1st Principal Component
(a) 2-2-1 network
2nd Principal Component
1st Principal Component
(b) 4-4-1 network
Figure 1: Examples of Learning Trajectories The overall displacement in the direction of the second principal component is quite small, implying that movement along it has contributed little to the change in MSE. We have observed similar behaviour for higher principal components, indicating that the small amount of information lost in considering only the rst one or two principal components is not signi cant to the overall visualization of the learning process. For this training run, 80.69% of variance was captured by the rst principal component, and 9.33% by the second. In comparison to Fig. 1, this trajectory is noticeably \rougher"; the underlying direction followed being aected by random noise-like eects in the error and direction. This shows that the error surface is subject to noticeable changes in error over small changes in its position. Unlike the example shown in Fig. 1, a smooth path was not found in following the gradient of this error surface from this starting point. The other training runs conducted on this problem revealed similar behavior, suggesting this is a general property of the error surface (and therefore highly problem dependent). The majority of the curves produced for the 8-8-1 networks revealed similar behaviour to that shown in Fig. 1 and Fig. 2. While most of the results produced were fairly \well-behaved" in that the trajectory dropped quickly to some minimum and remained there (as in Fig. 1 and Fig. 2), this is not always the case. Fig. 3 shows the trajectory produced by one of the 2-2-1 networks. This kind of trajectory was observed for 9 of the 10 training runs conducted on this particular problem. Clearly, our choice of starting point initialization for this problem often places the network in a sub-optimal minimum, where it remains throughout training. The fact that weight updates cause oscillatory behavior suggests that the network is trapped in a region with steep gradient in both directions along the rst principal component. For this particular problem the rst principal component captures 99.52% of the variance, indicating that the network is trapped in a steep \rain-gutter-like" local region, with steep positive gradient along the rst principal component and almost no gradient in all other directions.
5 Conclusion We have described a method for visualizing the learning process in a neural network which allows us to observe the properties of the error surface along the path taken by the network during training. Experimental results indicate that this method is applicable to networks of a practical size, unlike previous visualization methods. Problems de ned by randomly created teacher networks were chosen to
0.155 0.15
MSE
0.145 0.14 0.135 0.13 0.125 0.12 −0.14 −0.16
1 −0.18
0.5 −0.2
0 −0.22
2nd Principal Component
−0.5 −0.24
−1
1st Principal Component
Figure 2: Local minimum behavior for one example problem illustrate behaviour that was free of any problem dependency. In addition, our method is algorithm and model independent, and therefore can be applied to a wide range of computational problems including not only the study of various learning algorithms and the eects of varying their parameters, but also more general optimization problems. Results gained from applying this method to more practical problems will be made available for presentation at the conference.
Acknowledgments This work has been supported by a Departmental Scholarship from the Department of Electrical and Computer Engineering, University of Queensland.
References [1] Mark W. Craven and Jude W. Shavlik, \Visualizing learning and computation an arti cial neural networks", Tech. Rep. 91-5, University of Wisconsin Computer Sciences Department, 1991. [2] J. Wejchert and G. Tesauro, \Visualizing processes in neural networks", IBM Journal of Research and Development, vol. 35, no. 1/2, pp. 244{253, 1991. [3] Don R. Hush, John M. Salas, and Bill Horne, \Error surfaces for multi-layer perceptrons", in Proc. International Joint Conference on Neural Networks, Seattle, 1991, vol. I, pp. 759{764. [4] Frederic Jordan and Guillaume Clement, \Using the symmetries of a multi-layered network to reduce the weight space", in International Joint Conference on Neural Networks, Seattle, 1991, vol. II, pp. 391{396. [5] Yi Shang and Banjamin W. Wah, \Global optimization for neural network training", IEEE Computer, vol. 29, no. 3, pp. 45{54, 1996. [6] Leonard G. C. Hamey, \The structure of neural network error surfaces", in Proc. Sixth Australian Conference on Neural Networks, Sydney, 1995, pp. 197{200. [7] Edward J. Jackson, A User's Guide to Principal Components, Wiley, 1991. [8] Christopher M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, 1995.