Exploring the Capacity of Simple Neural Networks Aarnoud Hoekstra and Robert P.W. Duin Department of Applied Physics, Pattern Recognition group Lorentzweg 1, 2628 CJ Delft, The Netherlands e-mail:
[email protected]
Abstract The application of neural networks in pattern recognition problems constitute some difficult problems like initialization, choosing the correct architecture etc. In this paper the problem of ”how do networks find non-linear solutions” is addressed. By examining the properties of a simple (one input and one output) multi layered perceptron with sigmoidal transfer functions we try to show in what extend the network is able to find non-linear solutions starting in a linear initialization. In order to formalize the network’s behaviour, the discrimination capacity DC is introduced. This measure indicates whether the network has found any non-linear solutions. Experiments show that the values for DC that can be expected from the theory are indeed met but the mathematical analysis of the network is too complicated to exactly point out the exact points where specific changes of DC can be found.
1 Introduction When applying neural networks (in our case multi layer perceptrons with sigmoidal transfer functions) in pattern classification tasks some difficult problems have to be dealt with. Some well known problems are those of choosing the correct architecture, learning algorithms and sample sizes. In this paper a problem of more theoretical nature will be addressed. The subject is how do neural networks find non-linear solutions starting from a linear initialization. This problem is important since most pattern recognition problems are non-linear and neural networks always start learning from linear solution, ie. initialization of the weights around zero. The problem will be dealt with in both a theoretical and experimental way using a simple classification problem and a simple neural network. This is done in order to derive theoretical and experimental properties that can be applied to larger networks. By using a simple network, ie. one that only has one input, it can be easily identified when non-linear solutions have been found. When such a solution has been found the network’s output function will reveal that the network can discriminate more than two regions. This discrimination capacity of the network will be referred to as DC , it is defined as the maximum number of regions the network can discriminate. Throughout this paper we shall use DCactual to indicate the actual DC of a network. Published in: Proceedings of the first annual conference of the ASCI, pages 56-62, May 16-18, 1995
Traditionally a worst case approach when determining the capacity of a neural pattern classifier based on the Vapnik-Chervonenkis Dimension ([Vap92]), denoted as V C , is used. Note that V C refers to neural networks in general whereas DC refers to a specific network. Using the VapnikChervonenkis measure the size of the learning set for training a neural network up to a certain error can be determined. Large networks have large number of parameters and training those parameters requires large training sets. Baum [BH89] has shown that for 2-layer networks that V C has an upperbound of 2W log2 (exp(h + 1)) where h is the number of hidden units and W the number of free parameters or weights. However experimental research shows that the learning sets normally used in neural networks are much smaller than one would expect from the pattern recognition theory ([Sch93, KD94]). The work of Kraaijveld ([KD94, Kra93]) shows that the actual capacity of a classifier is much smaller than V C . However he only showed the relation between the training procedure used and the actual capacity but gave no theoretical analysis. Others, for instance Sprinkhuizen and Boers ([SB94]), have made a theoretical analysis of a simple neural network independent of the learning rule. The problem they investigated was quite restricted, although their approach of investigating the error landscape is useful. By examining the error “landscape” of the network we may be able to determine where the network’s solutions change from linear to non-linear, ie. DCactual changes. The error “landscape” may be such that there are steep “mountains” and deep “valleys” in which the network gets stuck. Furthermore there may be flat areas in which the learning algorithm moves around in such a slow manner that it does not converge to a solution or takes a terrible long time to find a solution. In the case of [SB94] the XOR-problem was investigated. The search for minima is now relatively simple because one only has four samples. In our case we use large numbers of samples which makes it hard to write the network equations for each sample seperately. However the analysis can be made less complicated if it is assumed that a simple network is available. Therefore first a theoretical analysis of the network’s behaviour on a two-class problem in order to find properties that indicate where the network changes solution. By using a small network DCactual is limited to a small number of values just as the number of weights. In the next section the classification problem that we used is explained, followed by a analysis of the network behaviour. The paper is concluded with some results obtained from experiments.
2 The classification problem The problem was to discriminate two gaussian distributed classes (see figure 1) such that a quadratic discriminant function was needed. The relatively simplest neural network architecture was chosen, 1 input unit, 2 hidden units and 1 output unit. In figure 2 the network with its parameters is depicted. The network chosen is able to implement a quadratic decision function (ie. a linear function per hidden unit) and moreover is in theory even able to discriminate 4 seperate regions. This indicates that DC for this network is 4. The solution to the classification problem is easy to find, at each crossing of the two class density functions. This results in the optimal case in two decision functions. For the network this means that each hidden unit should represent a decision function. Experiments show that this is the case. Approximately at each of the crossings the output function has an inflection point (see also figure 3). Both classes are gaussian distributed with class A is N (0; 1) and class B is N (0; 3) distributed.
0.5
0.4
0.3
0.2
0.1
-7.5
-5
-2.5
0
5
2.5
7.5
Figure 1: The two classes for our problem, both classes are gaussian distributed with the parameters (1 = 0,1 = 1) and (2 = 0,2 1:7) b1
w
w3
1
x
o(x)
w2 w
4
b3
b2
Figure 2: The network under investigation
3 Network analysis In order to analyse the behaviour of the network a theoretical analysis was made. From figure 2 we see that the output of the network is described by 7 parameters and one input x, or formally:
o(x)
=
f (w3f (w1x + b1 ) + w4 f (w2x + b2) + b3);
(1)
where the function f (x) is the well known sigmoidal 1=(1 + exp(?x)). As error function for the network the usual mean squared error is taken, this is expressed by the following equation:
Z1 1 2 2 (tB ? o(x)) fB (x)dx (tA ? o(x)) fA (x)dx + PB E = PA (2) ?1 ?1 where PA and PB are the apriori class probabilities and fA and fB the class density functions and tA , tB are the targets for class A and class B respectively. Since those parts of the weightspace Z
where the network has a minimal error are of interest, the derivative of E with respect to the variables w1, w2 , w3, w4 and b1, b2 and b3 should be taken. It is expected that minima are lying somewhere distant from 0 and that therefore DCactual is higher and the state of the network is non-linear. However it will be hard to detect where DCactual changes because in a minimum it only can be detected that the DC has changed. This means that there is no direct relation between DCactual and @E , but it gives an indication whether during learning the correct path is followed, ie. the path leading to an optimal solution, and if it is possible to achieve the required DCactual. It is assumed that there is some point (maybe more) (w1,w2,w3 ,w4,b1 ,b2 ,b3) where E is minimal, so we have to solve:
@E @ (w1; w2; w3; w4; b1; b2; b3)
=
0:
(3)
This equation is very hard to solve analytically since we are looking for a point in a 7-dimensional space. Therefore it is necessary to use approximations for the sigmoidal functions. The Taylor expansion seems to be a good candidate to be used for an approximation of the output function. Note that an exact solution for equation 3 cannot be found easily but by using an Taylor expansion it is possible to find indications of where minima are. If such point is found then by examining the space around it, it is possible to check if it really is a minimum. Furthermore by expansion we get an indication whether there are dependencies among the weights. This means that some weights can be expressed as a function of the other weights. For instance if by instantiating (w1; w2; w3; w4; b1; b2), b3 is determined, b3 could be expressed as b3 = H (w1,w2 ,w3,w4 ,b1 ,b2) where H is some function yet unknown. Note that this should hold for any input x. Consequently the learning algorithm restricts the weightspace because not all parameters can be freely chosen and therefore also the capacity of the network remains restricted. Another approach is trying to visualize the network’s behaviour by making plots of DCactual versus 2 weights where the other weights remain constant. By using this subspace approach areas in which DCactual changes may be located. This can be done for all combinations of weight sets. The last approach is to use knowledge about the behaviour of f (x) during the changes of the weights. It is known that for larger weights f (x) gets steeper and with smaller weights more shallow. This can be used to find weight sets beforehand that have a certain DCactual . In the case of the twoclass problem a probable set of weights could have been (?10,10,1,1,?20,?15,0). The constructed weight set can be used as an initial setting for the network to explore the weight space around a minimum.
4 Results of experiments The two-class problem imposes a DCactual = 3 for an optimal discrimination. By performing experiments we tried to verify whether the network is able find a solution that gives DCactual = 3 and preferably with a minimal error. Although not all of the experiments are finished yet they gave satisfactory results. The experiments were conducted in ANN/SPRLIB ([KSH94]) and the results were analysed using Matlab and Mathematica. In order to eliminate the effect of poor initialization, we trained one hundred networks. From these trained networks the best ones were chosen to create a reference network. With this reference network the search in the solution space was performed. Two sets were generated according to the class density functions, a training set of 200 samples (one hundred samples per class) to train the 100 networks, and a test set of 400 samples (200 samples per class) to check the performance and to select the best networks. For tA and tB the values 0:8
and 0:2 were chosen. 1 0.9 0.8
Output o(x)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6
−4
−2
0 Input x
2
4
6
Figure 3: The outputfunction of the network with the samples A new network was constructed by simply averaging the weights of the best performing networks. This may cause problems if the best performing networks have complementary weight sets. Therefore it was necessary to investigate which weights the network found, this was done manually since there were only a few best performing networks. Figure 3 depicts the samples used for training and the output function of the network. This figure shows that at the points where the network switches classes are approximately those points in which the distributions cross. The values for the reference network were as follows:
w1 w2 w3 w4
3:9561 8:2066
?1:1699 1:7718
b1 b2 b3
?6:9256 10:3333 0:7630
It was also investigated whether the network had found a real minimum by expanding the weights around the minimum. This causes problems due to the dimensionality of the weightspace, so we looked at the subspaces spanned by (w1; w2) and (w3; w4). Figure 4 show the error plot plus contours for the expansion of the subspace (w3; w4). The little cross indicates where the minimum is located the network found. Furthermore the straight (fat) lines indicate how the DC looks in that subspace, it is divided in three portions (DC = 1; 2; 3). The part the ”+” is in, DC = 3, the part above constitutes DC = 1 and the part on the left is DC = 2. Looking in these subspaces revealed that the found minimum was not a global minimum but the path the network followed would certainly lead to a global minimum. However looking at the subspaces is not sufficient, therefore simulations of investigating the space at once are still running.
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 10
5 5
0 0
−5
−5 −10
−10
Figure 4: The errorplot for the subspace spanned by (w3 ,w4)
5 Discussion Although we do not have any final conclusions yet there are some things that can be discussed. By trying to analyse the behaviour of neural networks using a small and simple network some problems arise:
Even in the simple case we are not able to give a full analytical analysis. This is mainly due to the fact that the network contains seven parameters, which shows to be too large. Furthermore if the number of samples is limited like in the XOR-problem, analysis would have been more easier. The derivative of the network output function was calculated using Mathematica, but was about 10 pages long, which was not very useful for analysis. The network searches a 7-dimensional space which is hard to visualize. By fixing sets of weights or by “clever” initialization the exploration of the space is easier. It was noticed that the network had DC = 4 and that the problem requires DCactual = 3. However in quite a few experiments this value was not reached at all. This reveals that in even such a simple case the weight space is very complicated.
The current work now focusses on the further theoretical analysis using Taylor expansions and performing more experiments using the techniques described earlier.
References [BH89] Baum, E.B. and Haussler, D. What size net gives valid generalization ? Neural computation 1, pages 151–160, 1989. [KD94] Kraaijveld, M.A. and Duin, R.P.W. The effective capacity of multilayer feedforward network classifiers. In the 12th international conference on pattern recognition, pages B–99 – B–103, 1994.
[Kra93] Kraaijveld, M.A. Small sample behavior of multi-layer feedforward network classifiers: theoretical and practical aspects. PhD thesis, Delft University of Technology, 1993. [KSH94] Kraaijveld, M.A., Schmidt, W.F., and Hoekstra, A. Annlib/Sprlib: Introduction and reference manual, version 2.3. Technical report, Delft University of Technology, April 1994. [SB94] Sprinkhuizen-Kuyper, I.G. and Boers, E.J.W. The error surface of the simplest xor network has no local minima. Technical report, Department of computer science, leiden university, 1994. [Sch93] Schmidt, W.F. Neural pattern classifying systems. PhD thesis, Delft University of Technology, 1993. [Vap92] Vapnik, V. Principles of risk minimization for learning theory. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 831–838. Morgan Kaufmann Publishers, San Mateo: CA, 1992.