PROBABILITY OF ERROR, MAXIMUM MUTUAL INFORMATION, AND SIZE MINIMIZATION OF NEURAL NETWORKS Waleed Fakhr' ,M.Kamel" , and M.I.Elmsry' 'VLSI Research Group, Electrical and Comp.Eng. Dept. "PAMI Research Group, System Design Eng. Dept. University of Waterloo Waterloo. Ontario, Canada, N2L 3Gl ABSTRACT
In this paper we employ an upper bound of the Bayes error probability as a generalization performance criterion for supervised neural network classifiers. We show that the maximization of the mutual information is equivalent to the minimization of this bound, and leads to a direct implementation of the Bayes framework for classification. We use this criterion both in training neural networks and in minimizing its size by adaptive pruning. We propose and apply a top-down heuristic for adaptively pruning nodes(weights). We apply our approach both on probabilistic neural networks and multilayer perceptron. Two benchmark problems results are given, verifying the validity of our approach. 1. INTRODUCTION
In statiStical pattem recognition minimum risk of misclassification,and equivalently minimum probability of error P(e) is achieved by implementing the Bayes framework, which assigns a pattem to the class with the highest a posteriori probability [l]. To do so, the classifier has to estimate the a posteriori probability of each class using a suitable model with a set of parameters, and a proper method for estimating these parameters from the given training data. On the one hand, model simplicity is highly desirable for two reasons. Firstly, it allows for efficient implementations of the classifier in either software or VLSI chips. Secondly, the less complex the model, the less its tendency to overfit the training data, and the better its generalization capability, specially when training data are poor both in quality and quantity. This has been known as the parsimony principle [2], or the Occam razor rule [3]. On the other hand, optimal learning of classifier parameters would estimate those parameters such that the Bayes probability of error P(e) is minimum, or equivalently the generalization probability is maximum. Unfortunately, P(e) is usually impossible to express in a functional form, hence we should rely on a tight upper bound P,(e), which when minimized P(e) is also minimized. In this paper we use a variation of the Chemoff bound as P,(e) [41, and we show that when it is minimized with respect to the model parameters the Bayes framework is directly implemented. We should expect that since the minimization of P(e) maximizes the genedization probability, the complexity of the classifier will be reduced, by pushing down those parameters which cause overfitting. This is verified in our results, which shows that P(e) decreases consistently with complexity reduction, and that training with P(e) minimization results in less complex networks. In that sense, we propose and apply a topdown heuristic which starts with an oversized classifier, and gradually reduces its size, using P J e ) as a performance criterion. We have experimented our approach on both the adaptive probabilistic neural networks AF"N [5] and the multilayer perceptron MLP. We also compared training the MLP with both ordinary minimization of squared error and the proposed minimization of the probability of error. The latter offers faster convergence, less weight complexity (smaller description lengths of weights at each layer), and better generaization.
0-7803-05594 192 $3.00 0 1992 IEEE
1-901
-22. AN UPPER BOUND FOR BAYES ERROR PROBABILITY A tight upper bound on P(e) is given in [41: P,,(e)=1/2, where I is the equivocation given by (for a two class case):
Where P(X,Cj) is the joint probability between class Cj and input pattem (X}, and P(X,Cl)+P(X,C2) is the unconditional probability P(X). M=-I, where M is the mutual information (minus a constant dependent on the class priori probabilities), thus maximizing M directly minimizes the probability of error P(e). For a classifier which models the joint probability of each class by f(X,Cj,ej), where 9, is the set of parameters for class Cj, the mutual information is given by:
Where 9 is the set of parameters in the classifier. Naturally M ( 0 ) differs than the true M. It can be shown that U@)-U e 0 , with equality if f(Cj/X) = P(Cj/X), where f(Cj/X) = f (X.Cj) i.e., the modeled f(X.Cd + fW.Cd * probabilities approximate the true a posteriori probabilities, hence implementing the Bayesian framework. This is valid as long as the model satisfies the Bayesian probabilistic constraints [1,31. Thus, maximizing M ( 0 ) with respect to the model parameters e pushes the inequality to zero, and leads to the Bayesian approximation. The mutual information M can be expressed in terms of the likelihood L, since: L=j[P (X.C,)lo$ (X,C, )eP(XFdlogP(X,C2)ldX
(3)
Hence, U 4 + E ,where E is the classifier system entropy given by:
E=- IP(X)logP(X)dr
(4)
and P(X) is the unconditional probability. Maximizing M is obviously different than maximizing L, the former corresponds directly to minimizing P(e), while the latter does not. It is well known that entropy maximization leads to the least biased estimate of probability [3], in other words, it leads to the least committed estimate to the particular training set given. In that sense, E maximization term included in M works as a regulator to the biased likelihood maximization estimate, it helps penalizing the overfitting of the training data, and as a penalty term for adding complexity to the model as will be shown in the results. Training with P,,(e) minimization (i.e., mutual information maximization) is thus equivalent to training with maximum likelihood plus having a complexity regularization term. Therefore, mutual information function has great similarity to other information-theoreticapproaches used for complexity regularization such as the Akaiki’s criteria and its variations [6].Training a neural network with mutual information maximization thus eliminates the need to include an extemal complexity-penalizingterm in the cost function.
3. DISCRETE APPROXIMATION AND STOCHASTIC LEARNING We approximate the mutual information, hence the likelihood and the entropy by summations over available data instead of integrations, where N is the number of training pattems, thus (2) becomes: 1
1
M=-[ N CIlogf(x,c,)+~logf(x,cz)l- ~ ~ l o g v ( x ,Nf(XC2)I cl
(5)
Where the first term is the likelihood, and the second is the entropy. These approximations are valid if the data is sufficiently large, and is observed according to the underlying probabilities. We use a stochastic steepest-ascentmaximization of M in the learning, where for example if we consider a parameter in the model of f(X,C,), and the M criterion, the updating rule for 0 is given by:
for training data (X} from class C1, and
1-902
-3-
for training data (X)from class C2, where cle is the leaming rate. 4. ADAPTIVE PROBABILISTIC NEURAL NETWORKS
In this class of neural networks the probability of each class is modeled by a sum of Gawians [5,71: 1' f.=- Z@i(X,6j) K i=t
Where K is the number of Gaussian windows, each with its own set of parameters,namely the center vector (mli) and the variance (oi3and P is the input dimensionality. This model is equivalent to the Panen technique 111, and the probabilistic neural network PNN [7]when K is equal to the number of training pattems, and each window is centered at a pattern. In this paper we start with the PNN,and with a variance small enough to reduce the training error to zero. We then train the network with a variation of the leaveae-out method [l] and a stochastic steepest-ascent maximization of the mutual information to adjust the centers and the widths of the windows [5]. 5. TOP-DOWN HEURISTIC FOR SIZE MINIMIZATION
To minimize the size of the probabilistic neural network we apply a top-down heuristic, which starts with the leave-one-out trained PNN. At each step only one window is removed or added "if" the mutual information M would increase. This process is continued until M stops increasing. Afterwards, at each step, pruning is followed by stochastic steepest-ascent iraining to restore M value. The heuristic stops if the leaming fails to restore the last M value. The use of stochastic learning to adjust the parameters eliminates the need to use an exhaustive search method, and is capable of producing suboptimal results, both in size and in performance. We compared this heuristic with a bottom-up one, and with the reduced Panen classifier of Fukunaga [l], details of this comparison are given in [5]. Tables(l-4) show the results of this heuristic for two benchmark problems with different sizes, and figure(1) shows the mutual information, entropy, and likelihood behavior with pruning. Results are in agreement with what we expected, as E is increasing and L is decreasing with complexity reduction, while the resultant M=L+E is increasing, i.e., the probabil-' ity of error is decreasing, and the generalization is increasing.
6. MULTILAYER PERCEPTRON AS A PROBABILISTIC CLASSIFIER Let the unbounded output of the MLP (single output) be y(x,e), where 0 is the set of weights. We can view the MLP as being a nonlinear transformation which maps the multi-dimensional input {X}to a one dimensional output y (i.e., a dimensionality-reduction transform). We further assume that this transformation succeeds in making y with unimcdal probability, i.e., its probability can be represented by a single, one dimensional Gaussian function. Let us assume that we can use the Same MLP to model both
[email protected]) or simply f l and
[email protected],) or f2,which are the joint probability of y with class C1 and C2 respectively. These probabilities are modeled as:
with the assumption of unity variance. Let d j be +1 for C1 and -1 for C2, then the likelihood function of the classifieris given by (neglecting the constant): L=- --[1 ~ ( y - l ) 2 + 3 y + 1 ) 2 ]
N
CI
i.e., maximizing the likelihood function is equivalent to minimizing the squared error criterion used in the ordinary training of the MLP. The minimum probability of error training, i.e. maximum mutual information training would be to maximize M, given by:
1-903
-4-
So that maximizing M is equivalent to minimizing the squared e m plus maximizing the entropy of the model. We have applied both criteria on the MLP for the benchmark problems, and results show that mutual information training of the MLP converges faster to good generalization (smaller probability of error), generalizes better than ordinary backpropagation training, and leads to a less complex network (smaller description lengths of weights at each layer). On the other hand, we applied the top-down heuristic to prune the MLP trained with both techniques. We started with an oversized MLP, with 536 weights in 3 layers. As in the adaptive PNN case, reducing the complexity resulted in a monotonic increase in entropy, and mutual infonnation, i.e., decrease in P(e). Tables(5-8) show the results of training and pruning the MLP, while figure(2) shows the effect of pruning on the mutual information, entropy and likelihood, which is minus the sum of squared errors in this case.
7. BENCHMARK PROBLEMS RESULTS Two benchmark problems were used, both are two-dimensional, each one is done twice, with 50 and 100 training jxittems for each class, and lo00 independent test patterns for each class (not used in training). The first problem is the generalized XOR problem, with inputs between (-1,l) drawn from uniform distributions. The second problem is similar, but with Gaussian distributions centered at each quarter of the twedimensional input space, [(0.5,0.5),(-0.5,-0.5)]and[(-0.5,0.5),(0.5,~.5)]eachwith a variance of 0.1. The first problem is separable, while the second is not, due to the overlapping of the Gaussians. 7.1. ADAPTIVE PROBABILISTIC NEURAL NETWORK RESULTS Tables(l-4) show the results of the PNN trained with the leave-one-out, MMI (max. mutual information) approach, versus the pruned version denoted by APNN. The tables show the ME,and L of each network, the size (the number of windows) and the test error (given as how many misclassifiedpattems out of 2000 test patterns). Table(1): fmt problem, 50/class. MI
L
E
error
Windows
PNN
-0.037
-0.685
0.648
123
100
APNN
-0.033
-7.838
7.805
98
23
PNN
-0.0035
-1.1387
1.135
87
200
APNN
-0.0003
-8.3
8.3003
42
20
I
I
PNN
II
APNN
-0.185
II
-0.00113
I
I I
-1.588 -9.27
1
1.403
I
I
9.269
I
277
I
I
100 4
266
Table(4): second problem, lOO/class MI PNN
-0.1
I APNN I
-0.O005
L
E 1.6
-1.7
I
-11.4
I
11.399
1-904
#windows
e m
257
I
203
200
1
4
-5-
Figure(1) shows the effect of pruning on the mutual infomation, entropy and likelihood. It is noted that M is multiplied by 10' to fit in the same scale.
7.2. MULTILAYER PERCEPTRON RESULTS Tables(5-8) show the results of the MLP trained with ML and MMI, and denoted by MLO and MMIO, as well as the results of the MLP trained by ML and MMI after pruning, denoted by MLP and MMIP, respectively. Fig(2) shows the effect of pruning on M,Eand L, where M is multiplied by 10 to fit the scale. The last column of each table shows the variance of fmt layer weights. The less the variance, the less complex the network. Table(5): first problem, 50/class l
M
I
l
L
I
I
E
eRlw
I heights I
vat.
I
Table(6): fmt problem, lOO/class
Table(7): second problem, 50/class MI
L
E
error
heights
Var.
MLD
-0.1411
-0.198
0.057
299
536
0.8
MMIO
-0.1166
-0.837
0.72
283
536
0.6
MLP
-0.13
-0.24
0.11
280
300
0.67
MMIP
-0.1
-0.74
0.64
277
300
0.55
Table(8): second problem, lOO/class
I
MI
I
L
I
E
I
ermr
I
heights
I
Var.
I
~~
MLO
-0.111
-0.122
MMIO
-0.08
-0.75
MLP
-0.07
-0.33
MMIP
I
-0.04
I
-0.98
2%
536
0.78
0.67
224
536
0.61
0.26
203
300
0.65
0.011
1
-0.94
I
200
I
300
I
0.55
I
8. SUMMARY We have used an upper bound of the Bayes probability of error as a criterion for neural network classifier training and design. We have also showed that the maximization of mutual information is equivalent to the minimization of this error. Using this criterion for training we obtain less complex networks, and by employing a topdown heuristic for pruning trained oversized networks, effective size minimization, and better generalization are reached. REFERENCES Keienosuke Fukunaga; Introduction to Statistical Pattern Recognition, Academic Press, Second Edi[11 tion, 1990.
1-905
-6-
[2]
R.L. Kashyap; A Bayesian Comparison of Different Classes of Dynamic Models Using Empirical Data, IEEE Transactions on Automatic Control, Vol. AC.22, pp(715-727), Oct. 1977.
131
Maximum Entropy and Bayesian Methods, Laramie, Wyoming, 1990. Edited by, W.T.Grandy, Jr. and L.H.Schick. (Complex Systems, and Robabiiity and Math. Sections). Martin E. Hellman and Josef Raviv; Robability of Error, Equivocation. and the Chemoff Bound, IEEE Transactionson Information Theory,Vol. IT-16, no.4, July 1970, pp(368-372). Waleed Fakhr and MI. Elmasry; Mutual Information Training and Size Minimization of Adaptive Probabilistic Neural Networks. Accepted for Publication in the Intemational Symposium of Circuits and Systems ISCAS'92. David B. Fogel; An Information Criterion for Optimal Neural Network Selection, IEEE Transactions on Neural Networks, V01.2, September 1991, pp(490-497) D.F.Specht; Robabilistic Neural Networks and the Polynomial Adaline as Complementary Techniques for Classification, IEEFi Transactions on Neural Networks, Vol.1, no.1, March 1990, pp(ll1121).
[4] [5]
161 [7]
1.5 Mutual Information Likelihood
U
0
1 -
9
01 .x 4
.......
- ;-.-
0.5
-
.............. ..&crow.",.'. .......... __.-. ...................
.....................
..-......
E
g
0 -
w E
-2
'
0
I 10
20
30 40 50 60 IO Flg(1l:Conplexity Reduction of PNN
0.3
0.2
SU
90
100
1
-
0.1 -
.........
........
nutudl Information Li.kclrho@..::.?' Entropy . . .
........................
. . . .. . . . . .... . .
__..
0 -
-0. 1
- ----. -...-.._. ...-.._ ..-..-..-.._.
-0.2 -
-0.3
............. ........
?.
'.................... ".l.............
-
0 5' 0
..,.
I 20
40
60 F19i2)
80 100 120 140 Complexity R e d u c t i o n of MLP
1-906
160
180
LllU