minimum description length criterion "MDL" to find optimal size adaptive probabilistic neural networks. "APNN" by adaptively pruning Gaussian windows from.
MINIMUM DESCRIFI'ION LENGTH PRUNING AND MAXIMUM MUTUAL INFORMATION TRAINING OF ADAPTIVE PROBABILISTIC NEURAL NETWORKS Waleed Fakhr and M.I.Elmasry VLSI Research Group, Elect. & C o w . Eng. Dept. University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1
Abstract-The major problem in implementing artificial neural networks is their large number of processing units and interconnections. This is partly because their architectures are not optimal in size, where too many parameters are usually used when only few would suffice. In this paper we apply an approximated version of the minimum description length criterion "MDL" to find optimal size adaptive probabilistic neural networks "APNN" by adaptively pruning Gaussian windows from the probabilistic neural network "PNN". We discuss and compare both stochastic maximum likelihood "ML" and stochastic maximum mutual information "MMI" training applied to the APNN, for probability density estimation "PDF" and pattern recognition applications. Results on four benchmark problems show that the APNN performed better than or similar to the PNN, and that its size is optimal, and much smaller than the PNN. 1. INTRODUCTION The baclrpropagation " B P trained multilayer perceptron "MLP estimates the Bayes decision boundaries between classes by estimating their posterior probabilities in the mean-square sense [1,2]. Although proved very successful in many classification applications, the BP trained MLP has several drawbacks. The Most serious drawback is that the MLP has the tendency to overfit training data when its architecture has more degrees of freedom, i.e., more free parameters, than necessary. Recently, many researchers have proposed various techniques to reduce the overfitting problem of the MLP [3-61. Most of these techniques are based on the Bayesian inference approach, where smoothing priors for the model parameters are used during training. Even though employing smoothing priors, produces reduced variance estimates of the weights, there is no guarantee that any of the weights will actually go to zero so that it can be removed. In other words, the actual size of the MLP may remain unaltered. On the other hand, the probabilistic neural network "PNN" [7],estimates the probability density function "PDF" for each class, then uses these estimates to implement the Bayes rule. The PNN, however, suffers from two major drawbacks. Firstly, all the training data must be stored, making the PNN 0-7803-0999-5/93/$03.00 01993 IEEE
unattractive for implementation, as well as making it an inefficient representation of the data [8,9]. Secondly, the PNN lacks any form of corrective training, which is very useful in many classification applications. In this paper we propose an adaptive neural network architecture which we call the adaptive probabilistic neural network "APNN" [ 10.1 11. The initial APNN architecture is the PNN, and we adaptively reduce its size by removing Gaussian windows according to an approximated version of the minimum description length criterion "MDL" [89,12]. Doing this size reduction, we seek to find a minimum complexity PDF estimate for each data set, in the stochastic complexity sense [12]. After the pruning process is completed, stochastic maximum likelihood "ML" training is used to estimate the location and width of each Gaussian window for each class PDF separately. After the ML training the APNN can be used for both density estimation and classification. In many situations, however, the sum-of-Gaussian PDF model is not accurate, e.g., when the data clusters are not Gaussian in shape. To overcome this drawback, we propose to use stochastic maximum mutual information "MMI" corrective training to enhance the classification performance of the
APNN. 2. MDL PRUNING OF THE APNN
Each class PDF in the APNN is modeled by: F(XICj)= l ; a i N,
i=l
a)i
C ai i=l
where N j is the total number of training patterns for class C j , p is the input pattern dimensionality, ai is a width parameter for the ith window and ai are integer parameters that take only (0,l) value, where 0 indicate that the corresponding window aiis removed and 1 indicates that it is included. In this paper we employ the Kullback-Leibler leave-one-out criterion [13] to find a suitable width for the windows of each class separately. For each class we assume equal a for all windows. The KL leave-one-out criterion is given by:
1338
(3) where F(Xi) is the probability density function when the window centered at the pattern X i is not included: 1
Ni
z onI-
F(Xi) = -[
Nj-1
n=l
@i
(4)
A plot is made between the KL and a, and a value corresponding to maximum KL is taken. The KL criterion is used since it is a simple cross-validation criterion, which enhances the generalization capability of the network by making each Gaussian window extends its influence to its neighbors. To prune the APNN, we apply an approximated version of the minimum description length criterion, as given by Rism e n [8,9]. The approximated MDL criterion for the APNN is given by:
- MDL = M = -[1 I:LogF(Xi IC,) - -kih g N j ] Nj i=l 2
3. STOCHASTIC MAXIMUM LIKELIHOOD TRAINING OF THE APNN In applying the MDL, we restricted the centers of the windows to be fixed, and their widths to be equal at each pruning step. We also restricted the coefficients ai to take binary value 0 or 1. These restrictions are tools to simplify the use of the MDL criterion, and to find the minimum number of windows needed in the model. The resultant pruned model's parameters however are not the ML estimates in the parameter space, and ML training can be used to obtain these estimates. Maximizing the Log-likelihood is equivalent to minimizing the Kullback-Leibler probabilistic distance between the modeled probability and the true one. The ML parameters are obtained by maximizing: L = I:Log F ( X I C j )
(5)
where the first term between brackets is the log-likelihood, the second is the complexity-penalizing term, and kj is the number of existing windows. In this paper we use the MDL criterion in (5) for pruning the APNN until a minimum of MDL (i.e. a maximum of M) is reached, and also as a performance criterion which indicates how good the PDF model is. The minimum description length criterion is employed since it can discover the closest probability distribution to the true one among many competing models [12]. During pruning the only adaptive variables are the ai integer variables, which take only { 0.1 ) values, and at each step one window is removed, with the corresponding ai is put to zero. The pruning steps are summarized in the following: (1) Start with the KL-trained PNN, with all ai equal one.
(2) Search for ai which when put to zero results in the highest increase in M. Put this ai to zero, i.e., remove the corresponding window.
(3) Re-adjust a value by using the KL criterion. (4) Repeat steps 2 and 3 as long as M keeps increasing (or constant) at each step. (5) After we reach the final architecture, we perform maximum likelihood "ML" training to enhance the PDF estimates. At this stage we allow all window parameters to adapt. In applying the MDL, the window centers and widths are considered as fixed parameters for each competing model, hence the complexity of each model is proportional to the number of windows in that model by employing the factor ki LogNj. The MDL is closely related in that sense to the 2
-
Akaike criterion which employs a term k j , however the MDL penalizes the number of parameters asymptotically much more severely [81.
CJ
with respect to the parameters of the model F(X ICj), where I: indicates the training data set of class C,. Stochastic CJ
gradient-ascent maximization of L version is proposed, where the adaptation equations are: K,
1 A a i = ~-[@>i--
,
.
..
- ._..
.
1
(7)
A w,. = A ~j aiz[x,,, -w,. ]
(9)
a
KJ
KJ
C an@n n=l
Zan
a,
n=l
n=l
anan
Where CL,, and uy are learning rates which were chosen experimentally such that the likelihood function increases monotonically. It is to be noted that similar stochastic (Robbins-Monro), ML training was used in [I41 for simple Gaussian mixtures, where convergence properties were discussed. Up to this point we have used the APNN to estimate the PDF of the classes, where the quality of these estimates depend on the quality and quantity of the given data and the accuracy of the sum-of-Gaussiansmodel. In many cases, the ML training is not the best training approach when the APNN is to be used as a classifier, for example, when the training data is poor both in quality and quantity, and/or when the PDF models do not closely approximate the true 1339
,
Canan
n=l
PDFs. To enhance the classification performance of the APNN,MMI corrective training is proposed. 4. STOCHASTIC MAXIMUM MUTUAL
INFORMATION TRAINING OF THE APNN A tight upper bound on the Bayes error probability P(e) is given in [15]: P,(e)=1/2, where "I" is the Equivocation. The mutual infomation "MI" is equal to "-I" plus a constant, hence maximizing MI directly minimizes the P(e) upper bound [10,11]. The MI involves integration over the data space, however to be able to use the criterion for training we rely on the large-sample approximation. The large-sample approximation of the MI for a 2-class case is: 1
P l F(XIC1)
MI=-[I; N c1 p l F(XIC&tp2F(XICz)
Where I; indicates the training data of class C j , N is the total cj
number of training data, and p l and p2 represent the class prior probabilities, which can also be adaptive, subject to pl+p2=1, and we assume here that they are equal for simplicity. Similar to ML, we employ a stochastic gradient-ascent approximation to maximize MI with respect to the model parameters. The jth class model parameters are updated after each pattern according to: Ki
A ai =Zj pa
1
I;anQn
n=l
[ Qi - 1 Ki
I;a n n=l
(11)
an n=l
Where for a 2-class case, Zj is equal to: zj=
F(XICk) + F(X IC j ) (F(X ICj)+F(X IC,))
5. BENCHMARK PROBLEMS RESULTS We have applied our proposed framework on four benchmark problems: 1.1 and 1.2 are 1-dimensional, and 2.1 and 2.2 are 2-dimensional. The training data used for the 1-dimensional case is 40 pattem/class, and for the 2-dimensional is 50 pattedclass. For testing, 10,OOO pattedclass is used so that the verification results are statistically valid. For each problem we show the PDF estimation results by the PNN and the ML-trained APNN,compared to the true PDFs used. We then show the classification results of the PNN, the ML-trained APNN and the MMI-trained APNN compared to the theoretical optimal Bayes classifier performance. We also show the size of the PNN and the APNN compared to the optimal size in terms of the number of Gaussian windows required for the optimal Bayes classifier. We compared the PDF estimates by using the 10.000/class data points for testing, and calculating the average Euclidean distance between the estimated and the true PDFs, which we denote here by DPDFl and DPDF2 for classes 1 and 2 respectively. In each table, "#error" denotes the number of misclassified patterns out of 20,000 test patterns, "%recog." denotes the % recognition rate, "%opt." denotes the 9% recognition relative to the optimal (included only if the optimal is not loo%),size1 and size2 are the number of windows in class-1 and class-2 networks respectively, and "NOT" denotes that this result is not included. 5.1. BENCHMARK PROBLEM 1.1 In this problem, each class PDF is composed of 2 uniform clusters of widthd, and centers at 2 & -6 for C1 and -2 & 6 for C2, i.e., the two classes are completely separable. The optimal size is 2 Gaussian windows per class, and a 100% recognition.
PNN
ML-APNN
h4MI-APNN
Optimal
Size1
40
2
2
2
Size2
40
2
2
2
DPDFl
0.0026
I
I DPDFZ I ( 14)
1 F(X ICj)+F(X I C,)
if the data pattern is from the opposite class.
NOT
0.003
I
0.0031
I
NOT
0.0
1
I
0.0
#error
565
61 1
347
0
%=cog.
97.175
96.95
98.27
100
if the data pattern is from class Cj and the parameter is from the model F(XICj), i.e., the same class, where F(XICj) is that class PDF, and F(X IC,) is the other class PDF, and:
zj=-
0.0028
(15) 1340
I
5.2. BENCHMARK PROBLEM 1.2 In this problem, each class PDF is composed of 2 Gaussian clusters, with centers at 2 & -6 for C1 and -2 & 6 for C2, and with different widths, where the two classes are highly overlapped with optimal recognition of only 86.8%, and optimal size of 2 Gaussian windows per class.
I
PNN
ML-APNN
MMI-APNN
Optimal
Sizel
40
2
2
2
Size2
40
2
2
2
DPDFl
O.OOO2441
0.0000674
NOT
0.0
DPDF2
0.000521
0.000387
NOT
0.0
I
krror
I
2723
I
2698
I
2700
86.39
86.51
86.5
86.8
%OF
99.53
99.67
99.65
100
I
53. BENCHMARK PROBLEM 2.1 This problem is a 2-dimensional generalized XOR,where the data values range between ( - 1,1}, class-1 is composed of 2 uniformly distributed clusters centered at (0.5,-0.5) and (-0.5.0.5) and class-2 is also composed of 2 uniform clusters at (0.5,0.5) and (-0.5,-0.5). This problem is completely separable, with an optimal recognition of 100% and optimal size of 2 Gaussian windows per class. TABLE 111: RESULTS OF PROBLEM 2.1
I
I Sizel Size2
I DPDFl I
PNN
I
50
I
2
50 0.0063
ML-APNN
0.0059
I
2
I
NOT
Optimal1 2
2
2
I
MMI-APNN
2
I
0.0
DPDF2
0.0088
0.0089
NOT
0.0
#error
1331
1113
926
0
I %recog. I
93.35
I
94.44
I
95.37
I
100
I I
ML-APNN
MMI-APNN
Optimal
50
2
2
2
Size2
50
2
2
2
74.65
74.64
%recog.
I
2640
%recog.
PNN Sizel
Yoopt.
I
98.28
I
98.87
I
98.86
I
100
I
6. CONCLUSIONS (1) In all problems considered the optimal size is reached with the MDL pruning, starting from the PNN. The resulted APNN are much smaller in size than the PNN, which means a tremendous saving in implementation costs, and computational time. (2) The APNN also has a much smaller stochastic complexity than the PNN, in other words, it is a more efficient representation of the data, or a more compact form of coding the data. (3) Since the MDL criterion is based on asymptotic approximation of the stochastic complexity 181, it performs better for large data sets. We expect that it might fail to find the optimal number of clusters for more complex problems, with small data sets. In such cases, other approximations for the stochastic complexity should be used [161. (4) On the other hand, if the data set is large, the MDL works well, and the optimal model for the given data is most likely found with the minimum number of windows, unlike the PNN, which increases in size with larger data sets. (5) In classification, ML training was superior to MMI when the sum-of-Gaussians model matched the true class distributions, i.e., in problems 1.2 and 2.2, while the MMI training was superior in the other two cases when the true distributions were uniform not Gaussian. (6) In PDF estimation, the ML-APNN is much better than the PNN when the true distributions are Gaussian sums, and they are almost equivalent in the other cases.
5.4. BENCHMARK PROBLEM 2.2 This problem is a 2-dimensional generalized XOR,where class-1 is composed of 2 Gaussian distributed clusters centered at (0.5.-0.5) and (-0.5.0.5) and class-2 is also composed of 2 Gaussian clusters at (0.5,0.5) and (-0.5,-0.5). In this problem the classes are highly overlapped with an optimal recognition of only 75.5%, and the optimal size is also 2 Gaussian windows per class. 1341
REFERENCES [I] Robert Schakoff; "Pattem Recognition, Statistical, Structural, and Neural Approaches", John Wiley & Sons, Inc., 1992. [2] John Makhoul; "Pattem Recognition Properties of Neural Networks", IEEE-SP Workshop on Neural Networks for Signal Processing, Princeton, NJ, 1991, pp.173-187. [3] Steven J. Nowlan, Geoffky E. Hinton; "Adaptive Soft Weight Tying using Gaussian Mixtures", In J.E. Moody, SJ. Hanson and R.P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4, Morgan Kaufmann, San h4ateo CA 1992. [4] Wray L. Buntine, Andreas S. Weigend; "Bayesian Back-Propagation", Complex Systems 5 , 1991, pp.603643. [5] Mackay, D.J.C.; "A Practical Bayesian Framework for Backprop Networks", submitted to Neural Computation, 1991. [6] John E. Moody; "Note on Generalization, Regularization, and Architecture Selection in Nonlinear Learning Systems", IEEE-SP Workshop on Neural Networks for Signal Processing, pp.1-IO, 1991 [7] Donald F. Specht;. "Probabilistic Neural Networks", Neural Networks, Vo1.3, pp.109-118,1990. [SI Jorma Rissanen; "Stochastic Complexity in Statistical Inquiry", Series in Computer Science-Vol.15, World Scientific. [9] Jorma Rissanen; "Density Estimation by Stochastic Complexity", IEEE Trans. on Information Theory, Vo1.38, N0.2, ~p.315-323,March 1992.
[lo] Waleed Fakhr and M.I.Elmasry; "Mutual Information Training and Size Minimization of Adaptive Probabilistic Neural Networks", ISCAS'92, San Diego CA, May 1992. [Ill Waleed Fakhr, M. Kamel and M.I. Elmasry; "Probability of Error, Maximum Mutual Information, and Size Minimization of Neural Networks", UCNN, Baltimore MD,June 1992. [I21 Andrew R. Barron and Thomas M. Cover: "Minimum Complexity Density Estimation", IEEE Trans. on Information Theory, Vo1.37, No.4, pp.1034-1054, July 1991. [ 131 Keinosuke Fukunaga: "Introduction to Statistical Pattern Recognition", 2nd Ed., Academic Press, Inc. 1990. 1141 Tzay Y. Young and T. Calvert; "Classification, Estimation and Pattem Recognition". American Elsevier, 1974. [15] Martin E. Hellman and Josef Raviv; "Probability of Error, Equivocation, and the Chemoff Bound", IEEE Trans. on Information Theory, Vol. 16, No.4, pp.368-372, July 1970. [I61 Waleed Fakhr, M. Kamel and M.I. Elmasry; "Unsupervised Learning by Stochastic Complexity", Unpublished.
1342