very high number of neurons in those cases in which large data sets are available. This letter proposes a modified version of the PNN learn- ing phase which ...
Learning Vector Quantization for the Probabilistic Neural Network Pietro Burrascano Abstract-The probabilistic neural network (PNN) represents an interesting parallel implementation of a Bayes strategy for pattern classification. Its training phase consists in generating a new neuron for each training pattern, whose weights equal the pattern components. This noniterative training procedure is extremely fast, but leads to a very high number of neurons in those cases in which large data sets are available. This letter proposes a modified version of the PNN learning phase which allows a considerable simplification of network structure by including a vector quantization of learning data.
I. INTRODUCTION A N D BACKGROUND Bayes strategies for pattern classification rely on procedures that minimize the “expected risk” of misclassification. In the case of a two-category classification, for example, the problem is to decide whether a pattern X, = [X,, , X,2, . . ., XrP]belongs to class OA or to class OB. In this case, following the notation used in [ l ] and [ 2 J , the Bayes decision rule can be written as
4:) = BA
ifhAIAfA(X)> ~ B ~ B ~ B ( X )
d(X) =
ifhALfA(X)< ~ B ~ B ~ B ( X )
(1)
wheref,(X) andfB(X) are the probability density functions for categories A and B, respectively; lA is the loss function associated with the decision d ( X ) = 0, when the truth is e A ; lB is the loss associated with the decision d ( X ) = when the truth is B E ; hA is the a priori probability of occurrence of patterns from category A; and h, = 1 - hA is the a priori probability of occurrence of patterns from category B. The Bayes decision surface relative to the twocategory problem we are considering is the boundary between the region in which d ( X ) = 0, and the region in which d ( X ) = BB. This surface can be arbitrarily complex and consist of disconnected parts. The main problem in using Bayes strategies for classification is the estimation of the probability density functions relative to each class. This task is usually accomplished by using a set of training patterns with known classification. In [3] it is shown that a consistent estimate of a multivariate probability density function can be obtained by using the product of univariate kernels. In the particular case of the Gaussian kernel, the multivariate estimate of the probability density function of category A can be expressed as .
m
where i is the current pattern index; rn is the total number of training patterns; XA, is the ith training pattern from category e A ; U is a smoothing parameter; andp is the dimensionality of the input space. In using expression (2) it is implicitly assumed that X,,,,XA,z, . . . , XA,p are independent identically distributed random variables [ 11, [2]. It is worth noting that expression (2) is the sum of multivariate Gaussian distributions centered at each training sample; moreover, although Gaussian kernels have been used in (2), f A ( X ) is not restricted to be Gaussian. The smoothing parameter U defines the amount of interpolation between the different modes of fA (X)corresponding to the location of the training patterns. Varying U also affects classification results: the classifier can be shown to be similar in effect to the nearest neighbor classifier as U -+ 0, and to the k-nearest neighbor classifier for increasing values of U . The PNN, as proposed in [ l ] , [2] is a direct implementation of the Bayes classifier described above. Its structure, in the case of a two-category problem, is shown in Fig. 1: the input layer consists of simple fan-out units; in the second layer there is one pattern unit per training pattern. Each pattern unit performs a dot product of Z, = X W, the input pattern vector X with a weight vector W,, and then performs the nonlinear operation
which, assuming that both W,and X are normalized to unit length, is equivalent to each exponentiation of formula (2). The summation units (one each category) simply sum the outputs from all pattern units of the relative class. In the two-category case there is one output unit, which implements the decision of equations (I). It is a two-input unit which produces a binary output with only a single variable weight C, defined as
where nA and nB are the numbers of training patterns from categories /IAand OB, respectively. If the numbers of training samples from the different categories are in proportion to their a priori probabilities and class losses l; do not reflect any bias in the decision, C may simplify to - 1. The network is trained by setting the W,weight vector in one of the pattern units equal to each X pattern in the training set and then connecting the pattern unit output to the appropriate summation unit.
r
11. THE PROPOSED MODIFICATION
Manuscript received November 20, 1990. This work was supported by the Minister0 dell‘Universith e della Ricerca Scientifica e Tecnologica (MURST) of Italy and by the Consiglio Nazionale delle Ricerche (CNR) of Italy. The author is with the INFO-COM Department, Universiti di Roma “La Sapienza,” via Eudossiana, 18-00184 Rome, Italy. IEEE Log Number 9143343.
The brief description of the P N N clarifies why its noniterative learning phese requires an extremely light computational cost and justifies its very good, Bayes-like classification performance. Unfortunately, if large training sets are available, the computational cost associated with the testing phase of the P N N is much higher than that of the training phase and can become incompatible with real time classification tasks. Moreover, if a hardware implementation of PNN is envisaged, the problem of storing all training patterns and using all them to obtain each classification result turns in
1045-9227/91/0700-0458$01 .OO
0 1991 IEEE
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2. NO. 4. JULY 1991 XI
xt
......
xp
FIRST LAYER:
mmuNm
I
459
learning procedure can be described as follows: 1) Define the total number of reference patterns (i.e., the number of units of layer two). 2) Assign to each class a number of nodes of layer 2 in proportion to the a priori probability of occurrence of that class (known or estimated). 3) Run the LVQ training procedure by using all the available training patterns; the result of this step defines the weight vectors W, relative to the connections from the input layer to the second layer. 4) Connect the output of each unit of the second layer to the appropriate summation unit of the third layer.
+ -> Category one
- -> category two Fig. 1. The probabilistic neural network.
to the problem of building up an extremely large network. In order to overcome this inconvenience, [2] proposes the PADALINE neural network, which is substantially a different implementation of the same Bayes classifier based, in this case, on a truncated Taylor series expansion. The present letter proposes a different solution to the problem of PNN testing time computational cost in the presence of large data sets. The proposed modification is based on a simple remark; in [I]and [2] it is shown that the best classification results are obtained by using values of U chosen in an appropriate interval, which depends on the particular application. In this way a certain amount of interpolation is obtained among a number of samples; the higher the probability value, the larger the interpolation effect. In other words, the interpolation effect is higher in the zones of the pattern space where samples are denser. Consequently, in such zones, which are the most important for classification tasks, the probability estimate is performed by evaluating a mean of all patterns in some neighborhood of the considered point, averaged according to their distances: no single sample makes a particularly significant contribution to the estimate. In such a situation it is intuitive that a number of “reference vectors” lower than the number of training patterns, if appropriately located in the pattern space, can give an accurate representation of the underlying probability density function. As a consequence of the above remark, it can be assumed that a network with the same functional structure of the PNN but with a number of nodes per class in the second layer much lower than the number of training patterns can give classification results approximating those of the PNN. The problem becomes one of defining the number of “reference vectors” and their location in the pattern space. This problem is widely treated in the technical literature, and there are several vector quantization techniques which can accomplish the desired task. To perform the reported experiments the learning vector quantization (LVQ) technique has been used, because it has been shown in a number of examples that the references vectors it defines in a way approximate the probability density functions of the pattern classes. Moreover, the algorithm is computationally extremely light and the convergence is reasonably fast [4].The LVQ algorithm has been implemented as proposed in [4], and its description is not reported here for the sake of brevity. In conclusion, the modified version of the PNN proposed here maintains the basic structure of Fig. 1, but the number of nodes in the second layer assigned to each class is no longer equal to the number of training patterns of that class; instead, it equals the number of processing elements per class of the LVQ. The proposed
After the training phase has ended, the proposed network performs the same operations of the PNN [I], [2], but with a reduced number of elements in the second layer. Consequently the computational cost associated with each classification operation is reduced proportionally. 111. EXPERIMENTAL RESULTS
Two synthetic data experiments have been performed in order to ascertain the effectiveness of the proposed modification. Both are two-category experiments similar to those described in [ 5 ] . The feature space is two-dimensional and samples are drawn from known probability distributions; a third input component is added in order to normalize input vectors to unit length. For each experiment the same 1000 token training set and a 3000 token testing set are drawn independently from the specified distribution. Data sets consist of equal numbers of vectors from the two classes, giving the classes equal a priori probabilities. The same 1000 token data set is used to train a PNN, which consequently consists of 1000 nodes in the second layer, and two modified probabilistic neural networks consisting, in the second layer, of 100 and ten nodes respectively. Learning vector quantization of training data consists, in both cases, in 10 000 presentations of single patterns randomly selected from the 1000 token training set and in the consequent weight corrections. The parameter CY, defining the amount of correction at each step of the LVQ, is chosen to be linearly decreasing from 0.5 to zero. In the first experiment, referred to as a uniform distribution classification task, feature vectors from the triangular regions shown in Fig. 2 are classified. Within each class the probability distribution is uniform across the respective region; the distributions for the two classes are adjacent but nonoverlapping. In the second experiment, the Gaussian mixture distribution classification task, category 1 is a single Gaussian distribution, while category 2 is an equal likelihood mixture of two Gaussians. The probability densities for categories 1 and 2 , shown in Fig. 3, are pi@,Y ) = Nx,0 , s2)N(y7 /+ 4s’)
and
respectively, where
N(r, p , s 2 ) = ( 2 7 r ~ ~ ) -exp ” ~ { - ( t - ~)~/(2s*)I is a Gaussian with mean p and variance s 2 . The values s = 0.1, p.r = 1.188 s, and pL?. = 2.325s have been used. The symbol s is used to denote standard deviation in order to avoid any confusion with U in formula ( 2 ) .
460
--
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2.
100
ee
I
B w
/
NO.
4. JULY 1991
3
H
a
t
Fig. 2. The uniform distributions. Feature vecton were uniformly distributed within each class region.
9!
. . . . ...., . . . .
.01
.1
70
X
-p" ....I
1
. . . . ..AI Q
10
Fig. 4. Results of the uniform distribution experiment.
100
ee
.s2
90
8
33
H
i8 70 1
Fig. 5. Results of the Gaussian mixture experiment.
-Px
c(x
Fig. 3. The Gaussian mixture distribution.
To give wider validity to the comparison between the PNN and its modified version, classification tests have been repeated for U values ranging from 0.01 to 10. The classification results obtained in the two experiments are reported in Figs. 4 and 5, which show that: in the case of the uniform distribution classification task, passing from the lo00 nodes PNN to the 100 nodes LVQa worsening of the best c~assificationresult of p" about 1.6% (from 99.800% to 98.166%) and passing to the causes a further 1.16% loss (96.999% ); ten nodes LVQ-P" in the ofthe ~~~~~i~~ mixture classification task, the best classification results remain unchanged even if network plexity is dramatically reduced; in both experimentsa reduction of network comp~exity an increased dependence on the U value. The CPU time required to perform a Single classification with the is 1.4% and 10.2 % of the ten nodes and the 100 nodes LVQ-P" PNN testing time, respectively. AS far as the LVQ training time for the ten and the 100 nodes LVQ-PNN is concerned, it is of 1 min and 5 min 15 S respectively on a DEC micro VAX, including pattern generation time. It is apparent from Figs. 4 and 5 that, while in the Gaussian mixture case the results obtained can be considered excellent, in the uniform distribution case they can be considered satisfactory only if the corresponding simplification of network complexity is taken into account. These different behaviors suggest the hypothesis that the approach of modeling the underlying probability den-
TABLE I COMPARISON BETWEEN THE PNN-LVQ AND THE LVQ UNIFORM DISTRIBUTIONS CLASSIFICATION TASK RESLLTS (%)
10 nodes 100 nodes
LVQ-PNN (Best Result)
LVQ
96.999 98.166
96.133 97.766
sity function, followed by both the PNN and its modified version proposed here, is not an efficient one in the case of the uniform distribution task (and in general when the boundary between the different classes is known to be linear). In fact, excellent classification results can be obtained in this case, but only if a very complex network topology is considered. In contrast, the uniform distribution classification task is known to be particularly well suited for a simple, ope-layer perceptron. It is interesting to compare the classification results obtained with the proposed LVQ-pNN network with those obtained by only using the LVQ technique. Tables I and I1 report the best classification results of the LVQ-P" in comparison with those of the LVQ trained with the same 1000 token set. The reported results confirm the good classification performance obtainable by using the LVQ and show that a noticeable improvement can be obtained by using the L V Q ~ P N N . IV. CONCLUSIONS A modified version of the PNN training procedure has been proposed, which includes a vector quantization of training data: it can
46 I
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 2, NO. 4. JULY 1991
I. INTRODUCTION: THECIRCUIT ARCHITECTURE
TABLE I1 COMPARISON BETWEEN THE PNN-LVQ A N D THE LVQ GAUSSIAN MIXTURE DISTRIBUTION CLASSIFICATION (96 ) TASKRESULTS
10 nodes 100 nodes
LVQ-PNN (Best Result)
LVQ
94.833 95.266
93.766 94.533
In [ l ] and [2], a new model, or architecture, for artifical neural networks (ANN’S) was introduced which has a direct MOS realization. The architecture has qualitatively the same dynamic properties as gradient continuous-time feedback ANN’S. The introduced model (circuit) also has the following features:
i) It reduces the maximum number of connections to n (n + 1)/2, where n is the number of neuron processors in the network. ii) It does not require the symmetry of interconnections (in order to ensure the convergence of all solutions to equilibria only). iii) More importantly, it does not require the realization of linear resistive elements for the synaptic weights. Instead, the connections are realized via MOSFET (nonlinear) conductance elements. iv) The introduced circuit lends itself naturally to direct analog MOS VLSIlLSI silicon implementation.
be useful if large training sets are available. The procedure has been successfully tested in two synthetic data experiments; furthermore, the proposed network has been shown to improve the classification performance of the LVQ procedure. Futher experiments are necessary in order to test its validity in classification problems in the real world. No general rule can be given for the definition of the number of nodes to be used in the second layer of the network: in fact this number is strictly related to the actual shape of each probability density function to be approximated, as evidenced by the results of the two experiments. However these results show that a gradual degradation of classification performance can be expected as the number of reference vectors in the pattern space is reduced: as a consequence, the procedure proposed in the present letter allows the user to obtain a parsimonious representation of the probability density functions while maintaining the classification performance allowed by the available data set. If further reductions of network complexity are required, the procedure makes it possible to choose the number of nodes in the second layer as a trade-off between acceptable network complexity and the desired classification accuracy.
The ANN architecture is described in the following [l], [2]. A . One Neuron with Self-Feedback
Each neuron is a processing device with input U , and output U,. It is modeled as an operational amplifier (op-amp) with a capacitive element C, and a resistive element RI at its input node. There is a self-feedback MOSFET, with gate voltage VR,, connecting the output U, to the input U , . We shall refer to the neuron with self-feedback as a unit. For VLSI/LSI implementation, the capacitive element C, and the resistive element R, at the input node are eliminated, since the parasitics compensate for their roles. Moreover, the neuron is realized via two CMOS inverters in series instead of an op-amp.
REFERENCES [I]D. F. Specht, “Probabilistic neural networks,” Neural Networks, vol. 3, no. 1, pp. 109-118,Jan. 1990. [2] D. F. Specht, “Probabilistic neural networks and polynomial Adaline as complementary techniques for classification,” IEEE Trans. Neural Networks, vol. 1, pp. 111-121,Mar. 1990. [3] T . Cacoullos, “Estimation of multivariate density,” Ann. Inst. Statist. Math., (Tokyo), vol. 18,no. 2, pp. 179-189, 1966. [4] T. Kohonen, Self-organization and Associarive Memory, 2nd ed. Berlin: Springer Verlag, 1988. [5] L. Niles, H. Silverman, G. Tajchman, and M. Bush, “How limited training data can allow a neural network to outperform an ‘optimal’ statistical classifier,” in Proc. ICASSP-89, (Glasgow), May 1989,pp.
17-20.
A Real-Time Experiment Using a 50-Neuron CMOS Analog Silicon Chip with On-Chip Digital Learning Fathi M. A. Salam and Yiwen Wang Abstruet-We present initial results of a patternkharacter recognition and association experiment using a newly fabricated 50-neuron CMOS analog silicon chip with digital on-chip learning. Manuscript received December 12, 1990. This work was supported by and by the the Office of Naval Research under Grant NOOO14-89-5-1833 Michigan Research Excellence Fund (REF). The authors are with the Department of Electrical Engineering, Michigan State University, East Lansing, MI 48824. IEEE Log Number 9100435.
B. General n-Neuron Circuit with Feedback
,
The input node of unit i is connected to the input node of u n i t j via a MOSFET with gate voltage VR,. The circuit architecture, for a three-unit prototype, is depicted in Fig. 1. Recall that a MOSFET current-voltage characteristic function, say Ids, denotes the (positive) current flowing from the drain to the source. Denote the source, gate, drain, and threshold voltages respectively as v,, vR,v,, and U,. The current characteristic function Id3 is a (nonlinear) function of the voltages according to the region of operation, namely the cutoff, the triode, and the saturation region. It is denoted as Id, (U,, v,, vR,v,), where v, is approximately a constant. If the substrate voltage v,~,,,,is connected to ground (or to the lowest voltage level), instead of the source node of the MOSFET, then the roles of the source and the drain are interchangeable. We refer to the gate voltages VK,and VK,,as the weights for the nonlinear synapses realized via the preceding four-terminal (symmetric) nMOS transistors. Now, from Kirchhoff‘s current law (KCL), applied at every node, one obtains the mathematical model
v; = S;(u;)
(lii)
where Ii, (respectively, I,) is the current function through the interconnecting (respectively, self-feedback) MOSFET element [ I]. [2], and S;are the transfer characteristics of the CMOS inverters.
1045-9227/91/0700-0461$0l.000 1991 IEEE