Building Practical Classifiers Using Cerebellar Model Associative

0 downloads 0 Views 315KB Size Report
(CMAC) is an associative memory neural network based on the ... The CMAC algorithm can be adapted for use as a classifier by .... problem domain. For fc = 1 ...
Building Practical Classifiers Using Cerebellar Model Associative Memory Neural Networks David Cornforth1 1

School of Environmental and Information Sciences, Charles Sturt University, Australia (E-mail : [email protected])

Abstract How can cerebellar model neural networks be successfully applied to automated classification? Simple modifications and a new training scheme are applied to the Cerebellar Model Articulation Controller (CMAC). This results in a classifier with fast training time, guaranteed convergence and a small number of parameters. How can model parameters be set to achieve optimum classifier accuracy, without lengthy empirical trials? A consideration of the most significant sources of classification error results in a simple method for estimating the optimum range for the parameters. The method is tested using empirical trials, and shown to be reliable. This makes the modified CMAC a desirable choice for automated classification in the context of pattern recognition and data mining. Keywords: CMAC, Cerebellar Model, Associative Memory, Neural Networks, Machine Learning. 1.Introduction A well-studied class of machine learning problems is automated classification. This is important because it is widely encountered in the fields of pattern recognition and data mining [10][13]. Here the key is to determine some relationship between a set of input vectors that represent stimuli, and a corresponding set of values on a nominal scale that represent category or class. The relationship is obtained by applying an algorithm to training samples that are 2-tuples , consisting of an input vector u and a class label z. The learned relationship can then be applied to instances of u not included in the training set, in order to discover the corresponding class label z [9]. Artificial Neural Networks have been shown to be effective in solving such problems, but often require long training periods, have no guarantee of convergence, and require the setting of a large number of parameters [10]. Cerebellar model neural networks such as the Cerebellar Model Articulation Controller (CMAC) offer fast training time, guaranteed convergence and a small number of parameters [21][29]. Although the CMAC has been proposed as a classifier, very little has appeared in the literature on this subject. In this paper I show that simple

modifications can result in a classifier which can be applied to data sets with an arbitrary number of classes. The revised CMAC model can furnish estimates of class membership probability using Bayes law, which is desirable, as it attaches confidence to predictions made by the classifier [4]. The difficulty of setting CMAC parameters has been noted [22], but this problem has received little attention in the literature. In this paper I show how the values of parameters can be determined by estimating the classification error. 2.The Cerebellar Model Articulation Controller

The Cerebellar Model Articulation Controller (CMAC) is an associative memory neural network based on the operation of the mammalian cerebellum. It was originally introduced as a continuous function modeller for real-time control of robotic manipulators [1]. To date its application has been mainly in this area [28][23][14][15]. The purpose of this paper is to develop its use as a classifier.

Fig. 1. The CMAC tile configuration, with a query point activating a tile in both the tile sets, after [27]. In the model, an input vector may be visualised as a point in a multi-dimensional input space [2][3]. Fig.1 shows an example of a two-dimensional space with two discretisation, or quantising functions, which form two overlapping sets of tiles. An input vector or query point activates one tile from each set. If there are q quantising functions, then any input vector will always activate exactly q tiles. If the value of the input vector changes enough to cause the point to cross one or more of the tile boundaries, the set of activated tiles changes.

Each tile is associated with a weight value. When a tile is activated, its corresponding weight value participates in the output. The output is formed by a simple summation of all q weight values corresponding to q activated tiles, as shown in Fig. 2. The weight values are determined by a training phase, during which these values are adjusted to minimise the difference between the CMAC output and its expected output.

0 : z < t c= 1 : z ≥ t

(1)

where t is a threshold. This is sufficient for two class problems, as in [11]. For problems with more than two classes, the threshold values may be set such as to divide up the scalar range of z into the number of classes to be represented:

c = v : t vlow < z v < t vhigh t low > t high

Fig. 2. The two active tiles activate two memory locations. These contain values that are summed to produce the output.

v −1 . where threshold v since there is no information membership of a class. As an propose using a CMAC for each vector mapping:

(2)

This is not ideal, about degree of improvement, we class, and define a

c = v : zv = max(z1, z 2 , ...zm )

(3)

where m is the number of classes. As zv represents a relative probability of selecting class c, it is possible to take account of a priori probability using Bayes Law:

The overlapping quantising functions provide a distance dependant response around vectors used during training. The similarities between CMAC and Radial Basis Function networks are well known, as both use local generalisation and compact receptive fields [17][12]. The Cerebellar Model can be abstracted as a mapping from a multi dimensional input space u to a single dimensional output space z, following the convention of [11]. This mapping can be broken into three mappings. The mapping of input space to the set of tiles, E : u → x , can be implemented using simple integer division in each dimension. The mapping of tiles to memory cells, H : x → y , is implemented using a hashing function. The mapping of memory cells to the output, W : y → z , is a simple summation. Since the introduction of the CMAC, further analysis and improvements have appeared in the literature. The training method proposed by Albus was an iterative algorithm based on global error minimisation. This algorithm has been shown to converge [24][20]. Alternative basis functions have been proposed as an improvement over the step function proposed by Albus [19]. An alternative spacing of the overlapping quantising grids can improve precision [25].

where p represents probability [10]. The frequency of samples occurring in each class may be used to estimate p(ci), which forms a weighting for the output zi. The adoption of a different output mapping requires a reconsideration of the training algorithm. From equation (3), the CMAC outputs must be proportional to the class probabilities, but are not constrained to be equal to any absolute values. This can be achieved by simply counting the number of training vectors that fall into each tile, for each class. As each training vector is presented, the activated weight values are each increased by one. If an alternative basis function is used, the weight values are increased by the value of the basis function. After training, each CMAC forms a piecewise model of the probability density function for the corresponding class [7][8]. This method is known as the Kernel Addition Training Algorithm (KATA).

3.The Kernel Addition Training Algorithm

4.Parameter identification

The CMAC algorithm can be adapted for use as a classifier by adopting a suitable mapping between output space z and class label c. A scalar mapping [3] takes the form:

The accuracy of a CMAC classifier can be evaluated by presenting previously unknown data and counting the number of samples correctly classified as a percentage of the total samples presented. The

p (c i | x ) =

p(ci ). p ( x | ci ) ∑i p (ci ). p( x | ci )

(4)

classifier accuracy depends upon the values of the CMAC parameters. These are the displacement between different quantising layers, the kernel function, and the grid spacing.

as a classifier. The errors due to non-optimum values for these parameters occur as a result of three scenarios: sparse training examples, class boundaries, and high spatial frequency. I analyse these and derive simple rules for setting the parameters to achieve optimum classifier accuracy.

4.1.The displacement vector

The displacement vector controls the relative placement of grids. For example, in Fig. 1, the displacement vector is {0, 0} for the first quantising function and {1, 1} for the second. The displacement vector controls the shape of the receptive field, the region formed by the union of tiles activated by a point. Intuitively this region should be hyperspherical. This is borne out by Parks and Militzer [25], who published optimal values for the displacement vectors. This arrangement is in use among almost all applications of CMAC, and is applicable to the CMAC classifier.

4.4.Errors due to insufficient training examples

For a point to be correctly classified, there must be at least one training point of the correct class within its receptive field. Fig. 3 shows a single point in a 2dimensional space and its associated receptive field when q is 5. If no training points lie in the receptive field, the CMAC will guess the class at random. The area of the receptive field in the example is equal to 67 resolution units. Using the optimum displacement vectors, for large q the shape of this region can be approximated by a hypersphere with radius (q – 0.5). The probability of obtaining no points in this region may be modelled by the Poisson distribution

4.2.The kernel function

The original CMAC of Albus had a constant response within a tile, that is, a step function whose boundaries are identical to the tile boundaries. The problem of finding the optimum kernel function has not received thorough coverage, but most studies conclude that a higher order kernel is preferable[5] [11][6][12]. The linear kernel has been used in this work.

P( X = x) =

e −µ µ x x!

(5)

for x = 0, where µ is the number of points expected, and is equal to:

µ=

Vui N t Vs

(6)

where V is the volume of the hypersphere ui activated by point ui, Nt is the number of points in the training set, and Vs is the total volume of the input space. This assumes a uniform distribution of training points in the input space. For m classes the expected error rate of this classifier is

 1 E s = 1 −  p ( x = 0 )  m Fig. 3. The region activated by a single point, for q = 5 and d = 2.

(7)

The volume of a hypersphere in d dimensions is

π 2 rs d V= (d 2 ) ! d

4.3.The grid spacing

The grid spacing is set by the number of quantising functions, q, and the resolution, r. If the boxed area of fig. 1 represents the total input space, r would be 17 in the horizontal direction and 15 in the vertical direction 2. For convenience r may be made equal in all dimensions. The number of quantising functions q is also equal to the number of resolution units along each dimension of a tile. These two parameters in conjunction set the generalisation region of the CMAC, so affect the accuracy of a CMAC when used

(8)

where rs is the radius of the sphere [16]. Errors of this type can be reduced by making q large, and r small. A limit is reached where q = r – 1. Any increase of q beyond this point simply introduces redundant quantising functions.

4.5.Errors due to class boundary effects

4.6.Errors due to high spatial frequency

After the CMAC has been trained, a region will exist at any boundary of two classes, populated by points of both classes. The classification error in the class boundaries can be estimated if the size of this region is known. The error is therefore dependant upon the distribution of vectors in the training data. As an example, I used the parity problem. In this problem there are m classes which cycle with spatial frequency fc in each dimension. If the input for dimension k is uk = [0, ukmax] then the output class is given by:

Nyquist’s sampling theory requires that the number of tiles along an axis should be at least double the spatial frequency along that axis, for each class [26]. This is consistent with the findings of [12]. The number of tiles along each axis is r/q for each quantising function. As there are q quantising functions, the total number of tiles along each axis is equal to r. Errors of this type can be reduced by ensuring that r ≥ 2mfc. This condition is easily met in my example using the parity problem with fc = 1, so will be ignored here.

d  u mf   c =  ∑ floor  k c   mod m  u k max    k =1

(9)

where d is the number of dimensions of the problem domain. For fc = 1 and m = 2 the size of the boundary in resolution units is 2r in two dimensions, and 3r in three dimensions. Fig. 4 shows a representation of the parity problem with these parameters. For d dimensions the fraction of input space lying in the boundary region is

Fb =

dr ( d −1) d = r rd

(10)

so the expected error in classification is

1 d  E b = 1 −  m  r

(11)

5.Empirical trials

Empirical trials were used to compare the errors calculated from equations (7) and (11) with errors obtained by running a CMAC implementation using the same parameter values. A program was prepared to run CMAC using KATA (section 3). This was tested using the parity problem. This problem was chosen because of its well-defined spatial frequency and uniform density of points, which will enable testing of the theory. Data for fc = 1 and m = 2 were generated using values of the input vector u drawn from a uniform random distribution, and d varying from 2 to 5. Each data set contained 10,000 records. The CMAC was configured to a uniform quantisation of input space in all dimensions. Tile offset vectors were based on the work of Parks and Militzer [25]. A hashing function with chaining was used to achieve zero collisions. The distance measure used for kernels was Euclidean, and the kernel function used was linear.

Errors of this type can be reduced by making r large. However, this result only applies to the parity problem.

Fig. 5. Classifier accuracy against resolution parameter, r, for a 2-dimensional parity problem, using 4 quantising functions.

Fig. 4. A 3-dimensional space for the parity problem, with spatial frequency of 1. Light and dark regions represent class 0 and class 1, respectively.

The performance of the algorithm was tested using a cross validation method. The data sets used were each divided into three parts at random. Training was performed using two parts of the data, and the trained model was tested on the remaining one part. This was done three times using a different part for testing. In this manner the model was tested on all data, and reported a number of correctly classified samples, which was divided by the size of the data set to obtain

a percentage accuracy. The choice of the fraction one third is a compromise between using all data to train, which may result in over fitting the model, and using less data to be computationally efficient [18].

and 80. Fig. 8 shows that the optimum range is between 20 and 60. These results show that it is possible to predict the approximate range of parameters to achieve optimum classifier accuracy. 6.Parameter setting for a practical CMAC classifier

Fig. 6. Classifier accuracy against resolution parameter, r, for a 3-dimensional parity problem, using 6 quantising functions.

Fig. 7. Classifier accuracy against resolution parameter, r, for a 4-dimensional parity problem, using 8 quantising functions.

The theory presented is useful as a prediction of correct parameter values, but how can this be applied to the practical use of CMAC classifiers? In order to extend these findings to any data set, a number of assumptions must be relaxed. Data sets encountered are unlikely to have a uniform distribution. It is more likely that data points will be distributed in clumps of high density with a small proportion of points appearing as regions of relative low density. If the parameters are set as if the data were uniformly distributed, then the areas of high density will be correctly classified, while the areas of low density will not. Errors will be confined to areas containing few points, so will have little effect on the classification accuracy. These preliminary results suggest that it is possible to obtain a rough estimate of these errors from equations (5) to (8). Errors due to boundary effects presented in section 4.5 above must be generalised. In order to estimate these errors accurately, a detailed analysis of the distribution of the data set would be required. However, an upper bound of the error may be estimated by assuming that all boundaries form (d-1)dimensional regions parallel to one of the axis. The number of such regions may be estimated from the spatial frequency of class transitions in each dimension: d

nb = ∏ (m − 1) f c

(12)

k =1

Fig. 8. Classifier accuracy against resolution parameter, r, for a 5-dimensional parity problem, using 10 quantising functions. The results are shown in Figs. 5 to 8. The solid line shows the empirical results, while the dotted line shows the accuracy predicted from the theory presented above. Fig. 5 shows that the optimum value of r when classifying the 2-dimensional problem is above 20. Figs. 6 to 8 show a plateau of values of r for which classifier accuracy is good. Accuracy falls away on either side of this plateau. Fig. 6 shows that the optimum range of values is between 20 to 120. Fig. 7 shows that the optimum range is between 30

Errors due to high spatial frequency of class transitions may be reduced by adjusting the parameters to ensure that r ≥ 2mfc. The procedure for determining the optimum set of CMAC parameters for a given data set is to minimise the error terms given by equations 7 and 11, subject to the constraint of r ≥ 2mfc. To these may be added additional objectives, if for example, it is desirable to optimise the processing time or memory use of the algorithm. This allows CMAC to be used as a practical classifier, as the values of parameters may be easily calculated and set to achieve optimum classifier accuracy.

7.Conclusion

Current models of the cerebellum, which were designed as robotic controllers, can be adapted for use as automated classification algorithms. This work has

demonstrated how the accuracy of such classifiers may be simply estimated from knowledge of the data set and the classifier parameters. The benefit of this is that estimation is much faster than actually running the classifier and measuring the accuracy for a variety of parameters. The estimation technique has been tested using empirical trials on artificial data sets. Preliminary results suggest that the theory provides a good estimate of the range of values for the model parameters, which will achieve optimum performance of the classifier. This work provides quick and simple methods to enable the use of automated classifier algorithms built on cerebellar models.

Acknowledgments This work was funded by a CSU Faculty Seed Grant. References [1] J.S. Albus, “A Theory of Cerebellar Function”, Mathematical Biosciences, Vol. 10, pp. 25-61, 1971. [2] J.S. Albus, “A New Approach to Manipulator Control: the Cerebellar Model Articulation Controller (CMAC)”, Transactions of the ASME, Series G. Journal of Dynamic Systems, Measurement and Control, Vol. 97, No. 3, pp. 220-233, 1975. [3] J.S. Albus, “Mechanisms of Planning and Problem Solving in the Brain”, Mathematical Biosciences, Vol. 45, pp. 247-293, 1979. [4] E.L. Allwein, R.E. Schapire and Y. Singer, “Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers”, Journal of Machine Learning Research, Vol. 1 pp. 113-141, 2000. [5] P.C.E. An, W.T. Miller, and P.C. Parks, “Design Improvements in Associative Memories for Cerebellar Model Articulation Controllers”, Proc. ICANN, pp. 1207-1210, 1991. [6] M. Brown, C.J. Harris and P.C. Parks, “The Interpolation Capabilities of the Binary Cmac”, Neural Networks, Vol. 6, No. 3, pp. 429-440, 1993. [7] D. Cornforth and D. Elliman, “Modelling Probability Density Functions for Classifying using a CMAC”, in M. Taylor and P. Lisboa (Eds.), Techniques and Applications of Neural Networks, Ellis Horwood, 1993. [8] D. Cornforth, Classifiers for Machine Intelligence, PhD Thesis, Nottingham University, UK, 1994. [9] T.G. Dietterich and G. Bakiri, “Solving Multiclass Learning Problems Via Error-Correcting Output Codes”, Journal of Artificial Intelligence Research, Vol. 2, pp. 263-286, 1995. [10] R.O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley and sons, 1973. [11] Z.J. Geng and W. Shen, “Fingerprint Classification Using Fuzzy Cerebellar Model Arithmetic Computer Neural Networks”, Journal of Electronic Imaging, Vol. 6, No. 3, pp. 311-318 1997. [12] F.J. Gonzalez-Serrano, A.R. Figueiras-Vidal and A. Artes-Rodriguez, “Fourier Analysis of the Generalized

CMAC Neural Network”, Neural Networks, Vol. 11, pp. 391-396, 1998. [13] J. Han and M. Kamber Data Mining Concepts and Techniques, Morgan Kaufman, 2001. [14] Y. Hirashima, Y. Iiguni, and N. Adachi, “An Adaptive Control System Design Using a Memory Based Learning System”, International Journal of Control, Vol. 68, No. 5, pp. 1085-1102, 1997. [15] K. Hwang and C. Lin, “Smooth Trajectory Tracking of Three-Link Robot: a Self-Organising CMAC Approach”, IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics, Vol. 28, No. 5, pp. 680-692, 1998. [16] M.G. Kendall, A Course in the Geometry of ndimensions. Hafner Publishing Co., 1961. [17] A. Kolcz and N.M. Allinson, “The General Memory Neural Network and its Relationship with Basis Function Architectures”, Neurocomputing, Vol. 29, pp. 57-84, 1999. [18] C.A. Kulikowski and S. Weiss (Eds.), Computer Systems That Learn: Classification and Prediction Methods From Statistics, Neural Nets, Machine Learning, and Expert Systems, Morgan Kaufman, 1991. [19] S.H. Lane, D.A. Handelman, and J.J. Gelfand, “Theory and Development of Higher-Order CMAC Neural Networks”, IEEE Control Systems, pp. 23-30, 1992. [20] C. Lin and C. Chiang, “Learning convergence of CMAC Technique”, IEEE Transactions on neural networks, Vol. 8, No. 6, pp. 1282-1292, 1997. [21] W.T. Miller, F.H. Glanz, and L.G. Kraft, “CMAC: An Associative Neural Network Alternative to Backpropagation”, Proceedings of the IEEE, Vol. 78, No. 10, pp. 1561-1567, 1990. [22] M. Miwa, T. Furuhashi, M. Matsuzaki and S. Okuma, “CMAC Modeling Using Bacterial Evolutionary Algorithm (BEA) on Field Programmable Gate Array”, Proceedings of the 26th Annual Conference of the IEEE Industrial Electronics Society, 2000. [23] K. Motoyama, K. Suzuki, M. Yamamoto and A. Ohuchi, “Evolutionary State Space Configuration with Reinforcement Learning for Adaptive Airship Control”, Australia-Japan Workshop on Intelligent and Evolutionary Systems, pp. 131-138, 1999. [24] P.C. Parks and J. Militzer, “Convergence Properties of Associative Memory Storage for Learning Control System”, Automat. Remote Contr. Vol. 50, pp. 254-286, 1989. [25] P.C. Parks and J. Militzer, “Improved Allocation of Weights for Associative Memory Storage in Learning Control Systems”, IFAC Design Methods of Control Systems, Zurich, Switzerland, pp. 507-512, 1991. [26] W. H. Press, S. A. Teukolsky, W.T. Vetterling, and B.P. Flannery, “Numerical Recipes”, Cambridge University Press, 1987. [27] J.C. Santamaria, R. S. Sutton and A. Ram, “Experiments with Reinforcement Learning in Problems with Continuous State and Actions Spaces”, Technical Report UM-CS-1996-088, Department of Computer Science, University of Massachusetts, MA., 1996. [28] T. Yamamoto and M. Kaneda, “Intelligent Controller Using CMACs with Self-Organised Structure and its Application for a Process System”, IEICE Trans. Fundamentals, Vol. E82-A, No. 5, pp. 856-860, 1999. [29] S. Yao and B. Zhang, “The Learning Convergence of CMAC in Cyclic Learning”, Proceedings of the International Joint Conference on Neural Networks, 1993.

Suggest Documents