Prototype Search for a Nearest Neighbor Classifier by a Genetic Algorithm Jari Kangas Nokia Research Center, P.O. Box 100, FIN-33721 Tampere, Finland E-mail:
[email protected] Abstract This paper deals with the problem of finding good prototypes for a condensed nearest neighbor classifier for a character recognition system based on directional codes. A prototype search by a genetic algorithm capable of creating new prototypes is compared against a direct prototype selection algorithm. It is shown in a leave-one-out experiment that the prototypes found by a genetic algorithm give significantly smaller intraclass distances to class members, and significantly larger normalized interclass distances.
1. Introduction Nearest neighbor classification algorithms have been used with good results in many pattern recognition applications. The main problem associated with them is the large memory size requirements and the high computational requirements due to the often large number of prototypes. Therefore, many different methods have been developed to decrease the number of necessary prototypes. In this paper we have looked for good class prototypes for a simple nearest neighbor classifier. Only one prototype for each class was sought. The target was to find such prototypes that they minimize the intraclass distance and maximize the interclass distances. Many algorithms use only the existing training samples as viable prototypes. In this paper we have compared one such algorithm to another search algorithm which is able to create new prototypes.
2. Character recognition system We applied a simple character recognition system, based on a nearest neighbor classifier. This chapter explains the basics of the applied recognition system. For a somewhat similar character recognition system, see [10]. The input data for the recognition was collected from a drawing tablet, and consisted of a sequence of coordinate pairs in a tablet coordinates. The strokes were separated. Preprocessing consisted of normalization to a fixed size, of data resampling, and of transformation to direction codes (chain codes [4]). After preprocessing, the character data consisted of a list of strokes: direction codes were given first, and the stroke begin and end locations were given then. The direction codes were from a set {1,…,8}. Begin and end locations were given in range [1,100]. Coordinate origin was in the upper-left corner. The order of coordinates was [begin-x, begin-y, end-x, end-y]. One sample of an "A" character consisting of two strokes is shown below: "A" = {
[5 5 5 5 5 5 4 4 2 1 1 1 1 1 1] [0 100 59 100] [4 3 3 3 3] [6 65 36 65] }
The k-nearest neighbor (k-nn) classifier [1-3], used in the recognition system, is one of the simplest classification methods. The main principle is to make the decision on the class membership by computing the distance of a sample to preclassified examples, and checking the class memberships of the k nearest examples. The algorithm makes the assumption that the most similar examples belong to the correct class. The efficiency of a nearest neighbor classifier depends on the distance function used. In our system the total distance between two character samples was the sum of minimum distances between corresponding strokes (direction codes and location parts), plus an extra term for every stroke in one character that had no corresponding stroke in the other character.
3. Genetic algorithms Genetic algorithms belong to a group of heuristic optimization algorithms, which mimic the problem solving methods of evolution. A group of potential solutions are stored, and survival of the fittest is used. New solutions are dynamically created by combining the best solutions with modifications. For overviews of the research area, see, e.g. [6-7]. The generic structure of a genetic algorithm is as follows: 1) The algorithm maintains a population of potential solutions. 2) A measure of fitness is computed for each solution, i.e. how good it would be. 3) A new population of solutions is initialized by selecting the best ones from previous population. 4) Some/all of the solutions undergo transformations, thereby introducing new trial solutions to the population. There exist several types of transformations, in some of them each solution is modified separately (mutation), in some others several solutions are combined by taking parts from each (crossover).
4. Prototype selection problem In practice, using the k-nearest neighbor classification is difficult because all the prototypes must be stored, and the comparison phase often becomes time consuming. Many algorithms have been developed to reduce the number of prototypes (see, e.g. [8]). In many practical classification problems it is only necessary to store a small subset of samples without sacrificing the accuracy. A well-known algorithm to reduce prototype number is the Condensed Nearest Neighbor (CNN) rule [5]. In this paper we have compared two prototype selection algorithms. The problem has been to find only one (average) prototype for each class. The first algorithm is a straightforward selection among the example cases; i.e. one of the example cases is used as a prototype. The second algorithm enables creating novel prototypes, which are not available among the example cases. 4.1 Prototype selection among the example cases The selection of a class prototype among the examples was done as follows. For every sample si the sum S i of distances among the class Ai was computed,
Si =
∑D
s j ∈Ai ,i ≠ j
Tot
( si , s j ) .
The sample, which gave the minimum sum, was used as the prototype. Because the search is restricted to the training samples only, we used quite many samples (30 samples
of each class) just to make sure that there is enough variation to find a reasonable average. The method gives the best choice among all the class samples. 4.2 Prototype search by a genetic algorithm The prototype search by a genetic algorithm followed the basic function optimization approaches [6-7]. Directions were encoded as discrete symbols, locations as real valued parameters. The length of strokes (the number of direction codes in each stroke) was naturally variable. The number of strokes was fixed for all samples in each class; all the character samples were collected from one person that had used consistently the same writing style. The fitness values were found by computing the cumulative distance of each prototype from all training samples. We used crossover and mutation operations. To generate the new individuals we used a mixture of techniques based on rank order. First we used an "elitist" selection by saving N 1 fittest prototypes directly for the next generation. Then the N 2 next fit individuals were mutated and saved as such. Last, the already selected N 1 + N 2 individuals were used to generate new individuals for the next generation by crossover operation. The number of individuals in each generation was fixed. Also the number of generations was fixed; we used a number that was large enough for saturation. Mutation operation. The mutation operation was separately applied for the direction codes, and for the location parameters. The mutated strokes were selected by a random operation. Each direction code in a sequence was separately mutated given a certain small probability. In mutation a code was replaced by a random code among all direction codes. The location parameters were deviated by a small random value each. Crossover operation. We used a one-point crossover, and applied it in the direction code section. In mating the location parameters were always copied from the first parent. The crossover location was first selected in one parent sequence and then a "nearby" location was selected in the other parent sequence. Because the locations were not exactly the same (as measured from the beginning of sequence) the length of resulting offspring sequence could vary as well. That was very convenient, because the sample sequences were not of the same length and defining an optimal prototype sequence length analytically would have been difficult.
5. Experiments 5.1 Character data For the prototype search experiments we collected data of 10 handprinted characters (uppercase characters "A"-"J"), 30 samples of each from one writer only. One writer was used to keep the sample sets compact. It was expected that the character samples of one writer will form a unimodal set (for each character), with which the usage of one class prototype would be well motivated. If there were several modes in each character set, the usage of selection method especially would be questionable. We used a Wacom ArtPad II drawing tablet connected to a Windows NT machine. The drawing area was a square of about 2 cm by 2 cm. Because the characters were later normalized the character size was not critical. The sampling frequency was approximately 100 Hz, and the resolution in both directions was about 100 points.
5.2 Indicator variables We looked for two prototype quality variables. The first variable D 1 was an estimate of how far from the correct class prototypes the samples are located on the average. That variable value should be as small as possible. It is, however, sensitive to overall scale changes. To compensate for that we used another variable D 2 that was a ratio of the interclass distance of each sample from the nearest incorrect class prototype, and the average intraclass distance over all classes.
1 N
D1 =
∑d
c i
,
i
D2 =
1 N ⋅ D1
∑d
w i
,
i
where N is the number of test samples, d ic is the distance of i:th test sample to the correct prototype, and d iw is the distance of i:th test sample to the nearest incorrect prototype. Indicator values are given classwise, i.e. samples of one class are grouped together. The normalizing value D1 is the mean of D 1 over all 10 classes. 5.3 Experimental setup The test runs were conducted using the leave-one-out principle. I.e., one test sample at a time was left out from the training set, the prototypes were found out using the remaining samples, and the distances of the test sample to corresponding (the correct and the nearest incorrect) prototypes were computed. The average values were, therefore, computed using 300 pairs of distances ( d ic and d iw ) for both prototype search.
6. Results For each of the ten classes we found out the average intraclass distances. To check for the significance of results we computed the mean of the differences, and the confidence interval for the mean. The mean value was found to be 0.19, and the confidence interval for 99 % confidence level was found to be from 0.08 to 0.30. Because the zero was outside the confidence interval, it can be said that the intraclass distances given by genetic search algorithm were significantly smaller than the corresponding values for the selection method. Table 1. The average intraclass distances inside each character class by two different training methods.
A B C D E F G H I J Means
Selection 3.38 2.45 0.62 2.37 3.91 2.01 1.70 4.77 0.71 0.56 2.25
GA 3.18 2.06 0.63 2.12 3.45 1.78 1.62 4.61 0.50 0.57 2.06
Diff. 0.20 0.39 -0.01 0.25 0.36 0.23 0.08 0.16 0.21 -0.01 0.19
Table 2. The average normalized interclass distances between the samples and the best incorrect class prototype given by two different training methods.
A B C D E F G H I J
Selection 25.92 2.39 11.79 1.72 10.74 10.51 12.82 29.04 9.37 13.94
GA 28.33 2.66 14.57 1.68 11.65 11.20 13.88 31.39 10.12 15.28
Diff. -2.41 -0.27 -2.78 0.04 -0.91 -0.68 -1.06 -2.34 -0.75 -1.34
For the interclass distances we again computed the mean of the differences, and the confidence interval. The mean value was found to be –1.25, and confidence interval for 99 % confidence level was found to be from –0.47 to –2.03. Because zero was again outside the confidence interval, it can be said that the normalized interclass distances given by genetic search algorithm were significantly larger than the corresponding values for the selection method.
7. Conclusions It has been shown that a genetic algorithm can be used to find good prototypes for a nearest neighbor classifier. The comparison here was made against a prototype selection scheme. The intraclass distances from the prototypes to correct class samples were significantly decreased and the normalized interclass distances between the samples and the best incorrect prototypes were significantly increased. The genetic algorithm was very straightforward and flexible, and could find good prototypes using any distance function. The genetic algorithm would also make it possible to enlarge the character representation scheme for prototypes to improve the prototype matching even more.
References [1] T.M. Cover, and P.E. Hart, "Nearest Neighbor Pattern Classification," IEEE Transactions on Information Theory, 13(1), pp. 21-27, January 1967. [2] B.V. Dasarathy (editor), Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, 1991. [3] E. Fix, and J.L. Hodges, "Discriminatory Analysis – Nonparametric Discrimination: Consistency Properties," Tech. Report No 4, USAF School of Aviation Medicine, Texas, USA, 1951. [4] H. Freeman, "On the Encoding of Arbitrary Geometric Configurations," IRE Transactions on Electronic Computers, pp. 260-268, June 1961. [5] P.E. Hart, "The Condensed Nearest Neighbor Rule," IEEE Transactions on Information Theory, pp. 515516, May, 1968. [6] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, 1996. [7] M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, 1996. [8] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, 1996. [9] D. Sankoff, and J.B. Kruskal, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, 1983. [10] H. Yuen, "A Chain Coding Approach for Real-time Recognition of On-line Handwritten Characters," in Proc. of Int. Conf. on Acoustics, Speech and Signal Processing, vol. 6, pp. 3426-3429, 1996.