Vector Quantisation Classi ers for Handwritten Character Recognition by ... \Intelligent Character Recognition" (ICR) algorithms may allow a further reduction of.
Report No. 95-185
Vector Quantisation Classi ers for Handwritten Character Recognition by M. Neschen 1995
Proceedings of the ZEUS-95 Workshop, Linkoping, Sweden, May 1995 IOS Press, Netherlands
Martin Neschen Zentrum fur Paralleles Rechnen (ZPR) Universitat zu Koln Weyertal 80, D-50931 Koln, Germany phone : + 49 221 470-6010 email :
Vector Quantisation Classi ers for Handwritten Character Recognition Martin Neschen Zentrum fur Paralleles Rechnen (ZPR), Universitat zu Koln Weyertal 80, D-50931 Koln, Germany Abstract. The development of a pattern recognition architecture based on vector quantization techniques is presented which is applied to the recognition of handwritten bank forms. After an overview of nearest-neighbor classi cation and clustering, a fast completely binary version of the k-means algorithm is introduced and results for large character databases are given. An integration of these methods in a multi-agent environment is discussed. Both the ecient implementation on general MIMD processors and a realization on a dedicated SIMD architecture are presented.
Keywords: pattern recognition, vector quantization, embedded systems
1 Introduction The recognition of handwritten text is a highly relevant industrial problem with a wide area of applications including the semi-automatic treatment of bank forms, cheques, and many other kinds of handwritten documents, as well as the automatic sorting of millions of letters every day. Although today, standard Optical Character Recognition (\OCR") techniques are successfully employed for typewritten documents, the reliable treatment of handwritten text still needs to be improved in order to limit the manual veri cation and correction to a minimal number of cases. Implementations relying on neural networks have been used quite successfully for the recognition of handwritten digits like postal codes [1, 2] or credit card slips. However, when more complex tasks, like the recognition of alphanumeric text on bank forms, have to be accomplished, these techniques seem to have reached their limits. More sophisticated \multi-agent" architectures have emerged recently, combining approaches from dierent areas like neural networks, pattern matching, hidden Markov chains, and statistical optimization. These \Intelligent Character Recognition" (ICR) algorithms may allow a further reduction of the error rate and a better detection of possible ambiguities. On the other hand, systems attaining the highest reliabilities require huge amounts of computing power and a training on extended sets of classi ed samples. The use of multi-processor computers in \embedded systems" and the realization of dedicated VLSI structures for pattern recognition are two options for providing that performance for industrial applications. 1
We have elaborated a complete system for character recognition based on nearestneighbor classi ers. The main application we aim at is the semi-automatic recognition of bank transfer forms, which is time-critical and at the same time requires a high reliability. Therefore, we put emphasis both on obtaining the lowest error rates and on realizing the most ecient implementation of our methods. Just like neural networks, nearest-neighbor methods are able to model any classi cation function for the limit of an in nite number of prototypes. However, they can contribute complementary aspects and therefore, one can expect a better overall classi cation behavior when their output is merged with the result of a neural network in a multi-agent structure. Con dence measures can be easily obtained and may be used for rejecting the classi cation of certain elds. Furthermore, there exist clustering techniques being capable of \on-line learning", i.e. in an incremental way, on huge sets of sample patterns, without iterating many times over the same data set. Sample data is abundant for our applications so that we will try to \learn" as many patterns as possible in order to cover a wide variety of writing styles. In this paper, we rst give an overview of nearest-neighbor techniques relying on the direct matching between a segmented pattern and a large prototype database. Then we describe how clustering techniques can be used to \condense" the learning set in order to obtain an optimal prototype set. We show how these methods can be applied to binary image data, present a new extremely ecient version of the k-means algorithm for binary vectors, and give preliminary results for tests on large databases. We further introduce a classi cation scheme by hierarchical vector quantization which not only requires much less pattern comparisons but which can also deal with ambiguities between classes quite naturally. Then, we explain how more complex \multi-agent architectures" can be built upon these basic techniques using discrete optimizing techniques to determine the globally most probable classi cation result instead of taking the decision at the character level. Finally, we give details on the segmentation and the actual implementation. We also present a special-purpose VLSI circuit implementing the comparison of binary patterns in an optimal way.
2 Classi cation by \k-Nearest-Neighbor" Methods Nearest-neighbor methods are well-known techniques for non-parametric classi cation of pattern vectors [3, 4]. They work by calculating the distances between one input pattern x 2 V of a vector space V and a large number of prototypes yj 2 V . Although such direct techniques may involve a larger number of operations than for example feed-forward neural networks, the overall computing requirements can be lower due to the simplicity of the approach. In particular, in the binary case, the calculation of the Hamming distances reduces to the addition of single bits. The inherent massive parallelism on prototypes can be easily exploited both on SIMD and MIMD parallel structures. Given a certain distribution p(x; c) (1) for the probability of a pattern x belonging to a class c 2 C , the \optimal" Bayes 2
classi er CB simply chooses the most probable classi cation result : CB (x) = cm ; such that p(x; cm ) = max p(x; c): c2C
(2)
Given a limited number of sample vectors yj , (j = 1; :::; N ) and class numbers cj = c(yj ), a k-Nearest-Neighbor (\kNN") classi er takes into account only the k nearest prototypes to the input pattern. Usually, the decision is determined by the majority of class values of these neighbors. Vector Quantization, i.e. the approximation of a vector by the nearest vector out of a number of given prototypes, the \codebook", is a well-known technique for data compression [5]. Used for classi cation, it leads to the most reduced case of kNN, where the result is directly given by the class number of the nearest prototype: CV Q (x) = cj ; where kx ? yj k kx ? ylk for all l 6= j
(3)
It has been shown, that kNN classi ers are almost optimal in the sense that for a large number N of prototypes, the classi cation error is always lower than twice the Bayesian error. For large k, the Bayes rule is even approximated [4]. On the other hand, in many applications, a limitation to the determination of the most probable classi cation result is not reasonable. Instead, one is more interested in giving an approximation for the probability distribution p(x; c). This strategy allows for taking a decision at a higher level, for example after taking into account other classi ers on the same input data or context information. Therefore, we have generalized the output of such VQ classi ers by determining the distances between one input pattern x and all classes c~ 2 C separately, which we de ne as the best match within that class : d(x; c~) = min
fj jcj =~cg
kx ? yj k
(4)
When the distance distribution for every class is taken into account, these values d(x; c) can be translated into con dence probabilities. Let c1 and c2 denote the nearest and the second nearest class to x, i.e. d(x; c1) = d(x; CV Q(x)). Then the dierence d := d(x; c2) ? d(x; c1)
(5)
is a reasonable rejection criterion.
2.1 Results for the NIST Database In order to test the simple \kNN" approach for character recognition applications, we have performed extensive tests on patterns from the NIST standard database of handwritten characters [6]. These patterns were rst sheared to their main vertical axis (after determination of the xy-correlation) and then normalized to a xed size (i.e. with a dierent scale factor in x and y direction). Both operations largely improve the recognition rates because they reduce the variability and fewer prototypes are required to cover all possible writing styles. 3
Classes Resolution Training set Test set Error rate Digits 0,...,9 16 16 167 350 55 784 1.05% Digits 0,...,9 20 24 167 350 55 784 0.66% Capital letters A,...,Z 20 24 33 741 11 247 4.29% Table 1: Error rates of binary kNN obtained for the NIST database (without rejection) Table 1 shows the results obtained for the largest training sets and for dierent resolutions. The rst 3/4 of the database were taken as reference patterns whereas the last quarter was used for testing the recognition. The results are comparable to neural networks solutions [1] and new sophisticated and time-consuming transform methods [2] which lead to error rates as low as 0:4%. Table 2 shows that the distance dierence d is a useful criterion for rejection. In the case A, rejecting only about 12% of all patterns can reduce the classi cation error from 2% to 0:2%. It is also very interesting to study the recognition error as a function of the size of the database shown in table 3. In g. 1, the resulting errors are displayed in a log/log plot over the database size N (curve A). A power law d CN ?! ; with ! = 0:37 (6) seems to be a good approximation, and even lower error rates can be expected for larger training sets, because no sign of saturation for large N is evident in g. 1. Reject if d ::: A: Rejection rate A: Error rate B: Rejection rate B: Error rate
0.0% 2.0% 0.0% 4.3%
0 0.4% 1.7% 0.5% 4.1%
1 1.2% 1.3% 1.3% 3.6%
2 2.0% 1.0% 2.4% 3.1%
3 5 7 10 2.6% 4.3% 6.8% 11.9% 0.8% 0.54% 0.32% 0.16% 3.2% 5.1% 7.0% 10.5% 2.7% 2.1% 1.6% 1.1%
Table 2: Error rates using the dierence of the distances to the two nearest classes, d, as a rejection criterion. A: Digits 0,...,9 with 37 000 prototypes, size 16 16 B: Capital letters A,..,Z with 33 700 prototypes, size 20 24 Number of prototypes N 1000 2000 5000 10000 20000 50000 167343 Error rate 6.8% 4.8% 4.0% 3.0% 2.4% 1.7% 1.05% Table 3: Error rate for NIST digits (without rejection) as a function of the number of prototypes The presented simple kNN methods require a comparison with all training patterns for every new classi cation. For an optimal implementation of the binary nearest-neighbor algorithm on a SPARC-10 processor, we obtained about 1 million comparisons of 20 24 patterns per seconds resulting in a classi cation speed of 6 digits per second with an error rate of 0.66%. Although such low error rates can justify this large computational eort, more intelligent methods need to be applied for a practical realization, where both memory and performance resources will limit the number of prototypes. 4
A : simple kNN B : kNN after binary k-Means C : kNN after standard k-Means
Classification error (e)
0.1
0.01 10
100
1000 Database size (N)
10000
100000
Figure 1: Pattern recognition error d for dierent database sizes N and dierent algorithms, A: kNN without learning, B/C: prototypes chosen with binary (B) and standard (C) k-means method
3 \Learning" using Clustering Techniques When the prototype database, the \codebook", is directly composed of training samples, many regions may be overrepresented. Therefore, a search for the \optimal representation" can result in almost the same classi cation behavior with a much lower number of prototypes. There exist many solutions for this optimization task relying on real-valued calculations. The k-means or Lloyd algorithm [7, 8], which is well known for vector quantization, is an iterative clustering procedure aiming at minimizing the discretization error by choosing appropriate real-valued prototype vectors. Other more complex \learning vector quantization" (LVQ) methods try to improve the class separation by iteratively modifying the prototypes [9]. As character images can be eectively represented in a binary form and as our implementation of the kNN method using a most-ecient calculation of the Hamming distance requires binary vectors as prototypes, we have developed a completely binary and extremely ecient version of the k-means algorithm, relying only on the accumulation of single bits. Here, we rst introduce both the standard and the binary version of the "k-means" algorithm. We show that they always converge to a local minimum (of the sum of the distances). It is not evident that the discrete procedure converges to a representative set, because slight continuous changes cannot be performed in order to reduce a total error function by a steepest descend method. Nevertheless, results for the NIST database 5
show that both versions give stable and reasonable coverings of the training set and yield a reduction of the codebook size by a large factor.
3.1 The standard k-means Algorithm In contrast to the LVQ method [9], k-means is usually applied to each class separately in order to optimize its covering. Given M training patterns xi 2 IRK , (i = 1; :::; M ), the goal is to choose a vector space decomposition into N regions IRK =
N [ j =1
Rj ; and Ri \ Rj = for i 6= j
(7)
and N representatives yj , (j = 1; :::; N ) such that the total squared error E :=
N X X j =1 fi:xi 2Rj g
kxi ? yj k2
(8)
takes on a minimal value. Fixing a given decomposition fRj g and replacing the prototypes by averages (centroids) of patterns in that region :
y~ j := n1
X
j fi:xi 2Rj g
xi; with nj := #fi : xi 2 Rj g
(9)
results in a reduction of E because the average minimizes the total squared error of a distribution : E=
K N X X X j =1 k=1 fi:xi 2Rj g
xik ? yjk
K N X 2 X X
j =1 k=1 fi:xi 2Rj g
2
xik ? y~jk ;
(10)
On the other hand, xing the prototypes yj and replacing the regions by the nearestneighbor or Voronoi decomposition
R0j := fx : d(x; yj ) d(x; yl) for all l = 1; :::; N g
(11)
will also decrease E or keep it at the same value. This leads to the k-means algorithm which iterates the following two steps :
Assign each pattern xi to the nearest prototype yj , Replace each prototype by the average of the assigned patterns. This procedure will always converge to a local minimum. However, in general, it will fail to reach the global optimum. The global update scheme (all prototypes are recalculated after all training patterns have been presented once) can be modi ed to 6
allow a continuous training on an input stream. In this case, for each training pattern, the nearest prototype is determined and only this one is updated immediately. Unlike the backpropagation learning algorithm for feedforward neural networks, parameters are only updated locally, and they converge much faster to a local minimum. When training data is abundant, it is even sucient to present each pattern once, and an incremental learning is possible, where each prototype \remembers" the centroid of all assigned patterns and no reiteration on old patterns is necessary.
3.2 The completely binary Version Character images can be reasonably represented by binary images xi 2 f0; 1gK . In this case, the standard k-means algorithm yields \gray" prototype pixel values yjk 2 [0; 1]. However, we are looking for binary prototypes too, since a binary kNN is much faster. When the centroid update (eqn. 9) rule is replaced by the majority rule ( n ( ) X 0 0 xik > n2 yjk := 1 for (12) 2 fi:x 2R g j
j
i
j
all prototype components yjk 2 [0; 1] remain binary throughout the iteration, and E is never increased. It simply counts the number of discrepancies yjk 6= xik which is not increased due to the majority decision:
X
2
X
xik yjk n2j ; whereas fi:x 2R g fi:x 2R g 2 X X xik ? (1 ? yjk ) = nj ? xik yjk n2j ; xik ? yjk
i
=
j
i
fi:xi 2Rj g
j
fi:xi 2Rj g
(13)
The rst step, the decomposition into Voronoi regions, remains unchanged so that the binary version of k-means also leads to a local minimum of E in f0; 1gK . This algorithm only employs single-bit add operations, which can be implemented highly eciently by bitwise boolean operations, and integer comparisons. We have realized the incremental version of this addition using an integer counter vector Cj for each prototype yj . Each element Cjk remembers how often a pixel of value 1 has been presented to prototype j at position k. After an initialization of the counters and the prototypes yj to patterns chosen at random from the training set, the following steps are performed for each new pattern xi :
Determine j such that PKk=1 xik yjk is minimized (standard VQ search). Update the counters : Cjk := Cjk + xik , nj := nj + 1. Update the prototype : yjk := 0 if Cjk n2 , else yjk := 1. j
7
3.3 Results Tests on the NIST database showed that both versions of the algorithms are extremely stable in the sense that the locally optimal solution they converge to does not depend very much on the initialization. They usually do not get trapped too early in a local minimum which is no good representation of the whole class. In table 4, the error rates using a nearest-neighbor classi er on prototypes produced by the two versions of k-means are listed for the digits 0,...,9 and a resolution of 16 16 pixels. Figure 1 shows that both methods allow a signi cant reduction of prototypes in comparison with the standard kNN directly working on the training set. For low numbers of prototypes, the same power law seems to be valid, while for large N , a saturation can be seen due to the limited number of training patterns. About twice the number of prototypes for the real-valued case is required to obtain the same error rates with the restriction to binary patterns. Of course, the binary method is much more interesting because a signi cant gain both in memory and in computing performance can be achieved. N= 50 100 200 500 1000 2000 5000 10000 binary k-means: 6.7% 4.8% 4.0% 3.0% 2.4% 2.0% 1.7% 1.57% standard k-means: 4.6% 3.7% 2.8% 2.1% 1.8% 1.6% 1.4% 1.22% Table 4: kNN error rate after learning with k-means on 167 000 digits (classes 0,...,9), size 16 16 pixels
4 A Hierarchical k-Means Approach Hierarchical vector quantization (HVQ) has already been proposed for image data compression in order to reduce the amount of operations which is necessary to nd a good approximation [10]. The basic idea is to start a quantization with a rather limited number of prototypes, giving a subdivision of the whole pattern space into few regions. On consecutive stages, a separate vector quantization (with dierent prototypes) is executed for each one of those regions. Thus, the problem is partitioned by restricting both the training and the recognition to the subregions. The prototypes yj at the leaves of the tree structure are again approximations of the input pattern, and the corresponding regions Rj are again a decomposition of the vector space. However, this time it is not a Voronoi partitioning because the search has been restricted to certain branches and the nearest prototype has not necessarily been found. In many cases, classes have a large overlap, so that a separate training on each class is not reasonable. Therefore, we introduced a hierarchical VQ classi cation which does not attribute each prototype yj to a certain class. Instead, the number of cases njc that a training pattern belonging to class c has been assigned to prototype yj is determined. X njc := #fxi jxi 2 Rj ^ ci = cg; Nj = njc : (14) c
The probability distribution is then approximated by 8
n pHV Q (x; c) = jc ; Nj
(15)
where yj is the prototype vector x that is assigned to in the hierarchical VQ. Again, this con dence vector p can be exploited on a higher level of the recognition process. The determination of a hierarchical structure with N prototypes yj , resulting in a decomposition, can now be formulated as an optimization problem. A good classi cation is obtained if, for every region Rj , most of the elements xi 2 Rj belong to the same class. The total entropy remaining in this decomposition, ! X X njc S := ? njc log2 ; (16) N j fcjnjc 6=0g
j
indicates how much information is yet necessary to classify all training patterns correctly. The global optimization of this function is a dicult task. However, we have implemented a greedy algorithm trying to improve the tree locally by choosing a re nement of a region where the reduction of entropy by an additional vector quantization is maximal. Training on 167 000 normalized digits (20 24) of the NIST data base resulted in a hierarchy containing 13 500 prototypes. As these patterns are binary vectors, they occupy only about 1 MB of main memory. The recognition was tested on the remaining 55 000 patterns. We obtained an error rate of only 0:90% at a classi cation speed of 210 patterns/s on a SPARC-10 processor. This result shows that a considerable acceleration of the recognition can be achieved using this hierarchical approach, all that at the expense of a higher classi cation error. As most bank applications are very critical, we still prefer the plain version of k-means and kNN because performance problems can always be resolved by using parallel architectures.
5 A Multi-Agent Structure Up to here, we have shown how a single approach can be optimized to obtain the lowest error rates. However, systems can be elaborated that are even more reliable, when sophisticated techniques from dierent areas are combined and when redundant information is exploited in order to detect inconsistencies or to reduce the classi cation error probability. A true improvement can be expected when dierent agents work closely together. Recently, dynamical segmentation techniques using a classi er to estimate the correctness of a cutting proposition have been shown to be quite successful [11]. A consistency check between word and digits elds on Eurocheques is certainly indispensable in order to obtain acceptable error rates. We consider the correct interpretation of a given document as an optimization task where a complex architecture of many dierent and specialized \agents" tries to evaluate the globally most probable solution. A comparison with the second most probable solution can give a con dence information and conditionally invoke human interaction.
9
5.1 Voter Agents The same input pattern can be presented to dierent classi ers which have been trained separately on the same data set. The problem has to be solved how their results can be merged in order to achieve a lower error rate. This task is done by another agent, the \voter" which has to \learn" to judge the classi cation responses of all agents correctly on a second, independent training set. We envisage to employ neural network and vector quantization classi ers on the same input data, both producing a vector of output values indicating the classi cation probabilities for all classes. The voter can be realized as another neural network, obtaining the two con dence vectors and producing yet another output vector which can be analysed at a higher level.
5.2 Context Analysis Although the error rates per character can be reduced signi cantly by learning on extremely large databases, it should be obvious that a recognition decision at the single character level cannot be reliable enough to obtain acceptable error rates for entire elds. In many cases, characters can only be recognized in a certain context, for example there is a large ambiguity between the classes \n" and \u" or \b" and \6", which means that even an optimal \Bayes" classi er is not acceptable. Therefore context information has to be taken into account. The so-called trigram technique exploits correlations between three consecutive characters in order to improve the recognition performance. We have determined occurrence frequencies q(ci?1; ci; ci+1) of character class sequences independently for the dierent text elds (names, bank names, purpose) of German bank transfer forms. These environment-speci c informations have been used to improve the recognition probability. A dynamic programming approach is employed to nd the class assignment ci for the sequence of segmented characters xi by optimizing the product LY ?1 L Y (17) p(xi ; ci ) q (ci?1 ; ci; ci+1 ): i=1
i=2
Furthermore, a complete list of 7806 German bank codes and bank names is used to detect many errors and to exploit inherent redundancy.
6 Implementation 6.1 Binary Operations As general-purpose computers with a wide data word are not well-adapted to operations on single bits, care has to be taken, and special techniques must be developed in order to achieve a high eciency. Within our project, a highly optimized library in ANSI C for the treatment of binary bit elds has been realized. It includes functions for the most ecient determination of Hamming distances, bit counting in row and column direction, transposition, normalization, and alignment (transformation by shearing to the main axis). 10
Row histogram on 1000 800 binary image Column histogram on 1000 800 binary image Transposition of 1000 800 binary matrix Character normalization in x- and y-direction Character alignment to main vertical axis Recognition of a NIST digit at 0.90% error rate 10 000 Hamming distance calculations for 20 24 images
5 ms 5 ms 20 ms 0.3 ms 1 ms 4.7 ms 10 ms
Table 5: Execution time for some operations on a SPARC-10 processor A comparison between two binary vectors can be realized eciently by using bitwise
XOR operations and some table-lookups for each data word. In the case that one vector x has to be compared with many prototype vectors yi, a much more ecient incremental
technique can be applied re ecting an SIMD kind of parallelism : di =
kx ? yik =
d0i =
kx0 ? yik =
0 = di + @
K X
xk yik
k=1 K X
k=1
x0k yik
X
fkjxk =1^xk =0g 0
1 0 (2yik ? 1)A ? @
X
fkjxk =0^xk =1g 0
1 (2yik ? 1)A
(18)
Once the distances di are calculated for one input pattern x, for the following pattern x0, only those positions having changed need to be updated by adding a binary vector to a vector of counters. This technique allows a limitation of the binary add operations to about 20 ? 30% of the data (depending on the input data). We have developed special functions working on many bits of binary vectors in parallel. In particular, the bitwise accumulation of many binary vectors has been implemented most eciently emulating a bitwise full-adder operation by 5 boolean operations. A total number of 160 million bits per second can be accumulated on a SPARC-10 processor. This procedure, which was originally written for binary kNN and k-means, is of equal interest for the segmentation, because it allows the calculation of row and column histograms at the same eciency. Table 5 shows some performance values obtained on a SPARC-10 processor (40 MHz). Built upon these ecient binary techniques, a complete program for the semi-automatic treatment of bank forms has been realized in our laboratory as a portable ANSI-C program. On a SPARC-10 processor, the segmentation of a complete form into 100-150 segments takes about 0.2 s, whereas the recognition using nearest-neighbor techniques requires only 2 seconds per form. 60 000 classi ed patterns were used as input for the binary k-means techniques. However, much more data can be reasonably dealt with due to the hierarchical approach of the vector quantization. The context analysis based on the trigram technique considerably improved the recognition result.
11
6.2 MIMD implementations The presented solution is an ideal application for MIMD architectures. If the achieved processing speed for one processor (one form in 2 seconds) is not high enough and if the latency is not crucial, the problem can be parallelized in the most ecient way using a \task farming" approach. This means that each form is treated separately by one processor. A TIFF le containing a binary image of a form at 200 DPI resolution has a typical size of 5 kB whereas the classi cation output corresponds to a few 100 bytes. This data has to be transferred only every 2 seconds for each node so that communication will not be a limiting factor in embedded systems. Therefore, an almost ideal speed-up can be expected. On the other hand, a copy of all prototype data needs to be stored at every processing node. Usually, this is no limitation either due to the use of binary prototypes.
6.3 A VLSI Structure for Pattern Recognition General-purpose processors with a rather wide data word are not well adapted to operations on single bits. In particular, for the optimal implementation of the accumulation of single bits, ve bitwise boolean operations are necessary, whereas on a special-purpose architecture, this function can be realized in one cycle by providing counter units. Therefore, and because the recognition is the most time-consuming part of application, we have conceived a dedicated VLSI structure, \CUMULUS", for the most ecient comparison of binary vectors. It works in a bit-sequential mode, leading to very compact processing elements, and exploits the inherent \single instruction multiple data" (SIMD) parallelism of multiple prototypes. The main structure of this chip, which contains a two-dimensional array of asynchronous counters, is shown in gure 2. Each one of the counters accumulates the dierences between one row data stream and one column data stream in a bit-sequential way. In a second phase, the minimal counter value within each column can be evaluated. By exploiting the two-fold parallelism over input patterns xi, (i = 1; :::; M ) and prototype patterns yj (j = 1; :::; N ), a scalable I/O structure can be realized which allows the usage of large external dynamical memory units. A prototype version of this chip has been successfully realized as a EUROCHIP project [12, 13]. A full system integrating a 32 32 counter array, DRAM and a controller unit on an SBUS board will be nished soon. It will provide a performance corresponding to about 50 SPARC-10 processors for nearest-neighbor applications. This dedicated architecture can be of high interest for high-performance pattern recognition applications running on embedded systems.
7 Conclusion In this paper, we have introduced vector quantization classi ers as an attractive alternative to more sophisticated pattern recognition algorithms like for example feed-forward neural networks. Clustering techniques allow an extraction of characteristic prototypes, and a fast and incremental \learning" is viable even for an extremely high number of 12
Minimal distances 1
2
N
Reference database
1
2
XOR
Counter +1
M
1
N
2
Stimuli
Figure 2: Two-dimensional CUMULUS chip architecture training patterns, due to the simplicity and locality of the k-means algorithm. We have presented a much more ecient version of k-means producing binary prototype vectors and employing only single-bit additions. A hierarchical pattern search leads to an even faster classi cation which can be of interest for those applications where a higher error rate can be accepted. Furthermore, vector quantization techniques can produce con dence vectors so that they can be combined with other techniques in order to increase the overall reliability. A program for the recognition of handwritten bank forms has been realized which is based on an optimal implementation of the binary algorithms on general-purpose processors. On the other hand, a dedicated CMOS architecture for the recognition of binary patterns has been designed.
Acknowledgements The presented project has been realized at the \Zentrum fur Paralleles Rechnen" (ZPR) at the University of Cologne in collaboration with the \Laboratoire d'Informatique" (LIX) at the Ecole Polytechnique, Palaiseau (France), where the CUMULUS architecture has been developed. The integration of neural networks and vector quantization approaches for a reliable recognition of handwritten bank forms is currently being done in cooperation with Parsytec Industriesysteme GmbH, Aachen.
13
References [1] Y. Le Cun et al., \Handwritten Zip Code Recognition with Multilayer Networks", Proceedings of the 10th International Conference on Pattern Recognition, IEEE, Comp. Soc. Press (1990) [2] P. Simard, Y. Le Cun, J. Denker, \Ecient Pattern Recognition Using a New Transformation Distance", Neural Information Processing Systems, vol. 5, p. 50 (1993) [3] T. M. Cover, P. E. Hart, \Nearest neighbor pattern classi cation", IEEE Trans. Info. Theory, IT-13, p. 21 (1967) [4] R. O. Duda, P. E. Hart, \Pattern Classi cation and Scene Analysis", J. Wiley & Sons, New York, 1973 [5] A. Gersho, R. M. Gray, \Vector Quantization and Signal Procession", Kluwer Academic Publishers, 1992 [6] M. D. Garris, R. A. Wilkinson, NIST Special Database 3 : Handwritten Segmented Characters (1992) [7] J. MacQueen, \Some methods for classi cation and analysis of multivariate observations", Poc. of the Fifth Berkeley Symposium on Math. Stat. and Prob., vol. 1, 281 (1967) [8] S. P. Lloyd, \Least square quantization in PCM", IEEE Trans. Inform. Theory, vol. IT-28, no. 2, p. 129 [9] T. Kohonen, \Learning Vector Quantization", Neural Networks, Vol. 1, suppl. 1, p. 303 (1988) [10] A. Gersho, Y. Shoham, \Hierarchical vector quantization of speech with dynamic codebook allocation", Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing, San Diego (1984) [11] O. Matan et al., \Multi-Digit Recognition using a Space Displacement Neural Network", Neural Information Processing Systems, vol. 4, p. 488 (1992) [12] M. Neschen, M. Gumm, \A Scalable Bit-Sequential SIMD Array for Pattern Recognition and Neural Networks", Proceedings of the PARLE'94 conference, Springer Verlag, Berlin (1994) [13] M. Gumm, Technical Report, LIX/RT/93/02, LIX, Ecole Polytechnique, Palaiseau
14