A Modular, Cyclic Neural Network for Character ... - Semantic Scholar

0 downloads 0 Views 65KB Size Report
Publishers, San Mateo, California, 1992. [2] Garris M. D., Wilkinson R. A. and Wilson C. L., "Methods for Enhancing Neural Network Handwritten. Character ...
A Modular, Cyclic Neural Network for Character Recognition

M. Costa, E. Filippi and E. Pasero Dept. of Electronics, Politecnico di Torino C.so Duca degli Abruzzi, 24 - 10129 TORINO - ITALY

Abstract We present a multi-layer feed-forward neural network that has been built up for pattern recognition purposes. Since we put special emphasis onto the aptitude towards hardware implementation, we provided it with a modular architecture of partially connected subnets. At its turn, this special organization makes it possible to avoid random initialization of the weights. The learning strategy is characterized by the adaptation of the learning rate coupled with an intermediate "batching" of the error back-propagation. To test the performance of the net in the classification of hand-written numerals, we fed the input units with grey-shaded patterns obtained from a large data set of the US National Institute of Standards and Technology (NIST). The original samples have been subjected only to scaling operations. The results up to now obtained seem quite interesting. In fact, in this application our recognizer ranks well among the ten best-rated OCR systems which competed in a world-wide contest organized by NIST itself in June 1992.

1 Introduction Multi-Layer Perceptrons (MLPs) are presently used in a large variety of pattern recognition tasks, so they really constitute a clear example of general-purpose devices. However, when dealing with real-world applications like OCR, MLPs are seldom utilized in a plain form: in fact they are usually tailored to specific requirements, and sometimes they become part of more complex classifying systems. This stems from the fact that, although its operation principle is quite straightforward, yet the design of an efficient MLP classifier constitutes a complex task, since at least three main issues are to be considered: 1) Experimental results clearly show that performance cannot be set free from the way information is encoded into the examples. Data Preprocessing strategy thus assumes major relevance, in that it can dramatically affect the final result as well as the amount of resources that suffices to achieve it; 2) Given the application, we would like to determine the optimal Network Architecture by means of few specifications and design rules. As a matter of fact, such process instead relies upon heuristic choices; 3) The Learning Procedure usually involves several parameters. Again, their values must be determined mainly on a trial-and-error basis, especially if advanced techniques like weight decay[1] are adopted. It is thence evident that additional hints are needed in order to properly address the design strategy. In fact, several solutions have been proposed that exploit suggestions often provided by the application itself. For instance, several works are concerned with prior extraction of relevant features from the raw data: this is done by means of traditional algorithms[2], by setting up hybrid

networks[3,4], or via highly constrained architectures with several hidden layers[5] that apply to MLPs some ideas owned by the Neocognitron[6]. An alternative interesting approach leads to the definition of MLP committees in conjunction with data resampling and the generation of synthetic patterns[7]. On the other side, we must take into account the existence of conventional techniques that already proved to be very effective, especially in the OCR field[8]. Compared to them, Artificial Neural Networks can rely on their own intrinsic, massive parallelism; but this winning characteristic cannot be exploited by software simulations on serial processors. So we think that it is required to give priority importance to the hardware feasibility of the proposed solutions. This concept, while forcing us to cope with stringent constraints, amazingly turned into a guideline along the formulation of those answers to the above issues that are described in the following sections.

2 Data Preprocessing For our training experiments we chose a database of hand-written digits collected by the US National Institute of Standards and Technology (NIST). It consists of 223125 samples stored as binary images inside matrices of 128*128 pixels each. We performed only scaling operations on the original data to produce patterns with specified dimensions and number of grey levels. For this purpose we developed two different algorithms: the first one forces the character to "touch" all four borders of the output image, while the other one preserves its original aspect ratio. At first we utilized such procedures to build two distinct training sets of 16*16 binary patterns. We then carried out several simulations using two copies of the same MLP to assess the best alternative solution. Unfortunately, in any case we did not obtain encouraging results. However, we achieved substantial improvements by averaging the responses of the two nets. This fact suggested the opportunity of feeding a single MLP with both versions of the same data. In order to reduce the number of input components without losing information, we decided to use smaller patterns (8*8) with an higher number of grey levels (64). Fig. 1 shows some examples of patterns that have been subjected to this kind of twofold preprocessing (only 5 grey levels are displayed).

Fig. 1

3 Network Architecture In view of future hardware implementation, we identified three primary requirements to be followed in the architectural definition of the net: 1) Every neuron must have a limited number of synapses (we imposed a maximum of 32 plus the threshold); 2) Interconnections must be planned so as to allow an easy routing of the communication lines; 3) No additional constraints specifically related to OCR are to be imposed, so in principle the same solution can be directly utilized for other classification tasks, or scaled to fit their requirements. We therefore designed a MLP provided with modular architecture. The number of modules equals that of the classes to be distinguished. Every module can be viewed as a partially connected subnet with only one output neuron. Fig. 2 emphasizes the general organization of the net: in particular, it can be seen that different modules do not share any connection. We can take great advantage of this, because we can plan to physically realize only few modules and then multiplex them. Their actual number can be settled to allow an efficient pipelining of the preprocessing stage with proper MLP operation. Moreover, this highly parallelizable structure guarantees low spatial cross-talk among hidden neurons, thus resulting in a fairly high convergence rate during the training phase[9]. Fig. 3 shows the inner structure of the single module. The output neuron is completely connected with the hidden layer of 32 elements. At its turn, each hidden unit has access to only 28 input components in a cyclic, sequential fashion: i.e., inputs 1-28 are connected to the first neuron, inputs 29-56 are connected to the second neuron, .., inputs 113-128 and 1-12 are connected to the fifth neuron (that depicted dark in the figure), and so on. These choices take care of two important features: first, each module covers the input vector an integer number of times, so that the very same connection scheme is preserved along the network; second, different hidden neurons in the same module are connected with different subsets of the input vector. General Architecture of the Modular MLP Outputs

Output Neuron

Modular Connections

32 Hidden

32 Hidden

...............

Connection Scheme of Each Module

. . . 32 Synapses . . .

32 Hidden . . . Hidden Layer of 32 Units with 28 Synapses Each . . .

Partial Connections

128 Inputs

Fig. 2

128 Inputs with Cyclic, Sequential Access

Fig. 3

4 Learning Procedure The special architecture of the net had a profound impact on the learning strategy itself. In fact, we noted that in this case random initialization of the weights could be avoided. We therefore started with null values (an entire class of MLPs with logistic neurons and generic number of hidden layers can be initialized in this way. Details are in [10]). Such a chance carries some interesting properties: 1) We get rid of one heuristic parameter, i.e. the maximum absolute value of the initial random weights; 2) Since neurons are provided with logistic transfer function, they lie in the farthest state from saturation. In other words, they are maximally sensitive to the error signal. Of course, it was necessary to avoid the sudden spreading of weight values towards a substantially random distribution after few updatings. To do this, two solutions appeared that took the serious drawback of slowing down of the system evolution. That is: 1) Performing a by epoch training; 2) Using low learning rate. Concerning the first point, we found a satisfactory trade-off by making one update every 100 patterns presented to the net, thus realizing an intermediate "batching" of the error backpropagation. We then started with low learning rate (0.01), and then changed its value according to the Vögl adaptive technique[11]. After 100 training epochs, we halved all the weights and kept on with the same procedure for 50 additional epochs. Although this operation can be considered a very crude form of weight decay, nevertheless it have already proven to be quite effective[12]. Figures 4 and 5 show the behaviours of the learning rate and of the Mean Square Error (MSE) on the outputs vs. the number of training epochs. It should be noted that, as long as the learning rate increases, MSE tends to saturate until it stops decreasing. When this happens, the learning rate gets halved, thus allowing narrower zones of the error surface to be explored. As a result, MSE starts going down again.

Fig. 4

Fig. 5

5 Performance Evaluation In June 1992, NIST organized a world-wide contest with the purpose of evaluating the state of the art in the OCR field. For what concerns hand-written numerals, NIST provided a suggested training set (the one we previously described) and a test set of 58646 samples purposely taken from a very different population. Therefore, recognition performance on the latter database is very revealing about the generalization capability of the system. We preprocessed such samples in the same way we did for their training counterparts (except for the fact that in this case we used only 16 grey levels), and then we tried to classify them by means of our modular MLP. In particular, we were mainly interested in checking the behaviour of the net when constraints on the resolution of both memory and computing elements are applied[13,14]. We then quantized all the weights using 6 bits, and the transfer function of the hidden neurons using 4 bits. Here output neurons are not involved. In fact, since their transfer function is monotonic, we can determine the winning class directly on their activations. In analog implementations these quantities are usually expressed in terms of currents, and very simple circuits can be designed to select the highest one[15]. With this configuration, when we forced the net to take a decision anyway, we achieved 3.69% error rate on the NIST Test Set. It is worth nothing that performance worsening is very limited in comparison with the usage of floating-point weights and neurons, since it amounts to about 0.1%. This stems from the fact that weight values result more uniformly distributed once decay is applied and additional training epochs performed. We then rejected the most dubious cases by imposing lower thresholds onto the winning outcome: clearly this is not the most effective solution, since the amount of information provided by the net is not fully taken into account. However, it has been chosen for its simplicity. Fig. 6 summarizes the results we obtained: dots in the graph show the behaviour of the error rate with regard to the percentage of rejected samples.

Fig. 6

6 Conclusions We showed how a large, real-word task like the recognition of hand-written numerals may be efficiently and economically accomplished by means of a rather general-purpose MLP. In fact, our classifier correctly recognized more than 96.3% of the samples contained in the NIST Test Set.

Therefore, it ranks sixth in the corresponding graduated list even in the presence of the constraints we imposed for hardware feasibility purposes. We want to point out here the plainness of the solutions that allowed us to achieve this result: no complex features are extracted from the raw data, no "a priori" knowledge about the problem to be solved is used in the architectural definition of the net. Moreover, the system has a total amount of 9610 free parameters: so it is about as large as a completely connected MLP with 128 inputs, one hidden layer of only 69 units, 10 outputs.

Acknowledgements This work is partly supported by Italian "Programma Nazionale di Ricerca sulle Tecnologie per la Bioelettronica"

References [1]

Krogh A. and Hertz J. A., "A Simple Weight Decay can Improve Generalization", in Advances in Neural Information Processing Systems 4, Moody J. E., Hanson S. J. and Lippmann R. P. (eds.), Morgan Kauffman Publishers, San Mateo, California, 1992 [2] Garris M. D., Wilkinson R. A. and Wilson C. L., "Methods for Enhancing Neural Network Handwritten Character Recognition", in Proc. of the International Joint Conference on Neural Networks, Vol. 1, San Diego, California, 1991 [3] Costa M. and Pasero E., "FNC: a Fuzzy Neural Classifier with Bayesan Engine", in Proc. of Italian Workshop on Neural Nets, Vietri sul Mare (SA), Italy, 1993 [4] Iwata A., Tohma T., Matsuo H. and Suzumura N., "A Large Scale Neural Network "CombNET" and its Application to Chinese Character Recognition", in Proc. of the International Neural Network Conference, Vol. 1, Paris, France, 1990 [5] LeCun Y., Boser B. E., Denker J. S., Henderson D., Howard R. E., Hubbard W. and Jackel L. D., "Handwritten Digit Recognition with a Back-Propagation Network", in Advances in Neural Information Processing Systems 2, Touretzky D. S. (ed.), Morgan Kauffman Publishers, San Mateo, California, 1990 [6] Fukushima K. and Wake N., "Handwritten Alphanumeric Character Recognition by the Neocognitron", in IEEE Transactions on Neural Networks, Vol. 2, No. 3, May 1991 [7] Drucker H., Schapire R. and Simard P., "Improving Performance in Neural Networks Using a Boosting Algorithm", in Advances in Neural Information Processing Systems 5, Hanson S. J., Cowan J. D., Giles C. L. (eds.), Morgan Kauffman Publishers, San Mateo, California, 1993 [8] Mori S., Ching Y. S. and Yamamoto K, "Historical Review of OCR Research and Development", in Proceedings of the IEEE, Vol. 80, No. 7, July 1992 [9] Jacobs R. A., Jordan M. I. and Barto A. G., "Task Decomposition through Competition in a Modular Connectionist Architecture: the What and Where Vision Tasks", COINS Technical Report No. 90-27, Dept. of Computer & Information Science, University of Massachusetts, 1990 [10] Costa M., "Sulle Topologie di Connessioni in Percettroni Multistrato che Consentono di Inizializzare a Zero i Pesi della Rete", Internal Report No. DE-930715, Dept. of Electronics, Politecnico di Torino, 1993 [11] Vogl T. P., Mangis J. K., Rigler A. K., Zink W. T. and Alkon D. L., "Accelerating the Convergence of the Back-Propagation Method", in Byol. Cybern. 59, 1988 [12] Vassallo G., Gioiello M., Condemi C. and Sorbello F., "Neural Solutions for the Handwritten Character Recognition Task", submitted to Conference on Computer Vision and Pattern Recognition, Seattle, Washington, 1994 [13] Boser B. E., Sackinger E., Bromley J., LeCun Y. and Jackel L. D., "Hardware Requirements for Neural Network Pattern Classifiers", in IEEE Micro, February 1992 [14] Reyneri L. M. and Filippi E., "An Analysis on the Performance of Silicon Implementations of Backpropagation Algorithms for Artificial Neural Networks", in IEEE Transactions on Computers, Vol. 4, No. 12, December 1991 [15] Vittoz E., "Analog VLSI Implementation of Neural Networks", in Proc. of Journées d'Électronique - Réseaux de Neurones Artificiels / Artificial Neural Networks, Lausanne, Switzerland, 1989

Suggest Documents