Experiments with Statistical Connectionist Methods and Hidden Markov Models for Recognition of Text in Telephone Company Drawings Atul K. Chhabra1 and Vishal Misra1 2 ;
NYNEX Science & Technology 500 Westchester Avenue, White Plains, NY 10604 USA 1
2
[email protected]
Department of Electrical and Computer Engineering University of Massachusetts at Amherst Amherst, MA 01003 USA
Abstract
At the last IWANNT workshop, we presented the framework for statistics-driven high-order feature selection for neural network based printed character recognition in the context of telephone company engineering drawing conversion. Here, we discuss the results of various experiments designed to improve the accuracy of text recognition in drawings. First, we study the eect of limiting the maximum number of higher order feature pairs that are selected for improving the discrimination of any given class pair. Classi ers obtained with different limits are found to make dierent mistakes. These classi ers could potentially be used together with a voting scheme to further reduce the classi cation error. Next, we attempt to model the limitedvocabulary language as a single hidden Markov model of the rst order. This helps in improving segmentation in the case of touching characters and helps in correcting some errors made by the isolated character recognition system.
1 Introduction
Telephone companies have a huge number of records stored in the form of tabular or quasi-tabular data on engineering drawings. Assignment drawings, wiring lists, front equipment drawings, and distributing frame drawings are a few examples of such structured drawings (see gure 1 for examples of some of these drawings). The ASCII data in these drawings has traditionally been stored on paper or micro lm because some of the data dates back to the days when computers were not commonly used in telephone company operations. Conversion of the drawings to computer data is crucial for ef cient operation and for utilizing emerging technologies such as automated mapping and facilities management. Conversion using manual data entry is very tedious and labor intensive.
(a) (b) Figure 1: Small extracts from E-size telephone company drawings: (a) DSX Assignment Drawing and (b) Front Equipment Drawing. At the last IWANNT workshop, we presented early work on a semiautomatic system to convert the telephone company drawings using line drawing interpretation and neural networks based text recognition [1]. In this paper, we describe progress made since then on improving the character recognition performance. We present experiments with statistical high-order feature selection used in conjunction with a feed-forward neural network for isolated character recognition, and experiments with hidden
Markov models for text string recognition. First, we study the eect of limiting the maximum number of higher order feature pairs that are selected for improving the discrimination of any given class pair. By imposing diering limits, it is found that the resulting classi ers have a signi cant non-overlap in their classi cation errors. Fusion of the results of several such classi ers can be used to reduce the overall classi cation error. Next, we attempt to model the lexicon as a single rst order hidden Markov model. This helps in improving segmentation in the case of touching characters and helps in correcting some errors made by the isolated character recognition system.
2 The Nature of Characters in the Drawings
The telephone company drawings contain text that is machine-printed (typewritten), drafted, and/or hand-printed. Even though most of the text is typewritten or drafted, the recognition of this text is much harder than conventional OCR of printed characters. This is due to several reasons listed below. The drawings are drawn/typed on a medium known as mylar. The E-size mylars undergo severe warping, shearing, and other distortion due improper handling over several years or decades, and due to hundreds of passes through roller fed copying machines. This causes the characters and the line work to deform severely. With age, patches of ink begin to fall o from the mylars because of wear and tear. This results in fragmented characters. Dierent characters (even characters belonging to the same character string) may have been typed/printed at dierent times and with different pitch and font, resulting in misalignment, touching characters, uneven/unpredictable spacing, etc. Due to the above reasons, it is not possible to use an o-the-shelf OCR package. We need a system that can be trained with real image data. We described such a system in [1, 2]. Here, we present enhancements to that system.
3 Statistical High-Order Feature Selection
Our intelligent character recognition (ICR) system consists of two major components { the single character recognizer (SCR) and the text string segmenter/interpreter. The SCR uses statistically derived high-order combinations of raw geometric features and a feed-forward neural network classi er [1, 2].
NN1 NN2
Error on Training Data
Error on Test Data
1.09% (491 errors) 4.11% (491 errors) 0.84% (379 errors) 4.16% (497 errors) Table 1: Performance of the two classi ers at 0% rejection rate. NN1 used high-order features computed as described in [2]. For NN2, a limit of four high-order features per class pair was imposed. Most neural network based character recognition systems treat neural networks as a `black box'. We showed [1, 2] that signi cant improvement in classi cation accuracy and speed can be obtained by intelligent selection of inputs to a neural network classi er. Speci cally, we showed that high order combination of geometric features leads to better accuracy and speed than using pixel data or raw features as input to the neural network. In essence, the high-order combinations of raw features provide higher discriminating ability than the raw features themselves. The combinations are derived by a statistical discriminant analysis through an iterative optimization process. This process tends to allocate more high-order features for class pairs that are dicult to distinguish (such as the hand-printed letters O and D). This results in too little attention being given to the class pairs that are relatively easier to distinguish. To avoid this problem, we experimented with the idea of limiting the maximum number of high-order features devoted to a particular class pair. This results in a more even allocation of high-order features across the class pairs. Although this causes the discrimination of classes such as hand-printed O and D to deteriorate, such problems can easily be xed with the help of context. For our experiment, we used hand-printed characters obtained from the NIST special database 3 (training data) and NIST test data 1 [3]. We used only the upper case alphabets in our experiment. The training set consisted of 44,951 character images and the test data contained 11,941 character samples. We trained two feed-forward neural networks, NN1 and NN2, using error backpropagation. The inputs to the two networks were two dierent sets of high-order features computed from the same training data. The rst set of features was computed by imposing no restriction on the feature selection algorithm of [2]. For computing the second set of features, we limited the number of high-order features that could be selected for any class pair to four. The performance of the two networks is shown in table 1. Although there appears to be no signi cant dierence in the overall performance of the two networks, we observed that they dier in the kinds
NN1 NN2
D-O H-M V-Y K-X U-V L-C C-I
40 13 12 14 47 19 9 48 19 17 18 36 11 2 Table 2: Number of errors made by classi ers NN1 and NN2 in distinguishing between similar looking characters. Errors for several confusing class pairs are shown. of mistakes they make. The dierence in errors is highlighted in table 2. We observed that approximately one third of the errors made by the network NN1 are not repeated by NN2. The converse was also found to be true, i.e., about one third of the errors made by the network NN2 are not repeated by NN1. Therefore, the two networks can potentially be used together with a voting scheme to further reduce the error rate. As an aside, note that the performance of both the networks is better than that reported earlier in [3]. In the NIST test [3], the NYNEX recognizer stood second out of 37 participating upper case alpha recognizers. The error rate of the NYNEX recognizer was 4.9%0.5% as compared with 3.7%0.5% for the top scoring recognizer. With the new improved networks, our error rate of 4.1% is within the statistical error bounds of the top scoring network.
4 Segmentation and Hidden Markov Model based Recognition of Text Strings
Once a text string has been extracted from the image of a drawing, and prior to applying the SCR, it is necessary to segment the text string into isolated characters. When neighboring characters are not touching, and when the characters are well formed, this is a trivial task. However, in practice, text strings frequently contain fragmented characters and touching characters. We address this problem by initially oversplitting the image of a text string into a large number of segments. The segmentation is done by computing the upper and lower pro les of the touching characters and by interpreting them along with their rst and second derivatives. The pro les and the associated derivatives yield a set of likely `pinch-points'. We then calculate the best split path which passes through a pair of these pinch-points. Following this, we try and recognize the initial segments and valid groups of the initial segments as single characters. In order to better guide the segmentation and grouping process and the overall string interpretation process, we can pre-de ne grammars or hidden Markov models (HMM's) of
the lexicon for text strings. A grammar de nes the structure of a text string. Text entries in tabulated data are often very structured. For instance, text entries in a column of data are likely to conform to similar structure. When a grammar is speci ed for a text string, the string interpretation procedure simultaneously performs splitting-regrouping and string-grammar matching. The grammar based system was referred to in [1] and is described in detail in [4]. Here, we discuss the HMM based system for text string interpretation. For entries in the drawings which have no associated grammar, we use HMM's as an aid to recognition. HMM's have been used in speech recognition for several years. Recently, they have been applied to the recognition of cursive handwriting [5]. We have improved upon these ideas while applying them to printed character recognition. As stated above, our printed character recognition task is much harder than that can be handled by conventional OCR systems. We treat each character as a state in the Markov Model and treat the corresponding image segment grouping as the symbol. We use a single path discriminant HMM for the lexicon. The HMM is built by rst collecting all the ASCII strings that are expected in the drawings. Once the dictionary is ready, we calculate the transition probabilities between letters and the probability of occurrence of each letter as the rst character in a word. In a rst order HMM we use transition probabilities only from one character to the next. The performance of the system is expected to improve with the use of a second or higher order model, i.e., if we consider transition probabilities across two or more characters. However the growth in computational complexity for higher order models makes it infeasible for us to implement them for the real-time interactive system that we are building. We have tried both the stationary and non-stationary versions of the HMM. In a non-stationary model the transition probabilities are calculated separately for each position in a string. In a stationary model the location of the transition in the string is ignored. If the dictionary is completely known then the non-stationary model performs better. However, the nonstationary model becomes too rigid and the chance of it recognizing a new word is considerably lower than that of a stationary model. Conventionally the Viterbi algorithm is used to estimate the most likely state sequence which resulted in a given string image. Experiments showed that retaining only the best transitions at each stage of the trellis frequently missed out the correct answer. We therefore use a modi ed version of the Viterbi algorithm in which, at each stage, we retain two to three best transitions to a state from the previous state. We pass each segment grouping through the modi ed Viterbi algorithm to come up with the most likely string interpretations. We also make sure that there is no string length bias
Strings recognized correctly as the top choice of the HMM Strings recognized correctly in the top two choices of the HMM Strings recognized correctly in the top 10 choices of the HMM Strings not recognized correctly in the top 10 choices of the HMM Table 3: Performance of the HMM and Viterbi extracted from drawings
93.97 % 97.41 % 98.27 % 1.73 % algorithm on test data
in the recognizer. Use of the HMM signi cantly improves the recognition of badly degraded and/or touching strings in the drawings. We ran experiments on a few hundred random samples collected from the drawings. We used a lexicon of about 200 words and used a rst order stationary HMM. The results are presented in table 3 and a sample run is shown in gure 2. The picture in the top of the gure shows the scanned image of the deformed word `ASSIGNMENT.' Also shown is the initial segmentation of the image into `atomic' symbols which are then regrouped in dierent ways to nd a good match (using the Viterbi algorithm) with the HMM of the lexicon. The bottom part of the gure shows the top ten choices of text strings reported by the Viterbi algorithm. Next to the reported answers, you can see the con dence level for each character and an overall score from the Viterbi algorithm. Although the con dence levels for `A' and `N' are almost zero, the system is able to recognize the word correctly due to the HMM of the lexicon.
5 Conclusion
In this paper, we presented techniques for improving the performance of single character recognition and word recognition. This work is part of a larger eort towards building an interactive system for converting telephone company drawings from paper into CAD and database form.
References
[1] A. Chhabra, S. Chandran, and R. Kasturi, \Table structure interpretation and neural network based text recognition for conversion of telephone company tabular drawings," in Proc. IWANNT'93, Princeton, NJ, pp. 92{98, October 1993.
Figure 2: Recognition of a text string with touching and fragmented characters. [2] A. Chhabra et al., \High-order statistically derived combinations of geometric features for handprinted character recognition," in Proc. IAPR International Conference on Document Analysis and Recognition, Tsukuba Science City, Japan, pp. 397{401, October 1993. [3] R. Wilkinson et al., \The rst census optical character recognition system conference," Tech. Rep. NISTIR 4912, National Institute of Standards and Technology, Gaithersburg, Maryland, USA, May 1992. [4] A. Chhabra, \Anatomy of a hand- lled form reader," in Proc. IEEE Workshop on Applications of Computer Vision, Sarasota, FL, December 1994. [5] M. Chen, A. Kundu, and J. Zhou, \O-line handwritten word recognition using a hidden markov model type stochastic network," IEEE Transactions on PAMI, vol. 16, pp. 481{496, May 1994.