IJDAR (2000) 3: 34–39
Pitch-based segmentation and recognition of dot-matrix text Berrin A. Yanikoglu Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli 81474 Istanbul, Turkey; e-mail:
[email protected] Received October 18, 1999 / Revised April 21, 2000
Abstract. Dot-matrix text recognition is a difficult problem, especially when characters are broken into several disconnected components. We present a dot-matrix text recognition system which uses the fact that dotmatrix fonts are fixed-pitch, in order to overcome the difficulty of the segmentation process. After finding the most likely pitch of the text, a decision is made as to whether the text is written in a fixed-pitch or proportional font. Fixed-pitch text is segmented using a pitchbased segmentation process that can successfully segment both touching and broken characters. We report performance results for the pitch estimation, fixed-pitch decision and segmentation, and recognition processes.
Fig. 1. Sample fixed-pitch text
Key words: Dot-matrix – Fixed-pitch – Pitch estimation – Segmentation – OCR
Fig. 2. Sample proportional-font text
1 Introduction Fonts are classified into two categories, fixed-pitch and proportional , according to the amount of horizontal space taken by each character. In fixed-pitch text, characters are written in (imaginary) fixed-width boxes; hence, an “i” takes up as much overall space as an “m”. An example of fixed-pitch text is shown in Fig. 1 1 . Although not necessarily by definition, typically the characters of a fixed-pitch font are centered in the imaginary box they are written in, so that the horizontal centers of consecutive characters are a fixed-distance apart. The pitch indicates the number of characters per horizontal distance (e.g., per inch). However, the pitch is also used to refer to the width, in pixels, of the imaginary box in which the character is positioned. However, proportional fonts leave less space around narrow characters to improve the appearance of the text. A sample proportional font text is shown in Fig. 2. This work was done when the author was at IBM Almaden Research Center 1 All of the figures in this paper are segments from real postal address blocks, cropped for privacy
Fig. 3. Sample dot-matrix text
Dot-matrix text is a special type of fixed-pitch text where characters are composed of small, often disconnected dots that are generated in raster scan order by dot-matrix printers (see Fig. 3). Even though this technology is now old, and the number of dot-matrix printers are declining, there is still a considerable amount of dotmatrix text being generated from existing dot-matrix printers. In addition, nowadays, some publishers simulate dot-matrix printing for a customized look in hightech printing (e.g., the name of the receiver of a catalog). From a recent sampling of the postal mail, we found that the proportion of addresses written in dot-matrix fonts is typically between 3% and 6%. Recognizing dot-matrix text is generally more difficult than recognizing regular machine-print. In particular, segmentation of dot-matrix text can be quite difficult when characters are composed of multiple disconnected components, as shown in Fig. 4. This type of text is often grossly over-segmented and, subsequently,
Berrin A. Yanikoglu: Pitch-based segmentation and recognition of dot-matrix text
not recognized by recognition systems designed for regular machine-print. Dot-matrix character recognition is also more difficult because of the greater variability in character shapes. Since the dots are generated in raster scan order as opposed to each character being printed in turn, even small alignment problems can cause significant variations in character shapes. Finally, dot-matrix text is often of poorer quality due to lower quality of printing mechanisms, resulting in significant degradation in character shapes.
35
The approach taken in these systems was to (optionally) eliminate too big or too small components, and then to calculate the center-to-center distances of the remaining characters. These distances were then compared to the small number of standard pitch values (6, 8, 10, 12, and 14 pitch) and a decision was made as to whether the text was fixed-pitch, along with the pitch value. Lu’s work addresses the segmentation of machine print text in general. However, the segmentation of fixedpitch text is the same as that described above: the pitch is estimated by computing the mode of the “interval distance” of connected components which is defined to be the center-to-center distance. Even though these publications do not report performance results, the general approach taken is adequate for estimating the pitch of good quality dot-matrix. They would, however, fail on broken (Fig. 4) or underlined (Fig. 5) text. Furthermore, most of these systems were designed to specifically recognize fixed-pitch text, without addressing the issue of how to decide whether the text was fixed-pitch.
Fig. 4. Broken dot-matrix text
In this paper, we present a dot-matrix text recognition system which uses the fact that dot-matrix fonts are fixed-pitch, in order to overcome the segmentation difficulty. The system was designed to be an extension of the IBM address recognition system [3]. In this overall design, machine print text is analyzed to identify the fixed-pitch portion, which is then processed separately with the new system using a pitch-based segmentation algorithm. The remaining machine print text (proportional font) is processed as usual. This work was motivated by the need to improve dotmatrix text recognition accuracy. However, the system is designed to process all types of fixed-pitch text, since identifying dot-matrix from other fixed-pitch text was not found to be reliable, nor necessary. A pitch-based segmentation algorithm can improve the segmentation of underlined, touching, and broken fixed-pitch (not necessarily dot-matrix) text, while not adversely affecting the segmentation of good quality fixed-pitch text. The organization of the rest of the paper is as follows. In Sect. 2, we discuss the previous work in the area. The algorithm for estimating the most likely pitch and the fixed-pitch decision criteria are explained in Sects. 3 and 4, respectively. Section 5 presents our pitch-based segmentation algorithm for fixed-pitch text. Recognition of segmented characters is described in Sect. 6. Finally, Sect. 7 summarizes the current work and describes possible extensions. 2 Previous work Most recent work in dot-matrix text recognition is proprietary, and not published. There are, however, some earlier patents and technical reports [2, 1, 4, 5], as well as few more recent articles [6] describing algorithms and machines for reading fixed-pitch text.
Fig. 5. Underlined fixed-pitch text
In this paper, we are addressing the issues of pitch estimation and font type classification (fixed-pitch or proportional), as well as presenting an algorithm to segment dot-matrix text by making use of the estimated pitch. 3 Pitch estimation We evaluated three different approaches to estimate the pitch of a text line efficiently and reliably, even when several of the characters are broken or touching: autocorrelation, Fourier transform, and peak-valley analysis. All three approaches use the vertical projection profile of the text line as the input. The vertical projection profile of an image is commonly used in many document processing algorithms, as well as in previous pitchestimation work. Also called the Radon transform, the vertical projection profile f (x) of a text Img(x, y) is defined as: f (x) = Img(x, y) y
For pitch-based segmentation, one also needs to know the offset of the text, as well as its pitch. In this paper, the offset is defined to be the distance between the left margin and the left edge of the imaginary box in which the first character is written. These imaginary fixed-width boxes, starting from the offset, will be called the pitch windows. The vertical projection profile of a sample fixed-pitch text, as well as the correct pitch, offset, and pitch windows, are displayed in Fig. 6. The Radon transform of a fixed-pitch word with pitch p is roughly periodic with period p, though this is less
36
Berrin A. Yanikoglu: Pitch-based segmentation and recognition of dot-matrix text
f (x)f (x + t)
x
Fig. 6. Pitch p, offset o and the pitch windows indicated by vertical bars are shown, along with the vertical projection profile of a sample fixed-pitch text
true for the whole text where there are spaces or small punctuation characters. By applying the limiting transformation shown in Fig. 7, we can remove some of the irrelevant information and bring out the periodicity further. This limiting transformation also reduces the adverse effect of large noise regions such as vertical frames, and has been found useful in increasing the overall pitch estimation accuracy.
To estimate the pitch of a given text line, we choose as the pitch the lag that maximizes the auto-correlation of the Radon transform of the text. In other words, for each possible pitch value t, we compute the auto-correlation coefficient and pick the pitch that gives the highest coefficient. The offset needs to be estimated in a second step; this could be done by sliding the pitch windows with the chosen width (pitch) over the text, and finding the best offset by some suitable criteria. One such criterion is to minimize the count of the ON-pixels at the pitch-window boundaries. This method fails in estimating the pitch when the text does not contain many regular characters, as is the case when there are several small punctuation or space characters in the line. A few examples where the pitch was incorrectly estimated are shown in Fig. 8. As mentioned before, in these cases the Radon transform is far from being periodic, and the result is expected. Computing the auto-correlation is fast – on the order of P N simple operations where N is the length of the line and P is the number of possible pitches considered. The offset estimation, which needs to be done only for the chosen pitch, can be done in O(N ) operations.
Fig. 7. Limiting function applied to the Radon transform, f (x)
Finally, in all three approaches, we calculate a rough initial estimate of the pitch, from the initial connected component analysis. This value is used for estimating a few non-sensitive threshold values, as well as narrowing down the possible pitch range in the pitch estimation process. Note that since the characters in the text might all be broken or touching, limiting the possible pitch values is done very conservatively, substituting default values when necessary. This initial estimate is not essential for the estimation process; it is only included to speed the process up, whenever possible. The results of all the three approaches are reported in Sect. 3.4.
Fig. 8. Typical cases for pitch estimation problems using the auto-correlation method
3.1 Auto-correlation
3.2 Fourier analysis
One of the methods used to estimate the pitch is autocorrelation, which is the correlation between a function and a lagged version of itself. A high correlation is likely to indicate a periodicity of the input function. In particular, the auto-correlation of a periodic function with period p will have its first peak (maximum) at lag value p. For our purposes, the auto-correlation of f (x) with lag t is defined as:
Another approach used to estimate the pitch is Fourier analysis. Again, due to the roughly periodic nature of the Radon transform of a fixed-pitch text, we expect that, in the Fourier representation, the sinusoids which have a period equal to the pitch, would correlate highly with the Radon transform. To find the sinusoids with the highest correlation, we need to find the magnitude of the corresponding complex coefficients. Using the Discrete Fourier Transform
Berrin A. Yanikoglu: Pitch-based segmentation and recognition of dot-matrix text
(DFT), the Fourier coefficients an and bn of the sinusoids can be calculated as: an = (f (x) × cos(2πn/N x)) bn =
(f (x) × sin(2πn/N x))
where N is the number of data points and the index n corresponds to the pitch equaling N/n. From these, we can estimate the magnitude of the complex coefficients cn as: | cn |= an 2 + bn 2 Finally, the pitch is computed as N/n∗ where N is the number of data points and n∗ is the index of the coefficient with the maximum magnitude. Without using an FFT implementation, the time necessary to estimate the pitch, as described above, is on the order of P N , where N is the length of the line and P is the number of possible pitches considered. However, the trigonometric function evaluations are more time consuming than the calculations used in the auto-correlation method. The offset, on the other hand, is calculated in a single step, as: arctan bn∗ /an∗ . The performance of the Fourier analysis for estimating the pitch was similar to the auto-correlation method. One type of error, also common to the peak-valley analysis, occurs when there are too few characters on the text line and most of them have dominating side strokes (e.g., “H, N, U”) and/or broken centers, resulting in an incorrect pitch estimate that is roughly half the correct pitch. 3.3 Peak-valley analysis The peak-valley analysis is designed to avoid the shortcomings of the previous two. In this method, we find the most-likely estimates for the peak p and the offset o, given the vertical projection profile of the text, such that the pitch windows defined by them align with the text in the best way (i.e., ON-pixels lie in the center of the windows and OFF-pixels lie on the boundary). Specifically, p and o are computed using the following maximization formula: t=cst argmaxp,o f (x + p/2 + t)− x=o+kp,k∈N
t=−cst
α × min(f (x), f (x + 1)))} The summation at the center of the pitch window is used to make the algorithm more robust against broken
37
characters (often at the center) as well as the ones with a thin stroke in the center (“h, j, n, r, u, v, H, J, N, U, V”). The constant cst is set to 4 (with common pitch values being 16–20 pixels), but adjusted accordingly if the initial estimate of the character width is very small or very large. Similarly, the minimum function on the boundary of the pitch window can accommodate single pixel shifts and detect a 1-pixel space between characters, even when characters are written slightly off-center inside the pitch windows. We use a small but important modification to the above calculation: the summation term at a particular pitch window is set to zero if the positive portion (summation of f (x)) is not large enough to indicate the presence of a character. This adjustment is used to normalize the summation for different pitch values; otherwise, smaller pitch values would be assigned higher scores on a mostly blank line with a few characters. The performance of the peak-valley analysis is quite good, while its drawback is its time complexity. Since we optimize for various offset values (0 to p − 1) as well as pitch values, the time it takes is on the order of P 2 N , where N is the length of the line and P is the number of possible pitches considered. each pitch value). 3.4 Results We tested the pitch estimation algorithm on a small test set of 228 fixed-pitch lines. These text lines were collected from real postal address blocks with an average of five lines each; however, information about other lines in the same address block was not used in estimating the pitch of an individual text line. True pitch values typically ranged from 10 to 22 pixels, while the range of considered pitch values was from 6 to 40 pixels, with increments of 0.5 pixels. With the peak-valley analysis, which was the best among the three approaches, the pitch estimation performance was 98.2% correct (up to 0.5 pixel difference from the true pitch). In other words, the pitch was estimated incorrectly for only 1.8% of the text lines. When the pitch was incorrectly estimated, the error was usually large, typically about 30–50% of the correct pitch. Most of the errors were due to skew, stretching, or having too few characters on the text line (also see Sect. 5). The error rates using the auto-correlation and Fourier analysis methods were 5.7% and 7%, respectively. The average pitch estimation time was less than 100 ms on a RISC 6000 per address block (average five lines and 500 × 400 pixels at 200 dpi), or less than 20 ms per text line. In the next section, we discuss the performance of the font type classification process; in other words, the task of deciding whether a given text is fixed-pitch or proportional font. This decision is made both for individual text lines and the whole address block.
38
Berrin A. Yanikoglu: Pitch-based segmentation and recognition of dot-matrix text
4 Fixed-pitch decision
5 Pitch-based segmentation
After finding the most likely pitch, we compare the prominence of that pitch to that of the second best in order to decide whether the text line is fixed-pitch or proportional font. The second-best pitch is chosen from among the pitch values that are two or more pixel values farther than the best pitch, depending on the pitch estimate. Since labeling proportional font as fixed-pitch is a lot more serious an error than classifying fixed-pitch text as proportional font, we use an additional safeguard criterion: if the proportion of pitch window boundaries that contain ON-pixels is high, the text is decided to be proportional font regardless of the prominence of the pitch. This last test can be ignored if there is only one connected component in the line, as is the case with underlined fixed-pitch text (which could not be segmented correctly without the pitch information). An address block is decided to be fixed-pitch if most of the lines in it are classified as fixed-pitch. This criterion worked well in our tests since the address block location algorithm was quite effective (i.e., the majority of the text lines in what was decided to be the address block, corresponded to the true address lines), and usually there are no font changes within an address. Using the peak-valley analysis, the font type (fixedpitch or proportional font) classification algorithm was evaluated on 2556 machine print address blocks. This set was obtained from 3000 randomly chosen address blocks after eliminating 438 handwritten ones and nine garbage images (bad address block location). Among the remaining 2556 address blocks, approximately 55% were fixedpitch and 45% were proportional font. Overall, only 1.1% of all the proportional font address blocks were classified as fixed-pitch (false positives) and 3.7% of fixed-pitch address blocks were not detected as fixed pitch (false negatives). These results are quite good, since most of the false negatives can easily be handled by the general machine print recognizer. A number of the false negatives are due to the use of the second factor of the above decision criteria (the number of pitch window boundaries cutting ONpixels). As we mentioned before, we wanted the decision to classify a given text as fixed-pitch to be conservative since the regular machine print recognizer can segment fixed-pitch text well, unless it was degraded dot-matrix, whereas the fixed-pitch segmentation algorithm would not do a good job with proportional font. Another test was done earlier where the decision as to whether an address block was fixed-pitch, was based on the cumulative best pitch and second-best pitch scores. The cumulative scores were calculated by accumulating the computed scores over all the lines in the address block, and choosing the global best at the end, as opposed to deciding on a line by line basis. The performance results of both decision criteria were comparable. Both the pitch estimation and the decision algorithms can handle small skews; however, as the skew increases, the pitch becomes more difficult or impossible to detect, since character centers may no longer remain a fixed distance apart.
Once the pitch and the offset are known, segmentation is quite easy. The segmentation algorithm looks at consecutive pitch windows and groups all of the small components that lie mostly in that window into one character. Components larger than the pitch value, spanning more than one window are split into multiple characters at the pitch window boundary. The performance of this segmentation algorithm was nearly perfect if the pitch estimation was correct. Most of the segmentation problems occur when the text is shifted due to small stretches in the scanning process or when the text has a small skew; in these cases the pitch is no longer perfect. For instance, when part of the document is slightly stretched in the scanning process, the characters near the end of the line might be off from their expected locations by a few pixels and might be oversegmented. Adapting the offset so as to compensate for some possible stretch near the end of a line has been tested; however, the benefits were about the same as the drawbacks. Similarly, if the address block has a small skew and is still classified as fixed-pitch, a few characters might be over-segmented, since they may span more than one pitch window. 6 Dot-matrix character recognition Character recognition was done by the IBM character recognition engine which was re-trained specifically for dot-matrix characters. The recognition engine is described here very briefly for the sake of completeness. Segmented fixed-pitch characters are recognized in a one-hidden-layer feed-forward neural network, using bending points features of the normalized bitmap character [7]. The recognizer was trained with regular machine print and dot-matrix characters, including degraded dotmatrix characters (as long as they were recognizable by a human truther). In our initial training and testing, the performance on test data was 97.5%. However, the test and the training sets were not completely independent since they were collected from the same forms (though different fields). We have found that characters that are composed of several small components or are very degraded due to smearing or very light printing, benefit significantly from a simple smoothing operation before the recognition process. 7 Summary We describe a system to recognize fixed-pitch, and in particular dot-matrix text, usually poorly recognized by regular machine print recognition engines. The segmentation difficulty is overcome by estimating the pitch of the text and guiding the segmentation of dot-matrix text by the location of the pitch windows, defined by the estimated pitch and the offset. We present three different algorithms for estimating the pitch of a text line. Alternatively, the system can
Berrin A. Yanikoglu: Pitch-based segmentation and recognition of dot-matrix text
be made faster by first analyzing the center-to-center distances of connected components to estimate the pitch and if the evidence is very strong, the methods described in this paper could be skipped. This would speed up the pitch estimation process for good quality dot-matrix. Acknowledgement. We would like to thank Dr. Sean Moore for his valuable help with the use of the Fourier Transform.
References 1. Baumgartner R. J.: Character pitch determination. Technical Report TDB 03-72, pp. 3104–3107, IBM, 1972 2. Ett A. H.: Pitch determination. Technical Report TDB 02-70, pp. 1349–1349a, IBM, 1970 3. Gopisetty S., Lorie R., Mao J., Mohuiddin M., Sorin A., Yair E.: Automated forms-processing software and services. IBM J. Res. Dev., 40(2):211–230, 1996 4. Jih C. R.: Segmentation method for fixed pitch printed documents. Technical Report TDB 08-80, p. 1194, IBM, 1980 5. Lotspiech J. B.: Algorithm for the segmentation of printed fixed pitch documents. IBM Patent 4377803, 1983 6. Lu Y.: On the segmentation of touching characters. Proc. 2nd Int. Conf. Doc. Anal. Recognition, pp. 821–828, 1993 7. Takahashi H.: A neural net ocr using geometrical and zonal pattern features. 1st Int. Conf. Doc. Anal. Recognition, pp. 821–828, 1991
39
Berrin Yanikoglu received B.S. and B.A. degrees in Computer Engineering and in Mathematics respectively, from Bogazici University, Turkey, in 1988, and her Ph.D. degree in Computer Science from Dartmouth College, in 1993. After a year as a Postdoctoral Associate at the Rockefeller University, she joined Xerox in 1994, where she worked on automatic benchmarking of document page segmentation, and the document recognition group at IBM Almaden Research Center in 1996. She is currently a faculty member at Sabanci University. Her research interests are image processing, pattern recognition, neural networks and information retrieval.