Key Words Form Processing, Character-Line Overlap, Form Frame Line Removal, Character Stroke ... impossible to find a general approach to judge if a.
Form Frame Line Removal With Line Width Thresholding Approach Yefeng Zheng, Changsong Liu, Xiaoqing Ding (Department of Electronic Engineering, Tsinghua University, Beijing, 100084)
ABSTRACT Characters often overlap form frame lines. Such character-line overlapping seriously deteriorates the recognition rate of characters. In this paper, a simple but powerful frame line removal algorithm - Line Width Thresholding Approach - is presented. Taking smaller threshold for the run-lengths within characters and larger threshold for those between characters, our approach is very robust. To cope with the situation that digits overlap with frame lines, two approaches are implemented and compared to use a priori information. One uses heuristic a-priori information, the other uses the feedback of the recognition core. The experiments on value-added tax invoices have shown the effect of our approach: the recognition rate of overlapped digit strings does not decrease. Key Words Form Processing, Character-Line Overlap, Form Frame Line Removal, Character Stroke Reservation
1 Introduction As a kind of familiar document, form has so many merits such as contact, formal, easy to fill and process that it is widely used. Form document automatic input, storage and processing become an important part of intelligent document processing. In some printed forms, the printed data are usually overlapped with form frame lines due to the un-exact registration of printers. And in some handwritten forms, the filled data often extend out of the form cell borders. Such character-line overlapping deteriorates the recognition rate of characters. So it is very important to remove form frame lines while preserve character strokes. As showing in figure 1, there are three types of character-line overlapping: contact, intersection and superposition [1]. The contact type is the easiest one to process. Removing frame lines completely does not infect the recognition of characters. The Intersection type is more difficult. But by intensive image analysis, we can remove frame lines with good result. The most
difficult to deal is superposition type. As showing in figure 1 (c), some pixels in the overlapping area can not be classified to frame lines or character strokes correctly only with image analysis. Some a priori information, even some syntactic and semantic information, must be used in this case to make correct judgment.
(a) contact
(b) intersection
(c) superposition
Figure 1. Three Types of Character-line Overlapping
Researchers have made a lot of effort to solve this problem. Some researchers focused on how to improve character recognition algorithms to cope with the overlap situation [2,3]; While others tried to find effect frame line removal algorithms, and some algorithms have been developed, which can be classified into two kinds: one kind of approaches try to remove frame lines totally first, then using local property of overlapping areas, such as stroke direction,
Supported by 863 Hi-tech Plan (project 863-306-ZT03-03-1) & National Natural Science Foundation of China (project 69972024).
-1This paper has been admitted by Journal of Pattern Recognition and Artificial Intelligence (Chinese), Aug. 2000. The original copy is in Chinese.
connection, etc., to restore missing parts of strokes [1,4,5,6] . For example, B.Yu [5] used BAG (Block Adjacency Graph) as structure elements to detect frame lines and remove lines. After that interpolating missing pixels between two broken segments lying on the opposite sides of a frame line. Though his approach can succes sfully solve contact and intersection overlapping, it cannot cope with superposition case because after frame line removal, a lot of useful information is lost. We cannot restore strokes even use the direction and connection of remained segments. The other kind of approaches analysis character-line overlapping areas first, then remove pixels only belonging to frame lines, while preserving those belonging to characters [7,8,9,10]. J.Y. Yoo[7] summarized out 13 types, 34 subtypes overlapping modes produced by Korean characters overlapping with frame lines after analyzing a large number of real handwritten form samples. After detected the overlapping mode, he took different methods to deal with every mode. His approach is very tedious, and cannot cover all overlapping modes. Forth more it is rather difficult to detect overlap mode in form documents with noise [7].Using boundary junction points, Y. Chung[8] classified the overlapping areas into 3 types: restoration parts (RP, for convenience), non-restoration parts (NRP), and candidate generation parts (CGP). RP type is classified into two subtypes fourth. NRP areas are preserved; RP areas are restored using line width, the kind and the number of neighboring line-component’s elements. The algorithm generated all digit images both including and non-including CGP, let recognition core make final judge. The existing approaches are either too tedious or cannot deal with all three types of overlapping. We found that as one of the fundamental properties of a frame line, the line width is preserved very well along the whole line. In contact and intersection overlapping types, the frame line width increased noticeably in the character-line overlapping areas. In supposition case, the frame line width changes little or does not change at all, there is no physical feature can be employed. In addition to thousands kinds of overlapping modes, it is impossible to find a general approach to judge if a
segment belongs to a character or a frame line. Some a priori information must be employed. Based on the observation, we developed a simple but powerful frame line removal algorithm --- LWTA (Line Width Thresholding Approach). We first decompose a frame line to an array of black pixel run-lengths predicating to the running direction of the line. Run-lengths smaller than the threshold are removed as segments of form frame lines. To cope with the case of a character line overlapping with a frame line, we take larger threshold for the run-lengths within characters and smaller threshold for those between characters. In a special case, digits overlapping form frame lines, we proposed two approaches to use a priori information in order to improve our algorithm. One uses heuristic a-priori information, the other uses the feedback of the recognition core. Our approach is similar to Chung’s algorithm, but much more simple and can deal with the case of a character line overlapping frame lines.
2 LWTA Algorithm 2.1 Basic LWTA algorithm Taking a horizontal line for example, along with the running direction of the horizontal line, we decompose it into an array of black pixel run-lengths {R i },as showing in figure 2. The run-lengths longer than a specific threshold is preserved as part of characters, while those shorter than the threshold is removed as segments of frame lines. In our algorithm threshold is a very important parameter: if it is too small, frame lines cannot be removed completely; on the other hand if it is too large, some segments of characters will be removed. Normally threshold is set to be 2-3 pixels larger than average width of a frame line. But skewness, noise, especially overlapping with characters will produce some long run-lengths (we call them “abnormal run-lengths”), which bring great difficulty to us to get accurate estimation of a frame line. The average length of all run-lengths is larger than the real line width. We take a technique similar to Mid-Value Filtering to exclude the interference of abnormal run-lengths. The definition of “Normal” run-lengths is:
-2-
Rnorm = {Ri | ye i − ys i + 1 < 2l mid } lmid is the mid-value of all run-lengths. Take the average length of all normal run-lengths as the estimation of the frame line width. ...
"Normal" run-lengths ys
(d) Line removal result using two thresholds Figure 3. Line Removal Result of LWTA
"Abnormal" run-lengths
2.2 LWTA information
with
heuristic
a-priori
Because digits are often one of the parts, or the only part, need to recognize in a form, the case of digits overlapping frame lines needs deeper research. To this specific case, we have more a-priori information at hand to get better removal results. Figure 4-(b) shows the line removal result of basic LWTA. We can see that the bottom segments of the digit 0, 2, 8 are removed by error. Digit 0, if broken on the top, will be recognized as two digit 1,as the first digit 0 showing in figure 4-(b).If the horizontal segment of digit 2 removed, it will be recognized to be digit 7. The upper part of digit 2 and 7 have small difference. The upper stroke of digit 2, which is made up of a half circle, is smoother than that of digit 7, which is made up of a horizontal stroke and a vertical stroke. S. Naoi[2] used this difference to design a specific classifier using only the upper part of digit 2 and 7. In value-added tax invoices, which we met in researches , the digits are printed, not handwritten. On the right end of the bottom horizontal stroke of digit 2, there is a small vertical stroke, as showing in figure 4. When the digit 2 superposing with frame lines, we can use this small vertical stroke to tell it apart from 7, and preserve the horizontal stroke. After studying all cases, we find a simple method to improve the basic LWTA algorithm using a-priori information. After we find that superposition case is occurring, merging the neighboring run-lengths, which are shorter than the threshold within the outer fix box of a character, into segments. If the run-lengths on two sides of the segment are both longer than a threshold (a larger value), then we preserve the segment, otherwise remove it. Figure 4-(c) shows the improved result. We can see that the bottom horizontal strokes of digit 0, 2 and 8 are all preserved correctly.
ye
Figure 2. Decomposing a Horizontal Line Into an Array of Black Pixel Run-lengths
Some times, because the line width of superposed area is very close to the average line width, while the variation introduced by skewness and noise is much large. So some character strokes are removed by error, and some segments of frame lines are preserved after line removal, as showing in figure 3-(b). It seriously deteriorates the following processes of character segmentation and recognition. So we introduced the character segmentation information. A larger threshold is set for run-lengths between characters, while a smaller threshold is set for run-lengths within a character. We modified the segmentation module, making it capable of dealing with overlapping situation. The character segmentation result is showed in figure 3-(c). The rectangles are the outer fix box of characters. The line removal result using two thresholds is showed in figure 3-(d).
(a) Raw image with a character line overlapping a frame line
(b) Line removal result with single threshold
(c) The result of character segmentation (the rectangles are the out fix boxes of characters)
-3-
this procedure. The difficulty lies in two facets: first, the character recognition techniques cannot give correct results when characters are interfered by frame lines; second researchers have not found an efficient information fusion technique to put all information got in every step together to make final decision. We can only take some simple method to imitate this cognition procedure by using the feedback of the recognition core. The detail of the algorithm is listed below: merging neighboring run-lengths shorter than the threshold within the outer fix box of a character into segments. To digits, there are ordinarily 2-4 segments. This step is same to the former approach discussed above in section 2.2. Enumerating all segment removal instances, we can get an array of images. Then send them to the recognition core, the instance with highest reliability is the ultimate line removal result. As showing in figure 5, on the right bottom corner of a segment instance listed the recognition result and the reliability. The instances are listed with reliability ascending, so the most right instance with the highest reliability, is the ultimate line removal result.
(a) Raw image with a digit line superposing a frame line
(b) Line removal result with basic LWTA
(c) Line removal result of LWTA with heuristic a-priori information Figure 4. Using heuristic a-priori Information to Improve the Basic LWTA
Step of the algorithm: making projection along with the running direction of a frame line. On the smoothed projection histogram, searching the first valleys on two side of the frame line. If the two valleys are very close to each other, no overlapping is occurring, removing the frame line completely. If the valleys are both far away from the frame line, then intersection is occurring, taking basic LWTA to process this case. Otherwise, superposition is occurring, using the improved algorithm. For a line of digits, if the overlap type is contact to digits 1,4,7,9, then it is superposition type to digits 0,2,3,5,6,8. So we treat the contact type the same as superposition type.
0/0.80
0/0.97
(a) Segment removal instance of digit 0,recognition result and reliability
7/0.85
2/0.95
(b) Segment removal instance of digit 2, recognition result and reliability
2.3 LWTA with the recognition feedback When a human classify strokes and frame lines, he always takes the whole image into consideration. Though with the interference of frame lines, human can recognize the character (even in some superposition cases),then judge which pixel belongs to strokes according to the whole character image stored in his memory. Pixels not belong to characters are classified to frame lines naturally. At last using the running direction and line width of frame line to verify if the pixels really belong to frame lines. In other words, in human perception procedure, character recognition and character -line separation are carried out at the same time. There are feedbacks of each other at any time. But it is rather difficult to imitate
2/0.75
3/0.80
2/0.85
7/0.96
(c) Segment removal instance of digit 7, recognition result and reliability
1/0.80
4/0.90
4/0.93
1/0.95
(d) Segment removal instance of digit 4, recognition result and reliability Figure 5. Using the Recognition Feedback to Improve the Basic LWTA
A good recognition core must not only give correct recognition result, but also give a reasonable reliability. But in real case the reliability given by most recognition core is not accurate. To some -4-
abnormal input (abnormal input means it does not belong to any class the recognition core can recognize), most recognition core cannot give small reliability. Though we can get some useful information, it is difficult to give correct result for all cases. As showing in figure 5-(d),a frame line crosses digit 4, and superposes the horizontal stroke. Because the recognition core gives higher reliability of 0.95 for removal case, the horizontal stroke is removed by error.
Intersection
Sum of
No Overlap
Total Sum
To test the effect of our approach, we took 793 value-added tax invoice samples, of which 486 samples have no overlap, 307 samples have overlap (2 samples have superposition on top of the character lines, 47 samples are intersection case, and 258 samples have superposition on bottom of the character lines). There are two digit lines need to recognize. Figure 6 shows a sample. Two dashed rectangles on the mid-right area of the sample are two digit lines need to recognize. We compared 4 frame line removal approaches: 1. Remove completely, 2. Basic LWTA, 3 LWTA with recognition feedback, 4 LWTA with heuristic a prior information. The experiment results are listed in table 1:
Type
Line
1
2
3
4
4
0
2
4
2
on Top
(0%)
(50%)
(100%)
(50%)
Superposed on Bottom
516
227
502
504
507
(44.0%)
(97.3%)
(97.7%)
(98.3%)
89
90
(95.7%)
(94.7%)
(95.7%)
281
594
597
599
(45.8%)
(96.7%)
(97.2%)
(97.6%)
941
940
940
941
(96.8%)
(96.7%)
(96.7%)
(96.8%)
1222
1534
1537
1540
(77.0%)
(96.7%)
(96.9%)
(97.1%)
We developed a new algorithm --- LWTA (Line Width Thresholding Approach) to remove form frame lines, while preserving character strokes. The experiments with value-added tax invoices show the effect of our approach: the recognition rate of overlapped digit lines does not decrease. To cope with digit lines overlapping with frame lines, we make use
Num Superposed
1586
90
(57.4%)
4 Conclusion
Table 1. Results of 4 Line Removal Approaches Digit
972
54
Correct recognition means all digits, including point, are recognized correctly. In table 1 the number of correc tly recognized digit lines is listed under every approach. From table 1, we can see that removing frame lines completely will greatly deteriorate the recognition rate, only 45.8% compared to 96.8% without overlap. Our basic LWTA approach can achieve rather good result. The recognition rate of digit line overlapping frame lines is 96.7%, almost the same to the no overlap case. It can satisfy application needs fairly well and be used in situations without a-priori information, such as we don’t know the character type. Using heuristic a-priori information and the recognition feedback can increase the recognition rate more, though the space of improvement is very small. The recognition feedback approach is more generic. If we know the character type, we can call different recognition core to process Chinese characters, English characters and digits. To the case of digit lines overlapping frame lines, the approach with heuristic a-priori information achieved the highest recognition rate: 95.7% to intersection case, 97.7% to superposition case and 96.9% to no overlap case. The other merit of this approach is that it does not need to try all segment removal cases and call the recognition core, so the processing speed is much faster than former approach.
A Value-Added Tax Invoice Sample
Overlap
614
Overlapped
3 Experiments
Figure 6.
94
-5-
of a-priori information to increase the recognition rate further. How to use a-priori information to cope with English characters and Chinese characters will be studied in future research.
Intelligence, 1996, 18(1):1127-1131 [6]
Guillevic D, Suen C Y. Cursive Script Recognition: A fast reader scheme. Proc. Of 2nd Int. Conf. On Document Analysis & Recognition, Tsukuba, Japan, 1993:311-314
Reference [1]
[7]
Naoi S, Hotta Y, Yabuki M, Asakawa A. Global
and Restoration of Handwritten Characters on the Form
Interpolation in the Segmentation of Handwritten
Documents. Proc. of 4th Int. Conf. Document Analysis
Characters Overlapping a Border. Proc. of 1st Int. Conf.
& Recognition. Ulm Germany. 1997
On Image Processing, 1994:149-153 [2]
[8]
Restoration of Digits Touching or Overlapping Lines.
Handwritten Numbers Overlapping a Border by
Proc. 13th Int.
Automatic Knowledge Acquisition of Overlapped
1996:155-159
Condition. Proc. of 4
Int. Conf. On Document
[9]
On
Pattern
Recognition,
Ren K.P. The Extraction of Character Blocks in Form.
Rodriguez C, .Mugucrza J, Navarro M, Zarate A,
Recognition, Kunming, P.R. China, 1999:147-153 [10]
of
the
7th
Proc.
th
National
Chinese
Character
Hori O, Doermann D S. Robust Table-form Structure
Broken and Bluerred Digits in Forms. Proc. 14 Int.
Analysis Based on Box-Driven Reasoning. Proc. of 3rd
Conf. On Pattern Recognition, 1998:1101-1105
Int. Conf. On Document Analysis & Recognition,
Cheriet M, Said J N, Suen C Y. A Formal Modal for
Montreal, Canada, 1995:218-221
rd
Document Processing of Business Forms. Proc. of 3
Int. Conf. On Document Analysis & Recogmition. Montreal Canada, 1995:210-213 [5]
Conf.
Analysis & Recognition, Ulm Germany, 1997:86-90
Martin J L, Perez J M. A Two-stage Classifier for
[4]
Chung Y, Lee K, Yaik J, Lee Y. Extraction and
Naoi S, Yabuki M. Global Interpolation Method II for
th
[3]
Yoo J Y, Kim M , Han S Y, Kwon Y B. Line Removal
Yu B, Jain A K. A Genetic System for Form Dropout. IEEE Trans. On Pattern Analysis & Machind
-6-