a Comprehensive Voting of Commercial OCR Devices. Stefan Klink. 1 ..... start state iff it is the root of the planning tree and a final state iff it is a leave of the plan-.
MergeLayouts - Overcoming Faulty Segmentations by a Comprehensive Voting of Commercial OCR Devices Stefan Klink1, Thorsten Jäger German Research Center for Artificial Intelligence GmbH P.O. Box 2080, 67608 Kaiserslautern, Germany E-mail: {klink, tjaeger}@dfki.de http://www.dfki.uni-kl.de/~klink/ Phone: (+49 631) 205-3503, Fax: (+49 631) 205-3210
Abstract In this paper, we will present a comprehensive voting approach, taking entire layouts obtained from commercial OCR devices as input. Such a layout comprises segments of three kinds: lines, words, and characters. By combining all attributes of a segment (e.g. recognized text, identified font height, coordinates within the original image etc.), we attain a “better” layout, representing the original page layout as good as possible. The voting process itself is hierarchically organized, starting with the line segments. For each level, a search tree is spawn and all fellow segments (segments from different layouts which denote the same image area) are established. A heuristic search method is utilized which is guided by a similarity measure defined on segments. Deviations in the segmentation, as well as segmentation errors of individual commercial OCR devices, are compensated by an “equalization module”. Keywords: OCR, Voting, Layout Attributes, Page Layout Retention.
1 Introduction The combination of several distinct classifiers working on the same classification problem (“voting”) has become a widely accepted technique for improving classification results. Its application focus are those classification problems which could hardly be solved by an individual omnipotent classifier. In the domain of OCR, the voting technique is already integrated into commercially available products like PrimeOCR [1]. In this paper, we will focus on voting of OCR results obtained by commercial OCR devices. To the best of our knowledge, all published material on this topic has focused on more or less “isolated” voting approaches, mostly combining classification results of individual word seg-
1. Author for correspondence
Stefan Klink
1
ments or character segments [2]-[9]. We will present a more comprehensive approach, combining entire layouts. Hereafter, a layout is considered to be a structure of line, word, and character segments with certain attributes. The following are typical attributes for such a text segment: • coordinates indicating a segment’s position within the original image • font attributes, e.g. bold, italic, underline, font height • recognition result(s): the recognized text for the segment’s image portion. Most often, speaking of OCR people were aiming at the latter point - the classification of digitized image portions obtaining ASCII character(s). Nevertheless, the other attributes are worth to be recognized with high accuracy, too. Thus, why not vote all attributes achieving a result preserving the original page layout as precise as possible, including a voting of the recognition results? The reliable layout structure obtained offers great benefit for subsequent document analysis steps, e.g. for contentual analysis (cf. [9]). Additionally, layout attributes might also support the identification of matching segments from different OCR devices, improving a voting algorithm solely based on recognition results. Doing so, we will introduce a search space, where each state represents a certain set of segments to be combined (“fellow” segments). By defining a quality measure on states, standard search algorithms are applicable to obtain an optimal path in the search tree that denotes all fellow segments from the different OCR devices. In the next step, we have to cope with different segmentations among fellow segments obtained from the basic OCR devices (e.g. ‘from’ - ‘fr’ ‘om’ on the word level). Doing so, original segments might be merged or splitted. Hereafter, the modified fellow segments will be combined by voting all attributes, including the recognition results. To do so, classic voting techniques, e.g. majority voting, might be applied. We will not only discuss some theoretical aspects of this approach, but will also present results achieved by a prototypi-
Stefan Klink
2
cal implementation proving its suitability, efficiency, and flexibility. In Section 2 we describe our system MergeLayouts from a global point of view. The essential processing steps, planning, equalization, and merging, are described in more detail in Section 3, Section 4, and Section 5, respectively. In Section 6 we will present some achieved results in comparison with commercial OCR devices. We will end with some concluding remarks in Section 7.
2 Global view MergeLayouts should combine results obtained from commercial OCR devices, comprising layout information as well as recognition results. The devices should be restricted in no way, e.g. we do not rely on predefined zones or preset recognition attributes like language, font, page quality etc. Thus, input to our voting system are page layouts that are converted to an internal layout format (see [10], pp. 147ff). The output of a commercial OCR device serves as input for this conversion process, where the OCR output should contain as many layout attributes (coordinates, font attributes etc.) and classification “attributes” (confidence measures, character alternatives etc.) as possible. Usually, a commercial OCR API could be configured in such a way to produce an enhanced proprietary output, e.g. XDOC by Xerox Imaging Systems (cf. [11]). The internal layout is a strict hierarchy of line segments, word segments, and character segments. Hereafter, speaking of “segments of one kind”, we refer to a certain level within this hierarchy, e.g. all line segments. Every segment is capable of holding classification results with confidence measures and alternatives as well as layout attributes. Whether a specific segment contains a certain layout attribute, e.g. text is italic, depends on the capability of the commercial OCR device from which the segment was originally obtained. MergeLayouts combines the layouts obtained by different commercial OCR devices. The out-
Stefan Klink
3
put is a new layout consisting of the “combined” segments. By combining (or “merging”) segments, we combine recognition hypotheses as well as layout attributes. The combination process itself is top-down, starting with the line segments. Each segment level is processed separately. Within a certain level, we perform three steps. First, the planning module (see Section 3) identifies all fellow segments. Second, the equalization module (see Section 4) modifies the established fellows by splitting large segments into smaller ones. Third, the merge module (see Section 5) combines the modified fellows and produces a new segment containing the combined attributes of the original segments. In figure 1, the global processing steps are depicted for all line segments obtained from three commercial OCR devices .
3 Planning The goal of the planning module is to find all segments from all different layouts belonging together (fellow segments) and to construct a plan for the succeeding merge module. When for all inspected segments their fellows are established, the segments in the determined plan are modified if necessary by the succeeding equalization module. Intuitively, segments from two or more layouts belong together if they denote the same physical area within the underlying image. Due to the following facts, the establishment of these fellow segments is a non-trivial task. Even if there is an one-to-one correspondence between two segments their coordinates might not be exactly the same. Due to character classification errors, fellow segments might differ in their recognition results. Segmentation errors might cause an n:m correspondence of fellows obtained from two different layouts. A mixture of the former problems might occur which gets even worse and more complicated when considering more than two layouts.
Stefan Klink
4
Starting with three initial layouts obtained from three commercial OCR devices L1={r1, r2, ..., r5}
L2={s1, s2, ..., s7}
L3={t1, t2, ..., t6} s1
r1
t1
txt: “hi!” tl: (23, 9) rb: (501, 33)
txt: “hi! date:” tl: (23, 10) rb: (1050, 34) bold: 0.1 italic: no
five line segments from OCR1
s2
txt: “hil” tl: (23, 9) rb: (501, 33)
txt: “date” tl: (24, 517) rb: (1050, 35)
t2
txt: “date:” tl: (24, 517) rb: (1051, 34)
(tl: top-left, rb: right-bottom)
planning step1: establishing fellow segments
1 Z0 = (T0, R0) = ((Ø, Ø, Ø), (L1, L2, L3))
Z13 = (T1, R1) = (({r1},{s1, s2},{t1, t2}), (L1-{r1}, L2-{s1, s2}, L3-{t1, t2})) ...
Z22 = (T2, R2) = (({r2},{s3},{t3}), (...)) Zk3 = (Tk, Rk) = (({r5},{s7},{t6}), (Ø, Ø, Ø))
NOTE: only states along the optimal path are shown
equalization
2
T = T1 = ({r1},{s1, s2},{t1, t2})
r1
s1
T
s1 t1
s2 t2
t1 T‘ r1’
step2: modifying fellow segments (splitting)
splitting spot
T’’
r1’’
s2
t2
merging
3 L={u1, u2, ..., u6}
step3: executing the optimal plan by combining all modified fellow segments
u1 txt: “hi!” tl: (23, 9) rb: (501, 33)
merge from r1’, s1, t1
u2 txt: “date:” tl: (24, 517) rb: (1050, 34)
merge from r1”, s2, t2
Figure 1: Global processing steps with intermediate results.
Stefan Klink
5
For the planning module, the membership of segments to a layout is essential. Thus, we define a tupel which contains the segments separated by their membership to a certain layout. Definition 1: Let l1, ..., ln be layouts, where each layout is obtained from a different OCR device, and L1,..., Ln be sets of all segments of one kind belonging to the layouts l1, ..., ln. A tupel of segment sets T = (S1, ..., Sn) is an n-tupel of segment sets with: S i ⊆ L i,
∀i = 1, …, n .
3.1 Searching for a good plan Unfortunately, the input contains a large number of segments (e.g. the average number of word segments in the layout of a single page business letter is approx. 300) and the problem of finding segment fellows is very complex. Our solution is to transform the problem into a classical search problem. We would like to introduce the definition of a planning tree (corresponding to the search tree) and the notion of a plan. Hereby, every node is a problem state and every edge is a transition to a successor state: Definition 2: Let T and R be tupels of segment sets. A problem state Z = (T, R) is a pair of tupels of segment sets. Definition 3: Let Z be a problem state and L1, ..., Ln be sets of all segments of one kind of the layouts l1, ..., ln. Z = (T, R) is called a start state iff T = ( ∅, …, ∅ ) and R = ( R 1, …, R n ) with R i = L i , ∀i = 1, …, n . In our case, there exists exactly one start state: Z0 = ((Ø, ..., Ø), (L1, ..., Ln)). Definition 4: Let Z = (T, R) be a problem state. Z is called a final state iff T = ( T 1, …, T n ) with T i ⊆ L i , ∀i = 1, …, n and R = ( ∅, …, ∅ ) .
Stefan Klink
6
Definition 5: Let Z k
k
k–1
= (T
k
k
k–1
,R
k
k–1
k
k–1
) = (( T 1
k–1
, …, T n
k–1
), ( R 1
k–1
, …, R n
)) and
k
Z = ( T , R ) = (( T 1, …, T n ), ( R 1, …, R n )), k > 0 be problem states. A transition of a problem state Z k
k–1
1. T i ⊆ R i
,
k–1
k
→ Z is defined with:
∀ i = 1, …, n
k
2. T j = 1,
∃1 ≤ j ≤ n k
3. Let j be the index from 2. and s be the segment in the set T j . Then it holds: • ∀s˜ ∈ • ∀s˜ ∈
n
∪ T ki :
area ( s ) ∩ area ( s˜ ) ≠ ∅ .
∪ Rki :
area ( s ) ∩ area ( s˜ ) = ∅ .
i=1 n
i=1 k
k–1
4. R i = R i
k
– T i , ∀ i = 1, …, n
The problem state Zk is also called a successor state of Zk-1. A problem state Zk-1 may k
k
have more than one successor state. They will be enumerated with Z 1, …, Z mk . In a transition from Zk-1 to Zk the established fellows denoting the same physical area are collected in Tk, whereas Rk holds the remaining segments, among which new fellows have still to be established. This process continues until all fellows have been established, resulting in a final state (empty Rk). In figure 1, 1 , the line segments r1, s1, s2, t1, and t2 are established as fellows in transition 0
1
Z → Z3 . Next, we will describe the search process itself. As we mentioned above, the planning could be seen as a traverse through a search tree. While traversing, a plan for the succeeding equalization module is constructed, which contains all established fellow segments of the current layouts. For this reason, we call the search tree also planning tree, which is defined as follows:
Stefan Klink
7
Definition 6: A planning tree is a directed search tree in which every node is a problem state and every edge is a transition of a problem state to a successor state. A node is a start state iff it is the root of the planning tree and a final state iff it is a leave of the planning tree. Definition 7: Let Z be a problem state and L1, ..., Ln be sets of all segments of one kind of the layouts l1, ..., ln. Let P be the associated planning tree. A plan is the sequence of problem states Zk, k = 0, ..., m along a path in the planning tree from the root (start state Z0) to a leave (final state Zm). A subplan is the sequence of problems states Zk, k = 0, ..., t along a path in the planning tree from the root (start state Z0) to an arbitrary state Zt. 0
1
2
In figure 1, 1 , P = ( Z , Z 3, Z 2 ) is a subplan.
3.2 The heuristic search in the planning tree As mentioned above, the planning module is searching for an optimal plan. Therefore, the qualities of all plans are estimated and the ‘best’ plan is seen as the solution of our search problem. To reduce the cost of finding an optimal plan, in MergeLayouts a beam search algorithm (see [15], pp. 73ff) is utilized, that allows to find a good plan without generating all plans and comparing them. The algorithm makes use of a so called heuristic function, which defines the order of the expansion of the planning tree. The function estimates the quality of complete plans as well as the quality of subplans. This enables to assess a (sub-)plan during the search progression without completing the whole path to a leave node. For further details refer to [10].
4 The equalization module One shortcoming of the planning module is, that the generated plan might not be optimal for further processing. This is due to the following limitations. First, successor states have to obey
Stefan Klink
8
definition 5 point 2 & 3 which restricts the number of allowed transitions. Thus, not all possible combinations of segments are evaluated. Second, the planning module does not repare any segmentation errors caused by one of the utilized OCR devices. In the overwhelming mayority of cases, the first limitation does not cause the planning module to neglect the optimal path. The latter limitation is compensated by the equalization module which modifies a plan in some details before it is executed by the merge module.
4.1 The combining behaviour of the planning module Due to definition 5 point 3, the segments s will be combined with all overlapping segments s˜ to a new tupel in a successor state. Thus, in some cases, the planning module combines too much segments. The resulting tupels have to be inspected and all segments of the tupels have to be verified, whether they represent the desired sole segment. The more layouts are processed the more often this behaviour could be observed. This problem is of a structural nature. It arises from the layouts which are obtained by the OCR devices, because the planning module does not modify the obtained segments but rather arranges them into separate tupels. If an OCR device fails in segmentation, the planning module cannot compensate this error by a search strategie. The equalization module verifies the correctness of the given segments within a tupel and compensates segmentation errors.
4.2 The directed splitting
4.2.1 Determining the splitting spot Please note, that all segments of the same device have to be sorted in their natural reading order, i. e. from left to right. A splitting spot in a tupel is found with a majority voting over all spacings of all OCR devices in the tupel. If the majority detects a split in the same region, then
Stefan Klink
9
the tupel will be splitted into two tupels, T’ to the left of the spacing and T’’ to the right of the spacing. In figure 1, 2 , the second and third OCR device have detected two segments while the first OCR contributes a single line segment r1. r1 overlaps the segments s1, s2 and t1, t2 and they are all planned as fellow segments into the same tupel. Obviously, the splitting spot is between s1 and s2, and between t1 and t2, respectively.
4.2.2 The split of a tupel For the splitting algorithm, generally two types of sets of segments are important: 1. In the simple case, the splitting spot does not intersect a segment obtained from the OCR device. Now all segments to the left of the splitting spot will be arranged in the first tupel and all segments to the right of the splitting spot will be arranged in the next tupel. In this case, we only have a new arrangement of the segments into two tupels
T’ and T’’ (see figure 2, 1.). 2. In the second case, a segment given by the OCR device has to be broken into two segments. Then these two new segments could be arranged into the two tupels T’ and T’’, respectively (see figure 2, 2.).
1. T
2. T
splitting spot
segment r
arrange s
segment split
ment t
T” segment r
arrange s
ting
s
rearrange
T’
splitting spot
ment t
t split
T’
T” segment s’
split s”
ting t
Figure 2: Splitting fellow segments. For simplicity, only segments of one OCR device are depicted: 1. splitting spot does not intersect a segment. 2. splitting spot intersects a segment.
Stefan Klink
10
Breaking a segment into two is a non-trivial task. First, a suitable splitting spot has to be determined. Second, the segment has to be split into two segments s’ and s’’. Hereafter, all constituting segments have to be rearranged into the new segments s’ and s’’. For instance, if a line segment is split, all word and character segments constituting the line have to be rearranged into the “left” tupel T’ or into the “right” tupel T’’, according to their coordinates. In the case of character segments, the segment can not be broken and will be arranged by analogy with the first (“simple”) case to the corresponding tupel.
4.3 Specialisation to character segments Different kinds of recognizers are most suitable for various parts of attributes. Thus, every OCR device might contribute informations which no other device does. For example, one OCR device might only contribute the recognized text and its font type, whereas another device delivers the recognized text, the attributes bold, italic, and the text height.
4.3.1 Problems with missing coordinates In this context, the absence of the coordinates of character segments is a severe problem, which affects all processing modules. It is solved by an interpolation method. But this method causes new problems, especially with documents written in proportional spacing: Example 1: The character segments are attached with the following interpolated horizontal coordinates; the vertical coordinates are adopted from the word segment:
Figure 3: Interpolated coordinates Obviously, the segments are misplaced. The first character ‘M’ rises wide into the
Stefan Klink
11
second segment and pushes the ‘i’ nearly into the third. If an other OCR device supplies the coordinates exactly, the planning module could be confused and forced to construct wrong tupels of segment sets.
4.3.2 Shifting of misplaced segments In the following example, we want to elucidate the problem of misplaced segments, which arises from interpolated coordinates: Example 2: All three OCR devices classify the first character segment as ‘M’. Because of the interpolated coordinates, the ‘i’ of the first OCR device is ‘planned into’ the tupel T1 instead of T2 and is ‘missing’ in the next tupel T2. OCR 1
M
i
t
t
e
OCR
2
M
i
t
t
e
OCR 3
M
l
t
t
e
T1
T
T4
T5
2
T
3
Figure 4: Misplaced segment in neighbour tupels.
In this case, the succeeding merge modul is affected in the following way: • The tupel T1 holds two character segments of the first OCR device. The merge modul will combine these two to a sole one with independent text recognition results and averaged coordinates. • The tupel T2 only holds the ‘i’ of the second and the ‘l’ of the third OCR device. Now, no majority can be determined by voting the character recognition results. To avoid this behaviour, a strategy is implemented in the equalization modul, which discovers misplaced segments by means of the recognized text and shifts them into the ‘correct’ tupel.
Stefan Klink
12
Definition 8: Let T = (S1, ..., Sn) be a tupel of segment sets and sˆ ∉ T be a segment. Let e(T) be the number of OCR devices which occur in T and let t( sˆ , T) be the number of segments s ∈ T where the recognized text of s is equal to the recognized text of sˆ . 1 2
The membership criterion for s and T is fulfilled iff t ( sˆ, T ) ≥ - ⋅ e ( T ) . Example 3: Let T1 to T5 be the tupel of example 2. Then e(T1) = e(T3) = e(T4) = e(T5) = 3 and e(T2) = 2. Let sˆ 1 be the ‘M’ and sˆ 2 be the ‘i’ of OCR1, sˆ 3 the ‘M’ of OCR2 and sˆ 4 the ‘M’ of OCR3 each in tupel T1. Then, t( sˆ 1 , T2) = 0, t( sˆ 2 , T2) = 1, t( sˆ 3 , T2) = 0 and t( sˆ 4 , T2) = 0. The membership criterion is fulfilled only for t( sˆ 2 , T2). Now, we are able to focus on the strategy for misplaced segments: For all tupels of segment sets Ti , i = 1 to n do: If Ti does not hold exactly one segment of each OCR device, then examine the tupel
Ti+1 to the right with the membership criterion, if there exists a misplaced segment s which belongs into the tupel Ti , or respectively if there is a segment ‘missing’ which is wrongly arranged into tupel Ti. If the membership criterion t( sˆ , T) with sˆ ∈ T i + 1 is fulfilled, then shift the segment s from Ti+1 to Ti , or respectively if the membership criterion t( sˆ , Ti+1) with sˆ ∈ T i is fulfilled, then shift the segment s from Ti to Ti+1.
Example 4: Let T1 to T5 be the tupel of example 2. With the strategy above the misplaced segment ‘i’ in tupel T1 will be shifted into the right tupel T2 and a majority voting will produce the correct result ‘i’.
5 The merge module The merge module has the purpose to execute the plan which is generated from the previous planning module and which has been modified by the equalization module. In this context, to “execute a plan” means to merge all segments within a tupel T to a sole one. Doing so, all
Stefan Klink
13
informations about the segments will be combined with suitable voting mechanisms. The result of this process is depicted in figure 1, 3 . The input segments are shown at the top of the figure.
5.1 Voting of all segments in a tupel T Various methods exist to extract a segment’s layout attributes and to achieve reliable classification results. Each one has its pros and cons, and in practice, a huge number of attributes are cooperating in the text recognition. In MergeLayouts all attributes are taken into account (for example recognized text, coordinates, font type, character height, boldness, super-, subscripted etc.) and are combined2. Unfortunately, it is not possible to combine every attribute with the same method because of their different representations (cf. [5]). For each representation, we have to define an individual method. In MergeLayouts, we have three of them which are described below: 1.
Binary decisions (“yes” or “no”3) are combined by majority voting, obtaining the most frequent value as result [2] – see attribute italic in figure 1.
2.
Numerical values (“42”) are combined by calculating a kind of average (e.g. arithmetic average or median) which represents all input values at best (cf. [10], pp. 84ff) – see attributes tl, rb in figure 1.
3.
Classification results obtained as rankings as well as results obtained as a subset of class labels with measurements are combined on the rank level. For the latter, a ranking is obtained by ordering all classes according to their measure. The combination itself is done by applying the Borda Count method (cf. [4], pp. 53ff).
2. A detailed description of all considered attributes could be found in [10], pp. 149ff. 3. The answer might also be “don’t know” if a reliable classification is impossible.
Stefan Klink
14
6 Test results
6.1 Comparison of the accuracy with three commercial OCR devices In this section, we describe a comparison of MergeLayouts with three commercial OCR devices on 22 business documents and on 20 facsimiles. Since the preparation of test data for an automatic evaluation of the page layout retention is a tedious and time consuming task, we focused on the OCR accuracy as a single measure of comparison.
6.1.1 Comparison of business documents For this comparison, 22 german business documents were processed in the following way: • The initial layout structures including the recognized text of each document were obtained from the three OCR devices Recore [12], ScanWorX4 [13], and Easyreader [14], which received the same scanned 300dpi document page as input. • These three layout structures were processed by MergeLayouts, which combined them to a fourth layout structure. • The recognized text within each layout structure of each document was extracted and written into a plain text file, which was compared with the ground truth text file5 and the absolute number of recognition errors and the accuracy of the recognition result was determined. For this comparison, we only regard zones with textual information and exclude graphical regions and signatures.
4. Xis is used hereafter as a synonym for ScanWorX. 5. Preparation of ground truth data was in close cooperation with ISRI, following their guidelines for preparing accurate ground truth data (cf. [17]).
Stefan Klink
15
Definition 9: Let n be the number of all characters in the ground truth file and let e be the number of errors based on the operations insert, delete and substitute. n–e The accuracy is defined with: acc = ----------- (cf. [17], p. 13). n The results of the comparison of MergeLayouts with three commercial OCR devices with reference to the accuracy is shown in figure 5.
Figure 5: Comparison of MergeLayouts with 3 commercial OCR devices for scanned business documents.
Obviously, the business documents are very good-natured with clean image copies. The overall accuracy of the recognition results for Easyreader is 98.18%, for Recore 97.28%, and for ScanWorX 98.52%. Nevertheless, with MergeLayouts we achieve considerable improvements. Our overall accuracy is 99.45%. Definition 10: Let e x be the number of errors of the recognizer x. ex – ey The relative error reduction is defined with: ∆ r e x, y = ---------------. ex Our relative error reduction with reference to the best recognizer ScanWorX is ∆ r e S, M = 63% . With reference to Recore we achieve ∆ r e R, M = 80% .
Stefan Klink
16
MergeLayouts is not only considerably better in the overall accuracy compared to a single recognizer. In 19 out of the 22 documents it outperforms the best recognizer for this specific document. In terms of error reduction, MergeLayouts commits 37% less errors than the particular best OCR device and 84,5% less errors than the particular worst OCR device.
6.1.2 Comparison of facsimiles For this comparison, 10 german and 10 english facsimiles of various kind (offer, order, invoice etc.) were processed in the same way as the business documents. The results of the comparison are shown in figure 6.
Figure 6: Comparison of MergeLayouts with 3 commercial OCR devices for facsimiles.
Again, MergeLayouts considerably improves the recognition results. The overall accuracy for Easyreader is 88.7%, for Recore 86.8%, and for ScanWorX 90.2%. With MergeLayouts we achieve 92%. The relative error reduction with reference to the best recognizer ScanWorX is ∆ r e S, M = 18.3% . With reference to Recore we achieve ∆ r e R, M = 40% . Even on facsimiles,
Stefan Klink
17
in 16 out of the 20 documents MergeLayouts outperforms the best recognizer for this specific document. Overall, MergeLayouts commits 18% less errors than the particular best OCR device and 48% less errors than the particular worst OCR device. Generally, it is ascertainable that the worse the accuracy of the OCR devices the less errors can be corrected with a voting on segmentation and classification. In the following three examples, we want to elucidate some reasons why the facsimiles are so difficult to recognize and what kind of problems might occur: 1.
Due to the low resolution of the facsimiles (204x98 dpi) compared to a scanned image (300x300 dpi), the bit images of the characters are very hard to recognize. The lower
Figure 7: Low quality image of a faxed page.
resolution results in a considerably lower accuracy of all recognizer. This is valid for all facsimiles and is a general problem. In the following snippet (see figure 7), this deficiency can be seen very clearly. 2.
Due to the low image quality, the recognition results on smaller fonts gets even worse. If a document contains small fonts, it is scarcely possible for an OCR device to achieve a reliable recognition result (see original-sized footnote in figure 8).
Figure 8: Unreadable small font appearing in a footnote of a page.
Stefan Klink
18
3.
Further problems for present OCR devices are graphics or words in different font heights next to the text which is to recognize. Often, the graphical objects are not recognized as non-text objects and the text next to them will be segmented in a wrong way. The following snippet (figure 9) shows the faulty segmentation of a head line.
Figure 9: Intermixed text and graphic portions cause segmentation errors for text lines.
The left stamp “10% Discount” conducts to two segmentation errors. First, some of the line segments are extended too far to the left and second, the heights of some lines are completely wrong. Due to these segmentation errors of some commercial OCR devices, it is unpossible for the planning module to guarantee a correct establishment of fellows of all segments.
7 Conclusion A comprehensive voting approach is presented, combining entire layouts obtained by commercial OCR devices. For each OCR device, its proprietary output format (e.g. XDOC) is converted into an internal layout hierachy, consisting of line segments, word segments, and character segments. Every segment comprises recognition result(s), coordinates, and font information. As output, we produce a new, more reliable layout structure, preserving the original page layout as good as possible. Doing so, we take into account all of the aforementioned attributes and establish so called “fellow segments”.
Stefan Klink
19
The general voting process is top down, processing each segment level, starting with the line segments. For each level, we perform three steps. First, a planning tree is constructed, representing a search space, where each state describes a possible establishment of fellow segments. A state’s quality relies on a similarity measure of segments of one kind. The similarity itself is based on a segment’s recognition result and its coordinates and may easily be modified to consider various other attributes. Within this search space, a “good” path to an end state is found by utilizing the heuristic beam search algorithm. In the second step, the plan, which is fully described by the states along the optimal path, is modified in order to cope with deviations in segmentations obtained by the commercial OCR devices. For every state within the plan, all fellow segments have to be rearranged in order to obtain the most probable number (and size) of segments reflecting the original page layout. This is done by merging and splitting the original segments obtained from the basic devices. In the third step, the modified plan is executed. Thus, all established fellow segments are combined by combination of their attributes. To do so, classic voting techniques, e.g. majority voting, are utilized. To evaluate the presented approach, we focus on a single attribute, the recognition hypothesis for each combined segment. The well known character accuracy is determined, for all utilized OCR devices as well as for the presented voting approach. Comparing the results of two test sets (scanned business documents and facsimiles), MergeLayouts achieves much higher recognition accuracy (scanned documents: 99.45%, facsimiles: 92%) than the best commercial OCR device (scanned documents: 98.52%, facsimiles: 90.2%). In terms of error reduction we reduce the number of errors by 63% for scanned documents and by 18.3% for facsimiles, compared to the best commercial OCR device.
Stefan Klink
20
References [1]
PRIME RECOGNITION: PRIMEOCR™, Access Kit Guide, Version 2.50; San Carlos, CA, 1996.
[2]
Lei Xu, Adam Krzyzak, Ching Y. Suen: Methods of Combining Multiple Classifiers and Their Applications to Handwriting Recognition; IEEE Transactions on Systems, Man, and Cybernetics, Vol. 22, No. 3, May/June 1992, pp. 418-435.
[3]
Jürgen Franke & Eberhard Mandler: A Comparison of Two Approaches for Combining the Votes of Cooperating Classifiers; 11th IAPR ’92, The Hague, The Netherlands, pp. 611-614.
[4]
Tin Kam Ho: A Theory of Multiple Classifier Systems and Its Application to Visual Word Recognition; Doctoral Dissertation, Department of Computer Science, State University of New York at Buffalo; Buffalo, New York, May 1992.
[5]
Tin Kam Ho, Jonathan J. Hull, Sargur N. Srihari: Decision Combination in Multiple Classifier Systems; IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, No. 1, January 1994, pp. 66-75.
[6]
Y. S. Huang, Ching. Y. Suen: Combination of Multiple Classifiers with Measurement values; Proceedings of the 2nd ICDAR, Japan, October 1993, pp. 598-601.
[7]
Xiaoning Ling and W. G. Rudd: Combining Opinions from Several Experts; Applied Artificial Intelligence, Vol. 3, 1989, pp. 439-452.
[8]
Thorsten Jäger, Frank Hönes, and Andreas Dengel: An Adaptive Metaclassifier for Word Recognition Based on Multiple Independent Classifiers; 4th SDAIR, April 1995, Las Vegas, Nevada, pp. 399-412.
[9]
Thorsten Jäger: OCR and Voting Shell Fulfilling Specific Text Analysis Requirements; 5th SDAIR, April 1996, Las Vegas, Nevada, pp. 287-302.
[10] Stefan Klink: Entwurf, Implementierung und Vergleich von Algorithmen zum Merge von Segmentierungs- und Klassifikationsergebnissen unterschiedlicher OCR-Systeme; Master Thesis, DFKI, Kaiserslautern, Germany, October 1997. [11] Xerox Corporation: XDOC Data Format: Technical Specification, Version 3.0; Peabody, Massachusetts, 1995. [12] Ocron Inc.: Recore Developer’s Guide V 2.0; Santa Clara, 1993. [13] Xerox Imaging Systems: ScanWorX API, Programmer’s Guide; Peabody, Massachusetts, 1993. [14] Mimétics S.A.: Easy Reader API V 2.1, User Manual & Reference Manual; ChatenayMalabry Cedex, France, October 1996. [15] Elaine Rich, Kevin Knight: Artificial Intelligence; 2nd edition, McGraw-Hill, Inc., 1991. [16] Nils J. Nilsson: Problem-Solving Methods in Artificial Intelligence; McGraw-Hill, Inc., 1974. [17] Stephen V. Rice, Junichi Kanai & Thomas A. Nartker: The Third Annual Test of OCR Accuracy; Annual Research Report, Information Science Research Institute (ISRI), University of Nevada, Las Vegas, USA, 1994.
Stefan Klink
21