The application of transform- ing technical papers into SGML is presented by Spitz [17]. For the analysis of office documents resp. business letters, there are.
Message Extraction from Printed Documents —A Complete Solution— Stephan Baumann, Majdi Ben Hadj Ali, Andreas Dengel, Thorsten Jäger, Michael Malburg, Achim Weigel, Claudia Wenzel German Research Center for Artificial Intelligence (DFKI GmbH) P.O. Box 20 80, D-6750 Kaiserslautern, Germany Abstract The task to be solved within our core research was the design and development of a document analysis toolbox covering typical document analysis tasks such as document understanding, information extraction and text recognition. In order to prove the feasibility of our concepts, we have developed the prototypical analysis system OfficeMAID. The system analyzes documents, as used in the daily work of a purchasing department, by a-priori knowledge about workflows and document features. In this way the system provides goal-directed information extraction, shallow understanding and process identification for given documents (paper, fax, e-mail).
that this paper is a system´s paper, we will not go into the technical details of the individual methods proposed, but rather refer to other publications of this group and to the demonstration we will give at this conference.
This work has been supported by a grant from the BMBF (ITW 9702).
1 Introduction Generally, printed documents are neither generated for scanning and automatic processing nor for easy integration into electronic workflows. Therefore, it is hard to transform them adequately for further processing by electronic means. This is the reason why DMS — in many cases — are only offering a limited functionality for image capturing (and OCR). However, in order to work on the resulting electronic document, this passive representation must be enriched by cumbersome manual indexing and archiving. This was the starting point for our work on document analysis and understanding (DAU) techniques which intends to transform not only the iconic view of a document but rather the information captured in a document´s structure, its text, and in its message and role when related to other documents. There are various systems in recent publications with similar goals but for different applications. Some instances are: Nagy et al, for example, proposed a system for analyzing and labeling technical journals [18]. The system described by MoggSchneider and Augmuth addresses the problem to extract information from tax assessments [15]. The application of transforming technical papers into SGML is presented by Spitz [17]. For the analysis of office documents resp. business letters, there are various systems showing promising results [3], [5], [6], [11]. In this paper, we are presenting the new prototype of the OfficeMAID system [7]. This is an acronym for Mail Analysis, Interpretation and Delivery. The system analyzes the scanned mail in the purchasing department of a real company, extracts the relevant information, and identifies the respective workflow. Fig. 1 shows an example of a typical document to be processed by OfficeMAID. Due to the limited number of pages and the fact
Figure 1: Typical binary document image received at the purchasing department and scanned by a resolution of 300 dpi. The paper is structured as follows: In Section 2 we describe the application scenario, Sections 3 is concerned with the individual processing steps. In Section 5, experimental results are given and a final conclusion and outlook on future work is sketched in Section 6.
2 Application Scenario By cooperating with the purchasing department of the University of Kaiserslautern we were able to establish an ideal environment for our research: (1) we had access to a large amount of realworld documents as contained in business processes, (2) we could explore a-priori knowledge about the process and document features by interviewing the purchasing experts, (3) we got feedback about the benefits of our application-oriented DAU prototype. In this chapter we want to sketch the handling of a purchasing process by a clerk using our system for his daily work.
A purchasing procedure consists of the four phases demanding, inquiring, purchasing, and paying. An initial demand may easily be distributed over the university LAN to the purchasing department using existing HTML-documents and WWWbrowsers as clients. As a consequence the purchasing department generates a new workflow instance. A human actor chooses a subset of the existing supplier/product DB and triggers subsequently an automated inquiry generation and sending to the suppliers. Corresponding offers, confirmations of order, delivery notes and invoices are sent to the purchasing department by fax or mail. According to their contents such as address, product, message type, and date of creation, these documents have to be categorized and relevant information portions should be extracted. These actions are performed automatically by the OfficeMAID system. Relying on the extracted data, these documents are linked automatically to the appropriate workflow instances. TIFF IMAGE
Structure Analysis
Image Filtering
CC Analysis
CONNECTED COMPONENTS Page Segmentation
Skew Elimination
Text Recognition
LAYOUT STRUCTURE
OCR2
CHARACTERS Dictionary Look-up
Voting
WORDS
Information Extraction
Morphol. Analysis
Text/Layout Based Classification
MESSAGE TYPE Text Parsing
Text Pattern Matching
Voting
MESSAGE INFORMATION
Structure analysis establishes the layout and logic structure of the given input image. It first applies a connected component analysis as well as image preprocessing steps. Hereafter, the layout hierarchy is constructed and logical labels are attached to individual layout segments.
3.1.1 Connected Component Analysis CC-Analysis is performed by SPRLC (single pass contour-line coding) [14]. The underlying method generates a hierarchy of four-connected black or eight-connected white components. For analysis, the image is considered as a hierarchy of several layers in which the lowest layer corresponds to the document page and the highest one to single connected components.
3.1.3 Skew Elimination
WORDSYSTEMS Text/Phrase Based Classification
3.1 Structure Analysis
If the SPRLC specialist finds some kind of background texture or noise in the actual document image, the image is enhanced using a simple and very efficient technique for Background Noise Detection and Cleaning in Document Images [2]. It detects the image regions, where background noise or shading degrades or disturbs the character images. This is achieved by dividing the image into small sub-images (tiles), computing for each nonempty tile a feature vector representing the black pixel neighborhood and classifying each tile to clean/noisy using a multilayered NN with the computed feature vector as net input. After that NoDel cleans the located “noisy” regions by applying a standard noise cleaning method: the morphological opening operation. This way image processing time is saved and the quality of the “clean” text regions are preserved.
Logical Labeling
OCR1
As shown in Fig. 2, OfficeMAID is based on a processing model with three phases: structure analysis, text recognition, and information extraction. The individual tasks are discussed in the following sections.
3.1.2 Image Filtering
LAYOUT SEGMENTS
LOGICAL OBJECTS
3 Processing Steps
Based on the connected component hierarchy, the angles between every connected component and its neighbors can be easily computed. The best possible unique skew is then computed from an angle histogram, applying knowledge-guided techniques. The document skew is then corrected applying an indirect affine rotation algorithm.
3.1.4 Page Segmentation Process Identification PROCESS
Figure 2: Processing Model of OfficeMAID.
LEOPARD [9] is a robust and effective algorithm for layout extraction of mixed-mode documents. Black and white connected components represent the basic elements which are classified and segmented. As output, nontext components are collected in a separate block, whereas text elements are also grouped as lines, words and characters. There is no text recognition in the entire analysis. The classification only considers rough object features
and relations to neighboring objects. The algorithm is highly reliable for documents that conform to a few constraints on their characteristics. The approach has several advantages over other known techniques: it handles normal as well as inverse text, accepts characters in various font styles and sizes, is independent of text orientation, and allows the processing of characters that consist of more than one connected component.
3.1.5 Logical Labeling For locating and interpreting the logical objects of a documents, we use the system FormClas [8] which does neither employ text fragments nor keyword information, but learns the typical layout of documents. The system is trained by presenting it with examples of empty forms of typical office business documents. It tracks the form images for identifying characteristic layout constraints which are taken as a basis to define appropriate reference patterns. Later, during classification, these reference patterns are matched against an unknown form in order to classify the type of form and consequently to derive the regions of interest possibly capturing the logical objects. For results, we refer to [8].
3.2 Text Recognition Having extracted the physical and logical document structure, the text recognition step transforms the binary image data into machine readable ASCII text. This process is commonly referred to as optical character recognition (OCR). To achieve highly reliable recognition results we employ state-of-the-art recognition technology in combination with sophisticated postprocessing techniques.
3.2.1 Basic OCR Two commercial OCR devices are homogeneously integrated into our processing model. The obtained results from Socrates (Aurora Technologies, Inc.) and ScanWorks (Xerox Imaging Systems, Inc.) are stored into the word segments of the layout structure. To do so, both devices are supplied with zone definitions corresponding to the individual word segments. For experimental research we have developed a polynomial classifier and several neural net classifiers. These classifiers should not be used as omni-purpose classifiers, but could rather be trained for a specific domain dependent classification task (e.g. if there occur a lot of 6pt numerals).
3.2.2 Voting In order to obtain even more accurate recognition results, a voting component is invoked which combines the results of all basic OCR devices. The fundamental idea is to focus on the individual classifier’s strength while neglecting its weaknesses [10]. Besides the increase in accuracy, voting provides a perfect interface for subsequent processing by hiding basic OCR devices.
3.2.3 Dictionary Look-Up The character recognition results generated by the voting module are represented in character hypotheses lattices (CHL), one for
each word. The CHLs serve as input for the dictionary look-up module LexAb. LexAb compares each CHL with the words of the vocabulary, using a generalization of the weighted edit distance [19]. The search for the best word is based on a depth first traversal through the paths of the CHL and is directed by several heuristics to achieve a reasonable processing speed without deteriorating the recognition rate significantly. In order to obtain a system, which automatically adapts itself to the characteristics of the input data, an iterative supervised automatic learning algorithm is integrated which determines the costs for the edit operations.
3.3 Information Extraction 3.3.1 Morphological Analysis Our tool Morphic-Plus has been designed for the inflectional morphological analysis of the German language [12]. Inflected word forms occurring in the text are reduced to their respective stems. Additional output of this step is morphological and syntactical information concerning the word form such as part-ofspeech, case, gender, number etc. Moreover, the component is able to decompose compound nouns frequently occurring in the German language into their multiple word stems.
3.3.2 Message Type Classification The message type identification part consists of two competing classifiers and an overall voting component embodying knowledge about the strengths and weaknesses of the classifiers [4]. The individual classifiers differ strongly in their basic assumptions: Fontclass considers layout and typographic information with respect to certain keywords. It combines word and layout information of certain keywords as classification features in a linear statistical classifier and is especially suited for entitled documents. The competitive rule-based system Ruleclas2 is a more conventional text categorization approach which employs text patterns and morphological information to capture linguistic features. Typically, features such as single words, morphological word stems and text phrases are collected in the current document and used as initial facts. Certainty factors which represent an estimate of the feature’s occurrence in the document may be attached to the input features. Rules conjoin these features with the classes. The rules may also be attached with certainty factors so that the final classification process consists of propagating these certainty factors through the rule network. A neural net which has been previously trained on a training set takes the results of the single classifiers as input and calculates a final estimation about the correct message type [4].
3.3.3 Parsing The identification of sender and recipient of a letter is performed by employing a standard island-driven parser for unificationbased context-free grammars. Such parsers are typically used for speech recognition. The result of this step is not only the respective company but also the clerk and department. In some cases, the FormClas component yields the document’s sender which is
only the sending company. Thus, the results of parsing complete the results of FormClas. The main steps of the parsing component are as follows. First of all, the region expected to hold address is determined. Secondly, the respective word hypotheses are fed into the component’s kernel: the address parser which terminates in yielding one or more complete or partial parses. The ability for partial parsing has been incorporated in order to improve recall. In case, a complete parse is not possible, a partial parse may yield the expected result. Accepted partial parses are case frames which should be filled in as completely as possible. Thirdly and finally, the resulting parses are matched against the entries of a pre-given address data base. More details of this component are described in [13].
3.3.4 Pattern Matching We employ a pattern matcher for the detection of text phrases and formulations typically used within the business context. It matches all occurrences of so-called text patterns in the current text and extracts the information they capture. A text pattern is a template specifying syntactic and semantic features of word contexts and consists of arbitrary nestings of word conjunctions, disjunctions, negations, skip (up to n words) and optionality operators. Our pattern-matcher is based on a finite-state-automaton that detects obligatory keywords or substrings of keywords of a pattern. In a second step, the component tries to match additional facultative words and to extract specific information (e.g. number of a fixed format, date).
3.3.5 Process Identification Process identification represents the final processing step of OfficeMAID. In contrast to other DAU systems which consider only isolated documents, our approach also takes relations between documents as they occur in document chains into account. Similar techniques for the treatment of discourse have a long tradition in the NLP community and can be used as a stimulating source for the development of techniques for process identification. Our approach relies actually on an object-oriented workflow description of the purchasing process [1]. This description concentrates only on the document-specific parts of the workflow incorporating temporal constraints. The system selects related documents of incomplete workflow instances as contained in the document archive by incorporating the information from the actual document (date, message type, sender, recipient, product, price) and the a-priori knowledge as specified in the workflow specification (namely the relations between these information units). The output of this processing step is a ranked list of bestfitting documents, resp. workflow instances
4 Results At present, our document sample consists of about 500 business letters collected in the purchasing department of the University of Kaiserslautern.
We divided our set of business letters into three different document clusters: The first set (91 documents) was used as a training set for the individual specialists, i.e., these documents were utilized to learn font attributes of keywords for Fontclass and to extract meaningful words and text patterns for Ruleclas2. The next 176 documents served as a training set for the voting component and the remaining 229 documents made up our test set with which we produced the results shown. Tab. 1 shows the effectiveness of the voting component compared to the employed basic OCR devices. The number of Ground-Truth characters in the test set was 69.290. For testing, obtained results are transformed into plain ASCII and compared to the corresponding Ground-Truth data. Hereafter, the number of errors as well as the character accuracy are calculated. Taking these numbers, we can state that the voting almost cuts the errors in half. Table 1: Effectiveness of the voting component compared to the basic OCR devices. Recognizer
#Errors
Accuracy
Socrates
3.950
94.3%
ScanWorX
5.959
91.4%
VOTE
2.040
97.1%
For the evaluation of text analysis results MUC conferences [16] have established a quasi-standard. In analogy to MUC metrics, we measure our results in recall and precision. In Tab. 2 the evaluation results of the address parsing component are summarized. The table shows complete parsing having a higher precision (up to 100%) but a terribly bad recall (11% for sender and 17% for recipient) while the partial parsing approach is not necessarily less precise (at least for the recipient a precision of 83% is reached) but far better in recall (up to 61%). Table 2: Results of address parsing. Address Object (Parsing Strategy)
precision
recall
overgeneration
Sender address (complete parsing)
100.0%
11.5%
0.0%
Sender address (partial parsing)
50.0%
16.1%
3.6%
Recipient address (complete parsing)
95.2%
17.2%
0.0%
Recipient address (partial parsing
82.6%
61.2%
1.2%
The bad precision in partial sender recognition is due to the weak heuristics which cause a reject. Additionally, the sender address is almost always in a poor printing quality and not always given in a letter. Since the sender address is typically in a small font, the hypotheses yielded by text recognition are often similarly badly weighted. But the adjustment of the respective threshold for rejects highly depends on the test set and therefore an optimal adjustment has not yet been done. In Tab. 3 the evaluation results of the document categorization approaches are shown. Precision P and recal R scores are again given in per cent (%). In comparison to the Ruleclas2 results, Fontclass reaches a very high precision but obtains a comparatively low recall. This is caused by the fact that a document’s title often is not recognized properly (recall) while on the other hand a well-recognized
Content Interpretation, Proc. SDAIR96, Las Vegas, NV, USA, (1996), pp. 67-75.
title is a very good indicator for the appropriate message type (precision). Applying the Voting component on top of these results leads to better results. The overall recall is improved and the overall precision is only a little lower than the very high one gained with Fontclass. Thus, we can state that Voting is a practical complement of the single classifiers.
[4]
C. Wenzel, S. Baumann, T. Jäger: Advances in Document Classification by Voting of Competitive Approaches. Proceedings DAS 96, Malvern, Pennsylvania, October 14-16, 1996, pp. 352-372.
Table 3: Results of message classification.
[5]
G. Ciardiello, M. T. Degrandi, M. P. Poccotelli, G. Scafuro, M. R. Spada: An experimental system for office document handling and text recognition, Proc. 9th ICPR, Milano, Italy (1988), pp. 739-743
[6]
A. Dengel, R. Bleisinger, R. Hoch, F. Fein and F. Hönes, From Paper to Office Document Standard Representation, IEEE Computer, Vol. 25, No. 7 (July 1992), pp. 63-67
[7]
A. Dengel, R. Bleisinger, R. Hoch, F. Hönes, M. Malburg, F. Fein: Office-MAID — A System for Automatic Mail Analysis, Interpretation and Delivery, in: L. Spitz and A. Dengel (eds.) Document Analysis Systems, World Scientific Co. Inc., Singapore (1995), pp. 52-75.
[8]
F. Dubiel and A. Dengel: FormClas - OCR-Free Classification of Forms, Proceedings DAS96, Int‘l Association for Pattern Recognition Workshop on Document Analysis Systems, Malvern, NJ, USA (Oct. 1996), pp. 40-62.
[9]
F. Hönes, J. Lichter: Layout Extraction of Mixed-Mode Documents; in: Machine Vision & Applications, Springer Verlag, No. 7, pp. 237-246
Ruleclas2
Fontclass
P
R
P
R
P
R
s 54
89.58
79.63
93.33
51.85
87.50
77.78
delivery note
59
47.83
37.29
93.33
47.46
89.13
69.49
request
10
100.0
100.0
100.0
90.0
100.0
100.0
announcement
92
86.59
77.17
79.03
53.26
76.25
66.3
invoice
14
92.86
92.86
90.00
64.29
92.86
92.86
total
229
79.5
69.43
87.23
53.71
84.34
72.93
message type confirmation of order
Voting
5 Conclusion Although contemporary WFMS offer integrated solutions for efficient business process handling, they are obviously lacking in technology for automatically integrating informal documents into the formal concepts of a workflow. But this is an important requirement to overcome the weakness of interacting with documents in separate worlds — the physical world of the desk and the electronic world of the computer. The development of the OfficeMAID system has been a first step to bridge this media gap. Therefore, our future work should address problem areas, resp. enforce promising research directions in the field of workflow-centered document analysis and understanding. Basically document analysis and understanding strongly benefit from knowledge about a given domain. Integration of corporate knowledge, e.g. predefined workflows, and the employment of the runtime context of open workflow instances will be one central topic of our future research. Due to the fact that different applications are characterized by different conditions, the diverse processing steps have to be emphasized by mechanisms allowing for an interactive parametrization as well as teaching of document and relevant problem solving knowledge. In order to take the new capabilities of printing devices into account, we further investigate the employment of image features as they are contained in gray level and color document images. By interlocking this knowledge with document analysis techniques, we expect to achieve better and additional results.
References [1] S. Baumann, M. Malburg, C. Wenzel: May document analysis tools bridge the gap between paper and workflow ? A critical survey; in Proceedings CoopIS-96, Int’l Conference on Cooperative Information Systems, Brussels, Belgium, June 1996, pp. 135-142 [2]
M. Ben Hadj Ali: Background Noise Detection and Cleaning in Document Images; in Proceedings ICPR-96, Vienna, Austria, August 1996, Track C, pp. 758-762.
[3]
T. Brückner et al: Distribution by Automatic Address and
[10] T. Jäger: OCR and Voting Shell Fulfilling Specific Text Analysis Requirements; in: Proceedings SDAIR ‘96, Las Vegas, Nevada, April 15-17, 1996, pp. 287-302. [11] M. Köppen, D. Waldöstl, B. Nickolay: A System for the Automated Evaluation of Invoices, Proc. DAS96, Malvern, PA, USA (1996), pp. 3-21. [12] O. Lutzy: Morphic-Plus: Ein morphol Analyseprogramm für die deutsche Flexionsmorphologie und KompositaAnalyse; DFKI Document D-95-07 (in German). [13] M. Malburg: Address Recognition with Robust NLU Technology; in: FLAIRS '96; FLAIRS, Key West, Florida, USA, Mai 1996 [14] E. Mandler, M. Oberländer: One-Pass Encoding of Connected Components in Multi-Valued Images; in: Proceedings ICPR-90, Int’l Conference Pattern Recognition, Atlantic City, NJ, 1990, pp. 64-69. [15] U. Mogg-Schneider, C. Augmuth: Information Extraction of Tax Assessments, Proc. DAS96, Malvern, PA, USA (1996), pp. 564-577. [16] Proceedings 4th Message Understanding Conference (MUC-4). Morgan Kaufmann Publishers Inc., San Mateo, California, 1992, pp. 22-29. [17] L. Spitz: Style Directed Document Recognition, Proceed. ICDAR'91, St. Malo, France (1991), pp. 611-619. [18] G. Nagy, S. Seth, M. Viswanathan: A Prototypical Document Image Analysis System for Technical Journals, IEEE Computer, Vol. 25, No. 7 (1992), pp. 10- 24 [19] A. Weigel, S. Baumann, J. Rohrschneider: Lexical Postprocessing by Heuristic Search and Automatic Determination of the Edit Costs, Proceedings ICDAR95, S. 857-860, 1995