Making Documents Work: Challenges for Document Understanding Andreas R. Dengel German Research Center for Artificial Intelligence (DFKI) P.O.Box 2080 D-67608 Kaiserslautern, Germany email:
[email protected]
Abstract In this paper I will try to explain the nature of document understanding in all of its dimensions. Therefore I will first describe the characteristics of data, knowledge, and information in order to describe their synergetic interweaving. After that I will try to structure the inherent complexity of sub-problems of document understanding which may not be solved serially, but rather are attributes of individual documents. Thus, this paper focuses on system engineering challenges. However, I will show some recent work done on the different topics and give some insights in the individual techniques we chose at DFKI. Keywords: Document understanding, categorization, table analysis, message understanding, system engineering
1. Introduction Document understanding refers to the field that is concerned with logical and semantic analysis of documents to extract human understandable information and codify it into machine-readable form. This information comprises not only textual aspects such as event dates, names or identification numbers, but also logical objects which may be derived from document structure as well as from information which is added from background data bases because of content driven relationships. In order to do so, document understanding systems provide technology to automatically transform meaningful information from a raster image into a formal representation. So it is a form of reasoning in which the meaning of communication is deduced from the combination of the written text and its presentation, the logical structure of a message, the context, knowledge of the language as well as knowledge about the world (see Fig. 1).
The output of document understanding may be of different quality ranging from an editable description, or a representation useful for document sorting, routing, or filing up to a structured message with attribute-value pairs describing all aspects of communication necessary to drive workflow. To achieve this quality of information, the level of data present in a document has to be enriched to the desired level. Before doing so, a paper document has to be converted into an electronic form which is suitable for the application of document understanding techniques. At a first glance a document is an object describing data of different types: • pictographic data where the whole image is represented as a sequence of orthogonal pixel sequences; • segment data describing homogeneous color or texture image regions; • layout data describing the presentation of objects, i.e. their geometry, order, and nesting; • character data indicating that multiple glyph images have the same ASCII code Initially, none of these different data types have information value. An analogous situation occurs when an ordinary European, like me, encounters when holding a letter in her or his hands written in Kanji or Hangul characters. For me, this letter is nothing more than just data: an image with segments arranged in a certain order and some of them look the same. The conversion from paper to electronic data of the four qualities described above is the initial step of document understanding providing an electronic representation which is already sufficient for many applications, such as full text indexing, electronic editing or reuse of document content. However, documents are a means for communication among individuals or institutions. They are created to an-
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Image Objects
Layout Structure
Image
Characters
„d“ „S“ „2“
DATA Document Understanding INFORMATION
KNOWLEDGE Vocabulary
Workflow Processes
SCHAD EN SM ELDUNG Geschä di gter Scha de nsort Scha de nsh öhe Scha de nsdat u m .. .
Presentation Logical Stracture
Muster man n, G . Zür ich 2680 € 22.09 .02 .. .
Company Data Base
Message Concepts
Figure 1: Document Understanding as the process to transform data into information by applying knowledge.
nounce conferences or new products, to summarize agreements, to comprise paragraphs of contracts, or even law, to publish innovative research results, or just to arouse, to promise, to inform, to question, to command and to convince others. Documents are always present where information meets with human beings whose work gets done through documents. With the intention to assist these work processes, the question is how to make documents data useful. Basically, we may think about different complementary sub-areas to be combined for forming a higher level representation, e.g., the interpretation of photographs, graphics or musical notation, etc. However, in the remainder of this paper, I will focus on business documents because of their high relevance for decision making and their substantial impact to workflow progress in organizations.
2. Challenges of Document Understanding In order to extract all relevant information from a document image, the employment of knowledge is of vital significance (see also [2]). Knowledge is a very
often used term with little meaning. However for the task of document understanding, we may refer to various intuitively comprehensible sources that we use in our daily business to capture the important bits of information driving our decisions and processes (cf. Fig. 1). First of all, there is the terminology of communication consisting of the vocabulary of a language enriched by special designators and identifiers. A layout structure may be knowledge as well, especially if it is typical of a certain class of documents we deal with. Consider for example the nested rectangles in the “Presentation” circle of Fig. 1. Although they represent simple geometric data, we can easily reason that they describe the typical layout of a business letter. Furthermore it is easy to give hypotheses about where certain logical objects, such as recipient or date are located. Neglecting the aspect of presentation, we are also able to describe a business letter by just its defining logical constituents, such as sender, recipient, subject, etc. In addition to these complementary structural views, we can more restrictively define the various message concepts relevant to our role and tasks in a company, for example, “notification of claim” and “appraisal report” in
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
an insurance transaction, or “invoice” and “delivery note” in a purchasing department. Furthermore, all of these sources are only valuable if we know the insured person, property, and value, or of customers, suppliers, and products which we get from our company’s data bases or other devices. If a human being is going to interpret the semantics behind a document message, he or she makes use of these kinds of knowledge sources either as tacit experience or by employing explicit expertise on the computer. It is one of the big challenges to combine these knowledge sources in a way that the relevant messages captured in the bulk of documents can be extracted automatically and presented at the work places where they are needed. If this were possible, the extracted information might be added to the knowledge repository and used for later processing. For example, if a company receives an invoice, the existing information from the quote and from the order may be utilized as knowledge. Thus the process described in Fig. 1 also shows aspects of learning. In the remainder of this paper I will try to explain how we at DFKI attempted to make use of all different sources of information at the same time in order to make documents work. In transforming the model shown in Fig. 1 to a computer we have to first consider the diversity of documents that we rely upon to make decisions and to support our daily work. In larger organizations, we can distinguish between documents with minor differences in complexity by the structural aspects with respect to layout and logical composition. These are: 1. Highly-structured documents which have a static form built on precisely defined page layout. They can be described by geometric templates including the regions of interest (ROIs) where the relevant information to be extracted is located. 2. Semi-structured documents allowing a partial employment of templates. These are documents having only some regular geometric features to identify either the class of document or some of the ROIs, i.e. dynamic forms or business letters which have preprinted headers or footers and an empty page body for the message 3. Loosely-structured documents which can not be characterized by a geometric template. They contain typical logical objects hidden in an irregular page layout. There have been various approaches published in the past some which are applied to a single class of documents while others are applied to a number of different classes. Most techniques combine geometric template matching and character recognition in order to extract key word lists generated by post-OCR fuzzy match approaches for indexing and retrieving the contained text. However there are also successful approaches extracting structured messages.
Fig. 2 shows an attempt to categorize some of the important publications with respect to semi-structured and loosely-structured documents. In addition to the related work shown in [2], I added some recent publications. Unfortunately, I can not go into deeper details here because of the page limitation of this paper. However this clustering should help readers to find sound standing work on this topic. [37] [43] [44] [47]
looselystructured semistructured
[29] [34]
[15] [19]
[23] [27] [40]
[24] [41]
[1]
[13] [36]
highlystructured
[9]
[3] [5] [26]
[7] one document class
[35]
[12]
[39]
[6]
[21]
multiple classes
Figure 2: Categorization of document understanding approaches.
3. Key Tasks of Document Understanding Large organizations receive thousands of documents daily, with widely varying structural complexity. The examples in Fig. 3, for instance, carry the same message concept “invoice” with the same relevant bits of information but show different complexity. Furthermore, there may be single-page or multi-page documents of different formats and paper quality. In many practical examples, documents having a different structural complexity as well as different formats are found in a single mail envelope. For example in the domain of health insurance, medical invoices are often sent with prescriptions or medical estimates. When considering the bulk of documents at the mail entry point of an organization we have mixed stacks of documents, all of which capture information relevant for individual’s roles, tasks, or interests. However, due to the diversity of different departments, qualifications and workflow processes, the documents must be sorted according to the information demand. Although, there are various tasks to be solved, there are three challenging problems to be synergistically solved in order to fully integrate printed information into business decisions: • • •
Document categorization Table analysis Message understanding
Although there might be a normal order in which the three steps are initiated, they are strongly intertwined and the order of their individual processing thus is directed by
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
(a)
(b)
(c)
Figure 3: Typical examples of documents capturing the same message concepts but showing different degrees of structural complexity.
the intermediate results. For the example of a form which is well known, categorization is the obvious first step before the relevant information in the corresponding ROIs is further analyzed. For a semi-structured document, e.g. an invoice, containing tabular information, it might be better to first identify the type of the tabular information to derive the document category. To do so, the first steps of information extraction are needed. However, in the following I will consider the three problems one after the other giving remarks about their interweaving when appropriate. Because of the problems’ diversity, I will describe different approaches to solve a single problem which can be applied opportunistically. They are finally encapsulated in so called specialists which can be combined depending on intermediate results produced during the document understanding process (see [8]).
Document Categorization Within an organization many people may be working on the same topic at the same time, some of which are related to sales, others to engineering or to administration. Consequently, there is a need to categorize many thousands of documents a day and further to determine how to proceed depending on the topic described. Document categorization addresses the task to automatically determine that a document belongs to a certain predefined class allowing routing or archiving of documents, the search for notes in online service systems, to respond to customer requests, to extract messages relevant to workflow and many more practical applications. For many problems, it is advantageous to not only use a flat set of document classes as proposed in many papers, but to prefer a taxonomy in which the individual classes may be stepwise generalized to so-called topic groups.
A topic group taxonomy, as shown in Fig. 4, describes a finite set of information items which are relevant e.g. to an organizational business unit. All non-terminal nodes represent other sub-ordered topic groups and the terminal nodes represent document classes. Using such a model allows a better categorization if a unique categorization of a document is impossible because it addresses different topics. Consequently the document is categorized into the more general topic group. Application Form
:
Clearing Center 1
:
Health Insurance
Medical Invoice
Dentist Invoice 1 :
Physician Invoice 1
:
Prescription
:
:
unclassified
Figure 4: An exemplary topic group hierarchy for document categorization. Before using a topic group taxonomy, all nodes are set to either the active or inactive state depending on the given task. Thus they may be considered as a categoriza-
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
tion filter, e.g. one company is interested only in sales documents while another one is interested in special application forms returned by mail. For practical means, it is not only necessary to provide pre-modeled taxonomies but rather to allow an additional description of new classes and topic groups. We therefore offer special templates which may be refined to suit a particular problem. Because no system is perfect, we also provide a special topic group called “unclassified” for rejects. This class is set be active by default. When defining a new application, new document classes must be defined. In order to do so, a user may make his choice from the following distinguished templates: • • •
standard separate (separate_document, separate_file) unclassified
The standard-template is taken for all reference descriptions of documents to be categorized carrying any message concept. Both the separate-document-template as well as the separate-file-template are selected if an application requires a special document separator to denote the end of a multi-page document or for separating the contents of mail envelopes. Note, that it is also possible to extract information from a separator-template, i.e. where a company indexes the documents in a single mail envelope by some extrinsic features like processing date or address data, e.g. by using a bar code. If categorization fails, a document is allowed to retain its unclassified categorization. Thus it belongs to the topic group “ unclassified”. The corresponding template allows the addition of information about how to proceed with the document, i.e. to route it to a verification workstation where it is manually categorized. Documents may have multiple pages. The number of pages may be fixed or variable. If we do know the number of pages in advance, we are able to define one document class per page, summarizing all classes to an individual topic group. If the page number is not fixed, we may define a single class. However, this depends on the uniqueness of the categorization features. If we can distinguish the first page from the rest of the pages, i.e. because of the header, we can define two corresponding classes. Separators are always single-page documents. Categorization relies on methods that organize documents by structure or by content. In order to do so we must define features either by training or by automatic learning. Fig. 5 illustrates the typical scenario when applying learning approaches for categorization. The described model distinguishes between a learning phase and a run-time phase. As proposed by [17], learning may be done in three steps: •
Extraction of features based on a sample of representative training documents
• •
Transformation of the training documents based on the extracted features into a feature-value representation Learning of the corresponding classifier taking the feature-value representation
The resulting classifier is applied to the run-time documents by transforming them with respect to the selected features (A) and the categorization (B) -see Fig. 5. TRAINING DOCUMENTS
RUN-TIME DOCUMENTS
Feature Extraction A Transformation
Transformation
Features
Documents in Feature-Value-Representation
Documents in Feature-Value-Representation
Feature-Attribute Learner B Classifier LEARNING PHASE
Feature-Attribute Learner RUN-TIME PHASE CATEGORIES
Figure 5: Model for learning categorization systems. When dealing with mixed stacks of documents, we are not able to select a single approach but rather have to make use of different methods, including features from layout and logical structure as well as from text. For “unclassified” documents, no features are necessary. For all other document classes we can apply • • • • • •
layout features textual features search patterns format features special indicators tabular features
Layout features are preferred in the case that a document class has a lot of preprinting, e.g. a form or a document with a fixed individual header. For the training of the corresponding class, it is most suitable to use a blank form. If not available, we use an electronic eraser to delete those regions in the image which are filled in. This layout-based classifier has been proposed in an earlier paper [7] and is very reliable but fast because no OCRdata is necessary.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Documents without typical layout features may be classified using textual features. Since we are required to support many applications we prefer approaches which are adaptive, i.e. be easily trained or even automatically learned. Learning approaches have been applied to document analysis and understanding problems. In general, they may be grouped into statistical methods [14, 18, 26], Neural Nets [16, 31, 51], decisions trees [25, 32], and rule learning techniques [10, 20, 33]. For comparisons of various approaches refer to [15, 49]. We have decided to use different approaches to take advantage of their inherent characteristics. The basic retrieval model we rely on is the vector space model (VSM, see [4]). In this model documents as well as document classes are represented by vectors. A decision whether a document belongs to a certain document class is decided via a similarity function between the document vector and vectors of all document classes. More formally, each position i in the vectors (w1, w2, …, wn) corresponds to a specific term ti in the collection. The value wi indicates the weighted presence or absence of the respective term in the document or class. For the classification itself we use a k-nearest neighbour method and a Support Vector Machine approach [15]. Often, relatively complex text patterns are good indicators for specific categories, e.g. “to rescind a contract” or “I withdraw ...” are typical indicators for a cancellation message. In order to use these document features for document classification, we use a pattern-based classifier that relies on a special text pattern language. Based on this language we cannot only represent arbitrary Boolean expressions of desired word occurrences but also account for the order and distance of specific words within documents. For instance, the pattern w1 [n] w2, with n ∈ Νo represents two words occurring in a document with a syntactic distance of n words. Typically, n ranges between 1 and 5. The pattern language also allows for the incorporation of synonyms and acronyms defined in a thesaurus as well as error-tolerant word searching in OCR texts using Levensthein edit distances [38]. For learning categorization rules based on sample documents, we rely on a relational learning algorithm which was specifically designed for learning text patterns [19]. Pattern-based categorization is not as fast as layoutbased. However, it is very powerful and practically relevant in the case where many different document classes are loosely-structured. In [20], a hybrid approach to document categorization is described. In this approach, the word-based document representation of the vector space model is first enriched by learned text patterns. Subsequently, the resulting document representation is used for learning classifiers in the vector space model. This approach proved to be sig-
nificantly better than the Support Vector Machine using simple terms as features and than the pure pattern-based classifier. There are additional features addressing the size of a document page, i.e. horizontal and vertical dimensions. However, I should address the fact that the corresponding classifier is not reliable for wide-spread application because of inaccurate format information from scanned documents. It should be only used for filtering out special formats such as prescriptions which have a specific format in Germany and are usually accompanied by invoices. All features may be combined within a document class. For example, we could add some textual features to distinguish document classes with very similar layout features. After layout-based categorization delivers e.g. two alternatives, pattern-based categorization may resolve the ambiguity. In such a case there is always a fixed order in which features are tested and the features must be unique for a document class. If not, the system generates a warning message. In order to further understand the document message, the system has to know not only what information to extract from the document but also where to extract, how to extract the various bits, and what the information looks like. Information extraction may be applied to: • • • •
ROIs having a fixed position ROIs based on search patterns Tables Special fields
ROIs with a fixed position may be specified for documents to be classified by layout-based categorization. Consequently, the position of a field is determined by the layout dimensions. This may be done with respect to the entire page or relative to a certain key word which initializes search pattern search. For extraction via search patterns, we define anchor words to which information of a certain type and syntax is close to, e.g. left to the anchor word “insurance number”, we expect a certain regular expression. In the worst case a ROI is specified as the entire document page. Note that anchor words are connected using a thesaurus to resolve synonyms and acronyms. For details refer to [8].
Table Analysis To extract tabular information, we have to take into account that tables have no fixed position but rather can be found more or less anywhere on a page. For this reason we rely on logical table models which are defined by the column headers and corresponding content types specified via regular expressions. Table headers are extracted using the a tool proposed by [22]. If our table analysis tool finds a line containing a number of terms consistent with column headers or synonyms/acronyms, it
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
attempts to verify the content types by checking the strings which are located geometrically below a header. However, as e.g. [11, 21, 28] have shown, there are also reliable approaches finding tables without header information. After identifying layout regularity as a hypotheses for a table object, a matching of the extracted information with appropriate data base information can be done. A closer look at the various papers about table location reveals that the earlier approaches exclusively relied on arbitrary combinations of the features table delineation, spacing, or known text patterns of columns headers. These approaches are applicable if the according features are present. However table presentation varies widely and there is no definition of table appearance. As shown in Fig. 6, the same information may be presented differently, i.e. with delineation, large spacing and columnheaders (a) but also in a way where all of these features are missing (b).
The approach described in [21] shows a possible solution to the problem of locating tables in that case. This basically model-free system relies only on wordbounding-box-geometry and does its own bottom-up block-clustering, recognizing table areas and the row and column structure. Thereby it neglects the presence of delineation or column headers and separates even smallest column gaps. Column elements not even need to be aligned properly.
(a)
Figure 7: Document containing sparse table w/t delineation and column-headers.
Message Understanding
(b)
Figure 6: Tables with different appearances. Many of these approaches allow the definition of specific table models which describe textual features [39, 42, 47] or layout features [11, 35, 46]. These generic models are then used to guide the analysis process. As a result, tables cannot only be located but also categorized or even semantically interpreted. But these approaches have an important shortcoming: They are not universally applicable and fail for a large number of tables. But in many practical applications the documents to be analyzed do contain important core information in tables that have a large variety of styles. These tables not only retract from model based approaches but sometimes don’t even show the typical (but not mandatory) features. A sample document (actually its block-segments) showing various potentials for misclassification is given in Fig. 7.
Considering the aforementioned usage of a company's background knowledge, only static sources have been employed. But additional knowledge may be available within business processes which provides valuable contextual knowledge. Business processes represent a sequence of tasks which require information to go from one state to another. If we are aiming to fully integrate printed information into electronic workflow, the context of the workflow must be connected to a document understanding system. One good example is given by a simplified scheme of a purchasing workflow in Fig. 8 where different outgoing and incoming documents (dashed lines) may initiate state transitions. Supporting document understanding by contextual knowledge e.g. about the documents which are likely to arrive would assist not only message understanding but also progress in the workflow. This is possible if a document understanding system is able to access the status of the business processes and the knowledge about involved documents. Workflow management systems (WfMS) allow an automation of business processes by defining the steps of a process, i.e. which persons or programs should execute a step, and when and with which information a workflow activity should be started. By introducing a message concept as a contextual window into WfMS, they can be used as information providers for knowledge intensive services [30] mentioned above.
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
Therefore we derive context information not only from databases but also from an accompanying WfMS system (see [48]). To accomplish a knowledge-based and context-enabled document understanding process, a repository is introduced. It uses a frame-based knowledge representation of message concepts, all of which may be captured in business documents. Within a frame, all relevant bits of a message are described and used for representing analysis results by means of instances of these concepts. Invoice
Call for Orders
Offer
Order Delivery Note
Invoice
Figure 8: Simplified scheme of a purchasing workflow. Besides modeling static document properties, the current context of the business processes is incorporated by so-called expectations which are specializations of message concepts already defined and may be compared to open workflow instances, e.g. because we already know that we have ordered a certain list of products from a supplier company, referring to Fig. 8, an expectation describes the possible message concept “invoice” already containing information about the sender to be expected as well as the goods to be expected. Furthermore, references to preceding documents are included, such as information from the corresponding “request for quotation”. It is straightforward that expectations are connected to the descriptions of document classes and the corresponding specialists can be applied to perform categorization, information extraction, and table analysis. Consequently, the different specialists are used to fill up the various slots of the expectations by relating the intermediate analysis results to background date base knowledge. Using the partially filled expectations we moreover save time because the possible results are limited. To do so, several strategies can be applied: The closed-world strategy restricts the specialists application only to expectations and only results which instantiate expectations are produced. The well-rounded strategy allows results which are consistent with expectations while the comprehensive strategy allows the generation of both, results based on expectations and results based on more general concepts at the same time (see [30, 48]). In our scenario two tasks have been identified: The first one is called process determination and is inherent in the business letter domain because incoming documents either initiate new processes or relate to business processes to which they have to be assigned. The second one is entire message understanding which requests specific information from the document necessary for further processing in the business process.
Thus the final message understanding step bridges the gap for document-based business decisions within WfMS and shows how dynamic background knowledge from WfMS can be used to enhance the document understanding results by introducing an expectation-driven approach. However, to support business practice, a document understanding system needs deep integration into the existing platforms and sources.
4. Conclusions Document understanding is a challenging field. It attempts to combine the different dimensions of data in order to filter out relevant semantics. Given the inherent variety of document presentation, it is essential to interweave complementary techniques to extract information from the written text and its presentation, the logical structure of a message, the context, knowledge of the language as well as knowledge about the world. This paper has sketched a multitude of components, i.e. specialists, that have been developed in the recent years at the DFKI and have been successfully applied to practical problems in which their positive impacts on office automation have been proven, even by themselves. I have furthermore envisioned the possibilities that evolve by combining these specialists into a document analysis framework which is able to handle a large variety of heterogeneous documents. By designing such a complex system, the primary intention is the beneficial integration into business environments. The economic potential of such a system is enormous: on the one hand, the process of filing and indexing is supported in order to save valuable work-time (e.g. categorization or workflow support) and on the other hand it enhances the quality of retrieval and enables additional services. Human beings also follow a multitude of techniques when reading documents. Thus a new challenge of document understanding would be to further investigate the opportunistic combination of different analysis strategies (e.g. the sequence in which specialists are to be applied). In this sense, my hope is that our work at DFKI inspires other groups to build more systems for flexible and widespread applications.
Acknowledgements I cordially would like to thank Thomas Kieninger, Markus Junker and Heiko Maus for their contributions to this paper. I further appreciated the valuable comments and suggestions of Larry Spitz.
References [1]
M. Aiello, C. Monz and L. Todoran: Combining linguistic and spatial information for document analysis. In: J. Mariani, D. Harman (eds), Proc. RIAO’2000 Content-
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Based Multimedia Information Access, CID, 2000, pp 266-275 M. Aiello, C. Monz, L. Todoran and M. Worring: Document understanding for a Broad class of documents. In: International Journal of Document Analysis and Recognition, Springer Berlin Heidelberg, Germany 2002 A. Anjewierden: AIDAS: Incremental Logical Structure Discovery in PDF Documents. In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, Seattle, Washington, USA, 2001 NN. Baeza-Yates and AA. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley Publishing Company, 1999. A. Bagdanov and M. Worring: Fine-grained Document Genre Classification using First Order Random Graphs. In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, Seattle, Washington, USA, 2001 F. Cesarini, M. Gori, S. Marinai and G. Soda: Informys: a flexible invoice-like form-reader System. IEEE Trans PAMI 20 (7): pp 730 –746, 1998 A. Dengel and F. Dubiel: Logical labeling of document images based on form layout features. In: Proc. 1997 Workshop on Document Image Analysis, DIA’97, pp 2631, Ulm, Germany 1997 A. Dengel and K. Hinkelmann: The Specialist Board -- A Technology Workbench for Document Analysis and Understanding. In: Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT '96), Austin, TX, pp 36-47, 1996 A. Dengel and B. Klein: smartFIX: A RequirementsDriven System for Document Analysis and Understanding. In: Proc. 5th IAPR International Workshop on Document Analysis Systems, DAS 2002, pp 433-444, Princeton, NJ, USA, 2001 J.L. Goldberg: CDM: an Approach to Learning in Text Categorization. In: International Journal on Artificial Intelligence Tools, vol. 5, number 1-2, pp 229-253, 1996 E. Green and M. Krishnamoorthy: Model-Based Analysis of Printed Tables. In: Proc. of International Conference on Document Analysis and Recognition - ICDAR 95, Montreal, Canada, 1995. J. Hu, R. Kashi, D. Lopresti and G. Wilfong: MediumIndependent Table Detection. In: Proc. of the seventh SPIE Conference on Document Recognition, San Jose, California, January 2000. M. Hurst: Layout and Language: Exploring Text Block discovery in Tables using Linguistic Resources. In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, Seattle, Washington, USA, 2001 D.J. Ittner and D.D. Lewis and D.D. Ahn: Text Categorization of Low Quality Images. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR 95), Las Vegas, NV, pp 1—13, 1995 T. Joachims: Text Categorization with Support Vector Machines: Learning with many Relevant Features. In Pro-
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
ceedings of the 10th European Conference for Machine Learning (ECML 98), pages: 137-148, Chemnitz, Germany, 1998. T. Joachims: Transductive Inference for Text Classification using Support Vector Machines. In: Machine Learning. Proceedings of the 16th International Conference (ICML 99), Bled, Slovenia, pp 200-209, 1999 M. Junker and R. Hoch: Evaluation OCR and Non-OCR Text Representations for Learning Document Classifiers. In: Proceedings of the 4th International Conference on Document Analysis and Recognition (ICDAR 97), Ulm, Germany, pp 1060-1066, 1997 M. Junker and R. Hoch: An Experimental Evaluation of OCR Text Representations for Learning Document Classifiers. In: International Journal on Document Analysis and Recognition, vol. 1, number 2, pp 116-122, 1998 M. Junker: Lernen heuristischer Regeln für die Textkategorisierung. PhD thesis, University of Kaiserslautern, Germany 2001 (in German). M. Junker, A. Abecker and A. Dengel: Text Categorization Using Learned Document Features. In: Hybrid Methods in Pattern Recognition. World Scientific Publishing Co.Pte.Ltd., 2002. T. Kieninger and A. Dengel: Applying the T-Recs Table Recognition System To The Business Letter Domain. In: Proc. of International Conference on Document Analysis and Recognition - ICDAR 01, Seattle, WA, 2001 B. Klein, S. Gökkus, T. Kieninger and A. Dengel: Three Approaches to “Industrial” Table Spotting. In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, pp 513-517, Seattle, Washington, USA, 2001 S. Klink, A. Dengel and T. Kieninger: Document structure analysis based on layout and textual features. In: Proc. 4th IAPR International Workshop on Document Analysis Systems, DAS 2000, pp 99-111, Rio de Janeiro, Brazil, 2000 K. H. Lee, Y.C. Choy, S. B. Cho: Geometric structure analysis of document images: a knowledge approach. IEEE Trans PAMI 22(11): pp 1224-1240, 2000 D.D. Lewis and M. Ringuette: A Comparison of Two Learning Algorithms for Text Categorization. In Proceedings of the 3rd Symposium on Document Analysis and Information Retrieval (SDAIR 94), Las Vegas, NV, USA, pp 81-93, 1994 X. Li and P.A. Ng: A document classification an extraction system with learning ability. In: ICDAR’99, pp 197200, Bangalore, India, 1999 J. Lladós, F. Lumbreras, V. Chapaprieta and J. Queralt: Identity Card Automatic Reader*. In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, Seattle, Washington, USA, 2001 D. Lopresti and G. Nagy: Automated table processing: an (opinionated) survey. In: Proc. IAPR Workshop on Graphics Recognition (GREC99), Jaipur, Sept. 1999, pp. 109-134. D. Malerba, F. Esposito F. Lisi and O. Altamura: Automated Discovery of Dependencies Between Logical
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE
[30]
[31]
[32] [33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
Components in Document Image Understanding. In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, Seattle, Washington, USA, 2001 H. Maus: Workflow Context as a Means for Intelligent Information Support. In: Springer LNAI 2116: V. Akman, P. Bouquet, R. Thomason, R.A. Young (Eds.): Modeling and Using Context, Proceedings of Third International and Interdisciplinary Conference, CONTEXT, 2001, Dundee, UK, July 27-30, 2001. A. McCallum and R. Rosenfeld and T. Mitchell and A.Y. Ng: Improving Text Classification by Shrinkage in a Hierarchy of Classes. In. Machine Learning. Proceedings of the 15th International Conference (ICML 98), Madison, Wisconsin, USA, pp 359-367, 1998 L. Monson: Classifying Text with ID3 and C4.5, Dr. Dobb's Journal, vol 22, number 10, pp 117-119, 1997, I. Moulinier and G. Raskinis and J.-G. Ganascia : Text Categorization: a Symbolic Approach. In Proceedings of the 5th Symposium on Document Analysis and Information Retrieval (SDAIR 96), Las Vegas, NV, USA, pp 8799, 1996 D. Niyogi and S.N. Srihari: Using domain knowledge to derive logical structure of documents. In: Proc. SPIE, Document Recognition and Retrieval III, 1996. pp 114125, San Jose, Calif., USA, 1996 M. A. Rahgozar and R. Cooperman: A Graph-Based Table Recognition System. In: Proc. of the third SPIE Conference on Document Recognition, San Jose, California, January 1996. D. Rus and K. Summers. Using White Space for Automated Document Structuring. Technical Report TR 94 1452, Department of Computer Science, Cornell University, 1994. G. I. Sainz Palmero and Y. A. Dimitriadis: Structured document labeling and rule extraction using a new recurent fuzzy-neural system. In: ICDAR’99, pp 181-184, Bangalore, India, 1999 D. Sankoff and J.B. Kruskal: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, 1983. J. H. Shamilian, H. S. Baird and T. L. Wood: A Retargetable Table Reader. In: Proc. of the 4-th International Conference on Document Analysis and Recognition ICDAR 97, Ulm, Germany, 1997. L. Todoran, M. Aiello, C. Monz and M. Worring: Logical structure detection for hetrogenous document classes. In: Proc. SPIE vol. 3407. In: P.B. Kantor, D.P. Lpresti, J. Zhou (eds)., Document Recognition and Retrieval VIII, pp 99-111, San Jose, Calif., USA, 2001 S. Tsujimoto and H. Asada: Major components of a complete text reading system. Proc. IEEE 80(7): pp 11331149, 1992 S. Tupaj, Y. Shi and C. Hwa Chang: Extracting Tabular Information From Text Files. Available at http://www.ee.tufts.edu/~hchang/paper1.ps, 1996. B. Waked, C. Suen and S. Bergler: Segmenting Document Images using Diagonal White Runs and Vertical Edges.
[44]
[45]
[46]
[47]
[48]
[49]
In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, Seattle, Washington, USA, 2001 H. Walischewski: Automatic knowledge acquisition for spatial document interpretation. In: Proc. 4th ICDAR’97, pp 99 243-247, Ulm, Germany, 1997 Y. Wang, R. Haralick and I. Philips: Zone Content Classification and its Performance Evaluation. In: Proc. of International Conference on Document Analysis and Recognition – ICDAR 01, Seattle, Washington, USA, 2001 T. Watanabe and T. Fukumura: A Framework for Validating Recognized Results in Understanding Table-form Document Images. In: Proc. of International Conference on Document Analysis and Recognition - ICDAR 95, Montreal, Canada, 1995 C. Wenzel and W. Tersteegen: Precise Table Recognition by Making Use of Reference Tables. In: Yasuaki Nakano Seong-Whan Lee (ed.), Lecture Notes in Computer Science, Pages: 283-294. Springer, August 1999. C. Wenzel and H. Maus: Leveraging corporate context within knowledge-based document analysis and understanding. In: Springer International Journal on Document Analysis and Recognition, Volume 3 Issue 4 (2001) pp 248-260, Special Issue on Document Analysis for Office Systems. Y. Yang and X. Liu: A Re-Examination of Text Categorization Methods. In Proceedings of the 22th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp 42-49, Berkeley, CA, USA, 1999
Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE