Content-free Document Genre Classification using First Order Random Graphs Andrew D. Bagdanov
[email protected]
Marcel Worring
[email protected]
Intelligent Sensory Information Systems University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Keywords: Document analysis, document understanding, genre classification, random graphs
Abstract We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the layout structure of document instances, and a first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.
1
Introduction
The task of a complete document analysis system is to take a binarized scan of a paper document and accurately reconstruct the logical meaning of the document to index in a document retrieval system. The process is typically divided into two tasks: document image analysis (DIA), and document image understanding (DIU). Document image analysis includes scanning, binarization, image enhancement, segmentation, and region identification. Document understanding encompasses all of the procedures used to transform layout components detected in the DIA process into an accurate representation of the semantic content of the document. The general unrestricted problem of document un-
derstanding is extremely difficult. One cause of this difficulty is the wide diversity of document types, or genres, within the domain of document processing systems. Even in the restricted domain of machine printed documents, there exists a daunting variety of document types. The process of transforming a printed document into a digital representation suitable for indexing can be thought of as the inverse of document authoring. There is a natural symmetry between the layout transformation functions in the document typesetting process and the layout segmentation and analysis modules in the document image analysis component. Therefore, an accurate model of document authoring can be used as a basis for a model of document understanding Figure 1 shows the entire lifecycle of a document starting with the document authoring process, and ending with the indexing of the document in a document retrieval system. The complexity of modern document analysis systems is such that incremental improvements in individual components have little overall effect on the performance of the entire system. Many researchers are turning to model-directed document processing in order to obtain highly accurate solutions to document understanding problems in restricted domains [2, 8]. Researchers have been able to construct and tune models to solve difficult problems in table understanding [2], business letter analysis [5], character recognition [8], office mail flow automation [16], and postal automation [15, 12]. These specialized solutions, however, leave the general problem of document understanding unchanged, since the same diversity of documents continue to move through a typical office workflow. A central problem in document image understanding then becomes the automatic determination of document genre, so that an appropriate model can be se-
Document Authoring
Genre Selection
Document Typesetting
Proto-document selection
Content Mapping
Layout Mapping
Document Imaging
Conceptual Document
Genre Classification
Document Retrieval System
Logical Structure Analysis
Layout Structure Analysis
Paper Documents
Layout Segmentation
Document Image Understanding
Document Acquisition
Document Image Analysis
Figure 1: The document lifecycle. lected for further processing. Document genre classification based on unstructured text is a well known topic in the information retrieval community [13], even for text documents containing OCR generated errors [9]. Since knowledge of document genre can guide much of the document understanding process, genre classification based on minimal logical information extraction is particularly desirable. Specifically, we are interested in the automatic determination of document genre before any OCR results are available, and traditional text retrieval approaches are not suitable. In most document analysis systems the genre classification component is considered to be an intrinsic part of the logical information extraction phase. We take the view that genre classification plays the important role of bridging the gap between layout analysis and logical information analysis. So, as shown in figure 1, we split genre detection out of the logical analysis module, and place it on the line separating document image analysis from document image understanding. The rest of this paper is organized as follows. In the next section we define document genre in detail and discuss various requirements and approaches to automatic genre classification. Section 3 describes the first order random graph classifier in detail. A description of the test data used for our experiments is given at the beginning of Section 4 and is followed by
a comparative analysis between our algorithm and a number of statistical classification methods.
2
Document Genre and its Detection
Written communication is highly structured. This structure is necessary so that the semantic intent of the author can be accurately and efficiently reconstructed by the reader. In addition to linguistic structures, written communication employs both logical and physical structure to organize the information content of a message. In most forms of written communication there are specific logical information elements that are expected to be present. For example, a business letter is expected to have a date, from–address, to–address, opening, body, and closing. These logical entities may be atomic or compound, but must conform to generic rules for logical construction. The presence of these logical information elements, however, is not typically sufficient for a reader to reconstruct the intended message of the document. The logical content of the document must be transformed into physical representations of the logical structures selected by the author. This layout function serves to translate the logical, semantic intent of the author into a visual representation. Given an idea or message that an author wishes to convey, there are established forms and protocols for structuring the message content so that it can be
effectively communicated. There are strong logical and physical structures that, if accurately detected, are clear indicators of the document genre. Since the measurement of logical structures is highly dependent on the document genre itself, we concentrate our effort on detecting document genre based on measurable layout structures. This idea is closely related to the concept of document function as introduced by Doermann, et. al. [6]. We adopt a broad definition of document genre: Definition 1 A document genre is a category of documents characterized by similarity of function, expression, style, form, or content. It is important to realize that there is no single, universal partitioning of the universe of paper documents into a set of discrete genres. Document genre is intrinsically use-specific. Consider table 1. Document genre can be defined in a top-down fashion, where the DIA/DIU application user specifies the top–level (functional) document genres that the system must handle. The logical goals are then defined as the logical information that should be extracted from documents of each type. The refinement of functional genres into subgenres allows for the specification of unique strategies for the genre–specific extraction of logical information. Given this definition of document genre, we can think of document genre distinctions being coarsegrained, as in the case of the Functional Genre column of Table 1, or fine-grained as in the Sub-genre column.
2.1
Related Work
The approaches described in the literature can be loosely grouped into two broad categories: Statistical/Distance- based systems, rule-based methods, and hybrid systems. Hybrid techniques that combine elements of statistical and knowledge-based approaches are of interest, but are not considered at this time. 2.1.1
Statistical-/Distance- Based Approaches
Techniques in this category describe document images using extracted feature vectors based on global image characteristics or local, textural structure. They are typically content-free, and use a distance metric based on a statistical model to perform classification. Shin, et al [14], have developed a method for classifying page images based on similarity of layout structures. The technique extracts a number of pagelevel as well as texture features from a document image and uses the OC1 decision tree classifier [10]. No OCR results are used for classification. It is applied to a coarse-grained genre classification problem, and an estimated accuracy of 88.51% is reported.
Hu, et al [7], report on a distance-based genre classification system that uses intervals of consistent texture from a document image. A distance metric based on string edit distance is given, as well as a document genre model based on Hidden Markov Models. A classification accuracy of 84% is reported for the classifier based on HMMs. For traditional statistical classifiers, a fixed-length feature vector must be used to represent each document instance. Global features must be extracted from each page, and this process misses much of the local layout structure of document images. This makes statistical classifiers less suitable for a fine-grained genre classification problem. 2.1.2
Rule-based Systems
Rule-based approaches to the genre classification problem use a set of typographical (and/or logical content) rules to construct a knowledge base for each document genre under consideration. They are typically content-based. A rule-based system may use grammars to model document structure, or a more explicit, PROLOG–like description of rules for document layout description. In this respect, this class of techniques could be called knowledge based approaches. Cesarini, et. al. [4], describe a method based on two types of knowledge representation: Class Independent Domain Knowledge (CIDK) and Class Dependent Domain Knowledge. CIDK is knowledge about all documents under consideration, while CDDK is knowledge about documents of a specific genre. No experimental results about genre classification accuracy are provided. Niyogi and Srihari [11] detail a method for logical information extraction that is based on a syntactic model of document logical structure. Two types of rules are used in the system. Control rules regulate the invocation of all other knowledge rules, while strategy rules control guide the derivation of document logical structure in a more general way. Experimental results for this technique are encouraging, with a block classification accuracy of about 90% reported on scanned newspaper images. A novel approach for categorizing documents is presented by Doermann, et. al. [6]. The idea is to pre-categorize documents by function before applying any logical information extraction is performed. The authors define three categories of document function: reading, browsing, and searching. Experiments using document function to classify document pages into title, reference, or body were performed. A classification accuracy of 90% is reported. Rule-based systems are appealing because the process of manually constructing rules for detecting document genre are quite natural. Automatic inference of
Functional Genre Technical Article
Business Letter
Formal Text
Newspaper
Directory
Sub-genre IEEE PAMI Article CVIU Article CACM Article Internal Memo Invoice Demand Letter Program Listing Chess Listing Recipe NY Times De Telegraff USA Today Telephone Book Street Index Classified Ads
Logical Goals Author names, data publ date, section headings, abstract To–address, from– address, required action Executable sequence of machine readable instructions Individual articles, author/wire service photos, captions Extract individual list items, insert into DBMS
Table 1: Document genre, sub-genre, and logical goals rules given sample images from a document genre is extremely difficult, however, and much user interaction during the training phase is required.
an abstract classification technique. The interested reader interested should consult Wong [17] for additional details.
2.2
3.1
Our Approach
We have developed a technique for performing document genre classification based on layout structure. Our method is a hybrid technique that uses attributed relational graphs to model the layout structure of documents, and first order random graphs model document genres. The technique is contentfree (i.e. uses no OCR results), and performs classification on the layout structure of documents rather than on textural features of document images. As defined above, our method is particularly suited for making fine-grained genre discriminations.
3
First Order Random Graphs
In this section we discuss the use of random graphs as an abstract classification technique. Random graphs have been used to model a number of recognition problems in which the structure of the observed objects is intrinsic to correct classification. While graph matching has been a common tool in structural pattern recognition [3], the method of random graphs is a hybrid method as it introduces statistical uncertainty into the matching process. Recently Alquezar, et. al. [1], have proposed a method using random graphs to perform occluded face identification. We follow the development of Wong, et. al. [17], adopt some of the terminology from Alquezar [1] in the following discussion, and simplify the notation to suit our purposes. Since we are primarily concerned with the application of random graphs to the problem of document genre classification, we omit much of the detail in describing random graphs as
Modeling Document Layout
Our method uses attributed relational graphs (ARGs) as a structural representation of document layout. An ARG is defined as follows: Definition 2 An attributed relational graph, G, over L = (Lv , Le ) is a 4-tuple (V, E, mv , me ), where V is a set of vertices, E ⊆ V × V is a set of edges, mv : V −→ Lv is the vertex interpretation function, and me : E −→ Le is the edge interpretation function. In the above definition Lv and Le are known as the vertex attribute domain and edge attribute domain respectively. For representing document images, we define the vertex attribute domain to be the vector space of text zone features. A document Di is described by a set of text zone feature vectors as follows: Di = {z1i , . . . , zni i }, where zji = (xij , yji , wji , hij , sij , tij ). In the above definition of a text zone feature vector, • xij , yji denote position of the center of the text zone, • wji and hij the width and height of the zone, • sij denotes the average pointsize of text in the zone, and • tij the number of textlines in the zone.
Each vertex in the ARG corresponds to a text zone in the segmented document image. Edges in our ARG representation of document images are defined over the null feature space. The presence of an edge between two nodes is used to indicate the Voronoi neighbor relation. We use the Voronoi neighbor relation, which is easily obtained from the Delauney triangulation of the text zone pointset, to simplify our structural representation of document layout. We are interested in modeling the relationship between neighboring textzones only, and use the Delauney triangulation to identify the important structural relationships within a document. Algorithm 1 details the ARG construction procedure for document images. Algorithm 1 Construct ARG from document layout representation Input: A document D = {z1 , . . . , zk } Output: An ARG G = (V, E, mv , me ) Let V = E = mv = me = ∅ for all zi ∈ D do Assume zi = (xi , yi , wi , hi , si , ti ) V = V ∪ {vi } {i.e. one vertex for each zone} Let mv (vi ) = zi end for Compute Delauney triangulation T = (VT , ET ) of zone coordinates in mv (V ) for all (vi , vj ) ∈ ET do E = E ∪ {(vi , vj )} {Delauney neighbors} end for return G
3.2
Modeling Document Genres
Given the above procedure for constructing an attributed graph from a segmented document image, a classification procedure based on matching an unclassified ARG against a library of classified prototype ARGs using standard graph matching techniques. Such a procedure is likely to be expensive and inaccurate, however, as graph matching is known to be difficult, and inexact graph matching techniques do not allow for much deviation from observed prototypes (thus requiring a large number of prototypes for each class). A random attributed relational graph (RARG) is essentially identical to an ARG, except that the vertex and edge interpretation functions do not take determined values, but vary randomly over the vertex and edge attribute domains according to some estimated density. Definition 3 A random attributed relational graph, R, over L = (Lv , Le ) is a 4-tuple (V, E, µv , µe ), where V is a set of vertices, E ⊆ V × V is a set of edges, µv : V −→ Π is the
vertex interpretation function, and µe : E −→ Θ is the edge interpretation function. Π = {π1 , . . . , π|V | } and Θ = {θ1 , . . . , θ|E| } are sets of random variables taking values in Lv and Le respectively. An ARG obtained from a RARG by instantiating all vertices and edges is called an outcome graph of the RARG. The joint probability distribution of all random elements induces a probability measure over the space of all outcome graphs, and by characterizing all pattern classes using a random graph we can construct a classification procedure. Estimation of this joint probability density, however, becomes quickly unpleasant for even moderately sized graphs, and Wong [17] introduces the following simplifying assumptions: 1. The random vertices πi are mutually independent. 2. Given values for each random vertex, the random edges θi are mutually independent. 3. A random edge θ is independent of all random vertices other than its endpoints. Random attributed relational graphs satisfying these conditions are called first order random graphs, or FORGs. Given an ARG G = (V, E, mv , me ) and a FORG R = (N, A, µv , µe ), we can compute the probability that G is an outcome graph of R. As in Wong [17], we assume without loss of generality that R and G are structurally isomorphic. Recall, however, that we must establish a common labeling between the vertices of R and G. An arbitrary isomorphism, φ, from R into G serves to “orient” the random graph to the ARG whose probability of outcome we wish to compute. We can now define the probability that G is an outcome graph of R in terms of a vertex factor: VR (G, φ) =
Y
pφ(vi ) (mv (vi ));
vi ∈V
and the edge factor: ER (G) =
Y
p(φ(vi ),φ(vj )) (mv (vi ), mv (vj ))
(vi ,vj )∈E
for an arbitrary orientation φ. The probability that G is an outcome graph of R, given orientation φ is then given by: PR (G) = arg max(VR (G, φ) × VR (G, φ)). φ
(1)
This probability estimate is central to the training of FORGs and also the FORG classification procedure.
3.3
Genre Classification using FORGs
This section describes the training and classification procedures used for construction of a classifier based on first order random graphs. We consider only the case where there is no unique labeling of vertices between pattern graphs, as this is the case in our application. The lack of a consistent labeling between pattern primitives complicates the training and classification procedures by introducing the need to establish an isomorphism between an unclassified ARG and each FORG in the set of pattern classes.
3.3.1
Training
In order to construct a FORG-based classifier, we must train a FORG to represent each genre in the classification task. Assume the entire pattern space is partitioned into n classes, ω1 , . . . , ωn . For a given class, ωi , we construct a FORG to model the training samples using essentially the same hierarchical entropy minimization technique for unlabeled ARGs as proposed by Wong [17]. We use a mixture-of-gaussian density estimator for estimating the densities of each of the random vertices in a FORG. The main difference in our FORG construction procedure is in the establishment of a common vertex labeling between the training ARGs within a class. Ideally we would like to select the orientation φ that minimizes the increase in Shannon entropy of the random elements of the random graph R. Unfortunately the problem of determining the optimal isomorphism between two FORGs is NP-hard. Since we only use edges in FORGs to represent the gross structure of the training graphs, we have implemented a technique that determines a sub-optimal graph orientation based only on minimization of the entropy of the vertex components. Our graph orientation procedure uses the maximum-weight matching in a bipartite graph to achieve this, and is detailed in Algorithm 2. Algorithm 2 Compute sub-optimal orientation of G w.r.t R Input: A FORG R = (VR , ER , mv , me ), an ARG G = (VG , EG , µv , µe ). Assume |VR | = |VG | = n Output: A suboptimal orientation of G w.r.t R ¯n × K ¯n Construct the bipartite graph B = K with w((vi , vj )) = pvj (mv (vi )) log pvj (mv (vi )) Compute the maximum weight matching M on B. return φ : VG −→ VR defined as: φ(vi ) = vj ⇐⇒ (vi , vj ) ∈ M
3.3.2
Classification using FORGs
Given a set of FORGs representing pattern classes and an ARG describing an unclassified object, the classification procedure is quite simple. For an unclassified sample, the procedure computes the sub-optimal orientation of each FORG with respect to the sample according to Algorithm 2. The FORG which generates the unclassified ARG with maximum likelihood according to equation 1 is selected. Algorithm 3 Document genre classification by FORG Input: D, a document of unknown genre R = {R1 , . . . , Rn }, a set of FORGs representing document genres. Output: ωi for some i ∈ {1, . . . , n} Convert D to an ARG G using algorithm 1. for all Ri ∈ R do Compute orientation φi of G w.r.t. Ri by algorithm 2 end for k = arg maxi (PRi (G, φi ) return ωk
4
Test Data and Experimental Results
A number of experiments were performed to evaluate the performance of several genre classification strategies in comparison to the FORG-based genre classifier. This section describes the test data and experimental results.
4.1 Test Data The test data used in all experiments is a set of 150 documents sampled from the Oc´e Competitive Business Archive (CBA). The collection contains sample documents from trade journals and product brochures. The sample consists of five genres, two of which have four subgenres. For our experiments we collapse the subgenres into top-level genre categories, creating a total of 11 fine-grained genres. Each document was scanned as a binary image file at 300 dpi, and sample documents from several genres are shown in Table 2.
All documents in the CBA sample were processed with the ScanSoft TextBridge OCR system, which produces structured output in the XDOC format. An attributed graph is constructed to represent the first page of each document as described in Section 3.1. As with the statistical classifiers, only the layout information from the first page of a document is used since it contains most of the genre-specific information.
Feature Name num pages num fonts num izones num tzones num textlines avg pointsize prop table prop image prop text prop inverse prop italic prop roman prop bold tsize hist0 tsize hist1 tsize hist2 tsize hist3 tsize hist4 tsize hist5 tsize hist6 tsize hist7 tsize hist8 tsize hist9
BLI Test Report
Dataquest Service
MFP Report
Table 2: Document images from the CBA sample
4.2
Classifiers Evaluated
A variety of statistical classification techniques were evaluated in order to compare the effectiveness of genre classification by first order random graphs with traditional techniques. The classifiers evaluated are:
Description # of page in document # of fonts used in document # of detected image zones # of text zones # of textlines Average pointsize of text % of table text % of image area in document % of text area in document % of inverse text in document % of italicized text % of roman text % of bold face text
Text size histogram
Table 3: Features extracted from the CBA sample. • NN – One-nearest neighbor classifier. • NMC – Nearest mean classifier.
Classifier NN NMC LDC QDC Parzen DT FORG
• LDC – Linear discriminant classifier. • QDC – Quadratic discriminant classifier. • Parzen – Parzen classifier on textzone geometry. • DT – Linear oblique decision tree classifier [10]. • FORG – First order random graph classifier. For the statistical, feature-based classifiers evaluated, global page-level features were extracted from the first page of each document. Table 3 shows the features used for the statistical classifiers. The text size histogram features in Table 3 are computed from the normalized frequency of occurrence of text at a given pointsize. The Parzen classifier mentioned in the list of methods above is not based on the features of Table 3. It is based on the same features as the FORG classifier. The method trains a Parzen classifier on all of the textzones within a document genre.
Table 4 gives the classification results for each of the methods on the CBA sample. During preliminary experiments, we discovered that the majority of misclassifications were directly related to one document genre, “Journal Article/Other”. After visually
Estimated Accuracy w/ JA-Other w/o JA-Other 40% 42% 30% 35% 58% 68% 48% 55% 68% 77% 68% 73% 91% 97%
Table 4: Estimated classification accuracy. inspecting the documents in this genre, we discovered that it contains a large degree of heterogeneous document structures. Table 4 provides the classification accuracy for the CBA sample including and excluding the “JA/Other” genre. The classification accuracy for each method was estimated using leave-one-out cross validation. On the CBA sample, the FORG classifier significantly outperforms purely statistical classification methods.
5
Conclusions and Future Work
The experimental results given in the previous section indicate that modeling the layout structure within a document genre is quite useful for constructing clas-
sifiers capable of making fine-grained genre distinctions. The methods based on traditional statistical classifiers suffer from the need to define and extract global, page level features. This representation is not sufficient for the fine-grained genre classification problem. The Parzen classification technique outperforms the other statistical classifiers because it is capable of modeling the textzone consistency within a document genre. It does not perform as well as the FORG classifier because it suffers from strong assumptions of independence between textzones within a document and within a document genre. For the future, a much larger test set is required to support a strong conclusion that genre classification using first order random graphs is significantly better than statistical methods. Additionally, experiments must be performed to determine if the FORG classification technique will generalize to the coarsegrained genre distinction problem. One feature of modeling of document genres with FORGs is that a model of the consistent layout structure within a document genre is automatically required. The possibility of using the FORG model to automatically construct a logical model of document genre will be investigated. Also, the automatic detection of “Other” genres based on the entropy of the FORG representing it will be studied.
Acknowledgments This work was sponsored by the Multimedia Information Analysis project and Oc´e Nederland.
References [1] R. Alqu´ezar, A. Sanfeliu, and F. Serratosa. Synthesis of function-described graphs. In Proc. Joint IAPR Int. Workshops SSPR’98 and SPR’98, number 1451 in LNCS, pages 112–121, 1998. [2] H. S. Baird. Model-directed document image analysis. In Proceedings of the Symposium on Document Image Understanding Technology, 1999. [3] H. Bunke. Recent developments in graph matching. In ICPR00, pages Vol II: 117–124, 2000. [4] F. Cesarini, E. Francesconi, M. Gori, and G. Soda. A two level knowledge approach for understanding documents of a multi-class domain. In The Proceedings of the International Conference on Document Analysis and Recognition, 1999. [5] A. Dengel, R. Bleisinger, F. Hein, R. Hoch, and F. H¨ones. Officemaid–a system for office mail analysis, interpretation and delivery.
In Proceedings of First International Workshop on Document Analysis Systems, pages 253–275, 1994. [6] D. Doermann, A. Rosenfeld, and E. Rivlen. The function of documents. In Proceedings of ICDAR97, 1997. [7] G. W. Jianying Hu, Rmanujan Kashi. Comparison and classification of documents based on layout similarity. Information Retrieval, 2(2):227–243, May 2000. [8] G. E. Kopec and P. A. Chou. Document image decoding using markov source models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), June 1994. [9] D. Lopresti and J. Zhou. Retrieval strategies for noisy text. In Symposium on Document Analysis and Information Retrieval, pages 225–270, 1995. [10] S. K. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, August 1994. [11] D. Niyogi and S. N. Srihari. Knowledge-based derivation of document logical structure. In The Proceedings of the International Conference on Document Analysis and Recognition, 1995. [12] P. Palumbo, S. Srihari, J. Soh, R. Sridhar, and V. Demjanenko. Postal address block location in real time. Computer, 25(7):34–42, July 1992. [13] G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information By Computer. Addison–Wesley, 1989. [14] C. K. Shin and D. S. Doermann. Classification of document page images based on visual similarity of layout structures. In Proceedings of the SPIE Document Recognition and Retrieval VII, 2001. [15] S. Srihari, C. Wang, P. Palumbo, and J. Hull. Recognizing address blocks on mail pieces: Specialized tools and problem-solving architecture. AI Magazine, 8(4):25–40, 1987. [16] C. Wenzel, S. Baumann, and T. J¨ager. Advances in document classification by voting of competative approaches. In J. J. Hull and S. Liebowitz, editors, Advances in Document Analysis Systems. World Scientific, 1997. [17] A. K. C. Wong, J. Constant, and M. L. You. Random graphs. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Pattern Recognition: Theory and Applications. World Scientific, 1990.