oriented HTML markup into a semantic-oriented XML annotation defined by .... All tree leaves (given in bold) refer .... As an example, in Figure 2.a, text in bold.
Supervised Learning for the Legacy Document Conversion Boris Chidlovskii, J´erˆome Fuselier Xerox Research Centre Europe 6, chemin de Maupertuis, F–38240 Meylan, France chidlovskii,fuselier @xrce.xerox.com
ABSTRACT We consider the problem of document conversion from the renderingoriented HTML markup into a semantic-oriented XML annotation defined by user-specific DTDs or XML Schema descriptions. We represent both source and target documents as rooted ordered trees so the conversion can be achieved by applying a set of tree transformations. We apply the supervised learning framework to the conversion task according to which the tree transformations are learned from a set of training examples. We develop a two-step approach to the conversion problem, that first labels the leaves in the source trees and then recomposes the target trees from the leaf labels. We present two solutions based of the leaf classification with the target terminals and paths. Moreover, we develop three methods for the leaf classification. All methods and solutions have been tested on two real collections.
1.
INTRODUCTION
A wide spread and growing maturity of XML technologies opened new opportunities in various domains, including content and document management, publishing and multi-media. The XML markup simplifies and eases the data exchange, the content reuse and repurposing, so new functionalities can be offered, while many old functions can be accomplished in a faster and cost-effective way. However, for companies and organizations that already own large document collections, the shift toward XML often raises a serious issue of the legacy document conversion. The legacy documents are often available in the electronic form, in one of the visualization-oriented formats like (X)HTML, PDF or MS Word, that describe how to render the document content but carry little information on what the content is (catalogs, bills, manuals, etc.) and how it is organized. Instead, due to its extensible tag set, the XML markup addresses the semantic-oriented annotation of the content (titles, authors, references, tools, etc.), while the rendering issues are delegated to the reuse/re-purposing component, which visualizes the content, for example on different devices, with the help of appropriate XSLT scripts. The conversion process conventionally assumes a rich target model, which is given by an XML schema definition, in the form of a
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2004 ACM X-XXXXX-XX-X/XX/XX ...$5.00.
Document Type Definition (DTD) or by a XML Schema; the target schema describes the company or user-specific elements and attributes, as well as constraints on their usage, like the element nesting or an attribute uniqueness. The conversion not only means transforming legacy documents from one format into another, but it also aims at customizing information, not explicitly encoded in legacy documents. In general, the conversion of legacy documents into the XML markup is often referred as a transformation from the rendering-oriented content presentation to the semantic-oriented one. If legacy documents are available in proprietary formats (PDF, Microsoft Word, etc.), they are assumed to be first converted into a standard format like HTML1 . The converters rewrite the instructions of a proprietary format into structural and layout HTML annotations, with the major goal of preserving the document rendering. The converters are capable to recognized certain structural entities like paragraphs, tables and lists, however their output remains often insufficient from the point of view of target document model, as the structural markup remains essentially layout-oriented. The further conversion of layout HTML annotations into the semantic XML may be achieved by a set of structural transformations. However, because of the ambiguity of the layout annotation, the conversion task can be hardly automated without bringing in the process either the domain knowledge or an important set of examples that would instruct the human or a computer program how to produce the transformation rules. In this paper, we adopt the supervised learning concept for the legacy document conversion; we assume that some source documents are given together with their target XML annotation. Each such pair (source document, target document) exemplifies the conversion process and forms an instance of the training set. We consider the source and target documents as special classes of labeled root trees; we then build a mapping function capable to express the transformation from source trees to target ones, and learn parameters of the mapping function from the documents in the training set.
Figure 1: The example CV fragment.
Some converters are available on the Web and from Adobe, Microsoft, CambridgeDocs [16, 8, 13]
tbody tr
tr
tr
td
td
td
td
td
b
span
b
span
span
Domaines de Genie Formation competances logiciel
tr
b 2002
a)
:
i
tr
Universite de Savoie
DEA Informatique
b : i Universite de Savoie 2001 Maitrise informatique
2. CONVERSION MODEL
curriculum domain
education item
Genie year logiciel 2002 b)
title
leaf classification in source documents and the target tree recomposition. In Section 3 we describe methods for the leaf classification step, and in Section 4 we discuss in detail the tree recomposition step. Experiments we run and their analysis are presented in Section 5. Section 6 surveys the prior art and Section 7 concludes the paper.
2.1 XML and their schemas
item affiliation
DEA Info Universite rmatique de Savoie
year
title
affiliation
2001 Maitrise Universite informatique de Savoie
Figure 2: Tree representations of the source HTML (a) and target XML (b) fragments.
We consider (X)HTML or XML documents as trees where inner nodes determine the structure of document, and the leaf nodes and the tag attributes provide the document content. HTML and XML documents can be abstracted as the class of unranked labeled rooted trees [11]. Tree is defined over an alphabet of tag names. The set of trees, denoted by , is inductively defined as follows:
is a tree leaf , , then 2. if and , is a tree in .
2
1. every
!ELEMENT curriculum (domain+, education) !ELEMENT domain (#PCDATA) !ELEMENT education (item+) !ELEMENT item (year, title, affiliation?) !ELEMENT year (#PCDATA) !ELEMENT affiliation (#PCDATA) !ELEMENT title (#PCDATA)
Table 1: Example target XML schema.
E XAMPLE 1. Consider the conversion of Curriculum Vitae documents into the XML format defined by the target DTD in Table 1. Figure 1 shows a fragment of a student CV that reports competence domains, studies, obtained degrees, etc. The content is presented in a way that eases the visual capture of information by humans. Figure 2.a shows the corresponding HTML source; it is a tree whose nodes are layout-oriented tags, so a browser should interpret them to render the document content. All tree leaves (given in bold) refer to content fragments given by PCDATA nodes, these fragments are called external or content leaves. The result of conversion into the target XML is shown in Figure 2.b. The target tree provides the semantic annotation of the CV data. The tree leaves (given again in bold) refer to the external leaves “inherited” from the source tree. Figure 2 exemplifies certain difficulties anyone will face when trying to convert the source tree into the target one; namely, most but not all external leaves are preserved after the conversion, some content elements are transformed to semantic tags, the ”shape” (the nesting of elements) of the target tree does not correspond to the shape of the input tree, etc. The transformation of source HTML trees into target XML trees that should fit the schema is the subject of our study. The remainder of the paper is organized as follows. Section 2 introduces the tree-based formalism for XML documents and their schema mechanisms. It then defines the conversion problem and introduces the two-step approach to the conversion, composed of the
There is no a priori bound on the number of children of a node in , a tree; such tree are therefore unranked. In a node is a root and are subtrees. A forest is any subset of . For any rooted subtree in subtrees tree , the path from the tree root to the subtree is denoted , the depth of is denoted (leaf nodes have depth 0); the size of is denoted and equals to the number of nodes in the subtree. The alphabet is predefined for HTML 3 but is extensible with can be constrained by a schema that XML. The set of trees is defined using DTD, W3C XML Schema and Relax NG mechanisms4 . In the following, we refer to DTDs as schema mechanism for both source HTML and target XML files. DTDs for later HTML versions are fixed and available from W3C site 5 . Instead, as target XML is semantic-oriented, corresponding DTDs are often userdefined and domain-specific. According to [12, 15], DTDs can be modeled as extended contextfree grammars, where regular expressions over alphabet are constructed by using two basic operators of concatenation and disjunction and with occurrence operators (Kleene closure), ( ) and ( ). An extended context free grammar (ECFG) is defined by 4-tuple , where and are disjoint sets of terminals and nonterminals in , ; is an initial nonterminal and is a finite set of production rules of the form for , where is a regular expression over . When , we say that the string can be derived from string and denote the derivation by . The language defined by an ECFG is the set of terminal strings derivable from the starting symbol of . Formally, , where denotes the transitive closure of the derivability relation. We represent any sequential form in at least one way as a derivation tree (or parse tree) that reflects the derivation steps. The
! #"$ %'&)(%' - *+, ./ 0 0 1
2
3 4 5 6587 690 : ; 6;3?64 @A7BC
D FE 6?GH JI C D D K L 7 N C M E 6?GH I OQP