From Legacy Documents to XML: A Conversion Framework - CiteSeerX

6 downloads 222972 Views 685KB Size Report
rendering-oriented formats like Adobe PDF, PostScript or Microsoft Word. The migra- ..... tree structure of XML with annotation classes from a set C. ... We widely deploy generic visualization tools, such as the Adobe Acrobat Reader for.
From Legacy Documents to XML: A Conversion Framework Jean-Pierre Chanod, Boris Chidlovskii, Herv´e Dejean, Olivier Fambon, J´erˆome Fuselier, Thierry Jacquin, Jean-Luc Meunier Xerox Research Centre Europe 6, chemin de Maupertuis, F–38240 Meylan, France {firstname.lastname}@xrce.xerox.com

Abstract. We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project.

1

Introduction

The eXtended Markup Language (XML) is the modern industry standard for data exchange across service and enterprise boundaries. It became a common practice to use XML as the underlying data model for the information capture, exchange and reuse. A large spectrum of activities led by the Web Consortium, OASIS and other actors around XML, including XML schema, querying and transformation, has led to an increasing availability and exchange of data and documents in XML format, the proliferation of user-defined XML schema definitions (DTDs, XML Schemas, Relax NG), the integration of XML components in large-scale content management solutions. The migration from legacy formats to XML has two important branches, dataoriented XML and document-oriented XML. Data-oriented XML refers to well-structured data and storage systems like databases or transactional systems. The migration of this data toward XML poses no serious problems, as data is already well-structured and ready for machine-oriented processing. Instead, the migration of legacy documents toward XML raises important issues. Documents that often form corporate and personal knowledge bases, are unstructured or semi-structured objects; they are stored in generic or specialized file systems, in a multitude of formats and forms. A large majority of documents are created for humans and not machines, with various implicit assumptions and choices which are obvious for a human reader, but difficult and ambiguous for computer programs. The migration of legacy documents toward XML addresses the document transformation into a form that eases the machine-oriented processing and reuse of documents, through a process that makes all implicit assumptions and choices explicit, and is guided by a common sense or by a specific domain knowledge.

In the mass document migration toward XML, source documents are available in rendering-oriented formats like Adobe PDF, PostScript or Microsoft Word. The migration result is expected to be XML documents that fit a user-defined or domain-specific XML schema. It is frequent in the conversion that the target documents preserve an important part of the source content but disregard all information relevant to the document presentation, such as pagination, headings, etc. Structurally, source and target documents are often very different, as they follow two opposite paradigms of layoutoriented and semantic-oriented document annotations. The former refers to the traditional, human-oriented paradigm of document annotation; the later is associated with a relatively new paradigm of the semantic-oriented annotation of documents for the machine processing. Currently, the conversion of legacy documents into dense semantic XML is performed by domain experts and remains essentially manual and expensive. XML and Web communities offer various tools for transforming data into and from XML, including XSLT, XQuery and their graphical extensions [17, 18]. However, writing accurate transformation rules for the mass document conversion appears difficult if even possible, because of the size and complexity of both source documents and target schema. The current state-of-art in the domain of semantic annotation leaves not much hope for achieving the fully automated and accurate converters in any observable future. Nevertheless, the conversion cost can be considerably reduced by deploying different and complementary methods. One well-established approach is the analysis of the logical structure of documents. Another approach can be based on data mining or machine learning techniques that can attempt to infer accurate transformation rules using a subset of annotated source documents of their fragments. We adopt an approach of managing the conversion complexity by decomposing the entire problem into a sequence of smaller and easier-to-handle conversion or transformation steps, where each sub-problem can be solved with an appropriate method. Any step performs a specific processing of the document, enriches it and thus puts it closer to the target XML format.

2

Legacy Document Conversion project

At Xerox Research Centre Europe, we are conducting the Legacy Document Conversion (LegDoC) project aimed at automating different tasks of the mass document conversion to XML. The typical conversion task starts with a large collection of legacy documents available in PDF, PostScript or Microsoft Word formats. The schema of target documents are provided in the form of DTD or W3C XML Schema descriptions. The conversion goal is to migrate source documents or their components into the XML files structured according to the target schema. The generic view of the conversion flow is presented in Figure 1. According to this figure, we distinguish among three types of document annotations. The first type refers to layout annotations that cope with the document presentation in terms of the physical rendering of elements (x and y positions, width, height, font, etc.). The second type refers to a more abstract, logical structure of the document, it expresses spatial relationships between elements in a page, such as columns, headings, paragraphs and lines. The

third type of annotations is semantic one; it refers rather to the meaning of elements than to their appearance on a page. Semantic annotations may be of different granularity with two well-known examples being the metadata and entities. Metadata refers to elements that describe the whole document, like title, authors, creationDate, etc.; all such elements are routinely indexed and used in content management applications. Entities are content elements of a low granularity, like person names, tool names, index entry points, etc.

Fig. 1. Three types of annotations and the conversion flow.

To achieve the target of converting legacy documents to XML, the LegDoC project offers a framework for modeling, evaluating and executing various conversion cases. The project framework is composed of the following components: Raw XML : the conversion starts with rewriting documents from proprietary formats into a raw XML. This deploys off-the-shelf converters for Adobe PDF and other formats. The output of a converter is XML files that preserve the rendering of documents. All converters allow an accurate recognition of characters, lines and their rendering attributes (x and y positions, fonts, etc.). However, they are fairly limited in the recognition of logical or semantic annotations. Preprocessing : this component cleans up and indexes the raw XML files. The index entry points get associated with all XML nodes and remain persistent during all the conversion process thus enabling the easy traceability and debugging of the different conversion steps. Logical analysis : it includes methods for the spacial analysis and extraction from the raw XML, the detection of headers and footers, determination of the reading order, the document structuring using the Table of Content where available, etc. Semantic annotation : it covers methods for recognizing entities in the document content. The methods are both hand-crafted regular expressions and a collection of machine learning algorithms that allow one to build learning models from corpora of annotated documents and apply the models to non-annotated documents.

Visualization and annotation : it includes an assistance for the visualization and validation of outputs of intermediate and final conversion steps. Conversion management : this component offers a support for building chains of transformation and enrichment steps, gradually migrating from the raw XML to the target XML. It represents an explicit set of agreements and requirements for a transformation chain and validity of output of intermediate steps, including XML schema definitions for any step. The core components of the logical analysis, semantic annotations, visualization and annotations are presented in more detail in Sections 4-6.

Fig. 2. Conversion example: left) Source PDF file, right) Target XML with annotations.

3

Conversion example

The components in the LegDoC framework are developed in a generic manner, they however require an adaptation to any specific conversion task. In the following, we use one case of technical documentation conversion as a driving example for presenting how the components contribute to the conversion process. The selected case is given by a collection of truck repairing manuals. The target schema is a complex W3C XML Schema description which scrupulously describes all notions and entities relevant to

the truck repairing world, including tools, operations, steps and items of demounting and mounting processes, etc. The gross volume of PDF documents is dedicated to the repairing operations, one such page is shown in Figure 2.a. Figure 2.b gives the SVG representation of the same page with all target XML annotations highlighted with different colors.

Suggest Documents