XML Semantics and Digital Libraries - CiteSeerX

1 downloads 0 Views 44KB Size Report
Graduate School of Library and Information Science. University of Illinois at Urbana-Champaign. {renear, ddubin}@uiuc. ... MIT Laboratory for Computer Science.
XML Semantics and Digital Libraries Allen Renear and David Dubin Graduate School of Library and Information Science University of Illinois at Urbana-Champaign {renear, ddubin}@uiuc.edu

C. M. Sperberg-McQueen World Wide Web Consortium MIT Laboratory for Computer Science [email protected]

Claus Huitfeldt Department for Culture, Language, and Information Technology Bergen University Research Foundation [email protected]

Abstract

not explicitly represent fundamental semantic relationships among document components and features. XML supports the specification of a machine readable “grammar,” but because it has no mechanism for providing a semantics for that grammar what an XML vocabulary means still cannot be formally specified. Even very simple fundamental semantic facts about a document markup system — facts that are routinely intended by markup language designers, and relied on by both markup language users and software designers — cannot be expressed. Prose documentation of markup vocabularies provides some assistance of course, but even when established principles of documentation are followed, prose documentation is not a machine-readable formalism — and that is what is required to address current problems with digital libraries.

The lack of a standard formalism for expressing the semantics of an XML vocabulary is a major obstacle to the development of high-function interoperable digital libraries. XML document type definitions (DTDs) provide a mechanism for specifying the syntax of an XML vocabulary, but there is no comparable mechanism for specifying the semantics of that vocabulary — where semantics simply means the basic facts and relationships represented by the occurrence of XML constructs. A substantial loss of functionality and interoperability in digital libraries results from not having a common machine-readable formalism for expressing these relationships for the XML vocabularies currently being used to encode content. Recently a number of projects and standards have begun taking up related topics. We describe the problem and our own project.

2 Related Work

1 Introduction

The lack for a machine-readable semantic description of SGML/XML constructs was noted in the 1980s. [3, 4] Recently a number of other technologies, standards, and research projects have recognized and responded to this challenge. For particularly promising research projects see [7, 8, 13, 14]. In addition, standards such as W3C Schema, ISO Topic Maps, RDF, and HyTime architectural forms address some of the problems mentioned here, but do not provide complete or systematic solutions. The W3C’s “Semantic Web” activity is certainly producing important relevant results, but its overall agenda is to develop XML-based techniques for knowledge representation in general, while our project focuses on identifying and processing actual document markup semantics, as found in existing document markup languages, and not on developing a new markup language for representing semantics in general.

Much textual content in digital libraries is encoded with XML document markup. XML provides a rigorous machine-readable technique for defining descriptive markup languages — languages usually designed to explicitly identify the underlying meaningful structure of document, apart from any intended processing. The superiority of descriptive markup over earlier strategies has been well-confirmed [1], and XML metalanguage supports the development of interoperable domain-specific descriptive markup vocabularies. However it has always been clear that XML and descriptive markup alone cannot deliver the level of functionality and interoperability originally anticipated. The problem is that even though XML markup identifies a document’s meaningful structure, XML itself does Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cit ation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. JCDL ’03 May 1-2, 2003, Houston, Texas. Copyright 2000 ACM 1-58113-000-0/00/0000…$5.00.

303

3 XML Semantics: What it is

French-citizen, is-illegible, has-been-copyedited. But obviously either these predicates really refer to different things, or must be given non-standard interpretations. While humans readers are not confused by such familiar ambiguities, they are an obstacle to automatic processing. Arity and Deixis: Some properties expressed by markup appear to be monadic, some polyadic — a title that is the immediate first child of a section is probably the title of that section. But property arity is not evident from the markup itself, and it is necessary to provide “deictic” mechanisms to locate and identify arguments. Parent/Child overloading: The parent/child relations of the XML tree data structure support a variety of implicit substantive relationships. A paragraph might have page break, sentence, and footnote as child elements, but in each case the parent/child relation represents a different substantive relationship: the parent/child relationship indicates that the sentence is part of the paragraph, but means something else for the page break and footnote [2]. These examples demonstrate several things: what XML semantics is, that it would be valuable to have a system for expressing XML semantics, and that it would be neither trivial nor excessively ambitious to develop such a system — we are not attempting to formalize common sense reasoning in general, but only the inferences that are routinely intended by markup designers, assumed by content developers, and inferred by software designers. (Parts of this section were adapted from Renear et al 2003 [5]; much of the original analysis is from SperbergMcQueen et al 2000 [11].)

XML semantics in our sense refers simply to the facts and relationships expressed by XML markup. It does not refer to processing behavior, machine states, linguistic meaning, business logic, or any of the other things that are sometimes meant by “semantics”. Consider for example the markup

. The semantics of this expression might be informally expressed by saying that the markup attributes to its content the property of being a paragraph and being in the English language. This example suggests that a semantics for XML markup vocabulary might be given by providing rules for a translation into predicate logic (or an equivalent formalism), perhaps along with appropriate axioms — and this is indeed our approach. It also suggests that the translation will be trivial, which turns out not to be the case, as the following examples of some specific problems show. Propagation: Often the properties expressed by markup are understood to be propagated, according to certain rules, to child elements. For instance, if an element has the attribute specification lang="de", indicating that the text is in German, then all child elements have the property of being in German, unless the attribution is defeated by an intervening reassignment. Language designers, content developers, and software designers all depend upon a common understanding of such rules. But XML DTDs provide no formal notation for specifying which attributes are propagated or what the rules for propagation are. The property of being a paragraph, for example, is not propagated at all (the child elements of a paragraph aren’t necessarily paragraphs), the property of being in German is propagated until defeated, and the property of being rendered in Helvetica will be defeated by a subsequent rendition assignment of, say, TIMES, but not by a subsequent rendition assignment of BOLD. In addition some properties are distributed to their textual content, and others apply only at the element level. Although there is no way to specify in a DTD which properties propagate, and what the logic of that propagation is, such relationships are intended by markup language designers, routinely assumed by content developers, and routinely inferred by software designers — and reflected in tools and applications [11]. Class Relationships and Synonymy: XML itself contains no general constructs for expressing class membership or hierarchies among elements, attributes, or attribute values — one of the most fundamental relationships in contemporary information modeling. Full and partial synonymy, within and across markup languages, is also an important semantic relationship, and the lack of characterizing mechanisms is an obstacle to dealing with heterogeneity. Ontological variation in reference: XML markup might appear to indicate that the same thing, is-a-noun, is-a-

4 Consequences for Digital Libraries Much current digital library research is focused on three closely related problems: efficiency in the creation and management of high-performance content, functionality of software tools, and interoperability of both tools and content. A persistent problem in all three areas is the efficient exploitation of diverse systems for representing information. A standard metalanguage, such as XML, is an essential part of the solution, but by itself a metalanguage for specifying syntax does not provide the needed semantic information. Wide adoption within a domain of specific well-designed XML vocabularies, such as the TEI [9] does make important semantic information available, but only through prose documentation and shared practice — and information in this form is not sufficiently systematic, uniform, complete, or exploitable to achieve high levels of functionality and interoperability. On the other hand a common standard for providing machine-readable semantics for XML vocabularies would go directly to the heart of the problem. Without a machinereadable semantics computational access to even the sim2

304

References

plest facts — that a word is in German, that a sentence is “part of” a paragraph, that a title is the title of a section — requires explicit human inference and intervention. This is because, as indicated above, the data structure provided by an XML document depends on a semantic interpretation in order to actually deliver the information it represents. Markup language designers, content developers, software engineers, and stylesheet developers easily carry out this interpretation. But they do so only opportunistically and there is no way for them to formally express their decisions and inferences, either to each other or to software. The resulting scenarios of content development and exploitation are idiosyncratic, error-prone, incomplete, and involve a massive duplication of effort. This lack of interoperability results in systems that are low function. Digital libraries will not reach their full potential until the semantic information, which is easily and routinely inferred, is made computationally available in a standard format. It is easy to see how a common standard for XML application would support functionality and interoperability in areas such as information retrieval, presentation, browsing, federation, inferencing, and conversion. [2,5,12] In addition it is likely that a formalism for XML semantics can help in other areas of digital library research as well. For instance, XML semantics could support digital preservation and authentication by providing a representation of content at a higher level of abstraction than character streams, canonical serializations, or even data structures [6]. This suggests that beyond improving the functionality and interoperability of tools and content, the research in XML semantics may also result in important theoretical contributions to the digital library research agenda.

[1] J. H. Coombs, A. H. Renear, and S. J. DeRose. Markup systems and the future of scholarly text processing. Communications of the Association for Computing Machinery, 30(11):933–947, 1987. [2] D. Dubin, C. M. Sperberg-McQueen, A. Renear, and C. Huitfeldt. A logic programming environment for document semantics and inference. Journal of Literary and Linguistic Computing, Forthcoming in 2003. [3] D. R. Raymond and F. W. Tompa. Markup reconsidered. Technical Report 356, Department of Computer Science, The University of Western Ontario, 1993. An earlier version was circulated privately as ”Markup Considered Harmful” in the late 1980s. [4] D. R. Raymond, F. W. Tompa, and D. Wood. From data representation to data model: Meta-semantic issues in the evolution of SGML. Computer Standards and Interfaces, 18(1):25–36, January 1996. [5] A. Renear, D. Dubin, C. M. Sperberg-McQueen, and C. Huitfeldt. Towards a semantics for XML markup. In R. Furuta, J. I. Maletic, and E. Munson, editors, Proceedings of the 2002 ACM Symposium on Document Engineering, pages 119–126, McLean, VA, November 2002. Association for Computing Machinery. [6] A. Renear, D. Dubin, C. M. Sperberg-McQueen, and C. Huitfeldt. Towards identity conditions for digital documents. Technical Report UIUCLIS–2003/2+EPRG, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, Champaign, IL, 2003. [7] G. F. Simons. Conceptual modeling versus visual modeling: A technological key to building consensus. Computers and the Humanities, 30(4):303–319, 1997. [8] G. F. Simons. Using architectural forms to map TEI data into an object-oriented database. Computers and the Humanities, 33(1-2):85–101, 1999. [9] C. M. Sperberg-McQueen and L. Burnard, editors. Guidelines for Electronic Text Encoding and Interchange (TEI P4). TEI Consortium, Oxford, 2002. [10] C. M. Sperberg-McQueen, D. Dubin, C. Huitfeldt, and A. Renear. Drawing inferences on the basis of markup. In B. T. Usdin and S. R. Newcomb, editors, Proceedings of Extreme Markup Languages 2002, Montreal, August 2002. [11] C. M. Sperberg-McQueen, C. Huitfeldt, and A. Renear. Meaning and interpretation of markup. Markup Languages: Theory and Practice, 2(3):215–234, 2000. [12] C. M. Sperberg-McQueen, A. Renear, C. Huitfeldt, and D. Dubin. Skeletons in the closet: Saying what markup means. Presented at ALLC/ACH, T¨ubingen, July 2002. [13] C. Welty and N. Ide. Using the right tools: Enhancing retrieval from marked-up documents. Computers and the Humanities, 33(1-2):59–84, 1999. Originally delivered in 1997 at the TEI 10 conference in Providence, RI. [14] V. Wuwongse, C. Anutariya, K. Akama, and E. Nantajeewarawat. XML declarative description: A language for the semantic web. IEEE Intelligence Systems, 16(3):54–65, May/June 2001.

5 The BECHAMEL Project The BECHAMEL Markup Semantics Project, led by Sperberg-McQueen (W3C/MIT), grew out of research initiated by in the late 1990s [11] and is a partnership with the research staff and faculty at Bergen University (Norway) and the Electronic Publishing Research Group at the University of Illinois. The project explores representation and inference issues in document markup semantics, surveys properties of popular markup languages, and is developing a formal, machine-readable declarative representation scheme in which the semantics of a markup language can be expressed. This scheme is applied to research on information retrieval, document understanding, conversion, preservation, and document authentication. An early Prolog inferencing system [11] has been developed into a prototype knowledge representation workbench for representing facts and rules of inference about structured documents [2]. Preliminary findings have been reported elsewhere [2, 5, 10–12]. 3

305

Suggest Documents