. The semantics of this expression might be informally expressed by saying that the markup attributes to its content the property of being a paragraph and being in the English language. This example suggests that a semantics for XML markup vocabulary might be given by providing rules for a translation into predicate logic (or an equivalent formalism), perhaps along with appropriate axioms — and this is indeed our approach. It also suggests that the translation will be trivial, which turns out not to be the case, as the following examples of some specific problems show. Propagation: Often the properties expressed by markup are understood to be propagated, according to certain rules, to child elements. For instance, if an element has the attribute specification lang="de", indicating that the text is in German, then all child elements have the property of being in German, unless the attribution is defeated by an intervening reassignment. Language designers, content developers, and software designers all depend upon a common understanding of such rules. But XML DTDs provide no formal notation for specifying which attributes are propagated or what the rules for propagation are. The property of being a paragraph, for example, is not propagated at all (the child elements of a paragraph aren’t necessarily paragraphs), the property of being in German is propagated until defeated, and the property of being rendered in Helvetica will be defeated by a subsequent rendition assignment of, say, TIMES, but not by a subsequent rendition assignment of BOLD. In addition some properties are distributed to their textual content, and others apply only at the element level. Although there is no way to specify in a DTD which properties propagate, and what the logic of that propagation is, such relationships are intended by markup language designers, routinely assumed by content developers, and routinely inferred by software designers — and reflected in tools and applications [11]. Class Relationships and Synonymy: XML itself contains no general constructs for expressing class membership or hierarchies among elements, attributes, or attribute values — one of the most fundamental relationships in contemporary information modeling. Full and partial synonymy, within and across markup languages, is also an important semantic relationship, and the lack of characterizing mechanisms is an obstacle to dealing with heterogeneity. Ontological variation in reference: XML markup might appear to indicate that the same thing, is-a-noun, is-a-
4 Consequences for Digital Libraries Much current digital library research is focused on three closely related problems: efficiency in the creation and management of high-performance content, functionality of software tools, and interoperability of both tools and content. A persistent problem in all three areas is the efficient exploitation of diverse systems for representing information. A standard metalanguage, such as XML, is an essential part of the solution, but by itself a metalanguage for specifying syntax does not provide the needed semantic information. Wide adoption within a domain of specific well-designed XML vocabularies, such as the TEI [9] does make important semantic information available, but only through prose documentation and shared practice — and information in this form is not sufficiently systematic, uniform, complete, or exploitable to achieve high levels of functionality and interoperability. On the other hand a common standard for providing machine-readable semantics for XML vocabularies would go directly to the heart of the problem. Without a machinereadable semantics computational access to even the sim2
304
References
plest facts — that a word is in German, that a sentence is “part of” a paragraph, that a title is the title of a section — requires explicit human inference and intervention. This is because, as indicated above, the data structure provided by an XML document depends on a semantic interpretation in order to actually deliver the information it represents. Markup language designers, content developers, software engineers, and stylesheet developers easily carry out this interpretation. But they do so only opportunistically and there is no way for them to formally express their decisions and inferences, either to each other or to software. The resulting scenarios of content development and exploitation are idiosyncratic, error-prone, incomplete, and involve a massive duplication of effort. This lack of interoperability results in systems that are low function. Digital libraries will not reach their full potential until the semantic information, which is easily and routinely inferred, is made computationally available in a standard format. It is easy to see how a common standard for XML application would support functionality and interoperability in areas such as information retrieval, presentation, browsing, federation, inferencing, and conversion. [2,5,12] In addition it is likely that a formalism for XML semantics can help in other areas of digital library research as well. For instance, XML semantics could support digital preservation and authentication by providing a representation of content at a higher level of abstraction than character streams, canonical serializations, or even data structures [6]. This suggests that beyond improving the functionality and interoperability of tools and content, the research in XML semantics may also result in important theoretical contributions to the digital library research agenda.
[1] J. H. Coombs, A. H. Renear, and S. J. DeRose. Markup systems and the future of scholarly text processing. Communications of the Association for Computing Machinery, 30(11):933–947, 1987. [2] D. Dubin, C. M. Sperberg-McQueen, A. Renear, and C. Huitfeldt. A logic programming environment for document semantics and inference. Journal of Literary and Linguistic Computing, Forthcoming in 2003. [3] D. R. Raymond and F. W. Tompa. Markup reconsidered. Technical Report 356, Department of Computer Science, The University of Western Ontario, 1993. An earlier version was circulated privately as ”Markup Considered Harmful” in the late 1980s. [4] D. R. Raymond, F. W. Tompa, and D. Wood. From data representation to data model: Meta-semantic issues in the evolution of SGML. Computer Standards and Interfaces, 18(1):25–36, January 1996. [5] A. Renear, D. Dubin, C. M. Sperberg-McQueen, and C. Huitfeldt. Towards a semantics for XML markup. In R. Furuta, J. I. Maletic, and E. Munson, editors, Proceedings of the 2002 ACM Symposium on Document Engineering, pages 119–126, McLean, VA, November 2002. Association for Computing Machinery. [6] A. Renear, D. Dubin, C. M. Sperberg-McQueen, and C. Huitfeldt. Towards identity conditions for digital documents. Technical Report UIUCLIS–2003/2+EPRG, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, Champaign, IL, 2003. [7] G. F. Simons. Conceptual modeling versus visual modeling: A technological key to building consensus. Computers and the Humanities, 30(4):303–319, 1997. [8] G. F. Simons. Using architectural forms to map TEI data into an object-oriented database. Computers and the Humanities, 33(1-2):85–101, 1999. [9] C. M. Sperberg-McQueen and L. Burnard, editors. Guidelines for Electronic Text Encoding and Interchange (TEI P4). TEI Consortium, Oxford, 2002. [10] C. M. Sperberg-McQueen, D. Dubin, C. Huitfeldt, and A. Renear. Drawing inferences on the basis of markup. In B. T. Usdin and S. R. Newcomb, editors, Proceedings of Extreme Markup Languages 2002, Montreal, August 2002. [11] C. M. Sperberg-McQueen, C. Huitfeldt, and A. Renear. Meaning and interpretation of markup. Markup Languages: Theory and Practice, 2(3):215–234, 2000. [12] C. M. Sperberg-McQueen, A. Renear, C. Huitfeldt, and D. Dubin. Skeletons in the closet: Saying what markup means. Presented at ALLC/ACH, T¨ubingen, July 2002. [13] C. Welty and N. Ide. Using the right tools: Enhancing retrieval from marked-up documents. Computers and the Humanities, 33(1-2):59–84, 1999. Originally delivered in 1997 at the TEI 10 conference in Providence, RI. [14] V. Wuwongse, C. Anutariya, K. Akama, and E. Nantajeewarawat. XML declarative description: A language for the semantic web. IEEE Intelligence Systems, 16(3):54–65, May/June 2001.
5 The BECHAMEL Project The BECHAMEL Markup Semantics Project, led by Sperberg-McQueen (W3C/MIT), grew out of research initiated by in the late 1990s [11] and is a partnership with the research staff and faculty at Bergen University (Norway) and the Electronic Publishing Research Group at the University of Illinois. The project explores representation and inference issues in document markup semantics, surveys properties of popular markup languages, and is developing a formal, machine-readable declarative representation scheme in which the semantics of a markup language can be expressed. This scheme is applied to research on information retrieval, document understanding, conversion, preservation, and document authentication. An early Prolog inferencing system [11] has been developed into a prototype knowledge representation workbench for representing facts and rules of inference about structured documents [2]. Preliminary findings have been reported elsewhere [2, 5, 10–12]. 3
305