SGML and Exceptions
z
Pekka Kilpelainen and Derick Woody Technical Report HKUST-CS96-30 June 1996
Department of Computer Science University of Helsinki Helsinki Finland
Department of Computer Science Hong Kong University of Science & Technology Clear Water Bay, Kowloon Hong Kong y
Abstract
The Standard Generalized Markup Language (SGML) allows users to de ne document type de nitions (DTDs), which are essentially extended context-free grammars in a notation that is similar to extended Backus{Naur form. The right-hand side of a production is called a content model and its semantics can be modi ed by exceptions. We give precise de nitions of the semantics of exceptions and prove that they do not increase the expressive power of SGML. For each DTD with exceptions we can construct a structurally equivalent extended context-free grammar. On the other hand, exceptions are a powerful shorthand notation| eliminating them may cause exponential growth in the size of a DTD.
z The research of the rst author was supported by the Academy of Finland and the research of the second author was supported by grants from the Natural Sciences and Engineering Research Council of Canada and from the Information Technology Research Centre of Ontario.
The Hong Kong University of Science & Technology
Technical Report Series Department of Computer Science
SGML and Exceptions Pekka Kilpel ainen
2
1
Derick Wood
3
July 3, 1996
Abstract
The Standard Generalized Markup Language (SGML) allows users to de ne document type de nitions (DTDs), which are essentially extended context-free grammars in a notation that is similar to extended Backus{Naur form. The right-hand side of a production is called a content model and its semantics can be modi ed by exceptions. We give precise de nitions of the semantics of exceptions and prove that they do not increase the expressive power of SGML. For each DTD with exceptions we can construct a structurally equivalent extended context-free grammar. On the other hand, exceptions are a powerful shorthand notation|eliminating them may cause exponential growth in the size of a DTD.
1 Introduction The Standard Generalized Markup Language (SGML) [9, 11] promotes the interchangeability and application-independent management of electronic documents by providing a syntactic metalanguage for the de nition of textual The research of the rst author was supported by the Academy of Finland and the research of the second author was supported by grants from the Natural Sciences and Engineering Research Council of Canada and from the Information Technology Research Centre of Ontario. 2 Department of Computer Science, University of Helsinki, Helsinki, Finland. E-mail: 1
[email protected].
Department of Computer Science, Hong Kong University of Science & Technology, Clear Water Bay, Kowloon, Hong Kong. E-mail:
[email protected]. 3
1
markup systems. An SGML document consists of an SGML prolog and a marked-up document instance. The prolog contains a document type definition (DTD), which is an extended context-free grammar in which the right-hand sides of productions are both extended and restricted regular expressions called content models. Fig. 1 gives an example of a simple SGML DTD.
message [ message - (head, body) head - (from & to & subject) from - (person) to - (person)+ person - (alias (forename?, surname)) body - (paragraph)* subject, alias, forename, surname, paragraph - (#PCDATA) ]
>
j
> >
>
>
>
> >
Figure 1: An example SGML DTD. The DTD in Fig. 1 de nes a document type for messages, which consist of a head followed by a body. The element (or nonterminal) head consists of subelements from, to, and subject that can appear in any order. The element from is de ned to be a person that can be denoted either by an alias or by an optional forename followed by a surname. The element to consists of a nonempty list of persons. The body of a message consists of a (possibly empty) sequence of paragraphs. Finally, the last element de nition speci es that elements subject, alias, forename, surname, and paragraph are unstructured strings, denoted by the keyword #PCDATA. The structural elements of a document instance are made visible by enclosing them in matching pairs of start tags and end tags. A possible instance of the DTD of Fig. 1 is given in Fig. 2. The semantics of content models can be modi ed by what the Standard calls exceptions. Inclusion exceptions allow named elements to appear anywhere within a content model and exclusion exceptions preclude named elements from appearing in a content model. To de ne the placement of sidebars, gures, equations, footnotes, and similar objects in a DTD using 2
Boss Tomorrow's meeting... Franklin Betty ..has been cancelled.
Figure 2: An SGML document instance. the usual grammatical approach is laborious; exceptions provide an alternative, concise, and formal mechanism. For example, with the DTD of Fig. 1, we might want to allow notes to appear anywhere in the bodies of messages, except within notes themselves. We could add the inclusion exception
- -
>
(paragraph)* +(note)
to the de nition of element body. This modi cation allows notes to appear within notes; therefore, to prevent such recursive appearances we add an exclusion exception to the de nition of element type note:
- -
>
(#PCDATA) -(note) .
Exclusion exceptions seem to be a useful concept, but their exact meaning is unclear from the Standard [11] and from Goldfarb's annotation of the Standard [9]. We give rigorous de nitions for the meaning of exceptions. In the full paper [10], we also give algorithms for transforming grammars with exceptions to grammars without exceptions, as well as giving complete proofs of the results mentioned here. The correctness proofs of these methods imply that exceptions do not increase the expressiveness of SGML DTDs. An application that requires the elimination of exceptions from content models is the translation of DTDs into static database schemas. This method of integrating textual documents into an object-oriented database has been suggested by Christo des et al. [8]. 3
The SGML Standard requires content models to be unambiguous, meaning that each nonempty pre x of an input string determines uniquely which symbols of the content model match the symbols of the pre x. Our methods of eliminating exceptions preserve the unambiguity of the original content models. In this respect our work extends the work of Bruggemann-Klein and Wood [3, 4, 5, 6, 7]. The Standard gives rather vague restrictions on the applicability of exclusion exceptions. We propose a simple and rigorous de nition for the applicability of exclusions; in the full paper [10], we also present an optimal algorithm for testing applicability. In this extended abstract we focus on the essential ideas underlying our approach. For this reason, we consider the removal of exceptions from only extended context-free grammars with exceptions, although we mention the problems of transferring this approach to DTDs. We refer the reader to the full paper [10] for more details.
2 Extended Context-Free Grammars with Exceptions We introduce extended context-free grammars as a model for SGML DTDs. We treat extended context-free grammars as context-free grammars in which the right-hand sides of productions are regular expressions. Let V be an alphabet. Then, we de ne a regular expression over V and its language in the usual way [1, 12]. The symbol denotes the empty string. We denote by sym(E ) the set of symbols of V that appear in a regular expression E . An extended context-free grammar G is speci ed by a tuple (N; ; P; S ), where N and are disjoint nite alphabets of nonterminal symbols and terminal symbols, respectively, P is a nite set of production schemas, and the nonterminal S is the sentence symbol. Each production schema has the form A ! E , where A is a nonterminal and E is a regular expression over V = N [ . When = A 2 V , A ! E 2 P , and 2 L(E ), the string can be derived from the string and we denote this fact by writing ) . The language L(G) of an extended context-free grammar G is the set of terminal strings derivable from the sentence symbol of G. Formally, L(G) = fw 2 j S ) wg, where ) denotes the 1
1
2
2
1
2
+
4
+
transitive closure of the derivability relation. Even though a production schema may correspond to an in nite number of ordinary context-free productions, it is known that extended and ordinary CFGs allow us to describe exactly the same languages; for example, see the text of Wood [12]. An extended context free grammar G with exceptions is speci ed by a tuple (N; ; P; S ) and is similar to an extended context-free grammar except that the production schemas in P have the form A ! E + I ? X , where A is in N , E is a regular expressions over V = N [ , and I and X are subsets of N . The intuitive idea is that the derivation of any string w from the nonterminal A using the production schema A ! E + I ? X must not involve any nonterminal in X yet w may contain, in any position, strings that are derivable from nonterminals in I . When a nonterminal is both included and excluded, its exclusion overrides its inclusion. We now de ne the eect of inclusions and exclusions on languages. Let L be a language over the alphabet V and let I; X V . We de ne a language L with inclusions I as the language
L
I
+
= fw a w anwn j a an 2 L; for n 0; and wi 2 I ; for i = 0; : : :; ng: 0 1
1
1
Thus, L I consists of the strings in L with arbitrary strings from I inserted into them. The language L with exclusions X is de ned as the language L?X that consists of the strings in L that do not contain any symbol in X . Notice that (L I )?X (L?X ) I , but the converse does not hold in general. In the sequel we will write L I ?X for (L I )?X . We formally describe the global eect of exceptions by attaching exceptions to nonterminals and by de ning derivations from nonterminals with exceptions. We denote a nonterminal A with inclusions I and exclusions X with the symbol A I ?X . Normally, we rewrite the nonterminal A, say, with a string , where A ! E is the production schema for A and 2 L(E ). But when A has inclusions I and exclusions X , and the production schema for A is A ! E + IA ? XA , we must cumulate the inclusions and exclusions in the string . Observe that I and X are the exceptions associated with A, whereas IA and XA are the exceptions to be applied to A's derived strings. We, therefore, replace A I ?X with I [IA;X [XA . This cumulation of inclusions and exclusions is described informally in the Standard. +
+
+
+
+
+
+
(
5
)
We modify the standard de nition of a derivation step in an extended context-free grammar as follows. For a string w over [ N , we denote by w I;X the string obtained from w by replacing every nonterminal A 2 sym(w) with A I ?X . Thus, we have attached the same inclusions and exclusions to every nonterminal in w. Let A I ?X be a string of nonterminal symbols with exceptions and terminal symbols. We say that the string 0 can be derived from A I ?X , when the following two conditions hold: 1. A ! E + IA ? XA is a production schema in P . 2. For some string in L(E ) I [IA ? X [XA , 0 = I [IA;X [XA . Observe that the second condition re ects the idea that exceptions are propagated and cumulated by derivations. We illustrate these ideas with the following example grammar with exceptions. This grammar is also used to show that the exception-removal method we design can lead to an exponential blow-up in grammar size. Example 1 The example grammar is speci ed as follows: A ! (A j j Am) + ; ? ;; A ! (a j A) + fA g ? ;; A ! (a j A) + fA g ? ;; ... Am ! (am j A) + fA g ? ;: We now demonstrate how exception propagation works. Consider a derivation step from A with empty inclusions and empty exclusions (that is from A ;?;). Now, A ;?; derives (AA A ) fA2 g;; = A fA2 g?;A fA2 g?;A fA2 g?; since the production schema A ! (a j A) + fA g ? ; is in the grammar and AA A 2 L(a j A) fA2 g?;: (
)
+
+
+
+(
) (
)
(
)
1
1
1
2
2
2
3
1
1
1+
1+ 2
2
(
)
2+
+
1
1
2
2
2+
2
1
+
2
6
Finally, the language L(G) of an extended context-free grammar G with exceptions consists of the terminal strings derivable from the sentence symbol with empty inclusions and exclusions. Formally,
L(G) = fw 2 j S
;?; )
+
+
wg:
Exceptions seem to be a context-dependent feature: Legal expansions of a nonterminal depend on the context in which the nonterminal appears. We show, however, that exceptions do not extend the descriptive power of extended context-free grammars by giving a transformation that produces an extended context-free grammar that is structurally equivalent to an extended context-free grammar with exceptions. The transformation propagates exceptions to production schemas and modi es their associated regular expressions to capture the eect of exceptions. Step 1: We explain how to modify regular expressions to capture the eect of exceptions. Let E be a regular expression over V = [ N and let I = fi ; : : : ; ikg be a set of inclusion exceptions. First, observe that we can remove the ; symbol from the regular expression E and maintain equivalence, if the language of the expression is not ;. We modify E to obtain a regular expression E I such that L(E I ) = L(E ) I by replacing each occurrence of a symbol a 2 sym(E ) with 1
+
+
+
(i j i j j ik )a(i j i j j ik ) 1
2
1
2
and each occurrence of with (i j i j j ik ): 1
2
For a set X of excluded elements, we obtain a regular expression E?X such that L(E?X ) = L(E )?X by replacing each occurrence of a symbol a 2 X in E with ;. Step 2: We describe an algorithm for eliminating exceptions from an extended context-free grammar G = (N; ; P; S ) with exceptions. It propagates the exceptions in a production schema to nonterminals in the schema; see Fig. 3. The algorithm produces an extended context-free grammar G0 = (N 0; 0; P 0; S 0) that is structurally equivalent to G. The nonterminals of G0 have the form A I ?X , where A 2 N and I; X N . A derivation step using a new production schema A I ?X ! E in P 0 corresponds to a derivation step +
+
7
N 0:= fA ;?; j A 2 N g; S 0:= S ;?;; +
+
0:= ; Q:= fA P 00:=;;
;?; ! E + I ? X
j A ! E + I ? X 2 P g;
+
for all A IA ?XA ! E + I ? X 2 Q do for all (B 2 (sym(E ) [ I ) ? X ) and B I ?X 62 N 0 do N 0:= N 0 [ fB I ?X g; Q:= Q [ fB I ?X ! EB + (I [ IB ) ? (X [ XB ) j B ;?; ! EB + IB ? XB 2 Qg od; Q := Q ? fA IA?XA ! E + I ? X g; P 00 := P 00 [ fA IA?XA ! E + I ? X g od; P 0:= fA IA ?XA ! EA j A IA?XA ! E + I ? X 2 P 00 and EA = ((E I )?X ) I;X g; +
+
+
+
+
+
+
+
+
+
(
)
Figure 3: Exception elimination from an extended context-free grammar (N; ; P; S ) with exceptions. using an old production schema for nonterminal A under inclusions I and exclusions X . Termination: The algorithm terminates since it generates, from each nonterminal A, at most 2 jN j new nonterminals of the form A I ?X . In the worst case the algorithm can exhibit this potentially exponential behavior. Given the grammar with exceptions that we de ned in Example 1, the algorithm produces production schemas of the form 2
+
A
I ?;
+
!E
for every subset I fA ; : : :; Amg. We do not know whether this exponential behavior can be avoided. Is it always possible to obtain an extended context-free grammar G0 without exceptions that is (structurally) equivalent to an extended context-free grammar G with exceptions such that the size of G0 is bounded by a polynomial in the size of G? We conjecture that the answer is negative. 1
8
3 Exception-Removal for DTDs
Document type de nitions (DTDs) are, essentially, extended contextfree grammars that have restricted and generalized regular expressions on the right-hand sides of their productions called content models in the ISO Standard [9, 11]. The major dierence between regular expressions and content models is that content models have the additional operators: F &G, F ?, and F , where F &G FG j GF . The SGML Standard describes the basic meaning of inclusions as follows: \Elements named in an inclusion can occur anywhere within the content of the element being de ned, including anywhere in the content of its subelements." The description is re ned by the rule specifying that \: : :an element that can satisfy an element token in the content model is considered to do so, even if the element is also an inclusion." This re nement means, for example, that given the content model (ajb) with inclusion a, baa is a valid string of the content model as one would expect intuitively; however, aab is not a valid string of the content model. The reason is that the rst a in aab must correspond to the a in the content model and then the sux ab cannot be obtained. On the other hand, the string aaa is a valid string of the content model. The Standard recommends that inclusions \: : :should be used only for elements that are not logically part of the content"; for example, neither for a nor for b in the preceding example. Since the diculty of understanding inclusions is caused, however, by the inclusion of elements that appear in the content model, we have to take them into account. The basic idea of compiling the inclusion of the set I = fi ; : : :; ik g of symbols in a content model E is to insert new subexpressions of the form (i j jik ) in E . Preserving the unambiguity of the content model requires some extra care. We de ne the SGML eect of inclusions I on language L V , where V is an alphabet, as the language LI = fw a wn? anwn j a an 2 L; n 0; wi 2 (I ? rst(tail(L; a ai))); i = 0; : : :; ng; where rst(L) = fa 2 V j au 2 L; for some u 2 V g +
1
1
0
1
1
1
1
9
and tail(L; w) = fu 2 V j wu 2 Lg:
For example, the language fab; bagfag consists of all strings of the forms ak bal and bak , where k 1 and l 0. We introduce the diculties caused by the & operator with the following example. Consider the content model E = a?&b, which is unambiguous. A content model that captures the inclusion of symbol a in E should match an arbitrary sequence of as after the b. A straightforward transformation would produce a content model of the form F &(ba) or of the form (F &b)a, where a 2 rst(L(F )) and 2 L(F ). It easy to see that these content models are ambiguous since, in each case, any a following an initial b can be matched by both F and a. Our strategy to handle such problematic subexpressions F &G is rst to replace them by the equivalent subexpression (FGjGF ). (Notice that this substitution may not suce, since FGjGF can be ambiguous even if F &G is unambiguous. For example, the content model (a?bjba?) is ambiguous, whereas the context model a?&b is unambiguous.) Then, given a content model E and a set I of inclusions, we compute a new content model EI such that L(EI ) = L(E )I .
Example 2 Let E = (a?&b?)c and I = fa; cg. We rst transform it into the content model
and then into the content model
(ab?jba?)?c
(aa(ba)?jb(aa)?)?c(ajc):
2 In the full paper [10], we give a complete algorithm for computing the content model EI from a given content model E and a given set of inclusions I . Clause 11.2.5.2 of the SGML Standard states that \: : :exclusions modify the eect of model groups to which they apply by precluding options that would otherwise have been available". The exact meaning of the phrase \precluding options" is not clear from the Standard. Our rst task is, therefore, to formalize the intuitive notion of exclusion. As a motivating example 10
consider excluding the symbol b from the content model E = a(bjc)c, which de nes the language L(E ) = fabc; accg. The element b is clearly an alternative to the rst occurrence of c, and we can realize its exclusion by modifying the content model to obtain E 0 = acc. Now, consider excluding b from the content model F = a(bcjcc). The case is not as clear since b appears in a seq subexpression. On the other hand, both E and F de ne the same language. Let L V be a language and let X V . Motivated by the preceding examples, we de ne the aect of excluding X from L, which we denote by L?X , to be the set of all strings in L that do not contain any symbol of X . As an example, the aect of excluding fbg from the language of the preceding content models E and F is L(E )?fbg = L(F )?fbg = faccg: Notice that an exclusion always speci es a subset of the original language. In the full paper [10], we show how to compute a content model E X such that L(E X ) = L(E )?X from a given content model E and a given set X of exclusions. The modi ed content model E X is unambiguous if the original content model E is unambiguous and its computation takes time linear in the size of E . As a restriction of the applicability of exclusions the Standard states that \: : :an exclusion cannot aect a speci cation in a model group that indicates that an element is required." The Standard does not specify rigorously how a model group (a subexpression of a content model) indicates that an element is required. The intent of the Standard appears to be that when A is an element, then in the contexts A?, (AjB ), and A, the A is optional, but in the contexts A, A , A&B , it is required. Note that a content model cannot denote a language that is either ; or fg. The Standard gives a syntactic de nition of applicability of exclusions, we prefer to give a semantic de nition. Therefore, a reasonable requirement for the applicability of excluding X from a content model E is that L(E )?X 6 fg. Intuitively, E X ; or E X means that excluding X from E precludes all elements from the content of E . On the other hand, E X 6 ; and E X 6 fg means that X precludes only elements that are optional in L(E ). We propose that the preceding requirement be the formalization of how a content model indicates that an element is required. Notice that computing E X is a reasonable and ecient test for the applicability of exclusions X to a content model E . +
11
We are now in a position to consider the removal of exceptions from a DTD. Let G = (N ; ; P ; S ) be an extended context-free grammar with exceptions and let G = (N ; ; P ; S ) be the extended context-free grammar that results by eliminating exceptions from G using the algorithm in Fig. 3. If B I ?X 2 N , then there is a production schema B I ?X ! EB in P if and only if there is a production schema B ! E + IB ? XB in P such that EB = (E I [IB ?X [XB ) I [IB ;X [XB . Lastly, we can apply the same idea to an SGML DTD with exceptions to obtain a structurally equivalent DTD without exceptions. 1
1
1
2
+
1
2
2
2
2
+
2
1
+
(
)
4 Concluding Remarks and Open Problems When we apply the exception removal transformation of Fig. 3 to an SGML DTD with exceptions, then we do indeed obtain a new DTD without exceptions. Unfortunately, the original DTD-document instances are not conformant to the new DTD since the new DTD has new elements and new tags that correspond to those elements that do not appear in the old DTD instances. Therefore, how useful are our results? First, the results are interesting in their own right as a contribution to the theory of extended context-free grammars and SGML DTDs. We can eliminate exceptions to give structurally equivalent grammars and DTDs while preserving their SGML unambiguity. Second, during the DTD design phase, it may be convenient to use exceptions. Our results imply that we can eliminate the exceptions and produce a nal DTD design without exceptions before any document instances are created. Third, rather than producing a new DTD, we can emulate it with an extended context-free grammar. We rst apply the exception-removal transformation to the extended context-free grammar with exceptions given by the original DTD with exceptions. We then modify its productions to explicitly include the old tags. For example, we transform a production of the form:
A
I ?X
+
! EA
into a production of the form:
A
I ?X
+
! `< A >'EA`< =A >'; 12
where `< A >' and `< =A >' 2 0 are the start and end tags that the new grammar has to use as delimiters for the element A. The new productions can be applied to the old DTD instances. Lastly, we can attack the document-instance problem head on by translating old instances into new instances. A convenient technique is to use a generalization of syntax-directed translation grammars (see Aho and Ullman [1, 2] and Wood [12]) to give extended context-free transduction grammars and the corresponding transduction version of DTDs that we call \Document Type Transduction De nitions." We are currently investigating this approach which would also be applicable to the DTD database schema issue raised by Christo des et al. [8]. It could also be used to convert a document marked up according to one DTD into a document marked up according to a dierent, but related, DTD.
Acknowledgements We would like to thank Anne Bruggemann-Klein and Gaston Gonnet for the discussions that encouraged us to continue our investigation of the exception problem in SGML.
References [1] A.V. Aho and J.D. Ullman. The Theory of Parsing, Translation, and Compiling, Vol. I: Parsing. Prentice-Hall, Inc., Englewood Clis, NJ, 1972. [2] A.V. Aho and J.D. Ullman. The Theory of Parsing, Translation and Compiling, Vol. II: Compiling. Prentice-Hall, Inc., Englewood Clis, NJ, 1973. [3] A. Bruggemann-Klein. Unambiguity of extended regular expressions in SGML document grammars. In Th. Lengauer, editor, Algorithms | ESA 93. Springer-Verlag, 1993. [4] A. Bruggemann-Klein. Regular expressions into nite automata. Theoretical Computer Science, 120:197{213, 1993. 13
[5] A. Bruggemann-Klein. Compiler-construction tools and techniques for SGML parsers: Diculties and solutions. To appear in EPODD, 1996. [6] A. Bruggemann-Klein and D. Wood. One-unambiguous regular languages. To appear in Information and Computation, 1996. [7] A. Bruggemann-Klein and D. Wood. The validation of SGML content models. To appear in Mathematical and Computer Modelling, 1996. [8] V. Christo des, S. Christo des, S. Cluet, and M. Scholl. From structured documents to novel query facilities. SIGMOD Record, 23(2):313{324, June 1994. (Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data). [9] C. F. Goldfarb. The SGML Handbook. Clarendon Press, Oxford, 1990. [10] P. Kilpelainen and D. Wood. Exceptions in SGML document grammars, 1996. Submitted for publication. [11] International Organization for Standardization. ISO 8879: Information Processing|Text and Oce Systems|Standard Generalized Markup Language (SGML), October 1986. [12] D. Wood. Theory of Computation. John Wiley, New York, NY, 1987.
14