grammar - obtained from the "cross-reference" of the TLG G. Under very natural restrictions ..... (rightbound, etc.). A well known result by John L. Baker is that for.
Acta Informatica 14, 175 - 193 (1980)
0 by
Springer-Verlag 1980
On Parsing Two-Level Grammars Lutz Michael Wegner Institut fur Angewandte Informatik und Formale Beschreibungsverfahren der Universität Karlsruhe, Postfach 6380, 7500 Karlsruhe 1, Germany (Fed. Rep.)
Summary. Making use of the fact that two-level grammars (TLGs) may be thought of as finite specification of context-free grammars (CFGs) with "infinite" sets of productions, known techniques for parsing CFGs are applied to TLGs by first specifying a canonical C F G G' - called skeleton grammar - obtained from the "cross-reference" of the TLG G. Under very natural restrictions it can be shown that for these grammar pairs (G, G') there exists a 1 - 1 correspondence between leftmost derivations in G and leftmost derivations in G'. With these results a straightforward parsing algorithm for restricted TLGs is given.
0. Introduction Two-level grammars (van Wijngaarden grammars, W-Grammars, 2 VWG's) have been introduced by A. van Wijngaarden in 1965 [22] for the definition of ALGOL 68 [23, 241. Major results concerning their formal properties were obtained by Sintzoff [17], Baker [4], Greibach [9] and Deussen [6, 71. A rather complete bibliography of two-level grammars may be found in [8]. Since "straight" two-level grammars were considered well-suited for generative language definitions but impractical for parsing, variants of two-level grammars (TLGs) were widely investigated. Among these are "Two-level Graph Grammars" [ll], "Nested Grammars" [3], "Affix Grammars" [12, 181, to mention but a few. This paper reconsiders "pure" two-level grammars with emphasis on regular metarules. The major aim is to give reasonable restrictions which guarantee solvability of the word problem for TLGs and which, going a step further ahead, permit practical backtrack-free parsing methods. The basic approach is to make use of the fact that TLGs may be thought of as finite specifications of "infinite" context-free grammars, i.e. context-free grammars (CFGs) with a potentially infinite set of context-free productions. This way of looking at TLGs is particularly attractive since CFGs are most likely the
176
L.M. Wegner
best investigated Part of formal language theory. Hence, it is only natural to hope that for the solution of problems involving TLGs known techniques are applicable. The major problem, namely how to apply, say, precedence parsing methods to TLGs, i.e. compute precedence relations for a potentially infinite set of context-free productions, is solved indirectly. We introduce a context-free grammar, called the "skeleton grammar", which we obtain from the so-called "crossreference" of a given TLG. This contrasts with the "head-grammar" of Watt [8] which is Part of the Affix Grammar itself. Context-free grammars which "model" derivations in TLGs have been constructed before, e.g. for the Algol 68 grammar [5], but to our knowledge always in an ad hoc way. A result of this paper then is that we can show why, under very natural restrictions, the proposed definition yields exactly one context-free grammar G' for every TLG G, and that for certain of these grammar pairs (G,Gf) there exists a 1-1 correspondence between leftmost derivations in G' and leftmost derivations in G. Using this property, parsing algorithms can be given. As just mentioned, the context-free skeleton grammar is obtained from the "cross-reference" of a TLG. As a consequence, the computability of the crossreference and related problems have to be investigated, i.e. properties of the sets of strict notions which a hypernotation yields. In [I91 it is shown that the computability of the cross-reference of a two-level grammar with regular metarules is a hard Open problem. Furthermore it is shown there that a positive answer implies a positive answer to the famous "string unification problem" [15], while the converse is Open. Following the publication of these results, it was learned that G.S. Makanin has claimed a solution to the string unification problem [13] which, of Course, made the computability of the cross-reference for regular metarules even more challenging. In this paper we shall concentrate on the parsing aspects and omit the lengthy discussion of these technical problems. Where needed, we shall quote the necessary results from [19].
1. Definitions We assume the reader to be familiar with formal language theory (cf. [16]), in particular with context-free grammars (CFGs). As usual 4 will denote the empty Set and E will be the empty word. The formal definitions connected with two-level grammars are adopted from [4] and [6] and use, to some extent, the terms coined in the Algol 68 Reports [23, 241. Definition 1. A two-lerel gralnmar (TLG, 2VWG, van Wijngaarden grammar, WGrammar) is an ordered 7-tuple (M, N , 7; R„ R „ S), where M is a finite set of metanotions; V is a finite set of syntactic uariables, M n V = 4; N = { ( H ) I H E (M U V)+), the finite set of hypernotions; T is a finite set of terminals;
On Parsing Two-Level Grammars
177
RM is a finite set of metaproduction rules X - E; where X E M , Y E ( MU V)*, s.t. for all W E M :( M , V, R M ,W ) is a context-free grammar; N X ( N u T)*, the finite set of hyperproduction rules Rv ) S
1 . .h ,
= ( s ) E N , s E V + ,the
h i ~ ( T u N u { & (1 ) )j i j m ) ; start notion.
Definition 2. Given a TLG G = ( M , V,N , 7;R M ,R,, S ) we define the set R , of strict production rules of a hyperproduction rule r = ( ( H o ) + h , h2 ... h,) containing the n 2 0 metanotions W , , W„ . . . , W, as follows:
cp is a homomorphism with cp(u)= for V EV
u {( , )) and c p ( y ) ~ L ( ( MY .K M ,V ) )cp(H0) , *E).
cp(hi),hi$7; is called a strict notion. Furthermore, let N, denote the set of strict notions, i.e. Ns= {cp((H))Jcpa homomorphism from above and ( H ) E N ) . Remark 3. Some of the conventions, abreviations, and special terms of the Algol 68 Reports will be used without further introduction. In particular, the term "consistent substitution" ("consistent replacement") will be used to refer to the homomorphic replacement of Definition 2. For details, the reader should refer to [24], pp. 17-35. By means of the possibly infinite set of strict rules of Definition 2, one defines a "derivation step" analogously to the definition of context-free derivations. Definition 4. If G = ( M , V, N , T, R„ R „ S ) is a TLG then the language specified by G is L(G)= { X E T* I S =*=X), where =* is the reflexive and transitive closure of G
the binary relation
X=*Y
C
==in (N,u Tj* which is defined by G
iff 3 P , QE(N,uT)*: X = P X r Q and Y = P Y f Q
G
and X' + Y ' ER,(r) for some r~ R,.
X
=* Y is C
called a derivation step and the subscript G is omitted whenever
the context permits it. We say a derivation is leftmost (rightmost), short lm(rmj, if in the derivation step X
== Y the leftmost (rightmost) strict notion not in T is rewritten; G
formally : lm
X=*Y
iff ~ P E T * QE(N,uT)*: , X = P X f Q and Y = P Y f Q and
G
( X f - Y f ) E R S ( r )for some rER,;
-
rm
X=*Y (X'
G
iff P E ( N , u T ) * , Q E T * : X = P X I Q and Y = P Y ' Q and
Y 1 ) ~ R S (for r ) some rER,.
L.M. Wegner
178
In the following we shall denote metanotions by capital letters, syntactic variables and terminals by small letters. We also omit mentioning the "rightmost cases" whenever it is obvious that definitions and results apply to both the leftmost and rightmost case. We shall now define "derivations" and "parses" which look like their context-free Counterparts but differ in one important aspect (cf. Theorem 9)
Definition 5. Given a TLG G =(M, TIN, 7; R,, R,, S), a sequence D = a,, a,, ...,an with crie(N,U T ) * is called a derivation of an in G iff ai=.ai+, for 0 5 i 5 n - 1. D is called a leftmost derivation of an in G iff ai Am-ai+, for OS i s n - 1. D is C
called a terminal deri~iationof an in G iff aneL(G).
Definition 6. Let G =(M, TI N, 7; R,, R,, S ) be a TLG and let ri denote a hyperrule in R v , I 5 i IRv\. A sequence E, =ril, ri2,...,rin is called a left parse of we(N,u T ) * if there exists a least one leftmost derivation D of W in G s.t. D =U,, a , , ...,an with an= W and for 1 S j S n : a j - ,
==.Hj C
by use of hyperrule rij.
For CFGs there is a 1 - 1 correspondence between leftmost derivations and left parses. In the case of TLGs there is the same correspondence between leftmost derivations and sequences of strict rules. As for the sequence of hyperrules (the left parse!), however, this does not hold. A hyperrule ER, may give rise to a possibly infinite number of strict rules rS€R(r)(see Def.2) with identical left-hand sides (lhs's) but distinct righthand sides (rhs's) and thus one cannot necessarily obtain a unique lm derivation from a left parse. Conversely a strict rule rs may be yielded by two or more distinct hyperrules r, r', i.e. r s ~ ( R S ( r ) n R s ( rfor f ) ) r+rl. Thus, one cannot always obtain a unique left parse from a lm derivation.
Example 7. Let G = ( { A ,B I , {a,s}, { ( s ) , ( A ) , ( B ) } , { U } , { A + a A l a , B+ A l , { ( I ) ( s ) + ( A ) , (2)( A ) + a, ( 3 )( B ) -r U } , ( s ) ) be a trivial TLG with L(G)= {a}. There are infinitely many lm derivations D for the left parse E =r,, r„ namely (s)
-
Conversely the lm derivation D =(s), r2 and E,=r,,r,.
( a 2 )=> a , etc. ( a ) , a has two left parses, namely E , =r,,
Remark 8. In the following, TLGs will be specified by R , and R , if the context is clear. We may also denote the language of a context free grammar G = (VN,V T ,R , S ) by L ( S ) rather than L((VN,V T ,R , S ) ) if VN,V T ,R are known. The following theorem is an easy consequence of the just discussed problem.
Theorem 9. It is undecidable whether an arbitrary TLG G has the property that for each leftmost derivation there is at most one leji parse. Proof: Assume the contrary. Then consider =(V;, V;, R', S') and G" =(V;', V;', Ru, S") and let G =(V; U V ' , V; U V;
two arbitrary CFGs G'
U { s } ,{ ( s ) , ( S r ) ,( S " ) } , { a } ,R'
UR",
{ ( I ) ( s ) -r ( S ' ) , (2)( S ' ) + a, ( 3 )( S " ) + a } , ( s ) ) be a TLG.
On Parsing Two-Level Grammars
179
Clearly there exists a (leftmost) derivation with more than one left parse iff L(Gf)nL(G")+4. But then our assumption implies that we can decide for arbitrary CF languages L „ L, whether L , n L , = 4 - a contradiction (see e.g. Cl6l). The property that for each left parse there is exactly one leftmost derivation is also, in general, undecidable and is easily shown with a similar construction.
2. Some Properties The relation between leftmost derivations and left parses is of fundamental importance for our construction since the productions of the C F skeleton grammar, which we shall propose, relate to the hyperrules of the TLG. Thus, a lm derivation in the skeleton grammar will not yield a priori a unique lm derivation in the TLG. Section3 will discuss this problem in detail and show solutions. One of the properties which we shall need there is "boundedness", which is defined below. Definition 10 (Deussen). Let lhilA denote the number of occurrences of a metanotion A E M in hypernotion hieN. Let lhilv denote the number of syntactic variables of V occurring in h,. If h i e T then lhilv= 1. A hyperrule (H,), + h , h, ... hm ( h i € N u T for 1 s i s m ) , is said to be leftbound if VAEM: (HOIA+O* IhllA+(hZIA+ ... +(hmlA+O; rightbound if VAEM: IkllA+ Ih21A+... lhmlA+ O * lHoiA=k0.
+
The hyperrule is strictly leftbound if VAEM:
strictly rigkthound if VAEM:
As expected a TLG is leftbound (rightbound, etc.) if all its hyperrules are leftbound (rightbound, etc.). A well known result by John L. Baker is that for every context-sensitive CHOMSKY-grammar there is an equivalent contextsensitive (by definition: strictly leftbound) TLG [4]. It is not difficult to see that context-sensitive (CS) TLGs guarantee solvability of the word problem: the strict boundedness of the hyperrules forces the sentential form ai to be of at least the same length as ai-, (in terms of syntactic variables).
L.M. Wegner
180
The restrictions which Deussen [6, 71 introduces guarantee that in going from ai to ai„ either the nonterminal strict notion (H) is replaced by another one, say (H'), in which case one requires that I H ~ ~ S ~ A or' ~H, is , replaced by n > 1 strict notions, in which case the length in terms of V does not matter. Deussen requires for his class of decidable TLGs that hyperrules are E-free, i.e. that the rhs of a strict rule is distinct from E. In our approach we shall not exclude E-hyperrules. However, in order to avoid an ambiguity we shall exclude the occurrence of the empty word in "hidden positions", i.e. the case where (H) is a hypernotion with H E M and for some homomorphic replacement cp: cp(H) -E . The following Lemma is to prove that it is not unreasonable to require that a TLG is free of hidden empty notions. +
Lemma 11. For euerjl TLG G=(M, Y N, IT; RM,R , S) there is a TLG G' =(M1, V', N', T', RM, R;, S') s.t. L(G) = L(G1) und G' is free of hidden empty notions. Proof. Set V ' = V u { # } , Vn{#}=q5. Let M=M', RM=R'„ T = T 1 , N' = { ( # H ) ( ( H ) E N } . R; is the set of hyperrules obtained from R, by replacing each hypernotion ( H ) E N by the corresponding element ( # H ) E NI. Furthermore, add a hyperrule ( # ) + E to R;, set S' = ( # s) and the proof is complete. The method of making a TLG free of hidden empty notions, which was shown above, is very crude. Since it is decidable whether H i yields an empty notion one could easily give more sophisticated methods. Before being able to tie derivations and parses in TLGs together we need a few additional properties which are directly associated with the Set of strict notions which a hypernotion yields. Definition 12. Let M, Y R, be defined as for two-level grammars (Definition 1). A hypernotion system HS is a Ctuple (M, Y R,, H), where H E ( MU V ) + ,the hypernotion (axiom). The language of a hypernotion system HS =(M, RM,H), where H contains the n 2 0 metanotions W,, W,, . . ., W,, is defined as L(HS) = {cp(H)lcp is a homomorphism with cp(v) = V for VEVand ~~(W,)EL((M, Y R,, W,)) for all 1 S i Sn}. An equivalent way of defining L(HS) is to require that identical derivation trees are obtained for identical metanotions in H. Because of the existence of ambiguous CFGs this is a slightly stronger restriction which, however, does not affect the set of languages generated. Definition 13. If HS=(M, Y R,, H ) is a hypernotion system with H = X , X, . . . X,,, X,E(M U V) for 1 5 i 5 n, then HS is uniqirely assignable (u.a.) if for all WEL(HS)there is exactly one partition (p„p2, ...,p,) s.t. w=p, p, ...P, and cp(Xi)=pi for cp homomorphism of Definition 12. Example 14. Consider metarules RA= {A + a A I a, Al
+ A, A2 -+ A}.
On Parsing Two-Level Grammars
Then HS,
= ({A}, {U},RA,A A)
is u.a.,
HS,=({A}, {U},RA,Al A2) is not u.a., HS,
= ({A}, {U},RA,Al
A2 A l ) is not u.a.,
HS,=({A}, {U,b}, RA,A l b A2 Al) is u.a. The following results are stated without proof referring the reader to [19].
Proposition 15. Menibership und emptiness are decidable for languages of hypernotion Systems. 7)le "empty intersection" und the "equality problem" of hypernotion languages is, in general, undecidable for hypernotion sjlstems with C F metarules und is a hard Open problem for regular metarules. It is important to note that a solution to the to the "string unification problem" (see e.g. published a positive answer just recently [13]. solvability of the "string unification problem" intersection problem have failed so far.
Open problem implies a solution [15]) for which G.S. Makanin Attempts to directly show that implies solvability of the empty
Proposition 16. For arbitrary hypernotion Systems HS = (M, R„ H) it is (a) undecidable whether HS is uniquely assignable f R, is context-free, (b) decidable whether HS is uniquely assignable f RM is regular und no metanotion occurs more than once in H, (C) decidable whether H S is uniquely assignable for R, regular i,ff the "empty intersection problem" for R, regular is decidable (see 15 above). We should also mention the time complexity of the wordproblem for hypernotion languages. The best upper bound we were able to find is roughly O(n3"), where n is the length of the given word and m is the number of metanotions occurring more than once in the axiom. The following definition introduces uniquely assignable TLGs. Definition 17. We say a TLG G =(M, Y N. T, R,, R„ S) is lhs uniquely assignable - short: lhs u.a. - if in every hyperrule r=((H,) + ...) €RV the lhs hypernotion system (M, T/, R,. H,) is u.a.; G is rhs uniquely assignable - short: rhs u.a. - if in every hyperrule r =((H,) + h, h, . . . hm)€RV the rhs hypernotions system ( M , V u T u { ( , )}, RM,h,h,... h,) is u.a., ( V u T ) n { ( , ) } = b . Note that rhs unique assignability is weaker than saying HS, =(M,VuT,RM,hi)is u.a. for 1 5 i s m . We mentioned above that unique assignability is, in general, undecidable for hypernotions with CF metanotions and Open for regular metanotions. For most of the published TLGs (e.g. [24]), one can show that a hyperrule is not u.a., but quite often the grammar never derives a strict notion which has more than one partition. A typical example is the hyperrule (where DECSETYl dec TAG DECSETY2 includes dec TAG) + (where true) with metarules DECSETY + DECSETY dec TAG1 EMPTY TAG + TAG LETTER I LETTER etc.
182
L.M. Wegner
The hyperrule is clearly not lhs u.a. but "harmless" since care has usually been taken that each "TAG" (declaration) occurs exactly once in "DECSETY" (in the declaration list) or otherwise more than one partition is purposely intended. The second property for which we may use results from the propositions above is disjointness. In a disjoint TLG there is at most one hyperrule r for any given strict rule rs s.t.: rS€Rs(r),i.e. r, is derivable from at most one hyperrule. Apart from use in parsing, this property is important if semantic routines are attached to the hyperrules. Definition 18. If G = (M, I/, N, 7;R„ Rv, S) is a TLG and r, r' are hyperrules in Rv, then r and r' are disjoint iff R,(r)nRS(rf)=4, where R,(r) is the set of strict rules of r (cf. Def. 2); G is said to be disjoint iff for all r,r1€RV:R,(r)nRS(r1)=4. There are several ways to make a TLG disjoint. The most important and natural one is using "left-disjointness". This property is informally known as "kosherness" among the ALGOL 68 community. Note that a hyperrule with alternatives separated by ' 1 ' (Algol 68 Report: ';') is shorthand for a set of distinct hyperrules with the same left-hand side. Definition 19. A TLG G=(M, I/, N, T, R„ R„S) is said to be left-disjoint if for any pair of hyperrules r = ((H,) -t h, h, ... h,) and r'= ((Ho) + h; h; .. . hn) in Rv, r r',
+
(a) if H,
+ Ho, (a
+
ß ) ~ R , ( r and ) (y + 6)sRS(r1)then a
+ y;
(b) if H,=Hh, (or+ß)~R,(r),and (y - 6 ) ~ R , ( r ) then ß + 6 . G is said to be right-disjoint if for all r, r' with r + r f and all (or+ß)~R,(r), (y + 6)€R,(rf)we have ß 6.
+
Informally: Left-disjointness implies that no two strict rules have the same lhs unless they come from two hyperrules with the same lhs. In that case the right-hand sides of the strict rules are different. Right-disjointness is a stronger restriction which requires that the right-hand sides of strict rules of different hyperrules are disjoint. This relates to a result from context-free grammars (see e.g. [I], P. 3701, namely that for every CFG G there is a C F G G' with distinct right-hand sides (a uniquely invertible CFG).
3. Locally Unambiguous TLGs We shall now introduce the term "locally unambiguous" which is to designate the ability to find for any given parse in a TLG G the corresponding derivation and vice versa (see Definitions 5, 6 and Example 7). TLGs with this property are the ones which are needed for the parsing method to follow. Furthermore, one can show that boundedness, unique assignability and disjointness are sufficient conditions for local unambiguity.
183
On Parsing Two-Level Grammars
Definition 20. A TLG G=(M, N, 7; R„ Rv, S ) is locally unambiguous if (a) for every lm derivation D of ci, in G there is exactly one left parse E of ci, in G and given D we can effectively find E, and (b) for every left parse E of a n in G there is exactly one lm derivation D of ci, in G and given E we can effectively find D either starting with ci, (top-down) or starting with ci, (bottom-up).
Theorem21. If a TLG G is disjoint, free of hidden empty notions und either (U) lhs u.a. und rightbound, or (b) rhs u.a. und leftbound then G is locally unambiguous. Proof. Part I (given a lm derivation, determine the unique left parse). For i = O,1, ...,n - 1 consider the leftmost nonterminal strict notion x and cii and its replacement y in Ni„. Since G is disjoint there is exactly one TER, s.t. (x + y)~R,(f).This hyperrule F is found by solving the word problem for x (resp. y) and the lhs (resp. rhs) of all (finitely many) ER,. Part I1 (given a left parse determine the unique Im derivation). Let G be lhs u.a. and rightbound. Then we apply a top-down strategy; starting at a, we determine M,,a,, ...,ci, by solving the word problem for the lm nonterminal strict notion x in cii and the Ihs of the hyperrule r given by the left parse. Because x has at most one partition with respect to the Ihs hypernotion H, of F and since each metanotion on the rhs of r occurs also on the lhs we can uniquely determine y s.t. (x + y ) ~ R , ( f )and thus there is exactly one cii„ for i =0,1, ...,n- 1. Let G be rhs u.a. and leftbound. Then we apply a bottom-up strategy; starting at an we determine an- M,-„ ...,M, by solving the word problem for some strict notions y in cii„ and the rhs of the hyperrule T given by the left parse. As we don't know which strict notions in M , + , constitute the y (the socalled handle) we might have to backtrack. Note that each reduction which is indicated by F is uniquely determined because F is rhs u.a., leftbound and G is free of hidden empty notions. The last condition is needed to ensure that T=((H,) + h, h, ... h,) reduces exactly m strict notions in the step from ai„ to a, (see Example34 for a TLG G which is not free of hidden empty notions). With the observation above and from the fact that the parse is finite we conclude that we find the unique Im derivation after finitely many steps.
,,
Theorem22. For every TLG G one can construct a locally unambiguous TLG G' s.t. L(G)= L(G1). Proof: Construct for a given TLG G the equivalent Turing machine T 1:[4], pp. 369-71) and for T the equivalent Chomsky grammar G, (see e.g. [14], p. 151ff.). We can then apply the Sintzoff simulation technique [17] which is modified as follows: i) add a unique labe1 to each hypernotion rendering the resulting grammar left-disjoint ; ii) insert a marker " T " into each hypernotion to make it u.a. and add two hyperrules to move the marker to the left and to the right.
L.M. Wegner
184
The Sintzoff proof C171 yielded a left- and rightbound TLG which is not affected by the modifications above. Furthermore, hidden empty notions have been removed through the inserted marker. Derivations in this modified, locally unambiguous grammar G simulate derivations in the Sintzoff Grammar and it can easily be shown that L(G) = L(G1).The technical details may be found in [19]. It is clear that the TLGs which one obtains from Theorem22 are impractical. However, the main purpose of the construction is to give a proof for the existence of a locally unambiguous TLG G' for every TLG G. The reason we are interested in locally unambiguous TLGs is that the nondeterminism which is inherent in any generative rewriting system is turned into a choice between hyperrule alternatives rather than multiple partitions of strict notions or choices resulting from unbound metanotions.
4. Skeleton Grammars We shall now define canonical CFGs which model derivations in TLGs under certain restrictions. The CFGs are constructed from "cross-references" of the TLG. Cross-references are known e.g. from the Algol 68 Reports, where they appeared as hyperrule references (section and line numbers) in curly brackets (cf. C241, P. 26). Precisely speaking, we may define a forward cross-reference of a rhs hypernotion (H,) as the Set of hyperrules r=((H,) + ...)ER, for which L((M, KR,, H i ) ) nL((M, V,R,, H,)) +@. However, one can formulate a stronger condition by intersecting the complete rhs of a hyperrule, having m 2 1 hypernotions, with a suitable combination of m lhs hypernotions. Example 23. Consider metarule A + a A 1 a and two hyperrules (1) (start) + (a A ) (A) (2) (AA) + a . Clearly L((...,a A)j n L((...,AA)) and L((.. . ,A))n L((...,AA)) yield non-empty cross-references. However, L((. . . , (aA) (A)) n L((.. . , (Al A l ) (A2 A2)))= 4 ! Definition 24. Given a TLG G = (M, V, N, 7;R,, R , S) and a hyperrule r =((H,) + h, h, . . . h,)€R, h,€(NuT) for 1 5 i 5 m, we define the cross-reference of r as a Set of m-tuples {(X,, X,, . . . ,x,ll (a) if hi€T then xi=hi; otherwise for some ((Ho) + . . .)€RV[X( refers to H;] or xi = (Ho) X ~ = E if EEL((M,KRM,hi)), 1 s i s m ; (b) L((M, I!RM.h , h , . . .\I,)) n L((M, R,, T, T, . . . Z,)) 4, where {( , )} n V = @ and -Fi is obtained from xi by renaming metanotions in xi s.t. they are distinct from those in x j for j +i, 1 5 i, jom}.
+
The condition (b) is somewhat complicated because we have to avoid that consistent substitution applies to metanotions of two or more lhs hypernotions xi and xj, but we require that it applies to all of the rhs of r and within each
On Parsing Two-Level Grarnrnars
185
individual xi. From Proposition 15 it follows that the cross-reference for TLGs with C F metarules is, in general, undecidable, but is an Open problem for regular metarules. Using Definition24 we may now define a canonical C F G which represents the structure of a T L G G. It is assumed that the T L G G is free of hidden empty notions. Definition 25. Given a T L G G = (M, T/; N, 7;RM,R„ S) which is free of hidden empty notions we define the skeleton grammar G„ = (@, C, R', S') of G in BackusNaur-Form as follows. @ = {(H,)I (H,) + ...) ER"} G N, C =7; S' =S, = U R„(r), where R' r ~ R y
Rsk(r)=(Ho)::=xlx2 ...xmIr=((HO)+hl h2 ... h,), h , € ( T u N ) for 1 S i S m and ( x l , x 2 ,..., X,) a cross-reference of r). Example 26. Consider the T L G G =(M, TIN, 7;RM,R,, S) with M = {BETY, EMPTY), V = {U,b, ... ,z ) , R M = {BETY + b BETYIEMPTY, E M P T Y + E } N ={(Start), (growing), (growing BETY), (growing b BETY), (terminal BETY), (terminal b BETY), (terminal b)}, T = {U,b, C), and R, = ((1) (start) 4 (growing) (2) (growing BETY) + a (growing b BETY) C (3) (growing BETY) + (terminal BETY) (4) (terminal b BETY) 4 b (terminal BETY) (5) (terminal b) + b). The skeleton grammar G„ = (@, C, R', S') is then given by @ = {(start), (growing BETY), (terminal b BETY), (terminal b)}, . C = {U,b, C), S' = (start), R' = {(start) :: = (growing BETY), (growing BETY) :: = a (growing BETY) cl (terminal b BETY) (terminal b), (terminal b BETY) :: = b (terminal b BETY) J h (terminal b), (terminal b) :: = b).
I
We notice that L(G) is the well-known not C F L {a"bncnln>= 1).
It was mentioned above that it is not clear whether the cross-reference of a T L G with regular metarules can be obtained in general. However, one can at least compute a superset of skeleton rules by using a "derived" T L G G' of G, that is a T L G G' where consistent substitution has been removed through renaming of metanotions.
L.M. Wegner
186
In Definition25 we used the lhs-hypernotions (H,) as "variable names" for
CJin G„. We can also give a C F G which uses the rhs-hypernotions (H,) ( i r 1). The C F G which we obtain in this way is called the "inverse skeleton grammar", short: G,'. Inverse skeleton grammars are the less natural of the two obtainable CFGs but L(G„)=L(G,') for any T L G G [19].
5. Proper Skeleton Grammars and Corresponding Parses It is intuitively clear that G„ models G and that it can be used to construct a derivation tree for a given WET*. We shall show below that, indeed, L(G] sL(G„).- If iv€L(G„) we Want to use the derivation tree to trace out a lm derivation in G. It is only natural that G„ preferably be an unambiguous CFG. However, two problems remain. Assume for each lm derivation D of W in G there exists at least one lm derivation D' of W in G„ and G„ is unambiguous. Then, in tracing out D', we in G„ using skeleton production r must be able to tell at each step a;=aj„ which hyperrule caused r to be included in R' of G„. To allow this we shall define "proper" skeleton grammars. The second problem is a familiar one. D' gives a sequence of skeleton productions which have been applied and from that sequence (the parse in G„) we obtain the left parse in G, but not necessarily the Im derivation in G! This, however, we can handle by requiring that G is a locally unambiguous T L G (see Sect. 3). Defnition 27. If G„ is the skeleton grammar of G =(M, V,N, 7;R,, R,, S) then we say G„ is proper iff for all r, r' in R, and r +r': R„(r) n R„(rl) = 4.
Lernma28. G„ of G is proper ijf all I~yperrulesr=((H,)-+ h, ... h,) und r' = ( ( H o )-+ h; ... hk) with H, =Ho Iiave pairwise distinct cross-references.
+
Proof: (a) First we note that R„(r) n R„(rl)= 4 if H, Ho, i.e. we only have to look at hyperrules which are usually written as "alternatives" (cf. [24], p.25, Sect. 1.1.3.4). (b) From Definition24 follows that only identical lhs combinations (i.e. cross-references) give identical righthand sides in G„. Given a left parse E' in G„ we can uniquely determine a hyperrule sequence E in G if G„ is proper and from E we can uniquely determine a Im derivation D in G (provided D exists at all) if G is locally unambiguous. The missing link is between E' and E. The claim is that the hyperrule sequence E, obtained from E', is the "corresponding" left parse in G. Definition 29. For G a TLG, G„ the skeleton grammar of G, E = ri,,ri,, ..., ri„ ...,ri, a left parse of w E L(G), E' = r;l, rLr2, . . . ,riJ, . . . rin a left parse of W E L(Gsk), we call E' corresponding to E iff for all 1 sj 5 n : riJeR„(riJ). The following result closes the gap between E and E'. Its proof is only technical and, therefore, has been omitted (cf. [19]).
.
On Parsing Two-Level Grammars
187
Theorem30. For any TLG G, skeleton grammar G„ of G, und any left parse E of WEL(G)in G, there exists exactly one corresponding left parse E' of WEL(G„)in Gsk'
It is important to note that G must be free of hidden empty notions. To illustrate what may happen if that is not the case, consider the following example. Example 31. Let G have the metarules BETY -,bI E and hyperrules (1) (start)-
(U) (BETY) (2) (U) -, u (BETY) (3) (b) -, b . The left parse E =r,, r,, r3 gives rise to two lm derivations
D,=(start)=(a) D, = (start) = ( U )
=a(b)=ab and (b) j a (b) =.a b. E
Figure 1 gives the distinct derivation trees in G. We note that G is not locally unambiguous even though it is disjoint, rhs u.a., and leftbound. The skeleton rules for G„ of G are the following (assuming G„ is also defined for G not free of hidden empty notions). Rsk(rl)= ((1') (start) : : = ( U ) (2') (start) : : = ( U ) (b)} Rsk(r2)={(3')(a)::=a (4') ( a ) : : = a (b)) Rsk(r3)= ((5') (b) : := b). Clearly for E = r, ,r, ,r3 there exist two corresponding left parses E;= r; ,ri, r; and E>=ri,ri,rk.
Corollary 32. lf G is a TLG, G„ is the skeleton grammar of G then L ( G ) cL(G„). ProoJ: L(G)E L(G„): This is an immediate result from 30, where we proved that for each left parse of WEL(G)there is a left parse of W in G„ as well.
Fig. 1. Derivation trees corresponding to D, and D,
L.M. Wegner
188
In 30 we showed that to each left parse E in G we can assign a unique left parse E' in G„ - called the corresponding parse. Now we extend this result by showing that distinct left parses in G are assigned distinct corresponding left parses in G„, provided G„ is proper (see Def. 27).
Corollary 33. If G is a TLG s.t. the skeleton grammar G„ of G is proper then for euch two left parses E , , E , of w€L(G)
E, =E,
holds $f EE; =E'„
where E; denotes the corresponding left parse of E j (j~{1,2)). Proof. "only if" follows trivially from Theorem 30. "if": Since G„ is proper, each skeleton rule application (i.e. each parse step in E') determines uniquely a hyperrule application (parse step in E). We now define ambiguous TLGs as an analogon to ambiguous CFGs.
Definition 34. A TLG G = (M, V, N , T, R„ Rn S) is said to be ambiguous if for some WEL(G)there is more than one lm derivation of W from S in G. Otherwise G is said to be unambiguous. The following theorem provides the basis for the backtrackfree parsing algorithm which we are going to suggest in the next section.
Theorem35 If G is a locally unambiguous TLG und i j G„ is the proper und unambiguous skeleton grammar of G, then G is unambiguous. Proof. Assume the contrary, i.e. G has for some WEL(G) two distinct lm derivations D , , D 2 . From G locally unambiguous it follows that the left parses corresponding to D , and D,, say E , and E , , are distinct as well (see Theorem 21). From Corollary 33 it follows that E , + E , implies E; +E'„ where E ; , E i are the corresponding left parses of E , , E , in G„. But then there exist two left parses of w€L(G„) - a contradiction to the assumption that G„ is unambiguous. The converse is not true, i.e. unambiguous TLGs do not necessarily yield unambiguous skeleton grammars. To see this consider a metarule A + a A 1 a and hyperrules (1) (start) + ( a a a a ) (2) ( A A ) + ( A ) (3) ( a ) +a. Clearly G is unambiguous, but G„ with productions
(start) : : = ( A A )
(AA)::=(AA)I(a)
is not unambiguous, although G is locally unambiguous and G„ is proper.
On Parsing Two-Level Grarnrnars
189
6. Backtrack-free Parsing of TLGs The results of the previous sections suggest the following straightforward parsing algorithm.
Algorithm 36. The skeleton grammar parsing algorithm. Input: A locally unambiguous TLG G, the proper and unambiguous skeleton grammar G„ of G, a word W E T * . Output: "yes" if
W E L(G), " no"
otherwise.
Method: Step 1: Apply any of the known parsing algorithms for CFGs to G„ and W and obtain the left parse E' (see e.g. [I], pp.320-330). If w$L(G„) then output "no". Step2: Obtain the left parse E in G starting from the Start notion S (if G is Ihs u.a. and rightbound) or from W (if G is rhs u.a. and leftbound). Apply hyperrule r to those strict notions which correspond to the terminals and nonterminals used in the derivation step in which skeleton rule r 1 ~ R S k (was r ) applied. Because G is locally unambiguous, giving the handle and hyperrule is sufficient to reduce aito E, resp. expand aito ai Output "no" and stop if r cannot be applied. Output "yes" and stop if the derivation in G is complete.
„
+
Theorem 37. For arbitrary W E T*, Algorithm 36 correctly determines in a backtrack-free way whether or not W E L(G). Proof: By Theorem 35, G is unambiguous. Thus there exists at most one terminal lm derivation of W . Suppose now the left parse E which we obtain from E' in G„ is not the left parse for W ,i.e. there exists a left parse E in G, E + E, s.t. E is a left parse of W E L(G). By Corollary 33 there exists a left skeleton parse E', s.t. E + E , and E is a left parse for W . But then G„ has two left parses for W - a contradiction to the assumption that G„ is unambiguous. Let us give an example. We shall use the TLG of Example 26 for which we specified the skeleton grammar. This skeleton grammar has the additional property of being a "simple precedence grammar" (for definitions of "simple precedence" see e.g. [10], pp. 102-120). The example shows how W may be parsed in parallel using two stacks.
Example 38. Consider the TLG G of Example 26 and note that G is right-disjoint thus disjoint. Also G is free of hidden empty notions, rhs u.a. and leftbound which is a sufficient condition for G to be locally unambiguous. The skeleton grammar is clearly proper and we claim that G„ is a simple precendence grammar with the following precedence table which was computed using the "Translator Writing System - Precedence Syntax Analysis Program" of UCSB installed at UBC. Let L and 1 denote left and right end marker. The stack containing the strict notions of G will be denoted by SNS, the stack for G„
L.M. Wegner
< g BETY> t b BETY: a b C
I Fig.2. Precedence table for G „ of Ex. 26
by CFS. We shall also list the unused Part of the input tape for underlined Symbol is the one presently under scan. Consider a sample parse for W = a 2 b2 c Z .
4)
SNS
CFS
SNS
CFS
bccl (terminal b ) a a
(terminal b ) a a CFS
SNS 5)
¿Cl
(terminal b b ) a
a
(terminal b B E T Y ) a a CFS
SNS
6) C cl (growing b b ) a a
L
(growing B E T Y ) a a
L SNS
CFS
W
and the
On Parsing Two-Level Grammars
7)
cl (growing b) a
(growing BETY) a
L
L SNS
8)
CFS
1 (growing)
(growing BETY)
L
L SNS
9)
CFS
1 (start)
(start)
L
L SNS
stop,
W
CFS
accepted!
The reader should convince himself that a parse for, say aabcc, fails in G, althoug G„ still suggests reductions while a parse for abcc fails already in G„. When we analyze the computational complexity of Algorithm 36 we have to take into account two different kinds of "steps". First there is the number of steps to compute the parse in G„. Let lwlT denote the length of W E T *measured in terms of the terminal vocabulary 7: It is well-known that parsing an unambiguous CFG takes at most O(n2) steps for words of length n. Therefore computing the left parse in G„ takes O(Jwl+)steps. Secondly there is the task of "tracing out" E, i.e. obtaining D. Clearly E and E' have the same length, namely O(lw,l) elements. The third factor is the number of elementary computations at each parse step in order to solve the word problem for a hypernotions HS and a strict notion V,VEV*.In Sect. 2 we mentioned that this is possible in polynomial time. However. v may grow exponential. Consider the following example. Exarnple 39. Let the metarules be B -t bl bB and the hyperrules be (1) (start) -t a (b) - t u (BB) l b. (2) (B) Clearly L(G) = {anb J n2 1). The skeleton grammar of G has the following productions: (srart) : := a (B) (B)::=a(B)lb. It is easy to see that G is disjoint, rightbound, lhs u.a., and free of hidden empty notions. By Theorem 21, G is locally unambiguous.
192
L.M. Wegner
G„ is proper, unambiguous and L(Gsk)= L(G)E&. However, the number of "b's" double at each derivation step and, when finally applying the second alternative of hyperrule (2), we have to solve the word problem for the hypernotion System with axiom "B" and the strict notion b2", where n = lwl,+ 1 for W E L(G). By virtue of Example39 we conclude that Algorithm36 solves the word problem in possibly exponential time. It is easy to See that the ability to have more occurrences of a particular metanotion on the rhs than on the lhs makes a strict notion "grow too fast". The converse situation (more lhs than rhs occurrences) permits "fast shrinking". In C191 it is shown that if the number of occurrences of a metavariable within a hyperrule is bound (cf. [4]) then Algorithm36 works in polynomial time.
7. Conclusion Concluding, we propose that given a TLG G one proceeds as follows: (i) check whether the metarules are regular (by inspection); (ii) check whether G is disjoint and free of hidden empty notions, where the latter is always decidable and fulfilment of (i) guarantees that we can decide disjointness for at least a superset regardless of the Open problems of Sect. 2; (iii) check whether G is either lhs unique assignable and rightbound or rhs unique assignable and leftbound and note that boundedness is decidable by inspection and that under fulfilment of (i) unique assignability is again decidable for a superset by ignoring consistent substitution; (iv) obtain the skeleton grammar G„ of G and check whether it is proper; properness of G„ is decidable by inspection; (V) try to show that G„ is unambiguous; if yes, apply Algorithm 36 (of Course there is no general procedure to decide (V)- see e.g. [16], p. 283 - but we are mostly interested in practical subclasses of CFLs such as LR(k) and LL(k) for some fixed k which we can always decide). It is clear that for a given TLG G certain "ad hoc" methods to obtain the properties (i)-(V)may succeed, despite the fact that a general procedure does not exist. A critical examination of the restrictions (i) to (V)shows that for practical purposes regular metarules do not limit applications but that predicates and cyclic hyperrules lead to ambiguous skeleton grammars. Furthermore metaboundedness causes TLGs to degenerate which is the reason why the Deussen Type L and Type R restrictions are unsuitable for language definitions as well. One solution is to introduce two disjoint Sets of hyperrules - skeleton hyperrules and predicates - and to apply an attribute grammar-like technique to unbound hyperrules. The formal properties of the resulting type of TLG - called bracketed TLG - have been stated in [20] and they have successfully been used in Syntax directed translations [21]. These developments and the obvious similarity with Affix grammars [12, 181, for which Parsers are being implemented, should soon allow the step from "effectively decidable'' to "efficiently parsable" TLGs. Acknowledgements. The author wishes to acknowledge the detailed recommendations of the referees which helped to make the presentation more concise and readable.
On Parsing Two-Level Grammars
References 1. Aho A, Ullman J (1972) The theory of parsing, translation, and compiling, Vol.1: Parsing. Prentice Hall, Englewood Cliffs, NJ 2. Aho A, Ullman J (1973) The theory of parsing, translation, and compiling, Vol. 11: Compiling. Prentice Hall, Englewood Cliffs, NJ 3. Ambler AL (1973) Nested LR(k) parsing using grammars of the van Wijngaarden type, Ph.D. Thesis, Dept. of Computer Sciences, Univ. of Wisconsin 4. Baker JL (1972) Grammars with structured vocabulary: a model for the Algol 68 definition. Information and Control 20: 351-395 5. Branquart P. Cardinal JP, Delescaille JP, Lewi J (1972) A context-free syntax of Algol 68. Information Processing Lett. 1:141-148 6. Deussen P (1975) A decidability criterion for van Wijngaarden grammars. Acta Informat. 5: 353375 7. Deussen P, Mehlhorn K (1977) Van Wijngaarden grammars and space complexity class EXSPACE. Acta Informat. 8: 193-199 8. Deussen P, Wegner L (1978) A bibliography of van Wijngaarden grammars, Bulletin of the European Ass. for Theoretical Computer Science (EATCS) 6 9. Greibach SA (1974) Some restrictions on W-grammars, ACM Proceedings of the 6th Symposium on the Theory of Computing, Seattle 10. Gries D (1971) Compiler construction for digital Computers. John Wiley, New York 11. Hesse W (1976) Vollständige formale Beschreibung von Programmiersprachen mit zweischichtigen Grammatiken, Dissertation, Technische Universität München, Bericht Nr. 7623 12. Koster CHA (1972) Affix grammars. In: Algol68 implementation. Proc. of the IFIP Work. Conf. on Algol 68 implementation, 95-109, North Holland, Amsterdam 13. Makanin GS (1977) The problem of solvability of equations in a free semigroup (in Russian). Matematiceskij sbornik 103 (145), No. 2(6), pp. 147-236; english Summary in: (1977) Dokl Akad Nauk SSSR, Soviet Math Dokl 18 330-334 14. Maurer H (1969) Theoretische Grundlagen der Programmiersprachen - Theorie der Syntax. Bibliographisches Institut, Mannheim 15. Rounds WC (1973) Complexity of recognition in intermediate - level languages, in: Proc. of the IEEE 14th Annual Symposium on Switching and Automata Theory, 145-158 16. Salomaa A (1973) Formal languages. Academic Press, New York, London 17. Sintzoff M (1967) Existence of a van Wijngaarden syntax for every recursively enumerable set. Ann Soc Sci Bruxelles 81: 115-118 18. Watt DA (1977) The parsing problem for affix grammars. Acta Informat 8: 1-20 19. Wegner L (1977) Analysis of two-level grammars. Ph.D. thesis, Hochschul-Verlag, Stuttgart 20. Wegner L (1979) Bracketed two-level grammars - a decidable and practical approach to language definitions. ICALP 79, Graz.Springer, Berlin Heidelberg New York (Lecture Notes in Computer Sciences, Vol. 71, p. 668) 21. Wegner L (1979) Two-level grammar translations, GI-9. Jahrestagung, Bonn. Springer, Berlin Heidelberg New York (Informatik Fachberichte, Band 19 (1965) p. 163) 22. Wijngaarden A van (1965) Orthogonal design and Description of formal languages. Mathematisch Centrum Amsterdam, M R 76 23. Wijngaarden A van (ed.) (1969) Report on the Algorithmic language ALGOL 68. Numer 14: 79218 24. Wijngaarden A van, Mailloux BJ, Peck JEL, Koster CHA, Sintzoff M, Lindsey CH, Meertens LGLT, Fisker R G (eds.) (1976) Revised report on the Algorithmic language ALGOL 68. Springer, Berlin Heidelberg New York Received August 17, 1978; Revised February 22, 1980