Acta Informatica31, 341-378 (1994)
mrmUca
i .
9 Springer-Verlag 1994
Context-free graph languages of bounded degree are generated by apex graph grammars Joost Engelfriet, Linda Heyker, and George Leih Department of Computer Science,LeidenUniversity,P.O. Box 9512, NL-2300 RA Leiden, The Netherlands Received October 2, 1992/ October 26, 1993
Abstract. The apex graph grammars generate precisely the context-free graph languages of bounded degree, independently of whether one considers hyperedge replacement systems or (boundary or confluent) NLC or edNCE graph grammars. The main feature of apex graph grammars is that nodes cannot be "passed" from nonterminal to nonterminal. The proof is based on a normal form result for arbitrary hyperedge replacement systems that forbids "passing chains". This generalizes Greibach Normal Form.
O. Introduction Context-free graph grammars are of importance because they provide a general means of studying graph properties of a recursive nature. Unfortunately, unlike the case of ordinary string grammars, there does not seem to be a single natural notion of context-free grammar for graphs. Thus, many different types of contextfree graph grammar have been proposed in the literature, of different generating power. Natural and useful requirements for a graph grammar to deserve the predicate 'context-free' are formulated in [-Cou87]. For several of the most interesting of these grammar types it has been shown recently that, although they generate different classes of graph languages, they generate the same graph languages of bounded degree (where a graph language is of bounded degree if there is a fixed upper bound on the degree of all nodes of all graphs in the language). This suggests the existence of a single natural class of context-free graph languages of bounded degree. In this paper we provide a grammatical characterization of this class by putting a natural static restriction on context-free graph grammars: the so-called apex property. These apex (context-free) graph grammars generate precisely "the" class of context-free graph languages of bounded degree. The specific feature of an apex graph grammar is that, during derivation, nodes of the generated graph cannot be "passed" from one nontere-mail:
[email protected]
342
J. Engelfrietet al.
minal to another. In other words, a newly generated nonterminal can be attached only to newly generated nodes, which also implies that a newly generated edge of the graph can be attached only to "recently" generated nodes. This feature ensures that the generated graph grows in a very regular way, i.e., at the "outside" only (just as the configuration of a domino game, where it is forbidden to squeeze a domino stone between two adjacent stones). From this it is obvious that the graph language generated by an apex graph grammar is of bounded degree. The hard part of the proof is to show that all context-free graph languages of bounded degree can be generated by an apex graph grammar. One, attractive and well studied, type of context-free graph grammar is the context-free hypergraph grammar (CFHG grammar, also called hyperedge replacement system or HR grammar; see [HabKre87a/b; BauCou, Hab]). The apex property for CFHG grammars was introduced in [EngRoz90]. It says that no nonterminal edge in the right-hand side of a production is incident with an external node. This static restriction implies that, during derivation, nodes of the generated (hyper)graph cannot be "passed" from one nonterminal edge to another. The main result of this paper is that every CFHG (hyper)graph language of bounded degree can be generated by an apex CFHG grammar. The kernel of the proof is a Greibach Normal Form Theorem for arbitrary CFHG grammars. Greibach Normal Form for CFHG grammars means that, in the right-hand side of a production, an external node cannot be incident with just one nonterminal edge. This implies that, in a derivation step, a node cannot be "passed" from one nonterminal edge to just one other nonterminal edge: it should be passed to at least one terminal edge or at least two nonterminal edges (in other words, "passing chains" are forbidden). Note that in an apex grammar nodes are passed to terminal edges only. This generalizes Greibach Normal Form from context-free string grammars ([Gre], see [Urb-1 for a short proof) to CFHG grammars, and it even generalizes Double Greibach Normal Form, in which it is required that every right-hand side of a production both begins and ends with a terminal. In fact, the construction in the proof of our Greibach Normal Form Theorem transforms an ordinary context-free grammar (viewed as a CFHG grammar in the obvious way) into an ordinary context-free grammar that is in Double Greibach Normal Form. The remainder of this introduction is aimed at the reader who wishes to know more about the relationship between our main result and the existing literature on context-free graph grammars. Whereas the CFHG grammars are based on (hyper)edge replacement, the various NLC-like graph grammars (see [-EngRoz91] ) are based on node replacement. Since NLC-like grammars in general do not satisfy the context-freeness requirements of [Cou87], one has to consider restricted versions of them to obtain context-free NLC-like graph grammars. One possibility is the boundary restriction, giving rise to the B-NLC grammar [-RozWel] and the more recent B-edNCE grammar [EngLeiWel], and another possibility is the (weaker) confluence restriction, leading, e.g., to the C-edNCE grammar [-Bra88, Eng89/91]. Since in B-NLC grammars the node labels play a very specific role, no B-NLC grammar can generate, e.g., the graph language of all cycles of arbitrary length (with all nodes labeled by one fixed symbol) (see [-EhrMaiRoz]). Since this graph language is intuitively context-free (of bounded degree) and can in fact be generated by the other grammar types, we prefer to consider the class of (node) relabelings of B-NLC graph languages (cf. [RozWel, Vog]) rather than
Context-freegraph languagesof bounded degree
343
the class of B-NLC graph languages itself. The classes of CFHG graph languages, relabelings of B-NLC graph languages, B-edNCE graph languages, and CedNCE graph languages form a properly increasing sequence of classes (see [Vog; RozWel; EngLeiWel; EngRoz90; Eng89]). However, they contain the same graph languages of bounded degree (see [-EngRoz90 Theorem 7; Lau88b; Bra91; EngHey91b; CouEng]). This is "the" class of context-free graph languages of bounded degree discussed before. The apex property was introduced for NLC-like graph grammars in l-EngLeiRoz87] and studied in, e.g., [EngLeiRoz88/91; EngLei]. It was shown in [EngLeiWel; EngRoz90] that apex CFHG grammars, apex B-NLC grammars, and apex B-edNCE grammars all three generate the same class of graph languages (where, for the apex NLC grammar one should allow an additional relabeling, as discussed above for B-NLC). Thus, to prove our main result, we could choose between two formalisms: the CFHG grammar or the B-edNCE grammar (because the latter is easier to handle than the B-NLC grammar with a relabeling). We have decided for the CFHG grammar, for two reasons. First, CFHG grammars are (in our opinion) easier to understand and visualize than B-edNCE grammars (but note that, of course, in general B-edNCE grammars are more powerful). Second, since C F H G grammars can also be used to generate (directed) hypergraphs, we can prove our result for the more general case of hypergraphs. We make three final observations. First, it has recently been shown in [Cou91J that the class of CFHG graph languages is exactly the class of CedNCE graph languages L of "bounded bi-degree" (meaning that there is an N such that for every n > N, K,,, is not a subgraph of any graph in L). This is a grammatical characterization of the context-free graph languages of bounded bi-degree, similar to our result (and note that, of course, every language of bounded degree is of bounded bi-degree). Second, both for CFHG grammars and NLC-like grammars it is decidable whether or not the generated language is of bounded degree ([HabKreVog; JanRozWel; Eng91]). Thus, by our result, it is decidable whether or not a context-free graph language can be generated by an apex graph grammar. In [-Cou91] it is shown that (as a consequence of the above-mentioned characterization) it is decidable whether or not a context-free graph language (i.e., a C-edNCE language) can be generated by a CFHG grammar. Third, we observe that strings can be represented by graphs of degree < 2 in a natural way. Thus, string languages are represented by graph languages of bounded degree, and consequently both CFHG grammars and the mentioned NLC-like grammars all generate the same class of string languages (studied in [HabKre87a; EngHey91a]). Our result shows that this is also the class of string languages generated by apex graph grammars. This paper consists of four sections. Section 1 contains all needed preliminaries on hypergraphs and C F H G grammars. In Sect. 2 the main result is stated and it is shown that it is a consequence of the Greibach Normal Form Theorem, which is proved in Sect. 3. Section 4 contains a study of the linear case, As usual, a C F H G grammar is linear if each right-hand side of a production contains at most one nonterminal. Although the analogue of our result does not hold for linear grammars, it does hold for the "quasi-linear" grammars, corresponding to the quasi-rational or derivation-bounded context-free grammars in the case of strings. The results of this paper were announced in [Eng92].
344
J. Engelfriet et al.
1. Preliminaries This section contains some terminology that we will use, in particular concerning hypergraphs and hypergraph grammars. The reader is assumed to be more or less familiar with context-free hypergraph grammars, also called hyperedge replacement systems. These grammars are studied, e.g., in [-BauCou; HabKre87a/ b; MonRos; Hab; Lau88a/b/90; EngHey91a/92; Cou88/90; EngRoz90; LenWan; Vog; HabKreVog]. N = { 0 , 1 , 2 .... } and N + = { 1 , 2 , 3 , . . . } . For n,m~N,[n,m] denotes the set
{i~N[nH where X ~ , - A , H is a hypergraph over Z, and rank(X)=rank(H). [] Note that, for every production X ~ H, H is a hypergraph over an alphabet with terminal and nonterminal symbols. Hence (by our previous assumption) extn and nodH(e), for every nonterminal edge esE u, are sequences of distinct nodes. A C F H G grammar is isolation-free if all right-hand sides of its productions are isolation-free. Let G = (Z, A, P, S) be a C F H G grammar. Application of a production X ~ H of G is defined as follows. Let K be a hypergraph over Z and let e be a nonterminal edge of K. Then X ~ H is applicable to e if lab~c(e)=X, and the result of application is the hypergraph K [H/e]. We write K =~ K', where K' is (isomorphic to) K[H/e]. As usual, ==>+ and ==>* denote the transitive and the transitivereflexive closure of =~. For a C F H G grammar G = (2;, A, P, S) the (hypergraph) language generated by G, denoted L(G), is the set of hypergraphs H over A such that S=:,*H. A hypergraph H over Z such that S =*-*H is a sentential form of G. We denote by C F H G the class of hypergraph languages generated by C F H G grammars. A C F H G grammar is reduced if every X ~ Z - A occurs in at least one derivation S=*.*H, with HeL(G). It is easy to show that every C F H G grammar (with L(G) 4=O) can be turned into an equivalent reduced C F H G grammar by dropping its useless nonterminals and productions. Due to the confluence and associativity of substitution, C F H G grammars have "all" the nice properties of context-free grammars (cf. [Cou87]). Here,
346
J. Engelfriet et al.
S
X
~'
1 9
Fig. 1, A C F H G grammar generating stars
we just state, without proof, the following Decomposition Lemma (cf. Lemma 2.14 of [-Cou87]; Sect. II.2 of [Hab] ; and Lemma 3.8 of [Lau88b]). Lemma 1.4 (Decomposition Lemma) Let G = (X, A, P, S) be a C F H G grammar.
Let H and K be hypergraphs over S and let e 1.... , e, be all the nonterminal edges of H. Then H=~* K if and only if there exist hypergraphs Ka, ..., K , over X such that K = H [ K 1 / e l .... , K,/e,] and labn(ei)=~* Ki for every i s [ l , n]. Moreover, the length of the derivation H=~*K equals the sum of the lengths of the derivations labH(ei)=~* Ki, i s [ l , hi. As an example of a C F H G grammar, consider G = ( S , A, P, S) where S = {S, X, a}, with rank(S)=O, rank(X)= 1, and rank(a)=2, A = {a}, and P consists of the three productions that are drawn in Fig. 1. In pictures of hypergraphs (such as the right-hand sides of these productions) we use the following pictorial conventions. A node is indicated by a fat dot, as usual for graphs, and an edge e is indicated by a square box containing lab(e), with a line between e and nod(e, i) labeled by i, for each i s [ l , rank(e)]. These lines, or the corresponding integers, are called the "tentacles" of the edge e. An edge e with 2 tentacles (i.e., with rank(e)=2) is also drawn as a directed line from nod(e, 1) to node(e, 2), with label lab(e), as usual for edges of graphs. The external node ext(i) is indicated by label i, for every i~ [1, r], where r is the rank of the hypergraph. The C F H G grammar G of Fig. 1 generates the hypergraph language L(G) that consists of all "stars", i.e., graphs in the form of a star, such as the one in Fig. 2. Note that L(G) is not of bounded degree. It is well known that the ordinary context-free grammar, generating strings, can be viewed in a simple way as C F H G grammar, as follows. We first represent strings as graphs. Let S be an alphabet and w = a l ... a,, with aisS, be a string over S. We represent w by hypergraphs grz(w) and gro(w ) over Z, where we set rank(a)=2 for every a s S ; gr2(w) is the hypergraph (V, E, nod, lab, ext) of rank 2, with V={vo, vl, ..., v,}, E = {el ..... e,}, nod(ei)=(vi_ x , v~), lab(ei)=a~, ext(1)=Vo, and ext(2)=v,; and gro(W) is the hypergraph of rank 0 obtained
Context-free graph languages of bounded degree
347
Fig. 2. A star
from gr2(w) by changing ext into (). Now let G be an ordinary context-free grammar, and assume that (1) G is 2-free, i.e., the right-hand side of every production is non-empty, and (2) the initial nonterminal S does not appear in any right-hand side of a production. Then we will view G as a C F H G grammar that has the set of productions {S~gro(w)[S~w is a production of G} u {X gr2 (w) lX + S, X ~ w is a production of G}. The initial nonterminal S is given rank 0 and all other symbols rank 2. Obviously, string substitution is simulated faithfully by hypergraph substitution.
2. The main results
Let X ~ H be a production of a C F H G grammar and let e be an edge of H. Suppose that extn(i)=nodu(e,j). Then we say that the node extn(i ) is passed to e (from tentacle i to tentacle j). Also, when X ~ H is applied to an edge f of hypergraph K (with label X), the node nodK(f i) is said to be passed from f to e during application; this is because intuitively this node is the same as the node nodKw/ii(e,j). Thus, during a derivation, nodes are passed from nonterminal edges to terminal or nonterminal edges. A C F H G grammar is apex if for every production X--. H, no external node of H is incident with a nonterminal edge. In other words, a C F H G grammar is apex if nodes are passed to terminal edges only, i.e., nonterminal edges cannot pass nodes to each other. This means that terminal edges are attached to "recently" generated nodes only: the terminal graph grows at the "outside". Thus, the derivation process of an apex grammar is particularly easy to understand and visualize. Note that an ordinary contextfree grammar, viewed as a C F H G grammar as discussed at the end of Sect. 1, is apex if and only if it is in Double Greibach Normal Form (i.e., every right-hand side of a production both begins and ends with a terminal). We denote by A - C F H G the class of hypergraph languages generated by apex C F H G grammars (first considered in [EngRoz90]). The main result of this paper is stated next. Theorem 2.1 A - C F H G is the class of C F H G hypergraph languages of bounded
degree. One direction of this result (that apex hypergraph languages are of bounded degree) is straightforward. The other direction is non-trivial. This will be illustrated i n the following examples, in which we will also indicate the main idea of the proof.
J. Engelfriet et al.
348
I
b
r
b
b
b
F i g . 3. A b i n a r y t r e e w i t h a linked frontier
b S
b
~
I
1 T
b
T
~
F i g . 4. A C F H G
Examples. (1)
-I
r
2
b .-
e 2
b >
3
9 3
grammar generating binary trees w i t h a linked frontier
To get some feeling for the non-triviality of the result the reader is challenged to find an apex C F H G grammar for the set of all graphs that are "binary trees with a linked frontier", such as the one in Fig. 3. Such a
Context-free graph languages of bounded degree
a
a
349
a
a
a
a
a
8.
Fig. 5. A cycle
S a
X
X
~,
I~
1
1
9
;
a
~
a
Fig. 6. A CFHG grammar generating cycles
graph is an ordered binary tree (with its edges labeled f and r, for 'left' and 'right') of which any two consecutive leaves (in the left to right order) are connected by two edges, labeled b (for 'bottom') and incident with an additional " b o t t o m " node. For technical reasons two additional b-labeled edges and a " b o t t o m " node are connected to both the first leaf and the last leaf. Note that these graphs are of bounded degree with bound 3. The productions of a (non-apex) C F H G grammar G = (Z, A, P, S) generating all "binary trees with a linked frontier" are given in Fig. 4, where Z = {S, T, ~, r, b} and A = {E, r, b}, with rank(S)= O,rank(T)= 3, and rank(f)= rank(r)= rank(b)= 2. An apex C F H G grammar equivalent with G will be given later as an illustration of the proof of our result. (2) Let us now consider an easier example: the set of all cycles of length _>3, such as the one in Fig. 5. This figure also suggests a way of generating the cycles, viz. by generating a linear chain and connecting the last node in the chain to the first; to do this, the nonterminal edge should keep one of its tentacles to the first node of the chain, during the whole derivation. A C F H G grammar G implementing this idea is given in Fig. 6. Note that the grammar is not apex, because it passes the first node of the chain from nonterminal edge to nonterminal edge. The main idea in the proof of our result is that an apex grammar can be obtained by "folding" parts of the graph (or, more precisely, by "folding" parts of derivations of the given non-apex grammar). In this case 'folding' means
350
J. Engelfriet et al.
ai
a
8.
a
a
a
a
Fig. 7. A "folded" cycle
a
S
~' -
~a
s .
a 2e
Z
~-
9 8.
Z
~,
2t
a
1
z . 1
Fig. 8. An apex CFHG grammar generating "folded" cycles
that the chain is folded once, exactly in the middle. The result of folding the chain of Fig. 5 is shown in Fig. 7. This figure also suggests a way of generating cycles. An apex C F H G grammar implementing this way of generating cycles is given in Fig. 8. (3) We now consider a slightly more difficult example that is analogous to the second example. Figure 9 shows a graph that we will call a "grasshopper". The "grass" is a chain of a-labeled edges and the hops of the grasshopper are represented by the b-labeled edges. We assume that the hops are of length
Context-freegraph languagesof bounded degree
351
b
a
a
a
b
a
a
a
a
a
a
a
a
a
a
Fig. 9. A grasshopper
S
9 a
2
X
2
p a
X
p a
8
b X
I~
1 a
Fig. 10. A C F H G grammar generating grasshoppers
at least 2 (measured by the number of a-labeled edges). A non-apex CFHG grammar generating all these grasshoppers, in the way suggested by Fig. 9, is given in Fig. 10. To obtain an apex CFHG grammar, we "fold" each part of the chain of grass that is hopped over by the grasshopper. The result is shown in Fig. 11, and the apex CFHG grammar suggested by this figure is given in Fig. 12. (4) Finally we consider the still more complicated example of two grasshoppers. An example of a "2-grasshopper" is given in Fig. 13. As before the "grass" is a chain of a-labeled edges. The hops of one grasshopper are represented by the b-labeled edges and the hops of a second grasshopper by the c-labeled edges. For reasons of simplicity we assume that the first grasshopper starts at the first node of the chain, and the second grasshopper at the second node. For the same reasons, we also assume that the grasshoppers never jump to the same node, except at the end. Otherwise the jumps of the two grasshoppers are totally unrelated. A non-apex CFHG grammar for the set of all 2-grasshop-
aaa,a
352
b
a
J. Engelfriet et al.
b
a
a a
Fig. 11. A "folded" grasshopper
9
la~ Y
~,
b
1
2
a
Z
a
1 9
l,,
a .~
2 --
b
a 1
2
Z
Fig. 12. An apex C F H G grammar generating "folded" grasshoppers
b
a
a
a
a
a
a
a
a
Fig. 13. Two grasshoppers
a
a
a
a
Context-free graph languages of bounded degree
353 1
Pl: S
I,
~~'-'-'-~-~"; a
3~
a
2
P3: X - ~
3 .
.
a
.
.
.
a
I, a
P5: X
(":~
2
P2: X
P4: X
a
a
2 3=
a
Fig. 14. A CFHG grammar generating 2-grasshoppers
pers is given in Fig. 14. The reader may consider the problem of finding an apex grammar for this graph language. In this case the graph has to be folded twice: first one should fold the pieces of grass where both grasshoppers are "in the air", and then one has to fold the pieces of graph where one grasshopper is in the air while the other repeatedly jumps. In the case of three grasshoppers the graph has to be folded three times, etcetera. A related example is the language of all graphs of cutwidth < k (for some fixed k). It is easy to write a non-apex C F H G grammar for this language (cf. [EngLei]), but hard to find an apex C F H G grammar. The above idea of "folding" (appropriately chosen) parts of the graphs is the central idea of our proof of the non-trivial direction of Theorem 2.1. The whole difficulty lies in showing that this simple idea not only works for simple example grammars like those in (2) and (3), but for arbitrary C F H G grammars generating hypergraphs of bounded degree. Actually, the folding idea also works for C F H G grammars that do not necessarily generate hypergraphs of bounded degree. In this case one of course does not obtain an apex grammar, but a grammar in a normal form that will be stated next. From this Greibach Normal Form result for arbitrary C F H G grammars Theorem 2.1 follows rather easily, as will be shown in the remainder of this section. Recall from Sect. 1 that a nonterminal node is a node that is incident with a nonterminal edge. Recall also from Sect. 1 that a C F H G grammar is isolationfree if the right-hand sides of its productions do not have isolated external
354
J. Engelfriet et al.
nodes. The isolation-free property is similar to the "neighbourhood preserving" normal form of B-NLC grammars (see [RozWel; EngRoz90]). Definition 2.2 A C F H G grammar G is in Greibaeh Normal F o r m / f (1) G is isolation-free, and (2) for every production X ~ H of G, H has no nonterminal nodes of degree 1. Theorem 2.3 For every C F H G grammar an equivalent C F H G grammar in Greibach Normal Form can be constructed. An ordinary context-free grammar (viewed as C F H G grammar as discussed at the end of Sect. 1) is in Greibach Normal F o r m if and only if it is in Double Greibach Normal F o r m (as a string grammar). Thus, Theorem 2.3 generalizes the Double Greibach Normal F o r m result for context-free grammars. In fact, the construction in the proof of Theorem 2.3 transforms a context-free grammar into a context-free grammar. The idea of "folding" (on which the proof of Theorem 2.3 is based) is implicit in the proof of the Double Greibach Normal F o r m result in [Ros]. However, for ordinary context-free grammars it suffices to fold once; thus, the case of graphs is essentially more difficult than the case of strings. The most important property of Greibach Normal F o r m for C F H G grammars is that there are no external nonterminal nodes of degree 1 (note that an apex C F H G grammar is one that has no external nonterminal nodes at all). Let X ~ H be a production of a C F H G grammar. In general an external node v of H can be passed to many edges of H, viz. all edges incident with v in H. Thus, nodes can be "distributed" to many terminal and nonterminal edges. In the case that v is a nonterminal node of degree 1, v is passed to one nonterminal edge, and to no other edge. This phenomenon will be called "chain-passing". A node of a hypergraph is chain-passing if it is an external nonterminal node of degree 1. Thus, in an isolation-free grammar, an external node that is not chain-passing is either incident with a terminal edge, or incident with at least two nonterminal edges. It is the removal of chain-passing nodes that is the most difficult part of the proof of Theorem 2.3. As an example, consider the grammar of Fig. 6, generating all cycles of length > 3. In the right-hand side of the second production, ext(2) is a chainpassing node. Note that indeed, during any derivation of a cycle, the node that is generated first is chain-passed throughout the derivation, in order to attach it by an edge to the node that is generated last. In the second production of Fig. 4 both ext(1) and ext(3) are chain-passing nodes, passed to different nonterminal edges. In the second production of Fig. 1 ext(1) is passed to the nonterminal edge, but it is not a chain-passing node because it is also incident with a terminal edge. We now show that Theorem 2.3 implies Theorem 2.1. This implication is based on the " s h o r t c u t " construction, which will also be needed in the proof of Theorem 2.3. Thus, to be more precise, the basic ideas in the proof of Theorem 2.1 are "folding" and "shortcutting". However, "shortcutting" is a straightforward and well-known operation: it consists of applying productions to righthand sides of productions. Let G be a C F H G grammar. We define the C F H G grammar shortcut(G), as follows. It has the same nonterminals, terminals, and initial nonterminal.
Context-free graph languages of bounded degree
355
Its productions are defined as follows. Let X ~ H be a production of G, and denote the nonterminal edges of H by e 1 . . . . , e, with n > 0 . Let
labn(eO --* Ha . . . . , labu(e,) ~ H, be productions of G. Then the production X ~ H[H1/el, ..., HJe,] is in shortcut(G). This new grammar has the following properties. Lemma 2.4 Let G be an isolation-free C F H G grammar. (1) L(shortcut(G)) = L(G), (2) shortcut(G) is isolation-free, and (3) for k > 2, if the degree of every external nonterminal node of (the right-hand sides of the productions of) G is >=k, then the degree of every external nonterminal node of shortcut(G) is > k + 1.
Proof (1). This also holds if G is not isolation-free. Maybe it needs no proof, because it holds for all types of grammar that are context-free in the sense of [Cou87]. For those who are in doubt, here is the detailed proof. It is clear that L(shortcut(G))~_L(G): the production X ~ H [ H 1 / e l , ..., H,/e,] can be simulated by the sequence of productions X o H , labH(el)~ HI, ..., labH(e,)~ H,, because, for any sentential form K with nonterminal edge e labeled X, K [H/e] [H1/e 1, ..., H,/e,] = K [H [H1/ea . . . . , H,/e,]/e] by associativity of substitution (see Sect. 1). The other direction follows from the Decomposition Lemma (Lemma 1.4); it can be shown in a straightforward way that if X=~*F in G then X=~*F in shortcut(G), for every nonterminal X and every terminal hypergraph F. In fact, X=~H=*.*F in G, for some production X ~ H; hence, by the Decomposition Lemma, F = H[F1/el . . . . , F,/e,] with lab(ej)=~*Fj in G, where ca, ..., e, are the nonterminal edges of H. Hence there exist Hj such that lab(ej)=~Hj=~*Fj in G, for some productions lab(ej)-~Hj of G. Let ejl , . . . , e j m J be the nonterminal edges of Hj. Then, again by the Decomposition Lemma, Fj=Hj[Fjl/ejl , ...,Fj,,s/ej,,j] with lab(ejk)~*Fjk in G, for k~[1, mj]. Hence, by induction, lab (ejk)=~* Fjk in shortcut(G). Now, by associativity, F = (H [Hi~el,..., H,/e,]) IF11/el 1, ..., Fl,,i/el . . . . . . . F,1/e,1 .... , F,m,/e,,,,]. Hence, by the Decomposition Lemma, H [ H 1 / e l , . . . , H,/e,]=~*F in shortcut(G), and so, since the production X ~ H [ H 1 / e l , . . . , H,/e,] is in shortcut(G), X=~*F in shortcut(G). (2). This is immediate from the fact that substitution of hypergraphs preserves the isolation-free property (see Lemma 1.2). (3). Let v be an external nonterminal node of H [H1/ea, ..., H,/e.]. Let ej,, ..., ej,, be all the nonterminal edges incident with v in H, and let v=nod(ej,, ri) for every ie[1, m]. Clearly, m > l , and ext(rl) is nonterminal for at least one of the H i. Let d be the degree of v in H (so d > k ) and let d i be the degree of ext(ri) in Hi (so at least one of the d i is >k). Then the degree of v in
H [H1/e I ..... H,/e,] is d' = d + ~ (d i - 1). Since G is isolation-free, di =>1 for every i. i=1
Hence (since d > k, d i - 1 > 0 for every i, and d i - 1 =>k - 1 for some i) d' > k + k - 1. Hence d' > 2 k - 1 > k + 1 (because k > 2). []
356
J. Engelfriet et al.
Proof of Theorem 2.1 (assuming Theorem 2.3). It is easy to see that an apex C F H G grammar generates a hypergraph language of bounded degree. In fact, if the maximal degree of a node in the right-hand side of a production is m, then the maximal degree of a node in a generated hypergraph is m 2 (to be more precise, it is the maximal number of nonterminal edges incident with a nonterminal node times the maximal degree of an external node). For the other direction, let G be a C F H G grammar such that L(G) is of bounded degree with bound b. By Theorem 2.3 we may assume that G is isolation-free and that G has no chain-passing nodes. Define G'=shorteutb-l(G), i.e., the result of applying the shortcut operation b - 1 times in succession, starting with G. Since G has no chain-passing nodes, the degree of every external nonterminal node of G is > 2. Hence, by repeated application of Lemma 2.4, the degree of every external nonterminal node of G' is >_b + 1. But this means that (after reduction of G') no external node of G' is nonterminal: after using the involved production such a node would lead to a node of degree > b + 1 in the generated terminal graph, because G' is isolation-free. Hence G' is apex. Note that we have shown in the proof of Lemma 2.4(3) that the degree of every external nonterminal node of shortcut (G) is in fact >_2 k - 1. This implies that it suffices in fact to apply the shortcut construction 1 + 21og b times rather than b - 1 times. []
3. Greibaeh normal form This section is devoted to the proof of Theorem 2.3. We have to show that every C F H G grammar can be transformed into an isolation-free C F H G grammar that has no nonterminal nodes of degree 1. The latter requirement naturally splits into two. We say that a hypergraph H is chain-passing-free if it has no external nonterminal nodes of degree 1 (i.e., no chain-passing nodes), and we say H is floating-free if it has no non-external nonterminal nodes of degree 1. As usual, a C F H G grammar is chain-passing-free or floating-free if all right-hand sides of its productions are. Note that a "floating" tentacle of a nonterminal edge (i.e., a tentacle to a non-external node of degree 1) serves no purpose, because it links the edge to no other edge. We will need the following properties of floating-free hypergraphs, that are straightforward to show (see Lemma 2.4). Lemma 3.1 (1) I f H and K are floating-free hypergraphs, K is isolation-free,
and e is a nonterminal edge of H with rank(e)= rank(K), then H [K/e] is floatingfree. (2) I f G is an isolation-free, floating-free C F H G grammar, then so is shortcut(G). Thus, we have to show that every C F H G grammar can be transformed into one that is isolation-free, floating-free, and chain-passing-free. The easy part is to obtain isolation-freeness and floating-freeness. Lemma 3.2 For every C F H G grammar there is an equivalent C F H G grammar
that is isolation-free and floating-free. Proof. The proof is a standard exercise. It is given for completeness sake. We first show that every C F H G grammar G has an equivalent isolation-free C F H G grammar G1. If X ~ * F in G (where F is terminal) and extv(i) is isolated, then
Context-free graph languages of bounded degree
357
the i-th tentacle of X was useless in this derivation. In G~ we add information to X saying which tentacles are useless, and drop these tentacles. This information is computed in a bottom-up fashion (i.e., from right to left in a production). Now, GI has nonterminals (X, U) where X is a nonterminal of G and U _ [1, rank(X)[. Intuitively, U is the set of useful tentacles of X (in the derivation starting from X). The initial nonterminal of G1 is (S, 0). The rank of a nonterminal (X, U) is # U, and we use z to denote a fixed bijection between U and [-1, 4~ U]. Intuitively, (X, U) has the tentacles in U, but they have to be renumbered (by ~). The productions of Ga are defined as follows. Let X--. H be any production of G and let e~, ..., e, be the nonterminal edges of H. Let (Xa, Ua), ..., (X,, U,) be arbitrary nonterminals of G1, with Xk=labi.i(ek). Then Ga contains the production (X, U)--*H', where U = {i~[1, rank(X)[[ either extra(i) is incident with a terminal edge or extn(i)=nodti(ek,j) for some k~[1, n[ and some j~ Uk} and H' is defined as follows: VR,= Vn-{ext~r(i)[iq~U}, Ew=Eu, Iabn,(e)=labn(e) and nodB,(e)=nodi~(e) for every terminal edge e, labn, (ek)= (Xk, Uk) and nOdH,(ek, Z(]))= nodn(ek, j) for every k~ [1, n[ and j e Uk, and extn,(z(i)) = extH(i) for every ie U. This ends the construction of Ga. It should be clear that G~ is isolation-free and that L(GI)=L(G). We now show that every isolation-free C F H G grammar G has an equivalent isolation-free floating-free C F H G grammar Ga. GI is obtained from G by dropping all "floating" tentacles and the corresponding non-external nodes of degree 1 from the right-hand sides of the productions. A non-external node v that is incident with nonterminal edge e (in G) is generated in Ga by the production that is applied to e (in which the external node corresponding to the floating tentacle of e is turned into a non-external node). Ga has nonterminals of the form (X, F), where X is a nonterminal of G and F__[1, rank(X)[. The initial nonterminal of Ga is (S, 0). Intuitively, F is the set of non-floating tentacles of X in the right-hand side of the production that generated X. As before, the rank of (X, F ) is ~ F , and z denotes a fixed bijection between F and [-1, ~ F]. The productions of G~ are defined as follows. Let X ~ H be any production of G, and let ea, ..., e, be the nonterminal edges of H, with Xk=labn(ek). Let F be any subset of [1, rank(X)[. Then GI contains the production (X, F ) --, H' where H' is defined as follows. VH,= VH--{re VHIV is a non-external nonterminal node of degree 1}, EH, = EH, labH,(e)= labH(e) and nOdH,(e)= nodn(e) for every terminal edge e, labH,(ek)= (Xk, Fk) where Fk= {j e [1, rank(Xk)]lnod~r (e~,j) is external or has degree > 1} for every k e [-1, n], nodu, (ek, Z(j)) = nOdH(ek, j) for every ke[1, n[ and jeFk, and extn,(z(i))=extH(i) for every ieF. This ends the construction of G1. It should be clear that G1 is floating-free and that L(GO =L(G). Ga is still isolation-free because H' is obtained from H by turning some external nodes into non-external nodes and by dropping some tentacles that are connected to non-external nodes only. [] To formulate the result that will lead to the chain-passing,free property we need some more terminology. Let H be a hypergraph and let e be a nonterminal edge of H. We denote by passH(e) the set of chain-passing nodes of H that are incident with e. Note that {passH(e)[e is a nonterminal edge of H} is a partition of the set of all chain-passing nodes of H (possibly containing empty sets). As an example, in the right-hand side H of the second production of Fig. 4, if ea and e2 are the nonterminal edges (from left to right), then passH(eO= {extH(1)} and passr~(e2)
358
J. Engelfriet et al.
={extn(3)}; the set of all chain-passing nodes of H is {extn(1), extn(3)}. As another example, if e is the nonterminal edge of the right-hand side H of production P2 in Fig. 14, then passe(e)= {extH(1 ), extn(2)}. We need the following fact that relates the passn function to substitution. Lemma 3.3 Let H and K be isolation-free hypergraphs, and let e be a nonterminal
edge of H with rank(e)= rank(K). Assume that nodn(e, i) and extK(i) are identified for every i t [1, rank(e)]. For every nonterminal edge f of H [K/e], if f is an edge of H (with f + e) then passmK/eI(f) = pass~ (f), and if f is an edge of K, then passntK/el(f)= passe(e)~ passK (f). Proof Let H'= H [K/e]. We want to know passn,(f) for every nonterminal edge f of H'. Note that H' and H have the same external nodes. Let f be an edge of H with fW-e. If v~passn(f), then v is not incident with e and hence is still chain-passing in H'; thus, v~passn,(f). Vice versa, if v is incident with f but vr then v(Epass~,(f): either v is incident with a terminal edge in H and it stays incident with that edge in H'; or v is incident with another nonterminal edge in H, which may be e but, since K is isolation-free, v has degree > 2 in H'. Hence passn, (f)= passB (f). Let f be an edge of K. Clearly, passn,(f)={nod~i(e,j)~passn(e)lextK(j)
EpassK(f)} = passn(e ) ~ passK(f).
[]
Let G be a C F H G grammar. Define the number maxp(G) to be max{ ~passn(e) le is a nonterminal edge of H for some production X ~ H of G}. A nonterminal edge e of a right-hand side H is maximal if ~passn(e)=maxp(G). A production X-+ H is maximal if H has a maximal nonterminal edge. As an example, for the grammar G of Fig. 4 maxp(G)=l; the second production is the unique maximal production, and both nonterminal edges of its right-hand side are maximal. For the grammar G of Fig. 14 maxp(G)= 2 and P2 is the unique maximal production (it is the production in which both grasshoppers are in the air). As another example, if G is an ordinary context-free grammar without chain productions (i.e., productions X ~ Y where both X and Y are nonterminal) that is not in Double Greibach Normal Form, then maxp(G)= 1; a production X --* w is maximal if w begins or ends with a nonterminal. Note that a C F H G grammar G is chain-passing-free if and only if maxp(G) = 0. Thus, Theorem 2.3 follows from Lemma 3.2 and the iterated application of the next lemma, which is the main lemma of this paper. Unfortunately its proof is quite long.
For every isolation-free floating-free C F H G grammar G with maxp(G)>O there is an equivalent isolation-free floating-free C F H G grammar G2 such that maxp(G2) < maxp(G). Proof A grammar with a smaller maxp is obtained by "folding" subderivations Lemma 3.4
that use maximal productions only. However, the construction is in two steps. First an isolation-free floating-free grammar GI is constructed (with another, possibly larger, maxp) with the following property (P): if X ~ H
is a production of G1 and e is a nonterminal edge of H with
pass n (e) >=maxp (G), then, for every production labn (e) ~ H' of G 1 and every nonterminal edge f of H', @passn,(f) < maxp(G). F r o m G 1 an equivalent grammar G 2 is constructed with maxp(G2)) is a subset of V),
E= {e,f} with lab(e)= (X2, re2> and lab(f)= (Xx, rq, X2, rc2, 49>, nod(e,j) is the unique node (x, y) in Vwith y=j, nod(f z(i,j))=(i,j) for every (i,j)~tent((X1, 7rl, X2, rc2, 49>), and ext(i) is the unique node (x, y) in Vwith x=i. that 49e,~=49,passM(e)={ext~t(i)lie~h}={nodM(e,j)[jerc2}, intprod( ( X1, re1, X 2, rcz, 49>)is isolation-free and floating-free.
Note
and
The idea behind this intermediate production is that it represents the onestep-transition from to (Xz, re2> discussed before. Formally, it is not difficult to show that if K=fold(H, e, (X1, ~a, X2, ~2, 49>) then M [ K / f ] =H (or, more precisely, M [ K / f ] and H are isomorphic). Under the assumption (*), this means that for every hypergraph H that contains nonterminals of G' only, (X1, ~q)=%,+a H in G' iff(X1, rq)=> + H in the grammar that is obtained from G 1 by adding all intermediate productions (assuming that such a production is used in the first step of the derivation only). In the formal proof of this equivalence one should of course use the Decomposition Lemma (Lemma 1.4). For the grammar of Fig. 4 (and Fig. 15) there are two intermediate productions, one for D 1 and one for D3, as shown in Fig. 16. The bijection z from tent(D1) to [1, #etent(D1)] is taken as z(2, *)= 1, z(3, *)=2, r(*, 2)=3, ~(*, 3)=4, and the one for O a as z(1, *)= 1, z(2, *)=2, z(*, 1)=3, and r(*, 2)=4.
364
J. Engelfriet et al. 1
2
3
T1 b
1
2
3
T3
b Fig. 17. Tail productions
Next we define some more productions that will not be part of G1, viz. the "tail productions". A tail production is a production of the form (Xx, n l ) -*M[H/e] where (X1, n t ) - * M equals intprod((X1, n 1, X2, n2, qS)) for some nonterminal (X1, nl, X2, n2, qS), e is the edge of M with label (X2, n2), and (X2, n2) --* H is a final production of G'. In view of the above (and still assuming (*)), it should be clear that such a production (followed by some derivation (XI, ~1, X2, ~2, ~b)=~*K in G~) can simulate any derivation in G' that uses (a positive number of) middle productions and ends by a final production. For the example grammar of Fig. 4 (and Fig. 15) there are two tail productions, shown in Fig. 17. One is obtained from the first intermediate production in Fig. 16 by application of the unique final production with left-hand side T1 (in Fig. 15), and similarly for the other. We finally turn to the definition of the productions of G1. There are four types of them, that will be considered in turn. (1) All non-maximal productions of G are productions of G 1. These will be called old productions. (2) Let X--*H be any initial production of G'. Let el, ...,en be the maximal edges of H and let (Xi, ni)=labn(el). For ie[1, n], let ( X i, ni)-*Hi be any final or tail production. Then X - * H [H~/el,..., H~/en] is a production of G 1. It will be called a bridge production. Intuitively, this production simulates a subderivation of G' that does not use old productions; note that a tail production is substituted for ei in the case that middle productions are used in the subderivation starting with (Xi, hi), whereas a final production is substituted in the case that no middle productions are used. The remaining two types of productions have left-hand sides of the form (X1, nl, X2, n2, ~b). Thus, still assuming (*) and using all previous observations on correctness, it can now be shown formally that, for every nonterminal X of G and every terminal hypergraph F, X ~ * F in G 1 iff X ~ * F in G' (and this implies that L(GI)=L(G')=L(G)). The formal proof is left to the reader; it is, of course, by induction on the length of the derivations, considering the production X-~ H applied in the first derivation step. The case that X--, H
Context-free graph languages of bounded degree
365
is old is obvious. Also the case that X --+H is a bridge production (in the only-if part of the proof) is straightforward. In the case that X --+ H is an initial production (in the if-part of the proof), one has to reorder the derivation H ~ * F as H ~ * K ~ * F in such a way that K contains nonterminals of G only, and the derivation H ~ * K uses middle and final productions only. The derivation X~H~*K can then be simulated by a bridge production (followed of course by a derivation that uses the productions for the nonterminals of the form ( X l , ~1, X2, 7g2, (/~>). The derivation K ~ * F can be simulated because of the induction hypothesis. Note that throughout the proof the Decomposition Lemma is needed; in particular, the existence of the above-mentioned reordering can be established by the Decomposition Lemma, using induction on the length of any derivation (X, r c ) ~ * F'. For our example grammar G of Fig. 4 (with G' in Fig. 15), Gj has two old productions (viz. the first two productions of Fig. 15), and four bridge productions as shown in Fig. 18. Note that G' has a unique initial production (the third production of Fig. 15) and each of Ta and T3 has one final production and one tail production. We now turn to the last two types of productions. (3) Let ~Xl, ~Za)~H 1 and (}72, Y 2 ) ~ H 2 be any two middle productions of G'. Let H~ contain an edge e 1 with label (Y1, #a>, and H2 an edge e2 with label {X2, re2). Let q~ and ~ be any two functions such that (Xa, ~zl, X2, ~z2, qS) and (Y1, #~, Y2, #2, 0> are nonterminals of G1, and such that ~ = ffJ e2, H2 ~ l/l ~ ffge l, H1 .
Then ( X I , ~zl, X2, 7~2, 4> --+ K is a production of G1, where
K =fold(O 1 [M/eli EHz/e], e2, ( X I , ~1, X2, rc2, ~)),
(Y1, #1> --*M is equal to intprod((Y1, #1, Y2,#a, c~>), and e is the edge of M with label (I12, #z)- Such a production will be called a (non-final)folding production. Intuitively it simulates the first and last step of a derivation (X1, ~l>~,,iaH+ in G', as discussed before, where the first step uses production (X1, re1>---~H1, the last step uses production (Y2, #2>~H2, and the remaining part of the derivation is represented by the intermediate production intprod((Y1, #1, }72,#2, @))' Note that H ' = H I [ M / e l ] [H2/e ] iS the result of a three-step derivation that uses these three productions. Note also that qSe2,W =qSe2,~o~'oqSel.th=~b (because ~be,~t=O), and (using Lemma3.3) that pass w (e2) = {extH,(i)[ie ~1} = {nodw (e2, J)]Je rE2}; thus, K is well defined. The example grammar GI has two non-final folding productions, shown in Fig. 19. The first production generates a left-most slope in a "folded" fashion. The top part of the right-hand side H (i.e., the part connected to tentacles 1 and 2 of the nonterminal edge with label D1) corresponds to the top of the left-most slope: two edges of the tree and a nonterminal edge with label T that generates the right subtrce. The bottom part of H (connected to tentacles 3 and 4 of D1) corresponds to the bottom of the left-most slope. Similarly the second production generates a folded right-most slope. (4) Let (X1, rq>--+H be any middle production of G', and let H contain an edge e with label (X2, ~2>. Then (X1, ~1, X2, re2, ~> ~ fold(H, e, (X1, zcl, X2, ~2, q~))
366
J. 1
2
Engelfriet et al.
3
b
b 1
2
b
b
1
2
b
3
b
1
2
3
b
I
2
3
T
b
Fig. 18. Bridge productions
is a production of G~, where ~b= ~be,~r. Let H1 and )={(i,j)lie[1, n] and j~[1, rank(Xi)]}, and its rank by 4t:tent(). Intuitively, "tentacle" (i, j) of represents tentacle j of X i. We denote by z a fixed bijection between tent((X1, ..., X,>) and [1, #t:tent()]; thus, z(i,j) is the tentacle number of "tentacle" (i,j). It remains to define the productions of G'. Let be a nonterminal of G, let ke[-1, n], and let X k ~ H be in P. Then G' contains the production -~ H ' where H ' is obtained from H as follows. (1) Add nodes vt,j with ie[1, n], i#k, andj~[-1, rank(Xi)].
Context-free graph languages of bounded degree
375
(2) Redefine the external nodes: f o r j e [1, rank(Xk)], extn, (~(k, j)) = extH(j), and for ie[1, n], i~= k, a n d j ~ [1, rank(Xi)], extH,(Z(i,j))=Vi, j. (3) Let e, . . . . ,em (with m > 0 ) be the nonterminal edges of H (in some order), and let Y~= IabH(ei). Remove e 1 . . . . , e,, and add a new nonterminal edge e such that IabH,(e)=