Generalized Bottom Up Parsers With Reduced ... - Semantic Scholar

2 downloads 0 Views 1MB Size Report
Jul 29, 2005 - Elizabeth Scott and Adrian Johnstone. Department of Computer ...... [8] Breuer, P.T. and Bowen, J.P. (1995) A PREttier Compiler-. Compiler: ...
The Computer Journal Advance Access published July 29, 2005 © The Author 2005. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For Permissions, please email: [email protected] doi:10.1093/comjnl/bxh102

Generalized Bottom Up Parsers With Reduced Stack Activity Elizabeth Scott and Adrian Johnstone Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK Email: {e.scott,a.johnstone}@ rhul.ac.uk We describe a generalized bottom up parser in which non-embedded recursive rules are handled directly by the underlying automaton, thus limiting stack activity to the activation of rules displaying embedded recursion. Our strategy is motivated by Aycock and Horspool’s approach, but uses a different automaton construction and leads to parsers that are correct for all context-free grammars, including those with hidden left recursion. The automaton features edges which directly connnect states containing reduction actions with their associated goto state: hence we call the approach reduction incorporated generalized LR parsing. Our parser constructs shared packed parse forests in a style similar to that of Tomita parsers. We give formal proofs of the correctness of our algorithm, and compare it with Tomita’s algorithm in terms of the space and time requirements of the running parsers and the size of the parsers’ tables. Experimental results are given for standard grammars for ANSI-C, ISO-Pascal; for a non-deterministic grammar for IBM VS-COBOL, and for a small grammar that triggers asymptotic worst case behaviour in our parser. Received 5 July 2004; revised 28 February 2005

1.

INTRODUCTION

The best currently known general parsing algorithm, based on boolean matrix multiplication [1], has order n2.376 , although the large constants of proportionality make this algorithm effectively impractical. The best currently known practical general parsing algorithms have order n3 . The standard LR parsing algorithm [2] allows linear parsing of any LR(1) grammar. However, real programming languages are not usually LR(1). An illustration of the difficulties that can result when trying to use standard tools (such as the LALR(1) based YACC) in such cases is outlined by Stroustrup [3] in his discussion of the development of C++. Problems can also arise when semantic actions are added to an initially LR(1) grammar. A standard LR parser generator requires such actions to be placed at the end of rule. Where actions appear in the middle of a rule a new subrule will be inserted by the parser generator. The addition of such new rules can result in a grammar which is no longer LR(1), so the parser generator can no longer be applied. In recent years several parser generators, for example BTYACC [4], PCCTS/ANTLR [5, 6], JAVACC [7] and PRECC [8], have been developed that have some general capability in the form of backtracking or extended lookahead. However, in worst case unlimited backtracking results in exponential time parsers. Researchers in the area of natural languages often have to use grammars which are ambiguous and therefore, certainly not LR(1), and thus in this field there is a long standing interest in general parsing techniques. One of the most

commonly used is Earley’s algorithm [9], which is known to be worst case cubic in general and worst case quadratic on non-ambiguous grammars. It was pointed out by Tomita [10, 11] that the parser version of Earley’s algorithm contains an error. Tomita gave an algorithm that, because it is a generalisation of the standard LR algorithm, is known as the GLR algorithm. This algorithm can be seen as a concretization of Lang’s generalized automaton traverser [12], although it seems that Tomita was unaware of Lang’s work at the time. Tomita’s algorithm was designed with derivation tree construction in mind and thus its extension to a parser was straightforward. A description of derivation tree generation using Lang’s algorithm can be found in [13]. As an extension of the standard LR algorithm, the GLR algorithm is an attractive algorithm for program language designers, allowing the possibility of addressing the problems associated with non-LR(1) grammars while retaining LR(1) behaviour for LR(1) grammars. This has generated further interest in general parsing. For example, the Bison parser generator [14, 15] now contains some general parsing capability loosely based on a GLR algorithm, and the ADF+SDF tool [16] which has has been extensively used for re-engineering of legacy code uses a GLR parser. The run time performance of bottom up generalized parsers is dominated by the need to maintain multiple stack contexts. Backtracking parsers such as BTYACC explore these contexts in a depth-first manner, leading to worst case exponential parse times. Breadth-first bottom-up parsers,

The Computer Journal, 2005

2 of 23

EA Scott and AIC Johnstone

typified by Tomita-like algorithms, maintain parallel stacks. It is easy to save storage by representing these stacks using a tree in which stack prefixes are shared (as done by the most recent versions of Bison), but the growth of the resulting structure is such as to make the technique impractical for even moderately non-deterministic grammars. More interestingly, Tomita noted that, owing to the context-free nature of the searches associated with reductions in shift-reduce parsers, stacks at the same ‘level’ which have the same state on top may be merged giving a graph structure. Altough there are technical errors in Tomita’s algorithms (later corrected by Farshi using brute force search [17] and efficiently by our RNGLR [18, 19] algorithm), Tomita’s core contribution allows both efficient recognition and, via the construction of a shared packed parse forest (SPPF) the delivery of a compact description of all string derivations. However, the algorithm remains worst case unbounded polynomial, so any efficiency gains that can be obtained by restricting stack activity are extremely valuable. Even users of standard deterministic bottom-up parsers appreciate the need to improve performance by reducing stack activity. The parsing power of standard LR(1) and LALR(1) techniques allows repetition in language constructs to be specified using either left or right recursion, but in practice right recursion is usually discouraged since it generates nested stack activity, whereas left recursion allows right hand sides to be built up on the stack and then immediately reduced. It is reasonable to ask if other nested stack activity can be avoided. Iteration expressed using head or tail recursion might be avoidable, but of course embedded recursion is central to the expressive power of contextfree grammars: parenthesized expressions, for instance, are inherently context-free and cannot be reduced to regular rules. However, in real programming languages the incidences of embedded recursion that cannot be avoided by simply writing the grammar differently are limited to parenthesized expressions, nested begin-end blocks and a few other constructs such as nested user defined type declarations. One view of left recursion in bottom up parsing is that left recursion is ‘absorbed’ into the underlying automaton state: when we form the closure over a left recursive item we form a state with a self-loop. In this sense the head recursion in the grammar has been converted into iteration in the automaton. In top-down parsers, it is possible to map tail recursion into iteration, and in fact many top down parser generators allow regular expressions to be used so that repetetion in the language may be specified using closure operators that do not require stack activity, and conversion of bottom-up style grammars with left recursion often proceeds by converting this recursion using regular expressions. The extraction of a ‘context-free core’ from an arbitrary grammar and the conversion of the remaining parts to regular expressions would be a desirable operation, but sadly it is undecidable whether a given context-free grammar generates a regular language [20, p208, ex 2.6.14(b)].

Nevertheless, algorithms that locate large regular sublanguages (as opposed to the largest regular sub-language) are still worthy of exploration. Aycock and Horspool [21, 22] have described such an approach, motivated by the reduction of stack calls in Tomita style parsing, that can lead to lower run-time constants of proportionality. Both Tomita’s original algorithm and Aycock and Horspool’s algorithm fail to terminate if the grammar contains hidden left recursion, but the reasons for the failure in the two cases are different. Tomita’s original algorithm really failed on hidden right recursion, and it is the attempt to avoid this problem which fails on hidden left recursion. In fact the solution is to use Tomita’s original algorithm with modified parse tables [18]. In Aycock and Horspool’s case, the problem is that, for a non-terminal that displays hidden left recursion, the associated automaton can repeatedly call itself without reading any input. This paper describes our Reduction Incorporated Generalized LR (RIGLR) algorithm that is motivated by Aycock and Horspool’s strategy, but which terminates and is correct for all context-free grammars. (An outline of an earlier less efficient version of this algorithm was presented in [23].) Our scheme yields high-performance general parsers, but, as for Aycock and Horspool’s construction, the associated tables may be unreasonably large even on modern hardware, although performance and space may be traded within certain limits. This is achieved, not by table compression (although that would be a useful adjunct to our scheme) but by varying the way in which non-terminal instances are processed by the parser. In practice, this tuning phase should be informed by some understanding of the kinds of strings to be parsed: it is possible to reduce table space at low cost in performance by nominating non-terminal instances associated with parser configurations that rarely occur. In what follows we give full details of the RIGLR algorithm, formal proofs of the correctness of the algorithm, and a parser version of the algorithm which constructs SPPFs. We discuss the problems associated with turning our recognizer into a parser and then give a parser which constructs Tomita style SPPFs. We also compare our algorithm with Tomita’s algorithm both in operational efficiency and parser size. 2.

INITIAL DEFINITIONS

A context-free grammar (CFG) consists of a set N of nonterminal symbols, a set T of terminal symbols that is disjoint from N, an element S ∈ N called the start symbol and a set of grammar rules of the form A ::=α where A ∈ N and α is a (possibly empty) string of terminals and non-terminals. As is standard in LR parsers, where necessary we shall assume that there is an augmented start rule, S  ::=S, such that S  does not appear on the right hand side of any grammar rule. A derivation step is an element of the form γ ⇒δ where γ and δ are strings of terminals and non-terminals, and δ can be obtained from γ by replacing some non-terminal A in γ by a string α, where A::=α is a grammar rule. A derivation of τ from σ is a sequence of derivation steps

The Computer Journal, 2005

3 of 23

Generalized Bottom Up Parsers With Reduced Stack Activity ∗

σ ⇒β1 ⇒β2 ⇒ . . . ⇒βn−1 ⇒τ . We also write σ ⇒τ and n σ ⇒τ if τ can be derived from σ in n steps. We may write + σ ⇒τ if n > 0. The symbol  denotes the empty string and $ denotes the special end-of-string symbol. ∗ A sentential form is any string α such that S ⇒α and a sentence is a sentential form that contains only elements of T. The set, L(), of sentences that can be derived from the start symbol of a grammar, , is defined to be the language generated by . For a grammar symbol x and a string γ we define ∗

firstT (x) = {t ∈ T | for some string β, x ⇒tβ} first() = {}  ∗ firstT (x) ∪ first(γ ), if x ⇒, first(xγ ) = otherwise. firstT (x), We say that a grammar has left (or right) recursion if there ∗ + is a non-terminal A and a derivation A⇒αAβ where α ⇒ ∗ (or β ⇒). We say that this recursion is hidden if α  =  (or β = ). A grammar has self embedding if there is some ∗ non-terminal, A, and strings α, β  =  such that A⇒αAβ. An item is an element A::=α · β, where A::=αβ is a grammar rule. (We write X::=· rather than X::= ·  and X::=·, so there is only one item associated with an -rule.) A finite automaton (FA) is a set of states and a set of directed edges between these states. One of the states is singled out to be the start state, and one or more states are designated as accepting states. The edges are labeled with grammar symbols or with the empty string . For technical reasons, we shall want to label some of the edges with special versions of , corresponding to ‘performing a reduction by rule i’. We shall denote these as Ri and call them R-edges. An FA is a deterministic finite automaton (DFA) if there are no edges labeled  or R and if there are no two edges with the same source state and the same label. A path is a sequence θ1 . . . θk of edges in the FA such that the source state of θi+1 is the target state of θi , for 1 ≤ i ≤ k − 1. A path through the FA is a path θ1 . . . θk such that the source state of θ1 is the start state and the target state of θk is an accepting state. For a path θ, lbl(θ ) denotes the sequence of labels on the edges in θ, including  and the Ri symbols. Formally, if θ is the empty path lbl(θ ) =  and if θ consists of one edge labeled x then lbl(θ ) = (x). If θ has two or more edges and θ = φθ1 , where θ1 is an edge labeled x, then lbl(θ ) = (lbl(φ), x). We also have a notation for the result of removing the  and Ri symbols: we define  = , (Ri) = , if x  = Ri then (x) = x and for sequences α of grammar,  and Ri symbols  (α, x) =

α αx

We say that a string µ of grammar symbols is accepted by an FA, N , if there is a path θ through N such that θ = µ. In Section 4 we describe an FA, IRIA(), that accepts all the sentential forms generated by a given CFG . The general approach is as follows. A grammar  is first annotated so that non-terminal instances that specify embedded recursion are distinguished. These distinguished non-terminal instances are treated as terminals in some phases of the table construction process. We then construct a ‘multiplied out’ non-deterministic FA called the intermediate reduction incorporated automaton, IRIA(). Some non-determism is then removed by applying the subset construction, yielding the reduction incorporated automaton, RIA(). We then describe how to combine RIA’s for distinguished non-terminal instances into a recursion call automaton (RCA). The RIGLR parsing algorithm traverses the RCA, building a recursion call graph (RCG) which encodes the stack history of the reductions arising from embedded recursion (which could not be directly incorporated into the RIA). In the next section we informally describe IRIA() and RIA() and their relationship to the more conventional handle finding automata used in standard LR parsing. 3.

THE GENERAL PRINCIPLE

Since parsing involves comparing a sentential form with the rules of a grammar so as to detect derivation steps, it is natural to render a grammar as an FA in which the states correspond to positions between grammar symbols and the edges correspond to the matching of grammar symbols. The standard LR(0) automaton, see for example [24], can be thought of as being constructed from this FA. For each state labeled X::=α · xβ there is an edge labeled x to a state labeled X::=αx · β and, if x is a non-terminal, edges labeled  to states labeled x::= · γ . The FA constructed in this way for 1 , S ::= aAbb | cAbd

A ::= d

is given below, with the corresponding LR(0) DFA obtained by using the subset construction on the FA.

if x = Ri or x =  otherwise.

Where no confusion will result, we shall write θ as shorthand for lbl(θ ). The Computer Journal, 2005

4 of 23

EA Scott and AIC Johnstone label of the form Y ::=δX · σ . This results in the following FA, IRIA(1 ), for 1 .

Standard LR DFAs act as ‘handle’ recognizers. A string of terminals and non-terminals is input and the DFA terminates in success when a handle α, which will be the right hand side of a production rule for a non-terminal X say, is read from the input stream. The handle is then replaced by X and the new string is fed into the DFA. Eventually, repeating this process, either the input string contains just the start symbol and the parse is complete, or the DFA will fail to terminate in an accept state and the parse fails. To avoid re-reading the input string from the start each time a handle is replaced, a stack is used to record the path taken through the DFA and, when α is recognized, the stack is unwound to remove the symbols in α and the traversal continues assuming that the next input symbol is X (this is referred to as performing a reduction). If we could simply retrace back along the path labeled with α then we would not need to use the stack since we could add an -arrow (a reduction arrow) from the end of this path to the state whose target is X. Of course, this would result in an FA which recognizes sentential forms and such an FA cannot always exist because some CFGs generate non-regular languages. The issue is that there can exist more than one path labeled with α which ends at a given state. This is certainly the case when the grammar contains self embedding, but, perhaps less obviously, this can also happen in grammars which do not contain self embedding. For example, in the grammar 1 above, the DFA state moved to after an application of the reduction A::=d depends on the path taken to reach the state L. Adding -arrows from L would result in an NFA that accepts cdbb and adbd, which are not in L(1 ). The central observation is that a stack in an LR parser performs two rôles: it ensures that in an instance of self ∗ embedding, X ⇒αXβ say, the number of αs matched is the same as the number of βs matched, and it ensures that when there are multiple instances of a non-terminal X in the set of grammar rules, the parser returns to the correct state in the FA after a reduction to X has been performed. From our point of view, the difference between these two rôles is that we can deal with multiple instances by replicating the states in the FA, but self embedding would require infinitely many replications. Thus we extend the FA by ‘multiplying out’, so that each occurrence of a non-terminal on the right hand side (RHS) of a production causes the entire set of items for that non-terminal to be added afresh. Then, for states labeled X::=α·, where X::=α is rule i, we add an edge labeled Ri to a state with a

Traversing this FA, treating the R-edges as -edges, allows us to recognize L(1 ). For recursive instances of non-terminals we add an -edge back to the most recent instance of the target item on a path from the start state to the current state. For example, from the recursive grammar 2 given by S ::= Sa | A

A ::= bA | 

we generate the IRIA, IRIA(2 )

The -edge from L to M and the corresponding R3-edge from N to itself arise from the recursive occurrence of A in the rule A::=bA. Edges which do not arise from recursion will be referred to as primary edges.

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity 4.

GENERATING IRIA() AND RIA()

We now give a formal algorithm for constructing the FA IRIA() which accepts all the sentential forms generated by . If  does not contain any self embedding then IRIA() accepts a string of terminals u if and only if u ∈ L(). 4.1.

The IRIA() Construction Algorithm

Given an augmented grammar  construct IRIA() as follows: Step 1: Create a node labeled S  ::= · S. Step 2: While there are nodes in the FA which are not marked as dealt with, carry out the following: (i) Pick a node K, labeled A::=µ · γ say, which is not marked as dealt with. (ii) If γ  =  then let γ = xγ  , where x ∈ N ∪ T, create a new node, M, labeled A::=µx · γ  and add an edge labeled x from K to M. This edge is defined to be a primary edge and K is the primary parent of M. (iii) If x = Y , where Y is a non-terminal, and if either x  = A or µ  =  for each rule Y ::=δ (a) if there is a node L, labeled Y ::= · δ, and a path θ from L to K which consists of only primary edges and primary -edges (θ may be empty), add an edge labeled  from K to L. (This new edge is not a primary -edge.) (b) if (a) does not hold, create a new node with label Y ::= · δ and add an edge labeled  from K to this new node. This edge is defined to be a primary -edge.

5 of 23

Theorem 2. Let  be a CFG that does not contain any ∗ self embedding. If θ is accepted by IRIA() then S  ⇒θ. 4.2.

RIA()

IRIA() can be highly non-deterministic. We now define a less non-deterministic automaton RIA() that is equivalent to IRIA(). First we note that IRIA() only ever has input that is a string of terminals, thus the edges labeled with non-terminals can be removed once the R-edges have been constructed. A non-deterministic automaton can always be transformed into a deterministic one by applying the standard subset construction. In IRIA() the R-edges consume no input and so could be treated as -edges. However, ultimately we want to produce derivations and thus we do not want to lose the information on which reductions were used. So we do not remove the R-edges. We could reduce the non-determinism further by assigning ‘lookahead’ sets to the R-edges. However, this would still not always resolve all non-determinism. So we leave the addition of lookahead symbols to the final PDA in Section 6 and define the reduction incorporated automaton for , RIA(), to be the FA obtained from IRIA() by removing the non-terminal edges and performing the subset construction with R-edges being treated as non-empty edges. The following is the result of applying this process to IRIA(2 ) above.

(iv) If the label of K is of the form A::= · Aδ, let M be the primary parent of K. For each node H such that there is an -edge from M to H , add an -edge from K to H . (These are not primary edges.) (v) Mark K as dealt with. Step 3: Remove all the ‘dealt with’ marks from all nodes. Step 4: While there are nodes labeled Y ::=γ · that are not dealt with, pick a node K labeled X::=x1 . . . xn · that is not marked as dealt with. Let X::=x1 . . . xn be rule i. If X  = S  then find each node L labeled Z::=δ · Xρ such that there is a path labeled (, x1 , . . . , xn ) from L to K, then add an edge labeled Ri from K to the child of L labeled Z::=δX · ρ. Mark K as dealt with. The new edge is called a reduction edge or an R-edge, and if the first ( labeled) edge of the corresponding path is a primary edge then this new edge is defined to be a primary R-edge. Step 5: Mark the node labeled S  ::= · S as the start node and mark the node labeled S  ::=S· as the accepting node. The following theorems will be proved in Section 9. Theorem 1. If α is a non-trivial sentential form of a + CFG , i.e. if S  ⇒α, then α is accepted by IRIA(), (i.e. there is a path θ through IRIA() such that θ = α).

As there are no edges to the start state of IRIA(), the start state of RIA() is the unique state whose label includes the item S  ::= · S. However, RIA() can have more than one accepting state. 5.

GENERALISING REGULAR LANGUAGE RECOGNITION

In this section we describe how to build a PDA that can be used to recognize sentences in a language generated by a given CFG. The method is an extension of the construction given by Aycock and Horspool [21]. 5.1.

Recursion call automata

We begin with a grammar  that we modify as follows: if + there is a non-terminal A and a derivation A⇒αAβ, where α  =  and β  = , replace an instance of A on the RHS of a rule with a special terminal of the form A⊥ so that this derivation is no longer possible. (Of course the language generated by the grammar will be changed by this.) Repeat

The Computer Journal, 2005

6 of 23

EA Scott and AIC Johnstone

this process until the grammar has no self embedding. We call such a grammar a derived grammar of  and denote it S . In order to be able to correctly construct all the derivations of a sentence, see Section 7, we shall also require that hidden left recursion has been removed from S . Thus if S has been constructed so that in addition it has no hidden left recursion, then we call it a derived parser grammar of . For each non-terminal A (except S and S  ) add a new rule SA ::=A and consider the grammar, A , which has the same rules as S but with start rule SA ::=A. (We take SS to be the augmented start symbol S  .) Then construct RIA(A ), in which the set of state labels is disjoint from the sets of state labels of all other automata constructed during this process. Link these automata together as follows: for each edge labeled A⊥ anywhere in any of the automata, suppose that the source node is labeled h and that the target node is labeled k, remove the edge from the automaton and add a new edge, labeled p(k), from h to the start node of RIA(A ). Label the accepting node of RIA(A ) with pop. The start state and accepting states of the new automaton are the start state and accepting states, respectively, of RIA(S ). We shall refer to this new automaton as a recursion call automaton associated with , RCA().1 For example, consider the grammar 3 given by S ::= BSb | b

B ::= b | 

We remove the self embedding by replacing the second instance of S with S ⊥ , resulting in the grammar S 0. S  ::= S 1. S ::= BS ⊥ b

2. S ::= b 3. B ::= b

4. B ::= 

and RIA((3 )S ) and RCA(3 ) are

If the action pop is associated with the state h then, for all x, T ()(h, x) contains the action pop. We shall use the table and graphical forms of RCA() interchangeably. A configuration of RCA() is a pair (k, S), where k is a state and S is a stack. The initial configuration is (0, ) and an accepting configuration is a configuration of the form (k, ), where k is an accepting state of RCA(). An execution step, θ, of RCA() has an input configuration (k, (l1 , . . . , lm )), an action from row k of T () and a resulting configuration, defined as follows. Action sh R(i, h) p(l, h) pop

Output configuration (h, (l1 , . . . , lm )) (h, (l1 , . . . , lm )) (h, (l1 , . . . , lm , l)) (lm , (l1 , . . . , lm−1 ))

We shall often refer to an execution step as a shift or shift on a, a reduction, a push or a pop, where that is the action associated with the step. Note, for a given state h all the actions sh lie in the same column of T (), labeled with an input symbol, a say. So there is a unique input symbol a associated with the action sh and we may refer to this action as a shift on a. An execution path is a sequence of execution steps, θ1 , . . . , θq , such that the input configuration of θi+1 is the resulting configuration of θi , 1 ≤ i ≤ q − 1. We define the resulting configuration of to be the resulting configuration of θq . The string consumed by is the sequence of input symbols associated with the steps θj whose actions are shifts. An execution path through RCA() on input u is an execution path, θ1 . . . θq , such that the input configuration of θ1 is the initial configuration for u and such that u is consumed by θ1 . . . θq . A string u is accepted by RCA() if there is an execution path , through RCA(), whose resulting configuration is an accepting configuration, and which consumes u. We then have the following result. Theorem 3. For any RCA(), a string, u, of terminals is in L() if and only if u is accepted by RCA(). 5.2.

(Note: state 4 is reachable from state 3 via a pop action.) We represent RCA() as a table T () whose rows are indexed by the state numbers of RCA() with the start state by convention being numbered 0, and whose columns are indexed by the terminal symbols of  and $. The entries in the table are sets of actions. If there is an edge in RCA() from state h to state k labeled with · the terminal a, then T ()(h, a) contains sk, · Ri, then, for all x, T ()(h, x) contains R(i, k), · p(l), then, for all x, T ()(h, x) contains p(l, k). 1 Push down automata thought of in this way are sometimes referred to as recursive transition networks.

Example

We now consider how to determine whether or not RCA() accepts a given string u = a1 . . . an . The basic idea is to start in the start state and construct a set U of all states which can be reached without consuming any input. We then start a new set which contains all states which can be reached from a state in the first set along an edge labeled a1 , and add all the states which can be reached without consuming any further input. If at some point we encounter an edge labeled with a push action, then we need to move to the state that is the target of this edge and we need to record the state we need to move back to when we arrive at the corresponding pop action. Of course we could just record this state with the state number we move to, but the possibility of nested calls and multiple execution step choices means that we need an efficient method of recording the return states. Following Aycock and Horspool, we create a recursion call

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity graph (RCG) (that is structured in a similar way to Tomita’s graph structured stack) and associate each state we reach in a traversal with a node in the RCG. Thus at each step in the algorithm we record the RCA states we move to and the associated RCG nodes as pairs (k, q) in a set U . Right recursion and hidden left recursion in S cause loops of reductions in RCA(). (Non-hidden left recursion does not generate a recursive call because it is absorbed into the appropriate state when the automaton is constructed. This is exactly analogous to LR(1) DFAs that, for the same reason, admit direct left recursion but not hidden left recursion.) We ensure that the traversal construction process terminates in such cases by only adding each pair (k, q) to U once at any given step in the process. It is also possible to have loops that ∗ consume no input if, in the grammar S , we have A⇒αA⊥ β + where α ⇒. In this case the loop involves a push to the start state of A and is the source of the problem with the algorithm given in [21]. We deal with this using an idea similar to that used by Farshi [17] in his modification of Tomita’s algorithm: we introduce loops in the RCG. Before formally describing the process in the next section, we consider the grammar 3 above. We need to record the set of RCG nodes constructed at each step in order to check whether a node with a particular label has already been constructed, because in this case the node is reused. For this we use a set P , which is set to ∅ each time an input symbol is read. We traverse RCA(3 ) with input bb as follows. We begin all traversals by creating a base node in the RCG, q0 , labeled −1 (which is not the label of any RCA state) and set U = {(0, q0 )}, where 0 is the start state of RCA(). From state 0 we can reach state 2 along an R-edge, so we add (2, q0 ) to U . From state 2 we can reach state 0 along an edge labeled p(4), so we create a new node in the RCG, q1 , labeled 4,

and add (0, q1 ) to U . Because (0, q1 ) ∈ U we again traverse the reduction from state 0 to state 2 and add (2, q1 ) to U . Then, because (2, q1 ) ∈ U , we traverse the push edge from state 2 to state 0. However, this time we find that we already have a node in the RCG labeled 4 which was constructed at this step so we just add an edge from this node to itself.

As the element (0, q1 ) is already in U , no new elements are added to U and this step is complete, with U = {(0, q0 ), (2, q0 ), (0, q1 ), (2, q1 )}. Next we read the first input symbol, b, and construct U = {(1, q0 ), (1, q1 )} from the edge labeled b from

7 of 23

state 0. We then traverse the reductions from state 1 adding the corresponding elements to U , so that U = {(1, q0 ), (1, q1 ), (3, q0 ), (3, q1 ), (2, q0 ), (2, q1 )}. The node q0 has no children so the pop action associated with state 3 creates nothing, but from the element (3, q1 ) we can perform a pop in which 4, the label of q1 , is popped off the stack and the elements (4, q0 ) and (4, q1 ) are added to U (because q0 and q1 are the children of q1 ). When we traverse the push edge associated with the element (2, q0 ) we need to create an RCG node, q2 , labeled 4 as a parent of q0 . So we have

U = {(1, q0 ), (1, q1 ), (3, q0 ), (3, q1 ), (2, q0 ), (2, q1 ), (4, q0 ), (4, q1 ), (0, q2 )}, P = {q2 }. We traverse the push edge associated with (2, q1 ) and find that there is already an RCG node, q2 , labeled 4 constructed at this step, so we just add an edge from this node to q1 . Since (0, q2 ) ∈ U nothing is added to U . We then traverse the reduction from (0, q2 ), adding (2, q2 ) to U , and then traverse the push action. There is already an RCG node labeled 4 so we just add an edge from this node to q2 to complete this step.

U = {(1, q0 ), (1, q1 ), (3, q0 ), (3, q1 ), (2, q0 ), (2, q1 ), (4, q0 ), (4, q1 ), (0, q2 ), (2, q2 )}. We now read the final input symbol, b, and construct U = {(5, q0 ), (5, q1 ), (1, q2 )}. We traverse the reductions from these elements so that U = {(5, q0 ), (5, q1 ), (1, q2 ), (3, q0 ), (3, q1 ), (3, q2 ), (2, q2 )}. Processing the elements of U as we did in the previous step, we get U = {(5, q0 ), (5, q1 ), (1, q2 ), (3, q0 ), (3, q1 ), (3, q2 ), (2, q2 ), (4, q0 ), (4, q1 ), (4, q2 ), (0, q3 ), (2, q3 )}, and RCG

Since the next input symbol is $ and U contains (3, q0 ), which has an accepting state of RCA() and the base node of the call stack, bb is accepted. There is one further issue that we need to address. As described so far the algorithm does not deal with the possibility that a new RCG edge is added to a node from which a pop action has already been applied. In such a case the pop must be applied again.

The Computer Journal, 2005

8 of 23

EA Scott and AIC Johnstone

FIGURE 1. RI recognizer algorithm

The effect of a pop on a node q with label h and child p is to add the element (h, p) to U , and is independent of the element that caused the pop. Thus if a new edge is later added from q to t say, all we have to do is add (h, t) to U . So we simply record whether a pop action has been applied to a given RCG node. Edges are only added to nodes constructed at the current step, i.e. nodes in P , so we actually store in P pairs (q, F ), where F is a flag which records whether a pop action has been applied to q. 5.3.

The RIGLR recognizer algorithm

In Figure 1 we give the algorithm that computes the results of all traversals of an RCA for a given input. We shall assume that the RCA is given as a table T . For input a1 . . . an , we begin with an empty set U0 and an RCG that contains a single node labeled −1 which is not the label of any state in the RCA. At the end of each step in the

process we have a set Ui of RCA states that can be reached using the portion of input consumed so far, together with the nodes which are the tops of the associated call stacks. At the beginning of each step we have a set Ui−1 ai of elements which can be reached from the previous set via a shift action on the input symbol, ai , which has just been read. The non-base nodes of the RCG are all labeled with state numbers from the RCA. For every node in the RCG there will be a path from this node to the base node. In practice the labels on the RCG nodes will all be states in the RCA which appear as parameters to p() edges. Sets Pi are used to record the RCG nodes constructed at step i of the algorithm. (Both Ui and Pi can be discarded at the end of step i.) Theorem 4. Given an RCA for a CFG  and an input string a1 . . . an , the RIGLR recognizer terminates and reports success if a1 . . . an is in L() and terminates and reports failure if a1 . . . an is not in L().

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity 6.

REDUCING THE ACTIONS OF RCA()

We can reduce the non-determinism in RCA() by adding lookahead sets to the reduction, push and pop actions. The basic idea is that, for each edge θ, we find all the symbol labeled edges that can be reached from θ along edges which do not consume any input, and use these to form the lookahead set for θ. We then only traverse θ if the next input symbol belongs to this lookahead set. For a pop action the lookahead set is calculated from the states that the pop action can return to, these are the targets of transitions labelled with special terminals A⊥ in the original RIAs. The following is RCA(3 ), above, with lookahead sets added.

9 of 23

We propose that limited composition is carried out in a way that is guaranteed not to increase the size of the underlying RCA. First we re-label all the RCA edges with triples x, R, P . This triple is of the form a, ,  if the edge was labeled a, of the form , Ri,  if the edge was labeled Ri and of the form , , p(k) if the edge was labeled p(k). If there is an RCA state v where either the in-degree or the out-degree of v is one and all the edges to v have labels of the form x, R,  where x ∈ T ∪ {}, and all the edges from v have labels of the form , R  , P where P is  or of the form p(k), then for each path

we add an edge

A further modification that can reduce the size of RCA() and the number of elements constructed by the traversal algorithm is to ‘pre-compile’ sequences of consecutive reductions. This is the same idea as is used to improve the efficiency of standard LR parsers by reducing the stack activity associated with the left hand sides of reduction rules, and it is the basis of the improved algorithm given in [22]. We shall not use such a modification, but we give a brief outline of the approach and then discuss some of its drawbacks. The idea is to compose sequences of R-edges and possibly a preceding shift edge and/or a following push edge. The new RCA has edges labeled with triples a, R, P where a is either a terminal or , R is a sequence of reductions, and P is a sequence of push actions. For example, we could rewrite RCA(3 ) as

When an edge is traversed all the actions that label it are carried out. There are two drawbacks that need to be considered. First, if we wish to use lookahead then we have essentially constructed a two symbol lookahead parser. For example, the table version of the above RCA would contain the three edges b, R2, ; b, R3, ; and b, , p(4) in position (0, b, b), whereas the position (0, b, $) would contain only the edge b, R2, . This increases the size of the table and hence the size of a corresponding parser. More seriously, if all possible sequences of reductions are composed then the number of edges in the new RCA can be O(2k+1 ) when the number of edges in the original RCA was O(k). This can be illustrated using the following grammar family (see [25] for details). S ::= Ak

A ::= B | b | 

B ::= 

and then remove the state v and all its associated edges. We call the resulting RCA a composed RCA. We shall not use the composed version of the RCA in the discussions in the rest of this paper, but we would expect at least the conservative composition described above to be included in any implementation of this approach. (If only one symbol of lookahead is to be supported then replace x by  in the above composition construction.) 7.

CONSTRUCTING DERIVATIONS

If a parsing algorithm is to be used as part of a translation process then it is likely that the grammar has been chosen to reflect the intended semantics of the language and hence that the derivation of a sentence will be helpful in determining the semantics. Tomita’s algorithm outputs a shared packed parse forest (SPPF) representation of the set of derivation trees for a given sentence, and we shall discuss a modification of our algorithm to do the same thing. The problem is that this significantly affects the efficiency of the algorithm because for any integer k ≥ 1 there is a grammar and a sentence of length n such that the associated SPPF has size O(nk ), whereas it is straightforward to check that our recognizer is worst case cubic and requires at most quadratic space. In the long term we plan to use a different approach in which our algorithm generates a grammar which in turn generates the derivation sequences of a sentence (the sequences of rules which have to be applied in right-most derivations). However, this is still work-in-progress. 7.1.

Derivation trees and SPPFs

When we traverse RCA() with an input string a1 . . . an , we can construct a derivation tree as follows. At each stage in the traversal we have a (finite) sequence, uk , . . . , u1 say, of tree nodes which are the root nodes of the subtrees constructed so far. If we traverse an edge labeled a then we create a new node u labeled a and add it to the end of the sequence,

The Computer Journal, 2005

10 of 23

EA Scott and AIC Johnstone

uk , . . . , u1 , u. If we traverse an edge labeled Ri, where rule i is A::=xl . . . x1 and ui is labeled xi , 1 ≤ i ≤ l, then we create a new tree node, v labeled A, with children ul , . . . , u1 , remove these nodes from the sequence and add the node v to the sequence, uk , . . . , ul+1 , v. When we traverse a push edge or perform a pop operation no additions are made to the tree. For example, if we traverse RCA(3 ), given in Section 6, in this way with input bbb, we construct the following derivation tree.

However, if RCA() is non-deterministic then we need to represent multiple subtrees (even if  is not ambiguous). We can reduce the amount of space required to represent multiple trees by merging them, sharing nodes that have the same tree below them and packing nodes that correspond to different derivations of the same substring.

correspond to the roots of the subtrees constructed so far for the derivation associated with (k, q). To allow us to merge two nodes in the SPPF if they correspond to the same nonterminal and derive the same portion of the input string, SPPF nodes are labeled with a pair (x, j ) where aj +1 . . . ai is the yield of the subgraph whose root is (x, j ), and we keep a list of the SPPF nodes constructed at the current step. Two nodes are merged only if they have the same label and are constructed at the same step. If we simply append the list of current SPPF roots to the corresponding element (k, q), we may significantly increase the number of elements constructed at each step (an example is discussed in [25]). To represent the lists of current root nodes more efficiently we could use a graph structured stack, creating a special SPPF node graph, and replace the SPPF root node sequences in the process elements with the appropriate SPPF node graph node. However, in either case the SPPF constructed may not be correct as it can contain spurious derivation trees as well as the correct ones. The problem is that when two elements with the same state are constructed their call stacks are merged to prevent the size of the RCG from exploding. However, if two different reduction sequences are associated with these different elements then the information about which sequence belongs to which element is lost, leading ultimately to the combination of the first half of one sequence with the second half of another. An example illustrating this problem is given in [25], and the same problem occurs with the reduction graph approach to parsing proposed in [22], as can be demonstrated using the string baaaa with the grammar S ::= BT | bT B

We refer to a directed graph obtained by merging and packing the nodes as a shared packed parse forest (SPPF). Grammars which contain cycles generate sentences which can have infinitely many derivation trees, resulting in a finite SPPF with cycles. For example, for the grammar S::= S S | a | , the string a has SPPF

7.2.

T ::= aT a | 

B ::= b | 

The solution we shall use is: when a push transition is applied, label the constructed RCG edge with the corresponding sequence of SPPF nodes. We illustrate the approach using 3 and the input string as in Section 5.2. We follow the same element construction and add the SPPF construction. We begin by creating a base node in the RCG, q0 , labeled −1 and the element (0, q0 , ). From state 0 we can reach state 2 along an R-edge corresponding to the rule B::=. We create an SPPF node labeled (B, ∞) with a child labeled , and a pointer wB to this node. Then we attach this pointer to the element we create, (2, q0 , wB ). From state 2 we can reach state 0 along an edge labeled p(4), so we create a new node in the RCG, q1 , labeled 4, attach the corresponding SPPF pointer to the edge (q1 , q0 ) and create a new element (0, q1 , ). We again traverse the reduction from state 0 to state 2 and, since there is already an SPPF node labeled (B, ∞), we just create the element (2, q1 , wB ). From the p(4) action in state 2 we create an edge (q1 , q1 ) labeled wB , and this step is complete.

SPPF construction

In this section we discuss informally the extension of the RIGLR recognizer to include SPPF construction. We build the SPPF for the input string as the algorithm executes, recording with each pair (k, q) the nodes of the SPPF that The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity Next we read the first input symbol, b, and construct an SPPF node labeled (b, 0) and a pointer w1 to this node. Then we create the elements (1, q0 , w1 ) and (1, q1 , w1 ). From state 1 there are two reductions, B::=b and S::=b, thus we create two SPPF nodes, labeled (B, 0) and (S, 0), with child (b, 0), and pointers w2 , w3 to these nodes. We then create the elements (2, q0 , w2 ), (3, q0 , w3 ), (2, q1 , w2 ), and (3, q1 , w3 ). Applying the pop action to (3, q1 , w3 ) we traverse the edges from q1 , collecting the label wB , and create a pointer, w4 , to the nodes pointed to by w3 and wB . Then we create the elements (4, q0 , w4 ) and (4, q1 , w4 ). (The pointer w4 records the fact that the SPPF nodes labeled (S, 0) and (B, ∞) are the roots notes of the parse forest associated with these elements.)

Applying the push action from state 2 and the reduction action from state 0 as in the previous step, we complete this step, and then read the final input symbol. This creates an SPPF node labeled (b, 1) and a pointer w5 , and the elements (4, q0 , w4 ) and (4, q1 , w4 ) require the creatio of a pointer, w6 , to the nodes pointed to by w5 and w4 . So at this point we have the following SPPF and RCG

with U = {(5, q0 , w6 ), (5, q1 , w6 ), (1, q2 , w5 )}. Traversing the reduction S::=BSb from state 5 we take the three rightmost children of w6 (in this case all the children) and create an SPPF node labeled (S, 0) with these three nodes as children. The remaining actions add to the RCG as described in Section 5.2 and create additional SPPF nodes labeled (B, 1) and (S, 1). The final SPPF is the subgraph containing all the nodes and edges reachable from the root node (the node labeled (S, 0) constructed at the final step of the algorithm). In this case the final SPPF is

7.3.

The RIGLR parsing algorithm

In Figures 2 and 3, we give formally the RIGLR parser which constructs the SPPF as it traverses the RCA. We assume that

11 of 23

the RCA has been constructed from a derived parser grammar for  and that it is given in the form of a table, T . We use sets N and W, respectively, to hold the SPPF nodes and pointer nodes which have been constructed at the current step of the algorithm. (The use of N reduces the searching required to find SPPF nodes that can be packed and means that we only have to label SPPF nodes with the index of the start position of the substring yielded.) We use a set W  to record the pointer nodes which label edges in the RCG as these pointers are maintained throughout the algorithm. In addition to their rôles in the recognizer algorithm, the shift and reduce actions simply add appropriate nodes to the SPPF and create corresponding pointer nodes. The push and pop actions do not modify the SPPF in the sense that they do not create (or delete) any nodes or edges. However, as described in the above example, a push action creates a pointer to a sequence of SPPF nodes and labels the edge it creates in the RCG with this pointer. When a pop action is carried out the sequence of nodes which is pointed to by the label on the corresponding RCG edge, that is the nodes which were ‘cached’ by the earlier push action, are recovered and added to the nodes associated with element being processed to create a new element in Ui . Thus pop actions can create new nodes that point to SPPF nodes. Furthermore, when a push action adds an edge to an exisiting node t in the RCG, any pop associated with this node must to be applied down the new edge and new pointer nodes need to be created from the label of the edge from each child of t. Thus the elements of Pi are of the form (q, F ) where, instead of simply a flag, F is the set of the labels on edges from q down which pop actions have so far been applied. Note that, in order to ensure that the parser terminates in the case where the grammar contains a hidden cycle of the form + ∗ A⇒αA⇒A (which generates infinitely many derivations), we need to use the loops in the RCG. Thus we require that RCA() be constructed from a derived parser grammar S , which has had the hidden left recursion removed (see Section 5.1).

8.

EXPERIMENTAL RESULTS

In this paper we expend considerable energy on demonstrating the generality and correctness of our algorithm because the GLR field has historically been rather weak in this area. (Incomplete and even erroneous algorithms have been published and formal proofs of correctness are rare.) However, performance evaluation is, of course, critical to the acceptance of a new algorithm in this area. Asymptotic results for space and time performance are of little help because we know that the asymptotically best general parser has such large constants of proportionality that it is impractical for strings of normal length. Asymptotic and typical actual performance of the RIGLR algorithm will be demonstrated by comparing it with a Tomita-style algorithm, and with the traditional LR algorithm using longest match to resolve nondeterminisms. We shall use a highly artificial grammar to trigger worst case asymptotic behaviour, and publically

The Computer Journal, 2005

12 of 23

EA Scott and AIC Johnstone

FIGURE 2. RI parsing algorithm (part 1)

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity

13 of 23

FIGURE 3. RI parsing algorithm (part 2)

available grammars for three widely used programming languages to look at ‘real world’ applications. Our parsers are characterised by: (i) the size of the parser, or rather its associated table; (ii) the size of the parse time structures, corresponding to maximum stack depth in an LR parser, GSS size in a GLR parser and RCG size in the RI parser; (iii) and for the general parsers, the cost of constructing the parse time structures, expressed in terms of edge visits and the total cardinalities of certain sets. We also give some run times for our table generators, that is we give the time taken to make the parser. 8.1.

The need for lookahead

LR-family parsers (including Tomita style GLR parsers) can employ a variety of parse tables based on a variety of automata such as LR(0), SLR(1), LALR(1) and LR(1) tables. The choice of table affects the performance of GLR algorithms in a, perhaps, non-obvious way. For example, for an LR(1) table the reduction in the number of active paths in the constructed graph structured stack (GSS) must be traded with the potentially increased number of states created in the graph. As formally written above, the RI algorithm uses automata with no lookahead, corresponding to the use of LR(0) automata in standard LR and GLR parsing. There are a variety of opportunities for incorporating lookahead into the RI algorithm and/or its tables, some of which have been discussed in Section 6. To avoid further expanding the size of

the parser, we choose to retain the table construction outlined above and modify the runtime behaviour of the algorithm by directly testing against grammar-derived lookahead sets, resulting in an algorithm that corresponds most closely to an SLR(1) GLR parser. Both the RCA and the LR DFA can be seen as automata that can be ‘traversed’ with a given input string. For an SLR(1) DFA a traversal is carried out by reading the current input symbol, a say, and either moving to the next state down an arrow labeled a and reading the next input symbol (a shift action), or finding a reduction, A::=x1 . . . xm say, such that a ∈ follow(A), tracing back through the DFA along a path labeled xm , . . . , x1 and then moving to the next state down the arrow labeled A (a reduce action). For the RCA the shift actions correspond directly to the LR shift actions and a push action can be seen as a move to an intermediate state that, in the LR DFA, would have been incorporated with the current state by the subset construction. A move along an RCA reduction arrow corresponds to an LR DFA reduce action without the need for tracing back. A pop action also corresponds to an LR DFA reduce action but the next state(s) must be found from the associated RCG. For the experimental results reported here, we modified the RI algorithm so that reduction and pop transitions are only carried out if the next input symbol is in the follow set (see [24]) of the corresponding non-terminal and a push transition is only carried out if the next input symbol is in the first set of the corresponding non-terminal, or if this first set contains  and the current symbol is in the follow set of the non-terminal.

The Computer Journal, 2005

14 of 23 8.2.

EA Scott and AIC Johnstone TABLE 1. The size of parse tables

Experimental grammars and test strings

To exercise the asymptotic behaviour of the algorithms, we use the grammar, 5 , whose rules are S ::= SSS | SS | a. This a grammar on which Tomita’s algorithm is known to have poor (quartic) performance. The language of 5 is {a n } with n > 0: a ‘1-dimensional’ language allowing simple graphical display of the asymptotic trend. To demonstrate the performance of the RI algorithm on non-pathological cases, we use grammars for ANSI-C, ISO Pascal and IBM VS-COBOL. The ANSI-C grammar is essentially that given in [26] with the optional rules expanded out. The grammar for Pascal is the grammar from the ISO-7185 standard converted from EBNF to BNF using largely right recursive expansion of closure operators. Some relatively minor grammar transformations result in a grammar which is LR(1) apart from the shift/reduce conflict arising from the if-then-else ambiguity. The IBM VS-COBOL grammar was provided by Steven Klusener and Ralf Lämmel, who describe its development in [27]. An online version of the grammar is available at www.cs.vu.nl/grammarware/browsable/vs-cobol-ii/. The version used here was derived from an ASF+SDF source file from which we extracted the context-free rules. The instances of self embedding in a grammar can be automatically detected by our gtb tool. The tool constructs a grammar dependency graph (GDG) whose nodes are nonterminals. There is an edge from A to B in the GDG if there is a grammar rule A::=αBβ. The edges are labeled to record whether there is a non-trivial left context and/or a non-trivial right context for B in a rule for A, and also whether there are contexts which derive . From the cycles in this graph we can determine the instances of left and right recursion and proper self embedding in the grammar. gtb performs principal component analysis on the GDG, identifying the maximal cycles. We used this to determine an instance of a non-terminal A which could be terminalized (replaced with the special terminal A⊥ ) to remove a cycle, repeating the process until no self embedding remained. We did not attempt to minimize the number of terminalizations required, which is in general an NPcomplete problem. In fact, in order to limit the size of the RCAs, when the grammars were reduced to produce the RIAs both self embedding and right recursion were removed. In addition, for ANSI-C one further instance of a non-terminal in the expression part of the grammar was terminalized to reduce the size of its RIA from approximately 6 × 106 to 1.5 × 106 states. Thus the resulting RCAs contain more recursive calls than are strictly necessary for the correctness of the algorithm. Part of the attractiveness of RI algorithm is that it can be ‘tuned’, trading execution time for parser size by terminialising more non-terminals in the reduced grammar. For Pascal, 10 non-terminals were terminalized in a total of 21 instances. For C, 14 non-terminals were terminalized

Grammar

RN SLR(1) table

RCA table

ANSI-C ISO Pascal Cobol

1,797×158 434×286 2692×1028

1,517,390×86 288,117×78 5,142,874×356

in a total of 46 instances. For Cobol, 19 non-terminals were terminalized in a total of 28 instances. The input string for ANSI-C is a 4291 token program that performs Quine McCluskey minimization. The input string for Pascal is a 4425 token program that allows tree data structures to be constructed and viewed. The input string for Cobol is a 2197 token program. In each case the input is existing code which has been written for other purposes. All strings were pre-tokenised to remove the overhead of lexical analysis and represented internally as arrays of 16-bit integers, one per token. We compare the RI algorithm with our GLR algorithm, the RNGLR algorithm [18, 19], running on right nulled SLR(1) tables. This algorithm constructs a GSS which represents the multiple stacks generated by non-determinism in the grammars, and maintains a queue R of (reduction,node) pairs. The reduction in a pair is applied by tracing back along the paths in the GSS from the corresponding node. We measure the efficiency of the RNGLR algorithm in terms of the number of edges visited in the course of carrying out the reductions, and the total number of (reduction,edge) pairs processed (added to R). We measure the space required by the RNGLR algorithm in terms of the numbers of nodes and edges in the GSS. For the RI algorithm we measure the efficiency of the algorithm in terms of number of RCG edges visited in the execution of pop actions and the total number of elements added to the Ui . We measure the space required by the RI algorithm in terms of the number of nodes and edges in the RCG. Figures are derived using the recognizer versions of RIGLR and RNGLR parsing implemented in our grammar analysis and visualisation tools gtb and PAT [28]. 8.3.

The size of parsers

Table 1 gives the size of the underlying parse tables as the number of rows by the number of columns. The number of columns in an SLR(1) table is the size of the grammar’s alphabet: for the RCA table, the number of columns is just the number of terminals in the original grammar plus one (to allow actions on $, the end of string symbol). Parse table size is clearly the Achilles’ heel of reduction incorporated techniques; the necessary multiplying out in the IRIA construction can generate multiple subtrees corresponding to the same non-terminal as the grammar is ‘unrolled’. However, the rather intimidating number of states shown for all of the RCA tables should be viewed as an upper bound. Firstly, the isomorphic subtrees present significant opportunities for space saving, but more importantly for

The Computer Journal, 2005

15 of 23

Generalized Bottom Up Parsers With Reduced Stack Activity TABLE 2. The size of parse time structures ANSI-C

ISO Pascal

Cobol

GSS e × n 28,604 × 28,479 21,258 × 21,043 13,512 × 12,057 RCG e × n 3,083 × 2,615 944 × 943 1,415 × 592 TABLE 3. The execution cost of parse time structures ANSI-C ISO Pascal Cobol GSS edge visits RGC edge visits total actions added to R total processes added to the Ui

FIGURE 4. Edge visits for RIGLR and RNGLR algorithms using 5 with input strings of length 10–200 tokens

practical engineering of RI parsers, we can mark extra instances of non-terminals as distinguished. The effect is to trade speed in the parser for space in the table. Interestingly, if we ‘terminalize’ all but the topmost instance of each non-terminal, we get a parser whose stack activity mimics that of a recursive descent parser, except that left recursion is allowable. A theoretical discussion of the effect of introducing additional terminalizations to grammars with long chains in the GDG can be found in [29] and as we mentioned in section 8.2, adding one terminalization in the middle of a chain in our C grammar dramatically reduces the RIA size. Of course, in an age when even portable computers have hundreds of megabytes of main memory, a table containing 5 million states is not the bar to practicality that it might have been when LR parsing was first investigated. Our prototype implementation, when running under Windows-XP on a 1.6GHz Pentium-M processor with 384MB of memory constructed the table for ANSI-C in 105s, ISO-Pascal in 9.5s and VS-COBOL in 511s. This implementation is written in ANSI-C (not C++) and was built with the Borland C++ compiler version 5.1 using the command bcc32 -A -w. The -A flag disables Borland extensions and the -w flag switches on all warning messages. This is a conservative configuration. We might expect some improvement in these figures if we used the bcc32i compiler which attempts aggressive optimisations. 8.4.

Asymptotic behaviour

Figure 4 shows the behaviour of RIGLR and RNGLR algorithms on 5 , expressed in terms of the number of edge visits in the RCA and GSS respectively. RNGLR displays quartic time complexity and RIGLR displays cubic complexity on this highly ambiguous grammar. The actual runtimes for these experiments show that the memory requirements for the RNGLR algorithm’s GSS start to generate paging activity for long strings. As a result, runtimes for RNGLR are artificially extended. However,

4,052 3,336 24,261 44,772

5,665 1,178 16,833 23,407

3,581 1,742 10,814 28,802

even before the onset of swapping the real performance advantage of RIGLR is very significant: for strings of 80 tokens RNGLR requires some 73 CPU s whilst RIGLR requires only 0.08 s. Indeed RIGLR can process a string of 200 tokens in 1.01 CPU s: our RNGLR implementation exceeded available memory for such long strings but required 1,665 CPU s to parse a 100 token string. These timings all count user CPU seconds when running on the 1.6GHz Pentium-M system specified above. 8.5.

Average behaviour on real grammars

We now turn to experiments with realistic grammars. Table 2 shows order-of-magnitude reductions in space requirements for the RCG of the RIGLR algorithm versus the GSS of the RNGLR parser. Table 3 presents results analogous to those of Figure 4 for our programming language grammars, and shows that over these program-sized input strings the parse times (expressed as edge visits within the parse-time structures) are practical. We have also characterised the runtime in terms of the activity within the R and Ui sets of the RIGLR algorithm. The cardinality of these sets might be thought to impact space requirements given the large number of actions and processes added, but in fact the sets are cleared at each input token, and the maximum cardinalities are very small. 8.6.

Comparison with deterministic parsing

In the introduction we noted the attractiveness of GLR style algorithms to programming language designers because they display deterministic behaviour on LR grammars, but with the added advantages of generality, and ability to return all parses of an ambiguous grammar. To illustrate the comparability, we performed an LR parse using our Pascal grammar. This contains only the shift/reduce conflict arising from the if-then-else ambiguity, allowing us to use the standard longest match strategy, which results in a deterministic LR(1) parser that selects the shift rather than the reduce action at a point of conflict. Using this strategy the LR algorithm with the Pascal LR(1) grammar accepts the input string after performing 16,685 reductions, and with a maximum stack depth of 129. The equivalent GLR parse requires 16,833 reductions, showing that little extra activity

The Computer Journal, 2005

16 of 23

EA Scott and AIC Johnstone

is generated by the (very limited) non-determinism in the grammar. 9.

PROOFS

In this section we give the proofs of the results quoted in the preceding sections. We begin by making some elementary observations about IRIA(). An edge labeled  has target node with a label of the form Y ::= · γ and source node with a label of the form X::=α · Yβ. An edge labeled Ri has source node with a label of the form Y ::=γ ·, where Y ::=γ is rule i, and target node with a label of the form X::=αY · β. An edge labeled x ∈ N ∪ T has source node with a label of the form X::=α · xβ and target node with a label of the form X::=αx · β. All entry edges to a node labeled Y ::= · γ are labeled . All exit edges from a node labeled Y ::=γ · are R-edges. For any a ∈ T , a node labeled X::=α · aβ has exactly one exit edge, which is labeled a, and a node labeled X::=αa · β has exactly one entry edge, which is labeled a. For Y ∈ N, a node labeled X::=α · Yβ has one exit edge labeled Y . All its other exit edges are labeled  and have targets with labels of the form Y ::= · γ . A node labeled X::=αY · β has one entry edge labeled Y . All its other entry edges are R-edges and have source nodes with labels of the form Y ::=γ ·. All edges whose source node is the start node are primary edges or primary -edges, and no edge has the start node as its target. A path in IRIA() is primary if all its - and R-edges are primary edges. A path in IRIA() is non-reducing if none of its edges are R-edges. 9.1.

Proof of Theorem 1

In this section we shall show that if α is a non-trivial sentential + form in , i.e. if S  ⇒α, then IRIA() accepts α. Lemma 1. If M is labeled Y ::=x1 . . . xn ·β, where Y  = S  , then M has an ancestor K labeled X::=γ · Y δ, there is a path labeled (, x1 , . . . , xn ) from K to M, and there is an edge labeled Y from K to a node labeled X::=γ Y · δ. Conversely, if K is a node labeled X::=γ · Y δ and if Y ::=x1 . . . xn β is a production rule then there is an edge labeled Y from K to a node labeled X::=γ Y · δ, and there is a path labeled (, x1 , . . . , xn ) from K to a node labeled Y ::=x1 . . . xn · β. Proof. By induction on n. If n = 0 then the result follows directly from Step 2 of the construction algorithm. For n ≥ 1 let α  = (x1 , . . . , xn−1 ). By Step 2 M has an ancestor, L, labeled Y ::=α  · xn β. By induction, L has an ancestor K of the required form and there is a path labeled (, x1 , . . . , xn−1 ) from K to L.

Thus there is a path labeled (, x1 , . . . , xn ) from K to M. Conversely, if there exists a node K labeled X::=γ · Y δ then, by induction, there exists a path labeled (, x1 , . . . , xn−1 ) to a node L labeled Y ::=α  · xn β, and by Step 2 there exists an edge labeled xn from L to a node labeled Y ::=α  xn · β. Theorem 1 If α is a non-trivial sentential form of a CFG + , i.e. if S  ⇒α, then α is accepted by IRIA(). m

Proof. Suppose that S  ⇒α. We prove the result by induction on m. If m = 1 then α = S which is clearly accepted by IRIA(). So suppose that m ≥ 2, that S  ⇒ γ Xδ ⇒ γβδ m−1

and, by induction, γ Xδ is accepted by IRIA(). Thus there is a path θ through IRIA() such that θ = γ Xδ. Write θ = θ1 f1 θ2 where θ1 = γ , θ2 = δ and lbl(f1 ) = X. So f1 has source node, M say, labeled Y ::=τ · Xσ and corresponding target node, L say, labeled Y ::=τ X · σ . Since X is a non-terminal by Lemma 1 there is a path, φ say, labeled (, y1 , . . . , ym ) from M to a node, K, labeled X::=y1 . . . ym · and, by Step 5 of the IRIA construction procedure, there is an edge, f2 say, labeled Ri from K to L. Thus θ1 φf2 θ2 is a path through IRIA() such that θ1 φf2 θ2 = γβδ, as required. 9.2.

Proof of Theorem 2

Lemma 2. Let  be an augmented grammar and let IRIA() be the associated FA constructed as above. If K and L are nodes labeled X::=α1 ·α2 and Z::=β1 ·β2 respectively, and if there is a non-reducing path θ from K to L, then for ∗ some σ , X⇒α1 θ β2 σ . Proof. The proof is by induction on the number, n, of -edges in θ. Suppose that lbl(θ) = (x1 , . . . , xk ). If n = 0, then since θ is non-reducing, for all i, xi ∈ N ∪T and θ = x1 . . . xk . Thus the edges have source node with a label of the form Xi ::=τi · xi µi and target node labeled Xi ::=τi xi · µi . Since the target node for one edge is the source node for the next we have X = X1 = X2 = . . . = Xk = Z, α2 = x1 µ1 , µi = xi+1 µi+1 , and µk = β2 . Thus X::=α1 θ β2 , because α2 = x1 µ1 = x1 x2 µ2 = . . . = x1 . . . xk µk = θ β2 . Now suppose that n ≥ 1 and that the result is true for paths with at most (n − 1) -edges. Write θ = θ  eφ, where e is labeled  and φ has no -edges. So θ = θ  φ. We have that e has source and target nodes with labels of the form W ::=τ · Y µ and Y ::= · γ , respectively.

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity Since φ has no -edges, the case n = 0 above shows that Y ::=φβ2 . Since θ  has (n − 1) -edges, the induction hypothesis gives that for some σ , ∗

X ⇒ α1 θ  Y µσ



and so X ⇒ α1 θ  φβ2 µσ = α1 θ β2 σ  .

Lemma 3. (i) Each node K, except the start node, is the target of either a primary edge or of a primary -edge. All other edges whose target is K are non-primary - or R-edges. (ii) If there is an -edge from M to K which is not a primary -edge then there is a non-reducing primary path, φ, from K to M. (iii) There is a non-reducing primary path from the start node to any node in IRIA(). (iv) Any non-reducing path from the start to a node K includes any non-reducing primary path from the start to K. (Hence the non-reducing primary path to each K is unique.) (v) If θ and ψ are non-reducing primary paths from nodes L and M, say, to the same node, K say, then either θ is a final segment of ψ or ψ is a final segment of θ, i.e. either ψ = φθ or θ = φψ, for some path φ.

17 of 23

path, ψ say, from K to M. So φψ is a non-reducing primary path from the start to M and, since θ  is shorter than θ, by induction φψ is contained in θ  and hence in θ as required. (v) By (iii) we have that there are non-reducing primary paths, θ  and ψ  say, from the start node to L and M respectively. Thus θ  θ and ψ  ψ are both non-reducing primary paths to K, and hence, by (iv), θ  θ = ψ  ψ. So for some φ, either θ = φψ or ψ = φθ , as required. Lemma 4. Let K be a node labeled Y ::=γ · and let L be the node labeled Y ::= · γ such that there is a (unique nonreducing primary) path ψ from L to K labeled (x1 , . . . , xn ), where γ = x1 . . . xn (here ψ is the empty path if γ = ). Let θ be a non-empty path from K to a node M labeled U ::=γ1 Z · γ2 , such that lbl(θ) contains only Rs, and let Q be the parent of M labeled U ::=γ1 · Zγ2 . Let φ2 be the path, labeled (y1 , . . . , ym ), where γ1 = y1 . . . ym , from the node, T , labeled U ::= · γ1 Zγ2 to Q.

Proof. (i) This follows from the fact that each node apart from the start node is constructed at Step 2 of the algorithm as the target of a primary edge. (ii) This follows directly from the rules for constructing non-primary edges in Step 2(iii) and Step 2(iv). (iii) If K is the start node then it has the empty path to itself. If K is constructed in Step 2 of the algorithm then it is the target node of a primary edge or a primary -edge from the start symbol. Now suppose that all nodes constructed before K have the required property, and that when K is constructed it is the target node of a non-reduction primary edge, e, which has a node M as its source node. Since M must have been constructed before K there is a non-reducing primary path φ from the start to M. Hence φe is the required path. (iv) Let θ be any non-reducing path from the start node to K and let φ be a non-reducing primary path from the start to K. We prove the result by induction on the length of θ. If θ has length 0 then K is the start node, which is not the target of any edge. Hence φ must also be the empty path. Now suppose that θ = θ  e, where e is an edge with source node, M say, and target node K. Then K cannot be the start node, φ cannot be empty and hence φ = φ  f , where f is the (unique) non-reduction primary edge whose target is K. If e = f then φ  is a non-reducing primary path to M and so, by induction, θ  contains φ  and hence θ contains φ  e = φ. If f  = e then we must have that e is non-primary, and hence e is labeled  and, by (ii), there is a non-reducing primary

Then either (i) θ contains the primary R-edge from K, or (ii) there is a non-reducing primary path, φ, from L to T ∗ and Z ⇒τ Y for some τ . Proof. We prove the result by induction on the number, k, of R-edges in θ. If k = 1 then Z = Y and θ = r1 , so looking at Step 5 in the construction algorithm we see that, in order for r1 to have been constructed, there must be an -edge, e1 , from Q to L.

If e1 is the primary -edge to L then r1 is the primary R-edge from K and (i) holds. If e1 is not the primary -edge then, from Step 2(iii(a)) of the construction algorithm, we see that there is a non-reducing primary path, φ1 , from L to Q. By Lemma 3(v) either φ1 is a final segment of φ2 or φ2 is a final segment of φ1 . In the former case, since φ2 does not contain any -edges, we must have U = Y and φ2 = φ1 . Thus, in either case, for some (possibly empty) path φ3 we have φ1 = φ3 φ2 and we can take φ = φ3 . Now suppose the k ≥ 2 and write θ = θ  r1 . Since the target node, P , of θ  is the source node of an R-edge, it must have a label of the form Z::=µ1 V ·, where V is a nonterminal. From Step 5 in the construction algorithm, there is

The Computer Journal, 2005

18 of 23

EA Scott and AIC Johnstone

an -edge, e1 , to the ancestor, R, of P labeled Z::= · µ1 V from Q. By induction either θ  contains the primary R-edge from K, or there is a non-reducing primary path, φ4 say, from ∗ L to R, and V ⇒τ Y . ∗

We show that S  ⇒θ1 γ θ2 . Let the source node, H , of e be labeled X::=α.Yβ, the target node, L, of e1 be labeled Y ::= · γ , the source node, K, of r1 be labeled Y ::=γ ·, and the target node, M, of r1 be labeled Z::=τ Y · ρ. In particular, φ = γ . From Step 5 of the construction algorithm we see that there is a node, P say, labeled Z::=τ · Yρ, an edge, f1 , labeled Y from P to M, and an -edge, e2 , from P to L. There is also an edge, f2 , labeled Y from H to a node, T say, labeled X::=αY.β and an R-edge, r say, from K to T .

In the latter case ∗

Z ⇒ µ 1 V ⇒ µ1 τ Y and we need to show that there is a non-reducing primary path from L to T . If e1 is not the primary -edge then r1 is not a primary R-edge and so, from the case k = 1, we have that there is a non-reducing primary path, φ3 say, from R to T and hence φ4 φ3 is a primary non-reducing path from L to T . Now suppose that e1 is the primary -edge to R. If φ4 is empty then R = L, K = P and r1 is the primary R-edge from K, so (i) holds. So suppose that φ4 is non-empty, and thus, since φ4 is non-reducing and primary, that φ4 = φ4 e1 . By Lemma 3(v), either φ4 is a final segment of φ2 or φ2 is a final segment of φ4 . In the former case, since φ2 does not contain any -edges, we must have that Y = U and T = L, so we can take φ to be the empty path. In the latter case we have φ2 = φ3 φ4 and we can take φ = φ3 . Theorem 2 Let  be a CFG that does not contain any self ∗ embedding. If θ is accepted by IRIA() then S  ⇒θ. Proof. All paths through IRIA() begin either with the edge, f0 say, labeled S whose target is the node, K1 , labeled S  ::=S·, or with a primary -edge. All paths through IRIA(), apart from the path f0 of length 1, end with a primary R-edge, because these are the only edges whose target is K1 . Let θ be a path from the start node of IRIA() to the accept node, so that θ is accepted by IRIA(). We show that if  has no proper self embedding then θ is a sentential form in . The proof is by induction on the number of R-edges in θ. If θ has no R-edges then we must have θ = f0 and θ = S is a sentential form. Now suppose that θ has one or more R-edge. Since the only path which starts with f0 is the path which just contains f0 , θ must begin with an -edge. Write

If e = e1 then θ1 f1 θ2 is a path through IRIA() and by ∗ induction S  ⇒θ1 f2 θ2 . Since f2 = Y we have ∗

S  ⇒ θ1 Y θ2 ⇒ θ1 γ θ2 = θ1 φθ2 So we shall suppose that e  = e1 , and hence that r = r1 . If φ =  then e1 is the unique edge whose target is L so e = e1 . Thus we also suppose that φ  = . If e1 is the primary -edge, then by Lemma 3(iv), the path θ1 e1 must contain e1 , and, since e  = e1 , θ1 must contain e1 . We write θ1 = θ5 e1 θ6 . Since θ5 has target node P , there is a path θ5 f1 θ2 , and by induction we have that ∗

S  ⇒ θ5 Y θ2 ⇒ θ5 γ θ2 and by Lemma 2, we have ∗

Y ⇒ θ6 Yβσ,

for some σ.

Since  does not contain self embedding, either θ6 =  or βσ = . In the first case we have θ5 γ θ2 = θ1 γ θ2 and in the second case we have ∗

S  ⇒ θ 5 θ6 Y θ 2 ⇒ θ 5 θ6 γ θ 2 = θ1 γ θ2 . Now suppose that e2 is not primary, and hence that r1 is not the primary R-edge. Write

θ = θ1 e1 φr1 θ2

θ2 = θ3 f4 θ4 ,

where θ1 is non-reducing, e1 is an -edge, φ has no R- or -edges and r1 is an R-edge. So θ1 e1 φr1 θ2 = θ1 γ θ2 , and r1 is the first R-edge in θ.

where θ3 = r2 . . . rk consists only of R-edges and either f4 θ4 is empty or f4 is not a reduction edge. So θ2 = f4 θ4 . We let Q be the target of rk , so Q has a label of the form W ::=νV ·µ,

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity we let R be the ancestor of Q labeled W ::= · νV µ, and we let φ3 be the path from R to Q labeled νV .

19 of 23

qi

xi ⇒ ui , u1 . . . uk = u and q1 + . . . + qk = m − 1. By definition of S we have B⇒A y1 . . . yn where yi = xi or yi = xi ⊥ . By induction, if yi = xi , since xi is ∗ reachable in A , xi ⇒A ui0 Bi1 ⊥ ui1 . . . uni −1 Bni ⊥ uni , where qij

Since r1 θ3 is non-empty, by Lemma 4, either r1 θ3 contains the primary R-edge, r0 say, from K or there is a primary ∗ non-reducing path from L to R and for some δ, V ⇒δY . Since r1  = r0 , if r1 θ3 contains r0 then r1 θ3 = r1 . . . ri−1 r0 ri+1 . . . rk , where i ≥ 2 and r0 has source node K. Thus there is a path θ1 eφr0 ri+1 . . . rk f4 θ4 and, by induction, ∗

S  ⇒ θ1 e1 φr0 ri+1 . . . rk f4 θ4 = θ1 γ f4 θ4 = θ1 γ θ2 . If r1 θ3 does not contain r0 then, by Lemma 4, there is a ∗ primary non-reducing path, φ1 say, from L to R, V ⇒δY and, ∗ by Lemma 2, there is a string σ such that Y ⇒φ1 νV µσ . Thus ∗



Y ⇒ φ1 νV µσ ⇒ φ1 νδY µσ ⇒ φ1 νδγ µσ, and by induction S



⇒ θ1 φ1 φ3 f4 θ4 = θ1 φ1 νV θ2 ∗ ⇒ θ1 φ1 νδY θ2 ⇒ θ1 φ1 νδγ θ2 .

Since  has no self embedding either φ1 = ν = δ =  or µ = σ = . In the first case, as required, ∗

S  ⇒ θ1 φ1 νδγ θ2 = θ1 γ θ2 . In the second case, since µ =  and f4 is not an R-edge, we must have that f4 θ4 is empty and that Q is the accepting state of IRIA(). Then, since φ3 is a non-reducing path, we must have that R is the start state of IRIA(). However, since the edge e1 exists and there is no non-empty path to the start state this is a contradiction and the second case cannot arise. This completes the proof. 9.3.

Proof of Theorem 3

Lemma 5. Suppose that S is a derived grammar of  in which some of the non-terminals A on the right hand sides of rules have been replaced with special non-terminals A⊥ . We let A be the component of S whose start rule is SA ::=A. Then for all non-terminals A and B such that B is reachable m ∗ in A , B ⇒ u if and only if B ⇒A v0 B1 ⊥ v1 . . . Bn ⊥ vn , where qi Bi ⇒ wi , 1 ≤ i ≤ n, qi < m and u = v0 w1 v1 . . . wn vn . Proof. If m = 1 then B::=u is a rule in  and, since u does not contain any non-terminals and B is reachable, B::=u is a rule in A . Thus B⇒A u as required. Now suppose that the result in true for derivations of length m−1

less than m and that B⇒ x1 . . . xk ⇒  u1 . . . uk , where

Bij ⇒ wij where qij < qi < m, i0 ≤ j ≤ ni , and ui = vi0 wi1 vi1 . . . wni vni . Thus each yi is either a special terminal of the required form, or derives in A a string of the required form, thus B derives in A a string of the required form. ∗ Conversely suppose that B ⇒A v0 B1 ⊥ v1 . . . Bn ⊥ vn , qi where Bi ⇒ wi , 1 ≤ i ≤ n, and u = v0 w1 v1 . . . wn vn . It is trivial to check, by induction on the derivation m ∗ length, that B ⇒ v0 B1 v1 . . . Bn vn and so B ⇒ u where m > q1 + . . . + qn . Theorem 3 For any RCA(), a string, u, of terminals is in L() if and only if u is accepted by RCA(). Proof. Suppose that S is a derived grammar of  which is obtained by replacing some instances of non-terminals on the right hand sides of rules with special terminals in such a way that S has no self embedding, and that A is the grammar obtained from S by adding a new start rule SA ::=A. Suppose that TA is the state in RCA() that corresponds to the start state of RIA(A ). First we show that if u ∈ L() then u is accepted by RCA(). m Suppose that A⇒ u = b1 . . . bm , where A = S or A⊥ appears in S . We show by induction on m that for any configuration (TA , S), there is an execution path which consumes u, whose input configuration is (TA , S) and whose resulting configuration is (EA , S), where EA is an accepting state of RIA(A ). Then, taking A = S we have the required result. ∗ If m = 1, by Lemma 5 A⇒A u and so there is a path through IRIA(A ) whose label is of the form (b1 , . . . , bm , R) which starts at the start state and ends at an accepting state. By the subset construction, this path exists in RIA(A ) and hence in RCA(). Thus we can take to be the sequence of steps whose actions are shifts on b1 , . . . , bm and then the reduction whose rule is A::=u. Now we suppose that the result is true for derivations of length less than m. By Lemma 5 we have that qi ∗ A⇒A v0 B1 ⊥ v1 . . . Bn ⊥ vn where Bi ⇒ wi and qi < m. Thus there is a path through RIA() whose edges are labeled with the elements of v0 , B1 ⊥ , v1 , Bn ⊥ , vn and whose final node EA is an accepting state of RIA(A ). We let hi and li be the source and target states on this path of the edge labeled Bi ⊥ . Then there are edges labeled with the elements of v0 from TA to the node in RCA() corresponding to hi and an edge from this node, labeled p(l1 ) say, to TB1 . We take 1 to be the sequence of steps whose actions are shifts on each of the elements of v0 and then p(l1 ). The resulting configuration of 1 is (TB1 , (S, l1 )). By induction there is an execution path 1 with this input configuration which consumes w1 and whose resulting configuration is (EB1 , (S, l1 )). We let θ1 be the step whose action is pop, so that 1 1 θ1 consumes v0 w1 and results in the configuration (l1 , S). Continuing in this

The Computer Journal, 2005

20 of 23

EA Scott and AIC Johnstone

fashion we construct an execution path 1 1 θ1 . . . n n θn which consumes v0 w1 . . . vn−1 wn , has input configuration (TA , S) and results in the configuration (ln , S). There is a path through RIA(A ) from ln to EA whose edges are labeled with the elements of vn and so, letting n+1 be the sequence of steps whose actions are shifts on the elements of vn , we can take = 1 1 θ1 . . . n n θn n+1 . Now suppose that there is an execution path in RCA() whose input configuration is (TA , S), whose final configuration is (EA , S), where EA corresponds to an accepting state of RIA(A ), and which consumes u. We prove, by induction on the number of steps in whose action ∗ is a push, that A⇒ u. If there are no push edges then all the steps in correspond to moves to states along edges in RCA(). Since the initial state is TA these steps also correspond to moves along edges ∗ in RIA(A ). Since the final move is to EA we have A⇒A u ∗ and hence, by Lemma 5, A⇒ u. Now suppose that = 1 θ  where 1 contains no steps whose action is a push (and hence no step whose action is pop) and that the action associated with θ is p(l, TB1 ). So the input and resulting configurations of θ are (h1 , S) and (TB1 , (S, l1 )) (nothing has been pushed onto the stack by actions associated with steps in 1 ) and there is an edge (h1 , l1 ) in RIA(A ) labeled B1 ⊥ . Since the resulting configuration of is (EA , S) we must have  = 1 ψ 1 where the action associated with ψ is pop and its input configuration is of the form (EB1 , (S, l)), where EB1 corresponds to an accepting state of RIA(B1 ). We have that 1 consumes v0 say, and there is a path in RIA(A ) from the start state to l1 whose edges are labeled with reductions and the elements of v0 . We then have that u = v0 w1 u1 where  consumes w1 and 1 consumes u1 . By induction we have ∗ that B1 ⇒ w1 . We now apply the above reasoning again to 1 , writing it in the form 2 θ1 1 where 2 does not contain any steps whose action is a push. Since steps involving push and pop actions occur in nested pairs and the input and output configurations of 2 have the same stack, 2 will also contain no steps whose action is a pop. Then, as above, we have a path from the start state of RIA(A ) labeled v0 B1 ⊥ v1 B2 ⊥ , ∗ where B2 ⇒ w2 and u = v0 w1 v1 w2 u2 where u2 is consumed  by 1 . Eventually we see that there is a path through RIA(A ) that ends at the state corresponding to EA , labeled ∗ with reductions and v0 B1 ⊥ v1 . . . Bn ⊥ vn , where Bi ⇒ wi , 1 ≤ i ≤ n, and u = v0 w1 v1 . . . wn vn . Then, by Lemma 5, ∗ A⇒ u so, taking A = S, we have the required result. 9.4.

Proof of Theorem 4

Lemma 6. The RIGLR parser terminates for all possible inputs. Proof. We shall say that a node t is at level i in the RCG if (t, F ) ∈ Pi . By construction there is only one element of the form (t, F ) ∈ Pi , where t has label l. Thus the RCG has at most N level i nodes, where N is the number of RCA states. An edge (p, q) is only added to the RCG if there is not

already an edge from p to q, thus the RCG has size at most O(n2 ). The sets Ui contain elements with labels of the form (h, q) where h is an RCA state and q is a node in the RCG whose level is at most i. Thus Ui contains at most O(i 2 ) elements. The while loop in the RIGLR parser algorithm only executes if the set A is non-empty, and it removes an element from A each time. The set A is finite at the start of each step, and an element (h, q) is only added to A when the element is created. An element (h, q) is only created if there is not already an element with this label in Ui , thus there are only finitely many elements which can be created. So the while loop only executes finitely many times. All other aspects of the RIGLR parser algorithm clearly terminate. Lemma 7. If RCA() accepts a1 . . . an then the RIGLR parser algorithm terminates and reports success on table T () with input a1 . . . an . Proof. We shall show that if an execution path θ1 . . . θi in T () has input configuration the initial configuration, resulting configuration (k, h1 , . . . , ht ) and consumes a1 . . . aj then there is an element (k, q) ∈ Uj and there is a path

in the RCG. This is sufficient because if u is accepted by T (), then there is a execution path which results in (h∞ , ), where h∞ is an accepting state, so (h∞ , q0 ) ∈ Un and the RIGLR parser algorithm will report success. If i = 0, so the execution path is empty, the resulting configuration is the initial configuration, j = 0, and U0 contains (0, q0 ) where q0 is labeled −1, as required. We now suppose that the resulting configuration of θ1 . . . θi−1 is (h, h1 , . . . , hs ). Then, by induction, there exist a node q  and a path

in the RCG and an element (h, q  ). Furthermore, either θi is a shift on aj and so θ1 . . . θi−1 consumes a1 . . . aj −1 and (h, q  ) ∈ Uj −1 or θi is not a shift and so (h, q  ) ∈ Uj . We make the following four observations: that all elements in Uj have also been elements of the set A at some point in the j th step of the RIGLR parser algorithm; that when an element is removed from A it is processed; that the set A is empty when the j th step is complete; and that the j th step of the algorithm terminates. Thus every element in Uj is processed during the j th step of the RIGLR parser algorithm. Hence (h, q  ) will be processed during the (finite) (j − 1)st or j th step of the RIGLR parser algorithm. Suppose first that θi is a shift, so that there is an entry sk ∈ T ()(h, aj −1 ) and the resulting configuration of θ1 . . . θi is (h, h1 , . . . , hs ). Thus we can take q = q  and step j − 1

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity of Algorithm 5.3 will result in the construction of a element (k, q  ) which will be put in Uj , as required. If θi is a reduction then again s = t and hr = hr , 1 ≤ r ≤ t, and (h, q  ) ∈ Uj so Algorithm 5.3 will cause (k, q  ) to be added to Uj as required. If θi is a pop then s = t + 1, k = hs and hp = hp , 1 ≤ p ≤ t, so we have a path

If the edge (q  , q) already exists when (h, q  ) ∈ Uj is processed then (k, q) will be added to Uj as required. Otherwise, processing (h, q  ) ensures that (q  , 1) is added to Pi , and (q  , q) must have been created as a result of processing some (k  , q) ∈ Uj and considering some action p(k, l  ) ∈ T ()(k  , aj +1 ). When this action is considered, since (q  , 1) ∈ Pi , (k, q) will be added to Uj when (k  , q) is processed, if it is not already there. If θi is a push then it is p(ht , k) ∈ T ()(h, aj +1 ), and t = s + 1, hp = hp , 1 ≤ p ≤ s. Since (h, q  ) ∈ Uj , when this element is processed (k, q) will be added to Uj as required. When an RCG node q ∈ Pi is created, an element v = (TA , q), for some A, is also added to Ui . We call v the element created with q. In particular, (0, q0 ) is the element created with q0 . Lemma 8. Suppose that we are given an RCA T () and an input string a1 . . . an . Let q be a node in the RCG, let Pi and Ui , 0 ≤ i ≤ n, be the sets as constructed by applying Algorithm 5.3, and let v = (TA , q) ∈ Ui be the element created with q. Let

be any path from q to q0 in the RCG. Then, if u = (k, q) ∈ Um there is an execution path in T () whose input configuration is (TA , h1 , . . . , ht ), whose resulting configuration is (k, h1 , . . . , ht ) and which consumes ai+1 . . . am . Proof. We prove the result by induction on the order in which the elements (k, q) are processed. We note that if j < i then all the elements in Uj are processed before any of the elements in Ui . We also note that if (h, p) is associated with the construction of q then any (k, q) must be processed after (h, p). If u = (0, q0 ) then u = v and the result is trivially true. So suppose that the result is true for elements processed before u. Suppose also that u was constructed while the element w = (h, p) was being processed and when an action act ∈ T ()(h, b) had been selected (here b = am or am+1 ). If act is sk or R(j, k) then p = q and, by induction, there is a path  whose input configuration is (TA , h1 , . . . , ht ) and whose resulting configuration is (h, h1 , . . . , ht ). If act is sk then w ∈ Um−1 , consumes ai+1 . . . am−1 ,

21 of 23

sk ∈ T ()(h, am ) and we can take =  θ where θ is the step whose input configuration is (h, h1 , . . . , ht ) and whose associated action is sk. If act is R(j, k) then w ∈ Um , consumes ai+1 . . . am , R(j, k) ∈ T ()(h, am+1 ) and we can take =  θ where θ is the step whose input configuration is (h, h1 , . . . , ht ) and whose associated action is R(j, k). If act is pop then there is a path

in RCA(). The edge (q, p) must have been created when processing an element y = (l, q) ∈ Uj , where i ≤ j ≤ m, with an action push(k, TB ) ∈ T ()(l, aj +1 ). Then, by induction, there is an execution path 1 whose input configuration is (TA , h1 , . . . , ht ), whose resulting configuration is (l, h1 , . . . , ht ) and which consumes ai+1 . . . aj and also an execution path 2 whose input configuration is (TB , h1 , . . . , ht , k), whose resulting configuration is (h, h1 , . . . , ht , k) and which consumes aj +1 . . . am . Let θ1 be the execution step whose input configuration is (l, h1 , . . . , ht ) and whose action is push(k, TB ) ∈ T ()(l, aj +1 ) and let θ2 be the step whose input configuration is (h, h1 , . . . , ht , k) and whose action is pop. Then we can take = 1 θ1 2 θ2 . Finally suppose that act is p(l, TB ) ∈ T ()(h, am+1 ). Either q is created at this point, TB = TA , l = ht and u = (TA , q) = v, in which case we can take to be the empty path, or l = k, the push action creates an edge from a node q  labeled k to q, and there is an element (k  , q  ) ∈ Um such that pop ∈ T ()(k  , am+1 ).

For the edge (q  , q) to be created when processing (h, p) we must have p = q and so, by induction, there is an execution path 1 whose input configuration is (TA , h1 , . . . , ht ), whose resulting configuration is (h, h1 , . . . , ht ) and which consumes ai+1 . . . am . Let θ1 be the step whose action is p(k, TB ). Also by induction there is an execution path 2 whose input configuration is (TB , h1 , . . . , ht , k), whose resulting configuration is (k  , h1 , . . . , ht , k) and which the empty string. Letting θ2 be the step whose action is pop, we can take = 1 θ1 2 θ2 . Theorem 4 Given an RCA for a CFG  and an input string a1 . . . an , Algorithm 5.3 terminates and reports success if a1 . . . an is in L() and terminates and reports failure if a1 . . . an is not in L(). Proof. If a1 . . . an is in the language generated by  then by Theorem 3 and Lemma 7, Algorithm 5.3 terminates and reports success on input a1 . . . an . If Algorithm 5.3 terminates and reports success on input a1 . . . an then there must be an element (k, q0 ) ∈ Un , where k is an accepting state of the RCA. Since q0 ∈ P0 , where 0 is

The Computer Journal, 2005

22 of 23

EA Scott and AIC Johnstone

the start state of the RCA, by Lemma 8 it is possible to move, in the RCA, from 0 and the empty stack to k and the empty stack by consuming input a1 . . . an . Thus the RCA accepts the string a1 . . . an , and hence it must be in the language generated by . If Algorithm 5.3 does not terminate and report success then, by Lemma 6, it terminates and reports failure. Thus if a1 . . . an is not in the language generated by , the algorithm will terminate and report failure as required.

although they contain high levels of ambiguity and are natural candidates for RI based parsing.

10.

REFERENCES

CONCLUSIONS

We have described Reduction Incorporated parsing which provides high performance parsing but at potentially high table cost. Our approach is closely related to Aycock and Horspool’s parser but the RI parser terminates and is correct for all context-free grammars and we give a proof of correctness. The Achilles’ heel of both approaches is table size. It is probably fair to say that for many applications this style of parsing is on the cusp of practicallity today, but as long as Moore’s Law continues to yield a doubling of memory densities roughly every 18 months we must be cautious: the 1.5 million states required for an RIGLR table for our ANSI-C parser becomes exponentially less costly over time. An interesting feature of this style of parsing is that by electing to ‘over-terminalize’ one can significantly reduce the table size but at the cost of generating more parse time stack activity. We exploited this behaviour in our ANSIC parser by introducing one extra terminalization which reduced the tables size from around 6 to 1.5 million states. If we terminalize all non-terminal instances then we achieve a parser whose stack activity mimics that of a recursive descent parser on LL(1) grammars but which is completely general. We shall return to the subject of engineering these parsers by trading table space for stack activity in a future publication. Some work-in-progress results relating to RI parsing have been presented in conference papers. An earlier, less efficient version of the algorithm was introduced in [23] where it was called a generalised regular parser. The performance of RI parsing is compared with Farshi-style Tomita parsing and the RNGLR parser in [29] and [28]. One application area that may benefit from RI parsing is that of context-free searching of biological sequence data. At present, almost all sequence database searches use regular expresssions or some equivalent notation, yet it is well known that biological sequences contain context-free (and even context-sensitive) features. Using a regular search to find context-free features will in general yield false positives because a regular search will not be able to correctly match bracket-like structures. Context-free search tools do exist – perhaps the best developed is Searles’ GenLang tool, which provides context-free grammars for tRNA and some protein families. This line of development is presently moribund – perhaps because practicing biologists find context-free grammars (as opposed to regular expressions) hard to use but also because the performance of the parsing-based search engine is inadequate. The grammars used are not large

ACKNOWLEDGEMENTS We are grateful to Ralf Lämmel for making available his reverse engineered IBM VS-COBOL grammar (at http://www.cs.vu.nl/grammars/vs-cobol-ii/); and also to the anonymous referees for their helpful comments, suggestions and corrections.

[1] Rytter, V. (1995) Context-free recognition via shortest paths computation: a version of Valiant’s algorithm Theor. Comput. Sci., 143(2), 343–352. [2] Knuth, D.E. (1965) On the translation of languages from left to right Inform. Control, 8(6), 607–639. [3] Stroustrup, B. (1994) The Design and Evolution of C++. Addison-Wesley Publishing Company, Reading, MA. [4] Dodd, C. and Maslov, V. BtYacc home page. http://www. siber.com/btyacc. Last accessed June 2005. [5] Parr, T.J. (1996) Language translation using PCCTS and C++. Automata Publishing Company, San Jose, CA. [6] Parr, T. ANTLR home page. http://www.antlr.org. Last accessed June 2005. [7] JAVACC project home page. https://javacc.dev.java.net. Last accessed June 2005. [8] Breuer, P.T. and Bowen, J.P. (1995) A PREttier CompilerCompiler: Generating Higher-order Parsers in C Software Pract. Exper., 25(11), 1263–1297. [9] Earley, J. An efficient context-free parsing algorithm, Commun. ACM, 13(2), 94–102. [10] Tomita, M. (1986) Efficient parsing for natural language, Kluwer Academic Publishers, Boston. [11] Grune, D. and Jacobs, C. (1990) Parsing Techniques: A Practical Guide, Ellis Horwood, Chichester, England. Available at http://www.cs.vu.nl/∼ dick/PTAPG.html [12] Lang, B. (1974) Deterministic techniques for efficient nondeterministic parsers. In Proc. 2nd Colloquium on Automata, Lanugages and Programming, University of Saarbrüken, July 29–August 2, LNCS, 14, 255–269. [13] Billot, S. and Lang, B. (1989) The Structure of Shared Forests in Ambiguous Parsing. In Proc. 27th Conf. on Association for Computational Linguistics, Vancouver, British Columbia, Canada, June 26–29, 143–151. Association for Computational Linguistics. [14] Gnu Bison home page. http://www.gnu.org/software/bison. Last accessed June 2005. [15] Eggert, P. On Bison’s general parser. http://compilers.iecc. com/comparch/article/03-01-042. Last accessed June 2005. [16] van den Brand, M.G.J., Heering, J., Klint, P. and Olivier, P.A. (2002) Compiling language definitions: the ASF+SDF compiler. ACM Trans. Progr. Lang. Sys., 24(4), 334–368. [17] Nozohoor-Farshi, R. (1991) GLR parsing for -grammars. In Tomita, M. (ed.) Generalized LR Parsing, 60–75. Kluwer Academic Publishers, Netherlands. [18] Scott, E., Johnstone, A. and Hussain, S. Tomita-style generalised LR parsers. Updated Version. Technical Report TR-00-12. Computer Science Department, Royal Holloway, University of London, London.

The Computer Journal, 2005

Generalized Bottom Up Parsers With Reduced Stack Activity [19] Johnstone, A. and Scott, E. (2002) Generalised reduction modified LR parsing for domain specific language prototyping. In Proc. 35th Annual Hawaii Int. Conf. On System Sciences (HICSS02), Big Island, Hawaii, January 7–10, 282–291. IEEE Computer Society. IEEE, New Jersey. [20] Aho, A.V. and Ullman, J.D. (1972) The Theory of Parsing, Translation and Compiling. Series in Automatic Computation, Vol. 1: Parsing. Prentice-Hall. New Jersey. [21] Aycock, J. and Horspool, N. (1999) Faster generalised LR parsing. In Proc. 8th Int. Conf. Compiler Construction, CC’99, Amsterdam, The Netherlands, March 20–28, LNCS, 1575, 32–46. Springer-Verlag. [22] Aycock, J., Horspool, R.N., Janousek, J. and Melichar, B. (2001) Even faster generalised LR parsing. Acta Inform., 37(8), 633–651. [23] Johnstone, A. and Scott, E. (2003) Generalised regular parsers. In Hedin, G. (ed.) 12th Int. Conf. Comipler Construction, CC’03, Warsaw, Poland, April 5–13, LNCS, 2622, 232–246. [24] Aho, A.V., Sethi, R. and Ullman, J.D. (1986) Compilers: principles techniques and tools. Addison-Wesley, Reading, MA.

23 of 23

[25] Scott, E. and Johnstone, A. (2002) Table based parsers with reduced stack activity. Technical Report TR-02-08. Computer Science Department, Royal Holloway, University of London, London. [26] Kernighan, B.W. and Ritchie, D.M. (1988) The C programming language (2nd edn). Prentice Hall, Englewood Cliffs, NJ. [27] Lämmel, R. and Verhoef, C. (2001) Semi-automatic Grammar Recovery. Software Pract. Exper., 31(15), 1395–1438. [28] Johnstone, A., Scott, E. and Economopoulos, G. (2004) The grammar tool box: a case study comparing GLR parsing algorithms. In Hedin, G. and Van Wick, E. (eds) Proc. 4th Workshop on Language Descriptions, Tools and Applications LDTA2004, Barcelona, Spain, April 3, Electronic Notes in Theoretical Computer Science, 110, 97–113. [29] Johnstone, A., Scott, E. and Economopoulos, G. (2004) Generalised parsing: some costs. In Duesterwald, E. (ed.) 13th Int. Conf. Compiler Construction, CC’04 Barcelona, Spain, March 29–April 2, LNCS, 2985, 89–103. Springer-Verlag, Berlin.

Note: [25] is a technical report version of this paper.

The Computer Journal, 2005

Suggest Documents