A combinator parser for Earley’s algorithm Ian D. Peake (
[email protected]) and Sean Seefried (
[email protected]) September 14, 2004 Abstract We show a combinator parser which corresponds to a standard variation of Earley’s algorithm. Combinator parsers are recursive descent parsers written using functional programming techniques, offering access to useful tradeoffs between expressiveness and parse time efficiency, and a declarative approach to language processor specification enabling customisation and optimisation. Although it has been suggested [11] [13] that certain combinator parsers resemble general tabular parsing algorithms like Earley’s, to our knowledge, no presentation of a combinator parser algorithm has ever been claimed to be equivalent to Earley’s algorithm. Johnson [11] shows how a combinator parser regime may be made general—capable of parsing any context free language, including languages which are ambiguous or embody left recursive productions. However, Johnson’s combinator parsers are not completely equivalent to Earley’s algorithm — unlike Earley’s algorithm, processing of the input is (effectively) not restricted to a single input position at a time. The existence of a combinator parser for Earley’s algorithm is useful both for explaining the relationship between top down parsers and tabular parsers, as well as providing the basis for an efficient method of combining combinator parsers with conventional context dependant lexical analysers.
1
Introduction
Combinator parsers have proved both useful and interesting. Combinator parsers are useful because they support many forms of declarative parser definition at least as well as parser generators do. Combinator parsers also offer more flexible and fine grained control over parsing strategy than might otherwise be available using a typical parser generator. Combinator parsers are structured so that the syntax of a given language may be defined separately from the strategy used to recognise languages denoted by the specifications, if desired. Programmers may define language syntax in a style closely resembling (E)BNF, based on one of several related underlying parsing libraries, in a functional style. It is possible
1
not only to switch between standard parsing strategies easily, but also to create parsers that involve hybrid parsing strategies that could not necessarily be achieved via standard parsing algorithms. Historically, parsers have typically either been developed using off the shelf parser generators such as Yacc, which tend to implement a single, inflexible, parser generation strategy, or written by hand, using custom recursive descent parsers (eg in the implementation of FermaT [1] [18] — in either case, undesirable expense is often involved. Combinator parsers are interesting because they epitomise the use of functional programming techniques to solve complex problems, in this case, the problem of separating the concerns of syntax definition from parsing strategy without requiring a preprocessing phase. Another reason why combinator parsers are interesting is because they seem to resemble earlier, well understood techniques. It has been suggested [11] [13] that some kinds of combinator parsers resemble tabular parsers. For example, specific combinator parser variations have been shown to be efficient, flexible alternatives to tabular algorithms [11]. Although we are aware of informal claims relating combinator parsers to Earley’s algorithm, we have not seen any formally presented demonstration of equivalence between Earley’s algorithm and a combinator parsing regime, nor any detailed comparison of the inner workings of both. In this paper we first introduce some essential background related to Earley’s algorithm and combinator parsing (section 2). Next we show and analyse Johnson’s memoized, continuation passing combinator parsing algorithm (section 3), comparing it to Earley’s algorithm. While there is striking correspondence, there is yet a significant difference: Johnson’s parsers tend to “push ahead” in the input before all (possibly relevant) computation is performed at a given input position. This difference means that information about what tokens may be legitimiately expected at a given input position is impossible to obtain, which has implications for lexical analysis. Next, we propose a variation — “synchronised” Earley-style continuation passing combinator parsing — that addresses this issue (section 4) and describe experiences and evaluation of the variation (section 5). Finally the significance of this work is summarised (section 6).
2 2.1
Background Deterministic, non-general parsers
Non-general parser generators such as Yacc [12], Bison [2] and ANTLR [3] are based on restricted parser generation and parsing algorithms which do not necessarily accept any context free grammar. However, it is usually possibly for an expert to transform a grammar for a given language into a form which is acceptable to a non-general generator. Non-general parsers have the advantage that they run in O(n) time, where n is the size of the input in tokens, and typically consume very modest amounts of space. Historically, language implementors have put up with the limitations of such
2
parser generators in deference to the realities of resource limitations. However, recently it has become apparent that general parsing algorithms, such as Earley’s algorithm, may be more appropriate in the light of the ongoing effects of Moore’s law.
2.2
Earley’s algorithm
Earley’s algorithm is an efficient general tabular parsing algorithm. Its worst case performance is O(n3 ), however for most grammars acceptable to nongeneral parsers, its performance is O(n), albeit with worse constant factors. Earley’s parsing algorithm involves constructing, for each input position, a model of productions which might possibly match text surrounding that position. In selecting Earley’s algorithm for consideration here, other general parsing algorithms, such as GLR parsing also have properties which make them attractive for use in modern language processor platforms such as ASF+SDF [17] [4]. 2.2.1
Details
Since our work relates closely to Earley’s algorithm, we sketch a version of the algorithm here as background. (Strictly, this is a typical “zero lookahead” variation of Earley’s original algorithm.) First we present an informal description of the algorithm, then give a worked example showing the algorithm’s operation on a simple grammar and input sentence. For each token in the input a state set is constructed. A state set is analogous to a parser state in an LR parser, consisting of a set of marked productions (states), each of which partially matches a portion of the input, including the current token. Each state has a mark according to what has been matched (taking into account the current token). Initially, a special state set consisting of a single, special goal state with the start nonterminal S on its RHS is used to prime the process, and the scanning phase is skipped. Each given state set is constructed by processing each state after they are added to the state set according to one of three processes: prediction, scanning and completion. Scanning, for a given state set, and state s with a terminal symbol to the right of a mark, adds states to the next state set, where the next input token matches the symbol immediately to terminal symbol right of the mark in the RHS of s. The new state set must include a suitably updated version of the same state (with the mark moved over the token). Predication, for a given state s0 with a nonterminal (“NT”) symbol to the right of its mark, adds extra states to the state set analogously to the calculation of a closure in an LR parser. That is, all productions P1..PN of that NT are added as states to the state set, with their marks set to the left of its RHS, and the completion mark pointing to S0.
3
S -> S a S -> b Figure 1: Grammar for a simple language b a −| Figure 2: Input sentence in simple language Completion, for every state whose mark is on the right of its RHS, adds the original state (pointed to by the completion mark), suitably updated so that the mark is on the right of the now completed NT. The algorithm completes either when there are no more tokens or when the state set is empty. For every instance of the distinguished symbol completed, a separate, alternative valid parse is considered complete. More useful information about Earley’s algorithm can be gained in [6], [8] and [5]. 2.2.2
Example trace of Earley’s algorithm
To illustrate this version of Earley’s algorithm, here is a simple example which nevertheless suffices to illustrate the ability of the algorithm to cope with “exotic” languages, in this case, a language containing left recursive definitions. Figure 1 gives a grammar for a simple language. Figure 2 gives a simple sentence in the language. Figure 3 shows a representation of the execution of the language with notation as follows: The first column gives the state set (corresponding to an index into the input, that is state set 0 corresponds to the first token in the input). The second column gives the states in the state set, in order of addition. The third column gives the origin state for each state. The fourth column is a unique numbering of each state, and is for commentary purposes only. An explanation of the workings of selected parts of the trace is as follows: In state set 0, the initial state 1 is created automatically. The predictor is applicable to state 0 since S is to the right of the dot. Therefore all productions with NT S are added to the state set (states 2,3). The predictor is applicable to state 2 but in this version of Earley’s algorithm the effect of applying the predictor is to redundantly add all productions with NT S to the state set. The scanner is applicable to state 3, resulting in the addition of state 4 to state set 1. In state set 1, the completer is applicable to state 4, so the states in its origin state set (0) associated with the NT S are added to state set 1, with the dot moved to the right of the S. The algorithm proceeds in the same vein until state set 3 is constructed, at which point the algorithm accepts the input string.
4
State set S0
S1
S2
S3
States φ → .S−| S → .Sa S → .b S → b. φ → S.−| S → S.a S → Sa. φ → S.−| S → S.a φ → Sa. accept
Origin state 0 0 0 0 0 0 0 0 0 0
Number 1 2 3 4 5 6 7 8 9 10 11
Figure 3: Trace of Earley’s algorithm
2.3
Combinator parsers
Combinator parsers are top down (essentially, recursive descent, backtracking) parsers, believed to be asymptotically as efficient as any general parsing algorithm — i.e.: average case performance O(n) [ref “believed”??]. The classical combinator parser style as presented by Hutton [9], while not particularly efficient (exponential in n worst case) nor having complete coverage (cannot handle left recursive languages) was capable of parsing ambiguous languages and inputs and exhibits the combinator parser’s characteristic elegent declarative style. More recent presentations of combinator parsers have variously sought to improve: coverage [11]; efficiency [14][7][16][11], and; the formal underpinnings of the architecture by expressing combinator parsers as monads [10]. A combinator parser corresponding to a given language is a function which attempt to match their language against a stream of input and give all “possible parses” as output, where each “possible parse” corresponds to a plausible way in which the parser could have matched against some prefix of the input by including both a new input stream resulting from removing the matched input portion, and (possibly) a representation of what was matched. Combinator parsers are constructed either primitively, for example, using a function which matches a single given token against the input (lexical analysis), or, critically, by using special combinators to combine parsers to form more complex parsers, in ways analogous to BNF alternation, sequencing, etc. Monadic presentations construct a specific well defined set of core combinators which do not all correspond straightforwardly to BNF forms (including for example “bind”). Since our presentation extends Johnson’s we have chosen to remain faithful to Johnson’s style, which is not monadic.
5
Figure 4: Johnson’s style — parser functions Figure 5: Johnson’s style — recognise function
2.4
Writing combinator parsers in Scheme
For historical reasons our combinator parser presentation is in Scheme and has yet be presented in a pure functional language, which necessitates some workarounds to implement lazy evaluation and currying. Our implementation is based on Johnson’s, which in turn relies on assignment to implement memoing. Pure functional languages typically do not support assignment, indeed the lack of support for assignment is normally regarded as a feature of pure functional languages, for good reason. Assignment is required to implement a special form of memoing and continuation passing. It seems likely that Johnson’s parsers could be rewritten to do without assignment, so an equivalent version of our variation could be implemented properly in a pure functional language. However, for this paper, our variation is presented in Scheme, the same language used to present Johnson’s parsers. The two major drawbacks of not using a pure functional language (aside from the fact that we are apparently sanctioning assignment) for combinator parsing are that there is no builtin support for systematic lazy evaluation of function arguments, nor currying, both of which are useful in combinator parser implementation. While systematic lazy evaluation of function arguments is not technically necessary for combinator parsing, lazy evaluation of at least some formal function parameters is necessary in certain circumstances, so ad hoc mechanisms must be provided instead to achieve this. While currying is also not strictly necessary for combinator parsing, and could be done without for this presentation, its absence would cause problems in practical adaptations of this work. Despite choosing Scheme as the primary language of presentation, we will choose to use a Haskell-like notation for expressing the types signatures of Scheme functions.
3
Johnson’s combinator parsers
The combinator parsing style explicated by Johnson [11] covers all context free languages and is efficient, but it does not synchronise computation corresponding to a particular input position in the same way as Earley’s algorithm does. Johnson’s parsers combine two techniques, memoing and continuation passing style, to avoid the problem of nontermination in left recursive function definitions.
6
Figure 6: Parser support functions
3.1
Details
Johnson’s scheme is fully documented in [11]. However, for completeness’ sake we show in figures 4, 5 and 6 the relevant scheme function definitions and give a brief explanation of all relevant functions below. 3.1.1
Miscellaneous functions
Johnson provides miscellaneous functions dolist and vacuous. The dolist function is a variation of the classical higher-order function map. The vacuous function is to enable the definition of mutually-recursive functions for parsers in Scheme. 3.1.2
Type signature — continuation passing
In contrast to the classical combinator parsing style, which explicitly returns the set of possible parses using a lazily constructed list of answers, Johnson’s style makes use of continuation passing, as discussed below, which is reflected in the type signature for parsers. ; Parser :: Success-cont -> Stream -> side-effect ; Success-cont :: Stream -> side-effect Each parser takes a “stream” (implemented here as a simple list), a success continuation and “returns” a side effect. The behaviour of such parsers is explained in detail immediately below. Continuation passing involves passing to a given function an additional argument, another function, defining “what to do next”. When the first function is complete, it invokes the second function. This form of computation provides a limited form of reflection, since any given function has access to information about and can exercise control over what is done with its result. In Johnson’s parsers, the continuation passing style is used so that functions simulate relations. Functions do not “return” a single answer, or multiple answers wrapped as a single answer, or even a lazily constructed list of answers. Instead they return possibly answers by applying their “continuation” argument to every answer. In this style, the actual returned value of a function is technically irrelevant. The “final” value of a completed parse must be communicated using a side effect. The signature of the (success) continuation reveals that an “answer” is in the form of a stream signifying the remainder of the stream after a succesful match has occured — in this form, strictly, Johnson’s parsers are recognisers.
7
(define S (memo (vacuous (alt (seq S (terminal ’a)) (terminal ’b))))) Figure 7: Combinator parser for the nonterminal S In a non-continuation passing calling context, allowances for the continuation passing style must be made. Invocation requires passing a special “sideeffecting” continuation to the top-level parser function which records the result of a successful parse in a global variable. 3.1.3
Primitive parser functions
Johnson’s presentation simplistically assumes that input is already tokenised. The function terminal takes a parameter word and attempts to match a word against the input. (We will later show the implications of a more realistic scenario where terminal must call a real lexical analysier.) 3.1.4
Parser combinators
The functions alt and seq (each taking two arguments) enable the construction of more complex parsers in a manner analogous to BNF-style alternation and sequencing. For example, the definition of the parser function S given in figure 7 implements the parser both for the nonterminal S and the simple grammar definition in which it is contained (shown earlier in figure 1). 3.1.5
Memoing
Memoing, as foreshadowed, is a way of modifying functions so that when called with input for which a result has already been cached, returning this result rather than redundantly recomputing it, usually for efficiency. Normally, to memoize for a given function, a “wrapper” function is created which captures arguments to and results of the original function and maintains the mapping. For conventional top down combinator parsers, since each parser function may be regarded as a function from some position in the input to a set of possible parses, a memoized parser remembers whether it has ever been called on a given input position. There is an extensive treatment of memoing of conventional top down combinator parsers in a LISP-like language in [14]. There, the modification is used to improve efficiency dramatically when parser functions embody non-left factored grammars, rather than increased generality. Johnson’s parsers use a variation of memoing such that, for a given input, not only results, but caller continuations are stored. To memoize Johnson’s parsers, consider a parser function FA implementing a continuation passing style parser for a nonterminal A. The continuation
8
0
argument passed to FA need not be regarded an argument for the purposes of memoing. However, the continuation cannot be completely ignored. For all calls to F 0 of the form F 0 (Cn , p), that is, for a single input position p, where Cn is different continuation for each call, there may be several valid ways to match an A on the front of p. So F ’s implementation must remember, for a given position p, both continuations and results, and ensure that every continuation is called with every result as an argument. A lookup table is created corresponding to each function to be memoized. Although presented in the code as an association list (for consistency with Johnson), it is trivial to implement replace this mechanism with a hash table (to provide constant expected time retrieval and update). The remainder of our presentation will assume such. The table is used by the memoized version of the function to provide lookup from the function’s stream argument to lists of (stored) results and waiting continuations, if any exist. In Scheme, the effect of the let binding of the lookup table is to create a single table which is shared across (Care must be taken to choose the correct function for hashing on streams. Depending on the representation of the input stream, it is possible for different parse threads independently to return stream objects which are not identical by address, but point to the same position in the input, so therefore are identical by structure.) Functions make-table and table-ref provide functionality for managing the lookup table. Functions make-entry, entry-continuations, entry-results, push-continuation! and push-result! provide access functions for managing table entries and associated lists of results and caller continuations. 3.1.6
Referential transparency
Memoing is only reliable if referential transparency can be guaranteed; for this application, although the language does not enforce referential transparency, it is possible to show that referential transparency is not violated in any important way apart from the memoing itself. Every memoized function must always return the same values when given the same arguments. In pure functional languages referential transparency is guaranteed by the language semantics at the cost of disallowing side effects. Unfortunately memoing in pure functional languages, unless built-in, is seemingly harder since memoing essentially requires side-effects (c.f. standard libraries in recent Haskell implementations). In impure functional languages like scheme, referential transparency must instead be verified. Here, since all relevant functions defined have no side effects (aside from memoing infrastructure), or alternatively to guarantee that where side effects occur, they do not have any significant effect on the behaviour of memoized functions. Finally, in a language where side effects are allowed, like Scheme, referential transparency is not automatically guaranteed — we must verify that the 9
condition holds. Here, we simply assert that all parser functions defined (based on primitive combinators) must have no side effects, aside from the memoing itself and the as-yet unrevealed implementation of the opaque stream type. (In reality, there are several.)
3.2
Example trace of Johnson’s algorithm
To illustrate Johnson’s algorithm, figure 8 gives a sketch of the trace of the execution of the parser for S (shown earlier) on the input given earlier, with commentary below. The key to the presentation is as follows: Execution steps are labelled in angle bracketed, point-separated digit form (for example < 1.2.1.1 >). Substeps (eg from 1 to 1.1) show steps carried out in executing a parent step, that is, descending into the call tree. For conceptual convenience, the whole call tree is laid out in level order, but this means subsequent computations are not necessarily adjacent. Selected continuations are shown, labelled with mnemonic names (eg k-r) afterward. At selected steps the subexpression to be reduced is identified by < .. > instead of ( .. ). Brevity mandates several further conventions. We do not show all steps, specifically omitting the detail of vacuous, and alt and seq except initially. Function bodies are frequently omitted. Continuations are referred to via labels after introducing them explicitly or referring to the numbering of an expression showed in a previous step. Final versions of memo tables are shown rather than showing redundant intermediate steps. The initial expression to be evaluated is (recognize S ’(b a)). The evaluation of hRi leads to the evaluation of h1i and h2i (in that order) — see the body of recognize — where k-r is the top-level recognizing continuation, that is passed as an argument to the parser as part of the call to recognize. Using the memo function as a reference we see that at this point in the evaluation of h1i, S has not yet been applied to the input ’(b a). Lines 7–16 of memo are now evaluated. This causes the continuation k-r (h1.1i) to be added to the memo table for S (at (i)). Following this, the original unmemoized function S’ is applied (redex h1.2i) to the continuation labelled as k-complete. This evaluation is equivalent to the evaluation of redexes h1.2.1i and h1.2.2i (see definition of alt and seq), where k-1 corresponds to the continuation that is created by the application of seq within the definition of the parser S. Redex h1.2.1i consists of an application of S (the memoized version). Since S has already been called with the argument ’(b a) the only effect will be for k-1 to be pushed onto the memo table at (ii) (see redex h1.2.1.1i). At this point there are still no results in the memo table so the evaluation of redex h1.2.1.2i does nothing. Only now is redex h1.2.1i evaluated. This will consume a token ’b and apply k-complete to the result, ’(a) (redex h1.2.2.1i). After pushing the result ’(a) onto the memo table at (iii) (redex h1.2.2.1.1), the continuations k-1, k-r are applied, in order, to this result (see redexes h1.2.2.1.2i and h1.2.2.1.3i). It is at this point that the difference between Earley’s algorithm and Johnson’s parser becomes apparent. As we shall see, the application of k-1 will actually complete the parse before k-r is called upon 10
’(a). (The step h1.2.2.1.2i has further substeps shown later which will be evaluated first.) If we consider the evaluation of the redexes h1.2.2.1.2i and h1.2.2.1.3i to correspond to operations on state set 1 in Earley’s algorithm, it can be seen that the former races ahead to perform operations in state set 2 before state set 1 has been completed. k-1 consumes the next token, ’a, and k-complete is then called on the result, ’() (redex h1.2.2.1.2.1i. This result is pushed onto the memo table (at (iv)), and again, k-1 and k-r are applied to it (redexes h1.2.2.1.2.1.2i and h1.2.2.1.2.1.3i. Only this time, k-1 does nothing, since its argument is empty. k-r will set the value of recognize to true. It is only now that the application of k-r (redex h1.2.2.1.3i) is made. Hence, we can see that operations on state set 2 have finished before those in state set 1, highlighting the inexact correspondence of Johnson’s parser to Earley’s algorithm.
3.3
Redundant lexical analysis
As noted by Johnson, implementations of Johnson’s combinator parsing algorithm shown so far seem as asymptotically efficient at parse time as chart parsers, such as Earley’s algorithm, however some significant constant factor inefficiencies remain. One key inefficiency is the potential cost of redundant lexical analysis. Lexical analyser calls constitute a major portion of the algorithm’s primitive operations. In many situations the combinator parser might have several candidate NTs which could match the current input, depending on what token comes next; for each of these NTs, a corresponding implementation function must invoke (terminal K_i), where K_i is the token expected, before it is clear which NT will actually be matched. If realistic lexical analysis were necessary and context dependant lexical analyisis were required, the lexical analyser would have to be invoked repeatedly at the same point at the input for each instance of (terminal K_i); so significant amounts of redundant computation will occur. An approach which somehow synchronised the separate lexical requests — that is, so that lexical requests at a given input position and subsequent associated computation were suspended until all requests for that given input position had been issued — would have the useful effect of eliminating redundant lexical analysis. In the following section we will show how the synchronised state set approach can be achieved.
3.4
Comparison to Earley’s algorithm
There are some noteworthy correspondences as well as one significant difference between Johnson’s style of combinator parsing and Earley’s algorithm. We have established that Johnson’s style covers all context free grammars, so is seemingly equivalent in power to Earley’s algorithm.
11
hRi
(recognize S ’(b a))
h1i h2i
(let ((recognized #f)) (S k-r ’(b a)) recognized))
h1.1i h1.2i
push-continuation! entry k-r S’ k-complete ’(b a)
h1.2.1i h1.2.2i
(begin (S k-1 ’(b a)) ((terminal ’b) k-complete ’(b a)))
h1.2.1.1i h1.2.1.2i h1.2.2.1i
(push-continuation! entry k-1) (dolist (result (entry-results entry)) (apply continuation result)) (k-complete ’(a))
h1.2.2.1.1i h1.2.2.1.2i h1.2.2.1.3i
(push-result! entry ’(a)) (k-1 ’(a)) (k-r ’(a))
h1.2.2.1.2.1i
(k-complete ’())
h1.2.2.1.2.1.1i h1.2.2.1.2.1.2i h1.2.2.1.2.1.3i
(push-result entry ’()) (k-1 ’()) (k-r ’())
(k-r)
(lambda (pos) (if (null? pos) (set! recognized #t)))
(k-complete) (lambda result (if (not (result-subsumed? entry result)) (begin (push-result! entry result) (dolist (cont (entry-continuations entry)) (apply cont result))))) (k-1)
(lambda (pos1) ((terminal ’a) k-complete pos1)) Figure 8: Johnson’s style — execution trace 12
String position ’(b a)
Continuations k-1(ii) k-r(i)
Results ’(a)(iii) ’()(iv)
Figure 9: The memo table for the parser S. Figure 10: Synchronised style — parser functions Intuitively, correspondence can be seen between combinator parsing and Earley’s algorithm’s construction of state sets. The function call tree arising from the recursive descent style, coupled with memoing (as described above), is similar to the predictor phase for a given state set in Earley’s algorithm. However, there is a crucial difference between Earley’s algorithm and Johnson’s combinator parsing: the latter does not synchronise of state set construction. The fact that Johnson’s parsers do not correspond to Earley’s algorithm in this way is not necessarily bad. There are other combinator parsing styles with greatly reduced coverage which use this property to gain efficiency. However, we will show later that an alternative scheme which does correspond to Earley’s algorithm has useful properties not shared by other combinator parsers.
4
New synchronised variation
Johnson’s parsers can be modified to more closely resemble Earley’s algorithm by ensuring that the equivalent of Earley’s predictor phase actually finishes before the equivalent of a new Earley state set is constructed. Many classes of parsing algorithms embody the concept of “state,” corresponding to having reached a position in the input. (In Earley’s algorithm the equivalent concept is that of the ”state set”.) Such algorithms often have associated information about which tokens are expected next in the input at a given position. Combinator parsers have not hitherto been constructed so that they can “know” all the tokens which are expected at a particular point in the input. In fact, such a request looks nonsensical since there is no real concept of state. However, Johnson’s combinator parsers can be adapted so that they do have something corresponding to the relevant state, and consequently can explicitly know what tokens are expected in the current state. The crucial change to Johnson’s combinator parsers is support for recording and synchronisation.
4.1 4.1.1
Details Lexical analysis must be deterministic
In this modified scheme, lexical analysers cannot be properly nondeterministic, that is, for a given input position, there must be at most one lexical interpre-
13
tation of the input stream into tokens. The reason for, and full implications of this restriction are discussed below. 4.1.2
New version of token request functions
First, the parser architecture must be changed so that the parser can intervene at whatever point there is an attempt to match a token or keyword against the input, and record details about the calling context (that is, the continuation). The input stream interface function terminal is re-implemented so that, rather than extracting a token from the input immediately, the request for the given token (at the current input position) is instead simply recorded, along with the current continuation, then the function returns as though it had failed. Thus, any given invocation of the top level parser will terminate after all requests for lexical analysis on the initial input position have been made. The parser superstructure must now recognise that whenever the whole parsing process “fails,” this simply means that all “prediction” activity is over for a single token of input. 4.1.3
New version of recognize
The top level function recognize now works as follows. The call to the toplevel parser function eventually terminates, yielding an (implicit) token request list, a trace of the context of every call made to the lexical analysis layer. (The context is recorded as a continuation.) The recursive function parse-by-token is invoked immediately after a set of token requests have been generated corresponding to the next token in the input. It calls another function, lex-step (see below), then effectively loops. The process is terminated when, at a given token position, no new requests for a subsequent token are generated, because the previous call to lex-step rejected all calling contexts as invalid, either because of a syntax error or because the top level parser was satisfied and did not request a further token. The function lex-step accepts a set of lexical analysis requests, each in the form of a continuation, an expected token and an input position. (Obviously, and importantly, this input position is the same for each lexical analysis request in a given cycle, which means that for the time being there are many redundant calls to the lexical analyser, leading to a key optimisation — see later.) After consulting the lexical analyser, the call to get-rest yields a new stream representing the input stream after extracting a token, as well as information representing the actual token discovered. From this, lex-step can filter the set of calling contexts, eliminating any which requested a token other than the one discovered.
4.2
Example trace of synchronised algorithm
Although the trace (figure 11) looks identical at first class to that of Johnson’s algorithm, the key difference is conveyed in an examination of the evaluation
14
ordering. In performing a trace of this example no explicit reference will be made to the internal workings of the key function parse-by-token. Instead, an operational point of view is taken. For each request (i.e. a triple consisting of a continuation, a token request and an input stream) a check is made on whether the token requested can be extracted from the input stream, and if so the continuation is applied to the stream remaining after the token has been removed. If, after all requests have been processed there are new requests in the table, parse-by-token is called recursively. If not, the function terminates. The evaluation of (recognize S ’(b a)) is equivalent to the evaluation of redexes h1i, h2i, and h3i. (k-r, which is present in redex h1i is defined to be the top-level recognizing continuation.) When redex h1i is evaluated, S has not yet been called on the input ’(b a). Hence, lines 7–16 of memo are called. This causes the continuation k-r to be added to the memo table for S ((i) in the table) via the evaluation of redex h1.1i. Now the original unmemoized function S’ is applied to a continuation k-complete. This continuation forms part of the definiton of memo. It simply adds results to the memo table for S and applies all continuations in the memo table to this result. The evaluation of this redex (h1.2i), is equivalent to the evaluation of redexes h1.2.1i and h1.2.2i. This follows from the defintions of alt and seq. k-1 corresponds to the continuation formed within the body of the defintion of seq. The evaluation of redex h1.2.1i consists of an application of S (the memoized function) to k-1 and the input ’(b a). Since S has already been called with the argument ’(b a) the only effect will be for the continuation to S, which we denote k-1 to be pushed onto the memo table at (ii) (see redex h1.2.1.1i) . At this point there are still no results in the memo table so the evaluation of redex h1.2.1.2i does nothing. The next evaluation, that of redex h1.2.2i, presents us with our first use of the request table. When it is evaluated entry (i) is added to the request table. At this point evaultion of h1i has completed and we perform a call to parse-by-token (redex h2i). In Earley’s algorithm this is equivalent to the situtation where state set 0 has been processed and processing of state 1 is about to commence. Since there is but a single request, redex h2.1i is evaluated. That is, parse-by-token appliesk-complete (via the auxilliary function lex-step to the input stream ’(a) (this is the remainder of the input stream after the token ’b has been removed and matched with the token request). This leads to the evaluation of redexes h2.1.1i, h2.1.2i and h2.1.3i. This will simply push the result ’(a) onto the memo table at (iii), after which the continuations k-1, k-r are applied, in that order, to this result. k-1 places a request in the request table at (ii). It must be noted that the request table at this point in time is empty (we represent the boundary between state sets in the request table via a horizontal line). Applying k-r to ’(a) will do nothing. Since a new lexical request has been generated the evalution of parse-by-token (redex h2i) will proceed via another call to itself. This time almost exactly the same process occurs. k-complete is applied 15
to ’() which is pushed into the memo table at (iv) (redexes h2.2i, h2.2.1i). Next redexes h2.2.2i and h2.2.3i are evaulated. k-1 is applied to ’() which puts another lexical request (at (iii)) in the (at present empty) request table. k-r is then applied to ’() which will set recognized to #t. The final call to parse-by-token (and by extension lex-step) will do nothing since the input stream of the sole remaining lexical request is empty. This completes the original evaluation of parse-by-token. The value of recognized is now evaluated (redex h3i) and returned. This completes the example.
4.3
Comparison with Earley’s algorithm
The synchronised variation of Johnson’s algorithm is broadly similar to Earley’s algorithm (with no lookahead), although some further considerations apply. The scanner operation involves explicit interpretation of the state-set. The predictor operation and completer operations are “programmed”, only interpreting the state-set implicitly (via memoing in the case of the predictor). Whereas in Earley’s algorithm, the state-set corresponding to every position in the input is clearly distinguished, in our modified form, only the state set corresponding to the current input position is actually explicit at any one time. Data corresponding to previous Earley state sets is distributed within the (frozen) state of individual continuations. This does not cause any immediate problems since only the completer makes use of this data in Earley’s algorithm in its standard form, and the completer’s function is implicit in the activity of the individual parse functions.
5 5.1
Application and experience Increased lexical analyser efficiency
A simple refactoring of the call to get-token to remove it from the body of the loop (since each call to get-token is effectively the same) affords a drastic improvement to efficiency. Even more importantly, the fact that get-token can be called with access to the information about all lexical analysis requests then makes interfacing with context dependant lexical analysers possible.
5.2
Context dependent lexical analysers
Synchronised combinator parsers can be interfaced with nondeterministic, context dependent lexical analysers (such as those used with LR parsers). Traditional combinator parsing is only compatible with nondeterministic lexical analysis, or deterministic lexical analysis not requiring context-sensitivity. A useful paradigm in language processor architecture is to regard parsers and lexical analysers as coroutines: for each token, first the lexical analyser runs, then the parser, etc., each having access to (an abstract view of) the
16
hRi
(recognize S ’(b a))
h1i h2i h3i
(let ((recognized #f)) (S k-r ’(b a)) parse-by-token recognized))
h1.1i h1.2i
push-continuation! entry k-r S’ k-complete ’(b a)
h1.2.1i h1.2.2i
(S k-1 ’(b a)) ((terminal ’b) k-complete ’(b a))
h1.2.1.1i h1.2.1.2i
(push-continuation! entry k-1) (dolist (result (entry-results entry)) (apply continuation result))
h2.1i
(k-complete ’(a))
h2.1.1i h2.1.2i h2.1.3i
(push-result! entry ’(a)) (k-1 ’(a)) (k-r ’(a))
h2.2i
(k-complete ’())
h2.2.1i h2.2.2i h2.2.3i
(push-result! entry ’()) (k-1 ’()) (k-r ’())
(k-r)
(lambda (pos) (if (null? pos) (set! recognized #t)))
(k-complete) (lambda result (if (not (result-subsumed? entry result)) (begin (push-result! entry result) (dolist (cont (entry-continuations entry)) (apply cont result))))) (k-1) (lambda (pos1) ((terminal ’a) k-complete pos1)) Figure 11: Synchronised style — execution trace 17
String position ’(b a)
Continuations k-1(ii) k-r(i)
Results ’()(iv) ’(a)(iii)
Figure 12: The memo table for the parser S.
(i) (ii) (iii)
Continuations k-complete k-complete k-complete
Token request ’b ’a ’a
Input ’(b a) ’(a) ’()
Figure 13: The lexical requests table. other’s state. Context-sensitive lexical analysers use the parser state to determine how to interpret subsequent input as tokens, specifically, which tokens the parser is currently expecting at a given input position. Such information is normally available in parsing algorithms such as LL or LR or (simple variations of) Earley’s algorithm. Such information is not normally available to combinator parsers, since there is no explicit synchronisation of requests for lexical information corresponding to each successive input position. The synchronised implementation shown earlier can be modified trivially so that the lexical analyser can be passed information about which tokens are acceptable at a given position. From the set of lexical analyser requests at a given input position, a set of expected tokens, that is, the set of valid tokens at that position in the input, can be generated. The get-token function is then augmented to accept such information and passing it to the real lexical analyser. Nondeterministic lexical analysis can be performed by traditional combinator parsers as follows: ad hoc lexical analyser requests would either “split”, giving multiple plausible tokens at a given point in the input, or behave differently according the specific token requested, but not with the benefit of information about all tokens requested at the given input position (arguably “context dependent” but not in the same, traditionally useful sense). There are undoubtedly applications where nondeterministic lexical analysis might be wholely appropriate. Deterministic, non-context dependent lexical analysis can be performed by traditional combinator parsers by simply regarding the input as an abstract sequence of tokens, lazily constructed by a lower level layer which interprets the input as tokens without reference to the parser.
5.3
Adapting parsers to return results
Johnson’s parsers, as originally presented are strictly only recognisers. The scheme can be straightforwardly extended to perform true parsing by explicitly incorporating parse results in relations. The extension amounts to adding an extra “result” argument to the type signature of continuations and modifying parser functions appropriately, as well as creating extra parser combinators for
18
combining and aggregating results. Further details, in a more “monadic” style, are given in [15].
5.4
Performance evaluation
Experiences with earlier, LISP implementations of the same algorithms presented here with real, industrial grammars are documented in [15]. Anecdotally (not documented in the above), there is a significant constant factor time benefit associated with the synchronisation variation. There is however still a significant and burdensome constant factor space usage penalty related to combinator parser implementations of Earley’s algorithm (typically 60 times more space). Such a space penalty has a consequent effect on time performance once input size is more than a few thousand tokens. We conjecture that Earley-style combinator parsers represent their implicit state set tables inefficiently, due to the need to store many unique closures. On the other hand, straightforward traditional implementations of Earley’s algorithm maintain relatively efficient representations of their state set tables. Nevertheless, there are applications where combinator parser implementations might still be more appropriate, such as concrete syntax-based transformation [4] where expected input size is small.
6
Conclusions and future work
The synchronised adaptation of Johnson’s combinator parser demonstrates how to implement a combinator parser corresponding exactly to Earley’s algorithm, resolving an important open problem which is implicit in the history of combinator parsing. We have shown that some useful parser architecture practice — namely, of deterministic, context dependent lexical analysis — normally associated with non-general LR parsing algorithms can be employed in combinator parsing. Although equivalence has not been formally proven here, nevertheless the presentation used here is an important step toward practical application of general parsing techniques. The (arguably) general importance of non-generational approaches to domain specific languages, and the consequent legitimacy of a combinator parsers style approach to language prototyping, makes this achievement significant in that it provides technical avenues of innovation in language prototyping. Although continuation passing style parsers benefit from the synchronised adaptation, it is not clear if the synchronisation adaptation is compatible with different combinator parsing architectures such as the more classical recursive descent approach shown in [10]. Nevertheless, consideration should be given to how further optimisations of tabular parsers (for example [8]) might be adapted combinator parsing. Further optimisations to deal with the large constant factor space usage also need to be investigated.
19
7
Acknowledgements
Sean Seefried is supported by an Australian Postgraduate Award. Work by Ian Peake which led to [15] was supported by an Australian Postgraduate Award and a DSTO Australia scholarship. We are grateful to our colleague Trent Waddington, who helped us to refine our understanding of Earley’s algorithm through discussion of implementation issues and trace output from an Earley implementation.
References [1] http://www.dur.ac.uk/martin.ward/fermat.html. [2] http://www.gnu.org/software/bison/bison.html. [3] http://www.antlr.org. [4] http://www.cwi.nl/htbin/sen1/twiki/bin/view/SEN1/MetaEnvironment. [5] http://accent.compilertools.net/. [6] Earley, J. An efficient context-free parsing algorithm. Communications of the ACM 26, 1 (Jan. 1983). Milestones of research - selected papers 1958-1982 (Reprint of CACM 13(2) 94-102 1970). [7] Frost, R. Using memoization to achieve polynomial complexity of purely functional executable specifications of non-deterministic top-down parsers. SIGPLAN Notices 29, 4 (Apr. 1994), 23–30. [8] Graham, S. L., Harrison, M. A., and Ruzzo, W. L. An improved context-free recognizer. ACM Trans. Program. Lang. Syst. 2, 3 (July 1980), 415–462. [9] Hutton, G. Higher-order functions for parsing. J. Functional Programming 2, 3 (July 1992), 323–343. [10] Hutton, G., and Meijer, E. Monadic parser combinators. Tech. Rep. NOTTCS-TR-96-4, Department of Computer Science, University of Nottingham, 1996. [11] Johnson, M. Memoization in top-down parsing. Computational Linguistics 21, 3 (1995), 405–417. Discussion Paper. [12] Johnson, S. C. Yacc: Yet another compiler compiler. In UNIX Programmer’s Manual, vol. 2. Holt, Rinehart, and Winston, New York, NY, USA, 1979, pp. 353–387. [13] Leermakers, R. The functional treatment of parsing. Kluwer Academic Publishers Group, Netherlands, 1993.
20
[14] Norvig, P. Techniques for automatic memoization with applications to context-free parsing. Computational Linguistics 17, 1 (1991), 91–98. [15] Peake, I. Enabling meta-level support for language design and implementation through modular parsers. PhD thesis, January 2000. [16] Rojemo, N. Garbage collection and memory efficiency in lazy functional languages. PhD thesis, Chalmers University of Technology, 1995. [17] van den Brand, M., Sellink, A., and Verhoef, C. Current parsing techniques in software renovation considered harmful. In Proc. 6th International Workshop on Program Comprehension IWPC 98 (1998). [18] Ward, M. Program slicing via FermaT transformations. In Proc. COMPSAC 2002 (Oxford, England), IEEE.
21