XParse: A Language for Parsing Text to XML - Semantic Scholar

10 downloads 0 Views 145KB Size Report
Thanks to Erick Breck for many discussions about XParse and to both him and ... [21] Daniel C. Wang, Andrew W. Appel, Je L. Korn, and. Christopher S. Serra.
XParse: A Language for Parsing Text to XML James Cheney Cornell University Ithaca, NY 14853

ABSTRACT This paper presents a domain-specific language, XParse, that attempts to combine the power of tools like lex and yacc, which generate efficient parsers from declarative specifications, with the convenience, safety, and usability of textor XML-processing languages such as Perl or XSLT. XParse is a standalone language which provides lex-style regular expression matching and yacc-style LALR(1) parsing. Existing parsing tools such as lex and yacc can be difficult to learn, are usually highly language-dependent, and often vary across languages and platforms. Parsers are therefore difficult to create and reuse, so are usually developed on a per-application basis rather than a per-language basis. Unlike traditional, language-dependent parser tools, the semantic actions in XParse denote XML fragments rather than uninterpreted source code. This design facilitates independent typechecking and analysis of XParse programs, without first translating to some other programming language as is the norm for existing parsing tools. Furthermore, since XML already enjoys wide support, XParse programs can be reused in many environments without modification. We present several applications of XParse, including providing more user-friendly concrete syntax for existing XML dialects like XSLT, parsing existing languages such as Java, and parsing XParse programs themselves. These applications show that XParse (in concert with other XML tools) can be used to help develop useful language tools quickly and conveniently.

1.

INTRODUCTION

Scripting languages such as Perl make it easy to accomplish complex tasks using simple yet powerful text processing techniques such as regular expression pattern matching. XML standards for stylesheets, transformations and queries (including XSLT [4] and XQuery [6]) and scripting or general-purpose languages (including XDuce [11] and CDuce [2]) bring this combination of convenience and power to the world of more richly structured XML data. However, for a variety of reasons (inertia among the most im-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2002 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

portant), many forms of interesting data are still stored and maintained in a human-readable form rather than XML. Examples include most programming languages, TEX markup, and legacy HTML. Consequently, tasks which seem straightforward in theory (such as writing a script to search for or rename all occurrences of a given variable) are impractical because of the effort required to parse such legacy formats. This is not because parsing is inherently difficult, but because existing parsing tools are difficult to learn and use. To make matters worse, finished parsers are usually tailored to a particular task and programming environment, so reusing them often requires considerable effort. We believe that there is a need for text-to-XML parsing tools that combine the power of standard parsing tools with the convenience of scripting languages. We have developed a language called XParse that addresses this need. XParse combines the regular expression matching and LALR(1) parsing techniques of traditional AT&T-style lex and yacc tools for C and other languages with the convenience and flexibility of scripting languages. Like existing tools, XParse programs are declarative specifications of lexical analyzers and parsers. However, the semantic actions of XParse specifications are XML fragments indicating how to interpret the input as tokens or parse trees, rather than being arbitrary, uninterpreted code in some general-purpose language as in traditional lex and yacc. This has several positive consequences: 1. XParse specifications require no “boilerplate” coding to implement abstract syntax tree data structures and can be combined with other XML- or text-processing tools without requiring low-level programming. 2. XParse is a stand-alone language, so specifications can be statically analyzed, typechecked on their own, without first being translated to some target language. 3. Similarly, since XParse programs have meaning independent of any other languages, they can be interpreted, compiled to executable programs, or even compiled to libraries that produce XML data using a standard interface like SAX or DOM. XParse can be used to develop portable and reusable standalone parsers from existing text formats to XML quickly, facilitating rapid development of programming tools using XParse in concert with other XML tools. Existing XML standards such as XSLT and XML Schema are presented as XML dialects, which are very machine-friendly but not as convenient for programmers to read and write. XParse can

be used to parse programmer-friendly concrete syntaxes for such languages to their XML equivalents. Thus, XParse illustrates both how programming research and development can benefit from XML technologies and how XML standards can be enriched with more usable interfaces using established programming language tools. In this paper we present a design for XParse, based in part on insights gained from a simpler prototype implementation. This design is a work in progress. The rest of this paper is structured as follows. In Section 2 we give an overview of the standard lex and yacc style tools. Section 3 describes the XParse language design. Section 4 describes the prototype implementation of XParse, and Section 5 discusses several sample applications, including XParse parsers for XSLT stylesheets, Java source files, and XParse specifications themselves. Section 7 discusses related and future work and concludes.

2.

BACKGROUND: LEX AND YACC

In this paper, we assume familiarity with XML but not with lex and yacc. In this section, we give an overview of the design and implementation of these tools. For good indepth coverage of lex and yacc, see [15]. We discuss this background for several reasons: first, to make this article accessible to readers not intimately familiar with such tools; second, to introduce the underlying scanning and parsing techniques upon which XParse is based; and finally, to support our position that such tools are difficult to learn and use, in part because of their software architecture rather than the (admitted) inherent difficulty of parsing. A lexical analyzer (or scanner ) splits a text stream into tokens. For example, a simple lexical analyzer for English text might tokenize it into words (that is, contiguous sequences of letters) and punctuation, ignoring whitespace (i.e., spaces, tabs, and newlines.) Lexical analyzer generators like lex convert declarative specifications of lexical analyzers to efficient programs. The specifications relate regular expressions to semantic actions describing how the text is tokenized. In lex, some rules for tokenizing a simple form of English are: [a-zA-Z]+ "," "."

{ yylval = yytext; return WORD; } { return COMMA; } { return PERIOD; }

Semantic actions, enclosed in braces, are uninterpreted C code to be executed when the rule is applied. When used in conjunction with yacc, the value returned by a semantic action indicates a token value for the matched string. However, semantic actions can include arbitrary code and need not return. The yylval variable can be used to associate a semantic value with a token (here, the string value of a WORD token). Ambiguities among collections of rules are resolved by giving priority to the longest match and then to the earliest-declared rule. lex translates these rules into nondeterministic finite automata. These NDFAs are converted to deterministic automata using standard algorithms [10]. The program constructed by lex simulates the deterministic automaton until it reaches an accepting state, then executes the associated semantic action. Pure regular expression matching is not powerful enough to handle some languages. In particular, tokenization is not always uniform throughout a language; for example, many languages include string literals (i.e., strings delimited by

quotes) within which the usual tokenization rules do not apply. lex’s start states make it possible to recognize such constructs. Each rule can be associated with one or more start states, and the rule applies only when the lexer is in one of the listed states. The action BEGIN transitions to another state. Also, some lex variants maintain a stack of start states, with semantic actions PUSH and POP. In addition, lex specifications can define and maintain arbitrary additional state; common examples include counters for line numbering and delimiter nesting and symbol tables for correctly parsing C-style typedef identifiers. A parser is a program that reads a sequence of tokens and attempts to build an abstract syntax tree according to the rules of a context-free grammar. Parser generators like yacc transform declarative descriptions of such grammars to efficient parsers. The rules used in yacc grammars are production rules of the form S : α1 . . . αn , where S is a nonterminal symbol and the αi s are either tokens or nonterminal symbols. Such rules are associated with semantic actions which specify how to construct a semantic value for S, given semantic values for the αi s. In C’s yacc notation, some rules for parsing simple arithmetic expressions might be: exp : | | |

INT exp PLUS exp exp TIMES exp LPAREN exp RPAREN

{ { { {

$$ $$ $$ $$

= = = =

mkInt($1); } mkPlus($1,$3); } mkTimes($1,$3); } $1; }

As in lex, the semantic actions are arbitrary C code. The symbol $$ denotes the return value of the action, and symbols $1, $2, etc. denote the semantic values (if any) of the first, second, etc. symbols on the right side of the production rule. Parsers generated by yacc use a shift-reduce parsing algorithm. A shift-reduce parser maintains a stack whose entries are tokens that have been read or nonterminals that have already been reduced. At every step, the parser decides whether to shift a new token from the input and push it onto the stack, or to reduce the top few symbols on the stack according to a matching grammar rule. Shift-reduce parsing can involve a lot of nondeterminism: at each step we need to decide whether to shift or reduce and (if the latter) which rule to reduce. To make these choices, yacc parsers use a finite-state control (with a state stack) to decide when to shift or reduce. Each automaton state is labeled with a collection of partial rules that may eventually match the current stack. The automaton’s input is the symbol consisting of the next unshifted symbol in the input, and the transition for each token indicates whether to shift a new token onto the stack (simultaneously transitioning to a new state), or reduce the current top of the stack using a grammar rule. The technique used by yacc to construct this automaton is called LALR(1). When nondeterministic behavior is still possible, yacc produces warnings called shift-reduce or reduce-reduce conflicts, depending on whether the ambiguity is whether to shift or reduce, or whether to reduce one of two distinct rules. yacc produces a text file describing the automaton and any conflicts on request. Simplistic grammars such as the above expression grammar are usually ambiguous in inessential ways, resulting in shift-reduce conflicts. For example, the expression 1 + 2 ∗ 3 could be parsed as (1+2)∗3 or 1+(2∗3) (the usual convention is to assume the latter form). Expressions such as 1 + 2 + 3

are also ambiguous, because both (1 + 2) + 3 and 1 + (2 + 3) are possible parses. Even though we might consider these parses to be semantically equivalent, they are not equivalent as far as yacc is concerned. This is not always undesirable, because not all operations are associative1 . There are two standard ways to disambiguate such grammars. First, the grammar rules can be rewritten to be more precise: exp

: | times exp : | atomic exp : |

exp PLUS times exp times exp times exp TIMES atomic exp atomic exp INT LPAREN exp RPAREN

Implicitly, these rules make both + and * left-associative (that is, 1 + 2 + 3 parses to (1 + 2) + 3), and they give higher precedence to * than +. This solution makes a grammar considerably more complicated and can be tricky to implement correctly, and it contradicts the spirit of declarative programming, where the ideal is to write as clear a description as possible rather than specialize the rules to a particular model of computation. Unfortunately, sometimes this is the only solution. The second standard solution is to employ precedence directives that specify the associativity and relative precedences of the conflicting operators. The following directives suffice to disambiguate the original ambiguous grammar: %left PLUS %left TIMES These directives state that both PLUS and TIMES are left associative, and the order of the declarations is used to establish their precedence order. The effect of these directives on the expression grammar is identical to the above rewritten grammar. Precedence directives work well for binary operations and a few other common cases, such as dangling-else conflicts. lex and yacc were designed to work together (although they can also be used separately). lex produces a C source file that implements a function int yylex(); which reads the next token from its input and returns a token value (implemented in C using #defined integer constants). Other communication between the lexer and the rest of the program is provided by global variables; for example, yylval contains the semantic value associated with a token, if any. yacc produces C source files that expect a yylex() function of the above type to be provided by the program with which it is compiled, and provides a function int yyparse(); which parses the input, using yylex() to fetch tokens from it.

3.

THE XPARSE LANGUAGE

XParse combines two rather different sublanguages, one for tokenizing and one for parsing. Historically, lex and 1 For example, arithmetic operations in C are not guaranteed to be associative.

%regexp var = [a-z]+ %regexp digits = [0-9]+ %regexp ws = [\ \t] %token INT VAR : string %token PLUS TIMES MINUS DIV LPAREN RPAREN %state INITIAL : token %state COMMENT : token\token %left PLUS MINUS %left TIMES DIV %type Exp = (plus|times|minus|div|int|var) %element doc : Exp %element plus times minus div : (Exp,Exp) %element int var : EMPTY %attlist int (value:string) %attlist var (name:string) %nonterm exp : Exp %start start : doc %counter cdepth Figure 1: Complete XParse Expression Parser I: Declarations yacc were developed as separate tools. Traditional yaccstyle tools produce an interface file that declares the token symbols in the target language, and corresponding lexstyle tools must import this interface in order for the resulting code to make sense and be correct. Since lexers and parsers are usually designed in tandem, this design forces the programmer to switch between two files frequently, and can result in hard-to-find bugs due to inconsistencies between them. We see few reasons to keep this design. Instead, XParse programs combine declarations, lexer rules, and grammar rules into one file. XParse programs consist of three sections: 1. A declarations section, describing the tokens and nonterminals and their types, along with declarations of state names and variables, regular expression abbreviations, token precedences/associativities and starting nonterminal. 2. A lexer section, consisting of a collection of rules associating regular expressions with actions that construct a token sequence. 3. A parser section, consisting of a collection of rules associating grammar productions with semantic actions that construct an XML syntax tree. This makes each parser a self-contained specification that is more convenient to write and maintain and easier to check for errors than two separate files are. As in traditional lex/yacc tools, these sections are separated by lines beginning with %%. Figures 1–3 show a complete XParse program, which implements a simple parser for arithmetic expressions with nested comments and variables. Unlike traditional parser specifications, XParse programs can be either interpreted directly or compiled to runnable code. A session with an XParse interpreter might go as follows:

%% Parser start : exp ;

%% Lexer { {digits} {var} "+" "*" "-" "/" "(" ")" "(*" "\n" {ws}+

{ { { { { { { { {

token([$$]) } token([$$]) } token() } token() } token() } token(
) } token() } token() } incr(cdepth); push(COMMENT) } { skip } { skip } { exit }

} { "*)"

"(*" . "\n"

{ decr(cdepth); if(cdepth == 0) then {pop} else {skip} } { incr(cdepth); skip } { skip } { skip } { error("unclosed comment") }

}

Figure 2: Complete XParse Expression Parser II: Lexer % xparse -interactive exp.xp - 1 + 5 > - 1 + 2 + 3 > - x y z : (x + y) * (1 / z) >
Alternatively, we can use such an XParse interpreter as a filter in conjunction with other tools. For example, suppose exp2mathml.xsl is an XSLT stylesheet that renders expressions into MathML. We can transform text representations of expressions directly to MathML as follows: xparse exp.xp | xslt exp2mathml.xsl There are many possible execution models for XParse programs, for example compilation to a native code executable, or to Java classes implementing an XML interface such as SAX or DOM. The following three subsections describe the contents of these XParse program sections in greater detail.

3.1

Declarations Section

Because XParse documents combines the functionality of lex and yacc, the declarations section must combine the

exp

: | | | | | |

INT VAR exp PLUS exp exp TIMES exp exp MINUS exp exp DIV exp LPAREN exp RPAREN

{ accept($1) }

{ { { { { { {

[@value[$1]] } [@name[$1]] } [$1,$3] } [$1,$3] } [$1,$3] }
[$1,$3] } $1 }

; Figure 3: Complete XParse Expression Parser III: Parser information formerly distributed between the lex and yacc declaration sections. Besides this change, we have made a few improvements to the design of the declarations. The first declaration in Figure 1, %regexp var = [a-z]+ defines a regexp abbreviation for use in lexer rules (as var, etc.). The traditional lex syntax for such abbreviations is digits [0-9]+ This notation really denotes a textual substitution, so that references {digit} within regular expressions will be replaced by the string [0-9] prior to parsing the expression. But this would permit partial or nonsensical abbreviations as well, such as [0-, with an error (if any) flagged only when the abbreviation is used. In XParse, abbreviations must be complete well-formed regular expressions. Token declarations such as %token INT VAR : string define tokens and declare their semantic value types (if any). State declarations like %state INITIAL : token define lexer start states and their types. (The types are explained in detail in the next section.) These are followed by precedence/associativity declarations which behave the same as in traditional lex/yacc. Next are some type definitions and declarations, including %type Exp = (plus|times|minus|div|int|var) %element doc : Exp %element int : EMPTY %attlist int (value:string) These declarations define a type abbreviation called Exp which describes the content model of expressions, declare and assign content models to the doc, and int elements, and assign an attribute list to int. These declarations are equivalent to the following DTD declarations: The above DTD fragment can easily be generated from the declaration forms. Note that we use the %type keyword

rule ::= regexp ::= trans ::= |

hstateiregexp{act seq} c | rs | r|s | r∗ | {abbrev} stay | begin(state) | push(state) | pop exit | error(s)

exp

::=

i | s | cntr | $$ | cut(exp, exp)

value

| ::=

trim(exp, exp) | htokeni[exp] token(exp) | skip

cmd

| ::=

start(token) | continue(exp) | end incr(cntr) | decr(cntr)

| | act

::=

act seq ::= test ::= rel

::=

add(symtbl, exp) | remove(symtbl, exp) if(test)then{act seq}else{act seq} cmd | value & trans act | act seq; act exp rel exp | member(symtbl) == | != | < | > | =

where cntr, token, symtbl, and state indicate declared counter, token, symbol table, and state symbols Figure 4: Lexer rule syntax differently than in yacc, where it assigns a type to a nonterminal. We use the %nonterm keyword for this purpose instead. The last several lines %nonterm exp : Exp %start start : doc %counter cdepth declare the nonterminal exp and its semantic value type, declare the start symbol start, and declare a counter cdepth.

3.2

Lexer Section

The lexer section comes second in an XParse document following the declarations. This section is a collection of rules defining a scanner. A scanner can be thought of as a transducer from an input character stream to an output token stream. Token streams can be thought of as XML documents consisting of a flat sequence of elements that represent tokens. Valueless tokens are represented with empty elements, and tokens that carry a (text) value are represented by elements with text content. The only allowed token value type is string (since this is the only built-in type in plain XML). XParse lexer rules look very similar to standard lex rules. In fact, the regular expression syntax is identical (we adapted it from that of AT&T lex). For brevity, we describe only the core regular expression syntax. Instead of being arbitrary C code, XParse lexer actions are drawn from a restricted language of actions. This language includes most of the standard built-in lex actions such as transitioning to a new state and emitting tokens, along with some new forms that restore some of the expressiveness XParse sacrifices by forbidding arbitrary code. The abstract syntax of lexer rules is shown in Figure 4. A rule consists of a start state, a regular expression, and an action sequence. The rule applies when the lexer is in the specified state and the regular expression matches the current input (ties are broken by giving priority to the longest

{ .|\n ["] } { . \n "\\\"" "\\n" /* other escapes ["] }

{ skip } { start(LITERAL) & push(STRING) }

{ continue($$) } { error("Newline in literal") } { continue("\"") } { continue("\n") } ... */ { end & pop }

Figure 5: Example: String literal tokenization match and then the earlier rule). Regular expressions include2 characters c, sequential composition rs, alternative choice r|s, and iteration r∗ . The transition operations stay, begin, push, pop, exit, and error describe state transitions. The operation stay means stay in the same state; begin(s) transitions to a new state s; push(s) transitions to s and saves the old state on the control stack; pop pops the control stack and transitions to the state formerly at the top; exit terminates execution; and error(msg) terminates execution with an error message msg. The expressions describe integer, string, and token values. They include integer and string constants i and s respectively, as well as counter names cntr and the symbol $$, denoting the currently matched string. The cut(s, i) and trim(s, i) expressions remove i characters from the head or tail of the string s. Token expressions hti[s] describe XML tokens of the form s. The value constructors define the output performed by an action: token(exp) emits a complete token exp, whereas start(t), continue(s), and end begin, add to the content of, and end a token. The commands modify state data such as declared counters and symbol tables. Commands incr and decr increment/decrement counters. The add and remove commands insert or remove a string in the given symbol table. Conditional actions are provided using the if construct. The tests used in if include arithmetic comparisons and testing membership in symbol tables. A semantic action is a sequence of commands followed by a value and transition (written val & trans). Either the value or transition may be omitted (along with the & operator); if the value is omitted then it is treated as skip, whereas if the transition is omitted then it is assumed to be stay. Values and transitions can only be the last instructions in any control flow path. This is not guaranteed by the syntax; instead, this is checked during typechecking. Here is an example of the use of start/continue/end and push/pop to tokenize string literals, replacing escaped symbols with unescaped ones: We could equally well have used begin to transition from INITIAL to STRING and back, but by using push and pop we can reuse the STRING state in other contexts. Our syntax admits some programs that may misbehave 2 Readers familiar with lex will note that we have omitted discussing “inclusive” vs. “exclusive” start states as well as additional regular expressions such as character classes, ?, and +, $, and lookahead expressions r/s. These are supported in XParse also.

or produce nonsense token sequences. Because of their unusual control flow, XParse lexers pose distinctive problems to typechecking. For example, as in lex, XParse lexer rules ought to return at most one value and perform at most one control flow action3 , and so we require that return values and transitions occur only at the end of control flow paths through actions. Programs might also output ill-formed token sequences. For example, if we exit after starting but before ending a token, then the result will have an unclosed token. Similarly, we cannot start while within another start because tokens cannot be nested. In the absence of push and pop, this is relatively easy to handle by associating an output type with each state indicating the kind of output it is permitted to perform. Only two output types are needed, token and string (although further refinements are obviously possible). The initial state must only output tokens, whereas within a start...end sequence only strings may be output. start and end transition from token to string output and back. Values token and tokentext require and result in token type, whereas continue expects and results in string type. On executing a begin, we must check that the current type matches that of the target state. push and pop complicate this picture because we need to prevent executing a pop from a state that expects one type to a state expecting a different type. For example, we need to rule out the following program: a {} b {} b {start(T) } which produces an unclosed token on input abb and a nested token on input ababb. Also, we need to prohibit pop in toplevel states like the initial state, where no previous push has occurred. We can fix both problems by associating a pop type with each state. The pop type indicates what the output type must be prior to executing a pop. We also add a void type that can never be the current type; in effect, pop type void indicates that popping is forbidden in a state. The initial state must have output type token and pop type void. The types of states are declared as follows: %state INITIAL : token\void %state COMMENT : token\token %state LITERAL : string\token where the first type is the output type and the second (if any) is the pop type. To typecheck a begin, we must check that the current output and pop types match those of the target state. To typecheck a pop, we check that the current output type matches the pop type. To typecheck a push, we must check that the current output type matches the target’s output type, and that the current state’s output type matches the target’s pop type. The limitation to one push/pop per action helps keep the type system simple. If multiple pushes/pops were allowed, then we would need to track the output types of all the states on the control stack. This could make type annotations very unwieldy, without significantly adding to the expressiveness of lexers. 3

Actually, flex’s state stacks permit multiple pushes and pops within an action, but we limit this to one such control operation per action in order to keep the type system simple.

3.3

Grammar Section

The final section of an XParse document is the grammar section. An XParse grammar can be thought of as defining a transformation from a sequence of XML tokens (the output of the lexer) to a parse tree, also represented as XML. Semantic values are XML fragments denoting partial parse trees. We can also think of tokens and nonterminals as XML fragments, that is, a symbol S with semantic value V can be thought of as an XML fragment V. As with ordinary yacc, the specification consists of a list of context-free grammar rules paired with semantic actions. Here, however, the semantic actions are expressions constructing output XML terms. For example, a parser for expressions might include rules such as exp : | | |

INT exp PLUS exp exp TIMES exp LPAREN exp RPAREN

{ { { {

[@value[$1]] } [$1,$3] } [$1,$3]} $1 }

As in yacc, the expressions $1, $2, etc., denote the semantic values of the first, second, etc., terminals/nonterminals on the right side of the rule. The syntax of grammar rules is very simple and close to that of yacc grammar rules. rule ::= items ::=

nonterm : items {action} nonterm items | token items | 

action ::= exp ::=

cmd; action | exp | accept(exp) | error(exp) hexpi[exp] | @exp[exp] | s | exp, exp |  | $n

where s denotes a string constant and  denotes the empty sequence. Here, nonterm and token denote nonterminal and token symbols, and string denotes string constants. Commands cmd are the same as for lexers. Expressions define XML content: hexp1 i[exp2 ] denotes an element with name defined by the string value of exp1 and contents defined by exp2 ; @exp1 [exp2 ] denotes an attribute whose name and value are the string values of exp1 and exp2 respectively; exp1 , exp2 concatenates sequences, and $n is a variable referring to the semantic value of the nth item in the rule (if any). We view attributes associated with an element as children, so for example we write hinti[@value[”1”]] for the XML fragment . Parsers also have access to counters and symbol tables via commands. A general action is a semicolon-separated list of commands terminated by an expression. Actions can also be error(msg), which signals a parse error, or accept(t), which accepts a term and terminates parsing. Accepting is only allowed in the start state. The types used in XParse parsers are based on the content models of XML DTDs (with which we assume some familiarity): cm ::=

string | (elt1 | · · · |eltn |string)∗ | any | empty | t

t

::=

elt | (t1 , . . . , tn ) | (t1 | · · · |tn ) | t∗ | t+ | t?

al

::=

(att1 : string, . . . , attn : string)

Types denote sets of sequences of elements and strings. The first two types describe mixed content (that is, text possibly alternating with elements in any order). empty is the type consisting only of the empty sequence and any is a catch-all type. The other types can be interpreted as regular expressions describing the sequence of toplevel elements. Attribute

lists al associated with elements are just definitions of the possible attribute fields (all of which can only have type string.) Typechecking for grammar rules is fairly straightforward. For ground (that is, variable-free) expressions typechecking is essentially the same as DTD validation. The algorithm calculates a principal type for each expression that describes the sequence of toplevel content constructors (elements, attributes, or text nodes) that the expression can denote. For example, the principal type of hti[”a”], hui is (t, u). We can check that an expression like hvi[hti[”a”], hui] is well-formed provided that the principal type of v’s content matches v’s declared type. For example, if v expects content of type τ = (t, u+ , v ∗ ), we can verify that hvi[hti[”a”], hui] is wellformed by checking that the string tu matches tu+ v ∗ . Variables make typechecking more complex. Because variables can have arbitrary regular expression “content models” as types, the principal type of an expression may not always be a simple sequence. So in general we need to check inclusions among regular expressions. For example, to typecheck hvi[$1, $2] where v has content type (t, u+ , v ∗ ), where $1 : (t, u|v) and $2 : (u|v), we need to check whether the regular language t(u|v)(u|v) is included in tu+ v ∗ (it isn’t, as witnessed by tvu). In general, deciding inequalities among regular languages is a PSPACE-complete problem [10, 18], but simple algorithms such as automata minimization are usually efficient enough for the kinds of inequalities needed to typecheck typical XParse programs. Attribute list typechecking is relatively simple: we just check that each attribute associated with an element is in its declared attribute list (and that there are no duplicates). Nonterminals can also have attribute-list types. When such attribute list values are used, we need only check that the attributes in the list are a subset of those declared for the element in which the list is used. Dynamic checks ensure that attribute list-valued nonterminals do not contain duplicate attributes.

3.4

Discussion

We summarize some of the major design choices we have made. 1. XParse combines lexer and parser specifications which have traditionally been separate. This makes typechecking and interpretation much simpler and helps increase uniformity among the two sublanguages (an ideal we do not claim to have achieved.) On the other hand, our design does tie XParse lexers and parsers closely together, making it difficult to use alternative lexical analysis or parsing techniques with XParse as it stands. 2. XParse requires user-supplied types and semantic actions for the results of parsing. These actions and types could instead be generated automatically from the grammar. However, this would sacrifice some flexibility since the rule structure desirable for LALR(1) parsing is not necessarily congruent with the desired parse tree structure, and it would be necessary to postprocess the output to bring it into the desired form. For example, the usual way to parse a list in yacc is:

Automatically generating parse trees from these rules would yield nested lists hlisti[. . . , helti] rather than flat lists [helti, . . . , helti]. On the other hand, generating some types and actions automatically could make XParse easier to use. In particular, inferring type declarations for nonterminals and elements from actions could be very helpful. 3. XParse lexer actions are “imperative” statements controlling a state machine. This is the traditional design, and we see no reason to change it, since even in pure functional languages, lexers are usually stateful. The state in our system is not difficult to encode in a purely functional language. On the other hand, this design choice has little to justify it other than history and expediency, and leads to some awkward developments (such as start/end and pop types). 4. XParse grammar actions are “functional” expressions (with a little state thrown in). This is the norm for yacc-like tools for functional languages, whereas in C, yacc actions are statements that construct the semantic value of the rule’s nonterminal by assigning it to $$. We find the functional view somewhat cleaner, and believe that it fits the intuitive model of XML parse tree generation better than C-style yacc actions. However, these actions are more limited than most functional languages: we can only construct expressions, not take them apart (that is, there is no case expression), and there are no numeric types or computations, only elements, attributes, strings and concatenation. This keeps the language simple and closer to “real” XML, but may prevent programs from constructing the most convenient abstract syntax trees. The last two points illuminate an essential tension in our design, one with which we have not yet become comfortable.

4.

We have developed a prototype implementation of XParse using a combination of GNU flex/bison and the Apache project’s Xalan XSLT interpreter4 . The implementations of xlex and xyacc comprise approximately 1500 and 1200 lines of code, respectively. In both systems approximately 800 lines are XSLT stylesheets and the rest are flex, bison, and C code.5 In this paper we have described XParse lexers and parsers as if they communicate across a textual XML interface. However, xlex and xyacc actually use the yylex() interface used by lex/yacc. This is much more efficient, but does not sacrifice any expressiveness, because it simulates the “logical” XML token-stream interface presented here. The logical token stream can optionally be printed out for debugging purposes. The prototype version of XParse is much more primitive than the version sketched in this paper. Lexer and parser specifications are separated into two files, which makes maintaining them difficult. The expression forms for constructing XML fragments in xyacc distinguish between single elements and lists of values, and confusing the two can lead 4

list : list COMMA elt | elt

IMPLEMENTATION

Available at http://xml.apache.org/ The xlex/xyacc implementation is publicly available at http://www.cs.cornell.edu/People/jcheney/xparse/ 5

to lost or ill-formed data or crashes. Type errors are often not caught until their translations are checked by lex/yacc or gcc, and may not be caught then either. Even if the errors are caught, they refer to lines in the output lex/yacc specification, not the input xlex/xyacc file. The design for XParse set forth in this paper incorporates experience with this primitive version. However, the prototype version is already quite useful; in particular, we expect it to be very helpful in constructing parsers for the new version of XParse.

5. 5.1

EXAMPLES XSLT Concrete Syntax

Our first example is a concrete syntax for XSLT transformations. Such transformations are primarily collections of templates, or rules that state what output to produce when the current XML node matches a given XPath expression. In XSLT’s XML syntax, a simple template might look like this: " " This template prints the text value of a foo element delimited in quotes. We used xlex and xyacc to develop a more legible syntax for XSLT. In our concrete syntax, this template can be written far more concisely as "foo" -> { text """, value-of ".", text """ }; Here, sequences are written using brace-delimited, commaseparated lists, rather than juxtaposition as in XML. The arrow syntax introduces a template matching the element foo. Within the template body, plain quoted strings denote xsl:text elements, whereas the value-of keyword (denoting the xsl:value-of element) takes a single argument, which can be either a brace-delimited content sequence or a string indicating the XPath expression to evaluate. A signal advantage of this concrete syntax is that a range of XML well-formedness errors are ruled out by construction. For example, it is very easy to forget a closing tag (or a closing slash on a compact empty tag ) when writing XML as text. The error messages resulting from this situation tend not to be very helpful. But XML text generated by XParse tools can never be ill-formed in this way. Although the source file may itself be ill-formed and fail to parse, most text editors support parenthesis matching that helps detect and prevent such syntax errors. The xlex and xyacc programs for parsing to XSLT comprise 306 lines of code. We rewrote all the XSLT stylesheets used in xlex and xyacc to use our human-readable syntax and found that the resulting programs had about the same number of lines, but were about 50% as large as the original XML representations.

5.2

xlex/xyacc specifications

As a qualitative validation of the expressiveness of XParse, we developed XParse grammars for xlex and xyacc themselves, to replace the hand-written flex and bison grammars that we developed first. Once early versions of both xlex and xyacc were working, it was straightforward to convert their lex/yacc specifications to xlex/xyacc. Getting these specifications to work correctly helped identify many bugs which were subsequently fixed. Although the resulting specifications are about the same size as the original versions, we find them cleaner, more readable, and more maintainable than the originals. Of course, this is a subjective judgment. As an extreme (but telling) example, here are two lines of code from the bison and xyacc grammars for xyacc. The bison version: term : ID LANGLE atts RANGLE LBRACK terms RBRACK { $$ = ELEMENT3("element", ELEMENT1("name", ELEMENT1("string",TEXT($1))), ELEMENT1("attlist", ELEMENT("list",$3))), ELEMENT1("content", ELEMENT("list",$6)));} In XParse syntax this would be term : ID LANGLE atts RANGLE LBRACK terms RBRACK { [[[text($1)]], [[$3]], [[$6]]] } These parsers for xlex and xyacc specifications consist of 524 and 359 lines of code, respectively.

5.3

Java

The Java programming language specification includes a LALR(1) grammar suitable for use with yacc ([8], chapter 19). We converted this grammar into an xyacc program and constructed an appropriate xlex lexer specification (based on [8], chapter 3). The result is a parser that parses Java programs to abstract syntax trees. These trees can be rendered using XML stylesheets in a number of ways, including as syntax-highlighted HTML pages or as text suitable for compilation. In addition, XML scripting languages can be used to perform source-to-source transformations on such syntax trees. Currently the parser does not support more recent features such as inner classes; however, since the most recent edition of the Java grammar is considerably shorter than the original one [9], we expect implementing it in XParse to be even easier. The Java xyacc grammar is 1200 lines long and the xlex lexer is 216 lines. It took the author about fourteen hours to write and debug (spread over three days). Writing the parser took three hours, the lexer took one hour, and testing and debugging took about seven additional hours. Three hours were spent writing a DTD characterizing the output and testing that the parser actually generates valid output relative to the DTD for several large examples. Much of the time needed for writing the parser consisted of writing down type declarations for the 175 nonterminals in the Java grammar. Much of the debugging time was spent correcting dynamic type errors such as using a list where a single element was expected or vice versa. In future versions of XParse, type checking and inference should help decrease the annotation burden, catch errors earlier and decrease the amount of time needed for testing and debugging.

5.4

Discussion

All these examples were easy to write and to modify in response to testing and debugging. Writing and maintaining similar parsers using existing tools would require doing additional work, such as defining an abstract syntax type, writing semantic actions that construct appropriate values, and printing out the ASTs as XML or in some other format. It is true that these tasks can be simplified by using a language with good symbolic computation support such as ML or using XML libraries to construct and serialize XML, but the amount of code that needs to be written is still usually larger than for XParse scripts and care still must be taken to avoid silly errors. On the other hand, our experiments with writing parsers using the prototype xlex/xyacc tools showed up several deficiencies that have motivated improvements in the design described in this paper.

6.

CONCLUDING REMARKS

Related work. Scanning/parsing tools in the style of lex and yacc have a long history (see [14, 12, 1, 15]). However, there are many interesting alternative approaches to programming parsers, including recursive-descent parser generators, Haskell’s parser combinator libraries (see for example [13]), and Prolog’s definite clause grammars [17]. We have focused on emulating lex/yacc because they are among the most common parsing tools, but each of these alternative approaches has advantages and there is no reason why any of them could (or should) not be used for parsing to XML. Libraries for XML parsing now exist for almost every programming language. There are many tools for converting among various forms of XML, converting text to specific forms of XML, or converting between XML and native data in some general-purpose language (for example HaXML [20] for Haskell, IoXML [5] for OCaml, and Flea [16] for C). ASDLGen [21] represents the apotheosis of such tools, since it generates code that converts among a large number of language-specific abstract syntax representations as well as XML. XParse and these other conversion tools add to the usefulness of general XML tools by lowering the barriers between source text, unparsed XML data, and languageinternal data structures. No other work of which we are aware addresses general text-to-XML parsing. Type systems such as XML Schema [7, 19, 3] and those of typed XML processing languages like XDuce and CDuce provide much richer ways to describe the structure of XML documents than the DTD-style type system of XParse. For example, in XML Schema it is possible to give different types to the same element based on context, and to describe the allowed text content of attributes and text content (e.g., integer constants, dates, etc.) This richness comes at the cost of more complex specifications (the XML Schema specification is 339 pages long); similarly, validation and typechecking are much more complex than for XParse. There are algorithms that solve these problems well in practice, so it should be possible to enrich XParse’s type system to include some of these richer datatypes and structures. Future work. One feature that is sorely lacking in XParse’s current design is the ability to preserve position information (file name, line number, and character range) that is an important for generating effective error messages. We plan to generalize the token stream model to permit tokens to be arbitrary XML tags, which could carry position information

in attributes, and to make this data available to parsers. Another very useful feature that is not well supported by either traditional lex/yacc or XParse is user-defined infix operators as provided by languages like Prolog, ML, and Haskell. We wish to experiment with declarative support for parsing such operators. A third future design direction is generalizing XParse’s character escaping conventions from those of lex and C to XML-style escaping and UNICODE character codes. Conclusion. XParse is simple but powerful. Since the expressive power of semantic actions is limited to XML fragments, XParse specifications can be interpreted and typechecked independently, and type-correct XParse programs are guaranteed to be “safe”, that is, not crash or misuse resources. Moreover, XParse programs have meaning in their own right, independent of any other general-purpose language, so they can be executed in a variety of ways (e.g., interpretation, source-to-source translation, native compilation) just like any other programming language. Because XParse produces XML data, XParse parsers can be used with any other program that understands XML. This includes a variety of powerful scripting languages as well as standalone applications and programming languages. Also, parsers can be reused in situ without porting between incompatible lex/yacc implementations, rewriting semantic actions or recompiling programs. These advantages distinguish XParse from the majority of parser programming tools, most of which are highly language-dependent and cannot be typechecked or executed without first translating to another language. Furthermore, parsers developed using such tools are often difficult to reuse because they tend to become specialized to their original applications. Although XParse has several acknowledged shortcomings in its current form, we believe that with additional development, XParse could help dramatically improve the software architecture of programming language tools.

Acknowledgments Thanks to Erick Breck for many discussions about XParse and to both him and Greg Morrisett for comments on this paper.

7.

REFERENCES

[1] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: principles, techniques and tools. Addison-Wesley, Reading, MA, 1986. [2] Veronique Benzaken, Giuseppe Castagna, and Alain Frisch. Cduce: An XML-centric general-purpose language. In Proceedings of the 8th ACM SIGPLAN International Conference on Functional Programming (ICFP 2003), pages 51–63. ACM Press, 2003. [3] Paul V. Biron and Ashok Malhotra. XML Schema Part 2: Datatypes. W3C Recommendation, May 2001. http://www.w3.org/TR/xmlschema-2. [4] J. Clark. XSL transformations (XSLT). W3C Recommendation, November 1999. http://www.w3.org/TR/xslt. [5] Daniel de Rauglaudre. IoXML, version 0.6. http://pauillac.inria.fr/~ddr/IoXML/. [6] D. Chamberlin et al. XQuery 1.0: An XML query language. W3C Working Draft, June 2001. http://www.w3.org/TR/xquery.

[7] David C. Fallside (Ed). XML Schema Part 0: Primer. W3C Recommendation, May 2001. http://www.w3.org/TR/xmlschema-0. [8] James Gosling, Bill Joy, and Guy L. Steele. The Java Language Specification. Addison-Wesley, Reading, MA, first edition, 1996. [9] James Gosling, Bill Joy, Guy L. Steele, and Gilad Bracha. The Java Language Specification. Addison-Wesley, Reading, MA, second edition, 2000. [10] John E. Hopcroft and Jeffrey D. Ullmann. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979. [11] H. Hosoya and B. C. Pierce. XDuce: A Typed XML Processing Language. In International Workshop on the Web and Databases (WebDB), Dallas, TX, 2000. [12] S. C. Johnson. Yacc: Yet another compiler compiler. Computer Science Technical Report #32, Bell Laboratories, Murray Hill, NJ, 1975. [13] Daan Leijen and Erik Meijer. Parsec: Direct style monadic parser combinators for the real world. http://www.cs.uu.nl/~daan/parsec.html. [14] M. E. Lesk. Lex: A lexical analyzer generator. Computer Science Technical Report #39, Bell Laboratories, Murray Hill, NJ, 1975. [15] J. Levine, T. Mason, and D. Brown. lex & yacc (2nd edition). O’Reilly, 1992. [16] Luca Padovani. Flea: A yacc grammar generator for parsing XML documents. http://www.cs.unibo.it/~lpadovan/flea/. [17] F.C.N. Pereira and D.H.D. Warren. Definite clause grammars for language analysis. Artificial lntelligence, 13:231–278, 1980. [18] L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential space. In Proceedings of the 5th ACM Symposium on Theory of Computing, pages 1–9, 1973. [19] Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. XML Schema Part 1: Structures. W3C Recommendation, May 2001. http://www.w3.org/TR/xmlschema-1. [20] Malcolm Wallace and Colin Runciman. Haskell and XML: Generic combinators or type-based translation? In Proceedings of the Fourth ACM SIGPLAN International Conference on Functional Programming (ICFP‘99), pages 148–159, N.Y., 1999. ACM Press. [21] Daniel C. Wang, Andrew W. Appel, Je L. Korn, and Christopher S. Serra. The Zephyr abstract syntax description language. In Proceedings of the USENIX Conference on Domain-Specific Languages, pages 213–228, October 1997.